There is much interest in Deep Reinforcement Learning lately, and there is a lot of examples of it on the net. Most of Deep RL algorithm use convolutional network. Nevertheless, there is prevalent opinion that only shallow (2-3 convolution layers) convolutional networks are easy to train for Reinforcement Learning. It could be correct or not, but there is a workaround for this problem – take pretrained deep network and freeze most of it, leaving only upper part of it to train. Frozen part work as feature detector without trainable parameters.

We did experiment on Torcs using modified gym-torcs framework and pretrained Resnet-18.

The input state was purely image-based (otherwise it wouldn’t make sense to use Resnet obviously).

Some modification was made to gym-torcs to produce images with higher resolution, more suitable for Resnet input. Original Resnet has RGB input. We used three input channels of former RGB image to stack three consecutive greyscale images. Following common practice for image-based driver agent we used difference between consecutive visual images produced by Torcs, instead of original images.

We added three fully-connected layers on top of Resnet, last of which produce action (steering angle).

Most of Resnet-18 was frozen. Only last block and fully-connected layers on top of it were trained.

We trained Resnet-18 to drive the car with discrete Q-Learning algorithm.(Simple explanation of Deep Q-learning could be found here).

Original Resnet-18 use Batch Normalization layer, but frozen net don’t need Batch Normalization. Because our Resnet-18 is mostly frozen Batch Normalization layers were merged into convolutional layers.

Training was done using Caffe framework by Stochastic Gradient Descent.

### Training

The only action network produce is steering angle. We tried discretized action with 3 values and 11 values. 11 values discretization produce much better results, but still could be overkill – approximately same quality of steering could be produced with discretization 7 values.

Acceleration produced by simple heuristics in gym_torcs. Speed is randomized in the range approximately 30-40 m/s.

Another simple heuristics trying move car back to the road in case of out-of-lane event.

First we trained net for 600k batch iterations (batch 32) for 17 tracks. Cut-off time for each track was set at ~30 minutes (5000 steps). Even with 600k iterations out-of-lane event still happens a lot, especially for more difficult tracks like city street. So we switched to only 7 tracks and after retraining (500k iterations batch 24) results become much better. Now out-of-lane events happen on average ~1.5 times per episode. Still driving is not quite smooth and out-of-lane event could be fatal if car stuck in the border of the road.

One of the problem is high latency of the Torcs (Torcs latency is set for 0.4 second). Another is “rare situations” which confuse network – hard shadows resembling the borders of the road, white staring lines across the road and in some cases not quite consistent width of the road produced by Torcs.

### Some details of implementation

#### Balancing

Left to its own devices network tend to train mostly one positive and one negative action. To provide smooth action we use “guided exploration” in addition to random exploitative noise. Some randomly chosen actions are smoothed in time and clipped by value. That produce more smooth distribution of the action for network to train on.

#### Positive reward

Following Ben Lau DDPG implementation we use simple reward function

*r = C (cos(angle) – |sin(angle)| – trackPos)*

where *C* is constant, *angle* is angle between velocity vector and the road and trackPos is normalized distance to the center of the road.

We are training in environment with “termination” condition. Training episode stop if it reaches terminating state. Such state is a situation then car is stuck or speed is too low. In such environment, it is important for reward function to be non-negative on whole reachable state space. If reward is negative on some part of the state space network can choose to train to terminate instead of trying to exit negative reward state and accumulate even more negative reward. Termination is equal to zero reward for infinite number of steps. Network may prefer zero reward over many steps of negative reward. Network would train to “suicide” instead of continue suffering negative reward.

Thus we make reward non-negative:

* r = max(r, 0) + indicator(1 – |trackPos|)*

#### Target network

As per common practice we are using replay buffer and two network – current network which driving car and target network, changing slowly over time. Parameters of target network updated as running average of current network

$$\theta^- = (1-\tau) \theta^- + \tau \theta \\

\theta \ \ \text{- parameters of current training network,}

\\ \theta^- \ \ \text{- parameter of target (averaged over time) network}$$

#### Looking into future

Our total discounted reward is

$$R = \sum_i r_i \ \gamma^{i-1} \\ \gamma \ \text{is discount parameter}$$

Lets’s look at the limiting case

$$\gamma = 0, \ R = r_1$$

In that case total reward is just reward for current action, and for discrete action reinforcement learning become simple regression. Obviously simple regression is much easier then reinforcement learning.

That indicate that smaller Gamma is more easy to train on the Gamma more close to 1.

Gamma parameter could be very important for training. Gamma close to 1 can produce erratic behavior in the middle of training, because some erratic trajectory can produce better reward in far future then simpler trajectory. This erratic trajectory can persist for long time during training. That kind of behavior encourage choice of “minimal” gamma – minimal gamma which provide stable predictive policy. In our experiments both Gamma 0.98 and 0.99 provide enough prediction for driving, but Gamma 0.99 seems provide more smooth driving.

#### Simplified priority sampling

Unlike DeepMind’s priority sampling we use priority sampling not based on** **Temporal Difference. Instead while choosing batch for training we make *k *tries and choose batch with maximal variance in relative reward.

#### N-step Q learning

Simple Q-learning is very slow to train. To accelerate training we use n-step Q-learning.

We are using standard Q learning loss, Euclidean distance between current Q value and target estimation

$$loss=(Q(s_t, a_t, \theta) – Q_y)^2$$

Simple n-step learning use target

$$Q_{y}^{nstep} = r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + \gamma^3 r_{t+2} +…+V_{s_{t+n}, \theta^-}$$

or

$$Q_{y,t-1}^{nstep} = r_{t-1} + \gamma \ Q_{y,t}^{nstep}$$

However to better account for exploration noise we can use

$$Q_{y,t-1}^{expl} = r_{t-1} + \gamma \ max(V_{s_t, \theta^-}, \ Q_{y,t}^{expl})$$

by adding max with corresponding *V* recursively before each new reward.

or unrolling it

$$Q_y^{expl} = r_t + \gamma \ max(V_{s_{t+1}, \theta^-},\ r_{t+1} + \gamma \ max(V_{s_{t+2}, \theta^-}, \ r_{t+2} + \gamma \ max(V_{s_{t+3}, \theta^-},\ r_{t+4} + …+V{s_{t+n}, \theta^-})…))$$

where *V* is value function

$$V_{s, \theta^-} = max_a Q(s_t, a_t, \theta^-) $$

We can also use weighted average of both

$$Q_y = \lambda \ Q^{nstep}_y + (1-\lambda) \ Q^{expl}_y$$

Which is simplified modification of well-known TD-Lambda algorithm.

### Source code

Source code for Linux is here. You need to install caffe and Torcs to use it (see installation instructions in README.md).

Acknowledgement:

Caffe is developed by Berkeley AI Research (BAIR)/The Berkeley Vision and Learning Center (BVLC) and community contributors

Naoto Yoshida is the author of the gym torcs.

Implementation of replay buffer is based on Ben Lau work.