There is much interest in Deep Reinforcement Learning lately, and there is a lot of examples of it on the net. Most of Deep RL algorithm use convolutional network. Nevertheless there is prevalent opinion that only shallow (2-3 convolution layers) convolutional networks are easy to train for Reinforcement Learning. It could be correct or not, but there is an workaround for this problem – take pretrained deep network and freeze most of it, leaving only upper part of it to train. Frozen part work as feature detector without trainable parameters.

We did experiment on Torcs using modified gym-torcs framework.

The input state was purely image-based (otherwise it wouldn’t make sense to use Resnet obviously)

Some modification was made to make gym-torcs to produce images with higher resolution, more suitable for Resnet input. Original Resnet has RGB input. We used three input channels of former RGB image to stack thre consecutive greyscale images. Following common practice for image-based neural network driver we used difference between consecutive visual images produced by Torcs, instead of original images.

We added 3 fc layer on top of resnet

Most of Resnet-18 was frozen. Only last block and fully-connected layers on top of it were trained.

After that we trained Resnet-18 to drive the car with discrete Q-Learning algorithm.(Simple explanation of Deep Q-learning could be found here).

###

Resnet-18 produce steering angle, brake/acceleration was produced in gym_torcs by simple controller to make speed approximately constant.

###

Original Resnet-18 use Batch Normalization layer, but frozen net don’t need Batch Normalization. Because our Resnet-18 is mostly frozen Batch Normalization layers were merged into convolutional layers.

Training was done using Caffe framework by Stochastic Gradient Descent.

Discretization – more or less?

3 vs 11

**Target network**

As per common practice we using replay buffer and two network – current network which driving car and target network, changing changing slowly per time. Parameters of target network updated as running average of current network

$$\theta^- = (1-\tau) \theta^- + \tau \theta \\

\theta \ \ \text{- parameters of current training network,}

\\ \theta^- \ \ \text{- parameter of target (averaged by time) network}$$

Torcs problem

Lag spikes (packet loss?)

Inconsistent road (-1,1)

crossing white line (few – no train) hard shadows create this prob

rebond

Loking into future Importance of Gamma

Our total discounted reward is

$$R = \sum_i r_i \ \gamma^{i-1} \\ \gamma \ \text{is discount parameter}$$

Lets’s look at the limiting case

$$\gamma = 0, \ R = r_1$$

In that case total reward is just reward for current action, and for discrete action reinforcement learning become regression. Obviously simple regression is much more easy then reinforcement learning.

That indicate that smaller Camma more easy to train on.

Gamma parameter could be very important for training. Gamma close to 1 can produce erratic behavior in the middle of training, because some erratic trajectory produce better reward then more simple trajectory. If this erratic trajectory is stable it can persist for long time. That kind of behavior encourage choice of “minimal” gamma – minimal gamma which provide stable predictive policy

**Positive reward**

We are training in environment with “termination” condition. Then reaching terminating state training episode stop. In such an environment is important for reward function be nonnegotiable on whole reachable state space. If reward is negative on some undesirable part of state space network can choose to train terminate training instead of trying to exit it and accumulate more negative reward. Termination = zero reward for infinite amount of steps, which network would prefer over many-step negative reward. Network would train to suicide instead of continue suffering.

**N-step Q learning**

It happens that simple Q-learning is very slow to train. To accelerate training we used n-step Q-learning.

We form of n-step Q-learning we were using standard Q learning loss, euclidean distance between current Q value and target estimation

$$loss=(Q(s_t, a_t, \theta) – Q_y)^2$$

Standard n-step learning use target

$$Q_{y}^{nstep} = r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + \gamma^3 r_{t+2} +…+V_{s_{t+n}, \theta^-}$$

or

$$Q_{y,t-1}^{nstep} = r_{t-1} + \gamma \ Q_{y,t}^{nstep}$$

However to better account for exploration noise we can use

$$Q_{y,t-1}^{expl} = r_{t-1} + \gamma \ max(V_{s_t, \theta^-}, \ Q_{y,t}^{expl})$$

by adding max with corresponding V recursively before each new reward.

or unrolling it

$$Q_y^{expl} = r_t + \gamma \ max(V_{s_{t+1}, \theta^-},\ r_{t+1} + \gamma \ max(V_{s_{t+2}, \theta^-}, \ r_{t+2} + \gamma \ max(V_{s_{t+3}, \theta^-},\ r_{t+4} + …+V{s_{t+n}, \theta^-})…))$$

where V is value function

$$V_{s, \theta^-} = max_a Q(s_t, a_t, \theta^-) $$

We can also use weighted average of both

$$Q_y = \alpha \ Q^{nstep}_y + (1-\alpha) \ Q^{expl}_y$$

Rebound

**Simplified priority sampling**

Then choosing starting n-step block we make k tries and choose block with maximal variance in reward

Training

we started action of only steering angle and retrained later to steering/acceleration action

We trained net for 600k batch iterations(batch 32) for 17 tracks. Each track was cut off after 10 minutes of running. Even with 600k iterations rebound event still happens for more difficult tracks (Sometimes shadows causing problem).

For steering/acceleration we used 2 Q functions

Source code is here

Acknowledgement:

**Caffe** is developed by Berkeley AI Research (BAIR)/The Berkeley Vision and Learning Center (BVLC) and community contributors

Naoto Yoshida is the author of the** gym torcs**.

Implementation of replay buffer is based on **Ben Lau** work.