How to set layer-wise learning rate in Tensorflow?

python deep-learning tensorflow

I am wondering if there is a way that I can use different learning rate for different layers like what is in Caffe. I am trying to modify a pre-trained model and use it for other tasks. What I want is to speed up the training for new added layers and keep the trained layers at low learning rate in order to prevent them from being distorted. for example, I have a 5-conv-layer pre-trained model. Now I add a new conv layer and fine tune it. The first 5 layers would have learning rate of 0.00001 and the last one would have 0.001. Any idea how to achieve this?

TF 1.8's tf.custom_gradient now greatly simplifies this problem -- see my answer below.

For TF2: tensorflow.org/addons/api_docs/python/tfa/optimizers/…

Community

It can be achieved quite easily with 2 optimizers:

var_list1 = [variables from first 5 layers]
var_list2 = [the rest of variables]
train_op1 = GradientDescentOptimizer(0.00001).minimize(loss, var_list=var_list1)
train_op2 = GradientDescentOptimizer(0.0001).minimize(loss, var_list=var_list2)
train_op = tf.group(train_op1, train_op2)

One disadvantage of this implementation is that it computes tf.gradients(.) twice inside the optimizers and thus it might not be optimal in terms of execution speed. This can be mitigated by explicitly calling tf.gradients(.), splitting the list into 2 and passing corresponding gradients to both optimizers.

EDIT: Added more efficient but longer implementation:

var_list1 = [variables from first 5 layers]
var_list2 = [the rest of variables]
opt1 = tf.train.GradientDescentOptimizer(0.00001)
opt2 = tf.train.GradientDescentOptimizer(0.0001)
grads = tf.gradients(loss, var_list1 + var_list2)
grads1 = grads[:len(var_list1)]
grads2 = grads[len(var_list1):]
tran_op1 = opt1.apply_gradients(zip(grads1, var_list1))
train_op2 = opt2.apply_gradients(zip(grads2, var_list2))
train_op = tf.group(train_op1, train_op2)

You can use tf.trainable_variables() to get all training variables and decide to select from them. The difference is that in the first implementation tf.gradients(.) is called twice inside the optimizers. This may cause some redundant operations to be executed (e.g. gradients on the first layer can reuse some computations for the gradients of the following layers).

Thanks for your answer Rafal. I am still a bit wondering when you said the disadvantage of it in terms of speed, how much it would affect the performance. If I am training a large network, if this is a big burden, it would not be a good option. Besides, could you be more specific about your second method? How to explicitly call tf.gradients()? Sorry, I am still a newbie.

thx man, is the second last line supposed to be "train_op2 = opt2.apply_gradients(.)"? if I understand it right.

In your first example you used loss in minimize so I think your second example should use tf.gradients(loss, var_list1 + var_list2)

Any reason this can't be extended to 3 or more separate lists?

Also if I was using global_step within apply_gradients would you use it for both opt1 and opt2 or would you only use it on the last apply_gradients?

P-Gn

Tensorflow 1.7 introduced tf.custom_gradient that greatly simplifies setting learning rate multipliers, in a way that is now compatible with any optimizer, including those accumulating gradient statistics. For example,

import tensorflow as tf

def lr_mult(alpha):
  @tf.custom_gradient
  def _lr_mult(x):
    def grad(dy):
      return dy * alpha * tf.ones_like(x)
    return x, grad
  return _lr_mult

x0 = tf.Variable(1.)
x1 = tf.Variable(1.)
loss = tf.square(x0) + tf.square(lr_mult(0.1)(x1))

step = tf.train.GradientDescentOptimizer(learning_rate=0.1).minimize(loss)

sess = tf.InteractiveSession()
tf.global_variables_initializer().run()
tf.local_variables_initializer().run()

for _ in range(5):
  sess.run([step])
  print(sess.run([x0, x1, loss]))

Actually, custom_gradient is available from TF 1.7

great post but i think it's super useful to see the output of the above code: [1] [0.8, 0.98, 1.6004001] [2] [0.64, 0.96040004, 1.3319682] [3] [0.51199996, 0.94119203, 1.1479864] [4] [0.40959996, 0.92236817, 1.0185351] [5] [0.32767996, 0.9039208, 0.924447]

Yaroslav Bulatov

Update Jan 22: recipe below is only a good idea for GradientDescentOptimizer , other optimizers that keep a running average will apply learning rate before the parameter update, so recipe below won't affect that part of the equation

In addition to Rafal's approach, you could use compute_gradients, apply_gradients interface of Optimizer. For instance, here's a toy network where I use 2x the learning rate for second parameter

x = tf.Variable(tf.ones([]))
y = tf.Variable(tf.zeros([]))
loss = tf.square(x-y)
global_step = tf.Variable(0, name="global_step", trainable=False)

opt = tf.GradientDescentOptimizer(learning_rate=0.1)
grads_and_vars = opt.compute_gradients(loss, [x, y])
ygrad, _ = grads_and_vars[1]
train_op = opt.apply_gradients([grads_and_vars[0], (ygrad*2, y)], global_step=global_step)

init_op = tf.initialize_all_variables()
sess = tf.Session()
sess.run(init_op)
for i in range(5):
  sess.run([train_op, loss, global_step])
  print sess.run([x, y])

You should see

[0.80000001, 0.40000001]
[0.72000003, 0.56]
[0.68800002, 0.62400001]
[0.67520005, 0.64960003]
[0.67008007, 0.65984005]

That's good when using SGD but not sure whether it's optimal for more fancy optimizers that compute statistics over past gradient values... It probably doesn't make a difference as long as you don't want to change that learning rate during training.

@YaroslavBulatov Does this work for MomentumOptimizer as well? What exactly does the compute_gradients and apply_gradients functions do in this case?

Sergey Demyanov

Collect learning rate multipliers for each variable like:

self.lr_multipliers[var.op.name] = lr_mult

and then apply them during before applying the gradients like:

def _train_op(self):
  tf.scalar_summary('learning_rate', self._lr_placeholder)
  opt = tf.train.GradientDescentOptimizer(self._lr_placeholder)
  grads_and_vars = opt.compute_gradients(self._loss)
  grads_and_vars_mult = []
  for grad, var in grads_and_vars:
    grad *= self._network.lr_multipliers[var.op.name]
    grads_and_vars_mult.append((grad, var))
    tf.histogram_summary('variables/' + var.op.name, var)
    tf.histogram_summary('gradients/' + var.op.name, grad)
  return opt.apply_gradients(grads_and_vars_mult)

You can find the whole example here.

Lewis Smith

A slight variation of Sergey Demyanov answer, where you only have to specify the learning rates you would like to change

from collections import defaultdict

self.learning_rates = defaultdict(lambda: 1.0)
...
x = tf.layers.Dense(3)(x)
self.learning_rates[x.op.name] = 2.0
...
optimizer = tf.train.MomentumOptimizer(learning_rate=1e-3, momentum=0.9)
grads_and_vars = optimizer.compute_gradients(loss)
grads_and_vars_mult = []
for grad, var in grads_and_vars:
    grad *= self.learning_rates[var.op.name]
    grads_and_vars_mult.append((grad, var))
train_op = optimizer.apply_gradients(grads_and_vars_mult, tf.train.get_global_step())

Nicolas Pinchaud

The first 5 layers would have learning rate of 0.00001 and the last one would have 0.001. Any idea how to achieve this?

There is an easy way to do that using tf.stop_gradient. Here is an example with 3 layers:

x = layer1(input)
x = layer2(x)
output = layer3(x)

You can shrink your gradient in the first two layers by a ratio of 1/100:

x = layer1(input)
x = layer2(x)
x = 1/100*x + (1-1/100)*tf.stop_gradient(x)
output = layer3(x)

On the layer2, the "flow" is split in two branches: one which has a contribution of 1/100 computes its gradient regularly but with a gradient magnitude shrinked by a proportion of 1/100, the other branch provides the remaining "flow" without contributing to the gradient because of the tf.stop_gradient operator. As a result, if you use a learning rate of 0.001 on your model optimizer, the first two layers will virtually have a learning rate of 0.00001.

this affects the gradients of upstream variables as well though. IE if you wanted layer1 to have a different LR than layer2, you'd have to make sure to scale the layer1 gradient according to how you scaled the layer2 gradient. i could see this cascading effect getting pretty hard to keep track of especially for deeper networks.

Morty

If you happen to be using tf.slim + slim.learning.create_train_op there is a nice example here: https://github.com/google-research/tf-slim/blob/master/tf_slim/learning.py#L65

# Create the train_op and scale the gradients by providing a map from variable
  # name (or variable) to a scaling coefficient:
  gradient_multipliers = {
    'conv0/weights': 1.2,
    'fc8/weights': 3.4,
  }
  train_op = slim.learning.create_train_op(
      total_loss,
      optimizer,
      gradient_multipliers=gradient_multipliers)

Unfortunately it doesn't seem possible to use a tf.Variable instead of a float value if you want to gradually modify the multiplier.