Skip to content

Instantly share code, notes, and snippets.

@senkumartup
Created April 30, 2018 06:31
Show Gist options
  • Select an option

  • Save senkumartup/14f20a9c33b6e13e24231bd9aed8c46f to your computer and use it in GitHub Desktop.

Select an option

Save senkumartup/14f20a9c33b6e13e24231bd9aed8c46f to your computer and use it in GitHub Desktop.
Out Of Memory on Y510p - How to monitor Memory allocation with TensorFlow Monitor
CIFAR
tinyurl.com/yderp956
*OOM with CIFAR*
Machine
Y510p with NVIDIA GeForce GT 750M - 2 GB GDDR5 SDRAM
ResourceExhaustedError: OOM when allocating tensor with shape[16,54,32,32] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[Node: batch_normalization_8/FusedBatchNorm = FusedBatchNorm[T=DT_FLOAT, data_format="NHWC", epsilon=0.001, is_training=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](concatenate_7/concat, batch_normalization_8/gamma/read, batch_normalization_8/beta/read, batch_normalization_1/Const_4, batch_normalization_1/Const_4)]]
Just to monitor the memory allocation - Here is the snippet
Set report_tensor_allocations_upon_oom to true and pass it in model.compile
import tensorflow as tf
run_opts = tf.RunOptions(report_tensor_allocations_upon_oom = True)
# determine Loss function and Optimizer
model.compile(loss='categorical_crossentropy',
optimizer=Adam(),
metrics=['accuracy'],
options = run_opts)
Sample report
Current usage from device: /job:localhost/replica:0/task:0/device:GPU:0, allocator: GPU_0_bfc
3.62MiB from batch_normalization_5/FusedBatchNorm
3.38MiB from concatenate_7/concat
3.00MiB from batch_normalization_7/FusedBatchNorm
3.00MiB from concatenate_6/concat
3.00MiB from training_1/Adam/gradients/zeros_271
2.62MiB from batch_normalization_6/FusedBatchNorm
2.62MiB from concatenate_5/concat
2.62MiB from training_1/Adam/gradients/zeros_277
2.37MiB from concatenate_2/concat
2.25MiB from concatenate_4/concat
2.25MiB from training_1/Adam/gradients/zeros_283
1.88MiB from batch_normalization_4/FusedBatchNorm
1.88MiB from concatenate_3/concat
1.88MiB from training_1/Adam/gradients/zeros_289
1.50MiB from batch_normalization_3/FusedBatchNorm
1.50MiB from training_1/Adam/gradients/zeros_295
1.28MiB from training_1/Adam/gradients/zeros_307
1.12MiB from batch_normalization_2/FusedBatchNorm
1.12MiB from concatenate_1/concat
1.12MiB from training_1/Adam/gradients/zeros_301
768.5KiB from batch_normalization_1/FusedBatchNorm
768.0KiB from conv2d_1/convolution
384.0KiB from training_1/Adam/gradients/zeros_304
384.0KiB from dropout_1/cond/dropout/random_uniform/RandomUniform
384.0KiB from training_1/Adam/gradients/zeros_298
384.0KiB from dropout_2/cond/dropout/random_uniform/RandomUniform
384.0KiB from training_1/Adam/gradients/zeros_292
384.0KiB from dropout_3/cond/dropout/random_uniform/RandomUniform
384.0KiB from training_1/Adam/gradients/zeros_286
384.0KiB from dropout_4/cond/dropout/random_uniform/RandomUniform
384.0KiB from training_1/Adam/gradients/zeros_280
384.0KiB from dropout_5/cond/dropout/random_uniform/RandomUniform
384.0KiB from training_1/Adam/gradients/zeros_274
384.0KiB from dropout_6/cond/dropout/random_uniform/RandomUniform
384.0KiB from training_1/Adam/gradients/zeros_268
384.0KiB from dropout_7/cond/dropout/random_uniform/RandomUniform
16.5KiB from training_1/Adam/mul_181
16.5KiB from training_1/Adam/mul_183
15.2KiB from training_1/Adam/mul_571
15.2KiB from training_1/Adam/mul_766
15.2KiB from training_1/Adam/mul_573
15.2KiB from training_1/Adam/mul_768
15.2KiB from training_1/Adam/mul_166
15.2KiB from training_1/Adam/mul_168
15.2KiB from training_1/Adam/mul_376
15.2KiB from training_1/Adam/mul_378
14.0KiB from training_1/Adam/mul_556
14.0KiB from training_1/Adam/mul_751
14.0KiB from training_1/Adam/mul_558
14.0KiB from training_1/Adam/mul_753
14.0KiB from training_1/Adam/mul_151
14.0KiB from training_1/Adam/mul_153
14.0KiB from training_1/Adam/mul_361
14.0KiB from training_1/Adam/mul_363
12.8KiB from training_1/Adam/mul_541
12.8KiB from training_1/Adam/mul_736
12.8KiB from training_1/Adam/mul_543
12.8KiB from training_1/Adam/mul_738
12.8KiB from training_1/Adam/mul_136
12.8KiB from training_1/Adam/mul_138
12.8KiB from training_1/Adam/mul_346
12.8KiB from training_1/Adam/mul_348
12.2KiB from training_1/Adam/mul_781
12.2KiB from training_1/Adam/mul_783
11.5KiB from training_1/Adam/mul_526
11.5KiB from training_1/Adam/mul_721
Remaining 291 nodes with 523.8KiB
[[Node: loss_1/mul/_3181 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_21641_loss_1/mul", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
Current usage from device: /job:localhost/replica:0/task:0/device:GPU:0, allocator: GPU_0_bfc
3.62MiB from batch_normalization_5/FusedBatchNorm
3.38MiB from concatenate_7/concat
3.00MiB from batch_normalization_7/FusedBatchNorm
3.00MiB from concatenate_6/concat
3.00MiB from training_1/Adam/gradients/zeros_271
2.62MiB from batch_normalization_6/FusedBatchNorm
2.62MiB from concatenate_5/concat
2.62MiB from training_1/Adam/gradients/zeros_277
2.37MiB from concatenate_2/concat
2.25MiB from concatenate_4/concat
2.25MiB from training_1/Adam/gradients/zeros_283
1.88MiB from batch_normalization_4/FusedBatchNorm
1.88MiB from concatenate_3/concat
1.88MiB from training_1/Adam/gradients/zeros_289
1.50MiB from batch_normalization_3/FusedBatchNorm
1.50MiB from training_1/Adam/gradients/zeros_295
1.28MiB from training_1/Adam/gradients/zeros_307
1.12MiB from batch_normalization_2/FusedBatchNorm
1.12MiB from concatenate_1/concat
1.12MiB from training_1/Adam/gradients/zeros_301
768.5KiB from batch_normalization_1/FusedBatchNorm
768.0KiB from conv2d_1/convolution
384.0KiB from training_1/Adam/gradients/zeros_304
384.0KiB from dropout_1/cond/dropout/random_uniform/RandomUniform
384.0KiB from training_1/Adam/gradients/zeros_298
384.0KiB from dropout_2/cond/dropout/random_uniform/RandomUniform
384.0KiB from training_1/Adam/gradients/zeros_292
384.0KiB from dropout_3/cond/dropout/random_uniform/RandomUniform
384.0KiB from training_1/Adam/gradients/zeros_286
384.0KiB from dropout_4/cond/dropout/random_uniform/RandomUniform
384.0KiB from training_1/Adam/gradients/zeros_280
384.0KiB from dropout_5/cond/dropout/random_uniform/RandomUniform
384.0KiB from training_1/Adam/gradients/zeros_274
384.0KiB from dropout_6/cond/dropout/random_uniform/RandomUniform
384.0KiB from training_1/Adam/gradients/zeros_268
384.0KiB from dropout_7/cond/dropout/random_uniform/RandomUniform
16.5KiB from training_1/Adam/mul_181
16.5KiB from training_1/Adam/mul_183
15.2KiB from training_1/Adam/mul_571
15.2KiB from training_1/Adam/mul_766
15.2KiB from training_1/Adam/mul_573
15.2KiB from training_1/Adam/mul_768
15.2KiB from training_1/Adam/mul_166
15.2KiB from training_1/Adam/mul_168
15.2KiB from training_1/Adam/mul_376
15.2KiB from training_1/Adam/mul_378
14.0KiB from training_1/Adam/mul_556
14.0KiB from training_1/Adam/mul_751
14.0KiB from training_1/Adam/mul_558
14.0KiB from training_1/Adam/mul_753
14.0KiB from training_1/Adam/mul_151
14.0KiB from training_1/Adam/mul_153
14.0KiB from training_1/Adam/mul_361
14.0KiB from training_1/Adam/mul_363
12.8KiB from training_1/Adam/mul_541
12.8KiB from training_1/Adam/mul_736
12.8KiB from training_1/Adam/mul_543
12.8KiB from training_1/Adam/mul_738
12.8KiB from training_1/Adam/mul_136
12.8KiB from training_1/Adam/mul_138
12.8KiB from training_1/Adam/mul_346
12.8KiB from training_1/Adam/mul_348
12.2KiB from training_1/Adam/mul_781
12.2KiB from training_1/Adam/mul_783
11.5KiB from training_1/Adam/mul_526
11.5KiB from training_1/Adam/mul_721
Remaining 291 nodes with 523.8KiB
Caused by op u'batch_normalization_8/FusedBatchNorm', defined at:
File "/home/ssk/anaconda2/lib/python2.7/runpy.py", line 174, in _run_module_as_main
"__main__", fname, loader, pkg_name)
File "/home/ssk/anaconda2/lib/python2.7/runpy.py", line 72, in _run_code
exec code in run_globals
File "/home/ssk/anaconda2/lib/python2.7/site-packages/ipykernel_launcher.py", line 16, in <module>
app.launch_new_instance()
File "/home/ssk/anaconda2/lib/python2.7/site-packages/traitlets/config/application.py", line 658, in launch_instance
app.start()
File "/home/ssk/anaconda2/lib/python2.7/site-packages/ipykernel/kernelapp.py", line 486, in start
self.io_loop.start()
File "/home/ssk/anaconda2/lib/python2.7/site-packages/tornado/ioloop.py", line 1008, in start
self._run_callback(self._callbacks.popleft())
File "/home/ssk/anaconda2/lib/python2.7/site-packages/tornado/ioloop.py", line 759, in _run_callback
ret = callback()
File "/home/ssk/anaconda2/lib/python2.7/site-packages/tornado/stack_context.py", line 276, in null_wrapper
return fn(*args, **kwargs)
File "/home/ssk/anaconda2/lib/python2.7/site-packages/zmq/eventloop/zmqstream.py", line 536, in <lambda>
self.io_loop.add_callback(lambda : self._handle_events(self.socket, 0))
File "/home/ssk/anaconda2/lib/python2.7/site-packages/zmq/eventloop/zmqstream.py", line 450, in _handle_events
self._handle_recv()
File "/home/ssk/anaconda2/lib/python2.7/site-packages/zmq/eventloop/zmqstream.py", line 480, in _handle_recv
self._run_callback(callback, msg)
File "/home/ssk/anaconda2/lib/python2.7/site-packages/zmq/eventloop/zmqstream.py", line 432, in _run_callback
callback(*args, **kwargs)
File "/home/ssk/anaconda2/lib/python2.7/site-packages/tornado/stack_context.py", line 276, in null_wrapper
return fn(*args, **kwargs)
File "/home/ssk/anaconda2/lib/python2.7/site-packages/ipykernel/kernelbase.py", line 283, in dispatcher
return self.dispatch_shell(stream, msg)
File "/home/ssk/anaconda2/lib/python2.7/site-packages/ipykernel/kernelbase.py", line 233, in dispatch_shell
handler(stream, idents, msg)
File "/home/ssk/anaconda2/lib/python2.7/site-packages/ipykernel/kernelbase.py", line 399, in execute_request
user_expressions, allow_stdin)
File "/home/ssk/anaconda2/lib/python2.7/site-packages/ipykernel/ipkernel.py", line 208, in do_execute
res = shell.run_cell(code, store_history=store_history, silent=silent)
File "/home/ssk/anaconda2/lib/python2.7/site-packages/ipykernel/zmqshell.py", line 537, in run_cell
return super(ZMQInteractiveShell, self).run_cell(*args, **kwargs)
File "/home/ssk/anaconda2/lib/python2.7/site-packages/IPython/core/interactiveshell.py", line 2714, in run_cell
interactivity=interactivity, compiler=compiler, result=result)
File "/home/ssk/anaconda2/lib/python2.7/site-packages/IPython/core/interactiveshell.py", line 2818, in run_ast_nodes
if self.run_code(code, result):
File "/home/ssk/anaconda2/lib/python2.7/site-packages/IPython/core/interactiveshell.py", line 2878, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-9-2efee5f85923>", line 7, in <module>
First_Block = add_denseblock(First_Conv2D, num_filter, dropout_rate)
File "<ipython-input-6-7ad1df438d81>", line 6, in add_denseblock
BatchNorm = BatchNormalization()(temp)
File "/home/ssk/anaconda2/lib/python2.7/site-packages/keras/engine/topology.py", line 619, in __call__
output = self.call(inputs, **kwargs)
File "/home/ssk/anaconda2/lib/python2.7/site-packages/keras/layers/normalization.py", line 181, in call
epsilon=self.epsilon)
File "/home/ssk/anaconda2/lib/python2.7/site-packages/keras/backend/tensorflow_backend.py", line 1827, in normalize_batch_in_training
epsilon=epsilon)
File "/home/ssk/anaconda2/lib/python2.7/site-packages/keras/backend/tensorflow_backend.py", line 1802, in _fused_normalize_batch_in_training
data_format=tf_data_format)
File "/home/ssk/anaconda2/lib/python2.7/site-packages/tensorflow/python/ops/nn_impl.py", line 906, in fused_batch_norm
name=name)
File "/home/ssk/anaconda2/lib/python2.7/site-packages/tensorflow/python/ops/gen_nn_ops.py", line 2224, in _fused_batch_norm
is_training=is_training, name=name)
File "/home/ssk/anaconda2/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/home/ssk/anaconda2/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 3271, in create_op
op_def=op_def)
File "/home/ssk/anaconda2/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1650, in __init__
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access
ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[16,54,32,32] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[Node: batch_normalization_8/FusedBatchNorm = FusedBatchNorm[T=DT_FLOAT, data_format="NHWC", epsilon=0.001, is_training=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](concatenate_7/concat, batch_normalization_8/gamma/read, batch_normalization_8/beta/read, batch_normalization_1/Const_4, batch_normalization_1/Const_4)]]
Current usage from device: /job:localhost/replica:0/task:0/device:GPU:0, allocator: GPU_0_bfc
3.62MiB from batch_normalization_5/FusedBatchNorm
3.38MiB from concatenate_7/concat
3.00MiB from batch_normalization_7/FusedBatchNorm
3.00MiB from concatenate_6/concat
3.00MiB from training_1/Adam/gradients/zeros_271
2.62MiB from batch_normalization_6/FusedBatchNorm
2.62MiB from concatenate_5/concat
2.62MiB from training_1/Adam/gradients/zeros_277
2.37MiB from concatenate_2/concat
2.25MiB from concatenate_4/concat
2.25MiB from training_1/Adam/gradients/zeros_283
1.88MiB from batch_normalization_4/FusedBatchNorm
1.88MiB from concatenate_3/concat
1.88MiB from training_1/Adam/gradients/zeros_289
1.50MiB from batch_normalization_3/FusedBatchNorm
1.50MiB from training_1/Adam/gradients/zeros_295
1.28MiB from training_1/Adam/gradients/zeros_307
1.12MiB from batch_normalization_2/FusedBatchNorm
1.12MiB from concatenate_1/concat
1.12MiB from training_1/Adam/gradients/zeros_301
768.5KiB from batch_normalization_1/FusedBatchNorm
768.0KiB from conv2d_1/convolution
384.0KiB from training_1/Adam/gradients/zeros_304
384.0KiB from dropout_1/cond/dropout/random_uniform/RandomUniform
384.0KiB from training_1/Adam/gradients/zeros_298
384.0KiB from dropout_2/cond/dropout/random_uniform/RandomUniform
384.0KiB from training_1/Adam/gradients/zeros_292
384.0KiB from dropout_3/cond/dropout/random_uniform/RandomUniform
384.0KiB from training_1/Adam/gradients/zeros_286
384.0KiB from dropout_4/cond/dropout/random_uniform/RandomUniform
384.0KiB from training_1/Adam/gradients/zeros_280
384.0KiB from dropout_5/cond/dropout/random_uniform/RandomUniform
384.0KiB from training_1/Adam/gradients/zeros_274
384.0KiB from dropout_6/cond/dropout/random_uniform/RandomUniform
384.0KiB from training_1/Adam/gradients/zeros_268
384.0KiB from dropout_7/cond/dropout/random_uniform/RandomUniform
16.5KiB from training_1/Adam/mul_181
16.5KiB from training_1/Adam/mul_183
15.2KiB from training_1/Adam/mul_571
15.2KiB from training_1/Adam/mul_766
15.2KiB from training_1/Adam/mul_573
15.2KiB from training_1/Adam/mul_768
15.2KiB from training_1/Adam/mul_166
15.2KiB from training_1/Adam/mul_168
15.2KiB from training_1/Adam/mul_376
15.2KiB from training_1/Adam/mul_378
14.0KiB from training_1/Adam/mul_556
14.0KiB from training_1/Adam/mul_751
14.0KiB from training_1/Adam/mul_558
14.0KiB from training_1/Adam/mul_753
14.0KiB from training_1/Adam/mul_151
14.0KiB from training_1/Adam/mul_153
14.0KiB from training_1/Adam/mul_361
14.0KiB from training_1/Adam/mul_363
12.8KiB from training_1/Adam/mul_541
12.8KiB from training_1/Adam/mul_736
12.8KiB from training_1/Adam/mul_543
12.8KiB from training_1/Adam/mul_738
12.8KiB from training_1/Adam/mul_136
12.8KiB from training_1/Adam/mul_138
12.8KiB from training_1/Adam/mul_346
12.8KiB from training_1/Adam/mul_348
12.2KiB from training_1/Adam/mul_781
12.2KiB from training_1/Adam/mul_783
11.5KiB from training_1/Adam/mul_526
11.5KiB from training_1/Adam/mul_721
Remaining 291 nodes with 523.8KiB
[[Node: loss_1/mul/_3181 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_21641_loss_1/mul", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
Current usage from device: /job:localhost/replica:0/task:0/device:GPU:0, allocator: GPU_0_bfc
3.62MiB from batch_normalization_5/FusedBatchNorm
3.38MiB from concatenate_7/concat
3.00MiB from batch_normalization_7/FusedBatchNorm
3.00MiB from concatenate_6/concat
3.00MiB from training_1/Adam/gradients/zeros_271
2.62MiB from batch_normalization_6/FusedBatchNorm
2.62MiB from concatenate_5/concat
2.62MiB from training_1/Adam/gradients/zeros_277
2.37MiB from concatenate_2/concat
2.25MiB from concatenate_4/concat
2.25MiB from training_1/Adam/gradients/zeros_283
1.88MiB from batch_normalization_4/FusedBatchNorm
1.88MiB from concatenate_3/concat
1.88MiB from training_1/Adam/gradients/zeros_289
1.50MiB from batch_normalization_3/FusedBatchNorm
1.50MiB from training_1/Adam/gradients/zeros_295
1.28MiB from training_1/Adam/gradients/zeros_307
1.12MiB from batch_normalization_2/FusedBatchNorm
1.12MiB from concatenate_1/concat
1.12MiB from training_1/Adam/gradients/zeros_301
768.5KiB from batch_normalization_1/FusedBatchNorm
768.0KiB from conv2d_1/convolution
384.0KiB from training_1/Adam/gradients/zeros_304
384.0KiB from dropout_1/cond/dropout/random_uniform/RandomUniform
384.0KiB from training_1/Adam/gradients/zeros_298
384.0KiB from dropout_2/cond/dropout/random_uniform/RandomUniform
384.0KiB from training_1/Adam/gradients/zeros_292
384.0KiB from dropout_3/cond/dropout/random_uniform/RandomUniform
384.0KiB from training_1/Adam/gradients/zeros_286
384.0KiB from dropout_4/cond/dropout/random_uniform/RandomUniform
384.0KiB from training_1/Adam/gradients/zeros_280
384.0KiB from dropout_5/cond/dropout/random_uniform/RandomUniform
384.0KiB from training_1/Adam/gradients/zeros_274
384.0KiB from dropout_6/cond/dropout/random_uniform/RandomUniform
384.0KiB from training_1/Adam/gradients/zeros_268
384.0KiB from dropout_7/cond/dropout/random_uniform/RandomUniform
16.5KiB from training_1/Adam/mul_181
16.5KiB from training_1/Adam/mul_183
15.2KiB from training_1/Adam/mul_571
15.2KiB from training_1/Adam/mul_766
15.2KiB from training_1/Adam/mul_573
15.2KiB from training_1/Adam/mul_768
15.2KiB from training_1/Adam/mul_166
15.2KiB from training_1/Adam/mul_168
15.2KiB from training_1/Adam/mul_376
15.2KiB from training_1/Adam/mul_378
14.0KiB from training_1/Adam/mul_556
14.0KiB from training_1/Adam/mul_751
14.0KiB from training_1/Adam/mul_558
14.0KiB from training_1/Adam/mul_753
14.0KiB from training_1/Adam/mul_151
14.0KiB from training_1/Adam/mul_153
14.0KiB from training_1/Adam/mul_361
14.0KiB from training_1/Adam/mul_363
12.8KiB from training_1/Adam/mul_541
12.8KiB from training_1/Adam/mul_736
12.8KiB from training_1/Adam/mul_543
12.8KiB from training_1/Adam/mul_738
12.8KiB from training_1/Adam/mul_136
12.8KiB from training_1/Adam/mul_138
12.8KiB from training_1/Adam/mul_346
12.8KiB from training_1/Adam/mul_348
12.2KiB from training_1/Adam/mul_781
12.2KiB from training_1/Adam/mul_783
11.5KiB from training_1/Adam/mul_526
11.5KiB from training_1/Adam/mul_721
Remaining 291 nodes with 523.8KiB
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment