vict0rsch/benchmark-v0.md

Benchmarking TPUs vs GPUs

	GPU fp32	GPU fp16	TPU fp32	TPU fp16
Largest batch size	32	64	1024+	1024+
Min-inference batch size	4	64	1024	1024
Masker inference (s/i)	0.059	0.019	5.76e-5	6.15e-5
Painter inference (s/i)	0.068	0.041	4.69e-5	5.08e-5
Masker + painter inference (i/s)	~8	~17	~9570	~8900
Inference loop (s)	130.382	60.567	0.123	0.134
Full dataset with loading (s)	151.546	76.953	16.146	17.272
Total Device -> CPU (s)	2.816	2.528	inf	inf

Rows

Largest batch size => largest batch size that would fit in memory (for multiples of 2)
Min-inference batch size => Batch size associated with the smallest pure on-device inference time. We look at this metric because we assume linear loading time
Masker inference => Average time per image for the masker's inferencs (s/i = seconds/image)
Painter inference => Average time per image for the painter's inferences (s/i = seconds/image )
Masker + painter inference => Number of images per second for pure inference (i/s)
Inference loop => smallest on-device inference time for the entire dataset
Full dataset with loading => overall time to process the entire dataset: numpy array -> torch tensor -> transforms -> inference but not back to cpu
Device -> CPU => (Average time taken to get the inference back trom the device for 1 batch) * (number of batches)

More

data => 1024 images (a list of 100 different images, repeated 5+ times). Because it gets really long on GPU and TPUs still have theback-to-cpu issue I did not try larger batches
inf => after 10+ minutes, still no answer. Cannot even stop the process with ctrl+c, have to kill it (kill -9)
TPU fp16 => prepend XLA_USE_BF16=1 to command

taylanbil · 2020-10-30T20:32:39Z

Hello @vict0rsch,

A couple of comments on this script:

in torch_xla, most of the python code only builds an IR graph and sends it to device. This is very fast. However, the fact that a line is executed does not mean the results have been materialized, due to async execution.
it's only when you mark the step w/ xm.mark_step(), or send the tensor to device as you're doing here, that the execution happens.
Therefore, timing the code is not very straightforward.
On top of this, when TPUs run the program for the first time, it has to compile the graph to a lower representation before executing on the device. This is known to be slow.
So, this code really does the following underneath:
1. Timer('inference') only measures construction of the ir graph.
2. Timer('back from device') measures compilation, execution, and data transfer.
If you did this forward pass 1000 times, the first few ones will be by far slowest due to compilation. So I suggest the following modifications to above:

# DO NOT MEASURE THIS -- GETS RID OF COMPILATION OVERHEAD
for i in range(20):
    y=model(data)
    xm.mark_step()  # at this point results get materialized.

# MEASURE THIS
with Timer('inference'):
    y = model(data)
    xm.mark_step()
    print(y.shape)
with Timer("back from device"):
    y = y.cpu().numpy()

vict0rsch · 2020-10-30T20:39:31Z

@taylanbil this changes everything... thanks for pointing it out. I should always call xm.mark_step() then, right? I think I was not. I'm sorry for having msised this!

I'll update the above chart accordingly and get back to you asap. Thank you very much

vict0rsch · 2020-10-30T20:43:05Z

@taylanbil then I guess the way to measure the inference time of 2 models would be

with Timer("Total inference"):
    with Timer("masker"):
        y = masker(x)
        xm.mark_step()
    with Timer("painter"):
        z = painter(y)
        xm.mark_step()

right?

taylanbil · 2020-10-30T22:10:03Z

in an ordinary training script, such as our imagenet example, you'll notice that we don't actually use xm.mark_step. This is because it's called every time the xla dataloader yields. Here in this script, you're not using a dataloader so you'll have to call it yourself.

As to your last comment; if you'd like to measure the two models separately, you'd do that, yes. I'd do the whole thing 10-20 times first to get rid of compilation overhead, and then measure as pointed out in my first comment.

Also, note that every time you mark step, you'll force execution, and most of the time you'll get better performance if you don't do that; i.e.

y=masker(x)
z=painter(y)
xm.mark_step()

should be faster than

y=masker(x)
xm.mark_step()
z=painter(y)
xm.mark_step()

The exception here is if the total IR graph the first snippet creates is too large and executing the graph actually causes an OOM or swapping in the TPU host.

vict0rsch/benchmark-v0.md

Select an option

No results found

Select an option

No results found

Benchmarking TPUs vs GPUs

Rows

More

taylanbil commented Oct 30, 2020 •

edited

Loading

Uh oh!

vict0rsch commented Oct 30, 2020 •

edited

Loading

Uh oh!

vict0rsch commented Oct 30, 2020

Uh oh!

taylanbil commented Oct 30, 2020 •

edited

Loading

Uh oh!

vict0rsch/benchmark-v0.md

Benchmarking TPUs vs GPUs

Rows

More

taylanbil commented Oct 30, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vict0rsch commented Oct 30, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vict0rsch commented Oct 30, 2020

Uh oh!

taylanbil commented Oct 30, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

taylanbil commented Oct 30, 2020 •

edited

Loading

vict0rsch commented Oct 30, 2020 •

edited

Loading

taylanbil commented Oct 30, 2020 •

edited

Loading