You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Largest batch size => largest batch size that would fit in memory (for multiples of 2)
Min-inference batch size => Batch size associated with the smallest pure on-device inference time. We look at this metric because we assume linear loading time
Masker inference => Average time per image for the masker's inferencs (s/i = seconds/image)
Painter inference => Average time per image for the painter's inferences (s/i = seconds/image )
Masker + painter inference => Number of images per second for pure inference (i/s)
Inference loop => smallest on-device inference time for the entire dataset
Full dataset with loading => overall time to process the entire dataset: numpy array -> torch tensor -> transforms -> inference but not back to cpu
Device -> CPU => (Average time taken to get the inference back trom the device for 1 batch) * (number of batches)
More
data => 1024 images (a list of 100 different images, repeated 5+ times). Because it gets really long on GPU and TPUs still have theback-to-cpu issue I did not try larger batches. Images have a wide range of shapes but are all transformed into a 3 x 640 x 640 tensor
inf => after 10+ minutes, still no answer. Cannot even stop the process with ctrl+c, have to kill it (kill -9)
in torch_xla, most of the python code only builds an IR graph and sends it to device. This is very fast. However, the fact that a line is executed does not mean the results have been materialized, due to async execution.
it's only when you mark the step w/ xm.mark_step(), or send the tensor to device as you're doing here, that the execution happens.
Therefore, timing the code is not very straightforward.
On top of this, when TPUs run the program for the first time, it has to compile the graph to a lower representation before executing on the device. This is known to be slow.
So, this code really does the following underneath:
Timer('inference') only measures construction of the ir graph.
Timer('back from device') measures compilation, execution, and data transfer.
If you did this forward pass 1000 times, the first few ones will be by far slowest due to compilation. So I suggest the following modifications to above:
# DO NOT MEASURE THIS -- GETS RID OF COMPILATION OVERHEADforiinrange(20):
y=model(data)
xm.mark_step() # at this point results get materialized.# MEASURE THISwithTimer('inference'):
y=model(data)
xm.mark_step()
print(y.shape)
withTimer("back from device"):
y=y.cpu().numpy()
@taylanbil this changes everything... thanks for pointing it out. I should always call xm.mark_step() then, right? I think I was not. I'm sorry for having msised this!
I'll update the above chart accordingly and get back to you asap. Thank you very much
in an ordinary training script, such as our imagenet example, you'll notice that we don't actually use xm.mark_step. This is because it's called every time the xla dataloader yields. Here in this script, you're not using a dataloader so you'll have to call it yourself.
As to your last comment; if you'd like to measure the two models separately, you'd do that, yes. I'd do the whole thing 10-20 times first to get rid of compilation overhead, and then measure as pointed out in my first comment.
Also, note that every time you mark step, you'll force execution, and most of the time you'll get better performance if you don't do that; i.e.
The exception here is if the total IR graph the first snippet creates is too large and executing the graph actually causes an OOM or swapping in the TPU host.
Hello @vict0rsch,
A couple of comments on this script:
xm.mark_step(), or send the tensor to device as you're doing here, that the execution happens.Timer('inference')only measures construction of the ir graph.Timer('back from device')measures compilation, execution, and data transfer.