| GPU fp32 | GPU fp16 | TPU fp32 | TPU fp16 | |
|---|---|---|---|---|
| Largest batch size | 32 | 64 | 1024+ | 1024+ |
| Min-inference batch size | 4 | 64 | 1024 | 1024 |
| Masker inference (s/i) | 0.059 | 0.019 | 5.76e-5 | 6.15e-5 |
| Painter inference (s/i) | 0.068 | 0.041 | 4.69e-5 | 5.08e-5 |
| Masker + painter inference (i/s) | ~8 | ~17 | ~9570 | ~8900 |
| Inference loop (s) | 130.382 | 60.567 | 0.123 | 0.134 |
| Full dataset with loading (s) | 151.546 | 76.953 | 16.146 | 17.272 |
| Total Device -> CPU (s) | 2.816 | 2.528 | inf | inf |
Largest batch size=> largest batch size that would fit in memory (for multiples of 2)Min-inference batch size=> Batch size associated with the smallest pure on-device inference time. We look at this metric because we assume linear loading timeMasker inference=> Average time per image for the masker's inferencs (s/i= seconds/image)Painter inference=> Average time per image for the painter's inferences (s/i= seconds/image )Masker + painter inference=> Number of images per second for pure inference (i/s)Inference loop=> smallest on-device inference time for the entire datasetFull dataset with loading=> overall time to process the entire dataset: numpy array -> torch tensor -> transforms -> inference but not back to cpuDevice -> CPU=> (Average time taken to get the inference back trom the device for 1 batch) * (number of batches)
- data => 1024 images (a list of 100 different images, repeated 5+ times). Because it gets really long on GPU and TPUs still have theback-to-cpu issue I did not try larger batches
inf=> after 10+ minutes, still no answer. Cannot even stop the process withctrl+c, have to kill it (kill -9)TPU fp16=> prependXLA_USE_BF16=1to command
Hello @vict0rsch,
A couple of comments on this script:
xm.mark_step(), or send the tensor to device as you're doing here, that the execution happens.Timer('inference')only measures construction of the ir graph.Timer('back from device')measures compilation, execution, and data transfer.