Last active
October 30, 2020 23:42
-
-
Save vict0rsch/5063b4ce5c14424507b6448a1ac32045 to your computer and use it in GitHub Desktop.
Revisions
-
vict0rsch revised this gist
Oct 30, 2020 . 1 changed file with 2 additions and 1 deletion.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -6,7 +6,7 @@ | Painter inference (s/i) | 0.068 | 0.041 | 2.53e-5 | 1.16e-5 | | Inference loop (s) | 130.382 | 60.567 | 0.073 | 0.0392 | | Inference loop (i/s) | ~8 | ~17 | ~14 000 | ~26 000 | | Full dataset with loading (s) | 151.546 | 76.953 | 18.05 | 15.31 | | Total Device -> CPU (s) | 2.816 | 2.528 | inf | inf | ## Rows @@ -25,6 +25,7 @@ * `inf` => after 10+ minutes, still no answer. Cannot even stop the process with `ctrl+c`, have to kill it (`kill -9`) * `TPU fp16` => prepend `XLA_USE_BF16=1` to command * TPU: time to perform transforms (per sample): 0.004, time to send to device (per sample): 0.011 * Numbers on TPU have a high variance (not something I measured, but observed, some full processings take 60 seconds others 75 or 80 with the same params) ## Slow back to cpu -
vict0rsch revised this gist
Oct 30, 2020 . 1 changed file with 1 addition and 0 deletions.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -24,6 +24,7 @@ * **TPU fp32** and **TPU fp16** => Numbers in this column where computed after loading **4096** images to have respectively 4 and 2 batches. So to account for 4 times more data than GPU columns, measures for `Inference loop` and `Full dataset with loading` are divided by 4 (which could be slightly off). * `inf` => after 10+ minutes, still no answer. Cannot even stop the process with `ctrl+c`, have to kill it (`kill -9`) * `TPU fp16` => prepend `XLA_USE_BF16=1` to command * TPU: time to perform transforms (per sample): 0.004, time to send to device (per sample): 0.011 ## Slow back to cpu -
vict0rsch revised this gist
Oct 30, 2020 . 1 changed file with 12 additions and 13 deletions.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -1,15 +1,13 @@ | | GPU fp32 | GPU fp16 | TPU fp32* | TPU fp16* | | :------------------------------- | -------: | -------: | --------: | --------: | | Largest batch size | 32 | 64 | 1024 | 2048 | | Min-inference batch size | 4 | 64 | 1024 | 2048 | | Masker inference (s/i) | 0.059 | 0.019 | 3.80e-5 | 1.36e-5 | | Painter inference (s/i) | 0.068 | 0.041 | 2.53e-5 | 1.16e-5 | | Inference loop (s) | 130.382 | 60.567 | 0.073 | 0.0392 | | Inference loop (i/s) | ~8 | ~17 | ~14 000 | ~26 000 | | Full dataset with loading (s) | 151.546 | 76.953 | 18.05 | 18.05 | | Total Device -> CPU (s) | 2.816 | 2.528 | inf | inf | ## Rows * `Largest batch size` => largest batch size that would fit in memory (for multiples of 2) @@ -21,8 +19,9 @@ * `Full dataset with loading` => overall time to process the _entire_ dataset: numpy array -> torch tensor -> transforms -> inference but **not back to cpu** * `Device -> CPU` => (Average time taken to get the inference back trom the device for 1 batch) * (number of batches) ## Comments * **data** => 1024 images (a list of 100 different images, repeated 5+ times). Because it gets really long on GPU and TPUs still have theback-to-cpu issue I did not try larger batches. Images have a wide range of shapes but are all transformed into a `3 x 640 x 640` tensor * **TPU fp32** and **TPU fp16** => Numbers in this column where computed after loading **4096** images to have respectively 4 and 2 batches. So to account for 4 times more data than GPU columns, measures for `Inference loop` and `Full dataset with loading` are divided by 4 (which could be slightly off). * `inf` => after 10+ minutes, still no answer. Cannot even stop the process with `ctrl+c`, have to kill it (`kill -9`) * `TPU fp16` => prepend `XLA_USE_BF16=1` to command -
vict0rsch revised this gist
Oct 30, 2020 . 1 changed file with 8 additions and 8 deletions.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -61,10 +61,10 @@ if __name__ == "__main__": torch.set_grad_enabled(False) model = nn.Sequential( *[ nn.Conv2d(3, 256, 3, 1, 1), nn.Conv2d(256, 512, 3, 1, 1), nn.Conv2d(512, 256, 3, 1, 1), nn.Conv2d(256, 3, 3, 1, 1), ] ).to(device) @@ -73,11 +73,11 @@ if __name__ == "__main__": y = model(data) print(y.shape) with Timer("back from device"): y = y.cpu().numpy() ``` ``` [inference] Elapsed time: 0.000643 torch.Size([2, 3, 640, 640]) [back from device] Elapsed time: 11.328 ``` -
vict0rsch revised this gist
Oct 30, 2020 . 1 changed file with 56 additions and 0 deletions.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -25,3 +25,59 @@ * **data** => 1024 images (a list of 100 different images, repeated 5+ times). Because it gets really long on GPU and TPUs still have theback-to-cpu issue I did not try larger batches. Images have a wide range of shapes but are all transformed into a `3 x 640 x 640` tensor * `inf` => after 10+ minutes, still no answer. Cannot even stop the process with `ctrl+c`, have to kill it (`kill -9`) * `TPU fp16` => prepend `XLA_USE_BF16=1` to command ## Slow back to cpu ```python import time import torch import torch.nn as nn import torch_xla.core.xla_model as xm class Timer: def __init__(self, name="", store=None, precision=3): self.name = name self.precision = precision def format(self, n): return f"{n:.{self.precision}f}" def __enter__(self): """Start a new timer as a context manager""" self._start_time = time.perf_counter() return self def __exit__(self, *exc_info): """Stop the context manager timer""" t = time.perf_counter() new_time = t - self._start_time print(f"[{self.name}] Elapsed time: {self.format(new_time)}") if __name__ == "__main__": device = xm.xla_device() torch.set_grad_enabled(False) model = nn.Sequential( *[ nn.Conv2d(3, 256, 3, 2), nn.Conv2d(256, 512, 3, 2), nn.Conv2d(512, 256, 3, 2), nn.Conv2d(256, 3, 3, 2, 1), ] ).to(device) data = torch.rand(2, 3, 640, 640, device=device) with Timer("inference", precision=6): y = model(data) print(y.shape) with Timer("back from device"): y = y.detach().cpu().numpy() ``` ``` [inference] Elapsed time: 0.000500 torch.Size([2, 3, 40, 40]) [back from device] Elapsed time: 5.083 ``` -
vict0rsch revised this gist
Oct 30, 2020 . 1 changed file with 1 addition and 1 deletion.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -22,6 +22,6 @@ * `Device -> CPU` => (Average time taken to get the inference back trom the device for 1 batch) * (number of batches) ## More * **data** => 1024 images (a list of 100 different images, repeated 5+ times). Because it gets really long on GPU and TPUs still have theback-to-cpu issue I did not try larger batches. Images have a wide range of shapes but are all transformed into a `3 x 640 x 640` tensor * `inf` => after 10+ minutes, still no answer. Cannot even stop the process with `ctrl+c`, have to kill it (`kill -9`) * `TPU fp16` => prepend `XLA_USE_BF16=1` to command -
vict0rsch revised this gist
Oct 30, 2020 . 1 changed file with 4 additions and 0 deletions.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -1,3 +1,5 @@ # Benchmarking TPUs vs GPUs | | GPU fp32 | GPU fp16 | TPU fp32 | TPU fp16 | | :------------------------------- | -------: | -------: | -------: | -------: | | Largest batch size | 32 | 64 | 1024+ | 1024+ | @@ -9,6 +11,7 @@ | Full dataset with loading (s) | 151.546 | 76.953 | 16.146 | 17.272 | | Total Device -> CPU (s) | 2.816 | 2.528 | inf | inf | ## Rows * `Largest batch size` => largest batch size that would fit in memory (for multiples of 2) * `Min-inference batch size` => Batch size associated with the smallest pure on-device inference time. We look at this metric because we assume linear loading time * `Masker inference` => Average time per image for the masker's inferencs (`s/i` = seconds/image) @@ -18,6 +21,7 @@ * `Full dataset with loading` => overall time to process the _entire_ dataset: numpy array -> torch tensor -> transforms -> inference but **not back to cpu** * `Device -> CPU` => (Average time taken to get the inference back trom the device for 1 batch) * (number of batches) ## More * **data** => 1024 images (a list of 100 different images, repeated 5+ times). Because it gets really long on GPU and TPUs still have theback-to-cpu issue I did not try larger batches * `inf` => after 10+ minutes, still no answer. Cannot even stop the process with `ctrl+c`, have to kill it (`kill -9`) * `TPU fp16` => prepend `XLA_USE_BF16=1` to command -
vict0rsch created this gist
Oct 30, 2020 .There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -0,0 +1,23 @@ | | GPU fp32 | GPU fp16 | TPU fp32 | TPU fp16 | | :------------------------------- | -------: | -------: | -------: | -------: | | Largest batch size | 32 | 64 | 1024+ | 1024+ | | Min-inference batch size | 4 | 64 | 1024 | 1024 | | Masker inference (s/i) | 0.059 | 0.019 | 5.76e-5 | 6.15e-5 | | Painter inference (s/i) | 0.068 | 0.041 | 4.69e-5 | 5.08e-5 | | Masker + painter inference (i/s) | ~8 | ~17 | ~9570 | ~8900 | | Inference loop (s) | 130.382 | 60.567 | 0.123 | 0.134 | | Full dataset with loading (s) | 151.546 | 76.953 | 16.146 | 17.272 | | Total Device -> CPU (s) | 2.816 | 2.528 | inf | inf | * `Largest batch size` => largest batch size that would fit in memory (for multiples of 2) * `Min-inference batch size` => Batch size associated with the smallest pure on-device inference time. We look at this metric because we assume linear loading time * `Masker inference` => Average time per image for the masker's inferencs (`s/i` = seconds/image) * `Painter inference` => Average time per image for the painter's inferences (`s/i` = seconds/image ) * `Masker + painter inference` => Number of images per second for pure inference (`i/s`) * `Inference loop` => smallest on-device inference time for the entire dataset * `Full dataset with loading` => overall time to process the _entire_ dataset: numpy array -> torch tensor -> transforms -> inference but **not back to cpu** * `Device -> CPU` => (Average time taken to get the inference back trom the device for 1 batch) * (number of batches) * **data** => 1024 images (a list of 100 different images, repeated 5+ times). Because it gets really long on GPU and TPUs still have theback-to-cpu issue I did not try larger batches * `inf` => after 10+ minutes, still no answer. Cannot even stop the process with `ctrl+c`, have to kill it (`kill -9`) * `TPU fp16` => prepend `XLA_USE_BF16=1` to command