vict0rsch · October 30, 2020 23:42 · Oct 30, 2020 · Oct 30, 2020 · Oct 30, 2020 · Oct 30, 2020
diff --git a/benchmark-v0.md b/benchmark-v0.md
@@ -6,7 +6,7 @@
 | Painter inference          (s/i) |    0.068 |    0.041 |   2.53e-5 |   1.16e-5 |
 | Inference loop             (s)   |  130.382 |   60.567 |     0.073 |    0.0392 |
 | Inference loop             (i/s) |       ~8 |      ~17 |   ~14 000 |   ~26 000 |
-| Full dataset with loading  (s)   |  151.546 |   76.953 |     18.05 |     18.05 |
+| Full dataset with loading  (s)   |  151.546 |   76.953 |     18.05 |     15.31 |
 | Total Device -> CPU        (s)   |    2.816 |    2.528 |       inf |       inf |
 
 ## Rows
@@ -25,6 +25,7 @@
 * `inf` => after 10+ minutes, still no answer. Cannot even stop the process with `ctrl+c`, have to kill it (`kill -9`)
 * `TPU fp16` => prepend `XLA_USE_BF16=1` to command
 * TPU: time to perform transforms (per sample): 0.004, time to send to device (per sample): 0.011
+* Numbers on TPU have a high variance (not something I measured, but observed, some full processings take 60 seconds others 75 or 80 with the same params)
 
 ## Slow back to cpu
 

diff --git a/benchmark-v0.md b/benchmark-v0.md
@@ -24,6 +24,7 @@
 * **TPU fp32** and **TPU fp16**  => Numbers in this column where computed after loading **4096** images to have respectively 4 and 2 batches. So to account for 4 times more data than GPU columns, measures for `Inference loop` and `Full dataset with loading` are divided by 4 (which could be slightly off).
 * `inf` => after 10+ minutes, still no answer. Cannot even stop the process with `ctrl+c`, have to kill it (`kill -9`)
 * `TPU fp16` => prepend `XLA_USE_BF16=1` to command
+* TPU: time to perform transforms (per sample): 0.004, time to send to device (per sample): 0.011
 
 ## Slow back to cpu
 

diff --git a/benchmark-v0.md b/benchmark-v0.md
@@ -1,15 +1,13 @@
-# Benchmarking TPUs vs GPUs
-
-|                                  | GPU fp32 | GPU fp16 | TPU fp32 | TPU fp16 |
-| :------------------------------- | -------: | -------: | -------: | -------: |
-| Largest batch size               |       32 |       64 |    1024+ |    1024+ |
-| Min-inference batch size         |        4 |       64 |     1024 |     1024 |
-| Masker inference           (s/i) |    0.059 |    0.019 |  5.76e-5 |  6.15e-5 |
-| Painter inference          (s/i) |    0.068 |    0.041 |  4.69e-5 |  5.08e-5 |
-| Masker + painter inference (i/s) |       ~8 |      ~17 |    ~9570 |    ~8900 |
-| Inference loop             (s)   |  130.382 |   60.567 |    0.123 |    0.134 |
-| Full dataset with loading  (s)   |  151.546 |   76.953 |   16.146 |   17.272 |
-| Total Device -> CPU        (s)   |    2.816 |    2.528 |      inf |      inf |
+|                                  | GPU fp32 | GPU fp16 | TPU fp32* | TPU fp16* |
+| :------------------------------- | -------: | -------: | --------: | --------: |
+| Largest batch size               |       32 |       64 |      1024 |      2048 |
+| Min-inference batch size         |        4 |       64 |      1024 |      2048 |
+| Masker inference           (s/i) |    0.059 |    0.019 |   3.80e-5 |   1.36e-5 |
+| Painter inference          (s/i) |    0.068 |    0.041 |   2.53e-5 |   1.16e-5 |
+| Inference loop             (s)   |  130.382 |   60.567 |     0.073 |    0.0392 |
+| Inference loop             (i/s) |       ~8 |      ~17 |   ~14 000 |   ~26 000 |
+| Full dataset with loading  (s)   |  151.546 |   76.953 |     18.05 |     18.05 |
+| Total Device -> CPU        (s)   |    2.816 |    2.528 |       inf |       inf |
 
 ## Rows
 * `Largest batch size` => largest  batch size that would fit in memory (for multiples of 2)
@@ -21,8 +19,9 @@
 * `Full dataset with loading` => overall time to process the _entire_ dataset: numpy array -> torch tensor -> transforms -> inference but **not back to cpu**
 * `Device -> CPU` => (Average time taken to get the inference back trom the device for 1 batch) * (number of batches)
 
-## More
+## Comments
 * **data** => 1024 images (a list of 100 different images, repeated 5+ times). Because it gets really long on GPU and TPUs still have theback-to-cpu issue I did not try larger batches. Images have a wide range of shapes but are all transformed into a `3 x 640 x 640` tensor
+* **TPU fp32** and **TPU fp16**  => Numbers in this column where computed after loading **4096** images to have respectively 4 and 2 batches. So to account for 4 times more data than GPU columns, measures for `Inference loop` and `Full dataset with loading` are divided by 4 (which could be slightly off).
 * `inf` => after 10+ minutes, still no answer. Cannot even stop the process with `ctrl+c`, have to kill it (`kill -9`)
 * `TPU fp16` => prepend `XLA_USE_BF16=1` to command
 

diff --git a/benchmark-v0.md b/benchmark-v0.md
@@ -61,10 +61,10 @@ if __name__ == "__main__":
     torch.set_grad_enabled(False)
     model = nn.Sequential(
         *[
-            nn.Conv2d(3, 256, 3, 2),
-            nn.Conv2d(256, 512, 3, 2),
-            nn.Conv2d(512, 256, 3, 2),
-            nn.Conv2d(256, 3, 3, 2, 1),
+            nn.Conv2d(3, 256, 3, 1, 1),
+            nn.Conv2d(256, 512, 3, 1, 1),
+            nn.Conv2d(512, 256, 3, 1, 1),
+            nn.Conv2d(256, 3, 3, 1, 1),
         ]
     ).to(device)
 
@@ -73,11 +73,11 @@ if __name__ == "__main__":
         y = model(data)
     print(y.shape)
     with Timer("back from device"):
-        y = y.detach().cpu().numpy()
+        y = y.cpu().numpy()
 ```
 
 ```
-[inference] Elapsed time: 0.000500
-torch.Size([2, 3, 40, 40])
-[back from device] Elapsed time: 5.083
+[inference] Elapsed time: 0.000643
+torch.Size([2, 3, 640, 640])
+[back from device] Elapsed time: 11.328
 ```
diff --git a/benchmark-v0.md b/benchmark-v0.md
@@ -25,3 +25,59 @@
 * **data** => 1024 images (a list of 100 different images, repeated 5+ times). Because it gets really long on GPU and TPUs still have theback-to-cpu issue I did not try larger batches. Images have a wide range of shapes but are all transformed into a `3 x 640 x 640` tensor
 * `inf` => after 10+ minutes, still no answer. Cannot even stop the process with `ctrl+c`, have to kill it (`kill -9`)
 * `TPU fp16` => prepend `XLA_USE_BF16=1` to command
+
+## Slow back to cpu
+
+```python
+import time
+import torch
+import torch.nn as nn
+
+import torch_xla.core.xla_model as xm
+
+
+class Timer:
+    def __init__(self, name="", store=None, precision=3):
+        self.name = name
+        self.precision = precision
+
+    def format(self, n):
+        return f"{n:.{self.precision}f}"
+
+    def __enter__(self):
+        """Start a new timer as a context manager"""
+        self._start_time = time.perf_counter()
+        return self
+
+    def __exit__(self, *exc_info):
+        """Stop the context manager timer"""
+        t = time.perf_counter()
+        new_time = t - self._start_time
+        print(f"[{self.name}] Elapsed time: {self.format(new_time)}")
+
+
+if __name__ == "__main__":
+    device = xm.xla_device()
+    torch.set_grad_enabled(False)
+    model = nn.Sequential(
+        *[
+            nn.Conv2d(3, 256, 3, 2),
+            nn.Conv2d(256, 512, 3, 2),
+            nn.Conv2d(512, 256, 3, 2),
+            nn.Conv2d(256, 3, 3, 2, 1),
+        ]
+    ).to(device)
+
+    data = torch.rand(2, 3, 640, 640, device=device)
+    with Timer("inference", precision=6):
+        y = model(data)
+    print(y.shape)
+    with Timer("back from device"):
+        y = y.detach().cpu().numpy()
+```
+
+```
+[inference] Elapsed time: 0.000500
+torch.Size([2, 3, 40, 40])
+[back from device] Elapsed time: 5.083
+```
diff --git a/benchmark-v0.md b/benchmark-v0.md
@@ -22,6 +22,6 @@
 * `Device -> CPU` => (Average time taken to get the inference back trom the device for 1 batch) * (number of batches)
 
 ## More
-* **data** => 1024 images (a list of 100 different images, repeated 5+ times). Because it gets really long on GPU and TPUs still have theback-to-cpu issue I did not try larger batches
+* **data** => 1024 images (a list of 100 different images, repeated 5+ times). Because it gets really long on GPU and TPUs still have theback-to-cpu issue I did not try larger batches. Images have a wide range of shapes but are all transformed into a `3 x 640 x 640` tensor
 * `inf` => after 10+ minutes, still no answer. Cannot even stop the process with `ctrl+c`, have to kill it (`kill -9`)
 * `TPU fp16` => prepend `XLA_USE_BF16=1` to command
diff --git a/benchmark-v0.md b/benchmark-v0.md
@@ -1,3 +1,5 @@
+# Benchmarking TPUs vs GPUs
+
 |                                  | GPU fp32 | GPU fp16 | TPU fp32 | TPU fp16 |
 | :------------------------------- | -------: | -------: | -------: | -------: |
 | Largest batch size               |       32 |       64 |    1024+ |    1024+ |
@@ -9,6 +11,7 @@
 | Full dataset with loading  (s)   |  151.546 |   76.953 |   16.146 |   17.272 |
 | Total Device -> CPU        (s)   |    2.816 |    2.528 |      inf |      inf |
 
+## Rows
 * `Largest batch size` => largest  batch size that would fit in memory (for multiples of 2)
 * `Min-inference batch size` => Batch size associated with the smallest pure on-device inference time. We look at this metric because we assume linear loading time
 * `Masker inference` => Average time per image for the masker's inferencs (`s/i` = seconds/image)
@@ -18,6 +21,7 @@
 * `Full dataset with loading` => overall time to process the _entire_ dataset: numpy array -> torch tensor -> transforms -> inference but **not back to cpu**
 * `Device -> CPU` => (Average time taken to get the inference back trom the device for 1 batch) * (number of batches)
 
+## More
 * **data** => 1024 images (a list of 100 different images, repeated 5+ times). Because it gets really long on GPU and TPUs still have theback-to-cpu issue I did not try larger batches
 * `inf` => after 10+ minutes, still no answer. Cannot even stop the process with `ctrl+c`, have to kill it (`kill -9`)
 * `TPU fp16` => prepend `XLA_USE_BF16=1` to command
diff --git a/benchmark-v0.md b/benchmark-v0.md
@@ -0,0 +1,23 @@
+|                                  | GPU fp32 | GPU fp16 | TPU fp32 | TPU fp16 |
+| :------------------------------- | -------: | -------: | -------: | -------: |
+| Largest batch size               |       32 |       64 |    1024+ |    1024+ |
+| Min-inference batch size         |        4 |       64 |     1024 |     1024 |
+| Masker inference           (s/i) |    0.059 |    0.019 |  5.76e-5 |  6.15e-5 |
+| Painter inference          (s/i) |    0.068 |    0.041 |  4.69e-5 |  5.08e-5 |
+| Masker + painter inference (i/s) |       ~8 |      ~17 |    ~9570 |    ~8900 |
+| Inference loop             (s)   |  130.382 |   60.567 |    0.123 |    0.134 |
+| Full dataset with loading  (s)   |  151.546 |   76.953 |   16.146 |   17.272 |
+| Total Device -> CPU        (s)   |    2.816 |    2.528 |      inf |      inf |
+
+* `Largest batch size` => largest  batch size that would fit in memory (for multiples of 2)
+* `Min-inference batch size` => Batch size associated with the smallest pure on-device inference time. We look at this metric because we assume linear loading time
+* `Masker inference` => Average time per image for the masker's inferencs (`s/i` = seconds/image)
+* `Painter inference` => Average time per image for the painter's inferences (`s/i` = seconds/image )
+* `Masker + painter inference` => Number of images per second for pure inference (`i/s`)
+* `Inference loop` => smallest on-device inference time for the entire dataset
+* `Full dataset with loading` => overall time to process the _entire_ dataset: numpy array -> torch tensor -> transforms -> inference but **not back to cpu**
+* `Device -> CPU` => (Average time taken to get the inference back trom the device for 1 batch) * (number of batches)
+
+* **data** => 1024 images (a list of 100 different images, repeated 5+ times). Because it gets really long on GPU and TPUs still have theback-to-cpu issue I did not try larger batches
+* `inf` => after 10+ minutes, still no answer. Cannot even stop the process with `ctrl+c`, have to kill it (`kill -9`)
+* `TPU fp16` => prepend `XLA_USE_BF16=1` to command
No results found