Skip to content

Instantly share code, notes, and snippets.

@vict0rsch
Last active October 30, 2020 23:42
Show Gist options
  • Select an option

  • Save vict0rsch/5063b4ce5c14424507b6448a1ac32045 to your computer and use it in GitHub Desktop.

Select an option

Save vict0rsch/5063b4ce5c14424507b6448a1ac32045 to your computer and use it in GitHub Desktop.

Revisions

  1. vict0rsch revised this gist Oct 30, 2020. 1 changed file with 2 additions and 1 deletion.
    3 changes: 2 additions & 1 deletion benchmark-v0.md
    Original file line number Diff line number Diff line change
    @@ -6,7 +6,7 @@
    | Painter inference (s/i) | 0.068 | 0.041 | 2.53e-5 | 1.16e-5 |
    | Inference loop (s) | 130.382 | 60.567 | 0.073 | 0.0392 |
    | Inference loop (i/s) | ~8 | ~17 | ~14 000 | ~26 000 |
    | Full dataset with loading (s) | 151.546 | 76.953 | 18.05 | 18.05 |
    | Full dataset with loading (s) | 151.546 | 76.953 | 18.05 | 15.31 |
    | Total Device -> CPU (s) | 2.816 | 2.528 | inf | inf |

    ## Rows
    @@ -25,6 +25,7 @@
    * `inf` => after 10+ minutes, still no answer. Cannot even stop the process with `ctrl+c`, have to kill it (`kill -9`)
    * `TPU fp16` => prepend `XLA_USE_BF16=1` to command
    * TPU: time to perform transforms (per sample): 0.004, time to send to device (per sample): 0.011
    * Numbers on TPU have a high variance (not something I measured, but observed, some full processings take 60 seconds others 75 or 80 with the same params)

    ## Slow back to cpu

  2. vict0rsch revised this gist Oct 30, 2020. 1 changed file with 1 addition and 0 deletions.
    1 change: 1 addition & 0 deletions benchmark-v0.md
    Original file line number Diff line number Diff line change
    @@ -24,6 +24,7 @@
    * **TPU fp32** and **TPU fp16** => Numbers in this column where computed after loading **4096** images to have respectively 4 and 2 batches. So to account for 4 times more data than GPU columns, measures for `Inference loop` and `Full dataset with loading` are divided by 4 (which could be slightly off).
    * `inf` => after 10+ minutes, still no answer. Cannot even stop the process with `ctrl+c`, have to kill it (`kill -9`)
    * `TPU fp16` => prepend `XLA_USE_BF16=1` to command
    * TPU: time to perform transforms (per sample): 0.004, time to send to device (per sample): 0.011

    ## Slow back to cpu

  3. vict0rsch revised this gist Oct 30, 2020. 1 changed file with 12 additions and 13 deletions.
    25 changes: 12 additions & 13 deletions benchmark-v0.md
    Original file line number Diff line number Diff line change
    @@ -1,15 +1,13 @@
    # Benchmarking TPUs vs GPUs

    | | GPU fp32 | GPU fp16 | TPU fp32 | TPU fp16 |
    | :------------------------------- | -------: | -------: | -------: | -------: |
    | Largest batch size | 32 | 64 | 1024+ | 1024+ |
    | Min-inference batch size | 4 | 64 | 1024 | 1024 |
    | Masker inference (s/i) | 0.059 | 0.019 | 5.76e-5 | 6.15e-5 |
    | Painter inference (s/i) | 0.068 | 0.041 | 4.69e-5 | 5.08e-5 |
    | Masker + painter inference (i/s) | ~8 | ~17 | ~9570 | ~8900 |
    | Inference loop (s) | 130.382 | 60.567 | 0.123 | 0.134 |
    | Full dataset with loading (s) | 151.546 | 76.953 | 16.146 | 17.272 |
    | Total Device -> CPU (s) | 2.816 | 2.528 | inf | inf |
    | | GPU fp32 | GPU fp16 | TPU fp32* | TPU fp16* |
    | :------------------------------- | -------: | -------: | --------: | --------: |
    | Largest batch size | 32 | 64 | 1024 | 2048 |
    | Min-inference batch size | 4 | 64 | 1024 | 2048 |
    | Masker inference (s/i) | 0.059 | 0.019 | 3.80e-5 | 1.36e-5 |
    | Painter inference (s/i) | 0.068 | 0.041 | 2.53e-5 | 1.16e-5 |
    | Inference loop (s) | 130.382 | 60.567 | 0.073 | 0.0392 |
    | Inference loop (i/s) | ~8 | ~17 | ~14 000 | ~26 000 |
    | Full dataset with loading (s) | 151.546 | 76.953 | 18.05 | 18.05 |
    | Total Device -> CPU (s) | 2.816 | 2.528 | inf | inf |

    ## Rows
    * `Largest batch size` => largest batch size that would fit in memory (for multiples of 2)
    @@ -21,8 +19,9 @@
    * `Full dataset with loading` => overall time to process the _entire_ dataset: numpy array -> torch tensor -> transforms -> inference but **not back to cpu**
    * `Device -> CPU` => (Average time taken to get the inference back trom the device for 1 batch) * (number of batches)

    ## More
    ## Comments
    * **data** => 1024 images (a list of 100 different images, repeated 5+ times). Because it gets really long on GPU and TPUs still have theback-to-cpu issue I did not try larger batches. Images have a wide range of shapes but are all transformed into a `3 x 640 x 640` tensor
    * **TPU fp32** and **TPU fp16** => Numbers in this column where computed after loading **4096** images to have respectively 4 and 2 batches. So to account for 4 times more data than GPU columns, measures for `Inference loop` and `Full dataset with loading` are divided by 4 (which could be slightly off).
    * `inf` => after 10+ minutes, still no answer. Cannot even stop the process with `ctrl+c`, have to kill it (`kill -9`)
    * `TPU fp16` => prepend `XLA_USE_BF16=1` to command

  4. vict0rsch revised this gist Oct 30, 2020. 1 changed file with 8 additions and 8 deletions.
    16 changes: 8 additions & 8 deletions benchmark-v0.md
    Original file line number Diff line number Diff line change
    @@ -61,10 +61,10 @@ if __name__ == "__main__":
    torch.set_grad_enabled(False)
    model = nn.Sequential(
    *[
    nn.Conv2d(3, 256, 3, 2),
    nn.Conv2d(256, 512, 3, 2),
    nn.Conv2d(512, 256, 3, 2),
    nn.Conv2d(256, 3, 3, 2, 1),
    nn.Conv2d(3, 256, 3, 1, 1),
    nn.Conv2d(256, 512, 3, 1, 1),
    nn.Conv2d(512, 256, 3, 1, 1),
    nn.Conv2d(256, 3, 3, 1, 1),
    ]
    ).to(device)

    @@ -73,11 +73,11 @@ if __name__ == "__main__":
    y = model(data)
    print(y.shape)
    with Timer("back from device"):
    y = y.detach().cpu().numpy()
    y = y.cpu().numpy()
    ```

    ```
    [inference] Elapsed time: 0.000500
    torch.Size([2, 3, 40, 40])
    [back from device] Elapsed time: 5.083
    [inference] Elapsed time: 0.000643
    torch.Size([2, 3, 640, 640])
    [back from device] Elapsed time: 11.328
    ```
  5. vict0rsch revised this gist Oct 30, 2020. 1 changed file with 56 additions and 0 deletions.
    56 changes: 56 additions & 0 deletions benchmark-v0.md
    Original file line number Diff line number Diff line change
    @@ -25,3 +25,59 @@
    * **data** => 1024 images (a list of 100 different images, repeated 5+ times). Because it gets really long on GPU and TPUs still have theback-to-cpu issue I did not try larger batches. Images have a wide range of shapes but are all transformed into a `3 x 640 x 640` tensor
    * `inf` => after 10+ minutes, still no answer. Cannot even stop the process with `ctrl+c`, have to kill it (`kill -9`)
    * `TPU fp16` => prepend `XLA_USE_BF16=1` to command

    ## Slow back to cpu

    ```python
    import time
    import torch
    import torch.nn as nn

    import torch_xla.core.xla_model as xm


    class Timer:
    def __init__(self, name="", store=None, precision=3):
    self.name = name
    self.precision = precision

    def format(self, n):
    return f"{n:.{self.precision}f}"

    def __enter__(self):
    """Start a new timer as a context manager"""
    self._start_time = time.perf_counter()
    return self

    def __exit__(self, *exc_info):
    """Stop the context manager timer"""
    t = time.perf_counter()
    new_time = t - self._start_time
    print(f"[{self.name}] Elapsed time: {self.format(new_time)}")


    if __name__ == "__main__":
    device = xm.xla_device()
    torch.set_grad_enabled(False)
    model = nn.Sequential(
    *[
    nn.Conv2d(3, 256, 3, 2),
    nn.Conv2d(256, 512, 3, 2),
    nn.Conv2d(512, 256, 3, 2),
    nn.Conv2d(256, 3, 3, 2, 1),
    ]
    ).to(device)

    data = torch.rand(2, 3, 640, 640, device=device)
    with Timer("inference", precision=6):
    y = model(data)
    print(y.shape)
    with Timer("back from device"):
    y = y.detach().cpu().numpy()
    ```

    ```
    [inference] Elapsed time: 0.000500
    torch.Size([2, 3, 40, 40])
    [back from device] Elapsed time: 5.083
    ```
  6. vict0rsch revised this gist Oct 30, 2020. 1 changed file with 1 addition and 1 deletion.
    2 changes: 1 addition & 1 deletion benchmark-v0.md
    Original file line number Diff line number Diff line change
    @@ -22,6 +22,6 @@
    * `Device -> CPU` => (Average time taken to get the inference back trom the device for 1 batch) * (number of batches)

    ## More
    * **data** => 1024 images (a list of 100 different images, repeated 5+ times). Because it gets really long on GPU and TPUs still have theback-to-cpu issue I did not try larger batches
    * **data** => 1024 images (a list of 100 different images, repeated 5+ times). Because it gets really long on GPU and TPUs still have theback-to-cpu issue I did not try larger batches. Images have a wide range of shapes but are all transformed into a `3 x 640 x 640` tensor
    * `inf` => after 10+ minutes, still no answer. Cannot even stop the process with `ctrl+c`, have to kill it (`kill -9`)
    * `TPU fp16` => prepend `XLA_USE_BF16=1` to command
  7. vict0rsch revised this gist Oct 30, 2020. 1 changed file with 4 additions and 0 deletions.
    4 changes: 4 additions & 0 deletions benchmark-v0.md
    Original file line number Diff line number Diff line change
    @@ -1,3 +1,5 @@
    # Benchmarking TPUs vs GPUs

    | | GPU fp32 | GPU fp16 | TPU fp32 | TPU fp16 |
    | :------------------------------- | -------: | -------: | -------: | -------: |
    | Largest batch size | 32 | 64 | 1024+ | 1024+ |
    @@ -9,6 +11,7 @@
    | Full dataset with loading (s) | 151.546 | 76.953 | 16.146 | 17.272 |
    | Total Device -> CPU (s) | 2.816 | 2.528 | inf | inf |

    ## Rows
    * `Largest batch size` => largest batch size that would fit in memory (for multiples of 2)
    * `Min-inference batch size` => Batch size associated with the smallest pure on-device inference time. We look at this metric because we assume linear loading time
    * `Masker inference` => Average time per image for the masker's inferencs (`s/i` = seconds/image)
    @@ -18,6 +21,7 @@
    * `Full dataset with loading` => overall time to process the _entire_ dataset: numpy array -> torch tensor -> transforms -> inference but **not back to cpu**
    * `Device -> CPU` => (Average time taken to get the inference back trom the device for 1 batch) * (number of batches)

    ## More
    * **data** => 1024 images (a list of 100 different images, repeated 5+ times). Because it gets really long on GPU and TPUs still have theback-to-cpu issue I did not try larger batches
    * `inf` => after 10+ minutes, still no answer. Cannot even stop the process with `ctrl+c`, have to kill it (`kill -9`)
    * `TPU fp16` => prepend `XLA_USE_BF16=1` to command
  8. vict0rsch created this gist Oct 30, 2020.
    23 changes: 23 additions & 0 deletions benchmark-v0.md
    Original file line number Diff line number Diff line change
    @@ -0,0 +1,23 @@
    | | GPU fp32 | GPU fp16 | TPU fp32 | TPU fp16 |
    | :------------------------------- | -------: | -------: | -------: | -------: |
    | Largest batch size | 32 | 64 | 1024+ | 1024+ |
    | Min-inference batch size | 4 | 64 | 1024 | 1024 |
    | Masker inference (s/i) | 0.059 | 0.019 | 5.76e-5 | 6.15e-5 |
    | Painter inference (s/i) | 0.068 | 0.041 | 4.69e-5 | 5.08e-5 |
    | Masker + painter inference (i/s) | ~8 | ~17 | ~9570 | ~8900 |
    | Inference loop (s) | 130.382 | 60.567 | 0.123 | 0.134 |
    | Full dataset with loading (s) | 151.546 | 76.953 | 16.146 | 17.272 |
    | Total Device -> CPU (s) | 2.816 | 2.528 | inf | inf |

    * `Largest batch size` => largest batch size that would fit in memory (for multiples of 2)
    * `Min-inference batch size` => Batch size associated with the smallest pure on-device inference time. We look at this metric because we assume linear loading time
    * `Masker inference` => Average time per image for the masker's inferencs (`s/i` = seconds/image)
    * `Painter inference` => Average time per image for the painter's inferences (`s/i` = seconds/image )
    * `Masker + painter inference` => Number of images per second for pure inference (`i/s`)
    * `Inference loop` => smallest on-device inference time for the entire dataset
    * `Full dataset with loading` => overall time to process the _entire_ dataset: numpy array -> torch tensor -> transforms -> inference but **not back to cpu**
    * `Device -> CPU` => (Average time taken to get the inference back trom the device for 1 batch) * (number of batches)

    * **data** => 1024 images (a list of 100 different images, repeated 5+ times). Because it gets really long on GPU and TPUs still have theback-to-cpu issue I did not try larger batches
    * `inf` => after 10+ minutes, still no answer. Cannot even stop the process with `ctrl+c`, have to kill it (`kill -9`)
    * `TPU fp16` => prepend `XLA_USE_BF16=1` to command