msm-code · December 29, 2021 02:41 · Dec 29, 2021
diff --git a/karton-benchmark-notes.md b/karton-benchmark-notes.md
@@ -0,0 +1,200 @@
+# Karton benchmark notes
+
+Predictibly, most of the time was spent in redis
+
+90k tasks, nothing to collect
+```
+[2021-12-29 00:31:53,585][INFO] Tasks: Get all tasks: 4.495668649673462
+[2021-12-29 00:31:53,586][INFO] Tasks: Tasks from queue: 0.00021696090698242188
+[2021-12-29 00:31:53,675][INFO] Tasks: Big loop: 0.08784604072570801
+[2021-12-29 00:31:53,675][INFO] Tasks: Smol loop: 1.0728836059570312e-05
+[2021-12-29 00:31:53,812][INFO] Resources: List objects: 0.003173351287841797
+[2021-12-29 00:31:58,107][INFO] Resources: Get all tasks: 4.294421672821045
+[2021-12-29 00:31:58,268][INFO] Resources: For loop: 0.15995168685913086
+[2021-12-29 00:31:58,268][INFO] Resources: Remove loop: 1.1205673217773438e-05
+[2021-12-29 00:31:58,407][INFO] GC done in 10.318462610244751 seconds
+```
+
+90k tasks including 60k to collect
+```
+[2021-12-29 00:36:48,376][INFO] Tasks: Get all tasks: 4.544401407241821
+[2021-12-29 00:36:48,377][INFO] Tasks: Tasks from queue: 0.00017404556274414062
+[2021-12-29 00:37:01,953][INFO] Tasks: Big loop: 13.575872898101807
+[2021-12-29 00:37:01,954][INFO] Tasks: Smol loop: 9.775161743164062e-06
+[2021-12-29 00:37:02,108][INFO] Resources: List objects: 0.005395174026489258
+[2021-12-29 00:37:03,493][INFO] Resources: Get all tasks: 1.3837084770202637
+[2021-12-29 00:37:03,545][INFO] Resources: For loop: 0.05156588554382324
+[2021-12-29 00:37:03,546][INFO] Resources: Remove loop: 8.821487426757812e-06
+[2021-12-29 00:37:03,593][INFO] GC done in 20.767701625823975 seconds
+```
+
+60k tasks including 60k to collect
+```
+[2021-12-29 00:49:17,727][INFO] Tasks: Get all tasks: 2.212865114212036
+[2021-12-29 00:49:17,728][INFO] Tasks: Tasks from queue: 0.00018286705017089844
+[2021-12-29 00:49:26,743][INFO] Tasks: Big loop: 9.015160083770752
+[2021-12-29 00:49:26,744][INFO] Tasks: Smol loop: 1.2874603271484375e-05
+[2021-12-29 00:49:27,397][INFO] Resources: List objects: 0.5849082469940186
+[2021-12-29 00:49:27,399][INFO] Resources: Get all tasks: 0.0008819103240966797
+[2021-12-29 00:49:27,400][INFO] Resources: For loop: 3.337860107421875e-05
+[2021-12-29 00:49:37,279][INFO] Resources: Remove loop: 9.878673791885376
+[2021-12-29 00:49:37,280][INFO] GC done in 22.77230477333069 seconds
+```
+
+Standard batch from now on: 20k tasks including 20k to collect:
+```
+[2021-12-29 01:04:37,191][INFO] Tasks: Get all tasks: 1.994135856628418
+[2021-12-29 01:04:37,192][INFO] Tasks: Tasks from queue: 0.00023102760314941406
+[2021-12-29 01:04:46,747][INFO] Tasks: Big loop (record): 9.555027723312378
+[2021-12-29 01:04:46,748][INFO] Tasks: Smol loop: 1.8596649169921875e-05
+[2021-12-29 01:04:46,841][INFO] Resources: List objects: 0.005769491195678711
+[2021-12-29 01:04:46,843][INFO] Resources: Get all tasks: 0.001276254653930664
+[2021-12-29 01:04:46,843][INFO] Resources: For loop: 3.981590270996094e-05
+[2021-12-29 01:04:46,846][INFO] Resources: Remove loop: 0.002631664276123047
+[2021-12-29 01:04:46,847][INFO] GC done in 12.65667200088501 seconds
+```
+
+### Part 1: Tasks
+
+`Get all tasks` is (unsuprisingly) slow.
+Main loop of `gc_collect_tasks` is very slow when there are resources to remove
+
+```python
+
+    def delete_tasks(self, tasks: List[Task]) -> None:
+        """
+        Remove multiple tasks from Redis
+
+        :param tasks: list of Task objects
+        """
+        keys = [f"{KARTON_TASK_NAMESPACE}:{task.uid}" for task in tasks]
+        self.redis.delete(*keys)
+```
+result:
+```
+[2021-12-29 01:09:28,965][INFO] Tasks: Get all tasks: 1.9590599536895752
+[2021-12-29 01:09:28,967][INFO] Tasks: Tasks from queue: 0.00023603439331054688
+[2021-12-29 01:09:29,052][INFO] Tasks: Big loop (record): 0.08496618270874023
+[2021-12-29 01:09:29,053][INFO] Tasks: Deleting 40001 tasks
+[2021-12-29 01:09:29,216][INFO] Tasks: Big loop (delete): 0.16314005851745605
+[2021-12-29 01:09:33,712][INFO] Tasks: Big loop (metrics): 4.494740009307861
+[2021-12-29 01:09:33,713][INFO] Tasks: Smol loop: 1.5020370483398438e-05
+[2021-12-29 01:09:33,776][INFO] Resources: List objects: 0.004618406295776367
+[2021-12-29 01:09:33,778][INFO] Resources: Get all tasks: 0.0007326602935791016
+[2021-12-29 01:09:33,778][INFO] Resources: For loop: 2.384185791015625e-05
+[2021-12-29 01:09:33,780][INFO] Resources: Remove loop: 0.001951456069946289
+[2021-12-29 01:09:33,781][INFO] GC done in 7.78088116645813 seconds
+```
+Wow! We **slashed** time it took to delete 40k tasks from 9.5 seconds to 0.16 seconds.
+
+Well, almost - incrementing all metrics still takes 4.5 seconds. One idea is to delete the metrics (system-side metrics are a bit redundant with service-side metrics. But still, it's a nice feature to have). But the fact that this is half of the time, suggests that network latency plays a huge role here. Let's try to speed it up with pipelining:
+
+```python
+    def increment_metrics_list(self, metric: KartonMetrics, identities: List[str]) -> None:
+        p = self.redis.pipeline()
+        for identity in identities:
+            p.hincrby(metric.value, identity, 1)
+        p.execute()
+```
+Result:
+```
+[2021-12-29 01:15:55,718][INFO] Tasks: Get all tasks: 2.1441361904144287
+[2021-12-29 01:15:55,719][INFO] Tasks: Tasks from queue: 0.0002739429473876953
+[2021-12-29 01:15:55,805][INFO] Tasks: Big loop (record): 0.08557796478271484
+[2021-12-29 01:15:55,806][INFO] Tasks: Deleting 40001 tasks
+[2021-12-29 01:15:55,965][INFO] Tasks: Big loop (delete): 0.15918326377868652
+[2021-12-29 01:15:56,424][INFO] Tasks: Big loop (metrics): 0.4580235481262207
+[2021-12-29 01:15:56,424][INFO] Tasks: Smol loop: 2.2649765014648438e-05
+[2021-12-29 01:15:56,494][INFO] Resources: List objects: 0.005167245864868164
+[2021-12-29 01:15:56,495][INFO] Resources: Get all tasks: 0.0008485317230224609
+[2021-12-29 01:15:56,496][INFO] Resources: For loop: 2.9325485229492188e-05
+[2021-12-29 01:15:56,498][INFO] Resources: Remove loop: 0.002457857131958008
+[2021-12-29 01:15:56,499][INFO] GC done in 3.9317240715026855 seconds
+```
+Well, there goes another 4.5 seconds. Task GC time improved from ~12 seconds to ~3.5 seconds. Nice.
+
+Can we do better? Well, it depends on why `get all tasks` is slow. There are three separate operations:
+ - `keys`
+ - `mget`
+ - deserialisation
+
+ I've added logging for them all. A bit suprisingly, deserialisation is by far the slowest part:
+ ```
+[2021-12-29 01:26:56,155][INFO] Keys: 0.18313837051391602
+[2021-12-29 01:26:56,583][INFO] Mget: 0.4270799160003662
+[2021-12-29 01:26:58,366][INFO] Deserialise: 1.7819278240203857
+[2021-12-29 01:26:58,372][INFO] Tasks: Get all tasks: 2.40012788772583
+```
+This is good news. This means we can optimise it in the future, without redesigning everything.
+
+### Part 2: Resources
+
+But for now, let's focus on resources. I've changed the producer a bit, to produce a different resource every time:
+
+```python
+import sys
+import os
+import logging
+from karton.core import Config, Producer, Task, Resource
+
+config = Config("karton.ini")
+producer = Producer(config)
+
+howmany = int(sys.argv[1])
+for i in range(howmany):
+    if i % 1000 == 0:
+        print(i)
+
+    contents = str(i).encode()
+    resource = Resource(str(i), contents)
+
+    task = Task({"type": "sample", "kind": "raw"})
+    logging.info('pushing file to karton %s, task %s' % (i, task))
+
+    task.add_resource("sample", resource)
+    task.add_payload("tags", ["simple_producer"])
+    task.add_payload("additional_info", ["This sample has been added by simple producer example"])
+    producer.send_task(task)
+```
+
+New times:
+```
+[2021-12-29 02:24:10,000][INFO] Tasks: Get all tasks: 0.8783807754516602
+[2021-12-29 02:24:10,048][INFO] Tasks: Deleting 20000 tasks
+[2021-12-29 02:24:10,532][INFO] Tasks: Big loop: 0.48422670364379883
+[2021-12-29 02:24:11,426][INFO] Resources: List objects: 0.8508834838867188
+[2021-12-29 02:24:11,427][INFO] Resources: List objects: 10000 objects
+[2021-12-29 02:24:11,428][INFO] Resources: Get all tasks: 0.00037384033203125
+[2021-12-29 02:24:11,428][INFO] Resources: For loop: 7.62939453125e-06
+[2021-12-29 02:24:11,429][INFO] Resources: Removing 10000 resources
+[2021-12-29 02:24:26,314][INFO] Resources: Remove loop: 14.885408878326416
+[2021-12-29 02:24:26,315][INFO] GC done in 18.199631452560425 seconds
+```
+
+Let's use basically the same technique, and pipeline object removal:
+
+```python
+[2021-12-29 02:21:48,814][INFO] Tasks: Get all tasks: 0.8932852745056152
+[2021-12-29 02:21:48,871][INFO] Tasks: Deleting 20000 tasks
+[2021-12-29 02:21:49,267][INFO] Tasks: Big loop: 0.39539289474487305
+[2021-12-29 02:21:50,224][INFO] Resources: List objects: 0.9117233753204346
+[2021-12-29 02:21:50,225][INFO] Resources: List objects: 10000 objects
+[2021-12-29 02:21:50,226][INFO] Resources: Get all tasks: 0.0004830360412597656
+[2021-12-29 02:21:50,226][INFO] Resources: For loop: 6.198883056640625e-06
+[2021-12-29 02:21:50,226][INFO] Resources: Removing 10000 resources
+[2021-12-29 02:21:52,237][INFO] Resources: Remove loop: 2.0103540420532227
+[2021-12-29 02:21:52,239][INFO] GC done in 5.3230650424957275 seconds
+```
+
+Batching wins, again. Decreased the remove time from 14.88s to 2.01s.
+
+### Part 3: Routing
+
+With these changes, GC should be fast enough. There is one optimisation pending (don't call `get_all_tasks` twice), but maybe it's time to focus on routing.
+
+The ways to go from here are:
+ - improve routing performnace (by using pipelining in the `process_task`, for example).
+ - reverse-engineer why routing is not thread-safe and fix it so we can scale routers.
+ - implement batched routing? (route multiple tasks at once to reduce roundtrips).
+
+Sadly it's gettint a bit late here, so I won't be able to look at it.
No results found