Created
December 29, 2021 02:41
-
-
Save msm-code/fef1088f59a0391689e510141b460a39 to your computer and use it in GitHub Desktop.
Revisions
-
msm-code created this gist
Dec 29, 2021 .There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -0,0 +1,200 @@ # Karton benchmark notes Predictibly, most of the time was spent in redis 90k tasks, nothing to collect ``` [2021-12-29 00:31:53,585][INFO] Tasks: Get all tasks: 4.495668649673462 [2021-12-29 00:31:53,586][INFO] Tasks: Tasks from queue: 0.00021696090698242188 [2021-12-29 00:31:53,675][INFO] Tasks: Big loop: 0.08784604072570801 [2021-12-29 00:31:53,675][INFO] Tasks: Smol loop: 1.0728836059570312e-05 [2021-12-29 00:31:53,812][INFO] Resources: List objects: 0.003173351287841797 [2021-12-29 00:31:58,107][INFO] Resources: Get all tasks: 4.294421672821045 [2021-12-29 00:31:58,268][INFO] Resources: For loop: 0.15995168685913086 [2021-12-29 00:31:58,268][INFO] Resources: Remove loop: 1.1205673217773438e-05 [2021-12-29 00:31:58,407][INFO] GC done in 10.318462610244751 seconds ``` 90k tasks including 60k to collect ``` [2021-12-29 00:36:48,376][INFO] Tasks: Get all tasks: 4.544401407241821 [2021-12-29 00:36:48,377][INFO] Tasks: Tasks from queue: 0.00017404556274414062 [2021-12-29 00:37:01,953][INFO] Tasks: Big loop: 13.575872898101807 [2021-12-29 00:37:01,954][INFO] Tasks: Smol loop: 9.775161743164062e-06 [2021-12-29 00:37:02,108][INFO] Resources: List objects: 0.005395174026489258 [2021-12-29 00:37:03,493][INFO] Resources: Get all tasks: 1.3837084770202637 [2021-12-29 00:37:03,545][INFO] Resources: For loop: 0.05156588554382324 [2021-12-29 00:37:03,546][INFO] Resources: Remove loop: 8.821487426757812e-06 [2021-12-29 00:37:03,593][INFO] GC done in 20.767701625823975 seconds ``` 60k tasks including 60k to collect ``` [2021-12-29 00:49:17,727][INFO] Tasks: Get all tasks: 2.212865114212036 [2021-12-29 00:49:17,728][INFO] Tasks: Tasks from queue: 0.00018286705017089844 [2021-12-29 00:49:26,743][INFO] Tasks: Big loop: 9.015160083770752 [2021-12-29 00:49:26,744][INFO] Tasks: Smol loop: 1.2874603271484375e-05 [2021-12-29 00:49:27,397][INFO] Resources: List objects: 0.5849082469940186 [2021-12-29 00:49:27,399][INFO] Resources: Get all tasks: 0.0008819103240966797 [2021-12-29 00:49:27,400][INFO] Resources: For loop: 3.337860107421875e-05 [2021-12-29 00:49:37,279][INFO] Resources: Remove loop: 9.878673791885376 [2021-12-29 00:49:37,280][INFO] GC done in 22.77230477333069 seconds ``` Standard batch from now on: 20k tasks including 20k to collect: ``` [2021-12-29 01:04:37,191][INFO] Tasks: Get all tasks: 1.994135856628418 [2021-12-29 01:04:37,192][INFO] Tasks: Tasks from queue: 0.00023102760314941406 [2021-12-29 01:04:46,747][INFO] Tasks: Big loop (record): 9.555027723312378 [2021-12-29 01:04:46,748][INFO] Tasks: Smol loop: 1.8596649169921875e-05 [2021-12-29 01:04:46,841][INFO] Resources: List objects: 0.005769491195678711 [2021-12-29 01:04:46,843][INFO] Resources: Get all tasks: 0.001276254653930664 [2021-12-29 01:04:46,843][INFO] Resources: For loop: 3.981590270996094e-05 [2021-12-29 01:04:46,846][INFO] Resources: Remove loop: 0.002631664276123047 [2021-12-29 01:04:46,847][INFO] GC done in 12.65667200088501 seconds ``` ### Part 1: Tasks `Get all tasks` is (unsuprisingly) slow. Main loop of `gc_collect_tasks` is very slow when there are resources to remove ```python def delete_tasks(self, tasks: List[Task]) -> None: """ Remove multiple tasks from Redis :param tasks: list of Task objects """ keys = [f"{KARTON_TASK_NAMESPACE}:{task.uid}" for task in tasks] self.redis.delete(*keys) ``` result: ``` [2021-12-29 01:09:28,965][INFO] Tasks: Get all tasks: 1.9590599536895752 [2021-12-29 01:09:28,967][INFO] Tasks: Tasks from queue: 0.00023603439331054688 [2021-12-29 01:09:29,052][INFO] Tasks: Big loop (record): 0.08496618270874023 [2021-12-29 01:09:29,053][INFO] Tasks: Deleting 40001 tasks [2021-12-29 01:09:29,216][INFO] Tasks: Big loop (delete): 0.16314005851745605 [2021-12-29 01:09:33,712][INFO] Tasks: Big loop (metrics): 4.494740009307861 [2021-12-29 01:09:33,713][INFO] Tasks: Smol loop: 1.5020370483398438e-05 [2021-12-29 01:09:33,776][INFO] Resources: List objects: 0.004618406295776367 [2021-12-29 01:09:33,778][INFO] Resources: Get all tasks: 0.0007326602935791016 [2021-12-29 01:09:33,778][INFO] Resources: For loop: 2.384185791015625e-05 [2021-12-29 01:09:33,780][INFO] Resources: Remove loop: 0.001951456069946289 [2021-12-29 01:09:33,781][INFO] GC done in 7.78088116645813 seconds ``` Wow! We **slashed** time it took to delete 40k tasks from 9.5 seconds to 0.16 seconds. Well, almost - incrementing all metrics still takes 4.5 seconds. One idea is to delete the metrics (system-side metrics are a bit redundant with service-side metrics. But still, it's a nice feature to have). But the fact that this is half of the time, suggests that network latency plays a huge role here. Let's try to speed it up with pipelining: ```python def increment_metrics_list(self, metric: KartonMetrics, identities: List[str]) -> None: p = self.redis.pipeline() for identity in identities: p.hincrby(metric.value, identity, 1) p.execute() ``` Result: ``` [2021-12-29 01:15:55,718][INFO] Tasks: Get all tasks: 2.1441361904144287 [2021-12-29 01:15:55,719][INFO] Tasks: Tasks from queue: 0.0002739429473876953 [2021-12-29 01:15:55,805][INFO] Tasks: Big loop (record): 0.08557796478271484 [2021-12-29 01:15:55,806][INFO] Tasks: Deleting 40001 tasks [2021-12-29 01:15:55,965][INFO] Tasks: Big loop (delete): 0.15918326377868652 [2021-12-29 01:15:56,424][INFO] Tasks: Big loop (metrics): 0.4580235481262207 [2021-12-29 01:15:56,424][INFO] Tasks: Smol loop: 2.2649765014648438e-05 [2021-12-29 01:15:56,494][INFO] Resources: List objects: 0.005167245864868164 [2021-12-29 01:15:56,495][INFO] Resources: Get all tasks: 0.0008485317230224609 [2021-12-29 01:15:56,496][INFO] Resources: For loop: 2.9325485229492188e-05 [2021-12-29 01:15:56,498][INFO] Resources: Remove loop: 0.002457857131958008 [2021-12-29 01:15:56,499][INFO] GC done in 3.9317240715026855 seconds ``` Well, there goes another 4.5 seconds. Task GC time improved from ~12 seconds to ~3.5 seconds. Nice. Can we do better? Well, it depends on why `get all tasks` is slow. There are three separate operations: - `keys` - `mget` - deserialisation I've added logging for them all. A bit suprisingly, deserialisation is by far the slowest part: ``` [2021-12-29 01:26:56,155][INFO] Keys: 0.18313837051391602 [2021-12-29 01:26:56,583][INFO] Mget: 0.4270799160003662 [2021-12-29 01:26:58,366][INFO] Deserialise: 1.7819278240203857 [2021-12-29 01:26:58,372][INFO] Tasks: Get all tasks: 2.40012788772583 ``` This is good news. This means we can optimise it in the future, without redesigning everything. ### Part 2: Resources But for now, let's focus on resources. I've changed the producer a bit, to produce a different resource every time: ```python import sys import os import logging from karton.core import Config, Producer, Task, Resource config = Config("karton.ini") producer = Producer(config) howmany = int(sys.argv[1]) for i in range(howmany): if i % 1000 == 0: print(i) contents = str(i).encode() resource = Resource(str(i), contents) task = Task({"type": "sample", "kind": "raw"}) logging.info('pushing file to karton %s, task %s' % (i, task)) task.add_resource("sample", resource) task.add_payload("tags", ["simple_producer"]) task.add_payload("additional_info", ["This sample has been added by simple producer example"]) producer.send_task(task) ``` New times: ``` [2021-12-29 02:24:10,000][INFO] Tasks: Get all tasks: 0.8783807754516602 [2021-12-29 02:24:10,048][INFO] Tasks: Deleting 20000 tasks [2021-12-29 02:24:10,532][INFO] Tasks: Big loop: 0.48422670364379883 [2021-12-29 02:24:11,426][INFO] Resources: List objects: 0.8508834838867188 [2021-12-29 02:24:11,427][INFO] Resources: List objects: 10000 objects [2021-12-29 02:24:11,428][INFO] Resources: Get all tasks: 0.00037384033203125 [2021-12-29 02:24:11,428][INFO] Resources: For loop: 7.62939453125e-06 [2021-12-29 02:24:11,429][INFO] Resources: Removing 10000 resources [2021-12-29 02:24:26,314][INFO] Resources: Remove loop: 14.885408878326416 [2021-12-29 02:24:26,315][INFO] GC done in 18.199631452560425 seconds ``` Let's use basically the same technique, and pipeline object removal: ```python [2021-12-29 02:21:48,814][INFO] Tasks: Get all tasks: 0.8932852745056152 [2021-12-29 02:21:48,871][INFO] Tasks: Deleting 20000 tasks [2021-12-29 02:21:49,267][INFO] Tasks: Big loop: 0.39539289474487305 [2021-12-29 02:21:50,224][INFO] Resources: List objects: 0.9117233753204346 [2021-12-29 02:21:50,225][INFO] Resources: List objects: 10000 objects [2021-12-29 02:21:50,226][INFO] Resources: Get all tasks: 0.0004830360412597656 [2021-12-29 02:21:50,226][INFO] Resources: For loop: 6.198883056640625e-06 [2021-12-29 02:21:50,226][INFO] Resources: Removing 10000 resources [2021-12-29 02:21:52,237][INFO] Resources: Remove loop: 2.0103540420532227 [2021-12-29 02:21:52,239][INFO] GC done in 5.3230650424957275 seconds ``` Batching wins, again. Decreased the remove time from 14.88s to 2.01s. ### Part 3: Routing With these changes, GC should be fast enough. There is one optimisation pending (don't call `get_all_tasks` twice), but maybe it's time to focus on routing. The ways to go from here are: - improve routing performnace (by using pipelining in the `process_task`, for example). - reverse-engineer why routing is not thread-safe and fix it so we can scale routers. - implement batched routing? (route multiple tasks at once to reduce roundtrips). Sadly it's gettint a bit late here, so I won't be able to look at it.