tsz is a Rust port of the TypeScript compiler, aiming for parity with tsc behavior plus better speed than tsgo (the Go port). On most fixtures we're competitive or ahead, but on multi-package monorepo workloads we are catastrophically behind:
| Fixture | Files | tsgo | tsz | Factor |
|---|---|---|---|---|
utility-types |
small | 117ms | 100ms | tsz 1.17× faster |
type-fest (heavy .d.ts mapped types) |
16.5K LOC | 103ms | 219ms | tsgo 2.11× faster |
large-ts-repo (6086 files, monorepo) |
~6K files | 2.45s | 706s workstation / 30+ min on this dev box | tsgo 288× faster |
large-ts-repo peak heap |
— | 16 MB | 10.1 GB | tsgo 600× less memory |
Goal stated by the project lead: tsz must be 2× faster than tsgo on large-ts-repo, i.e. ~1.2 s. We're not 30% off — we're 600× off on that workload. The user's directive was "do fundementally correct work and dig deep", so I'm trying to understand what the right architectural moves are, not just chase 5% wins.
- tsz-binder: per-file binder produces a
BinderState(symbols, scopes, file_locals, etc.). Lib symbols are merged in per-file (lots of duplication). - tsz-checker: per-file
CheckerStatewith aCheckerContextthat owns the file's binder reference plus shared cross-file indices. Type computation goes throughcompute_type_of_symbol→delegate_cross_arena_symbol_resolutionfor cross-file symbols. - tsz-solver: shared
TypeInterner(DashMap-sharded), QueryCache (per-checker RefCell, optionally backed by a sharedSharedQueryCache). - CLI driver (
tsz-cli): doesread_source_filesBFS to walk the import graph, builds the program, then runscollect_diagnosticswhich has a per-filepar_iteroverwork_itemscallingcheck_file_for_parallel, plus a sequential post-merge "lib recheck" loop.
Profile of full bench at ~5 s into a fresh run:
read_source_files 85% inclusive
module_resolver::lookup 64%
resolve_module_specifier 22%
Path::is_dir/is_file stat() (multiple call sites)
read_package_json open() + serde_json::from_str
The calling thread holds 100% of CPU; 10 rayon worker threads (already initialized via ensure_rayon_global_pool()) sit idle. Hot leaves: Path::is_dir, Path::is_file, open, read. Each import expands into ~10 candidate paths (extension fan-out + path-mappings) and each candidate is stat()-checked. With 6086 files × ~20 imports × ~10 candidates = ~1.2M syscalls.
Action I took (PR #1623): restructured the BFS as level-synchronous — drain pending into a per-level batch, classify (cached/skip-js/read), parallelize the read+import-scan phase via par_iter, then serial-resolve imports back into the queue. Workers now do real work (ps -p shows ~0.6 s UTIME each during BFS phase). Subset3 (1429 files, where I/O Read is only 4.4 s of 261 s) showed -4% I/O Read, expected to be much larger on full bench. Cannot validate full-bench impact — full bench takes 30+ min on this workstation and we have no CI bench job that tracks tsz on large-ts-repo yet (PR #1618 enables it).
Sample of subset3 mid-check (250 s of 261 s in check phase): exactly one worker thread at 98% CPU, the other nine at 0%, deep in:
collect_diagnostics → check_source_file → prepare_source_file_for_checking
→ build_type_environment → compute_type_of_symbol (recursive)
→ delegate_cross_arena_interface_type
→ compute_interface_type_from_declarations
→ get_type_of_interface
→ ...
→ copy_symbol_file_targets_to (HashMap clone, 543/3777 = 14% of CPU samples!)
The check phase IS parallelized at the source level (work_items.par_iter().zip(per_file_binders.into_par_iter()).map(|(file_idx, binder)| check_file_for_parallel(...))), but in practice one file with deep delegate_cross_arena cascades dominates the wall time. Adding .with_min_len(1) to force fine-grained work-stealing helped — measured -5.4% on subset3 across two runs each (PR #1626) — so the chunking is genuinely the issue at the rayon level, not just an artifact of the work being sequential.
But the gain plateaued at ~5% because the underlying problem (one file's compute_type_of_symbol dominates the entire wall time) is unsolved. The deeply-recursive type-computation cascade for that one file STILL runs on a single thread, and other workers run out of independent work.
Every cross-arena delegation creates a child checker via Box::new(CheckerState::with_parent_cache(...)) and copies the parent's local overlay:
pub fn copy_symbol_file_targets_to(&self, child: &mut CheckerContext<'_>) {
let overlay = self.cross_file_symbol_targets.borrow(); // RefCell<FxHashMap<SymbolId, usize>>
if !overlay.is_empty() {
*child.cross_file_symbol_targets.borrow_mut() = overlay.clone();
}
}Sample showed _platform_memmove calls inside RawTable::clone at 543 samples (~14%) when iterated through delegate_cross_arena_symbol_resolution. The recursion chain is very deep (delegate → compute_type_of_symbol → type_alias_variable_alias → resolve_global_jsdoc_typedef_info → copy_symbol_file_targets_to).
I considered RefCell<Arc<HashMap>> with Arc::make_mut for copy-on-write semantics, but register_symbol_file_target is called in many places (jsx orchestration, type_node resolution) — children DO write to the overlay, so make_mut would clone on first write anyway. Net: same number of HashMap clones, just deferred. Possibly some win if SOME children don't write at all, but no clear measurement.
PR #1618's description (which enables large-ts-repo in CI) explicitly says:
The 288x perf gap on this workload is structural (per-file Check phase = 95% of runtime;
par_iteralready used but TypeInterner DashMap contention likely defeats parallelism — 16-core actual ≈ 16x slower than ideal).
I tried parallelizing the per-iteration lib-recheck loop (for lib_idx in 0..checker_libs.files.len() { check_checker_lib_file(...) }, ~30 lib files at ~7-8 s each). Workers DID get scheduled (10-thread CPU went from 1 active to all active during that phase), but total wall time went UP ~12% in one run, was flat in another (high variance). The per-iteration check_checker_lib_file shares &program.type_interner (DashMap) and shared_lib_cache: Arc<DashMap>. My theory: TypeInterner shard contention eats the parallelism budget. Did not pursue further.
That's 600× more memory on the same workload. Some of it is cache state (~2 GB pre-merge bind data, ~500 MB bound files, ~300 MB AST arenas — these are all program-wide and proportional to file count). But it strongly suggests tsz allocates and frees a LOT of intermediate type representations during checking, vs tsgo which apparently computes types lazily/sparingly.
- Unbounded
is_file/is_dircache on the BFS resolve path: every candidate path is unique (cache hit rate ~0%). Cache grew unbounded, hit 25 GB+ memory, rehash dominated CPU. - Bounded
read_package_jsoncache: subset3 within noise (4.43 s I/O Read on subset3 leaves no room to measure improvement). Couldn't validate on full bench in reasonable time. - Extending
.with_min_len(1)to other par_iter sites (build_cross_file_binders, collect_module_specifiers, per_file_ts7016_diagnostics, prepare_binders): slowed the bench from 248 s to 263 s on subset3. Those phases have uniform per-file work; fine-grained stealing adds scheduler overhead without load-balancing benefit. - Side-by-side bench methodology lies: my early v3 PR claimed 35% subset3 win that turned out to be I/O contention from running two binaries simultaneously; solo runs of v3 vs main were within noise.
- PR #1618 — bench script enable for tsz on
large-ts-repo, bumps run timeout 600s → 1500s. - PR #1619 —
resolve_import_target_from_fileindex-first fast path. Subset3 within noise solo, expected to help on workloads where the fallback resolver fires often (project-relative bare specifiers likepackages/foo/src/bar.ts). - PR #1623 — Parallel
read_source_filesBFS file-read + import-scan (level-synchronous restructure). -4% I/O Read on subset3, expected larger on full bench where read_source_files is 85% of wall. - PR #1626 —
.with_min_len(1)on per-file check par_iter. -5.4% measured on subset3 across two runs, the only PR with a directly-attributable wall-clock win.
I want to push past incremental wins and make the architectural moves that could close a 200×+ gap. Please prioritize answers in terms of where you'd actually start.
The sample stack on subset3 shows one file's check work taking 95% of total wall time, with the cascade going delegate_cross_arena_interface_type → compute_interface_type_from_declarations → get_type_of_interface → ... → recursive compute_type_of_symbol → ... → delegate again. Why is one file's check work non-parallelizable like this even when 10 workers are available? Is the right fix to:
- (a) make
delegate_cross_arena_symbol_resolutionitself parallelize its sub-tasks (split the cross-file resolution graph across workers)? - (b) make the type-computation cascade lazier so cross-file delegation triggers a small amount of work instead of a big one (Salsa-style demand-driven)?
- (c) accept that one heavy file gates per-file parallelism, and instead improve single-thread throughput in the cascade?
What's the standard play in compilers (rustc, clang, swift) for this shape of problem?
We use a sharded DashMap (16 shards by default) for the type interner. Per the PR #1618 description, "16-core actual ≈ 16x slower than ideal" implies the shards are very hot. Is the right move:
- (a) increase shard count (256? 1024?)
- (b) per-thread interner with periodic merge (and accept that some types get duplicate IDs until merge)
- (c) thread-local "scratch" interners that are batched into the global on commit
- (d) something else entirely (e.g., a CRDT-style append-only structure)
Are there patterns from other Rust/concurrent compilers (rustc's interner, salsa's incremental queries) that are directly applicable?
Per-checker RefCell<FxHashMap<SymbolId, usize>> — copied parent → child on each cross-arena delegation. I'm aware of the Arc<HashMap> + Arc::make_mut COW pattern but children write to the overlay (it gets new mappings as resolution discovers them), so make_mut would just defer the clone. Is there a smarter data structure here? The fact that the overlay is "discoveries during this check" suggests append-only — would a HAMT (immutable persistent map) avoid the clone entirely?
Today the BFS uses module_resolver (Rust port of TypeScript's resolver) which has per-instance RefCell caches (package_json_cache, node_modules_dir_cache, skip_fallback_cache, etc.). I parallelized the file READ phase, but the resolve-imports phase is still serial because the resolver isn't Send + Sync. Is the right move to make the entire resolver thread-safe (DashMap caches everywhere, &self everywhere instead of &mut self), or is there a simpler split — e.g., have N per-thread resolvers that share their caches via DashMap?
The other observation: tsgo apparently does this (or doesn't have this problem) — tsgo finishes the same workload in 2.45 s. Does tsgo run module resolution in parallel? Or does it skip most of the work somehow (e.g., it doesn't validate as many candidate paths)?
This is the most damning number. Can you guess what tsz is doing wrong here without seeing more profiles? My hypotheses:
- per-file CheckerState construction allocates fresh maps that don't survive the file's check, summing to GB peak (the
extract_type_cacheflag exists exactly to avoid this in--noEmitmode, but per-file state still grows during check) - Type-environment intermediate computations (mapped types, conditional types) materialize giant temp structures
- 6086 file × 1429 lib symbols × per-file binder merge → some quadratic blowup in the binder phase
Where would you look first?
Honest gut-check question. Even if I do everything right (real parallelism, no contention, lazy types), tsz's algorithmic shape is fundamentally tsc's. tsc itself takes ~30+ s on large-ts-repo. tsgo at 2.45 s is already 12× faster than tsc. Is the right framing:
- "tsgo wins by being a leaner reimplementation; tsz can match it but not beat it 2× given the same algorithm"
- "tsgo wins because Go's GC pattern matches the workload and Rust's allocator pattern doesn't; we need a different memory strategy"
- "there's plenty of room on top of tsgo if we go genuinely demand-driven (Salsa) instead of eagerly checking everything"
Or is the 2× target a stretch goal that's expected to take multiple architectural rewrites to reach?
Any of these I'm framing wrong? Anything I should be measuring that I'm not? Profiling tool recommendations beyond samply for catching contention specifically (perf c2c-style)?
Thanks for reading this far.