mohsen1/perf-expert-question.md

tsz perf: how do we close a 200x+ gap to tsgo on `large-ts-repo`?

Background

tsz is a Rust port of the TypeScript compiler, aiming for parity with tsc behavior plus better speed than tsgo (the Go port). On most fixtures we're competitive or ahead, but on multi-package monorepo workloads we are catastrophically behind:

Fixture	Files	tsgo	tsz	Factor
`utility-types`	small	117ms	100ms	tsz 1.17× faster
`type-fest` (heavy `.d.ts` mapped types)	16.5K LOC	103ms	219ms	tsgo 2.11× faster
`large-ts-repo` (6086 files, monorepo)	~6K files	2.45s	706s workstation / 30+ min on this dev box	tsgo 288× faster
`large-ts-repo` peak heap	—	16 MB	10.1 GB	tsgo 600× less memory

Goal stated by the project lead: tsz must be 2× faster than tsgo on large-ts-repo, i.e. ~1.2 s. We're not 30% off — we're 600× off on that workload. The user's directive was "do fundementally correct work and dig deep", so I'm trying to understand what the right architectural moves are, not just chase 5% wins.

Architecture summary

tsz-binder: per-file binder produces a BinderState (symbols, scopes, file_locals, etc.). Lib symbols are merged in per-file (lots of duplication).
tsz-checker: per-file CheckerState with a CheckerContext that owns the file's binder reference plus shared cross-file indices. Type computation goes through compute_type_of_symbol → delegate_cross_arena_symbol_resolution for cross-file symbols.
tsz-solver: shared TypeInterner (DashMap-sharded), QueryCache (per-checker RefCell, optionally backed by a shared SharedQueryCache).
CLI driver (tsz-cli): does read_source_files BFS to walk the import graph, builds the program, then runs collect_diagnostics which has a per-file par_iter over work_items calling check_file_for_parallel, plus a sequential post-merge "lib recheck" loop.

What I observed (samply profiles, large-ts-repo on 10-core M3 Pro)

Observation 1: `read_source_files` BFS dominates and is single-threaded

Profile of full bench at ~5 s into a fresh run:

read_source_files            85% inclusive
  module_resolver::lookup    64%
    resolve_module_specifier 22%
      Path::is_dir/is_file   stat()  (multiple call sites)
      read_package_json      open() + serde_json::from_str

The calling thread holds 100% of CPU; 10 rayon worker threads (already initialized via ensure_rayon_global_pool()) sit idle. Hot leaves: Path::is_dir, Path::is_file, open, read. Each import expands into ~10 candidate paths (extension fan-out + path-mappings) and each candidate is stat()-checked. With 6086 files × ~20 imports × ~10 candidates = ~1.2M syscalls.

Action I took (PR #1623): restructured the BFS as level-synchronous — drain pending into a per-level batch, classify (cached/skip-js/read), parallelize the read+import-scan phase via par_iter, then serial-resolve imports back into the queue. Workers now do real work (ps -p shows ~0.6 s UTIME each during BFS phase). Subset3 (1429 files, where I/O Read is only 4.4 s of 261 s) showed -4% I/O Read, expected to be much larger on full bench. Cannot validate full-bench impact — full bench takes 30+ min on this workstation and we have no CI bench job that tracks tsz on large-ts-repo yet (PR #1618 enables it).

Observation 2: per-file check `par_iter` is also effectively single-threaded

Sample of subset3 mid-check (250 s of 261 s in check phase): exactly one worker thread at 98% CPU, the other nine at 0%, deep in:

collect_diagnostics → check_source_file → prepare_source_file_for_checking
  → build_type_environment → compute_type_of_symbol (recursive)
  → delegate_cross_arena_interface_type
  → compute_interface_type_from_declarations
  → get_type_of_interface
  → ...
  → copy_symbol_file_targets_to (HashMap clone, 543/3777 = 14% of CPU samples!)

The check phase IS parallelized at the source level (work_items.par_iter().zip(per_file_binders.into_par_iter()).map(|(file_idx, binder)| check_file_for_parallel(...))), but in practice one file with deep delegate_cross_arena cascades dominates the wall time. Adding .with_min_len(1) to force fine-grained work-stealing helped — measured -5.4% on subset3 across two runs each (PR #1626) — so the chunking is genuinely the issue at the rayon level, not just an artifact of the work being sequential.

But the gain plateaued at ~5% because the underlying problem (one file's compute_type_of_symbol dominates the entire wall time) is unsolved. The deeply-recursive type-computation cascade for that one file STILL runs on a single thread, and other workers run out of independent work.

Observation 3: `copy_symbol_file_targets_to` shows up at 14% of CPU samples

Every cross-arena delegation creates a child checker via Box::new(CheckerState::with_parent_cache(...)) and copies the parent's local overlay:

pub fn copy_symbol_file_targets_to(&self, child: &mut CheckerContext<'_>) {
    let overlay = self.cross_file_symbol_targets.borrow();  // RefCell<FxHashMap<SymbolId, usize>>
    if !overlay.is_empty() {
        *child.cross_file_symbol_targets.borrow_mut() = overlay.clone();
    }
}

Sample showed _platform_memmove calls inside RawTable::clone at 543 samples (~14%) when iterated through delegate_cross_arena_symbol_resolution. The recursion chain is very deep (delegate → compute_type_of_symbol → type_alias_variable_alias → resolve_global_jsdoc_typedef_info → copy_symbol_file_targets_to).

I considered RefCell<Arc<HashMap>> with Arc::make_mut for copy-on-write semantics, but register_symbol_file_target is called in many places (jsx orchestration, type_node resolution) — children DO write to the overlay, so make_mut would clone on first write anyway. Net: same number of HashMap clones, just deferred. Possibly some win if SOME children don't write at all, but no clear measurement.

Observation 4: TypeInterner contention mentioned as known issue

PR #1618's description (which enables large-ts-repo in CI) explicitly says:

The 288x perf gap on this workload is structural (per-file Check phase = 95% of runtime; par_iter already used but TypeInterner DashMap contention likely defeats parallelism — 16-core actual ≈ 16x slower than ideal).

I tried parallelizing the per-iteration lib-recheck loop (for lib_idx in 0..checker_libs.files.len() { check_checker_lib_file(...) }, ~30 lib files at ~7-8 s each). Workers DID get scheduled (10-thread CPU went from 1 active to all active during that phase), but total wall time went UP ~12% in one run, was flat in another (high variance). The per-iteration check_checker_lib_file shares &program.type_interner (DashMap) and shared_lib_cache: Arc<DashMap>. My theory: TypeInterner shard contention eats the parallelism budget. Did not pursue further.

Observation 5: 10 GB peak heap vs tsgo's 16 MB

That's 600× more memory on the same workload. Some of it is cache state (~2 GB pre-merge bind data, ~500 MB bound files, ~300 MB AST arenas — these are all program-wide and proportional to file count). But it strongly suggests tsz allocates and frees a LOT of intermediate type representations during checking, vs tsgo which apparently computes types lazily/sparingly.

Observation 6: Failed attacks

Unbounded is_file / is_dir cache on the BFS resolve path: every candidate path is unique (cache hit rate ~0%). Cache grew unbounded, hit 25 GB+ memory, rehash dominated CPU.
Bounded read_package_json cache: subset3 within noise (4.43 s I/O Read on subset3 leaves no room to measure improvement). Couldn't validate on full bench in reasonable time.
Extending .with_min_len(1) to other par_iter sites (build_cross_file_binders, collect_module_specifiers, per_file_ts7016_diagnostics, prepare_binders): slowed the bench from 248 s to 263 s on subset3. Those phases have uniform per-file work; fine-grained stealing adds scheduler overhead without load-balancing benefit.
Side-by-side bench methodology lies: my early v3 PR claimed 35% subset3 win that turned out to be I/O contention from running two binaries simultaneously; solo runs of v3 vs main were within noise.

What I shipped this session (4 PRs, all CI green, none merged yet)

PR #1618 — bench script enable for tsz on large-ts-repo, bumps run timeout 600s → 1500s.
PR #1619 — resolve_import_target_from_file index-first fast path. Subset3 within noise solo, expected to help on workloads where the fallback resolver fires often (project-relative bare specifiers like packages/foo/src/bar.ts).
PR #1623 — Parallel read_source_files BFS file-read + import-scan (level-synchronous restructure). -4% I/O Read on subset3, expected larger on full bench where read_source_files is 85% of wall.
PR #1626 — .with_min_len(1) on per-file check par_iter. -5.4% measured on subset3 across two runs, the only PR with a directly-attributable wall-clock win.

Questions

I want to push past incremental wins and make the architectural moves that could close a 200×+ gap. Please prioritize answers in terms of where you'd actually start.

Q1. Per-file `compute_type_of_symbol` cascade as the bottleneck

The sample stack on subset3 shows one file's check work taking 95% of total wall time, with the cascade going delegate_cross_arena_interface_type → compute_interface_type_from_declarations → get_type_of_interface → ... → recursive compute_type_of_symbol → ... → delegate again. Why is one file's check work non-parallelizable like this even when 10 workers are available? Is the right fix to:

(a) make delegate_cross_arena_symbol_resolution itself parallelize its sub-tasks (split the cross-file resolution graph across workers)?
(b) make the type-computation cascade lazier so cross-file delegation triggers a small amount of work instead of a big one (Salsa-style demand-driven)?
(c) accept that one heavy file gates per-file parallelism, and instead improve single-thread throughput in the cascade?

What's the standard play in compilers (rustc, clang, swift) for this shape of problem?

Q2. TypeInterner contention

We use a sharded DashMap (16 shards by default) for the type interner. Per the PR #1618 description, "16-core actual ≈ 16x slower than ideal" implies the shards are very hot. Is the right move:

(a) increase shard count (256? 1024?)
(b) per-thread interner with periodic merge (and accept that some types get duplicate IDs until merge)
(c) thread-local "scratch" interners that are batched into the global on commit
(d) something else entirely (e.g., a CRDT-style append-only structure)

Are there patterns from other Rust/concurrent compilers (rustc's interner, salsa's incremental queries) that are directly applicable?

Q3. `cross_file_symbol_targets` HashMap clone (14% of CPU)

Per-checker RefCell<FxHashMap<SymbolId, usize>> — copied parent → child on each cross-arena delegation. I'm aware of the Arc<HashMap> + Arc::make_mut COW pattern but children write to the overlay (it gets new mappings as resolution discovers them), so make_mut would just defer the clone. Is there a smarter data structure here? The fact that the overlay is "discoveries during this check" suggests append-only — would a HAMT (immutable persistent map) avoid the clone entirely?

Q4. `read_source_files` BFS architecture

Today the BFS uses module_resolver (Rust port of TypeScript's resolver) which has per-instance RefCell caches (package_json_cache, node_modules_dir_cache, skip_fallback_cache, etc.). I parallelized the file READ phase, but the resolve-imports phase is still serial because the resolver isn't Send + Sync. Is the right move to make the entire resolver thread-safe (DashMap caches everywhere, &self everywhere instead of &mut self), or is there a simpler split — e.g., have N per-thread resolvers that share their caches via DashMap?

The other observation: tsgo apparently does this (or doesn't have this problem) — tsgo finishes the same workload in 2.45 s. Does tsgo run module resolution in parallel? Or does it skip most of the work somehow (e.g., it doesn't validate as many candidate paths)?

Q5. Memory: 10 GB tsz vs 16 MB tsgo on the same workload

This is the most damning number. Can you guess what tsz is doing wrong here without seeing more profiles? My hypotheses:

per-file CheckerState construction allocates fresh maps that don't survive the file's check, summing to GB peak (the extract_type_cache flag exists exactly to avoid this in --noEmit mode, but per-file state still grows during check)
Type-environment intermediate computations (mapped types, conditional types) materialize giant temp structures
6086 file × 1429 lib symbols × per-file binder merge → some quadratic blowup in the binder phase

Where would you look first?

Q6. The "200×" gap — is it even reachable in Rust?

Honest gut-check question. Even if I do everything right (real parallelism, no contention, lazy types), tsz's algorithmic shape is fundamentally tsc's. tsc itself takes ~30+ s on large-ts-repo. tsgo at 2.45 s is already 12× faster than tsc. Is the right framing:

"tsgo wins by being a leaner reimplementation; tsz can match it but not beat it 2× given the same algorithm"
"tsgo wins because Go's GC pattern matches the workload and Rust's allocator pattern doesn't; we need a different memory strategy"
"there's plenty of room on top of tsgo if we go genuinely demand-driven (Salsa) instead of eagerly checking everything"

Or is the 2× target a stretch goal that's expected to take multiple architectural rewrites to reach?

Any of these I'm framing wrong? Anything I should be measuring that I'm not? Profiling tool recommendations beyond samply for catching contention specifically (perf c2c-style)?

Thanks for reading this far.

mohsen1/perf-expert-question.md

Select an option

No results found

Select an option

No results found

tsz perf: how do we close a 200x+ gap to tsgo on `large-ts-repo`?

Background

Architecture summary

What I observed (samply profiles, large-ts-repo on 10-core M3 Pro)

Observation 1: `read_source_files` BFS dominates and is single-threaded

Observation 2: per-file check `par_iter` is also effectively single-threaded

Observation 3: `copy_symbol_file_targets_to` shows up at 14% of CPU samples

Observation 4: TypeInterner contention mentioned as known issue

Observation 5: 10 GB peak heap vs tsgo's 16 MB

Observation 6: Failed attacks

What I shipped this session (4 PRs, all CI green, none merged yet)

Questions

Q1. Per-file `compute_type_of_symbol` cascade as the bottleneck

Q2. TypeInterner contention

Q3. `cross_file_symbol_targets` HashMap clone (14% of CPU)

Q4. `read_source_files` BFS architecture

Q5. Memory: 10 GB tsz vs 16 MB tsgo on the same workload

Q6. The "200×" gap — is it even reachable in Rust?

mohsen1/perf-expert-question.md

tsz perf: how do we close a 200x+ gap to tsgo on large-ts-repo?

Background

Architecture summary

What I observed (samply profiles, large-ts-repo on 10-core M3 Pro)

Observation 1: read_source_files BFS dominates and is single-threaded

Observation 2: per-file check par_iter is also effectively single-threaded

Observation 3: copy_symbol_file_targets_to shows up at 14% of CPU samples

Observation 4: TypeInterner contention mentioned as known issue

Observation 5: 10 GB peak heap vs tsgo's 16 MB

Observation 6: Failed attacks

What I shipped this session (4 PRs, all CI green, none merged yet)

Questions

Q1. Per-file compute_type_of_symbol cascade as the bottleneck

Q2. TypeInterner contention

Q3. cross_file_symbol_targets HashMap clone (14% of CPU)

Q4. read_source_files BFS architecture

Q5. Memory: 10 GB tsz vs 16 MB tsgo on the same workload

Q6. The "200×" gap — is it even reachable in Rust?

tsz perf: how do we close a 200x+ gap to tsgo on `large-ts-repo`?

Observation 1: `read_source_files` BFS dominates and is single-threaded

Observation 2: per-file check `par_iter` is also effectively single-threaded

Observation 3: `copy_symbol_file_targets_to` shows up at 14% of CPU samples

Q1. Per-file `compute_type_of_symbol` cascade as the bottleneck

Q3. `cross_file_symbol_targets` HashMap clone (14% of CPU)

Q4. `read_source_files` BFS architecture