perf(commit-walk): per-process pack cache + pack-first ordering (~120× faster) #24
commit-walk-pack-cache
into main
Summary
Two complementary fixes for slow commit-graph walks (Graph.Fallback.ahead_behind/4, commits_between/4, ancestor?/4).
1. Per-process pack cache
Every ObjectResolver.read/2 was calling Repo.storage_call(:get_pack) — File.read/1 of the entire pack file on the Filesystem backend, all serialized through Erlang’s singleton :prim_file GenServer. A 3000-commit walk = 3000 full pack reads.
Memoize four things in the process dictionary:
- pack data (
get_pack) - parsed pack index (
get_pack_index+Index.parse) - index-derived SHA→offset cache (
build_sha_cache) - pack listing (
list_packs)
Process-dict scope is correct because pack files are content-addressed (filename has SHA, contents immutable) and one LiveView mount = one process = one cache lifetime. Public clear_pack_cache/0 for tests / explicit invalidation.
2. Pack-first ordering
After (1), profile showed loose-object existence checks dominating: Object.read was checking for a loose object before falling through to packs. Every commit-walk SHA paid 1× :prim_file round-trip to confirm “no loose file” on a repo where commits are all packed. 90k synchronous round-trips per page mount.
Inverts ObjectResolver.read/2 to check packs first, fall through to loose objects on miss. After the cache is warm, pack lookup is Index.lookup/2 (binary search in memory) — no syscall, no GenServer. Loose check now only fires for SHAs not in any pack (recently-written, not-yet-packed case).
Safe by content addressing: a SHA in both loose AND a pack must have byte-identical content. Git itself uses pack-first ordering for the same reason.
Empirical impact
Profiled Git.ahead_behind for chiron’s 30 open PRs against bare clone of fangorn/chiron (~3000-commit main history, all packed) inside a memory-capped OrbStack container.
| Before | After pack cache only | After + pack-first | |
|---|---|---|---|
| Total wall (30 PRs) | ~10 min (extrapolated) | 11.8 s | 4.9 s |
| Per-PR mean | 19500 ms | 395 ms | 164 ms |
| First call (cold cache) | 28776 ms | 957 ms | 486 ms |
| Subsequent calls | 15-19 s | 400-500 ms | 150-170 ms |
Flame samples for :gen.do_call/4 |
~22000 | ~2200 | ~38 |
Cumulative ~120× faster end-to-end. Page that was unusable (10 min) now loads in 5 s.
What’s left at 164ms/call
Pure zlib decompression of commit objects. Reader.inflate_prefix_size/4 and :zlib.* calls now dominate the flame graph. That’s the actual algorithmic work of unpacking each commit — unavoidable without batching/parallelism.
Further improvements possible but lower-leverage:
- Reuse zlib state across object reads (avoid
:zlib.inflateInit/inflateEndper-object) - Consumer-side parallelism (
Task.async_streamover PRs in Anvil’scompute_ahead_behind/2) — would cut to ~500ms wall via concurrency - Architectural fix using
:file.pread/3with:rawmode to bypass:prim_filefor uncached cold reads
None of those are in this PR.
API surface
clear_pack_cache/0— public, drops cached entries in the calling process
Verified
mix test— 903/903 pass (51 excluded, same as main)mix format --check-formattedcleanmix credo --strictclean on changed file