perf(delta): unroll copy-arg decode — 90× faster, profile-driven #30
delta-apply-fast-path
into main
Sub-issue under fangorn/anvil#153 umbrella, validated this time with an actual profile.
What the profile said
Sample-based stack profiler captured ~48k samples on Pack.Reader.parse/2 applied to a real ovs pack (96 MB, 134k objects, 108k deltas). The dominant hot path was NOT the broken offset cache (#29), NOR the find_compressed_length binary search I’d theorized as the “actual” bottleneck — it was Pack.Delta.read_if_bit/3:
| samples | path |
|---|---|
| 18,471 | apply > apply_instructions > read_copy_size |
| 14,628 | apply > apply_instructions > read_copy_offset |
| 1,625 | read_copy_offset > read_if_bit |
| 1,172 | read_copy_size > read_if_bit |
~74% of total CPU. Each copy command made 7 sequential pattern matches via read_if_bit/3 (4 offset + 3 size), AND the false branch reconstructed <<byte, rest::binary>> instead of returning the original — tens of millions of redundant binary allocations across ovs’s ~5 million delta instructions.
Fix
apply_instructions/3’s copy clause now calls decode_copy_args(cmd, data) that consumes exactly the right number of bytes in ONE binary pattern match. 128 specialized clauses (16 offset bitmaps × 8 size bitmaps) generated at compile time via macros so the BEAM compiler picks the matching clause in a single dispatch. No more read_if_bit, no more 7-step splits, no more no-op binary reconstruction.
Measured improvement
Same Pack.Reader.parse/2 on the same 96 MB ovs pack:
| Before | After |
|---|---|
| >27 min, never completed | 18.4 s |
In the post-fix profile, Pack.Delta dropped from ~37k samples (~74%) to ~270 samples (~12% of remaining). New top is decompress_data → probe_compressed_length (the binary-search bottleneck I’d theorized originally) — that’s the next sub-issue tracked under #153.
Test plan
- 928/0 across full ex_git_objectstore suite.
-
mix format --check-formattedclean. -
mix dialyzerclean. - Live ovs push test against prod once this and the anvil mix.lock bump deploy. Expectation: parse phase drops from ‘never finishes’ to ~20 s. Total push: ~2 min.
Memory note
Peak RSS during the 18 s parse was ~3.5 GB on this pack — GC churn plus the buffered resolved entries list. Streaming parse-and-store is still the next big sub-issue; this PR just makes the existing path actually finish in finite time.