perf(delta): unroll copy-arg decode — 90× faster, profile-driven #30

merged colechristensen cole.christensen@gmail.com wants to merge delta-apply-fast-path into main
No CI

Sub-issue under fangorn/anvil#153 umbrella, validated this time with an actual profile.

What the profile said

Sample-based stack profiler captured ~48k samples on Pack.Reader.parse/2 applied to a real ovs pack (96 MB, 134k objects, 108k deltas). The dominant hot path was NOT the broken offset cache (#29), NOR the find_compressed_length binary search I’d theorized as the “actual” bottleneck — it was Pack.Delta.read_if_bit/3:

samples path
18,471 apply > apply_instructions > read_copy_size
14,628 apply > apply_instructions > read_copy_offset
1,625 read_copy_offset > read_if_bit
1,172 read_copy_size > read_if_bit

~74% of total CPU. Each copy command made 7 sequential pattern matches via read_if_bit/3 (4 offset + 3 size), AND the false branch reconstructed <<byte, rest::binary>> instead of returning the original — tens of millions of redundant binary allocations across ovs’s ~5 million delta instructions.

Fix

apply_instructions/3’s copy clause now calls decode_copy_args(cmd, data) that consumes exactly the right number of bytes in ONE binary pattern match. 128 specialized clauses (16 offset bitmaps × 8 size bitmaps) generated at compile time via macros so the BEAM compiler picks the matching clause in a single dispatch. No more read_if_bit, no more 7-step splits, no more no-op binary reconstruction.

Measured improvement

Same Pack.Reader.parse/2 on the same 96 MB ovs pack:

Before After
>27 min, never completed 18.4 s

In the post-fix profile, Pack.Delta dropped from ~37k samples (~74%) to ~270 samples (~12% of remaining). New top is decompress_data → probe_compressed_length (the binary-search bottleneck I’d theorized originally) — that’s the next sub-issue tracked under #153.

Test plan

  • 928/0 across full ex_git_objectstore suite.
  • mix format --check-formatted clean.
  • mix dialyzer clean.
  • Live ovs push test against prod once this and the anvil mix.lock bump deploy. Expectation: parse phase drops from ‘never finishes’ to ~20 s. Total push: ~2 min.

Memory note

Peak RSS during the 18 s parse was ~3.5 GB on this pack — GC churn plus the buffered resolved entries list. Streaming parse-and-store is still the next big sub-issue; this PR just makes the existing path actually finish in finite time.

Created May 06, 2026 at 15:17 UTC | Merged May 06, 2026 at 15:58 UTC by colechristensen cole.christensen@gmail.com