fix(diff): linear-space Myers (Myers 1986 §4b) + :atomics V table #23

merged colechristensen cole.christensen@gmail.com wants to merge myers-linear-space-counters into main
No CI

Summary

The hand-rolled ExGitObjectstore.Diff.Myers had two structural problems that together OOM’d the BEAM on large diffs (root cause of Anvil chiron PR #68 crash, tracked at fangorn/anvil#81):

  1. O(D²) memoryfind_d accumulated one V-table per d iteration into a list and materialized it into a tuple for backtracking. After d steps V has 2d+1 entries; sum across the trace = (D+1)² entries. For D=10k that’s ~8 GB of map nodes for ONE file.
  2. Map for V — diagonal index k is a contiguous integer range [-d, d], a perfect fit for an array. Using Map cost ~80 B per entry and O(log n) per access. Profile showed Map.get/3 taking 38% of total CPU.

This PR replaces both with the actual linear-space variant from Myers’ 1986 paper §4b, with V stored in :atomics.

Two commits

1. fix(diff): linear-space Myers (Myers 1986 §4b)

Divide-and-conquer at the middle snake: find the split point (x, y) where the optimal edit script crosses the middle of the edit graph, then recurse on a[0..x) vs b[0..y) and a[x..n) vs b[y..m). The snake’s equalities fall out of the recursion naturally because they appear in both halves.

Memory: O(N+M) total. Each bisect call holds two V tables for its lifetime, then frees them before recursing. Recursion depth is O(log(N+M)) on average.

Translation reference: Google diff_match_patch‘s diff_bisect (a faithful port of Myers §4b), cross-checked against git xdiff’s xdl_split.

V stored as Map in this commit — correctness first.

2. perf(diff): swap Myers V table from Map to :atomics

:atomics is the right primitive for V: fixed-size array of signed 64-bit ints, mutable in place, lives off the BEAM term heap, no GC pressure. Sentinel -1 for “not yet reached” still works (signed default).

Mutability simplifies the recursion: forward_sweep and reverse_sweep no longer thread updated v1/v2 through their return — they mutate in place and return only the bounds.

Verified

  • All existing diff tests pass byte-identical (10/10 myers_test.exs, 23/23 diff/, 903/903 full suite).
  • New stress test: 10k-line × ~30%-diff input peaks at ~4.5 MB process heap (sampled every 5 ms). Old impl peaked in the GBs and OOM’d inside a 6 GB-capped container.

Bench

10k lines, ~33% changed, single Myers.diff_lines/2 call:

Old hand-rolled Linear-space + Map V Linear-space + :atomics V
Wall OOM 12.2 s 4.7 s
Peak heap unbounded GBs 9.1 MB 4.5 MB
GCs thrashing ~21k ~1k

Real chiron PR #68 (216 files, 45k diff lines) inside a 6 GB cgroup’d container:

Before this PR After this PR
Phase A (full diff compute) OOM at 4 min 13.6 s, completes
BEAM peak allocator 16.2 GB 0.70 GB
GC count thrashing 18,361

Critical correctness rules from the paper

Earlier hand-roll attempts in this branch’s history got these wrong; capturing them here for future reference:

  • Δ = N − M parity drives WHICH sweep checks overlap (front when Δ is odd, reverse when Δ is even). Doing both is wrong.
  • Reverse-frame ↔ forward-frame mapping: k_other = delta − k_self (minus, not plus).
  • When bisect runs out of d iterations without finding overlap (tiny inputs like n=m=1 with no match, or no commonality at all), fall back to splitting at the top-right corner so both halves are STRICTLY smaller — otherwise the recursion can re-call itself on the same range.

Requirements

  • REQ-DIFF-001 (memory bounded) — covered by new stress test (annotated)
  • REQ-DIFF-002 (no Map.get in inner loop) — satisfied by :atomics V table
  • REQ-DIFF-003 (output unchanged) — covered by existing test corpus passing byte-identical

Test plan

  • mix test test/ex_git_objectstore/diff/ — 24/24 pass
  • mix test — full suite 903/903 pass
  • After merge: bump ex_git_objectstore in Anvil’s mix.lock, re-run profile against chiron PR #68 to confirm Phase A completes inside the 6 GB cap

Closes #59. Unblocks fangorn/anvil#81.

Created Apr 29, 2026 at 03:31 UTC | Merged Apr 29, 2026 at 05:15 UTC by colechristensen cole.christensen@gmail.com