ref:main

Myers diff uses O(D²) memory and Map-as-V — fix linear-space + counters #59

closed Opened by cole.christensen@gmail.com

Links

No links yet.

Summary

ExGitObjectstore.Diff.Myers (lib/ex_git_objectstore/diff/myers.ex) uses two structurally bad memory patterns that together OOM the BEAM on large diffs. Surfaced as the root cause of Anvil chiron PR #68 (~OOM tracked in fangorn/anvil#81).

Reproducer

Anvil.Git.diff for chiron PR #68, profiled inside a 6 GB-capped container by lib/mix/tasks/anvil.profile_diff_memory.ex:

  • BEAM allocators climb to 16 GB (cgroup capped at 6, sawtooth thrashing for ~4 minutes)
  • 75% of CPU in Myers.process_diag / Myers.snake / Map.get
  • 38% on Map.get/3 alone (V-table inner loop)
  • Phase A (Git.diff) never completes

Flame graph + memory TSV captured under priv/profiles/pr68/ in the anvil repo.

Root causes

A. Trace keeps every V table alive — O(D²) memory

myers.ex:67-78 accumulates one V map per d iteration into a list, then materializes it into a tuple for backtracking:

```elixir find_d(…, d + 1, v_new, [v_new | trace]) … Enum.reverse([v_new | trace]) ```

After step d, V has 2d+1 entries. Sum over the trace: (D+1)² total map entries kept alive simultaneously. At BEAM’s ~80 B/Map-node, a file with D=10k edits costs ~8 GB just for the V trace.

This is the classical Myers naive variant. Fix: linear-space refinement from Myers 1986 §4b — divide-and-conquer at the middle snake, recursing on the two halves. Memory becomes O(N+M) regardless of D.

B. V is a Map, not a contiguous array

myers.ex:51, 85-93 — V is %{1 => 0} with Map.put / Map.get. The diagonal index k is a contiguous integer range [-d, +d] — a perfect fit for a tuple/array, not a hash map.

  • ~80 B per Map entry vs ~16 B per :counters slot (~5× per-slot cost)
  • Map.get/put is O(log n); :counters.get/put is O(1)

Only safe to use :counters after Fix A — once only one V is alive at a time.

Acceptance criteria

  • Diff.diff_blobs/4 for synthetic 10k-line × 30%-divergent input completes under 200 MB process heap peak.
  • No Map.get in the diff inner loop (verified by flame graph).
  • All existing diff tests pass byte-for-byte (output must be unchanged — only memory profile changes).
  • Add a stress test that asserts the memory bound.

Plan

One PR, two commits:

  1. Replace ses / find_d / backtrack with linear-space middle-snake recursion.
  2. Replace Map V with :counters (depends on #1).

Closes Anvil chiron PR #68 OOM (tracked in fangorn/anvil#81).