fangorn/ex_git_objectstore
public
feat: commit-graph index with generation numbers + corrected commit date #26
Links
No links yet.
Why
Anvil’s PR list and PR detail pages walk the commit DAG via ExGitObjectstore.cat_object/2 one commit at a time to compute ahead/behind counts and commit lists (Walk.merge_base/3 and Anvil’s own collect_ancestors/walk_range). With no ancestry index, every walk is O(max_walk) random object reads — thousands of calls per page render. Under production traffic this saturates a 2-core box (observed 201% CPU, slow-request warnings, LiveView reconnect storm on anvil.fangorn.io).
The right fix is an in-memory commit-graph index so ahead/behind and commits_between run proportional to (ahead + behind), not to a hardcoded walk fence.
Design
One binary blob per repo at repos/<id>/graph/commit-graph.v1:
Header : magic "ECG1", u32 version, u32 commit_count
Fan-out : 256 × u32
OID table : sorted [20-byte sha] × N
Entries[N] :
tree_sha 20 bytes
parent_count u8
parents u32 × n (index into OID table; 0xFFFFFFFF = missing)
generation u32 (topological, max(parent.gen)+1, roots=1)
corrected_commit_date u64 (CCD — max(parent.CCD, commit_time); handles clock skew)
commit_time u64
~50 B/commit → 100k commits ≈ 5 MB. Deserialize wholesale into RAM; walks are in-memory.
Why both generation and CCD
- Generation (topological) — unambiguous ordering for the priority-queue walker that replaces the current date-sorted queue.
- Corrected commit date (CCD) — git v2’s “topologically consistent committer date” (CCD(c) = max(c.commit_time, max(CCD(p) for p in parents))). Lets us preserve chronological ordering for UI while still pruning correctly when a commit’s wall-clock time is earlier than a parent’s (clock-skewed pushes, rebases, etc.).
New module ExGitObjectstore.Graph
load(repo) :: {:ok, graph} | {:error, :missing}
build(repo) :: {:ok, graph} # full scan from refs
update(repo, graph, new_shas) :: {:ok, graph} # incremental
generation(graph, sha)
corrected_commit_date(graph, sha)
parents(graph, sha)
is_ancestor?(graph, anc, desc)
ahead_behind(graph, base, head)
commits_between(graph, base, head)
Plus top-level passthroughs on ExGitObjectstore (ahead_behind/3, commits_between/3, is_ancestor?/3) with auto-load and lazy build.
Walk algorithm
Priority queue by generation with CCD as tie-breaker for chronological ordering. Pop highest-gen, mark side, stop when a SHA is marked from both sides (= merge base). No max_walk fence needed — termination is natural.
Write-path integration
Any API that produces new commits calls Graph.update/3:
commit_tree/3,merge_branches/4,squash_merge/4,rebase_commits/4,cherry_pick/*- receive-pack protocol handler (batch update with new-commit set from a push)
Serialization via per-repo writer lock. Incremental cost: N cat_objects for the new commits + one blob write.
In-memory cache
Per-repo graph held in an ETS table or :persistent_term keyed by repo id. First call loads + deserializes (~few ms for 5 MB). Subsequent calls are constant-time lookups.
Rollout / backward compat
- If
Graph.load/1returns:missing, callers fall back to today’scat_objectwalker. Deployment order does not matter. - First call on a pre-existing repo lazily runs
Graph.build/1, persists, returns. One-time cost. - fsck: add
--rebuild-graphmode that recomputes from refs and compares to on-disk graph.
PR plan
- Binary format +
Graph.{load,build}+ round-trip serialization tests. No callers. Graph.{ahead_behind,commits_between,is_ancestor?}+ property tests asserting equivalence with today’sWalkmodule on fixture repos.Graph.update/3incremental + wire into all commit-producing APIs. Fuzz test: random push sequences,updateresult must equal freshbuild.- Public top-level
ExGitObjectstore.{ahead_behind,commits_between,is_ancestor?}with auto-load / lazy build. This is the API Anvil will switch to.
Non-goals
- Not matching git’s commit-graph v2 binary layout on disk (no interop requirement).
- No bloom-filter / changed-paths index (separate optimization, not path-history-related).
Acceptance
- All four PRs merged, tests green.
- Benchmark:
ahead_behindon anvil repo (base=main, head=recent feature branch) ≤ 5 ms after warm load, vs. current seconds-range cold. - Fuzz: incremental
updateproduces bit-identical blob to fullbuildover 1000 random push sequences.