fangorn/ex_git_objectstore
public
ahead_behind_many: batch API for shared-base graph walks #60
Links
No links yet.
Summary
Future investigation. Most callers of Graph.Fallback.ahead_behind/4 ask the same question for many heads against the same base (e.g. Anvil’s PR list page: 30 PRs all with base = main). The current single-pair API redundantly re-walks the base history for every call.
After the per-process pack cache + pack-first ordering landed (fangorn/ex_git_objectstore#24), per-call cost dropped from ~17 s to ~164 ms on the chiron PR-list workload. At that scale the structural redundancy becomes the next problem to address. Caching parsed commits would mask it; eliminating it is the right primary fix.
Why this is a separate, future issue
Per the performance handbook (anvil/docs/PERFORMANCE.md), once the per-unit primitive is fast, the next question to ask is “do we need to do this work at all?” — not “can we make the redundant work cheap?” (caching). The rule says to prefer eliminating redundancy (priority 3) over caching (priority 4).
Proposed API
@spec ahead_behind_many(Repo.t(), String.t(), [String.t()], opts()) ::
{:ok, %{String.t() => %{ahead: non_neg_integer(), behind: non_neg_integer()}}}
| {:error, term()}
def ahead_behind_many(repo, base_sha, head_shas, opts \\ [])
Walks base_sha’s ancestor set ONCE. For each head, walks with early termination when hitting a commit already in the base set (that whole subtree is contained — no need to descend). Returns ahead/behind counts for each head against the shared base.
Estimated impact (chiron PR list, 30 PRs all with base = main)
Currently after #24:
- 30 ×
ahead_behind(main, branch_X)= 30 base walks + 30 head walks - Total ~5 s wall (~164 ms each)
With ahead_behind_many(main, [branch_1, …, branch_30]):
- 1 base walk + 30 head walks WITH early termination on known-in-base
- Each head walk’s interesting frontier is small (the “behind” delta = 100-400 commits)
- Estimated total: hundreds of ms, possibly under 1 s
Roughly another 5-10× speedup on top of #24. More importantly, scales linearly in head count instead of N × base.
Anvil-side consumer change
AnvilWeb.PrLive.Index.compute_ahead_behind/2 would group PRs by base_branch (typically all share one base), resolve base + head SHAs, call Git.ahead_behind_many/3 once per group. Trivial change once the underlying API exists.
Why not cache parsed commits instead?
Considered. Process-dict cache for parsed %Commit{} structs would skip inflate+parse on the 30× repeated reads of main’s ancestors. ~3-5× speedup on the chiron workload. But:
- Caching is do the redundant work, but cheaply. Multi-head API is don’t do the redundant work.
- Cache only helps within one process. Multi-head API helps any caller — CLI, scheduled jobs, future API consumers, single-process workers.
- Multi-head also benefits from in-pack early termination (subtree skip) which a parsed-commit cache doesn’t.
Caching is still potentially worth doing as a follow-up if other workloads (single-call ahead_behind, repeated reads on overlapping graphs) benefit, but it’s not the primary fix.
Acceptance criteria
- New
ahead_behind_many/4inExGitObjectstore.Graph.Fallback(and the matching native-graph variant if/when commit-graph file support lands) - Documented behavior on edge cases: empty head list, head == base, head not reachable from base
- Stress test: 100 heads against a 3000-commit base completes in under 1 second of process CPU
- Anvil-side consumer change in a separate PR
Tracking
- Surfaced by: chiron PR-list slow-load investigation (anvil#81, anvil#103, ex_git_objectstore#23, ex_git_objectstore#24)
- Profile harness: anvil’s
mix anvil.profile_ahead_behind