fix(upload-pack): don't taint reachability with gitlink (mode 160000) SHAs #33
fix/sideband-multi-frame-regression
into main
Fixes a long-standing walker bug in the v2 upload-pack code, surfaced by a real Anvil prod fetch failure on fangorn/hephaestus.
Symptom
Cloning fangorn/hephaestus over SSH (full clone, no shallow) fails deterministically:
``` fatal: did not receive expected object 36e9a2230dcd1e6d11944e5e92518fd5b300c641 fatal: fetch-pack: invalid index-pack output ```
The same SHA, every clone. Shallow clones, smaller repos, and repos with synthetic content (200 MiB random blobs, 8000×25 KB files) all clone fine on the same daemon.
Root cause
Tree `d85a381c…` in hephaestus has a gitlink entry:
``` { mode: "160000", name: "ecad-drc", sha: 36e9a2230dcd1e6d11944e5e92518fd5b300c641 } ```
By coincidence (probably because the submodule URL loops back at this repo at some point in history), SHA `36e9a223` is also a real commit in hephaestus’s own history, reachable from every wanted branch tip.
The walker in `collect_single_tree_entry/6` had two function heads — one for `mode: "40000"` (subtree) and a catch-all for everything else. The catch-all treated any tree entry as a potential blob:
```elixir defp collect_single_tree_entry(repo, %{sha: sha}, acc, vis, _depth, opts) do cond do MapSet.member?(vis, sha) -> {acc, vis} opts[:skip_blobs] -> {acc, MapSet.put(vis, sha)} true -> read_blob_entry(repo, sha, acc, MapSet.put(vis, sha)) # <– puts sha in vis FIRST end end ```
That last branch adds `sha` to `vis` before calling `read_blob_entry`. `read_blob_entry` returns `{acc, vis}` unchanged when the read isn’t a `%Blob{}` — but `vis` is already mutated. So for the gitlink entry the walker emits nothing (correct — submodule commits live in another repo), but `36e9a223` is now in `vis`.
Later, when the depth-first walk follows the parent chain of branch tip `70366d6b…` and reaches `abc92a55` (whose parent is `36e9a223`), `collect_reachable/5` checks `MapSet.member?(visited, sha)`, sees the poisoned entry, returns `{[], visited}`, and never reads or emits the commit. Everything reachable through it goes with it.
Damage
On the actual hephaestus fetch we measured against donut (after deploying the matching anvil streaming change):
| True reachable | Walker output | Missing | |
|---|---|---|---|
| Commits | 861 | 740 | 121 |
| Trees | 4707 | 4145 | 562 |
| Blobs | 4342 | 3982 | 360 |
| Total | 9910 | 8867 | 1043 |
Client’s `git index-pack` builds an index off the 8867 it received, finds the wanted SHA `36e9a223` missing, errors out.
Fix
Give mode `160000` its own function head. Gitlinks point to commits in other repositories — the walker should never emit them and should never touch the visited set on their behalf.
```elixir defp collect_single_tree_entry(_repo, %{mode: "160000"}, acc, vis, _depth, _opts) do {acc, vis} end ```
Why narrow this fix to mode 160000
The same bug shape would also fire if any `mode: "100644"` entry referenced a non-blob SHA, but SHA-1 collisions between a real commit/tree and a tree-entry blob SHA are astronomically unlikely. The only realistic case is the deliberate "this SHA is a commit" semantics of mode 160000. Filtering more broadly (e.g. moving the `MapSet.put` inside the blob match in `read_blob_entry`) costs blob-read deduplication across trees — a real perf hit for repos that reference the same blob from many directories — for no practical correctness gain.
Regression test
`test/ex_git_objectstore/protocol/upload_pack_v2_test.exs` — "submodule (mode 160000) gitlink does NOT poison reachability of a same-SHA commit": builds a 3-commit chain where the middle commit’s tree has a gitlink whose SHA equals the root commit, asserts all three commits land in the pack.
Pre-fix this test fails: the root commit is pruned via exactly the mechanism above.
Verified against donut
After applying this patch in the local checkout used by donut’s `mix.exs` path-dep, the walker now produces 9910 objects (matching the true reachable count) and includes `36e9a223`. Client clone of hephaestus is expected to succeed once an anvil mix.lock bump pulls this commit in.
Test plan
- `mix test` — 947 tests, 0 failures (including the new regression test, which fails on `main` and passes on this branch)
- `mix format –check-formatted` clean
- `mix credo –strict` clean on touched files
- After merge + anvil mix.lock bump + donut deploy: full clone of fangorn/hephaestus succeeds end-to-end