Commit 6a8dd64 - fangorn/ex_git_objectstore


      fangorn/ex_git_objectstore

public

ref:6a8dd6440b1b6f37a15d907b2291303a0ccf0ec6

feat: blob_sizes/3 for batched size lookups (#22)

Closes #22 Adds a bulk variant of `blob_size/2` that runs the lookups in parallel with bounded concurrency. Designed for consumers (Anvil's directory listing renderer, primarily) that currently call `blob_size/2` inside an `Enum.map` — on an S3-backed store each call is a separate round-trip, so a 1000-file directory takes ~100s at 100 ms/call. `blob_sizes/3` issues all lookups via `Task.async_stream` with `max_concurrency: 16` (configurable), dropping the same workload to a few seconds. Semantics: * `blob_sizes(repo, [])` short-circuits to `{:ok, %{}}` * Input shas are deduplicated before dispatch * Returns `{:ok, %{sha => size}}` for every sha that resolved * Shas that fail to resolve (missing, `:not_a_blob`, storage error, task timeout) are silently dropped from the result map * `:max_concurrency` and `:timeout` opts allow per-call tuning Note this does not yet reduce the per-blob read cost — each sha still reads full blob content via the existing `blob/2` path. The win here is purely parallelism. A follow-up could add Storage-level bulk size callbacks that use S3 HEAD + pack header parsing to skip content transfer entirely. Tests cover: happy path, missing shas silently dropped, dedup, empty input, max_concurrency: 1 smoke test, and a non-blob (tree) sha being filtered out. 6 new test cases; full suite 566 tests, 0 failures; dialyzer clean. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

SHA: 6a8dd6440b1b6f37a15d907b2291303a0ccf0ec6

Author: Cole Christensen <cole.christensen@macmillan.com>

Date: 2026-04-13 16:51

Parents: 66017f9

3 files changed +131 -0

Type
	CHANGELOG.md	+4 −0
@@ -9,6 +9,10 @@ ### Added - `blob_sizes/3` — batched variant of `blob_size/2` with bounded-concurrency parallel reads, deduplication, and `{:ok, %{sha => size}}` return. Drops the 100s-of-sequential-round-trips cost of rendering large directory listings on S3-backed storage. See fangorn/ex_git_objectstore#22. - Repository integrity verification (`Fsck.check/2`) with full and quick modes - `list_objects/2` callback on Storage behaviour for enumerating loose objects - Dialyzer enforced in CI pipeline
	lib/ex_git_objectstore.ex	+75 −0
@@ -219,6 +219,81 @@ end @doc """ Batch variant of `blob_size/2`. Looks up the sizes of many blobs in one call, using bounded concurrency to parallelize the backend reads. Returns a map of `sha => size` containing only the shas that resolved successfully. Input shas that can't be resolved (missing object, `:not_a_blob`, storage error) are silently omitted from the result map. Input shas are deduplicated before dispatch, so passing the same sha multiple times costs the same as passing it once. ## Options * `:max_concurrency` — maximum number of parallel backend reads. Defaults to 16. Match this to your storage backend's sweet spot — S3 likes higher concurrency, local filesystem benefits less. * `:timeout` — per-sha timeout in milliseconds. Defaults to 30_000. ## Why a dedicated bulk API Anvil's tree-rendering path (and similar UIs) repeatedly call `blob_size/2` once per entry in an `Enum.map`. On an S3-backed store, each call is a separate network round-trip — a 1000-file directory listing costs ~100s at 100 ms/call. `blob_sizes/2` issues all lookups in parallel so the same workload drops to a few seconds. This version does not optimize the single-blob read cost — it still reads full blob content via `blob/2` for each sha. Backends that want to compute sizes without transferring blob content (e.g. parsing pack headers or using S3 HEAD requests once the format supports it) can override this function in a future revision. ## Examples iex> ExGitObjectstore.blob_sizes(repo, [sha1, sha2, missing_sha]) {:ok, %{^sha1 => 42, ^sha2 => 100}} # missing_sha silently dropped iex> ExGitObjectstore.blob_sizes(repo, []) {:ok, %{}} """ @spec blob_sizes(Repo.t(), [sha()], keyword()) :: {:ok, %{optional(sha()) => non_neg_integer()}} def blob_sizes(repo, shas, opts \\ []) def blob_sizes(%Repo{}, [], _opts), do: {:ok, %{}} def blob_sizes(%Repo{} = repo, shas, opts) when is_list(shas) do max_concurrency = Keyword.get(opts, :max_concurrency, 16) timeout = Keyword.get(opts, :timeout, 30_000) result = shas \|> Enum.uniq() \|> Task.async_stream( fn sha -> case blob_size(repo, sha) do {:ok, size} -> {sha, size} {:error, _} -> :skip end end, max_concurrency: max_concurrency, timeout: timeout, on_timeout: :kill_task, ordered: false ) \|> Enum.reduce(%{}, fn {:ok, {sha, size}}, acc -> Map.put(acc, sha, size) {:ok, :skip}, acc -> acc {:exit, _reason}, acc -> acc end) {:ok, result} end @doc """ Three-way merge of two commits. Finds the merge base (LCA), then merges the trees.
	test/ex_git_objectstore/walk_test.exs	+52 −0
@@ -351,4 +351,56 @@ assert {:ok, 5} = ExGitObjectstore.blob_size(repo, sha) end end describe "blob_sizes/3" do test "returns map of sha => size for every resolvable blob", %{repo: repo} do {:ok, a} = Object.write(repo, Blob.from_content("a")) {:ok, b} = Object.write(repo, Blob.from_content("bb")) {:ok, c} = Object.write(repo, Blob.from_content("ccc")) assert {:ok, sizes} = ExGitObjectstore.blob_sizes(repo, [a, b, c]) assert sizes == %{a => 1, b => 2, c => 3} end test "silently drops shas that don't resolve", %{repo: repo} do {:ok, sha} = Object.write(repo, Blob.from_content("real")) missing = String.duplicate("0", 40) assert {:ok, sizes} = ExGitObjectstore.blob_sizes(repo, [sha, missing]) assert sizes == %{sha => 4} end test "deduplicates input shas", %{repo: repo} do {:ok, sha} = Object.write(repo, Blob.from_content("dupe")) assert {:ok, %{^sha => 4} = sizes} = ExGitObjectstore.blob_sizes(repo, [sha, sha, sha]) assert map_size(sizes) == 1 end test "empty input returns empty map without touching storage", %{repo: repo} do assert {:ok, %{}} = ExGitObjectstore.blob_sizes(repo, []) end test "max_concurrency option is respected (smoke test at 1)", %{repo: repo} do {:ok, a} = Object.write(repo, Blob.from_content("one")) {:ok, b} = Object.write(repo, Blob.from_content("two")) assert {:ok, sizes} = ExGitObjectstore.blob_sizes(repo, [a, b], max_concurrency: 1) assert sizes == %{a => 3, b => 3} end test "a non-blob sha (tree) is dropped from the result", %{repo: repo} do {:ok, blob_sha} = Object.write(repo, Blob.from_content("leaf")) tree = Tree.new([%{mode: "100644", name: "f", sha: blob_sha}]) {:ok, tree_sha} = Object.write(repo, tree) assert {:ok, sizes} = ExGitObjectstore.blob_sizes(repo, [blob_sha, tree_sha]) assert sizes == %{blob_sha => 4} end end end