ref:main

feat: bulk blob_sizes/2 for batched tree-entry size lookups #22

closed Opened by cole.christensen@gmail.com

Links

Blocks
  • 🔒 private issue

Problem

Anvil’s tree-rendering path calls `ExGitObjectstore.blob_size/2` once per entry inside an `Enum.map` (see `anvil/lib/anvil/git/objectstore.ex:522-536`). A 1000-file directory = 1000 individual calls. On the S3 storage backend each call is a separate network round-trip (100+ ms), so a single directory listing can take dozens of seconds.

Proposal

Add a batched API:

```elixir @spec blob_sizes(repo :: Repo.t(), [sha :: binary]) :: {:ok, %{binary => non_neg_integer}} | {:error, term} def blob_sizes(repo, shas) ```

Returns a map of `sha => size` for every input sha. Unknown shas are simply omitted from the map.

Implementation notes

  • Deduplicate input shas internally before touching storage.
  • Filesystem backend: loop on top of the existing single-object read path — cheap.
  • S3 backend: use `S3.BatchGetObjectAttributes` if available, otherwise issue parallel `HeadObject` requests with a bounded concurrency pool (e.g. `Task.async_stream` with `max_concurrency: 16`).
  • Memory backend: trivial `Map.take`.
  • Pack-backed objects: read sizes from the pack index without inflating the blob.

Out of scope

  • Typed bulk reads (`read_commits/2`, `read_trees/2`) — separate issue.
  • Prefetch/warmup for future reads — separate issue.

Acceptance

  • `blob_sizes/2` exists, documented, typed.
  • Storage behaviour gains a `bulk_sizes/2` callback with a generic fallback so backends don’t have to implement it right away.
  • Benchmarked: 1000 shas on S3 goes from ~100s to sub-second.