Parallelize S3 backend hot paths: list_refs GETs and put_pack uploads #25

Problem

The S3 backend has two sequential-I/O patterns that dominate latency for MinIO/S3-backed deployments:

1. `list_refs/3` fetches each ref value with a sequential GET

lib/ex_git_objectstore/storage/s3.ex:200-227

def list_refs(config, prefix, ref_prefix) do
  full_prefix = "#{prefix}/#{ref_prefix}"

  case s3_list(config, full_prefix) do
    {:ok, keys} ->
      refs =
        keys
        |> Enum.map(fn key ->
          fetch_ref_from_key(config, key, prefix)  # Sequential GET per ref
        end)
        |> Enum.reject(&is_nil/1)
        |> Enum.sort()

Each ref requires a separate GET (content is in the object body, not in LIST response). A repo with 50 branches + 200 tags = 250 sequential 100ms GETs = ~25 seconds of latency.

Every git clone, git fetch, and git push pays this cost because UploadPack.list_all_refs_with_head and ReceivePack.list_all_refs_with_head both call Ref.list(repo, "refs/heads/") and Ref.list(repo, "refs/tags/") during protocol advertisement.

2. `put_pack/5` uploads pack and idx sequentially

lib/ex_git_objectstore/storage/s3.ex:135-139

def put_pack(config, prefix, pack_sha, pack_data, idx_data) do
  with :ok <- s3_put(config, pack_key(prefix, pack_sha, "pack"), pack_data) do
    s3_put(config, pack_key(prefix, pack_sha, "idx"), idx_data)
  end
end

Two large PUTs serialized — for big packs on a pushed commit, this doubles the write latency unnecessarily.

Impact

#1 is the dominant cost of every git protocol operation against S3/MinIO backends
Clone UX on ref-heavy repos feels broken (20+ seconds before any data transfers)
#2 adds 100ms–N seconds to every git push (depends on pack size)

Acceptance Criteria

S3.list_refs/3 parallelizes the per-ref GETs with Task.async_stream (pattern already used in ExGitObjectstore.blob_sizes/3)
S3.put_pack/5 uploads pack and idx concurrently
Both use bounded concurrency (default 32, configurable)
Preserves existing return types and ordering (refs still sorted)
Filesystem and Memory backends unchanged (no parallelism needed)
Benchmark or test demonstrating the speedup on a repo with >50 refs
CHANGELOG entry under [Unreleased]

Notes

Reference implementation: commit 6a8dd64 (blob_sizes/3) uses the same pattern successfully
max_concurrency default of 32 matches typical hackney pool size; document bumping the hackney pool if deployers need higher concurrency
The underlying architectural question — should refs be stored in a packed-refs-style single blob? — is deferred to a separate issue. This change works within the current one-key-per-ref layout.