ref:main

feat: has_objects/2 existence check without object read #23

open Opened by cole.christensen@gmail.com

Links

Blocks
  • 🔒 private issue

Problem

Anvil’s clone/fork import path at `anvil/lib/anvil/code_review.ex:1204-1330` calls `put_object` for each loose object with no way to check if it already exists. The S3 backend must `GET` then `PUT` for every object regardless of whether it’s already in the destination. Clones and forks re-transfer every object on every import.

There is currently no cheap existence probe in the library.

Proposal

Add:

```elixir @spec has_object?(repo :: Repo.t(), sha :: binary) :: boolean @spec has_objects(repo :: Repo.t(), [binary]) :: %{binary => boolean} ```

Neither variant deserializes the object. On S3, `has_object?` uses `HEAD` instead of `GET`. On the filesystem, it uses `File.exists?/1` on the loose-object path plus a pack-index lookup.

Implementation notes

  • Add `has_object?/2` to the Storage behaviour with a default implementation that calls `read_object` and pattern-matches on `{:ok, _}` — slow but always correct.
  • Override in FS, S3, Memory backends with cheap versions.
  • Consider `has_objects/2` returning a `MapSet` rather than a map for ergonomic `MapSet.member?/2` checks downstream.
  • Pack index already tracks every object SHA; use that for O(1) checks on packed objects.

Acceptance

  • Fork/clone incremental imports can skip objects the destination already has.
  • S3 backend never issues a `GET` for existence checks.
  • Benchmarked: re-importing 10k objects on a destination that already has them drops from ~15min (S3) to a few seconds.