Poor performance using reftable with many refs

public inbox for git@vger.kernel.org
 help / color / mirror / Atom feed

* Poor performance using reftable with many refs
@ 2025-02-13  0:01 brian m. carlson
  2025-02-13  6:11 ` Patrick Steinhardt
  0 siblings, 1 reply; 12+ messages in thread
From: brian m. carlson @ 2025-02-13  0:01 UTC (permalink / raw)
  To: git; +Cc: Patrick Steinhardt, Karthik Nayak

[-- Attachment #1: Type: text/plain, Size: 4272 bytes --]

I've been doing some testing of reftable at $DAYJOB and I found an
interesting performance problem when creating many refs.

I've attached a script which takes 50,000 recent commits, creates a file
suitable for `git update-ref --stdin`, deletes all of the existing refs,
and then uses that file to create the 50,000 refs.  The ref creation is
timed using Linux's `/usr/bin/time`.  (This is partially extracted from
a larger script, so please accept my apologies for some untidiness.)

With the files backend, the output is as below:

  1.75user 3.73system 0:05.50elapsed 99%CPU (0avgtext+0avgdata 166344maxresident)k
  0inputs+442880outputs (0major+27962minor)pagefaults 0swaps

With the reftable backend, this is the output:

  56.91user 0.52system 0:57.44elapsed 99%CPU (0avgtext+0avgdata 160416maxresident)k
  0inputs+6784outputs (0major+26151minor)pagefaults 0swaps

Both measurements are on next, so they should have all relevant patches
that I'm aware of.  I've tested on two X1 Carbons, one with Ubuntu 24.04
and one with Debian unstable, so they're both reasonably beefy machines
with modern Linux OSes.

It takes about 30 times as long to perform using the reftable backend,
which is concerning.  While this is a synthetic measurement, I had
intended to use it to determine the performance characteristics of
the reference update portion when pushing a large repository for the
first time.

I admit I haven't done any other particular investigation as to what's
going wrong here, but the behaviour is very noticeable so it may be easy
to profile.

One note: the script will be faster and more useful to reproduce if you
change the repository source to a local clone of the Linux repo.

----
#!/bin/sh -e
# This script will reproduce a performance problem with many (50000) refs using
# the current version of reftable in next.  The directory `testcase` under the
# current directory will be removed and replaced.
#
# Once the script is finished, you can do `cat testcase/tracedir/*/re-creation`
# to see the performance characteristics of the files backend (first) and the
# reftable backend (second).

# Your friendly neighbourhood Linux repository.  This may be any valid remote,
# including an HTTPS or SSH URL.
REPO_SRC="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git"
TAG="v6.13"

export GIT_CONFIG_GLOBAL=/dev/null

timed_op () {
  local output="$1"
  local message="$2"
  shift
  shift
  printf '%s...' "$message" >&2
  /usr/bin/time -o "$TRACEDIR/$output" "$@"
  printf 'done.\n' >&2
}

delete_refs () {
  local output="$1"
  (
    echo "start"
    git for-each-ref --format="%(refname)" | sed -e 's/^/delete /'
    echo "prepare"
    echo "commit"
  ) | timed_op "$output" "Deleting all references" git update-ref --stdin
}

fake_refs=true
while [ $# -gt 0 ]
do
  case "$1" in
    --real-refs)
      fake_refs=false
      shift
      ;;
    *)
      break
      ;;
  esac
done

rm -fr testcase
mkdir testcase
cd testcase
git clone --bare "$REPO_SRC" repo

mkdir tracedir

for backend in files reftable
do
  git clone --bare repo $backend
  (
    set -e
    cd $backend
    TRACEDIR="$(realpath "../tracedir/$backend")"
    mkdir -p "$TRACEDIR"

    if [ "$backend" = reftable ]
    then
      timed_op "migration" "Migrating to reftable" git refs migrate --ref-format=reftable
    fi

    if $fake_refs
    then
      git rev-list "$TAG" | head -n 50000 | perl -pe '
        $count++;
        $choice = $count % 4;
        if ($choice == 0) {
          s!^(.*)$!create refs/heads/ref-$count $1!;
        } elsif ($choice == 1) {
          s!^(.*)$!create refs/remotes/bk2204/ref-$count $1!;
        } elsif ($choice == 2) {
          s!^(.*)$!create refs/remotes/origin/ref-$count $1!;
        } elsif ($choice == 3) {
          s!^(.*)$!create refs/tags/tag-$count $1!;
        }
      ' | sort >all-refs
    else
      git for-each-ref --format="%(refname) %(objectname)" | sed -e 's/^/create /' >all-refs
    fi
    delete_refs "deletion"
    timed_op "re-creation" "Re-creating refs" git update-ref --stdin <all-refs
  )
done
----
-- 
brian m. carlson (they/them or he/him)
Toronto, Ontario, CA

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 263 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Poor performance using reftable with many refs
  2025-02-13  0:01 Poor performance using reftable with many refs brian m. carlson
@ 2025-02-13  6:11 ` Patrick Steinhardt
  2025-02-13  7:13   ` Patrick Steinhardt
  0 siblings, 1 reply; 12+ messages in thread
From: Patrick Steinhardt @ 2025-02-13  6:11 UTC (permalink / raw)
  To: brian m. carlson, git, Karthik Nayak

On Thu, Feb 13, 2025 at 12:01:59AM +0000, brian m. carlson wrote:
> It takes about 30 times as long to perform using the reftable backend,
> which is concerning.  While this is a synthetic measurement, I had
> intended to use it to determine the performance characteristics of
> the reference update portion when pushing a large repository for the
> first time.

Interesting, that's an edge case I didn't yet see. I know about some
cases where reftables are ~10% slower, but 30x slower is in a different
ballpark.

> I admit I haven't done any other particular investigation as to what's
> going wrong here, but the behaviour is very noticeable so it may be easy
> to profile.

No worries, I'm already happy to have gotten the report in the first
place :) I'll investigate, but probably won't get to it before next
week.

Thanks!

Patrick

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Poor performance using reftable with many refs
  2025-02-13  6:11 ` Patrick Steinhardt
@ 2025-02-13  7:13   ` Patrick Steinhardt
  2025-02-13  8:22     ` Jeff King
  2025-02-13  9:27     ` Christian Couder
  0 siblings, 2 replies; 12+ messages in thread
From: Patrick Steinhardt @ 2025-02-13  7:13 UTC (permalink / raw)
  To: brian m. carlson, git, Karthik Nayak

On Thu, Feb 13, 2025 at 07:11:33AM +0100, Patrick Steinhardt wrote:
> On Thu, Feb 13, 2025 at 12:01:59AM +0000, brian m. carlson wrote:
> > It takes about 30 times as long to perform using the reftable backend,
> > which is concerning.  While this is a synthetic measurement, I had
> > intended to use it to determine the performance characteristics of
> > the reference update portion when pushing a large repository for the
> > first time.
> 
> Interesting, that's an edge case I didn't yet see. I know about some
> cases where reftables are ~10% slower, but 30x slower is in a different
> ballpark.

Well, I just cannot resist and had to investigate immediately. I can
indeed reproduce the issue with "linux.git" rather easily:

    Benchmark 1: update-ref (refformat = files)
      Time (mean ± σ):     223.0 ms ±   2.4 ms    [User: 76.1 ms, System: 145.6 ms]
      Range (min … max):   220.2 ms … 226.6 ms    5 runs

    Benchmark 2: update-ref (refformat = reftable)
      Time (mean ± σ):     17.472 s ±  0.153 s    [User: 17.402 s, System: 0.049 s]
      Range (min … max):   17.390 s … 17.745 s    5 runs

    Summary
      update-ref (refformat = files) ran
       78.35 ± 1.09 times faster than update-ref (refformat = reftable)

Oops, that indeed doesn't look great.

Turns out that you're hitting quite a funny edge case: the issue comes
from you first deleting all preexisting refs in the target repository
before recreating them. With "packed-refs", this leads to a repository
that has neither a "packed-refs" file nor any loose ref, except for HEAD
of course. But with "reftables" it doesn't:

    total 368
    -rw-r--r-- 1 pks users 332102 Feb 13 08:00 0x000000000001-0x000000000001-d8285c7c.ref
    -rw-r--r-- 1 pks users  32941 Feb 13 08:00 0x000000000002-0x000000000003-f1a8ebf9.ref
    -rw-r--r-- 1 pks users     86 Feb 13 08:00 tables.list

We end up with two tables: the first one has been created when cloning
the repository and contains all references. The second one has been
created when deleting all references, so it only contains ref deletions.
Because deletions don't have to carry an object ID, the resulting table
is also much smaller. This has the effect that auto-compaction does not
kick in, because we see that the geometric sequence is still intact. And
consequently, all the checks that we perform when recreating the refs
are way more expensive now because we have to search for conflicts.

A "fix" would be to pack references after you have deleted refs. This
leads to a significant speedup and makes the reftable backend outperform
the files backend:

    Benchmark 1: update-ref (refformat = files)
      Time (mean ± σ):     223.1 ms ±   0.6 ms    [User: 71.2 ms, System: 150.8 ms]
      Range (min … max):   222.5 ms … 224.2 ms    5 runs

    Benchmark 2: update-ref (refformat = reftable)
      Time (mean ± σ):     129.1 ms ±   2.1 ms    [User: 84.4 ms, System: 44.1 ms]
      Range (min … max):   127.2 ms … 132.7 ms    5 runs

    Summary
      update-ref (refformat = reftable) ran
        1.73 ± 0.03 times faster than update-ref (refformat = files)

I don't really think there's a general fix for this issue though, as the
issue comes from the design of how tombstone references work.

That being said, I found an optimization in how we parse ref updates in
git-update-ref(1): when we see an exact object ID, we can skip the call
to `repo_get_oid()`. This function is quite expensive because it doesn't
only parse object IDs, but revisions in general. This didn't have much
of an impact on "packed-refs", because there are no references in the
first place. But it did have a significant impact on the "reftable"
backend, where we do have deleted references.

So optimizing this edge case leads to a significant speedup for the
"reftable" backend, but also to a small speedup for the "files" backend:

    Benchmark 1: update-ref (refformat = files, revision = master)
      Time (mean ± σ):     224.7 ms ±   2.9 ms    [User: 79.4 ms, System: 143.5 ms]
      Range (min … max):   220.2 ms … 228.0 ms    5 runs

    Benchmark 2: update-ref (refformat = reftable, revision = master)
      Time (mean ± σ):     16.304 s ±  0.429 s    [User: 16.216 s, System: 0.051 s]
      Range (min … max):   15.865 s … 16.862 s    5 runs

    Benchmark 3: update-ref (refformat = files, revision = pks-reftable-optimization)
      Time (mean ± σ):     181.3 ms ±   2.4 ms    [User: 69.5 ms, System: 110.7 ms]
      Range (min … max):   178.5 ms … 185.0 ms    5 runs

    Benchmark 4: update-ref (refformat = reftable, revision = pks-reftable-optimization)
      Time (mean ± σ):      5.939 s ±  0.060 s    [User: 5.895 s, System: 0.028 s]
      Range (min … max):    5.875 s …  6.026 s    5 runs

    Summary
      update-ref (refformat = files, revision = pks-reftable-optimization) ran
        1.24 ± 0.02 times faster than update-ref (refformat = files, revision = master)
       32.76 ± 0.55 times faster than update-ref (refformat = reftable, revision = pks-reftable-optimization)
       89.93 ± 2.65 times faster than update-ref (refformat = reftable, revision = master)

I will continue digging a bit to see whether there is more to find in
this context and will send a patch to the mailing list later today or
tomorrow.

Patrick

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Poor performance using reftable with many refs
  2025-02-13  7:13   ` Patrick Steinhardt
@ 2025-02-13  8:22     ` Jeff King
  2025-02-13 11:20       ` Patrick Steinhardt
  2025-02-13 22:17       ` brian m. carlson
  2025-02-13  9:27     ` Christian Couder
  1 sibling, 2 replies; 12+ messages in thread
From: Jeff King @ 2025-02-13  8:22 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: brian m. carlson, git, Karthik Nayak

On Thu, Feb 13, 2025 at 08:13:38AM +0100, Patrick Steinhardt wrote:

> Turns out that you're hitting quite a funny edge case: the issue comes
> from you first deleting all preexisting refs in the target repository
> before recreating them. With "packed-refs", this leads to a repository
> that has neither a "packed-refs" file nor any loose ref, except for HEAD
> of course. But with "reftables" it doesn't:
> 
>     total 368
>     -rw-r--r-- 1 pks users 332102 Feb 13 08:00 0x000000000001-0x000000000001-d8285c7c.ref
>     -rw-r--r-- 1 pks users  32941 Feb 13 08:00 0x000000000002-0x000000000003-f1a8ebf9.ref
>     -rw-r--r-- 1 pks users     86 Feb 13 08:00 tables.list
> 
> We end up with two tables: the first one has been created when cloning
> the repository and contains all references. The second one has been
> created when deleting all references, so it only contains ref deletions.
> Because deletions don't have to carry an object ID, the resulting table
> is also much smaller. This has the effect that auto-compaction does not
> kick in, because we see that the geometric sequence is still intact. And
> consequently, all the checks that we perform when recreating the refs
> are way more expensive now because we have to search for conflicts.

That makes sense. But that's only 360k of reftables. Why does it take so
long to process?

It's been a while since I looked at reftables, but I'd think for a
normal lookup we should be able to binary-search or similar in each
table, find the relevant entries, and be done.

But I guess we can't easily do that for finding write conflicts, because
writing "foo/bar" means we need to care about "foo" and "foo/bar/baz" as
well. Finding "foo" is easy; we just break apart the proposed refname
and look for each leading path. But "foo/bar/baz" is harder; we have to
merge the tables to get an authoritative sorted list of the current
state, and then look for the entries adjacent to where our proposed ref
goes. Looking at a profiling output, we're spending a lot of time in
merged_iter_next_void() and its children, which supports that.

But the run-time scales linearly with the number of refs we're adding,
which implies that we're doing this iteration independently once per ref
we're adding. Instead, if we're given a list of N refs to write, we
should be able to sort that list and walk it in parallel with the
merged_iter result, making a single pass over the lists.

So I guess we'd need a version of refs_verify_refname_available() that
takes a list of refs rather than a single name. And then you could do
that single-pass walk. And as a bonus, you'd be able to de-dup the
leading prefixes you're looking for (e.g., most of your refs will start
with "refs/heads/", so we only need to check it once).

> That being said, I found an optimization in how we parse ref updates in
> git-update-ref(1): when we see an exact object ID, we can skip the call
> to `repo_get_oid()`. This function is quite expensive because it doesn't
> only parse object IDs, but revisions in general. This didn't have much
> of an impact on "packed-refs", because there are no references in the
> first place. But it did have a significant impact on the "reftable"
> backend, where we do have deleted references.

Yes, we do similarly spend a lot of time there. But the problem isn't
quite that repo_get_oid() also parses revisions. When we see a full
object id we return it quickly. But you can fall afoul of 798c35fcd8
(get_sha1: warn about full or short object names that look like refs,
2013-05-29), which does a full dwim_ref() lookup for each one!

Try:

  git -c core.warnAmbiguousRefs=false update-ref --stdin

to disable that. Internally there's a warn_on_object_refname_ambiguity
flag that some code (like cat-file) sets when it knows it may be asked
to do a resolve a lot of entries that are likely to be oids.

I kind of wonder if we should ditch that warning. But if we wanted to
keep it, maybe adding a flag to get_oid_with_context() would be a less
hacky way of disabling that warning on a per-call basis.

-Peff

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Poor performance using reftable with many refs
  2025-02-13  8:22     ` Jeff King
@ 2025-02-13 11:20       ` Patrick Steinhardt
  2025-02-13 14:31         ` Patrick Steinhardt
  2025-02-13 19:42         ` Jeff King
  2025-02-13 22:17       ` brian m. carlson
  1 sibling, 2 replies; 12+ messages in thread
From: Patrick Steinhardt @ 2025-02-13 11:20 UTC (permalink / raw)
  To: Jeff King; +Cc: brian m. carlson, git, Karthik Nayak

On Thu, Feb 13, 2025 at 03:22:21AM -0500, Jeff King wrote:
> On Thu, Feb 13, 2025 at 08:13:38AM +0100, Patrick Steinhardt wrote:
> 
> > Turns out that you're hitting quite a funny edge case: the issue comes
> > from you first deleting all preexisting refs in the target repository
> > before recreating them. With "packed-refs", this leads to a repository
> > that has neither a "packed-refs" file nor any loose ref, except for HEAD
> > of course. But with "reftables" it doesn't:
> > 
> >     total 368
> >     -rw-r--r-- 1 pks users 332102 Feb 13 08:00 0x000000000001-0x000000000001-d8285c7c.ref
> >     -rw-r--r-- 1 pks users  32941 Feb 13 08:00 0x000000000002-0x000000000003-f1a8ebf9.ref
> >     -rw-r--r-- 1 pks users     86 Feb 13 08:00 tables.list
> > 
> > We end up with two tables: the first one has been created when cloning
> > the repository and contains all references. The second one has been
> > created when deleting all references, so it only contains ref deletions.
> > Because deletions don't have to carry an object ID, the resulting table
> > is also much smaller. This has the effect that auto-compaction does not
> > kick in, because we see that the geometric sequence is still intact. And
> > consequently, all the checks that we perform when recreating the refs
> > are way more expensive now because we have to search for conflicts.
> 
> That makes sense. But that's only 360k of reftables. Why does it take so
> long to process?
> 
> It's been a while since I looked at reftables, but I'd think for a
> normal lookup we should be able to binary-search or similar in each
> table, find the relevant entries, and be done.
> 
> But I guess we can't easily do that for finding write conflicts, because
> writing "foo/bar" means we need to care about "foo" and "foo/bar/baz" as
> well. Finding "foo" is easy; we just break apart the proposed refname
> and look for each leading path. But "foo/bar/baz" is harder; we have to
> merge the tables to get an authoritative sorted list of the current
> state, and then look for the entries adjacent to where our proposed ref
> goes. Looking at a profiling output, we're spending a lot of time in
> merged_iter_next_void() and its children, which supports that.
> 
> But the run-time scales linearly with the number of refs we're adding,
> which implies that we're doing this iteration independently once per ref
> we're adding. Instead, if we're given a list of N refs to write, we
> should be able to sort that list and walk it in parallel with the
> merged_iter result, making a single pass over the lists.
> 
> So I guess we'd need a version of refs_verify_refname_available() that
> takes a list of refs rather than a single name. And then you could do
> that single-pass walk. And as a bonus, you'd be able to de-dup the
> leading prefixes you're looking for (e.g., most of your refs will start
> with "refs/heads/", so we only need to check it once).

Yes, `refs_verify_refname_available()` is exactly the problem. We spend
~80% of the time in that function after the optimization I have pointed
out for `repo_get_oid()`. I assume that we'd see similar performance for
the "files" backend if we had 360k refs and inserted 360k other refs,
but haven't verified this claim.

I've already noticed multiple times that this function is a significant
slowdown in lots of cases. I've already started looking at it a bit, and
will think about ways to fix this.

> > That being said, I found an optimization in how we parse ref updates in
> > git-update-ref(1): when we see an exact object ID, we can skip the call
> > to `repo_get_oid()`. This function is quite expensive because it doesn't
> > only parse object IDs, but revisions in general. This didn't have much
> > of an impact on "packed-refs", because there are no references in the
> > first place. But it did have a significant impact on the "reftable"
> > backend, where we do have deleted references.
> 
> Yes, we do similarly spend a lot of time there. But the problem isn't
> quite that repo_get_oid() also parses revisions. When we see a full
> object id we return it quickly. But you can fall afoul of 798c35fcd8
> (get_sha1: warn about full or short object names that look like refs,
> 2013-05-29), which does a full dwim_ref() lookup for each one!
> 
> Try:
> 
>   git -c core.warnAmbiguousRefs=false update-ref --stdin
> 
> to disable that. Internally there's a warn_on_object_refname_ambiguity
> flag that some code (like cat-file) sets when it knows it may be asked
> to do a resolve a lot of entries that are likely to be oids.
> 
> I kind of wonder if we should ditch that warning. But if we wanted to
> keep it, maybe adding a flag to get_oid_with_context() would be a less
> hacky way of disabling that warning on a per-call basis.

Ah, that makes a lot of sense. I don't really think the warning makes
sense for git-update-ref(1), so introducing a flag is probably the best
way to go about this.

Thanks for your input!

Patrick

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Poor performance using reftable with many refs
  2025-02-13 11:20       ` Patrick Steinhardt
@ 2025-02-13 14:31         ` Patrick Steinhardt
  2025-02-13 19:53           ` Jeff King
  2025-02-13 19:42         ` Jeff King
  1 sibling, 1 reply; 12+ messages in thread
From: Patrick Steinhardt @ 2025-02-13 14:31 UTC (permalink / raw)
  To: Jeff King; +Cc: brian m. carlson, git, Karthik Nayak

On Thu, Feb 13, 2025 at 12:20:03PM +0100, Patrick Steinhardt wrote:
> On Thu, Feb 13, 2025 at 03:22:21AM -0500, Jeff King wrote:
> > On Thu, Feb 13, 2025 at 08:13:38AM +0100, Patrick Steinhardt wrote:
> > 
> > > Turns out that you're hitting quite a funny edge case: the issue comes
> > > from you first deleting all preexisting refs in the target repository
> > > before recreating them. With "packed-refs", this leads to a repository
> > > that has neither a "packed-refs" file nor any loose ref, except for HEAD
> > > of course. But with "reftables" it doesn't:
> > > 
> > >     total 368
> > >     -rw-r--r-- 1 pks users 332102 Feb 13 08:00 0x000000000001-0x000000000001-d8285c7c.ref
> > >     -rw-r--r-- 1 pks users  32941 Feb 13 08:00 0x000000000002-0x000000000003-f1a8ebf9.ref
> > >     -rw-r--r-- 1 pks users     86 Feb 13 08:00 tables.list
> > > 
> > > We end up with two tables: the first one has been created when cloning
> > > the repository and contains all references. The second one has been
> > > created when deleting all references, so it only contains ref deletions.
> > > Because deletions don't have to carry an object ID, the resulting table
> > > is also much smaller. This has the effect that auto-compaction does not
> > > kick in, because we see that the geometric sequence is still intact. And
> > > consequently, all the checks that we perform when recreating the refs
> > > are way more expensive now because we have to search for conflicts.
> > 
> > That makes sense. But that's only 360k of reftables. Why does it take so
> > long to process?
> > 
> > It's been a while since I looked at reftables, but I'd think for a
> > normal lookup we should be able to binary-search or similar in each
> > table, find the relevant entries, and be done.
> > 
> > But I guess we can't easily do that for finding write conflicts, because
> > writing "foo/bar" means we need to care about "foo" and "foo/bar/baz" as
> > well. Finding "foo" is easy; we just break apart the proposed refname
> > and look for each leading path. But "foo/bar/baz" is harder; we have to
> > merge the tables to get an authoritative sorted list of the current
> > state, and then look for the entries adjacent to where our proposed ref
> > goes. Looking at a profiling output, we're spending a lot of time in
> > merged_iter_next_void() and its children, which supports that.
> > 
> > But the run-time scales linearly with the number of refs we're adding,
> > which implies that we're doing this iteration independently once per ref
> > we're adding. Instead, if we're given a list of N refs to write, we
> > should be able to sort that list and walk it in parallel with the
> > merged_iter result, making a single pass over the lists.
> > 
> > So I guess we'd need a version of refs_verify_refname_available() that
> > takes a list of refs rather than a single name. And then you could do
> > that single-pass walk. And as a bonus, you'd be able to de-dup the
> > leading prefixes you're looking for (e.g., most of your refs will start
> > with "refs/heads/", so we only need to check it once).
> 
> Yes, `refs_verify_refname_available()` is exactly the problem. We spend
> ~80% of the time in that function after the optimization I have pointed
> out for `repo_get_oid()`. I assume that we'd see similar performance for
> the "files" backend if we had 360k refs and inserted 360k other refs,
> but haven't verified this claim.
> 
> I've already noticed multiple times that this function is a significant
> slowdown in lots of cases. I've already started looking at it a bit, and
> will think about ways to fix this.

This turns out to be harder to implement than anticipated. While
iterating through refnames and the ref iterator in tandem sounds nice,
it would cause problems when the ref iterator yields millions of refs.
You don't want to fully iterate through all of them.

What we really want to do is to reuse the iterator and have it skip
entries: we'd basically create the iterator and re-seek it for every
refname we want to check for collisions. This allows us to reuse some of
the data structures, and in the best case the underlying backend knows
to optimize.

This is something that I have spent a significant time on to implement
in the last couple months for the reftable backend. But while we have
reseekable iterators there, which already got us a bit of a performance
improvement due to more reuse of data structures, we don't yet know to
specifically optimize for some specific seeks. We could e.g. easily skip
re-reading a block if we already know that it will contain the reference
we're searching for.

But the bigger problem is that the generic reftable iterator does not
yet have this capability. We first have to revamp their lifetime in the
same way that I revamped the lifetime of reftable iterators. A generic
iterator that has hit its end is currently getting free'd immediately,
which I always found to be a bit awkward. But because of this it's
impossible to reseek them, as they have lost all their state.

Oh, well, I guess that's what I'll be working on now then. But please,
somebody stop me if I'm not seeing the forest for the trees.

Patrick

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Poor performance using reftable with many refs
  2025-02-13 14:31         ` Patrick Steinhardt
@ 2025-02-13 19:53           ` Jeff King
  0 siblings, 0 replies; 12+ messages in thread
From: Jeff King @ 2025-02-13 19:53 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: brian m. carlson, git, Karthik Nayak

On Thu, Feb 13, 2025 at 03:31:47PM +0100, Patrick Steinhardt wrote:

> This turns out to be harder to implement than anticipated. While
> iterating through refnames and the ref iterator in tandem sounds nice,
> it would cause problems when the ref iterator yields millions of refs.
> You don't want to fully iterate through all of them.
> 
> What we really want to do is to reuse the iterator and have it skip
> entries: we'd basically create the iterator and re-seek it for every
> refname we want to check for collisions. This allows us to reuse some of
> the data structures, and in the best case the underlying backend knows
> to optimize.

Not having looked at the reftable code much, I'll take your word that it
isn't easy. ;)

I suspect the files backend isn't very good at this either. It knows to
take a starting point in ref_iterator_begin(), and should binary-search
the packed-refs file to the right point. But if you want to then ask it
"OK, now skip ahead until you see entry 'foo'", it would just walk
linearly to get there. I.e., there is no real seeking. So I guess the
best it could do is restart that same binary-search, and the order in
which we feed the refnames is not even really important.

And I don't know how seeking works with reftables, since "skip ahead"
requires merging all of the tables.

> But the bigger problem is that the generic reftable iterator does not
> yet have this capability. We first have to revamp their lifetime in the
> same way that I revamped the lifetime of reftable iterators. A generic
> iterator that has hit its end is currently getting free'd immediately,
> which I always found to be a bit awkward. But because of this it's
> impossible to reseek them, as they have lost all their state.

I do think if you feed the refnames in sorted order your seeks would
always be forward. So if you hit the end of iteration, there should be
nothing left to check.

> Oh, well, I guess that's what I'll be working on now then. But please,
> somebody stop me if I'm not seeing the forest for the trees.

No, I think it's actually a hard problem. :)

-Peff

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Poor performance using reftable with many refs
  2025-02-13 11:20       ` Patrick Steinhardt
  2025-02-13 14:31         ` Patrick Steinhardt
@ 2025-02-13 19:42         ` Jeff King
  2025-02-13 20:12           ` Junio C Hamano
  1 sibling, 1 reply; 12+ messages in thread
From: Jeff King @ 2025-02-13 19:42 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: brian m. carlson, git, Karthik Nayak

On Thu, Feb 13, 2025 at 12:20:03PM +0100, Patrick Steinhardt wrote:

> Yes, `refs_verify_refname_available()` is exactly the problem. We spend
> ~80% of the time in that function after the optimization I have pointed
> out for `repo_get_oid()`. I assume that we'd see similar performance for
> the "files" backend if we had 360k refs and inserted 360k other refs,
> but haven't verified this claim.

Yeah. I didn't test it, but reading your analysis, I similarly thought
that having non-deleted refs might cause the same problem.

> > Try:
> > 
> >   git -c core.warnAmbiguousRefs=false update-ref --stdin
> > 
> > to disable that. Internally there's a warn_on_object_refname_ambiguity
> > flag that some code (like cat-file) sets when it knows it may be asked
> > to do a resolve a lot of entries that are likely to be oids.
> > 
> > I kind of wonder if we should ditch that warning. But if we wanted to
> > keep it, maybe adding a flag to get_oid_with_context() would be a less
> > hacky way of disabling that warning on a per-call basis.
> 
> Ah, that makes a lot of sense. I don't really think the warning makes
> sense for git-update-ref(1), so introducing a flag is probably the best
> way to go about this.

So I think the core of my suggestion is to just turn off the existing
flag, like:

diff --git a/builtin/update-ref.c b/builtin/update-ref.c
index 4d35bdc4b4..00e340a53b 100644
--- a/builtin/update-ref.c
+++ b/builtin/update-ref.c
@@ -618,6 +618,8 @@ static void update_refs_stdin(void)
 	if (!transaction)
 		die("%s", err.buf);
 
+	warn_on_object_refname_ambiguity = 0; /* XXX should also restore after */
+
 	/* Read each line dispatch its command */
 	while (!strbuf_getwholeline(&input, stdin, line_termination)) {
 		const struct parse_cmd *cmd = NULL;

which would solve the immediate problem, and is what we do elsewhere for
stdin modes.

But I'm not sure if my suggestion to do a per-call flag makes sense. A
patch for that is below, but it gets weird for commands like rev-list.
There we turn off the warning for --stdin, but leave it on for arguments
on the command line. But they both end up in handle_revision(), so if we
convert that global into a per-call flag, they'll both be affected. You
could mitigate that by passing the flags down to handle_revision(), but
it gets even weirder with stuff like collect_changed_submodules(), which
puts a potentially large number of entries into a fake argv to call
setup_revisions().

So the patch below, which illustrates the idea, probably weakens the
warning to the point of being useless, as commands that use revision.c
would never warn at all. So probably we're better off with the hacky
global, or just getting rid of the warning entirely. I have trouble
imagining it helping much beyond something like:

  git branch $some_oid

which creates refs/heads/$some_oid. But we probably would be better off
warning about that at the time of writing rather than checking for
ambiguity on each read.

-Peff

---
diff --git a/builtin/cat-file.c b/builtin/cat-file.c
index b13561cf73..ad7a0a6407 100644
--- a/builtin/cat-file.c
+++ b/builtin/cat-file.c
@@ -526,7 +526,7 @@ static void batch_one_object(const char *obj_name,
 {
 	struct object_context ctx = {0};
 	int flags =
-		GET_OID_HASH_ANY |
+		GET_OID_HASH_ANY | GET_OID_SKIP_OBJREF |
 		(opt->follow_symlinks ? GET_OID_FOLLOW_SYMLINKS : 0);
 	enum get_oid_result result;
 
@@ -781,7 +781,6 @@ static int batch_objects(struct batch_options *opt)
 	struct strbuf input = STRBUF_INIT;
 	struct strbuf output = STRBUF_INIT;
 	struct expand_data data;
-	int save_warning;
 	int retval = 0;
 
 	/*
@@ -850,16 +849,6 @@ static int batch_objects(struct batch_options *opt)
 		return 0;
 	}
 
-	/*
-	 * We are going to call get_sha1 on a potentially very large number of
-	 * objects. In most large cases, these will be actual object sha1s. The
-	 * cost to double-check that each one is not also a ref (just so we can
-	 * warn) ends up dwarfing the actual cost of the object lookups
-	 * themselves. We can work around it by just turning off the warning.
-	 */
-	save_warning = warn_on_object_refname_ambiguity;
-	warn_on_object_refname_ambiguity = 0;
-
 	if (opt->batch_mode == BATCH_MODE_QUEUE_AND_DISPATCH) {
 		batch_objects_command(opt, &output, &data);
 		goto cleanup;
@@ -886,7 +875,6 @@ static int batch_objects(struct batch_options *opt)
  cleanup:
 	strbuf_release(&input);
 	strbuf_release(&output);
-	warn_on_object_refname_ambiguity = save_warning;
 	return retval;
 }
 
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 58a9b16126..128cc39860 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -4190,17 +4190,13 @@ static void get_object_list(struct rev_info *revs, int ac, const char **av)
 	};
 	char line[1000];
 	int flags = 0;
-	int save_warning;
 
 	save_commit_buffer = 0;
 	setup_revisions(ac, av, revs, &s_r_opt);
 
 	/* make sure shallows are read */
 	is_repository_shallow(the_repository);
 
-	save_warning = warn_on_object_refname_ambiguity;
-	warn_on_object_refname_ambiguity = 0;
-
 	while (fgets(line, sizeof(line), stdin) != NULL) {
 		int len = strlen(line);
 		if (len && line[len - 1] == '\n')
@@ -4227,8 +4223,6 @@ static void get_object_list(struct rev_info *revs, int ac, const char **av)
 			die(_("bad revision '%s'"), line);
 	}
 
-	warn_on_object_refname_ambiguity = save_warning;
-
 	if (use_bitmap_index && !get_object_list_from_bitmap(revs))
 		return;
 
diff --git a/environment.c b/environment.c
index e5b361bb5d..b619d0c2b8 100644
--- a/environment.c
+++ b/environment.c
@@ -36,7 +36,6 @@ int minimum_abbrev = 4, default_abbrev = -1;
 int ignore_case;
 int assume_unchanged;
 int is_bare_repository_cfg = -1; /* unspecified */
-int warn_on_object_refname_ambiguity = 1;
 int repository_format_precious_objects;
 char *git_commit_encoding;
 char *git_log_output_encoding;
diff --git a/environment.h b/environment.h
index 2f43340f0b..87f71807b7 100644
--- a/environment.h
+++ b/environment.h
@@ -156,7 +156,6 @@ extern int has_symlinks;
 extern int minimum_abbrev, default_abbrev;
 extern int ignore_case;
 extern int assume_unchanged;
-extern int warn_on_object_refname_ambiguity;
 extern char *apply_default_whitespace;
 extern char *apply_default_ignorewhitespace;
 extern char *git_attributes_file;
diff --git a/hash.h b/hash.h
index 4367acfec5..e9d881c06e 100644
--- a/hash.h
+++ b/hash.h
@@ -204,6 +204,7 @@ struct object_id {
 #define GET_OID_ONLY_TO_DIE    04000
 #define GET_OID_REQUIRE_PATH  010000
 #define GET_OID_HASH_ANY      020000
+#define GET_OID_SKIP_OBJREF   040000
 
 #define GET_OID_DISAMBIGUATORS \
 	(GET_OID_COMMIT | GET_OID_COMMITTISH | \
diff --git a/object-name.c b/object-name.c
index 945d5bdef2..516f214562 100644
--- a/object-name.c
+++ b/object-name.c
@@ -961,7 +961,8 @@ static int get_oid_basic(struct repository *r, const char *str, int len,
 	int fatal = !(flags & GET_OID_QUIETLY);
 
 	if (len == r->hash_algo->hexsz && !get_oid_hex(str, oid)) {
-		if (repo_settings_get_warn_ambiguous_refs(r) && warn_on_object_refname_ambiguity) {
+		if (repo_settings_get_warn_ambiguous_refs(r) &&
+		    !(flags & GET_OID_SKIP_OBJREF)) {
 			refs_found = repo_dwim_ref(r, str, len, &tmp_oid, &real_ref, 0);
 			if (refs_found > 0) {
 				warning(warn_msg, len, str);
diff --git a/revision.c b/revision.c
index 474fa1e767..5c8e80a0c2 100644
--- a/revision.c
+++ b/revision.c
@@ -2184,7 +2184,7 @@ static int handle_revision_arg_1(const char *arg_, struct rev_info *revs, int fl
 	int local_flags;
 	const char *arg = arg_;
 	int cant_be_filename = revarg_opt & REVARG_CANNOT_BE_FILENAME;
-	unsigned get_sha1_flags = GET_OID_RECORD_PATH;
+	unsigned get_sha1_flags = GET_OID_RECORD_PATH | GET_OID_SKIP_OBJREF;
 	int ret;
 
 	flags = flags & UNINTERESTING ? flags | BOTTOM : flags & ~BOTTOM;
@@ -2914,12 +2914,8 @@ static void read_revisions_from_stdin(struct rev_info *revs,
 	struct strbuf sb;
 	int seen_dashdash = 0;
 	int seen_end_of_options = 0;
-	int save_warning;
 	int flags = 0;
 
-	save_warning = warn_on_object_refname_ambiguity;
-	warn_on_object_refname_ambiguity = 0;
-
 	strbuf_init(&sb, 1000);
 	while (strbuf_getline(&sb, stdin) != EOF) {
 		if (!sb.len)
@@ -2952,7 +2948,6 @@ static void read_revisions_from_stdin(struct rev_info *revs,
 		read_pathspec_from_stdin(&sb, prune);
 
 	strbuf_release(&sb);
-	warn_on_object_refname_ambiguity = save_warning;
 }
 
 static void NORETURN diagnose_missing_default(const char *def)
diff --git a/submodule.c b/submodule.c
index b361076c5b..e63019a6f5 100644
--- a/submodule.c
+++ b/submodule.c
@@ -916,16 +916,12 @@ static void collect_changed_submodules(struct repository *r,
 {
 	struct rev_info rev;
 	const struct commit *commit;
-	int save_warning;
 	struct setup_revision_opt s_r_opt = {
 		.assume_dashdash = 1,
 	};
 
-	save_warning = warn_on_object_refname_ambiguity;
-	warn_on_object_refname_ambiguity = 0;
 	repo_init_revisions(r, &rev, NULL);
 	setup_revisions(argv->nr, argv->v, &rev, &s_r_opt);
-	warn_on_object_refname_ambiguity = save_warning;
 	if (prepare_revision_walk(&rev))
 		die(_("revision walk setup failed"));
 

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: Poor performance using reftable with many refs
  2025-02-13 19:42         ` Jeff King
@ 2025-02-13 20:12           ` Junio C Hamano
  0 siblings, 0 replies; 12+ messages in thread
From: Junio C Hamano @ 2025-02-13 20:12 UTC (permalink / raw)
  To: Jeff King; +Cc: Patrick Steinhardt, brian m. carlson, git, Karthik Nayak

Jeff King <peff@peff.net> writes:

> ... I have trouble
> imagining it helping much beyond something like:
>
>   git branch $some_oid
>
> which creates refs/heads/$some_oid. But we probably would be better off
> warning about that at the time of writing rather than checking for
> ambiguity on each read.

Certainly.  Thanks for a well thought-out analysis.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Poor performance using reftable with many refs
  2025-02-13  8:22     ` Jeff King
  2025-02-13 11:20       ` Patrick Steinhardt
@ 2025-02-13 22:17       ` brian m. carlson
  1 sibling, 0 replies; 12+ messages in thread
From: brian m. carlson @ 2025-02-13 22:17 UTC (permalink / raw)
  To: Jeff King; +Cc: Patrick Steinhardt, git, Karthik Nayak

[-- Attachment #1: Type: text/plain, Size: 1579 bytes --]

On 2025-02-13 at 08:22:21, Jeff King wrote:
> Yes, we do similarly spend a lot of time there. But the problem isn't
> quite that repo_get_oid() also parses revisions. When we see a full
> object id we return it quickly. But you can fall afoul of 798c35fcd8
> (get_sha1: warn about full or short object names that look like refs,
> 2013-05-29), which does a full dwim_ref() lookup for each one!
> 
> Try:
> 
>   git -c core.warnAmbiguousRefs=false update-ref --stdin
> 
> to disable that. Internally there's a warn_on_object_refname_ambiguity
> flag that some code (like cat-file) sets when it knows it may be asked
> to do a resolve a lot of entries that are likely to be oids.

Yeah, I think both of these would be great improvements.  The kind of
case I'm interested in is reference updates in the context of pushes or
various API calls, where we're always going to have a full object ID and
there is never a human connected to the output of Git.  I expect that is
the case for a lot of users.

I also think it's unlikely that even the general scripter who's working
with `git update-ref` is going to want that warning.  Most users of that
command are GUIs, APIs, server backends, or the like, and even if I used
the command in my Git aliases or some custom commands, I still wouldn't
really care for that kind of extra output.  I _do_ always want screaming
performance in `git update-ref` though, since I frequently, even in my
everyday scripting, work with large numbers of refs.
-- 
brian m. carlson (they/them or he/him)
Toronto, Ontario, CA

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 263 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Poor performance using reftable with many refs
  2025-02-13  7:13   ` Patrick Steinhardt
  2025-02-13  8:22     ` Jeff King
@ 2025-02-13  9:27     ` Christian Couder
  2025-02-13 13:21       ` Patrick Steinhardt
  1 sibling, 1 reply; 12+ messages in thread
From: Christian Couder @ 2025-02-13  9:27 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: brian m. carlson, git, Karthik Nayak

On Thu, Feb 13, 2025 at 8:13 AM Patrick Steinhardt <ps@pks.im> wrote:

> We end up with two tables: the first one has been created when cloning
> the repository and contains all references. The second one has been
> created when deleting all references, so it only contains ref deletions.
> Because deletions don't have to carry an object ID, the resulting table
> is also much smaller. This has the effect that auto-compaction does not
> kick in, because we see that the geometric sequence is still intact.

Not that I think we should work on this right now, but theoretically,
could we "just" count the number of entries in each file and base the
geometric sequence on the number of entries in each file instead of
file size?

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Poor performance using reftable with many refs
  2025-02-13  9:27     ` Christian Couder
@ 2025-02-13 13:21       ` Patrick Steinhardt
  0 siblings, 0 replies; 12+ messages in thread
From: Patrick Steinhardt @ 2025-02-13 13:21 UTC (permalink / raw)
  To: Christian Couder; +Cc: brian m. carlson, git, Karthik Nayak

On Thu, Feb 13, 2025 at 10:27:39AM +0100, Christian Couder wrote:
> On Thu, Feb 13, 2025 at 8:13 AM Patrick Steinhardt <ps@pks.im> wrote:
> 
> > We end up with two tables: the first one has been created when cloning
> > the repository and contains all references. The second one has been
> > created when deleting all references, so it only contains ref deletions.
> > Because deletions don't have to carry an object ID, the resulting table
> > is also much smaller. This has the effect that auto-compaction does not
> > kick in, because we see that the geometric sequence is still intact.
> 
> Not that I think we should work on this right now, but theoretically,
> could we "just" count the number of entries in each file and base the
> geometric sequence on the number of entries in each file instead of
> file size?

In theory we could, and that may lead to better results in edge cases
like these indeed. And I think if either the header or footer of
reftables contained a total count of contained records that might have
been a viable thing to do indeed. But they don't, so we'd have to open
and parse every complete reftable to do so.

Because of that I think the cost of this would ultimately outweight the
benfit. After all, this logic kicks in on every write to determine if we
need to auto-compact. As a result, it needs to remain cheap.

Patrick

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2025-02-13 22:17 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-02-13  0:01 Poor performance using reftable with many refs brian m. carlson
2025-02-13  6:11 ` Patrick Steinhardt
2025-02-13  7:13   ` Patrick Steinhardt
2025-02-13  8:22     ` Jeff King
2025-02-13 11:20       ` Patrick Steinhardt
2025-02-13 14:31         ` Patrick Steinhardt
2025-02-13 19:53           ` Jeff King
2025-02-13 19:42         ` Jeff King
2025-02-13 20:12           ` Junio C Hamano
2025-02-13 22:17       ` brian m. carlson
2025-02-13  9:27     ` Christian Couder
2025-02-13 13:21       ` Patrick Steinhardt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox