* Poor performance using reftable with many refs
@ 2025-02-13 0:01 brian m. carlson
2025-02-13 6:11 ` Patrick Steinhardt
0 siblings, 1 reply; 12+ messages in thread
From: brian m. carlson @ 2025-02-13 0:01 UTC (permalink / raw)
To: git; +Cc: Patrick Steinhardt, Karthik Nayak
[-- Attachment #1: Type: text/plain, Size: 4272 bytes --]
I've been doing some testing of reftable at $DAYJOB and I found an
interesting performance problem when creating many refs.
I've attached a script which takes 50,000 recent commits, creates a file
suitable for `git update-ref --stdin`, deletes all of the existing refs,
and then uses that file to create the 50,000 refs. The ref creation is
timed using Linux's `/usr/bin/time`. (This is partially extracted from
a larger script, so please accept my apologies for some untidiness.)
With the files backend, the output is as below:
1.75user 3.73system 0:05.50elapsed 99%CPU (0avgtext+0avgdata 166344maxresident)k
0inputs+442880outputs (0major+27962minor)pagefaults 0swaps
With the reftable backend, this is the output:
56.91user 0.52system 0:57.44elapsed 99%CPU (0avgtext+0avgdata 160416maxresident)k
0inputs+6784outputs (0major+26151minor)pagefaults 0swaps
Both measurements are on next, so they should have all relevant patches
that I'm aware of. I've tested on two X1 Carbons, one with Ubuntu 24.04
and one with Debian unstable, so they're both reasonably beefy machines
with modern Linux OSes.
It takes about 30 times as long to perform using the reftable backend,
which is concerning. While this is a synthetic measurement, I had
intended to use it to determine the performance characteristics of
the reference update portion when pushing a large repository for the
first time.
I admit I haven't done any other particular investigation as to what's
going wrong here, but the behaviour is very noticeable so it may be easy
to profile.
One note: the script will be faster and more useful to reproduce if you
change the repository source to a local clone of the Linux repo.
----
#!/bin/sh -e
# This script will reproduce a performance problem with many (50000) refs using
# the current version of reftable in next. The directory `testcase` under the
# current directory will be removed and replaced.
#
# Once the script is finished, you can do `cat testcase/tracedir/*/re-creation`
# to see the performance characteristics of the files backend (first) and the
# reftable backend (second).
# Your friendly neighbourhood Linux repository. This may be any valid remote,
# including an HTTPS or SSH URL.
REPO_SRC="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git"
TAG="v6.13"
export GIT_CONFIG_GLOBAL=/dev/null
timed_op () {
local output="$1"
local message="$2"
shift
shift
printf '%s...' "$message" >&2
/usr/bin/time -o "$TRACEDIR/$output" "$@"
printf 'done.\n' >&2
}
delete_refs () {
local output="$1"
(
echo "start"
git for-each-ref --format="%(refname)" | sed -e 's/^/delete /'
echo "prepare"
echo "commit"
) | timed_op "$output" "Deleting all references" git update-ref --stdin
}
fake_refs=true
while [ $# -gt 0 ]
do
case "$1" in
--real-refs)
fake_refs=false
shift
;;
*)
break
;;
esac
done
rm -fr testcase
mkdir testcase
cd testcase
git clone --bare "$REPO_SRC" repo
mkdir tracedir
for backend in files reftable
do
git clone --bare repo $backend
(
set -e
cd $backend
TRACEDIR="$(realpath "../tracedir/$backend")"
mkdir -p "$TRACEDIR"
if [ "$backend" = reftable ]
then
timed_op "migration" "Migrating to reftable" git refs migrate --ref-format=reftable
fi
if $fake_refs
then
git rev-list "$TAG" | head -n 50000 | perl -pe '
$count++;
$choice = $count % 4;
if ($choice == 0) {
s!^(.*)$!create refs/heads/ref-$count $1!;
} elsif ($choice == 1) {
s!^(.*)$!create refs/remotes/bk2204/ref-$count $1!;
} elsif ($choice == 2) {
s!^(.*)$!create refs/remotes/origin/ref-$count $1!;
} elsif ($choice == 3) {
s!^(.*)$!create refs/tags/tag-$count $1!;
}
' | sort >all-refs
else
git for-each-ref --format="%(refname) %(objectname)" | sed -e 's/^/create /' >all-refs
fi
delete_refs "deletion"
timed_op "re-creation" "Re-creating refs" git update-ref --stdin <all-refs
)
done
----
--
brian m. carlson (they/them or he/him)
Toronto, Ontario, CA
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 263 bytes --]
^ permalink raw reply [flat|nested] 12+ messages in thread* Re: Poor performance using reftable with many refs 2025-02-13 0:01 Poor performance using reftable with many refs brian m. carlson @ 2025-02-13 6:11 ` Patrick Steinhardt 2025-02-13 7:13 ` Patrick Steinhardt 0 siblings, 1 reply; 12+ messages in thread From: Patrick Steinhardt @ 2025-02-13 6:11 UTC (permalink / raw) To: brian m. carlson, git, Karthik Nayak On Thu, Feb 13, 2025 at 12:01:59AM +0000, brian m. carlson wrote: > It takes about 30 times as long to perform using the reftable backend, > which is concerning. While this is a synthetic measurement, I had > intended to use it to determine the performance characteristics of > the reference update portion when pushing a large repository for the > first time. Interesting, that's an edge case I didn't yet see. I know about some cases where reftables are ~10% slower, but 30x slower is in a different ballpark. > I admit I haven't done any other particular investigation as to what's > going wrong here, but the behaviour is very noticeable so it may be easy > to profile. No worries, I'm already happy to have gotten the report in the first place :) I'll investigate, but probably won't get to it before next week. Thanks! Patrick ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Poor performance using reftable with many refs 2025-02-13 6:11 ` Patrick Steinhardt @ 2025-02-13 7:13 ` Patrick Steinhardt 2025-02-13 8:22 ` Jeff King 2025-02-13 9:27 ` Christian Couder 0 siblings, 2 replies; 12+ messages in thread From: Patrick Steinhardt @ 2025-02-13 7:13 UTC (permalink / raw) To: brian m. carlson, git, Karthik Nayak On Thu, Feb 13, 2025 at 07:11:33AM +0100, Patrick Steinhardt wrote: > On Thu, Feb 13, 2025 at 12:01:59AM +0000, brian m. carlson wrote: > > It takes about 30 times as long to perform using the reftable backend, > > which is concerning. While this is a synthetic measurement, I had > > intended to use it to determine the performance characteristics of > > the reference update portion when pushing a large repository for the > > first time. > > Interesting, that's an edge case I didn't yet see. I know about some > cases where reftables are ~10% slower, but 30x slower is in a different > ballpark. Well, I just cannot resist and had to investigate immediately. I can indeed reproduce the issue with "linux.git" rather easily: Benchmark 1: update-ref (refformat = files) Time (mean ± σ): 223.0 ms ± 2.4 ms [User: 76.1 ms, System: 145.6 ms] Range (min … max): 220.2 ms … 226.6 ms 5 runs Benchmark 2: update-ref (refformat = reftable) Time (mean ± σ): 17.472 s ± 0.153 s [User: 17.402 s, System: 0.049 s] Range (min … max): 17.390 s … 17.745 s 5 runs Summary update-ref (refformat = files) ran 78.35 ± 1.09 times faster than update-ref (refformat = reftable) Oops, that indeed doesn't look great. Turns out that you're hitting quite a funny edge case: the issue comes from you first deleting all preexisting refs in the target repository before recreating them. With "packed-refs", this leads to a repository that has neither a "packed-refs" file nor any loose ref, except for HEAD of course. But with "reftables" it doesn't: total 368 -rw-r--r-- 1 pks users 332102 Feb 13 08:00 0x000000000001-0x000000000001-d8285c7c.ref -rw-r--r-- 1 pks users 32941 Feb 13 08:00 0x000000000002-0x000000000003-f1a8ebf9.ref -rw-r--r-- 1 pks users 86 Feb 13 08:00 tables.list We end up with two tables: the first one has been created when cloning the repository and contains all references. The second one has been created when deleting all references, so it only contains ref deletions. Because deletions don't have to carry an object ID, the resulting table is also much smaller. This has the effect that auto-compaction does not kick in, because we see that the geometric sequence is still intact. And consequently, all the checks that we perform when recreating the refs are way more expensive now because we have to search for conflicts. A "fix" would be to pack references after you have deleted refs. This leads to a significant speedup and makes the reftable backend outperform the files backend: Benchmark 1: update-ref (refformat = files) Time (mean ± σ): 223.1 ms ± 0.6 ms [User: 71.2 ms, System: 150.8 ms] Range (min … max): 222.5 ms … 224.2 ms 5 runs Benchmark 2: update-ref (refformat = reftable) Time (mean ± σ): 129.1 ms ± 2.1 ms [User: 84.4 ms, System: 44.1 ms] Range (min … max): 127.2 ms … 132.7 ms 5 runs Summary update-ref (refformat = reftable) ran 1.73 ± 0.03 times faster than update-ref (refformat = files) I don't really think there's a general fix for this issue though, as the issue comes from the design of how tombstone references work. That being said, I found an optimization in how we parse ref updates in git-update-ref(1): when we see an exact object ID, we can skip the call to `repo_get_oid()`. This function is quite expensive because it doesn't only parse object IDs, but revisions in general. This didn't have much of an impact on "packed-refs", because there are no references in the first place. But it did have a significant impact on the "reftable" backend, where we do have deleted references. So optimizing this edge case leads to a significant speedup for the "reftable" backend, but also to a small speedup for the "files" backend: Benchmark 1: update-ref (refformat = files, revision = master) Time (mean ± σ): 224.7 ms ± 2.9 ms [User: 79.4 ms, System: 143.5 ms] Range (min … max): 220.2 ms … 228.0 ms 5 runs Benchmark 2: update-ref (refformat = reftable, revision = master) Time (mean ± σ): 16.304 s ± 0.429 s [User: 16.216 s, System: 0.051 s] Range (min … max): 15.865 s … 16.862 s 5 runs Benchmark 3: update-ref (refformat = files, revision = pks-reftable-optimization) Time (mean ± σ): 181.3 ms ± 2.4 ms [User: 69.5 ms, System: 110.7 ms] Range (min … max): 178.5 ms … 185.0 ms 5 runs Benchmark 4: update-ref (refformat = reftable, revision = pks-reftable-optimization) Time (mean ± σ): 5.939 s ± 0.060 s [User: 5.895 s, System: 0.028 s] Range (min … max): 5.875 s … 6.026 s 5 runs Summary update-ref (refformat = files, revision = pks-reftable-optimization) ran 1.24 ± 0.02 times faster than update-ref (refformat = files, revision = master) 32.76 ± 0.55 times faster than update-ref (refformat = reftable, revision = pks-reftable-optimization) 89.93 ± 2.65 times faster than update-ref (refformat = reftable, revision = master) I will continue digging a bit to see whether there is more to find in this context and will send a patch to the mailing list later today or tomorrow. Patrick ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Poor performance using reftable with many refs 2025-02-13 7:13 ` Patrick Steinhardt @ 2025-02-13 8:22 ` Jeff King 2025-02-13 11:20 ` Patrick Steinhardt 2025-02-13 22:17 ` brian m. carlson 2025-02-13 9:27 ` Christian Couder 1 sibling, 2 replies; 12+ messages in thread From: Jeff King @ 2025-02-13 8:22 UTC (permalink / raw) To: Patrick Steinhardt; +Cc: brian m. carlson, git, Karthik Nayak On Thu, Feb 13, 2025 at 08:13:38AM +0100, Patrick Steinhardt wrote: > Turns out that you're hitting quite a funny edge case: the issue comes > from you first deleting all preexisting refs in the target repository > before recreating them. With "packed-refs", this leads to a repository > that has neither a "packed-refs" file nor any loose ref, except for HEAD > of course. But with "reftables" it doesn't: > > total 368 > -rw-r--r-- 1 pks users 332102 Feb 13 08:00 0x000000000001-0x000000000001-d8285c7c.ref > -rw-r--r-- 1 pks users 32941 Feb 13 08:00 0x000000000002-0x000000000003-f1a8ebf9.ref > -rw-r--r-- 1 pks users 86 Feb 13 08:00 tables.list > > We end up with two tables: the first one has been created when cloning > the repository and contains all references. The second one has been > created when deleting all references, so it only contains ref deletions. > Because deletions don't have to carry an object ID, the resulting table > is also much smaller. This has the effect that auto-compaction does not > kick in, because we see that the geometric sequence is still intact. And > consequently, all the checks that we perform when recreating the refs > are way more expensive now because we have to search for conflicts. That makes sense. But that's only 360k of reftables. Why does it take so long to process? It's been a while since I looked at reftables, but I'd think for a normal lookup we should be able to binary-search or similar in each table, find the relevant entries, and be done. But I guess we can't easily do that for finding write conflicts, because writing "foo/bar" means we need to care about "foo" and "foo/bar/baz" as well. Finding "foo" is easy; we just break apart the proposed refname and look for each leading path. But "foo/bar/baz" is harder; we have to merge the tables to get an authoritative sorted list of the current state, and then look for the entries adjacent to where our proposed ref goes. Looking at a profiling output, we're spending a lot of time in merged_iter_next_void() and its children, which supports that. But the run-time scales linearly with the number of refs we're adding, which implies that we're doing this iteration independently once per ref we're adding. Instead, if we're given a list of N refs to write, we should be able to sort that list and walk it in parallel with the merged_iter result, making a single pass over the lists. So I guess we'd need a version of refs_verify_refname_available() that takes a list of refs rather than a single name. And then you could do that single-pass walk. And as a bonus, you'd be able to de-dup the leading prefixes you're looking for (e.g., most of your refs will start with "refs/heads/", so we only need to check it once). > That being said, I found an optimization in how we parse ref updates in > git-update-ref(1): when we see an exact object ID, we can skip the call > to `repo_get_oid()`. This function is quite expensive because it doesn't > only parse object IDs, but revisions in general. This didn't have much > of an impact on "packed-refs", because there are no references in the > first place. But it did have a significant impact on the "reftable" > backend, where we do have deleted references. Yes, we do similarly spend a lot of time there. But the problem isn't quite that repo_get_oid() also parses revisions. When we see a full object id we return it quickly. But you can fall afoul of 798c35fcd8 (get_sha1: warn about full or short object names that look like refs, 2013-05-29), which does a full dwim_ref() lookup for each one! Try: git -c core.warnAmbiguousRefs=false update-ref --stdin to disable that. Internally there's a warn_on_object_refname_ambiguity flag that some code (like cat-file) sets when it knows it may be asked to do a resolve a lot of entries that are likely to be oids. I kind of wonder if we should ditch that warning. But if we wanted to keep it, maybe adding a flag to get_oid_with_context() would be a less hacky way of disabling that warning on a per-call basis. -Peff ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Poor performance using reftable with many refs 2025-02-13 8:22 ` Jeff King @ 2025-02-13 11:20 ` Patrick Steinhardt 2025-02-13 14:31 ` Patrick Steinhardt 2025-02-13 19:42 ` Jeff King 2025-02-13 22:17 ` brian m. carlson 1 sibling, 2 replies; 12+ messages in thread From: Patrick Steinhardt @ 2025-02-13 11:20 UTC (permalink / raw) To: Jeff King; +Cc: brian m. carlson, git, Karthik Nayak On Thu, Feb 13, 2025 at 03:22:21AM -0500, Jeff King wrote: > On Thu, Feb 13, 2025 at 08:13:38AM +0100, Patrick Steinhardt wrote: > > > Turns out that you're hitting quite a funny edge case: the issue comes > > from you first deleting all preexisting refs in the target repository > > before recreating them. With "packed-refs", this leads to a repository > > that has neither a "packed-refs" file nor any loose ref, except for HEAD > > of course. But with "reftables" it doesn't: > > > > total 368 > > -rw-r--r-- 1 pks users 332102 Feb 13 08:00 0x000000000001-0x000000000001-d8285c7c.ref > > -rw-r--r-- 1 pks users 32941 Feb 13 08:00 0x000000000002-0x000000000003-f1a8ebf9.ref > > -rw-r--r-- 1 pks users 86 Feb 13 08:00 tables.list > > > > We end up with two tables: the first one has been created when cloning > > the repository and contains all references. The second one has been > > created when deleting all references, so it only contains ref deletions. > > Because deletions don't have to carry an object ID, the resulting table > > is also much smaller. This has the effect that auto-compaction does not > > kick in, because we see that the geometric sequence is still intact. And > > consequently, all the checks that we perform when recreating the refs > > are way more expensive now because we have to search for conflicts. > > That makes sense. But that's only 360k of reftables. Why does it take so > long to process? > > It's been a while since I looked at reftables, but I'd think for a > normal lookup we should be able to binary-search or similar in each > table, find the relevant entries, and be done. > > But I guess we can't easily do that for finding write conflicts, because > writing "foo/bar" means we need to care about "foo" and "foo/bar/baz" as > well. Finding "foo" is easy; we just break apart the proposed refname > and look for each leading path. But "foo/bar/baz" is harder; we have to > merge the tables to get an authoritative sorted list of the current > state, and then look for the entries adjacent to where our proposed ref > goes. Looking at a profiling output, we're spending a lot of time in > merged_iter_next_void() and its children, which supports that. > > But the run-time scales linearly with the number of refs we're adding, > which implies that we're doing this iteration independently once per ref > we're adding. Instead, if we're given a list of N refs to write, we > should be able to sort that list and walk it in parallel with the > merged_iter result, making a single pass over the lists. > > So I guess we'd need a version of refs_verify_refname_available() that > takes a list of refs rather than a single name. And then you could do > that single-pass walk. And as a bonus, you'd be able to de-dup the > leading prefixes you're looking for (e.g., most of your refs will start > with "refs/heads/", so we only need to check it once). Yes, `refs_verify_refname_available()` is exactly the problem. We spend ~80% of the time in that function after the optimization I have pointed out for `repo_get_oid()`. I assume that we'd see similar performance for the "files" backend if we had 360k refs and inserted 360k other refs, but haven't verified this claim. I've already noticed multiple times that this function is a significant slowdown in lots of cases. I've already started looking at it a bit, and will think about ways to fix this. > > That being said, I found an optimization in how we parse ref updates in > > git-update-ref(1): when we see an exact object ID, we can skip the call > > to `repo_get_oid()`. This function is quite expensive because it doesn't > > only parse object IDs, but revisions in general. This didn't have much > > of an impact on "packed-refs", because there are no references in the > > first place. But it did have a significant impact on the "reftable" > > backend, where we do have deleted references. > > Yes, we do similarly spend a lot of time there. But the problem isn't > quite that repo_get_oid() also parses revisions. When we see a full > object id we return it quickly. But you can fall afoul of 798c35fcd8 > (get_sha1: warn about full or short object names that look like refs, > 2013-05-29), which does a full dwim_ref() lookup for each one! > > Try: > > git -c core.warnAmbiguousRefs=false update-ref --stdin > > to disable that. Internally there's a warn_on_object_refname_ambiguity > flag that some code (like cat-file) sets when it knows it may be asked > to do a resolve a lot of entries that are likely to be oids. > > I kind of wonder if we should ditch that warning. But if we wanted to > keep it, maybe adding a flag to get_oid_with_context() would be a less > hacky way of disabling that warning on a per-call basis. Ah, that makes a lot of sense. I don't really think the warning makes sense for git-update-ref(1), so introducing a flag is probably the best way to go about this. Thanks for your input! Patrick ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Poor performance using reftable with many refs 2025-02-13 11:20 ` Patrick Steinhardt @ 2025-02-13 14:31 ` Patrick Steinhardt 2025-02-13 19:53 ` Jeff King 2025-02-13 19:42 ` Jeff King 1 sibling, 1 reply; 12+ messages in thread From: Patrick Steinhardt @ 2025-02-13 14:31 UTC (permalink / raw) To: Jeff King; +Cc: brian m. carlson, git, Karthik Nayak On Thu, Feb 13, 2025 at 12:20:03PM +0100, Patrick Steinhardt wrote: > On Thu, Feb 13, 2025 at 03:22:21AM -0500, Jeff King wrote: > > On Thu, Feb 13, 2025 at 08:13:38AM +0100, Patrick Steinhardt wrote: > > > > > Turns out that you're hitting quite a funny edge case: the issue comes > > > from you first deleting all preexisting refs in the target repository > > > before recreating them. With "packed-refs", this leads to a repository > > > that has neither a "packed-refs" file nor any loose ref, except for HEAD > > > of course. But with "reftables" it doesn't: > > > > > > total 368 > > > -rw-r--r-- 1 pks users 332102 Feb 13 08:00 0x000000000001-0x000000000001-d8285c7c.ref > > > -rw-r--r-- 1 pks users 32941 Feb 13 08:00 0x000000000002-0x000000000003-f1a8ebf9.ref > > > -rw-r--r-- 1 pks users 86 Feb 13 08:00 tables.list > > > > > > We end up with two tables: the first one has been created when cloning > > > the repository and contains all references. The second one has been > > > created when deleting all references, so it only contains ref deletions. > > > Because deletions don't have to carry an object ID, the resulting table > > > is also much smaller. This has the effect that auto-compaction does not > > > kick in, because we see that the geometric sequence is still intact. And > > > consequently, all the checks that we perform when recreating the refs > > > are way more expensive now because we have to search for conflicts. > > > > That makes sense. But that's only 360k of reftables. Why does it take so > > long to process? > > > > It's been a while since I looked at reftables, but I'd think for a > > normal lookup we should be able to binary-search or similar in each > > table, find the relevant entries, and be done. > > > > But I guess we can't easily do that for finding write conflicts, because > > writing "foo/bar" means we need to care about "foo" and "foo/bar/baz" as > > well. Finding "foo" is easy; we just break apart the proposed refname > > and look for each leading path. But "foo/bar/baz" is harder; we have to > > merge the tables to get an authoritative sorted list of the current > > state, and then look for the entries adjacent to where our proposed ref > > goes. Looking at a profiling output, we're spending a lot of time in > > merged_iter_next_void() and its children, which supports that. > > > > But the run-time scales linearly with the number of refs we're adding, > > which implies that we're doing this iteration independently once per ref > > we're adding. Instead, if we're given a list of N refs to write, we > > should be able to sort that list and walk it in parallel with the > > merged_iter result, making a single pass over the lists. > > > > So I guess we'd need a version of refs_verify_refname_available() that > > takes a list of refs rather than a single name. And then you could do > > that single-pass walk. And as a bonus, you'd be able to de-dup the > > leading prefixes you're looking for (e.g., most of your refs will start > > with "refs/heads/", so we only need to check it once). > > Yes, `refs_verify_refname_available()` is exactly the problem. We spend > ~80% of the time in that function after the optimization I have pointed > out for `repo_get_oid()`. I assume that we'd see similar performance for > the "files" backend if we had 360k refs and inserted 360k other refs, > but haven't verified this claim. > > I've already noticed multiple times that this function is a significant > slowdown in lots of cases. I've already started looking at it a bit, and > will think about ways to fix this. This turns out to be harder to implement than anticipated. While iterating through refnames and the ref iterator in tandem sounds nice, it would cause problems when the ref iterator yields millions of refs. You don't want to fully iterate through all of them. What we really want to do is to reuse the iterator and have it skip entries: we'd basically create the iterator and re-seek it for every refname we want to check for collisions. This allows us to reuse some of the data structures, and in the best case the underlying backend knows to optimize. This is something that I have spent a significant time on to implement in the last couple months for the reftable backend. But while we have reseekable iterators there, which already got us a bit of a performance improvement due to more reuse of data structures, we don't yet know to specifically optimize for some specific seeks. We could e.g. easily skip re-reading a block if we already know that it will contain the reference we're searching for. But the bigger problem is that the generic reftable iterator does not yet have this capability. We first have to revamp their lifetime in the same way that I revamped the lifetime of reftable iterators. A generic iterator that has hit its end is currently getting free'd immediately, which I always found to be a bit awkward. But because of this it's impossible to reseek them, as they have lost all their state. Oh, well, I guess that's what I'll be working on now then. But please, somebody stop me if I'm not seeing the forest for the trees. Patrick ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Poor performance using reftable with many refs 2025-02-13 14:31 ` Patrick Steinhardt @ 2025-02-13 19:53 ` Jeff King 0 siblings, 0 replies; 12+ messages in thread From: Jeff King @ 2025-02-13 19:53 UTC (permalink / raw) To: Patrick Steinhardt; +Cc: brian m. carlson, git, Karthik Nayak On Thu, Feb 13, 2025 at 03:31:47PM +0100, Patrick Steinhardt wrote: > This turns out to be harder to implement than anticipated. While > iterating through refnames and the ref iterator in tandem sounds nice, > it would cause problems when the ref iterator yields millions of refs. > You don't want to fully iterate through all of them. > > What we really want to do is to reuse the iterator and have it skip > entries: we'd basically create the iterator and re-seek it for every > refname we want to check for collisions. This allows us to reuse some of > the data structures, and in the best case the underlying backend knows > to optimize. Not having looked at the reftable code much, I'll take your word that it isn't easy. ;) I suspect the files backend isn't very good at this either. It knows to take a starting point in ref_iterator_begin(), and should binary-search the packed-refs file to the right point. But if you want to then ask it "OK, now skip ahead until you see entry 'foo'", it would just walk linearly to get there. I.e., there is no real seeking. So I guess the best it could do is restart that same binary-search, and the order in which we feed the refnames is not even really important. And I don't know how seeking works with reftables, since "skip ahead" requires merging all of the tables. > But the bigger problem is that the generic reftable iterator does not > yet have this capability. We first have to revamp their lifetime in the > same way that I revamped the lifetime of reftable iterators. A generic > iterator that has hit its end is currently getting free'd immediately, > which I always found to be a bit awkward. But because of this it's > impossible to reseek them, as they have lost all their state. I do think if you feed the refnames in sorted order your seeks would always be forward. So if you hit the end of iteration, there should be nothing left to check. > Oh, well, I guess that's what I'll be working on now then. But please, > somebody stop me if I'm not seeing the forest for the trees. No, I think it's actually a hard problem. :) -Peff ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Poor performance using reftable with many refs 2025-02-13 11:20 ` Patrick Steinhardt 2025-02-13 14:31 ` Patrick Steinhardt @ 2025-02-13 19:42 ` Jeff King 2025-02-13 20:12 ` Junio C Hamano 1 sibling, 1 reply; 12+ messages in thread From: Jeff King @ 2025-02-13 19:42 UTC (permalink / raw) To: Patrick Steinhardt; +Cc: brian m. carlson, git, Karthik Nayak On Thu, Feb 13, 2025 at 12:20:03PM +0100, Patrick Steinhardt wrote: > Yes, `refs_verify_refname_available()` is exactly the problem. We spend > ~80% of the time in that function after the optimization I have pointed > out for `repo_get_oid()`. I assume that we'd see similar performance for > the "files" backend if we had 360k refs and inserted 360k other refs, > but haven't verified this claim. Yeah. I didn't test it, but reading your analysis, I similarly thought that having non-deleted refs might cause the same problem. > > Try: > > > > git -c core.warnAmbiguousRefs=false update-ref --stdin > > > > to disable that. Internally there's a warn_on_object_refname_ambiguity > > flag that some code (like cat-file) sets when it knows it may be asked > > to do a resolve a lot of entries that are likely to be oids. > > > > I kind of wonder if we should ditch that warning. But if we wanted to > > keep it, maybe adding a flag to get_oid_with_context() would be a less > > hacky way of disabling that warning on a per-call basis. > > Ah, that makes a lot of sense. I don't really think the warning makes > sense for git-update-ref(1), so introducing a flag is probably the best > way to go about this. So I think the core of my suggestion is to just turn off the existing flag, like: diff --git a/builtin/update-ref.c b/builtin/update-ref.c index 4d35bdc4b4..00e340a53b 100644 --- a/builtin/update-ref.c +++ b/builtin/update-ref.c @@ -618,6 +618,8 @@ static void update_refs_stdin(void) if (!transaction) die("%s", err.buf); + warn_on_object_refname_ambiguity = 0; /* XXX should also restore after */ + /* Read each line dispatch its command */ while (!strbuf_getwholeline(&input, stdin, line_termination)) { const struct parse_cmd *cmd = NULL; which would solve the immediate problem, and is what we do elsewhere for stdin modes. But I'm not sure if my suggestion to do a per-call flag makes sense. A patch for that is below, but it gets weird for commands like rev-list. There we turn off the warning for --stdin, but leave it on for arguments on the command line. But they both end up in handle_revision(), so if we convert that global into a per-call flag, they'll both be affected. You could mitigate that by passing the flags down to handle_revision(), but it gets even weirder with stuff like collect_changed_submodules(), which puts a potentially large number of entries into a fake argv to call setup_revisions(). So the patch below, which illustrates the idea, probably weakens the warning to the point of being useless, as commands that use revision.c would never warn at all. So probably we're better off with the hacky global, or just getting rid of the warning entirely. I have trouble imagining it helping much beyond something like: git branch $some_oid which creates refs/heads/$some_oid. But we probably would be better off warning about that at the time of writing rather than checking for ambiguity on each read. -Peff --- diff --git a/builtin/cat-file.c b/builtin/cat-file.c index b13561cf73..ad7a0a6407 100644 --- a/builtin/cat-file.c +++ b/builtin/cat-file.c @@ -526,7 +526,7 @@ static void batch_one_object(const char *obj_name, { struct object_context ctx = {0}; int flags = - GET_OID_HASH_ANY | + GET_OID_HASH_ANY | GET_OID_SKIP_OBJREF | (opt->follow_symlinks ? GET_OID_FOLLOW_SYMLINKS : 0); enum get_oid_result result; @@ -781,7 +781,6 @@ static int batch_objects(struct batch_options *opt) struct strbuf input = STRBUF_INIT; struct strbuf output = STRBUF_INIT; struct expand_data data; - int save_warning; int retval = 0; /* @@ -850,16 +849,6 @@ static int batch_objects(struct batch_options *opt) return 0; } - /* - * We are going to call get_sha1 on a potentially very large number of - * objects. In most large cases, these will be actual object sha1s. The - * cost to double-check that each one is not also a ref (just so we can - * warn) ends up dwarfing the actual cost of the object lookups - * themselves. We can work around it by just turning off the warning. - */ - save_warning = warn_on_object_refname_ambiguity; - warn_on_object_refname_ambiguity = 0; - if (opt->batch_mode == BATCH_MODE_QUEUE_AND_DISPATCH) { batch_objects_command(opt, &output, &data); goto cleanup; @@ -886,7 +875,6 @@ static int batch_objects(struct batch_options *opt) cleanup: strbuf_release(&input); strbuf_release(&output); - warn_on_object_refname_ambiguity = save_warning; return retval; } diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c index 58a9b16126..128cc39860 100644 --- a/builtin/pack-objects.c +++ b/builtin/pack-objects.c @@ -4190,17 +4190,13 @@ static void get_object_list(struct rev_info *revs, int ac, const char **av) }; char line[1000]; int flags = 0; - int save_warning; save_commit_buffer = 0; setup_revisions(ac, av, revs, &s_r_opt); /* make sure shallows are read */ is_repository_shallow(the_repository); - save_warning = warn_on_object_refname_ambiguity; - warn_on_object_refname_ambiguity = 0; - while (fgets(line, sizeof(line), stdin) != NULL) { int len = strlen(line); if (len && line[len - 1] == '\n') @@ -4227,8 +4223,6 @@ static void get_object_list(struct rev_info *revs, int ac, const char **av) die(_("bad revision '%s'"), line); } - warn_on_object_refname_ambiguity = save_warning; - if (use_bitmap_index && !get_object_list_from_bitmap(revs)) return; diff --git a/environment.c b/environment.c index e5b361bb5d..b619d0c2b8 100644 --- a/environment.c +++ b/environment.c @@ -36,7 +36,6 @@ int minimum_abbrev = 4, default_abbrev = -1; int ignore_case; int assume_unchanged; int is_bare_repository_cfg = -1; /* unspecified */ -int warn_on_object_refname_ambiguity = 1; int repository_format_precious_objects; char *git_commit_encoding; char *git_log_output_encoding; diff --git a/environment.h b/environment.h index 2f43340f0b..87f71807b7 100644 --- a/environment.h +++ b/environment.h @@ -156,7 +156,6 @@ extern int has_symlinks; extern int minimum_abbrev, default_abbrev; extern int ignore_case; extern int assume_unchanged; -extern int warn_on_object_refname_ambiguity; extern char *apply_default_whitespace; extern char *apply_default_ignorewhitespace; extern char *git_attributes_file; diff --git a/hash.h b/hash.h index 4367acfec5..e9d881c06e 100644 --- a/hash.h +++ b/hash.h @@ -204,6 +204,7 @@ struct object_id { #define GET_OID_ONLY_TO_DIE 04000 #define GET_OID_REQUIRE_PATH 010000 #define GET_OID_HASH_ANY 020000 +#define GET_OID_SKIP_OBJREF 040000 #define GET_OID_DISAMBIGUATORS \ (GET_OID_COMMIT | GET_OID_COMMITTISH | \ diff --git a/object-name.c b/object-name.c index 945d5bdef2..516f214562 100644 --- a/object-name.c +++ b/object-name.c @@ -961,7 +961,8 @@ static int get_oid_basic(struct repository *r, const char *str, int len, int fatal = !(flags & GET_OID_QUIETLY); if (len == r->hash_algo->hexsz && !get_oid_hex(str, oid)) { - if (repo_settings_get_warn_ambiguous_refs(r) && warn_on_object_refname_ambiguity) { + if (repo_settings_get_warn_ambiguous_refs(r) && + !(flags & GET_OID_SKIP_OBJREF)) { refs_found = repo_dwim_ref(r, str, len, &tmp_oid, &real_ref, 0); if (refs_found > 0) { warning(warn_msg, len, str); diff --git a/revision.c b/revision.c index 474fa1e767..5c8e80a0c2 100644 --- a/revision.c +++ b/revision.c @@ -2184,7 +2184,7 @@ static int handle_revision_arg_1(const char *arg_, struct rev_info *revs, int fl int local_flags; const char *arg = arg_; int cant_be_filename = revarg_opt & REVARG_CANNOT_BE_FILENAME; - unsigned get_sha1_flags = GET_OID_RECORD_PATH; + unsigned get_sha1_flags = GET_OID_RECORD_PATH | GET_OID_SKIP_OBJREF; int ret; flags = flags & UNINTERESTING ? flags | BOTTOM : flags & ~BOTTOM; @@ -2914,12 +2914,8 @@ static void read_revisions_from_stdin(struct rev_info *revs, struct strbuf sb; int seen_dashdash = 0; int seen_end_of_options = 0; - int save_warning; int flags = 0; - save_warning = warn_on_object_refname_ambiguity; - warn_on_object_refname_ambiguity = 0; - strbuf_init(&sb, 1000); while (strbuf_getline(&sb, stdin) != EOF) { if (!sb.len) @@ -2952,7 +2948,6 @@ static void read_revisions_from_stdin(struct rev_info *revs, read_pathspec_from_stdin(&sb, prune); strbuf_release(&sb); - warn_on_object_refname_ambiguity = save_warning; } static void NORETURN diagnose_missing_default(const char *def) diff --git a/submodule.c b/submodule.c index b361076c5b..e63019a6f5 100644 --- a/submodule.c +++ b/submodule.c @@ -916,16 +916,12 @@ static void collect_changed_submodules(struct repository *r, { struct rev_info rev; const struct commit *commit; - int save_warning; struct setup_revision_opt s_r_opt = { .assume_dashdash = 1, }; - save_warning = warn_on_object_refname_ambiguity; - warn_on_object_refname_ambiguity = 0; repo_init_revisions(r, &rev, NULL); setup_revisions(argv->nr, argv->v, &rev, &s_r_opt); - warn_on_object_refname_ambiguity = save_warning; if (prepare_revision_walk(&rev)) die(_("revision walk setup failed")); ^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: Poor performance using reftable with many refs 2025-02-13 19:42 ` Jeff King @ 2025-02-13 20:12 ` Junio C Hamano 0 siblings, 0 replies; 12+ messages in thread From: Junio C Hamano @ 2025-02-13 20:12 UTC (permalink / raw) To: Jeff King; +Cc: Patrick Steinhardt, brian m. carlson, git, Karthik Nayak Jeff King <peff@peff.net> writes: > ... I have trouble > imagining it helping much beyond something like: > > git branch $some_oid > > which creates refs/heads/$some_oid. But we probably would be better off > warning about that at the time of writing rather than checking for > ambiguity on each read. Certainly. Thanks for a well thought-out analysis. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Poor performance using reftable with many refs 2025-02-13 8:22 ` Jeff King 2025-02-13 11:20 ` Patrick Steinhardt @ 2025-02-13 22:17 ` brian m. carlson 1 sibling, 0 replies; 12+ messages in thread From: brian m. carlson @ 2025-02-13 22:17 UTC (permalink / raw) To: Jeff King; +Cc: Patrick Steinhardt, git, Karthik Nayak [-- Attachment #1: Type: text/plain, Size: 1579 bytes --] On 2025-02-13 at 08:22:21, Jeff King wrote: > Yes, we do similarly spend a lot of time there. But the problem isn't > quite that repo_get_oid() also parses revisions. When we see a full > object id we return it quickly. But you can fall afoul of 798c35fcd8 > (get_sha1: warn about full or short object names that look like refs, > 2013-05-29), which does a full dwim_ref() lookup for each one! > > Try: > > git -c core.warnAmbiguousRefs=false update-ref --stdin > > to disable that. Internally there's a warn_on_object_refname_ambiguity > flag that some code (like cat-file) sets when it knows it may be asked > to do a resolve a lot of entries that are likely to be oids. Yeah, I think both of these would be great improvements. The kind of case I'm interested in is reference updates in the context of pushes or various API calls, where we're always going to have a full object ID and there is never a human connected to the output of Git. I expect that is the case for a lot of users. I also think it's unlikely that even the general scripter who's working with `git update-ref` is going to want that warning. Most users of that command are GUIs, APIs, server backends, or the like, and even if I used the command in my Git aliases or some custom commands, I still wouldn't really care for that kind of extra output. I _do_ always want screaming performance in `git update-ref` though, since I frequently, even in my everyday scripting, work with large numbers of refs. -- brian m. carlson (they/them or he/him) Toronto, Ontario, CA [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 263 bytes --] ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Poor performance using reftable with many refs 2025-02-13 7:13 ` Patrick Steinhardt 2025-02-13 8:22 ` Jeff King @ 2025-02-13 9:27 ` Christian Couder 2025-02-13 13:21 ` Patrick Steinhardt 1 sibling, 1 reply; 12+ messages in thread From: Christian Couder @ 2025-02-13 9:27 UTC (permalink / raw) To: Patrick Steinhardt; +Cc: brian m. carlson, git, Karthik Nayak On Thu, Feb 13, 2025 at 8:13 AM Patrick Steinhardt <ps@pks.im> wrote: > We end up with two tables: the first one has been created when cloning > the repository and contains all references. The second one has been > created when deleting all references, so it only contains ref deletions. > Because deletions don't have to carry an object ID, the resulting table > is also much smaller. This has the effect that auto-compaction does not > kick in, because we see that the geometric sequence is still intact. Not that I think we should work on this right now, but theoretically, could we "just" count the number of entries in each file and base the geometric sequence on the number of entries in each file instead of file size? ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Poor performance using reftable with many refs 2025-02-13 9:27 ` Christian Couder @ 2025-02-13 13:21 ` Patrick Steinhardt 0 siblings, 0 replies; 12+ messages in thread From: Patrick Steinhardt @ 2025-02-13 13:21 UTC (permalink / raw) To: Christian Couder; +Cc: brian m. carlson, git, Karthik Nayak On Thu, Feb 13, 2025 at 10:27:39AM +0100, Christian Couder wrote: > On Thu, Feb 13, 2025 at 8:13 AM Patrick Steinhardt <ps@pks.im> wrote: > > > We end up with two tables: the first one has been created when cloning > > the repository and contains all references. The second one has been > > created when deleting all references, so it only contains ref deletions. > > Because deletions don't have to carry an object ID, the resulting table > > is also much smaller. This has the effect that auto-compaction does not > > kick in, because we see that the geometric sequence is still intact. > > Not that I think we should work on this right now, but theoretically, > could we "just" count the number of entries in each file and base the > geometric sequence on the number of entries in each file instead of > file size? In theory we could, and that may lead to better results in edge cases like these indeed. And I think if either the header or footer of reftables contained a total count of contained records that might have been a viable thing to do indeed. But they don't, so we'd have to open and parse every complete reftable to do so. Because of that I think the cost of this would ultimately outweight the benfit. After all, this logic kicks in on every write to determine if we need to auto-compact. As a result, it needs to remain cheap. Patrick ^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2025-02-13 22:17 UTC | newest] Thread overview: 12+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-02-13 0:01 Poor performance using reftable with many refs brian m. carlson 2025-02-13 6:11 ` Patrick Steinhardt 2025-02-13 7:13 ` Patrick Steinhardt 2025-02-13 8:22 ` Jeff King 2025-02-13 11:20 ` Patrick Steinhardt 2025-02-13 14:31 ` Patrick Steinhardt 2025-02-13 19:53 ` Jeff King 2025-02-13 19:42 ` Jeff King 2025-02-13 20:12 ` Junio C Hamano 2025-02-13 22:17 ` brian m. carlson 2025-02-13 9:27 ` Christian Couder 2025-02-13 13:21 ` Patrick Steinhardt
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox