* Why is "git fetch --prune" so much slower than "git remote prune"? @ 2015-03-06 16:48 Ævar Arnfjörð Bjarmason 2015-03-06 22:59 ` Jeff King 0 siblings, 1 reply; 8+ messages in thread From: Ævar Arnfjörð Bjarmason @ 2015-03-06 16:48 UTC (permalink / raw) To: Git Mailing List; +Cc: Jay Soffian, Björn Gustavsson The --prune option to fetch added in v1.6.5-8-gf360d84 seems to be around 20-30x slower than the equivalent operation with git remote prune. I'm wondering if I'm missing something and fetch does something more, but it doesn't seem so. To test this clone git.git, create 1000 branches it in, create two local clones of that clone and then delete the 1000 branches in the original. I have a script to do this at https://gist.github.com/avar/497c8c8fbd641fb756ef Then in each of the clones: $ git branch -a|wc -l; time (~/g/git/git-fetch --prune origin >/dev/null 2>&1); git branch -a | wc -l 1003 real 0m3.337s user 0m2.996s sys 0m0.336s 3 $ git branch -a|wc -l; time (~/g/git/git-remote prune origin >/dev/null 2>&1); git branch -a | wc -l 1003 real 0m0.067s user 0m0.020s sys 0m0.040s 3 Both of these ends up doing a "git fetch", so it's not that. I'm quite rusty in C profiling but here's a gprof of the git-fetch command: $ gprof ~/g/git/git-fetch|head -n 20 Flat profile: Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls s/call s/call name 26.42 0.33 0.33 1584583 0.00 0.00 strbuf_getwholeline 14.63 0.51 0.18 90601347 0.00 0.00 strbuf_grow 13.82 0.68 0.17 1045676 0.00 0.00 find_pack_entry_one 8.13 0.78 0.10 1050062 0.00 0.00 check_refname_format 6.50 0.86 0.08 1584675 0.00 0.00 get_sha1_hex 5.69 0.93 0.07 2100529 0.00 0.00 starts_with 3.25 0.97 0.04 1044043 0.00 0.00 refname_is_safe 3.25 1.01 0.04 8007 0.00 0.00 get_packed_ref_cache 2.44 1.04 0.03 2605595 0.00 0.00 search_ref_dir 2.44 1.07 0.03 1040500 0.00 0.00 peel_entry 1.63 1.09 0.02 2632661 0.00 0.00 get_ref_dir 1.63 1.11 0.02 1044043 0.00 0.00 create_ref_entry 1.63 1.13 0.02 8024 0.00 0.00 do_for_each_entry_in_dir 0.81 1.14 0.01 2155105 0.00 0.00 memory_limit_check 0.81 1.15 0.01 1580503 0.00 0.00 sha1_to_hex And of the git-remote command: $ gprof ~/g/git/git-remote|head -n 20 Flat profile: Each sample counts as 0.01 seconds. no time accumulated % cumulative self self total time seconds seconds calls Ts/call Ts/call name 0.00 0.00 0.00 197475 0.00 0.00 strbuf_grow 0.00 0.00 0.00 24214 0.00 0.00 sort_ref_dir 0.00 0.00 0.00 24190 0.00 0.00 search_ref_dir 0.00 0.00 0.00 21661 0.00 0.00 memory_limit_check 0.00 0.00 0.00 20236 0.00 0.00 get_ref_dir 0.00 0.00 0.00 9187 0.00 0.00 xrealloc 0.00 0.00 0.00 7048 0.00 0.00 strbuf_add 0.00 0.00 0.00 6348 0.00 0.00 do_xmalloc 0.00 0.00 0.00 6126 0.00 0.00 xcalloc 0.00 0.00 0.00 6056 0.00 0.00 cleanup_path 0.00 0.00 0.00 6050 0.00 0.00 get_git_dir 0.00 0.00 0.00 6050 0.00 0.00 vsnpath 0.00 0.00 0.00 5554 0.00 0.00 config_file_fgetc Aside from the slowness of git-fetch it seems git-remote can be sped up quite a bit by more aggressively allocating a larger string buffer from the get-go. ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Why is "git fetch --prune" so much slower than "git remote prune"? 2015-03-06 16:48 Why is "git fetch --prune" so much slower than "git remote prune"? Ævar Arnfjörð Bjarmason @ 2015-03-06 22:59 ` Jeff King 2015-03-19 14:49 ` Michael Haggerty 0 siblings, 1 reply; 8+ messages in thread From: Jeff King @ 2015-03-06 22:59 UTC (permalink / raw) To: Ævar Arnfjörð Bjarmason Cc: Michael Haggerty, Stefan Beller, Git Mailing List, Jay Soffian, Björn Gustavsson On Fri, Mar 06, 2015 at 05:48:39PM +0100, Ævar Arnfjörð Bjarmason wrote: > The --prune option to fetch added in v1.6.5-8-gf360d84 seems to be > around 20-30x slower than the equivalent operation with git remote > prune. I'm wondering if I'm missing something and fetch does something > more, but it doesn't seem so. "git fetch --prune" is "do a normal fetch, and also prune anything necessary". "git remote prune" is "ls-remote the other side and see if there is anything we can prune; do not touch anything else". If your fetch is a noop (i.e., the other side has not advanced any branches), the outcome is the same. But perhaps fetch is doing more work to find out that it is a noop. One way to measure that would be to see how expensive a noop "git fetch" is (if it's expensive, then there is room to improve there. If not, then it is the pruning itself that is expensive). But just guessing (I do not have time to dig in deeper right now), and seeing this: > $ gprof ~/g/git/git-fetch|head -n 20 > Flat profile: > > Each sample counts as 0.01 seconds. > % cumulative self self total > time seconds seconds calls s/call s/call name > 26.42 0.33 0.33 1584583 0.00 0.00 strbuf_getwholeline > 14.63 0.51 0.18 90601347 0.00 0.00 strbuf_grow > 13.82 0.68 0.17 1045676 0.00 0.00 find_pack_entry_one > 8.13 0.78 0.10 1050062 0.00 0.00 check_refname_format > 6.50 0.86 0.08 1584675 0.00 0.00 get_sha1_hex > 5.69 0.93 0.07 2100529 0.00 0.00 starts_with > 3.25 0.97 0.04 1044043 0.00 0.00 refname_is_safe > 3.25 1.01 0.04 8007 0.00 0.00 get_packed_ref_cache > 2.44 1.04 0.03 2605595 0.00 0.00 search_ref_dir > 2.44 1.07 0.03 1040500 0.00 0.00 peel_entry > 1.63 1.09 0.02 2632661 0.00 0.00 get_ref_dir > 1.63 1.11 0.02 1044043 0.00 0.00 create_ref_entry > 1.63 1.13 0.02 8024 0.00 0.00 do_for_each_entry_in_dir > 0.81 1.14 0.01 2155105 0.00 0.00 memory_limit_check > 0.81 1.15 0.01 1580503 0.00 0.00 sha1_to_hex We spend a lot of time checking refs here. Probably this comes from writing the `packed-refs` file out 1000 times in your example, because fetch handles each ref individually. Whereas since c9e768b (remote: repack packed-refs once when deleting multiple refs, 2014-05-23), git-remote does it in one pass. Now that we have ref_transaction_*, I think if git-fetch fed all of the deletes (along with the updates) into a single transaction, we would get the same optimization for free. Maybe that is even part of some of the pending ref_transaction work from Stefan or Michael (both cc'd). I haven't kept up very well with what is cooking in pu. -Peff ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Why is "git fetch --prune" so much slower than "git remote prune"? 2015-03-06 22:59 ` Jeff King @ 2015-03-19 14:49 ` Michael Haggerty 2015-03-19 17:14 ` Jeff King 2015-03-19 19:24 ` Junio C Hamano 0 siblings, 2 replies; 8+ messages in thread From: Michael Haggerty @ 2015-03-19 14:49 UTC (permalink / raw) To: Jeff King, Ævar Arnfjörð Bjarmason Cc: Stefan Beller, Git Mailing List, Jay Soffian, Björn Gustavsson On 03/06/2015 11:59 PM, Jeff King wrote: > On Fri, Mar 06, 2015 at 05:48:39PM +0100, Ævar Arnfjörð Bjarmason wrote: > >> The --prune option to fetch added in v1.6.5-8-gf360d84 seems to be >> around 20-30x slower than the equivalent operation with git remote >> prune. I'm wondering if I'm missing something and fetch does something >> more, but it doesn't seem so. > > [...] > We spend a lot of time checking refs here. Probably this comes from > writing the `packed-refs` file out 1000 times in your example, because > fetch handles each ref individually. Whereas since c9e768b (remote: > repack packed-refs once when deleting multiple refs, 2014-05-23), > git-remote does it in one pass. > > Now that we have ref_transaction_*, I think if git-fetch fed all of the > deletes (along with the updates) into a single transaction, we would get > the same optimization for free. Maybe that is even part of some of the > pending ref_transaction work from Stefan or Michael (both cc'd). I > haven't kept up very well with what is cooking in pu. I am looking into this now. For pruning, we can't use a ref_transaction as it is currently implemented because it would fail if any of the reference deletions failed. But in this case I think if any deletions fail, we would prefer to emit a warning but keep going. I'm trying to decide whether to have a separate function in the refs API to "delete as many of the following refs as possible", or whether to add a flag to ref_transaction_update() that says "try this update, but don't fail the transaction if it fails". The latter would probably be more work because we would need to invent a way to return multiple error messages from a single transaction. Michael -- Michael Haggerty mhagger@alum.mit.edu ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Why is "git fetch --prune" so much slower than "git remote prune"? 2015-03-19 14:49 ` Michael Haggerty @ 2015-03-19 17:14 ` Jeff King 2015-03-19 19:24 ` Junio C Hamano 1 sibling, 0 replies; 8+ messages in thread From: Jeff King @ 2015-03-19 17:14 UTC (permalink / raw) To: Michael Haggerty Cc: Ævar Arnfjörð Bjarmason, Stefan Beller, Git Mailing List, Jay Soffian, Björn Gustavsson On Thu, Mar 19, 2015 at 03:49:08PM +0100, Michael Haggerty wrote: > For pruning, we can't use a ref_transaction as it is currently > implemented because it would fail if any of the reference deletions > failed. But in this case I think if any deletions fail, we would prefer > to emit a warning but keep going. > > I'm trying to decide whether to have a separate function in the refs API > to "delete as many of the following refs as possible", or whether to add > a flag to ref_transaction_update() that says "try this update, but don't > fail the transaction if it fails". The latter would probably be more > work because we would need to invent a way to return multiple error > messages from a single transaction. Hmm, yeah. That is sort of antithetical to the original purpose of ref transactions (which was atomicity). Given the way it is implemented now, I don't think it would be too big a deal to keep a per-ref status (one per "struct ref_upate"). But I would worry in the long run that it makes things much harder on proposed ref backends, which otherwise can consider a transaction an atomic unit. OTOH, I do not think there is an abstracted way to do this _without_ the ref backend knowing about it. You cannot just do a series of transactions; that defeats the purpose. It is a bit of a layering violation that builtin/remote.c (which does this optimization already) calls repack_without_refs directly. -Peff ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Why is "git fetch --prune" so much slower than "git remote prune"? 2015-03-19 14:49 ` Michael Haggerty 2015-03-19 17:14 ` Jeff King @ 2015-03-19 19:24 ` Junio C Hamano 2015-03-19 21:26 ` Jeff King 2015-03-20 4:51 ` Michael Haggerty 1 sibling, 2 replies; 8+ messages in thread From: Junio C Hamano @ 2015-03-19 19:24 UTC (permalink / raw) To: Michael Haggerty Cc: Jeff King, Ævar Arnfjörð Bjarmason, Stefan Beller, Git Mailing List, Jay Soffian, Björn Gustavsson Michael Haggerty <mhagger@alum.mit.edu> writes: >> Now that we have ref_transaction_*, I think if git-fetch fed all of the >> deletes (along with the updates) into a single transaction, we would get >> the same optimization for free. Maybe that is even part of some of the >> pending ref_transaction work from Stefan or Michael (both cc'd). I >> haven't kept up very well with what is cooking in pu. > > I am looking into this now. > > For pruning, we can't use a ref_transaction as it is currently > implemented because it would fail if any of the reference deletions > failed. But in this case I think if any deletions fail, we would prefer > to emit a warning but keep going. I am not quite sure what you mean here. I agree with you if you meant "we shouldn't fail the fetch only because 'fetch --prune' failed to remove only one of the remote-tracking refs that are no longer there" but that can easily be solved by the pruning phase into a separate transaction. If you meant "we would rather remove origin/{a,b} non-atomically when we noticed that origin/{a,b,c} are all gone than leaving all three intact only because we failed to remove origin/c for whatever reason", my knee-jerk reaction is "does it make practical difference to the end user between these two?" What are the plausible cause of failing to prune unused remote-tracking refs? ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Why is "git fetch --prune" so much slower than "git remote prune"? 2015-03-19 19:24 ` Junio C Hamano @ 2015-03-19 21:26 ` Jeff King 2015-03-20 4:51 ` Michael Haggerty 1 sibling, 0 replies; 8+ messages in thread From: Jeff King @ 2015-03-19 21:26 UTC (permalink / raw) To: Junio C Hamano Cc: Michael Haggerty, Ævar Arnfjörð Bjarmason, Stefan Beller, Git Mailing List, Jay Soffian, Björn Gustavsson On Thu, Mar 19, 2015 at 12:24:21PM -0700, Junio C Hamano wrote: > > For pruning, we can't use a ref_transaction as it is currently > > implemented because it would fail if any of the reference deletions > > failed. But in this case I think if any deletions fail, we would prefer > > to emit a warning but keep going. > > I am not quite sure what you mean here. I agree with you if you > meant "we shouldn't fail the fetch only because 'fetch --prune' > failed to remove only one of the remote-tracking refs that are no > longer there" but that can easily be solved by the pruning phase > into a separate transaction. If you meant "we would rather remove > origin/{a,b} non-atomically when we noticed that origin/{a,b,c} are > all gone than leaving all three intact only because we failed to > remove origin/c for whatever reason", my knee-jerk reaction is "does > it make practical difference to the end user between these two?" > > What are the plausible cause of failing to prune unused > remote-tracking refs? I had assumed earlier that Michael meant to use a single ref_transaction for all updates. Thinking on it more, that is probably a bad idea, as it makes fetch atomic in a user-visible way, whereas currently the updates are always per-ref (i.e., some may fail, but we let others succeed). I don't know if people actually care or not (certainly the exit code of fetch represents all of the refs, so it is not like you could say "eh, git-fetch return 1, but it probably got the ref I wanted" without parsing the human-readable output). If it is just a single atomic commit for all of the deletions, I agree it is less of a big deal. They are unlikely to fail, and when they do, you are not blocking the new refs from coming in. I think the ref_transaction does have some smarts to handle a case where we are updating "refs/foo/bar" while "refs/foo" exists but is deleted in the transaction. We switched to pruning before updating in 10a6cc8 (fetch --prune: Run prune before fetching, 2014-01-02), so it is a non-issue, but what is there now is technically racy[1], and it would have been nice to let the ref-storage code handle it. I guess we still can if we introduce a "git fetch --atomic" option. -Peff [1] http://article.gmane.org/gmane.comp.version-control.git/239519 ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Why is "git fetch --prune" so much slower than "git remote prune"? 2015-03-19 19:24 ` Junio C Hamano 2015-03-19 21:26 ` Jeff King @ 2015-03-20 4:51 ` Michael Haggerty 2015-03-20 7:04 ` Junio C Hamano 1 sibling, 1 reply; 8+ messages in thread From: Michael Haggerty @ 2015-03-20 4:51 UTC (permalink / raw) To: Junio C Hamano Cc: Jeff King, Ævar Arnfjörð Bjarmason, Stefan Beller, Git Mailing List, Jay Soffian, Björn Gustavsson On 03/19/2015 08:24 PM, Junio C Hamano wrote: > Michael Haggerty <mhagger@alum.mit.edu> writes: > >>> Now that we have ref_transaction_*, I think if git-fetch fed all of the >>> deletes (along with the updates) into a single transaction, we would get >>> the same optimization for free. Maybe that is even part of some of the >>> pending ref_transaction work from Stefan or Michael (both cc'd). I >>> haven't kept up very well with what is cooking in pu. >> >> I am looking into this now. >> >> For pruning, we can't use a ref_transaction as it is currently >> implemented because it would fail if any of the reference deletions >> failed. But in this case I think if any deletions fail, we would prefer >> to emit a warning but keep going. > > I am not quite sure what you mean here. I agree with you if you > meant "we shouldn't fail the fetch only because 'fetch --prune' > failed to remove only one of the remote-tracking refs that are no > longer there" but that can easily be solved by the pruning phase > into a separate transaction. If you meant "we would rather remove > origin/{a,b} non-atomically when we noticed that origin/{a,b,c} are > all gone than leaving all three intact only because we failed to > remove origin/c for whatever reason", my knee-jerk reaction is "does > it make practical difference to the end user between these two?" > > What are the plausible cause of failing to prune unused > remote-tracking refs? I wasn't proposing to combine the updates with the prunes in a single transaction. (Though now that I think about it, when "fetch --atomic" is specified it wouldn't be so crazy. It would also have the virtue of allowing the reference transaction code to deal with any D/F conflicts between references being added and references being deleted.) What I meant is the second thing you mentioned, namely that currently we prune each reference on a "best-effort" basis and wouldn't skip *all* prunes just because one failed. But you raise an interesting question: what could cause a failure to prune an unused remote-tracking ref (i.e. if all pruning is done in a single transaction)? * I think the most likely reason for a failure would be a lock conflict with another process. We don't retry lock attempts [1], so this would cause the whole transaction to fail. This could happen, for example, if the user launches two "fetch --prune" operations at the same time, or runs "pack-refs" while fetching [2]. * It is *not* a failure if another process deletes or updates the reference before we get to it, because we don't provide an "old" value of the reference for a pre-check. * I haven't verified this, but I can imagine it could fail if another process deletes the reference (say refs/remotes/origin/foo/bar) and adds another reference that would D/F conflict with it (say refs/remotes/origin/foo) while we are working, because our attempt to create refs/remotes/origin/foo/bar.lock would fail. * A repository problem, like for example wrong file permissions somewhere in the loose refs directory, could prevent us from being able to create the lockfile or delete the loose reference file. The lock conflict scenario is the only one that really worries me. But I don't think it is *so* hard to keep the current "best-effort" behavior. I am working on a function delete_refs() for the reference API that deletes a bunch of references on a "best effort" basis, reporting per-reference errors for any deletions that fail. Longer term, we could add something like a REFS_NON_ATOMIC flag to the refs_transaction_update() functions, which allows the rest of the transaction to proceed even if that particular update fails. Michael [1] I hope to submit a patch to make it possible to retry lock acquisition with a specified timeout. This should help in scenarios like this. [2] Maybe a careful analysis or a slight change to our locking protocol could prove that the only lock acquisition that can fail is the one on packed-refs, which would cause all of the deletes to fail anyway. In that case per-reference failures would cease to be a worry. -- Michael Haggerty mhagger@alum.mit.edu ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Why is "git fetch --prune" so much slower than "git remote prune"? 2015-03-20 4:51 ` Michael Haggerty @ 2015-03-20 7:04 ` Junio C Hamano 0 siblings, 0 replies; 8+ messages in thread From: Junio C Hamano @ 2015-03-20 7:04 UTC (permalink / raw) To: Michael Haggerty Cc: Jeff King, Ævar Arnfjörð Bjarmason, Stefan Beller, Git Mailing List, Jay Soffian, Björn Gustavsson Michael Haggerty <mhagger@alum.mit.edu> writes: > The lock conflict scenario is the only one that really worries me. Actually, I'd feel worried if this were on the receive-pack side, as it is entirely valid that two or more people make uncoordinated push into a single meeting point, but much less for something like "fetch". I do not see a valid reason why somebody cannot avoid fetching into the same repository in a racy way. > But I don't think it is *so* hard to keep the current "best-effort" > behavior. I agree that transaction is a wrong way to look at this. I view "We know packed refs is a lot more efficient if written once" as in the same category as "We know we will register many new objects with this operation, so instead of creating them into separate loose object files, declare that the object store is 'plugged' and send them into a single new packfile" (aka "bulk checkin"). Both are uninterested in all-or-none execution, but merely want to declare a boundary between a group of operations to help Git optimize them. > I am working on a function delete_refs() for the reference API > that deletes a bunch of references on a "best effort" basis, reporting > per-reference errors for any deletions that fail. I think it would make sense, but I suspect you would want to make it clear that the function is best-effort-bulk somewhere in its name (e.g. delete-refs-if-able or something) and make it not to automatically fail the surrounding transaction if any. That way an application could do things like this: transaction-begin; create-ref; update-ref; ... delete-refs-if-able; if trying to be as atomic as possible then if previous delete-refs-if-able found undeletable refs then transaction-abort; fi fi transaction-commit; Of course, supplying delete-refs that is not best-effort separately would make it even easier to code the above (the last if/then/fi will become unnecessary). ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2015-03-20 7:04 UTC | newest] Thread overview: 8+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2015-03-06 16:48 Why is "git fetch --prune" so much slower than "git remote prune"? Ævar Arnfjörð Bjarmason 2015-03-06 22:59 ` Jeff King 2015-03-19 14:49 ` Michael Haggerty 2015-03-19 17:14 ` Jeff King 2015-03-19 19:24 ` Junio C Hamano 2015-03-19 21:26 ` Jeff King 2015-03-20 4:51 ` Michael Haggerty 2015-03-20 7:04 ` Junio C Hamano
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).