* git-blame extremely slow in partial clones due to serial object fetching @ 2024-11-19 20:16 Burke Libbey 2024-11-20 11:59 ` Manoraj K 2024-11-20 18:52 ` Jonathan Tan 0 siblings, 2 replies; 10+ messages in thread From: Burke Libbey @ 2024-11-19 20:16 UTC (permalink / raw) To: git When running git-blame in a partial clone (--filter=blob:none), it fetches missing blob objects one at a time. This can result in thousands of serial fetch operations, making blame extremely slow, regardless of network latency. For example, in one large repository, blaming a single large file required fetching about 6500 objects. Each fetch requiring a round-trip means this operation would have taken something on the order of an hour to complete. The core issue appears to be in fill_origin_blob(), which is called individually for each blob needed during the blame process. While the blame algorithm does need blob contents to make detailed line-matching decisions, it seems like we don't necessarily need the contents just to determine which blobs we'llexamine. It seems like this could be optimized by batch-fetching the needed objects upfront, rather than fetching them one at a time. This would convert O(n) round-trips into a small number of batch fetches. Reproduction: 1. Create a partial clone with --filter=blob:none 2. Run git blame on a file with significant history 3. Observe serial fetching of objects in the trace output Let me know if you need any additional information to investigate this issue. —burke ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: git-blame extremely slow in partial clones due to serial object fetching 2024-11-19 20:16 git-blame extremely slow in partial clones due to serial object fetching Burke Libbey @ 2024-11-20 11:59 ` Manoraj K 2024-11-20 18:52 ` Jonathan Tan 1 sibling, 0 replies; 10+ messages in thread From: Manoraj K @ 2024-11-20 11:59 UTC (permalink / raw) To: burke.libbey; +Cc: git Hi Burke, > When running git-blame in a partial clone (--filter=blob:none), it fetches > missing blob objects one at a time. This can result in thousands of serial fetch > operations, making blame extremely slow, regardless of network latency. We are also experiencing this issue with partial clones. Have you found any workarounds since reporting this? Git team - would appreciate any insights or suggestions on handling this performance bottleneck in the meantime. Best regards, Manoraj K ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: git-blame extremely slow in partial clones due to serial object fetching 2024-11-19 20:16 git-blame extremely slow in partial clones due to serial object fetching Burke Libbey 2024-11-20 11:59 ` Manoraj K @ 2024-11-20 18:52 ` Jonathan Tan 2024-11-20 22:55 ` Junio C Hamano 1 sibling, 1 reply; 10+ messages in thread From: Jonathan Tan @ 2024-11-20 18:52 UTC (permalink / raw) To: Burke Libbey; +Cc: Jonathan Tan, git Burke Libbey <burke.libbey@shopify.com> writes: > The core issue appears to be in fill_origin_blob(), which is called > individually for each blob needed during the blame process. While the blame > algorithm does need blob contents to make detailed line-matching decisions, > it seems like we don't necessarily need the contents just to determine which > blobs we'llexamine. Technically, we do need the contents, because the contents determine whether we are done with the blame (all lines are accounted for) and whether we need to start looking at the blob at a different path (because there was a rename). > It seems like this could be optimized by batch-fetching the needed objects > upfront, rather than fetching them one at a time. This would convert O(n) > round-trips into a small number of batch fetches. That is one possible way (assuming you mean that whenever "git blame" notices that a blob is missing, it should walk the commits until a certain depth, collecting all the object IDs for a given path, and prefetching all of them). This runs the risk of overfetching, as I stated above, but perhaps overfetching is an acceptable tradeoff for speed. There are other ways: - If we can teach the client to collect object IDs for prefetching, perhaps it would be just as easy to teach the server. We could instead make filter-by-path an acceptable argument to pass to "fetch --filter", then teach the lazy fetch to use that argument. This also opens the door to future performance improvements - since the server has all the objects, it can give us precisely the objects that we need, and not just give us a quantity of objects based on a heuristic (so the client does not need to say "give me 10, and if I need more, I'll ask you again", but can say "give me all I need to complete the blame). This, however, relies on server implementers to implement and turn on such a feature. - We could also teach the server to "blame" a file for us and then teach the client to stitch together the server's result with the local findings, but this is more complicated. It may also be possible that even if we fix this issue, the scale of the repos involved might be such that a user would rather "blame" over the network (e.g. using a web UI) than download all the relevant blobs (even if the blobs were batched into one download). So...there are ideas for solutions, but I don't think anyone has analyzed them (or tried them) yet. ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: git-blame extremely slow in partial clones due to serial object fetching 2024-11-20 18:52 ` Jonathan Tan @ 2024-11-20 22:55 ` Junio C Hamano 2024-11-21 3:12 ` [External] " Han Young 0 siblings, 1 reply; 10+ messages in thread From: Junio C Hamano @ 2024-11-20 22:55 UTC (permalink / raw) To: Jonathan Tan; +Cc: Burke Libbey, git Jonathan Tan <jonathantanmy@google.com> writes: > Technically, we do need the contents, ... > There are other ways: > > - If we can teach the client to collect object IDs for prefetching, > perhaps it would be just as easy to teach the server. We could > instead make filter-by-path an acceptable argument to pass to "fetch > --filter", then teach the lazy fetch to use that argument. This also > opens the door to future performance improvements - since the server > has all the objects, it can give us precisely the objects that we > need, and not just give us a quantity of objects based on a heuristic > (so the client does not need to say "give me 10, and if I need more, > I'll ask you again", but can say "give me all I need to complete > the blame). This, however, relies on server implementers to implement > and turn on such a feature. This is an interesting half-way point, but I have a suspicion that in order for the server side to give you all you need, the server side has to do something close to computing the full blame. Start from a child commit plus the entire file as input, find blocks of text in that entire file that are different in its parent (these are the lines that are "blamed" to the child commit), pass control to the same algorithm but using the parent commit plus the remainder of the file (excluding the lines of text that have already "blamed") as the input, rinse and repeat, until the "remainder of the file" shrinks to empty. When everything is "blamed", you know you can stop. So, a server that can give you something better than a heuristic would have spent enough cycles to know the final result of "blame" by the time it knows where it should/can stop, wouldn't it? > - We could also teach the server to "blame" a file for us and then > teach the client to stitch together the server's result with the > local findings, but this is more complicated. Your local lazy repository, if you have anything you have to "stitch together", would have your locally modified contents, and for you to be able to make such modifications, it would also have at least the blobs from HEAD, which you based your modifications on. So you should be able to locally run "git blame @{u}.." to find lines that your locally modified contents are to be blamed, ask the other side to give you a blame for @{u}, and overlay the former on top of the latter. ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [External] Re: git-blame extremely slow in partial clones due to serial object fetching 2024-11-20 22:55 ` Junio C Hamano @ 2024-11-21 3:12 ` Han Young 2024-11-22 3:32 ` Shubham Kanodia 0 siblings, 1 reply; 10+ messages in thread From: Han Young @ 2024-11-21 3:12 UTC (permalink / raw) To: Junio C Hamano; +Cc: Jonathan Tan, Burke Libbey, git On Thu, Nov 21, 2024 at 7:00 AM Junio C Hamano <gitster@pobox.com> wrote: > > - We could also teach the server to "blame" a file for us and then > > teach the client to stitch together the server's result with the > > local findings, but this is more complicated. > > Your local lazy repository, if you have anything you have to "stitch > together", would have your locally modified contents, and for you to > be able to make such modifications, it would also have at least the > blobs from HEAD, which you based your modifications on. So you > should be able to locally run "git blame @{u}.." to find lines that > your locally modified contents are to be blamed, ask the other side > to give you a blame for @{u}, and overlay the former on top of the > latter. > In $DAY_JOB, we modified the server to run blame for the client. To deal with changes not yet pushed to the server, we let client pack the local only blobs for the blamed file, alone with the local only commits that touch that file into one packfile and send a 'remote-blame' request to the server. Server then unpack the relevant objects into memory (by reusing code from git-unpack-objects), run the blame and return the result back to the client. This way we avoided running blame both twice and interleave the results. It works quite well in very large repos, with result caching, the speed can be even faster than locally blame on a full repo. ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [External] Re: git-blame extremely slow in partial clones due to serial object fetching 2024-11-21 3:12 ` [External] " Han Young @ 2024-11-22 3:32 ` Shubham Kanodia 2024-11-22 8:29 ` Junio C Hamano 0 siblings, 1 reply; 10+ messages in thread From: Shubham Kanodia @ 2024-11-22 3:32 UTC (permalink / raw) To: Han Young, Junio C Hamano; +Cc: Jonathan Tan, Burke Libbey, git On 21/11/24 8:42 am, Han Young wrote: > On Thu, Nov 21, 2024 at 7:00 AM Junio C Hamano <gitster@pobox.com> wrote: >>> - We could also teach the server to "blame" a file for us and then >>> teach the client to stitch together the server's result with the >>> local findings, but this is more complicated. >> >> Your local lazy repository, if you have anything you have to "stitch >> together", would have your locally modified contents, and for you to >> be able to make such modifications, it would also have at least the >> blobs from HEAD, which you based your modifications on. So you >> should be able to locally run "git blame @{u}.." to find lines that >> your locally modified contents are to be blamed, ask the other side >> to give you a blame for @{u}, and overlay the former on top of the >> latter. >> > > In $DAY_JOB, we modified the server to run blame for the client. > To deal with changes not yet pushed to the server, we let client > pack the local only blobs for the blamed file, alone with the local > only commits that touch that file into one packfile and send a > 'remote-blame' request to the server. > > Server then unpack the relevant objects into memory > (by reusing code from git-unpack-objects), run the blame and return > the result back to the client. This way we avoided running blame both > twice and interleave the results. > > It works quite well in very large repos, with result caching, the speed > can be even faster than locally blame on a full repo. In a large sized partially cloned repo that I have, a `git blame` can take several minutes and network roundtrips. Junio — would it make sense to add an option (and config) for `git blame` that limits how far back it looks for fetching blobs? This would prevent someone accidently firing several cascading calls as they open new files in an editor that does git blame by default (IntelliJ) or popular plugins (GitLens for VSCode) that can startup multiple heavy git processes and bring a user's system to a crawl. ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [External] Re: git-blame extremely slow in partial clones due to serial object fetching 2024-11-22 3:32 ` Shubham Kanodia @ 2024-11-22 8:29 ` Junio C Hamano 2024-11-22 8:51 ` Shubham Kanodia 0 siblings, 1 reply; 10+ messages in thread From: Junio C Hamano @ 2024-11-22 8:29 UTC (permalink / raw) To: Shubham Kanodia; +Cc: Han Young, Jonathan Tan, Burke Libbey, git Shubham Kanodia <shubham.kanodia10@gmail.com> writes: > Junio — would it make sense to add an option (and config) for `git > blame` that limits how far back it looks for fetching blobs? No, I do not think it would. What would our workaround for the next one when people say "oh, 'git log -p' fetches blobs on demand and latency kills me"? Yet another such an option only for 'git log'? ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [External] Re: git-blame extremely slow in partial clones due to serial object fetching 2024-11-22 8:29 ` Junio C Hamano @ 2024-11-22 8:51 ` Shubham Kanodia 2024-11-22 17:55 ` Jonathan Tan 0 siblings, 1 reply; 10+ messages in thread From: Shubham Kanodia @ 2024-11-22 8:51 UTC (permalink / raw) To: Junio C Hamano; +Cc: Han Young, Jonathan Tan, Burke Libbey, git On 22/11/24 1:59 pm, Junio C Hamano wrote: > Shubham Kanodia <shubham.kanodia10@gmail.com> writes: > >> Junio — would it make sense to add an option (and config) for `git >> blame` that limits how far back it looks for fetching blobs? > > No, I do not think it would. > > What would our workaround for the next one when people say "oh, 'git > log -p' fetches blobs on demand and latency kills me"? Yet another > such an option only for 'git log'? > I'm guessing `git log` already provides options to limit history using `-n` or `--since` so ideally its not unbounded if you use those, unlike with `git blame`? I understand our concerns regarding adding new config options though. Between the solutions discussed in this thread — batching, adding server side support, (or another) — what do you think could be a good track to pursue here because this makes using `git blame` on larger partially cloned repos a possible footgun. ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [External] Re: git-blame extremely slow in partial clones due to serial object fetching 2024-11-22 8:51 ` Shubham Kanodia @ 2024-11-22 17:55 ` Jonathan Tan 2024-11-25 0:22 ` Junio C Hamano 0 siblings, 1 reply; 10+ messages in thread From: Jonathan Tan @ 2024-11-22 17:55 UTC (permalink / raw) To: Shubham Kanodia Cc: Jonathan Tan, Junio C Hamano, Han Young, Burke Libbey, git Shubham Kanodia <shubham.kanodia10@gmail.com> writes: > > > On 22/11/24 1:59 pm, Junio C Hamano wrote: > > Shubham Kanodia <shubham.kanodia10@gmail.com> writes: > > > >> Junio — would it make sense to add an option (and config) for `git > >> blame` that limits how far back it looks for fetching blobs? > > > > No, I do not think it would. > > > > What would our workaround for the next one when people say "oh, 'git > > log -p' fetches blobs on demand and latency kills me"? Yet another > > such an option only for 'git log'? > > > > I'm guessing `git log` already provides options to limit history using > `-n` or `--since` so ideally its not unbounded if you use those, unlike > with `git blame`? `git blame` also has options. See "SPECIFYING RANGES" in its man page, which teaches you how to specify revision ranges (and also line ranges, but that is not relevant here). > I understand our concerns regarding adding new config options though. > Between the solutions discussed in this thread — batching, adding server > side support, (or another) — what do you think could be a good track to > pursue here because this makes using `git blame` on larger partially > cloned repos a possible footgun. Typically questions like this should be answered by the person who is actually going to pursue the track. If you'd like to pursue a track but don't know which to pursue, maybe start with what you believe the best solution to be. It seems that you think that limiting either the number of blobs fetched or the time range of the commits to be considered is best, so maybe you could try one of them. Limiting the time range is already possible, so I'll provide my ideas aboult limiting the number of blobs fetched. You can detect when a blob is missing (and therefore needs to be fetched) by a flag in oid_object_info_extended() (or use has_object()), so you can count the number of blobs fetched as the blame is being run. My biggest concern is that there is no good limit - I suspect that for a file that is extensively changed, 10 blobs is too few and you'll need something like 50 blobs. But 50 blobs means 50 RTTs, which also might be too much for an end user. But in any case, you know your users' needs better than we do. ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [External] Re: git-blame extremely slow in partial clones due to serial object fetching 2024-11-22 17:55 ` Jonathan Tan @ 2024-11-25 0:22 ` Junio C Hamano 0 siblings, 0 replies; 10+ messages in thread From: Junio C Hamano @ 2024-11-25 0:22 UTC (permalink / raw) To: Jonathan Tan; +Cc: Shubham Kanodia, Han Young, Burke Libbey, git Jonathan Tan <jonathantanmy@google.com> writes: > number of blobs fetched as the blame is being run. My biggest concern > is that there is no good limit - I suspect that for a file that is > extensively changed, 10 blobs is too few and you'll need something like > 50 blobs. But 50 blobs means 50 RTTs, which also might be too much for > an end user. Depending on the project, size of a typical change to a blob may be different, so "10 commits that touched this blob" may touch 20% of the contents in one project, but in another that tends to prefer finer-grained commits, it may take 50 commits to make the same amount of change. I agree with you that there is no good default that fits all projects. Do 50 blobs have to mean 50 RTTs? I wonder if there is a good way to say "please give me all necessary tree and blob objects to complete the blobs at path $F for the past 50 commits" to the lazy fetch machinery and receive a single pack that contain all the objects that are listed in "git rev-list --objects HEAD~50.. -- $F"? I am not sure what should happen in the commit in that range where the path $F appears (meaning: the path did not exist, and its contents came from a different path in the parent of that commit). You'd need (a subset of) objects in "git rev-list --objects C^!" for that commit to find out where it came from, but what subset should we use? Fully hydrating the trees of these commits at the rename boundary would ensure you'd catch the same rename in a non-lazy repository, but that is way too much more than what the user can afford (otherwise, you wouldn't be using a narrow clone in the first place). So, I dunno. ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2024-11-25 0:22 UTC | newest] Thread overview: 10+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2024-11-19 20:16 git-blame extremely slow in partial clones due to serial object fetching Burke Libbey 2024-11-20 11:59 ` Manoraj K 2024-11-20 18:52 ` Jonathan Tan 2024-11-20 22:55 ` Junio C Hamano 2024-11-21 3:12 ` [External] " Han Young 2024-11-22 3:32 ` Shubham Kanodia 2024-11-22 8:29 ` Junio C Hamano 2024-11-22 8:51 ` Shubham Kanodia 2024-11-22 17:55 ` Jonathan Tan 2024-11-25 0:22 ` Junio C Hamano
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).