git-blame extremely slow in partial clones due to serial object fetching

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* git-blame extremely slow in partial clones due to serial object fetching
@ 2024-11-19 20:16 Burke Libbey
  2024-11-20 11:59 ` Manoraj K
  2024-11-20 18:52 ` Jonathan Tan
  0 siblings, 2 replies; 10+ messages in thread
From: Burke Libbey @ 2024-11-19 20:16 UTC (permalink / raw)
  To: git

When running git-blame in a partial clone (--filter=blob:none), it fetches
missing blob objects one at a time. This can result in thousands of serial fetch
operations, making blame extremely slow, regardless of network latency.

For example, in one large repository, blaming a single large file required 
fetching about 6500 objects. Each fetch requiring a round-trip means this 
operation would have taken something on the order of an hour to complete.

The core issue appears to be in fill_origin_blob(), which is called
individually for each blob needed during the blame process. While the blame
algorithm does need blob contents to make detailed line-matching decisions,
it seems like we don't necessarily need the contents just to determine which 
blobs we'llexamine.

It seems like this could be optimized by batch-fetching the needed objects
upfront, rather than fetching them one at a time. This would convert O(n)
round-trips into a small number of batch fetches.

Reproduction:
1. Create a partial clone with --filter=blob:none
2. Run git blame on a file with significant history
3. Observe serial fetching of objects in the trace output

Let me know if you need any additional information to investigate this issue.

—burke

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: git-blame extremely slow in partial clones due to serial object fetching
  2024-11-19 20:16 git-blame extremely slow in partial clones due to serial object fetching Burke Libbey
@ 2024-11-20 11:59 ` Manoraj K
  2024-11-20 18:52 ` Jonathan Tan
  1 sibling, 0 replies; 10+ messages in thread
From: Manoraj K @ 2024-11-20 11:59 UTC (permalink / raw)
  To: burke.libbey; +Cc: git

Hi Burke,

> When running git-blame in a partial clone (--filter=blob:none), it fetches
> missing blob objects one at a time. This can result in thousands of serial fetch
> operations, making blame extremely slow, regardless of network latency.

We are also experiencing this issue with partial clones. Have you found any 
workarounds since reporting this?

Git team - would appreciate any insights or suggestions on handling this
performance bottleneck in the meantime.

Best regards,
Manoraj K

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: git-blame extremely slow in partial clones due to serial object fetching
  2024-11-19 20:16 git-blame extremely slow in partial clones due to serial object fetching Burke Libbey
  2024-11-20 11:59 ` Manoraj K
@ 2024-11-20 18:52 ` Jonathan Tan
  2024-11-20 22:55   ` Junio C Hamano
  1 sibling, 1 reply; 10+ messages in thread
From: Jonathan Tan @ 2024-11-20 18:52 UTC (permalink / raw)
  To: Burke Libbey; +Cc: Jonathan Tan, git

Burke Libbey <burke.libbey@shopify.com> writes:
> The core issue appears to be in fill_origin_blob(), which is called
> individually for each blob needed during the blame process. While the blame
> algorithm does need blob contents to make detailed line-matching decisions,
> it seems like we don't necessarily need the contents just to determine which 
> blobs we'llexamine.

Technically, we do need the contents, because the contents determine
whether we are done with the blame (all lines are accounted for)
and whether we need to start looking at the blob at a different path
(because there was a rename).

> It seems like this could be optimized by batch-fetching the needed objects
> upfront, rather than fetching them one at a time. This would convert O(n)
> round-trips into a small number of batch fetches.

That is one possible way (assuming you mean that whenever "git blame"
notices that a blob is missing, it should walk the commits until a
certain depth, collecting all the object IDs for a given path, and
prefetching all of them). This runs the risk of overfetching, as I
stated above, but perhaps overfetching is an acceptable tradeoff for
speed.

There are other ways:

 - If we can teach the client to collect object IDs for prefetching,
   perhaps it would be just as easy to teach the server. We could
   instead make filter-by-path an acceptable argument to pass to "fetch
   --filter", then teach the lazy fetch to use that argument. This also
   opens the door to future performance improvements - since the server
   has all the objects, it can give us precisely the objects that we
   need, and not just give us a quantity of objects based on a heuristic
   (so the client does not need to say "give me 10, and if I need more,
   I'll ask you again", but can say "give me all I need to complete
   the blame). This, however, relies on server implementers to implement
   and turn on such a feature.

 - We could also teach the server to "blame" a file for us and then
   teach the client to stitch together the server's result with the
   local findings, but this is more complicated.

It may also be possible that even if we fix this issue, the scale of the
repos involved might be such that a user would rather "blame" over the
network (e.g. using a web UI) than download all the relevant blobs (even
if the blobs were batched into one download).

So...there are ideas for solutions, but I don't think anyone has
analyzed them (or tried them) yet.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: git-blame extremely slow in partial clones due to serial object fetching
  2024-11-20 18:52 ` Jonathan Tan
@ 2024-11-20 22:55   ` Junio C Hamano
  2024-11-21  3:12     ` [External] " Han Young
  0 siblings, 1 reply; 10+ messages in thread
From: Junio C Hamano @ 2024-11-20 22:55 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: Burke Libbey, git

Jonathan Tan <jonathantanmy@google.com> writes:

> Technically, we do need the contents, ...
> There are other ways:
>
>  - If we can teach the client to collect object IDs for prefetching,
>    perhaps it would be just as easy to teach the server. We could
>    instead make filter-by-path an acceptable argument to pass to "fetch
>    --filter", then teach the lazy fetch to use that argument. This also
>    opens the door to future performance improvements - since the server
>    has all the objects, it can give us precisely the objects that we
>    need, and not just give us a quantity of objects based on a heuristic
>    (so the client does not need to say "give me 10, and if I need more,
>    I'll ask you again", but can say "give me all I need to complete
>    the blame). This, however, relies on server implementers to implement
>    and turn on such a feature.

This is an interesting half-way point, but I have a suspicion that
in order for the server side to give you all you need, the server
side has to do something close to computing the full blame.  Start
from a child commit plus the entire file as input, find blocks of
text in that entire file that are different in its parent (these are
the lines that are "blamed" to the child commit), pass control to
the same algorithm but using the parent commit plus the remainder of
the file (excluding the lines of text that have already "blamed") as
the input, rinse and repeat, until the "remainder of the file"
shrinks to empty.  When everything is "blamed", you know you can
stop.

So, a server that can give you something better than a heuristic
would have spent enough cycles to know the final result of "blame"
by the time it knows where it should/can stop, wouldn't it?

>  - We could also teach the server to "blame" a file for us and then
>    teach the client to stitch together the server's result with the
>    local findings, but this is more complicated.

Your local lazy repository, if you have anything you have to "stitch
together", would have your locally modified contents, and for you to
be able to make such modifications, it would also have at least the
blobs from HEAD, which you based your modifications on.  So you
should be able to locally run "git blame @{u}.." to find lines that
your locally modified contents are to be blamed, ask the other side
to give you a blame for @{u}, and overlay the former on top of the
latter.  

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [External] Re: git-blame extremely slow in partial clones due to serial object fetching
  2024-11-20 22:55   ` Junio C Hamano
@ 2024-11-21  3:12     ` Han Young
  2024-11-22  3:32       ` Shubham Kanodia
  0 siblings, 1 reply; 10+ messages in thread
From: Han Young @ 2024-11-21  3:12 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Jonathan Tan, Burke Libbey, git

On Thu, Nov 21, 2024 at 7:00 AM Junio C Hamano <gitster@pobox.com> wrote:
> >  - We could also teach the server to "blame" a file for us and then
> >    teach the client to stitch together the server's result with the
> >    local findings, but this is more complicated.
>
> Your local lazy repository, if you have anything you have to "stitch
> together", would have your locally modified contents, and for you to
> be able to make such modifications, it would also have at least the
> blobs from HEAD, which you based your modifications on.  So you
> should be able to locally run "git blame @{u}.." to find lines that
> your locally modified contents are to be blamed, ask the other side
> to give you a blame for @{u}, and overlay the former on top of the
> latter.
>

In $DAY_JOB, we modified the server to run blame for the client.
To deal with changes not yet pushed to the server, we let client
pack the local only blobs for the blamed file, alone with the local
only commits that touch that file into one packfile and send a
'remote-blame' request to the server.

Server then unpack the relevant objects into memory
(by reusing code from git-unpack-objects), run the blame and return
the result back to the client. This way we avoided running blame both
twice and interleave the results.

It works quite well in very large repos, with result caching, the speed
can be even faster than locally blame on a full repo.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [External] Re: git-blame extremely slow in partial clones due to serial object fetching
  2024-11-21  3:12     ` [External] " Han Young
@ 2024-11-22  3:32       ` Shubham Kanodia
  2024-11-22  8:29         ` Junio C Hamano
  0 siblings, 1 reply; 10+ messages in thread
From: Shubham Kanodia @ 2024-11-22  3:32 UTC (permalink / raw)
  To: Han Young, Junio C Hamano; +Cc: Jonathan Tan, Burke Libbey, git



On 21/11/24 8:42 am, Han Young wrote:
> On Thu, Nov 21, 2024 at 7:00 AM Junio C Hamano <gitster@pobox.com> wrote:
>>>   - We could also teach the server to "blame" a file for us and then
>>>     teach the client to stitch together the server's result with the
>>>     local findings, but this is more complicated.
>>
>> Your local lazy repository, if you have anything you have to "stitch
>> together", would have your locally modified contents, and for you to
>> be able to make such modifications, it would also have at least the
>> blobs from HEAD, which you based your modifications on.  So you
>> should be able to locally run "git blame @{u}.." to find lines that
>> your locally modified contents are to be blamed, ask the other side
>> to give you a blame for @{u}, and overlay the former on top of the
>> latter.
>>
> 
> In $DAY_JOB, we modified the server to run blame for the client.
> To deal with changes not yet pushed to the server, we let client
> pack the local only blobs for the blamed file, alone with the local
> only commits that touch that file into one packfile and send a
> 'remote-blame' request to the server.
> 
> Server then unpack the relevant objects into memory
> (by reusing code from git-unpack-objects), run the blame and return
> the result back to the client. This way we avoided running blame both
> twice and interleave the results.
> 
> It works quite well in very large repos, with result caching, the speed
> can be even faster than locally blame on a full repo.

In a large sized partially cloned repo that I have, a `git blame` can 
take several minutes and network roundtrips.

Junio — would it make sense to add an option (and config) for `git 
blame` that limits how far back it looks for fetching blobs? This would 
prevent someone accidently firing several cascading calls as they open 
new files in an editor that does git blame by default (IntelliJ) or 
popular plugins (GitLens for VSCode) that can startup multiple heavy git 
processes and bring a user's system to a crawl.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [External] Re: git-blame extremely slow in partial clones due to serial object fetching
  2024-11-22  3:32       ` Shubham Kanodia
@ 2024-11-22  8:29         ` Junio C Hamano
  2024-11-22  8:51           ` Shubham Kanodia
  0 siblings, 1 reply; 10+ messages in thread
From: Junio C Hamano @ 2024-11-22  8:29 UTC (permalink / raw)
  To: Shubham Kanodia; +Cc: Han Young, Jonathan Tan, Burke Libbey, git

Shubham Kanodia <shubham.kanodia10@gmail.com> writes:

> Junio — would it make sense to add an option (and config) for `git
> blame` that limits how far back it looks for fetching blobs?

No, I do not think it would.

What would our workaround for the next one when people say "oh, 'git
log -p' fetches blobs on demand and latency kills me"?  Yet another
such an option only for 'git log'?


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [External] Re: git-blame extremely slow in partial clones due to serial object fetching
  2024-11-22  8:29         ` Junio C Hamano
@ 2024-11-22  8:51           ` Shubham Kanodia
  2024-11-22 17:55             ` Jonathan Tan
  0 siblings, 1 reply; 10+ messages in thread
From: Shubham Kanodia @ 2024-11-22  8:51 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Han Young, Jonathan Tan, Burke Libbey, git

On 22/11/24 1:59 pm, Junio C Hamano wrote:
> Shubham Kanodia <shubham.kanodia10@gmail.com> writes:
> 
>> Junio — would it make sense to add an option (and config) for `git
>> blame` that limits how far back it looks for fetching blobs?
> 
> No, I do not think it would.
> 
> What would our workaround for the next one when people say "oh, 'git
> log -p' fetches blobs on demand and latency kills me"?  Yet another
> such an option only for 'git log'?
> 

I'm guessing `git log` already provides options to limit history using 
`-n` or `--since` so ideally its not unbounded if you use those, unlike 
with `git blame`?

I understand our concerns regarding adding new config options though. 
Between the solutions discussed in this thread — batching, adding server 
side support, (or another) — what do you think could be a good track to 
pursue here because this makes using `git blame` on larger partially 
cloned repos a possible footgun.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [External] Re: git-blame extremely slow in partial clones due to serial object fetching
  2024-11-22  8:51           ` Shubham Kanodia
@ 2024-11-22 17:55             ` Jonathan Tan
  2024-11-25  0:22               ` Junio C Hamano
  0 siblings, 1 reply; 10+ messages in thread
From: Jonathan Tan @ 2024-11-22 17:55 UTC (permalink / raw)
  To: Shubham Kanodia
  Cc: Jonathan Tan, Junio C Hamano, Han Young, Burke Libbey, git

Shubham Kanodia <shubham.kanodia10@gmail.com> writes:
> 
> 
> On 22/11/24 1:59 pm, Junio C Hamano wrote:
> > Shubham Kanodia <shubham.kanodia10@gmail.com> writes:
> > 
> >> Junio — would it make sense to add an option (and config) for `git
> >> blame` that limits how far back it looks for fetching blobs?
> > 
> > No, I do not think it would.
> > 
> > What would our workaround for the next one when people say "oh, 'git
> > log -p' fetches blobs on demand and latency kills me"?  Yet another
> > such an option only for 'git log'?
> > 
> 
> I'm guessing `git log` already provides options to limit history using 
> `-n` or `--since` so ideally its not unbounded if you use those, unlike 
> with `git blame`?

`git blame` also has options. See "SPECIFYING RANGES" in its man page,
which teaches you how to specify revision ranges (and also line ranges,
but that is not relevant here).

> I understand our concerns regarding adding new config options though. 
> Between the solutions discussed in this thread — batching, adding server 
> side support, (or another) — what do you think could be a good track to 
> pursue here because this makes using `git blame` on larger partially 
> cloned repos a possible footgun.

Typically questions like this should be answered by the person who is
actually going to pursue the track. If you'd like to pursue a track but
don't know which to pursue, maybe start with what you believe the best
solution to be. It seems that you think that limiting either the number
of blobs fetched or the time range of the commits to be considered is
best, so maybe you could try one of them.

Limiting the time range is already possible, so I'll provide my ideas
aboult limiting the number of blobs fetched. You can detect when
a blob is missing (and therefore needs to be fetched) by a flag in
oid_object_info_extended() (or use has_object()), so you can count the
number of blobs fetched as the blame is being run. My biggest concern
is that there is no good limit - I suspect that for a file that is
extensively changed, 10 blobs is too few and you'll need something like
50 blobs. But 50 blobs means 50 RTTs, which also might be too much for
an end user. But in any case, you know your users' needs better than
we do.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [External] Re: git-blame extremely slow in partial clones due to serial object fetching
  2024-11-22 17:55             ` Jonathan Tan
@ 2024-11-25  0:22               ` Junio C Hamano
  0 siblings, 0 replies; 10+ messages in thread
From: Junio C Hamano @ 2024-11-25  0:22 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: Shubham Kanodia, Han Young, Burke Libbey, git

Jonathan Tan <jonathantanmy@google.com> writes:

> number of blobs fetched as the blame is being run. My biggest concern
> is that there is no good limit - I suspect that for a file that is
> extensively changed, 10 blobs is too few and you'll need something like
> 50 blobs. But 50 blobs means 50 RTTs, which also might be too much for
> an end user.

Depending on the project, size of a typical change to a blob may be
different, so "10 commits that touched this blob" may touch 20% of
the contents in one project, but in another that tends to prefer
finer-grained commits, it may take 50 commits to make the same
amount of change.

I agree with you that there is no good default that fits all
projects.

Do 50 blobs have to mean 50 RTTs?  I wonder if there is a good way
to say "please give me all necessary tree and blob objects to
complete the blobs at path $F for the past 50 commits" to the lazy
fetch machinery and receive a single pack that contain all the
objects that are listed in "git rev-list --objects HEAD~50.. -- $F"?

I am not sure what should happen in the commit in that range where
the path $F appears (meaning: the path did not exist, and its
contents came from a different path in the parent of that commit).
You'd need (a subset of) objects in "git rev-list --objects C^!" for
that commit to find out where it came from, but what subset should
we use?  Fully hydrating the trees of these commits at the rename
boundary would ensure you'd catch the same rename in a non-lazy
repository, but that is way too much more than what the user can
afford (otherwise, you wouldn't be using a narrow clone in the first
place).  So, I dunno.

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2024-11-25  0:22 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-11-19 20:16 git-blame extremely slow in partial clones due to serial object fetching Burke Libbey
2024-11-20 11:59 ` Manoraj K
2024-11-20 18:52 ` Jonathan Tan
2024-11-20 22:55   ` Junio C Hamano
2024-11-21  3:12     ` [External] " Han Young
2024-11-22  3:32       ` Shubham Kanodia
2024-11-22  8:29         ` Junio C Hamano
2024-11-22  8:51           ` Shubham Kanodia
2024-11-22 17:55             ` Jonathan Tan
2024-11-25  0:22               ` Junio C Hamano

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).