* [PATCH] blame.c: don't drop origin blobs as eagerly @ 2019-04-02 11:56 David Kastrup 2019-04-03 7:45 ` Junio C Hamano 0 siblings, 1 reply; 15+ messages in thread From: David Kastrup @ 2019-04-02 11:56 UTC (permalink / raw) To: git; +Cc: David Kastrup When a parent blob already has chunks queued up for blaming, dropping the blob at the end of one blame step will cause it to get reloaded right away, doubling the amount of I/O and unpacking when processing a linear history. Keeping such parent blobs in memory seems like a reasonable optimization that should incur additional memory pressure mostly when processing the merges from old branches. Signed-off-by: David Kastrup <dak@gnu.org> --- blame.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/blame.c b/blame.c index 5c07dec190..c11c516921 100644 --- a/blame.c +++ b/blame.c @@ -1562,7 +1562,8 @@ static void pass_blame(struct blame_scoreboard *sb, struct blame_origin *origin, } for (i = 0; i < num_sg; i++) { if (sg_origin[i]) { - drop_origin_blob(sg_origin[i]); + if (!sg_origin[i]->suspects) + drop_origin_blob(sg_origin[i]); blame_origin_decref(sg_origin[i]); } } -- 2.20.1 ^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: [PATCH] blame.c: don't drop origin blobs as eagerly 2019-04-02 11:56 [PATCH] blame.c: don't drop origin blobs as eagerly David Kastrup @ 2019-04-03 7:45 ` Junio C Hamano 2019-04-03 9:32 ` Duy Nguyen 2019-04-03 11:08 ` David Kastrup 0 siblings, 2 replies; 15+ messages in thread From: Junio C Hamano @ 2019-04-03 7:45 UTC (permalink / raw) To: David Kastrup; +Cc: git David Kastrup <dak@gnu.org> writes: > When a parent blob already has chunks queued up for blaming, dropping > the blob at the end of one blame step will cause it to get reloaded > right away, doubling the amount of I/O and unpacking when processing a > linear history. > > Keeping such parent blobs in memory seems like a reasonable optimization > that should incur additional memory pressure mostly when processing the > merges from old branches. Thanks for finding an age-old one that dates back to 7c3c7962 ("blame: drop blob data after passing blame to the parent", 2007-12-11). Interestingly, the said commit claims: When passing blame from a parent to its parent (i.e. the grandparent), the blob data for the parent may need to be read again, but it should be relatively cheap, thanks to delta-base cache. but perhaps you found a case where the delta-base cache is not all that effective in the benchmark? Will queue. Thanks. > > Signed-off-by: David Kastrup <dak@gnu.org> > --- > blame.c | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > diff --git a/blame.c b/blame.c > index 5c07dec190..c11c516921 100644 > --- a/blame.c > +++ b/blame.c > @@ -1562,7 +1562,8 @@ static void pass_blame(struct blame_scoreboard *sb, struct blame_origin *origin, > } > for (i = 0; i < num_sg; i++) { > if (sg_origin[i]) { > - drop_origin_blob(sg_origin[i]); > + if (!sg_origin[i]->suspects) > + drop_origin_blob(sg_origin[i]); > blame_origin_decref(sg_origin[i]); > } > } ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] blame.c: don't drop origin blobs as eagerly 2019-04-03 7:45 ` Junio C Hamano @ 2019-04-03 9:32 ` Duy Nguyen 2019-04-03 11:36 ` Jeff King 2019-04-03 11:08 ` David Kastrup 1 sibling, 1 reply; 15+ messages in thread From: Duy Nguyen @ 2019-04-03 9:32 UTC (permalink / raw) To: Junio C Hamano; +Cc: David Kastrup, Git Mailing List On Wed, Apr 3, 2019 at 2:45 PM Junio C Hamano <gitster@pobox.com> wrote: > > David Kastrup <dak@gnu.org> writes: > > > When a parent blob already has chunks queued up for blaming, dropping > > the blob at the end of one blame step will cause it to get reloaded > > right away, doubling the amount of I/O and unpacking when processing a > > linear history. > > > > Keeping such parent blobs in memory seems like a reasonable optimization > > that should incur additional memory pressure mostly when processing the > > merges from old branches. > > Thanks for finding an age-old one that dates back to 7c3c7962 > ("blame: drop blob data after passing blame to the parent", > 2007-12-11). > > Interestingly, the said commit claims: > > When passing blame from a parent to its parent (i.e. the > grandparent), the blob data for the parent may need to be read > again, but it should be relatively cheap, thanks to delta-base > cache. > > but perhaps you found a case where the delta-base cache is not all > that effective in the benchmark? Interesting. For some reason I keep remembering the delta-base cache is for caching base objects, not all packed objects. That might explain why I could not see significant gain when blaming linux.git's MAINTAINERS file (0.5s was shaved out of 13s) even though the number of objects read was cut by half (8424 vs 15083). I just tried again. The number of actual pack reading is slightly reduced with the patch (498260 vs 502140). Not by a large margin. But I imagine if the cache is under pressure (MAINTAINERS file is quite small, 426k), we may get more eviction and misses from delta-base cache. It might still help when we need to read loose objects though. But I guess this matters even less. And I don't know how lazy clones behave in this case. If we get new objects and store as loose, then it helps a bit more. > Will queue. Thanks. > > > > > > > > Signed-off-by: David Kastrup <dak@gnu.org> > > --- > > blame.c | 3 ++- > > 1 file changed, 2 insertions(+), 1 deletion(-) > > > > diff --git a/blame.c b/blame.c > > index 5c07dec190..c11c516921 100644 > > --- a/blame.c > > +++ b/blame.c > > @@ -1562,7 +1562,8 @@ static void pass_blame(struct blame_scoreboard *sb, struct blame_origin *origin, > > } > > for (i = 0; i < num_sg; i++) { > > if (sg_origin[i]) { > > - drop_origin_blob(sg_origin[i]); > > + if (!sg_origin[i]->suspects) > > + drop_origin_blob(sg_origin[i]); > > blame_origin_decref(sg_origin[i]); > > } > > } -- Duy ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] blame.c: don't drop origin blobs as eagerly 2019-04-03 9:32 ` Duy Nguyen @ 2019-04-03 11:36 ` Jeff King 2019-04-03 12:06 ` Duy Nguyen 0 siblings, 1 reply; 15+ messages in thread From: Jeff King @ 2019-04-03 11:36 UTC (permalink / raw) To: Duy Nguyen; +Cc: Junio C Hamano, David Kastrup, Git Mailing List On Wed, Apr 03, 2019 at 04:32:30PM +0700, Duy Nguyen wrote: > That might explain why I could not see significant gain when blaming > linux.git's MAINTAINERS file (0.5s was shaved out of 13s) even though > the number of objects read was cut by half (8424 vs 15083). I did a few timings, too, and managed to come up with similar improvements (only a small fraction, and only for large files). I think the main thing is simply that loading the blob from the object database is a fraction of the total work done. We still have to actually diff the blobs, which is at least as expensive as loading them from disk. We also have to load commits and trees from disk as we traverse. Enabling the commit-graph would shrink that portion (and make improvements in the blob loading proportionally more impressive). All that said, this seems like an easy and obvious win, and worth doing. 0.5s is still something. I suspect we could do even better by storing and reusing not just the original blob between diffs, but the intermediate diff state (i.e., the hashes produced by xdl_prepare(), which should be usable between multiple diffs). That's quite a bit more complex, though, and I imagine would require some surgery to xdiff. -Peff ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] blame.c: don't drop origin blobs as eagerly 2019-04-03 11:36 ` Jeff King @ 2019-04-03 12:06 ` Duy Nguyen 2019-04-03 12:19 ` Jeff King 0 siblings, 1 reply; 15+ messages in thread From: Duy Nguyen @ 2019-04-03 12:06 UTC (permalink / raw) To: Jeff King; +Cc: Junio C Hamano, David Kastrup, Git Mailing List On Wed, Apr 3, 2019 at 6:36 PM Jeff King <peff@peff.net> wrote: > I suspect we could do even better by storing and reusing not just the > original blob between diffs, but the intermediate diff state (i.e., the > hashes produced by xdl_prepare(), which should be usable between > multiple diffs). That's quite a bit more complex, though, and I imagine > would require some surgery to xdiff. Amazing. xdl_prepare_ctx and xdl_hash_record (called inside xdl_prepare_ctx) account for 36% according to 'perf report'. Please tell me you just did not get this on your first guess. I tracked and dumped all the hashes that are sent to xdl_prepare() and it looks like the amount of duplicates is quite high. There are only about 1000 one-time hashes out of 7000 (didn't really draw a histogram to examine closer). So yeah this looks really promising, assuming somebody is going to do something about it. -- Duy ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] blame.c: don't drop origin blobs as eagerly 2019-04-03 12:06 ` Duy Nguyen @ 2019-04-03 12:19 ` Jeff King 2019-04-03 12:32 ` David Kastrup 0 siblings, 1 reply; 15+ messages in thread From: Jeff King @ 2019-04-03 12:19 UTC (permalink / raw) To: Duy Nguyen; +Cc: Junio C Hamano, David Kastrup, Git Mailing List On Wed, Apr 03, 2019 at 07:06:02PM +0700, Duy Nguyen wrote: > On Wed, Apr 3, 2019 at 6:36 PM Jeff King <peff@peff.net> wrote: > > I suspect we could do even better by storing and reusing not just the > > original blob between diffs, but the intermediate diff state (i.e., the > > hashes produced by xdl_prepare(), which should be usable between > > multiple diffs). That's quite a bit more complex, though, and I imagine > > would require some surgery to xdiff. > > Amazing. xdl_prepare_ctx and xdl_hash_record (called inside > xdl_prepare_ctx) account for 36% according to 'perf report'. Please > tell me you just did not get this on your first guess. Sorry, it was a guess. ;) But an educated one, based on previous experiments with speeding up "log -p". Remember XDL_FAST_HASH, which produced speedups there (but unfortunately had some pathological slowdowns, because it produced too many collisions). I've also played around with using other hashes like murmur or siphash, but was never able to get anything remarkably faster than what we have now. > I tracked and dumped all the hashes that are sent to xdl_prepare() and > it looks like the amount of duplicates is quite high. There are only > about 1000 one-time hashes out of 7000 (didn't really draw a histogram > to examine closer). So yeah this looks really promising, assuming > somebody is going to do something about it. I don't think counting the unique hash outputs tells you much about what can be sped up. After all, two related blobs are likely to overlap quite a bit in their hashes (i.e., all of their non-unique lines). The trick is finding in each blob those ones that _are_ unique. :) But if we spend 36% of our time in hashing the blobs, then that implies that we could gain back 18% by caching and reusing the work from a previous diff (as David notes, a simple keep-the-last-parent cache only yields 100% cache hits in a linear history, but it's probably good enough for our purposes). This should likewise make "git log -p -- file" faster, though with more files you'd need a bigger cache. So I do think it's a promising lead. I don't have immediate plans to work on it, though. Maybe it would be a good GSoC project. ;) -Peff ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] blame.c: don't drop origin blobs as eagerly 2019-04-03 12:19 ` Jeff King @ 2019-04-03 12:32 ` David Kastrup 0 siblings, 0 replies; 15+ messages in thread From: David Kastrup @ 2019-04-03 12:32 UTC (permalink / raw) To: Jeff King; +Cc: Duy Nguyen, Junio C Hamano, Git Mailing List Jeff King <peff@peff.net> writes: > On Wed, Apr 03, 2019 at 07:06:02PM +0700, Duy Nguyen wrote: > >> On Wed, Apr 3, 2019 at 6:36 PM Jeff King <peff@peff.net> wrote: >> > I suspect we could do even better by storing and reusing not just the >> > original blob between diffs, but the intermediate diff state (i.e., the >> > hashes produced by xdl_prepare(), which should be usable between >> > multiple diffs). That's quite a bit more complex, though, and I imagine >> > would require some surgery to xdiff. >> >> Amazing. xdl_prepare_ctx and xdl_hash_record (called inside >> xdl_prepare_ctx) account for 36% according to 'perf report'. Please >> tell me you just did not get this on your first guess. > > Sorry, it was a guess. ;) > > But an educated one, based on previous experiments with speeding up "log > -p". Remember XDL_FAST_HASH, which produced speedups there (but > unfortunately had some pathological slowdowns, because it produced too > many collisions). I've also played around with using other hashes like > murmur or siphash, but was never able to get anything remarkably faster > than what we have now. > >> I tracked and dumped all the hashes that are sent to xdl_prepare() and >> it looks like the amount of duplicates is quite high. There are only >> about 1000 one-time hashes out of 7000 (didn't really draw a histogram >> to examine closer). So yeah this looks really promising, assuming >> somebody is going to do something about it. > > I don't think counting the unique hash outputs tells you much about what > can be sped up. After all, two related blobs are likely to overlap quite > a bit in their hashes (i.e., all of their non-unique lines). The trick > is finding in each blob those ones that _are_ unique. :) > > But if we spend 36% of our time in hashing the blobs, then that implies > that we could gain back 18% by caching and reusing the work from a > previous diff (as David notes, a simple keep-the-last-parent cache only > yields 100% cache hits in a linear history, but it's probably good > enough for our purposes). Of course, if you really want to get tricky, you'll not even compare stuff that is expanded from the same delta-chain location. Basically, there are a number of separate layers that are doing rather similar work with rather similar data, but intermingling the layers' work is not going to be good for maintainability. Caching at the various layers can keep their separation while still reducing some of the redundancy costs. -- David Kastrup ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] blame.c: don't drop origin blobs as eagerly 2019-04-03 7:45 ` Junio C Hamano 2019-04-03 9:32 ` Duy Nguyen @ 2019-04-03 11:08 ` David Kastrup 1 sibling, 0 replies; 15+ messages in thread From: David Kastrup @ 2019-04-03 11:08 UTC (permalink / raw) To: Junio C Hamano; +Cc: git Junio C Hamano <gitster@pobox.com> writes: > David Kastrup <dak@gnu.org> writes: > >> When a parent blob already has chunks queued up for blaming, dropping >> the blob at the end of one blame step will cause it to get reloaded >> right away, doubling the amount of I/O and unpacking when processing a >> linear history. >> >> Keeping such parent blobs in memory seems like a reasonable optimization >> that should incur additional memory pressure mostly when processing the >> merges from old branches. > > Thanks for finding an age-old one that dates back to 7c3c7962 > ("blame: drop blob data after passing blame to the parent", > 2007-12-11). > > Interestingly, the said commit claims: > > When passing blame from a parent to its parent (i.e. the > grandparent), the blob data for the parent may need to be read > again, but it should be relatively cheap, thanks to delta-base > cache. > > but perhaps you found a case where the delta-base cache is not all > that effective in the benchmark? The most relevant contribution is in a linear history where the diff between commit and parent is followed by the diff between parent and grandparent. It seems wasteful to recreate the blobs in this case. Of course this is also the case where any close cache layers are more likely to still be warm, so the savings may be less apparent. They are likely more for deep delta chains in long histories where the delta-chain cache is more thoroughly exercised. -- David Kastrup ^ permalink raw reply [flat|nested] 15+ messages in thread
* [PATCH] blame.c: don't drop origin blobs as eagerly @ 2016-05-27 13:35 David Kastrup 2016-05-27 15:00 ` Johannes Schindelin 0 siblings, 1 reply; 15+ messages in thread From: David Kastrup @ 2016-05-27 13:35 UTC (permalink / raw) To: git; +Cc: David Kastrup When a parent blob already has chunks queued up for blaming, dropping the blob at the end of one blame step will cause it to get reloaded right away, doubling the amount of I/O and unpacking when processing a linear history. Keeping such parent blobs in memory seems like a reasonable optimization. It's conceivable that this may incur additional memory pressure particularly when the history contains lots of merges from long-diverged branches. In practice, this optimization appears to behave quite benignly, and a viable strategy for limiting the total amount of cached blobs in a useful manner seems rather hard to implement. In addition, calling git-blame with -C leads to similar memory retention patterns. --- builtin/blame.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/builtin/blame.c b/builtin/blame.c index 21f42b0..2596fbc 100644 --- a/builtin/blame.c +++ b/builtin/blame.c @@ -1556,7 +1556,8 @@ finish: } for (i = 0; i < num_sg; i++) { if (sg_origin[i]) { - drop_origin_blob(sg_origin[i]); + if (!sg_origin[i]->suspects) + drop_origin_blob(sg_origin[i]); origin_decref(sg_origin[i]); } } -- 2.7.4 ^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: [PATCH] blame.c: don't drop origin blobs as eagerly 2016-05-27 13:35 David Kastrup @ 2016-05-27 15:00 ` Johannes Schindelin 2016-05-27 15:41 ` David Kastrup 0 siblings, 1 reply; 15+ messages in thread From: Johannes Schindelin @ 2016-05-27 15:00 UTC (permalink / raw) To: David Kastrup; +Cc: git, gitster Hi David, it is good practice to Cc: the original author of the code in question, in this case Junio. I guess he sees this anyway, but that is really just an assumption. On Fri, 27 May 2016, David Kastrup wrote: > When a parent blob already has chunks queued up for blaming, dropping > the blob at the end of one blame step will cause it to get reloaded > right away, doubling the amount of I/O and unpacking when processing a > linear history. It is obvious from your commit message that you have studied the code quite deeply. To make it easier for the reader (which might be your future self), it is advisable to give at least a little bit of introduction, e.g. what the "parent blob" is. I would *guess* that it is the blob corresponding to the same path in the parent of the current revision, but that should be spelled out explicitly. > Keeping such parent blobs in memory seems like a reasonable > optimization. It's conceivable that this may incur additional memory This sentence would be easier to read if "It's conceivable that" was simply deleted. > pressure particularly when the history contains lots of merges from > long-diverged branches. In practice, this optimization appears to > behave quite benignly, Why not just stop here? I say that because... > and a viable strategy for limiting the total amount of cached blobs in a > useful manner seems rather hard to implement. ... this sounds awfully handwaving. Since we already have reference counting, it sounds fishy to claim that simply piggybacking a global counter on top of it would be "rather hard". > In addition, calling git-blame with -C leads to similar memory retention > patterns. This is a red herring. Just delete it. I, for one, being a heavy user of `git blame`, could count the number of times I used blame's -C option without any remaining hands. Zero times. Besides, -C is *supposed* to look harder. By that argument, you could read all blobs in rev-list even when the user did not specify --objects "because --objects leads to similar memory retention patterns". So: let's just forget about that statement. The commit message is missing your sign-off. Also: is there an easy way to reproduce your claims of better I/O characteristics? Something like a command-line, ideally with a file in git.git's own history, that demonstrates the I/O before and after the patch, would be an excellent addition to the commit message. Further: I would have at least expected some rudimentary discussion why this patch -- which seems to at least partially contradict 7c3c796 (blame: drop blob data after passing blame to the parent, 2007-12-11) -- is not regressing on the intent of said commit. > diff --git a/builtin/blame.c b/builtin/blame.c > index 21f42b0..2596fbc 100644 > --- a/builtin/blame.c > +++ b/builtin/blame.c > @@ -1556,7 +1556,8 @@ finish: > } > for (i = 0; i < num_sg; i++) { > if (sg_origin[i]) { > - drop_origin_blob(sg_origin[i]); > + if (!sg_origin[i]->suspects) > + drop_origin_blob(sg_origin[i]); > origin_decref(sg_origin[i]); > } It would be good to mention in the commit message that this patch does not change anything for blobs with only one remaining reference (the current one) because origin_decref() would do the same job as drop_origin_blob when decrementing the reference counter to 0. In fact, I suspect that simply removing the drop_origin_blob() call might result in the exact same I/O pattern. Ciao, Johannes ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] blame.c: don't drop origin blobs as eagerly 2016-05-27 15:00 ` Johannes Schindelin @ 2016-05-27 15:41 ` David Kastrup 2016-05-28 6:37 ` Johannes Schindelin 0 siblings, 1 reply; 15+ messages in thread From: David Kastrup @ 2016-05-27 15:41 UTC (permalink / raw) To: Johannes Schindelin; +Cc: git, gitster Johannes Schindelin <Johannes.Schindelin@gmx.de> writes: > On Fri, 27 May 2016, David Kastrup wrote: > >> pressure particularly when the history contains lots of merges from >> long-diverged branches. In practice, this optimization appears to >> behave quite benignly, > > Why not just stop here? Because there is a caveat. > I say that because... > >> and a viable strategy for limiting the total amount of cached blobs in a >> useful manner seems rather hard to implement. > > ... this sounds awfully handwaving. Because it is. > Since we already have reference counting, it sounds fishy to claim > that simply piggybacking a global counter on top of it would be > "rather hard". You'll see that the patch is from 2014. When I actively worked on it, I found no convincing/feasible way to enforce a reasonable hard limit. I am not picking up work on this again but am merely flushing my queue so that the patch going to waste is not on my conscience. >> In addition, calling git-blame with -C leads to similar memory retention >> patterns. > > This is a red herring. Just delete it. I, for one, being a heavy user of > `git blame`, could count the number of times I used blame's -C option > without any remaining hands. Zero times. > > Besides, -C is *supposed* to look harder. We are not talking about "looking harder" but "taking more memory than the set limit". > Also: is there an easy way to reproduce your claims of better I/O > characteristics? Something like a command-line, ideally with a file in > git.git's own history, that demonstrates the I/O before and after the > patch, would be an excellent addition to the commit message. I've used it on the wortliste repository and system time goes down from about 70 seconds to 50 seconds (this is a flash drive). User time from about 4:20 to 4:00. It is a rather degenerate repository (predominantly small changes in one humongous text file) so savings for more typical cases might end up less than that. But then it is degenerate repositories that are most costly to blame. > Further: I would have at least expected some rudimentary discussion > why this patch -- which seems to at least partially contradict 7c3c796 > (blame: drop blob data after passing blame to the parent, 2007-12-11) > -- is not regressing on the intent of said commit. It is regressing on the intent of said commit by _retaining_ blob data that it is _sure_ to look at again because of already having this data asked for again in the priority queue. That's the point. It only drops the blob data for which it has no request queued yet. But there is no handle on when the request is going to pop up until it actually leaves the priority queue: the priority queue is a heap IIRC and thus only provides partial orderings. >> diff --git a/builtin/blame.c b/builtin/blame.c >> index 21f42b0..2596fbc 100644 >> --- a/builtin/blame.c >> +++ b/builtin/blame.c >> @@ -1556,7 +1556,8 @@ finish: >> } >> for (i = 0; i < num_sg; i++) { >> if (sg_origin[i]) { >> - drop_origin_blob(sg_origin[i]); >> + if (!sg_origin[i]->suspects) >> + drop_origin_blob(sg_origin[i]); >> origin_decref(sg_origin[i]); >> } > > It would be good to mention in the commit message that this patch does not > change anything for blobs with only one remaining reference (the current > one) because origin_decref() would do the same job as drop_origin_blob > when decrementing the reference counter to 0. A sizable number of references but not blobs are retained and needed for producing the information in the final printed output (at least when using the default sequential output instead of --incremental or --porcelaine or similar). > In fact, I suspect that simply removing the drop_origin_blob() call > might result in the exact same I/O pattern. It's been years since I actually worked on the code but I'm still pretty sure you are wrong. -- David Kastrup ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] blame.c: don't drop origin blobs as eagerly 2016-05-27 15:41 ` David Kastrup @ 2016-05-28 6:37 ` Johannes Schindelin 2016-05-28 8:29 ` David Kastrup 0 siblings, 1 reply; 15+ messages in thread From: Johannes Schindelin @ 2016-05-28 6:37 UTC (permalink / raw) To: David Kastrup; +Cc: git, gitster [-- Attachment #1: Type: text/plain, Size: 5585 bytes --] Hi David, On Fri, 27 May 2016, David Kastrup wrote: > Johannes Schindelin <Johannes.Schindelin@gmx.de> writes: > > > On Fri, 27 May 2016, David Kastrup wrote: > > > >> pressure particularly when the history contains lots of merges from > >> long-diverged branches. In practice, this optimization appears to > >> behave quite benignly, > > > > Why not just stop here? > > Because there is a caveat. > > > I say that because... > > > >> and a viable strategy for limiting the total amount of cached blobs in a > >> useful manner seems rather hard to implement. > > > > ... this sounds awfully handwaving. > > Because it is. Hrm. Are you really happy with this, on public record? > > Since we already have reference counting, it sounds fishy to claim > > that simply piggybacking a global counter on top of it would be > > "rather hard". > > You'll see that the patch is from 2014. No I do not. There was no indication of that. > When I actively worked on it, I found no convincing/feasible way to > enforce a reasonable hard limit. I am not picking up work on this again > but am merely flushing my queue so that the patch going to waste is not > on my conscience. Hmm. Speaking from my point of view as a maintainer, this raises a yellow alert. Sure, I do not maintain git.git, but if I were, I would want my confidence in the quality of this patch, and in the patch author being around when things go south with it, strengthened. This paragraph achieves the opposite. > >> In addition, calling git-blame with -C leads to similar memory retention > >> patterns. > > > > This is a red herring. Just delete it. I, for one, being a heavy user of > > `git blame`, could count the number of times I used blame's -C option > > without any remaining hands. Zero times. > > > > Besides, -C is *supposed* to look harder. > > We are not talking about "looking harder" but "taking more memory than > the set limit". I meant to say: *of course* it uses more memory, it has a much harder job. > > Also: is there an easy way to reproduce your claims of better I/O > > characteristics? Something like a command-line, ideally with a file in > > git.git's own history, that demonstrates the I/O before and after the > > patch, would be an excellent addition to the commit message. > > I've used it on the wortliste repository and system time goes down from > about 70 seconds to 50 seconds (this is a flash drive). User time from > about 4:20 to 4:00. It is a rather degenerate repository (predominantly > small changes in one humongous text file) so savings for more typical > cases might end up less than that. But then it is degenerate > repositories that are most costly to blame. Well, obviously I did not mean to measure time, but I/O. Something like an strace call that pipes into some grep that in turn pipes into wc. > > Further: I would have at least expected some rudimentary discussion > > why this patch -- which seems to at least partially contradict 7c3c796 > > (blame: drop blob data after passing blame to the parent, 2007-12-11) > > -- is not regressing on the intent of said commit. > > It is regressing on the intent of said commit by _retaining_ blob data > that it is _sure_ to look at again because of already having this data > asked for again in the priority queue. That's the point. It only drops > the blob data for which it has no request queued yet. But there is no > handle on when the request is going to pop up until it actually leaves > the priority queue: the priority queue is a heap IIRC and thus only > provides partial orderings. Again, this lowers my confidence in the patch. Mind you, the patch could be totally sound! Yet its commit message, and unfortunately even more so the discussion we are having right now, offer no adequate reason to assume that. If you, as the patch author, state that you are not sure what this thing does exactly, we must conservatively assume that it contains flaws. > >> diff --git a/builtin/blame.c b/builtin/blame.c > >> index 21f42b0..2596fbc 100644 > >> --- a/builtin/blame.c > >> +++ b/builtin/blame.c > >> @@ -1556,7 +1556,8 @@ finish: > >> } > >> for (i = 0; i < num_sg; i++) { > >> if (sg_origin[i]) { > >> - drop_origin_blob(sg_origin[i]); > >> + if (!sg_origin[i]->suspects) > >> + drop_origin_blob(sg_origin[i]); > >> origin_decref(sg_origin[i]); > >> } > > > > It would be good to mention in the commit message that this patch does not > > change anything for blobs with only one remaining reference (the current > > one) because origin_decref() would do the same job as drop_origin_blob > > when decrementing the reference counter to 0. > > A sizable number of references but not blobs are retained and needed for > producing the information in the final printed output (at least when > using the default sequential output instead of --incremental or > --porcelaine or similar). Sorry, please help me understand how that response relates to my suggestion to improve the commit message. > > In fact, I suspect that simply removing the drop_origin_blob() call > > might result in the exact same I/O pattern. > > It's been years since I actually worked on the code but I'm still pretty > sure you are wrong. The short version of your answer is that you will leave this patch in its current form and address none of my concerns because you moved on, correct? If so, that's totally okay, it just needs to be spelled out. Ciao, Johannes ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] blame.c: don't drop origin blobs as eagerly 2016-05-28 6:37 ` Johannes Schindelin @ 2016-05-28 8:29 ` David Kastrup 2016-05-28 12:34 ` Johannes Schindelin 0 siblings, 1 reply; 15+ messages in thread From: David Kastrup @ 2016-05-28 8:29 UTC (permalink / raw) To: Johannes Schindelin; +Cc: git, gitster Johannes Schindelin <Johannes.Schindelin@gmx.de> writes: > On Fri, 27 May 2016, David Kastrup wrote: > >> Johannes Schindelin <Johannes.Schindelin@gmx.de> writes: >> >> > On Fri, 27 May 2016, David Kastrup wrote: >> > >> >> pressure particularly when the history contains lots of merges from >> >> long-diverged branches. In practice, this optimization appears to >> >> behave quite benignly, >> > >> > Why not just stop here? >> >> Because there is a caveat. >> >> > I say that because... >> > >> >> and a viable strategy for limiting the total amount of cached blobs in a >> >> useful manner seems rather hard to implement. >> > >> > ... this sounds awfully handwaving. >> >> Because it is. > > Hrm. Are you really happy with this, on public record? Shrug. git-blame has limits for its memory usage which it generally heeds, throwing away potentially useful data to do so. git-blame -C has a store of data retained outside of those limits because nobody bothered reining them in, and this patch similarly retains data outside of those limits, though strictly limited to the pending priority queue size which controls the descent in reverse commit time order. In actual measurements on actual heavy-duty repositories I could not discern a difference in memory usage after the patch. The difference is likely only noticeable when you have lots of old long branches which is not a typical development model. The current data structures don't make harder constraints on memory pressure feasible, and enforcing such constraints would incur a lot of swapping (not by the operating system which does it more efficiently particularly since it does not need to decompress every time, but by the application constantly rereading and recompressing data). git-blame -C cares jack-shit about that (and it's used by git-gui's second pass if I remember correctly) and nobody raised concerns about its memory usage ever. I've taken out a lot of hand-waving of the "eh, good enough" kind from git-blame. If one wanted to have the memory limits enforced more stringently than they currently are, one would need a whole new layer of FIFO and accounting. This is not done in the current code base for existing functionality. And nobody bothered implementing it in decades or complaining about it in spite of it affecting git-blame -C (and consequently git-gui blame). Yes, this patch is another hand waving beside an already waving crowd. I am not going to lie about it but I am also not going to invest the time to fix yet another part of git-blame that has in decades not shown to be a problem anybody considered worth reporting let alone fixing. The limits for git-blame are set quite conservative: on actual developer systems, exceeding them by a factor of 10 will not exhaust the available memory. And I could not even measure a difference in memory usage in a small sample set of large examples. That's good enough for me (and better than the shape most of git-blame was in before I started on it and also after I finished), but it's still hand-waving. And it's not like it doesn't have a lot of company. >> > Since we already have reference counting, it sounds fishy to claim >> > that simply piggybacking a global counter on top of it would be >> > "rather hard". >> >> You'll see that the patch is from 2014. > > No I do not. There was no indication of that. I thought that git-send-email/git-am retained most patch metadata as opposed to git-patch, but you are entirely correct that the author date is nowhere to be found. Sorry for the presumption. >> When I actively worked on it, I found no convincing/feasible way to >> enforce a reasonable hard limit. I am not picking up work on this >> again but am merely flushing my queue so that the patch going to >> waste is not on my conscience. > > Hmm. Speaking from my point of view as a maintainer, this raises a > yellow alert. Sure, I do not maintain git.git, but if I were, I would > want my confidence in the quality of this patch, and in the patch > author being around when things go south with it, strengthened. This > paragraph achieves the opposite. It's completely reasonable to apply the same standards here as with an author having terminal cancer. I am not going to be around when things go South with git-blame but I should be very much surprised if this patch were significantly involved. >> > Besides, -C is *supposed* to look harder. >> >> We are not talking about "looking harder" but "taking more memory >> than the set limit". > > I meant to say: *of course* it uses more memory, it has a much harder > job. Still a non-sequitur. If you apply memory limits harder, it will still achieve the same job but finish later due to (decompressing) swapping. >> > Also: is there an easy way to reproduce your claims of better I/O >> > characteristics? Something like a command-line, ideally with a file >> > in git.git's own history, that demonstrates the I/O before and >> > after the patch, would be an excellent addition to the commit >> > message. >> >> I've used it on the wortliste repository and system time goes down >> from about 70 seconds to 50 seconds (this is a flash drive). User >> time from about 4:20 to 4:00. It is a rather degenerate repository >> (predominantly small changes in one humongous text file) so savings >> for more typical cases might end up less than that. But then it is >> degenerate repositories that are most costly to blame. > > Well, obviously I did not mean to measure time, but I/O. System time on a SSD is highly correlated with I/O for this task. Actually, so is system time on a hard drive, it just takes more real time (waiting for I/O to complete) then as well. >> > Further: I would have at least expected some rudimentary discussion >> > why this patch -- which seems to at least partially contradict >> > 7c3c796 (blame: drop blob data after passing blame to the parent, >> > 2007-12-11) -- is not regressing on the intent of said commit. >> >> It is regressing on the intent of said commit by _retaining_ blob >> data that it is _sure_ to look at again because of already having >> this data asked for again in the priority queue. That's the point. >> It only drops the blob data for which it has no request queued yet. >> But there is no handle on when the request is going to pop up until >> it actually leaves the priority queue: the priority queue is a heap >> IIRC and thus only provides partial orderings. > > Again, this lowers my confidence in the patch. Mind you, the patch > could be totally sound! Yet its commit message, and unfortunately even > more so the discussion we are having right now, offer no adequate > reason to assume that. If you, as the patch author, state that you are > not sure what this thing does exactly, Uh, I very definitely know what this thing does exactly. I never stated otherwise. > we must conservatively assume that it contains flaws. It has consequences without a theoretical hard limit on memory usage but where every additional retained blob will provably be needed. Basically, you don't want to apply this patch because I know what I'm doing and documenting it and the underlying rationale without sugar-coating. That's perfectly fine with me. It matches my expectations and experience. >> >> diff --git a/builtin/blame.c b/builtin/blame.c >> >> index 21f42b0..2596fbc 100644 >> >> --- a/builtin/blame.c >> >> +++ b/builtin/blame.c >> >> @@ -1556,7 +1556,8 @@ finish: >> >> } >> >> for (i = 0; i < num_sg; i++) { >> >> if (sg_origin[i]) { >> >> - drop_origin_blob(sg_origin[i]); >> >> + if (!sg_origin[i]->suspects) >> >> + drop_origin_blob(sg_origin[i]); >> >> origin_decref(sg_origin[i]); >> >> } >> > >> > It would be good to mention in the commit message that this patch does not >> > change anything for blobs with only one remaining reference (the current >> > one) because origin_decref() would do the same job as drop_origin_blob >> > when decrementing the reference counter to 0. >> >> A sizable number of references but not blobs are retained and needed for >> producing the information in the final printed output (at least when >> using the default sequential output instead of --incremental or >> --porcelaine or similar). > > Sorry, please help me understand how that response relates to my > suggestion to improve the commit message. Blobs and origins don't have separate reference counts. This patch has the same effect as maintaining such separate counts explicitly. Since the majority of commits lead to some lines in the current source (apart from commits being weeded out by later refactoring/reimplementation), this means that the majority of origins never reach a reference count of 0 before the end of the run. >> > In fact, I suspect that simply removing the drop_origin_blob() call >> > might result in the exact same I/O pattern. >> >> It's been years since I actually worked on the code but I'm still >> pretty sure you are wrong. > > The short version of your answer is that you will leave this patch in > its current form and address none of my concerns because you moved on, > correct? If so, that's totally okay, it just needs to be spelled out. Yes, that's it. You'll notice that the code change itself is both minuscule as well purely functional, so it contains nothing copyrightable. I'm perfectly fine with anybody else writing whatever commit message he wants based on whatever additional testing he wants and claiming whole credit for what indeed amounts to being the actually hard work, given the Git maintainership which is dependent on judging quality by how a patch author advertises his code when the author is not a frequent contributor. I've not authored more than a dozen or two patches spread out over more of a decade. If you do git-blame on them, I doubt you'll find any that ever caused problems or that called for refactoring/replacement. Nevertheless, I clearly am not a regular and different standards apply. I'm fine with that. It's off my lawn, my responsibility, and my conscience. -- David Kastrup ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] blame.c: don't drop origin blobs as eagerly 2016-05-28 8:29 ` David Kastrup @ 2016-05-28 12:34 ` Johannes Schindelin 2016-05-28 14:00 ` David Kastrup 0 siblings, 1 reply; 15+ messages in thread From: Johannes Schindelin @ 2016-05-28 12:34 UTC (permalink / raw) To: David Kastrup; +Cc: git, gitster Hi David, On Sat, 28 May 2016, David Kastrup wrote: > > The short version of your answer is that you will leave this patch in > > its current form and address none of my concerns because you moved on, > > correct? If so, that's totally okay, it just needs to be spelled out. > > Yes, that's it. You'll notice that the code change itself is both > minuscule as well purely functional, so it contains nothing > copyrightable. That is unfortunately an interpretation of the law that would need to come from a professional lawyer. As it is, the patch was clearly authored by you, and anybody else who would claim authorship would most likely be breaking the law. So I won't touch it. Sorry, Johannes ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] blame.c: don't drop origin blobs as eagerly 2016-05-28 12:34 ` Johannes Schindelin @ 2016-05-28 14:00 ` David Kastrup 0 siblings, 0 replies; 15+ messages in thread From: David Kastrup @ 2016-05-28 14:00 UTC (permalink / raw) To: Johannes Schindelin; +Cc: git, gitster [-- Attachment #1: Type: text/plain, Size: 1738 bytes --] Johannes Schindelin <Johannes.Schindelin@gmx.de> writes: > Hi David, > > On Sat, 28 May 2016, David Kastrup wrote: > >> > The short version of your answer is that you will leave this patch in >> > its current form and address none of my concerns because you moved on, >> > correct? If so, that's totally okay, it just needs to be spelled out. >> >> Yes, that's it. You'll notice that the code change itself is both >> minuscule as well purely functional, so it contains nothing >> copyrightable. > > That is unfortunately an interpretation of the law that would need to > come from a professional lawyer. A professional lawyer would laugh at "Signed-off-by:" being of more legal significance than a written record of intent but of course you know that. This is mere bluster. > As it is, the patch was clearly authored by you, and anybody else who > would claim authorship would most likely be breaking the law. The _diff_ is not "clearly authored" by me but just the simplest expression of the intent. The commit message is clearly authored by me but is not acceptable anyway. Whoever gets to write an acceptable commit message is up for all copyrightable credit in my book. Feel free to keep the authorship if you really want to, but when replacing the commit message it is not a particularly accurate attribution. > So I won't touch it. Signed-off-by: David Kastrup <dak@gnu.org> You don't get more than that for other patches either and it's a few bytes compared to a mail conversation. Here is a PGP signature on top. As I said: I am not going to put more work into it anyway and if it's an occasion for theatralics, it has at least accomplished something. -- David Kastrup [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 180 bytes --] ^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2019-04-03 12:32 UTC | newest] Thread overview: 15+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2019-04-02 11:56 [PATCH] blame.c: don't drop origin blobs as eagerly David Kastrup 2019-04-03 7:45 ` Junio C Hamano 2019-04-03 9:32 ` Duy Nguyen 2019-04-03 11:36 ` Jeff King 2019-04-03 12:06 ` Duy Nguyen 2019-04-03 12:19 ` Jeff King 2019-04-03 12:32 ` David Kastrup 2019-04-03 11:08 ` David Kastrup -- strict thread matches above, loose matches on Subject: below -- 2016-05-27 13:35 David Kastrup 2016-05-27 15:00 ` Johannes Schindelin 2016-05-27 15:41 ` David Kastrup 2016-05-28 6:37 ` Johannes Schindelin 2016-05-28 8:29 ` David Kastrup 2016-05-28 12:34 ` Johannes Schindelin 2016-05-28 14:00 ` David Kastrup
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).