From: Jan Hudec <bulb@ucw.cz>
To: Jeff King <peff@peff.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>,
Daniel Berlin <dberlin@dberlin.org>,
Git Mailing List <git@vger.kernel.org>
Subject: Re: git annotate runs out of memory
Date: Tue, 18 Dec 2007 00:24:50 +0100 [thread overview]
Message-ID: <20071217232450.GA13012@efreet.light.src> (raw)
In-Reply-To: <20071212075725.GA7676@coredump.intra.peff.net>
On Wed, Dec 12, 2007 at 02:57:25 -0500, Jeff King wrote:
> On Tue, Dec 11, 2007 at 11:50:08AM -0800, Linus Torvalds wrote:
> > And, btw: the diff is totally different from the xdelta we have, so even
> > if we have an already prepared nice xdelta between the two versions, we'll
> > end up re-generating the files in full, and then do a diff on the end
> > result.
The problem is whether git does not end-up re-generating the same file
multiple times. When it needs to construct the diff between two versions of
a file and one is delta-base (even indirect) of the other, does it know to
create the first, remember it, continue to the other and calculate the diff?
> > Of course, part of that is that git logically *never* works with deltas,
> > except in the actual code-paths that generate objects (or generate packs,
> > of course). So even if we had used a delta algorithm that would be
> > amenable to be turned into a diff directly, it would have been a layering
> > violation to actually do that.
>
> That doesn't mean we can't opportunistically jump layers when available,
> and fall back on the regular behavior otherwise. The nice thing about
> clean and simple layers is that you can always add optimizations later
> by poking sane holes.
>
> Let's assume for the sake of argument that we can convert an xdelta into
> a diff fairly cheaply. Using the patch below, we can count the places
> where we are diffing two blobs, and one blob is a delta base of the
> other (assuming our magical conversion function can also reverse diffs.
> ;) ).
>
> For a "git log -p" on git.git, I get:
>
> 9951 diffs could be optimized
> 10958 diffs could not be optimized
>
> or about 48%. It would be nice if we could drop the cost by almost 50%
> (if our magical function is free to call, too!).
This is actually a gross underestimation. The idea would be to know all the
diffs we need to calculate and than remember all useful results. Ie. if we
know we'll want objects A and C, A's delta base is B and B's delta base is C,
start calculating A and when it turns out to need C at some point, just
remember it for purpose of doing the final diff. On the other hand B can be
thrown away early (because we don't need it) to save memory.
Now git can know the list of deltas it will need in advance. First generate
the list of revisions -- nothing helps there, but their delta bases are
likely to be randomish anyway -- and than with the knowledge of full list of
trees, start doing the diffs to see which touched the subtree in question.
Repeat for each level.
Since the list of deltas that will be needed is known, the objects from
which all deltas were already generated can be expired from cache (but not
thrown away immediately, as they may help building other objects).
> Of course, I haven't even looked at whether converting xdeltas to
> unified diffs is possible. I suspect in some cases it is (e.g., pure
> addition of text) and in some cases it isn't (I assume xdelta doesn't
> have any context lines, which might hurt). And it's possible that a
> specialized diff user like git-blame can just learn to use the xdeltas
> by itself (I didn't get a "could optimize" count for git-blame since
> it seems to follow a different codepath for its diffs).
Well, it's about as hard as applying them, because you can remember the
necessary stuff when applying. The imporant bit would be to avoid applying
the same delta more than once during the whole annotate.
--
Jan 'Bulb' Hudec <bulb@ucw.cz>
next prev parent reply other threads:[~2007-12-17 23:25 UTC|newest]
Thread overview: 51+ messages / expand[flat|nested] mbox.gz Atom feed top
2007-12-11 17:33 git annotate runs out of memory Daniel Berlin
2007-12-11 17:47 ` Nicolas Pitre
2007-12-11 17:53 ` Daniel Berlin
2007-12-11 18:01 ` Nicolas Pitre
2007-12-11 18:32 ` Marco Costalba
2007-12-11 19:03 ` Daniel Berlin
2007-12-11 19:14 ` Marco Costalba
2007-12-11 19:27 ` Jason Sewall
2007-12-11 19:46 ` Daniel Barkalow
2007-12-11 20:14 ` Marco Costalba
2007-12-11 18:40 ` Linus Torvalds
2007-12-11 19:01 ` Matthieu Moy
2007-12-11 19:22 ` Linus Torvalds
2007-12-11 19:24 ` Daniel Berlin
2007-12-11 19:42 ` Pierre Habouzit
2007-12-11 21:09 ` Daniel Berlin
2007-12-11 23:37 ` Matthieu Moy
2007-12-11 23:48 ` Linus Torvalds
2007-12-11 19:06 ` Nicolas Pitre
2007-12-11 20:31 ` Jon Smirl
2007-12-11 19:09 ` Daniel Berlin
2007-12-11 19:26 ` Daniel Barkalow
2007-12-11 19:34 ` Pierre Habouzit
2007-12-11 19:59 ` Junio C Hamano
2007-12-11 19:42 ` Linus Torvalds
2007-12-11 19:50 ` Linus Torvalds
2007-12-11 21:14 ` Daniel Berlin
2007-12-11 21:34 ` Linus Torvalds
2007-12-12 7:57 ` Jeff King
2007-12-17 23:24 ` Jan Hudec [this message]
2007-12-18 0:05 ` Linus Torvalds
2007-12-11 21:14 ` Linus Torvalds
2007-12-11 21:54 ` Junio C Hamano
2007-12-11 23:36 ` Linus Torvalds
2007-12-12 0:02 ` Linus Torvalds
2007-12-12 0:22 ` Davide Libenzi
2007-12-12 0:50 ` Linus Torvalds
2007-12-12 1:12 ` Davide Libenzi
2007-12-12 2:10 ` Linus Torvalds
2007-12-12 3:35 ` Linus Torvalds
2007-12-12 0:56 ` Junio C Hamano
2007-12-12 2:20 ` Linus Torvalds
2007-12-12 2:39 ` Linus Torvalds
2007-12-12 19:43 ` Daniel Berlin
2007-12-12 4:48 ` Junio C Hamano
2007-12-11 21:24 ` Daniel Berlin
2007-12-12 3:57 ` Shawn O. Pearce
2007-12-11 20:29 ` Marco Costalba
2007-12-11 19:29 ` Steven Grimm
2007-12-11 20:14 ` Jakub Narebski
2007-12-12 10:36 ` Florian Weimer
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20071217232450.GA13012@efreet.light.src \
--to=bulb@ucw.cz \
--cc=dberlin@dberlin.org \
--cc=git@vger.kernel.org \
--cc=peff@peff.net \
--cc=torvalds@linux-foundation.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).