git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jeff King <peff@peff.net>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Daniel Berlin <dberlin@dberlin.org>,
	Git Mailing List <git@vger.kernel.org>
Subject: Re: git annotate runs out of memory
Date: Wed, 12 Dec 2007 02:57:25 -0500	[thread overview]
Message-ID: <20071212075725.GA7676@coredump.intra.peff.net> (raw)
In-Reply-To: <alpine.LFD.0.9999.0712111146200.25032@woody.linux-foundation.org>

On Tue, Dec 11, 2007 at 11:50:08AM -0800, Linus Torvalds wrote:

> And, btw: the diff is totally different from the xdelta we have, so even 
> if we have an already prepared nice xdelta between the two versions, we'll 
> end up re-generating the files in full, and then do a diff on the end 
> result.
> 
> Of course, part of that is that git logically *never* works with deltas, 
> except in the actual code-paths that generate objects (or generate packs, 
> of course). So even if we had used a delta algorithm that would be 
> amenable to be turned into a diff directly, it would have been a layering 
> violation to actually do that.

That doesn't mean we can't opportunistically jump layers when available,
and fall back on the regular behavior otherwise. The nice thing about
clean and simple layers is that you can always add optimizations later
by poking sane holes.

Let's assume for the sake of argument that we can convert an xdelta into
a diff fairly cheaply.  Using the patch below, we can count the places
where we are diffing two blobs, and one blob is a delta base of the
other (assuming our magical conversion function can also reverse diffs.
;) ).

For a "git log -p" on git.git, I get:

   9951 diffs could be optimized
  10958 diffs could not be optimized

or about 48%. It would be nice if we could drop the cost by almost 50%
(if our magical function is free to call, too!).

Of course, I haven't even looked at whether converting xdeltas to
unified diffs is possible. I suspect in some cases it is (e.g., pure
addition of text) and in some cases it isn't (I assume xdelta doesn't
have any context lines, which might hurt). And it's possible that a
specialized diff user like git-blame can just learn to use the xdeltas
by itself (I didn't get a "could optimize" count for git-blame since
it seems to follow a different codepath for its diffs).

---
diff --git a/cache.h b/cache.h
index 27d90fe..0d672be 100644
--- a/cache.h
+++ b/cache.h
@@ -569,6 +569,7 @@ extern void *unpack_entry(struct packed_git *, off_t, enum object_type *, unsign
 extern unsigned long unpack_object_header_gently(const unsigned char *buf, unsigned long len, enum object_type *type, unsigned long *sizep);
 extern unsigned long get_size_from_delta(struct packed_git *, struct pack_window **, off_t);
 extern const char *packed_object_info_detail(struct packed_git *, off_t, unsigned long *, unsigned long *, unsigned int *, unsigned char *);
+extern int have_xdelta(unsigned char from[20], unsigned char to[20]);
 extern int matches_pack_name(struct packed_git *p, const char *name);
 
 /* Dumb servers support */
diff --git a/diff.c b/diff.c
index f780e3e..5402900 100644
--- a/diff.c
+++ b/diff.c
@@ -1299,6 +1299,10 @@ static void builtin_diff(const char *name_a,
 		}
 	}
 
+	fprintf(stderr, "could optimize: %s\n",
+			(have_xdelta(one->sha1, two->sha1) ||
+			have_xdelta(two->sha1, one->sha1)) ? "yes" : "no");
+
 	if (fill_mmfile(&mf1, one) < 0 || fill_mmfile(&mf2, two) < 0)
 		die("unable to read files to diff");
 
diff --git a/sha1_file.c b/sha1_file.c
index b0c2435..f811ddc 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -2422,3 +2422,20 @@ int read_pack_header(int fd, struct pack_header *header)
 		return PH_ERROR_PROTOCOL;
 	return 0;
 }
+
+int have_xdelta(unsigned char from[20], unsigned char to[20])
+{
+	struct pack_entry e;
+	unsigned char base_sha1[20];
+	const char *type;
+	unsigned long size;
+	unsigned long store_size;
+	unsigned int delta_chain_length;
+
+	if (!find_pack_entry(to, &e, NULL))
+		return 0;
+
+	type = packed_object_info_detail(e.p, e.offset, &size, &store_size,
+					 &delta_chain_length, base_sha1);
+	return !hashcmp(base_sha1, from);
+}

  parent reply	other threads:[~2007-12-12  7:57 UTC|newest]

Thread overview: 51+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2007-12-11 17:33 git annotate runs out of memory Daniel Berlin
2007-12-11 17:47 ` Nicolas Pitre
2007-12-11 17:53   ` Daniel Berlin
2007-12-11 18:01     ` Nicolas Pitre
2007-12-11 18:32 ` Marco Costalba
2007-12-11 19:03   ` Daniel Berlin
2007-12-11 19:14     ` Marco Costalba
2007-12-11 19:27     ` Jason Sewall
2007-12-11 19:46     ` Daniel Barkalow
2007-12-11 20:14       ` Marco Costalba
2007-12-11 18:40 ` Linus Torvalds
2007-12-11 19:01   ` Matthieu Moy
2007-12-11 19:22     ` Linus Torvalds
2007-12-11 19:24       ` Daniel Berlin
2007-12-11 19:42         ` Pierre Habouzit
2007-12-11 21:09           ` Daniel Berlin
2007-12-11 23:37       ` Matthieu Moy
2007-12-11 23:48         ` Linus Torvalds
2007-12-11 19:06   ` Nicolas Pitre
2007-12-11 20:31     ` Jon Smirl
2007-12-11 19:09   ` Daniel Berlin
2007-12-11 19:26     ` Daniel Barkalow
2007-12-11 19:34     ` Pierre Habouzit
2007-12-11 19:59       ` Junio C Hamano
2007-12-11 19:42     ` Linus Torvalds
2007-12-11 19:50       ` Linus Torvalds
2007-12-11 21:14         ` Daniel Berlin
2007-12-11 21:34           ` Linus Torvalds
2007-12-12  7:57         ` Jeff King [this message]
2007-12-17 23:24           ` Jan Hudec
2007-12-18  0:05             ` Linus Torvalds
2007-12-11 21:14       ` Linus Torvalds
2007-12-11 21:54         ` Junio C Hamano
2007-12-11 23:36           ` Linus Torvalds
2007-12-12  0:02             ` Linus Torvalds
2007-12-12  0:22               ` Davide Libenzi
2007-12-12  0:50                 ` Linus Torvalds
2007-12-12  1:12                   ` Davide Libenzi
2007-12-12  2:10                     ` Linus Torvalds
2007-12-12  3:35                       ` Linus Torvalds
2007-12-12  0:56               ` Junio C Hamano
2007-12-12  2:20                 ` Linus Torvalds
2007-12-12  2:39                   ` Linus Torvalds
2007-12-12 19:43               ` Daniel Berlin
2007-12-12  4:48           ` Junio C Hamano
2007-12-11 21:24       ` Daniel Berlin
2007-12-12  3:57       ` Shawn O. Pearce
2007-12-11 20:29     ` Marco Costalba
2007-12-11 19:29   ` Steven Grimm
2007-12-11 20:14     ` Jakub Narebski
2007-12-12 10:36 ` Florian Weimer

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20071212075725.GA7676@coredump.intra.peff.net \
    --to=peff@peff.net \
    --cc=dberlin@dberlin.org \
    --cc=git@vger.kernel.org \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).