From: Jeff King <peff@peff.net>
To: Junio C Hamano <gitster@pobox.com>
Cc: Jay Soffian <jaysoffian@gmail.com>,
Michael J Gruber <git@drmicha.warpmail.net>,
git <git@vger.kernel.org>
Subject: Re: [PATCH 3/5] combine-diff: handle binary files as binary
Date: Mon, 30 May 2011 10:36:27 -0400 [thread overview]
Message-ID: <20110530143627.GC31490@sigill.intra.peff.net> (raw)
In-Reply-To: <7vpqn0ofy5.fsf@alter.siamese.dyndns.org>
On Sun, May 29, 2011 at 11:33:38PM -0700, Junio C Hamano wrote:
> Jeff King <peff@peff.net> writes:
>
> > @@ -852,6 +862,29 @@ static void show_patch_diff(struct combine_diff_path *elem, int num_parent,
> > }
> > }
> >
> > + if (userdiff->binary != -1)
> > + is_binary = userdiff->binary;
> > + else {
> > + is_binary = buffer_is_binary(result, result_size);
> > + for (i = 0; !is_binary && i < num_parent; i++) {
> > + char *buf;
> > + unsigned long size;
> > + buf = grab_blob(elem->parent[i].sha1,
> > + elem->parent[i].mode,
> > + &size);
> > + if (buffer_is_binary(buf, size))
> > + is_binary = 1;
> > + free(buf);
> > + }
> > + }
>
> Two comments.
>
> - This loop will grab the blob from all parents just to peek and discard
> for most of commits. It feels somewhat wasteful, especially because
> "binary" that is not marked as binary is a rare minor case (most of the
> paths are text). Stopping immediately at the first binary blob does not
> optimize the loop even though it is better than not having one.
Yeah, I am not too happy with that bit. Our choices are:
1. Grab each blob, check binary-ness, and free. This double-loads in
the common, non-binary case.
2. Grab each blob, check binary-ness, and keep it in memory. This
means using N times as much memory, where N is the number of
parents. In practice, N generally equals 2, and the file sizes
aren't gigantic, so it's not that bad. But there are some corner
cases.
3. Choose (1) or (2) based on file size, or based on the number of
parents. The problem cases for (2) are going to be big files
(bigger than bigFileThreshold?), and gigantic numbers of parents.
I'd also eventually like to do a "peek" binary-ness check on top of your
streaming work, as I showed for the non-combined diff case (speaking of
which, I need to polish that now that the streaming series seems to have
settled). And we would probably want to peek only in the big file case.
I'll try to take a look at it this week and get some measurements on (1)
versus (2) for both speed and peak memory usage. And then see if I can
do better with (3), and implement the "peek" solution both here and in
regular diff.
> - It may make sense to compare [i].sha1 with earlier [j].sha1 (j < i) and
> avoid grab_blob() altogether? Cf. 3c39e9b (combine-diff: reuse diff
> from the same blob., 2006-02-01).
Yeah, that is probably worth doing, as it collapses the "N" above into
"how many parents all modified the same file". I'll add it to my list to
measure.
-Peff
next prev parent reply other threads:[~2011-05-30 14:36 UTC|newest]
Thread overview: 37+ messages / expand[flat|nested] mbox.gz Atom feed top
2011-05-22 20:12 combined diff does not detect binary files and ignores -diff attribute Jay Soffian
2011-05-23 13:30 ` Michael J Gruber
2011-05-23 15:17 ` Jay Soffian
2011-05-23 17:07 ` Junio C Hamano
2011-05-23 18:11 ` Jeff King
2011-05-23 20:15 ` Jeff King
2011-05-23 20:16 ` [PATCH 1/5] combine-diff: split header printing into its own function Jeff King
2011-05-23 20:16 ` [PATCH 2/5] combine-diff: calculate mode_differs earlier Jeff King
2011-05-23 20:27 ` [PATCH 3/5] combine-diff: handle binary files as binary Jeff King
2011-05-23 23:02 ` Junio C Hamano
2011-05-23 23:50 ` Jeff King
2011-05-30 6:33 ` Junio C Hamano
2011-05-30 14:36 ` Jeff King [this message]
2011-05-30 16:19 ` Jeff King
2011-05-30 19:32 ` Junio C Hamano
2011-05-31 22:42 ` Junio C Hamano
2011-05-23 20:30 ` [PATCH 4/5] refactor get_textconv to not require diff_filespec Jeff King
2011-05-23 20:31 ` [PATCH 5/5] combine-diff: respect textconv attributes Jeff King
2011-05-23 22:47 ` Junio C Hamano
2011-05-23 23:39 ` Jeff King
2011-05-24 16:20 ` Junio C Hamano
2011-05-24 18:52 ` Jeff King
2011-05-23 22:55 ` combined diff does not detect binary files and ignores -diff attribute Jay Soffian
2011-05-23 23:31 ` Jay Soffian
2011-05-23 23:49 ` Jeff King
2011-05-24 0:59 ` Jay Soffian
2011-05-23 23:41 ` Jeff King
2011-05-24 4:46 ` Junio C Hamano
2011-05-24 7:19 ` Michael J Gruber
2011-05-24 15:36 ` Junio C Hamano
2011-05-24 16:38 ` Michael J Gruber
2011-05-24 16:43 ` Junio C Hamano
2011-05-24 16:52 ` Jay Soffian
2011-05-24 19:13 ` Jeff King
2011-05-25 7:38 ` Michael J Gruber
2011-05-25 15:29 ` Jeff King
2011-05-24 14:40 ` Jay Soffian
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20110530143627.GC31490@sigill.intra.peff.net \
--to=peff@peff.net \
--cc=git@drmicha.warpmail.net \
--cc=git@vger.kernel.org \
--cc=gitster@pobox.com \
--cc=jaysoffian@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).