git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jeff King <peff@peff.net>
To: Junio C Hamano <gitster@pobox.com>
Cc: git@vger.kernel.org
Subject: Re: [PATCH 00/11] writing out a huge blob to working tree
Date: Wed, 18 May 2011 04:17:33 -0400	[thread overview]
Message-ID: <20110518081733.GF27482@sigill.intra.peff.net> (raw)
In-Reply-To: <1305505831-31587-1-git-send-email-gitster@pobox.com>

On Sun, May 15, 2011 at 05:30:20PM -0700, Junio C Hamano wrote:

> Recently "diff" learned to avoid reading the contents only to say "Binary
> files differ" when these large blobs are marked as binary.

With your series, we should be able to get similar speedups even if the
user didn't explicitly mark a file as binary. We need only peek at the
beginning of a blob to see if it is binary, so we can be conservative
with big files. Something like this (which doesn't work because of the
"size" bug I mentioned elsewhere, but is meant to be illustrative):

diff --git a/diff.c b/diff.c
index ba5f7aa..bfe1b2d 100644
--- a/diff.c
+++ b/diff.c
@@ -15,6 +15,7 @@
 #include "sigchain.h"
 #include "submodule.h"
 #include "ll-merge.h"
+#include "streaming.h"
 
 #ifdef NO_FAST_WORKING_DIRECTORY
 #define FAST_WORKING_DIRECTORY 0
@@ -1931,6 +1932,37 @@ static void diff_filespec_load_driver(struct diff_filespec *one)
 		one->driver = userdiff_find_by_name("default");
 }
 
+static char *populate_or_peek(struct diff_filespec *df,
+			      unsigned long want,
+			      unsigned long *got)
+{
+	struct git_istream *st;
+	enum object_type type;
+	char *buf;
+
+	st = open_istream(df->sha1, &type, &df->size);
+	if (!st) {
+		diff_populate_filespec(df, 0);
+		*got = df->size;
+		return df->data;
+	}
+
+	if (df->size < big_file_threshold) {
+		buf = df->data = xmallocz(df->size);
+		want = df->size;
+		df->should_free = 1;
+	}
+	else
+		buf = xmallocz(want);
+
+	/* looks like it will always read_in_full? */
+	if (read_istream(st, buf, want) != want)
+		die("failed to read object");
+	close_istream(st);
+	*got = want;
+	return buf;
+}
+
 int diff_filespec_is_binary(struct diff_filespec *one)
 {
 	if (one->is_binary == -1) {
@@ -1938,13 +1970,25 @@ int diff_filespec_is_binary(struct diff_filespec *one)
 		if (one->driver->binary != -1)
 			one->is_binary = one->driver->binary;
 		else {
-			if (!one->data && DIFF_FILE_VALID(one))
-				diff_populate_filespec(one, 0);
-			if (one->data)
-				one->is_binary = buffer_is_binary(one->data,
-						one->size);
+			char *buf;
+			unsigned long size;
+
+			if (one->data) {
+				buf = one->data;
+				size = one->size;
+			}
+			else if (DIFF_FILE_VALID(one))
+				buf = populate_or_peek(one, 8192, &size);
+			else
+				buf = NULL;
+
+			if (buf)
+				one->is_binary = buffer_is_binary(buf, size);
 			if (one->is_binary == -1)
 				one->is_binary = 0;
+
+			if (buf != one->data)
+				free(buf);
 		}
 	}
 	return one->is_binary;

I think a "peek" function like this would be a nice addition to the
streaming API. Something like:

  char *peek_sha1(const unsigned char sha1[20], /* which object */
                  enum object_type *type, /* out: type */
                  unsigned long want, /* how much do we need */
                  unsigned long big, /* if less than this, just give us
                                        everything in the name of
                                        efficiency */
                  unsigned long *got, /* out: how much did we peek */
                  unsigned long *size, /* out: how big is the whole thing */
                  );

but maybe diff is the only place where that is useful. I dunno.

-Peff

  parent reply	other threads:[~2011-05-18  8:17 UTC|newest]

Thread overview: 49+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-05-16  0:30 [PATCH 00/11] writing out a huge blob to working tree Junio C Hamano
2011-05-16  0:30 ` [PATCH 01/11] packed_object_info_detail(): do not return a string Junio C Hamano
2011-05-17  0:45   ` Thiago Farina
2011-05-17  2:36     ` Junio C Hamano
2011-05-16  0:30 ` [PATCH 02/11] sha1_object_info_extended(): expose a bit more info Junio C Hamano
2011-05-16  0:30 ` [PATCH 03/11] sha1_object_info_extended(): hint about objects in delta-base cache Junio C Hamano
2011-05-16  0:40   ` Shawn Pearce
2011-05-16  0:30 ` [PATCH 04/11] unpack_object_header(): make it public Junio C Hamano
2011-05-16  0:30 ` [PATCH 05/11] write_entry(): separate two helper functions out Junio C Hamano
2011-05-16  0:30 ` [PATCH 06/11] streaming: a new API to read from the object store Junio C Hamano
2011-05-18  8:09   ` Jeff King
2011-05-19  1:52     ` Junio C Hamano
2011-05-16  0:30 ` [PATCH 07/11] streaming_write_entry(): use streaming API in write_entry() Junio C Hamano
2011-05-16  0:30 ` [PATCH 08/11] streaming_write_entry(): support files with holes Junio C Hamano
2011-05-16 10:53   ` Nguyen Thai Ngoc Duy
2011-05-16 14:39     ` Junio C Hamano
2011-05-17  1:18       ` Nguyen Thai Ngoc Duy
2011-05-17  5:23         ` Junio C Hamano
2011-05-16 13:03   ` Thiago Farina
2011-05-16  0:30 ` [PATCH 09/11] streaming: read non-delta incrementally from a pack Junio C Hamano
2011-05-16  0:58   ` Shawn Pearce
2011-05-16  5:00     ` Junio C Hamano
2011-05-16  0:30 ` [PATCH 10/11] sha1_file.c: expose helpers to read loose objects Junio C Hamano
2011-05-16  0:30 ` [PATCH 11/11] streaming: read loose objects incrementally Junio C Hamano
2011-05-16  0:47 ` [PATCH 00/11] writing out a huge blob to working tree Shawn Pearce
2011-05-18  6:41 ` Jeff King
2011-05-18  7:08   ` Jeff King
2011-05-18  7:50     ` Jeff King
2011-05-18 15:12       ` Junio C Hamano
2011-05-18  8:17 ` Jeff King [this message]
2011-05-19 21:33 ` [PATCH v2 " Junio C Hamano
2011-05-19 21:33   ` [PATCH v2 01/11] packed_object_info_detail(): do not return a string Junio C Hamano
2011-05-19 21:33   ` [PATCH v2 02/11] sha1_object_info_extended(): expose a bit more info Junio C Hamano
2011-05-19 21:33   ` [PATCH v2 03/11] sha1_object_info_extended(): hint about objects in delta-base cache Junio C Hamano
2011-05-20 23:05     ` René Scharfe
2011-05-21  1:49       ` Junio C Hamano
2011-05-19 21:33   ` [PATCH v2 04/11] unpack_object_header(): make it public Junio C Hamano
2011-05-19 21:33   ` [PATCH v2 05/11] write_entry(): separate two helper functions out Junio C Hamano
2011-05-19 21:33   ` [PATCH v2 06/11] streaming: a new API to read from the object store Junio C Hamano
2011-05-20 23:05     ` René Scharfe
2011-05-21  1:49       ` Junio C Hamano
2011-05-19 21:33   ` [PATCH v2 07/11] streaming_write_entry(): use streaming API in write_entry() Junio C Hamano
2011-05-20 22:52     ` Junio C Hamano
2011-05-19 21:33   ` [PATCH v2 08/11] streaming_write_entry(): support files with holes Junio C Hamano
2011-05-19 21:33   ` [PATCH v2 09/11] streaming: read non-delta incrementally from a pack Junio C Hamano
2011-05-19 21:33   ` [PATCH v2 10/11] sha1_file.c: expose helpers to read loose objects Junio C Hamano
2011-05-19 21:33   ` [PATCH v2 11/11] streaming: read loose objects incrementally Junio C Hamano
2011-05-19 21:44   ` [Not A PATCH v2 02/11] interdiff Junio C Hamano
2011-05-19 22:21   ` [PATCH v2 00/11] writing out a huge blob to working tree Jeff King

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20110518081733.GF27482@sigill.intra.peff.net \
    --to=peff@peff.net \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).