From: Jeff King <peff@peff.net>
To: Junio C Hamano <gitster@pobox.com>
Cc: git@vger.kernel.org
Subject: Re: [PATCH 00/11] writing out a huge blob to working tree
Date: Wed, 18 May 2011 04:17:33 -0400 [thread overview]
Message-ID: <20110518081733.GF27482@sigill.intra.peff.net> (raw)
In-Reply-To: <1305505831-31587-1-git-send-email-gitster@pobox.com>
On Sun, May 15, 2011 at 05:30:20PM -0700, Junio C Hamano wrote:
> Recently "diff" learned to avoid reading the contents only to say "Binary
> files differ" when these large blobs are marked as binary.
With your series, we should be able to get similar speedups even if the
user didn't explicitly mark a file as binary. We need only peek at the
beginning of a blob to see if it is binary, so we can be conservative
with big files. Something like this (which doesn't work because of the
"size" bug I mentioned elsewhere, but is meant to be illustrative):
diff --git a/diff.c b/diff.c
index ba5f7aa..bfe1b2d 100644
--- a/diff.c
+++ b/diff.c
@@ -15,6 +15,7 @@
#include "sigchain.h"
#include "submodule.h"
#include "ll-merge.h"
+#include "streaming.h"
#ifdef NO_FAST_WORKING_DIRECTORY
#define FAST_WORKING_DIRECTORY 0
@@ -1931,6 +1932,37 @@ static void diff_filespec_load_driver(struct diff_filespec *one)
one->driver = userdiff_find_by_name("default");
}
+static char *populate_or_peek(struct diff_filespec *df,
+ unsigned long want,
+ unsigned long *got)
+{
+ struct git_istream *st;
+ enum object_type type;
+ char *buf;
+
+ st = open_istream(df->sha1, &type, &df->size);
+ if (!st) {
+ diff_populate_filespec(df, 0);
+ *got = df->size;
+ return df->data;
+ }
+
+ if (df->size < big_file_threshold) {
+ buf = df->data = xmallocz(df->size);
+ want = df->size;
+ df->should_free = 1;
+ }
+ else
+ buf = xmallocz(want);
+
+ /* looks like it will always read_in_full? */
+ if (read_istream(st, buf, want) != want)
+ die("failed to read object");
+ close_istream(st);
+ *got = want;
+ return buf;
+}
+
int diff_filespec_is_binary(struct diff_filespec *one)
{
if (one->is_binary == -1) {
@@ -1938,13 +1970,25 @@ int diff_filespec_is_binary(struct diff_filespec *one)
if (one->driver->binary != -1)
one->is_binary = one->driver->binary;
else {
- if (!one->data && DIFF_FILE_VALID(one))
- diff_populate_filespec(one, 0);
- if (one->data)
- one->is_binary = buffer_is_binary(one->data,
- one->size);
+ char *buf;
+ unsigned long size;
+
+ if (one->data) {
+ buf = one->data;
+ size = one->size;
+ }
+ else if (DIFF_FILE_VALID(one))
+ buf = populate_or_peek(one, 8192, &size);
+ else
+ buf = NULL;
+
+ if (buf)
+ one->is_binary = buffer_is_binary(buf, size);
if (one->is_binary == -1)
one->is_binary = 0;
+
+ if (buf != one->data)
+ free(buf);
}
}
return one->is_binary;
I think a "peek" function like this would be a nice addition to the
streaming API. Something like:
char *peek_sha1(const unsigned char sha1[20], /* which object */
enum object_type *type, /* out: type */
unsigned long want, /* how much do we need */
unsigned long big, /* if less than this, just give us
everything in the name of
efficiency */
unsigned long *got, /* out: how much did we peek */
unsigned long *size, /* out: how big is the whole thing */
);
but maybe diff is the only place where that is useful. I dunno.
-Peff
next prev parent reply other threads:[~2011-05-18 8:17 UTC|newest]
Thread overview: 49+ messages / expand[flat|nested] mbox.gz Atom feed top
2011-05-16 0:30 [PATCH 00/11] writing out a huge blob to working tree Junio C Hamano
2011-05-16 0:30 ` [PATCH 01/11] packed_object_info_detail(): do not return a string Junio C Hamano
2011-05-17 0:45 ` Thiago Farina
2011-05-17 2:36 ` Junio C Hamano
2011-05-16 0:30 ` [PATCH 02/11] sha1_object_info_extended(): expose a bit more info Junio C Hamano
2011-05-16 0:30 ` [PATCH 03/11] sha1_object_info_extended(): hint about objects in delta-base cache Junio C Hamano
2011-05-16 0:40 ` Shawn Pearce
2011-05-16 0:30 ` [PATCH 04/11] unpack_object_header(): make it public Junio C Hamano
2011-05-16 0:30 ` [PATCH 05/11] write_entry(): separate two helper functions out Junio C Hamano
2011-05-16 0:30 ` [PATCH 06/11] streaming: a new API to read from the object store Junio C Hamano
2011-05-18 8:09 ` Jeff King
2011-05-19 1:52 ` Junio C Hamano
2011-05-16 0:30 ` [PATCH 07/11] streaming_write_entry(): use streaming API in write_entry() Junio C Hamano
2011-05-16 0:30 ` [PATCH 08/11] streaming_write_entry(): support files with holes Junio C Hamano
2011-05-16 10:53 ` Nguyen Thai Ngoc Duy
2011-05-16 14:39 ` Junio C Hamano
2011-05-17 1:18 ` Nguyen Thai Ngoc Duy
2011-05-17 5:23 ` Junio C Hamano
2011-05-16 13:03 ` Thiago Farina
2011-05-16 0:30 ` [PATCH 09/11] streaming: read non-delta incrementally from a pack Junio C Hamano
2011-05-16 0:58 ` Shawn Pearce
2011-05-16 5:00 ` Junio C Hamano
2011-05-16 0:30 ` [PATCH 10/11] sha1_file.c: expose helpers to read loose objects Junio C Hamano
2011-05-16 0:30 ` [PATCH 11/11] streaming: read loose objects incrementally Junio C Hamano
2011-05-16 0:47 ` [PATCH 00/11] writing out a huge blob to working tree Shawn Pearce
2011-05-18 6:41 ` Jeff King
2011-05-18 7:08 ` Jeff King
2011-05-18 7:50 ` Jeff King
2011-05-18 15:12 ` Junio C Hamano
2011-05-18 8:17 ` Jeff King [this message]
2011-05-19 21:33 ` [PATCH v2 " Junio C Hamano
2011-05-19 21:33 ` [PATCH v2 01/11] packed_object_info_detail(): do not return a string Junio C Hamano
2011-05-19 21:33 ` [PATCH v2 02/11] sha1_object_info_extended(): expose a bit more info Junio C Hamano
2011-05-19 21:33 ` [PATCH v2 03/11] sha1_object_info_extended(): hint about objects in delta-base cache Junio C Hamano
2011-05-20 23:05 ` René Scharfe
2011-05-21 1:49 ` Junio C Hamano
2011-05-19 21:33 ` [PATCH v2 04/11] unpack_object_header(): make it public Junio C Hamano
2011-05-19 21:33 ` [PATCH v2 05/11] write_entry(): separate two helper functions out Junio C Hamano
2011-05-19 21:33 ` [PATCH v2 06/11] streaming: a new API to read from the object store Junio C Hamano
2011-05-20 23:05 ` René Scharfe
2011-05-21 1:49 ` Junio C Hamano
2011-05-19 21:33 ` [PATCH v2 07/11] streaming_write_entry(): use streaming API in write_entry() Junio C Hamano
2011-05-20 22:52 ` Junio C Hamano
2011-05-19 21:33 ` [PATCH v2 08/11] streaming_write_entry(): support files with holes Junio C Hamano
2011-05-19 21:33 ` [PATCH v2 09/11] streaming: read non-delta incrementally from a pack Junio C Hamano
2011-05-19 21:33 ` [PATCH v2 10/11] sha1_file.c: expose helpers to read loose objects Junio C Hamano
2011-05-19 21:33 ` [PATCH v2 11/11] streaming: read loose objects incrementally Junio C Hamano
2011-05-19 21:44 ` [Not A PATCH v2 02/11] interdiff Junio C Hamano
2011-05-19 22:21 ` [PATCH v2 00/11] writing out a huge blob to working tree Jeff King
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20110518081733.GF27482@sigill.intra.peff.net \
--to=peff@peff.net \
--cc=git@vger.kernel.org \
--cc=gitster@pobox.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).