From: Thomas Braun <thomas.braun@virtuell-zuhause.de>
To: "Nguyễn Thái Ngọc Duy" <pclouds@gmail.com>
Cc: git@vger.kernel.org, worley@alum.mit.edu,
Junio C Hamano <gitster@pobox.com>
Subject: Re: [PATCH 4/4] diff: mark any file larger than core.bigfilethreshold binary
Date: Thu, 19 Jun 2014 14:27:25 +0200 [thread overview]
Message-ID: <1403180845.10052.16.camel@thomas-debian-x64> (raw)
In-Reply-To: <1401368227-14469-4-git-send-email-pclouds@gmail.com>
Am Donnerstag, den 29.05.2014, 19:57 +0700 schrieb Nguyễn Thái Ngọc Duy:
Hi,
sorry for chiming in so late.
I've just played around with patch 3 and 4 of that series.
And I like it very much as I work often with large files so any further
enhancement in that area is really nice.
(see comments below)
> Too large files may lead to failure to allocate memory. If it happens
> here, it could impact quite a few commands that involve
> diff. Moreover, too large files are inefficient to compare anyway (and
> most likely non-text), so mark them binary and skip looking at their
> content.
>
> Noticed-by: Dale R. Worley <worley@alum.mit.edu>
> Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
> ---
> diff.c | 26 ++++++++++++++++++--------
> diffcore.h | 1 +
> t/t1050-large.sh | 4 ++++
> 3 files changed, 23 insertions(+), 8 deletions(-)
>
> diff --git a/diff.c b/diff.c
> index 54281cb..0a2f865 100644
> --- a/diff.c
> +++ b/diff.c
> @@ -2185,8 +2185,8 @@ int diff_filespec_is_binary(struct diff_filespec *one)
> one->is_binary = one->driver->binary;
> else {
> if (!one->data && DIFF_FILE_VALID(one))
> - diff_populate_filespec(one, 0);
> - if (one->data)
> + diff_populate_filespec(one, DIFF_POPULATE_IS_BINARY);
> + if (one->is_binary == -1 && one->data)
> one->is_binary = buffer_is_binary(one->data,
> one->size);
> if (one->is_binary == -1)
> @@ -2721,6 +2721,11 @@ int diff_populate_filespec(struct diff_filespec *s, unsigned int flags)
> }
> if (size_only)
> return 0;
> + if ((flags & DIFF_POPULATE_IS_BINARY) &&
> + s->size > big_file_threshold && s->is_binary == -1) {
> + s->is_binary = 1;
> + return 0;
> + }
Why do you check for s->is_binary == -1 here? I think it does not matter
what s_is_binary says here.
> fd = open(s->path, O_RDONLY);
> if (fd < 0)
> goto err_empty;
> @@ -2742,16 +2747,21 @@ int diff_populate_filespec(struct diff_filespec *s, unsigned int flags)
> }
> else {
> enum object_type type;
> - if (size_only) {
> + if (size_only || (flags & DIFF_POPULATE_IS_BINARY)) {
> type = sha1_object_info(s->sha1, &s->size);
> if (type < 0)
> die("unable to read %s", sha1_to_hex(s->sha1));
> - } else {
> - s->data = read_sha1_file(s->sha1, &type, &s->size);
> - if (!s->data)
> - die("unable to read %s", sha1_to_hex(s->sha1));
> - s->should_free = 1;
> + if (size_only)
> + return 0;
> + if (s->size > big_file_threshold && s->is_binary == -1) {
same as above.
> + s->is_binary = 1;
> + return 0;
> + }
> }
> + s->data = read_sha1_file(s->sha1, &type, &s->size);
> + if (!s->data)
> + die("unable to read %s", sha1_to_hex(s->sha1));
> + s->should_free = 1;
> }
> return 0;
> }
> diff --git a/diffcore.h b/diffcore.h
> index a186d7c..e7760d9 100644
> --- a/diffcore.h
> +++ b/diffcore.h
> @@ -56,6 +56,7 @@ extern void fill_filespec(struct diff_filespec *, const unsigned char *,
> int, unsigned short);
>
> #define DIFF_POPULATE_SIZE_ONLY 1
> +#define DIFF_POPULATE_IS_BINARY 2
> extern int diff_populate_filespec(struct diff_filespec *, unsigned int);
> extern void diff_free_filespec_data(struct diff_filespec *);
> extern void diff_free_filespec_blob(struct diff_filespec *);
> diff --git a/t/t1050-large.sh b/t/t1050-large.sh
> index 333909b..4d922e2 100755
> --- a/t/t1050-large.sh
> +++ b/t/t1050-large.sh
> @@ -112,6 +112,10 @@ test_expect_success 'diff --raw' '
> git diff --raw HEAD^
> '
>
> +test_expect_success 'diff --stat' '
> + git diff --stat HEAD^ HEAD
> +'
> +
> test_expect_success 'hash-object' '
> git hash-object large1
> '
I would also add a note to the documentation e. g:
diff --git a/Documentation/config.txt b/Documentation/config.txt
index 9f467d3..7a2f27d 100644
--- a/Documentation/config.txt
+++ b/Documentation/config.txt
@@ -499,7 +499,8 @@ core.bigFileThreshold::
Files larger than this size are stored deflated, without
attempting delta compression. Storing large files without
delta compression avoids excessive memory usage, at the
- slight expense of increased disk usage.
+ slight expense of increased disk usage. Additionally files
+ larger than this size are allways treated as binary.
+
Default is 512 MiB on all platforms. This should be reasonable
for most projects as source code and other text files can still
Thomas
next prev parent reply other threads:[~2014-06-19 12:27 UTC|newest]
Thread overview: 49+ messages / expand[flat|nested] mbox.gz Atom feed top
2014-05-27 16:47 Git chokes on large file Dale R. Worley
2014-05-28 13:32 ` Duy Nguyen
2014-05-28 17:10 ` Junio C Hamano
2014-05-28 18:18 ` Dale R. Worley
2014-05-28 18:15 ` Dale R. Worley
2014-05-28 18:23 ` David Lang
2014-05-28 18:47 ` Dale R. Worley
2014-05-28 19:05 ` David Lang
2014-05-29 19:12 ` Dale R. Worley
2014-05-28 18:54 ` Junio C Hamano
2014-05-28 19:09 ` David Lang
2014-05-29 12:57 ` [PATCH 1/4] wrapper.c: introduce gentle xmallocz that does not die() Nguyễn Thái Ngọc Duy
2014-05-29 12:57 ` [PATCH 2/4] fsck: do not die when not enough memory to examine a pack entry Nguyễn Thái Ngọc Duy
2014-05-29 12:57 ` [PATCH 3/4] diff.c: allow to pass more flags to diff_populate_filespec Nguyễn Thái Ngọc Duy
2014-05-29 12:57 ` [PATCH 4/4] diff: mark any file larger than core.bigfilethreshold binary Nguyễn Thái Ngọc Duy
2014-06-19 12:27 ` Thomas Braun [this message]
2014-06-23 12:18 ` Duy Nguyen
2014-06-23 19:21 ` Thomas Braun
2014-06-24 11:45 ` [PATCH v2 1/4] wrapper.c: introduce gentle xmallocz that does not die() Nguyễn Thái Ngọc Duy
2014-06-24 11:45 ` [PATCH v2 2/4] fsck: do not die when not enough memory to examine a pack entry Nguyễn Thái Ngọc Duy
2014-06-26 18:09 ` Junio C Hamano
2014-06-29 0:40 ` Duy Nguyen
2014-06-24 11:45 ` [PATCH v2 3/4] diff.c: allow to pass more flags to diff_populate_filespec Nguyễn Thái Ngọc Duy
2014-06-24 11:45 ` [PATCH v2 4/4] diff: mark any file larger than core.bigfilethreshold binary Nguyễn Thái Ngọc Duy
2014-06-26 17:55 ` Junio C Hamano
2014-06-27 18:56 ` Thomas Braun
2014-06-29 1:11 ` Duy Nguyen
2014-08-13 10:57 ` [PATCH v3 0/6] Large file improvements Nguyễn Thái Ngọc Duy
2014-08-13 10:57 ` [PATCH v3 1/6] wrapper.c: introduce gentle xmalloc(z) that does not die() Nguyễn Thái Ngọc Duy
2014-08-14 16:38 ` Junio C Hamano
2014-08-13 10:57 ` [PATCH v3 2/6] sha1_file.c: do not die failing to malloc in unpack_compressed_entry Nguyễn Thái Ngọc Duy
2014-08-13 21:13 ` Junio C Hamano
2014-08-13 10:57 ` [PATCH v3 3/6] unpack-objects: continue when fail to malloc due to large objects Nguyễn Thái Ngọc Duy
2014-08-14 16:58 ` Junio C Hamano
2014-08-15 5:24 ` Duy Nguyen
2014-08-13 10:57 ` [PATCH v3 4/6] diff.c: allow to pass more flags to diff_populate_filespec Nguyễn Thái Ngọc Duy
2014-08-13 10:57 ` [PATCH v3 5/6] diff --stat: mark any file larger than core.bigfilethreshold binary Nguyễn Thái Ngọc Duy
2014-08-13 19:32 ` Eric Sunshine
2014-08-13 10:57 ` [PATCH v3 6/6] diff: shortcut for diff'ing two binary SHA-1 objects Nguyễn Thái Ngọc Duy
2014-08-14 17:00 ` Junio C Hamano
2014-08-15 12:11 ` Duy Nguyen
2014-08-14 17:17 ` Junio C Hamano
2014-08-16 3:08 ` [PATCH v4 0/5] Large file improvements Nguyễn Thái Ngọc Duy
2014-08-16 3:08 ` [PATCH v4 1/5] wrapper.c: introduce gentle xmallocz that does not die() Nguyễn Thái Ngọc Duy
2014-08-16 3:08 ` [PATCH v4 2/5] sha1_file.c: do not die failing to malloc in unpack_compressed_entry Nguyễn Thái Ngọc Duy
2014-08-16 3:08 ` [PATCH v4 3/5] diff.c: allow to pass more flags to diff_populate_filespec Nguyễn Thái Ngọc Duy
2014-08-16 3:08 ` [PATCH v4 4/5] diff --stat: mark any file larger than core.bigfilethreshold binary Nguyễn Thái Ngọc Duy
2014-08-16 3:08 ` [PATCH v4 5/5] diff: shortcut for diff'ing two binary SHA-1 objects Nguyễn Thái Ngọc Duy
2014-05-28 15:05 ` Git chokes on large file Thomas Braun
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1403180845.10052.16.camel@thomas-debian-x64 \
--to=thomas.braun@virtuell-zuhause.de \
--cc=git@vger.kernel.org \
--cc=gitster@pobox.com \
--cc=pclouds@gmail.com \
--cc=worley@alum.mit.edu \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).