git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Thomas Braun <thomas.braun@virtuell-zuhause.de>
To: "Nguyễn Thái Ngọc Duy" <pclouds@gmail.com>
Cc: git@vger.kernel.org, worley@alum.mit.edu,
	Junio C Hamano <gitster@pobox.com>
Subject: Re: [PATCH 4/4] diff: mark any file larger than core.bigfilethreshold binary
Date: Thu, 19 Jun 2014 14:27:25 +0200	[thread overview]
Message-ID: <1403180845.10052.16.camel@thomas-debian-x64> (raw)
In-Reply-To: <1401368227-14469-4-git-send-email-pclouds@gmail.com>

Am Donnerstag, den 29.05.2014, 19:57 +0700 schrieb Nguyễn Thái Ngọc Duy:

Hi,

sorry for chiming in so late.

I've just played around with patch 3 and 4 of that series.
And I like it very much as I work often with large files so any further 
enhancement in that area is really nice.

(see comments below)

> Too large files may lead to failure to allocate memory. If it happens
> here, it could impact quite a few commands that involve
> diff. Moreover, too large files are inefficient to compare anyway (and
> most likely non-text), so mark them binary and skip looking at their
> content.
> 
> Noticed-by: Dale R. Worley <worley@alum.mit.edu>
> Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
> ---
>  diff.c           | 26 ++++++++++++++++++--------
>  diffcore.h       |  1 +
>  t/t1050-large.sh |  4 ++++
>  3 files changed, 23 insertions(+), 8 deletions(-)
> 
> diff --git a/diff.c b/diff.c
> index 54281cb..0a2f865 100644
> --- a/diff.c
> +++ b/diff.c
> @@ -2185,8 +2185,8 @@ int diff_filespec_is_binary(struct diff_filespec *one)
>  			one->is_binary = one->driver->binary;
>  		else {
>  			if (!one->data && DIFF_FILE_VALID(one))
> -				diff_populate_filespec(one, 0);
> -			if (one->data)
> +				diff_populate_filespec(one, DIFF_POPULATE_IS_BINARY);
> +			if (one->is_binary == -1 && one->data)
>  				one->is_binary = buffer_is_binary(one->data,
>  						one->size);
>  			if (one->is_binary == -1)
> @@ -2721,6 +2721,11 @@ int diff_populate_filespec(struct diff_filespec *s, unsigned int flags)
>  		}
>  		if (size_only)
>  			return 0;
> +		if ((flags & DIFF_POPULATE_IS_BINARY) &&
> +		    s->size > big_file_threshold && s->is_binary == -1) {
> +			s->is_binary = 1;
> +			return 0;
> +		}

Why do you check for s->is_binary == -1 here? I think it does not matter
what s_is_binary says here.

>  		fd = open(s->path, O_RDONLY);
>  		if (fd < 0)
>  			goto err_empty;
> @@ -2742,16 +2747,21 @@ int diff_populate_filespec(struct diff_filespec *s, unsigned int flags)
>  	}
>  	else {
>  		enum object_type type;
> -		if (size_only) {
> +		if (size_only || (flags & DIFF_POPULATE_IS_BINARY)) {
>  			type = sha1_object_info(s->sha1, &s->size);
>  			if (type < 0)
>  				die("unable to read %s", sha1_to_hex(s->sha1));
> -		} else {
> -			s->data = read_sha1_file(s->sha1, &type, &s->size);
> -			if (!s->data)
> -				die("unable to read %s", sha1_to_hex(s->sha1));
> -			s->should_free = 1;
> +			if (size_only)
> +				return 0;
> +			if (s->size > big_file_threshold && s->is_binary == -1) {
same as above.
> +				s->is_binary = 1;
> +				return 0;
> +			}
>  		}
> +		s->data = read_sha1_file(s->sha1, &type, &s->size);
> +		if (!s->data)
> +			die("unable to read %s", sha1_to_hex(s->sha1));
> +		s->should_free = 1;
>  	}
>  	return 0;
>  }
> diff --git a/diffcore.h b/diffcore.h
> index a186d7c..e7760d9 100644
> --- a/diffcore.h
> +++ b/diffcore.h
> @@ -56,6 +56,7 @@ extern void fill_filespec(struct diff_filespec *, const unsigned char *,
>  			  int, unsigned short);
>  
>  #define DIFF_POPULATE_SIZE_ONLY 1
> +#define DIFF_POPULATE_IS_BINARY 2
>  extern int diff_populate_filespec(struct diff_filespec *, unsigned int);
>  extern void diff_free_filespec_data(struct diff_filespec *);
>  extern void diff_free_filespec_blob(struct diff_filespec *);
> diff --git a/t/t1050-large.sh b/t/t1050-large.sh
> index 333909b..4d922e2 100755
> --- a/t/t1050-large.sh
> +++ b/t/t1050-large.sh
> @@ -112,6 +112,10 @@ test_expect_success 'diff --raw' '
>  	git diff --raw HEAD^
>  '
>  
> +test_expect_success 'diff --stat' '
> +	git diff --stat HEAD^ HEAD
> +'
> +
>  test_expect_success 'hash-object' '
>  	git hash-object large1
>  '

I would also add a note to the documentation e. g:

diff --git a/Documentation/config.txt b/Documentation/config.txt
index 9f467d3..7a2f27d 100644
--- a/Documentation/config.txt
+++ b/Documentation/config.txt
@@ -499,7 +499,8 @@ core.bigFileThreshold::
        Files larger than this size are stored deflated, without
        attempting delta compression.  Storing large files without
        delta compression avoids excessive memory usage, at the
-       slight expense of increased disk usage.
+       slight expense of increased disk usage.  Additionally files
+       larger than this size are allways treated as binary.
 +
 Default is 512 MiB on all platforms.  This should be reasonable
 for most projects as source code and other text files can still

Thomas

  reply	other threads:[~2014-06-19 12:27 UTC|newest]

Thread overview: 49+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-05-27 16:47 Git chokes on large file Dale R. Worley
2014-05-28 13:32 ` Duy Nguyen
2014-05-28 17:10   ` Junio C Hamano
2014-05-28 18:18     ` Dale R. Worley
2014-05-28 18:15   ` Dale R. Worley
2014-05-28 18:23     ` David Lang
2014-05-28 18:47       ` Dale R. Worley
2014-05-28 19:05         ` David Lang
2014-05-29 19:12           ` Dale R. Worley
2014-05-28 18:54       ` Junio C Hamano
2014-05-28 19:09         ` David Lang
2014-05-29 12:57   ` [PATCH 1/4] wrapper.c: introduce gentle xmallocz that does not die() Nguyễn Thái Ngọc Duy
2014-05-29 12:57     ` [PATCH 2/4] fsck: do not die when not enough memory to examine a pack entry Nguyễn Thái Ngọc Duy
2014-05-29 12:57     ` [PATCH 3/4] diff.c: allow to pass more flags to diff_populate_filespec Nguyễn Thái Ngọc Duy
2014-05-29 12:57     ` [PATCH 4/4] diff: mark any file larger than core.bigfilethreshold binary Nguyễn Thái Ngọc Duy
2014-06-19 12:27       ` Thomas Braun [this message]
2014-06-23 12:18         ` Duy Nguyen
2014-06-23 19:21           ` Thomas Braun
2014-06-24 11:45     ` [PATCH v2 1/4] wrapper.c: introduce gentle xmallocz that does not die() Nguyễn Thái Ngọc Duy
2014-06-24 11:45       ` [PATCH v2 2/4] fsck: do not die when not enough memory to examine a pack entry Nguyễn Thái Ngọc Duy
2014-06-26 18:09         ` Junio C Hamano
2014-06-29  0:40           ` Duy Nguyen
2014-06-24 11:45       ` [PATCH v2 3/4] diff.c: allow to pass more flags to diff_populate_filespec Nguyễn Thái Ngọc Duy
2014-06-24 11:45       ` [PATCH v2 4/4] diff: mark any file larger than core.bigfilethreshold binary Nguyễn Thái Ngọc Duy
2014-06-26 17:55         ` Junio C Hamano
2014-06-27 18:56           ` Thomas Braun
2014-06-29  1:11             ` Duy Nguyen
2014-08-13 10:57       ` [PATCH v3 0/6] Large file improvements Nguyễn Thái Ngọc Duy
2014-08-13 10:57         ` [PATCH v3 1/6] wrapper.c: introduce gentle xmalloc(z) that does not die() Nguyễn Thái Ngọc Duy
2014-08-14 16:38           ` Junio C Hamano
2014-08-13 10:57         ` [PATCH v3 2/6] sha1_file.c: do not die failing to malloc in unpack_compressed_entry Nguyễn Thái Ngọc Duy
2014-08-13 21:13           ` Junio C Hamano
2014-08-13 10:57         ` [PATCH v3 3/6] unpack-objects: continue when fail to malloc due to large objects Nguyễn Thái Ngọc Duy
2014-08-14 16:58           ` Junio C Hamano
2014-08-15  5:24             ` Duy Nguyen
2014-08-13 10:57         ` [PATCH v3 4/6] diff.c: allow to pass more flags to diff_populate_filespec Nguyễn Thái Ngọc Duy
2014-08-13 10:57         ` [PATCH v3 5/6] diff --stat: mark any file larger than core.bigfilethreshold binary Nguyễn Thái Ngọc Duy
2014-08-13 19:32           ` Eric Sunshine
2014-08-13 10:57         ` [PATCH v3 6/6] diff: shortcut for diff'ing two binary SHA-1 objects Nguyễn Thái Ngọc Duy
2014-08-14 17:00           ` Junio C Hamano
2014-08-15 12:11             ` Duy Nguyen
2014-08-14 17:17           ` Junio C Hamano
2014-08-16  3:08         ` [PATCH v4 0/5] Large file improvements Nguyễn Thái Ngọc Duy
2014-08-16  3:08           ` [PATCH v4 1/5] wrapper.c: introduce gentle xmallocz that does not die() Nguyễn Thái Ngọc Duy
2014-08-16  3:08           ` [PATCH v4 2/5] sha1_file.c: do not die failing to malloc in unpack_compressed_entry Nguyễn Thái Ngọc Duy
2014-08-16  3:08           ` [PATCH v4 3/5] diff.c: allow to pass more flags to diff_populate_filespec Nguyễn Thái Ngọc Duy
2014-08-16  3:08           ` [PATCH v4 4/5] diff --stat: mark any file larger than core.bigfilethreshold binary Nguyễn Thái Ngọc Duy
2014-08-16  3:08           ` [PATCH v4 5/5] diff: shortcut for diff'ing two binary SHA-1 objects Nguyễn Thái Ngọc Duy
2014-05-28 15:05 ` Git chokes on large file Thomas Braun

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1403180845.10052.16.camel@thomas-debian-x64 \
    --to=thomas.braun@virtuell-zuhause.de \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=pclouds@gmail.com \
    --cc=worley@alum.mit.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).