[PATCH v2 0/2] Stream fd to clean filter

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2 0/2] Stream fd to clean filter
@ 2014-08-06  5:32 Steffen Prohaska
  2014-08-06  5:32 ` [PATCH v2 1/2] convert: Refactor would_convert_to_git() to single arg 'path' Steffen Prohaska
  2014-08-06  5:32 ` [PATCH v2 2/2] convert: Stream from fd to required clean filter instead of mmap Steffen Prohaska
  0 siblings, 2 replies; 13+ messages in thread
From: Steffen Prohaska @ 2014-08-06  5:32 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, peff, schacon, Steffen Prohaska

The main difference to the previous version is that I've split off the
refactoring into a separate commit.  The rest is polishing the style.

Steffen Prohaska (2):
  convert: Refactor would_convert_to_git() to single arg 'path'
  convert: Stream from fd to required clean filter instead of mmap

 convert.c             | 60 ++++++++++++++++++++++++++++++++++++++++-----
 convert.h             | 10 +++++---
 sha1_file.c           | 29 ++++++++++++++++++++--
 t/t0021-conversion.sh | 68 +++++++++++++++++++++++++++++++++++++++++++++++----
 4 files changed, 151 insertions(+), 16 deletions(-)

-- 
2.1.0.rc1.6.gb569bd8

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH v2 1/2] convert: Refactor would_convert_to_git() to single arg 'path'
  2014-08-06  5:32 [PATCH v2 0/2] Stream fd to clean filter Steffen Prohaska
@ 2014-08-06  5:32 ` Steffen Prohaska
  2014-08-06  5:32 ` [PATCH v2 2/2] convert: Stream from fd to required clean filter instead of mmap Steffen Prohaska
  1 sibling, 0 replies; 13+ messages in thread
From: Steffen Prohaska @ 2014-08-06  5:32 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, peff, schacon, Steffen Prohaska

It is only the path that matters in the decision whether to filter or
not.  Clarify this by making path the single argument of
would_convert_to_git().

Signed-off-by: Steffen Prohaska <prohaska@zib.de>
---
 convert.h   | 5 ++---
 sha1_file.c | 2 +-
 2 files changed, 3 insertions(+), 4 deletions(-)

diff --git a/convert.h b/convert.h
index 0c2143c..c638b33 100644
--- a/convert.h
+++ b/convert.h
@@ -40,10 +40,9 @@ extern int convert_to_working_tree(const char *path, const char *src,
 				   size_t len, struct strbuf *dst);
 extern int renormalize_buffer(const char *path, const char *src, size_t len,
 			      struct strbuf *dst);
-static inline int would_convert_to_git(const char *path, const char *src,
-				       size_t len, enum safe_crlf checksafe)
+static inline int would_convert_to_git(const char *path)
 {
-	return convert_to_git(path, src, len, NULL, checksafe);
+	return convert_to_git(path, NULL, 0, NULL, 0);
 }
 
 /*****************************************************************
diff --git a/sha1_file.c b/sha1_file.c
index 3f70b1d..00c07f2 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -3144,7 +3144,7 @@ int index_fd(unsigned char *sha1, int fd, struct stat *st,
 	if (!S_ISREG(st->st_mode))
 		ret = index_pipe(sha1, fd, type, path, flags);
 	else if (size <= big_file_threshold || type != OBJ_BLOB ||
-		 (path && would_convert_to_git(path, NULL, 0, 0)))
+		 (path && would_convert_to_git(path)))
 		ret = index_core(sha1, fd, size, type, path, flags);
 	else
 		ret = index_stream(sha1, fd, size, type, path, flags);
-- 
2.1.0.rc1.6.gb569bd8

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v2 2/2] convert: Stream from fd to required clean filter instead of mmap
  2014-08-06  5:32 [PATCH v2 0/2] Stream fd to clean filter Steffen Prohaska
  2014-08-06  5:32 ` [PATCH v2 1/2] convert: Refactor would_convert_to_git() to single arg 'path' Steffen Prohaska
@ 2014-08-06  5:32 ` Steffen Prohaska
  2014-08-16 10:27   ` John Keeping
  1 sibling, 1 reply; 13+ messages in thread
From: Steffen Prohaska @ 2014-08-06  5:32 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, peff, schacon, Steffen Prohaska

The data is streamed to the filter process anyway.  Better avoid mapping
the file if possible.  This is especially useful if a clean filter
reduces the size, for example if it computes a sha1 for binary data,
like git media.  The file size that the previous implementation could
handle was limited by the available address space; large files for
example could not be handled with (32-bit) msysgit.  The new
implementation can filter files of any size as long as the filter output
is small enough.

The new code path is only taken if the filter is required.  The filter
consumes data directly from the fd.  The original data is not available
to git, so it must fail if the filter fails.

The test that exercises required filters is modified to verify that the
data actually has been modified on its way from the file system to the
object store.

The expectation on the process size is tested using /usr/bin/time.  An
alternative would have been tcsh, which could be used to print memory
information as follows:

    tcsh -c 'set time=(0 "%M"); <cmd>'

Although the logic could perhaps be simplified with tcsh, I chose to use
'time' to avoid a dependency on tcsh.

Signed-off-by: Steffen Prohaska <prohaska@zib.de>
---
 convert.c             | 60 ++++++++++++++++++++++++++++++++++++++++-----
 convert.h             |  5 ++++
 sha1_file.c           | 27 +++++++++++++++++++-
 t/t0021-conversion.sh | 68 +++++++++++++++++++++++++++++++++++++++++++++++----
 4 files changed, 148 insertions(+), 12 deletions(-)

diff --git a/convert.c b/convert.c
index cb5fbb4..463f6de 100644
--- a/convert.c
+++ b/convert.c
@@ -312,11 +312,12 @@ static int crlf_to_worktree(const char *path, const char *src, size_t len,
 struct filter_params {
 	const char *src;
 	unsigned long size;
+	int fd;
 	const char *cmd;
 	const char *path;
 };
 
-static int filter_buffer(int in, int out, void *data)
+static int filter_buffer_or_fd(int in, int out, void *data)
 {
 	/*
 	 * Spawn cmd and feed the buffer contents through its stdin.
@@ -325,6 +326,7 @@ static int filter_buffer(int in, int out, void *data)
 	struct filter_params *params = (struct filter_params *)data;
 	int write_err, status;
 	const char *argv[] = { NULL, NULL };
+	int fd;
 
 	/* apply % substitution to cmd */
 	struct strbuf cmd = STRBUF_INIT;
@@ -355,7 +357,17 @@ static int filter_buffer(int in, int out, void *data)
 
 	sigchain_push(SIGPIPE, SIG_IGN);
 
-	write_err = (write_in_full(child_process.in, params->src, params->size) < 0);
+	if (params->src) {
+	    write_err = (write_in_full(child_process.in, params->src, params->size) < 0);
+	} else {
+	    /* dup(), because copy_fd() closes the input fd. */
+	    fd = dup(params->fd);
+	    if (fd < 0)
+		write_err = error("failed to dup file descriptor.");
+	    else
+		write_err = copy_fd(fd, child_process.in);
+	}
+
 	if (close(child_process.in))
 		write_err = 1;
 	if (write_err)
@@ -371,7 +383,7 @@ static int filter_buffer(int in, int out, void *data)
 	return (write_err || status);
 }
 
-static int apply_filter(const char *path, const char *src, size_t len,
+static int apply_filter(const char *path, const char *src, size_t len, int fd,
                         struct strbuf *dst, const char *cmd)
 {
 	/*
@@ -392,11 +404,12 @@ static int apply_filter(const char *path, const char *src, size_t len,
 		return 1;
 
 	memset(&async, 0, sizeof(async));
-	async.proc = filter_buffer;
+	async.proc = filter_buffer_or_fd;
 	async.data = &params;
 	async.out = -1;
 	params.src = src;
 	params.size = len;
+	params.fd = fd;
 	params.cmd = cmd;
 	params.path = path;
 
@@ -747,6 +760,24 @@ static void convert_attrs(struct conv_attrs *ca, const char *path)
 	}
 }
 
+int would_convert_to_git_filter_fd(const char *path)
+{
+	struct conv_attrs ca;
+
+	convert_attrs(&ca, path);
+	if (!ca.drv)
+	    return 0;
+
+	/* Apply a filter to an fd only if the filter is required to succeed.
+	 * We must die if the filter fails, because the original data before
+	 * filtering is not available.
+	 */
+	if (!ca.drv->required)
+	    return 0;
+
+	return apply_filter(path, NULL, 0, -1, NULL, ca.drv->clean);
+}
+
 int convert_to_git(const char *path, const char *src, size_t len,
                    struct strbuf *dst, enum safe_crlf checksafe)
 {
@@ -761,7 +792,7 @@ int convert_to_git(const char *path, const char *src, size_t len,
 		required = ca.drv->required;
 	}
 
-	ret |= apply_filter(path, src, len, dst, filter);
+	ret |= apply_filter(path, src, len, -1, dst, filter);
 	if (!ret && required)
 		die("%s: clean filter '%s' failed", path, ca.drv->name);
 
@@ -778,6 +809,23 @@ int convert_to_git(const char *path, const char *src, size_t len,
 	return ret | ident_to_git(path, src, len, dst, ca.ident);
 }
 
+void convert_to_git_filter_fd(const char *path, int fd, struct strbuf *dst,
+			      enum safe_crlf checksafe)
+{
+	struct conv_attrs ca;
+	convert_attrs(&ca, path);
+
+	assert(ca.drv);
+	assert(ca.drv->clean);
+
+	if (!apply_filter(path, NULL, 0, fd, dst, ca.drv->clean))
+		die("%s: clean filter '%s' failed", path, ca.drv->name);
+
+	ca.crlf_action = input_crlf_action(ca.crlf_action, ca.eol_attr);
+	crlf_to_git(path, dst->buf, dst->len, dst, ca.crlf_action, checksafe);
+	ident_to_git(path, dst->buf, dst->len, dst, ca.ident);
+}
+
 static int convert_to_working_tree_internal(const char *path, const char *src,
 					    size_t len, struct strbuf *dst,
 					    int normalizing)
@@ -811,7 +859,7 @@ static int convert_to_working_tree_internal(const char *path, const char *src,
 		}
 	}
 
-	ret_filter = apply_filter(path, src, len, dst, filter);
+	ret_filter = apply_filter(path, src, len, -1, dst, filter);
 	if (!ret_filter && required)
 		die("%s: smudge filter %s failed", path, ca.drv->name);
 
diff --git a/convert.h b/convert.h
index c638b33..d9d853c 100644
--- a/convert.h
+++ b/convert.h
@@ -44,6 +44,11 @@ static inline int would_convert_to_git(const char *path)
 {
 	return convert_to_git(path, NULL, 0, NULL, 0);
 }
+/* Precondition: would_convert_to_git_filter_fd(path) == true */
+extern void convert_to_git_filter_fd(const char *path, int fd,
+				     struct strbuf *dst,
+				     enum safe_crlf checksafe);
+extern int would_convert_to_git_filter_fd(const char *path);
 
 /*****************************************************************
  *
diff --git a/sha1_file.c b/sha1_file.c
index 00c07f2..eaf3220 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -3074,6 +3074,29 @@ static int index_mem(unsigned char *sha1, void *buf, size_t size,
 	return ret;
 }
 
+static int index_stream_convert_blob(unsigned char *sha1, int fd,
+				     const char *path, unsigned flags)
+{
+	int ret;
+	const int write_object = flags & HASH_WRITE_OBJECT;
+	struct strbuf sbuf = STRBUF_INIT;
+
+	assert(path);
+	assert(would_convert_to_git_filter_fd(path));
+
+	convert_to_git_filter_fd(path, fd, &sbuf,
+				 write_object ? safe_crlf : SAFE_CRLF_FALSE);
+
+	if (write_object)
+		ret = write_sha1_file(sbuf.buf, sbuf.len, typename(OBJ_BLOB),
+				      sha1);
+	else
+		ret = hash_sha1_file(sbuf.buf, sbuf.len, typename(OBJ_BLOB),
+				     sha1);
+	strbuf_release(&sbuf);
+	return ret;
+}
+
 static int index_pipe(unsigned char *sha1, int fd, enum object_type type,
 		      const char *path, unsigned flags)
 {
@@ -3141,7 +3164,9 @@ int index_fd(unsigned char *sha1, int fd, struct stat *st,
 	int ret;
 	size_t size = xsize_t(st->st_size);
 
-	if (!S_ISREG(st->st_mode))
+	if (type == OBJ_BLOB && path && would_convert_to_git_filter_fd(path))
+		ret = index_stream_convert_blob(sha1, fd, path, flags);
+	else if (!S_ISREG(st->st_mode))
 		ret = index_pipe(sha1, fd, type, path, flags);
 	else if (size <= big_file_threshold || type != OBJ_BLOB ||
 		 (path && would_convert_to_git(path)))
diff --git a/t/t0021-conversion.sh b/t/t0021-conversion.sh
index f890c54..2cb2414 100755
--- a/t/t0021-conversion.sh
+++ b/t/t0021-conversion.sh
@@ -153,17 +153,23 @@ test_expect_success 'filter shell-escaped filenames' '
 	:
 '
 
-test_expect_success 'required filter success' '
-	git config filter.required.smudge cat &&
-	git config filter.required.clean cat &&
+test_expect_success 'required filter should filter data' '
+	git config filter.required.smudge ./rot13.sh &&
+	git config filter.required.clean ./rot13.sh &&
 	git config filter.required.required true &&
 
 	echo "*.r filter=required" >.gitattributes &&
 
-	echo test >test.r &&
+	cat test.o >test.r &&
 	git add test.r &&
+
 	rm -f test.r &&
-	git checkout -- test.r
+	git checkout -- test.r &&
+	cmp test.o test.r &&
+
+	./rot13.sh <test.o >expected &&
+	git cat-file blob :test.r >actual &&
+	cmp expected actual
 '
 
 test_expect_success 'required filter smudge failure' '
@@ -190,6 +196,58 @@ test_expect_success 'required filter clean failure' '
 	test_must_fail git add test.fc
 '
 
+# Handle differences in /usr/bin/time.
+#
+#  - Linux: call with '-v'.
+#    output: <spaces><description>:<space><value-in-KBytes>
+#
+#  - Mac: call with '-l'.
+#    output: <spaces><value-in-Bytes><spaces><description>
+#    Strip three digits to get to KB (base 10 is good enough).
+#
+case $(uname -s) in
+Linux)
+	test_set_prereq HAVE_MAX_MEM_USAGE
+	max_mem_usage_KB () {
+	    /usr/bin/time -v "$@" 2>&1 |
+	    grep 'Maximum resident set size' |
+	    cut -d ':' -f 2
+	}
+	;;
+Darwin)
+	test_set_prereq HAVE_MAX_MEM_USAGE
+	max_mem_usage_KB () {
+		/usr/bin/time -l "$@" 2>&1 |
+		grep 'maximum resident set size' |
+		sed -e 's/  */ /' |
+		cut -d ' ' -f 2 |
+		sed -e 's/...$//'
+	}
+	;;
+esac
+
+max_mem_usage_is_lt_KB () {
+	limit=$1
+	shift
+	mem_usage=$(max_mem_usage_KB "$@")
+	if [ $mem_usage -lt $limit ]; then
+		true
+	else
+		printf 'Command used too much memory (expected limit %dKB, actual usage %dKB).\n' \
+			$limit $mem_usage
+		false
+	fi
+}
+
+test_expect_success HAVE_MAX_MEM_USAGE \
+'filtering large input to small output should use little memory' '
+	git config filter.devnull.clean "cat >/dev/null" &&
+	git config filter.devnull.required true &&
+	for i in $(test_seq 1 30); do printf "%1048576d" 1; done >30MB &&
+	echo "30MB filter=devnull" >.gitattributes &&
+	max_mem_usage_is_lt_KB 15000 git add 30MB
+'
+
 test_expect_success EXPENSIVE 'filter large file' '
 	git config filter.largefile.smudge cat &&
 	git config filter.largefile.clean cat &&
-- 
2.1.0.rc1.6.gb569bd8

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 2/2] convert: Stream from fd to required clean filter instead of mmap
  2014-08-06  5:32 ` [PATCH v2 2/2] convert: Stream from fd to required clean filter instead of mmap Steffen Prohaska
@ 2014-08-16 10:27   ` John Keeping
  2014-08-16 16:26     ` Steffen Prohaska
  0 siblings, 1 reply; 13+ messages in thread
From: John Keeping @ 2014-08-16 10:27 UTC (permalink / raw)
  To: Steffen Prohaska; +Cc: Junio C Hamano, git, peff, schacon

On Wed, Aug 06, 2014 at 07:32:14AM +0200, Steffen Prohaska wrote:
[...]
> The expectation on the process size is tested using /usr/bin/time.  An
> alternative would have been tcsh, which could be used to print memory
> information as follows:
> 
>     tcsh -c 'set time=(0 "%M"); <cmd>'
> 
> Although the logic could perhaps be simplified with tcsh, I chose to use
> 'time' to avoid a dependency on tcsh.
[...]
> @@ -190,6 +196,58 @@ test_expect_success 'required filter clean failure' '
>  	test_must_fail git add test.fc
>  '
>  
> +# Handle differences in /usr/bin/time.
> +#
> +#  - Linux: call with '-v'.
> +#    output: <spaces><description>:<space><value-in-KBytes>
> +#
> +#  - Mac: call with '-l'.
> +#    output: <spaces><value-in-Bytes><spaces><description>
> +#    Strip three digits to get to KB (base 10 is good enough).
> +#
> +case $(uname -s) in
> +Linux)
> +	test_set_prereq HAVE_MAX_MEM_USAGE
> +	max_mem_usage_KB () {
> +	    /usr/bin/time -v "$@" 2>&1 |
> +	    grep 'Maximum resident set size' |
> +	    cut -d ':' -f 2
> +	}
> +	;;
> +Darwin)
> +	test_set_prereq HAVE_MAX_MEM_USAGE
> +	max_mem_usage_KB () {
> +		/usr/bin/time -l "$@" 2>&1 |
> +		grep 'maximum resident set size' |
> +		sed -e 's/  */ /' |
> +		cut -d ' ' -f 2 |
> +		sed -e 's/...$//'
> +	}
> +	;;
> +esac
> +
> +max_mem_usage_is_lt_KB () {
> +	limit=$1
> +	shift
> +	mem_usage=$(max_mem_usage_KB "$@")
> +	if [ $mem_usage -lt $limit ]; then
> +		true
> +	else
> +		printf 'Command used too much memory (expected limit %dKB, actual usage %dKB).\n' \
> +			$limit $mem_usage
> +		false
> +	fi
> +}
> +
> +test_expect_success HAVE_MAX_MEM_USAGE \
> +'filtering large input to small output should use little memory' '
> +	git config filter.devnull.clean "cat >/dev/null" &&
> +	git config filter.devnull.required true &&
> +	for i in $(test_seq 1 30); do printf "%1048576d" 1; done >30MB &&
> +	echo "30MB filter=devnull" >.gitattributes &&
> +	max_mem_usage_is_lt_KB 15000 git add 30MB
> +'

This test fails for me:

-- 8< --
expecting success: 
        git config filter.devnull.clean "cat >/dev/null" &&
        git config filter.devnull.required true &&
        for i in $(test_seq 1 30); do printf "%1048576d" 1; done >30MB &&
        echo "30MB filter=devnull" >.gitattributes &&
        max_mem_usage_is_lt_KB 15000 git add 30MB

Command used too much memory (expected limit 15000KB, actual usage 15808KB).
not ok 8 - filtering large input to small output should use little memory
-- >8 --

This is on Linux 3.16 x86_64 with GCC 4.8.3 and glibc 2.19.  My GCC also
has "-fstack-protector" and "-D_FORTIFY_SOURCE=2" enabled by default;
turning those off for Git decreased the memory usage by about 500KB but
not enough to make the test pass.  Of course, all of the libraries Git
is linking are also compiled with those flags...

Is the 15MB limit supposed to be imposed somewhere or is it just a guide
of how much memory we expect Git to use in this scenario?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 2/2] convert: Stream from fd to required clean filter instead of mmap
  2014-08-16 10:27   ` John Keeping
@ 2014-08-16 16:26     ` Steffen Prohaska
  2014-08-16 17:00       ` Andreas Schwab
  2014-08-17  7:27       ` Jeff King
  0 siblings, 2 replies; 13+ messages in thread
From: Steffen Prohaska @ 2014-08-16 16:26 UTC (permalink / raw)
  To: John Keeping, Junio C Hamano; +Cc: Git Mailing List, Jeff King, Scott Chacon


On Aug 16, 2014, at 12:27 PM, John Keeping <john@keeping.me.uk> wrote:

>> +test_expect_success HAVE_MAX_MEM_USAGE \
>> +'filtering large input to small output should use little memory' '
>> +	git config filter.devnull.clean "cat >/dev/null" &&
>> +	git config filter.devnull.required true &&
>> +	for i in $(test_seq 1 30); do printf "%1048576d" 1; done >30MB &&
>> +	echo "30MB filter=devnull" >.gitattributes &&
>> +	max_mem_usage_is_lt_KB 15000 git add 30MB
>> +'
> 
> This test fails for me:
> 
> -- 8< --
> expecting success: 
>        git config filter.devnull.clean "cat >/dev/null" &&
>        git config filter.devnull.required true &&
>        for i in $(test_seq 1 30); do printf "%1048576d" 1; done >30MB &&
>        echo "30MB filter=devnull" >.gitattributes &&
>        max_mem_usage_is_lt_KB 15000 git add 30MB
> 
> Command used too much memory (expected limit 15000KB, actual usage 15808KB).
> not ok 8 - filtering large input to small output should use little memory
> -- >8 --
> 
> This is on Linux 3.16 x86_64 with GCC 4.8.3 and glibc 2.19.  My GCC also
> has "-fstack-protector" and "-D_FORTIFY_SOURCE=2" enabled by default;
> turning those off for Git decreased the memory usage by about 500KB but
> not enough to make the test pass.  Of course, all of the libraries Git
> is linking are also compiled with those flags...
> 
> Is the 15MB limit supposed to be imposed somewhere or is it just a guide
> of how much memory we expect Git to use in this scenario?

The test should confirm that the the file that is added is not mmapped to memory.  The process size should be relatively small independently of the size of the file that is added.  I wanted to keep the file size small.  The chosen sizes worked for me on Mac and Linux.

A simple solution could be to increase the size of the test file and the limit.  I'd suggest to squash the diff below.  The test should still run reasonably fast with a 50 MB test file.

Junio, shall I send a whole updated patch series?

	Steffen


diff --git a/t/t0021-conversion.sh b/t/t0021-conversion.sh
index 2cb2414..43de87d 100755
--- a/t/t0021-conversion.sh
+++ b/t/t0021-conversion.sh
@@ -243,9 +243,9 @@ test_expect_success HAVE_MAX_MEM_USAGE \
 'filtering large input to small output should use little memory' '
        git config filter.devnull.clean "cat >/dev/null" &&
        git config filter.devnull.required true &&
-       for i in $(test_seq 1 30); do printf "%1048576d" 1; done >30MB &&
-       echo "30MB filter=devnull" >.gitattributes &&
-       max_mem_usage_is_lt_KB 15000 git add 30MB
+       for i in $(test_seq 1 50); do printf "%1048576d" 1; done >50MB &&
+       echo "50MB filter=devnull" >.gitattributes &&
+       max_mem_usage_is_lt_KB 35000 git add 50MB
 '

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 2/2] convert: Stream from fd to required clean filter instead of mmap
  2014-08-16 16:26     ` Steffen Prohaska
@ 2014-08-16 17:00       ` Andreas Schwab
  2014-08-17 10:08         ` Steffen Prohaska
  2014-08-17  7:27       ` Jeff King
  1 sibling, 1 reply; 13+ messages in thread
From: Andreas Schwab @ 2014-08-16 17:00 UTC (permalink / raw)
  To: Steffen Prohaska
  Cc: John Keeping, Junio C Hamano, Git Mailing List, Jeff King,
	Scott Chacon

Steffen Prohaska <prohaska@zib.de> writes:

> The test should confirm that the the file that is added is not mmapped to
> memory.

RSS doesn't tell you that.  You can mmap a big file without RSS getting
bigger.

Andreas.

-- 
Andreas Schwab, schwab@linux-m68k.org
GPG Key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 2/2] convert: Stream from fd to required clean filter instead of mmap
  2014-08-16 16:26     ` Steffen Prohaska
  2014-08-16 17:00       ` Andreas Schwab
@ 2014-08-17  7:27       ` Jeff King
  2014-08-17 10:25         ` Steffen Prohaska
  1 sibling, 1 reply; 13+ messages in thread
From: Jeff King @ 2014-08-17  7:27 UTC (permalink / raw)
  To: Steffen Prohaska
  Cc: John Keeping, Junio C Hamano, Git Mailing List, Scott Chacon

On Sat, Aug 16, 2014 at 06:26:08PM +0200, Steffen Prohaska wrote:

> > Is the 15MB limit supposed to be imposed somewhere or is it just a guide
> > of how much memory we expect Git to use in this scenario?
> 
> The test should confirm that the the file that is added is not mmapped
> to memory.  The process size should be relatively small independently
> of the size of the file that is added.  I wanted to keep the file size
> small.  The chosen sizes worked for me on Mac and Linux.

Measuring memory usage seems inherently a bit flaky for the test suite.
It's also a little out of place, as the test suite is generally about
correctness and outcomes, and this is a performance issue.

Would it make more sense to construct a t/perf test that shows off the
change? I suppose the run-time change may not be that impressive, but it
would be cool if t/perf could measure max memory use, too. Then we can
just compare results between versions, which is enough to detect
regressions.

There's some prior art in the jk/pack-bitmap-reuse-deltas series (which
is not merged), where I taught it to measure output sizes of commands.
That should provide the necessary refactoring base, I think.

-Peff

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 2/2] convert: Stream from fd to required clean filter instead of mmap
  2014-08-16 17:00       ` Andreas Schwab
@ 2014-08-17 10:08         ` Steffen Prohaska
  2014-08-17 11:03           ` Andreas Schwab
  0 siblings, 1 reply; 13+ messages in thread
From: Steffen Prohaska @ 2014-08-17 10:08 UTC (permalink / raw)
  To: Andreas Schwab
  Cc: John Keeping, Junio C Hamano, Git Mailing List, Jeff King,
	Scott Chacon


On Aug 16, 2014, at 7:00 PM, Andreas Schwab <schwab@linux-m68k.org> wrote:

> Steffen Prohaska <prohaska@zib.de> writes:
> 
>> The test should confirm that the the file that is added is not mmapped to
>> memory.
> 
> RSS doesn't tell you that.  You can mmap a big file without RSS getting
> bigger.

All data are accessed after mapping, so RSS should be fine.  The test always did what I expected.

	Steffen

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 2/2] convert: Stream from fd to required clean filter instead of mmap
  2014-08-17  7:27       ` Jeff King
@ 2014-08-17 10:25         ` Steffen Prohaska
  2014-08-17 11:33           ` Duy Nguyen
  0 siblings, 1 reply; 13+ messages in thread
From: Steffen Prohaska @ 2014-08-17 10:25 UTC (permalink / raw)
  To: Jeff King; +Cc: John Keeping, Junio C Hamano, Git Mailing List, Scott Chacon


On Aug 17, 2014, at 9:27 AM, Jeff King <peff@peff.net> wrote:

> On Sat, Aug 16, 2014 at 06:26:08PM +0200, Steffen Prohaska wrote:
> 
>>> Is the 15MB limit supposed to be imposed somewhere or is it just a guide
>>> of how much memory we expect Git to use in this scenario?
>> 
>> The test should confirm that the the file that is added is not mmapped
>> to memory.  The process size should be relatively small independently
>> of the size of the file that is added.  I wanted to keep the file size
>> small.  The chosen sizes worked for me on Mac and Linux.
> 
> Measuring memory usage seems inherently a bit flaky for the test suite.
> It's also a little out of place, as the test suite is generally about
> correctness and outcomes, and this is a performance issue.

For files >2GB on a 32-bit system (e.g. msysgit), filtering with the previous code always failed.  Now it works.  I created the patch to change git from 'fundamentally doesn't handle this' to 'works as expected'.


> Would it make more sense to construct a t/perf test that shows off the
> change? I suppose the run-time change may not be that impressive, but it
> would be cool if t/perf could measure max memory use, too. Then we can
> just compare results between versions, which is enough to detect
> regressions.

I wasn't aware of t/perf.  Thanks for suggesting this.

I agree that testing memory usage might be a bit flaky.  t/perf might indeed be a better place.

I'm not yet entirely convinced, though.  I'm wondering whether the proposed test would be robust enough with a large enough threshold to keep it in the main test suite.

	Steffen

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 2/2] convert: Stream from fd to required clean filter instead of mmap
  2014-08-17 10:08         ` Steffen Prohaska
@ 2014-08-17 11:03           ` Andreas Schwab
  0 siblings, 0 replies; 13+ messages in thread
From: Andreas Schwab @ 2014-08-17 11:03 UTC (permalink / raw)
  To: Steffen Prohaska
  Cc: John Keeping, Junio C Hamano, Git Mailing List, Jeff King,
	Scott Chacon

Steffen Prohaska <prohaska@zib.de> writes:

> On Aug 16, 2014, at 7:00 PM, Andreas Schwab <schwab@linux-m68k.org> wrote:
>
>> Steffen Prohaska <prohaska@zib.de> writes:
>> 
>>> The test should confirm that the the file that is added is not mmapped to
>>> memory.
>> 
>> RSS doesn't tell you that.  You can mmap a big file without RSS getting
>> bigger.
>
> All data are accessed after mapping, so RSS should be fine.

RSS can shink any time when memory is tight.

> The test always did what I expected.

Not a sufficient condition.

Andreas.

-- 
Andreas Schwab, schwab@linux-m68k.org
GPG Key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 2/2] convert: Stream from fd to required clean filter instead of mmap
  2014-08-17 10:25         ` Steffen Prohaska
@ 2014-08-17 11:33           ` Duy Nguyen
  2014-08-19  7:53             ` Jeff King
  0 siblings, 1 reply; 13+ messages in thread
From: Duy Nguyen @ 2014-08-17 11:33 UTC (permalink / raw)
  To: Steffen Prohaska
  Cc: Jeff King, John Keeping, Junio C Hamano, Git Mailing List,
	Scott Chacon

On Sun, Aug 17, 2014 at 5:25 PM, Steffen Prohaska <prohaska@zib.de> wrote:
>
> On Aug 17, 2014, at 9:27 AM, Jeff King <peff@peff.net> wrote:
>
>> On Sat, Aug 16, 2014 at 06:26:08PM +0200, Steffen Prohaska wrote:
>>
>>>> Is the 15MB limit supposed to be imposed somewhere or is it just a guide
>>>> of how much memory we expect Git to use in this scenario?
>>>
>>> The test should confirm that the the file that is added is not mmapped
>>> to memory.  The process size should be relatively small independently
>>> of the size of the file that is added.  I wanted to keep the file size
>>> small.  The chosen sizes worked for me on Mac and Linux.
>>
>> Measuring memory usage seems inherently a bit flaky for the test suite.
>> It's also a little out of place, as the test suite is generally about
>> correctness and outcomes, and this is a performance issue.
>
> For files >2GB on a 32-bit system (e.g. msysgit), filtering with the previous code always failed.  Now it works.  I created the patch to change git from 'fundamentally doesn't handle this' to 'works as expected'.

I deal with similar problem and added $GIT_ALLOC_LIMIT to test large
blob code. Maybe we could add $GIT_MMAP_LIMIT? I don't suppose we call
xmmap/xmalloc so often that the extra test would slow git down.
-- 
Duy

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 2/2] convert: Stream from fd to required clean filter instead of mmap
  2014-08-17 11:33           ` Duy Nguyen
@ 2014-08-19  7:53             ` Jeff King
  2014-08-19  8:26               ` Steffen Prohaska
  0 siblings, 1 reply; 13+ messages in thread
From: Jeff King @ 2014-08-19  7:53 UTC (permalink / raw)
  To: Duy Nguyen
  Cc: Steffen Prohaska, John Keeping, Junio C Hamano, Git Mailing List,
	Scott Chacon

On Sun, Aug 17, 2014 at 06:33:54PM +0700, Duy Nguyen wrote:

> > For files >2GB on a 32-bit system (e.g. msysgit), filtering with the
> > previous code always failed.  Now it works.  I created the patch to
> > change git from 'fundamentally doesn't handle this' to 'works as
> > expected'.
> 
> I deal with similar problem and added $GIT_ALLOC_LIMIT to test large
> blob code. Maybe we could add $GIT_MMAP_LIMIT? I don't suppose we call
> xmmap/xmalloc so often that the extra test would slow git down.

Yeah, I think I'd prefer that approach. We should mmap _way_ less than
we malloc, and I do not think the malloc check has caused any problems.

-Peff

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 2/2] convert: Stream from fd to required clean filter instead of mmap
  2014-08-19  7:53             ` Jeff King
@ 2014-08-19  8:26               ` Steffen Prohaska
  0 siblings, 0 replies; 13+ messages in thread
From: Steffen Prohaska @ 2014-08-19  8:26 UTC (permalink / raw)
  To: Jeff King
  Cc: Duy Nguyen, John Keeping, Junio C Hamano, Git Mailing List,
	Scott Chacon


On Aug 19, 2014, at 9:53 AM, Jeff King <peff@peff.net> wrote:

>>> For files >2GB on a 32-bit system (e.g. msysgit), filtering with the
>>> previous code always failed.  Now it works.  I created the patch to
>>> change git from 'fundamentally doesn't handle this' to 'works as
>>> expected'.
>> 
>> I deal with similar problem and added $GIT_ALLOC_LIMIT to test large
>> blob code. Maybe we could add $GIT_MMAP_LIMIT? I don't suppose we call
>> xmmap/xmalloc so often that the extra test would slow git down.
> 
> Yeah, I think I'd prefer that approach. We should mmap _way_ less than
> we malloc, and I do not think the malloc check has caused any problems.

I am going to update the patch accordingly.

	Steffen

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2014-08-19  8:27 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-08-06  5:32 [PATCH v2 0/2] Stream fd to clean filter Steffen Prohaska
2014-08-06  5:32 ` [PATCH v2 1/2] convert: Refactor would_convert_to_git() to single arg 'path' Steffen Prohaska
2014-08-06  5:32 ` [PATCH v2 2/2] convert: Stream from fd to required clean filter instead of mmap Steffen Prohaska
2014-08-16 10:27   ` John Keeping
2014-08-16 16:26     ` Steffen Prohaska
2014-08-16 17:00       ` Andreas Schwab
2014-08-17 10:08         ` Steffen Prohaska
2014-08-17 11:03           ` Andreas Schwab
2014-08-17  7:27       ` Jeff King
2014-08-17 10:25         ` Steffen Prohaska
2014-08-17 11:33           ` Duy Nguyen
2014-08-19  7:53             ` Jeff King
2014-08-19  8:26               ` Steffen Prohaska

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).