[PATCH 0/3] fast textconv

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/3] fast textconv
@ 2010-03-28 14:53 Jeff King
  2010-03-28 14:53 ` [PATCH 1/3] textconv: refactor calls to run_textconv Jeff King
                   ` (4 more replies)
  0 siblings, 5 replies; 19+ messages in thread
From: Jeff King @ 2010-03-28 14:53 UTC (permalink / raw)
  To: git

The normal textconv procedure is to dump the binary file to a tempfile
(optionally using a working tree file if available), then run the
textconv helper to produce a textual version on stdout. This is a very
convenient interface, as helpers don't need to be aware of git at all
and many standard commands can be used without wrappers.

Unfortunately, it can be slow for large binary files. We spool the file
to disk before invoking the textconv helper, so the helper has no way to
do any optimizations. For example, the helper may need only part of the
file (e.g., when showing metadata at the beginning of a media file), or
it may implement a caching scheme to avoid repeating expensive
conversions.

This series introduces a "fast textconv", which does not automatically
spool a tempfile, but instead gives the helper program the sha1 of the
blob to be converted.

Here are some timings from my photo repository, on a commit with 37
JPEGs and 8 AVIs. Each file had two lines added to its exif metadata.
My textconv helper is a perl script that dumps the exif tags, and
implements its own caching scheme.

  $ time git show >/dev/null ;# before patch
  real    0m13.818s
  user    0m12.137s
  sys     0m1.552s

  $ time git show >/dev/null ;# after patch, first run
  real    0m15.076s
  user    0m13.321s
  sys     0m1.772s

  $ time git show >/dev/null ;# after patch, subsequent runs
  real    0m2.502s
  user    0m1.820s
  sys     0m0.592s

So you can see a 5.5x speedup. The first run is a little bit slower,
presumably due to the extra git-cat-file calls by the helper.

The speedup is purely from caching; I am not using the "we only need to
read the first part of the file" optimization. My files are only a few
megabytes. Probably that would be more useful for people storing files
in the hundreds of megabytes, where a full cat-file will cause a lot of
unwanted I/O.

There are two things I'm still not 100% happy with:

 1. 2.5 seconds is still a little slower than I would like. The slowness
    comes from the fact that my helper is written in perl, and therefore
    perl gets invoked for each diff. I could try collecting all of the
    to-be-textconv'd files at the beginning of the diff process and just
    invoking the helper once. But that means we need to store the
    results in core, and they could potentially be long (in my case,
    they are only a few hundred bytes, but somebody could potentially be
    textconv'ing a large documents).

 2. It is up to the helper to implement a caching layer. This offers a
    lot of flexibility, but it means each helper must implement its own.
    It also means we have to run the helper even for a cache hit, which
    causes slowness.

    An alternative would be for git to support textconv caching
    natively, probably by using the notes mechanism to map blob sha1's
    to their textconv'd contents. But that opens a whole can of worms
    with how the cache is managed. If I change my textconv helper to
    produce different results, how do I invalidate the cache? Should it
    happen automatically if I change the contents of
    diff.$method.textconv? Or do I need to do it manually (you will
    still need to do it manually if, e.g., you upgrade your textconv
    helper. Git can't know about that). How do I evict entries if the
    cache gets too large when notes are stored as a history?

So I'm not sure. This series works and is simple from git's perspective.
But caching textconv results in notes would be faster, and easier for
people to write helper scripts.

The patches are:

  [1/3]: textconv: refactor calls to run_textconv
  [2/3]: textconv: refactor to handle multiple textconv types
  [3/3]: diff: add "fasttextconv" config option

-Peff

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 1/3] textconv: refactor calls to run_textconv
  2010-03-28 14:53 [PATCH 0/3] fast textconv Jeff King
@ 2010-03-28 14:53 ` Jeff King
  2010-03-28 14:53 ` [PATCH 2/3] textconv: refactor to handle multiple textconv types Jeff King
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 19+ messages in thread
From: Jeff King @ 2010-03-28 14:53 UTC (permalink / raw)
  To: git

This patch adds a fill_textconv wrapper, which centralizes
some minor logic like error checking and handling the case
of no-textconv.

In addition to dropping the number of lines, this will make
it easier in future patches to handle multiple types of
textconv.

Signed-off-by: Jeff King <peff@peff.net>
---
 diff.c |   58 ++++++++++++++++++++++++----------------------------------
 1 files changed, 24 insertions(+), 34 deletions(-)

diff --git a/diff.c b/diff.c
index f5d93e9..3ddc05e 100644
--- a/diff.c
+++ b/diff.c
@@ -43,7 +43,8 @@ static char diff_colors[][COLOR_MAXLEN] = {
 };
 
 static void diff_filespec_load_driver(struct diff_filespec *one);
-static char *run_textconv(const char *, struct diff_filespec *, size_t *);
+static size_t fill_textconv(const char *cmd,
+		struct diff_filespec *df, char **outbuf);
 
 static int parse_diff_color_slot(const char *var, int ofs)
 {
@@ -477,7 +478,7 @@ static void emit_rewrite_diff(const char *name_a,
 	const char *reset = diff_get_color(color_diff, DIFF_RESET);
 	static struct strbuf a_name = STRBUF_INIT, b_name = STRBUF_INIT;
 	const char *a_prefix, *b_prefix;
-	const char *data_one, *data_two;
+	char *data_one, *data_two;
 	size_t size_one, size_two;
 	struct emit_callback ecbdata;
 
@@ -501,24 +502,8 @@ static void emit_rewrite_diff(const char *name_a,
 
 	diff_populate_filespec(one, 0);
 	diff_populate_filespec(two, 0);
-	if (textconv_one) {
-		data_one = run_textconv(textconv_one, one, &size_one);
-		if (!data_one)
-			die("unable to read files to diff");
-	}
-	else {
-		data_one = one->data;
-		size_one = one->size;
-	}
-	if (textconv_two) {
-		data_two = run_textconv(textconv_two, two, &size_two);
-		if (!data_two)
-			die("unable to read files to diff");
-	}
-	else {
-		data_two = two->data;
-		size_two = two->size;
-	}
+	size_one = fill_textconv(textconv_one, one, &data_one);
+	size_two = fill_textconv(textconv_two, two, &data_two);
 
 	memset(&ecbdata, 0, sizeof(ecbdata));
 	ecbdata.color_diff = color_diff;
@@ -1713,20 +1698,8 @@ static void builtin_diff(const char *name_a,
 			strbuf_reset(&header);
 		}
 
-		if (textconv_one) {
-			size_t size;
-			mf1.ptr = run_textconv(textconv_one, one, &size);
-			if (!mf1.ptr)
-				die("unable to read files to diff");
-			mf1.size = size;
-		}
-		if (textconv_two) {
-			size_t size;
-			mf2.ptr = run_textconv(textconv_two, two, &size);
-			if (!mf2.ptr)
-				die("unable to read files to diff");
-			mf2.size = size;
-		}
+		mf1.size = fill_textconv(textconv_one, one, &mf1.ptr);
+		mf2.size = fill_textconv(textconv_two, two, &mf2.ptr);
 
 		pe = diff_funcname_pattern(one);
 		if (!pe)
@@ -3907,3 +3880,20 @@ static char *run_textconv(const char *pgm, struct diff_filespec *spec,
 
 	return strbuf_detach(&buf, outsize);
 }
+
+static size_t fill_textconv(const char *cmd,
+			    struct diff_filespec *df,
+			    char **outbuf)
+{
+	size_t size;
+
+	if (!cmd) {
+		*outbuf = df->data;
+		return df->size;
+	}
+
+	*outbuf = run_textconv(cmd, df, &size);
+	if (!*outbuf)
+		die("unable to read files to diff");
+	return size;
+}
-- 
1.7.0.3.294.g8e730.dirty

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 2/3] textconv: refactor to handle multiple textconv types
  2010-03-28 14:53 [PATCH 0/3] fast textconv Jeff King
  2010-03-28 14:53 ` [PATCH 1/3] textconv: refactor calls to run_textconv Jeff King
@ 2010-03-28 14:53 ` Jeff King
  2010-03-28 14:54 ` [PATCH 3/3] diff: add "fasttextconv" config option Jeff King
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 19+ messages in thread
From: Jeff King @ 2010-03-28 14:53 UTC (permalink / raw)
  To: git

We pass around textconv information as a string which is
either NULL (no textconv) or not (and contains the textconv
command). In preparation for handling multiple types of
textconv interface, this patch converts it into a struct
with an "interface" field.

Signed-off-by: Jeff King <peff@peff.net>
---
 diff.c     |   26 ++++++++++++++++----------
 userdiff.c |    9 ++++++++-
 userdiff.h |   12 +++++++++++-
 3 files changed, 35 insertions(+), 12 deletions(-)

diff --git a/diff.c b/diff.c
index 3ddc05e..0d31320 100644
--- a/diff.c
+++ b/diff.c
@@ -43,7 +43,7 @@ static char diff_colors[][COLOR_MAXLEN] = {
 };
 
 static void diff_filespec_load_driver(struct diff_filespec *one);
-static size_t fill_textconv(const char *cmd,
+static size_t fill_textconv(struct userdiff_extcmd *cmd,
 		struct diff_filespec *df, char **outbuf);
 
 static int parse_diff_color_slot(const char *var, int ofs)
@@ -466,8 +466,8 @@ static void emit_rewrite_diff(const char *name_a,
 			      const char *name_b,
 			      struct diff_filespec *one,
 			      struct diff_filespec *two,
-			      const char *textconv_one,
-			      const char *textconv_two,
+			      struct userdiff_extcmd *textconv_one,
+			      struct userdiff_extcmd *textconv_two,
 			      struct diff_options *o)
 {
 	int lc_a, lc_b;
@@ -1567,14 +1567,16 @@ void diff_set_mnemonic_prefix(struct diff_options *options, const char *a, const
 		options->b_prefix = b;
 }
 
-static const char *get_textconv(struct diff_filespec *one)
+static struct userdiff_extcmd *get_textconv(struct diff_filespec *one)
 {
 	if (!DIFF_FILE_VALID(one))
 		return NULL;
 	if (!S_ISREG(one->mode))
 		return NULL;
 	diff_filespec_load_driver(one);
-	return one->driver->textconv;
+	if (!one->driver->textconv.interface)
+		return NULL;
+	return &one->driver->textconv;
 }
 
 static void builtin_diff(const char *name_a,
@@ -1591,7 +1593,7 @@ static void builtin_diff(const char *name_a,
 	const char *set = diff_get_color_opt(o, DIFF_METAINFO);
 	const char *reset = diff_get_color_opt(o, DIFF_RESET);
 	const char *a_prefix, *b_prefix;
-	const char *textconv_one = NULL, *textconv_two = NULL;
+	struct userdiff_extcmd *textconv_one = NULL, *textconv_two = NULL;
 	struct strbuf header = STRBUF_INIT;
 
 	if (DIFF_OPT_TST(o, SUBMODULE_LOG) &&
@@ -3848,7 +3850,7 @@ void diff_unmerge(struct diff_options *options,
 	diff_queue(&diff_queued_diff, one, two)->is_unmerged = 1;
 }
 
-static char *run_textconv(const char *pgm, struct diff_filespec *spec,
+static char *run_textconv_simple(const char *pgm, struct diff_filespec *spec,
 		size_t *outsize)
 {
 	struct diff_tempfile *temp;
@@ -3881,18 +3883,22 @@ static char *run_textconv(const char *pgm, struct diff_filespec *spec,
 	return strbuf_detach(&buf, outsize);
 }
 
-static size_t fill_textconv(const char *cmd,
+static size_t fill_textconv(struct userdiff_extcmd *cmd,
 			    struct diff_filespec *df,
 			    char **outbuf)
 {
 	size_t size;
 
-	if (!cmd) {
+	if (!cmd || !cmd->interface) {
 		*outbuf = df->data;
 		return df->size;
 	}
 
-	*outbuf = run_textconv(cmd, df, &size);
+	if (cmd->interface == USERDIFF_EXTCMD_SIMPLE)
+		*outbuf = run_textconv_simple(cmd->cmd, df, &size);
+	else
+		die("BUG: invalid internal textconv type");
+
 	if (!*outbuf)
 		die("unable to read files to diff");
 	return size;
diff --git a/userdiff.c b/userdiff.c
index df99249..6850f21 100644
--- a/userdiff.c
+++ b/userdiff.c
@@ -158,6 +158,13 @@ static int parse_string(const char **d, const char *k, const char *v)
 	return 1;
 }
 
+static int parse_extcmd(struct userdiff_extcmd *e, const char *k,
+		const char *v)
+{
+	e->interface = USERDIFF_EXTCMD_SIMPLE;
+	return parse_string(&e->cmd, k, v);
+}
+
 static int parse_tristate(int *b, const char *k, const char *v)
 {
 	if (v && !strcasecmp(v, "auto"))
@@ -180,7 +187,7 @@ int userdiff_config(const char *k, const char *v)
 	if ((drv = parse_driver(k, v, "command")))
 		return parse_string(&drv->external, k, v);
 	if ((drv = parse_driver(k, v, "textconv")))
-		return parse_string(&drv->textconv, k, v);
+		return parse_extcmd(&drv->textconv, k, v);
 	if ((drv = parse_driver(k, v, "wordregex")))
 		return parse_string(&drv->word_regex, k, v);
 
diff --git a/userdiff.h b/userdiff.h
index c315159..0c1ac06 100644
--- a/userdiff.h
+++ b/userdiff.h
@@ -6,13 +6,23 @@ struct userdiff_funcname {
 	int cflags;
 };
 
+enum userdiff_extcmd_interface {
+	USERDIFF_EXTCMD_NONE = 0,
+	USERDIFF_EXTCMD_SIMPLE,
+};
+
+struct userdiff_extcmd {
+	const char *cmd;
+	enum userdiff_extcmd_interface interface;
+};
+
 struct userdiff_driver {
 	const char *name;
 	const char *external;
 	int binary;
 	struct userdiff_funcname funcname;
 	const char *word_regex;
-	const char *textconv;
+	struct userdiff_extcmd textconv;
 };
 
 int userdiff_config(const char *k, const char *v);
-- 
1.7.0.3.294.g8e730.dirty

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 3/3] diff: add "fasttextconv" config option
  2010-03-28 14:53 [PATCH 0/3] fast textconv Jeff King
  2010-03-28 14:53 ` [PATCH 1/3] textconv: refactor calls to run_textconv Jeff King
  2010-03-28 14:53 ` [PATCH 2/3] textconv: refactor to handle multiple textconv types Jeff King
@ 2010-03-28 14:54 ` Jeff King
  2010-03-28 18:23   ` Johannes Sixt
  2010-03-28 16:09 ` [PATCH 0/3] fast textconv Michael J Gruber
  2010-03-30  3:52 ` Junio C Hamano
  4 siblings, 1 reply; 19+ messages in thread
From: Jeff King @ 2010-03-28 14:54 UTC (permalink / raw)
  To: git

The normal textconv procedure is to dump the binary file to
a tempfile (optionally using a working tree file if
available), then run the textconv helper to produce a
textual version on stdout. This is a very convenient
interface, as helpers don't need to be aware of git at all
and many standard commands can be used without wrappers.

Unfortunately, it can be slow for large binary files. We
spool the file to disk before invoking the textconv helper,
so the helper has no way to do any optimizations. For
example, the helper may need only part of the file (e.g.,
when showing metadata at the beginning of a media file), or
it may implement a caching scheme to avoid repeating
expensive conversions.

This patch introduces diff.*.fasttextconv, which invokes the
textconv helper with the sha1 of the blob to be converted.
The helper can then retrieve the file with git-cat-file if
necessary.

Signed-off-by: Jeff King <peff@peff.net>
---
 Documentation/gitattributes.txt |   11 +++++++++++
 diff.c                          |   32 ++++++++++++++++++++++++++++++++
 t/t4030-diff-textconv.sh        |   15 +++++++++++++++
 userdiff.c                      |    8 +++++---
 userdiff.h                      |    1 +
 5 files changed, 64 insertions(+), 3 deletions(-)

diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
index d892e64..6066256 100644
--- a/Documentation/gitattributes.txt
+++ b/Documentation/gitattributes.txt
@@ -414,6 +414,17 @@ because it quickly conveys the changes you have made), you
 should generate it separately and send it as a comment _in
 addition to_ the usual binary diff that you might send.
 
+The `fasttextconv` config option is similar to `textconv`, but
+uses a different interface that provides more opporunity for
+optimization at the cost of increased complexity. The program
+specified by `fasttextconv` receives on the command line the sha1
+of the blob to be converted and the pathname, if any, from which
+the blob was accessed. It is the responsibility of the invoked
+program to retrieve the blob's contents (e.g., using `git
+cat-file blob $sha1`). This allows the helper program to avoid
+looking at the blob contents in some situations (e.g., if it
+implements a caching layer keyed by sha1).
+
 
 Performing a three-way merge
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
diff --git a/diff.c b/diff.c
index 0d31320..5208d4d 100644
--- a/diff.c
+++ b/diff.c
@@ -3883,6 +3883,36 @@ static char *run_textconv_simple(const char *pgm, struct diff_filespec *spec,
 	return strbuf_detach(&buf, outsize);
 }
 
+static char *run_textconv_fast(const char *pgm, struct diff_filespec *spec,
+		size_t *outsize)
+{
+	const char *argv[4];
+	const char **arg = argv;
+	struct child_process child;
+	struct strbuf buf = STRBUF_INIT;
+
+	*arg++ = pgm;
+	*arg++ = sha1_to_hex(spec->sha1);
+	*arg++ = spec->path;
+	*arg = NULL;
+
+	memset(&child, 0, sizeof(child));
+	child.use_shell = 1;
+	child.argv = argv;
+	child.out = -1;
+	if (start_command(&child) != 0 ||
+	    strbuf_read(&buf, child.out, 0) < 0 ||
+	    finish_command(&child) != 0) {
+		close(child.out);
+		strbuf_release(&buf);
+		error("error running textconv command '%s'", pgm);
+		return NULL;
+	}
+	close(child.out);
+
+	return strbuf_detach(&buf, outsize);
+}
+
 static size_t fill_textconv(struct userdiff_extcmd *cmd,
 			    struct diff_filespec *df,
 			    char **outbuf)
@@ -3896,6 +3926,8 @@ static size_t fill_textconv(struct userdiff_extcmd *cmd,
 
 	if (cmd->interface == USERDIFF_EXTCMD_SIMPLE)
 		*outbuf = run_textconv_simple(cmd->cmd, df, &size);
+	else if (cmd->interface == USERDIFF_EXTCMD_FAST)
+		*outbuf = run_textconv_fast(cmd->cmd, df, &size);
 	else
 		die("BUG: invalid internal textconv type");
 
diff --git a/t/t4030-diff-textconv.sh b/t/t4030-diff-textconv.sh
index 88c5619..4b62434 100755
--- a/t/t4030-diff-textconv.sh
+++ b/t/t4030-diff-textconv.sh
@@ -25,6 +25,13 @@ perl -e '$/ = undef; $_ = <>; s/./ord($&)/ge; print $_' < "$1"
 EOF
 chmod +x hexdump
 
+cat >hexdump.fast <<'EOF'
+#!/bin/sh
+git cat-file blob "$1" |
+perl -e '$/ = undef; $_ = <>; s/./ord($&)/ge; print $_'
+EOF
+chmod +x hexdump.fast
+
 test_expect_success 'setup binary file with history' '
 	printf "\\0\\n" >file &&
 	git add file &&
@@ -49,6 +56,7 @@ test_expect_success 'file is considered binary by plumbing' '
 test_expect_success 'setup textconv filters' '
 	echo file diff=foo >.gitattributes &&
 	git config diff.foo.textconv "\"$(pwd)\""/hexdump &&
+	git config diff.fast.fasttextconv "\"$(pwd)\""/hexdump.fast &&
 	git config diff.fail.textconv false
 '
 
@@ -84,6 +92,13 @@ test_expect_success 'status -v produces text' '
 	git reset --soft HEAD@{1}
 '
 
+test_expect_success 'fast textconv works' '
+	echo file diff=fast >.gitattributes &&
+	git diff HEAD^ HEAD >diff &&
+	find_diff <diff >actual &&
+	test_cmp expect.text actual
+'
+
 cat >expect.stat <<'EOF'
  file |  Bin 2 -> 4 bytes
  1 files changed, 0 insertions(+), 0 deletions(-)
diff --git a/userdiff.c b/userdiff.c
index 6850f21..c913dc2 100644
--- a/userdiff.c
+++ b/userdiff.c
@@ -159,9 +159,9 @@ static int parse_string(const char **d, const char *k, const char *v)
 }
 
 static int parse_extcmd(struct userdiff_extcmd *e, const char *k,
-		const char *v)
+		const char *v, enum userdiff_extcmd_interface type)
 {
-	e->interface = USERDIFF_EXTCMD_SIMPLE;
+	e->interface = type;
 	return parse_string(&e->cmd, k, v);
 }
 
@@ -187,7 +187,9 @@ int userdiff_config(const char *k, const char *v)
 	if ((drv = parse_driver(k, v, "command")))
 		return parse_string(&drv->external, k, v);
 	if ((drv = parse_driver(k, v, "textconv")))
-		return parse_extcmd(&drv->textconv, k, v);
+		return parse_extcmd(&drv->textconv, k, v, USERDIFF_EXTCMD_SIMPLE);
+	if ((drv = parse_driver(k, v, "fasttextconv")))
+		return parse_extcmd(&drv->textconv, k, v, USERDIFF_EXTCMD_FAST);
 	if ((drv = parse_driver(k, v, "wordregex")))
 		return parse_string(&drv->word_regex, k, v);
 
diff --git a/userdiff.h b/userdiff.h
index 0c1ac06..84fd3df 100644
--- a/userdiff.h
+++ b/userdiff.h
@@ -9,6 +9,7 @@ struct userdiff_funcname {
 enum userdiff_extcmd_interface {
 	USERDIFF_EXTCMD_NONE = 0,
 	USERDIFF_EXTCMD_SIMPLE,
+	USERDIFF_EXTCMD_FAST,
 };
 
 struct userdiff_extcmd {
-- 
1.7.0.3.294.g8e730.dirty

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/3] fast textconv
  2010-03-28 14:53 [PATCH 0/3] fast textconv Jeff King
                   ` (2 preceding siblings ...)
  2010-03-28 14:54 ` [PATCH 3/3] diff: add "fasttextconv" config option Jeff King
@ 2010-03-28 16:09 ` Michael J Gruber
  2010-03-28 16:17   ` Jeff King
  2010-03-30  3:52 ` Junio C Hamano
  4 siblings, 1 reply; 19+ messages in thread
From: Michael J Gruber @ 2010-03-28 16:09 UTC (permalink / raw)
  To: Jeff King; +Cc: git

Jeff King venit, vidit, dixit 28.03.2010 16:53:
> The normal textconv procedure is to dump the binary file to a tempfile
> (optionally using a working tree file if available), then run the
> textconv helper to produce a textual version on stdout. This is a very
> convenient interface, as helpers don't need to be aware of git at all
> and many standard commands can be used without wrappers.
> 
> Unfortunately, it can be slow for large binary files. We spool the file
> to disk before invoking the textconv helper, so the helper has no way to
> do any optimizations. For example, the helper may need only part of the
> file (e.g., when showing metadata at the beginning of a media file), or
> it may implement a caching scheme to avoid repeating expensive
> conversions.
> 
> This series introduces a "fast textconv", which does not automatically
> spool a tempfile, but instead gives the helper program the sha1 of the
> blob to be converted.
> 
> Here are some timings from my photo repository, on a commit with 37
> JPEGs and 8 AVIs. Each file had two lines added to its exif metadata.
> My textconv helper is a perl script that dumps the exif tags, and
> implements its own caching scheme.
> 
>   $ time git show >/dev/null ;# before patch
>   real    0m13.818s
>   user    0m12.137s
>   sys     0m1.552s
> 
>   $ time git show >/dev/null ;# after patch, first run
>   real    0m15.076s
>   user    0m13.321s
>   sys     0m1.772s
> 
>   $ time git show >/dev/null ;# after patch, subsequent runs
>   real    0m2.502s
>   user    0m1.820s
>   sys     0m0.592s
> 
> So you can see a 5.5x speedup. The first run is a little bit slower,
> presumably due to the extra git-cat-file calls by the helper.
> 
> The speedup is purely from caching; I am not using the "we only need to
> read the first part of the file" optimization. My files are only a few
> megabytes. Probably that would be more useful for people storing files
> in the hundreds of megabytes, where a full cat-file will cause a lot of
> unwanted I/O.
> 
> There are two things I'm still not 100% happy with:
> 
>  1. 2.5 seconds is still a little slower than I would like. The slowness
>     comes from the fact that my helper is written in perl, and therefore
>     perl gets invoked for each diff. I could try collecting all of the
>     to-be-textconv'd files at the beginning of the diff process and just
>     invoking the helper once. But that means we need to store the
>     results in core, and they could potentially be long (in my case,
>     they are only a few hundred bytes, but somebody could potentially be
>     textconv'ing a large documents).
> 
>  2. It is up to the helper to implement a caching layer. This offers a
>     lot of flexibility, but it means each helper must implement its own.
>     It also means we have to run the helper even for a cache hit, which
>     causes slowness.
> 
>     An alternative would be for git to support textconv caching
>     natively, probably by using the notes mechanism to map blob sha1's
>     to their textconv'd contents. But that opens a whole can of worms
>     with how the cache is managed. If I change my textconv helper to
>     produce different results, how do I invalidate the cache? Should it
>     happen automatically if I change the contents of
>     diff.$method.textconv? Or do I need to do it manually (you will
>     still need to do it manually if, e.g., you upgrade your textconv
>     helper. Git can't know about that). How do I evict entries if the
>     cache gets too large when notes are stored as a history?

Really, "Notes!" was my first thought even before reading 2. Happy to
have found a like mind :)

This would still need a mechanism where the conv helper gets the blob's
SHA1 - hey, it's there in your patch...

How about:

Set fasttextconv=notestextconv

notestextconv does the following:

- If $sha1 has a note in refs/notes/bikeshed display it.
- If not create one and then display it.

In fact, the creation could be done using the textconv setting!

Pruning the cache is done be deleting the refs/notes/bikeshed ref,
truncating it by truncating it's DAG (filter-branch...).

Cheers,
Michael

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/3] fast textconv
  2010-03-28 16:09 ` [PATCH 0/3] fast textconv Michael J Gruber
@ 2010-03-28 16:17   ` Jeff King
  2010-03-28 16:19     ` Jeff King
  0 siblings, 1 reply; 19+ messages in thread
From: Jeff King @ 2010-03-28 16:17 UTC (permalink / raw)
  To: Michael J Gruber; +Cc: git

On Sun, Mar 28, 2010 at 06:09:35PM +0200, Michael J Gruber wrote:

> Really, "Notes!" was my first thought even before reading 2. Happy to
> have found a like mind :)
> 
> This would still need a mechanism where the conv helper gets the blob's
> SHA1 - hey, it's there in your patch...
> 
> How about:
> 
> Set fasttextconv=notestextconv
> 
> notestextconv does the following:
> 
> - If $sha1 has a note in refs/notes/bikeshed display it.
> - If not create one and then display it.
> 
> In fact, the creation could be done using the textconv setting!

If I understand you right, you are proposing a separate program
that you would pass as a fasttextconv helper, and that would look in a
notes tree. So you would still have a per-diff fork/exec, and pipe all
the data.

I was thinking of actually doing it in-core, so cache hits would be as
lightweight as a notes lookup (and cache misses obviously would still
fork/exec a helper, but we don't care too much since the helper's time
to convert will dominate in that path).

> Pruning the cache is done be deleting the refs/notes/bikeshed ref,
> truncating it by truncating it's DAG (filter-branch...).

Yeah, that would work. It just means it's one more thing the user has to
deal with. I didn't want to have to introduce a "git textconv"
management helper. ;)

-Peff

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/3] fast textconv
  2010-03-28 16:17   ` Jeff King
@ 2010-03-28 16:19     ` Jeff King
  2010-03-28 16:56       ` Jeff King
  0 siblings, 1 reply; 19+ messages in thread
From: Jeff King @ 2010-03-28 16:19 UTC (permalink / raw)
  To: Michael J Gruber; +Cc: git

On Sun, Mar 28, 2010 at 12:17:28PM -0400, Jeff King wrote:

> If I understand you right, you are proposing a separate program
> that you would pass as a fasttextconv helper, and that would look in a
> notes tree. So you would still have a per-diff fork/exec, and pipe all
> the data.
> 
> I was thinking of actually doing it in-core, so cache hits would be as
> lightweight as a notes lookup (and cache misses obviously would still
> fork/exec a helper, but we don't care too much since the helper's time
> to convert will dominate in that path).

Side note: I think I might prototype it as a separate program and see
what kind of speed I can get.

> > Pruning the cache is done be deleting the refs/notes/bikeshed ref,
> > truncating it by truncating it's DAG (filter-branch...).
> 
> Yeah, that would work. It just means it's one more thing the user has to
> deal with. I didn't want to have to introduce a "git textconv"
> management helper. ;)

Other side note: it's not even that it's that hard to prune. It's that
you might tweak your config, and then you have to _remember_ to prune.

-Peff

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/3] fast textconv
  2010-03-28 16:19     ` Jeff King
@ 2010-03-28 16:56       ` Jeff King
  2010-03-28 17:34         ` Jeff King
  0 siblings, 1 reply; 19+ messages in thread
From: Jeff King @ 2010-03-28 16:56 UTC (permalink / raw)
  To: Michael J Gruber; +Cc: git

On Sun, Mar 28, 2010 at 12:19:21PM -0400, Jeff King wrote:

> On Sun, Mar 28, 2010 at 12:17:28PM -0400, Jeff King wrote:
> 
> > If I understand you right, you are proposing a separate program
> > that you would pass as a fasttextconv helper, and that would look in a
> > notes tree. So you would still have a per-diff fork/exec, and pipe all
> > the data.
> > 
> > I was thinking of actually doing it in-core, so cache hits would be as
> > lightweight as a notes lookup (and cache misses obviously would still
> > fork/exec a helper, but we don't care too much since the helper's time
> > to convert will dominate in that path).
> 
> Side note: I think I might prototype it as a separate program and see
> what kind of speed I can get.

Better, but not perfect. My script is below. I get:

  $ time git show >/dev/null
  real    0m1.036s
  user    0m0.412s
  sys     0m0.672s

which is still a 2.5x speedup (versus my other fast-textconv solution
earlier in the thread), but I suspect we can do better. The notes
mechanism does some up-front work to get very fast lookups, but because
we invoke git-notes repeatedly, we never get the amortized benefit of
that up-front work.  Doing it in-core would fix that.

My script was:

-- >8 --
#!/bin/sh

type=$1; shift
program=$1; shift
sha1=$1; shift
filename=$1; shift

GIT_NOTES_REF=refs/notes/textconv/$type; export GIT_NOTES_REF

# try the cache
git notes show $sha1 2>/dev/null && exit 0

# otherwise, insert the cache entry.
# We can be as slow as we like.
ext=`echo "$filename" | sed 's/.*\.//'`
tmp=`mktemp notes-textconv-XXXXXX.$ext`
git cat-file blob $sha1 >$tmp
$program $tmp | git notes add -f -F - $sha1
git notes show $sha1
rm -f $tmp
-- 8< --

and my config is:

  $ git config diff.mfo.textconv /home/peff/fast mfo mfo-tags

where "mfo-tags" is the program to display metadata and "mfo" is a
user-selected shorthand name for it.

-Peff

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/3] fast textconv
  2010-03-28 16:56       ` Jeff King
@ 2010-03-28 17:34         ` Jeff King
  2010-03-28 18:13           ` Sverre Rabbelier
  0 siblings, 1 reply; 19+ messages in thread
From: Jeff King @ 2010-03-28 17:34 UTC (permalink / raw)
  To: Michael J Gruber; +Cc: git

On Sun, Mar 28, 2010 at 12:56:46PM -0400, Jeff King wrote:

> Better, but not perfect. My script is below. I get:
> 
>   $ time git show >/dev/null
>   real    0m1.036s
>   user    0m0.412s
>   sys     0m0.672s
> 
> which is still a 2.5x speedup (versus my other fast-textconv solution
> earlier in the thread), but I suspect we can do better. The notes
> mechanism does some up-front work to get very fast lookups, but because
> we invoke git-notes repeatedly, we never get the amortized benefit of
> that up-front work.  Doing it in-core would fix that.

Here is a quick and dirty in-core implementation. The most notable
defect is that all textconvs store under refs/notes/textconv, which is
obviously bogus if you might textconv the same blob in two different
ways.

It was relatively easy to code thanks to a very nice notes interface
from Johan Herland. That same commit is now:

  $ time git show >/dev/null
  real    0m0.350s
  user    0m0.172s
  sys     0m0.176s

which is starting to feel sufficiently snappy.

Patch is on top of 1/3 from my earlier series (it uses some of the
centralizing cleanup).

diff --git a/diff.c b/diff.c
index 3ddc05e..e1ef25e 100644
--- a/diff.c
+++ b/diff.c
@@ -14,6 +14,8 @@
 #include "userdiff.h"
 #include "sigchain.h"
 #include "submodule.h"
+#include "notes.h"
+#include "refs.h"
 
 #ifdef NO_FAST_WORKING_DIRECTORY
 #define FAST_WORKING_DIRECTORY 0
@@ -3881,19 +3883,57 @@ static char *run_textconv(const char *pgm, struct diff_filespec *spec,
 	return strbuf_detach(&buf, outsize);
 }
 
+static struct notes_tree textconv_notes;
+static int textconv_notes_initialized;
+
+static const unsigned char *run_and_cache_textconv(const char *cmd,
+		struct diff_filespec *df)
+{
+	char *data;
+	size_t size;
+	unsigned char blob_sha1[20];
+	unsigned char tree_sha1[20];
+
+	data = run_textconv(cmd, df, &size);
+	if (!data)
+		die("unable to read files to diff");
+	if (write_sha1_file(data, size, "blob", blob_sha1) < 0)
+		die("unable to write textconv cache blob");
+	free(data);
+
+	add_note(&textconv_notes, df->sha1, blob_sha1, combine_notes_overwrite);
+	if (write_notes_tree(&textconv_notes, tree_sha1) < 0)
+		die("unable to write textconv cache notes tree");
+	update_ref("textconv cache update", "refs/notes/textconv",
+			tree_sha1, NULL, 0, DIE_ON_ERR);
+
+	return get_note(&textconv_notes, df->sha1);
+}
+
 static size_t fill_textconv(const char *cmd,
 			    struct diff_filespec *df,
 			    char **outbuf)
 {
-	size_t size;
+	const unsigned char *sha1;
+	unsigned long size;
+	enum object_type type;
 
 	if (!cmd) {
 		*outbuf = df->data;
 		return df->size;
 	}
 
-	*outbuf = run_textconv(cmd, df, &size);
+	if (!textconv_notes_initialized) {
+		init_notes(&textconv_notes, "refs/notes/textconv", 0, 0);
+		textconv_notes_initialized = 1;
+	}
+
+	sha1 = get_note(&textconv_notes, df->sha1);
+	if (!sha1)
+		sha1 = run_and_cache_textconv(cmd, df);
+
+	*outbuf = read_sha1_file(sha1, &type, &size);
 	if (!*outbuf)
-		die("unable to read files to diff");
+		die("unable to read textconv cache object");
 	return size;
 }

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/3] fast textconv
  2010-03-28 17:34         ` Jeff King
@ 2010-03-28 18:13           ` Sverre Rabbelier
  2010-03-30 16:04             ` Jeff King
  0 siblings, 1 reply; 19+ messages in thread
From: Sverre Rabbelier @ 2010-03-28 18:13 UTC (permalink / raw)
  To: Jeff King; +Cc: Michael J Gruber, git

Heya,

On Sun, Mar 28, 2010 at 11:34, Jeff King <peff@peff.net> wrote:
> Here is a quick and dirty in-core implementation. The most notable
> defect is that all textconvs store under refs/notes/textconv, which is
> obviously bogus if you might textconv the same blob in two different
> ways.

What I did for the implementation of remote helpers is hash the url of
the remote when storing it locally, that way you don't have to worry
about escaping urls etc. You could do the same with the textconvs,
store it under  refs/notes/textconv/<hash of textconv filter>?

-- 
Cheers,

Sverre Rabbelier

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/3] diff: add "fasttextconv" config option
  2010-03-28 14:54 ` [PATCH 3/3] diff: add "fasttextconv" config option Jeff King
@ 2010-03-28 18:23   ` Johannes Sixt
  2010-03-30 16:30     ` Jeff King
  0 siblings, 1 reply; 19+ messages in thread
From: Johannes Sixt @ 2010-03-28 18:23 UTC (permalink / raw)
  To: Jeff King; +Cc: git

On Sonntag, 28. März 2010, Jeff King wrote:
> +	if (start_command(&child) != 0 ||
> +	    strbuf_read(&buf, child.out, 0) < 0 ||
> +	    finish_command(&child) != 0) {

This conditional is somewhat dubious. If strbuf_read fails, you do not wait 
for the child, and a zombie remains.

The have this sequence already in run_textconv().

-- Hannes

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/3] fast textconv
  2010-03-28 14:53 [PATCH 0/3] fast textconv Jeff King
                   ` (3 preceding siblings ...)
  2010-03-28 16:09 ` [PATCH 0/3] fast textconv Michael J Gruber
@ 2010-03-30  3:52 ` Junio C Hamano
  2010-03-30 17:07   ` Jeff King
  4 siblings, 1 reply; 19+ messages in thread
From: Junio C Hamano @ 2010-03-30  3:52 UTC (permalink / raw)
  To: Jeff King; +Cc: git

Jeff King <peff@peff.net> writes:

> The speedup is purely from caching; I am not using the "we only need to
> read the first part of the file" optimization.

This made me wonder if the end result might be easier to use if the
interface does not change "textconv", but adds some property to the
filter, i.e. "the output from this filter is stable and it is safe to
reuse a cached conversion result for a given blob object", boolean.  E.g.

        [diff "jpg"]
                textconv = exif
                textconv_stable = true

and let the calling side handle the caching.  I further suspect that
an unstable textconv filter would be an anomaly, so this could even be on
by default.

If we do so, stock conversion filters people have accumulated in the past
could be sped up without any additional change from the end user's side.

I guess that I am suggesting to postpone the potential speed-up that could
come from being able to inspect the header information as a separate
topic.  Besides, some file format has metadata at the end, which won't
help you.

About the caching scheme; to help invalidating the cache, it probably is a
good idea to use not just the blob object name but also at least the name
(command line) of the textconv filter as the key for the caching layer.

Instead of the "textconv_stable" boolean depicted above, you could add a
"textconv_filter_version" variable there, compute a hash over blob object
name, textconv filter name and textconv filter version, and use that as
the key to look into the cache (filters lacking textconv_filter_version
would then get no caching, and if you update your "exif" program you bump
the "textconv_filter_version" variable).

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/3] fast textconv
  2010-03-28 18:13           ` Sverre Rabbelier
@ 2010-03-30 16:04             ` Jeff King
  0 siblings, 0 replies; 19+ messages in thread
From: Jeff King @ 2010-03-30 16:04 UTC (permalink / raw)
  To: Sverre Rabbelier; +Cc: Michael J Gruber, git

On Sun, Mar 28, 2010 at 12:13:10PM -0600, Sverre Rabbelier wrote:

> On Sun, Mar 28, 2010 at 11:34, Jeff King <peff@peff.net> wrote:
> > Here is a quick and dirty in-core implementation. The most notable
> > defect is that all textconvs store under refs/notes/textconv, which is
> > obviously bogus if you might textconv the same blob in two different
> > ways.
> 
> What I did for the implementation of remote helpers is hash the url of
> the remote when storing it locally, that way you don't have to worry
> about escaping urls etc. You could do the same with the textconvs,
> store it under  refs/notes/textconv/<hash of textconv filter>?

I thought about something like that, but we want to keep something that
is comprehensible to the user. If my install of "exif" changes, I need
to be able to say "invalidate the cache for exif". If it's
human-readable, I know I need to clear refs/notes/textconv/exif. But if
it's a hash, then I need a special tool to figure out which ref is
relevant.

Instead, my plan is to use refs/notes/textconv/$TYPE, where $TYPE is the
same string used by the config (e.g., "diff.$TYPE.textconv") and the
gitattributes file (e.g., "*.jpg diff=$TYPE"). The user has already
provided us a with a short, unique, meaningful name.

-Peff

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/3] diff: add "fasttextconv" config option
  2010-03-28 18:23   ` Johannes Sixt
@ 2010-03-30 16:30     ` Jeff King
  2010-03-30 17:36       ` [PATCH] diff: fix textconv error zombies Johannes Sixt
  0 siblings, 1 reply; 19+ messages in thread
From: Jeff King @ 2010-03-30 16:30 UTC (permalink / raw)
  To: Johannes Sixt; +Cc: Junio C Hamano, git

On Sun, Mar 28, 2010 at 08:23:00PM +0200, Johannes Sixt wrote:

> On Sonntag, 28. März 2010, Jeff King wrote:
> > +	if (start_command(&child) != 0 ||
> > +	    strbuf_read(&buf, child.out, 0) < 0 ||
> > +	    finish_command(&child) != 0) {
> 
> This conditional is somewhat dubious. If strbuf_read fails, you do not wait 
> for the child, and a zombie remains.
> 
> The have this sequence already in run_textconv().

Ugh. I would blame the author of run_textconv, but that is also me. :)

I doubt it is a big deal in practice, as the read() call is the least
likely to produce an error, and even if we do get a zombie, it will die
along with the diffing process. But it is still worth fixing. Thanks for
noticing.

Junio, can you apply the patch below to maint?

-- >8 --
Subject: [PATCH] diff: fix textconv error zombies

To make the code simpler, run_textconv lumps all of its
error checking into one conditional. However, the
short-circuit means that an error in reading will prevent us
from calling finish_command, leaving a zombie child.

The cleanup requirements are actually different for each of
the three error checks, so let's just write them out
longhand.

Signed-off-by: Jeff King <peff@peff.net>
---
Yes, there are clever ways to make this shorter, but I think it is
clearer just written out.

 diff.c |   16 +++++++++++++---
 1 files changed, 13 insertions(+), 3 deletions(-)

diff --git a/diff.c b/diff.c
index 0d465fa..c268cfc 100644
--- a/diff.c
+++ b/diff.c
@@ -3875,9 +3875,19 @@ static char *run_textconv(const char *pgm, struct diff_filespec *spec,
 	child.use_shell = 1;
 	child.argv = argv;
 	child.out = -1;
-	if (start_command(&child) != 0 ||
-	    strbuf_read(&buf, child.out, 0) < 0 ||
-	    finish_command(&child) != 0) {
+	if (start_command(&child) != 0) {
+		remove_tempfile();
+		error("error running textconv command '%s'", pgm);
+		return NULL;
+	}
+	if (strbuf_read(&buf, child.out, 0) < 0) {
+		close(child.out);
+		finish_command(&child);
+		remove_tempfile();
+		error("error reading from textconv command '%s'", pgm);
+		return NULL;
+	}
+	if (finish_command(&child) != 0) {
 		close(child.out);
 		strbuf_release(&buf);
 		remove_tempfile();
-- 
1.7.0.3.460.g6f052

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/3] fast textconv
  2010-03-30  3:52 ` Junio C Hamano
@ 2010-03-30 17:07   ` Jeff King
  0 siblings, 0 replies; 19+ messages in thread
From: Jeff King @ 2010-03-30 17:07 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

On Mon, Mar 29, 2010 at 08:52:04PM -0700, Junio C Hamano wrote:

> Jeff King <peff@peff.net> writes:
> 
> > The speedup is purely from caching; I am not using the "we only need to
> > read the first part of the file" optimization.
> 
> This made me wonder if the end result might be easier to use if the
> interface does not change "textconv", but adds some property to the
> filter, i.e. "the output from this filter is stable and it is safe to
> reuse a cached conversion result for a given blob object", boolean.  E.g.
> 
>         [diff "jpg"]
>                 textconv = exif
>                 textconv_stable = true
> 
> and let the calling side handle the caching.  I further suspect that
> an unstable textconv filter would be an anomaly, so this could even be on
> by default.

Yep, that is the conclusion I came to later in the thread. I am tempted
not to even make it configurable, but it is a space-time tradeoff, so I
can see how certain corner cases would want to disable.

So consider "fasttextconv" (and this series) scrapped, and I'll work on
a new series that does internal caching via notes.

> If we do so, stock conversion filters people have accumulated in the past
> could be sped up without any additional change from the end user's side.
> 
> I guess that I am suggesting to postpone the potential speed-up that could
> come from being able to inspect the header information as a separate
> topic.  Besides, some file format has metadata at the end, which won't
> help you.

Yes, I agree that the "header only" speed-up is not worth pursuing at
this point. With caching, actually looking at the file at all becomes
the slow path, and you just don't care as much. It might help if you
have 2G files or something where even just spooling the file the first
time is awful, but I will leave that as a later itch to be scratched if
people want.

For metadata at the end (or other spots), probably it would need to be
accompanied by an extension to cat-file to peek at random positions
within the file.

> About the caching scheme; to help invalidating the cache, it probably is a
> good idea to use not just the blob object name but also at least the name
> (command line) of the textconv filter as the key for the caching layer.
> 
> Instead of the "textconv_stable" boolean depicted above, you could add a
> "textconv_filter_version" variable there, compute a hash over blob object
> name, textconv filter name and textconv filter version, and use that as
> the key to look into the cache (filters lacking textconv_filter_version
> would then get no caching, and if you update your "exif" program you bump
> the "textconv_filter_version" variable).

I was thinking of having a separate notes tree for each textconv helper.
Most projects will only have a small number, so the extra tree setup
cost is negligible. And because we know which helper we are using, we
can look in each tree separately, which means you may not even need to
load cache entries at all for filetypes which are not being diffed.

For cache validity, I'm planning to store a special "validity" entry
(either keyed by the null sha1 in a cache tree, or to wrap the tree in a
parentless commit with that information in the comment field). The entry
would contain the command line used to run the helper. If it matches,
the cache is valid. If not, we clear the notes tree.

I think this ends up being slightly simpler for the user to manage the
cache, and it has better cache growth characteristics.

In both schemes, git will automatically correctly handle:

  $ git config diff.foo.textconv foo-helper
  $ git show
  $ git config diff.foo.textconv "foo-helper --options"
  $ git show

In my case, by clearing refs/notes/textconv/foo, and in your case by
storing under a different key sha1. I would argue my scheme is better
here, because it actually _throws away_ the now useless cache entries.
So the cache is automatically pruned (and if you really want to go back
in time to salvage an old cache, you can do it via the reflog).

They will also both automatically handle:

  $ echo '*.jpg diff=foo' >.gitattributes
  $ git show
  $ echo '*.jpg diff=bar' >.gitattributes
  $ git show

In my case, by using a totally different tree. In your case, again a
different key sha1. Which means I'm now leaving the old "foo" tree in
place with stale entries for '*.jpg' files. But this is the same data
wastage as with your scheme. In both cases, you can clean things out by
blowing away the cache. But in my scheme, you are allowed to blow away
only the "foo" cache. The cache data for other helpers isn't precious,
obviously, but it does make things slower.

Neither scheme will automatically handle the semantics of a helper
changing (e.g., upgrade of the "foo" helper). In my scheme, the solution
is to clear the "foo" cache (git update-ref -d refs/notes/textconv/foo).
In yours, it would be to update the textconv_filter_version field in the
config. I would argue that mine is conceptually simpler for the user.
It's a cache. When a cache is in a broken or invalid state, you clear it
and start again.

You could try to get very fancy and include the stat() data for the
helper script in the cache validity information to detect version
changes, but I doubt that it's worth it. It's much more complex (we run
helpers with the shell, so it may not be a single program), and it still
isn't foolproof (your environment, dynamic libraries, or other factors
may be what's changing the answer).

I'll see if I can work up some patches in the next day or two.

-Peff

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH] diff: fix textconv error zombies
  2010-03-30 16:30     ` Jeff King
@ 2010-03-30 17:36       ` Johannes Sixt
  2010-03-30 21:46         ` Junio C Hamano
  0 siblings, 1 reply; 19+ messages in thread
From: Johannes Sixt @ 2010-03-30 17:36 UTC (permalink / raw)
  To: Jeff King; +Cc: Junio C Hamano, git

To make the code simpler, run_textconv lumps all of its
error checking into one conditional. However, the
short-circuit means that an error in reading will prevent us
from calling finish_command, leaving a zombie child.
Clean up properly after errors.

Based-on-work-by: Jeff King <peff@peff.net>
Signed-off-by: Johannes Sixt <j6t@kdbg.org>
---
Jeff,

> Yes, there are clever ways to make this shorter, but I think it is
> clearer just written out.

Thanks for your work, but I'm worried that in your version the close()
call is not before the finish_command (but that's really not _that_
important in this case). My version is perhaps not as easy to read, but
has a slightly better diff-stat.

Oh, I removed the error messages after start_command and finish_command,
because these two print one themselves; Junio, if you disagree with
this change, ditch my version and take Jeff's.

-- Hannes

 diff.c |   17 +++++++++++------
 1 files changed, 11 insertions(+), 6 deletions(-)

diff --git a/diff.c b/diff.c
index dfdfa1a..b96faea 100644
--- a/diff.c
+++ b/diff.c
@@ -3874,6 +3874,7 @@ static char *run_textconv(const char *pgm, struct diff_filespec *spec,
 	const char **arg = argv;
 	struct child_process child;
 	struct strbuf buf = STRBUF_INIT;
+	int err = 0;
 
 	temp = prepare_temp_file(spec->path, spec);
 	*arg++ = pgm;
@@ -3884,16 +3885,20 @@ static char *run_textconv(const char *pgm, struct diff_filespec *spec,
 	child.use_shell = 1;
 	child.argv = argv;
 	child.out = -1;
-	if (start_command(&child) != 0 ||
-	    strbuf_read(&buf, child.out, 0) < 0 ||
-	    finish_command(&child) != 0) {
-		close(child.out);
-		strbuf_release(&buf);
+	if (start_command(&child)) {
 		remove_tempfile();
-		error("error running textconv command '%s'", pgm);
 		return NULL;
 	}
+
+	if (strbuf_read(&buf, child.out, 0) < 0)
+		err = error("error reading from textconv command '%s'", pgm);
 	close(child.out);
+
+	if (finish_command(&child) || err) {
+		strbuf_release(&buf);
+		remove_tempfile();
+		return NULL;
+	}
 	remove_tempfile();
 
 	return strbuf_detach(&buf, outsize);
-- 
1.7.0.2.250.ge5578

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH] diff: fix textconv error zombies
  2010-03-30 17:36       ` [PATCH] diff: fix textconv error zombies Johannes Sixt
@ 2010-03-30 21:46         ` Junio C Hamano
  2010-03-30 22:17           ` Johannes Sixt
  0 siblings, 1 reply; 19+ messages in thread
From: Junio C Hamano @ 2010-03-30 21:46 UTC (permalink / raw)
  To: Johannes Sixt; +Cc: Jeff King, git

Johannes Sixt <j6t@kdbg.org> writes:

> Thanks for your work, but I'm worried that in your version the close()
> call is not before the finish_command (but that's really not _that_
> important in this case).

Are you referring to this part in api-run-command.txt?

    This describes the arguments, redirections, and environment of a
    command to run in a sub-process.

    The caller:

    1. allocates and clears (memset(&chld, 0, sizeof(chld));) a
       struct child_process variable;
    2. initializes the members;
    3. calls start_command();
    4. processes the data;
    5. closes file descriptors (if necessary; see below);
    6. calls finish_command().

In "if necessary; see below", I tried to find something to help me judge
why you said "That's really not _that_ important in this case", but I
think I failed.  Perhaps we would want a bit more detailed discussion in
the document there?

The patch looks fine.  Thanks; will queue this one.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] diff: fix textconv error zombies
  2010-03-30 21:46         ` Junio C Hamano
@ 2010-03-30 22:17           ` Johannes Sixt
  2010-03-30 22:56             ` Jeff King
  0 siblings, 1 reply; 19+ messages in thread
From: Johannes Sixt @ 2010-03-30 22:17 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Jeff King, git

On Dienstag, 30. März 2010, Junio C Hamano wrote:
> In "if necessary; see below", I tried to find something to help me judge
> why you said "That's really not _that_ important in this case", but I
> think I failed.  Perhaps we would want a bit more detailed discussion in
> the document there?

The "if necessary; see below" refers to the cases where the caller requests a 
file descriptor from start_command(). It is safe to follow the recipe. But I 
actually would not want to expand on the special case like we have in 
run_textconv, where it is not strictly necessary to close the fd before 
finish_command(), namely when we know that the writer will exit after it has 
closed its end of the pipe and the reader reads until EOF.

This could deadlock in the theoretical case that the read() in the parent 
returns an error even though the child is still writing data; the child would 
hang while waiting for the pipe being drained by the parent; but the parent 
would wait for the child to exit, which never happens. Doing the close() 
before finish_command() avoids this situation because the child dies of 
SIGPIPE.

But since the communication is via a pipe, and read failures are highly 
unlikely (and a read failure in a way that the writer does not fail, too, is 
even more unlikely, I think), the close() before finish_command() is "not 
important".

-- Hannes

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] diff: fix textconv error zombies
  2010-03-30 22:17           ` Johannes Sixt
@ 2010-03-30 22:56             ` Jeff King
  0 siblings, 0 replies; 19+ messages in thread
From: Jeff King @ 2010-03-30 22:56 UTC (permalink / raw)
  To: Johannes Sixt; +Cc: Junio C Hamano, git

On Wed, Mar 31, 2010 at 12:17:24AM +0200, Johannes Sixt wrote:

> This could deadlock in the theoretical case that the read() in the parent 
> returns an error even though the child is still writing data; the child would 
> hang while waiting for the pipe being drained by the parent; but the parent 
> would wait for the child to exit, which never happens. Doing the close() 
> before finish_command() avoids this situation because the child dies of 
> SIGPIPE.

Yeah, I actually thought about that while writing the error case for
strbuf_read (I do close(child.out) before finish_command). But I didn't
consider it in the success case. I don't think it matters here because
we will already have gotten EOF from child.out, which should mean the
child is no longer writing and will not block.

So I am not sure there is any deadlock, but I am fine with your version
of the patch.

-Peff

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2010-03-30 22:56 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-03-28 14:53 [PATCH 0/3] fast textconv Jeff King
2010-03-28 14:53 ` [PATCH 1/3] textconv: refactor calls to run_textconv Jeff King
2010-03-28 14:53 ` [PATCH 2/3] textconv: refactor to handle multiple textconv types Jeff King
2010-03-28 14:54 ` [PATCH 3/3] diff: add "fasttextconv" config option Jeff King
2010-03-28 18:23   ` Johannes Sixt
2010-03-30 16:30     ` Jeff King
2010-03-30 17:36       ` [PATCH] diff: fix textconv error zombies Johannes Sixt
2010-03-30 21:46         ` Junio C Hamano
2010-03-30 22:17           ` Johannes Sixt
2010-03-30 22:56             ` Jeff King
2010-03-28 16:09 ` [PATCH 0/3] fast textconv Michael J Gruber
2010-03-28 16:17   ` Jeff King
2010-03-28 16:19     ` Jeff King
2010-03-28 16:56       ` Jeff King
2010-03-28 17:34         ` Jeff King
2010-03-28 18:13           ` Sverre Rabbelier
2010-03-30 16:04             ` Jeff King
2010-03-30  3:52 ` Junio C Hamano
2010-03-30 17:07   ` Jeff King

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).