[PATCH 0/7] textconv caching

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/7] textconv caching
@ 2010-04-02  0:01 Jeff King
  2010-04-02  0:03 ` [PATCH 1/7] fix const-correctness of write_sha1_file Jeff King
                   ` (7 more replies)
  0 siblings, 8 replies; 13+ messages in thread
From: Jeff King @ 2010-04-02  0:01 UTC (permalink / raw)
  To: git

This is replacement for my fast-textconv series from a week ago. It's
much faster, and doesn't require the user to do any additional work when
writing or configuring a textconv helper. The cache is stored as a notes
tree.

On my sample commit, 45 jpegs and avi files with their metadata changed,
and a textconv helper that extracts the metadata, the speedup is:

  [before]
  $ time git show >/dev/null
  real    0m13.724s
  user    0m12.057s
  sys     0m1.624s

  [after (with cache primed)]
  $ time git show >/dev/null
  real    0m0.009s
  user    0m0.004s
  sys     0m0.004s

If you just blinked at those numbers, yes, it really is that much
faster.  The caching dropped it to about .35 seconds, which I showed in
an earlier "how about this" patch. The rest of it comes from patch 7/7,
where we can avoid even opening the binary blobs at all (in my sample,
they total about 180M).

The patches are:

  [1/7]: fix const-correctness of write_sha1_file
  [2/7]: fix textconv leak in emit_rewrite_diff
  [3/7]: make commit_tree a library function
  [4/7]: introduce notes-cache interface
  [5/7]: textconv: refactor calls to run_textconv
  [6/7]: diff: cache textconv output
  [7/7]: diff: avoid useless filespec population

-Peff

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 1/7] fix const-correctness of write_sha1_file
  2010-04-02  0:01 [PATCH 0/7] textconv caching Jeff King
@ 2010-04-02  0:03 ` Jeff King
  2010-04-02  0:04 ` [PATCH 2/7] fix textconv leak in emit_rewrite_diff Jeff King
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 13+ messages in thread
From: Jeff King @ 2010-04-02  0:03 UTC (permalink / raw)
  To: git

These should take const buffers as input data, but zlib's
next_in pointer is not const-correct. Let's fix it at the
zlib level, though, so the cast happens in one obvious
place. This should be safe, as a similar cast is used in
zlib's example code for a const array.

Signed-off-by: Jeff King <peff@peff.net>
---
This helps me avoid casting at a higher level in the code, but I also
think it's the right thing to do.

 cache.h     |    2 +-
 sha1_file.c |    6 +++---
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/cache.h b/cache.h
index 6dcb100..5eb0573 100644
--- a/cache.h
+++ b/cache.h
@@ -701,7 +701,7 @@ static inline void *read_sha1_file(const unsigned char *sha1, enum object_type *
 	return read_sha1_file_repl(sha1, type, size, NULL);
 }
 extern int hash_sha1_file(const void *buf, unsigned long len, const char *type, unsigned char *sha1);
-extern int write_sha1_file(void *buf, unsigned long len, const char *type, unsigned char *return_sha1);
+extern int write_sha1_file(const void *buf, unsigned long len, const char *type, unsigned char *return_sha1);
 extern int pretend_sha1_file(void *, unsigned long, enum object_type, unsigned char *);
 extern int force_object_loose(const unsigned char *sha1, time_t mtime);
 
diff --git a/sha1_file.c b/sha1_file.c
index a08a9d0..ff65328 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -2271,7 +2271,7 @@ static int create_tmpfile(char *buffer, size_t bufsiz, const char *filename)
 }
 
 static int write_loose_object(const unsigned char *sha1, char *hdr, int hdrlen,
-			      void *buf, unsigned long len, time_t mtime)
+			      const void *buf, unsigned long len, time_t mtime)
 {
 	int fd, ret;
 	unsigned char compressed[4096];
@@ -2307,7 +2307,7 @@ static int write_loose_object(const unsigned char *sha1, char *hdr, int hdrlen,
 	git_SHA1_Update(&c, hdr, hdrlen);
 
 	/* Then the data itself.. */
-	stream.next_in = buf;
+	stream.next_in = (void *)buf;
 	stream.avail_in = len;
 	do {
 		unsigned char *in0 = stream.next_in;
@@ -2342,7 +2342,7 @@ static int write_loose_object(const unsigned char *sha1, char *hdr, int hdrlen,
 	return move_temp_to_file(tmpfile, filename);
 }
 
-int write_sha1_file(void *buf, unsigned long len, const char *type, unsigned char *returnsha1)
+int write_sha1_file(const void *buf, unsigned long len, const char *type, unsigned char *returnsha1)
 {
 	unsigned char sha1[20];
 	char hdr[32];
-- 
1.7.0.4.299.gba9d4

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 2/7] fix textconv leak in emit_rewrite_diff
  2010-04-02  0:01 [PATCH 0/7] textconv caching Jeff King
  2010-04-02  0:03 ` [PATCH 1/7] fix const-correctness of write_sha1_file Jeff King
@ 2010-04-02  0:04 ` Jeff King
  2010-04-02  0:05 ` [PATCH 3/7] make commit_tree a library function Jeff King
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 13+ messages in thread
From: Jeff King @ 2010-04-02  0:04 UTC (permalink / raw)
  To: git

We correctly free() for the normal diff case, but leak for
rewrite diffs.

Signed-off-by: Jeff King <peff@peff.net>
---
I suppose we never noticed this because rewrite diffs and textconv are
both uncommonly used codepaths, and leaking a little bit of textconv
probably just isn't that big a deal in a short-running process. But
stopping leaks is good.

 diff.c |    4 ++++
 1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/diff.c b/diff.c
index 2daa732..db2cd5d 100644
--- a/diff.c
+++ b/diff.c
@@ -550,6 +550,10 @@ static void emit_rewrite_diff(const char *name_a,
 		emit_rewrite_lines(&ecbdata, '-', data_one, size_one);
 	if (lc_b)
 		emit_rewrite_lines(&ecbdata, '+', data_two, size_two);
+	if (textconv_one)
+		free(data_one);
+	if (textconv_two)
+		free(data_two);
 }
 
 struct diff_words_buffer {
-- 
1.7.0.4.299.gba9d4

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 3/7] make commit_tree a library function
  2010-04-02  0:01 [PATCH 0/7] textconv caching Jeff King
  2010-04-02  0:03 ` [PATCH 1/7] fix const-correctness of write_sha1_file Jeff King
  2010-04-02  0:04 ` [PATCH 2/7] fix textconv leak in emit_rewrite_diff Jeff King
@ 2010-04-02  0:05 ` Jeff King
  2010-04-02  0:07 ` [PATCH 4/7] introduce notes-cache interface Jeff King
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 13+ messages in thread
From: Jeff King @ 2010-04-02  0:05 UTC (permalink / raw)
  To: git

Until now, this has been part of the commit-tree builtin.
However, it is already used by other builtins (like commit,
merge, and notes), and it would be useful to access it from
library code.

The check_valid helper has to come along, too, but is given
a more library-ish name of "assert_sha1_type".

Otherwise, the code is unchanged. There are still a few
rough edges for a library function, like printing the utf8
warning to stderr, but we can address those if and when they
come up as inappropriate.

Signed-off-by: Jeff King <peff@peff.net>
---
I'm not inclined to try lib-ifying it any more, as it would just feel
like code churn. It's not as if we aren't calling it from all over
already, so this is really just about moving it out of builtin/.

 builtin.h             |    3 --
 builtin/commit-tree.c |   70 +------------------------------------------------
 cache.h               |    2 +
 commit.c              |   55 ++++++++++++++++++++++++++++++++++++++
 commit.h              |    4 +++
 sha1_file.c           |   10 +++++++
 6 files changed, 72 insertions(+), 72 deletions(-)

diff --git a/builtin.h b/builtin.h
index 464588b..5c887ef 100644
--- a/builtin.h
+++ b/builtin.h
@@ -16,9 +16,6 @@ extern const char *help_unknown_cmd(const char *cmd);
 extern void prune_packed_objects(int);
 extern int fmt_merge_msg(int merge_summary, struct strbuf *in,
 	struct strbuf *out);
-extern int commit_tree(const char *msg, unsigned char *tree,
-		struct commit_list *parents, unsigned char *ret,
-		const char *author);
 extern int commit_notes(struct notes_tree *t, const char *msg);
 
 struct notes_rewrite_cfg {
diff --git a/builtin/commit-tree.c b/builtin/commit-tree.c
index 90dac34..87f0591 100644
--- a/builtin/commit-tree.c
+++ b/builtin/commit-tree.c
@@ -9,19 +9,6 @@
 #include "builtin.h"
 #include "utf8.h"
 
-/*
- * FIXME! Share the code with "write-tree.c"
- */
-static void check_valid(unsigned char *sha1, enum object_type expect)
-{
-	enum object_type type = sha1_object_info(sha1, NULL);
-	if (type < 0)
-		die("%s is not a valid object", sha1_to_hex(sha1));
-	if (type != expect)
-		die("%s is not a valid '%s' object", sha1_to_hex(sha1),
-		    typename(expect));
-}
-
 static const char commit_tree_usage[] = "git commit-tree <sha1> [-p <sha1>]* < changelog";
 
 static void new_parent(struct commit *parent, struct commit_list **parents_p)
@@ -38,61 +25,6 @@ static void new_parent(struct commit *parent, struct commit_list **parents_p)
 	commit_list_insert(parent, parents_p);
 }
 
-static const char commit_utf8_warn[] =
-"Warning: commit message does not conform to UTF-8.\n"
-"You may want to amend it after fixing the message, or set the config\n"
-"variable i18n.commitencoding to the encoding your project uses.\n";
-
-int commit_tree(const char *msg, unsigned char *tree,
-		struct commit_list *parents, unsigned char *ret,
-		const char *author)
-{
-	int result;
-	int encoding_is_utf8;
-	struct strbuf buffer;
-
-	check_valid(tree, OBJ_TREE);
-
-	/* Not having i18n.commitencoding is the same as having utf-8 */
-	encoding_is_utf8 = is_encoding_utf8(git_commit_encoding);
-
-	strbuf_init(&buffer, 8192); /* should avoid reallocs for the headers */
-	strbuf_addf(&buffer, "tree %s\n", sha1_to_hex(tree));
-
-	/*
-	 * NOTE! This ordering means that the same exact tree merged with a
-	 * different order of parents will be a _different_ changeset even
-	 * if everything else stays the same.
-	 */
-	while (parents) {
-		struct commit_list *next = parents->next;
-		strbuf_addf(&buffer, "parent %s\n",
-			sha1_to_hex(parents->item->object.sha1));
-		free(parents);
-		parents = next;
-	}
-
-	/* Person/date information */
-	if (!author)
-		author = git_author_info(IDENT_ERROR_ON_NO_NAME);
-	strbuf_addf(&buffer, "author %s\n", author);
-	strbuf_addf(&buffer, "committer %s\n", git_committer_info(IDENT_ERROR_ON_NO_NAME));
-	if (!encoding_is_utf8)
-		strbuf_addf(&buffer, "encoding %s\n", git_commit_encoding);
-	strbuf_addch(&buffer, '\n');
-
-	/* And add the comment */
-	strbuf_addstr(&buffer, msg);
-
-	/* And check the encoding */
-	if (encoding_is_utf8 && !is_utf8(buffer.buf))
-		fprintf(stderr, commit_utf8_warn);
-
-	result = write_sha1_file(buffer.buf, buffer.len, commit_type, ret);
-	strbuf_release(&buffer);
-	return result;
-}
-
 int cmd_commit_tree(int argc, const char **argv, const char *prefix)
 {
 	int i;
@@ -117,7 +49,7 @@ int cmd_commit_tree(int argc, const char **argv, const char *prefix)
 
 		if (get_sha1(b, sha1))
 			die("Not a valid object name %s", b);
-		check_valid(sha1, OBJ_COMMIT);
+		assert_sha1_type(sha1, OBJ_COMMIT);
 		new_parent(lookup_commit(sha1), &parents);
 	}
 
diff --git a/cache.h b/cache.h
index 5eb0573..bfc4d82 100644
--- a/cache.h
+++ b/cache.h
@@ -718,6 +718,8 @@ extern int has_loose_object_nonlocal(const unsigned char *sha1);
 
 extern int has_pack_index(const unsigned char *sha1);
 
+extern void assert_sha1_type(const unsigned char *sha1, enum object_type expect);
+
 extern const signed char hexval_table[256];
 static inline unsigned int hexval(unsigned char c)
 {
diff --git a/commit.c b/commit.c
index 731191e..e9b0750 100644
--- a/commit.c
+++ b/commit.c
@@ -790,3 +790,58 @@ struct commit_list *reduce_heads(struct commit_list *heads)
 	free(other);
 	return result;
 }
+
+static const char commit_utf8_warn[] =
+"Warning: commit message does not conform to UTF-8.\n"
+"You may want to amend it after fixing the message, or set the config\n"
+"variable i18n.commitencoding to the encoding your project uses.\n";
+
+int commit_tree(const char *msg, unsigned char *tree,
+		struct commit_list *parents, unsigned char *ret,
+		const char *author)
+{
+	int result;
+	int encoding_is_utf8;
+	struct strbuf buffer;
+
+	assert_sha1_type(tree, OBJ_TREE);
+
+	/* Not having i18n.commitencoding is the same as having utf-8 */
+	encoding_is_utf8 = is_encoding_utf8(git_commit_encoding);
+
+	strbuf_init(&buffer, 8192); /* should avoid reallocs for the headers */
+	strbuf_addf(&buffer, "tree %s\n", sha1_to_hex(tree));
+
+	/*
+	 * NOTE! This ordering means that the same exact tree merged with a
+	 * different order of parents will be a _different_ changeset even
+	 * if everything else stays the same.
+	 */
+	while (parents) {
+		struct commit_list *next = parents->next;
+		strbuf_addf(&buffer, "parent %s\n",
+			sha1_to_hex(parents->item->object.sha1));
+		free(parents);
+		parents = next;
+	}
+
+	/* Person/date information */
+	if (!author)
+		author = git_author_info(IDENT_ERROR_ON_NO_NAME);
+	strbuf_addf(&buffer, "author %s\n", author);
+	strbuf_addf(&buffer, "committer %s\n", git_committer_info(IDENT_ERROR_ON_NO_NAME));
+	if (!encoding_is_utf8)
+		strbuf_addf(&buffer, "encoding %s\n", git_commit_encoding);
+	strbuf_addch(&buffer, '\n');
+
+	/* And add the comment */
+	strbuf_addstr(&buffer, msg);
+
+	/* And check the encoding */
+	if (encoding_is_utf8 && !is_utf8(buffer.buf))
+		fprintf(stderr, commit_utf8_warn);
+
+	result = write_sha1_file(buffer.buf, buffer.len, commit_type, ret);
+	strbuf_release(&buffer);
+	return result;
+}
diff --git a/commit.h b/commit.h
index 3cf5166..2b7fd89 100644
--- a/commit.h
+++ b/commit.h
@@ -158,4 +158,8 @@ static inline int single_parent(struct commit *commit)
 
 struct commit_list *reduce_heads(struct commit_list *heads);
 
+extern int commit_tree(const char *msg, unsigned char *tree,
+		struct commit_list *parents, unsigned char *ret,
+		const char *author);
+
 #endif /* COMMIT_H */
diff --git a/sha1_file.c b/sha1_file.c
index ff65328..28c056e 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -2516,3 +2516,13 @@ int read_pack_header(int fd, struct pack_header *header)
 		return PH_ERROR_PROTOCOL;
 	return 0;
 }
+
+void assert_sha1_type(const unsigned char *sha1, enum object_type expect)
+{
+	enum object_type type = sha1_object_info(sha1, NULL);
+	if (type < 0)
+		die("%s is not a valid object", sha1_to_hex(sha1));
+	if (type != expect)
+		die("%s is not a valid '%s' object", sha1_to_hex(sha1),
+		    typename(expect));
+}
-- 
1.7.0.4.299.gba9d4

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 4/7] introduce notes-cache interface
  2010-04-02  0:01 [PATCH 0/7] textconv caching Jeff King
                   ` (2 preceding siblings ...)
  2010-04-02  0:05 ` [PATCH 3/7] make commit_tree a library function Jeff King
@ 2010-04-02  0:07 ` Jeff King
  2010-04-02  0:09 ` [PATCH 5/7] textconv: refactor calls to run_textconv Jeff King
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 13+ messages in thread
From: Jeff King @ 2010-04-02  0:07 UTC (permalink / raw)
  To: git

Notes provide a fast lookup mechanism for data keyed by
sha1. This is ideal for caching certain operations, like
textconv filters.

This patch builds some infrastructure to make it simpler to
use notes trees as caches. In particular, caches:

  1. don't have arbitrary commit messages. They store a
     cache validity string in the commit, and clear the tree
     when the cache validity string changes.

  2. don't keep any commit history. The accumulated history
     of a a cache is just useless cruft.

  3. use a looser form of locking for ref updates. If two
     processes try to write to the cache simultaneously, it
     is OK if one overwrites the other, losing some changes.
     It's just a cache, so we will just end up with an extra
     miss.

Signed-off-by: Jeff King <peff@peff.net>
---
I tried lib-ifying commit_notes from builtin/notes.c, but for the
reasons mentioned above, plus the totally different error handling (we
need silent error returns because not writing to the cache should not be
fatal), the code ended up having more "if (quiet)" and "if (nohistory)"
than there was actual code. The notes_cache_write below ended up pretty
terse and readable, I think.

 Makefile      |    2 +
 notes-cache.c |   94 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 notes-cache.h |   20 ++++++++++++
 3 files changed, 116 insertions(+), 0 deletions(-)
 create mode 100644 notes-cache.c
 create mode 100644 notes-cache.h

diff --git a/Makefile b/Makefile
index 8a0f5c4..24e92ab 100644
--- a/Makefile
+++ b/Makefile
@@ -486,6 +486,7 @@ LIB_H += log-tree.h
 LIB_H += mailmap.h
 LIB_H += merge-recursive.h
 LIB_H += notes.h
+LIB_H += notes-cache.h
 LIB_H += object.h
 LIB_H += pack.h
 LIB_H += pack-refs.h
@@ -575,6 +576,7 @@ LIB_OBJS += merge-file.o
 LIB_OBJS += merge-recursive.o
 LIB_OBJS += name-hash.o
 LIB_OBJS += notes.o
+LIB_OBJS += notes-cache.o
 LIB_OBJS += object.o
 LIB_OBJS += pack-check.o
 LIB_OBJS += pack-refs.o
diff --git a/notes-cache.c b/notes-cache.c
new file mode 100644
index 0000000..dee6d62
--- /dev/null
+++ b/notes-cache.c
@@ -0,0 +1,94 @@
+#include "cache.h"
+#include "notes-cache.h"
+#include "commit.h"
+#include "refs.h"
+
+static int notes_cache_match_validity(const char *ref, const char *validity)
+{
+	unsigned char sha1[20];
+	struct commit *commit;
+	struct pretty_print_context pretty_ctx;
+	struct strbuf msg = STRBUF_INIT;
+	int ret;
+
+	if (read_ref(ref, sha1) < 0)
+		return 0;
+
+	commit = lookup_commit_reference_gently(sha1, 1);
+	if (!commit)
+		return 0;
+
+	memset(&pretty_ctx, 0, sizeof(pretty_ctx));
+	format_commit_message(commit, "%s", &msg, &pretty_ctx);
+	strbuf_trim(&msg);
+
+	ret = !strcmp(msg.buf, validity);
+	strbuf_release(&msg);
+
+	return ret;
+}
+
+void notes_cache_init(struct notes_cache *c, const char *name,
+		     const char *validity)
+{
+	struct strbuf ref = STRBUF_INIT;
+	int flags = 0;
+
+	memset(c, 0, sizeof(*c));
+	c->validity = xstrdup(validity);
+
+	strbuf_addf(&ref, "refs/notes/%s", name);
+	if (!notes_cache_match_validity(ref.buf, validity))
+		flags = NOTES_INIT_EMPTY;
+	init_notes(&c->tree, ref.buf, combine_notes_overwrite, flags);
+	strbuf_release(&ref);
+}
+
+int notes_cache_write(struct notes_cache *c)
+{
+	unsigned char tree_sha1[20];
+	unsigned char commit_sha1[20];
+
+	if (!c || !c->tree.initialized || !c->tree.ref || !*c->tree.ref)
+		return -1;
+	if (!c->tree.dirty)
+		return 0;
+
+	if (write_notes_tree(&c->tree, tree_sha1))
+		return -1;
+	if (commit_tree(c->validity, tree_sha1, NULL, commit_sha1, NULL) < 0)
+		return -1;
+	if (update_ref("update notes cache", c->tree.ref, commit_sha1, NULL,
+		       0, QUIET_ON_ERR) < 0)
+		return -1;
+
+	return 0;
+}
+
+char *notes_cache_get(struct notes_cache *c, unsigned char key_sha1[20],
+		      size_t *outsize)
+{
+	const unsigned char *value_sha1;
+	enum object_type type;
+	char *value;
+	unsigned long size;
+
+	value_sha1 = get_note(&c->tree, key_sha1);
+	if (!value_sha1)
+		return NULL;
+	value = read_sha1_file(value_sha1, &type, &size);
+
+	*outsize = size;
+	return value;
+}
+
+int notes_cache_put(struct notes_cache *c, unsigned char key_sha1[20],
+		    const char *data, size_t size)
+{
+	unsigned char value_sha1[20];
+
+	if (write_sha1_file(data, size, "blob", value_sha1) < 0)
+		return -1;
+	add_note(&c->tree, key_sha1, value_sha1, NULL);
+	return 0;
+}
diff --git a/notes-cache.h b/notes-cache.h
new file mode 100644
index 0000000..356f88f
--- /dev/null
+++ b/notes-cache.h
@@ -0,0 +1,20 @@
+#ifndef NOTES_CACHE_H
+#define NOTES_CACHE_H
+
+#include "notes.h"
+
+struct notes_cache {
+	struct notes_tree tree;
+	char *validity;
+};
+
+void notes_cache_init(struct notes_cache *c, const char *name,
+		     const char *validity);
+int notes_cache_write(struct notes_cache *c);
+
+char *notes_cache_get(struct notes_cache *c, unsigned char sha1[20], size_t
+		      *outsize);
+int notes_cache_put(struct notes_cache *c, unsigned char sha1[20],
+		    const char *data, size_t size);
+
+#endif /* NOTES_CACHE_H */
-- 
1.7.0.4.299.gba9d4

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 5/7] textconv: refactor calls to run_textconv
  2010-04-02  0:01 [PATCH 0/7] textconv caching Jeff King
                   ` (3 preceding siblings ...)
  2010-04-02  0:07 ` [PATCH 4/7] introduce notes-cache interface Jeff King
@ 2010-04-02  0:09 ` Jeff King
  2010-04-02  0:12 ` [PATCH 6/7] diff: cache textconv output Jeff King
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 13+ messages in thread
From: Jeff King @ 2010-04-02  0:09 UTC (permalink / raw)
  To: git

This patch adds a fill_textconv wrapper, which centralizes
some minor logic like error checking and handling the case
of no-textconv.

In addition to dropping the number of lines, this will make
it easier in future patches to handle multiple types of
textconv.

Signed-off-by: Jeff King <peff@peff.net>
---
Similar to 1/3 from the previous series, but fixes some deficiencies
there. Specifically, it handles !DIFF_FILE_VALID and filespec population
better, which makes fill_textconv a full replacement for fill_mmfile,
just with optional textconv-ing.

 diff.c |   66 +++++++++++++++++++++++++++++----------------------------------
 1 files changed, 30 insertions(+), 36 deletions(-)

diff --git a/diff.c b/diff.c
index db2cd5d..9665d6d 100644
--- a/diff.c
+++ b/diff.c
@@ -43,7 +43,8 @@ static char diff_colors[][COLOR_MAXLEN] = {
 };
 
 static void diff_filespec_load_driver(struct diff_filespec *one);
-static char *run_textconv(const char *, struct diff_filespec *, size_t *);
+static size_t fill_textconv(const char *cmd,
+			    struct diff_filespec *df, char **outbuf);
 
 static int parse_diff_color_slot(const char *var, int ofs)
 {
@@ -477,7 +478,7 @@ static void emit_rewrite_diff(const char *name_a,
 	const char *reset = diff_get_color(color_diff, DIFF_RESET);
 	static struct strbuf a_name = STRBUF_INIT, b_name = STRBUF_INIT;
 	const char *a_prefix, *b_prefix;
-	const char *data_one, *data_two;
+	char *data_one, *data_two;
 	size_t size_one, size_two;
 	struct emit_callback ecbdata;
 
@@ -499,26 +500,8 @@ static void emit_rewrite_diff(const char *name_a,
 	quote_two_c_style(&a_name, a_prefix, name_a, 0);
 	quote_two_c_style(&b_name, b_prefix, name_b, 0);
 
-	diff_populate_filespec(one, 0);
-	diff_populate_filespec(two, 0);
-	if (textconv_one) {
-		data_one = run_textconv(textconv_one, one, &size_one);
-		if (!data_one)
-			die("unable to read files to diff");
-	}
-	else {
-		data_one = one->data;
-		size_one = one->size;
-	}
-	if (textconv_two) {
-		data_two = run_textconv(textconv_two, two, &size_two);
-		if (!data_two)
-			die("unable to read files to diff");
-	}
-	else {
-		data_two = two->data;
-		size_two = two->size;
-	}
+	size_one = fill_textconv(textconv_one, one, &data_one);
+	size_two = fill_textconv(textconv_two, two, &data_two);
 
 	memset(&ecbdata, 0, sizeof(ecbdata));
 	ecbdata.color_diff = color_diff;
@@ -1717,20 +1700,8 @@ static void builtin_diff(const char *name_a,
 			strbuf_reset(&header);
 		}
 
-		if (textconv_one) {
-			size_t size;
-			mf1.ptr = run_textconv(textconv_one, one, &size);
-			if (!mf1.ptr)
-				die("unable to read files to diff");
-			mf1.size = size;
-		}
-		if (textconv_two) {
-			size_t size;
-			mf2.ptr = run_textconv(textconv_two, two, &size);
-			if (!mf2.ptr)
-				die("unable to read files to diff");
-			mf2.size = size;
-		}
+		mf1.size = fill_textconv(textconv_one, one, &mf1.ptr);
+		mf2.size = fill_textconv(textconv_two, two, &mf2.ptr);
 
 		pe = diff_funcname_pattern(one);
 		if (!pe)
@@ -3916,3 +3887,26 @@ static char *run_textconv(const char *pgm, struct diff_filespec *spec,
 
 	return strbuf_detach(&buf, outsize);
 }
+
+static size_t fill_textconv(const char *cmd,
+			    struct diff_filespec *df,
+			    char **outbuf)
+{
+	size_t size;
+
+	if (!cmd) {
+		if (!DIFF_FILE_VALID(df)) {
+			*outbuf = "";
+			return 0;
+		}
+		if (diff_populate_filespec(df, 0))
+			die("unable to read files to diff");
+		*outbuf = df->data;
+		return df->size;
+	}
+
+	*outbuf = run_textconv(cmd, df, &size);
+	if (!*outbuf)
+		die("unable to read files to diff");
+	return size;
+}
-- 
1.7.0.4.299.gba9d4

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 6/7] diff: cache textconv output
  2010-04-02  0:01 [PATCH 0/7] textconv caching Jeff King
                   ` (4 preceding siblings ...)
  2010-04-02  0:09 ` [PATCH 5/7] textconv: refactor calls to run_textconv Jeff King
@ 2010-04-02  0:12 ` Jeff King
  2010-04-02  7:23   ` Junio C Hamano
  2010-04-02  0:14 ` [PATCH 7/7] diff: avoid useless filespec population Jeff King
  2010-04-02  6:14 ` [PATCH 0/7] textconv caching Jeff King
  7 siblings, 1 reply; 13+ messages in thread
From: Jeff King @ 2010-04-02  0:12 UTC (permalink / raw)
  To: git

Running a textconv filter can take a long time. It's
particularly bad for a large file which needs to be spooled
to disk, but even for small files, the fork+exec overhead
can add up for something like "git log -p".

This patch uses the notes-cache mechanism to keep a fast
cache of textconv output. Caches are stored in
refs/notes/textconv/$x, where $x is the userdiff driver
defined in gitattributes.

Caching is enabled only if diff.$x.cachetextconv is true.

In my test repo, on a commit with 45 jpg and avi files
changed and a textconv to show their exif tags:

  [before]
  $ time git show >/dev/null
  real    0m13.724s
  user    0m12.057s
  sys     0m1.624s

  [after, first run]
  $ git config diff.mfo.cachetextconv true
  $ time git show >/dev/null
  real    0m14.252s
  user    0m12.197s
  sys     0m1.800s

  [after, subsequent runs]
  $ time git show >/dev/null
  real    0m0.352s
  user    0m0.148s
  sys     0m0.200s

So for a slight (3.8%) cost on the first run, we achieve an
almost 40x speed up on subsequent runs.

Signed-off-by: Jeff King <peff@peff.net>
---
The moral equivalent of what I posted earlier, but most of the heavy
lifting is in notes-cache.[ch] from earlier in the series.

It's off by default. I think it should probably swithc to on by default,
but I figured we could let the feature mature a bit first.

 Documentation/gitattributes.txt  |   20 +++++++
 diff.c                           |   52 +++++++++++++++---
 t/t4042-diff-textconv-caching.sh |  109 ++++++++++++++++++++++++++++++++++++++
 userdiff.c                       |    9 +++
 userdiff.h                       |    4 ++
 5 files changed, 185 insertions(+), 9 deletions(-)
 create mode 100755 t/t4042-diff-textconv-caching.sh

diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
index d892e64..a8500d1 100644
--- a/Documentation/gitattributes.txt
+++ b/Documentation/gitattributes.txt
@@ -414,6 +414,26 @@ because it quickly conveys the changes you have made), you
 should generate it separately and send it as a comment _in
 addition to_ the usual binary diff that you might send.
 
+Because text conversion can be slow, especially when doing a
+large number of them with `git log -p`, git provides a mechanism
+to cache the output and use it in future diffs.  To enable
+caching, set the "cachetextconv" variable in your diff driver's
+config. For example:
+
+------------------------
+[diff "jpg"]
+	textconv = exif
+	cachetextconv = true
+------------------------
+
+This will cache the result of running "exif" on each blob
+indefinitely. If you change the textconv config variable for a
+diff driver, git will automatically invalidate the cache entries
+and re-run the textconv filter. If you want to invalidate the
+cache manually (e.g., because your version of "exif" was updated
+and now produces better output), you can remove the cache
+manually with `git update-ref -d refs/notes/textconv/jpg` (where
+"jpg" is the name of the diff driver, as in the example above).
 
 Performing a three-way merge
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
diff --git a/diff.c b/diff.c
index 9665d6d..72d8503 100644
--- a/diff.c
+++ b/diff.c
@@ -43,7 +43,7 @@ static char diff_colors[][COLOR_MAXLEN] = {
 };
 
 static void diff_filespec_load_driver(struct diff_filespec *one);
-static size_t fill_textconv(const char *cmd,
+static size_t fill_textconv(struct userdiff_driver *driver,
 			    struct diff_filespec *df, char **outbuf);
 
 static int parse_diff_color_slot(const char *var, int ofs)
@@ -466,8 +466,8 @@ static void emit_rewrite_diff(const char *name_a,
 			      const char *name_b,
 			      struct diff_filespec *one,
 			      struct diff_filespec *two,
-			      const char *textconv_one,
-			      const char *textconv_two,
+			      struct userdiff_driver *textconv_one,
+			      struct userdiff_driver *textconv_two,
 			      struct diff_options *o)
 {
 	int lc_a, lc_b;
@@ -1569,14 +1569,26 @@ void diff_set_mnemonic_prefix(struct diff_options *options, const char *a, const
 		options->b_prefix = b;
 }
 
-static const char *get_textconv(struct diff_filespec *one)
+static struct userdiff_driver *get_textconv(struct diff_filespec *one)
 {
 	if (!DIFF_FILE_VALID(one))
 		return NULL;
 	if (!S_ISREG(one->mode))
 		return NULL;
 	diff_filespec_load_driver(one);
-	return one->driver->textconv;
+	if (!one->driver->textconv)
+		return NULL;
+
+	if (one->driver->textconv_want_cache && !one->driver->textconv_cache) {
+		struct notes_cache *c = xmalloc(sizeof(*c));
+		struct strbuf name = STRBUF_INIT;
+
+		strbuf_addf(&name, "textconv/%s", one->driver->name);
+		notes_cache_init(c, name.buf, one->driver->textconv);
+		one->driver->textconv_cache = c;
+	}
+
+	return one->driver;
 }
 
 static void builtin_diff(const char *name_a,
@@ -1593,7 +1605,8 @@ static void builtin_diff(const char *name_a,
 	const char *set = diff_get_color_opt(o, DIFF_METAINFO);
 	const char *reset = diff_get_color_opt(o, DIFF_RESET);
 	const char *a_prefix, *b_prefix;
-	const char *textconv_one = NULL, *textconv_two = NULL;
+	struct userdiff_driver *textconv_one = NULL;
+	struct userdiff_driver *textconv_two = NULL;
 	struct strbuf header = STRBUF_INIT;
 
 	if (DIFF_OPT_TST(o, SUBMODULE_LOG) &&
@@ -3888,13 +3901,13 @@ static char *run_textconv(const char *pgm, struct diff_filespec *spec,
 	return strbuf_detach(&buf, outsize);
 }
 
-static size_t fill_textconv(const char *cmd,
+static size_t fill_textconv(struct userdiff_driver *driver,
 			    struct diff_filespec *df,
 			    char **outbuf)
 {
 	size_t size;
 
-	if (!cmd) {
+	if (!driver || !driver->textconv) {
 		if (!DIFF_FILE_VALID(df)) {
 			*outbuf = "";
 			return 0;
@@ -3905,8 +3918,29 @@ static size_t fill_textconv(const char *cmd,
 		return df->size;
 	}
 
-	*outbuf = run_textconv(cmd, df, &size);
+	if (driver->textconv_cache) {
+		*outbuf = notes_cache_get(driver->textconv_cache, df->sha1,
+					  &size);
+		if (*outbuf)
+			return size;
+	}
+
+	*outbuf = run_textconv(driver->textconv, df, &size);
 	if (!*outbuf)
 		die("unable to read files to diff");
+
+	if (driver->textconv_cache) {
+		/* ignore errors, as we might be in a readonly repository */
+		notes_cache_put(driver->textconv_cache, df->sha1, *outbuf,
+				size);
+		/*
+		 * we could save up changes and flush them all at the end,
+		 * but we would need an extra call after all diffing is done.
+		 * Since generating a cache entry is the slow path anyway,
+		 * this extra overhead probably isn't a big deal.
+		 */
+		notes_cache_write(driver->textconv_cache);
+	}
+
 	return size;
 }
diff --git a/t/t4042-diff-textconv-caching.sh b/t/t4042-diff-textconv-caching.sh
new file mode 100755
index 0000000..91f8198
--- /dev/null
+++ b/t/t4042-diff-textconv-caching.sh
@@ -0,0 +1,109 @@
+#!/bin/sh
+
+test_description='test textconv caching'
+. ./test-lib.sh
+
+cat >helper <<'EOF'
+#!/bin/sh
+sed 's/^/converted: /' "$@" >helper.out
+cat helper.out
+EOF
+chmod +x helper
+
+test_expect_success 'setup' '
+	echo foo content 1 >foo.bin &&
+	echo bar content 1 >bar.bin &&
+	git add . &&
+	git commit -m one &&
+	echo foo content 2 >foo.bin &&
+	echo bar content 2 >bar.bin &&
+	git commit -a -m two &&
+	echo "*.bin diff=magic" >.gitattributes &&
+	git config diff.magic.textconv ./helper &&
+	git config diff.magic.cachetextconv true
+'
+
+cat >expect <<EOF
+diff --git a/bar.bin b/bar.bin
+index fcf9166..28283d5 100644
+--- a/bar.bin
++++ b/bar.bin
+@@ -1 +1 @@
+-converted: bar content 1
++converted: bar content 2
+diff --git a/foo.bin b/foo.bin
+index d5b9fe3..1345db2 100644
+--- a/foo.bin
++++ b/foo.bin
+@@ -1 +1 @@
+-converted: foo content 1
++converted: foo content 2
+EOF
+
+test_expect_success 'first textconv works' '
+	git diff HEAD^ HEAD >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success 'cached textconv produces same output' '
+	git diff HEAD^ HEAD >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success 'cached textconv does not run helper' '
+	rm -f helper.out &&
+	git diff HEAD^ HEAD >actual &&
+	test_cmp expect actual &&
+	! test -r helper.out
+'
+
+cat >expect <<EOF
+diff --git a/bar.bin b/bar.bin
+index fcf9166..28283d5 100644
+--- a/bar.bin
++++ b/bar.bin
+@@ -1,2 +1,2 @@
+ converted: other
+-converted: bar content 1
++converted: bar content 2
+diff --git a/foo.bin b/foo.bin
+index d5b9fe3..1345db2 100644
+--- a/foo.bin
++++ b/foo.bin
+@@ -1,2 +1,2 @@
+ converted: other
+-converted: foo content 1
++converted: foo content 2
+EOF
+test_expect_success 'changing textconv invalidates cache' '
+	echo other >other &&
+	git config diff.magic.textconv "./helper other" &&
+	git diff HEAD^ HEAD >actual &&
+	test_cmp expect actual
+'
+
+cat >expect <<EOF
+diff --git a/bar.bin b/bar.bin
+index fcf9166..28283d5 100644
+--- a/bar.bin
++++ b/bar.bin
+@@ -1,2 +1,2 @@
+ converted: other
+-converted: bar content 1
++converted: bar content 2
+diff --git a/foo.bin b/foo.bin
+index d5b9fe3..1345db2 100644
+--- a/foo.bin
++++ b/foo.bin
+@@ -1 +1 @@
+-converted: foo content 1
++converted: foo content 2
+EOF
+test_expect_success 'switching diff driver produces correct results' '
+	git config diff.moremagic.textconv ./helper &&
+	echo foo.bin diff=moremagic >>.gitattributes &&
+	git diff HEAD^ HEAD >actual &&
+	test_cmp expect actual
+'
+
+test_done
diff --git a/userdiff.c b/userdiff.c
index df99249..67003fb 100644
--- a/userdiff.c
+++ b/userdiff.c
@@ -1,3 +1,4 @@
+#include "cache.h"
 #include "userdiff.h"
 #include "cache.h"
 #include "attr.h"
@@ -167,6 +168,12 @@ static int parse_tristate(int *b, const char *k, const char *v)
 	return 1;
 }
 
+static int parse_bool(int *b, const char *k, const char *v)
+{
+	*b = git_config_bool(k, v);
+	return 1;
+}
+
 int userdiff_config(const char *k, const char *v)
 {
 	struct userdiff_driver *drv;
@@ -181,6 +188,8 @@ int userdiff_config(const char *k, const char *v)
 		return parse_string(&drv->external, k, v);
 	if ((drv = parse_driver(k, v, "textconv")))
 		return parse_string(&drv->textconv, k, v);
+	if ((drv = parse_driver(k, v, "cachetextconv")))
+		return parse_bool(&drv->textconv_want_cache, k, v);
 	if ((drv = parse_driver(k, v, "wordregex")))
 		return parse_string(&drv->word_regex, k, v);
 
diff --git a/userdiff.h b/userdiff.h
index c315159..942d594 100644
--- a/userdiff.h
+++ b/userdiff.h
@@ -1,6 +1,8 @@
 #ifndef USERDIFF_H
 #define USERDIFF_H
 
+#include "notes-cache.h"
+
 struct userdiff_funcname {
 	const char *pattern;
 	int cflags;
@@ -13,6 +15,8 @@ struct userdiff_driver {
 	struct userdiff_funcname funcname;
 	const char *word_regex;
 	const char *textconv;
+	struct notes_cache *textconv_cache;
+	int textconv_want_cache;
 };
 
 int userdiff_config(const char *k, const char *v);
-- 
1.7.0.4.299.gba9d4

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 7/7] diff: avoid useless filespec population
  2010-04-02  0:01 [PATCH 0/7] textconv caching Jeff King
                   ` (5 preceding siblings ...)
  2010-04-02  0:12 ` [PATCH 6/7] diff: cache textconv output Jeff King
@ 2010-04-02  0:14 ` Jeff King
  2010-04-02  7:12   ` Junio C Hamano
  2010-04-02  6:14 ` [PATCH 0/7] textconv caching Jeff King
  7 siblings, 1 reply; 13+ messages in thread
From: Jeff King @ 2010-04-02  0:14 UTC (permalink / raw)
  To: git

builtin_diff calls fill_mmfile fairly early, which in turn
calls diff_populate_filespec, which actually retrieves the
file's blob contents into a buffer. Long ago, this was
sensible as we would need to look at the blobs eventually.

These days, however, we may not ever want those blobs if we
end up using a textconv cache, and for large binary files
(exactly the sort for which you might have a textconv
cache), just retrieving the objects can be costly.

This patch just pushes the fill_mmfile call a bit later, so
we can avoid populating the filespec in some cases.  There
is one thing to note that looks like a bug but isn't. We
push the fill_mmfile down into the first branch of a
conditional. It seems like we would need it on the other
branch, too, but we don't; fill_textconv does it for us (in
fact, before this, we were just writing over the results of
the fill_mmfile on that branch).

Here's a timing sample on a commit with 45 changed jpgs and
avis. The result is fully textconv cached, but we still
wasted a lot of time just pulling the blobs from storage.
The total size of the blobs (source and dest) is about
180M.

  [before]
  $ time git show >/dev/null
  real    0m0.352s
  user    0m0.148s
  sys     0m0.200s

  [after]
  $ time git show >/dev/null
  real    0m0.009s
  user    0m0.004s
  sys     0m0.004s

And that's on a warm cache. On a cold cache, the "after"
case is not much worse, but the "before" case has to do an
extra 180M of I/O.

Signed-off-by: Jeff King <peff@peff.net>
---
I remember when I first started using git having it go so fast that I
had to double-check that it had actually worked correctly. I had another
of those moments when I ran this. But it correctly processed everything
I threw at it. More stress-testing appreciated. :)

 diff.c |    9 ++++-----
 1 files changed, 4 insertions(+), 5 deletions(-)

diff --git a/diff.c b/diff.c
index 72d8503..4cb6d9a 100644
--- a/diff.c
+++ b/diff.c
@@ -1680,12 +1680,11 @@ static void builtin_diff(const char *name_a,
 		}
 	}

-	if (fill_mmfile(&mf1, one) < 0 || fill_mmfile(&mf2, two) < 0)
-		die("unable to read files to diff");
-
 	if (!DIFF_OPT_TST(o, TEXT) &&
-	    ( (diff_filespec_is_binary(one) && !textconv_one) ||
-	      (diff_filespec_is_binary(two) && !textconv_two) )) {
+	    ( (!textconv_one && diff_filespec_is_binary(one)) ||
+	      (!textconv_two && diff_filespec_is_binary(two)) )) {
+		if (fill_mmfile(&mf1, one) < 0 || fill_mmfile(&mf2, two) < 0)
+			die("unable to read files to diff");
 		/* Quite common confusing case */
 		if (mf1.size == mf2.size &&
 		    !memcmp(mf1.ptr, mf2.ptr, mf1.size))
-- 
1.7.0.4.299.gba9d4

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH 0/7] textconv caching
  2010-04-02  0:01 [PATCH 0/7] textconv caching Jeff King
                   ` (6 preceding siblings ...)
  2010-04-02  0:14 ` [PATCH 7/7] diff: avoid useless filespec population Jeff King
@ 2010-04-02  6:14 ` Jeff King
  7 siblings, 0 replies; 13+ messages in thread
From: Jeff King @ 2010-04-02  6:14 UTC (permalink / raw)
  To: git

On Thu, Apr 01, 2010 at 08:01:59PM -0400, Jeff King wrote:

>   [before]
>   $ time git show >/dev/null
>   real    0m13.724s
>   user    0m12.057s
>   sys     0m1.624s
> 
>   [after (with cache primed)]
>   $ time git show >/dev/null
>   real    0m0.009s
>   user    0m0.004s
>   sys     0m0.004s

Since this is a space-time tradeoff, I thought it would make sense to
show some size numbers as a followup.

To get a sense of the size of the repo (it's almost all photos and
videos):

  [size of the repo, already fully packed]
  $ du -sh .git/objects
  4.0G    .git/objects

  [the number of unique blobs through all history; most are binary media]
  $ git log --raw --no-abbrev | awk '/^:/ {print $3 "\n" $4}' | sort -u | wc -l
  10605

In comparison, the metadata for a given file (produced by the textconv)
is about 200 bytes of text.

So I did a big cache priming:

  $ time git log -p >/dev/null
  real    39m29.748s
  user    23m1.090s
  sys     3m46.642s

Slow, and unsurprisingly spends quite a bit of time waiting on I/O. The
result is a notes tree with almost one textconv per blob:

  $ git ls-tree -r notes/textconv/mfo | wc -l
  10317

We're now using almost 200M:

  $ git count-objects
  39513 objects, 198604 kilobytes

But wait. Many of those objects are trees for stale versions of the
cache.

  $ git repack -d
  $ (cd .git/objects/pack && du -k *.pack)
  2056    pack-34170e72ec40a07e98aae044479abccc9e02751b.pack
  4089224 pack-81797628f3aebf6a0bdc082fa05ec14932910534.pack
  $ git count-objects
  30685 objects, 163288 kilobytes

In actuality, a fully packed cache is only about 2M (from 35M of
loose objects; it deltas quite well because there is a lot of overlap
in my metadata). And we can prune away the other 160M of cruft:

  $ git prune
  $ git count-objects
  0 objects, 0 kilobytes

And of course, the final speed result:

  $ time git log -p >/dev/null
  real    0m7.606s
  user    0m6.084s
  sys     0m0.788s

So what I take away from this is two things:

  1. The size tradeoff is definitely worthwhile for some workloads. In
     this case, the textconv version is orders of magnitude smaller than
     the original. I'd be interested to see numbers for something like a
     repository of documents that get textconv'd to pure ascii.

  2. We had 460% loose object overhead just from tree objects in
     intermediate versions of the cache. While it was not too hard to
     get rid of with a repack and a prune, we are probably better off
     not generating it in the first place. In theory we could have
     written only one notes tree, and kept the intermediate state in
     memory. In practice, flushing once per commit-diff (instead of once
     per file) would probably be fine, and would be simpler to
     implement.

And of course, now that I have a completely primed cache, I can push it
around with "git push $dest notes/textconv/mfo". Yay for storing notes
as git objects.

-Peff

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 7/7] diff: avoid useless filespec population
  2010-04-02  0:14 ` [PATCH 7/7] diff: avoid useless filespec population Jeff King
@ 2010-04-02  7:12   ` Junio C Hamano
  2010-04-02  7:24     ` Jeff King
  0 siblings, 1 reply; 13+ messages in thread
From: Junio C Hamano @ 2010-04-02  7:12 UTC (permalink / raw)
  To: Jeff King; +Cc: git

Jeff King <peff@peff.net> writes:

> This patch just pushes the fill_mmfile call a bit later, so we can avoid
> populating the filespec in some cases.  There is one thing to note that
> looks like a bug but isn't. We push the fill_mmfile down into the first
> branch of a conditional. It seems like we would need it on the other
> branch, too, but we don't; fill_textconv does it for us (in fact, before
> this, we were just writing over the results of the fill_mmfile on that
> branch).

... and leaking the memory in the "cached" codepath, perhaps?

> I remember when I first started using git having it go so fast that I
> had to double-check that it had actually worked correctly. I had another
> of those moments when I ran this. But it correctly processed everything
> I threw at it. More stress-testing appreciated. :)

Heh.  Will queue.

Thanks.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 6/7] diff: cache textconv output
  2010-04-02  0:12 ` [PATCH 6/7] diff: cache textconv output Jeff King
@ 2010-04-02  7:23   ` Junio C Hamano
  2010-04-02  7:38     ` Jeff King
  0 siblings, 1 reply; 13+ messages in thread
From: Junio C Hamano @ 2010-04-02  7:23 UTC (permalink / raw)
  To: Jeff King; +Cc: git

Jeff King <peff@peff.net> writes:

> Running a textconv filter can take a long time. It's
> particularly bad for a large file which needs to be spooled
> to disk, but even for small files, the fork+exec overhead
> can add up for something like "git log -p".

Another reason that "log -p" gets benefit from caching is that you would
typically end up running textconv twice on the same blob, once when you
compare $commit:$path with $commit~1:$path, and again when you compare
$commit~$n-1:$path with $commit~$n:$path (assuming that the $path didn't
change between $commit~$n-1 and $commit~1).

It _might_ give you even better performance characteristics if you noice
that you are walking history running many textconv, and cache the textconv
result from the "older" (i.e. "one" side) tree in-core, until it is used
in a "newer" (i.e. "two" side) tree, at which time you would evict it.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 7/7] diff: avoid useless filespec population
  2010-04-02  7:12   ` Junio C Hamano
@ 2010-04-02  7:24     ` Jeff King
  0 siblings, 0 replies; 13+ messages in thread
From: Jeff King @ 2010-04-02  7:24 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

On Fri, Apr 02, 2010 at 12:12:22AM -0700, Junio C Hamano wrote:

> Jeff King <peff@peff.net> writes:
> 
> > This patch just pushes the fill_mmfile call a bit later, so we can avoid
> > populating the filespec in some cases.  There is one thing to note that
> > looks like a bug but isn't. We push the fill_mmfile down into the first
> > branch of a conditional. It seems like we would need it on the other
> > branch, too, but we don't; fill_textconv does it for us (in fact, before
> > this, we were just writing over the results of the fill_mmfile on that
> > branch).
> 
> ... and leaking the memory in the "cached" codepath, perhaps?

I don't think so. fill_mmfile never allocates, but rather points to
what's in the diff_filespec. So though we overwrote what was in the
mmfile, that data was still pointed to and eventually freed by the
diff_filespec struct.

fill_textconv will allocate if and only if we are doing a textconv
(either a buffer we get from read_sha1_file in the case of a cached
entry, or the detached strbuf from run-textconv otherwise). If we're not
doing a textconv, it falls back to basically doing the same as
fill_mmfile. And that's the reason for the

  if (textconv_one)
          free(data_one);

seen in patch 2/7 (and the matching one in builtin_diff from before).

-Peff

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 6/7] diff: cache textconv output
  2010-04-02  7:23   ` Junio C Hamano
@ 2010-04-02  7:38     ` Jeff King
  0 siblings, 0 replies; 13+ messages in thread
From: Jeff King @ 2010-04-02  7:38 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

On Fri, Apr 02, 2010 at 12:23:06AM -0700, Junio C Hamano wrote:

> Jeff King <peff@peff.net> writes:
> 
> > Running a textconv filter can take a long time. It's
> > particularly bad for a large file which needs to be spooled
> > to disk, but even for small files, the fork+exec overhead
> > can add up for something like "git log -p".
> 
> Another reason that "log -p" gets benefit from caching is that you would
> typically end up running textconv twice on the same blob, once when you
> compare $commit:$path with $commit~1:$path, and again when you compare
> $commit~$n-1:$path with $commit~$n:$path (assuming that the $path didn't
> change between $commit~$n-1 and $commit~1).

Yep. I pointed out in one of my timing tests a slight slowdown in "git
show" when generating the cache. But for revision walking, it should
actually be faster, since you see each blob twice but cache after the
first time.

> It _might_ give you even better performance characteristics if you noice
> that you are walking history running many textconv, and cache the textconv
> result from the "older" (i.e. "one" side) tree in-core, until it is used
> in a "newer" (i.e. "two" side) tree, at which time you would evict it.

I doubt it is worth the effort. We are already caching the sha1 in-core
due to the notes mechanism. So we could really only save one object
retrieval. Which is already what a non-textconv diff will need to do, so
we should have performance on par with regular diffs at this point.

In fact, your optimization could be applied to all diff revision
walking, not just textconv, and you can halve the number of object
retrievals. The problem is that you may have blobs sitting in the
in-core cache as you walk many revisions, waiting for them to be changed
again. Depending on the locality of changes and the size of your
project, you won't be able to fit it all comfortably in memory, and will
end up discarding entries.

And all of that to save a few object retrievals, which are something
that git does very quickly already. Not to mention the ugly code that
would be involved.

-Peff

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2010-04-02  7:38 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-04-02  0:01 [PATCH 0/7] textconv caching Jeff King
2010-04-02  0:03 ` [PATCH 1/7] fix const-correctness of write_sha1_file Jeff King
2010-04-02  0:04 ` [PATCH 2/7] fix textconv leak in emit_rewrite_diff Jeff King
2010-04-02  0:05 ` [PATCH 3/7] make commit_tree a library function Jeff King
2010-04-02  0:07 ` [PATCH 4/7] introduce notes-cache interface Jeff King
2010-04-02  0:09 ` [PATCH 5/7] textconv: refactor calls to run_textconv Jeff King
2010-04-02  0:12 ` [PATCH 6/7] diff: cache textconv output Jeff King
2010-04-02  7:23   ` Junio C Hamano
2010-04-02  7:38     ` Jeff King
2010-04-02  0:14 ` [PATCH 7/7] diff: avoid useless filespec population Jeff King
2010-04-02  7:12   ` Junio C Hamano
2010-04-02  7:24     ` Jeff King
2010-04-02  6:14 ` [PATCH 0/7] textconv caching Jeff King

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).