git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/4] `log -c` speedup
@ 2014-01-20 16:20 Kirill Smelkov
  2014-01-20 16:20 ` [PATCH 1/4] diffcore-order: Export generic ordering interface Kirill Smelkov
                   ` (3 more replies)
  0 siblings, 4 replies; 8+ messages in thread
From: Kirill Smelkov @ 2014-01-20 16:20 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, Kirill Smelkov

Hello up there,

I'm using `git log --raw` to reconstruct file dates (readonly filesystem for
git archives) and, as it turned out, for --raw to emit diffs for merges we need
to explicitly activate combine-diff via -c.

The combined-diff turned out to be slow, I'm trying to optimize it. Please apply.

Thanks beforehand,
Kirill


Kirill Smelkov (4):
  diffcore-order: Export generic ordering interface
  diff test: Add tests for combine-diff with orderfile
  combine-diff: Optimize combine_diff_path sets intersection
  combine-diff: combine_diff_path.len is not needed anymore

 combine-diff.c        | 121 +++++++++++++++++++++++++++++++++-----------------
 diff-lib.c            |   2 -
 diff.h                |   1 -
 diffcore-order.c      |  53 ++++++++++++++--------
 diffcore.h            |  15 +++++++
 t/t4056-diff-order.sh |  21 +++++++++
 6 files changed, 151 insertions(+), 62 deletions(-)

-- 
1.9.rc0.143.g6fd479e

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH 1/4] diffcore-order: Export generic ordering interface
  2014-01-20 16:20 [PATCH 0/4] `log -c` speedup Kirill Smelkov
@ 2014-01-20 16:20 ` Kirill Smelkov
  2014-01-20 16:20 ` [PATCH 2/4] diff test: Add tests for combine-diff with orderfile Kirill Smelkov
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 8+ messages in thread
From: Kirill Smelkov @ 2014-01-20 16:20 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, Kirill Smelkov

At present, diffcore_order() interface is to accept only queue of
`struct diff_filepair`.

In the next patches, we'll need to order `struct combine_diff_path` by path,
so let's first rework diffcore-order to also provide generic low-level
interface for ordering arbitrary objects, provided they have path accessors.

The new interface is:

    - `struct obj_order`    for describing objects to ordering routine, and
    - order_objects()       for actually doing the ordering work.

Signed-off-by: Kirill Smelkov <kirr@mns.spb.ru>
---
 diffcore-order.c | 53 +++++++++++++++++++++++++++++++++++------------------
 diffcore.h       | 15 +++++++++++++++
 2 files changed, 50 insertions(+), 18 deletions(-)

diff --git a/diffcore-order.c b/diffcore-order.c
index fe7f1f4..327a93e 100644
--- a/diffcore-order.c
+++ b/diffcore-order.c
@@ -57,11 +57,7 @@ static void prepare_order(const char *orderfile)
 	}
 }
 
-struct pair_order {
-	struct diff_filepair *pair;
-	int orig_order;
-	int order;
-};
+
 
 static int match_order(const char *path)
 {
@@ -84,35 +80,56 @@ static int match_order(const char *path)
 	return order_cnt;
 }
 
-static int compare_pair_order(const void *a_, const void *b_)
+static int compare_objs_order(const void *a_, const void *b_)
 {
-	struct pair_order const *a, *b;
-	a = (struct pair_order const *)a_;
-	b = (struct pair_order const *)b_;
+	struct obj_order const *a, *b;
+	a = (struct obj_order const *)a_;
+	b = (struct obj_order const *)b_;
 	if (a->order != b->order)
 		return a->order - b->order;
 	return a->orig_order - b->orig_order;
 }
 
+
+void order_objects(const char *orderfile, obj_path_fn_t obj_path,
+			struct obj_order *objs, int nr)
+{
+	int i;
+
+	if (!nr)
+		return;
+
+	prepare_order(orderfile);
+	for (i = 0; i < nr; i++) {
+		objs[i].orig_order = i;
+		objs[i].order = match_order(obj_path(objs[i].obj));
+	}
+	qsort(objs, nr, sizeof(*objs), compare_objs_order);
+}
+
+
+static const char *pair_pathtwo(void *obj)
+{
+	struct diff_filepair *pair = (struct diff_filepair *)obj;
+
+	return pair->two->path;
+}
+
 void diffcore_order(const char *orderfile)
 {
 	struct diff_queue_struct *q = &diff_queued_diff;
-	struct pair_order *o;
+	struct obj_order *o;
 	int i;
 
 	if (!q->nr)
 		return;
 
 	o = xmalloc(sizeof(*o) * q->nr);
-	prepare_order(orderfile);
-	for (i = 0; i < q->nr; i++) {
-		o[i].pair = q->queue[i];
-		o[i].orig_order = i;
-		o[i].order = match_order(o[i].pair->two->path);
-	}
-	qsort(o, q->nr, sizeof(*o), compare_pair_order);
 	for (i = 0; i < q->nr; i++)
-		q->queue[i] = o[i].pair;
+		o[i].obj = q->queue[i];
+	order_objects(orderfile, pair_pathtwo, o, q->nr);
+	for (i = 0; i < q->nr; i++)
+		q->queue[i] = o[i].obj;
 	free(o);
 	return;
 }
diff --git a/diffcore.h b/diffcore.h
index 1c16c85..1fd00fc 100644
--- a/diffcore.h
+++ b/diffcore.h
@@ -111,6 +111,21 @@ extern void diffcore_merge_broken(void);
 extern void diffcore_pickaxe(struct diff_options *);
 extern void diffcore_order(const char *orderfile);
 
+/* low-level interface to diffcore_order */
+struct obj_order {
+	void *obj;	/* setup by caller */
+
+	/* setup/used by order_objects() */
+	int orig_order;
+	int order;
+};
+
+typedef const char *(*obj_path_fn_t)(void *obj);
+
+void order_objects(const char *orderfile, obj_path_fn_t obj_path,
+			struct obj_order *objs, int nr);
+
+
 #define DIFF_DEBUG 0
 #if DIFF_DEBUG
 void diff_debug_filespec(struct diff_filespec *, int, const char *);
-- 
1.9.rc0.143.g6fd479e

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH 2/4] diff test: Add tests for combine-diff with orderfile
  2014-01-20 16:20 [PATCH 0/4] `log -c` speedup Kirill Smelkov
  2014-01-20 16:20 ` [PATCH 1/4] diffcore-order: Export generic ordering interface Kirill Smelkov
@ 2014-01-20 16:20 ` Kirill Smelkov
  2014-01-20 16:20 ` [PATCH 3/4] combine-diff: Optimize combine_diff_path sets intersection Kirill Smelkov
  2014-01-20 16:20 ` [PATCH 4/4] combine-diff: combine_diff_path.len is not needed anymore Kirill Smelkov
  3 siblings, 0 replies; 8+ messages in thread
From: Kirill Smelkov @ 2014-01-20 16:20 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, Kirill Smelkov

In the next patch combine-diff will have special code-path for taking
orderfile into account. Prepare for making changes by introducing
coverage tests for that case.

Signed-off-by: Kirill Smelkov <kirr@mns.spb.ru>
---
 t/t4056-diff-order.sh | 21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)

diff --git a/t/t4056-diff-order.sh b/t/t4056-diff-order.sh
index 9e2b29e..c0460bb 100755
--- a/t/t4056-diff-order.sh
+++ b/t/t4056-diff-order.sh
@@ -97,4 +97,25 @@ do
 	'
 done
 
+test_expect_success 'setup for testing combine-diff order' '
+	git checkout -b tmp HEAD~ &&
+	create_files 3 &&
+	git checkout master &&
+	git merge --no-commit -s ours tmp &&
+	create_files 5
+'
+
+test_expect_success "combine-diff: no order (=tree object order)" '
+	git diff --name-only HEAD HEAD^ HEAD^2 >actual &&
+	test_cmp expect_none actual
+'
+
+for i in 1 2
+do
+	test_expect_success "combine-diff: orderfile using option ($i)" '
+		git diff -Oorder_file_$i --name-only HEAD HEAD^ HEAD^2 >actual &&
+		test_cmp expect_$i actual
+	'
+done
+
 test_done
-- 
1.9.rc0.143.g6fd479e

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH 3/4] combine-diff: Optimize combine_diff_path sets intersection
  2014-01-20 16:20 [PATCH 0/4] `log -c` speedup Kirill Smelkov
  2014-01-20 16:20 ` [PATCH 1/4] diffcore-order: Export generic ordering interface Kirill Smelkov
  2014-01-20 16:20 ` [PATCH 2/4] diff test: Add tests for combine-diff with orderfile Kirill Smelkov
@ 2014-01-20 16:20 ` Kirill Smelkov
  2014-01-28 15:46   ` Kirill Smelkov
  2014-01-28 21:55   ` Junio C Hamano
  2014-01-20 16:20 ` [PATCH 4/4] combine-diff: combine_diff_path.len is not needed anymore Kirill Smelkov
  3 siblings, 2 replies; 8+ messages in thread
From: Kirill Smelkov @ 2014-01-20 16:20 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, Kirill Smelkov

Currently, when generating combined diff, for a commit, we intersect
diff paths from diff(parent_0,commit) to diff(parent_i,commit) comparing
all paths pairs, i.e. doing it the quadratic way. That is correct, but
could be optimized:

Paths come from trees in sorted (= tree) order, and so does diff_tree()
emits resulting paths in that order too. Now if we look at diffcore
transformations, all of them, except diffcore_order, preserve resulting
path ordering:

    - skip_stat_unmatch, grep, pickaxe, filter
                            -- just skip elements -> order stays preserved

    - break                 -- just breaks diff for a path, adding path
                               dup after the path -> order stays preserved

    - detect rename/copy    -- resulting paths are emitted sorted
                               (verified empirically)

So only diffcore_order changes diff paths ordering.

But diffcore_order meaning affects only presentation - i.e. only how to
show the diff, so we could do all the internal computations without
paths reordering, and order only resultant paths set. This is faster,
since, if we know two paths sets are all ordered, their intersection
could be done in linear time.

This patch does just that.

Timings for `git log --raw --no-abbrev --no-renames` without `-c` ("git log")
and with `-c` ("git log -c") before and after the patch are as follows:

                linux.git v3.10..v3.11

            log     log -c

    before  1.9s    20.4s
    after   1.9s    16.6s

                navy.git    (private repo)

            log     log -c

    before  0.83s   15.6s
    after   0.83s    2.1s

P.S.

I think linux.git case is sped up not so much as the second one, since
in navy.git, there are more exotic (subtree, etc) merges.

P.P.S.

My tracing showed that the rest of the time (16.6s vs 1.9s) is usually
spent in computing huge diffs from commit to second parent. Will try to
deal with it, if I'll have time.

P.P.P.S.

For combine_diff_path, ->len is not needed anymore - will remove it in
the next noisy cleanup path, to maintain good signal/noise ratio here.

Signed-off-by: Kirill Smelkov <kirr@mns.spb.ru>
---
 combine-diff.c | 93 +++++++++++++++++++++++++++++++++++++++++++++-------------
 1 file changed, 72 insertions(+), 21 deletions(-)

diff --git a/combine-diff.c b/combine-diff.c
index 3b92c448..98c2562 100644
--- a/combine-diff.c
+++ b/combine-diff.c
@@ -15,8 +15,8 @@
 static struct combine_diff_path *intersect_paths(struct combine_diff_path *curr, int n, int num_parent)
 {
 	struct diff_queue_struct *q = &diff_queued_diff;
-	struct combine_diff_path *p;
-	int i;
+	struct combine_diff_path *p, *pprev, *ptmp;
+	int i, cmp;
 
 	if (!n) {
 		struct combine_diff_path *list = NULL, **tail = &list;
@@ -47,28 +47,43 @@ static struct combine_diff_path *intersect_paths(struct combine_diff_path *curr,
 		return list;
 	}
 
-	for (p = curr; p; p = p->next) {
-		int found = 0;
-		if (!p->len)
+	/*
+	 * NOTE paths are coming sorted here (= in tree order)
+	 */
+
+	pprev = NULL;
+	p = curr;
+	i = 0;
+
+	while (1) {
+		if (!p)
+			break;
+
+		cmp = (i >= q->nr) ? -1
+				   : strcmp(p->path, q->queue[i]->two->path);
+		if (cmp < 0) {
+			if (pprev)
+				pprev->next = p->next;
+			ptmp = p;
+			p = p->next;
+			free(ptmp);
+			if (curr == ptmp)
+				curr = p;
 			continue;
-		for (i = 0; i < q->nr; i++) {
-			const char *path;
-			int len;
+		}
 
-			if (diff_unmodified_pair(q->queue[i]))
-				continue;
-			path = q->queue[i]->two->path;
-			len = strlen(path);
-			if (len == p->len && !memcmp(path, p->path, len)) {
-				found = 1;
-				hashcpy(p->parent[n].sha1, q->queue[i]->one->sha1);
-				p->parent[n].mode = q->queue[i]->one->mode;
-				p->parent[n].status = q->queue[i]->status;
-				break;
-			}
+		if (cmp > 0) {
+			i++;
+			continue;
 		}
-		if (!found)
-			p->len = 0;
+
+		hashcpy(p->parent[n].sha1, q->queue[i]->one->sha1);
+		p->parent[n].mode = q->queue[i]->one->mode;
+		p->parent[n].status = q->queue[i]->status;
+
+		pprev = p;
+		p = p->next;
+		i++;
 	}
 	return curr;
 }
@@ -1295,6 +1310,13 @@ static void handle_combined_callback(struct diff_options *opt,
 	free(q.queue);
 }
 
+static const char *path_path(void *obj)
+{
+	struct combine_diff_path *path = (struct combine_diff_path *)obj;
+
+	return path->path;
+}
+
 void diff_tree_combined(const unsigned char *sha1,
 			const struct sha1_array *parents,
 			int dense,
@@ -1310,6 +1332,8 @@ void diff_tree_combined(const unsigned char *sha1,
 	diffopts.output_format = DIFF_FORMAT_NO_OUTPUT;
 	DIFF_OPT_SET(&diffopts, RECURSIVE);
 	DIFF_OPT_CLR(&diffopts, ALLOW_EXTERNAL);
+	/* tell diff_tree to emit paths in sorted (=tree) order */
+	diffopts.orderfile = NULL;
 
 	show_log_first = !!rev->loginfo && !rev->no_commit_id;
 	needsep = 0;
@@ -1335,6 +1359,13 @@ void diff_tree_combined(const unsigned char *sha1,
 				printf("%s%c", diff_line_prefix(opt),
 				       opt->line_termination);
 		}
+
+		/* if showing diff, show it in requested order */
+		if (diffopts.output_format != DIFF_FORMAT_NO_OUTPUT &&
+		    opt->orderfile) {
+			diffcore_order(opt->orderfile);
+		}
+
 		diff_flush(&diffopts);
 	}
 
@@ -1343,6 +1374,26 @@ void diff_tree_combined(const unsigned char *sha1,
 		if (p->len)
 			num_paths++;
 	}
+
+	/* order paths according to diffcore_order */
+	if (opt->orderfile && num_paths) {
+		struct obj_order *o;
+
+		o = xmalloc(sizeof(*o) * num_paths);
+		for (i = 0, p = paths; p; p = p->next, i++)
+			o[i].obj = p;
+		order_objects(opt->orderfile, path_path, o, num_paths);
+		for (i = 0; i < num_paths - 1; i++) {
+			p = o[i].obj;
+			p->next = o[i+1].obj;
+		}
+
+		p = o[num_paths-1].obj;
+		p->next = NULL;
+		paths = o[0].obj;
+	}
+
+
 	if (num_paths) {
 		if (opt->output_format & (DIFF_FORMAT_RAW |
 					  DIFF_FORMAT_NAME |
-- 
1.9.rc0.143.g6fd479e

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH 4/4] combine-diff: combine_diff_path.len is not needed anymore
  2014-01-20 16:20 [PATCH 0/4] `log -c` speedup Kirill Smelkov
                   ` (2 preceding siblings ...)
  2014-01-20 16:20 ` [PATCH 3/4] combine-diff: Optimize combine_diff_path sets intersection Kirill Smelkov
@ 2014-01-20 16:20 ` Kirill Smelkov
  3 siblings, 0 replies; 8+ messages in thread
From: Kirill Smelkov @ 2014-01-20 16:20 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, Kirill Smelkov

Brefore previous patch, ->len was used to speedup name compares and also
to mark removed paths via len=0. Now we do significantly less strcmp and
also just remove paths from list and free right after we know a path
will not be needed, so ->len is not needed anymore.

Signed-off-by: Kirill Smelkov <kirr@mns.spb.ru>
---
 combine-diff.c | 30 +++++++++---------------------
 diff-lib.c     |  2 --
 diff.h         |  1 -
 3 files changed, 9 insertions(+), 24 deletions(-)

diff --git a/combine-diff.c b/combine-diff.c
index 98c2562..07faa96 100644
--- a/combine-diff.c
+++ b/combine-diff.c
@@ -31,7 +31,6 @@ static struct combine_diff_path *intersect_paths(struct combine_diff_path *curr,
 			p->path = (char *) &(p->parent[num_parent]);
 			memcpy(p->path, path, len);
 			p->path[len] = 0;
-			p->len = len;
 			p->next = NULL;
 			memset(p->parent, 0,
 			       sizeof(p->parent[0]) * num_parent);
@@ -1234,8 +1233,6 @@ void show_combined_diff(struct combine_diff_path *p,
 {
 	struct diff_options *opt = &rev->diffopt;
 
-	if (!p->len)
-		return;
 	if (opt->output_format & (DIFF_FORMAT_RAW |
 				  DIFF_FORMAT_NAME |
 				  DIFF_FORMAT_NAME_STATUS))
@@ -1299,11 +1296,8 @@ static void handle_combined_callback(struct diff_options *opt,
 	q.queue = xcalloc(num_paths, sizeof(struct diff_filepair *));
 	q.alloc = num_paths;
 	q.nr = num_paths;
-	for (i = 0, p = paths; p; p = p->next) {
-		if (!p->len)
-			continue;
+	for (i = 0, p = paths; p; p = p->next)
 		q.queue[i++] = combined_pair(p, num_parent);
-	}
 	opt->format_callback(&q, opt, opt->format_callback_data);
 	for (i = 0; i < num_paths; i++)
 		free_combined_pair(q.queue[i]);
@@ -1369,11 +1363,9 @@ void diff_tree_combined(const unsigned char *sha1,
 		diff_flush(&diffopts);
 	}
 
-	/* find out surviving paths */
-	for (num_paths = 0, p = paths; p; p = p->next) {
-		if (p->len)
-			num_paths++;
-	}
+	/* find out number of surviving paths */
+	for (num_paths = 0, p = paths; p; p = p->next)
+		num_paths++;
 
 	/* order paths according to diffcore_order */
 	if (opt->orderfile && num_paths) {
@@ -1398,10 +1390,8 @@ void diff_tree_combined(const unsigned char *sha1,
 		if (opt->output_format & (DIFF_FORMAT_RAW |
 					  DIFF_FORMAT_NAME |
 					  DIFF_FORMAT_NAME_STATUS)) {
-			for (p = paths; p; p = p->next) {
-				if (p->len)
-					show_raw_diff(p, num_parent, rev);
-			}
+			for (p = paths; p; p = p->next)
+				show_raw_diff(p, num_parent, rev);
 			needsep = 1;
 		}
 		else if (opt->output_format &
@@ -1414,11 +1404,9 @@ void diff_tree_combined(const unsigned char *sha1,
 			if (needsep)
 				printf("%s%c", diff_line_prefix(opt),
 				       opt->line_termination);
-			for (p = paths; p; p = p->next) {
-				if (p->len)
-					show_patch_diff(p, num_parent, dense,
-							0, rev);
-			}
+			for (p = paths; p; p = p->next)
+				show_patch_diff(p, num_parent, dense,
+						0, rev);
 		}
 	}
 
diff --git a/diff-lib.c b/diff-lib.c
index e6d33b3..938869d 100644
--- a/diff-lib.c
+++ b/diff-lib.c
@@ -121,7 +121,6 @@ int run_diff_files(struct rev_info *revs, unsigned int option)
 			dpath->path = (char *) &(dpath->parent[5]);
 
 			dpath->next = NULL;
-			dpath->len = path_len;
 			memcpy(dpath->path, ce->name, path_len);
 			dpath->path[path_len] = '\0';
 			hashclr(dpath->sha1);
@@ -323,7 +322,6 @@ static int show_modified(struct rev_info *revs,
 		p = xmalloc(combine_diff_path_size(2, pathlen));
 		p->path = (char *) &p->parent[2];
 		p->next = NULL;
-		p->len = pathlen;
 		memcpy(p->path, new->name, pathlen);
 		p->path[pathlen] = 0;
 		p->mode = mode;
diff --git a/diff.h b/diff.h
index 0e6898f..a24a767 100644
--- a/diff.h
+++ b/diff.h
@@ -198,7 +198,6 @@ extern int diff_root_tree_sha1(const unsigned char *new, const char *base,
 
 struct combine_diff_path {
 	struct combine_diff_path *next;
-	int len;
 	char *path;
 	unsigned int mode;
 	unsigned char sha1[20];
-- 
1.9.rc0.143.g6fd479e

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH 3/4] combine-diff: Optimize combine_diff_path sets intersection
  2014-01-20 16:20 ` [PATCH 3/4] combine-diff: Optimize combine_diff_path sets intersection Kirill Smelkov
@ 2014-01-28 15:46   ` Kirill Smelkov
  2014-01-28 21:55   ` Junio C Hamano
  1 sibling, 0 replies; 8+ messages in thread
From: Kirill Smelkov @ 2014-01-28 15:46 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

On Mon, Jan 20, 2014 at 08:20:40PM +0400, Kirill Smelkov wrote:
[...]

> @@ -1343,6 +1374,26 @@ void diff_tree_combined(const unsigned char *sha1,
>  		if (p->len)
>  			num_paths++;
>  	}
> +
> +	/* order paths according to diffcore_order */
> +	if (opt->orderfile && num_paths) {
> +		struct obj_order *o;
> +
> +		o = xmalloc(sizeof(*o) * num_paths);
> +		for (i = 0, p = paths; p; p = p->next, i++)
> +			o[i].obj = p;
> +		order_objects(opt->orderfile, path_path, o, num_paths);
> +		for (i = 0; i < num_paths - 1; i++) {
> +			p = o[i].obj;
> +			p->next = o[i+1].obj;
> +		}
> +
> +		p = o[num_paths-1].obj;
> +		p->next = NULL;
> +		paths = o[0].obj;
> +	}

I found I've introduced memory leak here (malloc without free). Please
apply the fix.  Thanks, Kirill.

---- 8< ----
From: Kirill Smelkov <kirr@mns.spb.ru>
Date: Tue, 28 Jan 2014 19:39:16 +0400
Subject: [PATCH] fixup! combine-diff: Optimize combine_diff_path sets intersection

Plug a memory leak.
---
 combine-diff.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/combine-diff.c b/combine-diff.c
index 07faa96..2d79312 100644
--- a/combine-diff.c
+++ b/combine-diff.c
@@ -1383,6 +1383,7 @@ void diff_tree_combined(const unsigned char *sha1,
 		p = o[num_paths-1].obj;
 		p->next = NULL;
 		paths = o[0].obj;
+		free(o);
 	}
 
 
-- 
1.9.rc1.181.g641f458

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH 3/4] combine-diff: Optimize combine_diff_path sets intersection
  2014-01-20 16:20 ` [PATCH 3/4] combine-diff: Optimize combine_diff_path sets intersection Kirill Smelkov
  2014-01-28 15:46   ` Kirill Smelkov
@ 2014-01-28 21:55   ` Junio C Hamano
  2014-01-29 11:21     ` Kirill Smelkov
  1 sibling, 1 reply; 8+ messages in thread
From: Junio C Hamano @ 2014-01-28 21:55 UTC (permalink / raw)
  To: Kirill Smelkov; +Cc: git

Kirill Smelkov <kirr@mns.spb.ru> writes:

> diff --git a/combine-diff.c b/combine-diff.c
> index 3b92c448..98c2562 100644
> --- a/combine-diff.c
> +++ b/combine-diff.c
> @@ -15,8 +15,8 @@
> ...
> +	while (1) {
> ...
> +		if (cmp < 0) {
> +			if (pprev)
> +				pprev->next = p->next;
> +			ptmp = p;
> +			p = p->next;
> +			free(ptmp);
> +			if (curr == ptmp)
> +				curr = p;
>  			continue;
> ...
> +		if (cmp > 0) {
> +			i++;
> +			continue;
>  		}
> ...
> +
> +		pprev = p;
> +		p = p->next;
> +		i++;
>  	}
>  	return curr;
>  }

Thanks. I very much like the approach.

I was staring at the above part of the code, but couldn't help
recalling this gem (look for "understand pointers" in the article):

  http://meta.slashdot.org/story/12/10/11/0030249/linus-torvalds-answers-your-questions

How about doing it this way (on top of your patch)?  It reduces 7
lines even though it adds two comment lines ;-)

 combine-diff.c | 37 +++++++++++++++----------------------
 1 file changed, 15 insertions(+), 22 deletions(-)

diff --git a/combine-diff.c b/combine-diff.c
index 2d79312..0809e79 100644
--- a/combine-diff.c
+++ b/combine-diff.c
@@ -15,11 +15,10 @@
 static struct combine_diff_path *intersect_paths(struct combine_diff_path *curr, int n, int num_parent)
 {
 	struct diff_queue_struct *q = &diff_queued_diff;
-	struct combine_diff_path *p, *pprev, *ptmp;
+	struct combine_diff_path *p, **tail = &curr;
 	int i, cmp;
 
 	if (!n) {
-		struct combine_diff_path *list = NULL, **tail = &list;
 		for (i = 0; i < q->nr; i++) {
 			int len;
 			const char *path;
@@ -43,35 +42,30 @@ static struct combine_diff_path *intersect_paths(struct combine_diff_path *curr,
 			*tail = p;
 			tail = &p->next;
 		}
-		return list;
+		return curr;
 	}
 
 	/*
-	 * NOTE paths are coming sorted here (= in tree order)
+	 * paths in curr (linked list) and q->queue[] (array) are
+	 * both sorted in the tree order.
 	 */
-
-	pprev = NULL;
-	p = curr;
 	i = 0;
+	while ((p = *tail) != NULL) {
+		cmp = ((i >= q->nr)
+		       ? -1 : strcmp(p->path, q->queue[i]->two->path));
 
-	while (1) {
-		if (!p)
-			break;
-
-		cmp = (i >= q->nr) ? -1
-				   : strcmp(p->path, q->queue[i]->two->path);
 		if (cmp < 0) {
-			if (pprev)
-				pprev->next = p->next;
-			ptmp = p;
-			p = p->next;
-			free(ptmp);
-			if (curr == ptmp)
-				curr = p;
+			/* p->path not in q->queue[]; drop it */
+			struct combine_diff_path *next = p->next;
+
+			if ((*tail = next) != NULL)
+				tail = &next->next;
+			free(p);
 			continue;
 		}
 
 		if (cmp > 0) {
+			/* q->queue[i] not in p->path; skip it */
 			i++;
 			continue;
 		}
@@ -80,8 +74,7 @@ static struct combine_diff_path *intersect_paths(struct combine_diff_path *curr,
 		p->parent[n].mode = q->queue[i]->one->mode;
 		p->parent[n].status = q->queue[i]->status;
 
-		pprev = p;
-		p = p->next;
+		tail = &p->next;
 		i++;
 	}
 	return curr;

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH 3/4] combine-diff: Optimize combine_diff_path sets intersection
  2014-01-28 21:55   ` Junio C Hamano
@ 2014-01-29 11:21     ` Kirill Smelkov
  0 siblings, 0 replies; 8+ messages in thread
From: Kirill Smelkov @ 2014-01-29 11:21 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

On Tue, Jan 28, 2014 at 01:55:09PM -0800, Junio C Hamano wrote:
> Kirill Smelkov <kirr@mns.spb.ru> writes:
> 
> > diff --git a/combine-diff.c b/combine-diff.c
> > index 3b92c448..98c2562 100644
> > --- a/combine-diff.c
> > +++ b/combine-diff.c
> > @@ -15,8 +15,8 @@
> > ...
> > +	while (1) {
> > ...
> > +		if (cmp < 0) {
> > +			if (pprev)
> > +				pprev->next = p->next;
> > +			ptmp = p;
> > +			p = p->next;
> > +			free(ptmp);
> > +			if (curr == ptmp)
> > +				curr = p;
> >  			continue;
> > ...
> > +		if (cmp > 0) {
> > +			i++;
> > +			continue;
> >  		}
> > ...
> > +
> > +		pprev = p;
> > +		p = p->next;
> > +		i++;
> >  	}
> >  	return curr;
> >  }
> 
> Thanks. I very much like the approach.
> 
> I was staring at the above part of the code, but couldn't help
> recalling this gem (look for "understand pointers" in the article):
> 
>   http://meta.slashdot.org/story/12/10/11/0030249/linus-torvalds-answers-your-questions
> 
> How about doing it this way (on top of your patch)?  It reduces 7
> lines even though it adds two comment lines ;-)
> 
>  combine-diff.c | 37 +++++++++++++++----------------------
>  1 file changed, 15 insertions(+), 22 deletions(-)

Thanks, this is sound approach and adding guiding comments is good, and
also now some of us with self-taught heritage understand (or at least
they think so) pointers a bit better :)

Now some nitpicks:

> diff --git a/combine-diff.c b/combine-diff.c
> index 2d79312..0809e79 100644
> --- a/combine-diff.c
> +++ b/combine-diff.c
> @@ -15,11 +15,10 @@
>  static struct combine_diff_path *intersect_paths(struct combine_diff_path *curr, int n, int num_parent)
>  {
>  	struct diff_queue_struct *q = &diff_queued_diff;
> -	struct combine_diff_path *p, *pprev, *ptmp;
> +	struct combine_diff_path *p, **tail = &curr;
>  	int i, cmp;
>  
>  	if (!n) {
> -		struct combine_diff_path *list = NULL, **tail = &list;
>  		for (i = 0; i < q->nr; i++) {
>  			int len;
>  			const char *path;
> @@ -43,35 +42,30 @@ static struct combine_diff_path *intersect_paths(struct combine_diff_path *curr,
>  			*tail = p;
>  			tail = &p->next;
>  		}
> -		return list;
> +		return curr;
>  	}
>  
>  	/*
> -	 * NOTE paths are coming sorted here (= in tree order)
> +	 * paths in curr (linked list) and q->queue[] (array) are
> +	 * both sorted in the tree order.
>  	 */
> -
> -	pprev = NULL;
> -	p = curr;
>  	i = 0;
> +	while ((p = *tail) != NULL) {
> +		cmp = ((i >= q->nr)
> +		       ? -1 : strcmp(p->path, q->queue[i]->two->path));

I liked cmp assignment being the original way - when "-1" is on one line
and strcmp is on another - to me it reads better. I'm not insisting on
it though.


> -	while (1) {
> -		if (!p)
> -			break;
> -
> -		cmp = (i >= q->nr) ? -1
> -				   : strcmp(p->path, q->queue[i]->two->path);
>  		if (cmp < 0) {
> -			if (pprev)
> -				pprev->next = p->next;
> -			ptmp = p;
> -			p = p->next;
> -			free(ptmp);
> -			if (curr == ptmp)
> -				curr = p;
> +			/* p->path not in q->queue[]; drop it */
> +			struct combine_diff_path *next = p->next;
> +
> +			if ((*tail = next) != NULL)
> +				tail = &next->next;
> +			free(p);
>  			continue;
>  		}

A bug crept in here - if we are removing the first element, i.e. when
p=curr, we have to advance curr as well - as we are returning curr back
as new intersected paths set list start. That's why there was curr
change.

Now curr stays the same, and if we'll remove the first element, curr
will be pointing to freed memory -> oops. A possible fixup could be:

---- 8< ----
diff --git a/combine-diff.c b/combine-diff.c
index 0809e79..6a61a25 100644
--- a/combine-diff.c
+++ b/combine-diff.c
@@ -60,6 +60,8 @@ static struct combine_diff_path *intersect_paths(struct combine_diff_path *curr,
 
                        if ((*tail = next) != NULL)
                                tail = &next->next;
+                       if (curr == p)
+                               curr = next;
                        free(p);
                        continue;
                }
---- 8< ----

but this is blind code, as I had not tested it.


>  
>  		if (cmp > 0) {
> +			/* q->queue[i] not in p->path; skip it */
>  			i++;
>  			continue;
>  		}
> @@ -80,8 +74,7 @@ static struct combine_diff_path *intersect_paths(struct combine_diff_path *curr,
>  		p->parent[n].mode = q->queue[i]->one->mode;
>  		p->parent[n].status = q->queue[i]->status;
>  
> -		pprev = p;
> -		p = p->next;
> +		tail = &p->next;
>  		i++;
>  	}
>  	return curr;


P.S. I'm slowly working on to speedup combine-diff further - the same
way as diff_tree() skips path for two trees, for combine-diff we could
traverse a merge tree and n parents simultaneously, even not delving
into generating (usually huge) diff(merge,parent_i) for a path, if we
know such diff for parent_j will be empty.

I have no numbers yet, but this should give significant speedup, as my
tracing showed for e.g. linux.git a lot of diffing is done for
combine-diff for merges to e.g. second parents (mean value of diff to
HEAD^2 is ~ 1500 paths) and almost all of them annulate when intersected
to diff(HEAD, HEAD^1).

Only this can't work (or at least I don't know how) if rename/copy
detection is on, so there will be two codepaths - fast, if we run
without -M/-C, and generic, but slower, where combine-diff paths are
computed as intersections:

    D(A,P1,P2,...Pn) = D(A,P1) ^ D(A,P2) ^ ... ^ D(A,Pn)

i.e. the current way.

Does this approach sound reasonable? My draft not-working-yet code is here:

http://repo.or.cz/w/git/kirr.git/shortlog/refs/heads/x/combinediff-sorted
(look for diff_tree_combined_X)


Thanks,
Kirill

^ permalink raw reply related	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2014-01-29 11:19 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-01-20 16:20 [PATCH 0/4] `log -c` speedup Kirill Smelkov
2014-01-20 16:20 ` [PATCH 1/4] diffcore-order: Export generic ordering interface Kirill Smelkov
2014-01-20 16:20 ` [PATCH 2/4] diff test: Add tests for combine-diff with orderfile Kirill Smelkov
2014-01-20 16:20 ` [PATCH 3/4] combine-diff: Optimize combine_diff_path sets intersection Kirill Smelkov
2014-01-28 15:46   ` Kirill Smelkov
2014-01-28 21:55   ` Junio C Hamano
2014-01-29 11:21     ` Kirill Smelkov
2014-01-20 16:20 ` [PATCH 4/4] combine-diff: combine_diff_path.len is not needed anymore Kirill Smelkov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).