git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 1/2] blame: large-scale performance rewrite
@ 2014-04-25 23:56 David Kastrup
  2014-04-25 23:56 ` [PATCH 2/2] Mention "git blame" improvements in release notes David Kastrup
  2014-04-26  0:53 ` [PATCH 1/2] blame: large-scale performance rewrite Shawn Pearce
  0 siblings, 2 replies; 20+ messages in thread
From: David Kastrup @ 2014-04-25 23:56 UTC (permalink / raw)
  To: git; +Cc: David Kastrup

The previous implementation used a single sorted linear list of blame
entries for organizing all partial or completed work.  Every subtask had
to scan the whole list, with most entries not being relevant to the
task.  The resulting run-time was quadratic to the number of separate
chunks.

This change gives every subtask its own data to work with.  Subtasks are
organized into "struct origin" chains hanging off particular commits.
Commits are organized into a priority queue, processing them in commit
date order in order to keep most of the work affecting a particular blob
collated even in the presence of an extensive merge history.

For large files with a diversified history, a speedup by a factor of 3
or more is not unusual.

Signed-off-by: David Kastrup <dak@gnu.org>
---
 builtin/blame.c | 865 +++++++++++++++++++++++++++++++++++++-------------------
 1 file changed, 567 insertions(+), 298 deletions(-)

diff --git a/builtin/blame.c b/builtin/blame.c
index 88cb799..224f0ff 100644
--- a/builtin/blame.c
+++ b/builtin/blame.c
@@ -1,7 +1,8 @@
 /*
  * Blame
  *
- * Copyright (c) 2006, Junio C Hamano
+ * Copyright (c) 2006, 2014 by its authors
+ * See COPYING for licensing conditions
  */
 
 #include "cache.h"
@@ -18,7 +19,9 @@
 #include "cache-tree.h"
 #include "string-list.h"
 #include "mailmap.h"
+#include "mergesort.h"
 #include "parse-options.h"
+#include "prio-queue.h"
 #include "utf8.h"
 #include "userdiff.h"
 #include "line-range.h"
@@ -83,11 +86,42 @@ static unsigned blame_copy_score;
  */
 struct origin {
 	int refcnt;
+	/* Record preceding blame record for this blob */
 	struct origin *previous;
+	/* origins are put in a list linked via `next' hanging off the
+	 * corresponding commit's util field in order to make finding
+	 * them fast.  The presence in this chain does not count
+	 * towards the origin's reference count.  It is tempting to
+	 * let it count as long as the commit is pending examination,
+	 * but even under circumstances where the commit will be
+	 * present multiple times in the priority queue of unexamined
+	 * commits, processing the first instance will not leave any
+	 * work requiring the origin data for the second instance.  An
+	 * interspersed commit changing that would have to be
+	 * preexisting with a different ancestry and with the same
+	 * commit date in order to wedge itself between two instances
+	 * of the same commit in the priority queue _and_ produce
+	 * blame entries relevant for it.  While we don't want to let
+	 * us get tripped up by this case, it certainly does not seem
+	 * worth optimizing for.
+	 */
+	struct origin *next;
 	struct commit *commit;
+	/* `suspects' contains blame entries that may be attributed to
+	 * this origin's commit or to parent commits.  When a commit
+	 * is being processed, all suspects will be moved, either by
+	 * assigning them to an origin in a different commit, or by
+	 * shipping them to the scoreboard's ent list because they
+	 * cannot be attributed to a different commit.
+	 */
+	struct blame_entry *suspects;
 	mmfile_t file;
 	unsigned char blob_sha1[20];
 	unsigned mode;
+	/* guilty gets set when shipping any suspects to the final
+	 * blame list instead of other commits
+	 */
+	char guilty;
 	char path[FLEX_ARRAY];
 };
 
@@ -176,10 +210,22 @@ static inline struct origin *origin_incref(struct origin *o)
 static void origin_decref(struct origin *o)
 {
 	if (o && --o->refcnt <= 0) {
+		struct origin *p, *l = NULL;
 		if (o->previous)
 			origin_decref(o->previous);
 		free(o->file.ptr);
-		free(o);
+		/* Should be present exactly once in commit chain */
+		for (p = o->commit->util; p; l = p, p = p->next) {
+			if (p == o) {
+				if (l)
+					l->next = p->next;
+				else
+					o->commit->util = p->next;
+				free(o);
+				return;
+			}
+		}
+		die("internal error in blame::origin_decref");
 	}
 }
 
@@ -193,8 +239,12 @@ static void drop_origin_blob(struct origin *o)
 
 /*
  * Each group of lines is described by a blame_entry; it can be split
- * as we pass blame to the parents.  They form a linked list in the
- * scoreboard structure, sorted by the target line number.
+ * as we pass blame to the parents.  They are arranged in linked lists
+ * kept as `suspects' of some unprocessed origin, or entered (when the
+ * blame origin has been finalized) into the scoreboard structure.
+ * While the scoreboard structure is only sorted at the end of
+ * processing (according to final image line number), the lists
+ * attached to an origin are sorted by the target line number.
  */
 struct blame_entry {
 	struct blame_entry *next;
@@ -210,15 +260,6 @@ struct blame_entry {
 	/* the commit that introduced this group into the final image */
 	struct origin *suspect;
 
-	/* true if the suspect is truly guilty; false while we have not
-	 * checked if the group came from one of its parents.
-	 */
-	char guilty;
-
-	/* true if the entry has been scanned for copies in the current parent
-	 */
-	char scanned;
-
 	/* the line number of the first line of this group in the
 	 * suspect's file; internally all line numbers are 0 based.
 	 */
@@ -231,11 +272,112 @@ struct blame_entry {
 };
 
 /*
+ * Any merge of blames happens on lists of blames that arrived via
+ * different parents in a single suspect.  In this case, we want to
+ * sort according to the suspect line numbers as opposed to the final
+ * image line numbers.  The function body is somewhat longish because
+ * it avoids unnecessary writes.
+ */
+
+static struct blame_entry *blame_merge(struct blame_entry *list1,
+				       struct blame_entry *list2)
+{
+	struct blame_entry *p1 = list1, *p2 = list2,
+		**tail = &list1;
+
+	if (!p1)
+		return p2;
+	if (!p2)
+		return p1;
+
+	if (p1->s_lno <= p2->s_lno) {
+		do {
+			tail = &p1->next;
+			if ((p1 = *tail) == NULL) {
+				*tail = p2;
+				return list1;
+			}
+		} while (p1->s_lno <= p2->s_lno);
+	}
+	for (;;) {
+		*tail = p2;
+		do {
+			tail = &p2->next;
+			if ((p2 = *tail) == NULL)  {
+				*tail = p1;
+				return list1;
+			}
+		} while (p1->s_lno > p2->s_lno);
+		*tail = p1;
+		do {
+			tail = &p1->next;
+			if ((p1 = *tail) == NULL) {
+				*tail = p2;
+				return list1;
+			}
+		} while (p1->s_lno <= p2->s_lno);
+	}
+}
+
+static void *get_next_blame(const void *p)
+{
+	return ((struct blame_entry *)p)->next;
+}
+
+static void set_next_blame(void *p1, void *p2)
+{
+	((struct blame_entry *)p1)->next = p2;
+}
+
+/*
+ * Final image line numbers are all different, so we don't need a
+ * three-way comparison here.
+ */
+
+static int compare_blame_final(const void *p1, const void *p2)
+{
+	return ((struct blame_entry *)p1)->lno > ((struct blame_entry *)p2)->lno
+		? 1 : -1;
+}
+
+static int compare_blame_suspect(const void *p1, const void *p2)
+{
+	const struct blame_entry *s1 = p1, *s2 = p2;
+	/*
+	 * to allow for collating suspects, we sort according to the
+	 * respective pointer value as the primary sorting criterion.
+	 * The actual relation is pretty unimportant as long as it
+	 * establishes a total order.  Comparing as integers gives us
+	 * that.
+	 */
+	if (s1->suspect != s2->suspect)
+		return (intptr_t)s1->suspect > (intptr_t)s2->suspect ? 1 : -1;
+	if (s1->s_lno == s2->s_lno)
+		return 0;
+	return s1->s_lno > s2->s_lno ? 1 : -1;
+}
+
+static struct blame_entry *blame_sort(struct blame_entry *head,
+				      int (*compare_fn)(const void *, const void *))
+{
+	return llist_mergesort (head, get_next_blame, set_next_blame, compare_fn);
+}
+
+static int compare_commits_by_reverse_commit_date(const void *a,
+						  const void *b,
+						  void *c)
+{
+	return -compare_commits_by_commit_date(a, b, c);
+}
+
+/*
  * The current state of the blame assignment.
  */
 struct scoreboard {
 	/* the final commit (i.e. where we started digging from) */
 	struct commit *final;
+	/* Priority queue for commits with unassigned blame records */
+	struct prio_queue commits;
 	struct rev_info *revs;
 	const char *path;
 
@@ -268,7 +410,6 @@ static void coalesce(struct scoreboard *sb)
 
 	for (ent = sb->ent; ent && (next = ent->next); ent = next) {
 		if (ent->suspect == next->suspect &&
-		    ent->guilty == next->guilty &&
 		    ent->s_lno + ent->num_lines == next->s_lno) {
 			ent->num_lines += next->num_lines;
 			ent->next = next->next;
@@ -284,6 +425,30 @@ static void coalesce(struct scoreboard *sb)
 }
 
 /*
+ * Merge the given sorted list of blames into a preexisting origin.
+ * If there were no previous blames to that commit, it is entered into
+ * the commit priority queue of the score board.
+ */
+
+static void queue_blames(struct scoreboard *sb, struct origin *porigin,
+			 struct blame_entry *sorted)
+{
+	if (porigin->suspects)
+		porigin->suspects = blame_merge(porigin->suspects, sorted);
+	else {
+		struct origin *o;
+		for (o = porigin->commit->util; o; o = o->next) {
+			if (o->suspects) {
+				porigin->suspects = sorted;
+				return;
+			}
+		}
+		porigin->suspects = sorted;
+		prio_queue_put(&sb->commits, porigin->commit);
+	}
+}
+
+/*
  * Given a commit and a path in it, create a new origin structure.
  * The callers that add blame to the scoreboard should use
  * get_origin() to obtain shared, refcounted copy instead of calling
@@ -295,23 +460,32 @@ static struct origin *make_origin(struct commit *commit, const char *path)
 	o = xcalloc(1, sizeof(*o) + strlen(path) + 1);
 	o->commit = commit;
 	o->refcnt = 1;
+	o->next = commit->util;
+	commit->util = o;
 	strcpy(o->path, path);
 	return o;
 }
 
 /*
  * Locate an existing origin or create a new one.
+ * This moves the origin to front position in the commit util list.
  */
 static struct origin *get_origin(struct scoreboard *sb,
 				 struct commit *commit,
 				 const char *path)
 {
-	struct blame_entry *e;
+	struct origin *o, *l;
 
-	for (e = sb->ent; e; e = e->next) {
-		if (e->suspect->commit == commit &&
-		    !strcmp(e->suspect->path, path))
-			return origin_incref(e->suspect);
+	for (o = commit->util, l = NULL; o; l = o, o = o->next) {
+		if (!strcmp(o->path, path)) {
+			/* bump to front */
+			if (l) {
+				l->next = o->next;
+				o->next = commit->util;
+				commit->util = o;
+			}
+			return origin_incref(o);
+		}
 	}
 	return make_origin(commit, path);
 }
@@ -350,41 +524,19 @@ static struct origin *find_origin(struct scoreboard *sb,
 				  struct commit *parent,
 				  struct origin *origin)
 {
-	struct origin *porigin = NULL;
+	struct origin *porigin;
 	struct diff_options diff_opts;
 	const char *paths[2];
 
-	if (parent->util) {
-		/*
-		 * Each commit object can cache one origin in that
-		 * commit.  This is a freestanding copy of origin and
-		 * not refcounted.
-		 */
-		struct origin *cached = parent->util;
-		if (!strcmp(cached->path, origin->path)) {
+	/* First check any existing origins */
+	for (porigin = parent->util; porigin; porigin = porigin->next)
+		if (!strcmp(porigin->path, origin->path)) {
 			/*
 			 * The same path between origin and its parent
 			 * without renaming -- the most common case.
 			 */
-			porigin = get_origin(sb, parent, cached->path);
-
-			/*
-			 * If the origin was newly created (i.e. get_origin
-			 * would call make_origin if none is found in the
-			 * scoreboard), it does not know the blob_sha1/mode,
-			 * so copy it.  Otherwise porigin was in the
-			 * scoreboard and already knows blob_sha1/mode.
-			 */
-			if (porigin->refcnt == 1) {
-				hashcpy(porigin->blob_sha1, cached->blob_sha1);
-				porigin->mode = cached->mode;
-			}
-			return porigin;
+			return origin_incref (porigin);
 		}
-		/* otherwise it was not very useful; free it */
-		free(parent->util);
-		parent->util = NULL;
-	}
 
 	/* See if the origin->path is different between parent
 	 * and origin first.  Most of the time they are the
@@ -450,19 +602,6 @@ static struct origin *find_origin(struct scoreboard *sb,
 	}
 	diff_flush(&diff_opts);
 	free_pathspec(&diff_opts.pathspec);
-	if (porigin) {
-		/*
-		 * Create a freestanding copy that is not part of
-		 * the refcounted origin found in the scoreboard, and
-		 * cache it in the commit.
-		 */
-		struct origin *cached;
-
-		cached = make_origin(porigin->commit, porigin->path);
-		hashcpy(cached->blob_sha1, porigin->blob_sha1);
-		cached->mode = porigin->mode;
-		parent->util = cached;
-	}
 	return porigin;
 }
 
@@ -509,46 +648,31 @@ static struct origin *find_rename(struct scoreboard *sb,
 }
 
 /*
- * Link in a new blame entry to the scoreboard.  Entries that cover the
- * same line range have been removed from the scoreboard previously.
+ * Append a new blame entry to a given output queue.
  */
-static void add_blame_entry(struct scoreboard *sb, struct blame_entry *e)
+static void add_blame_entry(struct blame_entry ***queue, struct blame_entry *e)
 {
-	struct blame_entry *ent, *prev = NULL;
-
 	origin_incref(e->suspect);
 
-	for (ent = sb->ent; ent && ent->lno < e->lno; ent = ent->next)
-		prev = ent;
-
-	/* prev, if not NULL, is the last one that is below e */
-
-	if (prev) {
-		e->next = prev->next;
-		prev->next = e;
-	}
-	else {
-		e->next = sb->ent;
-		sb->ent = e;
-	}
+	e->next = **queue;
+	**queue = e;
+	*queue = &e->next;
 }
 
 /*
  * src typically is on-stack; we want to copy the information in it to
- * a malloced blame_entry that is already on the linked list of the
- * scoreboard.  The origin of dst loses a refcnt while the origin of src
- * gains one.
+ * a malloced blame_entry that gets added to the given queue.  The
+ * origin of dst loses a refcnt.
  */
-static void dup_entry(struct blame_entry *dst, struct blame_entry *src)
+static void dup_entry(struct blame_entry ***queue,
+		      struct blame_entry *dst, struct blame_entry *src)
 {
-	struct blame_entry *n;
-
-	n = dst->next;
 	origin_incref(src->suspect);
 	origin_decref(dst->suspect);
 	memcpy(dst, src, sizeof(*src));
-	dst->next = n;
-	dst->score = 0;
+	dst->next = **queue;
+	**queue = dst;
+	*queue = &dst->next;
 }
 
 static const char *nth_line(struct scoreboard *sb, long lno)
@@ -620,10 +744,11 @@ static void split_overlap(struct blame_entry *split,
 
 /*
  * split_overlap() divided an existing blame e into up to three parts
- * in split.  Adjust the linked list of blames in the scoreboard to
+ * in split.  Any assigned blame is moved to queue to
  * reflect the split.
  */
-static void split_blame(struct scoreboard *sb,
+static void split_blame(struct blame_entry ***blamed,
+			struct blame_entry ***unblamed,
 			struct blame_entry *split,
 			struct blame_entry *e)
 {
@@ -631,61 +756,39 @@ static void split_blame(struct scoreboard *sb,
 
 	if (split[0].suspect && split[2].suspect) {
 		/* The first part (reuse storage for the existing entry e) */
-		dup_entry(e, &split[0]);
+		dup_entry(unblamed, e, &split[0]);
 
 		/* The last part -- me */
 		new_entry = xmalloc(sizeof(*new_entry));
 		memcpy(new_entry, &(split[2]), sizeof(struct blame_entry));
-		add_blame_entry(sb, new_entry);
+		add_blame_entry(unblamed, new_entry);
 
 		/* ... and the middle part -- parent */
 		new_entry = xmalloc(sizeof(*new_entry));
 		memcpy(new_entry, &(split[1]), sizeof(struct blame_entry));
-		add_blame_entry(sb, new_entry);
+		add_blame_entry(blamed, new_entry);
 	}
 	else if (!split[0].suspect && !split[2].suspect)
 		/*
 		 * The parent covers the entire area; reuse storage for
 		 * e and replace it with the parent.
 		 */
-		dup_entry(e, &split[1]);
+		dup_entry(blamed, e, &split[1]);
 	else if (split[0].suspect) {
 		/* me and then parent */
-		dup_entry(e, &split[0]);
+		dup_entry(unblamed, e, &split[0]);
 
 		new_entry = xmalloc(sizeof(*new_entry));
 		memcpy(new_entry, &(split[1]), sizeof(struct blame_entry));
-		add_blame_entry(sb, new_entry);
+		add_blame_entry(blamed, new_entry);
 	}
 	else {
 		/* parent and then me */
-		dup_entry(e, &split[1]);
+		dup_entry(blamed, e, &split[1]);
 
 		new_entry = xmalloc(sizeof(*new_entry));
 		memcpy(new_entry, &(split[2]), sizeof(struct blame_entry));
-		add_blame_entry(sb, new_entry);
-	}
-
-	if (DEBUG) { /* sanity */
-		struct blame_entry *ent;
-		int lno = sb->ent->lno, corrupt = 0;
-
-		for (ent = sb->ent; ent; ent = ent->next) {
-			if (lno != ent->lno)
-				corrupt = 1;
-			if (ent->s_lno < 0)
-				corrupt = 1;
-			lno += ent->num_lines;
-		}
-		if (corrupt) {
-			lno = sb->ent->lno;
-			for (ent = sb->ent; ent; ent = ent->next) {
-				printf("L %8d l %8d n %8d\n",
-				       lno, ent->lno, ent->num_lines);
-				lno = ent->lno + ent->num_lines;
-			}
-			die("oops");
-		}
+		add_blame_entry(unblamed, new_entry);
 	}
 }
 
@@ -702,74 +805,146 @@ static void decref_split(struct blame_entry *split)
 }
 
 /*
- * Helper for blame_chunk().  blame_entry e is known to overlap with
- * the patch hunk; split it and pass blame to the parent.
+ * reverse_blame reverses the list given in head, appending tail.
+ * That allows us to build lists in reverse order, then reverse them
+ * afterwards.  This can be faster than building the list in proper
+ * order right away.  The reason is that building in proper order
+ * requires writing a link in the _previous_ element, while building
+ * in reverse order just requires placing the list head into the
+ * _current_ element.
  */
-static void blame_overlap(struct scoreboard *sb, struct blame_entry *e,
-			  int tlno, int plno, int same,
-			  struct origin *parent)
-{
-	struct blame_entry split[3];
-
-	split_overlap(split, e, tlno, plno, same, parent);
-	if (split[1].suspect)
-		split_blame(sb, split, e);
-	decref_split(split);
-}
 
-/*
- * Find the line number of the last line the target is suspected for.
- */
-static int find_last_in_target(struct scoreboard *sb, struct origin *target)
+static struct blame_entry *reverse_blame(struct blame_entry *head,
+					 struct blame_entry *tail)
 {
-	struct blame_entry *e;
-	int last_in_target = -1;
-
-	for (e = sb->ent; e; e = e->next) {
-		if (e->guilty || e->suspect != target)
-			continue;
-		if (last_in_target < e->s_lno + e->num_lines)
-			last_in_target = e->s_lno + e->num_lines;
+	while (head) {
+		struct blame_entry *next = head->next;
+		head->next = tail;
+		tail = head;
+		head = next;
 	}
-	return last_in_target;
+	return tail;
 }
 
 /*
  * Process one hunk from the patch between the current suspect for
- * blame_entry e and its parent.  Find and split the overlap, and
- * pass blame to the overlapping part to the parent.
+ * blame_entry e and its parent.  This first blames any unfinished
+ * entries before the chunk (which is where target and parent start
+ * differing) on the parent, and then splits blame entries at the
+ * start and at the end of the difference region.  Since use of -M and
+ * -C options may lead to overlapping/duplicate source line number
+ * ranges, all we can rely on from sorting/merging is the order of the
+ * first suspect line number.
  */
-static void blame_chunk(struct scoreboard *sb,
-			int tlno, int plno, int same,
-			struct origin *target, struct origin *parent)
+static void blame_chunk(struct blame_entry ***dstq, struct blame_entry ***srcq,
+			int tlno, int offset, int same,
+			struct origin *parent)
 {
-	struct blame_entry *e;
+	struct blame_entry *e = **srcq;
+	struct blame_entry *samep = NULL, *diffp = NULL;
 
-	for (e = sb->ent; e; e = e->next) {
-		if (e->guilty || e->suspect != target)
-			continue;
-		if (same <= e->s_lno)
-			continue;
-		if (tlno < e->s_lno + e->num_lines)
-			blame_overlap(sb, e, tlno, plno, same, parent);
+	while (e && e->s_lno < tlno) {
+		struct blame_entry *next = e->next;
+		/*
+		 * current record starts before differing portion.  If
+		 * it reaches into it, we need to split it up and
+		 * examine the second part separately.
+		 */
+		if (e->s_lno + e->num_lines > tlno) {
+			/* Move second half to a new record */
+			int len = tlno - e->s_lno;
+			struct blame_entry *n = xcalloc(1, sizeof (struct blame_entry));
+			n->suspect = e->suspect;
+			n->lno = e->lno + len;
+			n->s_lno = e->s_lno + len;
+			n->num_lines = e->num_lines - len;
+			e->num_lines = len;
+			e->score = 0;
+			/* Push new record to diffp */
+			n->next = diffp;
+			diffp = n;
+		} else
+			origin_decref(e->suspect);
+		/* Pass blame for everything before the differing
+		 * chunk to the parent */
+		e->suspect = origin_incref(parent);
+		e->s_lno += offset;
+		e->next = samep;
+		samep = e;
+		e = next;
+	}
+	/*
+	 * As we don't know how much of a common stretch after this
+	 * diff will occur, the currently blamed parts are all that we
+	 * can assign to the parent for now.
+	 */
+
+	if (samep) {
+		**dstq = reverse_blame(samep, **dstq);
+		*dstq = &samep->next;
 	}
+	/*
+	 * Prepend the split off portions: everything after e starts
+	 * after the blameable portion.
+	 */
+	e = reverse_blame(diffp, e);
+
+	/*
+	 * Now retain records on the target while parts are different
+	 * from the parent.
+	 */
+	samep = NULL;
+	diffp = NULL;
+	while (e && e->s_lno < same) {
+		struct blame_entry *next = e->next;
+
+		/*
+		 * If current record extends into sameness, need to split.
+		 */
+		if (e->s_lno + e->num_lines > same) {
+			/*
+			 * Move second half to a new record to be
+			 * processed by later chunks
+			 */
+			int len = same - e->s_lno;
+			struct blame_entry *n = xcalloc(1, sizeof (struct blame_entry));
+			n->suspect = origin_incref(e->suspect);
+			n->lno = e->lno + len;
+			n->s_lno = e->s_lno + len;
+			n->num_lines = e->num_lines - len;
+			e->num_lines = len;
+			e->score = 0;
+			/* Push new record to samep */
+			n->next = samep;
+			samep = n;
+		}
+		e->next = diffp;
+		diffp = e;
+		e = next;
+	}
+	**srcq = reverse_blame(diffp, reverse_blame(samep, e));
+	/* Move across elements that are in the unblamable portion */
+	if (diffp)
+		*srcq = &diffp->next;
 }
 
 struct blame_chunk_cb_data {
-	struct scoreboard *sb;
-	struct origin *target;
 	struct origin *parent;
-	long plno;
-	long tlno;
+	long offset;
+	struct blame_entry **dstq;
+	struct blame_entry **srcq;
 };
 
+/* diff chunks are from parent to target */
 static int blame_chunk_cb(long start_a, long count_a,
 			  long start_b, long count_b, void *data)
 {
 	struct blame_chunk_cb_data *d = data;
-	blame_chunk(d->sb, d->tlno, d->plno, start_b, d->target, d->parent);
-	d->plno = start_a + count_a;
-	d->tlno = start_b + count_b;
+	if (start_a - start_b != d->offset)
+		die("internal error in blame::blame_chunk_cb");
+	blame_chunk(&d->dstq, &d->srcq, start_b, start_a - start_b,
+		    start_b + count_b, d->parent);
+	d->offset = start_a + count_a - (start_b + count_b);
 	return 0;
 }
 
@@ -778,29 +953,32 @@ static int blame_chunk_cb(long start_a, long count_a,
  * for the lines it is suspected to its parent.  Run diff to find
  * which lines came from parent and pass blame for them.
  */
-static int pass_blame_to_parent(struct scoreboard *sb,
-				struct origin *target,
-				struct origin *parent)
+static void pass_blame_to_parent(struct scoreboard *sb,
+				 struct origin *target,
+				 struct origin *parent)
 {
-	int last_in_target;
 	mmfile_t file_p, file_o;
 	struct blame_chunk_cb_data d;
+	struct blame_entry *newdest = NULL;
 
-	memset(&d, 0, sizeof(d));
-	d.sb = sb; d.target = target; d.parent = parent;
-	last_in_target = find_last_in_target(sb, target);
-	if (last_in_target < 0)
-		return 1; /* nothing remains for this target */
+	if (!target->suspects)
+		return; /* nothing remains for this target */
+
+	d.parent = parent;
+	d.offset = 0;
+	d.dstq = &newdest; d.srcq = &target->suspects;
 
 	fill_origin_blob(&sb->revs->diffopt, parent, &file_p);
 	fill_origin_blob(&sb->revs->diffopt, target, &file_o);
 	num_get_patch++;
 
 	diff_hunks(&file_p, &file_o, 0, blame_chunk_cb, &d);
-	/* The rest (i.e. anything after tlno) are the same as the parent */
-	blame_chunk(sb, d.tlno, d.plno, last_in_target, target, parent);
+	/* The rest are the same as the parent */
+	blame_chunk(&d.dstq, &d.srcq, INT_MAX, d.offset, INT_MAX, parent);
+	*d.dstq = NULL;
+	queue_blames(sb, parent, newdest);
 
-	return 0;
+	return;
 }
 
 /*
@@ -945,43 +1123,80 @@ static void find_copy_in_blob(struct scoreboard *sb,
 	handle_split(sb, ent, d.tlno, d.plno, ent->num_lines, parent, split);
 }
 
+/* Move all blame entries from list *source that have a score smaller
+ * than score_min to the front of list *small.
+ * Returns a pointer to the link pointing to the old head of the small list.
+ */
+
+static struct blame_entry **filter_small(struct scoreboard *sb,
+					 struct blame_entry **small,
+					 struct blame_entry **source,
+					 unsigned score_min)
+{
+	struct blame_entry *p = *source;
+	struct blame_entry *oldsmall = *small;
+	while (p) {
+		if (ent_score(sb, p) <= score_min) {
+			*small = p;
+			small = &p->next;
+			p = *small;
+		} else {
+			*source = p;
+			source = &p->next;
+			p = *source;
+		}
+	}
+	*small = oldsmall;
+	*source = NULL;
+	return small;
+}
+
 /*
  * See if lines currently target is suspected for can be attributed to
  * parent.
  */
-static int find_move_in_parent(struct scoreboard *sb,
-			       struct origin *target,
-			       struct origin *parent)
+static void find_move_in_parent(struct scoreboard *sb,
+				struct blame_entry ***blamed,
+				struct blame_entry **toosmall,
+				struct origin *target,
+				struct origin *parent)
 {
-	int last_in_target, made_progress;
 	struct blame_entry *e, split[3];
+	struct blame_entry *unblamed = target->suspects;
+	struct blame_entry *leftover = NULL;
 	mmfile_t file_p;
 
-	last_in_target = find_last_in_target(sb, target);
-	if (last_in_target < 0)
-		return 1; /* nothing remains for this target */
+	if (!unblamed)
+		return; /* nothing remains for this target */
 
 	fill_origin_blob(&sb->revs->diffopt, parent, &file_p);
 	if (!file_p.ptr)
-		return 0;
+		return;
 
-	made_progress = 1;
-	while (made_progress) {
-		made_progress = 0;
-		for (e = sb->ent; e; e = e->next) {
-			if (e->guilty || e->suspect != target ||
-			    ent_score(sb, e) < blame_move_score)
-				continue;
+	/* At each iteration, unblamed has a NULL-terminated list of
+	 * entries that have not yet been tested for blame.  leftover
+	 * contains the reversed list of entries that have been tested
+	 * without being assignable to the parent.
+	 */
+	do {
+		struct blame_entry **unblamedtail = &unblamed;
+		struct blame_entry *next;
+		for (e = unblamed; e; e = next) {
+			next = e->next;
 			find_copy_in_blob(sb, e, parent, split, &file_p);
 			if (split[1].suspect &&
 			    blame_move_score < ent_score(sb, &split[1])) {
-				split_blame(sb, split, e);
-				made_progress = 1;
+				split_blame(blamed, &unblamedtail, split, e);
+			} else {
+				e->next = leftover;
+				leftover = e;
 			}
 			decref_split(split);
 		}
-	}
-	return 0;
+		*unblamedtail = NULL;
+		toosmall = filter_small(sb, toosmall, &unblamed, blame_move_score);
+	} while (unblamed);
+	target->suspects = reverse_blame(leftover, NULL);
 }
 
 struct blame_list {
@@ -993,62 +1208,46 @@ struct blame_list {
  * Count the number of entries the target is suspected for,
  * and prepare a list of entry and the best split.
  */
-static struct blame_list *setup_blame_list(struct scoreboard *sb,
-					   struct origin *target,
-					   int min_score,
+static struct blame_list *setup_blame_list(struct blame_entry *unblamed,
 					   int *num_ents_p)
 {
 	struct blame_entry *e;
 	int num_ents, i;
 	struct blame_list *blame_list = NULL;
 
-	for (e = sb->ent, num_ents = 0; e; e = e->next)
-		if (!e->scanned && !e->guilty &&
-		    e->suspect == target &&
-		    min_score < ent_score(sb, e))
-			num_ents++;
+	for (e = unblamed, num_ents = 0; e; e = e->next)
+		num_ents++;
 	if (num_ents) {
 		blame_list = xcalloc(num_ents, sizeof(struct blame_list));
-		for (e = sb->ent, i = 0; e; e = e->next)
-			if (!e->scanned && !e->guilty &&
-			    e->suspect == target &&
-			    min_score < ent_score(sb, e))
-				blame_list[i++].ent = e;
+		for (e = unblamed, i = 0; e; e = e->next)
+			blame_list[i++].ent = e;
 	}
 	*num_ents_p = num_ents;
 	return blame_list;
 }
 
 /*
- * Reset the scanned status on all entries.
- */
-static void reset_scanned_flag(struct scoreboard *sb)
-{
-	struct blame_entry *e;
-	for (e = sb->ent; e; e = e->next)
-		e->scanned = 0;
-}
-
-/*
  * For lines target is suspected for, see if we can find code movement
  * across file boundary from the parent commit.  porigin is the path
  * in the parent we already tried.
  */
-static int find_copy_in_parent(struct scoreboard *sb,
-			       struct origin *target,
-			       struct commit *parent,
-			       struct origin *porigin,
-			       int opt)
+static void find_copy_in_parent(struct scoreboard *sb,
+				struct blame_entry ***blamed,
+				struct blame_entry **toosmall,
+				struct origin *target,
+				struct commit *parent,
+				struct origin *porigin,
+				int opt)
 {
 	struct diff_options diff_opts;
 	int i, j;
-	int retval;
 	struct blame_list *blame_list;
 	int num_ents;
+	struct blame_entry *unblamed = target->suspects;
+	struct blame_entry *leftover = NULL;
 
-	blame_list = setup_blame_list(sb, target, blame_copy_score, &num_ents);
-	if (!blame_list)
-		return 1; /* nothing remains for this target */
+	if (!unblamed)
+		return; /* nothing remains for this target */
 
 	diff_setup(&diff_opts);
 	DIFF_OPT_SET(&diff_opts, RECURSIVE);
@@ -1078,9 +1277,9 @@ static int find_copy_in_parent(struct scoreboard *sb,
 	if (!DIFF_OPT_TST(&diff_opts, FIND_COPIES_HARDER))
 		diffcore_std(&diff_opts);
 
-	retval = 0;
-	while (1) {
-		int made_progress = 0;
+	do {
+		struct blame_entry **unblamedtail = &unblamed;
+		blame_list = setup_blame_list(unblamed, &num_ents);
 
 		for (i = 0; i < diff_queued_diff.nr; i++) {
 			struct diff_filepair *p = diff_queued_diff.queue[i];
@@ -1117,27 +1316,21 @@ static int find_copy_in_parent(struct scoreboard *sb,
 			struct blame_entry *split = blame_list[j].split;
 			if (split[1].suspect &&
 			    blame_copy_score < ent_score(sb, &split[1])) {
-				split_blame(sb, split, blame_list[j].ent);
-				made_progress = 1;
+				split_blame(blamed, &unblamedtail, split,
+					    blame_list[j].ent);
+			} else {
+				blame_list[j].ent->next = leftover;
+				leftover = blame_list[j].ent;
 			}
-			else
-				blame_list[j].ent->scanned = 1;
 			decref_split(split);
 		}
 		free(blame_list);
-
-		if (!made_progress)
-			break;
-		blame_list = setup_blame_list(sb, target, blame_copy_score, &num_ents);
-		if (!blame_list) {
-			retval = 1;
-			break;
-		}
-	}
-	reset_scanned_flag(sb);
+		*unblamedtail = NULL;
+		toosmall = filter_small(sb, toosmall, &unblamed, blame_copy_score);
+	} while (unblamed);
+	target->suspects = reverse_blame(leftover, NULL);
 	diff_flush(&diff_opts);
 	free_pathspec(&diff_opts.pathspec);
-	return retval;
 }
 
 /*
@@ -1147,20 +1340,21 @@ static int find_copy_in_parent(struct scoreboard *sb,
 static void pass_whole_blame(struct scoreboard *sb,
 			     struct origin *origin, struct origin *porigin)
 {
-	struct blame_entry *e;
+	struct blame_entry *e, *suspects;
 
 	if (!porigin->file.ptr && origin->file.ptr) {
 		/* Steal its file */
 		porigin->file = origin->file;
 		origin->file.ptr = NULL;
 	}
-	for (e = sb->ent; e; e = e->next) {
-		if (e->suspect != origin)
-			continue;
+	suspects = origin->suspects;
+	origin->suspects = NULL;
+	for (e = suspects; e; e = e->next) {
 		origin_incref(porigin);
 		origin_decref(e->suspect);
 		e->suspect = porigin;
 	}
+	queue_blames(sb, porigin, suspects);
 }
 
 /*
@@ -1184,6 +1378,27 @@ static int num_scapegoats(struct rev_info *revs, struct commit *commit)
 	return cnt;
 }
 
+/* Distribute collected unsorted blames to the respected sorted lists
+ * in the various origins.
+ */
+static void distribute_blame(struct scoreboard *sb, struct blame_entry *blamed)
+{
+	blamed = blame_sort(blamed, compare_blame_suspect);
+	while (blamed)
+	{
+		struct origin *porigin = blamed->suspect;
+		struct blame_entry *suspects = NULL;
+		do {
+			struct blame_entry *next = blamed->next;
+			blamed->next = suspects;
+			suspects = blamed;
+			blamed = next;
+		} while (blamed && blamed->suspect == porigin);
+		suspects = reverse_blame(suspects, NULL);
+		queue_blames(sb, porigin, suspects);
+	}
+}
+
 #define MAXSG 16
 
 static void pass_blame(struct scoreboard *sb, struct origin *origin, int opt)
@@ -1194,6 +1409,8 @@ static void pass_blame(struct scoreboard *sb, struct origin *origin, int opt)
 	struct commit_list *sg;
 	struct origin *sg_buf[MAXSG];
 	struct origin *porigin, **sg_origin = sg_buf;
+	struct blame_entry *toosmall = NULL;
+	struct blame_entry *blames, **blametail = &blames;
 
 	num_sg = num_scapegoats(revs, commit);
 	if (!num_sg)
@@ -1255,38 +1472,71 @@ static void pass_blame(struct scoreboard *sb, struct origin *origin, int opt)
 			origin_incref(porigin);
 			origin->previous = porigin;
 		}
-		if (pass_blame_to_parent(sb, origin, porigin))
+		pass_blame_to_parent(sb, origin, porigin);
+		if (!origin->suspects)
 			goto finish;
 	}
 
 	/*
 	 * Optionally find moves in parents' files.
 	 */
-	if (opt & PICKAXE_BLAME_MOVE)
-		for (i = 0, sg = first_scapegoat(revs, commit);
-		     i < num_sg && sg;
-		     sg = sg->next, i++) {
-			struct origin *porigin = sg_origin[i];
-			if (!porigin)
-				continue;
-			if (find_move_in_parent(sb, origin, porigin))
-				goto finish;
+	if (opt & PICKAXE_BLAME_MOVE) {
+		filter_small(sb, &toosmall, &origin->suspects, blame_move_score);
+		if (origin->suspects) {
+			for (i = 0, sg = first_scapegoat(revs, commit);
+			     i < num_sg && sg;
+			     sg = sg->next, i++) {
+				struct origin *porigin = sg_origin[i];
+				if (!porigin)
+					continue;
+				find_move_in_parent(sb, &blametail, &toosmall, origin, porigin);
+				if (!origin->suspects)
+					break;
+			}
 		}
+	}
 
 	/*
 	 * Optionally find copies from parents' files.
 	 */
-	if (opt & PICKAXE_BLAME_COPY)
+	if (opt & PICKAXE_BLAME_COPY) {
+		if (blame_copy_score > blame_move_score)
+			filter_small(sb, &toosmall, &origin->suspects, blame_copy_score);
+		else if (blame_copy_score < blame_move_score) {
+			origin->suspects = blame_merge(origin->suspects, toosmall);
+			toosmall = NULL;
+			filter_small(sb, &toosmall, &origin->suspects, blame_copy_score);
+		}
+		if (!origin->suspects)
+			goto finish;
+
 		for (i = 0, sg = first_scapegoat(revs, commit);
 		     i < num_sg && sg;
 		     sg = sg->next, i++) {
 			struct origin *porigin = sg_origin[i];
-			if (find_copy_in_parent(sb, origin, sg->item,
-						porigin, opt))
+			find_copy_in_parent(sb, &blametail, &toosmall,
+					    origin, sg->item, porigin, opt);
+			if (!origin->suspects)
 				goto finish;
 		}
+	}
 
- finish:
+finish:
+	*blametail = NULL;
+	distribute_blame(sb, blames);
+	/*
+	 * prepend toosmall to origin->suspects
+	 *
+	 * There is no point in sorting: this ends up on a big
+	 * unsorted list in the caller anyway.
+	 */
+	if (toosmall) {
+		struct blame_entry **tail = &toosmall;
+		while (*tail)
+			tail = &(*tail)->next;
+		*tail = origin->suspects;
+		origin->suspects = toosmall;
+	}
 	for (i = 0; i < num_sg; i++) {
 		if (sg_origin[i]) {
 			drop_origin_blob(sg_origin[i]);
@@ -1481,14 +1731,11 @@ static int emit_one_suspect_detail(struct origin *suspect, int repeat)
 }
 
 /*
- * The blame_entry is found to be guilty for the range.  Mark it
- * as such, and show it in incremental output.
+ * The blame_entry is found to be guilty for the range.
+ * Show it in incremental output.
  */
 static void found_guilty_entry(struct blame_entry *ent)
 {
-	if (ent->guilty)
-		return;
-	ent->guilty = 1;
 	if (incremental) {
 		struct origin *suspect = ent->suspect;
 
@@ -1502,32 +1749,34 @@ static void found_guilty_entry(struct blame_entry *ent)
 }
 
 /*
- * The main loop -- while the scoreboard has lines whose true origin
- * is still unknown, pick one blame_entry, and allow its current
- * suspect to pass blames to its parents.
- */
+ * The main loop -- while we have blobs with lines whose true origin
+ * is still unknown, pick one blob, and allow its lines to pass blames
+ * to its parents. */
 static void assign_blame(struct scoreboard *sb, int opt)
 {
 	struct rev_info *revs = sb->revs;
+	struct commit *commit = prio_queue_get(&sb->commits);
 
-	while (1) {
+	while (commit) {
 		struct blame_entry *ent;
-		struct commit *commit;
-		struct origin *suspect = NULL;
+		struct origin *suspect = commit->util;
 
 		/* find one suspect to break down */
-		for (ent = sb->ent; !suspect && ent; ent = ent->next)
-			if (!ent->guilty)
-				suspect = ent->suspect;
-		if (!suspect)
-			return; /* all done */
+		while (suspect && !suspect->suspects)
+			suspect = suspect->next;
+
+		if (!suspect) {
+			commit = prio_queue_get(&sb->commits);
+			continue;
+		}
+
+		assert(commit == suspect->commit);
 
 		/*
 		 * We will use this suspect later in the loop,
 		 * so hold onto it in the meantime.
 		 */
 		origin_incref(suspect);
-		commit = suspect->commit;
 		parse_commit(commit);
 		if (reverse ||
 		    (!(commit->object.flags & UNINTERESTING) &&
@@ -1543,9 +1792,22 @@ static void assign_blame(struct scoreboard *sb, int opt)
 			commit->object.flags |= UNINTERESTING;
 
 		/* Take responsibility for the remaining entries */
-		for (ent = sb->ent; ent; ent = ent->next)
-			if (ent->suspect == suspect)
+		ent = suspect->suspects;
+		if (ent) {
+			suspect->guilty = 1;
+			for (;;) {
+				struct blame_entry *next = ent->next;
 				found_guilty_entry(ent);
+				if (next) {
+					ent = next;
+					continue;
+				}
+				ent->next = sb->ent;
+				sb->ent = suspect->suspects;
+				suspect->suspects = NULL;
+				break;
+			}
+		}
 		origin_decref(suspect);
 
 		if (DEBUG) /* sanity */
@@ -1602,9 +1864,8 @@ static void emit_porcelain(struct scoreboard *sb, struct blame_entry *ent,
 	char hex[41];
 
 	strcpy(hex, sha1_to_hex(suspect->commit->object.sha1));
-	printf("%s%c%d %d %d\n",
+	printf("%s %d %d %d\n",
 	       hex,
-	       ent->guilty ? ' ' : '*', /* purely for debugging */
 	       ent->s_lno + 1,
 	       ent->lno + 1,
 	       ent->num_lines);
@@ -1717,17 +1978,16 @@ static void output(struct scoreboard *sb, int option)
 
 	if (option & OUTPUT_PORCELAIN) {
 		for (ent = sb->ent; ent; ent = ent->next) {
-			struct blame_entry *oth;
-			struct origin *suspect = ent->suspect;
-			struct commit *commit = suspect->commit;
+			int count = 0;
+			struct origin *suspect;
+			struct commit *commit = ent->suspect->commit;
 			if (commit->object.flags & MORE_THAN_ONE_PATH)
 				continue;
-			for (oth = ent->next; oth; oth = oth->next) {
-				if ((oth->suspect->commit != commit) ||
-				    !strcmp(oth->suspect->path, suspect->path))
-					continue;
-				commit->object.flags |= MORE_THAN_ONE_PATH;
-				break;
+			for (suspect = commit->util; suspect; suspect = suspect->next) {
+				if (suspect->guilty && count++) {
+					commit->object.flags |= MORE_THAN_ONE_PATH;
+					break;
+				}
 			}
 		}
 	}
@@ -2092,7 +2352,6 @@ static struct commit *fake_working_tree_commit(struct diff_options *opt,
 	origin->file.ptr = buf.buf;
 	origin->file.size = buf.len;
 	pretend_sha1_file(buf.buf, buf.len, OBJ_BLOB, origin->blob_sha1);
-	commit->util = origin;
 
 	/*
 	 * Read the current index, replace the path entry with
@@ -2403,12 +2662,16 @@ parse_done:
 	memset(&sb, 0, sizeof(sb));
 
 	sb.revs = &revs;
-	if (!reverse)
+	if (!reverse) {
 		final_commit_name = prepare_final(&sb);
+		sb.commits.compare = compare_commits_by_commit_date;
+	}
 	else if (contents_from)
 		die("--contents and --children do not blend well.");
-	else
+	else {
 		final_commit_name = prepare_initial(&sb);
+		sb.commits.compare = compare_commits_by_reverse_commit_date;
+	}
 
 	if (!sb.final) {
 		/*
@@ -2497,12 +2760,16 @@ parse_done:
 		ent->next = next;
 		origin_incref(o);
 	}
+
+	o->suspects = ent;
+	prio_queue_put(&sb.commits, o->commit);
+
 	origin_decref(o);
 
 	range_set_release(&ranges);
 	string_list_clear(&range_list, 0);
 
-	sb.ent = ent;
+	sb.ent = NULL;
 	sb.path = path;
 
 	read_mailmap(&mailmap, NULL);
@@ -2515,6 +2782,8 @@ parse_done:
 	if (incremental)
 		return 0;
 
+	sb.ent = blame_sort(sb.ent, compare_blame_final);
+
 	coalesce(&sb);
 
 	if (!(output_option & OUTPUT_PORCELAIN))
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 2/2] Mention "git blame" improvements in release notes
  2014-04-25 23:56 [PATCH 1/2] blame: large-scale performance rewrite David Kastrup
@ 2014-04-25 23:56 ` David Kastrup
  2014-04-26 17:28   ` Junio C Hamano
  2014-04-26  0:53 ` [PATCH 1/2] blame: large-scale performance rewrite Shawn Pearce
  1 sibling, 1 reply; 20+ messages in thread
From: David Kastrup @ 2014-04-25 23:56 UTC (permalink / raw)
  To: git; +Cc: David Kastrup

Includes reasonably tasteful begging.

Signed-off-by: David Kastrup <dak@gnu.org>
---
 Documentation/RelNotes/2.0.0.txt | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/Documentation/RelNotes/2.0.0.txt b/Documentation/RelNotes/2.0.0.txt
index ffd4899..27b23c3 100644
--- a/Documentation/RelNotes/2.0.0.txt
+++ b/Documentation/RelNotes/2.0.0.txt
@@ -144,6 +144,12 @@ UI, Workflows & Features
 
 Performance, Internal Implementation, etc.
 
+ * Significant parts of "git blame" have been reimplemented by David
+   Kastrup <dak@gnu.org> for a vast gain in performance with complex
+   histories and large files.  As working on free software is his sole
+   source of income, please consider contributing to his remuneration
+   if you find this useful.
+
  * The compilation options to port to AIX and to MSVC have been
    updated.
 
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/2] blame: large-scale performance rewrite
  2014-04-25 23:56 [PATCH 1/2] blame: large-scale performance rewrite David Kastrup
  2014-04-25 23:56 ` [PATCH 2/2] Mention "git blame" improvements in release notes David Kastrup
@ 2014-04-26  0:53 ` Shawn Pearce
  2014-04-26  7:48   ` David Kastrup
  1 sibling, 1 reply; 20+ messages in thread
From: Shawn Pearce @ 2014-04-26  0:53 UTC (permalink / raw)
  To: git

On Fri, Apr 25, 2014 at 4:56 PM, David Kastrup <dak@gnu.org> wrote:
> The previous implementation used a single sorted linear list of blame
> entries for organizing all partial or completed work.  Every subtask had
> to scan the whole list, with most entries not being relevant to the
> task.  The resulting run-time was quadratic to the number of separate
> chunks.
>
> This change gives every subtask its own data to work with.  Subtasks are
> organized into "struct origin" chains hanging off particular commits.
> Commits are organized into a priority queue, processing them in commit
> date order in order to keep most of the work affecting a particular blob
> collated even in the presence of an extensive merge history.

Without reading the code, this sounds like how JGit runs blame.

> For large files with a diversified history, a speedup by a factor of 3
> or more is not unusual.

And JGit was already usually slower than git-core. Now it will be even
slower! :-)

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/2] blame: large-scale performance rewrite
  2014-04-26  0:53 ` [PATCH 1/2] blame: large-scale performance rewrite Shawn Pearce
@ 2014-04-26  7:48   ` David Kastrup
  2014-04-26 16:01     ` Shawn Pearce
  0 siblings, 1 reply; 20+ messages in thread
From: David Kastrup @ 2014-04-26  7:48 UTC (permalink / raw)
  To: Shawn Pearce; +Cc: git

Shawn Pearce <spearce@spearce.org> writes:

> On Fri, Apr 25, 2014 at 4:56 PM, David Kastrup <dak@gnu.org> wrote:
>> The previous implementation used a single sorted linear list of blame
>> entries for organizing all partial or completed work.  Every subtask had
>> to scan the whole list, with most entries not being relevant to the
>> task.  The resulting run-time was quadratic to the number of separate
>> chunks.
>>
>> This change gives every subtask its own data to work with.  Subtasks are
>> organized into "struct origin" chains hanging off particular commits.
>> Commits are organized into a priority queue, processing them in commit
>> date order in order to keep most of the work affecting a particular blob
>> collated even in the presence of an extensive merge history.
>
> Without reading the code, this sounds like how JGit runs blame.
>
>> For large files with a diversified history, a speedup by a factor of 3
>> or more is not unusual.
>
> And JGit was already usually slower than git-core. Now it will be even
> slower! :-)

If your statement about JGit is accurate, it should likely have beat Git
for large use cases (where the performance improvements are most
important) as O(n) beats O(n^2) in the long run.

At any rate, I see that I ended up posting this patch series at the end
of the week again which makes for a somewhat lacklustre initial response
from those who code Git for a regular living.

Apropos: shaking the bugs regarding -M and -C options out of the code
had taken a large toll because -M can cause the same or overlapping line
regions to be responsible for different target regions and the original
code implementing the "straightforward" blame blew up on the overlap.
I spent a _lot_ of time tracking down that problem.

As I am lousy focusing on more than one task, and as I don't get a
regular paycheck anyway, this will have to remain my last contribution
to Git if I am not going to recoup my losses.

Patch 2 of this series tries giving the community of Git a serious
chance at picking that option (I mean, there are literally millions of
Git users around with a sizable number profiting) while not being
obnoxious about it.

My personal guess is that it will fail regarding both objectives.  But
then I've been surprised before by other free software communities when
trying to make those particular two ends meet.

At any rate, feedback about the performance of the patch from users
disappointed by regular git blame would be welcome.

Apart from the objective measurement of "total time", the more
subjective impression of interactive/incremental response (like in git
gui blame) where the order of results will significantly differ (current
git-blame --incremental focuses on getting blames resolved in
first-lines-first manner, the proposed git-blame rather works on a
newest-commits-first basis which might better match typical use cases)
might be worth reporting.

-- 
David Kastrup

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/2] blame: large-scale performance rewrite
  2014-04-26  7:48   ` David Kastrup
@ 2014-04-26 16:01     ` Shawn Pearce
  2014-04-26 16:50       ` David Kastrup
  2014-04-26 17:02       ` David Kastrup
  0 siblings, 2 replies; 20+ messages in thread
From: Shawn Pearce @ 2014-04-26 16:01 UTC (permalink / raw)
  To: David Kastrup; +Cc: git

On Sat, Apr 26, 2014 at 12:48 AM, David Kastrup <dak@gnu.org> wrote:
> Shawn Pearce <spearce@spearce.org> writes:
>
>> On Fri, Apr 25, 2014 at 4:56 PM, David Kastrup <dak@gnu.org> wrote:
>>> The previous implementation used a single sorted linear list of blame
>>> entries for organizing all partial or completed work.  Every subtask had
>>> to scan the whole list, with most entries not being relevant to the
>>> task.  The resulting run-time was quadratic to the number of separate
>>> chunks.
>>>
>>> This change gives every subtask its own data to work with.  Subtasks are
>>> organized into "struct origin" chains hanging off particular commits.
>>> Commits are organized into a priority queue, processing them in commit
>>> date order in order to keep most of the work affecting a particular blob
>>> collated even in the presence of an extensive merge history.
>>
>> Without reading the code, this sounds like how JGit runs blame.
>>
>>> For large files with a diversified history, a speedup by a factor of 3
>>> or more is not unusual.
>>
>> And JGit was already usually slower than git-core. Now it will be even
>> slower! :-)
>
> If your statement about JGit is accurate, it should likely have beat Git
> for large use cases (where the performance improvements are most
> important) as O(n) beats O(n^2) in the long run.

Agreed.

In a few cases yes, JGit did beat git-core at blame running time.
Unfortunately according to my profiling blame performance is still
dominated by inflation and scanning of commit and tree objects to
identify unmodified blobs and advance to the next scapegoat ancestor.

Its entirely possible my blame implementation in JGit is still doing
something stupid. Or its possible JIT translated Java just takes
longer than natively compiled C. I am including JVM startup and JIT
time in the timing comparison with git-core.

> Apart from the objective measurement of "total time", the more
> subjective impression of interactive/incremental response (like in git
> gui blame) where the order of results will significantly differ (current
> git-blame --incremental focuses on getting blames resolved in
> first-lines-first manner, the proposed git-blame rather works on a
> newest-commits-first basis which might better match typical use cases)
> might be worth reporting.

Seeing this fill during execution was the initial motivation I had for
writing git gui blame. I don't think anyone cares about the order it
displays in. In fact ordering my timestamp may be more what the user
wants anyway, as you suggest above.

Thanks for doing this. Unfortunately I can't read the patch itself as
I am also trying to improve JGit's blame code for $DAY_JOB, and JGit
is BSD licensed.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/2] blame: large-scale performance rewrite
  2014-04-26 16:01     ` Shawn Pearce
@ 2014-04-26 16:50       ` David Kastrup
  2014-04-26 17:09         ` Shawn Pearce
  2014-04-26 17:02       ` David Kastrup
  1 sibling, 1 reply; 20+ messages in thread
From: David Kastrup @ 2014-04-26 16:50 UTC (permalink / raw)
  To: Shawn Pearce; +Cc: git

Shawn Pearce <spearce@spearce.org> writes:

> On Sat, Apr 26, 2014 at 12:48 AM, David Kastrup <dak@gnu.org> wrote:
>> Shawn Pearce <spearce@spearce.org> writes:
>>>
>>> And JGit was already usually slower than git-core. Now it will be
>>> even slower! :-)
>>
>> If your statement about JGit is accurate, it should likely have beat
>> Git for large use cases (where the performance improvements are most
>> important) as O(n) beats O(n^2) in the long run.
>
> Agreed.
>
> In a few cases yes, JGit did beat git-core at blame running time.
> Unfortunately according to my profiling blame performance is still
> dominated by inflation and scanning of commit and tree objects to
> identify unmodified blobs and advance to the next scapegoat ancestor.

Oh, the C version is most certainly significantly impacted by that after
my patch.  One can _significantly_ speed it up by increasing
core.deltaBaseCacheLimit from its rather silly value of 16M.  If you
have a comparable control in JGit and if your benchmarking points to the
unpacking, that's where I'd suggest tweaking first.

>> Apart from the objective measurement of "total time", the more
>> subjective impression of interactive/incremental response (like in
>> git gui blame) where the order of results will significantly differ
>> (current git-blame --incremental focuses on getting blames resolved
>> in first-lines-first manner, the proposed git-blame rather works on a
>> newest-commits-first basis which might better match typical use
>> cases) might be worth reporting.
>
> Seeing this fill during execution was the initial motivation I had for
> writing git gui blame.

It does not look impressively better to me, actually.  Probably because
"git gui blame" is running "git blame" several times.

> I don't think anyone cares about the order it displays in. In fact
> ordering my timestamp may be more what the user wants anyway, as you
> suggest above.

What the user wants anyway is that "git gui blame" notifies "git blame"
of the currently displayed window area whenever that changes, and that
git blame then _first_ deals with all chunks with a visible on-screen
part.

> Thanks for doing this. Unfortunately I can't read the patch itself as
> I am also trying to improve JGit's blame code for $DAY_JOB, and JGit
> is BSD licensed.

Shrug.  The patch is functionally equivalent to the previous behavior,
the arrangement of linear lists on underlying data structures is hardly
copyrightable, and Java implements linear lists differently anyhow.
Merging two sorted linear lists is a straightforward operation,
splitting a linear list into several others also is.

The really tricky/expensive part was realizing that as opposed to target
line number ranges, source line number ranges may overlap and/or be
duplicate when using -M or -C options.  That really messed things up for
a long time and was hard to debug.  Once I figured out what was going
wrong, recoding the respective stuff was straightforward.

I doubt that there is much copyrightable material to transfer as I seem
to remember that Java does not have anything like a pointer.  So the
main stuff, juggling with linear lists, would not likely transfer in a
reasonably recognizable manner.

-- 
David Kastrup

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/2] blame: large-scale performance rewrite
  2014-04-26 16:01     ` Shawn Pearce
  2014-04-26 16:50       ` David Kastrup
@ 2014-04-26 17:02       ` David Kastrup
  2014-04-26 17:30         ` David Kastrup
  1 sibling, 1 reply; 20+ messages in thread
From: David Kastrup @ 2014-04-26 17:02 UTC (permalink / raw)
  To: Shawn Pearce; +Cc: git

Shawn Pearce <spearce@spearce.org> writes:

> Thanks for doing this. Unfortunately I can't read the patch itself as
> I am also trying to improve JGit's blame code for $DAY_JOB, and JGit
> is BSD licensed.

Actually, I'd have suggested asking $EMPLOYER to buy the rights for
looking at the code, but as I wrote previously, I'd seriously doubt that
he'd get his money's worth for use in a _Java_ implementation.

The C code I proposed is good for files with many small changes: I'd
suggest benchmarking your JGit code with some of them.  If the JGit is
still dominated by unpacking, your implementation should be fine.  The
current C version instead thrashes around digging through its own
all-purpose single linear list.

Here are two real-world test cases:

git://git.savannah.gnu.org/emacs.git
git blame [-M / -C] src/xdisp.c

http://repo.or.cz/r/wortliste.git
git blame [-M / -C] wortliste

The latter one is _really_ taking a severe hit from the O(n^2)
algorithms.  If your benchmarks for that one still point mostly to the
unpacking, your jgit blame should be fine regarding the stuff
I reimplemented.

-- 
David Kastrup

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/2] blame: large-scale performance rewrite
  2014-04-26 16:50       ` David Kastrup
@ 2014-04-26 17:09         ` Shawn Pearce
  2014-04-26 17:22           ` David Kastrup
  0 siblings, 1 reply; 20+ messages in thread
From: Shawn Pearce @ 2014-04-26 17:09 UTC (permalink / raw)
  To: David Kastrup; +Cc: git

On Sat, Apr 26, 2014 at 9:50 AM, David Kastrup <dak@gnu.org> wrote:
> Shawn Pearce <spearce@spearce.org> writes:
>
>> On Sat, Apr 26, 2014 at 12:48 AM, David Kastrup <dak@gnu.org> wrote:
>>> Shawn Pearce <spearce@spearce.org> writes:
>>>>
>>>> And JGit was already usually slower than git-core. Now it will be
>>>> even slower! :-)
>>>
>>> If your statement about JGit is accurate, it should likely have beat
>>> Git for large use cases (where the performance improvements are most
>>> important) as O(n) beats O(n^2) in the long run.
>>
>> Agreed.
>>
>> In a few cases yes, JGit did beat git-core at blame running time.
>> Unfortunately according to my profiling blame performance is still
>> dominated by inflation and scanning of commit and tree objects to
>> identify unmodified blobs and advance to the next scapegoat ancestor.
>
> Oh, the C version is most certainly significantly impacted by that after
> my patch.  One can _significantly_ speed it up by increasing
> core.deltaBaseCacheLimit from its rather silly value of 16M.  If you
> have a comparable control in JGit and if your benchmarking points to the
> unpacking, that's where I'd suggest tweaking first.

Good point, we have that control but I always forget to play with it
during benchmarking.

>>> Apart from the objective measurement of "total time", the more
>>> subjective impression of interactive/incremental response (like in
>>> git gui blame) where the order of results will significantly differ
>>> (current git-blame --incremental focuses on getting blames resolved
>>> in first-lines-first manner, the proposed git-blame rather works on a
>>> newest-commits-first basis which might better match typical use
>>> cases) might be worth reporting.
>>
>> Seeing this fill during execution was the initial motivation I had for
>> writing git gui blame.
>
> It does not look impressively better to me, actually.  Probably because
> "git gui blame" is running "git blame" several times.

IIRC git gui blame runs blame only twice. Once with the simple blame,
and then again with -M -C or something like that. Its a nice display
because you can see who moved code here, and then who originally wrote
it. Unfortunately it has to be done as two passes as the blame
implementation does not know how to compute both in one pass.

The move/copy detection is also computationally more expensive per
revision visited, so it really slows down the simple "who put it here"
if the two passes were somehow run in a single iteration.

>> I don't think anyone cares about the order it displays in. In fact
>> ordering my timestamp may be more what the user wants anyway, as you
>> suggest above.
>
> What the user wants anyway is that "git gui blame" notifies "git blame"
> of the currently displayed window area whenever that changes, and that
> git blame then _first_ deals with all chunks with a visible on-screen
> part.

That is a really good idea. Unfortunately finding that part of the
window in the blame data can still take a while. You may compute other
parts of the file anyway while looking for this part, as you had to
compare two revisions and scapegoat sections to find out that commit
didn't touch the interesting region and keep going back through
history.

You may get lucky and be able to fill in a recent region quickly, but
I somehow suspect giving priority to the visible region in the
priority queue won't really help to reduce latency of the visible
region for the user.

> The really tricky/expensive part was realizing that as opposed to target
> line number ranges, source line number ranges may overlap and/or be
> duplicate when using -M or -C options.  That really messed things up for
> a long time and was hard to debug.  Once I figured out what was going
> wrong, recoding the respective stuff was straightforward.

Right, and JGit blame still is missing the -M and -C options, as I
have not implemented those yet. I got basic blame and reverse blame
working a few years ago and then stopped working on the code for a
while. Now we have interest in improving the latency for $DAY_JOB, so
I've been poking at the code again for the last week or so.

But that -M and -C thing is still not implemented, and I know its
going to be tricky to get right with the way the scapegoating is
passed along.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/2] blame: large-scale performance rewrite
  2014-04-26 17:09         ` Shawn Pearce
@ 2014-04-26 17:22           ` David Kastrup
  0 siblings, 0 replies; 20+ messages in thread
From: David Kastrup @ 2014-04-26 17:22 UTC (permalink / raw)
  To: Shawn Pearce; +Cc: git

Shawn Pearce <spearce@spearce.org> writes:

> Right, and JGit blame still is missing the -M and -C options, as I
> have not implemented those yet. I got basic blame and reverse blame
> working a few years ago and then stopped working on the code for a
> while. Now we have interest in improving the latency for $DAY_JOB, so
> I've been poking at the code again for the last week or so.
>
> But that -M and -C thing is still not implemented, and I know its
> going to be tricky to get right with the way the scapegoating is
> passed along.

Actually, for -M/-C it would be saner to rip open the whole xdiff
blackbox which internally goes to quite some effort in order to produce
the linear ordering of a diff, and then -M has to simulate dropping that
linear ordering requirement by doing a host of parallel diffs for each
chunk.

-- 
David Kastrup

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/2] Mention "git blame" improvements in release notes
  2014-04-25 23:56 ` [PATCH 2/2] Mention "git blame" improvements in release notes David Kastrup
@ 2014-04-26 17:28   ` Junio C Hamano
  2014-04-26 18:28     ` David Kastrup
  0 siblings, 1 reply; 20+ messages in thread
From: Junio C Hamano @ 2014-04-26 17:28 UTC (permalink / raw)
  To: David Kastrup; +Cc: git

David Kastrup <dak@gnu.org> writes:

> Includes reasonably tasteful begging.

Thanks, but no thanks---I do not see it tasteful.

In any case, any large change that is not a regression fix (or a fix
to a code added since 1.9 series) is way too late for 2.0 at this
point, but I do look forward to reading the patch over, queuing to
my tree, cooking in 'next' and eventually having this in 2.1 or
later.

If you want help in a fundraising campaign, I can lend my name
(especially after this change settles and proves to be useful ;-),
but let's do that elsewhere. I do not want to do this in the release
notes (e.g., an entry in git-blame blog can mention it when it
touches the blame improvements).

This by the way touches another thing I have been wondering.
Perhaps I should stop having the top-level RelNotes as a symbolic
link, but keep it as a regular file, which I *copy* to its current
location in the commit that tags the release.  And then I start a
skeletal RelNotes at the top of the tree when the next cycle begins,
and new topics will build on top of that commit.

That way, this patch would have been against the top-level RelNotes,
applied as part of the topic, and when the topic is merged to
'master', it would make it less likely for me to forget about
mentioning it.

Thanks for working on the topic.

>
> Signed-off-by: David Kastrup <dak@gnu.org>
> ---
>  Documentation/RelNotes/2.0.0.txt | 6 ++++++
>  1 file changed, 6 insertions(+)
>
> diff --git a/Documentation/RelNotes/2.0.0.txt b/Documentation/RelNotes/2.0.0.txt
> index ffd4899..27b23c3 100644
> --- a/Documentation/RelNotes/2.0.0.txt
> +++ b/Documentation/RelNotes/2.0.0.txt
> @@ -144,6 +144,12 @@ UI, Workflows & Features
>  
>  Performance, Internal Implementation, etc.
>  
> + * Significant parts of "git blame" have been reimplemented by David
> +   Kastrup <dak@gnu.org> for a vast gain in performance with complex
> +   histories and large files.  As working on free software is his sole
> +   source of income, please consider contributing to his remuneration
> +   if you find this useful.
> +
>   * The compilation options to port to AIX and to MSVC have been
>     updated.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/2] blame: large-scale performance rewrite
  2014-04-26 17:02       ` David Kastrup
@ 2014-04-26 17:30         ` David Kastrup
  2014-04-26 17:56           ` Shawn Pearce
  0 siblings, 1 reply; 20+ messages in thread
From: David Kastrup @ 2014-04-26 17:30 UTC (permalink / raw)
  To: Shawn Pearce; +Cc: git

David Kastrup <dak@gnu.org> writes:

> http://repo.or.cz/r/wortliste.git
> git blame [-M / -C] wortliste
>
> The latter one is _really_ taking a severe hit from the O(n^2)
> algorithms.  If your benchmarks for that one still point mostly to the
> unpacking, your jgit blame should be fine regarding the stuff
> I reimplemented.

Here's some example:

dak@lola:/usr/local/tmp/wortliste$ time git blame -n -s wortliste >/tmp/wl1

real	15m47.118s
user	14m39.928s
sys	1m1.872s
dak@lola:/usr/local/tmp/wortliste$ time ../git/git blame -n -s wortliste >/tmp/wl2

real	3m40.947s
user	2m40.296s
sys	0m59.440s

Note how the system time is almost the same.  I have some patches which
make quite a bit of difference with that (at best, saving about half of
the system time), but I have not yet found the silver bullet where I'd
be reasonably sure that temporary memory use with non-linear history
stays strictly in nice bounds.

-- 
David Kastrup

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/2] blame: large-scale performance rewrite
  2014-04-26 17:30         ` David Kastrup
@ 2014-04-26 17:56           ` Shawn Pearce
  2014-04-26 21:39             ` David Kastrup
  0 siblings, 1 reply; 20+ messages in thread
From: Shawn Pearce @ 2014-04-26 17:56 UTC (permalink / raw)
  To: David Kastrup; +Cc: git

On Sat, Apr 26, 2014 at 10:30 AM, David Kastrup <dak@gnu.org> wrote:
> David Kastrup <dak@gnu.org> writes:
>
>> http://repo.or.cz/r/wortliste.git
>> git blame [-M / -C] wortliste
>>
>> The latter one is _really_ taking a severe hit from the O(n^2)
>> algorithms.  If your benchmarks for that one still point mostly to the
>> unpacking, your jgit blame should be fine regarding the stuff
>> I reimplemented.
>
> Here's some example:
>
> dak@lola:/usr/local/tmp/wortliste$ time git blame -n -s wortliste >/tmp/wl1
>
> real    15m47.118s
> user    14m39.928s
> sys     1m1.872s

Hah, this is quite the torture test. git before your patch is taking
22m11s on my laptop to compute this. (This was with default options, I
noticed you passed -s to suppress the author formatting.)

> dak@lola:/usr/local/tmp/wortliste$ time ../git/git blame -n -s wortliste >/tmp/wl2
>
> real    3m40.947s
> user    2m40.296s
> sys     0m59.440s

Meanwhile JGit computed in 4m30s on the same hardware. So I guess we
are "fine". Its still not as fast as I want it to be. :-)

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/2] Mention "git blame" improvements in release notes
  2014-04-26 17:28   ` Junio C Hamano
@ 2014-04-26 18:28     ` David Kastrup
       [not found]       ` <xmqqzjj5s8hs.fsf@gitster.dls.corp.google.com>
  0 siblings, 1 reply; 20+ messages in thread
From: David Kastrup @ 2014-04-26 18:28 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

Junio C Hamano <gitster@pobox.com> writes:

> David Kastrup <dak@gnu.org> writes:
>
>> Includes reasonably tasteful begging.
>
> Thanks, but no thanks---I do not see it tasteful.

Well, begging rarely is.  The point simply is that without commensurate
recompensation, I cannot afford any more work of that kind on Git, and
there is a reasonable likelihood that such work is worth more to some
subset of Git users than what it would take to enable me doing it.

My experience with "tasteful" asking for contributions in the context of
AUCTeX and preview-latex development is about €100 plus two cases of
beer in 10 years.

With GNU LilyPond, I've been way more blunt.  Its community certainly is
dwarved by the Git community, and still they've been able to support my
work with more than €1000 per month for several years now.  I've been
letting those people down for several months now because of the
git-blame stuff, with a respective decline in support to show for that.
Sure, partly because of misestimates of the involved work and the
involved self-motivation to get it over with.

If that's not worth anything to the Git community, I can just chalk it
off as a somewhat expensive one-time experience and that's it.  I can
live with that.

What I want to avoid, however, is the situation where this kind of work
would actually _have_ been worth enough to enough people to enable it
but they don't get to make a decision whether to support more of it
and/or express their appreciation in the manner that actually counts,
because of being blissfully unaware.

Now of course, people having an independent and/or guaranteed can afford
to be tasteful.  And there are probably enough of those around for
running the show.

But then "git blame" performance has been sub-par for a very long time
already.

> In any case, any large change that is not a regression fix (or a fix
> to a code added since 1.9 series) is way too late for 2.0 at this
> point,

For what it's worth, the user interface is unchanged.  And results
should be the same as previously apart from the runtime requirements.
Naturally, this is "should", and problems, particularly regarding
different output, may take a long time until somebody notices since few
people will actually compare the old and new results.

> but I do look forward to reading the patch over, queuing to my tree,
> cooking in 'next' and eventually having this in 2.1 or later.
>
> If you want help in a fundraising campaign, I can lend my name
> (especially after this change settles and proves to be useful ;-),

In my book, it is a large usability improvement but not necessarily a
game changer.  Waiting for 1 minute rather than 3 minutes is still
nothing one wants enabled in a web server, or that turns stuff into the
"interactive response" ballpark.

To get that, one will have to work on the remaining performance which is
primarily the responsibility of the object store and associated caching.
The advantage is that its impact on the performance is now readily
visible: previous to this patch it is strongly masked by the sub-par
performance of the git-blame code itself.

> but let's do that elsewhere.

If you have a reasonable idea for that.  It would be pointless wherever
it safely becomes "somebody else's problem" for pretty much everybody.
I'm not overly happy with trying to recruit active developers/power
users for that kind of thing when they are

a) actively investing time and effort themselves
b) outnumbered by profiting users 1000:1

but I've not yet found a better approach myself with regard to LilyPond.
If you have a better idea for Git...

> I do not want to do this in the release notes (e.g., an entry in
> git-blame blog can mention it when it touches the blame improvements).

Again: it's important to be visible to those people who might care about
putting money in, or it's pointless.

At any rate, I'm glad that the work is closed for now.

-- 
David Kastrup

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/2] blame: large-scale performance rewrite
  2014-04-26 17:56           ` Shawn Pearce
@ 2014-04-26 21:39             ` David Kastrup
  2014-04-27 17:53               ` Shawn Pearce
  0 siblings, 1 reply; 20+ messages in thread
From: David Kastrup @ 2014-04-26 21:39 UTC (permalink / raw)
  To: Shawn Pearce; +Cc: git

Shawn Pearce <spearce@spearce.org> writes:

> On Sat, Apr 26, 2014 at 10:30 AM, David Kastrup <dak@gnu.org> wrote:
>> David Kastrup <dak@gnu.org> writes:
>>
>> Here's some example:
>>
>> dak@lola:/usr/local/tmp/wortliste$ time git blame -n -s wortliste >/tmp/wl1
>>
>> real    15m47.118s
>> user    14m39.928s
>> sys     1m1.872s
>
> Hah, this is quite the torture test. git before your patch is taking
> 22m11s on my laptop to compute this. (This was with default options, I
> noticed you passed -s to suppress the author formatting.)
>
>> dak@lola:/usr/local/tmp/wortliste$ time ../git/git blame -n -s wortliste >/tmp/wl2
>>
>> real    3m40.947s
>> user    2m40.296s
>> sys     0m59.440s
>
> Meanwhile JGit computed in 4m30s on the same hardware. So I guess we
> are "fine".

At least the stuff I fixed with regard to performance would seem to be
done right in JGit to start with.

> Its still not as fast as I want it to be. :-)

Most of the diff data/CRC is computed over and over because of the
blackbox use of xdiff.  And then the delta-chain storage is packing
stuff based on CRCs as well (not sure whether it keeps them around for
unpacking).  So there is a lot that could likely be improved while
keeping the same basic algorithms, by cracking open the black boxes of
the xdiff engine and the delta-chain coding.

-- 
David Kastrup

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/2] blame: large-scale performance rewrite
  2014-04-26 21:39             ` David Kastrup
@ 2014-04-27 17:53               ` Shawn Pearce
  0 siblings, 0 replies; 20+ messages in thread
From: Shawn Pearce @ 2014-04-27 17:53 UTC (permalink / raw)
  To: David Kastrup; +Cc: git

On Sat, Apr 26, 2014 at 2:39 PM, David Kastrup <dak@gnu.org> wrote:
>
> At least the stuff I fixed with regard to performance would seem to be
> done right in JGit to start with.

Hah! Its Java. We have to do things right, otherwise its too slow. :-)

>> Its still not as fast as I want it to be. :-)
>
> Most of the diff data/CRC is computed over and over because of the
> blackbox use of xdiff.

Yes, I have been thinking about this all week.

JGit blame uses HistogramDiff by default instead of MyersDiff. The
first stage after we trim common header/trailer from both files is to
compute a hash of each line and store those hashes. Those hashes are
discarded as the blame algorithm moves to the next commit.

Clearly for a commit chain of A -> B -> C, the hashes computed at B
for the A->B compare can be reused for the B->C compare. This is not
the case in either git-core or JGit, because the diff algorithm is a
block box to the blame algorithm. I think this is what you mean by the
CRC being computed again.


For any given compare blame has a list of regions it is interested in
learning about from the diff algorithm. Anything outside of those
regions is useless noise that will be discarded. I have been pondering
pushing that region list down into the diff algorithm so it can avoid
executing on sections that are not relevant to the caller. At least
for HistogramDiff this makes some sense, the algorithm is recursively
applied after it finds a longest common subsequence. If one side of
the LCS is outside of the region of interest from blame, there is no
value in recursing on that portion.

If the blame region list covers a small enough portion, it may even
make sense to avoid the common header/trailer elimination
preprocessing step. Unfortunately that sounds "hard" as you could be
working with a file like a ChangeLog which grows on one of those
sides.


>  And then the delta-chain storage is packing
> stuff based on CRCs as well (not sure whether it keeps them around for
> unpacking).

There are CRCs validated by libz during inflation, but these aren't
rechecked once the inflated bytes are cached in that silly undersized
16M delta base cache.

> So there is a lot that could likely be improved while
> keeping the same basic algorithms, by cracking open the black boxes of
> the xdiff engine and the delta-chain coding.

The delta chain coding has no relationship to the source file.
Currently even plain text files are delta chain coded on a byte basis,
not a line basis. Just matching up the delta coding against a source
text file to determine lines 0-N are common is costly, since you have
a byte range in the delta coding and you want a line range out the
end.

To make things more challenging, the delta chain coding can be against
completely different blobs. In a compare of A->B from the commit graph
being walked by blame there is no requirement the delta coding uses
this pairing, and it almost certainly never uses this direction (its
usually B->A, if its even this pair!).


Given your comments in the other patch, I understand why you probably
won't be working on blame more. But the above may help someone else
that has available time to continue.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/2] Mention "git blame" improvements in release notes
       [not found]       ` <xmqqzjj5s8hs.fsf@gitster.dls.corp.google.com>
@ 2014-04-28 17:39         ` David Kastrup
  2014-04-28 19:35           ` Junio C Hamano
  0 siblings, 1 reply; 20+ messages in thread
From: David Kastrup @ 2014-04-28 17:39 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

Junio C Hamano <gitster@pobox.com> writes:

> David Kastrup <dak@gnu.org> writes:
>
>>> Thanks, but no thanks---I do not see it tasteful.
>>
>> Well, begging rarely is....
>> ...
>> If that's not worth anything to the Git community,...
>
> Concurred on the first point.  If you thought in any way I meant
> to say that blame improvements do not matter, then I am sorry, I did
> not mean that.
>
> But still, I am not convinced that the release notes is a good place
> to do this, and would be happier if you can think of a better venue.

"This change has been contributed by an independent developer on a
contingency base.  To make this approach work, please contact him if you
consider it worth recompensating."

This sort of text can be placed in the commit message (where it will be
mostly visible to actual other developers) and in the "What's cooking"
reports while, well, it's cooking.  It won't reach the mass of "ordinary
users", but while they are quite a large audience, they are also least
likely to care enough about isolated performance changes.  And while it
will be a limited run, "What's cooking" is also read by non-developers.

Those are the two venues I can currently think of that would seem
"scalable", at least with a text of that size.  Namely where it would
not become quite a total nuisance (though naturally less effective) if
everybody tried doing the same.

-- 
David Kastrup

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/2] Mention "git blame" improvements in release notes
  2014-04-28 17:39         ` David Kastrup
@ 2014-04-28 19:35           ` Junio C Hamano
  2014-04-28 19:57             ` David Kastrup
  2014-04-28 20:05             ` Ronnie Sahlberg
  0 siblings, 2 replies; 20+ messages in thread
From: Junio C Hamano @ 2014-04-28 19:35 UTC (permalink / raw)
  To: David Kastrup; +Cc: git

David Kastrup <dak@gnu.org> writes:

> Junio C Hamano <gitster@pobox.com> writes:
>
>> But still, I am not convinced that the release notes is a good place
>> to do this, and would be happier if you can think of a better venue.
>
> "This change has been contributed by an independent developer on a
> contingency base.  To make this approach work, please contact him if you
> consider it worth recompensating."

I write things under three personalities.  As just one of the people
active in the Git development community, as the maintainer of the
project, and saying things on behalf of "the Git project".

The distinction between the latter two may be subtle, but it matters
to me.  And in my mind, I write the Release Notes on behalf of the
project.

 * The performance of "git blame" has been greatly improved.  Thanks
   David Kastrup for his huge effort.

is perhaps as far as I can go in that capacity, without singling out
one contributor among 80+ contributors with changes between 1.9 and
2.0 (among which a dozen or so have more than 10 patches---some are
trivial and patch count alone does not do justice, though) with
similar "pay them to show your appreciation" pleas.

I however feel that I can certainly do that as an active (and highly
visible) contributor, and even as the maintainer.

I guess we probably can add "See $URL if you are interested in his
further plans" after that two-line item and let you write whatever
you want at that page pointed at by the URL, though.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/2] Mention "git blame" improvements in release notes
  2014-04-28 19:35           ` Junio C Hamano
@ 2014-04-28 19:57             ` David Kastrup
  2014-04-28 20:05             ` Ronnie Sahlberg
  1 sibling, 0 replies; 20+ messages in thread
From: David Kastrup @ 2014-04-28 19:57 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

Junio C Hamano <gitster@pobox.com> writes:

> I guess we probably can add "See $URL if you are interested in his
> further plans" after that two-line item and let you write whatever
> you want at that page pointed at by the URL, though.

I most definitely am _not_ planning to invest any more time into Git
since even designing such plans would be throwing good time after bad
time.  And I don't have a web presence anyway.  As it does not appear
that there is any realistic manner in which Git users could even be made
aware of a connection between monetary requirements and work for a
freelancer like myself, I'll be just writing this off as a one-time
mistake on my part given my personal situation.  It's not the first, and
it will certainly not be the last, but at least I can avoid doing the
same mistake twice on the same project.

There are some low-hanging fruit for further speeding up git-blame now
that its internal thrashing has been addressed.  I will point out those
low-hanging fruit so that anybody can follow up on it and do all the
arguing and benchmarking required to go anywhere and get the credit for
it.

But that's as far as my willingness to "do the right thing" will carry.
If nobody picks up either the tab or the rather simple followup tasks,
then that's what the community and customer base of Git is capable of
sustaining and I'm not in a position to change it.

-- 
David Kastrup

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/2] Mention "git blame" improvements in release notes
  2014-04-28 19:35           ` Junio C Hamano
  2014-04-28 19:57             ` David Kastrup
@ 2014-04-28 20:05             ` Ronnie Sahlberg
  2014-04-28 20:26               ` David Kastrup
  1 sibling, 1 reply; 20+ messages in thread
From: Ronnie Sahlberg @ 2014-04-28 20:05 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: David Kastrup, git@vger.kernel.org

On Mon, Apr 28, 2014 at 12:35 PM, Junio C Hamano <gitster@pobox.com> wrote:
> David Kastrup <dak@gnu.org> writes:
>
>> Junio C Hamano <gitster@pobox.com> writes:
>>
>>> But still, I am not convinced that the release notes is a good place
>>> to do this, and would be happier if you can think of a better venue.
>>
>> "This change has been contributed by an independent developer on a
>> contingency base.  To make this approach work, please contact him if you
>> consider it worth recompensating."
>
> I write things under three personalities.  As just one of the people
> active in the Git development community, as the maintainer of the
> project, and saying things on behalf of "the Git project".
>
> The distinction between the latter two may be subtle, but it matters
> to me.  And in my mind, I write the Release Notes on behalf of the
> project.
>
>  * The performance of "git blame" has been greatly improved.  Thanks
>    David Kastrup for his huge effort.
>
> is perhaps as far as I can go in that capacity, without singling out
> one contributor among 80+ contributors with changes between 1.9 and
> 2.0 (among which a dozen or so have more than 10 patches---some are
> trivial and patch count alone does not do justice, though) with
> similar "pay them to show your appreciation" pleas.
>
> I however feel that I can certainly do that as an active (and highly
> visible) contributor, and even as the maintainer.
>
> I guess we probably can add "See $URL if you are interested in his
> further plans" after that two-line item and let you write whatever
> you want at that page pointed at by the URL, though.
>

Some projects, for example samba, provide a dedicated page on the
project web site
where vendors, and I think individuals, that provide services can list
their information :

http://www.samba.org/samba/support/

Would this perhaps be a better solution?


> --
> To unsubscribe from this list: send the line "unsubscribe git" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/2] Mention "git blame" improvements in release notes
  2014-04-28 20:05             ` Ronnie Sahlberg
@ 2014-04-28 20:26               ` David Kastrup
  0 siblings, 0 replies; 20+ messages in thread
From: David Kastrup @ 2014-04-28 20:26 UTC (permalink / raw)
  To: Ronnie Sahlberg; +Cc: Junio C Hamano, git@vger.kernel.org

Ronnie Sahlberg <sahlberg@google.com> writes:

> Some projects, for example samba, provide a dedicated page on the
> project web site
> where vendors, and I think individuals, that provide services can list
> their information :
>
> http://www.samba.org/samba/support/
>
> Would this perhaps be a better solution?

Actually, it does not work for my situation at all but then my situation
is likely not typical enough to be worth catering for specifically.

The salient point with me is that my productivity drops by more than a
factor of 100 (no, that's not an exaggeration) when having to do
something I'm not interested in.

Which is the reason my deal with GNU LilyPond where I'm (interrupted by
the git-blame episode) basically lead programmer is that LilyPond users
give me whatever money they consider my work to be worth to them, and in
return I work on whatever I like on LilyPond.  Nobody gets to say _what_
I do, and that's to the best of everyone's interest since only that way
a reasonable amount of work actually gets done.

So I cannot actually provide "services for hire" but it's more like a
"themed money sink" I can offer.  And in the case of Git, since it does
not even appear that there is a continuing base to make it work
reasonably in the context of other contributors and their interests,
it's more like "oops, I ended up wasting months on your project, but at
least there was something to show for it from my side, how about
yours?".  That's not something one can reasonably put on a "support"
themed page.  It's actually bloody ridiculous but then that's the sort
of handicap I have to organize my life around.  Even while it's
unpredictable what I end up doing, once I do get something done it tends
to be pretty good (there are a few old performance patches of mine in
the Git code base where I did not start out from what happens to be
pretty awful code and still got considerable return).

-- 
David Kastrup

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2014-04-28 20:26 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-04-25 23:56 [PATCH 1/2] blame: large-scale performance rewrite David Kastrup
2014-04-25 23:56 ` [PATCH 2/2] Mention "git blame" improvements in release notes David Kastrup
2014-04-26 17:28   ` Junio C Hamano
2014-04-26 18:28     ` David Kastrup
     [not found]       ` <xmqqzjj5s8hs.fsf@gitster.dls.corp.google.com>
2014-04-28 17:39         ` David Kastrup
2014-04-28 19:35           ` Junio C Hamano
2014-04-28 19:57             ` David Kastrup
2014-04-28 20:05             ` Ronnie Sahlberg
2014-04-28 20:26               ` David Kastrup
2014-04-26  0:53 ` [PATCH 1/2] blame: large-scale performance rewrite Shawn Pearce
2014-04-26  7:48   ` David Kastrup
2014-04-26 16:01     ` Shawn Pearce
2014-04-26 16:50       ` David Kastrup
2014-04-26 17:09         ` Shawn Pearce
2014-04-26 17:22           ` David Kastrup
2014-04-26 17:02       ` David Kastrup
2014-04-26 17:30         ` David Kastrup
2014-04-26 17:56           ` Shawn Pearce
2014-04-26 21:39             ` David Kastrup
2014-04-27 17:53               ` Shawn Pearce

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).