How to substructure rewrites?

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* How to substructure rewrites?
@ 2014-01-25 12:44 David Kastrup
  2014-01-25 18:23 ` [PATCH 0/3] "Teaser" patch for rewriting blame for efficiency David Kastrup
  2014-01-27 15:58 ` How to substructure rewrites? Junio C Hamano
  0 siblings, 2 replies; 11+ messages in thread
From: David Kastrup @ 2014-01-25 12:44 UTC (permalink / raw)
  To: git

[-- Attachment #1: Type: text/plain, Size: 94 bytes --]

Hi,

I'm currently rewriting the internals of git-blame.  For the somewhat
academic example

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: blametest.sh --]
[-- Type: text/x-sh, Size: 235 bytes --]

#/bin/sh

cd /tmp
rm -rf testit
mkdir testit
cd testit
git init
for inc in 8 4 3 7 2 9 5 6 1
do
    seq 252000 -$inc 0 > testfile
    git add testfile
    git commit -m "Run with increment $inc"
done
time git blame testfile >/dev/null

[-- Attachment #3: Type: text/plain, Size: 3802 bytes --]

the performance of git-blame increases by 4000%.  It turns out that in
"real-world" bad cases, particularly after aggressively repacking
repositories, the gains are quite more modest.  I've seen a factor of 2
with source code and a factor of 3 with a large word list.  The bulk of
the remaining time is spent in system time (which has basically not
changed by my work), so it is likely that the repository gets read over
and over, calling for some better caching strategies.

So the question is how to restructure the work.  I want to put out a
"teaser" soonish but I think that there is little point in
substructuring the commits I have so far.  They start with a few
janitorial patches already in pu, but after that there is

commit 7c610731e9452f3e8c175ab7e7542eafed1c5b93
Author: David Kastrup <dak@gnu.org>
Date:   Sat Jan 25 13:19:17 2014 +0100

    wip

commit dca4c8c447b94e1a3f8f45874d5415665d503189
Author: David Kastrup <dak@gnu.org>
Date:   Wed Jan 22 01:16:06 2014 +0100

    wip

commit f64b41c472442ae9971321fe8f62c3885ba4d8b7
Author: David Kastrup <dak@gnu.org>
Date:   Sun Jan 19 02:16:21 2014 +0100

    blame.c: Let output determine MORE_THAN_ONE_PATH more efficiently

commit 607b094537d6fe4a75cf9a297470966bda934c73
Author: David Kastrup <dak@gnu.org>
Date:   Mon Jan 13 18:06:06 2014 +0100

    wip

commit 1cd4e172533fb55793d02147c9596c59f7b05343
Author: David Kastrup <dak@gnu.org>
Date:   Sun Jan 19 01:30:02 2014 +0100

    Reorganize blame data structures

commit 5ab2add67d5fcf3837623a7bc028bcaec1006cc4
Author: David Kastrup <dak@gnu.org>
Date:   Sun Jan 19 01:29:28 2014 +0100

    Add blame_sort function for sorting a blame list

commit d3745752968617979878962ead957a6b5f829f0b
Author: David Kastrup <dak@gnu.org>
Date:   Sun Jan 19 01:28:22 2014 +0100

    Add merge_blame function for merging two sorted blame lists.

As it can easily be guessed, the "add xxx function" commits are
basically adding not-yet-used code (and so will not disrupt
compilation), but everything starting with "Reorganize blame data
structures" up until the final commit will not work or compile since the
code does not match the data structures.

So there is little point in substructing all that, right?  Even
something seemingly isolated like

commit f64b41c472442ae9971321fe8f62c3885ba4d8b7
Author: David Kastrup <dak@gnu.org>
Date:   Sun Jan 19 02:16:21 2014 +0100

    blame.c: Let output determine MORE_THAN_ONE_PATH more efficiently

is not really useful as a separate commit since while it does implement
a particular task, this is done starting with non-working code relying
on no-longer existent data structures.

I'm seriously considering squashing all that in a single commit.  It
seems like it would be utterly pointless to introduce artificially
working in-between states that simulate a partial "migration" to new
data structures that never happened.  Where the algorithmic setup
happened, I rewrite from scratch, and it's pretty absurd to match and
marry efficient and inefficient half-algorithms.

Which brings us to point two: the "teaser" part.  The current stuff has
#if 0/#endif around the old implementation of the expensive -M and -C
options.  They are currently not operative due to the mismatch to data
structures.  Obviously, that makes the change unsuitable for putting
into master or anywhere close.

I'd still want to solicit feedback: when not using those options (which
will likely cause me another week of headaches), the code is working
well and delivering identical information.  So should I just squash the
stuff and leave just the commits that made it into pu separate (so that
people may git-am whether or not they already have those commits)?

In general, the rule is likely "any commit should not create a
non-working state" right?

-- 
David Kastrup

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 0/3] "Teaser" patch for rewriting blame for efficiency
  2014-01-25 12:44 How to substructure rewrites? David Kastrup
@ 2014-01-25 18:23 ` David Kastrup
  2014-01-25 18:23   ` [PATCH 1/3] builtin/blame.c: struct blame_entry does not need a prev link David Kastrup
                     ` (2 more replies)
  2014-01-27 15:58 ` How to substructure rewrites? Junio C Hamano
  1 sibling, 3 replies; 11+ messages in thread
From: David Kastrup @ 2014-01-25 18:23 UTC (permalink / raw)
  To: git; +Cc: David Kastrup

Ok, here is the "teaser" for the git-blame rewrite.  The first two
patches are already in pu and only contained for simplicity.  The
third patch gives a pretty good idea about the work that's up.  It is
missing support for the -M and -C options (I think, if that's what the
"move" and "copy" detection are supposed to be about).  Part of
preexisting code is #if 0/#endif removed since I don't know yet
whether the implementation of copy/move will need some of that.

Apart from these rather obvious problems, the question is whether it
is ok to do the finished rewrite as a single commit since I cannot
really envision useful intermediate stages.

While there is not much of a point on commenting on the code (and its
tentative changes) within #if 0/#endif, the code that is actually
compiled is of course open to preliminary criticism.

For obvious reasons, patch #3 is not signed off: it should never be
committed in its current form.

David Kastrup (3):
  builtin/blame.c: struct blame_entry does not need a prev link
  Eliminate same_suspect function in builtin/blame.c
  builtin/blame.c: large-scale rewrite

 builtin/blame.c | 595 +++++++++++++++++++++++++++++++++++---------------------
 1 file changed, 371 insertions(+), 224 deletions(-)

-- 
1.8.3.2

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 1/3] builtin/blame.c: struct blame_entry does not need a prev link
  2014-01-25 18:23 ` [PATCH 0/3] "Teaser" patch for rewriting blame for efficiency David Kastrup
@ 2014-01-25 18:23   ` David Kastrup
  2014-01-25 18:23   ` [PATCH 2/3] Eliminate same_suspect function in builtin/blame.c David Kastrup
  2014-01-25 18:23   ` [PATCH 3/3] builtin/blame.c: large-scale rewrite David Kastrup
  2 siblings, 0 replies; 11+ messages in thread
From: David Kastrup @ 2014-01-25 18:23 UTC (permalink / raw)
  To: git; +Cc: David Kastrup

Signed-off-by: David Kastrup <dak@gnu.org>
---
 builtin/blame.c | 13 ++-----------
 1 file changed, 2 insertions(+), 11 deletions(-)

diff --git a/builtin/blame.c b/builtin/blame.c
index e44a6bb..2195595 100644
--- a/builtin/blame.c
+++ b/builtin/blame.c
@@ -197,7 +197,6 @@ static void drop_origin_blob(struct origin *o)
  * scoreboard structure, sorted by the target line number.
  */
 struct blame_entry {
-	struct blame_entry *prev;
 	struct blame_entry *next;
 
 	/* the first line of this group in the final image;
@@ -282,8 +281,6 @@ static void coalesce(struct scoreboard *sb)
 		    ent->s_lno + ent->num_lines == next->s_lno) {
 			ent->num_lines += next->num_lines;
 			ent->next = next->next;
-			if (ent->next)
-				ent->next->prev = ent;
 			origin_decref(next->suspect);
 			free(next);
 			ent->score = 0;
@@ -534,7 +531,7 @@ static void add_blame_entry(struct scoreboard *sb, struct blame_entry *e)
 		prev = ent;
 
 	/* prev, if not NULL, is the last one that is below e */
-	e->prev = prev;
+
 	if (prev) {
 		e->next = prev->next;
 		prev->next = e;
@@ -543,8 +540,6 @@ static void add_blame_entry(struct scoreboard *sb, struct blame_entry *e)
 		e->next = sb->ent;
 		sb->ent = e;
 	}
-	if (e->next)
-		e->next->prev = e;
 }
 
 /*
@@ -555,14 +550,12 @@ static void add_blame_entry(struct scoreboard *sb, struct blame_entry *e)
  */
 static void dup_entry(struct blame_entry *dst, struct blame_entry *src)
 {
-	struct blame_entry *p, *n;
+	struct blame_entry *n;
 
-	p = dst->prev;
 	n = dst->next;
 	origin_incref(src->suspect);
 	origin_decref(dst->suspect);
 	memcpy(dst, src, sizeof(*src));
-	dst->prev = p;
 	dst->next = n;
 	dst->score = 0;
 }
@@ -2502,8 +2495,6 @@ parse_done:
 		ent->suspect = o;
 		ent->s_lno = bottom;
 		ent->next = next;
-		if (next)
-			next->prev = ent;
 		origin_incref(o);
 	}
 	origin_decref(o);
-- 
1.8.3.2

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 2/3] Eliminate same_suspect function in builtin/blame.c
  2014-01-25 18:23 ` [PATCH 0/3] "Teaser" patch for rewriting blame for efficiency David Kastrup
  2014-01-25 18:23   ` [PATCH 1/3] builtin/blame.c: struct blame_entry does not need a prev link David Kastrup
@ 2014-01-25 18:23   ` David Kastrup
  2014-01-25 18:23   ` [PATCH 3/3] builtin/blame.c: large-scale rewrite David Kastrup
  2 siblings, 0 replies; 11+ messages in thread
From: David Kastrup @ 2014-01-25 18:23 UTC (permalink / raw)
  To: git; +Cc: David Kastrup

Since the origin pointers are "interned" and reference-counted, comparing
the pointers rather than the content is enough.  The only uninterned
origins are cached values kept in commit->util, but same_suspect is not
called on them.

Signed-off-by: David Kastrup <dak@gnu.org>
---
 builtin/blame.c | 25 ++++++++-----------------
 1 file changed, 8 insertions(+), 17 deletions(-)

diff --git a/builtin/blame.c b/builtin/blame.c
index 2195595..ead6148 100644
--- a/builtin/blame.c
+++ b/builtin/blame.c
@@ -255,15 +255,6 @@ struct scoreboard {
 	int *lineno;
 };
 
-static inline int same_suspect(struct origin *a, struct origin *b)
-{
-	if (a == b)
-		return 1;
-	if (a->commit != b->commit)
-		return 0;
-	return !strcmp(a->path, b->path);
-}
-
 static void sanity_check_refcnt(struct scoreboard *);
 
 /*
@@ -276,7 +267,7 @@ static void coalesce(struct scoreboard *sb)
 	struct blame_entry *ent, *next;
 
 	for (ent = sb->ent; ent && (next = ent->next); ent = next) {
-		if (same_suspect(ent->suspect, next->suspect) &&
+		if (ent->suspect == next->suspect &&
 		    ent->guilty == next->guilty &&
 		    ent->s_lno + ent->num_lines == next->s_lno) {
 			ent->num_lines += next->num_lines;
@@ -735,7 +726,7 @@ static int find_last_in_target(struct scoreboard *sb, struct origin *target)
 	int last_in_target = -1;
 
 	for (e = sb->ent; e; e = e->next) {
-		if (e->guilty || !same_suspect(e->suspect, target))
+		if (e->guilty || e->suspect != target)
 			continue;
 		if (last_in_target < e->s_lno + e->num_lines)
 			last_in_target = e->s_lno + e->num_lines;
@@ -755,7 +746,7 @@ static void blame_chunk(struct scoreboard *sb,
 	struct blame_entry *e;
 
 	for (e = sb->ent; e; e = e->next) {
-		if (e->guilty || !same_suspect(e->suspect, target))
+		if (e->guilty || e->suspect != target)
 			continue;
 		if (same <= e->s_lno)
 			continue;
@@ -985,7 +976,7 @@ static int find_move_in_parent(struct scoreboard *sb,
 	while (made_progress) {
 		made_progress = 0;
 		for (e = sb->ent; e; e = e->next) {
-			if (e->guilty || !same_suspect(e->suspect, target) ||
+			if (e->guilty || e->suspect != target ||
 			    ent_score(sb, e) < blame_move_score)
 				continue;
 			find_copy_in_blob(sb, e, parent, split, &file_p);
@@ -1020,14 +1011,14 @@ static struct blame_list *setup_blame_list(struct scoreboard *sb,
 
 	for (e = sb->ent, num_ents = 0; e; e = e->next)
 		if (!e->scanned && !e->guilty &&
-		    same_suspect(e->suspect, target) &&
+		    e->suspect == target &&
 		    min_score < ent_score(sb, e))
 			num_ents++;
 	if (num_ents) {
 		blame_list = xcalloc(num_ents, sizeof(struct blame_list));
 		for (e = sb->ent, i = 0; e; e = e->next)
 			if (!e->scanned && !e->guilty &&
-			    same_suspect(e->suspect, target) &&
+			    e->suspect == target &&
 			    min_score < ent_score(sb, e))
 				blame_list[i++].ent = e;
 	}
@@ -1171,7 +1162,7 @@ static void pass_whole_blame(struct scoreboard *sb,
 		origin->file.ptr = NULL;
 	}
 	for (e = sb->ent; e; e = e->next) {
-		if (!same_suspect(e->suspect, origin))
+		if (e->suspect != origin)
 			continue;
 		origin_incref(porigin);
 		origin_decref(e->suspect);
@@ -1560,7 +1551,7 @@ static void assign_blame(struct scoreboard *sb, int opt)
 
 		/* Take responsibility for the remaining entries */
 		for (ent = sb->ent; ent; ent = ent->next)
-			if (same_suspect(ent->suspect, suspect))
+			if (ent->suspect == suspect)
 				found_guilty_entry(ent);
 		origin_decref(suspect);
 
-- 
1.8.3.2

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 3/3] builtin/blame.c: large-scale rewrite
  2014-01-25 18:23 ` [PATCH 0/3] "Teaser" patch for rewriting blame for efficiency David Kastrup
  2014-01-25 18:23   ` [PATCH 1/3] builtin/blame.c: struct blame_entry does not need a prev link David Kastrup
  2014-01-25 18:23   ` [PATCH 2/3] Eliminate same_suspect function in builtin/blame.c David Kastrup
@ 2014-01-25 18:23   ` David Kastrup
  2014-01-27 16:54     ` Junio C Hamano
  2 siblings, 1 reply; 11+ messages in thread
From: David Kastrup @ 2014-01-25 18:23 UTC (permalink / raw)
  To: git; +Cc: David Kastrup

The previous implementation uses a sorted linear list of struct
blame_entry in a struct scoreboard for organizing all partial or
completed work.  Every task that is done requires going through the
whole list where most entries are not relevant to the task at hand.

This commit reorganizes the data structures in order to have each
remaining subtask work with its own sorted linear list it can work off
front to back.  Subtasks are organized into "struct origin" chains
hanging off particular commits.  Commits are organized into a priority
queue, processing them in commit date order in order to keep most of
the work affecting a particular blob collated even in the presence of
an extensive merge history.  In that manner, linear searches can be
basically avoided anywhere.  They still are done for identifying one
of multiple analyzed files in a given commit, but the degenerate case
of a single large file being assembled from a multitude of smaller
files in the past is not likely to occur often enough to warrant
special treatment.
---
 builtin/blame.c | 569 ++++++++++++++++++++++++++++++++++++--------------------
 1 file changed, 367 insertions(+), 202 deletions(-)

diff --git a/builtin/blame.c b/builtin/blame.c
index ead6148..994a9e8 100644
--- a/builtin/blame.c
+++ b/builtin/blame.c
@@ -18,7 +18,9 @@
 #include "cache-tree.h"
 #include "string-list.h"
 #include "mailmap.h"
+#include "mergesort.h"
 #include "parse-options.h"
+#include "prio-queue.h"
 #include "utf8.h"
 #include "userdiff.h"
 #include "line-range.h"
@@ -83,11 +85,43 @@ static unsigned blame_copy_score;
  */
 struct origin {
 	int refcnt;
+	/* Record preceding blame record for this blob */
 	struct origin *previous;
+	/* origins are put in a list linked via `next' hanging off the
+	 * corresponding commit's util field in order to make finding
+	 * them fast.  The presence in this chain does not count
+	 * towards the origin's reference count.  It is tempting to
+	 * let it count as long as the commit is pending examination,
+	 * but even under circumstances where the commit will be
+	 * present multiple times in the priority queue of unexamined
+	 * commits, processing the first instance will not leave any
+	 * work requiring the origin data for the second instance.  An
+	 * interspersed commit changing that would have to be
+	 * preexisting with a different ancestry and with the same
+	 * commit date in order to wedge itself between two instances
+	 * of the same commit in the priority queue _and_ produce
+	 * blame entries relevant for it.  While we don't want to let
+	 * us get tripped up by this case, it certainly does not seem
+	 * worth optimizing for.
+	 */
+	struct origin *next;
 	struct commit *commit;
+	/* `suspects' contains blame entries that may be attributed to
+	 * this origin's commit or to parent commits.  When a commit
+	 * is being processed, all suspects will be moved, either by
+	 * assigning them to an origin in a different commit, or by
+	 * shipping them to the scoreboard's ent list because they
+	 * cannot be attributed to a different commit.
+	 */
+	struct blame_entry *suspects;
 	mmfile_t file;
 	unsigned char blob_sha1[20];
 	unsigned mode;
+	/* shipped gets set when shipping any suspects to the final
+	 * blame list instead of other commits
+	 */
+	char shipped;
+	char guilty;
 	char path[FLEX_ARRAY];
 };
 
@@ -176,10 +210,23 @@ static inline struct origin *origin_incref(struct origin *o)
 static void origin_decref(struct origin *o)
 {
 	if (o && --o->refcnt <= 0) {
+		struct origin *p, *l = NULL;
 		if (o->previous)
 			origin_decref(o->previous);
 		free(o->file.ptr);
-		free(o);
+		/* Should be present exactly once in commit chain */
+		for (p = o->commit->util; p; l = p, p = p->next) {
+			if (p == o)
+			{
+				if (l)
+					l->next = p->next;
+				else
+					o->commit->util = p->next;
+				free(o);
+				return;
+			}
+		}
+		die("internal error in blame::origin_decref");
 	}
 }
 
@@ -193,8 +240,12 @@ static void drop_origin_blob(struct origin *o)
 
 /*
  * Each group of lines is described by a blame_entry; it can be split
- * as we pass blame to the parents.  They form a linked list in the
- * scoreboard structure, sorted by the target line number.
+ * as we pass blame to the parents.  They are arranged in linked lists
+ * kept as `suspects' of some unprocessed origin, or entered (when the
+ * blame origin has been finalized) into the scoreboard structure.
+ * While the scoreboard structure is only sorted at the end of
+ * processing (according to final image line number), the lists
+ * attached to an origin are sorted by the target line number.
  */
 struct blame_entry {
 	struct blame_entry *next;
@@ -210,11 +261,6 @@ struct blame_entry {
 	/* the commit that introduced this group into the final image */
 	struct origin *suspect;
 
-	/* true if the suspect is truly guilty; false while we have not
-	 * checked if the group came from one of its parents.
-	 */
-	char guilty;
-
 	/* true if the entry has been scanned for copies in the current parent
 	 */
 	char scanned;
@@ -230,12 +276,94 @@ struct blame_entry {
 	unsigned score;
 };
 
+/* This is subtle.  Any _merge_ of blames happens on lists of blames
+ * that arrived via different parents in a single suspect.  In this
+ * case, we want to sort according to the _suspect_ line numbers.  In
+ * contrast to that, blame_sort is used for the final sorting of blame
+ * data and consequently works with final image line numbers.
+ */
+
+static struct blame_entry *
+blame_merge(struct blame_entry *list1,
+	    struct blame_entry *list2)
+{
+	struct blame_entry *p1 = list1, *p2 = list2,
+		**tail = &list1;
+	
+	if (!p1)
+		return p2;
+	if (!p2)
+		return p1;
+
+	if (p1->s_lno <= p2->s_lno)
+	{
+		do {
+			tail = &p1->next;
+			if ((p1 = *tail) == NULL)
+			{
+				*tail = p2;
+				return list1;
+			}
+		} while (p1->s_lno <= p2->s_lno);
+	}
+	for (;;)
+	{
+		*tail = p2;
+		do {
+			tail = &p2->next;
+			if ((p2 = *tail) == NULL)  {
+				*tail = p1;
+				return list1;
+			}
+		} while (p1->s_lno > p2->s_lno);
+		*tail = p1;
+		do {
+			tail = &p1->next;
+			if ((p1 = *tail) == NULL) {
+				*tail = p2;
+				return list1;
+			}
+		} while (p1->s_lno <= p2->s_lno);
+	}
+}
+
+static void *
+get_next_blame (const void *p)
+{
+	return ((struct blame_entry *)p)->next;
+}
+
+static void
+set_next_blame (void *p1, void *p2)
+{
+	((struct blame_entry *)p1)->next = p2;
+}
+
+static int
+compare_blame (const void *p1, const void *p2)
+{
+	return ((struct blame_entry *)p1)->lno - ((struct blame_entry *)p2)->lno;
+}
+
+static struct blame_entry *
+blame_sort (struct blame_entry *head)
+{
+	return llist_mergesort (head, get_next_blame, set_next_blame, compare_blame);
+}
+
+int compare_commits_by_reverse_commit_date(const void *a, const void *b, void *c)
+{
+	return -compare_commits_by_commit_date(a, b, c);
+}
+
 /*
  * The current state of the blame assignment.
  */
 struct scoreboard {
 	/* the final commit (i.e. where we started digging from) */
 	struct commit *final;
+	/* Priority queue for commits with unassigned blame records */
+	struct prio_queue commits;
 	struct rev_info *revs;
 	const char *path;
 
@@ -268,7 +396,6 @@ static void coalesce(struct scoreboard *sb)
 
 	for (ent = sb->ent; ent && (next = ent->next); ent = next) {
 		if (ent->suspect == next->suspect &&
-		    ent->guilty == next->guilty &&
 		    ent->s_lno + ent->num_lines == next->s_lno) {
 			ent->num_lines += next->num_lines;
 			ent->next = next->next;
@@ -295,23 +422,33 @@ static struct origin *make_origin(struct commit *commit, const char *path)
 	o = xcalloc(1, sizeof(*o) + strlen(path) + 1);
 	o->commit = commit;
 	o->refcnt = 1;
+	o->next = commit->util;
+	commit->util = o;
 	strcpy(o->path, path);
 	return o;
 }
 
 /*
  * Locate an existing origin or create a new one.
+ * This moves the origin to front position in the commit util list.
  */
 static struct origin *get_origin(struct scoreboard *sb,
 				 struct commit *commit,
 				 const char *path)
 {
-	struct blame_entry *e;
+	struct origin *o, *l;
 
-	for (e = sb->ent; e; e = e->next) {
-		if (e->suspect->commit == commit &&
-		    !strcmp(e->suspect->path, path))
-			return origin_incref(e->suspect);
+	for (o = commit->util, l = NULL; o; l = o, o = o->next) {
+		if (!strcmp(o->path, path))
+		{
+			/* bump to front */
+			if (l) {
+				l->next = o->next;
+				o->next = commit->util;
+				commit->util = o;
+			}
+			return origin_incref(o);
+		}
 	}
 	return make_origin(commit, path);
 }
@@ -350,41 +487,19 @@ static struct origin *find_origin(struct scoreboard *sb,
 				  struct commit *parent,
 				  struct origin *origin)
 {
-	struct origin *porigin = NULL;
+	struct origin *porigin;
 	struct diff_options diff_opts;
 	const char *paths[2];
 
-	if (parent->util) {
-		/*
-		 * Each commit object can cache one origin in that
-		 * commit.  This is a freestanding copy of origin and
-		 * not refcounted.
-		 */
-		struct origin *cached = parent->util;
-		if (!strcmp(cached->path, origin->path)) {
+	/* First check any existing origins */
+	for (porigin = parent->util; porigin; porigin = porigin->next)
+		if (!strcmp(porigin->path, origin->path)) {
 			/*
 			 * The same path between origin and its parent
 			 * without renaming -- the most common case.
 			 */
-			porigin = get_origin(sb, parent, cached->path);
-
-			/*
-			 * If the origin was newly created (i.e. get_origin
-			 * would call make_origin if none is found in the
-			 * scoreboard), it does not know the blob_sha1/mode,
-			 * so copy it.  Otherwise porigin was in the
-			 * scoreboard and already knows blob_sha1/mode.
-			 */
-			if (porigin->refcnt == 1) {
-				hashcpy(porigin->blob_sha1, cached->blob_sha1);
-				porigin->mode = cached->mode;
-			}
-			return porigin;
+			return origin_incref (porigin);
 		}
-		/* otherwise it was not very useful; free it */
-		free(parent->util);
-		parent->util = NULL;
-	}
 
 	/* See if the origin->path is different between parent
 	 * and origin first.  Most of the time they are the
@@ -450,19 +565,6 @@ static struct origin *find_origin(struct scoreboard *sb,
 	}
 	diff_flush(&diff_opts);
 	free_pathspec(&diff_opts.pathspec);
-	if (porigin) {
-		/*
-		 * Create a freestanding copy that is not part of
-		 * the refcounted origin found in the scoreboard, and
-		 * cache it in the commit.
-		 */
-		struct origin *cached;
-
-		cached = make_origin(porigin->commit, porigin->path);
-		hashcpy(cached->blob_sha1, porigin->blob_sha1);
-		cached->mode = porigin->mode;
-		parent->util = cached;
-	}
 	return porigin;
 }
 
@@ -508,49 +610,53 @@ static struct origin *find_rename(struct scoreboard *sb,
 	return porigin;
 }
 
+#if 0
 /*
- * Link in a new blame entry to the scoreboard.  Entries that cover the
- * same line range have been removed from the scoreboard previously.
+ * Add a new blame entry to a given output queue.
  */
-static void add_blame_entry(struct scoreboard *sb, struct blame_entry *e)
+static void add_blame_entry(struct blame_entry ***queue, struct blame_entry *e)
 {
-	struct blame_entry *ent, *prev = NULL;
-
 	origin_incref(e->suspect);
 
-	for (ent = sb->ent; ent && ent->lno < e->lno; ent = ent->next)
-		prev = ent;
-
-	/* prev, if not NULL, is the last one that is below e */
-
-	if (prev) {
-		e->next = prev->next;
-		prev->next = e;
-	}
-	else {
-		e->next = sb->ent;
-		sb->ent = e;
-	}
+	e->next = **queue;
+	**queue = e;
+	*queue = &e->next;
 }
 
 /*
  * src typically is on-stack; we want to copy the information in it to
- * a malloced blame_entry that is already on the linked list of the
- * scoreboard.  The origin of dst loses a refcnt while the origin of src
- * gains one.
+ * a blame_entry that is allocated in one queue and will be relocated
+ * to another.
  */
-static void dup_entry(struct blame_entry *dst, struct blame_entry *src)
+static void dup_entry(struct blame_entry ***dstq, struct blame_entry ***srcq,
+		      struct blame_entry *src)
 {
-	struct blame_entry *n;
-
-	n = dst->next;
+	struct blame_entry *dst = **srcq;
 	origin_incref(src->suspect);
 	origin_decref(dst->suspect);
+	**srcq = dst->next;	/* unlinked from source queue */
 	memcpy(dst, src, sizeof(*src));
-	dst->next = n;
 	dst->score = 0;
+	dst->next = **dstq;
+	*dstq = &dst->next;	/* linked into destination queue */
 }
 
+/* This copies a changed entry over the existing entry in the queue */
+static void trim_entry(struct blame_entry ***queue, struct blame_entry *src)
+{
+	struct blame_entry *dst = **queue;
+	struct blame_entry *n = dst->next;
+	/* No need to mess with reference counts as we stay with the
+	 * same suspect
+	 */
+	memcpy(dst, src, sizeof(*src));
+	dst->score = 0;
+	dst->next = n;
+	*queue = &dst->next;	/* move after entry */
+}
+
+#endif
+
 static const char *nth_line(struct scoreboard *sb, long lno)
 {
 	return sb->final_buf + sb->lineno[lno];
@@ -561,6 +667,7 @@ static const char *nth_line_cb(void *data, long lno)
 	return nth_line((struct scoreboard *)data, lno);
 }
 
+#if 0
 /*
  * It is known that lines between tlno to same came from parent, and e
  * has an overlap with that range.  it also is known that parent's
@@ -573,7 +680,10 @@ static const char *nth_line_cb(void *data, long lno)
  *             <------------------>
  *
  * Split e into potentially three parts; before this chunk, the chunk
- * to be blamed for the parent, and after that portion.
+ * to be blamed for the parent, and after that portion.  As the chunks
+ * are processed in sequence, any part before the overlap can be
+ * terminally blamed on the target while parts after the overlap can
+ * only be properly evaluated after examining future diffs.
  */
 static void split_overlap(struct blame_entry *split,
 			  struct blame_entry *e,
@@ -620,72 +730,55 @@ static void split_overlap(struct blame_entry *split,
 
 /*
  * split_overlap() divided an existing blame e into up to three parts
- * in split.  Adjust the linked list of blames in the scoreboard to
- * reflect the split.
+ * in split.  Adjust the linked list of blames to reflect the split,
+ * the first part being common to target and parent and consequently
+ * being blamed on the parent, the second part being different and
+ * consequently blamed on the target, and the third part starting out
+ * as being the same but without conclusive information where to place
+ * the blame, so leaving it in the target subject to further
+ * processing.
  */
-static void split_blame(struct scoreboard *sb,
-			struct blame_entry *split,
-			struct blame_entry *e)
+static void split_blame(struct blame_entry *split,
+			struct blame_entry ***dstq,
+			struct blame_entry ***srcq)
 {
 	struct blame_entry *new_entry;
 
 	if (split[0].suspect && split[2].suspect) {
 		/* The first part (reuse storage for the existing entry e) */
-		dup_entry(e, &split[0]);
+		trim_entry(srcq, &split[0]);
 
 		/* The last part -- me */
 		new_entry = xmalloc(sizeof(*new_entry));
 		memcpy(new_entry, &(split[2]), sizeof(struct blame_entry));
-		add_blame_entry(sb, new_entry);
+		add_blame_entry(srcq, new_entry);
 
 		/* ... and the middle part -- parent */
 		new_entry = xmalloc(sizeof(*new_entry));
 		memcpy(new_entry, &(split[1]), sizeof(struct blame_entry));
-		add_blame_entry(sb, new_entry);
+		add_blame_entry(dstq, new_entry);
 	}
 	else if (!split[0].suspect && !split[2].suspect)
 		/*
 		 * The parent covers the entire area; reuse storage for
 		 * e and replace it with the parent.
 		 */
-		dup_entry(e, &split[1]);
+		dup_entry(dstq, srcq, &split[1]);
 	else if (split[0].suspect) {
 		/* me and then parent */
-		dup_entry(e, &split[0]);
+		trim_entry(srcq, &split[0]);
 
 		new_entry = xmalloc(sizeof(*new_entry));
 		memcpy(new_entry, &(split[1]), sizeof(struct blame_entry));
-		add_blame_entry(sb, new_entry);
+		add_blame_entry(dstq, new_entry);
 	}
 	else {
 		/* parent and then me */
-		dup_entry(e, &split[1]);
-
 		new_entry = xmalloc(sizeof(*new_entry));
-		memcpy(new_entry, &(split[2]), sizeof(struct blame_entry));
-		add_blame_entry(sb, new_entry);
-	}
-
-	if (DEBUG) { /* sanity */
-		struct blame_entry *ent;
-		int lno = sb->ent->lno, corrupt = 0;
+		memcpy(new_entry, &(split[1]), sizeof(struct blame_entry));
+		add_blame_entry(dstq, new_entry);
 
-		for (ent = sb->ent; ent; ent = ent->next) {
-			if (lno != ent->lno)
-				corrupt = 1;
-			if (ent->s_lno < 0)
-				corrupt = 1;
-			lno += ent->num_lines;
-		}
-		if (corrupt) {
-			lno = sb->ent->lno;
-			for (ent = sb->ent; ent; ent = ent->next) {
-				printf("L %8d l %8d n %8d\n",
-				       lno, ent->lno, ent->num_lines);
-				lno = ent->lno + ent->num_lines;
-			}
-			die("oops");
-		}
+		trim_entry(srcq, &split[2]);
 	}
 }
 
@@ -702,74 +795,116 @@ static void decref_split(struct blame_entry *split)
 }
 
 /*
- * Helper for blame_chunk().  blame_entry e is known to overlap with
+ * Helper for blame_chunk().  blame_entry in srcq is known to overlap with
  * the patch hunk; split it and pass blame to the parent.
  */
-static void blame_overlap(struct scoreboard *sb, struct blame_entry *e,
+static void blame_overlap(struct blame entry ***dstq, struct blame_entry ***srcq,
 			  int tlno, int plno, int same,
 			  struct origin *parent)
 {
 	struct blame_entry split[3];
 
-	split_overlap(split, e, tlno, plno, same, parent);
+	split_overlap(split, **srcq, tlno, plno, same, parent);
 	if (split[1].suspect)
-		split_blame(sb, split, e);
+	  split_blame(split, dstq, srcq);
 	decref_split(split);
 }
+#endif
 
 /*
- * Find the line number of the last line the target is suspected for.
+ * Process one hunk from the patch between the current suspect for
+ * blame_entry e and its parent.  This first blames any unfinished
+ * entries before the chunk (which is where target and parent start
+ * differing) on the parent, and then splits blame entries at the
+ * start and at the end of the difference region.
  */
-static int find_last_in_target(struct scoreboard *sb, struct origin *target)
+static void blame_chunk(struct blame_entry ***dstq, struct blame_entry ***srcq,
+			int tlno, int offset, int same,
+			struct origin *target, struct origin *parent)
 {
-	struct blame_entry *e;
-	int last_in_target = -1;
-
-	for (e = sb->ent; e; e = e->next) {
-		if (e->guilty || e->suspect != target)
-			continue;
-		if (last_in_target < e->s_lno + e->num_lines)
-			last_in_target = e->s_lno + e->num_lines;
+	struct blame_entry *e = **srcq;
+	
+/* Pass blame for everything before the differing chunk to the parent */
+	while (e && e->s_lno + e->num_lines <= tlno) {
+		**dstq = e;
+		origin_decref(e->suspect);
+		e->suspect = origin_incref(parent);
+		e->s_lno += offset;
+		e = *(*dstq = &e->next);
 	}
-	return last_in_target;
-}
-
+	**srcq = e;
+	if (!e)
+		return;
 /*
- * Process one hunk from the patch between the current suspect for
- * blame_entry e and its parent.  Find and split the overlap, and
- * pass blame to the overlapping part to the parent.
+ * current record _ends_ after the start of the difference region.  If
+ * it starts before it, we need to split it up and pass blame for its
+ * first part to the parent.
  */
-static void blame_chunk(struct scoreboard *sb,
-			int tlno, int plno, int same,
-			struct origin *target, struct origin *parent)
-{
-	struct blame_entry *e;
-
-	for (e = sb->ent; e; e = e->next) {
-		if (e->guilty || e->suspect != target)
-			continue;
-		if (same <= e->s_lno)
-			continue;
-		if (tlno < e->s_lno + e->num_lines)
-			blame_overlap(sb, e, tlno, plno, same, parent);
+	if (e->s_lno < tlno) {
+		/* we move the first half to a new record */
+		int len = tlno - e->s_lno;
+		struct blame_entry *n = xcalloc(1, sizeof (struct blame_entry));
+		n->suspect = origin_incref(parent);
+		n->lno = e->lno;
+		n->s_lno = e->s_lno + offset;
+		n->num_lines = len;
+		n->score = 0;
+		**dstq = n;
+		*dstq = &n->next;
+
+		e->lno += len;
+		e->s_lno = tlno;
+		e->num_lines -= len;
+		e->score = 0;
 	}
+/*
+ * Now retain records on the target while it is different from the parent.
+ */
+	while (e->s_lno + e->num_lines <= same) {
+		e = *(*srcq = &e->next);
+		if (!e)
+			return;
+	}
+/*
+ * If current record starts before sameness, need to split.
+ */
+	if (e->s_lno < same) {
+		int len = same - e->s_lno;
+		struct blame_entry *n = xcalloc(1, sizeof (struct blame_entry));
+		n->suspect = origin_incref(target);
+		n->lno = e->lno;
+		n->s_lno = e->s_lno;
+		n->num_lines = len;
+		n->score = 0;
+		**srcq = n;
+		*(*srcq = &n->next) = e;
+		/* keep second half in target queue */
+		e->lno += len;
+		e->s_lno = same;
+		e->num_lines -= len;
+		e->score = 0;
+	}
+	return;
 }
 
 struct blame_chunk_cb_data {
-	struct scoreboard *sb;
 	struct origin *target;
 	struct origin *parent;
-	long plno;
-	long tlno;
+	long offset;
+	struct blame_entry **dstq;
+	struct blame_entry **srcq;
 };
 
+/* diff chunks are from parent to target */
 static int blame_chunk_cb(long start_a, long count_a,
 			  long start_b, long count_b, void *data)
 {
 	struct blame_chunk_cb_data *d = data;
-	blame_chunk(d->sb, d->tlno, d->plno, start_b, d->target, d->parent);
-	d->plno = start_a + count_a;
-	d->tlno = start_b + count_b;
+	if (start_a - start_b != d->offset)
+		die("internal error in blame::blame_chunk_cb");
+	blame_chunk(&d->dstq, &d->srcq, start_b, start_a - start_b,
+		    start_b + count_b, d->target, d->parent);
+	d->offset = start_a + count_a - (start_b + count_b);
 	return 0;
 }
 
@@ -782,23 +917,28 @@ static int pass_blame_to_parent(struct scoreboard *sb,
 				struct origin *target,
 				struct origin *parent)
 {
-	int last_in_target;
 	mmfile_t file_p, file_o;
 	struct blame_chunk_cb_data d;
+	struct blame_entry *newdest = NULL;
 
-	memset(&d, 0, sizeof(d));
-	d.sb = sb; d.target = target; d.parent = parent;
-	last_in_target = find_last_in_target(sb, target);
-	if (last_in_target < 0)
+	if (!target->suspects)
 		return 1; /* nothing remains for this target */
 
+	d.target = target; d.parent = parent;
+	d.offset = 0;
+	d.dstq = &newdest; d.srcq = &target->suspects;
+
 	fill_origin_blob(&sb->revs->diffopt, parent, &file_p);
 	fill_origin_blob(&sb->revs->diffopt, target, &file_o);
 	num_get_patch++;
 
 	diff_hunks(&file_p, &file_o, 0, blame_chunk_cb, &d);
-	/* The rest (i.e. anything after tlno) are the same as the parent */
-	blame_chunk(sb, d.tlno, d.plno, last_in_target, target, parent);
+	/* The rest are the same as the parent */
+	blame_chunk(&d.dstq, &d.srcq, INT_MAX, d.offset, INT_MAX, target, parent);
+	*d.dstq = NULL;
+	parent->suspects = blame_merge (parent->suspects, newdest);
+	if (parent->suspects)
+		prio_queue_put(&sb->commits, parent->commit);
 
 	return 0;
 }
@@ -833,6 +973,7 @@ static unsigned ent_score(struct scoreboard *sb, struct blame_entry *e)
 	return score;
 }
 
+#if 0
 /*
  * best_so_far[] and this[] are both a split of an existing blame_entry
  * that passes blame to the parent.  Maintain best_so_far the best split
@@ -1147,6 +1288,8 @@ static int find_copy_in_parent(struct scoreboard *sb,
 	return retval;
 }
 
+#endif
+
 /*
  * The blobs of origin and porigin exactly match, so everything
  * origin is suspected for can be blamed on the parent.
@@ -1154,20 +1297,22 @@ static int find_copy_in_parent(struct scoreboard *sb,
 static void pass_whole_blame(struct scoreboard *sb,
 			     struct origin *origin, struct origin *porigin)
 {
-	struct blame_entry *e;
+	struct blame_entry *e, *suspects;
 
 	if (!porigin->file.ptr && origin->file.ptr) {
 		/* Steal its file */
 		porigin->file = origin->file;
 		origin->file.ptr = NULL;
 	}
-	for (e = sb->ent; e; e = e->next) {
-		if (e->suspect != origin)
-			continue;
+	suspects = origin->suspects;
+	origin->suspects = NULL;
+	for (e = suspects; e; e = e->next) {
 		origin_incref(porigin);
 		origin_decref(e->suspect);
 		e->suspect = porigin;
 	}
+	porigin->suspects = blame_merge(porigin->suspects, suspects);
+	prio_queue_put(&sb->commits, porigin->commit);
 }
 
 /*
@@ -1266,6 +1411,7 @@ static void pass_blame(struct scoreboard *sb, struct origin *origin, int opt)
 			goto finish;
 	}
 
+#if 0
 	/*
 	 * Optionally find moves in parents' files.
 	 */
@@ -1292,6 +1438,7 @@ static void pass_blame(struct scoreboard *sb, struct origin *origin, int opt)
 						porigin, opt))
 				goto finish;
 		}
+#endif
 
  finish:
 	for (i = 0; i < num_sg; i++) {
@@ -1488,14 +1635,11 @@ static int emit_one_suspect_detail(struct origin *suspect, int repeat)
 }
 
 /*
- * The blame_entry is found to be guilty for the range.  Mark it
- * as such, and show it in incremental output.
+ * The blame_entry is found to be guilty for the range.
+ * Show it in incremental output.
  */
 static void found_guilty_entry(struct blame_entry *ent)
 {
-	if (ent->guilty)
-		return;
-	ent->guilty = 1;
 	if (incremental) {
 		struct origin *suspect = ent->suspect;
 
@@ -1509,32 +1653,34 @@ static void found_guilty_entry(struct blame_entry *ent)
 }
 
 /*
- * The main loop -- while the scoreboard has lines whose true origin
- * is still unknown, pick one blame_entry, and allow its current
- * suspect to pass blames to its parents.
- */
+ * The main loop -- while we have blobs with lines whose true origin
+ * is still unknown, pick one blob, and allow its lines to pass blames
+ * to its parents. */
 static void assign_blame(struct scoreboard *sb, int opt)
 {
 	struct rev_info *revs = sb->revs;
+	struct commit *commit = prio_queue_get (&sb->commits);
 
-	while (1) {
+	while (commit) {
 		struct blame_entry *ent;
-		struct commit *commit;
-		struct origin *suspect = NULL;
+		struct origin *suspect = commit->util;
 
 		/* find one suspect to break down */
-		for (ent = sb->ent; !suspect && ent; ent = ent->next)
-			if (!ent->guilty)
-				suspect = ent->suspect;
-		if (!suspect)
-			return; /* all done */
+		while (suspect && !suspect->suspects)
+			suspect = suspect->next;
+		
+		if (!suspect) {
+			commit = prio_queue_get(&sb->commits);
+			continue;
+		}
+
+		assert(commit == suspect->commit);
 
 		/*
 		 * We will use this suspect later in the loop,
 		 * so hold onto it in the meantime.
 		 */
 		origin_incref(suspect);
-		commit = suspect->commit;
 		parse_commit(commit);
 		if (reverse ||
 		    (!(commit->object.flags & UNINTERESTING) &&
@@ -1550,9 +1696,22 @@ static void assign_blame(struct scoreboard *sb, int opt)
 			commit->object.flags |= UNINTERESTING;
 
 		/* Take responsibility for the remaining entries */
-		for (ent = sb->ent; ent; ent = ent->next)
-			if (ent->suspect == suspect)
+		ent = suspect->suspects;
+		if (ent) {
+			for (;;) {
+				struct blame_entry *next = ent->next;
 				found_guilty_entry(ent);
+				if (next)
+				{
+					ent = next;
+					continue;
+				}
+				ent->next = sb->ent;
+				sb->ent = suspect->suspects;
+				suspect->suspects = NULL;
+				break;
+			}
+		}
 		origin_decref(suspect);
 
 		if (DEBUG) /* sanity */
@@ -1609,9 +1768,8 @@ static void emit_porcelain(struct scoreboard *sb, struct blame_entry *ent,
 	char hex[41];
 
 	strcpy(hex, sha1_to_hex(suspect->commit->object.sha1));
-	printf("%s%c%d %d %d\n",
+	printf("%s %d %d %d\n",
 	       hex,
-	       ent->guilty ? ' ' : '*', /* purely for debugging */
 	       ent->s_lno + 1,
 	       ent->lno + 1,
 	       ent->num_lines);
@@ -1724,17 +1882,16 @@ static void output(struct scoreboard *sb, int option)
 
 	if (option & OUTPUT_PORCELAIN) {
 		for (ent = sb->ent; ent; ent = ent->next) {
-			struct blame_entry *oth;
-			struct origin *suspect = ent->suspect;
-			struct commit *commit = suspect->commit;
+			int count = 0;
+			struct origin *suspect;
+			struct commit *commit = ent->suspect->commit;
 			if (commit->object.flags & MORE_THAN_ONE_PATH)
 				continue;
-			for (oth = ent->next; oth; oth = oth->next) {
-				if ((oth->suspect->commit != commit) ||
-				    !strcmp(oth->suspect->path, suspect->path))
-					continue;
-				commit->object.flags |= MORE_THAN_ONE_PATH;
-				break;
+			for (suspect = commit->util; suspect; suspect = suspect->next) {
+				if (suspect->guilty && count++) {
+					commit->object.flags |= MORE_THAN_ONE_PATH;
+					break;
+				}
 			}
 		}
 	}
@@ -2083,7 +2240,6 @@ static struct commit *fake_working_tree_commit(struct diff_options *opt,
 	origin->file.ptr = buf.buf;
 	origin->file.size = buf.len;
 	pretend_sha1_file(buf.buf, buf.len, OBJ_BLOB, origin->blob_sha1);
-	commit->util = origin;
 
 	/*
 	 * Read the current index, replace the path entry with
@@ -2394,12 +2550,16 @@ parse_done:
 	memset(&sb, 0, sizeof(sb));
 
 	sb.revs = &revs;
-	if (!reverse)
+	if (!reverse) {
 		final_commit_name = prepare_final(&sb);
+		sb.commits.compare = compare_commits_by_commit_date;
+	}
 	else if (contents_from)
 		die("--contents and --children do not blend well.");
-	else
+	else {
 		final_commit_name = prepare_initial(&sb);
+		sb.commits.compare = compare_commits_by_reverse_commit_date;
+	}
 
 	if (!sb.final) {
 		/*
@@ -2493,7 +2653,7 @@ parse_done:
 	range_set_release(&ranges);
 	string_list_clear(&range_list, 0);
 
-	sb.ent = ent;
+	sb.ent = NULL;
 	sb.path = path;
 
 	read_mailmap(&mailmap, NULL);
@@ -2501,11 +2661,16 @@ parse_done:
 	if (!incremental)
 		setup_pager();
 
+	prio_queue_put(&sb.commits, o->commit);
+	o->suspects = ent;
+	
 	assign_blame(&sb, opt);
 
 	if (incremental)
 		return 0;
 
+	sb.ent = blame_sort(sb.ent);
+
 	coalesce(&sb);
 
 	if (!(output_option & OUTPUT_PORCELAIN))
-- 
1.8.3.2

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: How to substructure rewrites?
  2014-01-25 12:44 How to substructure rewrites? David Kastrup
  2014-01-25 18:23 ` [PATCH 0/3] "Teaser" patch for rewriting blame for efficiency David Kastrup
@ 2014-01-27 15:58 ` Junio C Hamano
  2014-01-27 16:27   ` David Kastrup
  1 sibling, 1 reply; 11+ messages in thread
From: Junio C Hamano @ 2014-01-27 15:58 UTC (permalink / raw)
  To: David Kastrup; +Cc: git

David Kastrup <dak@gnu.org> writes:

> As it can easily be guessed, the "add xxx function" commits are
> basically adding not-yet-used code (and so will not disrupt
> compilation), but everything starting with "Reorganize blame data
> structures" up until the final commit will not work or compile since the
> code does not match the data structures.
>
> So there is little point in substructing all that, right?  Even
> something seemingly isolated like
>
> commit f64b41c472442ae9971321fe8f62c3885ba4d8b7
> Author: David Kastrup <dak@gnu.org>
> Date:   Sun Jan 19 02:16:21 2014 +0100
>
>     blame.c: Let output determine MORE_THAN_ONE_PATH more efficiently
>
> is not really useful as a separate commit since while it does implement
> a particular task, this is done starting with non-working code relying
> on no-longer existent data structures.

Small pieces that are incrementally added with their own
documentation would certainly be a lot easier to read than one big
ball of wax.  I am wondering if it would make it easier for
everybody to tentatively do "git-blame vs git-blame2" dance here,
just like we did "git-blame vs git-annotate" dance some years ago.
That is, to add a completely new command and have them in parallel
while cooking in 'next' (or we could even keep them in a few
releases if we are not absolutely certain about the correctness of
the result of the new code), aiming to eventually retire the current
implementation and replace it with the new one.  We have already
have test infrastructure to allow us to run variants of blames, too,
to help that kind of transition.

> In general, the rule is likely "any commit should not create a
> non-working state" right?

Yes.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: How to substructure rewrites?
  2014-01-27 15:58 ` How to substructure rewrites? Junio C Hamano
@ 2014-01-27 16:27   ` David Kastrup
  0 siblings, 0 replies; 11+ messages in thread
From: David Kastrup @ 2014-01-27 16:27 UTC (permalink / raw)
  To: git

Junio C Hamano <gitster@pobox.com> writes:

> David Kastrup <dak@gnu.org> writes:
>
>> As it can easily be guessed, the "add xxx function" commits are
>> basically adding not-yet-used code (and so will not disrupt
>> compilation), but everything starting with "Reorganize blame data
>> structures" up until the final commit will not work or compile since the
>> code does not match the data structures.
>>
>> So there is little point in substructing all that, right?  Even
>> something seemingly isolated like
>>
>> commit f64b41c472442ae9971321fe8f62c3885ba4d8b7
>> Author: David Kastrup <dak@gnu.org>
>> Date:   Sun Jan 19 02:16:21 2014 +0100
>>
>>     blame.c: Let output determine MORE_THAN_ONE_PATH more efficiently
>>
>> is not really useful as a separate commit since while it does implement
>> a particular task, this is done starting with non-working code relying
>> on no-longer existent data structures.
>
> Small pieces that are incrementally added with their own
> documentation would certainly be a lot easier to read than one big
> ball of wax.

Sure.  The problem is that my rewrite is characterized by doing as
little as possible in order to achieve identical results (with the
conceivable exception of picking a different, equally scored variant in
those parts of the algorithm choosing a maximum).  That also means that
the basic logic and layout of the program stays the same while the data
flow and parts of the data structures are replaced.

> I am wondering if it would make it easier for everybody to tentatively
> do "git-blame vs git-blame2" dance here, just like we did "git-blame
> vs git-annotate" dance some years ago.  That is, to add a completely
> new command and have them in parallel while cooking in 'next' (or we
> could even keep them in a few releases if we are not absolutely
> certain about the correctness of the result of the new code), aiming
> to eventually retire the current implementation and replace it with
> the new one.  We have already have test infrastructure to allow us to
> run variants of blames, too, to help that kind of transition.

Well, the point is that the implementation is supposed to
a) deliver identical results
b) reuse as much code as possible
so there is no real point in working with a separate source file.

For the "if we are not absolutely certain about the correctness of the
result of the new code" angle, this should be covered with the usual
stable/unstable/proposed division most projects have in some way or
another for quality assurance.  I have absolutely no clue how Git
organizes that, but it would usually mean that the new code is not
placed in a different _file_ (or a differently named command) but rather
in a different _branch_ as compared with the current implementation.

>> In general, the rule is likely "any commit should not create a
>> non-working state" right?
>
> Yes.

My current aim is to complete the code to the point where it is
a) fully operative and delivering equivalent results to the current
implementation
b) in every aspect at least as efficient as the current implementation
and in a state that is not basically less comprehensible than what I
started with

Since the change of the data structures and data flow requires changing
all affected program parts to get to a working state, and since I don't
have ambitions to do more than that which is required to get there,
I don't see how the bulk of the work can sensibly avoid coming as one
"omnibus" patch.  Most changes, however, will be understandable quite
well locally.

For example, currently the code has a number of loops traversing one
global linked list, ignoring all entries not relevant to a particular
target, and doing something with the rest.  Those loops generally are
replaced with a simpler loop just running through a single _completely_
relevant linked list.  Even while those replacements are scattered
throughout the patch, they make sense without having to look at the rest
of the patch.

-- 
David Kastrup

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 3/3] builtin/blame.c: large-scale rewrite
  2014-01-25 18:23   ` [PATCH 3/3] builtin/blame.c: large-scale rewrite David Kastrup
@ 2014-01-27 16:54     ` Junio C Hamano
  2014-01-27 19:45       ` David Kastrup
  0 siblings, 1 reply; 11+ messages in thread
From: Junio C Hamano @ 2014-01-27 16:54 UTC (permalink / raw)
  To: David Kastrup; +Cc: git

David Kastrup <dak@gnu.org> writes:

> The previous implementation uses a sorted linear list of struct
> blame_entry in a struct scoreboard for organizing all partial or
> completed work.  Every task that is done requires going through the
> whole list where most entries are not relevant to the task at hand.
>
> This commit reorganizes the data structures in order to have each
> remaining subtask work with its own sorted linear list it can work off
> front to back.  Subtasks are organized into "struct origin" chains
> hanging off particular commits.  Commits are organized into a priority
> queue, processing them in commit date order in order to keep most of
> the work affecting a particular blob collated even in the presence of
> an extensive merge history.  In that manner, linear searches can be
> basically avoided anywhere.  They still are done for identifying one
> of multiple analyzed files in a given commit, but the degenerate case
> of a single large file being assembled from a multitude of smaller
> files in the past is not likely to occur often enough to warrant
> special treatment.
> ---

Sign-off?

Actually, I'd like to take my previous suggestion to add this as
blame2 (to replace blame in the future) back.  That was based on my
fear/hope to see an implementation based on a completely different
approach, but the basic premise of having one blame_entry record per
the lines of the final image in the scoreboard, using diff between
parents to the child to find common lines for passing the blame up,
etc. have not changed at all and the change is all about organizing
the pieces of data in a *much* *better* way to avoid needlessly
finding what we already have computed.  It does look big (and if you
removed those "#if 0/#endif", the final patch would be a lot bigger
because we will see a lot of deleted lines), but I do not think the
code conceptually is a different implementation to warrant "create a
totally new blame2 and let it sit next to the original".

Looks nice overall.

> diff --git a/builtin/blame.c b/builtin/blame.c
> index ead6148..994a9e8 100644
> --- a/builtin/blame.c
> +++ b/builtin/blame.c
> @@ -18,7 +18,9 @@
>  #include "cache-tree.h"
>  #include "string-list.h"
>  #include "mailmap.h"
> +#include "mergesort.h"
>  #include "parse-options.h"
> +#include "prio-queue.h"
>  #include "utf8.h"
>  #include "userdiff.h"
>  #include "line-range.h"
> @@ -83,11 +85,43 @@ static unsigned blame_copy_score;
>   */
>  struct origin {
>  	int refcnt;
> +	/* Record preceding blame record for this blob */
>  	struct origin *previous;
> +	/* origins are put in a list linked via `next' hanging off the
> +	 * corresponding commit's util field in order to make finding
> +	 * them fast.  The presence in this chain does not count
> +	 * towards the origin's reference count.  It is tempting to
> +	 * let it count as long as the commit is pending examination,
> +	 * but even under circumstances where the commit will be
> +	 * present multiple times in the priority queue of unexamined
> +	 * commits, processing the first instance will not leave any
> +	 * work requiring the origin data for the second instance.  An
> +	 * interspersed commit changing that would have to be
> +	 * preexisting with a different ancestry and with the same
> +	 * commit date in order to wedge itself between two instances
> +	 * of the same commit in the priority queue _and_ produce
> +	 * blame entries relevant for it.  While we don't want to let
> +	 * us get tripped up by this case, it certainly does not seem
> +	 * worth optimizing for.
> +	 */

Style; please make /* and */ sit on their own line without anything
else in multi-line comments.

> +	struct origin *next;
>  	struct commit *commit;
> +	/* `suspects' contains blame entries that may be attributed to
> +	 * this origin's commit or to parent commits.  When a commit
> +	 * is being processed, all suspects will be moved, either by
> +	 * assigning them to an origin in a different commit, or by
> +	 * shipping them to the scoreboard's ent list because they
> +	 * cannot be attributed to a different commit.
> +	 */
> +	struct blame_entry *suspects;
>  	mmfile_t file;
>  	unsigned char blob_sha1[20];
>  	unsigned mode;
> +	/* shipped gets set when shipping any suspects to the final
> +	 * blame list instead of other commits
> +	 */
> +	char shipped;

Unused?

> +	char guilty;
>  	char path[FLEX_ARRAY];
>  };
>  
> @@ -176,10 +210,23 @@ static inline struct origin *origin_incref(struct origin *o)
>  static void origin_decref(struct origin *o)
>  {
>  	if (o && --o->refcnt <= 0) {
> +		struct origin *p, *l = NULL;
>  		if (o->previous)
>  			origin_decref(o->previous);
>  		free(o->file.ptr);
> -		free(o);
> +		/* Should be present exactly once in commit chain */
> +		for (p = o->commit->util; p; l = p, p = p->next) {
> +			if (p == o)
> +			{

Style; please have { sit with the control structure that opens the
block, unless it is the opening brace of the function body or
struct/union definition.

> +				if (l)
> +					l->next = p->next;
> +				else
> +					o->commit->util = p->next;
> +				free(o);
> +				return;
> +			}
> +		}
> +		die("internal error in blame::origin_decref");
>  	}
>  }
>  
> @@ -193,8 +240,12 @@ static void drop_origin_blob(struct origin *o)
>  
>  /*
>   * Each group of lines is described by a blame_entry; it can be split
> - * as we pass blame to the parents.  They form a linked list in the
> - * scoreboard structure, sorted by the target line number.
> + * as we pass blame to the parents.  They are arranged in linked lists
> + * kept as `suspects' of some unprocessed origin, or entered (when the
> + * blame origin has been finalized) into the scoreboard structure.
> + * While the scoreboard structure is only sorted at the end of
> + * processing (according to final image line number), the lists
> + * attached to an origin are sorted by the target line number.
>   */
>  struct blame_entry {
>  	struct blame_entry *next;
> @@ -210,11 +261,6 @@ struct blame_entry {
>  	/* the commit that introduced this group into the final image */
>  	struct origin *suspect;
>  
> -	/* true if the suspect is truly guilty; false while we have not
> -	 * checked if the group came from one of its parents.
> -	 */
> -	char guilty;
> -
>  	/* true if the entry has been scanned for copies in the current parent
>  	 */
>  	char scanned;
> @@ -230,12 +276,94 @@ struct blame_entry {
>  	unsigned score;
>  };
>  
> +/* This is subtle.  Any _merge_ of blames happens on lists of blames
> + * that arrived via different parents in a single suspect.  In this
> + * case, we want to sort according to the _suspect_ line numbers.  In
> + * contrast to that, blame_sort is used for the final sorting of blame
> + * data and consequently works with final image line numbers.
> + */
> +
> +static struct blame_entry *
> +blame_merge(struct blame_entry *list1,
> +	    struct blame_entry *list2)

Style; we do not do GNU-ish "function name begins a line".  Even
though I personally find it easier to use for things like 'grep',
that is not the style we use around here.

> +{
> +	struct blame_entry *p1 = list1, *p2 = list2,
> +		**tail = &list1;
> +	
> +	if (!p1)
> +		return p2;
> +	if (!p2)
> +		return p1;
> +
> +	if (p1->s_lno <= p2->s_lno)
> +	{
> +		do {
> +			tail = &p1->next;
> +			if ((p1 = *tail) == NULL)
> +			{
> +				*tail = p2;
> +				return list1;
> +			}
> +		} while (p1->s_lno <= p2->s_lno);
> +	}
> +	for (;;)
> +	{
> +		*tail = p2;
> +		do {
> +			tail = &p2->next;
> +			if ((p2 = *tail) == NULL)  {
> +				*tail = p1;
> +				return list1;
> +			}
> +		} while (p1->s_lno > p2->s_lno);
> +		*tail = p1;
> +		do {
> +			tail = &p1->next;
> +			if ((p1 = *tail) == NULL) {
> +				*tail = p2;
> +				return list1;
> +			}
> +		} while (p1->s_lno <= p2->s_lno);
> +	}
> +}
> +
> +static void *
> +get_next_blame (const void *p)
> +{
> +	return ((struct blame_entry *)p)->next;
> +}
> +
> +static void
> +set_next_blame (void *p1, void *p2)
> +{
> +	((struct blame_entry *)p1)->next = p2;
> +}
> +
> +static int
> +compare_blame (const void *p1, const void *p2)
> +{
> +	return ((struct blame_entry *)p1)->lno - ((struct blame_entry *)p2)->lno;
> +}
> +
> +static struct blame_entry *
> +blame_sort (struct blame_entry *head)
> +{
> +	return llist_mergesort (head, get_next_blame, set_next_blame, compare_blame);
> +}
> +
> +int compare_commits_by_reverse_commit_date(const void *a, const void *b, void *c)
> +{
> +	return -compare_commits_by_commit_date(a, b, c);
> +}
> +
>  /*
>   * The current state of the blame assignment.
>   */
>  struct scoreboard {
>  	/* the final commit (i.e. where we started digging from) */
>  	struct commit *final;
> +	/* Priority queue for commits with unassigned blame records */
> +	struct prio_queue commits;
>  	struct rev_info *revs;
>  	const char *path;
>  
> @@ -268,7 +396,6 @@ static void coalesce(struct scoreboard *sb)
>  
>  	for (ent = sb->ent; ent && (next = ent->next); ent = next) {
>  		if (ent->suspect == next->suspect &&
> -		    ent->guilty == next->guilty &&
>  		    ent->s_lno + ent->num_lines == next->s_lno) {
>  			ent->num_lines += next->num_lines;
>  			ent->next = next->next;
> @@ -295,23 +422,33 @@ static struct origin *make_origin(struct commit *commit, const char *path)
>  	o = xcalloc(1, sizeof(*o) + strlen(path) + 1);
>  	o->commit = commit;
>  	o->refcnt = 1;
> +	o->next = commit->util;
> +	commit->util = o;
>  	strcpy(o->path, path);
>  	return o;
>  }
>  
>  /*
>   * Locate an existing origin or create a new one.
> + * This moves the origin to front position in the commit util list.
>   */
>  static struct origin *get_origin(struct scoreboard *sb,
>  				 struct commit *commit,
>  				 const char *path)
>  {
> -	struct blame_entry *e;
> +	struct origin *o, *l;
>  
> -	for (e = sb->ent; e; e = e->next) {
> -		if (e->suspect->commit == commit &&
> -		    !strcmp(e->suspect->path, path))
> -			return origin_incref(e->suspect);
> +	for (o = commit->util, l = NULL; o; l = o, o = o->next) {
> +		if (!strcmp(o->path, path))
> +		{
> +			/* bump to front */
> +			if (l) {
> +				l->next = o->next;
> +				o->next = commit->util;
> +				commit->util = o;
> +			}
> +			return origin_incref(o);
> +		}
>  	}
>  	return make_origin(commit, path);
>  }
> @@ -350,41 +487,19 @@ static struct origin *find_origin(struct scoreboard *sb,
>  				  struct commit *parent,
>  				  struct origin *origin)
>  {
> -	struct origin *porigin = NULL;
> +	struct origin *porigin;
>  	struct diff_options diff_opts;
>  	const char *paths[2];
>  
> -	if (parent->util) {
> -		/*
> -		 * Each commit object can cache one origin in that
> -		 * commit.  This is a freestanding copy of origin and
> -		 * not refcounted.
> -		 */
> -		struct origin *cached = parent->util;
> -		if (!strcmp(cached->path, origin->path)) {
> +	/* First check any existing origins */
> +	for (porigin = parent->util; porigin; porigin = porigin->next)
> +		if (!strcmp(porigin->path, origin->path)) {
>  			/*
>  			 * The same path between origin and its parent
>  			 * without renaming -- the most common case.
>  			 */
> -			porigin = get_origin(sb, parent, cached->path);
> -
> -			/*
> -			 * If the origin was newly created (i.e. get_origin
> -			 * would call make_origin if none is found in the
> -			 * scoreboard), it does not know the blob_sha1/mode,
> -			 * so copy it.  Otherwise porigin was in the
> -			 * scoreboard and already knows blob_sha1/mode.
> -			 */
> -			if (porigin->refcnt == 1) {
> -				hashcpy(porigin->blob_sha1, cached->blob_sha1);
> -				porigin->mode = cached->mode;
> -			}
> -			return porigin;
> +			return origin_incref (porigin);
>  		}
> -		/* otherwise it was not very useful; free it */
> -		free(parent->util);
> -		parent->util = NULL;
> -	}
>  
>  	/* See if the origin->path is different between parent
>  	 * and origin first.  Most of the time they are the
> @@ -450,19 +565,6 @@ static struct origin *find_origin(struct scoreboard *sb,
>  	}
>  	diff_flush(&diff_opts);
>  	free_pathspec(&diff_opts.pathspec);
> -	if (porigin) {
> -		/*
> -		 * Create a freestanding copy that is not part of
> -		 * the refcounted origin found in the scoreboard, and
> -		 * cache it in the commit.
> -		 */
> -		struct origin *cached;
> -
> -		cached = make_origin(porigin->commit, porigin->path);
> -		hashcpy(cached->blob_sha1, porigin->blob_sha1);
> -		cached->mode = porigin->mode;
> -		parent->util = cached;
> -	}
>  	return porigin;
>  }
>  
> @@ -508,49 +610,53 @@ static struct origin *find_rename(struct scoreboard *sb,
>  	return porigin;
>  }
>  
> +#if 0
>  /*
> - * Link in a new blame entry to the scoreboard.  Entries that cover the
> - * same line range have been removed from the scoreboard previously.
> + * Add a new blame entry to a given output queue.
>   */
> -static void add_blame_entry(struct scoreboard *sb, struct blame_entry *e)
> +static void add_blame_entry(struct blame_entry ***queue, struct blame_entry *e)
>  {
> -	struct blame_entry *ent, *prev = NULL;
> -
>  	origin_incref(e->suspect);
>  
> -	for (ent = sb->ent; ent && ent->lno < e->lno; ent = ent->next)
> -		prev = ent;
> -
> -	/* prev, if not NULL, is the last one that is below e */
> -
> -	if (prev) {
> -		e->next = prev->next;
> -		prev->next = e;
> -	}
> -	else {
> -		e->next = sb->ent;
> -		sb->ent = e;
> -	}
> +	e->next = **queue;
> +	**queue = e;
> +	*queue = &e->next;
>  }
>  
>  /*
>   * src typically is on-stack; we want to copy the information in it to
> - * a malloced blame_entry that is already on the linked list of the
> - * scoreboard.  The origin of dst loses a refcnt while the origin of src
> - * gains one.
> + * a blame_entry that is allocated in one queue and will be relocated
> + * to another.
>   */
> -static void dup_entry(struct blame_entry *dst, struct blame_entry *src)
> +static void dup_entry(struct blame_entry ***dstq, struct blame_entry ***srcq,
> +		      struct blame_entry *src)
>  {
> -	struct blame_entry *n;
> -
> -	n = dst->next;
> +	struct blame_entry *dst = **srcq;
>  	origin_incref(src->suspect);
>  	origin_decref(dst->suspect);
> +	**srcq = dst->next;	/* unlinked from source queue */
>  	memcpy(dst, src, sizeof(*src));
> -	dst->next = n;
>  	dst->score = 0;
> +	dst->next = **dstq;
> +	*dstq = &dst->next;	/* linked into destination queue */
>  }
>  
> +/* This copies a changed entry over the existing entry in the queue */
> +static void trim_entry(struct blame_entry ***queue, struct blame_entry *src)
> +{
> +	struct blame_entry *dst = **queue;
> +	struct blame_entry *n = dst->next;
> +	/* No need to mess with reference counts as we stay with the
> +	 * same suspect
> +	 */
> +	memcpy(dst, src, sizeof(*src));
> +	dst->score = 0;
> +	dst->next = n;
> +	*queue = &dst->next;	/* move after entry */
> +}
> +
> +#endif
> +
>  static const char *nth_line(struct scoreboard *sb, long lno)
>  {
>  	return sb->final_buf + sb->lineno[lno];
> @@ -561,6 +667,7 @@ static const char *nth_line_cb(void *data, long lno)
>  	return nth_line((struct scoreboard *)data, lno);
>  }
>  
> +#if 0
>  /*
>   * It is known that lines between tlno to same came from parent, and e
>   * has an overlap with that range.  it also is known that parent's
> @@ -573,7 +680,10 @@ static const char *nth_line_cb(void *data, long lno)
>   *             <------------------>
>   *
>   * Split e into potentially three parts; before this chunk, the chunk
> - * to be blamed for the parent, and after that portion.
> + * to be blamed for the parent, and after that portion.  As the chunks
> + * are processed in sequence, any part before the overlap can be
> + * terminally blamed on the target while parts after the overlap can
> + * only be properly evaluated after examining future diffs.
>   */
>  static void split_overlap(struct blame_entry *split,
>  			  struct blame_entry *e,
> @@ -620,72 +730,55 @@ static void split_overlap(struct blame_entry *split,
>  
>  /*
>   * split_overlap() divided an existing blame e into up to three parts
> - * in split.  Adjust the linked list of blames in the scoreboard to
> - * reflect the split.
> + * in split.  Adjust the linked list of blames to reflect the split,
> + * the first part being common to target and parent and consequently
> + * being blamed on the parent, the second part being different and
> + * consequently blamed on the target, and the third part starting out
> + * as being the same but without conclusive information where to place
> + * the blame, so leaving it in the target subject to further
> + * processing.
>   */
> -static void split_blame(struct scoreboard *sb,
> -			struct blame_entry *split,
> -			struct blame_entry *e)
> +static void split_blame(struct blame_entry *split,
> +			struct blame_entry ***dstq,
> +			struct blame_entry ***srcq)
>  {
>  	struct blame_entry *new_entry;
>  
>  	if (split[0].suspect && split[2].suspect) {
>  		/* The first part (reuse storage for the existing entry e) */
> -		dup_entry(e, &split[0]);
> +		trim_entry(srcq, &split[0]);
>  
>  		/* The last part -- me */
>  		new_entry = xmalloc(sizeof(*new_entry));
>  		memcpy(new_entry, &(split[2]), sizeof(struct blame_entry));
> -		add_blame_entry(sb, new_entry);
> +		add_blame_entry(srcq, new_entry);
>  
>  		/* ... and the middle part -- parent */
>  		new_entry = xmalloc(sizeof(*new_entry));
>  		memcpy(new_entry, &(split[1]), sizeof(struct blame_entry));
> -		add_blame_entry(sb, new_entry);
> +		add_blame_entry(dstq, new_entry);
>  	}
>  	else if (!split[0].suspect && !split[2].suspect)
>  		/*
>  		 * The parent covers the entire area; reuse storage for
>  		 * e and replace it with the parent.
>  		 */
> -		dup_entry(e, &split[1]);
> +		dup_entry(dstq, srcq, &split[1]);
>  	else if (split[0].suspect) {
>  		/* me and then parent */
> -		dup_entry(e, &split[0]);
> +		trim_entry(srcq, &split[0]);
>  
>  		new_entry = xmalloc(sizeof(*new_entry));
>  		memcpy(new_entry, &(split[1]), sizeof(struct blame_entry));
> -		add_blame_entry(sb, new_entry);
> +		add_blame_entry(dstq, new_entry);
>  	}
>  	else {
>  		/* parent and then me */
> -		dup_entry(e, &split[1]);
> -
>  		new_entry = xmalloc(sizeof(*new_entry));
> -		memcpy(new_entry, &(split[2]), sizeof(struct blame_entry));
> -		add_blame_entry(sb, new_entry);
> -	}
> -
> -	if (DEBUG) { /* sanity */
> -		struct blame_entry *ent;
> -		int lno = sb->ent->lno, corrupt = 0;
> +		memcpy(new_entry, &(split[1]), sizeof(struct blame_entry));
> +		add_blame_entry(dstq, new_entry);
>  
> -		for (ent = sb->ent; ent; ent = ent->next) {
> -			if (lno != ent->lno)
> -				corrupt = 1;
> -			if (ent->s_lno < 0)
> -				corrupt = 1;
> -			lno += ent->num_lines;
> -		}
> -		if (corrupt) {
> -			lno = sb->ent->lno;
> -			for (ent = sb->ent; ent; ent = ent->next) {
> -				printf("L %8d l %8d n %8d\n",
> -				       lno, ent->lno, ent->num_lines);
> -				lno = ent->lno + ent->num_lines;
> -			}
> -			die("oops");
> -		}
> +		trim_entry(srcq, &split[2]);
>  	}
>  }
>  
> @@ -702,74 +795,116 @@ static void decref_split(struct blame_entry *split)
>  }
>  
>  /*
> - * Helper for blame_chunk().  blame_entry e is known to overlap with
> + * Helper for blame_chunk().  blame_entry in srcq is known to overlap with
>   * the patch hunk; split it and pass blame to the parent.
>   */
> -static void blame_overlap(struct scoreboard *sb, struct blame_entry *e,
> +static void blame_overlap(struct blame entry ***dstq, struct blame_entry ***srcq,
>  			  int tlno, int plno, int same,
>  			  struct origin *parent)
>  {
>  	struct blame_entry split[3];
>  
> -	split_overlap(split, e, tlno, plno, same, parent);
> +	split_overlap(split, **srcq, tlno, plno, same, parent);
>  	if (split[1].suspect)
> -		split_blame(sb, split, e);
> +	  split_blame(split, dstq, srcq);
>  	decref_split(split);
>  }
> +#endif
>  
>  /*
> - * Find the line number of the last line the target is suspected for.
> + * Process one hunk from the patch between the current suspect for
> + * blame_entry e and its parent.  This first blames any unfinished
> + * entries before the chunk (which is where target and parent start
> + * differing) on the parent, and then splits blame entries at the
> + * start and at the end of the difference region.
>   */
> -static int find_last_in_target(struct scoreboard *sb, struct origin *target)
> +static void blame_chunk(struct blame_entry ***dstq, struct blame_entry ***srcq,
> +			int tlno, int offset, int same,
> +			struct origin *target, struct origin *parent)
>  {
> -	struct blame_entry *e;
> -	int last_in_target = -1;
> -
> -	for (e = sb->ent; e; e = e->next) {
> -		if (e->guilty || e->suspect != target)
> -			continue;
> -		if (last_in_target < e->s_lno + e->num_lines)
> -			last_in_target = e->s_lno + e->num_lines;
> +	struct blame_entry *e = **srcq;
> +	
> +/* Pass blame for everything before the differing chunk to the parent */
> +	while (e && e->s_lno + e->num_lines <= tlno) {
> +		**dstq = e;
> +		origin_decref(e->suspect);
> +		e->suspect = origin_incref(parent);
> +		e->s_lno += offset;
> +		e = *(*dstq = &e->next);
>  	}
> -	return last_in_target;
> -}
> -
> +	**srcq = e;
> +	if (!e)
> +		return;
>  /*
> - * Process one hunk from the patch between the current suspect for
> - * blame_entry e and its parent.  Find and split the overlap, and
> - * pass blame to the overlapping part to the parent.
> + * current record _ends_ after the start of the difference region.  If
> + * it starts before it, we need to split it up and pass blame for its
> + * first part to the parent.
>   */
> -static void blame_chunk(struct scoreboard *sb,
> -			int tlno, int plno, int same,
> -			struct origin *target, struct origin *parent)
> -{
> -	struct blame_entry *e;
> -
> -	for (e = sb->ent; e; e = e->next) {
> -		if (e->guilty || e->suspect != target)
> -			continue;
> -		if (same <= e->s_lno)
> -			continue;
> -		if (tlno < e->s_lno + e->num_lines)
> -			blame_overlap(sb, e, tlno, plno, same, parent);
> +	if (e->s_lno < tlno) {
> +		/* we move the first half to a new record */
> +		int len = tlno - e->s_lno;
> +		struct blame_entry *n = xcalloc(1, sizeof (struct blame_entry));
> +		n->suspect = origin_incref(parent);
> +		n->lno = e->lno;
> +		n->s_lno = e->s_lno + offset;
> +		n->num_lines = len;
> +		n->score = 0;
> +		**dstq = n;
> +		*dstq = &n->next;
> +
> +		e->lno += len;
> +		e->s_lno = tlno;
> +		e->num_lines -= len;
> +		e->score = 0;
>  	}
> +/*
> + * Now retain records on the target while it is different from the parent.
> + */
> +	while (e->s_lno + e->num_lines <= same) {
> +		e = *(*srcq = &e->next);
> +		if (!e)
> +			return;
> +	}
> +/*
> + * If current record starts before sameness, need to split.
> + */
> +	if (e->s_lno < same) {
> +		int len = same - e->s_lno;
> +		struct blame_entry *n = xcalloc(1, sizeof (struct blame_entry));
> +		n->suspect = origin_incref(target);
> +		n->lno = e->lno;
> +		n->s_lno = e->s_lno;
> +		n->num_lines = len;
> +		n->score = 0;
> +		**srcq = n;
> +		*(*srcq = &n->next) = e;
> +		/* keep second half in target queue */
> +		e->lno += len;
> +		e->s_lno = same;
> +		e->num_lines -= len;
> +		e->score = 0;
> +	}
> +	return;
>  }
>  
>  struct blame_chunk_cb_data {
> -	struct scoreboard *sb;
>  	struct origin *target;
>  	struct origin *parent;
> -	long plno;
> -	long tlno;
> +	long offset;
> +	struct blame_entry **dstq;
> +	struct blame_entry **srcq;
>  };
>  
> +/* diff chunks are from parent to target */
>  static int blame_chunk_cb(long start_a, long count_a,
>  			  long start_b, long count_b, void *data)
>  {
>  	struct blame_chunk_cb_data *d = data;
> -	blame_chunk(d->sb, d->tlno, d->plno, start_b, d->target, d->parent);
> -	d->plno = start_a + count_a;
> -	d->tlno = start_b + count_b;
> +	if (start_a - start_b != d->offset)
> +		die("internal error in blame::blame_chunk_cb");
> +	blame_chunk(&d->dstq, &d->srcq, start_b, start_a - start_b,
> +		    start_b + count_b, d->target, d->parent);
> +	d->offset = start_a + count_a - (start_b + count_b);
>  	return 0;
>  }
>  
> @@ -782,23 +917,28 @@ static int pass_blame_to_parent(struct scoreboard *sb,
>  				struct origin *target,
>  				struct origin *parent)
>  {
> -	int last_in_target;
>  	mmfile_t file_p, file_o;
>  	struct blame_chunk_cb_data d;
> +	struct blame_entry *newdest = NULL;
>  
> -	memset(&d, 0, sizeof(d));
> -	d.sb = sb; d.target = target; d.parent = parent;
> -	last_in_target = find_last_in_target(sb, target);
> -	if (last_in_target < 0)
> +	if (!target->suspects)
>  		return 1; /* nothing remains for this target */
>  
> +	d.target = target; d.parent = parent;
> +	d.offset = 0;
> +	d.dstq = &newdest; d.srcq = &target->suspects;
> +
>  	fill_origin_blob(&sb->revs->diffopt, parent, &file_p);
>  	fill_origin_blob(&sb->revs->diffopt, target, &file_o);
>  	num_get_patch++;
>  
>  	diff_hunks(&file_p, &file_o, 0, blame_chunk_cb, &d);
> -	/* The rest (i.e. anything after tlno) are the same as the parent */
> -	blame_chunk(sb, d.tlno, d.plno, last_in_target, target, parent);
> +	/* The rest are the same as the parent */
> +	blame_chunk(&d.dstq, &d.srcq, INT_MAX, d.offset, INT_MAX, target, parent);
> +	*d.dstq = NULL;
> +	parent->suspects = blame_merge (parent->suspects, newdest);
> +	if (parent->suspects)
> +		prio_queue_put(&sb->commits, parent->commit);
>  
>  	return 0;
>  }
> @@ -833,6 +973,7 @@ static unsigned ent_score(struct scoreboard *sb, struct blame_entry *e)
>  	return score;
>  }
>  
> +#if 0
>  /*
>   * best_so_far[] and this[] are both a split of an existing blame_entry
>   * that passes blame to the parent.  Maintain best_so_far the best split
> @@ -1147,6 +1288,8 @@ static int find_copy_in_parent(struct scoreboard *sb,
>  	return retval;
>  }
>  
> +#endif
> +
>  /*
>   * The blobs of origin and porigin exactly match, so everything
>   * origin is suspected for can be blamed on the parent.
> @@ -1154,20 +1297,22 @@ static int find_copy_in_parent(struct scoreboard *sb,
>  static void pass_whole_blame(struct scoreboard *sb,
>  			     struct origin *origin, struct origin *porigin)
>  {
> -	struct blame_entry *e;
> +	struct blame_entry *e, *suspects;
>  
>  	if (!porigin->file.ptr && origin->file.ptr) {
>  		/* Steal its file */
>  		porigin->file = origin->file;
>  		origin->file.ptr = NULL;
>  	}
> -	for (e = sb->ent; e; e = e->next) {
> -		if (e->suspect != origin)
> -			continue;
> +	suspects = origin->suspects;
> +	origin->suspects = NULL;
> +	for (e = suspects; e; e = e->next) {
>  		origin_incref(porigin);
>  		origin_decref(e->suspect);
>  		e->suspect = porigin;
>  	}
> +	porigin->suspects = blame_merge(porigin->suspects, suspects);
> +	prio_queue_put(&sb->commits, porigin->commit);
>  }
>  
>  /*
> @@ -1266,6 +1411,7 @@ static void pass_blame(struct scoreboard *sb, struct origin *origin, int opt)
>  			goto finish;
>  	}
>  
> +#if 0
>  	/*
>  	 * Optionally find moves in parents' files.
>  	 */
> @@ -1292,6 +1438,7 @@ static void pass_blame(struct scoreboard *sb, struct origin *origin, int opt)
>  						porigin, opt))
>  				goto finish;
>  		}
> +#endif
>  
>   finish:
>  	for (i = 0; i < num_sg; i++) {
> @@ -1488,14 +1635,11 @@ static int emit_one_suspect_detail(struct origin *suspect, int repeat)
>  }
>  
>  /*
> - * The blame_entry is found to be guilty for the range.  Mark it
> - * as such, and show it in incremental output.
> + * The blame_entry is found to be guilty for the range.
> + * Show it in incremental output.
>   */
>  static void found_guilty_entry(struct blame_entry *ent)
>  {
> -	if (ent->guilty)
> -		return;
> -	ent->guilty = 1;
>  	if (incremental) {
>  		struct origin *suspect = ent->suspect;
>  
> @@ -1509,32 +1653,34 @@ static void found_guilty_entry(struct blame_entry *ent)
>  }
>  
>  /*
> - * The main loop -- while the scoreboard has lines whose true origin
> - * is still unknown, pick one blame_entry, and allow its current
> - * suspect to pass blames to its parents.
> - */
> + * The main loop -- while we have blobs with lines whose true origin
> + * is still unknown, pick one blob, and allow its lines to pass blames
> + * to its parents. */
>  static void assign_blame(struct scoreboard *sb, int opt)
>  {
>  	struct rev_info *revs = sb->revs;
> +	struct commit *commit = prio_queue_get (&sb->commits);
>  
> -	while (1) {
> +	while (commit) {
>  		struct blame_entry *ent;
> -		struct commit *commit;
> -		struct origin *suspect = NULL;
> +		struct origin *suspect = commit->util;
>  
>  		/* find one suspect to break down */
> -		for (ent = sb->ent; !suspect && ent; ent = ent->next)
> -			if (!ent->guilty)
> -				suspect = ent->suspect;
> -		if (!suspect)
> -			return; /* all done */
> +		while (suspect && !suspect->suspects)
> +			suspect = suspect->next;
> +		
> +		if (!suspect) {
> +			commit = prio_queue_get(&sb->commits);
> +			continue;
> +		}
> +
> +		assert(commit == suspect->commit);
>  
>  		/*
>  		 * We will use this suspect later in the loop,
>  		 * so hold onto it in the meantime.
>  		 */
>  		origin_incref(suspect);
> -		commit = suspect->commit;
>  		parse_commit(commit);
>  		if (reverse ||
>  		    (!(commit->object.flags & UNINTERESTING) &&
> @@ -1550,9 +1696,22 @@ static void assign_blame(struct scoreboard *sb, int opt)
>  			commit->object.flags |= UNINTERESTING;
>  
>  		/* Take responsibility for the remaining entries */
> -		for (ent = sb->ent; ent; ent = ent->next)
> -			if (ent->suspect == suspect)
> +		ent = suspect->suspects;
> +		if (ent) {
> +			for (;;) {
> +				struct blame_entry *next = ent->next;
>  				found_guilty_entry(ent);
> +				if (next)
> +				{
> +					ent = next;
> +					continue;
> +				}
> +				ent->next = sb->ent;
> +				sb->ent = suspect->suspects;
> +				suspect->suspects = NULL;
> +				break;
> +			}
> +		}
>  		origin_decref(suspect);
>  
>  		if (DEBUG) /* sanity */
> @@ -1609,9 +1768,8 @@ static void emit_porcelain(struct scoreboard *sb, struct blame_entry *ent,
>  	char hex[41];
>  
>  	strcpy(hex, sha1_to_hex(suspect->commit->object.sha1));
> -	printf("%s%c%d %d %d\n",
> +	printf("%s %d %d %d\n",
>  	       hex,
> -	       ent->guilty ? ' ' : '*', /* purely for debugging */
>  	       ent->s_lno + 1,
>  	       ent->lno + 1,
>  	       ent->num_lines);
> @@ -1724,17 +1882,16 @@ static void output(struct scoreboard *sb, int option)
>  
>  	if (option & OUTPUT_PORCELAIN) {
>  		for (ent = sb->ent; ent; ent = ent->next) {
> -			struct blame_entry *oth;
> -			struct origin *suspect = ent->suspect;
> -			struct commit *commit = suspect->commit;
> +			int count = 0;
> +			struct origin *suspect;
> +			struct commit *commit = ent->suspect->commit;
>  			if (commit->object.flags & MORE_THAN_ONE_PATH)
>  				continue;
> -			for (oth = ent->next; oth; oth = oth->next) {
> -				if ((oth->suspect->commit != commit) ||
> -				    !strcmp(oth->suspect->path, suspect->path))
> -					continue;
> -				commit->object.flags |= MORE_THAN_ONE_PATH;
> -				break;
> +			for (suspect = commit->util; suspect; suspect = suspect->next) {
> +				if (suspect->guilty && count++) {
> +					commit->object.flags |= MORE_THAN_ONE_PATH;
> +					break;
> +				}
>  			}
>  		}
>  	}
> @@ -2083,7 +2240,6 @@ static struct commit *fake_working_tree_commit(struct diff_options *opt,
>  	origin->file.ptr = buf.buf;
>  	origin->file.size = buf.len;
>  	pretend_sha1_file(buf.buf, buf.len, OBJ_BLOB, origin->blob_sha1);
> -	commit->util = origin;
>  
>  	/*
>  	 * Read the current index, replace the path entry with
> @@ -2394,12 +2550,16 @@ parse_done:
>  	memset(&sb, 0, sizeof(sb));
>  
>  	sb.revs = &revs;
> -	if (!reverse)
> +	if (!reverse) {
>  		final_commit_name = prepare_final(&sb);
> +		sb.commits.compare = compare_commits_by_commit_date;
> +	}
>  	else if (contents_from)
>  		die("--contents and --children do not blend well.");
> -	else
> +	else {
>  		final_commit_name = prepare_initial(&sb);
> +		sb.commits.compare = compare_commits_by_reverse_commit_date;
> +	}
>  
>  	if (!sb.final) {
>  		/*
> @@ -2493,7 +2653,7 @@ parse_done:
>  	range_set_release(&ranges);
>  	string_list_clear(&range_list, 0);
>  
> -	sb.ent = ent;
> +	sb.ent = NULL;
>  	sb.path = path;
>  
>  	read_mailmap(&mailmap, NULL);
> @@ -2501,11 +2661,16 @@ parse_done:
>  	if (!incremental)
>  		setup_pager();
>  
> +	prio_queue_put(&sb.commits, o->commit);
> +	o->suspects = ent;
> +	
>  	assign_blame(&sb, opt);
>  
>  	if (incremental)
>  		return 0;
>  
> +	sb.ent = blame_sort(sb.ent);
> +
>  	coalesce(&sb);
>  
>  	if (!(output_option & OUTPUT_PORCELAIN))

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 3/3] builtin/blame.c: large-scale rewrite
  2014-01-27 16:54     ` Junio C Hamano
@ 2014-01-27 19:45       ` David Kastrup
  2014-01-27 20:51         ` Junio C Hamano
  0 siblings, 1 reply; 11+ messages in thread
From: David Kastrup @ 2014-01-27 19:45 UTC (permalink / raw)
  To: git

Junio C Hamano <gitster@pobox.com> writes:

> David Kastrup <dak@gnu.org> writes:
>
>> The previous implementation uses a sorted linear list of struct
>> blame_entry in a struct scoreboard for organizing all partial or
>> completed work.  Every task that is done requires going through the
>> whole list where most entries are not relevant to the task at hand.
>>
>> This commit reorganizes the data structures in order to have each
>> remaining subtask work with its own sorted linear list it can work off
>> front to back.  Subtasks are organized into "struct origin" chains
>> hanging off particular commits.  Commits are organized into a priority
>> queue, processing them in commit date order in order to keep most of
>> the work affecting a particular blob collated even in the presence of
>> an extensive merge history.  In that manner, linear searches can be
>> basically avoided anywhere.  They still are done for identifying one
>> of multiple analyzed files in a given commit, but the degenerate case
>> of a single large file being assembled from a multitude of smaller
>> files in the past is not likely to occur often enough to warrant
>> special treatment.
>> ---
>
> Sign-off?

Not while this is not fit for merging because of #if 0 etc and missing
functionality.  This is just for review.

> Actually, I'd like to take my previous suggestion to add this as
> blame2 (to replace blame in the future) back.  That was based on my
> fear/hope to see an implementation based on a completely different
> approach, but the basic premise of having one blame_entry record per
> the lines of the final image in the scoreboard, using diff between
> parents to the child to find common lines for passing the blame up,
> etc. have not changed at all and the change is all about organizing
> the pieces of data in a *much* *better* way to avoid needlessly
> finding what we already have computed.

Yes.  Basically, the call graph outside of blame.c itself should be
pretty much the same.

> Style; please make /* and */ sit on their own line without anything
> else in multi-line comments.

Will do.

>> +	struct origin *next;
>>  	struct commit *commit;
>> +	/* `suspects' contains blame entries that may be attributed to
>> +	 * this origin's commit or to parent commits.  When a commit
>> +	 * is being processed, all suspects will be moved, either by
>> +	 * assigning them to an origin in a different commit, or by
>> +	 * shipping them to the scoreboard's ent list because they
>> +	 * cannot be attributed to a different commit.
>> +	 */
>> +	struct blame_entry *suspects;
>>  	mmfile_t file;
>>  	unsigned char blob_sha1[20];
>>  	unsigned mode;
>> +	/* shipped gets set when shipping any suspects to the final
>> +	 * blame list instead of other commits
>> +	 */
>> +	char shipped;
>
> Unused?

More like "added, forgotten, added under yet another name":

>> +	char guilty;

I'll have to decide which one to keep.

>> +		/* Should be present exactly once in commit chain */
>> +		for (p = o->commit->util; p; l = p, p = p->next) {
>> +			if (p == o)
>> +			{
>
> Style; please have { sit with the control structure that opens the
> block, unless it is the opening brace of the function body or
> struct/union definition.

Ok.

>> +static struct blame_entry *
>> +blame_merge(struct blame_entry *list1,
>> +	    struct blame_entry *list2)
>
> Style; we do not do GNU-ish "function name begins a line".  Even
> though I personally find it easier to use for things like 'grep',
> that is not the style we use around here.

Ok.  I'm also certain to have a few "space between function name and
arguments" left and will grep for those before submitting the next
version.

Next version will at least include -M option, possibly leaving -C for
later.

-- 
David Kastrup

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 3/3] builtin/blame.c: large-scale rewrite
  2014-01-27 19:45       ` David Kastrup
@ 2014-01-27 20:51         ` Junio C Hamano
  2014-01-27 21:21           ` David Kastrup
  0 siblings, 1 reply; 11+ messages in thread
From: Junio C Hamano @ 2014-01-27 20:51 UTC (permalink / raw)
  To: David Kastrup; +Cc: git

David Kastrup <dak@gnu.org> writes:

> Junio C Hamano <gitster@pobox.com> writes:
>
>> David Kastrup <dak@gnu.org> writes:
>>
>>> The previous implementation uses a sorted linear list of struct
>>> blame_entry in a struct scoreboard for organizing all partial or
>>> completed work.  Every task that is done requires going through the
>>> whole list where most entries are not relevant to the task at hand.
>>>
>>> This commit reorganizes the data structures in order to have each
>>> remaining subtask work with its own sorted linear list it can work off
>>> front to back.  Subtasks are organized into "struct origin" chains
>>> hanging off particular commits.  Commits are organized into a priority
>>> queue, processing them in commit date order in order to keep most of
>>> the work affecting a particular blob collated even in the presence of
>>> an extensive merge history.  In that manner, linear searches can be
>>> basically avoided anywhere.  They still are done for identifying one
>>> of multiple analyzed files in a given commit, but the degenerate case
>>> of a single large file being assembled from a multitude of smaller
>>> files in the past is not likely to occur often enough to warrant
>>> special treatment.
>>> ---
>>
>> Sign-off?
>
> Not while this is not fit for merging because of #if 0 etc and missing
> functionality.  This is just for review.

That is not what Signing off a patch means, though ;-)

>> Actually, I'd like to take my previous suggestion to add this as
>> blame2 (to replace blame in the future) back.  That was based on my
>> fear/hope to see an implementation based on a completely different
>> approach, but the basic premise of having one blame_entry record per
>> the lines of the final image in the scoreboard, using diff between
>> parents to the child to find common lines for passing the blame up,
>> etc. have not changed at all and the change is all about organizing
>> the pieces of data in a *much* *better* way to avoid needlessly
>> finding what we already have computed.
>
> Yes.  Basically, the call graph outside of blame.c itself should be
> pretty much the same.
> ...
> Ok.  I'm also certain to have a few "space between function name and
> arguments" left and will grep for those before submitting the next
> version.
>
> Next version will at least include -M option, possibly leaving -C for
> later.

OK.  As we do not want to break the build in the middle of the
series, but we still want to keep the individual steps reviewable
incrementally, I would think the best structure would be have the
series that consists of multiple steps "the basic infrastructure is
there, but no rename following, and neither -M or -C works", "now
renames are followed", "now -M works", etc., with tests that are not
yet working (in the beginning, most of "git blame" test may no
longer work until the series is finished) marked with

    s/test_expect_success/test_expect_failure/

and turn expect_failure into expect_success as the series advances.
That way, we may get a better picture of what each step achieves.  I
dunno.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 3/3] builtin/blame.c: large-scale rewrite
  2014-01-27 20:51         ` Junio C Hamano
@ 2014-01-27 21:21           ` David Kastrup
  0 siblings, 0 replies; 11+ messages in thread
From: David Kastrup @ 2014-01-27 21:21 UTC (permalink / raw)
  To: git

Junio C Hamano <gitster@pobox.com> writes:

> David Kastrup <dak@gnu.org> writes:
>
>> Junio C Hamano <gitster@pobox.com> writes:
>>
>>> David Kastrup <dak@gnu.org> writes:
>>>
>>>> The previous implementation uses a sorted linear list of struct
>>>> blame_entry in a struct scoreboard for organizing all partial or
>>>> completed work.  Every task that is done requires going through the
>>>> whole list where most entries are not relevant to the task at hand.
>>>>
>>>> This commit reorganizes the data structures in order to have each
>>>> remaining subtask work with its own sorted linear list it can work off
>>>> front to back.  Subtasks are organized into "struct origin" chains
>>>> hanging off particular commits.  Commits are organized into a priority
>>>> queue, processing them in commit date order in order to keep most of
>>>> the work affecting a particular blob collated even in the presence of
>>>> an extensive merge history.  In that manner, linear searches can be
>>>> basically avoided anywhere.  They still are done for identifying one
>>>> of multiple analyzed files in a given commit, but the degenerate case
>>>> of a single large file being assembled from a multitude of smaller
>>>> files in the past is not likely to occur often enough to warrant
>>>> special treatment.
>>>> ---
>>>
>>> Sign-off?
>>
>> Not while this is not fit for merging because of #if 0 etc and missing
>> functionality.  This is just for review.
>
> That is not what Signing off a patch means, though ;-)

From Documentation/SubmittingPatches:

    The sign-off is a simple line at the end of the explanation for
    the patch, which certifies that you wrote it or otherwise have
    the right to pass it on as a open-source patch.  The rules are
    pretty simple: if you can certify the below:

            Developer's Certificate of Origin 1.1

            By making a contribution to this project, I certify that:

Well, and the patch is not in a state where I want to contribute it to
this project.  Basically I first want to bring it into a state where
I am not embarrassed by having my name attached to it.

Yes, that's probably not quite the idea of signing off...

> OK.  As we do not want to break the build in the middle of the
> series, but we still want to keep the individual steps reviewable
> incrementally, I would think the best structure would be have the
> series that consists of multiple steps "the basic infrastructure is
> there, but no rename following, and neither -M or -C works", "now
> renames are followed", "now -M works", etc., with tests that are not
> yet working (in the beginning, most of "git blame" test may no
> longer work until the series is finished) marked with
>
>     s/test_expect_success/test_expect_failure/
>
> and turn expect_failure into expect_success as the series advances.
> That way, we may get a better picture of what each step achieves.  I
> dunno.

Seems a bit pointless as the various functionalities are implemented in
separate code passages.  Basically the only "common" code is

static void pass_blame(struct scoreboard *sb, struct origin *origin, int opt)
{
	struct rev_info *revs = sb->revs;
	int i, pass, num_sg;
	struct commit *commit = origin->commit;
[...]

#if 0
	/*
	 * Optionally find moves in parents' files.
	 */
	if (opt & PICKAXE_BLAME_MOVE)
		for (i = 0, sg = first_scapegoat(revs, commit);
		     i < num_sg && sg;
		     sg = sg->next, i++) {
			struct origin *porigin = sg_origin[i];
			if (!porigin)
				continue;
			if (find_move_in_parent(sb, origin, porigin))
				goto finish;
		}

	/*
	 * Optionally find copies from parents' files.
	 */
	if (opt & PICKAXE_BLAME_COPY)
		for (i = 0, sg = first_scapegoat(revs, commit);
		     i < num_sg && sg;
		     sg = sg->next, i++) {
			struct origin *porigin = sg_origin[i];
			if (find_copy_in_parent(sb, origin, sg->item,
						porigin, opt))
				goto finish;
		}
#endif

 finish:
	for (i = 0; i < num_sg; i++) {
		if (sg_origin[i]) {
			drop_origin_blob(sg_origin[i]);
			origin_decref(sg_origin[i]);
		}
	}
	drop_origin_blob(origin);
	if (sg_buf != sg_origin)
		free(sg_origin);
}


So what will happen here is that the #if 0 will get removed again, and
there will be somewhat different code for

        /*
	 * Optionally find moves in parents' files.
	 */

and then somewhat different code for

	/*
	 * Optionally find copies from parents' files.
	 */

and some section elsewhere with the functions being called here.  It's
not like there will be significant intersection between the
"same-file-and-rename" code and the "move-in-file or
copy-from-some-other-file" code.  It's just a separate piece of code,
marked with a separate comment.  Not an "evolution" of code, but rather
a simple addition.  Basically each of those additions does "try blaming
some more entries to some source different from the currently considered
target" and whatever remains at the end is blamed on the current target.
And every such addition has its own algorithm and functions.

At any rate: I'll just propose the thing as one piece first and then we
can still discuss what kind of subdivision and/or
commenting/restructuring might be warranted and would simplify
reviewing.

-- 
David Kastrup

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2014-01-27 21:21 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-01-25 12:44 How to substructure rewrites? David Kastrup
2014-01-25 18:23 ` [PATCH 0/3] "Teaser" patch for rewriting blame for efficiency David Kastrup
2014-01-25 18:23   ` [PATCH 1/3] builtin/blame.c: struct blame_entry does not need a prev link David Kastrup
2014-01-25 18:23   ` [PATCH 2/3] Eliminate same_suspect function in builtin/blame.c David Kastrup
2014-01-25 18:23   ` [PATCH 3/3] builtin/blame.c: large-scale rewrite David Kastrup
2014-01-27 16:54     ` Junio C Hamano
2014-01-27 19:45       ` David Kastrup
2014-01-27 20:51         ` Junio C Hamano
2014-01-27 21:21           ` David Kastrup
2014-01-27 15:58 ` How to substructure rewrites? Junio C Hamano
2014-01-27 16:27   ` David Kastrup

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).