Git development
 help / color / mirror / Atom feed
* Re: [PATCH] Don't recurse into parents marked uninteresting.
From: smurf @ 2006-03-09 10:20 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: git
In-Reply-To: <Pine.LNX.4.64.0603082144450.32577@g5.osdl.org>

Hi,

Linus Torvalds:
> So what was it that triggered this "parents had already been parsed" 
> situation? Is it because we've generated a huge list of "I have it" 
> objects when pulling? That would explain it..
> 
Something like that. I've converted a large number of older heads
(ranging from a few months to a few years) from $EVIL_SCM to git,
and tried to push them up to our main repository, which contains the
current development.

-- 
Matthias Urlichs   |   {M:U} IT Design @ m-u-it.de   |  smurf@smurf.noris.de
Disclaimer: The quote was selected randomly. Really. | http://smurf.noris.de
 - -
:read-only user: n. Describes a {luser} who uses computers almost
   exclusively for reading Usenet, bulletin boards, and/or email, rather
   than writing code or purveying useful information. See {twink},
   {terminal junkie}, {lurker}.

^ permalink raw reply

* Re: [PATCH] Allow git-repack to optionally run git-prune-packed.
From: Martin Atukunda @ 2006-03-09 10:24 UTC (permalink / raw)
  To: Alex Riesen; +Cc: git
In-Reply-To: <20060307212918.GA9474@steel.home>

Your suggestion has merit, though it's different from the behaviour I
desired. I _sometimes_ need the pruning, and it felt appropriate to make
it an option as opposed to default behaviour.

What do you think?

- Martin -

On Tue, Mar 07, 2006 at 10:29:18PM +0100, Alex Riesen wrote:
> Martin Atukunda, Tue, Mar 07, 2006 16:16:12 +0100:
> > +-p::
> > +	Run `git-prune-packed` after packing, see
> > +	gitlink:git-prune-packed[1]
> > +
> 
> Maybe just make "-d" work? I.e. "git repack -a -d" repacks and prunes
> everything, and "git repack -d" prunes just what was packed
> incrementally.
> Something like this:
> 
> diff --git a/git-repack.sh b/git-repack.sh
> index 3d6fec1..be6c7ab 100755
> --- a/git-repack.sh
> +++ b/git-repack.sh
> @@ -74,6 +74,8 @@ then
>  			esac
>  		  done
>  		)
> +	else
> +		git-prune-packed
>  	fi
>  fi
>  
> 
> -
> To unsubscribe from this list: send the line "unsubscribe git" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Girl on cell: Can you hear me when I roll my eyes?
	-- www.overheardinnewyourk.com

^ permalink raw reply

* Re: git-unpack-objects < pack file in repository doesn't work!
From: Junio C Hamano @ 2006-03-09 10:14 UTC (permalink / raw)
  To: Shawn Pearce; +Cc: git
In-Reply-To: <20060307040255.GA29544@spearce.org>

Shawn Pearce <spearce@spearce.org> writes:

> I wanted to explode a pack because I'm starting work on an Eclipse
> plugin for GIT.  I thought I'd try going down the road of letting
> the plugin read the repository directly, and write loose objects
> directly, but leave pack construction to the native C code.  So I
> tried to clone my local GIT repository to a new directory (thus
> had no loose objects at all) and unpack it to get loose objects.
> That didn't go so well.  :-)

Before "git-repack -a" was invented, I used to do this by hand:

	$ mkdir ./++preserve
        $ mv .git/objects/pack/pack-*.pack ./++preserve
        $ for p in ./++preserve/pack-*.pack
          do git-unpack-objects <$p; done

^ permalink raw reply

* Re: [PATCH] contrib/git-svn: fix UUID reading w/pre-1.2 svn; fetch args
From: Junio C Hamano @ 2006-03-09 10:08 UTC (permalink / raw)
  To: Eric Wong; +Cc: git
In-Reply-To: <20060308015730.GA28056@localdomain>

Eric Wong <normalperson@yhbt.net> writes:

> Junio: please don't apply this patch to git.git just yet.  It seems fine
> to me, but I haven't tested it heavily yet (Yann can help me, I hope :)
> I hardly slept the past few days and I may have broken something badly
> (it pasts all the tests, though).

I won't be applying it then.

I think this part is wrong.

> @@ -922,7 +930,9 @@ sub git_commit {
>  	}
>  	my @update_ref = ('git-update-ref',"refs/remotes/$GIT_SVN",$commit);
>  	if (my $primary_parent = shift @exec_parents) {
> -		push @update_ref, $primary_parent;
> +		if (!system('git-rev-parse',"refs/remotes/$GIT_SVN")){
> +			push @update_ref, $primary_parent;
> +		}

I think you are trying to see if you have .git/refs/remotes/foo,
and I think you actually have tried it to determine that is the
case.

But "git-rev-parse refs/remotes/foo" dies not because there is
no valid file .git/refs/remotes/foo that records SHA1 of an
existing commit.  If there is refs/remotes/foo file, it thinks
you have asked for it and gives it back happily.

A demonstration:

	$ cd /var/tmp/ && rm -fr junk && mkdir junk && cd junk
        $ git init-db
	defaulting to local storage area
        $ git-rev-parse refs/remotes/foo ; echo $?
        refs/remotes/foo
        fatal: 'refs/remotes/foo': No such file or directory
        128
        $ mkdir -p refs/remotes/foo
        $ ls -a
        ./  ../  .git/	refs/
        $ git-rev-parse refs/remotes/foo; echo $?
        refs/remotes/foo
        0

If you are trying to see if there is such a ref, I would do
this:

	$ git-rev-parse --verify refs/remotes/foo^0
        git-rev-parse --verify refs/remotes/foo^0
        fatal: Needed a single revision
	128

The --verify flag makes sure that the argument resolves to a
valid 40-hexadigit string (note that it does not verify if that
object actually exists), so asking for zeroth parent makes sure
you are dealing with a ref that actually points at a commit
object that exists.

^ permalink raw reply

* Re: [PATCH 0/3] Teach git-blame about renames
From: Fredrik Kuivinen @ 2006-03-09  7:30 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Fredrik Kuivinen, git
In-Reply-To: <7v4q28h38p.fsf@assigned-by-dhcp.cox.net>

On Wed, Mar 08, 2006 at 04:27:02PM -0800, Junio C Hamano wrote:
> > -static int compare_tree(struct tree *t1, struct tree *t2)
> > +int compare_tree(struct tree *t1, struct tree *t2)
> > ...
> > -static int same_tree_as_empty(struct tree *t1)
> > +int same_tree_as_empty(struct tree *t1)
> 
> Maybe the names are a bit too generic to be used as a global?
> 

Yes.. maybe. They are quite general though. Any suggestions for better
names? We could prefix everything in revision.h with "rev_" or
something like that.


Thanks for the comments. I will send an updated patch series soon.

- Fredrik

^ permalink raw reply

* Re: [PATCH] Don't recurse into parents marked uninteresting.
From: Linus Torvalds @ 2006-03-09  5:48 UTC (permalink / raw)
  To: Matthias Urlichs; +Cc: git
In-Reply-To: <pan.2006.03.09.04.04.34.617873@smurf.noris.de>



On Thu, 9 Mar 2006, Matthias Urlichs wrote:
>
> revision.c:make_parents_uninteresting() is exponential with the number
> of merges in the tree. That's fine -- unless some other part of git
> already has pulled the whole commit tree into memory ...

Good call.

However, I would have expected the normal case to be that we haven't even 
parsed the parent yet (as per the comment), so the parent normally 
shouldn't even have the parent pointer (due to not having been parsed).

So what was it that triggered this "parents had already been parsed" 
situation? Is it because we've generated a huge list of "I have it" 
objects when pulling? That would explain it..

		Linus

^ permalink raw reply

* [stgit] [PATCH] common: parse 'email (name)' correctly
From: Sam Vilain @ 2006-03-09  5:29 UTC (permalink / raw)
  To: Catalin Marinas; +Cc: Git Mailing List

Currently only e-mails of the form "Name <email>" are accepted by
stgit import etc, however some people use "email (Name)".  Accept this
alternate form.
---
Note: the function just below it might need consideration, too.

 stgit/commands/common.py |    7 +++++--
 1 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/stgit/commands/common.py b/stgit/commands/common.py
index 2985379..4bfa4dd 100644
--- a/stgit/commands/common.py
+++ b/stgit/commands/common.py
@@ -175,12 +175,15 @@ def push_patches(patches, check_merged =
 
 def name_email(address):
     """Return a tuple consisting of the name and email parsed from a
-    standard 'name <email>' string
+    standard 'name <email>' or 'email (name)' string
     """
     address = re.sub('[\\\\"]', '\\\\\g<0>', address)
     str_list = re.findall('^(.*)\s*<(.*)>\s*$', address)
     if not str_list:
-        raise CmdException, 'Incorrect "name <email>" string: %s' % address
+        str_list = re.findall('^(.*)\s*\((.*)\)\s*$', address)
+        if not str_list:
+            raise CmdException, 'Incorrect "name <email>"/"email
(name)" string: %s' % address
+        return ( str_list[0][1], str_list[0][0] )
 
     return str_list[0]

^ permalink raw reply related

* [PATCH] Don't recurse into parents marked uninteresting.
From: Matthias Urlichs @ 2006-03-09  4:04 UTC (permalink / raw)
  To: git
In-Reply-To: <pan.2006.03.08.20.04.24.62170@smurf.noris.de>

revision.c:make_parents_uninteresting() is exponential with the number
of merges in the tree. That's fine -- unless some other part of git
already has pulled the whole commit tree into memory ...

---

... or, in other words, "Don't do that, please."

With this patch, all tests still succeed, and the "git push" which
triggered the problem takes 5min instead of an estimated 10mio years.

---

 revision.c |   24 +++++++++++++-----------
 1 files changed, 13 insertions(+), 11 deletions(-)

32c9750691d1ef225ca1641fdf6902e53c25fe5b
diff --git a/revision.c b/revision.c
index 2a33637..713f27e 100644
--- a/revision.c
+++ b/revision.c
@@ -82,18 +82,20 @@ void mark_parents_uninteresting(struct c
 
 	while (parents) {
 		struct commit *commit = parents->item;
-		commit->object.flags |= UNINTERESTING;
+		if (!(commit->object.flags & UNINTERESTING)) {
+			commit->object.flags |= UNINTERESTING;
 
-		/*
-		 * Normally we haven't parsed the parent
-		 * yet, so we won't have a parent of a parent
-		 * here. However, it may turn out that we've
-		 * reached this commit some other way (where it
-		 * wasn't uninteresting), in which case we need
-		 * to mark its parents recursively too..
-		 */
-		if (commit->parents)
-			mark_parents_uninteresting(commit);
+			/*
+			 * Normally we haven't parsed the parent
+			 * yet, so we won't have a parent of a parent
+			 * here. However, it may turn out that we've
+			 * reached this commit some other way (where it
+			 * wasn't uninteresting), in which case we need
+			 * to mark its parents recursively too..
+			 */
+			if (commit->parents)
+				mark_parents_uninteresting(commit);
+		}
 
 		/*
 		 * A missing commit is ok iff its parent is marked
-- 
Matthias Urlichs

^ permalink raw reply related

* Re: fsck-object --standalone got errors
From: Junio C Hamano @ 2006-03-09  3:17 UTC (permalink / raw)
  To: Ming Lei; +Cc: git
In-Reply-To: <440F945C.2010401@brocade.com>

Ming Lei <mlei@brocade.com> writes:

> I have a repository created by GIT itself(not cognito, etc). It has
> branches called base, master and origin. When I did git-fsck-objects
> --full there is nothing shown, but when I did git-fsck-objects
> --standalone, it displayed following:

That is really an artifact from a distant past.  Please do not
worry about that error -- as long as --full does not see
anything long, there is nothing wrong with your repository.

The fsck-object command (back then it was called fsck-cache)
complained if objects referred to by files in .git/refs/ or
objects stored in files under .git/objects/??/ are not found as
stand-alone SHA1 files (i.e.  found in alternate object pools or
packed archives stored under .git/objects/pack).  Back then,
packs and alternates were curiosity and having everything as
loose objects were the norm.

When we adjusted the behaviour of fsck-cache to consider objects
found in packs are OK, we introduced the --standalone flag as a
backward compatibility measure.

It still correctly checks if your repository is complete and
consists only of loose objects, so in that sense it is doing the
"right" thing, but checking that is pointless these days.  We
probably should remove that flag.

^ permalink raw reply

* fsck-object --standalone got errors
From: Ming Lei @ 2006-03-09  2:35 UTC (permalink / raw)
  To: git

I have a repository created by GIT itself(not cognito, etc). It has 
branches called base, master and origin. When I did git-fsck-objects 
--full there is nothing shown, but when I did git-fsck-objects 
--standalone, it displayed following:

error: refs/heads/master: invalid sha1 pointer 
ea51c414519ffe78c5bf95c488e94e82d3603472
error: refs/heads/base: invalid sha1 pointer 
4b75aaeb5af2dc69374ad080020758e4de6a45d2
error: refs/heads/origin: invalid sha1 pointer 
f4c9503abb2ad52752004e8e9b77b72e23d18d3e
fatal: No default references


The question is:
what's the purpose for this standalone check? what's these errors about? 
Do I need to care these errors?
what's the step to ensure my repository always be in a good shape? Is 
running fsck-objects --full sufficient?


Thanks
Ming

^ permalink raw reply

* Re: git-fmt-merge-msg cleanup
From: Junio C Hamano @ 2006-03-09  2:27 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: git
In-Reply-To: <Pine.LNX.4.64.0603081753270.32577@g5.osdl.org>

Linus Torvalds <torvalds@osdl.org> writes:

> Since I've started using the "merge.summary" flag in my repo, my merge 
> messages look nicer, but I dislike how I get notifications of merges 
> within merges.
>
> So I'd suggest this trivial change..

Makes sense.  Thanks.

^ permalink raw reply

* git-fmt-merge-msg cleanup
From: Linus Torvalds @ 2006-03-09  1:56 UTC (permalink / raw)
  To: Junio C Hamano, Git Mailing List


Since I've started using the "merge.summary" flag in my repo, my merge 
messages look nicer, but I dislike how I get notifications of merges 
within merges.

So I'd suggest this trivial change..

Signed-off-by: Linus Torvalds <torvalds@osdl.org>
----
diff --git a/git-fmt-merge-msg.perl b/git-fmt-merge-msg.perl
index dae383f..afe80e6 100755
--- a/git-fmt-merge-msg.perl
+++ b/git-fmt-merge-msg.perl
@@ -47,7 +47,7 @@ sub current_branch {
 sub shortlog {
 	my ($tip) = @_;
 	my @result;
-	foreach ( qx{git-log --topo-order --pretty=oneline $tip ^HEAD} ) {
+	foreach ( qx{git-log --no-merges --topo-order --pretty=oneline $tip ^HEAD} ) {
 		s/^[0-9a-f]{40}\s+//;
 		push @result, $_;
 	}

^ permalink raw reply related

* Re: [PATCH 0/3] Teach git-blame about renames
From: Junio C Hamano @ 2006-03-09  0:27 UTC (permalink / raw)
  To: Fredrik Kuivinen; +Cc: git
In-Reply-To: <20060308225412.GA461@c165.ib.student.liu.se>

Fredrik Kuivinen <freku045@student.liu.se> writes:

(from part 1/3)

> +void default_setter(struct commit* c, void* data)
> +{
> +	c->object.util = data;
> +}
> +
> +void* default_getter(struct commit* c)
> +{
> +	return c->object.util;
> +}
> +

These names are too generic to be used as a global.

The rest of the git code tends to say "void *default_getter()".

(from part 2/3)

> @@ -224,7 +224,7 @@ static struct commit_list *find_bisectio
>  	nr = 0;
>  	p = list;
>  	while (p) {
> -		if (!revs.paths || (p->item->object.flags & TREECHANGE))
> +		if (!revs.prune_data || (p->item->object.flags & TREECHANGE))
>  			nr++;
>  		p = p->next;
>  	}

Here you test with revs.prune_data, but the rest you test with
revs.prune_fn.  It is conceivable that some prune_fn could be
written without using prune_data, so I'd suggest to check
consistently with prune_fn.

> -static int compare_tree(struct tree *t1, struct tree *t2)
> +int compare_tree(struct tree *t1, struct tree *t2)
> ...
> -static int same_tree_as_empty(struct tree *t1)
> +int same_tree_as_empty(struct tree *t1)

Maybe the names are a bit too generic to be used as a global?

> -	if (revs->paths)
> +	/* if (revs->paths)
>  		try_to_simplify_commit(revs, commit);
> +	*/

Leftover commenting.


(from part 3/3)

> -	struct util_info *util;
> -	if (commit->object.util)
> -		return 0;
> +	struct util_info *util = commit->object.util;
> +
> +	if(util)
> +		return util;

The rest of the git code tends to say "if (util)".

^ permalink raw reply

* [PATCH 3/3] blame: Rename detection
From: Fredrik Kuivinen @ 2006-03-08 23:00 UTC (permalink / raw)
  To: git; +Cc: junkio
In-Reply-To: <20060308225412.GA461@c165.ib.student.liu.se>

Signed-off-by: Fredrik Kuivinen <freku045@student.liu.se>

---

The output format could certainly be improved. Any suggestions?

 blame.c |  239 ++++++++++++++++++++++++++++++++++++++++++++++++++++-----------
 1 files changed, 199 insertions(+), 40 deletions(-)

diff --git a/blame.c b/blame.c
index 90338af..2cc4c96 100644
--- a/blame.c
+++ b/blame.c
@@ -14,6 +14,7 @@
 #include "tree.h"
 #include "blob.h"
 #include "diff.h"
+#include "diffcore.h"
 #include "revision.h"
 
 #define DEBUG 0
@@ -34,7 +35,9 @@ struct util_info {
 	char *buf;
 	unsigned long size;
 	int num_lines;
-//    const char* path;
+	const char* pathname;
+
+	void* topo_data;
 };
 
 struct chunk {
@@ -342,25 +345,34 @@ static int map_line(struct commit *commi
 	return info->line_map[line];
 }
 
-static int fill_util_info(struct commit *commit, const char *path)
+static struct util_info* get_util(struct commit *commit)
 {
-	struct util_info *util;
-	if (commit->object.util)
-		return 0;
+	struct util_info *util = commit->object.util;
+
+	if(util)
+		return util;
 
 	util = xmalloc(sizeof(struct util_info));
+	util->buf = NULL;
+	util->size = 0;
+	util->line_map = NULL;
+	util->num_lines = -1;
+	util->pathname = NULL;
+	commit->object.util = util;
+	return util;
+}
+
+static int fill_util_info(struct commit *commit)
+{
+	struct util_info *util = commit->object.util;
 
-	if (get_blob_sha1(commit->tree, path, util->sha1)) {
-		free(util);
+	assert(util);
+	assert(util->pathname);
+	
+	if (get_blob_sha1(commit->tree, util->pathname, util->sha1))
 		return 1;
-	} else {
-		util->buf = NULL;
-		util->size = 0;
-		util->line_map = NULL;
-		util->num_lines = -1;
-		commit->object.util = util;
+	else
 		return 0;
-	}
 }
 
 static void alloc_line_map(struct commit *commit)
@@ -389,10 +401,11 @@ static void alloc_line_map(struct commit
 
 static void init_first_commit(struct commit* commit, const char* filename)
 {
-	struct util_info* util;
+	struct util_info* util = commit->object.util;
 	int i;
 
-	if (fill_util_info(commit, filename))
+	util->pathname = filename;
+	if (fill_util_info(commit))
 		die("fill_util_info failed");
 
 	alloc_line_map(commit);
@@ -453,7 +466,7 @@ static void process_commits(struct rev_i
 		if(num_parents == 0)
 			*initial = commit;
 
-		if(fill_util_info(commit, path))
+		if(fill_util_info(commit))
 			continue;
 
 		alloc_line_map(commit);
@@ -471,7 +484,7 @@ static void process_commits(struct rev_i
 				printf("parent: %s\n",
 				       sha1_to_hex(parent->object.sha1));
 
-			if(fill_util_info(parent, path)) {
+			if(fill_util_info(parent)) {
 				num_parents--;
 				continue;
 			}
@@ -511,6 +524,135 @@ static void process_commits(struct rev_i
 	} while ((commit = get_revision(rev)) != NULL);
 }
 
+
+static int compare_tree_path(struct rev_info* revs,
+			     struct commit* c1, struct commit* c2)
+{
+	const char* paths[2];
+	struct util_info* util = c2->object.util;
+	paths[0] = util->pathname;
+	paths[1] = NULL;
+
+	diff_tree_setup_paths(get_pathspec(revs->prefix, paths));
+	return compare_tree(c1->tree, c2->tree);
+}
+
+
+static int same_tree_as_empty_path(struct rev_info *revs, struct tree* t1,
+				   const char* path)
+{
+	const char* paths[2];
+	paths[0] = path;
+	paths[1] = NULL;
+
+	diff_tree_setup_paths(get_pathspec(revs->prefix, paths));
+	return same_tree_as_empty(t1);
+}
+
+static const char* find_rename(struct commit* commit, struct commit* parent)
+{
+	struct util_info* cutil = commit->object.util;
+	struct diff_options diff_opts;
+	const char *paths[1];
+	int i;
+
+	if(DEBUG) {
+		printf("find_rename commit: %s ",
+		       sha1_to_hex(commit->object.sha1));
+		puts(sha1_to_hex(parent->object.sha1));
+	}
+
+	diff_setup(&diff_opts);
+	diff_opts.recursive = 1;
+	diff_opts.detect_rename = DIFF_DETECT_RENAME;
+	paths[0] = NULL;
+	diff_tree_setup_paths(paths);
+	if(diff_setup_done(&diff_opts) < 0)
+		die("diff_setup_done failed");
+
+	diff_tree_sha1(commit->tree->object.sha1, parent->tree->object.sha1,
+		       "", &diff_opts);
+	diffcore_std(&diff_opts);
+
+	for(i = 0; i < diff_queued_diff.nr; i++) {
+		struct diff_filepair *p = diff_queued_diff.queue[i];
+
+		if(p->status == 'R' && !strcmp(p->one->path, cutil->pathname)) {
+			if(DEBUG)
+				printf("rename %s -> %s\n", p->one->path, p->two->path);
+			return p->two->path;
+		}
+	}
+
+	return 0;
+}
+
+static void simplify_commit(struct rev_info *revs, struct commit *commit)
+{
+	struct commit_list **pp, *parent;
+
+	if (!commit->tree)
+		return;
+
+	if (!commit->parents) {
+		struct util_info* util = commit->object.util;
+		if (!same_tree_as_empty_path(revs, commit->tree,
+					     util->pathname))
+			commit->object.flags |= TREECHANGE;
+		return;
+	}
+
+	pp = &commit->parents;
+	while ((parent = *pp) != NULL) {
+		struct commit *p = parent->item;
+
+		if (p->object.flags & UNINTERESTING) {
+			pp = &parent->next;
+			continue;
+		}
+
+		parse_commit(p);
+		switch (compare_tree_path(revs, p, commit)) {
+		case TREE_SAME:
+			parent->next = NULL;
+			commit->parents = parent;
+			get_util(p)->pathname = get_util(commit)->pathname;
+			return;
+
+		case TREE_NEW:
+		{
+			
+			struct util_info* util = commit->object.util;
+			if (revs->remove_empty_trees &&
+			    same_tree_as_empty_path(revs, p->tree,
+						    util->pathname)) {
+				const char* new_name = find_rename(commit, p);
+				if(new_name) {
+					struct util_info* putil = get_util(p);
+					if(!putil->pathname)
+						putil->pathname = strdup(new_name);
+				} else {
+					*pp = parent->next;
+					continue;
+				}
+			}
+		}
+
+		/* fallthrough */
+		case TREE_DIFFERENT:
+			pp = &parent->next;
+			if(!get_util(p)->pathname)
+				get_util(p)->pathname =
+					get_util(commit)->pathname;
+			continue;
+		}
+		die("bad tree compare for commit %s",
+		    sha1_to_hex(commit->object.sha1));
+	}
+	commit->object.flags |= TREECHANGE;
+}
+
+
 struct commit_info
 {
 	char* author;
@@ -569,6 +711,18 @@ static const char* format_time(unsigned 
 	return time_buf;
 }
 
+static void topo_setter(struct commit* c, void* data)
+{
+	struct util_info* util = c->object.util;
+	util->topo_data = data;
+}
+
+static void* topo_getter(struct commit* c)
+{
+	struct util_info* util = c->object.util;
+	return util->topo_data;
+}
+
 int main(int argc, const char **argv)
 {
 	int i;
@@ -580,8 +734,8 @@ int main(int argc, const char **argv)
 	int sha1_len = 8;
 	int compability = 0;
 	int options = 1;
+	struct commit* start_commit;
 
-	int num_args;
 	const char* args[10];
 	struct rev_info rev;
 
@@ -634,28 +788,29 @@ int main(int argc, const char **argv)
 		strcpy(filename_buf, filename);
 	filename = filename_buf;
 
-	{
-		struct commit* c;
-		if (get_sha1(commit, sha1))
-			die("get_sha1 failed, commit '%s' not found", commit);
-		c = lookup_commit_reference(sha1);
-
-		if (fill_util_info(c, filename)) {
-			printf("%s not found in %s\n", filename, commit);
-			return 1;
-		}
+	if (get_sha1(commit, sha1))
+		die("get_sha1 failed, commit '%s' not found", commit);	
+	start_commit = lookup_commit_reference(sha1);
+	get_util(start_commit)->pathname = filename;
+	if (fill_util_info(start_commit)) {
+		printf("%s not found in %s\n", filename, commit);
+		return 1;
 	}
 
-	num_args = 0;
-	args[num_args++] = NULL;
-	args[num_args++] = "--topo-order";
-	args[num_args++] = "--remove-empty";
-	args[num_args++] = commit;
-	args[num_args++] = "--";
-	args[num_args++] = filename;
-	args[num_args] = NULL;
 
-	setup_revisions(num_args, args, &rev, "HEAD");
+	init_revisions(&rev);	
+	rev.remove_empty_trees = 1;
+	rev.topo_order = 1;
+	rev.prune_fn = simplify_commit;
+	rev.topo_setter = topo_setter;
+	rev.topo_getter = topo_getter;
+	rev.limited = 1;
+
+	commit_list_insert(start_commit, &rev.commits);
+	
+	args[0] = filename;
+	args[1] = NULL;
+	diff_tree_setup_paths(args);
 	prepare_revision_walk(&rev);
 	process_commits(&rev, filename, &initial);
 
@@ -665,17 +820,21 @@ int main(int argc, const char **argv)
 
 	for (i = 0; i < num_blame_lines; i++) {
 		struct commit *c = blame_lines[i];
+		struct util_info* u;
+
 		if (!c)
 			c = initial;
 
+		u = c->object.util;
 		get_commit_info(c, &ci);
 		fwrite(sha1_to_hex(c->object.sha1), sha1_len, 1, stdout);
 		if(compability)
 			printf("\t(%10s\t%10s\t%d)", ci.author,
 			       format_time(ci.author_time, ci.author_tz), i+1);
 		else
-			printf(" (%-15.15s %10s %*d) ", ci.author,
-			       format_time(ci.author_time, ci.author_tz),
+			printf(" %s (%-15.15s %10s %*d) ", u->pathname,
+			       ci.author, format_time(ci.author_time,
+						      ci.author_tz),
 			       max_digits, i+1);
 
 		if(i == num_blame_lines - 1) {

^ permalink raw reply related

* [PATCH 2/3] rev-lib: Make it easy to do rename tracking
From: Fredrik Kuivinen @ 2006-03-08 22:59 UTC (permalink / raw)
  To: git; +Cc: junkio
In-Reply-To: <20060308225412.GA461@c165.ib.student.liu.se>

prune_fn in the rev_info structure is called in place of
try_to_simplify_commit. This makes it possible to do rename tracking
with a custom try_to_simplify_commit-like function.

This commit also introduces init_revisions which initialises the rev_info
structure with default values.

Signed-off-by: Fredrik Kuivinen <freku045@student.liu.se>

---

 rev-list.c |    6 +++---
 revision.c |   59 +++++++++++++++++++++++++++++++++++++----------------------
 revision.h |   20 +++++++++++++++++++-
 3 files changed, 59 insertions(+), 26 deletions(-)

diff --git a/rev-list.c b/rev-list.c
index 8e4d83e..83a8ec2 100644
--- a/rev-list.c
+++ b/rev-list.c
@@ -190,7 +190,7 @@ static int count_distance(struct commit_
 
 		if (commit->object.flags & (UNINTERESTING | COUNTED))
 			break;
-		if (!revs.paths || (commit->object.flags & TREECHANGE))
+		if (!revs.prune_data || (commit->object.flags & TREECHANGE))
 			nr++;
 		commit->object.flags |= COUNTED;
 		p = commit->parents;
@@ -224,7 +224,7 @@ static struct commit_list *find_bisectio
 	nr = 0;
 	p = list;
 	while (p) {
-		if (!revs.paths || (p->item->object.flags & TREECHANGE))
+		if (!revs.prune_data || (p->item->object.flags & TREECHANGE))
 			nr++;
 		p = p->next;
 	}
@@ -234,7 +234,7 @@ static struct commit_list *find_bisectio
 	for (p = list; p; p = p->next) {
 		int distance;
 
-		if (revs.paths && !(p->item->object.flags & TREECHANGE))
+		if (revs.prune_data && !(p->item->object.flags & TREECHANGE))
 			continue;
 
 		distance = count_distance(p);
diff --git a/revision.c b/revision.c
index 2a33637..991dfe8 100644
--- a/revision.c
+++ b/revision.c
@@ -197,9 +197,6 @@ static int everybody_uninteresting(struc
 	return 1;
 }
 
-#define TREE_SAME	0
-#define TREE_NEW	1
-#define TREE_DIFFERENT	2
 static int tree_difference = TREE_SAME;
 
 static void file_add_remove(struct diff_options *options,
@@ -241,7 +238,7 @@ static struct diff_options diff_opt = {
 	.change = file_change,
 };
 
-static int compare_tree(struct tree *t1, struct tree *t2)
+int compare_tree(struct tree *t1, struct tree *t2)
 {
 	if (!t1)
 		return TREE_NEW;
@@ -253,7 +250,7 @@ static int compare_tree(struct tree *t1,
 	return tree_difference;
 }
 
-static int same_tree_as_empty(struct tree *t1)
+int same_tree_as_empty(struct tree *t1)
 {
 	int retval;
 	void *tree;
@@ -321,6 +318,7 @@ static void try_to_simplify_commit(struc
 	commit->object.flags |= TREECHANGE;
 }
 
+
 static void add_parents_to_list(struct rev_info *revs, struct commit *commit, struct commit_list **list)
 {
 	struct commit_list *parent = commit->parents;
@@ -358,8 +356,11 @@ static void add_parents_to_list(struct r
 	 * simplify the commit history and find the parent
 	 * that has no differences in the path set if one exists.
 	 */
-	if (revs->paths)
+	/* if (revs->paths)
 		try_to_simplify_commit(revs, commit);
+	*/
+	if (revs->prune_fn)
+		revs->prune_fn(revs, commit);
 
 	parent = commit->parents;
 	while (parent) {
@@ -381,9 +382,6 @@ static void limit_list(struct rev_info *
 	struct commit_list *newlist = NULL;
 	struct commit_list **p = &newlist;
 
-	if (revs->paths)
-		diff_tree_setup_paths(revs->paths);
-
 	while (list) {
 		struct commit_list *entry = list;
 		struct commit *commit = list->item;
@@ -435,6 +433,24 @@ static void handle_all(struct rev_info *
 	for_each_ref(handle_one_ref);
 }
 
+
+void init_revisions(struct rev_info *revs)
+{
+	memset(revs, 0, sizeof(*revs));
+	revs->lifo = 1;
+	revs->dense = 1;
+	revs->prefix = setup_git_directory();
+	revs->max_age = -1;
+	revs->min_age = -1;
+	revs->max_count = -1;
+
+	revs->prune_fn = NULL;
+	revs->prune_data = NULL;
+
+	revs->topo_setter = default_setter;
+	revs->topo_getter = default_getter;
+}
+
 /*
  * Parse revision information, filling in the "rev_info" structure,
  * and removing the used arguments from the argument list.
@@ -448,14 +464,8 @@ int setup_revisions(int argc, const char
 	const char **unrecognized = argv + 1;
 	int left = 1;
 
-	memset(revs, 0, sizeof(*revs));
-	revs->lifo = 1;
-	revs->dense = 1;
-	revs->prefix = setup_git_directory();
-	revs->max_age = -1;
-	revs->min_age = -1;
-	revs->max_count = -1;
-
+	init_revisions(revs);
+	
 	/* First, search for "--" */
 	seen_dashdash = 0;
 	for (i = 1; i < argc; i++) {
@@ -464,7 +474,7 @@ int setup_revisions(int argc, const char
 			continue;
 		argv[i] = NULL;
 		argc = i;
-		revs->paths = get_pathspec(revs->prefix, argv + i + 1);
+		revs->prune_data = get_pathspec(revs->prefix, argv + i + 1);
 		seen_dashdash = 1;
 		break;
 	}
@@ -628,7 +638,7 @@ int setup_revisions(int argc, const char
 				if (lstat(argv[j], &st) < 0)
 					die("'%s': %s", arg, strerror(errno));
 			}
-			revs->paths = get_pathspec(revs->prefix, argv + i);
+			revs->prune_data = get_pathspec(revs->prefix, argv + i);
 			break;
 		}
 		commit = get_commit_reference(revs, arg, sha1, flags ^ local_flags);
@@ -642,8 +652,13 @@ int setup_revisions(int argc, const char
 		commit = get_commit_reference(revs, def, sha1, 0);
 		add_one_commit(commit, revs);
 	}
-	if (revs->paths)
+
+	if (revs->prune_data) {
+		diff_tree_setup_paths(revs->prune_data);
+		revs->prune_fn = try_to_simplify_commit;
 		revs->limited = 1;
+	}
+
 	return left;
 }
 
@@ -653,7 +668,7 @@ void prepare_revision_walk(struct rev_in
 	if (revs->limited)
 		limit_list(revs);
 	if (revs->topo_order)
-		sort_in_topological_order(&revs->commits, revs->lifo);
+		sort_in_topological_order_fun(&revs->commits, revs->lifo, revs->topo_setter, revs->topo_getter);
 }
 
 static int rewrite_one(struct commit **pp)
@@ -709,7 +724,7 @@ struct commit *get_revision(struct rev_i
 			return NULL;
 		if (revs->no_merges && commit->parents && commit->parents->next)
 			goto next;
-		if (revs->paths && revs->dense) {
+		if (revs->prune_fn && revs->dense) {
 			if (!(commit->object.flags & TREECHANGE))
 				goto next;
 			rewrite_parents(commit);
diff --git a/revision.h b/revision.h
index 31e8f61..510d936 100644
--- a/revision.h
+++ b/revision.h
@@ -7,6 +7,10 @@
 #define SHOWN		(1u<<3)
 #define TMP_MARK	(1u<<4) /* for isolated cases; clean after use */
 
+struct rev_info;
+
+typedef void (prune_fn_t)(struct rev_info *revs, struct commit *commit);
+
 struct rev_info {
 	/* Starting list */
 	struct commit_list *commits;
@@ -14,7 +18,9 @@ struct rev_info {
 
 	/* Basic information */
 	const char *prefix;
-	const char **paths;
+
+	void* prune_data;
+	prune_fn_t* prune_fn;
 
 	/* Traversal flags */
 	unsigned int	dense:1,
@@ -33,9 +39,21 @@ struct rev_info {
 	int max_count;
 	unsigned long max_age;
 	unsigned long min_age;
+
+	set_fn_t topo_setter;
+	get_fn_t topo_getter;
 };
 
+#define TREE_SAME	0
+#define TREE_NEW	1
+#define TREE_DIFFERENT	2
+
 /* revision.c */
+extern int same_tree_as_empty(struct tree *t1);
+extern int compare_tree(struct tree *t1, struct tree *t2);
+
+
+extern void init_revisions(struct rev_info *revs);
 extern int setup_revisions(int argc, const char **argv, struct rev_info *revs, const char *def);
 extern void prepare_revision_walk(struct rev_info *revs);
 extern struct commit *get_revision(struct rev_info *revs);

^ permalink raw reply related

* [PATCH 1/3] Make it possible to not clobber object.util in sort_in_topological_order
From: Fredrik Kuivinen @ 2006-03-08 22:59 UTC (permalink / raw)
  To: git; +Cc: junkio
In-Reply-To: <20060308225412.GA461@c165.ib.student.liu.se>

Signed-off-by: Fredrik Kuivinen <freku045@student.liu.se>

---

 commit.c |   27 +++++++++++++++++++++------
 commit.h |    9 +++++++++
 2 files changed, 30 insertions(+), 6 deletions(-)

diff --git a/commit.c b/commit.c
index 06d5439..2552ffd 100644
--- a/commit.c
+++ b/commit.c
@@ -569,11 +569,26 @@ int count_parents(struct commit * commit
         return count;
 }
 
+void default_setter(struct commit* c, void* data)
+{
+	c->object.util = data;
+}
+
+void* default_getter(struct commit* c)
+{
+	return c->object.util;
+}
+
 /*
  * Performs an in-place topological sort on the list supplied.
  */
 void sort_in_topological_order(struct commit_list ** list, int lifo)
 {
+	sort_in_topological_order_fun(list, lifo, default_setter, default_getter);
+}
+
+void sort_in_topological_order_fun(struct commit_list ** list, int lifo, set_fn_t setter, get_fn_t getter)
+{
 	struct commit_list * next = *list;
 	struct commit_list * work = NULL, **insert;
 	struct commit_list ** pptr = list;
@@ -596,7 +611,7 @@ void sort_in_topological_order(struct co
 	next=*list;
 	while (next) {
 		next_nodes->list_item = next;
-		next->item->object.util = next_nodes;
+		setter(next->item, next_nodes);
 		next_nodes++;
 		next = next->next;
 	}
@@ -606,7 +621,7 @@ void sort_in_topological_order(struct co
 		struct commit_list * parents = next->item->parents;
 		while (parents) {
 			struct commit * parent=parents->item;
-			struct sort_node * pn = (struct sort_node *)parent->object.util;
+			struct sort_node * pn = (struct sort_node *) getter(parent);
 			
 			if (pn)
 				pn->indegree++;
@@ -624,7 +639,7 @@ void sort_in_topological_order(struct co
 	next=*list;
 	insert = &work;
 	while (next) {
-		struct sort_node * node = (struct sort_node *)next->item->object.util;
+		struct sort_node * node = (struct sort_node *) getter(next->item);
 
 		if (node->indegree == 0) {
 			insert = &commit_list_insert(next->item, insert)->next;
@@ -637,12 +652,12 @@ void sort_in_topological_order(struct co
 		sort_by_date(&work);
 	while (work) {
 		struct commit * work_item = pop_commit(&work);
-		struct sort_node * work_node = (struct sort_node *)work_item->object.util;
+		struct sort_node * work_node = (struct sort_node *) getter(work_item);
 		struct commit_list * parents = work_item->parents;
 
 		while (parents) {
 			struct commit * parent=parents->item;
-			struct sort_node * pn = (struct sort_node *)parent->object.util;
+			struct sort_node * pn = (struct sort_node *) getter(parent);
 			
 			if (pn) {
 				/* 
@@ -667,7 +682,7 @@ void sort_in_topological_order(struct co
 		*pptr = work_node->list_item;
 		pptr = &(*pptr)->next;
 		*pptr = NULL;
-		work_item->object.util = NULL;
+		setter(work_item, NULL);
 	}
 	free(nodes);
 }
diff --git a/commit.h b/commit.h
index 70a7c75..0be61eb 100644
--- a/commit.h
+++ b/commit.h
@@ -76,4 +76,13 @@ int count_parents(struct commit * commit
  *   sorted in the dates order.
  */
 void sort_in_topological_order(struct commit_list ** list, int lifo);
+
+typedef void (*set_fn_t)(struct commit*, void* data);
+typedef void* (*get_fn_t)(struct commit*);
+
+void default_setter(struct commit* c, void* data);
+void* default_getter(struct commit* c);
+
+void sort_in_topological_order(struct commit_list ** list, int lifo);
+void sort_in_topological_order_fun(struct commit_list ** list, int lifo, set_fn_t, get_fn_t);
 #endif /* COMMIT_H */

^ permalink raw reply related

* [PATCH 0/3] Teach git-blame about renames
From: Fredrik Kuivinen @ 2006-03-08 22:54 UTC (permalink / raw)
  To: git; +Cc: junkio

Hi,

This patch series teaches git-blame about renames. To do this I have
changed the revision.h interface a bit. In particular, it is now
possible for the user of revision.h to specify a
try_to_simply_commit-like function. That function can then do the
rename tracking.

I have also made a small change to sort_in_topological_order to make
it possible to use the object.util field at the same time as a
topological sort is done. Previously the object.util field was
clobbered by the topological sort. In the new interface the auxiliary
data that the topological sort needs to store for each commit object
is stored with a setter function and retrieved by a getter. Pointers
to those functions are passed to sort_in_topological_order.

- Fredrik

^ permalink raw reply

* Re: git-svn, tree moves, and --no-stop-on-copy
From: Yann Dirson @ 2006-03-08 22:41 UTC (permalink / raw)
  To: Eric Wong; +Cc: GIT list
In-Reply-To: <20060308221524.GF12638@nowhere.earth>

On Wed, Mar 08, 2006 at 11:15:24PM +0100, Yann Dirson wrote:
> OTOH, this does work:
> 
>  svn co -r1 https://svn.sourceforge.net/svnroot/ufoai/trunk@1

Let's go further:

@@ -224,12 +227,14 @@ sub fetch {
        unless (-d $SVN_WC) {
                my @svn_co = ('svn','co',"-r$base->{revision}");
                push @svn_co,'--ignore-externals' unless $_no_ignore_ext;
-               sys(@svn_co, $SVN_URL, $SVN_WC);
+               sys(@svn_co, $SVN_URL . "\@$base->{revision}", $SVN_WC);
                chdir $SVN_WC or croak $!;


That allows git-svn not to fail at r1 (or at r3 when checking out
trunk/src only), and the 1st fetch runs OK.

The second fetch runs OK as well, but it shows the following somewhat
scary stuff before starting the import:

Checked out revision 166.
refs/remotes/git-newsvn
fatal: 'refs/remotes/git-newsvn': No such file or directory
r166 = 38a6ba0e3db486c65a611c54c53f838210ce7551


The results looks fine.  I don't know if it is expected to have the
master head stuck at remotes/git-oldsvn, though.

Thanks much for your help!
-- 
Yann Dirson    <ydirson@altern.org> |
Debian-related: <dirson@debian.org> |   Support Debian GNU/Linux:
                                    |  Freedom, Power, Stability, Gratis
     http://ydirson.free.fr/        | Check <http://www.debian.org/>

^ permalink raw reply

* Re: git-svn, tree moves, and --no-stop-on-copy
From: Yann Dirson @ 2006-03-08 22:15 UTC (permalink / raw)
  To: Eric Wong; +Cc: GIT list
In-Reply-To: <20060308014207.GA31137@localdomain>

On Tue, Mar 07, 2006 at 05:42:07PM -0800, Eric Wong wrote:
> If you want full repository history for reorganized repositories,
> easiest way is to pay the price for full repository and all of its
> history.
> 
> 	git-svn init https://svn.sourceforge.net/svnroot/ufoai
> 	git-svn fetch
> 	# this puts all your branches and tags into one single big git tree.
> 
> However, the following should always work: (after the following patch,
> 
> 	GIT_SVN_ID=git-oldsvn git-svn init \
> 		https://svn.sourceforge.net/svnroot/ufoai/trunk
> 	GIT_SVN_ID=git-oldsvn git-svn fetch -r1:165
> 
> 	GIT_SVN_ID=git-newsvn git-svn init
> 		https://svn.sourceforge.net/svnroot/ufoai/ufoai/trunk
> 	GIT_SVN_ID=git-newsvn git-svn fetch \
> 		166=`git-rev-parse refs/remotes/git-oldsvn`

Thanks much for the hint - it should definitively be a good example
for the doc.

> Unfortunately, it does not, at least with svn 1.2.3...  I have a patch
> coming that should fix things for 1.1.1 (and give better 1.1.x support
> in general).  I'm not sure, but it feels like something is screwed up
> with svn 1.2.3dfsg1-3:
> 
> This works:	svn log -r1 https://svn.sourceforge.net/svnroot/ufoai/trunk

> This doesn't:	svn  co -r1 https://svn.sourceforge.net/svnroot/ufoai/trunk
> 
> But this:	svn  co -r1 https://svn.sourceforge.net/svnroot/ufoai
> will create the following structure:
> 	ufoai/{trunk,branches,tags}
> 
> I'm quite puzzled about it, as I swear I've seen it work on a different
> project recently (of course I cannot remember which :<)

Looks like svn may be looking at the current revision to find out
which path you are requesting, and that path does not exist any more.
Could look like something normal, if "svn log" complained - but the
problem may just be with "svn log".

OTOH, this does work:

 svn co -r1 https://svn.sourceforge.net/svnroot/ufoai/trunk@1


> In the face of repository reorgs, git-svn is happiest tracking partial
> history.  Or tracking the entire repository from the root.

Well, that could be a solution, if I could 1) filter out parts of the
tree I do not care about, and 2) strip the leading /whatever/trunk.
Not sure it's worth it :)


> Hopefully I've been reasonably coherent, having insomnia lately.

At least, my state of insomnia makes it look perferctly coherent :)

Best regards,
-- 
Yann Dirson    <ydirson@altern.org> |
Debian-related: <dirson@debian.org> |   Support Debian GNU/Linux:
                                    |  Freedom, Power, Stability, Gratis
     http://ydirson.free.fr/        | Check <http://www.debian.org/>

^ permalink raw reply

* Re: git-rev-list bug?
From: Junio C Hamano @ 2006-03-08 20:16 UTC (permalink / raw)
  To: Catalin Marinas; +Cc: git
In-Reply-To: <b0943d9e0603080819i227c637fo@mail.gmail.com>

"Catalin Marinas" <catalin.marinas@gmail.com> writes:

> Sorry if this was previously discussed. I ran git-rev-list on a linear
> graph and tried to filter the results by a file name:
>
>   git rev-list since.. path/to/file
>
> but it always shows the child commit of "since" even if it didn't
> touch the file. The same behaviour is for git-log (since it uses
> git-rev-list) but git-whatchanged seems to be fine.
>
> Is this the intended behaviour? The "stg patches" command based on
> git-rev-list used to work fine a few weeks ago but now it is always
> reporting the bottom patch in the stack as modifying a given file.

I can confirm that this is a recent breakage, but since it is
unfortunately my day-job day the more detailed analysis and fix
needs to wait.  Sorry.

^ permalink raw reply

* endless loop: ?
From: Matthias Urlichs @ 2006-03-08 20:04 UTC (permalink / raw)
  To: git

Hmmm...

# ps axurw
smurf    26367 84.1  0.2   3200  2068 pts/1    R    20:08  40:30 /usr/bin/git-rev-list --objects 5771c72e2908fb68906020d07d4e0cb77d2

# strace -p 26367
stat64("/daten/src/noris/kundet.git/.git/objects/23/ae4dd0bab2f05dba8a3c77d4c792542b07b73e", {st_mode=S_IFREG|0644, st_size=204, ...}) = 0
[ lots of different filenames, however ..:]

# strace -p 26367 | grep ae4dd
stat64("/daten/src/noris/kundet.git/.git/objects/23/ae4dd0bab2f05dba8a3c77d4c792542b07b73e", {st_mode=S_IFREG|0644, st_size=204, ...}) = 0
[ lots of repetitions of the same filename => we're in a bad loop of some sort ]

# gdb /daten/src/git/git/git-rev-list 26367
(gdb) whe
#0  0xffffe410 in __kernel_vsyscall ()
#1  0xb7ec08f8 in _xstat () from /lib/tls/i686/cmov/libc.so.6
#2  0x0804e2f0 in find_sha1_file (
    sha1=0x81ddb20 "#\uffffM\u043a\uffff\uffff]\uffff\212<w\uffff\uffff\222T+\a\uffff>\uffffr\005\b\uffff\uffff\035\b",
    st=0xbf87b450) at stat.h:366
#3  0x0804e752 in has_sha1_file (
    sha1=0x81ddb20 "#\uffffM\u043a\uffff\uffff]\uffff\212<w\uffff\uffff\222T+\a\uffff>\uffffr\005\b\uffff\uffff\035\b")
    at sha1_file.c:1600
#4  0x080514f8 in mark_parents_uninteresting (commit=0x0) at revision.c:106
#5  0x080514ed in mark_parents_uninteresting (commit=0x0) at revision.c:96
[ bah ]
#104 0x080514ed in mark_parents_uninteresting (commit=0x0) at revision.c:96
#105 0x080516d8 in prepare_revision_walk (revs=0x8066560) at revision.c:347
#106 0x08049ab4 in main (argc=1, argv=0xbf87c294) at rev-list.c:352

I'll dig further tomorrow...

^ permalink raw reply

* [PATCH] diff-delta: bound hash list length to avoid O(m*n) behavior
From: Nicolas Pitre @ 2006-03-08 19:32 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git


The diff-delta code can exhibit O(m*n) behavior with some patological
data set where most hash entries end up in the same hash bucket.

To prevent this, a limit is imposed to the number of entries that can 
exist in the same hash bucket.

Because of the above the code is a tiny bit more expensive on average, 
even if some small optimizations were added as well to atenuate the 
overhead. But the problematic samples used to diagnoze the issue are now 
orders of magnitude less expensive to process with only a slight loss in 
compression.

Signed-off-by: Nicolas Pitre <nico@cam.org>

---

For example, Carl Baldwin provided me with a couple 20MB files, and 
deltifying one against another one with test-delta takes around 
TEN MINUTES for only one delta on my P4 @ 3GHz.

Nnow imagine using git-repack -a with a default window
of 10 ... 

With this patch the test-delta time dropped to only 9 seconds.  And the 
resulting delta, once compressed, is about 2% larger.

diff --git a/diff-delta.c b/diff-delta.c
index 2ed5984..aaee7be 100644
--- a/diff-delta.c
+++ b/diff-delta.c
@@ -40,17 +40,18 @@ struct index {
 
 static struct index ** delta_index(const unsigned char *buf,
 				   unsigned long bufsize,
+				   unsigned long trg_bufsize,
 				   unsigned int *hash_shift)
 {
-	unsigned int hsize, hshift, entries, blksize, i;
+	unsigned int i, hsize, hshift, hlimit, entries, *hash_count;
 	const unsigned char *data;
 	struct index *entry, **hash;
 	void *mem;
 
 	/* determine index hash size */
-	entries = (bufsize + BLK_SIZE - 1) / BLK_SIZE;
+	entries = bufsize  / BLK_SIZE;
 	hsize = entries / 4;
-	for (i = 4; (1 << i) < hsize && i < 16; i++);
+	for (i = 4; (1 << i) < hsize && i < 31; i++);
 	hsize = 1 << i;
 	hshift = 32 - i;
 	*hash_shift = hshift;
@@ -63,20 +64,62 @@ static struct index ** delta_index(const
 	entry = mem + hsize * sizeof(*hash);
 	memset(hash, 0, hsize * sizeof(*hash));
 
-	/* then populate it */
+	/* allocate an array to count hash entries */
+	hash_count = calloc(hsize, sizeof(*hash_count));
+	if (!hash_count) {
+		free(hash);
+		return NULL;
+	}
+
+	/* then populate the index */
 	data = buf + entries * BLK_SIZE - BLK_SIZE;
-	blksize = bufsize - (data - buf);
 	while (data >= buf) {
-		unsigned int val = adler32(0, data, blksize);
+		unsigned int val = adler32(0, data, BLK_SIZE);
 		i = HASH(val, hshift);
 		entry->ptr = data;
 		entry->val = val;
 		entry->next = hash[i];
 		hash[i] = entry++;
-		blksize = BLK_SIZE;
+		hash_count[i]++;
 		data -= BLK_SIZE;
  	}
 
+	/*
+	 * Determine a limit on the number of entries in the same hash
+	 * bucket.  This guard us against patological data sets causing
+	 * really bad hash distribution with most entries in the same hash
+	 * bucket that would bring us to O(m*n) computing costs (m and n
+	 * corresponding to reference and target buffer sizes).
+	 *
+	 * The more the target buffer is large, the more it is important to
+	 * have small entry lists for each hash buckets.  With such a limit
+	 * the cost is bounded to something more like O(m+n).
+	 */
+	hlimit = (1 << 26) / trg_bufsize;
+	if (hlimit < 4*BLK_SIZE)
+		hlimit = 4*BLK_SIZE;
+
+	/*
+	 * Now make sure none of the hash buckets has more entries than
+	 * we're willing to test.  Otherwise we cull the entry list
+	 * uniformly to still preserve a good repartition across
+	 * the reference buffer.
+	 */
+	for (i = 0; i < hsize; i++) {
+		if (hash_count[i] < hlimit)
+			continue;
+		entry = hash[i];
+		do {
+			struct index *keep = entry;
+			int skip = hash_count[i] / hlimit / 2;
+			do {
+				entry = entry->next;
+			} while(--skip && entry);
+			keep->next = entry;
+		} while(entry);
+	}
+	free(hash_count);
+
 	return hash;
 }
 
@@ -100,7 +143,7 @@ void *diff_delta(void *from_buf, unsigne
 
 	if (!from_size || !to_size)
 		return NULL;
-	hash = delta_index(from_buf, from_size, &hash_shift);
+	hash = delta_index(from_buf, from_size, to_size, &hash_shift);
 	if (!hash)
 		return NULL;
 
@@ -141,29 +184,27 @@ void *diff_delta(void *from_buf, unsigne
 
 	while (data < top) {
 		unsigned int moff = 0, msize = 0;
-		unsigned int blksize = MIN(top - data, BLK_SIZE);
-		unsigned int val = adler32(0, data, blksize);
-		i = HASH(val, hash_shift);
-		for (entry = hash[i]; entry; entry = entry->next) {
-			const unsigned char *ref = entry->ptr;
-			const unsigned char *src = data;
-			unsigned int ref_size = ref_top - ref;
-			if (entry->val != val)
-				continue;
-			if (ref_size > top - src)
-				ref_size = top - src;
-			while (ref_size && *src++ == *ref) {
-				ref++;
-				ref_size--;
-			}
-			ref_size = ref - entry->ptr;
-			if (ref_size > msize) {
-				/* this is our best match so far */
-				moff = entry->ptr - ref_data;
-				msize = ref_size;
-				if (msize >= 0x10000) {
-					msize = 0x10000;
+		if (data + BLK_SIZE <= top) {
+			unsigned int val = adler32(0, data, BLK_SIZE);
+			i = HASH(val, hash_shift);
+			for (entry = hash[i]; entry; entry = entry->next) {
+				const unsigned char *ref = entry->ptr;
+				const unsigned char *src = data;
+				unsigned int ref_size = ref_top - ref;
+				if (entry->val != val)
+					continue;
+				if (ref_size > top - src)
+					ref_size = top - src;
+				if (ref_size > 0x10000)
+					ref_size = 0x10000;
+				if (ref_size <= msize)
 					break;
+				while (ref_size-- && *src++ == *ref)
+					ref++;
+				if (msize < ref - entry->ptr) {
+					/* this is our best match so far */
+					msize = ref - entry->ptr;
+					moff = entry->ptr - ref_data;
 				}
 			}
 		}

^ permalink raw reply related

* Re: [PATCH] git-blame: Make the output human readable
From: linux @ 2006-03-08 19:06 UTC (permalink / raw)
  To: linux, vsu; +Cc: git, junkio
In-Reply-To: <20060308183059.GD9555@procyon.home>

> So that mk_wcwidth() must be used unconditionally, and not as a
> fallback for systems which do not provide wcwidth() in libc.

Ah, the light dawns!  I now understand your confusion.
That was exactly the idea; apologies for being unclear.

Kind of like the existing issspace(), isdigit(), isalpha(), etc.
implementations in git-compat-util.h to avoid the mysteries and vagaries
of locales.  A UTF-8-only solution is desired.

^ permalink raw reply

* gitk: bug report: invalid command name "contmergediff"
From: Rutger Nijlunsing @ 2006-03-08 18:37 UTC (permalink / raw)
  To: git

Bug report on gitk, maybe already reported.

do:
   gitk HEAD~1000..

...select first commit (why is this not done automatically?), and keep
pressing down-arrow.

This gives me:
   Error: invalid command name "contmergediff"
with detail window:
   invalid command name "contmergediff"
   invalid command name "contmergediff"
       while executing
   "contmergediff $ids"
       (procedure "gettreediffline" line 14)
       invoked from within
   "gettreediffline file7 de84f99c12d1819479116685393afb1ebe99810b"


-- 
Rutger Nijlunsing ---------------------------------- eludias ed dse.nl
never attribute to a conspiracy which can be explained by incompetence
----------------------------------------------------------------------

^ permalink raw reply

* Re: [PATCH] git-blame: Make the output human readable
From: Sergey Vlasov @ 2006-03-08 18:30 UTC (permalink / raw)
  To: linux; +Cc: git, junkio
In-Reply-To: <20060308180422.27978.qmail@science.horizon.com>

[-- Attachment #1: Type: text/plain, Size: 1004 bytes --]

On Wed, Mar 08, 2006 at 01:04:22PM -0500, linux@horizon.com wrote:
> > And this won't work, unless you also add that wcwidth() implementation
> > to git.
> 
> That was the general idea.  It is freely usable.
> 
> > The problem is that the wchar_t encoding is not specified anywhere -
> > glibc uses Unicode for it, but other systems can use whatever they want
> > (even locale-dependent).
> 
> Why is that a problem?  None of the code mentioned even uses wchar_t.
> The code I wrote converts from UTF-8 to straight Unicode, and that's
> what Markus Kuhn's wcwidth() expects as an argument.

wcwidth() is a standard library function which takes a wchar_t:

http://www.opengroup.org/onlinepubs/009695399/functions/wcwidth.html

It is easy to write a program which assumes that wchar_t is Unicode
without noticing it, because it will work fine with glibc...

So that mk_wcwidth() must be used unconditionally, and not as a
fallback for systems which do not provide wcwidth() in libc.

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox