Git development

Git development
 help / color / mirror / Atom feed

* Re: the war on trailing whitespace
From: Andrew Morton @ 2006-02-26  5:07 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git
In-Reply-To: <7v1wxq7psj.fsf@assigned-by-dhcp.cox.net>

Junio C Hamano <junkio@cox.net> wrote:
>
> Andrew Morton <akpm@osdl.org> writes:
> 
> > It's invariably pointless to add lines which have trailing whitespace. 
> > Nobody cares much, but my scripts spam me when it happens, so I've become
> > obsessive....
> 
> I do not call me obsessive, but I do enable pre-commit and
> pre-applypatch hooks I ship with git myself.

It's apparent that few others do this.

> > I realise that we cannot do this when doing git fetches, but when importing
> > patches and mboxes, git ought to whine loudly about input which matches the
> > above regexp, and it should offer an option to tidy it up.  Perhaps by
> > default.
> 
> I stole the policy the sample hook scripts use from you; it is
> not enabled by default, and as the tool manufacturer I am a bit
> reluctant to do so.
> 

It's not strong enough.  I mean, if you ask a developer "do you wish to add
new trialing whitespace to the kernel" then obviously their answer would be
"no".   So how do we help them in this?

I'd suggest a) git will simply refuse to apply such a patch unless given a
special `forcing' flag, b) even when thus forced, it will still warn and c)
with a different flag, it will strip-then-apply, without generating a
warning.

^ permalink raw reply

* Re: [PATCH] First cut at libifying revlist generation
From: Junio C Hamano @ 2006-02-26  3:40 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Git Mailing List
In-Reply-To: <Pine.LNX.4.64.0602251608160.22647@g5.osdl.org>

Linus Torvalds <torvalds@osdl.org> writes:

> This really just splits things up partially, and creates the
> interface to set things up by parsign the command line.

Sorry, I'll keep this but I ended up wasting the whole day
setting up a new machine for my wife (a windows box).  The next
task would be to install Cygwin so I can have fun with git...

^ permalink raw reply

* Re: the war on trailing whitespace
From: Junio C Hamano @ 2006-02-26  3:38 UTC (permalink / raw)
  To: Andrew Morton; +Cc: git
In-Reply-To: <20060225174047.0e9a6d29.akpm@osdl.org>

Andrew Morton <akpm@osdl.org> writes:

> It's invariably pointless to add lines which have trailing whitespace. 
> Nobody cares much, but my scripts spam me when it happens, so I've become
> obsessive....

I do not call me obsessive, but I do enable pre-commit and
pre-applypatch hooks I ship with git myself.

> I realise that we cannot do this when doing git fetches, but when importing
> patches and mboxes, git ought to whine loudly about input which matches the
> above regexp, and it should offer an option to tidy it up.  Perhaps by
> default.

I stole the policy the sample hook scripts use from you; it is
not enabled by default, and as the tool manufacturer I am a bit
reluctant to do so.

However, as a kernel project maintainer high in the foodchain,
I'd imagine your plea to your fellow maintainers who apply
patches using git tools would be heard well.

^ permalink raw reply

* [PATCH] annotate: Convert all -| calls to use a helper open_pipe().
From: Ryan Anderson @ 2006-02-26  3:02 UTC (permalink / raw)
  To: junkio; +Cc: git, Ryan Anderson
In-Reply-To: <11409185133616-git-send-email-ryan@michonline.com>

When we settle on a solution for ActiveState's forking issues, all
compatibility checks can be handled inside this one function.

Also, fixed an abuse of global variables in the process of cleaning this up.

Signed-off-by: Ryan Anderson <ryan@michonline.com>

---

 I should have done this before I sent the other patch out, sorry about
 that.  This one will also be in my annotate-upstream branch, but I'm a
 bit less sure where we stand on this kind of fixup at the moment.  Even
 if this is the wrong place to start, this at least flags all the spots
 that need fixing in a nice way.

 git-annotate.perl |   73 ++++++++++++++++++++++++++++++++---------------------
 1 files changed, 44 insertions(+), 29 deletions(-)

3e72e9de7ece2b9c2c9447e761ee721bdd94a33f
diff --git a/git-annotate.perl b/git-annotate.perl
index 91da6d5..ee8ff15 100755
--- a/git-annotate.perl
+++ b/git-annotate.perl
@@ -98,15 +98,15 @@ while (my $bound = pop @stack) {
 push @revqueue, $head;
 init_claim( defined $starting_rev ? $starting_rev : 'dirty');
 unless (defined $starting_rev) {
-	open(DIFF,"-|","git","diff","-R", "HEAD", "--",$filename)
+	my $diff = open_pipe("git","diff","-R", "HEAD", "--",$filename)
 		or die "Failed to call git diff to check for dirty state: $!";
 
-	_git_diff_parse(*DIFF, $head, "dirty", (
+	_git_diff_parse($diff, $head, "dirty", (
 				'author' => gitvar_name("GIT_AUTHOR_IDENT"),
 				'author_date' => sprintf("%s +0000",time()),
 				)
 			);
-	close(DIFF);
+	close($diff);
 }
 handle_rev();
 
@@ -172,20 +172,21 @@ sub handle_rev {
 sub git_rev_list {
 	my ($rev, $file) = @_;
 
+	my $revlist;
 	if ($rev_file) {
-		open(P, '<' . $rev_file);
+		open($revlist, '<' . $rev_file);
 	} else {
-		open(P,"-|","git-rev-list","--parents","--remove-empty",$rev,"--",$file)
+		$revlist = open_pipe("git-rev-list","--parents","--remove-empty",$rev,"--",$file)
 			or die "Failed to exec git-rev-list: $!";
 	}
 
 	my @revs;
-	while(my $line = <P>) {
+	while(my $line = <$revlist>) {
 		chomp $line;
 		my ($rev, @parents) = split /\s+/, $line;
 		push @revs, [ $rev, @parents ];
 	}
-	close(P);
+	close($revlist);
 
 	printf("0 revs found for rev %s (%s)\n", $rev, $file) if (@revs == 0);
 	return @revs;
@@ -194,22 +195,22 @@ sub git_rev_list {
 sub find_parent_renames {
 	my ($rev, $file) = @_;
 
-	open(P,"-|","git-diff-tree", "-M50", "-r","--name-status", "-z","$rev")
+	my $patch = open_pipe("git-diff-tree", "-M50", "-r","--name-status", "-z","$rev")
 		or die "Failed to exec git-diff: $!";
 
 	local $/ = "\0";
 	my %bound;
-	my $junk = <P>;
-	while (my $change = <P>) {
+	my $junk = <$patch>;
+	while (my $change = <$patch>) {
 		chomp $change;
-		my $filename = <P>;
+		my $filename = <$patch>;
 		chomp $filename;
 
 		if ($change =~ m/^[AMD]$/ ) {
 			next;
 		} elsif ($change =~ m/^R/ ) {
 			my $oldfilename = $filename;
-			$filename = <P>;
+			$filename = <$patch>;
 			chomp $filename;
 			if ( $file eq $filename ) {
 				my $parent = git_find_parent($rev, $oldfilename);
@@ -218,7 +219,7 @@ sub find_parent_renames {
 			}
 		}
 	}
-	close(P);
+	close($patch);
 
 	return \%bound;
 }
@@ -227,14 +228,14 @@ sub find_parent_renames {
 sub git_find_parent {
 	my ($rev, $filename) = @_;
 
-	open(REVPARENT,"-|","git-rev-list","--remove-empty", "--parents","--max-count=1","$rev","--",$filename)
+	my $revparent = open_pipe("git-rev-list","--remove-empty", "--parents","--max-count=1","$rev","--",$filename)
 		or die "Failed to open git-rev-list to find a single parent: $!";
 
-	my $parentline = <REVPARENT>;
+	my $parentline = <$revparent>;
 	chomp $parentline;
 	my ($revfound,$parent) = split m/\s+/, $parentline;
 
-	close(REVPARENT);
+	close($revparent);
 
 	return $parent;
 }
@@ -245,13 +246,13 @@ sub git_find_parent {
 sub git_diff_parse {
 	my ($parent, $rev, %revinfo) = @_;
 
-	open(DIFF,"-|","git-diff-tree","-M","-p",$rev,$parent,"--",
+	my $diff = open_pipe("git-diff-tree","-M","-p",$rev,$parent,"--",
 			$revs{$rev}{'filename'}, $revs{$parent}{'filename'})
 		or die "Failed to call git-diff for annotation: $!";
 
-	_git_diff_parse(*DIFF, $parent, $rev, %revinfo);
+	_git_diff_parse($diff, $parent, $rev, %revinfo);
 
-	close(DIFF);
+	close($diff);
 }
 
 sub _git_diff_parse {
@@ -264,7 +265,7 @@ sub _git_diff_parse {
 	my $gotheader = 0;
 	my ($remstart);
 	my ($hunk_start, $hunk_index);
-	while(<DIFF>) {
+	while(<$diff>) {
 		chomp;
 		if (m/^@@ -(\d+),(\d+) \+(\d+),(\d+)/) {
 			$remstart = $1;
@@ -335,15 +336,15 @@ sub git_cat_file {
 
 	my $blob = git_ls_tree($rev, $filename);
 
-	open(C,"-|","git","cat-file", "blob", $blob)
+	my $catfile = open_pipe("git","cat-file", "blob", $blob)
 		or die "Failed to git-cat-file blob $blob (rev $rev, file $filename): " . $!;
 
 	my @lines;
-	while(<C>) {
+	while(<$catfile>) {
 		chomp;
 		push @lines, $_;
 	}
-	close(C);
+	close($catfile);
 
 	return @lines;
 }
@@ -351,15 +352,15 @@ sub git_cat_file {
 sub git_ls_tree {
 	my ($rev, $filename) = @_;
 
-	open(T,"-|","git","ls-tree",$rev,$filename)
+	my $lstree = open_pipe("git","ls-tree",$rev,$filename)
 		or die "Failed to call git ls-tree: $!";
 
 	my ($mode, $type, $blob, $tfilename);
-	while(<T>) {
+	while(<$lstree>) {
 		($mode, $type, $blob, $tfilename) = split(/\s+/, $_, 4);
 		last if ($tfilename eq $filename);
 	}
-	close(T);
+	close($lstree);
 
 	return $blob if $filename eq $filename;
 	die "git-ls-tree failed to find blob for $filename";
@@ -379,11 +380,11 @@ sub claim_line {
 
 sub git_commit_info {
 	my ($rev) = @_;
-	open(COMMIT, "-|","git-cat-file", "commit", $rev)
+	my $commit = open_pipe("git-cat-file", "commit", $rev)
 		or die "Failed to call git-cat-file: $!";
 
 	my %info;
-	while(<COMMIT>) {
+	while(<$commit>) {
 		chomp;
 		last if (length $_ == 0);
 
@@ -397,7 +398,7 @@ sub git_commit_info {
 			$info{'committer_date'} = $3;
 		}
 	}
-	close(COMMIT);
+	close($commit);
 
 	return %info;
 }
@@ -430,3 +431,17 @@ sub gitvar_name {
     return join(' ', @field[0...(@field-4)]);
 }
 
+
+sub open_pipe {
+	my (@execlist) = @_;
+
+	my $pid = open my $kid, "-|";
+	defined $pid or die "Cannot fork: $!";
+
+	unless ($pid) {
+		exec @execlist;
+		die "Cannot exec @execlist: $!";
+	}
+
+	return $kid;
+}
-- 
1.2.3.g9ca3

^ permalink raw reply related

* [PATCH] annotate: Handle dirty state and arbitrary revisions.
From: Ryan Anderson @ 2006-02-26  1:48 UTC (permalink / raw)
  To: junkio; +Cc: git, Ryan Anderson

Also, use Getopt::Long and only process each rev once.

(Thanks to Morten Welinder for spotting the performance problems.)

Signed-off-by: Ryan Anderson <ryan@michonline.com>

---

Or pull from http://h4x0r5.com/~ryan/git/ryan.git/ annotate-upstream

 git-annotate.perl |  150 ++++++++++++++++++++++++++++++++++++++++-------------
 1 files changed, 113 insertions(+), 37 deletions(-)

9ca393b055b53df4eb9284b39f38c5b42b7f93b1
diff --git a/git-annotate.perl b/git-annotate.perl
index 3800c46..91da6d5 100755
--- a/git-annotate.perl
+++ b/git-annotate.perl
@@ -8,44 +8,62 @@
 
 use warnings;
 use strict;
-use Getopt::Std;
+use Getopt::Long;
 use POSIX qw(strftime gmtime);
 
 sub usage() {
-	print STDERR 'Usage: ${\basename $0} [-s] [-S revs-file] file
-
-	-l		show long rev
-	-r		follow renames
-	-S commit	use revs from revs-file instead of calling git-rev-list
+	print STDERR 'Usage: ${\basename $0} [-s] [-S revs-file] file [ revision ]
+	-l, --long
+			Show long rev (Defaults off)
+	-r, --rename
+			Follow renames (Defaults on).
+	-S, --rev-file revs-file
+			use revs from revs-file instead of calling git-rev-list
+	-h, --help
+			This message.
 ';
 
 	exit(1);
 }
 
-our ($opt_h, $opt_l, $opt_r, $opt_S);
-getopts("hlrS:") or usage();
-$opt_h && usage();
+our ($help, $longrev, $rename, $starting_rev, $rev_file) = (0, 0, 1);
+
+my $rc = GetOptions(	"long|l" => \$longrev,
+			"help|h" => \$help,
+			"rename|r" => \$rename,
+			"rev-file|S" => \$rev_file);
+if (!$rc or $help) {
+	usage();
+}
 
 my $filename = shift @ARGV;
+if (@ARGV) {
+	$starting_rev = shift @ARGV;
+}
 
 my @stack = (
 	{
-		'rev' => "HEAD",
+		'rev' => defined $starting_rev ? $starting_rev : "HEAD",
 		'filename' => $filename,
 	},
 );
 
-our (@lineoffsets, @pendinglineoffsets);
 our @filelines = ();
-open(F,"<",$filename)
-	or die "Failed to open filename: $!";
 
-while(<F>) {
-	chomp;
-	push @filelines, $_;
+if (defined $starting_rev) {
+	@filelines = git_cat_file($starting_rev, $filename);
+} else {
+	open(F,"<",$filename)
+		or die "Failed to open filename: $!";
+
+	while(<F>) {
+		chomp;
+		push @filelines, $_;
+	}
+	close(F);
+
 }
-close(F);
-our $leftover_lines = @filelines;
+
 our %revs;
 our @revqueue;
 our $head;
@@ -66,7 +84,7 @@ while (my $bound = pop @stack) {
 			next;
 		}
 
-		if (!$opt_r) {
+		if (!$rename) {
 			next;
 		}
 
@@ -78,8 +96,18 @@ while (my $bound = pop @stack) {
 	}
 }
 push @revqueue, $head;
-init_claim($head);
-$revs{$head}{'lineoffsets'} = {};
+init_claim( defined $starting_rev ? $starting_rev : 'dirty');
+unless (defined $starting_rev) {
+	open(DIFF,"-|","git","diff","-R", "HEAD", "--",$filename)
+		or die "Failed to call git diff to check for dirty state: $!";
+
+	_git_diff_parse(*DIFF, $head, "dirty", (
+				'author' => gitvar_name("GIT_AUTHOR_IDENT"),
+				'author_date' => sprintf("%s +0000",time()),
+				)
+			);
+	close(DIFF);
+}
 handle_rev();
 
 
@@ -88,7 +116,7 @@ foreach my $l (@filelines) {
 	my ($output, $rev, $committer, $date);
 	if (ref $l eq 'ARRAY') {
 		($output, $rev, $committer, $date) = @$l;
-		if (!$opt_l && length($rev) > 8) {
+		if (!$longrev && length($rev) > 8) {
 			$rev = substr($rev,0,8);
 		}
 	} else {
@@ -102,7 +130,6 @@ foreach my $l (@filelines) {
 
 sub init_claim {
 	my ($rev) = @_;
-	my %revinfo = git_commit_info($rev);
 	for (my $i = 0; $i < @filelines; $i++) {
 		$filelines[$i] = [ $filelines[$i], '', '', '', 1];
 			# line,
@@ -117,7 +144,9 @@ sub init_claim {
 
 sub handle_rev {
 	my $i = 0;
+	my %seen;
 	while (my $rev = shift @revqueue) {
+		next if $seen{$rev}++;
 
 		my %revinfo = git_commit_info($rev);
 
@@ -143,8 +172,8 @@ sub handle_rev {
 sub git_rev_list {
 	my ($rev, $file) = @_;
 
-	if ($opt_S) {
-		open(P, '<' . $opt_S);
+	if ($rev_file) {
+		open(P, '<' . $rev_file);
 	} else {
 		open(P,"-|","git-rev-list","--parents","--remove-empty",$rev,"--",$file)
 			or die "Failed to exec git-rev-list: $!";
@@ -216,24 +245,31 @@ sub git_find_parent {
 sub git_diff_parse {
 	my ($parent, $rev, %revinfo) = @_;
 
-	my ($ri, $pi) = (0,0);
 	open(DIFF,"-|","git-diff-tree","-M","-p",$rev,$parent,"--",
 			$revs{$rev}{'filename'}, $revs{$parent}{'filename'})
 		or die "Failed to call git-diff for annotation: $!";
 
+	_git_diff_parse(*DIFF, $parent, $rev, %revinfo);
+
+	close(DIFF);
+}
+
+sub _git_diff_parse {
+	my ($diff, $parent, $rev, %revinfo) = @_;
+
+	my ($ri, $pi) = (0,0);
 	my $slines = $revs{$rev}{'lines'};
 	my @plines;
 
 	my $gotheader = 0;
-	my ($remstart, $remlength, $addstart, $addlength);
-	my ($hunk_start, $hunk_index, $hunk_adds);
+	my ($remstart);
+	my ($hunk_start, $hunk_index);
 	while(<DIFF>) {
 		chomp;
 		if (m/^@@ -(\d+),(\d+) \+(\d+),(\d+)/) {
-			($remstart, $remlength, $addstart, $addlength) = ($1, $2, $3, $4);
+			$remstart = $1;
 			# Adjust for 0-based arrays
 			$remstart--;
-			$addstart--;
 			# Reinit hunk tracking.
 			$hunk_start = $remstart;
 			$hunk_index = 0;
@@ -279,7 +315,6 @@ sub git_diff_parse {
 		}
 		$hunk_index++;
 	}
-	close(DIFF);
 	for (my $i = $ri; $i < @{$slines} ; $i++) {
 		push @plines, $slines->[$ri++];
 	}
@@ -295,13 +330,13 @@ sub get_line {
 }
 
 sub git_cat_file {
-	my ($parent, $filename) = @_;
-	return () unless defined $parent && defined $filename;
-	my $blobline = `git-ls-tree $parent $filename`;
-	my ($mode, $type, $blob, $tfilename) = split(/\s+/, $blobline, 4);
+	my ($rev, $filename) = @_;
+	return () unless defined $rev && defined $filename;
 
-	open(C,"-|","git-cat-file", "blob", $blob)
-		or die "Failed to git-cat-file blob $blob (rev $parent, file $filename): " . $!;
+	my $blob = git_ls_tree($rev, $filename);
+
+	open(C,"-|","git","cat-file", "blob", $blob)
+		or die "Failed to git-cat-file blob $blob (rev $rev, file $filename): " . $!;
 
 	my @lines;
 	while(<C>) {
@@ -313,6 +348,25 @@ sub git_cat_file {
 	return @lines;
 }
 
+sub git_ls_tree {
+	my ($rev, $filename) = @_;
+
+	open(T,"-|","git","ls-tree",$rev,$filename)
+		or die "Failed to call git ls-tree: $!";
+
+	my ($mode, $type, $blob, $tfilename);
+	while(<T>) {
+		($mode, $type, $blob, $tfilename) = split(/\s+/, $_, 4);
+		last if ($tfilename eq $filename);
+	}
+	close(T);
+
+	return $blob if $filename eq $filename;
+	die "git-ls-tree failed to find blob for $filename";
+
+}
+
+
 
 sub claim_line {
 	my ($floffset, $rev, $lines, %revinfo) = @_;
@@ -354,3 +408,25 @@ sub format_date {
 	return strftime("%Y-%m-%d %H:%M:%S " . $timezone, gmtime($timestamp));
 }
 
+# Copied from git-send-email.perl - We need a Git.pm module..
+sub gitvar {
+    my ($var) = @_;
+    my $fh;
+    my $pid = open($fh, '-|');
+    die "$!" unless defined $pid;
+    if (!$pid) {
+	exec('git-var', $var) or die "$!";
+    }
+    my ($val) = <$fh>;
+    close $fh or die "$!";
+    chomp($val);
+    return $val;
+}
+
+sub gitvar_name {
+    my ($name) = @_;
+    my $val = gitvar($name);
+    my @field = split(/\s+/, $val);
+    return join(' ', @field[0...(@field-4)]);
+}
+
-- 
1.2.3.g9ca3

^ permalink raw reply related

* the war on trailing whitespace
From: Andrew Morton @ 2006-02-26  1:40 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

It's invariably pointless to add lines which have trailing whitespace. 
Nobody cares much, but my scripts spam me when it happens, so I've become
obsessive.    Looking at Dave Miller's current net devel tree:

bix:/usr/src/25> grep '^+.*[    ]$' patches/git-net.patch | wc -l
    170

Note that this is purely _added_ trailing whitespace.  I'm not proposing
that the revision control system should care about pre-existing trailing
whitespace.

I got the quilt guys to generate warnings when patches add trailing
whitespace, and to provide the tools to strip it.  And I believe Larry made
similar changes to bk.

I realise that we cannot do this when doing git fetches, but when importing
patches and mboxes, git ought to whine loudly about input which matches the
above regexp, and it should offer an option to tidy it up.  Perhaps by
default.

Of course, this means that the person who sent the patch will find that his
as-yet-unsent patches don't apply to the upstream tree any more.  Well,
that's tough luck - perhaps it'll motivate him to stop adding trailing
whitespace.  The patches will still apply with `patch -l'.

Thanks for listening ;)

^ permalink raw reply

* [PATCH] First cut at libifying revlist generation
From: Linus Torvalds @ 2006-02-26  0:19 UTC (permalink / raw)
  To: Junio C Hamano, Git Mailing List


This really just splits things up partially, and creates the
interface to set things up by parsign the command line.

No real code changes so far, although the parsing of filenames is a bit 
stricter. In particular, if there is a "--", then we do not accept any 
filenames before it, and if there isn't any "--", then we check that _all_ 
paths listed are valid, not just the first one.

The new argument parsing automatically also gives us "--default" and 
"--not" handling as in git-rev-parse.

Signed-off-by: Linus Torvalds <torvalds@osdl.org>
---

The path checking just makes sense. 

This also makes git-rev-list handle "--all" and "--not" correctly (before, 
it wouldn't handle "--not"), and teaches it to handle "--default". I 
didn't do "--since" and "--before", but I will eventually. The final end 
result is that we won't need git-rev-parse for most things.

But basically there should be no code changes, and this is just a mid-way 
point where a _partial_ set of routines from "git-rev-list" has been moved 
into a library files - revision.c. The half-way point has some ugly 
interfaces, but I don't want to do it all in one go.

The _plan_ is to:

 - move the actual history traversal into revision.c too

 - leave all the git-rev-list -specific stuff (the "--bisect" logic, 
   the print-out logic etc) in rev-list.c

 - make it trivial to make a small C version of "git diff" that just uses 
   the same "setup_revisions()" interface and then looks at 
   "revs->commits" to see what revisions were passed in (the library 
   interface is already at the point where that should work)

 - when the actual history _traversal_ is moved into "revision.c" too, it 
   should then be possible to make an equally small "git log" and friends 
   (whatchanged etc) into C using the revision.c code.

Comments? I think this is safe to apply, because I've done a diff of the 
old non-split rev-list.c against both the new rev-list.c and revision.c, 
and all the changes _look_ like just moving things around and taking the 
new "struct rev_info" into account.

It also passes all the tests, and in general seems straightforward enough. 
HOEVER, I'd still like to point out that I could have screwed something 
up, and this is just about the most core program in all of git, so it 
would make a lot of sense if more people double-checked my "trivial" 
split-up.

		Linus


diff --git a/Makefile b/Makefile
index 6c59cee..3575489 100644
--- a/Makefile
+++ b/Makefile
@@ -192,7 +192,7 @@ LIB_FILE=libgit.a
 LIB_H = \
 	blob.h cache.h commit.h count-delta.h csum-file.h delta.h \
 	diff.h epoch.h object.h pack.h pkt-line.h quote.h refs.h \
-	run-command.h strbuf.h tag.h tree.h git-compat-util.h
+	run-command.h strbuf.h tag.h tree.h git-compat-util.h revision.h
 
 DIFF_OBJS = \
 	diff.o diffcore-break.o diffcore-order.o diffcore-pathspec.o \
@@ -205,7 +205,7 @@ LIB_OBJS = \
 	quote.o read-cache.o refs.o run-command.o \
 	server-info.o setup.o sha1_file.o sha1_name.o strbuf.o \
 	tag.o tree.o usage.o config.o environment.o ctype.o copy.o \
-	fetch-clone.o \
+	fetch-clone.o revision.o \
 	$(DIFF_OBJS)
 
 LIBS = $(LIB_FILE)
diff --git a/epoch.c b/epoch.c
index 3a76748..0f37492 100644
--- a/epoch.c
+++ b/epoch.c
@@ -15,6 +15,7 @@
 
 #include "cache.h"
 #include "commit.h"
+#include "revision.h"
 #include "epoch.h"
 
 struct fraction {
diff --git a/epoch.h b/epoch.h
index 7493d5a..3756009 100644
--- a/epoch.h
+++ b/epoch.h
@@ -11,7 +11,6 @@ typedef int (*emitter_func) (struct comm
 int sort_list_in_merge_order(struct commit_list *list, emitter_func emitter);
 
 /* Low bits are used by rev-list */
-#define UNINTERESTING   (1u<<10)
 #define BOUNDARY        (1u<<11)
 #define VISITED         (1u<<12)
 #define DISCONTINUITY   (1u<<13)
diff --git a/rev-list.c b/rev-list.c
index 67d2a48..d1c52a6 100644
--- a/rev-list.c
+++ b/rev-list.c
@@ -6,9 +6,10 @@
 #include "blob.h"
 #include "epoch.h"
 #include "diff.h"
+#include "revision.h"
+
+/* bits #0 and #1 in revision.h */
 
-#define SEEN		(1u << 0)
-#define INTERESTING	(1u << 1)
 #define COUNTED		(1u << 2)
 #define SHOWN		(1u << 3)
 #define TREECHANGE	(1u << 4)
@@ -38,60 +39,20 @@ static const char rev_list_usage[] =
 "    --bisect"
 ;
 
-static int dense = 1;
+struct rev_info revs;
+
 static int unpacked = 0;
 static int bisect_list = 0;
-static int tag_objects = 0;
-static int tree_objects = 0;
-static int blob_objects = 0;
-static int edge_hint = 0;
 static int verbose_header = 0;
 static int abbrev = DEFAULT_ABBREV;
 static int show_parents = 0;
 static int hdr_termination = 0;
 static const char *commit_prefix = "";
-static unsigned long max_age = -1;
-static unsigned long min_age = -1;
-static int max_count = -1;
 static enum cmit_fmt commit_format = CMIT_FMT_RAW;
 static int merge_order = 0;
 static int show_breaks = 0;
 static int stop_traversal = 0;
-static int topo_order = 0;
-static int lifo = 1;
 static int no_merges = 0;
-static const char **paths = NULL;
-static int remove_empty_trees = 0;
-
-struct name_path {
-	struct name_path *up;
-	int elem_len;
-	const char *elem;
-};
-
-static char *path_name(struct name_path *path, const char *name)
-{
-	struct name_path *p;
-	char *n, *m;
-	int nlen = strlen(name);
-	int len = nlen + 1;
-
-	for (p = path; p; p = p->up) {
-		if (p->elem_len)
-			len += p->elem_len + 1;
-	}
-	n = xmalloc(len);
-	m = n + len - (nlen + 1);
-	strcpy(m, name);
-	for (p = path; p; p = p->up) {
-		if (p->elem_len) {
-			m -= p->elem_len + 1;
-			memcpy(m, p->elem, p->elem_len);
-			m[p->elem_len] = '/';
-		}
-	}
-	return n;
-}
 
 static void show_commit(struct commit *commit)
 {
@@ -168,15 +129,15 @@ static int filter_commit(struct commit *
 		return STOP;
 	if (commit->object.flags & (UNINTERESTING|SHOWN))
 		return CONTINUE;
-	if (min_age != -1 && (commit->date > min_age))
+	if (revs.min_age != -1 && (commit->date > revs.min_age))
 		return CONTINUE;
-	if (max_age != -1 && (commit->date < max_age)) {
+	if (revs.max_age != -1 && (commit->date < revs.max_age)) {
 		stop_traversal=1;
 		return CONTINUE;
 	}
 	if (no_merges && (commit->parents && commit->parents->next))
 		return CONTINUE;
-	if (paths && dense) {
+	if (revs.paths && revs.dense) {
 		if (!(commit->object.flags & TREECHANGE))
 			return CONTINUE;
 		rewrite_parents(commit);
@@ -196,7 +157,7 @@ static int process_commit(struct commit 
 		return CONTINUE;
 	}
 
-	if (max_count != -1 && !max_count--)
+	if (revs.max_count != -1 && !revs.max_count--)
 		return STOP;
 
 	show_commit(commit);
@@ -204,19 +165,6 @@ static int process_commit(struct commit 
 	return CONTINUE;
 }
 
-static struct object_list **add_object(struct object *obj,
-				       struct object_list **p,
-				       struct name_path *path,
-				       const char *name)
-{
-	struct object_list *entry = xmalloc(sizeof(*entry));
-	entry->item = obj;
-	entry->next = *p;
-	entry->name = path_name(path, name);
-	*p = entry;
-	return &entry->next;
-}
-
 static struct object_list **process_blob(struct blob *blob,
 					 struct object_list **p,
 					 struct name_path *path,
@@ -224,7 +172,7 @@ static struct object_list **process_blob
 {
 	struct object *obj = &blob->object;
 
-	if (!blob_objects)
+	if (!revs.blob_objects)
 		return p;
 	if (obj->flags & (UNINTERESTING | SEEN))
 		return p;
@@ -241,7 +189,7 @@ static struct object_list **process_tree
 	struct tree_entry_list *entry;
 	struct name_path me;
 
-	if (!tree_objects)
+	if (!revs.tree_objects)
 		return p;
 	if (obj->flags & (UNINTERESTING | SEEN))
 		return p;
@@ -314,75 +262,6 @@ static void show_commit_list(struct comm
 	}
 }
 
-static void mark_blob_uninteresting(struct blob *blob)
-{
-	if (!blob_objects)
-		return;
-	if (blob->object.flags & UNINTERESTING)
-		return;
-	blob->object.flags |= UNINTERESTING;
-}
-
-static void mark_tree_uninteresting(struct tree *tree)
-{
-	struct object *obj = &tree->object;
-	struct tree_entry_list *entry;
-
-	if (!tree_objects)
-		return;
-	if (obj->flags & UNINTERESTING)
-		return;
-	obj->flags |= UNINTERESTING;
-	if (!has_sha1_file(obj->sha1))
-		return;
-	if (parse_tree(tree) < 0)
-		die("bad tree %s", sha1_to_hex(obj->sha1));
-	entry = tree->entries;
-	tree->entries = NULL;
-	while (entry) {
-		struct tree_entry_list *next = entry->next;
-		if (entry->directory)
-			mark_tree_uninteresting(entry->item.tree);
-		else
-			mark_blob_uninteresting(entry->item.blob);
-		free(entry);
-		entry = next;
-	}
-}
-
-static void mark_parents_uninteresting(struct commit *commit)
-{
-	struct commit_list *parents = commit->parents;
-
-	while (parents) {
-		struct commit *commit = parents->item;
-		commit->object.flags |= UNINTERESTING;
-
-		/*
-		 * Normally we haven't parsed the parent
-		 * yet, so we won't have a parent of a parent
-		 * here. However, it may turn out that we've
-		 * reached this commit some other way (where it
-		 * wasn't uninteresting), in which case we need
-		 * to mark its parents recursively too..
-		 */
-		if (commit->parents)
-			mark_parents_uninteresting(commit);
-
-		/*
-		 * A missing commit is ok iff its parent is marked 
-		 * uninteresting.
-		 *
-		 * We just mark such a thing parsed, so that when
-		 * it is popped next time around, we won't be trying
-		 * to parse it and get an error.
-		 */
-		if (!has_sha1_file(commit->object.sha1))
-			commit->object.parsed = 1;
-		parents = parents->next;
-	}
-}
-
 static int everybody_uninteresting(struct commit_list *orig)
 {
 	struct commit_list *list = orig;
@@ -413,7 +292,7 @@ static int count_distance(struct commit_
 
 		if (commit->object.flags & (UNINTERESTING | COUNTED))
 			break;
-		if (!paths || (commit->object.flags & TREECHANGE))
+		if (!revs.paths || (commit->object.flags & TREECHANGE))
 			nr++;
 		commit->object.flags |= COUNTED;
 		p = commit->parents;
@@ -447,7 +326,7 @@ static struct commit_list *find_bisectio
 	nr = 0;
 	p = list;
 	while (p) {
-		if (!paths || (p->item->object.flags & TREECHANGE))
+		if (!revs.paths || (p->item->object.flags & TREECHANGE))
 			nr++;
 		p = p->next;
 	}
@@ -457,7 +336,7 @@ static struct commit_list *find_bisectio
 	for (p = list; p; p = p->next) {
 		int distance;
 
-		if (paths && !(p->item->object.flags & TREECHANGE))
+		if (revs.paths && !(p->item->object.flags & TREECHANGE))
 			continue;
 
 		distance = count_distance(p);
@@ -483,7 +362,7 @@ static void mark_edge_parents_uninterest
 		if (!(parent->object.flags & UNINTERESTING))
 			continue;
 		mark_tree_uninteresting(parent->tree);
-		if (edge_hint && !(parent->object.flags & SHOWN)) {
+		if (revs.edge_hint && !(parent->object.flags & SHOWN)) {
 			parent->object.flags |= SHOWN;
 			printf("-%s\n", sha1_to_hex(parent->object.sha1));
 		}
@@ -613,7 +492,7 @@ static void try_to_simplify_commit(struc
 			return;
 
 		case TREE_NEW:
-			if (remove_empty_trees && same_tree_as_empty(p->tree)) {
+			if (revs.remove_empty_trees && same_tree_as_empty(p->tree)) {
 				*pp = parent->next;
 				continue;
 			}
@@ -664,7 +543,7 @@ static void add_parents_to_list(struct c
 	 * simplify the commit history and find the parent
 	 * that has no differences in the path set if one exists.
 	 */
-	if (paths)
+	if (revs.paths)
 		try_to_simplify_commit(commit);
 
 	parent = commit->parents;
@@ -693,7 +572,7 @@ static struct commit_list *limit_list(st
 		list = list->next;
 		free(entry);
 
-		if (max_age != -1 && (commit->date < max_age))
+		if (revs.max_age != -1 && (commit->date < revs.max_age))
 			obj->flags |= UNINTERESTING;
 		if (unpacked && has_sha1_pack(obj->sha1))
 			obj->flags |= UNINTERESTING;
@@ -704,155 +583,40 @@ static struct commit_list *limit_list(st
 				break;
 			continue;
 		}
-		if (min_age != -1 && (commit->date > min_age))
+		if (revs.min_age != -1 && (commit->date > revs.min_age))
 			continue;
 		p = &commit_list_insert(commit, p)->next;
 	}
-	if (tree_objects)
+	if (revs.tree_objects)
 		mark_edges_uninteresting(newlist);
 	if (bisect_list)
 		newlist = find_bisection(newlist);
 	return newlist;
 }
 
-static void add_pending_object(struct object *obj, const char *name)
-{
-	add_object(obj, &pending_objects, NULL, name);
-}
-
-static struct commit *get_commit_reference(const char *name, const unsigned char *sha1, unsigned int flags)
-{
-	struct object *object;
-
-	object = parse_object(sha1);
-	if (!object)
-		die("bad object %s", name);
-
-	/*
-	 * Tag object? Look what it points to..
-	 */
-	while (object->type == tag_type) {
-		struct tag *tag = (struct tag *) object;
-		object->flags |= flags;
-		if (tag_objects && !(object->flags & UNINTERESTING))
-			add_pending_object(object, tag->tag);
-		object = parse_object(tag->tagged->sha1);
-		if (!object)
-			die("bad object %s", sha1_to_hex(tag->tagged->sha1));
-	}
-
-	/*
-	 * Commit object? Just return it, we'll do all the complex
-	 * reachability crud.
-	 */
-	if (object->type == commit_type) {
-		struct commit *commit = (struct commit *)object;
-		object->flags |= flags;
-		if (parse_commit(commit) < 0)
-			die("unable to parse commit %s", name);
-		if (flags & UNINTERESTING)
-			mark_parents_uninteresting(commit);
-		return commit;
-	}
-
-	/*
-	 * Tree object? Either mark it uniniteresting, or add it
-	 * to the list of objects to look at later..
-	 */
-	if (object->type == tree_type) {
-		struct tree *tree = (struct tree *)object;
-		if (!tree_objects)
-			return NULL;
-		if (flags & UNINTERESTING) {
-			mark_tree_uninteresting(tree);
-			return NULL;
-		}
-		add_pending_object(object, "");
-		return NULL;
-	}
-
-	/*
-	 * Blob object? You know the drill by now..
-	 */
-	if (object->type == blob_type) {
-		struct blob *blob = (struct blob *)object;
-		if (!blob_objects)
-			return NULL;
-		if (flags & UNINTERESTING) {
-			mark_blob_uninteresting(blob);
-			return NULL;
-		}
-		add_pending_object(object, "");
-		return NULL;
-	}
-	die("%s is unknown object", name);
-}
-
-static void handle_one_commit(struct commit *com, struct commit_list **lst)
-{
-	if (!com || com->object.flags & SEEN)
-		return;
-	com->object.flags |= SEEN;
-	commit_list_insert(com, lst);
-}
-
-/* for_each_ref() callback does not allow user data -- Yuck. */
-static struct commit_list **global_lst;
-
-static int include_one_commit(const char *path, const unsigned char *sha1)
-{
-	struct commit *com = get_commit_reference(path, sha1, 0);
-	handle_one_commit(com, global_lst);
-	return 0;
-}
-
-static void handle_all(struct commit_list **lst)
-{
-	global_lst = lst;
-	for_each_ref(include_one_commit);
-	global_lst = NULL;
-}
-
 int main(int argc, const char **argv)
 {
-	const char *prefix = setup_git_directory();
-	struct commit_list *list = NULL;
+	struct commit_list *list;
 	int i, limited = 0;
 
+	argc = setup_revisions(argc, argv, &revs);
+
 	for (i = 1 ; i < argc; i++) {
-		int flags;
 		const char *arg = argv[i];
-		char *dotdot;
-		struct commit *commit;
-		unsigned char sha1[20];
 
 		/* accept -<digit>, like traditilnal "head" */
 		if ((*arg == '-') && isdigit(arg[1])) {
-			max_count = atoi(arg + 1);
+			revs.max_count = atoi(arg + 1);
 			continue;
 		}
 		if (!strcmp(arg, "-n")) {
 			if (++i >= argc)
 				die("-n requires an argument");
-			max_count = atoi(argv[i]);
+			revs.max_count = atoi(argv[i]);
 			continue;
 		}
 		if (!strncmp(arg,"-n",2)) {
-			max_count = atoi(arg + 2);
-			continue;
-		}
-		if (!strncmp(arg, "--max-count=", 12)) {
-			max_count = atoi(arg + 12);
-			continue;
-		}
-		if (!strncmp(arg, "--max-age=", 10)) {
-			max_age = atoi(arg + 10);
-			limited = 1;
-			continue;
-		}
-		if (!strncmp(arg, "--min-age=", 10)) {
-			min_age = atoi(arg + 10);
-			limited = 1;
+			revs.max_count = atoi(arg + 2);
 			continue;
 		}
 		if (!strcmp(arg, "--header")) {
@@ -893,23 +657,6 @@ int main(int argc, const char **argv)
 			bisect_list = 1;
 			continue;
 		}
-		if (!strcmp(arg, "--all")) {
-			handle_all(&list);
-			continue;
-		}
-		if (!strcmp(arg, "--objects")) {
-			tag_objects = 1;
-			tree_objects = 1;
-			blob_objects = 1;
-			continue;
-		}
-		if (!strcmp(arg, "--objects-edge")) {
-			tag_objects = 1;
-			tree_objects = 1;
-			blob_objects = 1;
-			edge_hint = 1;
-			continue;
-		}
 		if (!strcmp(arg, "--unpacked")) {
 			unpacked = 1;
 			limited = 1;
@@ -923,100 +670,42 @@ int main(int argc, const char **argv)
 			show_breaks = 1;
 			continue;
 		}
-		if (!strcmp(arg, "--topo-order")) {
-		        topo_order = 1;
-			lifo = 1;
-		        limited = 1;
-			continue;
-		}
-		if (!strcmp(arg, "--date-order")) {
-		        topo_order = 1;
-			lifo = 0;
-		        limited = 1;
-			continue;
-		}
-		if (!strcmp(arg, "--dense")) {
-			dense = 1;
-			continue;
-		}
-		if (!strcmp(arg, "--sparse")) {
-			dense = 0;
-			continue;
-		}
-		if (!strcmp(arg, "--remove-empty")) {
-			remove_empty_trees = 1;
-			continue;
-		}
-		if (!strcmp(arg, "--")) {
-			i++;
-			break;
-		}
-
-		if (show_breaks && !merge_order)
-			usage(rev_list_usage);
+		usage(rev_list_usage);
 
-		flags = 0;
-		dotdot = strstr(arg, "..");
-		if (dotdot) {
-			unsigned char from_sha1[20];
-			char *next = dotdot + 2;
-			*dotdot = 0;
-			if (!*next)
-				next = "HEAD";
-			if (!get_sha1(arg, from_sha1) && !get_sha1(next, sha1)) {
-				struct commit *exclude;
-				struct commit *include;
-				
-				exclude = get_commit_reference(arg, from_sha1, UNINTERESTING);
-				include = get_commit_reference(next, sha1, 0);
-				if (!exclude || !include)
-					die("Invalid revision range %s..%s", arg, next);
-				limited = 1;
-				handle_one_commit(exclude, &list);
-				handle_one_commit(include, &list);
-				continue;
-			}
-			*dotdot = '.';
-		}
-		if (*arg == '^') {
-			flags = UNINTERESTING;
-			arg++;
-			limited = 1;
-		}
-		if (get_sha1(arg, sha1) < 0) {
-			struct stat st;
-			if (lstat(arg, &st) < 0)
-				die("'%s': %s", arg, strerror(errno));
-			break;
-		}
-		commit = get_commit_reference(arg, sha1, flags);
-		handle_one_commit(commit, &list);
 	}
 
+	list = revs.commits;
+	if (list && list->next)
+		limited = 1;
+
+	if (revs.topo_order)
+		limited = 1;
+
 	if (!list &&
-	    (!(tag_objects||tree_objects||blob_objects) && !pending_objects))
+	    (!(revs.tag_objects||revs.tree_objects||revs.blob_objects) && !revs.pending_objects))
 		usage(rev_list_usage);
 
-	paths = get_pathspec(prefix, argv + i);
-	if (paths) {
+	if (revs.paths) {
 		limited = 1;
-		diff_tree_setup_paths(paths);
+		diff_tree_setup_paths(revs.paths);
 	}
+	if (revs.max_age || revs.min_age)
+		limited = 1;
 
 	save_commit_buffer = verbose_header;
 	track_object_refs = 0;
 
 	if (!merge_order) {		
 		sort_by_date(&list);
-		if (list && !limited && max_count == 1 &&
-		    !tag_objects && !tree_objects && !blob_objects) {
+		if (list && !limited && revs.max_count == 1 &&
+		    !revs.tag_objects && !revs.tree_objects && !revs.blob_objects) {
 			show_commit(list->item);
 			return 0;
 		}
 	        if (limited)
 			list = limit_list(list);
-		if (topo_order)
-			sort_in_topological_order(&list, lifo);
+		if (revs.topo_order)
+			sort_in_topological_order(&list, revs.lifo);
 		show_commit_list(list);
 	} else {
 #ifndef NO_OPENSSL
diff --git a/revision.c b/revision.c
new file mode 100644
index 0000000..17dbf9a
--- /dev/null
+++ b/revision.c
@@ -0,0 +1,370 @@
+#include "cache.h"
+#include "tag.h"
+#include "blob.h"
+#include "tree.h"
+#include "commit.h"
+#include "refs.h"
+#include "revision.h"
+
+static char *path_name(struct name_path *path, const char *name)
+{
+	struct name_path *p;
+	char *n, *m;
+	int nlen = strlen(name);
+	int len = nlen + 1;
+
+	for (p = path; p; p = p->up) {
+		if (p->elem_len)
+			len += p->elem_len + 1;
+	}
+	n = xmalloc(len);
+	m = n + len - (nlen + 1);
+	strcpy(m, name);
+	for (p = path; p; p = p->up) {
+		if (p->elem_len) {
+			m -= p->elem_len + 1;
+			memcpy(m, p->elem, p->elem_len);
+			m[p->elem_len] = '/';
+		}
+	}
+	return n;
+}
+
+struct object_list **add_object(struct object *obj,
+				       struct object_list **p,
+				       struct name_path *path,
+				       const char *name)
+{
+	struct object_list *entry = xmalloc(sizeof(*entry));
+	entry->item = obj;
+	entry->next = *p;
+	entry->name = path_name(path, name);
+	*p = entry;
+	return &entry->next;
+}
+
+static void mark_blob_uninteresting(struct blob *blob)
+{
+	if (blob->object.flags & UNINTERESTING)
+		return;
+	blob->object.flags |= UNINTERESTING;
+}
+
+void mark_tree_uninteresting(struct tree *tree)
+{
+	struct object *obj = &tree->object;
+	struct tree_entry_list *entry;
+
+	if (obj->flags & UNINTERESTING)
+		return;
+	obj->flags |= UNINTERESTING;
+	if (!has_sha1_file(obj->sha1))
+		return;
+	if (parse_tree(tree) < 0)
+		die("bad tree %s", sha1_to_hex(obj->sha1));
+	entry = tree->entries;
+	tree->entries = NULL;
+	while (entry) {
+		struct tree_entry_list *next = entry->next;
+		if (entry->directory)
+			mark_tree_uninteresting(entry->item.tree);
+		else
+			mark_blob_uninteresting(entry->item.blob);
+		free(entry);
+		entry = next;
+	}
+}
+
+void mark_parents_uninteresting(struct commit *commit)
+{
+	struct commit_list *parents = commit->parents;
+
+	while (parents) {
+		struct commit *commit = parents->item;
+		commit->object.flags |= UNINTERESTING;
+
+		/*
+		 * Normally we haven't parsed the parent
+		 * yet, so we won't have a parent of a parent
+		 * here. However, it may turn out that we've
+		 * reached this commit some other way (where it
+		 * wasn't uninteresting), in which case we need
+		 * to mark its parents recursively too..
+		 */
+		if (commit->parents)
+			mark_parents_uninteresting(commit);
+
+		/*
+		 * A missing commit is ok iff its parent is marked 
+		 * uninteresting.
+		 *
+		 * We just mark such a thing parsed, so that when
+		 * it is popped next time around, we won't be trying
+		 * to parse it and get an error.
+		 */
+		if (!has_sha1_file(commit->object.sha1))
+			commit->object.parsed = 1;
+		parents = parents->next;
+	}
+}
+
+static void add_pending_object(struct rev_info *revs, struct object *obj, const char *name)
+{
+	add_object(obj, &revs->pending_objects, NULL, name);
+}
+
+static struct commit *get_commit_reference(struct rev_info *revs, const char *name, const unsigned char *sha1, unsigned int flags)
+{
+	struct object *object;
+
+	object = parse_object(sha1);
+	if (!object)
+		die("bad object %s", name);
+
+	/*
+	 * Tag object? Look what it points to..
+	 */
+	while (object->type == tag_type) {
+		struct tag *tag = (struct tag *) object;
+		object->flags |= flags;
+		if (revs->tag_objects && !(object->flags & UNINTERESTING))
+			add_pending_object(revs, object, tag->tag);
+		object = parse_object(tag->tagged->sha1);
+		if (!object)
+			die("bad object %s", sha1_to_hex(tag->tagged->sha1));
+	}
+
+	/*
+	 * Commit object? Just return it, we'll do all the complex
+	 * reachability crud.
+	 */
+	if (object->type == commit_type) {
+		struct commit *commit = (struct commit *)object;
+		object->flags |= flags;
+		if (parse_commit(commit) < 0)
+			die("unable to parse commit %s", name);
+		if (flags & UNINTERESTING)
+			mark_parents_uninteresting(commit);
+		return commit;
+	}
+
+	/*
+	 * Tree object? Either mark it uniniteresting, or add it
+	 * to the list of objects to look at later..
+	 */
+	if (object->type == tree_type) {
+		struct tree *tree = (struct tree *)object;
+		if (!revs->tree_objects)
+			return NULL;
+		if (flags & UNINTERESTING) {
+			mark_tree_uninteresting(tree);
+			return NULL;
+		}
+		add_pending_object(revs, object, "");
+		return NULL;
+	}
+
+	/*
+	 * Blob object? You know the drill by now..
+	 */
+	if (object->type == blob_type) {
+		struct blob *blob = (struct blob *)object;
+		if (!revs->blob_objects)
+			return NULL;
+		if (flags & UNINTERESTING) {
+			mark_blob_uninteresting(blob);
+			return NULL;
+		}
+		add_pending_object(revs, object, "");
+		return NULL;
+	}
+	die("%s is unknown object", name);
+}
+
+static void add_one_commit(struct commit *commit, struct rev_info *revs)
+{
+	if (!commit || (commit->object.flags & SEEN))
+		return;
+	commit->object.flags |= SEEN;
+	commit_list_insert(commit, &revs->commits);
+}
+
+static int all_flags;
+static struct rev_info *all_revs;
+
+static int handle_one_ref(const char *path, const unsigned char *sha1)
+{
+	struct commit *commit = get_commit_reference(all_revs, path, sha1, all_flags);
+	add_one_commit(commit, all_revs);
+	return 0;
+}
+
+static void handle_all(struct rev_info *revs, unsigned flags)
+{
+	all_revs = revs;
+	all_flags = flags;
+	for_each_ref(handle_one_ref);
+}
+
+/*
+ * Parse revision information, filling in the "rev_info" structure,
+ * and removing the used arguments from the argument list.
+ *
+ * Returns the number of arguments left ("new argc").
+ */
+int setup_revisions(int argc, const char **argv, struct rev_info *revs)
+{
+	int i, flags, seen_dashdash;
+	const char *def = NULL;
+	const char **unrecognized = argv+1;
+	int left = 1;
+
+	memset(revs, 0, sizeof(*revs));
+	revs->lifo = 1;
+	revs->dense = 1;
+	revs->prefix = setup_git_directory();
+	revs->max_age = -1;
+	revs->min_age = -1;
+	revs->max_count = -1;
+
+	/* First, search for "--" */
+	seen_dashdash = 0;
+	for (i = 1; i < argc; i++) {
+		const char *arg = argv[i];
+		if (strcmp(arg, "--"))
+			continue;
+		argv[i] = NULL;
+		argc = i;
+		revs->paths = get_pathspec(revs->prefix, argv + i + 1);
+		seen_dashdash = 1;
+		break;
+	}
+
+	flags = 0;
+	for (i = 1; i < argc; i++) {
+		struct commit *commit;
+		const char *arg = argv[i];
+		unsigned char sha1[20];
+		char *dotdot;
+		int local_flags;
+
+		if (*arg == '-') {
+			if (!strncmp(arg, "--max-count=", 12)) {
+				revs->max_count = atoi(arg + 12);
+				continue;
+			}
+			if (!strncmp(arg, "--max-age=", 10)) {
+				revs->max_age = atoi(arg + 10);
+				continue;
+			}
+			if (!strncmp(arg, "--min-age=", 10)) {
+				revs->min_age = atoi(arg + 10);
+				continue;
+			}
+			if (!strcmp(arg, "--all")) {
+				handle_all(revs, flags);
+				continue;
+			}
+			if (!strcmp(arg, "--not")) {
+				flags ^= UNINTERESTING;
+				continue;
+			}
+			if (!strcmp(arg, "--default")) {
+				if (++i >= argc)
+					die("bad --default argument");
+				def = argv[i];
+				continue;
+			}
+			if (!strcmp(arg, "--topo-order")) {
+				revs->topo_order = 1;
+				continue;
+			}
+			if (!strcmp(arg, "--date-order")) {
+				revs->lifo = 0;
+				revs->topo_order = 1;
+				continue;
+			}
+			if (!strcmp(arg, "--dense")) {
+				revs->dense = 1;
+				continue;
+			}
+			if (!strcmp(arg, "--sparse")) {
+				revs->dense = 0;
+				continue;
+			}
+			if (!strcmp(arg, "--remove-empty")) {
+				revs->remove_empty_trees = 1;
+				continue;
+			}
+			if (!strcmp(arg, "--objects")) {
+				revs->tag_objects = 1;
+				revs->tree_objects = 1;
+				revs->blob_objects = 1;
+				continue;
+			}
+			if (!strcmp(arg, "--objects-edge")) {
+				revs->tag_objects = 1;
+				revs->tree_objects = 1;
+				revs->blob_objects = 1;
+				revs->edge_hint = 1;
+				continue;
+			}
+			*unrecognized++ = arg;
+			left++;
+			continue;
+		}
+		dotdot = strstr(arg, "..");
+		if (dotdot) {
+			unsigned char from_sha1[20];
+			char *next = dotdot + 2;
+			*dotdot = 0;
+			if (!*next)
+				next = "HEAD";
+			if (!get_sha1(arg, from_sha1) && !get_sha1(next, sha1)) {
+				struct commit *exclude;
+				struct commit *include;
+
+				exclude = get_commit_reference(revs, arg, from_sha1, flags ^ UNINTERESTING);
+				include = get_commit_reference(revs, next, sha1, flags);
+				if (!exclude || !include)
+					die("Invalid revision range %s..%s", arg, next);
+				add_one_commit(exclude, revs);
+				add_one_commit(include, revs);
+				continue;
+			}
+			*dotdot = '.';
+		}
+		local_flags = 0;
+		if (*arg == '^') {
+			local_flags = UNINTERESTING;
+			arg++;
+		}
+		if (get_sha1(arg, sha1) < 0) {
+			struct stat st;
+			int j;
+
+			if (seen_dashdash || local_flags)
+				die("bad revision '%s'", arg);
+
+			/* If we didn't have a "--", all filenames must exist */
+			for (j = i; j < argc; j++) {
+				if (lstat(argv[j], &st) < 0)
+					die("'%s': %s", arg, strerror(errno));
+			}
+			revs->paths = get_pathspec(revs->prefix, argv + i);
+			break;
+		}
+		commit = get_commit_reference(revs, arg, sha1, flags ^ local_flags);
+		add_one_commit(commit, revs);
+	}
+	if (def && !revs->commits) {
+		unsigned char sha1[20];
+		struct commit *commit;
+		if (get_sha1(def, sha1) < 0)
+			die("bad default revision '%s'", def);
+		commit = get_commit_reference(revs, def, sha1, 0);
+		add_one_commit(commit, revs);
+	}
+	*unrecognized = NULL;
+	return left;
+}
diff --git a/revision.h b/revision.h
new file mode 100644
index 0000000..5170ac4
--- /dev/null
+++ b/revision.h
@@ -0,0 +1,48 @@
+#ifndef REVISION_H
+#define REVISION_H
+
+#define SEEN		(1u<<0)
+#define UNINTERESTING   (1u<<1)
+
+struct rev_info {
+	/* Starting list */
+	struct commit_list *commits;
+	struct object_list *pending_objects;
+
+	/* Basic information */
+	const char *prefix;
+	const char **paths;
+
+	/* Traversal flags */
+	unsigned int	dense:1,
+			remove_empty_trees:1,
+			lifo:1,
+			topo_order:1,
+			tag_objects:1,
+			tree_objects:1,
+			blob_objects:1,
+			edge_hint:1;
+
+	/* special limits */
+	int max_count;
+	unsigned long max_age;
+	unsigned long min_age;
+};
+
+/* revision.c */
+extern int setup_revisions(int argc, const char **argv, struct rev_info *revs);
+extern void mark_parents_uninteresting(struct commit *commit);
+extern void mark_tree_uninteresting(struct tree *tree);
+
+struct name_path {
+	struct name_path *up;
+	int elem_len;
+	const char *elem;
+};
+
+extern struct object_list **add_object(struct object *obj,
+				       struct object_list **p,
+				       struct name_path *path,
+				       const char *name);
+
+#endif

^ permalink raw reply related

* Re: [PATCH] git-fetch: print the new and old ref when fast-forwarding
From: Junio C Hamano @ 2006-02-25 20:53 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Lukas Sandström, Git Mailing List
In-Reply-To: <Pine.LNX.4.63.0602252107110.17999@wbgn013.biozentrum.uni-wuerzburg.de>

Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:

>> This is useful when you check out new changes with gitk.
>> Just copy/paste the old ref into gitk from the terminal.
>
> Why does "gitk ORIG_HEAD..HEAD" not work? (It also does the correct thing 
> when pulling...)

For most projects and repositories with single interesting head,
that would work just fine.

If you use additional Pull: lines to track more than one remote
refs, this patch would help.

For example, if you are tracking my "next" while keeping an eye
on my "master" and "pu", your .git/remotes/origin file may have
something like this:

	URL: git://git.kernel.org/pub/scm/git/git.git
        Pull: next:origin
        Pull: master:ko-master
        Pull: pu:ko-pu

When "git pull origin" pulls my next branch into your current
branch (typically "master"), it also fast forwards your tracking
branches ko-master and ko-pu.  If you want to see what I merged
in the meantime, you would want to get the old value of
ko-master and the new value and feed them to gitk (or git log).
ORIG_HEAD in this case was the old value of _your_ current
branch head, and is not useful to see what happened to my master
branch.

^ permalink raw reply

* Re: [PATCH] git-fetch: print the new and old ref when fast-forwarding
From: Johannes Schindelin @ 2006-02-25 20:08 UTC (permalink / raw)
  To: Lukas Sandström; +Cc: Git Mailing List, Junio C Hamano
In-Reply-To: <44003D6D.2010406@etek.chalmers.se>

[-- Attachment #1: Type: TEXT/PLAIN, Size: 273 bytes --]

Hi,

On Sat, 25 Feb 2006, Lukas Sandström wrote:

> This is useful when you check out new changes with gitk.
> Just copy/paste the old ref into gitk from the terminal.

Why does "gitk ORIG_HEAD..HEAD" not work? (It also does the correct thing 
when pulling...)

Hth,
Dscho

^ permalink raw reply

* Re: [PATCH] diff-delta: produce optimal pack data
From: Linus Torvalds @ 2006-02-25 19:18 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Junio C Hamano, git, Carl Baldwin
In-Reply-To: <Pine.LNX.4.64.0602242326381.31162@localhost.localdomain>

On Sat, 25 Feb 2006, Nicolas Pitre wrote:
> 
> Yes, the hash is larger.  There is a cost in memory usage but not really 
> in CPU cycles.

Note that memory usage translates almost 1:1 (or worse) to CPU cycles in 
almost all real-life behaviours. Only in carefully tuned benchmarks does 
it not.

Increased memory usage means more paging, and worse cache behaviour. Now, 
hashes aren't wonderful for caches in the first place, but imagine the 
hump you pass when the data doesn't fit in a 64kB L1 any more (or a 256kB 
L2). Huge.

> > You'll find a lot of that in any file: three or four bytes of similarity 
> > just doesn't sound worthwhile to go digging after. 
> 
> Well after having experimented a lot with multiple parameters I think 
> they are worth it after all.  Not only they provide for optimal deltas, 
> but their hash is faster to compute than larger blocks which seems to 
> counter balance for the cost of increased hash list.

Hey, numbers talk. If you've got the numbers, I'll just shut up ;)

		Linus

^ permalink raw reply

* [PATCH] git-fetch: print the new and old ref when fast-forwarding
From: Lukas Sandström @ 2006-02-25 11:20 UTC (permalink / raw)
  To: Git Mailing List, Junio C Hamano

Signed-off-by: Lukas Sandström <lukass@etek.chalmers.se>

---
This is useful when you check out new changes with gitk.
Just copy/paste the old ref into gitk from the terminal.

 git-fetch.sh |    1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

78f2844c17bb8627e84389e3026906d074c43363
diff --git a/git-fetch.sh b/git-fetch.sh
index fcc24f8..a3e2a34 100755
--- a/git-fetch.sh
+++ b/git-fetch.sh
@@ -164,6 +164,7 @@ fast_forward_local () {
 		;;
 	    *,$local)
 		echo >&2 "* $1: fast forward to $3"
+		echo >&2 "  from $local to $2"
 		git-update-ref "$1" "$2" "$local"
 		;;
 	    *)
-- 
1.2.3.gc412

^ permalink raw reply related

* Re: gitview: Use monospace font to draw the branch and tag name
From: Aneesh Kumar @ 2006-02-25  7:00 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git
In-Reply-To: <7vzmkg6kj7.fsf@assigned-by-dhcp.cox.net>

On 2/25/06, Junio C Hamano <junkio@cox.net> wrote:
>
> >> I have an impression that hardcoding UI policy like this is
> >> generally frowned upon.
> >
> > But with that changes branch and the tag name looks neat.
>
> The point being that it only looks neat on your display, with
> its size and resolution, and to your eye.
>
> People have different tastes and equipments.  One thing I really
> liked gitk (I do not know if gitview does this as well -- if so
> I am happy) is I can do - and + to change the text size
> depending on what display I am working on.
>
>

Point taken. Right now gitview doesn't support + and - key binding. I
will try to add that. For the time being can you can drop my font and
font size related  changes and apply rest of the patches.

-aneesh

^ permalink raw reply

* Re: [PATCH] git-rm: Fix to properly handle files with spaces, tabs, newlines, etc.
From: Junio C Hamano @ 2006-02-25  6:05 UTC (permalink / raw)
  To: Alex Riesen; +Cc: git, Carl Worth, Krzysiek Pawlik
In-Reply-To: <81b0412b0602240523l1b10c910q6e5d2e3cef82e306@mail.gmail.com>

"Alex Riesen" <raa.lkml@gmail.com> writes:

> On 2/23/06, Carl Worth <cworth@cworth.org> wrote:
>> +# Setup some files to be removed, some with funny characters
>> +touch -- foo bar baz 'space embedded' 'tab     embedded' 'newline
>> +embedded' -q
>> +git-add -- foo bar baz 'space embedded' 'tab   embedded' 'newline
>> +embedded' -q
>> +git-commit -m "add files"
>
> This doesn't work on some exotic filesystems (ntfs and fat):

Sorry to have applied this without thinking.  Yes, we had
disabled a test that uses tab for this exact reason, but I have
forgotten about it.

^ permalink raw reply

* Re: gitview: Use monospace font to draw the branch and tag name
From: Junio C Hamano @ 2006-02-25  6:05 UTC (permalink / raw)
  To: Aneesh Kumar; +Cc: git
In-Reply-To: <cc723f590602240829y3ebffaf9mdd7de0cd2a9051e@mail.gmail.com>

"Aneesh Kumar" <aneesh.kumar@gmail.com> writes:

> On 2/24/06, Junio C Hamano <junkio@cox.net> wrote:
>> "Aneesh Kumar K.V" <aneesh.kumar@gmail.com> writes:
>>
>> > This patch address the below:
>> > Use monospace font to draw branch and tag name
>> > set the font size to 13.
>>
>> I have an impression that hardcoding UI policy like this is
>> generally frowned upon.
>
> But with that changes branch and the tag name looks neat.

The point being that it only looks neat on your display, with
its size and resolution, and to your eye.

People have different tastes and equipments.  One thing I really
liked gitk (I do not know if gitview does this as well -- if so
I am happy) is I can do - and + to change the text size
depending on what display I am working on.

^ permalink raw reply

* Re: FYI: git-am allows creation of empty commits.
From: Junio C Hamano @ 2006-02-25  6:04 UTC (permalink / raw)
  To: Eric Wong; +Cc: Eric W. Biederman, git, Martin Langhoff, Matthias Urlichs
In-Reply-To: <20060224131922.GA19401@localdomain>

Eric Wong <normalperson@yhbt.net> writes:

>> I think 99.9% of the time it is a mistake if a single-parented
>> commit has the same tree as its parent commit has, so having a
>> check in commit-tree may not be a bad idea.
>
> This would break importers, more than 0.1% I think...  Arch
> definitely allows empty commits for getting log messages in.
> SVN forbids them from their POV, but they also have things
> that we can't see when we import (properties like: mime,
> externals, eol-style) causing us to write the same tree twice.
> Not sure about CVS...
>
> Maybe a flag such as --force could be added.

Good point perhaps.  Maybe an explicit --no-empty should ask for
it, and git-am should use it.  As to git-commit I am unsure.

In the meantime I'd drop the patch.

^ permalink raw reply

* Re: [PATCH] diff-delta: produce optimal pack data
From: Nicolas Pitre @ 2006-02-25  5:35 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Junio C Hamano, git, Carl Baldwin
In-Reply-To: <Pine.LNX.4.64.0602242326381.31162@localhost.localdomain>

OOps.... Forgot to complete one paragraph.

On Sat, 25 Feb 2006, Nicolas Pitre wrote:

> I of course looked at the time to pack vs the size reduction in my 
> tests.  And really like I said above the cost is well balanced.  The 
> only issue is that smaller blocks are more likely to trap into 
> patological data sets.  But that problem does exist with larger blocks 
> too, to a lesser degree of course but still.  For example, using a 16 
> block size with adler32, computing a delta between two files 

... as provided by Carl takes up to _nine_ minutes for a _single_ delta !

So regardless of the block size used, the issue right now has more to do 
with that combinatorial explosion than the actual block size.  And 
preventing that patological case from expending out of bounds is pretty 
easy to do.

OK I just tested a tentative patch to trap that case and the time to 
delta those two 20MB files passed from over 9 minutes to only 36 seconds 
here, with less than 10% in delta size difference.  So I think I might 
be on the right track.  Further tuning might help even further.

Nicolas

^ permalink raw reply

* Re: [PATCH] diff-delta: produce optimal pack data
From: Nicolas Pitre @ 2006-02-25  5:10 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Junio C Hamano, git, Carl Baldwin
In-Reply-To: <Pine.LNX.4.64.0602241952140.22647@g5.osdl.org>

On Fri, 24 Feb 2006, Linus Torvalds wrote:

> 
> 
> On Fri, 24 Feb 2006, Nicolas Pitre wrote:
> > 
> > Well, that's the compromize to make.  By default the version with 
> > adler32 used 16 byte blocks to index the reference buffer.  That means 
> > you can match target data against the reference only if whole 16 byte 
> > blocks match.  Then, if you fix a typo in the target buffer then you'll 
> > inevitably need 16 literal bytes in the delta instead of only 
> > one because you won't be able to resynchronize with the reference buffer 
> > until the next 16 byte block.
> 
> That shouldn't be true. Once you find a 16-byte match, you can search 
> forward (in theory you should be able to search backwards too, but by that 
> time we've already expanded the non-matching previous bytes, I think).

Obviously, but that's not my point.  What I mean is that small changes 
will kind of sabotage the match since you'll have to scan forward 
(adding literal bytes to the delta output in the mean time) until you 
find another 16 byte block that matches.  This is harder to find than a 
3 byte block.  Hence you'll end up adding more literal bytes (up to 16 
of them) until you can match the reference buffer again even if you only 
flipped one byte in the target buffer.  In some cases that makes up for 
deltas that are twice as large.

> > What I've made in my last delta patch is to reduce that 16 byte block to 
> > only 3 bytes.  Why 3 bytes? Because less than that produces smaller 
> > delta data if done with literal bytes directly, and 3 bytes provided 
> > enough bits to hash.
> 
> On the other hand, the cost is that your lookups _are_ going to be more 
> expensive. Regardless of how good the hash is, basically you have 16/3 
> more hash-entries to look up, so you've made compression more expensive in 
> footprint, at least (I assume you've made the hash appropriately larger).

Yes, the hash is larger.  There is a cost in memory usage but not really 
in CPU cycles.

> Also, at 3 bytes, insertion is at least equally dense (three bytes of data 
> vs three bytes of offset into the source), and can be worse (the offset 
> might be 5 bytes, no?). So it would seem like you'd be better off with 4+ 
> bytes, at which point the delta should be a win.

The code already discriminate the space of a block copy notation given 
the offset and size vs the space for the equivalent literal bytes.  So 
the optimal encoding is always chosen already.

In fact, if you want to copy up to 15 bytes from offset 0 that will be 
encoded with only 2 bytes in the delta.  The only case that is 
suboptimal is when you want to copy only two bytes from offset 0 (2 
delta bytes) but only two bytes is mever matched by the hash lookup 
since the hash is computed with 3 bytes.  In that case 2 literal bytes 
will be added to the delta plus opcode = 3 bytes.  I considered that 
special case not worth it.  However copying a block of 3 bytes that gets 
encoded into 3 bytes of delta is quite common (that'd take 4 bytes of 
delta if they were literals).

As for using more bytes for block hashing, that increase thenumber of 
cycles to compute the hash.  The adler32 version reads 16 bytes for 
every byte offset in the target file while my latest version only reads 
3 bytes for every byte offset.  So in effect my target hash computation 
is faster than the adler32 one.  However there is potentially more 
entries in the same hash bucket to validate especially with repetitive 
data.

> Have you tried some half-way point, like ~8 bytes?

Yes, and while the needed cycles tend to remain the same on average, the 
resulting pack gets larger.

> > Now the problem comes when indexing a reference file full of:
> > 
> >         0x46f8, 0x000b, 0x42fe, 0x0000, 0xffc0, 0x0001, 0xff00, 0x0008,
> ...
> > 
> > There is a bunch of ", 0x" that get hashed to the same thing.
> 
> You'll find a lot of that in any file: three or four bytes of similarity 
> just doesn't sound worthwhile to go digging after. 

Well after having experimented a lot with multiple parameters I think 
they are worth it after all.  Not only they provide for optimal deltas, 
but their hash is faster to compute than larger blocks which seems to 
counter balance for the cost of increased hash list.

> > The adler32 made that particular example a non issue since the 
> > likelyhood of many 16 byte blocks to be the same is pretty low in this 
> > case.  But the flaw remains if for example there is lots of similar 16 
> > byte blocks, like a binary file with lots of zeroes for example.  In 
> > fact, the performance problem Carl is having does use the diff-delta 
> > version still using adler32.
> 
> Agreed. I think limiting the hash length is a fine idea regardless, I just 
> think it sounds dangerous with the three-byte thing where a lot of matches 
> should be expected (never mind ", 0x", just things like newlines and tabs 
> in source code).

They are usually less than the number of lines.  Yet if you have a 1000 
line source file and let's suppose that we keep only 50 hashed "\n\t\t" 
this is sufficient to provide enough opportunities for matching that 
plus common patterns like a following "if (" for example.

> Only considering 16-byte sequences of similarities is a trade-off of 
> packing cost vs win.. Not saying other block-sizes aren't worth testing, 
> but I suspect trying too hard is going to be too costly.

I of course looked at the time to pack vs the size reduction in my 
tests.  And really like I said above the cost is well balanced.  The 
only issue is that smaller blocks are more likely to trap into 
patological data sets.  But that problem does exist with larger blocks 
too, to a lesser degree of course but still.  For example, using a 16 
block size with adler32, computing a delta between two files 

> Especially as deltas compress _worse_ the smaller they are. Bigger 
> "insert" chunks probably compress a lot better than a copy chunk. 

Yes, but given that we favor deltas from larger to smaller already those 
inserts are already not making much differences.  They have to be quite 
large to effectively provide better zlib compression.

> Have you looked at the delta size vs compression?

That's certainly an additional test worth adding to try_delta(). 
max_size should be smaller than the original object deflated size making 
sure we won't store deltas that might end up larger than the undeltified 
object.

Nicolas

^ permalink raw reply

* Re: [PATCH] diff-delta: produce optimal pack data
From: Linus Torvalds @ 2006-02-25  4:05 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Junio C Hamano, git, Carl Baldwin
In-Reply-To: <Pine.LNX.4.64.0602242130030.31162@localhost.localdomain>

On Fri, 24 Feb 2006, Nicolas Pitre wrote:
> 
> Well, that's the compromize to make.  By default the version with 
> adler32 used 16 byte blocks to index the reference buffer.  That means 
> you can match target data against the reference only if whole 16 byte 
> blocks match.  Then, if you fix a typo in the target buffer then you'll 
> inevitably need 16 literal bytes in the delta instead of only 
> one because you won't be able to resynchronize with the reference buffer 
> until the next 16 byte block.

That shouldn't be true. Once you find a 16-byte match, you can search 
forward (in theory you should be able to search backwards too, but by that 
time we've already expanded the non-matching previous bytes, I think).

But I'm no xdelta expert, so I migt be wrong.

> What I've made in my last delta patch is to reduce that 16 byte block to 
> only 3 bytes.  Why 3 bytes? Because less than that produces smaller 
> delta data if done with literal bytes directly, and 3 bytes provided 
> enough bits to hash.

On the other hand, the cost is that your lookups _are_ going to be more 
expensive. Regardless of how good the hash is, basically you have 16/3 
more hash-entries to look up, so you've made compression more expensive in 
footprint, at least (I assume you've made the hash appropriately larger).

Also, at 3 bytes, insertion is at least equally dense (three bytes of data 
vs three bytes of offset into the source), and can be worse (the offset 
might be 5 bytes, no?). So it would seem like you'd be better off with 4+ 
bytes, at which point the delta should be a win.

Have you tried some half-way point, like ~8 bytes?

> Now the problem comes when indexing a reference file full of:
> 
>         0x46f8, 0x000b, 0x42fe, 0x0000, 0xffc0, 0x0001, 0xff00, 0x0008,
...
> 
> There is a bunch of ", 0x" that get hashed to the same thing.

You'll find a lot of that in any file: three or four bytes of similarity 
just doesn't sound worthwhile to go digging after. 

> The adler32 made that particular example a non issue since the 
> likelyhood of many 16 byte blocks to be the same is pretty low in this 
> case.  But the flaw remains if for example there is lots of similar 16 
> byte blocks, like a binary file with lots of zeroes for example.  In 
> fact, the performance problem Carl is having does use the diff-delta 
> version still using adler32.

Agreed. I think limiting the hash length is a fine idea regardless, I just 
think it sounds dangerous with the three-byte thing where a lot of matches 
should be expected (never mind ", 0x", just things like newlines and tabs 
in source code).

Only considering 16-byte sequences of similarities is a trade-off of 
packing cost vs win.. Not saying other block-sizes aren't worth testing, 
but I suspect trying too hard is going to be too costly.

Especially as deltas compress _worse_ the smaller they are. Bigger 
"insert" chunks probably compress a lot better than a copy chunk. 

Have you looked at the delta size vs compression?

		Linus

^ permalink raw reply

* Re: [PATCH] diff-delta: produce optimal pack data
From: Nicolas Pitre @ 2006-02-25  3:53 UTC (permalink / raw)
  To: Carl Baldwin; +Cc: Junio C Hamano, git
In-Reply-To: <20060224225023.GA28538@hpsvcnb.fc.hp.com>

On Fri, 24 Feb 2006, Carl Baldwin wrote:

> > If you can add them into a single .tgz with instructions on how 
> > to reproduce the issue and provide me with an URL where I can fetch it 
> > that'd be perfect.
> 
> I will do this in an email off of the list because these files really
> shouldn't be available on a public list.

OK I have the files, and I can confirm that your problem is of the same 
combinatorial explosion type I already talked about, _even_ with the 
version using adler32.  This is really O(m*n) where m and n being the 
size of two consecutive versions of the files.  But since m and n are 
both _huge_ then the delta code really goes out to lunch.

I'm working on a patch to cap this to something like O(m+n).

Stay tuned.

Nicolas

^ permalink raw reply

* Re: [PATCH] diff-delta: produce optimal pack data
From: Nicolas Pitre @ 2006-02-25  3:07 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Junio C Hamano, git, Carl Baldwin
In-Reply-To: <Pine.LNX.4.64.0602241637480.22647@g5.osdl.org>

On Fri, 24 Feb 2006, Linus Torvalds wrote:

> 
> 
> On Fri, 24 Feb 2006, Nicolas Pitre wrote:
> > 
> > Currently, diff-delta takes blocks of data in the reference file and 
> > hash them.  When the target file is scanned, it uses the hash to match 
> > blocks from the target file with the reference file.
> > 
> > If blocks are hashed evenly the cost of  producing a delta is at most 
> > O(n+m) where n and m are the size of the reference and target files 
> > respectively.  In other words, with good data set the cost is linear.
> 
> Assuming the hash is good, of course.
> 
> I think this was the problem with you trying something simpler than 
> adler32..

Well, that's the compromize to make.  By default the version with 
adler32 used 16 byte blocks to index the reference buffer.  That means 
you can match target data against the reference only if whole 16 byte 
blocks match.  Then, if you fix a typo in the target buffer then you'll 
inevitably need 16 literal bytes in the delta instead of only 
one because you won't be able to resynchronize with the reference buffer 
until the next 16 byte block.

What I've made in my last delta patch is to reduce that 16 byte block to 
only 3 bytes.  Why 3 bytes? Because less than that produces smaller 
delta data if done with literal bytes directly, and 3 bytes provided 
enough bits to hash.  I also made those 3 byte blocks overlap so 
indexing would start at any offset with byte precision.  This really 
allows for optimal deltas so that they cannot be smaller.

Now the problem comes when indexing a reference file full of:

        0x46f8, 0x000b, 0x42fe, 0x0000, 0xffc0, 0x0001, 0xff00, 0x0008,
        0x03e0, 0x0009, 0x0f01, 0x0003, 0x8072, 0x0000, 0x0400, 0x0000,
        0x0046, 0x0003, 0x9180, 0x0001, 0x0003, 0x0008, 0x02eb, 0x0003,
        0x8072, 0x0000, 0x0400, 0x0000, 0x8010, 0x0008, 0x0010, 0x0000,
        0x0361, 0x0003, 0x037e, 0x0004, 0x3941, 0x0002, 0x0b0f, 0x0003,
        0x8072, 0x0000, 0x0400, 0x0000, 0x000a, 0x000b, 0x0346, 0x000c,
        0x11fe, 0x0000, 0x3717, 0x0003, 0x8072, 0x0000, 0x0400, 0x0000,
        0x8010, 0x0008, 0x000e, 0x0000, 0x0361, 0x0003, 0x8060, 0x0000,

There is a bunch of ", 0x" that get hashed to the same thing.  And when 
the second phase i.e. trying to find the best match into the reference 
buffer for each occurrence of the same many ", 0x" in the target buffer 
you get a conbinatorial explosion.

The adler32 made that particular example a non issue since the 
likelyhood of many 16 byte blocks to be the same is pretty low in this 
case.  But the flaw remains if for example there is lots of similar 16 
byte blocks, like a binary file with lots of zeroes for example.  In 
fact, the performance problem Carl is having does use the diff-delta 
version still using adler32.

> > But if many blocks from the reference buffer do hash to the same bucket 
> > then for each block in the target file many blocks from the reference 
> > buffer have to be tested against, making it tend towards O(n^m) which is 
> > pretty highly exponential.
> > 
> > The solution I'm investigating is to put a limit on the number of 
> > entries in the same hash bucket so to bring the cost back to something 
> > more linear.  That means the delta might miss on better matches that 
> > have not been hashed but still benefit from a limited set.
> 
> Sounds fair enough.

Testing appear to show that this is a worthwhile safety valve.  And in 
most case that safety valve should not be activated at all.

> However, you migt also want to consider another approach..
> 
> One of the biggest costs for the xdelta algorithm is probably just the 
> "delta_prepare()", but at the same time, that is constant wrt the source 
> buffer.

Actually it is not that costly.  Much much less than computing the sha1 
of the same buffer for example.

> Now, the sad part is that when I wrote pack-objects, I didn't really 
> understand the diff-delta algorithm, I just plugged it in. Which means 
> that when I did it, I made the (obvious and simple) decision to keep the 
> _result_ that we are looking at constant, and try to delta against 
> different sources.
> 
> HOWEVER.
> 
> I suspect you already see where this is going..
> 
> We _could_ switch the "pack-objects" window handling around, and instead 
> of looking at the object we want to pack, and looking at the ten (or 
> "window") previous objects to delta against, we could do it the other way 
> around: keep the object we delta against constant, and see what deltas we 
> could prepare for the ten next objects.
> 
> And since the source would now be constant, you'd need to do the 
> "delta_prepare()" just _once_ per window, instead of every single time.

Might be worth trying.  Actually, this can be tested without even 
changing the window handling just yet, since diff-delta() could return 
the index data instead of freeing it, and pack-objects can store it 
along side with the object data it tries to delta against.  That 
wouldn't be memory efficient, but at least that would give an idea of 
the magnitude of the saving on CPU time.  But I really doubt that'll 
save more than a few percent.

> 
> Now, I haven't done any profiling on the diff-delta code, and maybe my 
> guess that delta_prepare() is a pretty expensive part is wrong, and maybe 
> it wouldn't help to switch the window probing around. But I thought I'd 
> mention it as one thing to explore..

Just to give you an idea, the bulk of my current "prepare" code looks 
like this:

        /* then populate the index */
        data = buf + bufsize - 2;
        while (data > buf) {
                entry->ptr = --data;
                i = (data[0] << hshift) ^ data[1];
                i ^= (i << hshift) ^ data[2];
                entry->next = hash[i];
                hash[i] = entry++;
        }

As you can see it is pretty lightweight.

But that would probably be a worthwhile optimization to have even if it 
saves 10% of CPU time.

Nicolas

^ permalink raw reply

* Re: git-annotate
From: Nicolas Vilz 'niv' @ 2006-02-25  1:21 UTC (permalink / raw)
  To: GIT Mailing List
In-Reply-To: <118833cc0602240721s1c64219y161cc0536fb361e4@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 668 bytes --]

Morten Welinder wrote:
> git-annotate ought to call usage() if it doesn't get its file.
Full Ack.

>>git annotate
> 
> Use of uninitialized value in open at
> /scratch/welinder/git/git-annotate line 40.
> Failed to open filename: No such file or directory at
> /scratch/welinder/git/git-annotate line 40.

oh yes, that was something, i slipped over today, didn't have the time
to report it myself.

additionally i got following with git-annotate --help

niv@hermes ~ $ git-annotate --help
/usr/bin/git-annotate version [unknown] calling Getopt::Std::getopts
(version 1.05 [paranoid]),
running under Perl version 5.8.8.

i don't know if that should happen so.

Nicolas

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 256 bytes --]

^ permalink raw reply

* Re: [PATCH] diff-delta: produce optimal pack data
From: Linus Torvalds @ 2006-02-25  0:45 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Junio C Hamano, git, Carl Baldwin
In-Reply-To: <Pine.LNX.4.64.0602241613030.31162@localhost.localdomain>

On Fri, 24 Feb 2006, Nicolas Pitre wrote:
> 
> Currently, diff-delta takes blocks of data in the reference file and 
> hash them.  When the target file is scanned, it uses the hash to match 
> blocks from the target file with the reference file.
> 
> If blocks are hashed evenly the cost of  producing a delta is at most 
> O(n+m) where n and m are the size of the reference and target files 
> respectively.  In other words, with good data set the cost is linear.

Assuming the hash is good, of course.

I think this was the problem with you trying something simpler than 
adler32..

> But if many blocks from the reference buffer do hash to the same bucket 
> then for each block in the target file many blocks from the reference 
> buffer have to be tested against, making it tend towards O(n^m) which is 
> pretty highly exponential.
> 
> The solution I'm investigating is to put a limit on the number of 
> entries in the same hash bucket so to bring the cost back to something 
> more linear.  That means the delta might miss on better matches that 
> have not been hashed but still benefit from a limited set.

Sounds fair enough.

However, you migt also want to consider another approach..

One of the biggest costs for the xdelta algorithm is probably just the 
"delta_prepare()", but at the same time, that is constant wrt the source 
buffer.

Now, the sad part is that when I wrote pack-objects, I didn't really 
understand the diff-delta algorithm, I just plugged it in. Which means 
that when I did it, I made the (obvious and simple) decision to keep the 
_result_ that we are looking at constant, and try to delta against 
different sources.

HOWEVER.

I suspect you already see where this is going..

We _could_ switch the "pack-objects" window handling around, and instead 
of looking at the object we want to pack, and looking at the ten (or 
"window") previous objects to delta against, we could do it the other way 
around: keep the object we delta against constant, and see what deltas we 
could prepare for the ten next objects.

And since the source would now be constant, you'd need to do the 
"delta_prepare()" just _once_ per window, instead of every single time.

Now, I haven't done any profiling on the diff-delta code, and maybe my 
guess that delta_prepare() is a pretty expensive part is wrong, and maybe 
it wouldn't help to switch the window probing around. But I thought I'd 
mention it as one thing to explore..

		Linus

^ permalink raw reply

* Re: [PATCH] diff-delta: produce optimal pack data
From: Junio C Hamano @ 2006-02-24 23:55 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: git
In-Reply-To: <Pine.LNX.4.64.0602241029360.23719@localhost.localdomain>

Nicolas Pitre <nico@cam.org> writes:

> On Fri, 24 Feb 2006, Junio C Hamano wrote:
>
>> In Linux 2.6 repository, these object pairs take forever to
>> delta.
>> 
>>         blob 9af06ba723df75fed49f7ccae5b6c9c34bc5115f -> 
>>         blob dfc9cd58dc065d17030d875d3fea6e7862ede143
>>         size (491102 -> 496045)
>>         58 seconds
>> 
>>         blob 4917ec509720a42846d513addc11cbd25e0e3c4f -> 
>>         blob dfc9cd58dc065d17030d875d3fea6e7862ede143
>>         size (495831 -> 496045)
>>         64 seconds
>
> Thanks for this.  I'll see what I can do to tweak the code to better 
> cope with those.  Just keep my fourth delta patch in the pu branch for 
> now.

If apply this on top of pack-objects.c, you can find more of
them yourself.

---
diff --git a/pack-objects.c b/pack-objects.c
index be7a200..3f88e86 100644
--- a/pack-objects.c
+++ b/pack-objects.c
@@ -62,6 +62,7 @@ static const char *base_name;
 static unsigned char pack_file_sha1[20];
 static int progress = 1;
 static volatile int progress_update = 0;
+static volatile int progress_update_cnt = 0;
 
 /*
  * The object names in objects array are hashed with this hashtable,
@@ -826,6 +827,7 @@ static int try_delta(struct unpacked *cu
 	struct object_entry *old_entry = old->entry;
 	int old_preferred = (old_entry->preferred_base ||
 			     old_entry->based_on_preferred);
+	int last_up;
 	unsigned long size, oldsize, delta_size, sizediff;
 	long max_size;
 	void *delta_buf;
@@ -890,8 +892,17 @@ static int try_delta(struct unpacked *cu
 	}
 	if (sizediff >= max_size)
 		return -1;
+	last_up = progress_update_cnt;
 	delta_buf = diff_delta(old->data, oldsize,
 			       cur->data, size, &delta_size, max_size);
+	if (last_up + 1 < progress_update_cnt) {
+		/* It took more than one second */
+		fprintf(stderr, "%d -> %d: %s -> ",
+			last_up, progress_update_cnt,
+			sha1_to_hex(old_entry->sha1));
+		fprintf(stderr, "%s (%lu -> %lu)\n",
+			sha1_to_hex(cur_entry->sha1), oldsize, size);
+	}
 	if (!delta_buf)
 		return 0;
 	cur_entry->delta = old_entry;
@@ -906,6 +917,7 @@ static void progress_interval(int signum
 {
 	signal(SIGALRM, progress_interval);
 	progress_update = 1;
+	progress_update_cnt++;
 }
 
 static void find_deltas(struct object_entry **list, int window, int depth)

^ permalink raw reply related

* [PATCH] Add missing programs to ignore list
From: Shawn Pearce @ 2006-02-24 22:51 UTC (permalink / raw)
  To: git

Added recently added programs to the default exclude list.

---
 I found these were missing in master.  pg won't let me work in a
 directory if there are untracked files laying around which aren't
 excluded by an ignore pattern.

 .gitignore |    4 ++++
 1 files changed, 4 insertions(+), 0 deletions(-)

base 809da5f8a21a10112eece4ee9be55fe64371ce68
last 9752e066a00144fc1b7657e6d54569e9d9764092
diff --git a/.gitignore b/.gitignore
index 94f66d5a1ef9d1e2bec2c14b392565ee5bf98dd6..5be239a4aa95f78cb09e4921459e1678b9f3c36e 100644
--- a/.gitignore
+++ b/.gitignore
@@ -2,6 +2,7 @@ GIT-VERSION-FILE
 git
 git-add
 git-am
+git-annotate
 git-apply
 git-applymbox
 git-applypatch
@@ -22,6 +23,7 @@ git-convert-objects
 git-count-objects
 git-cvsexportcommit
 git-cvsimport
+git-cvsserver
 git-daemon
 git-diff
 git-diff-files
@@ -53,6 +55,7 @@ git-mailsplit
 git-merge
 git-merge-base
 git-merge-index
+git-merge-tree
 git-merge-octopus
 git-merge-one-file
 git-merge-ours
@@ -60,6 +63,7 @@ git-merge-recursive
 git-merge-resolve
 git-merge-stupid
 git-mktag
+git-mktree
 git-name-rev
 git-mv
 git-pack-redundant
-- 
1.2.3.g809d

^ permalink raw reply related

* Re: [PATCH] diff-delta: produce optimal pack data
From: Carl Baldwin @ 2006-02-24 22:50 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Junio C Hamano, git
In-Reply-To: <Pine.LNX.4.64.0602241544270.31162@localhost.localdomain>

On Fri, Feb 24, 2006 at 04:12:14PM -0500, Nicolas Pitre wrote:
> On Fri, 24 Feb 2006, Carl Baldwin wrote:
> > After the twelve large objects were packed into individual packs the
> > rest of the packing went very quickly and git v1.2.3's date reuse worked
> > very well.  This was sort of my attempt at simulating how things would
> > be if git avoided deltification of each of these large files. I'm sorry
> > to have been so harsh earlier I just didn't understand that
> > incrementally packing one-by-one was going to help this much.
> 
> Hmmmmmmm....
> 
> I don't think I understand what is going on here.
> 
> You say that, if you add those big files and incrementally repack after 
> each commit using git repack with no option, then it requires only about 
> one second each time.  Right?

Well, actually I was packing them individually by calling
git-pack-objects directly on each blob.

I'll try doing it exactly as you describe...

Ok, I tried it.  Basically I do the following.

% mkdir test
% cd test
% git init-db
% cp ../files/binfile.1 binfile
% time git add binfile

real    0m2.459s
user    0m2.443s
sys     0m0.019s
% git commit -a -m "Rev 1"
% time git repack
[snip]

real    0m1.111s
user    0m1.046s
sys     0m0.061s
% for i in $(seq 2 12); do
    cp ../files/binfile.$i binfile
    time git commit -a -m "Rev $i"
    time git repack
done

Each commit takes around 2.8-3.5 seconds and each repack takes about
1.2-1.5 seconds.  These are prettly reasonable.

Now, I try 'git repack -a -f' (or even without -f) and it goes out to
lunch.  I think it would take on the order of a day to actually finish
because it wasn't very far after an hour.

[snip]
> How many files besides those 12 big blobs do you have?

This repository has been completely stripped to the 12 revisions of the
one file.  So, there are 36 objects.

12 blobs.
12 trees.
12 commits.

That is all.

[snip]
> If you can add them into a single .tgz with instructions on how 
> to reproduce the issue and provide me with an URL where I can fetch it 
> that'd be perfect.

I will do this in an email off of the list because these files really
shouldn't be available on a public list.

Carl

-- 
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
 Carl Baldwin                        RADCAD (R&D CAD)
 Hewlett Packard Company
 MS 88                               work: 970 898-1523
 3404 E. Harmony Rd.                 work: Carl.N.Baldwin@hp.com
 Fort Collins, CO 80525              home: Carl@ecBaldwin.net
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox