Git development

Git development
 help / color / mirror / Atom feed

* Re: git + ssh + key authentication feature-request
From: Nicolas Vilz 'niv' @ 2006-02-08 23:23 UTC (permalink / raw)
  To: git
In-Reply-To: <7vhd79o6m5.fsf@assigned-by-dhcp.cox.net>

Junio C Hamano wrote:
> Nicolas Vilz 'niv' <niv@iaglans.de> writes:
> 
> 
>>I would like to ask if it is possible to use ssh keys to authenticate
>>via ssh on a git repository via git-pull/git-push. Since ssh supports
>>them, wouldn't it be nice to use them in git, too?
> 
> 
> Please read what has been discussed within the last couple of
> weeks at least.  I could say the last couple of months but I
> know that is asking too much ;-).
> 
> http://thread.gmane.org/gmane.comp.version-control.git/15462
> 

Sorry, i haven't found that, yet, so i asked..

in my case it would be only one system-user which has full access to 
several repositories. At this time, the users which use that account, 
have to give a password, which isn't that bad... it would be easier and 
more secure for me, not to give a password, but ask the users for the 
ssh pubkey..

I can still live with the password thing :)

Sincerly
Nicolas

^ permalink raw reply

* Re: git + ssh + key authentication feature-request
From: Junio C Hamano @ 2006-02-08 21:58 UTC (permalink / raw)
  To: Nicolas Vilz 'niv'; +Cc: git
In-Reply-To: <43EA73C3.2040309@iaglans.de>

Nicolas Vilz 'niv' <niv@iaglans.de> writes:

> I would like to ask if it is possible to use ssh keys to authenticate
> via ssh on a git repository via git-pull/git-push. Since ssh supports
> them, wouldn't it be nice to use them in git, too?

Please read what has been discussed within the last couple of
weeks at least.  I could say the last couple of months but I
know that is asking too much ;-).

http://thread.gmane.org/gmane.comp.version-control.git/15462

^ permalink raw reply

* Re: [PATCH] Add git-annotate - a tool for annotating files with the revision and person that created each line in the file.
From: Junio C Hamano @ 2006-02-08 21:45 UTC (permalink / raw)
  To: Ryan Anderson; +Cc: git
In-Reply-To: <20060208210756.GA9490@mythryan2.michonline.com>

Ryan Anderson <ryan@michonline.com> writes:

>> It's been a while since I looked at it the last time so it may
>> not even work with the current git, but here it is..
>
> I'll take a look through this in greater detail later, hopefully your
> approach can be applied.  Diff-analyzing is apparently tricky.

Reading diff is tricky but I was lazy to match up the lines by
hand, which is also a real work ;-).

There are a few things I should add to that ancient code:

 - It wants old ls-tree behaviour.  The command line used in the
   "sub find_file" needs to be updated to something like this:

    open $fh, '-|', 'git-ls-tree', '-z', '-r', $commit->{TREE}, $path
	or die "cannot read git-ls-tree $commit->{TREE}";

 - It only cares about the line numbers and its output is meant
   to be postprocessed with the contents from the latest blob.

 - It predates the recent rev-list that skips commits that do
   not change the specified paths, and it literally follows each
   parent and optimizes not to diff with uninteresting parents
   by hand.

I suspect if you go with the diff-reading approach, it might be
easy to convert it to C (or even write the initial version in C)
using the machinery similar to what is in combine-diff.c.

The algorithm combine-diff.c uses keeps the lines discarded from
each parent in lline structure linked to the sline structure
(which keeps track of the lines in the final version), but for
your annotate purposes what you care about is only what the
child adds to the parent (IOW, we do not care about the lines
that do not appear in the final version), so the logic and the
data structure could be greatly simplified.  You only need to
keep "flag" element in the sline structure, and maybe bol and
len that point at the contents of the resulting line from the
final version.  In addition, you would need to store "the
current suspect commit" (starts from the final revision and
updated as you pass the blame along) and another bool that says
if "the current suspect" is known to be the guilty party or if
the true culprit is one of its ancestors (capital vs lowercase
difference in that explanatory note).

^ permalink raw reply

* git + ssh + key authentication feature-request
From: Nicolas Vilz 'niv' @ 2006-02-08 22:42 UTC (permalink / raw)
  To: git

Hi guys,

first of all, great work.

I just discovered git and i like it.

I would like to ask if it is possible to use ssh keys to authenticate 
via ssh on a git repository via git-pull/git-push. Since ssh supports 
them, wouldn't it be nice to use them in git, too?

The layout would be following:

you have a system user with a git-shell and several keys in 
.ssh/authorized_keys ... these are the keys of your contributors. They 
are allowed to login and work with the repository.

I haven't found a posibility to get this. Maybe I haven't discovered it, 
yet...

Sincerly
Nicolas

^ permalink raw reply

* Re: Handling large files with GIT
From: Florian Weimer @ 2006-02-08 21:20 UTC (permalink / raw)
  To: git
In-Reply-To: <46a038f90602080114r2205d72cmc2b5c93f6fffe03d@mail.gmail.com>

* Martin Langhoff:

> SVN does reasonably well tracking his >1GB mbox file. Now, I don't
> know if I like the idea of putting my own mbox file under version
> control, but it looks like projects with large and slow-changing files
> would be in trouble with GIT.

To my surprise, it's not that bad.  The Debian testing-security team
uses a single 1.8 MB file (400 KB compressed) to keep vulnerability
data.  Most changes to that file involve just a few lines.  But even
in this extreme case, git doesn't compare too badly against Subversion
if you pack regularly (but not too often).  Disk usage is actually
*below* Subversion FSFS even with --depth=10 (the default,
unfortunately a bit hard to override).

I plan to do another experiment for GCC, which contains marvels such
as:

  35905  126056 1379093 gcc/ChangeLog-2005
  12610   61215  417584 gcc/combine.c

But the outcome will likely be quite similar to the secure-testing
case: comparable disk space usage, not a difference in the order of
one or more magnitudes.

But Subversion still has got a significant adventage: I can get a
working copy without downloading full history (several gigabytes in
GCC's case).  There's also the slight drawback that you shouldn't pack
too often, otherwise you'll reduce its effectiveness.  You can always
run "git-repack -a -d", but it's rather expensive.  This means that
you need to keep compressed fulltexts from a few dozen revisions, but
I don't think this is a huge burden.  All in all, the compressed
fulltexts/packs model is a pretty good trade-off between disk usage,
end user usability nad code complexity.

In your mbox case, you should simply try Maildir.  The tree object
(which lists all files in the Maildir folder) will still be rather
large (about 40 to 50 bytes per message stored), though.

^ permalink raw reply

* Re: gitk changing line color for no reason after merge
From: Pavel Roskin @ 2006-02-08 21:06 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: Junio C Hamano, git
In-Reply-To: <17385.22468.218755.833713@cargo.ozlabs.ibm.com>

On Wed, 2006-02-08 at 13:30 +1100, Paul Mackerras wrote:
> Pavel Roskin writes:
> 
> > I'm trying to make it easier to follow a line.  It's easier if its color
> > is not changing, especially on trivial nodes (one parent, one child).
> 
> OK, you're using "line" to mean something a bit different from the
> connection between a commit and its children, which is how I use it.

I see.  Actually, your choice seems to me quite random and
non-intuitive.  You group together changes that have the same parent,
likely made independently by different people.  In fact, only those
changes are shown that would lead to the current revision of the
repository, unless "--all" is used.  Changes on unmerged branches are
not shown.

If you prefer "horizontal" grouping, it would be more logical to turn it
upside down, i.e. group commits with their parents.  In this case, the
line group would represent one act of merging, performed by one person.
No parents are hidden from view even without "--all".

> You seem to be using it more as a "line of development", or as a
> series of related patches.  Which is fine, if you can find a way to
> identify lines of development automatically.  (I know it looks obvious
> when you look at the gitk display, but that's a lot different from
> writing down an algorithm to do it.)

As usually, let's go from the newest commits to the root of the tree.
The idea is to assign branch ID to changesets, i.e. to combinations of
sha1 and parent number.  Branch ID should be inherited from the children
by the first parent.  Other parents get new branch ID.  There should be
a list of active branches, i.e. those branch ID with yet to be seen
parents.  Color should be assigned to branch ID at the creation time.
The color should be selected according to two rules, whenever possible.
It should be unique among the already assigned colors for the same
child, and is should avoid colors of the active branches.

Actually, qgit does a pretty reasonable thing.  I haven't used gitk for
months, but I had to inspect a Mercurial repository using hgk.  I was
surprised by its "crazy" color changes (or so it seemed to me after
qgit), then I found that gitk had the same problem, then I fixed it and
started this thread :-)

> > http://red-bean.com/proski/gitk/gitk-ideal.png - made in GIMP.  Trivial
> > nodes never change line color, because it changes as soon as the line
> > forks.
> 
> My problem with that is that it isn't clear that e.g. the green and
> brown lines near the bottom actually represent the same parent - and
> that will get worse with more complex graphs.

You are right.  qgit only uses vertical and horizontal lines, so it's
easier to find the parent.

-- 
Regards,
Pavel Roskin

^ permalink raw reply

* Re: [PATCH] Add git-annotate - a tool for annotating files with the revision and person that created each line in the file.
From: Ryan Anderson @ 2006-02-08 21:07 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git
In-Reply-To: <7vd5hxpr2d.fsf@assigned-by-dhcp.cox.net>

On Wed, Feb 08, 2006 at 11:51:22AM -0800, Junio C Hamano wrote:
> Ryan Anderson <ryan@michonline.com> writes:
> 
> > Signed-off-by: Ryan Anderson <ryan@michonline.com>
> >
> > ---
> >
> > I think this version is mostly ready to go.
> >
> > Junio, the post you pointed me at was very helpful (once I got around to
> > listening to it), but the code it links to is missing - if that's a
> > better partial implementation than this, can you ressurrect it
> > somewhere?  I'd be happy to reintegrate it together.
> 
> I still have it, but the reason why I withdrew circulating it
> was because I found that on some inputs it did not work
> correctly as intended.  Not that the algorithm was necessarily
> broken but the implementation certainly was.
> 
> Unlike yours mine reads and interprets diff output to find which
> lines are common and which lines are added, and I think the diff
> interpretation logic has various corner cases wrong.  I did
> combine-diff.c diff interpreter without looking at my
> 'git-blame', so I do not remember where I got it wrong,
> though...

I tried that approach at first, and it was much much more confusing to
try to keep track of.  The problem Linus found (that of a missing
"all_lines_claimed()") was related to that code.  This implementation is
simple, though it has to have some problems with guessing at duplicated
lines incorrectly.

> It's been a while since I looked at it the last time so it may
> not even work with the current git, but here it is..

I'll take a look through this in greater detail later, hopefully your
approach can be applied.  Diff-analyzing is apparently tricky.

-- 

Ryan Anderson
  sometimes Pug Majere

^ permalink raw reply

* Re: Handling large files with GIT
From: Junio C Hamano @ 2006-02-08 20:11 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: git, Johannes Schindelin
In-Reply-To: <Pine.LNX.4.64.0602080853480.2458@g5.osdl.org>

Linus Torvalds <torvalds@osdl.org> writes:

> Side note: the original explicit git "delta" objects by Nicolas Pitre 
> would have handled this large-file-case much more gracefully. 

True.

> The pack-files had absolutely huge advantages, though, so I think we (I) 
> did the right thing there in making the delta code only a very specific 
> special case..

Well the blame for ripping that out falls on me, actually...

> It is possible that we could re-introduce the "explicit delta" object, 
> though (it's not incompatible with also doing pack-files, it's just that 
> pack-files made 99% of all the arguments for an explicit delta go away).

I do not remember we had 'rev-list --objects' support for Nico's
explicit delta object chains.  If we didn't that would be a new
development that needs to be done to resurrect it.  I know
pack-objects never had support for it so obviously that needs to
be added as well.  Probably explicit delta objects should always
be packed in full without spending cost to find delta candidates.

Personally I feel that post-1.2.0 would be a good time to start
looking at enhancing the pack generation chain, rev-list piped
to pack-objects.  This "large files" use case is helped by
less self-contained packs while "shallow clone" use case
we discussed earlier is helped by more self-contained packs (we
had a discussion long time ago on this and I think we have the
code to do so [*1*]). 

An addition to pack-objects is needed to make it capable to read
a list of objects that we do not want to include in the
resulting pack but can be used as base objects for delitified.

BTW, as to the "shallow clone", I changed my mind and am
inclined to agree with Johannes that handling cut-offs
differently from grafts is easier for dealing with later "give
me more history" operation, so I am planning to chuck my jc/clone
topic branch that I have included in the proposed updates so
far.

[Footnote]

*1* http://article.gmane.org/gmane.comp.version-control.git/5779

^ permalink raw reply

* Re: [PATCH] Add git-annotate - a tool for annotating files with the revision and person that created each line in the file.
From: Junio C Hamano @ 2006-02-08 19:51 UTC (permalink / raw)
  To: Ryan Anderson; +Cc: git
In-Reply-To: <11394103753694-git-send-email-ryan@michonline.com>

Ryan Anderson <ryan@michonline.com> writes:

> Signed-off-by: Ryan Anderson <ryan@michonline.com>
>
> ---
>
> I think this version is mostly ready to go.
>
> Junio, the post you pointed me at was very helpful (once I got around to
> listening to it), but the code it links to is missing - if that's a
> better partial implementation than this, can you ressurrect it
> somewhere?  I'd be happy to reintegrate it together.

I still have it, but the reason why I withdrew circulating it
was because I found that on some inputs it did not work
correctly as intended.  Not that the algorithm was necessarily
broken but the implementation certainly was.

Unlike yours mine reads and interprets diff output to find which
lines are common and which lines are added, and I think the diff
interpretation logic has various corner cases wrong.  I did
combine-diff.c diff interpreter without looking at my
'git-blame', so I do not remember where I got it wrong,
though...

It's been a while since I looked at it the last time so it may
not even work with the current git, but here it is..

--
#!/usr/bin/perl -w

use strict;

package main;
$::debug = 0;

sub read_blob {
    my $sha1 = shift;
    my $fh = undef;
    my $result;
    local ($/) = undef;
    open $fh, '-|', 'git-cat-file', 'blob', $sha1
	or die "cannot read blob $sha1";
    $result = join('', <$fh>);
    close $fh
	or die "failure while closing pipe to git-cat-file";
    return $result;
}

sub read_diff_raw {
    my ($parent, $filename) = @_;
    my $fh = undef;
    local ($/) = "\0";
    my @result = (); 
    my ($meta, $status, $sha1_1, $sha1_2, $file1, $file2);

    print STDERR "* diff-index --cached $parent $filename\n" if $::debug;
    my $has_changes = 0;
    open $fh, '-|', 'git-diff-index', '--cached', '-z', $parent, $filename
	or die "cannot read git-diff-index $parent $filename";
    while (defined ($meta = <$fh>)) {
	$has_changes = 1;
    }
    close $fh
	or die "failure while closing pipe to git-diff-index";
    if (!$has_changes) {
	return ();
    }

    $fh = undef;
    print STDERR "* diff-index -B -C --find-copies-harder --cached $parent\n" if $::debug;
    open($fh, '-|', 'git-diff-index', '-B', '-C', '--find-copies-harder',
	 '--cached', '-z', $parent)
	or die "cannot read git-diff-index with $parent";
    while (defined ($meta = <$fh>)) {
	chomp($meta);
	(undef, undef, $sha1_1, $sha1_2, $status) = split(/ /, $meta);
	$file1 = <$fh>;
	chomp($file1);
	if ($status =~ /^[CR]/) {
	    $file2 = <$fh>;
	    chomp($file2);
	} elsif ($status =~ /^D/) {
	    next;
	} else {
	    $file2 = $file1;
	}
	if ($file2 eq $filename) {
	    push @result, [$status, $sha1_1, $sha1_2, $file1, $file2];
	}
    }
    close $fh
	or die "failure while closing pipe to git-diff-index";
    return @result;
}

sub write_temp_blob {
    my ($sha1, $temp) = @_;
    my $fh = undef;
    my $blob = read_blob($sha1);
    open $fh, '>', $temp
	or die "cannot open temporary file $temp";
    print $fh $blob;
    close($fh);
}

package Git::Patch;
sub new {
    my ($class, $sha1_1, $sha1_2) = @_;
    my $self = bless [], $class;
    my $fh = undef;
    ::write_temp_blob($sha1_1, "/tmp/blame-$$-1");
    ::write_temp_blob($sha1_2, "/tmp/blame-$$-2");
    open $fh, '-|', 'diff', '-u0', "/tmp/blame-$$-1", "/tmp/blame-$$-2"
	or die "cannot read diff";
    while (<$fh>) {
	if (/^\@\@ -(\d+)(?:,(\d+))? \+(\d+)(?:,(\d+))? \@\@/) {
	    push @$self, [$1, (defined $2 ? $2 : 1),
			  $3, (defined $4 ? $4 : 1)];
	}
    }
    close $fh;
    unlink "/tmp/blame-$$-1", "/tmp/blame-$$-2";
    return $self;
}

sub find_parent_line {
    my ($self, $commit_lineno) = @_;
    my $ofs = 0;
    for (@$self) {
	my ($line_1, $len_1, $line_2, $len_2) = @$_;
	if ($commit_lineno < $line_2) {
	    return $commit_lineno - $ofs;
	}
	if ($line_2 <= $commit_lineno && $commit_lineno < $line_2 + $len_2) {
	    return -1; # changed by commit.
	}
	$ofs += ($len_1 - $len_2);
    }
    return $commit_lineno + $ofs;
}

package Git::Commit;

my %author_name_canon = 
('Linus Torvalds <torvalds@evo.osdl.org>' =>
 'Linus Torvalds <torvalds@osdl.org>',
 'Linus Torvalds <torvalds@ppc970.osdl.org.(none)>' =>
 'Linus Torvalds <torvalds@osdl.org>',
 'Linus Torvalds <torvalds@ppc970.osdl.org>' =>
 'Linus Torvalds <torvalds@osdl.org>',
 'Linus Torvalds <torvalds@g5.osdl.org>' =>
 'Linus Torvalds <torvalds@osdl.org>',
 'Matthias Urlichs <smurf@kiste.(none)>' =>
 'Matthias Urlichs <smurf@smurf.noris.de>',
 'Paul Mackerras <paulus@dorrigo.(none)>' =>
 'Paul Mackerras <paulus@samba.org>',
 'Paul Mackerras <paulus@pogo.(none)>' =>
 'Paul Mackerras <paulus@samba.org>',
 'Petr Baudis <pasky@ucw.cz>' =>
 'Petr Baudis <pasky@suse.cz>',
 'tony.luck@intel.com <tony.luck@intel.com>' =>
 'Tony Luck <tony.luck@intel.com>',
 'barkalow@iabervon.org <barkalow@iabervon.org>' =>
 'Daniel Barkalow <barkalow@iabervon.org>',
 'jon@blackcubes.dyndns.org <jon@blackcubes.dyndns.org>' =>
 'Jon Seymour <jon.seymour@gmail.com>',
 'Sven Verdoolaege <skimo@kotnet.org>' =>
 'Sven Verdoolaege <skimo@liacs.nl>',
 'Bryan Larsen <bryanlarsen@yahoo.com>' =>
 'Bryan Larsen <bryan.larsen@gmail.com>',
 'Junio C Hamano <junio@twinsun.com>' =>
 'Junio C Hamano <junkio@cox.net>',
 );

sub canon_author_name {
    my ($name) = @_;
    if (exists $author_name_canon{$name}) {
	return $author_name_canon{$name};
    }
    return $name;
}

sub new {
    my $class = shift;
    my $self = bless {
	PARENT => [],
	TREE => undef,
	AUTHOR => undef,
	COMMITTER => undef,
    }, $class;
    my $commit_sha1 = shift;
    $self->{SHA1} = $commit_sha1;
    my $fh = undef;
    open $fh, '-|', 'git-cat-file', 'commit', $commit_sha1
	or die "cannot read commit object $commit_sha1";
    while (<$fh>) {
	chomp;
	if (/^tree ([0-9a-f]{40})$/) { $self->{TREE} = $1; }
	elsif (/^parent ([0-9a-f]{40})$/) { push @{$self->{PARENT}}, $1; }
	elsif (/^author ([^>]+>)/) {
	    $self->{AUTHOR} = canon_author_name($1);
	}
	elsif (/^committer ([^>]+>)/) {
	    $self->{COMMITTER} = canon_author_name($1);
	}
    }
    close $fh
	or die "failure while closing pipe to git-cat-file";
    return $self;
}

sub find_file {
    my ($commit, $path) = @_;
    my $result = undef;
    my $fh = undef;
    local ($/) = "\0";
    open $fh, '-|', 'git-ls-tree', '-z', '-r', '-d', $commit->{TREE}, $path
	or die "cannot read git-ls-tree $commit->{TREE}";
    while (<$fh>) {
	chomp;
	if (/^[0-7]{6} blob ([0-9a-f]{40})	(.*)$/) {
	    if ($2 ne $path) {
		die "$2 ne $path???";
	    }
	    $result = $1;
	    last;
	}
    }
    close $fh
	or die "failure while closing pipe to git-ls-tree";
    return $result;
}

package Git::Blame;
sub new {
    my $class = shift;
    my $self = bless {
	LINE => [],
	UNKNOWN => undef,
	WORK => [],
    }, $class;
    my $commit = shift;
    my $filename = shift;
    my $sha1 = $commit->find_file($filename);
    my $blob = ::read_blob($sha1);
    my @blob = (split(/\n/, $blob));
    for (my $i = 0; $i < @blob; $i++) {
	$self->{LINE}[$i] = +{
	    COMMIT => $commit,
	    FOUND => undef,
	    FILENAME => $filename,
	    LINENO => ($i + 1),
	};
    }
    $self->{UNKNOWN} = scalar @blob;
    push @{$self->{WORK}}, [$commit, $filename];
    return $self;
}

sub read_blame_cache {
    my $self = shift;
    my $filename = shift;
    my $fh = undef;
    my $pi = $self->{'PATHINFO'} = {};
    open $fh, '<', $filename;
    while (<$fh>) {
	chomp;
	my ($commit, $parent, $path) = split(/\t/, $_);
	$pi->{$path}{$commit}{$parent} = 1;
    }
    close $fh;
}

sub print {
    my $self = shift;
    my $line_termination = shift;
    for (my $i = 0; $i < @{$self->{LINE}}; $i++) {
	my $l = $self->{LINE}[$i];
	print ($l->{FOUND} ? ':' : '?');;
	print "$l->{COMMIT}->{SHA1}	";
	print "$l->{COMMIT}->{AUTHOR}	";
	print "$l->{COMMIT}->{COMMITTER}	";
	print "$l->{LINENO}	$l->{FILENAME}";
	print $line_termination;
    }
}

sub take_responsibility {
    my ($self, $commit) = @_;
    for (my $i = 0; $i < @{$self->{LINE}}; $i++) {
	my $l = $self->{LINE}[$i];
	if (! $l->{FOUND} && ($l->{COMMIT}->{SHA1} eq $commit->{SHA1})) {
	    $l->{FOUND} = 1;
	    $self->{UNKNOWN}--;
	}
    }
}

sub blame_parent {
    my ($self, $commit, $parent, $filename) = @_;
    my @diff = ::read_diff_raw($parent->{SHA1}, $filename);
    my $filename_in_parent;
    my $passed_blame_to_parent = undef;
    if (@diff == 0) {
	# We have not touched anything.  Blame parent for everything
	# that we are suspected for.
	for (my $i = 0; $i < @{$self->{LINE}}; $i++) {
	    my $l = $self->{LINE}[$i];
	    if (! $l->{FOUND} && ($l->{COMMIT}->{SHA1} eq $commit->{SHA1})) {
		$l->{COMMIT} = $parent;
		$passed_blame_to_parent = 1;
	    }
	}
	$filename_in_parent = $filename;
    }
    elsif (@diff != 1) {
	# This should not happen.
	for (@diff) {
	    print "** @$_\n";
	}
	die "Oops";
    }
    else {
	my ($status, $sha1_1, $sha1_2, $file1, $file2) = @{$diff[0]};
	print STDERR "** $status $file1 $file2\n" if $::debug;
	if ($status =~ /A/ || $status =~ /M[0-9][0-9]/) {
	    # Either some of other parents created it, or we did.
	    # At this point the only thing we know is that this
	    # parent is not responsible for it.
	    ;
	}
	else {
	    my $patch = Git::Patch->new($sha1_1, $sha1_2);
	    $filename_in_parent = $file1;
	    for (my $i = 0; $i < @{$self->{LINE}}; $i++) {
		my $l = $self->{LINE}[$i];
		if (! $l->{FOUND} && $l->{COMMIT}->{SHA1} eq $commit->{SHA1}) {
		    # We are suspected to have introduced this line.
		    # Does it exist in the parent?
		    my $lineno = $l->{LINENO};
		    my $parent_line = $patch->find_parent_line($lineno);
		    if ($parent_line < 0) {
			# No, we may be the guilty ones, or some other
			# parent might be.  We do not assign blame to
			# ourselves here yet.
			;
		    }
		    else {
			# This line is coming from the parent, so pass
			# blame to it.
			$l->{COMMIT} = $parent;
			$l->{FILENAME} = $file1;
			$l->{LINENO} = $parent_line;
			$passed_blame_to_parent = 1;
		    }
		}
	    }
	}
    }
    if ($passed_blame_to_parent && $self->{UNKNOWN}) {
	unshift @{$self->{WORK}},
	[$parent, $filename_in_parent];
    }
}

sub assign {
    my ($self, $commit, $filename) = @_;
    # We do read-tree of the current commit and diff-index
    # with each parents, instead of running diff-tree.  This
    # is because diff-tree does not look for copies hard enough.

    if (exists $self->{'PATHINFO'} && exists $self->{'PATHINFO'}{$filename} &&
	!exists $self->{'PATHINFO'}{$filename}{$commit->{SHA1}} &&
	@{$commit->{PARENT}} == 1) {
	# This commit did not touch the path at all, and
	# has only one parent.  It is all that parent's fault.

	my $parent = Git::Commit->new($commit->{PARENT}[0]);
	my $passed_blame_to_parent = 0;
	for (my $i = 0; $i < @{$self->{LINE}}; $i++) {
	    my $l = $self->{LINE}[$i];
	    if (! $l->{FOUND} &&
		($l->{COMMIT}->{SHA1} eq $commit->{SHA1})) {
		$l->{COMMIT} = $parent;
		$passed_blame_to_parent = 1;
	    }
	}
	if ($passed_blame_to_parent && $self->{UNKNOWN}) {
	    unshift @{$self->{WORK}},
	    [$parent, $filename];
	}
	return;
    }

    print STDERR "* read-tree  $commit->{SHA1}\n" if $::debug;
    system('git-read-tree', '-m', $commit->{SHA1});
    for my $parent (@{$commit->{PARENT}}) {
	$self->blame_parent($commit, Git::Commit->new($parent), $filename);
    }
    $self->take_responsibility($commit);
}

sub assign_blame {
    my ($self) = @_;
    while ($self->{UNKNOWN} && @{$self->{WORK}}) {
	my $wk = shift @{$self->{WORK}};
	my ($commit, $filename) = @$wk;
	$self->assign($commit, $filename);
    }
}



################################################################
package main;
my $usage = "blame [-z] <commit> filename";
my $line_termination = "\n";

$::ENV{GIT_INDEX_FILE} = "/tmp/blame-$$-index";
unlink($::ENV{GIT_INDEX_FILE});

if ($ARGV[0] eq '-z') {
    $line_termination = "\0";
    shift;
}

if (@ARGV != 2) {
    die $usage;
}

my $head_commit = Git::Commit->new($ARGV[0]);
my $filename = $ARGV[1];
my $blame = Git::Blame->new($head_commit, $filename);
if (-f ".blame-cache") {
    $blame->read_blame_cache(".blame-cache");
}

$blame->assign_blame();
$blame->print($line_termination);

unlink($::ENV{GIT_INDEX_FILE});

__END__

How does this work, and what do we do about merges?

The algorithm considers that the first parent is our main line of
development and treats it somewhat special than other parents.  So we
pass on the blame to the first parent if a line has not changed from
it.  For lines that have changed from the first parent, we must have
either inherited that change from some other parent, or it could have
been merge conflict resolution edit we did on our own.

The following picture illustrates how we pass on and assign blames.

In the sample, the original O was forked into A and B and then merged
into M.  Line 1, 2, and 4 did not change.  Line 3 and 5 are changed in
A, and Line 5 and 6 are changed in B.  M made its own decision to
resolve merge conflicts at Line 5 to something different from A and B:

                A: 1 2 T 4 T 6
               /               \ 
O: 1 2 3 4 5 6                  M: 1 2 T 4 M S
               \               / 
                B: 1 2 3 4 S S

In the following picture, each line is annotated with a blame letter.
A lowercase blame (e.g. "a" for "1") means that commit or its ancestor
is the guilty party but we do not know which particular ancestor is
responsible for the change yet.  An uppercase blame means that we know
that commit is the guilty party.

First we look at M (the HEAD) and initialize Git::Blame->{LINE} like
this:

             M: 1 2 T 4 M S
                m m m m m m

That is, we know all lines are results of modification made by some
ancestor of M, so we assign lowercase 'm' to all of them.

Then we examine our first parent A.  Throughout the algorithm, we are
always only interested in the lines we are the suspect, but this being
the initial round, we are the suspect for all of them.  We notice that
1 2 T 4 are the same as the parent A, so we pass the blame for these
four lines to A.  M and S are different from A, so we leave them as
they are (note that we do not immediately take the blame for them):

             M: 1 2 T 4 M S
                a a a a m m

Next we go on to examine parent B.  Again, we are only interested in
the lines we are still the suspect (i.e. M and S).  We notice S is
something we inherited from B, so we pass the blame on to it, like
this:

             M: 1 2 T 4 M S
                a a a a m b

Once we exhausted the parents, we look at the results and take
responsibility for the remaining ones that we are still the suspect:

             M: 1 2 T 4 M S
                a a a a M b

We are done with M.  And we know commits A and B need to be examined
further, so we do them recursively.  When we look at A, we again only
look at the lines that A is the suspect:

             A: 1 2 T 4 T 6
                a a a a M b

Among 1 2 T 4, comparing against its parent O, we notice 1 2 4 are
the same so pass the blame for those lines to O:

             A: 1 2 T 4 T 6
                o o a o M b

A is a non-merge commit; we have already exhausted the parents and
take responsibility for the remaining ones that A is the suspect:

             A: 1 2 T 4 T 6
                o o A o M b

We go on like this and the final result would become:

             O: 1 2 3 4 5 6
                O O A O M B

^ permalink raw reply

* Re: [PATCH] Add git-annotate - a tool for annotating files with the revision and person that created each line in the file.
From: Junio C Hamano @ 2006-02-08 19:19 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Franck Bui-Huu, git
In-Reply-To: <Pine.LNX.4.63.0602081843220.20568@wbgn013.biozentrum.uni-wuerzburg.de>

Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:

>> Are there any rules on the choice of the script language ?
>
> Yes. Do not try to introduce unnecessary dependencies. But if it is 
> the right tool to do the job, you should use it. As of now, we have perl, 
> python and Tcl/Tk.

Very well said.  That's what currently stands.

^ permalink raw reply

* Re: [PATCH] Add git-annotate - a tool for annotating files with the revision and person that created each line in the file.
From: Linus Torvalds @ 2006-02-08 19:09 UTC (permalink / raw)
  To: Ryan Anderson; +Cc: Junio C Hamano, git
In-Reply-To: <11394103753694-git-send-email-ryan@michonline.com>

On Wed, 8 Feb 2006, Ryan Anderson wrote:
> 
> I think this version is mostly ready to go.

Hmm.. I get

   [torvalds@g5 git]$ ./git-annotate Makefile
   fatal: 'e83c5163316f89bfbde7d9ab23ca2e25604af290^1..e83c5163316f89bfbde7d9ab23ca2e25604af290': No such file or directory
   Undefined subroutine &main::all_lines_claimed called at ./git-annotate line 124.

where that fatal error is because e83c51.. doesn't _have_ a parent, it's 
the root (so doing ^1 on it doesn't work).

After fixing the "all_lines_claimed" problem as outlined by Dscho, I get a 
lot of

	Skipping diff-parse - i = filelines)

and no actual output.

Doing it on a file that didn't exist in the root commit still have those 
"Skipping" messages, but at least it did actually output something. 

However, what it output was clearly not correct, so there's still some 
tweaking to do.

For example, doing

	./git-annotate apply.c

annotates most of that file to Junio's commit 1c15afb9, which is totally 
incorrect, that commit actually only changed a few lines.

So it looks like there's still some work to be done on this..

			Linus

^ permalink raw reply

* Re: [PATCH] Add git-annotate - a tool for annotating files with  the revision and person that created each line in the file.
From: Randal L. Schwartz @ 2006-02-08 18:47 UTC (permalink / raw)
  To: Franck Bui-Huu; +Cc: Ryan Anderson, Junio C Hamano, git
In-Reply-To: <cda58cb80602080835s38713193t@mail.gmail.com>

>>>>> "Franck" == Franck Bui-Huu <vagabon.xyz@gmail.com> writes:

Franck> another perl script :(

Franck> Are there any rules on the choice of the script language ?

I could argue that they should all be Perl. :)

-- 
Randal L. Schwartz - Stonehenge Consulting Services, Inc. - +1 503 777 0095
<merlyn@stonehenge.com> <URL:http://www.stonehenge.com/merlyn/>
Perl/Unix/security consulting, Technical writing, Comedy, etc. etc.
See PerlTraining.Stonehenge.com for onsite and open-enrollment Perl training!

^ permalink raw reply

* Re: [PATCH] Add git-annotate - a tool for annotating files with the revision and person that created each line in the file.
From: Johannes Schindelin @ 2006-02-08 17:45 UTC (permalink / raw)
  To: Franck Bui-Huu; +Cc: Junio C Hamano, git
In-Reply-To: <cda58cb80602080835s38713193t@mail.gmail.com>

Hi,

On Wed, 8 Feb 2006, Franck Bui-Huu wrote:

> another perl script :(
> 
> Are there any rules on the choice of the script language ?

Yes. Do not try to introduce unnecessary dependencies. But if it is 
the right tool to do the job, you should use it. As of now, we have perl, 
python and Tcl/Tk.

Hth,
Dscho

^ permalink raw reply

* Re: Handling large files with GIT
From: Linus Torvalds @ 2006-02-08 17:01 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Martin Langhoff, Git Mailing List
In-Reply-To: <Pine.LNX.4.64.0602080815180.2458@g5.osdl.org>

On Wed, 8 Feb 2006, Linus Torvalds wrote:
>
> The fact that all the operations work on a full object, and the delta's 
> are (on purpose) just a very specific and limited kind of size 
> compression is just very ingrained.

Side note: the original explicit git "delta" objects by Nicolas Pitre 
would have handled this large-file-case much more gracefully. 

The pack-files had absolutely huge advantages, though, so I think we (I) 
did the right thing there in making the delta code only a very specific 
special case..

It is possible that we could re-introduce the "explicit delta" object, 
though (it's not incompatible with also doing pack-files, it's just that 
pack-files made 99% of all the arguments for an explicit delta go away).

		Linus

^ permalink raw reply

* Re: Shortest path between commits
From: Linus Torvalds @ 2006-02-08 16:43 UTC (permalink / raw)
  To: Ralf Baechle; +Cc: git
In-Reply-To: <20060208160308.GB3484@linux-mips.org>

On Wed, 8 Feb 2006, Ralf Baechle wrote:
>
> I wonder if there some way to find the shortest path between two commits?
> That is if there is a merge between the two commits I only want the merge
> commit itself, not the potencially large list of commits that were merged.

The problem is that it's entirely possible that no such path even 
exists.

Two commits are not necessarily directly related, and asking for the 
shortest path may involve having to go both backwards _and_ forwards in 
history to get from one to the other. The most trivial case is

	    a  <- head of tree
	   / \
	  /   \
	 b     c
	  \   /
	   \ /
	    d  <- root

where the shortest path between "b" and "c" is not really a well-defined 
notion.

Now, _if_ you know that one of the commits is a direct descendant of the 
other, a sensible path can be decided on, but even then the notion of 
"shortest" is not obvious. Look at "a" vs "d" above - which path is the 
shortest one? The one through "b" or the one through "c"? There's really 
no way to tell them apart (you could select "first parent", but in more 
complex graphs that might not be unambiguous either).

That said, and to finally answer your question: selecing _one_ short path 
between two commits (if they are directly related) is certainly possible, 
but no, we don't have anything like that available right now. It wouldn't 
be hugely difficult to do an addition to git-rev-list to do so, though.

Can you describe your usage case? The operation really _isn't_ sensible in 
general, so while I could add a flag to git-rev-list to only print out as 
direct a chain as possible, I'd like to know that there is at least _one_ 
entirely sane usage for such a thing.

		Linus

^ permalink raw reply

* Re: [PATCH] Add git-annotate - a tool for annotating files with the revision and person that created each line in the file.
From: Franck Bui-Huu @ 2006-02-08 16:35 UTC (permalink / raw)
  To: Ryan Anderson; +Cc: Junio C Hamano, git
In-Reply-To: <11394103753694-git-send-email-ryan@michonline.com>

2006/2/8, Ryan Anderson <ryan@michonline.com>:
> Signed-off-by: Ryan Anderson <ryan@michonline.com>
>
> ---
>
> I think this version is mostly ready to go.
>

another perl script :(

Are there any rules on the choice of the script language ?

> Junio, the post you pointed me at was very helpful (once I got around to
> listening to it), but the code it links to is missing - if that's a
> better partial implementation than this, can you ressurrect it
> somewhere?  I'd be happy to reintegrate it together.
>

Thanks
--
               Franck

^ permalink raw reply

* Re: Handling large files with GIT
From: Linus Torvalds @ 2006-02-08 16:34 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Martin Langhoff, Git Mailing List
In-Reply-To: <Pine.LNX.4.63.0602081248270.31700@wbgn013.biozentrum.uni-wuerzburg.de>

On Wed, 8 Feb 2006, Johannes Schindelin wrote:
> 
> I am uncertain if it is possible to extend git to handle large files 
> gracefully, without slowing it down for its main use case.

Indeed. The git architecture simply sucks for big objects. It was 
discussed somewhat durign the early stages, but a lot of it really is 
pretty fundamental. The fact that all the operations work on a full 
object, and the delta's are (on purpose) just a very specific and limited 
kind of size compression is just very ingrained.

> [thinking] A potentially silly idea just hit me: We could virtually cut 
> every file into 256kB chunks. That would not affect source code at all: 
> anybody producing a 256kB C file should be shot anyway.

It probably wouldn't help that much, really. And it would probably impact 
source code users too: I bet we'd have bugs. It would be a very strange 
special case.

It also would only help for things that purely grow at the end. Which 
isn't even true for a mailbox: it may or may not be true for your INBOX, 
but anybody who _uses_ a mailbox format to read his email will be adding 
status flags to the mbox format (or deleting mbox entries etc). 

So every time a small change happened that changed the offset, you'd have 
an explosion of these 256kB chunk objects, and while the delta would work 
(probably slowly - remember how the git deltification algorithm tries to 
compare against the ten "nearest" neighbors), at _commit_ time you'd have 
to write that 1GB (compressed) out anyway.

Realistically, I think the answer is that git just doesn't work for his 
usage case. There's two alternatives:

 - convince him to not have big mailboxes (an answer I don't particularly 
   like: it's a tool limitation, and you shouldn't change your behaviour 
   just because the tool doesn't work for it - you should just try to find 
   the right tool).

   That said: git should actually work beautifully for email if you 
   _don't_ keep it as one big mbox. You could probably very reasonably use 
   git as a database backend, where each email is its own object, and you 
   can have many different ways of indexing them into trees (by content, 
   by date, by author, by thread).

   But that's very different from the suggested "home directory" setup 
   would be.

 - try to work around some of the worst git issues. While I don't think 
   the 256kB blockign thing would help (the git protocol would still 
   always send the base versions), there _are_ probably things that could 
   be done. They'd be very invasive, though, and somebody would seriously 
   have to look at the architectural issues.

   For example, right now the decision to send only "self-contained" packs 
   in the git protocol was a very conscious one: it's much safer, and it 
   makes the unpacking a lot easier (the unpacking doesn't ever have to 
   even read any other objects than the stream it gets). It's also (for 
   packs that we use on-disk) the only sane way to avoid nasty inter-pack 
   dependencies.

   But for the git protocol, the inter-pack dependencies don't matter, 
   if we'd always unpack the thing on reception if it is not a 
   self-contained pack. So we _could_ allow delta's that depend on the 
   receiver already having the objects we delta against.

   However, the deltification itself is likely very slow, exactly because 
   git (again, very much by design) generates the deltas dynamically 
   rather than depending on things already being in delta format.

Personally, I think the answer is "git is good for lots of small files". 
It's very much what git was designed for, and the fact that it doesn't 
work for everything is a trade-off for the things it _does_ work well for.

			Linus

^ permalink raw reply

* Re: [PATCH] Add git-annotate - a tool for annotating files with the revision and person that created each line in the file.
From: Johannes Schindelin @ 2006-02-08 16:05 UTC (permalink / raw)
  To: Peter Eriksen; +Cc: git
In-Reply-To: <20060208150950.GA29346@ebar091.ebar.dtu.dk>

Hi,

On Wed, 8 Feb 2006, Peter Eriksen wrote:

> On Wed, Feb 08, 2006 at 09:52:55AM -0500, Ryan Anderson wrote:
> > Signed-off-by: Ryan Anderson <ryan@michonline.com>
> > 
> > ---
> > 
> > I think this version is mostly ready to go.
> > 
> > Junio, the post you pointed me at was very helpful (once I got around to
> > listening to it), but the code it links to is missing - if that's a
> > better partial implementation than this, can you ressurrect it
> > somewhere?  I'd be happy to reintegrate it together.
> 
> Does it depends on some ealier patch?  I get this:
> 
> git]$ git-annotate diff-delta.c
> Undefined subroutine &main::all_lines_claimed called at
> /home/peter/bin/git-annotate line 124.

Just add a function like

-- snip --
sub all_lines_claimed {
        return ($leftover_lines == 0);
}
-- snap --

and you're done.

However, it does not yet do the correct thing: it does not show the root 
commit. For example, if you do "git annotate git-am.sh" it should show 
"d1c5f2a4" for the first lines, not "a1451104" as it does.

Ciao,
Dscho

^ permalink raw reply

* Shortest path between commits
From: Ralf Baechle @ 2006-02-08 16:03 UTC (permalink / raw)
  To: git

I wonder if there some way to find the shortest path between two commits?
That is if there is a merge between the two commits I only want the merge
commit itself, not the potencially large list of commits that were merged.

I need that for commit notification scripts; I don't want to spam users
with too many emails and aggregating everything that came through a single
merge would be a reaonsable approach.

  Ralf

^ permalink raw reply

* Re: [PATCH] Add git-annotate - a tool for annotating files with the revision and person that created each line in the file.
From: Peter Eriksen @ 2006-02-08 15:09 UTC (permalink / raw)
  To: git
In-Reply-To: <11394103753694-git-send-email-ryan@michonline.com>

On Wed, Feb 08, 2006 at 09:52:55AM -0500, Ryan Anderson wrote:
> Signed-off-by: Ryan Anderson <ryan@michonline.com>
> 
> ---
> 
> I think this version is mostly ready to go.
> 
> Junio, the post you pointed me at was very helpful (once I got around to
> listening to it), but the code it links to is missing - if that's a
> better partial implementation than this, can you ressurrect it
> somewhere?  I'd be happy to reintegrate it together.

Does it depends on some ealier patch?  I get this:

git]$ git-annotate diff-delta.c
Undefined subroutine &main::all_lines_claimed called at
/home/peter/bin/git-annotate line 124.

The patch was applied to: git version 1.1.6.gd19e-dirty.

Peter

^ permalink raw reply

* [PATCH] Add git-annotate - a tool for annotating files with the revision and person that created each line in the file.
From: Ryan Anderson @ 2006-02-08 14:52 UTC (permalink / raw)
  To: Junio C Hamano, git; +Cc: Ryan Anderson

Signed-off-by: Ryan Anderson <ryan@michonline.com>

---

I think this version is mostly ready to go.

Junio, the post you pointed me at was very helpful (once I got around to
listening to it), but the code it links to is missing - if that's a
better partial implementation than this, can you ressurrect it
somewhere?  I'd be happy to reintegrate it together.

 Makefile          |    1 
 git-annotate.perl |  291 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 292 insertions(+), 0 deletions(-)
 create mode 100755 git-annotate.perl

86fa163e7fd1bee2929b7946456407dbc7745193
diff --git a/Makefile b/Makefile
index 5c32934..8d24660 100644
--- a/Makefile
+++ b/Makefile
@@ -117,6 +117,7 @@ SCRIPT_SH = \
 SCRIPT_PERL = \
 	git-archimport.perl git-cvsimport.perl git-relink.perl \
 	git-shortlog.perl git-fmt-merge-msg.perl git-rerere.perl \
+	git-annotate.perl \
 	git-svnimport.perl git-mv.perl git-cvsexportcommit.perl
 
 SCRIPT_PYTHON = \
diff --git a/git-annotate.perl b/git-annotate.perl
new file mode 100755
index 0000000..a3ea201
--- /dev/null
+++ b/git-annotate.perl
@@ -0,0 +1,291 @@
+#!/usr/bin/perl
+# Copyright 2006, Ryan Anderson <ryan@michonline.com>
+#
+# GPL v2 (See COPYING)
+#
+# This file is licensed under the GPL v2, or a later version
+# at the discretion of Linus Torvalds.
+
+use warnings;
+use strict;
+
+use Data::Dumper;
+
+my $filename = shift @ARGV;
+
+
+my @stack = (
+	{
+		'rev' => "HEAD",
+		'filename' => $filename,
+	},
+);
+
+our (@lineoffsets, @pendinglineoffsets);
+our @filelines = ();
+open(F,"<",$filename)
+	or die "Failed to open filename: $!";
+
+while(<F>) {
+	chomp;
+	push @filelines, $_;
+}
+close(F);
+our $leftover_lines = @filelines;
+our %revs;
+our @revqueue;
+our $head;
+
+my $revsprocessed = 0;
+while (my $bound = pop @stack) {
+	my @revisions = git_rev_list($bound->{'rev'}, $bound->{'filename'});
+	foreach my $revinst (@revisions) {
+		my ($rev, @parents) = @$revinst;
+		$head ||= $rev;
+
+		if (scalar @parents > 0) {
+			$revs{$rev}{'parents'} = \@parents;
+			$revs{$rev}{'filename'} = $bound->{'filename'};
+			next;
+		}
+
+		my $newbound = find_parent_renames($rev, $bound->{'filename'});
+		if ( exists $newbound->{'filename'} && $newbound->{'filename'} ne $bound->{'filename'}) {
+			push @stack, $newbound;
+			$revs{$rev}{'parents'} = [$newbound->{'rev'}];
+		}
+	}
+}
+push @revqueue, $head;
+init_claim($head);
+$revs{$head}{'lineoffsets'} = {};
+handle_rev();
+
+
+my $i = 0;
+foreach my $l (@filelines) {
+	my ($output, $rev, $committer, $date);
+	if (ref $l eq 'ARRAY') {
+		($output, $rev, $committer, $date) = @$l;
+		if (length($rev) > 8) {
+			$rev = substr($rev,0,8);
+		}
+	} else {
+		$output = $l;
+		($rev, $committer, $date) = ('unknown', 'unknown', 'unknown');
+	}
+
+	printf("(%8s %10s %10s %d)%s\n", $rev, $committer, $date, $i++, $output);
+}
+
+sub init_claim {
+	my ($rev) = @_;
+	for (my $i = 0; $i < @filelines; $i++) {
+		$filelines[$i] = [ $filelines[$i], $rev, 'unknown', 'unknown', 0];
+			# line,
+			# rev,
+			# author,
+			# date,
+			# confirmed to actually belong to this rev (0 = tentative)
+	}
+}
+
+
+sub handle_rev {
+	my $i = 0;
+	while (my $rev = shift @revqueue) {
+
+		my %revinfo = git_commit_info($rev);
+
+		foreach my $p (@{$revs{$rev}{'parents'}}) {
+
+			my $nlineoffsets = {%{$revs{$rev}{'lineoffsets'}}};
+			git_line_assign($p, $rev, $revs{$p}{'filename'}, $nlineoffsets,
+				%revinfo);
+			push @revqueue, $p;
+			$revs{$p}{'lineoffsets'} = $nlineoffsets;
+		}
+
+		for (my $i = 0; $i < @filelines; $i++) {
+			if ($filelines[$i][1] eq $rev) {
+				claim_line($i, $rev, %revinfo);
+			}
+		}
+
+		if (scalar @{$revs{$rev}{parents}} == 0) {
+			# We must be at the initial rev here, so claim everything that is left.
+			for (my $i = 0; $i < @filelines; $i++) {
+				if (ref $filelines[$i] eq '') {
+					claim_line($i, $rev, %revinfo);
+				}
+			}
+		}
+	
+		return 1 if all_lines_claimed();
+	}	
+}
+
+
+sub git_rev_list {
+	my ($rev, $file) = @_;
+	#printf("grl = %s, %s\n", $rev, $file);
+
+# 	printf("Calling: %s\n",join(" ","git-rev-list","--parents","--remove-empty",$rev,"--",$file));
+	open(P,"-|","git-rev-list","--parents","--remove-empty",$rev,"--",$file)
+		or die "Failed to exec git-rev-list: $!";
+
+	my @revs;
+	while(my $line = <P>) {
+# 		print $line;
+		chomp $line;
+		my ($rev, @parents) = split /\s+/, $line;
+		push @revs, [ $rev, @parents ];
+	}
+	close(P);
+
+	printf("0 revs found for rev %s (%s)\n", $rev, $file) if (@revs == 0);
+	return @revs;
+}
+
+sub find_parent_renames {
+	my ($rev, $file) = @_;
+
+	open(P,"-|","git-diff", "-r","--name-status", "-z","$rev^1..$rev")
+		or die "Failed to exec git-diff: $!";
+
+	local $/ = "\0";
+	my %bound;
+	while (my $change = <P>) {
+		chomp $change;
+		my $filename = <P>;
+		chomp $filename;
+
+		if ($change =~ m/^[AMD]$/ ) {
+			next;
+		} elsif ($change =~ m/^R/ ) {
+			my $oldfilename = $filename;
+			$filename = <P>;
+			chomp $filename;
+			if ( $file eq $filename ) {
+				my $parent = git_find_parent($rev);
+				#printf("Found rename at boundary: %s-%s, %s\n", $rev, $parent, $oldfilename);
+				@bound{'rev','filename'} = ($parent, $oldfilename);
+
+				last;
+			} else {
+				#printf("Found unknown rename of %s => %s\n", $oldfilename, $filename);
+			}
+		} else {
+			#printf("Unknown name-status type of '%s'\n", $change);
+		}
+	}
+	close(P);
+
+	return \%bound;
+}
+
+
+sub git_find_parent {
+	my ($rev) = @_;
+
+	open(REVPARENT,"-|","git-rev-list","--parents","$rev^1..$rev")
+		or die "Failed to open git-rev-list to find a single parent: $!";
+
+	my $parentline = <REVPARENT>;
+	chomp $parentline;
+	my ($revfound,$parent) = split m/\s+/, $parentline;
+
+	close(REVPARENT);
+
+	return $parent;
+}
+
+
+# Examine a revision to see if it has unclaimed lines that we have,
+# if so, give those lines to that revision.
+sub git_line_assign {
+	my ($parent, $rev, $filename, $lineoffsets, %revinfo) = @_;
+
+	my @plines = git_cat_file($parent, $filename);
+
+	my ($i, $j, $jbase) = (0,0,0);
+	while ($i < @filelines && $filelines[$i][1] ne $rev) {
+		$i++;
+	}
+
+	if ($i == @filelines) {
+		printf("Skipping diff-parse - i = filelines)\n");
+	}
+	return if $i == @filelines;
+
+	while($i < @filelines && $j < @plines) {
+		if ($filelines[$i][0] eq $plines[$j]) {
+			# Our parent has this line, give it away.
+			$filelines[$i][1] = $parent;
+			$jbase = $j;
+			$i++;
+			$j++;
+			
+		} elsif ($j+1 == @plines) {
+			$i++;
+			$j = $jbase;
+		} else {
+			$j++;
+		}
+	}
+}
+
+sub git_cat_file {
+	my ($parent, $filename) = @_;
+	return () unless defined $parent && defined $filename;
+	my $blobline = `git-ls-tree $parent $filename`;
+	my ($mode, $type, $blob, $tfilename) = split(/\s+/, $blobline, 4);
+
+	open(C,"-|","git-cat-file", "blob", $blob)
+		or die "Failed to git-cat-file blob $blob (rev $parent, file $filename): " . $!;
+
+	my @lines;
+	while(<C>) {
+		chomp;
+		push @lines, $_;
+	}
+	close(C);
+
+	return @lines;
+}
+
+
+sub claim_line {
+	my ($floffset, $rev, %revinfo) = @_;
+	my $oline = $filelines[$floffset][0];
+	$filelines[$floffset] =	[ $oline, $rev,
+		$revinfo{'author'}, $revinfo{'author_date'} ];
+	$leftover_lines--;
+	printf("Claiming line %d with rev %s: '%s'\n",
+			$floffset, $rev, $oline) if 0;
+}
+
+sub git_commit_info {
+	my ($rev) = @_;
+	open(COMMIT, "-|","git-cat-file", "commit", $rev)
+		or die "Failed to call git-cat-file: $!";
+
+	my %info;
+	while(<COMMIT>) {
+		chomp;
+		last if (length $_ == 0);
+
+		if (m/^author (.*) <(.*)> (.*)$/) {
+			$info{'author'} = $1;
+			$info{'author_email'} = $2;
+			$info{'author_date'} = $3;
+		} elsif (m/^committer (.*) <(.*)> (.*)$/) {
+			$info{'committer'} = $1;
+			$info{'committer_email'} = $2;
+			$info{'committer_date'} = $3;
+		}
+	}
+	close(COMMIT);
+
+	return %info;
+}
-- 
1.1.6.g3b91b

^ permalink raw reply related

* Re: Handling large files with GIT
From: Johannes Schindelin @ 2006-02-08 11:54 UTC (permalink / raw)
  To: Martin Langhoff; +Cc: Git Mailing List
In-Reply-To: <46a038f90602080114r2205d72cmc2b5c93f6fffe03d@mail.gmail.com>

Hi,

On Wed, 8 Feb 2006, Martin Langhoff wrote:

> Roland Stigge recently pointed out a use case using very large files
> where GIT has some serious limitations.

That is intentional: git handles source code very well, where you tend to 
have small files, and it handles branches very well, where you tend to 
have mostly the same files in different branches.

I am uncertain if it is possible to extend git to handle large files 
gracefully, without slowing it down for its main use case.

[thinking] A potentially silly idea just hit me: We could virtually cut 
every file into 256kB chunks. That would not affect source code at all: 
anybody producing a 256kB C file should be shot anyway.

If the files just keep growing, this should help enormously. If the files 
change subtly, the diff algorithm should work quite well on 'em.

Comments?

Ciao,
Dscho

^ permalink raw reply

* Handling large files with GIT
From: Martin Langhoff @ 2006-02-08  9:14 UTC (permalink / raw)
  To: Git Mailing List

Roland Stigge recently pointed out a use case using very large files
where GIT has some serious limitations. He is one of several Debian
developers keeping their homedir under version control with SVN (blame
Joey Hess for this - http://www.kitenet.net/~joey/svnhome.html ).

SVN does reasonably well tracking his >1GB mbox file. Now, I don't
know if I like the idea of putting my own mbox file under version
control, but it looks like projects with large and slow-changing files
would be in trouble with GIT. Not literally trouble, but gross
inefficiencies.

The problems are two. At commit time, a full copy is stored in the
object database until git-repack && git-prune-packed are called. And
during the transfer over the git protocol we send the full object,
even if both ends have objects that are good candidates for a small
delta.

I'm not strong on either aspect of git (packfile format or git
protocol), and I don't personally deal with large files. So feel free
to ignore me for the time being. If it ever itches, you might get a
patch...

cheers,

martin

^ permalink raw reply

* Re: [PATCH] git-commit: revamp the git-commit semantics.
From: Junio C Hamano @ 2006-02-08  5:41 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: git
In-Reply-To: <Pine.LNX.4.64.0602071412390.5397@localhost.localdomain>

Nicolas Pitre <nico@cam.org> writes:

> As someone refreshed my memory in private, there is no "unstable" branch 
> like Linux used to have.  But hopefully you all understood what I meant 
> i.e. in the main branch after the stable 1.2.0 branch is forked.

Yes I understood what you meant.  Eventually we ship with --only
as the default and the best timing is between 1.2.0 and 1.3.0.

I am just worried if it might turn out to be like shipping a
bicycle with training wheels welded onto it (so it cannot be
easily removed).  That is where my reluctance comes from.

But I do not have to be convinced 100% myself in order to set it
to the default, as long as the more important users are happy.

^ permalink raw reply

* Re: [PATCH] .gitignore git-rerere and config.mak
From: Junio C Hamano @ 2006-02-08  5:34 UTC (permalink / raw)
  To: Jason Riedy; +Cc: git
In-Reply-To: <1303.1139370931@lotus.CS.Berkeley.EDU>

Jason Riedy <ejr@EECS.Berkeley.EDU> writes:

> And Junio C Hamano writes:
>  - I am not sure about this part.  It is plausible that somebody
>  - who privately uses config.mak has it in _his_ repository under
>  - version control.
>
> Like me.  That way I don't have to worry about conflicts in the
> Makefile.  But I can change .gitignore in those branches...

Or you can leave that as is.  .gitignore is used to sift
untracked files into two categories - ignored and unknown.  So
my initial worry was unfounded.

Sorry for the noise.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox