* Merge with git-pasky II.
@ 2005-04-14  0:29 Petr Baudis
  2005-04-13 21:25 ` Christopher Li
                   ` (2 more replies)
  0 siblings, 3 replies; 130+ messages in thread
From: Petr Baudis @ 2005-04-14  0:29 UTC (permalink / raw)
  To: torvalds; +Cc: git
  Hello Linus,
  I think my tree should be ready for merging with you. It is the final
tree and I've already switched my main branch for it, so it's what
people doing git pull are getting for some time already.
  Its main contents are all of my shell scripts. Apart of that, some
tiny fixes scattered all around can be found there, as well as some
patches which went through the mailing list. My last merge with you
concerned your commit 39021759c903a943a33a28cfbd5070d36d851581.
  It's again
	rsync://pasky.or.cz/git/
this time my HEAD is fba83970090ef54c6eb86dcc2c2d5087af5ac637.
  Note that my rsync tree still contains even my old branch; I thought
I'd leave it around in the public objects database for some time, shall
anyone want to have a look at the history of some of the scripts. But if
you want it gone, tell me and I will prune it (and perhaps offer it in
/git-old/ or whatever). I'm using the following:
	fsck-cache --unreachable $(commit-id) | grep unreachable \
		| cut -d ' ' -f 2 | sed 's/^\(..\)/.git\/objects\/\1\//' \
		| xargs rm
  Thanks,
-- 
				Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
C++: an octopus made by nailing extra legs onto a dog. -- Steve Taylor
^ permalink raw reply	[flat|nested] 130+ messages in thread* Re: Merge with git-pasky II. 2005-04-14 0:29 Merge with git-pasky II Petr Baudis @ 2005-04-13 21:25 ` Christopher Li 2005-04-14 0:45 ` Petr Baudis 2005-04-14 0:30 ` Petr Baudis 2005-04-14 22:11 ` git merge Petr Baudis 2 siblings, 1 reply; 130+ messages in thread From: Christopher Li @ 2005-04-13 21:25 UTC (permalink / raw) To: Petr Baudis; +Cc: torvalds, git While you are there, do you mind to move the shell script to a sub directory? Let's try how rename works. Chris On Thu, Apr 14, 2005 at 02:29:02AM +0200, Petr Baudis wrote: > Hello Linus, > > I think my tree should be ready for merging with you. It is the final > tree and I've already switched my main branch for it, so it's what > people doing git pull are getting for some time already. > > Its main contents are all of my shell scripts. Apart of that, some > tiny fixes scattered all around can be found there, as well as some > patches which went through the mailing list. My last merge with you > concerned your commit 39021759c903a943a33a28cfbd5070d36d851581. > > It's again > > rsync://pasky.or.cz/git/ > > this time my HEAD is fba83970090ef54c6eb86dcc2c2d5087af5ac637. > > Note that my rsync tree still contains even my old branch; I thought > I'd leave it around in the public objects database for some time, shall > anyone want to have a look at the history of some of the scripts. But if > you want it gone, tell me and I will prune it (and perhaps offer it in > /git-old/ or whatever). I'm using the following: > > fsck-cache --unreachable $(commit-id) | grep unreachable \ > | cut -d ' ' -f 2 | sed 's/^\(..\)/.git\/objects\/\1\//' \ > | xargs rm > > Thanks, > > -- > Petr "Pasky" Baudis > Stuff: http://pasky.or.cz/ > C++: an octopus made by nailing extra legs onto a dog. -- Steve Taylor > - > To unsubscribe from this list: send the line "unsubscribe git" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Re: Merge with git-pasky II. 2005-04-13 21:25 ` Christopher Li @ 2005-04-14 0:45 ` Petr Baudis 2005-04-13 22:00 ` Christopher Li 2005-04-14 3:51 ` Linus Torvalds 0 siblings, 2 replies; 130+ messages in thread From: Petr Baudis @ 2005-04-14 0:45 UTC (permalink / raw) To: Christopher Li; +Cc: torvalds, git Dear diary, on Wed, Apr 13, 2005 at 11:25:46PM CEST, I got a letter where Christopher Li <git@chrisli.org> told me that... > While you are there, do you mind to move the shell script > to a sub directory? Let's try how rename works. Well, unless Linus will want me otherwise, I'd like to postpone this until I'm finally done with the damn merge - enough things already got into my way today, so I would really like to focus on this tomorrow. So I'll be probably merging only (or mostly) bugfixes until I have that finished. P.S.: Just staring at http://www.theregister.co.uk/2005/04/11/torvalds_attack/ ... I'm nothing like a regular reader of (R), but I thought the guys have at least a bit of sense. Duh. :/ Or is April 11 now yet another joke day after April 1? -- Petr "Pasky" Baudis Stuff: http://pasky.or.cz/ C++: an octopus made by nailing extra legs onto a dog. -- Steve Taylor ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Re: Merge with git-pasky II. 2005-04-14 0:45 ` Petr Baudis @ 2005-04-13 22:00 ` Christopher Li 2005-04-14 3:51 ` Linus Torvalds 1 sibling, 0 replies; 130+ messages in thread From: Christopher Li @ 2005-04-13 22:00 UTC (permalink / raw) To: Petr Baudis; +Cc: torvalds, git On Thu, Apr 14, 2005 at 02:45:04AM +0200, Petr Baudis wrote: > Dear diary, on Wed, Apr 13, 2005 at 11:25:46PM CEST, I got a letter > where Christopher Li <git@chrisli.org> told me that... > Well, unless Linus will want me otherwise, I'd like to postpone this > until I'm finally done with the damn merge - enough things already got > into my way today, so I would really like to focus on this tomorrow. So > I'll be probably merging only (or mostly) bugfixes until I have that > finished. Sure, whenever you are ready. > P.S.: Just staring at > http://www.theregister.co.uk/2005/04/11/torvalds_attack/ ... I'm nothing > like a regular reader of (R), but I thought the guys have at least a bit > of sense. Duh. :/ Or is April 11 now yet another joke day after April 1? Whatever, is the news site. They never mention git though. Chris ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Re: Merge with git-pasky II. 2005-04-14 0:45 ` Petr Baudis 2005-04-13 22:00 ` Christopher Li @ 2005-04-14 3:51 ` Linus Torvalds 2005-04-14 1:23 ` Christopher Li 2005-04-14 7:05 ` Junio C Hamano 1 sibling, 2 replies; 130+ messages in thread From: Linus Torvalds @ 2005-04-14 3:51 UTC (permalink / raw) To: Petr Baudis; +Cc: Christopher Li, git On Thu, 14 Apr 2005, Petr Baudis wrote: > > http://www.theregister.co.uk/2005/04/11/torvalds_attack/ ... I'm nothing > like a regular reader of (R), but I thought the guys have at least a bit > of sense. Duh. :/ Or is April 11 now yet another joke day after April 1? I actually _am_ a fairly regular reader, and hey, being opinionated and a bit over the top is what makes the site worthwhile. It's obviously what motivates the people. And then, occasionally, when they bite you, hey, that's the price of having a high profile. I worry more about sometimes not listening to critics than I do about the critics themselves. Thick skin is the name of the game. I'd not get any work done otherwise. On that note - I've been avoiding doing the merge-tree thing, in the hope that somebody else does what I've described. I really do suck at scripting things, yet this is clearly something where using C to do a lot of the stuff is pointless. Almost all the parts do seem to be there, ie Daniel did the "common parent" part, and the rest really does seem to be more about scripting than writing more C plumbing stuff.. Linus ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Re: Merge with git-pasky II. 2005-04-14 3:51 ` Linus Torvalds @ 2005-04-14 1:23 ` Christopher Li 2005-04-14 5:03 ` Paul Jackson 2005-04-14 7:05 ` Junio C Hamano 1 sibling, 1 reply; 130+ messages in thread From: Christopher Li @ 2005-04-14 1:23 UTC (permalink / raw) To: Linus Torvalds; +Cc: Petr Baudis, git On Wed, Apr 13, 2005 at 08:51:50PM -0700, Linus Torvalds wrote: > > > On Thu, 14 Apr 2005, Petr Baudis wrote: > > Thick skin is the name of the game. I'd not get any work done otherwise. > > On that note - I've been avoiding doing the merge-tree thing, in the hope > that somebody else does what I've described. I really do suck at scripting > things, yet this is clearly something where using C to do a lot of the > stuff is pointless. > > Almost all the parts do seem to be there, ie Daniel did the "common > parent" part, and the rest really does seem to be more about scripting > than writing more C plumbing stuff.. Do you have preference about what language of script we used? I actually hesitated to introduce my Python script to git. I can build some script extension for git just like the one I did for sparse, is that some thing you want to see? Chris ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Merge with git-pasky II. 2005-04-14 1:23 ` Christopher Li @ 2005-04-14 5:03 ` Paul Jackson 2005-04-14 2:16 ` Christopher Li 0 siblings, 1 reply; 130+ messages in thread From: Paul Jackson @ 2005-04-14 5:03 UTC (permalink / raw) To: Christopher Li; +Cc: torvalds, pasky, git > Do you have preference about what language of script we used? Do you have a thick skin? <grin> Can you easily ignore language wars with an amused wave of the hand and a happy chuckle at the oh so predictable weaknesses of humans? Then I'd wager it will be fine. If you have a thin skin or tend to annoy others with a bit too much attitude or can't pass up a good language war (which is my failing, and why I am responding to a discussion that I've not been involved in for days) then the resulting flamage could be distracting. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj@engr.sgi.com> 1.650.933.1373, 1.925.600.0401 ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Merge with git-pasky II. 2005-04-14 5:03 ` Paul Jackson @ 2005-04-14 2:16 ` Christopher Li 2005-04-14 6:16 ` Paul Jackson 0 siblings, 1 reply; 130+ messages in thread From: Christopher Li @ 2005-04-14 2:16 UTC (permalink / raw) To: Paul Jackson; +Cc: torvalds, pasky, git On Wed, Apr 13, 2005 at 10:03:41PM -0700, Paul Jackson wrote: > > If you have a thin skin or tend to annoy others with a bit too much > attitude or can't pass up a good language war (which is my failing, and > why I am responding to a discussion that I've not been involved in for > days) then the resulting flamage could be distracting. Oh, my bad. I am not trying to start a language war here. That is why I am hesitated about Python. Just try to find out the acceptability. No pushing. Chris ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Merge with git-pasky II. 2005-04-14 2:16 ` Christopher Li @ 2005-04-14 6:16 ` Paul Jackson 0 siblings, 0 replies; 130+ messages in thread From: Paul Jackson @ 2005-04-14 6:16 UTC (permalink / raw) To: Christopher Li; +Cc: torvalds, pasky, git > Oh, my bad. I am not trying to start a language war here. Neither am I - no problem what so ever. <chuckle ...> Besides, I think we'd be on the same side. My point was only a gentle one -- as is often the case when dealing with the strange species called human, whether or not you can get away with something is often a simple matter of ones attitude. There is one Python script already in the kernel: scripts/show_delta. But it's too small a sample to mean much. I think first thing is "get it right." Python is good for that in the hands of someone who enjoys coding in it. I wish you well. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj@engr.sgi.com> 1.650.933.1373, 1.925.600.0401 ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Merge with git-pasky II. 2005-04-14 3:51 ` Linus Torvalds 2005-04-14 1:23 ` Christopher Li @ 2005-04-14 7:05 ` Junio C Hamano 2005-04-14 8:06 ` Linus Torvalds 1 sibling, 1 reply; 130+ messages in thread From: Junio C Hamano @ 2005-04-14 7:05 UTC (permalink / raw) To: Linus Torvalds; +Cc: Petr Baudis, Christopher Li, git >>>>> "LT" == Linus Torvalds <torvalds@osdl.org> writes: LT> On that note - I've been avoiding doing the merge-tree thing, in the hope LT> that somebody else does what I've described. I now have a Perl script that uses rev-tree, cat-file, diff-tree, show-files (with one modification so that it can deal with pathnames with embedded newlines), update-cache (with one modification so that I can add an entry for a file that does not exist to the dircache) and merge (from RCS). Quick and dirty. The changes to show-files is to give it an optional '-z' flag, which chanegs record terminator to NUL character instead of LF. The script git-merge.perl takes two head commits. It basically follows what you described as I remember ;-): 1. runs rev-tree with --edges to find the common anscestor. 2. creates a temporary directory "./,,merge-temp"; create a symlink ./,,merge-temp/.git/objects that points at .git/objects. 3. sets up dircache there, initially populated with this common ancestor tree. No files are checked out. Just set up .git/index and that's it. 4. runs diff-tree to find what has been changed in each head. 5. for each path involved: 5.0 if neither heads change it, leave it as is; 5.1 if only one head changes a path and the other does not, just get the changed version; 5.2 if both heads change it, check all three out and run merge. It does not currently commit. You can go to ./,,merge-temp/ and see show-diff to see the result of the merge. Files added in one head has already been run "update-cache" when the script ends, but changed and merged files are not---dircache still has the common ancestor view. So show-diff you will be seeing may be enormous and not very useful if two forks were done in the distant past. After reviewing the merge result, you can update-cache, write-tree and commit-tree as usual, but with one caveat: do not run "show-files | xargs update-cache" if you are running git-merge.perl without -f flag! By default, git-merge.perl creates absolute minimum number of files in ./,,merge-temp---only the merged files are left there so that you can inspect them. You will not see unmodified files nor files changed only by one side of the merge. If you give '-o' (oneside checkout) flag to git-merge.perl, then the files only one side of the merge changed are also checked out in ./,,merge-temp. If you give '-f' (full checkout) flag to git-merge.perl, then in addition to what '-o' checks out, unchanged files are checked out in ./,,merge-temp. This default is geared towards a huge tree with small merges (favorite case of Linus, if I understand correctly). Running 'show-diff' in such a sparsely populated merge result tree gives you huge results because recent show-diff shows diffs with empty files. I added a '-r' flag to show-diff, which squelches diffs with empty files. Also to implement 'changed only by one-side' without actually checking the file out, I needed to add one option to 'update-cache'. --cacheinfo flag is used this way: $ update-cache --cacheinfo mode sha1 path and adds the pathname with mode and sha1 to the .git/index without actually requiring you to have such a file there. Signed-off-by: Junio C Hamano <junkio@cox.net> --- show-diff.c | 11 ++- show-files.c | 12 ++- update-cache.c | 25 +++++++ git-merge.perl | 193 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 4 files changed, 234 insertions(+), 7 deletions(-) show-diff.c: a531ca4078525d1c8dcf84aae0bfa89fed6e5d96 --- show-diff.c +++ show-diff.c 2005-04-13 22:47:33.000000000 -0700 @@ -58,15 +58,20 @@ int main(int argc, char **argv) { int silent = 0; + int silent_on_nonexisting_files = 0; int entries = read_cache(); int i; while (argc-- > 1) { if (!strcmp(argv[1], "-s")) { - silent = 1; + silent_on_nonexisting_files = silent = 1; continue; } - usage("show-diff [-s]"); + if (!strcmp(argv[1], "-r")) { + silent_on_nonexisting_files = 1; + continue; + } + usage("show-diff [-s] [-r]"); } if (entries < 0) { @@ -83,7 +88,7 @@ if (stat(ce->name, &st) < 0) { printf("%s: %s\n", ce->name, strerror(errno)); - if (errno == ENOENT && !silent) + if (errno == ENOENT && !silent_on_nonexisting_files) show_diff_empty(ce); continue; } show-files.c: a9fa6767a418f870a34b39379f417bf37b17ee18 --- show-files.c +++ show-files.c 2005-04-13 21:18:40.000000000 -0700 @@ -14,6 +14,7 @@ static int show_cached = 0; static int show_others = 0; static int show_ignored = 0; +static int line_terminator = '\n'; static const char **dir; static int nr_dir; @@ -105,12 +106,12 @@ } if (show_others) { for (i = 0; i < nr_dir; i++) - printf("%s\n", dir[i]); + printf("%s%c", dir[i], line_terminator); } if (show_cached) { for (i = 0; i < active_nr; i++) { struct cache_entry *ce = active_cache[i]; - printf("%s\n", ce->name); + printf("%s%c", ce->name, line_terminator); } } if (show_deleted) { @@ -119,7 +120,7 @@ struct stat st; if (!stat(ce->name, &st)) continue; - printf("%s\n", ce->name); + printf("%s%c", ce->name, line_terminator); } } if (show_ignored) { @@ -134,6 +135,11 @@ for (i = 1; i < argc; i++) { char *arg = argv[i]; + if (!strcmp(arg, "-z")) { + line_terminator = 0; + continue; + } + if (!strcmp(arg, "--cached")) { show_cached = 1; continue; update-cache.c: 8f149d5a4ab60e030a0ab19fdb59b8ee2576ee71 --- update-cache.c +++ update-cache.c 2005-04-13 23:27:54.000000000 -0700 @@ -203,6 +203,8 @@ { int i, newfd, entries; int allow_options = 1; + const char *sha1_force = NULL; + const char *mode_force = NULL; newfd = open(".git/index.lock", O_RDWR | O_CREAT | O_EXCL, 0600); if (newfd < 0) @@ -235,14 +237,35 @@ refresh_cache(); continue; } + if (!strcmp(path, "--cacheinfo")) { + mode_force = argv[++i]; + sha1_force = argv[++i]; + continue; + } die("unknown option %s", path); } if (!verify_path(path)) { fprintf(stderr, "Ignoring path %s\n", argv[i]); continue; } - if (add_file_to_cache(path)) + if (sha1_force && mode_force) { + struct cache_entry *ce; + int namelen = strlen(path); + int mode; + int size = cache_entry_size(namelen); + sscanf(mode_force, "%o", &mode); + ce = malloc(size); + memset(ce, 0, size); + memcpy(ce->name, path, namelen); + ce->namelen = namelen; + ce->st_mode = mode; + get_sha1_hex(sha1_force, ce->sha1); + + add_cache_entry(ce, 1); + } + else if (add_file_to_cache(path)) die("Unable to add %s to database", path); + mode_force = sha1_force = NULL; } if (write_cache(newfd, active_cache, active_nr) || rename(".git/index.lock", ".git/index")) --- /dev/null 2005-03-19 15:28:25.000000000 -0800 +++ git-merge.perl 2005-04-13 23:45:23.000000000 -0700 @@ -0,0 +1,193 @@ +#!/usr/bin/perl -w + +use Getopt::Long; + +my $full_checkout = 0; +my $oneside_checkout = 0; +GetOptions("full" => \$full_checkout, + "oneside" => \$oneside_checkout) + or die; + +if ($full_checkout) { + $oneside_checkout = 1; +} + +sub read_rev_tree { + my (@head) = @_; + my ($fhi); + open $fhi, '-|', 'rev-tree', '--edges', @head + or die "$!: rev-tree --edges @head"; + my $common; + while (<$fhi>) { + chomp; + (undef, undef, $common) = split(/ /, $_); + if ($common =~ s/^([a-f0-f]{40}):\d+$/$1/) { + last; + } + } + close $fhi; + return $common; +} + +sub read_commit_tree { + my ($commit) = @_; + my ($fhi); + open $fhi, '-|', 'cat-file', 'commit', $commit + or die "$!: cat-file commit $commit"; + my $tree = <$fhi>; + close $fhi; + $tree =~ s/^tree //; + return $tree; +} + +sub read_diff_tree { + my (@tree) = @_; + my ($fhi); + local ($_, $/); + $/ = "\0"; + my %path; + open $fhi, '-|', 'diff-tree', '-r', @tree + or die "$!: diff-tree -r @tree"; + while (<$fhi>) { + chomp; + if (/^\*[0-7]+->([0-7]+)\tblob\t[0-9a-f]+->([0-9a-f]{40})\t(.*)$/s) { + # mode newsha path + $path{$3} = [$1, $2]; + } + elsif (/^\+([0-7]+)\tblob\t([0-9a-f]{40})\t(.*)$/s) { + # mode newsha path + $path{$3} = [$1, $2]; + } + else { + print STDERR "$_??"; + } + } + close $fhi; + return %path; +} + +sub read_show_files { + my ($fhi); + local ($_, $/); + $/ = "\0"; + open $fhi, '-|', 'show-files', '-z' + or die "$!: show-files -z"; + my (@path) = map { chomp; $_ } <$fhi>; + close $fhi; + return @path; +} + +sub checkout_file { + my ($path, $info) = @_; + my (@elt) = split(/\//, $path); + my $j = ''; + my $tail = pop @elt; + my ($fhi, $fho); + for (@elt) { + mkdir "$j$_"; + $j = "$j$_/"; + } + open $fho, '>', "$path"; + open $fhi, '-|', 'cat-file', 'blob', $info->[1] + or die "$!: cat-file blob $info->[1]"; + while (<$fhi>) { + print $fho $_; + } + close $fhi; + close $fho; + chmod oct("0$info->[0]"), "$path"; +} + +sub record_file { + my ($path, $info) = @_; + system 'update-cache', '--cacheinfo', @$info, $path; +} + +sub merge_tree { + my ($path, $info0, $info1) = @_; + print STDERR "M - $path\n"; + checkout_file(',,merge-0', $info0); + checkout_file(',,merge-1', $info1); + system 'checkout-cache', $path; + my ($fhi, $fho); + open $fhi, '-|', 'merge', '-p', ',,merge-0', $path, ',,merge-1'; + open $fho, '>', "$path+"; + local ($/); + while (<$fhi>) { print $fho $_; } + close $fhi; + close $fho; + unlink ',,merge-0', ',,merge-1'; + rename "$path+", $path; + # There is no reason to prefer info0 over info1 but + # we need to pick one. + chmod oct("0$info0->[0]"), "$path"; +} + +# Find common ancestor of two trees. +my $common = read_rev_tree(@ARGV); +print "Common ancestor: $common\n"; + +# Create a temporary directory and go there. +system 'rm', '-rf', ',,merge-temp'; +for ((',,merge-temp', '.git')) { mkdir $_; chdir $_; } +symlink "../../.git/objects", "objects"; +chdir '..'; + +my $ancestor_tree = read_commit_tree($common); +system 'read-tree', $ancestor_tree; + +my %tree0 = read_diff_tree($ancestor_tree, read_commit_tree($ARGV[0])); +my %tree1 = read_diff_tree($ancestor_tree, read_commit_tree($ARGV[1])); + +my @ancestor_file = read_show_files(); +my %ancestor_file = map { $_ => 1 } @ancestor_file; + +for (@ancestor_file) { + if (! exists $tree0{$_} && ! exists $tree1{$_}) { + if ($full_checkout) { + system 'checkout-cache', $_; + } + print STDERR "O - $_\n"; + } +} + +my %need_merge = (); + +for $path (keys %tree0) { + if (! exists $tree1{$path}) { + # Only changed in tree 0 --- take his version + print STDERR "0 - $path\n"; + if (! exists $ancestor_file{$path}) { + checkout_file($path, $tree0{$path}); + system 'update-cache', '--add', "$path"; + } + elsif ($oneside_checkout) { + checkout_file($path, $tree0{$path}); + } + else { + record_file($path, $tree0{$path}); + } + } + else { + merge_tree($path, $tree0{$path}, $tree1{$path}); + } +} + +for $path (keys %tree1) { + if (! exists $tree0{$path}) { + # Only changed in tree 1 --- take his version + print STDERR "1 - $path\n"; + if (! exists $ancestor_file{$path}) { + checkout_file($path, $tree1{$path}); + system 'update-cache', '--add', "$path"; + } + elsif ($oneside_checkout) { + checkout_file($path, $tree1{$path}); + } + else { + record_file($path, $tree1{$path}); + } + } +} + +# system 'show-diff'; ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Merge with git-pasky II. 2005-04-14 7:05 ` Junio C Hamano @ 2005-04-14 8:06 ` Linus Torvalds 2005-04-14 8:39 ` Junio C Hamano 2005-04-15 19:57 ` Junio C Hamano 0 siblings, 2 replies; 130+ messages in thread From: Linus Torvalds @ 2005-04-14 8:06 UTC (permalink / raw) To: Junio C Hamano; +Cc: Petr Baudis, Christopher Li, git On Thu, 14 Apr 2005, Junio C Hamano wrote: > > I now have a Perl script that uses rev-tree, cat-file, > diff-tree, show-files (with one modification so that it can deal > with pathnames with embedded newlines), update-cache (with one > modification so that I can add an entry for a file that does not > exist to the dircache) and merge (from RCS). Quick and dirty. That's exactly what I wanted. Q'n'D is how the ball gets rolling. In the meantime I wrote a very stupid "merge-tree" which does things slightly differently, but I really think your approach (aka my original approach) is actually a lot faster. I was just starting to worry that the ball didn't start, so I wrote an even hackier one. My really hacky one is called "merge-tree", and it really only merges one directory. For each entry in the directory it says either select <mode> <sha1> path or merge <mode>-><mode>,<mode> <sha1>-><sha1>,<sha1> path depending on whether it could directly select the right object or not. It's actually exactly the same algorithm as the first one, but I was afraid the first one would be so abstract that it (a) might not work and (b) wouldn't get people to work it out. This "one directory at a time with very explicit output" thing is much more down-to-earth, but it's also likely slower because it will need script help more often. That said, I don't know. MOST of the time there will be just a single "directory" entry that needs merging, and then the script would just need to recurse into that directory with the new "tree" objects. So it might not be too horrible. But I'm really happy that you seem to have implemented my first suggestion and I seem to have been wasting my time. > 5. for each path involved: > > 5.0 if neither heads change it, leave it as is; > 5.1 if only one head changes a path and the other does not, just > get the changed version; > 5.2 if both heads change it, check all three out and run merge. You missed one case: 5.0.1 if both heads change it to the same thing, take the new thing but maybe you counted that as 5.0 (it _should_ fall out automatically from the fact that "diff-tree" between the two destination trees shows no difference for such a file). Now, arguably, your 5.2 will do things right, but the thing is, it's actually fairly _common_ that both heads have changed something to the same thing. Namely if there was a previous merge that already handled that case, but that previous merge may not be a proper parent of the new commits. So from a performance standpoint you really don't want to consider that to be a merge - you just pick up the new contents directly. See? (My stupid "merge-tree" should show the algorithm in painful obviousity. Of course, my stipid merge-tree may also be painfully buggy. You be the judge). > It does not currently commit. You can go to ./,,merge-temp/ and > see show-diff to see the result of the merge. Files added in > one head has already been run "update-cache" when the script > ends, but changed and merged files are not---dircache still has > the common ancestor view. That sounds good. > Also to implement 'changed only by one-side' without actually > checking the file out, I needed to add one option to > 'update-cache'. --cacheinfo flag is used this way: > > $ update-cache --cacheinfo mode sha1 path Yes. My "merge-tree" needs the exact same thing. Looks good from your explanation, but I'm too tired to look at the code. It's 1AM, and the kids get up at 7. I'm not much of a hacker, I usually crash by 10PM these days ;^) Linus ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Merge with git-pasky II. 2005-04-14 8:06 ` Linus Torvalds @ 2005-04-14 8:39 ` Junio C Hamano 2005-04-14 9:10 ` Linus Torvalds 2005-04-15 19:57 ` Junio C Hamano 1 sibling, 1 reply; 130+ messages in thread From: Junio C Hamano @ 2005-04-14 8:39 UTC (permalink / raw) To: Linus Torvalds; +Cc: Petr Baudis, Christopher Li, git >>>>> "LT" == Linus Torvalds <torvalds@osdl.org> writes: LT> But I'm really happy that you seem to have implemented my first LT> suggestion and I seem to have been wasting my time. Thanks for the kind words. >> 5. for each path involved: >> >> 5.0 if neither heads change it, leave it as is; >> 5.1 if only one head changes a path and the other does not, just >> get the changed version; >> 5.2 if both heads change it, check all three out and run merge. LT> You missed one case: LT> 5.0.1 if both heads change it to the same thing, take the new thing LT> but maybe you counted that as 5.0 (it _should_ fall out automatically from LT> the fact that "diff-tree" between the two destination trees shows no LT> difference for such a file). Actually I am not handling that. It really is 5.1a---the exact same code path as 5.1 can be used for this case, and as you point out it is really a quite important optimization. I have to handle the following cases. I think I currently do wrong things to them: 5.1a both head modify to the same thing. 5.1b one head removes, the other does not do anything. 5.1c both head remove. 5.3 one head removes, the other head modifies. Handling of 5.1a, 5.1b and 5.1c are obvious. 5.1a Update dircache to the same new thing. Without -f or -o flag do not touch ,,merge-temp/. directory; with -f or -o, leave the new file in ,,merge-temp/. 5.1b Remove the path from dircache and do not have the file in ,,merge-temp/. directory regardless of -f or -o flags. 5.1c Same as 5.1b I am not sure what to do with 5.3. My knee-jerk reaction is to leave the modified result in ,,merge-temp/$path~ without touching dircache. If the merger wants to pick it up, he can rename $path~ to $path temporarily, run show-diff on it (I think giving an option to show-diff to specify paths would be helpful for this workflow), to decide if he wants to keep the file or not. Suggestions? ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Merge with git-pasky II. 2005-04-14 8:39 ` Junio C Hamano @ 2005-04-14 9:10 ` Linus Torvalds 2005-04-14 11:14 ` Junio C Hamano 0 siblings, 1 reply; 130+ messages in thread From: Linus Torvalds @ 2005-04-14 9:10 UTC (permalink / raw) To: Junio C Hamano; +Cc: Petr Baudis, Christopher Li, git On Thu, 14 Apr 2005, Junio C Hamano wrote: > > I have to handle the following cases. I think I currently do > wrong things to them: > > 5.1a both head modify to the same thing. > 5.1b one head removes, the other does not do anything. > 5.1c both head remove. > 5.3 one head removes, the other head modifies. There's another interesting set of cases: one side creates a file, and the other one creates a directory. > I am not sure what to do with 5.3. My very _strong_ preference is to just inform the user about a merge that cannot be performed, and not let it be automated. BIG warning, with some way for the user to specify the end result. The thing is, these are pretty rare cases. But in order to make people feel good about the _common_ case, it's important that they feel safe about the rare one. Put another way: if git tells me when it can't do something (with some specificity), I can then fix the situation up and try again. I might curse a while, and maybe it ends up being so common that I might even automate it, but at least I'll be able to trust the end result. In contrast, if git does something that _may_ be nonsensical, then I'll worry all the time, and not trust git. That's much worse than an occasional curse. So the rule should be: only merge when it's "obviously the right thing". If it's not obvious, the merge should _not_ try to guess what the right thing is. It's much better to fail loudly. (That's especially true early on. There may be cases that end up being obvious after some usage. But I'd rather find them by having git be too stupid, than find out the hard way that git lost some data because it thought it was ok to remove a file that had been modified) Linus ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Merge with git-pasky II. 2005-04-14 9:10 ` Linus Torvalds @ 2005-04-14 11:14 ` Junio C Hamano 2005-04-14 12:16 ` Petr Baudis 0 siblings, 1 reply; 130+ messages in thread From: Junio C Hamano @ 2005-04-14 11:14 UTC (permalink / raw) To: Linus Torvalds; +Cc: git Here is a diff to update the git-merge.perl script I showed you earlier today ;-). It contains the following updates against your HEAD (bb95843a5a0f397270819462812735ee29796fb4). * git-merge.perl command we talked about on the git list. I've covered the changed-to-the-same case etc. I still haven't done anything about file-vs-directory case yet. It does warn when it needed to run merge to automerge and let merge give a warning message about conflicts if any. In modify/remove cases, modified in one but removed in the other files are left in either $path~A~ or $path~B~ in the merge temporary directory, and the script issues a warning at the end. * show-files and ls-tree updates to add -z flag to NUL terminate records; this is needed for git-merge.perl to work. * show-diff updates to add -r flag to squelch diffs for files not in the working directory. This is mainly useful when verifying the result of an automated merge. * update-cache updates to add "--cacheinfo mode sha1" flag to register a file that is not in the current working directory. Needed for minimum-checkout merging by git-merge.perl. Signed-off-by: Junio C Hamano <junkio@cox.net> --- git-merge.perl | 247 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++ ls-tree.c | 9 +- show-diff.c | 11 +- show-files.c | 12 ++ update-cache.c | 25 +++++ 5 files changed, 296 insertions(+), 8 deletions(-) diff -x .git -Nru ,,1/git-merge.perl ,,2/git-merge.perl --- ,,1/git-merge.perl 1969-12-31 16:00:00.000000000 -0800 +++ ,,2/git-merge.perl 2005-04-14 04:00:14.000000000 -0700 @@ -0,0 +1,247 @@ +#!/usr/bin/perl -w + +use Getopt::Long; + +my $full_checkout = 0; +my $oneside_checkout = 0; +GetOptions("full" => \$full_checkout, + "oneside" => \$oneside_checkout) + or die; + +if ($full_checkout) { + $oneside_checkout = 1; +} + +sub read_rev_tree { + my (@head) = @_; + my ($fhi); + open $fhi, '-|', 'rev-tree', '--edges', @head + or die "$!: rev-tree --edges @head"; + my %common; + while (<$fhi>) { + chomp; + (undef, undef, my @common) = split(/ /, $_); + for (@common) { + if (s/^([a-f0-f]{40}):3$/$1/) { + $common{$_}++; + } + } + } + close $fhi; + + my @common = (map { $_->[1] } + sort { $b->[0] <=> $a->[0] } + map { [ $common{$_} => $_ ] } + keys %common); + + return $common[0]; +} + +sub read_commit_tree { + my ($commit) = @_; + my ($fhi); + open $fhi, '-|', 'cat-file', 'commit', $commit + or die "$!: cat-file commit $commit"; + my $tree = <$fhi>; + close $fhi; + $tree =~ s/^tree //; + return $tree; +} + +# Reads diff-tree -r output and gives a hash that maps a path +# to 3-tuple (old-mode new-mode new-sha). +# When creating, old-mode is undef. When removing, new-* are undef. +sub read_diff_tree { + my (@tree) = @_; + my ($fhi); + local ($_, $/); + $/ = "\0"; + my %path; + open $fhi, '-|', 'diff-tree', '-r', @tree + or die "$!: diff-tree -r @tree"; + while (<$fhi>) { + chomp; + if (/^\*([0-7]+)->([0-7]+)\tblob\t[0-9a-f]+->([0-9a-f]{40})\t(.*)$/s) { + $path{$4} = [$1, $2, $3]; + } + elsif (/^\+([0-7]+)\tblob\t([0-9a-f]{40})\t(.*)$/s) { + $path{$3} = [undef, $1, $2]; + } + elsif (/^\-([0-7]+)\tblob\t[0-9a-f]{40}\t(.*)$/s) { + $path{$2} = [$1, undef, undef]; + } + else { + die "cannot parse diff-tree output: $_"; + } + } + close $fhi; + return %path; +} + +sub read_show_files { + my ($fhi); + local ($_, $/); + $/ = "\0"; + open $fhi, '-|', 'show-files', '-z' + or die "$!: show-files -z"; + my (@path) = map { chomp; $_ } <$fhi>; + close $fhi; + return @path; +} + +sub checkout_file { + my ($path, $info) = @_; + my (@elt) = split(/\//, $path); + my $j = ''; + my $tail = pop @elt; + my ($fhi, $fho); + for (@elt) { + mkdir "$j$_"; + $j = "$j$_/"; + } + open $fho, '>', "$path"; + open $fhi, '-|', 'cat-file', 'blob', $info->[2] + or die "$!: cat-file blob $info->[2]"; + while (<$fhi>) { + print $fho $_; + } + close $fhi; + close $fho; + chmod oct("0$info->[1]"), "$path"; +} + +sub record_file { + my ($path, $info) = @_; + system ('update-cache', '--add', '--cacheinfo', + $info->[1], $info->[2], $path); +} + +sub merge_tree { + my ($path, $info0, $info1) = @_; + checkout_file(',,merge-0', $info0); + checkout_file(',,merge-1', $info1); + system 'checkout-cache', $path; + my ($fhi, $fho); + open $fhi, '-|', 'merge', '-p', ',,merge-0', $path, ',,merge-1'; + open $fho, '>', "$path+"; + local ($/); + while (<$fhi>) { print $fho $_; } + close $fhi; + close $fho; + unlink ',,merge-0', ',,merge-1'; + rename "$path+", $path; + # There is no reason to prefer info0 over info1 but + # we need to pick one. + chmod oct("0$info0->[1]"), "$path"; +} + +# Find common ancestor of two trees. +my $common = read_rev_tree(@ARGV); +print "Common ancestor: $common\n"; + +# Create a temporary directory and go there. +system 'rm', '-rf', ',,merge-temp'; +for ((',,merge-temp', '.git')) { mkdir $_; chdir $_; } +symlink "../../.git/objects", "objects"; +chdir '..'; + +my $ancestor_tree = read_commit_tree($common); +system 'read-tree', $ancestor_tree; + +my %tree0 = read_diff_tree($ancestor_tree, read_commit_tree($ARGV[0])); +my %tree1 = read_diff_tree($ancestor_tree, read_commit_tree($ARGV[1])); + +my @ancestor_file = read_show_files(); +my %ancestor_file = map { $_ => 1 } @ancestor_file; + +for (@ancestor_file) { + if (! exists $tree0{$_} && ! exists $tree1{$_}) { + if ($full_checkout) { + system 'checkout-cache', $_; + } + print STDERR "O - $_\n"; + } +} + +for my $set ([\%tree0, \%tree1, 'A'], [\%tree1, \%tree0, 'B']) { + my ($treeA, $treeB, $side) = @$set; + while (my ($path, $info) = each %$treeA) { + # In this loop we do not deal with overlaps. + next if (exists $treeB->{$path}); + + if (! defined $info->[1]) { + # deleted in this tree only. + unlink $path; + system 'update-cache', '--remove', $path; + print STDERR "$side D $path\n"; + } + else { + # modified or created in this tree only. + print STDERR "$side M $path\n"; + if ($oneside_checkout) { + checkout_file($path, $info); + system 'update-cache', '--add', "$path"; + } else { + record_file($path, $info); + } + } + } +} + +my @warning = (); + +while (my ($path, $info0) = each %tree0) { + # We need to deal only with overlaps. + next if (!exists $tree1{$path}); + + my $info1 = $tree1{$path}; + if (! defined $info0->[1]) { + # deleted in this tree. + if (! defined $info1->[1]) { + # deleted in both trees. Obvious. + print STDERR "*DD $path\n"; + unlink $path; + system 'update-cache', '--remove', $path; + } + else { + # oops. tree0 wants to remove but tree1 wants to modify it. + print STDERR "*DM $path\n"; + checkout_file("$path~B~", $info1); + push @warning, $path; + } + } + else { + # modified or created in tree0 + if (! defined $info1->[1]) { + # oops. tree0 wants to modify but tree1 wants to remove it. + print STDERR "*MD $path\n"; + checkout_file("$path~A~", $info0); + push @warning, $path; + } + else { + # modified both in tree0 and tree1 + # are they modifying to the same contents? + if ($info0->[2] eq $info1->[2]) { + # just mode changes (or no changes) + # we prefer tree0 over tree1 for no particular reason. + print STDERR "*MM $path\n"; + record_file($path, $info0); + } + else { + # modified in both. Needs merge. + print STDERR "MRG $path\n"; + merge_tree($path, $info0, $info1); + } + } + } +} + +if (@warning) { + print "\nThere are some files that were deleted in one branch and\n" + . "modified in another. Please examine them carefully:\n"; + for (@warning) { + print "$_\n"; + } +} + +# system 'show-diff'; diff -x .git -Nru ,,1/ls-tree.c ,,2/ls-tree.c --- ,,1/ls-tree.c 2005-04-14 03:47:18.000000000 -0700 +++ ,,2/ls-tree.c 2005-04-14 04:00:14.000000000 -0700 @@ -5,6 +5,8 @@ */ #include "cache.h" +int line_termination = '\n'; + static int list(unsigned char *sha1) { void *buffer; @@ -31,7 +33,8 @@ * It seems not worth it to read each file just to get this * and the file size. -- pasky@ucw.cz */ type = S_ISDIR(mode) ? "tree" : "blob"; - printf("%03o\t%s\t%s\t%s\n", mode, type, sha1_to_hex(sha1), path); + printf("%03o\t%s\t%s\t%s%c", mode, type, sha1_to_hex(sha1), + path, line_termination); } return 0; } @@ -40,6 +43,10 @@ { unsigned char sha1[20]; + if (argc == 3 && !strcmp(argv[1], "-z")) { + line_termination = 0; + argc--; argv++; + } if (argc != 2) usage("ls-tree <key>"); if (get_sha1_hex(argv[1], sha1) < 0) diff -x .git -Nru ,,1/show-diff.c ,,2/show-diff.c --- ,,1/show-diff.c 2005-04-14 03:47:18.000000000 -0700 +++ ,,2/show-diff.c 2005-04-14 04:00:14.000000000 -0700 @@ -58,15 +58,20 @@ int main(int argc, char **argv) { int silent = 0; + int silent_on_nonexisting_files = 0; int entries = read_cache(); int i; while (argc-- > 1) { if (!strcmp(argv[1], "-s")) { - silent = 1; + silent_on_nonexisting_files = silent = 1; continue; } - usage("show-diff [-s]"); + if (!strcmp(argv[1], "-r")) { + silent_on_nonexisting_files = 1; + continue; + } + usage("show-diff [-s] [-r]"); } if (entries < 0) { @@ -83,7 +88,7 @@ if (stat(ce->name, &st) < 0) { printf("%s: %s\n", ce->name, strerror(errno)); - if (errno == ENOENT && !silent) + if (errno == ENOENT && !silent_on_nonexisting_files) show_diff_empty(ce); continue; } diff -x .git -Nru ,,1/show-files.c ,,2/show-files.c --- ,,1/show-files.c 2005-04-14 03:47:18.000000000 -0700 +++ ,,2/show-files.c 2005-04-14 04:00:14.000000000 -0700 @@ -14,6 +14,7 @@ static int show_cached = 0; static int show_others = 0; static int show_ignored = 0; +static int line_terminator = '\n'; static const char **dir; static int nr_dir; @@ -105,12 +106,12 @@ } if (show_others) { for (i = 0; i < nr_dir; i++) - printf("%s\n", dir[i]); + printf("%s%c", dir[i], line_terminator); } if (show_cached) { for (i = 0; i < active_nr; i++) { struct cache_entry *ce = active_cache[i]; - printf("%s\n", ce->name); + printf("%s%c", ce->name, line_terminator); } } if (show_deleted) { @@ -119,7 +120,7 @@ struct stat st; if (!stat(ce->name, &st)) continue; - printf("%s\n", ce->name); + printf("%s%c", ce->name, line_terminator); } } if (show_ignored) { @@ -134,6 +135,11 @@ for (i = 1; i < argc; i++) { char *arg = argv[i]; + if (!strcmp(arg, "-z")) { + line_terminator = 0; + continue; + } + if (!strcmp(arg, "--cached")) { show_cached = 1; continue; diff -x .git -Nru ,,1/update-cache.c ,,2/update-cache.c --- ,,1/update-cache.c 2005-04-14 03:47:18.000000000 -0700 +++ ,,2/update-cache.c 2005-04-14 04:00:14.000000000 -0700 @@ -250,6 +250,8 @@ { int i, newfd, entries; int allow_options = 1; + const char *sha1_force = NULL; + const char *mode_force = NULL; newfd = open(".git/index.lock", O_RDWR | O_CREAT | O_EXCL, 0600); if (newfd < 0) @@ -282,14 +284,35 @@ refresh_cache(); continue; } + if (!strcmp(path, "--cacheinfo")) { + mode_force = argv[++i]; + sha1_force = argv[++i]; + continue; + } die("unknown option %s", path); } if (!verify_path(path)) { fprintf(stderr, "Ignoring path %s\n", argv[i]); continue; } - if (add_file_to_cache(path)) + if (sha1_force && mode_force) { + struct cache_entry *ce; + int namelen = strlen(path); + int mode; + int size = cache_entry_size(namelen); + sscanf(mode_force, "%o", &mode); + ce = malloc(size); + memset(ce, 0, size); + memcpy(ce->name, path, namelen); + ce->namelen = namelen; + ce->st_mode = mode; + get_sha1_hex(sha1_force, ce->sha1); + + add_cache_entry(ce, 1); + } + else if (add_file_to_cache(path)) die("Unable to add %s to database", path); + mode_force = sha1_force = NULL; } if (write_cache(newfd, active_cache, active_nr) || rename(".git/index.lock", ".git/index")) ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Re: Merge with git-pasky II. 2005-04-14 11:14 ` Junio C Hamano @ 2005-04-14 12:16 ` Petr Baudis 2005-04-14 18:12 ` Junio C Hamano 0 siblings, 1 reply; 130+ messages in thread From: Petr Baudis @ 2005-04-14 12:16 UTC (permalink / raw) To: Junio C Hamano; +Cc: Linus Torvalds, git Dear diary, on Thu, Apr 14, 2005 at 01:14:13PM CEST, I got a letter where Junio C Hamano <junkio@cox.net> told me that... > Here is a diff to update the git-merge.perl script I showed you > earlier today ;-). It contains the following updates against > your HEAD (bb95843a5a0f397270819462812735ee29796fb4). Bah, you outran me. ;-) > * git-merge.perl command we talked about on the git list. I've > covered the changed-to-the-same case etc. I still haven't done > anything about file-vs-directory case yet. > > It does warn when it needed to run merge to automerge and let > merge give a warning message about conflicts if any. In > modify/remove cases, modified in one but removed in the other > files are left in either $path~A~ or $path~B~ in the merge > temporary directory, and the script issues a warning at the > end. I think I will take it rather my working git merge implementation - it's getting insane in bash. ;-) I'll change it to use the cool git-pasky stuff (commit-id etc) and its style of committing - that is, it will merely record the update-caches to be done upon commit, and it will read-tree the branch we are merging to instead of the ancestor. (So that git diff gives useful output.) > * show-files and ls-tree updates to add -z flag to NUL terminate records; > this is needed for git-merge.perl to work. > > * show-diff updates to add -r flag to squelch diffs for files not in > the working directory. This is mainly useful when verifying the > result of an automated merge. -r traditionally means recursive - what's the reasoning behind the choice of this letter? > * update-cache updates to add "--cacheinfo mode sha1" flag to register > a file that is not in the current working directory. Needed for > minimum-checkout merging by git-merge.perl. > > > Signed-off-by: Junio C Hamano <junkio@cox.net> > > --- > > git-merge.perl | 247 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > ls-tree.c | 9 +- > show-diff.c | 11 +- > show-files.c | 12 ++ > update-cache.c | 25 +++++ > 5 files changed, 296 insertions(+), 8 deletions(-) > > diff -x .git -Nru ,,1/git-merge.perl ,,2/git-merge.perl > --- ,,1/git-merge.perl 1969-12-31 16:00:00.000000000 -0800 > +++ ,,2/git-merge.perl 2005-04-14 04:00:14.000000000 -0700 > @@ -0,0 +1,247 @@ > +#!/usr/bin/perl -w > + > +use Getopt::Long; use strict? > + > +my $full_checkout = 0; > +my $oneside_checkout = 0; > +GetOptions("full" => \$full_checkout, > + "oneside" => \$oneside_checkout) > + or die; > + > +if ($full_checkout) { > + $oneside_checkout = 1; > +} > + > +sub read_rev_tree { > + my (@head) = @_; > + my ($fhi); > + open $fhi, '-|', 'rev-tree', '--edges', @head > + or die "$!: rev-tree --edges @head"; > + my %common; > + while (<$fhi>) { > + chomp; > + (undef, undef, my @common) = split(/ /, $_); > + for (@common) { > + if (s/^([a-f0-f]{40}):3$/$1/) { > + $common{$_}++; > + } > + } > + } > + close $fhi; > + > + my @common = (map { $_->[1] } > + sort { $b->[0] <=> $a->[0] } > + map { [ $common{$_} => $_ ] } > + keys %common); > + > + return $common[0]; > +} It'd be simpler to do just my @common = (map { $common{$_} } sort { $b <=> $a } keys %common) But I really think this is a horrible heuristic. I believe you should take the latest commit in the --edges output, and from that choose the base whose rev-tree --edges the_base merged_branch has the least lines on output. (That is, the path to it is shortest - ideally it's already part of the merged_branch.) > + > +sub read_commit_tree { > + my ($commit) = @_; > + my ($fhi); > + open $fhi, '-|', 'cat-file', 'commit', $commit > + or die "$!: cat-file commit $commit"; > + my $tree = <$fhi>; > + close $fhi; > + $tree =~ s/^tree //; > + return $tree; > +} > + > +# Reads diff-tree -r output and gives a hash that maps a path > +# to 3-tuple (old-mode new-mode new-sha). > +# When creating, old-mode is undef. When removing, new-* are undef. What about sub OLDMODE { 0 } sub NEWMODE { 1 } sub NEWSHA { 2 } and then using that when accessing the tuple? Would make the code much more readable. > +sub read_diff_tree { > + my (@tree) = @_; > + my ($fhi); > + local ($_, $/); > + $/ = "\0"; > + my %path; > + open $fhi, '-|', 'diff-tree', '-r', @tree > + or die "$!: diff-tree -r @tree"; > + while (<$fhi>) { > + chomp; > + if (/^\*([0-7]+)->([0-7]+)\tblob\t[0-9a-f]+->([0-9a-f]{40})\t(.*)$/s) { > + $path{$4} = [$1, $2, $3]; > + } > + elsif (/^\+([0-7]+)\tblob\t([0-9a-f]{40})\t(.*)$/s) { > + $path{$3} = [undef, $1, $2]; > + } > + elsif (/^\-([0-7]+)\tblob\t[0-9a-f]{40}\t(.*)$/s) { > + $path{$2} = [$1, undef, undef]; > + } > + else { > + die "cannot parse diff-tree output: $_"; > + } > + } > + close $fhi; > + return %path; > +} > + > +sub read_show_files { > + my ($fhi); > + local ($_, $/); > + $/ = "\0"; > + open $fhi, '-|', 'show-files', '-z' > + or die "$!: show-files -z"; > + my (@path) = map { chomp; $_ } <$fhi>; > + close $fhi; > + return @path; > +} > + > +sub checkout_file { > + my ($path, $info) = @_; > + my (@elt) = split(/\//, $path); > + my $j = ''; > + my $tail = pop @elt; > + my ($fhi, $fho); > + for (@elt) { > + mkdir "$j$_"; > + $j = "$j$_/"; > + } > + open $fho, '>', "$path"; > + open $fhi, '-|', 'cat-file', 'blob', $info->[2] > + or die "$!: cat-file blob $info->[2]"; > + while (<$fhi>) { > + print $fho $_; > + } > + close $fhi; > + close $fho; > + chmod oct("0$info->[1]"), "$path"; > +} > + > +sub record_file { > + my ($path, $info) = @_; > + system ('update-cache', '--add', '--cacheinfo', > + $info->[1], $info->[2], $path); > +} > + > +sub merge_tree { > + my ($path, $info0, $info1) = @_; > + checkout_file(',,merge-0', $info0); > + checkout_file(',,merge-1', $info1); > + system 'checkout-cache', $path; > + my ($fhi, $fho); > + open $fhi, '-|', 'merge', '-p', ',,merge-0', $path, ',,merge-1'; > + open $fho, '>', "$path+"; > + local ($/); > + while (<$fhi>) { print $fho $_; } > + close $fhi; > + close $fho; > + unlink ',,merge-0', ',,merge-1'; > + rename "$path+", $path; > + # There is no reason to prefer info0 over info1 but > + # we need to pick one. > + chmod oct("0$info0->[1]"), "$path"; > +} It is a good idea to check merge's exit code and give a notice at the end if there were any conflicts. > + > +# Find common ancestor of two trees. > +my $common = read_rev_tree(@ARGV); > +print "Common ancestor: $common\n"; > + > +# Create a temporary directory and go there. > +system 'rm', '-rf', ',,merge-temp'; Can't we call it just ,,merge? > +for ((',,merge-temp', '.git')) { mkdir $_; chdir $_; } > +symlink "../../.git/objects", "objects"; > +chdir '..'; > + > +my $ancestor_tree = read_commit_tree($common); > +system 'read-tree', $ancestor_tree; > + > +my %tree0 = read_diff_tree($ancestor_tree, read_commit_tree($ARGV[0])); > +my %tree1 = read_diff_tree($ancestor_tree, read_commit_tree($ARGV[1])); > + > +my @ancestor_file = read_show_files(); > +my %ancestor_file = map { $_ => 1 } @ancestor_file; > + > +for (@ancestor_file) { > + if (! exists $tree0{$_} && ! exists $tree1{$_}) { > + if ($full_checkout) { > + system 'checkout-cache', $_; > + } > + print STDERR "O - $_\n"; Huh, what are you trying to do here? I think you should just record remove, no? (And I wouldn't do anything with my read-tree. ;-) > + } > +} > + > +for my $set ([\%tree0, \%tree1, 'A'], [\%tree1, \%tree0, 'B']) { > + my ($treeA, $treeB, $side) = @$set; > + while (my ($path, $info) = each %$treeA) { > + # In this loop we do not deal with overlaps. > + next if (exists $treeB->{$path}); > + > + if (! defined $info->[1]) { > + # deleted in this tree only. > + unlink $path; > + system 'update-cache', '--remove', $path; > + print STDERR "$side D $path\n"; > + } > + else { > + # modified or created in this tree only. > + print STDERR "$side M $path\n"; > + if ($oneside_checkout) { > + checkout_file($path, $info); > + system 'update-cache', '--add', "$path"; > + } else { > + record_file($path, $info); > + } > + } > + } > +} ..snip.. Hmm, I think I will just need to play with the script a lot. ;-) -- Petr "Pasky" Baudis Stuff: http://pasky.or.cz/ C++: an octopus made by nailing extra legs onto a dog. -- Steve Taylor ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Merge with git-pasky II. 2005-04-14 12:16 ` Petr Baudis @ 2005-04-14 18:12 ` Junio C Hamano 2005-04-14 18:36 ` Linus Torvalds ` (2 more replies) 0 siblings, 3 replies; 130+ messages in thread From: Junio C Hamano @ 2005-04-14 18:12 UTC (permalink / raw) To: Petr Baudis; +Cc: Linus Torvalds, git >>>>> "PB" == Petr Baudis <pasky@ucw.cz> writes: PB> Bah, you outran me. ;-) Just being in a different timezone, I guess. PB> I'll change it to use the cool git-pasky stuff (commit-id etc) and its PB> style of committing - that is, it will merely record the update-caches PB> to be done upon commit, and it will read-tree the branch we are merging PB> to instead of the ancestor. (So that git diff gives useful output.) Sorry, I have not seen what you have been doing since pasky 0.3, and I have not even started to understand the mental model of the world your tool is building. That said, my gut feeling is that telling this script about git-pasky's world model might be a mistake. I'd rather see you consider the script as mere "part of the plumbing". Maybe adding an extra parameter to the script to let the user explicitly specify the common ancestor to use would be needed, but I would prefer git-pasky-merge to do its own magic (converting symbolic commit names into raw commit names and such) before calling this low level script. That way people like me who have not migrated to your framework can still keep using it. All the script currently needs is a bare git object database; i.e., nothing other than what is in .git/objects and a couple of commit record SHA1s as its parameters. No .git/heads/, no .git/HEAD.local, no .git/tags, are involved for it to work, and I would prefer to keep things that way if possible. >> * show-diff updates to add -r flag to squelch diffs for files not in >> the working directory. This is mainly useful when verifying the >> result of an automated merge. PB> -r traditionally means recursive - what's the reasoning behind the PB> choice of this letter? Well, '-r' is not necessarily recursive. "ls -r" is reverse, "sort -r" is reverse. "less -r" is raw. "cat -r" is reversible. "nethack -r" is race ;-). You are thinking as an SCM person so it may look that way. "diff -r" is recursive. "darcs add -r" is recursive. But even in the SCM world, "cvs add -r" is not (it means read-only) neither "co -r" (explicit revision) ;-). I would rather pick '-q' if I were doing the patch today, but I was too tired and did not think of a letter when I wrote it. I guess '-r' stood for removed, but I agree it is a bad choice. Any objections to '-q'? PB> use strict? Not in this iteration but eventually yes. PB> It'd be simpler to do just PB> my @common = (map { $common{$_} } PB> sort { $b <=> $a } PB> keys %common) Well, actually you spotted a bug between the implementation and what I wanted to do. It should have been: map { $_->[0] } sort { $b->[1] <=> $a->[1] } map { [ $common{$_} => $_ ] } keys %common That is, sort [anscestor => number of times it appears] tuple by the "number of times it appears" in decreasing order, and project the resulting list to a list of ancestors. It is trying to deal with the following pattern in rev-tree output: TIMESTAMP1 EDGE1:1 ANCESTOR1:3 ANCESTOR2:3 TIMESTAMP2 EDGE2:2 ANCESTOR1:3 and when the above happens I wanted to pick up ANCESTOR1, but that was without no sound reason. PB> But I really think this is a horrible heuristic. I believe you should PB> take the latest commit in the --edges output, and from that choose the PB> base whose rev-tree --edges the_base merged_branch has the least lines PB> on output. (That is, the path to it is shortest - ideally it's already PB> part of the merged_branch.) I'll try something along that line. Honestly the ancestor selection part was what I had most trouble with. Thanks. PB> What about PB> sub OLDMODE { 0 } PB> sub NEWMODE { 1 } PB> sub NEWSHA { 2 } PB> and then using that when accessing the tuple? Would make the code PB> much more readable. Totally agreed; readability cleanup is needed, just as "use strict" you mentioned, before it is ready for public consumption. Remember, however, the primary purpose of the message was to share it with Linus so that I can ask his opinion while the script was still slushy; the contents that array contained was still changing then and was too early for symbolic constants. I'll do that in the next round. PB> It is a good idea to check merge's exit code and give a notice at the PB> end if there were any conflicts. In principle yes, but I noticed that merge already gave me a nice warning message when it found conflicts, so there was no need to do so myself in this case. See sample output: $ perl ./git-merge.perl \ 71796686221a0a56ccc25b02386ed8ea648da14d \ bb95843a5a0f397270819462812735ee29796fb4 Common ancestor: 9f02d4d233223462d3f6217b5837b786e6286ba4 O - COPYING O - README ... O - write-tree.c A M write-blob.c A M show-diff.c ... A M update-cache.c A M git-merge.perl B M merge-tree.c MRG Makefile merge: warning: conflicts during merge $ >> +# Create a temporary directory and go there. >> +system 'rm', '-rf', ',,merge-temp'; PB> Can't we call it just ,,merge? I'd rather have a command line option '-o' (scrapping the current '-o' and renaming it to something else; as you can see I am terrible at picking option names ;-)) to mean "output to this directory". I am not really an Arch person so I do not particulary care about /^,,/. How about "git~merge~$$"? >> +for ((',,merge-temp', '.git')) { mkdir $_; chdir $_; } >> +symlink "../../.git/objects", "objects"; >> +chdir '..'; >> + >> +my $ancestor_tree = read_commit_tree($common); >> +system 'read-tree', $ancestor_tree; >> + >> +my %tree0 = read_diff_tree($ancestor_tree, read_commit_tree($ARGV[0])); >> +my %tree1 = read_diff_tree($ancestor_tree, read_commit_tree($ARGV[1])); >> + >> +my @ancestor_file = read_show_files(); >> +my %ancestor_file = map { $_ => 1 } @ancestor_file; >> + >> +for (@ancestor_file) { >> + if (! exists $tree0{$_} && ! exists $tree1{$_}) { >> + if ($full_checkout) { >> + system 'checkout-cache', $_; >> + } >> + print STDERR "O - $_\n"; PB> Huh, what are you trying to do here? I think you should just record PB> remove, no? (And I wouldn't do anything with my read-tree. ;-) At this moment in the script, we have run "read-tree" the ancestor so the dircache has the original. %tree0 and %tree1 both did not touch the path ($_ here) so it is the same as ancestor. When '-f' is specified we are populating the output working tree with the merge result so that is what that 'checkout-cache' is about. "O - $path" means "we took the original". The idea is to populate the dircache of merge-temp with the merge result and leave uncertain stuff as in the common ancestor state, so that the user can fix them starting from there. Maybe it is a good time for me to summarize the output somewhere in a document. O - $path Tree-A and tree-B did not touch this; the result is taken from the ancestor (O for original). A D $path Only tree-A (or tree-B) deleted this and the other B D $path branch did not touch this; the result is to delete. A M $path Only tree-A (or tree-B) modified this and the other B M $path branch did not touch this; the result is to use one from tree-A (or tree-B). This includes file creation case. *DD $path Both tree-A and tree-B deleted this; the result is to delete. *DM $path Tree-A deleted while tree-B modified this (or *MD $path vice versa), and manual conflict resolution is needed; dircache is left as in the ancestor, and the modified file is saved as $path~A~ in the working directory. The user can rename it to $path and run show-diff to see what Tree-A wanted to do and decide before running update-cache. *MM $path Tree-A and tree-B did the exact same modification; the result is to use that. MRG $path Tree-A and tree-B have different modifications; run "merge" and the merge result is left as $path in the working directory. In cases other than *DM, *MD, and MRG, the result is trivial and is recorded in the dircache. Without '-o' (to be renamed ;-) nor '-f' there will not be a file checked out in the working directory for them. The three merge cases need human attention. The dircache is not touched in these cases and left as the ancestor version, and the working directory gets some file as described above. NOTE NOTE NOTE: I am not dealing with a case where both branches create the same file but with different contents. In such a case the current code falls into MRG path without having a common ancestor, which is nonsense---I can use /dev/null as the common ancestor, I guess. Also NOTE NOTE NOTE I need to detect the case where one branch creates a directory while the other creates a file. There is nothing an automated tool can do in that case but it needs to be detected and be told the user loudly. ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Merge with git-pasky II. 2005-04-14 18:12 ` Junio C Hamano @ 2005-04-14 18:36 ` Linus Torvalds 2005-04-14 19:59 ` Junio C Hamano ` (2 more replies) 2005-04-14 18:51 ` Christopher Li 2005-04-14 19:35 ` Petr Baudis 2 siblings, 3 replies; 130+ messages in thread From: Linus Torvalds @ 2005-04-14 18:36 UTC (permalink / raw) To: Junio C Hamano; +Cc: Petr Baudis, git On Thu, 14 Apr 2005, Junio C Hamano wrote: > > Sorry, I have not seen what you have been doing since pasky 0.3, > and I have not even started to understand the mental model of > the world your tool is building. That said, my gut feeling is > that telling this script about git-pasky's world model might be > a mistake. I'd rather see you consider the script as mere "part > of the plumbing". I agree. Having separate abstraction layers is good. I'm actually very happy with Pasky's cleaned-up-tree, exactly because unlike the first one, Pasky did a great job of maintaining the abstraction between "plumbing" and user interfaces. The plumbing should take user interface needs into account, but the more conceptually separate it is ("does it makes sense on its own?") the better off we'll be. And "merge these two trees" (which works on a _tree_ level) or "find the common commit" (which works on a _commit_ level) look like plumbing to me - the kind of things I should have written, if I weren't such a lazy slob. Linus ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Merge with git-pasky II. 2005-04-14 18:36 ` Linus Torvalds @ 2005-04-14 19:59 ` Junio C Hamano 2005-04-14 20:20 ` Petr Baudis 2005-04-15 0:42 ` Linus Torvalds 2005-04-15 2:21 ` [Patch] ls-tree enhancements Junio C Hamano 2005-04-15 9:14 ` Merge with git-pasky II David Woodhouse 2 siblings, 2 replies; 130+ messages in thread From: Junio C Hamano @ 2005-04-14 19:59 UTC (permalink / raw) To: Linus Torvalds; +Cc: Petr Baudis, git >>>>> "LT" == Linus Torvalds <torvalds@osdl.org> writes: LT> On Thu, 14 Apr 2005, Junio C Hamano wrote: >> Sorry, I have not seen what you have been doing since pasky 0.3, >> and I have not even started to understand the mental model of >> the world your tool is building. That said, my gut feeling is >> that telling this script about git-pasky's world model might be >> a mistake. I'd rather see you consider the script as mere "part >> of the plumbing". LT> I agree. Having separate abstraction layers is good. I'm actually very LT> happy with Pasky's cleaned-up-tree, exactly because unlike the first one, LT> Pasky did a great job of maintaining the abstraction between "plumbing" LT> and user interfaces. Agreed, not just with your agreeing with me, but with the statement that Pasky did a good job (although I am ashamed to say I have not caught up with the "userland" tools). LT> The plumbing should take user interface needs into account, but the more LT> conceptually separate it is ("does it makes sense on its own?") the better LT> off we'll be. And "merge these two trees" (which works on a _tree_ level) LT> or "find the common commit" (which works on a _commit_ level) look like LT> plumbing to me - the kind of things I should have written, if I weren't LT> such a lazy slob. I am planning drop the ancestor computation from the script, and make it another command line parameter to the script. Dan Barkalow's merge-base program should be used to compute it and his result should drive the merge. That sounds more UNIXy to me. I even may want to make the script take three trees not commits, since the merge script does not need commits (it only needs trees). As plumbing it would be cleaner interface to it to do so. The wrapper SCM scripts can and should make sure it is fed trees when the user gives it commits (or symbolic representation of it like .git/tags/blah, or `cat .git/HEAD`). But one different thing to note here. You say "merge these two trees" above (I take it that you mean "merge these two trees, taking account of this tree as their common ancestor", so actually you are dealing with three trees), and I am tending to agree with the notion of merging trees not commits. However you might get richer context and more sensible resulting merge if you say "merge these two commits". Since commit chaining is part of the fundamental git object model you may as well use it. This however opens up another set of can of worms---it would involve not just three trees but all the trees in the commit chain in between. That's when you start wondering if it would be better to add renames in the git object model, which is the topic of another thread. I have not formed an opinion on that one myself yet. ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Re: Merge with git-pasky II. 2005-04-14 19:59 ` Junio C Hamano @ 2005-04-14 20:20 ` Petr Baudis 2005-04-15 0:42 ` Linus Torvalds 1 sibling, 0 replies; 130+ messages in thread From: Petr Baudis @ 2005-04-14 20:20 UTC (permalink / raw) To: Junio C Hamano; +Cc: Linus Torvalds, git Dear diary, on Thu, Apr 14, 2005 at 09:59:04PM CEST, I got a letter where Junio C Hamano <junkio@cox.net> told me that... > >>>>> "LT" == Linus Torvalds <torvalds@osdl.org> writes: > > LT> On Thu, 14 Apr 2005, Junio C Hamano wrote: > > >> Sorry, I have not seen what you have been doing since pasky 0.3, > >> and I have not even started to understand the mental model of > >> the world your tool is building. That said, my gut feeling is > >> that telling this script about git-pasky's world model might be > >> a mistake. I'd rather see you consider the script as mere "part > >> of the plumbing". > > LT> I agree. Having separate abstraction layers is good. I'm actually very > LT> happy with Pasky's cleaned-up-tree, exactly because unlike the first one, (Just a side-note - functionally and even organizationally, the cleaned up tree does not differ significantly from the original one.) > LT> Pasky did a great job of maintaining the abstraction between "plumbing" > LT> and user interfaces. > > Agreed, not just with your agreeing with me, but with the > statement that Pasky did a good job (although I am ashamed to > say I have not caught up with the "userland" tools). Thanks. :-) > LT> The plumbing should take user interface needs into account, but the more > LT> conceptually separate it is ("does it makes sense on its own?") the better > LT> off we'll be. And "merge these two trees" (which works on a _tree_ level) > LT> or "find the common commit" (which works on a _commit_ level) look like > LT> plumbing to me - the kind of things I should have written, if I weren't > LT> such a lazy slob. > > I am planning drop the ancestor computation from the script, and > make it another command line parameter to the script. Dan > Barkalow's merge-base program should be used to compute it and > his result should drive the merge. That sounds more UNIXy to > me. Good move, I say! > I even may want to make the script take three trees not > commits, since the merge script does not need commits (it only > needs trees). As plumbing it would be cleaner interface to it > to do so. The wrapper SCM scripts can and should make sure it > is fed trees when the user gives it commits (or symbolic > representation of it like .git/tags/blah, or `cat .git/HEAD`). Agreed. > But one different thing to note here. > > You say "merge these two trees" above (I take it that you mean > "merge these two trees, taking account of this tree as their > common ancestor", so actually you are dealing with three trees), > and I am tending to agree with the notion of merging trees not > commits. However you might get richer context and more sensible > resulting merge if you say "merge these two commits". Since > commit chaining is part of the fundamental git object model you > may as well use it. Could you be more particular on the richer context etc? I think this script should stay strictly on the level of trees. When someone invents it, there could be a merge-commits script which does something very smart about two commits, traversing the graph between them etc, and doing a set of merge-tree invocations, possibly preparing the staging area for them etc. -- Petr "Pasky" Baudis Stuff: http://pasky.or.cz/ C++: an octopus made by nailing extra legs onto a dog. -- Steve Taylor ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Merge with git-pasky II. 2005-04-14 19:59 ` Junio C Hamano 2005-04-14 20:20 ` Petr Baudis @ 2005-04-15 0:42 ` Linus Torvalds 2005-04-15 2:33 ` Barry Silverman 2005-04-15 10:02 ` David Woodhouse 1 sibling, 2 replies; 130+ messages in thread From: Linus Torvalds @ 2005-04-15 0:42 UTC (permalink / raw) To: Junio C Hamano; +Cc: Petr Baudis, git On Thu, 14 Apr 2005, Junio C Hamano wrote: > > You say "merge these two trees" above (I take it that you mean > "merge these two trees, taking account of this tree as their > common ancestor", so actually you are dealing with three trees), Yes. We're definitely talking three trees. > and I am tending to agree with the notion of merging trees not > commits. However you might get richer context and more sensible > resulting merge if you say "merge these two commits". Since > commit chaining is part of the fundamental git object model you > may as well use it. Yes and no. There are real advantages to using the commit state to just figure out the trees, and then at least have the _option_ to do the merge at a pure tree object. In particular, if you ever find yourself wanting to graft together two different commit histories, that almost certainly is what you'd want to do. Somebody might have arrived at the exact same tree some other way, starting with a 2.6.12 tar.ball or something, and I think we should at least support the notion of saying "these two totally unrelated commits actually have the same base tree, so let's merge them in "space" (ie data) even if we can't really sanely join them in "time" (ie "commits"). I dunno. And it's also a question of sanity. The fact is, we know how to make tree merges unambiguous, by just totally ignoring the history between them. Ie we know how to merge data. I am pretty damn sure that _nobody_ knows how to merge "data over time". Maybe BK does. I'm pretty sure it actually takes the "over time" into account. But My goal is to get something that works, and something that is reliable because it is simple and it has simple rules. As you say: > This however opens up another set of can of worms---it would > involve not just three trees but all the trees in the commit > chain in between. Exactly. I seriously believe that the model is _broken_, simply because it gets too complicated. At some point it boils down to "keep it simple, stupid". > That's when you start wondering if it would > be better to add renames in the git object model, which is the > topic of another thread. I have not formed an opinion on that > one myself yet. I've not even been convinved that renames are worth it. Nobody has really given a good reason why. There are two reasons for renames I can think of: - space efficiency in delta-based trees. This is a total non-issue for git, and trying to explicitly track renames is going to cause _more_ space to be wasted rather than less. - "annotate". Something git doesn't really handle anyway, and it has little to do with renames. You can fake an annotate, but let's face it, it's _always_ going to be depending on interpreting a diff. In fact, that ends up how traditional SCM's do it too - they don't really annotate lines, they just interpret the diff. I think you might as well interpret the whole object thing. Git _does_ tell you how the objects changed, and I actually believe that a diff that works in between objects (ie can show "these lines moved from this file X to tjhat file Y") is a _hell_ of a lot more powerful than "rename" is. So I'd seriously suggest that instead of worryign about renames, people think about global diffs that aren't per-file. Git is good at limiting the changes to a set of objects, and it should be entirely possible to think of diffs as ways of moving lines _between_ objects and not just within objects. It's quite common to move a function from one file to another - certainly more so than renaming the whole file. In other words, I really believe renames are just a meaningless special case of a much more interesting problem. Which is just one reason why I'm not at all interested in bothering with them other than as a "data moved" thing, which git already handles very well indeed. So there, Linus ^ permalink raw reply [flat|nested] 130+ messages in thread
* RE: Merge with git-pasky II. 2005-04-15 0:42 ` Linus Torvalds @ 2005-04-15 2:33 ` Barry Silverman 2005-04-15 10:02 ` David Woodhouse 1 sibling, 0 replies; 130+ messages in thread From: Barry Silverman @ 2005-04-15 2:33 UTC (permalink / raw) To: 'Linus Torvalds', 'Junio C Hamano' Cc: 'Petr Baudis', git >>In particular, if you ever find yourself wanting to graft together two >>different commit histories, that almost certainly is what you'd want to >>do. Somebody might have arrived at the exact same tree some other way, >>starting with a 2.6.12 tar.ball or something, and I think we should at >>least support the notion of saying "these two totally unrelated commits >>actually have the same base tree, so let's merge them in "space" (ie data) >>even if we can't really sanely join them in "time" (ie "commits"). If this is true - then the tree-id's of the two commits would be identical, but the commit-id's wouldn't. Does this imply that common ancestor lookup should work by comparing the tree-id's (space-wise the same) rather than the commit-ids (time-wise the same)? -----Original Message----- From: git-owner@vger.kernel.org [mailto:git-owner@vger.kernel.org] On Behalf Of Linus Torvalds Sent: Thursday, April 14, 2005 8:43 PM To: Junio C Hamano Cc: Petr Baudis; git@vger.kernel.org Subject: Re: Merge with git-pasky II. On Thu, 14 Apr 2005, Junio C Hamano wrote: > > You say "merge these two trees" above (I take it that you mean > "merge these two trees, taking account of this tree as their > common ancestor", so actually you are dealing with three trees), Yes. We're definitely talking three trees. > and I am tending to agree with the notion of merging trees not > commits. However you might get richer context and more sensible > resulting merge if you say "merge these two commits". Since > commit chaining is part of the fundamental git object model you > may as well use it. Yes and no. There are real advantages to using the commit state to just figure out the trees, and then at least have the _option_ to do the merge at a pure tree object. In particular, if you ever find yourself wanting to graft together two different commit histories, that almost certainly is what you'd want to do. Somebody might have arrived at the exact same tree some other way, starting with a 2.6.12 tar.ball or something, and I think we should at least support the notion of saying "these two totally unrelated commits actually have the same base tree, so let's merge them in "space" (ie data) even if we can't really sanely join them in "time" (ie "commits"). I dunno. And it's also a question of sanity. The fact is, we know how to make tree merges unambiguous, by just totally ignoring the history between them. Ie we know how to merge data. I am pretty damn sure that _nobody_ knows how to merge "data over time". Maybe BK does. I'm pretty sure it actually takes the "over time" into account. But My goal is to get something that works, and something that is reliable because it is simple and it has simple rules. As you say: > This however opens up another set of can of worms---it would > involve not just three trees but all the trees in the commit > chain in between. Exactly. I seriously believe that the model is _broken_, simply because it gets too complicated. At some point it boils down to "keep it simple, stupid". > That's when you start wondering if it would > be better to add renames in the git object model, which is the > topic of another thread. I have not formed an opinion on that > one myself yet. I've not even been convinved that renames are worth it. Nobody has really given a good reason why. There are two reasons for renames I can think of: - space efficiency in delta-based trees. This is a total non-issue for git, and trying to explicitly track renames is going to cause _more_ space to be wasted rather than less. - "annotate". Something git doesn't really handle anyway, and it has little to do with renames. You can fake an annotate, but let's face it, it's _always_ going to be depending on interpreting a diff. In fact, that ends up how traditional SCM's do it too - they don't really annotate lines, they just interpret the diff. I think you might as well interpret the whole object thing. Git _does_ tell you how the objects changed, and I actually believe that a diff that works in between objects (ie can show "these lines moved from this file X to tjhat file Y") is a _hell_ of a lot more powerful than "rename" is. So I'd seriously suggest that instead of worryign about renames, people think about global diffs that aren't per-file. Git is good at limiting the changes to a set of objects, and it should be entirely possible to think of diffs as ways of moving lines _between_ objects and not just within objects. It's quite common to move a function from one file to another - certainly more so than renaming the whole file. In other words, I really believe renames are just a meaningless special case of a much more interesting problem. Which is just one reason why I'm not at all interested in bothering with them other than as a "data moved" thing, which git already handles very well indeed. So there, Linus - To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Merge with git-pasky II. 2005-04-15 0:42 ` Linus Torvalds 2005-04-15 2:33 ` Barry Silverman @ 2005-04-15 10:02 ` David Woodhouse 2005-04-15 15:32 ` Linus Torvalds 1 sibling, 1 reply; 130+ messages in thread From: David Woodhouse @ 2005-04-15 10:02 UTC (permalink / raw) To: Linus Torvalds; +Cc: Junio C Hamano, Petr Baudis, git On Thu, 2005-04-14 at 17:42 -0700, Linus Torvalds wrote: > I've not even been convinved that renames are worth it. Nobody has > really given a good reason why. > > There are two reasons for renames I can think of: > > - space efficiency in delta-based trees. > - "annotate". Neither of those were my motivation for looking at renames. The reasons I wanted to track renames were: - Per-file revision history which doesn't stop dead at a rename. - Merging where files have been renamed in one branch and modified in another. Which is basically a special case of the above; we need to see the per-file revision history. > So I'd seriously suggest that instead of worryign about renames, people > think about global diffs that aren't per-file. Git is good at limiting > the changes to a set of objects, and it should be entirely possible to > think of diffs as ways of moving lines _between_ objects and not just > within objects. It's quite common to move a function from one file to > another - certainly more so than renaming the whole file. > > In other words, I really believe renames are just a meaningless special > case of a much more interesting problem. Which is just one reason why > I'm not at all interested in bothering with them other than as a "data > moved" thing, which git already handles very well indeed. Git doesn't handle 'data moved' except at a whole-tree level. For each commit, it says "these are the old trees; this is the new tree". Git doesn't actually look hard into the contents of tree; certainly it has no business looking at the contents of individual files; that is something that the SCM or possibly only the user should do. The storage of 'rename' information in the commit object is another kind of 'xattr' storage which git would provides but not directly interpret. And you're right; it shouldn't have to be for renames only. There's no need for us to limit it to one "source" and one "destination"; the SCM can use it to track content as it sees fit. As I said, the main aim of this is to track revision history of given content, for displaying to the user and for performing merges. So when a file is split up, or a function is moved from it to another file, a 'rename' xattr can be included to mark that files 'foo' and 'bar' in the new tree are both associated with file 'wibble' in the parent. That's as much as we need to provide for content tracking, and it _does_ handle the general case as well as we should be attempting to. We don't want to get into dealing with file contents ourselves; we just want to store the hint for the SCM or the user that "your data went thataway". -- dwmw2 ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Merge with git-pasky II. 2005-04-15 10:02 ` David Woodhouse @ 2005-04-15 15:32 ` Linus Torvalds 2005-04-15 16:01 ` David Woodhouse ` (3 more replies) 0 siblings, 4 replies; 130+ messages in thread From: Linus Torvalds @ 2005-04-15 15:32 UTC (permalink / raw) To: David Woodhouse; +Cc: Junio C Hamano, Petr Baudis, git On Fri, 15 Apr 2005, David Woodhouse wrote: > > And you're right; it shouldn't have to be for renames only. There's no > need for us to limit it to one "source" and one "destination"; the SCM > can use it to track content as it sees fit. Listen to yourself, and think about the problem for a second. First off, let's just posit that "files" do not matter. The only thing that matters is how "content" moved in the tree. Ok? If I copy a function from one fiel to another, the perfect SCM will notice that, and show it as a diff that removes it from one file and adds it to another, and is _still_ able to track authorship past the move. Agreed? Now, you basically propose to put that information in the "commit" log, and that's certainly valid. You can have the commit log say "lines 50-89 in file kernel/sched.c moved to lines 100-139 in kernel/timer.c", and then renames fall out of that as one very small special case. You can even say "lines 50-89 in file kernel/sched.c copied to.." and allow data to be tracked past not just movement, but also duplication. Do you agree that this is kind of what you'd want to aim for? That's a winning SCM concept. How do you think the SCM _gets_ at this information? In particular, how are you proposing that we determine this, especially since 90% of all stuff comes in as patches etc? You propose that we spend time when generating the tree on doing so. I'm telling you that that is wrong, for several reasons: - you're ignoring different paths for the same data. For example, you will make it impossible to merge two trees that have done exactly the same thing, except one did it as a patch (create/delete) and one did it using some other heuristic. - you're doing the work at the wrong point. Doing it _well_ is quite expensive. So if you do it at commit time, you cannot _afford_ to do it well, and you'll always fall back to doing an ass-backwards job that doesn't really get you to the good state, and only gets you to a not-very-interesting easy 1% of the solution (ie full file renames). - you're doing the work at the wrong point for _another_ reason. You're freezing your (crappy) algorithm at tree creation time, and basically making it pointless to ever create something better later, because even if hardware and software improves, you've codified that "we have to have crappy information". Now, look at my proposal: - the actual information tracking tracks _nothing_ but information. You have an SCM that tracks what changed at the only level that really matters, namely the whole project. None of the information actually makes any sense at all at a smaller granularity, since by definition, a "project" depends on the other files, or it wouldn't be a project, it would be _two_ projects or more. - When you're interested in the history of the information, you actually track it, and you try to be _intelligent_ about it. You can actually do a HELL of a lot better than whet you propose if you go the extra mile. For example, let's say that you have a visualization tool that you can use for finding out where a line of code came from. You start out at some arbitrary point in the tree, and you drill down. That's how it works, right? So how do you drill down? You simply go backwards in history for that project, tracking when that file+line changed (a "file+line" thing is actually a "sensible" tracking unit at this point, because it makes sense within the query you're doing - it's _not_ a sensible thing to track at "commit" time, but when you ask yourself "where did this line come from", that _question_ makes it sensible. Also note that "where did this _file_ come from is not a sensible question, since the file may have been the combination (or split) of several files, so there is no _answer_ to that question" So the question then becomes: "how can you reasonably _efficiently_ find the history of one particular line", and in fact it turns out that by asking the question that way, it's pretty obvious: now that you don't have to track the whole repository, you can always try to minimize the thing you're looking for. So what you do is walk back the history, and look at the tree objects (both sides when you hit a merge), eand see if that file ever changes. That's actually a very efficient operation in GIT - it matches _exactly_ how git tracks things anyway. So it's not expensive at all. When that file changes, you need to look if that _line_ changed (and here is where it comes down to usability: from a practical standpoint you probably don't care about a single line, you really _probably_ want to see changes around it too). So you diff the old state and the new state, and you see if you can still find where you were. If you still can, and the line (and a few lines around it) is still the same, you just continue to drill down. So that's not the interesting case. So what happens when you found "ok, that area changed"? Your visualization tool now shows it to the user, AND BECAUSE IT SEES THE WHOLE TREE DIFF, it also shows where it probably came from. At _that_ point, it is actually very trivial to use a modest amount of CPU time, and look for probable sources within that diff. You can do it on modern hardware in basically no time, so your visualization tool can actually notice that "oops, that line didn't even exist in the previous version, BUT I FOUND FIVE PLACES that matched almost perfectly in the same diff, and here they are" and voila, your tool now very efficiently showed the programmer that the source of the line in question was actually that we had merged 5 copies of the same code in different archtiectures into one common helper function. And if you didn't find some source that matched, or if the old file was actually very similar around that line, and that line hadn't been "totally new"? That's the easy case again - you show the programmer the diff at that point in time, and you let him decide whether that diff was what he was looking for, or whether he wants to continue to "zoom down" into the history. The above tool is (a) fairly easy to write for git (if you can do visualization tools and (b) _exactly_ what I think most programmers actually want. Tell me I'm wrong. Honestly.. And notice? My clearly _superior_ algorithm never needed any rename information at all. It would have been a total waste of time. It would also have hidden the _real_ pattern, which was that a piece of code was merged from several other matching pieces of code into one new helper function. But if it _had_ been a pure rename, my superior tool would have trivially found that _too_. So rename infomation really really doesn't matter. So I'm claiming that any SCM that tries to track renames is fundamentally broken unless it does so for internal reasons (ie to allow efficient deltas), exactly because renames do not matter. They don't help you, and they aren't what you were interested in _anyway_. What matters is finding "where did this come from", and the git architecture does that very well indeed - much better than anything else out there. I outlined a simple algorithm that can be fairly trivially coded up by somebody who really cares. Sure, pattern matching isn't trivial, but you start out with just saying "let's find that exact line, and two lines on each side", and then you start improving on that. And that "where did this come from" decision should be done at _search_ time, not commit time. Because at that time it's not only trivial to do, but at that time you can _dynamically_ change your search criteria. For example, you can make the "match" algorithm be dependent on what you are looking at. If it's C source code, it might want to ignore vairable names when it searches for matching code. And if it's a OpenOffice document, you might have some open-office-specific tools to do so. See? Also, the person doing the searches can say whether he is interested in that particular line (or even that particial _identifier_ on a line), or whether he wants to see the changes "around" that line. All of which are very valid things to do, and all of which my world-view supports very well indeed. And all of which your pitiful "files matter" world-view totally doesn't get at all. In other words, I'm right. I'm always right, but sometimes I'm more right than other times. And dammit, when I say "files don't matter", I'm really really Right(tm). Please stop this "track files" crap. Git tracks _exactly_ what matters, namely "collections of files". Nothing else is relevant, and even _thinking_ that it is relevant only limits your world-view. Notice how the notion of CVS "annotate" always inevitably ends up limiting how people use it. I think it's a totally useless piece of crap, and I've described something that I think is a million times more useful, and it all fell out _exactly_ because I'm not limiting my thinking to the wrong model of the world. Linus ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Merge with git-pasky II. 2005-04-15 15:32 ` Linus Torvalds @ 2005-04-15 16:01 ` David Woodhouse 2005-04-15 16:31 ` C. Scott Ananian 2005-04-16 15:33 ` Johannes Schindelin 2005-04-15 19:20 ` Paul Jackson ` (2 subsequent siblings) 3 siblings, 2 replies; 130+ messages in thread From: David Woodhouse @ 2005-04-15 16:01 UTC (permalink / raw) To: Linus Torvalds; +Cc: Junio C Hamano, Petr Baudis, git On Fri, 2005-04-15 at 08:32 -0700, Linus Torvalds wrote: > - you're doing the work at the wrong point. Doing it _well_ is quite > expensive. So if you do it at commit time, you cannot _afford_ to do it > well, and you'll always fall back to doing an ass-backwards job that > doesn't really get you to the good state, and only gets you to a > not-very-interesting easy 1% of the solution (ie full file renames). > > - you're doing the work at the wrong point for _another_ reason. You're > freezing your (crappy) algorithm at tree creation time, and basically > making it pointless to ever create something better later, because even > if hardware and software improves, you've codified that "we have to > have crappy information". OK, I'm inclined to agree. The only thing that prevents me from capitulating entirely and resubscribing to the "Torvalds is always right" school is the concern that it _is_ expensive, and that's why I originally wanted to do it at commit time because then it's a one-off cost rather than recurring every time we want to track the history of a given piece of content. Also because we actually have the developer's attention at commit time, and we can get _real_ answers from the user about what she was doing, instead of having to guess. But if it can be done cheaply enough at a later date even though we end up repeating ourselves, and if it can be done _well_ enough that we shouldn't have just asked the user in the first place, then yes, OK I agree. -- dwmw2 ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Merge with git-pasky II. 2005-04-15 16:01 ` David Woodhouse @ 2005-04-15 16:31 ` C. Scott Ananian 2005-04-15 17:11 ` Linus Torvalds 2005-04-16 15:33 ` Johannes Schindelin 1 sibling, 1 reply; 130+ messages in thread From: C. Scott Ananian @ 2005-04-15 16:31 UTC (permalink / raw) To: David Woodhouse; +Cc: Linus Torvalds, Junio C Hamano, Petr Baudis, git On Fri, 15 Apr 2005, David Woodhouse wrote: > given piece of content. Also because we actually have the developer's > attention at commit time, and we can get _real_ answers from the user > about what she was doing, instead of having to guess. Yes, but it's still hard to get *accurate* information. And developers tend to use very short commit messages already... > But if it can be done cheaply enough at a later date even though we end > up repeating ourselves, and if it can be done _well_ enough that we > shouldn't have just asked the user in the first place, then yes, OK I > agree. I think examining the rsync algorithms should convince you that finding common chunks can be fairly efficient. (See my next message for a more concrete proposal.) --scott Rijndael AMLASH Moscow Ft. Bragg shotgun HTKEEPER SHERWOOD overthrow Uzi anthrax Yeltsin Indonesia Suharto LITEMPO Dictionary Yakima KUBARK ( http://cscott.net/ ) ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Merge with git-pasky II. 2005-04-15 16:31 ` C. Scott Ananian @ 2005-04-15 17:11 ` Linus Torvalds 0 siblings, 0 replies; 130+ messages in thread From: Linus Torvalds @ 2005-04-15 17:11 UTC (permalink / raw) To: C. Scott Ananian; +Cc: David Woodhouse, Junio C Hamano, Petr Baudis, git On Fri, 15 Apr 2005, C. Scott Ananian wrote: > > I think examining the rsync algorithms should convince you that finding > common chunks can be fairly efficient. Note that "efficient" really depends on how good a job you want to do, so you can tune it to how much CPU you can afford to waste on the problem. For example, my example had this thing where we merged five different functions into one function, and it is truly pretty efficient to find things like that _IF_ we only look at the files that changed (since the set of files that change in any one particular commit tends to be small, relative to the whole repository). There are many good algorithms for finding "common code", and with modern hardware that is basically instantaneous if you look at a few tens of files. For example, people wrote efficient things to compare _millions_ of lines of code for the whole SCO saga - you can do quite well. Some googling comes up with for example http://minnie.tuhs.org/Programs and applying those to a smallish set of files is quite efficient. What is _not_ necessarily as easy is the situation where you notice that a new set of lines appeared, but you don't see any place that matches that set of lines in the set of CHANGED files. That's actually quite common, ie let's say that you have a new filesystem or a new driver, and almost always it's based on a template or something, and you _would_ be able to see where that template came from, except it's not in that _changed_ set. And that is still doable, but now you really have to compare against the whole tree if you want to do it. Even _that_ is actually efficient if you cache the hashes - that's how the comparison tools compare two totally independent trees against each other, and it makes it practically possible to do even that expensive O(n**2) operation in reasonable time. It's certainly possible to do exactly the same thing for the "new code got added, does it bear any similarity to old code" case. Note! This is a question that is relevant and actually is in the realm of the "possible to find the answer interactively". It may fairly expensive, but the point is that this is the kind of relevant question that really does depend on the fundamental notion that "data matters more than any local changes". And when you think about the problem in that form, you find these kinds of interesting questions that you _can_ answer. Because the way git identifies data, the example "is there any other relevant code that may actually be similar to the newly added code" is actually not that hard to do in git. Remember: the way to answer that question is to have a cache of hashes of the contents. Guess what git _is_? You can now index your line-based hashes of contents against the _object_ hashes that git keeps track of, and you suddenly have an efficient way to actually look up those hashes. NOTE! All of this is outside the scope of git itself. This is all "visualization and comparison tools" built up on top of git. And I'm not at all interested in writing those tools myself, and I'm absolutely not signing up for that part. All I'm arguing for is that the git architecture is actually a very good architecture for doing these kinds of very very cool tools. Linus ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Merge with git-pasky II. 2005-04-15 16:01 ` David Woodhouse 2005-04-15 16:31 ` C. Scott Ananian @ 2005-04-16 15:33 ` Johannes Schindelin 2005-04-17 13:14 ` David Woodhouse 1 sibling, 1 reply; 130+ messages in thread From: Johannes Schindelin @ 2005-04-16 15:33 UTC (permalink / raw) To: David Woodhouse; +Cc: Linus Torvalds, Junio C Hamano, Petr Baudis, git Hi, On Fri, 15 Apr 2005, David Woodhouse wrote: > But if it can be done cheaply enough at a later date even though we end > up repeating ourselves, and if it can be done _well_ enough that we > shouldn't have just asked the user in the first place, then yes, OK I > agree. The repetition could be helped by using a cache. Ciao, Dscho ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Merge with git-pasky II. 2005-04-16 15:33 ` Johannes Schindelin @ 2005-04-17 13:14 ` David Woodhouse 0 siblings, 0 replies; 130+ messages in thread From: David Woodhouse @ 2005-04-17 13:14 UTC (permalink / raw) To: Johannes Schindelin; +Cc: Linus Torvalds, Junio C Hamano, Petr Baudis, git On Sat, 2005-04-16 at 17:33 +0200, Johannes Schindelin wrote: > > But if it can be done cheaply enough at a later date even though we end > > up repeating ourselves, and if it can be done _well_ enough that we > > shouldn't have just asked the user in the first place, then yes, OK I > > agree. > > The repetition could be helped by using a cache. Perhaps. Since neither such a cache nor even the commit comments are strictly part of the git data, they probably shouldn't be included in the sha1 hash of the commit object. However, I don't see a fundamental reason why we couldn't store them in the same file but omit them from the hash calculations. That also allows us to retrospectively edit commit comments without completely changing the entire subsequent history. Or is that a little too heretical a suggestion? -- dwmw2 ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Merge with git-pasky II. 2005-04-15 15:32 ` Linus Torvalds 2005-04-15 16:01 ` David Woodhouse @ 2005-04-15 19:20 ` Paul Jackson 2005-04-16 1:44 ` Simon Fowler 2005-04-16 20:29 ` Sanjoy Mahajan 3 siblings, 0 replies; 130+ messages in thread From: Paul Jackson @ 2005-04-15 19:20 UTC (permalink / raw) To: Linus Torvalds; +Cc: dwmw2, junkio, pasky, git These notions that one can always best answer questions by looking at the content, and that "Individual files DO NOT EXIST" seem over stated, to me. Granted, overstated for a good reason. A couple sticks of dynamite are needed to shake loose some old SCM thinking habits. === Ingo has a point when he states: > i believe the fundamental thing to think about is not file or line or > namespace, but 'tracking developer intent'. He too overstates - it's not _the_ (as in one and only) thing. But it's useful. Given the traditional terseness of many engineers, it's certainly not the _only_ thing. The code speaks too. === The above two are related in this way. Traditional SCM uses per file (versioned controlled file, as in s.* or *,v files) metadata to track 'developer intent'. I'm afraid we are at risk for confusing baby (developer intent) and bathwater (version controlled file structure of classic SCM's). === But we already have a pretty damn good way of tracking developer intent that needs to fit naturally with whatever we build on top of git. Mr. McGuire: I just want to say one word to you - just one word. Ben: Yes sir. Mr. McGuire: Are you listening? Ben: Yes I am. Mr. McGuire: 'Patches.' # the original word was 'Plastics' - The Graduate (1967) Andrew and the other maintainers do a pretty good job of 'encouraging' developers to provide useful statements of 'intent' in their patch headers. The patch series in something like *-mm, including per-patch commentary, are a valuable part of this project. === I have not looked closely at what is being done here, on top of git, for SCM like capabilities. Hopefully the next two questions are not too stupid: 1) How do we track the patch header commentary? 2) Why can't we have a quilt like SCM, not bk/rcs/cvs/sccs/... like? For (2), anyone publishing a Linux source would periodically announce an <sha1> value, attached to some name suitable for public consumption. For example, sometime in the next month or so, Linus would announce that the <sha1> of 2.6.12 is so-and-so. That would identify the result of applying a specific set of patches, resulting in a specific source tree contents. He would announce a few 2.6.12-rc* <sha1>'s between now and then. Between now and then, Andrew would (if using these tools) have published several <sha1> values, one each for various 2.6.12-rc*-mm* versions. If you explode such a <sha1> all out into a working directory, you get both the source contents in the appropriately named files, and the quilt-style patches subdirectory, of the patch series that gets you here, starting from some Time Zero for that series of published kernel versions. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj@engr.sgi.com> 1.650.933.1373, 1.925.600.0401 ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Merge with git-pasky II. 2005-04-15 15:32 ` Linus Torvalds 2005-04-15 16:01 ` David Woodhouse 2005-04-15 19:20 ` Paul Jackson @ 2005-04-16 1:44 ` Simon Fowler 2005-04-16 12:19 ` David Lang 2005-04-16 20:29 ` Sanjoy Mahajan 3 siblings, 1 reply; 130+ messages in thread From: Simon Fowler @ 2005-04-16 1:44 UTC (permalink / raw) To: Linus Torvalds; +Cc: David Woodhouse, Junio C Hamano, Petr Baudis, git [-- Attachment #1.1: Type: text/plain, Size: 1189 bytes --] On Fri, Apr 15, 2005 at 08:32:46AM -0700, Linus Torvalds wrote: > In other words, I'm right. I'm always right, but sometimes I'm more right > than other times. And dammit, when I say "files don't matter", I'm really > really Right(tm). > You're right, of course (All Hail Linus!), if you can make it work efficiently enough. Just to put something else on the table, here's how I'd go about tracking renames and the like, in another world where Linus /does/ make the odd mistake - it's basically a unique id for files in the repository, added when the file is first recognised and updated when update-cache adds a new version to the cache. Renames copy the id across to the new name, and add it into the cache. This gives you an O(n) way to tell what file was what across renames, and it might even be useful in Linus' world, or if someone wanted to build a traditional SCM on top of a git-a-like. Attached is a patch, and a rename-file.c to use it. Simon -- PGP public key Id 0x144A991C, or http://himi.org/stuff/himi.asc (crappy) Homepage: http://himi.org doe #237 (see http://www.lemuria.org/DeCSS) My DeCSS mirror: ftp://himi.org/pub/mirrors/css/ [-- Attachment #1.2: guid2.patch --] [-- Type: text/plain, Size: 19027 bytes --] COPYING: fe2a4177a760fd110e78788734f167bd633be8de Makefile: ca50293c4f211452d999b81f122e99babb9f2987 --- Makefile +++ Makefile 2005-04-15 22:17:49.000000000 +1000 @@ -14,7 +14,7 @@ PROG= update-cache show-diff init-db write-tree read-tree commit-tree \ cat-file fsck-cache checkout-cache diff-tree rev-tree show-files \ - check-files ls-tree + check-files ls-tree rename-file SCRIPT= parent-id tree-id git gitXnormid.sh gitadd.sh gitaddremote.sh \ gitcommit.sh gitdiff-do gitdiff.sh gitlog.sh gitls.sh gitlsobj.sh \ @@ -73,6 +73,9 @@ ls-tree: ls-tree.o read-cache.o $(CC) $(CFLAGS) -o ls-tree ls-tree.o read-cache.o $(LIBS) +rename-file: rename-file.o read-cache.o + $(CC) $(CFLAGS) -o rename-file rename-file.o read-cache.o $(LIBS) + read-cache.o: cache.h show-diff.o: cache.h README: ded1a3b20e9bbe1f40e487ba5f9361719a1b6b85 VERSION: c27bd67cd632cc15dd520fbfbf807d482efa2dcf cache.h: 4d382549041d3281f8d44aa2e52f9f8ec47dd420 --- cache.h +++ cache.h 2005-04-14 22:35:59.000000000 +1000 @@ -55,6 +55,7 @@ unsigned int st_gid; unsigned int st_size; unsigned char sha1[20]; + unsigned char guid[20]; unsigned short namelen; char name[0]; }; cat-file.c: 45be1badaa8517d4e3a69e0bf1cac2e90191e475 check-files.c: 927b0b9aca742183fc8e7ccd73d73d8d5427e98f checkout-cache.c: f06871cdbc1b18ea93bdf4e17126aeb4cca1373e commit-id: 65c81756c8f10d513d073ecbd741a3244663c4c9 commit-tree.c: 12196c79f31d004dff0df1f50dda67d8204f5568 diff-tree.c: 7dcc9eb7782fa176e27f1677b161ce78ac1d2070 --- diff-tree.c +++ diff-tree.c 2005-04-16 10:46:52.000000000 +1000 @@ -1,33 +1,144 @@ +#include <sys/param.h> #include "cache.h" -static int recursive = 0; +enum diff_type { + REMOVE, + ADD, + RENAME, + MODIFY, +}; + +struct guid_cache_entry { + enum diff_type diff; + unsigned char guid[20]; + unsigned char sha1[20]; + struct guid_cache_entry *old; + unsigned int mode; + unsigned int pathlen; + unsigned char path[0]; +}; + +struct guid_cache { + unsigned int nr; + unsigned int alloc; + struct guid_cache_entry **cache; +}; -static int diff_tree_sha1(const unsigned char *old, const unsigned char *new, const char *base); +struct guid_cache guid_cache; +struct guid_cache *cache = &guid_cache; -static void update_tree_entry(void **bufp, unsigned long *sizep) +int guid_cache_pos(const char *guid) { - void *buf = *bufp; - unsigned long size = *sizep; - int len = strlen(buf) + 1 + 20; + int first, last; - if (size < len) - die("corrupt tree file"); - *bufp = buf + len; - *sizep = size - len; + first = 0; + last = cache->nr; + while (last > first) { + int next = (last + first) >> 1; + struct guid_cache_entry *gce = cache->cache[next]; + int cmp = memcmp(guid, gce->guid, 20); + if (!cmp) + return next; + if (cmp < 0) { + last = next; + continue; + } + first = next + 1; + } + return - first-1; } -static const unsigned char *extract(void *tree, unsigned long size, const char **pathp, unsigned int *modep) +int add_guid_cache_entry(struct guid_cache_entry *gce) +{ + int pos; + + pos = guid_cache_pos(gce->guid); + + /* if this is a rename or modify, the guid will show up a + * second time */ + if (pos >= 0) { + struct guid_cache_entry *old = cache->cache[pos]; + int cmp = cache_name_compare(old->path, old->pathlen, gce->path, gce->pathlen); + + if (!cmp) { + /* pathname matches, so this must be a + * modify. */ + gce->old = old; + gce->diff = MODIFY; + cache->cache[pos] = gce; + } else { + /* the pathnames are different, so the file + * must have been renamed somewhere along the + * line. */ + gce->old = old; + gce->diff = RENAME; + cache->cache[pos] = gce; + } + return 0; + } + pos = -pos-1; + + if (cache->nr == cache->alloc) { + cache->alloc = alloc_nr(cache->alloc); + cache->cache = realloc(cache->cache, cache->alloc * sizeof(struct guid_cache_entry *)); + } + + cache->nr++; + if (cache->nr > pos) + memmove(cache->cache + pos + 1, cache->cache + pos, (cache->nr - pos - 1) * sizeof(struct guid_cache_entry *)); + cache->cache[pos] = gce; + return 0; +} + +static const unsigned char *extract(void *tree, unsigned long size, const char **pathp, unsigned int *modep, const unsigned char **guid) { int len = strlen(tree)+1; const unsigned char *sha1 = tree + len; const char *path = strchr(tree, ' '); - if (!path || size < len + 20 || sscanf(tree, "%o", modep) != 1) + if (!path || size < len + 40 || sscanf(tree, "%o", modep) != 1) die("corrupt tree file"); *pathp = path+1; + *guid = tree + len + 20; return sha1; } +static void guid_cache_tree_entry(void *buf, unsigned int len, const char *base, enum diff_type diff) +{ + unsigned mode; + const char *path; + const unsigned char *guid; + const unsigned char *sha1 = extract(buf, len, &path, &mode, &guid); + struct guid_cache_entry *gce; + int baselen = strlen(base); + + gce = calloc(1, sizeof(struct guid_cache_entry) + baselen + strlen(path) + 1); + memcpy(gce->guid, guid, 20); + memcpy(gce->sha1, sha1, 20); + gce->diff = diff; + gce->mode = mode; + gce->pathlen = snprintf(gce->path, MAXPATHLEN, "%s%s", base, path); + gce->path[gce->pathlen + 1] = '\0'; + + add_guid_cache_entry(gce); +} + +static int recursive = 0; + +static int diff_tree_sha1(const unsigned char *old, const unsigned char *new, const char *base); + +static void update_tree_entry(void **bufp, unsigned long *sizep) +{ + void *buf = *bufp; + unsigned long size = *sizep; + int len = strlen(buf) + 1 + 40; + + if (size < len) + die("corrupt tree file"); + *bufp = buf + len; + *sizep = size - len; +} + static char *malloc_base(const char *base, const char *path, int pathlen) { int baselen = strlen(base); @@ -38,23 +149,24 @@ return newbase; } -static void show_file(const char *prefix, void *tree, unsigned long size, const char *base); +static void changed_file(void *tree, unsigned long size, const char *base, enum diff_type diff); /* A whole sub-tree went away or appeared */ -static void show_tree(const char *prefix, void *tree, unsigned long size, const char *base) +static void changed_tree(void *tree, unsigned long size, const char *base, enum diff_type diff) { while (size) { - show_file(prefix, tree, size, base); + changed_file(tree, size, base, diff); update_tree_entry(&tree, &size); } } /* A file entry went away or appeared */ -static void show_file(const char *prefix, void *tree, unsigned long size, const char *base) +static void changed_file(void *tree, unsigned long size, const char *base, enum diff_type diff) { unsigned mode; const char *path; - const unsigned char *sha1 = extract(tree, size, &path, &mode); + const unsigned char *guid; + const unsigned char *sha1 = extract(tree, size, &path, &mode, &guid); if (recursive && S_ISDIR(mode)) { char type[20]; @@ -66,38 +178,96 @@ if (!tree || strcmp(type, "tree")) die("corrupt tree sha %s", sha1_to_hex(sha1)); - show_tree(prefix, tree, size, newbase); + changed_tree(tree, size, newbase, diff); free(tree); free(newbase); return; } - printf("%s%o\t%s\t%s\t%s%s%c", prefix, mode, - S_ISDIR(mode) ? "tree" : "blob", - sha1_to_hex(sha1), base, path, 0); + guid_cache_tree_entry(tree, size, base, diff); } +static void show_one_file(struct guid_cache_entry *gce) +{ + struct guid_cache_entry *old; + char old_sha1[50]; + char old_sha2[50]; + + switch(gce->diff) { + case REMOVE: + sprintf(old_sha1, "%s", sha1_to_hex(gce->sha1)); + printf("-%o\t%s\t%s\t%s\t%s%c", gce->mode, + S_ISDIR(gce->mode) ? "tree" : "blob", + old_sha1, sha1_to_hex(gce->guid), gce->path, 0); + break; + case ADD: + sprintf(old_sha1, "%s", sha1_to_hex(gce->sha1)); + printf("+%o\t%s\t%s\t%s\t%s%c", gce->mode, + S_ISDIR(gce->mode) ? "tree" : "blob", + old_sha1, sha1_to_hex(gce->guid), gce->path, 0); + break; + case MODIFY: + old = gce->old; + if (old) { + sprintf(old_sha1, "%s", sha1_to_hex(old->sha1)); + sprintf(old_sha2, "%s", sha1_to_hex(gce->sha1)); + + printf("*%o->%o\t%s\t%s->%s\t%s\t%s%c", old->mode, gce->mode, + S_ISDIR(old->mode) ? "tree" : "blob", + old_sha1, old_sha2, sha1_to_hex(gce->guid), gce->path, 0); + } else { + die("diff-tree: internal error"); + } + break; + case RENAME: + old = gce->old; + if (old) { + sprintf(old_sha1, "%s", sha1_to_hex(gce->sha1)); + sprintf(old_sha2, "%s", sha1_to_hex(old->sha1)); + + printf("r%o->%o\t%s\t%s->%s\t%s\t%s%c", gce->mode, old->mode, + S_ISDIR(old->mode) ? "tree" : "blob", + old_sha1, old_sha2, sha1_to_hex(old->guid), old->path, 0); + } else { + die("diff-tree: internal error"); + } + break; + default: + die("diff-tree: internal error"); + } +} + +/* simply iterate over both caches looking for matching guids, + * showing all files in both caches */ +static void show_cache(void) +{ + int i; + + for (i = 0; i < cache->nr; i++) + show_one_file(cache->cache[i]); +} + static int compare_tree_entry(void *tree1, unsigned long size1, void *tree2, unsigned long size2, const char *base) { unsigned mode1, mode2; const char *path1, *path2; const unsigned char *sha1, *sha2; + const unsigned char *guid1, *guid2; int cmp, pathlen1, pathlen2; - char old_sha1_hex[50]; - sha1 = extract(tree1, size1, &path1, &mode1); - sha2 = extract(tree2, size2, &path2, &mode2); + sha1 = extract(tree1, size1, &path1, &mode1, &guid1); + sha2 = extract(tree2, size2, &path2, &mode2, &guid2); pathlen1 = strlen(path1); pathlen2 = strlen(path2); cmp = cache_name_compare(path1, pathlen1, path2, pathlen2); if (cmp < 0) { - show_file("-", tree1, size1, base); + changed_file(tree1, size1, base, REMOVE); return -1; } if (cmp > 0) { - show_file("+", tree2, size2, base); + changed_file(tree2, size2, base, ADD); return 1; } if (!memcmp(sha1, sha2, 20) && mode1 == mode2) @@ -108,8 +278,8 @@ * file, we need to consider it a remove and an add. */ if (S_ISDIR(mode1) != S_ISDIR(mode2)) { - show_file("-", tree1, size1, base); - show_file("+", tree2, size2, base); + changed_file(tree1, size1, base, REMOVE); + changed_file(tree2, size2, base, ADD); return 0; } @@ -121,10 +291,14 @@ return retval; } - strcpy(old_sha1_hex, sha1_to_hex(sha1)); - printf("*%o->%o\t%s\t%s->%s\t%s%s%c", mode1, mode2, - S_ISDIR(mode1) ? "tree" : "blob", - old_sha1_hex, sha1_to_hex(sha2), base, path1, 0); + if (!memcmp(guid1, guid2, 20)) { + changed_file(tree1, size1, base, MODIFY); + changed_file(tree2, size2, base, MODIFY); + return 0; + } + + changed_file(tree1, size1, base, REMOVE); + changed_file(tree2, size2, base, ADD); return 0; } @@ -132,12 +306,12 @@ { while (size1 | size2) { if (!size1) { - show_file("+", tree2, size2, base); + changed_file(tree2, size2, base, ADD); update_tree_entry(&tree2, &size2); continue; } if (!size2) { - show_file("-", tree1, size1, base); + changed_file(tree1, size1, base, REMOVE); update_tree_entry(&tree1, &size1); continue; } @@ -179,6 +353,7 @@ int main(int argc, char **argv) { unsigned char old[20], new[20]; + int retval; while (argc > 3) { char *arg = argv[1]; @@ -193,5 +368,7 @@ if (argc != 3 || get_sha1_hex(argv[1], old) || get_sha1_hex(argv[2], new)) usage("diff-tree <tree sha1> <tree sha1>"); - return diff_tree_sha1(old, new, ""); + retval = diff_tree_sha1(old, new, ""); + show_cache(); + return retval; } fsck-cache.c: 9c900fe458cecd2bdb4c4571a584115b5cf24f22 --- fsck-cache.c +++ fsck-cache.c 2005-04-15 20:39:49.000000000 +1000 @@ -165,9 +165,10 @@ while (size) { int len = 1+strlen(data); unsigned char *file_sha1 = data + len; + unsigned char *guid = file_sha1 + 20; char *path = strchr(data, ' '); unsigned int mode; - if (size < len + 20 || !path || sscanf(data, "%o", &mode) != 1) + if (size < len + 40 || !path || sscanf(data, "%o", &mode) != 1) return -1; /* Warn about trees that don't do the recursive thing.. */ @@ -176,8 +177,8 @@ warn_old_tree = 0; } - data += len + 20; - size -= len + 20; + data += len + 40; + size -= len + 40; mark_needs_sha1(sha1, S_ISDIR(mode) ? "tree" : "blob", file_sha1); } return 0; git: 2c557dcf2032325acc265b577ee104e605fdaede gitXnormid.sh: a5d7a9f4a6e8d4860f35f69500965c2a493d80de gitadd.sh: 3ed93ea0fcb995673ba9ee1982e0e7abdbe35982 gitaddremote.sh: bf1f28823da5b5270aa8fa05b321faa514a57a11 gitapply.sh: d0e3c46e2ce1ee74e1a87ee6137955fa9b35c27b gitcancel.sh: ec58f7444a42cd3cbaae919fc68c70a3866420c0 gitcommit.sh: 3629f67bbd3f171d091552814908b67af7537f4d gitdiff-do: d6174abceab34d22010c36a8453a6c3f3f184fe0 gitdiff.sh: 5e47c4779d73c3f2f39f6be714c0145175933197 gitexport.sh: dad00bf251b38ce522c593ea9631f842d8ccc934 gitlntree.sh: 17c4966ea64aeced96ae4f1b00f3775c1904b0f1 gitlog.sh: 177c6d12dd9fa4b4920b08451ffe4badde544a39 gitls.sh: b6f15d82f16c1e9982c5031f3be22eb5430273af gitlsobj.sh: 128461d3de6a42cfaaa989fc6401bebdfa885b3f gitmerge.sh: 23e4a3ff342c6005928ceea598a2f52de6fb9817 gitpull.sh: 0883898dda579e3fa44944b7b1d909257f6dc63e gitrm.sh: 5c18c38a890c9fd9ad2b866ee7b529539d2f3f8f gittag.sh: c8cb31385d5a9622e95a4e0b2d6a4198038a659c gittrack.sh: 03d6db1fb3a70605ef249c632c04e542457f0808 init-db.c: aa00fbb1b95624f6c30090a17354c9c08a6ac596 ls-tree.c: 3e2a6c7d183a42e41f1073dfec6794e8f8a5e75c --- ls-tree.c +++ ls-tree.c 2005-04-15 15:55:40.000000000 +1000 @@ -10,6 +10,7 @@ void *buffer; unsigned long size; char type[20]; + char old_sha1[50]; buffer = read_sha1_file(sha1, type, &size); if (!buffer) @@ -19,19 +20,21 @@ while (size) { int len = strlen(buffer)+1; unsigned char *sha1 = buffer + len; + unsigned char *guid = buffer + len + 20; char *path = strchr(buffer, ' ')+1; unsigned int mode; unsigned char *type; - if (size < len + 20 || sscanf(buffer, "%o", &mode) != 1) + if (size < len + 40 || sscanf(buffer, "%o", &mode) != 1) die("corrupt 'tree' file"); - buffer = sha1 + 20; - size -= len + 20; + buffer = sha1 + 40; + size -= len + 40; /* XXX: We do some ugly mode heuristics here. * It seems not worth it to read each file just to get this * and the file size. -- pasky@ucw.cz */ type = S_ISDIR(mode) ? "tree" : "blob"; - printf("%03o\t%s\t%s\t%s\n", mode, type, sha1_to_hex(sha1), path); + sprintf(old_sha1, sha1_to_hex(guid)); + printf("%03o\t%s\t%s\t%s\t%s\n", mode, type, sha1_to_hex(sha1), old_sha1, path); } return 0; } parent-id: 1801c6fe426592832e7250f8b760fb9d2e65220f read-cache.c: 7a6ae8b9b489f6b67c82e065dedd5716a6bfc0ef --- read-cache.c +++ read-cache.c 2005-04-16 10:52:51.000000000 +1000 @@ -4,6 +4,8 @@ * Copyright (C) Linus Torvalds, 2005 */ #include <stdarg.h> +#include <time.h> +#include <sys/param.h> #include "cache.h" const char *sha1_file_directory = NULL; @@ -233,6 +235,22 @@ return 0; } +void new_guid(const char *filename, int namelen, unsigned char *returnguid) +{ + size_t size; + time_t now = time(NULL); + char buf[MAXPATHLEN + 20]; + unsigned char guid[20]; + + size = snprintf(buf, MAXPATHLEN + 20, "%ld%s", now, filename) + 1; + + SHA1(buf, size, guid); + + if (returnguid) + memcpy(returnguid, guid, 20); + return; +} + static inline int collision_check(char *filename, void *buf, unsigned int size) { #ifdef COLLISION_CHECK @@ -363,11 +381,14 @@ int add_cache_entry(struct cache_entry *ce, int ok_to_add) { int pos; + unsigned char guid[20]; pos = cache_name_pos(ce->name, ce->namelen); /* existing match? Just replace it */ if (pos >= 0) { + struct cache_entry *old_ce = active_cache[pos]; + memcpy(ce->guid, old_ce->guid, 20); active_cache[pos] = ce; return 0; } @@ -376,6 +397,12 @@ if (!ok_to_add) return -1; + memset(guid, 0, 20); + if (!memcmp(ce->guid, guid, 20)) { + new_guid(ce->name, ce->namelen, guid); + memcpy(ce->guid, guid, 20); + } + /* Make sure the array is big enough .. */ if (active_nr == active_alloc) { active_alloc = alloc_nr(active_alloc); read-tree.c: eb548148aa6d212f05c2c622ffbe62a06cd072f9 --- read-tree.c +++ read-tree.c 2005-04-16 10:41:46.000000000 +1000 @@ -5,7 +5,9 @@ */ #include "cache.h" -static int read_one_entry(unsigned char *sha1, const char *base, int baselen, const char *pathname, unsigned mode) +static int read_one_entry(unsigned char *sha1, unsigned char *guid, + const char *base, int baselen, + const char *pathname, unsigned mode) { int len = strlen(pathname); unsigned int size = cache_entry_size(baselen + len); @@ -18,6 +20,7 @@ memcpy(ce->name, base, baselen); memcpy(ce->name + baselen, pathname, len+1); memcpy(ce->sha1, sha1, 20); + memcpy(ce->guid, guid, 20); return add_cache_entry(ce, 1); } @@ -35,14 +38,15 @@ while (size) { int len = strlen(buffer)+1; unsigned char *sha1 = buffer + len; + unsigned char *guid = buffer + len + 20; char *path = strchr(buffer, ' ')+1; unsigned int mode; - - if (size < len + 20 || sscanf(buffer, "%o", &mode) != 1) + + if (size < len + 40 || sscanf(buffer, "%o", &mode) != 1) return -1; - buffer = sha1 + 20; - size -= len + 20; + buffer = sha1 + 40; + size -= len + 40; if (S_ISDIR(mode)) { int retval; @@ -57,7 +61,7 @@ return -1; continue; } - if (read_one_entry(sha1, base, baselen, path, mode) < 0) + if (read_one_entry(sha1, guid, base, baselen, path, mode) < 0) return -1; } return 0; rev-tree.c: 395b0b3bfadb0537ae0c62744b25ead4b487f3f6 show-diff.c: a531ca4078525d1c8dcf84aae0bfa89fed6e5d96 show-files.c: a9fa6767a418f870a34b39379f417bf37b17ee18 tree-id: cb70e2c508a18107abe305633612ed702aa3ee4f update-cache.c: 62d0a6c41560d40863c44599355af10d9e089312 write-tree.c: 1534477c91169ebddcf953e3f4d2872495477f6b --- write-tree.c +++ write-tree.c 2005-04-15 13:46:05.000000000 +1000 @@ -47,6 +47,7 @@ const char *pathname = ce->name, *filename, *dirname; int pathlen = ce->namelen, entrylen; unsigned char *sha1; + unsigned char *guid; unsigned int mode; /* Did we hit the end of the directory? Return how many we wrote */ @@ -54,6 +55,7 @@ break; sha1 = ce->sha1; + guid = ce->guid; mode = ce->st_mode; /* Do we have _further_ subdirectories? */ @@ -86,6 +88,8 @@ buffer[offset++] = 0; memcpy(buffer + offset, sha1, 20); offset += 20; + memcpy(buffer + offset, guid, 20); + offset += 20; nr++; } while (nr < maxentries); [-- Attachment #1.3: rename-file.c --] [-- Type: text/x-csrc, Size: 1703 bytes --] /* * rename files in a git repository, keeping the guid. * * Copyright Simon Fowler <simon@dreamcraft.com.au>, 2005. */ #include <unistd.h> #include <sys/stat.h> #include <errno.h> #include "cache.h" static int remove_lock = 0; static void remove_lock_file(void) { if (remove_lock) unlink(".git/index.lock"); } int main(int argc, char *argv[]) { struct stat stats; struct cache_entry *ce, *new; int newfd, entries, pos, pos2; if (argc != 3) usage("rename-file <old> <new>"); if (stat(argv[1], &stats)) { perror("rename-file: "); exit(1); } if (!stat(argv[2], &stats)) die("rename-file: destination file already exists"); newfd = open(".git/index.lock", O_RDWR | O_CREAT | O_EXCL, 0600); if (newfd < 0) die("unable to create new cachefile"); atexit(remove_lock_file); remove_lock = 1; entries = read_cache(); if (entries < 0) die("cache corrupted"); pos = cache_name_pos(argv[1], strlen(argv[1])); pos2 = cache_name_pos(argv[2], strlen(argv[2])); if (pos < 0) die("original file not in cache"); if (pos2 >= 0) die("destination file already in cache"); ce = active_cache[pos]; new = malloc(sizeof(struct cache_entry) + strlen(argv[2]) + 1); memcpy(new, ce, sizeof(struct cache_entry)); new->namelen = strlen(argv[2]); memcpy(new->name, argv[2], new->namelen); if (rename(argv[1], argv[2])) { perror("rename-file: "); exit(1); } remove_file_from_cache(argv[1]); add_cache_entry(new, 1); if (write_cache(newfd, active_cache, active_nr) || rename(".git/index.lock", ".git/index")) die("Unable to write new cachefile"); remove_lock = 0; return 0; } [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Merge with git-pasky II. 2005-04-16 1:44 ` Simon Fowler @ 2005-04-16 12:19 ` David Lang 2005-04-16 15:55 ` Simon Fowler 0 siblings, 1 reply; 130+ messages in thread From: David Lang @ 2005-04-16 12:19 UTC (permalink / raw) To: simon; +Cc: Linus Torvalds, David Woodhouse, Junio C Hamano, Petr Baudis, git On Fri, Apr 15, 2005 at 08:32:46AM -0700, Linus Torvalds wrote: > In other words, I'm right. I'm always right, but sometimes I'm more right > than other times. And dammit, when I say "files don't matter", I'm really > really Right(tm). > You're right, of course (All Hail Linus!), if you can make it work efficiently enough. Just to put something else on the table, here's how I'd go about tracking renames and the like, in another world where Linus /does/ make the odd mistake - it's basically a unique id for files in the repository, added when the file is first recognised and updated when update-cache adds a new version to the cache. Renames copy the id across to the new name, and add it into the cache. This gives you an O(n) way to tell what file was what across renames, and it might even be useful in Linus' world, or if someone wanted to build a traditional SCM on top of a git-a-like. Attached is a patch, and a rename-file.c to use it. Simon given that you have multiple machines creating files, how do you deal with the idea of the same 'unique id' being assigned to different files by different machines? David Lang -- There are two ways of constructing a software design. One way is to make it so simple that there are obviously no deficiencies. And the other way is to make it so complicated that there are no obvious deficiencies. -- C.A.R. Hoare ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Merge with git-pasky II. 2005-04-16 12:19 ` David Lang @ 2005-04-16 15:55 ` Simon Fowler 2005-04-16 16:03 ` Petr Baudis 0 siblings, 1 reply; 130+ messages in thread From: Simon Fowler @ 2005-04-16 15:55 UTC (permalink / raw) To: David Lang; +Cc: git [-- Attachment #1: Type: text/plain, Size: 797 bytes --] On Sat, Apr 16, 2005 at 05:19:24AM -0700, David Lang wrote: > Simon > > given that you have multiple machines creating files, how do you deal with > the idea of the same 'unique id' being assigned to different files by > different machines? > The id is a sha1 hash of the current time and the full path of the file being added - the chances of that being replicated without malicious intent is extremely small. There are other things that could be used, like the hostname, username of the person running the program, etc, but I don't really see them being necessary. Simon -- PGP public key Id 0x144A991C, or http://himi.org/stuff/himi.asc (crappy) Homepage: http://himi.org doe #237 (see http://www.lemuria.org/DeCSS) My DeCSS mirror: ftp://himi.org/pub/mirrors/css/ [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Re: Merge with git-pasky II. 2005-04-16 15:55 ` Simon Fowler @ 2005-04-16 16:03 ` Petr Baudis 2005-04-16 16:26 ` Simon Fowler 2005-04-16 16:26 ` Linus Torvalds 0 siblings, 2 replies; 130+ messages in thread From: Petr Baudis @ 2005-04-16 16:03 UTC (permalink / raw) To: Simon Fowler; +Cc: David Lang, git Dear diary, on Sat, Apr 16, 2005 at 05:55:37PM CEST, I got a letter where Simon Fowler <simon@himi.org> told me that... > On Sat, Apr 16, 2005 at 05:19:24AM -0700, David Lang wrote: > > Simon > > > > given that you have multiple machines creating files, how do you deal with > > the idea of the same 'unique id' being assigned to different files by > > different machines? > > > The id is a sha1 hash of the current time and the full path of the > file being added - the chances of that being replicated without > malicious intent is extremely small. There are other things that > could be used, like the hostname, username of the person running the > program, etc, but I don't really see them being necessary. Why not just use UUID? -- Petr "Pasky" Baudis Stuff: http://pasky.or.cz/ C++: an octopus made by nailing extra legs onto a dog. -- Steve Taylor ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Re: Merge with git-pasky II. 2005-04-16 16:03 ` Petr Baudis @ 2005-04-16 16:26 ` Simon Fowler 2005-04-16 16:26 ` Linus Torvalds 1 sibling, 0 replies; 130+ messages in thread From: Simon Fowler @ 2005-04-16 16:26 UTC (permalink / raw) To: Petr Baudis; +Cc: David Lang, git [-- Attachment #1: Type: text/plain, Size: 1411 bytes --] On Sat, Apr 16, 2005 at 06:03:33PM +0200, Petr Baudis wrote: > Dear diary, on Sat, Apr 16, 2005 at 05:55:37PM CEST, I got a letter > where Simon Fowler <simon@himi.org> told me that... > > On Sat, Apr 16, 2005 at 05:19:24AM -0700, David Lang wrote: > > > Simon > > > > > > given that you have multiple machines creating files, how do you deal with > > > the idea of the same 'unique id' being assigned to different files by > > > different machines? > > > > > The id is a sha1 hash of the current time and the full path of the > > file being added - the chances of that being replicated without > > malicious intent is extremely small. There are other things that > > could be used, like the hostname, username of the person running the > > program, etc, but I don't really see them being necessary. > > Why not just use UUID? > Hey, everything else in git seems to use sha1, so I just copied Linus' sha1 code ;-) All I wanted was something that had a good chance of being unique across any potential set of distributed repositories, to avoid the chance of accidental clashes. A sha1 hash of something that's not likely to be replicated is a simple way to do that. Simon -- PGP public key Id 0x144A991C, or http://himi.org/stuff/himi.asc (crappy) Homepage: http://himi.org doe #237 (see http://www.lemuria.org/DeCSS) My DeCSS mirror: ftp://himi.org/pub/mirrors/css/ [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Re: Merge with git-pasky II. 2005-04-16 16:03 ` Petr Baudis 2005-04-16 16:26 ` Simon Fowler @ 2005-04-16 16:26 ` Linus Torvalds 2005-04-16 23:02 ` David Lang 2005-04-17 14:52 ` Ingo Molnar 1 sibling, 2 replies; 130+ messages in thread From: Linus Torvalds @ 2005-04-16 16:26 UTC (permalink / raw) To: Petr Baudis; +Cc: Simon Fowler, David Lang, git On Sat, 16 Apr 2005, Petr Baudis wrote: > Dear diary, on Sat, Apr 16, 2005 at 05:55:37PM CEST, I got a letter > where Simon Fowler <simon@himi.org> told me that... > > > The id is a sha1 hash of the current time and the full path of the > > file being added - the chances of that being replicated without > > malicious intent is extremely small. There are other things that > > could be used, like the hostname, username of the person running the > > program, etc, but I don't really see them being necessary. > > Why not just use UUID? Note that using anything that isn't data-related totally destroys the whole point of the object database. Remember: any time we don't uniquely generate the same name for the same object, we'll waste disk-space. So adding in user/machine/uuid's to the thing is always a mistake. The whole thing depends on the hash being as close to 1:1 with the contents as humanly possible. There's also the issue of size. Yes, I could have chosen sha256 instead of sha1. But the keys would be almost twice as big, which in turn means that the "tree" objects would be bigger, and that the "index" file would be bigger. Is that a huge problem? No. We can certainly move to it if sha1 ever shows itself to be weak. But I really think we are much better off just re-generating the whole tree and history at that point, rather than try to predict the future. The fact is, with current knowledge, sha1 _is_ safe for what git uses it for, for the forseeable future. And we have a migration strategy if I'm wrong. Don't worry about it. Almost all attacks on sha1 will depend on _replacing_ a file with a bogus new one. So guys, instead of using sha256 or going overboard, just make sure that when you synchronize, you NEVER import a file you already have. It's really that simple. Add "--ignore-existing" to your rsync scripts, and you're pretty much done. That guarantees that a new evil blob by the next mad scientist out to take over the world will never touch your repository, and if we make this part of the _standard_ scripts, then dammit, security is in good _practices_ rather than just relying blindly on the hash being secure. In other words, I think we could have used md5's as the hash, if we just make sure we have good practices. And it wouldn't have been "insecure". The fact is, you don't merge with people you don't trust. If you don't trust them, they have a much easier time corrupting your repository by just creating bugs in the code and checking that thing in. Who cares about hash collisions, when you can generate a kernel root vulnerability by just adding a single line of code and use the _correct_ hash for it. So the sha1 hash does not replace _trust_. That comes from something else altogether. Linus ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Re: Merge with git-pasky II. 2005-04-16 16:26 ` Linus Torvalds @ 2005-04-16 23:02 ` David Lang 2005-04-17 14:52 ` Ingo Molnar 1 sibling, 0 replies; 130+ messages in thread From: David Lang @ 2005-04-16 23:02 UTC (permalink / raw) To: Linus Torvalds; +Cc: Petr Baudis, Simon Fowler, git On Sat, 16 Apr 2005, Linus Torvalds wrote: > Almost all attacks on sha1 will depend on _replacing_ a file with a bogus > new one. So guys, instead of using sha256 or going overboard, just make > sure that when you synchronize, you NEVER import a file you already have. > > It's really that simple. Add "--ignore-existing" to your rsync scripts, > and you're pretty much done. That guarantees that a new evil blob by the > next mad scientist out to take over the world will never touch your > repository, and if we make this part of the _standard_ scripts, then > dammit, security is in good _practices_ rather than just relying blindly > on the hash being secure. > > In other words, I think we could have used md5's as the hash, if we just > make sure we have good practices. And it wouldn't have been "insecure". > > The fact is, you don't merge with people you don't trust. If you don't > trust them, they have a much easier time corrupting your repository by > just creating bugs in the code and checking that thing in. Who cares about > hash collisions, when you can generate a kernel root vulnerability by just > adding a single line of code and use the _correct_ hash for it. > > So the sha1 hash does not replace _trust_. That comes from something else > altogether. What I am bringing up is not intended to be a trust thing, but instead a safety thing, accidents, not evil intent. makeing the rsync scripts --ignore-existing will avoid corrupting local data when pulling remotely, but it won't solve the problem of running into a collision locally (and won't do much to help you figure out what's wrong when you run into a remote collision) David Lang -- There are two ways of constructing a software design. One way is to make it so simple that there are obviously no deficiencies. And the other way is to make it so complicated that there are no obvious deficiencies. -- C.A.R. Hoare ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Re: Merge with git-pasky II. 2005-04-16 16:26 ` Linus Torvalds 2005-04-16 23:02 ` David Lang @ 2005-04-17 14:52 ` Ingo Molnar 2005-04-17 15:08 ` Brad Roberts 2005-04-17 15:28 ` Ingo Molnar 1 sibling, 2 replies; 130+ messages in thread From: Ingo Molnar @ 2005-04-17 14:52 UTC (permalink / raw) To: Linus Torvalds; +Cc: Petr Baudis, Simon Fowler, David Lang, git * Linus Torvalds <torvalds@osdl.org> wrote: > Almost all attacks on sha1 will depend on _replacing_ a file with a > bogus new one. So guys, instead of using sha256 or going overboard, > just make sure that when you synchronize, you NEVER import a file you > already have. here is a bit complex, but still practical attack that doesnt rely on replacement and which can only be detected if we check the sha1 uniqueness assumptions. If you can generate a duplicate sha1 key for an arbitrary 'target' file, and Malice sends you a GIT-generated patch that introduces a new file (which doesnt exist in the current tree) which you review (in the email) and which looks safe to apply & harmless. Maybe the patch has a bit weird formatting and some weird comments (which in reality Malice used to generate the proper sha1 key) but otherwise the patch is for some seldom used arcane driver that no-one used for quite some time and no-one really cares about, so you are happy to apply the patch. The compromise occurs when you apply the patch: the seemingly harmless patch has an sha1 key that Malice manufacured to match that of an already existing, 'dangerous' object in your database. With tens of thousands (or hundreds of thousands) of objects expected in the repository sooner or later, there's quite a selection to pick from. Once you apply the patch, instead of the expected new file that you reviewed and found safe, the attacker has the other object included in the official kernel. A dangerous object can be anything: e.g. a debugging hack that allows arbitrary kernel-space writes. Or a known-insecure module (which since then got fixed, but the buggy code still exists in the DB). The module is in a single file and is self-installing (e.g. it has __init code to register itself as some driver.) Malice might even previously plant a dangerous object as some 'firmware module' in another arcane driver, which doesnt get compiled by default, but still shows up in the DB. Or Malice might plant a dangerous object via an innocent-looking documentation file. (which contains some sample code and is called sample.txt) this type of 'false sharing attack' can only be prevented if an object is only 'shared' with another object if it has been memcmp-ed with the object in the repository. I.e. if we trust the sharing decision! Once the attack has occured it cannot be detected automatically: only people will notice it. (why did that weird unrelated module show up in that old driver?) The compromise relies on you having reviewed something harmless, while in reality what happened within the DB was far less harmless. And the DB remains self-consistent: neither fsck, nor others importing your tree will be able to detect the compromise. This attack can only be detected when you apply the patch, after that point all the information (except Malice's message in your inbox) is gone. so unless we actively check for collisions, once an sha1 key can be generated at will on near-arbitrary input, it's not a secure system anymore. We might be lucky and safe, but we wont be secure. Ingo ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Re: Merge with git-pasky II. 2005-04-17 14:52 ` Ingo Molnar @ 2005-04-17 15:08 ` Brad Roberts 2005-04-17 15:18 ` Ingo Molnar 2005-04-17 15:28 ` Ingo Molnar 1 sibling, 1 reply; 130+ messages in thread From: Brad Roberts @ 2005-04-17 15:08 UTC (permalink / raw) To: Ingo Molnar; +Cc: Linus Torvalds, Petr Baudis, Simon Fowler, David Lang, git On Sun, 17 Apr 2005, Ingo Molnar wrote: > Date: Sun, 17 Apr 2005 16:52:32 +0200 > From: Ingo Molnar <mingo@elte.hu> > To: Linus Torvalds <torvalds@osdl.org> > Cc: Petr Baudis <pasky@ucw.cz>, Simon Fowler <simon@himi.org>, > David Lang <david.lang@digitalinsight.com>, git@vger.kernel.org > Subject: Re: Re: Merge with git-pasky II. > > > * Linus Torvalds <torvalds@osdl.org> wrote: > > > Almost all attacks on sha1 will depend on _replacing_ a file with a > > bogus new one. So guys, instead of using sha256 or going overboard, > > just make sure that when you synchronize, you NEVER import a file you > > already have. > > With tens of thousands (or hundreds of thousands) of objects expected in > the repository sooner or later, there's quite a selection to pick from. > Once you apply the patch, instead of the expected new file that you > reviewed and found safe, the attacker has the other object included in > the official kernel. > > A dangerous object can be anything: e.g. a debugging hack that allows > arbitrary kernel-space writes. Or a known-insecure module (which since > then got fixed, but the buggy code still exists in the DB). The module > is in a single file and is self-installing (e.g. it has __init code to > register itself as some driver.) While I agree that a hash collision is bad and certainly worth preventing during new object creation, for it to actually implant a trojan in a build successfully it'd have to meet even more criteria than you've layed out. It'd have to... - be shadowing an object that's part of an active tree - provide all the public symbols the shadowed object provided so that it would still build and link successfully Shadowing an object that's not part of the working tree means something on another branch or obsoleted some time in the past is still db corruption, but not nearly as big an issue from a trojan standpoint. Later, Brad ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Re: Merge with git-pasky II. 2005-04-17 15:08 ` Brad Roberts @ 2005-04-17 15:18 ` Ingo Molnar 0 siblings, 0 replies; 130+ messages in thread From: Ingo Molnar @ 2005-04-17 15:18 UTC (permalink / raw) To: Brad Roberts; +Cc: Linus Torvalds, Petr Baudis, Simon Fowler, David Lang, git * Brad Roberts <braddr@puremagic.com> wrote: > While I agree that a hash collision is bad and certainly worth > preventing during new object creation, for it to actually implant a > trojan in a build successfully it'd have to meet even more criteria > than you've layed out. It'd have to... > - provide all the public symbols the shadowed object provided so that it > would still build and link successfully that's not a problem. Most modules dont provide public symbols. Especially not drivers. Generally it's the modules that _dont_ have any global impact that get reviewed less stringently - an attacker would thus choose them for psychological reasons anyway. > - be shadowing an object that's part of an active tree > > Shadowing an object that's not part of the working tree means > something on another branch or obsoleted some time in the past is > still db corruption, but not nearly as big an issue from a trojan > standpoint. it's not DB corruption, it's a feature of GIT: it's a content _cache_, new and old alike. Nothing in GIT says that old objects in the repository (which are still very much part of history) cannot be revived in newer trees. (in fact it regularly happens - e.g. if a fix is undone manually.) Ingo ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Re: Merge with git-pasky II. 2005-04-17 14:52 ` Ingo Molnar 2005-04-17 15:08 ` Brad Roberts @ 2005-04-17 15:28 ` Ingo Molnar 2005-04-17 17:34 ` Linus Torvalds 1 sibling, 1 reply; 130+ messages in thread From: Ingo Molnar @ 2005-04-17 15:28 UTC (permalink / raw) To: Linus Torvalds; +Cc: Petr Baudis, Simon Fowler, David Lang, git * Ingo Molnar <mingo@elte.hu> wrote: > The compromise relies on you having reviewed something harmless, while > in reality what happened within the DB was far less harmless. And the > DB remains self-consistent: neither fsck, nor others importing your > tree will be able to detect the compromise. This attack can only be > detected when you apply the patch, after that point all the > information (except Malice's message in your inbox) is gone. in fact, this attack cannot even be proven to be malicious, purely via the email from Malice: it could be incredible bad luck that caused that good-looking patch to be mistakenly matching a dangerous object. In fact this could happen even today, _accidentally_. (but i'm willing to bet that hell will be freezing over first, and i'll have some really good odds ;) There's probably a much higher likelyhood of Linus' tree getting corrupted in some old fashioned way and introducing a security hole by accident) Ingo ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Re: Merge with git-pasky II. 2005-04-17 15:28 ` Ingo Molnar @ 2005-04-17 17:34 ` Linus Torvalds 2005-04-17 22:12 ` Herbert Xu ` (2 more replies) 0 siblings, 3 replies; 130+ messages in thread From: Linus Torvalds @ 2005-04-17 17:34 UTC (permalink / raw) To: Ingo Molnar; +Cc: Petr Baudis, Simon Fowler, David Lang, git On Sun, 17 Apr 2005, Ingo Molnar wrote: > > in fact, this attack cannot even be proven to be malicious, purely via > the email from Malice: it could be incredible bad luck that caused that > good-looking patch to be mistakenly matching a dangerous object. I really hate theoretical discussions. The fact is, a lot of _crap_ engineering gets done because of the question "what if?". It results in over-engineering, often to the point where the end result is quite a lot measurably worse than the sane results. You are _literally_ arguing for the equivalent of "what if a meteorite hit my plane while it was in flight - maybe I should add three inches of high-tension armored steel around the plane, so that my passengers would be protected". That's not engineering. That's five-year-olds discussing building their imaginary forts ("I want gun-turrets and a mechanical horse one mile high, and my command center is 5 miles under-ground and totally encased in 5 meters of lead"). I absolutely _hate_ doing engineering on the principle of "this might be possible in theory", and I'm violently opposed to it. So far, I have not heard a single argument that I consider even _remotely_ likely. The thing is, even if you can force a hash collission by sending somebody a patch, it's really pretty much almost guaranteed that the patch is not just "a few strange characters", unless sha1 is really broken to the point where it's not cryptographically secure _at_all_. In other words, unless somebody finds a way to make sha1 appear as nothing more than a complicated set of parity bits, all brute-force "get the same sha1" is likely to be about generating a really strange blob based on the thing you want to replace - and by "really strange" I mean total binary crap. And likely _much_ bigger too. And by "much bigger" I mean "possibly gigabytes of data". And the thing is, _if_ somebody finds a way to make sha1 act as just a complex parity bit, and comes up with generating a clashing object that actually makes sense, then going to sha256 is likely pointless too - I think the algorithm is basically the same, just with more bits. If you've broken sha1 to the point where it's _that_ breakable, then you've likely broken sha256 too. Nobody has ever proven that you couldn't break sha256 with some really clever algorithm... So if you start playing "what if?" games, dammit, I can play mine. If we want to have any kind of confidence that the hash is reall yunbreakable, we should make it not just longer than 160 bits, we should make sure that it's two or more hashes, and that they are based on totally different principles. And we should all digitally sign every single object too, and we should use 4096-bit PGP keys and unguessable passphrases that are at least 20 words in length. And we should then build a bunker 5 miles underground, encased in lead, so that somebody cannot flip a few bits with a ray-gun, and make us believe that the sha1's match when they don't. Oh, and we need to all wear aluminum propeller beanies to make sure that they don't use that ray-gun to make us do the modification _outselves_. And the thing is, that's just crazy talk. The difference between a crazy person and an intelligent one is that the crazy one doesn't realize what makes sense in the world. The goal of good engineering is not to ask "what if?", but to ask "how do I make this work as well as possible". So please stop with the theoretical sha1 attacks. It is simply NOT TRUE that you can generate an object that looks halfway sane and still gets you the sha1 you want. Even the "breakage" doesn't actually do that. And if it ever _does_ become true, it will quite possibly be thanks to some technology that breaks other hashes too. So until proven otherwise, I worry about accidental hashes, and in 160 bits of good hashing, that just isn't an issue either. Anybody who compares a 128-bit md5-sum to a 160-bit sha1 doesn't understand the math. It didn't get "slightly less likely" to happen. It got so _unbelievably_ less likely to happen that it's not even funny. Linus ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Merge with git-pasky II. 2005-04-17 17:34 ` Linus Torvalds @ 2005-04-17 22:12 ` Herbert Xu 2005-04-17 22:35 ` Linus Torvalds 2005-04-18 4:16 ` Sanjoy Mahajan 2005-04-18 7:42 ` Ingo Molnar 2 siblings, 1 reply; 130+ messages in thread From: Herbert Xu @ 2005-04-17 22:12 UTC (permalink / raw) To: Linus Torvalds; +Cc: mingo, pasky, simon, david.lang, git Linus Torvalds <torvalds@osdl.org> wrote: > > If we want to have any kind of confidence that the hash is reall > yunbreakable, we should make it not just longer than 160 bits, we should > make sure that it's two or more hashes, and that they are based on totally > different principles. Sorry, it has already been shown that combining two difference hashes doesn't necessarily provide the security that you would hope. I think what hasn't been discussed here is the cost of actually doing the comparisons. In other words, what is the minimum number of comparisons we can get away and still deal with hash collisions successfully? Once we know what the cost is then we can decide whether it's worthwhile considering the odds involved. -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Merge with git-pasky II. 2005-04-17 22:12 ` Herbert Xu @ 2005-04-17 22:35 ` Linus Torvalds 2005-04-17 23:29 ` Herbert Xu 0 siblings, 1 reply; 130+ messages in thread From: Linus Torvalds @ 2005-04-17 22:35 UTC (permalink / raw) To: Herbert Xu; +Cc: mingo, pasky, simon, david.lang, git On Mon, 18 Apr 2005, Herbert Xu wrote: > > Sorry, it has already been shown that combining two difference hashes > doesn't necessarily provide the security that you would hope. Sorry, that's not true. Quite the reverse. Again, you bring up totally theoretical arguments. In _practice_ it has indeed been shown that using two hashes _does_ catch hash colissions. The trivial example is using md5 sums with a length. The "length" is a rally bad "hash" of the file contents too. And the fact is, that simple combination of hashes has proven to be more resistant to attack than the hash itself. It clearly _does_ make a difference in practice. So _please_, can we drop the obviously bogus "in theory" arguments. They do not matter. What matters is practice. And the fact is, in _theory_ we don't know if somebody may be trivially able to break any particular hash. But in practice we do know that it's less likely that you can break a combination of two totally unrelated hashes than you break one particular one. NOTE! I'm not actually arguing that we should do that. I'm actually arguing totally the reverse: I'm arguing that there is a fine line between being "very very careful" and being "crazy to the point of being incompetent". Linus ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Merge with git-pasky II. 2005-04-17 22:35 ` Linus Torvalds @ 2005-04-17 23:29 ` Herbert Xu 2005-04-17 23:34 ` Petr Baudis 2005-04-17 23:50 ` Linus Torvalds 0 siblings, 2 replies; 130+ messages in thread From: Herbert Xu @ 2005-04-17 23:29 UTC (permalink / raw) To: Linus Torvalds; +Cc: mingo, pasky, simon, david.lang, git On Sun, Apr 17, 2005 at 03:35:17PM -0700, Linus Torvalds wrote: > > Quite the reverse. Again, you bring up totally theoretical arguments. In > _practice_ it has indeed been shown that using two hashes _does_ catch > hash colissions. > > The trivial example is using md5 sums with a length. The "length" is a > rally bad "hash" of the file contents too. And the fact is, that simple > combination of hashes has proven to be more resistant to attack than the > hash itself. It clearly _does_ make a difference in practice. I wasn't disputing that of course. However, the same effect can be achieved in using a single hash with a bigger length, e.g., sha256 or sha512. > So _please_, can we drop the obviously bogus "in theory" arguments. They > do not matter. What matters is practice. I agree. However, what is the actual cost in practice of detecting collisions? I get the feeling that it isn't that bad. For example, if we did it at the points where the blobs actually entered the tree, then the cost is always proportional to the change size (the number of new blobs). Is this really that bad considering that the average blob isn't very big? Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Merge with git-pasky II. 2005-04-17 23:29 ` Herbert Xu @ 2005-04-17 23:34 ` Petr Baudis 2005-04-17 23:53 ` Kenneth Johansson 2005-04-18 0:49 ` Herbert Xu 2005-04-17 23:50 ` Linus Torvalds 1 sibling, 2 replies; 130+ messages in thread From: Petr Baudis @ 2005-04-17 23:34 UTC (permalink / raw) To: Herbert Xu; +Cc: Linus Torvalds, mingo, simon, david.lang, git Dear diary, on Mon, Apr 18, 2005 at 01:29:05AM CEST, I got a letter where Herbert Xu <herbert@gondor.apana.org.au> told me that... > I get the feeling that it isn't that bad. For example, if we did it > at the points where the blobs actually entered the tree, then the cost > is always proportional to the change size (the number of new blobs). No. The collision check is done in the opposite cache - when you want to write a blob and there is already a file of the same hash in the tree. So either the blob is already in the database, or you have a collision. Therefore, the cost is proportional to the size of what stays unchanged. -- Petr "Pasky" Baudis Stuff: http://pasky.or.cz/ C++: an octopus made by nailing extra legs onto a dog. -- Steve Taylor ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Merge with git-pasky II. 2005-04-17 23:34 ` Petr Baudis @ 2005-04-17 23:53 ` Kenneth Johansson 2005-04-18 0:49 ` Herbert Xu 1 sibling, 0 replies; 130+ messages in thread From: Kenneth Johansson @ 2005-04-17 23:53 UTC (permalink / raw) To: git; +Cc: Linus Torvalds, mingo, simon, david.lang, git Petr Baudis wrote: > Dear diary, on Mon, Apr 18, 2005 at 01:29:05AM CEST, I got a letter > where Herbert Xu <herbert@gondor.apana.org.au> told me that... > >>I get the feeling that it isn't that bad. For example, if we did it >>at the points where the blobs actually entered the tree, then the cost >>is always proportional to the change size (the number of new blobs). > > > No. The collision check is done in the opposite cache - when you want to > write a blob and there is already a file of the same hash in the tree. > So either the blob is already in the database, or you have a collision. > > Therefore, the cost is proportional to the size of what stays unchanged. > ?? now I'm confused. Surly the only cost involved is to never write over a file that already exist in the cache and that is already done NOW as far as I read the code. So there is NO extra cost in detecting an collision. ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Merge with git-pasky II. 2005-04-17 23:34 ` Petr Baudis 2005-04-17 23:53 ` Kenneth Johansson @ 2005-04-18 0:49 ` Herbert Xu 2005-04-18 0:55 ` Petr Baudis 1 sibling, 1 reply; 130+ messages in thread From: Herbert Xu @ 2005-04-18 0:49 UTC (permalink / raw) To: Petr Baudis; +Cc: Linus Torvalds, mingo, simon, david.lang, git On Mon, Apr 18, 2005 at 01:34:41AM +0200, Petr Baudis wrote: > > No. The collision check is done in the opposite cache - when you want to > write a blob and there is already a file of the same hash in the tree. > So either the blob is already in the database, or you have a collision. > Therefore, the cost is proportional to the size of what stays unchanged. This is only true if we're calling update-cache on all unchanged files. If that's what git is doing then we're in trouble anyway. Remember that prior to the collision check we've already spent the effort in 1) Compressing the file. 2) Computing a SHA1 hash on the result. These two steps together (especially the first one) is much more expensive than a file content comparison of the blob versus what's already in the tree. Somehow I have a hard time seeing how this can be at all efficient if we're compressing all checked out files including those which are unchanged. Therefore the only conclusion I can draw is that we're only calling update-cache on the set of changed files, or at most a small superset of them. In that case, the cost of the collision check *is* proportional to the size of the change. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Merge with git-pasky II. 2005-04-18 0:49 ` Herbert Xu @ 2005-04-18 0:55 ` Petr Baudis 0 siblings, 0 replies; 130+ messages in thread From: Petr Baudis @ 2005-04-18 0:55 UTC (permalink / raw) To: Herbert Xu; +Cc: Linus Torvalds, mingo, simon, david.lang, git Dear diary, on Mon, Apr 18, 2005 at 02:49:06AM CEST, I got a letter where Herbert Xu <herbert@gondor.apana.org.au> told me that... > Therefore the only conclusion I can draw is that we're only calling > update-cache on the set of changed files, or at most a small superset > of them. In that case, the cost of the collision check *is* proportional > to the size of the change. Yes, of course, sorry for the confusion. We only consider files you either specify manually or which have their stat metadata changed relative to the directory cache. (That is from the git-pasky perspective; from the plumbing perspective, the user just does update-cache on whatever he picks.) -- Petr "Pasky" Baudis Stuff: http://pasky.or.cz/ C++: an octopus made by nailing extra legs onto a dog. -- Steve Taylor ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Merge with git-pasky II. 2005-04-17 23:29 ` Herbert Xu 2005-04-17 23:34 ` Petr Baudis @ 2005-04-17 23:50 ` Linus Torvalds 1 sibling, 0 replies; 130+ messages in thread From: Linus Torvalds @ 2005-04-17 23:50 UTC (permalink / raw) To: Herbert Xu; +Cc: mingo, pasky, simon, david.lang, git On Mon, 18 Apr 2005, Herbert Xu wrote: > > I wasn't disputing that of course. However, the same effect can be > achieved in using a single hash with a bigger length, e.g., sha256 > or sha512. No it cannot. If somebody actually literally totally breaks that hash, length won't matter. There are (bad) hashes where you can literally edit the content of the file, and make sure that the end result has the same hash. In that case, when the hash algorithm has actually been broken, the length of the hash ends up being not very relevant. For example, you might "hash" your file by blocking it up in 16-byte blocks, and xoring all blocks together - the result is a 16-byte hash. It's a terrible hash, and obviously trivially breakable, and once broken it does _not_ help to make it use its 32-byte cousin. Not at all. You can just modify the breaking thing to equally cheaply make modifications to a file and get the 32-byte hash "right" again. Is that kind of breakage likely for sha1? Hell no. Is it possible? In your "in theory" world where practice doesn't matter, yes. Linus ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Merge with git-pasky II. 2005-04-17 17:34 ` Linus Torvalds 2005-04-17 22:12 ` Herbert Xu @ 2005-04-18 4:16 ` Sanjoy Mahajan 2005-04-18 7:42 ` Ingo Molnar 2 siblings, 0 replies; 130+ messages in thread From: Sanjoy Mahajan @ 2005-04-18 4:16 UTC (permalink / raw) To: Linus Torvalds; +Cc: Ingo Molnar, Petr Baudis, Simon Fowler, David Lang, git > So until proven otherwise, I worry about accidental hashes, and in > 160 bits of good hashing, that just isn't an issue either...[Going > from 128 bits to 160 bits made it] so _unbelievably_ less likely to > happen that it's not even funny. You are right. Here's how I learnt to stop worrying and love the 160 bits. A 160-bit hash requires 2^80=10^24 files before the collision probability is roughly 0.5 (actually 1-e^{-1/2}). Now be very conservative: Instead of tolerating a 0.5 probability, worry about even a 10^-8 probability of a collision anywhere, anytime. The magic number of files for that probability is 10^20 (roughly 10^40 pairs for 2^160=10^48 boxes). Given 10 billion people using git, each producing 1 source file per second -- busy beavers all -- they would need 300 years to produce 10^20 files. And to reach the 10^-8 collision probability, all 10^20 files must belong to the same project, and even OpenOffice will not be that bloated. -Sanjoy ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Re: Merge with git-pasky II. 2005-04-17 17:34 ` Linus Torvalds 2005-04-17 22:12 ` Herbert Xu 2005-04-18 4:16 ` Sanjoy Mahajan @ 2005-04-18 7:42 ` Ingo Molnar 2 siblings, 0 replies; 130+ messages in thread From: Ingo Molnar @ 2005-04-18 7:42 UTC (permalink / raw) To: Linus Torvalds; +Cc: Petr Baudis, Simon Fowler, David Lang, git * Linus Torvalds <torvalds@osdl.org> wrote: > On Sun, 17 Apr 2005, Ingo Molnar wrote: > > > > in fact, this attack cannot even be proven to be malicious, purely via > > the email from Malice: it could be incredible bad luck that caused that > > good-looking patch to be mistakenly matching a dangerous object. > > I really hate theoretical discussions. i was only replying to your earlier point: > > > Almost all attacks on sha1 will depend on _replacing_ a file with > > > a bogus new one. So guys, instead of using sha256 or going > > > overboard, just make sure that when you synchronize, you NEVER > > > import a file you already have. which point i still believe is subtly wrong. You were suggesting to concentrate on file replacement to counter most of the practical attacks, while i pointed out an attack _using the same basic mechanism that your point above supposed_. [ if you can replace a file with a known hash, with a bogus new one, and you still have enough control over the contents of your bogus new file that it is 1) a valid file that builds 2) compromises the kernel, then you likely have the same amount of control my 'theoretical' attack requires. ] > And the thing is, _if_ somebody finds a way to make sha1 act as just a > complex parity bit, and comes up with generating a clashing object > that actually makes sense, then going to sha256 is likely pointless > too [...] yes, that's why i suggested to not actually trust the hash to be cryptographically secure, but to just assume it's a good generic hash we can design a DB around, and to turn -DCOLLISION_CHECK on and enforce consistency rules on boundaries. [ it's not bad to keep sha1 because even my suggested enhancement still leaves 'content-less trust-pointers to untrusted content via email' vectors open against attack (maintainer sends you an email that commit X in Malice's repository Y is fine to pull, and you pull it blindly, while the attacker has replaced his content with the compromised one meanwhile), but it at least validates the bulk traffic that goes into the DB: patches via emails and trusted repositories. ] so all i was suggesting was to extend your suggested 'overwrite collision check' to a stricter 'content we throw away and use the sha1 shortcut for needs to be checked against the in-DB content as well'. in other words, your suggested 'rename check' is checking for 'positive duplicate content', while my addition would also check for 'negative duplicate content' as well. but as usual, i could be wrong, so dont take this too serious :-) Ingo ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Merge with git-pasky II. 2005-04-15 15:32 ` Linus Torvalds ` (2 preceding siblings ...) 2005-04-16 1:44 ` Simon Fowler @ 2005-04-16 20:29 ` Sanjoy Mahajan 2005-04-16 20:41 ` Linus Torvalds 3 siblings, 1 reply; 130+ messages in thread From: Sanjoy Mahajan @ 2005-04-16 20:29 UTC (permalink / raw) To: Linus Torvalds; +Cc: David Woodhouse, Junio C Hamano, Petr Baudis, git > And that "where did this come from" decision should be done at _search_ > time, not commit time. I like this elegant approach, but clever pattern matching can help even at commit time. Suppose hello.c is simply: printf ("Hello %d\n", year); And then developer A updates hello.c to: printf ("Hello %d\n", year); printf ("And %d\n", year+1); Meanwhile developer B updates hello.c to: printf ("Hello %d\n", yyyy); How to merge these two changes? The psychic solution is printf ("Hello %d\n", yyyy); printf ("And %d\n", yyyy+1); Darcs handles token renames specially, but it's not a general solution so let's leave it aside. The example does not have enough information to make the psychic solution unique or reliable, but imagine that the example were longer to solve that problem. You'd want to describe the delta A(hello.c) as 1. duplicated message line 2. changed 2nd line a bit And B(hello.c) as 1. Changed year to yyyy In that representation, merging the two deltas becomes 1. duplicated message line 2. changed 2nd line a bit 3. Changed year to yyyy in both lines Or, by commuting the merge operations and adjusting for their non-commutativity (in terminology like darcs's -- I'm also a physicist): 1. Changed year to yyyy 2. duplicated message line 3. changed 2nd line a bit So here some of the computation that Linus wants only at question time (e.g. 'how did that line get here??') is also useful at merge time. It's difficult (expensive, unreliable) to describe deltas in the form above or, worse, to merge two such descriptions, but I hope it illustrates the point. And perhaps a robust and easier-to-compute change-description language can be dreamt up, even if the general problem of describing changes compactly is not computable -- it's almost the same, or is the same, as finding the Kolmogorov complexity of a data set. Or have I missed a fundamental point? -Sanjoy `A society of sheep must in time beget a government of wolves.' - Bertrand de Jouvenal ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Merge with git-pasky II. 2005-04-16 20:29 ` Sanjoy Mahajan @ 2005-04-16 20:41 ` Linus Torvalds 0 siblings, 0 replies; 130+ messages in thread From: Linus Torvalds @ 2005-04-16 20:41 UTC (permalink / raw) To: Sanjoy Mahajan; +Cc: David Woodhouse, Junio C Hamano, Petr Baudis, git On Sat, 16 Apr 2005, Sanjoy Mahajan wrote: > > I like this elegant approach, but clever pattern matching can help even > at commit time. Suppose hello.c is simply: Here, what you're talking about is not "commit", but "merge". The git model very much separates the two events. You first generate a merged tree, an dyou commit that merge as a separate and largely totally independent phase. And yes, I agree that with merging, you do end up potentially wanting to try different things. When I've done my "git" merges, all I've really done is to make sure that the trivial parts basically merge in zero time, so that you can afford to perhaps spend some effort on handling the _real_ merge conflicts. Many systems seem to be designed around a "clever merge" algorithm, with darcs perhaps being the most extreme example. The problem with that design is that 99.9% of all the work is not at all about being clever, and if you try to base your design around the clever things, your performance will definitely suck. So I think that with git, you can actually really try to be clever, because when you get a merge conflict, you're now only worrying about one file out of 17,000, and then you can go wild on that one and try different merge algorithms (token merge, character-merge, line-based merge, you name it). Of course, I might not actually personally want to depend on any clever merges, but the git infrastructure really doesn't care. My plumbing doesn't merge the conflicts that arise within one single object, or the filename differences - you can do anything you want on that. Linus ^ permalink raw reply [flat|nested] 130+ messages in thread
* [Patch] ls-tree enhancements 2005-04-14 18:36 ` Linus Torvalds 2005-04-14 19:59 ` Junio C Hamano @ 2005-04-15 2:21 ` Junio C Hamano 2005-04-15 16:13 ` Petr Baudis 2005-04-15 9:14 ` Merge with git-pasky II David Woodhouse 2 siblings, 1 reply; 130+ messages in thread From: Junio C Hamano @ 2005-04-15 2:21 UTC (permalink / raw) To: Linus Torvalds; +Cc: Petr Baudis, git This adds '-r' (recursive) option and '-z' (NUL terminated) option to ls-tree. I need it so that the merge-trees (formerly known as git-merge.perl) script does not need to create any temporary dircache while merging. It used to use show-files on a temporary dircache to get the list of files in the ancestor tree, and also used the dircache to store the result of its automerge. I probably still need it for the latter reason, but with this patch not for the former reason anymore. It is relative to bb95843a5a0f397270819462812735ee29796fb4 Signed-off-by: Junio C Hamano <junkio@cox.net> --- ls-tree.c | 108 +++++++++++++++++++++++++++++++++++++++++++++++++++----------- 1 files changed, 90 insertions(+), 18 deletions(-) --- ,,Linus/ls-tree.c 2005-04-14 19:08:17.000000000 -0700 +++ ,,Siam/ls-tree.c 2005-04-14 19:11:23.000000000 -0700 @@ -5,45 +5,117 @@ */ #include "cache.h" -static int list(unsigned char *sha1) +int line_termination = '\n'; +int recursive = 0; + +struct path_prefix { + struct path_prefix *prev; + const char *name; +}; + +static void print_path_prefix(struct path_prefix *prefix) { - void *buffer; - unsigned long size; - char type[20]; + if (prefix) { + if (prefix->prev) + print_path_prefix(prefix->prev); + fputs(prefix->name, stdout); + putchar('/'); + } +} + +static void list_recursive(void *buffer, + unsigned char *type, + unsigned long size, + struct path_prefix *prefix) +{ + struct path_prefix this_prefix; + this_prefix.prev = prefix; - buffer = read_sha1_file(sha1, type, &size); - if (!buffer) - die("unable to read sha1 file"); if (strcmp(type, "tree")) die("expected a 'tree' node"); + while (size) { - int len = strlen(buffer)+1; - unsigned char *sha1 = buffer + len; - char *path = strchr(buffer, ' ')+1; + int namelen = strlen(buffer)+1; + void *eltbuf; + char elttype[20]; + unsigned long eltsize; + unsigned char *sha1 = buffer + namelen; + char *path = strchr(buffer, ' ') + 1; unsigned int mode; - unsigned char *type; - if (size < len + 20 || sscanf(buffer, "%o", &mode) != 1) + if (size < namelen + 20 || sscanf(buffer, "%o", &mode) != 1) die("corrupt 'tree' file"); buffer = sha1 + 20; - size -= len + 20; + size -= namelen + 20; + /* XXX: We do some ugly mode heuristics here. * It seems not worth it to read each file just to get this - * and the file size. -- pasky@ucw.cz */ - type = S_ISDIR(mode) ? "tree" : "blob"; - printf("%03o\t%s\t%s\t%s\n", mode, type, sha1_to_hex(sha1), path); + * and the file size. -- pasky@ucw.cz + * ... that is, when we are not recursive -- junkio@cox.net + */ + eltbuf = (recursive ? read_sha1_file(sha1, elttype, &eltsize) : + NULL); + if (! eltbuf) { + if (recursive) + error("cannot read %s", sha1_to_hex(sha1)); + type = S_ISDIR(mode) ? "tree" : "blob"; + } + else + type = elttype; + + printf("%03o\t%s\t%s\t", mode, type, sha1_to_hex(sha1)); + print_path_prefix(prefix); + fputs(path, stdout); + putchar(line_termination); + + if (eltbuf && !strcmp(type, "tree")) { + this_prefix.name = path; + list_recursive(eltbuf, elttype, eltsize, &this_prefix); + } + free(eltbuf); } +} + +static int list(unsigned char *sha1) +{ + void *buffer; + unsigned long size; + char type[20]; + + buffer = read_sha1_file(sha1, type, &size); + if (!buffer) + die("unable to read sha1 file"); + list_recursive(buffer, type, size, NULL); return 0; } +static void _usage(void) +{ + usage("ls-tree [-r] [-z] <key>"); +} + int main(int argc, char **argv) { unsigned char sha1[20]; + while (1 < argc && argv[1][0] == '-') { + switch (argv[1][1]) { + case 'z': + line_termination = 0; + break; + case 'r': + recursive = 1; + break; + default: + _usage(); + } + argc--; argv++; + } + if (argc != 2) - usage("ls-tree <key>"); + _usage(); if (get_sha1_hex(argv[1], sha1) < 0) - usage("ls-tree <key>"); + _usage(); sha1_file_directory = getenv(DB_ENVIRONMENT); if (!sha1_file_directory) sha1_file_directory = DEFAULT_DB_ENVIRONMENT; ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: ls-tree enhancements 2005-04-15 2:21 ` [Patch] ls-tree enhancements Junio C Hamano @ 2005-04-15 16:13 ` Petr Baudis 2005-04-15 18:25 ` Junio C Hamano 0 siblings, 1 reply; 130+ messages in thread From: Petr Baudis @ 2005-04-15 16:13 UTC (permalink / raw) To: Junio C Hamano; +Cc: Linus Torvalds, git Dear diary, on Fri, Apr 15, 2005 at 04:21:30AM CEST, I got a letter where Junio C Hamano <junkio@cox.net> told me that... > +static void _usage(void) > +{ > + usage("ls-tree [-r] [-z] <key>"); > +} (namespace-nazi-hat This infriges the system namespaces. FWIW, I prefer to add the underscore at the end of the identifier if wanting to do stuff like this. Or just call it my_usage(). ) -- Petr "Pasky" Baudis Stuff: http://pasky.or.cz/ C++: an octopus made by nailing extra legs onto a dog. -- Steve Taylor ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: ls-tree enhancements 2005-04-15 16:13 ` Petr Baudis @ 2005-04-15 18:25 ` Junio C Hamano 0 siblings, 0 replies; 130+ messages in thread From: Junio C Hamano @ 2005-04-15 18:25 UTC (permalink / raw) To: Petr Baudis; +Cc: Linus Torvalds, git >>>>> "PB" == Petr Baudis <pasky@ucw.cz> writes: >> +static void _usage(void) PB> This infriges the system namespaces. FWIW, I prefer to add the PB> underscore at the end of the identifier if wanting to do stuff like PB> this. Or just call it my_usage(). Thanks. My bad. Noted. ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Merge with git-pasky II. 2005-04-14 18:36 ` Linus Torvalds 2005-04-14 19:59 ` Junio C Hamano 2005-04-15 2:21 ` [Patch] ls-tree enhancements Junio C Hamano @ 2005-04-15 9:14 ` David Woodhouse 2005-04-15 9:36 ` Ingo Molnar ` (2 more replies) 2 siblings, 3 replies; 130+ messages in thread From: David Woodhouse @ 2005-04-15 9:14 UTC (permalink / raw) To: Linus Torvalds; +Cc: Junio C Hamano, Petr Baudis, git On Thu, 2005-04-14 at 11:36 -0700, Linus Torvalds wrote: > And "merge these two trees" (which works on a _tree_ level) > or "find the common commit" (which works on a _commit_ level) I suspect that finding the common commit is actually a per-file thing; it's not just something you do for the _commit_ graph, then use for merging each file in the two branches you're trying to merge. Consider a simple repository which contains two files A and B. We start off with the first version of each ('A1B1'), and the owner of each file takes a branch and modifies their own file. There is cross-pulling between the two, and then each modifies the _other's_ file as well as their own... (A1B2)--(A2B2)--(A2'B3) / \ / \ / \ / \ (A1B1) X (...) \ / \ / \ / \ / (A2B1)--(A2B2)--(A3B2') Now, we're trying to merge the two branches. It appears that the most useful common ancestor to use for a three-way merge of file A is the version from tree 'A2B1', while the most useful common ancestor for merging file B is that in 'A1B2'. (I think it's a coincidence that in my example the useful files 'A2' and 'B2' actually do end up in a single tree together at some point.) -- dwmw2 ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Merge with git-pasky II. 2005-04-15 9:14 ` Merge with git-pasky II David Woodhouse @ 2005-04-15 9:36 ` Ingo Molnar 2005-04-15 10:05 ` David Woodhouse 2005-04-15 12:03 ` Johannes Schindelin 2005-04-15 14:53 ` Linus Torvalds 2 siblings, 1 reply; 130+ messages in thread From: Ingo Molnar @ 2005-04-15 9:36 UTC (permalink / raw) To: David Woodhouse; +Cc: Linus Torvalds, Junio C Hamano, Petr Baudis, git * David Woodhouse <dwmw2@infradead.org> wrote: > Consider a simple repository which contains two files A and B. We > start off with the first version of each ('A1B1'), and the owner of > each file takes a branch and modifies their own file. There is > cross-pulling between the two, and then each modifies the _other's_ > file as well as their own... > > (A1B2)--(A2B2)--(A2'B3) > / \ / \ > / \ / \ > (A1B1) X (...) > \ / \ / > \ / \ / > (A2B1)--(A2B2)--(A3B2') > > Now, we're trying to merge the two branches. It appears that the most > useful common ancestor to use for a three-way merge of file A is the > version from tree 'A2B1', while the most useful common ancestor for > merging file B is that in 'A1B2'. do such cases occur frequently? In the kernel at least it's not too typical. Would it be a problem to go for the simple solution of using (A1B1) as the common ancestor (based on the tree graph), and then to do a 3-way merge of all changes from that point on? Ingo ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Merge with git-pasky II. 2005-04-15 9:36 ` Ingo Molnar @ 2005-04-15 10:05 ` David Woodhouse 2005-04-15 14:53 ` Ingo Molnar 0 siblings, 1 reply; 130+ messages in thread From: David Woodhouse @ 2005-04-15 10:05 UTC (permalink / raw) To: Ingo Molnar; +Cc: Linus Torvalds, Junio C Hamano, Petr Baudis, git On Fri, 2005-04-15 at 11:36 +0200, Ingo Molnar wrote: > do such cases occur frequently? In the kernel at least it's not too > typical. Isn't it? I thought it was a fairly accurate representation of the process "I make a whole bunch of changes to files I maintain, pulling from Linus while occasionally asking him to pull from my tree. Sometimes my files are changed by someone else in Linus' tree, and sometimes I change files that I don't actually own.". -- dwmw2 ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Merge with git-pasky II. 2005-04-15 10:05 ` David Woodhouse @ 2005-04-15 14:53 ` Ingo Molnar 2005-04-15 15:09 ` David Woodhouse 0 siblings, 1 reply; 130+ messages in thread From: Ingo Molnar @ 2005-04-15 14:53 UTC (permalink / raw) To: David Woodhouse; +Cc: Linus Torvalds, Junio C Hamano, Petr Baudis, git * David Woodhouse <dwmw2@infradead.org> wrote: > On Fri, 2005-04-15 at 11:36 +0200, Ingo Molnar wrote: > > do such cases occur frequently? In the kernel at least it's not too > > typical. > > Isn't it? I thought it was a fairly accurate representation of the > process "I make a whole bunch of changes to files I maintain, pulling > from Linus while occasionally asking him to pull from my tree. > Sometimes my files are changed by someone else in Linus' tree, and > sometimes I change files that I don't actually own.". but the specific scenario you described would require _Linus'_ tree to be in limbo for a long time, and have uncommitted half-done edits. I.e.: (A1B2)--(A2B2)--(A2'B3) / \ / \ / \ / \ (A1B1) X (...) \ / \ / \ / \ / (A2B1)--(A2B2)--(A3B2') in the above scenario Linus' tree needs to 'cross' with a maintainer's tree. (maintainer's tree wont cross with another maintainer's tree, as maintainer-to-maintainer merges rare.) but for the scenario to occur, i think there needs to be a prolongued "limbo" period in Linus' tree for a 'crossing' to happen. But Linus' merges are typically almost atomic: they are done then they are pushed out. It's definitely not in the 'days, sometimes weeks' timescale as maintainer trees are. so for the scenario to occur, a maintainer, from whom Linus has just pulled an update and Linus is merging the tree manually without comitting, has to pull a file from the earlier Linus tree, and then Linus has to modify that same file again. This does not seem to be a common scenario. so i think to avoid the scenario, maintainers should not pull from each other - they should only pull/push to/from Linus' tree. Maybe this is an unacceptable limitation? Ingo ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Merge with git-pasky II. 2005-04-15 14:53 ` Ingo Molnar @ 2005-04-15 15:09 ` David Woodhouse 0 siblings, 0 replies; 130+ messages in thread From: David Woodhouse @ 2005-04-15 15:09 UTC (permalink / raw) To: Ingo Molnar; +Cc: Linus Torvalds, Junio C Hamano, Petr Baudis, git On Fri, 2005-04-15 at 16:53 +0200, Ingo Molnar wrote: > but the specific scenario you described would require _Linus'_ tree to > be in limbo for a long time, and have uncommitted half-done edits. > I.e.: > > (A1B2)--(A2B2)--(A2'B3) > / \ / \ > / \ / \ > (A1B1) X (...) > \ / \ / > \ / \ / > (A2B1)--(A2B2)--(A3B2') > > in the above scenario Linus' tree needs to 'cross' with a maintainer's > tree. (maintainer's tree wont cross with another maintainer's tree, > as maintainer-to-maintainer merges rare.) Is that true? Consider (A2B1) to be a bugfixes-only tree which I make available for Linus to pull from. I keep doing more experimental stuff in my own private copy of the tree along the bottom branch, while Linus _eventually_ responds to my pull request and moves on, stopping only to add a 'static' to one of my new functions. I move on too but don't pull from Linus again for a little while; the final merge happens when I _do_ pull again. -- dwmw2 ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Merge with git-pasky II. 2005-04-15 9:14 ` Merge with git-pasky II David Woodhouse 2005-04-15 9:36 ` Ingo Molnar @ 2005-04-15 12:03 ` Johannes Schindelin 2005-04-15 10:22 ` Theodore Ts'o 2005-04-15 14:53 ` Linus Torvalds 2 siblings, 1 reply; 130+ messages in thread From: Johannes Schindelin @ 2005-04-15 12:03 UTC (permalink / raw) To: David Woodhouse; +Cc: Linus Torvalds, Junio C Hamano, Petr Baudis, git Hi, On Fri, 15 Apr 2005, David Woodhouse wrote: > On Thu, 2005-04-14 at 11:36 -0700, Linus Torvalds wrote: > > And "merge these two trees" (which works on a _tree_ level) > > or "find the common commit" (which works on a _commit_ level) > > I suspect that finding the common commit is actually a per-file thing; > it's not just something you do for the _commit_ graph, then use for > merging each file in the two branches you're trying to merge. I disagree. In order to be trusted, this thing has to catch the following scenario: Skywalker and Solo start from the same base. They commit quite a lot to their trees. In between, Skywalker commits a tree, where the function "kazoom()" has been added to the file "deathstar.c", but Solo also added this function, but to the file "moon.c". A file-based merge would have no problem merging each file, such that in the end, "kazoom()" is defined twice. The same problems arise when one tries to merge line-wise, i.e. when for each line a (possibly different) merge-parent is sought. The concept here is a *transaction*: when going from one tree to the next tree via a commit, a sort of integrity is maintained, which is breached when only looking at files and commits. Ciao, Dscho ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Merge with git-pasky II. 2005-04-15 12:03 ` Johannes Schindelin @ 2005-04-15 10:22 ` Theodore Ts'o 0 siblings, 0 replies; 130+ messages in thread From: Theodore Ts'o @ 2005-04-15 10:22 UTC (permalink / raw) To: Johannes Schindelin Cc: David Woodhouse, Linus Torvalds, Junio C Hamano, Petr Baudis, git On Fri, Apr 15, 2005 at 02:03:08PM +0200, Johannes Schindelin wrote: > I disagree. In order to be trusted, this thing has to catch the following > scenario: > > Skywalker and Solo start from the same base. They commit quite a lot to > their trees. In between, Skywalker commits a tree, where the function > "kazoom()" has been added to the file "deathstar.c", but Solo also added > this function, but to the file "moon.c". A file-based merge would have no > problem merging each file, such that in the end, "kazoom()" is defined > twice. > > The same problems arise when one tries to merge line-wise, i.e. when for > each line a (possibly different) merge-parent is sought. Be careful. There is a very big tradeoff between 100% perfections in catching these sorts of errors, and usability. There exists SCM's where you are not allowed to do commit such merges until you do a test compile, or run a regression test suite (that being the only way to catch these sorts of problems when we merge two branches like this). BitKeeper never caught this sort of thing, and we trusted it. In practice it was also rarely a problem. I'll also note that BitKeeper doesn't restrict you from doing a committing a changeset when you have modified files that have yet to be checked in to the tree. Same issue; you can accidentally check in changesets that in trees that won't build, but if we added this kind of SCM-by-straightjacket philosophy it would decrease our productivity and people would simply not use such an SCM, thus negating its effectiveness. - Ted ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Merge with git-pasky II. 2005-04-15 9:14 ` Merge with git-pasky II David Woodhouse 2005-04-15 9:36 ` Ingo Molnar 2005-04-15 12:03 ` Johannes Schindelin @ 2005-04-15 14:53 ` Linus Torvalds 2005-04-15 15:29 ` David Woodhouse 2005-04-15 15:54 ` Paul Jackson 2 siblings, 2 replies; 130+ messages in thread From: Linus Torvalds @ 2005-04-15 14:53 UTC (permalink / raw) To: David Woodhouse; +Cc: Junio C Hamano, Petr Baudis, git On Fri, 15 Apr 2005, David Woodhouse wrote: > > I suspect that finding the common commit is actually a per-file thing; > it's not just something you do for the _commit_ graph, then use for > merging each file in the two branches you're trying to merge. I disagree. Conceptually, you should never do _anything_ on a file level. Why? Because individual files don't matter. You shouldn't merge two files cleanly just because they look fine - they _depend_ on the other files in the archive, and that's quite fundamentally why per-file tracking is really wrong from a project standpoint. So if you can't merge two files cleanly because the "project" history ended up being further back than the "file" history, then that's a _good_ thing. You don't know what the hell happened to the other files that this file depended on. Merging one file independently of the others is WRONG. Also, I suspect that you'll find that if you do cross-merges, you'll basically always end up in: > (I think it's a coincidence that in my example the useful files 'A2' and > 'B2' actually do end up in a single tree together at some point.) nope, I don't think that's coincidence. I think that's the normal case. Your file-based history is the one that can _incorrectly_ and coincidentally happen to have a single file at some point, but since that file doesn't stand alone, that's really not a fundamentally good reason to merge it. Really, this "individual files matter" approach is a _disease_. They don't. Individual files DO NOT EXIST. Files always exist as part of the project, and the _only_ time you track a single file is when the project is a single file (and then that will be very very obvious in a git archive, thank you very much). So the single-file mentality is a disease brought on by decades of _crap_. And by the fact that it ends up limiting the problem scope, so you can do certain things easier. For example, just doing intra-file diffs is a lot _easier_ and less time-consuming than doing inter-file diffs. Bit it is _absolutely_ not better. In fact, it is clearly inferior to anybody who spends even five seconds thinking about it - yet we still do it, because of the historical (and INCORRECT) mindset that "files matter". Files DO NOT matter. Never have. It's an implementation limitation to think they do. You'll screw yourself up, and when somebody comes up with a half-way efficient way to generate inter-fiel diffs, your architecture is totally and utterly unable to handle it. I don't care what you do at an SCM level, and if the crud you put on top of git wants to perpetuate mistakes of yesteryear, that's _your_ issue. But dammit, git is designed to do the right thing, and I will fight tooth and nail against anybody who thinks individual files matter. Linus ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Merge with git-pasky II. 2005-04-15 14:53 ` Linus Torvalds @ 2005-04-15 15:29 ` David Woodhouse 2005-04-15 15:51 ` Linus Torvalds 2005-04-15 15:54 ` Paul Jackson 1 sibling, 1 reply; 130+ messages in thread From: David Woodhouse @ 2005-04-15 15:29 UTC (permalink / raw) To: Linus Torvalds; +Cc: Junio C Hamano, Petr Baudis, git On Fri, 2005-04-15 at 07:53 -0700, Linus Torvalds wrote: > Files DO NOT matter. Never have. It's an implementation limitation to > think they do. You'll screw yourself up, and when somebody comes up with a > half-way efficient way to generate inter-fiel diffs, your architecture is > totally and utterly unable to handle it. > > I don't care what you do at an SCM level, and if the crud you put on top > of git wants to perpetuate mistakes of yesteryear, that's _your_ issue. > But dammit, git is designed to do the right thing, and I will fight tooth > and nail against anybody who thinks individual files matter. No, really: individual files _DO_ matter. There's a reason we split stuff up into separate files, and if you look closely you'll find that we don't just randomly put different functions into different files with neither rhyme nor reason -- there's a pattern to it; usually some kind of functional grouping. And when I'm looking for the change that broke something, I can almost always tell which file it's in and go looking in _that_ file. It's a _whole_ lot easier to use the equivalent of 'bk revtool' than it is to sift through all the unrelated commits in the whole tree. If that's an implementation limitation, then it's an implementation limitation in my _brain_ not just in my tools. OK, in fact it shouldn't be 'show me the history of this file'; it's often really 'show me the history of this function' which I want. But that's fine. All I'm suggesting is that we should include the metadata which says "content moved from file XXX to file YYY" along with the commit objects. I'm certainly not suggesting that we should implement jejb's idea of explicit 'file revision history' objects -- the tree-based philosophy is perfectly sane and sufficient. But we do _also_ need a little information which allows us to track content as it moves around within the tree, and the SCM has to have a sane way to filter out the noise when we're looking for what broke. Yes, that's part of the SCM functionality, and can live in an xattr-type field in the commit object -- but it does need to be stored, and in practice I suspect it _will_ be useful for merging too. It's not about ditching the per-tree tracking and doing per-file tracking instead. I agree that would be wrong. It's about storing enough information to track what happened to given content as it moved around within the tree. -- dwmw2 ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Merge with git-pasky II. 2005-04-15 15:29 ` David Woodhouse @ 2005-04-15 15:51 ` Linus Torvalds 0 siblings, 0 replies; 130+ messages in thread From: Linus Torvalds @ 2005-04-15 15:51 UTC (permalink / raw) To: David Woodhouse; +Cc: Junio C Hamano, Petr Baudis, git On Fri, 15 Apr 2005, David Woodhouse wrote: > > And when I'm looking for the change that broke something, I can almost > always tell which file it's in and go looking in _that_ file. Read my email about finding "what changed" that I sent out a minute ago. I claim that my algorithm for finding "what changed" handles your "single file" case as a very small (and usually quite uninteresting) special case. I claim (and if you just look at my proposal I think you'll agree) that I can track single functions, and do it efficiently. WITHOUT adding any meta-data at all. The thing is, if the question is "I have this piece of code, and I want to see what changed", you fundamentally _can_ do that efficiently. That's really what git was designed for. It's the whole _point_ of having history in the first place. If git didn't care, it wouldn't have a back-pointer to the tree it came from, and we'd all be just merging pure trees. But you mix that question up with "how do I save that information in the commit", which is a totally unnecessary mix-up, and which makes things MUCH more complicated, for absolutely zero gain. In fact, because you mixed up those two issues, the problem now became so complicated that you can no longer solve it, so you start doing hacks like "the user has to tell us what he did" (aka "bk mv" or "svn rename"), and you start mentally to limit yourself to files, because you realize that you _have_ to limit your intractable problem to make it at all solvable. And I'm telling you that your problem is STUPID. You made it stupid by thinking that every question about the source tree should be answered at commit time. Which just clearly isn't true! If you just drop the tying-together, and accept that "what changed" is a valid question _regardless_ of trying to track it at commit time, now your whole world opens up. Birds sing, the sun is shining on you, and beautiful scantily clad women (or men) dance around you. The world is suddenly a good place, just _filled_ with possibilities. Suddenly you realize that if the question is just "what changed in this piece of code" (and let's face it, that _is_ the question), you can track it afterwards. Trying to tie in "commit time" into the question was what made it hard. If you do _not_ due that (totally unnecessary) tie-in, the question suddenly becomes easy to answer, and several obvious and simple answers spring to mind pretty immediately. > It's not about ditching the per-tree tracking and doing per-file > tracking instead. I agree that would be wrong. It's about storing enough > information to track what happened to given content as it moved around > within the tree. No. Git absolutely does have everything you need already. You just aren't realizing that it's already there - in the data - and that you can do much more intelligent searches for changes if you accept that undeniable fact. The fact that you can NOT do those searches at commit-time (which is a global op), and can only do them if you have a specific question in mind ("what changed _here_"), is the big issue. The thing is, at commit-time you'd need to answer every possible question ("what changed here, and here, and here, and in this function, and in this file, and in this directory and why did this identifier get renamed and why is the sky blue"). AND YOU FUNDAMENTALLY CANNOT DO THAT. It's impossible. But once you _know_ the question (which is the only time when the answer is actually relevant, so why care about if before that time?), you can find out the answer by just automating the job of looking at the _data_. It's easy. The question makes it obvious by its nature. The question is the thing that gives you the specifics that makes the search possible in the first place. And _this_ is why the data matters. Renames and file boundaries do not. And until you accept that, you just limit yourself. Linus ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Merge with git-pasky II. 2005-04-15 14:53 ` Linus Torvalds 2005-04-15 15:29 ` David Woodhouse @ 2005-04-15 15:54 ` Paul Jackson 2005-04-15 16:30 ` C. Scott Ananian 1 sibling, 1 reply; 130+ messages in thread From: Paul Jackson @ 2005-04-15 15:54 UTC (permalink / raw) To: Linus Torvalds; +Cc: dwmw2, junkio, pasky, git Linus wrote: > For example, just doing intra-file diffs is a lot _easier_ and less > time-consuming than doing inter-file diffs. Um ah ... could you explain what you mean by inter and intra file diffs? Google found a three year old message by Andrew Morton, discussing inter and intra file fragmentation on ext2/ext3 file systems and the find_group_dir() routine. I don't think that's what you had in mind ;). When I run the 'diff' command, it usually between two files, not between two parts of a file. So I'd have thought inter file diffs were easier. Clearly, I don't git it. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj@engr.sgi.com> 1.650.933.1373, 1.925.600.0401 ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Merge with git-pasky II. 2005-04-15 15:54 ` Paul Jackson @ 2005-04-15 16:30 ` C. Scott Ananian 2005-04-15 18:29 ` Paul Jackson 0 siblings, 1 reply; 130+ messages in thread From: C. Scott Ananian @ 2005-04-15 16:30 UTC (permalink / raw) To: Paul Jackson; +Cc: Linus Torvalds, dwmw2, junkio, pasky, git On Fri, 15 Apr 2005, Paul Jackson wrote: > Um ah ... could you explain what you mean by inter and intra file diffs? intra file diffs: here are two versions of the same file. what changed? inter file diffs: here is a new file, and here are *all the files in the current committed version*. Where did the contents of this new file come from? (Note that the new file is often a slightly changed version of an existing file in the current committed version. But we don't assume that must be true.) --scott supercomputer Pakistan WSHOOFS SECANT LCPANGS SDI assassination ZPSECANT SEQUIN AEBARMAN ESCOBILLA bomb mustard STANDEL ESGAIN Nazi FJDEFLECT ( http://cscott.net/ ) ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Merge with git-pasky II. 2005-04-15 16:30 ` C. Scott Ananian @ 2005-04-15 18:29 ` Paul Jackson 0 siblings, 0 replies; 130+ messages in thread From: Paul Jackson @ 2005-04-15 18:29 UTC (permalink / raw) To: C. Scott Ananian; +Cc: torvalds, dwmw2, junkio, pasky, git > intra file diffs: here are two versions of the same file. Ah so. Linus faked me out. I was _sure_ that by "file" he meant "file" -- as in a bucket of bits with a unique identifying <sha1>. In that message, I guess by "file" he meant "a version controlled file, consisting of a series of content versions and meta-data" That's what I get for trusting Linus to always speak as a kernel hacker, not an SCM hacker. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj@engr.sgi.com> 1.650.933.1373, 1.925.600.0401 ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Merge with git-pasky II. 2005-04-14 18:12 ` Junio C Hamano 2005-04-14 18:36 ` Linus Torvalds @ 2005-04-14 18:51 ` Christopher Li 2005-04-14 19:35 ` Petr Baudis 2 siblings, 0 replies; 130+ messages in thread From: Christopher Li @ 2005-04-14 18:51 UTC (permalink / raw) To: Junio C Hamano; +Cc: Petr Baudis, Linus Torvalds, git On Thu, Apr 14, 2005 at 11:12:35AM -0700, Junio C Hamano wrote: > >>>>> "PB" == Petr Baudis <pasky@ucw.cz> writes: > > At this moment in the script, we have run "read-tree" the > ancestor so the dircache has the original. %tree0 and %tree1 > both did not touch the path ($_ here) so it is the same as > ancestor. When '-f' is specified we are populating the output > working tree with the merge result so that is what that > 'checkout-cache' is about. "O - $path" means "we took the > original". > > The idea is to populate the dircache of merge-temp with the > merge result and leave uncertain stuff as in the common ancestor > state, so that the user can fix them starting from there. > > Maybe it is a good time for me to summarize the output somewhere > in a document. > > O - $path Tree-A and tree-B did not touch this; the result > is taken from the ancestor (O for original). > > A D $path Only tree-A (or tree-B) deleted this and the other > B D $path branch did not touch this; the result is to delete. > > A M $path Only tree-A (or tree-B) modified this and the other > B M $path branch did not touch this; the result is to use one > from tree-A (or tree-B). This includes file > creation case. > > *DD $path Both tree-A and tree-B deleted this; the result > is to delete. > > *DM $path Tree-A deleted while tree-B modified this (or > *MD $path vice versa), and manual conflict resolution is > needed; dircache is left as in the ancestor, and > the modified file is saved as $path~A~ in the > working directory. The user can rename it to $path > and run show-diff to see what Tree-A wanted to do > and decide before running update-cache. > > *MM $path Tree-A and tree-B did the exact same > modification; the result is to use that. > > MRG $path Tree-A and tree-B have different modifications; > run "merge" and the merge result is left as > $path in the working directory. > > In cases other than *DM, *MD, and MRG, the result is trivial and I believe there is simpler way to do it as in my demo python script. I start it easier but you bits me in time. It is a demo script, it only print the action instead of actually going out to do it. change that to corresponding os.system("") call leaves to the reader. Again, this is a demo how it can be done. Not python vs perl thing I did not chose perl only because I am not good at it. #!/usr/bin/env python import re import sys import os from pprint import pprint def get_tree(commit): data = os.popen("cat-file commit %s"%commit).read() return re.findall(r"(?m)^tree (\w+)", data)[0] PREFIX = 0 PATH = -1 SHA = -2 ORIGSHA = -3 def get_difftree(old, new): lines = os.popen("diff-tree %s %s"%(old, new)).read().split("\x00") patterns = (r"(\*)(\d+)->(\d+)\s(\w+)\s(\w+)->(\w+)\s(.*)", r"([+-])(\d+)\s(\w+)\s(\w+)\s(.*)") res = {} for l in lines: if not l: continue for p in patterns: m = re.findall(p, l) if m: m = m[0] res[m[-1]] = m break else: raise "difftree: unknow line", l return res def analyze(diff1, diff2): diff1only = [ diff1[k] for k in diff1 if k not in diff2 ] diff2only = [ diff2[k] for k in diff2 if k not in diff1 ] both = [ (diff1[k],diff2[k]) for k in diff2 if k in diff1 ] action(diff1only) action(diff2only) action_two(both) def action(diffs): for act in diffs: if act[PREFIX] == "*": print "modify", act[PATH], act[SHA] elif act[PREFIX] == '-': print "remove", act[PATH], act[SHA] elif act[PREFIX] == '+': print "add", "remove", act[PATH], act[SHA] else: raise "unknow action" def action_two(diffs): for act1, act2 in diffs: if len(act1) == len(act2): # same kind type if act1[PREFIX] == act2[PREFIX]: if act1[SHA] == act2[SHA] or act1[PREFIX] == '-': return action(act1) if act1[PREFIX]=='*': print "3way-merge", act1[PATH], act1[ORIGSHA], act1[SHA], act2[SHA] return print "unable to handle", act[PATH] print "one side wants", act1[PREFIX] print "the other side wants", act2[PREFIX] args = sys.argv[1:] trees = map(get_tree, args) print "check out tree", trees[0] diff1 = get_difftree(trees[0], trees[1]) diff2 = get_difftree(trees[0], trees[2]) analyze(diff1, diff2) ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Re: Merge with git-pasky II. 2005-04-14 18:12 ` Junio C Hamano 2005-04-14 18:36 ` Linus Torvalds 2005-04-14 18:51 ` Christopher Li @ 2005-04-14 19:35 ` Petr Baudis 2005-04-14 20:01 ` Live Merging from remote repositories Barry Silverman ` (2 more replies) 2 siblings, 3 replies; 130+ messages in thread From: Petr Baudis @ 2005-04-14 19:35 UTC (permalink / raw) To: Junio C Hamano; +Cc: Linus Torvalds, git Dear diary, on Thu, Apr 14, 2005 at 08:12:35PM CEST, I got a letter where Junio C Hamano <junkio@cox.net> told me that... > >>>>> "PB" == Petr Baudis <pasky@ucw.cz> writes: > > PB> Bah, you outran me. ;-) > > Just being in a different timezone, I guess. > > PB> I'll change it to use the cool git-pasky stuff (commit-id etc) and its > PB> style of committing - that is, it will merely record the update-caches > PB> to be done upon commit, and it will read-tree the branch we are merging > PB> to instead of the ancestor. (So that git diff gives useful output.) > > Sorry, I have not seen what you have been doing since pasky 0.3, > and I have not even started to understand the mental model of > the world your tool is building. That said, my gut feeling is > that telling this script about git-pasky's world model might be > a mistake. I'd rather see you consider the script as mere "part > of the plumbing". Maybe adding an extra parameter to the script > to let the user explicitly specify the common ancestor to use > would be needed, but I would prefer git-pasky-merge to do its > own magic (converting symbolic commit names into raw commit > names and such) before calling this low level script. > > That way people like me who have not migrated to your framework > can still keep using it. All the script currently needs is a > bare git object database; i.e., nothing other than what is in > .git/objects and a couple of commit record SHA1s as its > parameters. No .git/heads/, no .git/HEAD.local, no .git/tags, > are involved for it to work, and I would prefer to keep things > that way if possible. I see, and I actually agree with it. However, I'll want merge-tree.pl to do a little less than it does now for that, though. The mechanics in "kernel" is fine as long as I can control policy in my "userspace". ;-) BTW, the git* name sorta imply my toilet instead of the core plumbing, and it'd be more consistent with the current plumbnaming; and could we have it with the .pl extension, please? :-) What I would like your script to do is therefore just do the merge in a given already prepared (including built index) directory, with a passed base. The base should be determined by a separate tool (I already saw some patches); most future "science" will probably go to a clever selection of this base, anyway. This will give the tool maximal flexibility. E.g., then someone who wants to can just merge with his working copy (if you don't give checkout-cache -f - but why would you anyway), or do whatever other cleverness he wants. > >> * show-diff updates to add -r flag to squelch diffs for files not in > >> the working directory. This is mainly useful when verifying the > >> result of an automated merge. ..snip.. > was too tired and did not think of a letter when I wrote it. I > guess '-r' stood for removed, but I agree it is a bad choice. > Any objections to '-q'? None here. > >> +# Create a temporary directory and go there. > >> +system 'rm', '-rf', ',,merge-temp'; > > PB> Can't we call it just ,,merge? > > I'd rather have a command line option '-o' (scrapping the > current '-o' and renaming it to something else; as you can see I > am terrible at picking option names ;-)) to mean "output to this > directory". I am not really an Arch person so I do not > particulary care about /^,,/. How about "git~merge~$$"? I'm all for an -o, and I don't mind ,, - I just don't want it uselessly long. I hope "git~merge~$$" was a joke... :-) > >> +for ((',,merge-temp', '.git')) { mkdir $_; chdir $_; } > >> +symlink "../../.git/objects", "objects"; > >> +chdir '..'; > >> + > >> +my $ancestor_tree = read_commit_tree($common); > >> +system 'read-tree', $ancestor_tree; > >> + > >> +my %tree0 = read_diff_tree($ancestor_tree, read_commit_tree($ARGV[0])); > >> +my %tree1 = read_diff_tree($ancestor_tree, read_commit_tree($ARGV[1])); > >> + > >> +my @ancestor_file = read_show_files(); > >> +my %ancestor_file = map { $_ => 1 } @ancestor_file; > >> + > >> +for (@ancestor_file) { > >> + if (! exists $tree0{$_} && ! exists $tree1{$_}) { By the way, what about indentation with tabs? If you have a strong opinion about this, I don't insist - but if you really don't mind/care either way, it'd be great to use tabs as in the rest of the git code. > >> + if ($full_checkout) { > >> + system 'checkout-cache', $_; > >> + } > >> + print STDERR "O - $_\n"; > > PB> Huh, what are you trying to do here? I think you should just record > PB> remove, no? (And I wouldn't do anything with my read-tree. ;-) > > At this moment in the script, we have run "read-tree" the > ancestor so the dircache has the original. %tree0 and %tree1 > both did not touch the path ($_ here) so it is the same as > ancestor. When '-f' is specified we are populating the output > working tree with the merge result so that is what that > 'checkout-cache' is about. "O - $path" means "we took the > original". Aha! Thanks. Is there a fundamental reason why the directory cache contains the ancestor instead of the destination branch? It makes no sense to me and I think the script actually does not fundamentally depend on it. My main motivation is that the user can then trivially see what is he actually going to commit to his destination branch, which would be bought for free by that. > The idea is to populate the dircache of merge-temp with the > merge result and leave uncertain stuff as in the common ancestor > state, so that the user can fix them starting from there. And this is another thing I dislike a lot. I'd like merge-tree.pl to leave my directory cache alone, thank you very much. You know, I see what goes to the directory cache as actually part of the policy part. What you actually do is interfering with my different policy choice, which is to record stuff to index only at the time of commit (I've asked about this and noone replied, so I assume it's an ok choice). show-diff does the right thing for me then, and I don't need to care about losing *any* information when replacing/rebuilding the index for any reason. I have full control, and I like that. :-) I'd be happy with parsing merge-tree.pl output and doing the right thing on my side. Of course I could then blast away the tediously modified index with my one, but I didn't need to do any such hacking before and I'd prefer not to now either. Actually, the only time I need to do explicit update-cache (with my policy) when doing git merge is when deleting stuff or adding new stuff; both of this is not so common as modifying, when I need not to do anything. > Maybe it is a good time for me to summarize the output somewhere > in a document. > > O - $path Tree-A and tree-B did not touch this; the result > is taken from the ancestor (O for original). > > A D $path Only tree-A (or tree-B) deleted this and the other > B D $path branch did not touch this; the result is to delete. > > A M $path Only tree-A (or tree-B) modified this and the other > B M $path branch did not touch this; the result is to use one > from tree-A (or tree-B). This includes file > creation case. Could we please have the file creation case separately? Modification is much more common and creation has pretty different consequences (especially that it can't combine with anything else :-). > *DD $path Both tree-A and tree-B deleted this; the result > is to delete. > > *DM $path Tree-A deleted while tree-B modified this (or > *MD $path vice versa), and manual conflict resolution is > needed; dircache is left as in the ancestor, and > the modified file is saved as $path~A~ in the > working directory. The user can rename it to $path > and run show-diff to see what Tree-A wanted to do > and decide before running update-cache. > > *MM $path Tree-A and tree-B did the exact same > modification; the result is to use that. > > MRG $path Tree-A and tree-B have different modifications; > run "merge" and the merge result is left as > $path in the working directory. Hmm. I actually don't like this naming. I think it's not too consistent, is irregular, therefore parsing it would be ugly. What I propose: 12c\tname <- legend <- original file D <- tree #1 removed file D <- tree #2 removed file DD <- both trees removed file M <- tree #1 modified file M DM* <- conflict, tree #1 removed file, tree #2 modified file MD* MM <- exact same modification MM* <- different modifications, merging This is generic, theoretically scales well even to more trees, is easy to parse trivially, still is human readable (actually the asterisk in the 'conflict' column is there basically only for the humans), is completely regular and consistent. Now that we have the notion of tree A and tree B gone, I'd prefer to use numbers instead of letters for the ~1~ and ~2~ suffixes. Not insisting, though. What do you think? > In cases other than *DM, *MD, and MRG, the result is trivial and > is recorded in the dircache. Without '-o' (to be renamed ;-) > nor '-f' there will not be a file checked out in the working > directory for them. The three merge cases need human attention. > The dircache is not touched in these cases and left as the > ancestor version, and the working directory gets some file as > described above. > > NOTE NOTE NOTE: I am not dealing with a case where both branches > create the same file but with different contents. In such a > case the current code falls into MRG path without having a > common ancestor, which is nonsense---I can use /dev/null as the > common ancestor, I guess. Also NOTE NOTE NOTE I need to detect That might be the best way at least for the start, although I suspect that merge will fail horribly this way even in the case of slightest differences; still better than nothing. > the case where one branch creates a directory while the other > creates a file. There is nothing an automated tool can do in > that case but it needs to be detected and be told the user > loudly. Or when both branches create directories... ;-) -- Petr "Pasky" Baudis Stuff: http://pasky.or.cz/ C++: an octopus made by nailing extra legs onto a dog. -- Steve Taylor ^ permalink raw reply [flat|nested] 130+ messages in thread
* Live Merging from remote repositories 2005-04-14 19:35 ` Petr Baudis @ 2005-04-14 20:01 ` Barry Silverman 2005-04-14 23:22 ` Junio C Hamano 2005-04-14 20:23 ` Re: Merge with git-pasky II Erik van Konijnenburg 2005-04-14 23:12 ` Junio C Hamano 2 siblings, 1 reply; 130+ messages in thread From: Barry Silverman @ 2005-04-14 20:01 UTC (permalink / raw) To: Petr Baudis, Junio C Hamano; +Cc: Linus Torvalds, git If you are merging from many distributed developers, than you would need to replicate every one of their repositories into your own. Is this necessary? I have been looking at Junio's code for merging, and it looks like it would be (relatively) easy change to make it run live across two remote repositories - assuming "future" science to develop remote common ancestor lookup... IE, merge.pl $COMMON-BASE $LOCAL-CHANGESET remote::$REMOTE-CHANGESET To make this work, only a couple of things need to happen: 1) be able to remotely run "remote::diff-tree $BASE $REMOTE-CHANGESET", and copy the results over the net to the place in the script where it is done locally. This is not a LOT of data, and is bounded by the number of total number of blobs in the resulting tree. 2) When a remote blob is required (for merging, or copying), then copy it from the remote .git/objects to the local one. You only copy the blobs that will end up in the merged result (or be used for file merge). The way Junio has done it, no intermediate trees or commits are used... You don't copy remote's tree of commits between $BASE and $REMOTE-CHANGESET, or any of their associated trees and blobs (unless used to merge). Is this a bug or a feature? Barry Silverman ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Live Merging from remote repositories 2005-04-14 20:01 ` Live Merging from remote repositories Barry Silverman @ 2005-04-14 23:22 ` Junio C Hamano 2005-04-15 1:07 ` Question about git process model Barry Silverman 0 siblings, 1 reply; 130+ messages in thread From: Junio C Hamano @ 2005-04-14 23:22 UTC (permalink / raw) To: Barry Silverman; +Cc: Petr Baudis, Junio C Hamano, Linus Torvalds, git >>>>> "BS" == Barry Silverman <barry@disus.com> writes: I have not thought about remote issues at all, other than the distribution mechanism vaguely outlined in my previous mail (not cc'ed to git list but I would not mind if you reproduced it here if somebody asked), so I am not qualified to comment on that part of your message. BS> The way Junio has done it, no intermediate trees or commits BS> are used... BS> Is this a bug or a feature? I would call that a feature in that there is no need to look at intermediate state. I also might call that a misfeature in that it may have resulted in a better merge if it looked at intermediate state. I just have this fuzzy feeling that, when doing this merge: A-1 --- A-2 --- A-3 / \ Common Ancestor Merge Result \ / B-1 --- B-2 --- B-3 looking at diff(Common Ancestor, A-1), diff(Common Ancestor, B-1), diff(A-1, A-2), ... might give you richer context than just merging 3-way using Common Ancestor, A-3, and B-3 to derive the Merge Result. It might not. I honestly do not know. BTW, Pasky, the above paragraph is my answer to your question in the other message <20050414202016.GC22699@pasky.ji.cz>: > But one different thing to note here. > > You say "merge these two trees" above (I take it that you mean > "merge these two trees, taking account of this tree as their > common ancestor", so actually you are dealing with three trees), > and I am tending to agree with the notion of merging trees not > commits. However you might get richer context and more sensible > resulting merge if you say "merge these two commits". Since > commit chaining is part of the fundamental git object model you > may as well use it. Pasky> Could you be more particular on the richer context etc? ^ permalink raw reply [flat|nested] 130+ messages in thread
* Question about git process model 2005-04-14 23:22 ` Junio C Hamano @ 2005-04-15 1:07 ` Barry Silverman 0 siblings, 0 replies; 130+ messages in thread From: Barry Silverman @ 2005-04-15 1:07 UTC (permalink / raw) To: 'Junio C Hamano' Cc: 'Petr Baudis', 'Linus Torvalds', git JH->Junio Hamano, LT->Linus Torvalds JH>>I just have this fuzzy feeling that, when doing this merge: A-1 --- A-2 --- A-3 / \ Common Ancestor Merge Result \ / B-1 --- B-2 --- B-3 JH>>looking at diff(Common Ancestor, A-1), diff(Common Ancestor, JH>>B-1), diff(A-1, A-2), ... might give you richer context than JH>>just merging 3-way using Common Ancestor, A-3, and B-3 to derive JH>>the Merge Result. It might not. I honestly do not know. In the distributed git model, with lots of parallel development, and lots of merging - Is it important at the "business process" level that intermediate history (and content) be present for any forks off the mainline (in particular, forks maintained by someone else)? Git has the property that it is NOT delta based. In Junio's example above, if branch A were your mainline, and branch B were imported changes from elsewhere, Is it necessary to have B1, B2 available to you, when all that was required for you to merge successfully was B3? Which leads me to.... LT>>In particular, if you ever find yourself wanting to graft together two LT>>different commit histories, that almost certainly is what you'd want to LT>>do. Somebody might have arrived at the exact same tree some other way, LT>>starting with a 2.6.12 tar.ball or something, and I think we should at LT>>least support the notion of saying "these two totally unrelated commits LT>>actually have the same base tree, so let's merge them in "space" (ie LT>>data) even if we can't really sanely join them in "time" (ie "commits"). So in the case of only merging B3 - we would write into the commit record of Merge-Result, that it was parented by A1, and another new commit record -> B3-Prime. B3-Prime is space-wise identical to B3, but "time-wise" different. B3-Prime would be different than B3 because it would necessarily not have the same SHA1 as B3. Why? B3 has B2 as a parent, B3-Prime has the Common Ancestor as a parent - thus the Commit record is different, and so is the SHA1. We would need a facility to recognize that B3 and B3-Prime were space-wise the same, (and maybe have the SHA for B3 kept in somewhere in the SCM portion of B3-Prime's commit record???) Linus, is that what you were saying? ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Re: Merge with git-pasky II. 2005-04-14 19:35 ` Petr Baudis 2005-04-14 20:01 ` Live Merging from remote repositories Barry Silverman @ 2005-04-14 20:23 ` Erik van Konijnenburg 2005-04-14 20:24 ` Petr Baudis 2005-04-14 23:12 ` Junio C Hamano 2 siblings, 1 reply; 130+ messages in thread From: Erik van Konijnenburg @ 2005-04-14 20:23 UTC (permalink / raw) To: Petr Baudis; +Cc: Junio C Hamano, Linus Torvalds, git On Thu, Apr 14, 2005 at 09:35:07PM +0200, Petr Baudis wrote: > Hmm. I actually don't like this naming. I think it's not too consistent, > is irregular, therefore parsing it would be ugly. What I propose: > > 12c\tname <- legend > <- original file > D <- tree #1 removed file > D <- tree #2 removed file > DD <- both trees removed file > M <- tree #1 modified file > M > DM* <- conflict, tree #1 removed file, tree #2 modified file > MD* > MM <- exact same modification > MM* <- different modifications, merging > > This is generic, theoretically scales well even to more trees, is easy > to parse trivially, still is human readable (actually the asterisk in > the 'conflict' column is there basically only for the humans), is > completely regular and consistent. Detail: perhaps use underscore instead of space, to avoid space/tab typos that are invisible on paper and user friendly mail clients? Regards, Erik ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Re: Re: Merge with git-pasky II. 2005-04-14 20:23 ` Re: Merge with git-pasky II Erik van Konijnenburg @ 2005-04-14 20:24 ` Petr Baudis 0 siblings, 0 replies; 130+ messages in thread From: Petr Baudis @ 2005-04-14 20:24 UTC (permalink / raw) To: Erik van Konijnenburg; +Cc: Junio C Hamano, Linus Torvalds, git Dear diary, on Thu, Apr 14, 2005 at 10:23:26PM CEST, I got a letter where Erik van Konijnenburg <ekonijn@xs4all.nl> told me that... > On Thu, Apr 14, 2005 at 09:35:07PM +0200, Petr Baudis wrote: > > Hmm. I actually don't like this naming. I think it's not too consistent, > > is irregular, therefore parsing it would be ugly. What I propose: > > > > 12c\tname <- legend > > <- original file > > D <- tree #1 removed file > > D <- tree #2 removed file > > DD <- both trees removed file > > M <- tree #1 modified file > > M > > DM* <- conflict, tree #1 removed file, tree #2 modified file > > MD* > > MM <- exact same modification > > MM* <- different modifications, merging > > > > This is generic, theoretically scales well even to more trees, is easy > > to parse trivially, still is human readable (actually the asterisk in > > the 'conflict' column is there basically only for the humans), is > > completely regular and consistent. > > Detail: perhaps use underscore instead of space, to avoid space/tab typos > that are invisible on paper and user friendly mail clients? I'd go for dots in that case. Looks less intrusive. :^) -- Petr "Pasky" Baudis Stuff: http://pasky.or.cz/ C++: an octopus made by nailing extra legs onto a dog. -- Steve Taylor ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Merge with git-pasky II. 2005-04-14 19:35 ` Petr Baudis 2005-04-14 20:01 ` Live Merging from remote repositories Barry Silverman 2005-04-14 20:23 ` Re: Merge with git-pasky II Erik van Konijnenburg @ 2005-04-14 23:12 ` Junio C Hamano 2005-04-14 20:24 ` Christopher Li 2005-04-14 23:31 ` Petr Baudis 2 siblings, 2 replies; 130+ messages in thread From: Junio C Hamano @ 2005-04-14 23:12 UTC (permalink / raw) To: Petr Baudis; +Cc: Linus Torvalds, git >>>>> "PB" == Petr Baudis <pasky@ucw.cz> writes: PB> What I would like your script to do is therefore just do the PB> merge in a given already prepared (including built index) PB> directory, with a passed base. The base should be determined PB> by a separate tool (I already saw some patches); most future PB> "science" will probably go to a clever selection of this PB> base, anyway. I think you are contradicting yourself for saying the above after agreeing with me that the script should just work on trees not commits. My understanding is that the tools is just to merge two related trees relative to another ancestor tree, nothing more. Especially, it should not care what is in the working directory---that is SCM person's business. I am just trying to follow my understanding of what Linus wanted. One of the guiding principle is to do as much things as in dircache without ever checking things out or touching working files unnecessarily. PB> This will give the tool maximal flexibility. I suspect it would force me to have a working directory populated with files, just to do a merge. PB> I'm all for an -o, and I don't mind ,, - I just don't want it uselessly PB> long. I hope "git~merge~$$" was a joke... :-) Which part do you object to? PID part? or tilde? Would git~merge do, perhaps? It probably would not matter to you because as an SCM you would always give an explicit --output parameter to the script anyway. PB> By the way, what about indentation with tabs? If you have a PB> strong opinion about this, I don't insist - but if you PB> really don't mind/care either way, it'd be great to use tabs PB> as in the rest of the git code. I do not have a strong opinion, but it is more trouble for me only because I am lazy and am used to the indentation my Emacs gives me. I write code other than git, so changing Perl-mode indentation setting globally for all .pl files is not an option for me. I'll see what I can do when I have time. PB> Is there a fundamental reason why the directory cache PB> contains the ancestor instead of the destination branch? Because you are thinking as an SCM person where there are distinction between tree-A and tree-B, two heads being merged. There is no "destination branch" nor "source branch" in what I am doing. It is a merge of two equals derived from the same ancestor. PB> I think the script actually does not fundamentally depend on it. My main PB> motivation is that the user can then trivially see what is he actually PB> going to commit to his destination branch, which would be bought for PB> free by that. And again the user is *not* commiting to his "destination branch". At the level I am working at, the merge result should be commited with two -p parameters to commit-tree --- tree-A and tree-B, both being equal parents from the POV of git object storage. PB> And this is another thing I dislike a lot. I'd like merge-tree.pl to PB> leave my directory cache alone, thank you very much. You know, I see PB> what goes to the directory cache as actually part of the policy part. Remember I am not touching *your* dircache. It is a dircache in the temporary merge area, specifically set up to help you review the merge. Can't the SCM driver do things along this line, perhaps? - You have your working files and your dircache. They may not match because you have uncommitted changes to your environment. You want to merge with Linus head. You know its SHA1 (call it COMMIT-Linus). Your SCM knows which commit you started with (call it COMMIT-Current). - First you merge the tree associated with COMMIT-Current. Use it and COMMIT-Linus to find the common ancestor to use. - Now use the tree SHA of COMMIT-Current, tree SHA1 of COMMIT-Linus, and tree SHA1 of the common ancestor commit to drive git-merge.perl (to be renamed ;-). You will get a temporary directory. Have your user examine what is in there, and fix the merge and have them tell you they are happy. - You go to that temporary directory, do write-tree and commit-tree with -p parameter of COMMIT-Linus and COMMIT-Current. This will result in a new commit. Call that COMMIT-Merge. - You, as an SCM, should know what your user have done in the working directory relative to COMMIT-Current. Especially you should know the set of paths involved in that change. Go in to the temporary area, checkout-cache those files if you have not done so. Apply the changes you have there. Optionally have the user examine the changes and have him confirm. Lift those files into the user's working directory. - Do your bookkeeping like "echo COMMIT-Merge >.git/Head", to make the user's working files based on COMMIT-Merge, and run read-tree using the COMMIT-Merge in the user's working directory. At this point, show-diff output should show what the changes your user have had made if he had started working based on COMMIT-Merge instead of starting from COMMIT-Current. I think the above would result in what SCM person would call "merge upstream/sidestream changes into my working directory". ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Merge with git-pasky II. 2005-04-14 23:12 ` Junio C Hamano @ 2005-04-14 20:24 ` Christopher Li 2005-04-14 23:31 ` Petr Baudis 1 sibling, 0 replies; 130+ messages in thread From: Christopher Li @ 2005-04-14 20:24 UTC (permalink / raw) To: Junio C Hamano; +Cc: Petr Baudis, Linus Torvalds, git Hi Junio, I think if the merge tree belong to plumbing, you can do even less in the merge.perl. You can just print out the instruction for the upper level SCM what to to without actually doing it yourself. So you don't have to do touch anything in the tree. That is the way I use in my previous python script. You just print out some easy to modify e.g. in my python script it prints: (BTW, poor choice of print out name) check out tree 253290af8b9ebc8565dd8de4cda24d0432a92b57 modify pre-process.c 7684c115a87e41a9226ce79478101c746cf22c34 3way-merge check.c dcb970cc1c5a83284dc5986abf07b6da76a8758c f77bfe119c19d928879091e0e3ee6debe3f1e1bf d315b43b025350d0107568a4d42cc2494d38621d Your merge tree can do the smae. Then the supper level SCM can easily follow instruction. Save your effort and make no assumption what SCM module is. Chris On Thu, Apr 14, 2005 at 04:12:34PM -0700, Junio C Hamano wrote: > >>>>> "PB" == Petr Baudis <pasky@ucw.cz> writes: > > I think you are contradicting yourself for saying the above > after agreeing with me that the script should just work on trees > not commits. My understanding is that the tools is just to > merge two related trees relative to another ancestor tree, > nothing more. Especially, it should not care what is in the > working directory---that is SCM person's business. > > I am just trying to follow my understanding of what Linus > wanted. One of the guiding principle is to do as much things as > in dircache without ever checking things out or touching working > files unnecessarily. > > PB> This will give the tool maximal flexibility. > > I suspect it would force me to have a working directory > populated with files, just to do a merge. > > PB> I'm all for an -o, and I don't mind ,, - I just don't want it uselessly > PB> long. I hope "git~merge~$$" was a joke... :-) > > Which part do you object to? PID part? or tilde? Would > git~merge do, perhaps? It probably would not matter to you > because as an SCM you would always give an explicit --output > parameter to the script anyway. > > PB> By the way, what about indentation with tabs? If you have a > PB> strong opinion about this, I don't insist - but if you > PB> really don't mind/care either way, it'd be great to use tabs > PB> as in the rest of the git code. > > I do not have a strong opinion, but it is more trouble for me > only because I am lazy and am used to the indentation my Emacs > gives me. I write code other than git, so changing Perl-mode > indentation setting globally for all .pl files is not an option > for me. I'll see what I can do when I have time. > > PB> Is there a fundamental reason why the directory cache > PB> contains the ancestor instead of the destination branch? > > Because you are thinking as an SCM person where there are > distinction between tree-A and tree-B, two heads being merged. > There is no "destination branch" nor "source branch" in what I > am doing. It is a merge of two equals derived from the same > ancestor. > > PB> I think the script actually does not fundamentally depend on it. My main > PB> motivation is that the user can then trivially see what is he actually > PB> going to commit to his destination branch, which would be bought for > PB> free by that. > > And again the user is *not* commiting to his "destination > branch". At the level I am working at, the merge result should > be commited with two -p parameters to commit-tree --- tree-A and > tree-B, both being equal parents from the POV of git object > storage. > > PB> And this is another thing I dislike a lot. I'd like merge-tree.pl to > PB> leave my directory cache alone, thank you very much. You know, I see > PB> what goes to the directory cache as actually part of the policy part. > > Remember I am not touching *your* dircache. It is a dircache in > the temporary merge area, specifically set up to help you review > the merge. > > Can't the SCM driver do things along this line, perhaps? > > - You have your working files and your dircache. They may not > match because you have uncommitted changes to your > environment. You want to merge with Linus head. You know > its SHA1 (call it COMMIT-Linus). Your SCM knows which commit > you started with (call it COMMIT-Current). > > - First you merge the tree associated with COMMIT-Current. Use > it and COMMIT-Linus to find the common ancestor to use. > > - Now use the tree SHA of COMMIT-Current, tree SHA1 of > COMMIT-Linus, and tree SHA1 of the common ancestor commit to > drive git-merge.perl (to be renamed ;-). You will get a > temporary directory. Have your user examine what is in > there, and fix the merge and have them tell you they are > happy. > > - You go to that temporary directory, do write-tree and > commit-tree with -p parameter of COMMIT-Linus and > COMMIT-Current. This will result in a new commit. Call that > COMMIT-Merge. > > - You, as an SCM, should know what your user have done in the > working directory relative to COMMIT-Current. Especially you > should know the set of paths involved in that change. Go in > to the temporary area, checkout-cache those files if you have > not done so. Apply the changes you have there. Optionally > have the user examine the changes and have him confirm. Lift > those files into the user's working directory. > > - Do your bookkeeping like "echo COMMIT-Merge >.git/Head", to > make the user's working files based on COMMIT-Merge, and run > read-tree using the COMMIT-Merge in the user's working > directory. At this point, show-diff output should show what > the changes your user have had made if he had started working > based on COMMIT-Merge instead of starting from > COMMIT-Current. > > I think the above would result in what SCM person would call > "merge upstream/sidestream changes into my working directory". > ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Re: Merge with git-pasky II. 2005-04-14 23:12 ` Junio C Hamano 2005-04-14 20:24 ` Christopher Li @ 2005-04-14 23:31 ` Petr Baudis 2005-04-14 20:30 ` Christopher Li ` (2 more replies) 1 sibling, 3 replies; 130+ messages in thread From: Petr Baudis @ 2005-04-14 23:31 UTC (permalink / raw) To: Junio C Hamano; +Cc: Linus Torvalds, git Dear diary, on Fri, Apr 15, 2005 at 01:12:34AM CEST, I got a letter where Junio C Hamano <junkio@cox.net> told me that... > >>>>> "PB" == Petr Baudis <pasky@ucw.cz> writes: > > PB> What I would like your script to do is therefore just do the > PB> merge in a given already prepared (including built index) > PB> directory, with a passed base. The base should be determined > PB> by a separate tool (I already saw some patches); most future > PB> "science" will probably go to a clever selection of this > PB> base, anyway. > > I think you are contradicting yourself for saying the above > after agreeing with me that the script should just work on trees > not commits. My understanding is that the tools is just to > merge two related trees relative to another ancestor tree, > nothing more. Especially, it should not care what is in the > working directory---that is SCM person's business. Yes. Isn't this exactly what I'm saying? I'm arguing for doing less in my paragraph, you are arguing for doing less in your paragraph, and we even seem to agree on the direction in which we should do less. > I am just trying to follow my understanding of what Linus > wanted. One of the guiding principle is to do as much things as > in dircache without ever checking things out or touching working > files unnecessarily. I'm just arguing that instead of directly touching the directory cache, you should just list what would you do there - and you already do this, I think. So I'd be happy with a switch which would just do that and not touch the directory cache. I'll parse your output and do the right thing for me. > PB> This will give the tool maximal flexibility. > > I suspect it would force me to have a working directory > populated with files, just to do a merge. Why would that be so? > PB> I'm all for an -o, and I don't mind ,, - I just don't want it uselessly > PB> long. I hope "git~merge~$$" was a joke... :-) > > Which part do you object to? PID part? or tilde? Would > git~merge do, perhaps? It probably would not matter to you > because as an SCM you would always give an explicit --output > parameter to the script anyway. Yes. I'll just override it with ,,merge, I think. So, do whatever you want. ;-)) > PB> By the way, what about indentation with tabs? If you have a > PB> strong opinion about this, I don't insist - but if you > PB> really don't mind/care either way, it'd be great to use tabs > PB> as in the rest of the git code. > > I do not have a strong opinion, but it is more trouble for me > only because I am lazy and am used to the indentation my Emacs > gives me. I write code other than git, so changing Perl-mode > indentation setting globally for all .pl files is not an option > for me. I'll see what I can do when I have time. Doesn't Emacs have something equivalent to ./.vimrc? I've also seen those funny -*- strings. Well, if it would mean a lot of trouble for you, just forget about it. > PB> Is there a fundamental reason why the directory cache > PB> contains the ancestor instead of the destination branch? > > Because you are thinking as an SCM person where there are > distinction between tree-A and tree-B, two heads being merged. > There is no "destination branch" nor "source branch" in what I > am doing. It is a merge of two equals derived from the same > ancestor. That's a valid point of view too. Actually, when you would have a mode in which you would not write to the directory cache, do you need to read from it? You could do just direct cat-files like for the other trees, and it would be even faster. Then, you could do without a directory cache altogether in this mode. > PB> And this is another thing I dislike a lot. I'd like merge-tree.pl to > PB> leave my directory cache alone, thank you very much. You know, I see > PB> what goes to the directory cache as actually part of the policy part. > > Remember I am not touching *your* dircache. It is a dircache in > the temporary merge area, specifically set up to help you review > the merge. Yes, but I want to have a control over its dircache too. :-) That is because I want the user to be able to use the regular git commands like "git diff" there. > Can't the SCM driver do things along this line, perhaps? > > - You have your working files and your dircache. They may not > match because you have uncommitted changes to your > environment. You want to merge with Linus head. You know > its SHA1 (call it COMMIT-Linus). Your SCM knows which commit > you started with (call it COMMIT-Current). > > - First you merge the tree associated with COMMIT-Current. Use > it and COMMIT-Linus to find the common ancestor to use. > > - Now use the tree SHA of COMMIT-Current, tree SHA1 of > COMMIT-Linus, and tree SHA1 of the common ancestor commit to > drive git-merge.perl (to be renamed ;-). You will get a > temporary directory. Have your user examine what is in > there, and fix the merge and have them tell you they are > happy. > > - You go to that temporary directory, do write-tree and > commit-tree with -p parameter of COMMIT-Linus and > COMMIT-Current. This will result in a new commit. Call that > COMMIT-Merge. > > - You, as an SCM, should know what your user have done in the > working directory relative to COMMIT-Current. Especially you > should know the set of paths involved in that change. Go in > to the temporary area, checkout-cache those files if you have > not done so. Apply the changes you have there. Optionally > have the user examine the changes and have him confirm. Lift > those files into the user's working directory. > > - Do your bookkeeping like "echo COMMIT-Merge >.git/Head", to > make the user's working files based on COMMIT-Merge, and run > read-tree using the COMMIT-Merge in the user's working > directory. At this point, show-diff output should show what > the changes your user have had made if he had started working > based on COMMIT-Merge instead of starting from > COMMIT-Current. > > I think the above would result in what SCM person would call > "merge upstream/sidestream changes into my working directory". And that's exactly what I'm doing now with git merge. ;-) In fact, ideally the whole change in my scripts when your script is finished would be replacing checkout-cache `diff-tree` # symbolic git diff $base $merged | git apply with merge-tree.pl -b $base $(tree-id) $merged | parse-your-output -- Petr "Pasky" Baudis Stuff: http://pasky.or.cz/ C++: an octopus made by nailing extra legs onto a dog. -- Steve Taylor ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Re: Merge with git-pasky II. 2005-04-14 23:31 ` Petr Baudis @ 2005-04-14 20:30 ` Christopher Li 2005-04-14 20:37 ` Christopher Li 2005-04-15 0:58 ` Junio C Hamano 2005-04-15 10:22 ` Junio C Hamano 2 siblings, 1 reply; 130+ messages in thread From: Christopher Li @ 2005-04-14 20:30 UTC (permalink / raw) To: Petr Baudis; +Cc: Junio C Hamano, Linus Torvalds, git On Fri, Apr 15, 2005 at 01:31:59AM +0200, Petr Baudis wrote: > > I am just trying to follow my understanding of what Linus > > wanted. One of the guiding principle is to do as much things as > > in dircache without ever checking things out or touching working > > files unnecessarily. > > I'm just arguing that instead of directly touching the directory cache, > you should just list what would you do there - and you already do this, That is exactly what I suggest in the previous email. And my python script does exactly that ;-) Chris ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Re: Merge with git-pasky II. 2005-04-14 20:30 ` Christopher Li @ 2005-04-14 20:37 ` Christopher Li 2005-04-14 20:50 ` Christopher Li 0 siblings, 1 reply; 130+ messages in thread From: Christopher Li @ 2005-04-14 20:37 UTC (permalink / raw) To: Petr Baudis; +Cc: Junio C Hamano, Linus Torvalds, git Is that some thing you want to see? Maybe clean up the error printing. Chris --- /dev/null 2003-01-30 05:24:37.000000000 -0500 +++ merge.py 2005-04-14 16:34:39.000000000 -0400 @@ -0,0 +1,76 @@ +#!/usr/bin/env python + +import re +import sys +import os +from pprint import pprint + +def get_tree(commit): + data = os.popen("cat-file commit %s"%commit).read() + return re.findall(r"(?m)^tree (\w+)", data)[0] + +PREFIX = 0 +PATH = -1 +SHA = -2 +ORIGSHA = -3 + +def get_difftree(old, new): + lines = os.popen("diff-tree %s %s"%(old, new)).read().split("\x00") + patterns = (r"(\*)(\d+)->(\d+)\s(\w+)\s(\w+)->(\w+)\s(.*)", + r"([+-])(\d+)\s(\w+)\s(\w+)\s(.*)") + res = {} + for l in lines: + if not l: continue + for p in patterns: + m = re.findall(p, l) + if m: + m = m[0] + res[m[-1]] = m + break + else: + raise "difftree: unknow line", l + return res + +def analyze(diff1, diff2): + diff1only = [ diff1[k] for k in diff1 if k not in diff2 ] + diff2only = [ diff2[k] for k in diff2 if k not in diff1 ] + both = [ (diff1[k],diff2[k]) for k in diff2 if k in diff1 ] + + action(diff1only) + action(diff2only) + action_two(both) + +def action(diffs): + for act in diffs: + if act[PREFIX] == "*": + print "modify", act[PATH], act[SHA] + elif act[PREFIX] == '-': + print "remove", act[PATH], act[SHA] + elif act[PREFIX] == '+': + print "add", act[PATH], act[SHA] + else: + raise "unknow action" + +def action_two(diffs): + for act1, act2 in diffs: + if len(act1) == len(act2): # same kind type + if act1[PREFIX] == act2[PREFIX]: + if act1[SHA] == act2[SHA] or act1[PREFIX] == '-': + return action(act1) + if act1[PREFIX]=='*': + print "do_merge", act1[PATH], act1[ORIGSHA], act1[SHA], act2[SHA] + return + print "unable to handle", act[PATH] + print "one side wants", act1[PREFIX] + print "the other side wants", act2[PREFIX] + + +args = sys.argv[1:] +if len(args)!=3: + print "Usage merge.py <common> <rev1> <rev2>" +trees = map(get_tree, args) +print "checkout-tree", trees[0] +diff1 = get_difftree(trees[0], trees[1]) +diff2 = get_difftree(trees[0], trees[2]) +analyze(diff1, diff2) + ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Re: Merge with git-pasky II. 2005-04-14 20:37 ` Christopher Li @ 2005-04-14 20:50 ` Christopher Li 0 siblings, 0 replies; 130+ messages in thread From: Christopher Li @ 2005-04-14 20:50 UTC (permalink / raw) To: Petr Baudis; +Cc: Junio C Hamano, Linus Torvalds, git BTW, I am not competing with Junio script. If that is the way we all agree on. It is should be very easy for Junio to fix his perl script. right? Chris On Thu, Apr 14, 2005 at 04:37:17PM -0400, Christopher Li wrote: > Is that some thing you want to see? Maybe clean up the error printing. > > > Chris > > --- /dev/null 2003-01-30 05:24:37.000000000 -0500 > +++ merge.py 2005-04-14 16:34:39.000000000 -0400 > @@ -0,0 +1,76 @@ > +#!/usr/bin/env python > + > +import re > +import sys > +import os > +from pprint import pprint > + > +def get_tree(commit): > + data = os.popen("cat-file commit %s"%commit).read() > + return re.findall(r"(?m)^tree (\w+)", data)[0] > + > +PREFIX = 0 > +PATH = -1 > +SHA = -2 > +ORIGSHA = -3 > + > +def get_difftree(old, new): > + lines = os.popen("diff-tree %s %s"%(old, new)).read().split("\x00") > + patterns = (r"(\*)(\d+)->(\d+)\s(\w+)\s(\w+)->(\w+)\s(.*)", > + r"([+-])(\d+)\s(\w+)\s(\w+)\s(.*)") > + res = {} > + for l in lines: > + if not l: continue > + for p in patterns: > + m = re.findall(p, l) > + if m: > + m = m[0] > + res[m[-1]] = m > + break > + else: > + raise "difftree: unknow line", l > + return res > + > +def analyze(diff1, diff2): > + diff1only = [ diff1[k] for k in diff1 if k not in diff2 ] > + diff2only = [ diff2[k] for k in diff2 if k not in diff1 ] > + both = [ (diff1[k],diff2[k]) for k in diff2 if k in diff1 ] > + > + action(diff1only) > + action(diff2only) > + action_two(both) > + > +def action(diffs): > + for act in diffs: > + if act[PREFIX] == "*": > + print "modify", act[PATH], act[SHA] > + elif act[PREFIX] == '-': > + print "remove", act[PATH], act[SHA] > + elif act[PREFIX] == '+': > + print "add", act[PATH], act[SHA] > + else: > + raise "unknow action" > + > +def action_two(diffs): > + for act1, act2 in diffs: > + if len(act1) == len(act2): # same kind type > + if act1[PREFIX] == act2[PREFIX]: > + if act1[SHA] == act2[SHA] or act1[PREFIX] == '-': > + return action(act1) > + if act1[PREFIX]=='*': > + print "do_merge", act1[PATH], act1[ORIGSHA], act1[SHA], act2[SHA] > + return > + print "unable to handle", act[PATH] > + print "one side wants", act1[PREFIX] > + print "the other side wants", act2[PREFIX] > + > + > +args = sys.argv[1:] > +if len(args)!=3: > + print "Usage merge.py <common> <rev1> <rev2>" > +trees = map(get_tree, args) > +print "checkout-tree", trees[0] > +diff1 = get_difftree(trees[0], trees[1]) > +diff2 = get_difftree(trees[0], trees[2]) > +analyze(diff1, diff2) > + > - > To unsubscribe from this list: send the line "unsubscribe git" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Merge with git-pasky II. 2005-04-14 23:31 ` Petr Baudis 2005-04-14 20:30 ` Christopher Li @ 2005-04-15 0:58 ` Junio C Hamano 2005-04-14 22:30 ` Christopher Li 2005-04-15 19:54 ` Re: Merge with git-pasky II Petr Baudis 2005-04-15 10:22 ` Junio C Hamano 2 siblings, 2 replies; 130+ messages in thread From: Junio C Hamano @ 2005-04-15 0:58 UTC (permalink / raw) To: Petr Baudis; +Cc: Linus Torvalds, git >>>>> "PB" == Petr Baudis <pasky@ucw.cz> writes: >> I think the above would result in what SCM person would call >> "merge upstream/sidestream changes into my working directory". PB> And that's exactly what I'm doing now with git merge. ;-) In fact, PB> ideally the whole change in my scripts when your script is finished PB> would be replacing PB> checkout-cache `diff-tree` # symbolic PB> git diff $base $merged | git apply PB> with PB> merge-tree.pl -b $base $(tree-id) $merged | parse-your-output In the above I presume by $merged you mean the tree ID (or commit ID) the user's working directory is based upon? Well, merge-trees (Linus has a single directory merge-tree already) looks at tree IDs (or commit IDs); it would never involve working files in random state that is not recorded as part of a tree (committed or not). Given that constraints I am not sure how well that would pan out. I have to think about this a bit. I do like, however, the idea of separating the step of doing any checkout/merge etc. and actually doing them. So the command set of parse-your-output needs to be defined. Based on what I have done so far, it would consist of the following: - Result is this object $SHA1 with mode $mode at $path (takes one of the trees); you can do update-cache --cacheinfo (if you want to muck with dircache) or cat-file blob (if you want to get the file) or both. - Result is to delete $path. - Result is a merge between object $SHA1-1 and $SHA1-2 with mode $mode-1 or $mode-2 at $path. Would this be a good enough command set? PB> Doesn't Emacs have something equivalent to ./.vimrc? I've also seen PB> those funny -*- strings. The former is global per user (that is me including other Perl files I work outside of git context), which is exactly what I said is unacceptable to me. The latter is per file (applying to everybody else who touch the file), so if it is short and sweet I should use one. ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Merge with git-pasky II. 2005-04-15 0:58 ` Junio C Hamano @ 2005-04-14 22:30 ` Christopher Li 2005-04-15 7:43 ` Junio C Hamano 2005-04-15 19:54 ` Re: Merge with git-pasky II Petr Baudis 1 sibling, 1 reply; 130+ messages in thread From: Christopher Li @ 2005-04-14 22:30 UTC (permalink / raw) To: Junio C Hamano; +Cc: Petr Baudis, Linus Torvalds, git On Thu, Apr 14, 2005 at 05:58:25PM -0700, Junio C Hamano wrote: > > I do like, however, the idea of separating the step of doing any > checkout/merge etc. and actually doing them. So the command set > of parse-your-output needs to be defined. Based on what I have > done so far, it would consist of the following: > > - Result is this object $SHA1 with mode $mode at $path (takes > one of the trees); you can do update-cache --cacheinfo (if > you want to muck with dircache) or cat-file blob (if you want > to get the file) or both. Is that SHA1 for tree or the file object? If it is tree it don't need the $mode any more. If it is file you might need to emit entry for it's parent directory, including the modes of directory. > > - Result is to delete $path. > > - Result is a merge between object $SHA1-1 and $SHA1-2 with > mode $mode-1 or $mode-2 at $path. > > Would this be a good enough command set? And of course error/command for the files that unable to perform auto merge. including information of both revisions. That needs to be defined as well. Chris ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Merge with git-pasky II. 2005-04-14 22:30 ` Christopher Li @ 2005-04-15 7:43 ` Junio C Hamano 2005-04-15 6:28 ` Christopher Li 0 siblings, 1 reply; 130+ messages in thread From: Junio C Hamano @ 2005-04-15 7:43 UTC (permalink / raw) To: Christopher Li; +Cc: Petr Baudis, Linus Torvalds, git >>>>> "CL" == Christopher Li <git@chrisli.org> writes: >> - Result is this object $SHA1 with mode $mode at $path (takes >> one of the trees); you can do update-cache --cacheinfo (if >> you want to muck with dircache) or cat-file blob (if you want >> to get the file) or both. CL> Is that SHA1 for tree or the file object? I am talking about a single file here. ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Merge with git-pasky II. 2005-04-15 7:43 ` Junio C Hamano @ 2005-04-15 6:28 ` Christopher Li 2005-04-15 11:11 ` Junio C Hamano 0 siblings, 1 reply; 130+ messages in thread From: Christopher Li @ 2005-04-15 6:28 UTC (permalink / raw) To: Junio C Hamano; +Cc: Petr Baudis, Linus Torvalds, git On Fri, Apr 15, 2005 at 12:43:47AM -0700, Junio C Hamano wrote: > >>>>> "CL" == Christopher Li <git@chrisli.org> writes: > > CL> Is that SHA1 for tree or the file object? > > I am talking about a single file here. > Then do you emit the entry for it's parents directory? e.g. /foo/bar get created. foo doesn't exists. You have to create foo first. You don't have mode information for foo yet. If it give the top level tree, the SCM can check it out by tree. hopefully have the mode on directory correctly. Well, if they care about those little details. Chris ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Merge with git-pasky II. 2005-04-15 6:28 ` Christopher Li @ 2005-04-15 11:11 ` Junio C Hamano [not found] ` <7vaco0i3t9.fsf_-_@assigned-by-dhcp.cox.net> 0 siblings, 1 reply; 130+ messages in thread From: Junio C Hamano @ 2005-04-15 11:11 UTC (permalink / raw) To: Christopher Li, Linus Torvalds; +Cc: Petr Baudis, git >>>>> "CL" == Christopher Li <git@chrisli.org> writes: CL> Then do you emit the entry for it's parents directory? In GIT object model, directory modes do not matter. It is not designed to record directories, and running "update-cache --add foo" when foo is a directory fails. The data model of GIT is that it associates file datablob to a string called "pathname" that happen to contain slashes in them. It is kinda wierd. When you externalize it with checkout-cache, these slashes are mapped to hierarchical UNIX filesystem paths, relative to whereever you happened to run checkout-cache. The hierarchical "tree" representation in the GIT database was started as just a space optimization thing. CL> e.g. /foo/bar get created. foo doesn't exists. You have CL> to create foo first. You don't have mode information for CL> foo yet. And you will never have that information, since it is not recorded anywhere. If I say you should have foo/bar (by the way, no leading slashes are placed in the dircache either), and if it so happens that you do not have foo yet, you'd better create one without waiting to be told, because I will never tell you to just create a directory. By the way, Linus, while I was studying how the new hierarchical trees are written out, I think I have found one small funny (I would not call this a *bug*) there. Here is an excerpt from write-tree (around ll. 56; I am basing on pasky-0.4 so your line numbers may have some offsets): sha1 = ce->sha1; mode = ntohl(ce->st_mode); /* Do we have _further_ subdirectories? */ filename = pathname + baselen; dirname = strchr(filename, '/'); if (dirname) { int subdir_written; subdir_written = write_tree(cachep + nr, maxentries - nr, pathname, dirname-pathname+1, subdir_sha1); nr += subdir_written; /* Now we need to write out the directory entry into this tree.. */ mode |= S_IFDIR; pathlen = dirname - pathname; /* ..but the directory entry doesn't count towards the total count */ nr--; sha1 = subdir_sha1; } This code is going through a flat list of cache entries sorted by pathnames. The list is flat in the sense that the pathnames are like "foo/bar" i.e. with slashes inside. The if() statement there, upon seeing "foo/bar", slurps all the entries in foo/ subhierarchy and writes into a separate tree, recursively, to "represent" foo/. Notice what mode the "tree" object gets in this case? File mode for foo/bar (or whatever happens to be sorted the first among the stuff in dircache from foo/ directory) ORed with S_IFDIR. I think this is nonsense, and we should just store constant S_IFDIR. Another option, probably better from the SCM purist's POV, would be to start recording directories in dircaches, so that people can actually keep track of directory modes. Does it matter? --- I would say not. GIT does not have to be tar or cpio. ^ permalink raw reply [flat|nested] 130+ messages in thread
[parent not found: <7vaco0i3t9.fsf_-_@assigned-by-dhcp.cox.net>]
* Re: write-tree is pasky-0.4 [not found] ` <7vaco0i3t9.fsf_-_@assigned-by-dhcp.cox.net> @ 2005-04-15 18:44 ` Linus Torvalds 2005-04-15 18:56 ` Petr Baudis 2005-04-15 20:10 ` Junio C Hamano 0 siblings, 2 replies; 130+ messages in thread From: Linus Torvalds @ 2005-04-15 18:44 UTC (permalink / raw) To: Junio C Hamano; +Cc: Petr Baudis, git On Fri, 15 Apr 2005, Junio C Hamano wrote: > > Linus, sorry for bothering you with a false alarm. The problem > turns out to be introduced in pasky-0.4 and does not exist in > your HEAD. Hey, all the code I write is always perfect, of course ;) That said, I'm having some trouble merging with your perfect code, especially since I decided that Russell's "always big-endian" thing was definitely the right way to go (but ended up doing it slightly differently). I did my own version of "upcate-cache --cacheinfo", although mine is a bit more anal, and if you add a new filename it wants that "--add" flag in there first (why? I really like to make sure that people who add or remove files from the cache say so explicitly, so that there are no surprises). Otherwise it should be compatible with yours. And I merged your "Add -z option to show-files", but you had based your other patches on Petr's tree which due to my other changes is not going to merge totally cleanly with mine, so I'm wondering if you might want to try to re-merge your mergepoint stuff against my current tree? That way I can continue to maintain a set of "core files", and Pasky can maintain the "usable interfaces" part.. Linus ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Re: write-tree is pasky-0.4 2005-04-15 18:44 ` write-tree is pasky-0.4 Linus Torvalds @ 2005-04-15 18:56 ` Petr Baudis 2005-04-15 20:13 ` Linus Torvalds 2005-04-15 20:10 ` Junio C Hamano 1 sibling, 1 reply; 130+ messages in thread From: Petr Baudis @ 2005-04-15 18:56 UTC (permalink / raw) To: Linus Torvalds; +Cc: Junio C Hamano, git Dear diary, on Fri, Apr 15, 2005 at 08:44:02PM CEST, I got a letter where Linus Torvalds <torvalds@osdl.org> told me that... > And I merged your "Add -z option to show-files", but you had based your > other patches on Petr's tree which due to my other changes is not going to > merge totally cleanly with mine, so I'm wondering if you might want to try > to re-merge your mergepoint stuff against my current tree? That way I can > continue to maintain a set of "core files", and Pasky can maintain the > "usable interfaces" part.. Actually, I wanted to ask about this. :-) So, I assume that you don't want to merge my "SCM layer" (which is perfectly fine by me). However, I also apply plenty of patches concerning the "core git" - be it portability, leak fixes, argument parsing fixes and so on. Would it be of any benefit if I maintained two trees, one with just your core git but what I merge (I think I'd call this branch git-pb), and one with my git-pasky (to be renamed to Cogito) layer. I'd then put the "core git" changes to the git-pb branch and pull from it to the Cogito branch regularily, but it should be safe for you to pull from it too. In fact, in that case I might even end up entirely separating the Cogito tools from the core git and distributing them independently. BTW, just out of interest, are you personally planning to use Cogito for your kernel and sparse (and possibly even git) work, or will you stay with your lowlevel plumbing for that? Thanks, -- Petr "Pasky" Baudis Stuff: http://pasky.or.cz/ C++: an octopus made by nailing extra legs onto a dog. -- Steve Taylor ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Re: write-tree is pasky-0.4 2005-04-15 18:56 ` Petr Baudis @ 2005-04-15 20:13 ` Linus Torvalds 2005-04-15 22:36 ` Petr Baudis 0 siblings, 1 reply; 130+ messages in thread From: Linus Torvalds @ 2005-04-15 20:13 UTC (permalink / raw) To: Petr Baudis; +Cc: Junio C Hamano, git On Fri, 15 Apr 2005, Petr Baudis wrote: > > So, I assume that you don't want to merge my "SCM layer" (which is > perfectly fine by me). However, I also apply plenty of patches > concerning the "core git" - be it portability, leak fixes, argument > parsing fixes and so on. I'm actually perfectly happy to merge your SCM layer too eventually, but I'm nervous at this point. Especially while people are discussing some SCM options that I'm personally very leery of, and think that may make sense for others, but that I personally distrust. > BTW, just out of interest, are you personally planning to use Cogito for > your kernel and sparse (and possibly even git) work, or will you stay > with your lowlevel plumbing for that? I'm really really hoping I'd use cogito, and that it ends up being just one project. In particular, I'm hoping that in a few days, I'll have done enough plumbing that I don't even care any more, and then I'd not even maintain a tree of my own. I'm really not that much of an SCM guy. I detest pretty much all SCM's out there, and while it's been interesting to do 'git', I've done it because I was forced to, and because I really wanted to put _my_ needs and opinions first in an SCM, and see how that works. That's why I've been so adamant about having a "philosophy", because otherwise I'd probably just end up with yet another SCM that I'd despise. So for me, the "optimal" situation really ends up that you guys end up as the maintainers. I don't even _want_ to maintain it, although I'd be more than happy to be part of the engineering team. I just want to mark out the direction well enough and get it to a point where I can _use_ it, that I feel like I'm done. But before I can do that, I need to feel like I can live with the end result. The only missing part is merges, and I think you and Junio are getting pretty close (with Daniel's parent finder, Junio's merger etc). Linus ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Re: Re: write-tree is pasky-0.4 2005-04-15 20:13 ` Linus Torvalds @ 2005-04-15 22:36 ` Petr Baudis 2005-04-16 0:22 ` Linus Torvalds 0 siblings, 1 reply; 130+ messages in thread From: Petr Baudis @ 2005-04-15 22:36 UTC (permalink / raw) To: Linus Torvalds; +Cc: Junio C Hamano, git Dear diary, on Fri, Apr 15, 2005 at 10:13:21PM CEST, I got a letter where Linus Torvalds <torvalds@osdl.org> told me that... > > > On Fri, 15 Apr 2005, Petr Baudis wrote: > > > > So, I assume that you don't want to merge my "SCM layer" (which is > > perfectly fine by me). However, I also apply plenty of patches > > concerning the "core git" - be it portability, leak fixes, argument > > parsing fixes and so on. > > I'm actually perfectly happy to merge your SCM layer too eventually, but > I'm nervous at this point. Especially while people are discussing some > SCM options that I'm personally very leery of, and think that may make > sense for others, but that I personally distrust. You mean the renames tracking and similar yet mostly theoretical discussions? Or do you dislike something already implemented? I'd be happy to hear about it in that case. (To argue about it and likely get persuaded... ;-) But otherwise it is great news to me. Actually, in that case, is it worth renaming it to Cogito and using cg to invoke it? Wouldn't be that actually more confusing after it gets merged? IOW, should I stick to "git" or feel free to rename it to "cg"? -- Petr "Pasky" Baudis Stuff: http://pasky.or.cz/ C++: an octopus made by nailing extra legs onto a dog. -- Steve Taylor ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Re: Re: write-tree is pasky-0.4 2005-04-15 22:36 ` Petr Baudis @ 2005-04-16 0:22 ` Linus Torvalds 2005-04-16 1:13 ` Daniel Barkalow 2005-04-16 15:34 ` Re: Re: " Petr Baudis 0 siblings, 2 replies; 130+ messages in thread From: Linus Torvalds @ 2005-04-16 0:22 UTC (permalink / raw) To: Petr Baudis; +Cc: Junio C Hamano, git On Sat, 16 Apr 2005, Petr Baudis wrote: > > But otherwise it is great news to me. Actually, in that case, is it > worth renaming it to Cogito and using cg to invoke it? Wouldn't be that > actually more confusing after it gets merged? IOW, should I stick to > "git" or feel free to rename it to "cg"? I'm perfectly happy for it to stay as "git", and in general I don't have any huge preferences either way. You guys can discuss names as much as you like, it's the "tracking renames" and "how to merge" things that worry me. I think I've explained my name tracking worries. When it comes to "how to merge", there's three issues: - we do commonly have merge clashes where both trees have applied the exact same patch. That should merge perfectly well using the 3-way merge from a common parent that Junio has, but not your current "bring patches forward" kind of strategy. - I _do_ actually sometimes merge with dirty state in my working directory, which is why I want the merge to take place in a separate (and temporary) directory, which allows for a failed merge without having any major cleanup. If the merge fails, it's not a big deal, and I can just blow the merge directory away without losing the work I had in my "real" working directory. - reliability. I care much less for "clever" than I care for "guaranteed to never do the wrong thing". If I have to fix up some stuff by hand, I'll happily do so. But if I can't trust the merge and have to _check_ things by hand afterwards, that will make me leery of the merges, and _that_ is bad. The third point is why I'm going to the ultra-conservative "three-way merge from the common parent". It's not fancy, but it's something I feel comfortable with as a merge strategy. For example, arch (and in particular darcs) seems to want to try to be "clever" about the merges, and I'd always live in fear. And, finally, there's obviously performance. I _think_ a normal merge with nary a conflict and just a few tens of files changed should be possible in a second. I realize that sounds crazy to some people, but I think it's entirely doable. Half of that is writing the new tree out (that is a relative costly op due to the compression). The other half is the "work". Linus ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Re: Re: write-tree is pasky-0.4 2005-04-16 0:22 ` Linus Torvalds @ 2005-04-16 1:13 ` Daniel Barkalow 2005-04-16 2:18 ` Linus Torvalds 2005-04-16 15:34 ` Re: Re: " Petr Baudis 1 sibling, 1 reply; 130+ messages in thread From: Daniel Barkalow @ 2005-04-16 1:13 UTC (permalink / raw) To: Linus Torvalds; +Cc: Petr Baudis, Junio C Hamano, git On Fri, 15 Apr 2005, Linus Torvalds wrote: > I think I've explained my name tracking worries. When it comes to "how to > merge", there's three issues: > > - we do commonly have merge clashes where both trees have applied the > exact same patch. That should merge perfectly well using the 3-way > merge from a common parent that Junio has, but not your current "bring > patches forward" kind of strategy. I think 3-way merge is probably the best starting point, but I think that there might be value in being able to identify the commits of each side involved in a conflict. I think this would help with cases where both sides pick up an identical patch, and then each side makes a further change to a different part of the changed region (you find out that the other guy's change was supposed to follow the patch, and don't conflict with it). > - I _do_ actually sometimes merge with dirty state in my working > directory, which is why I want the merge to take place in a separate > (and temporary) directory, which allows for a failed merge without > having any major cleanup. If the merge fails, it's not a big deal, and > I can just blow the merge directory away without losing the work I had > in my "real" working directory. Is there some reason you don't commit before merging? All of the current merge theory seems to want to merge two commits, using the information git keeps about them. It should be cheap to get a new clean working directory to merge in, too, particularly if we add a cache of hardlinkable expanded blobs. > - reliability. I care much less for "clever" than I care for "guaranteed > to never do the wrong thing". If I have to fix up some stuff by hand, > I'll happily do so. But if I can't trust the merge and have to _check_ > things by hand afterwards, that will make me leery of the merges, and > _that_ is bad. > > The third point is why I'm going to the ultra-conservative "three-way > merge from the common parent". It's not fancy, but it's something I feel > comfortable with as a merge strategy. For example, arch (and in particular > darcs) seems to want to try to be "clever" about the merges, and I'd > always live in fear. How much do you care about the situation where there is no best common ancestor (which can happen if you're merging two main lines, each of which has merged with both of a pair of minor trees)? I think that arch is even more conservative, in that it doesn't look for a common ancestor, and reports conflicts whenever changes overlap at all. Of course, reliability by virtue of never working without help is not a big win over living in fear; you always have to check over it, not because you're afraid, but because it needs you to. > And, finally, there's obviously performance. I _think_ a normal merge with > nary a conflict and just a few tens of files changed should be possible in > a second. I realize that sounds crazy to some people, but I think it's > entirely doable. Half of that is writing the new tree out (that is a > relative costly op due to the compression). The other half is the "work". I think that the time spent on I/O will be overwhelmed by the time spent issuing the command at that rate. It might matter if you start getting into merging lots of things at once, but that's more like a minute for a merge group with 600 changes rather than a second per merge; we could potentially save a lot of time based of having a bunch of information left over from the previous merge when starting merge number 2. So 15 seconds plus half a second per merge might be better than a second per merge in the case that matters. -Daniel *This .sig left intentionally blank* ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Re: Re: write-tree is pasky-0.4 2005-04-16 1:13 ` Daniel Barkalow @ 2005-04-16 2:18 ` Linus Torvalds 2005-04-16 2:49 ` Daniel Barkalow 0 siblings, 1 reply; 130+ messages in thread From: Linus Torvalds @ 2005-04-16 2:18 UTC (permalink / raw) To: Daniel Barkalow; +Cc: Petr Baudis, Junio C Hamano, git On Fri, 15 Apr 2005, Daniel Barkalow wrote: > > Is there some reason you don't commit before merging? All of the current > merge theory seems to want to merge two commits, using the information git > keeps about them. Note that the 3-way merge would _only_ merge the committed state. The thing is, 99% of all merges end up touching files that I never touch myself (ie other architectures), so me being able to merge them even when _I_ am in the middle of something is a good thing. So even when I have dirty state, the "merge" would only merge the clean state. And then before the merge information is put back into my working directory, I'd do a "check-files" on the result, making sure that nothing that got changed by the merge isn't up-to-date. > How much do you care about the situation where there is no best common > ancestor I care. Even if the best common parent is 3 months ago, I care. I'd much rather get a big explicit conflict than a "clean merge" that ends up being debatable because people played games with per-file merging or something questionable like that. > I think that the time spent on I/O will be overwhelmed by the time spent > issuing the command at that rate. There is no time at all spent on IO. All my email is local, and if this all ends up working out well, I can track the other peoples object trees in local subdirectories with some daily rsyncs. And I have enough memory in my machines that there is basically no disk IO - the only tree I normally touch is the kernel trees, they all stay in cache. Linus ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Re: Re: write-tree is pasky-0.4 2005-04-16 2:18 ` Linus Torvalds @ 2005-04-16 2:49 ` Daniel Barkalow 2005-04-16 3:13 ` Linus Torvalds 0 siblings, 1 reply; 130+ messages in thread From: Daniel Barkalow @ 2005-04-16 2:49 UTC (permalink / raw) To: Linus Torvalds; +Cc: Petr Baudis, Junio C Hamano, git On Fri, 15 Apr 2005, Linus Torvalds wrote: > On Fri, 15 Apr 2005, Daniel Barkalow wrote: > > > > Is there some reason you don't commit before merging? All of the current > > merge theory seems to want to merge two commits, using the information git > > keeps about them. > > Note that the 3-way merge would _only_ merge the committed state. The > thing is, 99% of all merges end up touching files that I never touch > myself (ie other architectures), so me being able to merge them even when > _I_ am in the middle of something is a good thing. > > So even when I have dirty state, the "merge" would only merge the clean > state. And then before the merge information is put back into my working > directory, I'd do a "check-files" on the result, making sure that nothing > that got changed by the merge isn't up-to-date. So you want to merge someone else's tree into your committed state, and then merge the result with your working directory to get the working directory you continue with, provided that the second merge is trivial? > > How much do you care about the situation where there is no best common > > ancestor > > I care. Even if the best common parent is 3 months ago, I care. I'd much > rather get a big explicit conflict than a "clean merge" that ends up being > debatable because people played games with per-file merging or something > questionable like that. Are you thinking that the best common ancestor is the one that ties up absolutely all of the chains of commits, or the closest one that the sides have in common? I have the feeling that the former isn't going to be useful, because there will be lines you're considering merging which go back to ancient kernels, where they keep merging in your changes, but they still have a lineage back to 2.6.0 or something. For the latter, there are sometimes multiple ancestors which fit this criterion, and different ones of them are most helpful for different portions of the merge. I think this primarily happens when a branch you want to merge has accepted multiple patches that you've also accepted (and the history identifies this fact); this may or may not be a situation you want to allow on a regular basis. -Daniel *This .sig left intentionally blank* ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Re: Re: write-tree is pasky-0.4 2005-04-16 2:49 ` Daniel Barkalow @ 2005-04-16 3:13 ` Linus Torvalds 2005-04-16 3:56 ` Daniel Barkalow 2005-04-16 6:59 ` Paul Jackson 0 siblings, 2 replies; 130+ messages in thread From: Linus Torvalds @ 2005-04-16 3:13 UTC (permalink / raw) To: Daniel Barkalow; +Cc: Petr Baudis, Junio C Hamano, git On Fri, 15 Apr 2005, Daniel Barkalow wrote: > > So you want to merge someone else's tree into your committed state, and > then merge the result with your working directory to get the working > directory you continue with, provided that the second merge is trivial? No, you don't even "merge" the working directory. The low-level tools should entirely ignore the working directory. To a low-level merge, the working directory doesn't even exist. It just gets three commits (or trees) and merges two of them with the third as a parent, and does all of it in it's own temporary "merge working directory". So on a technical level, the "plumbing" part really really doesn't care at all. However, from a _usability_ part, you expect after a merge that your working directory has been updated to be the merged tree. And that's where the "if I have a working tree that is dirty, I want that part to fail" comes in. In other words, the final phase (after the "tree-merge" has actually successfully already finished) is to go back to the working directory, and check out the merged results. But that checkout would be a variation on "checkout-cache -a" which first checks that none of the files it is going to overwrite are dirty. Don't worry about this part. It's really totally separate from the true merge itself. The "real work" has already been done by the time we notice that "oops, we can't actually show him the newly merged tree, because he has got dirty data where we want to show it". > > I care. Even if the best common parent is 3 months ago, I care. I'd much > > rather get a big explicit conflict than a "clean merge" that ends up being > > debatable because people played games with per-file merging or something > > questionable like that. > > Are you thinking that the best common ancestor is the one that ties up > absolutely all of the chains of commits, or the closest one that the sides > have in common? The closest common one. > For the latter, there are sometimes multiple ancestors which fit this > criterion Yes. Let's just pick one at random (or more likely, the latest one by date - let's not actually be _random_ random) at first. There are other heuristics we can try, ie if it turns out that it's common to have a couple of alternatives (but no more than some small number, say five or so), we can literally just -try- to do a tree-only merge, and see how many lines out common output you get from "diff-tree". Because that "how mnay files do we need to merge" is the number you want to minimize, and doing a couple of extra "diff-tree" + "join" operations should be so fast that nobody will notice that we actually tried five different merges to see which one looked the best. But hey, especially if the merge fails with real clashes (ie there are changes in common and running "merge" leaves conflicts), and there were other alternate parents to choose, there's nothing wrong with just printing them out and saying "you might try to specify one of these manually". I really don't think we should worry too much about this until we've actually used the system for a while and seen what it does. So just start with "nearest common parent with most recent date". Which I think you already implemented, no? Linus ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Re: Re: write-tree is pasky-0.4 2005-04-16 3:13 ` Linus Torvalds @ 2005-04-16 3:56 ` Daniel Barkalow 2005-04-16 6:59 ` Paul Jackson 1 sibling, 0 replies; 130+ messages in thread From: Daniel Barkalow @ 2005-04-16 3:56 UTC (permalink / raw) To: Linus Torvalds; +Cc: Petr Baudis, Junio C Hamano, git On Fri, 15 Apr 2005, Linus Torvalds wrote: > On Fri, 15 Apr 2005, Daniel Barkalow wrote: > > > > So you want to merge someone else's tree into your committed state, and > > then merge the result with your working directory to get the working > > directory you continue with, provided that the second merge is trivial? > > No, you don't even "merge" the working directory. > > The low-level tools should entirely ignore the working directory. To a > low-level merge, the working directory doesn't even exist. It just gets > three commits (or trees) and merges two of them with the third as a > parent, and does all of it in it's own temporary "merge working > directory". It seems like users won't expect there to be a new working directory for the merge in which they are supposed to resolve te conflicts, but where they don't see their uncommited changes. In any case, the low-level tools have to care about *some* working directory, even if it isn't the parent of .git, and the parent of .git seems like where other similar things happen. If we're being conservative about merging, we're likely to report a lot of conflicts, at least until we work out better techniques than a simple 3-way merge. > > For the latter, there are sometimes multiple ancestors which fit this > > criterion > > Yes. Let's just pick one at random (or more likely, the latest one by > date - let's not actually be _random_ random) at first. Okay; I've currently got the one where the number of generations it is away from the further head is the smallest, and of equal ones, an arbitrary choice. If people are generally similar in the amount they diverge before commiting, this should be the most similar ancestor. > There are other heuristics we can try, ie if it turns out that it's common > to have a couple of alternatives (but no more than some small number, say > five or so), we can literally just -try- to do a tree-only merge, and see > how many lines out common output you get from "diff-tree". > > Because that "how mnay files do we need to merge" is the number you want > to minimize, and doing a couple of extra "diff-tree" + "join" operations > should be so fast that nobody will notice that we actually tried five > different merges to see which one looked the best. > > But hey, especially if the merge fails with real clashes (ie there are > changes in common and running "merge" leaves conflicts), and there were > other alternate parents to choose, there's nothing wrong with just > printing them out and saying "you might try to specify one of these > manually". I think we should be able to get good results out of doing the 5 merges and reporting a conflict only if there's a conflict in all of them; it shouldn't be possible for two to succeed but give different results (if it did, clearly our current algorithm is unsafe, since it would give some undesired output if it happened to use the wrong ancestor). I'm thinking of not actually calling "merge(1)" for this at all; it just calls diff3, and diff3 is only 1745 lines including option parsing. We can probably arrange to look around for better ancestors in case of conflicts we'd otherwise have to report, and get this all tidy and more efficient than having diff3 re-read files. And if we only go to other ancestors in case of conflicts, we're going to be a lot faster total than getting a reaction from the user, almost no matter what we do. > I really don't think we should worry too much about this until we've > actually used the system for a while and seen what it does. So just start > with "nearest common parent with most recent date". Which I think you > already implemented, no? I've got something like that (see above); did you want it in some form other than the patch I sent you? -Daniel *This .sig left intentionally blank* ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: write-tree is pasky-0.4 2005-04-16 3:13 ` Linus Torvalds 2005-04-16 3:56 ` Daniel Barkalow @ 2005-04-16 6:59 ` Paul Jackson 1 sibling, 0 replies; 130+ messages in thread From: Paul Jackson @ 2005-04-16 6:59 UTC (permalink / raw) To: Linus Torvalds; +Cc: barkalow, pasky, junkio, git One trick I've used to separate good automatic merges from ones that need human interaction is to run both the 'patch' and 'merge' commands, which use different approaches to determining the result. If they agree, take it. To apply the changes between file1 and file2 to filez: diff -au file1 file2 | patch -f filez merge -q filez file1 file2 -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj@engr.sgi.com> 1.650.933.1373, 1.925.600.0401 ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Re: Re: Re: write-tree is pasky-0.4 2005-04-16 0:22 ` Linus Torvalds 2005-04-16 1:13 ` Daniel Barkalow @ 2005-04-16 15:34 ` Petr Baudis 1 sibling, 0 replies; 130+ messages in thread From: Petr Baudis @ 2005-04-16 15:34 UTC (permalink / raw) To: Linus Torvalds; +Cc: Junio C Hamano, git Dear diary, on Sat, Apr 16, 2005 at 02:22:45AM CEST, I got a letter where Linus Torvalds <torvalds@osdl.org> told me that... > > > On Sat, 16 Apr 2005, Petr Baudis wrote: > > > > But otherwise it is great news to me. Actually, in that case, is it > > worth renaming it to Cogito and using cg to invoke it? Wouldn't be that > > actually more confusing after it gets merged? IOW, should I stick to > > "git" or feel free to rename it to "cg"? > > I'm perfectly happy for it to stay as "git", and in general I don't have > any huge preferences either way. You guys can discuss names as much as you > like, it's the "tracking renames" and "how to merge" things that worry me. :-) > I think I've explained my name tracking worries. When it comes to "how to > merge", there's three issues: > > - we do commonly have merge clashes where both trees have applied the > exact same patch. That should merge perfectly well using the 3-way > merge from a common parent that Junio has, but not your current "bring > patches forward" kind of strategy. My current "bring patches forward" strategy is only very interim, to have something working well enough for me to merge with you. I will gladly change it to use merge-tree*, when it is done. (Or read-tree -m - I will yet have to have a look, but it looks extremely promising.) > - I _do_ actually sometimes merge with dirty state in my working > directory, which is why I want the merge to take place in a separate > (and temporary) directory, which allows for a failed merge without > having any major cleanup. If the merge fails, it's not a big deal, and > I can just blow the merge directory away without losing the work I had > in my "real" working directory. Ok. But still, especially when you do some nontrivial conflicts resolving, how do you check if it even compiles after the merge? Or do you just commit it and possibly fix the compilation in another commit? > - reliability. I care much less for "clever" than I care for "guaranteed > to never do the wrong thing". If I have to fix up some stuff by hand, > I'll happily do so. But if I can't trust the merge and have to _check_ > things by hand afterwards, that will make me leery of the merges, and > _that_ is bad. > > The third point is why I'm going to the ultra-conservative "three-way > merge from the common parent". It's not fancy, but it's something I feel > comfortable with as a merge strategy. For example, arch (and in particular > darcs) seems to want to try to be "clever" about the merges, and I'd > always live in fear. I agree and I would like to achieve the same. I too think the three-way merge from the common parent is the best way to go for now. > And, finally, there's obviously performance. I _think_ a normal merge with > nary a conflict and just a few tens of files changed should be possible in > a second. I realize that sounds crazy to some people, but I think it's > entirely doable. Half of that is writing the new tree out (that is a > relative costly op due to the compression). The other half is the "work". Being written in shell, there is plenty of space for optimization - from using bash internals instead of textutils to rewriting parts of it in C. My priority now is to get it right first, though. :-) -- Petr "Pasky" Baudis Stuff: http://pasky.or.cz/ C++: an octopus made by nailing extra legs onto a dog. -- Steve Taylor ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: write-tree is pasky-0.4 2005-04-15 18:44 ` write-tree is pasky-0.4 Linus Torvalds 2005-04-15 18:56 ` Petr Baudis @ 2005-04-15 20:10 ` Junio C Hamano 2005-04-15 20:58 ` C. Scott Ananian 2005-04-15 21:48 ` [PATCH 1/2] merge-trees script for Linus git Junio C Hamano 1 sibling, 2 replies; 130+ messages in thread From: Junio C Hamano @ 2005-04-15 20:10 UTC (permalink / raw) To: Linus Torvalds; +Cc: Petr Baudis, git >>>>> "LT" == Linus Torvalds <torvalds@osdl.org> writes: LT> Hey, all the code I write is always perfect, of course ;) And you are always right ;-) Liked that blast-from-the-past? LT> That said, I'm having some trouble merging with your perfect code, LT> especially since I decided that Russell's "always big-endian" thing was LT> definitely the right way to go (but ended up doing it slightly LT> differently). LT> I did my own version of "upcate-cache --cacheinfo", although LT> mine is a bit more anal, and if you add a new filename it LT> wants that "--add" flag in there first (why? I really like LT> to make sure that people who add or remove files from the LT> cache say so explicitly, so that there are no surprises). LT> Otherwise it should be compatible with yours. Thanks. Not honoring "--add" was an oversight on my part. LT> And I merged your "Add -z option to show-files", but you had LT> based your other patches on Petr's tree which due to my LT> other changes is not going to merge totally cleanly with LT> mine, so I'm wondering if you might want to try to re-merge LT> your mergepoint stuff against my current tree? My pleasure. I am currently not that interested in toilet part than I am interested in plumbing part, so rebasing my tree back to yours is no problem for me. Currently I see your HEAD is at 461aef08823a18a6c69d472499ef5257f8c7f6c8, so I will generate a set of patches against it. ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: write-tree is pasky-0.4 2005-04-15 20:10 ` Junio C Hamano @ 2005-04-15 20:58 ` C. Scott Ananian 2005-04-15 21:22 ` Petr Baudis 2005-04-15 23:16 ` Junio C Hamano 2005-04-15 21:48 ` [PATCH 1/2] merge-trees script for Linus git Junio C Hamano 1 sibling, 2 replies; 130+ messages in thread From: C. Scott Ananian @ 2005-04-15 20:58 UTC (permalink / raw) To: Junio C Hamano; +Cc: Linus Torvalds, Petr Baudis, git On Fri, 15 Apr 2005, Junio C Hamano wrote: > to yours is no problem for me. Currently I see your HEAD is at > 461aef08823a18a6c69d472499ef5257f8c7f6c8, so I will generate a > set of patches against it. Have you considered using an s/key-like system to make these hashes more human-readable? Using the S/Key translation (11-bit chunks map to a 1-4 letter word), Linus' HEAD is at: WOW-SCAN-NAVE-AUK-JILL-BASH-HI-LACE-LID-RIDE-RUSE-LINE-GLEE-WICK-A ...which is a little longer, but speaking of branch "wow-scan" (which gives 22 bits of disambiguation) is probably less error-prone than discussing branch '461...' (only 12 bits). You could supercharge this algorithm by using (say) /usr/dict/american-english-large (>2^17 words; 160 bits of hash = 10 dictionary words), or mixing upper and lower case (likely to reduce the 15 word s/key phrase to ~11 words) to give something like RiDe-Rift-rIMe-rOSy-ScaR-sCat-ShiN-sIde-Sine-seeK-TIEd-TINT My personal feeling is that case is likely to be dropped in casual conversation, so speaking of branch 'wow', 'wow-scan', or 'wow-scan-nave' is likely to be significantly more useful than trying to pronounce mixed-cased versions of these. This is obviously a cogito issue, rather than a git-fs thing. --scott [More info is in RFCs 2289 and 1760, although all I'm really using from these is the word dictionary in the appendix.] http://www.faqs.org/rfcs/rfc1760.html http://www.faqs.org/rfcs/rfc2289.html SKIMMER MKOFTEN Ft. Bragg Sabana Seca ESMERALDITE NORAD HTAUTOMAT radar interception Pakistan BOND Kennedy postcard corporate globalization ( http://cscott.net/ ) ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Re: write-tree is pasky-0.4 2005-04-15 20:58 ` C. Scott Ananian @ 2005-04-15 21:22 ` Petr Baudis 2005-04-15 23:16 ` Junio C Hamano 1 sibling, 0 replies; 130+ messages in thread From: Petr Baudis @ 2005-04-15 21:22 UTC (permalink / raw) To: C. Scott Ananian; +Cc: Junio C Hamano, Linus Torvalds, git Dear diary, on Fri, Apr 15, 2005 at 10:58:10PM CEST, I got a letter where "C. Scott Ananian" <cscott@cscott.net> told me that... > On Fri, 15 Apr 2005, Junio C Hamano wrote: > > >to yours is no problem for me. Currently I see your HEAD is at > >461aef08823a18a6c69d472499ef5257f8c7f6c8, so I will generate a > >set of patches against it. > > Have you considered using an s/key-like system to make these hashes more > human-readable? Using the S/Key translation (11-bit chunks map to a 1-4 > letter word), Linus' HEAD is at: > WOW-SCAN-NAVE-AUK-JILL-BASH-HI-LACE-LID-RIDE-RUSE-LINE-GLEE-WICK-A > ...which is a little longer, but speaking of branch "wow-scan" (which > gives 22 bits of disambiguation) is probably less error-prone than > discussing branch '461...' (only 12 bits). > > You could supercharge this algorithm by using (say) > /usr/dict/american-english-large (>2^17 words; 160 bits of hash = 10 > dictionary words), or mixing upper and lower case (likely to reduce the 15 > word s/key phrase to ~11 words) to give something like > RiDe-Rift-rIMe-rOSy-ScaR-sCat-ShiN-sIde-Sine-seeK-TIEd-TINT > My personal feeling is that case is likely to be dropped in casual > conversation, so speaking of branch 'wow', 'wow-scan', or 'wow-scan-nave' > is likely to be significantly more useful than trying to pronounce > mixed-cased versions of these. > > This is obviously a cogito issue, rather than a git-fs thing. I kind of like it, the only thing I fear is possible conflict with branch names; it is not very likely though, I think. I believe (at least) the first three words should be used if possible. I'm not sure in what cases do you think we should use those "verbal" names, though. Of course we should accept them as IDs, but I don't think we should ever show them automatically. Probably provide a trivial to use tool to convert to them, and parameters for *-id tools to show them. I assume we would have a custom tool for the translation? -- Petr "Pasky" Baudis Stuff: http://pasky.or.cz/ C++: an octopus made by nailing extra legs onto a dog. -- Steve Taylor ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: write-tree is pasky-0.4 2005-04-15 20:58 ` C. Scott Ananian 2005-04-15 21:22 ` Petr Baudis @ 2005-04-15 23:16 ` Junio C Hamano 1 sibling, 0 replies; 130+ messages in thread From: Junio C Hamano @ 2005-04-15 23:16 UTC (permalink / raw) To: C. Scott Ananian; +Cc: Linus Torvalds, Petr Baudis, git >>>>> "CSA" == C Scott Ananian <cscott@cscott.net> writes: CSA> On Fri, 15 Apr 2005, Junio C Hamano wrote: >> to yours is no problem for me. Currently I see your HEAD is at >> 461aef08823a18a6c69d472499ef5257f8c7f6c8, so I will generate a >> set of patches against it. CSA> Have you considered using an s/key-like system to make these hashes CSA> more human-readable? Using the S/Key translation (11-bit chunks map CSA> to a 1-4 CSA> letter word), Linus' HEAD is at: CSA> WOW-SCAN-NAVE-AUK-JILL-BASH-HI-LACE-LID-RIDE-RUSE-LINE-GLEE-WICK-A CSA> ...which is a little longer, but speaking of branch "wow-scan" (which CSA> gives 22 bits of disambiguation) is probably less error-prone than CSA> discussing branch '461...' (only 12 bits). I understand monotone folks have the same issue and they let you use unambiguous prefix string. And why do you stop counting at "461" in your example? To my eyes, "461aef" in this particular string stands out and is easily typable, which gives me 24 bits ;-). But seriously I doubt the hex format is needed to be shown to humans very often. E-mail communications like this one being a very special exception. I do not expect for people to be talking about "Hey, Junio's patch against 461aef... from Linus is a total crap" like that. The only reason I mentioned his then-HEAD by hex is because I do not have a public archive for him to pull from, and I wanted to make it easy for him to do: $ export SHA1_FILE_DIRECTORY $ mkdir junk && cd junk && mkdir .git && read-tree `cat-file commit 461aef... | sed -e 's/^tree //;q'` $ patch < ../stupid-patch-from-junio-01 $ show-diff (it might have been better if I used the tree ID for this purpose). For Cogito users the hex format does not matter. "git pull" will get whatever HEAD recorded in the file on the sending end and the end user does not even have to know about it. CSA> This is obviously a cogito issue, rather than a git-fs thing. Yes. ^ permalink raw reply [flat|nested] 130+ messages in thread
* [PATCH 1/2] merge-trees script for Linus git 2005-04-15 20:10 ` Junio C Hamano 2005-04-15 20:58 ` C. Scott Ananian @ 2005-04-15 21:48 ` Junio C Hamano 2005-04-15 21:54 ` [PATCH 2/2] " Junio C Hamano 2005-04-15 23:33 ` [PATCH 3/2] " Junio C Hamano 1 sibling, 2 replies; 130+ messages in thread From: Junio C Hamano @ 2005-04-15 21:48 UTC (permalink / raw) To: Linus Torvalds; +Cc: git Linus, what you have in 461aef08823a18a6c69d472499ef5257f8c7f6c8 is fine by me for the essential support for merge-trees (sorry for the confusing name, but this is a stop-gap Q&D script until I do the real merge-tree.c conversion). This patch contains the merge-trees script itself and Makefile entry for it. I have some more fixes to merge-trees in the works but that will follow later. I have an optional patch to add '-q' option to show-diff so that complaints for missing files can be squelched, which I will be sending you in a separate message. Signed-off-by: Junio C Hamano <junkio@cox.net> --- Makefile | 2 merge-trees | 302 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 303 insertions(+), 1 deletion(-) Makefile: b39b4ea37586693dd707d1d0750a9b580350ec50 --- Makefile +++ Makefile 2005-04-15 13:32:06.000000000 -0700 @@ -14,7 +14,7 @@ PROG= update-cache show-diff init-db write-tree read-tree commit-tree \ cat-file fsck-cache checkout-cache diff-tree rev-tree show-files \ - check-files ls-tree merge-tree + check-files ls-tree merge-tree merge-trees all: $(PROG) --- /dev/null 2005-03-19 15:28:25.000000000 -0800 +++ merge-trees 2005-04-15 13:32:20.000000000 -0700 @@ -0,0 +1,302 @@ +#!/usr/bin/perl -w + +use strict; +use Cwd; +use Getopt::Long; + +my $full_checkout = 0; +my $partial_checkout = 0; +my $output_directory = ',,merge~tree'; + +GetOptions("full-checkout" => \$full_checkout, + "partial-checkout" => \$partial_checkout, + "output-directory=s" => \$output_directory) + or die; + + +if (@ARGV != 3) { + die "Usage: $0 -o [output-directory] [-f] [-p] ancestor A B\n"; +} + +if ($full_checkout) { + $partial_checkout = 1; +} + +################################################################ +# UI helper -- although it is encouraged to give tree ID, +# it is OK to give commit ID. +sub possibly_commit_to_tree { + my ($commit_or_tree_id) = @_; + my $type = read_cat_file_t($commit_or_tree_id); + if ($type eq 'tree') { return $commit_or_tree_id } + if ($type ne 'commit') { + die "Tree ID (or commit ID) required, given $type."; + } + + my ($fhi); + open $fhi, '-|', 'cat-file', 'commit', $commit_or_tree_id + or die "$!: cat-file commit $commit_or_tree_id"; + my ($tree) = <$fhi>; + close $fhi; + ($tree =~ s/^tree (.*)$/$1/) + or die "$tree: Linus says the first line is guaranteed to be tree."; + return $tree; +} + +sub read_cat_file_t { + my ($id) = @_; + my ($fhi); + open $fhi, '-|', 'cat-file', '-t', $id + or die "$!: cat-file -t $id"; + my ($t) = <$fhi>; + close $fhi; + chomp($t); + return $t; +} + +################################################################ +# Reads diff-tree -r output and gives a hash that maps a path +# to 4-tuple (old-mode new-mode old-oid new-oid). +# When creating, old-* are undef. When removing, new-* are undef. + +sub OLD_MODE () { 0 } +sub NEW_MODE () { 1 } +sub OLD_OID () { 2 } +sub NEW_OID () { 3 } + +sub read_diff_tree { + my (@tree) = @_; + my ($fhi); + + # Regular expression piece for mode + my $reM = '[0-7]+'; + + # Regular expression piece for object ID. + # There is a talk about base-64 so better make it easier to modify... + my $reID = '[0-9a-f]{40}'; + + local ($_, $/); + $/ = "\0"; + my %path; + open $fhi, '-|', 'diff-tree', '-r', @tree + or die "$!: diff-tree -r @tree"; + while (<$fhi>) { + chomp; + if (/^\*($reM)->($reM)\tblob\t($reID)->($reID)\t(.*)$/so) { + $path{$5} = [$1, $2, $3, $4]; # modified + } + elsif (/^\+($reM)\tblob\t($reID)\t(.*)$/so) { + $path{$3} = [undef, $1, undef, $2]; # added + } + elsif (/^\-($reM)\tblob\t($reID)\t(.*)$/so) { + $path{$3} = [$1, undef, $2, undef]; # deleted + } + else { + die "cannot parse diff-tree output: $_"; + } + } + close $fhi; + return %path; +} + +################################################################ +# Read show-files output to figure out the set of files contained +# in the tree. This is used to figure out what ancestor had. +sub read_show_files { + my ($fhi); + local ($_, $/); + $/ = "\0"; + open $fhi, '-|', 'show-files', '-z', '--cached' + or die "$!: show-files -z --cached"; + my (@path) = map { chomp; $_ } <$fhi>; + close $fhi; + return @path; +} + +################################################################ +# Given path and info (typically returned from read_diff_tree), +# create the file in the working directory to match the NEW tree. +# This does not touch dircache. +sub checkout_file { + my ($path, $info) = @_; + my (@elt) = split(/\//, $path); + my $j = ''; + my $tail = pop @elt; + my ($fhi, $fho); + for (@elt) { + mkdir "$j$_"; + $j = "$j$_/"; + } + open $fho, '>', "$path"; + open $fhi, '-|', 'cat-file', 'blob', $info->[NEW_OID] + or die "$!: cat-file blob $info->[NEW_OID]"; + while (<$fhi>) { + print $fho $_; + } + close $fhi; + close $fho; + chmod oct("0$info->[NEW_MODE]"), "$path"; +} + +################################################################ +# Given path and info record the file in the dircache without +# affecting working directory. +sub record_file { + my ($path, $info) = @_; + system ('update-cache', '--add', '--cacheinfo', + $info->[NEW_MODE], $info->[NEW_OID], $path); +} + +################################################################ +# Merge info from two trees and leave it in path, without +# affecting dircache. +sub merge_tree { + my ($path, $infoA, $infoB) = @_; + checkout_file("$path~A~", $infoA); + checkout_file("$path~B~", $infoB); + system 'checkout-cache', $path; + rename $path, "$path~O~"; + my ($fhi, $fho); + open $fhi, '-|', 'merge', '-p', "$path~A~", "$path~O~", "$path~B~"; + open $fho, '>', $path; + local ($/); + while (<$fhi>) { print $fho $_; } + close $fhi; + close $fho; + # There is no reason to prefer infoA over infoB but + # we need to pick one. + chmod oct("0$infoA->[NEW_MODE]"), $path; +} + +################################################################ + +# O stands for "the original". A and B are being merged. +my ($treeO, $treeA, $treeB) = map { possibly_commit_to_tree $_ } @ARGV; + +# Create a temporary directory and go there. +system('rm', '-rf', $output_directory) == 0 && +system('mkdir', '-p', "$output_directory/.git") == 0 && +symlink(Cwd::getcwd . "/.git/objects", "$output_directory/.git/objects") && +chdir $output_directory && +system('read-tree', $treeO) == 0 + or die "$!: Failed to set up merge working area $output_directory"; + +# Find out edits done in each branch. +my %treeA = read_diff_tree($treeO, $treeA); +my %treeB = read_diff_tree($treeO, $treeB); + +# The list of files that was in the ancestor. +my @ancestor_file = read_show_files(); +my %ancestor_file = map { $_ => 1 } @ancestor_file; + +# Report output is formated as follows: +# +# The first letter shows the origin of the result. +# O - original +# A - treeA +# B - treeB +# M - both treeA and treeB +# * - treeA and treeB conflicts; needs human action. +# +# The second and third letter shows what each tree did. +# . - no change +# A - created +# M - modified +# D - deleted + +for (@ancestor_file) { + if (! exists $treeA{$_} && ! exists $treeB{$_}) { + if ($full_checkout) { + system 'checkout-cache', $_; + } + print STDERR "O.. $_\n"; # keep original + } +} + +for my $set ([\%treeA, \%treeB, 'A'], [\%treeB, \%treeA, 'B']) { + my ($this, $other, $side) = @$set; + my $delete_sign = ($side eq 'A') ? 'D.' : '.D'; + my $create_sign = ($side eq 'A') ? 'A.' : '.A'; + my $modify_sign = ($side eq 'A') ? 'M.' : '.M'; + while (my ($path, $info) = each %$this) { + # In this loop we do not deal with overlaps. + next if (exists $other->{$path}); + + if (! defined $info->[NEW_OID]) { + # deleted in this tree only. + unlink $path; + system 'update-cache', '--remove', $path; + print STDERR "${side}${delete_sign} $path\n"; + } + else { + # modified or created in this tree only. + my $create_or_modify = + (! defined $info->[OLD_OID]) ? $create_sign : $modify_sign; + print STDERR "${side}${create_or_modify} $path\n"; + if ($partial_checkout) { + checkout_file($path, $info); + system 'update-cache', '--add', $path; + } else { + record_file($path, $info); + } + } + } +} + +my @warning = (); + +while (my ($path, $infoA) = each %treeA) { + # We need to deal only with overlaps. + next if (!exists $treeB{$path}); + + my $infoB = $treeB{$path}; + if (! defined $infoA->[NEW_OID]) { + # Deleted in tree A. + if (! defined $infoB->[NEW_OID]) { + # Deleted in both trees (obvious). + print STDERR "MDD $path\n"; + unlink $path; + system 'update-cache', '--remove', $path; + } + else { + # TreeA wants to remove but TreeB wants to modify it. + print STDERR "*DM $path\n"; + checkout_file("$path~B~", $infoB); + push @warning, $path; + } + } + else { + # Modified or created in tree A + if (! defined $infoB->[NEW_OID]) { + # TreeA wants to modify but treeB wants to remove it. + print STDERR "*MD $path\n"; + checkout_file("$path~A~", $infoA); + push @warning, $path; + } + else { + # Modified both in treeA and treeB. + # Are they modifying to the same contents? + if ($infoA->[NEW_OID] eq $infoB->[NEW_OID]) { + # No changes or just the mode. + # we prefer TreeA over TreeB for no particular reason. + print STDERR "MMM $path\n"; + record_file($path, $infoA); + } + else { + # Modified in both. Needs merge. + print STDERR "*MM $path\n"; + merge_tree($path, $infoA, $infoB); + } + } + } +} + +if (@warning) { + print "\nThere are some files that were deleted in one branch and\n" + . "modified in another. Please examine them carefully:\n"; + for (@warning) { + print "$_\n"; + } +} + +# system 'show-diff', '-q'; ^ permalink raw reply [flat|nested] 130+ messages in thread
* [PATCH 2/2] merge-trees script for Linus git 2005-04-15 21:48 ` [PATCH 1/2] merge-trees script for Linus git Junio C Hamano @ 2005-04-15 21:54 ` Junio C Hamano 2005-04-15 23:33 ` [PATCH 3/2] " Junio C Hamano 1 sibling, 0 replies; 130+ messages in thread From: Junio C Hamano @ 2005-04-15 21:54 UTC (permalink / raw) To: Linus Torvalds; +Cc: git Linus, This is the '-q' option for show-diff.c to squelch complaints for missing files. It is handy if you want to run it in the merge temporary directory after running merge-trees with its minimum checkout mode, which is the default, because you would not find any files other than the ones that needs human validation after the merge there. It also fixes the argument parsing bug Paul Mackerras noticed in <16991.42305.118284.139777@cargo.ozlabs.ibm.com> but slightly differently. Signed-off-by: Junio C Hamano <junkio@cox.net> --- show-diff.c | 17 ++++++++++++----- 1 files changed, 12 insertions(+), 5 deletions(-) show-diff.c: 3f7acd2a692a03026784a18f28521b9af322b71e --- show-diff.c +++ show-diff.c 2005-04-15 14:14:53.000000000 -0700 @@ -58,15 +58,20 @@ int main(int argc, char **argv) { int silent = 0; + int silent_on_nonexisting_files = 0; int entries = read_cache(); int i; - while (argc-- > 1) { - if (!strcmp(argv[1], "-s")) { - silent = 1; + for (i = 1; i < argc; i++) { + if (!strcmp(argv[i], "-s")) { + silent_on_nonexisting_files = silent = 1; continue; } - usage("show-diff [-s]"); + if (!strcmp(argv[i], "-q")) { + silent_on_nonexisting_files = 1; + continue; + } + usage("show-diff [-s] [-q]"); } if (entries < 0) { @@ -82,8 +87,10 @@ void *new; if (stat(ce->name, &st) < 0) { + if (errno == ENOENT && silent_on_nonexisting_files) + continue; printf("%s: %s\n", ce->name, strerror(errno)); - if (errno == ENOENT && !silent) + if (errno == ENOENT) show_diff_empty(ce); continue; } ^ permalink raw reply [flat|nested] 130+ messages in thread
* [PATCH 3/2] merge-trees script for Linus git 2005-04-15 21:48 ` [PATCH 1/2] merge-trees script for Linus git Junio C Hamano 2005-04-15 21:54 ` [PATCH 2/2] " Junio C Hamano @ 2005-04-15 23:33 ` Junio C Hamano 2005-04-16 1:02 ` Linus Torvalds 1 sibling, 1 reply; 130+ messages in thread From: Junio C Hamano @ 2005-04-15 23:33 UTC (permalink / raw) To: Linus Torvalds; +Cc: git Linus, the merge-trees I sent you earlier was expecting the old diff-tree behaviour, and I did not realize that I need an explicit -z flag now. Here is a fix. Signed-off-by: Junio C Hamano <junkio@cox.net> --- merge-trees | 4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) --- merge-trees 2005-04-15 13:21:35.000000000 -0700 +++ merge-trees+ 2005-04-15 16:27:34.000000000 -0700 @@ -78,8 +78,8 @@ local ($_, $/); $/ = "\0"; my %path; - open $fhi, '-|', 'diff-tree', '-r', @tree - or die "$!: diff-tree -r @tree"; + open $fhi, '-|', 'diff-tree', '-r', '-z', @tree + or die "$!: diff-tree -r -z @tree"; while (<$fhi>) { chomp; if (/^\*($reM)->($reM)\tblob\t($reID)->($reID)\t(.*)$/so) { ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: [PATCH 3/2] merge-trees script for Linus git 2005-04-15 23:33 ` [PATCH 3/2] " Junio C Hamano @ 2005-04-16 1:02 ` Linus Torvalds 2005-04-16 4:10 ` Junio C Hamano 0 siblings, 1 reply; 130+ messages in thread From: Linus Torvalds @ 2005-04-16 1:02 UTC (permalink / raw) To: Junio C Hamano; +Cc: git On Fri, 15 Apr 2005, Junio C Hamano wrote: > > the merge-trees I sent you earlier was expecting the old > diff-tree behaviour, and I did not realize that I need an > explicit -z flag now. You didn't need one - I just didn't want to merge your "ls-tree" change without making things be consistent. Once we started using the "-z" flag for ls-tree, it just didn't make any sense not to do the same thing for diff-tree. Just a heads-up - I'd really want to do the same thing to "merge-tree.c" too, but since you said that you were working on extending that to do recursion etc, I decided to hold off. So if you're working on it, maybe you can add the "-z" flag there too? I'm actually holding off merging the perl version exactly because you seemed to be working on the C version. I don't mind perl per se, but if there's a real solution coming down the line.. Linus ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: [PATCH 3/2] merge-trees script for Linus git 2005-04-16 1:02 ` Linus Torvalds @ 2005-04-16 4:10 ` Junio C Hamano 2005-04-16 5:02 ` Linus Torvalds 0 siblings, 1 reply; 130+ messages in thread From: Junio C Hamano @ 2005-04-16 4:10 UTC (permalink / raw) To: Linus Torvalds; +Cc: git >>>>> "LT" == Linus Torvalds <torvalds@osdl.org> writes: LT> Just a heads-up - I'd really want to do the same thing to "merge-tree.c" LT> too, but since you said that you were working on extending that to do LT> recursion etc, I decided to hold off. So if you're working on it, maybe LT> you can add the "-z" flag there too? Sent as a separate patch already. LT> I'm actually holding off merging the perl version exactly because you LT> seemed to be working on the C version. I don't mind perl per se, but if LT> there's a real solution coming down the line.. I'd take the hint, but I would say the current Perl version would be far more usable than the C version I would come up with by the end of this weekend because: - the Perl version creates a new temporary directory and leaves a ready-to-use dircache there---the only thing needed from that point for you is to fix it up any conflicts and do update-cache on that dircache. In that sense it is already usable (Linus-usable, but probably not Pasky-usable due to differences in phylosophy). - the enhancement I am planning on the C version does not do the real work itself, as you have originally written (the workings and the output from it are outlined in [*R1*]). Somebody has to write the executor part that does read-tree the base, update-cache --cacheinfo --add for the selects, runs 3-way merge on conflicting files and runs update-cache for the merges, update-cache --remove for the deletes, before it matches the usability of the Perl version. I do not expect to have enough time this weekend to finish this. I know of one case in Perl version I need to see if it does the right thing but other than that it would be far better than the C version I'm toying with. Just to let you know, here is the plan I have for my part. 1. I am currently writing some test cases. The plan is first to make sure the Perl version works OK with the test cases to flush initial problems out. 2. After that I'll see if a dumb but recursive C version I already have spits out the right instructions. This step is to make sure that the test cases are sane, and by making that sure, we will be able to say that we have something usable in extremely short run (i.e. the Perl version) after this step. 3. After that is done, I'll add the fourth argument to merge-tree.c to specify the base so that it can cut down 90% of trivial selects. Only after this happens the executioner script would be useful performance wise. [References] *R1* <7vr7hbhky9.fsf@assigned-by-dhcp.cox.net> ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: [PATCH 3/2] merge-trees script for Linus git 2005-04-16 4:10 ` Junio C Hamano @ 2005-04-16 5:02 ` Linus Torvalds 2005-04-16 6:26 ` Linus Torvalds 2005-04-16 8:12 ` Junio C Hamano 0 siblings, 2 replies; 130+ messages in thread From: Linus Torvalds @ 2005-04-16 5:02 UTC (permalink / raw) To: Junio C Hamano; +Cc: git On Fri, 15 Apr 2005, Junio C Hamano wrote: > > I'd take the hint, but I would say the current Perl version > would be far more usable than the C version I would come up with > by the end of this weekend because: Actually, it turns out that I have a cunning plan. I'm full of cunning plans, in fact. It turns out that I can do merges even more simply, if I just allow the notion of "state" into an index entry, and allow multiple index entries with the same name as long as they differ in "state". And that means that I can do all the merging in the regular index tree, using very simple rules. Let's see how that works out. I'm writing the code now. Linus ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: [PATCH 3/2] merge-trees script for Linus git 2005-04-16 5:02 ` Linus Torvalds @ 2005-04-16 6:26 ` Linus Torvalds 2005-04-16 8:12 ` Junio C Hamano 1 sibling, 0 replies; 130+ messages in thread From: Linus Torvalds @ 2005-04-16 6:26 UTC (permalink / raw) To: Junio C Hamano; +Cc: git On Fri, 15 Apr 2005, Linus Torvalds wrote: > > Actually, it turns out that I have a cunning plan. Damn, my cunning plan is some good stuff. Or maybe it is _so_ cunning that I just confuse even myself. But it looks like it is actually working, and that it allows pretty much instantaenous merges. The plan goes like this: - each "index" entry has two bits worth of "stage" state. stage 0 is the normal one, and is the only one you'd see in any kind of normal use. - however, when you do "read-tree" with multiple trees, the "stage" starts out at 0, but increments for each tree you read. And in particular, the old "-m" flag (which used to be "merge with old state") has a new meaning: it now means "start at stage 1" instead. - this means that you can do read-tree -m <tree1> <tree2> <tree3> and you will end up with an index with all of the <tree1> entries in "stage1", all of the <tree2> entries in "stage2" and all of the <tree3> entries in "stage3". - furthermore, "read-tree" has this special-case logic that says: if you see a file that matches in all respects in all three states, it "collapses" back to "stage0". - write-tree refuses to write a nonsensical tree, so write-tree will complain about unmerged entries if it sees a single entry that is not stage 0". Ok, this all sounds like a collection of totally nonsensical rules, but it's actually exactly what you want in order to do a fast merge. The differnt stages represent the "result tree" (stage 0, aka "merged"), the original tree (stage 1, aka "orig"), and the two trees you are trying to merge (stage 2 and 3 respectively). In fact, the way "read-tree" works, it's entirely agnostic about how you assign the stages, and you could really assign them any which way, and the above is just a suggested way to do it (except since "write-tree" refuses to write anything but stage0 entries, it makes sense to always consider stage 0 to be the "full merge" state). So what happens? Try it out. Select the original tree, and two trees to merge, and look how it works: - if a file exists in identical format in all three trees, it will automatically collapse to "merged" state by the new read-tree. - a file that has _any_ difference what-so-ever in the three trees will stay as separate entries in the index. It's up to "script policy" to determine how to remove the non-0 stages, and insert a merged version. But since the index is always sorted, they're easy to find: they'll be clustered together. - the index file saves and restores with all this information, so you can merge things incrementally, but as long as it has entries in stages 1/2/3 (ie "unmerged entries") you can't write the result. So now the merge algorithm ends up being really simple: - you walk the index in order, and ignore all entries of stage 0, since they've already been done. - if you find a "stage1", but no matching "stage2" or "stage3", you know it's been removed from both trees (it only existed in the original tree), and you remove that entry. - if you find a matching "stage2" and "stage3" tree, you remove one of them, and turn the other into a "stage0" entry. Remove any matching "stage1" entry if it exists too. .. all the normal trivial rules .. NOTE NOTE NOTE! I could make "read-tree" do some of these nontrivial merges, but I ended up deciding that only the "matches in all three states" thing collapses by default. Why? Because even though there are other trivial cases ("matches in both merge trees but not in the original one"), those cases might actually be interesting for the merge logic to know about, so I thought I'd leave all that information around. I expect it to be fairly rare anyway, so writing out a few extra index entries to disk so that others can decide to annotate the merge a bit more sounded like a fair deal. I should make "ls-files" have a "-l" format, which shows the index and the mode for each file too. Right now it's very hard to see what the contents of the index is. But all my tests seem to say that not only does this work, it's pretty efficient too. And it's dead _simple_, thanks to having all the merge information in just one place, the same index we always use anyway. Btw, it also means that you don't even have to have a separate subdirectory for this. All the information literally is in the index file, which is a temporary thing anyway. We don't need to worry about what is in the working directory, since we'll never show it, and we'll never need to use it. Damn, I'm good. (On the other hand, it is Friday evening at 11PM, and I'm sitting in front of the computer. I'm a sad case. I will now go take a beer, and relax. I think this is another of my "Really Good Ideas" (tm), and is worth the beer. This "feels" right). Linus ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: [PATCH 3/2] merge-trees script for Linus git 2005-04-16 5:02 ` Linus Torvalds 2005-04-16 6:26 ` Linus Torvalds @ 2005-04-16 8:12 ` Junio C Hamano 2005-04-16 9:27 ` [PATCH] Byteorder fix for read-tree, new -m semantics version Junio C Hamano ` (3 more replies) 1 sibling, 4 replies; 130+ messages in thread From: Junio C Hamano @ 2005-04-16 8:12 UTC (permalink / raw) To: Linus Torvalds; +Cc: git >>>>> "LT" == Linus Torvalds <torvalds@osdl.org> writes: LT> Damn, my cunning plan is some good stuff. I really like this a lot. It is *so* *simple*, clear, flexible and an example of elegance. This is one of the things I would happily say "Sheeeeeeeeeeeeeesh! Why didn't *I* think of *THAT* first!!!" to. LT> NOTE NOTE NOTE! I could make "read-tree" do some of these nontrivial LT> merges, but I ended up deciding that only the "matches in all three LT> states" thing collapses by default. * Understood and agreed. LT> Damn, I'm good. * Agreed ;-). Wholeheartedly. So what's next? Certainly I'd immediately drop (and I would imagine you would as well) both C or Perl version of merge-tree(s). The userland merge policies need ways to extract the stage information and manipulate them. Am I correct to say that you mean by "ls-files -l" the extracting part? LT> I should make "ls-files" have a "-l" format, which shows the LT> index and the mode for each file too. You probably meant "ls-tree". You used the word "mode" but it already shows the mode so I take it to mean "stage". Perhaps something like this? $ ls-tree -l -r 49c200191ba2e3cd61978672a59c90e392f54b8b 100644 blob fe2a4177a760fd110e78788734f167bd633be8de COPYING 100644 blob b39b4ea37586693dd707d1d0750a9b580350ec50:1 man/frotz.6 100644 blob b39b4ea37586693dd707d1d0750a9b580350ec50:2 man/frotz.6 100664 blob eeed997e557fb079f38961354473113ca0d0b115:3 man/frotz.6 ... The above example shows that COPYING has merged successfully, and O and A have the same contents and B has something different at man/frotz.6. Assuming that you would be working on that, I'd like to take the dircache manipulation part. Let's think about the minimally necessary set of operations: * The merge policy decides to take one of the existing stage. In this case we need a way to register a known mode/sha1 at a path. We already have this as "update-cache --cacheinfo". We just need to make sure that when "update-cache" puts things at stage 0 it clears other stages as well. * The merge policy comes up with a desired blob somewhere on the filesystem (perhaps by running an external merge program). It wants to register it as the result of the merge. We could do this today by first storing the "desired blob" in a temporary file somewhere in the path the dircache controls, "update-cache --add" the temporary file, ls-tree to find its mode/sha1, "update-cache --remove" the temporary file and finally "update-cache --cacheinfo" the mode/sha1. This is workable but clumsy. How about: $ update-cache --graft [--add] desired-blob path to say "I want to register mode/sha1 from desired-blob, which may not be of verify_path() satisfying name, at path in the dircache"? * The merge policy decides to delete the path. We could do this today by first stashing away the file at the path if it exists, "update-cache --remove" it, and restore if necessary. This is again workable but clumsy. How about: $ update-cache --force-remove path to mean "I want to remove the path from dircache even though it may exist in my working tree"? So it all boils down to update-cache. The new things to be introduced are: * An explicit update-cache always removes stage 1/2/3 entries associated with the named path. * update-cache --graft * update-cache --force-remove Am I on the right track? You might want to go even lower level by letting them say something like: * update-cache --register-stage mode sha1 stage path Registers the mode/sha1 at stage for path. Does not look at the working tree. stage is [0-3] * update-cache --delete-stage stage-list path Removes the entry at named stages for path. Does not look at the working tree. stage-list is either [0-3](,[0-3])+ or bitmask (i.e. (1 << stage-number) ORed together). The former would probably be easier to work with by scripts * write-blob path Hashes and registers the file at path (regardless of what verify_path() says) and writes the resulting blob's mode/sha1 to the standard output. If you take this lower-level approach, an explicit update-cache would not clear stage1/2/3. My preference is the former, not so low-level, interface. Guidance? ^ permalink raw reply [flat|nested] 130+ messages in thread
* [PATCH] Byteorder fix for read-tree, new -m semantics version. 2005-04-16 8:12 ` Junio C Hamano @ 2005-04-16 9:27 ` Junio C Hamano 2005-04-16 10:35 ` [PATCH 1/2] Add --stage to show-files for new stage dircache Junio C Hamano ` (2 subsequent siblings) 3 siblings, 0 replies; 130+ messages in thread From: Junio C Hamano @ 2005-04-16 9:27 UTC (permalink / raw) To: Linus Torvalds; +Cc: git The ce_namelen field has been renamed to ce_flags and split into the top 2-bit unused, next 2-bit stage number and the lowest 12-bit name-length, stored in the network byte order. A new macro create_ce_flags() is defined to synthesize this value from length and stage, but it forgets to turn the value into the network byte order. Here is a fix. The patch is against 9c03bd47892d11d0bb28c442184786db3c189978. Signed-off-by: Junio C Hamano <junkio@cox.net> --- cache.h | 2 +- 1 files changed, 1 insertion(+), 1 deletion(-) --- cache.h +++ cache.h 2005-04-16 02:22:05.000000000 -0700 @@ -66,7 +66,7 @@ #define CE_NAMEMASK (0x0fff) #define CE_STAGEMASK (0x3000) -#define create_ce_flags(len, stage) ((len) | ((stage) << 12)) +#define create_ce_flags(len, stage) htons((len) | ((stage) << 12)) const char *sha1_file_directory; struct cache_entry **active_cache; ^ permalink raw reply [flat|nested] 130+ messages in thread
* [PATCH 1/2] Add --stage to show-files for new stage dircache. 2005-04-16 8:12 ` Junio C Hamano 2005-04-16 9:27 ` [PATCH] Byteorder fix for read-tree, new -m semantics version Junio C Hamano @ 2005-04-16 10:35 ` Junio C Hamano 2005-04-16 10:42 ` [PATCH 2/2] " Junio C Hamano 2005-04-16 14:03 ` Issues with higher-order stages in dircache Junio C Hamano 2005-04-16 15:28 ` [PATCH 3/2] merge-trees script for Linus git Linus Torvalds 3 siblings, 1 reply; 130+ messages in thread From: Junio C Hamano @ 2005-04-16 10:35 UTC (permalink / raw) To: Linus Torvalds; +Cc: git >>>>> "JNH" == Junio C Hamano <junkio@cox.net> writes: >>>>> "LT" == Linus Torvalds <torvalds@osdl.org> writes: LT> I should make "ls-files" have a "-l" format, which shows the LT> index and the mode for each file too. JNH> You probably meant "ls-tree". You used the word "mode" but it JNH> already shows the mode so I take it to mean "stage". I was *wrong*. Of course you meant "show-files". Instead of sending you an apology, I am sending you the one I wrote myself. Please find it in the next message ;-). Here is its sample output. It shows file-mode, SHA1, stage and pathname. I am attaching this one because this is a verification that your read-tree -m passed the test. $ ../show-files --stage 100664 578cc900ed980b72acfbdd1eea63e688a893c458 2 AA 100664 f355077379fce072c210628691da232b59b6f25c 3 AA 100664 d698ebc45d0edfe6e5b95aebb5983cb5c760960b 2 AN 100664 0fa6a8e41814531679e1c76e968a9066fceb689d 1 DD 100664 aff448a9467a4d83b164ef969cfe92ff18eb96be 1 DM 100664 4bfe111723f11cb4a4deec7c837e12601030285f 3 DM 100664 9b0f86e5cded99b9de3bd9d234747ec2d1a4cddd 1 DN 100664 9b0f86e5cded99b9de3bd9d234747ec2d1a4cddd 3 DN 100664 a6772f2a2c15bac796d8c7bb55885891956534cf 1 MD 100664 dc2088ce13f659f2bd554b2c1b343f4966143b9b 2 MD 100664 e4310204563a9059828644464779874c3a406fee 1 MM 100664 fe5ddcd7618d26384cf98c6fcd15780c7125e6d6 2 MM 100664 53a9d14868dbe346a9f0cf01fcda742545b55987 3 MM 100664 f48f37ea0205a7e5591777b4d3ae0d153d3ef131 1 MN 100664 d7600381b69b92f61bad50c5f8408e831b622ef0 2 MN 100664 f48f37ea0205a7e5591777b4d3ae0d153d3ef131 3 MN 100664 67fb1517ea8d59949a8e4f5f07f0422b212f64dc 3 NA 100664 0e5842253af8881b2c9f579029d7b50a8e03d7f6 1 ND 100664 0e5842253af8881b2c9f579029d7b50a8e03d7f6 2 ND 100664 0d45c04c9d05fa9c21edf95fc2c1a43519a8c440 1 NM 100664 0d45c04c9d05fa9c21edf95fc2c1a43519a8c440 2 NM 100664 849bfa41d15951f5e97cb93e22cbcc2924ce4517 3 NM 100664 83d94b8fd056921f22ad2ca0122dd7f64974be7c 0 NN This is taken from the dircache after I ran $ read-tree -m O A B using the merge testcase I prepared earlier. Very trivial, single ancestor O, with two branches A & B merge case. This covers all possible patterns, except file vs directory conflicts. The filenames are all two letters, first letter being what the first branch does to that file while the second one encodes what the second branch does to it. The actions are: - A means "Added in this branch --- did not exist in the ancestor." - N means "No change in this branch." - D means "Deleted in this branch." - M means "Modified in this branch." So, for example, the first branch modified file MN while the second one did not touch it. Of course it existed in the ancestor. You can see that read-tree did the right thing because SHA1 for stage 1 and stage 3 match, and stage 2 is different. 100664 f48f37ea0205a7e5591777b4d3ae0d153d3ef131 1 MN 100664 d7600381b69b92f61bad50c5f8408e831b622ef0 2 MN 100664 f48f37ea0205a7e5591777b4d3ae0d153d3ef131 3 MN I verified all of the above result and it shows your algorithm is doing exactly what is expected. For those of you who are interested, this is the recipe to reproduce this merge testcase. NOTE! NOTE! NOTE! Do not run this in your working tree, because it trashes .git in its working directory. Signed-off-by: Junio C Hamano <junkio@cox.net> --- --- /dev/null +++ generate-merge-test.sh @@ -0,0 +1,163 @@ +#!/bin/sh + +: Skip execution up to <<\End_of_Commentary + +This directory is to hold a test case for merges. + +There is one ancestor (called O for Original) and two branches A +and B derived from it. We want to do 3-way merge between A and +B, using O as the common ancestor. + + merge A O B + diff3 A O B + +Decisions are made by comparing contents of O, A and B pathname +by pathname. The result is determined by the following guiding +principle: + + - If only A does something to it and B does not touch it, take + whatever A does. + + - If only B does something to it and A does not touch it, take + whatever B does. + + - If both A and B does something but in the same way, take + whatever they do. + + - If A and B does something but different things, we need a + 3-way merge: + + - We cannot do anything about the following cases: + + * O does not have it. A and B both must be adding to the + same path independently. + + * A deletes it. B must be modifying. + + - Otherwise, A and B are modifying. Run 3-way merge. + + +First, the case matrix. + + - Vertical axis is for A's actions. + - Horizontal axis is for B's actions. + +.----------------------------------------------------------------. +| A B | No Action | Delete | Modify | Add | +|------------+------------+------------+------------+------------| +| No Action | | | | | +| | select O | delete | select B | select B | +| | | | | | +|------------+------------+------------+------------+------------| +| Delete | | | ********** | can | +| | delete | delete | merge | not | +| | | | | happen | +|------------+------------+------------+------------+------------| +| Modify | | ********** | ?????????? | can | +| | select A | merge | select A=B | not | +| | | | merge | happen | +|------------+------------+------------+------------+------------| +| Add | | can | can | ?????????? | +| | select A | not | not | select A=B | +| | | happen | happen | merge | +.----------------------------------------------------------------. + +End_of_Commentary + +rm -fr [NDMA][NDMA] S .git Trivial +init-db + +# Original tree. +mkdir S +for a in N D M +do + for b in N D M + do + p=$a$b + echo This is $p from the original tree. >$p + echo This is S/$p from the original tree. >S/$p + update-cache --add $p || exit + update-cache --add S/$p || exit + done +done +cat >Trivial <<\EOF +This is a trivial merge sample text. +Branch A is expected to upcase this word. +There are some filler words to foil diff contexts here, +like this one, +and this one, +and this one is yet another one of them. +At the very end, here comes another line, that is +the word, expected to be upcased by Branch B. +This concludes the trivial merge sample file. +EOF +update-cache --add Trivial || exit +tree_O=$(write-tree) +commit_O=$(echo 'Original tree for the merge test.' | commit-tree $tree_O) + +# Branch A and B makes the changes according to the above matrix. +# Branch A +to_remove=$(echo D? S/D?) +rm -f $to_remove +update-cache --remove $to_remove || exit + +for p in M? S/M? +do + echo This is modified $p in the branch A. >$p + update-cache $p || exit +done + +for p in AN AA +do + echo This is added $p in the branch A. >$p + update-cache --add $p || exit +done +mv Trivial ,,Trivial +sed -e '/Branch A/s/word/WORD/g' <,,Trivial >Trivial +rm -f ,,Trivial +update-cache Trivial || exit + +tree_A=$(write-tree) +commit_A=$(echo 'Branch A for the merge test.' | + commit-tree $tree_A -p $commit_O) + + +# Branch B +# Start from O +rm -rf [NDMA][NDMA] S Trivial +mkdir S +../read-tree $tree_O +checkout-cache -a + +to_remove=$(echo ?D S/?D) +rm -f $to_remove +update-cache --remove $to_remove || exit + +for p in ?M S/?M +do + echo This is modified $p in the branch B. >$p + update-cache $p || exit +done + +for p in NA AA +do + echo This is added $p in the branch B. >$p + update-cache --add $p || exit +done +mv Trivial ,,Trivial +sed -e '/Branch B/s/word/WORD/g' <,,Trivial >Trivial +rm -f ,,Trivial +update-cache Trivial || exit + +tree_B=$(write-tree) +commit_B=$(echo 'Branch B for the merge test.' | + commit-tree $tree_B -p $commit_O) + +for commit in $commit_O $commit_A $commit_B +do + echo ================ + echo commit $commit + cat-file commit $commit +done +echo ================ + ^ permalink raw reply [flat|nested] 130+ messages in thread
* [PATCH 2/2] Add --stage to show-files for new stage dircache. 2005-04-16 10:35 ` [PATCH 1/2] Add --stage to show-files for new stage dircache Junio C Hamano @ 2005-04-16 10:42 ` Junio C Hamano 0 siblings, 0 replies; 130+ messages in thread From: Junio C Hamano @ 2005-04-16 10:42 UTC (permalink / raw) To: Linus Torvalds; +Cc: git This adds --stage option to show-files command. It shows file-mode, SHA1, stage and pathname. Record separator follows the usual convention of -z option as before. The patch is on top of the byte order fix for create_ce_flags in my previous message. Signed-off-by: Junio C Hamano <junkio@cox.net> --- cache.h | 12 +++++++----- show-files.c | 22 ++++++++++++++++++---- 2 files changed, 25 insertions(+), 9 deletions(-) --- cache.h 2005-04-16 03:02:36.000000000 -0700 +++ cache.h=show-files-stage-flags 2005-04-16 02:48:47.000000000 -0700 @@ -65,8 +65,14 @@ #define CE_NAMEMASK (0x0fff) #define CE_STAGEMASK (0x3000) +#define CE_STAGESHIFT 12 -#define create_ce_flags(len, stage) htons((len) | ((stage) << 12)) +#define create_ce_flags(len, stage) htons((len) | ((stage) << CE_STAGESHIFT)) +#define ce_namelen(ce) (CE_NAMEMASK & ntohs((ce)->ce_flags)) +#define ce_size(ce) cache_entry_size(ce_namelen(ce)) +#define ce_stage(ce) ((CE_STAGEMASK & ntohs((ce)->ce_flags)) >> CE_STAGESHIFT) + +#define cache_entry_size(len) ((offsetof(struct cache_entry,name) + (len) + 8) & ~7) const char *sha1_file_directory; struct cache_entry **active_cache; @@ -75,10 +81,6 @@ #define DB_ENVIRONMENT "SHA1_FILE_DIRECTORY" #define DEFAULT_DB_ENVIRONMENT ".git/objects" -#define cache_entry_size(len) ((offsetof(struct cache_entry,name) + (len) + 8) & ~7) -#define ce_namelen(ce) (CE_NAMEMASK & ntohs((ce)->ce_flags)) -#define ce_size(ce) cache_entry_size(ce_namelen(ce)) - #define alloc_nr(x) (((x)+16)*3/2) /* Initialize and use the cache information */ --- show-files.c +++ show-files.c 2005-04-16 02:58:32.000000000 -0700 @@ -14,6 +14,7 @@ static int show_cached = 0; static int show_others = 0; static int show_ignored = 0; +static int show_stage = 0; static int line_terminator = '\n'; static const char **dir; @@ -108,10 +109,19 @@ for (i = 0; i < nr_dir; i++) printf("%s%c", dir[i], line_terminator); } - if (show_cached) { + if (show_cached | show_stage) { for (i = 0; i < active_nr; i++) { struct cache_entry *ce = active_cache[i]; - printf("%s%c", ce->name, line_terminator); + if (!show_stage) + printf("%s%c", ce->name, line_terminator); + else + printf(/* "%06o %s %d %10d %s%c", */ + "%06o %s %d %s%c", + ntohl(ce->ce_mode), + sha1_to_hex(ce->sha1), + ce_stage(ce), + /* ntohl(ce->ce_size), */ + ce->name, line_terminator); } } if (show_deleted) { @@ -156,12 +166,16 @@ show_ignored = 1; continue; } + if (!strcmp(arg, "--stage")) { + show_stage = 1; + continue; + } - usage("show-files (--[cached|deleted|others|ignored])*"); + usage("show-files [-z] (--[cached|deleted|others|ignored|stage])*"); } /* With no flags, we default to showing the cached files */ - if (!(show_cached | show_deleted | show_others | show_ignored)) + if (!(show_stage | show_deleted | show_others | show_ignored)) show_cached = 1; read_cache(); ^ permalink raw reply [flat|nested] 130+ messages in thread
* Issues with higher-order stages in dircache 2005-04-16 8:12 ` Junio C Hamano 2005-04-16 9:27 ` [PATCH] Byteorder fix for read-tree, new -m semantics version Junio C Hamano 2005-04-16 10:35 ` [PATCH 1/2] Add --stage to show-files for new stage dircache Junio C Hamano @ 2005-04-16 14:03 ` Junio C Hamano 2005-04-17 5:11 ` Junio C Hamano 2005-04-17 10:00 ` Summary of "read-tree -m O A B" mechanism Junio C Hamano 2005-04-16 15:28 ` [PATCH 3/2] merge-trees script for Linus git Linus Torvalds 3 siblings, 2 replies; 130+ messages in thread From: Junio C Hamano @ 2005-04-16 14:03 UTC (permalink / raw) To: Linus Torvalds; +Cc: git >>>>> "JCH" == Junio C Hamano <junkio@cox.net> writes: JCH> So what's next? Here is my current thinking on the impact your higher-order stage dircache entries would have to the rest of the system and how to deal with them. * read-tree - When merging two trees, i.e. "read-tree -m A B", shouldn't we collapse identical stage-1/2 into stage-0? * update-cache - An explicit "update-cache [--add] [--remove] path" should be taken as a signal from the user (or Cogito) to tell the dircache layer "the merge is done and here is the result". So just delete higher-order stages for the path and record the specified path at stage 0 (or remove it altogether). - "update-cache --refresh" should just ignore a path that has not been merged, Maybe say "needs merge", just like "needs update" [*1*]. - "update-cache --cacheinfo" should get an extra "stage" argument. Unmerged state is typically produced by running "read-tree -m", but the user or Cogito can do it by hand with this if he wanted to. - I do not think we need a separate "remove the entry for this path at this stage" thing. That is only necessary if the user or Cogito is doing things by hand (as opposed to "read-tree -m"), which should be a very rare case. He can always do "update-cache --remove" followed by "update-cache --cacheinfo" to obtain the desired result if he really wanted to. For that, "update-cache --force-remove" may come in handy. * show-diff - What should we do about unmerged paths? Showing diffs between the combinations (1->2), (1->3), and (2->3) that exist may not be a bad idea. It would not be confusing because by definition dircache with higher-order stages is a merge temporary directory and the user should not have a working file there to begin with. I think the current implementation does a very bad thing: repeating the same diff as many times as it has higher-order stages for the same path. * checkout-cache - When checkout-cache is run with explicit paths that are unmerged, what should we do? What does that mean in the first place? One use scenario I can think of is that the user or Cogito wants the contents at all three stages, in order to run a merge tool on them. From this point of view, checking out all the available stages for the path makes sense. My "cunning plan" is to drop ".1-$file", ".2-$file", and ".3-$file" in the working directory. How does that sound? - When checkout-cache -a is run, presumably the user wants to check out everything to verify (e.g. build-test) the result. In this case, we should skip unmerged paths, give a warning, and check out only the merged ones. [Footnotes] *1* Unrelated note. Who is the intended consumer of this "needs update" message? Should we make it machine readable with '-z' flag as well? Otherwise, shouldn't it go to stderr? Currently it goes to stdout. ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Issues with higher-order stages in dircache 2005-04-16 14:03 ` Issues with higher-order stages in dircache Junio C Hamano @ 2005-04-17 5:11 ` Junio C Hamano 2005-04-17 5:31 ` Linus Torvalds 2005-04-17 10:00 ` Summary of "read-tree -m O A B" mechanism Junio C Hamano 1 sibling, 1 reply; 130+ messages in thread From: Junio C Hamano @ 2005-04-17 5:11 UTC (permalink / raw) To: Linus Torvalds; +Cc: git Linus, earlier I wrote [*R1*]: - An explicit "update-cache [--add] [--remove] path" should be taken as a signal from the user (or Cogito) to tell the dircache layer "the merge is done and here is the result". So just delete higher-order stages for the path and record the specified path at stage 0 (or remove it altogether). and I think this commit of yours implements the adding half. commit be7b1f05cea8e5213ffef8f74ebdefed2aacb6fc:1 author Linus Torvalds <torvalds@ppc970.osdl.org> 1113678345 -0700 committer Linus Torvalds <torvalds@ppc970.osdl.org> 1113678345 -0700 When inserting a index entry of stage 0, remove all old unmerged entries. I am wondering if you have a particular reason not to do the same for the removing half. Without it, currently I do not see a way for the user or Cogito to tell dircache layer that the merge should result in removal. That is, other than first adding a phony entry there (which brings the entry down to stage 0) and then immediately doing a regular update-cache --remove. That is two instead of one reading of 1.6MB index file for the kernel case. Also do you have any comments on this one from the same message? * read-tree - When merging two trees, i.e. "read-tree -m A B", shouldn't we collapse identical stage-1/2 into stage-0? [References] *R1* http://marc.theaimsgroup.com/?l=git&m=111366023126466&w=2 ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Issues with higher-order stages in dircache 2005-04-17 5:11 ` Junio C Hamano @ 2005-04-17 5:31 ` Linus Torvalds 2005-04-17 6:01 ` Junio C Hamano 0 siblings, 1 reply; 130+ messages in thread From: Linus Torvalds @ 2005-04-17 5:31 UTC (permalink / raw) To: Junio C Hamano; +Cc: git On Sat, 16 Apr 2005, Junio C Hamano wrote: > > I am wondering if you have a particular reason not to do the > same for the removing half. No. Except for me being silly. Please just make it so. > Also do you have any comments on this one from the same message? > > * read-tree > > - When merging two trees, i.e. "read-tree -m A B", shouldn't > we collapse identical stage-1/2 into stage-0? How do you actually intend to merge two trees? That sounds like a total special case, and better done with "diff-tree". But regardless, since I assume the result is the later tree, why do a "read-tree -m A B", since what you really want is "read-tree B"? The real merge always needs the base tree, and I'd hate to complicate the real merge with some special-case that isn't relevant for that real case. Linus ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Issues with higher-order stages in dircache 2005-04-17 5:31 ` Linus Torvalds @ 2005-04-17 6:01 ` Junio C Hamano 0 siblings, 0 replies; 130+ messages in thread From: Junio C Hamano @ 2005-04-17 6:01 UTC (permalink / raw) To: Linus Torvalds; +Cc: git >>>>> "LT" == Linus Torvalds <torvalds@osdl.org> writes: >> - When merging two trees, i.e. "read-tree -m A B", shouldn't >> we collapse identical stage-1/2 into stage-0? LT> How do you actually intend to merge two trees? How silly of me. *BLUSH* ^ permalink raw reply [flat|nested] 130+ messages in thread
* Summary of "read-tree -m O A B" mechanism 2005-04-16 14:03 ` Issues with higher-order stages in dircache Junio C Hamano 2005-04-17 5:11 ` Junio C Hamano @ 2005-04-17 10:00 ` Junio C Hamano 1 sibling, 0 replies; 130+ messages in thread From: Junio C Hamano @ 2005-04-17 10:00 UTC (permalink / raw) To: Linus Torvalds; +Cc: git Earlier I wrote down a list of issues your recent "merge stage" changes have introduced to the rest of the plumbing, with a set of suggested adaptions. I think all of them are cleared now (you have a pile of patches from me in your mailbox). I do not know what percentage of people on this list are using git without the Cogito part, but I suspect that the number might be quite small. I also suspect, from the description Petr gave us on how the merging in Cogito works, Cogito does not currently use the "read-tree -m O A B" mechanism, and those majority who do not deal with the low level tools themselves would not have to know about the merge issues yet. But I think it is a good time, now things have started to settle down, to summarize how various commands work when they see those "funny" dircache entries created after "read-tree -m O A B" has run. Of course, people working on Cogito needs to know them, once they decide to use the "reed-tree -m O A B" mechanism. * read-tree -m O A B - For description on how this works, the definitive reading is [*R1*]. In short: - unlike ordinary read-tree, "-m" form reads up to three trees and creates paths that are "unmerged". - trivial merges are done by read-tree itself. only conflicting paths will be in unmerged state when read-tree returns. * write-tree - write-tree refuses to give you a tree until all the unmerged paths are resolved. * show-files - "show-files --unmerged" and "show-files --stage" can be used to examine detailed information on unmerged paths. For an unmerged path, instead of recording a single mode/SHA1 pair, the dircache records up to three such pairs; one from tree O in stage 1, A in stage 2, and B in stage 3. This information can be used by the user (or Cogito) to see what should eventually be recorded at the path. * update-cache - An explicit "update-cache [--add] path" or "update-cache [--add] --cacheinfo mode SHA1 path" tells the plumbing that the user (or Cogito) wants to resolve it by storing mode/SHA1 of the given working file or mode SHA1 specified on the command line. The path ceases to be in unmerged state after this happens. Similarly, "update-cache --remove path" resolves the unmerged state and the merge result is not having anything at that path. - "update-cache --refresh", in addition to the "needs update" message people are now familiar with, says "needs merge" for unmerged paths. * show-diff - show-diff on an unmerged path simply says "unmerged" (the plumbing would not know what to diff with what among three stages and the working file). * checkout-cache - "checkout-cache -a" warns about unmerged paths and checks out only the merged paths. - "checkout-cache [-f] path" on an unmerged path says "Unmerged", just like the same command on non-existent path says "not in the cache", and does not touch the working file. I hope the descriptions in this summary is correct enough to be useful to somebody. [Reference] *R1* http://marc.theaimsgroup.com/?l=git&m=111363270608902&w=2 ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: [PATCH 3/2] merge-trees script for Linus git 2005-04-16 8:12 ` Junio C Hamano ` (2 preceding siblings ...) 2005-04-16 14:03 ` Issues with higher-order stages in dircache Junio C Hamano @ 2005-04-16 15:28 ` Linus Torvalds 2005-04-16 16:36 ` Linus Torvalds 3 siblings, 1 reply; 130+ messages in thread From: Linus Torvalds @ 2005-04-16 15:28 UTC (permalink / raw) To: Junio C Hamano; +Cc: git On Sat, 16 Apr 2005, Junio C Hamano wrote: > > LT> NOTE NOTE NOTE! I could make "read-tree" do some of these nontrivial > LT> merges, but I ended up deciding that only the "matches in all three > LT> states" thing collapses by default. > > * Understood and agreed. Having slept on it, I think I'll merge all the trivial cases that don't involve a file going away or being added. Ie if the file is in all three trees, but it's the same in two of them, we know what to do. That way we'll leave thigns where the tree itself changed (files added or removed at any point) and/or cases where you actually need a 3-way merge. > The userland merge policies need ways to extract the stage > information and manipulate them. Am I correct to say that you > mean by "ls-files -l" the extracting part? No, I meant "show-files", since we need to show the index, not a tree (no valid tree can ever have the "modes" information, since (a) it doesn't have the space for it anyway and (b) we refuse to write out a dirty index file. > > LT> I should make "ls-files" have a "-l" format, which shows the > LT> index and the mode for each file too. > > You probably meant "ls-tree". You used the word "mode" but it > already shows the mode so I take it to mean "stage". Perhaps > something like this? > > $ ls-tree -l -r 49c200191ba2e3cd61978672a59c90e392f54b8b > 100644 blob fe2a4177a760fd110e78788734f167bd633be8de COPYING > 100644 blob b39b4ea37586693dd707d1d0750a9b580350ec50:1 man/frotz.6 > 100644 blob b39b4ea37586693dd707d1d0750a9b580350ec50:2 man/frotz.6 > 100664 blob eeed997e557fb079f38961354473113ca0d0b115:3 man/frotz.6 Apart from the fact that it would be show-files -l since there are no tree objects that can have anything but fully merged state, yes. > Assuming that you would be working on that, I'd like to take the > dircache manipulation part. Let's think about the minimally > necessary set of operations: > > * The merge policy decides to take one of the existing stage. > > In this case we need a way to register a known mode/sha1 at a > path. We already have this as "update-cache --cacheinfo". > We just need to make sure that when "update-cache" puts > things at stage 0 it clears other stages as well. > > * The merge policy comes up with a desired blob somewhere on > the filesystem (perhaps by running an external merge > program). It wants to register it as the result of the > merge. > > We could do this today by first storing the "desired blob" > in a temporary file somewhere in the path the dircache > controls, "update-cache --add" the temporary file, ls-tree to > find its mode/sha1, "update-cache --remove" the temporary > file and finally "update-cache --cacheinfo" the mode/sha1. > This is workable but clumsy. How about: > > $ update-cache --graft [--add] desired-blob path > > to say "I want to register mode/sha1 from desired-blob, which > may not be of verify_path() satisfying name, at path in the > dircache"? > > * The merge policy decides to delete the path. > > We could do this today by first stashing away the file at the > path if it exists, "update-cache --remove" it, and restore > if necessary. This is again workable but clumsy. How about: > > $ update-cache --force-remove path > > to mean "I want to remove the path from dircache even though > it may exist in my working tree"? Yes. > Am I on the right track? Exactly. > You might want to go even lower level by letting them say > something like: > > * update-cache --register-stage mode sha1 stage path > > Registers the mode/sha1 at stage for path. Does not look at > the working tree. stage is [0-3] I'd prefer not. I'd avoid playing games with the stages at any other level than the "full tree" level until we show a real need for it. Let's go with the known-needed minimal cases that are high-level enough to make the scripting simple, and see if there is any reason to ever touch the tree any other way. Linus ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: [PATCH 3/2] merge-trees script for Linus git 2005-04-16 15:28 ` [PATCH 3/2] merge-trees script for Linus git Linus Torvalds @ 2005-04-16 16:36 ` Linus Torvalds 2005-04-16 17:14 ` Junio C Hamano 0 siblings, 1 reply; 130+ messages in thread From: Linus Torvalds @ 2005-04-16 16:36 UTC (permalink / raw) To: Junio C Hamano; +Cc: git On Sat, 16 Apr 2005, Linus Torvalds wrote: > > Having slept on it, I think I'll merge all the trivial cases that don't > involve a file going away or being added. Ie if the file is in all three > trees, but it's the same in two of them, we know what to do. Junio, I pushed this out, along with the two patches from you. It's still more anal than my original "tree-diff" algorithm, in that it refuses to touch anything where the name isn't the same in all three versions (original, new1 and new2), but now it does the "if two of them match, just select the result directly" trivial merges. I really cannot see any sane case where user policy might dictate doing anything else, but if somebody can come up with an argument for a merge algorithm that wouldn't do what that trivial merge does, we can make a flag for "don't merge at all". The reason I do want to merge at all in "read-tree" is that I want to avoid having to write out a huge index-file (it's 1.6MB on the kernel, so if you don't do _any_ trivial merges, it would be 4.8MB after reading three trees) and then having people read it and parse it just to do stuff that is obvious. Touching 5MB of data isn't cheap, even if you don't do a whole lot to it. Anyway, with the modified read-tree, as far as I can tell it will now merge all the cases where one side has done something to a file, and the other side has left it alone (or where both sides have done the exact same modification). That should _really_ cut down the cases to just a few files for most of the kernel merges I can think of. Does it do the right thing for your tests? Linus ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: [PATCH 3/2] merge-trees script for Linus git 2005-04-16 16:36 ` Linus Torvalds @ 2005-04-16 17:14 ` Junio C Hamano 0 siblings, 0 replies; 130+ messages in thread From: Junio C Hamano @ 2005-04-16 17:14 UTC (permalink / raw) To: Linus Torvalds; +Cc: git >>>>> "LT" == Linus Torvalds <torvalds@osdl.org> writes: LT> Anyway, with the modified read-tree, as far as I can tell it will now LT> merge all the cases where one side has done something to a file, and the LT> other side has left it alone (or where both sides have done the exact same LT> modification). That should _really_ cut down the cases to just a few files LT> for most of the kernel merges I can think of. LT> Does it do the right thing for your tests? Yes. ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Re: Merge with git-pasky II. 2005-04-15 0:58 ` Junio C Hamano 2005-04-14 22:30 ` Christopher Li @ 2005-04-15 19:54 ` Petr Baudis 1 sibling, 0 replies; 130+ messages in thread From: Petr Baudis @ 2005-04-15 19:54 UTC (permalink / raw) To: Junio C Hamano; +Cc: Linus Torvalds, git Dear diary, on Fri, Apr 15, 2005 at 02:58:25AM CEST, I got a letter where Junio C Hamano <junkio@cox.net> told me that... > >>>>> "PB" == Petr Baudis <pasky@ucw.cz> writes: > >> I think the above would result in what SCM person would call > >> "merge upstream/sidestream changes into my working directory". > > PB> And that's exactly what I'm doing now with git merge. ;-) In fact, > PB> ideally the whole change in my scripts when your script is finished > PB> would be replacing > > PB> checkout-cache `diff-tree` # symbolic > PB> git diff $base $merged | git apply > > PB> with > > PB> merge-tree.pl -b $base $(tree-id) $merged | parse-your-output > > In the above I presume by $merged you mean the tree ID (or > commit ID) the user's working directory is based upon? Well, > merge-trees (Linus has a single directory merge-tree already) > looks at tree IDs (or commit IDs); it would never involve > working files in random state that is not recorded as part of a > tree (committed or not). Given that constraints I am not sure > how well that would pan out. I have to think about this a bit. No, $(tree-id) is the "destination branhc", what the user directory is based upon; $merged is the branch you are merging now, relative to $base. When I throw away the useless "-b" argument, in practice it would look like merge-trees abcd 1234 5678 for doing /------ 1234 -+- abcd < / \------ 5678 (not that the order of 1234 and 5678 would actually really matter) I fear I don't understand the rest of your paragraph. :-( > I do like, however, the idea of separating the step of doing any > checkout/merge etc. and actually doing them. So the command set > of parse-your-output needs to be defined. Based on what I have > done so far, it would consist of the following: > > - Result is this object $SHA1 with mode $mode at $path (takes > one of the trees); you can do update-cache --cacheinfo (if > you want to muck with dircache) or cat-file blob (if you want > to get the file) or both. > > - Result is to delete $path. > > - Result is a merge between object $SHA1-1 and $SHA1-2 with > mode $mode-1 or $mode-2 at $path. > > Would this be a good enough command set? What about the conflicts? Like one tree deleting, other tree modifying? -- Petr "Pasky" Baudis Stuff: http://pasky.or.cz/ C++: an octopus made by nailing extra legs onto a dog. -- Steve Taylor ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Merge with git-pasky II. 2005-04-14 23:31 ` Petr Baudis 2005-04-14 20:30 ` Christopher Li 2005-04-15 0:58 ` Junio C Hamano @ 2005-04-15 10:22 ` Junio C Hamano 2005-04-15 20:40 ` Petr Baudis 2 siblings, 1 reply; 130+ messages in thread From: Junio C Hamano @ 2005-04-15 10:22 UTC (permalink / raw) To: Petr Baudis; +Cc: Linus Torvalds, git After I re-read [*R1*], in which Linus talks about dircache, especially this section: - The "current directory cache" describes some baseline. In particular, note the "some" part. It's not tied to any special baseline, and you can change your baseline any way you please. So it does NOT have to track any particular state in either the object database _or_ in your actual current working tree. In fact, all real interactions with "git" are really about updating this staging area one way or the other: you might check out the state from it into your working area (partially or fully), you can push your working area into the staging area (again, partially or fully). And if you want to, you can write the thing that the staging area represents as a "tree" into the object database, or you can merge a tree from the object database into the staging area. In other words: the staging area aka "current directory cache" is really how all interaction takes place. The object database never interacts directly with your working directory contents. ALL interactions go through the current directory cache. I started to have more doubts on the approach of *not* performing the merge in the dircache I set up specifically for merging, which is the direction in which you are pushing if I understand you correctly. Maybe I completely misunderstand what you want. This message is long but I need a clear understanding of what is expected to be useful to you, so please bear with me. PB> merge-tree.pl -b $base $(tree-id) $merged | parse-your-output Please help me understand this example you have given earlier. Here is my understanding of your assumption when the above pipeline takes place. Correct me if I am mistaken. * The user is in a working directory $W. It is controlled by git-tools and there are $W/.git/. directory and $W/.git/index dircache. * The dircache $W/.git/index started its life as a read-tree from some commit. The git-tools is keeping track of which commit it is somewhere, presumably in $W/.git/ directory. Let's call it $C (commit). ? Question. Is the $(tree-id) in your example the same as $C above? * The user have run [*1*] (see Footnote below) checkout-cache on $W/.git/index some time in the past and $W is full of working files. Some of them may or may not have modified. There may be some additions or deletions. So the contents of the working directory may not match the tree associated with $C. * The user may or may not have run [*1*] update-cache in $W. The contents of the dircache $W/.git/index may not match the tree associated with $C. ? Question. Are you forbidding the user to run update-cache by hand, and keeping track of the changes yourself, to be applied all at once at "git commit" time, thereby guaranteeing the $W/.git/index to match the tree associated with $C all times? From the description of The "GIT toolkit" section in README, it is not clear to me which part of his repository an end user is not supposed to muck with himself. * Now the user has some changes in his working directory and notices upstream or a side branch has notable changes desireble to be picked up. So he runs some git-tools command to cause the above quoted pipeline to run. ? Question. Does $merged in your example mean such an upstream or side branch? Is $base in your example the common ancestor between $C and $merged? Assuming that my above understanding of your model is correct, here are my "thinking aloud". - "merge-trees $base $C $merged" looks only at the git object database for those three trees named. The data structure of git object database is optimized to distinguish differences in those recorded trees (and hence recorded blobs they point at) without unpacking most of the files if the changes are small, because all the blobs involved are already hashed. It is not very good at comparing things in git object store and working files in random states, which would involve unpacking blobs and comparing, so "merge-trees" does not bother. - What can come out from merge-trees is therefore one of the following for each path from the union of paths contained in $base, $C, and $merged: (a) Neither $C nor $merged changed it --- merge result is what is in $C. (b) $C changed it but $merged did not --- merge result is what is in $C. (c) Both $C and $merged changed it in the same way --- merge result is what is in $C. (d) $C did not change it but $merged did --- merge result is what is in $merged. (e) Both $C and $merged changed it differently --- merge is needed and automatically succeeds between $C and $merge. (f) Both $C and $merged changed it differently --- merge is needed but have conflicts. - Assuming we are dealing with the case where working files are dirty and do not match what is in $C, among the above, (a)-(c) can be ignored by SCM. What the user has in his working files is exactly what he would have got if he started working from the merge result, although in reality the work was started from $C. Handling (d), (e) and (f) from SCM's point of view would be the same. They all involve 3-way merges between the file in the working directory, and the file from $merged, pivoting on the file from $base. In order to help SCM, merge-trees therefore should output SHA1 of blobs for such a file from $base and $merged and expect SCM to run "cat-file blob" on them and then merge or diff3. Up to the point of giving those two SHA1 out is the business of merge-trees and after that it is up to SCM. That would work. So I should base the design of output from merge-trees on the above analysis, which probably needs to be extended to cover differences between creation, modification, and deletion. - However, the above is quite different from the way Linus envisioned initially, on which my current implementation is based [*3*]. My current implementation is to record the merge outcome in the temporary dircache $W/,,merge/.git/index for cases (a)-(e). The last case (f) is problematic and needs human validation [*2*], so it is not recorded in that temporary dircache, but the files to be merged are left in that temporary directory and merge-trees stops there. It is expected that the end-user or SCM would merge the resulting file and run update-cache to update $W/,,merge/.git/index. After that happens, $W/,,merge/.git/index has the tree representing the desired result of the merge. It is expected that the end-user or SCM would write-tree, commit-tree there in the temporary directory, creating a new commit $C1. Then, it is expected that the SCM would make a patch file between $C and the user working directory, checks out $C1 (either in the user's working directory or another temporary directory; at this point merge-trees does not care because it has already done its job and exited), applies that patch to bring the user edits over to $C1. Then that directory would contain the desired merge of user edits. That is my understanding of how Linus originally wanted the tool to do his kernel work with to work. My hesitation to suggestions from you to change it not to keep its own merge dircache is coming from here. Not doing what I am currently doing to $W/,,merge/.git/index dircache would mean that SCM would have to do more, not less, to arrive at $C1 (the result of the clean $merge and $C merge pivoted at $base), where the real SCM merge begins. Although I suspect I am misunderstanding what you want, your messages so far suggest that what you want might be quite different from what Linus wants. Please do not misunderstand what I mean by saying this. I am not saying that Linus is always right [*4*] and therefore you are wrong for wanting something else. It is just that, if what I started writing needs to support both of those quite different needs, I need to know what they are. I think I understand what Linus wants well enough [*5*], but I am not certain about yours. [Footnotes] *1* By "The user have run" I mean either the user directly used the low-level plumbing command himself, or used git-tools to cause such command to run. *2* Strictly speaking, case (e) needs human validation as well, because successful textual merge does not guarantee sensible semantic merge. *3* See [*R2*] for descriptions on the way Linus wanted merge in git to happen. Especially around "5) At this point you need to MERGE" onwards. The current implementation handles (or attempts to handle) the `your working directory was fully committed' case described there. *4* According to Linus himself, he is always right ;-). [*R3*] *5* I consider [*R1*] and [*R2*] essential read for anybody wanting to understand merging operation in git object model (I am saying this for others; not for Pasky --- it would be like preaching to the choir ;-)). [References] *R1* <Pine.LNX.4.58.0504110928360.1267@ppc970.osdl.org> http://marc.theaimsgroup.com/?i=%3CPine.LNX.4.58.0504110928360.1267%20()%20ppc970%20!%20osdl%20!%20org%3E *R2* <Pine.LNX.4.58.0504121606580.4501@ppc970.osdl.org> http://marc.theaimsgroup.com/?i=%3CPine.LNX.4.58.0504121606580.4501%20()%20ppc970%20!%20osdl%20!%20org%3E *R3* http://www.uwsg.indiana.edu/hypermail/linux/kernel/0008.3/0555.html ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Re: Merge with git-pasky II. 2005-04-15 10:22 ` Junio C Hamano @ 2005-04-15 20:40 ` Petr Baudis 2005-04-15 22:41 ` Junio C Hamano 0 siblings, 1 reply; 130+ messages in thread From: Petr Baudis @ 2005-04-15 20:40 UTC (permalink / raw) To: Junio C Hamano; +Cc: Linus Torvalds, git Dear diary, on Fri, Apr 15, 2005 at 12:22:26PM CEST, I got a letter where Junio C Hamano <junkio@cox.net> told me that... > After I re-read [*R1*], in which Linus talks about dircache, > especially this section: > > - The "current directory cache" describes some baseline. In particular, > note the "some" part. It's not tied to any special baseline, and you > can change your baseline any way you please. > > So it does NOT have to track any particular state in either the object > database _or_ in your actual current working tree. In fact, all real > interactions with "git" are really about updating this staging area one > way or the other: you might check out the state from it into your > working area (partially or fully), you can push your working area into > the staging area (again, partially or fully). > > And if you want to, you can write the thing that the staging area > represents as a "tree" into the object database, or you can merge a > tree from the object database into the staging area. > > In other words: the staging area aka "current directory cache" is > really how all interaction takes place. The object database never > interacts directly with your working directory contents. ALL > interactions go through the current directory cache. > > I started to have more doubts on the approach of *not* > performing the merge in the dircache I set up specifically for > merging, which is the direction in which you are pushing if I > understand you correctly. Maybe I completely misunderstand what > you want. This message is long but I need a clear understanding > of what is expected to be useful to you, so please bear with me. > PB> merge-tree.pl -b $base $(tree-id) $merged | parse-your-output > > Please help me understand this example you have given earlier. > Here is my understanding of your assumption when the above > pipeline takes place. Correct me if I am mistaken. > > * The user is in a working directory $W. It is controlled by > git-tools and there are $W/.git/. directory and $W/.git/index > dircache. > > * The dircache $W/.git/index started its life as a read-tree > from some commit. The git-tools is keeping track of which > commit it is somewhere, presumably in $W/.git/ directory. > Let's call it $C (commit). > > ? Question. Is the $(tree-id) in your example the same as $C > above? Yes. Actually $(tree-id) returns ID of the tree object, not the commit object; but that doesn't matter here, probably - let's ignore that distinction for simplicity. > * The user have run [*1*] (see Footnote below) checkout-cache > on $W/.git/index some time in the past and $W is full of > working files. Some of them may or may not have modified. > There may be some additions or deletions. So the contents of > the working directory may not match the tree associated with > $C. > > * The user may or may not have run [*1*] update-cache in $W. > The contents of the dircache $W/.git/index may not match the > tree associated with $C. > > ? Question. Are you forbidding the user to run update-cache by > hand, and keeping track of the changes yourself, to be > applied all at once at "git commit" time, thereby > guaranteeing the $W/.git/index to match the tree associated > with $C all times? From the description of The "GIT toolkit" > section in README, it is not clear to me which part of his > repository an end user is not supposed to muck with himself. Ideally, he shouldn't be using *any* of the low-level plumbing by now. The only exception is update-cache --refresh, which he can do at will (I'm yet thinking what to do with it :-). The git-tools always assume that index basically contains the state as of the last commit. (Actually the only time when this matters *now* might be git diff - the user would get confused from the results.) > * Now the user has some changes in his working directory and > notices upstream or a side branch has notable changes > desireble to be picked up. So he runs some git-tools command > to cause the above quoted pipeline to run. > > ? Question. Does $merged in your example mean such an upstream > or side branch? Is $base in your example the common ancestor > between $C and $merged? Correct. *HOWEVER* what is not correct is that git-tools would let you merge in your working directory while you have local changes there. In the past, the merge would happen in your working tree, but git-tools wouldn't let you go for it unless your working tree has no local changes. It would complain loudly and refuse to, since it's *not* what you want to do and it was most likely a mistake. Currently, git merge just creates a ,,merge/ subdirectory sharing the object database with your working tree, but with an independent checkout of it; it will do the merge there, and when you commit it there, it will update your working tree with the merged changes. I'm describing both behaviors since I might revert back to the first one, based on what (if anything) will Linus reply to my mail about out-of-tree merges. But either way, when a merge is about to happen upon us, the working tree is "clean". > Assuming that my above understanding of your model is correct, > here are my "thinking aloud". > > - "merge-trees $base $C $merged" looks only at the git object > database for those three trees named. The data structure of > git object database is optimized to distinguish differences > in those recorded trees (and hence recorded blobs they point > at) without unpacking most of the files if the changes are > small, because all the blobs involved are already hashed. It > is not very good at comparing things in git object store and > working files in random states, which would involve unpacking > blobs and comparing, so "merge-trees" does not bother. > > - What can come out from merge-trees is therefore one of the > following for each path from the union of paths contained in > $base, $C, and $merged: > > (a) Neither $C nor $merged changed it --- merge result is what > is in $C. (Or in $base, if you don't want to give $C "unfair advantage", since it does not matter. ;-) > (b) $C changed it but $merged did not --- merge result is what > is in $C. > > (c) Both $C and $merged changed it in the same way --- merge > result is what is in $C. > > (d) $C did not change it but $merged did --- merge result is > what is in $merged. > > (e) Both $C and $merged changed it differently --- merge is > needed and automatically succeeds between $C and $merge. > > (f) Both $C and $merged changed it differently --- merge is > needed but have conflicts. > > - Assuming we are dealing with the case where working files are > dirty and do not match what is in $C, among the above, > (a)-(c) can be ignored by SCM. What the user has in his > working files is exactly what he would have got if he started > working from the merge result, although in reality the work > was started from $C. Yes. Actually they can be ignored by git-tools in any case since what is in the directory cache is $C. So it never needs to do any special action. > Handling (d), (e) and (f) from SCM's point of view would be > the same. They all involve 3-way merges between the file in > the working directory, and the file from $merged, pivoting on > the file from $base. In order to help SCM, merge-trees > therefore should output SHA1 of blobs for such a file from > $base and $merged and expect SCM to run "cat-file blob" on > them and then merge or diff3. Up to the point of giving > those two SHA1 out is the business of merge-trees and after > that it is up to SCM. > > That would work. So I should base the design of output from > merge-trees on the above analysis, which probably needs to be > extended to cover differences between creation, modification, > and deletion. Yes, it sounds sensible. Actually, you don't even need to make $C more special than $merged; I can filter out only the $merged changes on the SCM level. I guess that would add no complexity to your tool and make it usable even for more exotic kinds of merges (like the floating-in-the-void merge of two "equally important" trees). > - However, the above is quite different from the way Linus > envisioned initially, on which my current implementation is > based [*3*]. > > My current implementation is to record the merge outcome in > the temporary dircache $W/,,merge/.git/index for cases > (a)-(e). The last case (f) is problematic and needs human > validation [*2*], so it is not recorded in that temporary > dircache, but the files to be merged are left in that > temporary directory and merge-trees stops there. It is > expected that the end-user or SCM would merge the resulting > file and run update-cache to update $W/,,merge/.git/index. > After that happens, $W/,,merge/.git/index has the tree > representing the desired result of the merge. It is expected > that the end-user or SCM would write-tree, commit-tree there > in the temporary directory, creating a new commit $C1. > > Then, it is expected that the SCM would make a patch file > between $C and the user working directory, checks out $C1 > (either in the user's working directory or another temporary > directory; at this point merge-trees does not care because it > has already done its job and exited), applies that patch to > bring the user edits over to $C1. Then that directory would > contain the desired merge of user edits. > > That is my understanding of how Linus originally wanted the > tool to do his kernel work with to work. My hesitation to > suggestions from you to change it not to keep its own merge > dircache is coming from here. Not doing what I am currently > doing to $W/,,merge/.git/index dircache would mean that SCM > would have to do more, not less, to arrive at $C1 (the result > of the clean $merge and $C merge pivoted at $base), where the > real SCM merge begins. Well. Currently, apart from the directory cache part, I do it like you describe. I create a new directory, after commit I apply the diff back to the original tree etc. The only problem is really the dircache, and that's because it would be done totally differently than in the original tree, and I would unnecessarily have to introduce crowds of special cases to my tools in order for them to be usable in the merge tree (I call the ",,merge" temporary directory a "merge tree"). And the user would still lose the capability of easily seeing the changes being committed. I admit that I'm using this largely as an excuse and there *could* be a tool made which would compare the given tree with the cache, but it would be clumsy to use, violate Linus' "ALL interactions go through the current directory cache" paradigm (whew, the first time in my life I used this word), and we could do just fine with our current tools. > Although I suspect I am misunderstanding what you want, your > messages so far suggest that what you want might be quite > different from what Linus wants. Please do not misunderstand > what I mean by saying this. I am not saying that Linus is > always right [*4*] and therefore you are wrong for wanting > something else. It is just that, if what I started writing > needs to support both of those quite different needs, I need to > know what they are. I think I understand what Linus wants well > enough [*5*], but I am not certain about yours. I can't see the conflicts between what I want and what Linus wants. After all, Linus says that I can use the directory cache in any way I please (well, the user can, but I'm speaking for him ;-). So I'm doing so, and with your tool I would get into problems, since it is suddenly imposing a policy on what should be in the index. ..snip.. > *2* Strictly speaking, case (e) needs human validation as > well, because successful textual merge does not guarantee > sensible semantic merge. ..snip.. Actually, I think _all_ the caches should make the human validation _possible_ (by showing the diff of what would be merged), and it is trivial to do so by having pristine index. -- Petr "Pasky" Baudis Stuff: http://pasky.or.cz/ C++: an octopus made by nailing extra legs onto a dog. -- Steve Taylor ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Merge with git-pasky II. 2005-04-15 20:40 ` Petr Baudis @ 2005-04-15 22:41 ` Junio C Hamano 0 siblings, 0 replies; 130+ messages in thread From: Junio C Hamano @ 2005-04-15 22:41 UTC (permalink / raw) To: Petr Baudis; +Cc: Linus Torvalds, git >>>>> "PB" == Petr Baudis <pasky@ucw.cz> writes: PB> I can't see the conflicts between what I want and what Linus wants. PB> After all, Linus says that I can use the directory cache in any way I PB> please (well, the user can, but I'm speaking for him ;-). So I'm doing PB> so, and with your tool I would get into problems, since it is suddenly PB> imposing a policy on what should be in the index. I think our misunderstanding is coming from the use of the word "merge tree". I think you have been assuming that I wanted you to run "merge-trees -o ,,merge" --- which would certainly cause me to muck with your dircache there. I totally agree with you that that is a *BAD* *THING*. No question there. However, my assumption has been different. I was assuming that you would run "merge-trees -o merge~tree" (i.e. different from your "merge tree"), so that you can get the merge results in a form parsable by you. And then, using that information, you can make your changes in ,,merge. After you are done with that information, you can remove "merge~trees", of course. The format I chose for the "merge result in a form parsable by you" happens to be a dircache in "merge~tree", with minimum number of files checked out when merge cannot be automatically done safely. In the simplest case of not having any conflicting merge between $C and $merged, Cogito can immediately run write-tree in "merge~tree" (not ,,merge) to obtain its tree-ID $T, so that it can feed it to diff-tree to compare it with whatever tree state Cogito wants to apply the merges between $C and $merged to. I still do not understand what you do in ,,merge directory, but here is one way you can update the user working directory in-place without having a ,,merge directory [*2*]. You can run your "git diff" between $C and $T [*1*]. The result is the diff you need to apply on top of your user's working files. If the user does not like the result of running that diff, it can easily be reversed. If a manual merge were needed between $C and $merged, Cogito could guide the user through that manual edit in "merge~tree", and run update-cache on those hand merged files in "merge~tree", before running write-tree in "merge~tree" to obtain $T; after that, everything else is the same. You make interesting points in other parts of your message I need to regurgitate for a while, so I would not comment on them in this message. [Footnote] *1* I really like the convenience of being able to use tree-ID and commit-ID interchangeably there. Thanks. *2* I understand that this would change the user's "git-tools" experience a bit. The user will not be told to "go to ,,merge and commit there which will reflected back to your working tree" anymore. Instead the merge happens in-place. Committing, not committing, or further hand-fixing the merge is up to the user. I suspect this change might even be for the better. ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Merge with git-pasky II. 2005-04-14 8:06 ` Linus Torvalds 2005-04-14 8:39 ` Junio C Hamano @ 2005-04-15 19:57 ` Junio C Hamano 2005-04-15 20:45 ` Linus Torvalds 1 sibling, 1 reply; 130+ messages in thread From: Junio C Hamano @ 2005-04-15 19:57 UTC (permalink / raw) To: Linus Torvalds; +Cc: Petr Baudis, Christopher Li, git >>>>> "LT" == Linus Torvalds <torvalds@osdl.org> writes: LT> In the meantime I wrote a very stupid "merge-tree" which LT> does things slightly differently, but I really think your LT> approach (aka my original approach) is actually a lot LT> faster. I was just starting to worry that the ball didn't LT> start, so I wrote an even hackier one. LT> ... This "one directory at a time with very explicit output" LT> thing is much more down-to-earth, but it's also likely LT> slower because it will need script help more often. I was looking at merge-tree.c last night to add recursive behaviour (my favorite these days ;-) to it [*1*]. But then I started thinking. LT> ... For each entry in the directory it says either LT> select <mode> <sha1> path LT> or LT> merge <mode>-><mode>,<mode> <sha1>-><sha1>,<sha1> path LT> depending on whether it could directly select the right object or not. Given that the case you are primarily interested in is the one that affects only small parts of a huge tree (i.e. common kernel merge pattern I understand from your previous messages), your "hacky version" [*2*], extended for recursive operation, would spit out 98% select and 2% merge, and probably the origin of these selects are distributed across ancestor=90%, his=4%, my=4%, or something similar. Am I misestimating grossly? Assuming I am correct in the above, this would not scale for a huge project. We need to cut down the number of "90% select" part of the output to make it manageable. I am thinking about: - adding recursive behaviour (I am almost done with this); - adding another command line argument to merge-tree.c, to tell "do not output anything for the path if the resulting merge is the same as what is in this tree"; - adding another output type, "delete" to make the output type repertoire these three: delete path select <mode> <sha1> path merge <mode>-><mode>,<mode> <sha1>-><sha1>,<sha1> path When the user of the output of $ merge-tree <ancestor-sha1> <my-sha1> <his-sha1> <result-base-sha1> want to get a dircache populated with the merged result, he can: 1. read-tree <result-base-sha1> 2. for each output: a) "delete" -- delete path from dircache b) "select" -- register mode-sha1 at path c) "merge" -- do the 3-way merge and register result at path Do you think this is sensible? The reason I have the separate <result-base-sha1> instead of always using <ancestor-sha1> is because the user may be thinking of patching an existing base which is different from "my" or "his" or "ancestor" and doing it in place. That way, probably Pasky's SCM can use it to patch the dircache it creates in its own ,,merge/ directory, which would most likely be initially populated from the dircache in the user's working directory--- which may or may not match "my-sha1" if the user has uncommitted update-cache there. Pasky, do you think this is workable? If so do you think this would make your life easier? [Footnotes] *1* That's how I found the S_IFDIR problem (not in your tree but in the copy I had). *2* I did not find it quite "hacky". It was a pleasant read. Especially I liked "smaller()" part. ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Merge with git-pasky II. 2005-04-15 19:57 ` Junio C Hamano @ 2005-04-15 20:45 ` Linus Torvalds 0 siblings, 0 replies; 130+ messages in thread From: Linus Torvalds @ 2005-04-15 20:45 UTC (permalink / raw) To: Junio C Hamano; +Cc: Petr Baudis, Christopher Li, git On Fri, 15 Apr 2005, Junio C Hamano wrote: > > I was looking at merge-tree.c last night to add recursive > behaviour (my favorite these days ;-) to it [*1*]. > > But then I started thinking. Always good. > LT> ... For each entry in the directory it says either > LT> select <mode> <sha1> path > LT> or > LT> merge <mode>-><mode>,<mode> <sha1>-><sha1>,<sha1> path > LT> depending on whether it could directly select the right object or not. > > Given that the case you are primarily interested in is the one > that affects only small parts of a huge tree (i.e. common kernel > merge pattern I understand from your previous messages), your > "hacky version" [*2*], extended for recursive operation, would > spit out 98% select and 2% merge, and probably the origin of > these selects are distributed across ancestor=90%, his=4%, > my=4%, or something similar. Am I misestimating grossly? No. That's _exactly_ right. You do not want a recursive merge-tree. The "diff-tree" thing is different, exactly because it prunes out all the differences early on. > I am thinking about: > > - adding recursive behaviour (I am almost done with this); I think your suggestion sounds perfectly reasonable. Linus ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Merge with git-pasky II. 2005-04-14 0:29 Merge with git-pasky II Petr Baudis 2005-04-13 21:25 ` Christopher Li @ 2005-04-14 0:30 ` Petr Baudis 2005-04-14 22:11 ` git merge Petr Baudis 2 siblings, 0 replies; 130+ messages in thread From: Petr Baudis @ 2005-04-14 0:30 UTC (permalink / raw) To: torvalds; +Cc: git Dear diary, on Thu, Apr 14, 2005 at 02:29:02AM CEST, I got a letter where Petr Baudis <pasky@ucw.cz> told me that... > Its main contents are all of my shell scripts. Apart of that, some > tiny fixes scattered all around can be found there, as well as some > patches which went through the mailing list. My last merge with you > concerned your commit 39021759c903a943a33a28cfbd5070d36d851581. > > It's again > > rsync://pasky.or.cz/git/ > > this time my HEAD is fba83970090ef54c6eb86dcc2c2d5087af5ac637. I forgot to add that after merging, you will probably want to change the VERSION file (to contain whatever you want). -- Petr "Pasky" Baudis Stuff: http://pasky.or.cz/ C++: an octopus made by nailing extra legs onto a dog. -- Steve Taylor ^ permalink raw reply [flat|nested] 130+ messages in thread
* git merge 2005-04-14 0:29 Merge with git-pasky II Petr Baudis 2005-04-13 21:25 ` Christopher Li 2005-04-14 0:30 ` Petr Baudis @ 2005-04-14 22:11 ` Petr Baudis 2 siblings, 0 replies; 130+ messages in thread From: Petr Baudis @ 2005-04-14 22:11 UTC (permalink / raw) To: git; +Cc: torvalds Hi, note that in my git tree there is a git merge implementation which does out-of-tree merges now. It is still very trivial, and basically just does something along the lines of (symbolically written) checkout-cache $(diff-tree) git diff $base $mergedbranch | git apply .. fix rejects etc .. git commit It seems to work, but it is only very lightly tested - it is likely there are various tiny mistakes and typos in various unusual code paths and other weird corners of the scripts. Testing is encouraged, and especially patches fixing bugs you come over. It is designed in a way to make it possible to just replace the checkout-cache and git diff | git apply steps with the merge-tree.pl tool when it is finished. Thanks, -- Petr "Pasky" Baudis Stuff: http://pasky.or.cz/ C++: an octopus made by nailing extra legs onto a dog. -- Steve Taylor ^ permalink raw reply [flat|nested] 130+ messages in thread
end of thread, other threads:[~2005-04-18  7:38 UTC | newest]
Thread overview: 130+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-04-14  0:29 Merge with git-pasky II Petr Baudis
2005-04-13 21:25 ` Christopher Li
2005-04-14  0:45   ` Petr Baudis
2005-04-13 22:00     ` Christopher Li
2005-04-14  3:51     ` Linus Torvalds
2005-04-14  1:23       ` Christopher Li
2005-04-14  5:03         ` Paul Jackson
2005-04-14  2:16           ` Christopher Li
2005-04-14  6:16             ` Paul Jackson
2005-04-14  7:05       ` Junio C Hamano
2005-04-14  8:06         ` Linus Torvalds
2005-04-14  8:39           ` Junio C Hamano
2005-04-14  9:10             ` Linus Torvalds
2005-04-14 11:14               ` Junio C Hamano
2005-04-14 12:16                 ` Petr Baudis
2005-04-14 18:12                   ` Junio C Hamano
2005-04-14 18:36                     ` Linus Torvalds
2005-04-14 19:59                       ` Junio C Hamano
2005-04-14 20:20                         ` Petr Baudis
2005-04-15  0:42                         ` Linus Torvalds
2005-04-15  2:33                           ` Barry Silverman
2005-04-15 10:02                           ` David Woodhouse
2005-04-15 15:32                             ` Linus Torvalds
2005-04-15 16:01                               ` David Woodhouse
2005-04-15 16:31                                 ` C. Scott Ananian
2005-04-15 17:11                                   ` Linus Torvalds
2005-04-16 15:33                                 ` Johannes Schindelin
2005-04-17 13:14                                   ` David Woodhouse
2005-04-15 19:20                               ` Paul Jackson
2005-04-16  1:44                               ` Simon Fowler
2005-04-16 12:19                                 ` David Lang
2005-04-16 15:55                                   ` Simon Fowler
2005-04-16 16:03                                     ` Petr Baudis
2005-04-16 16:26                                       ` Simon Fowler
2005-04-16 16:26                                       ` Linus Torvalds
2005-04-16 23:02                                         ` David Lang
2005-04-17 14:52                                         ` Ingo Molnar
2005-04-17 15:08                                           ` Brad Roberts
2005-04-17 15:18                                             ` Ingo Molnar
2005-04-17 15:28                                           ` Ingo Molnar
2005-04-17 17:34                                             ` Linus Torvalds
2005-04-17 22:12                                               ` Herbert Xu
2005-04-17 22:35                                                 ` Linus Torvalds
2005-04-17 23:29                                                   ` Herbert Xu
2005-04-17 23:34                                                     ` Petr Baudis
2005-04-17 23:53                                                       ` Kenneth Johansson
2005-04-18  0:49                                                       ` Herbert Xu
2005-04-18  0:55                                                         ` Petr Baudis
2005-04-17 23:50                                                     ` Linus Torvalds
2005-04-18  4:16                                               ` Sanjoy Mahajan
2005-04-18  7:42                                               ` Ingo Molnar
2005-04-16 20:29                               ` Sanjoy Mahajan
2005-04-16 20:41                                 ` Linus Torvalds
2005-04-15  2:21                       ` [Patch] ls-tree enhancements Junio C Hamano
2005-04-15 16:13                         ` Petr Baudis
2005-04-15 18:25                           ` Junio C Hamano
2005-04-15  9:14                       ` Merge with git-pasky II David Woodhouse
2005-04-15  9:36                         ` Ingo Molnar
2005-04-15 10:05                           ` David Woodhouse
2005-04-15 14:53                             ` Ingo Molnar
2005-04-15 15:09                               ` David Woodhouse
2005-04-15 12:03                         ` Johannes Schindelin
2005-04-15 10:22                           ` Theodore Ts'o
2005-04-15 14:53                         ` Linus Torvalds
2005-04-15 15:29                           ` David Woodhouse
2005-04-15 15:51                             ` Linus Torvalds
2005-04-15 15:54                           ` Paul Jackson
2005-04-15 16:30                             ` C. Scott Ananian
2005-04-15 18:29                               ` Paul Jackson
2005-04-14 18:51                     ` Christopher Li
2005-04-14 19:35                     ` Petr Baudis
2005-04-14 20:01                       ` Live Merging from remote repositories Barry Silverman
2005-04-14 23:22                         ` Junio C Hamano
2005-04-15  1:07                           ` Question about git process model Barry Silverman
2005-04-14 20:23                       ` Re: Merge with git-pasky II Erik van Konijnenburg
2005-04-14 20:24                         ` Petr Baudis
2005-04-14 23:12                       ` Junio C Hamano
2005-04-14 20:24                         ` Christopher Li
2005-04-14 23:31                         ` Petr Baudis
2005-04-14 20:30                           ` Christopher Li
2005-04-14 20:37                             ` Christopher Li
2005-04-14 20:50                               ` Christopher Li
2005-04-15  0:58                           ` Junio C Hamano
2005-04-14 22:30                             ` Christopher Li
2005-04-15  7:43                               ` Junio C Hamano
2005-04-15  6:28                                 ` Christopher Li
2005-04-15 11:11                                   ` Junio C Hamano
     [not found]                                     ` <7vaco0i3t9.fsf_-_@assigned-by-dhcp.cox.net>
2005-04-15 18:44                                       ` write-tree is pasky-0.4 Linus Torvalds
2005-04-15 18:56                                         ` Petr Baudis
2005-04-15 20:13                                           ` Linus Torvalds
2005-04-15 22:36                                             ` Petr Baudis
2005-04-16  0:22                                               ` Linus Torvalds
2005-04-16  1:13                                                 ` Daniel Barkalow
2005-04-16  2:18                                                   ` Linus Torvalds
2005-04-16  2:49                                                     ` Daniel Barkalow
2005-04-16  3:13                                                       ` Linus Torvalds
2005-04-16  3:56                                                         ` Daniel Barkalow
2005-04-16  6:59                                                         ` Paul Jackson
2005-04-16 15:34                                                 ` Re: Re: " Petr Baudis
2005-04-15 20:10                                         ` Junio C Hamano
2005-04-15 20:58                                           ` C. Scott Ananian
2005-04-15 21:22                                             ` Petr Baudis
2005-04-15 23:16                                             ` Junio C Hamano
2005-04-15 21:48                                           ` [PATCH 1/2] merge-trees script for Linus git Junio C Hamano
2005-04-15 21:54                                             ` [PATCH 2/2] " Junio C Hamano
2005-04-15 23:33                                             ` [PATCH 3/2] " Junio C Hamano
2005-04-16  1:02                                               ` Linus Torvalds
2005-04-16  4:10                                                 ` Junio C Hamano
2005-04-16  5:02                                                   ` Linus Torvalds
2005-04-16  6:26                                                     ` Linus Torvalds
2005-04-16  8:12                                                     ` Junio C Hamano
2005-04-16  9:27                                                       ` [PATCH] Byteorder fix for read-tree, new -m semantics version Junio C Hamano
2005-04-16 10:35                                                       ` [PATCH 1/2] Add --stage to show-files for new stage dircache Junio C Hamano
2005-04-16 10:42                                                         ` [PATCH 2/2] " Junio C Hamano
2005-04-16 14:03                                                       ` Issues with higher-order stages in dircache Junio C Hamano
2005-04-17  5:11                                                         ` Junio C Hamano
2005-04-17  5:31                                                           ` Linus Torvalds
2005-04-17  6:01                                                             ` Junio C Hamano
2005-04-17 10:00                                                         ` Summary of "read-tree -m O A B" mechanism Junio C Hamano
2005-04-16 15:28                                                       ` [PATCH 3/2] merge-trees script for Linus git Linus Torvalds
2005-04-16 16:36                                                         ` Linus Torvalds
2005-04-16 17:14                                                           ` Junio C Hamano
2005-04-15 19:54                             ` Re: Merge with git-pasky II Petr Baudis
2005-04-15 10:22                           ` Junio C Hamano
2005-04-15 20:40                             ` Petr Baudis
2005-04-15 22:41                               ` Junio C Hamano
2005-04-15 19:57           ` Junio C Hamano
2005-04-15 20:45             ` Linus Torvalds
2005-04-14  0:30 ` Petr Baudis
2005-04-14 22:11 ` git merge Petr Baudis
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).