Git development

Git development
 help / color / mirror / Atom feed

* Re: [ANNOUNCE] Example Cogito Addon - cogito-bundle
From: David Lang @ 2006-10-20 20:49 UTC (permalink / raw)
  To: Petr Baudis
  Cc: Linus Torvalds, Shawn Pearce, Aaron Bentley, Jakub Narebski,
	bazaar-ng, git
In-Reply-To: <20061020202318.GJ20017@pasky.or.cz>

On Fri, 20 Oct 2006, Petr Baudis wrote:

> 
> Dear diary, on Fri, Oct 20, 2006 at 07:48:58PM CEST, I got a letter
> where Linus Torvalds <torvalds@osdl.org> said that...
>> So yeah, I've seen a few strange cases myself, but they've actually been
>> interesting. Like seeing how much of a file was just a copyright license,
>> and then a file being considered a "copy" just because it didn't actually
>> introduce any real new code.
>
> Well it's certainly "interesting" and fun to see, but is it equally fun
> to handle mismerges caused by a broken detection?
>
> I've talked to some people who really didn't mind (or even liked) Git's
> heuristics when it came to _inspecting_ movement of content, but were
> really nervous about merge following such heuristics.

remember, git only stores the results. so when you are merging it doesn't even 
look for renames.

the only time you get renames is after-the-fact when you ask git for a report 
about what changed. then (if you enable rename detection) it will tell you what 
files have changed, and what files look like they may have been renames 
(possibly with changes). but if you don't ask git to look for renames it won't 
bother and you can just ignore the concept entirely.

or if you only want complete renames (as opposed to rename + change) then use 
the option to tell it that you don't want to consider it a rename unless it's 
100% the same (or 99%, or whatever satisfies you)

David Lang

^ permalink raw reply

* Re: [ANNOUNCE] Example Cogito Addon - cogito-bundle
From: Jakub Narebski @ 2006-10-20 20:40 UTC (permalink / raw)
  To: git
In-Reply-To: <7v1wp2oi6s.fsf@assigned-by-dhcp.cox.net>

Junio C Hamano wrote:

>   2. git-pickaxe -M: blame line movements within a file.
> 
>      This adds logic to find swapped groups of lines in the same
>      file.  When the file in the parent had A and B and the child
>      has B and A, "single diff with parent" would find only one
>      of A or B is inherited from the parent, not both.  This
>      re-diffs the remainder with the parent's file to find both.
> 
>      I used to have heuristics to avoid trivial groups of lines
>      from being subject to this step, but in this version they
>      have been removed, so that we can see the core logic and
>      need for heuristics more clearly.
> 
>      On the other hand, the version I used to have in "pu" gave
>      blame to the first match.  This one tries to find the best
>      match and assign the blame to it.
> 
>   3. git-pickaxe -C: blame cut-and-pasted lines.
> 
>      This adds logic to find groups of lines brought in from
>      existing file in the parent.  We scan the remainder using
>      the same logic as -M detection, but it is done against
>      other files in the parent.
> 
>      There was a heuristic that gave the blame to the parent
>      right then and there when we find a copy-and-paste instead
>      of allowing the parent to pass blame further on to its
>      ancestors; again I removed this heuristics in the reordered
>      series.

The names of options clash somewhat with -M and -C in diffcore,
which detect contents 'M'oving (renaming files), and contents
'C'opying (copying files), where in git-pickaxe -C is still about
code movement, only across files (-M -M or --MM?).

Would git-pickaxe try to do also copy-and-paste within the file,
and across files?
-- 
Jakub Narebski
Warsaw, Poland
ShadeHawk on #git

^ permalink raw reply

* Re: [ANNOUNCE] Example Cogito Addon - cogito-bundle
From: Aaron Bentley @ 2006-10-20 20:29 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Jakub Narebski, Jan Hudec, bazaar-ng, git
In-Reply-To: <Pine.LNX.4.64.0610201231570.3962@g5.osdl.org>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Linus Torvalds wrote:
> 
> On Fri, 20 Oct 2006, Aaron Bentley wrote:
> 
>>Linus Torvalds wrote:
>>
>>>Git goes one step further: it _really_ doesn't matter about how you got to 
>>>a certain state. Absolutely _none_ of what the commits in between the 
>>>final stages and the common ancestor matter in the least. The only thing 
>>>that matters is what the states at the end-point are.
>>
>>That's interesting, because I've always thought one of the strengths of
>>file-ids was that you only had to worry about end-points, not how you
>>got there.
>>
>>How do you handle renames without looking at the history?
> 
> 
> You first handle all the non-renames that just merge on their own.
> If you were to use one hundredth of a second per file regardless of file, 
> a stupid per-file merge would take 210 seconds, which is just 
> unacceptable. So you really don't want to do that.

Agreed.  We start by comparing BASE and OTHER, so all those comparisons
are in-memory operations that don't hit disk.  Only for files where BASE
and OTHER differ do we even examine the THIS version.

We can do a do-nothing kernel merge in < 20 seconds, and that's
comparing every single file in the tree.  In Python.  I was aiming for
less than 10 seconds, but didn't quite hit it.

> So recursive basically generates the matrix of similarity for the 
> new/deleted files, and tries to match them up, and there you have your 
> renames - without ever looking at the history of how you ended up where 
> you are.

So in the simple case, you compare unmatched THIS, OTHER and BASE files
to find the renames?

>   I don't know if people appreciate how good it is to do a merge of two 
>   21000-file branches in less than a second. It didn't have any renames, 
>   and it only had a single well-defined common parent, but not only is 
>   that the common case, being that fast for the simple case is what 
>   _allows_ you to do well on the complex cases too, because it's what gets 
>   rid of all the files you should _not_ worry about ]

Well, I certainly appreciate that.  I've never worried about the speed
of text merge algorithms, because you rarely merge very many files.  The
key is making the tree merge fast.

Aaron
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFFOTGN0F+nu1YWqI0RAii+AJ0eduC3bYya5Ao8vm1EpBb38tJP4ACeJRYe
9/D+ahDRJa87NTryc7j3C+U=
=plWA
-----END PGP SIGNATURE-----

^ permalink raw reply

* Re: [ANNOUNCE] Example Cogito Addon - cogito-bundle
From: Petr Baudis @ 2006-10-20 20:23 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Shawn Pearce, Aaron Bentley, Jakub Narebski, bazaar-ng, git
In-Reply-To: <Pine.LNX.4.64.0610201045550.3962@g5.osdl.org>

Dear diary, on Fri, Oct 20, 2006 at 07:48:58PM CEST, I got a letter
where Linus Torvalds <torvalds@osdl.org> said that...
> So yeah, I've seen a few strange cases myself, but they've actually been 
> interesting. Like seeing how much of a file was just a copyright license, 
> and then a file being considered a "copy" just because it didn't actually 
> introduce any real new code.

Well it's certainly "interesting" and fun to see, but is it equally fun
to handle mismerges caused by a broken detection?

I've talked to some people who really didn't mind (or even liked) Git's
heuristics when it came to _inspecting_ movement of content, but were
really nervous about merge following such heuristics.

-- 
				Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
#!/bin/perl -sp0777i<X+d*lMLa^*lN%0]dsXx++lMlN/dsM0<j]dsj
$/=unpack('H*',$_);$_=`echo 16dio\U$k"SK$/SM$n\EsN0p[lN*1
lK[d2%Sa2/d0$^Ixp"|dc`;s/\W//g;$_=pack('H*',/((..)*)$/)

^ permalink raw reply

* Re: [ANNOUNCE] Example Cogito Addon - cogito-bundle
From: Junio C Hamano @ 2006-10-20 20:17 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: git
In-Reply-To: <Pine.LNX.4.64.0610201049250.3962@g5.osdl.org>

Linus Torvalds <torvalds@osdl.org> writes:

> ...  We're starting to see 
> git actually being able to track file content moving between files: even 
> when the files themselves didn't move (ie Junio's "git pickaxe" work could 
> do things like that).

I've reordered the git-pickaxe I parked in "pu" while 1.4.3-rc
cycle and merged it into "next".

The earlier one I was futzing with in "pu" had built-in
heuristics and pure mechanisms mixed together in the same patch,
which was quite bad as development history.  I think the
reordered sequence shows the logical evolution better.

  1. git-pickaxe: blame rewritten.

     This implements the infrastructure (parent traversal,
     identifying "corresponding path" in the parent -- aka
     "handling renames", passing blames to the parents and
     taking responsibility for the remainder) and uses the the
     same old "single diff with parent file identifies what we
     inherited from the parent" logic git-blame uses for passing
     blames.

  2. git-pickaxe -M: blame line movements within a file.

     This adds logic to find swapped groups of lines in the same
     file.  When the file in the parent had A and B and the child
     has B and A, "single diff with parent" would find only one
     of A or B is inherited from the parent, not both.  This
     re-diffs the remainder with the parent's file to find both.

     I used to have heuristics to avoid trivial groups of lines
     from being subject to this step, but in this version they
     have been removed, so that we can see the core logic and
     need for heuristics more clearly.

     On the other hand, the version I used to have in "pu" gave
     blame to the first match.  This one tries to find the best
     match and assign the blame to it.

  3. git-pickaxe -C: blame cut-and-pasted lines.

     This adds logic to find groups of lines brought in from
     existing file in the parent.  We scan the remainder using
     the same logic as -M detection, but it is done against
     other files in the parent.

     There was a heuristic that gave the blame to the parent
     right then and there when we find a copy-and-paste instead
     of allowing the parent to pass blame further on to its
     ancestors; again I removed this heuristics in the reordered
     series.

The next logical step is to come up with a good set of
heuristics to avoid excessive nonsense matches the code
currently gives.

Groups of small number of empty lines, lines with indentation
blanks followed by a closing brace, and '#include' lines that
include common header files occur so commonly, that without any
heuristics (which can be seen in the "next" branch today) the
algorithm would give surprisingly idiotic results.  For example:

	git -p pickaxe -C -f -n v1.4.3 -- commit.c

tells you that the first line of commit.c in v1.4.3 release,
which is '#include "cache.h"' came from the first line of
receive-pack.c which is total nonsense (this particular line
could actually be a bug in the -M or -C logic -- I need to
check).

A less "obviously wrong" but still idiotic case is that we find
ll.409-411 came from ll.94-96 of describe.c in commit 908e5310.
These three lines read as:

	409		}
        410	}
        411

While this blame assignment might be technically correct, it
does not add much value to pass blames on in such a case.

On the brighter side, we find that ll.415-419 (the beginning of
function "static int get_one_line()") originally came from
diff-tree.c (commit cee99d22, ll.275-279).

^ permalink raw reply

* Re: [ANNOUNCE] Example Cogito Addon - cogito-bundle
From: Aaron Bentley @ 2006-10-20 20:12 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Jakub Narebski, bazaar-ng, git
In-Reply-To: <Pine.LNX.4.64.0610201214530.3962@g5.osdl.org>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Linus Torvalds wrote:
> 
> On Fri, 20 Oct 2006, Aaron Bentley wrote:
> 
>>>Btw, this is a pet peeve of mine, and it is not at all restricted to 
>>>the SCM world.
>>
>>I guess I don't mind a bit of high-mmv discussion, so long as it doesn't
>>get in the way of real work.  Polishing these kinds of things seems to
>>fall in the category of 10% of functionality that takes 90% of effort.
> 
> 
> Well, the thing is, that 10% of the functionality usually takes a whole 
> lot _less_ than 10% of the work.

I guess this depends on whether you consider the brainstorming and
discussion to be part of the work of polishing, and I do mean polishing.
 Getting from something that works 90% of the time to something that
works 99% of the time can be a questionable expenditure of time and effort.

> The same is actually true of SCM's too, I'm totally convinced. At least in 
> git, we really haven't spent _that_ much time on merges, for example. My 
> original stupid three-way merge was really simple, and I think the way I 
> introduced "stages" into the git index was really clever, but it was still 
> a small detail. And it worked surprisingly way.

I did rewrite our merge code once, but that was because the API was
quite hard to deal with and made it hard to maintain.  I agree that it's
important to focus effort on the areas that make a difference.

On the other hand, our "exotic" text merge algorithms have been praised
by the people who work on Launchpad.  So that's a win.

> As an example: I suspect that in git just the CVS importer has gotten 
> _way_ more attention than merging ever got. Importing from CVS is simply a 
> much harder problem in practice, and we've probably had more people 
> working on it (and that's _despite_ the fact that this is one of the areas 
> where git has successfully re-used other projects that had similar goals: 
> cvsps, cvs2svn etc). It's hard to "think" about, because a lot of the 
> problems with importing from CVS are literally all about the details and 
> the nasty crud. I really think "merging" is _way_ easier.

Yes, I don't even want to think about CVS when I don't have to.

Aaron
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFFOS2Y0F+nu1YWqI0RAiOcAJ0TXmBdiCcvnTzmg+nnF+kayJ25cgCggMFx
w6xFlFHwPoNm9dt/T4LnmCU=
=zNuy
-----END PGP SIGNATURE-----

^ permalink raw reply

* Re: [ANNOUNCE] Example Cogito Addon - cogito-bundle
From: Linus Torvalds @ 2006-10-20 19:46 UTC (permalink / raw)
  To: Aaron Bentley; +Cc: Jakub Narebski, Jan Hudec, bazaar-ng, git
In-Reply-To: <45391F1C.80100@utoronto.ca>

On Fri, 20 Oct 2006, Aaron Bentley wrote:
> 
> Linus Torvalds wrote:
> > Git goes one step further: it _really_ doesn't matter about how you got to 
> > a certain state. Absolutely _none_ of what the commits in between the 
> > final stages and the common ancestor matter in the least. The only thing 
> > that matters is what the states at the end-point are.
> 
> That's interesting, because I've always thought one of the strengths of
> file-ids was that you only had to worry about end-points, not how you
> got there.
> 
> How do you handle renames without looking at the history?

You first handle all the non-renames that just merge on their own. That 
takes care of 99.99% of the stuff (and I'm not exaggerating: in the 
kernel, you have ~21000 files, and most merges don't have a single rename 
to worry about - and even when you do have them, they tend to be in the 
"you can count them on one hand" kind of situation).

Then you just look at all the pathnames you _couldn't_ resolve, and that's 
usually cut down the thing to something where you can literally use a lot 
of CPU power per file, because now you only have a small number of 
candidates left.

If you were to use one hundredth of a second per file regardless of file, 
a stupid per-file merge would take 210 seconds, which is just 
unacceptable. So you really don't want to do that. You want to merge whole 
subdirectories in one go (and with git, you can: since the SHA1 of a 
directory defines _all_ of the contents under it, if the two branches you 
merge have an identical subdirectory, you don't need to do anything at 
_all_ about that one. See?).

So instead of trying to be really fast on individual files and doing them 
one at a time, git makes individual files basically totally free (you 
literally often don't need to look at them AT ALL). And then for the few 
files you can't resolve, you can afford to spend more time.

So say that you spend one second per file-pair because you do complex 
heuristics etc - you'll still have a merge that is a _lot_ faster than 
your 210-second one.

So recursive basically generates the matrix of similarity for the 
new/deleted files, and tries to match them up, and there you have your 
renames - without ever looking at the history of how you ended up where 
you are.

Btw, that "210 second" merge is not at all unlikely. Some of the SCM's 
seem to scale much worse than that to big archives, and I've heard people 
talk about merges that took 20 minutes or more. In contrast, git doing a 
merge in ~2-3 seconds for the kernel is _normal_.

[ In fact, I just re-tested doing my last kernel merge: it took 0.970 
  seconds, and that was _including_ the diffstat of the result - not 
  obviously not including the time to fetch the other branch over the 
  network.

  I don't know if people appreciate how good it is to do a merge of two 
  21000-file branches in less than a second. It didn't have any renames, 
  and it only had a single well-defined common parent, but not only is 
  that the common case, being that fast for the simple case is what 
  _allows_ you to do well on the complex cases too, because it's what gets 
  rid of all the files you should _not_ worry about ]

Performance does matter. 

			Linus

^ permalink raw reply

* [PATCH] git-clone: define die() and use it.
From: Dmitry V. Levin @ 2006-10-20 19:38 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: GIT mailing list

Signed-off-by: Dmitry V. Levin <ldv@altlinux.org>
---
 git-clone.sh |   61 +++++++++++++++++++++++-----------------------------------
 1 files changed, 24 insertions(+), 37 deletions(-)

diff --git a/git-clone.sh b/git-clone.sh
index bf54a11..786d65a 100755
--- a/git-clone.sh
+++ b/git-clone.sh
@@ -8,11 +8,15 @@ # Clone a repository into a different di
 # See git-sh-setup why.
 unset CDPATH
 
-usage() {
-	echo >&2 "Usage: $0 [--template=<template_directory>] [--use-separate-remote] [--reference <reference-repo>] [--bare] [-l [-s]] [-q] [-u <upload-pack>] [--origin <name>] [-n] <repo> [<dir>]"
+die() {
+	echo >&2 "$@"
 	exit 1
 }
 
+usage() {
+	die "Usage: $0 [--template=<template_directory>] [--use-separate-remote] [--reference <reference-repo>] [--bare] [-l [-s]] [-q] [-u <upload-pack>] [--origin <name>] [-n] <repo> [<dir>]"
+}
+
 get_repo_base() {
 	(cd "$1" && (cd .git ; pwd)) 2> /dev/null
 }
@@ -35,11 +39,9 @@ clone_dumb_http () {
 		"`git-repo-config --bool http.noEPSV`" = true ]; then
 		curl_extra_args="${curl_extra_args} --disable-epsv"
 	fi
-	http_fetch "$1/info/refs" "$clone_tmp/refs" || {
-		echo >&2 "Cannot get remote repository information.
+	http_fetch "$1/info/refs" "$clone_tmp/refs" ||
+		die "Cannot get remote repository information.
 Perhaps git-update-server-info needs to be run there?"
-		exit 1;
-	}
 	while read sha1 refname
 	do
 		name=`expr "z$refname" : 'zrefs/\(.*\)'` &&
@@ -143,17 +145,12 @@ while
 		'')
 		    usage ;;
 		*/*)
-		    echo >&2 "'$2' is not suitable for an origin name"
-		    exit 1
+		    die "'$2' is not suitable for an origin name"
 		esac
-		git-check-ref-format "heads/$2" || {
-		    echo >&2 "'$2' is not suitable for a branch name"
-		    exit 1
-		}
-		test -z "$origin_override" || {
-		    echo >&2 "Do not give more than one --origin options."
-		    exit 1
-		}
+		git-check-ref-format "heads/$2" ||
+		    die "'$2' is not suitable for a branch name"
+		test -z "$origin_override" ||
+		    die "Do not give more than one --origin options."
 		origin_override=yes
 		origin="$2"; shift
 		;;
@@ -169,24 +166,19 @@ do
 done
 
 repo="$1"
-if test -z "$repo"
-then
-    echo >&2 'you must specify a repository to clone.'
-    exit 1
-fi
+test -n "$repo" ||
+    die 'you must specify a repository to clone.'
 
 # --bare implies --no-checkout
 if test yes = "$bare"
 then
 	if test yes = "$origin_override"
 	then
-		echo >&2 '--bare and --origin $origin options are incompatible.'
-		exit 1
+		die '--bare and --origin $origin options are incompatible.'
 	fi
 	if test t = "$use_separate_remote"
 	then
-		echo >&2 '--bare and --use-separate-remote options are incompatible.'
-		exit 1
+		die '--bare and --use-separate-remote options are incompatible.'
 	fi
 	no_checkout=yes
 fi
@@ -206,7 +198,7 @@ fi
 dir="$2"
 # Try using "humanish" part of source repo if user didn't specify one
 [ -z "$dir" ] && dir=$(echo "$repo" | sed -e 's|/$||' -e 's|:*/*\.git$||' -e 's|.*[/:]||g')
-[ -e "$dir" ] && echo "$dir already exists." && usage
+[ -e "$dir" ] && die "destination directory '$dir' already exists."
 mkdir -p "$dir" &&
 D=$(cd "$dir" && pwd) &&
 trap 'err=$?; cd ..; rm -rf "$D"; exit $err' 0
@@ -233,7 +225,7 @@ then
 		 cd reference-tmp &&
 		 tar xf -)
 	else
-		echo >&2 "$reference: not a local directory." && usage
+		die "reference repository '$reference' is not a local directory."
 	fi
 fi
 
@@ -242,10 +234,8 @@ rm -f "$GIT_DIR/CLONE_HEAD"
 # We do local magic only when the user tells us to.
 case "$local,$use_local" in
 yes,yes)
-	( cd "$repo/objects" ) || {
-		echo >&2 "-l flag seen but $repo is not local."
-		exit 1
-	}
+	( cd "$repo/objects" ) ||
+		die "-l flag seen but repository '$repo' is not local."
 
 	case "$local_shared" in
 	no)
@@ -307,18 +297,15 @@ yes,yes)
 		then
 			clone_dumb_http "$repo" "$D"
 		else
-			echo >&2 "http transport not supported, rebuild Git with curl support"
-			exit 1
+			die "http transport not supported, rebuild Git with curl support"
 		fi
 		;;
 	*)
 		case "$upload_pack" in
 		'') git-fetch-pack --all -k $quiet "$repo" ;;
 		*) git-fetch-pack --all -k $quiet "$upload_pack" "$repo" ;;
-		esac >"$GIT_DIR/CLONE_HEAD" || {
-			echo >&2 "fetch-pack from '$repo' failed."
-			exit 1
-		}
+		esac >"$GIT_DIR/CLONE_HEAD" ||
+			die "fetch-pack from '$repo' failed."
 		;;
 	esac
 	;;
-- 
1.4.3.GIT

^ permalink raw reply related

* Re: [ANNOUNCE] Example Cogito Addon - cogito-bundle
From: Linus Torvalds @ 2006-10-20 19:31 UTC (permalink / raw)
  To: Aaron Bentley; +Cc: bazaar-ng, git, Jakub Narebski
In-Reply-To: <45391DC3.7060002@utoronto.ca>

On Fri, 20 Oct 2006, Aaron Bentley wrote:
> > 
> > Btw, this is a pet peeve of mine, and it is not at all restricted to 
> > the SCM world.
> 
> I guess I don't mind a bit of high-mmv discussion, so long as it doesn't
> get in the way of real work.  Polishing these kinds of things seems to
> fall in the category of 10% of functionality that takes 90% of effort.

Well, the thing is, that 10% of the functionality usually takes a whole 
lot _less_ than 10% of the work.

The stuff you can think through (and argue about) tends to be the easy 
stuff. Exactly because you _can_ think about it abstractly.

The stuff that is actually really hard and time-consuming is the stuff 
that you find out in practice, and you have to iterate on.

In kernels, for example, it seems like 99% of the effort ends up being 
hardware-dependent stuff. Getting architecture interfaces right, and 
getting working drivers. Hotplugging and device management turns out to be 
a _much_ bigger issue than schedulers or VM page-out has _ever_ been. 

But show me a single paper about them. I'm sure they exist. I'm just 
saying that they're sure as heck not getting 99% of the attention (or even 
1% of the attention) in discussions, even though they're definitely 99% of 
the real everyday work and effort.

(Maybe it's not 99%. Numbers taken out of my nether regions. The point 
should be clear).

The same is actually true of SCM's too, I'm totally convinced. At least in 
git, we really haven't spent _that_ much time on merges, for example. My 
original stupid three-way merge was really simple, and I think the way I 
introduced "stages" into the git index was really clever, but it was still 
a small detail. And it worked surprisingly way.

After that merge, people improved it. And "recursive" is a _huge_ 
improvement, don't get me wrong: it's still entirely a 3-way merge on the 
file contents, but it now does those 3-way merges in several stages if 
there are multiple independent common parents, and the rename logic is 
clearly important.

But if you actually look at how much effort was spent on merging, and how 
much was spent on just "details in general", I think you'll find merging 
to be pretty low down the list, even though the recursive merge ended up 
_also_ getting re-written in C. Perhaps it was one of the bigger 
_individual_ efforts, but compared to all the work we've continually done 
on performance and usability, for example, it's been pretty small in the 
end.

As an example: I suspect that in git just the CVS importer has gotten 
_way_ more attention than merging ever got. Importing from CVS is simply a 
much harder problem in practice, and we've probably had more people 
working on it (and that's _despite_ the fact that this is one of the areas 
where git has successfully re-used other projects that had similar goals: 
cvsps, cvs2svn etc). It's hard to "think" about, because a lot of the 
problems with importing from CVS are literally all about the details and 
the nasty crud. I really think "merging" is _way_ easier.

			Linus

^ permalink raw reply

* Re: Signed git-tag doesn't find default key
From: Andy Parkins @ 2006-10-20 19:21 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: git
In-Reply-To: <Pine.LNX.4.64.0610200922170.3962@g5.osdl.org>

[-- Attachment #1: Type: text/plain, Size: 1833 bytes --]

On Friday 2006, October 20 17:32, Linus Torvalds wrote:

> and then do an "adduid", and then add your UID _without_ the "(Google)" in
> there, and that should solve all your problems.

Yeah, obviously that's one way; and while it doesn't really matter to me, it 
seems poor form that git doesn't work with gpg as it is.  While one could of 
course use the "-u" switch, if that is the answer, then why bother with 
having the "-s" switch at all?

> You're probably better off with something like
>
> 	git var GIT_COMMITTER_IDENT | sed 's/\(.*\)<\(.*\)>\(.*\)/\2/'

I've actually settled on:

: ${username:=$(expr "z$tagger" : 'z.*<\(.*\)>')}

In git-tag.sh.

> That said, I've never understood why gpg matches on the comment field.
> Dammit, it _should_ find the key anyway. Stupid program.

I think it's doing the right thing unfortunately.  If you search on any part
 "Andy Parkins"
 "<andyparkins@gmail.com>"
 "andyparkins@gmail.com"
 "andyparkins"
It finds it fine; the only thing it doesn't find is
 "Andy Parkins <andyparkins@gmail.com>"
Which I suppose is fair enough, as it's a fairly specific format to be 
searching for.

I'm going to advocate my change of only searching on the email address for 
finding the key - there shouldn't be two keys with the same email address 
anyway, so there shouldn't be a danger of ambiguity of key.  Also, it deals 
with the case when someone has entered a different name in git and in their 
gpg UID.  For example, I would think it shouldn't be a problem that I like to 
be called "Andy" on the git list, and yet want my key to say "A. D. 
Parkins", "Andrew Parkins" or "Sparky McFly". 

Now, I think I've written my name far, far too many times in this email.

Sparky McFly
-- 
Dr Andrew Parkins, M Eng (Hons), AMIEE
andyparkins@gmail.com

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply

* Re: [ANNOUNCE] Example Cogito Addon - cogito-bundle
From: Jakub Narebski @ 2006-10-20 19:14 UTC (permalink / raw)
  To: Jan Hudec; +Cc: bazaar-ng, git
In-Reply-To: <20061020181210.GA29843@artax.karlin.mff.cuni.cz>

Jan Hudec wrote:

> Let's consider following scenario:
> 
> (where A$ means working in branch A, B$ means working in branch B and
>  VCT stands for version control tool of choice)
[...]
> At this point, I expect the tree to look like this:
> A$ ls -R
> .:
> data/
> data:
> hello.txt
> A$ cat data/hello.txt
> Hello World!
[...]
> Oh, and there is one more complicated case, that I also require to work
> and that works in Bzr, but did not work in Arch:
> 
> ...let's start with the tree at the end of previous example...
> 
> A$ VCT mv data greetings
> A$ VCT commit -m "Renamed the data directory to greetings"
> B$ echo "Goodbye World!" > data/goodbye.txt
> B$ VCT add data/goodbye.txt
> B$ VCT commit -m "Added goodbye message."
> A$ VCT merge B

(slightly corrected example).

A$ git branch B
A$ git mv data greetings
A$ git commit -a -m "Renamed the data directory to greetings"
A$ git checkout B
B$ echo 'Goodbye World!' > data/goodbye.txt
B$ git add data/goodbye.txt
B$ git commit -a -m "Added goodbye message."
B$ git checkout A
A$ git pull . B
Trying really trivial in-index merge...
fatal: Merge requires file-level merging
Nope.
Merging HEAD with 4a8a1a7941f214c6173786b583830b4f74a67c1f
Merging: 
96738390ba0b4de5b234059081701badc1c86693 Renamed the data directory to greetings 
4a8a1a7941f214c6173786b583830b4f74a67c1f Added goodbye message. 
found 1 common ancestor(s): 
7cfd8edd06b7cb016856737d8fd98d5d096955b5 Merge branch 'B' into A 

Merge made by recursive.
 data/goodbye.txt |    1 +
 1 files changed, 1 insertions(+), 0 deletions(-)
 create mode 100644 data/goodbye.txt

> And now I expect to have tree looking like this:
> 
> A$ ls -R
> .:
> greetings/
> greetings:
> hello.txt
> goodbye.txt

So git _fails_ (your expectations) in this case:
A$ ls -R
.:
data  greetings

./data:
goodbye.txt

./greetings:
hello.txt

^ permalink raw reply

* Re: [ANNOUNCE] Example Cogito Addon - cogito-bundle
From: Aaron Bentley @ 2006-10-20 19:10 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Jakub Narebski, Jan Hudec, bazaar-ng, git
In-Reply-To: <Pine.LNX.4.64.0610201151130.3962@g5.osdl.org>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Linus Torvalds wrote:
> Git goes one step further: it _really_ doesn't matter about how you got to 
> a certain state. Absolutely _none_ of what the commits in between the 
> final stages and the common ancestor matter in the least. The only thing 
> that matters is what the states at the end-point are.

That's interesting, because I've always thought one of the strengths of
file-ids was that you only had to worry about end-points, not how you
got there.

How do you handle renames without looking at the history?

Aaron
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFFOR8c0F+nu1YWqI0RAkhJAJ9QJ3nyP/437/bNPI3VEVHZP0dEZACfZyEg
SWAp+673iTDEZfH00M4RG4k=
=1XO+
-----END PGP SIGNATURE-----

^ permalink raw reply

* Re: [ANNOUNCE] Example Cogito Addon - cogito-bundle
From: Aaron Bentley @ 2006-10-20 19:04 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Jakub Narebski, bazaar-ng, git
In-Reply-To: <Pine.LNX.4.64.0610201110320.3962@g5.osdl.org>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Linus Torvalds wrote:
> 
> On Fri, 20 Oct 2006, Linus Torvalds wrote:
> 
>>So yes, merges are the situation where renames are normally considered a 
>>"problem", but it's actually not nearly the most every-day situation at 
>>all.
> 
> 
> Btw, this is a pet peeve of mine, and it is not at all restricted to 
> the SCM world.

I guess I don't mind a bit of high-mmv discussion, so long as it doesn't
get in the way of real work.  Polishing these kinds of things seems to
fall in the category of 10% of functionality that takes 90% of effort.

> Of the rest, most by far need some trivial 3-way merging. And the ones 
> that have trouble? In practice, that trivial and maligned 3-way does 
> _better_ than anything more complicated.

I think the great motivator for exploring other merge algorithms has
been criss-cross merge.  There are some workflows (e.g. the Launchpad
workflow) in which heavy mesh-merging takes place, leading to frequent
criss-crosses.

Bog-standard three-way doesn't handle that criss-cross very well.  I
understand git uses recursive three-way in that situation.

The other motivator has been cherry-picking.

So I'm happy that people are trying to devise merge algorithms that are
better than three-way.  When someone gets it right, we'll implement it.

And then there are other more incremental tweaks, like
merge-across-indent and merge-across-line-ending-change that I'd like to
see.

> Go to revctrl.org for prime example of this. I think half the stuff is 
> about merge algorithms, some of it is about glossary, and almost none of 
> it is about something as pedestrian and simple as performance and 
> scalability.

Partly this is because of Bram's interests.  AIUI, he started with a
merge algorithm and built a VCS around it.

> (Actually, to be honest, I think some of the #revctrl noise has become 
> better lately.

I used to spend time on #revctrl, but I think that was before you
started visiting.  Too bad I missed ya.

 So maybe at least this area is getting more about
> real every-day problems, and less about the theoretical-but-not-very- 
> important issues).

It wouldn't surprise me if the early phases of VCS development tended
toward more theoretical discussion, just because so many questions are open.

Aaron
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFFOR3D0F+nu1YWqI0RAo5lAJ99+5ShvLXaVIRG1A8XN7HRicoPngCeLO+y
meMZVcjdX7AX9JCfhSN5uK4=
=AI8p
-----END PGP SIGNATURE-----

^ permalink raw reply

* Re: [ANNOUNCE] Example Cogito Addon - cogito-bundle
From: Linus Torvalds @ 2006-10-20 19:00 UTC (permalink / raw)
  To: Jakub Narebski; +Cc: Jan Hudec, Aaron Bentley, bazaar-ng, git
In-Reply-To: <200610202047.11291.jnareb@gmail.com>

On Fri, 20 Oct 2006, Jakub Narebski wrote:

> Jan Hudec wrote:
> 
> > And note, that it is /not/ required to use file-ids to handle this.
> > Darcs handles this just as well with it's patch algebra
> > (http://darcs.net/DarcsWiki/PatchTheory) without need of any IDs.
> 
> And Darcs is, from opinions I've read, dog-slow.

You really cannot expect to get any kind of performance at all unless you:

 - are able to ignore 99.9% of all files on merging (ie you have to be 
   able to totally ignore the files that are identical in both sides, and 
   you really shouldn't even _care_ about why they ended up being 
   identical)

 - are able to ignore 99% of what the commits _did_ in between the merges 
   (ie if you need to look at them at all, only look at the part that 
   matters for the 0.1% of files that you couldn't ignore)

If you have to parse all the commit details all the way down to the common 
parent, you're basically already screwed. There's no _way_ you can make it 
fast. 

Git goes one step further: it _really_ doesn't matter about how you got to 
a certain state. Absolutely _none_ of what the commits in between the 
final stages and the common ancestor matter in the least. The only thing 
that matters is what the states at the end-point are.

(Of course, you _could_ plug in a merge algorithm that cares, since there 
is more data there. I'm just talking about the standard "recursive" 
algorithm here.)

That's why git can be so fast, but it's actually more important than that: 
the fact that it doesn't matter _how_ you got to a certain state is 
actually a huge and important feature. In other words, you should see it 
as a guarantee, not as a "lack of knowledge".

Darcs thinks it matters how you got somewhere. Git consciously says: none 
of the individual patches matter, the only thing that matters is the end 
result, because you could have gotten the same result in a lot of 
different ways, and nobody _cares_.

			Linus

^ permalink raw reply

* Re: [ANNOUNCE] Example Cogito Addon - cogito-bundle
From: Linus Torvalds @ 2006-10-20 18:48 UTC (permalink / raw)
  To: Jan Hudec; +Cc: bazaar-ng, git, Jakub Narebski
In-Reply-To: <20061020181210.GA29843@artax.karlin.mff.cuni.cz>

On Fri, 20 Oct 2006, Jan Hudec wrote:
>
> Let's consider following scenario:

Here's a real-life schenario that we hit several times with BK over the 
years:

 - take a real repository, and a patch that gets discussed that adds a new 
   file.
 - take two different people applying that patch to their trees (or, do 
   the equivalent thing, which is to just create the same filename
   independently, because the solution is obvious - and the same - to 
   both developers).
 - now, have somebody merge both of those two peoples trees (eg me)
 - have the two people continue to use their trees, modifying it, and 
   getting merged.

Trust me, this isn't even _unlikely_. It happens. And it's a serious 
problem for a file-ID case. Why? Because you have two different file ID's 
for the same pathname. 

(It happily only happened a handful of times, so it was never a big enough 
problem to cause me to think that BK was crap. But it definitely was a 
real issue).

What BK did (and what is likely the only reasonable thing to do) is to 
move one of the file-ID's to an "Attic" kind of place, and just go with 
the other. The nasty part is that now the developer whose file was 
"dropped" (and anybody who got the work from him) may still be continuing 
to work with _his_ copy of the same file, never even realizing that when 
his work gets merged, all his fixes GET THROWN AWAY!

And trust me, this isn't a theoretical thing. This actually happens. So 
you have problems at many levels: you have the problems that happen during 
the merge (where somebody needs to decide how to resolve the file-ID 
clash), but what a lot of SCM people seem to not have understood is that 
the problem actually _remains_ after the merge, and causes problems even 
down the line.

So yeah, content-based merging has its own problems (especially if you do 
things like re-indent a file as you move it, or if you have files that 
just look the same because they share 99% of their content through a 
copyright message), but at least so far, we've not really ever hit that 
issue in the kernel.

And we are actually approaching the old kernel BK tree in size with the 
current git tree (we're about 2/3rds of the way if you count number of 
commits). That's despite the fact that we actually have been moving things 
around.  So from a purely _practical_ standpoint, I really do have 
anecdotal evidence that I'm right.

I didn't have that evidence when I started, but I knew I was right anyway ;)

		Linus

PS. It's undoubtedly true that the SCM you use impacts _how_ you do 
development, so any project will almost automatically align itself with 
whatever SCM rules there are in place.

So "anecdotal evidence" in that sense isn't really wonderful, since it 
obviously is always a matter of a certain project/SCM combination - but 
the above example is about as neutral as you can get, since it's the 
_same_ project, with the _same_ maintainer, and roughtly the _same_ rules, 
just two different approaches wrt renames of the SCM's in question.

^ permalink raw reply

* Re: [ANNOUNCE] Example Cogito Addon - cogito-bundle
From: Jakub Narebski @ 2006-10-20 18:47 UTC (permalink / raw)
  To: Jan Hudec; +Cc: bazaar-ng, git
In-Reply-To: <20061020181210.GA29843@artax.karlin.mff.cuni.cz>

Jan Hudec wrote:

> And note, that it is /not/ required to use file-ids to handle this.
> Darcs handles this just as well with it's patch algebra
> (http://darcs.net/DarcsWiki/PatchTheory) without need of any IDs.

And Darcs is, from opinions I've read, dog-slow.

^ permalink raw reply

* Re: [ANNOUNCE] Example Cogito Addon - cogito-bundle
From: Jakub Narebski @ 2006-10-20 18:46 UTC (permalink / raw)
  To: Jan Hudec; +Cc: Aaron Bentley, bazaar-ng, git
In-Reply-To: <200610202035.26227.jnareb@gmail.com>

Jakub Narebski wrote:
>> A$ VCT commit -m "Moved hello.txt to data dir"
> 1092:jnareb@roke:/tmp/jnareb/tmp> git commit -a -m "Moved hello.txt to data dir"
> 
>> B$ ed hello.txt
>> ? 1s/Warld/World/
>> ? wq
Sorry, I have forgot to put in email "git checkout B"
to actually switch to branch B.

> 1094:jnareb@roke:/tmp/jnareb/tmp> ed hello.txt 
> 13
> 1s/Warld/World/
> wq
> 13

-- 
Jakub Narebski
Poland

^ permalink raw reply

* [PATCH] add the capability for index-pack to read from a stream
From: Nicolas Pitre @ 2006-10-20 18:45 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

This patch only adds the streaming capability to index-pack.  Although 
the code is different it has the exact same functionality as before to 
make sure nothing broke.

This is in preparation for receiving packs over the net, parse them on 
the fly, fix them up if they are "thin" packs, and keep the resulting 
pack instead of exploding it into loose objects.  But such functionality 
should come separately.

One immediate advantage of this patch is that index-pack can now deal 
with packs up to 4GB in size even on 32-bit architectures since the pack 
is not entirely mmap()'d all at once anymore.

Signed-off-by: Nicolas Pitre <nico@cam.org>

---

diff --git a/index-pack.c b/index-pack.c
index 56c590e..e33f605 100644
--- a/index-pack.c
+++ b/index-pack.c
@@ -13,6 +13,8 @@ static const char index_pack_usage[] =
 struct object_entry
 {
 	unsigned long offset;
+	unsigned long size;
+	unsigned int hdr_size;
 	enum object_type type;
 	enum object_type real_type;
 	unsigned char sha1[20];
@@ -36,51 +38,68 @@ struct delta_entry
 };
 
 static const char *pack_name;
-static unsigned char *pack_base;
-static unsigned long pack_size;
 static struct object_entry *objects;
 static struct delta_entry *deltas;
 static int nr_objects;
 static int nr_deltas;
 
-static void open_pack_file(void)
+/* We always read in 4kB chunks. */
+static unsigned char input_buffer[4096];
+static unsigned long input_offset, input_len, consumed_bytes;
+static SHA_CTX input_ctx;
+static int input_fd;
+
+/*
+ * Make sure at least "min" bytes are available in the buffer, and
+ * return the pointer to the buffer.
+ */
+static void * fill(int min)
 {
-	int fd;
-	struct stat st;
+	if (min <= input_len)
+		return input_buffer + input_offset;
+	if (min > sizeof(input_buffer))
+		die("cannot fill %d bytes", min);
+	if (input_offset) {
+		SHA1_Update(&input_ctx, input_buffer, input_offset);
+		memcpy(input_buffer, input_buffer + input_offset, input_len);
+		input_offset = 0;
+	}
+	do {
+		int ret = xread(input_fd, input_buffer + input_len,
+				sizeof(input_buffer) - input_len);
+		if (ret <= 0) {
+			if (!ret)
+				die("early EOF");
+			die("read error on input: %s", strerror(errno));
+		}
+		input_len += ret;
+	} while (input_len < min);
+	return input_buffer;
+}
+
+static void use(int bytes)
+{
+	if (bytes > input_len)
+		die("used more bytes than were available");
+	input_len -= bytes;
+	input_offset += bytes;
+	consumed_bytes += bytes;
+}
 
-	fd = open(pack_name, O_RDONLY);
-	if (fd < 0)
+static void open_pack_file(void)
+{
+	input_fd = open(pack_name, O_RDONLY);
+	if (input_fd < 0)
 		die("cannot open packfile '%s': %s", pack_name,
 		    strerror(errno));
-	if (fstat(fd, &st)) {
-		int err = errno;
-		close(fd);
-		die("cannot fstat packfile '%s': %s", pack_name,
-		    strerror(err));
-	}
-	pack_size = st.st_size;
-	pack_base = mmap(NULL, pack_size, PROT_READ, MAP_PRIVATE, fd, 0);
-	if (pack_base == MAP_FAILED) {
-		int err = errno;
-		close(fd);
-		die("cannot mmap packfile '%s': %s", pack_name,
-		    strerror(err));
-	}
-	close(fd);
+	SHA1_Init(&input_ctx);
 }
 
 static void parse_pack_header(void)
 {
-	const struct pack_header *hdr;
-	unsigned char sha1[20];
-	SHA_CTX ctx;
-
-	/* Ensure there are enough bytes for the header and final SHA1 */
-	if (pack_size < sizeof(struct pack_header) + 20)
-		die("packfile '%s' is too small", pack_name);
+	struct pack_header *hdr = fill(sizeof(struct pack_header));
 
 	/* Header consistency check */
-	hdr = (void *)pack_base;
 	if (hdr->hdr_signature != htonl(PACK_SIGNATURE))
 		die("packfile '%s' signature mismatch", pack_name);
 	if (!pack_version_ok(hdr->hdr_version))
@@ -88,13 +107,8 @@ static void parse_pack_header(void)
 		    pack_name, ntohl(hdr->hdr_version));
 
 	nr_objects = ntohl(hdr->hdr_entries);
-
-	/* Check packfile integrity */
-	SHA1_Init(&ctx);
-	SHA1_Update(&ctx, pack_base, pack_size - 20);
-	SHA1_Final(sha1, &ctx);
-	if (hashcmp(sha1, pack_base + pack_size - 20))
-		die("packfile '%s' SHA1 mismatch", pack_name);
+	use(sizeof(struct pack_header));
+	/*fprintf(stderr, "Indexing %d objects\n", nr_objects);*/
 }
 
 static void bad_object(unsigned long offset, const char *format,
@@ -112,85 +126,78 @@ static void bad_object(unsigned long off
 	    pack_name, offset, buf);
 }
 
-static void *unpack_entry_data(unsigned long offset,
-			       unsigned long *current_pos, unsigned long size)
+static void *unpack_entry_data(unsigned long offset, unsigned long size)
 {
-	unsigned long pack_limit = pack_size - 20;
-	unsigned long pos = *current_pos;
 	z_stream stream;
 	void *buf = xmalloc(size);
 
 	memset(&stream, 0, sizeof(stream));
 	stream.next_out = buf;
 	stream.avail_out = size;
-	stream.next_in = pack_base + pos;
-	stream.avail_in = pack_limit - pos;
+	stream.next_in = fill(1);
+	stream.avail_in = input_len;
 	inflateInit(&stream);
 
 	for (;;) {
 		int ret = inflate(&stream, 0);
-		if (ret == Z_STREAM_END)
+		use(input_len - stream.avail_in);
+		if (stream.total_out == size && ret == Z_STREAM_END)
 			break;
 		if (ret != Z_OK)
 			bad_object(offset, "inflate returned %d", ret);
+		stream.next_in = fill(1);
+		stream.avail_in = input_len;
 	}
 	inflateEnd(&stream);
-	if (stream.total_out != size)
-		bad_object(offset, "size mismatch (expected %lu, got %lu)",
-			   size, stream.total_out);
-	*current_pos = pack_limit - stream.avail_in;
 	return buf;
 }
 
-static void *unpack_raw_entry(unsigned long offset,
-			      enum object_type *obj_type,
-			      unsigned long *obj_size,
-			      union delta_base *delta_base,
-			      unsigned long *next_obj_offset)
+static void *unpack_raw_entry(struct object_entry *obj, union delta_base *delta_base)
 {
-	unsigned long pack_limit = pack_size - 20;
-	unsigned long pos = offset;
-	unsigned char c;
+	unsigned char *p, c;
 	unsigned long size, base_offset;
 	unsigned shift;
-	enum object_type type;
-	void *data;
 
-	c = pack_base[pos++];
-	type = (c >> 4) & 7;
+	obj->offset = consumed_bytes;
+
+	p = fill(1);
+	c = *p;
+	use(1);
+	obj->type = (c >> 4) & 7;
 	size = (c & 15);
 	shift = 4;
 	while (c & 0x80) {
-		if (pos >= pack_limit)
-			bad_object(offset, "object extends past end of pack");
-		c = pack_base[pos++];
+		p = fill(1);
+		c = *p;
+		use(1);
 		size += (c & 0x7fUL) << shift;
 		shift += 7;
 	}
+	obj->size = size;
 
-	switch (type) {
+	switch (obj->type) {
 	case OBJ_REF_DELTA:
-		if (pos + 20 >= pack_limit)
-			bad_object(offset, "object extends past end of pack");
-		hashcpy(delta_base->sha1, pack_base + pos);
-		pos += 20;
+		hashcpy(delta_base->sha1, fill(20));
+		use(20);
 		break;
 	case OBJ_OFS_DELTA:
 		memset(delta_base, 0, sizeof(*delta_base));
-		c = pack_base[pos++];
+		p = fill(1);
+		c = *p;
+		use(1);
 		base_offset = c & 127;
 		while (c & 128) {
 			base_offset += 1;
 			if (!base_offset || base_offset & ~(~0UL >> 7))
-				bad_object(offset, "offset value overflow for delta base object");
-			if (pos >= pack_limit)
-				bad_object(offset, "object extends past end of pack");
-			c = pack_base[pos++];
+				bad_object(obj->offset, "offset value overflow for delta base object");
+			p = fill(1);
+			c = *p;
+			use(1);
 			base_offset = (base_offset << 7) + (c & 127);
 		}
-		delta_base->offset = offset - base_offset;
-		if (delta_base->offset >= offset)
-			bad_object(offset, "delta base offset is out of bound");
+		delta_base->offset = obj->offset - base_offset;
+		if (delta_base->offset >= obj->offset)
+			bad_object(obj->offset, "delta base offset is out of bound");
 		break;
 	case OBJ_COMMIT:
 	case OBJ_TREE:
@@ -198,13 +205,38 @@ static void *unpack_raw_entry(unsigned l
 	case OBJ_TAG:
 		break;
 	default:
-		bad_object(offset, "bad object type %d", type);
+		bad_object(obj->offset, "bad object type %d", obj->type);
 	}
+	obj->hdr_size = consumed_bytes - obj->offset;
+
+	return unpack_entry_data(obj->offset, obj->size);
+}
+
+static void * get_data_from_pack(struct object_entry *obj)
+{
+	unsigned long from = obj[0].offset + obj[0].hdr_size;
+	unsigned long len = obj[1].offset - from;
+	unsigned pg_offset = from % getpagesize();
+	unsigned char *map, *data;
+	z_stream stream;
+	int st;
 
-	data = unpack_entry_data(offset, &pos, size);
-	*obj_type = type;
-	*obj_size = size;
-	*next_obj_offset = pos;
+	map = mmap(NULL, len + pg_offset, PROT_READ, MAP_PRIVATE,
+		   input_fd, from - pg_offset);
+	if (map == MAP_FAILED)
+		die("cannot mmap packfile '%s': %s", pack_name, strerror(errno));
+	data = xmalloc(obj->size);
+	memset(&stream, 0, sizeof(stream));
+	stream.next_out = data;
+	stream.avail_out = obj->size;
+	stream.next_in = map + pg_offset;
+	stream.avail_in = len;
+	inflateInit(&stream);
+	while ((st = inflate(&stream, Z_FINISH)) == Z_OK);
+	inflateEnd(&stream);
+	if (st != Z_STREAM_END || stream.total_out != obj->size)
+		die("serious inflate inconsistency");
+	munmap(map, len + pg_offset);
 	return data;
 }
 
@@ -280,15 +312,12 @@ static void resolve_delta(struct delta_e
 	unsigned long delta_size;
 	void *result;
 	unsigned long result_size;
-	enum object_type delta_type;
 	union delta_base delta_base;
-	unsigned long next_obj_offset;
 	int j, first, last;
 
 	obj->real_type = type;
-	delta_data = unpack_raw_entry(obj->offset, &delta_type,
-				      &delta_size, &delta_base,
-				      &next_obj_offset);
+	delta_data = get_data_from_pack(obj);
+	delta_size = obj->size;
 	result = patch_delta(base_data, base_size, delta_data, delta_size,
 			     &result_size);
 	free(delta_data);
@@ -321,13 +350,13 @@ static int compare_delta_entry(const voi
 	return memcmp(&delta_a->base, &delta_b->base, UNION_BASE_SZ);
 }
 
-static void parse_pack_objects(void)
+/* Parse all objects and return the pack content SHA1 hash */
+static void parse_pack_objects(unsigned char *sha1)
 {
 	int i;
-	unsigned long offset = sizeof(struct pack_header);
 	struct delta_entry *delta = deltas;
 	void *data;
-	unsigned long data_size;
+	struct stat st;
 
 	/*
 	 * First pass:
@@ -337,19 +366,29 @@ static void parse_pack_objects(void)
 	 */
 	for (i = 0; i < nr_objects; i++) {
 		struct object_entry *obj = &objects[i];
-		obj->offset = offset;
-		data = unpack_raw_entry(offset, &obj->type, &data_size,
-					&delta->base, &offset);
+		data = unpack_raw_entry(obj, &delta->base);
 		obj->real_type = obj->type;
 		if (obj->type == OBJ_REF_DELTA || obj->type == OBJ_OFS_DELTA) {
 			nr_deltas++;
 			delta->obj = obj;
 			delta++;
 		} else
-			sha1_object(data, data_size, obj->type, obj->sha1);
+			sha1_object(data, obj->size, obj->type, obj->sha1);
 		free(data);
 	}
-	if (offset != pack_size - 20)
+	objects[i].offset = consumed_bytes;
+
+	/* Check pack integrity */
+	SHA1_Update(&input_ctx, input_buffer, input_offset);
+	SHA1_Final(sha1, &input_ctx);
+	if (hashcmp(fill(20), sha1))
+		die("packfile '%s' SHA1 mismatch", pack_name);
+	use(20);
+
+	/* If input_fd is a file, we should have reached its end now. */
+	if (fstat(input_fd, &st))
+		die("cannot fstat packfile '%s': %s", pack_name, strerror(errno));
+	if (S_ISREG(st.st_mode) && st.st_size != consumed_bytes)
 		die("packfile '%s' has junk at the end", pack_name);
 
 	/* Sort deltas by base SHA1/offset for fast searching */
@@ -378,18 +417,17 @@ static void parse_pack_objects(void)
 		ofs = !find_delta_childs(&base, &ofs_first, &ofs_last);
 		if (!ref && !ofs)
 			continue;
-		data = unpack_raw_entry(obj->offset, &obj->type, &data_size,
-					&base, &offset);
+		data = get_data_from_pack(obj);
 		if (ref)
 			for (j = ref_first; j <= ref_last; j++)
 				if (deltas[j].obj->type == OBJ_REF_DELTA)
 					resolve_delta(&deltas[j], data,
-						      data_size, obj->type);
+						      obj->size, obj->type);
 		if (ofs)
 			for (j = ofs_first; j <= ofs_last; j++)
 				if (deltas[j].obj->type == OBJ_OFS_DELTA)
 					resolve_delta(&deltas[j], data,
-						      data_size, obj->type);
+						      obj->size, obj->type);
 		free(data);
 	}
 
@@ -408,6 +446,10 @@ static int sha1_compare(const void *_a, 
 	return hashcmp(a->sha1, b->sha1);
 }
 
+/*
+ * On entry *sha1 contains the pack content SHA1 hash, on exit it is
+ * the SHA1 hash of sorted object names.
+ */
 static void write_index_file(const char *index_name, unsigned char *sha1)
 {
 	struct sha1file *f;
@@ -467,7 +509,7 @@ static void write_index_file(const char 
 		sha1write(f, obj->sha1, 20);
 		SHA1_Update(&ctx, obj->sha1, 20);
 	}
-	sha1write(f, pack_base + pack_size - 20, 20);
+	sha1write(f, sha1, 20);
 	sha1close(f, NULL, 1);
 	free(sorted_by_sha);
 	SHA1_Final(sha1, &ctx);
@@ -513,9 +555,9 @@ int main(int argc, char **argv)
 
 	open_pack_file();
 	parse_pack_header();
-	objects = xcalloc(nr_objects, sizeof(struct object_entry));
+	objects = xcalloc(nr_objects + 1, sizeof(struct object_entry));
 	deltas = xcalloc(nr_objects, sizeof(struct delta_entry));
-	parse_pack_objects();
+	parse_pack_objects(sha1);
 	free(deltas);
 	write_index_file(index_name, sha1);
 	free(objects);

^ permalink raw reply related

* Re: [ANNOUNCE] Example Cogito Addon - cogito-bundle
From: Jakub Narebski @ 2006-10-20 18:35 UTC (permalink / raw)
  To: Jan Hudec; +Cc: Aaron Bentley, bazaar-ng, git
In-Reply-To: <20061020181210.GA29843@artax.karlin.mff.cuni.cz>

Jan Hudec wrote:
> On Fri, Oct 20, 2006 at 06:21:34PM +0200, Jakub Narebski wrote:
> > Aaron Bentley wrote:
> > 
> > > === added directory  // file-id:TREE_ROOT
> > 
> > Gaaah, so rename detection in bzr is done using file-ids?
> > Linus will tell you the inherent problems with that "solution".
> 
> Ok, I tried to read
> http://permalink.gmane.org/gmane.comp.version-control.git/217
> 
> It's all nice and well, but my question is whether the below cases work
> in git. Yes, they are particular cases, but they are particularly
> important. If they don't, I'd rather have file-id scheme, that is
> limited to just them, but handles them, than something with big plans,
> but nothing working.
> 
> Let's consider following scenario:
> 
> (where A$ means working in branch A, B$ means working in branch B and
>  VCT stands for version control tool of choice)

1077:jnareb@roke:/tmp/jnareb> mkdir tmp
1078:jnareb@roke:/tmp/jnareb> cd tmp/
1079:jnareb@roke:/tmp/jnareb/tmp> git init-db
defaulting to local storage area

> A$ echo Hello Warld! > hello.txt
1081:jnareb@roke:/tmp/jnareb/tmp> echo 'Hello Warld!' > hello.txt

> A$ VCT add hello.txt
1082:jnareb@roke:/tmp/jnareb/tmp> git add hello.txt

> A$ VCT commit -m "Created greeting"
1083:jnareb@roke:/tmp/jnareb/tmp> git commit -a -m "Created greeting"

(we use here still default branch 'master'. Let us change it to A)
1084:jnareb@roke:/tmp/jnareb/tmp> git branch A
1088:jnareb@roke:/tmp/jnareb/tmp> git checkout A

> $ VCT branch A B
1085:jnareb@roke:/tmp/jnareb/tmp> git branch B A
(create branch B based on A)

> A$ VCT mkdir data
1089:jnareb@roke:/tmp/jnareb/tmp> mkdir data

> A$ VCT mv hello.txt data/
1090:jnareb@roke:/tmp/jnareb/tmp> git mv hello.txt data/

> A$ VCT commit -m "Moved hello.txt to data dir"
1092:jnareb@roke:/tmp/jnareb/tmp> git commit -a -m "Moved hello.txt to data dir"

> B$ ed hello.txt
> ? 1s/Warld/World/
> ? wq
1094:jnareb@roke:/tmp/jnareb/tmp> ed hello.txt 
13
1s/Warld/World/
wq
13

> B$ VCT commit -m "Fixed typo in greeting"
1096:jnareb@roke:/tmp/jnareb/tmp> git commit -a -m "Fixed typo in greeting"

> A$ VCT merge B
1097:jnareb@roke:/tmp/jnareb/tmp> git checkout A
1098:jnareb@roke:/tmp/jnareb/tmp> git pull . B
Trying really trivial in-index merge...
fatal: Merge requires file-level merging
Nope.
Merging HEAD with 9de7290d385ec2b0c2ade9b888f6c3a6633ac926
Merging: 
5f0eb04467538f0f1414af85ec6481150107c0b2 Moved hello.txt to data dir 
9de7290d385ec2b0c2ade9b888f6c3a6633ac926 Fixed typo in greeting 
found 1 common ancestor(s): 
f49a520e40143cb9d84b00e9728c5742897c0a22 Created greeting 

Merge made by recursive.
 data/hello.txt |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

> At this point, I expect the tree to look like this:
> A$ ls -R
1099:jnareb@roke:/tmp/jnareb/tmp> ls -R
.:
data

./data:
hello.txt

> A$ cat data/hello.txt
1100:jnareb@roke:/tmp/jnareb/tmp> cat data/hello.txt 
Hello World!



> A$ VCT mv data greetings
1102:jnareb@roke:/tmp/jnareb/tmp> git mv data greetings

> A$ VCT commit -m "Renamed the data directory to greetings"
1105:jnareb@roke:/tmp/jnareb/tmp> git commit -a -m "Renamed the data directory to greetings"

> B$ echo "Goodbye World!" > data/goodbye.txt
1106:jnareb@roke:/tmp/jnareb/tmp> git checkout B
1109:jnareb@roke:/tmp/jnareb/tmp> echo 'Goodbye World!' > data/goodbye.txt
bash: data/goodbye.txt: There is no such file or directory
1110:jnareb@roke:/tmp/jnareb/tmp> ls -R
.:
hello.txt

You need to revise your example.
-- 
Jakub Narebski
Poland

^ permalink raw reply

* Re: [ANNOUNCE] Example Cogito Addon - cogito-bundle
From: Linus Torvalds @ 2006-10-20 18:30 UTC (permalink / raw)
  To: Aaron Bentley; +Cc: Jakub Narebski, bazaar-ng, git
In-Reply-To: <Pine.LNX.4.64.0610201100070.3962@g5.osdl.org>

On Fri, 20 Oct 2006, Linus Torvalds wrote:
> 
> So yes, merges are the situation where renames are normally considered a 
> "problem", but it's actually not nearly the most every-day situation at 
> all.

Btw, this is a pet peeve of mine, and it is not at all restricted to 
the SCM world.

In CompSci in general, you see a _lot_ of papers about things that almost 
don't matter - not because the issues are that important in practice, but 
because the issues are something small enough to be something you can 
discuss and explain without having to delve into tons of ugly detail, and 
because it's something that has a lot of "mental masturbation" associated 
with it - ie you can discuss it endlessly.

In the OS world, it's things like schedulers. You find an _inordinate_ 
number of papers on scheduling, considering that the actual algorithm then 
tends to be something that can be expressed in a hundred lines of code or 
so, but it's got quite high "mental masturbatory value" (hereafter called 
MMV).

Other high-MMV areas are page-out algorithms (never mind that almost all 
_real_ VM problems are elsewhere) and some zero-copy schemes (never mind 
that if you actually need to _work_ with the data, zero-copy DMA may 
actually be much worse because it ends up having bad cache behaviour).

In the SCM world, file renames and merging seem to be the high-MMV things. 
Never mind that the real issues tend to be elsewhere (like _performance_ 
when you have a few thousand commits that you want to merge).

For example, in the kernel, I think about half of all merges are what git 
calls "trivial in-index merges". That's HALF. Being a trivial in-index 
merge means that there was not a single file-level conflict that even 
needed a three-way merge, much less any study of the history AT ALL (other 
than finding the common ancestor, of course).

Of the rest, most by far need some trivial 3-way merging. And the ones 
that have trouble? In practice, that trivial and maligned 3-way does 
_better_ than anything more complicated.

Yet, if you actually bother to follow all the discussion on #revctrl and 
other places, what do you find discussed? Right: various high-MMV issues 
like "staircase merge" etc crap.

Go to revctrl.org for prime example of this. I think half the stuff is 
about merge algorithms, some of it is about glossary, and almost none of 
it is about something as pedestrian and simple as performance and 
scalability.

(Actually, to be honest, I think some of the #revctrl noise has become 
better lately. I'm not seeing quite as much theoretical discussion, it may 
be that as open-source distributed SCM's are getting to be more "real", 
people start to slowly realize that the masturbatory crap isn't actually 
what it's all about. So maybe at least this area is getting more about 
real every-day problems, and less about the theoretical-but-not-very- 
important issues).

		Linus

^ permalink raw reply

* Re: [ANNOUNCE] Example Cogito Addon - cogito-bundle
From: Jon Smirl @ 2006-10-20 18:15 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Shawn Pearce, Aaron Bentley, Jakub Narebski, bazaar-ng, git
In-Reply-To: <Pine.LNX.4.64.0610201045550.3962@g5.osdl.org>

On 10/20/06, Linus Torvalds <torvalds@osdl.org> wrote:
> So yeah, I've seen a few strange cases myself, but they've actually been
> interesting. Like seeing how much of a file was just a copyright license,
> and then a file being considered a "copy" just because it didn't actually
> introduce any real new code.

It may be worth doing something special for licenses. Logs of small
Mozilla files are also getting tripped up by the large copyright
notices. The notices take up a lot of space too. The Mozilla license
has been changed five times. That is 110,000 files times one to five
licenses at 800-1500 characters each. 500MB+ of junk before
compression.

You could have a file of macro substitutions that is applied/expanded
when files go in/out of git. The macros would replace the copyright
notices improving the move/rename tracking and the reducing repository
size. The macros could be recorded out of band to eliminate the need
for escaping the file contents. Even simpler, the only valid place for
the macro could be the beginning of the file.

-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply

* Re: [ANNOUNCE] Example Cogito Addon - cogito-bundle
From: Jan Hudec @ 2006-10-20 18:12 UTC (permalink / raw)
  To: Jakub Narebski; +Cc: Aaron Bentley, bazaar-ng, git
In-Reply-To: <200610201821.34712.jnareb@gmail.com>

On Fri, Oct 20, 2006 at 06:21:34PM +0200, Jakub Narebski wrote:
> Aaron Bentley wrote:
> 
> > === added directory  // file-id:TREE_ROOT
> 
> Gaaah, so rename detection in bzr is done using file-ids?
> Linus will tell you the inherent problems with that "solution".

Ok, I tried to read
http://permalink.gmane.org/gmane.comp.version-control.git/217

It's all nice and well, but my question is whether the below cases work
in git. Yes, they are particular cases, but they are particularly
important. If they don't, I'd rather have file-id scheme, that is
limited to just them, but handles them, than something with big plans,
but nothing working.

Let's consider following scenario:

(where A$ means working in branch A, B$ means working in branch B and
 VCT stands for version control tool of choice)

A$ echo Hello Warld! > hello.txt
A$ VCT add hello.txt
A$ VCT commit -m "Created greeting"
$ VCT branch A B
A$ VCT mkdir data
A$ VCT mv hello.txt data/
A$ VCT commit -m "Moved hello.txt to data dir"
B$ ed hello.txt
? 1s/Warld/World/
? wq
B$ VCT commit -m "Fixed typo in greeting"
A$ VCT merge B

At this point, I expect the tree to look like this:
A$ ls -R
.:
data/
data:
hello.txt
A$ cat data/hello.txt
Hello World!

The file-id algorithm is not exceptionaly clever, is a bit of
special-case and all that, but it handles the above case right. And
while that scenario is just a special case of general moving contents,
it is:
1) Very common
2) Possible to handle in an obviously correct way

It is very important for me that a version control tool I use handles
this case. If it handles the more general cases, that's nice, but this
is a must.

Oh, and there is one more complicated case, that I also require to work
and that works in Bzr, but did not work in Arch:

...let's start with the tree at the end of previous example...

A$ VCT mv data greetings
A$ VCT commit -m "Renamed the data directory to greetings"
B$ echo "Goodbye World!" > data/goodbye.txt
B$ VCT add data/goodbye.txt
B$ VCT commit -m "Added goodbye message."
A$ VCT merge B

And now I expect to have tree looking like this:

A$ ls -R
.:
greetings/
greetings:
hello.txt
goodbye.txt

And note, that it is /not/ required to use file-ids to handle this.
Darcs handles this just as well with it's patch algebra
(http://darcs.net/DarcsWiki/PatchTheory) without need of any IDs.

--------------------------------------------------------------------------------
                  				- Jan Hudec `Bulb' <bulb@ucw.cz>

^ permalink raw reply

* Re: [ANNOUNCE] Example Cogito Addon - cogito-bundle
From: Linus Torvalds @ 2006-10-20 18:06 UTC (permalink / raw)
  To: Aaron Bentley; +Cc: bazaar-ng, git, Jakub Narebski
In-Reply-To: <45390BAF.5040405@utoronto.ca>

On Fri, 20 Oct 2006, Aaron Bentley wrote:
> > 
> > Git _definitely_ handles renames, both in everyday life and when merging.
> 
> Hmm.  Could you say more here?  The only examples I can think of for
> handling renames are situations that can be expressed as a merge.

So yes, merges are the situation where renames are normally considered a 
"problem", but it's actually not nearly the most every-day situation at 
all.

The most common one is actually just showing things as a diff.

If you are looking at a code-change, there's an absolutely _huge_ 
difference if you look at the result as a "delete this huge file" and 
"create this other huge file" and seeing it as a "move this huge file from 
here to here, and change a few lines in the process".

So the most _important_ part of rename tracking from a user perspective is 
for the person who walks through somebody elses code history, and wants to 
know how a certain state came to be. The merges are usually not as big of 
a deal for the user (although they are clearly the most hairy case for the 
SCM - which is why SCM people concentrate on merges).

			Linus

^ permalink raw reply

* Re: [ANNOUNCE] Example Cogito Addon - cogito-bundle
From: Linus Torvalds @ 2006-10-20 17:59 UTC (permalink / raw)
  To: Jakub Narebski; +Cc: bazaar-ng, git
In-Reply-To: <200610201945.43957.jnareb@gmail.com>

On Fri, 20 Oct 2006, Jakub Narebski wrote:
> 
> If I remember correctly, git decided on contents (plus filename)
> similarity based renames detection because 1), it is more generic
> as it covers (or can cover) contents moving not only wholesome rename
> of a file, and 2) because file-id based renames handling works only
> if you explicitely use SCM command to rename file, which is not the
> case of non-SCM-aware channel like for example patches (and accepting
> ordinary patches is important for Linux kernel, the project git was
> created for).

There are lots of problems with file ID's. One of the more obvious ones is 
indeed that if you arrive at the same state two different ways (eg patches 
vs "native SCM"), you end up with two fundmanetally different trees. Even 
though clearly there was no real difference.

There are other serious problems. For example, file-ID based systems 
invariably have _huge_ problems with handling two branches deleting and 
renaming things differently, and we had several issues with that during 
the BK days (ie two people would move files differently, and ending up 
with different file ID's for the same path, and merging that inevitably 
causes problems not just during the merge, but ever after, since one of 
the file ID's will then have to be "deleted" even though it might be 
active in one of the branches).

Finally, file-ID based systems fundamentally cannot handle some simple and 
interesting cases, like partial content movement. We're starting to see 
git actually being able to track file content moving between files: even 
when the files themselves didn't move (ie Junio's "git pickaxe" work could 
do things like that).

And there really aren't as many advantages to tracking renames as people 
claim. The biggest advantage of tracking renames is to avoid the trap that 
CVS fell into: being file-ID based _and_ not being able to track the file 
ID moving is clearly the worst of all worlds.

So for anybody coming from a CVS background, tracking renames explicitly 
is a _huge_ advantage, which is, I think, why some SCM people have gotten 
so hung up about them. It's just that if you don't have the file-ID 
problem in the first place (and git doesn't), then rename tracking doesn't 
actually make any sense, and only makes things much worse.

			Linus

^ permalink raw reply

* Re: [ANNOUNCE] Example Cogito Addon - cogito-bundle
From: David Lang @ 2006-10-20 17:58 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Shawn Pearce, Aaron Bentley, Jakub Narebski, bazaar-ng, git
In-Reply-To: <Pine.LNX.4.64.0610201045550.3962@g5.osdl.org>

On Fri, 20 Oct 2006, Linus Torvalds wrote:

> On Fri, 20 Oct 2006, Shawn Pearce wrote:
>>
>> I renamed hundreds of small files in one shot and also did a few
>> hundered adds and deletes of other small XML files.  Git generated
>> a lot of those unrelated adds/deletes as rename/modifies, as their
>> content was very similiar.  Some people involved in the project
>> freaked as the files actually had nothing in common with one
>> another... except for a lot of XML elements (as they shared the
>> same DTD).
>
> Heh. We can probably tweak the heuristics (one of the _great_ things about
> content detection is that you can fix it after the fact, unlike the
> alternative).
>
> That said, I've personally actually found the content-based similarity
> analysis to often be quite informative, even when (and perhaps
> _especially_ when) it ended up showing something that the actual author of
> the thing didn't intend.
>
> So yeah, I've seen a few strange cases myself, but they've actually been
> interesting. Like seeing how much of a file was just a copyright license,
> and then a file being considered a "copy" just because it didn't actually
> introduce any real new code.
>

isn't the default to consider them a copy if they are 80% the same, with a 
command line option to tweak this (IIRC -m, but I could easily be wrong)

David Lang

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox