Git development
 help / color / mirror / Atom feed
* Re: [PATCH 5/6] Do linear-time/space rename logic for exact renames
From: Linus Torvalds @ 2007-10-25 20:25 UTC (permalink / raw)
  To: Daniel Barkalow; +Cc: Junio C Hamano, Git Mailing List
In-Reply-To: <Pine.LNX.4.64.0710251522190.7345@iabervon.org>



On Thu, 25 Oct 2007, Daniel Barkalow wrote:
> 
> Creating a list of the pointers doesn't work correctly with the grow 
> implementation, because growing the hash may turn a collision into a 
> non-collision, at which point items other than the first cannot be found 
> (since they're listed inside a bucket that's now wrong for them). AFAIK, 
> resizing a hash table requires being able to figure out what happened with 
> collisions.

Nope. 

The hash algorithm is much smarter than that.

I *always* uses a full 32-bit hash, and no amount of resizing is ever 
going to change that. The index into the hash-table is in fact entirely 
unused.

This has several good properties:

 - it means that hash-table resizing is a non-event

 - it means that you always have the full 32-bit hash, and a collision in 
   the hash size never causes unnecessary work apart from the fact that 
   the code walks the hash table a bit more.

 - because the hash is embedded in the table itself, it has relatively 
   good cache behaviour when compared to something that needs to actually 
   follow the pointer to validate the full data. So assuming that the full 
   32-bit hash is good enough to effectively never have any collisions 
   (or, assuming you don't even *care* about the collisions, which is the 
   case when you're just generating content fingerprints for lines when 
   comparing the data in two files), you never end up with unnecessarily 
   following pointers to cachelines that you are not interested in.

The last point at least somewhat mitigates the (inevitably) bad cache 
behaviour that hash tables tend to have. It's not like it's going to be 
wonderful in the cache, but at least it's less horrid than the more common 
implementation that needs to follow the pointer to validate each hash 
entry that may or may not be a collision.

but the important part is #1, which is what allows the code to be a 
generic hash algorithm that resizes the hash table without even 
understanding or caring what is behind the pointer.

		Linus

^ permalink raw reply

* Re: recent change in git.git/master broke my repos
From: Andreas Ericsson @ 2007-10-25 20:23 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Karl Hasselström, Randal L. Schwartz, git
In-Reply-To: <alpine.LFD.0.9999.0710251344220.22100@xanadu.home>

Nicolas Pitre wrote:
> On Thu, 25 Oct 2007, Karl Hasselström wrote:
> 
>> On 2007-10-25 07:32:36 -0700, Randal L. Schwartz wrote:
>>
>>> And when are we gonna get "fast forward only" for git-merge?
>> I'd like that too. For cases when I know I don't have to do a merge,
>> and want git to yell at me if I'm mistaken. For example, in a
>> repository that tracks an upstream so I can build the latest version,
>> but where I don't normally do any development.
> 
> Isn't that called a remote branch that gets updated with "git fetch' ?
> You can even trick Git into not using the refs/remotes/ namespace for 
> them if you wish.
> 

You'd lose the ability to do "git diff origin/master" while disconnected
though. It's quite valuable.

-- 
Andreas Ericsson                   andreas.ericsson@op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231

^ permalink raw reply

* Re: [PATCH 5/6] Do linear-time/space rename logic for exact renames
From: Daniel Barkalow @ 2007-10-25 20:23 UTC (permalink / raw)
  To: Jeff King; +Cc: Linus Torvalds, Junio C Hamano, Git Mailing List
In-Reply-To: <20071025194859.GB27745@coredump.intra.peff.net>

On Thu, 25 Oct 2007, Jeff King wrote:

> On Thu, Oct 25, 2007 at 03:43:46PM -0400, Daniel Barkalow wrote:
> 
> > Creating a list of the pointers doesn't work correctly with the grow 
> > implementation, because growing the hash may turn a collision into a 
> > non-collision, at which point items other than the first cannot be found 
> > (since they're listed inside a bucket that's now wrong for them). AFAIK, 
> > resizing a hash table requires being able to figure out what happened with 
> > collisions.
> 
> I thought this at first, too, but there are two types of collisions in
> this hash implementation: those that come from having the actual 32-bit
> hash collide, and those that come from not having enough buckets.
> 
> The client code gets a pointer kicked back to it when there is a
> collision on the actual hash value (i.e., two things had the exact same
> hash value). The number of buckets grows when you simply have more
> buckets filled than you like. Two different hashes that would be in the
> same bucket don't actually occupy the same bucket -- the second one to
> arrive gets shoved into the next available bucket.

Ah, right, nevermind. The comment might be a bit misleading in that case, 
if we both missed this at first.

	-Daniel
*This .sig left intentionally blank*

^ permalink raw reply

* Re: best git practices, was Re: Git User's Survey 2007 unfinishedsummary continued
From: Andreas Ericsson @ 2007-10-25 20:19 UTC (permalink / raw)
  To: Federico Mena Quintero; +Cc: J. Bruce Fields, git
In-Reply-To: <1193335562.4522.403.camel@cacharro.xalalinux.org>

Federico Mena Quintero wrote:
> On Thu, 2007-10-25 at 12:38 -0400, J. Bruce Fields wrote:
> 
>> It's definitely not a simple cut-and-paste--even with permission from
>> the author of "Git for computer scientists", fitting this in would
>> require rethinking the ordering of topics in the manual.
> 
> Oh, that can be done.  It's easier to move text around than to
> rearchitect code :)
> 

It misses the point though. Machines should work while humans are
lounging. If the humans have to read a lot to get the machines to
work, there's less time for lounging ;-)

>> Also, there's
>> the restriction that we'd like to keep it looking good in plain ascii,
>> so diagrams have to be done in ascii somehow.
> 
> Hmm, what's the rationale for this?  I'd assume that most people read
> the user's manual as a web page (or as bedside reading if they can print
> a PDF thereof), where diagrams can be pretty.
> 

man pages.

-- 
Andreas Ericsson                   andreas.ericsson@op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231

^ permalink raw reply

* Re: best git practices, was Re: Git User's Survey 2007 unfinished summary continued
From: Andreas Ericsson @ 2007-10-25 20:18 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Theodore Tso, Johannes Schindelin, Steffen Prohaska,
	Peter Baumann, J. Bruce Fields, Jakub Narebski,
	Federico Mena Quintero, git
In-Reply-To: <7vejfj11tk.fsf@gitster.siamese.dyndns.org>

Junio C Hamano wrote:
> Andreas Ericsson <ae@op5.se> writes:
> 
>> However, there's still this issue:
>> $ git checkout -b foo origin/pu
>> Branch foo set up to track remote branch refs/remotes/origin/pu.
>> Switched to a new branch "foo"
>>
>> git checkout will say that every time a branch is created from a
>> tracking branch, unless one tells it --no-track (which people don't
>> learn about unless they're really into git), so it's quite natural
>> that people think git will actually make sure, within reasonable
>> limits, that 'foo' is kept in sync with refs/remotes/origin/pu.
>> That's not the case, however.
>>
>> So we could either change the message to be:
>> "Branch foo set up to track remote branch refs/remotes/origin/pu,
>> provided you only ever issue git-pull while having branch foo
>> checked out."
>>
>> Or we could make 'git checkout -b' default to --no-track, perhaps
>> giving annoying messages everytime someone "git-checkout -b"'s a
>> remote tracking branch.
>> Or we could make git-pull keep git checkout's promises.
> 
> The thing is, if you have 200 local branches (because you
> interact with 50 repositories with 4 primary branches each), you
> do not constantly check all of them out anyway.  And the only
> place that staleness of the local tracking fork matters is when
> you check it out (that is, as long as you train your users that
> the way to check differences with the upstream 'pu' in your case
> is by doing operations with 'origin/pu' not with your local
> 'foo').
> 

Probably, although I think the confusion of 'foo' being something
else than 'origin/pu' after it's been checked out would be hard
to explain. I'll see how the patch turns out. If it all goes tits
up, I'll see if a post-checkout hook can solve it.

> With that in mind, how about making "git checkout foo", after
> foo is set up thusly, to show:
> 
> 	git log --pretty=oneline --left-right origin/pu...foo
> 
> if (and only if) they have diverged?  Then you can deal with the
> staleness of local tracking fork 'foo' in any way you want.
> 
> You could even go one step further and make this "checkout foo",
> in addition to or instead of showing the above left-right log,
> 
>  - automatically run "git merge origin/pu" if it is a
>    fast-forward, and say it did _not_ run that merge if it is
>    not a fast-forward;
> 
>  - automatically run "git merge origin/pu" always, even if it is
>    not a fast-forward;
> 
>  - automatically run "git rebase origin/pu" always;
> 
> Would that make your life easier?

That it would, except the confusion would then be that it's automatically
rebased for the branches one currently hasn't got checked out while pulling,
and the branch that *is* checked out gets merged (crazy, yes), so those
who prefer the rebase would get what they want by doing something completely
bonkers, such as:

git checkout -b just-gonna-pull HEAD^
git pull
git checkout whatever-other-branch-they-were-on

(yes, "aggresively ignorant", I think Ted said in an earlier mail)

It'd probably be better to go with Dscho's suggestion, although I'm not quite
sure what that was any more. It involved automagical rebasing on fetch or pull
though.

-- 
Andreas Ericsson                   andreas.ericsson@op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231

^ permalink raw reply

* [PATCH] Fix generation of perl/perl.mak
From: Alex Riesen @ 2007-10-25 20:17 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano

The code generating perl/Makefile from Makefile.PL was causing trouble
because it didn't considered NO_PERL_MAKEMAKER and ran makemaker
unconditionally, rewriting perl.mak. Makemaker is FUBAR in ActiveState Perl,
and perl/Makefile has a replacement for it.

Besides, a changed Git.pm is *NOT* a reason to rebuild all the perl scripts,
so remove the dependency too.

Signed-off-by: Alex Riesen <raa.lkml@gmail.com>
---
 Makefile |    6 +-----
 1 files changed, 1 insertions(+), 5 deletions(-)

diff --git a/Makefile b/Makefile
index b728920..72f5ef4 100644
--- a/Makefile
+++ b/Makefile
@@ -812,7 +812,7 @@ $(patsubst %.sh,%,$(SCRIPT_SH)) : % : %.sh
 
 $(patsubst %.perl,%,$(SCRIPT_PERL)): perl/perl.mak
 
-perl/perl.mak: GIT-CFLAGS
+perl/perl.mak: GIT-CFLAGS perl/Makefile perl/Makefile.PL
 	$(QUIET_SUBDIR0)perl $(QUIET_SUBDIR1) PERL_PATH='$(PERL_PATH_SQ)' prefix='$(prefix_SQ)' $(@F)
 
 $(patsubst %.perl,%,$(SCRIPT_PERL)): % : %.perl
@@ -931,10 +931,6 @@ $(XDIFF_LIB): $(XDIFF_OBJS)
 	$(QUIET_AR)$(RM) $@ && $(AR) rcs $@ $(XDIFF_OBJS)
 
 
-perl/Makefile: perl/Git.pm perl/Makefile.PL GIT-CFLAGS
-	(cd perl && $(PERL_PATH) Makefile.PL \
-		PREFIX='$(prefix_SQ)')
-
 doc:
 	$(MAKE) -C Documentation all
 
-- 
1.5.3.4.403.g401f6

^ permalink raw reply related

* Re: best git practices, was Re: Git User's Survey 2007 unfinishedsummary continued
From: Andreas Ericsson @ 2007-10-25 20:08 UTC (permalink / raw)
  To: Federico Mena Quintero; +Cc: Theodore Tso, git
In-Reply-To: <1193335339.4522.398.camel@cacharro.xalalinux.org>

Federico Mena Quintero wrote:
> On Thu, 2007-10-25 at 11:21 -0400, Theodore Tso wrote:
> 
>> And of course it's inelegant.  You just told us we were dealing with
>> CVS-brain-damaged corporate developers who can't be bothered to learn
>> about the fine points of using things the git way.
> 
> Ignore the corporate developers who use SCMs only because their company
> requires them to.  Git is not the right thing for them;

Quite contrary to what I think, and quite contrary to what Linus said in
his google speach. The problem with using a chainsaw instead of a tooth-
pick is that it's much easier to hurt yourself with a chainsaw. It's a
lot easier to get real work done though.

> some
> Eclipse-based monstrosity probably is.  It's like the horrendous
> Oracle-based expense-reporting thing we have to use at Novell; I use it
> because they make me, not because I'm particularly excited about
> reporting expenses :)
> 

Nobody's particularly excited about reporting expenses. That's why there
are so few OSS solutions for it, and the ones that exist suck horribly
because whoever got the job of making the system knew his users would
simply hate it, no matter how perfect it was. It's one of those things ;-)

> However, *do think* of the free software developers who have been using
> CVS forever.  You won't make friends among them if you keep saying, "you
> use CVS?  You are brain-damaged, then."  CVS has been as good/bad to
> them as to anyone else, and they are probably delighted to get a better
> solution.  That solution needs to take into account the concepts to
> which they have been exposed for the past N years.  Just because your
> new concepts are better, doesn't mean that their old ones were wrong in
> their time.
> 

Well, perhaps they were. CVS was a fairly important learning phase, much
like Thomas Edison who first discovered 2000 ways of not making a light-
bulb. It doesn't make them less valuable though, and the reason to keep
going until perfection is reached is that machines are made for work, and
humans are made for fun.

-- 
Andreas Ericsson                   andreas.ericsson@op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231

^ permalink raw reply

* stgit restrictions on patch names
From: Yann Dirson @ 2007-10-25 19:48 UTC (permalink / raw)
  To: Catalin Marinas, Karl Hasselström; +Cc: GIT list

Looks like stgit is now more picky on patch names than in used to be:

$ stg branch --clone v2.0.6-debian
Checking for changes in the working directory ... done
Cloning current branch to "v2.0.6-debian" ...
  No log for 01_springelectrical
stg branch: Invalid patch name: "10_g++4.0_build_failures"
$


=> the result of the cloning operation is a partial clone.  Do we want to:

- implement a mechanism for checking beforehand that the operation
will not fail ?  Seems awkward to duplicate checks already found
elsewhere.

- wait for proper transactions so we can rollback on error ?

- on clone error, delete the newly-created stack ?  I'd vote for this
one, until the previous one gets done.


=> is there any particular reason why we would refuse "+" ?

^ permalink raw reply

* Re: Git and Windows
From: Robin Rosenberg @ 2007-10-25 19:54 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Bo Yang, git
In-Reply-To: <Pine.LNX.4.64.0710251517350.25221@racer.site>

torsdag 25 oktober 2007 skrev Johannes Schindelin:
> Hi,
> 
> On Thu, 25 Oct 2007, Bo Yang wrote:
> 
> >   I am a new comer to this list but I have used git for two week 
> > development control. I think it is a very cool tool, the only flaw is 
> > that I have not found Windows version of it. Does git just aim at Linux 
> > kernel development? Is there any plan or in the future to migrate it to 
> > windows?
> 
> Funny.  The first three hits I get from Google are
> 
> 	Wikipedia,
> 	GitWiki and
> 	msysgit
> 
> The first two pointing to the third.  And happily enough, there is a 
> Download page at the third site.  Oh, and it has a description what its 
> affiliation with git is.

The "featured download" is still not the one I'd recommend.

-- robin

^ permalink raw reply

* Re: [PATCH 5/6] Do linear-time/space rename logic for exact renames
From: Jeff King @ 2007-10-25 19:48 UTC (permalink / raw)
  To: Daniel Barkalow; +Cc: Linus Torvalds, Junio C Hamano, Git Mailing List
In-Reply-To: <Pine.LNX.4.64.0710251522190.7345@iabervon.org>

On Thu, Oct 25, 2007 at 03:43:46PM -0400, Daniel Barkalow wrote:

> Creating a list of the pointers doesn't work correctly with the grow 
> implementation, because growing the hash may turn a collision into a 
> non-collision, at which point items other than the first cannot be found 
> (since they're listed inside a bucket that's now wrong for them). AFAIK, 
> resizing a hash table requires being able to figure out what happened with 
> collisions.

I thought this at first, too, but there are two types of collisions in
this hash implementation: those that come from having the actual 32-bit
hash collide, and those that come from not having enough buckets.

The client code gets a pointer kicked back to it when there is a
collision on the actual hash value (i.e., two things had the exact same
hash value). The number of buckets grows when you simply have more
buckets filled than you like. Two different hashes that would be in the
same bucket don't actually occupy the same bucket -- the second one to
arrive gets shoved into the next available bucket.

-Peff

^ permalink raw reply

* Re: [PATCH] git-cvsimport: Add -N option to force a new import
From: Robin Rosenberg @ 2007-10-25 19:45 UTC (permalink / raw)
  To: Matt McCutchen; +Cc: Junio C Hamano, git
In-Reply-To: <1193284913.2619.23.camel@mattlaptop2>

torsdag 25 oktober 2007 skrev Matt McCutchen:
> On Wed, 2007-10-24 at 20:17 -0700, Junio C Hamano wrote:
> > Matt McCutchen <matt@mattmccutchen.net> writes:
> > 
> > > I had a git repository for development of rsync and wanted to start
> > > importing the upstream CVS with git-cvsimport, but git-cvsimport saw
> > > that the git repository existed and insisted on updating a previous
> > > import.  This patch adds an -N option to git-cvsimport to force a new
> > > import and updates the documentation appropriately.
> > 
> > Sounds like a useful addition.  Tests?
> 
> Are there existing tests for git-cvsimport somewhere whose example I
> could follow?  (I didn't see any in t/ .)  If not, I suppose I will just
> write a simple script that runs git-cvsimport with and without -N and
> with and without an existing, empty git repository and checks that the
> right things happen.

None, but there should be. I also think cvsps should be included in the git 
repo since it is required and AFAIK, only git people maintain it. 

Now I don't use cvsimport to import my CVS repos, so I'll pass on adding test 
cases. It is a non-trivial task 

I did it for cvsexportcommit which didn't have any tests when I started 
hacking it.

-- robin

^ permalink raw reply

* Re: [PATCH 5/6] Do linear-time/space rename logic for exact renames
From: Daniel Barkalow @ 2007-10-25 19:43 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Junio C Hamano, Git Mailing List
In-Reply-To: <alpine.LFD.0.999.0710251120590.30120@woody.linux-foundation.org>

> +/*
> + * Insert a new hash entry pointer into the table.
> + *
> + * If that hash entry already existed, return the pointer to
> + * the existing entry (and the caller can create a list of the
> + * pointers or do anything else). If it didn't exist, return
> + * NULL (and the caller knows the pointer has been inserted).
> + */

Creating a list of the pointers doesn't work correctly with the grow 
implementation, because growing the hash may turn a collision into a 
non-collision, at which point items other than the first cannot be found 
(since they're listed inside a bucket that's now wrong for them). AFAIK, 
resizing a hash table requires being able to figure out what happened with 
collisions.

	-Daniel
*This .sig left intentionally blank*

^ permalink raw reply

* Re: [PATCH 0/6] Cleaned-up rename detection patch-series
From: Jeff King @ 2007-10-25 19:08 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Junio C Hamano, Git Mailing List
In-Reply-To: <alpine.LFD.0.999.0710251112120.30120@woody.linux-foundation.org>

On Thu, Oct 25, 2007 at 11:15:16AM -0700, Linus Torvalds wrote:

> Ok, here's the patch-series to do rename detection of exact renames in 
> linear time, except it's cleaned up into a nice series of 6 patches. The 
> end result is identical to the previous patches (which got smushed down 
> into just two patches in Junio's tree), apart from a fixed dependency in 
> the Makefile that caused me grief and a broken test-suite due to some 
> object files simply not being correctly recompiled.

Thanks, I have been basing my inexact rename work off of your messy
patches, so I will rebase onto this.

Sorry I am lagging behind on getting that work out, but I hope to have
something to show soon.

-Peff

^ permalink raw reply

* Re: best git practices, was Re: Git User's Survey 2007 unfinished summary continued
From: Carl Worth @ 2007-10-25 19:04 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Federico Mena Quintero, Andreas Ericsson, git
In-Reply-To: <Pine.LNX.4.64.0710230017120.25221@racer.site>

[-- Attachment #1: Type: text/plain, Size: 791 bytes --]

On Tue, 23 Oct 2007 00:21:03 +0100 (BST), Johannes Schindelin wrote:
> On Mon, 22 Oct 2007, Federico Mena Quintero wrote:
> >
> > The "branches should not track their origin by default"


Did this change recently? I just wrote a long message arguing for the
"autosetupmerge = 1" behavior by default, but when I tested with
1.5.3.4 it seems to be there already.

Am I seeing that correctly?

If so, thank you and congratulations! Not having to pass --track nor
manually configure autosetupmerge to get this very useful behavior is
a very nice change!

A nice followup would be to improve the documentation for "git pull"
to better indicate some of the smarts that are now inherent in it.

I'll take a whack at that, (just as soon as I finish reading the rest
of this giant thread...).

-Carl

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply

* Re: [PATCH 0/6] Cleaned-up rename detection patch-series
From: Linus Torvalds @ 2007-10-25 18:37 UTC (permalink / raw)
  To: Junio C Hamano, Git Mailing List
In-Reply-To: <alpine.LFD.0.999.0710251112120.30120@woody.linux-foundation.org>



Final notes:

 - all of these were done on top of the current 'master' branch.  Not that 
   it likely matters much, but I thought I'd mention it in case there are 
   conflicts with anything currently cooking in 'next'. I didn't even 
   check.

 - The commits depend on each other, but apart from the ordering, they all 
   stand on their own. They pass the test-suite at all stages.

 - In particular, the first four commits are cleanups and infrastructure 
   that really is totally independent of the last two. I'd argue that the 
   first one (the dependency fix) could/should be merged into stable, and 
   the next three can probably be merged more aggressively than the last 
   two.

 - I *think* this series is good to go. I've used the new rename detection 
   code in my own "production" environment (ie the kernel) since sending 
   it out, and the end result of all these patches is identical - apart 
   from the the dependency fix - to what I've been running for the last 
   week. But maybe it makes more sense to merge the first four into 
   'next', and leave the last two in 'pu', for example (or 'master' and 
   'next' respectively, depending on how comfy people are with the 
   patches).

Splitting the thing up definitely makes the series more readable, and 
maybe that means that somebody will actually comment on it. I don't think 
I got any comments on the original series once I fixed the stupid bugs. 
Hopefully the cleaner series is more amenable to people reviewing it.

			Linus

^ permalink raw reply

* Re: best git practices, was Re: Git User's Survey 2007 unfinished summary continued
From: Junio C Hamano @ 2007-10-25 18:33 UTC (permalink / raw)
  To: Andreas Ericsson
  Cc: Theodore Tso, Johannes Schindelin, Steffen Prohaska,
	Peter Baumann, J. Bruce Fields, Jakub Narebski,
	Federico Mena Quintero, git
In-Reply-To: <4720CCE0.2090007@op5.se>

Andreas Ericsson <ae@op5.se> writes:

> However, there's still this issue:
> $ git checkout -b foo origin/pu
> Branch foo set up to track remote branch refs/remotes/origin/pu.
> Switched to a new branch "foo"
>
> git checkout will say that every time a branch is created from a
> tracking branch, unless one tells it --no-track (which people don't
> learn about unless they're really into git), so it's quite natural
> that people think git will actually make sure, within reasonable
> limits, that 'foo' is kept in sync with refs/remotes/origin/pu.
> That's not the case, however.
>
> So we could either change the message to be:
> "Branch foo set up to track remote branch refs/remotes/origin/pu,
> provided you only ever issue git-pull while having branch foo
> checked out."
>
> Or we could make 'git checkout -b' default to --no-track, perhaps
> giving annoying messages everytime someone "git-checkout -b"'s a
> remote tracking branch.
> Or we could make git-pull keep git checkout's promises.

The thing is, if you have 200 local branches (because you
interact with 50 repositories with 4 primary branches each), you
do not constantly check all of them out anyway.  And the only
place that staleness of the local tracking fork matters is when
you check it out (that is, as long as you train your users that
the way to check differences with the upstream 'pu' in your case
is by doing operations with 'origin/pu' not with your local
'foo').

With that in mind, how about making "git checkout foo", after
foo is set up thusly, to show:

	git log --pretty=oneline --left-right origin/pu...foo

if (and only if) they have diverged?  Then you can deal with the
staleness of local tracking fork 'foo' in any way you want.

You could even go one step further and make this "checkout foo",
in addition to or instead of showing the above left-right log,

 - automatically run "git merge origin/pu" if it is a
   fast-forward, and say it did _not_ run that merge if it is
   not a fast-forward;

 - automatically run "git merge origin/pu" always, even if it is
   not a fast-forward;

 - automatically run "git rebase origin/pu" always;

Would that make your life easier?

^ permalink raw reply

* Re: recent change in git.git/master broke my repos
From: Randal L. Schwartz @ 2007-10-25 18:29 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Karl Hasselström, git
In-Reply-To: <alpine.LFD.0.9999.0710251344220.22100@xanadu.home>

>>>>> "Nicolas" == Nicolas Pitre <nico@cam.org> writes:

Nicolas> Isn't that called a remote branch that gets updated with "git fetch' ?
Nicolas> You can even trick Git into not using the refs/remotes/ namespace for 
Nicolas> them if you wish.

No.  I have a local master, and I want to be able to blindly say
"git-merge" on that master when doing so is only a fast forward.


-- 
Randal L. Schwartz - Stonehenge Consulting Services, Inc. - +1 503 777 0095
<merlyn@stonehenge.com> <URL:http://www.stonehenge.com/merlyn/>
Perl/Unix/security consulting, Technical writing, Comedy, etc. etc.
See PerlTraining.Stonehenge.com for onsite and open-enrollment Perl training!

^ permalink raw reply

* [PATCH 5/6] Do linear-time/space rename logic for exact renames
From: Linus Torvalds @ 2007-10-25 18:23 UTC (permalink / raw)
  To: Junio C Hamano, Git Mailing List
In-Reply-To: <alpine.LFD.0.999.0710251112120.30120@woody.linux-foundation.org>


From: Linus Torvalds <torvalds@linux-foundation.org>
Subject: [PATCH 5/6] Do linear-time/space rename logic for exact renames

This implements a smarter rename detector for exact renames, which
rather than doing a pairwise comparison (time O(m*n)) will just hash the
files into a hash-table (size O(n+m)), and only do pairwise comparisons
to renames that have the same hash (time O(n+m) except for unrealistic
hash collissions, which we just cull aggressively).

Admittedly the exact rename case is not nearly as interesting as the
generic case, but it's an important case none-the-less.A similar general
approach should work for the generic case too, but even then you do need
to handle the exact renames/copies separately (to avoid the inevitable
added cost factor that comes from the _size_ of the file), so this is
worth doing.

In the expectation that we will indeed do the same hashing trick for the
general rename case, this code uses a generic hash-table implementation
that can be used for other things too.  In fact, we might be able to
consolidate some of our existing hash tables with the new generic code
in hash.[ch].

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---

This is obviously the actual meat of the new exact rename detection: 
everything else was just leading up to this.

I could have split out "hash.[ch]" as a separate patch just to introduce 
the infrastructure, but I hate patches that add code that isn't used at 
all. But hash.[ch] really is potentially independently useful, and has 
nothing that is specific to the rename detection per se in it.

 Makefile          |    4 +-
 diffcore-rename.c |  211 +++++++++++++++++++++++++++++++++++++----------------
 hash.c            |  110 +++++++++++++++++++++++++++
 hash.h            |   43 +++++++++++
 4 files changed, 303 insertions(+), 65 deletions(-)
 create mode 100644 hash.c
 create mode 100644 hash.h

diff --git a/Makefile b/Makefile
index 8a0082d..845f811 100644
--- a/Makefile
+++ b/Makefile
@@ -290,7 +290,7 @@ LIB_H = \
 	run-command.h strbuf.h tag.h tree.h git-compat-util.h revision.h \
 	tree-walk.h log-tree.h dir.h path-list.h unpack-trees.h builtin.h \
 	utf8.h reflog-walk.h patch-ids.h attr.h decorate.h progress.h \
-	mailmap.h remote.h transport.h diffcore.h
+	mailmap.h remote.h transport.h diffcore.h hash.h
 
 DIFF_OBJS = \
 	diff.o diff-lib.o diffcore-break.o diffcore-order.o \
@@ -300,7 +300,7 @@ DIFF_OBJS = \
 LIB_OBJS = \
 	blob.o commit.o connect.o csum-file.o cache-tree.o base85.o \
 	date.o diff-delta.o entry.o exec_cmd.o ident.o \
-	interpolate.o \
+	interpolate.o hash.o \
 	lockfile.o \
 	patch-ids.o \
 	object.o pack-check.o pack-write.o patch-delta.o path.o pkt-line.o \
diff --git a/diffcore-rename.c b/diffcore-rename.c
index edb2424..e7e370b 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -4,6 +4,7 @@
 #include "cache.h"
 #include "diff.h"
 #include "diffcore.h"
+#include "hash.h"
 
 /* Table of rename/copy destinations */
 
@@ -93,29 +94,6 @@ static struct diff_rename_src *register_rename_src(struct diff_filespec *one,
 	return &(rename_src[first]);
 }
 
-static int is_exact_match(struct diff_filespec *src,
-			  struct diff_filespec *dst,
-			  int contents_too)
-{
-	if (src->sha1_valid && dst->sha1_valid &&
-	    !hashcmp(src->sha1, dst->sha1))
-		return 1;
-	if (!contents_too)
-		return 0;
-	if (diff_populate_filespec(src, 1) || diff_populate_filespec(dst, 1))
-		return 0;
-	if (src->size != dst->size)
-		return 0;
-	if (src->sha1_valid && dst->sha1_valid)
-	    return !hashcmp(src->sha1, dst->sha1);
-	if (diff_populate_filespec(src, 0) || diff_populate_filespec(dst, 0))
-		return 0;
-	if (src->size == dst->size &&
-	    !memcmp(src->data, dst->data, src->size))
-		return 1;
-	return 0;
-}
-
 static int basename_same(struct diff_filespec *src, struct diff_filespec *dst)
 {
 	int src_len = strlen(src->path), dst_len = strlen(dst->path);
@@ -242,56 +220,163 @@ static int score_compare(const void *a_, const void *b_)
 	return b->score - a->score;
 }
 
+struct file_similarity {
+	int src_dst, index;
+	struct diff_filespec *filespec;
+	struct file_similarity *next;
+};
+
+static int find_identical_files(struct file_similarity *src,
+				struct file_similarity *dst)
+{
+	int renames = 0;
+
+	/*
+	 * Walk over all the destinations ...
+	 */
+	do {
+		struct diff_filespec *one = dst->filespec;
+		struct file_similarity *p, *best;
+		int i = 100;
+
+		/*
+		 * .. to find the best source match
+		 */
+		best = NULL;
+		for (p = src; p; p = p->next) {
+			struct diff_filespec *two = p->filespec;
+
+			/* False hash collission? */
+			if (hashcmp(one->sha1, two->sha1))
+				continue;
+			/* Non-regular files? If so, the modes must match! */
+			if (!S_ISREG(one->mode) || !S_ISREG(two->mode)) {
+				if (one->mode != two->mode)
+					continue;
+			}
+			best = p;
+			if (basename_same(one, two))
+				break;
+
+			/* Too many identical alternatives? Pick one */
+			if (!--i)
+				break;
+		}
+		if (best) {
+			record_rename_pair(dst->index, best->index, MAX_SCORE);
+			renames++;
+		}
+	} while ((dst = dst->next) != NULL);
+	return renames;
+}
+
+/*
+ * Note: the rest of the rename logic depends on this
+ * phase also populating all the filespecs for any
+ * entry that isn't matched up with an exact rename.
+ */
+static void free_similarity_list(struct file_similarity *p)
+{
+	while (p) {
+		struct file_similarity *entry = p;
+		p = p->next;
+
+		/* Stupid special case, see note above! */
+		diff_populate_filespec(entry->filespec, 0);
+		free(entry);
+	}
+}
+
+static int find_same_files(void *ptr)
+{
+	int ret;
+	struct file_similarity *p = ptr;
+	struct file_similarity *src = NULL, *dst = NULL;
+
+	/* Split the hash list up into sources and destinations */
+	do {
+		struct file_similarity *entry = p;
+		p = p->next;
+		if (entry->src_dst < 0) {
+			entry->next = src;
+			src = entry;
+		} else {
+			entry->next = dst;
+			dst = entry;
+		}
+	} while (p);
+
+	/*
+	 * If we have both sources *and* destinations, see if
+	 * we can match them up
+	 */
+	ret = (src && dst) ? find_identical_files(src, dst) : 0;
+
+	/* Free the hashes and return the number of renames found */
+	free_similarity_list(src);
+	free_similarity_list(dst);
+	return ret;
+}
+
+static unsigned int hash_filespec(struct diff_filespec *filespec)
+{
+	unsigned int hash;
+	if (!filespec->sha1_valid) {
+		if (diff_populate_filespec(filespec, 0))
+			return 0;
+		hash_sha1_file(filespec->data, filespec->size, "blob", filespec->sha1);
+	}
+	memcpy(&hash, filespec->sha1, sizeof(hash));
+	return hash;
+}
+
+static void insert_file_table(struct hash_table *table, int src_dst, int index, struct diff_filespec *filespec)
+{
+	void **pos;
+	unsigned int hash;
+	struct file_similarity *entry = xmalloc(sizeof(*entry));
+
+	entry->src_dst = src_dst;
+	entry->index = index;
+	entry->filespec = filespec;
+	entry->next = NULL;
+
+	hash = hash_filespec(filespec);
+	pos = insert_hash(hash, entry, table);
+
+	/* We already had an entry there? */
+	if (pos) {
+		entry->next = *pos;
+		*pos = entry;
+	}
+}
+
 /*
  * Find exact renames first.
  *
  * The first round matches up the up-to-date entries,
  * and then during the second round we try to match
  * cache-dirty entries as well.
- *
- * Note: the rest of the rename logic depends on this
- * phase also populating all the filespecs for any
- * entry that isn't matched up with an exact rename,
- * see "is_exact_match()".
  */
 static int find_exact_renames(void)
 {
-	int rename_count = 0;
-	int contents_too;
-
-	for (contents_too = 0; contents_too < 2; contents_too++) {
-		int i;
-
-		for (i = 0; i < rename_dst_nr; i++) {
-			struct diff_filespec *two = rename_dst[i].two;
-			int j;
-
-			if (rename_dst[i].pair)
-				continue; /* dealt with an earlier round */
-			for (j = 0; j < rename_src_nr; j++) {
-				int k;
-				struct diff_filespec *one = rename_src[j].one;
-				if (!is_exact_match(one, two, contents_too))
-					continue;
+	int i;
+	struct hash_table file_table;
 
-				/* see if there is a basename match, too */
-				for (k = j; k < rename_src_nr; k++) {
-					one = rename_src[k].one;
-					if (basename_same(one, two) &&
-						is_exact_match(one, two,
-							contents_too)) {
-						j = k;
-						break;
-					}
-				}
-
-				record_rename_pair(i, j, (int)MAX_SCORE);
-				rename_count++;
-				break; /* we are done with this entry */
-			}
-		}
-	}
-	return rename_count;
+	init_hash(&file_table);
+	for (i = 0; i < rename_src_nr; i++)
+		insert_file_table(&file_table, -1, i, rename_src[i].one);
+
+	for (i = 0; i < rename_dst_nr; i++)
+		insert_file_table(&file_table, 1, i, rename_dst[i].two);
+
+	/* Find the renames */
+	i = for_each_hash(&file_table, find_same_files);
+
+	/* .. and free the hash data structure */
+	free_hash(&file_table);
+
+	return i;
 }
 
 void diffcore_rename(struct diff_options *options)
diff --git a/hash.c b/hash.c
new file mode 100644
index 0000000..7b492d4
--- /dev/null
+++ b/hash.c
@@ -0,0 +1,110 @@
+/*
+ * Some generic hashing helpers.
+ */
+#include "cache.h"
+#include "hash.h"
+
+/*
+ * Look up a hash entry in the hash table. Return the pointer to
+ * the existing entry, or the empty slot if none existed. The caller
+ * can then look at the (*ptr) to see whether it existed or not.
+ */
+static struct hash_table_entry *lookup_hash_entry(unsigned int hash, struct hash_table *table)
+{
+	unsigned int size = table->size, nr = hash % size;
+	struct hash_table_entry *array = table->array;
+
+	while (array[nr].ptr) {
+		if (array[nr].hash == hash)
+			break;
+		nr++;
+		if (nr >= size)
+			nr = 0;
+	}
+	return array + nr;
+}
+
+
+/*
+ * Insert a new hash entry pointer into the table.
+ *
+ * If that hash entry already existed, return the pointer to
+ * the existing entry (and the caller can create a list of the
+ * pointers or do anything else). If it didn't exist, return
+ * NULL (and the caller knows the pointer has been inserted).
+ */
+static void **insert_hash_entry(unsigned int hash, void *ptr, struct hash_table *table)
+{
+	struct hash_table_entry *entry = lookup_hash_entry(hash, table);
+
+	if (!entry->ptr) {
+		entry->ptr = ptr;
+		entry->hash = hash;
+		table->nr++;
+		return NULL;
+	}
+	return &entry->ptr;
+}
+
+static void grow_hash_table(struct hash_table *table)
+{
+	unsigned int i;
+	unsigned int old_size = table->size, new_size;
+	struct hash_table_entry *old_array = table->array, *new_array;
+
+	new_size = alloc_nr(old_size);
+	new_array = xcalloc(sizeof(struct hash_table_entry), new_size);
+	table->size = new_size;
+	table->array = new_array;
+	table->nr = 0;
+	for (i = 0; i < old_size; i++) {
+		unsigned int hash = old_array[i].hash;
+		void *ptr = old_array[i].ptr;
+		if (ptr)
+			insert_hash_entry(hash, ptr, table);
+	}
+	free(old_array);
+}
+
+void *lookup_hash(unsigned int hash, struct hash_table *table)
+{
+	if (!table->array)
+		return NULL;
+	return &lookup_hash_entry(hash, table)->ptr;
+}
+
+void **insert_hash(unsigned int hash, void *ptr, struct hash_table *table)
+{
+	unsigned int nr = table->nr;
+	if (nr >= table->size/2)
+		grow_hash_table(table);
+	return insert_hash_entry(hash, ptr, table);
+}
+
+int for_each_hash(struct hash_table *table, int (*fn)(void *))
+{
+	int sum = 0;
+	unsigned int i;
+	unsigned int size = table->size;
+	struct hash_table_entry *array = table->array;
+
+	for (i = 0; i < size; i++) {
+		void *ptr = array->ptr;
+		array++;
+		if (ptr) {
+			int val = fn(ptr);
+			if (val < 0)
+				return val;
+			sum += val;
+		}
+	}
+	return sum;
+}
+
+void free_hash(struct hash_table *table)
+{
+	free(table->array);
+	table->array = NULL;
+	table->size = 0;
+	table->nr = 0;
+}
diff --git a/hash.h b/hash.h
new file mode 100644
index 0000000..a8b0fbb
--- /dev/null
+++ b/hash.h
@@ -0,0 +1,43 @@
+#ifndef HASH_H
+#define HASH_H
+
+/*
+ * These are some simple generic hash table helper functions.
+ * Not necessarily suitable for all users, but good for things
+ * where you want to just keep track of a list of things, and
+ * have a good hash to use on them.
+ *
+ * It keeps the hash table at roughly 50-75% free, so the memory
+ * cost of the hash table itself is roughly
+ *
+ *	3 * 2*sizeof(void *) * nr_of_objects
+ *
+ * bytes.
+ *
+ * FIXME: on 64-bit architectures, we waste memory. It would be
+ * good to have just 32-bit pointers, requiring a special allocator
+ * for hashed entries or something.
+ */
+struct hash_table_entry {
+	unsigned int hash;
+	void *ptr;
+};
+
+struct hash_table {
+	unsigned int size, nr;
+	struct hash_table_entry *array;
+};
+
+extern void *lookup_hash(unsigned int hash, struct hash_table *table);
+extern void **insert_hash(unsigned int hash, void *ptr, struct hash_table *table);
+extern int for_each_hash(struct hash_table *table, int (*fn)(void *));
+extern void free_hash(struct hash_table *table);
+
+static inline void init_hash(struct hash_table *table)
+{
+	table->size = 0;
+	table->nr = 0;
+	table->array = NULL;
+}
+
+#endif
-- 
1.5.3.4.330.g1dec6

^ permalink raw reply related

* [PATCH 1/6] Add 'diffcore.h' to LIB_H
From: Linus Torvalds @ 2007-10-25 18:16 UTC (permalink / raw)
  To: Junio C Hamano, Git Mailing List
In-Reply-To: <alpine.LFD.0.999.0710251112120.30120@woody.linux-foundation.org>


From: Linus Torvalds <torvalds@linux-foundation.org>
Subject: [PATCH 1/6] Add 'diffcore.h' to LIB_H

The diffcore.h header file is included by more than just the internal
diff generation files, and needs to be part of the proper dependencies.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---

This caused me pain.

I spent way too much time trying to debug something that wasn't buggy to 
begin with (which made it really hard to see the bug ;).

We really should have a proper (automatic) dependency generation phase. 
This patch does *not* do that, it just fixes up the crappy stuff we have 
now.

 Makefile |    3 +--
 1 files changed, 1 insertions(+), 2 deletions(-)

diff --git a/Makefile b/Makefile
index b728920..8a0082d 100644
--- a/Makefile
+++ b/Makefile
@@ -290,7 +290,7 @@ LIB_H = \
 	run-command.h strbuf.h tag.h tree.h git-compat-util.h revision.h \
 	tree-walk.h log-tree.h dir.h path-list.h unpack-trees.h builtin.h \
 	utf8.h reflog-walk.h patch-ids.h attr.h decorate.h progress.h \
-	mailmap.h remote.h transport.h
+	mailmap.h remote.h transport.h diffcore.h
 
 DIFF_OBJS = \
 	diff.o diff-lib.o diffcore-break.o diffcore-order.o \
@@ -917,7 +917,6 @@ git-http-push$X: revision.o http.o http-push.o $(GITLIBS)
 
 $(LIB_OBJS) $(BUILTIN_OBJS): $(LIB_H)
 $(patsubst git-%$X,%.o,$(PROGRAMS)): $(LIB_H) $(wildcard */*.h)
-$(DIFF_OBJS): diffcore.h
 
 $(LIB_FILE): $(LIB_OBJS)
 	$(QUIET_AR)$(RM) $@ && $(AR) rcs $@ $(LIB_OBJS)
-- 
1.5.3.4.330.g1dec6

^ permalink raw reply related

* [PATCH 6/6] Do exact rename detection regardless of rename limits
From: Linus Torvalds @ 2007-10-25 18:24 UTC (permalink / raw)
  To: Junio C Hamano, Git Mailing List
In-Reply-To: <alpine.LFD.0.999.0710251112120.30120@woody.linux-foundation.org>


From: Linus Torvalds <torvalds@linux-foundation.org>
Subject: [PATCH 6/6] Do exact rename detection regardless of rename limits

Now that the exact rename detection is linear-time (with a very small
constant factor to boot), there is no longer any reason to limit it by
the number of files involved.

In some trivial testing, I created a repository with a directory that
had a hundred thousand files in it (all with different contents), and
then moved that directory to show the effects of renaming 100,000 files.

With the new code, that resulted in

	[torvalds@woody big-rename]$ time ~/git/git show -C | wc -l
	400006

	real    0m2.071s
	user    0m1.520s
	sys     0m0.576s

ie the code can correctly detect the hundred thousand renames in about 2
seconds (the number "400006" comes from four lines for each rename:

	diff --git a/really-big-dir/file-1-1-1-1-1 b/moved-big-dir/file-1-1-1-1-1
	similarity index 100%
	rename from really-big-dir/file-1-1-1-1-1
	rename to moved-big-dir/file-1-1-1-1-1

and the extra six lines is from a one-liner commit message and all the
commit information and spacing).

Most of those two seconds weren't even really the rename detection, it's
really all the other stuff needed to get there.

With the old code, this wouldn't have been practically possible.  Doing
a pairwise check of the ten billion possible pairs would have been
prohibitively expensive.  In fact, even with the rename limiter in
place, the old code would waste a lot of time just on the diff_filespec
checks, and despite not even trying to find renames, it used to look
like:

	[torvalds@woody big-rename]$ time git show -C | wc -l
	1400006

	real    0m12.337s
	user    0m12.285s
	sys     0m0.192s

ie we used to take 12 seconds for this load and not even do any rename
detection! (The number 1400006 comes from fourteen lines per file moved:
seven lines each for the delete and the create of a one-liner file, and
the same extra six lines of commit information).

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---

This obviously just moves the rename detection call. It made sense as a 
separate patch largely because of the explanation that goes along with it, 
but it conceptually is very different from actually improving the rename 
detection logic itself. So it got a patch of its own.

 diffcore-rename.c |   12 ++++++------
 1 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/diffcore-rename.c b/diffcore-rename.c
index e7e370b..3946932 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -429,6 +429,12 @@ void diffcore_rename(struct diff_options *options)
 		goto cleanup; /* nothing to do */
 
 	/*
+	 * We really want to cull the candidates list early
+	 * with cheap tests in order to avoid doing deltas.
+	 */
+	rename_count = find_exact_renames();
+
+	/*
 	 * This basically does a test for the rename matrix not
 	 * growing larger than a "rename_limit" square matrix, ie:
 	 *
@@ -444,12 +450,6 @@ void diffcore_rename(struct diff_options *options)
 	if (rename_dst_nr * rename_src_nr > rename_limit * rename_limit)
 		goto cleanup;
 
-	/*
-	 * We really want to cull the candidates list early
-	 * with cheap tests in order to avoid doing deltas.
-	 */
-	rename_count = find_exact_renames();
-
 	/* Have we run out the created file pool?  If so we can avoid
 	 * doing the delta matrix altogether.
 	 */
-- 
1.5.3.4.330.g1dec6

^ permalink raw reply related

* Re: best git practices, was Re: Git User's Survey 2007 unfinishedsummary continued
From: Theodore Tso @ 2007-10-25 18:23 UTC (permalink / raw)
  Cc: git
In-Reply-To: <1193335339.4522.398.camel@cacharro.xalalinux.org>

On Thu, Oct 25, 2007 at 01:02:19PM -0500, Federico Mena Quintero wrote:
> On Thu, 2007-10-25 at 11:21 -0400, Theodore Tso wrote:
> 
> > And of course it's inelegant.  You just told us we were dealing with
> > CVS-brain-damaged corporate developers who can't be bothered to learn
> > about the fine points of using things the git way.
> 
> Ignore the corporate developers who use SCMs only because their company
> requires them to.  Git is not the right thing for them; some
> Eclipse-based monstrosity probably is.  It's like the horrendous
> Oracle-based expense-reporting thing we have to use at Novell; I use it
> because they make me, not because I'm particularly excited about
> reporting expenses :)

I think I misunderstand Andreas' problem statement.  What I proposed
is useful for corporate developers who are deeply confused by
branches, especially when a single working directory is constantly
jumping back and forth between several branches.  (Having the current
branch in your bash prompt is a *big* help here, but we can't count on
them having it.)

So setting up a solution where each branch gets its own working
directory is a great solution where you have some number of newbie
developers in a company that get easily confused, while still
providing advanced users the ability to use the full power of git, and
giving both the newbie and advanced users the advantages of
disconnected operations.  And, of course, hopefully some day the
newbie users will grow up to become advanced users.

Right now I suspect a number of projects who have picked hg or bzr do
so because the traditional git model is too confusing to newbie users.
So for those people, creating the model where branch == a separate
directory may make life easier for them.  That's probably the one
thing that bzr does much better than git; it has a number of modes
which act as training wheels for the easily confused user.  For
example, the bzr's "bound branch" requires you to have network access,
since anything that modifies the local repository requires hitting the
remote server as well.  Horrible!  Gives you all of the downsides of
CVS!  But it allows some users to use the SCM is CVS-style mode, while
allowing more advanced users to use it in a more distributed mode.

So I think it *is* useful to help the corporate developers, because
that means there are more git users --- and someday some of us on this
list might have to work at such a company, and better that they use
git than something like perforce or Clearcase, right?  :-)

> However, *do think* of the free software developers who have been using
> CVS forever.  You won't make friends among them if you keep saying, "you
> use CVS?  You are brain-damaged, then."  

Fair enough.  I used the term somewhat toungue-in-cheek, and I
probably should have said "newbie user" instead.

							- Ted

^ permalink raw reply

* [PATCH 4/6] copy vs rename detection: avoid unnecessary O(n*m) loops
From: Linus Torvalds @ 2007-10-25 18:20 UTC (permalink / raw)
  To: Junio C Hamano, Git Mailing List
In-Reply-To: <alpine.LFD.0.999.0710251112120.30120@woody.linux-foundation.org>


From: Linus Torvalds <torvalds@linux-foundation.org>
Subject: [PATCH 4/6] copy vs rename detection: avoid unnecessary O(n*m) loops

The core rename detection had some rather stupid code to check if a
pathname was used by a later modification or rename, which basically
walked the whole pathname space for all renames for each rename, in
order to tell whether it was a pure rename (no remaining users) or
should be considered a copy (other users of the source file remaining).

That's really silly, since we can just keep a count of users around, and
replace all those complex and expensive loops with just testing that
simple counter (but this all depends on the previous commit that shared
the diff_filespec data structure by using a separate reference count).

Note that the reference count is not the same as the rename count: they
behave otherwise rather similarly, but the reference count is tied to
the allocation (and decremented at de-allocation, so that when it turns
zero we can get rid of the memory), while the rename count is tied to
the renames and is decremented when we find a rename (so that when it
turns zero we know that it was a rename, not a copy).

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---

This fixes the generic diffcode infrastructure, but doesn't do the actual 
*detection* any faster. But as you can tell, it's a huge cleanup on its 
own, and basically removes the "incidental" and stupid support costs of 
rename detection.

 diff.c            |   40 +++++++++++++------------------
 diffcore-rename.c |   68 +++++++++++++---------------------------------------
 diffcore.h        |    2 +-
 3 files changed, 35 insertions(+), 75 deletions(-)

diff --git a/diff.c b/diff.c
index 0b320f6..af85b94 100644
--- a/diff.c
+++ b/diff.c
@@ -2597,9 +2597,9 @@ void diff_debug_filepair(const struct diff_filepair *p, int i)
 {
 	diff_debug_filespec(p->one, i, "one");
 	diff_debug_filespec(p->two, i, "two");
-	fprintf(stderr, "score %d, status %c stays %d broken %d\n",
+	fprintf(stderr, "score %d, status %c rename_used %d broken %d\n",
 		p->score, p->status ? p->status : '?',
-		p->source_stays, p->broken_pair);
+		p->one->rename_used, p->broken_pair);
 }
 
 void diff_debug_queue(const char *msg, struct diff_queue_struct *q)
@@ -2617,8 +2617,8 @@ void diff_debug_queue(const char *msg, struct diff_queue_struct *q)
 
 static void diff_resolve_rename_copy(void)
 {
-	int i, j;
-	struct diff_filepair *p, *pp;
+	int i;
+	struct diff_filepair *p;
 	struct diff_queue_struct *q = &diff_queued_diff;
 
 	diff_debug_queue("resolve-rename-copy", q);
@@ -2640,27 +2640,21 @@ static void diff_resolve_rename_copy(void)
 		 * either in-place edit or rename/copy edit.
 		 */
 		else if (DIFF_PAIR_RENAME(p)) {
-			if (p->source_stays) {
-				p->status = DIFF_STATUS_COPIED;
-				continue;
-			}
-			/* See if there is some other filepair that
-			 * copies from the same source as us.  If so
-			 * we are a copy.  Otherwise we are either a
-			 * copy if the path stays, or a rename if it
-			 * does not, but we already handled "stays" case.
+			/*
+			 * A rename might have re-connected a broken
+			 * pair up, causing the pathnames to be the
+			 * same again. If so, that's not a rename at
+			 * all, just a modification..
+			 *
+			 * Otherwise, see if this source was used for
+			 * multiple renames, in which case we decrement
+			 * the count, and call it a copy.
 			 */
-			for (j = i + 1; j < q->nr; j++) {
-				pp = q->queue[j];
-				if (strcmp(pp->one->path, p->one->path))
-					continue; /* not us */
-				if (!DIFF_PAIR_RENAME(pp))
-					continue; /* not a rename/copy */
-				/* pp is a rename/copy from the same source */
+			if (!strcmp(p->one->path, p->two->path))
+				p->status = DIFF_STATUS_MODIFIED;
+			else if (--p->one->rename_used > 0)
 				p->status = DIFF_STATUS_COPIED;
-				break;
-			}
-			if (!p->status)
+			else
 				p->status = DIFF_STATUS_RENAMED;
 		}
 		else if (hashcmp(p->one->sha1, p->two->sha1) ||
diff --git a/diffcore-rename.c b/diffcore-rename.c
index 3da06b7..edb2424 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -55,12 +55,10 @@ static struct diff_rename_dst *locate_rename_dst(struct diff_filespec *two,
 static struct diff_rename_src {
 	struct diff_filespec *one;
 	unsigned short score; /* to remember the break score */
-	unsigned src_path_left : 1;
 } *rename_src;
 static int rename_src_nr, rename_src_alloc;
 
 static struct diff_rename_src *register_rename_src(struct diff_filespec *one,
-						   int src_path_left,
 						   unsigned short score)
 {
 	int first, last;
@@ -92,7 +90,6 @@ static struct diff_rename_src *register_rename_src(struct diff_filespec *one,
 			(rename_src_nr - first - 1) * sizeof(*rename_src));
 	rename_src[first].one = one;
 	rename_src[first].score = score;
-	rename_src[first].src_path_left = src_path_left;
 	return &(rename_src[first]);
 }
 
@@ -216,6 +213,7 @@ static void record_rename_pair(int dst_index, int src_index, int score)
 		die("internal error: dst already matched.");
 
 	src = rename_src[src_index].one;
+	src->rename_used++;
 	src->count++;
 
 	dst = rename_dst[dst_index].two;
@@ -227,7 +225,6 @@ static void record_rename_pair(int dst_index, int src_index, int score)
 		dp->score = rename_src[src_index].score;
 	else
 		dp->score = score;
-	dp->source_stays = rename_src[src_index].src_path_left;
 	rename_dst[dst_index].pair = dp;
 }
 
@@ -245,21 +242,6 @@ static int score_compare(const void *a_, const void *b_)
 	return b->score - a->score;
 }
 
-static int compute_stays(struct diff_queue_struct *q,
-			 struct diff_filespec *one)
-{
-	int i;
-	for (i = 0; i < q->nr; i++) {
-		struct diff_filepair *p = q->queue[i];
-		if (strcmp(one->path, p->two->path))
-			continue;
-		if (DIFF_PAIR_RENAME(p)) {
-			return 0; /* something else is renamed into this */
-		}
-	}
-	return 1;
-}
-
 /*
  * Find exact renames first.
  *
@@ -338,15 +320,25 @@ void diffcore_rename(struct diff_options *options)
 				locate_rename_dst(p->two, 1);
 		}
 		else if (!DIFF_FILE_VALID(p->two)) {
-			/* If the source is a broken "delete", and
+			/*
+			 * If the source is a broken "delete", and
 			 * they did not really want to get broken,
 			 * that means the source actually stays.
+			 * So we increment the "rename_used" score
+			 * by one, to indicate ourselves as a user
+			 */
+			if (p->broken_pair && !p->score)
+				p->one->rename_used++;
+			register_rename_src(p->one, p->score);
+		}
+		else if (detect_rename == DIFF_DETECT_COPY) {
+			/*
+			 * Increment the "rename_used" score by
+			 * one, to indicate ourselves as a user.
 			 */
-			int stays = (p->broken_pair && !p->score);
-			register_rename_src(p->one, stays, p->score);
+			p->one->rename_used++;
+			register_rename_src(p->one, p->score);
 		}
-		else if (detect_rename == DIFF_DETECT_COPY)
-			register_rename_src(p->one, 1, p->score);
 	}
 	if (rename_dst_nr == 0 || rename_src_nr == 0)
 		goto cleanup; /* nothing to do */
@@ -472,16 +464,7 @@ void diffcore_rename(struct diff_options *options)
 					pair_to_free = p;
 			}
 			else {
-				for (j = 0; j < rename_dst_nr; j++) {
-					if (!rename_dst[j].pair)
-						continue;
-					if (strcmp(rename_dst[j].pair->
-						   one->path,
-						   p->one->path))
-						continue;
-					break;
-				}
-				if (j < rename_dst_nr)
+				if (p->one->rename_used)
 					/* this path remains */
 					pair_to_free = p;
 			}
@@ -507,23 +490,6 @@ void diffcore_rename(struct diff_options *options)
 	*q = outq;
 	diff_debug_queue("done collapsing", q);
 
-	/* We need to see which rename source really stays here;
-	 * earlier we only checked if the path is left in the result,
-	 * but even if a path remains in the result, if that is coming
-	 * from copying something else on top of it, then the original
-	 * source is lost and does not stay.
-	 */
-	for (i = 0; i < q->nr; i++) {
-		struct diff_filepair *p = q->queue[i];
-		if (DIFF_PAIR_RENAME(p) && p->source_stays) {
-			/* If one appears as the target of a rename-copy,
-			 * then mark p->source_stays = 0; otherwise
-			 * leave it as is.
-			 */
-			p->source_stays = compute_stays(q, p->one);
-		}
-	}
-
 	for (i = 0; i < rename_dst_nr; i++)
 		free_filespec(rename_dst[i].two);
 
diff --git a/diffcore.h b/diffcore.h
index 30055ac..cc96c20 100644
--- a/diffcore.h
+++ b/diffcore.h
@@ -31,6 +31,7 @@ struct diff_filespec {
 	unsigned long size;
 	int count;               /* Reference count */
 	int xfrm_flags;		 /* for use by the xfrm */
+	int rename_used;         /* Count of rename users */
 	unsigned short mode;	 /* file mode */
 	unsigned sha1_valid : 1; /* if true, use sha1 and trust mode;
 				  * if false, use the name and read from
@@ -58,7 +59,6 @@ struct diff_filepair {
 	struct diff_filespec *two;
 	unsigned short int score;
 	char status; /* M C R N D U (see Documentation/diff-format.txt) */
-	unsigned source_stays : 1; /* all of R/C are copies */
 	unsigned broken_pair : 1;
 	unsigned renamed_pair : 1;
 	unsigned is_unmerged : 1;
-- 
1.5.3.4.330.g1dec6

^ permalink raw reply related

* [PATCH 0/6] Cleaned-up rename detection patch-series
From: Linus Torvalds @ 2007-10-25 18:15 UTC (permalink / raw)
  To: Junio C Hamano, Git Mailing List


Ok, here's the patch-series to do rename detection of exact renames in 
linear time, except it's cleaned up into a nice series of 6 patches. The 
end result is identical to the previous patches (which got smushed down 
into just two patches in Junio's tree), apart from a fixed dependency in 
the Makefile that caused me grief and a broken test-suite due to some 
object files simply not being correctly recompiled.

The end result should be more understandable this way.

		Linus

^ permalink raw reply

* [PATCH 3/6] Ref-count the filespecs used by diffcore
From: Linus Torvalds @ 2007-10-25 18:19 UTC (permalink / raw)
  To: Junio C Hamano, Git Mailing List
In-Reply-To: <alpine.LFD.0.999.0710251112120.30120@woody.linux-foundation.org>



From: Linus Torvalds <torvalds@linux-foundation.org>
Subject: [PATCH 3/6] Ref-count the filespecs used by diffcore

Rather than copy the refcounts when introducing new versions of them
(for rename or copy detection), use a refcount and increment the count
when reusing the diff_filespec.

This avoids unnecessary allocations, but the real reason behind this is
a future enhancement: we will want to track shared data across the
copy/rename detection.  In order to efficiently notice when a filespec
is used by a rename, the rename machinery wants to keep track of a
rename usage count which is shared across all different users of the
filespec.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---

This was sent out in the original series as a fix on top of broken code. 
Now it sets up the infrastructure so that the next patch won't be broken.

 diff.c            |   15 +++++++++++----
 diffcore-rename.c |   16 ++++++----------
 diffcore.h        |    2 ++
 3 files changed, 19 insertions(+), 14 deletions(-)

diff --git a/diff.c b/diff.c
index dfb8595..0b320f6 100644
--- a/diff.c
+++ b/diff.c
@@ -1440,9 +1440,18 @@ struct diff_filespec *alloc_filespec(const char *path)
 	memset(spec, 0, sizeof(*spec));
 	spec->path = (char *)(spec + 1);
 	memcpy(spec->path, path, namelen+1);
+	spec->count = 1;
 	return spec;
 }
 
+void free_filespec(struct diff_filespec *spec)
+{
+	if (!--spec->count) {
+		diff_free_filespec_data(spec);
+		free(spec);
+	}
+}
+
 void fill_filespec(struct diff_filespec *spec, const unsigned char *sha1,
 		   unsigned short mode)
 {
@@ -2435,10 +2444,8 @@ struct diff_filepair *diff_queue(struct diff_queue_struct *queue,
 
 void diff_free_filepair(struct diff_filepair *p)
 {
-	diff_free_filespec_data(p->one);
-	diff_free_filespec_data(p->two);
-	free(p->one);
-	free(p->two);
+	free_filespec(p->one);
+	free_filespec(p->two);
 	free(p);
 }
 
diff --git a/diffcore-rename.c b/diffcore-rename.c
index 2077a9b..3da06b7 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -209,21 +209,19 @@ static int estimate_similarity(struct diff_filespec *src,
 
 static void record_rename_pair(int dst_index, int src_index, int score)
 {
-	struct diff_filespec *one, *two, *src, *dst;
+	struct diff_filespec *src, *dst;
 	struct diff_filepair *dp;
 
 	if (rename_dst[dst_index].pair)
 		die("internal error: dst already matched.");
 
 	src = rename_src[src_index].one;
-	one = alloc_filespec(src->path);
-	fill_filespec(one, src->sha1, src->mode);
+	src->count++;
 
 	dst = rename_dst[dst_index].two;
-	two = alloc_filespec(dst->path);
-	fill_filespec(two, dst->sha1, dst->mode);
+	dst->count++;
 
-	dp = diff_queue(NULL, one, two);
+	dp = diff_queue(NULL, src, dst);
 	dp->renamed_pair = 1;
 	if (!strcmp(src->path, dst->path))
 		dp->score = rename_src[src_index].score;
@@ -526,10 +524,8 @@ void diffcore_rename(struct diff_options *options)
 		}
 	}
 
-	for (i = 0; i < rename_dst_nr; i++) {
-		diff_free_filespec_data(rename_dst[i].two);
-		free(rename_dst[i].two);
-	}
+	for (i = 0; i < rename_dst_nr; i++)
+		free_filespec(rename_dst[i].two);
 
 	free(rename_dst);
 	rename_dst = NULL;
diff --git a/diffcore.h b/diffcore.h
index eb618b1..30055ac 100644
--- a/diffcore.h
+++ b/diffcore.h
@@ -29,6 +29,7 @@ struct diff_filespec {
 	void *cnt_data;
 	const char *funcname_pattern_ident;
 	unsigned long size;
+	int count;               /* Reference count */
 	int xfrm_flags;		 /* for use by the xfrm */
 	unsigned short mode;	 /* file mode */
 	unsigned sha1_valid : 1; /* if true, use sha1 and trust mode;
@@ -43,6 +44,7 @@ struct diff_filespec {
 };
 
 extern struct diff_filespec *alloc_filespec(const char *);
+extern void free_filespec(struct diff_filespec *);
 extern void fill_filespec(struct diff_filespec *, const unsigned char *,
 			  unsigned short);
 
-- 
1.5.3.4.330.g1dec6

^ permalink raw reply related

* Re: best git practices, was Re: Git User's Survey 2007 unfinishedsummary continued
From: J. Bruce Fields @ 2007-10-25 18:18 UTC (permalink / raw)
  To: Mike Hommey; +Cc: Federico Mena Quintero, Theodore Tso, git
In-Reply-To: <20071025180451.GA6349@glandium.org>

On Thu, Oct 25, 2007 at 08:04:51PM +0200, Mike Hommey wrote:
> On Thu, Oct 25, 2007 at 01:02:19PM -0500, Federico Mena Quintero wrote:
> > On Thu, 2007-10-25 at 11:21 -0400, Theodore Tso wrote:
> > 
> > > And of course it's inelegant.  You just told us we were dealing with
> > > CVS-brain-damaged corporate developers who can't be bothered to learn
> > > about the fine points of using things the git way.
> > 
> > Ignore the corporate developers who use SCMs only because their company
> > requires them to.  Git is not the right thing for them; some
> > Eclipse-based monstrosity probably is.  It's like the horrendous
> > Oracle-based expense-reporting thing we have to use at Novell; I use it
> > because they make me, not because I'm particularly excited about
> > reporting expenses :)
> > 
> > However, *do think* of the free software developers who have been using
> > CVS forever.  You won't make friends among them if you keep saying, "you
> > use CVS?  You are brain-damaged, then."  CVS has been as good/bad to
> > them as to anyone else, and they are probably delighted to get a better
> > solution.  That solution needs to take into account the concepts to
> > which they have been exposed for the past N years.  Just because your
> > new concepts are better, doesn't mean that their old ones were wrong in
> > their time.
> 
> It's probably just a matter of writing a "git for CVS users" document.

First google hit for "git for CVS users":

	http://www.kernel.org/pub/software/scm/git/docs/cvs-migration.html

patches welcomed....

--b.

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox