impure renames / history tracking

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* impure renames / history tracking
@ 2006-03-01 14:01 Paul Jakma
  2006-03-01 15:38 ` Andreas Ericsson
  0 siblings, 1 reply; 16+ messages in thread
From: Paul Jakma @ 2006-03-01 14:01 UTC (permalink / raw)
  To: git list

Hi,

I'm trying to understand git better (so I can explain it better to 
others, with an eye to them considering switching to git), one 
question I have is about renames.

- git obviously detects pure renames perfectly well

- git doesn't however record renames, so 'impure' renames may not be
   detected

My question is:

- why not record rename information explicitely in the commit object?

I.e. so as to be able to follow history information through 'impure' 
renames without having to resort to heuristics.

E.g. imagine a project where development typically occurs through:

o: commit
m: merge

    o---o-m--o-o-o--o----m <- project
   /     /              /
o-o-o-o-o--o-o-o--o-o-o <- main branch

The project merge back to main in one 'big' combined merge 
(collapsing all of the commits on 'project' into one commit). This 
leads to 'impure renames' being not uncommon. The desired end-result 
of merging back to 'main' being to rebase 'project' as one commit 
against 'main', and merge that single commit back, a la:

    o---o-m--o-o-o--o----m <- project
   /     /              /
o-o-o-o-o--o-o-o--o-o-o---m <- main branch
                        \ /
                         o <- project_collapsed

So that 'm' on 'main' is that one commit[1].

The merits or demerits of such merging practice aside, what reason 
would there be /against/ recording explicit rename information in the 
commit object, so as to help browsers follow history (particularly 
impure renames) better in a commit?

I.e. would there be resistance to adding meta-info rename headers 
commit objects, and having diffcore and other tools to use those 
headers to /augment/ their existing heuristics in detecting renames?

Thanks!

1. Git currently doesn't have 'porcelain' to do this, presumably 
there'd be no objection to one?

regards,
-- 
Paul Jakma	paul@clubi.ie	paul@jakma.org	Key ID: 64A2FF6A
Fortune:
It is the quality rather than the quantity that matters.
- Lucius Annaeus Seneca (4 B.C. - A.D. 65)

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: impure renames / history tracking
  2006-03-01 14:01 Paul Jakma
@ 2006-03-01 15:38 ` Andreas Ericsson
  2006-03-01 16:27   ` Paul Jakma
  0 siblings, 1 reply; 16+ messages in thread
From: Andreas Ericsson @ 2006-03-01 15:38 UTC (permalink / raw)
  To: Paul Jakma; +Cc: git list

Paul Jakma wrote:
> 
> - git obviously detects pure renames perfectly well
> 
> - git doesn't however record renames, so 'impure' renames may not be
>   detected
> 
> My question is:
> 
> - why not record rename information explicitely in the commit object?
> 

Mainly for two reasons, iirc:
1. Extensive metadata is evil.
2. Backwards compatibility. Old repos should always work with new tools. 
Old tools should work with new repos, at least until a new major-release 
is released.

> I.e. so as to be able to follow history information through 'impure' 
> renames without having to resort to heuristics.
> 
> E.g. imagine a project where development typically occurs through:
> 
> o: commit
> m: merge
> 
>    o---o-m--o-o-o--o----m <- project
>   /     /              /
> o-o-o-o-o--o-o-o--o-o-o <- main branch
> 
> The project merge back to main in one 'big' combined merge (collapsing 
> all of the commits on 'project' into one commit). This leads to 'impure 
> renames' being not uncommon. The desired end-result of merging back to 
> 'main' being to rebase 'project' as one commit against 'main', and merge 
> that single commit back, a la:
> 
>    o---o-m--o-o-o--o----m <- project
>   /     /              /
> o-o-o-o-o--o-o-o--o-o-o---m <- main branch
>                        \ /
>                         o <- project_collapsed
> 
> So that 'm' on 'main' is that one commit[1].
> 

I think you're misunderstanding the git meaning of rebase here. "git 
rebase" moves all commits since "project" forked from "main branch" to 
the tip of "main branch".

Other than that, this is the recommended workflow, and exactly how Linux 
and git both are managed (i.e. topic branches eventually merged into 
'master').

In your drawings, 'main branch' would be 'master' and 'project' would be 
any amount of topic-branches (or just one, if you like that better).

I'm not sure what you mean by 'project_collapsed' though. If I 
understand you correctly, each branch-head represents one 'collapse'. I 
suggest you clone the git repo and do

	$ gitk master
	$ gitk next
	$ gitk pu

gitk is great for visualizing what you've done and what the repo looks 
like. Use and abuse it frequently every time you're unsure what was you 
just did. It's the best way to quickly learn what happens, really.

If you just want to distribute snapshots I suggest you do take a look at 
git-tar-tree. Junio makes nice use of it in the git Makefile (the dist: 
target).

> The merits or demerits of such merging practice aside, what reason would 
> there be /against/ recording explicit rename information in the commit 
> object, so as to help browsers follow history (particularly impure 
> renames) better in a commit?
> 
> I.e. would there be resistance to adding meta-info rename headers commit 
> objects, and having diffcore and other tools to use those headers to 
> /augment/ their existing heuristics in detecting renames?
> 

Personally I think metadata is evil. Renames will still be auto-detected 
anyway, and with the distributed repo setup the only reason git 
shouldn't be able to detect a rename is if you rename a file and hack it 
up so it doesn't even come close to matching its origin (close in this 
case is 80% by default, I think). In those cases it isn't so much a 
rename as a rewrite. If you find the commit where the file was renamed 
it should be listed in that commit, like so:

	similarity index 92%
	rename from Documentation/git-log-script.txt
	rename to Documentation/git-log.txt

(this is gitk output from the git repo. Search for "Big tool rename")

IMO this is far better than having to tell git "I renamed this file to 
that", since it also detects code-copying with modifications, and it's 
usually quick enough to find those renames as well.

> Thanks!
> 
> 1. Git currently doesn't have 'porcelain' to do this, presumably there'd 
> be no objection to one?
> 

	$ git checkout master
	$ git pull . project

The dot means "pull from the local repo". "project" is the branch you 
want to merge into master. You can pull an arbitrary amount of branches 
in one go ("octopus" merge). The current tested limit is 12 (thanks, Len 
;) ).

If, for some reason, you want to combine lots of commits into a single 
mega-patch (like Linus does for each release of the kernel), you can do:

	$ git diff $(git merge-base main project) project > patch-file

Then you can apply patch-file to whatever branch you want and make the 
commit as if it was a single change-set. I'd recommend against it unless 
you're just toying around though. It's a bad idea to lie in a projects 
history.

Hope that helps.

-- 
Andreas Ericsson                   andreas.ericsson@op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: impure renames / history tracking
  2006-03-01 15:38 ` Andreas Ericsson
@ 2006-03-01 16:27   ` Paul Jakma
  2006-03-01 17:13     ` Linus Torvalds
                       ` (2 more replies)
  0 siblings, 3 replies; 16+ messages in thread
From: Paul Jakma @ 2006-03-01 16:27 UTC (permalink / raw)
  To: Andreas Ericsson; +Cc: git list

On Wed, 1 Mar 2006, Andreas Ericsson wrote:

> Mainly for two reasons, iirc:

> 1. Extensive metadata is evil.

Only if /required/. I wouldn't argue for rename meta-data to be 
'core', only as an additional hint into the rename-detection process.

FWIW, I think git's rename handling is really nice. It's just I 
suspect, being a heuristic, it won't be able to follow history 
reliably across 'very impure' renames.

> 2. Backwards compatibility. Old repos should always work with new 
> tools. Old tools should work with new repos, at least until a new 
> major-release is released.

Absolutely.

>> o: commit
>> m: merge
>>
>>    o---o-m--o-o-o--o----m <- project
>>   /     /              /
>> o-o-o-o-o--o-o-o--o-o-o <- main branch
>> 
>> The project merge back to main in one 'big' combined merge (collapsing all 
>> of the commits on 'project' into one commit). This leads to 'impure 
>> renames' being not uncommon. The desired end-result of merging back to 
>> 'main' being to rebase 'project' as one commit against 'main', and merge 
>> that single commit back, a la:
>>
>>    o---o-m--o-o-o--o----m <- project
>>   /     /              /
>> o-o-o-o-o--o-o-o--o-o-o---m <- main branch
>>                        \ /
>>                         o <- project_collapsed
>> 
>> So that 'm' on 'main' is that one commit[1].

> I think you're misunderstanding the git meaning of rebase here. 
> "git rebase" moves all commits since "project" forked from "main 
> branch" to the tip of "main branch".

Right, I'm referring to 'rebase' generally, as a concept, not to 
git-rebase specifically. E.g. git diff main..project is another way 
of rebasing I think.

> Other than that, this is the recommended workflow, and exactly how Linux and 
> git both are managed (i.e. topic branches eventually merged into 'master').

They're not rebased though, generally. They're pulled. Ie, in Linux 
and git when 'project' is merged, things look like:

     o---o-m--o-o-o--o----m   <- project
    /     /              / \
o-o-o-o-o--o-o-o--o-o-o----m <- main branch

The rest of the world sees /all/ the individual commits of 'project' 
right? The traditional process for the case I'm thinking of results 
in the 'main' tree seeing only /one/ single commit for the project.

> I'm not sure what you mean by 'project_collapsed' though.

All the commits on the project branch are 'collapsed' into one single 
commit/delta, and then that /single/ commit is merged to 'main'. Rest 
of the world sees:

o-o-o-o-o--o-o-o--o-o-o---m <- main branch
                        \ /
                         o <- project

> correctly, each branch-head represents one 'collapse'.

Not quite. It represents a branch with one or more commits. In the 
Linux and git work flow, multiple commits are left as is.

> gitk is great for visualizing what you've done and what the repo 
> looks like. Use and abuse it frequently every time you're unsure 
> what was you just did. It's the best way to quickly learn what 
> happens, really.

I do. It rocks! :)

> If you just want to distribute snapshots I suggest you do take a 
> look at git-tar-tree. Junio makes nice use of it in the git 
> Makefile (the dist: target).

Neat.

Though, I probably should stay away from the git Makefile for now. 
<cough>.

> Personally I think metadata is evil.

Not sure I agree. Silly/redundant meta-data can be evil alright. But 
I'm talking about meta-data which is not there and potentially not 
reconstructable.

> Renames will still be auto-detected anyway,

Chances are so, yes. Definitely with the git and Linux workflows.

The traditional workflow for the software project I'm thinking of is 
different though. One commit may encompass multiple renames and edits 
of a file (discouraged, but it's possible).

If my understanding is correct, following back history for such cases 
would be difficult.

There is an argument that that 'traditional' process should be 
changed. However, leaving aside that argument, I'd like to know if 
git could accomodate that process.

> be able to detect a rename is if you rename a file and hack it up 
> so it doesn't even come close to matching its origin (close in this 
> case is 80% by default, I think). In those cases it isn't so much a 
> rename as a rewrite.

Exactly - this is the case I'm concerned about. Imagine that you'd 
like to be follow the history back through the rewrite and through to 
the original file.

> IMO this is far better than having to tell git "I renamed this file 
> to that", since it also detects code-copying with modifications, 
> and it's usually quick enough to find those renames as well.

I think so too, but that involves arguing that very very 
long-standing workflows should be changed to accomodate git. I intend 
to make that argument to the 'project' concerned, however I would 
also like to be say git could equally well deal with the 
'traditional' workflow, modulo having to explicitely use (say) 
git-mv.

>> 1. Git currently doesn't have 'porcelain' to do this, presumably there'd be 
>> no objection to one?
>> 
>
> 	$ git checkout master
> 	$ git pull . project

Right, but 'pull' isn't what I mean :).

I mean:

 	$ git checkout project
 	$ git pull . master
 	$ git checkout -b tmp project
 	$ git diff project..master | <git apply I think>

> If, for some reason, you want to combine lots of commits into a single 
> mega-patch (like Linus does for each release of the kernel), you can do:
>
> 	$ git diff $(git merge-base main project) project > patch-file

Right.

> Then you can apply patch-file to whatever branch you want and make 
> the commit as if it was a single change-set. I'd recommend against 
> it unless you're just toying around though. It's a bad idea to lie 
> in a projects history.

Presume that 'project' in the workflow is defined as

 	"achieve one goal with one commit to the master"

So by definition, it always correct that the project only ever has 
one commit.

The trouble is that /sometimes/ projects do indeed 'rename and 
rewrite' a file. At present, chances are git might not notice this, 
and ability to follow history through the rename+rewrite would be 
lost.

I'm wondering whether:

- this could be solved?
- how? (some additional advisory-only meta-data in the
   index-cache and commit?)

If there is consensus on an acceptable way, I'm willing to implement 
it. (I was thinking of just adding 'rename' headers to the commit 
objects, then teaching diffcore to consider them in addition to 
current heuristics).

regards,
-- 
Paul Jakma	paul@clubi.ie	paul@jakma.org	Key ID: 64A2FF6A
Fortune:
Be nice to people on the way up, because you'll meet them on your way down.
 		-- Wilson Mizner

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: impure renames / history tracking
  2006-03-01 16:27   ` Paul Jakma
@ 2006-03-01 17:13     ` Linus Torvalds
  2006-03-01 18:50       ` Paul Jakma
  2006-03-01 17:43     ` Andreas Ericsson
  2006-03-01 18:05     ` Martin Langhoff
  2 siblings, 1 reply; 16+ messages in thread
From: Linus Torvalds @ 2006-03-01 17:13 UTC (permalink / raw)
  To: Paul Jakma; +Cc: Andreas Ericsson, git list

On Wed, 1 Mar 2006, Paul Jakma wrote:
> 
> FWIW, I think git's rename handling is really nice. It's just I suspect, being
> a heuristic, it won't be able to follow history reliably across 'very impure'
> renames.

The thing is, it does better than anything that _tries_ to be "reliable".

I can pretty much _guarantee_ that you can't do it better.

Tracking "inodes" - aka file identities - (which is what BK does, and I 
assume what SVN does) is fundamentally problematic. I particular, it's a 
horrible problem when two inodes "meet" under the same name. You now have 
two identities for the same file, and you're fundamentally screwed.

And don't tell me it doesn't happen. It _does_ happen, and it did happen 
with the kernel under BK.

It doesn't even need renames to be a problem. JUST THE FACT THAT YOU TRY 
TO TRACK FILE "IDENTITY" HISTORY IS BROKEN. For example, take CVS, which 
doesn't actually try to do renames, but _does_ try to track the identity 
of a file, since all the history is tied into that identity: think about 
what happens in Attic when a file is deleted. Completely broken model.

Now, CVS doesn't tend to show the problems very much, because people don't 
actually use branches that much (they are a pain in the neck), and they 
sure as hell try to avoid deleting and creating the same filename under a 
branch and on HEAD. I'm sure you can do it, but I'm also pretty sure 
there's a lot of old projects around that have ended up moving the ,v 
files around to play rename/delete games.

And that's really fundamental. CVS doesn't show the problems so much, 
because CVS actively tries to make it hard to do these things.

With renames-tracking-file-identities, it's _really_ easy to get some 
major confusion going. What happens when one branch creates a file, and 
another one renames a file to that same name, and they merge?

Don't tell me it doesn't happen. It happened under BK. The way BK "solved" 
it was to keep the two separate identities: one of them got resolved to 
the new filename, the other one went into the "deleted" directory. Guess 
what happens when the side that got merged into "deleted" continues to 
edit the file? That's right - their edits happen on the deleted file, and 
never show up in the real tree in a subsequent merge ever again.

And as far as I can tell, BK really did the best you can do. Following 
file identities really _is_ fundamentally broken. It sounds like a nice 
idea, but while you migth solve a few problems, you create a whole raft of 
much more fundamental problems.

So next time you think about a merge that migt have been improved by 
tracking renames, please also think about a merge where one of the 
filenames came from two or more different sources through an earlier 
merge, and thank your benevolent Gods that they instructed me to make git 
be based purely on file contents.

		Linus

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: impure renames / history tracking
  2006-03-01 16:27   ` Paul Jakma
  2006-03-01 17:13     ` Linus Torvalds
@ 2006-03-01 17:43     ` Andreas Ericsson
  2006-03-02 21:10       ` Paul Jakma
  2006-03-01 18:05     ` Martin Langhoff
  2 siblings, 1 reply; 16+ messages in thread
From: Andreas Ericsson @ 2006-03-01 17:43 UTC (permalink / raw)
  To: Paul Jakma; +Cc: git list

Paul Jakma wrote:
> On Wed, 1 Mar 2006, Andreas Ericsson wrote:
> 
>>> o: commit
>>> m: merge
>>>
>>>    o---o-m--o-o-o--o----m <- project
>>>   /     /              /
>>> o-o-o-o-o--o-o-o--o-o-o <- main branch
>>>
>>> The project merge back to main in one 'big' combined merge 
>>> (collapsing all of the commits on 'project' into one commit). This 
>>> leads to 'impure renames' being not uncommon. The desired end-result 
>>> of merging back to 'main' being to rebase 'project' as one commit 
>>> against 'main', and merge that single commit back, a la:
>>>
>>>    o---o-m--o-o-o--o----m <- project
>>>   /     /              /
>>> o-o-o-o-o--o-o-o--o-o-o---m <- main branch
>>>                        \ /
>>>                         o <- project_collapsed
>>>
>>> So that 'm' on 'main' is that one commit[1].
> 
> 
>> I think you're misunderstanding the git meaning of rebase here. "git 
>> rebase" moves all commits since "project" forked from "main branch" to 
>> the tip of "main branch".
> 
> 
> Right, I'm referring to 'rebase' generally, as a concept, not to 
> git-rebase specifically. E.g. git diff main..project is another way of 
> rebasing I think.
> 

Yes, but imo a poor one, as you're losing all the history. git *can* do 
what you want, but it was designed to maintain a long history so that 
everyone can see it and improve on the code with many chains of small 
and simultanous changes.

>> Other than that, this is the recommended workflow, and exactly how 
>> Linux and git both are managed (i.e. topic branches eventually merged 
>> into 'master').
> 
> 
> They're not rebased though, generally. They're pulled. Ie, in Linux and 
> git when 'project' is merged, things look like:
> 
>     o---o-m--o-o-o--o----m   <- project
>    /     /              / \
> o-o-o-o-o--o-o-o--o-o-o----m <- main branch
> 
> The rest of the world sees /all/ the individual commits of 'project' 
> right? The traditional process for the case I'm thinking of results in 
> the 'main' tree seeing only /one/ single commit for the project.
> 

Perhpas we have a nomenclature clash here. When you say "one single 
commit", I can't help but thinking "snapshot". It's completely 
impossible to fold *ALL* the history into a single commit, and since you 
want heuristics I would imagine you wouldn't want that either.

>> I'm not sure what you mean by 'project_collapsed' though.
> 
> 
> All the commits on the project branch are 'collapsed' into one single 
> commit/delta, and then that /single/ commit is merged to 'main'. Rest of 
> the world sees:
> 
> o-o-o-o-o--o-o-o--o-o-o---m <- main branch
>                        \ /
>                         o <- project
> 

The only sane way to represent this is by doing a mega-patch and 
applying it with a new commit message. That way renamed files will show 
up as

	renamed from /path/to/foo
	renamed to /path/to/some/where/else

Since you're removing all the history in between one mega-patch and the 
next (as if Linus would have v2.6.12 one day and in the next commit it 
would be v2.6.13... strange thought), the history for that tree can't 
well know about renames that doesn't exist in its history. Again, if you 
wan't to keep "master" (can we please call it that? I can't keep up with 
what you call "project" and "main branch") to a single commit you'll 
have no history in it. In essence, that's a snapshot (or a release, 
which is just a snapshot with a tag).

>> Personally I think metadata is evil.
> 
> 
> Not sure I agree. Silly/redundant meta-data can be evil alright. But I'm 
> talking about meta-data which is not there and potentially not 
> reconstructable.
> 
>> Renames will still be auto-detected anyway,
> 
> 
> Chances are so, yes. Definitely with the git and Linux workflows.
> 
> The traditional workflow for the software project I'm thinking of is 
> different though. One commit may encompass multiple renames and edits of 
> a file (discouraged, but it's possible).
> 
> If my understanding is correct, following back history for such cases 
> would be difficult.
> 

It would be impossible. At best you can get "before mega-patch 64, the 
tree looked like this", "after mega-patch 64, it looked like this, and 
here are the files with 80% of above similarity index".

> There is an argument that that 'traditional' process should be changed. 
> However, leaving aside that argument, I'd like to know if git could 
> accomodate that process.
> 
>> be able to detect a rename is if you rename a file and hack it up so 
>> it doesn't even come close to matching its origin (close in this case 
>> is 80% by default, I think). In those cases it isn't so much a rename 
>> as a rewrite.
> 
> 
> Exactly - this is the case I'm concerned about. Imagine that you'd like 
> to be follow the history back through the rewrite and through to the 
> original file.
> 

I'm confused. First you say you want to have one single mega-patch for 
each commit, then you say you want to be able to follow history back. 
It's like deciding to throw away your wallet and then trying to get 
someone to pick it up and carry it around for you.

>> IMO this is far better than having to tell git "I renamed this file to 
>> that", since it also detects code-copying with modifications, and it's 
>> usually quick enough to find those renames as well.
> 
> 
> I think so too, but that involves arguing that very very long-standing 
> workflows should be changed to accomodate git. I intend to make that 
> argument to the 'project' concerned, however I would also like to be say 
> git could equally well deal with the 'traditional' workflow, modulo 
> having to explicitely use (say) git-mv.
> 

The simple fact is that once you start juggling 12MB patches instead of 
keeping the commits, your history is out the window anyway. Adding 
meta-data to accommodate for the lack of history when you throw it away 
is, to be honest, an approach that leaves "insane" in the dust.

As for convincing others, shove git-bisect under their noses and ask 
them if they'd like a tool to find their bugs for them.

>>
>>     $ git checkout master
>>     $ git pull . project
> 
> 
> Right, but 'pull' isn't what I mean :).
> 
> I mean:
> 
>     $ git checkout project
>     $ git pull . master
>     $ git checkout -b tmp project
>     $ git diff project..master | <git apply I think>
>

This way, 'project' and 'tmp' both would hold all patches since you 
merge 'master' into 'project' before creating the 'tmp' branch at the 
head of 'project'. As such, 'project' is ahead of 'master' (it has its 
own changes, those in master and the merge between 'project' and 
'master'), so the diff will be empty.

If 'master' is where you commit regularly (i.e. not mega-patches), you 
can do these two steps to create the mega-patch branch

	$ git checkout -b mega; # create the mega-patch branch
	$ # rewind the mega-patch branch to the dawn of time
	$ git reset --hard $(git rev-list HEAD | tail -n 1)

And for each mega-patch, do this:

	$ # create and apply mega-patch 1
	$ git diff project..master | git apply
	$ # commit the changes we just applied
	$ git commit -s -a -m "mega-patch 1"
	$ git checkout project; # back to project branch
	$ # Merge with 'master', or the next mega-patch won't apply
	$ git pull . master

>> Then you can apply patch-file to whatever branch you want and make the 
>> commit as if it was a single change-set. I'd recommend against it 
>> unless you're just toying around though. It's a bad idea to lie in a 
>> projects history.
> 
> 
> Presume that 'project' in the workflow is defined as
> 
>     "achieve one goal with one commit to the master"
> 
> So by definition, it always correct that the project only ever has one 
> commit.
> 

But that can't be true either, unless you intend to stop working at the 
project. At "best", you could be able to get a chain of commits in 
'master' where each commit hold several tons of changes.

The topic-branch approach to this would be to
a) Implement all changes required for a certain feature in one go and 
commit all of them. do "git pull . topic-branch" when on master branch. 
This will result in a "fast-forward" (i.e. top of 'master' is the 
merge-base between 'master' and 'topic-branch'), so no merge will happen.

b) Implement all changes required for a certain feature in small steps 
and then apply the diff between 'master..topic-branch' to master. The 
topic-branch has to be thrown away, since it can't ever be merged back 
into master, and master can't be merged into the topic-branch (that's 
ok, topic-branches are made to throw away).

For small changes, or one change and some stupid bugfixes, I'd say b) is 
a viable option. The kind of changes you talk about, with several 
renames of files and sometimes near-complete rewrite of them, would 
certainly warrant a merge (or a fast-forward).

> The trouble is that /sometimes/ projects do indeed 'rename and rewrite' 
> a file. At present, chances are git might not notice this, and ability 
> to follow history through the rename+rewrite would be lost.
> 
> I'm wondering whether:
> 
> - this could be solved?

Not with the mega-patch approach.

> - how? (some additional advisory-only meta-data in the
>   index-cache and commit?)
> 

You could maintain that data yourself in either an external or versioned 
file. I've never heard of anyone employing the workflow you describe so 
I doubt it's very common. I also shudder to think that git will be made 
less efficient for the benefit of throwing history away, when tracking 
history efficiently is what it's all about in the first place.

> If there is consensus on an acceptable way, I'm willing to implement it. 
> (I was thinking of just adding 'rename' headers to the commit objects, 
> then teaching diffcore to consider them in addition to current heuristics).
> 

The code is mightier than the mail. Perhaps if I see an implementation 
of this I could wrap my head around what you really mean. I'm sure I 
must misunderstand you one way or another.

-- 
Andreas Ericsson                   andreas.ericsson@op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: impure renames / history tracking
  2006-03-01 16:27   ` Paul Jakma
  2006-03-01 17:13     ` Linus Torvalds
  2006-03-01 17:43     ` Andreas Ericsson
@ 2006-03-01 18:05     ` Martin Langhoff
  2006-03-01 19:13       ` Paul Jakma
  2 siblings, 1 reply; 16+ messages in thread
From: Martin Langhoff @ 2006-03-01 18:05 UTC (permalink / raw)
  To: paul; +Cc: Andreas Ericsson, git list

On 3/2/06, Paul Jakma <paul@clubi.ie> wrote:
> I mean:
>
>         $ git checkout project
>         $ git pull . master
>         $ git checkout -b tmp project
>         $ git diff project..master | <git apply I think>

The moment you 'merge' by using git-diff | patch you lose all the
support git gives you, because you are discarding all of git's
metadata! git's metadata is about all the commits you are merging, and
is good enough that it will help future merges across renames.

You should really use git-pull/git-merge at that point.

My guess is that you do this to achieve what you describe later:

> Presume that 'project' in the workflow is defined as
>
>         "achieve one goal with one commit to the master"
>
> So by definition, it always correct that the project only ever has
> one commit.

What happens if you rephrase that to read: "achieve one goal with one
merge to the master"? Long term, it gives you much better support from
the SCM. If a particular commit broke something, you can use
whatchanged, log, annotate and bisect to figure out in which /small/
commit things went astray.

And you can modify your practices ever so slightly to match the
benefits of the old model:

 - force merge message editing in git-merge, and prepare appropriate
commit messages for your merges
 - write a modified git-log that displays only the merges to master

that way, you get the best of both worlds.

> The trouble is that /sometimes/ projects do indeed 'rename and
> rewrite' a file. At present, chances are git might not notice this,

It will, if you preserve git's metadata.

The thing is that with any scm that tracks metadata of some kind, the
moment you bypass its tools and do diff|patch to discard the
metadata... well, you lose its benefits...

And what I've found, managing a project with 13K files, is that in
practice git does far better tracking renames than several SCMs that
do explicit tracking. Don't be distracted by the 'we don't track
renames posturing'. We do, and it's so magic that it just works.

cheers,

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: impure renames / history tracking
  2006-03-01 17:13     ` Linus Torvalds
@ 2006-03-01 18:50       ` Paul Jakma
  0 siblings, 0 replies; 16+ messages in thread
From: Paul Jakma @ 2006-03-01 18:50 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Andreas Ericsson, git list

Hi Linus,

On Wed, 1 Mar 2006, Linus Torvalds wrote:

> The thing is, it does better than anything that _tries_ to be 
> "reliable".
>
> I can pretty much _guarantee_ that you can't do it better.

I'm willing to take that argument to the 'project' concerned, I just 
need to be pretty sure of it.

> Tracking "inodes" - aka file identities - (which is what BK does, 
> and I assume what SVN does) is fundamentally problematic. I 
> particular, it's a horrible problem when two inodes "meet" under 
> the same name. You now have two identities for the same file, and 
> you're fundamentally screwed.

Yes, in that model it is. This interestingly, is not the BK model, I 
suspect (see below).

> It doesn't even need renames to be a problem. JUST THE FACT THAT 
> YOU TRY TO TRACK FILE "IDENTITY" HISTORY IS BROKEN.

If it's "file identity" globally across the lifetime of the project, 
I agree 100% per cent. The 'traditional' SCM concerned does this.

That's not what a solution I'd want to explore either, I'm only 
interested in the identity of files for any one /one/ commit. In 
saying that, I recognise it's pointless to try annotate file-change 
information in multi-parent commits (merges).

> For example, take CVS, which doesn't actually try to do renames, 
> but _does_ try to track the identity of a file, since all the 
> history is tied into that identity: think about what happens in 
> Attic when a file is deleted. Completely broken model.

ACK, {Attic,deleted_files}/ is just horrid.

> And that's really fundamental. CVS doesn't show the problems so 
> much, because CVS actively tries to make it hard to do these 
> things.

ACK.

> With renames-tracking-file-identities, it's _really_ easy to get 
> some major confusion going. What happens when one branch creates a 
> file, and another one renames a file to that same name, and they 
> merge?

Well, the conflict has to be resolved somehow, even today.

> Don't tell me it doesn't happen. It happened under BK. The way BK 
> "solved" it was to keep the two separate identities: one of them 
> got resolved to the new filename, the other one went into the 
> "deleted" directory.

Right. That's what the 'traditional workflow' SCM I'm thinking of 
does - not BK funnily enough, but an SCM predating BK which also 
happens to use SCCS files, and with some of the same high-level 
push/pull constructs as BK (interestingly).

It also tracks name history globally using a deleted_files/ history, 
which is maintained, but I don't think it does this for name merges 
like the above.

In the one I'm thinking of, it does (I /think/, I'm not an expert in 
it) the following:

Given two files, say:

'old:

1.1---1.2---1.3

new:

1.1

- constructs a 'fake' base SCCS revision, empty
- adds the top 'old' version as a branch
- adds the top new version as a new delta

    1.1.1.1
   /
1.1---------1.2

Where in the merged file:

 	1.1: empty
 	1.1.1.1: was 1.3 from 'old'
 	1.2: is 1.1 from 'new'

However, it does /not/ create a deleted_files entry for the 'old' 
file. (AFAICT - I may not have a sufficiently full understanding of 
this SCM)

> Guess what happens when the side that got merged into "deleted" 
> continues to edit the file? That's right - their edits happen on 
> the deleted file, and never show up in the real tree in a 
> subsequent merge ever again.

Indeed - horrid.

> And as far as I can tell, BK really did the best you can do. 
> Following file identities really _is_ fundamentally broken. It 
> sounds like a nice idea, but while you migth solve a few problems, 
> you create a whole raft of much more fundamental problems.

For tracking identity across more than one commit - I fully agree.

That's not what quite I'm thinking of though. Is it worth going on 
with the discussion on a:

 	 'track identities *only* from context of /the/ parent to
           this commit'

> So next time you think about a merge that migt have been improved 
> by tracking renames, please also think about a merge where one of 
> the filenames came from two or more different sources through an 
> earlier merge, and thank your benevolent Gods that they instructed 
> me to make git be based purely on file contents.

Oh, I agree muchely here.

I wouldn't change git. I only wonder if it give its rename-heuristics 
an additional advisory-only hint? (for single-parent commits at least 
- never merges - and only on a per-commit basis).

I probably should first explore how git deals with rename clashes..

regards,
-- 
Paul Jakma	paul@clubi.ie	paul@jakma.org	Key ID: 64A2FF6A
Fortune:
I'm glad I was not born before tea.
 		-- Sidney Smith (1771-1845)

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: impure renames / history tracking
  2006-03-01 18:05     ` Martin Langhoff
@ 2006-03-01 19:13       ` Paul Jakma
  2006-03-01 19:56         ` Junio C Hamano
  0 siblings, 1 reply; 16+ messages in thread
From: Paul Jakma @ 2006-03-01 19:13 UTC (permalink / raw)
  To: Martin Langhoff; +Cc: Andreas Ericsson, git list

On Thu, 2 Mar 2006, Martin Langhoff wrote:

> The moment you 'merge' by using git-diff | patch you lose all the 
> support git gives you, because you are discarding all of git's 
> metadata! git's metadata is about all the commits you are merging, 
> and is good enough that it will help future merges across renames.

> You should really use git-pull/git-merge at that point.

Let's try not get stuck on the workflow.

I probably shouldn't have brought it up. However, just assume it's 
been decided that 'detail' of the project implementation is too much 
clutter for the 'master'. I note that people do this already even in 
the "keep all the details" Linux and Git workflows, where they 
rejiggle commits in order to cut-out 'oops, made a typo' type of 
commits.

So the level of detail that is suitable is for 'merging upstream' 
clearly is arbitrary and subjective, and even with git and Linux that 
knob already is set past 0 (all detail), maybe to 1 - the workflow 
I'm thinking of has it set to (say) 2.

For sake of argument assume the workflow corresponds to:

     o-o-o-o---o--o
    /              \
--o----------------m->

And collapsing just the 'oops, made a typo' commits so it looks like:

     o-----o------o
    /              \
--o----------------m->

The /real/ point, other than workflow, is:

- can we track 'rename and rewrite'?

> And you can modify your practices ever so slightly to match the
> benefits of the old model:

I agree completely on the workflow argument, I intend to make it to 
the project concerned ;).

> And what I've found, managing a project with 13K files, is that in 
> practice git does far better tracking renames than several SCMs 
> that do explicit tracking. Don't be distracted by the 'we don't 
> track renames posturing'. We do, and it's so magic that it just 
> works.

Yep, I know. :).

I just wonder if that magic could use additional hints (*not* Attic/ 
type stuff, ick ye gods no! Agree fully there!). Cause 'rename and 
rewrite' it just does not get right.

Simplest test-case (simulating 'rename and rewrite half the file') 
is:

- create a one-line file
- commit to git
- mv it and add a line

To show:

$ git status
nothing to commit
$ cat test
foo
$ git-mv test toast
$ echo bar >> toast
$ git-update-index toast
$ git status
#
# Updated but not checked in:
#   (will commit)
#
#       deleted:  test
#       new file: toast
#

A year later, someone comes along and looks at the history for 
'toast', they'll never know they can look back further by following 
'test'.

I'd like to fix the above somehow, possibly by adding 'renamed test 
toast' meta-data to index cache and commit objects. Having git-mv / 
git-cp add that meta-data.

Then diffcore using that meta-data as /advisory/ and auxilliary 
information *only* in /helping/ to determining renames, as an 
additional input to its existing heuristics. This meta-data would not 
be intrinsic to the operation git, it would /only/ be to aid humans 
(or their tools rather) in tracking back/forward through history.

Would that be the best way to explore solving the above problem?

regards,
-- 
Paul Jakma	paul@clubi.ie	paul@jakma.org	Key ID: 64A2FF6A
Fortune:
Human resources are human first, and resources second.
 		-- J. Garbers

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: impure renames / history tracking
  2006-03-01 19:13       ` Paul Jakma
@ 2006-03-01 19:56         ` Junio C Hamano
  2006-03-01 21:25           ` Paul Jakma
  0 siblings, 1 reply; 16+ messages in thread
From: Junio C Hamano @ 2006-03-01 19:56 UTC (permalink / raw)
  To: paul; +Cc: Andreas Ericsson, git list

Paul Jakma <paul@clubi.ie> writes:

> For sake of argument assume the workflow corresponds to:
>
>     o-o-o-o---o--o
>    /              \
> --o----------------m->
>
> And collapsing just the 'oops, made a typo' commits so it looks like:
>
>     o-----o------o
>    /              \
> --o----------------m->
>
>
> The /real/ point, other than workflow, is:
>
> - can we track 'rename and rewrite'?

Yes.  Especially the collapsing is 'oops, made a typo' kind.

Interestingly enough, there are two levels of "rename tracking"
the current git does.  Whey you run "git whatchanged -M", you
are looking at renames between each commit in the commit chain,
one step at a time.  There as long as the rename+rewrite does
not amount to too much rewrite, you would see what should be
detected as rename to be detected as renames.  I found the
current default threshold parameters to be about right, maybe a
bit too tight sometimes, though.  If you want to loosen the
default, you can specify similiarity index after -M.

The way recursive merge strategy uses the rename detection,
unlike what whatchanged shows you, does not use chains of
commits down to the common merge base in order to detect renames
(my recollection may be wrong here -- it's a while since I
looked at the recursive merge the last time).  It just looks at
the two heads being merged, and detects similarility between
them.  So it does not make _any_ difference with the current
implementation of recursive merge if you kept a history full of
"honest but disgusting" commits or collapsed them into a history
with small number of "cleaned up" commits.

One thing it _could_ do (and you _could_ implement as another
merge strategy and call it "pauls-rename" merge) is to follow
the commit chain one by one down to the common merge base from
both heads being merged, and analyze rename history on the both
commit chains.  Then, you would get better rename+rewrite
detection than what it currently does.

HOWEVER.

If you have that kind of rename-following merge, a workflow that
collapses a useful history into a single huge commit "Ok, this
commit is a roll-up patch between version 2.6.14 and 2.6.15"
becomes far less attractive than it currently already is.  At
that point, you _are_ throwing away useful history.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: impure renames / history tracking
  2006-03-01 19:56         ` Junio C Hamano
@ 2006-03-01 21:25           ` Paul Jakma
  2006-03-01 22:12             ` Andreas Ericsson
  0 siblings, 1 reply; 16+ messages in thread
From: Paul Jakma @ 2006-03-01 21:25 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Andreas Ericsson, git list

Hi Junio,

On Wed, 1 Mar 2006, Junio C Hamano wrote:

> Interestingly enough, there are two levels of "rename tracking" the 
> current git does.  Whey you run "git whatchanged -M", you are 
> looking at renames between each commit in the commit chain, one 
> step at a time.  There as long as the rename+rewrite does not 
> amount to too much rewrite, you would see what should be detected 
> as rename to be detected as renames.

Right.

> I found the current default threshold parameters to be about right, 
> maybe a bit too tight sometimes, though.  If you want to loosen the 
> default, you can specify similiarity index after -M.

That's one option.

I'm wondering though if we couldn't also allow for users to 
additionally encode naming 'hints', to aid this 'similarity' 
detection process.

> The way recursive merge strategy uses the rename detection, unlike 
> what whatchanged shows you, does not use chains of commits down to 
> the common merge base in order to detect renames (my recollection 
> may be wrong here -- it's a while since I looked at the recursive 
> merge the last time).  It just looks at the two heads being merged, 
> and detects similarility between them.  So it does not make _any_ 
> difference with the current implementation of recursive merge if 
> you kept a history full of "honest but disgusting" commits or 
> collapsed them into a history with small number of "cleaned up" 
> commits.

I'm going to have to stare at this paragraph a lot longer and harder 
to understand it :).

> One thing it _could_ do (and you _could_ implement as another merge 
> strategy and call it "pauls-rename" merge) is to follow the commit 
> chain one by one down to the common merge base from both heads 
> being merged, and analyze rename history on the both commit chains.

Right, I was just thinking that while making tea actually. This could 
be part of the 'collapsing' process. (or call it "coalesce 
too-detailed commits" process if that is less offensive to ones sense 
of process ;) ).

Actually, you're sort of suggesting following the chains in parallel, 
right? Ie in wall-clock time order, rather than chain order. And 
doing name resolution across the 'to-be-merged' chains at each step 
of the way? Sort of a lesser subset of how other SCMs maintain state 
for names globally?

It's not so much /resolving/ names I'm worried about in the first 
place. It's there simply being no information in the first place to 
indicate (from one single-parent commit to the next) which names were 
renamed.

> Then, you would get better rename+rewrite detection than what it 
> currently does.

But if I follow the commit chain in order to try extract

> HOWEVER.

> If you have that kind of rename-following merge, a workflow that 
> collapses a useful history into a single huge commit "Ok, this 
> commit is a roll-up patch between version 2.6.14 and 2.6.15" 
> becomes far less attractive than it currently already is.  At that 
> point, you _are_ throwing away useful history.

Yes, I agree. And I am, as part of arguing git's case (several SCMs 
are being evaluated and considered, I'm the git proponent at the 
moment), I'm going to suggest workflow ought to be re-evaluated to 
ensure it is generally reasonable, rather than be kept for the sake 
of it keeping (particularly as it may be tailored to the 
needs/limitations of $TRADITIONAL_SCM).

However, I suspect at least some level of collapsing will be desired 
(just as it is with Linux and git).

The workflow issue is seperate from the 'impure rename' issue though, 
even if the workflow I gave as an example excerbates the issue, 
"rename and rewrite half of it" and hard-to-detect renames can still 
occur in the detailed git/linux workflows, surely?

regards,
-- 
Paul Jakma	paul@clubi.ie	paul@jakma.org	Key ID: 64A2FF6A
Fortune:
If you really knew C++, you wouldn't even joke about putting it
in the kernel.

 	- Richard Johnson on linux-kernel

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: impure renames / history tracking
  2006-03-01 21:25           ` Paul Jakma
@ 2006-03-01 22:12             ` Andreas Ericsson
  2006-03-01 22:28               ` Paul Jakma
  2006-03-01 22:46               ` Junio C Hamano
  0 siblings, 2 replies; 16+ messages in thread
From: Andreas Ericsson @ 2006-03-01 22:12 UTC (permalink / raw)
  To: Paul Jakma; +Cc: Junio C Hamano, git list

Just to cap off my own engagement in this discussion, here's the last 
time rename detection was seriously discussed on the list:

http://www.gelato.unsw.edu.au/archives/git/0504/0147.html

If you're going to implement something you might benefit from the 
suggestions made there.

-- 
Andreas Ericsson                   andreas.ericsson@op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: impure renames / history tracking
  2006-03-01 22:12             ` Andreas Ericsson
@ 2006-03-01 22:28               ` Paul Jakma
  2006-03-01 22:46               ` Junio C Hamano
  1 sibling, 0 replies; 16+ messages in thread
From: Paul Jakma @ 2006-03-01 22:28 UTC (permalink / raw)
  To: Andreas Ericsson; +Cc: Junio C Hamano, git list

On Wed, 1 Mar 2006, Andreas Ericsson wrote:

> http://www.gelato.unsw.edu.au/archives/git/0504/0147.html

In terms of format, that's pretty much exactly what I was thinking, 
except it's been vetoed.

> If you're going to implement something you might benefit from the 
> suggestions made there.

Cheers.

Is there a correct way to extend the git header? To add meta-data 
that normal git porcelain won't display? (there doesn't appear to 
be..)

regards,
-- 
Paul Jakma	paul@clubi.ie	paul@jakma.org	Key ID: 64A2FF6A
Fortune:
Zombie processes haunting the computer

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: impure renames / history tracking
  2006-03-01 22:12             ` Andreas Ericsson
  2006-03-01 22:28               ` Paul Jakma
@ 2006-03-01 22:46               ` Junio C Hamano
  1 sibling, 0 replies; 16+ messages in thread
From: Junio C Hamano @ 2006-03-01 22:46 UTC (permalink / raw)
  To: Andreas Ericsson; +Cc: Paul Jakma, git list

Andreas Ericsson <ae@op5.se> writes:

> Just to cap off my own engagement in this discussion, here's the last
> time rename detection was seriously discussed on the list:
>
> http://www.gelato.unsw.edu.au/archives/git/0504/0147.html
>
> If you're going to implement something you might benefit from the
> suggestions made there.

Also, today's #git log has some interesting material.

	http://colabti.de/irclogger/irclogger_logs/git

For anybody who wants to discuss rename recording (not
tracking), the following is a must-read:

	http://article.gmane.org/gmane.comp.version-control.git/217

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: impure renames / history tracking
  2006-03-01 17:43     ` Andreas Ericsson
@ 2006-03-02 21:10       ` Paul Jakma
  2006-03-02 22:06         ` Andreas Ericsson
  0 siblings, 1 reply; 16+ messages in thread
From: Paul Jakma @ 2006-03-02 21:10 UTC (permalink / raw)
  To: Andreas Ericsson; +Cc: git list

On Wed, 1 Mar 2006, Andreas Ericsson wrote:

> Yes, but imo a poor one, as you're losing all the history.

Well, not per se. You might keep the original 'detail' branch. It's a 
terminal branch obviously, you can't pull master's changes to it once 
the aggregate patch goes into master. But you can keep it around.

> git *can* do what you want, but it was designed to maintain a long 
> history so that everyone can see it and improve on the code with 
> many chains of small and simultanous changes.

Indeed, and I appreciate that.

> Perhpas we have a nomenclature clash here. When you say "one single 
> commit", I can't help but thinking "snapshot".

I mean:

 	git diff upstream..bugfix_xyz

or:

 	git diff upstream..project_foo_phase1

type of thing.

> It's completely impossible to fold *ALL* the history into a single 
> commit, and since you want heuristics I would imagine you wouldn't 
> want that either.

I want to know whether additional meta-data to help the existing 
heuristics would be acceptable. From a discussion on #git yesterday I 
gather the best way forward would to be to first prototype something 
keeping state in a file in .git.

All that's needed really is something that relates the following 3 
things:

 	commit-id obj1-id obj2-id

Ie: For <commit-id>, <obj1-id> is similar to <obj2-id>.

Maintaining this state could be done via the git-mv/rename wrappers 
and an additional git-edit wrapper. Those who are quite happy with 
the existing diff-input only similarity heuristics wouldn't have to 
bother using a git-edit wrapper obviously, those who want to let git 
gather additional 'similarity hint' in this way could.

Aside:

Git might be easier to extend generally if it adopted just /one/ new 
core header, say "see-also" - that could serve as a pointer to 
arbitrary commit-related meta-info objects that aren't of immediate 
interest to either:

a) core git

or

b) the user

Format:

 	see-also <word> <obj-id>

E.g.:

 	see-also similars <obj-id>

Where <obj-id> would list the 'commit obj1 obj2', but just as:

 	obj1 obj2

Would ultimately be neater than fishing around in .git/, and would 
allow other extensions in the future too.

The <word> identifier preferably would need to be centrally 
co-ordinated.

> I'm confused. First you say you want to have one single mega-patch 
> for each commit, then you say you want to be able to follow history 
> back. It's like deciding to throw away your wallet and then trying 
> to get someone to pick it up and carry it around for you.

I'm not sure why think mega-patch. Collapsing a bunch of commits 
related to one project need not result in a big patch relative to the 
repository as a whole.

In Linux terms think project == "Add ATAPI support to SATA" or 
"Change the foo VFS method and update its filesystem users" type of 
thing (ok, the latter would be big enough, but still not /that/ big 
in terms of the whole Linux source base). Where the project concerned 
is like BSD, not just a kernel but a complete userland (so 1.1GB of 
source code).

I'm aware of the workflow arguments, I /do/ intend to make those but 
elsewhere ;).

> As for convincing others, shove git-bisect under their noses and 
> ask them if they'd like a tool to find their bugs for them.

;)

[snip - thanks, interesting]

> The code is mightier than the mail. Perhaps if I see an implementation of 
> this I could wrap my head around what you really mean. I'm sure I must 
> misunderstand you one way or another.

Yes, you're right. I think Junio gave me the required hints on 
directions last night on #git.

I think now at least it's quite possible to achieve without violating 
git's "track the /content/" philosophy, via .git.

Thanks!

regards,
-- 
Paul Jakma	paul@clubi.ie	paul@jakma.org	Key ID: 64A2FF6A
Fortune:
Factorials were someone's attempt to make math LOOK exciting.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: impure renames / history tracking
  2006-03-02 21:10       ` Paul Jakma
@ 2006-03-02 22:06         ` Andreas Ericsson
  0 siblings, 0 replies; 16+ messages in thread
From: Andreas Ericsson @ 2006-03-02 22:06 UTC (permalink / raw)
  To: Paul Jakma; +Cc: git list

Paul Jakma wrote:
> On Wed, 1 Mar 2006, Andreas Ericsson wrote:
> 
>> It's completely impossible to fold *ALL* the history into a single 
>> commit, and since you want heuristics I would imagine you wouldn't 
>> want that either.
> 
> 
> I want to know whether additional meta-data to help the existing 
> heuristics would be acceptable. From a discussion on #git yesterday I 
> gather the best way forward would to be to first prototype something 
> keeping state in a file in .git.
> 
> All that's needed really is something that relates the following 3 things:
> 
>     commit-id obj1-id obj2-id
> 
> Ie: For <commit-id>, <obj1-id> is similar to <obj2-id>.
> 
> Maintaining this state could be done via the git-mv/rename wrappers and 
> an additional git-edit wrapper. Those who are quite happy with the 
> existing diff-input only similarity heuristics wouldn't have to bother 
> using a git-edit wrapper obviously, those who want to let git gather 
> additional 'similarity hint' in this way could.
> 
> Aside:
> 
> Git might be easier to extend generally if it adopted just /one/ new 
> core header, say "see-also" - that could serve as a pointer to arbitrary 
> commit-related meta-info objects that aren't of immediate interest to 
> either:
> 
> a) core git
> 
> or
> 
> b) the user
> 

Things that aren't of interest to either core git or the user is already 
handled properly. It's called "cruft". ;)

However, I see what you're trying for here. Something like the X-* 
headers inside a mailer. Not all MUA's understand them, but if they do 
they can make use of them to the users benefit.


> Format:
> 
>     see-also <word> <obj-id>
> 
> E.g.:
> 
>     see-also similars <obj-id>
> 
> Where <obj-id> would list the 'commit obj1 obj2', but just as:
> 
>     obj1 obj2
> 
> Would ultimately be neater than fishing around in .git/, and would allow 
> other extensions in the future too.
> 
> The <word> identifier preferably would need to be centrally co-ordinated.
> 

With X-* headers I don't see why it should have to be. Only the X-* part 
is mentioned in the RFC, so with a proper format Junio won't have to 
coordinate cross-SCM tools, git-tortoise, etc, etc...


>> I'm confused. First you say you want to have one single mega-patch for 
>> each commit, then you say you want to be able to follow history back. 
>> It's like deciding to throw away your wallet and then trying to get 
>> someone to pick it up and carry it around for you.
> 
> 
> I'm not sure why think mega-patch. Collapsing a bunch of commits related 
> to one project need not result in a big patch relative to the repository 
> as a whole.
> 

Mainly I think it's because you mentioned several renames of a single 
file and many files renamed + rewritten (beyond gits current ability of 
recognizing it). That's definitely a mega-patch in my book.


> Where the project concerned is like BSD, not 
> just a kernel but a complete userland (so 1.1GB of source code).
> 

<just curious>
Such a large project surely must be split in several smaller 
sub-projects? GNU is, after all, several small (and not so small) 
components. X works the same way. Linux is a large project, but each 
compartment of code can be managed on its own, so long as they adhere to 
the ABI hooking them back in to the kernel core.
</just curious>

-- 
Andreas Ericsson                   andreas.ericsson@op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: impure renames / history tracking
@ 2006-03-02 22:24 linux
  0 siblings, 0 replies; 16+ messages in thread
From: linux @ 2006-03-02 22:24 UTC (permalink / raw)
  To: git, paul

>> Yes, but imo a poor one, as you're losing all the history.
>
> Well, not per se. You might keep the original 'detail' branch. It's a 
> terminal branch obviously, you can't pull master's changes to it once 
> the aggregate patch goes into master. But you can keep it around.

Actually, you can!  That's what the "ours" merge stratgy is for!
It creates a merge whose result is a verbatim copy of the first parent.

The intended use is for when you've cherry-picked or otherwise manually
merged everything interesting from a branch and want to tie up the loose
end so you can delete the branch name.

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2006-03-02 22:24 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-03-02 22:24 impure renames / history tracking linux
  -- strict thread matches above, loose matches on Subject: below --
2006-03-01 14:01 Paul Jakma
2006-03-01 15:38 ` Andreas Ericsson
2006-03-01 16:27   ` Paul Jakma
2006-03-01 17:13     ` Linus Torvalds
2006-03-01 18:50       ` Paul Jakma
2006-03-01 17:43     ` Andreas Ericsson
2006-03-02 21:10       ` Paul Jakma
2006-03-02 22:06         ` Andreas Ericsson
2006-03-01 18:05     ` Martin Langhoff
2006-03-01 19:13       ` Paul Jakma
2006-03-01 19:56         ` Junio C Hamano
2006-03-01 21:25           ` Paul Jakma
2006-03-01 22:12             ` Andreas Ericsson
2006-03-01 22:28               ` Paul Jakma
2006-03-01 22:46               ` Junio C Hamano

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).