git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFH] Finding all commits that touch the same files as a specific commit
@ 2008-07-12 15:58 Sverre Rabbelier
  2008-07-13  1:24 ` Junio C Hamano
  0 siblings, 1 reply; 7+ messages in thread
From: Sverre Rabbelier @ 2008-07-12 15:58 UTC (permalink / raw)
  To: Git Mailinglist

Heya,

Currently I do the following:
$git diff-tree --name-status --no-commit-id -r <hash>
To get all the files touched by the commit, I do:
$git rev-list HEAD -- all the returned paths here
This works perfectly, except when the subtree merge strategy is used,
since in that case I get (example from git.git):
$ git diff-tree --name-status --no-commit-id -r
5821988f97b827f6ba81dfeebff932067c88ba6c
M	git-gui.sh
M	lib/diff.tcl
$ git rev-list HEAD -- git-gui.sh lib/diff.tcl
$

Now it was noticed on #git that git log has a --follow argument which
-does- catch the rename, but it only works on one file at a time. So,
my question is this:
How do I find all commits that touch the same files as a specific commit?
I have described my current approach above, which does not work when
the subtree merge strategy is used. I am not stuck to this approach
though, if someone comes up with a better way to do this than with
'git diff-tree' / 'git rev-list' I'm fine by that. I provided with my
current approach in the hope that someone comes up with a similar
solution which means I'll have to edit less ;).

-- 
Cheers,

Sverre Rabbelier

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFH] Finding all commits that touch the same files as a specific commit
  2008-07-12 15:58 [RFH] Finding all commits that touch the same files as a specific commit Sverre Rabbelier
@ 2008-07-13  1:24 ` Junio C Hamano
  2008-07-13 14:43   ` Sverre Rabbelier
  0 siblings, 1 reply; 7+ messages in thread
From: Junio C Hamano @ 2008-07-13  1:24 UTC (permalink / raw)
  To: sverre; +Cc: Git Mailinglist

"Sverre Rabbelier" <alturin@gmail.com> writes:

> Currently I do the following:
> $git diff-tree --name-status --no-commit-id -r <hash>
> To get all the files touched by the commit, I do:
> $git rev-list HEAD -- all the returned paths here
> This works perfectly, except when the subtree merge strategy is used,
> since in that case I get (example from git.git):
> $ git diff-tree --name-status --no-commit-id -r
> 5821988f97b827f6ba81dfeebff932067c88ba6c
> M	git-gui.sh
> M	lib/diff.tcl
> $ git rev-list HEAD -- git-gui.sh lib/diff.tcl
> $
>
> Now it was noticed on #git that git log has a --follow argument which
> -does- catch the rename, but it only works on one file at a time. So,
> my question is this:
> How do I find all commits that touch the same files as a specific commit?
> I have described my current approach above, which does not work when
> the subtree merge strategy is used. I am not stuck to this approach
> though, if someone comes up with a better way to do this than with
> 'git diff-tree' / 'git rev-list' I'm fine by that. I provided with my
> current approach in the hope that someone comes up with a similar
> solution which means I'll have to edit less ;).

First of all, a bad news that everybody should have known since day 1 when
the --follow option was introduced.  It merely is a cute hack that works
most of the time in trivial histories.  The data structure it uses cannot
reliably follow renames if you have any nontrivial history.

Revision traversal machinery has a single list of pathspecs to filter the
results with, and in the usual traversal, the list never changes.  That is
why you would need to give a list of three pathspecs upfront, like this:

	git log -- arch/i386 arch/x86 arch/x86_64

to get the whole picture of how things are consolidated into a single
arch/x86 hierarchy over time from originally two hierarchies.  The
revision traversal works by simplifying away commits that do not touch
path that match any of the given pathspecs, so giving the "current" path
(i.e. arch/x86) is not sufficient.

The --follow changes the behaviour slightly.  When you have this history:

    ---o---o---o---x---x---x

where a file you are interested in (say, arch/i386/kernel/reboot.c)
existed in the past in 'o' commits, but was renamed to something else
(say, arch/x86/kernel/reboot.c) in newer 'x' commits, you would start
following from the tip of the history like this:

	git log --follow arch/x86/kernel/reboot.c

And the machinery traverses down the history, showing only the commits
that touch the given path.  An interesting thing happens, however, when it
hits the earliest 'x' commit and realizes that its parent 'o' does not
have that path.  It runs the rename detection there, realizes the path it
is interested in corresponds to a different path in the parent, and
_updates_ the pathspec to the old name.  I.e. it will from that point on
behaves as if you started digging from the tip of this history:

    ---o---o---o

with a different pathspec:

	git log --follow arch/i386/kernel/reboot.c

This works as long as your history is trivial, but in real life, the world
is not linear.

          x---x---x---x
         /       /
    ----o---o---o

If commits 'x' have git-gui/git-gui.sh and commits 'o' have git-gui.sh at
the root level, you would start digging from the tip with --follow:

	git log --follow git-gui/git-gui.sh

When it hits the rightmost merge 'x', it realizes the changes to the file
came from lower history and switches the pathspec to "git-gui.sh" at the
root level (the commits that have already been traversed are marked with
uppercase latters here).

          x---x---X---X
         /       /
    ----o---o---O

Switching the pathspec from "git-gui/git-gui.sh" to "git-gui.sh" is fine
for the purpose of traversing the 'o' history down, but there is a
problem.  Remember I said there is a _single_ list of pathspecs the
revision traversal machinery keeps track of?  If you switch that single
list to "git-gui.sh", it means you completely forget that you were
following "git-gui/git-gui.sh".  You cannot follow the upper history
anymore.

In order to follow renames reliably in a merge heavy history, you need to
keep track of the pathname the file you are interested in appears as _in
each commit_.  As you traverse down the history, you pass down the
pathname to the parent you visit, so while you are traversing from 'x' to
earlier 'x', you will keep following "git-gui/git-gui.sh", while you
traverse down to 'o', you will inspect "git-gui.sh".

The data structure the revision traversal machinery uses does not support
this "path-per-commit" natively.

This is the reason "git-blame" uses its own traversal engine.  It keeps
track of <commit, path> pairs so that it can mark which line came from
what path in what commit.  When copy/move detection are used, we can even
notice that the contents we are interested in came from more than one file
in the same commits, and the data structure supports it (i.e. it is not
just a pointer to a single string from "struct commit").

For the purpose of "git log" traversal and the "file renames" people
usually talk about, this is overkill; you should however be able to
backport the basic idea to revision machinery, if you really cared.

In a real history, "file rename" is a very ill defined concept and is not
always useful in practice.  I did a fairly detailed analysis on one
real-world history more than two years ago, which is found here:

    http://thread.gmane.org/gmane.comp.version-control.git/13746/focus=13769

In our own "git.git" history, the evolution of what finally landed in
revision.c is interesting.  The interesting part of content movement never
involved any file renames --- only bits and pieces migrated over across
many files.  That is not something "file rename tracking", even with an
extension to the revision traversal machinery to keep one path per commit
to record the file you are interested in, can ever give meaningful
explanation of the history.  You need a lot more fine grained "blame"
traversal machinery for that.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFH] Finding all commits that touch the same files as a specific commit
  2008-07-13  1:24 ` Junio C Hamano
@ 2008-07-13 14:43   ` Sverre Rabbelier
  2008-07-13 18:30     ` Johannes Schindelin
  0 siblings, 1 reply; 7+ messages in thread
From: Sverre Rabbelier @ 2008-07-13 14:43 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Git Mailinglist

On Sun, Jul 13, 2008 at 3:24 AM, Junio C Hamano <gitster@pobox.com> wrote:

<explanation of the git log traversal machinery snipped>

> In order to follow renames reliably in a merge heavy history, you need to
> keep track of the pathname the file you are interested in appears as _in
> each commit_.  As you traverse down the history, you pass down the
> pathname to the parent you visit, so while you are traversing from 'x' to
> earlier 'x', you will keep following "git-gui/git-gui.sh", while you
> traverse down to 'o', you will inspect "git-gui.sh".
>
> The data structure the revision traversal machinery uses does not support
> this "path-per-commit" natively.

Would it be possible to go for a slightly less complicated approach
and instead of passing replacing the tracked file, append it? We
already have a list of files we are tracking, so I assume the data
structure does support that. Such would run with the risk of tracking
too much (e.g., you rename a.txt => b.txt, and then later on
create/rename a new a.txt which is then tracked as well).

> This is the reason "git-blame" uses its own traversal engine.  It keeps
> track of <commit, path> pairs so that it can mark which line came from
> what path in what commit.  When copy/move detection are used, we can even
> notice that the contents we are interested in came from more than one file
> in the same commits, and the data structure supports it (i.e. it is not
> just a pointer to a single string from "struct commit").

So what could be done is use a blame-like mechanism that invokes
rename detection on each interesting commit and then record that
information? Purely hypothetical though, since I know neither and have
no time to do so.

> For the purpose of "git log" traversal and the "file renames" people
> usually talk about, this is overkill; you should however be able to
> backport the basic idea to revision machinery, if you really cared.

Right, that'd teach git log how to follow across renames in an
intelligent manner that works also for non-linear histories at the
cost of using up more memory and cpu?

> In a real history, "file rename" is a very ill defined concept and is not
> always useful in practice.  I did a fairly detailed analysis on one
> real-world history more than two years ago, which is found here:
>
>    http://thread.gmane.org/gmane.comp.version-control.git/13746/focus=13769

Aye, I agree that a 'rename' is hard to define and that a lot of
effort could be put into supporting 'renames' that are not trivial
(e.g., more complex than 'git mv foo.txt bar.txt').

> In our own "git.git" history, the evolution of what finally landed in
> revision.c is interesting.  The interesting part of content movement never
> involved any file renames --- only bits and pieces migrated over across
> many files.  That is not something "file rename tracking", even with an
> extension to the revision traversal machinery to keep one path per commit
> to record the file you are interested in, can ever give meaningful
> explanation of the history.  You need a lot more fine grained "blame"
> traversal machinery for that.

This makes sense, but it (using blame traversal machinery) is overkill
for what I am interested in. What I think would be a good goal in
supporting is the subtree merge strategy. It would be nice if 'git log
--follow-subtree-merge refspec -- filefilter' or such would Just Work
(TM). Perhaps that the hunk-tracking I am working on with Dscho could
help make 'git log --numstat' more accurate. Those two combined (git
log being able to follow across subtree merges and being able to
recognise hunks being moved) would be all that I need.


-- 
Cheers,

Sverre Rabbelier

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFH] Finding all commits that touch the same files as a specific commit
  2008-07-13 14:43   ` Sverre Rabbelier
@ 2008-07-13 18:30     ` Johannes Schindelin
  2008-07-14 11:17       ` Sverre Rabbelier
  0 siblings, 1 reply; 7+ messages in thread
From: Johannes Schindelin @ 2008-07-13 18:30 UTC (permalink / raw)
  To: Sverre Rabbelier; +Cc: Junio C Hamano, Git Mailinglist

Hi,

On Sun, 13 Jul 2008, Sverre Rabbelier wrote:

> On Sun, Jul 13, 2008 at 3:24 AM, Junio C Hamano <gitster@pobox.com> wrote:
> 
> <explanation of the git log traversal machinery snipped>
> 
> > In order to follow renames reliably in a merge heavy history, you need to
> > keep track of the pathname the file you are interested in appears as _in
> > each commit_.  As you traverse down the history, you pass down the
> > pathname to the parent you visit, so while you are traversing from 'x' to
> > earlier 'x', you will keep following "git-gui/git-gui.sh", while you
> > traverse down to 'o', you will inspect "git-gui.sh".
> >
> > The data structure the revision traversal machinery uses does not support
> > this "path-per-commit" natively.
> 
> Would it be possible to go for a slightly less complicated approach
> and instead of passing replacing the tracked file, append it?

Maybe I miss something, but do you not have to keep track of the file 
names, in order to keep track of the proper statistics?

If that is the case, appending does not cut it.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFH] Finding all commits that touch the same files as a specific commit
  2008-07-13 18:30     ` Johannes Schindelin
@ 2008-07-14 11:17       ` Sverre Rabbelier
  2008-07-14 12:13         ` Johannes Schindelin
  0 siblings, 1 reply; 7+ messages in thread
From: Sverre Rabbelier @ 2008-07-14 11:17 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Junio C Hamano, Git Mailinglist

On Sun, Jul 13, 2008 at 8:30 PM, Johannes Schindelin
<Johannes.Schindelin@gmx.de> wrote:
> On Sun, 13 Jul 2008, Sverre Rabbelier wrote:
>> On Sun, Jul 13, 2008 at 3:24 AM, Junio C Hamano <gitster@pobox.com> wrote:
>> > The data structure the revision traversal machinery uses does not support
>> > this "path-per-commit" natively.
>>
>> Would it be possible to go for a slightly less complicated approach
>> and instead of passing replacing the tracked file, append it?
>
> Maybe I miss something, but do you not have to keep track of the file
> names, in order to keep track of the proper statistics?

Hmm, no, this is just to get commits that touched a (set of) file(s).
I use it to limit the commits I have to check when searching for
reverts.

> If that is the case, appending does not cut it.

For the activity metric I think pretending that all files with the
same name or renamed versions of those would make sense, which is what
appending the new name would do. The downside is that all files with
the same name get grouped together, I'm not sure which is the lesser
of two evils. Leaving out information, or (possibly) including too
much.

-- 
Cheers,

Sverre Rabbelier

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFH] Finding all commits that touch the same files as a specific commit
  2008-07-14 11:17       ` Sverre Rabbelier
@ 2008-07-14 12:13         ` Johannes Schindelin
  2008-07-14 14:30           ` Sverre Rabbelier
  0 siblings, 1 reply; 7+ messages in thread
From: Johannes Schindelin @ 2008-07-14 12:13 UTC (permalink / raw)
  To: Sverre Rabbelier; +Cc: Junio C Hamano, Git Mailinglist

Hi,

On Mon, 14 Jul 2008, Sverre Rabbelier wrote:

> For the activity metric I think pretending that all files with the same 
> name or renamed versions of those would make sense, which is what 
> appending the new name would do. The downside is that all files with the 
> same name get grouped together, I'm not sure which is the lesser of two 
> evils. Leaving out information, or (possibly) including too much.

IMO following the file renames/code moves precisely is really worth the 
time it takes to calculate.  Otherwise, the statistics will not reflect 
what was going on in the project, right?

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFH] Finding all commits that touch the same files as a specific commit
  2008-07-14 12:13         ` Johannes Schindelin
@ 2008-07-14 14:30           ` Sverre Rabbelier
  0 siblings, 0 replies; 7+ messages in thread
From: Sverre Rabbelier @ 2008-07-14 14:30 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Junio C Hamano, Git Mailinglist

On Mon, Jul 14, 2008 at 2:13 PM, Johannes Schindelin
<Johannes.Schindelin@gmx.de> wrote:
> IMO following the file renames/code moves precisely is really worth the
> time it takes to calculate.  Otherwise, the statistics will not reflect
> what was going on in the project, right?

Ah, I did not mean to imply that it is not worth the time it takes to
calculate, more that I do not know how to implement it that way. Of
course if someone has the time/motivation to do so I would very much
make use of it, but I do not have time to do so myself (at least not
until after GSoC). I can have GitStats just ignore the subtree merge
cases (and say that it is not beyond the scope of my project to take
such into account) and have a working program by the end of GSoC. But
if I spend time on getting this to work instead I might end up with a
program that does follow sub-tree merges but is only half-done, and as
such probably won't receive a "OK" grade. So, yes, I would very much
like to see this, but no time to look into doing so myself until after
GSoC.

-- 
Cheers,

Sverre Rabbelier

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2008-07-14 14:31 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-07-12 15:58 [RFH] Finding all commits that touch the same files as a specific commit Sverre Rabbelier
2008-07-13  1:24 ` Junio C Hamano
2008-07-13 14:43   ` Sverre Rabbelier
2008-07-13 18:30     ` Johannes Schindelin
2008-07-14 11:17       ` Sverre Rabbelier
2008-07-14 12:13         ` Johannes Schindelin
2008-07-14 14:30           ` Sverre Rabbelier

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).