Re: Finding file revisions - David Woodhouse

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: David Woodhouse <dwmw2@infradead.org>
To: Chris Mason <mason@suse.com>
Cc: Linus Torvalds <torvalds@osdl.org>, git@vger.kernel.org
Subject: Re: Finding file revisions
Date: Thu, 28 Apr 2005 14:01:58 +0100	[thread overview]
Message-ID: <1114693318.27227.111.camel@hades.cambridge.redhat.com> (raw)
In-Reply-To: <200504271423.37433.mason@suse.com>

[-- Attachment #1: Type: text/plain, Size: 2865 bytes --]

On Wed, 2005-04-27 at 14:23 -0400, Chris Mason wrote:
> Thanks.  I originally called diff-tree without the file list so that I could 
> do the regexp matching, but this is probably one of those features that will 
> never get used.

When I added this functionality to diff-tree I didn't want to add regexp
support, but I did make sure it could handle the simple case of "changes
within directory xxx/yyy". It can also take _multiple_ names. 

At the same time, I also posted a primitive script which attempted to do
something similar to what you're doing. The output of rev-tree is
useless, as Linus pointed out. Chronological sorting is
counterproductive in all cases and should be avoided _everywhere_.

My script is based on the original 'gitlog.sh' script, which walks the
commit tree from the head to its parents. It lists only those commits
where the file(s) in question actually changed, giving the commit ID and
the changes.

There's one problem with that already documented in my (attached) mail
-- we don't print merge changesets where the file in the child is
identical to the file in all the parents, but the changeset in question
_is_ relevant to the history because it's merging two branches on which
the file _independently_ changed.

The other problem is that we still don't have enough information to
piece together the full tree. With each commit we print, we're also
printing the last _relevant_ child (see $lastprinted in the script). 

That allows us to piece together most of the graph, but when we
eventually reach a commit which has already been processed (but not
necessarily _printed_, we just stop -- so we don't have useful parent
information for the oldset change in each branch and can't tie it back
to the point at which it branched. We know the _immediate_ parent, but
that parent isn't necessarily going to have been one of the commits we
actually printed.

I suspect the best way to do this is to start with a copy of rev-tree
and do something like..

	1. Add a 'struct commit_list children' to 'struct commit'

	2. Make process_commit() set it correctly:
@@ wherever @@ process_commit
	        while (parents) {
	                process_commit(parents->item->object.sha1);
+	                commit_list_insert(obj, &parents->item->children);
	                parents = parents->next;
	        }

	3. Check each 'interesting' commit to see if it affects the
	   file(s) in question.

	4. Prune the tree: For each commit which isn't a merge and which
	   doesn't touch the file(s), just dump it from the tree,
	   changing the child pointer of its parent and the parent
	   pointer of its child accordingly to maintain the tree.
	   For each merge where there are no changes to the file(s)
	   between the merge point and the point at which the branch was
	   taken, drop that too.

	5. Print the remaining commits.

-- 
dwmw2

[-- Attachment #2: Attached message - Re: [GIT PATCH] Selective diff-tree --]
[-- Type: message/rfc822, Size: 7296 bytes --]

[-- Attachment #2.1.1: Type: text/plain, Size: 2904 bytes --]

On Wed, 2005-04-13 at 14:57 +0100, David Woodhouse wrote:
> The plan is that this will also form the basis of a tool which will report the
> revision tree for a given file, which is why I really want to avoid the
> unnecessary recursion rather than just post-processing the output.

Script attached. Its output is something like this:

commit 97c9a63e76bf667c21f24a5cfa8172aff0dd1294 child
*100664->100644 blob    6e4064e920792d5b0219b9f8f55a38ab4a1af856->c1091cd15e2ed1be65b50eaa910f7b45c08d93ac      rev-tree.c

--------------------------
commit 13b6f29ac1686955e15f0250f796362460b4992e child 97c9a63e76bf667c21f24a5cfa8172aff0dd1294
*100644->100644 blob    5b3090780d49cc610339a19f070a5954dce9a8bc->c1091cd15e2ed1be65b50eaa910f7b45c08d93ac      rev-tree.c

--------------------------
commit 6420f0732f695269c0e3f28e62ed4b9aa6578d9f child 13b6f29ac1686955e15f0250f796362460b4992e
*100644->100644 blob    7429b9c4d0aab2e4a494eb4b65129a59da138106->5b3090780d49cc610339a19f070a5954dce9a8bc      rev-tree.c
*100664->100644 blob    28a980482bf2053e022409cc3e50b2ad8adafd55->5b3090780d49cc610339a19f070a5954dce9a8bc      rev-tree.c

 <...>

As we walk the tree from the HEAD to its parents, we print only those
commits which modify the file(s) in question. We remember the last
commit we printed as we recurse, so that we can generate a complete
graph. The SHA-1 of the blobs themselves aren't good enough on their own
because they're not guaranteed to be unique -- if the same change
happens on two different branches, the SHA-1 will be the same, and we
won't know how it fits together.

As it is, it's not quite perfect because I'm still omitting merge
commits where the resulting file is identical to the same file in _all_
of the parents. So if we have the following tree (for the _file):

       ----- (AB) ----,
      /                \ 
  (A) ------ (AB) ----- (AB) --,
      \                         \
       ----- (AC) --------------(ABC)

(Where the delta A->AB is a trivial one-line fix which two people
independently reproduce, then they merge their trees together)

.. the point where the two independent instances of (AB) are merged
together won't be shown in the output of the attached script. The output
would show only this:

       ----- (AB) ----,
      /                \ 
  (A) ------ (AB) ----- (ABC)
      \                /           
       ----- (AC) ----'

Do we care about this? Or is it good enough? I don't really want to emit
output for _every_ merge commit we traverse, just in _case_ it happens
to be relevant later. Should just give in to the voices in my head which
are telling me I should through the damn thing away and rewrite it in C?

Given this output, it should be possible to display a pretty graph of
the history of the file, and easily find both diffs and whole files.
Creating a graphical tool which does this is left as an exercise for the
reader.

-- 
dwmw2

[-- Attachment #2.1.2: gitfilelog.sh --]
[-- Type: application/x-shellscript, Size: 1983 bytes --]

next prev parent reply	other threads:[~2005-04-28 12:58 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2005-04-27 16:50 Finding file revisions Chris Mason
2005-04-27 17:34 ` Linus Torvalds
2005-04-27 18:23   ` Chris Mason
2005-04-27 22:19     ` Linus Torvalds
2005-04-27 22:31       ` Chris Mason
2005-04-28  8:41         ` Simon Fowler
2005-04-28 11:56           ` Chris Mason
2005-04-28 13:13             ` Simon Fowler
2005-04-28 11:45       ` Chris Mason
2005-04-28 16:34         ` Kay Sievers
2005-04-28 17:10           ` Tony Luck
2005-04-28 17:22             ` Thomas Glanzmann
2005-04-28 19:11         ` Kay Sievers
2005-04-28 20:58           ` Chris Mason
2005-04-28 21:32             ` Linus Torvalds
2005-04-28 21:33             ` Kay Sievers
2005-04-28 21:50               ` Linus Torvalds
2005-04-28 22:27               ` Chris Mason
2005-04-28 13:09       ` David Woodhouse
2005-04-28 13:01     ` David Woodhouse [this message]
2005-04-27 18:41   ` Thomas Gleixner
2005-04-28 15:24     ` Linus Torvalds
2005-04-28 16:47       ` Thomas Gleixner
2005-04-28 16:08 ` Daniel Barkalow
2005-04-28 17:05   ` Chris Mason

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1114693318.27227.111.camel@hades.cambridge.redhat.com \
    --to=dwmw2@infradead.org \
    --cc=git@vger.kernel.org \
    --cc=mason@suse.com \
    --cc=torvalds@osdl.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).