git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jeff King <peff@peff.net>
To: Cristian Tibirna <tibirna@kde.org>
Cc: git@vger.kernel.org
Subject: Re: git diff-tree -r -C output inexact sometimes
Date: Fri, 21 Sep 2012 02:03:39 -0400	[thread overview]
Message-ID: <20120921060339.GA9844@sigill.intra.peff.net> (raw)
In-Reply-To: <2789023.yr3ihcVOhq@leto>

On Thu, Sep 20, 2012 at 11:20:31PM -0400, Cristian Tibirna wrote:

> Running the script in attachment produces a git repository in which were 
> operated a large number of file renames, in which many of the renamed files 
> (in this particular case all) have the same content but different names.
> 
> The commit data from the renaming operation (last commit in the script-
> generated history) is inexactly rendered by the command 
> 
> git diff-tree -r -C master
> 
> The logical result is correctly produced by the more restricted command
> 
> git diff-tree -r -M master
> 
> IMO for this particular last commit both the above commands should return the 
> same result.

Interesting. I get the same results from both commands. But I did have
to munge your script, as my "rename" command does not seem to work like
the one you expect in your script. So I may have misinterpreted the
intent of it.

However, I would not be surprised if one could conduct a situation in
which "-C" and "-M" produced different results. Since the content of all
the files is the same, git has to make a guess about which files match
up based on their filenames. The current heuristic is very stupid and
just tries to match basenames (e.g., moving "foo/Makefile" to
"bar/Makefile" is a better match than moving the same content to
"bar/foo.c"). But in this case, the basenames don't match at all.

By using "-C", we will typically have more rename sources available, and
we may therefore process the possible pairs in a different order. Since
our name heuristic is largely useless, our results depend on that order.

I think the real solution is to improve the name heuristic. Something
like an edit distance would make more sense (though I think it is not as
simple as an edit distance across the whole pathname, as moving a
basename across directories should probably be preferred to changing the
filename inside a directory).

Largely I think nobody has cared much because this only comes up when
you move multiple identical files. Quite often there is a minor
difference even between very similar files, and that is enough to come
up with sane results.

-Peff

      reply	other threads:[~2012-09-21  6:03 UTC|newest]

Thread overview: 2+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-09-21  3:20 git diff-tree -r -C output inexact sometimes Cristian Tibirna
2012-09-21  6:03 ` Jeff King [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20120921060339.GA9844@sigill.intra.peff.net \
    --to=peff@peff.net \
    --cc=git@vger.kernel.org \
    --cc=tibirna@kde.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).