move detection doesnt take filename into account

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* move detection doesnt take filename into account
@ 2014-06-30  6:38 Elliot Wolk
  2014-07-01  9:16 ` Robin Rosenberg
  0 siblings, 1 reply; 11+ messages in thread
From: Elliot Wolk @ 2014-06-30  6:38 UTC (permalink / raw)
  To: git

if you move two identical {e.g.: empty} files to two new locations in a 
single commit, the move detection picks them {seemingly?} arbitrarily. 
it should use a statistical algorithm to compare the filenames and pick 
a likely match.

my apologies in advance if this isnt the right venue or is improperly 
formatted, or if this is extraneous noise, or widely known, etc.

+ cd /tmp
+ mkdir repo
+ cd repo
+ git init
Initialized empty Git repository in /tmp/repo/.git/
+ touch a1 b1 c1
+ git add a1 b1 c1
+ git commit -m 1
[master (root-commit) 72f8c89] 1
  3 files changed, 0 insertions(+), 0 deletions(-)
  create mode 100644 a1
  create mode 100644 b1
  create mode 100644 c1
+ git mv a1 a2
+ git mv b1 b2
+ git mv c1 c2
+ git commit -m 2
[master 359da78] 2
  3 files changed, 0 insertions(+), 0 deletions(-)
  rename c1 => a2 (100%)
  rename b1 => b2 (100%)
  rename a1 => c2 (100%)
+ git log --name-status -M
commit 359da78caaaf06848ae32359abfeb87db35cdb30
Author: Elliot Wolk <elliot.wolk@gmail.com>
Date:   Mon Jun 30 02:26:49 2014 -0400

     2

R100    c1      a2
R100    b1      b2
R100    a1      c2

commit 72f8c89b418e3b1d13ec350f4c30b5088fc69e83
Author: Elliot Wolk <elliot.wolk@gmail.com>
Date:   Mon Jun 30 02:26:49 2014 -0400

     1

A       a1
A       b1
A       c1

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: move detection doesnt take filename into account
  2014-06-30  6:38 move detection doesnt take filename into account Elliot Wolk
@ 2014-07-01  9:16 ` Robin Rosenberg
  2014-07-01 14:40   ` Elliot Wolk
  2014-07-01 14:57   ` Junio C Hamano
  0 siblings, 2 replies; 11+ messages in thread
From: Robin Rosenberg @ 2014-07-01  9:16 UTC (permalink / raw)
  To: Elliot Wolk; +Cc: git



----- Ursprungligt meddelande -----
> Från: "Elliot Wolk" <elliot.wolk@gmail.com>
> Till: git@vger.kernel.org
> Skickat: måndag, 30 jun 2014 8:38:18
> Ämne: move detection doesnt take filename into account
> 
> if you move two identical {e.g.: empty} files to two new locations in a
> single commit, the move detection picks them {seemingly?} arbitrarily.
> it should use a statistical algorithm to compare the filenames and pick
> a likely match.

I think it does, but based on filename suffix. E.g. here is a rename of
three empty files with a suffix.

 3 files changed, 0 insertions(+), 0 deletions(-)
 rename 1.a => 2.a (100%)
 rename 1.b => 2.b (100%)
 rename 1.c => 2.c (100%)

-- robin

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: move detection doesnt take filename into account
  2014-07-01  9:16 ` Robin Rosenberg
@ 2014-07-01 14:40   ` Elliot Wolk
  2014-07-01 14:57   ` Junio C Hamano
  1 sibling, 0 replies; 11+ messages in thread
From: Elliot Wolk @ 2014-07-01 14:40 UTC (permalink / raw)
  To: Robin Rosenberg; +Cc: git

interesting that it considers suffixes {only suffixes following 
periods?}. this is insufficient, in my opinion.

with all other things being equal, it ought to find the closest match 
{using smith-waterman or some such algorithm}.

as a real-world use case, i have a repository with empty files that 
mirrors the file structure of a directory containing large binary files.
when i move a dir, it seems to select the files renamed at random.

On 07/01/2014 05:16 AM, Robin Rosenberg wrote:
>
> ----- Ursprungligt meddelande -----
>> Från: "Elliot Wolk" <elliot.wolk@gmail.com>
>> Till: git@vger.kernel.org
>> Skickat: måndag, 30 jun 2014 8:38:18
>> Ämne: move detection doesnt take filename into account
>>
>> if you move two identical {e.g.: empty} files to two new locations in a
>> single commit, the move detection picks them {seemingly?} arbitrarily.
>> it should use a statistical algorithm to compare the filenames and pick
>> a likely match.
> I think it does, but based on filename suffix. E.g. here is a rename of
> three empty files with a suffix.
>
>   3 files changed, 0 insertions(+), 0 deletions(-)
>   rename 1.a => 2.a (100%)
>   rename 1.b => 2.b (100%)
>   rename 1.c => 2.c (100%)
>
> -- robin

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: move detection doesnt take filename into account
  2014-07-01  9:16 ` Robin Rosenberg
  2014-07-01 14:40   ` Elliot Wolk
@ 2014-07-01 14:57   ` Junio C Hamano
  2014-07-01 15:05     ` Elliot Wolk
  1 sibling, 1 reply; 11+ messages in thread
From: Junio C Hamano @ 2014-07-01 14:57 UTC (permalink / raw)
  To: Robin Rosenberg; +Cc: Elliot Wolk, git

Robin Rosenberg <robin.rosenberg@dewire.com> writes:

> I think it does, but based on filename suffix. E.g. here is a rename of
> three empty files with a suffix.
>
>  3 files changed, 0 insertions(+), 0 deletions(-)
>  rename 1.a => 2.a (100%)
>  rename 1.b => 2.b (100%)
>  rename 1.c => 2.c (100%)

This is not more than a chance.

We tie-break rename source candidates that have the same content
similarity score to a rename destination using "name similarity",
whose implementation has been diffcore-rename.c::basename_same(),
which scores 1 if `basename $src` and `basename $dst` are the same
and 0 otherwise, i.e. from 1.a to a/1.a is judged to be a better
rename than from 1.a to a/2.a but otherwise there is nothing that
favors rename from 1.a to 2.a over 1.a to 2.b.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: move detection doesnt take filename into account
  2014-07-01 14:57   ` Junio C Hamano
@ 2014-07-01 15:05     ` Elliot Wolk
  2014-07-01 17:08       ` Junio C Hamano
  0 siblings, 1 reply; 11+ messages in thread
From: Elliot Wolk @ 2014-07-01 15:05 UTC (permalink / raw)
  To: Junio C Hamano, Robin Rosenberg; +Cc: git

thanks for the info!
then i suppose my bug is a petition to have name similarity instead use 
a different statistical matching algorithm.

On 07/01/2014 10:57 AM, Junio C Hamano wrote:
> Robin Rosenberg <robin.rosenberg@dewire.com> writes:
>
>> I think it does, but based on filename suffix. E.g. here is a rename of
>> three empty files with a suffix.
>>
>>   3 files changed, 0 insertions(+), 0 deletions(-)
>>   rename 1.a => 2.a (100%)
>>   rename 1.b => 2.b (100%)
>>   rename 1.c => 2.c (100%)
> This is not more than a chance.
>
> We tie-break rename source candidates that have the same content
> similarity score to a rename destination using "name similarity",
> whose implementation has been diffcore-rename.c::basename_same(),
> which scores 1 if `basename $src` and `basename $dst` are the same
> and 0 otherwise, i.e. from 1.a to a/1.a is judged to be a better
> rename than from 1.a to a/2.a but otherwise there is nothing that
> favors rename from 1.a to 2.a over 1.a to 2.b.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: move detection doesnt take filename into account
  2014-07-01 15:05     ` Elliot Wolk
@ 2014-07-01 17:08       ` Junio C Hamano
  2014-07-09  6:45         ` Jeff King
  0 siblings, 1 reply; 11+ messages in thread
From: Junio C Hamano @ 2014-07-01 17:08 UTC (permalink / raw)
  To: Elliot Wolk; +Cc: Robin Rosenberg, git

Elliot Wolk <elliot.wolk@gmail.com> writes:

> On 07/01/2014 10:57 AM, Junio C Hamano wrote:
>> Robin Rosenberg <robin.rosenberg@dewire.com> writes:
>>
>>> I think it does, but based on filename suffix. E.g. here is a rename of
>>> three empty files with a suffix.
>>>
>>>   3 files changed, 0 insertions(+), 0 deletions(-)
>>>   rename 1.a => 2.a (100%)
>>>   rename 1.b => 2.b (100%)
>>>   rename 1.c => 2.c (100%)
>> This is not more than a chance.
>>
>> We tie-break rename source candidates that have the same content
>> similarity score to a rename destination using "name similarity",
>> whose implementation has been diffcore-rename.c::basename_same(),
>> which scores 1 if `basename $src` and `basename $dst` are the same
>> and 0 otherwise, i.e. from 1.a to a/1.a is judged to be a better
>> rename than from 1.a to a/2.a but otherwise there is nothing that
>> favors rename from 1.a to 2.a over 1.a to 2.b.
>
> thanks for the info!
> then i suppose my bug is a petition to have name similarity instead
> use a different statistical matching algorithm.

[administrivia: please do not top-post on this list]

I didn't think it through but my gut feeling is that we could change
the name similarity score to be the length of the tail part that
matches (e.g. 1.a to a/2.a that has the same two bytes at the tail
is a better match than to a/2.b that does not share any tail, and to
a/1.a that shares the three bytes at the tail is an even better
match).

Oh, and rename basename_same() to something else; currently it is
only used as the "name similarity", and after such a change, it will
stay to be "name similarity" but will not be asking "are basenames
the same?" anymore.

Hint, hint...

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: move detection doesnt take filename into account
  2014-07-01 17:08       ` Junio C Hamano
@ 2014-07-09  6:45         ` Jeff King
  2014-07-09 15:51           ` Junio C Hamano
  0 siblings, 1 reply; 11+ messages in thread
From: Jeff King @ 2014-07-09  6:45 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Elliot Wolk, Robin Rosenberg, git

On Tue, Jul 01, 2014 at 10:08:15AM -0700, Junio C Hamano wrote:

> I didn't think it through but my gut feeling is that we could change
> the name similarity score to be the length of the tail part that
> matches (e.g. 1.a to a/2.a that has the same two bytes at the tail
> is a better match than to a/2.b that does not share any tail, and to
> a/1.a that shares the three bytes at the tail is an even better
> match).

The delta heuristics in pack-objects use pack_name_hash, which claims:

        /*
         * This effectively just creates a sortable number from the
         * last sixteen non-whitespace characters. Last characters
         * count "most", so things that end in ".c" sort together.
         */

which might be another option (and seems like a superset of the basename
check, short of basenames that are longer than 16 characters).

-Peff

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: move detection doesnt take filename into account
  2014-07-09  6:45         ` Jeff King
@ 2014-07-09 15:51           ` Junio C Hamano
  2014-07-09 22:03             ` Jeff King
  0 siblings, 1 reply; 11+ messages in thread
From: Junio C Hamano @ 2014-07-09 15:51 UTC (permalink / raw)
  To: Jeff King; +Cc: Elliot Wolk, Robin Rosenberg, git

Jeff King <peff@peff.net> writes:

> On Tue, Jul 01, 2014 at 10:08:15AM -0700, Junio C Hamano wrote:
>
>> I didn't think it through but my gut feeling is that we could change
>> the name similarity score to be the length of the tail part that
>> matches (e.g. 1.a to a/2.a that has the same two bytes at the tail
>> is a better match than to a/2.b that does not share any tail, and to
>> a/1.a that shares the three bytes at the tail is an even better
>> match).
>
> The delta heuristics in pack-objects use pack_name_hash, which claims:
>
>         /*
>          * This effectively just creates a sortable number from the
>          * last sixteen non-whitespace characters. Last characters
>          * count "most", so things that end in ".c" sort together.
>          */
>
> which might be another option (and seems like a superset of the basename
> check, short of basenames that are longer than 16 characters).

Perhaps.

I am however not sure if the code to compute similarity score is as
OK with false positives, i.e. dissimilar names that happen to hash
together getting clumped in a same bin or in close bins, as the
existing callers of pack_name_hash().

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: move detection doesnt take filename into account
  2014-07-09 15:51           ` Junio C Hamano
@ 2014-07-09 22:03             ` Jeff King
  2014-07-09 22:18               ` Junio C Hamano
  0 siblings, 1 reply; 11+ messages in thread
From: Jeff King @ 2014-07-09 22:03 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Elliot Wolk, Robin Rosenberg, git

On Wed, Jul 09, 2014 at 08:51:07AM -0700, Junio C Hamano wrote:

> > The delta heuristics in pack-objects use pack_name_hash, which claims:
> >
> >         /*
> >          * This effectively just creates a sortable number from the
> >          * last sixteen non-whitespace characters. Last characters
> >          * count "most", so things that end in ".c" sort together.
> >          */
> >
> > which might be another option (and seems like a superset of the basename
> > check, short of basenames that are longer than 16 characters).
> 
> Perhaps.
> 
> I am however not sure if the code to compute similarity score is as
> OK with false positives, i.e. dissimilar names that happen to hash
> together getting clumped in a same bin or in close bins, as the
> existing callers of pack_name_hash().

I think the hash here does not collide in that way. It really is just
the last sixteen characters shoved into a uint32_t.

But thinking on it more, that is useful to the delta code because it
wants to create a sorted list of items. In the rename code we are doing
pairwise comparisons, so we are more flexible. We can compare whole
basenames, or whole suffixes (so "a/foo/bar.c" is closer to
"b/foo/bar.c" than to "c/other/bar.c"). Or just use a general-purpose
edit-distance function.

The tricky part is that the rename detection seems to take the score as
a binary 0/1 "is it the same", but we would want to express more nuance
(i.e., the "best" match among those that have similar content scores).

-Peff

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: move detection doesnt take filename into account
  2014-07-09 22:03             ` Jeff King
@ 2014-07-09 22:18               ` Junio C Hamano
  2014-07-10  3:53                 ` Jeff King
  0 siblings, 1 reply; 11+ messages in thread
From: Junio C Hamano @ 2014-07-09 22:18 UTC (permalink / raw)
  To: Jeff King; +Cc: Elliot Wolk, Robin Rosenberg, git

Jeff King <peff@peff.net> writes:

> I think the hash here does not collide in that way. It really is just
> the last sixteen characters shoved into a uint32_t.

All bytes overlap with their adjacent byte because they are shifted
by only 2 bits, not 8 bits, when a new byte is brought in.  We can
say that the topmost two bits of the result must have come from the
last character, but other than these, there are more than one input
byte for each bit position to be set/unset by, so two names that human
would not consider "similar" would be given the same hash, no?

That is useful for delta code because the code only needs that
similar things are grouped together, it does not mind things that
are not similar is also mixed to a group, as the end result is
primarily determined by similarity of the actual contents, not
pathnames.

What is under topic in this discussion is the other way around; we
know two paths have contents of the same similarity to the third one
and want to tie-break these two using how similar their pathnames
are to the third one.  

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: move detection doesnt take filename into account
  2014-07-09 22:18               ` Junio C Hamano
@ 2014-07-10  3:53                 ` Jeff King
  0 siblings, 0 replies; 11+ messages in thread
From: Jeff King @ 2014-07-10  3:53 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Elliot Wolk, Robin Rosenberg, git

On Wed, Jul 09, 2014 at 03:18:43PM -0700, Junio C Hamano wrote:

> Jeff King <peff@peff.net> writes:
> 
> > I think the hash here does not collide in that way. It really is just
> > the last sixteen characters shoved into a uint32_t.
> 
> All bytes overlap with their adjacent byte because they are shifted
> by only 2 bits, not 8 bits, when a new byte is brought in.  We can
> say that the topmost two bits of the result must have come from the
> last character, but other than these, there are more than one input
> byte for each bit position to be set/unset by, so two names that human
> would not consider "similar" would be given the same hash, no?

Yeah, you're right. I didn't look at the algorithm closely enough.

-Peff

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2014-07-10  3:55 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-06-30  6:38 move detection doesnt take filename into account Elliot Wolk
2014-07-01  9:16 ` Robin Rosenberg
2014-07-01 14:40   ` Elliot Wolk
2014-07-01 14:57   ` Junio C Hamano
2014-07-01 15:05     ` Elliot Wolk
2014-07-01 17:08       ` Junio C Hamano
2014-07-09  6:45         ` Jeff King
2014-07-09 15:51           ` Junio C Hamano
2014-07-09 22:03             ` Jeff King
2014-07-09 22:18               ` Junio C Hamano
2014-07-10  3:53                 ` Jeff King

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).