git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Bug: impure renames may be reported as pure renames
@ 2024-02-20 20:58 Mário Guimarães
  2024-02-21  2:12 ` Elijah Newren
  0 siblings, 1 reply; 4+ messages in thread
From: Mário Guimarães @ 2024-02-20 20:58 UTC (permalink / raw)
  To: git

Hello Git community,

please see the report below of what may be a bug.

Yours sincerely
Mário Guimarães

======================================================
Thank you for filling out a Git bug report!
Please answer the following questions to help us understand your issue.

What did you do before the bug happened? (Steps to reproduce your issue)

In the rust-lang/rust repository (I cloned today from GitHub), if we
run the command

    git diff-tree -r -M a04e649^2 a04e649 --
tests/ui/issues/issue-83190.rs
tests/ui/type-alias-impl-trait/nested-rpit-with-lifetimes.rs

we get this result

    :100644 100644 da931c3edaf6f9de6805e771f2b3b28edd27001f
11b659eec97323ea5190dac1771c7ca3241861e7 R100
tests/ui/issues/issue-83190.rs
tests/ui/type-alias-impl-trait/nested-rpit-with-lifetimes.rs

However, the source and destination files of the rename are not 100%
equal. If we run this other command

    git diff -M a04e649^2 a04e649 -- tests/ui/issues/issue-83190.rs
tests/ui/type-alias-impl-trait/nested-rpit-with-lifetimes.rs

we get the following result

    diff --git a/tests/ui/issues/issue-83190.rs
b/tests/ui/type-alias-impl-trait/nested-rpit-with-lifetimes.rs
    similarity index 100%
    rename from tests/ui/issues/issue-83190.rs
    rename to tests/ui/type-alias-impl-trait/nested-rpit-with-lifetimes.rs
    index da931c3edaf..11b659eec97 100644
    --- a/tests/ui/issues/issue-83190.rs
    +++ b/tests/ui/type-alias-impl-trait/nested-rpit-with-lifetimes.rs
    @@ -1,7 +1,7 @@
    -// check-pass
    -
     // Regression test for issue #83190, triggering an ICE in borrowck.

    +// check-pass
    +
     pub trait Any {}
     impl<T> Any for T {}

What did you expect to happen? (Expected behavior)

I was expecting to get from the `git diff-tree ...` command above a
status of `R099`, but never `R100` which should be reserved for pure
renames (i.e., renames with no content modifications).

In fact, if we search for pure renames only, by using the `-M100%`
option, such as running this command

    git diff-tree -r -M100% a04e649^2 a04e649 --
tests/ui/issues/issue-83190.rs
tests/ui/type-alias-impl-trait/nested-rpit-with-lifetimes.rs

we get this output

    :100644 000000 da931c3edaf6f9de6805e771f2b3b28edd27001f
0000000000000000000000000000000000000000 D
tests/ui/issues/issue-83190.rs
    :000000 100644 0000000000000000000000000000000000000000
11b659eec97323ea5190dac1771c7ca3241861e7 A
tests/ui/type-alias-impl-trait/nested-rpit-with-lifetimes.rs

This suggests that the initial command in this report cannot return an
`R100` status, as doing so it contradicts this last command.

Note that we only get a rename in case we lower the similarity
percentage, as demonstrated by this other command using the `-M99%`
option

    git diff-tree -r -M99% a04e649^2 a04e649 --
tests/ui/issues/issue-83190.rs
tests/ui/type-alias-impl-trait/nested-rpit-with-lifetimes.rs

which outputs

    :100644 100644 da931c3edaf6f9de6805e771f2b3b28edd27001f
11b659eec97323ea5190dac1771c7ca3241861e7 R100
tests/ui/issues/issue-83190.rs
tests/ui/type-alias-impl-trait/nested-rpit-with-lifetimes.rs

in which we see again `R100`, whereas the correct status should be `R099`.

What happened instead? (Actual behavior)

What's different between what you expected and what actually happened?

Anything else you want to add:

The rename detection machinery should be fixed to report pure renames
(`R100`) only when the source and destination files have exactly the
same contents.

Please review the rest of the bug report below.
You can delete any lines you don't wish to share.


[Info de sistema]
versão git:
git version 2.42.0
cpu: x86_64
no commit associated with this build
sizeof-long: 8
sizeof-size_t: 8
shell-path: /bin/sh
feature: fsmonitor--daemon
uname: Darwin 20.6.0 Darwin Kernel Version 20.6.0: Thu Jul  6 22:12:47
PDT 2023; root:xnu-7195.141.49.702.12~1/RELEASE_X86_64 x86_64
info de compilador: clang: 13.0.0 (clang-1300.0.29.30)
info de libc: informação libc indisponível

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Bug: impure renames may be reported as pure renames
  2024-02-20 20:58 Bug: impure renames may be reported as pure renames Mário Guimarães
@ 2024-02-21  2:12 ` Elijah Newren
  2024-02-21 17:11   ` Junio C Hamano
  0 siblings, 1 reply; 4+ messages in thread
From: Elijah Newren @ 2024-02-21  2:12 UTC (permalink / raw)
  To: Mário Guimarães; +Cc: git

On Tue, Feb 20, 2024 at 12:58 PM Mário Guimarães
<mario.luis.guimaraes@gmail.com> wrote:
>
> Hello Git community,
>
> please see the report below of what may be a bug.
>
> Yours sincerely
> Mário Guimarães
>
> ======================================================
> Thank you for filling out a Git bug report!
> Please answer the following questions to help us understand your issue.
>
> What did you do before the bug happened? (Steps to reproduce your issue)
>
> In the rust-lang/rust repository (I cloned today from GitHub), if we
> run the command
>
>     git diff-tree -r -M a04e649^2 a04e649 --
> tests/ui/issues/issue-83190.rs
> tests/ui/type-alias-impl-trait/nested-rpit-with-lifetimes.rs
>
> we get this result
>
>     :100644 100644 da931c3edaf6f9de6805e771f2b3b28edd27001f
> 11b659eec97323ea5190dac1771c7ca3241861e7 R100
> tests/ui/issues/issue-83190.rs
> tests/ui/type-alias-impl-trait/nested-rpit-with-lifetimes.rs
>
> However, the source and destination files of the rename are not 100%
> equal. If we run this other command
>
>     git diff -M a04e649^2 a04e649 -- tests/ui/issues/issue-83190.rs
> tests/ui/type-alias-impl-trait/nested-rpit-with-lifetimes.rs
>
> we get the following result
>
>     diff --git a/tests/ui/issues/issue-83190.rs
> b/tests/ui/type-alias-impl-trait/nested-rpit-with-lifetimes.rs
>     similarity index 100%
>     rename from tests/ui/issues/issue-83190.rs
>     rename to tests/ui/type-alias-impl-trait/nested-rpit-with-lifetimes.rs
>     index da931c3edaf..11b659eec97 100644
>     --- a/tests/ui/issues/issue-83190.rs
>     +++ b/tests/ui/type-alias-impl-trait/nested-rpit-with-lifetimes.rs
>     @@ -1,7 +1,7 @@
>     -// check-pass
>     -
>      // Regression test for issue #83190, triggering an ICE in borrowck.
>
>     +// check-pass
>     +
>      pub trait Any {}
>      impl<T> Any for T {}

Heh, good point.  And more generally, due to how the similarity checks
work (split the file into lines/chunks, hash each to an integer, then
sort the list of integers and compare the list of integers between two
files), whenever you keep all the original lines of a file but permute
their order, you will see a 100% match.

Maybe a simple hack like the below would suffice to fix (untested --
anyone want to test it out for me?):

diff --git a/diff.c b/diff.c
index a2def45644b..6b7b3a8b9af 100644
--- a/diff.c
+++ b/diff.c
@@ -4405,7 +4405,7 @@ static void run_external_diff(const char *pgm,

 static int similarity_index(struct diff_filepair *p)
 {
-       return p->score * 100 / MAX_SCORE;
+       return p->score * 100 / (MAX_SCORE+1);
 }

 static const char *diff_abbrev_oid(const struct object_id *oid, int abbrev)
diff --git a/diffcore-rename.c b/diffcore-rename.c
index 5a6e2bcac71..3228a898e0b 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -314,7 +314,7 @@ static int find_identical_files(struct hashmap *srcs,
                        break;
        }
        if (best) {
-               record_rename_pair(dst_index, best->index, MAX_SCORE);
+               record_rename_pair(dst_index, best->index, MAX_SCORE+1);
                renames++;
        }
        return renames;

Which is based on the premise that 100% of source lines can match 100%
of destination lines and result in a score of MAX_SCORE, so we
manually set the scoring to MAX_SCORE+1 when we detect exact renames
based on matching hashes and then use that higher value as the divisor
when computing percentages.  (Note that there are several other places
where "MAX_SCORE" is used and which I'm pretty sure we would _not_
want to replace with "MAX_SCORE+1", so if you don't like my patch and
want to redefine MAX_SCORE, then all those other locations would need
to be adjusted instead.)

^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: Bug: impure renames may be reported as pure renames
  2024-02-21  2:12 ` Elijah Newren
@ 2024-02-21 17:11   ` Junio C Hamano
  2024-02-21 21:00     ` Junio C Hamano
  0 siblings, 1 reply; 4+ messages in thread
From: Junio C Hamano @ 2024-02-21 17:11 UTC (permalink / raw)
  To: Elijah Newren; +Cc: Mário Guimarães, git

Elijah Newren <newren@gmail.com> writes:

> Heh, good point.  And more generally, due to how the similarity checks
> work (split the file into lines/chunks, hash each to an integer, then
> sort the list of integers and compare the list of integers between two
> files), whenever you keep all the original lines of a file but permute
> their order, you will see a 100% match.

Stepping back a bit, there internally is a distinction between
"identical files" and "similar but not identical files".  The
file with permuted lines are not identical but are very similar.

The decision not to surface these two to end-users was made long
ago, and after finding identical renames, we skip similarity
computation when minimum_score is set to -M100 from the command
line, which is a documented UI that cannot change.

So you are right to say that similarity index of an inexact rename
should be capped at 99 and never reach 100.  I do not know offhand
if your "how about" computation ...

>  static int similarity_index(struct diff_filepair *p)
>  {
> -       return p->score * 100 / MAX_SCORE;
> +       return p->score * 100 / (MAX_SCORE+1);
>  }

... is the way to go, though.  We have filepair, so shouldn't we be
doing more like

	if (identical_rename(p))
		return 100;
	else
		return 99 * p->score / MAX_SCORE;

instead?

It will also keep working when a minor pet peeve of mine is fixed.
If you have a file with 1000 lines, all hashing to distinct values,
and if you change only one line, even though the similarity is
computed as 99% by the current logic, it actually is 99.9% that is
much closer to 100%.  But we always round down, which feels wrong.

Alternatively, we could cap p->score to (MAX_SCORE-1) when
similarity is _computed_ for filepair between non-identical blobs.
That should happen in diffcore-rename.c:estimate_similarity().
It would allow us to still _show_ "1 line changed among 1000 lines"
case as R100 and still reject inexact renames via "-M100" from the
command line, I think, as the "exact renames only" short-cut works
with the finer-grained "score" values and not the coarser "percent"
values, which might give us a better approach.





^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Bug: impure renames may be reported as pure renames
  2024-02-21 17:11   ` Junio C Hamano
@ 2024-02-21 21:00     ` Junio C Hamano
  0 siblings, 0 replies; 4+ messages in thread
From: Junio C Hamano @ 2024-02-21 21:00 UTC (permalink / raw)
  To: Elijah Newren; +Cc: Mário Guimarães, git

Junio C Hamano <gitster@pobox.com> writes:

> Alternatively, we could cap p->score to (MAX_SCORE-1) when
> similarity is _computed_ for filepair between non-identical blobs.
> That should happen in diffcore-rename.c:estimate_similarity().
> It would allow us to still _show_ "1 line changed among 1000 lines"
> case as R100 and still reject inexact renames via "-M100" from the
> command line, I think, as the "exact renames only" short-cut works
> with the finer-grained "score" values and not the coarser "percent"
> values, which might give us a better approach.

... the above may be "better" under a specific assumption: in the
future, we would correct the rounding of scaling from 0..MAX_SCORE
(or 0..MAX_SCORE-1) to 0..100 from the current "always round down"
to "below 0.5 is rounded down, 0.5 and above is rounded up".  If
that happens, we would still honor the "Give -M100 to limit the
rename detection to exact ones", and will exclude a change that
modifies a single line in 1000, but if you say -M99, we would show
such a change and it would be labelled as R100 as correctly rounded.


^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2024-02-21 21:00 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-02-20 20:58 Bug: impure renames may be reported as pure renames Mário Guimarães
2024-02-21  2:12 ` Elijah Newren
2024-02-21 17:11   ` Junio C Hamano
2024-02-21 21:00     ` Junio C Hamano

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).