From: Junio C Hamano <gitster@pobox.com>
To: Elijah Newren <newren@gmail.com>
Cc: "Mário Guimarães" <mario.luis.guimaraes@gmail.com>, git@vger.kernel.org
Subject: Re: Bug: impure renames may be reported as pure renames
Date: Wed, 21 Feb 2024 09:11:12 -0800 [thread overview]
Message-ID: <xmqqttm1pwsv.fsf@gitster.g> (raw)
In-Reply-To: <CABPp-BG2m3_fyxFL6Hw4o5HhoKVd7_tYchDxCdTaXxdxZke-9g@mail.gmail.com> (Elijah Newren's message of "Tue, 20 Feb 2024 18:12:00 -0800")
Elijah Newren <newren@gmail.com> writes:
> Heh, good point. And more generally, due to how the similarity checks
> work (split the file into lines/chunks, hash each to an integer, then
> sort the list of integers and compare the list of integers between two
> files), whenever you keep all the original lines of a file but permute
> their order, you will see a 100% match.
Stepping back a bit, there internally is a distinction between
"identical files" and "similar but not identical files". The
file with permuted lines are not identical but are very similar.
The decision not to surface these two to end-users was made long
ago, and after finding identical renames, we skip similarity
computation when minimum_score is set to -M100 from the command
line, which is a documented UI that cannot change.
So you are right to say that similarity index of an inexact rename
should be capped at 99 and never reach 100. I do not know offhand
if your "how about" computation ...
> static int similarity_index(struct diff_filepair *p)
> {
> - return p->score * 100 / MAX_SCORE;
> + return p->score * 100 / (MAX_SCORE+1);
> }
... is the way to go, though. We have filepair, so shouldn't we be
doing more like
if (identical_rename(p))
return 100;
else
return 99 * p->score / MAX_SCORE;
instead?
It will also keep working when a minor pet peeve of mine is fixed.
If you have a file with 1000 lines, all hashing to distinct values,
and if you change only one line, even though the similarity is
computed as 99% by the current logic, it actually is 99.9% that is
much closer to 100%. But we always round down, which feels wrong.
Alternatively, we could cap p->score to (MAX_SCORE-1) when
similarity is _computed_ for filepair between non-identical blobs.
That should happen in diffcore-rename.c:estimate_similarity().
It would allow us to still _show_ "1 line changed among 1000 lines"
case as R100 and still reject inexact renames via "-M100" from the
command line, I think, as the "exact renames only" short-cut works
with the finer-grained "score" values and not the coarser "percent"
values, which might give us a better approach.
next prev parent reply other threads:[~2024-02-21 17:11 UTC|newest]
Thread overview: 4+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-02-20 20:58 Bug: impure renames may be reported as pure renames Mário Guimarães
2024-02-21 2:12 ` Elijah Newren
2024-02-21 17:11 ` Junio C Hamano [this message]
2024-02-21 21:00 ` Junio C Hamano
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=xmqqttm1pwsv.fsf@gitster.g \
--to=gitster@pobox.com \
--cc=git@vger.kernel.org \
--cc=mario.luis.guimaraes@gmail.com \
--cc=newren@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).