From: Jeff King <peff@peff.net>
To: "sebastien.stettler" <sebastien.stettler@proton.me>
Cc: "phillip.wood@dunelm.org.uk" <phillip.wood@dunelm.org.uk>,
"chris.torek@gmail.com" <chris.torek@gmail.com>,
"git@vger.kernel.org" <git@vger.kernel.org>,
"j6t@kdbg.org" <j6t@kdbg.org>
Subject: Re: git rename/moved status unreliable in ruby
Date: Mon, 4 May 2026 06:00:56 -0400 [thread overview]
Message-ID: <20260504100056.GB599780@coredump.intra.peff.net> (raw)
In-Reply-To: <IC7a4NnSKMdvXlVyaSDYEtU7iRlKdJGzCwrXNCFKrtFfnBJTMrwY522rHF8PfzYxFs43huo0KFGrqB6f4IQjmvYi2B8Ehh0cwfjHHOYW_RU=@proton.me>
On Sat, May 02, 2026 at 09:34:18AM +0000, sebastien.stettler wrote:
> Has there been explorations of ignoring white space for the similarity checker, i would
> assume that majority of white space movements across many languages would result in a
> semantically similar document in most cases.
I don't think anybody has ever looked into it. We do have "-w" and
friends for diffs, and it makes sense that there might be some mode to
soften renames in the same way (especially if you are doing a "-w"
diff, or a merge that ignores whitespace).
The line you need to touch is probably this:
diff --git a/diffcore-delta.c b/diffcore-delta.c
index 2b7db39983..379f6010d3 100644
--- a/diffcore-delta.c
+++ b/diffcore-delta.c
@@ -147,6 +147,8 @@ static struct spanhash_top *hash_chars(struct repository *r,
/* Ignore CR in CRLF sequence if text */
if (is_text && c == '\r' && sz && *buf == '\n')
continue;
+ if (is_text && (c == ' ' || c == '\t'))
+ continue;
accum1 = (accum1 << 7) ^ (accum2 >> 25);
accum2 = (accum2 << 7) ^ (old_1 >> 25);
but:
1. The option to ignore whitespace would need to be plumbed through
the rest of the diffcore code.
2. This concept probably throws off some other rename heuristics.
E.g., I think we do a rough check that the sizes of the objects are
not too far apart before even looking at the content. So you could
construct a pathological case where the line "a\n" was changed to
have a million spaces, and the files would look like they couldn't
possibly be similar, even though they are identical when ignoring
whitespace. I think in practice you could just ignore this, as sane
cases would tend to have a reasonable ratio of content to
whitespace changes.
-Peff
next prev parent reply other threads:[~2026-05-04 10:00 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-05-01 5:05 git rename/moved status unreliable in ruby sebastien.stettler
2026-05-01 15:30 ` Phillip Wood
2026-05-02 7:25 ` Johannes Sixt
2026-05-03 21:59 ` Junio C Hamano
2026-05-02 8:06 ` Chris Torek
2026-05-02 9:34 ` sebastien.stettler
2026-05-04 10:00 ` Jeff King [this message]
2026-05-05 0:09 ` Junio C Hamano
2026-05-05 0:46 ` Chris Torek
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260504100056.GB599780@coredump.intra.peff.net \
--to=peff@peff.net \
--cc=chris.torek@gmail.com \
--cc=git@vger.kernel.org \
--cc=j6t@kdbg.org \
--cc=phillip.wood@dunelm.org.uk \
--cc=sebastien.stettler@proton.me \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox