Git development
 help / color / mirror / Atom feed
From: Jeff King <peff@peff.net>
To: "sebastien.stettler" <sebastien.stettler@proton.me>
Cc: "phillip.wood@dunelm.org.uk" <phillip.wood@dunelm.org.uk>,
	"chris.torek@gmail.com" <chris.torek@gmail.com>,
	"git@vger.kernel.org" <git@vger.kernel.org>,
	"j6t@kdbg.org" <j6t@kdbg.org>
Subject: Re: git rename/moved status unreliable in ruby
Date: Mon, 4 May 2026 06:00:56 -0400	[thread overview]
Message-ID: <20260504100056.GB599780@coredump.intra.peff.net> (raw)
In-Reply-To: <IC7a4NnSKMdvXlVyaSDYEtU7iRlKdJGzCwrXNCFKrtFfnBJTMrwY522rHF8PfzYxFs43huo0KFGrqB6f4IQjmvYi2B8Ehh0cwfjHHOYW_RU=@proton.me>

On Sat, May 02, 2026 at 09:34:18AM +0000, sebastien.stettler wrote:

> Has there been explorations of ignoring white space for the similarity checker, i would 
> assume that majority of white space movements across many languages would result in a 
> semantically similar document in most cases.

I don't think anybody has ever looked into it. We do have "-w" and
friends for diffs, and it makes sense that there might be some mode to
soften renames in the same way (especially if you are doing a "-w"
diff, or a merge that ignores whitespace).

The line you need to touch is probably this:

diff --git a/diffcore-delta.c b/diffcore-delta.c
index 2b7db39983..379f6010d3 100644
--- a/diffcore-delta.c
+++ b/diffcore-delta.c
@@ -147,6 +147,8 @@ static struct spanhash_top *hash_chars(struct repository *r,
 		/* Ignore CR in CRLF sequence if text */
 		if (is_text && c == '\r' && sz && *buf == '\n')
 			continue;
+		if (is_text && (c == ' ' || c == '\t'))
+			continue;
 
 		accum1 = (accum1 << 7) ^ (accum2 >> 25);
 		accum2 = (accum2 << 7) ^ (old_1 >> 25);

but:

  1. The option to ignore whitespace would need to be plumbed through
     the rest of the diffcore code.

  2. This concept probably throws off some other rename heuristics.
     E.g., I think we do a rough check that the sizes of the objects are
     not too far apart before even looking at the content. So you could
     construct a pathological case where the line "a\n" was changed to
     have a million spaces, and the files would look like they couldn't
     possibly be similar, even though they are identical when ignoring
     whitespace. I think in practice you could just ignore this, as sane
     cases would tend to have a reasonable ratio of content to
     whitespace changes.

-Peff

  reply	other threads:[~2026-05-04 10:00 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-01  5:05 git rename/moved status unreliable in ruby sebastien.stettler
2026-05-01 15:30 ` Phillip Wood
2026-05-02  7:25   ` Johannes Sixt
2026-05-03 21:59   ` Junio C Hamano
2026-05-02  8:06 ` Chris Torek
2026-05-02  9:34   ` sebastien.stettler
2026-05-04 10:00     ` Jeff King [this message]
2026-05-05  0:09   ` Junio C Hamano
2026-05-05  0:46     ` Chris Torek

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260504100056.GB599780@coredump.intra.peff.net \
    --to=peff@peff.net \
    --cc=chris.torek@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=j6t@kdbg.org \
    --cc=phillip.wood@dunelm.org.uk \
    --cc=sebastien.stettler@proton.me \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox