All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Elijah Newren via GitGitGadget" <gitgitgadget@gmail.com>
To: git@vger.kernel.org
Cc: Taylor Blau <me@ttaylorr.com>, Elijah Newren <newren@gmail.com>,
	Elijah Newren <newren@gmail.com>,
	Elijah Newren <newren@gmail.com>
Subject: [PATCH v2] diffcore-delta: avoid ignoring final 'line' of file
Date: Sat, 13 Jan 2024 04:26:13 +0000	[thread overview]
Message-ID: <pull.1637.v2.git.1705119973690.gitgitgadget@gmail.com> (raw)
In-Reply-To: <pull.1637.git.1705006074626.gitgitgadget@gmail.com>

From: Elijah Newren <newren@gmail.com>

hash_chars() would hash lines to integers, and store them in a spanhash,
but cut lines at 64 characters.  Thus, whenever it reached 64 characters
or a newline, it would create a new spanhash.  The problem is, the final
part of the file might not end 64 characters after the previous 'line'
and might not end with a newline.  This could, for example, cause an
85-byte file with 12 lines and only the first character in the file
differing to appear merely 23% similar rather than the expected 97%.
Ensure the last line is included, and add a testcase that would have
caught this problem.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
    diffcore-delta: avoid ignoring final 'line' of file
    
    Found while experimenting with converting portions of diffcore-delta to
    Rust.
    
    Changes since v1:
    
     * Removed the unnecessary similarity threshold specification
     * Munged the discovered similarity percentage so we are only checking
       that a rename is detected
     * Fixed up filenames

Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1637%2Fnewren%2Ffix-diffcore-final-line-v2
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1637/newren/fix-diffcore-final-line-v2
Pull-Request: https://github.com/gitgitgadget/git/pull/1637

Range-diff vs v1:

 1:  f62b22a54c3 ! 1:  e0223bbfeb5 diffcore-delta: avoid ignoring final 'line' of file
     @@ t/t4001-diff-rename.sh: test_expect_success 'basename similarity vs best similar
       '
       
      +test_expect_success 'last line matters too' '
     -+	test_write_lines a 0 1 2 3 4 5 6 7 8 9 >nonewline &&
     -+	printf "git ignores final up to 63 characters if not newline terminated" >>nonewline &&
     -+	git add nonewline &&
     ++	{
     ++		test_write_lines a 0 1 2 3 4 5 6 7 8 9 &&
     ++		printf "git ignores final up to 63 characters if not newline terminated"
     ++	} >no-final-lf &&
     ++	git add no-final-lf &&
      +	git commit -m "original version of file with no final newline" &&
      +
      +	# Change ONLY the first character of the whole file
     -+	test_write_lines b 0 1 2 3 4 5 6 7 8 9 >nonewline &&
     -+	printf "git ignores final up to 63 characters if not newline terminated" >>nonewline &&
     -+	git add nonewline &&
     -+	git mv nonewline still-no-newline &&
     -+	git commit -a -m "rename nonewline -> still-no-newline" &&
     -+	git diff-tree -r -M01 --name-status HEAD^ HEAD >actual &&
     ++	{
     ++		test_write_lines b 0 1 2 3 4 5 6 7 8 9 &&
     ++		printf "git ignores final up to 63 characters if not newline terminated"
     ++	} >no-final-lf &&
     ++	git add no-final-lf &&
     ++	git mv no-final-lf still-absent-final-lf &&
     ++	git commit -a -m "rename no-final-lf -> still-absent-final-lf" &&
     ++	git diff-tree -r -M --name-status HEAD^ HEAD >actual &&
     ++	sed -e "s/^R[0-9]*	/R	/" actual >actual.munged &&
      +	cat >expected <<-\EOF &&
     -+	R097	nonewline	still-no-newline
     ++	R	no-final-lf	still-absent-final-lf
      +	EOF
     -+	test_cmp expected actual
     ++	test_cmp expected actual.munged
      +'
      +
       test_done


 diffcore-delta.c       |  4 ++++
 t/t4001-diff-rename.sh | 24 ++++++++++++++++++++++++
 2 files changed, 28 insertions(+)

diff --git a/diffcore-delta.c b/diffcore-delta.c
index c30b56e983b..7136c3dd203 100644
--- a/diffcore-delta.c
+++ b/diffcore-delta.c
@@ -159,6 +159,10 @@ static struct spanhash_top *hash_chars(struct repository *r,
 		n = 0;
 		accum1 = accum2 = 0;
 	}
+	if (n > 0) {
+		hashval = (accum1 + accum2 * 0x61) % HASHBASE;
+		hash = add_spanhash(hash, hashval, n);
+	}
 	QSORT(hash->data, (size_t)1ul << hash->alloc_log2, spanhash_cmp);
 	return hash;
 }
diff --git a/t/t4001-diff-rename.sh b/t/t4001-diff-rename.sh
index 85be1367de6..49c042a38ae 100755
--- a/t/t4001-diff-rename.sh
+++ b/t/t4001-diff-rename.sh
@@ -286,4 +286,28 @@ test_expect_success 'basename similarity vs best similarity' '
 	test_cmp expected actual
 '
 
+test_expect_success 'last line matters too' '
+	{
+		test_write_lines a 0 1 2 3 4 5 6 7 8 9 &&
+		printf "git ignores final up to 63 characters if not newline terminated"
+	} >no-final-lf &&
+	git add no-final-lf &&
+	git commit -m "original version of file with no final newline" &&
+
+	# Change ONLY the first character of the whole file
+	{
+		test_write_lines b 0 1 2 3 4 5 6 7 8 9 &&
+		printf "git ignores final up to 63 characters if not newline terminated"
+	} >no-final-lf &&
+	git add no-final-lf &&
+	git mv no-final-lf still-absent-final-lf &&
+	git commit -a -m "rename no-final-lf -> still-absent-final-lf" &&
+	git diff-tree -r -M --name-status HEAD^ HEAD >actual &&
+	sed -e "s/^R[0-9]*	/R	/" actual >actual.munged &&
+	cat >expected <<-\EOF &&
+	R	no-final-lf	still-absent-final-lf
+	EOF
+	test_cmp expected actual.munged
+'
+
 test_done

base-commit: 055bb6e9969085777b7fab83e3fee0017654f134
-- 
gitgitgadget

      parent reply	other threads:[~2024-01-13  4:26 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-01-11 20:47 [PATCH] diffcore-delta: avoid ignoring final 'line' of file Elijah Newren via GitGitGadget
2024-01-11 21:45 ` Taylor Blau
2024-01-11 23:00 ` Junio C Hamano
2024-01-13  1:45   ` Elijah Newren
2024-01-13  6:21     ` Junio C Hamano
2024-01-19  1:54       ` Elijah Newren
2024-01-19  3:06         ` Junio C Hamano
2024-01-19  5:05           ` Elijah Newren
2024-01-19  6:27             ` Junio C Hamano
2024-01-13  4:26 ` Elijah Newren via GitGitGadget [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=pull.1637.v2.git.1705119973690.gitgitgadget@gmail.com \
    --to=gitgitgadget@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=me@ttaylorr.com \
    --cc=newren@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.