* buggy smudge/clean of empty files
@ 2025-05-22 19:01 Joey Hess
2025-05-22 22:15 ` Jeff King
0 siblings, 1 reply; 3+ messages in thread
From: Joey Hess @ 2025-05-22 19:01 UTC (permalink / raw)
To: git
[-- Attachment #1: Type: text/plain, Size: 1118 bytes --]
git seems to be buggy in its handling of empty files when smudge/clean filters
are used.
I've attached a script setup_smudge_clean.sh, which configures a git
repository to use a very simple smudge and clean filter pair for all
files. The clean filter prepends a line "hi" to the file content, and the
smudge filter removes the line. There is nothing very special about this
smudge/clean, it's just a simple one for the sake of an example.
Here's the bug:
# git init repo
# cd repo
# sh ~/setup_smudge_clean.sh
# echo foo > foofile
# git add foofile
# git diff
# touch emptyfile
# git add emptyfile
# git commit -m added
# git status
On branch master
nothing to commit, working tree clean
# git diff
diff --git a/emptyfile b/emptyfile
--- a/emptyfile
+++ b/emptyfile
@@ -1 +0,0 @@
-hi
emptyfile is unchanged, as git status shows, so why is git diff displaying
a change?
It seems that git diff runs the clean filter (GIT_TRACE shows it does),
but it must ignore its output when the file is empty, and always use an empty
file as the current content for the diff. Which differs from what was staged.
--
see shy jo
[-- Attachment #2: setup_smudge_clean.sh --]
[-- Type: application/x-sh, Size: 321 bytes --]
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: buggy smudge/clean of empty files
2025-05-22 19:01 buggy smudge/clean of empty files Joey Hess
@ 2025-05-22 22:15 ` Jeff King
2025-05-22 22:26 ` Junio C Hamano
0 siblings, 1 reply; 3+ messages in thread
From: Jeff King @ 2025-05-22 22:15 UTC (permalink / raw)
To: Joey Hess; +Cc: git
On Thu, May 22, 2025 at 03:01:38PM -0400, Joey Hess wrote:
> git seems to be buggy in its handling of empty files when smudge/clean filters
> are used.
>
> I've attached a script setup_smudge_clean.sh, which configures a git
> repository to use a very simple smudge and clean filter pair for all
> files. The clean filter prepends a line "hi" to the file content, and the
> smudge filter removes the line. There is nothing very special about this
> smudge/clean, it's just a simple one for the sake of an example.
>
> Here's the bug:
Thanks for a reproducible example. Running it through the debugger, I'd
guess the problem is in ce_match_stat_basic(), specifically this bit:
/* Racily smudged entry? */
if (!ce->ce_stat_data.sd_size) {
if (!is_empty_blob_oid(&ce->oid, the_repository->hash_algo))
changed |= DATA_CHANGED;
}
That comes from f49c2c22fe (racy-git: an empty blob has a fixed object
name, 2008-06-10), which says:
We use size=0 as the magic token to say the entry is known to be racily
clean, but a sequence that does:
- update the path with a non-empty blob and write the index;
- update an unrelated path and write the index -- this smudges
the above entry;
- truncate the path to size zero.
would make both the size field for the path in the index and the size on
the filesystem zero. We should not mistake it as a clean index entry.
but I suspect the is_empty_blob_oid() check is out of date for a world
with clean/smudge filters. The blob content inside the repository is
going to be "hi\n" in this case, so we will mark it as DATA_CHANGED. But
what we really want to know is: when smudged for the worktree, is the
content expected to be empty?
Something like the patch below, but it feels very dirty.
I wondered if we might be able to just catch these cases later in
diffcore (like we do for other stat-unmatch cases), but I do think this
conditional has false positives and false negatives. Your case is
confusing an empty file in the worktree which fails to match its smudged
content. But the opposite one is where the file should have content in
the worktree (due to smudging), but is cleaned to empty inside the
repository.
So I dunno. I'm hoping somebody more familiar with the index and/or
clean/smudge conversions can show a better way.
-Peff
diff --git a/read-cache.c b/read-cache.c
index 73f83a7e7a..0f19440514 100644
--- a/read-cache.c
+++ b/read-cache.c
@@ -48,6 +48,7 @@
#include "csum-file.h"
#include "promisor-remote.h"
#include "hook.h"
+#include "convert.h"
/* Mask for the name length in ce_flags in the on-disk index */
@@ -342,8 +343,37 @@ static int ce_match_stat_basic(const struct cache_entry *ce, struct stat *st)
/* Racily smudged entry? */
if (!ce->ce_stat_data.sd_size) {
- if (!is_empty_blob_oid(&ce->oid, the_repository->hash_algo))
+ /*
+ * Yuck, we'd really like to be able to ask if there is any
+ * conversion configured so we can just check the oid in
+ * the common non-smudge case. But there is no worktree
+ * equivalent to would_convert_to_git().
+ *
+ * It would not be correct to check is_empty_blob_oid() first
+ * here (and skip the more expensive check). I think that would
+ * be wrong for cases where the clean in-repo blob is empty,
+ * but the smudged version has data.
+ */
+ char *data;
+ unsigned long len;
+ enum object_type type;
+ struct strbuf expected_wt = STRBUF_INIT;
+
+ /*
+ * skip error handling for this example. What would we do? Set
+ * DATA_CHANGED pessimistically?
+ */
+ data = repo_read_object_file(the_repository,
+ &ce->oid,
+ &type, &len);
+ convert_to_working_tree(the_repository->index,
+ ce->name, data, len,
+ &expected_wt, NULL);
+
+ if (expected_wt.len)
changed |= DATA_CHANGED;
+ strbuf_release(&expected_wt);
+ free(data);
}
return changed;
^ permalink raw reply related [flat|nested] 3+ messages in thread
* Re: buggy smudge/clean of empty files
2025-05-22 22:15 ` Jeff King
@ 2025-05-22 22:26 ` Junio C Hamano
0 siblings, 0 replies; 3+ messages in thread
From: Junio C Hamano @ 2025-05-22 22:26 UTC (permalink / raw)
To: Jeff King; +Cc: Joey Hess, git
Jeff King <peff@peff.net> writes:
> So I dunno. I'm hoping somebody more familiar with the index and/or
> clean/smudge conversions can show a better way.
My initial reaction was "well, do not do it then---an emptyness
smudges to and cleans to an emptyness and that is either by design,
or is a known limitation of the design of smudge/clean".
If somebody can show a clean way to lift that assumption buried deep
inside the design, that would indeed be nice ;-)
Thanks.
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2025-05-22 22:26 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-22 19:01 buggy smudge/clean of empty files Joey Hess
2025-05-22 22:15 ` Jeff King
2025-05-22 22:26 ` Junio C Hamano
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).