* An interaction with ce_match_stat_basic() and autocrlf
@ 2008-01-08 12:12 Junio C Hamano
2008-01-08 16:10 ` Linus Torvalds
2008-01-08 17:12 ` Pēteris Kļaviņš
0 siblings, 2 replies; 6+ messages in thread
From: Junio C Hamano @ 2008-01-08 12:12 UTC (permalink / raw)
To: torvalds; +Cc: git
There is an interesting interaction with the stat matching and
autocrlf.
$ git init
$ git config core.autocrlf true
$ echo a >a.txt
$ git add a.txt
$ unix2dos a.txt
$ git diff
diff --git a/a.txt b/a.txt
At this point, the index records a blob with LF line ending,
while the work tree file has the same content with CRLF line
ending. And the funny thing is that once you get into this
situation it is unfixable short of "git add a.txt". Most
notably, "git update-index --refresh" (and the equilvalent
auto-refresh that is implicitly run by "git diff" Porcelain)
will not update the cached stat information.
This is caused partly by the breakage in size_only codepath of
diff.c::diff_populate_filespec(). When taking the file contents
from the work tree, it just gets stat data and thinks it got the
final size, but it should actually convert the blob data into
canonical format. diff.c::diffcore_skip_stat_unmatch() is
fooled by this and declares that the path is modified.
This can be fixed by not returning early even when size_only is
asked in the codepath. It will make everything quite a lot more
expensive, as there currently is not a cheap way to ask "is this
path going to be munged by autocrlf or clean filter", but
getting the correct result is more important than getting a
quick but wrong result.
But that is just a half of the story.
(1) It won't make the entry stat clean, as refresh_index()
later called from builtin-diff.c to clean up the stat
dirtiness works without paying attention to the autocrlf
conversion.
(2) It won't help lower-level diff-files and internal callers
to ce_match_stat() that checks if the path were touched.
The "read-tree -m -u" codepath uses it to avoid touching
the path with local modifications. The standard way to
clear the stat-dirtiness with "git update-index --refresh"
still needs to be fixed anyway.
I was going to conclude this message by saying "I need to sleep
on this to see if I can come up with a clean solution", but it
appears I do not have much time left for actually sleeping X-<.
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: An interaction with ce_match_stat_basic() and autocrlf
2008-01-08 12:12 An interaction with ce_match_stat_basic() and autocrlf Junio C Hamano
@ 2008-01-08 16:10 ` Linus Torvalds
2008-01-08 18:04 ` Junio C Hamano
2008-01-10 2:11 ` Junio C Hamano
2008-01-08 17:12 ` Pēteris Kļaviņš
1 sibling, 2 replies; 6+ messages in thread
From: Linus Torvalds @ 2008-01-08 16:10 UTC (permalink / raw)
To: Junio C Hamano; +Cc: git
On Tue, 8 Jan 2008, Junio C Hamano wrote:
>
> This is caused partly by the breakage in size_only codepath of
> diff.c::diff_populate_filespec().
Only partially.
The more fundamental behaviour (that of git update-index) is caused by
ie_modified() thinking that when DATA_CHANGED is true, it cannot possibly
need to call "ce_modified_check_fs()":
>From ie_modified():
/* Immediately after read-tree or update-index --cacheinfo,
* the length field is zero. For other cases the ce_size
* should match the SHA1 recorded in the index entry.
*/
if ((changed & DATA_CHANGED) && ce->ce_size != htonl(0))
return changed;
and that DATA_CHANGED comes from ce_match_stat_basic() which notices that
the size has changed.
Similarly, I think that the problem with "diff" not realizing they might
be the same comes from ie_match_stat(), which has a similar problem in not
realizing that DATA_CHANGED could possibly still mean that it's the same.
This patch should fix it, but I suspect we should think hard about that
change to ie_modified(), and see what the performance issues are (ie that
code has tried to avoid doing the more expensive ce_modified_check_fs()
for a reason).
The change to diff.c is similarly interesting. It is logically wrong to
use the worktree_file there (since we have to read the object anyway), but
since "reuse_worktree_file" is also tied into the whole refresh logic, I
think the diff.c change is correct.
I dunno. This is not meant to be applied, it is meant to be thought about.
Linus
---
diff.c | 2 +-
read-cache.c | 2 ++
2 files changed, 3 insertions(+), 1 deletions(-)
diff --git a/diff.c b/diff.c
index b18c140..9f699b7 100644
--- a/diff.c
+++ b/diff.c
@@ -1512,7 +1512,7 @@ static int reuse_worktree_file(const char *name, const unsigned char *sha1, int
ce = active_cache[pos];
if ((lstat(name, &st) < 0) ||
!S_ISREG(st.st_mode) || /* careful! */
- ce_match_stat(ce, &st, 0) ||
+ ce_modified(ce, &st, 0) ||
hashcmp(sha1, ce->sha1))
return 0;
/* we return 1 only when we can stat, it is a regular file,
diff --git a/read-cache.c b/read-cache.c
index 7db5588..e1fc880 100644
--- a/read-cache.c
+++ b/read-cache.c
@@ -253,12 +253,14 @@ int ie_modified(struct index_state *istate,
if (changed & (MODE_CHANGED | TYPE_CHANGED))
return changed;
+#if 0
/* Immediately after read-tree or update-index --cacheinfo,
* the length field is zero. For other cases the ce_size
* should match the SHA1 recorded in the index entry.
*/
if ((changed & DATA_CHANGED) && ce->ce_size != htonl(0))
return changed;
+#endif
changed_fs = ce_modified_check_fs(ce, st);
if (changed_fs)
^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: An interaction with ce_match_stat_basic() and autocrlf
2008-01-08 12:12 An interaction with ce_match_stat_basic() and autocrlf Junio C Hamano
2008-01-08 16:10 ` Linus Torvalds
@ 2008-01-08 17:12 ` Pēteris Kļaviņš
2008-01-08 17:30 ` Linus Torvalds
1 sibling, 1 reply; 6+ messages in thread
From: Pēteris Kļaviņš @ 2008-01-08 17:12 UTC (permalink / raw)
To: git
> At this point, the index records a blob with LF line ending,
> while the work tree file has the same content with CRLF line
> ending.
I think this needs more than just sleeping on.
There are two separate problems related to crlf treatment in git that
manifest themselves in the quirks you see in the current implementation:
(1) The fact that the index may be misaligned with the work tree. Junio's
example demonstrates this well. I have resorted to
$ rm -rf *
$ git reset --hard
in the past to get a work tree that passes
$ git status
without false positives after changing the value of autocrlf.
(2) The fact that repository content may be mangled in an indeterminate way
because of the current work tree <-> repository transformation algorithm.
While criticism in the past has mainly been levelled at not knowing whether
a truly binary file will be correctly determined as such, content can be
lost in the round trip work tree -> repository -> work tree much more
simply:
$ git init
$ git config core.autocrlf true
$ echo ab | tr ab \\r\\n >a.txt
$ od -t a a.txt
0000000 cr nl nl
0000003
$ git add a.txt
$ git commit
$ rm a.txt
$ git reset --hard
$ od -t a a.txt
0000000 cr nl cr nl
0000004
In summary, it irks me that autocrlf true mode is a second cousin of
autocrlf false and I think that there *should* be an acceptable
deterministic solution to this.
The solution to (2) seems easier than (1): could the transformation
algorithm be made deterministic and changed to something like "convert all
crlf pairs to lf if and only if no singleton cr or lf exist in the file
before conversion"? If a binary file gets mangled in error, it would be an
easy transformation with standard tools to get the file back again. If an
otherwise text file has mixed lf and crlf endings, or additional cr or lf
sprinkled randomly through it, the file is not transformed.
Given a deterministic transformation algorithm, the solution to (1) boils
down to recording for each file in the work tree whether the transformation
algorithm was used or not in arriving at the file's current contents,
together with a way of telling git to force the use of the transformation
algorithm or not for a particular file. It seems to me the place that this
information *should* be recorded is the index, given that both .git/config
and .gitattributes can be changed independently of the work tree. Recording
the information in the index would mean that both autocrlf true and autocrlf
false clones of the same repository would produce equally valid work trees
with no loss of information. I am however not well versed enough in git
internals at the moment to know whether this is an acceptable solution or
not.
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: An interaction with ce_match_stat_basic() and autocrlf
2008-01-08 17:12 ` Pēteris Kļaviņš
@ 2008-01-08 17:30 ` Linus Torvalds
0 siblings, 0 replies; 6+ messages in thread
From: Linus Torvalds @ 2008-01-08 17:30 UTC (permalink / raw)
To: Pēteris Kļaviņš; +Cc: git
On Tue, 8 Jan 2008, Pēteris Kļaviņš wrote:
>
> In summary, it irks me that autocrlf true mode is a second cousin of autocrlf
> false and I think that there *should* be an acceptable deterministic solution
> to this.
Well, I think the real issue is simply that most the main git developers
do development on architectures where CRLF just isn't an issue.
So it's not that autocrlf is a "second cousin", it's that
- CRLF is stupid to begin with, and slightly anathemical to the git
worldview of trying to be as exact as possible.
- ..and almost nobody in the git community is actually affected, so
people don't even notice when it's an issue.
People who actually care and use crlf are probably best off sending in
test-cases for particular behaviour they notice.
Linus
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: An interaction with ce_match_stat_basic() and autocrlf
2008-01-08 16:10 ` Linus Torvalds
@ 2008-01-08 18:04 ` Junio C Hamano
2008-01-10 2:11 ` Junio C Hamano
1 sibling, 0 replies; 6+ messages in thread
From: Junio C Hamano @ 2008-01-08 18:04 UTC (permalink / raw)
To: Linus Torvalds; +Cc: git
Linus Torvalds <torvalds@linux-foundation.org> writes:
> On Tue, 8 Jan 2008, Junio C Hamano wrote:
>>
>> This is caused partly by the breakage in size_only codepath of
>> diff.c::diff_populate_filespec().
>
> Only partially.
Agreed. That's why it is "just a half of the story".
> The more fundamental behaviour (that of git update-index) is caused by
> ie_modified() thinking that when DATA_CHANGED is true, it cannot possibly
> need to call "ce_modified_check_fs()":
> ...
> Similarly, I think that the problem with "diff" not realizing they might
> be the same comes from ie_match_stat(), which has a similar problem in not
> realizing that DATA_CHANGED could possibly still mean that it's the same.
Yes, I think your patch to ie_modified() should take care of the
issue from the diff-files front-end side, which is the right
approach. The optimization diffcore_populate_filespec() makes
when asked to do size_only, which predates the addition of
convert_to_git(), needs to be updated regardless, though. The
size field in diffcore_filespec is never about on-filesystem
size.
> This patch should fix it, but I suspect we should think hard about that
> change to ie_modified(), and see what the performance issues are (ie that
> code has tried to avoid doing the more expensive ce_modified_check_fs()
> for a reason).
I think the reason was I simply avoided doing any unnecessary
operation that goes to the filesystem. We did not even have
that modified_check_fs() code before the racy-git safety, and
when I added it I do not think I benched it with a real-life
workload; the logic there was simply a valid optimization back
then.
It is not anymore. Addition of convert_to_git() made cached
stat info essentially ineffective in the sense that:
(1) if a user changes the work tree files in such a way that
does not change convert_to_git() output, the index will say
"file contents in external representation has definitely
changed, the sizes no longer match". We need to actually
go to the data to find out that there is no change at the
canonical level.
(2) if a user changes the crlf setting (or .gitattributes)
without touching the work tree files, the index will say
"unchanged and do not have to compare". We need to
actually go to the data to find out that they do not match
anymore.
The latter is an opposite issue of what I brought up in this
thread. I personally do not want to "fix" it --- it means
destroying one of the most important optimizations. The use
case is essentially a one-shot operation for a user to "fix" a
broken crlf setting, and having to re-checkout everything is a
small cost to pay to maintain it.
But the former is something we should be able to deal with
sanely.
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: An interaction with ce_match_stat_basic() and autocrlf
2008-01-08 16:10 ` Linus Torvalds
2008-01-08 18:04 ` Junio C Hamano
@ 2008-01-10 2:11 ` Junio C Hamano
1 sibling, 0 replies; 6+ messages in thread
From: Junio C Hamano @ 2008-01-10 2:11 UTC (permalink / raw)
To: Linus Torvalds; +Cc: git
Linus Torvalds <torvalds@linux-foundation.org> writes:
> This patch should fix it, but I suspect we should think hard about that
> change to ie_modified(), and see what the performance issues are (ie that
> code has tried to avoid doing the more expensive ce_modified_check_fs()
> for a reason).
>
> The change to diff.c is similarly interesting. It is logically wrong to
> use the worktree_file there (since we have to read the object anyway), but
> since "reuse_worktree_file" is also tied into the whole refresh logic, I
> think the diff.c change is correct.
>
> I dunno. This is not meant to be applied, it is meant to be thought about.
There are a few cases around the changing value of autocrlf (and
filter attributes --- anything that affects convert_to_git() and
convert_to_working_tree()).
* The cached stat information matches the work tree, but user
changed convert_to_working_tree(). "git diff" reports
nothing. The user needs to remove the work tree file and
check it out again.
* The cached stat information matches the work tree, but user
changed convert_to_git(). Again, diff reports nothing. The
user needs to "git add" to cause rehashing.
* The cached stat information does not match. What the working
tree file stores hasn't changed, but convert_to_git() was
changed.
The fact that the working tree "file" contents did not change
does not have much significance in this case. What defines
the "contents" as far as git is concerned is the combination
of the working tree file contents _and_ what convert_to_git()
does to it.
Depending on the nature of the change to convert_to_git(),
"git diff-files" may or may not report real changes in this
case.
* The working tree file has changed, and convert_to_git() also
has changed.
Depending on the nature of the change to convert_to_git(),
"git diff" may or may not report change in this case. The
most extreme case is when unix2dos is run on the working tree
file and convert_to_git() is made to strip CR. The object
registered in the index won't change in this case.
But in practice, the most problematic case also falls into
this category. The user has _real_ changes to the work tree
file, but at the same time flipped convert_to_git() to
operate differently from before. Users should not be making
such a change, not because of git, but because a commit like
that will be impossible to review (and understand three
months later while archaeologying).
The ie_modified() change you suggested will not be hurt by the
first two cases (which I see are one-shot events and re-checkout
and re-add are good enough solution to them, and I do not want
them to hurt the performance for normal use cases).
I originally thought it was a _bug_, but I suspect the false
positive changes reported by "git diff" is even a good thing.
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2008-01-10 2:12 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-01-08 12:12 An interaction with ce_match_stat_basic() and autocrlf Junio C Hamano
2008-01-08 16:10 ` Linus Torvalds
2008-01-08 18:04 ` Junio C Hamano
2008-01-10 2:11 ` Junio C Hamano
2008-01-08 17:12 ` Pēteris Kļaviņš
2008-01-08 17:30 ` Linus Torvalds
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).