* An interaction with ce_match_stat_basic() and autocrlf @ 2008-01-08 12:12 Junio C Hamano 2008-01-08 16:10 ` Linus Torvalds 2008-01-08 17:12 ` Pēteris Kļaviņš 0 siblings, 2 replies; 6+ messages in thread From: Junio C Hamano @ 2008-01-08 12:12 UTC (permalink / raw) To: torvalds; +Cc: git There is an interesting interaction with the stat matching and autocrlf. $ git init $ git config core.autocrlf true $ echo a >a.txt $ git add a.txt $ unix2dos a.txt $ git diff diff --git a/a.txt b/a.txt At this point, the index records a blob with LF line ending, while the work tree file has the same content with CRLF line ending. And the funny thing is that once you get into this situation it is unfixable short of "git add a.txt". Most notably, "git update-index --refresh" (and the equilvalent auto-refresh that is implicitly run by "git diff" Porcelain) will not update the cached stat information. This is caused partly by the breakage in size_only codepath of diff.c::diff_populate_filespec(). When taking the file contents from the work tree, it just gets stat data and thinks it got the final size, but it should actually convert the blob data into canonical format. diff.c::diffcore_skip_stat_unmatch() is fooled by this and declares that the path is modified. This can be fixed by not returning early even when size_only is asked in the codepath. It will make everything quite a lot more expensive, as there currently is not a cheap way to ask "is this path going to be munged by autocrlf or clean filter", but getting the correct result is more important than getting a quick but wrong result. But that is just a half of the story. (1) It won't make the entry stat clean, as refresh_index() later called from builtin-diff.c to clean up the stat dirtiness works without paying attention to the autocrlf conversion. (2) It won't help lower-level diff-files and internal callers to ce_match_stat() that checks if the path were touched. The "read-tree -m -u" codepath uses it to avoid touching the path with local modifications. The standard way to clear the stat-dirtiness with "git update-index --refresh" still needs to be fixed anyway. I was going to conclude this message by saying "I need to sleep on this to see if I can come up with a clean solution", but it appears I do not have much time left for actually sleeping X-<. ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: An interaction with ce_match_stat_basic() and autocrlf 2008-01-08 12:12 An interaction with ce_match_stat_basic() and autocrlf Junio C Hamano @ 2008-01-08 16:10 ` Linus Torvalds 2008-01-08 18:04 ` Junio C Hamano 2008-01-10 2:11 ` Junio C Hamano 2008-01-08 17:12 ` Pēteris Kļaviņš 1 sibling, 2 replies; 6+ messages in thread From: Linus Torvalds @ 2008-01-08 16:10 UTC (permalink / raw) To: Junio C Hamano; +Cc: git On Tue, 8 Jan 2008, Junio C Hamano wrote: > > This is caused partly by the breakage in size_only codepath of > diff.c::diff_populate_filespec(). Only partially. The more fundamental behaviour (that of git update-index) is caused by ie_modified() thinking that when DATA_CHANGED is true, it cannot possibly need to call "ce_modified_check_fs()": >From ie_modified(): /* Immediately after read-tree or update-index --cacheinfo, * the length field is zero. For other cases the ce_size * should match the SHA1 recorded in the index entry. */ if ((changed & DATA_CHANGED) && ce->ce_size != htonl(0)) return changed; and that DATA_CHANGED comes from ce_match_stat_basic() which notices that the size has changed. Similarly, I think that the problem with "diff" not realizing they might be the same comes from ie_match_stat(), which has a similar problem in not realizing that DATA_CHANGED could possibly still mean that it's the same. This patch should fix it, but I suspect we should think hard about that change to ie_modified(), and see what the performance issues are (ie that code has tried to avoid doing the more expensive ce_modified_check_fs() for a reason). The change to diff.c is similarly interesting. It is logically wrong to use the worktree_file there (since we have to read the object anyway), but since "reuse_worktree_file" is also tied into the whole refresh logic, I think the diff.c change is correct. I dunno. This is not meant to be applied, it is meant to be thought about. Linus --- diff.c | 2 +- read-cache.c | 2 ++ 2 files changed, 3 insertions(+), 1 deletions(-) diff --git a/diff.c b/diff.c index b18c140..9f699b7 100644 --- a/diff.c +++ b/diff.c @@ -1512,7 +1512,7 @@ static int reuse_worktree_file(const char *name, const unsigned char *sha1, int ce = active_cache[pos]; if ((lstat(name, &st) < 0) || !S_ISREG(st.st_mode) || /* careful! */ - ce_match_stat(ce, &st, 0) || + ce_modified(ce, &st, 0) || hashcmp(sha1, ce->sha1)) return 0; /* we return 1 only when we can stat, it is a regular file, diff --git a/read-cache.c b/read-cache.c index 7db5588..e1fc880 100644 --- a/read-cache.c +++ b/read-cache.c @@ -253,12 +253,14 @@ int ie_modified(struct index_state *istate, if (changed & (MODE_CHANGED | TYPE_CHANGED)) return changed; +#if 0 /* Immediately after read-tree or update-index --cacheinfo, * the length field is zero. For other cases the ce_size * should match the SHA1 recorded in the index entry. */ if ((changed & DATA_CHANGED) && ce->ce_size != htonl(0)) return changed; +#endif changed_fs = ce_modified_check_fs(ce, st); if (changed_fs) ^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: An interaction with ce_match_stat_basic() and autocrlf 2008-01-08 16:10 ` Linus Torvalds @ 2008-01-08 18:04 ` Junio C Hamano 2008-01-10 2:11 ` Junio C Hamano 1 sibling, 0 replies; 6+ messages in thread From: Junio C Hamano @ 2008-01-08 18:04 UTC (permalink / raw) To: Linus Torvalds; +Cc: git Linus Torvalds <torvalds@linux-foundation.org> writes: > On Tue, 8 Jan 2008, Junio C Hamano wrote: >> >> This is caused partly by the breakage in size_only codepath of >> diff.c::diff_populate_filespec(). > > Only partially. Agreed. That's why it is "just a half of the story". > The more fundamental behaviour (that of git update-index) is caused by > ie_modified() thinking that when DATA_CHANGED is true, it cannot possibly > need to call "ce_modified_check_fs()": > ... > Similarly, I think that the problem with "diff" not realizing they might > be the same comes from ie_match_stat(), which has a similar problem in not > realizing that DATA_CHANGED could possibly still mean that it's the same. Yes, I think your patch to ie_modified() should take care of the issue from the diff-files front-end side, which is the right approach. The optimization diffcore_populate_filespec() makes when asked to do size_only, which predates the addition of convert_to_git(), needs to be updated regardless, though. The size field in diffcore_filespec is never about on-filesystem size. > This patch should fix it, but I suspect we should think hard about that > change to ie_modified(), and see what the performance issues are (ie that > code has tried to avoid doing the more expensive ce_modified_check_fs() > for a reason). I think the reason was I simply avoided doing any unnecessary operation that goes to the filesystem. We did not even have that modified_check_fs() code before the racy-git safety, and when I added it I do not think I benched it with a real-life workload; the logic there was simply a valid optimization back then. It is not anymore. Addition of convert_to_git() made cached stat info essentially ineffective in the sense that: (1) if a user changes the work tree files in such a way that does not change convert_to_git() output, the index will say "file contents in external representation has definitely changed, the sizes no longer match". We need to actually go to the data to find out that there is no change at the canonical level. (2) if a user changes the crlf setting (or .gitattributes) without touching the work tree files, the index will say "unchanged and do not have to compare". We need to actually go to the data to find out that they do not match anymore. The latter is an opposite issue of what I brought up in this thread. I personally do not want to "fix" it --- it means destroying one of the most important optimizations. The use case is essentially a one-shot operation for a user to "fix" a broken crlf setting, and having to re-checkout everything is a small cost to pay to maintain it. But the former is something we should be able to deal with sanely. ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: An interaction with ce_match_stat_basic() and autocrlf 2008-01-08 16:10 ` Linus Torvalds 2008-01-08 18:04 ` Junio C Hamano @ 2008-01-10 2:11 ` Junio C Hamano 1 sibling, 0 replies; 6+ messages in thread From: Junio C Hamano @ 2008-01-10 2:11 UTC (permalink / raw) To: Linus Torvalds; +Cc: git Linus Torvalds <torvalds@linux-foundation.org> writes: > This patch should fix it, but I suspect we should think hard about that > change to ie_modified(), and see what the performance issues are (ie that > code has tried to avoid doing the more expensive ce_modified_check_fs() > for a reason). > > The change to diff.c is similarly interesting. It is logically wrong to > use the worktree_file there (since we have to read the object anyway), but > since "reuse_worktree_file" is also tied into the whole refresh logic, I > think the diff.c change is correct. > > I dunno. This is not meant to be applied, it is meant to be thought about. There are a few cases around the changing value of autocrlf (and filter attributes --- anything that affects convert_to_git() and convert_to_working_tree()). * The cached stat information matches the work tree, but user changed convert_to_working_tree(). "git diff" reports nothing. The user needs to remove the work tree file and check it out again. * The cached stat information matches the work tree, but user changed convert_to_git(). Again, diff reports nothing. The user needs to "git add" to cause rehashing. * The cached stat information does not match. What the working tree file stores hasn't changed, but convert_to_git() was changed. The fact that the working tree "file" contents did not change does not have much significance in this case. What defines the "contents" as far as git is concerned is the combination of the working tree file contents _and_ what convert_to_git() does to it. Depending on the nature of the change to convert_to_git(), "git diff-files" may or may not report real changes in this case. * The working tree file has changed, and convert_to_git() also has changed. Depending on the nature of the change to convert_to_git(), "git diff" may or may not report change in this case. The most extreme case is when unix2dos is run on the working tree file and convert_to_git() is made to strip CR. The object registered in the index won't change in this case. But in practice, the most problematic case also falls into this category. The user has _real_ changes to the work tree file, but at the same time flipped convert_to_git() to operate differently from before. Users should not be making such a change, not because of git, but because a commit like that will be impossible to review (and understand three months later while archaeologying). The ie_modified() change you suggested will not be hurt by the first two cases (which I see are one-shot events and re-checkout and re-add are good enough solution to them, and I do not want them to hurt the performance for normal use cases). I originally thought it was a _bug_, but I suspect the false positive changes reported by "git diff" is even a good thing. ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: An interaction with ce_match_stat_basic() and autocrlf 2008-01-08 12:12 An interaction with ce_match_stat_basic() and autocrlf Junio C Hamano 2008-01-08 16:10 ` Linus Torvalds @ 2008-01-08 17:12 ` Pēteris Kļaviņš 2008-01-08 17:30 ` Linus Torvalds 1 sibling, 1 reply; 6+ messages in thread From: Pēteris Kļaviņš @ 2008-01-08 17:12 UTC (permalink / raw) To: git > At this point, the index records a blob with LF line ending, > while the work tree file has the same content with CRLF line > ending. I think this needs more than just sleeping on. There are two separate problems related to crlf treatment in git that manifest themselves in the quirks you see in the current implementation: (1) The fact that the index may be misaligned with the work tree. Junio's example demonstrates this well. I have resorted to $ rm -rf * $ git reset --hard in the past to get a work tree that passes $ git status without false positives after changing the value of autocrlf. (2) The fact that repository content may be mangled in an indeterminate way because of the current work tree <-> repository transformation algorithm. While criticism in the past has mainly been levelled at not knowing whether a truly binary file will be correctly determined as such, content can be lost in the round trip work tree -> repository -> work tree much more simply: $ git init $ git config core.autocrlf true $ echo ab | tr ab \\r\\n >a.txt $ od -t a a.txt 0000000 cr nl nl 0000003 $ git add a.txt $ git commit $ rm a.txt $ git reset --hard $ od -t a a.txt 0000000 cr nl cr nl 0000004 In summary, it irks me that autocrlf true mode is a second cousin of autocrlf false and I think that there *should* be an acceptable deterministic solution to this. The solution to (2) seems easier than (1): could the transformation algorithm be made deterministic and changed to something like "convert all crlf pairs to lf if and only if no singleton cr or lf exist in the file before conversion"? If a binary file gets mangled in error, it would be an easy transformation with standard tools to get the file back again. If an otherwise text file has mixed lf and crlf endings, or additional cr or lf sprinkled randomly through it, the file is not transformed. Given a deterministic transformation algorithm, the solution to (1) boils down to recording for each file in the work tree whether the transformation algorithm was used or not in arriving at the file's current contents, together with a way of telling git to force the use of the transformation algorithm or not for a particular file. It seems to me the place that this information *should* be recorded is the index, given that both .git/config and .gitattributes can be changed independently of the work tree. Recording the information in the index would mean that both autocrlf true and autocrlf false clones of the same repository would produce equally valid work trees with no loss of information. I am however not well versed enough in git internals at the moment to know whether this is an acceptable solution or not. ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: An interaction with ce_match_stat_basic() and autocrlf 2008-01-08 17:12 ` Pēteris Kļaviņš @ 2008-01-08 17:30 ` Linus Torvalds 0 siblings, 0 replies; 6+ messages in thread From: Linus Torvalds @ 2008-01-08 17:30 UTC (permalink / raw) To: Pēteris Kļaviņš; +Cc: git On Tue, 8 Jan 2008, Pēteris Kļaviņš wrote: > > In summary, it irks me that autocrlf true mode is a second cousin of autocrlf > false and I think that there *should* be an acceptable deterministic solution > to this. Well, I think the real issue is simply that most the main git developers do development on architectures where CRLF just isn't an issue. So it's not that autocrlf is a "second cousin", it's that - CRLF is stupid to begin with, and slightly anathemical to the git worldview of trying to be as exact as possible. - ..and almost nobody in the git community is actually affected, so people don't even notice when it's an issue. People who actually care and use crlf are probably best off sending in test-cases for particular behaviour they notice. Linus ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2008-01-10 2:12 UTC | newest] Thread overview: 6+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2008-01-08 12:12 An interaction with ce_match_stat_basic() and autocrlf Junio C Hamano 2008-01-08 16:10 ` Linus Torvalds 2008-01-08 18:04 ` Junio C Hamano 2008-01-10 2:11 ` Junio C Hamano 2008-01-08 17:12 ` Pēteris Kļaviņš 2008-01-08 17:30 ` Linus Torvalds
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).