git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* An interaction with ce_match_stat_basic() and autocrlf
@ 2008-01-08 12:12 Junio C Hamano
  2008-01-08 16:10 ` Linus Torvalds
  2008-01-08 17:12 ` Pēteris Kļaviņš
  0 siblings, 2 replies; 6+ messages in thread
From: Junio C Hamano @ 2008-01-08 12:12 UTC (permalink / raw)
  To: torvalds; +Cc: git

There is an interesting interaction with the stat matching and
autocrlf.

    $ git init
    $ git config core.autocrlf true
    $ echo a >a.txt
    $ git add a.txt
    $ unix2dos a.txt
    $ git diff
    diff --git a/a.txt b/a.txt

At this point, the index records a blob with LF line ending,
while the work tree file has the same content with CRLF line
ending.  And the funny thing is that once you get into this
situation it is unfixable short of "git add a.txt".  Most
notably, "git update-index --refresh" (and the equilvalent
auto-refresh that is implicitly run by "git diff" Porcelain)
will not update the cached stat information.

This is caused partly by the breakage in size_only codepath of
diff.c::diff_populate_filespec().  When taking the file contents
from the work tree, it just gets stat data and thinks it got the
final size, but it should actually convert the blob data into
canonical format.  diff.c::diffcore_skip_stat_unmatch() is
fooled by this and declares that the path is modified.

This can be fixed by not returning early even when size_only is
asked in the codepath.  It will make everything quite a lot more
expensive, as there currently is not a cheap way to ask "is this
path going to be munged by autocrlf or clean filter", but
getting the correct result is more important than getting a
quick but wrong result.

But that is just a half of the story.

 (1) It won't make the entry stat clean, as refresh_index()
     later called from builtin-diff.c to clean up the stat
     dirtiness works without paying attention to the autocrlf
     conversion.

 (2) It won't help lower-level diff-files and internal callers
     to ce_match_stat() that checks if the path were touched.
     The "read-tree -m -u" codepath uses it to avoid touching
     the path with local modifications.  The standard way to
     clear the stat-dirtiness with "git update-index --refresh"
     still needs to be fixed anyway.

I was going to conclude this message by saying "I need to sleep
on this to see if I can come up with a clean solution", but it
appears I do not have much time left for actually sleeping X-<.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: An interaction with ce_match_stat_basic() and autocrlf
  2008-01-08 12:12 An interaction with ce_match_stat_basic() and autocrlf Junio C Hamano
@ 2008-01-08 16:10 ` Linus Torvalds
  2008-01-08 18:04   ` Junio C Hamano
  2008-01-10  2:11   ` Junio C Hamano
  2008-01-08 17:12 ` Pēteris Kļaviņš
  1 sibling, 2 replies; 6+ messages in thread
From: Linus Torvalds @ 2008-01-08 16:10 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git



On Tue, 8 Jan 2008, Junio C Hamano wrote:
> 
> This is caused partly by the breakage in size_only codepath of
> diff.c::diff_populate_filespec().

Only partially.

The more fundamental behaviour (that of git update-index) is caused by 
ie_modified() thinking that when DATA_CHANGED is true, it cannot possibly 
need to call "ce_modified_check_fs()":

>From ie_modified():

        /* Immediately after read-tree or update-index --cacheinfo,
         * the length field is zero.  For other cases the ce_size
         * should match the SHA1 recorded in the index entry.
         */
        if ((changed & DATA_CHANGED) && ce->ce_size != htonl(0))
                return changed;

and that DATA_CHANGED comes from ce_match_stat_basic() which notices that 
the size has changed.

Similarly, I think that the problem with "diff" not realizing they might 
be the same comes from ie_match_stat(), which has a similar problem in not 
realizing that DATA_CHANGED could possibly still mean that it's the same.

This patch should fix it, but I suspect we should think hard about that 
change to ie_modified(), and see what the performance issues are (ie that 
code has tried to avoid doing the more expensive ce_modified_check_fs() 
for a reason).

The change to diff.c is similarly interesting. It is logically wrong to 
use the worktree_file there (since we have to read the object anyway), but 
since "reuse_worktree_file" is also tied into the whole refresh logic, I 
think the diff.c change is correct.

I dunno. This is not meant to be applied, it is meant to be thought about.

		Linus

---
 diff.c       |    2 +-
 read-cache.c |    2 ++
 2 files changed, 3 insertions(+), 1 deletions(-)

diff --git a/diff.c b/diff.c
index b18c140..9f699b7 100644
--- a/diff.c
+++ b/diff.c
@@ -1512,7 +1512,7 @@ static int reuse_worktree_file(const char *name, const unsigned char *sha1, int
 	ce = active_cache[pos];
 	if ((lstat(name, &st) < 0) ||
 	    !S_ISREG(st.st_mode) || /* careful! */
-	    ce_match_stat(ce, &st, 0) ||
+	    ce_modified(ce, &st, 0) ||
 	    hashcmp(sha1, ce->sha1))
 		return 0;
 	/* we return 1 only when we can stat, it is a regular file,
diff --git a/read-cache.c b/read-cache.c
index 7db5588..e1fc880 100644
--- a/read-cache.c
+++ b/read-cache.c
@@ -253,12 +253,14 @@ int ie_modified(struct index_state *istate,
 	if (changed & (MODE_CHANGED | TYPE_CHANGED))
 		return changed;
 
+#if 0
 	/* Immediately after read-tree or update-index --cacheinfo,
 	 * the length field is zero.  For other cases the ce_size
 	 * should match the SHA1 recorded in the index entry.
 	 */
 	if ((changed & DATA_CHANGED) && ce->ce_size != htonl(0))
 		return changed;
+#endif
 
 	changed_fs = ce_modified_check_fs(ce, st);
 	if (changed_fs)

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: An interaction with ce_match_stat_basic() and autocrlf
  2008-01-08 12:12 An interaction with ce_match_stat_basic() and autocrlf Junio C Hamano
  2008-01-08 16:10 ` Linus Torvalds
@ 2008-01-08 17:12 ` Pēteris Kļaviņš
  2008-01-08 17:30   ` Linus Torvalds
  1 sibling, 1 reply; 6+ messages in thread
From: Pēteris Kļaviņš @ 2008-01-08 17:12 UTC (permalink / raw)
  To: git

> At this point, the index records a blob with LF line ending,
> while the work tree file has the same content with CRLF line
> ending.

I think this needs more than just sleeping on.

There are two separate problems related to crlf treatment in git that 
manifest themselves in the quirks you see in the current implementation:

(1) The fact that the index may be misaligned with the work tree. Junio's 
example demonstrates this well. I have resorted to

$ rm -rf *
$ git reset --hard

in the past to get a work tree that passes

$ git status

without false positives after changing the value of autocrlf.

(2) The fact that repository content may be mangled in an indeterminate way 
because of the current work tree <-> repository transformation algorithm. 
While criticism in the past has mainly been levelled at not knowing whether 
a truly binary file will be correctly determined as such, content can be 
lost in the round trip work tree -> repository -> work tree much more 
simply:

$ git init
$ git config core.autocrlf true
$ echo ab | tr ab \\r\\n >a.txt
$ od -t a a.txt
0000000  cr  nl  nl
0000003
$ git add a.txt
$ git commit
$ rm a.txt
$ git reset --hard
$ od -t a a.txt
0000000  cr  nl  cr  nl
0000004

In summary, it irks me that autocrlf true mode is a second cousin of 
autocrlf false and I think that there *should* be an acceptable 
deterministic solution to this.

The solution to (2) seems easier than (1): could the transformation 
algorithm be made deterministic and changed to something like "convert all 
crlf pairs to lf if and only if no singleton cr or lf exist in the file 
before conversion"? If a binary file gets mangled in error, it would be an 
easy transformation with standard tools to get the file back again. If an 
otherwise text file has mixed lf and crlf endings, or additional cr or lf 
sprinkled randomly through it, the file is not transformed.

Given a deterministic transformation algorithm, the solution to (1) boils 
down to recording for each file in the work tree whether the transformation 
algorithm was used or not in arriving at the file's current contents, 
together with a way of telling git to force the use of the transformation 
algorithm or not for a particular file. It seems to me the place that this 
information *should* be recorded is the index, given that both .git/config 
and .gitattributes can be changed independently of the work tree. Recording 
the information in the index would mean that both autocrlf true and autocrlf 
false clones of the same repository would produce equally valid work trees 
with no loss of information. I am however not well versed enough in git 
internals at the moment to know whether this is an acceptable solution or 
not. 

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: An interaction with ce_match_stat_basic() and autocrlf
  2008-01-08 17:12 ` Pēteris Kļaviņš
@ 2008-01-08 17:30   ` Linus Torvalds
  0 siblings, 0 replies; 6+ messages in thread
From: Linus Torvalds @ 2008-01-08 17:30 UTC (permalink / raw)
  To: Pēteris Kļaviņš; +Cc: git



On Tue, 8 Jan 2008, Pēteris Kļaviņš wrote:
> 
> In summary, it irks me that autocrlf true mode is a second cousin of autocrlf
> false and I think that there *should* be an acceptable deterministic solution
> to this.

Well, I think the real issue is simply that most the main git developers 
do development on architectures where CRLF just isn't an issue.

So it's not that autocrlf is a "second cousin", it's that

 - CRLF is stupid to begin with, and slightly anathemical to the git 
   worldview of trying to be as exact as possible.

 - ..and almost nobody in the git community is actually affected, so 
   people don't even notice when it's an issue.

People who actually care and use crlf are probably best off sending in 
test-cases for particular behaviour they notice. 

		Linus

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: An interaction with ce_match_stat_basic() and autocrlf
  2008-01-08 16:10 ` Linus Torvalds
@ 2008-01-08 18:04   ` Junio C Hamano
  2008-01-10  2:11   ` Junio C Hamano
  1 sibling, 0 replies; 6+ messages in thread
From: Junio C Hamano @ 2008-01-08 18:04 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: git

Linus Torvalds <torvalds@linux-foundation.org> writes:

> On Tue, 8 Jan 2008, Junio C Hamano wrote:
>> 
>> This is caused partly by the breakage in size_only codepath of
>> diff.c::diff_populate_filespec().
>
> Only partially.

Agreed.  That's why it is "just a half of the story".

> The more fundamental behaviour (that of git update-index) is caused by 
> ie_modified() thinking that when DATA_CHANGED is true, it cannot possibly 
> need to call "ce_modified_check_fs()":
> ...
> Similarly, I think that the problem with "diff" not realizing they might 
> be the same comes from ie_match_stat(), which has a similar problem in not 
> realizing that DATA_CHANGED could possibly still mean that it's the same.

Yes, I think your patch to ie_modified() should take care of the
issue from the diff-files front-end side, which is the right
approach.  The optimization diffcore_populate_filespec() makes
when asked to do size_only, which predates the addition of
convert_to_git(), needs to be updated regardless, though.  The
size field in diffcore_filespec is never about on-filesystem
size.

> This patch should fix it, but I suspect we should think hard about that 
> change to ie_modified(), and see what the performance issues are (ie that 
> code has tried to avoid doing the more expensive ce_modified_check_fs() 
> for a reason).

I think the reason was I simply avoided doing any unnecessary
operation that goes to the filesystem.  We did not even have
that modified_check_fs() code before the racy-git safety, and
when I added it I do not think I benched it with a real-life
workload; the logic there was simply a valid optimization back
then.

It is not anymore.  Addition of convert_to_git() made cached
stat info essentially ineffective in the sense that:

 (1) if a user changes the work tree files in such a way that
     does not change convert_to_git() output, the index will say
     "file contents in external representation has definitely
     changed, the sizes no longer match".  We need to actually
     go to the data to find out that there is no change at the
     canonical level.

 (2) if a user changes the crlf setting (or .gitattributes)
     without touching the work tree files, the index will say
     "unchanged and do not have to compare".  We need to
     actually go to the data to find out that they do not match
     anymore.

The latter is an opposite issue of what I brought up in this
thread.  I personally do not want to "fix" it --- it means
destroying one of the most important optimizations.  The use
case is essentially a one-shot operation for a user to "fix" a
broken crlf setting, and having to re-checkout everything is a
small cost to pay to maintain it.

But the former is something we should be able to deal with
sanely.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: An interaction with ce_match_stat_basic() and autocrlf
  2008-01-08 16:10 ` Linus Torvalds
  2008-01-08 18:04   ` Junio C Hamano
@ 2008-01-10  2:11   ` Junio C Hamano
  1 sibling, 0 replies; 6+ messages in thread
From: Junio C Hamano @ 2008-01-10  2:11 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: git

Linus Torvalds <torvalds@linux-foundation.org> writes:

> This patch should fix it, but I suspect we should think hard about that 
> change to ie_modified(), and see what the performance issues are (ie that 
> code has tried to avoid doing the more expensive ce_modified_check_fs() 
> for a reason).
>
> The change to diff.c is similarly interesting. It is logically wrong to 
> use the worktree_file there (since we have to read the object anyway), but 
> since "reuse_worktree_file" is also tied into the whole refresh logic, I 
> think the diff.c change is correct.
>
> I dunno. This is not meant to be applied, it is meant to be thought about.

There are a few cases around the changing value of autocrlf (and
filter attributes --- anything that affects convert_to_git() and
convert_to_working_tree()).

 * The cached stat information matches the work tree, but user
   changed convert_to_working_tree().  "git diff" reports
   nothing.  The user needs to remove the work tree file and
   check it out again.

 * The cached stat information matches the work tree, but user
   changed convert_to_git().  Again, diff reports nothing.  The
   user needs to "git add" to cause rehashing.

 * The cached stat information does not match.  What the working
   tree file stores hasn't changed, but convert_to_git() was
   changed.

   The fact that the working tree "file" contents did not change
   does not have much significance in this case.  What defines
   the "contents" as far as git is concerned is the combination
   of the working tree file contents _and_ what convert_to_git()
   does to it.

   Depending on the nature of the change to convert_to_git(),
   "git diff-files" may or may not report real changes in this
   case.

 * The working tree file has changed, and convert_to_git() also
   has changed.

   Depending on the nature of the change to convert_to_git(),
   "git diff" may or may not report change in this case.  The
   most extreme case is when unix2dos is run on the working tree
   file and convert_to_git() is made to strip CR.  The object
   registered in the index won't change in this case.

   But in practice, the most problematic case also falls into
   this category.  The user has _real_ changes to the work tree
   file, but at the same time flipped convert_to_git() to
   operate differently from before.  Users should not be making
   such a change, not because of git, but because a commit like
   that will be impossible to review (and understand three
   months later while archaeologying).

The ie_modified() change you suggested will not be hurt by the
first two cases (which I see are one-shot events and re-checkout
and re-add are good enough solution to them, and I do not want
them to hurt the performance for normal use cases).

I originally thought it was a _bug_, but I suspect the false
positive changes reported by "git diff" is even a good thing.

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2008-01-10  2:12 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-01-08 12:12 An interaction with ce_match_stat_basic() and autocrlf Junio C Hamano
2008-01-08 16:10 ` Linus Torvalds
2008-01-08 18:04   ` Junio C Hamano
2008-01-10  2:11   ` Junio C Hamano
2008-01-08 17:12 ` Pēteris Kļaviņš
2008-01-08 17:30   ` Linus Torvalds

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).