git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Schrödinger's diff
@ 2009-07-07  6:53 Eric Raible
  2009-07-07  7:28 ` Johannes Sixt
                   ` (2 more replies)
  0 siblings, 3 replies; 13+ messages in thread
From: Eric Raible @ 2009-07-07  6:53 UTC (permalink / raw)
  To: Git Mailing List

git version 1.6.3.2.1299.gee46c (msysgit)

In trying to track down some annoying crlf corruption in a repo
I have found a Schrödinger's diff.  In other words it's unknown
whether the diff will produce output or not on any particular run
of the following script.

Sometimes it does, and sometimes it doesn't (seems to be about
50/50).  But either way in any given repo rerunning the git-diff will
always give the same result.

Doing an "git ls-tree HEAD" gives an identical tree in both cases.

Can anyone explain why the output to this is not deterministic?
I'm at a complete loss.

	# Clean up from last run and start over
	rm -rf .git has-crlf
	git init
	git config core.autocrlf false

	# Add a "bad" file
	perl -e 'printf( "12%c%c", 0xd, 0xa )' > has-crlf
	git add has-crlf
	git commit -m"add crlf"

	# I realize that switching is ill-advised, but I'm
	# trying to track down a possibly related problem...
	git config core.autocrlf true

	# This sometimes produces output and sometimes it doesn't.
	# Either way rerunning just git-diff always gives the same result
	# as the first run in this repo.
	git diff

- Eric

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Schrödinger's diff
  2009-07-07  6:53 Schrödinger's diff Eric Raible
@ 2009-07-07  7:28 ` Johannes Sixt
  2009-07-07  7:52 ` Andreas Ericsson
  2009-07-07 17:36 ` Daniel Barkalow
  2 siblings, 0 replies; 13+ messages in thread
From: Johannes Sixt @ 2009-07-07  7:28 UTC (permalink / raw)
  To: Eric Raible; +Cc: Git Mailing List

Eric Raible schrieb:
> Sometimes it does, and sometimes it doesn't (seems to be about
> 50/50).  But either way in any given repo rerunning the git-diff will
> always give the same result.
> 
> Doing an "git ls-tree HEAD" gives an identical tree in both cases.
> 
> Can anyone explain why the output to this is not deterministic?
> I'm at a complete loss.
> 
> 	# Clean up from last run and start over
> 	rm -rf .git has-crlf
> 	git init
> 	git config core.autocrlf false
> 
> 	# Add a "bad" file
> 	perl -e 'printf( "12%c%c", 0xd, 0xa )' > has-crlf
> 	git add has-crlf
> 	git commit -m"add crlf"
> 
> 	# I realize that switching is ill-advised, but I'm
> 	# trying to track down a possibly related problem...
> 	git config core.autocrlf true
> 
> 	# This sometimes produces output and sometimes it doesn't.
> 	# Either way rerunning just git-diff always gives the same result
> 	# as the first run in this repo.
> 	git diff

If I put this in a script, I get a diff in 9 out of 10 runs. If I insert
'sleep 1' right before the 'git add', I never get a diff.

I'm handing this off to people who care about core.autocrlf and who know
how racily-clean index entries (not) work ;)

-- Hannes

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Schrödinger's diff
  2009-07-07  6:53 Schrödinger's diff Eric Raible
  2009-07-07  7:28 ` Johannes Sixt
@ 2009-07-07  7:52 ` Andreas Ericsson
  2009-07-07 17:36 ` Daniel Barkalow
  2 siblings, 0 replies; 13+ messages in thread
From: Andreas Ericsson @ 2009-07-07  7:52 UTC (permalink / raw)
  To: Eric Raible; +Cc: Git Mailing List

Eric Raible wrote:
> git version 1.6.3.2.1299.gee46c (msysgit)
> 
> In trying to track down some annoying crlf corruption in a repo
> I have found a Schrödinger's diff.  In other words it's unknown
> whether the diff will produce output or not on any particular run
> of the following script.
> 
> Sometimes it does, and sometimes it doesn't (seems to be about
> 50/50).  But either way in any given repo rerunning the git-diff will
> always give the same result.
> 

I don't get the same result in the same repo, although it only
differs in 1-1.5% of the tests.

> Doing an "git ls-tree HEAD" gives an identical tree in both cases.
> 
> Can anyone explain why the output to this is not deterministic?

On Linux with git version 1.6.3.3.354.g3b4cc

Pasting your commands into "repro.sh", but redirecting output from
git commit to /dev/null, and then running the following commands
has yielded 9 to 15 sample.$i files over 5 tries of the following:

sh repro.sh > correct
for i in $(seq 1 1000); do
  sh repro.sh > sample && cmp sample correct >/dev/null || \
     { echo "fail $i" && cp sample sample.$i; };
done

> I'm at a complete loss.
> 

Inserting "sync" between calls as shown below doesn't fix the issue
(although it drops from 9-15 to 4-10 fails on Linux; Not a very good
improvement and only two test-runs). I have no idea how it can behave
so strangely, and I refuse to believe that the ext3 fs driver allows
dirty reads.

> 	# Clean up from last run and start over
> 	rm -rf .git has-crlf
> 	git init
> 	git config core.autocrlf false
> 
> 	# Add a "bad" file
> 	perl -e 'printf( "12%c%c", 0xd, 0xa )' > has-crlf
> 	git add has-crlf
> 	git commit -m"add crlf"
> 

sync

> 	# I realize that switching is ill-advised, but I'm
> 	# trying to track down a possibly related problem...
> 	git config core.autocrlf true
> 

sync

> 	# This sometimes produces output and sometimes it doesn't.
> 	# Either way rerunning just git-diff always gives the same result
> 	# as the first run in this repo.
> 	git diff
> 

-- 
Andreas Ericsson                   andreas.ericsson@op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231

Considering the successes of the wars on alcohol, poverty, drugs and
terror, I think we should give some serious thought to declaring war
on peace.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Schrödinger's diff
  2009-07-07  6:53 Schrödinger's diff Eric Raible
  2009-07-07  7:28 ` Johannes Sixt
  2009-07-07  7:52 ` Andreas Ericsson
@ 2009-07-07 17:36 ` Daniel Barkalow
  2009-07-07 19:36   ` Jeff King
  2 siblings, 1 reply; 13+ messages in thread
From: Daniel Barkalow @ 2009-07-07 17:36 UTC (permalink / raw)
  To: Eric Raible; +Cc: Git Mailing List

[-- Attachment #1: Type: TEXT/PLAIN, Size: 2272 bytes --]

On Mon, 6 Jul 2009, Eric Raible wrote:

> git version 1.6.3.2.1299.gee46c (msysgit)
> 
> In trying to track down some annoying crlf corruption in a repo
> I have found a Schrödinger's diff.  In other words it's unknown
> whether the diff will produce output or not on any particular run
> of the following script.
> 
> Sometimes it does, and sometimes it doesn't (seems to be about
> 50/50).  But either way in any given repo rerunning the git-diff will
> always give the same result.
> 
> Doing an "git ls-tree HEAD" gives an identical tree in both cases.
> 
> Can anyone explain why the output to this is not deterministic?
> I'm at a complete loss.
> 
> 	# Clean up from last run and start over
> 	rm -rf .git has-crlf
> 	git init
> 	git config core.autocrlf false
> 
> 	# Add a "bad" file
> 	perl -e 'printf( "12%c%c", 0xd, 0xa )' > has-crlf
> 	git add has-crlf

If has-crlf and .git/index have the same timestamp, git does not know 
whether the file has been modified afterwards or not. If they have 
different timestamps, git knows the file hasn't been modified after the 
add. (More precisely, the index contains the mtime of the file, and it 
will agree with the file system. However, if the timestamp on the index 
matches a timestamp *in* the index, that means that, when the index was 
written, the time period represented by that timestamp was not yet over 
when git looked at the file. Therefore, the file could have changed 
again after that time and still gotten the same timestamp it already had. 
This means that git can't be sure that there's nothing new to see in the 
filesystem.)

> 	git commit -m"add crlf"
> 
> 	# I realize that switching is ill-advised, but I'm
> 	# trying to track down a possibly related problem...
> 	git config core.autocrlf true
> 
> 	# This sometimes produces output and sometimes it doesn't.
> 	# Either way rerunning just git-diff always gives the same result
> 	# as the first run in this repo.
> 	git diff

If git knows the file hasn't been modified, it doesn't produce a diff. 

If it doesn't know the file hasn't been modified, it looks at the actual 
contents and it find that the result of reading the disk applying autocrlf 
now doesn't match the contents of the index.

	-Daniel
*This .sig left intentionally blank*

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Schrödinger's diff
  2009-07-07 17:36 ` Daniel Barkalow
@ 2009-07-07 19:36   ` Jeff King
  2009-07-07 19:48     ` Junio C Hamano
  0 siblings, 1 reply; 13+ messages in thread
From: Jeff King @ 2009-07-07 19:36 UTC (permalink / raw)
  To: Daniel Barkalow; +Cc: Eric Raible, Git Mailing List

On Tue, Jul 07, 2009 at 01:36:08PM -0400, Daniel Barkalow wrote:

> > 	# I realize that switching is ill-advised, but I'm
> > 	# trying to track down a possibly related problem...
> > 	git config core.autocrlf true
> > 
> > 	# This sometimes produces output and sometimes it doesn't.
> > 	# Either way rerunning just git-diff always gives the same result
> > 	# as the first run in this repo.
> > 	git diff
> 
> If git knows the file hasn't been modified, it doesn't produce a diff. 
> 
> If it doesn't know the file hasn't been modified, it looks at the actual 
> contents and it find that the result of reading the disk applying autocrlf 
> now doesn't match the contents of the index.

Yes, that was my analysis upon reading the original mail, as well (and I
have been bitten by this before while testing crlf stuff). The same
thing can happen with clean/smudge, I think.

When you set up config that changes how we view worktree files (like
crlf or clean/smudge) and there is already cached stat information in
the index, you really need to invalidate the matching stat information
in the index to get sane results[1]. It might be nice for "git config"
to do this for you, but:

    1. You could just as easily be hand-editing the config.

    2. It feels wrong from a modularity standpoint. Right now "git
       config" doesn't actually care about the semantics of config,
       just the syntax. Which makes it exactly equivalent to
       hand-editing.

    3. It doesn't cover every situation. Files can also be "changed" in
       this way by editing .gitattributes, which can be changed manually
       or by any number of git commands (like checkout, reset, etc).

So I think automatically detecting this situation would require flags in
the index to say "this stat information is valid only over these
particular settings". And you would want it per-file to avoid having to
re-hash every file when you change the .gitattributes for one file. The
command using the index would check it. But even that might have holes,
I'm afraid -- we don't always look at all of the config in every
command, though perhaps we do for such core functionality.

-Peff

[1] Is there an easy way to do this with update-index? I didn't see one,
and had to resort to "git read-tree HEAD".

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Schrödinger's diff
  2009-07-07 19:36   ` Jeff King
@ 2009-07-07 19:48     ` Junio C Hamano
  2009-07-07 19:54       ` Jeff King
  2009-07-07 20:30       ` Eric Raible
  0 siblings, 2 replies; 13+ messages in thread
From: Junio C Hamano @ 2009-07-07 19:48 UTC (permalink / raw)
  To: Jeff King; +Cc: Daniel Barkalow, Eric Raible, Git Mailing List

Jeff King <peff@peff.net> writes:

> .... But even that might have holes,
> I'm afraid -- we don't always look at all of the config in every
> command, though perhaps we do for such core functionality.

I personally do not think it is worth it.  If you change the crlf,
clean/smudge, or anything of that sort, just doing a "rm .git/index"
followed by "git reset --hard" would restore sanity to your work tree, no?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Schrödinger's diff
  2009-07-07 19:48     ` Junio C Hamano
@ 2009-07-07 19:54       ` Jeff King
  2009-07-07 22:22         ` Junio C Hamano
  2009-07-07 20:30       ` Eric Raible
  1 sibling, 1 reply; 13+ messages in thread
From: Jeff King @ 2009-07-07 19:54 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Daniel Barkalow, Eric Raible, Git Mailing List

On Tue, Jul 07, 2009 at 12:48:45PM -0700, Junio C Hamano wrote:

> Jeff King <peff@peff.net> writes:
> 
> > .... But even that might have holes,
> > I'm afraid -- we don't always look at all of the config in every
> > command, though perhaps we do for such core functionality.
> 
> I personally do not think it is worth it.  If you change the crlf,
> clean/smudge, or anything of that sort, just doing a "rm .git/index"
> followed by "git reset --hard" would restore sanity to your work tree, no?

Yes, that works fine, but:

  1. It blows away anything unrelated you might have staged.

  2. You have to know to do it (and you get very confusing results if
     you don't), which makes it very unfriendly for newbies.

  3. You have to know to do it, and it isn't documented. :)

(3) at least is not too hard to address. And perhaps (2) is not a big
enough issue to care about. This is not a problem we have seen on the
list a lot. I suspect it is because most CRLF users are on Windows, and
therefore have it setup before the tree is checked out, and there are
simply not all that many clean/smudge users.

-Peff

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Schrödinger's diff
  2009-07-07 19:48     ` Junio C Hamano
  2009-07-07 19:54       ` Jeff King
@ 2009-07-07 20:30       ` Eric Raible
  2009-07-07 20:48         ` Jeff King
  1 sibling, 1 reply; 13+ messages in thread
From: Eric Raible @ 2009-07-07 20:30 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Jeff King, Daniel Barkalow, Git Mailing List

On Tue, Jul 7, 2009 at 12:48 PM, Junio C Hamano<gitster@pobox.com> wrote:
> Jeff King <peff@peff.net> writes:
>
>> .... But even that might have holes,
>> I'm afraid -- we don't always look at all of the config in every
>> command, though perhaps we do for such core functionality.
>
> I personally do not think it is worth it.  If you change the crlf,
> clean/smudge, or anything of that sort, just doing a "rm .git/index"
> followed by "git reset --hard" would restore sanity to your work tree, no?

Is there any technical reason why "git reset --hard" shouldn't repopulate
the index by doing a "git read-tree" or equivalent [1]?  After all the
docs claim it
"Matches the working tree and index to that of the tree being switched to".

Except in this case it doesn't.  The resulting .index is an invalid
representation
of the tree.  Sanity can be restored with "git read-tree HEAD" (as
Jeff suggested),
but that's hardly intuitive.

- Eric

[1] For instance anything at all (possibly involving time stamps) such that
the index will truly match the tree.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Schrödinger's diff
  2009-07-07 20:30       ` Eric Raible
@ 2009-07-07 20:48         ` Jeff King
  0 siblings, 0 replies; 13+ messages in thread
From: Jeff King @ 2009-07-07 20:48 UTC (permalink / raw)
  To: Eric Raible; +Cc: Junio C Hamano, Daniel Barkalow, Git Mailing List

On Tue, Jul 07, 2009 at 01:30:47PM -0700, Eric Raible wrote:

> > I personally do not think it is worth it.  If you change the crlf,
> > clean/smudge, or anything of that sort, just doing a "rm .git/index"
> > followed by "git reset --hard" would restore sanity to your work tree, no?
> 
> Is there any technical reason why "git reset --hard" shouldn't repopulate
> the index by doing a "git read-tree" or equivalent [1]?  After all the
> docs claim it
> "Matches the working tree and index to that of the tree being switched to".

Yes; relying on the stat cache is what makes "git reset --hard" really
fast, instead of having to re-hash each file. The real problem is that
we are invalidating the contents of that cache, but not marking the
entry as dirty.

Right now we don't mark the entries, and you get stale data until you
manually mark them as dirty. But marking them dirty on every reset will
mean that reset can't make use of the cache, which is slow. So ideally
we would have a way of marking them dirty only when necessary. One way
would be a "git reset --hard --reset-me-harder", but that obviously
still involves manual work; you're just making the command a little
easier to find and use.

As I suggested before, you could include the changed bits (i.e., the
attributes and config) as part of the cache validity information.  But I
suspect it would be hard to implement, as it involves new fields in the
index. It also involves arbitrary user commands, so getting it 100%
right would be impossible (e.g., we can record the string
"my-clean-filter" as the filter name, but we can't know when that
filter's behavior has changed); however, in practice, I think we can
assume that a string containing a text command will have stable output.

-Peff

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Schrödinger's diff
  2009-07-07 19:54       ` Jeff King
@ 2009-07-07 22:22         ` Junio C Hamano
  2009-07-08  0:17           ` Eric Raible
  0 siblings, 1 reply; 13+ messages in thread
From: Junio C Hamano @ 2009-07-07 22:22 UTC (permalink / raw)
  To: Jeff King; +Cc: Junio C Hamano, Daniel Barkalow, Eric Raible, Git Mailing List

Jeff King <peff@peff.net> writes:

> Yes, that works fine, but:
>
>   1. It blows away anything unrelated you might have staged.
>
>   2. You have to know to do it (and you get very confusing results if
>      you don't), which makes it very unfriendly for newbies.
>
>   3. You have to know to do it, and it isn't documented. :)
>
> (3) at least is not too hard to address. And perhaps (2) is not a big
> enough issue to care about. This is not a problem we have seen on the
> list a lot. I suspect it is because most CRLF users are on Windows, and
> therefore have it setup before the tree is checked out, and there are
> simply not all that many clean/smudge users.

A much more important reason is that it is a one-time event.  You notice
that you screwed up the configuration to use your peculiar work tree
representation, and you fix it once and for all.  Because not only it is a
one-time event but because it is a big-deal event, I do not think it is
something people even would want to think about doing it while having
local changes, so I suspect #1 is also a non-issue.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Schrödinger's diff
  2009-07-07 22:22         ` Junio C Hamano
@ 2009-07-08  0:17           ` Eric Raible
  2009-07-08  2:54             ` Junio C Hamano
  0 siblings, 1 reply; 13+ messages in thread
From: Eric Raible @ 2009-07-08  0:17 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Jeff King, Daniel Barkalow, Git Mailing List

On Tue, Jul 7, 2009 at 3:22 PM, Junio C Hamano<gitster@pobox.com> wrote:
> You notice
> that you screwed up the configuration to use your peculiar work tree
> representation, and you fix it once and for all.

A kinda dim-witted question:

So what's the best way of "fixing once and for all" a repo infected with
carriage returns when you want to use autocrlf=true moving forward?

And a hopefully less annoying one:

Would you accept a patch explaining how "git reset --hard" doesn't
actually rebuild the index from scratch, and how "git read-head" might
be recommended in some weird situations?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Schrödinger's diff
  2009-07-08  0:17           ` Eric Raible
@ 2009-07-08  2:54             ` Junio C Hamano
  2009-07-08  6:13               ` Eric Raible
  0 siblings, 1 reply; 13+ messages in thread
From: Junio C Hamano @ 2009-07-08  2:54 UTC (permalink / raw)
  To: Eric Raible; +Cc: Jeff King, Daniel Barkalow, Git Mailing List

Eric Raible <raible@gmail.com> writes:

> So what's the best way of "fixing once and for all" a repo infected with
> carriage returns when you want to use autocrlf=true moving forward?

Didn't "rm -f .git/index && git reset --hard HEAD" work?

> And a hopefully less annoying one:
>
> Would you accept a patch explaining how "git reset --hard" doesn't
> actually rebuild the index from scratch,...

Absolutely.

> ... and how "git read-head" might
> be recommended in some weird situations?

I am less certain about that; people may have easier-to-read solution than
the one with read-tree, I would suspect.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Schrödinger's diff
  2009-07-08  2:54             ` Junio C Hamano
@ 2009-07-08  6:13               ` Eric Raible
  0 siblings, 0 replies; 13+ messages in thread
From: Eric Raible @ 2009-07-08  6:13 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Jeff King, Daniel Barkalow, Git Mailing List

On Tue, Jul 7, 2009 at 7:54 PM, Junio C Hamano<gitster@pobox.com> wrote:
> Eric Raible <raible@gmail.com> writes:
>
>> So what's the best way of "fixing once and for all" a repo infected with
>> carriage returns when you want to use autocrlf=true moving forward?
>
> Didn't "rm -f .git/index && git reset --hard HEAD" work?

Will, it "worked" in the sense that it usually [1] allows
"git diff --name-only" to correctly show the files that were
checked in with crlf endings and core.autocrlf false.

Those files are then out-of-date when core.autocrlf is
true and .git/index is up-to-date.

In other words, that first step worked around the fact
that sometimes that .git/index was out of date.

By "fixing it once and for all" I was trying to refer to creating
the correct commit to abolish the carriage returns from my repo.

I ended up with this crude-but-obvious loop which
generates many LF->CRLF warnings [2]:

for i in `git diff --name-only`; do
      echo $i
      sed 's/0x0D//' < $i > foo
      mv foo $i
      git add $i
done

>> Would you accept a patch explaining how "git reset --hard" doesn't
>> actually rebuild the index from scratch,...
>
> Absolutely.

I'll try to get to it, given the $dayjob / $significant_other constraints.

- Eric

[1] I've had cases where for whatever reasons a "git read-tree HEAD"
     seemed to be required, but I don't have the recipe yet.
[2] Which can be abolished by wrapping it in autocrlf=false
     before and autocrlf=true after

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2009-07-08  6:13 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-07-07  6:53 Schrödinger's diff Eric Raible
2009-07-07  7:28 ` Johannes Sixt
2009-07-07  7:52 ` Andreas Ericsson
2009-07-07 17:36 ` Daniel Barkalow
2009-07-07 19:36   ` Jeff King
2009-07-07 19:48     ` Junio C Hamano
2009-07-07 19:54       ` Jeff King
2009-07-07 22:22         ` Junio C Hamano
2009-07-08  0:17           ` Eric Raible
2009-07-08  2:54             ` Junio C Hamano
2009-07-08  6:13               ` Eric Raible
2009-07-07 20:30       ` Eric Raible
2009-07-07 20:48         ` Jeff King

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).