git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFH] eol=lf on existing mixed line-ending files
@ 2011-04-07 23:15 Jeff King
  2011-04-08  9:36 ` Michael J Gruber
                   ` (2 more replies)
  0 siblings, 3 replies; 7+ messages in thread
From: Jeff King @ 2011-04-07 23:15 UTC (permalink / raw)
  To: git; +Cc: Mislav Marohnic

I investigated some odd git behavior with the EOL gitattributes today,
and I'm curious to hear what others on the list think of what git does.
In particular, index raciness means git produces non-deterministic
results in this case.

The repo in question has a gitattributes file with "* crlf=input" (which
we would spell "eol=lf" these days, but the results are the same), but
still contains some files with mixed line endings. Which you can
reproduce with:

  git init repo &&
  cd repo &&
  {
    printf 'one\n' &&
    printf 'two\r\n'
  } >mixed &&
  git add mixed &&
  git commit -m one &&
  echo '* eol=lf' >.gitattributes

Now if we run "git status" or "git diff", it will let us know that
"mixed" is modified, insofar as adding and committing it would perform
the LF conversion.

Now we come to the first confusing behavior. Generally one would expect
the working directory to be clean after a "git reset --hard". But not
here:

  git reset --hard &&
  git status

will still show "mixed" as modified. Because of course we are checking
out the version from HEAD into the index and working tree, which has the
mixed line endings. So we rewrite the identical file.

So that kind of makes sense. But it isn't all that helpful, if I just
want to reset my working tree to something sane without making a new
commit (more on this later).

But here's an extra helping of confusion on top. Every once in a while,
doing the reset _won't_ keep "mixed" as modified. I can trigger it
reliably by inserting an extra sleep into git:

diff --git a/unpack-trees.c b/unpack-trees.c
index 500ebcf..735b13e 100644
--- a/unpack-trees.c
+++ b/unpack-trees.c
@@ -223,6 +223,7 @@ static int check_updates(struct unpack_trees_options *o)
 		}
 	}
 	stop_progress(&progress);
+	sleep(1);
 	if (o->update)
 		git_attr_set_direction(GIT_ATTR_CHECKIN, NULL);
 	return errs != 0;

That puts a delay between when reset writes the "mixed" file, and when
we write out the refreshed index. So next time we look at the index
(e.g., in "status"), we will see that the "mixed" entry has up-to-date
stat information and not look at its actual contents.

But in the original case (without the sleep), that doesn't happen.
There, we usually end up writing the file and the index in the same
second. So when status looks at the index, the "mixed" entry is racily
clean, and we actually check it again.

So we get two different outcomes, depending on the index raciness. Which
one is right, or is it right for it to be non-deterministic?

And one final question. Let's say I don't immediately convert this mixed
file to the correct line-endings. Instead, it persists over a large
number of commits, some of them even changing the "mixed" file but not
fixing the line endings[1]. We can simulate that with:

  mv .gitattributes tmp
  echo three >>mixed &&
  git commit -a -m three &&
  mv tmp .gitattributes

Now imagine I am somebody who has cloned this repo; the clone will tend
to end the race condition in the "clean" state, since it will often take
more than 1 second to write out all of the files (at least for a
normal-sized project). We can simulate using our sleep-patched reset:

  git reset --hard

to get a "clean" repo. Now let's say I want to explore old history, so I
go to a detached HEAD, but using normal git, not the sleep-patched one:

  git checkout HEAD^

And, of course, now we think "mixed" is modified. After I'm done
exploring, I want to go back to "master", but I can't:

  $ git checkout master
  error: Your local changes to the following files would be overwritten by checkout:
          mixed

What is the best way out of this situation? You can't use "reset --hard"
to fix the working tree. I guess "git checkout -f" is the best option.

Hopefully my example made sense and was reproducible. The real repo
which triggered this puzzle was jquery. You can try:

  git clone git://github.com/jquery/jquery.git &&
  cd jquery &&
  git checkout 1.4.2 &&
  git checkout master

which will fail (but may succeed racily on a slow enough machine).
Obviously they need to fix the mixed line-ending files in their repo.
But that fix would be on HEAD, and "git checkout 1.4.2" will be forever
broken. Is there a way to fix that?

-Peff

[1] The one thing still puzzling me about the jquery repo is how they
managed to make so many commits (including ones to mixed line ending
files) without seeing the dirty working tree state and committing it. Is
there some combination of config that makes this not happen?

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [RFH] eol=lf on existing mixed line-ending files
  2011-04-07 23:15 [RFH] eol=lf on existing mixed line-ending files Jeff King
@ 2011-04-08  9:36 ` Michael J Gruber
  2011-04-08 16:06   ` Jeff King
  2011-04-09 18:58 ` Dmitry Potapov
  2011-04-12 13:57 ` Jay Soffian
  2 siblings, 1 reply; 7+ messages in thread
From: Michael J Gruber @ 2011-04-08  9:36 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Mislav Marohnic

Jeff King venit, vidit, dixit 08.04.2011 01:15:
> I investigated some odd git behavior with the EOL gitattributes today,
> and I'm curious to hear what others on the list think of what git does.
> In particular, index raciness means git produces non-deterministic
> results in this case.
> 
> The repo in question has a gitattributes file with "* crlf=input" (which
> we would spell "eol=lf" these days, but the results are the same), but
> still contains some files with mixed line endings. Which you can
> reproduce with:
> 
>   git init repo &&
>   cd repo &&
>   {
>     printf 'one\n' &&
>     printf 'two\r\n'
>   } >mixed &&
>   git add mixed &&
>   git commit -m one &&
>   echo '* eol=lf' >.gitattributes
> 
> Now if we run "git status" or "git diff", it will let us know that
> "mixed" is modified, insofar as adding and committing it would perform
> the LF conversion.
> 
> Now we come to the first confusing behavior. Generally one would expect
> the working directory to be clean after a "git reset --hard". But not
> here:
> 
>   git reset --hard &&
>   git status
> 
> will still show "mixed" as modified. Because of course we are checking
> out the version from HEAD into the index and working tree, which has the
> mixed line endings. So we rewrite the identical file.
> 
> So that kind of makes sense. But it isn't all that helpful, if I just
> want to reset my working tree to something sane without making a new
> commit (more on this later).
> 
> But here's an extra helping of confusion on top. Every once in a while,
> doing the reset _won't_ keep "mixed" as modified. I can trigger it
> reliably by inserting an extra sleep into git:
> 
> diff --git a/unpack-trees.c b/unpack-trees.c
> index 500ebcf..735b13e 100644
> --- a/unpack-trees.c
> +++ b/unpack-trees.c
> @@ -223,6 +223,7 @@ static int check_updates(struct unpack_trees_options *o)
>  		}
>  	}
>  	stop_progress(&progress);
> +	sleep(1);
>  	if (o->update)
>  		git_attr_set_direction(GIT_ATTR_CHECKIN, NULL);
>  	return errs != 0;
> 
> That puts a delay between when reset writes the "mixed" file, and when
> we write out the refreshed index. So next time we look at the index
> (e.g., in "status"), we will see that the "mixed" entry has up-to-date
> stat information and not look at its actual contents.
> 
> But in the original case (without the sleep), that doesn't happen.
> There, we usually end up writing the file and the index in the same
> second. So when status looks at the index, the "mixed" entry is racily
> clean, and we actually check it again.
> 
> So we get two different outcomes, depending on the index raciness. Which
> one is right, or is it right for it to be non-deterministic?
> 
> And one final question. Let's say I don't immediately convert this mixed
> file to the correct line-endings. Instead, it persists over a large
> number of commits, some of them even changing the "mixed" file but not
> fixing the line endings[1]. We can simulate that with:
> 
>   mv .gitattributes tmp
>   echo three >>mixed &&
>   git commit -a -m three &&
>   mv tmp .gitattributes
> 
> Now imagine I am somebody who has cloned this repo; the clone will tend
> to end the race condition in the "clean" state, since it will often take
> more than 1 second to write out all of the files (at least for a
> normal-sized project). We can simulate using our sleep-patched reset:
> 
>   git reset --hard
> 
> to get a "clean" repo. Now let's say I want to explore old history, so I
> go to a detached HEAD, but using normal git, not the sleep-patched one:
> 
>   git checkout HEAD^
> 
> And, of course, now we think "mixed" is modified. After I'm done
> exploring, I want to go back to "master", but I can't:
> 
>   $ git checkout master
>   error: Your local changes to the following files would be overwritten by checkout:
>           mixed
> 
> What is the best way out of this situation? You can't use "reset --hard"
> to fix the working tree. I guess "git checkout -f" is the best option.
> 
> Hopefully my example made sense and was reproducible. The real repo
> which triggered this puzzle was jquery. You can try:
> 
>   git clone git://github.com/jquery/jquery.git &&
>   cd jquery &&
>   git checkout 1.4.2 &&
>   git checkout master
> 
> which will fail (but may succeed racily on a slow enough machine).
> Obviously they need to fix the mixed line-ending files in their repo.
> But that fix would be on HEAD, and "git checkout 1.4.2" will be forever
> broken. Is there a way to fix that?
> 
> -Peff
> 
> [1] The one thing still puzzling me about the jquery repo is how they
> managed to make so many commits (including ones to mixed line ending
> files) without seeing the dirty working tree state and committing it. Is
> there some combination of config that makes this not happen?

When did they introduce the .gitattributes file?
Also, maybe they're jgit users.

Michael

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFH] eol=lf on existing mixed line-ending files
  2011-04-08  9:36 ` Michael J Gruber
@ 2011-04-08 16:06   ` Jeff King
  0 siblings, 0 replies; 7+ messages in thread
From: Jeff King @ 2011-04-08 16:06 UTC (permalink / raw)
  To: Michael J Gruber; +Cc: git, Mislav Marohnic

On Fri, Apr 08, 2011 at 11:36:20AM +0200, Michael J Gruber wrote:

> > [1] The one thing still puzzling me about the jquery repo is how they
> > managed to make so many commits (including ones to mixed line ending
> > files) without seeing the dirty working tree state and committing it. Is
> > there some combination of config that makes this not happen?
> 
> When did they introduce the .gitattributes file?
> Also, maybe they're jgit users.

The gitattributes file was introduced in 2009, and there are several
commits to mixed line-ending files between then and now. So it seems
unlikely that they just won the race condition over and over.

I don't know whether they could be using jgit.

-Peff

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFH] eol=lf on existing mixed line-ending files
  2011-04-07 23:15 [RFH] eol=lf on existing mixed line-ending files Jeff King
  2011-04-08  9:36 ` Michael J Gruber
@ 2011-04-09 18:58 ` Dmitry Potapov
  2011-04-09 19:32   ` Jeff King
  2011-04-12 13:57 ` Jay Soffian
  2 siblings, 1 reply; 7+ messages in thread
From: Dmitry Potapov @ 2011-04-09 18:58 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Mislav Marohnic

On Fri, Apr 8, 2011 at 3:15 AM, Jeff King <peff@peff.net> wrote:
>
>  git init repo &&
>  cd repo &&
>  {
>    printf 'one\n' &&
>    printf 'two\r\n'
>  } >mixed &&
>  git add mixed &&
>  git commit -m one &&
>  echo '* eol=lf' >.gitattributes
>
> Now if we run "git status" or "git diff", it will let us know that
> "mixed" is modified, insofar as adding and committing it would perform
> the LF conversion.

Well, git _may_ report that file is modified, but usually when you
change .gitattributes, git does not notice changes to file endings
until you touch those files. You can force git to notice changes in
all files by doing:

 $ touch -d 2000-1-1 .git/index

so it will re-read all files, but I guess it should be do that
automatically, otherwise many people end up with having inconsistent
file endings in their repository as result of editing .gitattributes
(or by just pulling a new version from the upstream).

>
> Now we come to the first confusing behavior. Generally one would expect
> the working directory to be clean after a "git reset --hard". But not
> here:
>
>  git reset --hard &&
>  git status
>
> will still show "mixed" as modified.

It is because you discard all changes except to .gitattributes.  If
.gitattributes were tracked, "reset" would discard them too, and you
would get clean original state.

> So that kind of makes sense. But it isn't all that helpful, if I just
> want to reset my working tree to something sane without making a new
> commit (more on this later).

If we do not discard changes to .gitattributes then the question is
what a sane state is? It is really difficult to define what is sane
when conversion to the work tree and back gives a different result.

> But here's an extra helping of confusion on top. Every once in a while,
> doing the reset _won't_ keep "mixed" as modified. I can trigger it
> reliably by inserting an extra sleep into git:

you can have the same effect by doing:

git reset --hard HEAD && sleep 1 && git touch .git/index

Ironically, that the race that you observed is result of fixing another
race in git when files are changed too fast, so they may have the same
timestamp. To prevent this race, git checks timestamp of .git/index
and a trcking file. If .git/index timestamp is older or same as that file,
this file is considered dirty. So, it is re-read from the disk to check
if there are any changes. This works well but only if conversion to the
work tree and back produces the same result.

> So we get two different outcomes, depending on the index raciness. Which
> one is right, or is it right for it to be non-deterministic?

I like everything being deterministic, but in this case I do not see
how it is possible without making the normal case much slower.

> And one final question. Let's say I don't immediately convert this mixed
> file to the correct line-endings.

IMHO, adding .gitattributes that specifies line endings while not
fixing actual line endings of existing files is really a bad idea.

As with any other filter, the rule is that conversion from git to
the working tree and back should give the same result for any file
in the repository, otherwise you will have a lot of troubles later.

> Hopefully my example made sense and was reproducible. The real repo
> which triggered this puzzle was jquery. You can try:
>
>  git clone git://github.com/jquery/jquery.git &&
>  cd jquery &&
>  git checkout 1.4.2 &&
>  git checkout master
>
> which will fail (but may succeed racily on a slow enough machine).
> Obviously they need to fix the mixed line-ending files in their repo.
> But that fix would be on HEAD, and "git checkout 1.4.2" will be forever
> broken. Is there a way to fix that?

You cannot change the past history. Well, you can overwrite that
setting using .git/info/attributes. It does not make sense to do
that in general, but it may be useful if you do git bisect.

BTW, nowadays, we have much better alternative than using

* crlf=input

Instead of it, you probably want to use:

* text=auto

which will automatically detect text files, so you won't have problems
with binary files. All text files are put into the repository with LF,
but users may have different endings in their working tree if they like.


Dmitry

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFH] eol=lf on existing mixed line-ending files
  2011-04-09 18:58 ` Dmitry Potapov
@ 2011-04-09 19:32   ` Jeff King
  2011-04-09 20:09     ` Dmitry Potapov
  0 siblings, 1 reply; 7+ messages in thread
From: Jeff King @ 2011-04-09 19:32 UTC (permalink / raw)
  To: Dmitry Potapov; +Cc: git, Mislav Marohnic

On Sat, Apr 09, 2011 at 10:58:59PM +0400, Dmitry Potapov wrote:

> > Now we come to the first confusing behavior. Generally one would expect
> > the working directory to be clean after a "git reset --hard". But not
> > here:
> >
> >  git reset --hard &&
> >  git status
> >
> > will still show "mixed" as modified.
> 
> It is because you discard all changes except to .gitattributes.  If
> .gitattributes were tracked, "reset" would discard them too, and you
> would get clean original state.

Yeah, in this case. But gitattributes could easily be in the repository
already, and reset still wouldn't change it (as it is in the jquery
example).

> > So that kind of makes sense. But it isn't all that helpful, if I just
> > want to reset my working tree to something sane without making a new
> > commit (more on this later).
> 
> If we do not discard changes to .gitattributes then the question is
> what a sane state is? It is really difficult to define what is sane
> when conversion to the work tree and back gives a different result.

Agreed. The problem is the disconnect between what is in the repository,
and what _would_ be in the repository if we committed the file. So
obviously what the user is giving to git in this case is slightly
insane.

I just wonder if git can do better. But the only options I could think
of are:

  1. Set the working tree file to have just LF's. But that doesn't help,
     since it is the conversion _to_ linefeeds that make it look like
     the file is changed. So we'd still see unstaged changes.

  2. Set the index file to have just LF's. That would make the working
     tree look clean, but it would look like changes are staged, which
     is even worse.

> > But here's an extra helping of confusion on top. Every once in a while,
> > doing the reset _won't_ keep "mixed" as modified. I can trigger it
> > reliably by inserting an extra sleep into git:
> 
> you can have the same effect by doing:
> 
> git reset --hard HEAD && sleep 1 && git touch .git/index

Yeah, that has the same effect. I wanted to show the sleep inside git to
demonstrate that it really is an inside-git race condition.

> Ironically, that the race that you observed is result of fixing another
> race in git when files are changed too fast, so they may have the same
> timestamp. To prevent this race, git checks timestamp of .git/index
> and a trcking file. If .git/index timestamp is older or same as that file,
> this file is considered dirty. So, it is re-read from the disk to check
> if there are any changes. This works well but only if conversion to the
> work tree and back produces the same result.

Yeah, that's my analysis, too.

> > So we get two different outcomes, depending on the index raciness. Which
> > one is right, or is it right for it to be non-deterministic?
> 
> I like everything being deterministic, but in this case I do not see
> how it is possible without making the normal case much slower.

I think if you took my (1) suggestion above, it would be deterministic.
I don't know how much that would help. It would at least force people to
always see the change and hopefully spur them to commit the fixed
line-endings.

> > And one final question. Let's say I don't immediately convert this mixed
> > file to the correct line-endings.
> 
> IMHO, adding .gitattributes that specifies line endings while not
> fixing actual line endings of existing files is really a bad idea.

I absolutely agree, and my first advice upon seeing this jquery repo was
to fix those line endings. But they went for over a year with the broken
setup, so clearly it wasn't bothering them. I wonder what git could do
better to provoke them to fix it sooner.

> As with any other filter, the rule is that conversion from git to
> the working tree and back should give the same result for any file
> in the repository, otherwise you will have a lot of troubles later.

I think that's a good rule in general, but doesn't crlf=input (and now
eol=lf, and by extension, the text attribute) encourage exactly that if
you have mixed line-ending files?

I think the moral of the story may simply be that mixed line-ending text
files are an abomination which should be rooted out and destroyed.

> >  git clone git://github.com/jquery/jquery.git &&
> >  cd jquery &&
> >  git checkout 1.4.2 &&
> >  git checkout master
> >
> > which will fail (but may succeed racily on a slow enough machine).
> > Obviously they need to fix the mixed line-ending files in their repo.
> > But that fix would be on HEAD, and "git checkout 1.4.2" will be forever
> > broken. Is there a way to fix that?
> 
> You cannot change the past history. Well, you can overwrite that
> setting using .git/info/attributes. It does not make sense to do
> that in general, but it may be useful if you do git bisect.

The problem with that is that for recent commits you want one set of
attributes (where the files have been fixed), and for going back to
older commits, you want a different set of attributes (where you say
"don't care about line endings in these files").

One solution would be to have a git-notes ref with per-commit
attributes, so you could selectively override attributes as you explore
history.

> BTW, nowadays, we have much better alternative than using
> 
> * crlf=input
> 
> Instead of it, you probably want to use:
> 
> * text=auto

Agreed, and I already recommended that to jquery people (actually, one
of the problem files you will see in the example above is a binary file,
though later on they ended up fixing its attributes by specifically
marking its extension as binary).

-Peff

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFH] eol=lf on existing mixed line-ending files
  2011-04-09 19:32   ` Jeff King
@ 2011-04-09 20:09     ` Dmitry Potapov
  0 siblings, 0 replies; 7+ messages in thread
From: Dmitry Potapov @ 2011-04-09 20:09 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Mislav Marohnic

On Sat, Apr 9, 2011 at 11:32 PM, Jeff King <peff@github.com> wrote:
>
> I just wonder if git can do better. But the only options I could think
> of are:
>
>  1. Set the working tree file to have just LF's. But that doesn't help,
>     since it is the conversion _to_ linefeeds that make it look like
>     the file is changed. So we'd still see unstaged changes.

Well, eol=crlf _does_ set the working tree file to have only CRLF, but
you still have the same race as before. Just because git converts file,
it does not think about it as "dirty". It becomes "dirty" when git
tries to convert it back and gets a different result, and whether it
tries or not depend on timestamp. So, you still have the same race.

>> > So we get two different outcomes, depending on the index raciness. Which
>> > one is right, or is it right for it to be non-deterministic?
>>
>> I like everything being deterministic, but in this case I do not see
>> how it is possible without making the normal case much slower.
>
> I think if you took my (1) suggestion above, it would be deterministic.

Replace "eol=lf" with "eol=crlf" in your script, and you will see that
it does not help with the race.

>
> I absolutely agree, and my first advice upon seeing this jquery repo was
> to fix those line endings. But they went for over a year with the broken
> setup, so clearly it wasn't bothering them. I wonder what git could do
> better to provoke them to fix it sooner.

I believe git should consider all files as "dirty" if .gitattributes is
changed. So, you cannot accidentally commit changes to .gitattributes
without fixing line endings. Currently, you can provoke git to consider
all files as dirty by doing:

  touch -d 2000-1-1 .git/index

but we should not expect users to do that after editing .gitattributes.


Dmitry

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFH] eol=lf on existing mixed line-ending files
  2011-04-07 23:15 [RFH] eol=lf on existing mixed line-ending files Jeff King
  2011-04-08  9:36 ` Michael J Gruber
  2011-04-09 18:58 ` Dmitry Potapov
@ 2011-04-12 13:57 ` Jay Soffian
  2 siblings, 0 replies; 7+ messages in thread
From: Jay Soffian @ 2011-04-12 13:57 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Mislav Marohnic, Dmitry Potapov

On Thu, Apr 7, 2011 at 7:15 PM, Jeff King <peff@peff.net> wrote:
> I investigated some odd git behavior with the EOL gitattributes today,
> and I'm curious to hear what others on the list think of what git does.
> In particular, index raciness means git produces non-deterministic
> results in this case.

Relatedly, http://article.gmane.org/gmane.comp.version-control.git/169476

j.

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2011-04-12 13:58 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-04-07 23:15 [RFH] eol=lf on existing mixed line-ending files Jeff King
2011-04-08  9:36 ` Michael J Gruber
2011-04-08 16:06   ` Jeff King
2011-04-09 18:58 ` Dmitry Potapov
2011-04-09 19:32   ` Jeff King
2011-04-09 20:09     ` Dmitry Potapov
2011-04-12 13:57 ` Jay Soffian

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).