git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Git EOL Normalization
       [not found] <20833035.39857.1306334468204.JavaMail.root@mail.hq.genarts.com>
@ 2011-05-25 15:20 ` Stephen Bash
  2011-05-25 17:58   ` Dmitry Potapov
  0 siblings, 1 reply; 8+ messages in thread
From: Stephen Bash @ 2011-05-25 15:20 UTC (permalink / raw)
  To: git

Hi all-

At the office we recently had a few commits flipping end-of-line characters on complete files so I spent some time decoding the man pages on eol, autocrlf, and the text attribute.  From the current man pages I generated the tables below and I'm wondering a) if my interpretation is correct, and b) if there's value in putting these somewhere in the Git wiki (page name suggestions welcome!)?

Here are the tables (warning: poor attempt at ascii art... sorry!):

Configuration variables:
+---------------+--------+--------------+---------------------+----------------+
| Property      | Value  | Check in/out | Applies to          | Does what      |
+---------------+--------+--------------+---------------------+----------------+
| core.eol      | native | check out    | files marked text   | set EOL to OS  |
|               |        |              |                     |    native      |
+---------------+--------+--------------+---------------------+----------------+
| core.eol      | LF     | check out    | files marked text   | set EOL to LF  |
+---------------+--------+--------------+---------------------+----------------+
| core.eol      | CRLF   | check out    | files marked text   | set EOL to CRLF|
+---------------+--------+--------------+---------------------+----------------+
| core.autocrlf | true   | check out    | detected text files | set EOL to CRLF|
+---------------+--------+--------------+---------------------+----------------+
|               |        |              | other files         | check out as is|
+---------------+--------+--------------+---------------------+----------------+
|               |        | check in     | detected text files | LF in repo,    |
|               |        |              |                     |   checkin LF   |
|               |        |              |                     | CRLF in repo,  |
|               |        |              |                     |   checkin CRLF |
+---------------+--------+--------------+---------------------+----------------+
|               |        |              | other files         | check in as is |
+---------------+--------+--------------+---------------------+----------------+
| core.autocrlf | input  | check in     | detected text files | LF in repo,    |
|               |        |              |                     |   checkin LF   |
|               |        |              |                     | CRLF in repo,  |
|               |        |              |                     |   checkin CRLF |
+---------------+--------+--------------+---------------------+----------------+
|               |        |              | other files         | check in as is |
+---------------+--------+--------------+---------------------+----------------+
| core.autocrlf | unset  | nothing                                             |
+---------------+--------+--------------+---------------------+----------------+

Git attributes:
+-----------+------------+--------------+---------------------+----------------+
| Attribute | Value      | Check in/out | Applies to          | Does what      |
+-----------+------------+--------------+---------------------+----------------+
| text      | set        | check in     | matching files      | set EOL to LF  |
+-----------+------------+--------------+---------------------+----------------+
|           | unset      | check in     | matching files      | check in as is |
+-----------+------------+--------------+---------------------+----------------+
|           | auto       | check in     | matching detected   | set EOL to LF  |
|           |            |              |   text files        |                |
+-----------+------------+--------------+---------------------+----------------+
|           |            |              | matching non-text   | check in as is |
|           |            |              |    files            |                |
+-----------+------------+--------------+---------------------+----------------+
|           | unspecified| check in     | delegate to core.autocrlf            | 
+-----------+------------+--------------+---------------------+----------------+
| eol       | LF         | check out    | matching files      | set EOL to LF  |
+-----------+------------+--------------+---------------------+----------------+
|           | CRLF       | check out    | matching files      | set EOL to CRLF|
+-----------+------------+--------------+---------------------+----------------+

The open questions for me are:
  1) what is the actual text file detection algorithm?
  2) what is the autocrlf LF/CRLF detection algorithm?
  3) how does autocrlf handle mixed line endings? (either in the working copy or repo)

Thanks,
Stephen

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Git EOL Normalization
  2011-05-25 15:20 ` Git EOL Normalization Stephen Bash
@ 2011-05-25 17:58   ` Dmitry Potapov
  2011-05-25 18:06     ` Stephen Bash
  2011-05-26  6:02     ` Jakub Narebski
  0 siblings, 2 replies; 8+ messages in thread
From: Dmitry Potapov @ 2011-05-25 17:58 UTC (permalink / raw)
  To: Stephen Bash; +Cc: git

On Wed, May 25, 2011 at 7:20 PM, Stephen Bash <bash@genarts.com> wrote:
>
> The open questions for me are:
>  1) what is the actual text file detection algorithm?
>  2) what is the autocrlf LF/CRLF detection algorithm?
>  3) how does autocrlf handle mixed line endings? (either in the working copy or repo)

Git looks at the text attribute of a file. If it is set or unset then it
treats the file as text or binary accordingly. If the text attribute is
'auto', or it is unspecified but core.autocrlf is true, then git uses
heuristics to detect text files.

Currently, the following heuristics are used:

A file is considered as text if it does not have '\0' or a bare CR, and
the number of non-printable characters is less than 1 in 128.

Non-printable characters are DEL (127) and anything less than 32 except
CR, LF, BS, HT, ESC and FF.

Also, to avoid problems with autocrlf=true when someone has already put
a text file with CRLF, CRLF->LF conversion happens only if the tracked
file in the index does not have any CR.


Dmitry
PS I wrote this mostly from my memory, so I could miss some detail.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Git EOL Normalization
  2011-05-25 17:58   ` Dmitry Potapov
@ 2011-05-25 18:06     ` Stephen Bash
  2011-05-26  6:02     ` Jakub Narebski
  1 sibling, 0 replies; 8+ messages in thread
From: Stephen Bash @ 2011-05-25 18:06 UTC (permalink / raw)
  To: Dmitry Potapov; +Cc: git

----- Original Message -----
> From: "Dmitry Potapov" <dpotapov@gmail.com>
> Sent: Wednesday, May 25, 2011 1:58:33 PM
> Subject: Re: Git EOL Normalization
>
> >  1) what is the actual text file detection algorithm?
> >  2) what is the autocrlf LF/CRLF detection algorithm?
> >  3) how does autocrlf handle mixed line endings? (either in the
> >  working copy or repo)
>
> Currently, the following heuristics are used:
> 
> A file is considered as text if it does not have '\0' or a bare CR,
> and the number of non-printable characters is less than 1 in 128.
> 
> Non-printable characters are DEL (127) and anything less than 32
> except CR, LF, BS, HT, ESC and FF.
>
> Also, to avoid problems with autocrlf=true when someone has already
> put a text file with CRLF, CRLF->LF conversion happens only if the tracked
> file in the index does not have any CR.
>
> PS I wrote this mostly from my memory, so I could miss some detail.

Thanks!  This is very helpful.

Stephen

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Git EOL Normalization
  2011-05-25 17:58   ` Dmitry Potapov
  2011-05-25 18:06     ` Stephen Bash
@ 2011-05-26  6:02     ` Jakub Narebski
  2011-05-26  7:20       ` Dmitry Potapov
  2011-05-26 16:07       ` Junio C Hamano
  1 sibling, 2 replies; 8+ messages in thread
From: Jakub Narebski @ 2011-05-26  6:02 UTC (permalink / raw)
  To: Dmitry Potapov; +Cc: Stephen Bash, git

Dmitry Potapov <dpotapov@gmail.com> writes:

> On Wed, May 25, 2011 at 7:20 PM, Stephen Bash <bash@genarts.com> wrote:
> >
> > The open questions for me are:
> >  1) what is the actual text file detection algorithm?
> >  2) what is the autocrlf LF/CRLF detection algorithm?
> >  3) how does autocrlf handle mixed line endings? (either in the working copy or repo)
> 
> Git looks at the text attribute of a file. If it is set or unset then it
> treats the file as text or binary accordingly. If the text attribute is
> 'auto', or it is unspecified but core.autocrlf is true, then git uses
> heuristics to detect text files.
> 
> Currently, the following heuristics are used:
> 
> A file is considered as text if it does not have '\0' or a bare CR, and
> the number of non-printable characters is less than 1 in 128.
> 
> Non-printable characters are DEL (127) and anything less than 32 except
> CR, LF, BS, HT, ESC and FF.

I think git examines only first block of a file or so.  The heuristic
to detect binary-ness of a file is, as I have heard, the same or
similar to the one that GNU diff uses.

See also `perldoc -f -X`, description of "-T" and "-B" switches,
though this might differ somewhat in detection and thresholds.
 
> Also, to avoid problems with autocrlf=true when someone has already put
> a text file with CRLF, CRLF->LF conversion happens only if the tracked
> file in the index does not have any CR.

See also documentation of `core.safecrlf` config variable (defaults to
true IIRC).

-- 
Jakub Narebski
Poland
ShadeHawk on #git

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Git EOL Normalization
  2011-05-26  6:02     ` Jakub Narebski
@ 2011-05-26  7:20       ` Dmitry Potapov
  2011-05-26 16:07       ` Junio C Hamano
  1 sibling, 0 replies; 8+ messages in thread
From: Dmitry Potapov @ 2011-05-26  7:20 UTC (permalink / raw)
  To: Jakub Narebski; +Cc: Stephen Bash, git

On Thu, May 26, 2011 at 10:02 AM, Jakub Narebski <jnareb@gmail.com> wrote:
>
> I think git examines only first block of a file or so.

I have looked at convert.c:

http://git.kernel.org/?p=git/git.git;a=blob;f=convert.c;h=efc7e07d475c66f7835dc6cbbd3bc358f01c41c3;hb=HEAD

and gather_stats works on the whole file as far as I can tell.

Did I miss something?


Dmitry

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Git EOL Normalization
  2011-05-26  6:02     ` Jakub Narebski
  2011-05-26  7:20       ` Dmitry Potapov
@ 2011-05-26 16:07       ` Junio C Hamano
  2011-05-26 16:28         ` Stephen Bash
  1 sibling, 1 reply; 8+ messages in thread
From: Junio C Hamano @ 2011-05-26 16:07 UTC (permalink / raw)
  To: Jakub Narebski; +Cc: Dmitry Potapov, Stephen Bash, git

Jakub Narebski <jnareb@gmail.com> writes:

> I think git examines only first block of a file or so.  The heuristic
> to detect binary-ness of a file is, as I have heard, the same or
> similar to the one that GNU diff uses.

Yes, the binary detection was designed to be compatible with GNU diff. But
I do not think it has much to do with the topic of this thread. Aren't
other people discussing the line ending?

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Git EOL Normalization
  2011-05-26 16:07       ` Junio C Hamano
@ 2011-05-26 16:28         ` Stephen Bash
  2011-05-31 15:01           ` Drew Northup
  0 siblings, 1 reply; 8+ messages in thread
From: Stephen Bash @ 2011-05-26 16:28 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Dmitry Potapov, git, Jakub Narebski

----- Original Message -----
> From: "Junio C Hamano" <gitster@pobox.com>
> To: "Jakub Narebski" <jnareb@gmail.com>
> Sent: Thursday, May 26, 2011 12:07:21 PM
> Subject: Re: Git EOL Normalization
> 
> > I think git examines only first block of a file or so. The heuristic
> > to detect binary-ness of a file is, as I have heard, the same or
> > similar to the one that GNU diff uses.
> 
> Yes, the binary detection was designed to be compatible with GNU diff. But
> I do not think it has much to do with the topic of this thread. Aren't
> other people discussing the line ending?

The binary detection may be apropos because there are situations (core.autocrlf={true,input} and text=auto) where Git will only do line ending conversion if it detects a text file...  But I'll leave it to people who know the code better to say if this binary detection is in fact part of the decision process.

Thanks,
Stephen

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Git EOL Normalization
  2011-05-26 16:28         ` Stephen Bash
@ 2011-05-31 15:01           ` Drew Northup
  0 siblings, 0 replies; 8+ messages in thread
From: Drew Northup @ 2011-05-31 15:01 UTC (permalink / raw)
  To: Stephen Bash; +Cc: Junio C Hamano, Dmitry Potapov, git, Jakub Narebski


On Thu, 2011-05-26 at 12:28 -0400, Stephen Bash wrote:
> ----- Original Message -----
> > From: "Junio C Hamano" <gitster@pobox.com>
> > To: "Jakub Narebski" <jnareb@gmail.com>
> > Sent: Thursday, May 26, 2011 12:07:21 PM
> > Subject: Re: Git EOL Normalization
> > 
> > > I think git examines only first block of a file or so. The heuristic
> > > to detect binary-ness of a file is, as I have heard, the same or
> > > similar to the one that GNU diff uses.
> > 
> > Yes, the binary detection was designed to be compatible with GNU diff. But
> > I do not think it has much to do with the topic of this thread. Aren't
> > other people discussing the line ending?
> 
> The binary detection may be apropos because there are situations
> (core.autocrlf={true,input} and text=auto) where Git will only do line
> ending conversion if it detects a text file...  But I'll leave it to
> people who know the code better to say if this binary detection is in
> fact part of the decision process.

Currently UTF-16 and UTF-32 (which many consider to be text files) are
detected as binary files by Git (due to said compatibility with GNU
diff). Therefore EOL normalization does not happen on those files. 

I have played a little with detecting (and eventually do the same for
normalizing) reasonably valid UTF-16 (BE and LE), but my code is nowhere
near ready for the big time, much less properly tested.

As for diff-ing UTF-16/UTF-32 for purely human consumption, I would be
tempted to iconv (smudge?) the text into UTF-8 and then let the diff-ing
algorithm deal with it. Not a perfect solution, but perfect should not
be the enemy of good in that case. Unfortunately this would not produce
proper patches for mailing. (As for how we'd know it is UTF-32 and not a
binary, I'll leave that for further discussion should we need it. I
suspect we'd have to trust the user. UGH.)

-- 
-Drew Northup
________________________________________________
"As opposed to vegetable or mineral error?"
-John Pescatore, SANS NewsBites Vol. 12 Num. 59

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2011-05-31 15:01 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20833035.39857.1306334468204.JavaMail.root@mail.hq.genarts.com>
2011-05-25 15:20 ` Git EOL Normalization Stephen Bash
2011-05-25 17:58   ` Dmitry Potapov
2011-05-25 18:06     ` Stephen Bash
2011-05-26  6:02     ` Jakub Narebski
2011-05-26  7:20       ` Dmitry Potapov
2011-05-26 16:07       ` Junio C Hamano
2011-05-26 16:28         ` Stephen Bash
2011-05-31 15:01           ` Drew Northup

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).