* Git EOL Normalization [not found] <20833035.39857.1306334468204.JavaMail.root@mail.hq.genarts.com> @ 2011-05-25 15:20 ` Stephen Bash 2011-05-25 17:58 ` Dmitry Potapov 0 siblings, 1 reply; 8+ messages in thread From: Stephen Bash @ 2011-05-25 15:20 UTC (permalink / raw) To: git Hi all- At the office we recently had a few commits flipping end-of-line characters on complete files so I spent some time decoding the man pages on eol, autocrlf, and the text attribute. From the current man pages I generated the tables below and I'm wondering a) if my interpretation is correct, and b) if there's value in putting these somewhere in the Git wiki (page name suggestions welcome!)? Here are the tables (warning: poor attempt at ascii art... sorry!): Configuration variables: +---------------+--------+--------------+---------------------+----------------+ | Property | Value | Check in/out | Applies to | Does what | +---------------+--------+--------------+---------------------+----------------+ | core.eol | native | check out | files marked text | set EOL to OS | | | | | | native | +---------------+--------+--------------+---------------------+----------------+ | core.eol | LF | check out | files marked text | set EOL to LF | +---------------+--------+--------------+---------------------+----------------+ | core.eol | CRLF | check out | files marked text | set EOL to CRLF| +---------------+--------+--------------+---------------------+----------------+ | core.autocrlf | true | check out | detected text files | set EOL to CRLF| +---------------+--------+--------------+---------------------+----------------+ | | | | other files | check out as is| +---------------+--------+--------------+---------------------+----------------+ | | | check in | detected text files | LF in repo, | | | | | | checkin LF | | | | | | CRLF in repo, | | | | | | checkin CRLF | +---------------+--------+--------------+---------------------+----------------+ | | | | other files | check in as is | +---------------+--------+--------------+---------------------+----------------+ | core.autocrlf | input | check in | detected text files | LF in repo, | | | | | | checkin LF | | | | | | CRLF in repo, | | | | | | checkin CRLF | +---------------+--------+--------------+---------------------+----------------+ | | | | other files | check in as is | +---------------+--------+--------------+---------------------+----------------+ | core.autocrlf | unset | nothing | +---------------+--------+--------------+---------------------+----------------+ Git attributes: +-----------+------------+--------------+---------------------+----------------+ | Attribute | Value | Check in/out | Applies to | Does what | +-----------+------------+--------------+---------------------+----------------+ | text | set | check in | matching files | set EOL to LF | +-----------+------------+--------------+---------------------+----------------+ | | unset | check in | matching files | check in as is | +-----------+------------+--------------+---------------------+----------------+ | | auto | check in | matching detected | set EOL to LF | | | | | text files | | +-----------+------------+--------------+---------------------+----------------+ | | | | matching non-text | check in as is | | | | | files | | +-----------+------------+--------------+---------------------+----------------+ | | unspecified| check in | delegate to core.autocrlf | +-----------+------------+--------------+---------------------+----------------+ | eol | LF | check out | matching files | set EOL to LF | +-----------+------------+--------------+---------------------+----------------+ | | CRLF | check out | matching files | set EOL to CRLF| +-----------+------------+--------------+---------------------+----------------+ The open questions for me are: 1) what is the actual text file detection algorithm? 2) what is the autocrlf LF/CRLF detection algorithm? 3) how does autocrlf handle mixed line endings? (either in the working copy or repo) Thanks, Stephen ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Git EOL Normalization 2011-05-25 15:20 ` Git EOL Normalization Stephen Bash @ 2011-05-25 17:58 ` Dmitry Potapov 2011-05-25 18:06 ` Stephen Bash 2011-05-26 6:02 ` Jakub Narebski 0 siblings, 2 replies; 8+ messages in thread From: Dmitry Potapov @ 2011-05-25 17:58 UTC (permalink / raw) To: Stephen Bash; +Cc: git On Wed, May 25, 2011 at 7:20 PM, Stephen Bash <bash@genarts.com> wrote: > > The open questions for me are: > 1) what is the actual text file detection algorithm? > 2) what is the autocrlf LF/CRLF detection algorithm? > 3) how does autocrlf handle mixed line endings? (either in the working copy or repo) Git looks at the text attribute of a file. If it is set or unset then it treats the file as text or binary accordingly. If the text attribute is 'auto', or it is unspecified but core.autocrlf is true, then git uses heuristics to detect text files. Currently, the following heuristics are used: A file is considered as text if it does not have '\0' or a bare CR, and the number of non-printable characters is less than 1 in 128. Non-printable characters are DEL (127) and anything less than 32 except CR, LF, BS, HT, ESC and FF. Also, to avoid problems with autocrlf=true when someone has already put a text file with CRLF, CRLF->LF conversion happens only if the tracked file in the index does not have any CR. Dmitry PS I wrote this mostly from my memory, so I could miss some detail. ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Git EOL Normalization 2011-05-25 17:58 ` Dmitry Potapov @ 2011-05-25 18:06 ` Stephen Bash 2011-05-26 6:02 ` Jakub Narebski 1 sibling, 0 replies; 8+ messages in thread From: Stephen Bash @ 2011-05-25 18:06 UTC (permalink / raw) To: Dmitry Potapov; +Cc: git ----- Original Message ----- > From: "Dmitry Potapov" <dpotapov@gmail.com> > Sent: Wednesday, May 25, 2011 1:58:33 PM > Subject: Re: Git EOL Normalization > > > 1) what is the actual text file detection algorithm? > > 2) what is the autocrlf LF/CRLF detection algorithm? > > 3) how does autocrlf handle mixed line endings? (either in the > > working copy or repo) > > Currently, the following heuristics are used: > > A file is considered as text if it does not have '\0' or a bare CR, > and the number of non-printable characters is less than 1 in 128. > > Non-printable characters are DEL (127) and anything less than 32 > except CR, LF, BS, HT, ESC and FF. > > Also, to avoid problems with autocrlf=true when someone has already > put a text file with CRLF, CRLF->LF conversion happens only if the tracked > file in the index does not have any CR. > > PS I wrote this mostly from my memory, so I could miss some detail. Thanks! This is very helpful. Stephen ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Git EOL Normalization 2011-05-25 17:58 ` Dmitry Potapov 2011-05-25 18:06 ` Stephen Bash @ 2011-05-26 6:02 ` Jakub Narebski 2011-05-26 7:20 ` Dmitry Potapov 2011-05-26 16:07 ` Junio C Hamano 1 sibling, 2 replies; 8+ messages in thread From: Jakub Narebski @ 2011-05-26 6:02 UTC (permalink / raw) To: Dmitry Potapov; +Cc: Stephen Bash, git Dmitry Potapov <dpotapov@gmail.com> writes: > On Wed, May 25, 2011 at 7:20 PM, Stephen Bash <bash@genarts.com> wrote: > > > > The open questions for me are: > > 1) what is the actual text file detection algorithm? > > 2) what is the autocrlf LF/CRLF detection algorithm? > > 3) how does autocrlf handle mixed line endings? (either in the working copy or repo) > > Git looks at the text attribute of a file. If it is set or unset then it > treats the file as text or binary accordingly. If the text attribute is > 'auto', or it is unspecified but core.autocrlf is true, then git uses > heuristics to detect text files. > > Currently, the following heuristics are used: > > A file is considered as text if it does not have '\0' or a bare CR, and > the number of non-printable characters is less than 1 in 128. > > Non-printable characters are DEL (127) and anything less than 32 except > CR, LF, BS, HT, ESC and FF. I think git examines only first block of a file or so. The heuristic to detect binary-ness of a file is, as I have heard, the same or similar to the one that GNU diff uses. See also `perldoc -f -X`, description of "-T" and "-B" switches, though this might differ somewhat in detection and thresholds. > Also, to avoid problems with autocrlf=true when someone has already put > a text file with CRLF, CRLF->LF conversion happens only if the tracked > file in the index does not have any CR. See also documentation of `core.safecrlf` config variable (defaults to true IIRC). -- Jakub Narebski Poland ShadeHawk on #git ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Git EOL Normalization 2011-05-26 6:02 ` Jakub Narebski @ 2011-05-26 7:20 ` Dmitry Potapov 2011-05-26 16:07 ` Junio C Hamano 1 sibling, 0 replies; 8+ messages in thread From: Dmitry Potapov @ 2011-05-26 7:20 UTC (permalink / raw) To: Jakub Narebski; +Cc: Stephen Bash, git On Thu, May 26, 2011 at 10:02 AM, Jakub Narebski <jnareb@gmail.com> wrote: > > I think git examines only first block of a file or so. I have looked at convert.c: http://git.kernel.org/?p=git/git.git;a=blob;f=convert.c;h=efc7e07d475c66f7835dc6cbbd3bc358f01c41c3;hb=HEAD and gather_stats works on the whole file as far as I can tell. Did I miss something? Dmitry ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Git EOL Normalization 2011-05-26 6:02 ` Jakub Narebski 2011-05-26 7:20 ` Dmitry Potapov @ 2011-05-26 16:07 ` Junio C Hamano 2011-05-26 16:28 ` Stephen Bash 1 sibling, 1 reply; 8+ messages in thread From: Junio C Hamano @ 2011-05-26 16:07 UTC (permalink / raw) To: Jakub Narebski; +Cc: Dmitry Potapov, Stephen Bash, git Jakub Narebski <jnareb@gmail.com> writes: > I think git examines only first block of a file or so. The heuristic > to detect binary-ness of a file is, as I have heard, the same or > similar to the one that GNU diff uses. Yes, the binary detection was designed to be compatible with GNU diff. But I do not think it has much to do with the topic of this thread. Aren't other people discussing the line ending? ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Git EOL Normalization 2011-05-26 16:07 ` Junio C Hamano @ 2011-05-26 16:28 ` Stephen Bash 2011-05-31 15:01 ` Drew Northup 0 siblings, 1 reply; 8+ messages in thread From: Stephen Bash @ 2011-05-26 16:28 UTC (permalink / raw) To: Junio C Hamano; +Cc: Dmitry Potapov, git, Jakub Narebski ----- Original Message ----- > From: "Junio C Hamano" <gitster@pobox.com> > To: "Jakub Narebski" <jnareb@gmail.com> > Sent: Thursday, May 26, 2011 12:07:21 PM > Subject: Re: Git EOL Normalization > > > I think git examines only first block of a file or so. The heuristic > > to detect binary-ness of a file is, as I have heard, the same or > > similar to the one that GNU diff uses. > > Yes, the binary detection was designed to be compatible with GNU diff. But > I do not think it has much to do with the topic of this thread. Aren't > other people discussing the line ending? The binary detection may be apropos because there are situations (core.autocrlf={true,input} and text=auto) where Git will only do line ending conversion if it detects a text file... But I'll leave it to people who know the code better to say if this binary detection is in fact part of the decision process. Thanks, Stephen ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Git EOL Normalization 2011-05-26 16:28 ` Stephen Bash @ 2011-05-31 15:01 ` Drew Northup 0 siblings, 0 replies; 8+ messages in thread From: Drew Northup @ 2011-05-31 15:01 UTC (permalink / raw) To: Stephen Bash; +Cc: Junio C Hamano, Dmitry Potapov, git, Jakub Narebski On Thu, 2011-05-26 at 12:28 -0400, Stephen Bash wrote: > ----- Original Message ----- > > From: "Junio C Hamano" <gitster@pobox.com> > > To: "Jakub Narebski" <jnareb@gmail.com> > > Sent: Thursday, May 26, 2011 12:07:21 PM > > Subject: Re: Git EOL Normalization > > > > > I think git examines only first block of a file or so. The heuristic > > > to detect binary-ness of a file is, as I have heard, the same or > > > similar to the one that GNU diff uses. > > > > Yes, the binary detection was designed to be compatible with GNU diff. But > > I do not think it has much to do with the topic of this thread. Aren't > > other people discussing the line ending? > > The binary detection may be apropos because there are situations > (core.autocrlf={true,input} and text=auto) where Git will only do line > ending conversion if it detects a text file... But I'll leave it to > people who know the code better to say if this binary detection is in > fact part of the decision process. Currently UTF-16 and UTF-32 (which many consider to be text files) are detected as binary files by Git (due to said compatibility with GNU diff). Therefore EOL normalization does not happen on those files. I have played a little with detecting (and eventually do the same for normalizing) reasonably valid UTF-16 (BE and LE), but my code is nowhere near ready for the big time, much less properly tested. As for diff-ing UTF-16/UTF-32 for purely human consumption, I would be tempted to iconv (smudge?) the text into UTF-8 and then let the diff-ing algorithm deal with it. Not a perfect solution, but perfect should not be the enemy of good in that case. Unfortunately this would not produce proper patches for mailing. (As for how we'd know it is UTF-32 and not a binary, I'll leave that for further discussion should we need it. I suspect we'd have to trust the user. UGH.) -- -Drew Northup ________________________________________________ "As opposed to vegetable or mineral error?" -John Pescatore, SANS NewsBites Vol. 12 Num. 59 ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2011-05-31 15:01 UTC | newest] Thread overview: 8+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- [not found] <20833035.39857.1306334468204.JavaMail.root@mail.hq.genarts.com> 2011-05-25 15:20 ` Git EOL Normalization Stephen Bash 2011-05-25 17:58 ` Dmitry Potapov 2011-05-25 18:06 ` Stephen Bash 2011-05-26 6:02 ` Jakub Narebski 2011-05-26 7:20 ` Dmitry Potapov 2011-05-26 16:07 ` Junio C Hamano 2011-05-26 16:28 ` Stephen Bash 2011-05-31 15:01 ` Drew Northup
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).