From: Lars Schneider <larsxschneider@gmail.com>
To: "Torsten Bögershausen" <tboegi@web.de>
Cc: Lars Schneider <lars.schneider@autodesk.com>,
Git List <git@vger.kernel.org>,
Junio C Hamano <gitster@pobox.com>,
peff@peff.net, patrick@luehne.de
Subject: Re: [PATCH v1] convert: add support for 'encoding' attribute
Date: Fri, 29 Dec 2017 14:56:17 +0100 [thread overview]
Message-ID: <2ADD036C-6714-4DE5-A290-4C76EFACE166@gmail.com> (raw)
In-Reply-To: <20171229125954.GA9772@tor.lan>
> On 29 Dec 2017, at 13:59, Torsten Bögershausen <tboegi@web.de> wrote:
>
> On 2017-12-28 17:14, Lars Schneider wrote:>
>>> On 17 Dec 2017, at 18:14, Torsten Bögershausen <tboegi@web.de> wrote:
>>>
>>> On Mon, Dec 11, 2017 at 04:50:23PM +0100, lars.schneider@autodesk.com wrote:
>>>> From: Lars Schneider <larsxschneider@gmail.com>
>>>>
>>
>>>> +`encoding`
>>>> +^^^^^^^^^^
>>>> +
>>>> +By default Git assumes UTF-8 encoding for text files. The `encoding`
>>>
>>> This is probably "ASCII" and it's supersets like ISO-8859-1 or UTF-8.
>>
>> I am not sure I understand what you want to tell me here.
>> Do you think UTF-8 encoding is not correct in the sentence above?
>
> Git itself was designed to handle source code in ASCII.
> Text files which are encoded in ISO-8859-1, UTF-8 or any
> super-set of ASCII are handled as well, and what exactly to do
> with bytes >0x80 is outside the responsibility of Git.
> E.g. the interpretation and rendering on the screen may be
> dependent on the locale or being guessed.
> All in all, saying that Git expects UTF-8 may be "overdriven".
> Any kind of file that uses an '\n' as an end of line
> and has no NUL bytes '\0' works as a text file in Git.
> (End of BlaBla ;-)
>
> We can probably replace:
> "By default Git assumes UTF-8 encoding for text files"
>
> with something like
> "Git handles UTF-8 encoded files as text files, but UTF-16 encoded
> files are handled as binary files"
Well, UTF-32 are handled as binary file, too :-)
How about this?
Git recognizes files encoded with ASCII or one of its supersets
(e.g. UTF-8 or ISO-8859-1) as text files. All other...
>
>>>
>>>> +attribute sets the encoding to be used in the working directory.
>>>> +If the path is added to the index, then Git encodes the content to
>>>> +UTF-8. On checkout the content is encoded back to the original
>>>> +encoding. Consequently, you can use all built-in Git text processing
>>>> +tools (e.g. git diff, line ending conversions, etc.) with your
>>>> +non-UTF-8 encoded file.
>>>> +
>>>> +Please note that re-encoding content can cause errors and requires
>>>> +resources. Use the `encoding` attribute only if you cannot store
>>>> +a file in UTF-8 encoding and if you want Git to be able to process
>>>> +the text.
>>>> +
>>>> +------------------------
>>>> +*.txt text encoding=UTF-16
>>>> +------------------------
>>>
>>> I think that encoding (or worktree-encoding as said elsewhere) must imply text.
>>> (That is not done in convert.c, but assuming binary or even auto
>>> does not make much sense to me)
>>
>> "text" only means "LF". "-text" means CRLF which would be perfectly fine
>> for UTF-16. Therefore I don't think we need to imply text.
>> Does this make sense?
> Yes and now.
>
> "text" means convert CRLF at "git add" (or commit) into LF,
> And potentially LF into CRLF at checkout,
> depending on the EOL attribute (if set), core.autocrlf and/or core.eol.
>
> "-text" means don't touch CRLF or LF at all. Files with CRLF are commited
> with CRLF.
> Running a Unix like "diff" tool will often show "^M" to indicate that there
> is a CR before the LF, which may be distracting or confusing.
>
> If someone ever wants to handle the UTF-16 files on a Linux/Mac/Unix system,
> it makes sense to convert the line endings into LF before storing them
> into the index (at least this is my experience).
>
> (Not specifying "text" or "-text" at all will trigger the auto-handling,
> which is not needed at all).
> So if we want to be sure that line endings are CRLF in the worktree we
> should ask the user to specify things like this:
>
> *.ext worktree-encoding=UTF-16 text eol=CRLF
>
> It may be enough to give this example in the documentation.
> and I would rather be over-careful here, to avoid future problems.
Agreed. I'll add that example!
>>>> +
>>>> +All `iconv` encodings with a stable round-trip conversion to and from
>>>> +UTF-8 are supported. You can see a full list with the following command:
>>>> +
>>>> +------------------------
>>>> +iconv --list
>>>> +------------------------
>>>> +
>>>> `eol`
>>>> ^^^^^
>>>>
>>>> diff --git a/convert.c b/convert.c
>>>> index 20d7ab67bd..ee19c17104 100644
>>>> --- a/convert.c
>>>> +++ b/convert.c
>>>> @@ -7,6 +7,7 @@
>>>> #include "sigchain.h"
>>>> #include "pkt-line.h"
>>>> #include "sub-process.h"
>>>> +#include "utf8.h"
>>>>
>>>> /*
>>>> * convert.c - convert a file when checking it out and checking it in.
>>>> @@ -256,6 +257,149 @@ static int will_convert_lf_to_crlf(size_t len, struct text_stat *stats,
>>>>
>>>> }
>>>
>>> I would avoid to use these #ifdefs here.
>>> All functions can be in strbuf.c (and may have #ifdefs there), but not here.
>>
>> I'll try that. But wouldn't it make more sense to move the functions to utf.c?
>
> May be.
> Originally utf8.c was about encoding and all kind of UTF-8 related stuff.
> Especially it didn't know anything about strbuf.
> I don't know why strbuf.h and other functions had been added here,
>
> I once moved them into strbuf.c without any problems, but never send out
> a patch, because of possible merge conflicts in ongoing patches.
>
> In any case, if it is about strbuf, I would try to put it into strbuf.c
I think I simplified the code. I reused "reencode_string_len".
I added just two BOM check functions to utf8.c ... I'll send out a new series
in a bit.
- Lars
next prev parent reply other threads:[~2017-12-29 13:56 UTC|newest]
Thread overview: 36+ messages / expand[flat|nested] mbox.gz Atom feed top
2017-12-11 15:50 [PATCH v1] convert: add support for 'encoding' attribute lars.schneider
2017-12-11 18:39 ` Eric Sunshine
2017-12-11 23:47 ` Lars Schneider
2017-12-11 23:58 ` Eric Sunshine
2017-12-12 10:58 ` Lars Schneider
2017-12-11 20:47 ` Johannes Sixt
2017-12-11 23:42 ` Lars Schneider
2017-12-12 0:59 ` Junio C Hamano
2017-12-12 7:15 ` Johannes Sixt
2017-12-12 10:55 ` Lars Schneider
2017-12-12 19:31 ` Junio C Hamano
2017-12-13 17:57 ` Lars Schneider
2017-12-13 18:11 ` Junio C Hamano
2017-12-13 23:02 ` Lars Schneider
2017-12-14 23:01 ` Junio C Hamano
2017-12-12 7:09 ` Johannes Sixt
2017-12-18 10:13 ` Torsten Bögershausen
2017-12-18 13:12 ` Jeff King
2017-12-23 8:08 ` Torsten Bögershausen
2017-12-29 13:28 ` [PATCH/RFC 0/2] git diff --UTF-8 tboegi
2017-12-29 13:28 ` [PATCH/RFC 1/2] convert_to_git(): checksafe becomes an integer tboegi
2017-12-29 13:28 ` [PATCH/RFC 2/2] git diff: Allow to reencode into UTF-8 tboegi
2018-02-26 17:27 ` [PATCH/RFC 1/1] Auto diff of UTF-16 files in UTF-8 tboegi
2018-02-26 18:43 ` Peter Krefting
2018-02-27 22:39 ` Jeff King
2017-12-18 18:02 ` [PATCH v1] convert: add support for 'encoding' attribute Junio C Hamano
2017-12-18 21:55 ` Johannes Sixt
2017-12-15 9:58 ` Jeff King
2017-12-18 10:54 ` Lars Schneider
2017-12-18 12:59 ` Jeff King
2017-12-17 17:14 ` Torsten Bögershausen
2017-12-28 16:14 ` Lars Schneider
2017-12-29 12:59 ` Torsten Bögershausen
2017-12-29 13:56 ` Lars Schneider [this message]
2018-01-03 19:15 ` Junio C Hamano
2018-01-03 20:45 ` Lars Schneider
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=2ADD036C-6714-4DE5-A290-4C76EFACE166@gmail.com \
--to=larsxschneider@gmail.com \
--cc=git@vger.kernel.org \
--cc=gitster@pobox.com \
--cc=lars.schneider@autodesk.com \
--cc=patrick@luehne.de \
--cc=peff@peff.net \
--cc=tboegi@web.de \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).