From: "Torsten Bögershausen" <tboegi@web.de>
To: lars.schneider@autodesk.com
Cc: git@vger.kernel.org, gitster@pobox.com, j6t@kdbg.org,
sunshine@sunshineco.com, peff@peff.net,
ramsay@ramsayjones.plus.com, Johannes.Schindelin@gmx.de,
Lars Schneider <larsxschneider@gmail.com>
Subject: Re: [PATCH v6 5/7] convert: add 'working-tree-encoding' attribute
Date: Sat, 10 Feb 2018 10:48:52 +0100 [thread overview]
Message-ID: <20180210094852.GB11525@tor.lan> (raw)
In-Reply-To: <20180209132830.55385-6-lars.schneider@autodesk.com>
On Fri, Feb 09, 2018 at 02:28:28PM +0100, lars.schneider@autodesk.com wrote:
> From: Lars Schneider <larsxschneider@gmail.com>
>
> Git recognizes files encoded with ASCII or one of its supersets (e.g.
> UTF-8 or ISO-8859-1) as text files. All other encodings are usually
> interpreted as binary and consequently built-in Git text processing
> tools (e.g. 'git diff') as well as most Git web front ends do not
> visualize the content.
>
> Add an attribute to tell Git what encoding the user has defined for a
> given file. If the content is added to the index, then Git converts the
> content to a canonical UTF-8 representation. On checkout Git will
> reverse the conversion.
>
> Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
> ---
> Documentation/gitattributes.txt | 66 ++++++++++++
> convert.c | 157 ++++++++++++++++++++++++++++-
> convert.h | 1 +
> sha1_file.c | 2 +-
> t/t0028-working-tree-encoding.sh | 210 +++++++++++++++++++++++++++++++++++++++
> 5 files changed, 434 insertions(+), 2 deletions(-)
> create mode 100755 t/t0028-working-tree-encoding.sh
>
> diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
> index 30687de81a..4ecdcd4859 100644
> --- a/Documentation/gitattributes.txt
> +++ b/Documentation/gitattributes.txt
> @@ -272,6 +272,72 @@ few exceptions. Even though...
> catch potential problems early, safety triggers.
>
>
> +`working-tree-encoding`
> +^^^^^^^^^^^^^^^^^^^^^^^
> +
> +Git recognizes files encoded with ASCII or one of its supersets (e.g.
> +UTF-8 or ISO-8859-1) as text files. All other encodings are usually
> +interpreted as binary and consequently built-in Git text processing
> +tools (e.g. 'git diff') as well as most Git web front ends do not
> +visualize the content.
> +
> +In these cases you can tell Git the encoding of a file in the working
> +directory with the `working-tree-encoding` attribute. If a file with this
> +attribute is added to Git, then Git reencodes the content from the
> +specified encoding to UTF-8. Finally, Git stores the UTF-8 encoded
> +content in its internal data structure (called "the index"). On checkout
> +the content is reencoded back to the specified encoding.
> +
> +Please note that using the `working-tree-encoding` attribute may have a
> +number of pitfalls:
> +
> +- Git clients that do not support the `working-tree-encoding` attribute
A client to Git ?
Or may be "third party Git implementations"
> + will checkout the respective files UTF-8 encoded and not in the
> + expected encoding. Consequently, these files will appear different
> + which typically causes trouble. This is in particular the case for
> + older Git versions and alternative Git implementations such as JGit
> + or libgit2 (as of February 2018).
> +
> +- Reencoding content requires resources that might slow down certain
> + Git operations (e.g 'git checkout' or 'git add').
> +
> +Use the `working-tree-encoding` attribute only if you cannot store a file
> +in UTF-8 encoding and if you want Git to be able to process the content
> +as text.
> +
> +As an example, use the following attributes if your '*.proj' files are
> +UTF-16 encoded with byte order mark (BOM) and you want Git to perform
> +automatic line ending conversion based on your platform.
> +
> +------------------------
> +*.proj text working-tree-encoding=UTF-16
> +------------------------
> +
> +Use the following attributes if your '*.proj' files are UTF-16 little
> +endian encoded without BOM and you want Git to use Windows line endings
> +in the working directory. Please note, it is highly recommended to
> +explicitly define the line endings with `eol` if the `working-tree-encoding`
> +attribute is used to avoid ambiguity.
> +
> +------------------------
> +*.proj working-tree-encoding=UTF-16LE text eol=CRLF
> +------------------------
> +
> +You can get a list of all available encodings on your platform with the
> +following command:
One question:
+*.proj text working-tree-encoding=UTF-16
vs
*.proj working-tree-encoding=UTF-16LE text eol=CRLF
Technically the order of attributes doesn't matter, but that is not what we
want to demonstrate here and now.
I would probably move the "text" attribute to the end of the line.
So that readers don't start to wonder if the order is important.
> +
> +------------------------
> +iconv --list
> +------------------------
> +
> +If you do not know the encoding of a file, then you can use the `file`
> +command to guess the encoding:
> +
> +------------------------
> +file foo.proj
> +------------------------
> +
> +
> `ident`
> ^^^^^^^
>
> diff --git a/convert.c b/convert.c
> index b976eb968c..dc9e2db6b5 100644
> --- a/convert.c
> +++ b/convert.c
> @@ -7,6 +7,7 @@
> #include "sigchain.h"
> #include "pkt-line.h"
> #include "sub-process.h"
> +#include "utf8.h"
>
> /*
> * convert.c - convert a file when checking it out and checking it in.
> @@ -265,6 +266,110 @@ static int will_convert_lf_to_crlf(size_t len, struct text_stat *stats,
>
> }
>
> +static struct encoding {
> + const char *name;
> + struct encoding *next;
> +} *encoding, **encoding_tail;
> +static const char *default_encoding = "UTF-8";
> +
> +static int encode_to_git(const char *path, const char *src, size_t src_len,
> + struct strbuf *buf, struct encoding *enc, int conv_flags)
> +{
> + char *dst;
> + int dst_len;
> +
> + /*
> + * No encoding is specified or there is nothing to encode.
> + * Tell the caller that the content was not modified.
> + */
> + if (!enc || (src && !src_len))
> + return 0;
> +
> + /*
> + * Looks like we got called from "would_convert_to_git()".
> + * This means Git wants to know if it would encode (= modify!)
> + * the content. Let's answer with "yes", since an encoding was
> + * specified.
> + */
> + if (!buf && !src)
> + return 1;
> +
> + if (has_prohibited_utf_bom(enc->name, src, src_len)) {
> + const char *error_msg = _(
> + "BOM is prohibited for '%s' if encoded as %s");
> + const char *advise_msg = _(
> + "You told Git to treat '%s' as %s. A byte order mark "
> + "(BOM) is prohibited with this encoding. Either use "
> + "%.6s as working tree encoding or remove the BOM from the "
> + "file.");
"You told Git" is probly right from Gits point of view, and advises are really helpfull.
But what should the user do about it ?
Could we give a better advise ?
"A byte order mark (BOM) is prohibited with %s.
Please remove the BOM from the file %s
or use "%s as working-tree-encoding"
I would probably suspect that a tool wrote the BOM, and that is
good and can or should not be changed by a user.
So a simply message like this could be the preferred (and only)
solution for a user:
"A byte order mark (BOM) is prohibited with %s.
Please use "%s as working-tree-encoding"
(And why %.6s and not simply %s ?)
No more comments for now, I didn't review the test cases.
next prev parent reply other threads:[~2018-02-10 9:49 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-02-09 13:28 [PATCH v6 0/7] convert: add support for different encodings lars.schneider
2018-02-09 13:28 ` [PATCH v6 1/7] strbuf: remove unnecessary NUL assignment in xstrdup_tolower() lars.schneider
2018-02-09 13:28 ` [PATCH v6 2/7] strbuf: add xstrdup_toupper() lars.schneider
2018-02-09 13:28 ` [PATCH v6 3/7] utf8: add function to detect prohibited UTF-16/32 BOM lars.schneider
2018-02-09 13:28 ` [PATCH v6 4/7] utf8: add function to detect a missing " lars.schneider
2018-02-09 19:28 ` Junio C Hamano
2018-02-09 19:47 ` Lars Schneider
2018-02-09 13:28 ` [PATCH v6 5/7] convert: add 'working-tree-encoding' attribute lars.schneider
2018-02-10 9:48 ` Torsten Bögershausen [this message]
2018-02-14 13:22 ` Lars Schneider
2018-02-09 13:28 ` [PATCH v6 6/7] convert: add tracing for " lars.schneider
2018-02-09 13:28 ` [PATCH v6 7/7] convert: add round trip check based on 'core.checkRoundtripEncoding' lars.schneider
2018-02-09 20:09 ` [PATCH v6 0/7] convert: add support for different encodings Junio C Hamano
2018-02-10 1:04 ` Lars Schneider
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20180210094852.GB11525@tor.lan \
--to=tboegi@web.de \
--cc=Johannes.Schindelin@gmx.de \
--cc=git@vger.kernel.org \
--cc=gitster@pobox.com \
--cc=j6t@kdbg.org \
--cc=lars.schneider@autodesk.com \
--cc=larsxschneider@gmail.com \
--cc=peff@peff.net \
--cc=ramsay@ramsayjones.plus.com \
--cc=sunshine@sunshineco.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).