Re: [PATCH v6 5/7] convert: add 'working-tree-encoding' attribute

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: "Torsten Bögershausen" <tboegi@web.de>
To: lars.schneider@autodesk.com
Cc: git@vger.kernel.org, gitster@pobox.com, j6t@kdbg.org,
	sunshine@sunshineco.com, peff@peff.net,
	ramsay@ramsayjones.plus.com, Johannes.Schindelin@gmx.de,
	Lars Schneider <larsxschneider@gmail.com>
Subject: Re: [PATCH v6 5/7] convert: add 'working-tree-encoding' attribute
Date: Sat, 10 Feb 2018 10:48:52 +0100	[thread overview]
Message-ID: <20180210094852.GB11525@tor.lan> (raw)
In-Reply-To: <20180209132830.55385-6-lars.schneider@autodesk.com>

On Fri, Feb 09, 2018 at 02:28:28PM +0100, lars.schneider@autodesk.com wrote:
> From: Lars Schneider <larsxschneider@gmail.com>
> 
> Git recognizes files encoded with ASCII or one of its supersets (e.g.
> UTF-8 or ISO-8859-1) as text files. All other encodings are usually
> interpreted as binary and consequently built-in Git text processing
> tools (e.g. 'git diff') as well as most Git web front ends do not
> visualize the content.
> 
> Add an attribute to tell Git what encoding the user has defined for a
> given file. If the content is added to the index, then Git converts the
> content to a canonical UTF-8 representation. On checkout Git will
> reverse the conversion.
> 
> Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
> ---
>  Documentation/gitattributes.txt  |  66 ++++++++++++
>  convert.c                        | 157 ++++++++++++++++++++++++++++-
>  convert.h                        |   1 +
>  sha1_file.c                      |   2 +-
>  t/t0028-working-tree-encoding.sh | 210 +++++++++++++++++++++++++++++++++++++++
>  5 files changed, 434 insertions(+), 2 deletions(-)
>  create mode 100755 t/t0028-working-tree-encoding.sh
> 
> diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
> index 30687de81a..4ecdcd4859 100644
> --- a/Documentation/gitattributes.txt
> +++ b/Documentation/gitattributes.txt
> @@ -272,6 +272,72 @@ few exceptions.  Even though...
>    catch potential problems early, safety triggers.
>  
>  
> +`working-tree-encoding`
> +^^^^^^^^^^^^^^^^^^^^^^^
> +
> +Git recognizes files encoded with ASCII or one of its supersets (e.g.
> +UTF-8 or ISO-8859-1) as text files.  All other encodings are usually
> +interpreted as binary and consequently built-in Git text processing
> +tools (e.g. 'git diff') as well as most Git web front ends do not
> +visualize the content.
> +
> +In these cases you can tell Git the encoding of a file in the working
> +directory with the `working-tree-encoding` attribute. If a file with this
> +attribute is added to Git, then Git reencodes the content from the
> +specified encoding to UTF-8. Finally, Git stores the UTF-8 encoded
> +content in its internal data structure (called "the index"). On checkout
> +the content is reencoded back to the specified encoding.
> +
> +Please note that using the `working-tree-encoding` attribute may have a
> +number of pitfalls:
> +
> +- Git clients that do not support the `working-tree-encoding` attribute

A client to Git ?
Or may be "third party Git implementations"

> +  will checkout the respective files UTF-8 encoded and not in the
> +  expected encoding. Consequently, these files will appear different
> +  which typically causes trouble. This is in particular the case for
> +  older Git versions and alternative Git implementations such as JGit
> +  or libgit2 (as of February 2018).
> +
> +- Reencoding content requires resources that might slow down certain
> +  Git operations (e.g 'git checkout' or 'git add').
> +
> +Use the `working-tree-encoding` attribute only if you cannot store a file
> +in UTF-8 encoding and if you want Git to be able to process the content
> +as text.
> +
> +As an example, use the following attributes if your '*.proj' files are
> +UTF-16 encoded with byte order mark (BOM) and you want Git to perform
> +automatic line ending conversion based on your platform.
> +
> +------------------------
> +*.proj		text working-tree-encoding=UTF-16
> +------------------------
> +
> +Use the following attributes if your '*.proj' files are UTF-16 little
> +endian encoded without BOM and you want Git to use Windows line endings
> +in the working directory. Please note, it is highly recommended to
> +explicitly define the line endings with `eol` if the `working-tree-encoding`
> +attribute is used to avoid ambiguity.
> +
> +------------------------
> +*.proj 		working-tree-encoding=UTF-16LE text eol=CRLF
> +------------------------
> +
> +You can get a list of all available encodings on your platform with the
> +following command:

One question:
 +*.proj		text working-tree-encoding=UTF-16
vs
*.proj 		working-tree-encoding=UTF-16LE text eol=CRLF

Technically the order of attributes doesn't matter, but that is not what we
want to demonstrate here and now.
I would probably move the "text" attribute to the end of the line.
So that readers don't start to wonder if the order is important.

> +
> +------------------------
> +iconv --list
> +------------------------
> +
> +If you do not know the encoding of a file, then you can use the `file`
> +command to guess the encoding:
> +
> +------------------------
> +file foo.proj
> +------------------------
> +
> +
>  `ident`
>  ^^^^^^^
>  
> diff --git a/convert.c b/convert.c
> index b976eb968c..dc9e2db6b5 100644
> --- a/convert.c
> +++ b/convert.c
> @@ -7,6 +7,7 @@
>  #include "sigchain.h"
>  #include "pkt-line.h"
>  #include "sub-process.h"
> +#include "utf8.h"
>  
>  /*
>   * convert.c - convert a file when checking it out and checking it in.
> @@ -265,6 +266,110 @@ static int will_convert_lf_to_crlf(size_t len, struct text_stat *stats,
>  
>  }
>  
> +static struct encoding {
> +	const char *name;
> +	struct encoding *next;
> +} *encoding, **encoding_tail;
> +static const char *default_encoding = "UTF-8";
> +
> +static int encode_to_git(const char *path, const char *src, size_t src_len,
> +			 struct strbuf *buf, struct encoding *enc, int conv_flags)
> +{
> +	char *dst;
> +	int dst_len;
> +
> +	/*
> +	 * No encoding is specified or there is nothing to encode.
> +	 * Tell the caller that the content was not modified.
> +	 */
> +	if (!enc || (src && !src_len))
> +		return 0;
> +
> +	/*
> +	 * Looks like we got called from "would_convert_to_git()".
> +	 * This means Git wants to know if it would encode (= modify!)
> +	 * the content. Let's answer with "yes", since an encoding was
> +	 * specified.
> +	 */
> +	if (!buf && !src)
> +		return 1;
> +
> +	if (has_prohibited_utf_bom(enc->name, src, src_len)) {
> +		const char *error_msg = _(
> +			"BOM is prohibited for '%s' if encoded as %s");
> +		const char *advise_msg = _(
> +			"You told Git to treat '%s' as %s. A byte order mark "
> +			"(BOM) is prohibited with this encoding. Either use "
> +			"%.6s as working tree encoding or remove the BOM from the "
> +			"file.");

"You told Git" is probly right from Gits point of view, and advises are really helpfull.
But what should the user do about it ?
Could we give a better advise ?


"A byte order mark (BOM) is prohibited with %s.
Please remove the BOM from the file %s 
or use "%s as working-tree-encoding"

I would probably suspect that a tool wrote the BOM, and that is
good and can or should not be changed by a user.

So a simply message like this could be the preferred (and only)
solution for a user:
"A byte order mark (BOM) is prohibited with %s.
Please use "%s as working-tree-encoding"


(And why %.6s and not simply %s ?)

No more comments for now, I didn't review the test cases.

next prev parent reply	other threads:[~2018-02-10  9:49 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-02-09 13:28 [PATCH v6 0/7] convert: add support for different encodings lars.schneider
2018-02-09 13:28 ` [PATCH v6 1/7] strbuf: remove unnecessary NUL assignment in xstrdup_tolower() lars.schneider
2018-02-09 13:28 ` [PATCH v6 2/7] strbuf: add xstrdup_toupper() lars.schneider
2018-02-09 13:28 ` [PATCH v6 3/7] utf8: add function to detect prohibited UTF-16/32 BOM lars.schneider
2018-02-09 13:28 ` [PATCH v6 4/7] utf8: add function to detect a missing " lars.schneider
2018-02-09 19:28   ` Junio C Hamano
2018-02-09 19:47     ` Lars Schneider
2018-02-09 13:28 ` [PATCH v6 5/7] convert: add 'working-tree-encoding' attribute lars.schneider
2018-02-10  9:48   ` Torsten Bögershausen [this message]
2018-02-14 13:22     ` Lars Schneider
2018-02-09 13:28 ` [PATCH v6 6/7] convert: add tracing for " lars.schneider
2018-02-09 13:28 ` [PATCH v6 7/7] convert: add round trip check based on 'core.checkRoundtripEncoding' lars.schneider
2018-02-09 20:09 ` [PATCH v6 0/7] convert: add support for different encodings Junio C Hamano
2018-02-10  1:04   ` Lars Schneider

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20180210094852.GB11525@tor.lan \
    --to=tboegi@web.de \
    --cc=Johannes.Schindelin@gmx.de \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=j6t@kdbg.org \
    --cc=lars.schneider@autodesk.com \
    --cc=larsxschneider@gmail.com \
    --cc=peff@peff.net \
    --cc=ramsay@ramsayjones.plus.com \
    --cc=sunshine@sunshineco.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).