git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Lars Schneider <larsxschneider@gmail.com>
To: Jeff King <peff@peff.net>
Cc: lars.schneider@autodesk.com, git@vger.kernel.org,
	gitster@pobox.com, tboegi@web.de, patrick@luehne.de
Subject: Re: [PATCH v1] convert: add support for 'encoding' attribute
Date: Mon, 18 Dec 2017 11:54:32 +0100	[thread overview]
Message-ID: <D1E22F0E-EC10-4133-9177-AE20F398A8AC@gmail.com> (raw)
In-Reply-To: <20171215095838.GA3567@sigill.intra.peff.net>


> On 15 Dec 2017, at 10:58, Jeff King <peff@peff.net> wrote:
> 
> On Mon, Dec 11, 2017 at 04:50:23PM +0100, lars.schneider@autodesk.com wrote:
> 
>> From: Lars Schneider <larsxschneider@gmail.com>
>> 
>> Git and its tools (e.g. git diff) expect all text files in UTF-8
>> encoding. Git will happily accept content in all other encodings, too,
>> but it might not be able to process the text (e.g. viewing diffs or
>> changing line endings).
>> 
>> Add an attribute to tell Git what encoding the user has defined for a
>> given file. If the content is added to the index, then Git converts the
>> content to a canonical UTF-8 representation. On checkout Git will
>> reverse the conversion.
> 
> This made me wonder what happens when you have a file that _was_ utf16,
> and then you later convert it to utf8 and add a .gitattributes entry.
> 
> I tried this with your patch:
> 
>  git init repo
>  cd repo
> 
>  echo foo | iconv -t utf16 >file
>  git add file
>  git commit -m utf16
> 
>  echo 'file encoding=utf16' >.gitattributes
>  touch file ;# make it stat-dirty
>  git commit -m convert
> 
>  git checkout HEAD^
> 
> That works OK, because we try to read the attributes from the
> destination tree for a checkout. If you do this:
> 
>  echo 'file encoding=utf16' >.git/info/attributes
>  git checkout HEAD^
> 
> then we get:
> 
>  warning: failed to encode 'file' from utf-8 to utf16
> 
> At least it figured out that it couldn't convert the content. It's
> slightly troubling that it would try in the first place, though; are
> there encoding pairs where we might accidentally generate nonsense?

At this point we interpret utf-16 content as utf-8 and try to convert
it to utf-16. That of course fails because utf-16 content is no valid
utf-8. How could we stop trying that? How could Git possibly know what
kind of encoding is used (apart from our new hint in gitattributes)?

I asked myself the question about nonsense encoding pairs, too. That
was the reason I added the "X -> UTF-8 -> Y && X == Y" check on Git
add. However, with ".git/info/attributes" you could circumvent that
check and probably generate nonsense.


> It _should_ be uncommon, I think, to have a repo-level attribute set
> like that. It's very common for us to use whatever happens to be in the
> checked-out .gitattributes for some attributes (e.g., when doing a diff
> of an older commit), but I think for checkout-related ones it's not.  So
> I think it may generally work in practice. And certainly the line-ending
> code would share any similar problems, but at least there the result is
> less confusing than mojibake.
> 
> Playing around, I also managed to do this:
> 
>  echo 'file encoding=utf16' >.gitattributes
>  echo foo >file
> 
>  # I did these with an old version of git that didn't
>  # support the new attribute, so they blindly added the utf8 content.
>  git add .
>  git commit -m convert
> 
>  git.compile checkout HEAD^
> 
> which yielded:
> 
>  fatal: encoding 'file' from utf16 to utf-8 and back is not the same
> 
> Now obviously my situation is somewhat nonsense. I was trying to force
> the in-repo representation to utf8, but ended up with a mismatched
> working tree file. But what's somewhat troubling is that I couldn't
> checkout _away_ from that state due to the die() in convert_to_git().
> Which is in turn just there as part of refresh_index().
> 
> And indeed, other commands hit the same problem:
> 
>  $ git.compile diff
>  fatal: encoding 'file' from utf16 to utf-8 and back is not the same
> 
>  $ git.compile checkout -f
>  fatal: encoding 'file' from utf16 to utf-8 and back is not the same
> 
> It may make sense to die() during "git add ." (since we're actually
> changing the index entry, and we don't want to put nonsense into a
> tree). But I'm not sure it's the best thing for operations which just
> want to read the content. For them, perhaps it would be more appropriate
> to issue a warning and return the untouched content.

Absolutely! Thanks for spotting this. I will try to run die() only on
"git add" in v2.

- Lars

  reply	other threads:[~2017-12-18 10:54 UTC|newest]

Thread overview: 36+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-12-11 15:50 [PATCH v1] convert: add support for 'encoding' attribute lars.schneider
2017-12-11 18:39 ` Eric Sunshine
2017-12-11 23:47   ` Lars Schneider
2017-12-11 23:58     ` Eric Sunshine
2017-12-12 10:58       ` Lars Schneider
2017-12-11 20:47 ` Johannes Sixt
2017-12-11 23:42   ` Lars Schneider
2017-12-12  0:59     ` Junio C Hamano
2017-12-12  7:15       ` Johannes Sixt
2017-12-12 10:55         ` Lars Schneider
2017-12-12 19:31           ` Junio C Hamano
2017-12-13 17:57             ` Lars Schneider
2017-12-13 18:11               ` Junio C Hamano
2017-12-13 23:02                 ` Lars Schneider
2017-12-14 23:01                   ` Junio C Hamano
2017-12-12  7:09     ` Johannes Sixt
2017-12-18 10:13   ` Torsten Bögershausen
2017-12-18 13:12     ` Jeff King
2017-12-23  8:08       ` Torsten Bögershausen
2017-12-29 13:28       ` [PATCH/RFC 0/2] git diff --UTF-8 tboegi
2017-12-29 13:28       ` [PATCH/RFC 1/2] convert_to_git(): checksafe becomes an integer tboegi
2017-12-29 13:28       ` [PATCH/RFC 2/2] git diff: Allow to reencode into UTF-8 tboegi
2018-02-26 17:27       ` [PATCH/RFC 1/1] Auto diff of UTF-16 files in UTF-8 tboegi
2018-02-26 18:43         ` Peter Krefting
2018-02-27 22:39         ` Jeff King
2017-12-18 18:02     ` [PATCH v1] convert: add support for 'encoding' attribute Junio C Hamano
2017-12-18 21:55     ` Johannes Sixt
2017-12-15  9:58 ` Jeff King
2017-12-18 10:54   ` Lars Schneider [this message]
2017-12-18 12:59     ` Jeff King
2017-12-17 17:14 ` Torsten Bögershausen
2017-12-28 16:14   ` Lars Schneider
2017-12-29 12:59     ` Torsten Bögershausen
2017-12-29 13:56       ` Lars Schneider
2018-01-03 19:15       ` Junio C Hamano
2018-01-03 20:45         ` Lars Schneider

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=D1E22F0E-EC10-4133-9177-AE20F398A8AC@gmail.com \
    --to=larsxschneider@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=lars.schneider@autodesk.com \
    --cc=patrick@luehne.de \
    --cc=peff@peff.net \
    --cc=tboegi@web.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).