From: Jonathan Nieder <jrnieder@gmail.com>
To: Joshua Juran <jjuran@gmail.com>
Cc: Drew Northup <drew.northup@maine.edu>,
Git mailing list <git@vger.kernel.org>,
Junio C Hamano <gitster@pobox.com>, Jeff King <peff@peff.net>
Subject: Re: [RFC] Print diffs of UTF-16 to console / patches to email as UTF-8...?
Date: Fri, 22 Oct 2010 14:53:31 -0500 [thread overview]
Message-ID: <20101022195331.GA12014@burratino> (raw)
In-Reply-To: <E7645863-A3AD-4EE1-AF6B-71C50A859619@gmail.com>
Joshua Juran wrote:
> I would like to see the same thing for MacRoman-encoded text.[1]
> This is the encoding used by classic Mac development tools such as
> Metrowerks C/C++ (packaged as CodeWarrior) and Apple's Rez resource
> compiler (even the version in OS X). Clearly, UTF-8 checkouts are
> not an option here.
Yes, makes sense.
There are (at least) two approaches you could use here: treat the
content as precious and use e.g. textconv for readable diffs, or
treat the content as UTF-8 text and use clean/smudge to ensure
the checkout has the right encoding.
So let's see what happens with the latter:
> I wrote a Mac<->UTF-8 converter in C++ and set it as the
> clean/smudge filter for .r (Rez) files. Checkouts were noticeably
> slower (on a real machine, not one of my antiques).
Vague ideas to mitigate that:
a) allow a single clean/smudge filter invocation for a batch of
files
b) cache, as Jeff hinted
c) allow custom "native" clean/smudge filters, executed using dlopen()
> While the performance cost could be overlooked, a worse problem
> occurred when I checked out a branch into which the conversion of
> files from MacRoman to UTF-8 hadn't occurred. It automatically
> dirtied my working tree, requiring me to temporarily disable the
> filter attribute and reset --hard. I also resorted to checkout -f a
> number of times -- a bad habit, I'm sure.
The jn/merge-renormalize topic from pu might help somewhat (or might
not). In any event, if you have a test case, I would be happy to look
at it.
> In the end I concluded that (a) these files are definitely text, and
> (b) they are natively MacRoman and should be stored that way. There
> is no advantage to using UTF-8 since the tools can't handle it, and
> even were one to write a UTF-8-capable Rez compiler, the resources
> it outputs are still MacRoman-encoded, so no Unicode support is
> possible.
>
> Finally, (c) the end-to-end principle applies.
Yep.
Although "definitely text" seems somewhat abstract to me. Is the
problem that "git diff" fails to default to --text in some situation?
> But Git should definitely
> convert data to match the encoding of the display device; writing
> anything but valid UTF-8 to a UTF-8 terminal is in error.
Oh, this is what you mean. Except for log encoding, git is not paying
attention to the display encoding at all.
[...]
> But a more
> complete and robust solution would be to store the encoding
> somewhere, possibly in the blob itself, or in the tree storing the
> filename.
How about Jakub's idea of keeping it in .gitattributes (or some
similarly visible key/value store)? Two reasons:
1. When asked to declare encoding, half the time people will be
wrong. So it seems worthwhile to make the declared encoding
visible enough to fix.
2. Two ASCII files identical except that one is declared as
latin1 and the other utf8 should be considered identical.
Thanks for some food for thought.
next prev parent reply other threads:[~2010-10-22 19:57 UTC|newest]
Thread overview: 15+ messages / expand[flat|nested] mbox.gz Atom feed top
2010-10-22 16:06 [RFC] Print diffs of UTF-16 to console / patches to email as UTF-8...? Drew Northup
2010-10-22 16:18 ` Jonathan Nieder
2010-10-22 17:01 ` Drew Northup
2010-10-22 17:12 ` Jonathan Nieder
2010-10-22 17:27 ` Drew Northup
2010-10-22 17:30 ` Jonathan Nieder
2010-10-22 17:58 ` Jakub Narebski
2010-10-22 17:48 ` Jakub Narebski
2010-10-22 18:06 ` Drew Northup
2010-10-22 19:18 ` Jakub Narebski
2010-10-22 18:28 ` Joshua Juran
2010-10-22 19:13 ` Jeff King
2010-10-22 19:53 ` Jonathan Nieder [this message]
2010-10-22 20:18 ` Git Attribute: File Text Encoding {WAS: Re: [RFC] Print diffs of UTF-16 to console / patches to email as UTF-8...?} Drew Northup
2010-10-22 21:49 ` Jakub Narebski
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20101022195331.GA12014@burratino \
--to=jrnieder@gmail.com \
--cc=drew.northup@maine.edu \
--cc=git@vger.kernel.org \
--cc=gitster@pobox.com \
--cc=jjuran@gmail.com \
--cc=peff@peff.net \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).