[RFC] Print diffs of UTF-16 to console / patches to email as UTF-8...?

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC] Print diffs of UTF-16 to console / patches to email as UTF-8...?
@ 2010-10-22 16:06 Drew Northup
  2010-10-22 16:18 ` Jonathan Nieder
  0 siblings, 1 reply; 15+ messages in thread
From: Drew Northup @ 2010-10-22 16:06 UTC (permalink / raw)
  To: Git mailing list; +Cc: Junio C Hamano

I am currently thinking about what the best way to preset readable (and
safely email-able) patches to the user may be when the content is
UTF-16. This is part of my ongoing work to treat UTF-16 as text (in
other words, the crlf options will work and .gitattributes hacks won't
be required to display diffs, etc).
I was also concerned that the result be re-importable to valid UTF-16 in
the end. This has led me to consider printing diffs as UTF-8 (no data
loss, at least 16->8) when the source text is UTF-16. This should also
be git-gui / gitk friendly (in theory). I would favorably consider this
as a configurable option (export_unicode_diff_as_utf8 ?) leaving plain
UTF-16 output as the standard output from "git diff" (once I convince it
that UTF-16 is indeed text).
Also, there is the issue of being able to recognize UTF-16 as UTF-16 in
diffs/patches. Is there a precedent/standard I should be aware of with
respect to BOMs and patches? I would think that adhering to the UTF-16
standard with respect to whole text files would make sense here (no BOM
== Big Endian, BOM used to match LE/BE otherwise).

Comments welcome!

-- 
-Drew Northup N1XIM
   AKA RvnPhnx on OPN
________________________________________________
"As opposed to vegetable or mineral error?"
-John Pescatore, SANS NewsBites Vol. 12 Num. 59

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC] Print diffs of UTF-16 to console / patches to email as UTF-8...?
  2010-10-22 16:06 [RFC] Print diffs of UTF-16 to console / patches to email as UTF-8...? Drew Northup
@ 2010-10-22 16:18 ` Jonathan Nieder
  2010-10-22 17:01   ` Drew Northup
  2010-10-22 18:28   ` Joshua Juran
  0 siblings, 2 replies; 15+ messages in thread
From: Jonathan Nieder @ 2010-10-22 16:18 UTC (permalink / raw)
  To: Drew Northup; +Cc: Git mailing list, Junio C Hamano

Hi,

Drew Northup wrote:

>         This is part of my ongoing work to treat UTF-16 as text (in
> other words, the crlf options will work and .gitattributes hacks won't
> be required to display diffs, etc).

What's wrong with .gitattributes for this use case?  I would think a
clean/smudge filter would produce very good behavior from most git
commands.

If speed is the issue, maybe a built-in clean/smudge filter would
address that?

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC] Print diffs of UTF-16 to console / patches to email as UTF-8...?
  2010-10-22 16:18 ` Jonathan Nieder
@ 2010-10-22 17:01   ` Drew Northup
  2010-10-22 17:12     ` Jonathan Nieder
  2010-10-22 18:28   ` Joshua Juran
  1 sibling, 1 reply; 15+ messages in thread
From: Drew Northup @ 2010-10-22 17:01 UTC (permalink / raw)
  To: Jonathan Nieder; +Cc: Git mailing list, Junio C Hamano

On Fri, 2010-10-22 at 11:18 -0500, Jonathan Nieder wrote:
> Hi,
> 
> Drew Northup wrote:
> 
> >         This is part of my ongoing work to treat UTF-16 as text (in
> > other words, the crlf options will work and .gitattributes hacks won't
> > be required to display diffs, etc).
> 
> What's wrong with .gitattributes for this use case?  I would think a
> clean/smudge filter would produce very good behavior from most git
> commands.
> 
> If speed is the issue, maybe a built-in clean/smudge filter would
> address that?

That still doesn't fix the crlf issue, for starters. Also, I would like
to be able to email patches for files that are in UTF-16 and properly
re-import them. Unless I'm missing something really big there's not much
that a display filter is going to do for me there. I want this to be
two-way.
To be honest, the not needing of a display filter goes along with
treating UTF-16 patches as text. The same sort of code will be required
for both, so why not get predictable behavior for another text encoding
at the same time?

-- 
-Drew Northup N1XIM
   AKA RvnPhnx on OPN
________________________________________________
"As opposed to vegetable or mineral error?"
-John Pescatore, SANS NewsBites Vol. 12 Num. 59

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC] Print diffs of UTF-16 to console / patches to email as UTF-8...?
  2010-10-22 17:01   ` Drew Northup
@ 2010-10-22 17:12     ` Jonathan Nieder
  2010-10-22 17:27       ` Drew Northup
  0 siblings, 1 reply; 15+ messages in thread
From: Jonathan Nieder @ 2010-10-22 17:12 UTC (permalink / raw)
  To: Drew Northup; +Cc: Git mailing list, Junio C Hamano

Hi again,

Drew Northup wrote:

> That still doesn't fix the crlf issue, for starters. Also, I would like
> to be able to email patches for files that are in UTF-16 and properly
> re-import them. Unless I'm missing something really big there's not much
> that a display filter is going to do for me there.

Right, I think you're missing something big.  textconv is a display
filter.  clean/smudge convert between internal and external
representation (and your clean/smudge scripts could take care of CRLF
themselves if desired).

That said, I wouldn't be surprised if clean/smudge filters don't do
everything you want.  If you do go that way, please keep the list
posted so the mechanism can be improved.

And longer term, maybe people will want something tailor-made after
all?  I just imagine it would be more productive to try out the
generic mechanisms first.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC] Print diffs of UTF-16 to console / patches to email as UTF-8...?
  2010-10-22 17:12     ` Jonathan Nieder
@ 2010-10-22 17:27       ` Drew Northup
  2010-10-22 17:30         ` Jonathan Nieder
  2010-10-22 17:48         ` Jakub Narebski
  0 siblings, 2 replies; 15+ messages in thread
From: Drew Northup @ 2010-10-22 17:27 UTC (permalink / raw)
  To: Jonathan Nieder; +Cc: Git mailing list, Junio C Hamano


On Fri, 2010-10-22 at 12:12 -0500, Jonathan Nieder wrote:
> Hi again,
> 
> Drew Northup wrote:
> 
> > That still doesn't fix the crlf issue, for starters. Also, I would like
> > to be able to email patches for files that are in UTF-16 and properly
> > re-import them. Unless I'm missing something really big there's not much
> > that a display filter is going to do for me there.
> 
> Right, I think you're missing something big.  textconv is a display
> filter.  clean/smudge convert between internal and external
> representation (and your clean/smudge scripts could take care of CRLF
> themselves if desired).
> 
> That said, I wouldn't be surprised if clean/smudge filters don't do
> everything you want.  If you do go that way, please keep the list
> posted so the mechanism can be improved.

Well I shall plumb the documentation again.... just in case. I'm not
holding my breath that it will do what I (and frankly a fair number of
other people) want. We just want version control that treats text like
text. FULL STOP. Why isn't UTF-16 text???????

> And longer term, maybe people will want something tailor-made after
> all?  I just imagine it would be more productive to try out the
> generic mechanisms first.

Please forgive me for being offended that UTF-16 text is not "generic"
enough.

-- 
-Drew Northup N1XIM
   AKA RvnPhnx on OPN
________________________________________________
"As opposed to vegetable or mineral error?"
-John Pescatore, SANS NewsBites Vol. 12 Num. 59

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC] Print diffs of UTF-16 to console / patches to email as UTF-8...?
  2010-10-22 17:27       ` Drew Northup
@ 2010-10-22 17:30         ` Jonathan Nieder
  2010-10-22 17:58           ` Jakub Narebski
  2010-10-22 17:48         ` Jakub Narebski
  1 sibling, 1 reply; 15+ messages in thread
From: Jonathan Nieder @ 2010-10-22 17:30 UTC (permalink / raw)
  To: Drew Northup; +Cc: Git mailing list, Junio C Hamano

Drew Northup wrote:

> Please forgive me for being offended that UTF-16 text is not "generic"
> enough.

First some words of explanation.

By "generic" I did not mean ubiquitous, unbranded, popular, or some
other almost-synonym.  What I actually meant is that it is not obvious
what to do with UTF-16.  Should it be converted to UTF-8 for output?
Should it always be normalized when added to the index, so that
switching between canonically equivalent sequences does not result
in spurious diffs?  Should the byte-for-byte representation be
faithfully preserved, even when it is not valid UTF-16?

When in such a situation, often a good approach is the following:
take care of mechanism first, then policy.  So the first thing to do
is to make sure that the code is _capable_ of what people are trying
to do; then one can try various configurations and see what is most
convenient; and finally, one can make sure the program behaves in an
intuitive way by setting a reasonable default.

So by "generic" I meant those mechanisms that can be used in the
context of multiple policies.

Apologies; I never meant to offend; please carry on and I will leave
you in peace.

Jonathan

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC] Print diffs of UTF-16 to console / patches to email as UTF-8...?
  2010-10-22 17:30         ` Jonathan Nieder
@ 2010-10-22 17:58           ` Jakub Narebski
  0 siblings, 0 replies; 15+ messages in thread
From: Jakub Narebski @ 2010-10-22 17:58 UTC (permalink / raw)
  To: Jonathan Nieder; +Cc: Drew Northup, Git mailing list, Junio C Hamano

Jonathan Nieder <jrnieder@gmail.com> writes:

> Drew Northup wrote:
> 
> > Please forgive me for being offended that UTF-16 text is not "generic"
> > enough.
> 
> First some words of explanation.
> 
> By "generic" I did not mean ubiquitous, unbranded, popular, or some
> other almost-synonym.  What I actually meant is that it is not obvious
> what to do with UTF-16.  Should it be converted to UTF-8 for output?
> Should it always be normalized when added to the index, so that
> switching between canonically equivalent sequences does not result
> in spurious diffs?  Should the byte-for-byte representation be
> faithfully preserved, even when it is not valid UTF-16?
> 
> When in such a situation, often a good approach is the following:
> take care of mechanism first, then policy.  So the first thing to do
> is to make sure that the code is _capable_ of what people are trying
> to do; then one can try various configurations and see what is most
> convenient; and finally, one can make sure the program behaves in an
> intuitive way by setting a reasonable default.
> 
> So by "generic" I meant those mechanisms that can be used in the
> context of multiple policies.

It would be nice if there was a way (perhaps stearable via
gitattributes) to change whether Git is to treat file as sequence of
bytes (as it is now), or as sequence of characters (probably like 
Perl 6, i.e. as sequence of graphemes), though this would require
to specify encoding (and normalization) used.

Wishful thinking
-- 
Jakub Narebski
Poland
ShadeHawk on #git

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC] Print diffs of UTF-16 to console / patches to email as UTF-8...?
  2010-10-22 17:27       ` Drew Northup
  2010-10-22 17:30         ` Jonathan Nieder
@ 2010-10-22 17:48         ` Jakub Narebski
  2010-10-22 18:06           ` Drew Northup
  1 sibling, 1 reply; 15+ messages in thread
From: Jakub Narebski @ 2010-10-22 17:48 UTC (permalink / raw)
  To: Drew Northup; +Cc: Jonathan Nieder, Git mailing list, Junio C Hamano

Drew Northup <drew.northup@maine.edu> writes:

> On Fri, 2010-10-22 at 12:12 -0500, Jonathan Nieder wrote:
> > 
> > Drew Northup wrote:
> > 
> > > That still doesn't fix the crlf issue, for starters. Also, I would like
> > > to be able to email patches for files that are in UTF-16 and properly
> > > re-import them. Unless I'm missing something really big there's not much
> > > that a display filter is going to do for me there.
> > 
> > Right, I think you're missing something big.  textconv is a display
> > filter.  clean/smudge convert between internal and external
> > representation (and your clean/smudge scripts could take care of CRLF
> > themselves if desired).
> > 
> > That said, I wouldn't be surprised if clean/smudge filters don't do
> > everything you want.  If you do go that way, please keep the list
> > posted so the mechanism can be improved.
> 
> Well I shall plumb the documentation again.... just in case. I'm not
> holding my breath that it will do what I (and frankly a fair number of
> other people) want. We just want version control that treats text like
> text. FULL STOP. Why isn't UTF-16 text???????

If you are asking why Git detects files with text in UTF-16 / USC-2 as
binary, it is because Git (re)uses the same heuristic that e.g. GNU
diff (and probably also -T file test in Perl), and one of heuristics
is that if file contains NUL ("\0") character, then it is most
porbably binary (because legacy C programs for text would have
troubles with NUL characters).

That probably doesn't help you any...
-- 
Jakub Narebski
Poland
ShadeHawk on #git

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC] Print diffs of UTF-16 to console / patches to email as UTF-8...?
  2010-10-22 17:48         ` Jakub Narebski
@ 2010-10-22 18:06           ` Drew Northup
  2010-10-22 19:18             ` Jakub Narebski
  0 siblings, 1 reply; 15+ messages in thread
From: Drew Northup @ 2010-10-22 18:06 UTC (permalink / raw)
  To: Jakub Narebski; +Cc: Jonathan Nieder, Git mailing list, Junio C Hamano


On Fri, 2010-10-22 at 10:48 -0700, Jakub Narebski wrote:
> Drew Northup <drew.northup@maine.edu> writes:

> > Well I shall plumb the documentation again.... just in case. I'm not
> > holding my breath that it will do what I (and frankly a fair number of
> > other people) want. We just want version control that treats text like
> > text. FULL STOP. Why isn't UTF-16 text???????
> 
> If you are asking why Git detects files with text in UTF-16 / USC-2 as
> binary, it is because Git (re)uses the same heuristic that e.g. GNU
> diff (and probably also -T file test in Perl), and one of heuristics
> is that if file contains NUL ("\0") character, then it is most
> porbably binary (because legacy C programs for text would have
> troubles with NUL characters).
> 
> That probably doesn't help you any...

I did find that already. I still have not decided that correct place to
shoehorn in Unicode detection, but I'll be sure to do that before I
bother anybody else with it. I already wrote code to detect (reasonably)
valid UTF-16 (if it isn't obviously valid then I'll just as soon deal
with it as binary data, so as to avoid a foot-shooting exercise).
My main motivation here has been to get some feedback as I write stuff
so as to not waste a lot of time during writing something that could be
done better. 
(As opposed to not done at all, which is the feeling I'm getting from a
few people around here...)
-- 
-Drew Northup N1XIM
   AKA RvnPhnx on OPN
________________________________________________
"As opposed to vegetable or mineral error?"
-John Pescatore, SANS NewsBites Vol. 12 Num. 59

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC] Print diffs of UTF-16 to console / patches to email as UTF-8...?
  2010-10-22 18:06           ` Drew Northup
@ 2010-10-22 19:18             ` Jakub Narebski
  0 siblings, 0 replies; 15+ messages in thread
From: Jakub Narebski @ 2010-10-22 19:18 UTC (permalink / raw)
  To: Drew Northup; +Cc: Jonathan Nieder, Git mailing list, Junio C Hamano

On Fri, 22 Oct 2010, Drew Northup wrote:
> On Fri, 2010-10-22 at 10:48 -0700, Jakub Narebski wrote:
> > Drew Northup <drew.northup@maine.edu> writes:
> 
> > > Well I shall plumb the documentation again.... just in case. I'm not
> > > holding my breath that it will do what I (and frankly a fair number of
> > > other people) want. We just want version control that treats text like
> > > text. FULL STOP. Why isn't UTF-16 text???????
> > 
> > If you are asking why Git detects files with text in UTF-16 / USC-2 as
> > binary, it is because Git (re)uses the same heuristic that e.g. GNU
> > diff (and probably also -T file test in Perl), and one of heuristics
> > is that if file contains NUL ("\0") character, then it is most
> > porbably binary (because legacy C programs for text would have
> > troubles with NUL characters).
> > 
> > That probably doesn't help you any...
> 
> I did find that already. I still have not decided that correct place to
> shoehorn in Unicode detection, but I'll be sure to do that before I
> bother anybody else with it. I already wrote code to detect (reasonably)
> valid UTF-16 (if it isn't obviously valid then I'll just as soon deal
> with it as binary data, so as to avoid a foot-shooting exercise).
> My main motivation here has been to get some feedback as I write stuff
> so as to not waste a lot of time during writing something that could be
> done better. 
>
> (As opposed to not done at all, which is the feeling I'm getting from a
> few people around here...)

Git supports well different encoding used in commit message (which is
always text, as opposed to file contents which might be binary or text).

You specify what encoding you use to format commit messages with
i18n.commitEncoding (defaults to 'utf-8'); if it is different than utf-8
it gets saved in 'encoding' header.  You can even specify that encoding
that your terminal uses is different from i18n.commitEncoding with
i18n.logOutputEncoding

The only support for different encoding of file contents is used by
git-gui.  You provide encoding that a file uses via .gitattributes
(the `encoding` attribute).  You specify what output encoding git-gui
(Tcl/Tk) uses with `gui.encoding` config variable.

I guess that what you need to support for diffs and 'git show <file>'
etc. is respecting `encoding` .gitattribute, and providing encoding
that console uses with e.g. i18n.blobOutputEncoding (or something like
that).

HTH
-- 
Jakub Narebski
Poland

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC] Print diffs of UTF-16 to console / patches to email as UTF-8...?
  2010-10-22 16:18 ` Jonathan Nieder
  2010-10-22 17:01   ` Drew Northup
@ 2010-10-22 18:28   ` Joshua Juran
  2010-10-22 19:13     ` Jeff King
  2010-10-22 19:53     ` Jonathan Nieder
  1 sibling, 2 replies; 15+ messages in thread
From: Joshua Juran @ 2010-10-22 18:28 UTC (permalink / raw)
  To: Jonathan Nieder; +Cc: Drew Northup, Git mailing list, Junio C Hamano

On Oct 22, 2010, at 9:18 AM, Jonathan Nieder wrote:

> Drew Northup wrote:
>
>>        This is part of my ongoing work to treat UTF-16 as text (in
>> other words, the crlf options will work and .gitattributes hacks  
>> won't
>> be required to display diffs, etc).

I would like to see the same thing for MacRoman-encoded text.[1]  This  
is the encoding used by classic Mac development tools such as  
Metrowerks C/C++ (packaged as CodeWarrior) and Apple's Rez resource  
compiler (even the version in OS X).  Clearly, UTF-8 checkouts are not  
an option here.

> What's wrong with .gitattributes for this use case?  I would think a
> clean/smudge filter would produce very good behavior from most git
> commands.

I wrote a Mac<->UTF-8 converter in C++ and set it as the clean/smudge  
filter for .r (Rez) files.  Checkouts were noticeably slower (on a  
real machine, not one of my antiques).  This would be much worse if I  
also applied it to C and C++ source files (most, but not all, of which  
are ASCII anyway).

> If speed is the issue, maybe a built-in clean/smudge filter would
> address that?

While the performance cost could be overlooked, a worse problem  
occurred when I checked out a branch into which the conversion of  
files from MacRoman to UTF-8 hadn't occurred.  It automatically  
dirtied my working tree, requiring me to temporarily disable the  
filter attribute and reset --hard.  I also resorted to checkout -f a  
number of times -- a bad habit, I'm sure.

In the end I concluded that (a) these files are definitely text, and  
(b) they are natively MacRoman and should be stored that way.  There  
is no advantage to using UTF-8 since the tools can't handle it, and  
even were one to write a UTF-8-capable Rez compiler, the resources it  
outputs are still MacRoman-encoded, so no Unicode support is possible.

Finally, (c) the end-to-end principle applies.  Don't convert data en  
route, but wait until it's necessary.  Premature conversion was the  
curse of FTP; let's not repeat it.  But Git should definitely convert  
data to match the encoding of the display device; writing anything but  
valid UTF-8 to a UTF-8 terminal is in error.  The same applies in gitk.

Another issue (on which my thoughts are less clear) is the use of CR  
newlines.  CR is also native to classic Mac OS, but in contrast to  
UTF-8, Mac developer tools are generally newline-agnostic, whereas  
typical Unix programs are less forgiving in expecting LF -- so I've  
been using linefeeds in my source code.  However, it might be useful  
to retain platform-customary newlines for the purpose of guessing non- 
UTF character encodings:  Carriage returns would almost certainly  
indicate MacRoman rather than ISO-8859-1.  But a more complete and  
robust solution would be to store the encoding somewhere, possibly in  
the blob itself, or in the tree storing the filename.

Josh

[1] MacRoman is an extended ASCII character set native to classic Mac  
OS.  <http://en.wikipedia.org/wiki/Mac_OS_Roman>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC] Print diffs of UTF-16 to console / patches to email as UTF-8...?
  2010-10-22 18:28   ` Joshua Juran
@ 2010-10-22 19:13     ` Jeff King
  2010-10-22 19:53     ` Jonathan Nieder
  1 sibling, 0 replies; 15+ messages in thread
From: Jeff King @ 2010-10-22 19:13 UTC (permalink / raw)
  To: Joshua Juran
  Cc: Jonathan Nieder, Drew Northup, Git mailing list, Junio C Hamano

On Fri, Oct 22, 2010 at 11:28:44AM -0700, Joshua Juran wrote:

> >What's wrong with .gitattributes for this use case?  I would think a
> >clean/smudge filter would produce very good behavior from most git
> >commands.
> 
> I wrote a Mac<->UTF-8 converter in C++ and set it as the clean/smudge
> filter for .r (Rez) files.  Checkouts were noticeably slower (on a
> real machine, not one of my antiques).  This would be much worse if I
> also applied it to C and C++ source files (most, but not all, of
> which are ASCII anyway).

Not surprising, as you were probably running your filter a lot. Clean
and smudge could perhaps benefit from the same notes-caching layer that
textconv uses (caching the "smudged" version of each clean file).

But that would only impact checkout. Most other operations use the
"clean" representation already, so they should be full-speed.

You could also cache the other way (mapping smudged sha1's into clean
sha1's). But I doubt that would do you any good. We generally see those
when updating the index with "git add", which means either the stat
information is clean (and we don't have the clean the file) or it isn't
(in which case you probably have new content that has not been seen
before, which means a cache miss).

And of course it doesn't help the other clean/smudge inconveniences you
ran into.

-Peff

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC] Print diffs of UTF-16 to console / patches to email as UTF-8...?
  2010-10-22 18:28   ` Joshua Juran
  2010-10-22 19:13     ` Jeff King
@ 2010-10-22 19:53     ` Jonathan Nieder
  2010-10-22 20:18       ` Git Attribute: File Text Encoding {WAS: Re: [RFC] Print diffs of UTF-16 to console / patches to email as UTF-8...?} Drew Northup
  1 sibling, 1 reply; 15+ messages in thread
From: Jonathan Nieder @ 2010-10-22 19:53 UTC (permalink / raw)
  To: Joshua Juran; +Cc: Drew Northup, Git mailing list, Junio C Hamano, Jeff King

Joshua Juran wrote:

> I would like to see the same thing for MacRoman-encoded text.[1]
> This is the encoding used by classic Mac development tools such as
> Metrowerks C/C++ (packaged as CodeWarrior) and Apple's Rez resource
> compiler (even the version in OS X).  Clearly, UTF-8 checkouts are
> not an option here.

Yes, makes sense.

There are (at least) two approaches you could use here: treat the
content as precious and use e.g. textconv for readable diffs, or
treat the content as UTF-8 text and use clean/smudge to ensure
the checkout has the right encoding.

So let's see what happens with the latter:

> I wrote a Mac<->UTF-8 converter in C++ and set it as the
> clean/smudge filter for .r (Rez) files.  Checkouts were noticeably
> slower (on a real machine, not one of my antiques).

Vague ideas to mitigate that:

 a) allow a single clean/smudge filter invocation for a batch of
    files
 b) cache, as Jeff hinted
 c) allow custom "native" clean/smudge filters, executed using dlopen()

> While the performance cost could be overlooked, a worse problem
> occurred when I checked out a branch into which the conversion of
> files from MacRoman to UTF-8 hadn't occurred.  It automatically
> dirtied my working tree, requiring me to temporarily disable the
> filter attribute and reset --hard.  I also resorted to checkout -f a
> number of times -- a bad habit, I'm sure.

The jn/merge-renormalize topic from pu might help somewhat (or might
not).  In any event, if you have a test case, I would be happy to look
at it.

> In the end I concluded that (a) these files are definitely text, and
> (b) they are natively MacRoman and should be stored that way.  There
> is no advantage to using UTF-8 since the tools can't handle it, and
> even were one to write a UTF-8-capable Rez compiler, the resources
> it outputs are still MacRoman-encoded, so no Unicode support is
> possible.
> 
> Finally, (c) the end-to-end principle applies.

Yep.

Although "definitely text" seems somewhat abstract to me.  Is the
problem that "git diff" fails to default to --text in some situation?

>                                         But Git should definitely
> convert data to match the encoding of the display device; writing
> anything but valid UTF-8 to a UTF-8 terminal is in error.

Oh, this is what you mean.  Except for log encoding, git is not paying
attention to the display encoding at all.

[...]
>                                                      But a more
> complete and robust solution would be to store the encoding
> somewhere, possibly in the blob itself, or in the tree storing the
> filename.

How about Jakub's idea of keeping it in .gitattributes (or some
similarly visible key/value store)?  Two reasons:

 1. When asked to declare encoding, half the time people will be
    wrong.  So it seems worthwhile to make the declared encoding
    visible enough to fix.

 2. Two ASCII files identical except that one is declared as
    latin1 and the other utf8 should be considered identical.

Thanks for some food for thought.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Git Attribute: File Text Encoding {WAS: Re: [RFC] Print diffs of UTF-16 to console / patches to email as UTF-8...?}
  2010-10-22 19:53     ` Jonathan Nieder
@ 2010-10-22 20:18       ` Drew Northup
  2010-10-22 21:49         ` Jakub Narebski
  0 siblings, 1 reply; 15+ messages in thread
From: Drew Northup @ 2010-10-22 20:18 UTC (permalink / raw)
  To: Jonathan Nieder; +Cc: Joshua Juran, Git mailing list, Junio C Hamano, Jeff King


On Fri, 2010-10-22 at 14:53 -0500, Jonathan Nieder wrote:
> Joshua Juran wrote:

> >                                                      But a more
> > complete and robust solution would be to store the encoding
> > somewhere, possibly in the blob itself, or in the tree storing the
> > filename.
> 
> How about Jakub's idea of keeping it in .gitattributes (or some
> similarly visible key/value store)?  Two reasons:
> 
>  1. When asked to declare encoding, half the time people will be
>     wrong.  So it seems worthwhile to make the declared encoding
>     visible enough to fix.
> 
>  2. Two ASCII files identical except that one is declared as
>     latin1 and the other utf8 should be considered identical.
> 
> Thanks for some food for thought.

I think that's a fine place to start. Shall I start a branch for it when
I get home (where the code I'm working on is located)? It would be good
practice if nothing else...

-- 
-Drew Northup N1XIM
   AKA RvnPhnx on OPN
________________________________________________
"As opposed to vegetable or mineral error?"
-John Pescatore, SANS NewsBites Vol. 12 Num. 59

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Git Attribute: File Text Encoding {WAS: Re: [RFC] Print diffs of UTF-16 to console / patches to email as UTF-8...?}
  2010-10-22 20:18       ` Git Attribute: File Text Encoding {WAS: Re: [RFC] Print diffs of UTF-16 to console / patches to email as UTF-8...?} Drew Northup
@ 2010-10-22 21:49         ` Jakub Narebski
  0 siblings, 0 replies; 15+ messages in thread
From: Jakub Narebski @ 2010-10-22 21:49 UTC (permalink / raw)
  To: Drew Northup
  Cc: Jonathan Nieder, Joshua Juran, Git mailing list, Junio C Hamano,
	Jeff King

Drew Northup <drew.northup@maine.edu> writes:
> On Fri, 2010-10-22 at 14:53 -0500, Jonathan Nieder wrote:
> > Joshua Juran wrote:
> 
> > >                                                      But a more
> > > complete and robust solution would be to store the encoding
> > > somewhere, possibly in the blob itself, or in the tree storing the
> > > filename.
> > 
> > How about Jakub's idea of keeping it in .gitattributes (or some
> > similarly visible key/value store)?  Two reasons:
> > 
> >  1. When asked to declare encoding, half the time people will be
> >     wrong.  So it seems worthwhile to make the declared encoding
> >     visible enough to fix.
> > 
> >  2. Two ASCII files identical except that one is declared as
> >     latin1 and the other utf8 should be considered identical.
> > 
> > Thanks for some food for thought.
> 
> I think that's a fine place to start. Shall I start a branch for it when
> I get home (where the code I'm working on is located)? It would be good
> practice if nothing else...

Just for clarification: Git supports and uses `encoding` gitattribute,
but it is used currently _only_ bu git-gui (by converting from `encoding`
gitattribute to `git.encoding` given by config).

The places that are missing:

1. A way to check attribute of a file in given tree.  Currently 
   git-check-attr checks for .gitattributes _only_ from working
   area (in addition to unversioned .git/info/attributes, and perhaps
   in the future core.attributesFile a la core.excludesFile).

2. A consensus where conversion should take place: at which level
   in stack, for what output destination, etc.

3. Support for i18n.blobOutputEncoding, which would convert (unless
   overriden) e.g. in 'git show <blob>', and perhaps also in diff.

-- 
Jakub Narebski
Poland
ShadeHawk on #git

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2010-10-22 21:49 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-10-22 16:06 [RFC] Print diffs of UTF-16 to console / patches to email as UTF-8...? Drew Northup
2010-10-22 16:18 ` Jonathan Nieder
2010-10-22 17:01   ` Drew Northup
2010-10-22 17:12     ` Jonathan Nieder
2010-10-22 17:27       ` Drew Northup
2010-10-22 17:30         ` Jonathan Nieder
2010-10-22 17:58           ` Jakub Narebski
2010-10-22 17:48         ` Jakub Narebski
2010-10-22 18:06           ` Drew Northup
2010-10-22 19:18             ` Jakub Narebski
2010-10-22 18:28   ` Joshua Juran
2010-10-22 19:13     ` Jeff King
2010-10-22 19:53     ` Jonathan Nieder
2010-10-22 20:18       ` Git Attribute: File Text Encoding {WAS: Re: [RFC] Print diffs of UTF-16 to console / patches to email as UTF-8...?} Drew Northup
2010-10-22 21:49         ` Jakub Narebski

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).