git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] [RFC] Design for pathname encoding gitattribute [RESEND]
@ 2008-01-22  4:41 Sam Vilain
  2008-01-22  5:35 ` Johannes Schindelin
  2008-01-22  6:26 ` Junio C Hamano
  0 siblings, 2 replies; 11+ messages in thread
From: Sam Vilain @ 2008-01-22  4:41 UTC (permalink / raw)
  To: git
  Cc: Peter Karlsson, Mark Junker, Pedro Melo, Martin Langhoff,
	Johannes Schindelin, Dmitry Potapov, Kevin Ballard

Some projects may like to enforce a particular encoding is used for
all filenames in the repository.  Within the UTF-8 encoding, there are
four normal forms (see http://unicode.org/reports/tr15/), any of which
may be a reasonable repository format choice.  Additionally, some
filesystems may have a single encoding that they support when writing
local filenames.  To support this, iconv and a normalization library
must have the information they need to perform the correct conversion.

This is a configuration design proposal, and does not implement any
changes.
---
   Hi all, I think that restating the problem in these terms might be
   more productive than the previous discussion, design critiques?

   It is intended that this doesn't impact at all on users with C
   filesystems without explicit configuration, while adding the feature
   of allowing projects to specify unicode normalisation (so, eg,
   Märchen ends up the same as Märchen)

   [apologies if this hits the list twice; I sent the first with a bad
    content encoding header and assume it got dropped]

 Documentation/config.txt        |   16 ++++++++++++++++
 Documentation/gitattributes.txt |   19 +++++++++++++++++++
 Documentation/i18n.txt          |    9 ++++++---
 3 files changed, 41 insertions(+), 3 deletions(-)

diff --git a/Documentation/config.txt b/Documentation/config.txt
index ee08845..9d2567d 100644
--- a/Documentation/config.txt
+++ b/Documentation/config.txt
@@ -146,6 +146,22 @@ core.symlinks::
 	file. Useful on filesystems like FAT that do not support
 	symbolic links. True by default.
 
+core.repositoryPathEncoding::
+	Specify the default assumed encoding of repository paths, if
+	not specified in gitlink:gitattributes[3] for that repository.
+	The default value of this is "C".
+
+core.checkoutPathEncoding::
+	Specify the encoding of local filenames.  The default value of
+	this depends on the platform and filesystem, but for most users
+	will be "C", indicating no pathname conversion required.
+
+core.checkoutPathEncodingFromLocale::
+	Specify whether the checkout path encoding should be
+	controlled via environment locale variables.  This may have
+	some bizarre side effects if you switch locales between
+	working with a checkout.  False by default.
+
 core.gitProxy::
 	A "proxy command" to execute (as 'command host port') instead
 	of establishing direct connection to the remote server when
diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
index cc9c7c5..4136528 100644
--- a/Documentation/gitattributes.txt
+++ b/Documentation/gitattributes.txt
@@ -170,6 +170,25 @@ intent is that if someone unsets the filter driver definition,
 or does not have the appropriate filter program, the project
 should still be usable.
 
+`encoding`
+^^^^^^^^^^
+Specifies the valid encoding for file names (does not affect content)
+on the specified path.  Git enforces that all filenames are valid in
+this encoding, and if applicable and possible, will translate from the
+encoding configured (or, on relevant platform and filesystem
+combinations, detected) to this encoding.
+
+The default value of this is "C", which leaves behaviour on
+filesystems which do not support "C" semantics undefined until it is
+set.  For instance, if your filesystem supports only UTF-8, and you
+are trying to check out a repository that is in Latin-1, then you will
+need to configure the repository encoding in `.git/info/attributes` 
+before you can check files out on that system.
+
+Valid encodings are currently 'ISO-8859-1' and 'UTF-8'.  'UTF-8' may
+be followed by '+NFC', '+NFD', '+NFKD' or '+NFKC' to enforce a
+particular normalization of filenames.
+
 
 Interaction between checkin/checkout attributes
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
diff --git a/Documentation/i18n.txt b/Documentation/i18n.txt
index b95f99b..fba0407 100644
--- a/Documentation/i18n.txt
+++ b/Documentation/i18n.txt
@@ -1,11 +1,14 @@
 At the core level, git is character encoding agnostic.
 
  - The pathnames recorded in the index and in the tree objects
-   are treated as uninterpreted sequences of non-NUL bytes.
+   are normally treated as uninterpreted sequences of non-NUL bytes.
    What readdir(2) returns are what are recorded and compared
    with the data git keeps track of, which in turn are expected
-   to be what lstat(2) and creat(2) accepts.  There is no such
-   thing as pathname encoding translation.
+   to be what lstat(2) and creat(2) accepts.
+
+However, if there are configured encodings for the checkout and/or
+repository, then the defined conversions will occur between the
+readdir(2) and the index, in both directions.
 
  - The contents of the blob objects are uninterpreted sequence
    of bytes.  There is no encoding translation at the core
-- 
1.5.3.5

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH] [RFC] Design for pathname encoding gitattribute [RESEND]
  2008-01-22  4:41 [PATCH] [RFC] Design for pathname encoding gitattribute [RESEND] Sam Vilain
@ 2008-01-22  5:35 ` Johannes Schindelin
  2008-01-22  6:37   ` Junio C Hamano
  2008-01-22  6:26 ` Junio C Hamano
  1 sibling, 1 reply; 11+ messages in thread
From: Johannes Schindelin @ 2008-01-22  5:35 UTC (permalink / raw)
  To: Sam Vilain
  Cc: git, Peter Karlsson, Mark Junker, Pedro Melo, Martin Langhoff,
	Dmitry Potapov

Hi,

On Tue, 22 Jan 2008, Sam Vilain wrote:

>  Documentation/gitattributes.txt |   19 +++++++++++++++++++

As I said on IRC already, I don't think that this is served well as an 
"attribute"... it is most likely that the issue either affects _all_ 
filenames , or _none_.

In that, it is very similar to the CR/LF issue we encountered.  There 
also, it depends more on the platform than on the filename if you want to 
enable special handling or not.

I maintain that it is even more obviously a platform issue than CR/LF, 
since the UTF-8 normalisation takes place in the filesystem driver -- 
regardless if it is needed, or wished for, or not -- whereas CR/LF might 
be not needed/wished for in one certain project, but might well be wished 
for in another clone _on the same platform_.

So I think that this would be a prime candidate for /etc/gitconfig, even 
more so than core.crlf.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] [RFC] Design for pathname encoding gitattribute [RESEND]
  2008-01-22  4:41 [PATCH] [RFC] Design for pathname encoding gitattribute [RESEND] Sam Vilain
  2008-01-22  5:35 ` Johannes Schindelin
@ 2008-01-22  6:26 ` Junio C Hamano
  2008-01-22  7:43   ` Junio C Hamano
  1 sibling, 1 reply; 11+ messages in thread
From: Junio C Hamano @ 2008-01-22  6:26 UTC (permalink / raw)
  To: Sam Vilain
  Cc: git, Peter Karlsson, Mark Junker, Pedro Melo, Martin Langhoff,
	Johannes Schindelin, Dmitry Potapov, Kevin Ballard

Sam Vilain <sam.vilain@catalyst.net.nz> writes:

> Some projects may like to enforce a particular encoding is used for
> all filenames in the repository.  Within the UTF-8 encoding, there are
> four normal forms (see http://unicode.org/reports/tr15/), any of which
> may be a reasonable repository format choice.  Additionally, some
> filesystems may have a single encoding that they support when writing
> local filenames.  To support this, iconv and a normalization library
> must have the information they need to perform the correct conversion.

Isn't there a chicken-and-egg problem?  The attributes are by
nature per-path, and you need to match the pathname string with
a pattern to decide which attribute definition to apply to a
given path.  Before knowing what encoding the pathname you have
just read from readdir(3), how would you match that pathname
with the pattern in the gitattributes file?

I can buy the .git/config (and an in-tree .git-encoding,
perhaps), though.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] [RFC] Design for pathname encoding gitattribute [RESEND]
  2008-01-22  5:35 ` Johannes Schindelin
@ 2008-01-22  6:37   ` Junio C Hamano
  0 siblings, 0 replies; 11+ messages in thread
From: Junio C Hamano @ 2008-01-22  6:37 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: Sam Vilain, git, Peter Karlsson, Mark Junker, Pedro Melo,
	Martin Langhoff, Dmitry Potapov

Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:

> On Tue, 22 Jan 2008, Sam Vilain wrote:
>
>>  Documentation/gitattributes.txt |   19 +++++++++++++++++++
>
> As I said on IRC already, I don't think that this is served well as an 
> "attribute"... it is most likely that the issue either affects _all_ 
> filenames , or _none_.

I do not think .gitattributes is the way to go, but I do not
think this has to be all or nothing either.

I can well imagine somebody wanting to do:

	Documentation/ja/README-spelled-in-Japanese
	Documentation/ja/... other files in Japanese ...
	Documentation/zh/README-spelled-in-Chinese
	Documentation/zh/... other files in Chinese ...

and have all files under Documentation/ja/ in EUC-JP while
Documentation/zh/ are BIG5 or whatever (I do not speak nor write
Chinese).

Maybe the project originates from Brasil and the string
"Documentation" itself is spelled as "Documentação" and in
Latin-1 (no, I do not write pt_BR either, and I admit at this
point this is a contrived example that I cannot _that_ well
imagine, but is not so far-fetched).

So we _could_ have .git-encoding in Documentation/ja/ and
Documentation/zh/ each of which says "this directory and
everything below are in this encoding, unless overriden
otherwise by a deeper directory".

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] [RFC] Design for pathname encoding gitattribute [RESEND]
  2008-01-22  6:26 ` Junio C Hamano
@ 2008-01-22  7:43   ` Junio C Hamano
  2008-01-22  8:09     ` Mark Junker
                       ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Junio C Hamano @ 2008-01-22  7:43 UTC (permalink / raw)
  To: Sam Vilain
  Cc: git, Peter Karlsson, Mark Junker, Pedro Melo, Martin Langhoff,
	Johannes Schindelin, Dmitry Potapov, Kevin Ballard

Junio C Hamano <gitster@pobox.com> writes:

> Sam Vilain <sam.vilain@catalyst.net.nz> writes:
>
>> Some projects may like to enforce a particular encoding is used for
>> all filenames in the repository.  Within the UTF-8 encoding, there are
>> four normal forms (see http://unicode.org/reports/tr15/), any of which
>> may be a reasonable repository format choice.  Additionally, some
>> filesystems may have a single encoding that they support when writing
>> local filenames.  To support this, iconv and a normalization library
>> must have the information they need to perform the correct conversion.
>
> Isn't there a chicken-and-egg problem?  The attributes are by
> nature per-path, and you need to match the pathname string with
> a pattern to decide which attribute definition to apply to a
> given path.  Before knowing what encoding the pathname you have
> just read from readdir(3), how would you match that pathname
> with the pattern in the gitattributes file?
>
> I can buy the .git/config (and an in-tree .git-encoding,
> perhaps), though.

I admit that Documentação/ja/お読み下さい example was contrived
(the last component is README-in-Japanese), and if anybody still
wanted to have such a tree sanely, the only practical
cross-platform and multi-language way to do so is to have
everything in UTF-8 at the repository level.

In that sense, the project does not need to specify anything,
other than marking that "all of the pathnames in tree objects
are in UTF-8 (we could go stronger, and say which kind of
normalization we want)".  As there is no other practical choice
than UTF-8-NFC if you want to be cross-platform, compatible, and
multi-language, the project can just declare that is what it
uses and does not have to mark it any specially.

A particular clone of such a project may want to check
everything out as-is to get an UTF-8 only tree (I'll mention
HFS+ shortly).  Another clone may want to get mixed legacy
encodings by running mkdir(utf8_to_latin1("Documentação")) and
creat(utf8_to_eucjp(" お読み下さい")), but that is purely a
local matter and should not be controlled by anything in-tree,
be it .gitattributes or .git-encoding.

On the other hand, it is not so unusual to see a legacy encoding
used in the pathnames, especially if your project does not need
to deal with multi-language issues.  In such a repository, I do
not want to enforce that all the paths in tree objects MUST be
UTF-8.  If all the project participant agree to work with EUC-JP
pathnames in tree objects, we should not make the users always
go through double conversion going from readdir(3) to index, and
coming from index back to open(2) or creat(2).  Again, that is
done by agreement by project participants, so there is nothing
that needs to be specified in-tree.

If the project uses UTF-8-NFC, we would need to adjust check-in
and check-out codepath like Linus's readdir(3) hack suggested,
but that needs to be done only on HFS+.  Of course, the project
participants need to be careful not to create files that HFS+
cannot handle (two paths that happen to be equivalent strings
should not be created), but I do not think that is such a big
issue as some people seem to make a big deal out of.  If you
want to be interoperable with different filesystems, you should
not create two paths that are different only in case, and if
there are participants who are on such a filesystem, the mistake
is quickly spotted and corrected.  It happened in git.git to a
file other than that infamous Märchen.  It's exactly the same
issue [*1*].

In short, initially I did not like Linus's readdir(3) hack very
much, but the more I think about it, I like it the better.

We pick a reasonable default (i.e. "no conversion") at the
technical level, and recommend (but do not pay for the overhead
of enforcing) a reasonable normalization as the BCP at the human
level.  Only on filesystems that mangle the pathnames, or if you
want legacy encodings on the filesystem, we would need to pay
overhead for conversion and help people with actual code to do
so.

To support the above scenarios, I think each instance of
repository needs to be able to say "this path (specified with a
matching pattern in the filename encoding) should be converted
this way coming in, and that way going out."  UTF-8 only project
would have NKC<->NKD on HFS+ partition, and nothing on
everywhere else.  EUC-JP project that checks out as-is would
specify nothing either, but people on Shift_JIS platforms would
locally specify that EUC-JP <-> Shift_JIS conversion to be made.


[Footnote]

*1* This is an important point, especially the breakage was
about tests that used files "a" and "A".  No pathname
enforcement in git-as-scm would have enforced anything to avoid
the breakage.  But there are humans involved in the project and
they are an integral part of ensuring interoperability.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] [RFC] Design for pathname encoding gitattribute [RESEND]
  2008-01-22  7:43   ` Junio C Hamano
@ 2008-01-22  8:09     ` Mark Junker
  2008-01-22  9:16       ` Junio C Hamano
  2008-01-22  9:13     ` Rafael Garcia-Suarez
  2008-01-22  9:57     ` Sam Vilain
  2 siblings, 1 reply; 11+ messages in thread
From: Mark Junker @ 2008-01-22  8:09 UTC (permalink / raw)
  To: git

Junio C Hamano schrieb:

> To support the above scenarios, I think each instance of
> repository needs to be able to say "this path (specified with a
> matching pattern in the filename encoding) should be converted
> this way coming in, and that way going out."  UTF-8 only project
> would have NKC<->NKD on HFS+ partition, and nothing on
> everywhere else.  EUC-JP project that checks out as-is would
> specify nothing either, but people on Shift_JIS platforms would
> locally specify that EUC-JP <-> Shift_JIS conversion to be made.

Just to sum up what you wrote and to be sure that I understand you 
correctly:

Lets have two encodings:
- Encoding for path names stored in the repository
- Encoding for path names from/to file systems

Do conversion only if they are different. Both encodings are configurable.

Regards,
Mark

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] [RFC] Design for pathname encoding gitattribute [RESEND]
  2008-01-22  7:43   ` Junio C Hamano
  2008-01-22  8:09     ` Mark Junker
@ 2008-01-22  9:13     ` Rafael Garcia-Suarez
  2008-01-22  9:57     ` Sam Vilain
  2 siblings, 0 replies; 11+ messages in thread
From: Rafael Garcia-Suarez @ 2008-01-22  9:13 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Sam Vilain, git, Peter Karlsson, Mark Junker, Pedro Melo,
	Martin Langhoff, Johannes Schindelin, Dmitry Potapov,
	Kevin Ballard

On 22/01/2008, Junio C Hamano wrote:
> If the project uses UTF-8-NFC, we would need to adjust check-in
> and check-out codepath like Linus's readdir(3) hack suggested,
> but that needs to be done only on HFS+.  Of course, the project
> participants need to be careful not to create files that HFS+
> cannot handle (two paths that happen to be equivalent strings
> should not be created), but I do not think that is such a big
> issue as some people seem to make a big deal out of.  If you

Right, I don't see that as a big issue -- for new files. But we can have
files that were created in the past as non-handleable by HFS+, and later
renamed to something more portable.

More generally, the consensus encoding might change over time. We can
imagine a project which contains, say, a test file which a latin-1 name,
that gets later renamed to a UTF-8 name, (due to a project policy
change), but making necessary to adjust the said test. A checkout of the
earlier version would have that test failing. (But maybe I'm just
handwaving towards a non-existent problem here. I'd consider the issue
as minor anyway.)

> want to be interoperable with different filesystems, you should
> not create two paths that are different only in case, and if
> there are participants who are on such a filesystem, the mistake
> is quickly spotted and corrected.  It happened in git.git to a
> file other than that infamous Märchen.  It's exactly the same
> issue [*1*].
>
> In short, initially I did not like Linus's readdir(3) hack very
> much, but the more I think about it, I like it the better.
>
> We pick a reasonable default (i.e. "no conversion") at the
> technical level, and recommend (but do not pay for the overhead
> of enforcing) a reasonable normalization as the BCP at the human
> level.  Only on filesystems that mangle the pathnames, or if you
> want legacy encodings on the filesystem, we would need to pay
> overhead for conversion and help people with actual code to do
> so.
>
> To support the above scenarios, I think each instance of
> repository needs to be able to say "this path (specified with a
> matching pattern in the filename encoding) should be converted
> this way coming in, and that way going out."  UTF-8 only project
> would have NKC<->NKD on HFS+ partition, and nothing on
> everywhere else.  EUC-JP project that checks out as-is would
> specify nothing either, but people on Shift_JIS platforms would
> locally specify that EUC-JP <-> Shift_JIS conversion to be made.

Sounds sane, except maybe the part where you specify paths with a
pattern. Do you really need this layer of complexity? Pattern matching
in different encodings has proven to be troublesome. Usually that's
where UTF-8 normalisation rules and locale-specific behaviours kick in,
esp. when you're starting to use \w or \d characters classes, or case
insensitivity. For example, if you want to do it correctly, "I" will
match /i/ case-insensitively, except in Turkish locales... (Sorry, I'm
just handwaving again here...)

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] [RFC] Design for pathname encoding gitattribute [RESEND]
  2008-01-22  8:09     ` Mark Junker
@ 2008-01-22  9:16       ` Junio C Hamano
  0 siblings, 0 replies; 11+ messages in thread
From: Junio C Hamano @ 2008-01-22  9:16 UTC (permalink / raw)
  To: Mark Junker; +Cc: git

Mark Junker <mjscod@web.de> writes:

> Just to sum up what you wrote and to be sure that I understand you
> correctly:
>
> Lets have two encodings:
> - Encoding for path names stored in the repository
> - Encoding for path names from/to file systems
>
> Do conversion only if they are different. Both encodings are configurable.

Not really.

 1. Encoding for the project does not have to be specified at
    all.  The project participants are expected to know about it
    out of band.

 2. Conversion for path names between filesystems and the
    project (i.e. "paths in tree objects") can be specified per
    repository (i.e. "a particular clone of the project").  We
    could even allow the conversion function to be different
    per-path-component but I suspect that would be a much
    future addition that nobody would use in practice.

 3. Suggest use of UTF-8-NFC as the project encoding as a BCP,
    but never enforce it.  It is a responsibility of the owner
    of the particular repository to make sure that the
    conversions used in a particular repository (again, "a
    particular clone of the project") produces the desired
    encoding in the tree objects.

But please take these with a moderately large grain of salt, as
I was more or less handwaving and pretending to know what I was
talking about ;-).  I think this should work in theory, but I at
the same time suspect that there are many more places than just
readdir(3) that need to be wrapped if we take this approach, and
the intrusiveness factor might make this infeasible in practice.

The difference between your version and my 1. and 2. is very
subtle, but comes primarily from my desire not to have to use
the word "canonical".  Yours define "this canonical encoding is
used in the repository, and we convert back and forth to that
local encoding", as opposed to my saying "here are to and from
conversion functions".  The latter is more in line with how we
define smudge/clean filters for blob contents conversion, in
that the "encoding" used in in-repository blob does not have to
even have a name.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] [RFC] Design for pathname encoding gitattribute [RESEND]
  2008-01-22  7:43   ` Junio C Hamano
  2008-01-22  8:09     ` Mark Junker
  2008-01-22  9:13     ` Rafael Garcia-Suarez
@ 2008-01-22  9:57     ` Sam Vilain
  2008-01-22 10:36       ` Junio C Hamano
  2 siblings, 1 reply; 11+ messages in thread
From: Sam Vilain @ 2008-01-22  9:57 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: git, Peter Karlsson, Mark Junker, Pedro Melo, Martin Langhoff,
	Johannes Schindelin, Dmitry Potapov, Kevin Ballard

Junio C Hamano wrote:
> To support the above scenarios, I think each instance of
> repository needs to be able to say "this path (specified with a
> matching pattern in the filename encoding) should be converted
> this way coming in, and that way going out."  UTF-8 only project
> would have NKC<->NKD on HFS+ partition, and nothing on
> everywhere else.

I think there is another reason to do this - simple sanity.  Two people
adding the same filename should not end up with a different tree ID, if
they for whatever reason ended up entering a differing equivalent
variant of the same Unicode NKC form.

But, that rule of sanity breaks the C semantics sanity, so it must be a
per-project setting.  Not a necessity, but a good feature I think.  It
can be enforced with external scripts/hooks of course.

What happens on the way in and out of the filesystem, I see that as a
side issue.  Once you define what the normalized form is for the
project, then the features should just fall into place without messy
heuristics.  There is also a correct behaviour when faced with
filesystems that have a different idea about who enforces encoding rules
- so long as you can detect what those ideas are :).  It also means that
users can choose to use the same local encoding as their locale, which
might interoperate better with other apps.

The readdir() (case|normalization) tolerance change is good in its own
right, but it's a slightly different scenario, and an independent
question to what is the normalized form.  Of course, on case folding,
unicode normalizing filesystems you'd have to have a mixture of these
settings for sane operation.

On the chicken and egg thing, I guess .gitattributes is too late, you're
right - unless you say that at each directory level, the globbing is
always C.  But I haven't thought about that very hard.  I was just
re-using a mechanism that already exists rather than try to invent
something new.  I do agree with Dscho's point that mixing encodings in a
repository is not necessarily a use case worth catering for.

Sam.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] [RFC] Design for pathname encoding gitattribute [RESEND]
  2008-01-22  9:57     ` Sam Vilain
@ 2008-01-22 10:36       ` Junio C Hamano
  2008-01-22 10:44         ` Sam Vilain
  0 siblings, 1 reply; 11+ messages in thread
From: Junio C Hamano @ 2008-01-22 10:36 UTC (permalink / raw)
  To: Sam Vilain
  Cc: git, Peter Karlsson, Mark Junker, Pedro Melo, Martin Langhoff,
	Johannes Schindelin, Dmitry Potapov, Kevin Ballard

Sam Vilain <sam.vilain@catalyst.net.nz> writes:

> On the chicken and egg thing, ...
> ...  I do agree with Dscho's point that mixing encodings in a
> repository is not necessarily a use case worth catering for.

Are you talking about "repository" as in "a specific clone", or
"a project that can be cloned by many people and checked out to
suit cloner's needs"?  I definitely agree that mixing encodings
in a project (i.e. "paths in tree objects") does not make any
sense _if_ clones of the projects _may_ want to check things out
in different pathname encodings from each other.  And if all
clones would want to check things out the same way, it does not
really matter what encoding the paths in tree objects are.

I am not absolutely sure if you are talking about mixing
encodings depending on parts of the tree in a specific clone (my
earlier "Documentação/ja/ お読み下さい" example).  I would
certainly say it would be a very low priority for us to support
such usage, as I imagine that multi-language trees would most
likely be checked out in UTF-8 everywhere, but it _might_ be
something people may find real need for.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] [RFC] Design for pathname encoding gitattribute [RESEND]
  2008-01-22 10:36       ` Junio C Hamano
@ 2008-01-22 10:44         ` Sam Vilain
  0 siblings, 0 replies; 11+ messages in thread
From: Sam Vilain @ 2008-01-22 10:44 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Git Mailing List

Junio C Hamano wrote:
> Sam Vilain <sam.vilain@catalyst.net.nz> writes:
> 
>> On the chicken and egg thing, ...
>> ...  I do agree with Dscho's point that mixing encodings in a
>> repository is not necessarily a use case worth catering for.
> 
> Are you talking about "repository" as in "a specific clone", or
> "a project that can be cloned by many people and checked out to
> suit cloner's needs"?  I definitely agree that mixing encodings
> in a project (i.e. "paths in tree objects") does not make any
> sense _if_ clones of the projects _may_ want to check things out
> in different pathname encodings from each other.  And if all
> clones would want to check things out the same way, it does not
> really matter what encoding the paths in tree objects are.

I'm referring to the normalized form in the object database - ie what
affects the generated SHA1s - what you check it out to locally is a
developer's choice, and assuming that they can handle whatever issues
they create by doing this, then that should be fine.

> I am not absolutely sure if you are talking about mixing
> encodings depending on parts of the tree in a specific clone (my
> earlier "Documentação/ja/ お読み下さい" example).  I would
> certainly say it would be a very low priority for us to support
> such usage, as I imagine that multi-language trees would most
> likely be checked out in UTF-8 everywhere, but it _might_ be
> something people may find real need for.

Agreed - not something you want to condone, but if it's just as easy to
come up with a design that doesn't limit to one encoding for a whole
repository, it might help some people.

The use case for mixed encodings I had in mind was when you clone some
repository that's got them mixed, and you need to tell git the encoding
per-path to get the darned thing to behave sensibly for you (presumably
while you write a patch to submit upstream to fix it).

Sam.

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2008-01-22 10:44 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-01-22  4:41 [PATCH] [RFC] Design for pathname encoding gitattribute [RESEND] Sam Vilain
2008-01-22  5:35 ` Johannes Schindelin
2008-01-22  6:37   ` Junio C Hamano
2008-01-22  6:26 ` Junio C Hamano
2008-01-22  7:43   ` Junio C Hamano
2008-01-22  8:09     ` Mark Junker
2008-01-22  9:16       ` Junio C Hamano
2008-01-22  9:13     ` Rafael Garcia-Suarez
2008-01-22  9:57     ` Sam Vilain
2008-01-22 10:36       ` Junio C Hamano
2008-01-22 10:44         ` Sam Vilain

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).