git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "Torsten Bögershausen" <tboegi@web.de>
To: Junio C Hamano <gitster@pobox.com>
Cc: "Torsten Bögershausen" <tboegi@web.de>, git@vger.kernel.org
Subject: Re: [PATCH/RFC] core.precomposeunicode is true by default
Date: Tue, 27 Aug 2013 15:45:44 +0200	[thread overview]
Message-ID: <521CAD88.4080609@web.de> (raw)
In-Reply-To: <7vmwp5z3iu.fsf@alter.siamese.dyndns.org>

(Sorry for the somewhat late reply, thanks for review)
>Torsten Bögershausen <tboegi@web.de> writes:
>
>> When core.precomposeunicode was introduced, it was set to false
>> by default, to be compatible with older versions of Git.
>>
>> Whenever UTF-8 file names are used in a mixed environment,
>> the Mac OS users need to find out that this configuration exist
>> and set it to true manually.
>>
>> There is no measurable performance impact between false and true.
>
>The real reason we default it to auto-sensing in the current code is
>for correctness, I think. the new precompose code could be buggy,
>and by auto-sensing, we hoped that we would enable it only on
>filesystems that the codepath matters.
>
>> A smoother workflow can be achieved for new Git users,
>> so change the default to true:
>>
>> - Remove the auto-sensing
>
>Why?
>
>> - Rename the internal variable into precompose_unicode,
>>   and set it to 1 meaning true.
>
>Why the rename?
>
>> - Adjust and clean up test cases
>>
>> The configuration core.precomposeunicode is still supported.
>
>Sorry, but I do not quite understand the change.  Is this because
>the auto-sensing is not working, or after auto-sensing we do a wrong
>thing?  If that is the case, perhaps that is what we should fix?
>
>> diff --git a/compat/precompose_utf8.c b/compat/precompose_utf8.c
>> index 7980abd..5396b91 100644
>> --- a/compat/precompose_utf8.c
>> +++ b/compat/precompose_utf8.c
>> @@ -36,30 +36,6 @@ static size_t has_non_ascii(const char *s, size_t maxlen, size_t *strlen_c)
>>  }
>>  
>>  
>> -void probe_utf8_pathname_composition(char *path, int len)
>> -{
>> -	static const char *auml_nfc = "\xc3\xa4";
>> -	static const char *auml_nfd = "\x61\xcc\x88";
>> -	int output_fd;
>> -	if (precomposed_unicode != -1)
>> -		return; /* We found it defined in the global config, respect it */
>> -	strcpy(path + len, auml_nfc);
>> -	output_fd = open(path, O_CREAT|O_EXCL|O_RDWR, 0600);
>
>So we try to create a path under one name, and ...
>
>> -	if (output_fd >= 0) {
>> -		close(output_fd);
>> -		strcpy(path + len, auml_nfd);
>> -		/* Indicate to the user, that we can configure it to true */
>> -		if (!access(path, R_OK))
>> -			git_config_set("core.precomposeunicode", "false");
>
>... see if that path can be seen under its alias.  Why do we set it
>to "false"?  Isn't this the true culprit?
>
>After all, this is not in the "reinit" codepath, so we know we are
>dealing with a repository that was created afresh.
>

There is nothing wrong with the auto-sensing as such.
The problem for many users today is that we set core.precomposeunicode
to false, when it should be true.

A patch for that comes out in a minute. But first look back and 
collect some experience with core.precomposeunicode.

Lets have a look at the variable "precomposed_unicode",
(the one I wanted to rename to be more consistant).
It is controlled by the git config files and
depending on the config it is set like this:
core.precomposeuinicode false -> precomposed_unicode = 0
core.precomposeuinicode true  -> precomposed_unicode = 1
core.precomposeuinicode <not set> -> precomposed_unicode = -1.

Let's look what precomposed_unicode does and go through a couple
of git operations.

1)
When we create a repo under Mac OS using HFS+,
we want to have precomposed_unicode = 1

2)
When we access a repo from Windows/Linux using SAMBA,
readdir() will return decomposed.
When the repo is created by nonMacOS, core.precomposeunicode is undefined.
The precomposition is off, but should be on, 
precomposed_unicode = -1, but should be = 1

3)
When we access a repo from another Mac OS system using 
SAMBA, NFS or AFP readdir() will return decomposed.
As the repo is created under Mac OS, we have the same case as (1)

4)
When we access a repo from Linux using NFS we can have
precomposed_unicode = 0 (which is technically more correct).
If Linux users do not use decomposed unicode in their file names,
(according to my understanding this is the case), we can use 1
as well as 0:
precomposing an already precomposed string is a no-op, so it doesn't
harm.


5)
When we create a repo under Linux/Windows on a USB-drive,
and run "git status" under Mac OS, we want precomposed_unicode = 1.

There are few cases where we want to use precomposed_unicode=0:
a) To work around bugs. This may be a short term solution,
  I would rather see bugs to be fixed.
  I'm not aware of any bugs, so please remind me if I missed something.

b) Working with foreign vcs:  E.g. bzr and hg use decomposed unicode,
   so it may be better to use decomposed unicode in git as well.

The simplified V2 patch looks like this (I send it in a seperate mail):

diff --git a/compat/precompose_utf8.c b/compat/precompose_utf8.c
index 7980abd..95fe849 100644
--- a/compat/precompose_utf8.c
+++ b/compat/precompose_utf8.c
@@ -48,11 +48,8 @@ void probe_utf8_pathname_composition(char *path, int len)
 	if (output_fd >= 0) {
 		close(output_fd);
 		strcpy(path + len, auml_nfd);
-		/* Indicate to the user, that we can configure it to true */
-		if (!access(path, R_OK))
-			git_config_set("core.precomposeunicode", "false");
-		/* To be backward compatible, set precomposed_unicode to 0 */
-		precomposed_unicode = 0;
+		precomposed_unicode = access(path, R_OK) ? 0 : 1;
+		git_config_set("core.precomposeunicode", precomposed_unicode ? "true" : "false");

This will not affect existing repos, as they should have
core.precomposeunicode either true or false.
When a new repo is created, users having core.precomposeunicode in the
global config are not effected,
and the global setting switches off the auto-sensing.

The users which are new to git are affected,
and hopefully we do the right thing for them.
At least, according to my understanding, we do the best for majority
of users and use cases.

Thoughts are welcome, and arguments for and against V1 or V2.

Anybody who uses Mac OS and has experience with decomposed unicode?
I have core.precomposeunicode true in my global config.
 

  reply	other threads:[~2013-08-27 13:46 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-07-27  1:21 [PATCH/RFC] core.precomposeunicode is true by default Torsten Bögershausen
2013-07-27 15:23 ` Duy Nguyen
2013-07-27 22:53   ` Torsten Bögershausen
2013-07-28  4:45     ` Duy Nguyen
2013-07-29 17:20 ` Junio C Hamano
2013-08-27 13:45   ` Torsten Bögershausen [this message]
2013-08-27 14:49     ` Junio C Hamano
2013-08-27 15:06       ` Torsten Bögershausen
2013-08-27 16:27         ` Junio C Hamano
2013-08-27 19:34           ` Torsten Bögershausen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=521CAD88.4080609@web.de \
    --to=tboegi@web.de \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).