* jgit problems for file paths with non-ASCII characters
@ 2009-11-25 13:47 Marc Strapetz
2009-11-25 21:11 ` Robin Rosenberg
0 siblings, 1 reply; 10+ messages in thread
From: Marc Strapetz @ 2009-11-25 13:47 UTC (permalink / raw)
To: git
I have noticed that jgit converts file paths to UTF-8 when querying the
repository. Especially,
org.eclipse.jgit.treewalk.filter.PathFilter#PathFilter performs this
conversion:
private PathFilter(final String s) {
pathStr = s;
pathRaw = Constants.encode(pathStr);
}
Because of this conversion, a TreeWalk fails to identify a file with
German umlauts. When using platform encoding to convert the file path to
bytes:
private PathFilter(final String s) {
pathStr = s;
pathRaw = s.getBytes();
}
the TreeWalk works as expected. Actually, the file path seems to be
stored with platform encoding in the repository.
Is this a bug or a misconfiguration of my repository? I'm using jgit
(commit e16af839e8a0cc01c52d3648d2d28e4cb915f80f) on Windows.
Thanks!
--
Best regards,
Marc Strapetz
=============
syntevo GmbH
http://www.syntevo.com
http://blog.syntevo.com
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: jgit problems for file paths with non-ASCII characters
2009-11-25 13:47 jgit problems for file paths with non-ASCII characters Marc Strapetz
@ 2009-11-25 21:11 ` Robin Rosenberg
2009-11-26 0:54 ` [egit-dev] " Shawn O. Pearce
0 siblings, 1 reply; 10+ messages in thread
From: Robin Rosenberg @ 2009-11-25 21:11 UTC (permalink / raw)
To: Marc Strapetz; +Cc: git, egit-dev
onsdag 25 november 2009 14:47:25 skrev Marc Strapetz:
> I have noticed that jgit converts file paths to UTF-8 when querying the
> repository. Especially,
> org.eclipse.jgit.treewalk.filter.PathFilter#PathFilter performs this
> conversion:
>
> private PathFilter(final String s) {
> pathStr = s;
> pathRaw = Constants.encode(pathStr);
> }
>
> Because of this conversion, a TreeWalk fails to identify a file with
> German umlauts. When using platform encoding to convert the file path to
> bytes:
>
> private PathFilter(final String s) {
> pathStr = s;
> pathRaw = s.getBytes();e pr
> }
>
> the TreeWalk works as expected. Actually, the file path seems to be
> stored with platform encoding in the repository.
>
> Is this a bug or a misconfiguration of my repository? I'm using jgit
> (commit e16af839e8a0cc01c52d3648d2d28e4cb915f80f) on Windows.
A bug.
The problem here is that we need to allow multiple encodings since there
is no reliable encoding specified anywhere. The approach I advocate is
the one we use for handling encoding in general. I.e. if it looks like UTF-8,
treat it like that else fallback. This is expensive however and then we have
all the other issues with case insensitive name and the funny property that
unicode has when it allows characters to be encoding using multiple sequences
of code points as empoloyed by Apple.
-- robin
-- robin
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [egit-dev] Re: jgit problems for file paths with non-ASCII characters
2009-11-25 21:11 ` Robin Rosenberg
@ 2009-11-26 0:54 ` Shawn O. Pearce
2009-11-26 13:09 ` Thomas Singer
2009-11-26 14:25 ` Marc Strapetz
0 siblings, 2 replies; 10+ messages in thread
From: Shawn O. Pearce @ 2009-11-26 0:54 UTC (permalink / raw)
To: EGit developer discussion; +Cc: Marc Strapetz, git
Robin Rosenberg <robin.rosenberg@dewire.com> wrote:
> onsdag 25 november 2009 14:47:25 skrev Marc Strapetz:
> > I have noticed that jgit converts file paths to UTF-8 when querying the
> > repository.
...
> > Is this a bug or a misconfiguration of my repository? I'm using jgit
> > (commit e16af839e8a0cc01c52d3648d2d28e4cb915f80f) on Windows.
>
> A bug.
>
> The problem here is that we need to allow multiple encodings since there
> is no reliable encoding specified anywhere.
This is a design fault of both Linux and git. git gets a byte
sequence from readdir and stores that as-is into the repository.
We have no way of knowing what that encoding is. So now everyone
touching a Git repository is screwed.
> The approach I advocate is
> the one we use for handling encoding in general. I.e. if it looks like UTF-8,
> treat it like that else fallback. This is expensive however
We should try to work harder with the git-core folks to get character
set encoding for file names worked out. We might be able to use a
configuration setting in the repository to tell us what the proper
encoding should be, and if not set, assume UTF-8.
> and then we have
> all the other issues with case insensitive name and the funny property that
> unicode has when it allows characters to be encoding using multiple sequences
> of code points as empoloyed by Apple.
But as you said, this still doesn't make the Apple normal form
any easier. Though if we know we are on such a strange filesystem
we might be able to assume the paths in the repository are equally
damaged. Or not.
--
Shawn.
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [egit-dev] Re: jgit problems for file paths with non-ASCII characters
2009-11-26 0:54 ` [egit-dev] " Shawn O. Pearce
@ 2009-11-26 13:09 ` Thomas Singer
2009-11-26 14:47 ` Johannes Schindelin
2009-11-26 16:44 ` Robin Rosenberg
2009-11-26 14:25 ` Marc Strapetz
1 sibling, 2 replies; 10+ messages in thread
From: Thomas Singer @ 2009-11-26 13:09 UTC (permalink / raw)
To: Shawn O. Pearce; +Cc: EGit developer discussion, Marc Strapetz, git
> But as you said, this still doesn't make the Apple normal form
> any easier. Though if we know we are on such a strange filesystem
> we might be able to assume the paths in the repository are equally
> damaged. Or not.
Well, if the git-core folks could standardize on, e.g., composed UTF-8
(rather then just UTF-8), for storing file names in the repository, then
everything should be clear, isn't it?
--
Best regards,
Thomas Singer
=============
syntevo GmbH
http://www.syntevo.com
http://blog.syntevo.com
Shawn O. Pearce wrote:
> Robin Rosenberg <robin.rosenberg@dewire.com> wrote:
>> onsdag 25 november 2009 14:47:25 skrev Marc Strapetz:
>>> I have noticed that jgit converts file paths to UTF-8 when querying the
>>> repository.
> ...
>>> Is this a bug or a misconfiguration of my repository? I'm using jgit
>>> (commit e16af839e8a0cc01c52d3648d2d28e4cb915f80f) on Windows.
>> A bug.
>>
>> The problem here is that we need to allow multiple encodings since there
>> is no reliable encoding specified anywhere.
>
> This is a design fault of both Linux and git. git gets a byte
> sequence from readdir and stores that as-is into the repository.
> We have no way of knowing what that encoding is. So now everyone
> touching a Git repository is screwed.
>
>> The approach I advocate is
>> the one we use for handling encoding in general. I.e. if it looks like UTF-8,
>> treat it like that else fallback. This is expensive however
>
> We should try to work harder with the git-core folks to get character
> set encoding for file names worked out. We might be able to use a
> configuration setting in the repository to tell us what the proper
> encoding should be, and if not set, assume UTF-8.
>
>> and then we have
>> all the other issues with case insensitive name and the funny property that
>> unicode has when it allows characters to be encoding using multiple sequences
>> of code points as empoloyed by Apple.
>
> But as you said, this still doesn't make the Apple normal form
> any easier. Though if we know we are on such a strange filesystem
> we might be able to assume the paths in the repository are equally
> damaged. Or not.
>
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [egit-dev] Re: jgit problems for file paths with non-ASCII characters
2009-11-26 13:09 ` Thomas Singer
@ 2009-11-26 14:47 ` Johannes Schindelin
2009-11-26 15:31 ` Thomas Singer
2009-11-26 16:44 ` Robin Rosenberg
1 sibling, 1 reply; 10+ messages in thread
From: Johannes Schindelin @ 2009-11-26 14:47 UTC (permalink / raw)
To: Thomas Singer
Cc: Shawn O. Pearce, EGit developer discussion, Marc Strapetz, git
Hi,
On Thu, 26 Nov 2009, Thomas Singer wrote:
> [someone said, Thomas did not say who]
>
> > But as you said, this still doesn't make the Apple normal form any
> > easier. Though if we know we are on such a strange filesystem we
> > might be able to assume the paths in the repository are equally
> > damaged. Or not.
>
> Well, if the git-core folks could standardize on, e.g., composed UTF-8
> (rather then just UTF-8), for storing file names in the repository, then
> everything should be clear, isn't it?
You mean we should do the same thing as Apple with HFS? Are you serious?
Ciao,
Dscho
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [egit-dev] Re: jgit problems for file paths with non-ASCII characters
2009-11-26 14:47 ` Johannes Schindelin
@ 2009-11-26 15:31 ` Thomas Singer
2009-11-26 19:57 ` Shawn O. Pearce
0 siblings, 1 reply; 10+ messages in thread
From: Thomas Singer @ 2009-11-26 15:31 UTC (permalink / raw)
To: Johannes Schindelin; +Cc: Shawn O. Pearce, Marc Strapetz, git
> You mean we should do the same thing as Apple with HFS? Are you serious?
Yes, I'm serious. IMHO there should be a defined clear encoding used for
files names in the repository. Otherwise you don't know what you can expect
by reading it - it could mean anything. File names are in fact strings which
are based on characters. To convert characters to bytes (or visa versa) you
need to know the encoding.
--
Best regards,
Thomas Singer
=============
syntevo GmbH
http://www.syntevo.com
http://blog.syntevo.com
Johannes Schindelin wrote:
> Hi,
>
> On Thu, 26 Nov 2009, Thomas Singer wrote:
>
>> [someone said, Thomas did not say who]
>>
>>> But as you said, this still doesn't make the Apple normal form any
>>> easier. Though if we know we are on such a strange filesystem we
>>> might be able to assume the paths in the repository are equally
>>> damaged. Or not.
>> Well, if the git-core folks could standardize on, e.g., composed UTF-8
>> (rather then just UTF-8), for storing file names in the repository, then
>> everything should be clear, isn't it?
>
> You mean we should do the same thing as Apple with HFS? Are you serious?
>
> Ciao,
> Dscho
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [egit-dev] Re: jgit problems for file paths with non-ASCII characters
2009-11-26 15:31 ` Thomas Singer
@ 2009-11-26 19:57 ` Shawn O. Pearce
0 siblings, 0 replies; 10+ messages in thread
From: Shawn O. Pearce @ 2009-11-26 19:57 UTC (permalink / raw)
To: Thomas Singer; +Cc: Johannes Schindelin, Marc Strapetz, git
Thomas Singer <thomas.singer@syntevo.com> wrote:
> > You mean we should do the same thing as Apple with HFS? Are you serious?
>
> Yes, I'm serious. IMHO there should be a defined clear encoding used for
> files names in the repository. Otherwise you don't know what you can expect
> by reading it - it could mean anything. File names are in fact strings which
> are based on characters. To convert characters to bytes (or visa versa) you
> need to know the encoding.
That's likely not going to fly. HFS+ has changed their decomposition
rules at least once, which means the byte sequence for the same
character sequence would differ, and a tree or commit hash would
come out different depending upon which rules you were following.
See [1] for details on what HFS+ does.
Also, Linus has previously stated HFS+ chose the worst possible
way to encode the names. Getting Linus to admit he was wrong is
impossible, getting Linus to accept the HFS+ encoding rules as the
standard format used in a Git repository is not likely to happen.
Fortunately Linus carries a slightly smaller stick in Git than he
used to, but he is quite vocal and people tend to listen.
[1] http://developer.apple.com/mac/library/technotes/tn/tn1150.html#UnicodeSubtleties
--
Shawn.
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [egit-dev] Re: jgit problems for file paths with non-ASCII characters
2009-11-26 13:09 ` Thomas Singer
2009-11-26 14:47 ` Johannes Schindelin
@ 2009-11-26 16:44 ` Robin Rosenberg
1 sibling, 0 replies; 10+ messages in thread
From: Robin Rosenberg @ 2009-11-26 16:44 UTC (permalink / raw)
To: Thomas Singer
Cc: Shawn O. Pearce, EGit developer discussion, Marc Strapetz, git
torsdag 26 november 2009 14:09:09 skrev Thomas Singer:
> > But as you said, this still doesn't make the Apple normal form
> > any easier. Though if we know we are on such a strange filesystem
> > we might be able to assume the paths in the repository are equally
> > damaged. Or not.
>
> Well, if the git-core folks could standardize on, e.g., composed UTF-8
> (rather then just UTF-8), for storing file names in the repository, then
> everything should be clear, isn't it?
Hey, we're trying to enforce composed characters...
-- robin
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [egit-dev] Re: jgit problems for file paths with non-ASCII characters
2009-11-26 0:54 ` [egit-dev] " Shawn O. Pearce
2009-11-26 13:09 ` Thomas Singer
@ 2009-11-26 14:25 ` Marc Strapetz
2009-11-26 20:03 ` Shawn O. Pearce
1 sibling, 1 reply; 10+ messages in thread
From: Marc Strapetz @ 2009-11-26 14:25 UTC (permalink / raw)
To: Shawn O. Pearce; +Cc: EGit developer discussion, git, robin.rosenberg
> We should try to work harder with the git-core folks to get character
> set encoding for file names worked out. We might be able to use a
> configuration setting in the repository to tell us what the proper
> encoding should be, and if not set, assume UTF-8.
I agree that this should be the ultimate goal, though the default should
better be "system encoding" for compatibility with current git
repositories and instead have newer git versions always set encoding to
UTF-8. Thus, for our jgit clone I've introduced a system property to
configure Constants.PATH_ENCODING set to system encoding. It's used by
PathFilter and this resolves my original problem.
I have tried to switch more usages from Constants.CHARACTER_ENCODING to
Constants.PATH_ENCODING, but ended up in confusion due to my lack of
understanding: primarily because I couldn't tell anymore whether encoded
strings were file names or not. Does it make sense to explicitly
distinguish encoding usages in that way? We could try to contribute here
(and hopefully cause less review effort to jgit developers than the
changes itself are worth ;-)
--
Best regards,
Marc Strapetz
=============
syntevo GmbH
http://www.syntevo.com
http://blog.syntevo.com
Shawn O. Pearce wrote:
> Robin Rosenberg <robin.rosenberg@dewire.com> wrote:
>> onsdag 25 november 2009 14:47:25 skrev Marc Strapetz:
>>> I have noticed that jgit converts file paths to UTF-8 when querying the
>>> repository.
> ...
>>> Is this a bug or a misconfiguration of my repository? I'm using jgit
>>> (commit e16af839e8a0cc01c52d3648d2d28e4cb915f80f) on Windows.
>> A bug.
>>
>> The problem here is that we need to allow multiple encodings since there
>> is no reliable encoding specified anywhere.
>
> This is a design fault of both Linux and git. git gets a byte
> sequence from readdir and stores that as-is into the repository.
> We have no way of knowing what that encoding is. So now everyone
> touching a Git repository is screwed.
>
>> The approach I advocate is
>> the one we use for handling encoding in general. I.e. if it looks like UTF-8,
>> treat it like that else fallback. This is expensive however
>
> We should try to work harder with the git-core folks to get character
> set encoding for file names worked out. We might be able to use a
> configuration setting in the repository to tell us what the proper
> encoding should be, and if not set, assume UTF-8.
>
>> and then we have
>> all the other issues with case insensitive name and the funny property that
>> unicode has when it allows characters to be encoding using multiple sequences
>> of code points as empoloyed by Apple.
>
> But as you said, this still doesn't make the Apple normal form
> any easier. Though if we know we are on such a strange filesystem
> we might be able to assume the paths in the repository are equally
> damaged. Or not.
>
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [egit-dev] Re: jgit problems for file paths with non-ASCII characters
2009-11-26 14:25 ` Marc Strapetz
@ 2009-11-26 20:03 ` Shawn O. Pearce
0 siblings, 0 replies; 10+ messages in thread
From: Shawn O. Pearce @ 2009-11-26 20:03 UTC (permalink / raw)
To: Marc Strapetz; +Cc: EGit developer discussion, git, robin.rosenberg
Marc Strapetz <marc.strapetz@syntevo.com> wrote:
> > We should try to work harder with the git-core folks to get character
> > set encoding for file names worked out. We might be able to use a
> > configuration setting in the repository to tell us what the proper
> > encoding should be, and if not set, assume UTF-8.
>
> I agree that this should be the ultimate goal, though the default should
> better be "system encoding" for compatibility with current git
> repositories and instead have newer git versions always set encoding to
> UTF-8. Thus, for our jgit clone I've introduced a system property to
> configure Constants.PATH_ENCODING set to system encoding. It's used by
> PathFilter and this resolves my original problem.
That's probably a good point, using the system encoding on a
repository may produce the file names in a more compatible way
with git-core. But we probably don't want the encoding to be a
single encoding constant in this JVM, we probably need to support
a per-repository configuration of the encoding for path names so
that we can eventually move to a non-platform specific encoding.
> I have tried to switch more usages from Constants.CHARACTER_ENCODING to
> Constants.PATH_ENCODING, but ended up in confusion due to my lack of
> understanding: primarily because I couldn't tell anymore whether encoded
> strings were file names or not.
Heh. Yea. There are a number of file name encoding sites. I think
everything in the treewalk package, as well as the GitIndex, Tree and
DirCache* classes. Also the Patch class and its FileHeader friend.
> Does it make sense to explicitly
> distinguish encoding usages in that way? We could try to contribute here
> (and hopefully cause less review effort to jgit developers than the
> changes itself are worth ;-)
Yes, it does. Because we eventually need to support encodings
other than the current UTF-8 we assume for file names, especially
if a repository is using the local filesystem encoding and that
isn't UTF-8.
--
Shawn.
^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2009-11-26 20:03 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-11-25 13:47 jgit problems for file paths with non-ASCII characters Marc Strapetz
2009-11-25 21:11 ` Robin Rosenberg
2009-11-26 0:54 ` [egit-dev] " Shawn O. Pearce
2009-11-26 13:09 ` Thomas Singer
2009-11-26 14:47 ` Johannes Schindelin
2009-11-26 15:31 ` Thomas Singer
2009-11-26 19:57 ` Shawn O. Pearce
2009-11-26 16:44 ` Robin Rosenberg
2009-11-26 14:25 ` Marc Strapetz
2009-11-26 20:03 ` Shawn O. Pearce
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).