* Git-mediawiki : Encoding problems in perl
@ 2011-06-08 13:45 Jérémie NIKAES
2011-06-08 14:37 ` Steffen Daode Nurpmeso
` (2 more replies)
0 siblings, 3 replies; 20+ messages in thread
From: Jérémie NIKAES @ 2011-06-08 13:45 UTC (permalink / raw)
To: thomas; +Cc: git
Hi,
While working on the git-mediawiki project[1], we ran into some
problems regarding utf8 encoding of files. Most of them have been
solved, however, one is still pretty annoying.
Let me illustrate it :
I want to edit a page on mediawiki using the API, with a very simple example :
my $mw = MediaWiki::API->new();
$mw->edit( {
action => 'edit',
title => 'Main_page',
text => 'été',
} ) ;
But, when I look at the page on mediawiki, I see weird characters : été.
I tried text => encode_utf8('été') with no success.
This makes pushing changes from git to mediawiki buggy since pulling a
file with accentuated characters and pushing it right after changes
things on the wiki.
While googling (a lot), I found that utf8 was pretty tricky in perl...
The only thing that seems to solve things is a simple addition of 'use
encoding utf8' at the top of our script.
However
A) Adding this line requires that I remove 'use strict;'
B) I found some information about this pragma encoding and it seems to
be unadvised to use it
Do you have any information regarding this issue ?
Thanks,
--
Jérémie Nikaes
[1] https://github.com/Bibzball/Git-Mediawiki
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Git-mediawiki : Encoding problems in perl
2011-06-08 13:45 Git-mediawiki : Encoding problems in perl Jérémie NIKAES
@ 2011-06-08 14:37 ` Steffen Daode Nurpmeso
2011-06-08 15:01 ` Jeff King
2011-06-08 17:04 ` Jakub Narebski
2 siblings, 0 replies; 20+ messages in thread
From: Steffen Daode Nurpmeso @ 2011-06-08 14:37 UTC (permalink / raw)
To: Jérémie NIKAES; +Cc: git, thomas
@ Jérémie NIKAES <jeremie.nikaes@gmail.com> wrote (2011-06-08 15:45+0200):
> But, when I look at the page on mediawiki, I see weird characters : été.
>
> I tried text => encode_utf8('été') with no success.
>
> Do you have any information regarding this issue ?
I'm not a Perl guru, but i ran into the very same problem when
writing a disc-ripper/CDDB lookup/DB writer, and the following
snippet helped me out:
$CDDB{TITLES} = $dinf->{ttitles};
foreach (@{$dinf->{ttitles}}) {
s/^\s*(.*?)\s*$/$1/;
my $save = $_;
eval { Encode::from_to($_, 'iso-8859-1', 'utf-8'); };
$_ = $save if $@;
Encode::_utf8_off($_);
}
I forget the exact circumstances, but as far as i remember you
need to trigger the is-UTF-8 bit on the string object in an
discouraged (acc. to manual) way to make it work the way it
should.
--
Ciao, Steffen
sdaoden(*)(gmail.com)
() ascii ribbon campaign - against html e-mail
/\ www.asciiribbon.org - against proprietary attachments
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Git-mediawiki : Encoding problems in perl
2011-06-08 13:45 Git-mediawiki : Encoding problems in perl Jérémie NIKAES
2011-06-08 14:37 ` Steffen Daode Nurpmeso
@ 2011-06-08 15:01 ` Jeff King
2011-06-08 15:37 ` Matthieu Moy
2011-06-08 17:04 ` Jakub Narebski
2 siblings, 1 reply; 20+ messages in thread
From: Jeff King @ 2011-06-08 15:01 UTC (permalink / raw)
To: Jérémie NIKAES; +Cc: thomas, git
On Wed, Jun 08, 2011 at 03:45:43PM +0200, Jérémie NIKAES wrote:
> my $mw = MediaWiki::API->new();
> $mw->edit( {
> action => 'edit',
> title => 'Main_page',
> text => 'été',
> } ) ;
> [...]
> While googling (a lot), I found that utf8 was pretty tricky in perl...
> The only thing that seems to solve things is a simple addition of 'use
> encoding utf8' at the top of our script.
> However
> A) Adding this line requires that I remove 'use strict;'
> B) I found some information about this pragma encoding and it seems to
> be unadvised to use it
From the "utf8" man page:
Do not use this pragma for anything else than telling Perl that your
script is written in UTF-8.
which is what you are doing here, since you are telling perl that the
string constant is in utf8. So from my understanding, "use utf8" is the
right solution.
That being said, this is probably just a small test case, and you are
more likely to be reading the data from a file.
For file contents, you can use:
binmode($handle, ":utf8");
to read everything in as utf8.
For file names themselves, I think it depends where you get them.
Presumably from readdir() or from a glob. I think you can use
utf8::upgrade($string) on the result to make sure they are interpreted
as utf8 (if you already know that is how the bytes in the filename
should be interpreted).
But I admit I am not an expert on such matters, and every time I do utf8
things in perl, I end up with a lot of trial and error.
-Peff
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Git-mediawiki : Encoding problems in perl
2011-06-08 15:01 ` Jeff King
@ 2011-06-08 15:37 ` Matthieu Moy
2011-06-08 15:45 ` Jeff King
2011-06-08 15:46 ` Jérémie NIKAES
0 siblings, 2 replies; 20+ messages in thread
From: Matthieu Moy @ 2011-06-08 15:37 UTC (permalink / raw)
To: Jeff King; +Cc: Jérémie NIKAES, thomas, git
Jeff King <peff@peff.net> writes:
> On Wed, Jun 08, 2011 at 03:45:43PM +0200, Jérémie NIKAES wrote:
>
>> my $mw = MediaWiki::API->new();
>> $mw->edit( {
>> action => 'edit',
>> title => 'Main_page',
>> text => 'été',
>> } ) ;
>> [...]
[...]
>>From the "utf8" man page:
>
> Do not use this pragma for anything else than telling Perl that your
> script is written in UTF-8.
>
> which is what you are doing here,
Actually, this is what the example does, but this is not where the
original problem comes from. The code of git-remote-mediawiki contains
only us-ascii characters.
The actual code is:
my $file_content = `git cat-file -p $sha1`;
chomp($file_content);
// ...
$mw->edit( {
action => 'edit',
summary => $_[1],
title => $title,
text => $file_content,
});
If the file is UTF-8 encoded, the page sent to the wiki is
double-utf8-encoded.
> That being said, this is probably just a small test case, and you are
> more likely to be reading the data from a file.
Oops, read this too late ;-).
> For file contents, you can use:
>
> binmode($handle, ":utf8");
>
> to read everything in as utf8.
That's not exactly it, since we read the output of "git cat-file", not
an actual file.
But something along the lines of:
open(my $git, "-|:encoding(UTF-8)", "git cat-file -p $sha1");
my $file_content = <$git>;
close($git);
may do it.
--
Matthieu Moy
http://www-verimag.imag.fr/~moy/
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Git-mediawiki : Encoding problems in perl
2011-06-08 15:37 ` Matthieu Moy
@ 2011-06-08 15:45 ` Jeff King
2011-06-08 15:46 ` Jérémie NIKAES
1 sibling, 0 replies; 20+ messages in thread
From: Jeff King @ 2011-06-08 15:45 UTC (permalink / raw)
To: Matthieu Moy; +Cc: Jérémie NIKAES, thomas, git
On Wed, Jun 08, 2011 at 05:37:56PM +0200, Matthieu Moy wrote:
> The actual code is:
>
> my $file_content = `git cat-file -p $sha1`;
> chomp($file_content);
> // ...
> $mw->edit( {
> action => 'edit',
> summary => $_[1],
> title => $title,
> text => $file_content,
> });
>
> If the file is UTF-8 encoded, the page sent to the wiki is
> double-utf8-encoded.
I think there might be a way to say "all descriptors are utf8"; I don't
know if that would apply to pipes from backtick commands or not. I'm
also not sure if it would interfere with something like Mediawiki::API
talking over the socket.
> That's not exactly it, since we read the output of "git cat-file", not
> an actual file.
>
> But something along the lines of:
>
> open(my $git, "-|:encoding(UTF-8)", "git cat-file -p $sha1");
> my $file_content = <$git>;
> close($git);
>
> may do it.
Yeah, I think that's the cleanest solution (assuming it works. :) ).
-Peff
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Git-mediawiki : Encoding problems in perl
2011-06-08 15:37 ` Matthieu Moy
2011-06-08 15:45 ` Jeff King
@ 2011-06-08 15:46 ` Jérémie NIKAES
2011-06-08 15:58 ` Matthieu Moy
1 sibling, 1 reply; 20+ messages in thread
From: Jérémie NIKAES @ 2011-06-08 15:46 UTC (permalink / raw)
To: Matthieu Moy; +Cc: Jeff King, thomas, git
2011/6/8 Matthieu Moy <Matthieu.Moy@grenoble-inp.fr>:
> open(my $git, "-|:encoding(UTF-8)", "git cat-file -p $sha1");
> my $file_content = <$git>;
> close($git);
Yes, that did it ! Thank you boss :-)
And thanks to both of you Peff & Steffen for your suggestions.
Although, I'm still encountering issues regarding encoding file names
though, I am going to look deeper now that i have this solution.
--
Jérémie Nikaes
Élève ingénieur en deuxième année à l'Ensimag
Ingénierie des Systèmes d'Informations
Tel : +33 (0)6 12 99 78 75
Mail : jeremie.nikaes@gmail.com
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Git-mediawiki : Encoding problems in perl
2011-06-08 15:46 ` Jérémie NIKAES
@ 2011-06-08 15:58 ` Matthieu Moy
2011-06-08 16:15 ` Jérémie NIKAES
0 siblings, 1 reply; 20+ messages in thread
From: Matthieu Moy @ 2011-06-08 15:58 UTC (permalink / raw)
To: Jérémie NIKAES; +Cc: Jeff King, thomas, git
Jérémie NIKAES <jeremie.nikaes@gmail.com> writes:
> 2011/6/8 Matthieu Moy <Matthieu.Moy@grenoble-inp.fr>:
>
>> open(my $git, "-|:encoding(UTF-8)", "git cat-file -p $sha1");
there should probably have been a $/ = 1; or some other perl magic to
make sure we don't read only the first line there:
>> my $file_content = <$git>;
>> close($git);
>
> Yes, that did it ! Thank you boss :-)
Then, make it a helper function to call like
my $file_content = run_git("cat-file -p $sha1");
and use it where needed.
> Although, I'm still encountering issues regarding encoding file names
> though, I am going to look deeper now that i have this solution.
My advice, at least in the short-term (already discussed offline): use
urlencode ( http://php.net/manual/en/function.urlencode.php ) on pull,
and don't bother with encoding on push. Non-ascii characters in
filenames are a nightmare ...
If you go for utf8 filenames, you should test that your script works in
various environments, like
LANG=fr_FR.ISO-8859-1 xterm
(launch a terminal with latin-1 encoding inside)
and Mac OS X (which does some weird utf-8-normalization on filenames),
and probably windows (no idea how filename encoding works there).
If it doesn't work in one of them, you'll have to provide a fall-back to
plain ascii for these users, which will most likely be the short-term
solution I'm proposing.
--
Matthieu Moy
http://www-verimag.imag.fr/~moy/
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Git-mediawiki : Encoding problems in perl
2011-06-08 15:58 ` Matthieu Moy
@ 2011-06-08 16:15 ` Jérémie NIKAES
2011-06-08 16:18 ` Jeff King
` (3 more replies)
0 siblings, 4 replies; 20+ messages in thread
From: Jérémie NIKAES @ 2011-06-08 16:15 UTC (permalink / raw)
To: Matthieu Moy; +Cc: Jeff King, thomas, git
2011/6/8 Matthieu Moy <Matthieu.Moy@grenoble-inp.fr>:
> there should probably have been a $/ = 1; or some other perl magic to
> make sure we don't read only the first line there:
>
Yes, it indeed currently reads only the first line. I'm going to see
what kind of magic I need to use.
> Then, make it a helper function to call like
>
> my $file_content = run_git("cat-file -p $sha1");
>
> and use it where needed.
Good idea, doing it right now
> My advice, at least in the short-term (already discussed offline): use
> urlencode ( http://php.net/manual/en/function.urlencode.php ) on pull,
> and don't bother with encoding on push. Non-ascii characters in
> filenames are a nightmare ...
>
Yes I tried uri_escape, but that only works in the direction mediawiki -> git.
A page named "Eté" on mediawiki comes as a Et%C3%A9.mw file on the repo.
However, when I try to send that file "Et%C3%A9" with the mediawiki
API, I get this error
"Can't use an undefined value as a HASH reference at
/usr/local/share/perl/5.10.1/MediaWiki/API.pm line 554."
So I tried to backslash the '%' but it does not do it...
Any idea ? Thanks
--
Jérémie Nikaes
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Git-mediawiki : Encoding problems in perl
2011-06-08 16:15 ` Jérémie NIKAES
@ 2011-06-08 16:18 ` Jeff King
2011-06-08 16:26 ` Jérémie NIKAES
2011-06-08 16:27 ` Matthieu Moy
` (2 subsequent siblings)
3 siblings, 1 reply; 20+ messages in thread
From: Jeff King @ 2011-06-08 16:18 UTC (permalink / raw)
To: Jérémie NIKAES; +Cc: Matthieu Moy, thomas, git
On Wed, Jun 08, 2011 at 06:15:15PM +0200, Jérémie NIKAES wrote:
> 2011/6/8 Matthieu Moy <Matthieu.Moy@grenoble-inp.fr>:
>
> > there should probably have been a $/ = 1; or some other perl magic to
> > make sure we don't read only the first line there:
> >
>
> Yes, it indeed currently reads only the first line. I'm going to see
> what kind of magic I need to use.
You need to set $/ to undef. Use "local" to prevent it from polluting
other parts of the code, like:
my $var = do { local $/; <$handle> };
-Peff
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Git-mediawiki : Encoding problems in perl
2011-06-08 16:15 ` Jérémie NIKAES
2011-06-08 16:18 ` Jeff King
@ 2011-06-08 16:27 ` Matthieu Moy
2011-06-08 16:30 ` Jérémie NIKAES
2011-06-08 17:07 ` Jakub Narebski
2011-06-08 17:11 ` Matthieu Moy
3 siblings, 1 reply; 20+ messages in thread
From: Matthieu Moy @ 2011-06-08 16:27 UTC (permalink / raw)
To: Jérémie NIKAES; +Cc: Jeff King, thomas, git
Jérémie NIKAES <jeremie.nikaes@gmail.com> writes:
> Yes I tried uri_escape, but that only works in the direction mediawiki -> git.
> A page named "Eté" on mediawiki comes as a Et%C3%A9.mw file on the repo.
> However, when I try to send that file "Et%C3%A9" with the mediawiki
> API, I get this error
>
> "Can't use an undefined value as a HASH reference at
> /usr/local/share/perl/5.10.1/MediaWiki/API.pm line 554."
>
> So I tried to backslash the '%' but it does not do it...
What if you uri_unescape before sending to MediaWiki?
--
Matthieu Moy
http://www-verimag.imag.fr/~moy/
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Git-mediawiki : Encoding problems in perl
2011-06-08 16:27 ` Matthieu Moy
@ 2011-06-08 16:30 ` Jérémie NIKAES
0 siblings, 0 replies; 20+ messages in thread
From: Jérémie NIKAES @ 2011-06-08 16:30 UTC (permalink / raw)
To: Matthieu Moy; +Cc: Jeff King, thomas, git
2011/6/8 Matthieu Moy <Matthieu.Moy@grenoble-inp.fr>:
> Jérémie NIKAES <jeremie.nikaes@gmail.com> writes:
>
>> Yes I tried uri_escape, but that only works in the direction mediawiki -> git.
>> A page named "Eté" on mediawiki comes as a Et%C3%A9.mw file on the repo.
>> However, when I try to send that file "Et%C3%A9" with the mediawiki
>> API, I get this error
>>
>> "Can't use an undefined value as a HASH reference at
>> /usr/local/share/perl/5.10.1/MediaWiki/API.pm line 554."
>>
>> So I tried to backslash the '%' but it does not do it...
>
> What if you uri_unescape before sending to MediaWiki?
Same problem, same error. Unfortunately.
--
Jérémie Nikaes
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Git-mediawiki : Encoding problems in perl
2011-06-08 16:15 ` Jérémie NIKAES
2011-06-08 16:18 ` Jeff King
2011-06-08 16:27 ` Matthieu Moy
@ 2011-06-08 17:07 ` Jakub Narebski
2011-06-08 17:11 ` Matthieu Moy
3 siblings, 0 replies; 20+ messages in thread
From: Jakub Narebski @ 2011-06-08 17:07 UTC (permalink / raw)
To: Jérémie NIKAES
Cc: Matthieu Moy, Jeff King, thomas, git, Jakub Narebski
Jérémie NIKAES <jeremie.nikaes@gmail.com> writes:
> 2011/6/8 Matthieu Moy <Matthieu.Moy@grenoble-inp.fr>:
[...]
> > My advice, at least in the short-term (already discussed offline): use
> > urlencode ( http://php.net/manual/en/function.urlencode.php ) on pull,
> > and don't bother with encoding on push. Non-ascii characters in
> > filenames are a nightmare ...
> >
>
> Yes I tried uri_escape, but that only works in the direction mediawiki -> git.
> A page named "Eté" on mediawiki comes as a Et%C3%A9.mw file on the repo.
> However, when I try to send that file "Et%C3%A9" with the mediawiki
> API, I get this error
>
> "Can't use an undefined value as a HASH reference at
> /usr/local/share/perl/5.10.1/MediaWiki/API.pm line 554."
Can you show us this line end its neighourhood?
It might be bug in MediaWiki::API...
> So I tried to backslash the '%' but it does not do it...
Decode if from URI encoding to UTF-8 and mark as UTF-8 before sending
to mediawiki API.
--
Jakub Narebski
Poland
ShadeHawk on #git
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Git-mediawiki : Encoding problems in perl
2011-06-08 16:15 ` Jérémie NIKAES
` (2 preceding siblings ...)
2011-06-08 17:07 ` Jakub Narebski
@ 2011-06-08 17:11 ` Matthieu Moy
2011-06-08 18:03 ` Jérémie NIKAES
3 siblings, 1 reply; 20+ messages in thread
From: Matthieu Moy @ 2011-06-08 17:11 UTC (permalink / raw)
To: Jérémie NIKAES; +Cc: Jeff King, thomas, git
Jérémie NIKAES <jeremie.nikaes@gmail.com> writes:
> Yes I tried uri_escape, but that only works in the direction mediawiki -> git.
> A page named "Eté" on mediawiki comes as a Et%C3%A9.mw file on the repo.
> However, when I try to send that file "Et%C3%A9" with the mediawiki
> API, I get this error
>
> "Can't use an undefined value as a HASH reference at
> /usr/local/share/perl/5.10.1/MediaWiki/API.pm line 554."
>
> So I tried to backslash the '%' but it does not do it...
> Any idea ? Thanks
OK, I know that's cheating, but reading the doc helped ;-)
http://search.cpan.org/~exobuzz/MediaWiki-API-0.35/lib/MediaWiki/API.pm#MediaWiki::API-%3Eedit%28_$query_hashref,_$options_hashref_%29
The options hashref currently has one optional parameter (skip_encoding
=> 1). This is described above in the MediaWiki::API->api call
documentation.
which leads us to:
MediaWiki's API uses UTF-8 and any 8 bit character string parameters are
encoded automatically by the API call. If your parameters are already in
UTF-8 this will be detected and the encoding will be skipped. If your
parameters for some reason contain UTF-8 data but no UTF-8 flag is set
(i.e. you did not use the "use utf8;" pragma) you should prevent
re-encoding by passing an option skip_encoding => 1 in the
$options_hash.
In other words, Perl and MediaWiki::API use some black magic to detect
UTF-8, and you want to disable it like this:
$mw->edit( {
action => 'edit',
title => $title,
text => $text,
}, {
skip_encoding => 1
} ) || die $mw->{error}->{code} . ': ' . $mw->{error}->{details};
Tried it, worked :-).
This may well be an alternative solution to the earlier UTF-8 problem.
--
Matthieu Moy
http://www-verimag.imag.fr/~moy/
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Git-mediawiki : Encoding problems in perl
2011-06-08 17:11 ` Matthieu Moy
@ 2011-06-08 18:03 ` Jérémie NIKAES
2011-06-08 18:20 ` Matthieu Moy
2011-06-08 21:51 ` Jeff King
0 siblings, 2 replies; 20+ messages in thread
From: Jérémie NIKAES @ 2011-06-08 18:03 UTC (permalink / raw)
To: Matthieu Moy; +Cc: Jeff King, thomas, git
2011/6/8 Matthieu Moy <Matthieu.Moy@grenoble-inp.fr>:
> $mw->edit( {
> action => 'edit',
> title => $title,
> text => $text,
> }, {
> skip_encoding => 1
> } ) || die $mw->{error}->{code} . ': ' . $mw->{error}->{details};
>
> Tried it, worked :-).
>
Yep this works if you manually set your $title variable earlier in the
code. However, I still have the problem which I think is on the git
side
- I pull the "Eté.mw" file from mediawiki
- I edit it
- When I commit it I get this message from git :
[master sha1] commit message
1 files changed [...]
create mode 100644 "Bl\303\251.mw"
As a result, when I parse commit information, the title of the file is
indeed Bl\303\251... so a new page is created on the mediawiki.
--
Jérémie Nikaes
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Git-mediawiki : Encoding problems in perl
2011-06-08 18:03 ` Jérémie NIKAES
@ 2011-06-08 18:20 ` Matthieu Moy
2011-06-08 21:51 ` Jeff King
1 sibling, 0 replies; 20+ messages in thread
From: Matthieu Moy @ 2011-06-08 18:20 UTC (permalink / raw)
To: Jérémie NIKAES; +Cc: Jeff King, thomas, git
Jérémie NIKAES <jeremie.nikaes@gmail.com> writes:
> Yep this works if you manually set your $title variable earlier in the
> code.
Or if you uri_escape the file on pull.
> However, I still have the problem which I think is on the git side
>
> - I pull the "Eté.mw" file from mediawiki
> - I edit it
> - When I commit it I get this message from git :
> [master sha1] commit message
> 1 files changed [...]
> create mode 100644 "Bl\303\251.mw"
I guess you mean "Et\303\251.mw" ?
> As a result, when I parse commit information, the title of the file is
> indeed Bl\303\251...
The -z option of many Git commands is your friend.
Especially the one of "git diff --raw".
--
Matthieu Moy
http://www-verimag.imag.fr/~moy/
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Git-mediawiki : Encoding problems in perl
2011-06-08 18:03 ` Jérémie NIKAES
2011-06-08 18:20 ` Matthieu Moy
@ 2011-06-08 21:51 ` Jeff King
2011-06-08 22:36 ` Jérémie NIKAES
1 sibling, 1 reply; 20+ messages in thread
From: Jeff King @ 2011-06-08 21:51 UTC (permalink / raw)
To: Jérémie NIKAES; +Cc: Matthieu Moy, thomas, git
On Wed, Jun 08, 2011 at 08:03:26PM +0200, Jérémie NIKAES wrote:
> - I pull the "Eté.mw" file from mediawiki
> - I edit it
> - When I commit it I get this message from git :
> [master sha1] commit message
> 1 files changed [...]
> create mode 100644 "Bl\303\251.mw"
>
> As a result, when I parse commit information, the title of the file is
> indeed Bl\303\251... so a new page is created on the mediawiki.
Ick. I hope you aren't parsing the output of "git commit"; it's not
guaranteed to be stable.
But if you are parsing "diff", then yes, filenames with high-bit
characters (or special characters like tab or double-quote) may be
quoted C-style, and you should be unquoting them. Or, as Matthieu
suggested, use "-z" to get a NUL-terminated, non-quoted version.
-Peff
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Git-mediawiki : Encoding problems in perl
2011-06-08 21:51 ` Jeff King
@ 2011-06-08 22:36 ` Jérémie NIKAES
0 siblings, 0 replies; 20+ messages in thread
From: Jérémie NIKAES @ 2011-06-08 22:36 UTC (permalink / raw)
To: Jeff King; +Cc: Matthieu Moy, thomas, git
2011/6/8 Jeff King <peff@peff.net>:
>
> But if you are parsing "diff", then yes, filenames with high-bit
> characters (or special characters like tab or double-quote) may be
> quoted C-style, and you should be unquoting them. Or, as Matthieu
> suggested, use "-z" to get a NUL-terminated, non-quoted version.
Yes, we are parsing "diff". The -z helped a lot with non-iso characters.
Everything seems to be working fine now without using uri_escape. The
problem is, as Matthieu said, different file systems may handle these
characters in file names differently, so in the long run it could be
better to use uri_escape / uri_unescape.
The problem I run into now is that we are using
use encoding 'utf-8'
as Jakub suggested.
Using this mode, when you uri_escape and uri_unescape a string, you
don't get the original string. I must be missing something but my head
is kind of fuzzy with all the different existing methods to encode
things in utf8 and it is getting pretty late.
Thanks a lot to everyone who helped today, a RFC patch should follow tomorrow.
--
Jérémie Nikaes
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Git-mediawiki : Encoding problems in perl
2011-06-08 13:45 Git-mediawiki : Encoding problems in perl Jérémie NIKAES
2011-06-08 14:37 ` Steffen Daode Nurpmeso
2011-06-08 15:01 ` Jeff King
@ 2011-06-08 17:04 ` Jakub Narebski
2011-06-08 17:59 ` Jérémie NIKAES
2 siblings, 1 reply; 20+ messages in thread
From: Jakub Narebski @ 2011-06-08 17:04 UTC (permalink / raw)
To: Jérémie NIKAES; +Cc: thomas, git, Jakub Narebski
Jérémie NIKAES <jeremie.nikaes@gmail.com> writes:
> While working on the git-mediawiki project[1], we ran into some
> problems regarding utf8 encoding of files. Most of them have been
> solved, however, one is still pretty annoying.
> Let me illustrate it :
>
> I want to edit a page on mediawiki using the API, with a very simple example :
>
> my $mw = MediaWiki::API->new();
> $mw->edit( {
> action => 'edit',
> title => 'Main_page',
> text => 'été',
> } ) ;
>
> But, when I look at the page on mediawiki, I see weird characters : été.
Take a look at
http://stackoverflow.com/questions/6162484/why-does-modern-perl-avoid-utf-8-by-default
especially accepted answer.
In short (I don't agree with everything there, and not everything is
needed for all but extremal Unicode usage): if your script is written
using UTF-8 like in above examples, use
use utf8;
If this is simplification, and this text comes from other file or is
result of output of some command, use
use utf8::all;
or take a look what it does and put relevant parts in your script.
> I tried text => encode_utf8('été') with no success.
>
> This makes pushing changes from git to mediawiki buggy since pulling a
> file with accentuated characters and pushing it right after changes
> things on the wiki.
>
> While googling (a lot), I found that utf8 was pretty tricky in perl...
> The only thing that seems to solve things is a simple addition of 'use
> encoding utf8' at the top of our script.
> However
> A) Adding this line requires that I remove 'use strict;'
use encoding ':utf8';
or
use encoding 'utf8';
> B) I found some information about this pragma encoding and it seems to
> be unadvised to use it
--
Jakub Narebski
Poland
ShadeHawk on #git
^ permalink raw reply [flat|nested] 20+ messages in thread
end of thread, other threads:[~2011-06-08 22:37 UTC | newest]
Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-06-08 13:45 Git-mediawiki : Encoding problems in perl Jérémie NIKAES
2011-06-08 14:37 ` Steffen Daode Nurpmeso
2011-06-08 15:01 ` Jeff King
2011-06-08 15:37 ` Matthieu Moy
2011-06-08 15:45 ` Jeff King
2011-06-08 15:46 ` Jérémie NIKAES
2011-06-08 15:58 ` Matthieu Moy
2011-06-08 16:15 ` Jérémie NIKAES
2011-06-08 16:18 ` Jeff King
2011-06-08 16:26 ` Jérémie NIKAES
2011-06-08 16:27 ` Matthieu Moy
2011-06-08 16:30 ` Jérémie NIKAES
2011-06-08 17:07 ` Jakub Narebski
2011-06-08 17:11 ` Matthieu Moy
2011-06-08 18:03 ` Jérémie NIKAES
2011-06-08 18:20 ` Matthieu Moy
2011-06-08 21:51 ` Jeff King
2011-06-08 22:36 ` Jérémie NIKAES
2011-06-08 17:04 ` Jakub Narebski
2011-06-08 17:59 ` Jérémie NIKAES
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).