[PATCH] gitweb: Fix chop_str not to cut in middle of utf8 multibyte chars.

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] gitweb: Fix chop_str not to cut in middle of utf8 multibyte chars.
@ 2008-05-20 20:55 Anders Waldenborg
  2008-05-20 22:19 ` Jakub Narebski
  0 siblings, 1 reply; 6+ messages in thread
From: Anders Waldenborg @ 2008-05-20 20:55 UTC (permalink / raw)
  To: git


Without this fix at least author name in short log may cut in middle of a
multibyte char. When the result comes to esc_html to_utf8 is called again,
which doesn't find valid utf8 and decodes using $fallback_encoding making
it even worse.

Signed-off-by: Anders Waldenborg <anders@0x63.nu>
---
  gitweb/gitweb.perl |    4 ++++
  1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/gitweb/gitweb.perl b/gitweb/gitweb.perl
index 2facf2d..8308e22 100755
--- a/gitweb/gitweb.perl
+++ b/gitweb/gitweb.perl
@@ -866,6 +866,10 @@ sub chop_str {
         my $add_len = shift || 10;
         my $where = shift || 'right'; # 'left' | 'center' | 'right'

+       # Make sure perl knows it is utf8 encoded so we don't
+       # cut in the middle of a utf8 multibyte char.
+       $str = to_utf8($str);
+
         # allow only $len chars, but don't cut a word if it would fit in $add_len
         # if it doesn't fit, cut it if it's still longer than the dots we would add
         # remove chopped character entities entirely

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH] gitweb: Fix chop_str not to cut in middle of utf8 multibyte chars.
  2008-05-20 20:55 [PATCH] gitweb: Fix chop_str not to cut in middle of utf8 multibyte chars Anders Waldenborg
@ 2008-05-20 22:19 ` Jakub Narebski
  2008-05-21  7:27   ` Junio C Hamano
  2008-05-21 11:44   ` [PATCH] gitweb: Convert string to internal form before chopping in chop_str Anders Waldenborg
  0 siblings, 2 replies; 6+ messages in thread
From: Jakub Narebski @ 2008-05-20 22:19 UTC (permalink / raw)
  To: Anders Waldenborg; +Cc: git

Anders Waldenborg <anders@0x63.nu> writes:

> Without this fix at least author name in short log may cut in middle of a
> multibyte char. When the result comes to esc_html to_utf8 is called again,
> which doesn't find valid utf8 and decodes using $fallback_encoding making
> it even worse.

Thanks a lot.  This is certainly a good thing, although I think that
the proper solution (but which would need much more work) would be to
ensure that all information is stored in Perl internal form, and not
only ensured on output.

I would change title of this commit to be more descriptive what this
commit does; or rather make current subject the first sentence of
commit summary, have something like the following for commit message:

gitweb: Convert string to Perl internal form before chopping in chop_str

Fix chop_str not to cut in middle of utf8 multibyte chars.  Without
this fix at least author name in short log may cut in middle of a
multibyte char. When the result comes to esc_html to_utf8 is called
again, which doesn't find valid utf8 and decodes using
$fallback_encoding making it even worse.

> Signed-off-by: Anders Waldenborg <anders@0x63.nu>
> ---
>   gitweb/gitweb.perl |    4 ++++
>   1 files changed, 4 insertions(+), 0 deletions(-)
> 
> diff --git a/gitweb/gitweb.perl b/gitweb/gitweb.perl
> index 2facf2d..8308e22 100755
> --- a/gitweb/gitweb.perl
> +++ b/gitweb/gitweb.perl
> @@ -866,6 +866,10 @@ sub chop_str {
>          my $add_len = shift || 10;
>          my $where = shift || 'right'; # 'left' | 'center' | 'right'
> 
> +       # Make sure perl knows it is utf8 encoded so we don't
> +       # cut in the middle of a utf8 multibyte char.
> +       $str = to_utf8($str);
> +

I like the comment here. It explains the whys of code.

>          # allow only $len chars, but don't cut a word if it would fit in $add_len
>          # if it doesn't fit, cut it if it's still longer than the dots we would add
>          # remove chopped character entities entirely
> 

This patch is whitespace damaged, by the way.

-- 
Jakub Narebski
Poland
ShadeHawk on #git

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] gitweb: Fix chop_str not to cut in middle of utf8 multibyte chars.
  2008-05-20 22:19 ` Jakub Narebski
@ 2008-05-21  7:27   ` Junio C Hamano
  2008-05-21  7:45     ` Anders Waldenborg
  2008-05-21 11:44   ` [PATCH] gitweb: Convert string to internal form before chopping in chop_str Anders Waldenborg
  1 sibling, 1 reply; 6+ messages in thread
From: Junio C Hamano @ 2008-05-21  7:27 UTC (permalink / raw)
  To: Jakub Narebski; +Cc: Anders Waldenborg, git

Jakub Narebski <jnareb@gmail.com> writes:

> Anders Waldenborg <anders@0x63.nu> writes:
> ...
> Fix chop_str not to cut in middle of utf8 multibyte chars.  Without
> this fix at least author name in short log may cut in middle of a
> multibyte char. When the result comes to esc_html to_utf8 is called
> again, which doesn't find valid utf8 and decodes using
> $fallback_encoding making it even worse.
>  
>> Signed-off-by: Anders Waldenborg <anders@0x63.nu>
>> ---
>>   gitweb/gitweb.perl |    4 ++++
>>   1 files changed, 4 insertions(+), 0 deletions(-)
>> 
>> diff --git a/gitweb/gitweb.perl b/gitweb/gitweb.perl
>> index 2facf2d..8308e22 100755
>> --- a/gitweb/gitweb.perl
>> +++ b/gitweb/gitweb.perl
>> @@ -866,6 +866,10 @@ sub chop_str {
>>          my $add_len = shift || 10;
>>          my $where = shift || 'right'; # 'left' | 'center' | 'right'
>> 
>> +       # Make sure perl knows it is utf8 encoded so we don't
>> +       # cut in the middle of a utf8 multibyte char.
>> +       $str = to_utf8($str);
>> +
>
> I like the comment here. It explains the whys of code.
>
>>          # allow only $len chars, but don't cut a word if it would fit in $add_len
>>          # if it doesn't fit, cut it if it's still longer than the dots we would add
>>          # remove chopped character entities entirely
>> 
>
> This patch is whitespace damaged, by the way.

I haven't followed the codepath but what do the callers do to the string
returned from chop_str?  Don't they assume the string hasn't been decoded
(because the old implementation of chop_str did not do this to_utf8), and
emit the result directly to the output because it also assumes the
undecoded format is what the outside world wants?  In other words, don't
they now need to do different things because returned string has gone
through the to_utf8() processing already?

Maybe I am worrying too much, after getting burned by decode_utf/encode_utf
data chains in another popular scripting language, and it is possible that
with Perl you may not have to be so careful...

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] gitweb: Fix chop_str not to cut in middle of utf8 multibyte chars.
  2008-05-21  7:27   ` Junio C Hamano
@ 2008-05-21  7:45     ` Anders Waldenborg
  2008-05-24 13:34       ` Jakub Narebski
  0 siblings, 1 reply; 6+ messages in thread
From: Anders Waldenborg @ 2008-05-21  7:45 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Jakub Narebski, git

> I haven't followed the codepath but what do the callers do to the string
> returned from chop_str?  Don't they assume the string hasn't been decoded
> (because the old implementation of chop_str did not do this to_utf8), and
> emit the result directly to the output because it also assumes the
> undecoded format is what the outside world wants?  In other words, don't
> they now need to do different things because returned string has gone
> through the to_utf8() processing already?

The to_utf8() (defined in gitweb.perl, not part of perl it self) is kind 
of sneaky, it checks if the string already is valid utf8. (guess it 
should be called ensure_utf8())

chop_str needs to work on decoded string, otherwise character count goes 
all wrong. But maybe it is better to add the to_utf8() to the callsites?

  anders

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH] gitweb: Convert string to internal form before chopping in chop_str
  2008-05-20 22:19 ` Jakub Narebski
  2008-05-21  7:27   ` Junio C Hamano
@ 2008-05-21 11:44   ` Anders Waldenborg
  1 sibling, 0 replies; 6+ messages in thread
From: Anders Waldenborg @ 2008-05-21 11:44 UTC (permalink / raw)
  To: Jakub Narebski; +Cc: git

Fix chop_str not to cut in middle of utf8 multibyte chars. Without
this fix at least author name in short log may cut in middle of a
multibyte char. When the result comes to esc_html to_utf8 is called
again, which doesn't find valid utf8 and decodes using
$fallback_encoding making it even worse.

This also have the nice side effect that it actually tries to show the
first 10 _characters_, not the number of characters that happened to fit
into 10 bytes.

Signed-off-by: Anders Waldenborg <anders@0x63.nu>
---

Description updated as per your suggesion, and hopefully without the
embarrassing whitespace corruption.

gitweb/gitweb.perl |    4 ++++
 1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/gitweb/gitweb.perl b/gitweb/gitweb.perl
index 2facf2d..8308e22 100755
--- a/gitweb/gitweb.perl
+++ b/gitweb/gitweb.perl
@@ -866,6 +866,10 @@ sub chop_str {
 	my $add_len = shift || 10;
 	my $where = shift || 'right'; # 'left' | 'center' | 'right'

+	# Make sure perl knows it is utf8 encoded so we don't
+	# cut in the middle of a utf8 multibyte char.
+	$str = to_utf8($str);
+
 	# allow only $len chars, but don't cut a word if it would fit in $add_len
 	# if it doesn't fit, cut it if it's still longer than the dots we would add
 	# remove chopped character entities entirely
-- 
1.5.5.1

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH] gitweb: Fix chop_str not to cut in middle of utf8 multibyte chars.
  2008-05-21  7:45     ` Anders Waldenborg
@ 2008-05-24 13:34       ` Jakub Narebski
  0 siblings, 0 replies; 6+ messages in thread
From: Jakub Narebski @ 2008-05-24 13:34 UTC (permalink / raw)
  To: Anders Waldenborg; +Cc: Junio C Hamano, git

On Wed, 21 May 2008, Anders Waldenborg wrote:
> Junio C Hamano wrote:

>> I haven't followed the codepath but what do the callers do to the string
>> returned from chop_str?  Don't they assume the string hasn't been decoded
>> (because the old implementation of chop_str did not do this to_utf8), and
>> emit the result directly to the output because it also assumes the
>> undecoded format is what the outside world wants?  In other words, don't
>> they now need to do different things because returned string has gone
>> through the to_utf8() processing already?
> 
> The to_utf8() (defined in gitweb.perl, not part of perl it self) is kind 
> of sneaky, it checks if the string already is valid utf8. (guess it 
> should be called ensure_utf8())

Perhaps it should...

> chop_str needs to work on decoded string, otherwise character count goes 
> all wrong. But maybe it is better to add the to_utf8() to the callsites?

Or do "binmode $fd, :utf8".

But yes, I guess converting to Perl internal form on input would be
good idea.  Gitweb currently does it partially...

-- 
Jakub Narebski
Poland

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2008-05-24 13:35 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-05-20 20:55 [PATCH] gitweb: Fix chop_str not to cut in middle of utf8 multibyte chars Anders Waldenborg
2008-05-20 22:19 ` Jakub Narebski
2008-05-21  7:27   ` Junio C Hamano
2008-05-21  7:45     ` Anders Waldenborg
2008-05-24 13:34       ` Jakub Narebski
2008-05-21 11:44   ` [PATCH] gitweb: Convert string to internal form before chopping in chop_str Anders Waldenborg

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).