[PATCH] diff-highlight: Fix broken multibyte string

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] diff-highlight: Fix broken multibyte string
@ 2015-03-30 15:55 Yi EungJun
  2015-03-30 22:16 ` Jeff King
  0 siblings, 1 reply; 13+ messages in thread
From: Yi EungJun @ 2015-03-30 15:55 UTC (permalink / raw)
  To: git; +Cc: Yi EungJun, Jeff King

From: Yi EungJun <eungjun.yi@navercorp.com>

Highlighted string might be broken if the common subsequence is a proper subset
of a multibyte character. For example, if the old string is "진" and the new
string is "지", then we expect the diff is rendered as follows:

	-진
	+지

but actually it was rendered as follows:

    -<EC><A7><84>
    +<EC><A7><80>

This fixes the bug by splitting the string by multibyte characters.
---
 contrib/diff-highlight/diff-highlight | 25 +++++++++++++++++++++++--
 1 file changed, 23 insertions(+), 2 deletions(-)

diff --git a/contrib/diff-highlight/diff-highlight b/contrib/diff-highlight/diff-highlight
index 08c88bb..2662c1a 100755
--- a/contrib/diff-highlight/diff-highlight
+++ b/contrib/diff-highlight/diff-highlight
@@ -2,6 +2,9 @@
 
 use warnings FATAL => 'all';
 use strict;
+use File::Basename;
+use File::Spec::Functions qw( catdir );
+use String::Multibyte;
 
 # Highlight by reversing foreground and background. You could do
 # other things like bold or underline if you prefer.
@@ -24,6 +27,8 @@ my @removed;
 my @added;
 my $in_hunk;
 
+my $mbcs = get_mbcs();
+
 # Some scripts may not realize that SIGPIPE is being ignored when launching the
 # pager--for instance scripts written in Python.
 $SIG{PIPE} = 'DEFAULT';
@@ -164,8 +169,8 @@ sub highlight_pair {
 
 sub split_line {
 	local $_ = shift;
-	return map { /$COLOR/ ? $_ : (split //) }
-	       split /($COLOR*)/;
+	return map { /$COLOR/ ? $_ : ($mbcs ? $mbcs->strsplit('', $_) : split //) }
+	       split /($COLOR)/;
 }
 
 sub highlight_line {
@@ -211,3 +216,19 @@ sub is_pair_interesting {
 	       $suffix_a !~ /^$BORING*$/ ||
 	       $suffix_b !~ /^$BORING*$/;
 }
+
+# Returns an instance of String::Multibyte based on the charset defined by
+# i18n.commitencoding or UTF-8, or undef if String::Multibyte doesn't support
+# the charset.
+sub get_mbcs {
+	my $dir = catdir(dirname($INC{'String/Multibyte.pm'}), 'Multibyte');
+	opendir my $dh, $dir or return;
+	my @mbcs_charsets = grep s/[.]pm\z//, readdir $dh;
+	close $dh;
+	my $expected_charset = `git config i18n.commitencoding` || "UTF-8";
+	$expected_charset =~ s/-//g;
+	my @matches = grep {/^$expected_charset$/i} @mbcs_charsets;
+	my $charset = shift @matches;
+
+	return eval 'String::Multibyte->new($charset)';
+}
-- 
2.3.2.209.gd67f9d5.dirty

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH] diff-highlight: Fix broken multibyte string
  2015-03-30 15:55 [PATCH] diff-highlight: Fix broken multibyte string Yi EungJun
@ 2015-03-30 22:16 ` Jeff King
  2015-04-03  0:49   ` Kyle J. McKay
  0 siblings, 1 reply; 13+ messages in thread
From: Jeff King @ 2015-03-30 22:16 UTC (permalink / raw)
  To: Yi EungJun; +Cc: git, Yi EungJun

On Tue, Mar 31, 2015 at 12:55:33AM +0900, Yi EungJun wrote:

> From: Yi EungJun <eungjun.yi@navercorp.com>
> 
> Highlighted string might be broken if the common subsequence is a proper subset
> of a multibyte character. For example, if the old string is "진" and the new
> string is "지", then we expect the diff is rendered as follows:
> 
> 	-진
> 	+지
> 
> but actually it was rendered as follows:
> 
>     -<EC><A7><84>
>     +<EC><A7><80>
> 
> This fixes the bug by splitting the string by multibyte characters.

Yeah, I agree the current output is not ideal, and this should address
the problem. I was worried that multi-byte splitting would make things
slower, but in my tests, it actually speeds things up!

That surprised me. The big difference is calling $mbcs->strsplit instead
of regular split. Could it be that it's much faster than regular split?
Or is it that the resulting strings are faster for the rest of the
processing (e.g., perl hits a "slow path" on the sheared characters)? It
doesn't really matter, I guess, but certainly I was curious.

> +use File::Basename;
> +use File::Spec::Functions qw( catdir );
> +use String::Multibyte;

Unfortunately, String::Multibyte is not a standard module, and is not
even packed for Debian systems (I got mine from CPAN). Can we make this
a conditional include (e.g., 'eval "require String::Multibyte"' in
get_mbcs, and return undef if that fails?). Then people without it can
still use the script.

> +# Returns an instance of String::Multibyte based on the charset defined by
> +# i18n.commitencoding or UTF-8, or undef if String::Multibyte doesn't support
> +# the charset.

Hrm. The characters we are processing are not in the commit message, but
in the files themselves. In fact, there may be many different charsets
(i.e., a different one for each file), and we really don't have a good
way of knowing which is in play. I'd say that using the commit
encoding is our best guess, though. What happens with $mbcs->split when
the input is not a valid character in the charset (i.e., when we guess
wrong)?

If we are going to use the commit encoding, wouldn't
i18n.logOutputEncoding be a better choice?

> +sub get_mbcs {
> +	my $dir = catdir(dirname($INC{'String/Multibyte.pm'}), 'Multibyte');
> +	opendir my $dh, $dir or return;
> +	my @mbcs_charsets = grep s/[.]pm\z//, readdir $dh;

Yuck. This is a lot more intimate with String::Multibyte's
implementation than I'd like to be. Could we perhaps just run the
constructor on any candidates charsets, and then return the first hit
that gives us something besides undef?

-Peff

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] diff-highlight: Fix broken multibyte string
  2015-03-30 22:16 ` Jeff King
@ 2015-04-03  0:49   ` Kyle J. McKay
  2015-04-03  1:24     ` Jeff King
  0 siblings, 1 reply; 13+ messages in thread
From: Kyle J. McKay @ 2015-04-03  0:49 UTC (permalink / raw)
  To: Jeff King, Yi EungJun; +Cc: git

On Mar 30, 2015, at 15:16, Jeff King wrote:

> Yeah, I agree the current output is not ideal, and this should address
> the problem. I was worried that multi-byte splitting would make things
> slower, but in my tests, it actually speeds things up!

[...]

> Unfortunately, String::Multibyte is not a standard module, and is not
> even packed for Debian systems (I got mine from CPAN). Can we make
> this
> a conditional include (e.g., 'eval "require String::Multibyte"' in
> get_mbcs, and return undef if that fails?). Then people without it can
> still use the script.

[...]

> Yuck. This is a lot more intimate with String::Multibyte's
> implementation than I'd like to be.

So I was curious about this and played with it and was able to
reproduce the problem as described.

Here's an alternate fix that should work for everyone with Perl 5.8
or later.

-Kyle

-- 8< --
Subject: [PATCH v2] diff-highlight: do not split multibyte characters

When the input is UTF-8 and Perl is operating on bytes instead
of characters, a diff that changes one multibyte character to
another that shares an initial byte sequence will result in a
broken diff display as the common byte sequence prefix will be
separated from the rest of the bytes in the multibyte character.

For example, if a single line contains only the unicode
character U+C9C4 (encoded as UTF-8 0xEC, 0xA7, 0x84) and that
line is then changed to the unicode character U+C9C0 (encoded as
UTF-8 0xEC, 0xA7, 0x80), when operating on bytes diff-highlight
will show only the single byte change from 0x84 to 0x80 thus
creating invalid UTF-8 and a broken diff display.

Fix this by putting Perl into character mode when splitting the
line and then back into byte mode after the split is finished.

While the utf8::xxx functions are built-in and do not require
any 'use' statement, the utf8::is_utf8 function did not appear
until Perl 5.8.1, but is identical to the Encode::is_utf8
function which is available in 5.8 so we use that instead of
utf8::is_utf8.

Reported-by: Yi EungJun <semtlenori@gmail.com>
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
---
 contrib/diff-highlight/diff-highlight | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/contrib/diff-highlight/diff-highlight b/contrib/diff-highlight/diff-highlight
index 08c88bbc..8e9b5ada 100755
--- a/contrib/diff-highlight/diff-highlight
+++ b/contrib/diff-highlight/diff-highlight
@@ -2,6 +2,7 @@

 use warnings FATAL => 'all';
 use strict;
+use Encode ();

 # Highlight by reversing foreground and background. You could do
 # other things like bold or underline if you prefer.
@@ -164,8 +165,10 @@ sub highlight_pair {

 sub split_line {
 	local $_ = shift;
-	return map { /$COLOR/ ? $_ : (split //) }
-	       split /($COLOR*)/;
+	utf8::decode($_);
+	return map { utf8::encode($_) if Encode::is_utf8($_); $_ }
+		map { /$COLOR/ ? $_ : (split //) }
+		split /($COLOR*)/;
 }

 sub highlight_line {
---

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH] diff-highlight: Fix broken multibyte string
  2015-04-03  0:49   ` Kyle J. McKay
@ 2015-04-03  1:24     ` Jeff King
  2015-04-03  1:59       ` Kyle J. McKay
  2015-04-03  2:19       ` Yi, EungJun
  0 siblings, 2 replies; 13+ messages in thread
From: Jeff King @ 2015-04-03  1:24 UTC (permalink / raw)
  To: Kyle J. McKay; +Cc: Yi EungJun, git

On Thu, Apr 02, 2015 at 05:49:24PM -0700, Kyle J. McKay wrote:

> Subject: [PATCH v2] diff-highlight: do not split multibyte characters
> 
> When the input is UTF-8 and Perl is operating on bytes instead
> of characters, a diff that changes one multibyte character to
> another that shares an initial byte sequence will result in a
> broken diff display as the common byte sequence prefix will be
> separated from the rest of the bytes in the multibyte character.

Thanks, I had a feeling we should be able to do something with perl's
builtin utf8 support.  This doesn't help people with other encodings,
but I'm not sure the original was all that helpful either (in that we
don't actually _know_ the file encodings in the first place).

I briefly confirmed that this seems to do the right thing on po/bg.po,
which has a couple of sheared characters when viewed with the existing
code.

I timed this one versus the existing diff-highlight. It's about 7%
slower. That's not great, but is acceptable to me. The String::Multibyte
version was a lot faster, which was nice (but I'm still unclear on
_why_).

> Fix this by putting Perl into character mode when splitting the
> line and then back into byte mode after the split is finished.

I also wondered if we could simply put stdin into utf8 mode. But it
looks like it will barf whenever it gets invalid utf8. Checking for
valid utf8 and only doing the multi-byte split in that case (as you do
here) is a lot more robust.

> While the utf8::xxx functions are built-in and do not require
> any 'use' statement, the utf8::is_utf8 function did not appear
> until Perl 5.8.1, but is identical to the Encode::is_utf8
> function which is available in 5.8 so we use that instead of
> utf8::is_utf8.

Makes sense. I'm happy enough listing perl 5.8 as a dependency.

EungJun, does this version meet your needs?

-Peff

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] diff-highlight: Fix broken multibyte string
  2015-04-03  1:24     ` Jeff King
@ 2015-04-03  1:59       ` Kyle J. McKay
  2015-04-03 21:47         ` Jeff King
  2015-04-03  2:19       ` Yi, EungJun
  1 sibling, 1 reply; 13+ messages in thread
From: Kyle J. McKay @ 2015-04-03  1:59 UTC (permalink / raw)
  To: Jeff King; +Cc: Yi EungJun, git

On Apr 2, 2015, at 18:24, Jeff King wrote:

> On Thu, Apr 02, 2015 at 05:49:24PM -0700, Kyle J. McKay wrote:
>
>> Subject: [PATCH v2] diff-highlight: do not split multibyte characters
>>
>> When the input is UTF-8 and Perl is operating on bytes instead
>> of characters, a diff that changes one multibyte character to
>> another that shares an initial byte sequence will result in a
>> broken diff display as the common byte sequence prefix will be
>> separated from the rest of the bytes in the multibyte character.
>
> Thanks, I had a feeling we should be able to do something with perl's
> builtin utf8 support.  This doesn't help people with other encodings,

It should work as well as the original did for any 1-byte encoding.   
That is, if it's not valid UTF-8 it should pass through unchanged and  
any single byte encoding should just work.  But, as you point out,  
multibyte encodings other than UTF-8 won't work, but they should  
behave the same as they did before.

> but I'm not sure the original was all that helpful either (in that we
> don't actually _know_ the file encodings in the first place).

I think it should work fine on any single byte encoding (i.e. ISO-8859- 
x, WINDOWS-1252, etc.).

> I timed this one versus the existing diff-highlight. It's about 7%
> slower.

I'd expect that, we're doing extra work we weren't doing before.

> That's not great, but is acceptable to me. The String::Multibyte
> version was a lot faster, which was nice (but I'm still unclear on
> _why_).

Must be the mbcs->strsplit routine has special case code for splitting  
on '' to just split on character boundaries.

>> Fix this by putting Perl into character mode when splitting the
>> line and then back into byte mode after the split is finished.
>
> I also wondered if we could simply put stdin into utf8 mode. But it
> looks like it will barf whenever it gets invalid utf8. Checking for
> valid utf8 and only doing the multi-byte split in that case (as you do
> here) is a lot more robust.
>
>> While the utf8::xxx functions are built-in and do not require
>> any 'use' statement, the utf8::is_utf8 function did not appear
>> until Perl 5.8.1, but is identical to the Encode::is_utf8
>> function which is available in 5.8 so we use that instead of
>> utf8::is_utf8.
>
> Makes sense. I'm happy enough listing perl 5.8 as a dependency.

Maybe that should be added.  The rest of Git's perl code seems to have  
a 'use 5.008;' already, so I figured that was a reasonable  
dependency.  :)

-Kyle

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] diff-highlight: Fix broken multibyte string
  2015-04-03  1:24     ` Jeff King
  2015-04-03  1:59       ` Kyle J. McKay
@ 2015-04-03  2:19       ` Yi, EungJun
  2015-04-03 22:08         ` Jeff King
  2015-04-03 22:15         ` [PATCH v3] diff-highlight: do not split multibyte characters Kyle J. McKay
  1 sibling, 2 replies; 13+ messages in thread
From: Yi, EungJun @ 2015-04-03  2:19 UTC (permalink / raw)
  To: Jeff King; +Cc: Kyle J. McKay, Git List

> I timed this one versus the existing diff-highlight. It's about 7%
> slower. That's not great, but is acceptable to me. The String::Multibyte
> version was a lot faster, which was nice (but I'm still unclear on
> _why_).

I think the reason is here:

> sub split_line {
>    local $_ = shift;
>    return map { /$COLOR/ ? $_ : ($mbcs ? $mbcs->strsplit('', $_) : split //) }
>           split /($COLOR)/;
> }

I removed "*" from "split /($COLOR*)/". Actually I don't know why "*"
was required but I need to remove it to make my patch works correctly.


On Fri, Apr 3, 2015 at 10:24 AM, Jeff King <peff@peff.net> wrote:
> On Thu, Apr 02, 2015 at 05:49:24PM -0700, Kyle J. McKay wrote:
>
>> Subject: [PATCH v2] diff-highlight: do not split multibyte characters
>>
>> When the input is UTF-8 and Perl is operating on bytes instead
>> of characters, a diff that changes one multibyte character to
>> another that shares an initial byte sequence will result in a
>> broken diff display as the common byte sequence prefix will be
>> separated from the rest of the bytes in the multibyte character.
>
> Thanks, I had a feeling we should be able to do something with perl's
> builtin utf8 support.  This doesn't help people with other encodings,
> but I'm not sure the original was all that helpful either (in that we
> don't actually _know_ the file encodings in the first place).
>
> I briefly confirmed that this seems to do the right thing on po/bg.po,
> which has a couple of sheared characters when viewed with the existing
> code.
>
> I timed this one versus the existing diff-highlight. It's about 7%
> slower. That's not great, but is acceptable to me. The String::Multibyte
> version was a lot faster, which was nice (but I'm still unclear on
> _why_).
>
>> Fix this by putting Perl into character mode when splitting the
>> line and then back into byte mode after the split is finished.
>
> I also wondered if we could simply put stdin into utf8 mode. But it
> looks like it will barf whenever it gets invalid utf8. Checking for
> valid utf8 and only doing the multi-byte split in that case (as you do
> here) is a lot more robust.
>
>> While the utf8::xxx functions are built-in and do not require
>> any 'use' statement, the utf8::is_utf8 function did not appear
>> until Perl 5.8.1, but is identical to the Encode::is_utf8
>> function which is available in 5.8 so we use that instead of
>> utf8::is_utf8.
>
> Makes sense. I'm happy enough listing perl 5.8 as a dependency.
>
> EungJun, does this version meet your needs?
>
> -Peff

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] diff-highlight: Fix broken multibyte string
  2015-04-03  1:59       ` Kyle J. McKay
@ 2015-04-03 21:47         ` Jeff King
  0 siblings, 0 replies; 13+ messages in thread
From: Jeff King @ 2015-04-03 21:47 UTC (permalink / raw)
  To: Kyle J. McKay; +Cc: Yi EungJun, git

On Thu, Apr 02, 2015 at 06:59:50PM -0700, Kyle J. McKay wrote:

> It should work as well as the original did for any 1-byte encoding.  That
> is, if it's not valid UTF-8 it should pass through unchanged and any single
> byte encoding should just work.  But, as you point out, multibyte encodings
> other than UTF-8 won't work, but they should behave the same as they did
> before.

Yeah, sorry, I should have been more clear that I meant multibyte
encodings. UTF-8 is the only common multibyte encoding I run across, but
that's because Latin1 served most of my pre-UTF-8 needs.  I suspect
things are very different for people in Asia. I don't know how
badly they would want support for other encodings. I'm happy to go with
a UTF-8 solution for now, and see if anybody wants to expand it further
later.

> >I timed this one versus the existing diff-highlight. It's about 7%
> >slower.
> 
> I'd expect that, we're doing extra work we weren't doing before.

I was worried would be 200% or something. :)

> >Makes sense. I'm happy enough listing perl 5.8 as a dependency.
> 
> Maybe that should be added.  The rest of Git's perl code seems to have a
> 'use 5.008;' already, so I figured that was a reasonable dependency.  :)

I shouldn't have said "listing". I just meant "have" as a dependency. I
am also happy with adding "use 5.008", but I agree it's probably not
necessary at this point. It was released in 2002 (wow, has it really
been that long?).

-Peff

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] diff-highlight: Fix broken multibyte string
  2015-04-03  2:19       ` Yi, EungJun
@ 2015-04-03 22:08         ` Jeff King
  2015-04-03 22:24           ` Kyle J. McKay
  2015-04-03 22:15         ` [PATCH v3] diff-highlight: do not split multibyte characters Kyle J. McKay
  1 sibling, 1 reply; 13+ messages in thread
From: Jeff King @ 2015-04-03 22:08 UTC (permalink / raw)
  To: Yi, EungJun; +Cc: Kyle J. McKay, Git List

On Fri, Apr 03, 2015 at 11:19:24AM +0900, Yi, EungJun wrote:

> > I timed this one versus the existing diff-highlight. It's about 7%
> > slower. That's not great, but is acceptable to me. The String::Multibyte
> > version was a lot faster, which was nice (but I'm still unclear on
> > _why_).
> 
> I think the reason is here:
> 
> > sub split_line {
> >    local $_ = shift;
> >    return map { /$COLOR/ ? $_ : ($mbcs ? $mbcs->strsplit('', $_) : split //) }
> >           split /($COLOR)/;
> > }
> 
> I removed "*" from "split /($COLOR*)/". Actually I don't know why "*"
> was required but I need to remove it to make my patch works correctly.

Ah, OK, that makes more sense. The "*" was meant to handle the case of
multiple groups of ANSI colors in a row. But I think it should have been
"+" in that case, as we would otherwise split on the empty field, which
would mean character-by-character. And the second "split" in the map
would then be superfluous, which would break your patch (we've already
split the multi-byte characters before we even hit $mbcs->strsplit).

Kyle's patch does not care, because it tweaks the string so that normal
split works. Which means there is an easy speedup here. :)

Doing:

diff --git a/contrib/diff-highlight/diff-highlight b/contrib/diff-highlight/diff-highlight
index 08c88bb..1c4b599 100755
--- a/contrib/diff-highlight/diff-highlight
+++ b/contrib/diff-highlight/diff-highlight
@@ -165,7 +165,7 @@ sub highlight_pair {
 sub split_line {
 	local $_ = shift;
 	return map { /$COLOR/ ? $_ : (split //) }
-	       split /($COLOR*)/;
+	       split /($COLOR+)/;
 }

 sub highlight_line {

gives me a 25% speed improvement, and the same output processing
git.git's entire "git log -p" output.

I thought that meant we could also optimize out the "map" call entirely,
and just use the first split (with "*") to end up with a list of $COLOR
chunks and single characters, but it does not seem to work. So maybe I
am misreading something about what is going on.

-Peff

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v3] diff-highlight: do not split multibyte characters
  2015-04-03  2:19       ` Yi, EungJun
  2015-04-03 22:08         ` Jeff King
@ 2015-04-03 22:15         ` Kyle J. McKay
  2015-04-04 14:09           ` Jeff King
  1 sibling, 1 reply; 13+ messages in thread
From: Kyle J. McKay @ 2015-04-03 22:15 UTC (permalink / raw)
  To: Jeff King, Yi EungJun; +Cc: git

When the input is UTF-8 and Perl is operating on bytes instead of
characters, a diff that changes one multibyte character to another
that shares an initial byte sequence will result in a broken diff
display as the common byte sequence prefix will be separated from
the rest of the bytes in the multibyte character.

For example, if a single line contains only the unicode character
U+C9C4 (encoded as UTF-8 0xEC, 0xA7, 0x84) and that line is then
changed to the unicode character U+C9C0 (encoded as UTF-8 0xEC,
0xA7, 0x80), when operating on bytes diff-highlight will show only
the single byte change from 0x84 to 0x80 thus creating invalid UTF-8
and a broken diff display.

Fix this by putting Perl into character mode when splitting the line
and then back into byte mode after the split is finished.

The utf8::xxx functions require Perl 5.8 so we require that as well.

Also, since we are mucking with code in the split_line function, we
change a '*' quantifier to a '+' quantifier when matching the $COLOR
expression which has the side effect of speeding everything up while
eliminating useless '' elements in the returned array.

Reported-by: Yi EungJun <semtlenori@gmail.com>
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
---

On Apr 2, 2015, at 19:19, Yi, EungJun wrote:
>> I timed this one versus the existing diff-highlight. It's about 7%
>> slower. That's not great, but is acceptable to me. The  
>> String::Multibyte
>> version was a lot faster, which was nice (but I'm still unclear on
>> _why_).
>
> I think the reason is here:
>
>> sub split_line {
>>   local $_ = shift;
>>   return map { /$COLOR/ ? $_ : ($mbcs ? $mbcs->strsplit('', $_) :  
>> split //) }
>>          split /($COLOR)/;
>> }
>
> I removed "*" from "split /($COLOR*)/". Actually I don't know why "*"
> was required but I need to remove it to make my patch works correctly.
>
> On Fri, Apr 3, 2015 at 10:24 AM, Jeff King <peff@peff.net> wrote:
>> EungJun, does this version meet your needs?

This version differs from the former as follows:

1) Slightly faster code that eliminates the need for Encode::is_utf8.

2) The '*' quantifier is changed to '+' in the split_line regexs which  
was probably the original intent anyway as using '*' generates useless  
empty elements.  This has the side effect of greatly increasing the  
speed so the tiny speed penalty for the UTF-8 checking is vastly  
overwhelmed by the overall speed up. :)

3) The 'use 5.008;' line has been added since the utf8::xxx functions  
require Perl 5.8

-Kyle

 contrib/diff-highlight/diff-highlight | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/contrib/diff-highlight/diff-highlight b/contrib/diff-highlight/diff-highlight
index 08c88bbc..ffefc31a 100755
--- a/contrib/diff-highlight/diff-highlight
+++ b/contrib/diff-highlight/diff-highlight
@@ -1,5 +1,6 @@
 #!/usr/bin/perl

+use 5.008;
 use warnings FATAL => 'all';
 use strict;

@@ -164,8 +165,12 @@ sub highlight_pair {

 sub split_line {
 	local $_ = shift;
-	return map { /$COLOR/ ? $_ : (split //) }
-	       split /($COLOR*)/;
+	return utf8::decode($_) ?
+		map { utf8::encode($_); $_ }
+			map { /$COLOR/ ? $_ : (split //) }
+			split /($COLOR+)/ :
+		map { /$COLOR/ ? $_ : (split //) }
+		split /($COLOR+)/;
 }

 sub highlight_line {
---

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH] diff-highlight: Fix broken multibyte string
  2015-04-03 22:08         ` Jeff King
@ 2015-04-03 22:24           ` Kyle J. McKay
  2015-04-04 14:10             ` Jeff King
  0 siblings, 1 reply; 13+ messages in thread
From: Kyle J. McKay @ 2015-04-03 22:24 UTC (permalink / raw)
  To: Jeff King; +Cc: Yi, EungJun, Git List

On Apr 3, 2015, at 15:08, Jeff King wrote:
> Doing:
>
> diff --git a/contrib/diff-highlight/diff-highlight b/contrib/diff- 
> highlight/diff-highlight
> index 08c88bb..1c4b599 100755
> --- a/contrib/diff-highlight/diff-highlight
> +++ b/contrib/diff-highlight/diff-highlight
> @@ -165,7 +165,7 @@ sub highlight_pair {
> sub split_line {
> 	local $_ = shift;
> 	return map { /$COLOR/ ? $_ : (split //) }
> -	       split /($COLOR*)/;
> +	       split /($COLOR+)/;
> }
>
> sub highlight_line {
>
> gives me a 25% speed improvement, and the same output processing
> git.git's entire "git log -p" output.
>
> I thought that meant we could also optimize out the "map" call  
> entirely,
> and just use the first split (with "*") to end up with a list of  
> $COLOR
> chunks and single characters, but it does not seem to work. So maybe I
> am misreading something about what is going on.

I think our emails crossed in flight...

Using just the first split (with "*") produces useless empty elements  
which I think ends up causing problems.  I suppose you could surround  
it with a grep /./ to remove them but that would defeat the point of  
the optimization.

-Kyle

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v3] diff-highlight: do not split multibyte characters
  2015-04-03 22:15         ` [PATCH v3] diff-highlight: do not split multibyte characters Kyle J. McKay
@ 2015-04-04 14:09           ` Jeff King
  2015-04-04 14:47             ` Yi, EungJun
  0 siblings, 1 reply; 13+ messages in thread
From: Jeff King @ 2015-04-04 14:09 UTC (permalink / raw)
  To: Kyle J. McKay; +Cc: Yi EungJun, git

On Fri, Apr 03, 2015 at 03:15:14PM -0700, Kyle J. McKay wrote:

> When the input is UTF-8 and Perl is operating on bytes instead of
> characters, a diff that changes one multibyte character to another
> that shares an initial byte sequence will result in a broken diff
> display as the common byte sequence prefix will be separated from
> the rest of the bytes in the multibyte character.
> 
> For example, if a single line contains only the unicode character
> U+C9C4 (encoded as UTF-8 0xEC, 0xA7, 0x84) and that line is then
> changed to the unicode character U+C9C0 (encoded as UTF-8 0xEC,
> 0xA7, 0x80), when operating on bytes diff-highlight will show only
> the single byte change from 0x84 to 0x80 thus creating invalid UTF-8
> and a broken diff display.
> 
> Fix this by putting Perl into character mode when splitting the line
> and then back into byte mode after the split is finished.
> 
> The utf8::xxx functions require Perl 5.8 so we require that as well.
> 
> Also, since we are mucking with code in the split_line function, we
> change a '*' quantifier to a '+' quantifier when matching the $COLOR
> expression which has the side effect of speeding everything up while
> eliminating useless '' elements in the returned array.
> 
> Reported-by: Yi EungJun <semtlenori@gmail.com>
> Signed-off-by: Kyle J. McKay <mackyle@gmail.com>

This version looks good to me. I looked over the diff of running "git
log -p --color" on git.git through diff-highlight before and after this
patch, and everything looks like an improvement.

  Acked-by: Jeff King <peff@peff.net>

Thanks both of you for working on this.

-Peff

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] diff-highlight: Fix broken multibyte string
  2015-04-03 22:24           ` Kyle J. McKay
@ 2015-04-04 14:10             ` Jeff King
  0 siblings, 0 replies; 13+ messages in thread
From: Jeff King @ 2015-04-04 14:10 UTC (permalink / raw)
  To: Kyle J. McKay; +Cc: Yi, EungJun, Git List

On Fri, Apr 03, 2015 at 03:24:09PM -0700, Kyle J. McKay wrote:

> >I thought that meant we could also optimize out the "map" call entirely,
> >and just use the first split (with "*") to end up with a list of $COLOR
> >chunks and single characters, but it does not seem to work. So maybe I
> >am misreading something about what is going on.
> 
> I think our emails crossed in flight...
> 
> Using just the first split (with "*") produces useless empty elements which
> I think ends up causing problems.  I suppose you could surround it with a
> grep /./ to remove them but that would defeat the point of the optimization.

Yeah, the problem is the use of (). We want to keep the $COLOR
delimiters, but not the empty ones. Perhaps you could do:

  split /($COLOR+)|/

but I didn't try it. I think what you posted is good and a lot less
subtle.

-Peff

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v3] diff-highlight: do not split multibyte characters
  2015-04-04 14:09           ` Jeff King
@ 2015-04-04 14:47             ` Yi, EungJun
  0 siblings, 0 replies; 13+ messages in thread
From: Yi, EungJun @ 2015-04-04 14:47 UTC (permalink / raw)
  To: Jeff King; +Cc: Kyle J. McKay, Git List

On Fri, Apr 3, 2015 at 10:24 AM, Jeff King <peff@peff.net> wrote:
>
> EungJun, does this version meet your needs?
>
> -Peff

Yes, this patch is enough to meet my needs because it works well on
UTF-8, the only encoding I use. And this patch looks better than my
one because it is smaller, doesn't depend on String::Multibyte and
seems to have no side-effect.

I hope someone who use another multibyte encoding will send a patch to
support the encoding in future... :)

On Sat, Apr 4, 2015 at 11:09 PM, Jeff King <peff@peff.net> wrote:
> On Fri, Apr 03, 2015 at 03:15:14PM -0700, Kyle J. McKay wrote:
>
>> When the input is UTF-8 and Perl is operating on bytes instead of
>> characters, a diff that changes one multibyte character to another
>> that shares an initial byte sequence will result in a broken diff
>> display as the common byte sequence prefix will be separated from
>> the rest of the bytes in the multibyte character.
>>
>> For example, if a single line contains only the unicode character
>> U+C9C4 (encoded as UTF-8 0xEC, 0xA7, 0x84) and that line is then
>> changed to the unicode character U+C9C0 (encoded as UTF-8 0xEC,
>> 0xA7, 0x80), when operating on bytes diff-highlight will show only
>> the single byte change from 0x84 to 0x80 thus creating invalid UTF-8
>> and a broken diff display.
>>
>> Fix this by putting Perl into character mode when splitting the line
>> and then back into byte mode after the split is finished.
>>
>> The utf8::xxx functions require Perl 5.8 so we require that as well.
>>
>> Also, since we are mucking with code in the split_line function, we
>> change a '*' quantifier to a '+' quantifier when matching the $COLOR
>> expression which has the side effect of speeding everything up while
>> eliminating useless '' elements in the returned array.
>>
>> Reported-by: Yi EungJun <semtlenori@gmail.com>
>> Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
>
> This version looks good to me. I looked over the diff of running "git
> log -p --color" on git.git through diff-highlight before and after this
> patch, and everything looks like an improvement.
>
>   Acked-by: Jeff King <peff@peff.net>
>
> Thanks both of you for working on this.
>
> -Peff

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2015-04-04 14:47 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-03-30 15:55 [PATCH] diff-highlight: Fix broken multibyte string Yi EungJun
2015-03-30 22:16 ` Jeff King
2015-04-03  0:49   ` Kyle J. McKay
2015-04-03  1:24     ` Jeff King
2015-04-03  1:59       ` Kyle J. McKay
2015-04-03 21:47         ` Jeff King
2015-04-03  2:19       ` Yi, EungJun
2015-04-03 22:08         ` Jeff King
2015-04-03 22:24           ` Kyle J. McKay
2015-04-04 14:10             ` Jeff King
2015-04-03 22:15         ` [PATCH v3] diff-highlight: do not split multibyte characters Kyle J. McKay
2015-04-04 14:09           ` Jeff King
2015-04-04 14:47             ` Yi, EungJun

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).