From: "Kyle J. McKay" <mackyle@gmail.com>
To: Jeff King <peff@peff.net>, Yi EungJun <semtlenori@gmail.com>
Cc: git@vger.kernel.org
Subject: Re: [PATCH] diff-highlight: Fix broken multibyte string
Date: Thu, 2 Apr 2015 17:49:24 -0700 [thread overview]
Message-ID: <ffa56a1b1257732077c287a5cfdd138@74d39fa044aa309eaea14b9f57fe79c> (raw)
In-Reply-To: <20150330221635.GB25212@peff.net>
On Mar 30, 2015, at 15:16, Jeff King wrote:
> Yeah, I agree the current output is not ideal, and this should address
> the problem. I was worried that multi-byte splitting would make things
> slower, but in my tests, it actually speeds things up!
[...]
> Unfortunately, String::Multibyte is not a standard module, and is not
> even packed for Debian systems (I got mine from CPAN). Can we make
> this
> a conditional include (e.g., 'eval "require String::Multibyte"' in
> get_mbcs, and return undef if that fails?). Then people without it can
> still use the script.
[...]
> Yuck. This is a lot more intimate with String::Multibyte's
> implementation than I'd like to be.
So I was curious about this and played with it and was able to
reproduce the problem as described.
Here's an alternate fix that should work for everyone with Perl 5.8
or later.
-Kyle
-- 8< --
Subject: [PATCH v2] diff-highlight: do not split multibyte characters
When the input is UTF-8 and Perl is operating on bytes instead
of characters, a diff that changes one multibyte character to
another that shares an initial byte sequence will result in a
broken diff display as the common byte sequence prefix will be
separated from the rest of the bytes in the multibyte character.
For example, if a single line contains only the unicode
character U+C9C4 (encoded as UTF-8 0xEC, 0xA7, 0x84) and that
line is then changed to the unicode character U+C9C0 (encoded as
UTF-8 0xEC, 0xA7, 0x80), when operating on bytes diff-highlight
will show only the single byte change from 0x84 to 0x80 thus
creating invalid UTF-8 and a broken diff display.
Fix this by putting Perl into character mode when splitting the
line and then back into byte mode after the split is finished.
While the utf8::xxx functions are built-in and do not require
any 'use' statement, the utf8::is_utf8 function did not appear
until Perl 5.8.1, but is identical to the Encode::is_utf8
function which is available in 5.8 so we use that instead of
utf8::is_utf8.
Reported-by: Yi EungJun <semtlenori@gmail.com>
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
---
contrib/diff-highlight/diff-highlight | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)
diff --git a/contrib/diff-highlight/diff-highlight b/contrib/diff-highlight/diff-highlight
index 08c88bbc..8e9b5ada 100755
--- a/contrib/diff-highlight/diff-highlight
+++ b/contrib/diff-highlight/diff-highlight
@@ -2,6 +2,7 @@
use warnings FATAL => 'all';
use strict;
+use Encode ();
# Highlight by reversing foreground and background. You could do
# other things like bold or underline if you prefer.
@@ -164,8 +165,10 @@ sub highlight_pair {
sub split_line {
local $_ = shift;
- return map { /$COLOR/ ? $_ : (split //) }
- split /($COLOR*)/;
+ utf8::decode($_);
+ return map { utf8::encode($_) if Encode::is_utf8($_); $_ }
+ map { /$COLOR/ ? $_ : (split //) }
+ split /($COLOR*)/;
}
sub highlight_line {
---
next prev parent reply other threads:[~2015-04-03 0:49 UTC|newest]
Thread overview: 13+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-03-30 15:55 [PATCH] diff-highlight: Fix broken multibyte string Yi EungJun
2015-03-30 22:16 ` Jeff King
2015-04-03 0:49 ` Kyle J. McKay [this message]
2015-04-03 1:24 ` Jeff King
2015-04-03 1:59 ` Kyle J. McKay
2015-04-03 21:47 ` Jeff King
2015-04-03 2:19 ` Yi, EungJun
2015-04-03 22:08 ` Jeff King
2015-04-03 22:24 ` Kyle J. McKay
2015-04-04 14:10 ` Jeff King
2015-04-03 22:15 ` [PATCH v3] diff-highlight: do not split multibyte characters Kyle J. McKay
2015-04-04 14:09 ` Jeff King
2015-04-04 14:47 ` Yi, EungJun
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ffa56a1b1257732077c287a5cfdd138@74d39fa044aa309eaea14b9f57fe79c \
--to=mackyle@gmail.com \
--cc=git@vger.kernel.org \
--cc=peff@peff.net \
--cc=semtlenori@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).