git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jeff King <peff@peff.net>
To: Yoshihiro Sugi <sugi1982@gmail.com>
Cc: git@vger.kernel.org
Subject: Re: [PATCH] contrib/diff-highlight: multibyte characters diff
Date: Wed, 12 Feb 2014 15:59:48 -0500	[thread overview]
Message-ID: <20140212205948.GA4453@sigill.intra.peff.net> (raw)
In-Reply-To: <1392109750-47852-1-git-send-email-sugi1982@gmail.com>

On Tue, Feb 11, 2014 at 06:09:10PM +0900, Yoshihiro Sugi wrote:

> diff-highlight split each hunks and compare them as byte sequences.
> it causes problems when diff hunks include multibyte characters.
> This change enable to work on such cases by decoding inputs and encoding output as utf8 string.

Thanks for looking at this. I didn't consider multibyte characters at
all when I wrote the original.

Sadly, applying your patch seems to cause diff-highlight to take about
1.5x as much CPU (I just tried it on the complete output of "git log
--all -p" in my git.git repository, which went from ~60s to ~90s). That
is not necessarily a deal-breaker, as it is more important to be
correct. But I wonder if there is any way we can optimize it. From my
understanding, it may not be the encoding/decoding itself, but rather
that the utf8-aware text routines are slower. More on that below.

>  # Highlight by reversing foreground and background. You could do
>  # other things like bold or underline if you prefer.
> @@ -15,8 +16,9 @@ my @added;
>  my $in_hunk;
>  
>  while (<>) {
> +	$_ = decode_utf8($_);

What happens when the diff content is not utf8 at all? The default
failure mode for decode_utf8 seems to be to replace non-utf8 sequences
with a substitution character. Which means we would corrupt something
like iso8859-1, which works fine in the current code.

>  	if (!$in_hunk) {
> -		print;
> +		print encode_utf8($_);

And we have the opposite question here. We know that what we have in the
string is perl's internal format. But how do we know that what the
user's terminal wants is utf8?

I think in the most general case, we would need to auto-detect the
encoding for each string (since there is no guarantee that a file even
retains the same encoding through its history). And then probably punt on
highlighting when comparing two lines in different encodings, and if
they are in the same encoding, treat them as a sequence of code points
rather than octets when comparing.

That's probably way too slow and complicated to worry about. And
frankly, I am OK with making the tool work _best_ with UTF-8, as long as
it can fail gracefully for other encodings (i.e., if you can turn off
the utf8 magic when you have another encoding, and we will pass it
through blindly as octets).

Would it be enough to do something like the patch below? The specific
things I am shooting for are:

  - the only part of the code that cares about the sequence is the
    split/highlight bit, and most lines do not get highlighted at all.
    Therefore it tries to lazily do the decode step, which gets back our
    lost performance (I measured ~12% slowdown).

  - we try the utf8 decode and if it fails, fall back to treating it
    like a sequence of bytes. That doesn't help people with other
    multibyte encodings, but at least lets single-byte encodings besides
    utf8 continue to work.

I constructed a very simple test of:

  git init &&
  printf 'cont\xc3\xa9nt\n' >file &&
  git add file &&
  printf 'cont\xc3\xb6nt\n' >file &&
  git diff

and it seemed to work. Can you confirm that it does the right thing on
your more complicated cases?

-Peff

---
diff --git a/contrib/diff-highlight/diff-highlight b/contrib/diff-highlight/diff-highlight
index c4404d4..9447ba2 100755
--- a/contrib/diff-highlight/diff-highlight
+++ b/contrib/diff-highlight/diff-highlight
@@ -2,6 +2,7 @@
 
 use warnings FATAL => 'all';
 use strict;
+use Encode qw(decode_utf8 encode_utf8);
 
 # Highlight by reversing foreground and background. You could do
 # other things like bold or underline if you prefer.
@@ -73,13 +74,23 @@ sub show_hunk {
 
 	my @queue;
 	for (my $i = 0; $i < @$a; $i++) {
-		my ($rm, $add) = highlight_pair($a->[$i], $b->[$i]);
-		print $rm;
-		push @queue, $add;
+		my ($a_dec, $encode_rm) = decode($a->[$i]);
+		my ($b_dec, $encode_add) = decode($b->[$i]);
+		my ($rm, $add) = highlight_pair($a_dec, $b_dec);
+		print $encode_rm->($rm);
+		push @queue, $encode_add->($add);
 	}
 	print @queue;
 }
 
+sub decode {
+	my $orig = shift;
+	my $decoded = eval { decode_utf8($orig, Encode::FB_CROAK) };
+	return defined $decoded ?
+	       ($decoded, sub { encode_utf8(shift) }) :
+	       ($orig, sub { shift });
+}
+
 sub highlight_pair {
 	my @a = split_line(shift);
 	my @b = split_line(shift);

  parent reply	other threads:[~2014-02-12 20:59 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-02-11  9:09 [PATCH] contrib/diff-highlight: multibyte characters diff Yoshihiro Sugi
2014-02-11 19:30 ` Junio C Hamano
2014-02-12 20:59 ` Jeff King [this message]
2014-02-12 23:10   ` Thomas Adam
2014-02-12 23:27     ` Jeff King
2014-02-13  1:17       ` brian m. carlson
2014-02-13  1:37         ` Jeff King
2014-02-13 19:14   ` Yoshihiro Sugi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20140212205948.GA4453@sigill.intra.peff.net \
    --to=peff@peff.net \
    --cc=git@vger.kernel.org \
    --cc=sugi1982@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).