All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jeff King <peff@peff.net>
To: Yi EungJun <semtlenori@gmail.com>
Cc: git@vger.kernel.org, Yi EungJun <eungjun.yi@navercorp.com>
Subject: Re: [PATCH] diff-highlight: Fix broken multibyte string
Date: Mon, 30 Mar 2015 18:16:35 -0400	[thread overview]
Message-ID: <20150330221635.GB25212@peff.net> (raw)
In-Reply-To: <1427730933-26189-1-git-send-email-eungjun.yi@navercorp.com>

On Tue, Mar 31, 2015 at 12:55:33AM +0900, Yi EungJun wrote:

> From: Yi EungJun <eungjun.yi@navercorp.com>
> 
> Highlighted string might be broken if the common subsequence is a proper subset
> of a multibyte character. For example, if the old string is "진" and the new
> string is "지", then we expect the diff is rendered as follows:
> 
> 	-진
> 	+지
> 
> but actually it was rendered as follows:
> 
>     -<EC><A7><84>
>     +<EC><A7><80>
> 
> This fixes the bug by splitting the string by multibyte characters.

Yeah, I agree the current output is not ideal, and this should address
the problem. I was worried that multi-byte splitting would make things
slower, but in my tests, it actually speeds things up!

That surprised me. The big difference is calling $mbcs->strsplit instead
of regular split. Could it be that it's much faster than regular split?
Or is it that the resulting strings are faster for the rest of the
processing (e.g., perl hits a "slow path" on the sheared characters)? It
doesn't really matter, I guess, but certainly I was curious.

> +use File::Basename;
> +use File::Spec::Functions qw( catdir );
> +use String::Multibyte;

Unfortunately, String::Multibyte is not a standard module, and is not
even packed for Debian systems (I got mine from CPAN). Can we make this
a conditional include (e.g., 'eval "require String::Multibyte"' in
get_mbcs, and return undef if that fails?). Then people without it can
still use the script.

> +# Returns an instance of String::Multibyte based on the charset defined by
> +# i18n.commitencoding or UTF-8, or undef if String::Multibyte doesn't support
> +# the charset.

Hrm. The characters we are processing are not in the commit message, but
in the files themselves. In fact, there may be many different charsets
(i.e., a different one for each file), and we really don't have a good
way of knowing which is in play. I'd say that using the commit
encoding is our best guess, though. What happens with $mbcs->split when
the input is not a valid character in the charset (i.e., when we guess
wrong)?

If we are going to use the commit encoding, wouldn't
i18n.logOutputEncoding be a better choice?

> +sub get_mbcs {
> +	my $dir = catdir(dirname($INC{'String/Multibyte.pm'}), 'Multibyte');
> +	opendir my $dh, $dir or return;
> +	my @mbcs_charsets = grep s/[.]pm\z//, readdir $dh;

Yuck. This is a lot more intimate with String::Multibyte's
implementation than I'd like to be. Could we perhaps just run the
constructor on any candidates charsets, and then return the first hit
that gives us something besides undef?

-Peff

  reply	other threads:[~2015-03-30 22:16 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-03-30 15:55 [PATCH] diff-highlight: Fix broken multibyte string Yi EungJun
2015-03-30 22:16 ` Jeff King [this message]
2015-04-03  0:49   ` Kyle J. McKay
2015-04-03  1:24     ` Jeff King
2015-04-03  1:59       ` Kyle J. McKay
2015-04-03 21:47         ` Jeff King
2015-04-03  2:19       ` Yi, EungJun
2015-04-03 22:08         ` Jeff King
2015-04-03 22:24           ` Kyle J. McKay
2015-04-04 14:10             ` Jeff King
2015-04-03 22:15         ` [PATCH v3] diff-highlight: do not split multibyte characters Kyle J. McKay
2015-04-04 14:09           ` Jeff King
2015-04-04 14:47             ` Yi, EungJun

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20150330221635.GB25212@peff.net \
    --to=peff@peff.net \
    --cc=eungjun.yi@navercorp.com \
    --cc=git@vger.kernel.org \
    --cc=semtlenori@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.