git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Duy Nguyen <pclouds@gmail.com>
To: Junio C Hamano <gitster@pobox.com>
Cc: git@vger.kernel.org
Subject: Re: [PATCH] utf8.c: fix strbuf_utf8_replace copying the last NUL to dst string
Date: Wed, 30 Jul 2014 17:20:22 +0700	[thread overview]
Message-ID: <20140730102022.GA5653@lanh> (raw)
In-Reply-To: <xmqqha202b2v.fsf@gitster.dls.corp.google.com>

On Tue, Jul 29, 2014 at 12:56:24PM -0700, Junio C Hamano wrote:
> Nguyễn Thái Ngọc Duy  <pclouds@gmail.com> writes:
> 
> > When utf8_width(&src) is called with *src == NULL (because the
> > source string ends with an ansi sequence),
> 
> I am not sure what you mean by "because" here.  Do you mean somebody
> (who?) decides to call the utf8_width() with NULL pointer stored in
> *src because of "ansi sequence"?
> 
> What do you mean by "ansi sequence"?  I'll assume that you mean
> those terminal control that all use bytes with hi-bit clear.

OK let's try with an example, strbuf_utf8_replace(&sb, 0, 0, "") where
"sb" contains "\033[m". The expected result is exactly the same string
with length of 3. After this block

		while ((n = display_mode_esc_sequence_len(src))) {
			memcpy(dst, src, n);
			src += n;
			dst += n;
		}

src will be right after 'm', *src is NUL (iow. src == end). After

		n = utf8_width((const char**)&src, NULL);

n is zero but 'src' is advanced by another character, so
src - old_src == 1. This NUL character is counted as part of the
string, so after the _setlen, we have the same string with length of
_4_ (instead of 3).

> At the very beginning of utf8_width(), *start can be cleared to
> point at a NULL pointer by pick_one_utf8_char() if the pointer that
> comes into utf8_width() originally points at an invalid UTF-8
> string, but as far as I can see, ESC (or any bytes that would be
> used in those terminal control sequences like colors and cursor
> control) will simply be returned as a single byte, without going
> into error path that clears *start = NULL.
> 
> Puzzled...
> 
> > it returns 0 and steps 'src' by one. 
> 
> Here "it" refers to utf8_width()?  Who steps 'src' by one?

utf8_width() steps 'src'.

> 
> Ahh, did you mean *src == NUL, i.e. "already at the end of the
> string"?

Yes.. I guess you have a better commit message prepared for me now :)

> I think utf8_width() called with an empty string should not move the
> pointer past that end-of-string NUL in the first place.  It makes me
> wonder if it would be a better fix to make it not to do that (and
> return 0), but if we declare it is the caller's fault, perhaps we
> may want to add
> 
> 	if (!**start)
>         	die("BUG: empty string to utf8_width()???");
> 
> at the very beginning of utf8_width(), even before it calls
> pick-one-utf8-char.
> 
> Still puzzled...

I don't know. I mean, in a UTF-8 byte sequence, NUL is a valid
character (part of the ASCII subset), so advancing '*start' by one
makes sense. Whether we have any call sites crazy enough to do that on
purpose is a different matter.

> > This stepping makes strbuf_utf8_replace add NUL to the
> > destination string at the end of the loop. Check and break the loop
> > early.
> >
> > Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
> > ---
> >  utf8.c | 2 ++
> >  1 file changed, 2 insertions(+)
> 
> Tests?

Ugh.. this was triggered by one of the alignment specifier in log
--pretty. Probably easier to export strbuf_utf8_replace() as a
test-command and verify the output?

--
Duy

  reply	other threads:[~2014-07-30 10:20 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-07-29 13:10 [PATCH] utf8.c: fix strbuf_utf8_replace copying the last NUL to dst string Nguyễn Thái Ngọc Duy
2014-07-29 19:56 ` Junio C Hamano
2014-07-30 10:20   ` Duy Nguyen [this message]
2014-07-30 18:23     ` Junio C Hamano
2014-08-10  7:05 ` [PATCH v2] utf8.c: fix strbuf_utf8_replace() consuming data beyond input string Nguyễn Thái Ngọc Duy

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20140730102022.GA5653@lanh \
    --to=pclouds@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).