All of lore.kernel.org
 help / color / mirror / Atom feed
From: Duy Nguyen <pclouds@gmail.com>
To: Junio C Hamano <gitster@pobox.com>
Cc: git@vger.kernel.org
Subject: Re: [PATCH] utf8.c: fix strbuf_utf8_replace copying the last NUL to dst string
Date: Wed, 30 Jul 2014 17:20:22 +0700	[thread overview]
Message-ID: <20140730102022.GA5653@lanh> (raw)
In-Reply-To: <xmqqha202b2v.fsf@gitster.dls.corp.google.com>

On Tue, Jul 29, 2014 at 12:56:24PM -0700, Junio C Hamano wrote:
> Nguyễn Thái Ngọc Duy  <pclouds@gmail.com> writes:
> 
> > When utf8_width(&src) is called with *src == NULL (because the
> > source string ends with an ansi sequence),
> 
> I am not sure what you mean by "because" here.  Do you mean somebody
> (who?) decides to call the utf8_width() with NULL pointer stored in
> *src because of "ansi sequence"?
> 
> What do you mean by "ansi sequence"?  I'll assume that you mean
> those terminal control that all use bytes with hi-bit clear.

OK let's try with an example, strbuf_utf8_replace(&sb, 0, 0, "") where
"sb" contains "\033[m". The expected result is exactly the same string
with length of 3. After this block

		while ((n = display_mode_esc_sequence_len(src))) {
			memcpy(dst, src, n);
			src += n;
			dst += n;
		}

src will be right after 'm', *src is NUL (iow. src == end). After

		n = utf8_width((const char**)&src, NULL);

n is zero but 'src' is advanced by another character, so
src - old_src == 1. This NUL character is counted as part of the
string, so after the _setlen, we have the same string with length of
_4_ (instead of 3).

> At the very beginning of utf8_width(), *start can be cleared to
> point at a NULL pointer by pick_one_utf8_char() if the pointer that
> comes into utf8_width() originally points at an invalid UTF-8
> string, but as far as I can see, ESC (or any bytes that would be
> used in those terminal control sequences like colors and cursor
> control) will simply be returned as a single byte, without going
> into error path that clears *start = NULL.
> 
> Puzzled...
> 
> > it returns 0 and steps 'src' by one. 
> 
> Here "it" refers to utf8_width()?  Who steps 'src' by one?

utf8_width() steps 'src'.

> 
> Ahh, did you mean *src == NUL, i.e. "already at the end of the
> string"?

Yes.. I guess you have a better commit message prepared for me now :)

> I think utf8_width() called with an empty string should not move the
> pointer past that end-of-string NUL in the first place.  It makes me
> wonder if it would be a better fix to make it not to do that (and
> return 0), but if we declare it is the caller's fault, perhaps we
> may want to add
> 
> 	if (!**start)
>         	die("BUG: empty string to utf8_width()???");
> 
> at the very beginning of utf8_width(), even before it calls
> pick-one-utf8-char.
> 
> Still puzzled...

I don't know. I mean, in a UTF-8 byte sequence, NUL is a valid
character (part of the ASCII subset), so advancing '*start' by one
makes sense. Whether we have any call sites crazy enough to do that on
purpose is a different matter.

> > This stepping makes strbuf_utf8_replace add NUL to the
> > destination string at the end of the loop. Check and break the loop
> > early.
> >
> > Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
> > ---
> >  utf8.c | 2 ++
> >  1 file changed, 2 insertions(+)
> 
> Tests?

Ugh.. this was triggered by one of the alignment specifier in log
--pretty. Probably easier to export strbuf_utf8_replace() as a
test-command and verify the output?

--
Duy

  reply	other threads:[~2014-07-30 10:20 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-07-29 13:10 [PATCH] utf8.c: fix strbuf_utf8_replace copying the last NUL to dst string Nguyễn Thái Ngọc Duy
2014-07-29 19:56 ` Junio C Hamano
2014-07-30 10:20   ` Duy Nguyen [this message]
2014-07-30 18:23     ` Junio C Hamano
2014-08-10  7:05 ` [PATCH v2] utf8.c: fix strbuf_utf8_replace() consuming data beyond input string Nguyễn Thái Ngọc Duy

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20140730102022.GA5653@lanh \
    --to=pclouds@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.