* [PATCH] utf8.c: fix strbuf_utf8_replace copying the last NUL to dst string @ 2014-07-29 13:10 Nguyễn Thái Ngọc Duy 2014-07-29 19:56 ` Junio C Hamano 2014-08-10 7:05 ` [PATCH v2] utf8.c: fix strbuf_utf8_replace() consuming data beyond input string Nguyễn Thái Ngọc Duy 0 siblings, 2 replies; 5+ messages in thread From: Nguyễn Thái Ngọc Duy @ 2014-07-29 13:10 UTC (permalink / raw) To: git; +Cc: Junio C Hamano, Nguyễn Thái Ngọc Duy When utf8_width(&src) is called with *src == NULL (because the source string ends with an ansi sequence), it returns 0 and steps 'src' by one. This stepping makes strbuf_utf8_replace add NUL to the destination string at the end of the loop. Check and break the loop early. Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> --- utf8.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/utf8.c b/utf8.c index b30790d..cd090a1 100644 --- a/utf8.c +++ b/utf8.c @@ -381,6 +381,8 @@ void strbuf_utf8_replace(struct strbuf *sb_src, int pos, int width, src += n; dst += n; } + if (src >= end) + break; old = src; n = utf8_width((const char**)&src, NULL); -- 2.1.0.rc0.66.gb9187ad ^ permalink raw reply related [flat|nested] 5+ messages in thread
* Re: [PATCH] utf8.c: fix strbuf_utf8_replace copying the last NUL to dst string 2014-07-29 13:10 [PATCH] utf8.c: fix strbuf_utf8_replace copying the last NUL to dst string Nguyễn Thái Ngọc Duy @ 2014-07-29 19:56 ` Junio C Hamano 2014-07-30 10:20 ` Duy Nguyen 2014-08-10 7:05 ` [PATCH v2] utf8.c: fix strbuf_utf8_replace() consuming data beyond input string Nguyễn Thái Ngọc Duy 1 sibling, 1 reply; 5+ messages in thread From: Junio C Hamano @ 2014-07-29 19:56 UTC (permalink / raw) To: Nguyễn Thái Ngọc Duy; +Cc: git Nguyễn Thái Ngọc Duy <pclouds@gmail.com> writes: > When utf8_width(&src) is called with *src == NULL (because the > source string ends with an ansi sequence), I am not sure what you mean by "because" here. Do you mean somebody (who?) decides to call the utf8_width() with NULL pointer stored in *src because of "ansi sequence"? What do you mean by "ansi sequence"? I'll assume that you mean those terminal control that all use bytes with hi-bit clear. At the very beginning of utf8_width(), *start can be cleared to point at a NULL pointer by pick_one_utf8_char() if the pointer that comes into utf8_width() originally points at an invalid UTF-8 string, but as far as I can see, ESC (or any bytes that would be used in those terminal control sequences like colors and cursor control) will simply be returned as a single byte, without going into error path that clears *start = NULL. Puzzled... > it returns 0 and steps > 'src' by one. Here "it" refers to utf8_width()? Who steps 'src' by one? Ahh, did you mean *src == NUL, i.e. "already at the end of the string"? I think utf8_width() called with an empty string should not move the pointer past that end-of-string NUL in the first place. It makes me wonder if it would be a better fix to make it not to do that (and return 0), but if we declare it is the caller's fault, perhaps we may want to add if (!**start) die("BUG: empty string to utf8_width()???"); at the very beginning of utf8_width(), even before it calls pick-one-utf8-char. Still puzzled... > This stepping makes strbuf_utf8_replace add NUL to the > destination string at the end of the loop. Check and break the loop > early. > > Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> > --- > utf8.c | 2 ++ > 1 file changed, 2 insertions(+) Tests? > diff --git a/utf8.c b/utf8.c > index b30790d..cd090a1 100644 > --- a/utf8.c > +++ b/utf8.c > @@ -381,6 +381,8 @@ void strbuf_utf8_replace(struct strbuf *sb_src, int pos, int width, > src += n; > dst += n; > } > + if (src >= end) > + break; > > old = src; > n = utf8_width((const char**)&src, NULL); ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH] utf8.c: fix strbuf_utf8_replace copying the last NUL to dst string 2014-07-29 19:56 ` Junio C Hamano @ 2014-07-30 10:20 ` Duy Nguyen 2014-07-30 18:23 ` Junio C Hamano 0 siblings, 1 reply; 5+ messages in thread From: Duy Nguyen @ 2014-07-30 10:20 UTC (permalink / raw) To: Junio C Hamano; +Cc: git On Tue, Jul 29, 2014 at 12:56:24PM -0700, Junio C Hamano wrote: > Nguyễn Thái Ngọc Duy <pclouds@gmail.com> writes: > > > When utf8_width(&src) is called with *src == NULL (because the > > source string ends with an ansi sequence), > > I am not sure what you mean by "because" here. Do you mean somebody > (who?) decides to call the utf8_width() with NULL pointer stored in > *src because of "ansi sequence"? > > What do you mean by "ansi sequence"? I'll assume that you mean > those terminal control that all use bytes with hi-bit clear. OK let's try with an example, strbuf_utf8_replace(&sb, 0, 0, "") where "sb" contains "\033[m". The expected result is exactly the same string with length of 3. After this block while ((n = display_mode_esc_sequence_len(src))) { memcpy(dst, src, n); src += n; dst += n; } src will be right after 'm', *src is NUL (iow. src == end). After n = utf8_width((const char**)&src, NULL); n is zero but 'src' is advanced by another character, so src - old_src == 1. This NUL character is counted as part of the string, so after the _setlen, we have the same string with length of _4_ (instead of 3). > At the very beginning of utf8_width(), *start can be cleared to > point at a NULL pointer by pick_one_utf8_char() if the pointer that > comes into utf8_width() originally points at an invalid UTF-8 > string, but as far as I can see, ESC (or any bytes that would be > used in those terminal control sequences like colors and cursor > control) will simply be returned as a single byte, without going > into error path that clears *start = NULL. > > Puzzled... > > > it returns 0 and steps 'src' by one. > > Here "it" refers to utf8_width()? Who steps 'src' by one? utf8_width() steps 'src'. > > Ahh, did you mean *src == NUL, i.e. "already at the end of the > string"? Yes.. I guess you have a better commit message prepared for me now :) > I think utf8_width() called with an empty string should not move the > pointer past that end-of-string NUL in the first place. It makes me > wonder if it would be a better fix to make it not to do that (and > return 0), but if we declare it is the caller's fault, perhaps we > may want to add > > if (!**start) > die("BUG: empty string to utf8_width()???"); > > at the very beginning of utf8_width(), even before it calls > pick-one-utf8-char. > > Still puzzled... I don't know. I mean, in a UTF-8 byte sequence, NUL is a valid character (part of the ASCII subset), so advancing '*start' by one makes sense. Whether we have any call sites crazy enough to do that on purpose is a different matter. > > This stepping makes strbuf_utf8_replace add NUL to the > > destination string at the end of the loop. Check and break the loop > > early. > > > > Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> > > --- > > utf8.c | 2 ++ > > 1 file changed, 2 insertions(+) > > Tests? Ugh.. this was triggered by one of the alignment specifier in log --pretty. Probably easier to export strbuf_utf8_replace() as a test-command and verify the output? -- Duy ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH] utf8.c: fix strbuf_utf8_replace copying the last NUL to dst string 2014-07-30 10:20 ` Duy Nguyen @ 2014-07-30 18:23 ` Junio C Hamano 0 siblings, 0 replies; 5+ messages in thread From: Junio C Hamano @ 2014-07-30 18:23 UTC (permalink / raw) To: Duy Nguyen; +Cc: git Duy Nguyen <pclouds@gmail.com> writes: >> > it returns 0 and steps 'src' by one. >> >> Here "it" refers to utf8_width()? Who steps 'src' by one? > > utf8_width() steps 'src'. > >> >> Ahh, did you mean *src == NUL, i.e. "already at the end of the >> string"? > > Yes.. I guess you have a better commit message prepared for me now :) Heh. At least you realized that the log message was uninformative ;-) >> I think utf8_width() called with an empty string should not move the >> pointer past that end-of-string NUL in the first place. It makes me >> wonder if it would be a better fix to make it not to do that (and >> return 0), but if we declare it is the caller's fault, perhaps we >> may want to add >> >> if (!**start) >> die("BUG: empty string to utf8_width()???"); >> >> at the very beginning of utf8_width(), even before it calls >> pick-one-utf8-char. >> >> Still puzzled... > > I don't know. I mean, in a UTF-8 byte sequence, NUL is a valid > character (part of the ASCII subset), so advancing '*start' by one > makes sense. Dubious. NUL may be valid but it is valid only in the sense that it will mark the end of the string, i.e. the caller is expected to do: const char *string = ...; while (not reached the end of the string) { this_char = pick_one_char(&string); do something to this_char; } and there are two ways to structure such a loop: (1) Make pick-one-char signal the termination condition. i.e. while ((this_char = pick_one(&string)) != EOF) do something to this_char; That's a typical getchar()-style loop. It would better not advance string when it returns EOF. (2) Make it the responsibility of the caller not to call beyond the end. i.e. while (string < end) { this_char = pick_one_char(&string); do something to this char; } and your patch takes the latter, which I think is in line with the way how existing callchain works. So the addition of "BUG" merely is to make sure that everybody follows that same convention. We cannot declare that it is caller's responsibility to feed sensible input to the function only at one callsite and allow other callsites feed garbage to the same function, after all, no? ^ permalink raw reply [flat|nested] 5+ messages in thread
* [PATCH v2] utf8.c: fix strbuf_utf8_replace() consuming data beyond input string 2014-07-29 13:10 [PATCH] utf8.c: fix strbuf_utf8_replace copying the last NUL to dst string Nguyễn Thái Ngọc Duy 2014-07-29 19:56 ` Junio C Hamano @ 2014-08-10 7:05 ` Nguyễn Thái Ngọc Duy 1 sibling, 0 replies; 5+ messages in thread From: Nguyễn Thái Ngọc Duy @ 2014-08-10 7:05 UTC (permalink / raw) To: git; +Cc: Junio C Hamano, Nguyễn Thái Ngọc Duy The main loop in strbuf_utf8_replace() could summed up as: while ('src' is still valid) { 1) advance 'src' to copy ANSI escape sequences 2) advance 'src' to copy/replace visible characters } The problem is after #1, 'src' may have reached the end of the string (so 'src' points to NUL) and #2 will continue to copy that NUL as if it's a normal character. Because the output is stored in a strbuf, this NULL accounted in the 'len' field as well. Check after #1 and break the loop if necessary. The test does not look obvious, but the combination of %>>() should make a call trace like this show_log() pretty_print_commit() format_commit_message() strbuf_expand() format_commit_item() format_and_pad_commit() strbuf_utf8_replace() where %C(auto)%d would insert a color reset escape sequence in the end of the string given to strbuf_utf8_replace() and show_log() uses fwrite() to send everything to stdout (including the incorrect NUL inserted by strbuf_utf8_replace) Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> --- t/t4205-log-pretty-formats.sh | 7 +++++++ utf8.c | 3 +++ 2 files changed, 10 insertions(+) diff --git a/t/t4205-log-pretty-formats.sh b/t/t4205-log-pretty-formats.sh index 349c531..de0cc4a 100755 --- a/t/t4205-log-pretty-formats.sh +++ b/t/t4205-log-pretty-formats.sh @@ -431,6 +431,13 @@ EOF test_cmp expected actual ' +test_expect_success 'strbuf_utf8_replace() not producing NUL' ' + git log --color --pretty="tformat:%<(10,trunc)%s%>>(10,ltrunc)%C(auto)%d" | + test_decode_color | + nul_to_q >actual && + ! grep Q actual +' + # get new digests (with no abbreviations) head1=$(git rev-parse --verify HEAD~0) && head2=$(git rev-parse --verify HEAD~1) && diff --git a/utf8.c b/utf8.c index b30790d..401a6a5 100644 --- a/utf8.c +++ b/utf8.c @@ -382,6 +382,9 @@ void strbuf_utf8_replace(struct strbuf *sb_src, int pos, int width, dst += n; } + if (src >= end) + break; + old = src; n = utf8_width((const char**)&src, NULL); if (!src) /* broken utf-8, do nothing */ -- 2.1.0.rc0.78.gc0d8480 ^ permalink raw reply related [flat|nested] 5+ messages in thread
end of thread, other threads:[~2014-08-10 7:09 UTC | newest] Thread overview: 5+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2014-07-29 13:10 [PATCH] utf8.c: fix strbuf_utf8_replace copying the last NUL to dst string Nguyễn Thái Ngọc Duy 2014-07-29 19:56 ` Junio C Hamano 2014-07-30 10:20 ` Duy Nguyen 2014-07-30 18:23 ` Junio C Hamano 2014-08-10 7:05 ` [PATCH v2] utf8.c: fix strbuf_utf8_replace() consuming data beyond input string Nguyễn Thái Ngọc Duy
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).