* git bug: rebase fatal failure @ 2008-05-15 7:27 Tommy Thorn 2008-05-16 10:41 ` Johannes Schindelin 0 siblings, 1 reply; 21+ messages in thread From: Tommy Thorn @ 2008-05-15 7:27 UTC (permalink / raw) To: git Hi, I have large-ish repository (143 MB) where git rebase failed unexpectedly (see below). If anyone is interested in looking at this, I have left a copy of the repository at http://thorn.ws/git-fatal-rebase-error.tar.gz Tommy PS: Please add me in CC as I'm no longer on this list. $ git --version git version 1.5.5.1.93.ge3f3e $ git rebase cacao my-cacao-build First, rewinding head to replay your work on top of it... Applying Import binutils-2.18 .dotest/patch:16878: trailing whitespace. .dotest/patch:17923: trailing whitespace. 0. Additional Definitions. .dotest/patch:18024: trailing whitespace. Version. .dotest/patch:18094: trailing whitespace. // .dotest/patch:18099: trailing whitespace. // fatal: corrupt patch at line 260912 Patch failed at 0001. When you have resolved this problem run "git rebase --continue". If you would prefer to skip this patch, instead run "git rebase --skip". To restore the original branch and stop rebasing run "git rebase --abort". ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: git bug: rebase fatal failure 2008-05-15 7:27 git bug: rebase fatal failure Tommy Thorn @ 2008-05-16 10:41 ` Johannes Schindelin 2008-05-16 11:01 ` Johannes Schindelin 0 siblings, 1 reply; 21+ messages in thread From: Johannes Schindelin @ 2008-05-16 10:41 UTC (permalink / raw) To: Tommy Thorn; +Cc: git Hi, On Thu, 15 May 2008, Tommy Thorn wrote: > I have large-ish repository (143 MB) where git rebase failed > unexpectedly (see below). If anyone is interested in looking at this, I > have left a copy of the repository at > http://thorn.ws/git-fatal-rebase-error.tar.gz I can reproduce. It seems that our diff machinery generates a file that adds a file with 10668 lines, but says 10669 lines in the hunk header. It is a .info file, so I suspect strange things going on with non-printable ASCII characters. I'm on it, Dscho ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: git bug: rebase fatal failure 2008-05-16 10:41 ` Johannes Schindelin @ 2008-05-16 11:01 ` Johannes Schindelin 2008-05-16 13:03 ` [PATCH] mailsplit and mailinfo: gracefully handle NUL characters Johannes Schindelin 0 siblings, 1 reply; 21+ messages in thread From: Johannes Schindelin @ 2008-05-16 11:01 UTC (permalink / raw) To: Tommy Thorn; +Cc: git Hi, On Fri, 16 May 2008, Johannes Schindelin wrote: > It is a .info file, so I suspect strange things going on with > non-printable ASCII characters. And indeed, this is the culprit: -- snip -- BFD Index ********* ^@^H[index^@^H] * Menu: -- snap -- Note the NUL characters? Now, the thing is: git format-patch still outputs it correctly. But git-rebase pipes the output to git-am, which in turn calls git-mailsplit, which suppresses that line. Will keep you posted, Dscho ^ permalink raw reply [flat|nested] 21+ messages in thread
* [PATCH] mailsplit and mailinfo: gracefully handle NUL characters 2008-05-16 11:01 ` Johannes Schindelin @ 2008-05-16 13:03 ` Johannes Schindelin 2008-05-16 14:03 ` Avery Pennarun ` (2 more replies) 0 siblings, 3 replies; 21+ messages in thread From: Johannes Schindelin @ 2008-05-16 13:03 UTC (permalink / raw) To: Tommy Thorn; +Cc: git The function fgets() has a big problem with NUL characters: it reads them, but nobody will know if the NUL comes from the file stream, or was appended at the end of the line. So implement a custom read_line() function. Noticed by Tommy Thorn. Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de> --- Sorry for the binary patch: the file t5100/nul contains NUL characters, obviously. BTW I do not know how much fgetc() instead of fgets() slows down things, but I expect both to be equally fast because they are both buffered, right? builtin-mailinfo.c | 24 +++++++++++++----------- builtin-mailsplit.c | 27 +++++++++++++++++++++++---- builtin.h | 1 + t/t5100-mailinfo.sh | 9 +++++++++ t/t5100/nul | Bin 0 -> 91 bytes 5 files changed, 46 insertions(+), 15 deletions(-) create mode 100644 t/t5100/nul diff --git a/builtin-mailinfo.c b/builtin-mailinfo.c index 11f154b..f0c4209 100644 --- a/builtin-mailinfo.c +++ b/builtin-mailinfo.c @@ -641,7 +641,7 @@ static void decode_transfer_encoding(char *line, unsigned linesize) } } -static int handle_filter(char *line, unsigned linesize); +static int handle_filter(char *line, unsigned linesize, int linelen); static int find_boundary(void) { @@ -669,7 +669,7 @@ again: "can't recover\n"); exit(1); } - handle_filter(newline, sizeof(newline)); + handle_filter(newline, sizeof(newline), strlen(newline)); /* skip to the next boundary */ if (!find_boundary()) @@ -759,14 +759,14 @@ static int handle_commit_msg(char *line, unsigned linesize) return 0; } -static int handle_patch(char *line) +static int handle_patch(char *line, int len) { - fputs(line, patchfile); + fwrite(line, 1, len, patchfile); patch_lines++; return 0; } -static int handle_filter(char *line, unsigned linesize) +static int handle_filter(char *line, unsigned linesize, int linelen) { static int filter = 0; @@ -779,7 +779,7 @@ static int handle_filter(char *line, unsigned linesize) break; filter++; case 1: - if (!handle_patch(line)) + if (!handle_patch(line, linelen)) break; filter++; default: @@ -794,6 +794,7 @@ static void handle_body(void) int rc = 0; static char newline[2000]; static char *np = newline; + int len = strlen(line); /* Skip up to the first boundary */ if (content_top->boundary) { @@ -807,7 +808,8 @@ static void handle_body(void) /* flush any leftover */ if ((transfer_encoding == TE_BASE64) && (np != newline)) { - handle_filter(newline, sizeof(newline)); + handle_filter(newline, sizeof(newline), + strlen(newline)); } if (!handle_boundary()) return; @@ -824,7 +826,7 @@ static void handle_body(void) /* binary data most likely doesn't have newlines */ if (message_type != TYPE_TEXT) { - rc = handle_filter(line, sizeof(newline)); + rc = handle_filter(line, sizeof(line), len); break; } @@ -841,7 +843,7 @@ static void handle_body(void) /* should be sitting on a new line */ *(++np) = 0; op++; - rc = handle_filter(newline, sizeof(newline)); + rc = handle_filter(newline, sizeof(newline), np - newline); np = newline; } } while (*op != 0); @@ -851,12 +853,12 @@ static void handle_body(void) break; } default: - rc = handle_filter(line, sizeof(newline)); + rc = handle_filter(line, sizeof(line), len); } if (rc) /* nothing left to filter */ break; - } while (fgets(line, sizeof(line), fin)); + } while ((len = read_line_with_nul(line, sizeof(line), fin))); return; } diff --git a/builtin-mailsplit.c b/builtin-mailsplit.c index 46b27cd..021dc16 100644 --- a/builtin-mailsplit.c +++ b/builtin-mailsplit.c @@ -45,6 +45,25 @@ static int is_from_line(const char *line, int len) /* Could be as small as 64, enough to hold a Unix "From " line. */ static char buf[4096]; +/* We cannot use fgets() because our lines can contain NULs */ +int read_line_with_nul(char *buf, int size, FILE *in) +{ + int len = 0, c; + + for (;;) { + c = fgetc(in); + buf[len++] = c; + if (c == EOF || c == '\n' || len + 1 >= size) + break; + } + + if (c == EOF) + len--; + buf[len] = '\0'; + + return len; +} + /* Called with the first line (potentially partial) * already in buf[] -- normally that should begin with * the Unix "From " line. Write it into the specified @@ -70,19 +89,19 @@ static int split_one(FILE *mbox, const char *name, int allow_bare) * "From " and having something that looks like a date format. */ for (;;) { - int is_partial = (buf[len-1] != '\n'); + int is_partial = len && buf[len-1] != '\n'; - if (fputs(buf, output) == EOF) + if (fwrite(buf, 1, len, output) != len) die("cannot write output"); - if (fgets(buf, sizeof(buf), mbox) == NULL) { + len = read_line_with_nul(buf, sizeof(buf), mbox); + if (len == 0) { if (feof(mbox)) { status = 1; break; } die("cannot read mbox"); } - len = strlen(buf); if (!is_partial && !is_bare && is_from_line(buf, len)) break; /* done with one message */ } diff --git a/builtin.h b/builtin.h index c630d5b..d0a0ead 100644 --- a/builtin.h +++ b/builtin.h @@ -9,6 +9,7 @@ extern const char git_usage_string[]; extern void list_common_cmds_help(void); extern void help_unknown_cmd(const char *cmd); extern void prune_packed_objects(int); +extern int read_line_with_nul(char *buf, int size, FILE *file); extern int cmd_add(int argc, const char **argv, const char *prefix); extern int cmd_annotate(int argc, const char **argv, const char *prefix); diff --git a/t/t5100-mailinfo.sh b/t/t5100-mailinfo.sh index d6c55c1..5a4610b 100755 --- a/t/t5100-mailinfo.sh +++ b/t/t5100-mailinfo.sh @@ -25,4 +25,13 @@ do diff ../t5100/info$mail info$mail" done +test_expect_success 'respect NULs' ' + + git mailsplit -d3 -o. ../t5100/nul && + cmp ../t5100/nul 001 && + (cat 001 | git mailinfo msg patch) && + test 4 = $(wc -l < patch) + +' + test_done diff --git a/t/t5100/nul b/t/t5100/nul new file mode 100644 index 0000000000000000000000000000000000000000..3d40691787b855cc0133514a19052492eb853d21 GIT binary patch literal 91 zcmW;6y$ygM5C%|6a#MT@Tm%~v2e7kZ0t`Q)fHOej_C}MJcXX*}a!Gh_N`s3x>;_}@ mA68>55i?ULDS<hc3BM!}T;HU$lNvE*_bo@vIHuA{lcE=x<rtIz literal 0 HcmV?d00001 -- 1.5.5.1.425.g5f464.dirty ^ permalink raw reply related [flat|nested] 21+ messages in thread
* Re: [PATCH] mailsplit and mailinfo: gracefully handle NUL characters 2008-05-16 13:03 ` [PATCH] mailsplit and mailinfo: gracefully handle NUL characters Johannes Schindelin @ 2008-05-16 14:03 ` Avery Pennarun 2008-05-16 14:05 ` David Kastrup 2008-05-16 14:32 ` Johannes Schindelin [not found] ` <200805161539.29259.brian.foster@innova-card.com> 2008-05-21 18:08 ` Junio C Hamano 2 siblings, 2 replies; 21+ messages in thread From: Avery Pennarun @ 2008-05-16 14:03 UTC (permalink / raw) To: Johannes Schindelin; +Cc: Tommy Thorn, git On 5/16/08, Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote: > BTW I do not know how much fgetc() instead of fgets() slows > down things, but I expect both to be equally fast because > they are both buffered, right? In my experience, fgetc() is pretty fantastically slow because you have a function call for every byte (and, I gather, modern libc does thread locking for every fgetc). It's usually much faster to fread() into a buffer and then access the buffer. I don't know if that's appropriate (or matters) here, though. Have fun, Avery ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH] mailsplit and mailinfo: gracefully handle NUL characters 2008-05-16 14:03 ` Avery Pennarun @ 2008-05-16 14:05 ` David Kastrup 2008-05-16 14:32 ` Johannes Schindelin 1 sibling, 0 replies; 21+ messages in thread From: David Kastrup @ 2008-05-16 14:05 UTC (permalink / raw) To: git "Avery Pennarun" <apenwarr@gmail.com> writes: > On 5/16/08, Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote: >> BTW I do not know how much fgetc() instead of fgets() slows >> down things, but I expect both to be equally fast because >> they are both buffered, right? > > In my experience, fgetc() is pretty fantastically slow because you > have a function call for every byte (and, I gather, modern libc does > thread locking for every fgetc). It's usually much faster to fread() > into a buffer and then access the buffer. I don't know if that's > appropriate (or matters) here, though. Is getc an option? -- David Kastrup ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH] mailsplit and mailinfo: gracefully handle NUL characters 2008-05-16 14:03 ` Avery Pennarun 2008-05-16 14:05 ` David Kastrup @ 2008-05-16 14:32 ` Johannes Schindelin 2008-05-16 14:56 ` Avery Pennarun 1 sibling, 1 reply; 21+ messages in thread From: Johannes Schindelin @ 2008-05-16 14:32 UTC (permalink / raw) To: Avery Pennarun; +Cc: Tommy Thorn, git Hi, On Fri, 16 May 2008, Avery Pennarun wrote: > On 5/16/08, Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote: > > BTW I do not know how much fgetc() instead of fgets() slows > > down things, but I expect both to be equally fast because > > they are both buffered, right? > > In my experience, fgetc() is pretty fantastically slow because you > have a function call for every byte (and, I gather, modern libc does > thread locking for every fgetc). It's usually much faster to fread() > into a buffer and then access the buffer. Hmpf. I hoped to get more definitive information here. Especially given that fgetc() is nothing more than a glorified fread() into a buffer, and then access the buffer. Well, at least you kind of pointed me to the _unlocked() function family. Ciao, Dscho ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH] mailsplit and mailinfo: gracefully handle NUL characters 2008-05-16 14:32 ` Johannes Schindelin @ 2008-05-16 14:56 ` Avery Pennarun 2008-05-16 23:59 ` Johannes Schindelin 0 siblings, 1 reply; 21+ messages in thread From: Avery Pennarun @ 2008-05-16 14:56 UTC (permalink / raw) To: Johannes Schindelin; +Cc: Tommy Thorn, git On 5/16/08, Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote: > Hmpf. I hoped to get more definitive information here. Especially given > that fgetc() is nothing more than a glorified fread() into a buffer, and > then access the buffer. > > Well, at least you kind of pointed me to the _unlocked() function family. Point taken. /tmp $ for d in test1 test2 test3 test3u; do echo -n "$d: "; /usr/bin/time ./$d </dev/zero; done test1: 0.09user 0.05system 0:00.14elapsed 94%CPU test2: 2.50user 0.05system 0:02.54elapsed 100%CPU test3: 2.48user 0.06system 0:02.53elapsed 100%CPU test3u: 1.05user 0.05system 0:01.10elapsed 99%CPU fread is about 18x faster than fgetc(). getc() is the same speed as fgetc(). getc_unlocked() is definitely faster than getc, but still at least 7x slower than fread(). And if you think *that* sucks, you should try "c << cin" in C++ :) Source code below. Have fun, Avery === test1.c === #include <stdio.h> int main() { char buf[1024]; int i; for (i = 0; i < 102400; i++) fread(buf, 1, sizeof(buf), stdin); } === test2.c === #include <stdio.h> int main() { int i; for (i = 0; i < 1024*102400; i++) fgetc(stdin); } === test3.c === #include <stdio.h> int main() { int i; for (i = 0; i < 1024*102400; i++) getc(stdin); } === test3u.c === #include <stdio.h> int main() { int i; for (i = 0; i < 1024*102400; i++) getc_unlocked(stdin); } ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH] mailsplit and mailinfo: gracefully handle NUL characters 2008-05-16 14:56 ` Avery Pennarun @ 2008-05-16 23:59 ` Johannes Schindelin 2008-05-17 0:06 ` Tommy Thorn 2008-05-17 10:07 ` Stephen R. van den Berg 0 siblings, 2 replies; 21+ messages in thread From: Johannes Schindelin @ 2008-05-16 23:59 UTC (permalink / raw) To: Avery Pennarun; +Cc: Tommy Thorn, git Hi, On Fri, 16 May 2008, Avery Pennarun wrote: > On 5/16/08, Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote: > > Hmpf. I hoped to get more definitive information here. Especially given > > that fgetc() is nothing more than a glorified fread() into a buffer, and > > then access the buffer. > > > > Well, at least you kind of pointed me to the _unlocked() function family. > > Point taken. > > /tmp $ for d in test1 test2 test3 test3u; do echo -n "$d: "; > /usr/bin/time ./$d </dev/zero; done > test1: 0.09user 0.05system 0:00.14elapsed 94%CPU > test2: 2.50user 0.05system 0:02.54elapsed 100%CPU > test3: 2.48user 0.06system 0:02.53elapsed 100%CPU > test3u: 1.05user 0.05system 0:01.10elapsed 99%CPU > > fread is about 18x faster than fgetc(). getc() is the same speed as > fgetc(). getc_unlocked() is definitely faster than getc, but still at > least 7x slower than fread(). Well, my question was more about fgetc() vs fgets(). If you feel like it, you might benchmark this patch: -- snipsnap -- diff --git a/builtin-mailsplit.c b/builtin-mailsplit.c index 021dc16..5d8defd 100644 --- a/builtin-mailsplit.c +++ b/builtin-mailsplit.c @@ -45,13 +45,32 @@ static int is_from_line(const char *line, int len) /* Could be as small as 64, enough to hold a Unix "From " line. */ static char buf[4096]; +/* + * This is an ugly hack to avoid fgetc(), which is slow, as it is locking. + * The argument "in" must be the same for all calls to this function! + */ +static int fast_fgetc(FILE *in) +{ + static char buf[4096]; + static int offset = 0, len = 0; + + if (offset >= len) { + len = fread(buf, 1, sizeof(buf), in); + offset = 0; + if (!len && feof(in)) + return EOF; + } + + return buf[offset++]; +} + /* We cannot use fgets() because our lines can contain NULs */ int read_line_with_nul(char *buf, int size, FILE *in) { int len = 0, c; for (;;) { - c = fgetc(in); + c = fast_fgetc(in); buf[len++] = c; if (c == EOF || c == '\n' || len + 1 >= size) break; ^ permalink raw reply related [flat|nested] 21+ messages in thread
* Re: [PATCH] mailsplit and mailinfo: gracefully handle NUL characters 2008-05-16 23:59 ` Johannes Schindelin @ 2008-05-17 0:06 ` Tommy Thorn 2008-05-17 0:26 ` Johannes Schindelin 2008-05-17 10:07 ` Stephen R. van den Berg 1 sibling, 1 reply; 21+ messages in thread From: Tommy Thorn @ 2008-05-17 0:06 UTC (permalink / raw) To: Johannes Schindelin; +Cc: Avery Pennarun, git Johannes Schindelin wrote: > +/* > + * This is an ugly hack to avoid fgetc(), which is slow, as it is locking. > + * The argument "in" must be the same for all calls to this function! > + */ > +static int fast_fgetc(FILE *in) > +{ > Looks great to me, but shouldn't you add an "inline" for this one? Also, maybe a double the buffer size. Tommy ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH] mailsplit and mailinfo: gracefully handle NUL characters 2008-05-17 0:06 ` Tommy Thorn @ 2008-05-17 0:26 ` Johannes Schindelin 0 siblings, 0 replies; 21+ messages in thread From: Johannes Schindelin @ 2008-05-17 0:26 UTC (permalink / raw) To: Tommy Thorn; +Cc: Avery Pennarun, git Hi, On Fri, 16 May 2008, Tommy Thorn wrote: > Johannes Schindelin wrote: > > +/* > > + * This is an ugly hack to avoid fgetc(), which is slow, as it is locking. > > + * The argument "in" must be the same for all calls to this function! > > + */ > > +static int fast_fgetc(FILE *in) > > +{ > > > > Looks great to me, but shouldn't you add an "inline" for this one? Also, > maybe a double the buffer size. No. This is an ugly hack, and not meant for application. If that is substantially faster than the fgetc() version (and I want this be tested in a _real-world_ scenario, i.e. not the fgetc() alone, but a real mailsplit and a real mailinfo on a huge patch, with all three versions: fgets(), fgetc() and fast_fgetc())), then I would prefer having something like struct line_reader { FILE *in; char buffer[4096]; int offset, int len; char line[1024]; int linelen; }; and corresponding functions to read lines in that setting. Maybe it would even be better to have line be a strbuf, but I am not so sure on that. Let's see what the tests show. Would you do them, please? "git format-patch --stdout bla..blub | /usr/bin/time git mailsplit -o." three times in succession should give you a good hint on the runtime. Ciao, Dscho ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH] mailsplit and mailinfo: gracefully handle NUL characters 2008-05-16 23:59 ` Johannes Schindelin 2008-05-17 0:06 ` Tommy Thorn @ 2008-05-17 10:07 ` Stephen R. van den Berg 2008-05-17 10:18 ` Johannes Schindelin 1 sibling, 1 reply; 21+ messages in thread From: Stephen R. van den Berg @ 2008-05-17 10:07 UTC (permalink / raw) To: Johannes Schindelin; +Cc: Avery Pennarun, Tommy Thorn, git Johannes Schindelin wrote: >On Fri, 16 May 2008, Avery Pennarun wrote: >> fread is about 18x faster than fgetc(). getc() is the same speed as >> fgetc(). getc_unlocked() is definitely faster than getc, but still at >> least 7x slower than fread(). >Well, my question was more about fgetc() vs fgets(). >If you feel like it, you might benchmark this patch: Wouldn't it be better to improve the implementation of getc() in glibc instead? getc() is meant to be the fast version of fgetc(), and if it isn't (anymore), then the library needs fixing. -- Sincerely, srb@cuci.nl Stephen R. van den Berg. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH] mailsplit and mailinfo: gracefully handle NUL characters 2008-05-17 10:07 ` Stephen R. van den Berg @ 2008-05-17 10:18 ` Johannes Schindelin 0 siblings, 0 replies; 21+ messages in thread From: Johannes Schindelin @ 2008-05-17 10:18 UTC (permalink / raw) To: Stephen R. van den Berg; +Cc: Avery Pennarun, Tommy Thorn, git Hi, On Sat, 17 May 2008, Stephen R. van den Berg wrote: > Johannes Schindelin wrote: > >On Fri, 16 May 2008, Avery Pennarun wrote: > >> fread is about 18x faster than fgetc(). getc() is the same speed as > >> fgetc(). getc_unlocked() is definitely faster than getc, but still at > >> least 7x slower than fread(). > > >Well, my question was more about fgetc() vs fgets(). > >If you feel like it, you might benchmark this patch: > > Wouldn't it be better to improve the implementation of getc() > in glibc instead? Sure, because glibc is used on Windows, AIX, Solaris, etc. > getc() is meant to be the fast version of fgetc(), and if it isn't > (anymore), then the library needs fixing. fgetc() has to work reliably in a threaded environment, too. So I guess that it does all kinds of locks that slow it down, but it is not something to be "fixed". Hth, Dscho ^ permalink raw reply [flat|nested] 21+ messages in thread
[parent not found: <200805161539.29259.brian.foster@innova-card.com>]
* Re: [PATCH] mailsplit and mailinfo: gracefully handle NUL characters [not found] ` <200805161539.29259.brian.foster@innova-card.com> @ 2008-05-16 14:07 ` Brian Foster 2008-05-16 14:14 ` David Kastrup 2008-05-16 14:29 ` Johannes Schindelin 0 siblings, 2 replies; 21+ messages in thread From: Brian Foster @ 2008-05-16 14:07 UTC (permalink / raw) To: Johannes Schindelin, Tommy Thorn; +Cc: git two quibbles of no great importance ... Johannes Schindelin suggested: > The function fgets() has a big problem with NUL characters: it reads > them, but nobody will know if the NUL comes from the file stream, or > was appended at the end of the line. > > So implement a custom read_line() function. ^^^^^^^^^^^ read_line_with_nul() meaning read part or all of one line which may contain NULs. >[ ... ] > diff --git a/builtin-mailsplit.c b/builtin-mailsplit.c > index 46b27cd..021dc16 100644 > --- a/builtin-mailsplit.c > +++ b/builtin-mailsplit.c > @@ -45,6 +45,25 @@ static int is_from_line(const char *line, int len) > /* Could be as small as 64, enough to hold a Unix "From " line. */ > static char buf[4096]; > > +/* We cannot use fgets() because our lines can contain NULs */ > +int read_line_with_nul(char *buf, int size, FILE *in) > +{ > + int len = 0, c; > + > + for (;;) { > + c = fgetc(in); > + buf[len++] = c; > + if (c == EOF || c == '\n' || len + 1 >= size) > + break; > + } > + > + if (c == EOF) > + len--; > + buf[len] = '\0'; > + > + return len; when fgetc(3) — why not use getc(3)? — returns EOF it is pointlessly stored in buf[] (as a 'char'!), len's advanced, and then the storage and advancing are undone. isn't that a bit silly? untested: assert(2 <= size); do { if ((c = getc(in)) == EOF) break; } while (((buf[len++] = c) != '\n' && len+1 < size); buf[len] = '\0' return len; I'd tend to write this in terms of pointers, something along the lines (untested): char *p, *endp; assert(1 <= size); p = buf; endp = p + (size-1); while (p < endp) { if ((c = getc(in)) == EOF || (*p++ = c) == '\n') break; } *p = '\0'; return p - buf; > + > +} -- "How many surrealists does it take to | Brian Foster change a lightbulb? Three. One calms | somewhere in south of France the warthog, and two fill the bathtub | Stop E$$o (ExxonMobil)! with brightly-coloured machine tools." | http://www.stopesso.com ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH] mailsplit and mailinfo: gracefully handle NUL characters 2008-05-16 14:07 ` Brian Foster @ 2008-05-16 14:14 ` David Kastrup 2008-05-16 14:29 ` Johannes Schindelin 1 sibling, 0 replies; 21+ messages in thread From: David Kastrup @ 2008-05-16 14:14 UTC (permalink / raw) To: git "Brian Foster" <brian.foster@innova-card.com> writes: > I'd tend to write this in terms of pointers, > something along the lines (untested): > > char *p, *endp; > > assert(1 <= size); > p = buf; > endp = p + (size-1); > while (p < endp) { > if ((c = getc(in)) == EOF || (*p++ = c) == '\n') > break; > } > *p = '\0'; > > return p - buf; Leave optimization to the compiler. Using pointer arithmetic more often than not screws up loop optimization and strength reduction. -- David Kastrup ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH] mailsplit and mailinfo: gracefully handle NUL characters 2008-05-16 14:07 ` Brian Foster 2008-05-16 14:14 ` David Kastrup @ 2008-05-16 14:29 ` Johannes Schindelin 2008-05-16 14:33 ` David Kastrup 1 sibling, 1 reply; 21+ messages in thread From: Johannes Schindelin @ 2008-05-16 14:29 UTC (permalink / raw) To: Brian Foster; +Cc: Tommy Thorn, git [-- Attachment #1: Type: TEXT/PLAIN, Size: 2543 bytes --] Hi, On Fri, 16 May 2008, Brian Foster wrote: > Johannes Schindelin suggested: > > The function fgets() has a big problem with NUL characters: it reads > > them, but nobody will know if the NUL comes from the file stream, or > > was appended at the end of the line. > > > > So implement a custom read_line() function. > ^^^^^^^^^^^ > read_line_with_nul() > meaning read part or all of one line which may contain NULs. Right. > >[ ... ] > > diff --git a/builtin-mailsplit.c b/builtin-mailsplit.c > > index 46b27cd..021dc16 100644 > > --- a/builtin-mailsplit.c > > +++ b/builtin-mailsplit.c > > @@ -45,6 +45,25 @@ static int is_from_line(const char *line, int len) > > /* Could be as small as 64, enough to hold a Unix "From " line. */ > > static char buf[4096]; > > > > +/* We cannot use fgets() because our lines can contain NULs */ > > +int read_line_with_nul(char *buf, int size, FILE *in) > > +{ > > + int len = 0, c; > > + > > + for (;;) { > > + c = fgetc(in); > > + buf[len++] = c; > > + if (c == EOF || c == '\n' || len + 1 >= size) > > + break; > > + } > > + > > + if (c == EOF) > > + len--; > > + buf[len] = '\0'; > > + > > + return len; > > when fgetc(3) — why not use getc(3)? - Because mailsplit can read from a file, too. > returns EOF it is pointlessly stored in buf[] (as a 'char'!), len's > advanced, and then the storage and advancing are undone. isn't that a > bit silly? I left it at that, because it is a rare case, the buffer has to be accessed with the trailing NUL anyway, and I think it is worth to have this function quite readable. I, for one, am pretty certain that I understand what this function does, and how, in 6 months from now, without any additional documentation. > untested: > > assert(2 <= size); > do { > if ((c = getc(in)) == EOF) > break; > } while (((buf[len++] = c) != '\n' && len+1 < size); > buf[len] = '\0' > > return len; ... except this is unreadable at best ;-) > I'd tend to write this in terms of pointers, > something along the lines (untested): > > char *p, *endp; > > assert(1 <= size); > p = buf; > endp = p + (size-1); > while (p < endp) { > if ((c = getc(in)) == EOF || (*p++ = c) == '\n') > break; > } > *p = '\0'; > > return p - buf; Again, I think this is too cuddled. You have to think about every second line, and that makes for stupid mistakes with this developer. Ciao, Dscho ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH] mailsplit and mailinfo: gracefully handle NUL characters 2008-05-16 14:29 ` Johannes Schindelin @ 2008-05-16 14:33 ` David Kastrup 0 siblings, 0 replies; 21+ messages in thread From: David Kastrup @ 2008-05-16 14:33 UTC (permalink / raw) To: git Johannes Schindelin <Johannes.Schindelin@gmx.de> writes: > On Fri, 16 May 2008, Brian Foster wrote: >> >> when fgetc(3) — why not use getc(3)? - > > Because mailsplit can read from a file, too. Huh? getc != getchar -- David Kastrup ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH] mailsplit and mailinfo: gracefully handle NUL characters 2008-05-16 13:03 ` [PATCH] mailsplit and mailinfo: gracefully handle NUL characters Johannes Schindelin 2008-05-16 14:03 ` Avery Pennarun [not found] ` <200805161539.29259.brian.foster@innova-card.com> @ 2008-05-21 18:08 ` Junio C Hamano 2008-05-22 10:38 ` Johannes Schindelin 2 siblings, 1 reply; 21+ messages in thread From: Junio C Hamano @ 2008-05-21 18:08 UTC (permalink / raw) To: Johannes Schindelin; +Cc: Tommy Thorn, git Johannes Schindelin <Johannes.Schindelin@gmx.de> writes: > The function fgets() has a big problem with NUL characters: it reads > them, but nobody will know if the NUL comes from the file stream, or > was appended at the end of the line. > > So implement a custom read_line() function. Looking at what handle_body() does for TE_BASE64 and TE_QP cases, I have to wonder if this is enough. The loop seems to stop at (*op == NUL) which follows an old assumption that each line is terminated with NUL, not the new assumption you introduced that each line's length is kept in local variable len. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH] mailsplit and mailinfo: gracefully handle NUL characters 2008-05-21 18:08 ` Junio C Hamano @ 2008-05-22 10:38 ` Johannes Schindelin 2008-05-22 17:44 ` Junio C Hamano 0 siblings, 1 reply; 21+ messages in thread From: Johannes Schindelin @ 2008-05-22 10:38 UTC (permalink / raw) To: Junio C Hamano; +Cc: Tommy Thorn, git Hi, On Wed, 21 May 2008, Junio C Hamano wrote: > Johannes Schindelin <Johannes.Schindelin@gmx.de> writes: > > > The function fgets() has a big problem with NUL characters: it reads > > them, but nobody will know if the NUL comes from the file stream, or > > was appended at the end of the line. > > > > So implement a custom read_line() function. > > Looking at what handle_body() does for TE_BASE64 and TE_QP cases, I have > to wonder if this is enough. The loop seems to stop at (*op == NUL) > which follows an old assumption that each line is terminated with NUL, > not the new assumption you introduced that each line's length is kept in > local variable len. Of course! But does BASE64 and QP contain NULs? After all, even my custom read_line_with_nul() function adds a NUL IIRC. Well, the biggest problem here is my lack of time. I thought I would give Tommy a patch which kinda works, and he would actually hold through to brush it up until it shines and gets into git.git, because it is not _my_ itch. Hmmmmmmm. Ciao, Dscho ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH] mailsplit and mailinfo: gracefully handle NUL characters 2008-05-22 10:38 ` Johannes Schindelin @ 2008-05-22 17:44 ` Junio C Hamano 2008-05-23 11:21 ` Johannes Schindelin 0 siblings, 1 reply; 21+ messages in thread From: Junio C Hamano @ 2008-05-22 17:44 UTC (permalink / raw) To: Johannes Schindelin; +Cc: Tommy Thorn, git Johannes Schindelin <Johannes.Schindelin@gmx.de> writes: >> Looking at what handle_body() does for TE_BASE64 and TE_QP cases, I have >> to wonder if this is enough. The loop seems to stop at (*op == NUL) >> which follows an old assumption that each line is terminated with NUL, >> not the new assumption you introduced that each line's length is kept in >> local variable len. > > Of course! But does BASE64 and QP contain NULs? The loop in question iterates over bytes _after_ decoding these encoded lines, and a typical reason you would encode the payload is because it contains something not safe over e-mail transfer, e.g. NUL. I think decode_transfer_encoding() also needs to become safe against NULs in the payload. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH] mailsplit and mailinfo: gracefully handle NUL characters 2008-05-22 17:44 ` Junio C Hamano @ 2008-05-23 11:21 ` Johannes Schindelin 0 siblings, 0 replies; 21+ messages in thread From: Johannes Schindelin @ 2008-05-23 11:21 UTC (permalink / raw) To: Junio C Hamano; +Cc: Tommy Thorn, git Hi, On Thu, 22 May 2008, Junio C Hamano wrote: > Johannes Schindelin <Johannes.Schindelin@gmx.de> writes: > > >> Looking at what handle_body() does for TE_BASE64 and TE_QP cases, I have > >> to wonder if this is enough. The loop seems to stop at (*op == NUL) > >> which follows an old assumption that each line is terminated with NUL, > >> not the new assumption you introduced that each line's length is kept in > >> local variable len. > > > > Of course! But does BASE64 and QP contain NULs? > > The loop in question iterates over bytes _after_ decoding these encoded > lines, and a typical reason you would encode the payload is because it > contains something not safe over e-mail transfer, e.g. NUL. > > I think decode_transfer_encoding() also needs to become safe against NULs > in the payload. Okay, I missed that. Ciao, Dscho ^ permalink raw reply [flat|nested] 21+ messages in thread
end of thread, other threads:[~2008-05-23 11:22 UTC | newest] Thread overview: 21+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2008-05-15 7:27 git bug: rebase fatal failure Tommy Thorn 2008-05-16 10:41 ` Johannes Schindelin 2008-05-16 11:01 ` Johannes Schindelin 2008-05-16 13:03 ` [PATCH] mailsplit and mailinfo: gracefully handle NUL characters Johannes Schindelin 2008-05-16 14:03 ` Avery Pennarun 2008-05-16 14:05 ` David Kastrup 2008-05-16 14:32 ` Johannes Schindelin 2008-05-16 14:56 ` Avery Pennarun 2008-05-16 23:59 ` Johannes Schindelin 2008-05-17 0:06 ` Tommy Thorn 2008-05-17 0:26 ` Johannes Schindelin 2008-05-17 10:07 ` Stephen R. van den Berg 2008-05-17 10:18 ` Johannes Schindelin [not found] ` <200805161539.29259.brian.foster@innova-card.com> 2008-05-16 14:07 ` Brian Foster 2008-05-16 14:14 ` David Kastrup 2008-05-16 14:29 ` Johannes Schindelin 2008-05-16 14:33 ` David Kastrup 2008-05-21 18:08 ` Junio C Hamano 2008-05-22 10:38 ` Johannes Schindelin 2008-05-22 17:44 ` Junio C Hamano 2008-05-23 11:21 ` Johannes Schindelin
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).