* git bug: rebase fatal failure
@ 2008-05-15 7:27 Tommy Thorn
2008-05-16 10:41 ` Johannes Schindelin
0 siblings, 1 reply; 21+ messages in thread
From: Tommy Thorn @ 2008-05-15 7:27 UTC (permalink / raw)
To: git
Hi,
I have large-ish repository (143 MB) where git rebase failed
unexpectedly (see below). If anyone is interested in looking at this, I
have left a copy of the repository at
http://thorn.ws/git-fatal-rebase-error.tar.gz
Tommy
PS: Please add me in CC as I'm no longer on this list.
$ git --version
git version 1.5.5.1.93.ge3f3e
$ git rebase cacao my-cacao-build
First, rewinding head to replay your work on top of it...
Applying Import binutils-2.18
.dotest/patch:16878: trailing whitespace.
.dotest/patch:17923: trailing whitespace.
0. Additional Definitions.
.dotest/patch:18024: trailing whitespace.
Version.
.dotest/patch:18094: trailing whitespace.
//
.dotest/patch:18099: trailing whitespace.
//
fatal: corrupt patch at line 260912
Patch failed at 0001.
When you have resolved this problem run "git rebase --continue".
If you would prefer to skip this patch, instead run "git rebase --skip".
To restore the original branch and stop rebasing run "git rebase --abort".
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: git bug: rebase fatal failure
2008-05-15 7:27 git bug: rebase fatal failure Tommy Thorn
@ 2008-05-16 10:41 ` Johannes Schindelin
2008-05-16 11:01 ` Johannes Schindelin
0 siblings, 1 reply; 21+ messages in thread
From: Johannes Schindelin @ 2008-05-16 10:41 UTC (permalink / raw)
To: Tommy Thorn; +Cc: git
Hi,
On Thu, 15 May 2008, Tommy Thorn wrote:
> I have large-ish repository (143 MB) where git rebase failed
> unexpectedly (see below). If anyone is interested in looking at this, I
> have left a copy of the repository at
> http://thorn.ws/git-fatal-rebase-error.tar.gz
I can reproduce. It seems that our diff machinery generates a file that
adds a file with 10668 lines, but says 10669 lines in the hunk header.
It is a .info file, so I suspect strange things going on with
non-printable ASCII characters.
I'm on it,
Dscho
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: git bug: rebase fatal failure
2008-05-16 10:41 ` Johannes Schindelin
@ 2008-05-16 11:01 ` Johannes Schindelin
2008-05-16 13:03 ` [PATCH] mailsplit and mailinfo: gracefully handle NUL characters Johannes Schindelin
0 siblings, 1 reply; 21+ messages in thread
From: Johannes Schindelin @ 2008-05-16 11:01 UTC (permalink / raw)
To: Tommy Thorn; +Cc: git
Hi,
On Fri, 16 May 2008, Johannes Schindelin wrote:
> It is a .info file, so I suspect strange things going on with
> non-printable ASCII characters.
And indeed, this is the culprit:
-- snip --
BFD Index
*********
^@^H[index^@^H]
* Menu:
-- snap --
Note the NUL characters?
Now, the thing is: git format-patch still outputs it correctly. But
git-rebase pipes the output to git-am, which in turn calls git-mailsplit,
which suppresses that line.
Will keep you posted,
Dscho
^ permalink raw reply [flat|nested] 21+ messages in thread
* [PATCH] mailsplit and mailinfo: gracefully handle NUL characters
2008-05-16 11:01 ` Johannes Schindelin
@ 2008-05-16 13:03 ` Johannes Schindelin
2008-05-16 14:03 ` Avery Pennarun
` (2 more replies)
0 siblings, 3 replies; 21+ messages in thread
From: Johannes Schindelin @ 2008-05-16 13:03 UTC (permalink / raw)
To: Tommy Thorn; +Cc: git
The function fgets() has a big problem with NUL characters: it reads
them, but nobody will know if the NUL comes from the file stream, or
was appended at the end of the line.
So implement a custom read_line() function.
Noticed by Tommy Thorn.
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
---
Sorry for the binary patch: the file t5100/nul contains NUL
characters, obviously.
BTW I do not know how much fgetc() instead of fgets() slows
down things, but I expect both to be equally fast because
they are both buffered, right?
builtin-mailinfo.c | 24 +++++++++++++-----------
builtin-mailsplit.c | 27 +++++++++++++++++++++++----
builtin.h | 1 +
t/t5100-mailinfo.sh | 9 +++++++++
t/t5100/nul | Bin 0 -> 91 bytes
5 files changed, 46 insertions(+), 15 deletions(-)
create mode 100644 t/t5100/nul
diff --git a/builtin-mailinfo.c b/builtin-mailinfo.c
index 11f154b..f0c4209 100644
--- a/builtin-mailinfo.c
+++ b/builtin-mailinfo.c
@@ -641,7 +641,7 @@ static void decode_transfer_encoding(char *line, unsigned linesize)
}
}
-static int handle_filter(char *line, unsigned linesize);
+static int handle_filter(char *line, unsigned linesize, int linelen);
static int find_boundary(void)
{
@@ -669,7 +669,7 @@ again:
"can't recover\n");
exit(1);
}
- handle_filter(newline, sizeof(newline));
+ handle_filter(newline, sizeof(newline), strlen(newline));
/* skip to the next boundary */
if (!find_boundary())
@@ -759,14 +759,14 @@ static int handle_commit_msg(char *line, unsigned linesize)
return 0;
}
-static int handle_patch(char *line)
+static int handle_patch(char *line, int len)
{
- fputs(line, patchfile);
+ fwrite(line, 1, len, patchfile);
patch_lines++;
return 0;
}
-static int handle_filter(char *line, unsigned linesize)
+static int handle_filter(char *line, unsigned linesize, int linelen)
{
static int filter = 0;
@@ -779,7 +779,7 @@ static int handle_filter(char *line, unsigned linesize)
break;
filter++;
case 1:
- if (!handle_patch(line))
+ if (!handle_patch(line, linelen))
break;
filter++;
default:
@@ -794,6 +794,7 @@ static void handle_body(void)
int rc = 0;
static char newline[2000];
static char *np = newline;
+ int len = strlen(line);
/* Skip up to the first boundary */
if (content_top->boundary) {
@@ -807,7 +808,8 @@ static void handle_body(void)
/* flush any leftover */
if ((transfer_encoding == TE_BASE64) &&
(np != newline)) {
- handle_filter(newline, sizeof(newline));
+ handle_filter(newline, sizeof(newline),
+ strlen(newline));
}
if (!handle_boundary())
return;
@@ -824,7 +826,7 @@ static void handle_body(void)
/* binary data most likely doesn't have newlines */
if (message_type != TYPE_TEXT) {
- rc = handle_filter(line, sizeof(newline));
+ rc = handle_filter(line, sizeof(line), len);
break;
}
@@ -841,7 +843,7 @@ static void handle_body(void)
/* should be sitting on a new line */
*(++np) = 0;
op++;
- rc = handle_filter(newline, sizeof(newline));
+ rc = handle_filter(newline, sizeof(newline), np - newline);
np = newline;
}
} while (*op != 0);
@@ -851,12 +853,12 @@ static void handle_body(void)
break;
}
default:
- rc = handle_filter(line, sizeof(newline));
+ rc = handle_filter(line, sizeof(line), len);
}
if (rc)
/* nothing left to filter */
break;
- } while (fgets(line, sizeof(line), fin));
+ } while ((len = read_line_with_nul(line, sizeof(line), fin)));
return;
}
diff --git a/builtin-mailsplit.c b/builtin-mailsplit.c
index 46b27cd..021dc16 100644
--- a/builtin-mailsplit.c
+++ b/builtin-mailsplit.c
@@ -45,6 +45,25 @@ static int is_from_line(const char *line, int len)
/* Could be as small as 64, enough to hold a Unix "From " line. */
static char buf[4096];
+/* We cannot use fgets() because our lines can contain NULs */
+int read_line_with_nul(char *buf, int size, FILE *in)
+{
+ int len = 0, c;
+
+ for (;;) {
+ c = fgetc(in);
+ buf[len++] = c;
+ if (c == EOF || c == '\n' || len + 1 >= size)
+ break;
+ }
+
+ if (c == EOF)
+ len--;
+ buf[len] = '\0';
+
+ return len;
+}
+
/* Called with the first line (potentially partial)
* already in buf[] -- normally that should begin with
* the Unix "From " line. Write it into the specified
@@ -70,19 +89,19 @@ static int split_one(FILE *mbox, const char *name, int allow_bare)
* "From " and having something that looks like a date format.
*/
for (;;) {
- int is_partial = (buf[len-1] != '\n');
+ int is_partial = len && buf[len-1] != '\n';
- if (fputs(buf, output) == EOF)
+ if (fwrite(buf, 1, len, output) != len)
die("cannot write output");
- if (fgets(buf, sizeof(buf), mbox) == NULL) {
+ len = read_line_with_nul(buf, sizeof(buf), mbox);
+ if (len == 0) {
if (feof(mbox)) {
status = 1;
break;
}
die("cannot read mbox");
}
- len = strlen(buf);
if (!is_partial && !is_bare && is_from_line(buf, len))
break; /* done with one message */
}
diff --git a/builtin.h b/builtin.h
index c630d5b..d0a0ead 100644
--- a/builtin.h
+++ b/builtin.h
@@ -9,6 +9,7 @@ extern const char git_usage_string[];
extern void list_common_cmds_help(void);
extern void help_unknown_cmd(const char *cmd);
extern void prune_packed_objects(int);
+extern int read_line_with_nul(char *buf, int size, FILE *file);
extern int cmd_add(int argc, const char **argv, const char *prefix);
extern int cmd_annotate(int argc, const char **argv, const char *prefix);
diff --git a/t/t5100-mailinfo.sh b/t/t5100-mailinfo.sh
index d6c55c1..5a4610b 100755
--- a/t/t5100-mailinfo.sh
+++ b/t/t5100-mailinfo.sh
@@ -25,4 +25,13 @@ do
diff ../t5100/info$mail info$mail"
done
+test_expect_success 'respect NULs' '
+
+ git mailsplit -d3 -o. ../t5100/nul &&
+ cmp ../t5100/nul 001 &&
+ (cat 001 | git mailinfo msg patch) &&
+ test 4 = $(wc -l < patch)
+
+'
+
test_done
diff --git a/t/t5100/nul b/t/t5100/nul
new file mode 100644
index 0000000000000000000000000000000000000000..3d40691787b855cc0133514a19052492eb853d21
GIT binary patch
literal 91
zcmW;6y$ygM5C%|6a#MT@Tm%~v2e7kZ0t`Q)fHOej_C}MJcXX*}a!Gh_N`s3x>;_}@
mA68>55i?ULDS<hc3BM!}T;HU$lNvE*_bo@vIHuA{lcE=x<rtIz
literal 0
HcmV?d00001
--
1.5.5.1.425.g5f464.dirty
^ permalink raw reply related [flat|nested] 21+ messages in thread
* Re: [PATCH] mailsplit and mailinfo: gracefully handle NUL characters
2008-05-16 13:03 ` [PATCH] mailsplit and mailinfo: gracefully handle NUL characters Johannes Schindelin
@ 2008-05-16 14:03 ` Avery Pennarun
2008-05-16 14:05 ` David Kastrup
2008-05-16 14:32 ` Johannes Schindelin
[not found] ` <200805161539.29259.brian.foster@innova-card.com>
2008-05-21 18:08 ` Junio C Hamano
2 siblings, 2 replies; 21+ messages in thread
From: Avery Pennarun @ 2008-05-16 14:03 UTC (permalink / raw)
To: Johannes Schindelin; +Cc: Tommy Thorn, git
On 5/16/08, Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote:
> BTW I do not know how much fgetc() instead of fgets() slows
> down things, but I expect both to be equally fast because
> they are both buffered, right?
In my experience, fgetc() is pretty fantastically slow because you
have a function call for every byte (and, I gather, modern libc does
thread locking for every fgetc). It's usually much faster to fread()
into a buffer and then access the buffer. I don't know if that's
appropriate (or matters) here, though.
Have fun,
Avery
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH] mailsplit and mailinfo: gracefully handle NUL characters
2008-05-16 14:03 ` Avery Pennarun
@ 2008-05-16 14:05 ` David Kastrup
2008-05-16 14:32 ` Johannes Schindelin
1 sibling, 0 replies; 21+ messages in thread
From: David Kastrup @ 2008-05-16 14:05 UTC (permalink / raw)
To: git
"Avery Pennarun" <apenwarr@gmail.com> writes:
> On 5/16/08, Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote:
>> BTW I do not know how much fgetc() instead of fgets() slows
>> down things, but I expect both to be equally fast because
>> they are both buffered, right?
>
> In my experience, fgetc() is pretty fantastically slow because you
> have a function call for every byte (and, I gather, modern libc does
> thread locking for every fgetc). It's usually much faster to fread()
> into a buffer and then access the buffer. I don't know if that's
> appropriate (or matters) here, though.
Is getc an option?
--
David Kastrup
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH] mailsplit and mailinfo: gracefully handle NUL characters
[not found] ` <200805161539.29259.brian.foster@innova-card.com>
@ 2008-05-16 14:07 ` Brian Foster
2008-05-16 14:14 ` David Kastrup
2008-05-16 14:29 ` Johannes Schindelin
0 siblings, 2 replies; 21+ messages in thread
From: Brian Foster @ 2008-05-16 14:07 UTC (permalink / raw)
To: Johannes Schindelin, Tommy Thorn; +Cc: git
two quibbles of no great importance ...
Johannes Schindelin suggested:
> The function fgets() has a big problem with NUL characters: it reads
> them, but nobody will know if the NUL comes from the file stream, or
> was appended at the end of the line.
>
> So implement a custom read_line() function.
^^^^^^^^^^^
read_line_with_nul()
meaning read part or all of one line which may contain NULs.
>[ ... ]
> diff --git a/builtin-mailsplit.c b/builtin-mailsplit.c
> index 46b27cd..021dc16 100644
> --- a/builtin-mailsplit.c
> +++ b/builtin-mailsplit.c
> @@ -45,6 +45,25 @@ static int is_from_line(const char *line, int len)
> /* Could be as small as 64, enough to hold a Unix "From " line. */
> static char buf[4096];
>
> +/* We cannot use fgets() because our lines can contain NULs */
> +int read_line_with_nul(char *buf, int size, FILE *in)
> +{
> + int len = 0, c;
> +
> + for (;;) {
> + c = fgetc(in);
> + buf[len++] = c;
> + if (c == EOF || c == '\n' || len + 1 >= size)
> + break;
> + }
> +
> + if (c == EOF)
> + len--;
> + buf[len] = '\0';
> +
> + return len;
when fgetc(3) — why not use getc(3)? — returns EOF
it is pointlessly stored in buf[] (as a 'char'!),
len's advanced, and then the storage and advancing
are undone. isn't that a bit silly? untested:
assert(2 <= size);
do {
if ((c = getc(in)) == EOF)
break;
} while (((buf[len++] = c) != '\n' && len+1 < size);
buf[len] = '\0'
return len;
I'd tend to write this in terms of pointers,
something along the lines (untested):
char *p, *endp;
assert(1 <= size);
p = buf;
endp = p + (size-1);
while (p < endp) {
if ((c = getc(in)) == EOF || (*p++ = c) == '\n')
break;
}
*p = '\0';
return p - buf;
> +
> +}
--
"How many surrealists does it take to | Brian Foster
change a lightbulb? Three. One calms | somewhere in south of France
the warthog, and two fill the bathtub | Stop E$$o (ExxonMobil)!
with brightly-coloured machine tools." | http://www.stopesso.com
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH] mailsplit and mailinfo: gracefully handle NUL characters
2008-05-16 14:07 ` Brian Foster
@ 2008-05-16 14:14 ` David Kastrup
2008-05-16 14:29 ` Johannes Schindelin
1 sibling, 0 replies; 21+ messages in thread
From: David Kastrup @ 2008-05-16 14:14 UTC (permalink / raw)
To: git
"Brian Foster" <brian.foster@innova-card.com> writes:
> I'd tend to write this in terms of pointers,
> something along the lines (untested):
>
> char *p, *endp;
>
> assert(1 <= size);
> p = buf;
> endp = p + (size-1);
> while (p < endp) {
> if ((c = getc(in)) == EOF || (*p++ = c) == '\n')
> break;
> }
> *p = '\0';
>
> return p - buf;
Leave optimization to the compiler. Using pointer arithmetic more often
than not screws up loop optimization and strength reduction.
--
David Kastrup
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH] mailsplit and mailinfo: gracefully handle NUL characters
2008-05-16 14:07 ` Brian Foster
2008-05-16 14:14 ` David Kastrup
@ 2008-05-16 14:29 ` Johannes Schindelin
2008-05-16 14:33 ` David Kastrup
1 sibling, 1 reply; 21+ messages in thread
From: Johannes Schindelin @ 2008-05-16 14:29 UTC (permalink / raw)
To: Brian Foster; +Cc: Tommy Thorn, git
[-- Attachment #1: Type: TEXT/PLAIN, Size: 2543 bytes --]
Hi,
On Fri, 16 May 2008, Brian Foster wrote:
> Johannes Schindelin suggested:
> > The function fgets() has a big problem with NUL characters: it reads
> > them, but nobody will know if the NUL comes from the file stream, or
> > was appended at the end of the line.
> >
> > So implement a custom read_line() function.
> ^^^^^^^^^^^
> read_line_with_nul()
> meaning read part or all of one line which may contain NULs.
Right.
> >[ ... ]
> > diff --git a/builtin-mailsplit.c b/builtin-mailsplit.c
> > index 46b27cd..021dc16 100644
> > --- a/builtin-mailsplit.c
> > +++ b/builtin-mailsplit.c
> > @@ -45,6 +45,25 @@ static int is_from_line(const char *line, int len)
> > /* Could be as small as 64, enough to hold a Unix "From " line. */
> > static char buf[4096];
> >
> > +/* We cannot use fgets() because our lines can contain NULs */
> > +int read_line_with_nul(char *buf, int size, FILE *in)
> > +{
> > + int len = 0, c;
> > +
> > + for (;;) {
> > + c = fgetc(in);
> > + buf[len++] = c;
> > + if (c == EOF || c == '\n' || len + 1 >= size)
> > + break;
> > + }
> > +
> > + if (c == EOF)
> > + len--;
> > + buf[len] = '\0';
> > +
> > + return len;
>
> when fgetc(3) — why not use getc(3)? -
Because mailsplit can read from a file, too.
> returns EOF it is pointlessly stored in buf[] (as a 'char'!), len's
> advanced, and then the storage and advancing are undone. isn't that a
> bit silly?
I left it at that, because it is a rare case, the buffer has to be
accessed with the trailing NUL anyway, and I think it is worth to have
this function quite readable. I, for one, am pretty certain that I
understand what this function does, and how, in 6 months from now, without
any additional documentation.
> untested:
>
> assert(2 <= size);
> do {
> if ((c = getc(in)) == EOF)
> break;
> } while (((buf[len++] = c) != '\n' && len+1 < size);
> buf[len] = '\0'
>
> return len;
... except this is unreadable at best ;-)
> I'd tend to write this in terms of pointers,
> something along the lines (untested):
>
> char *p, *endp;
>
> assert(1 <= size);
> p = buf;
> endp = p + (size-1);
> while (p < endp) {
> if ((c = getc(in)) == EOF || (*p++ = c) == '\n')
> break;
> }
> *p = '\0';
>
> return p - buf;
Again, I think this is too cuddled. You have to think about every second
line, and that makes for stupid mistakes with this developer.
Ciao,
Dscho
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH] mailsplit and mailinfo: gracefully handle NUL characters
2008-05-16 14:03 ` Avery Pennarun
2008-05-16 14:05 ` David Kastrup
@ 2008-05-16 14:32 ` Johannes Schindelin
2008-05-16 14:56 ` Avery Pennarun
1 sibling, 1 reply; 21+ messages in thread
From: Johannes Schindelin @ 2008-05-16 14:32 UTC (permalink / raw)
To: Avery Pennarun; +Cc: Tommy Thorn, git
Hi,
On Fri, 16 May 2008, Avery Pennarun wrote:
> On 5/16/08, Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote:
> > BTW I do not know how much fgetc() instead of fgets() slows
> > down things, but I expect both to be equally fast because
> > they are both buffered, right?
>
> In my experience, fgetc() is pretty fantastically slow because you
> have a function call for every byte (and, I gather, modern libc does
> thread locking for every fgetc). It's usually much faster to fread()
> into a buffer and then access the buffer.
Hmpf. I hoped to get more definitive information here. Especially given
that fgetc() is nothing more than a glorified fread() into a buffer, and
then access the buffer.
Well, at least you kind of pointed me to the _unlocked() function family.
Ciao,
Dscho
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH] mailsplit and mailinfo: gracefully handle NUL characters
2008-05-16 14:29 ` Johannes Schindelin
@ 2008-05-16 14:33 ` David Kastrup
0 siblings, 0 replies; 21+ messages in thread
From: David Kastrup @ 2008-05-16 14:33 UTC (permalink / raw)
To: git
Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:
> On Fri, 16 May 2008, Brian Foster wrote:
>>
>> when fgetc(3) — why not use getc(3)? -
>
> Because mailsplit can read from a file, too.
Huh? getc != getchar
--
David Kastrup
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH] mailsplit and mailinfo: gracefully handle NUL characters
2008-05-16 14:32 ` Johannes Schindelin
@ 2008-05-16 14:56 ` Avery Pennarun
2008-05-16 23:59 ` Johannes Schindelin
0 siblings, 1 reply; 21+ messages in thread
From: Avery Pennarun @ 2008-05-16 14:56 UTC (permalink / raw)
To: Johannes Schindelin; +Cc: Tommy Thorn, git
On 5/16/08, Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote:
> Hmpf. I hoped to get more definitive information here. Especially given
> that fgetc() is nothing more than a glorified fread() into a buffer, and
> then access the buffer.
>
> Well, at least you kind of pointed me to the _unlocked() function family.
Point taken.
/tmp $ for d in test1 test2 test3 test3u; do echo -n "$d: ";
/usr/bin/time ./$d </dev/zero; done
test1: 0.09user 0.05system 0:00.14elapsed 94%CPU
test2: 2.50user 0.05system 0:02.54elapsed 100%CPU
test3: 2.48user 0.06system 0:02.53elapsed 100%CPU
test3u: 1.05user 0.05system 0:01.10elapsed 99%CPU
fread is about 18x faster than fgetc(). getc() is the same speed as
fgetc(). getc_unlocked() is definitely faster than getc, but still at
least 7x slower than fread().
And if you think *that* sucks, you should try "c << cin" in C++ :)
Source code below.
Have fun,
Avery
=== test1.c ===
#include <stdio.h>
int main()
{
char buf[1024];
int i;
for (i = 0; i < 102400; i++)
fread(buf, 1, sizeof(buf), stdin);
}
=== test2.c ===
#include <stdio.h>
int main()
{
int i;
for (i = 0; i < 1024*102400; i++)
fgetc(stdin);
}
=== test3.c ===
#include <stdio.h>
int main()
{
int i;
for (i = 0; i < 1024*102400; i++)
getc(stdin);
}
=== test3u.c ===
#include <stdio.h>
int main()
{
int i;
for (i = 0; i < 1024*102400; i++)
getc_unlocked(stdin);
}
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH] mailsplit and mailinfo: gracefully handle NUL characters
2008-05-16 14:56 ` Avery Pennarun
@ 2008-05-16 23:59 ` Johannes Schindelin
2008-05-17 0:06 ` Tommy Thorn
2008-05-17 10:07 ` Stephen R. van den Berg
0 siblings, 2 replies; 21+ messages in thread
From: Johannes Schindelin @ 2008-05-16 23:59 UTC (permalink / raw)
To: Avery Pennarun; +Cc: Tommy Thorn, git
Hi,
On Fri, 16 May 2008, Avery Pennarun wrote:
> On 5/16/08, Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote:
> > Hmpf. I hoped to get more definitive information here. Especially given
> > that fgetc() is nothing more than a glorified fread() into a buffer, and
> > then access the buffer.
> >
> > Well, at least you kind of pointed me to the _unlocked() function family.
>
> Point taken.
>
> /tmp $ for d in test1 test2 test3 test3u; do echo -n "$d: ";
> /usr/bin/time ./$d </dev/zero; done
> test1: 0.09user 0.05system 0:00.14elapsed 94%CPU
> test2: 2.50user 0.05system 0:02.54elapsed 100%CPU
> test3: 2.48user 0.06system 0:02.53elapsed 100%CPU
> test3u: 1.05user 0.05system 0:01.10elapsed 99%CPU
>
> fread is about 18x faster than fgetc(). getc() is the same speed as
> fgetc(). getc_unlocked() is definitely faster than getc, but still at
> least 7x slower than fread().
Well, my question was more about fgetc() vs fgets().
If you feel like it, you might benchmark this patch:
-- snipsnap --
diff --git a/builtin-mailsplit.c b/builtin-mailsplit.c
index 021dc16..5d8defd 100644
--- a/builtin-mailsplit.c
+++ b/builtin-mailsplit.c
@@ -45,13 +45,32 @@ static int is_from_line(const char *line, int len)
/* Could be as small as 64, enough to hold a Unix "From " line. */
static char buf[4096];
+/*
+ * This is an ugly hack to avoid fgetc(), which is slow, as it is locking.
+ * The argument "in" must be the same for all calls to this function!
+ */
+static int fast_fgetc(FILE *in)
+{
+ static char buf[4096];
+ static int offset = 0, len = 0;
+
+ if (offset >= len) {
+ len = fread(buf, 1, sizeof(buf), in);
+ offset = 0;
+ if (!len && feof(in))
+ return EOF;
+ }
+
+ return buf[offset++];
+}
+
/* We cannot use fgets() because our lines can contain NULs */
int read_line_with_nul(char *buf, int size, FILE *in)
{
int len = 0, c;
for (;;) {
- c = fgetc(in);
+ c = fast_fgetc(in);
buf[len++] = c;
if (c == EOF || c == '\n' || len + 1 >= size)
break;
^ permalink raw reply related [flat|nested] 21+ messages in thread
* Re: [PATCH] mailsplit and mailinfo: gracefully handle NUL characters
2008-05-16 23:59 ` Johannes Schindelin
@ 2008-05-17 0:06 ` Tommy Thorn
2008-05-17 0:26 ` Johannes Schindelin
2008-05-17 10:07 ` Stephen R. van den Berg
1 sibling, 1 reply; 21+ messages in thread
From: Tommy Thorn @ 2008-05-17 0:06 UTC (permalink / raw)
To: Johannes Schindelin; +Cc: Avery Pennarun, git
Johannes Schindelin wrote:
> +/*
> + * This is an ugly hack to avoid fgetc(), which is slow, as it is locking.
> + * The argument "in" must be the same for all calls to this function!
> + */
> +static int fast_fgetc(FILE *in)
> +{
>
Looks great to me, but shouldn't you add an "inline" for this one? Also,
maybe a double the buffer size.
Tommy
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH] mailsplit and mailinfo: gracefully handle NUL characters
2008-05-17 0:06 ` Tommy Thorn
@ 2008-05-17 0:26 ` Johannes Schindelin
0 siblings, 0 replies; 21+ messages in thread
From: Johannes Schindelin @ 2008-05-17 0:26 UTC (permalink / raw)
To: Tommy Thorn; +Cc: Avery Pennarun, git
Hi,
On Fri, 16 May 2008, Tommy Thorn wrote:
> Johannes Schindelin wrote:
> > +/*
> > + * This is an ugly hack to avoid fgetc(), which is slow, as it is locking.
> > + * The argument "in" must be the same for all calls to this function!
> > + */
> > +static int fast_fgetc(FILE *in)
> > +{
> >
>
> Looks great to me, but shouldn't you add an "inline" for this one? Also,
> maybe a double the buffer size.
No. This is an ugly hack, and not meant for application.
If that is substantially faster than the fgetc() version (and I want this
be tested in a _real-world_ scenario, i.e. not the fgetc() alone, but a
real mailsplit and a real mailinfo on a huge patch, with all three
versions: fgets(), fgetc() and fast_fgetc())), then I would prefer having
something like
struct line_reader {
FILE *in;
char buffer[4096];
int offset, int len;
char line[1024];
int linelen;
};
and corresponding functions to read lines in that setting. Maybe it would
even be better to have line be a strbuf, but I am not so sure on that.
Let's see what the tests show. Would you do them, please?
"git format-patch --stdout bla..blub | /usr/bin/time git mailsplit -o."
three times in succession should give you a good hint on the runtime.
Ciao,
Dscho
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH] mailsplit and mailinfo: gracefully handle NUL characters
2008-05-16 23:59 ` Johannes Schindelin
2008-05-17 0:06 ` Tommy Thorn
@ 2008-05-17 10:07 ` Stephen R. van den Berg
2008-05-17 10:18 ` Johannes Schindelin
1 sibling, 1 reply; 21+ messages in thread
From: Stephen R. van den Berg @ 2008-05-17 10:07 UTC (permalink / raw)
To: Johannes Schindelin; +Cc: Avery Pennarun, Tommy Thorn, git
Johannes Schindelin wrote:
>On Fri, 16 May 2008, Avery Pennarun wrote:
>> fread is about 18x faster than fgetc(). getc() is the same speed as
>> fgetc(). getc_unlocked() is definitely faster than getc, but still at
>> least 7x slower than fread().
>Well, my question was more about fgetc() vs fgets().
>If you feel like it, you might benchmark this patch:
Wouldn't it be better to improve the implementation of getc()
in glibc instead?
getc() is meant to be the fast version of fgetc(), and if it isn't
(anymore), then the library needs fixing.
--
Sincerely, srb@cuci.nl
Stephen R. van den Berg.
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH] mailsplit and mailinfo: gracefully handle NUL characters
2008-05-17 10:07 ` Stephen R. van den Berg
@ 2008-05-17 10:18 ` Johannes Schindelin
0 siblings, 0 replies; 21+ messages in thread
From: Johannes Schindelin @ 2008-05-17 10:18 UTC (permalink / raw)
To: Stephen R. van den Berg; +Cc: Avery Pennarun, Tommy Thorn, git
Hi,
On Sat, 17 May 2008, Stephen R. van den Berg wrote:
> Johannes Schindelin wrote:
> >On Fri, 16 May 2008, Avery Pennarun wrote:
> >> fread is about 18x faster than fgetc(). getc() is the same speed as
> >> fgetc(). getc_unlocked() is definitely faster than getc, but still at
> >> least 7x slower than fread().
>
> >Well, my question was more about fgetc() vs fgets().
> >If you feel like it, you might benchmark this patch:
>
> Wouldn't it be better to improve the implementation of getc()
> in glibc instead?
Sure, because glibc is used on Windows, AIX, Solaris, etc.
> getc() is meant to be the fast version of fgetc(), and if it isn't
> (anymore), then the library needs fixing.
fgetc() has to work reliably in a threaded environment, too. So I guess
that it does all kinds of locks that slow it down, but it is not something
to be "fixed".
Hth,
Dscho
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH] mailsplit and mailinfo: gracefully handle NUL characters
2008-05-16 13:03 ` [PATCH] mailsplit and mailinfo: gracefully handle NUL characters Johannes Schindelin
2008-05-16 14:03 ` Avery Pennarun
[not found] ` <200805161539.29259.brian.foster@innova-card.com>
@ 2008-05-21 18:08 ` Junio C Hamano
2008-05-22 10:38 ` Johannes Schindelin
2 siblings, 1 reply; 21+ messages in thread
From: Junio C Hamano @ 2008-05-21 18:08 UTC (permalink / raw)
To: Johannes Schindelin; +Cc: Tommy Thorn, git
Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:
> The function fgets() has a big problem with NUL characters: it reads
> them, but nobody will know if the NUL comes from the file stream, or
> was appended at the end of the line.
>
> So implement a custom read_line() function.
Looking at what handle_body() does for TE_BASE64 and TE_QP cases, I have
to wonder if this is enough. The loop seems to stop at (*op == NUL) which
follows an old assumption that each line is terminated with NUL, not the
new assumption you introduced that each line's length is kept in local
variable len.
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH] mailsplit and mailinfo: gracefully handle NUL characters
2008-05-21 18:08 ` Junio C Hamano
@ 2008-05-22 10:38 ` Johannes Schindelin
2008-05-22 17:44 ` Junio C Hamano
0 siblings, 1 reply; 21+ messages in thread
From: Johannes Schindelin @ 2008-05-22 10:38 UTC (permalink / raw)
To: Junio C Hamano; +Cc: Tommy Thorn, git
Hi,
On Wed, 21 May 2008, Junio C Hamano wrote:
> Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:
>
> > The function fgets() has a big problem with NUL characters: it reads
> > them, but nobody will know if the NUL comes from the file stream, or
> > was appended at the end of the line.
> >
> > So implement a custom read_line() function.
>
> Looking at what handle_body() does for TE_BASE64 and TE_QP cases, I have
> to wonder if this is enough. The loop seems to stop at (*op == NUL)
> which follows an old assumption that each line is terminated with NUL,
> not the new assumption you introduced that each line's length is kept in
> local variable len.
Of course! But does BASE64 and QP contain NULs? After all, even my
custom read_line_with_nul() function adds a NUL IIRC.
Well, the biggest problem here is my lack of time. I thought I would give
Tommy a patch which kinda works, and he would actually hold through to
brush it up until it shines and gets into git.git, because it is not _my_
itch.
Hmmmmmmm.
Ciao,
Dscho
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH] mailsplit and mailinfo: gracefully handle NUL characters
2008-05-22 10:38 ` Johannes Schindelin
@ 2008-05-22 17:44 ` Junio C Hamano
2008-05-23 11:21 ` Johannes Schindelin
0 siblings, 1 reply; 21+ messages in thread
From: Junio C Hamano @ 2008-05-22 17:44 UTC (permalink / raw)
To: Johannes Schindelin; +Cc: Tommy Thorn, git
Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:
>> Looking at what handle_body() does for TE_BASE64 and TE_QP cases, I have
>> to wonder if this is enough. The loop seems to stop at (*op == NUL)
>> which follows an old assumption that each line is terminated with NUL,
>> not the new assumption you introduced that each line's length is kept in
>> local variable len.
>
> Of course! But does BASE64 and QP contain NULs?
The loop in question iterates over bytes _after_ decoding these encoded
lines, and a typical reason you would encode the payload is because it
contains something not safe over e-mail transfer, e.g. NUL.
I think decode_transfer_encoding() also needs to become safe against NULs
in the payload.
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH] mailsplit and mailinfo: gracefully handle NUL characters
2008-05-22 17:44 ` Junio C Hamano
@ 2008-05-23 11:21 ` Johannes Schindelin
0 siblings, 0 replies; 21+ messages in thread
From: Johannes Schindelin @ 2008-05-23 11:21 UTC (permalink / raw)
To: Junio C Hamano; +Cc: Tommy Thorn, git
Hi,
On Thu, 22 May 2008, Junio C Hamano wrote:
> Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:
>
> >> Looking at what handle_body() does for TE_BASE64 and TE_QP cases, I have
> >> to wonder if this is enough. The loop seems to stop at (*op == NUL)
> >> which follows an old assumption that each line is terminated with NUL,
> >> not the new assumption you introduced that each line's length is kept in
> >> local variable len.
> >
> > Of course! But does BASE64 and QP contain NULs?
>
> The loop in question iterates over bytes _after_ decoding these encoded
> lines, and a typical reason you would encode the payload is because it
> contains something not safe over e-mail transfer, e.g. NUL.
>
> I think decode_transfer_encoding() also needs to become safe against NULs
> in the payload.
Okay, I missed that.
Ciao,
Dscho
^ permalink raw reply [flat|nested] 21+ messages in thread
end of thread, other threads:[~2008-05-23 11:22 UTC | newest]
Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-05-15 7:27 git bug: rebase fatal failure Tommy Thorn
2008-05-16 10:41 ` Johannes Schindelin
2008-05-16 11:01 ` Johannes Schindelin
2008-05-16 13:03 ` [PATCH] mailsplit and mailinfo: gracefully handle NUL characters Johannes Schindelin
2008-05-16 14:03 ` Avery Pennarun
2008-05-16 14:05 ` David Kastrup
2008-05-16 14:32 ` Johannes Schindelin
2008-05-16 14:56 ` Avery Pennarun
2008-05-16 23:59 ` Johannes Schindelin
2008-05-17 0:06 ` Tommy Thorn
2008-05-17 0:26 ` Johannes Schindelin
2008-05-17 10:07 ` Stephen R. van den Berg
2008-05-17 10:18 ` Johannes Schindelin
[not found] ` <200805161539.29259.brian.foster@innova-card.com>
2008-05-16 14:07 ` Brian Foster
2008-05-16 14:14 ` David Kastrup
2008-05-16 14:29 ` Johannes Schindelin
2008-05-16 14:33 ` David Kastrup
2008-05-21 18:08 ` Junio C Hamano
2008-05-22 10:38 ` Johannes Schindelin
2008-05-22 17:44 ` Junio C Hamano
2008-05-23 11:21 ` Johannes Schindelin
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).