git bug: rebase fatal failure

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* git bug: rebase fatal failure
@ 2008-05-15  7:27 Tommy Thorn
  2008-05-16 10:41 ` Johannes Schindelin
  0 siblings, 1 reply; 21+ messages in thread
From: Tommy Thorn @ 2008-05-15  7:27 UTC (permalink / raw)
  To: git

Hi,

I have large-ish repository (143 MB) where git rebase failed 
unexpectedly (see below). If anyone is interested in looking at this, I 
have left a copy of the repository at 
http://thorn.ws/git-fatal-rebase-error.tar.gz

Tommy
PS: Please add me in CC as I'm no longer on this list.

$ git --version
git version 1.5.5.1.93.ge3f3e

$ git rebase cacao my-cacao-build
First, rewinding head to replay your work on top of it...
Applying Import binutils-2.18
.dotest/patch:16878: trailing whitespace.

.dotest/patch:17923: trailing whitespace.
  0. Additional Definitions.
.dotest/patch:18024: trailing whitespace.
       Version.
.dotest/patch:18094: trailing whitespace.
//
.dotest/patch:18099: trailing whitespace.
//
fatal: corrupt patch at line 260912
Patch failed at 0001.

When you have resolved this problem run "git rebase --continue".
If you would prefer to skip this patch, instead run "git rebase --skip".
To restore the original branch and stop rebasing run "git rebase --abort".

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: git bug: rebase fatal failure
  2008-05-15  7:27 git bug: rebase fatal failure Tommy Thorn
@ 2008-05-16 10:41 ` Johannes Schindelin
  2008-05-16 11:01   ` Johannes Schindelin
  0 siblings, 1 reply; 21+ messages in thread
From: Johannes Schindelin @ 2008-05-16 10:41 UTC (permalink / raw)
  To: Tommy Thorn; +Cc: git

Hi,

On Thu, 15 May 2008, Tommy Thorn wrote:

> I have large-ish repository (143 MB) where git rebase failed 
> unexpectedly (see below). If anyone is interested in looking at this, I 
> have left a copy of the repository at 
> http://thorn.ws/git-fatal-rebase-error.tar.gz

I can reproduce.  It seems that our diff machinery generates a file that 
adds a file with 10668 lines, but says 10669 lines in the hunk header.

It is a .info file, so I suspect strange things going on with 
non-printable ASCII characters.

I'm on it,
Dscho

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: git bug: rebase fatal failure
  2008-05-16 10:41 ` Johannes Schindelin
@ 2008-05-16 11:01   ` Johannes Schindelin
  2008-05-16 13:03     ` [PATCH] mailsplit and mailinfo: gracefully handle NUL characters Johannes Schindelin
  0 siblings, 1 reply; 21+ messages in thread
From: Johannes Schindelin @ 2008-05-16 11:01 UTC (permalink / raw)
  To: Tommy Thorn; +Cc: git

Hi,

On Fri, 16 May 2008, Johannes Schindelin wrote:

> It is a .info file, so I suspect strange things going on with 
> non-printable ASCII characters.

And indeed, this is the culprit:

-- snip --
BFD Index
*********

^@^H[index^@^H]
* Menu:

-- snap --

Note the NUL characters?

Now, the thing is: git format-patch still outputs it correctly.  But 
git-rebase pipes the output to git-am, which in turn calls git-mailsplit, 
which suppresses that line.

Will keep you posted,
Dscho

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH] mailsplit and mailinfo: gracefully handle NUL characters
  2008-05-16 11:01   ` Johannes Schindelin
@ 2008-05-16 13:03     ` Johannes Schindelin
  2008-05-16 14:03       ` Avery Pennarun
                         ` (2 more replies)
  0 siblings, 3 replies; 21+ messages in thread
From: Johannes Schindelin @ 2008-05-16 13:03 UTC (permalink / raw)
  To: Tommy Thorn; +Cc: git


The function fgets() has a big problem with NUL characters: it reads
them, but nobody will know if the NUL comes from the file stream, or
was appended at the end of the line.

So implement a custom read_line() function.

Noticed by Tommy Thorn.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
---

	Sorry for the binary patch: the file t5100/nul contains NUL
	characters, obviously.

	BTW I do not know how much fgetc() instead of fgets() slows
	down things, but I expect both to be equally fast because
	they are both buffered, right?

 builtin-mailinfo.c  |   24 +++++++++++++-----------
 builtin-mailsplit.c |   27 +++++++++++++++++++++++----
 builtin.h           |    1 +
 t/t5100-mailinfo.sh |    9 +++++++++
 t/t5100/nul         |  Bin 0 -> 91 bytes
 5 files changed, 46 insertions(+), 15 deletions(-)
 create mode 100644 t/t5100/nul

diff --git a/builtin-mailinfo.c b/builtin-mailinfo.c
index 11f154b..f0c4209 100644
--- a/builtin-mailinfo.c
+++ b/builtin-mailinfo.c
@@ -641,7 +641,7 @@ static void decode_transfer_encoding(char *line, unsigned linesize)
 	}
 }
 
-static int handle_filter(char *line, unsigned linesize);
+static int handle_filter(char *line, unsigned linesize, int linelen);
 
 static int find_boundary(void)
 {
@@ -669,7 +669,7 @@ again:
 					"can't recover\n");
 			exit(1);
 		}
-		handle_filter(newline, sizeof(newline));
+		handle_filter(newline, sizeof(newline), strlen(newline));
 
 		/* skip to the next boundary */
 		if (!find_boundary())
@@ -759,14 +759,14 @@ static int handle_commit_msg(char *line, unsigned linesize)
 	return 0;
 }
 
-static int handle_patch(char *line)
+static int handle_patch(char *line, int len)
 {
-	fputs(line, patchfile);
+	fwrite(line, 1, len, patchfile);
 	patch_lines++;
 	return 0;
 }
 
-static int handle_filter(char *line, unsigned linesize)
+static int handle_filter(char *line, unsigned linesize, int linelen)
 {
 	static int filter = 0;
 
@@ -779,7 +779,7 @@ static int handle_filter(char *line, unsigned linesize)
 			break;
 		filter++;
 	case 1:
-		if (!handle_patch(line))
+		if (!handle_patch(line, linelen))
 			break;
 		filter++;
 	default:
@@ -794,6 +794,7 @@ static void handle_body(void)
 	int rc = 0;
 	static char newline[2000];
 	static char *np = newline;
+	int len = strlen(line);
 
 	/* Skip up to the first boundary */
 	if (content_top->boundary) {
@@ -807,7 +808,8 @@ static void handle_body(void)
 			/* flush any leftover */
 			if ((transfer_encoding == TE_BASE64)  &&
 			    (np != newline)) {
-				handle_filter(newline, sizeof(newline));
+				handle_filter(newline, sizeof(newline),
+						strlen(newline));
 			}
 			if (!handle_boundary())
 				return;
@@ -824,7 +826,7 @@ static void handle_body(void)
 
 			/* binary data most likely doesn't have newlines */
 			if (message_type != TYPE_TEXT) {
-				rc = handle_filter(line, sizeof(newline));
+				rc = handle_filter(line, sizeof(line), len);
 				break;
 			}
 
@@ -841,7 +843,7 @@ static void handle_body(void)
 					/* should be sitting on a new line */
 					*(++np) = 0;
 					op++;
-					rc = handle_filter(newline, sizeof(newline));
+					rc = handle_filter(newline, sizeof(newline), np - newline);
 					np = newline;
 				}
 			} while (*op != 0);
@@ -851,12 +853,12 @@ static void handle_body(void)
 			break;
 		}
 		default:
-			rc = handle_filter(line, sizeof(newline));
+			rc = handle_filter(line, sizeof(line), len);
 		}
 		if (rc)
 			/* nothing left to filter */
 			break;
-	} while (fgets(line, sizeof(line), fin));
+	} while ((len = read_line_with_nul(line, sizeof(line), fin)));
 
 	return;
 }
diff --git a/builtin-mailsplit.c b/builtin-mailsplit.c
index 46b27cd..021dc16 100644
--- a/builtin-mailsplit.c
+++ b/builtin-mailsplit.c
@@ -45,6 +45,25 @@ static int is_from_line(const char *line, int len)
 /* Could be as small as 64, enough to hold a Unix "From " line. */
 static char buf[4096];
 
+/* We cannot use fgets() because our lines can contain NULs */
+int read_line_with_nul(char *buf, int size, FILE *in)
+{
+	int len = 0, c;
+
+	for (;;) {
+		c = fgetc(in);
+		buf[len++] = c;
+		if (c == EOF || c == '\n' || len + 1 >= size)
+			break;
+	}
+
+	if (c == EOF)
+		len--;
+	buf[len] = '\0';
+
+	return len;
+}
+
 /* Called with the first line (potentially partial)
  * already in buf[] -- normally that should begin with
  * the Unix "From " line.  Write it into the specified
@@ -70,19 +89,19 @@ static int split_one(FILE *mbox, const char *name, int allow_bare)
 	 * "From " and having something that looks like a date format.
 	 */
 	for (;;) {
-		int is_partial = (buf[len-1] != '\n');
+		int is_partial = len && buf[len-1] != '\n';
 
-		if (fputs(buf, output) == EOF)
+		if (fwrite(buf, 1, len, output) != len)
 			die("cannot write output");
 
-		if (fgets(buf, sizeof(buf), mbox) == NULL) {
+		len = read_line_with_nul(buf, sizeof(buf), mbox);
+		if (len == 0) {
 			if (feof(mbox)) {
 				status = 1;
 				break;
 			}
 			die("cannot read mbox");
 		}
-		len = strlen(buf);
 		if (!is_partial && !is_bare && is_from_line(buf, len))
 			break; /* done with one message */
 	}
diff --git a/builtin.h b/builtin.h
index c630d5b..d0a0ead 100644
--- a/builtin.h
+++ b/builtin.h
@@ -9,6 +9,7 @@ extern const char git_usage_string[];
 extern void list_common_cmds_help(void);
 extern void help_unknown_cmd(const char *cmd);
 extern void prune_packed_objects(int);
+extern int read_line_with_nul(char *buf, int size, FILE *file);
 
 extern int cmd_add(int argc, const char **argv, const char *prefix);
 extern int cmd_annotate(int argc, const char **argv, const char *prefix);
diff --git a/t/t5100-mailinfo.sh b/t/t5100-mailinfo.sh
index d6c55c1..5a4610b 100755
--- a/t/t5100-mailinfo.sh
+++ b/t/t5100-mailinfo.sh
@@ -25,4 +25,13 @@ do
 		diff ../t5100/info$mail info$mail"
 done
 
+test_expect_success 'respect NULs' '
+
+	git mailsplit -d3 -o. ../t5100/nul &&
+	cmp ../t5100/nul 001 &&
+	(cat 001 | git mailinfo msg patch) &&
+	test 4 = $(wc -l < patch)
+
+'
+
 test_done
diff --git a/t/t5100/nul b/t/t5100/nul
new file mode 100644
index 0000000000000000000000000000000000000000..3d40691787b855cc0133514a19052492eb853d21
GIT binary patch
literal 91
zcmW;6y$ygM5C%|6a#MT@Tm%~v2e7kZ0t`Q)fHOej_C}MJcXX*}a!Gh_N`s3x>;_}@
mA68>55i?ULDS<hc3BM!}T;HU$lNvE*_bo@vIHuA{lcE=x<rtIz

literal 0
HcmV?d00001

-- 
1.5.5.1.425.g5f464.dirty

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH] mailsplit and mailinfo: gracefully handle NUL characters
  2008-05-16 13:03     ` [PATCH] mailsplit and mailinfo: gracefully handle NUL characters Johannes Schindelin
@ 2008-05-16 14:03       ` Avery Pennarun
  2008-05-16 14:05         ` David Kastrup
  2008-05-16 14:32         ` Johannes Schindelin
       [not found]       ` <200805161539.29259.brian.foster@innova-card.com>
  2008-05-21 18:08       ` Junio C Hamano
  2 siblings, 2 replies; 21+ messages in thread
From: Avery Pennarun @ 2008-05-16 14:03 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Tommy Thorn, git

On 5/16/08, Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote:
>         BTW I do not know how much fgetc() instead of fgets() slows
>         down things, but I expect both to be equally fast because
>         they are both buffered, right?

In my experience, fgetc() is pretty fantastically slow because you
have a function call for every byte (and, I gather, modern libc does
thread locking for every fgetc).  It's usually much faster to fread()
into a buffer and then access the buffer.  I don't know if that's
appropriate (or matters) here, though.

Have fun,

Avery

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] mailsplit and mailinfo: gracefully handle NUL characters
  2008-05-16 14:03       ` Avery Pennarun
@ 2008-05-16 14:05         ` David Kastrup
  2008-05-16 14:32         ` Johannes Schindelin
  1 sibling, 0 replies; 21+ messages in thread
From: David Kastrup @ 2008-05-16 14:05 UTC (permalink / raw)
  To: git

"Avery Pennarun" <apenwarr@gmail.com> writes:

> On 5/16/08, Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote:
>>         BTW I do not know how much fgetc() instead of fgets() slows
>>         down things, but I expect both to be equally fast because
>>         they are both buffered, right?
>
> In my experience, fgetc() is pretty fantastically slow because you
> have a function call for every byte (and, I gather, modern libc does
> thread locking for every fgetc).  It's usually much faster to fread()
> into a buffer and then access the buffer.  I don't know if that's
> appropriate (or matters) here, though.

Is getc an option?

-- 
David Kastrup

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] mailsplit and mailinfo: gracefully handle NUL characters
       [not found]       ` <200805161539.29259.brian.foster@innova-card.com>
@ 2008-05-16 14:07         ` Brian Foster
  2008-05-16 14:14           ` David Kastrup
  2008-05-16 14:29           ` Johannes Schindelin
  0 siblings, 2 replies; 21+ messages in thread
From: Brian Foster @ 2008-05-16 14:07 UTC (permalink / raw)
  To: Johannes Schindelin, Tommy Thorn; +Cc: git

two quibbles of no great importance ...

Johannes Schindelin suggested:
> The function fgets() has a big problem with NUL characters: it reads
> them, but nobody will know if the NUL comes from the file stream, or
> was appended at the end of the line.
>
> So implement a custom read_line() function.
                        ^^^^^^^^^^^
                        read_line_with_nul()
meaning read part or all of one line which may contain NULs.

>[ ... ]
> diff --git a/builtin-mailsplit.c b/builtin-mailsplit.c
> index 46b27cd..021dc16 100644
> --- a/builtin-mailsplit.c
> +++ b/builtin-mailsplit.c
> @@ -45,6 +45,25 @@ static int is_from_line(const char *line, int len)
>  /* Could be as small as 64, enough to hold a Unix "From " line. */
>  static char buf[4096];
>
> +/* We cannot use fgets() because our lines can contain NULs */
> +int read_line_with_nul(char *buf, int size, FILE *in)
> +{
> +     int len = 0, c;
> +
> +     for (;;) {
> +             c = fgetc(in);
> +             buf[len++] = c;
> +             if (c == EOF || c == '\n' || len + 1 >= size)
> +                     break;
> +     }
> +
> +     if (c == EOF)
> +             len--;
> +     buf[len] = '\0';
> +
> +     return len;

 when fgetc(3) — why not use getc(3)? — returns EOF
 it is pointlessly stored in buf[] (as a 'char'!),
 len's advanced, and then the storage and advancing
 are undone.  isn't that a bit silly?   untested:

	assert(2 <= size);
	do {
		if ((c = getc(in)) == EOF)
			break;
	} while (((buf[len++] = c) != '\n' && len+1 < size);
	buf[len] = '\0'

	return len;

 I'd tend to write this in terms of pointers,
 something along the lines (untested):

	char	*p, *endp;

	assert(1 <= size);
	p    = buf;
	endp = p + (size-1);
	while (p < endp) {
		if ((c = getc(in)) == EOF || (*p++ = c) == '\n')
			break;
	}
	*p = '\0';

	return p - buf;

> +
> +}

-- 
"How many surrealists does it take to   | Brian Foster
 change a lightbulb? Three. One calms   | somewhere in south of France
 the warthog, and two fill the bathtub  |   Stop E$$o (ExxonMobil)!
 with brightly-coloured machine tools." |      http://www.stopesso.com

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] mailsplit and mailinfo: gracefully handle NUL characters
  2008-05-16 14:07         ` Brian Foster
@ 2008-05-16 14:14           ` David Kastrup
  2008-05-16 14:29           ` Johannes Schindelin
  1 sibling, 0 replies; 21+ messages in thread
From: David Kastrup @ 2008-05-16 14:14 UTC (permalink / raw)
  To: git

"Brian Foster" <brian.foster@innova-card.com> writes:

>  I'd tend to write this in terms of pointers,
>  something along the lines (untested):
>
> 	char	*p, *endp;
>
> 	assert(1 <= size);
> 	p    = buf;
> 	endp = p + (size-1);
> 	while (p < endp) {
> 		if ((c = getc(in)) == EOF || (*p++ = c) == '\n')
> 			break;
> 	}
> 	*p = '\0';
>
> 	return p - buf;

Leave optimization to the compiler.  Using pointer arithmetic more often
than not screws up loop optimization and strength reduction.

-- 
David Kastrup

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] mailsplit and mailinfo: gracefully handle NUL characters
  2008-05-16 14:07         ` Brian Foster
  2008-05-16 14:14           ` David Kastrup
@ 2008-05-16 14:29           ` Johannes Schindelin
  2008-05-16 14:33             ` David Kastrup
  1 sibling, 1 reply; 21+ messages in thread
From: Johannes Schindelin @ 2008-05-16 14:29 UTC (permalink / raw)
  To: Brian Foster; +Cc: Tommy Thorn, git

[-- Attachment #1: Type: TEXT/PLAIN, Size: 2543 bytes --]

Hi,

On Fri, 16 May 2008, Brian Foster wrote:

> Johannes Schindelin suggested:
> > The function fgets() has a big problem with NUL characters: it reads
> > them, but nobody will know if the NUL comes from the file stream, or
> > was appended at the end of the line.
> >
> > So implement a custom read_line() function.
>                         ^^^^^^^^^^^
>                         read_line_with_nul()
> meaning read part or all of one line which may contain NULs.

Right.

> >[ ... ]
> > diff --git a/builtin-mailsplit.c b/builtin-mailsplit.c
> > index 46b27cd..021dc16 100644
> > --- a/builtin-mailsplit.c
> > +++ b/builtin-mailsplit.c
> > @@ -45,6 +45,25 @@ static int is_from_line(const char *line, int len)
> >  /* Could be as small as 64, enough to hold a Unix "From " line. */
> >  static char buf[4096];
> >
> > +/* We cannot use fgets() because our lines can contain NULs */
> > +int read_line_with_nul(char *buf, int size, FILE *in)
> > +{
> > +     int len = 0, c;
> > +
> > +     for (;;) {
> > +             c = fgetc(in);
> > +             buf[len++] = c;
> > +             if (c == EOF || c == '\n' || len + 1 >= size)
> > +                     break;
> > +     }
> > +
> > +     if (c == EOF)
> > +             len--;
> > +     buf[len] = '\0';
> > +
> > +     return len;
> 
>  when fgetc(3) — why not use getc(3)? -

Because mailsplit can read from a file, too.

>  returns EOF it is pointlessly stored in buf[] (as a 'char'!), len's 
>  advanced, and then the storage and advancing are undone.  isn't that a 
>  bit silly?

I left it at that, because it is a rare case, the buffer has to be 
accessed with the trailing NUL anyway, and I think it is worth to have 
this function quite readable.  I, for one, am pretty certain that I 
understand what this function does, and how, in 6 months from now, without 
any additional documentation.

>  untested:
> 
> 	assert(2 <= size);
> 	do {
> 		if ((c = getc(in)) == EOF)
> 			break;
> 	} while (((buf[len++] = c) != '\n' && len+1 < size);
> 	buf[len] = '\0'
> 
> 	return len;

... except this is unreadable at best ;-)

>  I'd tend to write this in terms of pointers,
>  something along the lines (untested):
> 
> 	char	*p, *endp;
> 
> 	assert(1 <= size);
> 	p    = buf;
> 	endp = p + (size-1);
> 	while (p < endp) {
> 		if ((c = getc(in)) == EOF || (*p++ = c) == '\n')
> 			break;
> 	}
> 	*p = '\0';
> 
> 	return p - buf;

Again, I think this is too cuddled.  You have to think about every second 
line, and that makes for stupid mistakes with this developer.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] mailsplit and mailinfo: gracefully handle NUL characters
  2008-05-16 14:03       ` Avery Pennarun
  2008-05-16 14:05         ` David Kastrup
@ 2008-05-16 14:32         ` Johannes Schindelin
  2008-05-16 14:56           ` Avery Pennarun
  1 sibling, 1 reply; 21+ messages in thread
From: Johannes Schindelin @ 2008-05-16 14:32 UTC (permalink / raw)
  To: Avery Pennarun; +Cc: Tommy Thorn, git

Hi,

On Fri, 16 May 2008, Avery Pennarun wrote:

> On 5/16/08, Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote:
> >         BTW I do not know how much fgetc() instead of fgets() slows
> >         down things, but I expect both to be equally fast because
> >         they are both buffered, right?
> 
> In my experience, fgetc() is pretty fantastically slow because you
> have a function call for every byte (and, I gather, modern libc does
> thread locking for every fgetc).  It's usually much faster to fread()
> into a buffer and then access the buffer.

Hmpf.  I hoped to get more definitive information here.  Especially given 
that fgetc() is nothing more than a glorified fread() into a buffer, and 
then access the buffer.

Well, at least you kind of pointed me to the _unlocked() function family.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] mailsplit and mailinfo: gracefully handle NUL characters
  2008-05-16 14:29           ` Johannes Schindelin
@ 2008-05-16 14:33             ` David Kastrup
  0 siblings, 0 replies; 21+ messages in thread
From: David Kastrup @ 2008-05-16 14:33 UTC (permalink / raw)
  To: git

Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:

> On Fri, 16 May 2008, Brian Foster wrote:
>> 
>>  when fgetc(3) — why not use getc(3)? -
>
> Because mailsplit can read from a file, too.

Huh?  getc != getchar

-- 
David Kastrup

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] mailsplit and mailinfo: gracefully handle NUL characters
  2008-05-16 14:32         ` Johannes Schindelin
@ 2008-05-16 14:56           ` Avery Pennarun
  2008-05-16 23:59             ` Johannes Schindelin
  0 siblings, 1 reply; 21+ messages in thread
From: Avery Pennarun @ 2008-05-16 14:56 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Tommy Thorn, git

On 5/16/08, Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote:
> Hmpf.  I hoped to get more definitive information here.  Especially given
>  that fgetc() is nothing more than a glorified fread() into a buffer, and
>  then access the buffer.
>
>  Well, at least you kind of pointed me to the _unlocked() function family.

Point taken.

/tmp $ for d in test1 test2 test3 test3u; do echo -n "$d: ";
/usr/bin/time ./$d </dev/zero; done
test1: 0.09user 0.05system 0:00.14elapsed 94%CPU
test2: 2.50user 0.05system 0:02.54elapsed 100%CPU
test3: 2.48user 0.06system 0:02.53elapsed 100%CPU
test3u: 1.05user 0.05system 0:01.10elapsed 99%CPU

fread is about 18x faster than fgetc().  getc() is the same speed as
fgetc().  getc_unlocked() is definitely faster than getc, but still at
least 7x slower than fread().

And if you think *that* sucks, you should try "c << cin" in C++ :)

Source code below.

Have fun,

Avery


=== test1.c ===
#include <stdio.h>

int main()
{
    char buf[1024];
    int i;
    for (i = 0; i < 102400; i++)
        fread(buf, 1, sizeof(buf), stdin);
}

=== test2.c ===
#include <stdio.h>

int main()
{
    int i;
    for (i = 0; i < 1024*102400; i++)
        fgetc(stdin);
}

=== test3.c ===
#include <stdio.h>

int main()
{
    int i;
    for (i = 0; i < 1024*102400; i++)
        getc(stdin);
}

=== test3u.c ===
#include <stdio.h>

int main()
{
    int i;
    for (i = 0; i < 1024*102400; i++)
        getc_unlocked(stdin);
}

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] mailsplit and mailinfo: gracefully handle NUL characters
  2008-05-16 14:56           ` Avery Pennarun
@ 2008-05-16 23:59             ` Johannes Schindelin
  2008-05-17  0:06               ` Tommy Thorn
  2008-05-17 10:07               ` Stephen R. van den Berg
  0 siblings, 2 replies; 21+ messages in thread
From: Johannes Schindelin @ 2008-05-16 23:59 UTC (permalink / raw)
  To: Avery Pennarun; +Cc: Tommy Thorn, git

Hi,

On Fri, 16 May 2008, Avery Pennarun wrote:

> On 5/16/08, Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote:
> > Hmpf.  I hoped to get more definitive information here.  Especially given
> >  that fgetc() is nothing more than a glorified fread() into a buffer, and
> >  then access the buffer.
> >
> >  Well, at least you kind of pointed me to the _unlocked() function family.
> 
> Point taken.
> 
> /tmp $ for d in test1 test2 test3 test3u; do echo -n "$d: ";
> /usr/bin/time ./$d </dev/zero; done
> test1: 0.09user 0.05system 0:00.14elapsed 94%CPU
> test2: 2.50user 0.05system 0:02.54elapsed 100%CPU
> test3: 2.48user 0.06system 0:02.53elapsed 100%CPU
> test3u: 1.05user 0.05system 0:01.10elapsed 99%CPU
> 
> fread is about 18x faster than fgetc().  getc() is the same speed as
> fgetc().  getc_unlocked() is definitely faster than getc, but still at
> least 7x slower than fread().

Well, my question was more about fgetc() vs fgets().

If you feel like it, you might benchmark this patch:

-- snipsnap --
diff --git a/builtin-mailsplit.c b/builtin-mailsplit.c
index 021dc16..5d8defd 100644
--- a/builtin-mailsplit.c
+++ b/builtin-mailsplit.c
@@ -45,13 +45,32 @@ static int is_from_line(const char *line, int len)
 /* Could be as small as 64, enough to hold a Unix "From " line. */
 static char buf[4096];
 
+/*
+ *  This is an ugly hack to avoid fgetc(), which is slow, as it is locking.
+ *  The argument "in" must be the same for all calls to this function!
+ */
+static int fast_fgetc(FILE *in)
+{
+	static char buf[4096];
+	static int offset = 0, len = 0;
+
+	if (offset >= len) {
+		len = fread(buf, 1, sizeof(buf), in);
+		offset = 0;
+		if (!len && feof(in))
+			return EOF;
+	}
+
+	return buf[offset++];
+}
+
 /* We cannot use fgets() because our lines can contain NULs */
 int read_line_with_nul(char *buf, int size, FILE *in)
 {
 	int len = 0, c;
 
 	for (;;) {
-		c = fgetc(in);
+		c = fast_fgetc(in);
 		buf[len++] = c;
 		if (c == EOF || c == '\n' || len + 1 >= size)
 			break;

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH] mailsplit and mailinfo: gracefully handle NUL characters
  2008-05-16 23:59             ` Johannes Schindelin
@ 2008-05-17  0:06               ` Tommy Thorn
  2008-05-17  0:26                 ` Johannes Schindelin
  2008-05-17 10:07               ` Stephen R. van den Berg
  1 sibling, 1 reply; 21+ messages in thread
From: Tommy Thorn @ 2008-05-17  0:06 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Avery Pennarun, git

Johannes Schindelin wrote:
> +/*
> + *  This is an ugly hack to avoid fgetc(), which is slow, as it is locking.
> + *  The argument "in" must be the same for all calls to this function!
> + */
> +static int fast_fgetc(FILE *in)
> +{
>   

Looks great to me, but shouldn't you add an "inline" for this one? Also, 
maybe a double the buffer size.

Tommy

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] mailsplit and mailinfo: gracefully handle NUL characters
  2008-05-17  0:06               ` Tommy Thorn
@ 2008-05-17  0:26                 ` Johannes Schindelin
  0 siblings, 0 replies; 21+ messages in thread
From: Johannes Schindelin @ 2008-05-17  0:26 UTC (permalink / raw)
  To: Tommy Thorn; +Cc: Avery Pennarun, git

Hi,

On Fri, 16 May 2008, Tommy Thorn wrote:

> Johannes Schindelin wrote:
> > +/*
> > + *  This is an ugly hack to avoid fgetc(), which is slow, as it is locking.
> > + *  The argument "in" must be the same for all calls to this function!
> > + */
> > +static int fast_fgetc(FILE *in)
> > +{
> >   
> 
> Looks great to me, but shouldn't you add an "inline" for this one? Also, 
> maybe a double the buffer size.

No.  This is an ugly hack, and not meant for application.

If that is substantially faster than the fgetc() version (and I want this 
be tested in a _real-world_ scenario, i.e. not the fgetc() alone, but a 
real mailsplit and a real mailinfo on a huge patch, with all three 
versions: fgets(), fgetc() and fast_fgetc())), then I would prefer having 
something like

	struct line_reader {
		FILE *in;
		char buffer[4096];
		int offset, int len;
		char line[1024];
		int linelen;
	};

and corresponding functions to read lines in that setting.  Maybe it would 
even be better to have line be a strbuf, but I am not so sure on that.

Let's see what the tests show.  Would you do them, please?  
"git format-patch --stdout bla..blub | /usr/bin/time git mailsplit -o." 
three times in succession should give you a good hint on the runtime.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] mailsplit and mailinfo: gracefully handle NUL characters
  2008-05-16 23:59             ` Johannes Schindelin
  2008-05-17  0:06               ` Tommy Thorn
@ 2008-05-17 10:07               ` Stephen R. van den Berg
  2008-05-17 10:18                 ` Johannes Schindelin
  1 sibling, 1 reply; 21+ messages in thread
From: Stephen R. van den Berg @ 2008-05-17 10:07 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Avery Pennarun, Tommy Thorn, git

Johannes Schindelin wrote:
>On Fri, 16 May 2008, Avery Pennarun wrote:
>> fread is about 18x faster than fgetc().  getc() is the same speed as
>> fgetc().  getc_unlocked() is definitely faster than getc, but still at
>> least 7x slower than fread().

>Well, my question was more about fgetc() vs fgets().
>If you feel like it, you might benchmark this patch:

Wouldn't it be better to improve the implementation of getc()
in glibc instead?

getc() is meant to be the fast version of fgetc(), and if it isn't
(anymore), then the library needs fixing.
-- 
Sincerely,                                                          srb@cuci.nl
           Stephen R. van den Berg.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] mailsplit and mailinfo: gracefully handle NUL characters
  2008-05-17 10:07               ` Stephen R. van den Berg
@ 2008-05-17 10:18                 ` Johannes Schindelin
  0 siblings, 0 replies; 21+ messages in thread
From: Johannes Schindelin @ 2008-05-17 10:18 UTC (permalink / raw)
  To: Stephen R. van den Berg; +Cc: Avery Pennarun, Tommy Thorn, git

Hi,

On Sat, 17 May 2008, Stephen R. van den Berg wrote:

> Johannes Schindelin wrote:
> >On Fri, 16 May 2008, Avery Pennarun wrote:
> >> fread is about 18x faster than fgetc().  getc() is the same speed as
> >> fgetc().  getc_unlocked() is definitely faster than getc, but still at
> >> least 7x slower than fread().
> 
> >Well, my question was more about fgetc() vs fgets().
> >If you feel like it, you might benchmark this patch:
> 
> Wouldn't it be better to improve the implementation of getc()
> in glibc instead?

Sure, because glibc is used on Windows, AIX, Solaris, etc.

> getc() is meant to be the fast version of fgetc(), and if it isn't 
> (anymore), then the library needs fixing.

fgetc() has to work reliably in a threaded environment, too.  So I guess 
that it does all kinds of locks that slow it down, but it is not something 
to be "fixed".

Hth,
Dscho

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] mailsplit and mailinfo: gracefully handle NUL characters
  2008-05-16 13:03     ` [PATCH] mailsplit and mailinfo: gracefully handle NUL characters Johannes Schindelin
  2008-05-16 14:03       ` Avery Pennarun
       [not found]       ` <200805161539.29259.brian.foster@innova-card.com>
@ 2008-05-21 18:08       ` Junio C Hamano
  2008-05-22 10:38         ` Johannes Schindelin
  2 siblings, 1 reply; 21+ messages in thread
From: Junio C Hamano @ 2008-05-21 18:08 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Tommy Thorn, git

Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:

> The function fgets() has a big problem with NUL characters: it reads
> them, but nobody will know if the NUL comes from the file stream, or
> was appended at the end of the line.
>
> So implement a custom read_line() function.

Looking at what handle_body() does for TE_BASE64 and TE_QP cases, I have
to wonder if this is enough.  The loop seems to stop at (*op == NUL) which
follows an old assumption that each line is terminated with NUL, not the
new assumption you introduced that each line's length is kept in local
variable len.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] mailsplit and mailinfo: gracefully handle NUL characters
  2008-05-21 18:08       ` Junio C Hamano
@ 2008-05-22 10:38         ` Johannes Schindelin
  2008-05-22 17:44           ` Junio C Hamano
  0 siblings, 1 reply; 21+ messages in thread
From: Johannes Schindelin @ 2008-05-22 10:38 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Tommy Thorn, git

Hi,

On Wed, 21 May 2008, Junio C Hamano wrote:

> Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:
> 
> > The function fgets() has a big problem with NUL characters: it reads 
> > them, but nobody will know if the NUL comes from the file stream, or 
> > was appended at the end of the line.
> >
> > So implement a custom read_line() function.
> 
> Looking at what handle_body() does for TE_BASE64 and TE_QP cases, I have 
> to wonder if this is enough.  The loop seems to stop at (*op == NUL) 
> which follows an old assumption that each line is terminated with NUL, 
> not the new assumption you introduced that each line's length is kept in 
> local variable len.

Of course!  But does BASE64 and QP contain NULs?  After all, even my 
custom read_line_with_nul() function adds a NUL IIRC.

Well, the biggest problem here is my lack of time.  I thought I would give 
Tommy a patch which kinda works, and he would actually hold through to 
brush it up until it shines and gets into git.git, because it is not _my_ 
itch.

Hmmmmmmm.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] mailsplit and mailinfo: gracefully handle NUL characters
  2008-05-22 10:38         ` Johannes Schindelin
@ 2008-05-22 17:44           ` Junio C Hamano
  2008-05-23 11:21             ` Johannes Schindelin
  0 siblings, 1 reply; 21+ messages in thread
From: Junio C Hamano @ 2008-05-22 17:44 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Tommy Thorn, git

Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:

>> Looking at what handle_body() does for TE_BASE64 and TE_QP cases, I have 
>> to wonder if this is enough.  The loop seems to stop at (*op == NUL) 
>> which follows an old assumption that each line is terminated with NUL, 
>> not the new assumption you introduced that each line's length is kept in 
>> local variable len.
>
> Of course!  But does BASE64 and QP contain NULs?

The loop in question iterates over bytes _after_ decoding these encoded
lines, and a typical reason you would encode the payload is because it
contains something not safe over e-mail transfer, e.g. NUL.

I think decode_transfer_encoding() also needs to become safe against NULs
in the payload.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] mailsplit and mailinfo: gracefully handle NUL characters
  2008-05-22 17:44           ` Junio C Hamano
@ 2008-05-23 11:21             ` Johannes Schindelin
  0 siblings, 0 replies; 21+ messages in thread
From: Johannes Schindelin @ 2008-05-23 11:21 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Tommy Thorn, git

Hi,

On Thu, 22 May 2008, Junio C Hamano wrote:

> Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:
> 
> >> Looking at what handle_body() does for TE_BASE64 and TE_QP cases, I have 
> >> to wonder if this is enough.  The loop seems to stop at (*op == NUL) 
> >> which follows an old assumption that each line is terminated with NUL, 
> >> not the new assumption you introduced that each line's length is kept in 
> >> local variable len.
> >
> > Of course!  But does BASE64 and QP contain NULs?
> 
> The loop in question iterates over bytes _after_ decoding these encoded
> lines, and a typical reason you would encode the payload is because it
> contains something not safe over e-mail transfer, e.g. NUL.
> 
> I think decode_transfer_encoding() also needs to become safe against NULs
> in the payload.

Okay, I missed that.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2008-05-23 11:22 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-05-15  7:27 git bug: rebase fatal failure Tommy Thorn
2008-05-16 10:41 ` Johannes Schindelin
2008-05-16 11:01   ` Johannes Schindelin
2008-05-16 13:03     ` [PATCH] mailsplit and mailinfo: gracefully handle NUL characters Johannes Schindelin
2008-05-16 14:03       ` Avery Pennarun
2008-05-16 14:05         ` David Kastrup
2008-05-16 14:32         ` Johannes Schindelin
2008-05-16 14:56           ` Avery Pennarun
2008-05-16 23:59             ` Johannes Schindelin
2008-05-17  0:06               ` Tommy Thorn
2008-05-17  0:26                 ` Johannes Schindelin
2008-05-17 10:07               ` Stephen R. van den Berg
2008-05-17 10:18                 ` Johannes Schindelin
     [not found]       ` <200805161539.29259.brian.foster@innova-card.com>
2008-05-16 14:07         ` Brian Foster
2008-05-16 14:14           ` David Kastrup
2008-05-16 14:29           ` Johannes Schindelin
2008-05-16 14:33             ` David Kastrup
2008-05-21 18:08       ` Junio C Hamano
2008-05-22 10:38         ` Johannes Schindelin
2008-05-22 17:44           ` Junio C Hamano
2008-05-23 11:21             ` Johannes Schindelin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).