* cleaner/better zlib sources?
@ 2007-03-16 1:04 Linus Torvalds
2007-03-16 1:10 ` Shawn O. Pearce
` (2 more replies)
0 siblings, 3 replies; 85+ messages in thread
From: Linus Torvalds @ 2007-03-16 1:04 UTC (permalink / raw)
To: Git Mailing List
I looked at git profiles yesterday, and some of them are pretty scary. We
spend about 50% of the time under some loads in just zlib uncompression,
and when I actually looked closer at the zlib sources I can kind of
understand why. That thing is horrid.
The sad part is that it looks like it should be quite possible to make
zlib simply just perform better. The profiles seem to say that a lot of
the cost is literally in the "inflate()" state machine code (and by that I
mean *not* the code itself, but literally in the indirect jump generated
by the case-statement).
Now, on any high-performance CPU, doing state-machines by having
for (;;)
switch (data->state) {
...
data->state = NEW_STATE;
continue;
}
(which is what zlib seems to be doing) is just about the worst possible
way to code things.
Now, it's possible that I'm just wrong, but the instruction-level profile
really did pinpoint the "look up state branch pointer and jump to it" as
some of the hottest part of that function. Which is just *evil*. You can
most likely use direct jumps within the loop (zero cost at all on most OoO
CPU's) most of the time, and the entry condition is likely quite
predictable too, so a lot of that overhead seems to be just sad and
unnecessary.
Now, I'm just wondering if anybody knows if there are better zlib
implementations out there? This really looks like it could be a noticeable
performance issue, but I'm lazy and would be much happier to hear that
somebody has already played with optimizing zlib. Especially since I'm not
100% sure it's really going to be noticeable..
Linus
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: cleaner/better zlib sources?
2007-03-16 1:04 cleaner/better zlib sources? Linus Torvalds
@ 2007-03-16 1:10 ` Shawn O. Pearce
2007-03-16 1:11 ` Jeff Garzik
2007-03-16 1:33 ` Davide Libenzi
2 siblings, 0 replies; 85+ messages in thread
From: Shawn O. Pearce @ 2007-03-16 1:10 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Git Mailing List
Linus Torvalds <torvalds@linux-foundation.org> wrote:
> I looked at git profiles yesterday, and some of them are pretty scary. We
> spend about 50% of the time under some loads in just zlib uncompression,
> and when I actually looked closer at the zlib sources I can kind of
> understand why. That thing is horrid.
Yes. This is actually one of the motivations behind pack v4.
We don't store the "important bits" of commits and trees in zlib
compressed format at all; allowing us to completely bypass the
inflate() penalty you describe.
We're already much faster on the linux-2.6 kernel tree, and that's
*with* converting the pack raw data into text, then reparsing
that text into a struct commit* or a struct name_entry using the
current code. We're also planning on reworking those parsers to
parse the raw pack data, allowing us to save some very unnecessary
raw->string->raw conversion time.
But Nico and I still looking to use zlib for commit messages and
blob content, so any improvements to inflate (or its replacement)
would still be most helpful.
--
Shawn.
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: cleaner/better zlib sources?
2007-03-16 1:04 cleaner/better zlib sources? Linus Torvalds
2007-03-16 1:10 ` Shawn O. Pearce
@ 2007-03-16 1:11 ` Jeff Garzik
2007-03-16 1:14 ` Matt Mackall
2007-03-16 1:46 ` Linus Torvalds
2007-03-16 1:33 ` Davide Libenzi
2 siblings, 2 replies; 85+ messages in thread
From: Jeff Garzik @ 2007-03-16 1:11 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Git Mailing List, mpm, bcrl
Linus Torvalds wrote:
> Now, it's possible that I'm just wrong, but the instruction-level profile
> really did pinpoint the "look up state branch pointer and jump to it" as
> some of the hottest part of that function. Which is just *evil*. You can
ISTR there are a bunch of state transitions per byte, which would make
sense that it shows up on profiles.
> Now, I'm just wondering if anybody knows if there are better zlib
> implementations out there? This really looks like it could be a noticeable
> performance issue, but I'm lazy and would be much happier to hear that
> somebody has already played with optimizing zlib. Especially since I'm not
> 100% sure it's really going to be noticeable..
I could have sworn that either Matt Mackall or Ben LaHaise had cleaned
up the existing zlib so much that it was practically a new
implementation. I'm not aware of any open source implementations
independent of zlib (except maybe that C++ behemoth, 7zip).
Jeff
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: cleaner/better zlib sources?
2007-03-16 1:11 ` Jeff Garzik
@ 2007-03-16 1:14 ` Matt Mackall
2007-03-16 1:46 ` Linus Torvalds
1 sibling, 0 replies; 85+ messages in thread
From: Matt Mackall @ 2007-03-16 1:14 UTC (permalink / raw)
To: Jeff Garzik; +Cc: Linus Torvalds, Git Mailing List, bcrl
On Thu, Mar 15, 2007 at 09:11:49PM -0400, Jeff Garzik wrote:
> Linus Torvalds wrote:
> >Now, it's possible that I'm just wrong, but the instruction-level profile
> >really did pinpoint the "look up state branch pointer and jump to it" as
> >some of the hottest part of that function. Which is just *evil*. You can
>
> ISTR there are a bunch of state transitions per byte, which would make
> sense that it shows up on profiles.
Yep, not surprising.
> >Now, I'm just wondering if anybody knows if there are better zlib
> >implementations out there? This really looks like it could be a noticeable
> >performance issue, but I'm lazy and would be much happier to hear that
> >somebody has already played with optimizing zlib. Especially since I'm not
> >100% sure it's really going to be noticeable..
>
> I could have sworn that either Matt Mackall or Ben LaHaise had cleaned
> up the existing zlib so much that it was practically a new
> implementation. I'm not aware of any open source implementations
> independent of zlib (except maybe that C++ behemoth, 7zip).
I cleaned up the version in lib/ that's used for boot on most systems.
It's quite a bit simpler and cleaner than the code lib/zlib (and
elsewhere!). But making it faster is another matter entirely - I don't
know off-hand how the two compare.
--
Mathematics is the supreme nostalgia of our time.
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: cleaner/better zlib sources?
2007-03-16 1:04 cleaner/better zlib sources? Linus Torvalds
2007-03-16 1:10 ` Shawn O. Pearce
2007-03-16 1:11 ` Jeff Garzik
@ 2007-03-16 1:33 ` Davide Libenzi
2007-03-16 2:06 ` Davide Libenzi
2 siblings, 1 reply; 85+ messages in thread
From: Davide Libenzi @ 2007-03-16 1:33 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Git Mailing List
On Thu, 15 Mar 2007, Linus Torvalds wrote:
>
> I looked at git profiles yesterday, and some of them are pretty scary. We
> spend about 50% of the time under some loads in just zlib uncompression,
> and when I actually looked closer at the zlib sources I can kind of
> understand why. That thing is horrid.
>
> The sad part is that it looks like it should be quite possible to make
> zlib simply just perform better. The profiles seem to say that a lot of
> the cost is literally in the "inflate()" state machine code (and by that I
> mean *not* the code itself, but literally in the indirect jump generated
> by the case-statement).
>
> Now, on any high-performance CPU, doing state-machines by having
>
> for (;;)
> switch (data->state) {
> ...
> data->state = NEW_STATE;
> continue;
> }
>
> (which is what zlib seems to be doing) is just about the worst possible
> way to code things.
A quick hack would be to just define:
#define SWITCH_LBL(n) \
case n: \
lbl_##n:
#define STATE_CHANGE(s) \
state->mode = s; \
goto lbl_##s;
Then replace all the "state->mode = STATE; break;" into STATE_CHANGE(STATE);
I'm giving it a try as we speak ...
- Davide
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: cleaner/better zlib sources?
2007-03-16 1:11 ` Jeff Garzik
2007-03-16 1:14 ` Matt Mackall
@ 2007-03-16 1:46 ` Linus Torvalds
2007-03-16 1:54 ` Linus Torvalds
1 sibling, 1 reply; 85+ messages in thread
From: Linus Torvalds @ 2007-03-16 1:46 UTC (permalink / raw)
To: Jeff Garzik; +Cc: Git Mailing List, mpm, bcrl
On Thu, 15 Mar 2007, Jeff Garzik wrote:
>
> ISTR there are a bunch of state transitions per byte, which would make sense
> that it shows up on profiles.
Yeah. I only looked at the kernel sources (which include a cleaned-up
older version of zlib), but they seemed to match the disassembly that I
saw when doing the instruction-level profiling.
The code *sometimes* falls through from one state to another, ie there's a
case where zlib does:
...
case DICTID:
NEEDBITS(32);
strm->adler = state->check = REVERSE(hold);
INITBITS();
state->mode = DICT;
case DICT:
...
which will obviously generate fine code. There's a few other examples of
that, *BUT* most of the stuff is just horrid, like
case HEAD:
if (state->wrap == 0) {
state->mode = TYPEDO;
break;
}
where the "break" (which simply breaks out of the case-statement, and thus
just loops back to it:
for (;;)
switch (state->mode) {
thing is nasty nasty nasty).
A trivial thing to do is to just replace such
state->mode = xxx;
break;
with
state->mode = xxx;
goto xxx_mode;
and all of that complicated run-time code *just*goes*away* and is replaced
by a relative no-op (ie an unconditional direct jump).
Some of them are slightly more complex, like
state->mode = hold & 0x200 ? DICTID : TYPE;
INITBITS();
break;
which would need to be rewritten as
old_hold = hold;
INITBITS();
if (old_hold & 0x200) {
state->mode = DICTID;
goto dictid_mode;
}
state->mode = TYPE;
goto dictid_mode;
but while that looks more complicated on a source code level it's a *lot*
easier for a CPU to actually execute.
Same obvious performance problems go for
case STORED:
...
case COPY:
copy = state->length;
if (copy) {
...
state->length -= copy;
break;
}
Tracev((stderr, "inflate: stored end\n"));
state->mode = TYPE;
break;
notice how when you get to the COPY state it will actually have nicely
fallen through from STORED (one of the places where it does that), but
then it will go through the expensive indirect jump not just once, but at
least *TWICE* just to get to the TYPE thing, because you'll first have to
re-enter COPY with a count of zero to get to the place where it sets TYPE,
and then does the indirect jump immediately again!
In other words, the code is *incredibly* badly written from a performance
angle.
Yeah, a perfect compiler could do this all for us even with unchanged zlib
source code, but (a) gcc isn't perfect and (b) I don't really think
anything else is either, although if things like this happen as part of
gzip in SpecInt, I wouldn't be surprised by compilers doing things like
this just to look good ;)
Especially with something like git, where we know that most of the time we
have the full input buffer (so inflate() generally won't be called in the
"middle" of a inflate event), we probably have a very well-defined start
state too, so we'd actually predict perfectlyon the *first* indirect jump
if we just never did any more of them.
> I could have sworn that either Matt Mackall or Ben LaHaise had cleaned up the
> existing zlib so much that it was practically a new implementation. I'm not
> aware of any open source implementations independent of zlib (except maybe
> that C++ behemoth, 7zip).
I looked at the latest release (1.2.3), so either Matt/Ben hasn't been
merged, or this wasn't part of the thing they looked at.
Linus
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: cleaner/better zlib sources?
2007-03-16 1:46 ` Linus Torvalds
@ 2007-03-16 1:54 ` Linus Torvalds
2007-03-16 2:43 ` Davide Libenzi
0 siblings, 1 reply; 85+ messages in thread
From: Linus Torvalds @ 2007-03-16 1:54 UTC (permalink / raw)
To: Jeff Garzik; +Cc: Git Mailing List, mpm, bcrl, Davide Libenzi
On Thu, 15 Mar 2007, Linus Torvalds wrote:
>
> Same obvious performance problems go for
>
> case COPY:
As an example, I *think* this patch to zlib-1.2.3 not only generates
better code, but is (a) shorter and (b) more logical anyway.
Together with Davide's suggestion on using C macro expansion to make most
of the mode switches simple branches, it might get rid of most of the
indirect branches (to get rid of them all, you'd have to also find the
places where we *don't* set a new state, because it stays the same like
this one, and the ones where we have conditionals on what the mode is
going to be..
Of course, the zlib sources are pretty horrid for other reasons (K&R
source code meant to be compiled on 16-bit architectures too). But that's
a separate issue, and at least shouldn't affect the resulting code
quality..
Linus
---
inflate.c | 3 +--
1 files changed, 1 insertions(+), 2 deletions(-)
diff --git a/inflate.c b/inflate.c
index 792fdee..fb26f39 100644
--- a/inflate.c
+++ b/inflate.c
@@ -819,7 +819,7 @@ int flush;
state->mode = COPY;
case COPY:
copy = state->length;
- if (copy) {
+ while (copy) {
if (copy > have) copy = have;
if (copy > left) copy = left;
if (copy == 0) goto inf_leave;
@@ -829,7 +829,6 @@ int flush;
left -= copy;
put += copy;
state->length -= copy;
- break;
}
Tracev((stderr, "inflate: stored end\n"));
state->mode = TYPE;
^ permalink raw reply related [flat|nested] 85+ messages in thread
* Re: cleaner/better zlib sources?
2007-03-16 1:33 ` Davide Libenzi
@ 2007-03-16 2:06 ` Davide Libenzi
0 siblings, 0 replies; 85+ messages in thread
From: Davide Libenzi @ 2007-03-16 2:06 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Git Mailing List
On Thu, 15 Mar 2007, Davide Libenzi wrote:
> On Thu, 15 Mar 2007, Linus Torvalds wrote:
>
> >
> > I looked at git profiles yesterday, and some of them are pretty scary. We
> > spend about 50% of the time under some loads in just zlib uncompression,
> > and when I actually looked closer at the zlib sources I can kind of
> > understand why. That thing is horrid.
> >
> > The sad part is that it looks like it should be quite possible to make
> > zlib simply just perform better. The profiles seem to say that a lot of
> > the cost is literally in the "inflate()" state machine code (and by that I
> > mean *not* the code itself, but literally in the indirect jump generated
> > by the case-statement).
> >
> > Now, on any high-performance CPU, doing state-machines by having
> >
> > for (;;)
> > switch (data->state) {
> > ...
> > data->state = NEW_STATE;
> > continue;
> > }
> >
> > (which is what zlib seems to be doing) is just about the worst possible
> > way to code things.
>
> A quick hack would be to just define:
>
> #define SWITCH_LBL(n) \
> case n: \
> lbl_##n:
>
> #define STATE_CHANGE(s) \
> state->mode = s; \
> goto lbl_##s;
>
> Then replace all the "state->mode = STATE; break;" into STATE_CHANGE(STATE);
> I'm giving it a try as we speak ...
I get about 5-6% boost with it AFAICS ...
- Davide
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: cleaner/better zlib sources?
2007-03-16 1:54 ` Linus Torvalds
@ 2007-03-16 2:43 ` Davide Libenzi
2007-03-16 2:56 ` Linus Torvalds
0 siblings, 1 reply; 85+ messages in thread
From: Davide Libenzi @ 2007-03-16 2:43 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Jeff Garzik, Git Mailing List, mpm, bcrl
On Thu, 15 Mar 2007, Linus Torvalds wrote:
>
>
> On Thu, 15 Mar 2007, Linus Torvalds wrote:
> >
> > Same obvious performance problems go for
> >
> > case COPY:
>
> As an example, I *think* this patch to zlib-1.2.3 not only generates
> better code, but is (a) shorter and (b) more logical anyway.
>
> Together with Davide's suggestion on using C macro expansion to make most
> of the mode switches simple branches, it might get rid of most of the
> indirect branches (to get rid of them all, you'd have to also find the
> places where we *don't* set a new state, because it stays the same like
> this one, and the ones where we have conditionals on what the mode is
> going to be..
That's the diff against 1.2.3, but it does not seem to make an substantial
difference in my Opteron ...
- Davide
Index: zlib-1.2.3.quilt/inflate.c
===================================================================
--- zlib-1.2.3.quilt.orig/inflate.c 2007-03-15 18:17:19.000000000 -0700
+++ zlib-1.2.3.quilt/inflate.c 2007-03-15 18:31:14.000000000 -0700
@@ -551,6 +551,15 @@
will return Z_BUF_ERROR if it has not reached the end of the stream.
*/
+#define CASE_DECL(n) \
+ case n: \
+ lbl_##n:
+
+#define STATE_CHANGE(s) do { \
+ state->mode = s; \
+ goto lbl_##s; \
+} while (0)
+
int ZEXPORT inflate(strm, flush)
z_streamp strm;
int flush;
@@ -586,10 +595,9 @@
ret = Z_OK;
for (;;)
switch (state->mode) {
- case HEAD:
+ CASE_DECL(HEAD)
if (state->wrap == 0) {
- state->mode = TYPEDO;
- break;
+ STATE_CHANGE(TYPEDO);
}
NEEDBITS(16);
#ifdef GUNZIP
@@ -597,8 +605,7 @@
state->check = crc32(0L, Z_NULL, 0);
CRC2(state->check, hold);
INITBITS();
- state->mode = FLAGS;
- break;
+ STATE_CHANGE(FLAGS);
}
state->flags = 0; /* expect zlib header */
if (state->head != Z_NULL)
@@ -609,20 +616,17 @@
#endif
((BITS(8) << 8) + (hold >> 8)) % 31) {
strm->msg = (char *)"incorrect header check";
- state->mode = BAD;
- break;
+ STATE_CHANGE(BAD);
}
if (BITS(4) != Z_DEFLATED) {
strm->msg = (char *)"unknown compression method";
- state->mode = BAD;
- break;
+ STATE_CHANGE(BAD);
}
DROPBITS(4);
len = BITS(4) + 8;
if (len > state->wbits) {
strm->msg = (char *)"invalid window size";
- state->mode = BAD;
- break;
+ STATE_CHANGE(BAD);
}
state->dmax = 1U << len;
Tracev((stderr, "inflate: zlib header ok\n"));
@@ -631,32 +635,30 @@
INITBITS();
break;
#ifdef GUNZIP
- case FLAGS:
+ CASE_DECL(FLAGS)
NEEDBITS(16);
state->flags = (int)(hold);
if ((state->flags & 0xff) != Z_DEFLATED) {
strm->msg = (char *)"unknown compression method";
- state->mode = BAD;
- break;
+ STATE_CHANGE(BAD);
}
if (state->flags & 0xe000) {
strm->msg = (char *)"unknown header flags set";
- state->mode = BAD;
- break;
+ STATE_CHANGE(BAD);
}
if (state->head != Z_NULL)
state->head->text = (int)((hold >> 8) & 1);
if (state->flags & 0x0200) CRC2(state->check, hold);
INITBITS();
state->mode = TIME;
- case TIME:
+ CASE_DECL(TIME)
NEEDBITS(32);
if (state->head != Z_NULL)
state->head->time = hold;
if (state->flags & 0x0200) CRC4(state->check, hold);
INITBITS();
state->mode = OS;
- case OS:
+ CASE_DECL(OS)
NEEDBITS(16);
if (state->head != Z_NULL) {
state->head->xflags = (int)(hold & 0xff);
@@ -665,7 +667,7 @@
if (state->flags & 0x0200) CRC2(state->check, hold);
INITBITS();
state->mode = EXLEN;
- case EXLEN:
+ CASE_DECL(EXLEN)
if (state->flags & 0x0400) {
NEEDBITS(16);
state->length = (unsigned)(hold);
@@ -677,7 +679,7 @@
else if (state->head != Z_NULL)
state->head->extra = Z_NULL;
state->mode = EXTRA;
- case EXTRA:
+ CASE_DECL(EXTRA)
if (state->flags & 0x0400) {
copy = state->length;
if (copy > have) copy = have;
@@ -699,7 +701,7 @@
}
state->length = 0;
state->mode = NAME;
- case NAME:
+ CASE_DECL(NAME)
if (state->flags & 0x0800) {
if (have == 0) goto inf_leave;
copy = 0;
@@ -720,7 +722,7 @@
state->head->name = Z_NULL;
state->length = 0;
state->mode = COMMENT;
- case COMMENT:
+ CASE_DECL(COMMENT)
if (state->flags & 0x1000) {
if (have == 0) goto inf_leave;
copy = 0;
@@ -740,13 +742,12 @@
else if (state->head != Z_NULL)
state->head->comment = Z_NULL;
state->mode = HCRC;
- case HCRC:
+ CASE_DECL(HCRC)
if (state->flags & 0x0200) {
NEEDBITS(16);
if (hold != (state->check & 0xffff)) {
strm->msg = (char *)"header crc mismatch";
- state->mode = BAD;
- break;
+ STATE_CHANGE(BAD);
}
INITBITS();
}
@@ -755,28 +756,26 @@
state->head->done = 1;
}
strm->adler = state->check = crc32(0L, Z_NULL, 0);
- state->mode = TYPE;
- break;
+ STATE_CHANGE(TYPE);
#endif
- case DICTID:
+ CASE_DECL(DICTID)
NEEDBITS(32);
strm->adler = state->check = REVERSE(hold);
INITBITS();
state->mode = DICT;
- case DICT:
+ CASE_DECL(DICT)
if (state->havedict == 0) {
RESTORE();
return Z_NEED_DICT;
}
strm->adler = state->check = adler32(0L, Z_NULL, 0);
state->mode = TYPE;
- case TYPE:
+ CASE_DECL(TYPE)
if (flush == Z_BLOCK) goto inf_leave;
- case TYPEDO:
+ CASE_DECL(TYPEDO)
if (state->last) {
BYTEBITS();
- state->mode = CHECK;
- break;
+ STATE_CHANGE(CHECK);
}
NEEDBITS(3);
state->last = BITS(1);
@@ -785,39 +784,38 @@
case 0: /* stored block */
Tracev((stderr, "inflate: stored block%s\n",
state->last ? " (last)" : ""));
- state->mode = STORED;
- break;
+ DROPBITS(2);
+ STATE_CHANGE(STORED);
case 1: /* fixed block */
fixedtables(state);
Tracev((stderr, "inflate: fixed codes block%s\n",
state->last ? " (last)" : ""));
- state->mode = LEN; /* decode codes */
- break;
+ DROPBITS(2);
+ STATE_CHANGE(LEN);
case 2: /* dynamic block */
Tracev((stderr, "inflate: dynamic codes block%s\n",
state->last ? " (last)" : ""));
- state->mode = TABLE;
- break;
+ DROPBITS(2);
+ STATE_CHANGE(TABLE);
case 3:
+ DROPBITS(2);
strm->msg = (char *)"invalid block type";
- state->mode = BAD;
+ STATE_CHANGE(BAD);
}
- DROPBITS(2);
break;
- case STORED:
+ CASE_DECL(STORED)
BYTEBITS(); /* go to byte boundary */
NEEDBITS(32);
if ((hold & 0xffff) != ((hold >> 16) ^ 0xffff)) {
strm->msg = (char *)"invalid stored block lengths";
- state->mode = BAD;
- break;
+ STATE_CHANGE(BAD);
}
state->length = (unsigned)hold & 0xffff;
Tracev((stderr, "inflate: stored length %u\n",
state->length));
INITBITS();
state->mode = COPY;
- case COPY:
+ CASE_DECL(COPY)
copy = state->length;
if (copy) {
if (copy > have) copy = have;
@@ -832,9 +830,8 @@
break;
}
Tracev((stderr, "inflate: stored end\n"));
- state->mode = TYPE;
- break;
- case TABLE:
+ STATE_CHANGE(TYPE);
+ CASE_DECL(TABLE)
NEEDBITS(14);
state->nlen = BITS(5) + 257;
DROPBITS(5);
@@ -845,14 +842,13 @@
#ifndef PKZIP_BUG_WORKAROUND
if (state->nlen > 286 || state->ndist > 30) {
strm->msg = (char *)"too many length or distance symbols";
- state->mode = BAD;
- break;
+ STATE_CHANGE(BAD);
}
#endif
Tracev((stderr, "inflate: table sizes ok\n"));
state->have = 0;
state->mode = LENLENS;
- case LENLENS:
+ CASE_DECL(LENLENS)
while (state->have < state->ncode) {
NEEDBITS(3);
state->lens[order[state->have++]] = (unsigned short)BITS(3);
@@ -867,13 +863,12 @@
&(state->lenbits), state->work);
if (ret) {
strm->msg = (char *)"invalid code lengths set";
- state->mode = BAD;
- break;
+ STATE_CHANGE(BAD);
}
Tracev((stderr, "inflate: code lengths ok\n"));
state->have = 0;
state->mode = CODELENS;
- case CODELENS:
+ CASE_DECL(CODELENS)
while (state->have < state->nlen + state->ndist) {
for (;;) {
this = state->lencode[BITS(state->lenbits)];
@@ -891,8 +886,7 @@
DROPBITS(this.bits);
if (state->have == 0) {
strm->msg = (char *)"invalid bit length repeat";
- state->mode = BAD;
- break;
+ STATE_CHANGE(BAD);
}
len = state->lens[state->have - 1];
copy = 3 + BITS(2);
@@ -914,17 +908,13 @@
}
if (state->have + copy > state->nlen + state->ndist) {
strm->msg = (char *)"invalid bit length repeat";
- state->mode = BAD;
- break;
+ STATE_CHANGE(BAD);
}
while (copy--)
state->lens[state->have++] = (unsigned short)len;
}
}
- /* handle error breaks in while */
- if (state->mode == BAD) break;
-
/* build code tables */
state->next = state->codes;
state->lencode = (code const FAR *)(state->next);
@@ -933,8 +923,7 @@
&(state->lenbits), state->work);
if (ret) {
strm->msg = (char *)"invalid literal/lengths set";
- state->mode = BAD;
- break;
+ STATE_CHANGE(BAD);
}
state->distcode = (code const FAR *)(state->next);
state->distbits = 6;
@@ -942,12 +931,11 @@
&(state->next), &(state->distbits), state->work);
if (ret) {
strm->msg = (char *)"invalid distances set";
- state->mode = BAD;
- break;
+ STATE_CHANGE(BAD);
}
Tracev((stderr, "inflate: codes ok\n"));
state->mode = LEN;
- case LEN:
+ CASE_DECL(LEN)
if (have >= 6 && left >= 258) {
RESTORE();
inflate_fast(strm, out);
@@ -975,22 +963,19 @@
Tracevv((stderr, this.val >= 0x20 && this.val < 0x7f ?
"inflate: literal '%c'\n" :
"inflate: literal 0x%02x\n", this.val));
- state->mode = LIT;
- break;
+ STATE_CHANGE(LIT);
}
if (this.op & 32) {
Tracevv((stderr, "inflate: end of block\n"));
- state->mode = TYPE;
- break;
+ STATE_CHANGE(TYPE);
}
if (this.op & 64) {
strm->msg = (char *)"invalid literal/length code";
- state->mode = BAD;
- break;
+ STATE_CHANGE(BAD);
}
state->extra = (unsigned)(this.op) & 15;
state->mode = LENEXT;
- case LENEXT:
+ CASE_DECL(LENEXT)
if (state->extra) {
NEEDBITS(state->extra);
state->length += BITS(state->extra);
@@ -998,7 +983,7 @@
}
Tracevv((stderr, "inflate: length %u\n", state->length));
state->mode = DIST;
- case DIST:
+ CASE_DECL(DIST)
for (;;) {
this = state->distcode[BITS(state->distbits)];
if ((unsigned)(this.bits) <= bits) break;
@@ -1017,13 +1002,12 @@
DROPBITS(this.bits);
if (this.op & 64) {
strm->msg = (char *)"invalid distance code";
- state->mode = BAD;
- break;
+ STATE_CHANGE(BAD);
}
state->offset = (unsigned)this.val;
state->extra = (unsigned)(this.op) & 15;
state->mode = DISTEXT;
- case DISTEXT:
+ CASE_DECL(DISTEXT)
if (state->extra) {
NEEDBITS(state->extra);
state->offset += BITS(state->extra);
@@ -1032,18 +1016,16 @@
#ifdef INFLATE_STRICT
if (state->offset > state->dmax) {
strm->msg = (char *)"invalid distance too far back";
- state->mode = BAD;
- break;
+ STATE_CHANGE(BAD);
}
#endif
if (state->offset > state->whave + out - left) {
strm->msg = (char *)"invalid distance too far back";
- state->mode = BAD;
- break;
+ STATE_CHANGE(BAD);
}
Tracevv((stderr, "inflate: distance %u\n", state->offset));
state->mode = MATCH;
- case MATCH:
+ CASE_DECL(MATCH)
if (left == 0) goto inf_leave;
copy = out - left;
if (state->offset > copy) { /* copy from window */
@@ -1066,15 +1048,15 @@
do {
*put++ = *from++;
} while (--copy);
- if (state->length == 0) state->mode = LEN;
+ if (state->length == 0)
+ STATE_CHANGE(LEN);
break;
- case LIT:
+ CASE_DECL(LIT)
if (left == 0) goto inf_leave;
*put++ = (unsigned char)(state->length);
left--;
- state->mode = LEN;
- break;
- case CHECK:
+ STATE_CHANGE(LEN);
+ CASE_DECL(CHECK)
if (state->wrap) {
NEEDBITS(32);
out -= left;
@@ -1090,36 +1072,34 @@
#endif
REVERSE(hold)) != state->check) {
strm->msg = (char *)"incorrect data check";
- state->mode = BAD;
- break;
+ STATE_CHANGE(BAD);
}
INITBITS();
Tracev((stderr, "inflate: check matches trailer\n"));
}
#ifdef GUNZIP
state->mode = LENGTH;
- case LENGTH:
+ CASE_DECL(LENGTH)
if (state->wrap && state->flags) {
NEEDBITS(32);
if (hold != (state->total & 0xffffffffUL)) {
strm->msg = (char *)"incorrect length check";
- state->mode = BAD;
- break;
+ STATE_CHANGE(BAD);
}
INITBITS();
Tracev((stderr, "inflate: length matches trailer\n"));
}
#endif
state->mode = DONE;
- case DONE:
+ CASE_DECL(DONE)
ret = Z_STREAM_END;
goto inf_leave;
- case BAD:
+ CASE_DECL(BAD)
ret = Z_DATA_ERROR;
goto inf_leave;
- case MEM:
+ CASE_DECL(MEM)
return Z_MEM_ERROR;
- case SYNC:
+ CASE_DECL(SYNC)
default:
return Z_STREAM_ERROR;
}
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: cleaner/better zlib sources?
2007-03-16 2:43 ` Davide Libenzi
@ 2007-03-16 2:56 ` Linus Torvalds
2007-03-16 3:16 ` Davide Libenzi
0 siblings, 1 reply; 85+ messages in thread
From: Linus Torvalds @ 2007-03-16 2:56 UTC (permalink / raw)
To: Davide Libenzi; +Cc: Jeff Garzik, Git Mailing List, mpm, bcrl
On Thu, 15 Mar 2007, Davide Libenzi wrote:
>
> That's the diff against 1.2.3, but it does not seem to make an substantial
> difference in my Opteron ...
But the "goto" stuff you did is 5-6%?
Is that 5-6% of total git costs, or just of inflate() itself?
Linus
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: cleaner/better zlib sources?
2007-03-16 2:56 ` Linus Torvalds
@ 2007-03-16 3:16 ` Davide Libenzi
2007-03-16 16:21 ` Linus Torvalds
0 siblings, 1 reply; 85+ messages in thread
From: Davide Libenzi @ 2007-03-16 3:16 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Jeff Garzik, Git Mailing List, mpm, bcrl
On Thu, 15 Mar 2007, Linus Torvalds wrote:
>
>
> On Thu, 15 Mar 2007, Davide Libenzi wrote:
> >
> > That's the diff against 1.2.3, but it does not seem to make an substantial
> > difference in my Opteron ...
>
> But the "goto" stuff you did is 5-6%?
>
> Is that 5-6% of total git costs, or just of inflate() itself?
Didn't do proper cache warmup and test time was fairly short. Now I'm not
able to notice substantial differences.
Hacked up test case below ...
- Davide
/* example.c -- usage example of the zlib compression library
* Copyright (C) 1995-2004 Jean-loup Gailly.
* For conditions of distribution and use, see copyright notice in zlib.h
*/
/* @(#) $Id$ */
#include <sys/time.h>
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <time.h>
#include "zlib.h"
#define CHECK_ERR(err, msg) do { \
if (err != Z_OK) { \
fprintf(stderr, "%s error: %d\n", msg, err); \
exit(1); \
} \
} while (0)
unsigned long long getustime(void) {
struct timeval tm;
gettimeofday(&tm, NULL);
return tm.tv_sec * 1000000ULL + tm.tv_usec;
}
/* ===========================================================================
* Test deflate() with large buffers and dynamic change of compression level
*/
void do_defl(Byte *compr, uLong *comprLen,
Byte *uncompr, uLong uncomprLen) {
z_stream c_stream; /* compression stream */
int err;
c_stream.zalloc = (alloc_func)0;
c_stream.zfree = (free_func)0;
c_stream.opaque = (voidpf)0;
err = deflateInit(&c_stream, Z_BEST_SPEED);
CHECK_ERR(err, "deflateInit");
c_stream.next_out = compr;
c_stream.avail_out = (uInt) *comprLen;
/* At this point, uncompr is still mostly zeroes, so it should compress
* very well:
*/
c_stream.next_in = uncompr;
c_stream.avail_in = (uInt) uncomprLen;
err = deflate(&c_stream, Z_FINISH);
if (err != Z_STREAM_END) {
fprintf(stderr, "whoops, got %d instead of Z_STREAM_END\n", err);
exit(1);
}
err = deflateEnd(&c_stream);
CHECK_ERR(err, "deflateEnd");
*comprLen = c_stream.next_out - compr;
}
/* ===========================================================================
* Test inflate() with large buffers
*/
void do_infl(Byte *compr, uLong comprLen,
Byte *uncompr, uLong *uncomprLen) {
int err;
z_stream d_stream; /* decompression stream */
d_stream.zalloc = (alloc_func)0;
d_stream.zfree = (free_func)0;
d_stream.opaque = (voidpf)0;
d_stream.next_in = compr;
d_stream.avail_in = (uInt)comprLen;
err = inflateInit(&d_stream);
CHECK_ERR(err, "inflateInit");
d_stream.next_out = uncompr; /* discard the output */
d_stream.avail_out = (uInt) *uncomprLen;
err = inflate(&d_stream, Z_FULL_FLUSH);
if (err != Z_STREAM_END) {
fprintf(stderr, "deflate should report Z_STREAM_END\n");
exit(1);
}
err = inflateEnd(&d_stream);
CHECK_ERR(err, "inflateEnd");
*uncomprLen = d_stream.next_out - uncompr;
}
int main(int ac, char **av) {
uLong i, n, clen, ulen, size = 8 * 1024 * 1024, range = 256;
Byte *ubuf, *cbuf, *tbuf;
unsigned long long ts, te;
srand(1);
ulen = size;
clen = 2 * ulen;
ubuf = malloc(ulen);
tbuf = malloc(ulen);
cbuf = malloc(clen);
for (i = 0; i < ulen; i++)
ubuf[i] = (Byte) (rand() % range);
/* Warming up ... */
do_defl(cbuf, &clen, ubuf, ulen);
do_infl(cbuf, clen, tbuf, &ulen);
if (ulen != size) {
fprintf(stderr, "size mismatch %lu instead of %lu\n",
(unsigned long) ulen, (unsigned long) size);
return 1;
}
if (memcmp(tbuf, ubuf, size)) {
fprintf(stderr, "whoops! we did not get back the same data\n");
return 2;
}
/* Test ... */
ts = getustime();
n = 0;
do {
for (i = 0; i < 16; i++) {
ulen = size;
do_infl(cbuf, clen, ubuf, &ulen);
}
n += i;
te = getustime();
} while (te - ts < 2 * 1000000);
fprintf(stdout, "us time / cycle = %llu\n", (te - ts) / n);
return 0;
}
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: cleaner/better zlib sources?
@ 2007-03-16 6:08 linux
2007-03-16 11:34 ` Florian Weimer
` (2 more replies)
0 siblings, 3 replies; 85+ messages in thread
From: linux @ 2007-03-16 6:08 UTC (permalink / raw)
To: torvalds; +Cc: git
Er, it's a little hard to see, but zlib spends the bulk of its time
in inflate_fast(). This is the fast path, invoked as long as there's
at least 6 bytes of input and 258 bytes of output space available.
The code in inflate.c just handles the last few bytes when near
one limit or the other. Are you sure it's a performance problem?
There's equivalent inflate code in the PGP 5.0i distribution
(src/lib/pgp/compress/pgpZInflate.c) that's in the public domain AFAICT
(it says "Not copyrighted", and "You can do whatever you like with this
source file, though I would prefer that if you modify it and redistribute
it that you include comments to that effect with your name and the
date."), and uses switch statements only for resuming after a pause.
It's presumably well tested, but it's got a strong nasty-old-MSDOS
code smell about it, some truly bizarre indenting (don't look at lines
1118-1126 with a full stomach), and would require a lot of massaging to
integrate with zlib.
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: cleaner/better zlib sources?
2007-03-16 6:08 linux
@ 2007-03-16 11:34 ` Florian Weimer
2007-03-16 15:51 ` Josef Weidendorfer
2007-03-16 16:11 ` Linus Torvalds
2 siblings, 0 replies; 85+ messages in thread
From: Florian Weimer @ 2007-03-16 11:34 UTC (permalink / raw)
To: git
> There's equivalent inflate code in the PGP 5.0i distribution
> (src/lib/pgp/compress/pgpZInflate.c) that's in the public domain AFAICT
It's an early fork from zlib. gzip contains a copy, too.
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: cleaner/better zlib sources?
2007-03-16 6:08 linux
2007-03-16 11:34 ` Florian Weimer
@ 2007-03-16 15:51 ` Josef Weidendorfer
2007-03-16 16:11 ` Linus Torvalds
2 siblings, 0 replies; 85+ messages in thread
From: Josef Weidendorfer @ 2007-03-16 15:51 UTC (permalink / raw)
To: linux; +Cc: torvalds, git
On Friday 16 March 2007, linux@horizon.com wrote:
> Er, it's a little hard to see, but zlib spends the bulk of its time
> in inflate_fast(). This is the fast path, invoked as long as there's
> at least 6 bytes of input and 258 bytes of output space available.
Hmm...
When running "git log >/dev/null" a few times on my Pentium-M here,
I get the following from oprofile:
CPU: PIII, speed 1600 MHz (estimated)
Counted CPU_CLK_UNHALTED events (clocks processor is not halted) with a unit mask of 0x00 (No unit mask) count 100000
Counted INST_RETIRED events (number of instructions retired) with a unit mask of 0x00 (No unit mask) count 400000
samples % samples % image name app name symbol name
12816 19.6492 3221 19.0772 libz.so.1.2.3 git-log inflate_table
12186 18.6833 3640 21.5589 libz.so.1.2.3 git-log inflate
5228 8.0155 1395 8.2623 libz.so.1.2.3 git-log inflate_fast
3855 5.9104 738 4.3710 libc-2.5.so git-log strlen
3782 5.7985 1361 8.0609 git-log git-log pretty_print_commit
2022 3.1001 600 3.5537 libc-2.5.so git-log vfprintf
inflate_fast() is definitly below inflate().
Josef
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: cleaner/better zlib sources?
2007-03-16 6:08 linux
2007-03-16 11:34 ` Florian Weimer
2007-03-16 15:51 ` Josef Weidendorfer
@ 2007-03-16 16:11 ` Linus Torvalds
2007-03-16 17:39 ` linux
2007-03-16 22:45 ` Josef Weidendorfer
2 siblings, 2 replies; 85+ messages in thread
From: Linus Torvalds @ 2007-03-16 16:11 UTC (permalink / raw)
To: linux; +Cc: git
On Thu, 16 Mar 2007, linux@horizon.com wrote:
>
> Er, it's a little hard to see, but zlib spends the bulk of its time
> in inflate_fast().
Not for git.
It may be true for *big* inflate events, but the bulk of git inflates are
small trees etc. Here is one particular profile:
CPU: Core 2, speed 1596 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit
mask of 0x00 (Unhalted core cycles) count 100000
samples % image name app name symbol name
169540 15.9845 git git inflate
138377 13.0464 git git inflate_fast
94738 8.9321 git git inflate_table
73511 6.9308 git git strlen
70398 6.6373 git git find_pack_entry_one
59200 5.5815 vmlinux vmlinux __copy_user_nocache
45103 4.2524 git git adler32
42973 4.0516 git git memcpy
23438 2.2098 git git interesting
..
so yes, inflate_fast() is certainly up there too, but plain "inflate()" is
actually more important.
(Btw, to get this level of detail, you need to link statically, at least
if you are using Fedora Core. If you do the normal dynamic linking,
oprofile will not be able to show you which function, and will just say
that 50% of all time was spent in libz.so.1.2.3).
Also note that the above is with oprofile: do *not* bother to try to
profile using "-pg" and gprof. That changes the binary so much as to be
useless - small functions get the profiling code overhead and things that
take no time at all will suddenly become twice or three times as
expensive. Using oprofile you get fairly correct results.
> The code in inflate.c just handles the last few bytes when near
> one limit or the other. Are you sure it's a performance problem?
See above. I'm absolutely positive.
(The load for the above was roughly:
git log drivers/usb/ > /dev/null
on the kernel tree - ie I was mainly testing the commit tree pruning,
which is one of the most fundamnetal operations, and is what makes
or breaks the performance of things like "git blame" etc).
I'd obviously *also* like to see inflate_fast() go down, and it has some
really strange code too, like:
# define PUP(a) *++(a)
...
len -= op;
do {
PUP(out) = PUP(from);
} while (--op);
...
which looks rather odd, wouldn't you say? That's a memcpy. But I
especially find the nice "unrolled" memcpy interestign:
...
while (len > 2) {
PUP(out) = PUP(from);
PUP(out) = PUP(from);
PUP(out) = PUP(from);
len -= 3;
}
if (len) {
PUP(out) = PUP(from);
if (len > 1)
PUP(out) = PUP(from);
}
...
yeah, that's right - we unroll memcpy() BY DOING IT THREE BYTES AT A TIME!
Whoever wrote that crap must have been on some strange medication. If you
need to do enough of a memcpy() that you want to unroll it, you sure as
hell don't want to do it a byte-at-a-time, you want to do it with full
words etc. And the thing is called "memcpy()", for chrissake!
That's the "optimized" code. Strange.
> There's equivalent inflate code in the PGP 5.0i distribution
Interesting. I'll take a look.
Linus
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: cleaner/better zlib sources?
2007-03-16 3:16 ` Davide Libenzi
@ 2007-03-16 16:21 ` Linus Torvalds
2007-03-16 16:24 ` Davide Libenzi
` (2 more replies)
0 siblings, 3 replies; 85+ messages in thread
From: Linus Torvalds @ 2007-03-16 16:21 UTC (permalink / raw)
To: Davide Libenzi; +Cc: Jeff Garzik, Git Mailing List, mpm, bcrl
On Thu, 15 Mar 2007, Davide Libenzi wrote:
>
> Hacked up test case below ...
This one seems to do benchmarking with 8MB buffers if I read it right
(didn't try).
The normal size for the performance-critical git objects are in the couple
of *hundred* bytes. Not kilobytes, and not megabytes.
The most performance-critical objects for uncompression are commits and
trees. At least for the kernel, the average size of a tree object is 678
bytes. And that's ignoring the fact that most of them are then deltified,
so about 80% of them are likely just a ~60-byte delta.
Linus
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: cleaner/better zlib sources?
2007-03-16 16:21 ` Linus Torvalds
@ 2007-03-16 16:24 ` Davide Libenzi
2007-03-16 16:35 ` Linus Torvalds
2007-03-16 16:35 ` cleaner/better zlib sources? Jeff Garzik
2007-03-16 17:06 ` Nicolas Pitre
2 siblings, 1 reply; 85+ messages in thread
From: Davide Libenzi @ 2007-03-16 16:24 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Jeff Garzik, Git Mailing List, mpm, bcrl
On Fri, 16 Mar 2007, Linus Torvalds wrote:
> On Thu, 15 Mar 2007, Davide Libenzi wrote:
> >
> > Hacked up test case below ...
>
> This one seems to do benchmarking with 8MB buffers if I read it right
> (didn't try).
Yes, I just wanted to have the biggest time spent in inflate(). That why I
use a big buffer.
> The normal size for the performance-critical git objects are in the couple
> of *hundred* bytes. Not kilobytes, and not megabytes.
>
> The most performance-critical objects for uncompression are commits and
> trees. At least for the kernel, the average size of a tree object is 678
> bytes. And that's ignoring the fact that most of them are then deltified,
> so about 80% of them are likely just a ~60-byte delta.
Definitely. The nature of the data matters.
Did you try to make a zlib with my patch and oprofile git on real data
with that?
- Davide
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: cleaner/better zlib sources?
2007-03-16 16:21 ` Linus Torvalds
2007-03-16 16:24 ` Davide Libenzi
@ 2007-03-16 16:35 ` Jeff Garzik
2007-03-16 16:42 ` Matt Mackall
` (3 more replies)
2007-03-16 17:06 ` Nicolas Pitre
2 siblings, 4 replies; 85+ messages in thread
From: Jeff Garzik @ 2007-03-16 16:35 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Davide Libenzi, Git Mailing List, mpm, bcrl
Linus Torvalds wrote:
> The normal size for the performance-critical git objects are in the couple
> of *hundred* bytes. Not kilobytes, and not megabytes.
>
> The most performance-critical objects for uncompression are commits and
> trees. At least for the kernel, the average size of a tree object is 678
> bytes. And that's ignoring the fact that most of them are then deltified,
> so about 80% of them are likely just a ~60-byte delta.
Ahhh. At least for me, that explains a lot. Rather than spending all
its time in inflate_fast(), git is dealing with lots of zlib
startup/shutdown overhead.
Although it sounds like zlib could indeed be optimized to reduce its
startup and shutdown overhead, I wonder if switching compression
algorithms to a pure Huffman or even RLE compression (with associated
lower startup/shutdown costs) would perform better in the face of all
those small objects.
And another random thought, though it may be useless in this thread: I
bet using a pre-built (compiled into git) static zlib dictionary for git
commit and tree objects might improve things a bit.
Jeff
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: cleaner/better zlib sources?
2007-03-16 16:24 ` Davide Libenzi
@ 2007-03-16 16:35 ` Linus Torvalds
2007-03-16 19:21 ` Davide Libenzi
0 siblings, 1 reply; 85+ messages in thread
From: Linus Torvalds @ 2007-03-16 16:35 UTC (permalink / raw)
To: Davide Libenzi; +Cc: Jeff Garzik, Git Mailing List, mpm, bcrl
On Fri, 16 Mar 2007, Davide Libenzi wrote:
>
> > This one seems to do benchmarking with 8MB buffers if I read it right
> > (didn't try).
>
> Yes, I just wanted to have the biggest time spent in inflate(). That why I
> use a big buffer.
Right. But if the biggest time is spent in setup, the big-buffer thing
ends up being exactly the wrong thing to test ;)
> Definitely. The nature of the data matters.
> Did you try to make a zlib with my patch and oprofile git on real data
> with that?
I haven't actually set it up so that I can build against my own zlib yet.
Exactly because I was hoping that somebody would already have a solution
;)
Will try to do later today, although since it's the kids school
conferences today I'll effectively be in and out all day long.
Linus
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: cleaner/better zlib sources?
2007-03-16 16:35 ` cleaner/better zlib sources? Jeff Garzik
@ 2007-03-16 16:42 ` Matt Mackall
2007-03-16 16:51 ` Linus Torvalds
` (2 subsequent siblings)
3 siblings, 0 replies; 85+ messages in thread
From: Matt Mackall @ 2007-03-16 16:42 UTC (permalink / raw)
To: Jeff Garzik; +Cc: Linus Torvalds, Davide Libenzi, Git Mailing List, bcrl
On Fri, Mar 16, 2007 at 12:35:39PM -0400, Jeff Garzik wrote:
> Linus Torvalds wrote:
> >The normal size for the performance-critical git objects are in the couple
> >of *hundred* bytes. Not kilobytes, and not megabytes.
> >
> >The most performance-critical objects for uncompression are commits and
> >trees. At least for the kernel, the average size of a tree object is 678
> >bytes. And that's ignoring the fact that most of them are then deltified,
> >so about 80% of them are likely just a ~60-byte delta.
>
>
> Ahhh. At least for me, that explains a lot. Rather than spending all
> its time in inflate_fast(), git is dealing with lots of zlib
> startup/shutdown overhead.
>
> Although it sounds like zlib could indeed be optimized to reduce its
> startup and shutdown overhead, I wonder if switching compression
> algorithms to a pure Huffman or even RLE compression (with associated
> lower startup/shutdown costs) would perform better in the face of all
> those small objects.
Mercurial simply stores uncompressed objects below a threshold of 44
bytes, based on benchmarks I did in April 2005. I'd probably up that
number if I redid my measurements today. There's just not a whole lot
zlib can do at these small sizes. Given that a SHA hash is an
uncompressible 20 bytes already, you're well into the domain of
diminishing returns.
> And another random thought, though it may be useless in this thread: I
> bet using a pre-built (compiled into git) static zlib dictionary for git
> commit and tree objects might improve things a bit.
Ideally, you'd compress all deltas in a chain with the same context.
You've got to decompress the delta base to do the delta
calculation, so this should allow you to recover the context up to
that point. Zlib isn't really set up for this sort of thing though.
--
Mathematics is the supreme nostalgia of our time.
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: cleaner/better zlib sources?
2007-03-16 16:35 ` cleaner/better zlib sources? Jeff Garzik
2007-03-16 16:42 ` Matt Mackall
@ 2007-03-16 16:51 ` Linus Torvalds
2007-03-16 17:12 ` Nicolas Pitre
2007-03-16 23:22 ` Shawn O. Pearce
3 siblings, 0 replies; 85+ messages in thread
From: Linus Torvalds @ 2007-03-16 16:51 UTC (permalink / raw)
To: Jeff Garzik; +Cc: Davide Libenzi, Git Mailing List, mpm, bcrl
On Fri, 16 Mar 2007, Jeff Garzik wrote:
>
> Although it sounds like zlib could indeed be optimized to reduce its startup
> and shutdown overhead, I wonder if switching compression algorithms to a pure
> Huffman or even RLE compression (with associated lower startup/shutdown costs)
> would perform better in the face of all those small objects.
Well, the thing is, I personally much prefer to have just a single
compression algorithm and object layout. Most of the performance-critical
objects from a decompression standpoint during commit traversal are all
small (especially if you do pathname limiting), but when you do something
like a "git add ." most objects are actually random blob objects and you
need to have a compression algorithm that works in the general case too.
Of course, pack-v4 may (likely will) end up using different strategies for
different objects (delta's in particular), but the "one single object
compression type" was a big deal for initial implementation.
It's may not be fundamental to git operation (so we can fairly easily
change it and make it more complex without any higher-level stuff even
noticing), but it was definitely fundamental to "get something stable and
working" up and running quickly..
> And another random thought, though it may be useless in this thread: I bet
> using a pre-built (compiled into git) static zlib dictionary for git commit
> and tree objects might improve things a bit.
That's kind of pack-v4 area. It will happen, but I'd actually like to see
if we can just avoid stupid performance problems with zlib, independently
of trying to make more tuned formats.
Linus
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: cleaner/better zlib sources?
2007-03-16 16:21 ` Linus Torvalds
2007-03-16 16:24 ` Davide Libenzi
2007-03-16 16:35 ` cleaner/better zlib sources? Jeff Garzik
@ 2007-03-16 17:06 ` Nicolas Pitre
2007-03-16 17:51 ` Linus Torvalds
2 siblings, 1 reply; 85+ messages in thread
From: Nicolas Pitre @ 2007-03-16 17:06 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Davide Libenzi, Jeff Garzik, Git Mailing List, mpm, bcrl
On Fri, 16 Mar 2007, Linus Torvalds wrote:
> The most performance-critical objects for uncompression are commits and
> trees. At least for the kernel, the average size of a tree object is 678
> bytes. And that's ignoring the fact that most of them are then deltified,
> so about 80% of them are likely just a ~60-byte delta.
This is why in pack v4 there will be an alternate tree object
representation which is not deflated at all.
In short we intend to have 3 tables where common things are factored
out:
1) the path component string table (deflated)
2) author/committer string table (deflated)
3) sorted SHA1 table (obviously not deflated)
The sorted SHA1 table will be part of the pack instead of being in the
pack index. The idea is that most SHA1's are already duplicated in the
pack already anyway within commit and tree objects. With a single table
then commit and tree objects can index into that SHA1 table rather than
providing the SHA1 value inline for the objects they refer to.
This means that a tree object record would be only 6 bytes according to
the current design: 2 bytes to index into the path component string
table (which also include the mode information), and 4 bytes to index
into the sorted SHA1 table. And similarly for commit objects.
This means that the pack index will only have a table of offsets
corresponding to the table of sorted SHA1's.
So... walking revisions will become only a matter of picking the first
commit object, using the tree index value (which is not deflated), but
instead of using it in the SHA1 table it could be used in the offset
table to find the location of the corresponding tree object directly.
Same goes for tree entries, or for locating the parent's commit object.
No deflating, no binary searching, no SHA1 comparisons. Plain straight
pointer dereference.
Then, if you want to filter tree walking on path spec, you only need to
locate the path component in the path table once and use the
corresponding index to filter tree entries instead of repeated strcmp().
Same thing if you want to filter commits based on author/committer.
One side effect of this is that you can tell straight away that a path
doesn't exist in the whole pack if one of its components cannot be found
in the table (that works only if no legacy tree representations are
present of course). That should make history walking blazingly fast.
The only thing that gets deflated is the commit message which needs to
be inflated only when displaying it.
And so far that makes for quite smaller packs too!
Nicolas
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: cleaner/better zlib sources?
2007-03-16 16:35 ` cleaner/better zlib sources? Jeff Garzik
2007-03-16 16:42 ` Matt Mackall
2007-03-16 16:51 ` Linus Torvalds
@ 2007-03-16 17:12 ` Nicolas Pitre
2007-03-16 23:22 ` Shawn O. Pearce
3 siblings, 0 replies; 85+ messages in thread
From: Nicolas Pitre @ 2007-03-16 17:12 UTC (permalink / raw)
To: Jeff Garzik; +Cc: Linus Torvalds, Davide Libenzi, Git Mailing List, mpm, bcrl
On Fri, 16 Mar 2007, Jeff Garzik wrote:
> Although it sounds like zlib could indeed be optimized to reduce its startup
> and shutdown overhead, I wonder if switching compression algorithms to a pure
> Huffman or even RLE compression (with associated lower startup/shutdown costs)
> would perform better in the face of all those small objects.
>
> And another random thought, though it may be useless in this thread: I bet
> using a pre-built (compiled into git) static zlib dictionary for git commit
> and tree objects might improve things a bit.
See my last post. We'll do even better with special object
encoding altogether. Those representations are so dense that
compression provides no gain at all making the point moot.
Nicolas
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: cleaner/better zlib sources?
2007-03-16 16:11 ` Linus Torvalds
@ 2007-03-16 17:39 ` linux
2007-03-16 22:45 ` Josef Weidendorfer
1 sibling, 0 replies; 85+ messages in thread
From: linux @ 2007-03-16 17:39 UTC (permalink / raw)
To: linux, torvalds; +Cc: git
> so yes, inflate_fast() is certainly up there too, but plain "inflate()" is
> actually more important.
Ah, okay. Thanks for the real data.
> I'd obviously *also* like to see inflate_fast() go down, and it has some
> really strange code too, like:
...
> which looks rather odd, wouldn't you say? That's a memcpy.
Actually, no, it's not a memcpy. memcpy is undefined for overlapping
operands. The way that gzip compresses 80 "x" characters is:
- Emit literal "x"
- Copy 79 bytes starting at (relative) offset -1
For this to work, the copy *must* proceed byte-at-a-time. libc memcpy(),
or even rep movsw, won't do. It has to be rep movsb.
Now, you could check for overlap and call memcpy() when the source and
destination don't overlap (the length is no larger than the offset),
but it's not clear that it's useful.
> But I especially find the nice "unrolled" memcpy interestign:
>
> ...
> while (len > 2) {
> PUP(out) = PUP(from);
> PUP(out) = PUP(from);
> PUP(out) = PUP(from);
> len -= 3;
> }
> if (len) {
> PUP(out) = PUP(from);
> if (len > 1)
> PUP(out) = PUP(from);
> }
> ...
>
> yeah, that's right - we unroll memcpy() BY DOING IT THREE BYTES AT A TIME!
They're actually not that stupid. A lot of copies are very short, so
getting too fancy is counterproductive. It's been a while, so it may
not be optimal on current processors, but the code was heavily profiled
and optimized at one point.
>> There's equivalent inflate code in the PGP 5.0i distribution
> Interesting. I'll take a look.
I think I figured out the whitespace damage - that code was transmitted
using RFC1149-like technology: printed on paper and OCRed in.
I can see how that could play hell with the whitespace. It makes a lot
more sense after that is fixed. A properly indented version is appended,
if anyone cares.
/*
* inflate.c -- Not copyrighted 1992,1993 by Mark Adler
* Latest hacking on this version by Colin Plumb
* version p3, 19 Oct 1995
*
* $Id: pgpZInflate.c,v 1.2 1997/01/24 23:35:54 mhw Exp $
*/
/*
* You can do whatever you like with this source file, though I would
* prefer that if you modify it and redistribute it that you include
* comments to that effect with your name and the date. Thank you.
*
* History:
* vers date who what
* ---- --------- ---------- ------------------------------------
* a ~~ Feb 92 M. Adler used full (large, one-step) lookup table
* b1 21 Mar 92 M. Adler first version with partial lookup tables
* b2 21 Mar 92 M. Adler fixed bug in fixed-code blocks
* b3 22 Mar 92 M. Adler sped up match copies, cleaned up some
* b4 25 Mar 92 M. Adler added prototypes; removed window[] (now
* is the responsibility of unzip.h--also
* changed name to slide[]), so needs diffs
* for unzip.c and unzip.h (this allows
* compiling in the small model on MSDOS);
* fixed cast of q in huft_build();
* b5 26 Mar 92 M. Adler got rid of unintended macro recursion.
* b6 27 Mar 92 M. Adler got rid of nextbyte() routine. fixed
* bug in inflate_fixed().
* c1 30 Mar 92 M. Adler removed lbits, dbits environment variables.
* changed BMAX to 16 for explode. Removed
* OUTB usage, and replaced it with flush()--
* this was a 20% speed improvement! Added
* an explode.c (to replace unimplode.c) that
* uses the huft routines here. Removed
* register union.
* c2 4 Apr 92 M. Adler fixed bug for file sizes a multiple of 32k.
* c3 10 Apr 92 M. Adler reduced memory of code tables made by
* huft_build significantly (factor of two to
* three).
* c4 15 Apr 92 M. Adler added NOMEMCPY do kill use of memcpy().
* worked around a Turbo C optimization bug.
* c5 21 Apr 92 M. Adler added the WSIZE #define to allow reducing
* the 32K window size for specialized
* applications.
* c6 31 May 92 M. Adler added some typecasts to eliminate warnings
* c7 27 Jun 92 G. Roelofs added some more typecasts (444: MSC bug).
* c8 5 Oct 92 J-l. Gailly added ifdef'd code to deal with PKZIP bug.
* c9 9 Oct 92 M. Adler removed a memory error message (~line 416).
* c10 17 Oct 92 G. Roelofs changed ULONG/UWORD/byte to ulg/ush/uch,
* removed old inflate, renamed inflate_entry
* to inflate, added Mark's fix to a comment.
* c11 2 Jan 93 M. Adler fixed bug in detection of incomplete
* tables, and removed assumption that EOB is
* the longest code (bad assumption).
* c12 3 Jan 93 M. Adler make tables for fixed blocks only once.
* c13 5 Jan 93 M. Adler allow all zero length codes (pkzip 2.04c
* outputs one zero length code for an empty
* distance tree).
* c14 12 Mar 93 M. Adler made inflate.c standalone with the
* introduction of inflate.h.
* d0 25 Apr 93 M. Adler several speedups in inflate_codes of
* almost 20% (suggested by Urban Mueller)--
* now requires the use of backfill.[ch]
* d1 4 May 93 M. Adler deleted extraneous statement in cheap loop,
* optimized common copy a little more.
* d2 5 May 93 M. Adler calculate number of cheap loops, a few
* other small optimizations.
* p1 Nov 93 C. Plumb Adpated to be able to suspend itself.
* Pretty massive reorganization.
* Many comments added.
* p2 18 Oct 95 C. Plumb Improved interface some more. Now
* nicely re-entrant. Still more comments.
* p3 19 Oct 95 C. Plumb Changed infInflate core function so that
* it sucks the input completely dry before
* returning. This gets rid of the old ad-hoc
* technique for flushing out the last little
* bit of a compressed file by padding it
* with some dummy data, which was ugly.
*/
/*
* Inflate deflated (PKZIP's method 8 compressed) data. The compression
* method searches for as much of the current string of bytes (up to a
* length of 258) in the previous 32K bytes. If it doesn't find any
* matches (of at least length 3), it codes the next byte. Otherwise, it
* codes the length of the matched string and its distance backwards from
* the current position. There is a single Huffman code that codes both
* single bytes (called "literals") and match lengths. A second Huffman
* code codes the distance information, which follows a length code. Each
* length or distance code actually represents a base value and a number
* of "extra" (sometimes zero) bits to get to add to the base value. At
* the end of each deflated block is a special end-of-block (EOB) literal/
* length code. The decoding process is basically: get a literal/length
* code; if EOB then done; if a literal, emit the decoded byte; if a
* length then get the distance and emit the referred-to bytes from the
* sliding window of previously emitted data.
*
* There are (currently) three kinds of inflate blocks: stored, fixed, and
* dynamic. The compressor outputs a chunk of data at a time and decides
* which method to use on a chunk-by-chunk basis. A chunk might typically
* be 32K to 64K, uncompressed. If the chunk is uncompressible, then the
* "stored" method is used. In this case, the bytes are simply stored as
* is, eight bits per byte, with none of the above coding. The bytes are
* preceded by a (16-bit) count, since there is no longer an EOB code.
*
* If the data is compressible, then either the fixed or dynamic methods
* are used. In the dynamic method, the compressed data is preceded by
* an encoding of the literal/length and distance Huffman codes that are
* to be used to decode this block. The representation is itself Huffman
* coded, and so is preceded by a description of that code. These code
* descriptions take up a little space, and so for small blocks, there is
* a predefined set of codes, called the fixed codes. The fixed method is
* used if the block ends up smaller that way (usually for quite small
* chunks); otherwise the dynamic method is used. In the latter case, the
* codes are customized to the probabilities in the current block and so
* can code it much better than the pre-determined fixed codes can.
*
* The Huffman codes themselves are decoded using a mutli-level table
* lookup, in order to maximize the speed of decoding plus the speed of
* building the decoding tables. See the comments below that precede the
* LBITS and DBITS tuning parameters.
*/
/*
* Notes beyond the 1.93a appnote.txt:
*
* 1. Distance pointers never point before the beginning of the output
* stream.
* 2. Distance pointers can point back across blocks, up to 32k away.
* 3. There is an implied maximum of 7 bits for the bit length table and
* 15 bits for the actual data.
* 4. If only one code exists, then it is encoded using one bit. (Zero
* would be more efficient, but perhaps a little confusing.) If two
* codes exist, they are coded using one bit each (0 and 1).
* 5. There is no way of sending zero distance codes--a dummy must be
* sent if there are none. (History: a pre 2.0 version of PKZIP would
* store blocks with no distance codes, but this was discovered to be
* too harsh a criterion.) Valid only for 1.93a. 2.04c does allow
* zero distance codes, which is sent as one code of zero bits in
* length.
* 6. There are up to 286 literal/length codes. Code 256 represents the
* end-of-block. Note however that the static length tree defines
* 288 codes just to fill out the Huffman codes. Codes 286 and 287
* cannot be used though, since there is no length base or extra bits
* defined for them. Similarily, there are up to 30 distance codes.
* However, static trees define 32 codes (all 5 bits) to fill out the
* Huffman codes, but the last two had better not show up in the data.
* 7. Unzip can check dynamic Huffman blocks for complete code sets.
* The exception is that a single code would not be complete (see #4).
* 8. The five bits following the block type is really the number of
* literal codes sent minus 257.
* 9. Length codes 8,16,16 are interpreted as 13 length codes of 8 bits
* (1+6+6). Therefore, to output three times the length, you output
* three codes (1+1+1), whereas to output four times the same length,
* you only need two codes (1+3). Hmm.
* 10. In the tree reconstruction algorithm, Code = Code + Increment
* only if BitLength(i) is not zero. (Pretty obvious.)
* 11. Correction: 4 Bits: # of Bit Length codes - 4 (4 - 19)
* 12. Note: length code 284 can represent 227-258, but length code 285
* really is 258. The last length deserves its own, short code
* since it gets used a lot in very redundant files. The length
* 258 is special since 258 - 3 (the min match length) is 255.
* 13. The literal/length and distance code bit lengths are read as a
* single stream of lengths. It is possible (and advantageous) for
* a repeat code (16, 17, or 18) to go across the boundary between
* the two sets of lengths.
*/
#ifdef HAVE_CONFIG_H
#include "config.h"
#endif
#include <string.h> /* for memset() */
#include "pgpDebug.h"
#include "pgpMem.h"
#include "pgpUsuals.h"
#include "pgpZInflate.h"
/* Increasing these only wastes space, so they can be changed if necessary */
typedef word16 ush;
typedef word32 ulg;
/* Private context structure for use by all functions. */
struct InflateContext {
/* Major state information */
int state;
int substate;
int lastblock;
/* Output-related information */
unsigned char *slide; /* Circular output buffer */
unsigned char *slideend;
unsigned slidelen; /* slideend-slide - MUST BE POWER OF 2 */
unsigned slidemask; /* slidelen-1 */
unsigned char *outptr; /* Pointer into slide */
unsigned char const *readptr; /* Pointer into slide for reading */
/* Input-related information */
unsigned char const *inptr; /* Input pointer */
int inlen; /* Input length */
ulg bitbuffer;
unsigned bufbits;
/* The literal/length tree - also used for the bit-length tree */
struct huft *tree1;
unsigned bits1;
/* The distance tree */
struct huft *tree2;
unsigned bits2;
/* Encoded huffman tree */
unsigned char bitlen[286+30];
/* For dynamic trees */
unsigned bitlengths;
unsigned litcodes;
unsigned distcodes;
/* Used in various places */
struct huft const *t;
unsigned copylen;
int distbase;
int numbits; /* # of bits in dyn. tree repeat and after dist. code*/
unsigned index; /* Position in dynamic tree bit lengths we've read to*/
};
/* Options for behaviour */
#ifndef SECURE
#define SECURE 1 /* Wipe memory before freeing it? */
#endif
#ifndef WSIZE
#define WSIZE 32768 /* Window size - 32768 for PKZIP compatibility */
#endif
#ifndef STRICT
#define STRICT 1 /* Require unused bits to be zero? */
#endif
/*
* Huffman code lookup table entry--this entry is four bytes for machines
* that have 16-bit pointers (e.g. PC's in the small or medium model).
* Valid extra bits are 0..13. e == 32 is EOB (end of block), e == 16
* means that v is a literal, e < 0 means that v is a pointer to the next
* table, which codes -e bits, and lastly e == -128 indicates an unused
* code. If a code with e == -128 is looked up, this implies an error in
* the data.
*/
struct huft {
union {
struct {
signed char exop; /* # of extra bits or operation */
char bits; /* # of bits in this code/subcode */
} what;
char *pad; /* pad structure to a power of 2 (4 bytes for*/
} word; /* 16-bit, 8 bytes for 32-bit machines) */
union {
ush base; /* literal, length base, or distance base */
struct huft *next; /* pointer to next level of table */
} more;
};
#define base more.base
#define next more.next
#define exop word.what.exop
#define bits word.what.bits
/* Function prototypes */
#ifndef OF
# if defined(__STDC__) || defined(__cplusplus)
# define OF(a) a
# else
# define OF(a) ()
# endif
#endif
static int huft_build OF((unsigned char const *, unsigned, unsigned,
ush const *, ush const *, struct huft **,
unsigned *));
static int huft_free OF((struct huft *));
/*
* The inflate algorithm uses a sliding 32K byte window on the uncompressed
* stream to find repeated byte strings. This is implemented here as a
* circular buffer. The index is updated simply by incrementing and then
* and'ing with 0x7fff (32K-1).
*/
/* Tables for deflate from PKZIP's appnote.txt. */
static unsigned const border[] = { /* Order of the bit length code lengths */
16, 17, 18, 0, 8, 7, 9, 6, 10, 5, 11, 4, 12, 3, 13, 2, 14, 1, 15};
static ush const cplens[] = { /* Copy lengths for literal codes 257..285 */
1, 2, 3, 4, 5, 6, 7, 8, 9, 11, 13, 15, 17, 21, 25, 29,
33, 41, 49, 57, 65, 81, 97, 113, 129, 161, 193, 225, 256, 0, 0};
/* actually lengths - 2; also see note #13 above about 258 */
static ush const cplext[] = { /* Extra bits for literal codes 257..285 */
0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2,
3, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5, 5, 0, 128, 128}; /* 128==invalid */
static ush const cpdist[] = { /* Copy offsets for distance codes 0..29 */
1, 2, 3, 4, 5, 7, 9, 13, 17, 25, 33, 49, 65, 97, 129, 193,
257, 385, 513, 769, 1025, 1537, 2049, 3073, 4097, 6145,
8193, 12289, 16385, 24577};
static ush const cpdext[] = { /* Extra bits for distance codes */
0, 0, 0, 0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6,
7, 7, 8, 8, 9, 9, 10, 10, 11, 11,
12, 12, 13, 13};
/* And'ing with mask[n] masks the lower n bits */
static unsigned const mask[17] = {
0x0000,
0x0001, 0x0003, 0x0007, 0x000f, 0x001f, 0x003f, 0x007f, 0x00ff,
0x01ff, 0x03ff, 0x07ff, 0x0fff, 0x1fff, 0x3fff, 0x7fff, 0xffff
};
/* Macros for inflate() bit peeking and grabbing.
* The usage is:
*
* NEEDBITS(j)
* x = b & mask[j];
* DUMPBITS(j)
*
* where NEEDBITS makes sure that b has at least j bits in it, and
* DUMPBITS removes the bits from b. The macros use the variable k
* for the number of bits in b. Normally, b and k are register
* variables for speed, and are initialized at the beginning of a
* routine that uses these macros from a global bit buffer and count.
*
* It is allowed for more bits to be requested than actually used.
* The remainder stay in the bit buffer. NOTE that a few more
* bytes than necessary may be grabbed at the end of input.
* This should not be fatal as long as the grabbed bits stay
* in the bit buffer. The wrapup functions should check for this.
*
* There is also the macro GRABBITS which is like NEEDBITS, but uses
* *g++ to get the byte without checking availability. This requires
* using AVAILBYTES first to assure that the needed bytes are there
* already. g is set to inptr and then inptr is restored to g around
* this usage.
*/
/*
* Variable usage:
* b = bit buffer, k least-significant bits valid
* k = # of bits in bit buffer
* g = grab bits pointer (pointer to next byte to add: b |= *g++ << k)
* s = # of bytes available
* ctx = InflateContext
*/
/* Functions to save the state of the bit buffer before returning */
#define LOADBITBUF \
(b=ctx->bitbuffer, k=ctx->bufbits, g=ctx->inptr, s=ctx->inlen)
#define SAVEBITBUF \
(ctx->bitbuffer=b, ctx->bufbits=k, ctx->inptr=g, ctx->inlen=s)
#define SAVEBITBUFEMPTY \
(ctx->bitbuffer=b, ctx->bufbits=k, ctx->inptr=g, ctx->inlen=0)
/*
* A simple macro to ensure that at least "x" bits are in the bit buffer.
* Does NOT check for input underflow (s < 0).
*/
#define GRABBITS(x) \
while (k < x) { \
--s; \
b |= (ulg)*g++ << k; \
k += 8; \
}
/*
* Get "x" bits in the input buffer or execute the "whatif" code.
* NOTE NOTE NOTE that if the "whatif" code is executed, s is set to -1!
* (This is faster on a lot of machines.) You will usually have to
* reset it to 0 manually.
*/
#define GETBITSOR(x, whatif) \
while (k < (unsigned)(x)) { \
if (--s < 0) { \
whatif; \
} \
b |= (ulg)*g++ << k; \
k += 8; \
}
/*
* A macro to ensure that there are at least "x" bits available in the
* bit buffer. If there are not, the bit buffer is saved back into the
* context, and any code in "savestate" is run to save any additional
* context that might be needed. Then, "return 1" means that more input is
* needed.
*/
#define NEEDBITS(x, savestate) \
GETBITSOR(x, savestate; SAVEBITBUFEMPTY; return 1)
/*
* NEEDBITS2 uses two figures for the number of bits required. The first,
* x, is cheap to compute and is used unless we run out of input data.
* If that happens, the more expensive value, x2, is used.
*/
#define NEEDBITS2(x, x2, savestate) \
NEEDBITS(x, if (k >= (unsigned)(x2)) {s=0; break;} savestate)
#define DUMPBITS(x) (b >>= x, k -= x)
/*
* Huffman code decoding is performed using a multi-level table lookup.
* The fastest way to decode is to simply build a lookup table whose
* size is determined by the longest code. However, the time it takes
* to build this table can also be a factor if the data being decoded
* is not very long. The most common codes are necessarily the
* shortest codes, so those codes dominate the decoding time, and hence
* the speed. The idea is you can have a shorter table that decodes the
* shorter, more probable codes, and then point to subsidiary tables for
* the longer codes. The time it costs to decode the longer codes is
* then traded against the time it takes to make longer tables.
*
* This results of this trade are in the values LBITS and DBITS
* below. LBITS is the number of bits the first level table for literal/
* length codes can decode in one step, and DBITS is the same thing for
* the distance codes. Subsequent tables are also less than or equal to
* those sizes. These values may be adjusted either when all of the
* codes are shorter than that, in which case the longest code length in
* bits is used, or when the shortest code is *longer* than the requested
* table size, in which case the length of the shortest code in bits is
* used.
*
* There are two different values for the two tables, since they code a
* different number of possibilities each. The literal/length table
* codes 286 possible values, or in a flat code, a little over eight
* bits. The distance table codes 30 possible values, or a little less
* than five bits, flat. The optimum values for speed end up being
* about one bit more than those, so LBITS is 8+1 and DBITS is 5+1.
* The optimum values may differ though from machine to machine, and
* possibly even between compilers. Your mileage may vary.
*/
#define LBITS 9 /* bits in base literal/length lookup table */
#define DBITS 6 /* bits in base distance lookup table */
/* If BMAX needs to be larger than 16, then h and x[] should be ulg. */
#define BMAX 15 /* maximum bit length of any code (16 for explode) */
#define N_MAX 288 /* maximum number of codes in any set */
#if defined(__STDC__) || defined(__cplusplus)
static int
huft_build(unsigned char const *b, unsigned n, unsigned s, ush const *d,
ush const *e, struct huft **t, unsigned *m)
#else
static int
huft_build(b, n, s, d, e, t, m)
unsigned char *b; /* code lengths in bits (all assumed <= BMAX) */
unsigned n; /* number of codes (assumed <= N_MAX) */
unsigned s; /* number of simple-valued codes (0..s-1) */
ush const *d; /* list of base values for non-simple codes */
ush const *e; /* list of extra bits for non-simple codes */
struct huft **t; /* result: starting table */
unsigned *m; /* suggested lookup bits, returns actual */
#endif
/*
* Given a list of code lengths and a maximum table size, make a set of
* tables to decode that set of codes. Return values:
* 0 = success
* 1 = the given code set is incomplete
* (the tables are still built in this case)
* 2 = the input is invalid
* (all zero length codes or an oversubscribed set of lengths)
* 3 = not enough memory.
*/
{
unsigned a; /* counter for codes of length k */
unsigned c[BMAX+1]; /* bit length count table */
unsigned f; /* i repeats in table every f entries */
int g; /* maximum code length */
int h; /* table level */
register unsigned i; /* counter, current code */
register unsigned j; /* counter */
register int k; /* number of bits in current code */
int l; /* bits per table (returned in m) */
register unsigned char const *bp; /* pointer into b[] */
register unsigned *p; /* pointer into c[] or v[] */
register struct huft *q; /* points to current table */
struct huft r; /* table entry for structure assignment */
struct huft *u[BMAX]; /* table stack */
unsigned v[N_MAX]; /* values in order of bit length */
register int w; /* bits before this table == (l * h) */
unsigned x[BMAX+1]; /* bit offsets, then code stack */
unsigned *xp; /* pointer into x */
int y; /* number of dummy codes added */
unsigned z; /* number of entries in current table */
/* Generate counts for each bit length */
memset(c, 0, sizeof(c));
bp = b; i = n;
do {
c[*bp++]++; /* assume all entries <= BMAX */
} while (--i);
if (c[0] == n) { /* null input--all zero length codes */
*t = (struct huft *)NULL;
*m = 0;
return 0;
}
/* Find minimum and maximum length, bound *m by those */
l = *m;
for (j = 1; j <= BMAX; j++)
if (c[j])
break;
k = j; /* minimum code length */
if ((unsigned)l < j)
l = j;
for (i = BMAX; i; i--)
if (c[i])
break;
g = i; /* maximum code length */
if ((unsigned)l > i)
l = i;
*m = l;
/* Adjust last length count to fill out codes, if needed */
for (y = 1 << j; j < i; j++, y <<= 1)
if ((y -= c[j]) < 0)
return 2; /* bad input: more codes than bits */
if ((y -= c[i]) < 0)
return 2;
c[i] += y;
/* Generate starting offsets into the value table for each length */
x[1] = j = 0;
p = c + 1;
xp = x + 2;
while (--i) { /* note that i == g from above */
*xp++ = (j += *p++);
}
/* Make a table of values in order of bit lengths */
bp = b; i = 0;
do {
if ((j = *bp++) != 0)
v[x[j]++] = i;
} while (++i < n);
/* Generate the Huffman codes and for each, make the table entries */
x[0] = i = 0; /* first Huffman code is zero */
p = v; /* grab values in bit order */
h = -1; /* no tables yet--level -1 */
w = -l; /* bits decoded == (l * h) */
u[0] = (struct huft *)NULL; /* just to keep compilers happy */
q = (struct huft *)NULL; /* ditto */
z = 0; /* ditto */
/* go through the bit lengths (k already is bits in shortest code) */
for (; k <= g; k++) {
a = c[k];
while (a--) {
/* here i is the Huffman code of length k bits for value *p */
/* make tables up to required level */
while (k > w + l) {
h++;
w += l; /* previous table always l bits */
/* compute minimum size table less than or equal to l bits */
z = (z = g - w) > (unsigned)l ? l : z; /* upper limit on table size */
if ((f = 1 << (j = k - w)) > a + 1) /* try a k-w bit table */
{ /* too few codes for k-w bit table */
f -= a + 1; /* deduct codes from patterns left */
xp = c + k;
while (++j < z) { /* try smaller tables up to z bits */
if ((f <<= 1) <= *++xp)
break; /* enough codes to use up j bits */
f -= *xp; /* else deduct codes from patterns */
}
}
z = 1 << j; /* table entries for j-bit table */
/* allocate and link in new table */
q = (struct huft *)pgpMemAlloc((z + 1)*sizeof(struct huft));
if (q == (struct huft *)NULL) {
if (h)
huft_free(u[0]);
return 3; /* not enough memory */
}
#if SECURE
q->bits = (char)j; /* Track size for free */
#endif
*t = q + 1; /* link to list for huft_free() */
*(t = &(q->next)) = (struct huft *)NULL;
u[h] = ++q; /* table starts after link */
/* connect to last table, if there is one */
if (h) {
x[h] = i; /* save pattern for backing up */
r.bits = (char)l; /* bits to dump before this table */
r.exop = -(char)j; /* bits in this table */
r.next = q; /* pointer to this table */
j = i >> (w - l); /* (get around Turbo C bug) */
u[h-1][j] = r; /* connect to last table */
}
}
/* set up table entry in r */
r.bits = (char)(k - w);
if (p >= v + n) {
r.exop = -128; /* out of values--invalid code */
} else if (*p < s) {
r.exop = (char)(*p < 256 ? 16 : 32); /* 256 is end-of-block code */
r.base = *p++; /* simple code is just the value */
} else {
r.exop = (char)e[*p - s]; /* non-simple--look up in lists */
r.base = d[*p++ - s];
}
/* fill code-like entries with r */
f = 1 << (k - w);
for (j = i >> w; j < z; j += f)
q[j] = r;
/* backwards increment the k-bit code i */
for (j = 1 << (k - 1); i & j; j >>= 1)
i ^= j;
i ^= j;
/* back up over finished tables */
while ((i & ((1 << w) - 1)) != x[h]) {
h--; /* don't need to update q */
w -= l;
}
}
}
/* Return true (1) if we were given an incomplete table */
return y != 0 && g != 1;
}
#if defined(__STDC__) || defined(__cplusplus)
static int
huft_free(struct huft *t)
#else
static int
huft_free(t)
struct huft *t; /* table to free */
#endif
/*
* Free the malloc'ed tables built by huft_build(), which makes a linked
* list of the tables it made, with the links in a dummy -1st entry of
* each table.
*/
{
register struct huft *p, *q;
/* Go through linked list, freeing from the malloced (t[-1]) address. */
p = t;
while (p != (struct huft *)NULL)
{
q = (--p)->next;
#if SECURE
memset(p, 0, (1+(1<<p->bits)) * sizeof(*p));
#endif
pgpMemFree(p);
p = q;
}
return 0;
}
/*
* Colin's new code.
*/
#define STATE_START 0
#define STATE_STOREDLEN 1
#define STATE_STORED 2
#define STATE_DYN_HDR 3
#define STATE_DYN_BITS 4
#define STATE_DYN_TREE 5
#define STATE_INFLATE 6
#define STATE_DONE 7
#define STATE_STOP 8
/*
* All these functions return
* 2 if the output buffer is full
* 1 if the input buffer is empty
* 0 if things should continue with the new ctx->state, and
* <0 on error.
*/
static int
infStoredLen(struct InflateContext *ctx)
{
ulg b;
unsigned k;
int s;
unsigned char const *g;
unsigned t;
LOADBITBUF;
t = k & 7;
#if STRICT
if (b & mask[t]) /* Require unused bits are zero */
return INFERR_BADINPUT;
#endif
DUMPBITS(t);
NEEDBITS(32, (void)0);
/* At this point, k is exactly 32 */
ctx->copylen = (unsigned)b & 0xffff;
t = ~(unsigned)(b>>16) & 0xffff;
b = 0;
k = 0;
SAVEBITBUF;
if (t != ctx->copylen)
return INFERR_BADINPUT; /* Two lengths not identical */
ctx->state = STATE_STORED;
ctx->substate = 0;
return 0;
}
/*
* Process the data in a stored block. Copy the input data to
* the circular buffer, flushing it as necessary.
*/
static int
infStored(struct InflateContext *ctx)
{
int len; /* # of bytes to copy */
int retval = 1; /* Out of memory */
pgpAssert(ctx->bufbits == 0);
len = ctx->inlen;
if ((unsigned)len >= ctx->copylen) {
len = ctx->copylen; /* Copy done! */
ctx->state = STATE_START;
retval = 0;
}
if (len > ctx->slideend-ctx->outptr) {
len = (int)(ctx->slideend-ctx->outptr);
retval = 2; /* Output buffer full */
}
memcpy(ctx->outptr, ctx->inptr, len);
ctx->inptr += len;
ctx->inlen -= len;
ctx->outptr += len;
ctx->copylen -= (unsigned)len;
/* Return 2 if output full, 1 if input empty, 0 otherwise (copy done)*/
return retval;
}
static int
infDynHdr(struct InflateContext *ctx)
{
ulg b;
unsigned k;
int s;
unsigned char const *g;
LOADBITBUF;
NEEDBITS(14, (void)0);
ctx->litcodes = 257 + ((unsigned)b & 0x1f);
DUMPBITS(5);
ctx->distcodes = 1 + ((unsigned)b & 0x1f);
DUMPBITS(5);
ctx->bitlengths = 4 + ((unsigned)b & 0xf);
DUMPBITS(4);
SAVEBITBUF;
#ifndef PKZIP_BUG_WORKAROUND
if (ctx->litcodes > 286 || ctx->distcodes > 30)
return INFERR_BADINPUT; /* Bad lengths */
#endif
ctx->state = STATE_DYN_BITS;
return 0;
}
/* Load up the ctx->bitlen array with the bit lengths for the encoded
* trees. Unpermute them according to the border[] array.
*/
static int
infDynBits(struct InflateContext *ctx)
{
ulg b;
unsigned k;
int s;
unsigned char const *g;
struct huft *t;
int i, lim;
LOADBITBUF;
i = ctx->substate;
lim = ctx->bitlengths;
while (i < lim) {
NEEDBITS(3, ctx->substate=i)
ctx->bitlen[border[i++]] = (unsigned)b & 7;
DUMPBITS(3);
}
SAVEBITBUF;
while (i < 19)
ctx->bitlen[border[i++]] = 0;
k = 7; /* Bits in tree table */
i = huft_build(ctx->bitlen, 19, 19, NULL, NULL, &t, &k);
if (i) {
if (i == 1)
huft_free(t); /* Incomplete code set */
return i == 3 ? INFERR_NOMEM : INFERR_BADINPUT;
}
ctx->tree1 = t;
ctx->bits1 = k;
ctx->state = STATE_DYN_TREE;
ctx->substate = 0;
return 0;
}
/* The static tree (built on demand) */
static struct huft *fixed_tl = NULL;
static struct huft *fixed_td;
static unsigned fixed_bl, fixed_bd;
static int
infFixedTree(void)
{
int i;
unsigned char l[288];
if (fixed_tl)
return 0; /* All okay */
/* literal table */
for (i = 0; i < 144; i++)
l[i] = 8;
for (; i < 256; i++)
l[i] = 9;
for (; i < 280; i++)
l[i] = 7;
for (; i < 288; i++) /* make a complete, but wrong code set */
l[i] = 8;
fixed_bl = 7;
i = huft_build(l, 288, 257, cplens, cplext, &fixed_tl, &fixed_bl);
if (i != 0) {
fixed_tl = NULL;
return INFERR_NOMEM;
}
/* distance table */
for (i = 0; i < 30; i++) /* make an incomplete code set */
l[i] = 5;
fixed_bd = 5;
i = huft_build(l, 30, 0, cpdist, cpdext, &fixed_td, &fixed_bd);
if (i > 1) {
huft_free(fixed_tl);
fixed_tl = NULL;
return INFERR_NOMEM;
}
return 0;
}
static int
infDynTree(struct InflateContext *ctx)
{
ulg b;
unsigned k;
int s;
unsigned char const *g;
struct huft *tl, *td;
unsigned bl, bd;
unsigned i, j, l, m, lim;
LOADBITBUF;
i = ctx->index;
l = ctx->numbits; /* Previous # of bits for repeat code use */
lim = ctx->litcodes + ctx->distcodes;
tl = ctx->tree1;
bl = ctx->bits1;
m = mask[bl];
switch(ctx->substate) {
case 0:
i = 0;
l = 0;
/*FALLTHROUGH*/
case 1:
while (i < lim) {
NEEDBITS2(bl, tl[(unsigned)b & m].bits,
ctx->index=i; ctx->numbits=l; ctx->substate=1)
td = tl + ((unsigned)b & m);
j = td->bits;
DUMPBITS(j);
j = td->base;
if (j < 16) {
ctx->bitlen[i++] = l = j;
} else if (j == 16) { /* 3 to 6 repeats */
/*FALLTHROUGH*/
case 2:
NEEDBITS(2, ctx->index=i; ctx->numbits=l;
ctx->substate=2)
j = 3 + ((unsigned)b & 3);
DUMPBITS(2);
if (i + j > lim)
return INFERR_BADINPUT;
do {
ctx->bitlen[i++] = (unsigned char)l;
} while (--j);
} else if (j == 17) { /* 3 to 10 zeros */
/*FALLTHROUGH*/
case 3:
NEEDBITS(3, ctx->index=i; ctx->substate=3)
j = 3 + ((unsigned)b & 7);
DUMPBITS(3);
if (i + j > lim)
return INFERR_BADINPUT;
do {
ctx->bitlen[i++] = 0;
} while (--j);
l = 0;
} else { /* j == 18 -- 11 to 138 zeros */
pgpAssert(j == 18);
/*FALLTHROUGH*/
case 4:
NEEDBITS(7, ctx->index=i; ctx->substate=4)
j = 11 + ((unsigned)b & 127);
DUMPBITS(7);
if (i + j > lim)
return INFERR_BADINPUT;
do {
ctx->bitlen[i++] = 0;
} while (--j);
l = 0;
}
}
}
/* Input finished */
SAVEBITBUF;
huft_free(tl);
ctx->tree1 = 0;
/* Now build the trees - literal/length, then distance */
bl = LBITS;
i = huft_build(ctx->bitlen, ctx->litcodes, 257,
cplens, cplext, &tl, &bl);
if (i != 0) {
if (i == 1)
huft_free(tl);
return i == 3 ? INFERR_NOMEM : INFERR_BADINPUT;
}
bd = DBITS;
i = huft_build(ctx->bitlen+ctx->litcodes, ctx->distcodes, 0,
cpdist, cpdext, &td, &bd);
#ifdef PKZIP_BUG_WORKAROUND
if (i > 1) {
#else
if (i != 0) {
if (i == 1)
huft_free(td);
#endif
huft_free(tl);
return i == 3 ? INFERR_NOMEM : INFERR_BADINPUT;
}
ctx->tree1 = tl;
ctx->bits1 = bl;
ctx->tree2 = td;
ctx->bits2 = bd;
ctx->state = STATE_INFLATE;
ctx->substate = 0;
return 0;
}
/*
* This is the heart of ZIP inflation. The code is heavily optimized
* for speed, including for many brain-damaged compilers that can only
* optimize one statement at a time. To generate better code from
* such idiot compilers (which are distressingly common on platforms
* such as MS-DOS), expressions are made big and complex, and intermediate
* results are assigned to variables in the expression where possible
* so the compiler won't try to use a disjoint temporary and have to
* spill. This makes the code a little hard to follow at times.
* Sorry.
*
* Also, due to the deep nesting (it's all in one function, again, for
* speed even on systems that can't inline), the indent amount is only
* two spaces and variable names are very short.
*
* This does NOT currently detect a reference to before the beginning of
* the file. It just blindly uses the circular buffer.
*/
static int
infInflate(struct InflateContext *ctx)
{
ulg b; /* Bit buffer info - bit buffer itself */
unsigned k; /* Number of low0order bits of b that are valid */
int s; /* Bytes of valid input remaining */
unsigned char const *g; /* Input buffer pointer */
unsigned char *w; /* Window output write pointer */
unsigned r; /* Space available after w (always >0) */
struct huft *tl, *td; /* Tree info (const except for EOB processing) */
unsigned bl, bd;
unsigned ml, md;
struct huft const *t; /* Temporary tree pointer */
unsigned char const *p; /* Copy source pointer */
int i;
int e;
unsigned n; /* Copy length */
unsigned d; /* Copy distance */
LOADBITBUF; /* Load b, k, s, g */
w = ctx->outptr;
r = ctx->slideend-w;
pgpAssert(r);
tl = ctx->tree1; /* Literal/length tree */
bl = ctx->bits1; /* Number of bits in top-level table */
ml = mask[bl]; /* Mask of that many bits */
td = ctx->tree2; /* Distance tree */
bd = ctx->bits2; /* Number of bits in top-level table */
md = mask[bd]; /* Mask of that many bits */
/*
* We don't always need all of these, but it's easier to always
* load them than to do it conditionally.
*/
t = ctx->t;
n = ctx->copylen;
d = ctx->distbase;
e = ctx->numbits;
switch (ctx->substate) {
case 0:
for (;;) { /* do until end of block */
/*
* This cheap loop does no input or output checking.
* At most 258 bytes of output are produced per iteration,
* so we must do no more than r/258 iterations.
* Also, at most 15+5+15+13 = 48 bits = 6 bytes are consumed
* every iteration, so we can do no more than s/6 iterations.
* Also, to allow for starting with an empty bit buffer and
* ending with a full one, subtract 4 bytes, so it's (s-4)/6.
*
* These are approximated by r/256-1 and (s-2)/8, respectively.
* Note: for a file that compresses to 258 bytes per symbol (e.g. all
* the same character), that approximation would break down if
*
* r/256-1 > r/258
* thus, 258 * r - 256 * 258 > 256*r
* thus, 2 * r > 256 * 258
* thus, r > 128 * 258 = 33024 > 32768
*
* Of course, the window is <= 32768, so it's not a concern.
*
* For the input approximation to break down, you'd have to have
*
* (s-2)/8 > (s-4)/6
* thus, 6*s - 12 > 8*s - 32
* thus, 20 > 2*s
* thus, 10 > s
*
* But for s<10, both approximations (since this is integer math)
* return 0, so it's not a problem. The r/258 limit is usually
* hit first, so the crudeness of this approximation is acceptable.
*/
while (r >= 512 && s >= 10) {
/* Compute number of iterations we can do, worst-case */
if ((e = (s - 2) >> 3) < (i = (r >> 8) - 1))
i = e;
/*
* This is the cheap loop, which is performed i times before
* the available buffer space is re-evaluated. If you want to
* understand how the inflation process works, this is the best
* part of the code to read, since it isn't cluttered up with
* error checking. First comes a literal/length code, which
* can be either a literal (0-255), an end of block code (256),
*/
do {
GRABBITS(20); /* max bits for literal/length code */
t = tl + ((unsigned)b & ml);
if ((e = t->exop) < 0) {
do {
if (e == -128)
return INFERR_BADINPUT;
DUMPBITS(t->bits);
e = -e;
t = t->next + ((unsigned)b & mask[e]);
} while ((e = t->exop) < 0);
}
DUMPBITS(t->bits);
if (e & 16) { /* then it's a literal */
*w++ = (unsigned char)t->base;
continue;
}
if (e & 32) /* EOB */
goto EOB; /* At end of function */
/* Else length code */
/* get length of block to copy */
n = t->base + ((unsigned)b & mask[e]);
DUMPBITS(e);
/* decode distance of block to copy */
GRABBITS(15);
t = td + ((unsigned)b & md);
if ((e = t->exop) < 0) {
do {
if (e == -128)
return INFERR_BADINPUT;
DUMPBITS(t->bits);
e = -e;
t = t->next + ((unsigned)b & mask[e]);
} while ((e = t->exop) < 0);
}
DUMPBITS(t->bits);
GRABBITS((unsigned)e); /* get up to 13 extra bits */
d = t->base + ((unsigned)b & mask[e]);
DUMPBITS((unsigned)e);
#if WSIZE < 32768
if (d > ctx->slidelen)
return INFERR_BADINPUT;
#endif
/* do the copy */
if ((unsigned)(w - ctx->slide) >= d) {
p = w - d;
*w++ = *p++;
*w++ = *p++;
do {
*w++ = *p++;
} while (--n);
} else {
n += 2;
p = w - d + ctx->slidelen;
do {
n -= (e = (unsigned)(e = ctx->slideend - (p>w?p:w)) > n ? n : e);
do {
*w++ = *p++;
} while (--e);
if (p == ctx->slideend)
p = ctx->slide;
} while (n);
}
} while (--i);
r = ctx->slideend - w;
} /* while (we can use the cheap loop) */
/*
* Okay, we've fallen through from the cheap loop to the
* expensive loop. This one checks each time it gets bits
* from the input or writes bytes to the output that there
* is enough room. However, there are two versions of
* much of *this*, too! The first uses worst-case figures
* for the amount of input data needed, and obtains one batch of
* input bits for several uses.
* The second carefully avoids asking for any more bits than it really
* needs. When it gives up and returns, it is really not possible
* to extract any more symbols from the input data.
*/
/* Pessimistic estimate of bits for literal/length: 15+5 */
GETBITSOR(20, goto getlitlength2)
t = tl + ((unsigned)b & ml);
if ((e = t->exop) < 0) {
do {
if (e == -128)
return INFERR_BADINPUT;
DUMPBITS(t->bits);
e = -e;
t = t->next + ((unsigned)b & mask[e]);
} while ((e = t->exop) < 0);
}
DUMPBITS(t->bits);
if (e & 16) { /* then it's a literal */
*w++ = (unsigned char)t->base;
if (--r == 0) {
SAVEBITBUF;
ctx->outptr = w;
ctx->substate = 0; /* Restart at top of loop */
return 2; /* Output buffer full */
}
continue;
}
if (e & 32)
break; /* Leave infinite loop */
goto gotlength;
getlitlength2:
/* Method 2: We don't even have 20 bits available - do it bit-by-bit. */
t = tl + ((unsigned)b & ml);
/*
* See if higher-order bits we're missing actually matter.
* We don't have to try to fill the bit buffer, because we only get
* here if s (number of following input bytes) is supposed to be zero.
*/
if (k < (unsigned)t->bits) {
ctx->outptr = w;
ctx->substate = 0;
SAVEBITBUFEMPTY;
return 1;
}
/* Actually, s is set (see GETBITSOR comment) to -1, so reset it */
s = 0;
if ((e = t->exop) < 0) {
do {
if (e == -128)
return INFERR_BADINPUT;
DUMPBITS(t->bits);
e = -e;
t = t->next;
case 1:
NEEDBITS2(e, t[(unsigned)b & mask[e]].bits,
ctx->t=t; ctx->outptr=w; ctx->numbits=e; ctx->substate=1)
t += (unsigned)b & mask[e];
} while ((e = t->exop) < 0);
}
DUMPBITS(t->bits);
if (e & 16) { /* then it's a literal */
*w++ = (unsigned char)t->base;
if (--r == 0) {
/* We just filled the output buffer - return */
SAVEBITBUF;
ctx->outptr = w;
ctx->substate = 0; /* Restart at top of loop */
return 2; /* Output buffer full */
}
continue;
}
if (e & 32) /* EOB */
break; /* Leave infinite loop */
/*FALLTHROUGH*/
case 2:
NEEDBITS(e, ctx->t = t; ctx->outptr=w; ctx->numbits=e; ctx->substate=2)
gotlength:
/* All data is available - compute total length of block to copy */
n = t->base + ((unsigned)b & mask[e]);
DUMPBITS((unsigned)e);
/*FALLTHROUGH*/
case 3:
/* Get distance code - 15 is maximum code size */
GETBITSOR(15, goto getdistance2)
t = td + ((unsigned)b & md);
/* Negative t->exop means there's a subtable of 2^-e entries */
if ((e = t->exop) < 0) {
do {
if (e == -128)
return INFERR_BADINPUT; /* Invalid code (static table case) */
DUMPBITS(t->bits);
e = -e;
t = t->next + ((unsigned)b & mask[e]);
} while ((e = t->exop) < 0);
}
goto gotdistance;
getdistance2:
/* We don't even have 15 bits available - do it bit-by-bit. */
t = td + ((unsigned)b & md);
/*
* See if higher-order bits we're missing actually matter.
* We don't have to try to fill the bit buffer, because we only get
* here if s (number of following input bytes) is supposed to be zero.
*/
if (k < (unsigned)t->bits) {
ctx->outptr = w;
ctx->copylen = n;
ctx->substate = 3;
SAVEBITBUFEMPTY;
return 1;
}
/* Actually (see GETBITSOR comment) s is set to -1, so reset it */
s = 0;
/* Negative t->exop means there's a subtable of 2^-e entries */
if ((e = t->exop) < 0) {
do {
if (e == -128)
return INFERR_BADINPUT; /* Invalid code (static table case) */
DUMPBITS(t->bits);
e = -e;
t = t->next;
case 4:
NEEDBITS2(e, t[(unsigned)b & mask[e]].bits, ctx->t=t; ctx->outptr=w;
ctx->copylen = n; ctx->numbits=e; ctx->substate=4)
t += (unsigned)b & mask[e];
} while ((e = t->exop) < 0);
}
gotdistance:
/* All data is available - compute the base distance */
DUMPBITS(t->bits);
d = t->base;
/*FALLTHROUGH*/
case 5:
/* e is number of bits extra following distance code (0..13) */
NEEDBITS(e, ctx->outptr=w; ctx->copylen=n; ctx->distbase=d;
ctx->numbits=e; ctx->substate=5)
d += ((unsigned)b & mask[e]);
DUMPBITS((unsigned)e);
#if WSIZE < 32768
if (d > ctx->slidelen)
return INFERR_BADINPUT;
#endif
/* do the copy, with handling for wrapping around end of buffer */
if ((unsigned)(w - ctx->slide) >= d && w + n < ctx->slideend - 2) {
/* Simple case - neither source nor dest cross end of buffer */
p = w - d;
r -= n + 2;
*w++ = *p++;
*w++ = *p++;
do {
*w++ = *p++;
} while (--n);
} else {
n += 2; /* # of bytes to copy */
p = ctx->slide + ((w - ctx->slide - d) & ctx->slidemask); /* src */
do {
/* Set e to number of bytes we can copy at once */
n -= (e = (unsigned)(e = ctx->slideend - (p>w?p:w)) > n ? n : e);
r -= e;
do {
*w++ = *p++;
} while (--e);
if (r == 0) { /* Did we just write everything we could? */
SAVEBITBUF;
ctx->outptr = w;
ctx->copylen = n;
ctx->distbase = d; /* Save copy distance to recompute p */
ctx->substate = 6;
return 2; /* Need more output space */
case 6:
p = ctx->slide + ((w - ctx->slide - d) & ctx->slidemask);
}
if (p == ctx->slideend)
p = ctx->slide;
} while (n);
}
} /* for(;;) */
} /* switch(ctx->subcase) */
EOB:
/* End of Block */
ctx->outptr = w;
SAVEBITBUF;
ctx->state = STATE_START;
ctx->substate = 0;
if (tl && tl != fixed_tl)
huft_free(tl);
if (td && td != fixed_td)
huft_free(td);
ctx->tree1 = 0;
ctx->tree2 = 0;
return 0;
}
/*
* Start state - read 3 bits, which are last block flag (set to 1 on
* last block), and 2 bits of block type.
*/
static int
infStart(struct InflateContext *ctx)
{
ulg b;
unsigned k;
int s;
unsigned char const *g;
int retval = 0;
if (ctx->lastblock) {
ctx->state = STATE_DONE;
return 0;
}
LOADBITBUF;
NEEDBITS(3, (void)0);
ctx->lastblock = (int)(b & 1);
DUMPBITS(1);
switch(b & 3) {
case 0: /* Stored */
ctx->state = STATE_STOREDLEN;
break;
case 1: /* Static - build fixed trees and start decoding */
retval = infFixedTree();
ctx->tree1 = fixed_tl;
ctx->bits1 = fixed_bl;
ctx->tree2 = fixed_td;
ctx->bits2 = fixed_bd;
ctx->state = STATE_INFLATE;
break;
case 2: /* Dynamic */
ctx->state = STATE_DYN_HDR;
break;
case 3: /* Illegal */
retval = INFERR_BADINPUT;
break;
}
DUMPBITS(2);
SAVEBITBUF;
ctx->substate = 0;
return retval;
}
/*
* Get a pointer to available data in the output buffer
*/
unsigned char const *
infGetBytes(struct InflateContext *ctx, unsigned *len)
{
*len = ctx->outptr - ctx->readptr;
return *len ? ctx->readptr : 0;
}
/*
* Mark data in the output buffer as already read. We don't start
* accepting new data until the entire output buffer has been read,
* at which point the outptr is set back to the beginning of the
* buffer.
*/
void
infSkipBytes(struct InflateContext *ctx, unsigned len)
{
pgpAssert(len <= (unsigned)(ctx->outptr - ctx->readptr));
ctx->readptr += len;
/* If at end of buffer, recirculate */
if (ctx->readptr == ctx->slideend) {
pgpAssert(ctx->outptr == ctx->slideend);
ctx->readptr = ctx->outptr = ctx->slide;
}
}
/*
* Returns number of bytes written.
* *error < 0 on error, == 0 if no error (done with input or output full)
*/
size_t
infWrite(struct InflateContext *ctx, unsigned char const *buf, size_t len,
int *error)
{
int i;
if (ctx->outptr == ctx->slideend) {
*error = (ctx->state == STATE_STOP) ? 0 : ctx->substate;
return 0; /* Out of output space! */
}
ctx->inptr = buf;
/* Decompression code can't handle more than INT_MAX bytes at a time */
ctx->inlen = len > INT_MAX ? INT_MAX : (int)len;
do {
switch(ctx->state) {
case STATE_START:
i = infStart(ctx);
break;
case STATE_STOREDLEN:
i = infStoredLen(ctx);
break;
case STATE_STORED:
i = infStored(ctx);
break;
case STATE_DYN_HDR:
i = infDynHdr(ctx);
break;
case STATE_DYN_BITS:
i = infDynBits(ctx);
break;
case STATE_DYN_TREE:
i = infDynTree(ctx);
break;
case STATE_INFLATE:
i = infInflate(ctx);
break;
case STATE_DONE:
if (ctx->inlen || ctx->bufbits > 7)
i = INFERR_LONG;
#if STRICT
else if (ctx->bitbuffer & mask[ctx->bufbits])
i = INFERR_BADINPUT; /* Unused bits != 0 */
#endif
else
i = 2; /* Read output, please */
break;
case STATE_STOP:
i = ctx->substate;
break;
default:
pgpAssert(i=0); /* To shut up compiler warnings */
}
} while (i == 0);
if (i < 0) {
ctx->state = STATE_STOP;
ctx->substate = i;
*error = i;
} else {
*error = 0;
}
pgpAssert((size_t)(ctx->inptr - buf) == len - ctx->inlen);
return (size_t)(ctx->inptr - buf); /* Bytes read */
}
/*
* Signal EOF, detecting truncated messages.
* If another error has been reported, this just repeats it.
*/
/* New version, hopefully useful. */
int
infEOF(struct InflateContext *ctx)
{
/* If processing is halted, just return the last error (if any) */
if (ctx->state == STATE_STOP)
return ctx->substate;
/* If expecting still more input, we're short */
if (ctx->state != STATE_DONE)
return INFERR_SHORT;
/* Otherwise, all is well. */
return 0;
}
struct InflateContext *
infAlloc(void)
{
struct InflateContext *ctx;
unsigned char *slide;
slide = (unsigned char *)pgpMemAlloc(WSIZE);
if (!slide)
return 0;
ctx = (struct InflateContext *)pgpMemAlloc(sizeof(*ctx));
if (!ctx) {
pgpMemFree(slide);
return 0;
}
ctx->slide = slide;
ctx->slideend = slide+WSIZE;
ctx->slidelen = WSIZE;
ctx->slidemask = WSIZE-1;
ctx->outptr = slide;
ctx->readptr = slide;
ctx->state = STATE_START;
ctx->substate = 0;
ctx->inlen = 0;
ctx->bitbuffer = 0;
ctx->bufbits = 0;
ctx->lastblock = 0;
ctx->tree1 = 0;
ctx->tree2 = 0;
return ctx;
}
void
infFree(struct InflateContext *ctx)
{
if (ctx->tree1 && ctx->tree1 != fixed_tl)
huft_free(ctx->tree1);
if (ctx->tree2 && ctx->tree2 != fixed_td)
huft_free(ctx->tree2);
if (ctx->slide) {
#if SECURE
memset(ctx->slide, 0, ctx->slidelen);
#endif
pgpMemFree(ctx->slide);
}
#if SECURE
memset(ctx, 0, sizeof(*ctx));
#endif
pgpMemFree(ctx);
}
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: cleaner/better zlib sources?
2007-03-16 17:06 ` Nicolas Pitre
@ 2007-03-16 17:51 ` Linus Torvalds
2007-03-16 18:09 ` Nicolas Pitre
0 siblings, 1 reply; 85+ messages in thread
From: Linus Torvalds @ 2007-03-16 17:51 UTC (permalink / raw)
To: Nicolas Pitre; +Cc: Davide Libenzi, Jeff Garzik, Git Mailing List, mpm, bcrl
On Fri, 16 Mar 2007, Nicolas Pitre wrote:
> On Fri, 16 Mar 2007, Linus Torvalds wrote:
>
> > The most performance-critical objects for uncompression are commits and
> > trees. At least for the kernel, the average size of a tree object is 678
> > bytes. And that's ignoring the fact that most of them are then deltified,
> > so about 80% of them are likely just a ~60-byte delta.
>
> This is why in pack v4 there will be an alternate tree object
> representation which is not deflated at all.
Well, the thing is, for things that really don't compress, zlib shouldn't
add much of an overhead on uncompression. It *should* just end up being a
single "memcpy()" after you've done:
- check the header for size and mode ("plain data")
- check the adler checksum (which is *really* nice - we've found real
corruption this way!).
The adler32 checksumming may sound unnecessary when you already have the
SHA1 checksum, but the thing is, we normally don't actually *check* the
SHA1 except when doing a full fsck. So I actually like the fact that
object unpacking always checks at least the adler32 checksum at each
stage, which you get "for free" when you use zlib.
So not using compression at all actually not only gets rid of the
compression, it gets rid of a good safety valve - something that may not
be immediately obvious when you don't think about what all zlib entails.
People think of zlib as just compressing, but I think the checksumming is
almost as important, which is why it isn't an obviously good thing to not
compress small objects just because you don't win on size!
Remember: stability and safety of the data is *the* #1 objective here. The
git SHA1 checksums guarantees that we can find any corruption, but in
every-day git usage, the adler32 checksum is the one that generally would
*notice* the corruption and cause us to say "uhhuh, need to fsck".
Everything else is totally secondary to the goal of "your data is secure".
Yes, performance is a primary goal too, but it's always "performance with
correctness guarantees"!
But I just traced through a simple 60-byte incompressible zlib thing. It's
painful. This should be *the* simplest case, and it should really just be
the memcpy and the adler32 check. But:
[torvalds@woody ~]$ grep '<inflate' trace | wc -l
460
[torvalds@woody ~]$ grep '<adler32' trace | wc -l
403
[torvalds@woody ~]$ grep '<memcpy' trace | wc -l
59
ie we spend *more* instructions on just the stupid setup in "inflate()"
than we spend on the adler32 (or, obviously, on the actual 60-byte memcpy
of the actual incompressible data)
I dunno. I don't mind the adler32 that much. The rest seems to be
pretty annoying, though.
Linus
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: cleaner/better zlib sources?
2007-03-16 17:51 ` Linus Torvalds
@ 2007-03-16 18:09 ` Nicolas Pitre
0 siblings, 0 replies; 85+ messages in thread
From: Nicolas Pitre @ 2007-03-16 18:09 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Davide Libenzi, Jeff Garzik, Git Mailing List, mpm, bcrl
On Fri, 16 Mar 2007, Linus Torvalds wrote:
> On Fri, 16 Mar 2007, Nicolas Pitre wrote:
>
> > This is why in pack v4 there will be an alternate tree object
> > representation which is not deflated at all.
>
> Well, the thing is, for things that really don't compress, zlib shouldn't
> add much of an overhead on uncompression. It *should* just end up being a
> single "memcpy()" after you've done:
> - check the header for size and mode ("plain data")
> - check the adler checksum (which is *really* nice - we've found real
> corruption this way!).
But the thing is that with tree objects which records are 6 fairly
random bytes we already know that compression will never be worth it
size wise, so it is not worth it even if the header overhead was zero.
In that case it is preferable to do without compression entirely.
> The adler32 checksumming may sound unnecessary when you already have the
> SHA1 checksum, but the thing is, we normally don't actually *check* the
> SHA1 except when doing a full fsck. So I actually like the fact that
> object unpacking always checks at least the adler32 checksum at each
> stage, which you get "for free" when you use zlib.
We still can perform adler32 on undeflated objects directly though. But
they need no be stored in the pack. I'd store the adler32 checksum for
each object in the pack index as it can be recomputed by index-pack
(which will do the full SHA1 validation anyway).
Nicolas
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: cleaner/better zlib sources?
2007-03-16 16:35 ` Linus Torvalds
@ 2007-03-16 19:21 ` Davide Libenzi
2007-03-17 0:01 ` Linus Torvalds
0 siblings, 1 reply; 85+ messages in thread
From: Davide Libenzi @ 2007-03-16 19:21 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Jeff Garzik, Git Mailing List, mpm, bcrl
On Fri, 16 Mar 2007, Linus Torvalds wrote:
>
>
> On Fri, 16 Mar 2007, Davide Libenzi wrote:
> >
> > > This one seems to do benchmarking with 8MB buffers if I read it right
> > > (didn't try).
> >
> > Yes, I just wanted to have the biggest time spent in inflate(). That why I
> > use a big buffer.
>
> Right. But if the biggest time is spent in setup, the big-buffer thing
> ends up being exactly the wrong thing to test ;)
I modified ztest.c to be able to bench on various data files (you list
them on the command line), but result are pretty much same.
I cannot measure any sensible difference between the two.
Attached there's ztest.c and the diff, in case you want to try on your
own.
> > Definitely. The nature of the data matters.
> > Did you try to make a zlib with my patch and oprofile git on real data
> > with that?
>
> I haven't actually set it up so that I can build against my own zlib yet.
> Exactly because I was hoping that somebody would already have a solution
> ;)
An LD_PRELOAD pointing to your own build should do it.
- Davide
#include <sys/types.h>
#include <sys/stat.h>
#include <sys/time.h>
#include <sys/mman.h>
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <fcntl.h>
#include <time.h>
#include "zlib.h"
#define MIN_TESTIME (2 * 1000000)
#define INNER_CYCLES 32
#define CHECK_ERR(err, msg) do { \
if (err != Z_OK) { \
fprintf(stderr, "%s error: %d\n", msg, err); \
exit(1); \
} \
} while (0)
static unsigned long long mintt = MIN_TESTIME;
static uLong incycles = INNER_CYCLES;
static unsigned long long getustime(void) {
struct timeval tm;
gettimeofday(&tm, NULL);
return tm.tv_sec * 1000000ULL + tm.tv_usec;
}
static void do_defl(Byte *cdata, uLong *clen,
Byte *udata, uLong uclen) {
z_stream c_stream; /* compression stream */
int err;
c_stream.zalloc = (alloc_func) NULL;
c_stream.zfree = (free_func) NULL;
c_stream.opaque = (voidpf) NULL;
err = deflateInit(&c_stream, Z_BEST_SPEED);
CHECK_ERR(err, "deflateInit");
c_stream.next_out = cdata;
c_stream.avail_out = (uInt) *clen;
/* At this point, udata is still mostly zeroes, so it should compress
* very well:
*/
c_stream.next_in = udata;
c_stream.avail_in = (uInt) uclen;
err = deflate(&c_stream, Z_FINISH);
if (err != Z_STREAM_END) {
fprintf(stderr, "whoops, got %d instead of Z_STREAM_END\n", err);
exit(1);
}
err = deflateEnd(&c_stream);
CHECK_ERR(err, "deflateEnd");
*clen = c_stream.next_out - cdata;
}
static void do_infl(Byte *cdata, uLong clen,
Byte *udata, uLong *uclen) {
int err;
z_stream d_stream; /* decompression stream */
d_stream.zalloc = (alloc_func) NULL;
d_stream.zfree = (free_func) NULL;
d_stream.opaque = (voidpf) NULL;
d_stream.next_in = cdata;
d_stream.avail_in = (uInt) clen;
err = inflateInit(&d_stream);
CHECK_ERR(err, "inflateInit");
d_stream.next_out = udata; /* discard the output */
d_stream.avail_out = (uInt) *uclen;
err = inflate(&d_stream, Z_FULL_FLUSH);
if (err != Z_STREAM_END) {
fprintf(stderr, "deflate should report Z_STREAM_END\n");
exit(1);
}
err = inflateEnd(&d_stream);
CHECK_ERR(err, "inflateEnd");
*uclen = d_stream.next_out - udata;
}
static int do_filebench(char const *fpath) {
int fd, err = -1;
uLong i, n, clen, ulen, size;
Byte *ubuf, *cbuf, *tbuf;
unsigned long long ts, te;
void *addr;
struct stat stb;
if ((fd = open(fpath, O_RDONLY)) == -1 ||
fstat(fd, &stb)) {
perror(fpath);
close(fd);
return -1;
}
size = stb.st_size;
addr = mmap(NULL, size, PROT_READ, MAP_PRIVATE, fd, 0);
close(fd);
if (addr == (void *) -1L) {
perror("mmap");
return -1;
}
ulen = size;
clen = size + 4096;
ubuf = addr;
if ((tbuf = malloc(ulen + clen)) == NULL) {
perror("malloc");
goto err_exit;
}
cbuf = tbuf + ulen;
/* Warming up ... */
do_defl(cbuf, &clen, ubuf, ulen);
do_infl(cbuf, clen, tbuf, &ulen);
if (ulen != size) {
fprintf(stderr, "size mismatch %lu instead of %lu\n",
(unsigned long) ulen, (unsigned long) size);
goto err_exit;
}
if (memcmp(tbuf, ubuf, size)) {
fprintf(stderr, "whoops! we did not get back the same data\n");
goto err_exit;
}
/* Test ... */
fprintf(stdout, "testing: %s\n", fpath);
ts = getustime();
n = 0;
do {
for (i = 0; i < incycles; i++) {
ulen = size;
do_infl(cbuf, clen, tbuf, &ulen);
}
n += i;
te = getustime();
} while (te - ts < mintt);
fprintf(stdout, "\tus time / cycle = %llu\n", (te - ts) / n);
err = 0;
err_exit:
free(tbuf);
munmap(addr, size);
return err;
}
int main(int ac, char **av) {
int i;
for (i = 1; i < ac; i++)
do_filebench(av[i]);
return 0;
}
Index: zlib-1.2.3.quilt/inflate.c
===================================================================
--- zlib-1.2.3.quilt.orig/inflate.c 2007-03-15 18:17:19.000000000 -0700
+++ zlib-1.2.3.quilt/inflate.c 2007-03-15 18:31:14.000000000 -0700
@@ -551,6 +551,15 @@
will return Z_BUF_ERROR if it has not reached the end of the stream.
*/
+#define CASE_DECL(n) \
+ case n: \
+ lbl_##n:
+
+#define STATE_CHANGE(s) do { \
+ state->mode = s; \
+ goto lbl_##s; \
+} while (0)
+
int ZEXPORT inflate(strm, flush)
z_streamp strm;
int flush;
@@ -586,10 +595,9 @@
ret = Z_OK;
for (;;)
switch (state->mode) {
- case HEAD:
+ CASE_DECL(HEAD)
if (state->wrap == 0) {
- state->mode = TYPEDO;
- break;
+ STATE_CHANGE(TYPEDO);
}
NEEDBITS(16);
#ifdef GUNZIP
@@ -597,8 +605,7 @@
state->check = crc32(0L, Z_NULL, 0);
CRC2(state->check, hold);
INITBITS();
- state->mode = FLAGS;
- break;
+ STATE_CHANGE(FLAGS);
}
state->flags = 0; /* expect zlib header */
if (state->head != Z_NULL)
@@ -609,20 +616,17 @@
#endif
((BITS(8) << 8) + (hold >> 8)) % 31) {
strm->msg = (char *)"incorrect header check";
- state->mode = BAD;
- break;
+ STATE_CHANGE(BAD);
}
if (BITS(4) != Z_DEFLATED) {
strm->msg = (char *)"unknown compression method";
- state->mode = BAD;
- break;
+ STATE_CHANGE(BAD);
}
DROPBITS(4);
len = BITS(4) + 8;
if (len > state->wbits) {
strm->msg = (char *)"invalid window size";
- state->mode = BAD;
- break;
+ STATE_CHANGE(BAD);
}
state->dmax = 1U << len;
Tracev((stderr, "inflate: zlib header ok\n"));
@@ -631,32 +635,30 @@
INITBITS();
break;
#ifdef GUNZIP
- case FLAGS:
+ CASE_DECL(FLAGS)
NEEDBITS(16);
state->flags = (int)(hold);
if ((state->flags & 0xff) != Z_DEFLATED) {
strm->msg = (char *)"unknown compression method";
- state->mode = BAD;
- break;
+ STATE_CHANGE(BAD);
}
if (state->flags & 0xe000) {
strm->msg = (char *)"unknown header flags set";
- state->mode = BAD;
- break;
+ STATE_CHANGE(BAD);
}
if (state->head != Z_NULL)
state->head->text = (int)((hold >> 8) & 1);
if (state->flags & 0x0200) CRC2(state->check, hold);
INITBITS();
state->mode = TIME;
- case TIME:
+ CASE_DECL(TIME)
NEEDBITS(32);
if (state->head != Z_NULL)
state->head->time = hold;
if (state->flags & 0x0200) CRC4(state->check, hold);
INITBITS();
state->mode = OS;
- case OS:
+ CASE_DECL(OS)
NEEDBITS(16);
if (state->head != Z_NULL) {
state->head->xflags = (int)(hold & 0xff);
@@ -665,7 +667,7 @@
if (state->flags & 0x0200) CRC2(state->check, hold);
INITBITS();
state->mode = EXLEN;
- case EXLEN:
+ CASE_DECL(EXLEN)
if (state->flags & 0x0400) {
NEEDBITS(16);
state->length = (unsigned)(hold);
@@ -677,7 +679,7 @@
else if (state->head != Z_NULL)
state->head->extra = Z_NULL;
state->mode = EXTRA;
- case EXTRA:
+ CASE_DECL(EXTRA)
if (state->flags & 0x0400) {
copy = state->length;
if (copy > have) copy = have;
@@ -699,7 +701,7 @@
}
state->length = 0;
state->mode = NAME;
- case NAME:
+ CASE_DECL(NAME)
if (state->flags & 0x0800) {
if (have == 0) goto inf_leave;
copy = 0;
@@ -720,7 +722,7 @@
state->head->name = Z_NULL;
state->length = 0;
state->mode = COMMENT;
- case COMMENT:
+ CASE_DECL(COMMENT)
if (state->flags & 0x1000) {
if (have == 0) goto inf_leave;
copy = 0;
@@ -740,13 +742,12 @@
else if (state->head != Z_NULL)
state->head->comment = Z_NULL;
state->mode = HCRC;
- case HCRC:
+ CASE_DECL(HCRC)
if (state->flags & 0x0200) {
NEEDBITS(16);
if (hold != (state->check & 0xffff)) {
strm->msg = (char *)"header crc mismatch";
- state->mode = BAD;
- break;
+ STATE_CHANGE(BAD);
}
INITBITS();
}
@@ -755,28 +756,26 @@
state->head->done = 1;
}
strm->adler = state->check = crc32(0L, Z_NULL, 0);
- state->mode = TYPE;
- break;
+ STATE_CHANGE(TYPE);
#endif
- case DICTID:
+ CASE_DECL(DICTID)
NEEDBITS(32);
strm->adler = state->check = REVERSE(hold);
INITBITS();
state->mode = DICT;
- case DICT:
+ CASE_DECL(DICT)
if (state->havedict == 0) {
RESTORE();
return Z_NEED_DICT;
}
strm->adler = state->check = adler32(0L, Z_NULL, 0);
state->mode = TYPE;
- case TYPE:
+ CASE_DECL(TYPE)
if (flush == Z_BLOCK) goto inf_leave;
- case TYPEDO:
+ CASE_DECL(TYPEDO)
if (state->last) {
BYTEBITS();
- state->mode = CHECK;
- break;
+ STATE_CHANGE(CHECK);
}
NEEDBITS(3);
state->last = BITS(1);
@@ -785,39 +784,38 @@
case 0: /* stored block */
Tracev((stderr, "inflate: stored block%s\n",
state->last ? " (last)" : ""));
- state->mode = STORED;
- break;
+ DROPBITS(2);
+ STATE_CHANGE(STORED);
case 1: /* fixed block */
fixedtables(state);
Tracev((stderr, "inflate: fixed codes block%s\n",
state->last ? " (last)" : ""));
- state->mode = LEN; /* decode codes */
- break;
+ DROPBITS(2);
+ STATE_CHANGE(LEN);
case 2: /* dynamic block */
Tracev((stderr, "inflate: dynamic codes block%s\n",
state->last ? " (last)" : ""));
- state->mode = TABLE;
- break;
+ DROPBITS(2);
+ STATE_CHANGE(TABLE);
case 3:
+ DROPBITS(2);
strm->msg = (char *)"invalid block type";
- state->mode = BAD;
+ STATE_CHANGE(BAD);
}
- DROPBITS(2);
break;
- case STORED:
+ CASE_DECL(STORED)
BYTEBITS(); /* go to byte boundary */
NEEDBITS(32);
if ((hold & 0xffff) != ((hold >> 16) ^ 0xffff)) {
strm->msg = (char *)"invalid stored block lengths";
- state->mode = BAD;
- break;
+ STATE_CHANGE(BAD);
}
state->length = (unsigned)hold & 0xffff;
Tracev((stderr, "inflate: stored length %u\n",
state->length));
INITBITS();
state->mode = COPY;
- case COPY:
+ CASE_DECL(COPY)
copy = state->length;
if (copy) {
if (copy > have) copy = have;
@@ -832,9 +830,8 @@
break;
}
Tracev((stderr, "inflate: stored end\n"));
- state->mode = TYPE;
- break;
- case TABLE:
+ STATE_CHANGE(TYPE);
+ CASE_DECL(TABLE)
NEEDBITS(14);
state->nlen = BITS(5) + 257;
DROPBITS(5);
@@ -845,14 +842,13 @@
#ifndef PKZIP_BUG_WORKAROUND
if (state->nlen > 286 || state->ndist > 30) {
strm->msg = (char *)"too many length or distance symbols";
- state->mode = BAD;
- break;
+ STATE_CHANGE(BAD);
}
#endif
Tracev((stderr, "inflate: table sizes ok\n"));
state->have = 0;
state->mode = LENLENS;
- case LENLENS:
+ CASE_DECL(LENLENS)
while (state->have < state->ncode) {
NEEDBITS(3);
state->lens[order[state->have++]] = (unsigned short)BITS(3);
@@ -867,13 +863,12 @@
&(state->lenbits), state->work);
if (ret) {
strm->msg = (char *)"invalid code lengths set";
- state->mode = BAD;
- break;
+ STATE_CHANGE(BAD);
}
Tracev((stderr, "inflate: code lengths ok\n"));
state->have = 0;
state->mode = CODELENS;
- case CODELENS:
+ CASE_DECL(CODELENS)
while (state->have < state->nlen + state->ndist) {
for (;;) {
this = state->lencode[BITS(state->lenbits)];
@@ -891,8 +886,7 @@
DROPBITS(this.bits);
if (state->have == 0) {
strm->msg = (char *)"invalid bit length repeat";
- state->mode = BAD;
- break;
+ STATE_CHANGE(BAD);
}
len = state->lens[state->have - 1];
copy = 3 + BITS(2);
@@ -914,17 +908,13 @@
}
if (state->have + copy > state->nlen + state->ndist) {
strm->msg = (char *)"invalid bit length repeat";
- state->mode = BAD;
- break;
+ STATE_CHANGE(BAD);
}
while (copy--)
state->lens[state->have++] = (unsigned short)len;
}
}
- /* handle error breaks in while */
- if (state->mode == BAD) break;
-
/* build code tables */
state->next = state->codes;
state->lencode = (code const FAR *)(state->next);
@@ -933,8 +923,7 @@
&(state->lenbits), state->work);
if (ret) {
strm->msg = (char *)"invalid literal/lengths set";
- state->mode = BAD;
- break;
+ STATE_CHANGE(BAD);
}
state->distcode = (code const FAR *)(state->next);
state->distbits = 6;
@@ -942,12 +931,11 @@
&(state->next), &(state->distbits), state->work);
if (ret) {
strm->msg = (char *)"invalid distances set";
- state->mode = BAD;
- break;
+ STATE_CHANGE(BAD);
}
Tracev((stderr, "inflate: codes ok\n"));
state->mode = LEN;
- case LEN:
+ CASE_DECL(LEN)
if (have >= 6 && left >= 258) {
RESTORE();
inflate_fast(strm, out);
@@ -975,22 +963,19 @@
Tracevv((stderr, this.val >= 0x20 && this.val < 0x7f ?
"inflate: literal '%c'\n" :
"inflate: literal 0x%02x\n", this.val));
- state->mode = LIT;
- break;
+ STATE_CHANGE(LIT);
}
if (this.op & 32) {
Tracevv((stderr, "inflate: end of block\n"));
- state->mode = TYPE;
- break;
+ STATE_CHANGE(TYPE);
}
if (this.op & 64) {
strm->msg = (char *)"invalid literal/length code";
- state->mode = BAD;
- break;
+ STATE_CHANGE(BAD);
}
state->extra = (unsigned)(this.op) & 15;
state->mode = LENEXT;
- case LENEXT:
+ CASE_DECL(LENEXT)
if (state->extra) {
NEEDBITS(state->extra);
state->length += BITS(state->extra);
@@ -998,7 +983,7 @@
}
Tracevv((stderr, "inflate: length %u\n", state->length));
state->mode = DIST;
- case DIST:
+ CASE_DECL(DIST)
for (;;) {
this = state->distcode[BITS(state->distbits)];
if ((unsigned)(this.bits) <= bits) break;
@@ -1017,13 +1002,12 @@
DROPBITS(this.bits);
if (this.op & 64) {
strm->msg = (char *)"invalid distance code";
- state->mode = BAD;
- break;
+ STATE_CHANGE(BAD);
}
state->offset = (unsigned)this.val;
state->extra = (unsigned)(this.op) & 15;
state->mode = DISTEXT;
- case DISTEXT:
+ CASE_DECL(DISTEXT)
if (state->extra) {
NEEDBITS(state->extra);
state->offset += BITS(state->extra);
@@ -1032,18 +1016,16 @@
#ifdef INFLATE_STRICT
if (state->offset > state->dmax) {
strm->msg = (char *)"invalid distance too far back";
- state->mode = BAD;
- break;
+ STATE_CHANGE(BAD);
}
#endif
if (state->offset > state->whave + out - left) {
strm->msg = (char *)"invalid distance too far back";
- state->mode = BAD;
- break;
+ STATE_CHANGE(BAD);
}
Tracevv((stderr, "inflate: distance %u\n", state->offset));
state->mode = MATCH;
- case MATCH:
+ CASE_DECL(MATCH)
if (left == 0) goto inf_leave;
copy = out - left;
if (state->offset > copy) { /* copy from window */
@@ -1066,15 +1048,15 @@
do {
*put++ = *from++;
} while (--copy);
- if (state->length == 0) state->mode = LEN;
+ if (state->length == 0)
+ STATE_CHANGE(LEN);
break;
- case LIT:
+ CASE_DECL(LIT)
if (left == 0) goto inf_leave;
*put++ = (unsigned char)(state->length);
left--;
- state->mode = LEN;
- break;
- case CHECK:
+ STATE_CHANGE(LEN);
+ CASE_DECL(CHECK)
if (state->wrap) {
NEEDBITS(32);
out -= left;
@@ -1090,36 +1072,34 @@
#endif
REVERSE(hold)) != state->check) {
strm->msg = (char *)"incorrect data check";
- state->mode = BAD;
- break;
+ STATE_CHANGE(BAD);
}
INITBITS();
Tracev((stderr, "inflate: check matches trailer\n"));
}
#ifdef GUNZIP
state->mode = LENGTH;
- case LENGTH:
+ CASE_DECL(LENGTH)
if (state->wrap && state->flags) {
NEEDBITS(32);
if (hold != (state->total & 0xffffffffUL)) {
strm->msg = (char *)"incorrect length check";
- state->mode = BAD;
- break;
+ STATE_CHANGE(BAD);
}
INITBITS();
Tracev((stderr, "inflate: length matches trailer\n"));
}
#endif
state->mode = DONE;
- case DONE:
+ CASE_DECL(DONE)
ret = Z_STREAM_END;
goto inf_leave;
- case BAD:
+ CASE_DECL(BAD)
ret = Z_DATA_ERROR;
goto inf_leave;
- case MEM:
+ CASE_DECL(MEM)
return Z_MEM_ERROR;
- case SYNC:
+ CASE_DECL(SYNC)
default:
return Z_STREAM_ERROR;
}
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: cleaner/better zlib sources?
2007-03-16 16:11 ` Linus Torvalds
2007-03-16 17:39 ` linux
@ 2007-03-16 22:45 ` Josef Weidendorfer
1 sibling, 0 replies; 85+ messages in thread
From: Josef Weidendorfer @ 2007-03-16 22:45 UTC (permalink / raw)
To: Linus Torvalds; +Cc: linux, git
On Friday 16 March 2007, Linus Torvalds wrote:
> (Btw, to get this level of detail, you need to link statically, at least
> if you are using Fedora Core. If you do the normal dynamic linking,
> oprofile will not be able to show you which function, and will just say
> that 50% of all time was spent in libz.so.1.2.3).
I see the same here on OpenSuse 10.2.
This is because opreport currently does not take symbols from the dynamic
symbol table into account. Patching the symbol loader (I just send a
patch to the oprofile mailing list), I get the same detail with dynamic
linking.
Josef
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: cleaner/better zlib sources?
2007-03-16 16:35 ` cleaner/better zlib sources? Jeff Garzik
` (2 preceding siblings ...)
2007-03-16 17:12 ` Nicolas Pitre
@ 2007-03-16 23:22 ` Shawn O. Pearce
3 siblings, 0 replies; 85+ messages in thread
From: Shawn O. Pearce @ 2007-03-16 23:22 UTC (permalink / raw)
To: Jeff Garzik; +Cc: Linus Torvalds, Davide Libenzi, Git Mailing List, mpm, bcrl
Jeff Garzik <jeff@garzik.org> wrote:
> Although it sounds like zlib could indeed be optimized to reduce its
> startup and shutdown overhead, I wonder if switching compression
> algorithms to a pure Huffman or even RLE compression (with associated
> lower startup/shutdown costs) would perform better in the face of all
> those small objects.
As Nico already stated, for pack v4 we are probably heading in a
direction where these really small (except for blobs anyway) objects
aren't compressed at all by zlib. They are smaller in disk space,
and are faster to reconstruct to their raw format.
> And another random thought, though it may be useless in this thread: I
> bet using a pre-built (compiled into git) static zlib dictionary for git
> commit and tree objects might improve things a bit.
I've actually tried this with the Mozilla project. The improvement
was under 2% on disk space usage and no runtime performance gains.
Not worth the pain involved. We are seeing much higher disk
space improvements and much better performance gains in the pack
v4 prototype.
Oh, and that was *with* a dictionary that was customized to Mozilla.
Not a static one. A lot of keywords in the dictionary were Mozilla
project specific, and would actually *hurt* compression for the
Linux kernel, Git, X.org, etc...
--
Shawn.
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: cleaner/better zlib sources?
2007-03-16 19:21 ` Davide Libenzi
@ 2007-03-17 0:01 ` Linus Torvalds
2007-03-17 1:11 ` Linus Torvalds
0 siblings, 1 reply; 85+ messages in thread
From: Linus Torvalds @ 2007-03-17 0:01 UTC (permalink / raw)
To: Davide Libenzi; +Cc: Jeff Garzik, Git Mailing List, mpm, bcrl
On Fri, 16 Mar 2007, Davide Libenzi wrote:
>
> I cannot measure any sensible difference between the two.
I'm using your previous patch (is it the same?) along with the additional
patch appended.
And yes, it's not hugely faster, but I seem to see *some* difference: this
is the real-time of ten runs of
time git log drivers/usb/ > /dev/null
Before:
0m2.673s
0m2.476s
0m2.603s
0m2.576s
0m2.625s
0m2.628s
0m2.493s
0m2.696s
0m2.525s
0m2.575s
After:
0m2.639s
0m2.519s
0m2.454s
0m2.604s
0m2.499s
0m2.497s
0m2.506s
0m2.394s
0m2.409s
0m2.562s
ie after I actually get under 2.4s once, and under 2.5s most of the time,
while before it was under 2.5s just twice, and mostly in the 2.6s..
(I did end up adding the "-g", but I trust that doesn't make things
*faster*. Generally gcc is good at not actually changing code generation
based on -g)
But yeah, not very impressive changes. We're talking *maybe* 0.1s out of
2.5, so potentially about 4% of total time but more likely about 2-3%, and
it's clearly mostly in the noise. And inflate() is still at 16%, and
inflate_fast obviously got no faster.
The nice part is that the instruction-level profile for inflate() got more
interesting. Instead of clearly peaking at the silly indirect jump, the
peak now seems to be a specific path through the thing. I've not decoded
it fully yet, but it seems to be mostly the LEN/LIT cases:
file inflate.c, line 942.
file inflate.c, line 942.
file inflate.c, line 949.
file inflate.c, line 949.
file inflate.c, line 949.
file inflate.c, line 949.
file inflate.c, line 949.
file inflate.c, line 950.
file inflate.c, line 950.
file inflate.c, line 951.
file inflate.c, line 951.
file inflate.c, line 951.
file inflate.c, line 951.
file inflate.c, line 951.
file inflate.c, line 949.
file inflate.c, line 949.
file inflate.c, line 950.
file inflate.c, line 950.
file inflate.c, line 953.
file inflate.c, line 953.
file inflate.c, line 953.
file inflate.c, line 969.
file inflate.c, line 1058.
file inflate.c, line 1059.
file inflate.c, line 1061.
file inflate.c, line 884.
file inflate.c, line 963.
file inflate.c, line 964.
(those are the line numbers *after* applying my patch for where the
hotpoints are: the same line-number showing up multiple times is just
because several hot instructions came from there and got spread out)
Linus
---
diff --git a/Makefile b/Makefile
index 2fd6e45..d8e9ff4 100644
--- a/Makefile
+++ b/Makefile
@@ -18,7 +18,7 @@
CC=cc
-CFLAGS=-O
+CFLAGS=-O -g
#CFLAGS=-O -DMAX_WBITS=14 -DMAX_MEM_LEVEL=7
#CFLAGS=-g -DDEBUG
#CFLAGS=-O3 -Wall -Wwrite-strings -Wpointer-arith -Wconversion \
diff --git a/inflate.c b/inflate.c
index 190c642..3d41d6f 100644
--- a/inflate.c
+++ b/inflate.c
@@ -568,7 +568,7 @@ int flush;
unsigned char FAR *next; /* next input */
unsigned char FAR *put; /* next output */
unsigned have, left; /* available input and output */
- unsigned long hold; /* bit buffer */
+ unsigned long hold, old_hold;/* bit buffer */
unsigned bits; /* bits in bit buffer */
unsigned in, out; /* save starting available input and output */
unsigned copy; /* number of stored or match bytes to copy */
@@ -631,8 +631,11 @@ int flush;
state->dmax = 1U << len;
Tracev((stderr, "inflate: zlib header ok\n"));
strm->adler = state->check = adler32(0L, Z_NULL, 0);
- state->mode = hold & 0x200 ? DICTID : TYPE;
+ old_hold = hold;
INITBITS();
+ if (old_hold & 0x200)
+ STATE_CHANGE(DICTID);
+ STATE_CHANGE(TYPE);
break;
#ifdef GUNZIP
CASE_DECL(FLAGS)
@@ -817,7 +820,7 @@ int flush;
state->mode = COPY;
CASE_DECL(COPY)
copy = state->length;
- if (copy) {
+ while (copy) {
if (copy > have) copy = have;
if (copy > left) copy = left;
if (copy == 0) goto inf_leave;
@@ -826,8 +829,8 @@ int flush;
next += copy;
left -= copy;
put += copy;
- state->length -= copy;
- break;
+ copy = state->length - copy;
+ state->length = copy;
}
Tracev((stderr, "inflate: stored end\n"));
STATE_CHANGE(TYPE);
^ permalink raw reply related [flat|nested] 85+ messages in thread
* Re: cleaner/better zlib sources?
2007-03-17 0:01 ` Linus Torvalds
@ 2007-03-17 1:11 ` Linus Torvalds
2007-03-17 3:28 ` Nicolas Pitre
0 siblings, 1 reply; 85+ messages in thread
From: Linus Torvalds @ 2007-03-17 1:11 UTC (permalink / raw)
To: Git Mailing List
[-- Attachment #1: Type: TEXT/PLAIN, Size: 4206 bytes --]
On Fri, 16 Mar 2007, Linus Torvalds wrote:
>
> And yes, it's not hugely faster, but I seem to see *some* difference: this
> is the real-time of ten runs of
>
> time git log drivers/usb/ > /dev/null
Damn. I think I know why it's happening, and I'm an idiot. I think it's
actually an issue I wondered about a *loong* time ago, and then forgot all
about. And later or Nico made it almost impossible to fix with his "pack
offset" changes.
The thing that made me realize was one of the callchains into inflate()
that I looked at:
(gdb) where
#0 inflate (strm=0x7fff10d83810, flush=4) at inflate.c:566
#1 0x000000000044c165 in unpack_compressed_entry (p=0x6d52e0, w_curs=0x7fff10d838e0, curpos=94941911,
size=<value optimized out>) at sha1_file.c:1348
#2 0x000000000044c2b6 in unpack_entry (p=0x6d52e0, obj_offset=94941909, type=0x7fff10d85d8c, sizep=0x7fff10d83928)
at sha1_file.c:1408
#3 0x000000000044c32e in unpack_entry (p=0x6d52e0, obj_offset=94942707, type=0x7fff10d85d8c, sizep=0x7fff10d83988)
at sha1_file.c:1373
#4 0x000000000044c32e in unpack_entry (p=0x6d52e0, obj_offset=94943021, type=0x7fff10d85d8c, sizep=0x7fff10d839e8)
at sha1_file.c:1373
#5 0x000000000044c32e in unpack_entry (p=0x6d52e0, obj_offset=94943382, type=0x7fff10d85d8c, sizep=0x7fff10d83a48)
at sha1_file.c:1373
#6 0x000000000044c32e in unpack_entry (p=0x6d52e0, obj_offset=94943531, type=0x7fff10d85d8c, sizep=0x7fff10d83aa8)
at sha1_file.c:1373
#7 0x000000000044c32e in unpack_entry (p=0x6d52e0, obj_offset=94943622, type=0x7fff10d85d8c, sizep=0x7fff10d83b08)
at sha1_file.c:1373
#8 0x000000000044c32e in unpack_entry (p=0x6d52e0, obj_offset=94945357, type=0x7fff10d85d8c, sizep=0x7fff10d83b68)
at sha1_file.c:1373
#9 0x000000000044c32e in unpack_entry (p=0x6d52e0, obj_offset=94945447, type=0x7fff10d85d8c, sizep=0x7fff10d83bc8)
at sha1_file.c:1373
#10 0x000000000044c32e in unpack_entry (p=0x6d52e0, obj_offset=94945571, type=0x7fff10d85d8c, sizep=0x7fff10d85d80)
at sha1_file.c:1373
#11 0x000000000044c3f8 in read_packed_sha1 (sha1=<value optimized out>, type=0x7fff10d85d8c, size=0x7fff10d85d80)
at sha1_file.c:1567
#12 0x000000000044c741 in read_sha1_file (sha1=0x7fff10d85d60 "�Ab\217��236���031", type=0x7fff10d85d8c,
size=0x7fff10d85d80) at sha1_file.c:1636
....
and notice the deep recursion in sha1_file.
The way we unpack delta chains is that we do
- find a delta
- we apply it to "recursively unpack the thing it's a delta to"
which sounds totally obvious and straightforward, right?
EXCEPT it's actually O(n**2) in the delta depth, because we never save the
intermediate results, so when we have a delta depth of 10 (our default),
and we decode a lot of these things, we basically will look up the base
object 10 times, apply the first delta 9 times, apply the second delta 8
times, etc etc..
I didn't worry about it, because it never actually hit as much of a
performance problem (and when you do a *single* tree operation you'd
never see it anyway: you apply the deltas you need, and nothing else),
but what it means is that we actually call inflate on the chain entries
55 times instead of just doing it 10 times.
It's also somewhat limited by the delta depth that we enforce anyway (I
say "somewhat", because we only limit the maximum depth, not the number of
times an object can be used as a base, and if you use an object as a base
a thousand times, it will literally be unpacked a thousand times too!
I also didn't worry about it, because I felt that if it became a problem,
it would be easy to just add a cache of base objects (we probably do *not*
want to keep the whole unpacked object info in memory all the time just
because of memory pressure issues, so "cache of base objects" is better).
However, the "pack file + offset" thing makes it harder to do, since we
now don't even have the SHA1 of the base object before we unpack it.
But I guess we could just index this by a <packfile, offset> tuple.
Anyway, I bet that this is a much bigger issue than the pack format
itself (and is largely independent).
Linus
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: cleaner/better zlib sources?
2007-03-17 1:11 ` Linus Torvalds
@ 2007-03-17 3:28 ` Nicolas Pitre
2007-03-17 5:19 ` Shawn O. Pearce
2007-03-17 17:55 ` Linus Torvalds
0 siblings, 2 replies; 85+ messages in thread
From: Nicolas Pitre @ 2007-03-17 3:28 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Git Mailing List
On Fri, 16 Mar 2007, Linus Torvalds wrote:
> The way we unpack delta chains is that we do
>
> - find a delta
> - we apply it to "recursively unpack the thing it's a delta to"
>
> which sounds totally obvious and straightforward, right?
>
> EXCEPT it's actually O(n**2) in the delta depth, because we never save the
> intermediate results, so when we have a delta depth of 10 (our default),
> and we decode a lot of these things, we basically will look up the base
> object 10 times, apply the first delta 9 times, apply the second delta 8
> times, etc etc..
In the worst case, yes. And if you're walking history then the
probability of hitting the worst case eventually is rather high.
> I also didn't worry about it, because I felt that if it became a problem,
> it would be easy to just add a cache of base objects (we probably do *not*
> want to keep the whole unpacked object info in memory all the time just
> because of memory pressure issues, so "cache of base objects" is better).
> However, the "pack file + offset" thing makes it harder to do, since we
> now don't even have the SHA1 of the base object before we unpack it.
>
> But I guess we could just index this by a <packfile, offset> tuple.
Right. Should be really trivial to hook into unpack_delta_entry()
actually replacing the call to unpack_entry() with a wrapper function
that returns cached data, or populates the cache with unpack_entry()
when no match is found.
Then it would only be a matter of coming up with a clever cache
eviction algorithm.
> Anyway, I bet that this is a much bigger issue than the pack format
> itself (and is largely independent).
Well, I think the pack format issue is significant too. But because
those are independent issues the gain in performance will be additive.
Nicolas
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: cleaner/better zlib sources?
2007-03-17 3:28 ` Nicolas Pitre
@ 2007-03-17 5:19 ` Shawn O. Pearce
2007-03-17 17:55 ` Linus Torvalds
1 sibling, 0 replies; 85+ messages in thread
From: Shawn O. Pearce @ 2007-03-17 5:19 UTC (permalink / raw)
To: Nicolas Pitre; +Cc: Linus Torvalds, Git Mailing List
Nicolas Pitre <nico@cam.org> wrote:
> On Fri, 16 Mar 2007, Linus Torvalds wrote:
> > I also didn't worry about it, because I felt that if it became a problem,
> > it would be easy to just add a cache of base objects (we probably do *not*
> > want to keep the whole unpacked object info in memory all the time just
> > because of memory pressure issues, so "cache of base objects" is better).
> > However, the "pack file + offset" thing makes it harder to do, since we
> > now don't even have the SHA1 of the base object before we unpack it.
> >
> > But I guess we could just index this by a <packfile, offset> tuple.
...
> Then it would only be a matter of coming up with a clever cache
> eviction algorithm.
Yes. Linus above seems to imply (at least to me) that we wouldn't
want to cache the original object requested by read_sha1_file(), as
its not the delta base. But given our packing rules, we should be
(in general anyway) first asking for the most recent revision of
a file, which is stored whole, then for an older revision, which
will be a delta of the more recent revision we just saw.
Hence we probably would want to cache an object. Well, at least
anything that had been packed as a delta. Caching a deflated
OBJ_BLOB may not be worth it.
> > Anyway, I bet that this is a much bigger issue than the pack format
> > itself (and is largely independent).
>
> Well, I think the pack format issue is significant too. But because
> those are independent issues the gain in performance will be additive.
I'm torn there.
There's two places that we do lots of unpacks of objects where we
run into this difficult case of unpacking the same base object many
times: git-blame and a rev-list with a path limiter.
Now the git-blame case is obvious: we are constantly unpacking
various revisions of the same file, and these are probably delta'd
against each other, so the unpacking gets really brutal after a
while. A blob cache here would probably *really* help out git-blame.
What's slightly less obvious about git-blame is we are probably also
traversing the different versions of the same trees over and over, as
we resolve the path to the correct blob in each commit we traverse.
So again here we are hitting lots of the same trees multiple times.
That last part about git-blame also obviously applies to the rev-list
with a path limiter.
But most other operations don't seem like they would benefit from a
base object cache; actually they might slow down from having such
a cache present!
Commits tend not to delta well; if they delta it is a very rare
occurrance. So we aren't getting huge unpacking benefits there
by caching them. Scratch any benefit of the cache for any sort of
rev-list operation that doesn't require tree access.
As for the other common operations (diff, read-tree, checkout-index,
merge-recursive): I don't think these will benefit from a cache
either. Their data access patterns are pretty spread out over
the tree. With the exception of rename detection we hit everything
only once. After touching a path, we tend to not go back to it.
So unless we are really lucky and one blob acts as a base object
for many others at different paths (possible, but I suspect not
very likely) its not worth caching the base.
If we do hit something twice, its probably because we are doing two
distinct passes over the data. In this case the passes are probably
because we either don't want to hold all of the data in memory (too
big of a set for some projects) or because we tried one algorithm,
failed, and are now trying a different one (internal read-tree
in merge-recursive).
Caching in merge-recursive may help, but just making the dirty
cache (index) that resulted from the internal read-tree available
for the remainder of the merge-recursive process might be faster;
especially if we only have one base and don't need to recursively
merge multiple bases.
So where does that leave us? The only places I see a base object
cache really helping is in git-blame for blob access, repeated
tree access (git-blame and path limiting), and maybe we could do
better with the common cases in merge-recursive by being smarter
with the cache.
But with pack v4 I don't think I need a tree object cache.
With a 6 byte fixed record format, a strict ordering requirement,
a finite delta depth within a packfile, a stricter tree-specific
delta encoder, and a minor API change to tree-walk.h, I think we
can unpack the delta at the same time that we are walking the tree.
No upfront unpack required. Hence no reason to cache.
So yea, a base object cache may help us today. It will most
definately help in git-blame. But I doubt it will help with trees
in pack v4, and I think it will just hurt in most cases. So maybe
it should be local to git-blame only.
--
Shawn.
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: cleaner/better zlib sources?
2007-03-17 3:28 ` Nicolas Pitre
2007-03-17 5:19 ` Shawn O. Pearce
@ 2007-03-17 17:55 ` Linus Torvalds
2007-03-17 19:40 ` Linus Torvalds
1 sibling, 1 reply; 85+ messages in thread
From: Linus Torvalds @ 2007-03-17 17:55 UTC (permalink / raw)
To: Nicolas Pitre; +Cc: Git Mailing List
On Fri, 16 Mar 2007, Nicolas Pitre wrote:
>
> In the worst case, yes. And if you're walking history then the
> probability of hitting the worst case eventually is rather high.
Actually, it's even better than that.
If we're walking a certain pathspec (which is reall ythe only thing that
is expensive), we're pretty much *guaranteed* that we'll hit exactly this
case. Doing some instrumentation on the test-case I've been using (which
is just "git log drivers/usb/ > /dev/null") shows:
[torvalds@woody linux]$ grep Needs delta-base-trace | wc -l
469334
[torvalds@woody linux]$ grep Needs delta-base-trace | sort -u | wc -l
21933
where that delta-base-trace is just a trace of which delta bases were
needed. Look how we currently generate almost half a million of them, but
only 22000 are actually unique objects - we just generate many of them
over and over again. In fact, the top delta bases with counts looks like:
558 Needs 102398354
556 Needs 161353360
554 Needs 161354852
552 Needs 161354916
550 Needs 161354980
526 Needs 161355044
524 Needs 161355108
522 Needs 161355174
520 Needs 161355238
508 Needs 161445724
446 Needs 119712387
425 Needs 133406737
420 Needs 161513997
387 Needs 120784913
331 Needs 127094253
321 Needs 95694853
319 Needs 125888524
303 Needs 155109487
301 Needs 155627964
299 Needs 155628028
.....
ie the top twenty objects were all generated hundreds of times each.
More importantly, the trace also shows that it actually has very good
locality too - exactly as you'd expect, since when we traverse the trees,
we'd generally see a particular delta base used as a base when that thing
is slowly changing, so of the half-million "needs" entries in my trace, if
I pick the top delta_base (102398354), and use "cat -n" to give them all
line numbers (from 1 to half a million), and grep for that particular
delta:
grep Needs delta-base-trace | cat -n | grep 102398354 | less -S
they are *all* at lines 61624..89352, with the bulk of them being very
close together (the bulk of those are all around 88k line mark).
In other words, it's not "spread out" over time. It's very clustered,
which I'd expect anyway, which means that even a simple cache of just a
few hundred entries (statically sized) will be very effective.
So the cache doesn't need to be "complete". It will get good hit-rates
even from being very simple. I think I have a very simple and cunning
plan, I'll try it out asap.
Linus
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: cleaner/better zlib sources?
2007-03-17 17:55 ` Linus Torvalds
@ 2007-03-17 19:40 ` Linus Torvalds
2007-03-17 19:42 ` [PATCH 1/2] Make trivial wrapper functions around delta base generation and freeing Linus Torvalds
2007-03-17 19:44 ` [PATCH 2/2] Implement a simple delta_base cache Linus Torvalds
0 siblings, 2 replies; 85+ messages in thread
From: Linus Torvalds @ 2007-03-17 19:40 UTC (permalink / raw)
To: Nicolas Pitre; +Cc: Git Mailing List
On Sat, 17 Mar 2007, Linus Torvalds wrote:
>
> So the cache doesn't need to be "complete". It will get good hit-rates
> even from being very simple. I think I have a very simple and cunning
> plan, I'll try it out asap.
Ok, got distracted by guests coming to look at the new puppy, so it took
longer than it should have, but the following is a simple two-patch series
that improves path-following by a factor of almost 2.5 for me.
The cache is *really* simple. It's just a 256-entry hashed cache of the
last few base entries, and it brings down my test-case of
git log drivers/usb/ > /dev/null
from 2.5s to just over 1s. I have *not* tuned or tweaked this at all, and
maybe there are better ways to do this, but this was simple as hell and
obviously quite effective.
It also speeds up "git blame", for all the same reasons. Before (best
times out of a run of five):
[torvalds@woody linux]$ time git blame drivers/char/Makefile > /dev/null
real 0m1.585s
user 0m1.576s
sys 0m0.004s
after:
[torvalds@woody linux]$ time ~/git/git blame drivers/char/Makefile > /dev/null
real 0m0.763s
user 0m0.644s
sys 0m0.120s
so it's a factor of two there too (just a random file, I'm not at all
going to guarantee that this is really consistent - it should get more
testing etc).
The first patch just does some obvious re-factoring and setting up (no
real code changes). The second patch just uses the new functions to
actually add a cache.
Linus
^ permalink raw reply [flat|nested] 85+ messages in thread
* [PATCH 1/2] Make trivial wrapper functions around delta base generation and freeing
2007-03-17 19:40 ` Linus Torvalds
@ 2007-03-17 19:42 ` Linus Torvalds
2007-03-17 19:44 ` [PATCH 2/2] Implement a simple delta_base cache Linus Torvalds
1 sibling, 0 replies; 85+ messages in thread
From: Linus Torvalds @ 2007-03-17 19:42 UTC (permalink / raw)
To: Junio C Hamano, Nicolas Pitre; +Cc: Git Mailing List
This doesn't change any code, it just creates a point for where we'd
actually do the caching of delta bases that have been generated.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---
Done this way to make all the changes as obvious as possible.
diff --git a/sha1_file.c b/sha1_file.c
index 110d696..f11ca3f 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -1352,6 +1352,18 @@ static void *unpack_compressed_entry(struct packed_git *p,
return buffer;
}
+static void *cache_or_unpack_entry(struct packed_git *p, off_t base_offset,
+ unsigned long *base_size, enum object_type *type)
+{
+ return unpack_entry(p, base_offset, type, base_size);
+}
+
+static void add_delta_base_cache(struct packed_git *p, off_t base_offset,
+ void *base, unsigned long base_size, enum object_type type)
+{
+ free(base);
+}
+
static void *unpack_delta_entry(struct packed_git *p,
struct pack_window **w_curs,
off_t curpos,
@@ -1365,7 +1377,7 @@ static void *unpack_delta_entry(struct packed_git *p,
off_t base_offset;
base_offset = get_delta_base(p, w_curs, &curpos, *type, obj_offset);
- base = unpack_entry(p, base_offset, type, &base_size);
+ base = cache_or_unpack_entry(p, base_offset, &base_size, type);
if (!base)
die("failed to read delta base object"
" at %"PRIuMAX" from %s",
@@ -1378,7 +1390,7 @@ static void *unpack_delta_entry(struct packed_git *p,
if (!result)
die("failed to apply delta");
free(delta_data);
- free(base);
+ add_delta_base_cache(p, base_offset, base, base_size, *type);
return result;
}
^ permalink raw reply related [flat|nested] 85+ messages in thread
* [PATCH 2/2] Implement a simple delta_base cache
2007-03-17 19:40 ` Linus Torvalds
2007-03-17 19:42 ` [PATCH 1/2] Make trivial wrapper functions around delta base generation and freeing Linus Torvalds
@ 2007-03-17 19:44 ` Linus Torvalds
2007-03-17 21:45 ` Linus Torvalds
2007-03-17 22:44 ` Linus Torvalds
1 sibling, 2 replies; 85+ messages in thread
From: Linus Torvalds @ 2007-03-17 19:44 UTC (permalink / raw)
To: Junio C Hamano, Nicolas Pitre; +Cc: Git Mailing List
This trivial 256-entry delta_base cache improves performance for some
loads by a factor of 2.5 or so.
Instead of always re-generating the delta bases (possibly over and over
and over again), just cache the last few ones. They often can get re-used.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---
This should have some other people doing performance testing too, since
it's fairly core. But *dang*, it's really simple.
diff --git a/sha1_file.c b/sha1_file.c
index f11ca3f..a7e3a2a 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -1352,16 +1352,57 @@ static void *unpack_compressed_entry(struct packed_git *p,
return buffer;
}
+#define MAX_DELTA_CACHE (256)
+
+static struct delta_base_cache_entry {
+ struct packed_git *p;
+ off_t base_offset;
+ unsigned long size;
+ void *data;
+ enum object_type type;
+} delta_base_cache[MAX_DELTA_CACHE];
+
+static unsigned long pack_entry_hash(struct packed_git *p, off_t base_offset)
+{
+ unsigned long hash;
+
+ hash = (unsigned long)p + (unsigned long)base_offset;
+ hash += (hash >> 8) + (hash >> 16);
+ return hash & 0xff;
+}
+
static void *cache_or_unpack_entry(struct packed_git *p, off_t base_offset,
unsigned long *base_size, enum object_type *type)
{
+ void *ret;
+ unsigned long hash = pack_entry_hash(p, base_offset);
+ struct delta_base_cache_entry *ent = delta_base_cache + hash;
+
+ ret = ent->data;
+ if (ret && ent->p == p && ent->base_offset == base_offset)
+ goto found_cache_entry;
return unpack_entry(p, base_offset, type, base_size);
+
+found_cache_entry:
+ ent->data = NULL;
+ *type = ent->type;
+ *base_size = ent->size;
+ return ret;
}
static void add_delta_base_cache(struct packed_git *p, off_t base_offset,
void *base, unsigned long base_size, enum object_type type)
{
- free(base);
+ unsigned long hash = pack_entry_hash(p, base_offset);
+ struct delta_base_cache_entry *ent = delta_base_cache + hash;
+
+ if (ent->data)
+ free(ent->data);
+ ent->p = p;
+ ent->base_offset = base_offset;
+ ent->type = type;
+ ent->data = base;
+ ent->size = base_size;
}
static void *unpack_delta_entry(struct packed_git *p,
^ permalink raw reply related [flat|nested] 85+ messages in thread
* Re: [PATCH 2/2] Implement a simple delta_base cache
2007-03-17 19:44 ` [PATCH 2/2] Implement a simple delta_base cache Linus Torvalds
@ 2007-03-17 21:45 ` Linus Torvalds
2007-03-17 22:37 ` Junio C Hamano
` (2 more replies)
2007-03-17 22:44 ` Linus Torvalds
1 sibling, 3 replies; 85+ messages in thread
From: Linus Torvalds @ 2007-03-17 21:45 UTC (permalink / raw)
To: Junio C Hamano, Nicolas Pitre; +Cc: Git Mailing List
On Sat, 17 Mar 2007, Linus Torvalds wrote:
>
> Instead of always re-generating the delta bases (possibly over and over
> and over again), just cache the last few ones. They often can get re-used.
Not just to compare actual timings, this shows the difference in the
traces I did. Remember, before we had:
[torvalds@woody linux]$ grep Needs delta-base-trace | wc -l
469334
[torvalds@woody linux]$ grep Needs delta-base-trace |sort -u | wc -l
21933
and now with the simple cache, I get:
[torvalds@woody linux]$ grep Needs delta-base-trace-new | wc -l
28688
[torvalds@woody linux]$ grep Needs delta-base-trace-new | sort -u | wc -l
21933
ie, we still re-generate some of the objects multiple times, but now,
rather than generating them (on average) 20+ times each, we now generate
them an average of just 1.3 times each. Which explains why the wall-time
goes down by over a factor of two.
Changing the (statically sized) cache from 256 entries to 1024 (and
updating the hash function appropriately of course) gets the number down
to 23953 delta-base lookups (the number of unique ones obviously stays the
same), for an average of just 1.1 object generates per unique object, and
also means that you occasionally get sub-second times for my test-case of
logging drivers/usb/.
It all also means that libz isn't really even the top entry in the
profiles any more, although it's still pretty high. But the profile now
says:
samples % app name symbol name
41527 15.6550 git strlen
30215 11.3905 git inflate
27504 10.3685 git inflate_table
20321 7.6607 git find_pack_entry_one
16892 6.3680 git interesting
16259 6.1294 vmlinux __copy_user_nocache
16010 6.0355 git inflate_fast
9240 3.4833 git get_mode
8863 3.3412 git tree_entry_extract
7145 2.6935 git strncmp
7131 2.6883 git memcpy
6863 2.5872 git diff_tree
6113 2.3045 git adler32
4515 1.7021 git _int_malloc
3022 1.1392 git update_tree_entry
...
(Adding up all of libz is still ~31%, but it's lower as a percentage *and*
it's obviously a smaller percentage of a much lower absolute time, so the
zlib overhead went down much more than any other git overheads did)
In general, this all seems very cool. The patches are simple enough that I
think this is very safe to merge indeed: the only question I have is that
somebody should verify that the "struct packed_git *p" is stable over the
whole lifetime of a process - so that we can use it as a hash key without
having to invalidate hashes if we unmap a pack (I *think* we just unmap
the virtual mapping, and "struct packed_git *" stays valid, but Junio
should ack that for me).
Here's the trivial patch to extend the caching to 1k entries if somebody
cares. I don't know if the small added performance is worth it.
Linus
---
diff --git a/sha1_file.c b/sha1_file.c
index a7e3a2a..372af60 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -1352,7 +1352,7 @@ static void *unpack_compressed_entry(struct packed_git *p,
return buffer;
}
-#define MAX_DELTA_CACHE (256)
+#define MAX_DELTA_CACHE (1024)
static struct delta_base_cache_entry {
struct packed_git *p;
@@ -1367,8 +1367,8 @@ static unsigned long pack_entry_hash(struct packed_git *p, off_t base_offset)
unsigned long hash;
hash = (unsigned long)p + (unsigned long)base_offset;
- hash += (hash >> 8) + (hash >> 16);
- return hash & 0xff;
+ hash += (hash >> 10) + (hash >> 20);
+ return hash & (MAX_DELTA_CACHE-1);
}
static void *cache_or_unpack_entry(struct packed_git *p, off_t base_offset,
^ permalink raw reply related [flat|nested] 85+ messages in thread
* Re: [PATCH 2/2] Implement a simple delta_base cache
2007-03-17 21:45 ` Linus Torvalds
@ 2007-03-17 22:37 ` Junio C Hamano
2007-03-17 23:09 ` Linus Torvalds
2007-03-18 1:13 ` Nicolas Pitre
2007-03-17 23:12 ` Junio C Hamano
2007-03-18 1:14 ` Morten Welinder
2 siblings, 2 replies; 85+ messages in thread
From: Junio C Hamano @ 2007-03-17 22:37 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Nicolas Pitre, Git Mailing List
Linus Torvalds <torvalds@linux-foundation.org> writes:
> ie, we still re-generate some of the objects multiple times, but now,
> rather than generating them (on average) 20+ times each, we now generate
> them an average of just 1.3 times each. Which explains why the wall-time
> goes down by over a factor of two.
This is beautiful. You only cache what we were about to discard
anyway, and when giving a cached one out, you invalidate the
cached entry, so there is no way the patch can introduce leaks
nor double-frees and it is absolutely safe (as long as we can
pin the packed_git structure, which I think is the case --- even
when we re-read the packs, I do not think we discard old ones).
I've thought about possible ways to improve on it, but came up
almost empty.
When unpacking a depth-3 deltified object A, the code finds the
target object A (which is a delta), ask for its base B and put B
in the cache after using it to reconstitute A. While doing so,
the first-generation base B is also a delta so its base C (which
is a non-delta) is found and placed in the cache. When A is
returned, the cache has B and C. If you ask for B at this
point, we read the delta, pick up its base C from the cache,
apply, and return while putting C back in the cache. If you ask
for A after that, we do not read from the cache, although it is
available.
Which feels a bit wasteful at first sight, and we *could* make
read_packed_sha1() also steal from the cache, but after thinking
about it a bit, I am not sure if it is worth it. The contract
between read_packed_sha1() and read_sha1_file() and its callers
is that the returned data belongs to the caller and it is a
responsibility for the caller to free the buffer, and also the
caller is free to modify it, so stealing from the cache from
that codepath means an extra allocation and memcpy. If the
object stolen from the cache is of sufficient depth, it might be
worth it, but to decide it we somehow need to compute and store
which delta depth the cached one is at.
In any way, your code makes a deeply delitified packfiles a lot
more practical. As long as the working set of delta chains fits
in the cache, after unpacking the longuest delta, the objects on
the chain can be had by one lookup and one delta application.
Very good job.
> In general, this all seems very cool. The patches are simple enough that I
> think this is very safe to merge indeed: the only question I have is that
> somebody should verify that the "struct packed_git *p" is stable over the
> whole lifetime of a process - so that we can use it as a hash key without
> having to invalidate hashes if we unmap a pack (I *think* we just unmap
> the virtual mapping, and "struct packed_git *" stays valid, but Junio
> should ack that for me).
Ack ;-)
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [PATCH 2/2] Implement a simple delta_base cache
2007-03-17 19:44 ` [PATCH 2/2] Implement a simple delta_base cache Linus Torvalds
2007-03-17 21:45 ` Linus Torvalds
@ 2007-03-17 22:44 ` Linus Torvalds
1 sibling, 0 replies; 85+ messages in thread
From: Linus Torvalds @ 2007-03-17 22:44 UTC (permalink / raw)
To: Junio C Hamano, Nicolas Pitre; +Cc: Git Mailing List
On Sat, 17 Mar 2007, Linus Torvalds wrote:
>
> This trivial 256-entry delta_base cache improves performance for some
> loads by a factor of 2.5 or so.
Btw, final comment on this issue:
I was initially a bit worried about optimizing for just the "git log" with
pathspec or "git blame" kind of behaviour, and possibly pessimizing some
other load.
But the way the caching works, this is likely to be faster (or at least
not slower) even for something that doesn't ever need the cache (which in
turn is likely to be because it's a smaller footprint query and only works
on one version).
Because the way the cache works, it doesn't really do any extra work: it
basically just delays the "free()" on the buffer we allocated. So for
really small footprints it just avoids the overhead of free() (let the OS
reap the pages for it at exit), and for bigger footprints (that end up
replacing the cache entries) it will just do the same work a bit later.
Because it's a simple direct-mapped cache, the only cost is the (trivial)
hash of a few instructions, and possibly the slightly bigger D$ footprint.
I would strongly suspect that even on loads where it doesn't help by
reusing the cached objects, the delayed free'ing on its own is as likely
to help as it is to hurt.
So there really shouldn't be any downsides.
Testing on some other loads (for example, drivers/scsi/ has more activity
than drivers/usb/), the 2x performance win seems to happen for other
things too. For drivers/scsi, the log generating went down from 3.582s
(best) to 1.448s.
"git blame Makefile" went from 1.802s to 1.243s (both best-case numbers
again: a smaller win, but still a win), but there the issue seems to be
that with a file like that, we actually spend most of our time comparing
different versions.
For the "git blame Makefile" case *all* of zlib combined is just 18%,
while the ostensibly trivial "cmp_suspect()" is 23% and another 11% is
from "assign_blame()" - so for top-level entries the costs would seem to
tend to be in the blame algorithm itself, rather than in the actual object
handling.
(I'm sure that could be improved too, but the take-home message from this
is that zlib wasn't really the problem, and our stupid re-generation of
the same delta base was.
Linus
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [PATCH 2/2] Implement a simple delta_base cache
2007-03-17 22:37 ` Junio C Hamano
@ 2007-03-17 23:09 ` Linus Torvalds
2007-03-17 23:54 ` Linus Torvalds
2007-03-18 1:13 ` Nicolas Pitre
1 sibling, 1 reply; 85+ messages in thread
From: Linus Torvalds @ 2007-03-17 23:09 UTC (permalink / raw)
To: Junio C Hamano; +Cc: Nicolas Pitre, Git Mailing List
On Sat, 17 Mar 2007, Junio C Hamano wrote:
>
> When unpacking a depth-3 deltified object A, the code finds the
> target object A (which is a delta), ask for its base B and put B
> in the cache after using it to reconstitute A. While doing so,
> the first-generation base B is also a delta so its base C (which
> is a non-delta) is found and placed in the cache. When A is
> returned, the cache has B and C. If you ask for B at this
> point, we read the delta, pick up its base C from the cache,
> apply, and return while putting C back in the cache. If you ask
> for A after that, we do not read from the cache, although it is
> available.
Yes.
I debated that a bit with myself, but decided that:
(a) it probably doesn't really matter a lot (but I don't have the
numbers)
(b) trying to *also* fill non-delta-base queries from the delta-base
cache actually complicates things a lot. Surprisingly much so (the
current logic of removing the entry from the cache only to re-insert
it after being used made the memory management totally trivial, as
you noticed)
(c) and regardless, we could decide to do a more extensive caching layer
later if we really wanted to, and at that point it probably makes
more sense to integrate it with the delta-base cache.
Most git objects are use-once, which is why we really *just* save the
flag bits and the SHA1 hash name itself in "struct object", but doing
a generic caching layer for object content would likely obviate the
need for the current logic to do "save_commit_buffer".
That (c) in particular was what made me think that it's better to keep it
simple and obvious for now, since even the simple thing largely fixes the
performance issue. Almost three seconds I felt bad about, while just over
a second for something as complex as "git log drivers/usb/" I just cannot
make myself worry about.
> In any way, your code makes a deeply delitified packfiles a lot
> more practical. As long as the working set of delta chains fits
> in the cache, after unpacking the longuest delta, the objects on
> the chain can be had by one lookup and one delta application.
Yeah. I think it would be good to probably (separately and as "further
tweaks"):
- have somebody actually look at hit-rates for different repositories and
hash sizes.
- possibly allow people to set the hash size as a config option, if it
turns out that certain repository layouts or usage scenarios end up
preferring bigger caches.
For example, it may be that for historical archives you might want to
have deeper delta queues to make the repository smaller, and if they
are big anyway maybe they would prefer to have a larger-than-normal
cache as a result. On the other hand, if you are memory-constrained,
maybe you'd prefer to re-generate the objects and waste a bit of CPU
rather than cache the results.
But neither of the above is really an argument against the patch, just a
"there's certainly room for more work here if anybody cares".
> Very good job.
I'm pretty happy with the results myself. Partly because the patches just
ended up looking so *nice*.
Linus
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [PATCH 2/2] Implement a simple delta_base cache
2007-03-17 21:45 ` Linus Torvalds
2007-03-17 22:37 ` Junio C Hamano
@ 2007-03-17 23:12 ` Junio C Hamano
2007-03-17 23:24 ` Linus Torvalds
2007-03-18 1:14 ` Morten Welinder
2 siblings, 1 reply; 85+ messages in thread
From: Junio C Hamano @ 2007-03-17 23:12 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Nicolas Pitre, Git Mailing List
Linus Torvalds <torvalds@linux-foundation.org> writes:
> Here's the trivial patch to extend the caching to 1k entries if somebody
> cares. I don't know if the small added performance is worth it.
This largely would depend on the project, but if a blob that is
cached is 20kB each, a 1024-entry cache would grow to 20MB. We
may need to introduce early eviction of cached objects with
total cache size limit, configurable per repository.
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [PATCH 2/2] Implement a simple delta_base cache
2007-03-17 23:12 ` Junio C Hamano
@ 2007-03-17 23:24 ` Linus Torvalds
2007-03-17 23:52 ` Jon Smirl
0 siblings, 1 reply; 85+ messages in thread
From: Linus Torvalds @ 2007-03-17 23:24 UTC (permalink / raw)
To: Junio C Hamano; +Cc: Nicolas Pitre, Git Mailing List
On Sat, 17 Mar 2007, Junio C Hamano wrote:
>
> This largely would depend on the project, but if a blob that is
> cached is 20kB each, a 1024-entry cache would grow to 20MB. We
> may need to introduce early eviction of cached objects with
> total cache size limit, configurable per repository.
One thing that I considered was to limit the delta-base cache to just tree
entries. Those tend to be the really performance-sensitive ones - by the
time you actually unpack blob entries, you're going to do something with
that *single* entry anyway (like compare it to another blob), and the cost
of unpacking the entry is likely to not be really all that noticeable.
That said, it was just simpler to do it unconditionally, and it obviously
*works* fine regardless of the object type, so limiting it to trees is a
bit sad. And since the intensive tree operations tend to be in a separate
phase (ie the commit simplification phase) from the the blob operations
(say, doing "git log -p <pathspec>"), I suspect that the cache locality
would still remain good.
So I didn't do anything along the lines of "only cache for case Xyzzy".
But yes, especially if a project has big blobs, it might make sense to
limit by full size of the cached entries some way.
Linus
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [PATCH 2/2] Implement a simple delta_base cache
2007-03-17 23:24 ` Linus Torvalds
@ 2007-03-17 23:52 ` Jon Smirl
0 siblings, 0 replies; 85+ messages in thread
From: Jon Smirl @ 2007-03-17 23:52 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Junio C Hamano, Nicolas Pitre, Git Mailing List
If you still have a Mozilla pack file around it would be a good test
case. It has delta chains thousands of entries long. If I remember
correctly one had over 4,000 deltas.
--
Jon Smirl
jonsmirl@gmail.com
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [PATCH 2/2] Implement a simple delta_base cache
2007-03-17 23:09 ` Linus Torvalds
@ 2007-03-17 23:54 ` Linus Torvalds
0 siblings, 0 replies; 85+ messages in thread
From: Linus Torvalds @ 2007-03-17 23:54 UTC (permalink / raw)
To: Junio C Hamano; +Cc: Nicolas Pitre, Git Mailing List
On Sat, 17 Mar 2007, Linus Torvalds wrote:
>
> (a) it probably doesn't really matter a lot (but I don't have the
> numbers)
Well, to some degree I obviously *do* have the numbers.
I have the numbers that we used to re-generate the object data over five
*hundred* times per object for some cases, and that I got the average
such delta-base usage down from 20x to 1.1-1.3x depending on cache size.
In contrast, the "use delta-base also for non-delta queries" fairly
obviously cannot touch those kinds of numbers. We migth avoid a *few*
object generation cases, but we're not looking at factors of 20 for any
kind of sane cases.
So I do think that a higher-level caching approach can work too, but it's
going to be more effective in other areas:
- get rid of some ugly hacks (like the "save_commit_buffer" thing I
mentioned)
- possibly help some insane loads (eg cases where we really *do* end up
seeing the same object over and over again, perhaps simply because some
idiotic automated commit system ends up switching between a few states
back-and-forth).
I really think the "insane loads" thing is unlikely, but I could construct
some crazy usage scenario where a cache of objects in general (and not
just delta bases) would work. I don't think it's a very realistic case,
but who knows - people sometimes do really stupid things.
Linus
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [PATCH 2/2] Implement a simple delta_base cache
2007-03-17 22:37 ` Junio C Hamano
2007-03-17 23:09 ` Linus Torvalds
@ 2007-03-18 1:13 ` Nicolas Pitre
2007-03-18 7:47 ` Junio C Hamano
1 sibling, 1 reply; 85+ messages in thread
From: Nicolas Pitre @ 2007-03-18 1:13 UTC (permalink / raw)
To: Junio C Hamano; +Cc: Linus Torvalds, Git Mailing List
On Sat, 17 Mar 2007, Junio C Hamano wrote:
> When unpacking a depth-3 deltified object A, the code finds the
> target object A (which is a delta), ask for its base B and put B
> in the cache after using it to reconstitute A. While doing so,
> the first-generation base B is also a delta so its base C (which
> is a non-delta) is found and placed in the cache. When A is
> returned, the cache has B and C. If you ask for B at this
> point, we read the delta, pick up its base C from the cache,
> apply, and return while putting C back in the cache. If you ask
> for A after that, we do not read from the cache, although it is
> available.
>
> Which feels a bit wasteful at first sight, and we *could* make
> read_packed_sha1() also steal from the cache, but after thinking
> about it a bit, I am not sure if it is worth it. The contract
> between read_packed_sha1() and read_sha1_file() and its callers
> is that the returned data belongs to the caller and it is a
> responsibility for the caller to free the buffer, and also the
> caller is free to modify it, so stealing from the cache from
> that codepath means an extra allocation and memcpy.
So?
A malloc() + memcpy() will always be faster than mmap() + malloc() +
inflate(). If the data is already there it is certainly better to copy
it straight away.
With the patch below I can do 'git log drivers/scsi/ > /dev/null' about
7% faster. I bet it might be even more on those platforms with bad
mmap() support.
Signed-off-by: Nicolas Pitre <nico@cam.org>
---
diff --git a/sha1_file.c b/sha1_file.c
index a7e3a2a..ee64865 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -1372,7 +1372,7 @@ static unsigned long pack_entry_hash(struct packed_git *p, off_t base_offset)
}
static void *cache_or_unpack_entry(struct packed_git *p, off_t base_offset,
- unsigned long *base_size, enum object_type *type)
+ unsigned long *base_size, enum object_type *type, int keep_cache)
{
void *ret;
unsigned long hash = pack_entry_hash(p, base_offset);
@@ -1384,7 +1384,13 @@ static void *cache_or_unpack_entry(struct packed_git *p, off_t base_offset,
return unpack_entry(p, base_offset, type, base_size);
found_cache_entry:
- ent->data = NULL;
+ if (!keep_cache)
+ ent->data = NULL;
+ else {
+ ret = xmalloc(ent->size + 1);
+ memcpy(ret, ent->data, ent->size);
+ ((char *)ret)[ent->size] = 0;
+ }
*type = ent->type;
*base_size = ent->size;
return ret;
@@ -1418,7 +1424,7 @@ static void *unpack_delta_entry(struct packed_git *p,
off_t base_offset;
base_offset = get_delta_base(p, w_curs, &curpos, *type, obj_offset);
- base = cache_or_unpack_entry(p, base_offset, &base_size, type);
+ base = cache_or_unpack_entry(p, base_offset, &base_size, type, 0);
if (!base)
die("failed to read delta base object"
" at %"PRIuMAX" from %s",
@@ -1615,7 +1621,7 @@ static void *read_packed_sha1(const unsigned char *sha1,
if (!find_pack_entry(sha1, &e, NULL))
return NULL;
else
- return unpack_entry(e.p, e.offset, type, size);
+ return cache_or_unpack_entry(e.p, e.offset, size, type, 1);
}
/*
^ permalink raw reply related [flat|nested] 85+ messages in thread
* Re: [PATCH 2/2] Implement a simple delta_base cache
2007-03-17 21:45 ` Linus Torvalds
2007-03-17 22:37 ` Junio C Hamano
2007-03-17 23:12 ` Junio C Hamano
@ 2007-03-18 1:14 ` Morten Welinder
2007-03-18 1:29 ` Linus Torvalds
2007-03-18 6:28 ` Avi Kivity
2 siblings, 2 replies; 85+ messages in thread
From: Morten Welinder @ 2007-03-18 1:14 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Junio C Hamano, Nicolas Pitre, Git Mailing List
> samples % app name symbol name
> 41527 15.6550 git strlen
Almost 16% in strlen? Ugh!
That's a lot of strings, or perhaps very long strings. Or a profiling bug.
M.
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [PATCH 2/2] Implement a simple delta_base cache
2007-03-18 1:14 ` Morten Welinder
@ 2007-03-18 1:29 ` Linus Torvalds
2007-03-18 1:38 ` Nicolas Pitre
2007-03-18 1:44 ` [PATCH 2/2] Implement a simple delta_base cache Linus Torvalds
2007-03-18 6:28 ` Avi Kivity
1 sibling, 2 replies; 85+ messages in thread
From: Linus Torvalds @ 2007-03-18 1:29 UTC (permalink / raw)
To: Morten Welinder; +Cc: Junio C Hamano, Nicolas Pitre, Git Mailing List
On Sat, 17 Mar 2007, Morten Welinder wrote:
>
> > samples % app name symbol name
> > 41527 15.6550 git strlen
>
> Almost 16% in strlen? Ugh!
>
> That's a lot of strings, or perhaps very long strings. Or a profiling bug.
It's likely real, and the problem is likely lots of small strings.
Each git tree entry is:
"<octal mode> name\0" <20-byte sha1>
so you do have a *lot* of strlen() calls when doing any tree parsing. And
for some inexplicable reason, glibc thinks strings are long on average, so
it has a fancy algorithm to do 8 bytes at a time and tries to do things
aligned etc.
The size of strlen() on x86-64 with glibc is 232 bytes. I'm not kidding.
Linus
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [PATCH 2/2] Implement a simple delta_base cache
2007-03-18 1:29 ` Linus Torvalds
@ 2007-03-18 1:38 ` Nicolas Pitre
2007-03-18 1:55 ` Linus Torvalds
2007-03-18 1:44 ` [PATCH 2/2] Implement a simple delta_base cache Linus Torvalds
1 sibling, 1 reply; 85+ messages in thread
From: Nicolas Pitre @ 2007-03-18 1:38 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Morten Welinder, Junio C Hamano, Git Mailing List
On Sat, 17 Mar 2007, Linus Torvalds wrote:
>
>
> On Sat, 17 Mar 2007, Morten Welinder wrote:
> >
> > > samples % app name symbol name
> > > 41527 15.6550 git strlen
> >
> > Almost 16% in strlen? Ugh!
> >
> > That's a lot of strings, or perhaps very long strings. Or a profiling bug.
>
> It's likely real, and the problem is likely lots of small strings.
>
> Each git tree entry is:
>
> "<octal mode> name\0" <20-byte sha1>
>
> so you do have a *lot* of strlen() calls when doing any tree parsing.
This is definitely an area where pack v4 will bring that cost down to
zero.
Nicolas
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [PATCH 2/2] Implement a simple delta_base cache
2007-03-18 1:29 ` Linus Torvalds
2007-03-18 1:38 ` Nicolas Pitre
@ 2007-03-18 1:44 ` Linus Torvalds
1 sibling, 0 replies; 85+ messages in thread
From: Linus Torvalds @ 2007-03-18 1:44 UTC (permalink / raw)
To: Morten Welinder; +Cc: Junio C Hamano, Nicolas Pitre, Git Mailing List
On Sat, 17 Mar 2007, Linus Torvalds wrote:
> >
> > That's a lot of strings, or perhaps very long strings. Or a profiling bug.
Btw, the reason I'm pretty sure that it's not a profiling bug is that
(a) the rest of the profile looks fine
(b) it actually matches the rest of the profile.
In particular, while you reacted to
samples % app name symbol name
41527 15.6550 git strlen
you didn't bat an eye on
9240 3.4833 git get_mode
8863 3.3412 git tree_entry_extract
ie over 3% of time spent in tree entry extract and get_mode. But take
another look at that tree_entry_extract() function in particular and look
what it does, and ask yourself: if *that* function takes up 3% of time,
what does it tell you about strlen()?
(Side note: we could probably improve "strlen()" in particular. We
sometimes call it twice: look at "entry_extract()", which calls strlen()
on the tree entry extract, but then *also* calls strlen on the resulting
path.
I suspect the
a->pathlen = strlen(a->path);
could be written as
a->pathlen = (char *)a->sha1 - (char *)a->path - 1;
but somebody should check that I didn't off-by-one or something. Also, it
migt be better to make that part of "tree_entry_extract()" itself, because
other callers do the same thing (see "find_tree_entry()": doing a
"strlen()" on the path return of tree_entry_extract() seems to be a common
pattern).
HOWEVER!
Once we get to *that* level of optimizations, we're doing pretty damn
well. I'm sure we could probably cut down that strlen() from 16% to 8% by
being smart about it, but still - this is a "good kind of problem" to
have, if these things are your lowest-hanging fruit!
Maybe it all boils down to the same thing: I just can't seem to be really
upset about "git log drivers/usb/ > /dev/null" taking all of a second. It
just doesn't strike me as a performance problem ;)
Linus
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [PATCH 2/2] Implement a simple delta_base cache
2007-03-18 1:38 ` Nicolas Pitre
@ 2007-03-18 1:55 ` Linus Torvalds
2007-03-18 2:03 ` Nicolas Pitre
0 siblings, 1 reply; 85+ messages in thread
From: Linus Torvalds @ 2007-03-18 1:55 UTC (permalink / raw)
To: Nicolas Pitre; +Cc: Morten Welinder, Junio C Hamano, Git Mailing List
On Sat, 17 Mar 2007, Nicolas Pitre wrote:
>
> This is definitely an area where pack v4 will bring that cost down to
> zero.
Heh. I believe that when I see it. The thing is, unless you re-generate
the tree object data structures, you'll have to have totally different
tree walkers for different tree types, and it will all be quite ugly and
complex. And "ugly and complex" seldom translates into "zero cost".
Linus
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [PATCH 2/2] Implement a simple delta_base cache
2007-03-18 1:55 ` Linus Torvalds
@ 2007-03-18 2:03 ` Nicolas Pitre
2007-03-18 2:20 ` Linus Torvalds
0 siblings, 1 reply; 85+ messages in thread
From: Nicolas Pitre @ 2007-03-18 2:03 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Morten Welinder, Junio C Hamano, Git Mailing List
On Sat, 17 Mar 2007, Linus Torvalds wrote:
>
>
> On Sat, 17 Mar 2007, Nicolas Pitre wrote:
> >
> > This is definitely an area where pack v4 will bring that cost down to
> > zero.
>
> Heh. I believe that when I see it. The thing is, unless you re-generate
> the tree object data structures, you'll have to have totally different
> tree walkers for different tree types, and it will all be quite ugly and
> complex. And "ugly and complex" seldom translates into "zero cost".
Well... in my opinion it is the _current_ tree walker that is quite ugly
and complex. It is always messier to parse strings than fixed width
binary fields.
Nicolas
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [PATCH 2/2] Implement a simple delta_base cache
2007-03-18 2:03 ` Nicolas Pitre
@ 2007-03-18 2:20 ` Linus Torvalds
2007-03-18 3:00 ` Nicolas Pitre
2007-03-18 3:06 ` [PATCH 3/2] Avoid unnecessary strlen() calls Linus Torvalds
0 siblings, 2 replies; 85+ messages in thread
From: Linus Torvalds @ 2007-03-18 2:20 UTC (permalink / raw)
To: Nicolas Pitre; +Cc: Morten Welinder, Junio C Hamano, Git Mailing List
On Sat, 17 Mar 2007, Nicolas Pitre wrote:
>
> Well... in my opinion it is the _current_ tree walker that is quite ugly
> and complex. It is always messier to parse strings than fixed width
> binary fields.
Sure. On the other hand, text is what made things easy to do initially,
and you're missing one *BIG* clue: you cannot remote the support without
losing compatibility with all traditional object formats.
So you have no choice. You need to support the text representation. As a
result, *your* code will now be way more ugly and messy.
The thing is, parsing some little text may sound expensive, but if the
expense is in finding the end of the string, we're doing really well.
In other words: the data structures are both simple and straightforward,
and the only reason strlen() shows up at all is:
- we pass strings around as just C strings, even when we know their
lengths. Prime example: look at tree-diff.c. And when you look at it,
realize that *for*every*single*strlen* in that file except for the very
last one (which is only used once per process for setup) we actually
know the string length from before, but we (well, *I*) decided that it
wasn't worth passing down as a parameter all the time.
- the simple parsing of the tree itself (which really isn't that
expensive - the real expense is bringing the data into the CPU cache,
but that's something we'd need to do *anyway*).
So I seriously suspect that you could get the strlen() overhead down from
that 16% pretty easily, but you'd have to pass the length of the "base"
string along all the time (and in the tree_entry cases you'd replace the
"strlen()" calls with a call to something like
static inline int tree_entry_len(const char *name, const unsigned char *sha1)
{
return (char *)sha1 - (char *)name - 1;
}
which will do it for you).
But what you're ignoring here is that "16%" may sound like a huge deal,
but it's 16% of somethng that takes 1 second, and that other SCM's cannot
do AT ALL.
Linus
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [PATCH 2/2] Implement a simple delta_base cache
2007-03-18 2:20 ` Linus Torvalds
@ 2007-03-18 3:00 ` Nicolas Pitre
2007-03-18 3:31 ` Linus Torvalds
2007-03-18 3:06 ` [PATCH 3/2] Avoid unnecessary strlen() calls Linus Torvalds
1 sibling, 1 reply; 85+ messages in thread
From: Nicolas Pitre @ 2007-03-18 3:00 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Morten Welinder, Junio C Hamano, Git Mailing List
On Sat, 17 Mar 2007, Linus Torvalds wrote:
>
>
> On Sat, 17 Mar 2007, Nicolas Pitre wrote:
> >
> > Well... in my opinion it is the _current_ tree walker that is quite ugly
> > and complex. It is always messier to parse strings than fixed width
> > binary fields.
>
> Sure. On the other hand, text is what made things easy to do initially,
Oh indeed. No argument there.
> and you're missing one *BIG* clue: you cannot remote the support without
> losing compatibility with all traditional object formats.
>
> So you have no choice. You need to support the text representation. As a
> result, *your* code will now be way more ugly and messy.
Depends. We currently have separate parsers for trees, commits, tags,
etc. That should be easy enough to add another (separate) parser for
new tree objects while still having a common higher level accessor
interface like tree_entry().
But right now we only regenerate the text representation whenever the
binary representation is encountered just to make things easy to do, and
yet we still have a performance gain already in _addition_ to a net
saving in disk footprint.
> The thing is, parsing some little text may sound expensive, but if the
> expense is in finding the end of the string, we're doing really well.
Of course the current tree parser will remain, probably forever. And it
is always a good thing to optimize it further when ever possible.
> But what you're ignoring here is that "16%" may sound like a huge deal,
> but it's 16% of somethng that takes 1 second, and that other SCM's cannot
> do AT ALL.
Sure. But at this point the reference to compare GIT performance
against might be GIT itself. And while 1 second is really nice in this
case, there are some repos where it could be (and has already been
reported to be) much more.
I still have a feeling that we can do even better than we do now. Much
much better than 16% actually. But that require a new data format that
is designed for speed.
We'll see.
Nicolas
^ permalink raw reply [flat|nested] 85+ messages in thread
* [PATCH 3/2] Avoid unnecessary strlen() calls
2007-03-18 2:20 ` Linus Torvalds
2007-03-18 3:00 ` Nicolas Pitre
@ 2007-03-18 3:06 ` Linus Torvalds
2007-03-18 9:45 ` Junio C Hamano
1 sibling, 1 reply; 85+ messages in thread
From: Linus Torvalds @ 2007-03-18 3:06 UTC (permalink / raw)
To: Nicolas Pitre, Junio C Hamano; +Cc: Morten Welinder, Git Mailing List
This is a micro-optimization that grew out of the mailing list discussion
about "strlen()" showing up in profiles.
We used to pass regular C strings around to the low-level tree walking
routines, and while this worked fine, it meant that we needed to call
strlen() on strings that the caller always actually knew the size of
anyway.
So pass the length of the string down wih the string, and avoid
unnecessary calls to strlen(). Also, when extracting a pathname from a
tree entry, use "tree_entry_len()" instead of strlen(), since the length
of the pathname is directly calculable from the decoded tree entry itself
without having to actually do another strlen().
This shaves off another ~5-10% from some loads that are very tree
intensive (notably doing commit filtering by a pathspec).
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>"
---
On Sat, 17 Mar 2007, Linus Torvalds wrote:
>
> - we pass strings around as just C strings, even when we know their
> lengths. Prime example: look at tree-diff.c. And when you look at it,
> realize that *for*every*single*strlen* in that file except for the very
> last one (which is only used once per process for setup) we actually
> know the string length from before, but we (well, *I*) decided that it
> wasn't worth passing down as a parameter all the time.
So here's the patch.
It definitely cuts down on CPU usage, and I actually left one extra
"strlen()" around, simply because I didn't want to mess with the exported
interface of "diff_tree()".
But that other strlen() is also one that is done *once* for the whole
tree, so from a performance standpoint it doesn't matter (we *could* have
passed in that length too, but that would have involved more changes that
simply aren't really useful).
Does it help? Yes it does. It takes another 5-10% off my test-case.
"strlen()" still exists, but it's basically half of what it used to be
because we now basically only call it when literally parsing the tree data
itself (ie now it's ~8% of the total, and no longer the hottest entry.
Is it worth it? If it was just a random micro-optimization I might not
care, but I guess it's not that ugly to pass an extra "baselen" around all
the time. And that "tree_entry_len()" helper function is actually quite
nice. So yeah, I'd suggest applying this one just because it's actually a
perfectly fine patch and it does speed things up.
So it *is* very much a micro-optimization, but one that doesn't really
make the code any uglier, so why not..
I still think that if we do these kinds of optimizations and they matter,
that shows just how *well* we're actually doing here!
Anyway, Junio, it passes all the tests, as well as passing my "looks
obviously correct" filter, so..
Linus
---
diff --git a/tree-diff.c b/tree-diff.c
index c827582..f89b9d3 100644
--- a/tree-diff.c
+++ b/tree-diff.c
@@ -5,9 +5,8 @@
#include "diff.h"
#include "tree.h"
-static char *malloc_base(const char *base, const char *path, int pathlen)
+static char *malloc_base(const char *base, int baselen, const char *path, int pathlen)
{
- int baselen = strlen(base);
char *newbase = xmalloc(baselen + pathlen + 2);
memcpy(newbase, base, baselen);
memcpy(newbase + baselen, path, pathlen);
@@ -16,9 +15,9 @@ static char *malloc_base(const char *base, const char *path, int pathlen)
}
static void show_entry(struct diff_options *opt, const char *prefix, struct tree_desc *desc,
- const char *base);
+ const char *base, int baselen);
-static int compare_tree_entry(struct tree_desc *t1, struct tree_desc *t2, const char *base, struct diff_options *opt)
+static int compare_tree_entry(struct tree_desc *t1, struct tree_desc *t2, const char *base, int baselen, struct diff_options *opt)
{
unsigned mode1, mode2;
const char *path1, *path2;
@@ -28,15 +27,15 @@ static int compare_tree_entry(struct tree_desc *t1, struct tree_desc *t2, const
sha1 = tree_entry_extract(t1, &path1, &mode1);
sha2 = tree_entry_extract(t2, &path2, &mode2);
- pathlen1 = strlen(path1);
- pathlen2 = strlen(path2);
+ pathlen1 = tree_entry_len(path1, sha1);
+ pathlen2 = tree_entry_len(path2, sha2);
cmp = base_name_compare(path1, pathlen1, mode1, path2, pathlen2, mode2);
if (cmp < 0) {
- show_entry(opt, "-", t1, base);
+ show_entry(opt, "-", t1, base, baselen);
return -1;
}
if (cmp > 0) {
- show_entry(opt, "+", t2, base);
+ show_entry(opt, "+", t2, base, baselen);
return 1;
}
if (!opt->find_copies_harder && !hashcmp(sha1, sha2) && mode1 == mode2)
@@ -47,14 +46,14 @@ static int compare_tree_entry(struct tree_desc *t1, struct tree_desc *t2, const
* file, we need to consider it a remove and an add.
*/
if (S_ISDIR(mode1) != S_ISDIR(mode2)) {
- show_entry(opt, "-", t1, base);
- show_entry(opt, "+", t2, base);
+ show_entry(opt, "-", t1, base, baselen);
+ show_entry(opt, "+", t2, base, baselen);
return 0;
}
if (opt->recursive && S_ISDIR(mode1)) {
int retval;
- char *newbase = malloc_base(base, path1, pathlen1);
+ char *newbase = malloc_base(base, baselen, path1, pathlen1);
if (opt->tree_in_recursive)
opt->change(opt, mode1, mode2,
sha1, sha2, base, path1);
@@ -67,20 +66,20 @@ static int compare_tree_entry(struct tree_desc *t1, struct tree_desc *t2, const
return 0;
}
-static int interesting(struct tree_desc *desc, const char *base, struct diff_options *opt)
+static int interesting(struct tree_desc *desc, const char *base, int baselen, struct diff_options *opt)
{
const char *path;
+ const unsigned char *sha1;
unsigned mode;
int i;
- int baselen, pathlen;
+ int pathlen;
if (!opt->nr_paths)
return 1;
- (void)tree_entry_extract(desc, &path, &mode);
+ sha1 = tree_entry_extract(desc, &path, &mode);
- pathlen = strlen(path);
- baselen = strlen(base);
+ pathlen = tree_entry_len(path, sha1);
for (i=0; i < opt->nr_paths; i++) {
const char *match = opt->paths[i];
@@ -121,18 +120,18 @@ static int interesting(struct tree_desc *desc, const char *base, struct diff_opt
}
/* A whole sub-tree went away or appeared */
-static void show_tree(struct diff_options *opt, const char *prefix, struct tree_desc *desc, const char *base)
+static void show_tree(struct diff_options *opt, const char *prefix, struct tree_desc *desc, const char *base, int baselen)
{
while (desc->size) {
- if (interesting(desc, base, opt))
- show_entry(opt, prefix, desc, base);
+ if (interesting(desc, base, baselen, opt))
+ show_entry(opt, prefix, desc, base, baselen);
update_tree_entry(desc);
}
}
/* A file entry went away or appeared */
static void show_entry(struct diff_options *opt, const char *prefix, struct tree_desc *desc,
- const char *base)
+ const char *base, int baselen)
{
unsigned mode;
const char *path;
@@ -140,7 +139,8 @@ static void show_entry(struct diff_options *opt, const char *prefix, struct tree
if (opt->recursive && S_ISDIR(mode)) {
enum object_type type;
- char *newbase = malloc_base(base, path, strlen(path));
+ int pathlen = tree_entry_len(path, sha1);
+ char *newbase = malloc_base(base, baselen, path, pathlen);
struct tree_desc inner;
void *tree;
@@ -149,7 +149,7 @@ static void show_entry(struct diff_options *opt, const char *prefix, struct tree
die("corrupt tree sha %s", sha1_to_hex(sha1));
inner.buf = tree;
- show_tree(opt, prefix, &inner, newbase);
+ show_tree(opt, prefix, &inner, newbase, baselen + 1 + pathlen);
free(tree);
free(newbase);
@@ -160,26 +160,28 @@ static void show_entry(struct diff_options *opt, const char *prefix, struct tree
int diff_tree(struct tree_desc *t1, struct tree_desc *t2, const char *base, struct diff_options *opt)
{
+ int baselen = strlen(base);
+
while (t1->size | t2->size) {
- if (opt->nr_paths && t1->size && !interesting(t1, base, opt)) {
+ if (opt->nr_paths && t1->size && !interesting(t1, base, baselen, opt)) {
update_tree_entry(t1);
continue;
}
- if (opt->nr_paths && t2->size && !interesting(t2, base, opt)) {
+ if (opt->nr_paths && t2->size && !interesting(t2, base, baselen, opt)) {
update_tree_entry(t2);
continue;
}
if (!t1->size) {
- show_entry(opt, "+", t2, base);
+ show_entry(opt, "+", t2, base, baselen);
update_tree_entry(t2);
continue;
}
if (!t2->size) {
- show_entry(opt, "-", t1, base);
+ show_entry(opt, "-", t1, base, baselen);
update_tree_entry(t1);
continue;
}
- switch (compare_tree_entry(t1, t2, base, opt)) {
+ switch (compare_tree_entry(t1, t2, base, baselen, opt)) {
case -1:
update_tree_entry(t1);
continue;
diff --git a/tree-walk.c b/tree-walk.c
index 70f8999..a4a4e2a 100644
--- a/tree-walk.c
+++ b/tree-walk.c
@@ -32,7 +32,7 @@ static void entry_clear(struct name_entry *a)
static void entry_extract(struct tree_desc *t, struct name_entry *a)
{
a->sha1 = tree_entry_extract(t, &a->path, &a->mode);
- a->pathlen = strlen(a->path);
+ a->pathlen = tree_entry_len(a->path, a->sha1);
}
void update_tree_entry(struct tree_desc *desc)
@@ -169,7 +169,7 @@ static int find_tree_entry(struct tree_desc *t, const char *name, unsigned char
sha1 = tree_entry_extract(t, &entry, mode);
update_tree_entry(t);
- entrylen = strlen(entry);
+ entrylen = tree_entry_len(entry, sha1);
if (entrylen > namelen)
continue;
cmp = memcmp(name, entry, entrylen);
diff --git a/tree-walk.h b/tree-walk.h
index e57befa..a0d7afd 100644
--- a/tree-walk.h
+++ b/tree-walk.h
@@ -13,6 +13,11 @@ struct name_entry {
int pathlen;
};
+static inline int tree_entry_len(const char *name, const unsigned char *sha1)
+{
+ return (char *)sha1 - (char *)name - 1;
+}
+
void update_tree_entry(struct tree_desc *);
const unsigned char *tree_entry_extract(struct tree_desc *, const char **, unsigned int *);
^ permalink raw reply related [flat|nested] 85+ messages in thread
* Re: [PATCH 2/2] Implement a simple delta_base cache
2007-03-18 3:00 ` Nicolas Pitre
@ 2007-03-18 3:31 ` Linus Torvalds
2007-03-18 5:30 ` Julian Phillips
2007-03-18 10:53 ` Robin Rosenberg
0 siblings, 2 replies; 85+ messages in thread
From: Linus Torvalds @ 2007-03-18 3:31 UTC (permalink / raw)
To: Nicolas Pitre; +Cc: Morten Welinder, Junio C Hamano, Git Mailing List
On Sat, 17 Mar 2007, Nicolas Pitre wrote:
>
> Sure. But at this point the reference to compare GIT performance
> against might be GIT itself. And while 1 second is really nice in this
> case, there are some repos where it could be (and has already been
> reported to be) much more.
I'd still like to see the KDE repo, that thing went quiet after it was
supposed to hit sneaker-net..
If it was 30 seconds before to do a "git log" for some individual file,
after the recent optimizations it should hopefully be down to 10. And I
agree that I might be more motivated to try to get it down further if I
could just find a repository where it's that much.
Right now I can can do a "git log" on any file in the kernel archive in
under a second (well, when I say "any file", I started with a script, but
with 22 thousand files I didn't bother to run it for all that long, so I
ended up testing a few random files in addition to the first few hundred
files of "git ls-files", and they are all well under a second).
And that's without the "git diff --quiet" thing that is still in "next",
and that cut down some of the overhead for other reasons (although I
suspect the effect of that will be less when combined with my patches
since the stuff it cut down I probably cut down even more).
I really suspect you'll have a hard time beating "normal" git with the
patches I sent out. I'm sure it's quite possible - don't get me wrong - I
just suspect it won't be spectacular, and it will be a lot of work.
Linus
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [PATCH 2/2] Implement a simple delta_base cache
2007-03-18 3:31 ` Linus Torvalds
@ 2007-03-18 5:30 ` Julian Phillips
2007-03-18 17:23 ` Linus Torvalds
2007-03-18 10:53 ` Robin Rosenberg
1 sibling, 1 reply; 85+ messages in thread
From: Julian Phillips @ 2007-03-18 5:30 UTC (permalink / raw)
To: Linus Torvalds
Cc: Nicolas Pitre, Morten Welinder, Junio C Hamano, Git Mailing List
[-- Attachment #1: Type: TEXT/PLAIN, Size: 1448 bytes --]
On Sat, 17 Mar 2007, Linus Torvalds wrote:
> On Sat, 17 Mar 2007, Nicolas Pitre wrote:
>>
>> Sure. But at this point the reference to compare GIT performance
>> against might be GIT itself. And while 1 second is really nice in this
>> case, there are some repos where it could be (and has already been
>> reported to be) much more.
>
> I'd still like to see the KDE repo, that thing went quiet after it was
> supposed to hit sneaker-net..
>
> If it was 30 seconds before to do a "git log" for some individual file,
> after the recent optimizations it should hopefully be down to 10. And I
> agree that I might be more motivated to try to get it down further if I
> could just find a repository where it's that much.
In my test repository (which emulates a real repository in terms of
approximate size in terms of commits, branches and tags) "git log f12000"
takes about 15m (using 1.5.0.4). After applying patches 1/2 and 2/2 on
top of master I get ~3m50s. With 3/2 as well it goes down a bit more to
~3m20s.
I've attached the script that generated the repository in case you feel
the urge to try some move time shaving exercises ... ;)
(This is a rather unrealistic repository consisting of a long series of
commits of new binary files, but I don't have access to the repository
that is being approximated until I get back to work on Monday ...)
--
Julian
---
That must be wonderful: I don't understand it at all.
-- Moliere
[-- Attachment #2: mk_large_repo --]
[-- Type: TEXT/PLAIN, Size: 753 bytes --]
#!/bin/bash
# no. of commits branches and tags to make
commits=25000;
branches=900;
tags=8000;
# create a new file of this size (kb) for each commit
commit_size=102;
bs=1024;
large=$1;
((bg = $commits / $branches));
((tg = $commits / $tags));
echo "creating $large";
mkdir $large;
cd $large;
git init-db;
i=0
while [ $i -lt $commits ]; do
dd if=/dev/urandom of=f$i bs=${bs} count=${commit_size} > /dev/null 2>&1
git add f$i;
git commit -m "add t$i";
((ig = $i % $tg));
if [ $ig -eq 0 ]; then
git tag t$i;
echo -n "t";
fi
((ig = $i % $bg));
if [ $ig -eq 0 ]; then
git branch b$i;
echo -n "b";
fi
echo -n "$i ";
((i = $i + 1))
done
echo;
echo "complete.";
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [PATCH 2/2] Implement a simple delta_base cache
2007-03-18 1:14 ` Morten Welinder
2007-03-18 1:29 ` Linus Torvalds
@ 2007-03-18 6:28 ` Avi Kivity
1 sibling, 0 replies; 85+ messages in thread
From: Avi Kivity @ 2007-03-18 6:28 UTC (permalink / raw)
To: Morten Welinder
Cc: Linus Torvalds, Junio C Hamano, Nicolas Pitre, Git Mailing List
Morten Welinder wrote:
>> samples % app name symbol name
>> 41527 15.6550 git strlen
>
> Almost 16% in strlen? Ugh!
>
> That's a lot of strings, or perhaps very long strings. Or a profiling
> bug.
>
Or maybe strlen() is the first function to touch the page/cacheline.
--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [PATCH 2/2] Implement a simple delta_base cache
2007-03-18 1:13 ` Nicolas Pitre
@ 2007-03-18 7:47 ` Junio C Hamano
0 siblings, 0 replies; 85+ messages in thread
From: Junio C Hamano @ 2007-03-18 7:47 UTC (permalink / raw)
To: Nicolas Pitre; +Cc: Linus Torvalds, Git Mailing List
Nicolas Pitre <nico@cam.org> writes:
> A malloc() + memcpy() will always be faster than mmap() + malloc() +
> inflate(). If the data is already there it is certainly better to copy
> it straight away.
I do not know if there is mmap() cost involved, but you are
correct to point out that my aversion to malloc() cost was
unfounded. We need to allocate anyway, and memcpy() should of
course be cheaper than inflate().
> With the patch below I can do 'git log drivers/scsi/ > /dev/null' about
> 7% faster. I bet it might be even more on those platforms with bad
> mmap() support.
Wonderful. I was going to nitpick but you even took care of the
convention of returning a buffer with one extra byte that
terminates the contents with NUL. Perfect.
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [PATCH 3/2] Avoid unnecessary strlen() calls
2007-03-18 3:06 ` [PATCH 3/2] Avoid unnecessary strlen() calls Linus Torvalds
@ 2007-03-18 9:45 ` Junio C Hamano
2007-03-18 15:54 ` Linus Torvalds
0 siblings, 1 reply; 85+ messages in thread
From: Junio C Hamano @ 2007-03-18 9:45 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Nicolas Pitre, Morten Welinder, Git Mailing List
Linus Torvalds <torvalds@linux-foundation.org> writes:
> This shaves off another ~5-10% from some loads that are very tree
> intensive (notably doing commit filtering by a pathspec).
>
> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>"
With your 256-entry cache, Nico's reusing objects out of delta
base cache, and this strlen() patch
git blame -C block/ll_rw_blk.c
gets these numbers:
(v1.5.0)
14.71user 0.26system 0:15.07elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+93622minor)pagefaults 0swaps
(master + three patches)
8.94user 0.14system 0:09.10elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+40075minor)pagefaults 0swaps
Just for fun, these are the same for the kernel history with tglx-history
repository's history grafted behind it, i.e. with this grafts file:
$ cat .git/info/grafts
1da177e4c3f41524e886b7f1b8a0c1fc7321cac2 e7e173af42dbf37b1d946f9ee00219cb3b2bea6a
(v1.5.0)
73.80user 2.57system 1:16.40elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+773077minor)pagefaults 0swaps
(master + three patches)
65.14user 0.40system 1:05.55elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+125052minor)pagefaults 0swaps
In either case, it is showing drastic reduction of minor faults.
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [PATCH 2/2] Implement a simple delta_base cache
2007-03-18 3:31 ` Linus Torvalds
2007-03-18 5:30 ` Julian Phillips
@ 2007-03-18 10:53 ` Robin Rosenberg
2007-03-18 17:34 ` Linus Torvalds
1 sibling, 1 reply; 85+ messages in thread
From: Robin Rosenberg @ 2007-03-18 10:53 UTC (permalink / raw)
To: Linus Torvalds
Cc: Nicolas Pitre, Morten Welinder, Junio C Hamano, Git Mailing List
söndag 18 mars 2007 04:31 skrev Linus Torvalds:
> I'd still like to see the KDE repo, that thing went quiet after it was
> supposed to hit sneaker-net..
>
> If it was 30 seconds before to do a "git log" for some individual file,
> after the recent optimizations it should hopefully be down to 10. And I
> agree that I might be more motivated to try to get it down further if I
> could just find a repository where it's that much.
I don't have the KDE repo, but I do have an Eclipse import. Without your
patches I get (hot cache)
# time git log -- org.eclipse.core.resources/src/org/eclipse/core/resources/ >/dev/null
65.10user 0.50system 1:12.44elapsed 90%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+80242minor)pagefaults 0swaps
With patch 1&2 (hot cache)
# time ~/SW/GIT/git-log -- org.eclipse.core.resources/src/org/eclipse/core/resources/ >/dev/null
27.51user 0.21system 0:28.23elapsed 98%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+80266minor)pagefaults 0swaps
That's quite an improvement The eclipse repo is about 140k commits in the master branch and
has a 3GB pack file (fromcvs import).
-- robin
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [PATCH 3/2] Avoid unnecessary strlen() calls
2007-03-18 9:45 ` Junio C Hamano
@ 2007-03-18 15:54 ` Linus Torvalds
2007-03-18 15:57 ` Linus Torvalds
` (2 more replies)
0 siblings, 3 replies; 85+ messages in thread
From: Linus Torvalds @ 2007-03-18 15:54 UTC (permalink / raw)
To: Junio C Hamano; +Cc: Nicolas Pitre, Morten Welinder, Git Mailing List
On Sun, 18 Mar 2007, Junio C Hamano wrote:
>
> git blame -C block/ll_rw_blk.c
>
> Just for fun, these are the same for the kernel history with tglx-history
> repository's history grafted behind it, i.e. with this grafts file:
>
> $ cat .git/info/grafts
> 1da177e4c3f41524e886b7f1b8a0c1fc7321cac2 e7e173af42dbf37b1d946f9ee00219cb3b2bea6a
>
> (v1.5.0)
> 73.80user 2.57system 1:16.40elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
> 0inputs+0outputs (0major+773077minor)pagefaults 0swaps
>
> (master + three patches)
> 65.14user 0.40system 1:05.55elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
> 0inputs+0outputs (0major+125052minor)pagefaults 0swaps
>
> In either case, it is showing drastic reduction of minor faults.
That's an interesting test-case (and I get 53 seconds, nyaah, nyaah ;)
However, it's almost totally *not* about object access any more with my
patches. All the top profiling hits are about generating the patches and
assigning blame:
samples % image name app name symbol name
470352 15.5813 git git xdl_hash_record
298683 9.8944 git git cmp_suspect
225156 7.4587 git git assign_blame
221308 7.3312 libc-2.5.so libc-2.5.so memcpy
177621 5.8840 libc-2.5.so libc-2.5.so memchr
163571 5.4186 vmlinux vmlinux __copy_user_nocache
129301 4.2833 git git xdl_prepare_ctx
99009 3.2799 libc-2.5.so libc-2.5.so _int_malloc
83899 2.7793 git git xdiff_outf
80588 2.6696 libz.so.1.2.3 libz.so.1.2.3 (no symbols)
..
so as you can see, libz is down in the 2.5% range, and strlen and the tree
accessor functions are totally un the noise.
So it looks like it *used* to be somewhat of a problem (the object access
itself must have been about 10 seconds, since that got shaved off the
time), but realistically, if you want to speed up "git blame", we can
totally ignore the git object data structures, an dconcentrate on xdiff
and on blame itself (cmp_suspect and assign_blame probably have some nasty
O(n^2) behaviour or something like that, that could hopefully be fixed
fairly easily. The xdl hashing is a different thing, and I don't think
it's necessarily easy to fix that one..)
Linus
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [PATCH 3/2] Avoid unnecessary strlen() calls
2007-03-18 15:54 ` Linus Torvalds
@ 2007-03-18 15:57 ` Linus Torvalds
2007-03-18 21:38 ` Shawn O. Pearce
2007-03-20 3:05 ` Johannes Schindelin
2007-03-20 3:16 ` Junio C Hamano
2 siblings, 1 reply; 85+ messages in thread
From: Linus Torvalds @ 2007-03-18 15:57 UTC (permalink / raw)
To: Junio C Hamano; +Cc: Nicolas Pitre, Morten Welinder, Git Mailing List
On Sun, 18 Mar 2007, Linus Torvalds wrote:
>
> That's an interesting test-case (and I get 53 seconds, nyaah, nyaah ;)
Btw, it's also an example of why the incremental blame is so much nicer.
No way would I want to wait 53 seconds to get the whole blame. But doing
git gui blame HEAD block/ll_rw_blk.c
(the "git gui" command line is a bit unwieldly) you get something quite
usable!
Of course, the git gui blame colorization is clearly done by somebody who
is still actively popping LSD with both fists and didn't realize that the
60's are long done, but that's another issue.
Linus
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [PATCH 2/2] Implement a simple delta_base cache
2007-03-18 5:30 ` Julian Phillips
@ 2007-03-18 17:23 ` Linus Torvalds
0 siblings, 0 replies; 85+ messages in thread
From: Linus Torvalds @ 2007-03-18 17:23 UTC (permalink / raw)
To: Julian Phillips
Cc: Nicolas Pitre, Morten Welinder, Junio C Hamano, Git Mailing List
On Sun, 18 Mar 2007, Julian Phillips wrote:
>
> (This is a rather unrealistic repository consisting of a long series of
> commits of new binary files, but I don't have access to the repository that is
> being approximated until I get back to work on Monday ...)
This is a *horrible* test repo.
Is this actually really trying to approximate anything you work with? If
so, please check whether you have cyanide or some other effective poison
to kill all your cow-orkers - it's really doing them a favor - and then do
the honorable thing yourself? Use something especially painful on whoever
came up with the idea to track 25000 files in a single directory.
I'll see what the profile is, but even without the repo full generated
yet, I can already tell you that you should *not* put tens of thousands of
files in a single directory like this.
It's not only usually horribly bad quite independently of any SCM issues
(ie most filesystems will have some bad performance behaviour with things
like this - if only because "readdir()" will inevitably be slow).
And for git it means that you lose all ability to efficiently prune away
the parts of the tree that you don't care about. git will always end up
working with a full linear filemanifest instead of a nice collection of
recursive trees, and a lot of the nice tree-walking optimizations that git
has will just end up being no-ops: each tree is always one *huge*
manifest.
So it's not that git cannot handle it, it's that a lot of the nice things
that make git really efficient simply won't trigger for your repository.
In short: avoiding tens of thousands of files in a single directory is
*always* a good idea. With or without git.
(Again, SCM's that are really just "one file at a time" like CVS,
won't care as much. They never really track all files anyway, so while
they are limited by potential filesystem performance bottlenecks, they
won't have the fundamental issue of tracking 25,000 files..)
Linus
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [PATCH 2/2] Implement a simple delta_base cache
2007-03-18 10:53 ` Robin Rosenberg
@ 2007-03-18 17:34 ` Linus Torvalds
2007-03-18 18:29 ` Robin Rosenberg
0 siblings, 1 reply; 85+ messages in thread
From: Linus Torvalds @ 2007-03-18 17:34 UTC (permalink / raw)
To: Robin Rosenberg
Cc: Nicolas Pitre, Morten Welinder, Junio C Hamano, Git Mailing List
On Sun, 18 Mar 2007, Robin Rosenberg wrote:
>
> I don't have the KDE repo, but I do have an Eclipse import.
>
> The eclipse repo is about 140k commits in the master branch and
> has a 3GB pack file (fromcvs import).
Do you happen to have a fast internet connection that you can expose this
thing on?
3GB will take me a while to download, but it sounds like a great
test-case. A 3GB pack-file is what we're supposed to be able to handle
fairly comfortably right now, so it sounds like the ideal project to do
performance testing on.
Linus
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [PATCH 2/2] Implement a simple delta_base cache
2007-03-18 17:34 ` Linus Torvalds
@ 2007-03-18 18:29 ` Robin Rosenberg
2007-03-18 21:25 ` Shawn O. Pearce
2007-03-19 13:16 ` David Brodsky
0 siblings, 2 replies; 85+ messages in thread
From: Robin Rosenberg @ 2007-03-18 18:29 UTC (permalink / raw)
To: Linus Torvalds
Cc: Nicolas Pitre, Morten Welinder, Junio C Hamano, Git Mailing List
söndag 18 mars 2007 18:34 skrev Linus Torvalds:
>
> On Sun, 18 Mar 2007, Robin Rosenberg wrote:
> >
> > I don't have the KDE repo, but I do have an Eclipse import.
> >
> > The eclipse repo is about 140k commits in the master branch and
> > has a 3GB pack file (fromcvs import).
>
> Do you happen to have a fast internet connection that you can expose this
> thing on?
Not that fast and it would take me quite a time to move the files to a public
location (it's on my laptop). I'd rather dump it somewhere directly if someone can
provide me with some suitable coordinates.
-- robin
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [PATCH 2/2] Implement a simple delta_base cache
2007-03-18 18:29 ` Robin Rosenberg
@ 2007-03-18 21:25 ` Shawn O. Pearce
2007-03-19 13:16 ` David Brodsky
1 sibling, 0 replies; 85+ messages in thread
From: Shawn O. Pearce @ 2007-03-18 21:25 UTC (permalink / raw)
To: Robin Rosenberg
Cc: Linus Torvalds, Nicolas Pitre, Morten Welinder, Junio C Hamano,
Git Mailing List
Robin Rosenberg <robin.rosenberg.lists@dewire.com> wrote:
> söndag 18 mars 2007 18:34 skrev Linus Torvalds:
> >
> > On Sun, 18 Mar 2007, Robin Rosenberg wrote:
> > >
> > > I don't have the KDE repo, but I do have an Eclipse import.
> > >
> > > The eclipse repo is about 140k commits in the master branch and
> > > has a 3GB pack file (fromcvs import).
> >
> > Do you happen to have a fast internet connection that you can expose this
> > thing on?
>
> Not that fast and it would take me quite a time to move the files to a public
> location (it's on my laptop). I'd rather dump it somewhere directly if someone can
> provide me with some suitable coordinates.
I'd like to get a copy of one of these big repos too (KDE, Eclipse).
I probably could get a DVD onto both Internet and Internet2 from a
fast enough pipe that a few folks (e.g. Linus, Nico, Junio) could
pull it down, but I can't offer a public distribution point for
the world.
--
Shawn.
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [PATCH 3/2] Avoid unnecessary strlen() calls
2007-03-18 15:57 ` Linus Torvalds
@ 2007-03-18 21:38 ` Shawn O. Pearce
2007-03-18 21:48 ` Linus Torvalds
0 siblings, 1 reply; 85+ messages in thread
From: Shawn O. Pearce @ 2007-03-18 21:38 UTC (permalink / raw)
To: Linus Torvalds
Cc: Junio C Hamano, Nicolas Pitre, Morten Welinder, Git Mailing List
Linus Torvalds <torvalds@linux-foundation.org> wrote:
> Btw, it's also an example of why the incremental blame is so much nicer.
>
> No way would I want to wait 53 seconds to get the whole blame. But doing
>
> git gui blame HEAD block/ll_rw_blk.c
>
> (the "git gui" command line is a bit unwieldly) you get something quite
> usable!
>
> Of course, the git gui blame colorization is clearly done by somebody who
> is still actively popping LSD with both fists and didn't realize that the
> 60's are long done, but that's another issue.
:-)
git-gui is open source. I'd be happy to take a patch. Or,
since that is horribly messy Tcl/Tk code, just a better color
suggestion. :-)
--
Shawn.
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [PATCH 3/2] Avoid unnecessary strlen() calls
2007-03-18 21:38 ` Shawn O. Pearce
@ 2007-03-18 21:48 ` Linus Torvalds
0 siblings, 0 replies; 85+ messages in thread
From: Linus Torvalds @ 2007-03-18 21:48 UTC (permalink / raw)
To: Shawn O. Pearce
Cc: Junio C Hamano, Nicolas Pitre, Morten Welinder, Git Mailing List
On Sun, 18 Mar 2007, Shawn O. Pearce wrote:
> Linus Torvalds <torvalds@linux-foundation.org> wrote:
> >
> > Of course, the git gui blame colorization is clearly done by somebody who
> > is still actively popping LSD with both fists and didn't realize that the
> > 60's are long done, but that's another issue.
>
> :-)
>
> git-gui is open source. I'd be happy to take a patch. Or,
> since that is horribly messy Tcl/Tk code, just a better color
> suggestion. :-)
Yeah, the Tcl/Tk part means that I take one look and decide that I have
absolutely zero clue..
Also, I'm not entirely sure what the "right" color is, but the changing
colors do confuse me. Also, maybe I'm some kind of white suburban
house-wife or something, but I prefer calmer pastel colors over the bright
ones you've selected.
I would suggest:
- some special color for "currently selected" (which defaults to being
the first one coming out of the blame thing, of course).
I'd suggest "black text on pale green background", but that may be just
me. Patricia calls the current color "hot pink", and maybe that's
appropriate for a certain segment of the population, but I'm not sure I
want to even *meet* that segment ;)
- some *stable* graduated color for the rest. I don't think it
necessarily needs to be "older" vs "newer", and in fact I'd suggest
just two slightly different shades of gray for the background - just
pick alternating shades for each blame entry that comes in (and leave
un-blamed lines white).
The flickering just makes me go "ooh, I'm really happy I don't have
epilepsy, because otherwise I'd be writhing on the floor every time I
tried to use this tool".
Linus
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [PATCH 2/2] Implement a simple delta_base cache
2007-03-18 18:29 ` Robin Rosenberg
2007-03-18 21:25 ` Shawn O. Pearce
@ 2007-03-19 13:16 ` David Brodsky
2007-03-20 6:35 ` Robin Rosenberg
1 sibling, 1 reply; 85+ messages in thread
From: David Brodsky @ 2007-03-19 13:16 UTC (permalink / raw)
To: Robin Rosenberg
Cc: Linus Torvalds, Nicolas Pitre, Morten Welinder, Junio C Hamano,
Git Mailing List
Robin Rosenberg wrote:
> söndag 18 mars 2007 18:34 skrev Linus Torvalds:
>> On Sun, 18 Mar 2007, Robin Rosenberg wrote:
>>> I don't have the KDE repo, but I do have an Eclipse import.
>>>
>>> The eclipse repo is about 140k commits in the master branch and
>>> has a 3GB pack file (fromcvs import).
>> Do you happen to have a fast internet connection that you can expose this
>> thing on?
>
> Not that fast and it would take me quite a time to move the files to a public
> location (it's on my laptop). I'd rather dump it somewhere directly if someone can
> provide me with some suitable coordinates.
I have access to a server with enough disk space and its internet
connection should be something like 10 Mbps (or even faster). I can
provide you anonymous ftp access for upload/download (temporarily) and
http for download (permanent).
Or you can send me a dvd, but that would take some time (at least 1 week
because I'm in the Czech Republic and I don't expect that you are
anywhere near...) and I don't know if postal service can handle such
fragile things like dvds.
This is the smallest thing I can do for you...
David Brodsky
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [PATCH 3/2] Avoid unnecessary strlen() calls
2007-03-18 15:54 ` Linus Torvalds
2007-03-18 15:57 ` Linus Torvalds
@ 2007-03-20 3:05 ` Johannes Schindelin
2007-03-20 3:29 ` Shawn O. Pearce
2007-03-20 3:16 ` Junio C Hamano
2 siblings, 1 reply; 85+ messages in thread
From: Johannes Schindelin @ 2007-03-20 3:05 UTC (permalink / raw)
To: Linus Torvalds
Cc: Junio C Hamano, Nicolas Pitre, Morten Welinder, Git Mailing List
Hi,
On Sun, 18 Mar 2007, Linus Torvalds wrote:
> All the top profiling hits are about generating the patches and
> assigning blame:
>
> samples % image name app name symbol name
> 470352 15.5813 git git xdl_hash_record
I felt a little left out in all that performance slashing, and so I
thought maybe, just maybe, a small change in xdl_hash_record() can do
wonders (since it _is_ really simple, but still takes almost a 6th of the
CPU time). I don't have a proper test case setup, so maybe you want to try
this:
-- snipsnap --
[PATCH] xdiff/xutils.c(xdl_hash_record): factor out whitespace handling
Since in at least one use case, xdl_hash_record() takes over 15% of the
CPU time, it makes sense to even micro-optimize it. For many cases, no
whitespace special handling is needed, and in these cases we should not
even bother to check for whitespace in _every_ iteration of the loop.
Signed-off-by: Johannes Schindelin <Johannes.Schindelin@gmx.de>
---
Please do not consider this patch _unless_ it is proven to enhance
the profile statistics substantially.
xdiff/xutils.c | 22 ++++++++++++++++++++--
1 files changed, 20 insertions(+), 2 deletions(-)
diff --git a/xdiff/xutils.c b/xdiff/xutils.c
index 3653864..bf91c0f 100644
--- a/xdiff/xutils.c
+++ b/xdiff/xutils.c
@@ -236,12 +236,13 @@ int xdl_recmatch(const char *l1, long s1, const char *l2, long s2, long flags)
return 0;
}
-unsigned long xdl_hash_record(char const **data, char const *top, long flags) {
+static unsigned long xdl_hash_record_with_whitespace(char const **data,
+ char const *top, long flags) {
unsigned long ha = 5381;
char const *ptr = *data;
for (; ptr < top && *ptr != '\n'; ptr++) {
- if (isspace(*ptr) && (flags & XDF_WHITESPACE_FLAGS)) {
+ if (isspace(*ptr)) {
const char *ptr2 = ptr;
while (ptr + 1 < top && isspace(ptr[1])
&& ptr[1] != '\n')
@@ -270,6 +271,23 @@ unsigned long xdl_hash_record(char const **data, char const *top, long flags) {
}
+unsigned long xdl_hash_record(char const **data, char const *top, long flags) {
+ unsigned long ha = 5381;
+ char const *ptr = *data;
+
+ if (flags & XDF_WHITESPACE_FLAGS)
+ return xdl_hash_record_with_whitespace(data, top, flags);
+
+ for (; ptr < top && *ptr != '\n'; ptr++) {
+ ha += (ha << 5);
+ ha ^= (unsigned long) *ptr;
+ }
+ *data = ptr < top ? ptr + 1: ptr;
+
+ return ha;
+}
+
+
unsigned int xdl_hashbits(unsigned int size) {
unsigned int val = 1, bits = 0;
^ permalink raw reply related [flat|nested] 85+ messages in thread
* Re: [PATCH 3/2] Avoid unnecessary strlen() calls
2007-03-18 15:54 ` Linus Torvalds
2007-03-18 15:57 ` Linus Torvalds
2007-03-20 3:05 ` Johannes Schindelin
@ 2007-03-20 3:16 ` Junio C Hamano
2007-03-20 4:31 ` Linus Torvalds
2 siblings, 1 reply; 85+ messages in thread
From: Junio C Hamano @ 2007-03-20 3:16 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Nicolas Pitre, Morten Welinder, Git Mailing List
Linus Torvalds <torvalds@linux-foundation.org> writes:
> So it looks like it *used* to be somewhat of a problem (the object access
> itself must have been about 10 seconds, since that got shaved off the
> time), but realistically, if you want to speed up "git blame", we can
> totally ignore the git object data structures, an dconcentrate on xdiff
> and on blame itself (cmp_suspect and assign_blame probably have some nasty
> O(n^2) behaviour or something like that,...
With this stupidity-removal patch, it gets down to 7.80user from
8.72user (comparable number of minor faults) for blaming
block/ll_rw_blk.c (without tglx grafts)
diff --git a/builtin-blame.c b/builtin-blame.c
index b51cdc7..104521e 100644
--- a/builtin-blame.c
+++ b/builtin-blame.c
@@ -182,9 +182,8 @@ struct scoreboard {
static int cmp_suspect(struct origin *a, struct origin *b)
{
- int cmp = hashcmp(a->commit->object.sha1, b->commit->object.sha1);
- if (cmp)
- return cmp;
+ if (a->commit != b->commit)
+ return 1;
return strcmp(a->path, b->path);
}
^ permalink raw reply related [flat|nested] 85+ messages in thread
* Re: [PATCH 3/2] Avoid unnecessary strlen() calls
2007-03-20 3:05 ` Johannes Schindelin
@ 2007-03-20 3:29 ` Shawn O. Pearce
2007-03-20 3:40 ` Shawn O. Pearce
0 siblings, 1 reply; 85+ messages in thread
From: Shawn O. Pearce @ 2007-03-20 3:29 UTC (permalink / raw)
To: Johannes Schindelin
Cc: Linus Torvalds, Junio C Hamano, Nicolas Pitre, Morten Welinder,
Git Mailing List
Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote:
> On Sun, 18 Mar 2007, Linus Torvalds wrote:
>
> > All the top profiling hits are about generating the patches and
> > assigning blame:
> >
> > samples % image name app name symbol name
> > 470352 15.5813 git git xdl_hash_record
>
> I felt a little left out in all that performance slashing, and so I
> thought maybe, just maybe, a small change in xdl_hash_record() can do
> wonders (since it _is_ really simple, but still takes almost a 6th of the
> CPU time). I don't have a proper test case setup, so maybe you want to try
> this:
>
> -- snipsnap --
> [PATCH] xdiff/xutils.c(xdl_hash_record): factor out whitespace handling
>
> Since in at least one use case, xdl_hash_record() takes over 15% of the
> CPU time, it makes sense to even micro-optimize it. For many cases, no
> whitespace special handling is needed, and in these cases we should not
> even bother to check for whitespace in _every_ iteration of the loop.
>
> Signed-off-by: Johannes Schindelin <Johannes.Schindelin@gmx.de>
>
> ---
>
> Please do not consider this patch _unless_ it is proven to enhance
> the profile statistics substantially.
This is a massive difference for me. I ran it on git-gui.sh in
the git-gui repository - this is a 6000 line file that has a lot of
revisions, and has been renamed a few times. I applied the patch on
top of current 'master' (v1.5.1-rc1), so I was testing with Linus'
delta_base_cache.
# stock v1.5.1-rc1
$ for a in 1 2 3 4 5;do /usr/bin/time ../lt-blame blame --incremental HEAD git-gui.sh >/dev/null;done
6.27 real 5.31 user 0.55 sys
6.40 real 5.32 user 0.55 sys
6.33 real 5.33 user 0.53 sys
6.67 real 5.32 user 0.55 sys
6.18 real 5.31 user 0.53 sys
# with the above patch
$ for a in 1 2 3 4 5;do /usr/bin/time ../js-blame blame --incremental HEAD git-gui.sh >/dev/null;done
3.57 real 2.87 user 0.51 sys
3.58 real 2.87 user 0.51 sys
3.53 real 2.86 user 0.52 sys
3.61 real 2.86 user 0.51 sys
3.64 real 2.87 user 0.52 sys
For the record, both versions did produce identical output.
Given how small of a change it is, and how much of an improvement
it made, I say apply it.
--
Shawn.
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [PATCH 3/2] Avoid unnecessary strlen() calls
2007-03-20 3:29 ` Shawn O. Pearce
@ 2007-03-20 3:40 ` Shawn O. Pearce
2007-03-20 4:11 ` Linus Torvalds
0 siblings, 1 reply; 85+ messages in thread
From: Shawn O. Pearce @ 2007-03-20 3:40 UTC (permalink / raw)
To: Johannes Schindelin
Cc: Linus Torvalds, Junio C Hamano, Nicolas Pitre, Morten Welinder,
Git Mailing List
"Shawn O. Pearce" <spearce@spearce.org> wrote:
> Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote:
> > -- snipsnap --
> > [PATCH] xdiff/xutils.c(xdl_hash_record): factor out whitespace handling
...
> > ---
> >
> > Please do not consider this patch _unless_ it is proven to enhance
> > the profile statistics substantially.
>
> This is a massive difference for me.
...
> # stock v1.5.1-rc1
> $ for a in 1 2 3 4 5;do /usr/bin/time ../lt-blame blame --incremental HEAD git-gui.sh >/dev/null;done
> 6.27 real 5.31 user 0.55 sys
> 6.40 real 5.32 user 0.55 sys
> 6.33 real 5.33 user 0.53 sys
> 6.67 real 5.32 user 0.55 sys
> 6.18 real 5.31 user 0.53 sys
>
> # with the above patch
> $ for a in 1 2 3 4 5;do /usr/bin/time ../js-blame blame --incremental HEAD git-gui.sh >/dev/null;done
> 3.57 real 2.87 user 0.51 sys
> 3.58 real 2.87 user 0.51 sys
> 3.53 real 2.86 user 0.52 sys
> 3.61 real 2.86 user 0.51 sys
> 3.64 real 2.87 user 0.52 sys
DrNick suggested on #git to try flipping the isspace test around.
This is a smaller change and generated the same ~3.60 seconds run
as Dscho's patch. I like DrNick's version better. ;-)
-->8--
[PATCH] xdiff/xutils.c(xdl_hash_record): factor out whitespace handling
Since in at least one use case, xdl_hash_record() takes over 15%
of the CPU time, it makes sense to even micro-optimize it. For
many cases, no whitespace special handling is needed, and in these
cases we should not even bother to check for whitespace in _every_
iteration of the loop.
Signed-off-by: Johannes Schindelin <Johannes.Schindelin@gmx.de>
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
---
xdiff/xutils.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)
diff --git a/xdiff/xutils.c b/xdiff/xutils.c
index 3653864..7b1f213 100644
--- a/xdiff/xutils.c
+++ b/xdiff/xutils.c
@@ -241,7 +241,7 @@ unsigned long xdl_hash_record(char const **data, char const *top, long flags) {
char const *ptr = *data;
for (; ptr < top && *ptr != '\n'; ptr++) {
- if (isspace(*ptr) && (flags & XDF_WHITESPACE_FLAGS)) {
+ if ((flags & XDF_WHITESPACE_FLAGS) && isspace(*ptr)) {
const char *ptr2 = ptr;
while (ptr + 1 < top && isspace(ptr[1])
&& ptr[1] != '\n')
--
1.5.1.rc1.595.gd1206
--
Shawn.
^ permalink raw reply related [flat|nested] 85+ messages in thread
* Re: [PATCH 3/2] Avoid unnecessary strlen() calls
2007-03-20 3:40 ` Shawn O. Pearce
@ 2007-03-20 4:11 ` Linus Torvalds
2007-03-20 4:18 ` Shawn O. Pearce
2007-03-20 5:44 ` Junio C Hamano
0 siblings, 2 replies; 85+ messages in thread
From: Linus Torvalds @ 2007-03-20 4:11 UTC (permalink / raw)
To: Shawn O. Pearce
Cc: Johannes Schindelin, Junio C Hamano, Nicolas Pitre,
Morten Welinder, Git Mailing List
On Mon, 19 Mar 2007, Shawn O. Pearce wrote:
>
> DrNick suggested on #git to try flipping the isspace test around.
> This is a smaller change and generated the same ~3.60 seconds run
> as Dscho's patch. I like DrNick's version better. ;-)
For me, the result seems to be in the noise.
It may be due to running on Core 2. It's not very sensitive to
micro-optimizations like this. It definitely makes sense to test the
*stable* test first, since that will help branch prediction (the
"isspace()" test is *not* very predictable), so I don't disagree with the
patch, but I suspect it depends a lot on the microarchitecture just how
much it matters.
Do you perhaps have a P4? It has a very bad branch mispredict penalty, so
putting the predictable branch first could explain the big difference you
see..
Dscho's bigger patch probably helps more on an in-order architecture, and
should be equally good on a P4 (or Opteron). On Core 2, neither of the
patches seem to make a huge difference.
Linus
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [PATCH 3/2] Avoid unnecessary strlen() calls
2007-03-20 4:11 ` Linus Torvalds
@ 2007-03-20 4:18 ` Shawn O. Pearce
2007-03-20 4:45 ` Linus Torvalds
2007-03-20 5:44 ` Junio C Hamano
1 sibling, 1 reply; 85+ messages in thread
From: Shawn O. Pearce @ 2007-03-20 4:18 UTC (permalink / raw)
To: Linus Torvalds
Cc: Johannes Schindelin, Junio C Hamano, Nicolas Pitre,
Morten Welinder, Git Mailing List
Linus Torvalds <torvalds@linux-foundation.org> wrote:
> On Mon, 19 Mar 2007, Shawn O. Pearce wrote:
> >
> > DrNick suggested on #git to try flipping the isspace test around.
> > This is a smaller change and generated the same ~3.60 seconds run
> > as Dscho's patch. I like DrNick's version better. ;-)
>
> For me, the result seems to be in the noise.
>
> It may be due to running on Core 2. It's not very sensitive to
> micro-optimizations like this. It definitely makes sense to test the
> *stable* test first, since that will help branch prediction (the
> "isspace()" test is *not* very predictable), so I don't disagree with the
> patch, but I suspect it depends a lot on the microarchitecture just how
> much it matters.
>
> Do you perhaps have a P4? It has a very bad branch mispredict penalty, so
> putting the predictable branch first could explain the big difference you
> see..
I tested both patches on a PowerPC G4. (Apple PowerBook, 1.5 GHz)
Running on Mac OS X 10.4.8.
Might be more of a Linux<->Darwin thing; perhaps my isspace is
significantly slower than yours is... after all my mmap runs
like a PC from the 1980s... ;-)
--
Shawn.
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [PATCH 3/2] Avoid unnecessary strlen() calls
2007-03-20 3:16 ` Junio C Hamano
@ 2007-03-20 4:31 ` Linus Torvalds
2007-03-20 4:39 ` Shawn O. Pearce
0 siblings, 1 reply; 85+ messages in thread
From: Linus Torvalds @ 2007-03-20 4:31 UTC (permalink / raw)
To: Junio C Hamano; +Cc: Nicolas Pitre, Morten Welinder, Git Mailing List
On Mon, 19 Mar 2007, Junio C Hamano wrote:
>
> With this stupidity-removal patch, it gets down to 7.80user from
> 8.72user (comparable number of minor faults) for blaming
> block/ll_rw_blk.c (without tglx grafts)
Yeah, this one works for me too. Even more than for you. For me,
git blame --incremental -C HEAD block/ll_rw_blk.c
takes 6.71s (best of ten) normally, and 4.85 (best of ten again) with your
patch and Nico's one-liner. In fact, that's a much bigger improvement than
I would have expected from the profile, but it may be that you just cut
the data cache footprint down a lot, and thus made other things more
efficient.
(I just double-checked. Nico's one-liner does help, but not nearly as
radically as it did for Nico. The "best of ten" with *just* Nico's
one-liner is 6.22 for me - better than before, but the combination of
Nico's patch and yours is much more dramatic).
Btw, Dscho's slightly more invasive patch seems to *just* edge out Nico's
one-liner for me, with best-of-ten being 6.17s.
The winner is your patch *with* Dscho's slightly more invasive one: 4.69s.
But the difference between the numbers of Dscho's bigger patch and Nico's
one-liner really are totally in the noise. Dscho *just* wins the
best-of-ten both with and without your patch, but in both cases it's
*way* in the noise. For example, while 4.69s was the best for your+Dscho
in my testing, the full series was
0:05.69
0:04.69
0:04.82
0:04.97
0:04.85
0:05.88
0:04.77
0:04.69
0:05.12
0:04.98
so the variability was big enough that I wouldn't say that 0.1s is really
all that meaningful even for "best of ten". I didn't try to make the
machine totally quiescent, I've got xmms playing in the background etc..
But these kinds of things will definitely vary from machine to machine.
It's all good, though.
Linus
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [PATCH 3/2] Avoid unnecessary strlen() calls
2007-03-20 4:31 ` Linus Torvalds
@ 2007-03-20 4:39 ` Shawn O. Pearce
2007-03-20 4:57 ` Linus Torvalds
0 siblings, 1 reply; 85+ messages in thread
From: Shawn O. Pearce @ 2007-03-20 4:39 UTC (permalink / raw)
To: Linus Torvalds
Cc: Junio C Hamano, Nicolas Pitre, Morten Welinder, Git Mailing List
Linus Torvalds <torvalds@linux-foundation.org> wrote:
> Btw, Dscho's slightly more invasive patch seems to *just* edge out Nico's
> one-liner for me, with best-of-ten being 6.17s.
Uh, instead of Nico here don't you mean DrNick on #git? He is in
real life Nicholas Miell. Google says he's somewhat active in the
kernel world, so maybe you know him? ;-)
--
Shawn.
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [PATCH 3/2] Avoid unnecessary strlen() calls
2007-03-20 4:18 ` Shawn O. Pearce
@ 2007-03-20 4:45 ` Linus Torvalds
0 siblings, 0 replies; 85+ messages in thread
From: Linus Torvalds @ 2007-03-20 4:45 UTC (permalink / raw)
To: Shawn O. Pearce
Cc: Johannes Schindelin, Junio C Hamano, Nicolas Pitre,
Morten Welinder, Git Mailing List
On Tue, 20 Mar 2007, Shawn O. Pearce wrote:
>
> I tested both patches on a PowerPC G4. (Apple PowerBook, 1.5 GHz)
> Running on Mac OS X 10.4.8.
>
> Might be more of a Linux<->Darwin thing; perhaps my isspace is
> significantly slower than yours is... after all my mmap runs
> like a PC from the 1980s... ;-)
No, we do a git-specific isspace().
But yeah, a G4 will explain the thing even more than a P4 would. The G4
really isn't a very good uarch compared to the modern x86 ones. Not
aggressively out-of-order with deep instruction queues and I don't think
it does basically any memop re-ordering at all. I know Apple used to claim
that they were the fastest PC around (both with the G4 and the G5), but
let's face it, they lied.
The closer to in-order you are, the more instruction scheduling in sw
tends to matter.
Linus
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [PATCH 3/2] Avoid unnecessary strlen() calls
2007-03-20 4:39 ` Shawn O. Pearce
@ 2007-03-20 4:57 ` Linus Torvalds
0 siblings, 0 replies; 85+ messages in thread
From: Linus Torvalds @ 2007-03-20 4:57 UTC (permalink / raw)
To: Shawn O. Pearce
Cc: Junio C Hamano, Nicolas Pitre, Morten Welinder, Git Mailing List
On Tue, 20 Mar 2007, Shawn O. Pearce wrote:
> Linus Torvalds <torvalds@linux-foundation.org> wrote:
> > Btw, Dscho's slightly more invasive patch seems to *just* edge out Nico's
> > one-liner for me, with best-of-ten being 6.17s.
>
> Uh, instead of Nico here don't you mean DrNick on #git? He is in
> real life Nicholas Miell. Google says he's somewhat active in the
> kernel world, so maybe you know him? ;-)
I actually meant you.
For some reason, I confuse you and Nico. I've done it several times, and
even without any DrNick mention.
Time to take my meds,
Linus
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [PATCH 3/2] Avoid unnecessary strlen() calls
2007-03-20 4:11 ` Linus Torvalds
2007-03-20 4:18 ` Shawn O. Pearce
@ 2007-03-20 5:44 ` Junio C Hamano
1 sibling, 0 replies; 85+ messages in thread
From: Junio C Hamano @ 2007-03-20 5:44 UTC (permalink / raw)
To: Linus Torvalds
Cc: Shawn O. Pearce, Johannes Schindelin, Nicolas Pitre,
Morten Welinder, Git Mailing List
Linus Torvalds <torvalds@linux-foundation.org> writes:
> Dscho's bigger patch probably helps more on an in-order architecture, and
> should be equally good on a P4 (or Opteron). On Core 2, neither of the
> patches seem to make a huge difference.
Because hoisting stable test outside loop is always better for
any architecture, I thought picking between Gitte and Gitney
patches is a no brainer, and I didn't bother to compare-bench,
but I got curious.
(plain)
7.89user 0.15system 0:08.08elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+41608minor)pagefaults 0swaps
7.93user 0.18system 0:08.14elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+41608minor)pagefaults 0swaps
(gitte -- separate function for slow path)
6.98user 0.18system 0:07.17elapsed 100%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+41606minor)pagefaults 0swaps
7.14user 0.12system 0:07.26elapsed 100%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+41607minor)pagefaults 0swaps
(gitney -- cheap test first before isspace)
7.23user 0.18system 0:07.42elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+41608minor)pagefaults 0swaps
7.32user 0.14system 0:07.48elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+41607minor)pagefaults 0swaps
So it does not seem to make much difference on Athlon 64x2 either.
Will apply the "stupid hashcmp() removal" and Dscho's patch and
call it a day.
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [PATCH 2/2] Implement a simple delta_base cache
2007-03-19 13:16 ` David Brodsky
@ 2007-03-20 6:35 ` Robin Rosenberg
2007-03-20 9:13 ` David Brodsky
0 siblings, 1 reply; 85+ messages in thread
From: Robin Rosenberg @ 2007-03-20 6:35 UTC (permalink / raw)
To: David Brodsky
Cc: Linus Torvalds, Nicolas Pitre, Morten Welinder, Junio C Hamano,
Git Mailing List
Uploaded now.
David Brodsky provides the final location.
Linus: I noted a large extra file that I don't know where it is from. Seems to
be form the first convetsion. Perhaps you wont' need to download it:
ECLIPSE.git/.git/objects/pack_sETUPg
-- robin
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [PATCH 2/2] Implement a simple delta_base cache
2007-03-20 6:35 ` Robin Rosenberg
@ 2007-03-20 9:13 ` David Brodsky
2007-03-21 2:37 ` Linus Torvalds
0 siblings, 1 reply; 85+ messages in thread
From: David Brodsky @ 2007-03-20 9:13 UTC (permalink / raw)
To: Robin Rosenberg
Cc: Linus Torvalds, Nicolas Pitre, Morten Welinder, Junio C Hamano,
Git Mailing List
Robin Rosenberg wrote:
> Uploaded now.
>
> David Brodsky provides the final location.
Anonymous ftp at agnes.kajka.koleje.cuni.cz:10000 - it will stay up for
a while, but since it my desktop machine, I don't guarantee anything.
And (hopefully) permanent http://steamer.kajka.koleje.cuni.cz/Eclipse
Enjoy
David
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [PATCH 2/2] Implement a simple delta_base cache
2007-03-20 9:13 ` David Brodsky
@ 2007-03-21 2:37 ` Linus Torvalds
2007-03-21 2:54 ` Nicolas Pitre
0 siblings, 1 reply; 85+ messages in thread
From: Linus Torvalds @ 2007-03-21 2:37 UTC (permalink / raw)
To: David Brodsky
Cc: Robin Rosenberg, Nicolas Pitre, Morten Welinder, Junio C Hamano,
Git Mailing List
On Tue, 20 Mar 2007, David Brodsky wrote:
>
> And (hopefully) permanent http://steamer.kajka.koleje.cuni.cz/Eclipse
Ok, thanks, downloaded. Although the pack-file is just 1.7GB for me, not
3.7 like somebody said.
Anyway, doing a
git blame --incremental HEAD -- org.eclipse.debug.ui/plugin.xml > /dev/null
on that thing (picked a random file that got modified in a recent commit)
took something like 12 seconds, so this is certainly a perfectly fine
test-case.
Sadly, "git-gui blame" doesn't work in a bare git repository, so I had to
do an ugly
ln -s . .git
to make git-gui happy, and that worked, and was pretty usable. Still,
12seconds should be something we can improve on.
And yeah, the profile is pretty horrid:
samples % app name symbol name
70307 20.9412 libc-2.5.so strlen
50925 15.1682 libz.so.1.2.3 (no symbols)
24295 7.2364 git tree_entry_interesting
19816 5.9023 libc-2.5.so memcpy
19569 5.8287 git tree_entry_extract
17693 5.2699 vmlinux memcpy_c
17032 5.0730 git assign_blame
16956 5.0504 git get_mode
12401 3.6937 git get_origin
11815 3.5191 git skip_uninteresting
10449 3.1123 git update_tree_entry
10359 3.0855 git find_pack_entry_one
7946 2.3667 git cmp_suspect
4572 1.3618 libc-2.5.so strncmp
...
so I guess we need to find some more strlen's to remove ;)
Linus
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [PATCH 2/2] Implement a simple delta_base cache
2007-03-21 2:37 ` Linus Torvalds
@ 2007-03-21 2:54 ` Nicolas Pitre
0 siblings, 0 replies; 85+ messages in thread
From: Nicolas Pitre @ 2007-03-21 2:54 UTC (permalink / raw)
To: Linus Torvalds
Cc: David Brodsky, Robin Rosenberg, Morten Welinder, Junio C Hamano,
Git Mailing List
On Tue, 20 Mar 2007, Linus Torvalds wrote:
>
>
> On Tue, 20 Mar 2007, David Brodsky wrote:
> >
> > And (hopefully) permanent http://steamer.kajka.koleje.cuni.cz/Eclipse
>
> Ok, thanks, downloaded. Although the pack-file is just 1.7GB for me, not
> 3.7 like somebody said.
There is a 1.3GB garbage pack_sETUPg file in eclipse.git/objects/ laying
there, probably resulting from an interrupted index-pack, making the
repository needlessly bigger. It can be safely deleted.
We probably should make git-prune get rid of those automatically.
Nicolas
^ permalink raw reply [flat|nested] 85+ messages in thread
end of thread, other threads:[~2007-03-21 2:54 UTC | newest]
Thread overview: 85+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-03-16 1:04 cleaner/better zlib sources? Linus Torvalds
2007-03-16 1:10 ` Shawn O. Pearce
2007-03-16 1:11 ` Jeff Garzik
2007-03-16 1:14 ` Matt Mackall
2007-03-16 1:46 ` Linus Torvalds
2007-03-16 1:54 ` Linus Torvalds
2007-03-16 2:43 ` Davide Libenzi
2007-03-16 2:56 ` Linus Torvalds
2007-03-16 3:16 ` Davide Libenzi
2007-03-16 16:21 ` Linus Torvalds
2007-03-16 16:24 ` Davide Libenzi
2007-03-16 16:35 ` Linus Torvalds
2007-03-16 19:21 ` Davide Libenzi
2007-03-17 0:01 ` Linus Torvalds
2007-03-17 1:11 ` Linus Torvalds
2007-03-17 3:28 ` Nicolas Pitre
2007-03-17 5:19 ` Shawn O. Pearce
2007-03-17 17:55 ` Linus Torvalds
2007-03-17 19:40 ` Linus Torvalds
2007-03-17 19:42 ` [PATCH 1/2] Make trivial wrapper functions around delta base generation and freeing Linus Torvalds
2007-03-17 19:44 ` [PATCH 2/2] Implement a simple delta_base cache Linus Torvalds
2007-03-17 21:45 ` Linus Torvalds
2007-03-17 22:37 ` Junio C Hamano
2007-03-17 23:09 ` Linus Torvalds
2007-03-17 23:54 ` Linus Torvalds
2007-03-18 1:13 ` Nicolas Pitre
2007-03-18 7:47 ` Junio C Hamano
2007-03-17 23:12 ` Junio C Hamano
2007-03-17 23:24 ` Linus Torvalds
2007-03-17 23:52 ` Jon Smirl
2007-03-18 1:14 ` Morten Welinder
2007-03-18 1:29 ` Linus Torvalds
2007-03-18 1:38 ` Nicolas Pitre
2007-03-18 1:55 ` Linus Torvalds
2007-03-18 2:03 ` Nicolas Pitre
2007-03-18 2:20 ` Linus Torvalds
2007-03-18 3:00 ` Nicolas Pitre
2007-03-18 3:31 ` Linus Torvalds
2007-03-18 5:30 ` Julian Phillips
2007-03-18 17:23 ` Linus Torvalds
2007-03-18 10:53 ` Robin Rosenberg
2007-03-18 17:34 ` Linus Torvalds
2007-03-18 18:29 ` Robin Rosenberg
2007-03-18 21:25 ` Shawn O. Pearce
2007-03-19 13:16 ` David Brodsky
2007-03-20 6:35 ` Robin Rosenberg
2007-03-20 9:13 ` David Brodsky
2007-03-21 2:37 ` Linus Torvalds
2007-03-21 2:54 ` Nicolas Pitre
2007-03-18 3:06 ` [PATCH 3/2] Avoid unnecessary strlen() calls Linus Torvalds
2007-03-18 9:45 ` Junio C Hamano
2007-03-18 15:54 ` Linus Torvalds
2007-03-18 15:57 ` Linus Torvalds
2007-03-18 21:38 ` Shawn O. Pearce
2007-03-18 21:48 ` Linus Torvalds
2007-03-20 3:05 ` Johannes Schindelin
2007-03-20 3:29 ` Shawn O. Pearce
2007-03-20 3:40 ` Shawn O. Pearce
2007-03-20 4:11 ` Linus Torvalds
2007-03-20 4:18 ` Shawn O. Pearce
2007-03-20 4:45 ` Linus Torvalds
2007-03-20 5:44 ` Junio C Hamano
2007-03-20 3:16 ` Junio C Hamano
2007-03-20 4:31 ` Linus Torvalds
2007-03-20 4:39 ` Shawn O. Pearce
2007-03-20 4:57 ` Linus Torvalds
2007-03-18 1:44 ` [PATCH 2/2] Implement a simple delta_base cache Linus Torvalds
2007-03-18 6:28 ` Avi Kivity
2007-03-17 22:44 ` Linus Torvalds
2007-03-16 16:35 ` cleaner/better zlib sources? Jeff Garzik
2007-03-16 16:42 ` Matt Mackall
2007-03-16 16:51 ` Linus Torvalds
2007-03-16 17:12 ` Nicolas Pitre
2007-03-16 23:22 ` Shawn O. Pearce
2007-03-16 17:06 ` Nicolas Pitre
2007-03-16 17:51 ` Linus Torvalds
2007-03-16 18:09 ` Nicolas Pitre
2007-03-16 1:33 ` Davide Libenzi
2007-03-16 2:06 ` Davide Libenzi
-- strict thread matches above, loose matches on Subject: below --
2007-03-16 6:08 linux
2007-03-16 11:34 ` Florian Weimer
2007-03-16 15:51 ` Josef Weidendorfer
2007-03-16 16:11 ` Linus Torvalds
2007-03-16 17:39 ` linux
2007-03-16 22:45 ` Josef Weidendorfer
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).