Partitioned packs

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Partitioned packs
@ 2007-04-04  1:36 Chris Lee
  2007-04-04  1:16 ` David Lang
                   ` (2 more replies)
  0 siblings, 3 replies; 5+ messages in thread
From: Chris Lee @ 2007-04-04  1:36 UTC (permalink / raw)
  To: git

I've been running some experiments, as hinted earlier by the
discussion about just how much git-index-pack sucks (which, really,
isn't much since the gaping memleak is gone now).

These experiments include trying to see if there's a noticeable
performance improvement by splitting out objects of different types
into different packs. So far, it definitely seems to make a
difference, though not the one I was initially expecting. For all of
these tests, I did 'sysctl -w vm.drop_caches=3' before running, to
effectively simulate a cold-cache run.

Single 3.1GB pack file containing all commits, blobs, and trees
First run (cold cache):
git-rev-list --all > /dev/null  5.52s user 0.32s system 45% cpu 12.872 total
git-blame -- kdelibs/kdeui/kmenubar.cpp  0.00s user 0.01s system 0%
cpu 40.218s total
git-archive --format=tar HEAD -- kdelibs >> /dev/null  0.48s user
0.10s system 5% cpu 10.143 total

Subsequent runs (warm cache):
git-rev-list --all > /dev/null  5.19s user 0.48s system 98% cpu 5.750 total
git-blame -- kdelibs/kdeui/kmenubar.cpp  0.00s user 0.00s system 0%
cpu 11.960 total
git-archive --format=tar HEAD -- kdelibs >> /dev/null  0.43s user
0.04s system 100% cpu 0.472 total


Single pack for commit objects and another pack for the rest
First run (cold cache):
git-rev-list --all > /dev/null  5.84s user 0.34s system 31% cpu 19.427 total
git-blame -- kdelibs/kdeui/kmenubar.cpp  0.00s user 0.00s system 0%
cpu 9:42.74 total
git-archive --format=tar HEAD -- kdelibs >> /dev/null  0.50s user
0.26s system 0% cpu 1:35.44 total

Subsequent runs (warm cache):
git-rev-list --all > /dev/null  5.94s user 0.26s system 99% cpu 6.204 total
git-blame -- kdelibs/kdeui/kmenubar.cpp  0.00s user 0.00s system 0%
cpu 12.394 total
git-archive --format=tar HEAD -- kdelibs >> /dev/null  0.41s user
0.07s system 98% cpu 0.486 total

Fully-partitioned separate packs for commit, tree, and blob objects
First run (cold cache):
git-rev-list --all > /dev/null  6.24s user 0.32s system 25% cpu 25.689 total
git-blame -- kdelibs/kdeui/kmenubar.cpp  0.00s user 0.00s system 0%
cpu 1:08.76 total
git-archive --format=tar HEAD -- kdelibs >> /dev/null  0.38s user
0.30s system 0% cpu 1:35.89 total

Subsequent runs (warm cache):
git-rev-list --all > /dev/null  6.28s user 0.24s system 99% cpu 6.527 total
git-blame -- kdelibs/kdeui/kmenubar.cpp  0.00s user 0.00s system 0%
cpu 13.895 total
git-archive --format=tar HEAD -- kdelibs >> /dev/null  0.42s user
0.06s system 99% cpu 0.476 total

I packed all of these using --delta-base-offset, with a window of 100
and a depth of 10.

-clee

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Partitioned packs
  2007-04-04  1:36 Partitioned packs Chris Lee
@ 2007-04-04  1:16 ` David Lang
  2007-04-04  1:58 ` Junio C Hamano
  2007-04-04  2:14 ` Linus Torvalds
  2 siblings, 0 replies; 5+ messages in thread
From: David Lang @ 2007-04-04  1:16 UTC (permalink / raw)
  To: Chris Lee; +Cc: git

On Tue, 3 Apr 2007, Chris Lee wrote:

> Date: Tue, 3 Apr 2007 18:36:44 -0700
> From: Chris Lee <clee@kde.org>
> To: git@vger.kernel.org
> Subject: Partitioned packs
> 
> I've been running some experiments, as hinted earlier by the
> discussion about just how much git-index-pack sucks (which, really,
> isn't much since the gaping memleak is gone now).
>
> These experiments include trying to see if there's a noticeable
> performance improvement by splitting out objects of different types
> into different packs. So far, it definitely seems to make a
> difference, though not the one I was initially expecting. For all of
> these tests, I did 'sysctl -w vm.drop_caches=3' before running, to
> effectively simulate a cold-cache run.

I wonder what order the packs ended up in. if git had to go through the wrong 
pack completely first before finding the pack that it needed, that coudl account 
for extra time.

is it worth makeing up single packs that order the three different types of 
object differently within the one pack to see what difference it makes to have 
to walk past all the blobs to get to the commits and trees?

David Lang

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Partitioned packs
  2007-04-04  1:36 Partitioned packs Chris Lee
  2007-04-04  1:16 ` David Lang
@ 2007-04-04  1:58 ` Junio C Hamano
  2007-04-04  2:14 ` Linus Torvalds
  2 siblings, 0 replies; 5+ messages in thread
From: Junio C Hamano @ 2007-04-04  1:58 UTC (permalink / raw)
  To: Chris Lee; +Cc: git

"Chris Lee" <clee@kde.org> writes:

> I've been running some experiments, as hinted earlier by the
> discussion about just how much git-index-pack sucks (which, really,
> isn't much since the gaping memleak is gone now).
>
> These experiments include trying to see if there's a noticeable
> performance improvement by splitting out objects of different types
> into different packs. So far, it definitely seems to make a
> difference, though not the one I was initially expecting. For all of
> these tests, I did 'sysctl -w vm.drop_caches=3' before running, to
> effectively simulate a cold-cache run.

Are you running on a 64-bit machine or 32-bit?

I wonder what the numbers would be if you partition into the
same number of packs of similar sizes as your experiment, but
partitioning based on not by type but by age or other factors.

What I am getting at is that you may not be seeing the effect of
access pattern based on the type at all.  For example, the
performance can be affected by other factors, such as necessity
to use smaller number of pack_windows per pack.  use_pack()
iterates through the currently active windows on a linked list
per pack, and a window is 32MB on 32-bit machines, so you would
literally need hundreds of them to access that 3GB pack (the
total is limited to 256MB so 8 windows are recycled).  It is
possible that simply using more packs and knowing which pack you
need to access upfront may be cutting down the cost of finding
the pack window to use.  A single pack would have a linked list
of 8 active windows, while two packs would have one linked list
of each, so the average linear search cost would be half.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Partitioned packs
  2007-04-04  1:36 Partitioned packs Chris Lee
  2007-04-04  1:16 ` David Lang
  2007-04-04  1:58 ` Junio C Hamano
@ 2007-04-04  2:14 ` Linus Torvalds
  2007-04-04  2:52   ` Linus Torvalds
  2 siblings, 1 reply; 5+ messages in thread
From: Linus Torvalds @ 2007-04-04  2:14 UTC (permalink / raw)
  To: Chris Lee; +Cc: git

On Tue, 3 Apr 2007, Chris Lee wrote:
> 
> These experiments include trying to see if there's a noticeable
> performance improvement by splitting out objects of different types
> into different packs. So far, it definitely seems to make a
> difference, though not the one I was initially expecting. For all of
> these tests, I did 'sysctl -w vm.drop_caches=3' before running, to
> effectively simulate a cold-cache run.

Ok, the wordwrap makes it a bit hard to read, but it looks like the 
single-pack always wins. Sometimes by a huge amount.

The reason is simple: not only are single packs well sorted anyway (so if 
you only look at commits, it will only look at the head of the pack 
anyway), but a single pack is much faster to look things up in: you can do 
a single binary lookup.

If you have multiple packs, you *may* be able to do a single binary 
lookup, but quite often you'll do one *failing* binary lookup, and then go 
on to the next pack - in other words, you'll do a linear search over a set 
of binary lookups.

So trying to partition things doesn't help (because the objects are 
already well sorted), and it does hurt.

That said, for most operations it's probably in the noise. Something bad 
happened for your "git-blame" thing for the "commits" and "everything 
else" case. Perhaps just unlucky ordering of packs.

		Linus

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Partitioned packs
  2007-04-04  2:14 ` Linus Torvalds
@ 2007-04-04  2:52   ` Linus Torvalds
  0 siblings, 0 replies; 5+ messages in thread
From: Linus Torvalds @ 2007-04-04  2:52 UTC (permalink / raw)
  To: Chris Lee; +Cc: git

On Tue, 3 Apr 2007, Linus Torvalds wrote:
> 
> So trying to partition things doesn't help (because the objects are 
> already well sorted), and it does hurt.

Side note: I think that there *are* cases where partitioned packs can do 
better, but I think that in order to do better you should

 - partition by "recency", ie put objects that are not reachable from any 
   recent point in older packs.

 - make sure that the "packed_git" list is always sorted so that the older 
   data packs are at the end.

and that should actually speed up many loads, just because the recent 
objects are all in one pack, and because it's smaller, that pack can be 
looked up a bit faster.

On the other hand, the power of a log(n) function like a binary search is 
that lookup in a big pack that is four times the size of four smaller 
packs is really not all that much more expensive, so the advantage is 
probably pretty small.

And for things that need old objects (and "git blame" does obviously very 
much tend to fall into that category), any partitioning is likely to be 
bad.

So I think partitioning is valid, but my suspicion is that you'd want to 
partition for *other* reasons than highest performance. Better reasons to 
have multiple packs:

 - just because you haven't repacked ;)
 - to keep "git repack" times down by marking old big packs as "keep" once 
   they get big enough (the space advantage of packing eventually flattens 
   out, so there's no real overwhelming reason to repack old stuff if you 
   have "enough")
 - filesystem and pack-file limitations (ie the 2**31 limit)

but I doubt performance is ever going to be a really compelling one.

You can obviously always optimize for some very *particular* load by 
packing optimally for just that one (keep exactly the objects you need in 
one particular pack, don't even touch any other packs), but I don't think 
any load is *so* special that you shouldn't think of other loads.

			Linus

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2007-04-04  2:54 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-04-04  1:36 Partitioned packs Chris Lee
2007-04-04  1:16 ` David Lang
2007-04-04  1:58 ` Junio C Hamano
2007-04-04  2:14 ` Linus Torvalds
2007-04-04  2:52   ` Linus Torvalds

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).