* [PATCH] change the unpack limit treshold to a saner value
@ 2006-12-06 21:08 Nicolas Pitre
2006-12-06 22:24 ` [PATCH] change the unpack limit threshold " Junio C Hamano
` (2 more replies)
0 siblings, 3 replies; 8+ messages in thread
From: Nicolas Pitre @ 2006-12-06 21:08 UTC (permalink / raw)
To: Junio C Hamano; +Cc: git
Let's assume the average object size is x. Given n objects, the needed
storage size is n*(x + b), where b is the average wasted block size on
disk.
If those objects are packed, the needed storage becomes n*x + b so we
save on the block overhead. But there is the pack index which is
1024 + n*24 + b.
Trying to find the value of n where packed objects become an advantage,
we have:
n*x + n*b > n*x + n*24 + 2*b + 1024
n*b - 2*b > n*24 + 1024
(n - 2)*b > n*24 + 1024
So given this we need at least 3 objects for the whole to use more space
than a pack, and only if b is greater than 1096. Since a common block
size is 4096 then the value of b is likely to converge towards 2048. 3
objects is also where more directory entries are used over an constant
of 2 for a pack (assuming that both objects would end up with the same
first 2 bytes of hash which is overly optimistic). So 3 should be the
optimal number of objects for not exploding a pack. And of course
larger packs are likely to take even less space due to delta compression
kicking in.
This is why I think the current default treshold should be 3 instead of
the insane value of 5000. But since it feels a bit odd to go from 5000
to 3 I setled on 10.
Signed-off-by: Nicolas Pitre <nico@cam.org>
---
Sidenote: I think it is OK for pushes to _not_ use thin packs. When
not exploding thin packs, they must be completed by adding objects
creating duplicates and using more disk space. The penalty for not
using thin packs is a slight increase in bandwidth for push operations,
but since pushes are normally much less frequent than fetches it seems
OK to penalize the push a bit for a better disk usage on servers.
diff --git a/receive-pack.c b/receive-pack.c
index d62ed5b..9140312 100644
--- a/receive-pack.c
+++ b/receive-pack.c
@@ -11,7 +11,7 @@
static const char receive_pack_usage[] = "git-receive-pack <git-dir>";
static int deny_non_fast_forwards = 0;
-static int unpack_limit = 5000;
+static int unpack_limit = 10;
static int report_status;
^ permalink raw reply related [flat|nested] 8+ messages in thread
* Re: [PATCH] change the unpack limit threshold to a saner value
2006-12-06 21:08 [PATCH] change the unpack limit treshold to a saner value Nicolas Pitre
@ 2006-12-06 22:24 ` Junio C Hamano
2006-12-07 0:19 ` Nicolas Pitre
2006-12-07 1:08 ` [PATCH] change the unpack limit treshold " Linus Torvalds
2006-12-07 4:01 ` [PATCH take 2] " Nicolas Pitre
2 siblings, 1 reply; 8+ messages in thread
From: Junio C Hamano @ 2006-12-06 22:24 UTC (permalink / raw)
To: Nicolas Pitre; +Cc: git
Nicolas Pitre <nico@cam.org> writes:
> Let's assume the average object size is x. Given n objects, the needed
> storage size is n*(x + b), where b is the average wasted block size on
> disk.
> ...
> This is why I think the current default treshold should be 3 instead of
> the insane value of 5000. But since it feels a bit odd to go from 5000
> to 3 I setled on 10.
I see you are optimizing for disk footprint, and this will
result in tons of tiny packs left between "repack -a".
I have not benched it yet, but the runtime pack handling code
was written assuming we have only a handful of big packs; I
suspect this change would affect the performance at runtime in
quite a bad way.
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] change the unpack limit threshold to a saner value
2006-12-06 22:24 ` [PATCH] change the unpack limit threshold " Junio C Hamano
@ 2006-12-07 0:19 ` Nicolas Pitre
0 siblings, 0 replies; 8+ messages in thread
From: Nicolas Pitre @ 2006-12-07 0:19 UTC (permalink / raw)
To: Junio C Hamano; +Cc: git
On Wed, 6 Dec 2006, Junio C Hamano wrote:
> Nicolas Pitre <nico@cam.org> writes:
>
> > Let's assume the average object size is x. Given n objects, the needed
> > storage size is n*(x + b), where b is the average wasted block size on
> > disk.
> > ...
> > This is why I think the current default treshold should be 3 instead of
> > the insane value of 5000. But since it feels a bit odd to go from 5000
> > to 3 I setled on 10.
>
> I see you are optimizing for disk footprint, and this will
> result in tons of tiny packs left between "repack -a".
Depends how often i.e. how big pushes are, and how often you repack.
> I have not benched it yet, but the runtime pack handling code
> was written assuming we have only a handful of big packs; I
> suspect this change would affect the performance at runtime in
> quite a bad way.
Possibly.
Still a treshold of 5000 is way too large IMHO.
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] change the unpack limit treshold to a saner value
2006-12-06 21:08 [PATCH] change the unpack limit treshold to a saner value Nicolas Pitre
2006-12-06 22:24 ` [PATCH] change the unpack limit threshold " Junio C Hamano
@ 2006-12-07 1:08 ` Linus Torvalds
2006-12-07 3:24 ` Nicolas Pitre
2006-12-07 4:01 ` [PATCH take 2] " Nicolas Pitre
2 siblings, 1 reply; 8+ messages in thread
From: Linus Torvalds @ 2006-12-07 1:08 UTC (permalink / raw)
To: Nicolas Pitre; +Cc: Junio C Hamano, git
On Wed, 6 Dec 2006, Nicolas Pitre wrote:
>
> This is why I think the current default treshold should be 3 instead of
> the insane value of 5000. But since it feels a bit odd to go from 5000
> to 3 I setled on 10.
Definitely not.
We have a much easier time handling many loose packed objects than many
pack-files. For many reasons, but two really obvious ones:
- pack-file indexes get read in on startup, and we maintain an explicit
list of them. Having lots of pack-files adds overhead that doesn't
exist for lots of loose objects.
- loose files are spread out over 256 subdirectories to make lookup
easier, packfiles are not (and always create an index file too).
So in general, as a trivial heuristic, you probably want about 512 times
as many loose objects as you want pack-files, i fonly because of the
latter issue, because you can much more easily handle lots of loose
objects than lots of pack-files. So it's _not_ a factor of 3. Or even 10.
But since there _is_ reason to do pack-files too, and since using too big
a value means that you never end up keeping a pack-file _at_all_ if you
pull often, I'd suggest that rather than use a limit of 512 you go for
something like 100-200 objects as the threshold (of course, the proper one
would depend on the distribution of the size of your pack-files, but I'll
just hand-wave and say that together with occasional re-packing, something
in that range is _generally_ going to be a good idea).
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] change the unpack limit treshold to a saner value
2006-12-07 1:08 ` [PATCH] change the unpack limit treshold " Linus Torvalds
@ 2006-12-07 3:24 ` Nicolas Pitre
2006-12-07 3:39 ` Linus Torvalds
0 siblings, 1 reply; 8+ messages in thread
From: Nicolas Pitre @ 2006-12-07 3:24 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Junio C Hamano, git
On Wed, 6 Dec 2006, Linus Torvalds wrote:
> We have a much easier time handling many loose packed objects than many
> pack-files. For many reasons, but two really obvious ones:
>
> - pack-file indexes get read in on startup, and we maintain an explicit
> list of them. Having lots of pack-files adds overhead that doesn't
> exist for lots of loose objects.
>
> - loose files are spread out over 256 subdirectories to make lookup
> easier, packfiles are not (and always create an index file too).
>
> So in general, as a trivial heuristic, you probably want about 512 times
> as many loose objects as you want pack-files, i fonly because of the
> latter issue, because you can much more easily handle lots of loose
> objects than lots of pack-files. So it's _not_ a factor of 3. Or even 10.
>
> But since there _is_ reason to do pack-files too, and since using too big
> a value means that you never end up keeping a pack-file _at_all_ if you
> pull often, I'd suggest that rather than use a limit of 512 you go for
> something like 100-200 objects as the threshold (of course, the proper one
> would depend on the distribution of the size of your pack-files, but I'll
> just hand-wave and say that together with occasional re-packing, something
> in that range is _generally_ going to be a good idea).
Note that this setting is currently observed for pushes not pulls.
On the pull side you currentli need to provide -k for not exploding
packs.
So the question is what number of objects on average do pushes have? If
most pushes are below the treshold this is not going to be really
useful.
And I think 5000 is definitely way too high. 10 might be too small
indeed. 100 is maybe a good default to try out.
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] change the unpack limit treshold to a saner value
2006-12-07 3:24 ` Nicolas Pitre
@ 2006-12-07 3:39 ` Linus Torvalds
0 siblings, 0 replies; 8+ messages in thread
From: Linus Torvalds @ 2006-12-07 3:39 UTC (permalink / raw)
To: Nicolas Pitre; +Cc: Junio C Hamano, git
On Wed, 6 Dec 2006, Nicolas Pitre wrote:
>
> Note that this setting is currently observed for pushes not pulls.
> On the pull side you currentli need to provide -k for not exploding
> packs.
So noted.
> So the question is what number of objects on average do pushes have? If
> most pushes are below the treshold this is not going to be really
> useful.
It will depend a lot on the project, and "where" in the project you are.
For example, for most end developers, the "push" is likely going to be a
few commits (say, a days work). Probably on the order of a few tens to
maybe a few hundred objects. It's actually hard to create a pack with less
than ten objects if you have a few directory levels (a single small commit
in the kernel is usually 5-7 objects: commit + 2-3 levels of directory + a
couple of blobs).
For me, as I pull a big merge and push it out, a push can easily be in the
thousands of objects, just because I merged other peoples combined work
over several weeks.
And for a "mirror" server, it will depend on the granularity of the
mirroring.
> And I think 5000 is definitely way too high. 10 might be too small
> indeed. 100 is maybe a good default to try out.
I think 100 is a nice round number for humans. Worth trying.
^ permalink raw reply [flat|nested] 8+ messages in thread
* [PATCH take 2] change the unpack limit treshold to a saner value
2006-12-06 21:08 [PATCH] change the unpack limit treshold to a saner value Nicolas Pitre
2006-12-06 22:24 ` [PATCH] change the unpack limit threshold " Junio C Hamano
2006-12-07 1:08 ` [PATCH] change the unpack limit treshold " Linus Torvalds
@ 2006-12-07 4:01 ` Nicolas Pitre
2006-12-07 7:59 ` Shawn Pearce
2 siblings, 1 reply; 8+ messages in thread
From: Nicolas Pitre @ 2006-12-07 4:01 UTC (permalink / raw)
To: Junio C Hamano; +Cc: git
Currently the treshold is 5000. The likelihood of this value to ever be
crossed for a single push is really small making it not really useful.
The optimal treshold for a pure space saving on a filesystem with 4kb
blocks is 3. However this is likely to create many small packs
concentrating a large number of files in a single directory compared to
the same objects which are spread over 256 directories when loose. This
means we would need 512 objects per pack on average to approximagte the
same directory cost (a pack has 2 files because of the index).
But 512 is a really high value just like 5000 since most pushes are
unlikely to have that many objects. So let's try with a value of 100
which should have a good balance between small pushes going to be
exploded into loose objects and large pushes kept as whole packs.
This is not a replacement for periodic repacks of course.
Signed-off-by: Nicolas Pitre <nico@cam.org>
---
diff --git a/receive-pack.c b/receive-pack.c
index d62ed5b..9140312 100644
--- a/receive-pack.c
+++ b/receive-pack.c
@@ -11,7 +11,7 @@
static const char receive_pack_usage[] = "git-receive-pack <git-dir>";
static int deny_non_fast_forwards = 0;
-static int unpack_limit = 5000;
+static int unpack_limit = 100;
static int report_status;
^ permalink raw reply related [flat|nested] 8+ messages in thread
* Re: [PATCH take 2] change the unpack limit treshold to a saner value
2006-12-07 4:01 ` [PATCH take 2] " Nicolas Pitre
@ 2006-12-07 7:59 ` Shawn Pearce
0 siblings, 0 replies; 8+ messages in thread
From: Shawn Pearce @ 2006-12-07 7:59 UTC (permalink / raw)
To: Nicolas Pitre; +Cc: Junio C Hamano, git
Nicolas Pitre <nico@cam.org> wrote:
> Currently the treshold is 5000. The likelihood of this value to ever be
> crossed for a single push is really small making it not really useful.
I started that out at 5000 because it was a reasonably safe
threshold. Users who didn't explicitly lower this value didn't
enable the feature. But it was still useful on initial pushes into
a brand new repository when the project was very large. For example
pushing git.git into a bare repository *sucked* before this change.
Yea, 5000 is probably too high. Nice to see it drop.
--
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2006-12-07 8:00 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-12-06 21:08 [PATCH] change the unpack limit treshold to a saner value Nicolas Pitre
2006-12-06 22:24 ` [PATCH] change the unpack limit threshold " Junio C Hamano
2006-12-07 0:19 ` Nicolas Pitre
2006-12-07 1:08 ` [PATCH] change the unpack limit treshold " Linus Torvalds
2006-12-07 3:24 ` Nicolas Pitre
2006-12-07 3:39 ` Linus Torvalds
2006-12-07 4:01 ` [PATCH take 2] " Nicolas Pitre
2006-12-07 7:59 ` Shawn Pearce
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).