git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] change the unpack limit treshold to a saner value
@ 2006-12-06 21:08 Nicolas Pitre
  2006-12-06 22:24 ` [PATCH] change the unpack limit threshold " Junio C Hamano
                   ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: Nicolas Pitre @ 2006-12-06 21:08 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git


Let's assume the average object size is x. Given n objects, the needed 
storage size is n*(x + b), where b is the average wasted block size on 
disk.

If those objects are packed, the needed storage becomes n*x + b so we 
save on the block overhead.  But there is the pack index which is 
1024 + n*24 + b.

Trying to find the value of n where packed objects become an advantage, 
we have:

	n*x + n*b > n*x + n*24 + 2*b + 1024

	n*b - 2*b > n*24 + 1024

	(n - 2)*b > n*24 + 1024

So given this we need at least 3 objects for the whole to use more space 
than a pack, and only if b is greater than 1096.  Since a common block 
size is 4096 then the value of b is likely to converge towards 2048. 3 
objects is also where more directory entries are used over an constant 
of 2 for a pack (assuming that both objects would end up with the same 
first 2 bytes of hash which is overly optimistic).  So 3 should be the 
optimal number of objects for not exploding a pack.  And of course 
larger packs are likely to take even less space due to delta compression 
kicking in.

This is why I think the current default treshold should be 3 instead of 
the insane value of 5000.  But since it feels a bit odd to go from 5000 
to 3 I setled on 10.

Signed-off-by: Nicolas Pitre <nico@cam.org>

---

Sidenote: I think it is OK for pushes to _not_ use thin packs.  When 
not exploding thin packs, they must be completed by adding objects 
creating duplicates and using more disk space.  The penalty for not 
using thin packs is a slight increase in bandwidth for push operations, 
but since pushes are normally much less frequent than fetches it seems 
OK to penalize the push a bit for a better disk usage on servers.

diff --git a/receive-pack.c b/receive-pack.c
index d62ed5b..9140312 100644
--- a/receive-pack.c
+++ b/receive-pack.c
@@ -11,7 +11,7 @@
 static const char receive_pack_usage[] = "git-receive-pack <git-dir>";
 
 static int deny_non_fast_forwards = 0;
-static int unpack_limit = 5000;
+static int unpack_limit = 10;
 static int report_status;
 

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH] change the unpack limit threshold to a saner value
  2006-12-06 21:08 [PATCH] change the unpack limit treshold to a saner value Nicolas Pitre
@ 2006-12-06 22:24 ` Junio C Hamano
  2006-12-07  0:19   ` Nicolas Pitre
  2006-12-07  1:08 ` [PATCH] change the unpack limit treshold " Linus Torvalds
  2006-12-07  4:01 ` [PATCH take 2] " Nicolas Pitre
  2 siblings, 1 reply; 8+ messages in thread
From: Junio C Hamano @ 2006-12-06 22:24 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: git

Nicolas Pitre <nico@cam.org> writes:

> Let's assume the average object size is x. Given n objects, the needed 
> storage size is n*(x + b), where b is the average wasted block size on 
> disk.
> ...
> This is why I think the current default treshold should be 3 instead of 
> the insane value of 5000.  But since it feels a bit odd to go from 5000 
> to 3 I setled on 10.

I see you are optimizing for disk footprint, and this will
result in tons of tiny packs left between "repack -a".

I have not benched it yet, but the runtime pack handling code
was written assuming we have only a handful of big packs; I
suspect this change would affect the performance at runtime in
quite a bad way.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] change the unpack limit threshold to a saner value
  2006-12-06 22:24 ` [PATCH] change the unpack limit threshold " Junio C Hamano
@ 2006-12-07  0:19   ` Nicolas Pitre
  0 siblings, 0 replies; 8+ messages in thread
From: Nicolas Pitre @ 2006-12-07  0:19 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

On Wed, 6 Dec 2006, Junio C Hamano wrote:

> Nicolas Pitre <nico@cam.org> writes:
> 
> > Let's assume the average object size is x. Given n objects, the needed 
> > storage size is n*(x + b), where b is the average wasted block size on 
> > disk.
> > ...
> > This is why I think the current default treshold should be 3 instead of 
> > the insane value of 5000.  But since it feels a bit odd to go from 5000 
> > to 3 I setled on 10.
> 
> I see you are optimizing for disk footprint, and this will
> result in tons of tiny packs left between "repack -a".

Depends how often i.e. how big pushes are, and how often you repack.

> I have not benched it yet, but the runtime pack handling code
> was written assuming we have only a handful of big packs; I
> suspect this change would affect the performance at runtime in
> quite a bad way.

Possibly.

Still a treshold of 5000 is way too large IMHO.



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] change the unpack limit treshold to a saner value
  2006-12-06 21:08 [PATCH] change the unpack limit treshold to a saner value Nicolas Pitre
  2006-12-06 22:24 ` [PATCH] change the unpack limit threshold " Junio C Hamano
@ 2006-12-07  1:08 ` Linus Torvalds
  2006-12-07  3:24   ` Nicolas Pitre
  2006-12-07  4:01 ` [PATCH take 2] " Nicolas Pitre
  2 siblings, 1 reply; 8+ messages in thread
From: Linus Torvalds @ 2006-12-07  1:08 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Junio C Hamano, git



On Wed, 6 Dec 2006, Nicolas Pitre wrote:
> 
> This is why I think the current default treshold should be 3 instead of 
> the insane value of 5000.  But since it feels a bit odd to go from 5000 
> to 3 I setled on 10.

Definitely not.

We have a much easier time handling many loose packed objects than many 
pack-files. For many reasons, but two really obvious ones:

 - pack-file indexes get read in on startup, and we maintain an explicit 
   list of them. Having lots of pack-files adds overhead that doesn't 
   exist for lots of loose objects.

 - loose files are spread out over 256 subdirectories to make lookup 
   easier, packfiles are not (and always create an index file too).

So in general, as a trivial heuristic, you probably want about 512 times 
as many loose objects as you want pack-files, i fonly because of the 
latter issue, because you can much more easily handle lots of loose 
objects than lots of pack-files. So it's _not_ a factor of 3. Or even 10.

But since there _is_ reason to do pack-files too, and since using too big 
a value means that you never end up keeping a pack-file _at_all_ if you 
pull often, I'd suggest that rather than use a limit of 512 you go for 
something like 100-200 objects as the threshold (of course, the proper one 
would depend on the distribution of the size of your pack-files, but I'll 
just hand-wave and say that together with occasional re-packing, something 
in that range is _generally_ going to be a good idea).


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] change the unpack limit treshold to a saner value
  2006-12-07  1:08 ` [PATCH] change the unpack limit treshold " Linus Torvalds
@ 2006-12-07  3:24   ` Nicolas Pitre
  2006-12-07  3:39     ` Linus Torvalds
  0 siblings, 1 reply; 8+ messages in thread
From: Nicolas Pitre @ 2006-12-07  3:24 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Junio C Hamano, git

On Wed, 6 Dec 2006, Linus Torvalds wrote:

> We have a much easier time handling many loose packed objects than many 
> pack-files. For many reasons, but two really obvious ones:
> 
>  - pack-file indexes get read in on startup, and we maintain an explicit 
>    list of them. Having lots of pack-files adds overhead that doesn't 
>    exist for lots of loose objects.
> 
>  - loose files are spread out over 256 subdirectories to make lookup 
>    easier, packfiles are not (and always create an index file too).
> 
> So in general, as a trivial heuristic, you probably want about 512 times 
> as many loose objects as you want pack-files, i fonly because of the 
> latter issue, because you can much more easily handle lots of loose 
> objects than lots of pack-files. So it's _not_ a factor of 3. Or even 10.
> 
> But since there _is_ reason to do pack-files too, and since using too big 
> a value means that you never end up keeping a pack-file _at_all_ if you 
> pull often, I'd suggest that rather than use a limit of 512 you go for 
> something like 100-200 objects as the threshold (of course, the proper one 
> would depend on the distribution of the size of your pack-files, but I'll 
> just hand-wave and say that together with occasional re-packing, something 
> in that range is _generally_ going to be a good idea).

Note that this setting is currently observed for pushes not pulls.
On the pull side you currentli need to provide -k for not exploding 
packs.

So the question is what number of objects on average do pushes have?  If 
most pushes are below the treshold this is not going to be really 
useful.

And I think 5000 is definitely way too high.  10 might be too small 
indeed.  100 is maybe a good default to try out.



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] change the unpack limit treshold to a saner value
  2006-12-07  3:24   ` Nicolas Pitre
@ 2006-12-07  3:39     ` Linus Torvalds
  0 siblings, 0 replies; 8+ messages in thread
From: Linus Torvalds @ 2006-12-07  3:39 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Junio C Hamano, git



On Wed, 6 Dec 2006, Nicolas Pitre wrote:
> 
> Note that this setting is currently observed for pushes not pulls.
> On the pull side you currentli need to provide -k for not exploding 
> packs.

So noted.

> So the question is what number of objects on average do pushes have?  If 
> most pushes are below the treshold this is not going to be really 
> useful.

It will depend a lot on the project, and "where" in the project you are.

For example, for most end developers, the "push" is likely going to be a 
few commits (say, a days work). Probably on the order of a few tens to 
maybe a few hundred objects. It's actually hard to create a pack with less 
than ten objects if you have a few directory levels (a single small commit 
in the kernel is usually 5-7 objects: commit + 2-3 levels of directory + a 
couple of blobs).

For me, as I pull a big merge and push it out, a push can easily be in the 
thousands of objects, just because I merged other peoples combined work 
over several weeks.

And for a "mirror" server, it will depend on the granularity of the 
mirroring.

> And I think 5000 is definitely way too high.  10 might be too small 
> indeed.  100 is maybe a good default to try out.

I think 100 is a nice round number for humans. Worth trying.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH take 2] change the unpack limit treshold to a saner value
  2006-12-06 21:08 [PATCH] change the unpack limit treshold to a saner value Nicolas Pitre
  2006-12-06 22:24 ` [PATCH] change the unpack limit threshold " Junio C Hamano
  2006-12-07  1:08 ` [PATCH] change the unpack limit treshold " Linus Torvalds
@ 2006-12-07  4:01 ` Nicolas Pitre
  2006-12-07  7:59   ` Shawn Pearce
  2 siblings, 1 reply; 8+ messages in thread
From: Nicolas Pitre @ 2006-12-07  4:01 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

Currently the treshold is 5000.  The likelihood of this value to ever be 
crossed for a single push is really small making it not really useful.

The optimal treshold for a pure space saving on a filesystem with 4kb 
blocks is 3.  However this is likely to create many small packs 
concentrating a large number of files in a single directory compared to 
the same objects which are spread over 256 directories when loose.  This 
means we would need 512 objects per pack on average to approximagte the 
same directory cost (a pack has 2 files because of the index).

But 512 is a really high value just like 5000 since most pushes are 
unlikely to have that many objects.  So let's try with a value of 100 
which should have a good balance between small pushes going to be 
exploded into loose objects and large pushes kept as whole packs.

This is not a replacement for periodic repacks of course.

Signed-off-by: Nicolas Pitre <nico@cam.org>

---

diff --git a/receive-pack.c b/receive-pack.c
index d62ed5b..9140312 100644
--- a/receive-pack.c
+++ b/receive-pack.c
@@ -11,7 +11,7 @@
 static const char receive_pack_usage[] = "git-receive-pack <git-dir>";
 
 static int deny_non_fast_forwards = 0;
-static int unpack_limit = 5000;
+static int unpack_limit = 100;
 static int report_status;
 

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH take 2] change the unpack limit treshold to a saner value
  2006-12-07  4:01 ` [PATCH take 2] " Nicolas Pitre
@ 2006-12-07  7:59   ` Shawn Pearce
  0 siblings, 0 replies; 8+ messages in thread
From: Shawn Pearce @ 2006-12-07  7:59 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Junio C Hamano, git

Nicolas Pitre <nico@cam.org> wrote:
> Currently the treshold is 5000.  The likelihood of this value to ever be 
> crossed for a single push is really small making it not really useful.

I started that out at 5000 because it was a reasonably safe
threshold.  Users who didn't explicitly lower this value didn't
enable the feature.  But it was still useful on initial pushes into
a brand new repository when the project was very large.  For example
pushing git.git into a bare repository *sucked* before this change.

Yea, 5000 is probably too high.  Nice to see it drop.

-- 

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2006-12-07  8:00 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-12-06 21:08 [PATCH] change the unpack limit treshold to a saner value Nicolas Pitre
2006-12-06 22:24 ` [PATCH] change the unpack limit threshold " Junio C Hamano
2006-12-07  0:19   ` Nicolas Pitre
2006-12-07  1:08 ` [PATCH] change the unpack limit treshold " Linus Torvalds
2006-12-07  3:24   ` Nicolas Pitre
2006-12-07  3:39     ` Linus Torvalds
2006-12-07  4:01 ` [PATCH take 2] " Nicolas Pitre
2006-12-07  7:59   ` Shawn Pearce

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).