All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] sha1_file.c:rearrange_packed_git() should consider packs' object sizes
@ 2007-05-24 22:20 Dana How
  0 siblings, 0 replies; only message in thread
From: Dana How @ 2007-05-24 22:20 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Git Mailing List, danahow


Shawn O. Pearce wrote:
> We might be able to fix this by altering the sort_pack function
> in sha1_file.c to not only order by mtime, but also by the ratio
> of the size of the .pack to the number of objects stored in it.
> Any packfile with a high size/object ratio is likely to be what
> Dana has been calling a "metadata" pack, holding things like tags,
> commits, trees and small blobs.  Its these packfiles that we want
> to search first, as they are the most likely to be accessed.
>
> By pushing the megablob packs to the end of our packed_git search
> list we won't tend to scan their indexes, as most of our objects
> will be found earlier in the search list.  Hence we will generally
> avoid any costs associated with their indexes.

So change the sort keys in rearrange_packed_git()/sort_pack() to
  local then alternate,
  increasing "deviation",  <== NEW
  decreasing mtime (new then old)

Each packfile has a "rank",  which is the log10 of its average
stored object size.  "Deviation" is the number of standard deviations
this number exceeds its mean,  rounded down and clipped at zero.
Deviation should be 0 in normal use,  and positive for packfiles
with significant populations of huge blobs.

This definition of deviation is intended to override mtime
only when sufficiently significant.  Putting mtime before deviation
in the sort key list would cause deviation to be irrelevant.

Signed-off-by: Dana L. How <danahow@gmail.com>
---
 Makefile    |    2 +-
 cache.h     |    1 +
 sha1_file.c |   26 +++++++++++++++++++++++++-
 3 files changed, 27 insertions(+), 2 deletions(-)

diff --git a/Makefile b/Makefile
index 29243c6..45f0a52 100644
--- a/Makefile
+++ b/Makefile
@@ -383,7 +383,7 @@ BUILTIN_OBJS = \
 	builtin-pack-refs.o
 
 GITLIBS = $(LIB_FILE) $(XDIFF_LIB)
-EXTLIBS = -lz
+EXTLIBS = -lz -lm
 
 #
 # Platform specific tweaks
diff --git a/cache.h b/cache.h
index cd875bc..630cd89 100644
--- a/cache.h
+++ b/cache.h
@@ -437,6 +437,7 @@ extern struct packed_git {
 	uint32_t num_objects;
 	int index_version;
 	time_t mtime;
+	int deviation;
 	int pack_fd;
 	int pack_local;
 	unsigned char sha1[20];
diff --git a/sha1_file.c b/sha1_file.c
index 12d2ef2..1565d91 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -6,6 +6,7 @@
  * This handles basic git sha1 object files - packing, unpacking,
  * creation etc.
  */
+#include <math.h>
 #include "cache.h"
 #include "delta.h"
 #include "pack.h"
@@ -869,6 +870,14 @@ static int sort_pack(const void *a_, const void *b_)
 		return -st;
 
 	/*
+	 * Packs with large "deviation" should be searched last.
+	 * Such packs have significantly larger blobs.
+	 */
+	st = a->deviation - b->deviation;
+	if (st)
+		return st;
+
+	/*
 	 * Younger packs tend to contain more recent objects,
 	 * and more recent objects tend to get accessed more
 	 * often.
@@ -884,6 +893,7 @@ static void rearrange_packed_git(void)
 {
 	struct packed_git **ary, *p;
 	int i, n;
+	float sum1 = 0, sum2 = 0, mean, sigma;
 
 	for (n = 0, p = packed_git; p; p = p->next)
 		n++;
@@ -892,8 +902,22 @@ static void rearrange_packed_git(void)
 
 	/* prepare an array of packed_git for easier sorting */
 	ary = xcalloc(n, sizeof(struct packed_git *));
-	for (n = 0, p = packed_git; p; p = p->next)
+	for (n = 0, p = packed_git; p; p = p->next) {
+		/* this is the log10 of the average object size (almost) */
+		float rank = log10((p->pack_size + 1.0) / (p->num_objects + 1.0));
+		sum1 += rank;
+		sum2 += rank * rank;
 		ary[n++] = p;
+	}
+	mean = sum1 / n;
+	sigma = sqrt(sum2 / n - mean * mean);
+	for (p = packed_git; p; p = p->next) {
+		float rank = log10((p->pack_size + 1.0) / (p->num_objects + 1.0));
+		int deviation = sigma > 0 ? floor((rank - mean) / sigma) : 0;
+		if ( deviation < 0 )
+			deviation = 0;
+		p->deviation = deviation;
+	}
 
 	qsort(ary, n, sizeof(struct packed_git *), sort_pack);
 
-- 
1.5.2.762.gd8c6-dirty

^ permalink raw reply related	[flat|nested] only message in thread

only message in thread, other threads:[~2007-05-24 22:20 UTC | newest]

Thread overview: (only message) (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-05-24 22:20 [PATCH] sha1_file.c:rearrange_packed_git() should consider packs' object sizes Dana How

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.