From: Dana How <danahow@gmail.com>
To: Junio C Hamano <junkio@cox.net>
Cc: Git Mailing List <git@vger.kernel.org>, danahow@gmail.com
Subject: [PATCH] sha1_file.c:rearrange_packed_git() should consider packs' object sizes
Date: Thu, 24 May 2007 15:20:48 -0700 [thread overview]
Message-ID: <46560FC0.4020204@gmail.com> (raw)
Shawn O. Pearce wrote:
> We might be able to fix this by altering the sort_pack function
> in sha1_file.c to not only order by mtime, but also by the ratio
> of the size of the .pack to the number of objects stored in it.
> Any packfile with a high size/object ratio is likely to be what
> Dana has been calling a "metadata" pack, holding things like tags,
> commits, trees and small blobs. Its these packfiles that we want
> to search first, as they are the most likely to be accessed.
>
> By pushing the megablob packs to the end of our packed_git search
> list we won't tend to scan their indexes, as most of our objects
> will be found earlier in the search list. Hence we will generally
> avoid any costs associated with their indexes.
So change the sort keys in rearrange_packed_git()/sort_pack() to
local then alternate,
increasing "deviation", <== NEW
decreasing mtime (new then old)
Each packfile has a "rank", which is the log10 of its average
stored object size. "Deviation" is the number of standard deviations
this number exceeds its mean, rounded down and clipped at zero.
Deviation should be 0 in normal use, and positive for packfiles
with significant populations of huge blobs.
This definition of deviation is intended to override mtime
only when sufficiently significant. Putting mtime before deviation
in the sort key list would cause deviation to be irrelevant.
Signed-off-by: Dana L. How <danahow@gmail.com>
---
Makefile | 2 +-
cache.h | 1 +
sha1_file.c | 26 +++++++++++++++++++++++++-
3 files changed, 27 insertions(+), 2 deletions(-)
diff --git a/Makefile b/Makefile
index 29243c6..45f0a52 100644
--- a/Makefile
+++ b/Makefile
@@ -383,7 +383,7 @@ BUILTIN_OBJS = \
builtin-pack-refs.o
GITLIBS = $(LIB_FILE) $(XDIFF_LIB)
-EXTLIBS = -lz
+EXTLIBS = -lz -lm
#
# Platform specific tweaks
diff --git a/cache.h b/cache.h
index cd875bc..630cd89 100644
--- a/cache.h
+++ b/cache.h
@@ -437,6 +437,7 @@ extern struct packed_git {
uint32_t num_objects;
int index_version;
time_t mtime;
+ int deviation;
int pack_fd;
int pack_local;
unsigned char sha1[20];
diff --git a/sha1_file.c b/sha1_file.c
index 12d2ef2..1565d91 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -6,6 +6,7 @@
* This handles basic git sha1 object files - packing, unpacking,
* creation etc.
*/
+#include <math.h>
#include "cache.h"
#include "delta.h"
#include "pack.h"
@@ -869,6 +870,14 @@ static int sort_pack(const void *a_, const void *b_)
return -st;
/*
+ * Packs with large "deviation" should be searched last.
+ * Such packs have significantly larger blobs.
+ */
+ st = a->deviation - b->deviation;
+ if (st)
+ return st;
+
+ /*
* Younger packs tend to contain more recent objects,
* and more recent objects tend to get accessed more
* often.
@@ -884,6 +893,7 @@ static void rearrange_packed_git(void)
{
struct packed_git **ary, *p;
int i, n;
+ float sum1 = 0, sum2 = 0, mean, sigma;
for (n = 0, p = packed_git; p; p = p->next)
n++;
@@ -892,8 +902,22 @@ static void rearrange_packed_git(void)
/* prepare an array of packed_git for easier sorting */
ary = xcalloc(n, sizeof(struct packed_git *));
- for (n = 0, p = packed_git; p; p = p->next)
+ for (n = 0, p = packed_git; p; p = p->next) {
+ /* this is the log10 of the average object size (almost) */
+ float rank = log10((p->pack_size + 1.0) / (p->num_objects + 1.0));
+ sum1 += rank;
+ sum2 += rank * rank;
ary[n++] = p;
+ }
+ mean = sum1 / n;
+ sigma = sqrt(sum2 / n - mean * mean);
+ for (p = packed_git; p; p = p->next) {
+ float rank = log10((p->pack_size + 1.0) / (p->num_objects + 1.0));
+ int deviation = sigma > 0 ? floor((rank - mean) / sigma) : 0;
+ if ( deviation < 0 )
+ deviation = 0;
+ p->deviation = deviation;
+ }
qsort(ary, n, sizeof(struct packed_git *), sort_pack);
--
1.5.2.762.gd8c6-dirty
reply other threads:[~2007-05-24 22:20 UTC|newest]
Thread overview: [no followups] expand[flat|nested] mbox.gz Atom feed
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=46560FC0.4020204@gmail.com \
--to=danahow@gmail.com \
--cc=git@vger.kernel.org \
--cc=junkio@cox.net \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.