[RFC] super indexes to span multiple packfiles

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC] super indexes to span multiple packfiles
@ 2007-05-29  7:16 Shawn O. Pearce
  2007-05-29 16:05 ` Jon Smirl
  2007-05-29 17:54 ` Geert Bosch
  0 siblings, 2 replies; 7+ messages in thread
From: Shawn O. Pearce @ 2007-05-29  7:16 UTC (permalink / raw)
  To: git; +Cc: Dana How, Nicolas Pitre

Dana talking about some sort of global index that spans all packfiles
got me thinking.  Could we build a "super index" that covered all
known packfiles at the time it was created, but that doesn't really
hurt us when the packfiles are repacked?  Or that doesn't have to
be repacked every single time a packfile is added to the repository.

This patch introduces a new command `git super-index` which reads
all known pack-*.idx files and merges them together into a single
super-index.sdx file, located in $GIT_OBJECT_DIRECTORY/pack.
At runtime we load the first super-index.sdx file we come across
during our packfile scanning.  Using a single super-index works
because the super-index.sdx should contain data for this repository's
packs, and for all of its alternate's packs.

A super-index holds a list of packfile names that it will match at
the start of its data file.  A super-index can only be used for
the pack(s) mentioned there.  At runtime we try to lazily match
these pack SHA-1 names up to the struct packed_git.sha1 field,
giving us the struct packed_git* for direct index access.

The super-index does not hold the object offsets; instead it holds
an array of percentage shifts within the index fan-out ranges.
The shifts are normalized into the range 1-65535.

The basic idea is:

 1) Use the first byte of the input SHA-1 to index into the
    fan-out table of the super-index.sdx.

 2) Use the next sdx_prefix bytes of the SHA-1 to binary search
    through that region of the sdx file.  Since we are working
    only with the abbreviated prefix the memcmp() has less data to
    wade through.  We've also eliminated the common first byte,
    thanks to the fan-out table.

 3) The matching record in the sdx table holds sdx_packs uint16_t
    values.  The indexes in that list corresponding to the index
    of the pack SHA-1 names in the header of the sdx file; hence
    the first uint16_t goes to the first packfile, the second to
    the second, etc.  I call these "shifts".  I don't know why.

 4) If the shift value for a pack is 0 then that pack does not
    contain any objects that start with that prefix.  So a 0 shift
    means we do not need to scan the corresponding pack's .idx file
    for an object.

 5) If the shift is non-zero then there is at least one object in
    the corresponding packfile starting with the prefix and further
    searching in the specific packfile is required.  The shift is
    actually the percentage within the range of [lo, hi) in the
    pack's .idx of where the middle of that prefix's block appears.
    We use this as a seed value for mi, rather than (lo+hi)/2, as it
    gets us much much closer to the interesting SHA-1s in the .idx.

 6) We binary search the .idx in a traditional way, once we have
    selected a reasonable guess for the starting position.

 7) If a packfile cannot be found at runtime (e.g. it has been
    repacked away) we treat it's shift as though it were 0;
    the object is not in that packfile (since the packfile does
    not exist).

 8) If the sdx cannot give us the object, we fallback to standard
    packfile scanning.

In the single packfile case (everything repacked into one) this
is not faster; its actually slightly slower.  With a handful of
smaller recent packfiles (such as immediately after a git-fetch)
it breaks even with the stock code.  I haven't tested it yet with
a high number of packfiles (e.g. 20).  I suspect it won't gain us
a lot up there either...

So in short this shouldn't be applied, because its not any faster,
and is sometimes slower.  But I'm tossing it out here for discussion.
I'm also not documenting the new super-index command line program,
because I don't think this should be applied.  ;-)

----
 .gitignore             |    1 +
 Makefile               |    1 +
 builtin-pack-objects.c |    2 +-
 builtin-super-index.c  |  226 ++++++++++++++++++++++++++++++++++++++++++++++++
 builtin.h              |    1 +
 cache.h                |    2 +-
 git.c                  |    1 +
 pack-check.c           |    4 +-
 pack.h                 |   17 ++++
 sha1_file.c            |  208 ++++++++++++++++++++++++++++++++++++--------
 10 files changed, 423 insertions(+), 40 deletions(-)
 create mode 100644 builtin-super-index.c

diff --git a/.gitignore b/.gitignore
index 4dc0c39..3cd1b2c 100644
--- a/.gitignore
+++ b/.gitignore
@@ -126,6 +126,7 @@ git-ssh-push
 git-ssh-upload
 git-status
 git-stripspace
+git-super-index
 git-svn
 git-svnimport
 git-symbolic-ref
diff --git a/Makefile b/Makefile
index 29243c6..26ac766 100644
--- a/Makefile
+++ b/Makefile
@@ -371,6 +371,7 @@ BUILTIN_OBJS = \
 	builtin-shortlog.o \
 	builtin-show-branch.o \
 	builtin-stripspace.o \
+	builtin-super-index.o \
 	builtin-symbolic-ref.o \
 	builtin-tar-tree.o \
 	builtin-unpack-objects.o \
diff --git a/builtin-pack-objects.c b/builtin-pack-objects.c
index d165f10..3b49336 100644
--- a/builtin-pack-objects.c
+++ b/builtin-pack-objects.c
@@ -756,7 +756,7 @@ static int add_object_entry(const unsigned char *sha1, enum object_type type,
 	}
 
 	for (p = packed_git; p; p = p->next) {
-		off_t offset = find_pack_entry_one(sha1, p);
+		off_t offset = find_pack_entry_one(sha1, p, 0);
 		if (offset) {
 			if (!found_pack) {
 				found_offset = offset;
diff --git a/builtin-super-index.c b/builtin-super-index.c
new file mode 100644
index 0000000..9f47760
--- /dev/null
+++ b/builtin-super-index.c
@@ -0,0 +1,226 @@
+#include "cache.h"
+#include "pack.h"
+#include "csum-file.h"
+
+struct sdx_entry {
+	unsigned char prefix[20];
+	uint16_t *splits;
+};
+
+static unsigned pack_cnt;
+static unsigned entry_cnt;
+static unsigned avail_cnt;
+static unsigned prefix_len = 5;
+static uint32_t *current;
+static struct packed_git **packs;
+static struct sdx_entry *entries;
+
+static unsigned split_avail;
+static uint16_t *split_free;
+
+static void select_packs()
+{
+	struct packed_git *p;
+	unsigned i;
+
+	prepare_packed_git();
+	for (p = packed_git; p; p = p->next)
+		pack_cnt++;
+
+	packs = xmalloc(pack_cnt * sizeof(packs[0]));
+	current = xcalloc(pack_cnt, sizeof(current[0]));
+	for (p = packed_git, i = 0; p; p = p->next, i++) {
+		packs[i] = p;
+		open_pack_index(p);
+	}
+}
+
+static void compute_pack(struct sdx_entry *ent, unsigned pack_idx)
+{
+	struct packed_git *p = packs[pack_idx];
+	const uint32_t *level1_ofs = p->index_data;
+	uint32_t hi, lo, mi, n = current[pack_idx];
+
+	/* Determine the end of the range that matches */
+	while (n < p->num_objects) {
+		const unsigned char *c = nth_packed_object_sha1(p, n);
+		if (memcmp(c, ent->prefix, prefix_len) > 0)
+			break;
+		n++;
+	}
+	if (n == current[pack_idx])
+		return;
+
+	/* Determine the position within the fan-out subrange */
+	if (p->index_version > 1)
+		level1_ofs += 2;
+	hi = ntohl(level1_ofs[*ent->prefix]);
+	lo = *ent->prefix ? ntohl(level1_ofs[*ent->prefix - 1]) : 0;
+	mi = (current[pack_idx] + n) / 2;
+	ent->splits[pack_idx] = (mi - lo) * 65535 / (hi - lo) + 1;
+
+	current[pack_idx] = n;
+}
+
+static int compute_next_entry()
+{
+	unsigned i;
+	const unsigned char *min = NULL;
+	struct sdx_entry *ent;
+
+	/* locate the minimum hash */
+	for (i = 0; i < pack_cnt; i++) {
+		struct packed_git *p = packs[i];
+		uint32_t n = current[i];
+		if (n < p->num_objects) {
+			const unsigned char *c = nth_packed_object_sha1(p, n);
+			if (!min || memcmp(c, min, prefix_len) < 0)
+				min = c;
+		}
+	}
+	if (!min)
+		return 0;
+
+	if (entry_cnt == avail_cnt) {
+		avail_cnt = avail_cnt * 3 / 2 + 128;
+		entries = xrealloc(entries, avail_cnt * sizeof(entries[0]));
+	}
+	if (split_avail < pack_cnt) {
+		split_avail = pack_cnt * (avail_cnt - entry_cnt);
+		split_free = xcalloc(split_avail, sizeof(split_free[0]));
+	}
+
+	ent = &entries[entry_cnt++];
+	hashcpy(ent->prefix, min);
+
+	ent->splits = split_free;
+	split_free += pack_cnt;
+	split_avail -= pack_cnt;
+
+	/* determine which packs match this entry */
+	for (i = 0; i < pack_cnt; i++)
+		compute_pack(ent, i);
+
+	return 1;
+}
+
+static int read_only(const char *path)
+{
+	mode_t mode = umask(0);
+
+	umask(mode);
+	mode = 0444 & ~mode;
+	if (chmod(path, mode))
+		return -1;
+	return adjust_shared_perm(path);
+}
+
+static void write_super_index()
+{
+	char tmpname[PATH_MAX];
+	struct sha1file *f;
+	uint32_t i, array[256];
+	int fd;
+	struct pack_sdx_header hdr;
+	struct sdx_entry *curr = entries, *last = entries + entry_cnt;
+
+	snprintf(tmpname, sizeof(tmpname), "%s/%s",
+		get_object_directory(),
+		"tmp_sdx_XXXXXX");
+	fd = mkstemp(tmpname);
+	if (fd < 0)
+		die("unable to create %s: %s\n", tmpname, strerror(errno));
+	f = sha1fd(fd, tmpname);
+
+	/* Our file header, for once we plan ahead! */
+	memset(&hdr, 0, sizeof(hdr));
+	hdr.sdx_signature = htonl(PACK_SDX_SIGNATURE);
+	hdr.sdx_version = htonl(1);
+	hdr.sdx_packs = htons(pack_cnt);
+	hdr.sdx_prefix = prefix_len - 1;
+	sha1write(f, &hdr, sizeof(hdr));
+
+	/* Immediately after is the pack SHA-1 names */
+	for (i = 0; i < pack_cnt; i++)
+		sha1write(f, packs[i]->sha1, sizeof(packs[i]->sha1));
+
+	/*
+	 * Write the first-level table (the list is sorted,
+	 * but we use a 256-entry lookup to be able to avoid
+	 * having to do eight extra binary search iterations).
+	 */
+	for (i = 0; i < 256; i++) {
+		while (curr < last && curr->prefix[0] == i)
+			curr++;
+		array[i] = htonl(curr - entries);
+	}
+	sha1write(f, array, 256 * 4);
+
+	/* Write the entries */
+	for (curr = entries; curr < last; curr++) {
+		sha1write(f, curr->prefix + 1, prefix_len - 1);
+		sha1write(f, curr->splits, pack_cnt);
+	}
+
+	sha1close(f, NULL, 1);
+	if (read_only(tmpname))
+		die("cannot make file readable: %s", strerror(errno));
+
+	if (move_temp_to_file(tmpname,
+		mkpath("%s/pack/super-index.sdx", get_object_directory())))
+		die("cannot save super-index.sdx");
+}
+
+static void show_super_index()
+{
+	unsigned i, j, multi_pack = 0;
+
+	for (i = 0; i < pack_cnt; i++)
+		printf("pack %s\n", sha1_to_hex(packs[i]->sha1));
+
+	for (i = 0; i < entry_cnt; i++) {
+		struct sdx_entry *ent = &entries[i];
+		char *prefix_str = sha1_to_hex(ent->prefix);
+		unsigned matches = 0;
+
+		prefix_str[2 * prefix_len] = 0;
+
+		printf("%s:", prefix_str);
+		for (j = 0; j < pack_cnt; j++) {
+			if (ent->splits[j]) {
+				matches++;
+				printf(" %3u", (int)ent->splits[j]);
+			} else
+				printf(" ----");
+		}
+		printf("\n");
+
+		if (matches > 1)
+			multi_pack++;
+	}
+
+	printf("%u packs, %u entries, %u%% unique\n",
+		pack_cnt, entry_cnt,
+		(entry_cnt - multi_pack) * 100 / entry_cnt);
+}
+
+static const char super_index_usage[] =
+"git-super-index";
+
+int cmd_super_index(int argc, char **argv, const char *prefix)
+{
+	select_packs();
+	if (!pack_cnt)
+		die("no packfiles to super-index");
+	if (pack_cnt == 1)
+		warning("only one packfile; a super-index is unnecessary");
+
+	while (compute_next_entry())
+		/* nothing */;
+	write_super_index();
+
+	if (0)
+		show_super_index();
+
+	return 0;
+}
diff --git a/builtin.h b/builtin.h
index d3f3a74..2a11187 100644
--- a/builtin.h
+++ b/builtin.h
@@ -71,6 +71,7 @@ extern int cmd_shortlog(int argc, const char **argv, const char *prefix);
 extern int cmd_show(int argc, const char **argv, const char *prefix);
 extern int cmd_show_branch(int argc, const char **argv, const char *prefix);
 extern int cmd_stripspace(int argc, const char **argv, const char *prefix);
+extern int cmd_super_index(int argc, const char **argv, const char *prefix);
 extern int cmd_symbolic_ref(int argc, const char **argv, const char *prefix);
 extern int cmd_tar_tree(int argc, const char **argv, const char *prefix);
 extern int cmd_unpack_objects(int argc, const char **argv, const char *prefix);
diff --git a/cache.h b/cache.h
index 0f4a05b..3c8a41d 100644
--- a/cache.h
+++ b/cache.h
@@ -490,7 +490,7 @@ extern unsigned char* use_pack(struct packed_git *, struct pack_window **, off_t
 extern void unuse_pack(struct pack_window **);
 extern struct packed_git *add_packed_git(const char *, int, int);
 extern const unsigned char *nth_packed_object_sha1(struct packed_git *, uint32_t);
-extern off_t find_pack_entry_one(const unsigned char *, struct packed_git *);
+extern off_t find_pack_entry_one(const unsigned char *, struct packed_git *, unsigned);
 extern void *unpack_entry(struct packed_git *, off_t, enum object_type *, unsigned long *);
 extern unsigned long unpack_object_header_gently(const unsigned char *buf, unsigned long len, enum object_type *type, unsigned long *sizep);
 extern unsigned long get_size_from_delta(struct packed_git *, struct pack_window **, off_t);
diff --git a/git.c b/git.c
index 29b55a1..c323479 100644
--- a/git.c
+++ b/git.c
@@ -284,6 +284,7 @@ static void handle_internal_command(int argc, const char **argv, char **envp)
 		{ "show-branch", cmd_show_branch, RUN_SETUP },
 		{ "show", cmd_show, RUN_SETUP | USE_PAGER },
 		{ "stripspace", cmd_stripspace },
+		{ "super-index", cmd_super_index, RUN_SETUP },
 		{ "symbolic-ref", cmd_symbolic_ref, RUN_SETUP },
 		{ "tar-tree", cmd_tar_tree },
 		{ "unpack-objects", cmd_unpack_objects, RUN_SETUP },
diff --git a/pack-check.c b/pack-check.c
index 7475348..ec61592 100644
--- a/pack-check.c
+++ b/pack-check.c
@@ -51,7 +51,7 @@ static int verify_packfile(struct packed_git *p,
 		sha1 = nth_packed_object_sha1(p, i);
 		if (!sha1)
 			die("internal error pack-check nth-packed-object");
-		offset = find_pack_entry_one(sha1, p);
+		offset = find_pack_entry_one(sha1, p, 0);
 		if (!offset)
 			die("internal error pack-check find-pack-entry-one");
 		data = unpack_entry(p, offset, &type, &size);
@@ -94,7 +94,7 @@ static void show_pack_info(struct packed_git *p)
 		sha1 = nth_packed_object_sha1(p, i);
 		if (!sha1)
 			die("internal error pack-check nth-packed-object");
-		offset = find_pack_entry_one(sha1, p);
+		offset = find_pack_entry_one(sha1, p, 0);
 		if (!offset)
 			die("internal error pack-check find-pack-entry-one");
 
diff --git a/pack.h b/pack.h
index d667fb8..b660833 100644
--- a/pack.h
+++ b/pack.h
@@ -43,6 +43,23 @@ struct pack_idx_header {
 };
 
 
+/*
+ * Super-pack index header
+ *
+ * This file points to multiple packfiles by name and provides a
+ * hint to tell us if it is likely that a given packfile contains
+ * a particular object.  This file is not an index file replacement,
+ * it is only a performance optimization hint.
+ */
+#define PACK_SDX_SIGNATURE 0x50534458	/* "PSDX" */
+struct pack_sdx_header {
+	uint32_t sdx_signature;
+	uint32_t sdx_version;
+	uint16_t sdx_packs;
+	uint8_t  sdx_prefix;
+	uint8_t  _unused_padding;
+};
+
 extern int verify_pack(struct packed_git *, int);
 extern void fixup_pack_header_footer(int, unsigned char *, const char *, uint32_t);
 
diff --git a/sha1_file.c b/sha1_file.c
index a3637d7..d296b32 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -405,6 +405,15 @@ static char *find_sha1_file(const unsigned char *sha1, struct stat *st)
 	return NULL;
 }
 
+struct super_index {
+	struct pack_sdx_header *sdx_header;
+	void *sdx_data;
+	unsigned char *sdx_pack_names;
+	struct packed_git **packs;
+	unsigned pack_cnt;
+};
+
+static struct super_index *super_idx;
 static unsigned int pack_used_ctr;
 static unsigned int pack_mmap_calls;
 static unsigned int peak_pack_open_windows;
@@ -831,6 +840,64 @@ void install_packed_git(struct packed_git *pack)
 	packed_git = pack;
 }
 
+static int open_super_index(const char *path)
+{
+	int fd = open(path, O_RDONLY);
+	struct stat st;
+	size_t sdx_size;
+	void *sdx_map;
+	struct pack_sdx_header *hdr;
+	unsigned char *sdx_data;
+	unsigned pack_cnt;
+
+	if (fd < 0)
+		return -1;
+	if (fstat(fd, &st)) {
+		close(fd);
+		return -1;
+	}
+
+	sdx_size = xsize_t(st.st_size);
+	if (sdx_size < 4 * 256 + 20) {
+		close(fd);
+		return error("super-index %s is too small", path);
+	}
+	sdx_map = mmap(NULL, sdx_size, PROT_READ, MAP_PRIVATE, fd, 0);
+	if (sdx_map == MAP_FAILED) {
+		release_pack_memory(sdx_size, fd);
+		sdx_map = mmap(NULL, sdx_size, PROT_READ, MAP_PRIVATE, fd, 0);
+		if (sdx_map == MAP_FAILED) {
+			close(fd);
+			return -1;
+		}
+	}
+	close(fd);
+
+	hdr = sdx_map;
+	if (hdr->sdx_signature != htonl(PACK_SDX_SIGNATURE)) {
+		munmap(sdx_map, sdx_size);
+		return error("super-index %s has incorrect signature", path);
+	}
+	if (hdr->sdx_version != htonl(1)) {
+		munmap(sdx_map, sdx_size);
+		return error("super-index file %s is version %u"
+				 " and is not supported by this binary"
+				 " (try upgrading GIT to a newer version)",
+				 path, ntohl(hdr->sdx_version));
+	}
+
+	pack_cnt = ntohs(hdr->sdx_packs);
+	sdx_data = (unsigned char*)(hdr + 1);
+
+	super_idx = xmalloc(sizeof(*super_idx));
+	super_idx->sdx_header = hdr;
+	super_idx->sdx_pack_names = sdx_data;
+	super_idx->sdx_data = sdx_data + pack_cnt * 20;
+	super_idx->pack_cnt = pack_cnt;
+	super_idx->packs = xcalloc(pack_cnt, sizeof(*super_idx->packs));
+	return 0;
+}
+
 static void prepare_packed_git_one(char *objdir, int local)
 {
 	char path[PATH_MAX];
@@ -852,6 +919,12 @@ static void prepare_packed_git_one(char *objdir, int local)
 		int namelen = strlen(de->d_name);
 		struct packed_git *p;
 
+		if (!super_idx && !strcmp(de->d_name, "super-index.sdx")) {
+			strcpy(path + len, de->d_name);
+			open_super_index(path);
+			continue;
+		}
+
 		if (!has_extension(de->d_name, ".idx"))
 			continue;
 
@@ -1260,7 +1333,7 @@ static off_t get_delta_base(struct packed_git *p,
 		*curpos += used;
 	} else if (type == OBJ_REF_DELTA) {
 		/* The base entry _must_ be in the same pack */
-		base_offset = find_pack_entry_one(base_info, p);
+		base_offset = find_pack_entry_one(base_info, p, 0);
 		if (!base_offset)
 			die("failed to find delta-pack base object %s",
 				sha1_to_hex(base_info));
@@ -1362,7 +1435,7 @@ const char *packed_object_info_detail(struct packed_git *p,
 			next_sha1 = use_pack(p, &w_curs, curpos, NULL);
 			if (*delta_chain_length == 0)
 				hashcpy(base_sha1, next_sha1);
-			obj_offset = find_pack_entry_one(next_sha1, p);
+			obj_offset = find_pack_entry_one(next_sha1, p, 0);
 			break;
 		}
 		(*delta_chain_length)++;
@@ -1633,11 +1706,12 @@ static off_t nth_packed_object_offset(const struct packed_git *p, uint32_t n)
 }
 
 off_t find_pack_entry_one(const unsigned char *sha1,
-				  struct packed_git *p)
+				  struct packed_git *p,
+				  unsigned shift)
 {
 	const uint32_t *level1_ofs = p->index_data;
 	const unsigned char *index = p->index_data;
-	unsigned hi, lo;
+	unsigned hi, lo, mi;
 
 	if (!index) {
 		if (open_pack_index(p))
@@ -1652,9 +1726,12 @@ off_t find_pack_entry_one(const unsigned char *sha1,
 	index += 4 * 256;
 	hi = ntohl(level1_ofs[*sha1]);
 	lo = ((*sha1 == 0x0) ? 0 : ntohl(level1_ofs[*sha1 - 1]));
+	if (shift)
+		mi = lo + (hi - lo) * (shift - 1) / 65535;
+	else
+		mi = (lo + hi) / 2;
 
 	do {
-		unsigned mi = (lo + hi) / 2;
 		unsigned x = (p->index_version > 1) ? (mi * 20) : (mi * 24 + 4);
 		int cmp = hashcmp(index + x, sha1);
 		if (!cmp)
@@ -1663,6 +1740,7 @@ off_t find_pack_entry_one(const unsigned char *sha1,
 			hi = mi;
 		else
 			lo = mi+1;
+		mi = (lo + hi) / 2;
 	} while (lo < hi);
 	return 0;
 }
@@ -1685,42 +1763,100 @@ static int matches_pack_name(struct packed_git *p, const char *ig)
 	return 1;
 }
 
-static int find_pack_entry(const unsigned char *sha1, struct pack_entry *e, const char **ignore_packed)
+static int find_in_pack(struct packed_git *p, unsigned shift, const unsigned char *sha1, struct pack_entry *e, const char **ignore_packed)
 {
-	struct packed_git *p;
 	off_t offset;
 
-	prepare_packed_git();
+	if (ignore_packed) {
+		const char **ig;
+		for (ig = ignore_packed; *ig; ig++)
+			if (!matches_pack_name(p, *ig))
+				break;
+		if (*ig)
+			return 0;
+	}
 
-	for (p = packed_git; p; p = p->next) {
-		if (ignore_packed) {
-			const char **ig;
-			for (ig = ignore_packed; *ig; ig++)
-				if (!matches_pack_name(p, *ig))
-					break;
-			if (*ig)
-				continue;
-		}
-		offset = find_pack_entry_one(sha1, p);
-		if (offset) {
-			/*
-			 * We are about to tell the caller where they can
-			 * locate the requested object.  We better make
-			 * sure the packfile is still here and can be
-			 * accessed before supplying that answer, as
-			 * it may have been deleted since the index
-			 * was loaded!
-			 */
-			if (p->pack_fd == -1 && open_packed_git(p)) {
-				error("packfile %s cannot be accessed", p->pack_name);
-				continue;
+	offset = find_pack_entry_one(sha1, p, shift);
+	if (!offset)
+		return 0;
+
+	/*
+	 * We are about to tell the caller where they can
+	 * locate the requested object.  We better make
+	 * sure the packfile is still here and can be
+	 * accessed before supplying that answer, as
+	 * it may have been deleted since the index
+	 * was loaded!
+	 */
+	if (p->pack_fd == -1 && open_packed_git(p)) {
+		error("packfile %s cannot be accessed", p->pack_name);
+		return 0;
+	}
+	e->offset = offset;
+	e->p = p;
+	hashcpy(e->sha1, sha1);
+	return 1;
+}
+
+static int find_superindex_entry(const unsigned char *sha1, struct pack_entry *e, const char **ignore_packed)
+{
+	const uint32_t *level1_ofs = super_idx->sdx_data;
+	const unsigned char *index = super_idx->sdx_data;
+	const unsigned pfx_len = super_idx->sdx_header->sdx_prefix;
+	const unsigned packs = super_idx->pack_cnt;
+	unsigned hi, lo;
+
+	index += 4 * 256;
+	hi = ntohl(level1_ofs[*sha1]);
+	lo = ((*sha1 == 0x0) ? 0 : ntohl(level1_ofs[*sha1 - 1]));
+
+	do {
+		unsigned mi = (lo + hi) / 2;
+		unsigned x = mi * (pfx_len + packs);
+		int cmp = memcmp(index + x, sha1 + 1, pfx_len);
+		if (!cmp) {
+			index += x + pfx_len;
+			for (x = 0; x < packs; x++, index += 2) {
+				unsigned shift = (*index << 8) | index[1];
+				struct packed_git *p;
+				if (!shift)
+					continue;
+				p = super_idx->packs[x];
+				if (!p) {
+					for (p = packed_git; p; p = p->next) {
+						if (!hashcmp(p->sha1,
+							super_idx->sdx_pack_names + x * 20))
+							break;
+					}
+					if (!p)
+						p = MAP_FAILED;
+					super_idx->packs[x] = p;
+				}
+				if (p == MAP_FAILED)
+					continue;
+				if (find_in_pack(p, shift, sha1, e, ignore_packed))
+					return 1;
 			}
-			e->offset = offset;
-			e->p = p;
-			hashcpy(e->sha1, sha1);
-			return 1;
+			return 0;
 		}
-	}
+		if (cmp > 0)
+			hi = mi;
+		else
+			lo = mi + 1;
+	} while (lo < hi);
+	return 0;
+}
+
+static int find_pack_entry(const unsigned char *sha1, struct pack_entry *e, const char **ignore_packed)
+{
+	struct packed_git *p;
+
+	prepare_packed_git();
+	if (super_idx && find_superindex_entry(sha1, e, ignore_packed))
+		return 1;
+	for (p = packed_git; p; p = p->next)
+		if (find_in_pack(p, 0, sha1, e, ignore_packed))
+			return 1;
 	return 0;
 }
 
@@ -1730,7 +1866,7 @@ struct packed_git *find_sha1_pack(const unsigned char *sha1,
 	struct packed_git *p;
 
 	for (p = packs; p; p = p->next) {
-		if (find_pack_entry_one(sha1, p))
+		if (find_pack_entry_one(sha1, p, 0))
 			return p;
 	}
 	return NULL;
-- 
1.5.2.838.g8a923

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [RFC] super indexes to span multiple packfiles
  2007-05-29  7:16 [RFC] super indexes to span multiple packfiles Shawn O. Pearce
@ 2007-05-29 16:05 ` Jon Smirl
  2007-05-29 16:19   ` Nicolas Pitre
  2007-05-29 16:20   ` Avi Kivity
  2007-05-29 17:54 ` Geert Bosch
  1 sibling, 2 replies; 7+ messages in thread
From: Jon Smirl @ 2007-05-29 16:05 UTC (permalink / raw)
  To: Shawn O. Pearce; +Cc: git, Dana How, Nicolas Pitre

Object's are not accessed in random order with git. Once an object
reference hits a pack file it is very likely that following references
will hit the same pack file. That's because you always find object
SHA's by following the chains.

So first place to look for an object is the same place the previous
object was found. If it isn't there order the search of the pack files
by creation data (just a heuristic). Make this list a circle and start
the search in the pack where the previous object was found. This can
all be done with the existing indexes.

I haven't been reading all of the messages on this subject, but is
this strategy enough to eliminate the need for a super index?

If you still need a super index, note that it may be good enough for
it to only contain the SHA's for objects that are externally
referenced. This index would be small and simply point to the correct
pack file index to find the object in. You could add a list of
dangling links to each packfile index to assist with building this
super index.

My work with databases leads me to believe that figuring out how to
pack everything into a smaller space always beats efforts put into
incrementally improving the indexing scheme. Packing into a smaller
space reduces the total IO needs and that's always a winner.

-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC] super indexes to span multiple packfiles
  2007-05-29 16:05 ` Jon Smirl
@ 2007-05-29 16:19   ` Nicolas Pitre
  2007-05-29 16:20   ` Avi Kivity
  1 sibling, 0 replies; 7+ messages in thread
From: Nicolas Pitre @ 2007-05-29 16:19 UTC (permalink / raw)
  To: Jon Smirl; +Cc: Shawn O. Pearce, git, Dana How

On Tue, 29 May 2007, Jon Smirl wrote:

> Object's are not accessed in random order with git. Once an object
> reference hits a pack file it is very likely that following references
> will hit the same pack file. That's because you always find object
> SHA's by following the chains.
> 
> So first place to look for an object is the same place the previous
> object was found. If it isn't there order the search of the pack files
> by creation data (just a heuristic). Make this list a circle and start
> the search in the pack where the previous object was found. This can
> all be done with the existing indexes.
> 
> I haven't been reading all of the messages on this subject, but is
> this strategy enough to eliminate the need for a super index?

I think it could.

Personally I'm not a big fan of the super index notion.  It needs extra 
maintenance to keep in synch, and when it is not in synch it requires 
extra work at run time to fall back to traditional lookup.  And Shawn's 
testing didn't provide significant performance gains either.

But a simple heuristic like the presumption that the next object is 
likely to be in the same pack as the previous is the kind of thing that 
could provide significant improvements with really little effort.

Nicolas

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC] super indexes to span multiple packfiles
  2007-05-29 16:05 ` Jon Smirl
  2007-05-29 16:19   ` Nicolas Pitre
@ 2007-05-29 16:20   ` Avi Kivity
  2007-05-29 16:31     ` Nicolas Pitre
  1 sibling, 1 reply; 7+ messages in thread
From: Avi Kivity @ 2007-05-29 16:20 UTC (permalink / raw)
  To: Jon Smirl; +Cc: Shawn O. Pearce, git, Dana How, Nicolas Pitre

Jon Smirl wrote:
>
> My work with databases leads me to believe that figuring out how to
> pack everything into a smaller space always beats efforts put into
> incrementally improving the indexing scheme. Packing into a smaller
> space reduces the total IO needs and that's always a winner.
>

Another way to achieve that is to place objects that are accessed 
together nearby, and issue a larger read so as to bring them into 
cache.  I imagine that placing commit objects and associated tree and 
blobs in history order should help here (but maybe git already does 
that, I'm not familiar with the internals).

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC] super indexes to span multiple packfiles
  2007-05-29 16:20   ` Avi Kivity
@ 2007-05-29 16:31     ` Nicolas Pitre
  2007-05-30  9:40       ` Avi Kivity
  0 siblings, 1 reply; 7+ messages in thread
From: Nicolas Pitre @ 2007-05-29 16:31 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Jon Smirl, Shawn O. Pearce, git, Dana How

On Tue, 29 May 2007, Avi Kivity wrote:

> Jon Smirl wrote:
> > 
> > My work with databases leads me to believe that figuring out how to
> > pack everything into a smaller space always beats efforts put into
> > incrementally improving the indexing scheme. Packing into a smaller
> > space reduces the total IO needs and that's always a winner.
> > 
> 
> Another way to achieve that is to place objects that are accessed together
> nearby, and issue a larger read so as to bring them into cache.  I imagine
> that placing commit objects and associated tree and blobs in history order
> should help here (but maybe git already does that, I'm not familiar with the
> internals).

GIT already does that indeed, except for commit objects which are all 
together for better performances on history traversal operations.

After a fresh repack, the checkout of the latest revision should produce 
a nearly perfect linear and contigous access into the early portion of 
the same pack.  Things will get more random with access to objects 
further back in history of course, but those objects are less likely to 
be accessed as often.

Nicolas

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC] super indexes to span multiple packfiles
  2007-05-29  7:16 [RFC] super indexes to span multiple packfiles Shawn O. Pearce
  2007-05-29 16:05 ` Jon Smirl
@ 2007-05-29 17:54 ` Geert Bosch
  1 sibling, 0 replies; 7+ messages in thread
From: Geert Bosch @ 2007-05-29 17:54 UTC (permalink / raw)
  To: Shawn O. Pearce; +Cc: Git Mailing List, Dana How, Nicolas Pitre

[resent, my email client doesn't like Shawn's name :( ]
On May 29, 2007, at 03:16, Shawn O. Pearce wrote:

> In the single packfile case (everything repacked into one) this
> is not faster; its actually slightly slower.  With a handful of
> smaller recent packfiles (such as immediately after a git-fetch)
> it breaks even with the stock code.  I haven't tested it yet with
> a high number of packfiles (e.g. 20).  I suspect it won't gain us
> a lot up there either...
>
> So in short this shouldn't be applied, because its not any faster,
> and is sometimes slower.  But I'm tossing it out here for discussion.
> I'm also not documenting the new super-index command line program,
> because I don't think this should be applied.  ;-)

A super-index should have a fan-out factor dependent on the number
of entries in the index. So an index with 2^20 entries would have
2^16 fan-outs. So there will be about 2^4 entries per fanout slot.
Each slot only contains the next byte of the key, so in this case
bit 16 through 23 (numbering bits starting at zero).

The format would look like:
   * Header
   * Fan-out table (approx 2 bits/object)
   * Next key-byte table (1 byte/object)
   * Object info (offset into packfile(s) and/or rest of object name)

So, after a single lookup to get to about the right entry, only
an expected 16 single-byte comparisons are necessary to find the
entry that is needed or return a "entry doesn't exist".

Below I repeat a previous suggestion for a new pack file index.
I'm happy to implement this, although it seems that there are
so many pending index/pack file format changes that this may
be premature. If the pack file itself will have the full SHA-1,
the

(snipped from <http://article.gmane.org/gmane.comp.version- 
control.git/43545>)

Multiple Pack Index

The linear search through packs is very inefficient
with large numbers of packs. Having packs much larger
than a GB is also problematic, due to this as repacking
and otherwise modifying packs gets very expensive.

Another issue is that binary search requires many
semi-random accesses spread over the index. Finally,
most of the information actually read consists of
SHA1's that are never needed.

This proposed pack index format does not focus on reducing
used disk space, but instead aims to reduce the number
of blocks that needs to be read to perform lookups.
This is done using three techniques:
   1) scale the number of fan-out bins with the number
      of objects in the index, keeping the expected
      number of objects in each bin constant
   2) take advantage of 1) by only storing a few bits
      following the common prefix in the main lookup table
      as a discriminant. Store the rest of the SHA1 and
      the pack offest in a separate, parallel, object table.
   3) Instead of repeating the variable-length common prefix
      and the discriminant, use the space for the prefix
      for a pack identifier and omit the discriminant altogether.

For a repository with N objects and highest PACK_NR P,
the total space used for the index is bounded by
24 * (N + P) bytes, if N is at least 512 and N >= 512 * P.

Limits:
    - Maximum number of packs: 2^27
    - Maximum number of objects: 2^40
    - Maximum repository size: 2^48 bytes

<PACK_INDEX>
    :   <IDX_PACK_LIST>
        <IDX_FANOUT_BITS>
        <IDX_FANOUT_TABLE>
        <IDX_LOOKUP_TABLE>
        <IDX_OBJECT_TABLE>
        <IDX_CHECKSUM>
    ;

<IDX_PACK_LIST>
    :   <IDX_PACK_LIST_ENTRIES>
        <ZERO_32> <IDX_PACK_LIST_CHECKSUM>
    ;
<IDX_PACK_LIST_ENTRIES>
    # List of packs sorted by ascending PACK_ID
    :  ( <IDX_PACK_NR> <PACK_ID> ) *
    ;

<PACK_ID>
    # 20-byte binary representation of the 40 hex-digit
    # value PACK_ID_HEX, such that pack-${PACK_ID_HEX}.pack
    # is the name of the pack file
    ;

<IDX_PACK_NR>
    # 32-bit unsigned integer in network order, with the same
    # value as the preceding <IDX_PACK_NR> (or zero for the
    # first entry), increased by the size of the pack file in
    # bytes, divided by 2^32 and rounded up.
    ;

<ZERO_32>
    # 32-bit zero
    ;

<IDX_PACK_LIST_CHECKSUM>
    # 20-byte SHA1 of <IDX_PACK_LIST_ENTRIES>
    ;

<IDX_FANOUT_BITS>
    # 1 byte with the smallest value N between 8 and 35,
    # such that 2^(N - 8) greater than or equal to the
    # largest IDX_PACK_NR in the IDX_PACK_LIST, and such that
    # 2^(N+5) is greater than or equal to the total number
    # of objects in all packs.
    ;

<IDX_FANOUT_TABLE>
    # Table of 2^${IDX_FANOUT_BITS} entries
    :   ( <IDX_PARTIAL_COUNT> ) *
    ;

<IDX_PARTIAL_COUNT>
    # 40 bit, network byte order, binary integer of the count of
    # objects in the pack file with the high IDX_FANOUT_BITS bits of
    # the object ID less than or equal to the index of the count,
    # starting from zero.
    ;

<IDX_LOOKUP_TABLE>
    # One 8-bit key per object indexed by the pack
    :   ( <IDX_LOOKUP_KEY> ) *
    ;

<IDX_LOOKUP_KEY>
    # Bits IDX_FANOUT_BITS through IDX_FANOUT_BITS + 7 of the
    # object ID.
    ;

<IDX_OBJECT_ENTRY>
    # The total width of each entry is 22 bytes
    :   ( <IDX_PACK_REF> <IDX_OBJECT_ID> <IDX_OFFSET> ) *
    ;

<IDX_PACK_REF>
    # A IDX_FANOUT_BITS - 8 bit wide integer value, equal to
    # PACK_NR of pack preceding the one containing the object
    # (or zero, if object is in first pack) increased with the
    # pack offset divided by 2^32.
    ;

<IDX_OBJECT_ID>
    # Bits IDX_FANOUT_BITS + 8 .. 159 of the object ID
    ;

<IDX_OFFSET>
    # 32-bits offset in network byte order
    ;

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC] super indexes to span multiple packfiles
  2007-05-29 16:31     ` Nicolas Pitre
@ 2007-05-30  9:40       ` Avi Kivity
  0 siblings, 0 replies; 7+ messages in thread
From: Avi Kivity @ 2007-05-30  9:40 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Jon Smirl, Shawn O. Pearce, git, Dana How

Nicolas Pitre wrote:
>>
>> Another way to achieve that is to place objects that are accessed together
>> nearby, and issue a larger read so as to bring them into cache.  I imagine
>> that placing commit objects and associated tree and blobs in history order
>> should help here (but maybe git already does that, I'm not familiar with the
>> internals).
>>     
>
> GIT already does that indeed, except for commit objects which are all 
> together for better performances on history traversal operations.
>
> After a fresh repack, the checkout of the latest revision should produce 
> a nearly perfect linear and contigous access into the early portion of 
> the same pack.  Things will get more random with access to objects 
> further back in history of course, but those objects are less likely to 
> be accessed as often.
>
>   

Thanks.  Actually I should have deduced this from the speed of 'git log' ;-)

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2007-05-30  9:40 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-05-29  7:16 [RFC] super indexes to span multiple packfiles Shawn O. Pearce
2007-05-29 16:05 ` Jon Smirl
2007-05-29 16:19   ` Nicolas Pitre
2007-05-29 16:20   ` Avi Kivity
2007-05-29 16:31     ` Nicolas Pitre
2007-05-30  9:40       ` Avi Kivity
2007-05-29 17:54 ` Geert Bosch

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).