Re: [RFC] Packing large repositories

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Linus Torvalds <torvalds@linux-foundation.org>
To: Geert Bosch <bosch@adacore.com>
Cc: Nicolas Pitre <nico@cam.org>,
	"Shawn O. Pearce" <spearce@spearce.org>,
	Dana How <danahow@gmail.com>,
	git@vger.kernel.org
Subject: Re: [RFC] Packing large repositories
Date: Sat, 31 Mar 2007 11:36:35 -0700 (PDT)	[thread overview]
Message-ID: <Pine.LNX.4.64.0703311033290.6730@woody.linux-foundation.org> (raw)
In-Reply-To: <78639417-ACDB-484F-85BB-EC0AF694949A@adacore.com>



On Sat, 31 Mar 2007, Geert Bosch wrote:
> 
> I've been working on a design for a new index file format, and I'll
> include my current draft at the end of this message. It is not finished,
> and I've not written much code yet, but I'd like to prevent duplicated
> work and give others the opportunity to borrow from the ideas so far.

Ok, here's one comment...

I think your basic assumption that the current index-pack is bad is 
broken ;)

The thing is, I think the index-pack could be improved, but I think we can 
get a bigger improvement from just being more intelligent about searching 
than from actually needing to re-organize the pack.

Here's a few clues:
 - binary search is certainly not the only way to do lookups on sorted 
   contents
 - one of the fundamental rules about cryptographic hashes is that they 
   are *evenly*distributed*
 - currently we do not take advantage of our inherent knowledge of the 
   distribution when we look things up, and we use a stupid binary lookup 
   that is guaranteed to basically always take O(log2(n)) random jumps 
   around the data area, with bad locality.

In other words, the 256-entry fan-out was meant to avoid the first eight 
levels of binary lookup. But the thing is, we should be able to generally 
do a *lot* better than any binary lookup by just doing a LINEAR SEARCH, 
which is also more cache-friendly, and prefetches much better when it's 
not in the cache.

The only thing we want for a linear search is a good starting point. Which 
gets us back to the simple: "SHA1 hashes are evenly distributed". In other 
words, getting a good starting point should be *trivial*.

So here's a suggestion:

 - start finding a range using the 256-entry fan-out, exactly the way we 
   did for the binary search. It's cheap. We could probably avoid EVEN 
   THIS, and just do one more newton-raphson iteration more. But since we 
   have the data, we migth as well use it. After all, it really *is* just 
   a first approximation of newton-raphson, and while it uses only 8 bits 
   (and we could do better), at least it's an *exact* one.

 - use newton-raphson to iterate closer. It should be a much faster way to 
   find the rough area for the entry we're searching for than binary 
   search. Two or three iterations should get us there, easily.

 - do a linear search once you're close enough.

Here's a stupid patch that does that. It's biggest weakness is that it 
only does a single iteration of newton-raphson. It should probably do at 
least one more iteration. I bet that you'd get so close that you find it 
really quickly if you did that. As it is, it doesn't seem to be any slower 
than the binary search..

And the nice thing is that I think that in large packs the "evenly 
distributed" should be *more*true* than in small packs, so this should 
scale up very nicely, and I'm hoping that it would get much better cache 
behaviour (because we wouldn't bounce around in the index file too much). 
I wouldn't be surprised if we basically hit it on the first try if we did 
two or three iterations of newton-raphson.

Comments? Does anybody want to take this and run with it?

(No guarantees that this really can ever beat binary search, but somebody 
might find this interesting. And I think the potential is there, but I 
think you do need to do at *least* two iterations).

		Linus
---
 sha1_file.c |   79 +++++++++++++++++++++++++++++++++++++++++++++++-----------
 1 files changed, 64 insertions(+), 15 deletions(-)

diff --git a/sha1_file.c b/sha1_file.c
index 9c26038..fc4dbc7 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -1543,29 +1543,78 @@ int nth_packed_object_sha1(const struct packed_git *p, uint32_t n,
 	return 0;
 }
 
-off_t find_pack_entry_one(const unsigned char *sha1,
-				  struct packed_git *p)
+struct index_entry {
+	uint32_t nb_offset;	/* pack-file offset in network byte order */
+	unsigned char sha1[20];	/* SHA1 of the object */
+};
+
+static off_t pack_search_backward(const struct index_entry *index, uint32_t guess, const unsigned char *sha1)
 {
-	const uint32_t *level1_ofs = p->index_data;
-	int hi = ntohl(level1_ofs[*sha1]);
-	int lo = ((*sha1 == 0x0) ? 0 : ntohl(level1_ofs[*sha1 - 1]));
-	const unsigned char *index = p->index_data;
+	int count = 1;
+	while (guess) {
+		int cmp;
 
-	index += 4 * 256;
+		guess--;
+		index--;
+		cmp = hashcmp(index->sha1, sha1);
+		if (!cmp)
+			return ntohl(index->nb_offset);
+		if (cmp < 0)
+			break;
+		count++;
+	}
+	return 0;
+}
 
-	do {
-		int mi = (lo + hi) / 2;
-		int cmp = hashcmp(index + 24 * mi + 4, sha1);
+static off_t pack_search_forward(const struct index_entry *index, uint32_t guess, uint32_t max, const unsigned char *sha1)
+{
+	int count = 1;
+	while (++guess < max) {
+		int cmp;
+
+		index++;
+		cmp = hashcmp(index->sha1, sha1);
 		if (!cmp)
-			return ntohl(*((uint32_t *)((char *)index + (24 * mi))));
+			return ntohl(index->nb_offset);
 		if (cmp > 0)
-			hi = mi;
-		else
-			lo = mi+1;
-	} while (lo < hi);
+			break;
+		count++;
+	}
 	return 0;
 }
 
+off_t find_pack_entry_one(const unsigned char *sha1,
+				  struct packed_git *p)
+{
+	const struct index_entry *index;
+	const uint32_t *level1_ofs = p->index_data;
+	uint32_t min, max, first;
+	uint32_t hash_val;
+	uint32_t guess;
+	int cmp;
+
+	first = *sha1;
+	min = first ? ntohl(level1_ofs[first - 1]) : 0;
+	max = ntohl(level1_ofs[first]);
+
+	hash_val = (sha1[1] << 24) |
+		   (sha1[2] << 16) |
+		   (sha1[3] << 8) |
+		   (sha1[4]);
+	guess = ((uint64_t)(max - min) * hash_val) >> 32;
+	guess += min;
+
+	index = (const struct index_entry *)((char *) p->index_data + 4 * 256);
+	index += guess;
+
+	cmp = hashcmp(index->sha1, sha1);
+	if (!cmp)
+		return ntohl(index->nb_offset);
+	if (cmp > 0)
+		return pack_search_backward(index, guess, sha1);
+	return pack_search_forward(index, guess, max, sha1);
+}
+
 static int matches_pack_name(struct packed_git *p, const char *ig)
 {
 	const char *last_c, *c;

next prev parent reply	other threads:[~2007-03-31 18:38 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2007-03-28  7:05 [RFC] Packing large repositories Dana How
2007-03-28 16:53 ` Linus Torvalds
2007-03-30  6:23   ` Shawn O. Pearce
2007-03-30 13:01     ` Nicolas Pitre
2007-03-31 11:04       ` Geert Bosch
2007-03-31 18:36         ` Linus Torvalds [this message]
2007-03-31 19:02           ` Nicolas Pitre
2007-03-31 20:54           ` Junio C Hamano
2007-03-31 21:20           ` Linus Torvalds
2007-03-31 21:56             ` Linus Torvalds
2007-04-02  6:22           ` Geert Bosch
2007-04-03  5:39             ` Shawn O. Pearce
2007-03-31 18:51         ` Nicolas Pitre
2007-04-02 21:19   ` Dana How
2007-04-02  1:39 ` Sam Vilain

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:9c26038 dfblob:fc4dbc7 )
 OR (
bs:"Re: [RFC] Packing large repositories" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Pine.LNX.4.64.0703311033290.6730@woody.linux-foundation.org \
    --to=torvalds@linux-foundation.org \
    --cc=bosch@adacore.com \
    --cc=danahow@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=nico@cam.org \
    --cc=spearce@spearce.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).