git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Karsten Blees <karsten.blees@gmail.com>
To: Jeff King <peff@peff.net>
Cc: "Thomas Rast" <tr@thomasrast.ch>,
	git@vger.kernel.org, "Vicent Martí" <vicent@github.com>
Subject: Re: [PATCH v3 10/21] pack-bitmap: add support for bitmap indexes
Date: Tue, 03 Dec 2013 15:40:41 +0100	[thread overview]
Message-ID: <529DED69.4080300@gmail.com> (raw)
In-Reply-To: <20131128103838.GB14615@sigill.intra.peff.net>

Am 28.11.2013 11:38, schrieb Jeff King:
> On Wed, Nov 27, 2013 at 10:08:56AM +0100, Karsten Blees wrote:
> 
>> Khash is OK for sha1 keys, but I don't think it should be advertised
>> as a second general purpose hash table implementation. Its far too
>> easy to shoot yourself in the foot by using 'straightforward' hash-
>> and comparison functions. Khash doesn't store the hash codes of the
>> keys, so you have to take care of that yourself or live with the
>> performance penalties (see [1]).
>>
>> [1] http://article.gmane.org/gmane.comp.version-control.git/237876
> 
> Yes. I wonder if we should improve it in that respect. I haven't looked
> carefully at the hash code you posted elsewhere, but I feel like many
> uses will want a macro implementation to let them store arbitrary types
> smaller or larger than a pointer.
> 
> -Peff
> 

IMO, trying to improve khash isn't worth the trouble.

With smaller-than-pointer types, khash _may_ actually save a few bytes compared to hash.[ch] or hashmap.[ch]. E.g. a set of 'int's would be a perfect use case for khash. However, you're using bitmaps for that, and khash's predefined macros for int sets and maps have been removed from the git version.

Using khash with pointers and larger-than-pointer types is just a waste of memory and performance, though. Hash tables are sparsely filled by design, and using large types means lots of large empty buckets. E.g. kh_resize requires almost four times the size or your data, and copies everything at least three times.

Additionally, there are the obvious problems with khash's macro design (hard to read, impossible to debug, and each named instance increases executable size by ~4k).


Below is a patch that converts pack-bitmap.c to hashmap. Its not even longer than the khash version, and the hashmap API forces you to think about no-brainer improvements such as specifying an expected size or skipping duplicates checks where they aren't needed. I could do the same for pack-bitmap-write.c if you like.

Karsten

-----------8<----------
Subject: [PATCH] pack-bitmap.c: convert to new hashmap implementation

Preallocates bitmap_index.bitmaps with the expected size to prevent
rehashing.

Combines the five arrays of ext_index ('objects', 'hashes' and khash's
'keys', 'flags' and 'vals') into a single structure for better locality.

Removes two unnecessary duplicates checks:
 - we don't expect a pack index file to contain duplicate sha1's
 - ext_index_add_object is only called after bitmap_position returned -1,
   so it is impossible that the entry is already there

Signed-off-by: Karsten Blees <blees@dcon.de>
---
 pack-bitmap.c | 166 ++++++++++++++++++++++++++++------------------------------
 1 file changed, 81 insertions(+), 85 deletions(-)

diff --git a/pack-bitmap.c b/pack-bitmap.c
index 078f7c6..115caed 100644
--- a/pack-bitmap.c
+++ b/pack-bitmap.c
@@ -15,6 +15,7 @@
  * commit.
  */
 struct stored_bitmap {
+	struct hashmap_entry ent;
 	unsigned char sha1[20];
 	struct ewah_bitmap *root;
 	struct stored_bitmap *xor;
@@ -22,6 +23,16 @@ struct stored_bitmap {
 };
 
 /*
+ * An entry in the extended index.
+ */
+struct ext_entry {
+	struct hashmap_entry ent;
+	uint32_t name_hash;
+	struct object *object;
+	unsigned int nr;
+};
+
+/*
  * The currently active bitmap index. By design, repositories only have
  * a single bitmap index available (the index for the biggest packfile in
  * the repository), since bitmap indexes need full closure.
@@ -61,7 +72,7 @@ static struct bitmap_index {
 	struct ewah_bitmap *tags;
 
 	/* Map from SHA1 -> `stored_bitmap` for all the bitmapped comits */
-	khash_sha1 *bitmaps;
+	struct hashmap bitmaps;
 
 	/* Number of bitmapped commits */
 	uint32_t entry_count;
@@ -76,12 +87,7 @@ static struct bitmap_index {
 	 * packed in `pack`, these objects are added to this "fake index" and
 	 * are assumed to appear at the end of the packfile for all operations
 	 */
-	struct eindex {
-		struct object **objects;
-		uint32_t *hashes;
-		uint32_t count, alloc;
-		khash_sha1_pos *positions;
-	} ext_index;
+	struct hashmap ext_index;
 
 	/* Bitmap result of the last performed walk */
 	struct bitmap *result;
@@ -93,6 +99,21 @@ static struct bitmap_index {
 
 } bitmap_git;
 
+static int stored_bitmap_cmp(const struct stored_bitmap *entry,
+		const struct stored_bitmap *entry_or_key,
+		const unsigned char *sha1)
+{
+	return hashcmp(entry->sha1, sha1 ? sha1 : entry_or_key->sha1);
+}
+
+static int ext_entry_cmp(const struct ext_entry *entry,
+		const struct ext_entry *entry_or_key,
+		const unsigned char *sha1)
+{
+	return hashcmp(entry->object->sha1,
+		       sha1 ? sha1 : entry_or_key->object->sha1);
+}
+
 static struct ewah_bitmap *lookup_stored_bitmap(struct stored_bitmap *st)
 {
 	struct ewah_bitmap *parent;
@@ -112,6 +133,16 @@ static struct ewah_bitmap *lookup_stored_bitmap(struct stored_bitmap *st)
 	return composed;
 }
 
+static struct ewah_bitmap *lookup_stored_bitmap_sha1(const unsigned char *sha1)
+{
+	struct stored_bitmap key, *st;
+	hashmap_entry_init(&key, __kh_oid_hash(sha1));
+	st = hashmap_get(&bitmap_git.bitmaps, &key, sha1);
+	if (st)
+		return lookup_stored_bitmap(st);
+	return NULL;
+}
+
 /*
  * Read a bitmap from the current read position on the mmaped
  * index, and increase the read position accordingly
@@ -174,26 +205,15 @@ static struct stored_bitmap *store_bitmap(struct bitmap_index *index,
 					  int flags)
 {
 	struct stored_bitmap *stored;
-	khiter_t hash_pos;
-	int ret;
 
 	stored = xmalloc(sizeof(struct stored_bitmap));
+	hashmap_entry_init(stored, __kh_oid_hash(sha1));
 	stored->root = root;
 	stored->xor = xor_with;
 	stored->flags = flags;
 	hashcpy(stored->sha1, sha1);
 
-	hash_pos = kh_put_sha1(index->bitmaps, stored->sha1, &ret);
-
-	/* a 0 return code means the insertion succeeded with no changes,
-	 * because the SHA1 already existed on the map. this is bad, there
-	 * shouldn't be duplicated commits in the index */
-	if (ret == 0) {
-		error("Duplicate entry in bitmap index: %s", sha1_to_hex(sha1));
-		return NULL;
-	}
-
-	kh_value(index->bitmaps, hash_pos) = stored;
+	hashmap_add(&index->bitmaps, stored);
 	return stored;
 }
 
@@ -291,8 +311,9 @@ static int load_pack_bitmap(void)
 {
 	assert(bitmap_git.map && !bitmap_git.loaded);
 
-	bitmap_git.bitmaps = kh_init_sha1();
-	bitmap_git.ext_index.positions = kh_init_sha1_pos();
+	hashmap_init(&bitmap_git.bitmaps, (hashmap_cmp_fn) stored_bitmap_cmp,
+			bitmap_git.entry_count);
+	hashmap_init(&bitmap_git.ext_index, (hashmap_cmp_fn) ext_entry_cmp, 0);
 	bitmap_git.reverse_index = revindex_for_pack(bitmap_git.pack);
 
 	if (!(bitmap_git.commits = read_bitmap_1(&bitmap_git)) ||
@@ -362,13 +383,12 @@ struct include_data {
 
 static inline int bitmap_position_extended(const unsigned char *sha1)
 {
-	khash_sha1_pos *positions = bitmap_git.ext_index.positions;
-	khiter_t pos = kh_get_sha1_pos(positions, sha1);
+	struct ext_entry key, *e;
+	hashmap_entry_init(&key, __kh_oid_hash(sha1));
+	e = hashmap_get(&bitmap_git.bitmaps, &key, sha1);
 
-	if (pos < kh_end(positions)) {
-		int bitmap_pos = kh_value(positions, pos);
-		return bitmap_pos + bitmap_git.pack->num_objects;
-	}
+	if (e)
+		return e->nr + bitmap_git.pack->num_objects;
 
 	return -1;
 }
@@ -390,32 +410,13 @@ static int bitmap_position(const unsigned char *sha1)
 
 static int ext_index_add_object(struct object *object, const char *name)
 {
-	struct eindex *eindex = &bitmap_git.ext_index;
-
-	khiter_t hash_pos;
-	int hash_ret;
-	int bitmap_pos;
-
-	hash_pos = kh_put_sha1_pos(eindex->positions, object->sha1, &hash_ret);
-	if (hash_ret > 0) {
-		if (eindex->count >= eindex->alloc) {
-			eindex->alloc = (eindex->alloc + 16) * 3 / 2;
-			eindex->objects = xrealloc(eindex->objects,
-				eindex->alloc * sizeof(struct object *));
-			eindex->hashes = xrealloc(eindex->hashes,
-				eindex->alloc * sizeof(uint32_t));
-		}
-
-		bitmap_pos = eindex->count;
-		eindex->objects[eindex->count] = object;
-		eindex->hashes[eindex->count] = pack_name_hash(name);
-		kh_value(eindex->positions, hash_pos) = bitmap_pos;
-		eindex->count++;
-	} else {
-		bitmap_pos = kh_value(eindex->positions, hash_pos);
-	}
-
-	return bitmap_pos + bitmap_git.pack->num_objects;
+	struct ext_entry *e = xmalloc(sizeof(struct ext_entry));
+	hashmap_entry_init(e, __kh_oid_hash(object->sha1));
+	e->object = object;
+	e->name_hash = pack_name_hash(name);
+	e->nr = bitmap_git.ext_index.size;
+	hashmap_add(&bitmap_git.ext_index, e);
+	return e->nr + bitmap_git.pack->num_objects;
 }
 
 static void show_object(struct object *object, const struct name_path *path,
@@ -443,7 +444,7 @@ static int add_to_include_set(struct include_data *data,
 			      const unsigned char *sha1,
 			      int bitmap_pos)
 {
-	khiter_t hash_pos;
+	struct ewah_bitmap *bm;
 
 	if (data->seen && bitmap_get(data->seen, bitmap_pos))
 		return 0;
@@ -451,10 +452,9 @@ static int add_to_include_set(struct include_data *data,
 	if (bitmap_get(data->base, bitmap_pos))
 		return 0;
 
-	hash_pos = kh_get_sha1(bitmap_git.bitmaps, sha1);
-	if (hash_pos < kh_end(bitmap_git.bitmaps)) {
-		struct stored_bitmap *st = kh_value(bitmap_git.bitmaps, hash_pos);
-		bitmap_or_ewah(data->base, lookup_stored_bitmap(st));
+	bm = lookup_stored_bitmap_sha1(sha1);
+	if (bm) {
+		bitmap_or_ewah(data->base, bm);
 		return 0;
 	}
 
@@ -507,12 +507,9 @@ static struct bitmap *find_objects(struct rev_info *revs,
 		roots = roots->next;
 
 		if (object->type == OBJ_COMMIT) {
-			khiter_t pos = kh_get_sha1(bitmap_git.bitmaps, object->sha1);
-
-			if (pos < kh_end(bitmap_git.bitmaps)) {
-				struct stored_bitmap *st = kh_value(bitmap_git.bitmaps, pos);
-				struct ewah_bitmap *or_with = lookup_stored_bitmap(st);
-
+			struct ewah_bitmap *or_with =
+				lookup_stored_bitmap_sha1(object->sha1);
+			if (or_with) {
 				if (base == NULL)
 					base = ewah_to_bitmap(or_with);
 				else
@@ -584,17 +581,15 @@ static struct bitmap *find_objects(struct rev_info *revs,
 static void show_extended_objects(struct bitmap *objects,
 				  show_reachable_fn show_reach)
 {
-	struct eindex *eindex = &bitmap_git.ext_index;
-	uint32_t i;
-
-	for (i = 0; i < eindex->count; ++i) {
-		struct object *obj;
+	struct ext_entry *e;
+	struct hashmap_iter iter;
 
-		if (!bitmap_get(objects, bitmap_git.pack->num_objects + i))
+	for (e = hashmap_iter_first(&bitmap_git.ext_index, &iter); e;
+	     e = hashmap_iter_next(&iter)) {
+		if (!bitmap_get(objects, bitmap_git.pack->num_objects + e->nr))
 			continue;
 
-		obj = eindex->objects[i];
-		show_reach(obj->sha1, obj->type, 0, eindex->hashes[i], NULL, 0);
+		show_reach(e->object->sha1, e->object->type, 0, e->name_hash, NULL, 0);
 	}
 }
 
@@ -831,7 +826,8 @@ void traverse_bitmap_commit_list(show_reachable_fn show_reachable)
 static uint32_t count_object_type(struct bitmap *objects,
 				  enum object_type type)
 {
-	struct eindex *eindex = &bitmap_git.ext_index;
+	struct ext_entry *e;
+	struct hashmap_iter iter;
 
 	uint32_t i = 0, count = 0;
 	struct ewah_iterator it;
@@ -863,9 +859,10 @@ static uint32_t count_object_type(struct bitmap *objects,
 		count += ewah_bit_popcount64(word);
 	}
 
-	for (i = 0; i < eindex->count; ++i) {
-		if (eindex->objects[i]->type == type &&
-			bitmap_get(objects, bitmap_git.pack->num_objects + i))
+	for (e = hashmap_iter_first(&bitmap_git.ext_index, &iter); e;
+	     e = hashmap_iter_next(&iter)) {
+		if (e->object->type == type &&
+			bitmap_get(objects, bitmap_git.pack->num_objects + e->nr))
 			count++;
 	}
 
@@ -928,7 +925,7 @@ void test_bitmap_walk(struct rev_info *revs)
 {
 	struct object *root;
 	struct bitmap *result = NULL;
-	khiter_t pos;
+	struct ewah_bitmap *bm;
 	size_t result_popcnt;
 	struct bitmap_test_data tdata;
 
@@ -942,12 +939,9 @@ void test_bitmap_walk(struct rev_info *revs)
 		bitmap_git.version, bitmap_git.entry_count);
 
 	root = revs->pending.objects[0].item;
-	pos = kh_get_sha1(bitmap_git.bitmaps, root->sha1);
-
-	if (pos < kh_end(bitmap_git.bitmaps)) {
-		struct stored_bitmap *st = kh_value(bitmap_git.bitmaps, pos);
-		struct ewah_bitmap *bm = lookup_stored_bitmap(st);
 
+	bm = lookup_stored_bitmap_sha1(root->sha1);
+	if (bm) {
 		fprintf(stderr, "Found bitmap for %s. %d bits / %08x checksum\n",
 			sha1_to_hex(root->sha1), (int)bm->bit_size, ewah_checksum(bm));
 
@@ -1020,6 +1014,7 @@ int rebuild_existing_bitmaps(struct packing_data *mapping,
 	struct bitmap *rebuild;
 	struct stored_bitmap *stored;
 	struct progress *progress = NULL;
+	struct hashmap_iter iter;
 
 	khiter_t hash_pos;
 	int hash_ret;
@@ -1049,7 +1044,8 @@ int rebuild_existing_bitmaps(struct packing_data *mapping,
 	if (show_progress)
 		progress = start_progress("Reusing bitmaps", 0);
 
-	kh_foreach_value(bitmap_git.bitmaps, stored, {
+	for (stored = hashmap_iter_first(&bitmap_git.bitmaps, &iter); stored;
+	     stored = hashmap_iter_next(&iter)) {
 		if (stored->flags & BITMAP_FLAG_REUSE) {
 			if (!rebuild_bitmap(reposition,
 					    lookup_stored_bitmap(stored),
@@ -1063,7 +1059,7 @@ int rebuild_existing_bitmaps(struct packing_data *mapping,
 			bitmap_reset(rebuild);
 			display_progress(progress, ++i);
 		}
-	});
+	}
 
 	stop_progress(&progress);
 
-- 

  reply	other threads:[~2013-12-03 14:40 UTC|newest]

Thread overview: 55+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-11-14 12:41 [PATCH v3 0/21] pack bitmaps Jeff King
2013-11-14 12:42 ` [PATCH v3 01/21] sha1write: make buffer const-correct Jeff King
2013-11-14 12:42 ` [PATCH v3 02/21] revindex: Export new APIs Jeff King
2013-11-14 12:42 ` [PATCH v3 03/21] pack-objects: Refactor the packing list Jeff King
2013-11-14 12:43 ` [PATCH v3 04/21] pack-objects: factor out name_hash Jeff King
2013-11-14 12:43 ` [PATCH v3 05/21] revision: allow setting custom limiter function Jeff King
2013-11-14 12:43 ` [PATCH v3 06/21] sha1_file: export `git_open_noatime` Jeff King
2013-11-14 12:43 ` [PATCH v3 07/21] compat: add endianness helpers Jeff King
2013-11-14 12:43 ` [PATCH v3 08/21] ewah: compressed bitmap implementation Jeff King
2013-11-14 12:44 ` [PATCH v3 09/21] documentation: add documentation for the bitmap format Jeff King
2013-11-14 12:44 ` [PATCH v3 10/21] pack-bitmap: add support for bitmap indexes Jeff King
2013-11-24 21:36   ` Thomas Rast
2013-11-25 15:04     ` [PATCH] Document khash Thomas Rast
2013-11-28 10:35       ` Jeff King
2013-11-27  9:08     ` [PATCH v3 10/21] pack-bitmap: add support for bitmap indexes Karsten Blees
2013-11-28 10:38       ` Jeff King
2013-12-03 14:40         ` Karsten Blees [this message]
2013-12-03 18:21           ` Jeff King
2013-12-07 20:52             ` Karsten Blees
2013-11-29 21:21   ` Thomas Rast
2013-12-02 16:12     ` Jeff King
2013-12-02 20:36       ` Junio C Hamano
2013-12-02 20:47         ` Jeff King
2013-12-02 21:43           ` Junio C Hamano
2013-11-14 12:45 ` [PATCH v3 11/21] pack-objects: use bitmaps when packing objects Jeff King
2013-12-07 15:47   ` Thomas Rast
2013-12-21 13:15     ` Jeff King
2013-11-14 12:45 ` [PATCH v3 12/21] rev-list: add bitmap mode to speed up object lists Jeff King
2013-12-07 16:05   ` Thomas Rast
2013-11-14 12:45 ` [PATCH v3 13/21] pack-objects: implement bitmap writing Jeff King
2013-12-07 16:32   ` Thomas Rast
2013-12-21 13:17     ` Jeff King
2013-11-14 12:45 ` [PATCH v3 14/21] repack: stop using magic number for ARRAY_SIZE(exts) Jeff King
2013-12-07 16:34   ` Thomas Rast
2013-11-14 12:46 ` [PATCH v3 15/21] repack: turn exts array into array-of-struct Jeff King
2013-12-07 16:34   ` Thomas Rast
2013-11-14 12:46 ` [PATCH v3 16/21] repack: handle optional files created by pack-objects Jeff King
2013-12-07 16:35   ` Thomas Rast
2013-11-14 12:46 ` [PATCH v3 17/21] repack: consider bitmaps when performing repacks Jeff King
2013-12-07 16:37   ` Thomas Rast
2013-11-14 12:46 ` [PATCH v3 18/21] count-objects: recognize .bitmap in garbage-checking Jeff King
2013-12-07 16:38   ` Thomas Rast
2013-11-14 12:46 ` [PATCH v3 19/21] t: add basic bitmap functionality tests Jeff King
2013-12-07 16:43   ` Thomas Rast
2013-12-21 13:22     ` Jeff King
2013-11-14 12:48 ` [PATCH v3 20/21] t/perf: add tests for pack bitmaps Jeff King
2013-12-07 16:51   ` Thomas Rast
2013-12-21 13:40     ` Jeff King
2013-11-14 12:48 ` [PATCH v3 21/21] pack-bitmap: implement optional name_hash cache Jeff King
2013-12-07 16:59   ` Thomas Rast
2013-11-14 19:19 ` [PATCH v3 0/21] pack bitmaps Ramsay Jones
2013-11-14 21:33   ` Jeff King
2013-11-14 23:09     ` Ramsay Jones
2013-11-18 21:16       ` Ramsay Jones
2013-11-16 10:28     ` Thomas Rast

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=529DED69.4080300@gmail.com \
    --to=karsten.blees@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=peff@peff.net \
    --cc=tr@thomasrast.ch \
    --cc=vicent@github.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).