Journal of Failed Git Experiments, Volume 1

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Journal of Failed Git Experiments, Volume 1
@ 2016-09-14 23:55 Jeff King
  2016-09-14 23:56 ` [PATCH 1/2] obj_hash: convert to a critbit tree Jeff King
                   ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: Jeff King @ 2016-09-14 23:55 UTC (permalink / raw)
  To: git

I try a lot of different experiments with git performance, some of them
more hare-brained than others. The ones that succeed end up as real
patches. But I hate for the ones that fail to die a quiet death. Then
nobody learns what _doesn't_ work, and nobody has the opportunity to
point out the spot where I made a stupid mistake that invalidates the
whole result.

Here are two performance experiments that did _not_ turn out. They
are in patch form, because I want to document exactly what I did
(especially because it helps in the "stupid mistake" case, or if
somebody wants to try building on my results).

So without further ado:

  [1/2]: obj_hash: convert to a critbit tree
  [2/2]: use zstd zlib wrapper

-Peff

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH 1/2] obj_hash: convert to a critbit tree
  2016-09-14 23:55 Journal of Failed Git Experiments, Volume 1 Jeff King
@ 2016-09-14 23:56 ` Jeff King
  2016-09-15  0:52   ` Stefan Beller
  2016-09-14 23:58 ` [PATCH 2/2] use zstd zlib wrapper Jeff King
  2016-09-25 14:17 ` Journal of Failed Git Experiments, Volume 1 Johannes Schindelin
  2 siblings, 1 reply; 8+ messages in thread
From: Jeff King @ 2016-09-14 23:56 UTC (permalink / raw)
  To: git

For operations that traverse the whole reachability graph,
like "rev-list --objects", the obj_hash in object.c shows up
as a hotspot. We basically have to do "nr_commits *
size_of_tree" hash lookups, because each tree we look at, we
need to say "have we seen this sha1 yet?" (it's a little
more complicated because we avoid digging into sub-trees we
know we've already seen, but it's a reasonable
approximation).  So graph traversal operations like that are
very sensitive to improvements in lookup time.

For keys with length "m", a hash table is generally O(m) to
compute the hash (and verify that you've found the correct
key), and O(1) in finding the correct slot. The latter is
subject to collisions and probing strategy, but we keep the
load factor on the obj_hash table below 50%, which is quite
low. And we have a good hash (we reuse the sha1s themselves,
which are effectively random). So we'd expect relatively few
collisions, and past experiments to tweak the probing
strategy haven't yielded any benefit (we use linear probing;
I tried quadratic probing at one point, and Junio tried
cuckoo hashing).

Another data structure with similar properties is sometimes
known as a "critbit tree" (see http://cr.yp.to/critbit.html
for discussion and history). The basic idea is a binary trie
where each node is either an internal node, representing a
single bit of a stored key, or a leaf node, representing one
or more "remainder" bits. So if you were storing two bit
sequences 1011 and "1100", you'd have three nodes (besides
the root):

        (root)
        /    \
       0      1
      /        \
    NULL    (internal)
             /    \
            0      1
           /        \
        "11"       "00"

So finding "1011" involves traversing the trie: down the "1"
side, then the "0" side, and then check that the rest
matches "11".

Looking up a key is O(m), and there's no issue with
collisions or probing. It can use less memory than a hash
table, because there's no load factor to care about.

That's the good news. The bad news is that big-O analysis
is not the whole story. You'll notice that we have to do a
lot of conditional pointer-jumping to walk the trie. Whereas
a hash table can jump right to a proposed key and do a
memcmp() to see if we got it.

So I wasn't overly optimistic that this would be any faster.
And indeed it's not. It's about three times slower (about
4.5s versus 1.5s running "rev-list --objects --all" on
git.git).

The reason is, I think, a combination of:

  0. We care much more about performance for this hash than
     memory efficiency. So we keep the load factor
     quite low, and thus reduce the number of collisions.

  1. For many hash tables, computing the hash is expensive.
     Not so here, because we are storing sha1s. So it is
     literally just casting the first 4 bytes of the sha1 to
     an int; we don't even look at the other bytes until the
     collision check (and because of point 0, we don't
     generally have to do very many collision checks during
     our probe).

  2. The collision check _does_ have to look at all 20 bytes
     of the sha1. And we may have to do it multiple times as
     we linearly probe the collisions. _But_ it can be done
     with memcmp(), which is optimized to compare 32 or 64
     bits at a time. So we our O(m) has a very nice constant
     factor.

     Whereas in the critbit tree, we pay an instruction for
     _each_ bit we look at.

It's possible that (2) would be better if instead of a
critbit tree, we used a "critbyte" tree. That would involve
fewer node traversals, at the cost of making each node
larger (right now the internal nodes store 2 pointer slots;
they'd have to store 256 to handle a full byte). I didn't
try it, but I suspect it would still be slower for the same
reasons.

The code is below for reference. I pulled the critbit
implementation from:

  https://github.com/ennorehling/critbit

but made a few tweaks:

  - the critbit does not store the keys themselves, as we
    already have them in the "struct object", and do not
    want to waste space. As a result, I baked in some
    assumptions about the 20-byte key-length (which is fine
    for our purposes here).

  - rather than malloc() each node, I used a pool allocator
    (to try to keep the nodes closer together in cache)

Signed-off-by: Jeff King <peff@peff.net>
---
 Makefile  |   1 +
 critbit.c | 350 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 critbit.h |  60 +++++++++++
 object.c  |  85 ++-------------
 4 files changed, 417 insertions(+), 79 deletions(-)
 create mode 100644 critbit.c
 create mode 100644 critbit.h

diff --git a/Makefile b/Makefile
index 7f18492..99e0eb2 100644
--- a/Makefile
+++ b/Makefile
@@ -720,6 +720,7 @@ LIB_OBJS += connected.o
 LIB_OBJS += convert.o
 LIB_OBJS += copy.o
 LIB_OBJS += credential.o
+LIB_OBJS += critbit.o
 LIB_OBJS += csum-file.o
 LIB_OBJS += ctype.o
 LIB_OBJS += date.o
diff --git a/critbit.c b/critbit.c
new file mode 100644
index 0000000..7354375
--- /dev/null
+++ b/critbit.c
@@ -0,0 +1,350 @@
+/*
+Copyright (c) 2012, Enno Rehling <enno@eressea.de>
+
+Permission to use, copy, modify, and/or distribute this software for any
+purpose with or without fee is hereby granted, provided that the above
+copyright notice and this permission notice appear in all copies.
+
+THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES
+WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF
+MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR
+ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES
+WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN
+ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF
+OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
+**/
+
+#include "cache.h"
+#include "critbit.h"
+
+/* see http://cr.yp.to/critbit.html */
+struct critbit_node {
+    void * child[2];
+    size_t byte;
+    unsigned int mask;
+};
+
+#define EXTERNAL_NODE 0
+#define INTERNAL_NODE 1
+
+static int decode_pointer(void ** ptr)
+{
+    ptrdiff_t numvalue = (char*)*ptr - (char*)0;
+    if (numvalue & 1) {
+        *ptr = (void*)(numvalue - 1);
+        return EXTERNAL_NODE;
+    }
+    return INTERNAL_NODE;
+}
+
+static void * make_external_node(const void * key, size_t keylen)
+{
+    return (unsigned char *)key + 1;
+}
+
+static void from_external_node(void * ptr, void **key, size_t *keylen)
+{
+    /* eek */
+    *keylen = 20;
+    *key = ptr;
+}
+
+static struct critbit_node *node_pool;
+static int node_pool_used;
+static int node_pool_alloc;
+
+static struct critbit_node * make_internal_node(void)
+{
+    if (node_pool_used == node_pool_alloc) {
+        node_pool_alloc = 1024;
+        node_pool_used = 0;
+        ALLOC_ARRAY(node_pool, node_pool_alloc);
+    }
+    return &node_pool[node_pool_used++];
+}
+
+static int cb_less(const struct critbit_node * a, const struct critbit_node * b)
+{
+    return a->byte < b->byte || (a->byte == b->byte && a->mask < b->mask);
+}
+
+int cb_insert(critbit_tree * cb, const void * key, size_t keylen)
+{
+    assert(cb);
+    assert(key);
+    if (!cb->root) {
+        cb->root = make_external_node(key, keylen);
+        return CB_SUCCESS;
+    }
+    else {
+        void ** iter = &cb->root;
+        struct critbit_node * prev = 0;
+        for (;;) {
+            void * ptr = *iter;
+            if (decode_pointer(&ptr) == INTERNAL_NODE) {
+                struct critbit_node * node = (struct critbit_node *)ptr;
+                unsigned char * bytes = (unsigned char *)key;
+                int branch = (keylen <= node->byte) ? 0 : ((1 + ((bytes[node->byte] | node->mask) & 0xFF)) >> 8);
+                iter = &node->child[branch];
+                prev = node;
+            }
+            else {
+                unsigned char *iptr, *bytes = (unsigned char *)key, *ikey = bytes;
+                void * vptr;
+                unsigned int mask, byte = 0;
+                int branch;
+                size_t len;
+                struct critbit_node * node;
+
+                from_external_node(ptr, &vptr, &len);
+
+                for (iptr = vptr; byte < keylen && byte < len && *ikey == *iptr;) {
+                    ++ikey;
+                    ++iptr;
+                    ++byte;
+                }
+
+                if (byte == keylen && byte == len) {
+                    return CB_EXISTS; /* duplicate entry */
+                }
+                node = make_internal_node();
+                node->byte = byte;
+                mask = *ikey ^ *iptr; /* these are all the bits that differ */
+                mask |= mask >> 1;
+                mask |= mask >> 2;
+                mask |= mask >> 4; /* now, every bit up to the MSB is set to 1 */
+                mask = (mask&~(mask >> 1)) ^ 0xFF;
+                node->mask = (unsigned char)mask;
+
+                /* find the right place to insert, iff prev's crit-bit is later in the string than new crit-bit */
+                if (prev && cb_less(node, prev)) {
+                    for (iter = &cb->root;;) {
+                        ptr = *iter;
+                        if (decode_pointer(&ptr) == INTERNAL_NODE) {
+                            struct critbit_node * next = (struct critbit_node *)ptr;
+                            if (cb_less(next, node)) {
+                                branch = ((1 + ((bytes[next->byte] | next->mask) & 0xFF)) >> 8);
+                                iter = &next->child[branch];
+                            }
+                            else {
+                                break;
+                            }
+                        }
+                        else {
+                            assert(!"should never get here");
+                        }
+                    }
+                }
+
+                branch = ((1 + ((*ikey | node->mask) & 0xFF)) >> 8);
+                node->child[branch] = make_external_node(key, keylen);
+                node->child[1 - branch] = *iter;
+                *iter = (void *)node;
+
+                return CB_SUCCESS;
+            }
+        }
+    }
+}
+
+static void * cb_find_top_i(const critbit_tree * cb, const void * key, size_t keylen)
+{
+    void *ptr, *top = 0;
+    assert(key);
+
+    if (!cb->root) {
+        return 0;
+    }
+    for (ptr = cb->root, top = cb->root;;) {
+        void * last = ptr;
+        if (decode_pointer(&ptr) == INTERNAL_NODE) {
+            struct critbit_node * node = (struct critbit_node *)ptr;
+            int branch;
+            if (keylen <= node->byte) {
+                break;
+            }
+            else {
+                unsigned char * bytes = (unsigned char *)key;
+                top = last;
+                branch = (1 + ((bytes[node->byte] | node->mask) & 0xFF)) >> 8;
+                ptr = node->child[branch];
+            }
+        }
+        else {
+            /* we reached an external node before exhausting the key length */
+            top = last;
+            break;
+        }
+    }
+    return top;
+}
+
+static int cb_find_prefix_i(void * ptr, const void * key, size_t keylen, void ** results, int numresults, int * offset, int next)
+{
+    assert(next <= numresults);
+    if (next == numresults) {
+        return next;
+    }
+    else if (decode_pointer(&ptr) == INTERNAL_NODE) {
+        struct critbit_node * node = (struct critbit_node *)ptr;
+        next = cb_find_prefix_i(node->child[0], key, keylen, results, numresults, offset, next);
+        if (next < numresults) {
+            next = cb_find_prefix_i(node->child[1], key, keylen, results, numresults, offset, next);
+        }
+    }
+    else {
+        /* reached an external node */
+        char * str;
+        void * vptr;
+        size_t len;
+
+        from_external_node(ptr, &vptr, &len);
+        str = vptr;
+        if (len >= keylen && memcmp(key, str, keylen) == 0) {
+            if (*offset>0) {
+                --*offset;
+            }
+            else {
+                results[next++] = str;
+            }
+        }
+    }
+    return next;
+}
+
+int cb_find_prefix(const critbit_tree * cb, const void * key, size_t keylen, void ** results, int numresults, int offset)
+{
+    if (numresults > 0) {
+        void *top = cb_find_top_i(cb, key, keylen);
+        if (top) {
+            /* recursively add all children except the ones from [0-offset) of top to the results */
+            return cb_find_prefix_i(top, key, keylen, results, numresults, &offset, 0);
+        }
+    }
+    return 0;
+}
+
+static int cb_foreach_i(void * ptr, const void * key, size_t keylen, int(*match_cb)(const void * match, const void * key, size_t keylen, void *), void *data)
+{
+    int result = 0;
+
+    if (decode_pointer(&ptr) == INTERNAL_NODE) {
+        struct critbit_node * node = (struct critbit_node *)ptr;
+        result = cb_foreach_i(node->child[0], key, keylen, match_cb, data);
+        if (!result) {
+            result = cb_foreach_i(node->child[1], key, keylen, match_cb, data);
+        }
+    }
+    else {
+        /* reached an external node */
+        void * match;
+        size_t len;
+
+        from_external_node(ptr, &match, &len);
+        if (len >= keylen && memcmp(key, match, keylen) == 0) {
+            return match_cb(match, key, keylen, data);
+        }
+    }
+    return result;
+}
+
+int cb_foreach(critbit_tree * cb, const void * key, size_t keylen, int(*match_cb)(const void * match, const void * key, size_t keylen, void *), void *data)
+{
+    void *top = cb_find_top_i(cb, key, keylen);
+    if (top) {
+        /* recursively add all children except the ones from [0-offset) of top to the results */
+        return cb_foreach_i(top, key, keylen, match_cb, data);
+    }
+    return 0;
+}
+
+const void * cb_find(critbit_tree * cb, const void * key, size_t keylen)
+{
+    void * str;
+    size_t len;
+    unsigned char * bytes = (unsigned char *)key;
+    void * ptr;
+
+    assert(cb);
+    assert(key);
+    if (!cb->root) return 0;
+    for (ptr = cb->root; decode_pointer(&ptr) == INTERNAL_NODE; ) {
+        struct critbit_node *node = (struct critbit_node *)ptr;
+        uint8_t c = 0;
+        int direction;
+        if (node->byte < keylen)
+            c = bytes[node->byte];
+        direction = (1+(c|node->mask))>>8;
+        ptr = node->child[direction];
+    }
+    from_external_node(ptr, &str, &len);
+    if (len >= keylen && memcmp(key, str, keylen) == 0) {
+        return str;
+    }
+    return 0;
+}
+
+int cb_erase(critbit_tree * cb, const void * key, size_t keylen)
+{
+    void **iter = NULL;
+    void *ptr = cb->root;
+    struct critbit_node *parent = 0;
+    unsigned char * bytes = (unsigned char *)key;
+    int branch;
+
+    if (!cb->root) return CB_NOTFOUND;
+
+    for (;;) {
+        int type;
+
+        type = decode_pointer(&ptr);
+        if (type == INTERNAL_NODE) {
+            iter = parent ? &parent->child[branch] : &cb->root;
+            parent = (struct critbit_node*)ptr;
+            branch = (keylen <= parent->byte) ? 0 : ((1 + ((bytes[parent->byte] | parent->mask) & 0xFF)) >> 8);
+            ptr = parent->child[branch];
+        }
+        else {
+            void * str;
+            size_t len;
+            from_external_node(ptr, &str, &len);
+            if (len == keylen && memcmp(key, str, len) == 0) {
+                free(ptr);
+                if (iter) {
+                    *iter = parent->child[1 - branch];
+                    free(parent);
+                }
+                else {
+                    cb->root = 0;
+                }
+                return CB_SUCCESS;
+            }
+            return CB_NOTFOUND;
+        }
+    }
+}
+
+size_t cb_new_kv(const char *key, size_t keylen, const void * value, size_t len, void * out)
+{
+    char * dst = (char*)out;
+    if (dst != key) {
+        memmove(dst, key, keylen);
+    }
+    dst[keylen] = 0;
+    memmove(dst + keylen + 1, value, len);
+    return len + keylen + 1;
+}
+
+void cb_get_kv(const void *kv, void * value, size_t len)
+{
+    const char * key = (const char *)kv;
+    size_t keylen = strlen(key) + 1;
+    memmove(value, key + keylen, len);
+}
+
+void cb_get_kv_ex(void *kv, void ** value)
+{
+    char * key = (char *)kv;
+    size_t keylen = strlen(key) + 1;
+    *value = key + keylen;
+}
diff --git a/critbit.h b/critbit.h
new file mode 100644
index 0000000..ee774e8
--- /dev/null
+++ b/critbit.h
@@ -0,0 +1,60 @@
+/*
+Copyright (c) 2012, Enno Rehling <enno@eressea.de>
+
+Permission to use, copy, modify, and/or distribute this software for any
+purpose with or without fee is hereby granted, provided that the above
+copyright notice and this permission notice appear in all copies.
+
+THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES
+WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF
+MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR
+ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES
+WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN
+ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF
+OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
+**/
+
+#ifndef _CRITBITT_H
+#define _CRITBITT_H
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+#include <stddef.h>
+
+typedef struct critbit_tree {
+  void * root;
+} critbit_tree;
+
+#define CB_SUCCESS 0
+#define CB_EXISTS 1
+#define CB_NOTFOUND 2
+
+#define CRITBIT_TREE() { 0 }
+
+int cb_insert(critbit_tree * cb, const void * key, size_t keylen);
+const void * cb_find(critbit_tree * cb, const void * key, size_t keylen);
+int cb_erase(critbit_tree * cb, const void * key, size_t keylen);
+int cb_find_prefix(const critbit_tree * cb, const void * key, size_t keylen, void ** results, int numresults, int offset);
+int cb_foreach(critbit_tree * cb, const void * key, size_t keylen, int (*match_cb)(const void * match, const void * key, size_t keylen, void *), void *data);
+void cb_clear(critbit_tree * cb);
+
+#define cb_insert_str(cb, key) \
+  cb_insert(cb, (void *)key, strlen(key)+1)
+#define cb_find_str(cb, key) \
+  (const char *)cb_find(cb, (void *)key, strlen(key)+1)
+#define cb_erase_str(cb, key) \
+  cb_erase(cb, (void *)key, strlen(key)+1)
+#define cb_find_prefix_str(cb, key, results, numresults, offset) \
+  cb_find_prefix(cb, (const void *)key, strlen(key), results, numresults, offset)
+
+#define CB_KV_SIZE(keylen, datalen) (keylen+datalen+1)
+size_t cb_new_kv(const char *key, size_t keylen, const void * value, size_t len, void * out);
+void cb_get_kv(const void *kv, void * value, size_t len);
+void cb_get_kv_ex(void *kv, void ** value);
+
+#ifdef __cplusplus
+}
+#endif
+#endif
diff --git a/object.c b/object.c
index 67d9a9e..24d26a8 100644
--- a/object.c
+++ b/object.c
@@ -4,6 +4,7 @@
 #include "tree.h"
 #include "commit.h"
 #include "tag.h"
+#include "critbit.h"
 
 static struct object **obj_hash;
 static int nr_objs, obj_hash_size;
@@ -51,32 +52,8 @@ int type_from_string_gently(const char *str, ssize_t len, int gentle)
 	die("invalid object type \"%s\"", str);
 }
 
-/*
- * Return a numerical hash value between 0 and n-1 for the object with
- * the specified sha1.  n must be a power of 2.  Please note that the
- * return value is *not* consistent across computer architectures.
- */
-static unsigned int hash_obj(const unsigned char *sha1, unsigned int n)
-{
-	return sha1hash(sha1) & (n - 1);
-}
-
-/*
- * Insert obj into the hash table hash, which has length size (which
- * must be a power of 2).  On collisions, simply overflow to the next
- * empty bucket.
- */
-static void insert_obj_hash(struct object *obj, struct object **hash, unsigned int size)
-{
-	unsigned int j = hash_obj(obj->oid.hash, size);
 
-	while (hash[j]) {
-		j++;
-		if (j >= size)
-			j = 0;
-	}
-	hash[j] = obj;
-}
+struct critbit_tree objects;
 
 /*
  * Look up the record for the given sha1 in the hash map stored in
@@ -84,58 +61,10 @@ static void insert_obj_hash(struct object *obj, struct object **hash, unsigned i
  */
 struct object *lookup_object(const unsigned char *sha1)
 {
-	unsigned int i, first;
-	struct object *obj;
-
-	if (!obj_hash)
+	const unsigned char *ret = cb_find(&objects, sha1, 20);
+	if (!ret)
 		return NULL;
-
-	first = i = hash_obj(sha1, obj_hash_size);
-	while ((obj = obj_hash[i]) != NULL) {
-		if (!hashcmp(sha1, obj->oid.hash))
-			break;
-		i++;
-		if (i == obj_hash_size)
-			i = 0;
-	}
-	if (obj && i != first) {
-		/*
-		 * Move object to where we started to look for it so
-		 * that we do not need to walk the hash table the next
-		 * time we look for it.
-		 */
-		struct object *tmp = obj_hash[i];
-		obj_hash[i] = obj_hash[first];
-		obj_hash[first] = tmp;
-	}
-	return obj;
-}
-
-/*
- * Increase the size of the hash map stored in obj_hash to the next
- * power of 2 (but at least 32).  Copy the existing values to the new
- * hash map.
- */
-static void grow_object_hash(void)
-{
-	int i;
-	/*
-	 * Note that this size must always be power-of-2 to match hash_obj
-	 * above.
-	 */
-	int new_hash_size = obj_hash_size < 32 ? 32 : 2 * obj_hash_size;
-	struct object **new_hash;
-
-	new_hash = xcalloc(new_hash_size, sizeof(struct object *));
-	for (i = 0; i < obj_hash_size; i++) {
-		struct object *obj = obj_hash[i];
-		if (!obj)
-			continue;
-		insert_obj_hash(obj, new_hash, new_hash_size);
-	}
-	free(obj_hash);
-	obj_hash = new_hash;
-	obj_hash_size = new_hash_size;
+	return (struct object *)(ret - offsetof(struct object, oid));
 }
 
 void *create_object(const unsigned char *sha1, void *o)
@@ -147,10 +76,8 @@ void *create_object(const unsigned char *sha1, void *o)
 	obj->flags = 0;
 	hashcpy(obj->oid.hash, sha1);
 
-	if (obj_hash_size - 1 <= nr_objs * 2)
-		grow_object_hash();
+	cb_insert(&objects, &obj->oid, 20);
 
-	insert_obj_hash(obj, obj_hash, obj_hash_size);
 	nr_objs++;
 	return obj;
 }
-- 
2.10.0.271.ge3ee153


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH 1/2] obj_hash: convert to a critbit tree
  2016-09-14 23:56 ` [PATCH 1/2] obj_hash: convert to a critbit tree Jeff King
@ 2016-09-15  0:52   ` Stefan Beller
  2016-09-15  1:13     ` Jeff King
  0 siblings, 1 reply; 8+ messages in thread
From: Stefan Beller @ 2016-09-15  0:52 UTC (permalink / raw)
  To: Jeff King; +Cc: git@vger.kernel.org

On Wed, Sep 14, 2016 at 4:56 PM, Jeff King <peff@peff.net> wrote:
> For operations that traverse the whole reachability graph,
> like "rev-list --objects", the obj_hash in object.c shows up
> as a hotspot. We basically have to do "nr_commits *
> size_of_tree" hash lookups, because each tree we look at, we
> need to say "have we seen this sha1 yet?" (it's a little
> more complicated because we avoid digging into sub-trees we
> know we've already seen, but it's a reasonable
> approximation).  So graph traversal operations like that are
> very sensitive to improvements in lookup time.
>
> For keys with length "m", a hash table is generally O(m) to
> compute the hash (and verify that you've found the correct
> key), and O(1) in finding the correct slot. The latter is
> subject to collisions and probing strategy, but we keep the
> load factor on the obj_hash table below 50%, which is quite
> low. And we have a good hash (we reuse the sha1s themselves,
> which are effectively random). So we'd expect relatively few
> collisions, and past experiments to tweak the probing
> strategy haven't yielded any benefit (we use linear probing;
> I tried quadratic probing at one point, and Junio tried
> cuckoo hashing).
>
> Another data structure with similar properties is sometimes
> known as a "critbit tree" (see http://cr.yp.to/critbit.html
> for discussion and history). The basic idea is a binary trie
> where each node is either an internal node, representing a
> single bit of a stored key, or a leaf node, representing one
> or more "remainder" bits. So if you were storing two bit
> sequences 1011 and "1100", you'd have three nodes (besides
> the root):
>
>         (root)
>         /    \
>        0      1
>       /        \
>     NULL    (internal)
>              /    \
>             0      1
>            /        \
>         "11"       "00"
>
> So finding "1011" involves traversing the trie: down the "1"
> side, then the "0" side, and then check that the rest
> matches "11".

So we stop building a tree as soon as we hit a unique data
element (i.e. if we stick to the idea of encoding the hash along the way,
we would have needed another node each for "11" as well as "00"
that points to NULL and the ending data respectively.

So we stop short as soon as we have a unique.

That makes insertion very easy, because as soon as we hit
a unique, we'd just introduce a node and add the two uniques
left and right. (Well what happens if we were to insert
101010111 and 101010101 ? Both have a long prefix,
I suppose we'd have 7 nodes and then the distiguishing
node for those 2.)

>
> Looking up a key is O(m), and there's no issue with
> collisions or probing. It can use less memory than a hash
> table, because there's no load factor to care about.
>
> That's the good news. The bad news is that big-O analysis
> is not the whole story. You'll notice that we have to do a
> lot of conditional pointer-jumping to walk the trie. Whereas
> a hash table can jump right to a proposed key and do a
> memcmp() to see if we got it.
>
> So I wasn't overly optimistic that this would be any faster.
> And indeed it's not. It's about three times slower (about
> 4.5s versus 1.5s running "rev-list --objects --all" on
> git.git).
>
> The reason is, I think, a combination of:
>
>   0. We care much more about performance for this hash than
>      memory efficiency. So we keep the load factor
>      quite low, and thus reduce the number of collisions.
>
>   1. For many hash tables, computing the hash is expensive.
>      Not so here, because we are storing sha1s. So it is
>      literally just casting the first 4 bytes of the sha1 to
>      an int; we don't even look at the other bytes until the
>      collision check (and because of point 0, we don't
>      generally have to do very many collision checks during
>      our probe).
>
>   2. The collision check _does_ have to look at all 20 bytes
>      of the sha1. And we may have to do it multiple times as
>      we linearly probe the collisions. _But_ it can be done
>      with memcmp(), which is optimized to compare 32 or 64
>      bits at a time. So we our O(m) has a very nice constant
>      factor.
>
>      Whereas in the critbit tree, we pay an instruction for
>      _each_ bit we look at.
>
> It's possible that (2) would be better if instead of a
> critbit tree, we used a "critbyte" tree. That would involve
> fewer node traversals, at the cost of making each node
> larger (right now the internal nodes store 2 pointer slots;
> they'd have to store 256 to handle a full byte). I didn't
> try it, but I suspect it would still be slower for the same
> reasons.

I would expect to go for a crit-"variable-length" tree instead.

The reason for this is that a higher fan out seems to be more
beneficial in the earlier stages, e.g. we could use critbyte trees
for the first 1-3 layers in the tree as that will have good memory
efficiency (all 256 slots filled), but will be faster than the critbit trees
(one lookup instead of 8 conditional jumps).

In a degenerated form a crit-"variable-length" tree could be
imagined as a hashmap with a critbit tree as the collision resolver,
as we flip from the one extreme of hashmaps (memory inefficient,
but O(1) lookup with low constant factors), to the other extreme
of critbits (memory efficient, but need O(log(size)) nodes which
each need time per bit)

So maybe we could start with a crit-word (16bit) node at the top,
then have critbyte tree nodes (8 bits), and then go for critbit nodes?
The crit-word would have 2**16 entries, i.e. 64k, which is more than
git.gits count of objects (~44k), so we'd get the memory inefficiency of
a hash map, combined with slightly higher costs for collision resolving.

I guess when trying to improve the hashsets, someone tried trees
as a collision resolver?

>
> The code is below for reference. I pulled the critbit
> implementation from:
>
>   https://github.com/ennorehling/critbit
>
> but made a few tweaks:
>
>   - the critbit does not store the keys themselves, as we
>     already have them in the "struct object", and do not
>     want to waste space. As a result, I baked in some
>     assumptions about the 20-byte key-length (which is fine
>     for our purposes here).
>
>   - rather than malloc() each node, I used a pool allocator
>     (to try to keep the nodes closer together in cache)

I think this is another interesting part, so let's talk about caching.
While we may not care about the memory inefficiency for the lost
memory, it is still bad for caches. And the cache inefficiency may
be worse than a few instructions of walking down a critbit for
example. I was instantly reminded of 9a414486d9f0
(lookup_object: prioritize recently found objects)

So maybe we could have some software sided cache for hot entries?
(I imagine a data structure consisting of 2 hash sets here, one
hashset containing
the complete data set, and the other hashset is a very small hashset with e.g.
just 256(?) entries that are an LRU cache for the cache entries.
Though this sounds like we'd be trying to outsmart the hardware... Not sure
I'd expect gains from that)

I guess we rather want to split up the data sets on the application
side: i.e. instead of
having so many objects, have hash sets for e.g. blobs, trees, commits separately
and then use slightly different strategies there (different load factors?)

Unrelated note about hashmaps:
I wonder if we rather want to give good initial estimates of the size
as one improvement

Thanks for posting these patches!
They are really fun to learn from!

Stefan

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 1/2] obj_hash: convert to a critbit tree
  2016-09-15  0:52   ` Stefan Beller
@ 2016-09-15  1:13     ` Jeff King
  0 siblings, 0 replies; 8+ messages in thread
From: Jeff King @ 2016-09-15  1:13 UTC (permalink / raw)
  To: Stefan Beller; +Cc: git@vger.kernel.org

On Wed, Sep 14, 2016 at 05:52:06PM -0700, Stefan Beller wrote:

> > So finding "1011" involves traversing the trie: down the "1"
> > side, then the "0" side, and then check that the rest
> > matches "11".
> 
> So we stop building a tree as soon as we hit a unique data
> element (i.e. if we stick to the idea of encoding the hash along the way,
> we would have needed another node each for "11" as well as "00"
> that points to NULL and the ending data respectively.
> 
> So we stop short as soon as we have a unique.
> 
> That makes insertion very easy, because as soon as we hit
> a unique, we'd just introduce a node and add the two uniques
> left and right. (Well what happens if we were to insert
> 101010111 and 101010101 ? Both have a long prefix,
> I suppose we'd have 7 nodes and then the distiguishing
> node for those 2.)

I think it actually can represent the interior sequence as a single
node. That's what the "critical" part of critbit is for; the critical
bit is the one where we diverge. But I'm not 100% sure that it's
implemented that way.

In practice, though, sha1 would give us a pretty full tree near the
front, so I'd expect each bit to be "critical".

> > It's possible that (2) would be better if instead of a
> > critbit tree, we used a "critbyte" tree. That would involve
> > fewer node traversals, at the cost of making each node
> > larger (right now the internal nodes store 2 pointer slots;
> > they'd have to store 256 to handle a full byte). I didn't
> > try it, but I suspect it would still be slower for the same
> > reasons.
> 
> I would expect to go for a crit-"variable-length" tree instead.
> 
> The reason for this is that a higher fan out seems to be more
> beneficial in the earlier stages, e.g. we could use critbyte trees
> for the first 1-3 layers in the tree as that will have good memory
> efficiency (all 256 slots filled), but will be faster than the critbit trees
> (one lookup instead of 8 conditional jumps).

Yeah, I suspect a hybrid approach may be a better tradeoff. I encourage
you to try and measure. :)

> I guess when trying to improve the hashsets, someone tried trees
> as a collision resolver?

I don't think so. hashmap.c uses a linked list, but obj_hash just does
linear probing. Both are linear, but the latter is more memory efficient
and especially has a more compact cache footprint when resolving
collisions. The downside of open probing like that is that it's very
hard to delete an entry, but we don't care about that for obj_hash.

I don't think that improving collision resolution would help that much
for obj_hash, though. The fact that quadratic probing and cuckoo hashing
didn't yield big benefits implies that we don't spend most of our time
on collisions in the first place (OTOH, my 9a414486d9f0 that you found
does help _only_ when there are collisions, so maybe I'm wrong).

> So maybe we could have some software sided cache for hot entries?
> (I imagine a data structure consisting of 2 hash sets here, one
> hashset containing
> the complete data set, and the other hashset is a very small hashset with e.g.
> just 256(?) entries that are an LRU cache for the cache entries.
> Though this sounds like we'd be trying to outsmart the hardware... Not sure
> I'd expect gains from that)

Yeah. Basically what we want is the reverse of a bloom filter: it's OK
to be wrong, but it most be wrong to say "it's not here" when it is, not
the other way around. And so that's basically...a cache of a smaller
data-structure in front of the larger one.

But given that the hash table is already O(1)-ish, it's hard to beat
that (because remember when you have a cache miss, you then have to do
an extra lookup in the full table anyway).

I did play around with stuff like that when I was coming up with
9a414486d9f0, but was never able to make it faster. Patches welcome, of
course.

> I guess we rather want to split up the data sets on the application
> side: i.e. instead of
> having so many objects, have hash sets for e.g. blobs, trees, commits separately
> and then use slightly different strategies there (different load factors?)

My gut is that would not make a big difference (and sometimes it would
be worse, because we don't know what the object type is, or we want to
make _sure_ that we don't have the object known as a different type).

> Unrelated note about hashmaps:
> I wonder if we rather want to give good initial estimates of the size
> as one improvement

My recollection is that the growth isn't really relevant. At least for
obj_hash, we do _way_ more lookups than insertions, so the only thing
that really matters is lookup speed.

But as with everything, if you can come up with improved numbers, I'm
happy to look at the patches. :)

-Peff

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH 2/2] use zstd zlib wrapper
  2016-09-14 23:55 Journal of Failed Git Experiments, Volume 1 Jeff King
  2016-09-14 23:56 ` [PATCH 1/2] obj_hash: convert to a critbit tree Jeff King
@ 2016-09-14 23:58 ` Jeff King
  2016-09-15  1:22   ` Stefan Beller
  2016-09-25 14:17 ` Journal of Failed Git Experiments, Volume 1 Johannes Schindelin
  2 siblings, 1 reply; 8+ messages in thread
From: Jeff King @ 2016-09-14 23:58 UTC (permalink / raw)
  To: git

There's a fancy new compression algorithm called "zstd". The
idea is that it's supposed to get similar compression ratios
to zlib, but with much faster compression and decompression
times. And on top of that, a nice sliding scale to trade off
size versus time on the compression side.

The zstd site at https://facebook.github.io/zstd/ claims
close to 3x speedup for both compression and decompression
versus zlib, with similar compression ratios. There are
other fast algorithms (like lz4), but they usually compress
much worse (follow the link above for a nice table of
results).

Since any git operations that have to access objects need to
do a zlib inflate, in theory we can speed up everything by
using zstd. And then on the packing side, use higher
compression levels when making on-disk packfiles (which will
be accessed many times) and lower ones when making loose
objects, or deflating packed objects on the fly when serving
fetches.

The catch, of course, is that it's a new incompatible
format. This would be a pretty huge change and totally break
backwards compatibility for git, not just on disk but
on-the-wire as well. So my goal here was not a finished
product but just a quick experiment to see if it did indeed
bring the promise speedups.

Disappointingly, the answer seems to be "no".

The patch is relatively straightforward. zstd ships with a
zlib-compatible wrapper that we can just plug in. For
compression, it writes with zlib by default, but you can
flip it to zstd mode with a run-time flag. For
decompression, it auto-detects and runs the correct choice
between zlib and zstd. So this patch _could_ serve as the
basis for a real solution. We'd bundle zstd and its wrapper
with the runtime switch off, and then after several years
people could enable zstd writing to cut off older clients.
In the patch below, that switch is just setting the
pack.compression level to "zstd1", "zstd2", etc.

I ran a series of tests where I would:

  1. Repack with "-F" and a particular pack.compression
     setting, measuring both the time to repack and the size
     of the resulting pack.

  2. Time "git rev-list --objects --all" to see the effect
     on commit/tree traversal.

  3. Run "log -Sfoo --raw" to see the additional effect on
     accessing the blobs.

All the timings below are against linux.git. The rev-list
timings are best-of-five, but I didn't do multiple timings
for the repack or "log -S" tests, because they were so
painfully slow (and because the effect sizes seemed to be
larger than run-to-run noise).

I did these on top of my delta-cache and pack-mru patches,
since that would amplify any effect we see (i.e.,
those speed up other things so decompression time is a
relatively larger share of the pie).

Here's a baseline run using stock zlib, with the default
compression settings:

 [zlib]
 - pack
   1122845190
 - repack
   real    4m43.817s
   user    17m32.396s
   sys     0m5.964s
 - rev-list --objects --all
   real    0m27.378s
   user    0m27.132s
   sys     0m0.240s
 - log -Sfoo --raw
   real    5m3.490s
   user    5m0.040s
   sys     0m3.424s

You can see that we're pretty much CPU bound. You can also
see that there's quite a bit of parallelism going on in the
repack (this is a quad-core with hyperthreading).

Here's the same thing built against the zstd wrapper, but
still using zlib:

 [zstd, disabled]
 - pack
   1123089975
   (+0.02%)
 - repack
   real    4m58.263s
   user    17m50.596s
   sys     0m7.200s
   (+5%)
 - rev-list --objects --all
   real    0m28.143s
   user    0m27.864s
   sys     0m0.272s
   (+3%)
 - log -Sfoo --raw
   real    5m12.742s
   user    5m9.804s
   sys     0m2.912s
   (+3%)

The percentages show the difference against the baseline run
above (wall-clock times for the timing results). The pack
size is similar, which we'd expect (it's not identical
because the threaded delta search is slightly
non-deterministic).

Sadly we lose 3-5% just to the wrapper sitting in front of
zlib. So that's not a great start. But it also means we
might be able to recover that through code optimizations of
the wrapper, the way we call it, etc.

Now here it is with zstd turned onto its lowest setting:

 [zstd, 1]
 - pack
   1199180502
   (+7%)
 - repack
   real    4m11.740s
   user    16m36.510s
   sys     0m7.700s
   (-11%)
 - rev-list --objects --all
   real    0m25.870s
   user    0m25.632s
   sys     0m0.232s
   (-5.5%)
 - log -Sfoo --raw
   real    4m10.899s
   user    4m7.788s
   sys     0m3.056s
   (-17%)

We've definitely traded off some space here. But the
compression is faster, _and_ so is the decompression.

So let's bump up the compression level a bit:

 [zstd, 3]
 - pack
   1162416903
   (+3.5%)
 - repack
   real    5m1.477s
   user    19m34.476s
   sys     0m17.720s
   (+6.2%)
 - rev-list --objects --all
   real    0m25.716s
   user    0m25.468s
   sys     0m0.244s
   (-6%)
 - log -Sfoo --raw
   real    4m10.547s
   user    4m6.500s
   sys     0m3.884s
   (-17%)

OK, the size is improving. But now our compression is
noticeably slower. The decompression remains better, which
is nice. Let's go further:

 [zstd, 5]
 - pack
   1144433165
   (+2%)
 - repack
   real    5m46.050s
   user    22m7.580s
   sys     1m2.212s
   (+22%)
 - rev-list --objects --all
   real    0m25.717s
   user    0m25.404s
   sys     0m0.304s
   (-6%)
 - log -Sfoo --raw
   real    4m14.651s
   user    4m11.176s
   sys     0m3.448s
   (-16%)

Getting close to parity on the size. But the compression
time keeps creeping up.

 [zstd, 10]
 - pack
   1132922195
   (+0.8%)
 - repack
   real    41m35.835s
   user    166m36.956s
   sys     50m11.436s
   (+780%)
 - rev-list --objects --all
   real    0m25.946s
   user    0m25.656s
   sys     0m0.284s
   (-5%)
 - log -Sfoo --raw
   real    4m8.490s
   user    4m5.864s
   sys     0m2.604s
   (-18%)

Oof. We still haven't reached size parity, and now we're
using a _lot_ more CPU on the repack.

I wanted to take it all the way to "22", zstd's maximum,
just for fun. But after 2 hours (real time, so pegging all
my cores while doing so), we were only at 12% of the delta
phase. Yikes.

So...it doesn't seem like a great tradeoff (especially
considering the costs in migrating to a new algorithm). The
decompression _is_ delightfully faster, and it's true that
we decompress more than we compress. But the speed/size
tradeoff for compression doesn't seem great.

I'm not sure why we don't get numbers as nice as the ones on
the zstd website. Obviously possibility 0 is that I simply
screwed something up. But here are some other thoughts /
wild guesses:

  1. There's clearly some overhead to using the zlib
     wrapper. Even just calling into zlib with it to
     compress seems to cost about 5%.

     For zstd, it's less clear. The regular (non-wrapper)
     zstd API seems to involve allocating a zstd compressor,
     and then feeding multiple compressions to it. So that
     would amortize the setup cost. I'd guess that the zlib
     wrapper has to allocate a new zstd compressor for each
     wrapped compression (obviously they could share one,
     but then you run into threading issues).

     The fact that the time increases as we bump the
     compression level implies to me that the main cost is
     in the actual compression, though, not in the setup.
     It's possible that the setup cost is higher when the
     compression level is higher, but I wouldn't really
     expect it to scale as rapidly as my numbers show.

     So this is probably making things worse than they
     otherwise could be, but I doubt it's the real cause of
     the problems.

  2. Git tends to do a lot of relatively small compressions,
     as we zlib-compress deltas. So I wonder (and this is a
     wild guess) if zstd performs more favorably on "normal"
     sized inputs.

     That would be possible to test. Or could probably be
     found out by digging more into the zstd benchmarks;
     they used some standard corpuses, but the results are
     summarized. More broken-down results may show patterns.

     I didn't do those things, hence "wild guess".

And obviously even this patch series _does_ show an
improvement on the decompression side (which happens a lot
more often than compression!). So this isn't a total dead
end. But given the ratio/time tradeoff I saw on the
compression side, there are other options. E.g., lz4 is even
faster than zstd, but comes with a size penalty (though I
think it's more like 30%, not 7%).

I would have liked to test with lz4, too, but they don't
ship a totally trivial zlib wrapper. :)

Signed-off-by: Jeff King <peff@peff.net>
---
 Makefile               |  9 +++++++++
 builtin/pack-objects.c | 19 +++++++++++++------
 cache.h                |  4 ++++
 3 files changed, 26 insertions(+), 6 deletions(-)

diff --git a/Makefile b/Makefile
index 99e0eb2..e29edbe 100644
--- a/Makefile
+++ b/Makefile
@@ -1138,6 +1138,15 @@ else
 endif
 IMAP_SEND_LDFLAGS += $(OPENSSL_LINK) $(OPENSSL_LIBSSL) $(LIB_4_CRYPTO)

+# for linking straight out of zstd build directory
+ifdef ZSTD_PATH
+	LIB_OBJS += $(ZSTD_PATH)/zlibWrapper/zstd_zlibwrapper.o
+	BASIC_CFLAGS += -I$(ZSTD_PATH)/zlibWrapper
+	BASIC_CFLAGS += -I$(ZSTD_PATH)/lib -I$(ZSTD_PATH)/lib/common
+	BASIC_CFLAGS += -DUSE_ZSTD
+	EXTLIBS += $(ZSTD_PATH)/lib/libzstd.a
+endif
+
 ifdef ZLIB_PATH
 	BASIC_CFLAGS += -I$(ZLIB_PATH)/include
 	EXTLIBS += -L$(ZLIB_PATH)/$(lib) $(CC_LD_DYNPATH)$(ZLIB_PATH)/$(lib)
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 4a63398..b424465 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -62,6 +62,7 @@ static int num_preferred_base;
 static struct progress *progress_state;
 static int pack_compression_level = Z_DEFAULT_COMPRESSION;
 static int pack_compression_seen;
+static int pack_compression_max = Z_BEST_COMPRESSION;

 static struct packed_git *reuse_packfile;
 static uint32_t reuse_packfile_objects;
@@ -2220,11 +2221,17 @@ static int git_pack_config(const char *k, const char *v, void *cb)
 		return 0;
 	}
 	if (!strcmp(k, "pack.compression")) {
-		int level = git_config_int(k, v);
-		if (level == -1)
-			level = Z_DEFAULT_COMPRESSION;
-		else if (level < 0 || level > Z_BEST_COMPRESSION)
-			die("bad pack compression level %d", level);
+		int level;
+		if (!v)
+			return config_error_nonbool(k);
+#ifdef USE_ZSTD
+		if (skip_prefix(v, "zstd", &v)) {
+			useZSTD(1);
+			/* XXX doesn't seem to be a constant? */
+			pack_compression_max = 22;
+		}
+#endif
+		level = git_config_int(k, v);
 		pack_compression_level = level;
 		pack_compression_seen = 1;
 		return 0;
@@ -2765,7 +2772,7 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 		reuse_delta = 0;
 	if (pack_compression_level == -1)
 		pack_compression_level = Z_DEFAULT_COMPRESSION;
-	else if (pack_compression_level < 0 || pack_compression_level > Z_BEST_COMPRESSION)
+	else if (pack_compression_level < 0 || pack_compression_level > pack_compression_max)
 		die("bad pack compression level %d", pack_compression_level);

 	if (!delta_search_threads)	/* --threads=0 means autodetect */
diff --git a/cache.h b/cache.h
index 3556326..9b2f8f5 100644
--- a/cache.h
+++ b/cache.h
@@ -37,7 +37,11 @@
 #define git_SHA1_Update		git_SHA1_Update_Chunked
 #endif

+#ifdef USE_ZSTD
+#include "zstd_zlibwrapper.h"
+#else
 #include <zlib.h>
+#endif
 typedef struct git_zstream {
 	z_stream z;
 	unsigned long avail_in;
-- 
2.10.0.271.ge3ee153

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH 2/2] use zstd zlib wrapper
  2016-09-14 23:58 ` [PATCH 2/2] use zstd zlib wrapper Jeff King
@ 2016-09-15  1:22   ` Stefan Beller
  2016-09-15  6:28     ` Jeff King
  0 siblings, 1 reply; 8+ messages in thread
From: Stefan Beller @ 2016-09-15  1:22 UTC (permalink / raw)
  To: Jeff King; +Cc: git@vger.kernel.org

On Wed, Sep 14, 2016 at 4:58 PM, Jeff King <peff@peff.net> wrote:
> There's a fancy new compression algorithm called "zstd". The
> idea is that it's supposed to get similar compression ratios
> to zlib, but with much faster compression and decompression
> times. And on top of that, a nice sliding scale to trade off
> size versus time on the compression side.
>
> The zstd site at https://facebook.github.io/zstd/ claims
> close to 3x speedup for both compression and decompression
> versus zlib, with similar compression ratios. There are
> other fast algorithms (like lz4), but they usually compress
> much worse (follow the link above for a nice table of
> results).
>
> Since any git operations that have to access objects need to
> do a zlib inflate, in theory we can speed up everything by
> using zstd. And then on the packing side, use higher
> compression levels when making on-disk packfiles (which will
> be accessed many times) and lower ones when making loose
> objects, or deflating packed objects on the fly when serving
> fetches.
>
> The catch, of course, is that it's a new incompatible
> format. This would be a pretty huge change and totally break
> backwards compatibility for git, not just on disk but
> on-the-wire as well. So my goal here was not a finished
> product but just a quick experiment to see if it did indeed
> bring the promise speedups.
>
> Disappointingly, the answer seems to be "no".

After having looked at the data, I disagree with the conclusion.
And for that I think we need to reason about the frequency
of the operations happening.

* As an enduser, happily hacking away at one repository,
  I probably do not care about the pack size on disk as much
  as I care about timing of the local operations. And I assume
  that for each repack we have about 1000 reads (log/rev-list)
  The 1000 is a wild speculation without any data to back it up.
  So as an end user I'd be happy about [zstd, ~5]
  For the end user LZ4 seems to be the best solution if it were available.

* As a service provider, I know we have a lot more reads than
  writes, and repacking is annoying. Also at that scale the disk
  isn't negligible cheap. So we need to weigh the numbers differently,
  but how? I suspect depending on the weighting it could still be
  considered beneficial to go with zstd5. (No hard numbers here)

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 2/2] use zstd zlib wrapper
  2016-09-15  1:22   ` Stefan Beller
@ 2016-09-15  6:28     ` Jeff King
  0 siblings, 0 replies; 8+ messages in thread
From: Jeff King @ 2016-09-15  6:28 UTC (permalink / raw)
  To: Stefan Beller; +Cc: git@vger.kernel.org

On Wed, Sep 14, 2016 at 06:22:17PM -0700, Stefan Beller wrote:

> > Disappointingly, the answer seems to be "no".
> 
> After having looked at the data, I disagree with the conclusion.
> And for that I think we need to reason about the frequency
> of the operations happening.

I definitely agree that reads outnumber writes, and it's OK to have an
asymmetric tradeoff between the two. zstd5 isn't _too_ bad in that
respect. I guess I was just disappointed that the pack size was still
bigger, as I was really hoping to see some speed tradeoff without
getting a worse pack.

The other thing to weigh against is "if we were designing it today"
versus "is it worth the compatibility headaches now". A 6% improvement
in "rev-list --objects" is not that amazing for a data format change.
Bitmaps were an _easier_ data format change and are more like a 99%
speedup.

They do not apply to every operations, but we may be able to do similar
space/time tradeoffs that are easier to handle in terms of backwards
compatibility, and which yield bigger results.

-Peff

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Journal of Failed Git Experiments, Volume 1
  2016-09-14 23:55 Journal of Failed Git Experiments, Volume 1 Jeff King
  2016-09-14 23:56 ` [PATCH 1/2] obj_hash: convert to a critbit tree Jeff King
  2016-09-14 23:58 ` [PATCH 2/2] use zstd zlib wrapper Jeff King
@ 2016-09-25 14:17 ` Johannes Schindelin
  2 siblings, 0 replies; 8+ messages in thread
From: Johannes Schindelin @ 2016-09-25 14:17 UTC (permalink / raw)
  To: Jeff King; +Cc: git

Hi Peff,

On Wed, 14 Sep 2016, Jeff King wrote:

> I try a lot of different experiments with git performance, some of them
> more hare-brained than others. The ones that succeed end up as real
> patches. But I hate for the ones that fail to die a quiet death. Then
> nobody learns what _doesn't_ work, and nobody has the opportunity to
> point out the spot where I made a stupid mistake that invalidates the
> whole result.

To show those experiments, with analysis, is a really good idea.

I found the zstd experiment in particular very educating, as I wondered
about the same: could we maybe use it to accelerate Git operations? Now I
know.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2016-09-25 14:17 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-09-14 23:55 Journal of Failed Git Experiments, Volume 1 Jeff King
2016-09-14 23:56 ` [PATCH 1/2] obj_hash: convert to a critbit tree Jeff King
2016-09-15  0:52   ` Stefan Beller
2016-09-15  1:13     ` Jeff King
2016-09-14 23:58 ` [PATCH 2/2] use zstd zlib wrapper Jeff King
2016-09-15  1:22   ` Stefan Beller
2016-09-15  6:28     ` Jeff King
2016-09-25 14:17 ` Journal of Failed Git Experiments, Volume 1 Johannes Schindelin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).