Re: [RFC] Cache negative delta pairs

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: [RFC] Cache negative delta pairs
       [not found] <20060628223744.GA24421@coredump.intra.peff.net>
@ 2006-06-29  3:09 ` Junio C Hamano
  2006-06-29  3:50   ` Jeff King
                     ` (4 more replies)
  0 siblings, 5 replies; 36+ messages in thread
From: Junio C Hamano @ 2006-06-29  3:09 UTC (permalink / raw)
  To: Jeff King; +Cc: git

Jeff King <peff@peff.net> writes:

> From repack to repack, we end up trying to delta many of the same object
> pairs, which is computationally expensive.
>...
> I found this especially to be a problem with repos that consist of many
> large, unrelated files (e.g., photos). For example, on my test repo
> (about 300 unrelated 1-2M jpgs), a 'git-repack -a' takes about 10
> minutes to complete. With the delta cache, subsequent repacks take only
> 13 seconds. Results are not quite as dramatic for "normal" repos, but
> there is still some speedup. Repacking a fully packed linux-2.6 repo
> went from 1m12s to 36s. Repacking the git repo goes from 5.6s to 3.0s.

Interesting idea.  I think this matters more because for a
repository with many unrelated undeltifiable files, we do the
computation for objects that results in _no_ delta.  For normal
nearly fully packed repositories, once an object is deltified
against something else, subsequent repacking of the same set of
objects (or a superset thereof) will very likely reuse the delta
without recomputation, so as long as each object _can_ be
deltified with at least _one_ other object, you should not see
improvement on them.

So I am curious where the speed-up comes from for "normal" repos
in your experiments.  If it turns out that in "normal" repos the
objects that hit your negative cache are stored undeltified,
then that suggests that it might be worthwhile to consider using
a cache of "inherently undeltifiable objects", In other words, a
negative cache of O(N) entries, instead of O(N^2) entries,

Another interpretation of your result is that we may be using a
delta window that is unnecessarily too deep, and your negative
cache is collecting less optimum candidates that we attempt to
deltify against "just in case".  Can you easily instrument your
code to see where in the sorted delta candidate list the pairs
that hit your the negative cache are?  That is, in find_deltas()
function, we have "while (--j > 0)" loop that attempts to delta
with the entry that is j (modulo window size) entries away from
the current one, then j-1, j-2, ...; I am interested in the
distribution of "j" value for the pair "n,m" that hits your
negative cache for normal repositories, and I am speculating
that the value would probably be small relative to the delta
window size.

Another idea is to have a cache of "paths at which inherently
undeltifiable objects live in".  For example, we currently do
not delta OpenOffice documents (*.odt, *.odp, etc) very well.
If one has a repository that tracks the history of "file.odp",
we know each revision of "file.odp" would not delta against any
other version anyway, and could skip attempting to deltify them.

Your message contained string "*pt-in" in the commentary part
(replace asterisk with lowercase o) and was discarded by vger
mailing list software because that was a taboo word.  If you
would want to pursue this I would suggest to resend your
original patch after rephrasing that part.

>  - size. The cache is a packed sequence of binary sha1 pairs. I was
>    concerned that it would grow too large (obviously for n blobs you can
>    end up with n^2/2 entries), but it doesn't seem unreasonable for most
>    repos (either you don't have a lot of files, or if you do, they delta
>    reasonably well). My test repo's cache is only 144K. The git cache is
>    about 2.7M. The linux-2.6 cache is 22M.

The fully-packed object database is 6.2M pack so you are talking
about 40% bloat; the kernel is 115M so the overhead is 19%.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC] Cache negative delta pairs
  2006-06-29  3:09 ` [RFC] Cache negative delta pairs Junio C Hamano
@ 2006-06-29  3:50   ` Jeff King
  2006-06-29  3:58   ` Jeff King
                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 36+ messages in thread
From: Jeff King @ 2006-06-29  3:50 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

On Wed, Jun 28, 2006 at 08:09:43PM -0700, Junio C Hamano wrote:

> Interesting idea.  I think this matters more because for a
> repository with many unrelated undeltifiable files, we do the
> computation for objects that results in _no_ delta.  For normal

Yes, precisely.

> So I am curious where the speed-up comes from for "normal" repos
> in your experiments.  If it turns out that in "normal" repos the

I assumed they were either:
  - blobs of files with only a single revision
  - blobs with one "island" revision which is so different from previous
    revisions that it isn't worth a delta
You can get a (very rough) estimate on the former:
  $ cd linux-2.6 && git-rev-list --objects --all \
     | grep / | cut -d' ' -f2 | sort | uniq -u | wc -l
  8364

As I said, we may also be catching possible deltas at the edge of the
pack depth. I should maybe put the cache check closer to the
create_delta call.

> objects that hit your negative cache are stored undeltified,
> then that suggests that it might be worthwhile to consider using
> a cache of "inherently undeltifiable objects", In other words, a
> negative cache of O(N) entries, instead of O(N^2) entries,

That would certainly be preferable, though I'm not convinced there
aren't objects which are deltifiable, but not right now (e.g., a file
with only one revision, but which will later get a revision).

I'm not sure what would make a file inherently undeltifiable. But you
could put the O(N) cache in front of the other cache, and reduce the
size of N for the O(N^2) cache. 

> deltify against "just in case".  Can you easily instrument your
> code to see where in the sorted delta candidate list the pairs
> that hit your the negative cache are?  That is, in find_deltas()

I'll look into that...

> Another idea is to have a cache of "paths at which inherently
> undeltifiable objects live in".  For example, we currently do
> not delta OpenOffice documents (*.odt, *.odp, etc) very well.
> If one has a repository that tracks the history of "file.odp",
> we know each revision of "file.odp" would not delta against any
> other version anyway, and could skip attempting to deltify them.

I thought about that, but I was trying to avoid "clever" heuristics that
can often be wrong. Case in point: my repo consists mainly of jpegs,
most of which will not delta. But for a few, I edited the exif header
and they delta'd nicely against their previous revision.

> Your message contained string "*pt-in" in the commentary part
> (replace asterisk with lowercase o) and was discarded by vger
> mailing list software because that was a taboo word.  If you

Oops, I didn't know we were filtering. I'll resend in a moment.

> >    reasonably well). My test repo's cache is only 144K. The git cache is
> >    about 2.7M. The linux-2.6 cache is 22M.
> The fully-packed object database is 6.2M pack so you are talking
> about 40% bloat; the kernel is 115M so the overhead is 19%.

Yes, obviously if you're interested in maximal space saving, this is
stupid for the classic git repo case (though it makes sense for repos
with few but very large blobs). However, if you assume that:
  1. Packing is inherently space-saving and more-or-less required on
     large repos like linux-2.6 (which is 1.8G unpacked and 115M packed)
  2. It saves time (36 seconds per repack on linux-2.6)
then a more reasonable question is "do you want to spend 22M to save 30
seconds every time you repack -a?". Or, to really cook the numbers, you
can say that the cache is only wasting 1% versus an unpacked repo. :)

I think it's reasonable that this should be an optional feature. For
some repos, it turns them from almost unusable to very fast. For others,
it's a questionable space-time tradeoff (and for some pathological
cases, it would probably turn the repo from usable to unusable).

-Peff

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [RFC] Cache negative delta pairs
  2006-06-29  3:09 ` [RFC] Cache negative delta pairs Junio C Hamano
  2006-06-29  3:50   ` Jeff King
@ 2006-06-29  3:58   ` Jeff King
  2006-06-29  4:30     ` Jeff King
  2006-06-29 16:39     ` Nicolas Pitre
  2006-06-29  4:09   ` Jeff King
                     ` (2 subsequent siblings)
  4 siblings, 2 replies; 36+ messages in thread
From: Jeff King @ 2006-06-29  3:58 UTC (permalink / raw)
  To: git

>From repack to repack, we end up trying to delta many of the same object
pairs, which is computationally expensive.  This patch makes a
persistent cache, $GIT_DIR/delta-cache, which contains pairs of sha1
hashes (the presence of a pair indicates that it's not worth trying to
delta those two objects).

Signed-off-by: Jeff King <peff@peff.net>
---

[This is a repost, since the original got caught by the list filter.]

I found this especially to be a problem with repos that consist of many
large, unrelated files (e.g., photos). For example, on my test repo
(about 300 unrelated 1-2M jpgs), a 'git-repack -a' takes about 10
minutes to complete. With the delta cache, subsequent repacks take only
13 seconds. Results are not quite as dramatic for "normal" repos, but
there is still some speedup. Repacking a fully packed linux-2.6 repo
went from 1m12s to 36s. Repacking the git repo goes from 5.6s to 3.0s.

Here are some of my thoughts:

 - speed. The implementation is quite fast. The sha1 pairs are stored
   sorted, and we mmap and binary search them. Certainly the extra time
   spent in lookup is justified by avoiding the delta attempts.

 - size. The cache is a packed sequence of binary sha1 pairs. I was
   concerned that it would grow too large (obviously for n blobs you can
   end up with n^2/2 entries), but it doesn't seem unreasonable for most
   repos (either you don't have a lot of files, or if you do, they delta
   reasonably well). My test repo's cache is only 144K. The git cache is
   about 2.7M. The linux-2.6 cache is 22M.

   Theoretically, I could bound the cache size and boot old entries.
   However, that means storing age information, which increases the
   required size. I think keeping it simple is best.

 - correctness. Currently the code uses the output of try_delta for
   negative caching. Should the cache checking be moved inside try_delta
   instead? This would give more control over which reasons to mark a
   delta negative (I believe right now hitting the depth limit will
   cause a negative mark; we should perhaps only do so if the content
   itself makes the delta unpalatable).
 
 - optionalness. Currently the delta-cache is always used. Since it is a
   space-time tradeoff, maybe it should be optional (it will have
   negligible performance and horrible size impact on a repo that
   consists of many very small but unrelated objects).  Possible methods
   include:
     - enable cache saves only if .git/delta-cache is present; turn it
       on initially with 'touch .git/delta-cache'
     - config variable

 Makefile       |    4 +-
 delta-cache.c  |  119 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 delta-cache.h  |   11 +++++
 pack-objects.c |   11 +++++
 4 files changed, 142 insertions(+), 3 deletions(-)

diff --git a/Makefile b/Makefile
index cde619c..39c4308 100644
--- a/Makefile
+++ b/Makefile
@@ -202,7 +202,7 @@ LIB_H = \
 	blob.h cache.h commit.h csum-file.h delta.h \
 	diff.h object.h pack.h pkt-line.h quote.h refs.h \
 	run-command.h strbuf.h tag.h tree.h git-compat-util.h revision.h \
-	tree-walk.h log-tree.h dir.h
+	tree-walk.h log-tree.h dir.h delta-cache.h
 
 DIFF_OBJS = \
 	diff.o diff-lib.o diffcore-break.o diffcore-order.o \
@@ -217,7 +217,7 @@ LIB_OBJS = \
 	server-info.o setup.o sha1_file.o sha1_name.o strbuf.o \
 	tag.o tree.o usage.o config.o environment.o ctype.o copy.o \
 	fetch-clone.o revision.o pager.o tree-walk.o xdiff-interface.o \
-	alloc.o $(DIFF_OBJS)
+	alloc.o delta-cache.o $(DIFF_OBJS)
 
 BUILTIN_OBJS = \
 	builtin-log.o builtin-help.o builtin-count.o builtin-diff.o builtin-push.o \
diff --git a/delta-cache.c b/delta-cache.c
new file mode 100644
index 0000000..d132867
--- /dev/null
+++ b/delta-cache.c
@@ -0,0 +1,119 @@
+#include "delta-cache.h"
+#include "cache.h"
+
+static const unsigned char* disk_cache = 0;
+static unsigned disk_cache_len = 0;
+static unsigned char* mem_cache = 0;
+static unsigned mem_cache_len = 0;
+static unsigned mem_cache_alloc = 0;
+
+#define GETCACHE(c, n) (c+(40*n))
+
+static void disk_cache_init()
+{
+	static int done = 0;
+	int fd;
+	struct stat st;
+
+	if (done) return;
+	done = 1;
+
+	fd = open(git_path("delta-cache"), O_RDONLY);
+	if (fd < 0)
+		return;
+	if (fstat(fd, &st) || (st.st_size == 0)) {
+		close(fd);
+		return;
+	}
+	disk_cache = mmap(NULL, st.st_size, PROT_READ, MAP_PRIVATE, fd, 0);
+	close(fd);
+	if (disk_cache != MAP_FAILED)
+		disk_cache_len = st.st_size / 40;
+}
+
+static int
+check_cache(const unsigned char* c, unsigned len, const unsigned char s[40])
+{
+	int left, right, mid, cmp;
+
+	left = 0;
+	right = len - 1;
+	while (left <= right) {
+		mid = left + (right - left) / 2;
+		cmp = memcmp(s, GETCACHE(c, mid), 40);
+		if(cmp == 0) return 1;
+		else if(cmp <  0) right = mid - 1;
+		else left = mid + 1;
+	}
+	return 0;
+}
+
+extern int
+delta_cache_check(const unsigned char a[20], const unsigned char b[20])
+{
+	unsigned char search[40];
+
+	disk_cache_init();
+	memcpy(search, a, 20);
+	memcpy(search+20, b, 20);
+	return check_cache(disk_cache, disk_cache_len, search);
+}
+
+extern void
+delta_cache_mark(const unsigned char a[20], const unsigned char b[20])
+{
+	if (mem_cache_len == mem_cache_alloc) {
+		mem_cache_alloc = mem_cache_alloc ? mem_cache_alloc * 2 : 16;
+		mem_cache = xrealloc(mem_cache, mem_cache_alloc * 40);
+	}
+	memcpy(GETCACHE(mem_cache, mem_cache_len), a, 20);
+	memcpy(GETCACHE(mem_cache, mem_cache_len)+20, b, 20);
+	mem_cache_len++;
+}
+
+static int
+compare_sha1pair(const void *a, const void *b)
+{
+	return memcmp(a, b, 40);
+}
+
+static int
+merge_write(int fd, const unsigned char* p1, unsigned n1,
+		    const unsigned char* p2, unsigned n2)
+{
+#define EMIT(p, x) do { \
+	if (xwrite(fd, GETCACHE(p, x), 40) < 0) return -1; x++; \
+	} while(0)
+
+	int i = 0, j = 0, cmp;
+	while (i < n1 && j < n2) {
+		cmp = memcmp(GETCACHE(p1, i), GETCACHE(p2, j), 40);
+		if (cmp < 0) EMIT(p1, i);
+		else EMIT(p2, j);
+	}
+	while (i < n1) EMIT(p1, i);
+	while (j < n2) EMIT(p2, j);
+#undef EMIT
+	return 0;
+}
+
+extern void
+delta_cache_save(void)
+{
+	int fd;
+	char tmpfile[PATH_MAX];
+
+	strcpy(tmpfile, git_path("delta-cache.%u", getpid()));
+	fd = open(tmpfile, O_WRONLY|O_EXCL|O_CREAT, 0666);
+	if (fd < 0)
+		return;
+
+	qsort(mem_cache, mem_cache_len, 40, compare_sha1pair);
+	if (merge_write(fd, mem_cache, mem_cache_len,
+			    disk_cache, disk_cache_len) < 0) {
+		close(fd);
+		return;
+	}
+
+	rename(tmpfile, git_path("delta-cache"));
+}
diff --git a/delta-cache.h b/delta-cache.h
new file mode 100644
index 0000000..19201be
--- /dev/null
+++ b/delta-cache.h
@@ -0,0 +1,11 @@
+#ifndef DELTA_CACHE_H
+#define DELTA_CACHE_H
+
+extern int
+delta_cache_check(const unsigned char a[20], const unsigned char b[20]);
+extern void
+delta_cache_mark(const unsigned char a[20], const unsigned char b[20]);
+extern void
+delta_cache_save(void);
+
+#endif /* DELTA_CACHE_H */
diff --git a/pack-objects.c b/pack-objects.c
index bed2497..46b9775 100644
--- a/pack-objects.c
+++ b/pack-objects.c
@@ -8,6 +8,7 @@ #include "delta.h"
 #include "pack.h"
 #include "csum-file.h"
 #include "tree-walk.h"
+#include "delta-cache.h"
 #include <sys/time.h>
 #include <signal.h>
 
@@ -1083,14 +1084,21 @@ static void find_deltas(struct object_en
 		j = window;
 		while (--j > 0) {
 			unsigned int other_idx = idx + j;
+			int r;
 			struct unpacked *m;
 			if (other_idx >= window)
 				other_idx -= window;
 			m = array + other_idx;
 			if (!m->entry)
 				break;
-			if (try_delta(n, m, m->index, depth) < 0)
+			if (delta_cache_check(n->entry->sha1, m->entry->sha1))
+				continue;
+			r = try_delta(n, m, m->index, depth);
+			if (r < 0)
 				break;
+			if (r == 0)
+				delta_cache_mark(n->entry->sha1,
+						 m->entry->sha1);
 		}
 		/* if we made n a delta, and if n is already at max
 		 * depth, leaving it in the window is pointless.  we
@@ -1342,5 +1350,6 @@ int main(int argc, char **argv)
 	if (progress)
 		fprintf(stderr, "Total %d, written %d (delta %d), reused %d (delta %d)\n",
 			nr_result, written, written_delta, reused, reused_delta);
+	delta_cache_save();
 	return 0;
 }
-- 
1.4.1.rc1.g3000

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [RFC] Cache negative delta pairs
  2006-06-29  3:09 ` [RFC] Cache negative delta pairs Junio C Hamano
  2006-06-29  3:50   ` Jeff King
  2006-06-29  3:58   ` Jeff King
@ 2006-06-29  4:09   ` Jeff King
  2006-06-29 15:42   ` Nicolas Pitre
  2006-06-29 22:31   ` Jakub Narebski
  4 siblings, 0 replies; 36+ messages in thread
From: Jeff King @ 2006-06-29  4:09 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

On Wed, Jun 28, 2006 at 08:09:43PM -0700, Junio C Hamano wrote:

> that hit your the negative cache are?  That is, in find_deltas()
> function, we have "while (--j > 0)" loop that attempts to delta
> with the entry that is j (modulo window size) entries away from
> the current one, then j-1, j-2, ...; I am interested in the
> distribution of "j" value for the pair "n,m" that hits your
> negative cache for normal repositories, and I am speculating
> that the value would probably be small relative to the delta
> window size.

Just to make sure I am understanding you correctly, you're interested in
the distribution of 'j' each time we skip a delta for being negative
(or each time we mark a negative, but that should simply equal the
lookup times for the next run). The instrumentation I added simply
prints the value of j each time we skip a delta. The counts are
surprisingly uniform (this is for linux-2.6):
  57209 j=1
  57213 j=2
  57217 j=3
  57221 j=4
  57225 j=5
  57229 j=6
  57233 j=7
  57237 j=8
  57241 j=9
  57245 j=10
They're so uniform (and in order by j!) that I feel like I must have done
something wrong...

-Peff

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC] Cache negative delta pairs
  2006-06-29  3:58   ` Jeff King
@ 2006-06-29  4:30     ` Jeff King
  2006-06-29 16:39     ` Nicolas Pitre
  1 sibling, 0 replies; 36+ messages in thread
From: Jeff King @ 2006-06-29  4:30 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

On Wed, Jun 28, 2006 at 11:58:49PM -0400, Jeff King wrote:

>    about 2.7M. The linux-2.6 cache is 22M.
>    [...]
>  - correctness. Currently the code uses the output of try_delta for
>    negative caching. Should the cache checking be moved inside try_delta
>    instead? This would give more control over which reasons to mark a

I tried moving the cache check/mark to wrap the call to create_delta in
try_delta. The speedup is the same (since all of the CPU time is spent
in create_delta anyway), and the linux-2.6 cache size dropped to 18M.
I also think this is probably more correct (it ignores every reason not
to delta EXCEPT create_delta failing).

The distribution of the window parameter 'j' is similarly uniform:
  44900 j=2
  44907 j=3
  44943 j=1
  45001 j=4
  45014 j=5
  45023 j=6
  45063 j=7
  45158 j=8
  45288 j=9
  45466 j=10

Patch on top of my other one is below.

-Peff

-- >8 --
pack-objects: move delta-cache check/mark closer to create_delta

Signed-off-by: Jeff King <peff@peff.net>
---
 pack-objects.c |   15 ++++++---------
 1 files changed, 6 insertions(+), 9 deletions(-)

diff --git a/pack-objects.c b/pack-objects.c
index 46b9775..83ccc8a 100644
--- a/pack-objects.c
+++ b/pack-objects.c
@@ -1014,9 +1014,13 @@ static int try_delta(struct unpacked *tr
 	if (sizediff >= max_size)
 		return 0;
 
+	if (delta_cache_check(src->entry->sha1, trg->entry->sha1))
+		return 0;
 	delta_buf = create_delta(src_index, trg->data, size, &delta_size, max_size);
-	if (!delta_buf)
+	if (!delta_buf) {
+		delta_cache_mark(src->entry->sha1, trg->entry->sha1);
 		return 0;
+	}
 
 	trg_entry->delta = src_entry;
 	trg_entry->delta_size = delta_size;
@@ -1084,21 +1088,14 @@ static void find_deltas(struct object_en
 		j = window;
 		while (--j > 0) {
 			unsigned int other_idx = idx + j;
-			int r;
 			struct unpacked *m;
 			if (other_idx >= window)
 				other_idx -= window;
 			m = array + other_idx;
 			if (!m->entry)
 				break;
-			if (delta_cache_check(n->entry->sha1, m->entry->sha1))
-				continue;
-			r = try_delta(n, m, m->index, depth);
-			if (r < 0)
+			if (try_delta(n, m, m->index, depth) < 0)
 				break;
-			if (r == 0)
-				delta_cache_mark(n->entry->sha1,
-						 m->entry->sha1);
 		}
 		/* if we made n a delta, and if n is already at max
 		 * depth, leaving it in the window is pointless.  we
-- 
1.4.1.rc1.g0458-dirty

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [RFC] Cache negative delta pairs
  2006-06-29  3:09 ` [RFC] Cache negative delta pairs Junio C Hamano
                     ` (2 preceding siblings ...)
  2006-06-29  4:09   ` Jeff King
@ 2006-06-29 15:42   ` Nicolas Pitre
  2006-06-29 16:35     ` Nicolas Pitre
  2006-06-29 18:00     ` Jeff King
  2006-06-29 22:31   ` Jakub Narebski
  4 siblings, 2 replies; 36+ messages in thread
From: Nicolas Pitre @ 2006-06-29 15:42 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Jeff King, git

On Wed, 28 Jun 2006, Junio C Hamano wrote:

> Interesting idea.  I think this matters more because for a
> repository with many unrelated undeltifiable files, we do the
> computation for objects that results in _no_ delta.  For normal
> nearly fully packed repositories, once an object is deltified
> against something else, subsequent repacking of the same set of
> objects (or a superset thereof) will very likely reuse the delta
> without recomputation, so as long as each object _can_ be
> deltified with at least _one_ other object, you should not see
> improvement on them.
> 
> So I am curious where the speed-up comes from for "normal" repos
> in your experiments.

My GIT repo currently has 23622 objects and 7610 of them are currently 
undeltified.  Those objects are of course candidates for delta matching 
each time git-repack is run.

> If it turns out that in "normal" repos the
> objects that hit your negative cache are stored undeltified,
> then that suggests that it might be worthwhile to consider using
> a cache of "inherently undeltifiable objects", In other words, a
> negative cache of O(N) entries, instead of O(N^2) entries,

Actually... I'm not so sure.  Those objects are not "inherently 
undeltifiable".  They just happen to not have other objects to easily 
delta against in the given set of objects.  Think of a file with only 
one revision for example.  As soon as there is a second revision of that 
file added to the set of objects then the former revision would have a 
high probability of being deltifiable.

So the negative cache should not be O(N^2) either.  It just has to be 
O(N*window).

> Another interpretation of your result is that we may be using a
> delta window that is unnecessarily too deep, and your negative
> cache is collecting less optimum candidates that we attempt to
> deltify against "just in case".  Can you easily instrument your
> code to see where in the sorted delta candidate list the pairs
> that hit your the negative cache are?  That is, in find_deltas()
> function, we have "while (--j > 0)" loop that attempts to delta
> with the entry that is j (modulo window size) entries away from
> the current one, then j-1, j-2, ...; I am interested in the
> distribution of "j" value for the pair "n,m" that hits your
> negative cache for normal repositories, and I am speculating
> that the value would probably be small relative to the delta
> window size.

My past experiments showed that the best window size for compression is 
a bit larger than the current default of 10.  It was rather around 15 
for the kernel repository, with higher values than 15 not providing 
significant improvements anymore.  But that is clearly a function of the 
repository nature (the average number of revisions for each file).  But 
the window size is directly connected to the computational cost.

> Another idea is to have a cache of "paths at which inherently
> undeltifiable objects live in".  For example, we currently do
> not delta OpenOffice documents (*.odt, *.odp, etc) very well.
> If one has a repository that tracks the history of "file.odp",
> we know each revision of "file.odp" would not delta against any
> other version anyway, and could skip attempting to deltify them.

I'm afraid this could lead to bad behavior eventually.  Better to just 
attempt a delta once, and when an object has not found any delta 
base candidate then just write its sha1 and corresponding window to the 
cache.  This would imply an initial cost to create the cache the first 
time, but after that the created cache could be relied upon as hard 
information and not just as guess heuristics.

> >  - size. The cache is a packed sequence of binary sha1 pairs. I was
> >    concerned that it would grow too large (obviously for n blobs you can
> >    end up with n^2/2 entries), but it doesn't seem unreasonable for most
> >    repos (either you don't have a lot of files, or if you do, they delta
> >    reasonably well). My test repo's cache is only 144K. The git cache is
> >    about 2.7M. The linux-2.6 cache is 22M.

This is way suboptimal.  First there is no reason for the cache to ever 
grow to N^2.  At worst it should be N*10 where 10 is the current window 
size.

Next I think this can be made just N*2.  Consider that the criteria for 
skipping over delta matching for a given object is the fact 
that we already know that such object doesn't delta against 
none of the objects found in a given window.  Therefore we only have to 
compute a hash for the object names found in that window and store that 
in the cache.  So the cache entries would then be a pair of sha1: first 
the sha1 of the victim object, and the sha1 of all sha1 names for the 
objects against which the victim object was found not to delta well 
against.

And this can be pushed even further by just including the sha1 of the 
victim object inside the list of objects therefore computing a hash of 
all objects (the victim and the window) for which no delta results. The 
cache is therefore a list of hash values corresponding to bad 
victim+window combinations.

So given my GIT repository such a cache would be 7610 * 40 = 304400 
bytes if we stick to the full 40 bytes of sha1 to hash bad combinations.

Nicolas

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC] Cache negative delta pairs
  2006-06-29 15:42   ` Nicolas Pitre
@ 2006-06-29 16:35     ` Nicolas Pitre
  2006-06-29 18:00     ` Jeff King
  1 sibling, 0 replies; 36+ messages in thread
From: Nicolas Pitre @ 2006-06-29 16:35 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Jeff King, git

On Thu, 29 Jun 2006, Nicolas Pitre wrote:

> And this can be pushed even further by just including the sha1 of the 
> victim object inside the list of objects therefore computing a hash of 
> all objects (the victim and the window) for which no delta results. The 
> cache is therefore a list of hash values corresponding to bad 
> victim+window combinations.
> 
> So given my GIT repository such a cache would be 7610 * 40 = 304400 
> bytes if we stick to the full 40 bytes of sha1 to hash bad combinations.

Correction: the 40 bytes figure is for _ascii_ representation of sha1 
values.  The cache doesn't need ascii and therefore this number can be 
reduced by half.


Nicolas

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC] Cache negative delta pairs
  2006-06-29  3:58   ` Jeff King
  2006-06-29  4:30     ` Jeff King
@ 2006-06-29 16:39     ` Nicolas Pitre
  2006-06-29 18:07       ` Jeff King
  1 sibling, 1 reply; 36+ messages in thread
From: Nicolas Pitre @ 2006-06-29 16:39 UTC (permalink / raw)
  To: Jeff King; +Cc: git

On Wed, 28 Jun 2006, Jeff King wrote:

> >From repack to repack, we end up trying to delta many of the same object
> pairs, which is computationally expensive.  This patch makes a
> persistent cache, $GIT_DIR/delta-cache, which contains pairs of sha1
> hashes (the presence of a pair indicates that it's not worth trying to
> delta those two objects).

I think this could be done differently to be even more space efficient 
as I suggested earlier.

> Here are some of my thoughts:
> 
>  - speed. The implementation is quite fast. The sha1 pairs are stored
>    sorted, and we mmap and binary search them. Certainly the extra time
>    spent in lookup is justified by avoiding the delta attempts.

You do that lookup for every delta match attempt.  Instead it could be 
done once for the whole window attempt, potentially reducing the cache 
size by a factor of 20, and it might be faster too.

>  - size. The cache is a packed sequence of binary sha1 pairs. I was
>    concerned that it would grow too large (obviously for n blobs you can
>    end up with n^2/2 entries), but it doesn't seem unreasonable for most
>    repos (either you don't have a lot of files, or if you do, they delta
>    reasonably well). My test repo's cache is only 144K. The git cache is
>    about 2.7M. The linux-2.6 cache is 22M.

See my previous email for comments about this.

>    Theoretically, I could bound the cache size and boot old entries.
>    However, that means storing age information, which increases the
>    required size. I think keeping it simple is best.

You could simply recreate the cache on each run.  Or just keep a bitmap 
of referenced cache entries, and a list of new entries.  At the end if 
the new entry list is empty and the bitmap is all set (or clear) then 
you just keep the current cache.  ONce, say, more than 10% of cache 
entries are not used anymore then you regenerate the cache.  But since 
you need to regenerate it when new entries are added then the 10% unused 
entry criteria might not be used that often anyway, unless packs shrink 
with time which might not be the case really often.

>  - correctness. Currently the code uses the output of try_delta for
>    negative caching. Should the cache checking be moved inside try_delta
>    instead? This would give more control over which reasons to mark a
>    delta negative (I believe right now hitting the depth limit will
>    cause a negative mark; we should perhaps only do so if the content
>    itself makes the delta unpalatable).

Like I said earlier I think this should be moved up a bit i.e. outside 
the delta loop.  In find_deltas() right before the ...

		j = window;
		while (--j > 0) {

loop, just insert another loop that compute the sha1 of the current 
target object and window:

		SHA1_Init(&c);
		j = window;
		while (j-- > 0) { /* postdec include the current object as well */
			unsigned int other_idx = idx + j;
			struct unpacked *m;
			if (other_idx >= window)
				other_idx -= window;
			m = array + other_idx;
			if (!m->entry)
				break;
			SHA1_Update(&c, m->entry.sha1, 20);
		}
		SHA1_Final(negative_delta_hash, &c);			

Then you can skip over the whole of the delta loop right away if that 
negative_delta_hash matches one entry in the cache since you already 
know that this object with this window doesn't produce any delta result.

Otherwise, after the delta loop has proceeded, if there is no delta 
found then you can add negative_delta_hash to the cache.

>  - optionalness. Currently the delta-cache is always used. Since it is a
>    space-time tradeoff, maybe it should be optional (it will have
>    negligible performance and horrible size impact on a repo that
>    consists of many very small but unrelated objects).  Possible methods
>    include:
>      - enable cache saves only if .git/delta-cache is present; turn it
>        on initially with 'touch .git/delta-cache'
>      - config variable

First, I think it should be ignored (but still created) when 
--no-reuse-delta is passed.  Then, it should not be created (but still 
looked up if it exists and --no-reuse-delta is not provided) when the 
pack index file is also not created.  I don't think it is worth making 
this further configurable, and given the suggested strategy above the 
cache should remain fairly small.


Nicolas

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC] Cache negative delta pairs
  2006-06-29 15:42   ` Nicolas Pitre
  2006-06-29 16:35     ` Nicolas Pitre
@ 2006-06-29 18:00     ` Jeff King
  2006-06-29 18:24       ` Nicolas Pitre
  1 sibling, 1 reply; 36+ messages in thread
From: Jeff King @ 2006-06-29 18:00 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Junio C Hamano, git

On Thu, Jun 29, 2006 at 11:42:14AM -0400, Nicolas Pitre wrote:

> So the negative cache should not be O(N^2) either.  It just has to be 
> O(N*window).

Yes, unless we use the cache information to change the window (but I
don't think that's worth it, unless our window heuristics are bad).
The implementation I posted should already grow to O(N*window), since it
only saves what we try (and fail at).

> My past experiments showed that the best window size for compression is 
> a bit larger than the current default of 10.  It was rather around 15 
> for the kernel repository, with higher values than 15 not providing 
> significant improvements anymore.  But that is clearly a function of the 
> repository nature (the average number of revisions for each file).  But 
> the window size is directly connected to the computational cost.

Increasing the window, of course, has virtually no speed impact on later
repacks with the cache in effect. Packing with a window of 15 drops my
linux-2.6 pack to 108M from 122M. That helps offset the cache size
(though of course it takes longer on the initial pack).

> This is way suboptimal.  First there is no reason for the cache to ever 
> grow to N^2.  At worst it should be N*10 where 10 is the current window 
> size.

I assumed the window would change over time (though our total is still
likely to hang around N*10 rather than N^2).

> none of the objects found in a given window.  Therefore we only have to 
> compute a hash for the object names found in that window and store that 
> in the cache.  So the cache entries would then be a pair of sha1: first 
> the sha1 of the victim object, and the sha1 of all sha1 names for the 
> objects against which the victim object was found not to delta well 
> against.

This will fail to hit the cache anytime the window changes. How often
does the window change? In my test case, I would think anytime I added a
bunch of new photos, it would be likely that one of them would make it
into the window, thus invalidating the cache entry and forcing me to try
against every object in the window (even though I've already tried
9/10).

Also, is it true to say "if this object did not delta against this
window, it will never delta?" What about interactions with the depth
parameter?

> So given my GIT repository such a cache would be 7610 * 40 = 304400 
> bytes if we stick to the full 40 bytes of sha1 to hash bad combinations.

Keep in mind that it will grow every time the window changes.

-Peff

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC] Cache negative delta pairs
  2006-06-29 16:39     ` Nicolas Pitre
@ 2006-06-29 18:07       ` Jeff King
  2006-06-29 18:48         ` Nicolas Pitre
  0 siblings, 1 reply; 36+ messages in thread
From: Jeff King @ 2006-06-29 18:07 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: git

On Thu, Jun 29, 2006 at 12:39:31PM -0400, Nicolas Pitre wrote:

> You do that lookup for every delta match attempt.  Instead it could be 
> done once for the whole window attempt, potentially reducing the cache 
> size by a factor of 20, and it might be faster too.

I'm not convinced this will provide good cache hit characteristics, and
I'm not convinced it's semantically correct (see my other mail).

> You could simply recreate the cache on each run.  Or just keep a bitmap 

Yes, that would probably work and would be quite easy to do with the
existing code.

> First, I think it should be ignored (but still created) when 
> --no-reuse-delta is passed.  Then, it should not be created (but still 
> looked up if it exists and --no-reuse-delta is not provided) when the 
> pack index file is also not created.  I don't think it is worth making 
> this further configurable, and given the suggested strategy above the 
> cache should remain fairly small.

Those suggestions make sense to me.

-Peff

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC] Cache negative delta pairs
  2006-06-29 18:00     ` Jeff King
@ 2006-06-29 18:24       ` Nicolas Pitre
  2006-06-29 18:53         ` Jeff King
  0 siblings, 1 reply; 36+ messages in thread
From: Nicolas Pitre @ 2006-06-29 18:24 UTC (permalink / raw)
  To: Jeff King; +Cc: Junio C Hamano, git

On Thu, 29 Jun 2006, Jeff King wrote:

> > This is way suboptimal.  First there is no reason for the cache to ever 
> > grow to N^2.  At worst it should be N*10 where 10 is the current window 
> > size.
> 
> I assumed the window would change over time (though our total is still
> likely to hang around N*10 rather than N^2).

It doesn't change unless you force a different window size.

> > none of the objects found in a given window.  Therefore we only have to 
> > compute a hash for the object names found in that window and store that 
> > in the cache.  So the cache entries would then be a pair of sha1: first 
> > the sha1 of the victim object, and the sha1 of all sha1 names for the 
> > objects against which the victim object was found not to delta well 
> > against.
> 
> This will fail to hit the cache anytime the window changes. How often
> does the window change? In my test case, I would think anytime I added a
> bunch of new photos, it would be likely that one of them would make it
> into the window, thus invalidating the cache entry and forcing me to try
> against every object in the window (even though I've already tried
> 9/10).

Sure.  But on the lot how often will that happen?
This trades a bit of CPU for much smaller cache which might be worth it.

And even then, since my suggested method implies only one cache lookup 
in a much smaller cache instead of 10 lookups in a larger cache for each 
objects it might end up faster overall even if sometimes some windows 
don't match and deltas are recomputed needlessly.

> Also, is it true to say "if this object did not delta against this
> window, it will never delta?" What about interactions with the depth
> parameter?

Of course a greater depth might allow for a hit where there isn't any 
otherwise.  But changing the delta depth is not something someone does 
that often, and when the depth is changed then you better use -f with 
git-repack as well which like I said should also ignore the cache.

> > So given my GIT repository such a cache would be 7610 * 40 = 304400 
> > bytes if we stick to the full 40 bytes of sha1 to hash bad combinations.
> 
> Keep in mind that it will grow every time the window changes.

What do you mean by window change?

Nicolas

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC] Cache negative delta pairs
  2006-06-29 18:07       ` Jeff King
@ 2006-06-29 18:48         ` Nicolas Pitre
  2006-06-29 18:58           ` Jeff King
  0 siblings, 1 reply; 36+ messages in thread
From: Nicolas Pitre @ 2006-06-29 18:48 UTC (permalink / raw)
  To: Jeff King; +Cc: git

On Thu, 29 Jun 2006, Jeff King wrote:

> On Thu, Jun 29, 2006 at 12:39:31PM -0400, Nicolas Pitre wrote:
> 
> > You do that lookup for every delta match attempt.  Instead it could be 
> > done once for the whole window attempt, potentially reducing the cache 
> > size by a factor of 20, and it might be faster too.
> 
> I'm not convinced this will provide good cache hit characteristics, and
> I'm not convinced it's semantically correct (see my other mail).

Dare to test it?  I provided you with most of the code difference 
already.

And what do you mean by "semantically correct"?


Nicolas

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC] Cache negative delta pairs
  2006-06-29 18:24       ` Nicolas Pitre
@ 2006-06-29 18:53         ` Jeff King
  2006-06-29 19:04           ` Nicolas Pitre
  0 siblings, 1 reply; 36+ messages in thread
From: Jeff King @ 2006-06-29 18:53 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Junio C Hamano, git

On Thu, Jun 29, 2006 at 02:24:57PM -0400, Nicolas Pitre wrote:

> > I assumed the window would change over time (though our total is still
> > likely to hang around N*10 rather than N^2).
> It doesn't change unless you force a different window size.

Sorry, I meant "the items in the window for a given object would change
over time." 

> > This will fail to hit the cache anytime the window changes. How often
> > does the window change? In my test case, I would think anytime I added a
> > bunch of new photos, it would be likely that one of them would make it
> > into the window, thus invalidating the cache entry and forcing me to try
> > against every object in the window (even though I've already tried
> > 9/10).
> Sure.  But on the lot how often will that happen?

Reasonably often, according to my test. I did this to simulate usage
over time:
  - create an empty repo
  - from my test repo of 515 images, grab 20 at a time and add/commit
    them
  - after each commit, record the SHA1 of (object, window[0..n]) for
    each object to be delta'd
If doing the cache on the sha1 of the whole window is a good idea, then
we should see many of the same hashes from commit to commit. If we
don't, that means the newly added files are being placed in the old
windows, thus disrupting their hashes.

The results were that there was typically only 1 reusable window each
time I added 20 files. At that point, caching is largely pointless.

> And even then, since my suggested method implies only one cache lookup 
> in a much smaller cache instead of 10 lookups in a larger cache for each 
> objects it might end up faster overall even if sometimes some windows 
> don't match and deltas are recomputed needlessly.

I didn't benchmark, but I doubt it will have significant impact.
Especially on my photo test repo, the lookups are dominated by the
create_delta time by several orders of magnitude.

> Of course a greater depth might allow for a hit where there isn't any 
> otherwise.  But changing the delta depth is not something someone does 
> that often, and when the depth is changed then you better use -f with 
> git-repack as well which like I said should also ignore the cache.

That sounds reasonable to me for depth. What about other reasons for
try_delta to fail? Preferred base?

> > > So given my GIT repository such a cache would be 7610 * 40 = 304400 
> > > bytes if we stick to the full 40 bytes of sha1 to hash bad combinations.
> > Keep in mind that it will grow every time the window changes.
> What do you mean by window change?

I meant that when the window we use for a given object changes, it will
insert a new cache entry. But if we deal with invalidating unused cache
entries as you suggested before, it won't matter.

-Peff

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC] Cache negative delta pairs
  2006-06-29 18:48         ` Nicolas Pitre
@ 2006-06-29 18:58           ` Jeff King
  2006-06-29 19:06             ` Nicolas Pitre
  0 siblings, 1 reply; 36+ messages in thread
From: Jeff King @ 2006-06-29 18:58 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: git

On Thu, Jun 29, 2006 at 02:48:23PM -0400, Nicolas Pitre wrote:

> Dare to test it?  I provided you with most of the code difference 
> already.

See my other mail. Unless I did something horribly wrong, caching full
windows is largely useless.

> And what do you mean by "semantically correct"?

I mean that right now the cache means "calling create_delta on content
that has sha1_a against content that has sha1_b will not produce a
useful delta." That makes sense since the content is never going to
change for a given sha1. However, covering the whole window takes into
account depth and preferred base. We just need to be sure that it is
correct to be including those in our cache calculation (I don't know,
and I'll defer to your judgement on that, since you clearly know more
about the delta logic than I do).

-Peff

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC] Cache negative delta pairs
  2006-06-29 18:53         ` Jeff King
@ 2006-06-29 19:04           ` Nicolas Pitre
  2006-06-29 19:52             ` Jeff King
  0 siblings, 1 reply; 36+ messages in thread
From: Nicolas Pitre @ 2006-06-29 19:04 UTC (permalink / raw)
  To: Jeff King; +Cc: Junio C Hamano, git

On Thu, 29 Jun 2006, Jeff King wrote:

> On Thu, Jun 29, 2006 at 02:24:57PM -0400, Nicolas Pitre wrote:
> 
> > > I assumed the window would change over time (though our total is still
> > > likely to hang around N*10 rather than N^2).
> > It doesn't change unless you force a different window size.
> 
> Sorry, I meant "the items in the window for a given object would change
> over time." 
> 
> > > This will fail to hit the cache anytime the window changes. How often
> > > does the window change? In my test case, I would think anytime I added a
> > > bunch of new photos, it would be likely that one of them would make it
> > > into the window, thus invalidating the cache entry and forcing me to try
> > > against every object in the window (even though I've already tried
> > > 9/10).
> > Sure.  But on the lot how often will that happen?
> 
> Reasonably often, according to my test. I did this to simulate usage
> over time:
>   - create an empty repo
>   - from my test repo of 515 images, grab 20 at a time and add/commit
>     them
>   - after each commit, record the SHA1 of (object, window[0..n]) for
>     each object to be delta'd
> If doing the cache on the sha1 of the whole window is a good idea, then
> we should see many of the same hashes from commit to commit. If we
> don't, that means the newly added files are being placed in the old
> windows, thus disrupting their hashes.
> 
> The results were that there was typically only 1 reusable window each
> time I added 20 files. At that point, caching is largely pointless.

Right.  Your use pattern is a special case that doesn't work well with 
the whole window hash approach.  I'd expect it to work beautifully with 
the kernel repository though.

> > And even then, since my suggested method implies only one cache lookup 
> > in a much smaller cache instead of 10 lookups in a larger cache for each 
> > objects it might end up faster overall even if sometimes some windows 
> > don't match and deltas are recomputed needlessly.
> 
> I didn't benchmark, but I doubt it will have significant impact.
> Especially on my photo test repo, the lookups are dominated by the
> create_delta time by several orders of magnitude.

Again I think it is a repo like the linux kernel that would benefit 
more.

> > Of course a greater depth might allow for a hit where there isn't any 
> > otherwise.  But changing the delta depth is not something someone does 
> > that often, and when the depth is changed then you better use -f with 
> > git-repack as well which like I said should also ignore the cache.
> 
> That sounds reasonable to me for depth. What about other reasons for
> try_delta to fail? Preferred base?

Hmmm.  That might need to be dealth with (easily but still).


Nicolas

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC] Cache negative delta pairs
  2006-06-29 18:58           ` Jeff King
@ 2006-06-29 19:06             ` Nicolas Pitre
  0 siblings, 0 replies; 36+ messages in thread
From: Nicolas Pitre @ 2006-06-29 19:06 UTC (permalink / raw)
  To: Jeff King; +Cc: git

On Thu, 29 Jun 2006, Jeff King wrote:

> On Thu, Jun 29, 2006 at 02:48:23PM -0400, Nicolas Pitre wrote:
> 
> > Dare to test it?  I provided you with most of the code difference 
> > already.
> 
> See my other mail. Unless I did something horribly wrong, caching full
> windows is largely useless.

... on your special photo repository I agree.

I'm still unconvinced for large repos though.


Nicolas

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC] Cache negative delta pairs
  2006-06-29 19:04           ` Nicolas Pitre
@ 2006-06-29 19:52             ` Jeff King
  2006-06-29 20:24               ` Nicolas Pitre
  0 siblings, 1 reply; 36+ messages in thread
From: Jeff King @ 2006-06-29 19:52 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Junio C Hamano, git

On Thu, Jun 29, 2006 at 03:04:15PM -0400, Nicolas Pitre wrote:

> Right.  Your use pattern is a special case that doesn't work well with 
> the whole window hash approach.  I'd expect it to work beautifully with 
> the kernel repository though.

I don't necessarily care about the kernel repository. It packs fine as
it is, and you only waste 30 seconds on a repack checking deltas that
could be avoided. I do care on my special repository where packing is
virtually unusable and I can achieve a 45x speedup. Maybe my original
caching is not worth it for the kernel and should be configurable,
but obviously this window caching cannot REPLACE mine since it fails
utterly for the one thing I wanted it for.

That being said, I'm not sure that window caching is all that great for
"normal" repos.

Same test as before, but instead of simulating the commits, I merely
looked at the window hashes produced by 
  git-rev-list --objects master~$x

For the git repo:
x=0 tries 6698 windows
x=0 and x=50 contain 5197 identical windows
x=0 and x=100 contain 2484 identical windows
x=0 and x=500 contain 455 identical windows

For linux-2.6 repo:
x=0 tries 57208 windows
x=0 and x=50 contain 53677 identical windows
x=0 and x=100 contain 52886 identical windows
x=0 and x=500 contain 41196 identical windows

Obviously the kernel repo is doing better, but x=500 is only 4 days ago.
Trying with --before=2.weeks.ago yields only 31505 matches.

So the windows do clearly experience a fair bit of churn, but whether or
not this is worth it depends on how long you think is reasonable before
something gets "churned out" .

-Peff

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC] Cache negative delta pairs
  2006-06-29 19:52             ` Jeff King
@ 2006-06-29 20:24               ` Nicolas Pitre
  2006-06-29 21:04                 ` Linus Torvalds
                                   ` (2 more replies)
  0 siblings, 3 replies; 36+ messages in thread
From: Nicolas Pitre @ 2006-06-29 20:24 UTC (permalink / raw)
  To: Jeff King; +Cc: Junio C Hamano, git

On Thu, 29 Jun 2006, Jeff King wrote:

> On Thu, Jun 29, 2006 at 03:04:15PM -0400, Nicolas Pitre wrote:
> 
> > Right.  Your use pattern is a special case that doesn't work well with 
> > the whole window hash approach.  I'd expect it to work beautifully with 
> > the kernel repository though.
> 
> I don't necessarily care about the kernel repository. It packs fine as
> it is, and you only waste 30 seconds on a repack checking deltas that
> could be avoided. I do care on my special repository where packing is
> virtually unusable and I can achieve a 45x speedup. Maybe my original
> caching is not worth it for the kernel and should be configurable,
> but obviously this window caching cannot REPLACE mine since it fails
> utterly for the one thing I wanted it for.

And I agreed with you already.

> That being said, I'm not sure that window caching is all that great for
> "normal" repos.
> 
> Same test as before, but instead of simulating the commits, I merely
> looked at the window hashes produced by 
>   git-rev-list --objects master~$x
> 
> For the git repo:
> x=0 tries 6698 windows
> x=0 and x=50 contain 5197 identical windows
> x=0 and x=100 contain 2484 identical windows
> x=0 and x=500 contain 455 identical windows
> 
> For linux-2.6 repo:
> x=0 tries 57208 windows
> x=0 and x=50 contain 53677 identical windows
> x=0 and x=100 contain 52886 identical windows
> x=0 and x=500 contain 41196 identical windows
> 
> Obviously the kernel repo is doing better, but x=500 is only 4 days ago.
> Trying with --before=2.weeks.ago yields only 31505 matches.

What does this prove?  I fail to see the relation between those results 
and a possible git-pack-objects improvement.

The negative delta cache concept is certainly attractive even for normal 
repositories, especially for public servers, since when used in 
conjonction with delta reuse it makes the creation of a pack basically 
free.  So I think this idea really has merits, as long as the cache 
remains small.


Nicolas

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC] Cache negative delta pairs
  2006-06-29 20:24               ` Nicolas Pitre
@ 2006-06-29 21:04                 ` Linus Torvalds
  2006-06-29 21:24                   ` Nicolas Pitre
                                     ` (3 more replies)
  2006-06-29 21:26                 ` Junio C Hamano
  2006-06-29 21:37                 ` Jeff King
  2 siblings, 4 replies; 36+ messages in thread
From: Linus Torvalds @ 2006-06-29 21:04 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Jeff King, Junio C Hamano, git


On Thu, 29 Jun 2006, Nicolas Pitre wrote:
> 
> The negative delta cache concept is certainly attractive even for normal 
> repositories, especially for public servers, since when used in 
> conjonction with delta reuse it makes the creation of a pack basically 
> free.  So I think this idea really has merits, as long as the cache 
> remains small.

I don't really see much of a point of this all.

Instead of having a separate cache, wouldn't it be much better to just 
take the hint from the previous pack-file?

In the repacking window, if both objects we are looking at already came 
from the same (old) pack-file, don't bother delta'ing them against each 
other. 

That means that we'll still always check for better deltas for (and 
against!) _unpacked_ objects, but assuming incremental repacks, you'll 
avoid the delta creation 99% of the time.

Ie somethng really simple like the appended.

		Linus
---
diff --git a/pack-objects.c b/pack-objects.c
index bed2497..cea63e7 100644
--- a/pack-objects.c
+++ b/pack-objects.c
@@ -988,6 +988,13 @@ static int try_delta(struct unpacked *tr
 		return -1;
 
 	/*
+	 * We do not bother to try a delta that we discarded
+	 * on an earlier try
+	 */
+	if (trg_entry->in_pack && trg_entry->in_pack == src_entry->in_pack)
+		return -1;
+
+	/*
 	 * If the current object is at pack edge, take the depth the
 	 * objects that depend on the current object into account --
 	 * otherwise they would become too deep.

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [RFC] Cache negative delta pairs
  2006-06-29 21:04                 ` Linus Torvalds
@ 2006-06-29 21:24                   ` Nicolas Pitre
  2006-06-29 21:30                     ` Linus Torvalds
  2006-06-29 21:35                   ` [RFC] Cache negative delta pairs Jeff King
                                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 36+ messages in thread
From: Nicolas Pitre @ 2006-06-29 21:24 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Jeff King, Junio C Hamano, git

On Thu, 29 Jun 2006, Linus Torvalds wrote:

> Instead of having a separate cache, wouldn't it be much better to just 
> take the hint from the previous pack-file?

DOH!  ;-)


Nicolas

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC] Cache negative delta pairs
  2006-06-29 20:24               ` Nicolas Pitre
  2006-06-29 21:04                 ` Linus Torvalds
@ 2006-06-29 21:26                 ` Junio C Hamano
  2006-06-29 21:37                 ` Jeff King
  2 siblings, 0 replies; 36+ messages in thread
From: Junio C Hamano @ 2006-06-29 21:26 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: git, Jeff King

Nicolas Pitre <nico@cam.org> writes:

> The negative delta cache concept is certainly attractive even for normal 
> repositories, especially for public servers, since when used in 
> conjonction with delta reuse it makes the creation of a pack basically 
> free.  So I think this idea really has merits, as long as the cache 
> remains small.

Yes, I agree it is very attractive.

One thing to watch out for is that we probably would not want to
let git-daemon write into public repositories.  Which means that
use of negative cache should be strict "opt-in".

 - "$GIT_DIR/delta-cache" is read but not necessarily is written
   back when it exists; git-daemon uses it that way.

 - The owner of the repository shouldn't have to tell the tool
   to update the negative cache every time repack happens.

Which suggests that pack-objects.c can learn an option that
tells it to call delta_cache_save(), and we use it in
git-repack, perhaps like this:

diff --git a/git-repack.sh b/git-repack.sh
index 640ad8d..b07ed9b 100755
--- a/git-repack.sh
+++ b/git-repack.sh
@@ -44,7 +44,7 @@ case ",$all_into_one," in
 esac
 pack_objects="$pack_objects $local $quiet $no_reuse_delta$extra"
 name=$(git-rev-list --objects --all $rev_list 2>&1 |
-	git-pack-objects --non-empty $pack_objects .tmp-pack) ||
+	git-pack-objects --update-delta-cache --non-empty $pack_objects .tmp-pack) ||
 	exit 1
 if [ -z "$name" ]; then
 	echo Nothing new to pack.
 
diff --git a/pack-objects.c b/pack-objects.c
index bed2497..46b9775 100644
--- a/pack-objects.c
+++ b/pack-objects.c
...
@@ -1342,5 +1350,7 @@ int main(int argc, char **argv)
 	if (progress)
 		fprintf(stderr, "Total %d, written %d (delta %d), reused %d (delta %d)\n",
 			nr_result, written, written_delta, reused, reused_delta);
+	if (update_delta_cache)
+		delta_cache_save();
 	return 0;
 }

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [RFC] Cache negative delta pairs
  2006-06-29 21:24                   ` Nicolas Pitre
@ 2006-06-29 21:30                     ` Linus Torvalds
  2006-06-29 21:39                       ` Jeff King
                                         ` (2 more replies)
  0 siblings, 3 replies; 36+ messages in thread
From: Linus Torvalds @ 2006-06-29 21:30 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Jeff King, Junio C Hamano, git

On Thu, 29 Jun 2006, Nicolas Pitre wrote:
>
> On Thu, 29 Jun 2006, Linus Torvalds wrote:
> 
> > Instead of having a separate cache, wouldn't it be much better to just 
> > take the hint from the previous pack-file?
> 
> DOH!  ;-)

Btw, I think this could do with a flag to turn it on/off (but probably 
default to on).

The advantage of this patch is that it should guarantee that if everything 
is already packed, a repack will basically keep the pack identical. 

However, that is obviously also the dis-advantage, since it means that 
repacking cannot improve packing. So adding a flag to say "please try to 
incrementally improve the pack" might well be worth it, even if this new 
behaviour would be the _default_.

Hmm? Jeff, does this work for your load?

		Linus

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC] Cache negative delta pairs
  2006-06-29 21:04                 ` Linus Torvalds
  2006-06-29 21:24                   ` Nicolas Pitre
@ 2006-06-29 21:35                   ` Jeff King
  2006-06-29 21:54                   ` Junio C Hamano
  2006-06-29 22:22                   ` Junio C Hamano
  3 siblings, 0 replies; 36+ messages in thread
From: Jeff King @ 2006-06-29 21:35 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Nicolas Pitre, Junio C Hamano, git

On Thu, Jun 29, 2006 at 02:04:01PM -0700, Linus Torvalds wrote:

> Instead of having a separate cache, wouldn't it be much better to just 
> take the hint from the previous pack-file?

I'd like to second Nicolas' "DOH!".

This drops my nasty load with roughly the same effect as the delta
cache, and it does the same for the kernel repo. And it doesn't consume
any extra disk space. Junio, please drop my delta-cache patch in favor
of Linus' patch.

-Peff

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC] Cache negative delta pairs
  2006-06-29 20:24               ` Nicolas Pitre
  2006-06-29 21:04                 ` Linus Torvalds
  2006-06-29 21:26                 ` Junio C Hamano
@ 2006-06-29 21:37                 ` Jeff King
  2 siblings, 0 replies; 36+ messages in thread
From: Jeff King @ 2006-06-29 21:37 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Junio C Hamano, git

On Thu, Jun 29, 2006 at 04:24:01PM -0400, Nicolas Pitre wrote:

> > Obviously the kernel repo is doing better, but x=500 is only 4 days ago.
> > Trying with --before=2.weeks.ago yields only 31505 matches.
> 
> What does this prove?  I fail to see the relation between those results 
> and a possible git-pack-objects improvement.

My point is that the window cache gets invalidated over time, whereas a
sha1 by sha1 cache doesn't.  However, the point is moot, as I think
Linus has come up with a much better solution.

> The negative delta cache concept is certainly attractive even for normal 
> repositories, especially for public servers, since when used in 
> conjonction with delta reuse it makes the creation of a pack basically 
> free.  So I think this idea really has merits, as long as the cache 
> remains small.

Yes, if you're talking about a situation in which you make many packs
for a given set of commits, then it does make more sense (what I was
really trying to eliminate was a 10 minute repack followed by another 10
minute repack to push to the server).

-Peff

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC] Cache negative delta pairs
  2006-06-29 21:30                     ` Linus Torvalds
@ 2006-06-29 21:39                       ` Jeff King
  2006-06-29 21:43                       ` Joel Becker
  2006-06-29 21:47                       ` Nicolas Pitre
  2 siblings, 0 replies; 36+ messages in thread
From: Jeff King @ 2006-06-29 21:39 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Nicolas Pitre, Junio C Hamano, git

On Thu, Jun 29, 2006 at 02:30:11PM -0700, Linus Torvalds wrote:

> However, that is obviously also the dis-advantage, since it means that 
> repacking cannot improve packing. So adding a flag to say "please try to 
> incrementally improve the pack" might well be worth it, even if this new 
> behaviour would be the _default_.

We could tie the on/off to --no-reuse-delta, since you would mostly
want them at the same time. This disallows repacking over and over to
try to incrementally improve size. Do people actually do that?

> Hmm? Jeff, does this work for your load?

Yes, just as well as my original patch.

-Peff

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC] Cache negative delta pairs
  2006-06-29 21:30                     ` Linus Torvalds
  2006-06-29 21:39                       ` Jeff King
@ 2006-06-29 21:43                       ` Joel Becker
  2006-06-29 21:47                       ` Nicolas Pitre
  2 siblings, 0 replies; 36+ messages in thread
From: Joel Becker @ 2006-06-29 21:43 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Nicolas Pitre, Jeff King, Junio C Hamano, git

On Thu, Jun 29, 2006 at 02:30:11PM -0700, Linus Torvalds wrote:
> However, that is obviously also the dis-advantage, since it means that 
> repacking cannot improve packing. So adding a flag to say "please try to 
> incrementally improve the pack" might well be worth it, even if this new 
> behaviour would be the _default_.

	I nominate "--pack-me-harder".

Joel

-- 

"Behind every successful man there's a lot of unsuccessful years."
        - Bob Brown

Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC] Cache negative delta pairs
  2006-06-29 21:30                     ` Linus Torvalds
  2006-06-29 21:39                       ` Jeff King
  2006-06-29 21:43                       ` Joel Becker
@ 2006-06-29 21:47                       ` Nicolas Pitre
  2006-06-29 22:12                         ` Junio C Hamano
  2 siblings, 1 reply; 36+ messages in thread
From: Nicolas Pitre @ 2006-06-29 21:47 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Jeff King, Junio C Hamano, git

On Thu, 29 Jun 2006, Linus Torvalds wrote:

> 
> 
> On Thu, 29 Jun 2006, Nicolas Pitre wrote:
> >
> > On Thu, 29 Jun 2006, Linus Torvalds wrote:
> > 
> > > Instead of having a separate cache, wouldn't it be much better to just 
> > > take the hint from the previous pack-file?
> > 
> > DOH!  ;-)
> 
> Btw, I think this could do with a flag to turn it on/off (but probably 
> default to on).

I think it should simply be coupled with the --no-reuse-delta flag.

> The advantage of this patch is that it should guarantee that if everything 
> is already packed, a repack will basically keep the pack identical. 
> 
> However, that is obviously also the dis-advantage, since it means that 
> repacking cannot improve packing. So adding a flag to say "please try to 
> incrementally improve the pack" might well be worth it, even if this new 
> behaviour would be the _default_.

Actually, the delta reusing already prevents those deltas from being 
improved.  So your patch only extend this notion to the non-deltified 
objects as well.  And the way out is to provide the --no-reuse-delta 
flag.


Nicolas

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC] Cache negative delta pairs
  2006-06-29 21:04                 ` Linus Torvalds
  2006-06-29 21:24                   ` Nicolas Pitre
  2006-06-29 21:35                   ` [RFC] Cache negative delta pairs Jeff King
@ 2006-06-29 21:54                   ` Junio C Hamano
  2006-06-29 22:22                   ` Junio C Hamano
  3 siblings, 0 replies; 36+ messages in thread
From: Junio C Hamano @ 2006-06-29 21:54 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: git

Linus Torvalds <torvalds@osdl.org> writes:

> Instead of having a separate cache, wouldn't it be much better to just 
> take the hint from the previous pack-file?
>
> In the repacking window, if both objects we are looking at already came 
> from the same (old) pack-file, don't bother delta'ing them against each 
> other. 
>
> That means that we'll still always check for better deltas for (and 
> against!) _unpacked_ objects, but assuming incremental repacks, you'll 
> avoid the delta creation 99% of the time.

I bow down before you.

No ugly special-case caching, just automatically "the right
thing", with very little overhead.

It just makes sense.

We have a winner.

;-)

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC] Cache negative delta pairs
  2006-06-29 21:47                       ` Nicolas Pitre
@ 2006-06-29 22:12                         ` Junio C Hamano
  2006-06-30  3:44                           ` [PATCH] consider previous pack undeltified object state only when reusing delta data Nicolas Pitre
  0 siblings, 1 reply; 36+ messages in thread
From: Junio C Hamano @ 2006-06-29 22:12 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Linus Torvalds, Jeff King, git

Nicolas Pitre <nico@cam.org> writes:

> On Thu, 29 Jun 2006, Linus Torvalds wrote:
>
>> 
>> 
>> On Thu, 29 Jun 2006, Nicolas Pitre wrote:
>> >
>> > On Thu, 29 Jun 2006, Linus Torvalds wrote:
>> > 
>> > > Instead of having a separate cache, wouldn't it be much better to just 
>> > > take the hint from the previous pack-file?
>> > 
>> > DOH!  ;-)
>> 
>> Btw, I think this could do with a flag to turn it on/off (but probably 
>> default to on).
>
> I think it should simply be coupled with the --no-reuse-delta flag.

I agree that makes sense.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC] Cache negative delta pairs
  2006-06-29 21:04                 ` Linus Torvalds
                                     ` (2 preceding siblings ...)
  2006-06-29 21:54                   ` Junio C Hamano
@ 2006-06-29 22:22                   ` Junio C Hamano
  3 siblings, 0 replies; 36+ messages in thread
From: Junio C Hamano @ 2006-06-29 22:22 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: git

Linus Torvalds <torvalds@osdl.org> writes:

> On Thu, 29 Jun 2006, Nicolas Pitre wrote:
>> 
>> The negative delta cache concept is certainly attractive even for normal 
>> repositories, especially for public servers, since when used in 
>> conjonction with delta reuse it makes the creation of a pack basically 
>> free.  So I think this idea really has merits, as long as the cache 
>> remains small.
>
> I don't really see much of a point of this all.
>
> Instead of having a separate cache, wouldn't it be much better to just 
> take the hint from the previous pack-file?
>
> In the repacking window, if both objects we are looking at already came 
> from the same (old) pack-file, don't bother delta'ing them against each 
> other. 
>
> That means that we'll still always check for better deltas for (and 
> against!) _unpacked_ objects, but assuming incremental repacks, you'll 
> avoid the delta creation 99% of the time.
>
> Ie somethng really simple like the appended.
>
> 		Linus
> ---
> diff --git a/pack-objects.c b/pack-objects.c
> index bed2497..cea63e7 100644
> --- a/pack-objects.c
> +++ b/pack-objects.c
> @@ -988,6 +988,13 @@ static int try_delta(struct unpacked *tr
>  		return -1;
>  
>  	/*
> +	 * We do not bother to try a delta that we discarded
> +	 * on an earlier try
> +	 */
> +	if (trg_entry->in_pack && trg_entry->in_pack == src_entry->in_pack)
> +		return -1;
> +
> +	/*

I think you meant to return 0 from here though.  -1 means "do
not use this pair and do not bother try improving it with the
remaining candidates".

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC] Cache negative delta pairs
  2006-06-29  3:09 ` [RFC] Cache negative delta pairs Junio C Hamano
                     ` (3 preceding siblings ...)
  2006-06-29 15:42   ` Nicolas Pitre
@ 2006-06-29 22:31   ` Jakub Narebski
  4 siblings, 0 replies; 36+ messages in thread
From: Jakub Narebski @ 2006-06-29 22:31 UTC (permalink / raw)
  To: git

Junio C Hamano wrote:

> [...] For example, we currently do
> not delta OpenOffice documents (*.odt, *.odp, etc) very well.
> If one has a repository that tracks the history of "file.odp",
> we know each revision of "file.odp" would not delta against any
> other version anyway, and could skip attempting to deltify them.

Perhaps we should steal Mercurial idea of EncodeDecodeFilter, and store
OpenOffice documents, Mozilla extensions, Java packages in object store as
uncompressed archive, and checkout them to working area in original format.
All diff should be of course done on in-repository (after-filter) format.

The original example at 
  http://www.selenic.com/mercurial/wiki/index.cgi/EncodeDecodeFilter
talks about archives (zip files) and unix2dos endline convention conversion.

Perhaps for OpenOffice and Mozilla we would need to use [external] XML-aware
diff, too...

-- 
Jakub Narebski
Warsaw, Poland
ShadeHawk on #git

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH] consider previous pack undeltified object state only when reusing delta data
  2006-06-29 22:12                         ` Junio C Hamano
@ 2006-06-30  3:44                           ` Nicolas Pitre
  2006-06-30  9:45                             ` Johannes Schindelin
  0 siblings, 1 reply; 36+ messages in thread
From: Nicolas Pitre @ 2006-06-30  3:44 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

Without this there would never be a chance to improve packing for 
previously undeltified objects.

Signed-off-by: Nicolas Pitre <nico@cam.org>

---

On Thu, 29 Jun 2006, Junio C Hamano wrote:

> Nicolas Pitre <nico@cam.org> writes:
> 
> > On Thu, 29 Jun 2006, Linus Torvalds wrote:
> >
> >> 
> >> 
> >> On Thu, 29 Jun 2006, Nicolas Pitre wrote:
> >> >
> >> > On Thu, 29 Jun 2006, Linus Torvalds wrote:
> >> > 
> >> > > Instead of having a separate cache, wouldn't it be much better to just 
> >> > > take the hint from the previous pack-file?
> >> > 
> >> > DOH!  ;-)
> >> 
> >> Btw, I think this could do with a flag to turn it on/off (but probably 
> >> default to on).
> >
> > I think it should simply be coupled with the --no-reuse-delta flag.
> 
> I agree that makes sense.

So here it is.

diff --git a/pack-objects.c b/pack-objects.c
index 6e17676..47da33b 100644
--- a/pack-objects.c
+++ b/pack-objects.c
@@ -989,9 +989,10 @@ static int try_delta(struct unpacked *tr
 
 	/*
 	 * We do not bother to try a delta that we discarded
-	 * on an earlier try.
+	 * on an earlier try, but only when reusing delta data.
 	 */
-	if (trg_entry->in_pack && trg_entry->in_pack == src_entry->in_pack)
+	if (!no_reuse_delta && trg_entry->in_pack &&
+	    trg_entry->in_pack == src_entry->in_pack)
 		return 0;
 
 	/*

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH] consider previous pack undeltified object state only when reusing delta data
  2006-06-30  3:44                           ` [PATCH] consider previous pack undeltified object state only when reusing delta data Nicolas Pitre
@ 2006-06-30  9:45                             ` Johannes Schindelin
  2006-06-30 12:28                               ` Andreas Ericsson
  0 siblings, 1 reply; 36+ messages in thread
From: Johannes Schindelin @ 2006-06-30  9:45 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Junio C Hamano, git

Hi,

On Thu, 29 Jun 2006, Nicolas Pitre wrote:

> Without this there would never be a chance to improve packing for 
> previously undeltified objects.

Earlier this year, I was quite surprised to learn that multiple repackings 
actually improved packing. Does that patch mean this feature is gone?

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] consider previous pack undeltified object state only when reusing delta data
  2006-06-30  9:45                             ` Johannes Schindelin
@ 2006-06-30 12:28                               ` Andreas Ericsson
  2006-06-30 16:55                                 ` Nicolas Pitre
  0 siblings, 1 reply; 36+ messages in thread
From: Andreas Ericsson @ 2006-06-30 12:28 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Nicolas Pitre, Junio C Hamano, git

Johannes Schindelin wrote:
> Hi,
> 
> On Thu, 29 Jun 2006, Nicolas Pitre wrote:
> 
> 
>>Without this there would never be a chance to improve packing for 
>>previously undeltified objects.
> 
> 
> Earlier this year, I was quite surprised to learn that multiple repackings 
> actually improved packing. Does that patch mean this feature is gone?
> 

The patch Linus sent removes that feature. This one re-introduces it.

-- 
Andreas Ericsson                   andreas.ericsson@op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] consider previous pack undeltified object state only when reusing delta data
  2006-06-30 12:28                               ` Andreas Ericsson
@ 2006-06-30 16:55                                 ` Nicolas Pitre
  2006-07-03  8:11                                   ` Andreas Ericsson
  0 siblings, 1 reply; 36+ messages in thread
From: Nicolas Pitre @ 2006-06-30 16:55 UTC (permalink / raw)
  To: Andreas Ericsson; +Cc: Johannes Schindelin, Junio C Hamano, git

On Fri, 30 Jun 2006, Andreas Ericsson wrote:

> Johannes Schindelin wrote:
> > Hi,
> > 
> > On Thu, 29 Jun 2006, Nicolas Pitre wrote:
> > 
> > 
> > > Without this there would never be a chance to improve packing for
> > > previously undeltified objects.
> > 
> > 
> > Earlier this year, I was quite surprised to learn that multiple repackings
> > actually improved packing. Does that patch mean this feature is gone?
> > 
> 
> The patch Linus sent removes that feature. This one re-introduces it.

Not really.

Actually that multiple repacking "feature" was rather an artifact of the 
delta data reuse code and not really by design.  Here's what happened 
before:

Consider the first repack where no delta exists, or "git-repack -a -f" 
where the -f argument makes it ignores existing delta data.  In that 
case all objects are sorted and delta attempted on them within a window.

So to simplify things let's assume objects are numbered from 1 upwards.  
First obj #1 is added to the window.  Obj #2 attempts a delta against 
obj #1.  Obj #3 attempts a delta against objs #2 and #1.  Obj #4 
attempts a delta against objs #3, #2 and #1.  And so on for all object: 
each new object attempts a delta against the last 10 objects (the 
default window size is 10) and the best delta, if any, is kept.

In the end, some objects get deltified, some don't, and a new pack is 
produced.

When repacking without -f to git-repack, then already deltified objects 
are simply copied as is from the existing pack(s) avoiding costly delta 
re-computation.  Still, without Linus' patch, non-deltified objects were 
considered for deltification and deltas attempted on them.

So supposing that objects #1 through #10 were not deltified, and objects 
#11 through #50 were deltified, then those deltified objects were 
skipped over for the purpose of delta matching and therefore object #51 
ended up attempting a delta against objs #1 to 10 instead of #41 to #50 
like in the previous run.  The net effect was similar to a larger window 
for some objects providing more opportunities for successful deltas, and 
therefore a smaller pack.

With Linus' patch those objects already known to be undeltified are, 
too, skipped.  That means that successive git-repack without the -f 
argument are now producing identical packs all the time and the artifact 
above is gone.

I think this is a good thing since now the packing behavior is more 
predictable.  But nothing is lost since if you want to have better 
packing like before you simply have to specify a slightly larger window 
size on the first git-repack.  It'll take a bit more time but running 
git-repack many times also took more time in the end anyway.

Nicolas

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] consider previous pack undeltified object state only when reusing delta data
  2006-06-30 16:55                                 ` Nicolas Pitre
@ 2006-07-03  8:11                                   ` Andreas Ericsson
  0 siblings, 0 replies; 36+ messages in thread
From: Andreas Ericsson @ 2006-07-03  8:11 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Johannes Schindelin, Junio C Hamano, git

Nicolas Pitre wrote:
> On Fri, 30 Jun 2006, Andreas Ericsson wrote:
> 
>>Johannes Schindelin wrote:
>>>
>>>>Without this there would never be a chance to improve packing for
>>>>previously undeltified objects.
>>>
>>>
>>>Earlier this year, I was quite surprised to learn that multiple repackings
>>>actually improved packing. Does that patch mean this feature is gone?
>>>
>>
>>The patch Linus sent removes that feature. This one re-introduces it.
> 
> 
> Not really.
> 
> Actually that multiple repacking "feature" was rather an artifact of the 
> delta data reuse code and not really by design.  Here's what happened 
> before:
> 

Thanks for the extensive and very clear info. Lovely to see a competent 
programmer who can also explain how things work. :)

-- 
Andreas Ericsson                   andreas.ericsson@op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231

^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2006-07-03  8:11 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20060628223744.GA24421@coredump.intra.peff.net>
2006-06-29  3:09 ` [RFC] Cache negative delta pairs Junio C Hamano
2006-06-29  3:50   ` Jeff King
2006-06-29  3:58   ` Jeff King
2006-06-29  4:30     ` Jeff King
2006-06-29 16:39     ` Nicolas Pitre
2006-06-29 18:07       ` Jeff King
2006-06-29 18:48         ` Nicolas Pitre
2006-06-29 18:58           ` Jeff King
2006-06-29 19:06             ` Nicolas Pitre
2006-06-29  4:09   ` Jeff King
2006-06-29 15:42   ` Nicolas Pitre
2006-06-29 16:35     ` Nicolas Pitre
2006-06-29 18:00     ` Jeff King
2006-06-29 18:24       ` Nicolas Pitre
2006-06-29 18:53         ` Jeff King
2006-06-29 19:04           ` Nicolas Pitre
2006-06-29 19:52             ` Jeff King
2006-06-29 20:24               ` Nicolas Pitre
2006-06-29 21:04                 ` Linus Torvalds
2006-06-29 21:24                   ` Nicolas Pitre
2006-06-29 21:30                     ` Linus Torvalds
2006-06-29 21:39                       ` Jeff King
2006-06-29 21:43                       ` Joel Becker
2006-06-29 21:47                       ` Nicolas Pitre
2006-06-29 22:12                         ` Junio C Hamano
2006-06-30  3:44                           ` [PATCH] consider previous pack undeltified object state only when reusing delta data Nicolas Pitre
2006-06-30  9:45                             ` Johannes Schindelin
2006-06-30 12:28                               ` Andreas Ericsson
2006-06-30 16:55                                 ` Nicolas Pitre
2006-07-03  8:11                                   ` Andreas Ericsson
2006-06-29 21:35                   ` [RFC] Cache negative delta pairs Jeff King
2006-06-29 21:54                   ` Junio C Hamano
2006-06-29 22:22                   ` Junio C Hamano
2006-06-29 21:26                 ` Junio C Hamano
2006-06-29 21:37                 ` Jeff King
2006-06-29 22:31   ` Jakub Narebski

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).