* Re: [RFC] Cache negative delta pairs
[not found] <20060628223744.GA24421@coredump.intra.peff.net>
@ 2006-06-29 3:09 ` Junio C Hamano
2006-06-29 3:50 ` Jeff King
` (4 more replies)
0 siblings, 5 replies; 36+ messages in thread
From: Junio C Hamano @ 2006-06-29 3:09 UTC (permalink / raw)
To: Jeff King; +Cc: git
Jeff King <peff@peff.net> writes:
> From repack to repack, we end up trying to delta many of the same object
> pairs, which is computationally expensive.
>...
> I found this especially to be a problem with repos that consist of many
> large, unrelated files (e.g., photos). For example, on my test repo
> (about 300 unrelated 1-2M jpgs), a 'git-repack -a' takes about 10
> minutes to complete. With the delta cache, subsequent repacks take only
> 13 seconds. Results are not quite as dramatic for "normal" repos, but
> there is still some speedup. Repacking a fully packed linux-2.6 repo
> went from 1m12s to 36s. Repacking the git repo goes from 5.6s to 3.0s.
Interesting idea. I think this matters more because for a
repository with many unrelated undeltifiable files, we do the
computation for objects that results in _no_ delta. For normal
nearly fully packed repositories, once an object is deltified
against something else, subsequent repacking of the same set of
objects (or a superset thereof) will very likely reuse the delta
without recomputation, so as long as each object _can_ be
deltified with at least _one_ other object, you should not see
improvement on them.
So I am curious where the speed-up comes from for "normal" repos
in your experiments. If it turns out that in "normal" repos the
objects that hit your negative cache are stored undeltified,
then that suggests that it might be worthwhile to consider using
a cache of "inherently undeltifiable objects", In other words, a
negative cache of O(N) entries, instead of O(N^2) entries,
Another interpretation of your result is that we may be using a
delta window that is unnecessarily too deep, and your negative
cache is collecting less optimum candidates that we attempt to
deltify against "just in case". Can you easily instrument your
code to see where in the sorted delta candidate list the pairs
that hit your the negative cache are? That is, in find_deltas()
function, we have "while (--j > 0)" loop that attempts to delta
with the entry that is j (modulo window size) entries away from
the current one, then j-1, j-2, ...; I am interested in the
distribution of "j" value for the pair "n,m" that hits your
negative cache for normal repositories, and I am speculating
that the value would probably be small relative to the delta
window size.
Another idea is to have a cache of "paths at which inherently
undeltifiable objects live in". For example, we currently do
not delta OpenOffice documents (*.odt, *.odp, etc) very well.
If one has a repository that tracks the history of "file.odp",
we know each revision of "file.odp" would not delta against any
other version anyway, and could skip attempting to deltify them.
Your message contained string "*pt-in" in the commentary part
(replace asterisk with lowercase o) and was discarded by vger
mailing list software because that was a taboo word. If you
would want to pursue this I would suggest to resend your
original patch after rephrasing that part.
> - size. The cache is a packed sequence of binary sha1 pairs. I was
> concerned that it would grow too large (obviously for n blobs you can
> end up with n^2/2 entries), but it doesn't seem unreasonable for most
> repos (either you don't have a lot of files, or if you do, they delta
> reasonably well). My test repo's cache is only 144K. The git cache is
> about 2.7M. The linux-2.6 cache is 22M.
The fully-packed object database is 6.2M pack so you are talking
about 40% bloat; the kernel is 115M so the overhead is 19%.
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [RFC] Cache negative delta pairs
2006-06-29 3:09 ` [RFC] Cache negative delta pairs Junio C Hamano
@ 2006-06-29 3:50 ` Jeff King
2006-06-29 3:58 ` Jeff King
` (3 subsequent siblings)
4 siblings, 0 replies; 36+ messages in thread
From: Jeff King @ 2006-06-29 3:50 UTC (permalink / raw)
To: Junio C Hamano; +Cc: git
On Wed, Jun 28, 2006 at 08:09:43PM -0700, Junio C Hamano wrote:
> Interesting idea. I think this matters more because for a
> repository with many unrelated undeltifiable files, we do the
> computation for objects that results in _no_ delta. For normal
Yes, precisely.
> So I am curious where the speed-up comes from for "normal" repos
> in your experiments. If it turns out that in "normal" repos the
I assumed they were either:
- blobs of files with only a single revision
- blobs with one "island" revision which is so different from previous
revisions that it isn't worth a delta
You can get a (very rough) estimate on the former:
$ cd linux-2.6 && git-rev-list --objects --all \
| grep / | cut -d' ' -f2 | sort | uniq -u | wc -l
8364
As I said, we may also be catching possible deltas at the edge of the
pack depth. I should maybe put the cache check closer to the
create_delta call.
> objects that hit your negative cache are stored undeltified,
> then that suggests that it might be worthwhile to consider using
> a cache of "inherently undeltifiable objects", In other words, a
> negative cache of O(N) entries, instead of O(N^2) entries,
That would certainly be preferable, though I'm not convinced there
aren't objects which are deltifiable, but not right now (e.g., a file
with only one revision, but which will later get a revision).
I'm not sure what would make a file inherently undeltifiable. But you
could put the O(N) cache in front of the other cache, and reduce the
size of N for the O(N^2) cache.
> deltify against "just in case". Can you easily instrument your
> code to see where in the sorted delta candidate list the pairs
> that hit your the negative cache are? That is, in find_deltas()
I'll look into that...
> Another idea is to have a cache of "paths at which inherently
> undeltifiable objects live in". For example, we currently do
> not delta OpenOffice documents (*.odt, *.odp, etc) very well.
> If one has a repository that tracks the history of "file.odp",
> we know each revision of "file.odp" would not delta against any
> other version anyway, and could skip attempting to deltify them.
I thought about that, but I was trying to avoid "clever" heuristics that
can often be wrong. Case in point: my repo consists mainly of jpegs,
most of which will not delta. But for a few, I edited the exif header
and they delta'd nicely against their previous revision.
> Your message contained string "*pt-in" in the commentary part
> (replace asterisk with lowercase o) and was discarded by vger
> mailing list software because that was a taboo word. If you
Oops, I didn't know we were filtering. I'll resend in a moment.
> > reasonably well). My test repo's cache is only 144K. The git cache is
> > about 2.7M. The linux-2.6 cache is 22M.
> The fully-packed object database is 6.2M pack so you are talking
> about 40% bloat; the kernel is 115M so the overhead is 19%.
Yes, obviously if you're interested in maximal space saving, this is
stupid for the classic git repo case (though it makes sense for repos
with few but very large blobs). However, if you assume that:
1. Packing is inherently space-saving and more-or-less required on
large repos like linux-2.6 (which is 1.8G unpacked and 115M packed)
2. It saves time (36 seconds per repack on linux-2.6)
then a more reasonable question is "do you want to spend 22M to save 30
seconds every time you repack -a?". Or, to really cook the numbers, you
can say that the cache is only wasting 1% versus an unpacked repo. :)
I think it's reasonable that this should be an optional feature. For
some repos, it turns them from almost unusable to very fast. For others,
it's a questionable space-time tradeoff (and for some pathological
cases, it would probably turn the repo from usable to unusable).
-Peff
^ permalink raw reply [flat|nested] 36+ messages in thread
* [RFC] Cache negative delta pairs
2006-06-29 3:09 ` [RFC] Cache negative delta pairs Junio C Hamano
2006-06-29 3:50 ` Jeff King
@ 2006-06-29 3:58 ` Jeff King
2006-06-29 4:30 ` Jeff King
2006-06-29 16:39 ` Nicolas Pitre
2006-06-29 4:09 ` Jeff King
` (2 subsequent siblings)
4 siblings, 2 replies; 36+ messages in thread
From: Jeff King @ 2006-06-29 3:58 UTC (permalink / raw)
To: git
>From repack to repack, we end up trying to delta many of the same object
pairs, which is computationally expensive. This patch makes a
persistent cache, $GIT_DIR/delta-cache, which contains pairs of sha1
hashes (the presence of a pair indicates that it's not worth trying to
delta those two objects).
Signed-off-by: Jeff King <peff@peff.net>
---
[This is a repost, since the original got caught by the list filter.]
I found this especially to be a problem with repos that consist of many
large, unrelated files (e.g., photos). For example, on my test repo
(about 300 unrelated 1-2M jpgs), a 'git-repack -a' takes about 10
minutes to complete. With the delta cache, subsequent repacks take only
13 seconds. Results are not quite as dramatic for "normal" repos, but
there is still some speedup. Repacking a fully packed linux-2.6 repo
went from 1m12s to 36s. Repacking the git repo goes from 5.6s to 3.0s.
Here are some of my thoughts:
- speed. The implementation is quite fast. The sha1 pairs are stored
sorted, and we mmap and binary search them. Certainly the extra time
spent in lookup is justified by avoiding the delta attempts.
- size. The cache is a packed sequence of binary sha1 pairs. I was
concerned that it would grow too large (obviously for n blobs you can
end up with n^2/2 entries), but it doesn't seem unreasonable for most
repos (either you don't have a lot of files, or if you do, they delta
reasonably well). My test repo's cache is only 144K. The git cache is
about 2.7M. The linux-2.6 cache is 22M.
Theoretically, I could bound the cache size and boot old entries.
However, that means storing age information, which increases the
required size. I think keeping it simple is best.
- correctness. Currently the code uses the output of try_delta for
negative caching. Should the cache checking be moved inside try_delta
instead? This would give more control over which reasons to mark a
delta negative (I believe right now hitting the depth limit will
cause a negative mark; we should perhaps only do so if the content
itself makes the delta unpalatable).
- optionalness. Currently the delta-cache is always used. Since it is a
space-time tradeoff, maybe it should be optional (it will have
negligible performance and horrible size impact on a repo that
consists of many very small but unrelated objects). Possible methods
include:
- enable cache saves only if .git/delta-cache is present; turn it
on initially with 'touch .git/delta-cache'
- config variable
Makefile | 4 +-
delta-cache.c | 119 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
delta-cache.h | 11 +++++
pack-objects.c | 11 +++++
4 files changed, 142 insertions(+), 3 deletions(-)
diff --git a/Makefile b/Makefile
index cde619c..39c4308 100644
--- a/Makefile
+++ b/Makefile
@@ -202,7 +202,7 @@ LIB_H = \
blob.h cache.h commit.h csum-file.h delta.h \
diff.h object.h pack.h pkt-line.h quote.h refs.h \
run-command.h strbuf.h tag.h tree.h git-compat-util.h revision.h \
- tree-walk.h log-tree.h dir.h
+ tree-walk.h log-tree.h dir.h delta-cache.h
DIFF_OBJS = \
diff.o diff-lib.o diffcore-break.o diffcore-order.o \
@@ -217,7 +217,7 @@ LIB_OBJS = \
server-info.o setup.o sha1_file.o sha1_name.o strbuf.o \
tag.o tree.o usage.o config.o environment.o ctype.o copy.o \
fetch-clone.o revision.o pager.o tree-walk.o xdiff-interface.o \
- alloc.o $(DIFF_OBJS)
+ alloc.o delta-cache.o $(DIFF_OBJS)
BUILTIN_OBJS = \
builtin-log.o builtin-help.o builtin-count.o builtin-diff.o builtin-push.o \
diff --git a/delta-cache.c b/delta-cache.c
new file mode 100644
index 0000000..d132867
--- /dev/null
+++ b/delta-cache.c
@@ -0,0 +1,119 @@
+#include "delta-cache.h"
+#include "cache.h"
+
+static const unsigned char* disk_cache = 0;
+static unsigned disk_cache_len = 0;
+static unsigned char* mem_cache = 0;
+static unsigned mem_cache_len = 0;
+static unsigned mem_cache_alloc = 0;
+
+#define GETCACHE(c, n) (c+(40*n))
+
+static void disk_cache_init()
+{
+ static int done = 0;
+ int fd;
+ struct stat st;
+
+ if (done) return;
+ done = 1;
+
+ fd = open(git_path("delta-cache"), O_RDONLY);
+ if (fd < 0)
+ return;
+ if (fstat(fd, &st) || (st.st_size == 0)) {
+ close(fd);
+ return;
+ }
+ disk_cache = mmap(NULL, st.st_size, PROT_READ, MAP_PRIVATE, fd, 0);
+ close(fd);
+ if (disk_cache != MAP_FAILED)
+ disk_cache_len = st.st_size / 40;
+}
+
+static int
+check_cache(const unsigned char* c, unsigned len, const unsigned char s[40])
+{
+ int left, right, mid, cmp;
+
+ left = 0;
+ right = len - 1;
+ while (left <= right) {
+ mid = left + (right - left) / 2;
+ cmp = memcmp(s, GETCACHE(c, mid), 40);
+ if(cmp == 0) return 1;
+ else if(cmp < 0) right = mid - 1;
+ else left = mid + 1;
+ }
+ return 0;
+}
+
+extern int
+delta_cache_check(const unsigned char a[20], const unsigned char b[20])
+{
+ unsigned char search[40];
+
+ disk_cache_init();
+ memcpy(search, a, 20);
+ memcpy(search+20, b, 20);
+ return check_cache(disk_cache, disk_cache_len, search);
+}
+
+extern void
+delta_cache_mark(const unsigned char a[20], const unsigned char b[20])
+{
+ if (mem_cache_len == mem_cache_alloc) {
+ mem_cache_alloc = mem_cache_alloc ? mem_cache_alloc * 2 : 16;
+ mem_cache = xrealloc(mem_cache, mem_cache_alloc * 40);
+ }
+ memcpy(GETCACHE(mem_cache, mem_cache_len), a, 20);
+ memcpy(GETCACHE(mem_cache, mem_cache_len)+20, b, 20);
+ mem_cache_len++;
+}
+
+static int
+compare_sha1pair(const void *a, const void *b)
+{
+ return memcmp(a, b, 40);
+}
+
+static int
+merge_write(int fd, const unsigned char* p1, unsigned n1,
+ const unsigned char* p2, unsigned n2)
+{
+#define EMIT(p, x) do { \
+ if (xwrite(fd, GETCACHE(p, x), 40) < 0) return -1; x++; \
+ } while(0)
+
+ int i = 0, j = 0, cmp;
+ while (i < n1 && j < n2) {
+ cmp = memcmp(GETCACHE(p1, i), GETCACHE(p2, j), 40);
+ if (cmp < 0) EMIT(p1, i);
+ else EMIT(p2, j);
+ }
+ while (i < n1) EMIT(p1, i);
+ while (j < n2) EMIT(p2, j);
+#undef EMIT
+ return 0;
+}
+
+extern void
+delta_cache_save(void)
+{
+ int fd;
+ char tmpfile[PATH_MAX];
+
+ strcpy(tmpfile, git_path("delta-cache.%u", getpid()));
+ fd = open(tmpfile, O_WRONLY|O_EXCL|O_CREAT, 0666);
+ if (fd < 0)
+ return;
+
+ qsort(mem_cache, mem_cache_len, 40, compare_sha1pair);
+ if (merge_write(fd, mem_cache, mem_cache_len,
+ disk_cache, disk_cache_len) < 0) {
+ close(fd);
+ return;
+ }
+
+ rename(tmpfile, git_path("delta-cache"));
+}
diff --git a/delta-cache.h b/delta-cache.h
new file mode 100644
index 0000000..19201be
--- /dev/null
+++ b/delta-cache.h
@@ -0,0 +1,11 @@
+#ifndef DELTA_CACHE_H
+#define DELTA_CACHE_H
+
+extern int
+delta_cache_check(const unsigned char a[20], const unsigned char b[20]);
+extern void
+delta_cache_mark(const unsigned char a[20], const unsigned char b[20]);
+extern void
+delta_cache_save(void);
+
+#endif /* DELTA_CACHE_H */
diff --git a/pack-objects.c b/pack-objects.c
index bed2497..46b9775 100644
--- a/pack-objects.c
+++ b/pack-objects.c
@@ -8,6 +8,7 @@ #include "delta.h"
#include "pack.h"
#include "csum-file.h"
#include "tree-walk.h"
+#include "delta-cache.h"
#include <sys/time.h>
#include <signal.h>
@@ -1083,14 +1084,21 @@ static void find_deltas(struct object_en
j = window;
while (--j > 0) {
unsigned int other_idx = idx + j;
+ int r;
struct unpacked *m;
if (other_idx >= window)
other_idx -= window;
m = array + other_idx;
if (!m->entry)
break;
- if (try_delta(n, m, m->index, depth) < 0)
+ if (delta_cache_check(n->entry->sha1, m->entry->sha1))
+ continue;
+ r = try_delta(n, m, m->index, depth);
+ if (r < 0)
break;
+ if (r == 0)
+ delta_cache_mark(n->entry->sha1,
+ m->entry->sha1);
}
/* if we made n a delta, and if n is already at max
* depth, leaving it in the window is pointless. we
@@ -1342,5 +1350,6 @@ int main(int argc, char **argv)
if (progress)
fprintf(stderr, "Total %d, written %d (delta %d), reused %d (delta %d)\n",
nr_result, written, written_delta, reused, reused_delta);
+ delta_cache_save();
return 0;
}
--
1.4.1.rc1.g3000
^ permalink raw reply related [flat|nested] 36+ messages in thread
* Re: [RFC] Cache negative delta pairs
2006-06-29 3:09 ` [RFC] Cache negative delta pairs Junio C Hamano
2006-06-29 3:50 ` Jeff King
2006-06-29 3:58 ` Jeff King
@ 2006-06-29 4:09 ` Jeff King
2006-06-29 15:42 ` Nicolas Pitre
2006-06-29 22:31 ` Jakub Narebski
4 siblings, 0 replies; 36+ messages in thread
From: Jeff King @ 2006-06-29 4:09 UTC (permalink / raw)
To: Junio C Hamano; +Cc: git
On Wed, Jun 28, 2006 at 08:09:43PM -0700, Junio C Hamano wrote:
> that hit your the negative cache are? That is, in find_deltas()
> function, we have "while (--j > 0)" loop that attempts to delta
> with the entry that is j (modulo window size) entries away from
> the current one, then j-1, j-2, ...; I am interested in the
> distribution of "j" value for the pair "n,m" that hits your
> negative cache for normal repositories, and I am speculating
> that the value would probably be small relative to the delta
> window size.
Just to make sure I am understanding you correctly, you're interested in
the distribution of 'j' each time we skip a delta for being negative
(or each time we mark a negative, but that should simply equal the
lookup times for the next run). The instrumentation I added simply
prints the value of j each time we skip a delta. The counts are
surprisingly uniform (this is for linux-2.6):
57209 j=1
57213 j=2
57217 j=3
57221 j=4
57225 j=5
57229 j=6
57233 j=7
57237 j=8
57241 j=9
57245 j=10
They're so uniform (and in order by j!) that I feel like I must have done
something wrong...
-Peff
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [RFC] Cache negative delta pairs
2006-06-29 3:58 ` Jeff King
@ 2006-06-29 4:30 ` Jeff King
2006-06-29 16:39 ` Nicolas Pitre
1 sibling, 0 replies; 36+ messages in thread
From: Jeff King @ 2006-06-29 4:30 UTC (permalink / raw)
To: Junio C Hamano; +Cc: git
On Wed, Jun 28, 2006 at 11:58:49PM -0400, Jeff King wrote:
> about 2.7M. The linux-2.6 cache is 22M.
> [...]
> - correctness. Currently the code uses the output of try_delta for
> negative caching. Should the cache checking be moved inside try_delta
> instead? This would give more control over which reasons to mark a
I tried moving the cache check/mark to wrap the call to create_delta in
try_delta. The speedup is the same (since all of the CPU time is spent
in create_delta anyway), and the linux-2.6 cache size dropped to 18M.
I also think this is probably more correct (it ignores every reason not
to delta EXCEPT create_delta failing).
The distribution of the window parameter 'j' is similarly uniform:
44900 j=2
44907 j=3
44943 j=1
45001 j=4
45014 j=5
45023 j=6
45063 j=7
45158 j=8
45288 j=9
45466 j=10
Patch on top of my other one is below.
-Peff
-- >8 --
pack-objects: move delta-cache check/mark closer to create_delta
Signed-off-by: Jeff King <peff@peff.net>
---
pack-objects.c | 15 ++++++---------
1 files changed, 6 insertions(+), 9 deletions(-)
diff --git a/pack-objects.c b/pack-objects.c
index 46b9775..83ccc8a 100644
--- a/pack-objects.c
+++ b/pack-objects.c
@@ -1014,9 +1014,13 @@ static int try_delta(struct unpacked *tr
if (sizediff >= max_size)
return 0;
+ if (delta_cache_check(src->entry->sha1, trg->entry->sha1))
+ return 0;
delta_buf = create_delta(src_index, trg->data, size, &delta_size, max_size);
- if (!delta_buf)
+ if (!delta_buf) {
+ delta_cache_mark(src->entry->sha1, trg->entry->sha1);
return 0;
+ }
trg_entry->delta = src_entry;
trg_entry->delta_size = delta_size;
@@ -1084,21 +1088,14 @@ static void find_deltas(struct object_en
j = window;
while (--j > 0) {
unsigned int other_idx = idx + j;
- int r;
struct unpacked *m;
if (other_idx >= window)
other_idx -= window;
m = array + other_idx;
if (!m->entry)
break;
- if (delta_cache_check(n->entry->sha1, m->entry->sha1))
- continue;
- r = try_delta(n, m, m->index, depth);
- if (r < 0)
+ if (try_delta(n, m, m->index, depth) < 0)
break;
- if (r == 0)
- delta_cache_mark(n->entry->sha1,
- m->entry->sha1);
}
/* if we made n a delta, and if n is already at max
* depth, leaving it in the window is pointless. we
--
1.4.1.rc1.g0458-dirty
^ permalink raw reply related [flat|nested] 36+ messages in thread
* Re: [RFC] Cache negative delta pairs
2006-06-29 3:09 ` [RFC] Cache negative delta pairs Junio C Hamano
` (2 preceding siblings ...)
2006-06-29 4:09 ` Jeff King
@ 2006-06-29 15:42 ` Nicolas Pitre
2006-06-29 16:35 ` Nicolas Pitre
2006-06-29 18:00 ` Jeff King
2006-06-29 22:31 ` Jakub Narebski
4 siblings, 2 replies; 36+ messages in thread
From: Nicolas Pitre @ 2006-06-29 15:42 UTC (permalink / raw)
To: Junio C Hamano; +Cc: Jeff King, git
On Wed, 28 Jun 2006, Junio C Hamano wrote:
> Interesting idea. I think this matters more because for a
> repository with many unrelated undeltifiable files, we do the
> computation for objects that results in _no_ delta. For normal
> nearly fully packed repositories, once an object is deltified
> against something else, subsequent repacking of the same set of
> objects (or a superset thereof) will very likely reuse the delta
> without recomputation, so as long as each object _can_ be
> deltified with at least _one_ other object, you should not see
> improvement on them.
>
> So I am curious where the speed-up comes from for "normal" repos
> in your experiments.
My GIT repo currently has 23622 objects and 7610 of them are currently
undeltified. Those objects are of course candidates for delta matching
each time git-repack is run.
> If it turns out that in "normal" repos the
> objects that hit your negative cache are stored undeltified,
> then that suggests that it might be worthwhile to consider using
> a cache of "inherently undeltifiable objects", In other words, a
> negative cache of O(N) entries, instead of O(N^2) entries,
Actually... I'm not so sure. Those objects are not "inherently
undeltifiable". They just happen to not have other objects to easily
delta against in the given set of objects. Think of a file with only
one revision for example. As soon as there is a second revision of that
file added to the set of objects then the former revision would have a
high probability of being deltifiable.
So the negative cache should not be O(N^2) either. It just has to be
O(N*window).
> Another interpretation of your result is that we may be using a
> delta window that is unnecessarily too deep, and your negative
> cache is collecting less optimum candidates that we attempt to
> deltify against "just in case". Can you easily instrument your
> code to see where in the sorted delta candidate list the pairs
> that hit your the negative cache are? That is, in find_deltas()
> function, we have "while (--j > 0)" loop that attempts to delta
> with the entry that is j (modulo window size) entries away from
> the current one, then j-1, j-2, ...; I am interested in the
> distribution of "j" value for the pair "n,m" that hits your
> negative cache for normal repositories, and I am speculating
> that the value would probably be small relative to the delta
> window size.
My past experiments showed that the best window size for compression is
a bit larger than the current default of 10. It was rather around 15
for the kernel repository, with higher values than 15 not providing
significant improvements anymore. But that is clearly a function of the
repository nature (the average number of revisions for each file). But
the window size is directly connected to the computational cost.
> Another idea is to have a cache of "paths at which inherently
> undeltifiable objects live in". For example, we currently do
> not delta OpenOffice documents (*.odt, *.odp, etc) very well.
> If one has a repository that tracks the history of "file.odp",
> we know each revision of "file.odp" would not delta against any
> other version anyway, and could skip attempting to deltify them.
I'm afraid this could lead to bad behavior eventually. Better to just
attempt a delta once, and when an object has not found any delta
base candidate then just write its sha1 and corresponding window to the
cache. This would imply an initial cost to create the cache the first
time, but after that the created cache could be relied upon as hard
information and not just as guess heuristics.
> > - size. The cache is a packed sequence of binary sha1 pairs. I was
> > concerned that it would grow too large (obviously for n blobs you can
> > end up with n^2/2 entries), but it doesn't seem unreasonable for most
> > repos (either you don't have a lot of files, or if you do, they delta
> > reasonably well). My test repo's cache is only 144K. The git cache is
> > about 2.7M. The linux-2.6 cache is 22M.
This is way suboptimal. First there is no reason for the cache to ever
grow to N^2. At worst it should be N*10 where 10 is the current window
size.
Next I think this can be made just N*2. Consider that the criteria for
skipping over delta matching for a given object is the fact
that we already know that such object doesn't delta against
none of the objects found in a given window. Therefore we only have to
compute a hash for the object names found in that window and store that
in the cache. So the cache entries would then be a pair of sha1: first
the sha1 of the victim object, and the sha1 of all sha1 names for the
objects against which the victim object was found not to delta well
against.
And this can be pushed even further by just including the sha1 of the
victim object inside the list of objects therefore computing a hash of
all objects (the victim and the window) for which no delta results. The
cache is therefore a list of hash values corresponding to bad
victim+window combinations.
So given my GIT repository such a cache would be 7610 * 40 = 304400
bytes if we stick to the full 40 bytes of sha1 to hash bad combinations.
Nicolas
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [RFC] Cache negative delta pairs
2006-06-29 15:42 ` Nicolas Pitre
@ 2006-06-29 16:35 ` Nicolas Pitre
2006-06-29 18:00 ` Jeff King
1 sibling, 0 replies; 36+ messages in thread
From: Nicolas Pitre @ 2006-06-29 16:35 UTC (permalink / raw)
To: Junio C Hamano; +Cc: Jeff King, git
On Thu, 29 Jun 2006, Nicolas Pitre wrote:
> And this can be pushed even further by just including the sha1 of the
> victim object inside the list of objects therefore computing a hash of
> all objects (the victim and the window) for which no delta results. The
> cache is therefore a list of hash values corresponding to bad
> victim+window combinations.
>
> So given my GIT repository such a cache would be 7610 * 40 = 304400
> bytes if we stick to the full 40 bytes of sha1 to hash bad combinations.
Correction: the 40 bytes figure is for _ascii_ representation of sha1
values. The cache doesn't need ascii and therefore this number can be
reduced by half.
Nicolas
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [RFC] Cache negative delta pairs
2006-06-29 3:58 ` Jeff King
2006-06-29 4:30 ` Jeff King
@ 2006-06-29 16:39 ` Nicolas Pitre
2006-06-29 18:07 ` Jeff King
1 sibling, 1 reply; 36+ messages in thread
From: Nicolas Pitre @ 2006-06-29 16:39 UTC (permalink / raw)
To: Jeff King; +Cc: git
On Wed, 28 Jun 2006, Jeff King wrote:
> >From repack to repack, we end up trying to delta many of the same object
> pairs, which is computationally expensive. This patch makes a
> persistent cache, $GIT_DIR/delta-cache, which contains pairs of sha1
> hashes (the presence of a pair indicates that it's not worth trying to
> delta those two objects).
I think this could be done differently to be even more space efficient
as I suggested earlier.
> Here are some of my thoughts:
>
> - speed. The implementation is quite fast. The sha1 pairs are stored
> sorted, and we mmap and binary search them. Certainly the extra time
> spent in lookup is justified by avoiding the delta attempts.
You do that lookup for every delta match attempt. Instead it could be
done once for the whole window attempt, potentially reducing the cache
size by a factor of 20, and it might be faster too.
> - size. The cache is a packed sequence of binary sha1 pairs. I was
> concerned that it would grow too large (obviously for n blobs you can
> end up with n^2/2 entries), but it doesn't seem unreasonable for most
> repos (either you don't have a lot of files, or if you do, they delta
> reasonably well). My test repo's cache is only 144K. The git cache is
> about 2.7M. The linux-2.6 cache is 22M.
See my previous email for comments about this.
> Theoretically, I could bound the cache size and boot old entries.
> However, that means storing age information, which increases the
> required size. I think keeping it simple is best.
You could simply recreate the cache on each run. Or just keep a bitmap
of referenced cache entries, and a list of new entries. At the end if
the new entry list is empty and the bitmap is all set (or clear) then
you just keep the current cache. ONce, say, more than 10% of cache
entries are not used anymore then you regenerate the cache. But since
you need to regenerate it when new entries are added then the 10% unused
entry criteria might not be used that often anyway, unless packs shrink
with time which might not be the case really often.
> - correctness. Currently the code uses the output of try_delta for
> negative caching. Should the cache checking be moved inside try_delta
> instead? This would give more control over which reasons to mark a
> delta negative (I believe right now hitting the depth limit will
> cause a negative mark; we should perhaps only do so if the content
> itself makes the delta unpalatable).
Like I said earlier I think this should be moved up a bit i.e. outside
the delta loop. In find_deltas() right before the ...
j = window;
while (--j > 0) {
loop, just insert another loop that compute the sha1 of the current
target object and window:
SHA1_Init(&c);
j = window;
while (j-- > 0) { /* postdec include the current object as well */
unsigned int other_idx = idx + j;
struct unpacked *m;
if (other_idx >= window)
other_idx -= window;
m = array + other_idx;
if (!m->entry)
break;
SHA1_Update(&c, m->entry.sha1, 20);
}
SHA1_Final(negative_delta_hash, &c);
Then you can skip over the whole of the delta loop right away if that
negative_delta_hash matches one entry in the cache since you already
know that this object with this window doesn't produce any delta result.
Otherwise, after the delta loop has proceeded, if there is no delta
found then you can add negative_delta_hash to the cache.
> - optionalness. Currently the delta-cache is always used. Since it is a
> space-time tradeoff, maybe it should be optional (it will have
> negligible performance and horrible size impact on a repo that
> consists of many very small but unrelated objects). Possible methods
> include:
> - enable cache saves only if .git/delta-cache is present; turn it
> on initially with 'touch .git/delta-cache'
> - config variable
First, I think it should be ignored (but still created) when
--no-reuse-delta is passed. Then, it should not be created (but still
looked up if it exists and --no-reuse-delta is not provided) when the
pack index file is also not created. I don't think it is worth making
this further configurable, and given the suggested strategy above the
cache should remain fairly small.
Nicolas
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [RFC] Cache negative delta pairs
2006-06-29 15:42 ` Nicolas Pitre
2006-06-29 16:35 ` Nicolas Pitre
@ 2006-06-29 18:00 ` Jeff King
2006-06-29 18:24 ` Nicolas Pitre
1 sibling, 1 reply; 36+ messages in thread
From: Jeff King @ 2006-06-29 18:00 UTC (permalink / raw)
To: Nicolas Pitre; +Cc: Junio C Hamano, git
On Thu, Jun 29, 2006 at 11:42:14AM -0400, Nicolas Pitre wrote:
> So the negative cache should not be O(N^2) either. It just has to be
> O(N*window).
Yes, unless we use the cache information to change the window (but I
don't think that's worth it, unless our window heuristics are bad).
The implementation I posted should already grow to O(N*window), since it
only saves what we try (and fail at).
> My past experiments showed that the best window size for compression is
> a bit larger than the current default of 10. It was rather around 15
> for the kernel repository, with higher values than 15 not providing
> significant improvements anymore. But that is clearly a function of the
> repository nature (the average number of revisions for each file). But
> the window size is directly connected to the computational cost.
Increasing the window, of course, has virtually no speed impact on later
repacks with the cache in effect. Packing with a window of 15 drops my
linux-2.6 pack to 108M from 122M. That helps offset the cache size
(though of course it takes longer on the initial pack).
> This is way suboptimal. First there is no reason for the cache to ever
> grow to N^2. At worst it should be N*10 where 10 is the current window
> size.
I assumed the window would change over time (though our total is still
likely to hang around N*10 rather than N^2).
> none of the objects found in a given window. Therefore we only have to
> compute a hash for the object names found in that window and store that
> in the cache. So the cache entries would then be a pair of sha1: first
> the sha1 of the victim object, and the sha1 of all sha1 names for the
> objects against which the victim object was found not to delta well
> against.
This will fail to hit the cache anytime the window changes. How often
does the window change? In my test case, I would think anytime I added a
bunch of new photos, it would be likely that one of them would make it
into the window, thus invalidating the cache entry and forcing me to try
against every object in the window (even though I've already tried
9/10).
Also, is it true to say "if this object did not delta against this
window, it will never delta?" What about interactions with the depth
parameter?
> So given my GIT repository such a cache would be 7610 * 40 = 304400
> bytes if we stick to the full 40 bytes of sha1 to hash bad combinations.
Keep in mind that it will grow every time the window changes.
-Peff
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [RFC] Cache negative delta pairs
2006-06-29 16:39 ` Nicolas Pitre
@ 2006-06-29 18:07 ` Jeff King
2006-06-29 18:48 ` Nicolas Pitre
0 siblings, 1 reply; 36+ messages in thread
From: Jeff King @ 2006-06-29 18:07 UTC (permalink / raw)
To: Nicolas Pitre; +Cc: git
On Thu, Jun 29, 2006 at 12:39:31PM -0400, Nicolas Pitre wrote:
> You do that lookup for every delta match attempt. Instead it could be
> done once for the whole window attempt, potentially reducing the cache
> size by a factor of 20, and it might be faster too.
I'm not convinced this will provide good cache hit characteristics, and
I'm not convinced it's semantically correct (see my other mail).
> You could simply recreate the cache on each run. Or just keep a bitmap
Yes, that would probably work and would be quite easy to do with the
existing code.
> First, I think it should be ignored (but still created) when
> --no-reuse-delta is passed. Then, it should not be created (but still
> looked up if it exists and --no-reuse-delta is not provided) when the
> pack index file is also not created. I don't think it is worth making
> this further configurable, and given the suggested strategy above the
> cache should remain fairly small.
Those suggestions make sense to me.
-Peff
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [RFC] Cache negative delta pairs
2006-06-29 18:00 ` Jeff King
@ 2006-06-29 18:24 ` Nicolas Pitre
2006-06-29 18:53 ` Jeff King
0 siblings, 1 reply; 36+ messages in thread
From: Nicolas Pitre @ 2006-06-29 18:24 UTC (permalink / raw)
To: Jeff King; +Cc: Junio C Hamano, git
On Thu, 29 Jun 2006, Jeff King wrote:
> > This is way suboptimal. First there is no reason for the cache to ever
> > grow to N^2. At worst it should be N*10 where 10 is the current window
> > size.
>
> I assumed the window would change over time (though our total is still
> likely to hang around N*10 rather than N^2).
It doesn't change unless you force a different window size.
> > none of the objects found in a given window. Therefore we only have to
> > compute a hash for the object names found in that window and store that
> > in the cache. So the cache entries would then be a pair of sha1: first
> > the sha1 of the victim object, and the sha1 of all sha1 names for the
> > objects against which the victim object was found not to delta well
> > against.
>
> This will fail to hit the cache anytime the window changes. How often
> does the window change? In my test case, I would think anytime I added a
> bunch of new photos, it would be likely that one of them would make it
> into the window, thus invalidating the cache entry and forcing me to try
> against every object in the window (even though I've already tried
> 9/10).
Sure. But on the lot how often will that happen?
This trades a bit of CPU for much smaller cache which might be worth it.
And even then, since my suggested method implies only one cache lookup
in a much smaller cache instead of 10 lookups in a larger cache for each
objects it might end up faster overall even if sometimes some windows
don't match and deltas are recomputed needlessly.
> Also, is it true to say "if this object did not delta against this
> window, it will never delta?" What about interactions with the depth
> parameter?
Of course a greater depth might allow for a hit where there isn't any
otherwise. But changing the delta depth is not something someone does
that often, and when the depth is changed then you better use -f with
git-repack as well which like I said should also ignore the cache.
> > So given my GIT repository such a cache would be 7610 * 40 = 304400
> > bytes if we stick to the full 40 bytes of sha1 to hash bad combinations.
>
> Keep in mind that it will grow every time the window changes.
What do you mean by window change?
Nicolas
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [RFC] Cache negative delta pairs
2006-06-29 18:07 ` Jeff King
@ 2006-06-29 18:48 ` Nicolas Pitre
2006-06-29 18:58 ` Jeff King
0 siblings, 1 reply; 36+ messages in thread
From: Nicolas Pitre @ 2006-06-29 18:48 UTC (permalink / raw)
To: Jeff King; +Cc: git
On Thu, 29 Jun 2006, Jeff King wrote:
> On Thu, Jun 29, 2006 at 12:39:31PM -0400, Nicolas Pitre wrote:
>
> > You do that lookup for every delta match attempt. Instead it could be
> > done once for the whole window attempt, potentially reducing the cache
> > size by a factor of 20, and it might be faster too.
>
> I'm not convinced this will provide good cache hit characteristics, and
> I'm not convinced it's semantically correct (see my other mail).
Dare to test it? I provided you with most of the code difference
already.
And what do you mean by "semantically correct"?
Nicolas
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [RFC] Cache negative delta pairs
2006-06-29 18:24 ` Nicolas Pitre
@ 2006-06-29 18:53 ` Jeff King
2006-06-29 19:04 ` Nicolas Pitre
0 siblings, 1 reply; 36+ messages in thread
From: Jeff King @ 2006-06-29 18:53 UTC (permalink / raw)
To: Nicolas Pitre; +Cc: Junio C Hamano, git
On Thu, Jun 29, 2006 at 02:24:57PM -0400, Nicolas Pitre wrote:
> > I assumed the window would change over time (though our total is still
> > likely to hang around N*10 rather than N^2).
> It doesn't change unless you force a different window size.
Sorry, I meant "the items in the window for a given object would change
over time."
> > This will fail to hit the cache anytime the window changes. How often
> > does the window change? In my test case, I would think anytime I added a
> > bunch of new photos, it would be likely that one of them would make it
> > into the window, thus invalidating the cache entry and forcing me to try
> > against every object in the window (even though I've already tried
> > 9/10).
> Sure. But on the lot how often will that happen?
Reasonably often, according to my test. I did this to simulate usage
over time:
- create an empty repo
- from my test repo of 515 images, grab 20 at a time and add/commit
them
- after each commit, record the SHA1 of (object, window[0..n]) for
each object to be delta'd
If doing the cache on the sha1 of the whole window is a good idea, then
we should see many of the same hashes from commit to commit. If we
don't, that means the newly added files are being placed in the old
windows, thus disrupting their hashes.
The results were that there was typically only 1 reusable window each
time I added 20 files. At that point, caching is largely pointless.
> And even then, since my suggested method implies only one cache lookup
> in a much smaller cache instead of 10 lookups in a larger cache for each
> objects it might end up faster overall even if sometimes some windows
> don't match and deltas are recomputed needlessly.
I didn't benchmark, but I doubt it will have significant impact.
Especially on my photo test repo, the lookups are dominated by the
create_delta time by several orders of magnitude.
> Of course a greater depth might allow for a hit where there isn't any
> otherwise. But changing the delta depth is not something someone does
> that often, and when the depth is changed then you better use -f with
> git-repack as well which like I said should also ignore the cache.
That sounds reasonable to me for depth. What about other reasons for
try_delta to fail? Preferred base?
> > > So given my GIT repository such a cache would be 7610 * 40 = 304400
> > > bytes if we stick to the full 40 bytes of sha1 to hash bad combinations.
> > Keep in mind that it will grow every time the window changes.
> What do you mean by window change?
I meant that when the window we use for a given object changes, it will
insert a new cache entry. But if we deal with invalidating unused cache
entries as you suggested before, it won't matter.
-Peff
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [RFC] Cache negative delta pairs
2006-06-29 18:48 ` Nicolas Pitre
@ 2006-06-29 18:58 ` Jeff King
2006-06-29 19:06 ` Nicolas Pitre
0 siblings, 1 reply; 36+ messages in thread
From: Jeff King @ 2006-06-29 18:58 UTC (permalink / raw)
To: Nicolas Pitre; +Cc: git
On Thu, Jun 29, 2006 at 02:48:23PM -0400, Nicolas Pitre wrote:
> Dare to test it? I provided you with most of the code difference
> already.
See my other mail. Unless I did something horribly wrong, caching full
windows is largely useless.
> And what do you mean by "semantically correct"?
I mean that right now the cache means "calling create_delta on content
that has sha1_a against content that has sha1_b will not produce a
useful delta." That makes sense since the content is never going to
change for a given sha1. However, covering the whole window takes into
account depth and preferred base. We just need to be sure that it is
correct to be including those in our cache calculation (I don't know,
and I'll defer to your judgement on that, since you clearly know more
about the delta logic than I do).
-Peff
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [RFC] Cache negative delta pairs
2006-06-29 18:53 ` Jeff King
@ 2006-06-29 19:04 ` Nicolas Pitre
2006-06-29 19:52 ` Jeff King
0 siblings, 1 reply; 36+ messages in thread
From: Nicolas Pitre @ 2006-06-29 19:04 UTC (permalink / raw)
To: Jeff King; +Cc: Junio C Hamano, git
On Thu, 29 Jun 2006, Jeff King wrote:
> On Thu, Jun 29, 2006 at 02:24:57PM -0400, Nicolas Pitre wrote:
>
> > > I assumed the window would change over time (though our total is still
> > > likely to hang around N*10 rather than N^2).
> > It doesn't change unless you force a different window size.
>
> Sorry, I meant "the items in the window for a given object would change
> over time."
>
> > > This will fail to hit the cache anytime the window changes. How often
> > > does the window change? In my test case, I would think anytime I added a
> > > bunch of new photos, it would be likely that one of them would make it
> > > into the window, thus invalidating the cache entry and forcing me to try
> > > against every object in the window (even though I've already tried
> > > 9/10).
> > Sure. But on the lot how often will that happen?
>
> Reasonably often, according to my test. I did this to simulate usage
> over time:
> - create an empty repo
> - from my test repo of 515 images, grab 20 at a time and add/commit
> them
> - after each commit, record the SHA1 of (object, window[0..n]) for
> each object to be delta'd
> If doing the cache on the sha1 of the whole window is a good idea, then
> we should see many of the same hashes from commit to commit. If we
> don't, that means the newly added files are being placed in the old
> windows, thus disrupting their hashes.
>
> The results were that there was typically only 1 reusable window each
> time I added 20 files. At that point, caching is largely pointless.
Right. Your use pattern is a special case that doesn't work well with
the whole window hash approach. I'd expect it to work beautifully with
the kernel repository though.
> > And even then, since my suggested method implies only one cache lookup
> > in a much smaller cache instead of 10 lookups in a larger cache for each
> > objects it might end up faster overall even if sometimes some windows
> > don't match and deltas are recomputed needlessly.
>
> I didn't benchmark, but I doubt it will have significant impact.
> Especially on my photo test repo, the lookups are dominated by the
> create_delta time by several orders of magnitude.
Again I think it is a repo like the linux kernel that would benefit
more.
> > Of course a greater depth might allow for a hit where there isn't any
> > otherwise. But changing the delta depth is not something someone does
> > that often, and when the depth is changed then you better use -f with
> > git-repack as well which like I said should also ignore the cache.
>
> That sounds reasonable to me for depth. What about other reasons for
> try_delta to fail? Preferred base?
Hmmm. That might need to be dealth with (easily but still).
Nicolas
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [RFC] Cache negative delta pairs
2006-06-29 18:58 ` Jeff King
@ 2006-06-29 19:06 ` Nicolas Pitre
0 siblings, 0 replies; 36+ messages in thread
From: Nicolas Pitre @ 2006-06-29 19:06 UTC (permalink / raw)
To: Jeff King; +Cc: git
On Thu, 29 Jun 2006, Jeff King wrote:
> On Thu, Jun 29, 2006 at 02:48:23PM -0400, Nicolas Pitre wrote:
>
> > Dare to test it? I provided you with most of the code difference
> > already.
>
> See my other mail. Unless I did something horribly wrong, caching full
> windows is largely useless.
... on your special photo repository I agree.
I'm still unconvinced for large repos though.
Nicolas
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [RFC] Cache negative delta pairs
2006-06-29 19:04 ` Nicolas Pitre
@ 2006-06-29 19:52 ` Jeff King
2006-06-29 20:24 ` Nicolas Pitre
0 siblings, 1 reply; 36+ messages in thread
From: Jeff King @ 2006-06-29 19:52 UTC (permalink / raw)
To: Nicolas Pitre; +Cc: Junio C Hamano, git
On Thu, Jun 29, 2006 at 03:04:15PM -0400, Nicolas Pitre wrote:
> Right. Your use pattern is a special case that doesn't work well with
> the whole window hash approach. I'd expect it to work beautifully with
> the kernel repository though.
I don't necessarily care about the kernel repository. It packs fine as
it is, and you only waste 30 seconds on a repack checking deltas that
could be avoided. I do care on my special repository where packing is
virtually unusable and I can achieve a 45x speedup. Maybe my original
caching is not worth it for the kernel and should be configurable,
but obviously this window caching cannot REPLACE mine since it fails
utterly for the one thing I wanted it for.
That being said, I'm not sure that window caching is all that great for
"normal" repos.
Same test as before, but instead of simulating the commits, I merely
looked at the window hashes produced by
git-rev-list --objects master~$x
For the git repo:
x=0 tries 6698 windows
x=0 and x=50 contain 5197 identical windows
x=0 and x=100 contain 2484 identical windows
x=0 and x=500 contain 455 identical windows
For linux-2.6 repo:
x=0 tries 57208 windows
x=0 and x=50 contain 53677 identical windows
x=0 and x=100 contain 52886 identical windows
x=0 and x=500 contain 41196 identical windows
Obviously the kernel repo is doing better, but x=500 is only 4 days ago.
Trying with --before=2.weeks.ago yields only 31505 matches.
So the windows do clearly experience a fair bit of churn, but whether or
not this is worth it depends on how long you think is reasonable before
something gets "churned out" .
-Peff
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [RFC] Cache negative delta pairs
2006-06-29 19:52 ` Jeff King
@ 2006-06-29 20:24 ` Nicolas Pitre
2006-06-29 21:04 ` Linus Torvalds
` (2 more replies)
0 siblings, 3 replies; 36+ messages in thread
From: Nicolas Pitre @ 2006-06-29 20:24 UTC (permalink / raw)
To: Jeff King; +Cc: Junio C Hamano, git
On Thu, 29 Jun 2006, Jeff King wrote:
> On Thu, Jun 29, 2006 at 03:04:15PM -0400, Nicolas Pitre wrote:
>
> > Right. Your use pattern is a special case that doesn't work well with
> > the whole window hash approach. I'd expect it to work beautifully with
> > the kernel repository though.
>
> I don't necessarily care about the kernel repository. It packs fine as
> it is, and you only waste 30 seconds on a repack checking deltas that
> could be avoided. I do care on my special repository where packing is
> virtually unusable and I can achieve a 45x speedup. Maybe my original
> caching is not worth it for the kernel and should be configurable,
> but obviously this window caching cannot REPLACE mine since it fails
> utterly for the one thing I wanted it for.
And I agreed with you already.
> That being said, I'm not sure that window caching is all that great for
> "normal" repos.
>
> Same test as before, but instead of simulating the commits, I merely
> looked at the window hashes produced by
> git-rev-list --objects master~$x
>
> For the git repo:
> x=0 tries 6698 windows
> x=0 and x=50 contain 5197 identical windows
> x=0 and x=100 contain 2484 identical windows
> x=0 and x=500 contain 455 identical windows
>
> For linux-2.6 repo:
> x=0 tries 57208 windows
> x=0 and x=50 contain 53677 identical windows
> x=0 and x=100 contain 52886 identical windows
> x=0 and x=500 contain 41196 identical windows
>
> Obviously the kernel repo is doing better, but x=500 is only 4 days ago.
> Trying with --before=2.weeks.ago yields only 31505 matches.
What does this prove? I fail to see the relation between those results
and a possible git-pack-objects improvement.
The negative delta cache concept is certainly attractive even for normal
repositories, especially for public servers, since when used in
conjonction with delta reuse it makes the creation of a pack basically
free. So I think this idea really has merits, as long as the cache
remains small.
Nicolas
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [RFC] Cache negative delta pairs
2006-06-29 20:24 ` Nicolas Pitre
@ 2006-06-29 21:04 ` Linus Torvalds
2006-06-29 21:24 ` Nicolas Pitre
` (3 more replies)
2006-06-29 21:26 ` Junio C Hamano
2006-06-29 21:37 ` Jeff King
2 siblings, 4 replies; 36+ messages in thread
From: Linus Torvalds @ 2006-06-29 21:04 UTC (permalink / raw)
To: Nicolas Pitre; +Cc: Jeff King, Junio C Hamano, git
On Thu, 29 Jun 2006, Nicolas Pitre wrote:
>
> The negative delta cache concept is certainly attractive even for normal
> repositories, especially for public servers, since when used in
> conjonction with delta reuse it makes the creation of a pack basically
> free. So I think this idea really has merits, as long as the cache
> remains small.
I don't really see much of a point of this all.
Instead of having a separate cache, wouldn't it be much better to just
take the hint from the previous pack-file?
In the repacking window, if both objects we are looking at already came
from the same (old) pack-file, don't bother delta'ing them against each
other.
That means that we'll still always check for better deltas for (and
against!) _unpacked_ objects, but assuming incremental repacks, you'll
avoid the delta creation 99% of the time.
Ie somethng really simple like the appended.
Linus
---
diff --git a/pack-objects.c b/pack-objects.c
index bed2497..cea63e7 100644
--- a/pack-objects.c
+++ b/pack-objects.c
@@ -988,6 +988,13 @@ static int try_delta(struct unpacked *tr
return -1;
/*
+ * We do not bother to try a delta that we discarded
+ * on an earlier try
+ */
+ if (trg_entry->in_pack && trg_entry->in_pack == src_entry->in_pack)
+ return -1;
+
+ /*
* If the current object is at pack edge, take the depth the
* objects that depend on the current object into account --
* otherwise they would become too deep.
^ permalink raw reply related [flat|nested] 36+ messages in thread
* Re: [RFC] Cache negative delta pairs
2006-06-29 21:04 ` Linus Torvalds
@ 2006-06-29 21:24 ` Nicolas Pitre
2006-06-29 21:30 ` Linus Torvalds
2006-06-29 21:35 ` [RFC] Cache negative delta pairs Jeff King
` (2 subsequent siblings)
3 siblings, 1 reply; 36+ messages in thread
From: Nicolas Pitre @ 2006-06-29 21:24 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Jeff King, Junio C Hamano, git
On Thu, 29 Jun 2006, Linus Torvalds wrote:
> Instead of having a separate cache, wouldn't it be much better to just
> take the hint from the previous pack-file?
DOH! ;-)
Nicolas
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [RFC] Cache negative delta pairs
2006-06-29 20:24 ` Nicolas Pitre
2006-06-29 21:04 ` Linus Torvalds
@ 2006-06-29 21:26 ` Junio C Hamano
2006-06-29 21:37 ` Jeff King
2 siblings, 0 replies; 36+ messages in thread
From: Junio C Hamano @ 2006-06-29 21:26 UTC (permalink / raw)
To: Nicolas Pitre; +Cc: git, Jeff King
Nicolas Pitre <nico@cam.org> writes:
> The negative delta cache concept is certainly attractive even for normal
> repositories, especially for public servers, since when used in
> conjonction with delta reuse it makes the creation of a pack basically
> free. So I think this idea really has merits, as long as the cache
> remains small.
Yes, I agree it is very attractive.
One thing to watch out for is that we probably would not want to
let git-daemon write into public repositories. Which means that
use of negative cache should be strict "opt-in".
- "$GIT_DIR/delta-cache" is read but not necessarily is written
back when it exists; git-daemon uses it that way.
- The owner of the repository shouldn't have to tell the tool
to update the negative cache every time repack happens.
Which suggests that pack-objects.c can learn an option that
tells it to call delta_cache_save(), and we use it in
git-repack, perhaps like this:
diff --git a/git-repack.sh b/git-repack.sh
index 640ad8d..b07ed9b 100755
--- a/git-repack.sh
+++ b/git-repack.sh
@@ -44,7 +44,7 @@ case ",$all_into_one," in
esac
pack_objects="$pack_objects $local $quiet $no_reuse_delta$extra"
name=$(git-rev-list --objects --all $rev_list 2>&1 |
- git-pack-objects --non-empty $pack_objects .tmp-pack) ||
+ git-pack-objects --update-delta-cache --non-empty $pack_objects .tmp-pack) ||
exit 1
if [ -z "$name" ]; then
echo Nothing new to pack.
diff --git a/pack-objects.c b/pack-objects.c
index bed2497..46b9775 100644
--- a/pack-objects.c
+++ b/pack-objects.c
...
@@ -1342,5 +1350,7 @@ int main(int argc, char **argv)
if (progress)
fprintf(stderr, "Total %d, written %d (delta %d), reused %d (delta %d)\n",
nr_result, written, written_delta, reused, reused_delta);
+ if (update_delta_cache)
+ delta_cache_save();
return 0;
}
^ permalink raw reply related [flat|nested] 36+ messages in thread
* Re: [RFC] Cache negative delta pairs
2006-06-29 21:24 ` Nicolas Pitre
@ 2006-06-29 21:30 ` Linus Torvalds
2006-06-29 21:39 ` Jeff King
` (2 more replies)
0 siblings, 3 replies; 36+ messages in thread
From: Linus Torvalds @ 2006-06-29 21:30 UTC (permalink / raw)
To: Nicolas Pitre; +Cc: Jeff King, Junio C Hamano, git
On Thu, 29 Jun 2006, Nicolas Pitre wrote:
>
> On Thu, 29 Jun 2006, Linus Torvalds wrote:
>
> > Instead of having a separate cache, wouldn't it be much better to just
> > take the hint from the previous pack-file?
>
> DOH! ;-)
Btw, I think this could do with a flag to turn it on/off (but probably
default to on).
The advantage of this patch is that it should guarantee that if everything
is already packed, a repack will basically keep the pack identical.
However, that is obviously also the dis-advantage, since it means that
repacking cannot improve packing. So adding a flag to say "please try to
incrementally improve the pack" might well be worth it, even if this new
behaviour would be the _default_.
Hmm? Jeff, does this work for your load?
Linus
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [RFC] Cache negative delta pairs
2006-06-29 21:04 ` Linus Torvalds
2006-06-29 21:24 ` Nicolas Pitre
@ 2006-06-29 21:35 ` Jeff King
2006-06-29 21:54 ` Junio C Hamano
2006-06-29 22:22 ` Junio C Hamano
3 siblings, 0 replies; 36+ messages in thread
From: Jeff King @ 2006-06-29 21:35 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Nicolas Pitre, Junio C Hamano, git
On Thu, Jun 29, 2006 at 02:04:01PM -0700, Linus Torvalds wrote:
> Instead of having a separate cache, wouldn't it be much better to just
> take the hint from the previous pack-file?
I'd like to second Nicolas' "DOH!".
This drops my nasty load with roughly the same effect as the delta
cache, and it does the same for the kernel repo. And it doesn't consume
any extra disk space. Junio, please drop my delta-cache patch in favor
of Linus' patch.
-Peff
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [RFC] Cache negative delta pairs
2006-06-29 20:24 ` Nicolas Pitre
2006-06-29 21:04 ` Linus Torvalds
2006-06-29 21:26 ` Junio C Hamano
@ 2006-06-29 21:37 ` Jeff King
2 siblings, 0 replies; 36+ messages in thread
From: Jeff King @ 2006-06-29 21:37 UTC (permalink / raw)
To: Nicolas Pitre; +Cc: Junio C Hamano, git
On Thu, Jun 29, 2006 at 04:24:01PM -0400, Nicolas Pitre wrote:
> > Obviously the kernel repo is doing better, but x=500 is only 4 days ago.
> > Trying with --before=2.weeks.ago yields only 31505 matches.
>
> What does this prove? I fail to see the relation between those results
> and a possible git-pack-objects improvement.
My point is that the window cache gets invalidated over time, whereas a
sha1 by sha1 cache doesn't. However, the point is moot, as I think
Linus has come up with a much better solution.
> The negative delta cache concept is certainly attractive even for normal
> repositories, especially for public servers, since when used in
> conjonction with delta reuse it makes the creation of a pack basically
> free. So I think this idea really has merits, as long as the cache
> remains small.
Yes, if you're talking about a situation in which you make many packs
for a given set of commits, then it does make more sense (what I was
really trying to eliminate was a 10 minute repack followed by another 10
minute repack to push to the server).
-Peff
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [RFC] Cache negative delta pairs
2006-06-29 21:30 ` Linus Torvalds
@ 2006-06-29 21:39 ` Jeff King
2006-06-29 21:43 ` Joel Becker
2006-06-29 21:47 ` Nicolas Pitre
2 siblings, 0 replies; 36+ messages in thread
From: Jeff King @ 2006-06-29 21:39 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Nicolas Pitre, Junio C Hamano, git
On Thu, Jun 29, 2006 at 02:30:11PM -0700, Linus Torvalds wrote:
> However, that is obviously also the dis-advantage, since it means that
> repacking cannot improve packing. So adding a flag to say "please try to
> incrementally improve the pack" might well be worth it, even if this new
> behaviour would be the _default_.
We could tie the on/off to --no-reuse-delta, since you would mostly
want them at the same time. This disallows repacking over and over to
try to incrementally improve size. Do people actually do that?
> Hmm? Jeff, does this work for your load?
Yes, just as well as my original patch.
-Peff
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [RFC] Cache negative delta pairs
2006-06-29 21:30 ` Linus Torvalds
2006-06-29 21:39 ` Jeff King
@ 2006-06-29 21:43 ` Joel Becker
2006-06-29 21:47 ` Nicolas Pitre
2 siblings, 0 replies; 36+ messages in thread
From: Joel Becker @ 2006-06-29 21:43 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Nicolas Pitre, Jeff King, Junio C Hamano, git
On Thu, Jun 29, 2006 at 02:30:11PM -0700, Linus Torvalds wrote:
> However, that is obviously also the dis-advantage, since it means that
> repacking cannot improve packing. So adding a flag to say "please try to
> incrementally improve the pack" might well be worth it, even if this new
> behaviour would be the _default_.
I nominate "--pack-me-harder".
Joel
--
"Behind every successful man there's a lot of unsuccessful years."
- Bob Brown
Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [RFC] Cache negative delta pairs
2006-06-29 21:30 ` Linus Torvalds
2006-06-29 21:39 ` Jeff King
2006-06-29 21:43 ` Joel Becker
@ 2006-06-29 21:47 ` Nicolas Pitre
2006-06-29 22:12 ` Junio C Hamano
2 siblings, 1 reply; 36+ messages in thread
From: Nicolas Pitre @ 2006-06-29 21:47 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Jeff King, Junio C Hamano, git
On Thu, 29 Jun 2006, Linus Torvalds wrote:
>
>
> On Thu, 29 Jun 2006, Nicolas Pitre wrote:
> >
> > On Thu, 29 Jun 2006, Linus Torvalds wrote:
> >
> > > Instead of having a separate cache, wouldn't it be much better to just
> > > take the hint from the previous pack-file?
> >
> > DOH! ;-)
>
> Btw, I think this could do with a flag to turn it on/off (but probably
> default to on).
I think it should simply be coupled with the --no-reuse-delta flag.
> The advantage of this patch is that it should guarantee that if everything
> is already packed, a repack will basically keep the pack identical.
>
> However, that is obviously also the dis-advantage, since it means that
> repacking cannot improve packing. So adding a flag to say "please try to
> incrementally improve the pack" might well be worth it, even if this new
> behaviour would be the _default_.
Actually, the delta reusing already prevents those deltas from being
improved. So your patch only extend this notion to the non-deltified
objects as well. And the way out is to provide the --no-reuse-delta
flag.
Nicolas
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [RFC] Cache negative delta pairs
2006-06-29 21:04 ` Linus Torvalds
2006-06-29 21:24 ` Nicolas Pitre
2006-06-29 21:35 ` [RFC] Cache negative delta pairs Jeff King
@ 2006-06-29 21:54 ` Junio C Hamano
2006-06-29 22:22 ` Junio C Hamano
3 siblings, 0 replies; 36+ messages in thread
From: Junio C Hamano @ 2006-06-29 21:54 UTC (permalink / raw)
To: Linus Torvalds; +Cc: git
Linus Torvalds <torvalds@osdl.org> writes:
> Instead of having a separate cache, wouldn't it be much better to just
> take the hint from the previous pack-file?
>
> In the repacking window, if both objects we are looking at already came
> from the same (old) pack-file, don't bother delta'ing them against each
> other.
>
> That means that we'll still always check for better deltas for (and
> against!) _unpacked_ objects, but assuming incremental repacks, you'll
> avoid the delta creation 99% of the time.
I bow down before you.
No ugly special-case caching, just automatically "the right
thing", with very little overhead.
It just makes sense.
We have a winner.
;-)
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [RFC] Cache negative delta pairs
2006-06-29 21:47 ` Nicolas Pitre
@ 2006-06-29 22:12 ` Junio C Hamano
2006-06-30 3:44 ` [PATCH] consider previous pack undeltified object state only when reusing delta data Nicolas Pitre
0 siblings, 1 reply; 36+ messages in thread
From: Junio C Hamano @ 2006-06-29 22:12 UTC (permalink / raw)
To: Nicolas Pitre; +Cc: Linus Torvalds, Jeff King, git
Nicolas Pitre <nico@cam.org> writes:
> On Thu, 29 Jun 2006, Linus Torvalds wrote:
>
>>
>>
>> On Thu, 29 Jun 2006, Nicolas Pitre wrote:
>> >
>> > On Thu, 29 Jun 2006, Linus Torvalds wrote:
>> >
>> > > Instead of having a separate cache, wouldn't it be much better to just
>> > > take the hint from the previous pack-file?
>> >
>> > DOH! ;-)
>>
>> Btw, I think this could do with a flag to turn it on/off (but probably
>> default to on).
>
> I think it should simply be coupled with the --no-reuse-delta flag.
I agree that makes sense.
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [RFC] Cache negative delta pairs
2006-06-29 21:04 ` Linus Torvalds
` (2 preceding siblings ...)
2006-06-29 21:54 ` Junio C Hamano
@ 2006-06-29 22:22 ` Junio C Hamano
3 siblings, 0 replies; 36+ messages in thread
From: Junio C Hamano @ 2006-06-29 22:22 UTC (permalink / raw)
To: Linus Torvalds; +Cc: git
Linus Torvalds <torvalds@osdl.org> writes:
> On Thu, 29 Jun 2006, Nicolas Pitre wrote:
>>
>> The negative delta cache concept is certainly attractive even for normal
>> repositories, especially for public servers, since when used in
>> conjonction with delta reuse it makes the creation of a pack basically
>> free. So I think this idea really has merits, as long as the cache
>> remains small.
>
> I don't really see much of a point of this all.
>
> Instead of having a separate cache, wouldn't it be much better to just
> take the hint from the previous pack-file?
>
> In the repacking window, if both objects we are looking at already came
> from the same (old) pack-file, don't bother delta'ing them against each
> other.
>
> That means that we'll still always check for better deltas for (and
> against!) _unpacked_ objects, but assuming incremental repacks, you'll
> avoid the delta creation 99% of the time.
>
> Ie somethng really simple like the appended.
>
> Linus
> ---
> diff --git a/pack-objects.c b/pack-objects.c
> index bed2497..cea63e7 100644
> --- a/pack-objects.c
> +++ b/pack-objects.c
> @@ -988,6 +988,13 @@ static int try_delta(struct unpacked *tr
> return -1;
>
> /*
> + * We do not bother to try a delta that we discarded
> + * on an earlier try
> + */
> + if (trg_entry->in_pack && trg_entry->in_pack == src_entry->in_pack)
> + return -1;
> +
> + /*
I think you meant to return 0 from here though. -1 means "do
not use this pair and do not bother try improving it with the
remaining candidates".
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [RFC] Cache negative delta pairs
2006-06-29 3:09 ` [RFC] Cache negative delta pairs Junio C Hamano
` (3 preceding siblings ...)
2006-06-29 15:42 ` Nicolas Pitre
@ 2006-06-29 22:31 ` Jakub Narebski
4 siblings, 0 replies; 36+ messages in thread
From: Jakub Narebski @ 2006-06-29 22:31 UTC (permalink / raw)
To: git
Junio C Hamano wrote:
> [...] For example, we currently do
> not delta OpenOffice documents (*.odt, *.odp, etc) very well.
> If one has a repository that tracks the history of "file.odp",
> we know each revision of "file.odp" would not delta against any
> other version anyway, and could skip attempting to deltify them.
Perhaps we should steal Mercurial idea of EncodeDecodeFilter, and store
OpenOffice documents, Mozilla extensions, Java packages in object store as
uncompressed archive, and checkout them to working area in original format.
All diff should be of course done on in-repository (after-filter) format.
The original example at
http://www.selenic.com/mercurial/wiki/index.cgi/EncodeDecodeFilter
talks about archives (zip files) and unix2dos endline convention conversion.
Perhaps for OpenOffice and Mozilla we would need to use [external] XML-aware
diff, too...
--
Jakub Narebski
Warsaw, Poland
ShadeHawk on #git
^ permalink raw reply [flat|nested] 36+ messages in thread
* [PATCH] consider previous pack undeltified object state only when reusing delta data
2006-06-29 22:12 ` Junio C Hamano
@ 2006-06-30 3:44 ` Nicolas Pitre
2006-06-30 9:45 ` Johannes Schindelin
0 siblings, 1 reply; 36+ messages in thread
From: Nicolas Pitre @ 2006-06-30 3:44 UTC (permalink / raw)
To: Junio C Hamano; +Cc: git
Without this there would never be a chance to improve packing for
previously undeltified objects.
Signed-off-by: Nicolas Pitre <nico@cam.org>
---
On Thu, 29 Jun 2006, Junio C Hamano wrote:
> Nicolas Pitre <nico@cam.org> writes:
>
> > On Thu, 29 Jun 2006, Linus Torvalds wrote:
> >
> >>
> >>
> >> On Thu, 29 Jun 2006, Nicolas Pitre wrote:
> >> >
> >> > On Thu, 29 Jun 2006, Linus Torvalds wrote:
> >> >
> >> > > Instead of having a separate cache, wouldn't it be much better to just
> >> > > take the hint from the previous pack-file?
> >> >
> >> > DOH! ;-)
> >>
> >> Btw, I think this could do with a flag to turn it on/off (but probably
> >> default to on).
> >
> > I think it should simply be coupled with the --no-reuse-delta flag.
>
> I agree that makes sense.
So here it is.
diff --git a/pack-objects.c b/pack-objects.c
index 6e17676..47da33b 100644
--- a/pack-objects.c
+++ b/pack-objects.c
@@ -989,9 +989,10 @@ static int try_delta(struct unpacked *tr
/*
* We do not bother to try a delta that we discarded
- * on an earlier try.
+ * on an earlier try, but only when reusing delta data.
*/
- if (trg_entry->in_pack && trg_entry->in_pack == src_entry->in_pack)
+ if (!no_reuse_delta && trg_entry->in_pack &&
+ trg_entry->in_pack == src_entry->in_pack)
return 0;
/*
^ permalink raw reply related [flat|nested] 36+ messages in thread
* Re: [PATCH] consider previous pack undeltified object state only when reusing delta data
2006-06-30 3:44 ` [PATCH] consider previous pack undeltified object state only when reusing delta data Nicolas Pitre
@ 2006-06-30 9:45 ` Johannes Schindelin
2006-06-30 12:28 ` Andreas Ericsson
0 siblings, 1 reply; 36+ messages in thread
From: Johannes Schindelin @ 2006-06-30 9:45 UTC (permalink / raw)
To: Nicolas Pitre; +Cc: Junio C Hamano, git
Hi,
On Thu, 29 Jun 2006, Nicolas Pitre wrote:
> Without this there would never be a chance to improve packing for
> previously undeltified objects.
Earlier this year, I was quite surprised to learn that multiple repackings
actually improved packing. Does that patch mean this feature is gone?
Ciao,
Dscho
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH] consider previous pack undeltified object state only when reusing delta data
2006-06-30 9:45 ` Johannes Schindelin
@ 2006-06-30 12:28 ` Andreas Ericsson
2006-06-30 16:55 ` Nicolas Pitre
0 siblings, 1 reply; 36+ messages in thread
From: Andreas Ericsson @ 2006-06-30 12:28 UTC (permalink / raw)
To: Johannes Schindelin; +Cc: Nicolas Pitre, Junio C Hamano, git
Johannes Schindelin wrote:
> Hi,
>
> On Thu, 29 Jun 2006, Nicolas Pitre wrote:
>
>
>>Without this there would never be a chance to improve packing for
>>previously undeltified objects.
>
>
> Earlier this year, I was quite surprised to learn that multiple repackings
> actually improved packing. Does that patch mean this feature is gone?
>
The patch Linus sent removes that feature. This one re-introduces it.
--
Andreas Ericsson andreas.ericsson@op5.se
OP5 AB www.op5.se
Tel: +46 8-230225 Fax: +46 8-230231
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH] consider previous pack undeltified object state only when reusing delta data
2006-06-30 12:28 ` Andreas Ericsson
@ 2006-06-30 16:55 ` Nicolas Pitre
2006-07-03 8:11 ` Andreas Ericsson
0 siblings, 1 reply; 36+ messages in thread
From: Nicolas Pitre @ 2006-06-30 16:55 UTC (permalink / raw)
To: Andreas Ericsson; +Cc: Johannes Schindelin, Junio C Hamano, git
On Fri, 30 Jun 2006, Andreas Ericsson wrote:
> Johannes Schindelin wrote:
> > Hi,
> >
> > On Thu, 29 Jun 2006, Nicolas Pitre wrote:
> >
> >
> > > Without this there would never be a chance to improve packing for
> > > previously undeltified objects.
> >
> >
> > Earlier this year, I was quite surprised to learn that multiple repackings
> > actually improved packing. Does that patch mean this feature is gone?
> >
>
> The patch Linus sent removes that feature. This one re-introduces it.
Not really.
Actually that multiple repacking "feature" was rather an artifact of the
delta data reuse code and not really by design. Here's what happened
before:
Consider the first repack where no delta exists, or "git-repack -a -f"
where the -f argument makes it ignores existing delta data. In that
case all objects are sorted and delta attempted on them within a window.
So to simplify things let's assume objects are numbered from 1 upwards.
First obj #1 is added to the window. Obj #2 attempts a delta against
obj #1. Obj #3 attempts a delta against objs #2 and #1. Obj #4
attempts a delta against objs #3, #2 and #1. And so on for all object:
each new object attempts a delta against the last 10 objects (the
default window size is 10) and the best delta, if any, is kept.
In the end, some objects get deltified, some don't, and a new pack is
produced.
When repacking without -f to git-repack, then already deltified objects
are simply copied as is from the existing pack(s) avoiding costly delta
re-computation. Still, without Linus' patch, non-deltified objects were
considered for deltification and deltas attempted on them.
So supposing that objects #1 through #10 were not deltified, and objects
#11 through #50 were deltified, then those deltified objects were
skipped over for the purpose of delta matching and therefore object #51
ended up attempting a delta against objs #1 to 10 instead of #41 to #50
like in the previous run. The net effect was similar to a larger window
for some objects providing more opportunities for successful deltas, and
therefore a smaller pack.
With Linus' patch those objects already known to be undeltified are,
too, skipped. That means that successive git-repack without the -f
argument are now producing identical packs all the time and the artifact
above is gone.
I think this is a good thing since now the packing behavior is more
predictable. But nothing is lost since if you want to have better
packing like before you simply have to specify a slightly larger window
size on the first git-repack. It'll take a bit more time but running
git-repack many times also took more time in the end anyway.
Nicolas
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH] consider previous pack undeltified object state only when reusing delta data
2006-06-30 16:55 ` Nicolas Pitre
@ 2006-07-03 8:11 ` Andreas Ericsson
0 siblings, 0 replies; 36+ messages in thread
From: Andreas Ericsson @ 2006-07-03 8:11 UTC (permalink / raw)
To: Nicolas Pitre; +Cc: Johannes Schindelin, Junio C Hamano, git
Nicolas Pitre wrote:
> On Fri, 30 Jun 2006, Andreas Ericsson wrote:
>
>>Johannes Schindelin wrote:
>>>
>>>>Without this there would never be a chance to improve packing for
>>>>previously undeltified objects.
>>>
>>>
>>>Earlier this year, I was quite surprised to learn that multiple repackings
>>>actually improved packing. Does that patch mean this feature is gone?
>>>
>>
>>The patch Linus sent removes that feature. This one re-introduces it.
>
>
> Not really.
>
> Actually that multiple repacking "feature" was rather an artifact of the
> delta data reuse code and not really by design. Here's what happened
> before:
>
Thanks for the extensive and very clear info. Lovely to see a competent
programmer who can also explain how things work. :)
--
Andreas Ericsson andreas.ericsson@op5.se
OP5 AB www.op5.se
Tel: +46 8-230225 Fax: +46 8-230231
^ permalink raw reply [flat|nested] 36+ messages in thread
end of thread, other threads:[~2006-07-03 8:11 UTC | newest]
Thread overview: 36+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <20060628223744.GA24421@coredump.intra.peff.net>
2006-06-29 3:09 ` [RFC] Cache negative delta pairs Junio C Hamano
2006-06-29 3:50 ` Jeff King
2006-06-29 3:58 ` Jeff King
2006-06-29 4:30 ` Jeff King
2006-06-29 16:39 ` Nicolas Pitre
2006-06-29 18:07 ` Jeff King
2006-06-29 18:48 ` Nicolas Pitre
2006-06-29 18:58 ` Jeff King
2006-06-29 19:06 ` Nicolas Pitre
2006-06-29 4:09 ` Jeff King
2006-06-29 15:42 ` Nicolas Pitre
2006-06-29 16:35 ` Nicolas Pitre
2006-06-29 18:00 ` Jeff King
2006-06-29 18:24 ` Nicolas Pitre
2006-06-29 18:53 ` Jeff King
2006-06-29 19:04 ` Nicolas Pitre
2006-06-29 19:52 ` Jeff King
2006-06-29 20:24 ` Nicolas Pitre
2006-06-29 21:04 ` Linus Torvalds
2006-06-29 21:24 ` Nicolas Pitre
2006-06-29 21:30 ` Linus Torvalds
2006-06-29 21:39 ` Jeff King
2006-06-29 21:43 ` Joel Becker
2006-06-29 21:47 ` Nicolas Pitre
2006-06-29 22:12 ` Junio C Hamano
2006-06-30 3:44 ` [PATCH] consider previous pack undeltified object state only when reusing delta data Nicolas Pitre
2006-06-30 9:45 ` Johannes Schindelin
2006-06-30 12:28 ` Andreas Ericsson
2006-06-30 16:55 ` Nicolas Pitre
2006-07-03 8:11 ` Andreas Ericsson
2006-06-29 21:35 ` [RFC] Cache negative delta pairs Jeff King
2006-06-29 21:54 ` Junio C Hamano
2006-06-29 22:22 ` Junio C Hamano
2006-06-29 21:26 ` Junio C Hamano
2006-06-29 21:37 ` Jeff King
2006-06-29 22:31 ` Jakub Narebski
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).