[PATCH] git-pack-objects: cache small deltas between big objects

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] git-pack-objects: cache small deltas between big objects
@ 2007-05-20 21:11 Martin Koegler
  2007-05-21  4:35 ` Dana How
  2007-05-21  4:54 ` Junio C Hamano
  0 siblings, 2 replies; 8+ messages in thread
From: Martin Koegler @ 2007-05-20 21:11 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, Martin Koegler

Creating deltas between big blobs is a CPU and memory intensive task.
In the writing phase, all (not reused) deltas are redone.

This patch adds support for caching deltas from the deltifing phase, so
that that the writing phase is faster.

The caching is limited to small deltas to avoid increasing memory usage very much.
The implemented limit is (memory needed to create the delta)/1024.

Signed-off-by: Martin Koegler <mkoegler@auto.tuwien.ac.at>
---
 builtin-pack-objects.c |   35 +++++++++++++++++++++++++----------
 1 files changed, 25 insertions(+), 10 deletions(-)

diff --git a/builtin-pack-objects.c b/builtin-pack-objects.c
index d165f10..13429d0 100644
--- a/builtin-pack-objects.c
+++ b/builtin-pack-objects.c
@@ -35,6 +35,7 @@ struct object_entry {
 	struct object_entry *delta_sibling; /* other deltified objects who
 					     * uses the same base as me
 					     */
+	void *delta_data;	/* cached delta (uncompressed) */
 	unsigned long delta_size;	/* delta data size (uncompressed) */
 	enum object_type type;
 	enum object_type in_pack_type;	/* could be delta */
@@ -380,17 +381,24 @@ static unsigned long write_object(struct sha1file *f,
 				 */
 
 	if (!to_reuse) {
-		buf = read_sha1_file(entry->sha1, &type, &size);
-		if (!buf)
-			die("unable to read %s", sha1_to_hex(entry->sha1));
-		if (size != entry->size)
-			die("object %s size inconsistency (%lu vs %lu)",
-			    sha1_to_hex(entry->sha1), size, entry->size);
-		if (entry->delta) {
-			buf = delta_against(buf, size, entry);
+		if (entry->delta_data) {
+			buf = entry->delta_data;
 			size = entry->delta_size;
 			obj_type = (allow_ofs_delta && entry->delta->offset) ?
-				OBJ_OFS_DELTA : OBJ_REF_DELTA;
+					OBJ_OFS_DELTA : OBJ_REF_DELTA;
+		} else {
+			buf = read_sha1_file(entry->sha1, &type, &size);
+			if (!buf)
+				die("unable to read %s", sha1_to_hex(entry->sha1));
+			if (size != entry->size)
+				die("object %s size inconsistency (%lu vs %lu)",
+				    sha1_to_hex(entry->sha1), size, entry->size);
+			if (entry->delta) {
+				buf = delta_against(buf, size, entry);
+				size = entry->delta_size;
+				obj_type = (allow_ofs_delta && entry->delta->offset) ?
+					OBJ_OFS_DELTA : OBJ_REF_DELTA;
+			}
 		}
 		/*
 		 * The object header is a byte of 'type' followed by zero or
@@ -1294,10 +1302,17 @@ static int try_delta(struct unpacked *trg, struct unpacked *src,
 	if (!delta_buf)
 		return 0;
 
+	if (trg_entry->delta_data)
+		free (trg_entry->delta_data);
+	trg_entry->delta_data = 0;
 	trg_entry->delta = src_entry;
 	trg_entry->delta_size = delta_size;
 	trg_entry->depth = src_entry->depth + 1;
-	free(delta_buf);
+	/* cache delta, if objects are large enough compared to delta size */
+	if ((src_size >> 20) + (trg_size >> 21) > (delta_size >> 10))
+		trg_entry->delta_data = delta_buf;
+	else
+		free(delta_buf);
 	return 1;
 }
 
-- 
1.5.2.rc3.802.g4b4b7

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH] git-pack-objects: cache small deltas between big objects
  2007-05-20 21:11 [PATCH] git-pack-objects: cache small deltas between big objects Martin Koegler
@ 2007-05-21  4:35 ` Dana How
  2007-05-21 17:59   ` Martin Koegler
  2007-05-21  4:54 ` Junio C Hamano
  1 sibling, 1 reply; 8+ messages in thread
From: Dana How @ 2007-05-21  4:35 UTC (permalink / raw)
  To: Martin Koegler; +Cc: Junio C Hamano, git, danahow

On 5/20/07, Martin Koegler <mkoegler@auto.tuwien.ac.at> wrote:
> Creating deltas between big blobs is a CPU and memory intensive task.
> In the writing phase, all (not reused) deltas are redone.

Actually,  just the ones selected,  which is approx 1/window.
Do you have any numbers describing the effects on runtime
and memory size for a known repo like linux-2.6?

> This patch adds support for caching deltas from the deltifing phase, so
> that that the writing phase is faster.
>
> The caching is limited to small deltas to avoid increasing memory usage very much.
> The implemented limit is (memory needed to create the delta)/1024.

Your limit is applied per-object,  and there is no overall limit
on the amount of memory not freed in the delta phase.
I suspect this caching would be disastrous for the large repo
with "megablobs" I'm trying to wrestle with at the moment.

> @@ -1294,10 +1302,17 @@ static int try_delta(struct unpacked *trg, struct unpacked *src,
>         trg_entry->delta = src_entry;
>         trg_entry->delta_size = delta_size;
>         trg_entry->depth = src_entry->depth + 1;
> -       free(delta_buf);
> +       /* cache delta, if objects are large enough compared to delta size */
> +       if ((src_size >> 20) + (trg_size >> 21) > (delta_size >> 10))
> +               trg_entry->delta_data = delta_buf;
> +       else
> +               free(delta_buf);
>         return 1;
>  }

-- 
Dana L. How  danahow@gmail.com  +1 650 804 5991 cell

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] git-pack-objects: cache small deltas between big objects
  2007-05-21  4:35 ` Dana How
@ 2007-05-21 17:59   ` Martin Koegler
  2007-05-22  7:01     ` Dana How
  0 siblings, 1 reply; 8+ messages in thread
From: Martin Koegler @ 2007-05-21 17:59 UTC (permalink / raw)
  To: Dana How; +Cc: git, Junio C Hamano

On Sun, May 20, 2007 at 09:35:56PM -0700, Dana How wrote:
> On 5/20/07, Martin Koegler <mkoegler@auto.tuwien.ac.at> wrote:
> >Creating deltas between big blobs is a CPU and memory intensive task.
> >In the writing phase, all (not reused) deltas are redone.
> 
> Actually,  just the ones selected,  which is approx 1/window.
> Do you have any numbers describing the effects on runtime
> and memory size for a known repo like linux-2.6?

Objects below 1 MB are not considered for caching. 
The linux kernel has only such objects:
linux.git$ find -size +1000k |grep -v ".git"|wc
      0       0       0
So no caching happens. The required memory is only increased by the
new pointer in object_entry. 

At runtime, we have additional (#object)*(window size+1) null pointer
checks, (#objects)*(window size) pointer initialiations with zero and
check (#objects)*(window size) times the caching policy check: ((src_size
>> 20) + (trg_size >> 21) > (delta_size >> 10))

Writing a cached delta is faster, as we avoid creating a delta. Some
calls to free are delayed.

> >This patch adds support for caching deltas from the deltifing phase, so
> >that that the writing phase is faster.
> >
> >The caching is limited to small deltas to avoid increasing memory usage 
> >very much.
> >The implemented limit is (memory needed to create the delta)/1024.
> 
> Your limit is applied per-object,  and there is no overall limit
> on the amount of memory not freed in the delta phase.
> I suspect this caching would be disastrous for the large repo
> with "megablobs" I'm trying to wrestle with at the moment.

http://www.spinics.net/lists/git/msg31241.html:
> At the moment I'm experimenting on a git repository with
> a 4.5GB checkout,  and 18 months of history in 4K commits
> comprising 100GB (uncompressed) of blobs stored in
> 7 packfiles of 2GB or less. Hopefully I'll be able to say
> more about tweaking packing shortly.

I you have 100 GB of uncompressed data in your pack files, the cache
limit is between 100MB and 200MB with the current policy.

The aim of my patch is to speed up pack writing without increasing
memory usage very much, if you have blobs of some hundred MB size in
your repository.

The caching policy could be extended to speed more memory on caching
other deltas. Ideas on this topic are welcome.

mfg Martin Kögler
PS: If you are trying to optimize packing speed/size, you could test
the following patch: http://marc.info/?l=git&m=117908942525171&w=2

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] git-pack-objects: cache small deltas between big objects
  2007-05-21 17:59   ` Martin Koegler
@ 2007-05-22  7:01     ` Dana How
  2007-05-22  8:04       ` Junio C Hamano
  0 siblings, 1 reply; 8+ messages in thread
From: Dana How @ 2007-05-22  7:01 UTC (permalink / raw)
  To: Martin Koegler; +Cc: git, Junio C Hamano, danahow

On 5/21/07, Martin Koegler <mkoegler@auto.tuwien.ac.at> wrote:
> On Sun, May 20, 2007 at 09:35:56PM -0700, Dana How wrote:
> > On 5/20/07, Martin Koegler <mkoegler@auto.tuwien.ac.at> wrote:
> > > This patch adds support for caching deltas from the deltifing phase, so
> > > that that the writing phase is faster.
> > >
> > > The caching is limited to small deltas to avoid increasing memory usage
> > > very much.
> > > The implemented limit is (memory needed to create the delta)/1024.
> >
> > Your limit is applied per-object,  and there is no overall limit
> > on the amount of memory not freed in the delta phase.
> > I suspect this caching would be disastrous for the large repo
> > with "megablobs" I'm trying to wrestle with at the moment.
>
> http://www.spinics.net/lists/git/msg31241.html:
> > At the moment I'm experimenting on a git repository with
> > a 4.5GB checkout,  and 18 months of history in 4K commits
> > comprising 100GB (uncompressed) of blobs stored in
> > 7 packfiles of 2GB or less. Hopefully I'll be able to say
> > more about tweaking packing shortly.
>
> I you have 100 GB of uncompressed data in your pack files, the cache
> limit is between 100MB and 200MB with the current policy.
Yes,  there is an implicit limit in your patch, and it would be
sufficient in my case.  It's still the case that there is no absolute
limit,  but perhaps you have to do something truly insane
for that to matter.

> The aim of my patch is to speed up pack writing without increasing
> memory usage very much, if you have blobs of some hundred MB size in
> your repository.
>
> The caching policy could be extended to speed more memory on caching
> other deltas. Ideas on this topic are welcome.
There _is_ something useful in your patch.
Unfortunately I don't think it helps my problem that much.

> PS: If you are trying to optimize packing speed/size, you could test
> the following patch: http://marc.info/?l=git&m=117908942525171&w=2
I remember this post -- I hope you continue to refine it.

What I've concluded is that there are cases where the packfile
treatment is just not appropriate for some part of the data.
[NOTE: I'm talking about disk storage here, not packs for communications.]
With the "delta" attribute Junio proposed,  and the "repack"
attribute I proposed in response,  we were starting to move in that
direction already.

The order of objects in the packfile(s) in my test repo after repacking
seems to be commit+ [ tree+ blob+ ]+, in other words,  the commits
are all at the beginning and the new tree blobs are interspersed amongst
the data blobs (this was imported with only straightline history, no branching).
If some of these blobs are enormous,  the tree blobs which are accessed
all the time get pushed apart.  This seemed to really hurt performance.

If I simply refuse to insert enormous blobs in the packfiles,  and keep
them loose,  the performance is better.  More importantly,  my packfiles
are now sized like everyone else's, so I'm in an operating regime which
everyone is testing and optimizing.  This was not true with 12GB+ of packfiles.
Of course, loose objects are slower, but slight extra overhead to access
something large enough to be noticeable already doesn't bother me.

Finally, loose objects don't get deltified.  This is a problem,  but I would
need to repack at least every week,  and nonzero window/depth would
be prohibitive with large objects included.  So if I put the large objects
in the packs,  not only are the large objects still undeltified, but everything
else is undeltified as well.  Note also that Perforce, what we're currently
using, doesn't deltify large objects either,  so people here who migrate
to git aren't going to lose anything, but they will gain compression
on the remaining "normal" objects (Perforce uses deltification or compression,
but not both).

So at the moment I'm finding keeping enormous objects loose
to be a reasonable compromise which keeps my packfiles
"normal" and imposes overheads only on objects whose size
already imposes an even larger overhead.

Thanks,
-- 
Dana L. How  danahow@gmail.com  +1 650 804 5991 cell

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] git-pack-objects: cache small deltas between big objects
  2007-05-22  7:01     ` Dana How
@ 2007-05-22  8:04       ` Junio C Hamano
  2007-05-22  9:25         ` Dana How
  0 siblings, 1 reply; 8+ messages in thread
From: Junio C Hamano @ 2007-05-22  8:04 UTC (permalink / raw)
  To: Dana How; +Cc: Martin Koegler, git, Junio C Hamano

"Dana How" <danahow@gmail.com> writes:

> If I simply refuse to insert enormous blobs in the packfiles,  and keep
> them loose,  the performance is better.  More importantly,  my packfiles
> are now sized like everyone else's, so I'm in an operating regime which
> everyone is testing and optimizing.  This was not true with 12GB+ of packfiles.
> Of course, loose objects are slower, but slight extra overhead to access
> something large enough to be noticeable already doesn't bother me.
>
> Finally, loose objects don't get deltified.  This is a problem,  but I would
> need to repack at least every week,  and nonzero window/depth would
> be prohibitive with large objects included.

Here are a few quick comments before going to bed.

 * The objects in the packfile are ordered in "recency" order,
   as "rev-list --objects" feeds you, so it is correct that we
   get trees and blobs mixed.  It might be an interesting
   experiment, especially with a repository without huge blobs,
   to see how much improvement we might get if we keep the
   recency order _but_ emit tags, commits, trees, and then
   blobs, in this order.  In write_pack_file() we have a single
   loop to call write_one(), but we could make it a nested loop
   that writes only objects of each type.

 * Also my earlier "nodelta" attribute thing would be worth
   trying with your repository with huge blobs, with the above
   "group by object type" with further tweak to write blobs
   without "nodelta" marker first and then finally blobs with
   "nodelta" marker.

I suspect the above two should help "git log" and "git log --
pathspec..."  performance, as these two do not look at blobs at
all (pathspec limiting does invoke diff machinery, but that is
only at the tree level).

The "I want to have packs with reasonable size as everybody
else" (which I think is a reasonable thing to want, but does not
have much technical meaning as other issues do) wish is
something we cannot _measure_ to judge pros and cons, but with
the above experiment, you could come up with three set of packs
such that, all three sets use "nodelta" to leave the huge blobs
undeltified, and use the default window and depth for others,
and:

 (1) One set has trees and blobs mixed;

 (2) Another set has trees and blobs grouped, but "nodelta" blobs
     and others are not separated;

 (3) The third set has trees and blobs grouped, and "nodelta"
     blobs and others are separated.

Comparing (1) and (2) would show how bad it is to have huge
blobs in between trees (which are presumably accessed more
often).  I suspect that comparing (2) and (3) would show that
for most workloads, the split is not worth it.

And compare (3) with another case where you leave "nodelta"
blobs loose.  That's the true comparison that would demonstrate
why placing huge blobs in packs is bad and they should be left
loose.  I'm skeptical if there will be significant differences,
though.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] git-pack-objects: cache small deltas between big objects
  2007-05-22  8:04       ` Junio C Hamano
@ 2007-05-22  9:25         ` Dana How
  0 siblings, 0 replies; 8+ messages in thread
From: Dana How @ 2007-05-22  9:25 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Martin Koegler, git, danahow

On 5/22/07, Junio C Hamano <junkio@cox.net> wrote:
> "Dana How" <danahow@gmail.com> writes:
> > If I simply refuse to insert enormous blobs in the packfiles,  and keep
> > them loose,  the performance is better.  More importantly,  my packfiles
> > are now sized like everyone else's, so I'm in an operating regime which
> > everyone is testing and optimizing.  This was not true with 12GB+ of packfiles.
> > Of course, loose objects are slower, but slight extra overhead to access
> > something large enough to be noticeable already doesn't bother me.
> >
> > Finally, loose objects don't get deltified.  This is a problem,  but I would
> > need to repack at least every week,  and nonzero window/depth would
> > be prohibitive with large objects included.
>
> Here are a few quick comments before going to bed.
>
>  * The objects in the packfile are ordered in "recency" order,
>    as "rev-list --objects" feeds you, so it is correct that we
>    get trees and blobs mixed.  It might be an interesting
>    experiment, especially with a repository without huge blobs,
>    to see how much improvement we might get if we keep the
>    recency order _but_ emit tags, commits, trees, and then
>    blobs, in this order.  In write_pack_file() we have a single
>    loop to call write_one(), but we could make it a nested loop
>    that writes only objects of each type.
Already tried that, almost.  Added a --types=[ctgb]+ flag to
pack-objects, and changed to git-repack to run in 2 passes
when -a && --max-pack-size.  The first pass would create
packfiles with all Commits/Trees/taGs [of course just 1],
the second made packfiles with just Blobs.  With a warm cache,
this was 3X to 7X slower than --max-blob-size= approach
(for the git-log --pretty=oneline example).  Why?  I'm guessing
because each lookup had to go through 7 index files instead of 1,
which would be significant when processing very small blobs
(commits and trees).  And the _slower_ one had window/depth=0/0,
so it had no delta expansion to do.

>  * Also my earlier "nodelta" attribute thing would be worth
>    trying with your repository with huge blobs, with the above
>    "group by object type" with further tweak to write blobs
>    without "nodelta" marker first and then finally blobs with
>    "nodelta" marker.
I started out enthusiastic about "nodelta",  causing me to
quickly propose "norepack" as well.  However, there is no
simple way in my repository to specify these.  Most of the
enormous files have certain suffixes,  but each of these
appears on a continuum of file sizes,  so I can't write
any *.sfx rules in .gitattributes.  I could make rules
specific to specific files,  but then I would have to write
scripts to auto-generate them.  (At commit time?)

Assuming I *could* get "nodelta" properly specified,
putting these last would help somewhat.  But we would
still be left with the problem caused by extra index files
(resulting from 2GB packfile limit).

> I suspect the above two should help "git log" and "git log --
> pathspec..."  performance, as these two do not look at blobs at
> all (pathspec limiting does invoke diff machinery, but that is
> only at the tree level).
>
> The "I want to have packs with reasonable size as everybody
> else" (which I think is a reasonable thing to want, but does not
> have much technical meaning as other issues do) wish is
> something we cannot _measure_ to judge pros and cons, ...
?? A depressingly large portion of my career has been spent
fooling optimization programs to work on new problems,  by
making the new problem look just like what they're used to.  So
wanting a program's input to look "conventional", or similar
to something in a regression,  seems pretty reasonable.
It's just the data-side version of preferring small changes in an algorithm.

> ... but with
> the above experiment, you could come up with three set of packs
> such that, all three sets use "nodelta" to leave the huge blobs
> undeltified, and use the default window and depth for others,
> and:
>
>  (1) One set has trees and blobs mixed;
>
>  (2) Another set has trees and blobs grouped, but "nodelta" blobs
>      and others are not separated;
>
>  (3) The third set has trees and blobs grouped, and "nodelta"
>      blobs and others are separated.
>
> Comparing (1) and (2) would show how bad it is to have huge
> blobs in between trees (which are presumably accessed more
> often).  I suspect that comparing (2) and (3) would show that
> for most workloads, the split is not worth it.
>
> And compare (3) with another case where you leave "nodelta"
> blobs loose.  That's the true comparison that would demonstrate
> why placing huge blobs in packs is bad and they should be left
> loose.  I'm skeptical if there will be significant differences,
> though.
I think the difference will come from at least the different number of
index files,
as pointed out above.  I can certainly start on these comparisons.

I must say,  getting the whole repository to repack -a under an hour
with great git-log performance after just a 55-line change was a much better
experience than the 10X larger max-pack-size patch...

BTW,  why the attachment to keeping *everything* in a packfile?
If I implement the changes above,  they will be more extensive than
max-blob-size (even max-pack-size only added *1* new nested loop
to pack-objects),  and they'll be climbing uphill due to the packfiles
being THREE orders of magnitude larger and the index files one
order of magnitude more numerous.

Thanks,
-- 
Dana L. How  danahow@gmail.com  +1 650 804 5991 cell

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] git-pack-objects: cache small deltas between big objects
  2007-05-20 21:11 [PATCH] git-pack-objects: cache small deltas between big objects Martin Koegler
  2007-05-21  4:35 ` Dana How
@ 2007-05-21  4:54 ` Junio C Hamano
  2007-05-21 17:00   ` Martin Koegler
  1 sibling, 1 reply; 8+ messages in thread
From: Junio C Hamano @ 2007-05-21  4:54 UTC (permalink / raw)
  To: Martin Koegler; +Cc: git

Martin Koegler <mkoegler@auto.tuwien.ac.at> writes:

> Creating deltas between big blobs is a CPU and memory intensive task.
> In the writing phase, all (not reused) deltas are redone.
>
> This patch adds support for caching deltas from the deltifing phase, so
> that that the writing phase is faster.
>
> The caching is limited to small deltas to avoid increasing memory usage very much.
> The implemented limit is (memory needed to create the delta)/1024.
>
> Signed-off-by: Martin Koegler <mkoegler@auto.tuwien.ac.at>
> ---
>  builtin-pack-objects.c |   35 +++++++++++++++++++++++++----------
>  1 files changed, 25 insertions(+), 10 deletions(-)

This is an interesting idea.

> diff --git a/builtin-pack-objects.c b/builtin-pack-objects.c
> index d165f10..13429d0 100644
> --- a/builtin-pack-objects.c
> +++ b/builtin-pack-objects.c
> ...
> @@ -1294,10 +1302,17 @@ static int try_delta(struct unpacked *trg, struct unpacked *src,
>  	if (!delta_buf)
>  		return 0;
>  
> +	if (trg_entry->delta_data)
> +		free (trg_entry->delta_data);
> +	trg_entry->delta_data = 0;
>  	trg_entry->delta = src_entry;
>  	trg_entry->delta_size = delta_size;
>  	trg_entry->depth = src_entry->depth + 1;
> -	free(delta_buf);
> +	/* cache delta, if objects are large enough compared to delta size */
> +	if ((src_size >> 20) + (trg_size >> 21) > (delta_size >> 10))
> +		trg_entry->delta_data = delta_buf;
> +	else
> +		free(delta_buf);
>  	return 1;
>  }

Care to justify this arithmetic?  Why isn't it for example like
this?

	((src_size + trg_size) >> 10) > delta_size

I am puzzled by the shifts on both ends, and differences between
20 and 21.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] git-pack-objects: cache small deltas between big objects
  2007-05-21  4:54 ` Junio C Hamano
@ 2007-05-21 17:00   ` Martin Koegler
  0 siblings, 0 replies; 8+ messages in thread
From: Martin Koegler @ 2007-05-21 17:00 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

On Sun, May 20, 2007 at 09:54:53PM -0700, Junio C Hamano wrote:
> Martin Koegler <mkoegler@auto.tuwien.ac.at> writes:
> > diff --git a/builtin-pack-objects.c b/builtin-pack-objects.c
> > index d165f10..13429d0 100644
> > --- a/builtin-pack-objects.c
> > +++ b/builtin-pack-objects.c
> > ...
> > @@ -1294,10 +1302,17 @@ static int try_delta(struct unpacked *trg, struct unpacked *src,
> >  	if (!delta_buf)
> >  		return 0;
> >  
> > +	if (trg_entry->delta_data)
> > +		free (trg_entry->delta_data);
> > +	trg_entry->delta_data = 0;
> >  	trg_entry->delta = src_entry;
> >  	trg_entry->delta_size = delta_size;
> >  	trg_entry->depth = src_entry->depth + 1;
> > -	free(delta_buf);
> > +	/* cache delta, if objects are large enough compared to delta size */
> > +	if ((src_size >> 20) + (trg_size >> 21) > (delta_size >> 10))
> > +		trg_entry->delta_data = delta_buf;
> > +	else
> > +		free(delta_buf);
> >  	return 1;
> >  }
> 
> Care to justify this arithmetic?  Why isn't it for example like
> this?
> 
> 	((src_size + trg_size) >> 10) > delta_size

I wanted to avoid a possible overflow in (src_size + trg_size), so
I shift both sides.

> I am puzzled by the shifts on both ends, and differences between
> 20 and 21.

I base the maximum allowed delta_size for caching on the required
memory for creating the delta. For the src entry, you need need a
delta index, which has (about) the same size of the src entry. So I
count the src entry double.

I divide the requried memory by 1024, so that the delta size is some
magnitudes smaller and will not cause a big increase of memory usage,
eg:

For two 100 MB (uncompressed) blobs, we need 300MB of memory to do the
delta (with the default window size of 10 up to 1900MB for all delta
indexes in the worst case). The patch will limit the delta size for
the target blob to 150kB.

The caching policy does only cache really small deltas for really big
objects, as I wanted to avoid out of memory situations. Futurer patch
should probably replace it with a better strategy.

mfg Martin Kögler

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2007-05-22  9:25 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-05-20 21:11 [PATCH] git-pack-objects: cache small deltas between big objects Martin Koegler
2007-05-21  4:35 ` Dana How
2007-05-21 17:59   ` Martin Koegler
2007-05-22  7:01     ` Dana How
2007-05-22  8:04       ` Junio C Hamano
2007-05-22  9:25         ` Dana How
2007-05-21  4:54 ` Junio C Hamano
2007-05-21 17:00   ` Martin Koegler

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).