small question about the repack algorithm

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* small question about the repack algorithm
@ 2008-02-07  9:03 Pierre Habouzit
  2008-02-08  7:12 ` Junio C Hamano
  0 siblings, 1 reply; 3+ messages in thread
From: Pierre Habouzit @ 2008-02-07  9:03 UTC (permalink / raw)
  To: Git ML

[-- Attachment #1: Type: text/plain, Size: 2170 bytes --]

  I've trying to see if that optimization was used but I was somehow
unable to find if it was the case, as the code is a bit tough :)

  I was wondering if the repacking window was using triangle inequality
to discard trying some costly deltas (I assume that what costs the most
in the repacking is computing the delta). I mean, if you consider the
"size" of a delta, I'm almost sure that it's very near a distance.

  So assuming that we know the delta sizes between any pair of reference
objects in the window, well, if an object we want to delta against the
window Od are near one reference O1 enough, for each Oi in the window
that holds: len(δ(O1, Oi)) > 2 * len(δ(Od, O1)), then it's not worth
investigating.

  I've seen quite many heuristics tried to avoid computing a delta at
the beginning of builtin-pack-objects.c:try_delta, but I don't seem to
find this one used, whereas it would even encompass most of the
heuristics here. I mean the object-size based heuristics are (if I'm
correct) a special case of the previous, e.g. I think one of the
heuristics is to not trying a delta if:
  | len(Oi) - len(Od) | > 2 * len(δ(Od, O1))

But it's just a subcase of the former because
  len(δ(Od, Oi)) >= | len(Oi) - len(Od) |

  Of course it needs us to know how objects in the same window diff
against one each other, of have a good estimation of it (having not too
bad lower estimates of the real deltas is an option here, as it just
makes the test a bit less discriminant, but can allow us to have a
trade-off between computing each crossed deltas in the window and
dropping tried deltas early). Note that we don't need to store the
actual deltas, just their length (or a more complicated score if we want
-- probably -- take the chain depth into account, we just need that
score to look like a distance enough).

  If it's already done, then don't mind me, though I'd like to be
pointed to where so that I can grok that code a bit more :)
-- 
·O·  Pierre Habouzit
··O                                                madcoder@debian.org
OOO                                                http://www.madism.org

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: small question about the repack algorithm
  2008-02-07  9:03 small question about the repack algorithm Pierre Habouzit
@ 2008-02-08  7:12 ` Junio C Hamano
  2008-02-08  8:38   ` Pierre Habouzit
  0 siblings, 1 reply; 3+ messages in thread
From: Junio C Hamano @ 2008-02-08  7:12 UTC (permalink / raw)
  To: Pierre Habouzit; +Cc: Git ML

Pierre Habouzit <madcoder@debian.org> writes:

>   I've trying to see if that optimization was used but I was somehow
> unable to find if it was the case, as the code is a bit tough :)
>
>   I was wondering if the repacking window was using triangle inequality
> to discard trying some costly deltas (I assume that what costs the most
> in the repacking is computing the delta). I mean, if you consider the
> "size" of a delta, I'm almost sure that it's very near a distance.
>
>   So assuming that we know the delta sizes between any pair of reference
> objects in the window, well, if an object we want to delta against the
> window Od are near one reference O1 enough, for each Oi in the window
> that holds: len(δ(O1, Oi)) > 2 * len(δ(Od, O1)), then it's not worth
> investigating.

We do not keep track of the delta size matrix between delta-base
candidates in the window, but I presume we could.  The storage
cost for doing so is very cheap (window^2 * size_t).  But we do
not even compute the distance matrix fully (I'll mention the
reason why the above is not (window^2 * size_t / 2) later).

    1-----------------------------i   <= delta-base candidates
     \
      \ 
       D <-- the target we are considering

Your idea is that if we want to find the cheapest delta-base,
and after we find out that candidate #1 is close to our target D
and candidate #i is very far from candidate #1, then delta to
create D using candidate #i as the base would be much bigger.
If the distance space is Euclidean, that would be a nice
optimization.

However I do not think deltification would work that way, for
two reasons.

The deltified representation of an object is a sequence of two
kinds of opcodes:

 (1) the N bytes in the target from the current point is a copy
     of the source at offset M;

 (2) the N bytes in the target from the current point are the
     following literal string.

The first form is two integers (and we represent short integers
efficiently), and the second form is one integer plus N bytes
literal.

An efficient delta candidate is the one with most common
contents with the target image.  Saying "copy that part" is
cheaper than "Here is the contents.....".

Notice that it does not matter how much other cruft a target
contains.  In an extreme case, if the base candidate #1 is a
prefix of the candidate #i, and the difference does not have any
commonality with target D, then the delta to represent D using
the base #1 and the delta using the base #i would be of the same
size and the same contents (as all offsets from the source image
of copied parts would be the same).  The tail part of #i would
not participate reconstruction of D at all.

In such a case, you would have a long distance between #1 and
#i, but that is because #i has very many unrelated contents that
are not shared with #1 (so the difference has to be represented
as "Append these bytes to represent #i, as we cannot copy from #1".

But that does not mean #i has less common contents than #1 has
with D.  In fact, we might even find common contents between the
tail part of #i and D that the delta using base #1 needed to
represent as literals.  In other words, #i could well be a
better candidate than #1.

The second reason is that the deltification is not symmetric.
If you define the "distance" between #1 and #i as "the size of
delta to reproduce #i using #1 as base", the distance between #1
and #i is very different from the distance between #i and #1.

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: small question about the repack algorithm
  2008-02-08  7:12 ` Junio C Hamano
@ 2008-02-08  8:38   ` Pierre Habouzit
  0 siblings, 0 replies; 3+ messages in thread
From: Pierre Habouzit @ 2008-02-08  8:38 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Git ML

[-- Attachment #1: Type: text/plain, Size: 2363 bytes --]

On Fri, Feb 08, 2008 at 07:12:44AM +0000, Junio C Hamano wrote:
> Pierre Habouzit <madcoder@debian.org> writes:
> 
> >   I've trying to see if that optimization was used but I was somehow
> > unable to find if it was the case, as the code is a bit tough :)
> >
> >   I was wondering if the repacking window was using triangle inequality
> > to discard trying some costly deltas (I assume that what costs the most
> > in the repacking is computing the delta). I mean, if you consider the
> > "size" of a delta, I'm almost sure that it's very near a distance.
> >
> >   So assuming that we know the delta sizes between any pair of reference
> > objects in the window, well, if an object we want to delta against the
> > window Od are near one reference O1 enough, for each Oi in the window
> > that holds: len(δ(O1, Oi)) > 2 * len(δ(Od, O1)), then it's not worth
> > investigating.
> 
> We do not keep track of the delta size matrix between delta-base
> candidates in the window, but I presume we could.  The storage
> cost for doing so is very cheap (window^2 * size_t).  But we do
> not even compute the distance matrix fully (I'll mention the
> reason why the above is not (window^2 * size_t / 2) later).
> 
>     1-----------------------------i   <= delta-base candidates
>      \
>       \ 
>        D <-- the target we are considering
> 
> Your idea is that if we want to find the cheapest delta-base,
> and after we find out that candidate #1 is close to our target D
> and candidate #i is very far from candidate #1, then delta to
> create D using candidate #i as the base would be much bigger.
> If the distance space is Euclidean, that would be a nice
> optimization.

  Well, Euclidean is too much, a simple Metric Space is enough IIRC.

> The second reason is that the deltification is not symmetric.
> If you define the "distance" between #1 and #i as "the size of
> delta to reproduce #i using #1 as base", the distance between #1
> and #i is very different from the distance between #i and #1.

  Well that's the most obvious reason indeed and I totally missed that,
dang. The delta is not near a distance enough. Too bad :)

-- 
·O·  Pierre Habouzit
··O                                                madcoder@debian.org
OOO                                                http://www.madism.org

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2008-02-08  8:39 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-02-07  9:03 small question about the repack algorithm Pierre Habouzit
2008-02-08  7:12 ` Junio C Hamano
2008-02-08  8:38   ` Pierre Habouzit

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).