Re: [PATCH] Keep last used delta base in the delta window

All of lore.kernel.org
 help / color / mirror / Atom feed

From: David Kastrup <dak@gnu.org>
To: mkoegler@auto.tuwien.ac.at (Martin Koegler)
Cc: git@vger.kernel.org
Subject: Re: [PATCH] Keep last used delta base in the delta window
Date: Mon, 27 Aug 2007 23:04:20 +0200	[thread overview]
Message-ID: <8564308zaj.fsf@lola.goethe.zz> (raw)
In-Reply-To: <20070827204320.GA17126@auto.tuwien.ac.at> (Martin Koegler's message of "Mon\, 27 Aug 2007 22\:43\:20 +0200")

mkoegler@auto.tuwien.ac.at (Martin Koegler) writes:

> On Mon, Aug 27, 2007 at 12:07:58PM +0200, David Kastrup wrote:
>> Well, given the little amount of spare time I have for personal
>> projects, I should not go flaunting them too much, but anyway...
>> 
>> I am just drafting implementing a delta-diff size estimator: for every
>> hash value it does not store hash chains or pointers to memory images,
>> but just a bit map of at most 32 (possibly 64 where ulong delivers it,
>> possibly even just 16 bits when one restricts oneself to 16 element
>> windows in order to save memory) delta window files, telling which of
>> the files show such a hash value anywhere. 
>
> How do you want to select these few hash values?

Hm?  Are you familiar with diff-delta.c?  I was not going to change
the hash value computation in there.  It is a 32bit CRC over 16byte
passages: the delta source is checksummed with a stride of 16 bytes
and the resulting CRC values are masked to a convenient number of bits
and used as index into a hash table with disambiguation to actual code
passages using linked lists.  Then the destination delta is
checksummed in the same manner but with a stride of 1, and the sums
are looked up in the hash until a matching prefix is found.

So I was going to do the same calculation, but look up a bit mask
rather than a linked list (namely calculating under the assumption
that a hash hit implies an actual match).

> I eg. dump database tables as SQL statements in per table files and
> commit all changes. All lines in each file share a common prefix
> (INSERT INTO tablename (COL1, COL2, ...) VALUES). Statistics showed,
> that up to 50% of the hash elements of the delta index were dropped,
> because they had equal hash values.

Yes, I am aware of that.  One somewhat crazy consequence of that is
that git's deltaing works better on Linux style indentation than on
GNU style indentation.

> With you apropach:
>
> If rare used values are selected as the 32 hash values, you will
> probably regard files, which contain different tables (and therefore
> a different prefix), as "similar", although the files have not very
> much similarity otherwise.

It's statistics.  Random matches will tend to level out.

> If you select the hash values of the prefixes, you will create for
> each table one cluster (if the table count is < 32/64). This already
> happens by using the hash value of the path name in the sorting.
>
> I don't belive, that your idea will improve very much in this case.
> I fear, that it will be even worser than the current algorithm.

Huh?  I am not replacing the current algorithm.  I am doing some
upfront estimates for getting a good order for doing the current
algorithm, so that I might abort parts of it early when I know they
can't improve things.  If the estimates are completely random and/or
useless, it will mean that you get a slowdown by a comparatively small
constant factor.

>> When a file is considered for deltaing, it is run once against this
>> delta-diff estimator which does not even look at the files again.
>> This delivers a lower bound for the size a delta towards each of
>> the delta candidates can take.  The candidates are then sorted in
>> increasing size of expected delta size and are considered in turn.
>> Once a delta has a lower size than the lowest bound for further
>> delta candidates, no further candidates need to get considered.
>
> So this means, that we have to hash the target entry too (which
> could be merged with creating the delta index).

Huh?  I don't see what you are getting at.

> We currently only hash an entry, if it is really needed as source in
> a delta-diff.

Why would you want to hash a non-candidate?

>> Also the calculation of a delta can get aborted once it exceeds the
>> size of a previous delta.
>
> This already happens.

Ah yes.  Overlooked this.  Presorting the candidates should be helpful
right away then.

> Do you want to do your estimations only against a small window of
> objects or all objects?

The estimates are made against the candidates in the window.  When a
candidate leaves the window, its bit in the bit masks gets reused for
the next candidate entering the window.

-- 
David Kastrup, Kriemhildstr. 15, 44793 Bochum

next prev parent reply	other threads:[~2007-08-27 21:04 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2007-08-26 20:56 [PATCH] Keep last used delta base in the delta window Martin Koegler
2007-08-27  8:48 ` Junio C Hamano
2007-08-27 10:07   ` David Kastrup
2007-08-27 20:43     ` Martin Koegler
2007-08-27 21:04       ` David Kastrup [this message]
2007-08-27 12:12   ` Junio C Hamano
2007-08-29 18:21     ` Nicolas Pitre
2007-08-29 19:26       ` Junio C Hamano
2007-08-29 21:02         ` Nicolas Pitre

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=8564308zaj.fsf@lola.goethe.zz \
    --to=dak@gnu.org \
    --cc=git@vger.kernel.org \
    --cc=mkoegler@auto.tuwien.ac.at \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.