Indexing Zlib deflated streams for pseudo-random-access to reduce delta-resolution memory requirements

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Indexing Zlib deflated streams for pseudo-random-access to reduce delta-resolution memory requirements
@ 2011-03-31 20:26 Sebastian Thiel
       [not found] ` <4D959B56.7040508@peralex.com>
  0 siblings, 1 reply; 2+ messages in thread
From: Sebastian Thiel @ 2011-03-31 20:26 UTC (permalink / raw)
  To: git

Hi,

Why would one have big files delta compressed anyway ? From my 
experience, this can save a lot of space, even though the process of 
generating the deltas will definitely take up a lot of memory. In my 
use-case the packs are generated on a server with plenty of RAM. And 
even though the most recent version of an object tends to be a base 
which wouldn't need delta resolution, querying older revisions will 
require it. As this is done client side most of the time, it would be 
great to reduce the memory footprint,  at the cost of some additional 
preprocessing time.

Currently, when reapplying deltas on rather big files, assumed that one 
didn't exclude them for delta compression in the first place, the 
current algorithm requires the base-buffer, the target buffer as well as 
the inflated delta stream itself in memory concurrently. Then the 
algorithm works its way up recursively from the first base object to the 
last delta until it is done. This causes plenty of possibly large memory 
allocations, as well as high memory demands.

In my current python and c++ implementation of pack reading, I try to 
stream all objects. Base objects can already be very efficiently 
streamed as a plain (streamed) inflate will do. This is different for 
delta objects. Even though the delta resolution machinery is hidden 
behind a stream interface, it still produces a big buffer to hold the 
resolved object.

My current research went far enough to allow delta streams to be merged 
together at first, so the last delta to be applied gets merged down onto 
its base delta, and so forth, until all deltas were merged into one big 
delta which just applies to the base buffer. This delta will have only 
copy operations which refer to the base object's data, as well as copy 
operations. Currently,  the merged delta stream is kept in memory 
deflated, and could be streamed  more or less easily. The problem I have 
is that the base object's data still needs to be available for random 
access, i.e. inflated in memory.

Although this technique safes some memory during processing, as it will 
never allocate two possibly large base and target buffers, in the end it 
turns out to be more expensive memory wise as it needs the base buffer 
as well as the merged delta to be allocated as long as the streaming is 
in progress, compared to just the possibly smaller target buffer in case 
of the default recursive algorithm.

Here comes the link to the subject line: If it was possible to index the 
zip deflated stream somehow, it wouldn't be required to keep an inflated 
version of the base buffer in memory all the time. Instead, the required 
portions can be decompressed on demand, using the said zip stream index. 
Then we would only need the merged delta stream in memory, which should 
(ideally) be smaller than the target buffer itself.

Do you think it is feasible to burn these extra cycles to reduce the 
memory footprint ? Do you know whether or how it is possible to index a 
zip deflated stream for pseudo-random access or can provide hints ?
So far, from studying the zlib manual, there seems to be some way of 
determining zip blocks which can then possibly be linked to the 
uncompressed data positions, but ... I am absolutely not sure. Ideally, 
this whole indexing process works by quickly skipping through the zlib 
stream without actually decompressing it.

Maybe, all this is too much effort, and maybe the merged delta stream's 
size ends up being larger than the target buffer so it was all for 
nothing (which one would only know once the preprocessing time was 
already spent), but maybe it could really help truly 'stream' deltified 
objects (which has been something like my holy grail for quite some time 
now).

Thanks,
Sebastian

^ permalink raw reply	[flat|nested] 2+ messages in thread

[parent not found: <4D959B56.7040508@peralex.com>]

* Re: Indexing Zlib deflated streams for pseudo-random-access to reduce delta-resolution memory requirements
       [not found] ` <4D959B56.7040508@peralex.com>
@ 2011-04-01 13:59   ` Sebastian Thiel
  0 siblings, 0 replies; 2+ messages in thread
From: Sebastian Thiel @ 2011-04-01 13:59 UTC (permalink / raw)
  To: Noel Grandin; +Cc: git

Hi Noel,

Thanks for your input ! Except for (a), I understand what you mean. In
case of (a), this would also mean that the delta stream gets resized and
becomes larger, which could be a bottleneck depending on how many
resizes you need. Yet, I didn't understand what (a) would mean.

Everything else makes sense, and should work well and fast assuming that
hard disk is large enough to possibly hold the temporary destination
file. If the system runs out of memory, it will just swap the memory out
onto the temporary file.

I definitely consider this an alternative to indexing the zip stream,
which might even be impossible or not viable performance wise.

Thanks,
Sebastian

On 01.04.11 11:31, Noel Grandin wrote:
> Nice work!
>
> Suggestion:
> (a) walk the merged delta stream, creating additional delta entries to
> fill in the bits that did not change i.e. create entries that look like
> copy from position 500, len 20 to position 500
> (b) sort the merged delta stream by base buffer file position
> (c) create the destination file (with zero contents)
> (d) mmap the destination file
> (e) stream the base buffer, applying the deltas to the mmap'ed file.
>
> -- Noel Grandin.
>
> Sebastian Thiel wrote:
>> Hi,
>>
>> Why would one have big files delta compressed anyway ? From my
>> experience, this can save a lot of space, even though the process of
>> generating the deltas will definitely take up a lot of memory. In my
>> use-case the packs are generated on a server with plenty of RAM. And
>> even though the most recent version of an object tends to be a base
>> which wouldn't need delta resolution, querying older revisions will
>> require it. As this is done client side most of the time, it would be
>> great to reduce the memory footprint,  at the cost of some additional
>> preprocessing time.
>>
>> Currently, when reapplying deltas on rather big files, assumed that
>> one didn't exclude them for delta compression in the first place, the
>> current algorithm requires the base-buffer, the target buffer as well
>> as the inflated delta stream itself in memory concurrently. Then the
>> algorithm works its way up recursively from the first base object to
>> the last delta until it is done. This causes plenty of possibly large
>> memory allocations, as well as high memory demands.
>>
>> In my current python and c++ implementation of pack reading, I try to
>> stream all objects. Base objects can already be very efficiently
>> streamed as a plain (streamed) inflate will do. This is different for
>> delta objects. Even though the delta resolution machinery is hidden
>> behind a stream interface, it still produces a big buffer to hold the
>> resolved object.
>>
>> My current research went far enough to allow delta streams to be
>> merged together at first, so the last delta to be applied gets merged
>> down onto its base delta, and so forth, until all deltas were merged
>> into one big delta which just applies to the base buffer. This delta
>> will have only copy operations which refer to the base object's data,
>> as well as copy operations. Currently,  the merged delta stream is
>> kept in memory deflated, and could be streamed  more or less easily.
>> The problem I have is that the base object's data still needs to be
>> available for random access, i.e. inflated in memory.
>>
>> Although this technique safes some memory during processing, as it
>> will never allocate two possibly large base and target buffers, in
>> the end it turns out to be more expensive memory wise as it needs the
>> base buffer as well as the merged delta to be allocated as long as
>> the streaming is in progress, compared to just the possibly smaller
>> target buffer in case of the default recursive algorithm.
>>
>> Here comes the link to the subject line: If it was possible to index
>> the zip deflated stream somehow, it wouldn't be required to keep an
>> inflated version of the base buffer in memory all the time. Instead,
>> the required portions can be decompressed on demand, using the said
>> zip stream index. Then we would only need the merged delta stream in
>> memory, which should (ideally) be smaller than the target buffer itself.
>>
>> Do you think it is feasible to burn these extra cycles to reduce the
>> memory footprint ? Do you know whether or how it is possible to index
>> a zip deflated stream for pseudo-random access or can provide hints ?
>> So far, from studying the zlib manual, there seems to be some way of
>> determining zip blocks which can then possibly be linked to the
>> uncompressed data positions, but ... I am absolutely not sure.
>> Ideally, this whole indexing process works by quickly skipping
>> through the zlib stream without actually decompressing it.
>>
>> Maybe, all this is too much effort, and maybe the merged delta
>> stream's size ends up being larger than the target buffer so it was
>> all for nothing (which one would only know once the preprocessing
>> time was already spent), but maybe it could really help truly
>> 'stream' deltified objects (which has been something like my holy
>> grail for quite some time now).
>>
>> Thanks,
>> Sebastian
>>
>> -- 
>> To unsubscribe from this list: send the line "unsubscribe git" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>
>
>
> ------------------------------------------------------------------------
> Disclaimer: http://www.peralex.com/disclaimer.html
>

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2011-04-01 14:00 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-03-31 20:26 Indexing Zlib deflated streams for pseudo-random-access to reduce delta-resolution memory requirements Sebastian Thiel
     [not found] ` <4D959B56.7040508@peralex.com>
2011-04-01 13:59   ` Sebastian Thiel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).