Re: Multiblobs - Avery Pennarun

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Avery Pennarun <apenwarr@gmail.com>
To: Sergio <sergio.callegari@gmail.com>
Cc: git@vger.kernel.org
Subject: Re: Multiblobs
Date: Wed, 28 Apr 2010 20:44:07 -0400	[thread overview]
Message-ID: <h2w32541b131004281744xea800f1eq9459bbe462ba3a1e@mail.gmail.com> (raw)
In-Reply-To: <loom.20100429T010742-199@post.gmane.org>

On Wed, Apr 28, 2010 at 7:26 PM, Sergio <sergio.callegari@gmail.com> wrote:
> Avery Pennarun <apenwarr <at> gmail.com> writes:
>> But why not use a .gitattributes filter to recompress the zip/odp file
>> with no compression, as I suggested?  Then you can just dump the whole
>> thing into git directly.  When you change the file, only the changes
>> need to be stored thanks to delta compression.  Unless your
>> presentation is hundreds of megs in size, git should be able to handle
>> that just fine already.
>
> Actually, I'm doing so...  But in some occasions odf file that share many
> components do not delta, even when passed through a filter that uncompresses
> them. Multiblobs are like taking advantage of a known structure to get better
> deltas.

Hmm, it might be a good idea to investigate the specific reasons why
that's not working.  Fixing it may be easier (and help more people)
than introducing a whole new infrastructure for these multiblobs.

>> But then you're digging around inside the pdf file by hand, which is a
>> lot of pdf-specific work that probably doesn't belong inside git.
>
> I perfectly agree that git should not know about the inner structure of things
> like PDFs, Zips, Tars, Jars, whatever. But having an infrastructure allowing
> multiblobs and attributes like clean/smudge to trigger creation and use of
> multiblobs with user provided split/unsplit drivers could be nice.

Yes, it could.  Sorry to be playing the devil's advocate :)

>> Worse, because compression programs don't always produce the same
>> output, this operation would most likely actually *change* the hash of
>> your pdf file as you do it.
>
> This should depend on the split/unsplit driver that you write. If your driver
> stores a sufficient amount of metadata about the streams and their order, you
> should be able to recreate the original file.

Almost.  The one thing you can't count on replicating reliably is
compression.  If you use git-zlib the first time, and git-zlib the
second time with the same settings, of course the results will be
identical each time.  But if the original file used Acrobat-zlib, and
your new one uses git-zlib, the most likely situation is the files
will be functionally identical but not the same stream of bytes, and
that could be a problem.  (Then again, maybe it's not a problem in
some use cases.)

Another danger of this method is that different versions of git may
have slightly different versions of zlib that compress slightly
differently.  In that case, you'd (rather surprisingly) end up with
different output files depending which version of git you use to check
them out.  Maybe that's manageable, though.

>> In what way?  I doubt you'd get more efficient storage, at least.
>> Git's deltas are awfully hard to beat.
>
> Using the known structure of the file, you automatically identify the bits that
> are identical and you save the need to find a delta altogether.

bup avoids the need to find a delta altogether.  This isn't entirely a
good thing; it's a necessity because it processes huge amounts of data
and doing deltas across it all would be ungodly slow.

However, in all my tests (except with massively self-redundant files
like VMware images) deltas are at least somewhat smaller than bup
deduplication.  This isn't surprising, since deltas can eliminate
duplication on a byte-by-byte level, while bup chunks have a much
larger threshold (around 8k).

So I question the idea that this method would actually save any space
over git's existing deltas.  CPU time, yes, but only really during gc,
and you can run gc overnight while you're not waiting for it.

>> In that case, I'd like to see some comparisons of real numbers
>> (memory, disk usage, CPU usage) when storing your openoffice documents
>> (using the .gitattributes filter, of course).  I can't really imagine
>> how splitting the files into more pieces would really improve disk
>> space usage, at least.
>
> I'll try to isolate test cases, making test repos:
>
> a) with 1 odf file changing a little on each checkin
> b) the same storing the odf file with no compression with a suitable filter
> c) the same storing the tree inside the odf file.

This sounds like it would be quite interesting to see.  I would also
be interested in d) the test from (b) using bup instead of git.

You might also want to compare results with 'git gc' vs. 'git gc --aggressive'.

>> Having done some tests while writing bup, my experience has been that
>> chunking-without-deltas is great for these situations:
>> 1) you have the same data shared across *multiple* files (eg. the same
>> images in lots of openoffice documents with different filenames);
>> 2) you have the same data *repeated* in the same file at large
>> distances (so that gzip compression doesn't catch it; eg. VMware
>> images)
>> 3) your file is too big to work with the delta compressor (eg. VMware images).
>
> An aside: bup is great!!! Thanks!

Glad you like it :)

Have fun,

Avery

next prev parent reply	other threads:[~2010-04-29  0:44 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-04-28 15:12 Multiblobs Sergio Callegari
2010-04-28 18:07 ` Multiblobs Avery Pennarun
2010-04-28 19:13   ` Multiblobs Sergio Callegari
2010-04-28 21:27     ` Multiblobs Avery Pennarun
2010-04-28 23:10       ` Multiblobs Michael Witten
2010-04-28 23:26       ` Multiblobs Sergio
2010-04-29  0:44         ` Avery Pennarun [this message]
2010-04-29 11:34       ` Multiblobs Peter Krefting
2010-04-29 15:28         ` Multiblobs Avery Pennarun
2010-04-30  8:20           ` Multiblobs Peter Krefting
2010-04-30 17:26             ` Multiblobs Avery Pennarun
2010-04-30  9:14     ` Multiblobs Hervé Cauwelier
2010-04-30 17:32       ` Multiblobs Avery Pennarun
2010-04-30 18:16       ` Multiblobs Michael Witten
2010-04-30 19:06         ` Multiblobs Hervé Cauwelier
2010-04-28 18:34 ` Multiblobs Geert Bosch
2010-04-29  6:55 ` Multiblobs Mike Hommey
2010-05-06  6:26 ` Multiblobs Jeff King
2010-05-06 22:56   ` Multiblobs Sergio Callegari
2010-05-10  6:36     ` Multiblobs Jeff King
2010-05-10 13:58       ` Multiblobs Sergio Callegari

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=h2w32541b131004281744xea800f1eq9459bbe462ba3a1e@mail.gmail.com \
    --to=apenwarr@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=sergio.callegari@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).