Multiblobs

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Multiblobs
@ 2010-04-28 15:12 Sergio Callegari
  2010-04-28 18:07 ` Multiblobs Avery Pennarun
                   ` (3 more replies)
  0 siblings, 4 replies; 21+ messages in thread
From: Sergio Callegari @ 2010-04-28 15:12 UTC (permalink / raw)
  To: git

Hi,

it happened to me to read an older post by Jeff King about "multiblobs"
(http://kerneltrap.org/mailarchive/git/2008/4/6/1360014) and I was wandering
whether the idea has been abandoned for some reason or just put on hold.

Apparently, this would marvellously help on
- storing large binary blobs (the split could happen with a rolling checksum
approach)
- storing "structured files", such as the many zip-based file formats
(Opendocument, Docx, Jar files, zip files themselves), tars (including
compressed tars), pdfs, etc, whose number is rising day after day...
- storing binary files with textual tags, where the tags could go on a separate
blob, greatly simplifying their readout without any need for caching them on a
note tree.
- etc...

Furthermore, this could also
- help the management of upstream trees. This could be simplified since the
"pristine tree" distributed as a tar.gz file and the exploded repo could share
their blobs making commands such as pristine-tree unnecessary.
- help projects such as bup that currently need to provide split mechanisms of
their own.
- be used to add "different representations" to objects... for instance, when
storing a pdf one could use a fake split to store in a separate blob the
corresponding text, making the git-diff of pdfs almost instantaneous.

>From Jeff's post, I guess that the major issue could be that the same file could
get a different sha1 as a multiblob versus a regular blob, but maybe it could be
possible to make the multiblob take the same sha1 of the "equivalent plain blob"
rather than its real hash.

For the moment, I am just very curious about the idea and the possible pros and
cons... can someone (maybe Jeff himself) tell me a little more? Also I wonder
about the two possibilities (implement it in git vs implement it "on top of"
git).

Sergio

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Multiblobs
  2010-04-28 15:12 Multiblobs Sergio Callegari
@ 2010-04-28 18:07 ` Avery Pennarun
  2010-04-28 19:13   ` Multiblobs Sergio Callegari
  2010-04-28 18:34 ` Multiblobs Geert Bosch
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 21+ messages in thread
From: Avery Pennarun @ 2010-04-28 18:07 UTC (permalink / raw)
  To: Sergio Callegari; +Cc: git

On Wed, Apr 28, 2010 at 11:12 AM, Sergio Callegari
<sergio.callegari@gmail.com> wrote:
> - storing "structured files", such as the many zip-based file formats
> (Opendocument, Docx, Jar files, zip files themselves), tars (including
> compressed tars), pdfs, etc, whose number is rising day after day...

I'm not sure it would help very much for these sorts of files.  The
problem is that compressed files tend to change a lot even if only a
few bytes of the original data have changed.

For things like opendocument, or uncompressed tars, you'd be better
off to decompress them (or recompress with zip -0) using
.gitattributes.  Generally these files aren't *so* large that they
really need to be chunked; what you want to do is improve the deltas,
which decompressing will do.

> - storing binary files with textual tags, where the tags could go on a separate
> blob, greatly simplifying their readout without any need for caching them on a
> note tree.

That sounds complicated and error prone, and is suspiciously like
Apple's "resource forks," which even Apple has mostly realized were a
bad idea.

> - help the management of upstream trees. This could be simplified since the
> "pristine tree" distributed as a tar.gz file and the exploded repo could share
> their blobs making commands such as pristine-tree unnecessary.

Sharing the blobs of a tarball with a checked-out tree would require a
tar-specific chunking algorithm.  Not impossible, but a pain, and you
might have a hard time getting it accepted into git since it's
obviously not something you really need for a normal "source code"
tracking system.

> - help projects such as bup that currently need to provide split mechanisms of
> their own.

Since bup is so awesome that it will soon rule the world of file
splitting backup systems, and bup already has a working implemention,
this reason by itself probably isn't enough to integrate the feature
into git.

> - be used to add "different representations" to objects... for instance, when
> storing a pdf one could use a fake split to store in a separate blob the
> corresponding text, making the git-diff of pdfs almost instantaneous.

Aie, files that have different content depending how you look at them?
 You'll make a lot of enemies with such a patch :)

> From Jeff's post, I guess that the major issue could be that the same file could
> get a different sha1 as a multiblob versus a regular blob, but maybe it could be
> possible to make the multiblob take the same sha1 of the "equivalent plain blob"
> rather than its real hash.

I think that's actually not a very important problem.  Files that are
different will still always have differing sha1s, which is the
important part.  Files that are the same might not have the same sha1,
which is a bit weird, but it's unlikely that any algorithm in git
depends fundamentally on the fact that the sha1s match.

Storing files as split does have a lot of usefulness for calculating
diffs, however: because you can walk through the tree of hashes and
short entire circuit subtrees with identical sha1s, you can diff even
20GB files really rapidly.

> For the moment, I am just very curious about the idea and the possible pros and
> cons... can someone (maybe Jeff himself) tell me a little more? Also I wonder
> about the two possibilities (implement it in git vs implement it "on top of"
> git).

"on top of" git has one major advantage, which is that it's easy: for
example, bup already does it.  The disadvantage is that checking out
the resulting repository won't be smart enough to re-merge the data
again, so you have a bunch of tiny chunk files you have to concatenate
by hand.

Implementing inside git could be done in one of two ways: add support
for a new 'multiblob' data type (which is really more like a tree
object, but gets checked out as a single file), or implement chunking
at the packfile level, so that higher-level tools never have to know
about multiblobs.

The latter would probably be easier and more backward-compatibility,
but you'd probably lose the ability to do really fast diffs between
multiblobs, since diff happens at the higher level.

Overall, I'm not sure git would benefit much from supporting large
files in this way; at least not yet.  As soon as you supported this,
you'd start running into other problems... such as the fact that
shallow repos don't really work very well, and you obviously don't
want to clone every single copy of a 100MB file just so you can edit
the most recent version.  So you might want to make sure shallow repos
/ sparse checkouts are fully up to speed first.

Have fun,

Avery

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Multiblobs
  2010-04-28 15:12 Multiblobs Sergio Callegari
  2010-04-28 18:07 ` Multiblobs Avery Pennarun
@ 2010-04-28 18:34 ` Geert Bosch
  2010-04-29  6:55 ` Multiblobs Mike Hommey
  2010-05-06  6:26 ` Multiblobs Jeff King
  3 siblings, 0 replies; 21+ messages in thread
From: Geert Bosch @ 2010-04-28 18:34 UTC (permalink / raw)
  To: Sergio Callegari; +Cc: git

On Apr 28, 2010, at 11:12, Sergio Callegari wrote:

> Hi,
> 
> it happened to me to read an older post by Jeff King about "multiblobs"
> (http://kerneltrap.org/mailarchive/git/2008/4/6/1360014) and I was wandering
> whether the idea has been abandoned for some reason or just put on hold.
> 
> Apparently, this would marvellously help on
> - storing large binary blobs (the split could happen with a rolling checksum
> approach)
> - storing "structured files", such as the many zip-based file formats
> (Opendocument, Docx, Jar files, zip files themselves), tars (including
> compressed tars), pdfs, etc, whose number is rising day after day...
> - storing binary files with textual tags, where the tags could go on a separate
> blob, greatly simplifying their readout without any need for caching them on a
> note tree.
> - etc...

In the early days of GIT I once implemented a "git pipe" command that would
allow an unbounded stream of data to be stored in GIT. The stream would be
broken up in small segments using context-sensitive break points (essentially
points in the code where a hash H of the last N bytes modulo P is equal to some Q).
The average segment length will then be about P bytes long.
Multiple segments would be put in a tree with each tree entry's name being the
cumulative length of the segment or subtree it references, with enough leading
zeros to accomodate for the largest length in the tree.

This works well and allows efficient diff operations or updates of arbitrarily
large files. In particular, all operations take a time proportional to the
size of the change rather than the size of the file.

The draw backs are:

  - All of the variables H, N, P and Q above influence the final hash
    that is computed for an object, so the values picked must work well.
  - You'd only want to use this method for largish files, but because
    this threshold influences final hashes, it again should be picked with care.
  - more complex than having just simple straight blobs.

One of the nice aspects of this representation is that extracting the tree
into the local filesystem and concatenating all files in the directory
tree in alphabetical order does yield the original file.

  -Geert

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Multiblobs
  2010-04-28 18:07 ` Multiblobs Avery Pennarun
@ 2010-04-28 19:13   ` Sergio Callegari
  2010-04-28 21:27     ` Multiblobs Avery Pennarun
  2010-04-30  9:14     ` Multiblobs Hervé Cauwelier
  0 siblings, 2 replies; 21+ messages in thread
From: Sergio Callegari @ 2010-04-28 19:13 UTC (permalink / raw)
  To: git

Avery Pennarun <apenwarr <at> gmail.com> writes:

> 
> On Wed, Apr 28, 2010 at 11:12 AM, Sergio Callegari
> <sergio.callegari <at> gmail.com> wrote:
> > - storing "structured files", such as the many zip-based file formats
> > (Opendocument, Docx, Jar files, zip files themselves), tars (including
> > compressed tars), pdfs, etc, whose number is rising day after day...
> 
> I'm not sure it would help very much for these sorts of files.  The
> problem is that compressed files tend to change a lot even if only a
> few bytes of the original data have changed.

Probably I have not provided enough elements... My idea is the following:

If you store a structured file as a multiblob, you can use a blob for each
uncompressed element of content.  For instance, when storing an opendocument
file you could use a blob for manifest.xml, one for content.xml, etc... (try
unzip -l on an odt or odp file to get an idea). When you edit your file only a
few of these change. For instance, if we talk about a presentation, each slide
has its own content.xml, so changing one slide only that changes.

The same for PDF files, if you split them using a blob for each uncompressed
stream, little variations of the pdf file will touch only a blob.

In other terms, to benefit from multiblobs you should use a different splitting
strategy for PDFs (1 blob per uncompressed stream + 1 header blob telling how
streams should be put together), Zip files (1 blob per uncompressed file + 1
header blob also containing metadata), long unstructured binary files (1 blob
per chunk + 1 header blob), etc.

> For things like opendocument, or uncompressed tars, you'd be better
> off to decompress them (or recompress with zip -0) using
> .gitattributes.  Generally these files aren't *so* large that they
> really need to be chunked; what you want to do is improve the deltas,
> which decompressing will do.

This is what I currently do.  But using multiblobs would be a definite
improvement over this.

> > - storing binary files with textual tags, where the tags could go on a
separate
> > blob, greatly simplifying their readout without any need for caching them on
a
> > note tree.
> 
> That sounds complicated and error prone, and is suspiciously like
> Apple's "resource forks," which even Apple has mostly realized were a
> bad idea.

I did not mean the Apple way... Suppose that you need to store images with exif
tags.  In order to diff them you would tipically set a textconv attribute, to
see only the tags.  However, this kind of filter needs to read the whole file
(expensive). BTW this is why a caching mechanism involving notes has recently
been proposed. Now suppose that you can set up a rule so that image files with
tags are stored as a multiblob. You can use 3 blobs... 1 as a header, one for
the raw image data and one for the tags.  Now your textconv filter only needs to
look at the content of the tags blob.

> > - help the management of upstream trees. This could be simplified since the
> > "pristine tree" distributed as a tar.gz file and the exploded repo could
share
> > their blobs making commands such as pristine-tree unnecessary.

Similar... Right now to do package management with git, you need to use pristine
tar. This is because when you check in the upstream tar you only check in its
elements, not the whole tar.gz.  So you need pristine tar to recreate the
upstream tar.gz whenever needed. But with multiblob you could store both the
content /and/ the upstream tar and there would be minimal overlap since the
blobs would be the same. 

> Sharing the blobs of a tarball with a checked-out tree would require a
> tar-specific chunking algorithm.  Not impossible, but a pain, and you
> might have a hard time getting it accepted into git since it's
> obviously not something you really need for a normal "source code"
> tracking system.

I agree... but there could be just a mere couple of gitattributes multiblobsplit
and multiblobcompose, so that one could provide his own splitting and composing
methods for the types of files he is interested in (and maybe contribute them to
the community).

> > - help projects such as bup that currently need to provide split mechanisms
of
> > their own.
> 
> Since bup is so awesome that it will soon rule the world of file
> splitting backup systems, and bup already has a working implemention,
> this reason by itself probably isn't enough to integrate the feature
> into git.

On this I tend to agree!

 > > - be used to add "different representations" to objects... for instance,
when
> > storing a pdf one could use a fake split to store in a separate blob the
> > corresponding text, making the git-diff of pdfs almost instantaneous.
> 
> Aie, files that have different content depending how you look at them?
>  You'll make a lot of enemies with such a patch :)

I would not consider it as different content... rather as a way to cache data
you might need.  But I agree this is probably going too far.

> Overall, I'm not sure git would benefit much from supporting large
> files in this way; at least not yet.  As soon as you supported this,
> you'd start running into other problems... such as the fact that
> shallow repos don't really work very well, and you obviously don't
> want to clone every single copy of a 100MB file just so you can edit
> the most recent version.  So you might want to make sure shallow repos
> / sparse checkouts are fully up to speed first.

I am not really thinking that much about large binary files (that would anyway
come as a bonus - an many people often talk about them on the list), but of
structured files that currently do not pack well.  My personal issue is with
opendocument files, since I need to check in lots of documentation and
presentation material.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Multiblobs
  2010-04-28 19:13   ` Multiblobs Sergio Callegari
@ 2010-04-28 21:27     ` Avery Pennarun
  2010-04-28 23:10       ` Multiblobs Michael Witten
                         ` (2 more replies)
  2010-04-30  9:14     ` Multiblobs Hervé Cauwelier
  1 sibling, 3 replies; 21+ messages in thread
From: Avery Pennarun @ 2010-04-28 21:27 UTC (permalink / raw)
  To: Sergio Callegari; +Cc: git

On Wed, Apr 28, 2010 at 3:13 PM, Sergio Callegari
<sergio.callegari@gmail.com> wrote:
> Avery Pennarun <apenwarr <at> gmail.com> writes:
>> I'm not sure it would help very much for these sorts of files.  The
>> problem is that compressed files tend to change a lot even if only a
>> few bytes of the original data have changed.
>
> Probably I have not provided enough elements... My idea is the following:
>
> If you store a structured file as a multiblob, you can use a blob for each
> uncompressed element of content.  For instance, when storing an opendocument
> file you could use a blob for manifest.xml, one for content.xml, etc... (try
> unzip -l on an odt or odp file to get an idea). When you edit your file only a
> few of these change. For instance, if we talk about a presentation, each slide
> has its own content.xml, so changing one slide only that changes.

But why not use a .gitattributes filter to recompress the zip/odp file
with no compression, as I suggested?  Then you can just dump the whole
thing into git directly.  When you change the file, only the changes
need to be stored thanks to delta compression.  Unless your
presentation is hundreds of megs in size, git should be able to handle
that just fine already.

> The same for PDF files, if you split them using a blob for each uncompressed
> stream, little variations of the pdf file will touch only a blob.

But then you're digging around inside the pdf file by hand, which is a
lot of pdf-specific work that probably doesn't belong inside git.
Worse, because compression programs don't always produce the same
output, this operation would most likely actually *change* the hash of
your pdf file as you do it.  (That's also true for openoffice files,
but at least those are just plain zip files, and zip files are
somewhat less of a special case.)

>> For things like opendocument, or uncompressed tars, you'd be better
>> off to decompress them (or recompress with zip -0) using
>> .gitattributes.  Generally these files aren't *so* large that they
>> really need to be chunked; what you want to do is improve the deltas,
>> which decompressing will do.
>
> This is what I currently do.  But using multiblobs would be a definite
> improvement over this.

In what way?  I doubt you'd get more efficient storage, at least.
Git's deltas are awfully hard to beat.

>> That sounds complicated and error prone, and is suspiciously like
>> Apple's "resource forks," which even Apple has mostly realized were a
>> bad idea.
>
> I did not mean the Apple way... Suppose that you need to store images with exif
> tags.  In order to diff them you would tipically set a textconv attribute, to
> see only the tags.  However, this kind of filter needs to read the whole file
> (expensive). BTW this is why a caching mechanism involving notes has recently
> been proposed. Now suppose that you can set up a rule so that image files with
> tags are stored as a multiblob. You can use 3 blobs... 1 as a header, one for
> the raw image data and one for the tags.  Now your textconv filter only needs to
> look at the content of the tags blob.

A resource fork by any other name is still a resource fork, and it's
still ugly.  If you really need something like this, just cache the
attributes in a file alongside the big file, and store both files in
the git repo.

> Similar... Right now to do package management with git, you need to use pristine
> tar. This is because when you check in the upstream tar you only check in its
> elements, not the whole tar.gz.  So you need pristine tar to recreate the
> upstream tar.gz whenever needed. But with multiblob you could store both the
> content /and/ the upstream tar and there would be minimal overlap since the
> blobs would be the same.

I guess.  For something like that, though, Debian's pristine-tarball
tool seems to already solve the problem and works with any VCS, not
just git.

>> Sharing the blobs of a tarball with a checked-out tree would require a
>> tar-specific chunking algorithm.  Not impossible, but a pain, and you
>> might have a hard time getting it accepted into git since it's
>> obviously not something you really need for a normal "source code"
>> tracking system.
>
> I agree... but there could be just a mere couple of gitattributes multiblobsplit
> and multiblobcompose, so that one could provide his own splitting and composing
> methods for the types of files he is interested in (and maybe contribute them to
> the community).

I guess this would be mostly harmless; the implementation could mirror
the filter stuff.

> I am not really thinking that much about large binary files (that would anyway
> come as a bonus - an many people often talk about them on the list), but of
> structured files that currently do not pack well.  My personal issue is with
> opendocument files, since I need to check in lots of documentation and
> presentation material.

In that case, I'd like to see some comparisons of real numbers
(memory, disk usage, CPU usage) when storing your openoffice documents
(using the .gitattributes filter, of course).  I can't really imagine
how splitting the files into more pieces would really improve disk
space usage, at least.

Having done some tests while writing bup, my experience has been that
chunking-without-deltas is great for these situations:
1) you have the same data shared across *multiple* files (eg. the same
images in lots of openoffice documents with different filenames);
2) you have the same data *repeated* in the same file at large
distances (so that gzip compression doesn't catch it; eg. VMware
images)
3) your file is too big to work with the delta compressor (eg. VMware images).

However, in my experience #1 is pretty rare and #2 and #3 aren't in
your use case.  And deltas-between-chunks is not very easy to do,
since it's hard to guess which chunks might be "similar" to which
other chunks.

Personally, I think it would be great if git could natively handle
large numbers of large binary files efficiently, because there are a
few use cases I would have for it.  But whenever I start investigating
my use cases, it always turns out that just "supporting large files"
is just the tip of the iceberg, and there's a huge submerged mass of
iceberg that becomes obvious as soon as you start crashing into it.

The bup use case (write-once, read-almost-never, incremental backups)
is a rare exception in which fixing *only* the file size problem has
produced useful results.

Have fun,

Avery

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Multiblobs
  2010-04-28 21:27     ` Multiblobs Avery Pennarun
@ 2010-04-28 23:10       ` Michael Witten
  2010-04-28 23:26       ` Multiblobs Sergio
  2010-04-29 11:34       ` Multiblobs Peter Krefting
  2 siblings, 0 replies; 21+ messages in thread
From: Michael Witten @ 2010-04-28 23:10 UTC (permalink / raw)
  To: Avery Pennarun; +Cc: Sergio Callegari, git

On Wed, Apr 28, 2010 at 16:27, Avery Pennarun <apenwarr@gmail.com> wrote:
>
> But then you're digging around inside the pdf file by hand, which is a
> lot of pdf-specific work that probably doesn't belong inside git.

Core git could provide just the mechanisms for easily defining
'plugins' for handling different formats.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Multiblobs
  2010-04-28 21:27     ` Multiblobs Avery Pennarun
  2010-04-28 23:10       ` Multiblobs Michael Witten
@ 2010-04-28 23:26       ` Sergio
  2010-04-29  0:44         ` Multiblobs Avery Pennarun
  2010-04-29 11:34       ` Multiblobs Peter Krefting
  2 siblings, 1 reply; 21+ messages in thread
From: Sergio @ 2010-04-28 23:26 UTC (permalink / raw)
  To: git

Avery Pennarun <apenwarr <at> gmail.com> writes:

> But why not use a .gitattributes filter to recompress the zip/odp file
> with no compression, as I suggested?  Then you can just dump the whole
> thing into git directly.  When you change the file, only the changes
> need to be stored thanks to delta compression.  Unless your
> presentation is hundreds of megs in size, git should be able to handle
> that just fine already.

Actually, I'm doing so...  But in some occasions odf file that share many
components do not delta, even when passed through a filter that uncompresses
them. Multiblobs are like taking advantage of a known structure to get better
deltas.

> But then you're digging around inside the pdf file by hand, which is a
> lot of pdf-specific work that probably doesn't belong inside git.

I perfectly agree that git should not know about the inner structure of things
like PDFs, Zips, Tars, Jars, whatever. But having an infrastructure allowing
multiblobs and attributes like clean/smudge to trigger creation and use of
multiblobs with user provided split/unsplit drivers could be nice.

> Worse, because compression programs don't always produce the same
> output, this operation would most likely actually *change* the hash of
> your pdf file as you do it. 

This should depend on the split/unsplit driver that you write. If your driver
stores a sufficient amount of metadata about the streams and their order, you
should be able to recreate the original file.

> In what way?  I doubt you'd get more efficient storage, at least.
> Git's deltas are awfully hard to beat.

Using the known structure of the file, you automatically identify the bits that
are identical and you save the need to find a delta altogether.

> > I agree... but there could be just a mere couple of gitattributes
multiblobsplit
> > and multiblobcompose, so that one could provide his own splitting and
composing
> > methods for the types of files he is interested in (and maybe contribute
them to
> > the community).
> 
> I guess this would be mostly harmless; the implementation could mirror
> the filter stuff.

This is exactly what I was thinking of: multiblobs as a generalization of the
filter infrastructure.

> In that case, I'd like to see some comparisons of real numbers
> (memory, disk usage, CPU usage) when storing your openoffice documents
> (using the .gitattributes filter, of course).  I can't really imagine
> how splitting the files into more pieces would really improve disk
> space usage, at least.

I'll try to isolate test cases, making test repos:

a) with 1 odf file changing a little on each checkin
b) the same storing the odf file with no compression with a suitable filter
c) the same storing the tree inside the odf file.

> Having done some tests while writing bup, my experience has been that
> chunking-without-deltas is great for these situations:
> 1) you have the same data shared across *multiple* files (eg. the same
> images in lots of openoffice documents with different filenames);
> 2) you have the same data *repeated* in the same file at large
> distances (so that gzip compression doesn't catch it; eg. VMware
> images)
> 3) your file is too big to work with the delta compressor (eg. VMware images).

An aside: bup is great!!! Thanks!

And thanks for all your comments, of course!

Sergio

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Multiblobs
  2010-04-28 23:26       ` Multiblobs Sergio
@ 2010-04-29  0:44         ` Avery Pennarun
  0 siblings, 0 replies; 21+ messages in thread
From: Avery Pennarun @ 2010-04-29  0:44 UTC (permalink / raw)
  To: Sergio; +Cc: git

On Wed, Apr 28, 2010 at 7:26 PM, Sergio <sergio.callegari@gmail.com> wrote:
> Avery Pennarun <apenwarr <at> gmail.com> writes:
>> But why not use a .gitattributes filter to recompress the zip/odp file
>> with no compression, as I suggested?  Then you can just dump the whole
>> thing into git directly.  When you change the file, only the changes
>> need to be stored thanks to delta compression.  Unless your
>> presentation is hundreds of megs in size, git should be able to handle
>> that just fine already.
>
> Actually, I'm doing so...  But in some occasions odf file that share many
> components do not delta, even when passed through a filter that uncompresses
> them. Multiblobs are like taking advantage of a known structure to get better
> deltas.

Hmm, it might be a good idea to investigate the specific reasons why
that's not working.  Fixing it may be easier (and help more people)
than introducing a whole new infrastructure for these multiblobs.

>> But then you're digging around inside the pdf file by hand, which is a
>> lot of pdf-specific work that probably doesn't belong inside git.
>
> I perfectly agree that git should not know about the inner structure of things
> like PDFs, Zips, Tars, Jars, whatever. But having an infrastructure allowing
> multiblobs and attributes like clean/smudge to trigger creation and use of
> multiblobs with user provided split/unsplit drivers could be nice.

Yes, it could.  Sorry to be playing the devil's advocate :)

>> Worse, because compression programs don't always produce the same
>> output, this operation would most likely actually *change* the hash of
>> your pdf file as you do it.
>
> This should depend on the split/unsplit driver that you write. If your driver
> stores a sufficient amount of metadata about the streams and their order, you
> should be able to recreate the original file.

Almost.  The one thing you can't count on replicating reliably is
compression.  If you use git-zlib the first time, and git-zlib the
second time with the same settings, of course the results will be
identical each time.  But if the original file used Acrobat-zlib, and
your new one uses git-zlib, the most likely situation is the files
will be functionally identical but not the same stream of bytes, and
that could be a problem.  (Then again, maybe it's not a problem in
some use cases.)

Another danger of this method is that different versions of git may
have slightly different versions of zlib that compress slightly
differently.  In that case, you'd (rather surprisingly) end up with
different output files depending which version of git you use to check
them out.  Maybe that's manageable, though.

>> In what way?  I doubt you'd get more efficient storage, at least.
>> Git's deltas are awfully hard to beat.
>
> Using the known structure of the file, you automatically identify the bits that
> are identical and you save the need to find a delta altogether.

bup avoids the need to find a delta altogether.  This isn't entirely a
good thing; it's a necessity because it processes huge amounts of data
and doing deltas across it all would be ungodly slow.

However, in all my tests (except with massively self-redundant files
like VMware images) deltas are at least somewhat smaller than bup
deduplication.  This isn't surprising, since deltas can eliminate
duplication on a byte-by-byte level, while bup chunks have a much
larger threshold (around 8k).

So I question the idea that this method would actually save any space
over git's existing deltas.  CPU time, yes, but only really during gc,
and you can run gc overnight while you're not waiting for it.

>> In that case, I'd like to see some comparisons of real numbers
>> (memory, disk usage, CPU usage) when storing your openoffice documents
>> (using the .gitattributes filter, of course).  I can't really imagine
>> how splitting the files into more pieces would really improve disk
>> space usage, at least.
>
> I'll try to isolate test cases, making test repos:
>
> a) with 1 odf file changing a little on each checkin
> b) the same storing the odf file with no compression with a suitable filter
> c) the same storing the tree inside the odf file.

This sounds like it would be quite interesting to see.  I would also
be interested in d) the test from (b) using bup instead of git.

You might also want to compare results with 'git gc' vs. 'git gc --aggressive'.

>> Having done some tests while writing bup, my experience has been that
>> chunking-without-deltas is great for these situations:
>> 1) you have the same data shared across *multiple* files (eg. the same
>> images in lots of openoffice documents with different filenames);
>> 2) you have the same data *repeated* in the same file at large
>> distances (so that gzip compression doesn't catch it; eg. VMware
>> images)
>> 3) your file is too big to work with the delta compressor (eg. VMware images).
>
> An aside: bup is great!!! Thanks!

Glad you like it :)

Have fun,

Avery

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Multiblobs
  2010-04-28 15:12 Multiblobs Sergio Callegari
  2010-04-28 18:07 ` Multiblobs Avery Pennarun
  2010-04-28 18:34 ` Multiblobs Geert Bosch
@ 2010-04-29  6:55 ` Mike Hommey
  2010-05-06  6:26 ` Multiblobs Jeff King
  3 siblings, 0 replies; 21+ messages in thread
From: Mike Hommey @ 2010-04-29  6:55 UTC (permalink / raw)
  To: Sergio Callegari; +Cc: git

On Wed, Apr 28, 2010 at 03:12:07PM +0000, Sergio Callegari wrote:
> Hi,
> 
> it happened to me to read an older post by Jeff King about "multiblobs"
> (http://kerneltrap.org/mailarchive/git/2008/4/6/1360014) and I was wandering
> whether the idea has been abandoned for some reason or just put on hold.
> 
> Apparently, this would marvellously help on
> - storing large binary blobs (the split could happen with a rolling checksum
> approach)
> - storing "structured files", such as the many zip-based file formats
> (Opendocument, Docx, Jar files, zip files themselves), tars (including
> compressed tars), pdfs, etc, whose number is rising day after day...
> - storing binary files with textual tags, where the tags could go on a separate
> blob, greatly simplifying their readout without any need for caching them on a
> note tree.
> - etc...

This sounds very much like what I've had in mind for a while, but I
always thought that git as a VCS doesn't need that, and that it could be
a feature of a new program, for which the git object database would be a
special case. That is, a program using the git object database format
for individual objects and packs, but with additional object types.

Mike

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Multiblobs
  2010-04-28 21:27     ` Multiblobs Avery Pennarun
  2010-04-28 23:10       ` Multiblobs Michael Witten
  2010-04-28 23:26       ` Multiblobs Sergio
@ 2010-04-29 11:34       ` Peter Krefting
  2010-04-29 15:28         ` Multiblobs Avery Pennarun
  2 siblings, 1 reply; 21+ messages in thread
From: Peter Krefting @ 2010-04-29 11:34 UTC (permalink / raw)
  To: Avery Pennarun; +Cc: Sergio Callegari, Git Mailing List

Avery Pennarun:

> But why not use a .gitattributes filter to recompress the zip/odp file 
> with no compression, as I suggested?  Then you can just dump the whole 
> thing into git directly.

The advantage would be that you could look at the history of the individual 
components of the zip/openoffice file and follow changes. When looking at 
the entire zip file (even if using no compression), it is still a compound 
file.

The few times I need to version control zip or openoffice files, I only need 
to version control it *as* a zipped file, I don't need the version control 
to ensure that I get exactly the file out that I put in, just that it is 
zipped in both ends. If Git could do that by unzipping and storing the 
individual components itself, that would be great.

Or if someone could create a "zgit" that would allow me to version control 
such a file by internally unzipping it and storing it in git, and then 
zipping it up on checkout.

Having support for merging files inside the zip file would also be a 
wonderful feature to have, especially if the zip file holds mostly-text data.

-- 
\\// Peter - http://www.softwolves.pp.se/

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Multiblobs
  2010-04-29 11:34       ` Multiblobs Peter Krefting
@ 2010-04-29 15:28         ` Avery Pennarun
  2010-04-30  8:20           ` Multiblobs Peter Krefting
  0 siblings, 1 reply; 21+ messages in thread
From: Avery Pennarun @ 2010-04-29 15:28 UTC (permalink / raw)
  To: Peter Krefting; +Cc: Sergio Callegari, Git Mailing List

On Thu, Apr 29, 2010 at 7:34 AM, Peter Krefting <peter@softwolves.pp.se> wrote:
> Avery Pennarun:
>> But why not use a .gitattributes filter to recompress the zip/odp file
>> with no compression, as I suggested?  Then you can just dump the whole thing
>> into git directly.
>
> The advantage would be that you could look at the history of the individual
> components of the zip/openoffice file and follow changes.

This use case seems to be converging more and more on the
"clean/smudge filter like" idea, which might be ok.  But I think it
would be a kind of messy if the git index/worktree shows only one
file, but the actual object shows up as a tree, though.  What should
'git show HEAD:filename.odt' do?  How about 'git cat-file
HEAD:filename.odt'?  What if I *do* want to check out one of the
individual components?

It might be saner to just write some wrapper scripts on top of git,
and cleanly just check in the individual components.  Then just build
a Makefile and run something like 'make extract' before checkin (to
make sure all the .odp files/etc are broken into components) and 'make
assemble' after checkout.

Have fun,

Avery

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Multiblobs
  2010-04-29 15:28         ` Multiblobs Avery Pennarun
@ 2010-04-30  8:20           ` Peter Krefting
  2010-04-30 17:26             ` Multiblobs Avery Pennarun
  0 siblings, 1 reply; 21+ messages in thread
From: Peter Krefting @ 2010-04-30  8:20 UTC (permalink / raw)
  To: Avery Pennarun; +Cc: Sergio Callegari, Git Mailing List

Avery Pennarun:

(I seem to have been unsubscribed from the list, and can't subscribe again; 
please keep cc's to me for the time being).

> This use case seems to be converging more and more on the "clean/smudge 
> filter like" idea, which might be ok.

That's what I am using now (recompressing files), but that approach is a bit 
fragile (it suddenly broke on my Mac install, and it only works 
intermittently on Windows).

> It might be saner to just write some wrapper scripts on top of git, and 
> cleanly just check in the individual components.

Yeah, that was my thought to (thus the "zgit" idea).

-- 
\\// Peter - http://www.softwolves.pp.se/

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Multiblobs
  2010-04-28 19:13   ` Multiblobs Sergio Callegari
  2010-04-28 21:27     ` Multiblobs Avery Pennarun
@ 2010-04-30  9:14     ` Hervé Cauwelier
  2010-04-30 17:32       ` Multiblobs Avery Pennarun
  2010-04-30 18:16       ` Multiblobs Michael Witten
  1 sibling, 2 replies; 21+ messages in thread
From: Hervé Cauwelier @ 2010-04-30  9:14 UTC (permalink / raw)
  To: git

On 04/28/10 21:13, Sergio Callegari wrote:
> If you store a structured file as a multiblob, you can use a blob for each
> uncompressed element of content.  For instance, when storing an opendocument
> file you could use a blob for manifest.xml, one for content.xml, etc... (try
> unzip -l on an odt or odp file to get an idea). When you edit your file only a
> few of these change. For instance, if we talk about a presentation, each slide
> has its own content.xml, so changing one slide only that changes.

I'll obviously let the Git experts answer you, but I can answer about 
OpenDocument itself.

In a presentation each slide is a <draw:page/> inside a single 
content.xml. So if you change one slide, the whole XML will serialize 
with a different SHA.

And maybe you'll add style to that slide, or probably OpenOffice.org 
will generate an automatic style, so styles.xml will also change. Adding 
an image also changes manifest.xml, along with storing the image itself. 
OOo will surely record the last slide displayed when closing the 
application, so settings.xml will change too.

So, all in all, for a single slide, 30 to 80 % of the Zip content may 
change.

Unless you are talking about a dedicated application to store and 
generate on-the-fly office documents, built on top of Git, you're better 
not touching the contents the user is entrusting git to store, and write 
a .gitattribute not to compress them in a pack.

You may also be interested in the git-bigfiles project that was 
mentioned last week.

http://caca.zoy.org/wiki/git-bigfiles

-- 
Hervé Cauwelier - ITAAPY - 9 rue Darwin 75018 Paris
Tél. 01 42 23 67 45 - Fax 01 53 28 27 88
http://www.itaapy.com/ - http://www.cms-migration.com

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Multiblobs
  2010-04-30  8:20           ` Multiblobs Peter Krefting
@ 2010-04-30 17:26             ` Avery Pennarun
  0 siblings, 0 replies; 21+ messages in thread
From: Avery Pennarun @ 2010-04-30 17:26 UTC (permalink / raw)
  To: Peter Krefting; +Cc: Sergio Callegari, Git Mailing List

On Fri, Apr 30, 2010 at 4:20 AM, Peter Krefting <peter@softwolves.pp.se> wrote:
> Avery Pennarun:
>> This use case seems to be converging more and more on the "clean/smudge
>> filter like" idea, which might be ok.
>
> That's what I am using now (recompressing files), but that approach is a bit
> fragile (it suddenly broke on my Mac install, and it only works
> intermittently on Windows).

In general, if you find that existing features have bugs, the correct
solution is not to add more buggy features, but to fix the ones that
already exist :)

Avery

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Multiblobs
  2010-04-30  9:14     ` Multiblobs Hervé Cauwelier
@ 2010-04-30 17:32       ` Avery Pennarun
  2010-04-30 18:16       ` Multiblobs Michael Witten
  1 sibling, 0 replies; 21+ messages in thread
From: Avery Pennarun @ 2010-04-30 17:32 UTC (permalink / raw)
  To: Hervé Cauwelier; +Cc: git

2010/4/30 Hervé Cauwelier <herve@itaapy.com>:
> I'll obviously let the Git experts answer you, but I can answer about
> OpenDocument itself.
>
> In a presentation each slide is a <draw:page/> inside a single content.xml.
> So if you change one slide, the whole XML will serialize with a different
> SHA.
>
> And maybe you'll add style to that slide, or probably OpenOffice.org will
> generate an automatic style, so styles.xml will also change. Adding an image
> also changes manifest.xml, along with storing the image itself. OOo will
> surely record the last slide displayed when closing the application, so
> settings.xml will change too.
>
> So, all in all, for a single slide, 30 to 80 % of the Zip content may
> change.

Sure.  But if you name the chunks consistently, git's delta
compression can deal with tiny changes like those very easily.

The question is whether it'll work equally well, or better, or worse,
with a one-big-file format.  I think we won't know this without doing
some actual tests.

(Normally, you could assume that one-big-file is the most
space-efficient storage format, because then xdelta and gzip have the
most data to work with.  But if you have a lot of *duplicated* content
inside the same file, and the distance between duplications is outside
the gzip window, you could find that more unusual methods - like the
method used by bup - results in better compression.  I know this is
true for VM images, so it may be true for other things.  I haven't
tested everything :))

> You may also be interested in the git-bigfiles project that was mentioned
> last week.
>
> http://caca.zoy.org/wiki/git-bigfiles

git-bigfiles is a worthwhile project.  Its goal of "make life
bearable" is aiming kind of low, though.  Basically they seem to be
aiming simply to make git not die horribly when given lots of large
files.  This is commendable, but the resulting repo will be very space
inefficient when your large files change frequently in small ways.  So
I think it doesn't solve the problem Sergio brought up.

Have fun,

Avery

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Multiblobs
  2010-04-30  9:14     ` Multiblobs Hervé Cauwelier
  2010-04-30 17:32       ` Multiblobs Avery Pennarun
@ 2010-04-30 18:16       ` Michael Witten
  2010-04-30 19:06         ` Multiblobs Hervé Cauwelier
  1 sibling, 1 reply; 21+ messages in thread
From: Michael Witten @ 2010-04-30 18:16 UTC (permalink / raw)
  To: Hervé Cauwelier; +Cc: git

2010/4/30 Hervé Cauwelier <herve@itaapy.com>:
>
> Unless you are talking about a dedicated application to store and generate
> on-the-fly office documents, built on top of Git, you're better not touching
> the contents the user is entrusting git to store, and write a .gitattribute
> not to compress them in a pack.

Doesn't OOo provide at least some library of official code for
handling such files, so that other programs might be able to
interoperate?

If so, then it would be almost trivial for an OpenDocument 'plugin' to
be 'built on top of Git'.

If not, then OOo is crap.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Multiblobs
  2010-04-30 18:16       ` Multiblobs Michael Witten
@ 2010-04-30 19:06         ` Hervé Cauwelier
  0 siblings, 0 replies; 21+ messages in thread
From: Hervé Cauwelier @ 2010-04-30 19:06 UTC (permalink / raw)
  To: Michael Witten; +Cc: git

On 04/30/10 20:16, Michael Witten wrote:
> 2010/4/30 Hervé Cauwelier<herve@itaapy.com>:
>>
>> Unless you are talking about a dedicated application to store and generate
>> on-the-fly office documents, built on top of Git, you're better not touching
>> the contents the user is entrusting git to store, and write a .gitattribute
>> not to compress them in a pack.
>
> Doesn't OOo provide at least some library of official code for
> handling such files, so that other programs might be able to
> interoperate?

I'm not sure what you mean but the only way to interoperate with OOo is 
to run it in "server mode" with at least a framebuffer xorg in the 
background. Then you connect a client and use their RPC/Corba-like API.

OpenDocument libraries all start from scratch, or at least the RelaxNG 
schema to generate validating code.

If the chunks are Zip parts, you're almost done. If you want smarter 
splitting logic like slides in a presentation, sheets in a spreadsheet, 
and pages... no, there is no page in a text; well, you need to go 
through the XML layer or better use a OpenDocument library that 
abstracts it. Other parts in the Zip like styles and metadata are easier 
to split since they are basically a linear collection of objects.

> If so, then it would be almost trivial for an OpenDocument 'plugin' to
> be 'built on top of Git'.
>
> If not, then OOo is crap.

I already had reasons to conclude this. But hopefully OD is an open 
standard, not restricted to OOo.

-- 
Hervé Cauwelier - ITAAPY - 9 rue Darwin 75018 Paris
Tél. 01 42 23 67 45 - Fax 01 53 28 27 88
http://www.itaapy.com/ - http://www.cms-migration.com

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Multiblobs
  2010-04-28 15:12 Multiblobs Sergio Callegari
                   ` (2 preceding siblings ...)
  2010-04-29  6:55 ` Multiblobs Mike Hommey
@ 2010-05-06  6:26 ` Jeff King
  2010-05-06 22:56   ` Multiblobs Sergio Callegari
  3 siblings, 1 reply; 21+ messages in thread
From: Jeff King @ 2010-05-06  6:26 UTC (permalink / raw)
  To: Sergio Callegari; +Cc: git

On Wed, Apr 28, 2010 at 03:12:07PM +0000, Sergio Callegari wrote:

> it happened to me to read an older post by Jeff King about "multiblobs"
> (http://kerneltrap.org/mailarchive/git/2008/4/6/1360014) and I was wandering
> whether the idea has been abandoned for some reason or just put on hold.

I am a little late getting to this thread, and I agree with a lot of
what Avery said elsewhere, so I won't repeat what's been said. But after
reading my own message that you linked and the rest of this thread, I
wanted to note a few things.

One is that many of the applications for these multiblobs are extremely
varied, and many of them are vague and hand-waving. I think you really
have to look at each application individually to see how a solution
would fit. In my original email, I mentioned linear chunking of large
blobs for:

  1. faster inexact rename detection

  2. better diffs of binary files

I think (2) is now obsolete. Since that message, we now have textconv
filters, which allow simple and fast diffs of large objects (in my
example, I talked about exif tags on images. I now textconv the images
into a text representation of the exif tags and diff those). And with
textconv caching, we can do it on the fly without impacting how we
represent the object in git (we don't even have to pull the original
large blob out of storage at all, as the cache provide a look-aside
table keyed by the object name).

I also mentioned in that email that in theory we could diff individual
chunks even if we don't understand their semantic meaning. In practice,
I don't think this works. Most binary formats are going to involve not
just linear chunking, but decoding the binary chunks into some
human-readable form. So smart chunking isn't enough; you need a decoder,
which is what a textconv filter does.

For item (1), this is closely related to faster (and possibly better)
delta compression. I say only possibly better, because in theory our
delta algorithm should be finding something as simple as my example
already.

And for both of those cases, the upside is a speed increase, but the
downside is a breakage of the user-visible git model (i.e., blobs get
different sha1's depending on how they've been split). But being two
years wiser than when I wrote the original message, I don't think that
breakage is justified. Instead, you should retain the simple git object
model, and consider on-the-fly content-specific splits. In other words,
at rename (or delta) time notice that blob 123abc is a PDF, and that it
can be intelligently split into several chunks, and then look for other
files which share chunks with it. As a bonus, this sort of scheme is
very easy to cache, just as textconv is. You cache the smart-split of
the blob, which is immutable for some blob/split-scheme combination. And
then you can even do rename detection on large blob 123abc without even
retrieving it from storage.

Another benefit is that you still _store_ the original (you just don't
look at it as often). Which means there is no annoyance with perfectly
reconstructing a file. I had originally envisioned straight splitting,
with concatenation as the reverse operation. But I have seen things like
zip and tar files mentioned in this thread. They are quite challenging,
because it is difficult to reproduce them byte-for-byte. But if you take
the splitting out of the git data model, then that problem just goes
away.

The other application I saw in this thread is structured files where you
actually _want_ to see all of the innards as individual files (e.g.,
being able to do "git show HEAD:foo.zip/file.txt"). And for those, I
don't think any sort of automated chunking is really desirable. If you
want git to store and process those files individually, then you should
provide them to git individually. In other words, there is no need for
git to know or care at all that "foo.zip" exists, but you should simply
feed it a directory containing the files. The right place to do that
conversion is either totally outside of git, or at the edges of git
(i.e., git-add and when git places the file in the repository). Our
current hooks may not be sufficient, but that means those hooks should
be improved, which to me is much more favorable than a scheme that
alters the core of the git data model.

So no, reading my original message, I don't think it was a good idea. :)
The things people want to accomplish are reasonable goals, but there are
better ways to go about it.

-Peff

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Multiblobs
  2010-05-06  6:26 ` Multiblobs Jeff King
@ 2010-05-06 22:56   ` Sergio Callegari
  2010-05-10  6:36     ` Multiblobs Jeff King
  0 siblings, 1 reply; 21+ messages in thread
From: Sergio Callegari @ 2010-05-06 22:56 UTC (permalink / raw)
  To: Jeff King; +Cc: git

Many thanks for the clear and evidently very well thought answer.
I wonder if I can take another minute of your (and Avery, and anybody 
else who is interested) time to feed a little more my curiosity.
And I apologize in advance for possible mistakes in my understanding of 
git internals.

Jeff King wrote:

> And for both of those cases, the upside is a speed increase, but the
> downside is a breakage of the user-visible git model (i.e., blobs get
> different sha1's depending on how they've been split).
Is this different from what happens with clean/smudge filters? I wonder 
what hash does a cleanable object get. The hash of its cleaned version 
or its original hash? If it is the first case, the hash can change if 
the filter is used/not-used or slightly modified, so I wonder if an 
enhanced "clean" filter capable of splitting an object into a multiblob 
would be different in this sense. If it gets the original hash, again I 
wonder if an enhanced "clean" filter capable of splitting an object into 
a multiblob could not do the same.
>  But being two
> years wiser than when I wrote the original message, I don't think that
> breakage is justified. Instead, you should retain the simple git object
> model, and consider on-the-fly content-specific splits. In other words,
> at rename (or delta) time notice that blob 123abc is a PDF, and that it
> can be intelligently split into several chunks, and then look for other
> files which share chunks with it. As a bonus, this sort of scheme is
> very easy to cache, just as textconv is. You cache the smart-split of
> the blob, which is immutable for some blob/split-scheme combination. And
> then you can even do rename detection on large blob 123abc without even
> retrieving it from storage.
>   
Now I see why for things like diffing, showing textual representations 
or rename detection caching can be much more practical.
My initial list of "potential applications" was definitely too wide and 
vague.
> Another benefit is that you still _store_ the original (you just don't
> look at it as often). 
... but of course if you keep storing the original, I guess there is no 
advantage in storage efficiency.
> Which means there is no annoyance with perfectly
> reconstructing a file. I had originally envisioned straight splitting,
> with concatenation as the reverse operation. But I have seen things like
> zip and tar files mentioned in this thread. They are quite challenging,
> because it is difficult to reproduce them byte-for-byte.
I agree, but this is already being done. For instance on odf and zip 
files, by using clean filters capable of removing compression you can 
greatly improve the storage efficiency of the delta machinery included 
in git. And of course, to re-create the original file is potentially 
challenging. But most time, it does not really matter. For instance, 
when I use this technique with odf files, I do not need to care if the 
smudge filter recreates the original file or not, the important thing is 
that it recreates a file that can then be cleaned to the same thing (and 
this makes me think that cleanable objects get the sha1 of the cleaned 
blob, see above).

In other terms, all the time we underline that git is about tracking 
/content/. However, when you have a structured file, and you want to 
track its /content/, most time you are not interested at all at the 
/envelope/ (e.g. the compression level of the odf/zip file): the content 
is what is inside (typically a tree-structured thing). And maybe scms 
could be made better at tracking structured files, by providing an easy 
way to tell the scm how to discard the envelope.

In fact, having the hash of the structured file only depend on its real 
content (the inner tree or list of files/streams/whatever), seems to me 
to be completely respectful of the git model. This is why I originally 
thought that having enhanced filters enabling the storage of the the 
inner matter of a structured file as a multiblob could make sense.
> The other application I saw in this thread is structured files where you
> actually _want_ to see all of the innards as individual files (e.g.,
> being able to do "git show HEAD:foo.zip/file.txt"). And for those, I
> don't think any sort of automated chunking is really desirable. If you
> want git to store and process those files individually, then you should
> provide them to git individually. In other words, there is no need for
> git to know or care at all that "foo.zip" exists, but you should simply
> feed it a directory containing the files. The right place to do that
> conversion is either totally outside of git, or at the edges of git
> (i.e., git-add and when git places the file in the repository).
Originally, I thought of creating wrappers for some git commands. 
However, things like "status" or "commit -a" appeared to me quite 
complicated to be done in a wrapper.
>  Our
> current hooks may not be sufficient, but that means those hooks should
> be improved, which to me is much more favorable than a scheme that
> alters the core of the git data model.
>   
Having a sufficient number of hooks could help a lot. However, if I 
remember properly, one of the reasons why the clean/smudge filters were 
introduced was to avoid the need to implement a similar functionality 
with hooks.


Thanks in advance for the further explanations that might come!

Sergio

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Multiblobs
  2010-05-06 22:56   ` Multiblobs Sergio Callegari
@ 2010-05-10  6:36     ` Jeff King
  2010-05-10 13:58       ` Multiblobs Sergio Callegari
  0 siblings, 1 reply; 21+ messages in thread
From: Jeff King @ 2010-05-10  6:36 UTC (permalink / raw)
  To: Sergio Callegari; +Cc: git

On Fri, May 07, 2010 at 12:56:59AM +0200, Sergio Callegari wrote:

> >And for both of those cases, the upside is a speed increase, but the
> >downside is a breakage of the user-visible git model (i.e., blobs get
> >different sha1's depending on how they've been split).
> Is this different from what happens with clean/smudge filters? I
> wonder what hash does a cleanable object get. The hash of its cleaned
> version or its original hash? If it is the first case, the hash can

It gets the cleaned version. The idea is that the sha1 in the repository
is the "official" version, and anything else is simply a representation
suitable for use on your platform.

So in that sense, clean/smudge filters are very visible. Splitting into
multiple blobs would mean that as far as git was concerned, your data
_is_ multiple blobs. And it would diff and merge them as separate
entities. That makes sense for something where that breakdown happens
along user-visible lines, and is useful to the user. For example,
automatically breaking down a tarfile into its constituent files might
be a more desirable representation for git to diff and merge (though the
current implementation of clean/smudge filters does not allow breaking
the file into multiple blobs).

But as I argued later in my email, I think that is not the right model
for performance-oriented multiblobs. Splitting a file at certain length
boundaries simply because it is large is going to be awkward when you
want to look at it as a whole item.

> >Another benefit is that you still _store_ the original (you just don't
> >look at it as often).
> ... but of course if you keep storing the original, I guess there is
> no advantage in storage efficiency.

Yes and no. If you are storing some set of N bytes, then you need to
store N bytes whether they are in a single blob or multiple blobs. The
only way that multiple blobs can improve on that is if you can find
better delta candidates by doing so.  Which means that you are just as
well off by splitting the large blob when looking for delta candidates
as you are in splitting it in storage.

> I agree, but this is already being done. For instance on odf and zip
> files, by using clean filters capable of removing compression you can
> greatly improve the storage efficiency of the delta machinery
> included in git. And of course, to re-create the original file is
> potentially challenging. But most time, it does not really matter.
> For instance, when I use this technique with odf files, I do not need
> to care if the smudge filter recreates the original file or not, the
> important thing is that it recreates a file that can then be cleaned
> to the same thing (and this makes me think that cleanable objects get
> the sha1 of the cleaned blob, see above).

Sure. And for those cases, I think clean/smudge filters are perhaps
already doing the job.

As an aside, I don't think that _git_ cares about pristine tars. It is
that people want to store compressed tarfiles in git that have a
particular checksum because they are interacting with some _other_
system that cares about the tarfile.  In your case, where you don't care
about the particular byte pattern of the odf file, it is much simpler.
So clean/smudge filters are even easier there.

> In other terms, all the time we underline that git is about tracking
> /content/. However, when you have a structured file, and you want to
> track its /content/, most time you are not interested at all at the
> /envelope/ (e.g. the compression level of the odf/zip file): the
> content is what is inside (typically a tree-structured thing). And
> maybe scms could be made better at tracking structured files, by
> providing an easy way to tell the scm how to discard the envelope.

Right. The question is how the structured contents are handled
internally by the SCM. Git's choice is to leave contents as opaque as
possible, and let you handle conversion at the boundaries: textconv (or
a custom external diff) for viewing diffs, and clean/smudge for working
tree files.

> In fact, having the hash of the structured file only depend on its
> real content (the inner tree or list of files/streams/whatever),
> seems to me to be completely respectful of the git model. This is why

Yes, and that is how it works with clean/smudge filters.

> I originally thought that having enhanced filters enabling the
> storage of the the inner matter of a structured file as a multiblob
> could make sense.

I do think it makes sense, but only for some applications. But for those
applications, rather than a multiblob, I think creating a tree structure
is a natural fit, and works well with existing git tools. But again,
that isn't really implemented. Blobs must stay as blobs. So the closest
you can come is saying:

  - an ODF file may be a collection of structured text, but we will
    store it marshalled as a single binary data stream

  - we don't want it compressed for performance reasons, so we won't use
    the native marshalling format. Instead, we'll clean/smudge it as an
    uncompressed collection format inside of git (e.g., a zip without
    compression, or a tarball).

  - even though git doesn't understand the structure, we _do_ want to
    see the structure when doing diffs or merges. For that, we define
    custom diff/merge drivers which can operate on the file. They can
    unpack the structure as necessary.

which is really not too bad, and it means git can remain blissfully
unaware of the details of any format.

> >provide them to git individually. In other words, there is no need for
> >git to know or care at all that "foo.zip" exists, but you should simply
> >feed it a directory containing the files. The right place to do that
> >conversion is either totally outside of git, or at the edges of git
> >(i.e., git-add and when git places the file in the repository).
> Originally, I thought of creating wrappers for some git commands.
> However, things like "status" or "commit -a" appeared to me quite
> complicated to be done in a wrapper.

Yes, I would just do it manually. But in theory a clean/smudge filter
could be the right sort of place for that, if somebody made an
implementation that handle exploding a single file into an arbitrary
tree/blob hierarchy. I think it was discussed when filters were
introduced, but the complexity (both in terms of implementation, and
in meeting user expectations) prevented anyone from moving it forward.

-Peff

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Multiblobs
  2010-05-10  6:36     ` Multiblobs Jeff King
@ 2010-05-10 13:58       ` Sergio Callegari
  0 siblings, 0 replies; 21+ messages in thread
From: Sergio Callegari @ 2010-05-10 13:58 UTC (permalink / raw)
  To: Jeff King; +Cc: git

On 10/05/2010 08:36, Jeff King wrote:
> Sure. And for those cases, I think clean/smudge filters are perhaps
> already doing the job.
>
>    
As a matter of fact, my idea was exactly to think of a multiblob as a 
git tree (maybe plus a signature).

With this, one can set up a "multiclean" filter, triggered by a filename 
pattern as for a normal filter or by the invocation of "file" to look at 
inner magic.

When this filter is invoked it should take the file to be cleaned as the 
stdin and output a tree at the output, while (as a side effect) 
populating the git storage by the normal blobs pointed to by the tree.

In a complementary fashion, the multismudge filter should receive the 
"multiblob" tree on stdin, and output on stdout the smudged file, 
inspecting the blobs pointed to by the tree to do the work.

This would require having trees as tree entries, and (I guess) also some 
update to the git package machinery, but apart from that should fit well 
with the current clean/smudge approach, nor significantly alter the git 
model.

Sergio

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2010-05-10 13:59 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-04-28 15:12 Multiblobs Sergio Callegari
2010-04-28 18:07 ` Multiblobs Avery Pennarun
2010-04-28 19:13   ` Multiblobs Sergio Callegari
2010-04-28 21:27     ` Multiblobs Avery Pennarun
2010-04-28 23:10       ` Multiblobs Michael Witten
2010-04-28 23:26       ` Multiblobs Sergio
2010-04-29  0:44         ` Multiblobs Avery Pennarun
2010-04-29 11:34       ` Multiblobs Peter Krefting
2010-04-29 15:28         ` Multiblobs Avery Pennarun
2010-04-30  8:20           ` Multiblobs Peter Krefting
2010-04-30 17:26             ` Multiblobs Avery Pennarun
2010-04-30  9:14     ` Multiblobs Hervé Cauwelier
2010-04-30 17:32       ` Multiblobs Avery Pennarun
2010-04-30 18:16       ` Multiblobs Michael Witten
2010-04-30 19:06         ` Multiblobs Hervé Cauwelier
2010-04-28 18:34 ` Multiblobs Geert Bosch
2010-04-29  6:55 ` Multiblobs Mike Hommey
2010-05-06  6:26 ` Multiblobs Jeff King
2010-05-06 22:56   ` Multiblobs Sergio Callegari
2010-05-10  6:36     ` Multiblobs Jeff King
2010-05-10 13:58       ` Multiblobs Sergio Callegari

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).