Re: Multiblobs - Avery Pennarun

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Avery Pennarun <apenwarr@gmail.com>
To: Sergio Callegari <sergio.callegari@gmail.com>
Cc: git@vger.kernel.org
Subject: Re: Multiblobs
Date: Wed, 28 Apr 2010 14:07:02 -0400	[thread overview]
Message-ID: <k2y32541b131004281107u6d15ed4ex54b5e5c138cc0e24@mail.gmail.com> (raw)
In-Reply-To: <loom.20100428T164432-954@post.gmane.org>

On Wed, Apr 28, 2010 at 11:12 AM, Sergio Callegari
<sergio.callegari@gmail.com> wrote:
> - storing "structured files", such as the many zip-based file formats
> (Opendocument, Docx, Jar files, zip files themselves), tars (including
> compressed tars), pdfs, etc, whose number is rising day after day...

I'm not sure it would help very much for these sorts of files.  The
problem is that compressed files tend to change a lot even if only a
few bytes of the original data have changed.

For things like opendocument, or uncompressed tars, you'd be better
off to decompress them (or recompress with zip -0) using
.gitattributes.  Generally these files aren't *so* large that they
really need to be chunked; what you want to do is improve the deltas,
which decompressing will do.

> - storing binary files with textual tags, where the tags could go on a separate
> blob, greatly simplifying their readout without any need for caching them on a
> note tree.

That sounds complicated and error prone, and is suspiciously like
Apple's "resource forks," which even Apple has mostly realized were a
bad idea.

> - help the management of upstream trees. This could be simplified since the
> "pristine tree" distributed as a tar.gz file and the exploded repo could share
> their blobs making commands such as pristine-tree unnecessary.

Sharing the blobs of a tarball with a checked-out tree would require a
tar-specific chunking algorithm.  Not impossible, but a pain, and you
might have a hard time getting it accepted into git since it's
obviously not something you really need for a normal "source code"
tracking system.

> - help projects such as bup that currently need to provide split mechanisms of
> their own.

Since bup is so awesome that it will soon rule the world of file
splitting backup systems, and bup already has a working implemention,
this reason by itself probably isn't enough to integrate the feature
into git.

> - be used to add "different representations" to objects... for instance, when
> storing a pdf one could use a fake split to store in a separate blob the
> corresponding text, making the git-diff of pdfs almost instantaneous.

Aie, files that have different content depending how you look at them?
 You'll make a lot of enemies with such a patch :)

> From Jeff's post, I guess that the major issue could be that the same file could
> get a different sha1 as a multiblob versus a regular blob, but maybe it could be
> possible to make the multiblob take the same sha1 of the "equivalent plain blob"
> rather than its real hash.

I think that's actually not a very important problem.  Files that are
different will still always have differing sha1s, which is the
important part.  Files that are the same might not have the same sha1,
which is a bit weird, but it's unlikely that any algorithm in git
depends fundamentally on the fact that the sha1s match.

Storing files as split does have a lot of usefulness for calculating
diffs, however: because you can walk through the tree of hashes and
short entire circuit subtrees with identical sha1s, you can diff even
20GB files really rapidly.

> For the moment, I am just very curious about the idea and the possible pros and
> cons... can someone (maybe Jeff himself) tell me a little more? Also I wonder
> about the two possibilities (implement it in git vs implement it "on top of"
> git).

"on top of" git has one major advantage, which is that it's easy: for
example, bup already does it.  The disadvantage is that checking out
the resulting repository won't be smart enough to re-merge the data
again, so you have a bunch of tiny chunk files you have to concatenate
by hand.

Implementing inside git could be done in one of two ways: add support
for a new 'multiblob' data type (which is really more like a tree
object, but gets checked out as a single file), or implement chunking
at the packfile level, so that higher-level tools never have to know
about multiblobs.

The latter would probably be easier and more backward-compatibility,
but you'd probably lose the ability to do really fast diffs between
multiblobs, since diff happens at the higher level.

Overall, I'm not sure git would benefit much from supporting large
files in this way; at least not yet.  As soon as you supported this,
you'd start running into other problems... such as the fact that
shallow repos don't really work very well, and you obviously don't
want to clone every single copy of a 100MB file just so you can edit
the most recent version.  So you might want to make sure shallow repos
/ sparse checkouts are fully up to speed first.

Have fun,

Avery

next prev parent reply	other threads:[~2010-04-28 18:07 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-04-28 15:12 Multiblobs Sergio Callegari
2010-04-28 18:07 ` Avery Pennarun [this message]
2010-04-28 19:13   ` Multiblobs Sergio Callegari
2010-04-28 21:27     ` Multiblobs Avery Pennarun
2010-04-28 23:10       ` Multiblobs Michael Witten
2010-04-28 23:26       ` Multiblobs Sergio
2010-04-29  0:44         ` Multiblobs Avery Pennarun
2010-04-29 11:34       ` Multiblobs Peter Krefting
2010-04-29 15:28         ` Multiblobs Avery Pennarun
2010-04-30  8:20           ` Multiblobs Peter Krefting
2010-04-30 17:26             ` Multiblobs Avery Pennarun
2010-04-30  9:14     ` Multiblobs Hervé Cauwelier
2010-04-30 17:32       ` Multiblobs Avery Pennarun
2010-04-30 18:16       ` Multiblobs Michael Witten
2010-04-30 19:06         ` Multiblobs Hervé Cauwelier
2010-04-28 18:34 ` Multiblobs Geert Bosch
2010-04-29  6:55 ` Multiblobs Mike Hommey
2010-05-06  6:26 ` Multiblobs Jeff King
2010-05-06 22:56   ` Multiblobs Sergio Callegari
2010-05-10  6:36     ` Multiblobs Jeff King
2010-05-10 13:58       ` Multiblobs Sergio Callegari

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=k2y32541b131004281107u6d15ed4ex54b5e5c138cc0e24@mail.gmail.com \
    --to=apenwarr@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=sergio.callegari@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).