Re: Multiblobs - Jeff King

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Jeff King <peff@peff.net>
To: Sergio Callegari <sergio.callegari@gmail.com>
Cc: git@vger.kernel.org
Subject: Re: Multiblobs
Date: Mon, 10 May 2010 02:36:18 -0400	[thread overview]
Message-ID: <20100510063618.GD13340@coredump.intra.peff.net> (raw)
In-Reply-To: <4BE3493B.8010409@gmail.com>

On Fri, May 07, 2010 at 12:56:59AM +0200, Sergio Callegari wrote:

> >And for both of those cases, the upside is a speed increase, but the
> >downside is a breakage of the user-visible git model (i.e., blobs get
> >different sha1's depending on how they've been split).
> Is this different from what happens with clean/smudge filters? I
> wonder what hash does a cleanable object get. The hash of its cleaned
> version or its original hash? If it is the first case, the hash can

It gets the cleaned version. The idea is that the sha1 in the repository
is the "official" version, and anything else is simply a representation
suitable for use on your platform.

So in that sense, clean/smudge filters are very visible. Splitting into
multiple blobs would mean that as far as git was concerned, your data
_is_ multiple blobs. And it would diff and merge them as separate
entities. That makes sense for something where that breakdown happens
along user-visible lines, and is useful to the user. For example,
automatically breaking down a tarfile into its constituent files might
be a more desirable representation for git to diff and merge (though the
current implementation of clean/smudge filters does not allow breaking
the file into multiple blobs).

But as I argued later in my email, I think that is not the right model
for performance-oriented multiblobs. Splitting a file at certain length
boundaries simply because it is large is going to be awkward when you
want to look at it as a whole item.

> >Another benefit is that you still _store_ the original (you just don't
> >look at it as often).
> ... but of course if you keep storing the original, I guess there is
> no advantage in storage efficiency.

Yes and no. If you are storing some set of N bytes, then you need to
store N bytes whether they are in a single blob or multiple blobs. The
only way that multiple blobs can improve on that is if you can find
better delta candidates by doing so.  Which means that you are just as
well off by splitting the large blob when looking for delta candidates
as you are in splitting it in storage.

> I agree, but this is already being done. For instance on odf and zip
> files, by using clean filters capable of removing compression you can
> greatly improve the storage efficiency of the delta machinery
> included in git. And of course, to re-create the original file is
> potentially challenging. But most time, it does not really matter.
> For instance, when I use this technique with odf files, I do not need
> to care if the smudge filter recreates the original file or not, the
> important thing is that it recreates a file that can then be cleaned
> to the same thing (and this makes me think that cleanable objects get
> the sha1 of the cleaned blob, see above).

Sure. And for those cases, I think clean/smudge filters are perhaps
already doing the job.

As an aside, I don't think that _git_ cares about pristine tars. It is
that people want to store compressed tarfiles in git that have a
particular checksum because they are interacting with some _other_
system that cares about the tarfile.  In your case, where you don't care
about the particular byte pattern of the odf file, it is much simpler.
So clean/smudge filters are even easier there.

> In other terms, all the time we underline that git is about tracking
> /content/. However, when you have a structured file, and you want to
> track its /content/, most time you are not interested at all at the
> /envelope/ (e.g. the compression level of the odf/zip file): the
> content is what is inside (typically a tree-structured thing). And
> maybe scms could be made better at tracking structured files, by
> providing an easy way to tell the scm how to discard the envelope.

Right. The question is how the structured contents are handled
internally by the SCM. Git's choice is to leave contents as opaque as
possible, and let you handle conversion at the boundaries: textconv (or
a custom external diff) for viewing diffs, and clean/smudge for working
tree files.

> In fact, having the hash of the structured file only depend on its
> real content (the inner tree or list of files/streams/whatever),
> seems to me to be completely respectful of the git model. This is why

Yes, and that is how it works with clean/smudge filters.

> I originally thought that having enhanced filters enabling the
> storage of the the inner matter of a structured file as a multiblob
> could make sense.

I do think it makes sense, but only for some applications. But for those
applications, rather than a multiblob, I think creating a tree structure
is a natural fit, and works well with existing git tools. But again,
that isn't really implemented. Blobs must stay as blobs. So the closest
you can come is saying:

  - an ODF file may be a collection of structured text, but we will
    store it marshalled as a single binary data stream

  - we don't want it compressed for performance reasons, so we won't use
    the native marshalling format. Instead, we'll clean/smudge it as an
    uncompressed collection format inside of git (e.g., a zip without
    compression, or a tarball).

  - even though git doesn't understand the structure, we _do_ want to
    see the structure when doing diffs or merges. For that, we define
    custom diff/merge drivers which can operate on the file. They can
    unpack the structure as necessary.

which is really not too bad, and it means git can remain blissfully
unaware of the details of any format.

> >provide them to git individually. In other words, there is no need for
> >git to know or care at all that "foo.zip" exists, but you should simply
> >feed it a directory containing the files. The right place to do that
> >conversion is either totally outside of git, or at the edges of git
> >(i.e., git-add and when git places the file in the repository).
> Originally, I thought of creating wrappers for some git commands.
> However, things like "status" or "commit -a" appeared to me quite
> complicated to be done in a wrapper.

Yes, I would just do it manually. But in theory a clean/smudge filter
could be the right sort of place for that, if somebody made an
implementation that handle exploding a single file into an arbitrary
tree/blob hierarchy. I think it was discussed when filters were
introduced, but the complexity (both in terms of implementation, and
in meeting user expectations) prevented anyone from moving it forward.

-Peff

next prev parent reply	other threads:[~2010-05-10  6:36 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-04-28 15:12 Multiblobs Sergio Callegari
2010-04-28 18:07 ` Multiblobs Avery Pennarun
2010-04-28 19:13   ` Multiblobs Sergio Callegari
2010-04-28 21:27     ` Multiblobs Avery Pennarun
2010-04-28 23:10       ` Multiblobs Michael Witten
2010-04-28 23:26       ` Multiblobs Sergio
2010-04-29  0:44         ` Multiblobs Avery Pennarun
2010-04-29 11:34       ` Multiblobs Peter Krefting
2010-04-29 15:28         ` Multiblobs Avery Pennarun
2010-04-30  8:20           ` Multiblobs Peter Krefting
2010-04-30 17:26             ` Multiblobs Avery Pennarun
2010-04-30  9:14     ` Multiblobs Hervé Cauwelier
2010-04-30 17:32       ` Multiblobs Avery Pennarun
2010-04-30 18:16       ` Multiblobs Michael Witten
2010-04-30 19:06         ` Multiblobs Hervé Cauwelier
2010-04-28 18:34 ` Multiblobs Geert Bosch
2010-04-29  6:55 ` Multiblobs Mike Hommey
2010-05-06  6:26 ` Multiblobs Jeff King
2010-05-06 22:56   ` Multiblobs Sergio Callegari
2010-05-10  6:36     ` Jeff King [this message]
2010-05-10 13:58       ` Multiblobs Sergio Callegari

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20100510063618.GD13340@coredump.intra.peff.net \
    --to=peff@peff.net \
    --cc=git@vger.kernel.org \
    --cc=sergio.callegari@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).