Re: Git Large Object Support Proposal

Git development
 help / color / mirror / Atom feed

From: Junio C Hamano <gitster@pobox.com>
To: david@lang.hm
Cc: Scott Chacon <schacon@gmail.com>, git list <git@vger.kernel.org>
Subject: Re: Git Large Object Support Proposal
Date: Thu, 19 Mar 2009 17:11:22 -0700	[thread overview]
Message-ID: <7vtz5p59zp.fsf@gitster.siamese.dyndns.org> (raw)
In-Reply-To: <alpine.DEB.1.10.0903191650160.16753@asgard.lang.hm> (david@lang.hm's message of "Thu, 19 Mar 2009 16:52:19 -0700 (PDT)")

david@lang.hm writes:

> On Thu, 19 Mar 2009, Junio C Hamano wrote:
>
>> Scott Chacon <schacon@gmail.com> writes:
>>
>>> The point is that we don't keep this data as 'blob's - we don't try to
>>> compress them or add the header to them, they're too big and already
>>> compressed, it's a waste of time and often outside the memory
>>> tolerance of many systems. We keep only the stub in our db and stream
>>> the large media content directly to and from disk.  If we do a
>>> 'checkout' or something that would switch it out, we could store the
>>> data in '.git/media' or the equivalent until it's uploaded elsewhere.
>>
>> Aha, that sounds like you can just maintain a set of out-of-tree symbolic
>> links that you keep track of, and let other people (e.g. rsync) deal with
>> the complexity of managing that side of the world.
>>
>> And I think you can start experimenting it without any change to the core
>> datastructures.  In your single-page web site in which its sole html file
>> embeds an mpeg movie, you keep track of these two things in git:
>>
>> 	porn-of-the-day.html
>>        porn-of-the-day.mpg -> ../media/6066f5ae75ec.mpg
>>
>> and any time you want to feed a new movie, you update the symlink to a
>> different one that lives outside the source-controlled tree, while
>> arranging the link target to be updated out-of-band.
>
> that would work, but the proposed change has some advantages
>
> 1. you store the sha1 of the real mpg in the 'large file' blob so you
> can detect problems

You store the unique identifier of the real mpg in the symbolic link
target which is a blob payload, so you can detect problems already.  I
deliberately said "unique identifier"; you seem to think saying SHA-1
brings something magical but I do not think it needs to be even blob's
SHA-1.  Hashing that much data costs.

In any case, you can have a script (or client-side hook) that does:

    (1) find the out-of-tree symlinks in the index (or in the work tree);

    (2) if it is dangling, and if you have definition of where to get that
        hierarchy from (e.g ../media), run rsync or wget or whatever
        external means to grab it.

and call it after "git pull" updates from some other place.  The "git
media" of Scott's message could be an alias to such a command.

Adding a new type "external-blob" would be an unwelcome pain.  Reusing
"blob" so that existing "blob" codepath now needs to notice special "0"
that is not length "0" is even bigger pain than that.

And that is a pain for unknown benefit, especially when you can start
experimenting without any changes to the existing data structure.  In the
worst case, the experiment may not pan out as well as you hoped and if
that is the end of the story, so be it.  It is not a great loss.  If it
works well enough and we can have the external large media support without
any changes to the data structure, that would be really great.  If it
sort-of works but hits limitation, we can analyze how best to overcome
that limitation, and at that time it _might_ turn out to be the best
approach to introduce a new blob type.

But I do not think we know that yet.

In the longer run, as you speculated in your message, I think the native
blob codepaths need to be updated to tolerate a large, unmappable objects
better.  With that goal in mind, I think it is a huge mistake to
prematurely introduce an arbitrary distinct "blob" and "large blob" types,
if in the end they need to be merged back again; it would force the future
code indefinitely to care about the historical "large blob" types that was
once supported.

> 2. since it knows the sha1 of the real file, it can auto-create the
> real file as needed, without wasting space on too many copies of it.

Hmm, since when SHA-1 is reversible?

next prev parent reply	other threads:[~2009-03-20  0:13 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-03-19 22:14 Git Large Object Support Proposal Scott Chacon
2009-03-19 22:31 ` Junio C Hamano
2009-03-19 23:18   ` Scott Chacon
2009-03-19 23:44     ` Junio C Hamano
2009-03-19 23:52       ` david
2009-03-20  0:11         ` Junio C Hamano [this message]
2009-03-20  0:19           ` Scott Chacon
2009-03-20  0:23           ` david
2009-03-20  0:41       ` Junio C Hamano
2009-03-20  4:46       ` Jeff King
2009-03-19 23:42   ` david

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=7vtz5p59zp.fsf@gitster.siamese.dyndns.org \
    --to=gitster@pobox.com \
    --cc=david@lang.hm \
    --cc=git@vger.kernel.org \
    --cc=schacon@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox