From: Jeff King <peff@peff.net>
To: Junio C Hamano <gitster@pobox.com>
Cc: Lawrence Brett <lcbrett@gmail.com>, git@vger.kernel.org
Subject: Re: git for game development?
Date: Wed, 24 Aug 2011 14:26:32 -0400 [thread overview]
Message-ID: <20110824182632.GA22659@sigill.intra.peff.net> (raw)
In-Reply-To: <7vwre2pw3m.fsf@alter.siamese.dyndns.org>
On Wed, Aug 24, 2011 at 10:17:49AM -0700, Junio C Hamano wrote:
> I suspect that people with large binary assets were scared away by rumors
> they heard second-hand, based on bad experiences other people had before
> any of the recent efforts made in various "large Git" topics, and they
> themselves haven't tried recent versions of Git enough to be able to tell
> what the remaining pain points are. I wouldn't be surprised if none of the
> core Git people tried shoving huge binary assets in test repositories with
> recent versions of Git---I certainly haven't.
I haven't tried anything really big in a while. My personal interest in
big file support has been:
1. Mid-sized photos and videos (objects top out around 50M, total repo
size is 4G packed). Most commits are additions or tweaks of exif
tags (so they delta well). Using gitattributes (and especially
textconv caching), it's really quite pleasant to use. Doing a full
repack is my only complaint; the delta-compression isn't bad, but
just the I/O on rewriting the whole thing is a killer.
2. Storing an entire audio collection in flac. Median file size is
only around 20M, but the whole repo is 120G. Obviously compression
doesn't buy much, so a git repo plus checkout is 240G, which is
pretty hefty for most laptops. I played with this early on, but
gave up; the data storage model just doesn't make sense.
The two common use cases that aren't represented here are:
3. Big files, not just big repos. I.e., files that are 1G or more.
4. Medium-big files that don't delta well (e.g., metadata tweaks do
delta well; rewriting media assets for a game don't delta well).
I think recent changes (like putting big files straight to packs) make
(3) and (4) reasonably pleasant.
I'm not sure of the right answer for (1). The repack is the only
annoying thing. But not repacking is not satisfying, either. You don't
get deltas where they are applicable, and the server is always
re-examining the pack for possible deltas on fetch and push. Some sort
of hybrid loose-pack storage would be nice: store delta chains for big
files in their own individual packs, but otherwise keep everything in a
separate pack. We would want some kind of meta-index over all of these
little pack-files, not just individual pack-file indices.
But (2) is the hardest one. It would be nice if we had some kind of
local-remote hybrid storage, where objects were fetched on demand from
somewhere else. For example, developers on workstations with a fast
local network to a storage server wouldn't have to replicate all of the
objects locally. And for a true distributed setup, when the fast network
isn't there, it would be nice to fail gracefully (which maybe just means
saying "sorry, we can't do 'log -p' right now; try 'log --raw'").
I wonder how close one can get on (2) using alternates and a
network-mounted filesystem.
> People toyed around with ideas to have a separate object store
> representation for large and possibly incompressible blobs (a possible
> complaint being that it is pointless to send them even to its own
> packfile). One possible implementation would be to add a new huge
> hierarchy under $GIT_DIR/objects/, compute the object name exactly the
> same way for huge blobs as we normally would (i.e. hash concatenation of
> object header and then contents) to decide which subdirectory under the
> "huge" hierarchy to store the data (huge/[0-9a-f]{2}/[0-9a-f]{38}/ like we
> do for loose objects, or perhaps huge/[0-9a-f]{40}/ expecting that there
> won't be very many). The data can be stored unmodified as a file in that
> directory, with type stored in a separate file---that way, we won't have
> to compress, but we just copy. You still need to hash it at least once to
> come up with the object name, but that is what gives us integrity checks,
> is unavoidable and is not going to change.
Yeah. I think one of the bonuses there is that some filesystems are
capable of referencing the same inodes in a copy-on-write way, so "add"
and "checkout" cease to be a copy operation, but rather an inode-linking
operation. Which is a big win, both for speed and storage.
I've had dreams of using hard-linking to do something similar, but it's
just not safe enough without some filesystem-level copy-on-write
protection.
-Peff
next prev parent reply other threads:[~2011-08-24 18:26 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
2011-08-23 23:06 git for game development? Lawrence Brett
2011-08-23 23:32 ` Junio C Hamano
2011-08-24 1:24 ` Jeff King
2011-08-24 17:17 ` Junio C Hamano
2011-08-24 18:26 ` Jeff King [this message]
2011-08-27 15:32 ` Michael Witten
2011-08-25 6:53 ` One MMORPG git facts Marat Radchenko
2011-08-25 7:57 ` J.H.
2011-08-25 16:02 ` Marat Radchenko
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20110824182632.GA22659@sigill.intra.peff.net \
--to=peff@peff.net \
--cc=git@vger.kernel.org \
--cc=gitster@pobox.com \
--cc=lcbrett@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).