From: Adam Heath <doogie@brainfood.com>
To: "Marcel M. Cary" <marcel@oak.homeunix.org>
Cc: "git@vger.kernel.org" <git@vger.kernel.org>
Subject: Re: large(25G) repository in git
Date: Thu, 26 Mar 2009 11:35:17 -0500 [thread overview]
Message-ID: <49CBAEC5.6070606@brainfood.com> (raw)
In-Reply-To: <49CBA2AB.30304@oak.homeunix.org>
Marcel M. Cary wrote:
> My company manages code in a similar way, except we avoid this kind of
> issue (with 100 gigabytes of user-uploaded images and other data) by not
> checking in the data. We even went so far is as to halve the size of
> our repository by removing 2GB of non-user-supplied images -- rounded
> corners, background gradients, logos, etc, etc. This made Git
> noticeably faster.
Disk space is cheap.
> While I'd love to be able to handle your kind of use case and data size
> with Git in that way, it's a little beyond the intended usage to handle
> hundreds of gigabytes of binary data, I think.
>
> I imagine as your web site grows, which I'm assuming is your goal, your
> problems with scaling Git will continue to be a challenge.
>
> Maybe you can find a way to:
>
> * Get along with less data in your non-production environments; we're
> hoping to be able to do this eventually
We do that by only cloning/checking out certain modules.
However, as is always the case, sometimes a bug occurs with production
data, and you need to use the real data to track it down.
> * Find other ways to copy it; we use rsync even though it does take
> forever to crawl over the file system
>
> * Put your data files in a separate Git repository, at least, assuming
> your checkin, update, and release code more often than your video files.
> That way you'll experience pain less often, and maybe even be able to
> tune your repository differently.
As already mentioned, our sub-sites *are* in separate repos. There's
a base repository, that has just the event/backend code. Then 32
*other* repositories, where the actual websites are.
We want to use *some* kind of versioning system. Being able to have
history of *all* changes is extremely useful. Not to mention being
able to track what each separate user does as they modify their files
thru their browser.
subversion is just right out. It's centralized. It leaves poop all
over the place.
mercurial is just right out. If you do several *separate* commits of
*separate* files, but don't push for some time period, then eventually
do a push/pull, where the sum total of the changes is larger than some
value, mercurial will fail when it tries to then update the local
directory. This limit is based on 2G, a hard-coded python limit(even
on a 64-bit host), because mercurial reads the entire set of changes
into a python string.
git mmaps files, does window scanning of the pack files. It *might*
read a single file all into memory, for compression purposes; I'm not
certain on this. We certainly haven't hit any limits that cause it to
fail outright.
I haven't tried any others.
prev parent reply other threads:[~2009-03-26 16:37 UTC|newest]
Thread overview: 16+ messages / expand[flat|nested] mbox.gz Atom feed top
2009-03-23 21:10 large(25G) repository in git Adam Heath
2009-03-24 1:19 ` Nicolas Pitre
2009-03-24 17:59 ` Adam Heath
2009-03-24 18:31 ` Nicolas Pitre
2009-03-24 20:55 ` Adam Heath
2009-03-25 1:21 ` Nicolas Pitre
2009-03-24 18:33 ` david
2009-03-24 8:59 ` Andreas Ericsson
2009-03-24 22:35 ` Adam Heath
2009-03-24 21:04 ` Sam Hocevar
2009-03-24 21:44 ` Adam Heath
2009-03-25 0:28 ` Nicolas Pitre
2009-03-25 0:57 ` Adam Heath
2009-03-25 1:47 ` Nicolas Pitre
2009-03-26 15:43 ` Marcel M. Cary
2009-03-26 16:35 ` Adam Heath [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=49CBAEC5.6070606@brainfood.com \
--to=doogie@brainfood.com \
--cc=git@vger.kernel.org \
--cc=marcel@oak.homeunix.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).