From: Linus Torvalds <torvalds@linux-foundation.org>
To: Anton Tropashko <atropashko@yahoo.com>
Cc: git@vger.kernel.org
Subject: Re: Errors cloning large repo
Date: Fri, 9 Mar 2007 13:37:18 -0800 (PST) [thread overview]
Message-ID: <Pine.LNX.4.64.0703091312530.10832@woody.linux-foundation.org> (raw)
In-Reply-To: <284107.69764.qm@web52601.mail.yahoo.com>
On Fri, 9 Mar 2007, Anton Tropashko wrote:
>
> I managed to stuff 8.5 GB worth of files into a git repo (two two git commits since
> it was running out of memory when I gave it -a option)
Heh. Your usage schenario may not be one where git is useful. If a single
commit generates that much data, git will likely perform horribly badly.
But it's an interesting test-case, and I don't think anybody has really
*tried* this before, so don't give up yet.
First off, you shouldn't really need two commits. It's true that "git
commit -a" will probably have memory usage issues (because a single "git
add" will keep it all in memory while it generates the objects), but it
should be possible to just use "git add" to add even 8.5GB worth of data
in a few chunks, and then a single "git commit" should commit it.
So you might be able to do just do
git add dir1
git add dir2
git add dir3
..
git commit
or something.
But one caveat: git may not be the right tool for the job. May I inquire
what the heck you're doing? We may be able to fix git even for your kinds
of usage, but it's also possible that
(a) git may not suit your needs
(b) you might be better off using git differently
Especially when it comes to that "(b)" case, please realize that git is
somewhat different from something like CVS at a very fundamental level.
CVS in many ways can more easily track *humongous* projects, for one very
simple reason: CVS really deep down just tracks individual files.
So people who have used CVS may get used to the notion of putting
everything in one big repository, because in the end, it's just a ton of
small files to CVS. CVS never really looks at the big picture - even doing
something like merging or doing a full checkout is really just iterating
over all the individual files.
So if you put a million files in a CVS repository, it's just going to
basically loop over those million files, but they are still just
individual files. There's never any operation that works on *all* of the
files at once.
Git really is *fundamentally* different here. Git takes completely the
opposite approach, and git never tracks individual files at all at any
level, really. Git almost doesn't care about file boundaries (I say
"almost", because obviously git knows about them, and they are visible in
myriads of ways, but at the same time it's not entirely untrue to say that
git really doesn't care).
So git scales in a very different way from CVS. Many things are tons
faster (because git does many operations a full directory structure at a
time, and that makes merges that only touch a few subdirectories *much*
faster), but on the other hand, it means that git will consider everything
to be *related* in a way that CVS never does.
So, for example, if your 8.5GB thing is something like your whole home
directory, putting it as one git archive now ties everything together and
that can cause issues that really aren't very nice. Tying everything
together is very important in a software project (the "total state" is
what matters), but in your home directory, many things are simply totally
independent, and tying them together can be the wrong thing to do.
So I'm not saying that git won't work for you, I'm just warning that the
whole model of operation may or may not actually match what you want to
do. Do you really want to track that 8.5GB as *one* entity?
> but when I'm cloning to another linux box I get:
>
> Generating pack...
> Done counting 152200 objects.
> Deltifying 152200 objects.
.. this is the part makes me think git *should* be able to work for you.
Having lots of smallish files is much better for git than a few DVD
images, for example. And if those 152200 objects are just from two
commits, you obviously have lots of files ;)
However, if it packs really badly (and without any history, that's quite
likely), maybe the resulting pack-file is bigger than 4GB, and then you'd
have trouble (in fact, I think you'd hit trouble at the 2GB pack-file
mark).
Does "git repack -a -d" work for you?
> /usr/bin/git-clone: line 321: 2072 File size limit exceededgit-fetch-pack --all -k $quiet "$repo"
"File size limit exceeded" sounds like SIGXFSZ, which is either:
- you have file limits enabled, and the resulting pack-file was just too
big for the limits.
- the file size is bigger than MAX_NON_LFS (2GB-1), and we don't use
O_LARGEFILE.
I suspect the second case. Shawn and Nico have worked on 64-bit packfile
indexing, so they may have a patch / git tree for you to try out.
Linus
next prev parent reply other threads:[~2007-03-09 21:37 UTC|newest]
Thread overview: 25+ messages / expand[flat|nested] mbox.gz Atom feed top
2007-03-09 19:20 Errors cloning large repo Anton Tropashko
2007-03-09 21:37 ` Linus Torvalds [this message]
-- strict thread matches above, loose matches on Subject: below --
2007-03-09 23:48 Anton Tropashko
2007-03-10 0:54 ` Linus Torvalds
2007-03-10 2:03 ` Linus Torvalds
2007-03-10 2:12 ` Junio C Hamano
2007-03-10 1:21 Anton Tropashko
2007-03-10 1:45 ` Linus Torvalds
2007-03-10 2:37 Anton Tropashko
2007-03-10 3:07 ` Shawn O. Pearce
2007-03-10 5:54 ` Linus Torvalds
2007-03-10 6:01 ` Shawn O. Pearce
2007-03-10 22:32 ` Martin Waitz
2007-03-10 22:46 ` Linus Torvalds
2007-03-11 21:35 ` Martin Waitz
2007-03-10 10:27 ` Jakub Narebski
2007-03-11 2:00 ` Shawn O. Pearce
2007-03-12 11:09 ` Jakub Narebski
2007-03-12 14:24 ` Shawn O. Pearce
2007-03-17 13:23 ` Jakub Narebski
[not found] ` <82B0999F-73E8-494E-8D66-FEEEDA25FB91@adacore.com>
2007-03-10 22:21 ` Linus Torvalds
2007-03-10 5:10 ` Linus Torvalds
2007-03-12 17:39 Anton Tropashko
2007-03-12 18:40 ` Linus Torvalds
2007-03-13 0:02 Anton Tropashko
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Pine.LNX.4.64.0703091312530.10832@woody.linux-foundation.org \
--to=torvalds@linux-foundation.org \
--cc=atropashko@yahoo.com \
--cc=git@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).