Re: Errors cloning large repo

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Linus Torvalds <torvalds@linux-foundation.org>
To: Anton Tropashko <atropashko@yahoo.com>
Cc: git@vger.kernel.org
Subject: Re: Errors cloning large repo
Date: Mon, 12 Mar 2007 11:40:38 -0700 (PDT)	[thread overview]
Message-ID: <Pine.LNX.4.64.0703121057530.9690@woody.linux-foundation.org> (raw)
In-Reply-To: <315943.12751.qm@web52606.mail.yahoo.com>

On Mon, 12 Mar 2007, Anton Tropashko wrote:
> 
> > Its very likely this did fit in just under 4 GiB of packed data,
> > but as you said, without O_LARGEFILE we can't work with it.
> 
> .git is 3.5GB according to du -H :)

Ok, that's good.  That means that we really can use git without any major 
issues, and that it's literally apparently only receive-pack that has 
problems.

I didn't even realize that we have

	#define _FILE_OFFSET_BITS 64

in the header file, but not only is that a glibc-specific thing, it also 
won't really even cover all issues.

For example, if a file is opened from the shell (ie we're talking shell 
re-direction etc), that means that since the program that used 
_FILE_OFFSET_BITS wasn't the one opening, it was opened without 
O_LARGEFILE, and as such a write() will hit the LFS 31-bit limit.

That said, I'm not quite seeing why the _FILE_OFFSET_BITS trick doesn't 
help. We don't have any shell redirection in that path.

I just did an "strace -f" on a git clone on x86, and all the git open's 
seemed to use O_LARGEFILE, but that's with a very recent git.

I think you said that you had git-1.4.1 on the client, and I think that 
the _FILE_OFFSET_BITS=64 hack went in after that, and if your client just 
upgrades to the current 1.5.x release, it will all "just work" for you.

> Just curious why won't you use something like 
> PostgreSQL for data storage at this point, but, then
> I know nothing about git internals :)

I can pretty much guarantee that if we used a "real" database, we'd have

 - really really horrendously bad performance
 - total inability to actually recover from errors.

Other SCM projects have used databases, and it *always* boils down that. 
Most either die off, or decide to just do their own homegrown database (eg 
switching to FSFS for SVN).

Even database people seem to have figured it out lately: relational 
databases are starting to lose ground to specialized ones. These days you 
can google for something like

	relational specialized database performance

and you'll see real papers that are actually finally being taken seriously 
about how specialized databases often have performance-advantages of 
orders of magnitude. There's a paper (the above will find it, but if you 
add "one size fits all" you'll probably find it even better) that talks 
about benchmarking specialized databases against RDBMS, and they are 
*literally* talking about three and four *orders*of*magnitude* speedups 
(ie not factors of 2 or three, but factors of _seven_hundred_).

In other words, the whole relational database hype is so seventies and 
eighties. People have since figured out that yeah, they are convenient to 
program in if you want to do Visual Basic kind of things, but they really 
are *not* a replacement for good data structures.

So git has ended up writing its own data structures, but git is a lot 
better for it.

		Linus

next prev parent reply	other threads:[~2007-03-12 18:40 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2007-03-12 17:39 Errors cloning large repo Anton Tropashko
2007-03-12 18:40 ` Linus Torvalds [this message]
  -- strict thread matches above, loose matches on Subject: below --
2007-03-13  0:02 Anton Tropashko
2007-03-10  2:37 Anton Tropashko
2007-03-10  3:07 ` Shawn O. Pearce
2007-03-10  5:54   ` Linus Torvalds
2007-03-10  6:01     ` Shawn O. Pearce
2007-03-10 22:32       ` Martin Waitz
2007-03-10 22:46         ` Linus Torvalds
2007-03-11 21:35           ` Martin Waitz
2007-03-10 10:27   ` Jakub Narebski
2007-03-11  2:00     ` Shawn O. Pearce
2007-03-12 11:09       ` Jakub Narebski
2007-03-12 14:24         ` Shawn O. Pearce
2007-03-17 13:23           ` Jakub Narebski
     [not found]   ` <82B0999F-73E8-494E-8D66-FEEEDA25FB91@adacore.com>
2007-03-10 22:21     ` Linus Torvalds
2007-03-10  5:10 ` Linus Torvalds
2007-03-10  1:21 Anton Tropashko
2007-03-10  1:45 ` Linus Torvalds
2007-03-09 23:48 Anton Tropashko
2007-03-10  0:54 ` Linus Torvalds
2007-03-10  2:03   ` Linus Torvalds
2007-03-10  2:12     ` Junio C Hamano
2007-03-09 19:20 Anton Tropashko
2007-03-09 21:37 ` Linus Torvalds

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Pine.LNX.4.64.0703121057530.9690@woody.linux-foundation.org \
    --to=torvalds@linux-foundation.org \
    --cc=atropashko@yahoo.com \
    --cc=git@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).