git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Linus Torvalds <torvalds@osdl.org>
To: Junio C Hamano <junkio@cox.net>
Cc: git@vger.kernel.org
Subject: Re: If I were redoing git from scratch...
Date: Sat, 4 Nov 2006 08:44:02 -0800 (PST)	[thread overview]
Message-ID: <Pine.LNX.4.64.0611040829040.25218@g5.osdl.org> (raw)
In-Reply-To: <7vpsc3xx65.fsf@assigned-by-dhcp.cox.net>



On Sat, 4 Nov 2006, Junio C Hamano wrote:
> 
> The biggest one is that we use too many static (worse, function
> scope static) variables that live for the life of the process,
> which makes many things very nice and easy ("run-once and let
> exit clean up the mess" mentality), but because of this it
> becomes awkward to do certain things.  Examples are:
> 
>  - Multiple invocations of merge-bases (needs clearing the
>    marks left on commit objects by earlier traversal),

Well, quite frankly, I dare anybody to do it differently, yet have good 
performance with millions of objects.

The fact is, I don't think it _can_ be done. I would seriously suggest 
re-visiting this in five years, just because CPU's and memory will by then 
hopefully have gotten an order of magnitude faster/bigger.

The thing is, the object database when we read it in really needs to be 
pretty compact-sized, and we need to remember objects we've seen earlier 
(exactly _because_ we depend on the flags). So there's exactly two 
alternatives:
 - global life-time allocations of objects like we do now
 - magic memory management with unknown lifetimes and keeping track of all 
   pointers.

And I'd like to point out that the memory management right now is simply 
not realistic:

 - it's too damn hard. A simple garbage collector based on the approach we 
   have now would simply not be able to do anything, since all objects are 
   _by_definition_ reachable from the hash chains, so there's nothing to 
   collect. The lifetime of an object fundamentally _is_ the whole process 
   lifetime, exactly because we expect the objects (and the object flags 
   in particular) to be meaningful.

 - pretty much all garbage collection schemes tend to have a memory 
   footprint that is about twice what a static footprint is under any 
   normal load. Think about what we already do with "git pack-objects" for 
   something like the mozilla repository: I worked quite a lot on getting 
   the memory footprint down, and it's _still_ several hundred MB. 

In other words, I can pretty much guarantee that some kind of "smarter" 
memory management would be a huge step backwards. Yes, we now have to do 
some things explicitly, but exactly because we do them explicitly we can 
_afford_ to have the stupid and simple and VERY EFFICIENT memory 
management ("lack of memory management") that we have now.

The memory use of git had an very real correlation with performance when I 
was doing the memory shrinking a few months back (back in June). I realize 
that it's perhaps awkward, but I would really want people to realize that 
it's a huge performance issue. It was a clear performance issue for me 
(and I use machines with 2GB of RAM, so I was never swapping), it would be 
an even bigger one for anybody where the size meant that you needed to 
start doing paging.

So I would seriously ask you not to even consider changing the object 
model. Maybe add a few more helper routines to clear all object flags or 
something, but the "every object is global and will never be de-allocated" 
is really a major deal.

Five years from now, or for somebody who re-implements git in Java (where 
performance isn't going to be the major issue anyway, and you probably do 
"small" things like "commit" and "diff", and never do full-database things 
like "git repack"), _then_ you can happily look at having something 
fancier. Right now, it's too easy to just look at cumbersome interfaces, 
and forget about the fact that those interfaces is sometimes what allows 
us to practically do some things in the first place.


  parent reply	other threads:[~2006-11-04 16:44 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2006-11-04 11:34 If I were redoing git from scratch Junio C Hamano
2006-11-04 12:21 ` Jakub Narebski
2006-11-04 16:44 ` Linus Torvalds [this message]
2006-11-04 19:16   ` Shawn Pearce
2006-11-04 22:29     ` Robin Rosenberg
2006-11-04 22:44       ` Linus Torvalds
2006-11-04 23:15         ` Linus Torvalds

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Pine.LNX.4.64.0611040829040.25218@g5.osdl.org \
    --to=torvalds@osdl.org \
    --cc=git@vger.kernel.org \
    --cc=junkio@cox.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).