Darcs-git pulling from the Linux repo: a Linux VM question

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Darcs-git pulling from the Linux repo: a Linux VM question
@ 2005-04-27 13:10 Juliusz Chroboczek
  2005-04-27 15:31 ` Linus Torvalds
  0 siblings, 1 reply; 7+ messages in thread
From: Juliusz Chroboczek @ 2005-04-27 13:10 UTC (permalink / raw)
  To: darcs-devel, Git Mailing List

Hi,

If you are one of the few initiated who can tune the Linux VM, please
skip to the end of this mail and give me some advice.  If you are one
of the even fewer initiated who understand Darcs' memory usage, read
the whole of this message and send me a patch.  Otherwise, press D.

Now that I've got a Darcs that groks Git repos, I can play with a
fairly large tree -- the Linux 2.6 one.  All the experiments described
below were done on a 1.4 MHz Pentium-M with 640 MB of memory, running
Linux 2.6.9 (Debian branded) over Reiserfs. 

All the commands that don't need to actually read the underlying blobs
are instantaneous; for example, ``darcs changes'' takes 0.4s.
Commands that require reading the blobs but allow discarding them
straight away are reasonable enough -- ``darcs changes -s'' on all but
the initial import takes a very reasonable 15s, ``darcs changes -s''
including the initial import takes 2m30s real time, (50s CPU time).

The trouble, of course, is with commands that need to read a full tree
and keep it in memory.  This is, unfortunately, the case with pull of
the initial commit, which is over 200MB in size.  Darcs behaviour when
pulling this initial commit is as follows.

As I'm currently reading the git repository eagerly, Darcs starts by
reading the whole of the initial tree into memory; this takes roughly
2 minutes of real time (at less than 10% CPU), reads 18987 Git files
(blobs and trees), of which 18512 are unique (meaning that less than
500 were read two times or more -- yes, I should be keeping track of
the blobs I've already read).  When that is done, Darcs' VMEM usage is
beneath 300MB.

At that point, Darcs stops doing I/O, and starts trying to interpret
the data.  It runs between 80% and 100% of CPU, and grows steadily
until its VMEM reaches 550MB.  At that point, the system starts
swapping very lightly (no more than 200kB/s or so), and Darcs' VMEM
usage grows up to 720MB after 5 minutes CPU, 8 minutes real time.

When Darcs has grokked the fullness of the Linux kernel, it decides to
write out a patch.  So it starts touching all of its memory while
simultaneously writing out data to a patch file at a fairly sustained
rate.  It gets pretty close to the end -- over 200MB of patch are
written --, when suddenly the system appears to freeze for a second,
then the OOM killer triggers and kills the Darcs process.

Now obviously there is a problem with Darcs -- it shouldn't be needing
720MB of virtual memory just to grok a 250 MB import --, but there's
also a problem with the VM.  A 720 MB process should be reasonable on
a machine with 640 MB, and there's no apparent reason why the kernel
couldn't go more heavily into swap.  My completely uninformed guess
would be that the heavy I/O activity generated by Darcs in the final
stage causes a shortage of some resource (probably buffers) that is
essential for the VM to perform the swapping, and that the only way
the kernel sees to get itself out of the tight spot is to invoke the
OOM killer on the process that's causing the I/O activity.

So yes, in the longer term we need to fix Darcs.  For now, does anyone
know how I can tune the Linux VM to get a 720 MB process to run
reliably in 640 MB of main memory?  Obviously, adding swap or tuning
the overcommit policy doesn't help (the issue is precisely that the VM
refuses to dig into the swap early enough).  I don't understand what
``swappinness'' is, but it doesn't appear to help.  The
``min_free_kbytes'' and ``dirty_*'' knobs look promising, but nobody
seems to know what they mean.

So what was it you said about self-tuning VM systems?

                                        Juliusz

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Darcs-git pulling from the Linux repo: a Linux VM question
  2005-04-27 13:10 Darcs-git pulling from the Linux repo: a Linux VM question Juliusz Chroboczek
@ 2005-04-27 15:31 ` Linus Torvalds
  2005-04-27 15:54   ` Juliusz Chroboczek
  0 siblings, 1 reply; 7+ messages in thread
From: Linus Torvalds @ 2005-04-27 15:31 UTC (permalink / raw)
  To: Juliusz Chroboczek; +Cc: Git Mailing List, darcs-devel

On Wed, 27 Apr 2005, Juliusz Chroboczek wrote:
> 
> So yes, in the longer term we need to fix Darcs.  For now, does anyone
> know how I can tune the Linux VM to get a 720 MB process to run
> reliably in 640 MB of main memory?

I really think you're screwed. The only way you have even a _chance_ of
getting it to work well is that if you have very nice access patterns to
that 720MB, but my guess is that that simply isn't the case. You probably
read most of it in once (and write out changes once, but I hope you at
least notice the case of "nothing changed" so that probably is the smaller
of your problems), and the fact is, you're going to have absolutely
_horrible_ access patterns, since you'll end up not just with a 720MB
process that doesn't have much locality, you'll end up with another 720MB
that you needed to have in the page cache for the IO.

The only way I can see to fix it short-term is to try to use "mmap()"  
instead of "read()" to read the file data, and then try to avoid touching
the mapping unless you _have_ to. In other words: if you actually need to
_compare_ the data (which obviously reads from the mapping), you're
screwed.

Using mmap() will at least mean that the system can re-use the page cache 
pages, though, so it should improve memory pressure a bit.

> So what was it you said about self-tuning VM systems?

The kernel tries to tune itself in the sense that it automatically 
allocates the memory to user processes vs caching (page cache, directory 
caching etc) and tunes itself quite well that way.

But there's no way to tune for crappy access patterns and working sets
bigger than the amount of RAM. Sorry. You really need to fix darcs.

You _really_ shouldn't read in files that you don't absolutely need.  
That's really the biggest point of git: using the sha1 for naming the
objects is really all about "descrive the contents using 20 bytes instead
of by reading the contents". Because reading the content _will_ be
expensive. Even if you have 2GB of memory and you can keep it all cached,
it will be horribly expensive.

		Linus

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Re: Darcs-git pulling from the Linux repo: a Linux VM question
  2005-04-27 15:31 ` Linus Torvalds
@ 2005-04-27 15:54   ` Juliusz Chroboczek
  2005-04-27 16:16     ` Linus Torvalds
  0 siblings, 1 reply; 7+ messages in thread
From: Juliusz Chroboczek @ 2005-04-27 15:54 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Git Mailing List, darcs-devel

>> For now, does anyone know how I can tune the Linux VM to get a 720
>> MB process to run reliably in 640 MB of main memory?

> I really think you're screwed.

Thanks, that's what I needed to know.

> You _really_ shouldn't read in files that you don't absolutely need.

Ahem... you don't expect me to embark on hacking Git without at least
understanding that, do you?

> That's really the biggest point of git: using the sha1 for naming the
> objects is really all about "descrive the contents using 20 bytes instead
> of by reading the contents".

Here we're speaking about the initial import.  Committed on 17 April
2005 by Linus Torvalds, with the comment ``Let it rip''.  220 MB of
changed files in a single commit.  2 minutes real time just to read
all the files, never mind doing anything useful with them.

To put it mildly, Darcs is not optimised for that sort of usage.

> Sorry.  You really need to fix darcs.

That's exactly why we're so interested in your repository.

                                        Juliusz

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Re: Darcs-git pulling from the Linux repo: a Linux VM question
  2005-04-27 15:54   ` Juliusz Chroboczek
@ 2005-04-27 16:16     ` Linus Torvalds
  2005-04-28 11:39       ` [darcs-devel] " David Roundy
  0 siblings, 1 reply; 7+ messages in thread
From: Linus Torvalds @ 2005-04-27 16:16 UTC (permalink / raw)
  To: Juliusz Chroboczek; +Cc: Git Mailing List, darcs-devel

On Wed, 27 Apr 2005, Juliusz Chroboczek wrote:
> 
> Here we're speaking about the initial import.  Committed on 17 April
> 2005 by Linus Torvalds, with the comment ``Let it rip''.  220 MB of
> changed files in a single commit.  2 minutes real time just to read
> all the files, never mind doing anything useful with them.

I think you may well want to consider the initial commit special. In many 
ways it is - it has no parents etc, so even apart from the fact that the 
initial commit obviously tends to be a lot bigger than any other commit, 
it actually fundamnetally is _technically_ different too.

> To put it mildly, Darcs is not optimised for that sort of usage.

It shouldn't be. Make the initial one a special case, and import things 
file-by-file for that one special case.

Afterwards, you should be able to handle other commits as "diffs", and
then it's entirely reasonable to have the difference all in memory. If
somebody really does end up having a 220MB diff, and darcs sucks at it,
then at that point I don't think it's darcs' problem any more, it's the
project that you're trying to track that is doing something wrong..

So if you _just_ consider the initial git commit special (and it's easy to 
notice by just looking at the lack of parents), then you may not need to 
change darcs in the other cases.

And almost all SCM's consider the initial state a special case anyway. The 
fact that GIT doesn't is just a result of the strange way of representing 
data, which doesn't care. I don't think you should emulate git in that 
respect.

		Linus

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [darcs-devel] Re: Darcs-git pulling from the Linux repo: a Linux VM question
  2005-04-27 16:16     ` Linus Torvalds
@ 2005-04-28 11:39       ` David Roundy
  2005-04-28 15:36         ` Juliusz Chroboczek
  0 siblings, 1 reply; 7+ messages in thread
From: David Roundy @ 2005-04-28 11:39 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Juliusz Chroboczek, Git Mailing List, darcs-devel

On Wed, Apr 27, 2005 at 09:16:03AM -0700, Linus Torvalds wrote:
> On Wed, 27 Apr 2005, Juliusz Chroboczek wrote:
> > Here we're speaking about the initial import.  Committed on 17 April
> > 2005 by Linus Torvalds, with the comment ``Let it rip''.  220 MB of
> > changed files in a single commit.  2 minutes real time just to read
> > all the files, never mind doing anything useful with them.
> 
> I think you may well want to consider the initial commit special. In many 
> ways it is - it has no parents etc, so even apart from the fact that the 
> initial commit obviously tends to be a lot bigger than any other commit, 
> it actually fundamnetally is _technically_ different too.

This has been discussed, and while I'm not opposed to special-casing the
initial commit, mostly we've taken the stance so far of not special-casing.
It's much nicer if we can make darcs efficient enough to perform the
initial commit without a special case, which has the nice side-effect of
also improving other cases.

When we're desperate, we'll special-case the initial commit, but currently
I'm sure we can pretty easily adjust things by making the git-tree-reading
lazy, which should pretty well address both the memory and speed
concerns--and also improve performance of other commands.  Perhaps more to
the point, it will also ensure that the same optimizations that work for
working with darcs repos will help when dealing with git repos.
-- 
David Roundy
http://www.darcs.net

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [darcs-devel] Re: Darcs-git pulling from the Linux repo: a Linux VM question
  2005-04-28 11:39       ` [darcs-devel] " David Roundy
@ 2005-04-28 15:36         ` Juliusz Chroboczek
  2005-04-29 11:25           ` David Roundy
  0 siblings, 1 reply; 7+ messages in thread
From: Juliusz Chroboczek @ 2005-04-28 15:36 UTC (permalink / raw)
  To: Git Mailing List, darcs-devel

> When we're desperate, we'll special-case the initial commit, but currently
> I'm sure we can pretty easily adjust things by making the git-tree-reading
> lazy,

Just to make it clear: reading the git tree is lazy.  The problem is
somewhere in the higher layers, probably in pull_cmd.

There's also another problem: reading the git tree takes 220MB.  Then
Darcs allocates a further 500MB without calling my code at all.  (Some
of it is doubtless due to linesPS, that should be more than a handful
of megabytes.)

                                        Juliusz

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [darcs-devel] Re: Darcs-git pulling from the Linux repo: a Linux VM question
  2005-04-28 15:36         ` Juliusz Chroboczek
@ 2005-04-29 11:25           ` David Roundy
  0 siblings, 0 replies; 7+ messages in thread
From: David Roundy @ 2005-04-29 11:25 UTC (permalink / raw)
  To: Juliusz Chroboczek; +Cc: Git Mailing List, darcs-devel

On Thu, Apr 28, 2005 at 05:36:01PM +0200, Juliusz Chroboczek wrote:
> > When we're desperate, we'll special-case the initial commit, but
> > currently I'm sure we can pretty easily adjust things by making the
> > git-tree-reading lazy,
> 
> Just to make it clear: reading the git tree is lazy.  The problem is
> somewhere in the higher layers, probably in pull_cmd.

I guess really that's the issue.  Get itself is a special case that we've
already optimized for the initial get.  You'll also run into trouble using
pull to grab an entire plain old darcs repository with a large initial
commit.  We can (and should) also optimize pull, but it's not going to ever
be as efficient as a get is, for the case where you start with an empty
repository.

> There's also another problem: reading the git tree takes 220MB.  Then
> Darcs allocates a further 500MB without calling my code at all.  (Some
> of it is doubtless due to linesPS, that should be more than a handful
> of megabytes.)

The output of linesPS actually does take a huge amount of space.
-- 
David Roundy
http://www.darcs.net

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2005-04-29 11:25 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-04-27 13:10 Darcs-git pulling from the Linux repo: a Linux VM question Juliusz Chroboczek
2005-04-27 15:31 ` Linus Torvalds
2005-04-27 15:54   ` Juliusz Chroboczek
2005-04-27 16:16     ` Linus Torvalds
2005-04-28 11:39       ` [darcs-devel] " David Roundy
2005-04-28 15:36         ` Juliusz Chroboczek
2005-04-29 11:25           ` David Roundy

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).