linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Peter Zijlstra <a.p.zijlstra@chello.nl>
To: Nick Piggin <npiggin@suse.de>
Cc: Christoph Lameter <clameter@sgi.com>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	akpm@linux-foundation.org, dkegel@google.com,
	David Miller <davem@davemloft.net>,
	Daniel Phillips <phillips@google.com>
Subject: Re: [RFC 0/3] Recursive reclaim (on __PF_MEMALLOC)
Date: Mon, 20 Aug 2007 05:51:34 +0200	[thread overview]
Message-ID: <1187581894.6114.169.camel@twins> (raw)
In-Reply-To: <20070816032921.GA32197@wotan.suse.de>

[-- Attachment #1: Type: text/plain, Size: 6269 bytes --]

On Thu, 2007-08-16 at 05:29 +0200, Nick Piggin wrote:
> On Wed, Aug 15, 2007 at 03:12:06PM +0200, Peter Zijlstra wrote:
> > On Wed, 2007-08-15 at 14:22 +0200, Nick Piggin wrote:
> > > On Tue, Aug 14, 2007 at 07:21:03AM -0700, Christoph Lameter wrote:
> > > > The following patchset implements recursive reclaim. Recursive reclaim
> > > > is necessary if we run out of memory in the writeout patch from reclaim.
> > > > 
> > > > This is f.e. important for stacked filesystems or anything that does
> > > > complicated processing in the writeout path.
> > > 
> > > Filesystems (most of them) that require compilcated allocations at
> > > writeout time suck. That said, especially with network ones, it
> > > seems like making them preallocate or reserve required memory isn't
> > > progressing very smoothly.
> > 
> > Mainly because we seem to go in circles :-(
> > 
> > >  I think these patchsets are definitely
> > > worth considering as an alternative. 
> > 
> > Honestly, I don't. They very much do not solve the problem, they just
> > displace it.
> 
> Well perhaps it doesn't work for networked swap, because dirty accounting
> doesn't work the same way with anonymous memory... but for _filesystems_,
> right?
> 
> I mean, it intuitively seems like a good idea to terminate the recursive
> allocation problem with an attempt to reclaim clean pages rather than
> immediately let them have-at our memory reserve that is used for other
> things as well. 

I'm concerned about the worst case scenarios, and those don't change.
The proposed changes can be seen as an optimisation of various things,
but they do not change the fundamental issues.

> Any and all writepage() via reclaim is allowed to eat
> into all of memory (I hate that writepage() ever has to use any memory,
> and have prototyped how to fix that for simple block based filesystems
> in fsblock, but others will require it).
> 
> 
> > Christoph's suggestion to set min_free_kbytes to 20% is ridiculous - nor
> > does it solve all deadlocks :-(
> 
> Well of course it doesn't, but it is a pragmatic way to reduce some
> memory depletion cases. I don't see too much harm in it (although I didn't
> see the 20% suggestion?)

Sure, and on that note I don't object to them, they might be quite
useful at times. It just doesn't help the worst case scenarios.

> > > No substantial comments though. 
> > 
> > Please do ponder the problem and its proposed solutions, because I'm
> > going crazy here.
>  
> Well yeah I think you simply have to reserve a minimum amount of memory in
> order to reclaim a page, and I don't see any other way to do it other than
> what you describe to be _technically_ deadlock free.

Right, and I guess I have to go at it again, this time ensuring not to
touch the fast-path nor sacrificing anything NUMA for simplicity in the
reclaim path.

(I think its a good thing to be technically deadlock free - and if your
work on the fault path rewrite and buffered write rework shows anything
it is that you seem to agree with this)

> But firstly, you don't _want_ to start dropping packets when you hit a tough
> patch in reclaim -- even if you are strictly deadlock free. And secondly,
> I think recursive reclaim could reduce the deadlocks in practice which is
> not a bad thing as your patches aren't merged.

Non of the people who have actually used these patches seem to object to
the dropping packets thing. Nor do I see that as a real problem,
networks are assumed lossy - also if you really need that traffic for a
RT app that also runs on the machine you need networked swap on (odd
combination but hey, it should be possible) then I can make that work as
well with a little bit more effort.

Also, I'm a very reluctant to accept a known deadlock, esp. since the
changes needed are not _that_ complex.

> How are your deadlock patches going anyway? AFAIK they are mostly a network
> issue and I haven't been keeping up with them for a while. 

They really do rely on some VM interaction too, network does not have
enough information to break out of the deadlock on its own.

As for how its going, it seems to work quite reliably in my test setup -
that is, I can shut down the NFS server, swamp the client in network
traffic for hours (yes it will quickly stop userspace) and then restart
the NFS server and the client will reconnect and resume operation.

There are also a few people running various versions of my patches in
production environments. One university is running it on a 500-node
cluster and another on ~500 thin-clients and there is someone using it
in a blade product.

> Do you really need
> networked swap and actually encounter the deadlock, or is it just a question of
> wanting to fix the bugs? If the former, what for, may I ask?

Yes (we - not I personally) want networked swap. There is quite the
demand for it in the marked. It allows clusters and blades to be build
without any storage - which not only saves on the initial cost of a hard
drive [1] but also on maintenance but more importantly on energy cost
and heat production.

[1] a single drive is not that expensive, but when you're talking about
a 1000 node cluster things tend to add up.

> > <> What Christoph is proposing is doing recursive reclaim and not
> > initiating writeout. This will only work _IFF_ there are clean pages
> > about. Which in the general case need not be true (memory might be
> > packed with anonymous pages - consider an MPI cluster doing computation
> > stuff). So this gets us a workload dependant solution - which IMHO is
> > bad!
> 
> Although you will quite likely have at least a couple of MB worth of
> clean program text. The important part of recursive reclaim is that it
> doesn't so easily allow reclaim to blow all memory reserves (including
> interrupt context). Sure you still have theoretical deadlocks, but if
> I understand correctly, they are going to be lessened. I would be
> really interested to see if even just these recursive reclaim patches
> eliminate the problem in practice.

were we much bothered by the buffered write deadlock? - why accept a
known deadlock if a solid solution is quite attainable?


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

  parent reply	other threads:[~2007-08-20  3:51 UTC|newest]

Thread overview: 60+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2007-08-14 14:21 [RFC 0/3] Recursive reclaim (on __PF_MEMALLOC) Christoph Lameter
2007-08-14 14:21 ` [RFC 1/3] Allow reclaim via __GFP_NOMEMALLOC reclaim Christoph Lameter
2007-08-14 14:21 ` [RFC 2/3] Use NOMEMALLOC reclaim to allow reclaim if PF_MEMALLOC is set Christoph Lameter
2007-08-14 14:21 ` [RFC 3/3] Test code for PF_MEMALLOC reclaim Christoph Lameter
2007-08-14 14:36 ` [RFC 0/3] Recursive reclaim (on __PF_MEMALLOC) Peter Zijlstra
2007-08-14 15:29   ` Christoph Lameter
2007-08-14 19:32     ` Peter Zijlstra
2007-08-14 19:41       ` Christoph Lameter
2007-08-15 12:22 ` Nick Piggin
2007-08-15 13:12   ` Peter Zijlstra
2007-08-15 14:15     ` Andi Kleen
2007-08-15 13:55       ` Peter Zijlstra
2007-08-15 14:34         ` Andi Kleen
2007-08-15 20:32         ` Christoph Lameter
2007-08-15 20:29     ` Christoph Lameter
2007-08-16  3:29     ` Nick Piggin
2007-08-16 20:27       ` Christoph Lameter
2007-08-20  3:51       ` Peter Zijlstra [this message]
2007-08-20 19:15         ` Christoph Lameter
2007-08-21  0:32           ` Nick Piggin
2007-08-21  0:28         ` Nick Piggin
2007-08-21 15:29           ` Peter Zijlstra
2007-08-23  3:02             ` Nick Piggin
2007-09-12 22:39           ` Christoph Lameter
2007-09-05  9:20 ` Daniel Phillips
2007-09-05 10:42   ` Christoph Lameter
2007-09-05 11:42     ` Nick Piggin
2007-09-05 12:14       ` Christoph Lameter
2007-09-05 12:19         ` Nick Piggin
2007-09-10 19:29           ` Christoph Lameter
2007-09-10 19:37             ` Peter Zijlstra
2007-09-10 19:41               ` Christoph Lameter
2007-09-10 19:55                 ` Peter Zijlstra
2007-09-10 20:17                   ` Christoph Lameter
2007-09-10 20:48                     ` Peter Zijlstra
2007-09-11  7:41             ` Nick Piggin
2007-09-12 10:52         ` Peter Zijlstra
2007-09-12 22:47           ` Christoph Lameter
2007-09-13  8:19             ` Peter Zijlstra
2007-09-13 18:32               ` Christoph Lameter
2007-09-13 19:24                 ` Peter Zijlstra
2007-09-05 16:16     ` Daniel Phillips
2007-09-08  5:12       ` Mike Snitzer
2007-09-18  0:28         ` Daniel Phillips
2007-09-18  3:27           ` Mike Snitzer
     [not found]             ` <200709172211.26493.phillips@phunq.net>
2007-09-18  8:11               ` Wouter Verhelst
2007-09-18  9:58               ` Peter Zijlstra
2007-09-18 16:56                 ` Daniel Phillips
2007-09-18 19:16                   ` Peter Zijlstra
2007-09-18  9:30             ` Peter Zijlstra
2007-09-18 18:40             ` Daniel Phillips
2007-09-18 20:13               ` Mike Snitzer
2007-09-10 19:25       ` Christoph Lameter
2007-09-10 19:55         ` Peter Zijlstra
2007-09-10 20:22           ` Christoph Lameter
2007-09-10 20:48             ` Peter Zijlstra
2007-10-26 17:44               ` Pavel Machek
2007-10-26 17:55                 ` Christoph Lameter
2007-10-27 22:58                   ` Daniel Phillips
2007-10-27 23:08                 ` Daniel Phillips

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1187581894.6114.169.camel@twins \
    --to=a.p.zijlstra@chello.nl \
    --cc=akpm@linux-foundation.org \
    --cc=clameter@sgi.com \
    --cc=davem@davemloft.net \
    --cc=dkegel@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=npiggin@suse.de \
    --cc=phillips@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).