Re: [RFC 0/3] Recursive reclaim (on __PF_MEMALLOC)

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Daniel Phillips <phillips@phunq.net>
To: Christoph Lameter <clameter@sgi.com>
Cc: Pavel Machek <pavel@ucw.cz>,
	Peter Zijlstra <a.p.zijlstra@chello.nl>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	akpm@linux-foundation.org, dkegel@google.com,
	David Miller <davem@davemloft.net>, Nick Piggin <npiggin@suse.de>
Subject: Re: [RFC 0/3] Recursive reclaim (on __PF_MEMALLOC)
Date: Sat, 27 Oct 2007 15:58:36 -0700	[thread overview]
Message-ID: <200710271558.37514.phillips@phunq.net> (raw)
In-Reply-To: <Pine.LNX.4.64.0710261052530.15895@schroedinger.engr.sgi.com>

On Friday 26 October 2007 10:55, Christoph Lameter wrote:
> On Fri, 26 Oct 2007, Pavel Machek wrote:
> > > And, _no_, it does not necessarily mean global serialisation. By
> > > simply saying there must be N pages available I say nothing about
> > > on which node they should be available, and the way the
> > > watermarks work they will be evenly distributed over the
> > > appropriate zones.
> >
> > Agreed. Scalability of emergency swapping reserved is simply
> > unimportant. Please, lets get swapping to _work_ first, then we can
> > make it faster.
>
> Global reserve means that any cpuset that runs out of memory may
> exhaust the global reserve and thereby impact the rest of the system.
> The emergencies that are currently localized to a subset of the
> system and may lead to the failure of a job may now become global and
> lead to the failure of all jobs running on it.

If it does, it is a bug in the reserve accounting.  That said, I still 
agree with you that per-node reserve is a desirable goal for numa.  I 
would just like to be clear that it is not necessary, even for numa, 
just nice.  By all means somebody should be hacking on a numa feature 
for per-node emergency reserves, but as far as fixing the immediate, 
serious kernel block IO deadlocks goes, it does not matter.

Pavel, I do not agree that efficiency is unimportant on the 
under-pressure path.  I do not even like to call that the "emergency" 
path, because under heavy load it is normal for a machine to spend a 
significant fraction of its time in that state.  However, the 
efficiency goal there does not need to be quite the same as normal 
mode.

To illustrate, I would expect to see something like 95% of normal block 
IO performance on a numa machine in the case that "emergency" (aka 
memalloc memory) is allocated globally instead of locally, thus paying 
a (modest compared to the disk transfer itself) penalty for transfer of 
disk data over the numa interconnect.  95% of normal throughput on the 
block IO path is not a problem: if the machine spends 5% of its time on 
the "emergency" (aka memalloc) path, then overall efficiency will be 
95% * 95% = 99.75%.

Moral of this story: let's get the memory recursion fixes done in the 
most obviously correct way and not get distracted by illusory 
efficiency requirements for numa, that do not have a big bottom line 
impact.

I'm glad to see everybody still interested in these problems.  Though we 
have been a little quiet on this issue over here for a while, it does 
not mean that progress has stopped.  In fact, we are testing our 
solutions more heavily than ever, and getting closer to a solution that 
not only works solidly, but that should enable mass deletion of the 
whole creaky notion of dirty page limits in favor of nice, tight 
per-device control of in flight write traffic as I have described 
previously.

Regards,

Daniel

WARNING: multiple messages have this Message-ID (diff)

From: Daniel Phillips <phillips@phunq.net>
To: Christoph Lameter <clameter@sgi.com>
Cc: Pavel Machek <pavel@ucw.cz>,
	Peter Zijlstra <a.p.zijlstra@chello.nl>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	akpm@linux-foundation.org, dkegel@google.com,
	David Miller <davem@davemloft.net>, Nick Piggin <npiggin@suse.de>
Subject: Re: [RFC 0/3] Recursive reclaim (on __PF_MEMALLOC)
Date: Sat, 27 Oct 2007 15:58:36 -0700	[thread overview]
Message-ID: <200710271558.37514.phillips@phunq.net> (raw)
In-Reply-To: <Pine.LNX.4.64.0710261052530.15895@schroedinger.engr.sgi.com>

On Friday 26 October 2007 10:55, Christoph Lameter wrote:
> On Fri, 26 Oct 2007, Pavel Machek wrote:
> > > And, _no_, it does not necessarily mean global serialisation. By
> > > simply saying there must be N pages available I say nothing about
> > > on which node they should be available, and the way the
> > > watermarks work they will be evenly distributed over the
> > > appropriate zones.
> >
> > Agreed. Scalability of emergency swapping reserved is simply
> > unimportant. Please, lets get swapping to _work_ first, then we can
> > make it faster.
>
> Global reserve means that any cpuset that runs out of memory may
> exhaust the global reserve and thereby impact the rest of the system.
> The emergencies that are currently localized to a subset of the
> system and may lead to the failure of a job may now become global and
> lead to the failure of all jobs running on it.

If it does, it is a bug in the reserve accounting.  That said, I still 
agree with you that per-node reserve is a desirable goal for numa.  I 
would just like to be clear that it is not necessary, even for numa, 
just nice.  By all means somebody should be hacking on a numa feature 
for per-node emergency reserves, but as far as fixing the immediate, 
serious kernel block IO deadlocks goes, it does not matter.

Pavel, I do not agree that efficiency is unimportant on the 
under-pressure path.  I do not even like to call that the "emergency" 
path, because under heavy load it is normal for a machine to spend a 
significant fraction of its time in that state.  However, the 
efficiency goal there does not need to be quite the same as normal 
mode.

To illustrate, I would expect to see something like 95% of normal block 
IO performance on a numa machine in the case that "emergency" (aka 
memalloc memory) is allocated globally instead of locally, thus paying 
a (modest compared to the disk transfer itself) penalty for transfer of 
disk data over the numa interconnect.  95% of normal throughput on the 
block IO path is not a problem: if the machine spends 5% of its time on 
the "emergency" (aka memalloc) path, then overall efficiency will be 
95% * 95% = 99.75%.

Moral of this story: let's get the memory recursion fixes done in the 
most obviously correct way and not get distracted by illusory 
efficiency requirements for numa, that do not have a big bottom line 
impact.

I'm glad to see everybody still interested in these problems.  Though we 
have been a little quiet on this issue over here for a while, it does 
not mean that progress has stopped.  In fact, we are testing our 
solutions more heavily than ever, and getting closer to a solution that 
not only works solidly, but that should enable mass deletion of the 
whole creaky notion of dirty page limits in favor of nice, tight 
per-device control of in flight write traffic as I have described 
previously.

Regards,

Daniel

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

next prev parent reply	other threads:[~2007-10-27 22:59 UTC|newest]

Thread overview: 108+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2007-08-14 14:21 [RFC 0/3] Recursive reclaim (on __PF_MEMALLOC) Christoph Lameter
2007-08-14 14:21 ` Christoph Lameter
2007-08-14 14:21 ` [RFC 1/3] Allow reclaim via __GFP_NOMEMALLOC reclaim Christoph Lameter
2007-08-14 14:21   ` Christoph Lameter
2007-08-14 14:21 ` [RFC 2/3] Use NOMEMALLOC reclaim to allow reclaim if PF_MEMALLOC is set Christoph Lameter
2007-08-14 14:21   ` Christoph Lameter
2007-08-14 14:21 ` [RFC 3/3] Test code for PF_MEMALLOC reclaim Christoph Lameter
2007-08-14 14:21   ` Christoph Lameter
2007-08-14 14:36 ` [RFC 0/3] Recursive reclaim (on __PF_MEMALLOC) Peter Zijlstra
2007-08-14 14:36   ` Peter Zijlstra
2007-08-14 15:29   ` Christoph Lameter
2007-08-14 15:29     ` Christoph Lameter
2007-08-14 19:32     ` Peter Zijlstra
2007-08-14 19:32       ` Peter Zijlstra
2007-08-14 19:41       ` Christoph Lameter
2007-08-14 19:41         ` Christoph Lameter
2007-08-15 12:22 ` Nick Piggin
2007-08-15 12:22   ` Nick Piggin
2007-08-15 13:12   ` Peter Zijlstra
2007-08-15 14:15     ` Andi Kleen
2007-08-15 14:15       ` Andi Kleen
2007-08-15 13:55       ` Peter Zijlstra
2007-08-15 14:34         ` Andi Kleen
2007-08-15 14:34           ` Andi Kleen
2007-08-15 20:32         ` Christoph Lameter
2007-08-15 20:32           ` Christoph Lameter
2007-08-15 20:29     ` Christoph Lameter
2007-08-15 20:29       ` Christoph Lameter
2007-08-16  3:29     ` Nick Piggin
2007-08-16  3:29       ` Nick Piggin
2007-08-16 20:27       ` Christoph Lameter
2007-08-16 20:27         ` Christoph Lameter
2007-08-20  3:51       ` Peter Zijlstra
2007-08-20 19:15         ` Christoph Lameter
2007-08-20 19:15           ` Christoph Lameter
2007-08-21  0:32           ` Nick Piggin
2007-08-21  0:32             ` Nick Piggin
2007-08-21  0:28         ` Nick Piggin
2007-08-21  0:28           ` Nick Piggin
2007-08-21 15:29           ` Peter Zijlstra
2007-08-23  3:02             ` Nick Piggin
2007-08-23  3:02               ` Nick Piggin
2007-09-12 22:39           ` Christoph Lameter
2007-09-12 22:39             ` Christoph Lameter
2007-09-05  9:20 ` Daniel Phillips
2007-09-05  9:20   ` Daniel Phillips
2007-09-05 10:42   ` Christoph Lameter
2007-09-05 10:42     ` Christoph Lameter
2007-09-05 11:42     ` Nick Piggin
2007-09-05 11:42       ` Nick Piggin
2007-09-05 12:14       ` Christoph Lameter
2007-09-05 12:14         ` Christoph Lameter
2007-09-05 12:19         ` Nick Piggin
2007-09-05 12:19           ` Nick Piggin
2007-09-10 19:29           ` Christoph Lameter
2007-09-10 19:29             ` Christoph Lameter
2007-09-10 19:37             ` Peter Zijlstra
2007-09-10 19:41               ` Christoph Lameter
2007-09-10 19:41                 ` Christoph Lameter
2007-09-10 19:55                 ` Peter Zijlstra
2007-09-10 20:17                   ` Christoph Lameter
2007-09-10 20:17                     ` Christoph Lameter
2007-09-10 20:48                     ` Peter Zijlstra
2007-09-11  7:41             ` Nick Piggin
2007-09-11  7:41               ` Nick Piggin
2007-09-12 10:52         ` Peter Zijlstra
2007-09-12 22:47           ` Christoph Lameter
2007-09-12 22:47             ` Christoph Lameter
2007-09-13  8:19             ` Peter Zijlstra
2007-09-13 18:32               ` Christoph Lameter
2007-09-13 18:32                 ` Christoph Lameter
2007-09-13 19:24                 ` Peter Zijlstra
2007-09-13 19:24                   ` Peter Zijlstra
2007-09-05 16:16     ` Daniel Phillips
2007-09-05 16:16       ` Daniel Phillips
2007-09-08  5:12       ` Mike Snitzer
2007-09-08  5:12         ` Mike Snitzer
2007-09-18  0:28         ` Daniel Phillips
2007-09-18  0:28           ` Daniel Phillips
2007-09-18  3:27           ` Mike Snitzer
2007-09-18  3:27             ` Mike Snitzer
2007-09-18  5:37             ` Daniel Phillips
2007-09-18  9:30             ` Peter Zijlstra
2007-09-18  9:30               ` Peter Zijlstra
     [not found]             ` <200709172211.26493.phillips@phunq.net>
2007-09-18  8:11               ` Wouter Verhelst
2007-09-18  8:11                 ` Wouter Verhelst
2007-09-18  9:58               ` Peter Zijlstra
2007-09-18  9:58                 ` Peter Zijlstra
2007-09-18 16:56                 ` Daniel Phillips
2007-09-18 16:56                   ` Daniel Phillips
2007-09-18 19:16                   ` Peter Zijlstra
2007-09-18 19:16                     ` Peter Zijlstra
2007-09-18 18:40             ` Daniel Phillips
2007-09-18 20:13               ` Mike Snitzer
2007-09-10 19:25       ` Christoph Lameter
2007-09-10 19:25         ` Christoph Lameter
2007-09-10 19:55         ` Peter Zijlstra
2007-09-10 20:22           ` Christoph Lameter
2007-09-10 20:22             ` Christoph Lameter
2007-09-10 20:48             ` Peter Zijlstra
2007-10-26 17:44               ` Pavel Machek
2007-10-26 17:44                 ` Pavel Machek
2007-10-26 17:55                 ` Christoph Lameter
2007-10-26 17:55                   ` Christoph Lameter
2007-10-27 22:58                   ` Daniel Phillips [this message]
2007-10-27 22:58                     ` Daniel Phillips
2007-10-27 23:08                 ` Daniel Phillips
2007-10-27 23:08                   ` Daniel Phillips

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=200710271558.37514.phillips@phunq.net \
    --to=phillips@phunq.net \
    --cc=a.p.zijlstra@chello.nl \
    --cc=akpm@linux-foundation.org \
    --cc=clameter@sgi.com \
    --cc=davem@davemloft.net \
    --cc=dkegel@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=npiggin@suse.de \
    --cc=pavel@ucw.cz \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.