From: Daniel Phillips <phillips@phunq.net>
To: Christoph Lameter <clameter@sgi.com>
Cc: Pavel Machek <pavel@ucw.cz>,
Peter Zijlstra <a.p.zijlstra@chello.nl>,
linux-mm@kvack.org, linux-kernel@vger.kernel.org,
akpm@linux-foundation.org, dkegel@google.com,
David Miller <davem@davemloft.net>, Nick Piggin <npiggin@suse.de>
Subject: Re: [RFC 0/3] Recursive reclaim (on __PF_MEMALLOC)
Date: Sat, 27 Oct 2007 15:58:36 -0700 [thread overview]
Message-ID: <200710271558.37514.phillips@phunq.net> (raw)
In-Reply-To: <Pine.LNX.4.64.0710261052530.15895@schroedinger.engr.sgi.com>
On Friday 26 October 2007 10:55, Christoph Lameter wrote:
> On Fri, 26 Oct 2007, Pavel Machek wrote:
> > > And, _no_, it does not necessarily mean global serialisation. By
> > > simply saying there must be N pages available I say nothing about
> > > on which node they should be available, and the way the
> > > watermarks work they will be evenly distributed over the
> > > appropriate zones.
> >
> > Agreed. Scalability of emergency swapping reserved is simply
> > unimportant. Please, lets get swapping to _work_ first, then we can
> > make it faster.
>
> Global reserve means that any cpuset that runs out of memory may
> exhaust the global reserve and thereby impact the rest of the system.
> The emergencies that are currently localized to a subset of the
> system and may lead to the failure of a job may now become global and
> lead to the failure of all jobs running on it.
If it does, it is a bug in the reserve accounting. That said, I still
agree with you that per-node reserve is a desirable goal for numa. I
would just like to be clear that it is not necessary, even for numa,
just nice. By all means somebody should be hacking on a numa feature
for per-node emergency reserves, but as far as fixing the immediate,
serious kernel block IO deadlocks goes, it does not matter.
Pavel, I do not agree that efficiency is unimportant on the
under-pressure path. I do not even like to call that the "emergency"
path, because under heavy load it is normal for a machine to spend a
significant fraction of its time in that state. However, the
efficiency goal there does not need to be quite the same as normal
mode.
To illustrate, I would expect to see something like 95% of normal block
IO performance on a numa machine in the case that "emergency" (aka
memalloc memory) is allocated globally instead of locally, thus paying
a (modest compared to the disk transfer itself) penalty for transfer of
disk data over the numa interconnect. 95% of normal throughput on the
block IO path is not a problem: if the machine spends 5% of its time on
the "emergency" (aka memalloc) path, then overall efficiency will be
95% * 95% = 99.75%.
Moral of this story: let's get the memory recursion fixes done in the
most obviously correct way and not get distracted by illusory
efficiency requirements for numa, that do not have a big bottom line
impact.
I'm glad to see everybody still interested in these problems. Though we
have been a little quiet on this issue over here for a while, it does
not mean that progress has stopped. In fact, we are testing our
solutions more heavily than ever, and getting closer to a solution that
not only works solidly, but that should enable mass deletion of the
whole creaky notion of dirty page limits in favor of nice, tight
per-device control of in flight write traffic as I have described
previously.
Regards,
Daniel
WARNING: multiple messages have this Message-ID (diff)
From: Daniel Phillips <phillips@phunq.net>
To: Christoph Lameter <clameter@sgi.com>
Cc: Pavel Machek <pavel@ucw.cz>,
Peter Zijlstra <a.p.zijlstra@chello.nl>,
linux-mm@kvack.org, linux-kernel@vger.kernel.org,
akpm@linux-foundation.org, dkegel@google.com,
David Miller <davem@davemloft.net>, Nick Piggin <npiggin@suse.de>
Subject: Re: [RFC 0/3] Recursive reclaim (on __PF_MEMALLOC)
Date: Sat, 27 Oct 2007 15:58:36 -0700 [thread overview]
Message-ID: <200710271558.37514.phillips@phunq.net> (raw)
In-Reply-To: <Pine.LNX.4.64.0710261052530.15895@schroedinger.engr.sgi.com>
On Friday 26 October 2007 10:55, Christoph Lameter wrote:
> On Fri, 26 Oct 2007, Pavel Machek wrote:
> > > And, _no_, it does not necessarily mean global serialisation. By
> > > simply saying there must be N pages available I say nothing about
> > > on which node they should be available, and the way the
> > > watermarks work they will be evenly distributed over the
> > > appropriate zones.
> >
> > Agreed. Scalability of emergency swapping reserved is simply
> > unimportant. Please, lets get swapping to _work_ first, then we can
> > make it faster.
>
> Global reserve means that any cpuset that runs out of memory may
> exhaust the global reserve and thereby impact the rest of the system.
> The emergencies that are currently localized to a subset of the
> system and may lead to the failure of a job may now become global and
> lead to the failure of all jobs running on it.
If it does, it is a bug in the reserve accounting. That said, I still
agree with you that per-node reserve is a desirable goal for numa. I
would just like to be clear that it is not necessary, even for numa,
just nice. By all means somebody should be hacking on a numa feature
for per-node emergency reserves, but as far as fixing the immediate,
serious kernel block IO deadlocks goes, it does not matter.
Pavel, I do not agree that efficiency is unimportant on the
under-pressure path. I do not even like to call that the "emergency"
path, because under heavy load it is normal for a machine to spend a
significant fraction of its time in that state. However, the
efficiency goal there does not need to be quite the same as normal
mode.
To illustrate, I would expect to see something like 95% of normal block
IO performance on a numa machine in the case that "emergency" (aka
memalloc memory) is allocated globally instead of locally, thus paying
a (modest compared to the disk transfer itself) penalty for transfer of
disk data over the numa interconnect. 95% of normal throughput on the
block IO path is not a problem: if the machine spends 5% of its time on
the "emergency" (aka memalloc) path, then overall efficiency will be
95% * 95% = 99.75%.
Moral of this story: let's get the memory recursion fixes done in the
most obviously correct way and not get distracted by illusory
efficiency requirements for numa, that do not have a big bottom line
impact.
I'm glad to see everybody still interested in these problems. Though we
have been a little quiet on this issue over here for a while, it does
not mean that progress has stopped. In fact, we are testing our
solutions more heavily than ever, and getting closer to a solution that
not only works solidly, but that should enable mass deletion of the
whole creaky notion of dirty page limits in favor of nice, tight
per-device control of in flight write traffic as I have described
previously.
Regards,
Daniel
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2007-10-27 22:59 UTC|newest]
Thread overview: 108+ messages / expand[flat|nested] mbox.gz Atom feed top
2007-08-14 14:21 [RFC 0/3] Recursive reclaim (on __PF_MEMALLOC) Christoph Lameter
2007-08-14 14:21 ` Christoph Lameter
2007-08-14 14:21 ` [RFC 1/3] Allow reclaim via __GFP_NOMEMALLOC reclaim Christoph Lameter
2007-08-14 14:21 ` Christoph Lameter
2007-08-14 14:21 ` [RFC 2/3] Use NOMEMALLOC reclaim to allow reclaim if PF_MEMALLOC is set Christoph Lameter
2007-08-14 14:21 ` Christoph Lameter
2007-08-14 14:21 ` [RFC 3/3] Test code for PF_MEMALLOC reclaim Christoph Lameter
2007-08-14 14:21 ` Christoph Lameter
2007-08-14 14:36 ` [RFC 0/3] Recursive reclaim (on __PF_MEMALLOC) Peter Zijlstra
2007-08-14 14:36 ` Peter Zijlstra
2007-08-14 15:29 ` Christoph Lameter
2007-08-14 15:29 ` Christoph Lameter
2007-08-14 19:32 ` Peter Zijlstra
2007-08-14 19:32 ` Peter Zijlstra
2007-08-14 19:41 ` Christoph Lameter
2007-08-14 19:41 ` Christoph Lameter
2007-08-15 12:22 ` Nick Piggin
2007-08-15 12:22 ` Nick Piggin
2007-08-15 13:12 ` Peter Zijlstra
2007-08-15 14:15 ` Andi Kleen
2007-08-15 14:15 ` Andi Kleen
2007-08-15 13:55 ` Peter Zijlstra
2007-08-15 14:34 ` Andi Kleen
2007-08-15 14:34 ` Andi Kleen
2007-08-15 20:32 ` Christoph Lameter
2007-08-15 20:32 ` Christoph Lameter
2007-08-15 20:29 ` Christoph Lameter
2007-08-15 20:29 ` Christoph Lameter
2007-08-16 3:29 ` Nick Piggin
2007-08-16 3:29 ` Nick Piggin
2007-08-16 20:27 ` Christoph Lameter
2007-08-16 20:27 ` Christoph Lameter
2007-08-20 3:51 ` Peter Zijlstra
2007-08-20 19:15 ` Christoph Lameter
2007-08-20 19:15 ` Christoph Lameter
2007-08-21 0:32 ` Nick Piggin
2007-08-21 0:32 ` Nick Piggin
2007-08-21 0:28 ` Nick Piggin
2007-08-21 0:28 ` Nick Piggin
2007-08-21 15:29 ` Peter Zijlstra
2007-08-23 3:02 ` Nick Piggin
2007-08-23 3:02 ` Nick Piggin
2007-09-12 22:39 ` Christoph Lameter
2007-09-12 22:39 ` Christoph Lameter
2007-09-05 9:20 ` Daniel Phillips
2007-09-05 9:20 ` Daniel Phillips
2007-09-05 10:42 ` Christoph Lameter
2007-09-05 10:42 ` Christoph Lameter
2007-09-05 11:42 ` Nick Piggin
2007-09-05 11:42 ` Nick Piggin
2007-09-05 12:14 ` Christoph Lameter
2007-09-05 12:14 ` Christoph Lameter
2007-09-05 12:19 ` Nick Piggin
2007-09-05 12:19 ` Nick Piggin
2007-09-10 19:29 ` Christoph Lameter
2007-09-10 19:29 ` Christoph Lameter
2007-09-10 19:37 ` Peter Zijlstra
2007-09-10 19:41 ` Christoph Lameter
2007-09-10 19:41 ` Christoph Lameter
2007-09-10 19:55 ` Peter Zijlstra
2007-09-10 20:17 ` Christoph Lameter
2007-09-10 20:17 ` Christoph Lameter
2007-09-10 20:48 ` Peter Zijlstra
2007-09-11 7:41 ` Nick Piggin
2007-09-11 7:41 ` Nick Piggin
2007-09-12 10:52 ` Peter Zijlstra
2007-09-12 22:47 ` Christoph Lameter
2007-09-12 22:47 ` Christoph Lameter
2007-09-13 8:19 ` Peter Zijlstra
2007-09-13 18:32 ` Christoph Lameter
2007-09-13 18:32 ` Christoph Lameter
2007-09-13 19:24 ` Peter Zijlstra
2007-09-13 19:24 ` Peter Zijlstra
2007-09-05 16:16 ` Daniel Phillips
2007-09-05 16:16 ` Daniel Phillips
2007-09-08 5:12 ` Mike Snitzer
2007-09-08 5:12 ` Mike Snitzer
2007-09-18 0:28 ` Daniel Phillips
2007-09-18 0:28 ` Daniel Phillips
2007-09-18 3:27 ` Mike Snitzer
2007-09-18 3:27 ` Mike Snitzer
2007-09-18 5:37 ` Daniel Phillips
2007-09-18 9:30 ` Peter Zijlstra
2007-09-18 9:30 ` Peter Zijlstra
[not found] ` <200709172211.26493.phillips@phunq.net>
2007-09-18 8:11 ` Wouter Verhelst
2007-09-18 8:11 ` Wouter Verhelst
2007-09-18 9:58 ` Peter Zijlstra
2007-09-18 9:58 ` Peter Zijlstra
2007-09-18 16:56 ` Daniel Phillips
2007-09-18 16:56 ` Daniel Phillips
2007-09-18 19:16 ` Peter Zijlstra
2007-09-18 19:16 ` Peter Zijlstra
2007-09-18 18:40 ` Daniel Phillips
2007-09-18 20:13 ` Mike Snitzer
2007-09-10 19:25 ` Christoph Lameter
2007-09-10 19:25 ` Christoph Lameter
2007-09-10 19:55 ` Peter Zijlstra
2007-09-10 20:22 ` Christoph Lameter
2007-09-10 20:22 ` Christoph Lameter
2007-09-10 20:48 ` Peter Zijlstra
2007-10-26 17:44 ` Pavel Machek
2007-10-26 17:44 ` Pavel Machek
2007-10-26 17:55 ` Christoph Lameter
2007-10-26 17:55 ` Christoph Lameter
2007-10-27 22:58 ` Daniel Phillips [this message]
2007-10-27 22:58 ` Daniel Phillips
2007-10-27 23:08 ` Daniel Phillips
2007-10-27 23:08 ` Daniel Phillips
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=200710271558.37514.phillips@phunq.net \
--to=phillips@phunq.net \
--cc=a.p.zijlstra@chello.nl \
--cc=akpm@linux-foundation.org \
--cc=clameter@sgi.com \
--cc=davem@davemloft.net \
--cc=dkegel@google.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=npiggin@suse.de \
--cc=pavel@ucw.cz \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.