All of lore.kernel.org
 help / color / mirror / Atom feed
From: Andrew Morton <akpm@linux-foundation.org>
To: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <andrea@suse.de>, linux-mm@kvack.org
Subject: Re: [PATCH 01 of 16] remove nr_scan_inactive/active
Date: Thu, 28 Jun 2007 18:12:38 -0700	[thread overview]
Message-ID: <20070628181238.372828fa.akpm@linux-foundation.org> (raw)
In-Reply-To: <46845620.6020906@redhat.com>

On Thu, 28 Jun 2007 20:45:20 -0400
Rik van Riel <riel@redhat.com> wrote:

> >> The only problem with this is that anonymous
> >> pages could be easily pushed out of memory by
> >> the page cache, because the page cache has
> >> totally different locality of reference.
> > 
> > I don't immediately see why we need to change the fundamental aging design
> > at all.   The problems afacit are
> > 
> > a) that huge burst of activity when we hit pages_high and
> > 
> > b) the fact that this huge burst happens on lots of CPUs at the same time.
> > 
> > And balancing the LRUs _prior_ to hitting pages_high can address both
> > problems?
> 
> That may work on systems with up to a few GB of memory,
> but customers are already rolling out systems with 256GB
> of RAM for general purpose use, that's 64 million pages!
> 
> Even doing a background scan on that many pages will take
> insane amounts of CPU time.
> 
> In a few years, they will be deploying systems with 1TB
> of memory and throwing random workloads at them.

I don't see how the amount of memory changes anything here: if there are
more pages, more work needs to be done regardless of when we do it.

Still confused.

> >>>> No matter how efficient we make the scanning of one
> >>>> individual page, we simply cannot scan through 1TB
> >>>> worth of anonymous pages (which are all referenced
> >>>> because they've been there for a week) in order to
> >>>> deactivate something.
> >>> Sure.  And we could avoid that sudden transition by balancing the LRU prior
> >>> to hitting the great pages_high wall.
> >> Yes, we will need to do some preactive balancing.
> > 
> > OK..
> > 
> > And that huge anon-vma walk might need attention.  At the least we could do
> > something to prevent lots of CPUs from piling up in there.
> 
> Speaking of which, I have also seen a thousand processes waiting
> to grab the iprune_mutex in prune_icache.
> 

It would make sense to only permit one cpu at a time to go in and do
reclaimation against a particular zone (or even node).

But the problem with the vfs caches is that they aren't node/zone-specific.
We wouldn't want to get into the situation where 1023 CPUs are twiddling
thumbs waiting for one CPU to free stuff up (or less extreme variants of
this).

> Maybe direct reclaim processes should not dive into this cache
> at all, but simply increase some variable indicating that kswapd
> might want to prune some extra pages from this cache on its next
> run?

Tell the node's kswapd to go off and do VFS reclaim while the CPUs on that
node wait for it?  That would help I guess, but those thousand processes
would still need to block _somewhere_ waiting for the memory to come back.

Of course, iprune_mutex is a particularly dumb place in which to do that,
because the memory may get freed up from somewhere else.

The general design here could/should be to back off to the top-level when
there's contention (that's presently congestion_wait()) and to poll for
memory-became-allocatable.

So what we could do here is to back off when iprune_mutex is busy and, if
nothing else works out, block in congestion_wait() (which is becoming
increasingly misnamed).  Then, add some more smarts to congestion_wait():
deliver a wakeup when "enough" memory got freed from the VFS caches.

One suspects that at some stage, congestion_wait() will need to be told
what the calling task is actually waiting for (perhaps a zonelist) so that
the wakup delivery can become smarter.  


But for now, the question is: is this a reasonable overall design?  Back
off from contention points, block at the top-level, polling for allocatable
memory to turn up?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  reply	other threads:[~2007-06-29  1:12 UTC|newest]

Thread overview: 77+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2007-06-08 20:02 [PATCH 00 of 16] OOM related fixes Andrea Arcangeli
2007-06-08 20:02 ` [PATCH 01 of 16] remove nr_scan_inactive/active Andrea Arcangeli
2007-06-10 17:36   ` Rik van Riel
2007-06-10 18:17     ` Andrea Arcangeli
2007-06-11 14:58       ` Rik van Riel
2007-06-26 17:08       ` Rik van Riel
2007-06-26 17:55         ` Andrew Morton
2007-06-26 19:02           ` Rik van Riel
2007-06-28 22:44           ` Rik van Riel
2007-06-28 22:57             ` Andrew Morton
2007-06-28 23:04               ` Rik van Riel
2007-06-28 23:13                 ` Andrew Morton
2007-06-28 23:16                   ` Rik van Riel
2007-06-28 23:29                     ` Andrew Morton
2007-06-29  0:00                       ` Rik van Riel
2007-06-29  0:19                         ` Andrew Morton
2007-06-29  0:45                           ` Rik van Riel
2007-06-29  1:12                             ` Andrew Morton [this message]
2007-06-29  1:20                               ` Rik van Riel
2007-06-29  1:29                                 ` Andrew Morton
2007-06-28 23:25                   ` Andrea Arcangeli
2007-06-29  0:12                     ` Andrew Morton
2007-06-29 13:38             ` Lee Schermerhorn
2007-06-29 14:12               ` Andrea Arcangeli
2007-06-29 14:59                 ` Rik van Riel
2007-06-29 22:39                 ` "Noreclaim Infrastructure" [was Re: [PATCH 01 of 16] remove nr_scan_inactive/active] Lee Schermerhorn
2007-06-29 22:42                 ` RFC "Noreclaim Infrastructure - patch 1/3 basic infrastructure" Lee Schermerhorn
2007-06-29 22:44                 ` RFC "Noreclaim Infrastructure patch 2/3 - noreclaim statistics..." Lee Schermerhorn
2007-06-29 22:49                 ` "Noreclaim - client patch 3/3 - treat pages w/ excessively references anon_vma as nonreclaimable" Lee Schermerhorn
2007-06-26 20:37         ` [PATCH 01 of 16] remove nr_scan_inactive/active Andrea Arcangeli
2007-06-26 20:57           ` Rik van Riel
2007-06-26 22:21             ` Andrea Arcangeli
2007-06-08 20:03 ` [PATCH 02 of 16] avoid oom deadlock in nfs_create_request Andrea Arcangeli
2007-06-10 17:38   ` Rik van Riel
2007-06-10 18:27     ` Andrea Arcangeli
2007-06-08 20:03 ` [PATCH 03 of 16] prevent oom deadlocks during read/write operations Andrea Arcangeli
2007-06-08 20:03 ` [PATCH 04 of 16] serialize oom killer Andrea Arcangeli
2007-06-09  6:43   ` Peter Zijlstra
2007-06-09 15:27     ` Andrea Arcangeli
2007-06-08 20:03 ` [PATCH 05 of 16] avoid selecting already killed tasks Andrea Arcangeli
2007-06-08 20:03 ` [PATCH 06 of 16] reduce the probability of an OOM livelock Andrea Arcangeli
2007-06-08 20:03 ` [PATCH 07 of 16] balance_pgdat doesn't return the number of pages freed Andrea Arcangeli
2007-06-08 20:03 ` [PATCH 08 of 16] don't depend on PF_EXITING tasks to go away Andrea Arcangeli
2007-06-08 20:03 ` [PATCH 09 of 16] fallback killing more tasks if tif-memdie doesn't " Andrea Arcangeli
2007-06-08 21:57   ` Christoph Lameter
2007-06-08 20:03 ` [PATCH 10 of 16] stop useless vm trashing while we wait the TIF_MEMDIE task to exit Andrea Arcangeli
2007-06-08 21:48   ` Christoph Lameter
2007-06-09  1:59     ` Andrea Arcangeli
2007-06-09  3:01       ` Christoph Lameter
2007-06-09 14:05         ` Andrea Arcangeli
2007-06-09 14:38           ` Andrea Arcangeli
2007-06-11 16:07             ` Christoph Lameter
2007-06-11 16:50               ` Andrea Arcangeli
2007-06-11 16:57                 ` Christoph Lameter
2007-06-11 17:51                   ` Andrea Arcangeli
2007-06-11 17:56                     ` Christoph Lameter
2007-06-11 18:22                       ` Andrea Arcangeli
2007-06-11 18:39                         ` Christoph Lameter
2007-06-11 18:58                           ` Andrea Arcangeli
2007-06-11 19:25                             ` Christoph Lameter
2007-06-11 16:04           ` Christoph Lameter
2007-06-08 20:03 ` [PATCH 11 of 16] the oom schedule timeout isn't needed with the VM_is_OOM logic Andrea Arcangeli
2007-06-08 20:03 ` [PATCH 12 of 16] show mem information only when a task is actually being killed Andrea Arcangeli
2007-06-08 20:03 ` [PATCH 13 of 16] simplify oom heuristics Andrea Arcangeli
2007-06-08 20:03 ` [PATCH 14 of 16] oom select should only take rss into account Andrea Arcangeli
2007-06-10 17:17   ` Rik van Riel
2007-06-10 17:30     ` Andrea Arcangeli
2007-06-08 20:03 ` [PATCH 15 of 16] limit reclaim if enough pages have been freed Andrea Arcangeli
2007-06-10 17:20   ` Rik van Riel
2007-06-10 17:32     ` Andrea Arcangeli
2007-06-10 17:52       ` Rik van Riel
2007-06-11 16:23         ` Christoph Lameter
2007-06-11 16:57           ` Rik van Riel
2007-06-08 20:03 ` [PATCH 16 of 16] avoid some lock operation in vm fast path Andrea Arcangeli
2007-06-08 21:26 ` [PATCH 00 of 16] OOM related fixes William Lee Irwin III
2007-06-09 14:55   ` Andrea Arcangeli
2007-06-12  8:58     ` Petr Tesarik

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20070628181238.372828fa.akpm@linux-foundation.org \
    --to=akpm@linux-foundation.org \
    --cc=andrea@suse.de \
    --cc=linux-mm@kvack.org \
    --cc=riel@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.