Re: [PATCH 1/3] page-allocator: Under memory pressure, wait on pressure to relieve instead of congestion

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Mel Gorman <mel@csn.ul.ie>
To: Nick Piggin <npiggin@suse.de>
Cc: linux-mm@kvack.org,
	Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>,
	Chris Mason <chris.mason@oracle.com>,
	Jens Axboe <jens.axboe@oracle.com>,
	linux-kernel@vger.kernel.org
Subject: Re: [PATCH 1/3] page-allocator: Under memory pressure, wait on pressure to relieve instead of congestion
Date: Tue, 9 Mar 2010 14:17:13 +0000	[thread overview]
Message-ID: <20100309141713.GF4883@csn.ul.ie> (raw)
In-Reply-To: <20100309133513.GL8653@laptop>

On Wed, Mar 10, 2010 at 12:35:13AM +1100, Nick Piggin wrote:
> On Mon, Mar 08, 2010 at 11:48:21AM +0000, Mel Gorman wrote:
> > Under heavy memory pressure, the page allocator may call congestion_wait()
> > to wait for IO congestion to clear or a timeout. This is not as sensible
> > a choice as it first appears. There is no guarantee that BLK_RW_ASYNC is
> > even congested as the pressure could have been due to a large number of
> > SYNC reads and the allocator waits for the entire timeout, possibly uselessly.
> > 
> > At the point of congestion_wait(), the allocator is struggling to get the
> > pages it needs and it should back off. This patch puts the allocator to sleep
> > on a zone->pressure_wq for either a timeout or until a direct reclaimer or
> > kswapd brings the zone over the low watermark, whichever happens first.
> > 
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > ---
> >  include/linux/mmzone.h |    3 ++
> >  mm/internal.h          |    4 +++
> >  mm/mmzone.c            |   47 +++++++++++++++++++++++++++++++++++++++++++++
> >  mm/page_alloc.c        |   50 +++++++++++++++++++++++++++++++++++++++++++----
> >  mm/vmscan.c            |    2 +
> >  5 files changed, 101 insertions(+), 5 deletions(-)
> > 
> > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> > index 30fe668..72465c1 100644
> > --- a/include/linux/mmzone.h
> > +++ b/include/linux/mmzone.h
> > @@ -398,6 +398,9 @@ struct zone {
> >  	unsigned long		wait_table_hash_nr_entries;
> >  	unsigned long		wait_table_bits;
> >  
> > +	/* queue for processes waiting for pressure to relieve */
> > +	wait_queue_head_t	*pressure_wq;
> 
> Hmm, processes may be eligible to allocate from > 1 zone, but you
> have them only waiting for one. I wonder if we shouldn't wait for
> more zones?
> 

It's waiting for the zone that is most desirable. If that zones watermarks
are met, why would it wait on any other zone? If you mean that it would
wait for any of the eligible zones to meet their watermark, it might have
an impact on NUMA locality but it could be managed. It might make sense to
wait on a node-based queue rather than a zone if this behaviour was desirable.

> Congestion waiting uses a global waitqueue, which hasn't seemed to
> cause a big scalability problem. Would it be better to have a global
> waitqueue for this too?
> 

Considering that the congestion wait queue is for a relatively slow operation,
it would be surprising if lock scalability was noticeable.  Potentially the
pressure_wq involves no IO so scalability may be noticeable there.

What would the advantages of a global waitqueue be? Obviously, a smaller
memory footprint. A second potential advantage is that on wakeup, it
could check the watermarks on multiple zones which might reduce
latencies in some cases. Can you think of more compelling reasons?

> 
> > +void check_zone_pressure(struct zone *zone)
> 
> I don't really like the name pressure. We use that term for the reclaim
> pressure wheras we're just checking watermarks here (actual pressure
> could be anything).
> 

pressure_wq => watermark_wq
check_zone_pressure => check_watermark_wq

?

> 
> > +{
> > +	/* If no process is waiting, nothing to do */
> > +	if (!waitqueue_active(zone->pressure_wq))
> > +		return;
> > +
> > +	/* Check if the high watermark is ok for order 0 */
> > +	if (zone_watermark_ok(zone, 0, low_wmark_pages(zone), 0, 0))
> > +		wake_up_interruptible(zone->pressure_wq);
> > +}
> 
> If you were to do this under the zone lock (in your subsequent patch),
> then it could avoid races. I would suggest doing it all as a single
> patch and not doing the pressure checks in reclaim at all.
> 

That is reasonable. I've already dropped the checks in reclaim because as you
say, if the free path check is cheap enough, it's also sufficient. Checking
in the reclaim paths as well is redundant.

I'll move the call to check_zone_pressure() within the zone lock to avoid
races.

> If you are missing anything, then that needs to be explained and fixed
> rather than just adding extra checks.
> 
> > +
> > +/**
> > + * zonepressure_wait - Wait for pressure on a zone to ease off
> > + * @zone: The zone that is expected to be under pressure
> > + * @order: The order the caller is waiting on pages for
> > + * @timeout: Wait until pressure is relieved or this timeout is reached
> > + *
> > + * Waits for up to @timeout jiffies for pressure on a zone to be relieved.
> > + * It's considered to be relieved if any direct reclaimer or kswapd brings
> > + * the zone above the high watermark
> > + */
> > +long zonepressure_wait(struct zone *zone, unsigned int order, long timeout)
> > +{
> > +	long ret;
> > +	DEFINE_WAIT(wait);
> > +
> > +wait_again:
> > +	prepare_to_wait(zone->pressure_wq, &wait, TASK_INTERRUPTIBLE);
> 
> I guess to do it without races you need to check watermark here.
> And possibly some barriers if it is done without zone->lock.
> 

As watermark checks are already done without the zone->lock and without
barriers, why are they needed here? Yes, there are small races. For
example, it's possible to hit a window where pages were freed between
watermarks were checked and we went to sleep here but that is similar to
current behaviour.


> > +
> > +	/*
> > +	 * The use of io_schedule_timeout() here means that it gets
> > +	 * accounted for as IO waiting. This may or may not be the case
> > +	 * but at least this way it gets picked up by vmstat
> > +	 */
> > +	ret = io_schedule_timeout(timeout);
> > +	finish_wait(zone->pressure_wq, &wait);
> > +
> > +	/* If woken early, check watermarks before continuing */
> > +	if (ret && !zone_watermark_ok(zone, order, low_wmark_pages(zone), 0, 0)) {
> > +		timeout = ret;
> > +		goto wait_again;
> > +	}
> 
> And then I don't know if we'd really need the extra check here. Might as
> well just let the allocator try again and avoid the code?
> 

I was considering multiple processes been woken up and racing with each
other. I can drop this check though. The worst that happens is multiple
processes wake and walk the full zonelist. Some will succeed and others
will go back to sleep.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

next prev parent reply	other threads:[~2010-03-09 14:17 UTC|newest]

Thread overview: 68+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-03-08 11:48 [RFC PATCH 0/3] Avoid the use of congestion_wait under zone pressure Mel Gorman
2010-03-08 11:48 ` [PATCH 1/3] page-allocator: Under memory pressure, wait on pressure to relieve instead of congestion Mel Gorman
2010-03-09 13:35   ` Nick Piggin
2010-03-09 14:17     ` Mel Gorman [this message]
2010-03-09 15:03       ` Nick Piggin
2010-03-09 15:42         ` Christian Ehrhardt
2010-03-09 18:22           ` Mel Gorman
2010-03-10  2:38             ` Nick Piggin
2010-03-09 17:35         ` Mel Gorman
2010-03-10  2:35           ` Nick Piggin
2010-03-09 15:50   ` Christoph Lameter
2010-03-09 15:56     ` Christian Ehrhardt
2010-03-09 16:09       ` Christoph Lameter
2010-03-09 17:01         ` Mel Gorman
2010-03-09 17:11           ` Christoph Lameter
2010-03-09 17:30             ` Mel Gorman
2010-03-08 11:48 ` [PATCH 2/3] page-allocator: Check zone pressure when batch of pages are freed Mel Gorman
2010-03-09  9:53   ` Nick Piggin
2010-03-09 10:08     ` Mel Gorman
2010-03-09 10:23       ` Nick Piggin
2010-03-09 10:36         ` Mel Gorman
2010-03-09 11:11           ` Nick Piggin
2010-03-09 11:29             ` Mel Gorman
2010-03-08 11:48 ` [PATCH 3/3] vmscan: Put kswapd to sleep on its own waitqueue, not congestion Mel Gorman
2010-03-09 10:00   ` Nick Piggin
2010-03-09 10:21     ` Mel Gorman
2010-03-09 10:32       ` Nick Piggin
2010-03-11 23:41 ` [RFC PATCH 0/3] Avoid the use of congestion_wait under zone pressure Andrew Morton
2010-03-12  6:39   ` Christian Ehrhardt
2010-03-12  7:05     ` Andrew Morton
2010-03-12 10:47       ` Mel Gorman
2010-03-12 12:15         ` Christian Ehrhardt
2010-03-12 14:37           ` Andrew Morton
2010-03-15 12:29             ` Mel Gorman
2010-03-15 14:45               ` Christian Ehrhardt
2010-03-15 12:34             ` Christian Ehrhardt
2010-03-15 20:09               ` Andrew Morton
2010-03-16 10:11                 ` Mel Gorman
2010-03-18 17:42                 ` Mel Gorman
2010-03-22 23:50                 ` Mel Gorman
2010-03-23 14:35                   ` Christian Ehrhardt
2010-03-23 21:35                   ` Corrado Zoccolo
2010-03-24 11:48                     ` Mel Gorman
2010-03-24 12:56                       ` Corrado Zoccolo
2010-03-23 22:29                   ` Rik van Riel
2010-03-24 14:50                     ` Mel Gorman
2010-04-19 12:22                       ` Christian Ehrhardt
2010-04-19 21:44                         ` Johannes Weiner
2010-04-20  7:20                           ` Christian Ehrhardt
2010-04-20  8:54                             ` Christian Ehrhardt
2010-04-20 15:32                             ` Johannes Weiner
2010-04-20 17:22                               ` Rik van Riel
2010-04-21  4:23                                 ` Christian Ehrhardt
2010-04-21  7:35                                   ` Christian Ehrhardt
2010-04-21 13:19                                     ` Rik van Riel
2010-04-22  6:21                                       ` Christian Ehrhardt
2010-04-26 10:59                                         ` Subject: [PATCH][RFC] mm: make working set portion that is protected tunable v2 Christian Ehrhardt
2010-04-26 11:59                                           ` KOSAKI Motohiro
2010-04-26 12:43                                             ` Christian Ehrhardt
2010-04-26 14:20                                               ` Rik van Riel
2010-04-27 14:00                                                 ` Christian Ehrhardt
2010-04-21  9:03                                   ` [RFC PATCH 0/3] Avoid the use of congestion_wait under zone pressure Johannes Weiner
2010-04-21 13:20                                   ` Rik van Riel
2010-04-20 14:40                           ` Rik van Riel
2010-03-24  2:38                   ` Greg KH
2010-03-24 11:49                     ` Mel Gorman
2010-03-24 13:13                   ` Johannes Weiner
2010-03-12  9:09   ` Mel Gorman

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20100309141713.GF4883@csn.ul.ie \
    --to=mel@csn.ul.ie \
    --cc=chris.mason@oracle.com \
    --cc=ehrhardt@linux.vnet.ibm.com \
    --cc=jens.axboe@oracle.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=npiggin@suse.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).