All of lore.kernel.org
 help / color / mirror / Atom feed
From: Mel Gorman <mgorman@suse.de>
To: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: akpm@linux-foundation.org, P@draigBrady.com,
	James.Bottomley@HansenPartnership.com, colin.king@canonical.com,
	minchan.kim@gmail.com, luto@mit.edu, riel@redhat.com,
	hannes@cmpxchg.org, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org
Subject: Re: [PATCH 4/4] mm: vmscan: Only read new_classzone_idx from pgdat when reclaiming successfully
Date: Thu, 30 Jun 2011 11:19:31 +0100	[thread overview]
Message-ID: <20110630101931.GZ9396@suse.de> (raw)
In-Reply-To: <4E0C3C77.8010608@jp.fujitsu.com>

On Thu, Jun 30, 2011 at 06:05:59PM +0900, KOSAKI Motohiro wrote:
> (2011/06/24 23:44), Mel Gorman wrote:
> > During allocator-intensive workloads, kswapd will be woken frequently
> > causing free memory to oscillate between the high and min watermark.
> > This is expected behaviour.  Unfortunately, if the highest zone is
> > small, a problem occurs.
> > 
> > When balance_pgdat() returns, it may be at a lower classzone_idx than
> > it started because the highest zone was unreclaimable. Before checking
> > if it should go to sleep though, it checks pgdat->classzone_idx which
> > when there is no other activity will be MAX_NR_ZONES-1. It interprets
> > this as it has been woken up while reclaiming, skips scheduling and
> > reclaims again. As there is no useful reclaim work to do, it enters
> > into a loop of shrinking slab consuming loads of CPU until the highest
> > zone becomes reclaimable for a long period of time.
> > 
> > There are two problems here. 1) If the returned classzone or order is
> > lower, it'll continue reclaiming without scheduling. 2) if the highest
> > zone was marked unreclaimable but balance_pgdat() returns immediately
> > at DEF_PRIORITY, the new lower classzone is not communicated back to
> > kswapd() for sleeping.
> > 
> > This patch does two things that are related. If the end_zone is
> > unreclaimable, this information is communicated back. Second, if
> > the classzone or order was reduced due to failing to reclaim, new
> > information is not read from pgdat and instead an attempt is made to go
> > to sleep. Due to this, it is also necessary that pgdat->classzone_idx
> > be initialised each time to pgdat->nr_zones - 1 to avoid re-reads
> > being interpreted as wakeups.
> > 
> > Reported-and-tested-by: Pádraig Brady <P@draigBrady.com>
> > Signed-off-by: Mel Gorman <mgorman@suse.de>
> > ---
> >  mm/vmscan.c |   34 +++++++++++++++++++++-------------
> >  1 files changed, 21 insertions(+), 13 deletions(-)
> > 
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index a76b6cc2..fe854d7 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -2448,7 +2448,6 @@ loop_again:
> >  			if (!zone_watermark_ok_safe(zone, order,
> >  					high_wmark_pages(zone), 0, 0)) {
> >  				end_zone = i;
> > -				*classzone_idx = i;
> >  				break;
> >  			}
> >  		}
> > @@ -2528,8 +2527,11 @@ loop_again:
> >  			    total_scanned > sc.nr_reclaimed + sc.nr_reclaimed / 2)
> >  				sc.may_writepage = 1;
> >  
> > -			if (zone->all_unreclaimable)
> > +			if (zone->all_unreclaimable) {
> > +				if (end_zone && end_zone == i)
> > +					end_zone--;
> >  				continue;
> > +			}
> >  
> >  			if (!zone_watermark_ok_safe(zone, order,
> >  					high_wmark_pages(zone), end_zone, 0)) {
> > @@ -2709,8 +2711,8 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int order, int classzone_idx)
> >   */
> >  static int kswapd(void *p)
> >  {
> > -	unsigned long order;
> > -	int classzone_idx;
> > +	unsigned long order, new_order;
> > +	int classzone_idx, new_classzone_idx;
> >  	pg_data_t *pgdat = (pg_data_t*)p;
> >  	struct task_struct *tsk = current;
> >  
> > @@ -2740,17 +2742,23 @@ static int kswapd(void *p)
> >  	tsk->flags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD;
> >  	set_freezable();
> >  
> > -	order = 0;
> > -	classzone_idx = MAX_NR_ZONES - 1;
> > +	order = new_order = 0;
> > +	classzone_idx = new_classzone_idx = pgdat->nr_zones - 1;
> >  	for ( ; ; ) {
> > -		unsigned long new_order;
> > -		int new_classzone_idx;
> >  		int ret;
> >  
> > -		new_order = pgdat->kswapd_max_order;
> > -		new_classzone_idx = pgdat->classzone_idx;
> > -		pgdat->kswapd_max_order = 0;
> > -		pgdat->classzone_idx = MAX_NR_ZONES - 1;
> > +		/*
> > +		 * If the last balance_pgdat was unsuccessful it's unlikely a
> > +		 * new request of a similar or harder type will succeed soon
> > +		 * so consider going to sleep on the basis we reclaimed at
> > +		 */
> > +		if (classzone_idx >= new_classzone_idx && order == new_order) {
> 
> I'm confusing this. If we take a following scenario, new_classzone_idx may be garbage.
> 
> 1. new_classzone_idx = pgdat->classzone_idx
> 2. kswapd_try_to_sleep()
> 3. classzone_idx = pgdat->classzone_idx
> 4. balance_pgdat()
> 
> Wouldn't we need to reinitialize new_classzone_idx nad new_order at kswapd_try_to_sleep()
> path too?
> 

I don't understand your question. new_classzone_idx is initialised
before the kswapd main loop and after this patch is only updated
only when balance_pgdat() successfully balanced but the following
situation can arise

1. Read for balance-request-A (order, classzone) pair
2. Fail balance_pgdat
3. Sleep based on (order, classzone) pair
4. Wake for balance-request-B (order, classzone) pair where
   balance-request-B != balance-request-A
5. Succeed balance_pgdat
6. Compare order,classzone with balance-request-A which will treat
   balance_pgdat() as fail and try go to sleep

This is not the same as new_classzone_idx being "garbage" but is it
what you mean? If so, is this your proposed fix?

diff --git a/mm/vmscan.c b/mm/vmscan.c
index fe854d7..1a518e6 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2770,6 +2770,8 @@ static int kswapd(void *p)
 			kswapd_try_to_sleep(pgdat, order, classzone_idx);
 			order = pgdat->kswapd_max_order;
 			classzone_idx = pgdat->classzone_idx;
+			new_order = order;
+			new_classzone_idx = classzone_idx;
 			pgdat->kswapd_max_order = 0;
 			pgdat->classzone_idx = pgdat->nr_zones - 1;
 		}

WARNING: multiple messages have this Message-ID (diff)
From: Mel Gorman <mgorman@suse.de>
To: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: akpm@linux-foundation.org, P@draigBrady.com,
	James.Bottomley@HansenPartnership.com, colin.king@canonical.com,
	minchan.kim@gmail.com, luto@mit.edu, riel@redhat.com,
	hannes@cmpxchg.org, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org
Subject: Re: [PATCH 4/4] mm: vmscan: Only read new_classzone_idx from pgdat when reclaiming successfully
Date: Thu, 30 Jun 2011 11:19:31 +0100	[thread overview]
Message-ID: <20110630101931.GZ9396@suse.de> (raw)
In-Reply-To: <4E0C3C77.8010608@jp.fujitsu.com>

On Thu, Jun 30, 2011 at 06:05:59PM +0900, KOSAKI Motohiro wrote:
> (2011/06/24 23:44), Mel Gorman wrote:
> > During allocator-intensive workloads, kswapd will be woken frequently
> > causing free memory to oscillate between the high and min watermark.
> > This is expected behaviour.  Unfortunately, if the highest zone is
> > small, a problem occurs.
> > 
> > When balance_pgdat() returns, it may be at a lower classzone_idx than
> > it started because the highest zone was unreclaimable. Before checking
> > if it should go to sleep though, it checks pgdat->classzone_idx which
> > when there is no other activity will be MAX_NR_ZONES-1. It interprets
> > this as it has been woken up while reclaiming, skips scheduling and
> > reclaims again. As there is no useful reclaim work to do, it enters
> > into a loop of shrinking slab consuming loads of CPU until the highest
> > zone becomes reclaimable for a long period of time.
> > 
> > There are two problems here. 1) If the returned classzone or order is
> > lower, it'll continue reclaiming without scheduling. 2) if the highest
> > zone was marked unreclaimable but balance_pgdat() returns immediately
> > at DEF_PRIORITY, the new lower classzone is not communicated back to
> > kswapd() for sleeping.
> > 
> > This patch does two things that are related. If the end_zone is
> > unreclaimable, this information is communicated back. Second, if
> > the classzone or order was reduced due to failing to reclaim, new
> > information is not read from pgdat and instead an attempt is made to go
> > to sleep. Due to this, it is also necessary that pgdat->classzone_idx
> > be initialised each time to pgdat->nr_zones - 1 to avoid re-reads
> > being interpreted as wakeups.
> > 
> > Reported-and-tested-by: Padraig Brady <P@draigBrady.com>
> > Signed-off-by: Mel Gorman <mgorman@suse.de>
> > ---
> >  mm/vmscan.c |   34 +++++++++++++++++++++-------------
> >  1 files changed, 21 insertions(+), 13 deletions(-)
> > 
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index a76b6cc2..fe854d7 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -2448,7 +2448,6 @@ loop_again:
> >  			if (!zone_watermark_ok_safe(zone, order,
> >  					high_wmark_pages(zone), 0, 0)) {
> >  				end_zone = i;
> > -				*classzone_idx = i;
> >  				break;
> >  			}
> >  		}
> > @@ -2528,8 +2527,11 @@ loop_again:
> >  			    total_scanned > sc.nr_reclaimed + sc.nr_reclaimed / 2)
> >  				sc.may_writepage = 1;
> >  
> > -			if (zone->all_unreclaimable)
> > +			if (zone->all_unreclaimable) {
> > +				if (end_zone && end_zone == i)
> > +					end_zone--;
> >  				continue;
> > +			}
> >  
> >  			if (!zone_watermark_ok_safe(zone, order,
> >  					high_wmark_pages(zone), end_zone, 0)) {
> > @@ -2709,8 +2711,8 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int order, int classzone_idx)
> >   */
> >  static int kswapd(void *p)
> >  {
> > -	unsigned long order;
> > -	int classzone_idx;
> > +	unsigned long order, new_order;
> > +	int classzone_idx, new_classzone_idx;
> >  	pg_data_t *pgdat = (pg_data_t*)p;
> >  	struct task_struct *tsk = current;
> >  
> > @@ -2740,17 +2742,23 @@ static int kswapd(void *p)
> >  	tsk->flags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD;
> >  	set_freezable();
> >  
> > -	order = 0;
> > -	classzone_idx = MAX_NR_ZONES - 1;
> > +	order = new_order = 0;
> > +	classzone_idx = new_classzone_idx = pgdat->nr_zones - 1;
> >  	for ( ; ; ) {
> > -		unsigned long new_order;
> > -		int new_classzone_idx;
> >  		int ret;
> >  
> > -		new_order = pgdat->kswapd_max_order;
> > -		new_classzone_idx = pgdat->classzone_idx;
> > -		pgdat->kswapd_max_order = 0;
> > -		pgdat->classzone_idx = MAX_NR_ZONES - 1;
> > +		/*
> > +		 * If the last balance_pgdat was unsuccessful it's unlikely a
> > +		 * new request of a similar or harder type will succeed soon
> > +		 * so consider going to sleep on the basis we reclaimed at
> > +		 */
> > +		if (classzone_idx >= new_classzone_idx && order == new_order) {
> 
> I'm confusing this. If we take a following scenario, new_classzone_idx may be garbage.
> 
> 1. new_classzone_idx = pgdat->classzone_idx
> 2. kswapd_try_to_sleep()
> 3. classzone_idx = pgdat->classzone_idx
> 4. balance_pgdat()
> 
> Wouldn't we need to reinitialize new_classzone_idx nad new_order at kswapd_try_to_sleep()
> path too?
> 

I don't understand your question. new_classzone_idx is initialised
before the kswapd main loop and after this patch is only updated
only when balance_pgdat() successfully balanced but the following
situation can arise

1. Read for balance-request-A (order, classzone) pair
2. Fail balance_pgdat
3. Sleep based on (order, classzone) pair
4. Wake for balance-request-B (order, classzone) pair where
   balance-request-B != balance-request-A
5. Succeed balance_pgdat
6. Compare order,classzone with balance-request-A which will treat
   balance_pgdat() as fail and try go to sleep

This is not the same as new_classzone_idx being "garbage" but is it
what you mean? If so, is this your proposed fix?

diff --git a/mm/vmscan.c b/mm/vmscan.c
index fe854d7..1a518e6 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2770,6 +2770,8 @@ static int kswapd(void *p)
 			kswapd_try_to_sleep(pgdat, order, classzone_idx);
 			order = pgdat->kswapd_max_order;
 			classzone_idx = pgdat->classzone_idx;
+			new_order = order;
+			new_classzone_idx = classzone_idx;
 			pgdat->kswapd_max_order = 0;
 			pgdat->classzone_idx = pgdat->nr_zones - 1;
 		}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  reply	other threads:[~2011-06-30 10:19 UTC|newest]

Thread overview: 80+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-06-24 14:44 [PATCH 0/4] Stop kswapd consuming 100% CPU when highest zone is small Mel Gorman
2011-06-24 14:44 ` Mel Gorman
2011-06-24 14:44 ` [PATCH 1/4] mm: vmscan: Correct check for kswapd sleeping in sleeping_prematurely Mel Gorman
2011-06-24 14:44   ` Mel Gorman
2011-06-25 21:33   ` Rik van Riel
2011-06-25 21:33     ` Rik van Riel
2011-06-27  6:10   ` Minchan Kim
2011-06-27  6:10     ` Minchan Kim
2011-06-28 21:49   ` Andrew Morton
2011-06-28 21:49     ` Andrew Morton
2011-06-29 10:57     ` Pádraig Brady
2011-06-29 10:57       ` Pádraig Brady
2011-06-30  9:39     ` Mel Gorman
2011-06-30  9:39       ` Mel Gorman
2011-06-30  2:23   ` KOSAKI Motohiro
2011-06-30  2:23     ` KOSAKI Motohiro
2011-06-24 14:44 ` [PATCH 2/4] mm: vmscan: Do not apply pressure to slab if we are not applying pressure to zone Mel Gorman
2011-06-24 14:44   ` Mel Gorman
2011-06-25 21:40   ` Rik van Riel
2011-06-25 21:40     ` Rik van Riel
2011-06-28 23:38   ` Minchan Kim
2011-06-28 23:38     ` Minchan Kim
2011-06-30  2:37   ` KOSAKI Motohiro
2011-06-30  2:37     ` KOSAKI Motohiro
2011-06-24 14:44 ` [PATCH 3/4] mm: vmscan: Evaluate the watermarks against the correct classzone Mel Gorman
2011-06-24 14:44   ` Mel Gorman
2011-06-25 21:42   ` Rik van Riel
2011-06-25 21:42     ` Rik van Riel
2011-06-27  6:53   ` Minchan Kim
2011-06-27  6:53     ` Minchan Kim
2011-06-28 12:52     ` Mel Gorman
2011-06-28 12:52       ` Mel Gorman
2011-06-28 23:23       ` Minchan Kim
2011-06-28 23:23         ` Minchan Kim
2011-06-28 23:23   ` Minchan Kim
2011-06-28 23:23     ` Minchan Kim
2011-06-24 14:44 ` [PATCH 4/4] mm: vmscan: Only read new_classzone_idx from pgdat when reclaiming successfully Mel Gorman
2011-06-24 14:44   ` Mel Gorman
2011-06-25 23:17   ` Rik van Riel
2011-06-25 23:17     ` Rik van Riel
2011-06-30  9:05   ` KOSAKI Motohiro
2011-06-30  9:05     ` KOSAKI Motohiro
2011-06-30 10:19     ` Mel Gorman [this message]
2011-06-30 10:19       ` Mel Gorman
2011-07-19 16:09   ` Minchan Kim
2011-07-19 16:09     ` Minchan Kim
2011-07-20 10:48     ` Mel Gorman
2011-07-20 10:48       ` Mel Gorman
2011-07-21 15:30       ` Minchan Kim
2011-07-21 15:30         ` Minchan Kim
2011-07-21 16:07         ` Mel Gorman
2011-07-21 16:07           ` Mel Gorman
2011-07-21 16:36           ` Minchan Kim
2011-07-21 16:36             ` Minchan Kim
2011-07-21 17:01             ` Mel Gorman
2011-07-21 17:01               ` Mel Gorman
2011-07-22  0:21               ` Minchan Kim
2011-07-22  0:21                 ` Minchan Kim
2011-07-22  7:42                 ` Mel Gorman
2011-07-22  7:42                   ` Mel Gorman
2011-06-25 14:23 ` [PATCH 0/4] Stop kswapd consuming 100% CPU when highest zone is small Andrew Lutomirski
2011-06-25 14:23   ` Andrew Lutomirski
2011-07-21 15:37 ` Minchan Kim
2011-07-21 15:37   ` Minchan Kim
2011-07-21 16:09   ` Mel Gorman
2011-07-21 16:09     ` Mel Gorman
2011-07-21 16:24     ` Minchan Kim
2011-07-21 16:24       ` Minchan Kim
2011-07-21 16:36       ` Andrew Lutomirski
2011-07-21 16:36         ` Andrew Lutomirski
2011-07-21 16:42         ` Minchan Kim
2011-07-21 16:42           ` Minchan Kim
2011-07-21 16:58           ` Andrew Lutomirski
2011-07-21 16:58             ` Andrew Lutomirski
2011-07-22  0:30             ` Minchan Kim
2011-07-22  0:30               ` Minchan Kim
2011-07-22 13:21               ` Andrew Lutomirski
2011-07-22 13:21                 ` Andrew Lutomirski
  -- strict thread matches above, loose matches on Subject: below --
2011-06-24 13:43 Mel Gorman
2011-06-24 13:43 ` [PATCH 4/4] mm: vmscan: Only read new_classzone_idx from pgdat when reclaiming successfully Mel Gorman
2011-06-24 13:43   ` Mel Gorman

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20110630101931.GZ9396@suse.de \
    --to=mgorman@suse.de \
    --cc=James.Bottomley@HansenPartnership.com \
    --cc=P@draigBrady.com \
    --cc=akpm@linux-foundation.org \
    --cc=colin.king@canonical.com \
    --cc=hannes@cmpxchg.org \
    --cc=kosaki.motohiro@jp.fujitsu.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=luto@mit.edu \
    --cc=minchan.kim@gmail.com \
    --cc=riel@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.