linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Mel Gorman <mgorman@suse.de>
To: Minchan Kim <minchan.kim@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	P?draig Brady <P@draigBrady.com>,
	James Bottomley <James.Bottomley@HansenPartnership.com>,
	Colin King <colin.king@canonical.com>,
	Andrew Lutomirski <luto@mit.edu>, Rik van Riel <riel@redhat.com>,
	Johannes Weiner <hannes@cmpxchg.org>,
	linux-mm <linux-mm@kvack.org>,
	linux-kernel <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH 4/4] mm: vmscan: Only read new_classzone_idx from pgdat when reclaiming successfully
Date: Thu, 21 Jul 2011 17:07:06 +0100	[thread overview]
Message-ID: <20110721160706.GS5349@suse.de> (raw)
In-Reply-To: <20110721153007.GC1713@barrios-desktop>

On Fri, Jul 22, 2011 at 12:30:07AM +0900, Minchan Kim wrote:
> On Wed, Jul 20, 2011 at 11:48:47AM +0100, Mel Gorman wrote:
> > On Wed, Jul 20, 2011 at 01:09:03AM +0900, Minchan Kim wrote:
> > > Hi Mel,
> > > 
> > > Too late review.
> > 
> > Never too late.
> > 
> > > At that time, I had no time to look into this patch.
> > > 
> > > On Fri, Jun 24, 2011 at 03:44:57PM +0100, Mel Gorman wrote:
> > > > During allocator-intensive workloads, kswapd will be woken frequently
> > > > causing free memory to oscillate between the high and min watermark.
> > > > This is expected behaviour.  Unfortunately, if the highest zone is
> > > > small, a problem occurs.
> > > > 
> > > > When balance_pgdat() returns, it may be at a lower classzone_idx than
> > > > it started because the highest zone was unreclaimable. Before checking
> > > 
> > > Yes.
> > > 
> > > > if it should go to sleep though, it checks pgdat->classzone_idx which
> > > > when there is no other activity will be MAX_NR_ZONES-1. It interprets
> > > 
> > > Yes.
> > > 
> > > > this as it has been woken up while reclaiming, skips scheduling and
> > > 
> > > Hmm. I can't understand this part.
> > > If balance_pgdat returns lower classzone and there is no other activity,
> > > new_classzone_idx is always MAX_NR_ZONES - 1 so that classzone_idx would be less than
> > > new_classzone_idx. It means it doesn't skip scheduling.
> > > 
> > > Do I miss something?
> > > 
> > 
> > It was a few weeks ago so I don't rememember if this is the exact
> > sequence I had in mind at the time of writing but an example sequence
> > of events is for a node whose highest populated zone is ZONE_NORMAL,
> > very small, and gets set all_unreclaimable by balance_pgdat() looks
> > is below. The key is the "very small" part because pages are getting
> > freed in the zone but the small size means that unreclaimable gets
> > set easily.
> > 
> > /*
> >  * kswapd is woken up for ZONE_NORMAL (as this is the preferred zone
> >  * as ZONE_HIGHMEM is not populated.
> >  */
> > 
> > order = pgdat->kswapd_max_order;
> > classzone_idx = pgdat->classzone_idx;				/* classzone_idx == ZONE_NORMAL */
> > pgdat->kswapd_max_order = 0;
> > pgdat->classzone_idx = MAX_NR_ZONES - 1;
> > order = balance_pgdat(pgdat, order, &classzone_idx);		/* classzone_idx == ZONE_NORMAL even though
> > 								 * the highest zone was set unreclaimable
> > 								 * and it exited scanning ZONE_DMA32
> > 								 * because we did not communicate that
> > 								 * information back
> 
> 								Yes. It's too bad.
> 
> > 								 */
> > new_order = pgdat->kswapd_max_order;				/* new_order = 0 */
> > new_classzone_idx = pgdat->classzone_idx;			/* new_classzone_idx == ZONE_HIGHMEM
> > 								 * because that is what classzone_idx
> > 								 * gets reset to
> 
> 								Yes. new_classzone_idx is ZONE_HIGHMEM.
> 
> > 								 */
> > if (order < new_order || classzone_idx > new_classzone_idx) {
> > 	/* does not sleep, this branch not taken */
> > } else {
> > 	/* tries to sleep, goes here */
> > 	try_to_sleep(ZONE_NORMAL)
> > 		sleeping_prematurely(ZONE_NORMAL)		/* finds zone unbalanced so skips scheduling */
> >         order = pgdat->kswapd_max_order;
> >         classzone_idx = pgdat->classzone_idx;			/* classzone_idx == ZONE_HIGHMEM now which
> > 								 * is higher than what it was originally
> > 								 * woken for
> > 								 */
> 
> 								But is it a problem?
> 								it should be reset to ZONE_NORMAL in balance_pgdat as high zone isn't populated.

At the very least, it's sloppy.

> > }
> > 
> > /* Looped around to balance_pgdat() again */
> > order = balance_pgdat()
> > 
> > Between when all_unreclaimable is set and before before kswapd
> > goes fully to sleep, a page is freed clearing all_reclaimable so
> > it rechecks all the zones, find the highest one is not balanced and
> > skip scheduling.
> 
> Yes and it could be repeated forever.

Resulting in chewing up large amounts of CPU.

> Apparently, we should fix wit this patch but I have a qustion about this patch.
> 
> Quote from your patch
> 
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index a76b6cc2..fe854d7 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -2448,7 +2448,6 @@ loop_again:
> >                       if (!zone_watermark_ok_safe(zone, order,
> >                                       high_wmark_pages(zone), 0, 0)) {
> >                               end_zone = i;
> > -                             *classzone_idx = i;
> >                               break;
> >                       }
> >               }
> > @@ -2528,8 +2527,11 @@ loop_again:
> >                           total_scanned > sc.nr_reclaimed + sc.nr_reclaimed / 2)
> >                               sc.may_writepage = 1;
> >  
> > -                     if (zone->all_unreclaimable)
> > +                     if (zone->all_unreclaimable) {
> > +                             if (end_zone && end_zone == i)
> > +                                     end_zone--;
> 
> Until now, It's good.
> 
> >                               continue;
> > +                     }
> >  
> >                       if (!zone_watermark_ok_safe(zone, order,
> >                                       high_wmark_pages(zone), end_zone, 0)) {
> > @@ -2709,8 +2711,8 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int order, int classzone_idx)
> >   */
> >  static int kswapd(void *p)
> >  {
> > -     unsigned long order;
> > -     int classzone_idx;
> > +     unsigned long order, new_order;
> > +     int classzone_idx, new_classzone_idx;
> >       pg_data_t *pgdat = (pg_data_t*)p;
> >       struct task_struct *tsk = current;
> >  
> > @@ -2740,17 +2742,23 @@ static int kswapd(void *p)
> >       tsk->flags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD;
> >       set_freezable();
> >  
> > -     order = 0;
> > -     classzone_idx = MAX_NR_ZONES - 1;
> > +     order = new_order = 0;
> > +     classzone_idx = new_classzone_idx = pgdat->nr_zones - 1;
> >       for ( ; ; ) {
> > -             unsigned long new_order;
> > -             int new_classzone_idx;
> >               int ret;
> >  
> > -             new_order = pgdat->kswapd_max_order;
> > -             new_classzone_idx = pgdat->classzone_idx;
> > -             pgdat->kswapd_max_order = 0;
> > -             pgdat->classzone_idx = MAX_NR_ZONES - 1;
> > +             /*
> > +              * If the last balance_pgdat was unsuccessful it's unlikely a
> > +              * new request of a similar or harder type will succeed soon
> > +              * so consider going to sleep on the basis we reclaimed at
> > +              */
> > +             if (classzone_idx >= new_classzone_idx && order == new_order) {
> > +                     new_order = pgdat->kswapd_max_order;
> > +                     new_classzone_idx = pgdat->classzone_idx;
> > +                     pgdat->kswapd_max_order =  0;
> > +                     pgdat->classzone_idx = pgdat->nr_zones - 1;
> > +             }
> > +
> 
> But in this part.
> Why do we need this?

Lets say it's a fork-heavy workload and it is routinely being woken
for order-1 allocations and the highest zone is very small. For the
most part, it's ok because the allocations are being satisfied from
the lower zones which kswapd has no problem balancing.

However, by reading the information even after failing to
balance, kswapd continues balancing for order-1 due to reading
pgdat->kswapd_max_order, each time failing for the highest zone. It
only takes one wakeup request per balance_pgdat() to keep kswapd
awake trying to balance the highest zone in a continual loop.

By avoiding this read, kswapd will try and go to sleep after checking
all the watermarks and all_unreclaimable. If the watermarks are ok, it
will sleep until woken up due to the lower zones hitting their min
watermarks.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  reply	other threads:[~2011-07-21 16:07 UTC|newest]

Thread overview: 40+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-06-24 14:44 [PATCH 0/4] Stop kswapd consuming 100% CPU when highest zone is small Mel Gorman
2011-06-24 14:44 ` [PATCH 1/4] mm: vmscan: Correct check for kswapd sleeping in sleeping_prematurely Mel Gorman
2011-06-25 21:33   ` Rik van Riel
2011-06-27  6:10   ` Minchan Kim
2011-06-28 21:49   ` Andrew Morton
2011-06-29 10:57     ` Pádraig Brady
2011-06-30  9:39     ` Mel Gorman
2011-06-30  2:23   ` KOSAKI Motohiro
2011-06-24 14:44 ` [PATCH 2/4] mm: vmscan: Do not apply pressure to slab if we are not applying pressure to zone Mel Gorman
2011-06-25 21:40   ` Rik van Riel
2011-06-28 23:38   ` Minchan Kim
2011-06-30  2:37   ` KOSAKI Motohiro
2011-06-24 14:44 ` [PATCH 3/4] mm: vmscan: Evaluate the watermarks against the correct classzone Mel Gorman
2011-06-25 21:42   ` Rik van Riel
2011-06-27  6:53   ` Minchan Kim
2011-06-28 12:52     ` Mel Gorman
2011-06-28 23:23       ` Minchan Kim
2011-06-28 23:23   ` Minchan Kim
2011-06-24 14:44 ` [PATCH 4/4] mm: vmscan: Only read new_classzone_idx from pgdat when reclaiming successfully Mel Gorman
2011-06-25 23:17   ` Rik van Riel
2011-06-30  9:05   ` KOSAKI Motohiro
2011-06-30 10:19     ` Mel Gorman
2011-07-19 16:09   ` Minchan Kim
2011-07-20 10:48     ` Mel Gorman
2011-07-21 15:30       ` Minchan Kim
2011-07-21 16:07         ` Mel Gorman [this message]
2011-07-21 16:36           ` Minchan Kim
2011-07-21 17:01             ` Mel Gorman
2011-07-22  0:21               ` Minchan Kim
2011-07-22  7:42                 ` Mel Gorman
2011-06-25 14:23 ` [PATCH 0/4] Stop kswapd consuming 100% CPU when highest zone is small Andrew Lutomirski
2011-07-21 15:37 ` Minchan Kim
2011-07-21 16:09   ` Mel Gorman
2011-07-21 16:24     ` Minchan Kim
2011-07-21 16:36       ` Andrew Lutomirski
2011-07-21 16:42         ` Minchan Kim
2011-07-21 16:58           ` Andrew Lutomirski
2011-07-22  0:30             ` Minchan Kim
2011-07-22 13:21               ` Andrew Lutomirski
  -- strict thread matches above, loose matches on Subject: below --
2011-06-24 13:43 Mel Gorman
2011-06-24 13:43 ` [PATCH 4/4] mm: vmscan: Only read new_classzone_idx from pgdat when reclaiming successfully Mel Gorman

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20110721160706.GS5349@suse.de \
    --to=mgorman@suse.de \
    --cc=James.Bottomley@HansenPartnership.com \
    --cc=P@draigBrady.com \
    --cc=akpm@linux-foundation.org \
    --cc=colin.king@canonical.com \
    --cc=hannes@cmpxchg.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=luto@mit.edu \
    --cc=minchan.kim@gmail.com \
    --cc=riel@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).