From mboxrd@z Thu Jan 1 00:00:00 1970 From: Mel Gorman Subject: Re: [PATCH 2/2] mm: vmscan: If kswapd has been running too long, allow it to sleep Date: Thu, 19 May 2011 10:19:06 +0100 Message-ID: <20110519091906.GT5279@suse.de> References: <1305558417-24354-1-git-send-email-mgorman@suse.de> <1305558417-24354-3-git-send-email-mgorman@suse.de> <20110516141654.2728f05a.akpm@linux-foundation.org> <1305614225.6008.19.camel@mulgrave.site> <20110517162226.96974d89.akpm@linux-foundation.org> <20110518094718.GP5279@suse.de> Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Transfer-Encoding: QUOTED-PRINTABLE Cc: Andrew Morton , James Bottomley , Colin King , Raghavendra D Prabhu , Jan Kara , Chris Mason , Christoph Lameter , Pekka Enberg , Rik van Riel , Johannes Weiner , linux-fsdevel , linux-mm , linux-kernel , linux-ext4 , stable To: Minchan Kim Return-path: Received: from cantor2.suse.de ([195.135.220.15]:34464 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932832Ab1ESJTS (ORCPT ); Thu, 19 May 2011 05:19:18 -0400 Content-Disposition: inline In-Reply-To: Sender: linux-ext4-owner@vger.kernel.org List-ID: On Thu, May 19, 2011 at 07:42:29AM +0900, Minchan Kim wrote: > On Wed, May 18, 2011 at 6:47 PM, Mel Gorman wrote: > > On Tue, May 17, 2011 at 04:22:26PM -0700, Andrew Morton wrote: > >> On Tue, 17 May 2011 10:37:04 +0400 > >> James Bottomley wrote: > >> > >> > On Mon, 2011-05-16 at 14:16 -0700, Andrew Morton wrote: > >> > > On Mon, 16 May 2011 16:06:57 +0100 > >> > > Mel Gorman wrote: > >> > > > >> > > > Under constant allocation pressure, kswapd can be in the sit= uation where > >> > > > sleeping_prematurely() will always return true even if kswap= d has been > >> > > > running a long time. Check if kswapd needs to be scheduled. > >> > > > > >> > > > Signed-off-by: Mel Gorman > >> > > > Acked-by: Rik van Riel > >> > > > --- > >> > > > =A0mm/vmscan.c | =A0 =A04 ++++ > >> > > > =A01 files changed, 4 insertions(+), 0 deletions(-) > >> > > > > >> > > > diff --git a/mm/vmscan.c b/mm/vmscan.c > >> > > > index af24d1e..4d24828 100644 > >> > > > --- a/mm/vmscan.c > >> > > > +++ b/mm/vmscan.c > >> > > > @@ -2251,6 +2251,10 @@ static bool sleeping_prematurely(pg_d= ata_t *pgdat, int order, long remaining, > >> > > > =A0 =A0 =A0 =A0 unsigned long balanced =3D 0; > >> > > > =A0 =A0 =A0 =A0 bool all_zones_ok =3D true; > >> > > > > >> > > > + =A0 =A0 =A0 /* If kswapd has been running too long, just s= leep */ > >> > > > + =A0 =A0 =A0 if (need_resched()) > >> > > > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 return false; > >> > > > + > >> > > > =A0 =A0 =A0 =A0 /* If a direct reclaimer woke kswapd within = HZ/10, it's premature */ > >> > > > =A0 =A0 =A0 =A0 if (remaining) > >> > > > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 return true; > >> > > > >> > > I'm a bit worried by this one. > >> > > > >> > > Do we really fully understand why kswapd is continuously runni= ng like > >> > > this? =A0The changelog makes me think "no" ;) > >> > > > >> > > Given that the page-allocating process is madly reclaiming pag= es in > >> > > direct reclaim (yes?) and that kswapd is madly reclaiming page= s on a > >> > > different CPU, we should pretty promptly get into a situation = where > >> > > kswapd can suspend itself. =A0But that obviously isn't happeni= ng. =A0So > >> > > what *is* going on? > >> > > >> > The triggering workload is a massive untar using a file on the s= ame > >> > filesystem, so that's a continuous stream of pages read into the= cache > >> > for the input and a stream of dirty pages out for the writes. =A0= We > >> > thought it might have been out of control shrinkers, so we alrea= dy > >> > debugged that and found it wasn't. =A0It just seems to be an imb= alance in > >> > the zones that the shrinkers can't fix which causes > >> > sleeping_prematurely() to return true almost indefinitely. > >> > >> Is the untar disk-bound? =A0The untar has presumably hit the write= back > >> dirty_ratio? =A0So its rate of page allocation is approximately eq= ual to > >> the write speed of the disks? > >> > > > > A reasonable assumption but it gets messy. > > > >> If so, the VM is consuming 100% of a CPU to reclaim pages at a mer= e > >> tens-of-megabytes-per-second. =A0If so, there's something seriousl= y wrong > >> here - under favorable conditions one would expect reclaim to free= up > >> 100,000 pages/sec, maybe more. > >> > >> If the untar is not disk-bound and the required page reclaim rate = is > >> equal to the rate at which a CPU can read, decompress and write to > >> pagecache then, err, maybe possible. =A0But it still smells of > >> inefficient reclaim. > >> > > > > I think it's higher than just the rate of data but couldn't guess b= y > > how much exactly. Reproducing this locally would have been nice but > > the following conditions are likely happening on the problem machin= e. > > > > =A0 SLUB is using high-orders for its slabs, kswapd and reclaimers = are > > =A0 reclaiming at a faster rate than required for just the data. SL= UB > > =A0 is using order-2 allocs for inodes so every 18 files created by > > =A0 untar, we need an order-2 page. For ext4_io_end, we need order-= 3 > > =A0 allocs and we are allocating these due to delayed block allocat= ion. > > > > =A0 So for example: 50 files, each less than 1 page in size needs 5= 0 > > =A0 order-0 pages, 3 order-2 page and 2 order-3 pages > > > > =A0 To satisfy the high order pages, we are reclaiming at least 28 > > =A0 pages. For compaction, we are migrating these so we are allocat= ing > > =A0 a further 28 pages and then copying putting further pressure on > > =A0 the system. We may do this multiple times as order-0 allocation= s > > =A0 could be breaking up the pages again. Without compaction, we ar= e > > =A0 only reclaiming but can get stalled for significant periods of > > =A0 time if dirty or writeback pages are encountered in the contigu= ous > > =A0 blocks and can reclaim too many pages quite easily. > > > > So the rate of allocation required to write out data is higher than > > just the data rate. The reclaim rate could be just fine but the num= ber > > of pages we need to reclaim to allocate slab objects can be screwy. > > > >> > > Secondly, taking an up-to-100ms sleep in response to a need_re= sched() > >> > > seems pretty savage and I suspect it risks undesirable side-ef= fects. =A0A > >> > > plain old cond_resched() would be more cautious. =A0But presum= ably > >> > > kswapd() is already running cond_resched() pretty frequently, = so why > >> > > didn't that work? > >> > > >> > So the specific problem with cond_resched() is that kswapd is st= ill > >> > runnable, so even if there's other work the system can be gettin= g on > >> > with, it quickly comes back to looping madly in kswapd. =A0If we= return > >> > false from sleeping_prematurely(), we stop kswapd until its woke= n up to > >> > do more work. =A0This manifests, even on non sandybridge systems= that > >> > don't hang as a lot of time burned in kswapd. > >> > > >> > I think the sandybridge bug I see on the laptop is that cond_res= ched() > >> > is somehow ineffective: =A0kswapd is usually hogging one CPU and= there are > >> > runnable processes but they seem to cluster on other CPUs, leavi= ng > >> > kswapd to spin at close to 100% system time. > >> > > >> > When the problem was first described, we tried sprinkling more > >> > cond_rescheds() in the shrinker loop and it didn't work. > >> > >> Seems to me that kswapd for some reason is doing too much work. =A0= Or, > >> more specifically is doing its work very inefficiently. =A0Making = kswapd > >> take arbitrary naps when it's misbehaving didn't fix that misbehav= iour! > >> > > > > It is likely to be doing work inefficiently in one of two ways > > > > =A01. We are reclaiming far more pages than required by the data > > =A0 =A0 for slab objects > > > > =A02. The rate we are reclaiming is fast enough that dirty pages ar= e > > =A0 =A0 reaching the end of the LRU quickly > > > > The latter part is also important. I doubt we are getting stalled i= n > > writepage as this is new data being written to disk to blocks aren'= t > > allocated yet but kswapd is encountering the dirty_ratio of pages > > on the LRU and churning them through the LRU and reclaims the clean > > pages in between. > > > > In effect, this "sorts" the LRU lists so the dirty pages get groupe= d > > together. At worst on a 2G system such as James', we have 104857 > > (20% of memory in pages) pages together on the LRU, all dirty and > > all being skipped over by kswapd and direct reclaimers. This is at > > least 3276 takings of the zone LRU lock assuming we isolate pages i= n > > groups of SWAP_CLUSTER_MAX which a lot of list walking and CPU usag= e > > for no pages reclaimed. > > > > In this case, kswapd might as well take a brief nap as it can't cle= an > > the pages so the flusher threads can get some work done. > > > >> It would be interesting to watch kswapd's page reclaim inefficienc= y > >> when this is happening: /proc/vmstat:pgscan_kswapd_* versus > >> /proc/vmstat:kswapd_steal. =A0If that ration is high then kswapd i= s > >> scanning many pages and not reclaiming them. > >> > >> But given the prominence of shrink_slab in the traces, perhaps tha= t > >> isn't happening. > >> > > > > As we are aggressively shrinking slab, we can reach the stage where > > we scan the requested number of objects and reclaim none of them > > potentially setting zone->all_unreclaimable to 1 if a lot of scanni= ng > > has also taken place recently without pages being freed. Once this > > happens, kswapd isn't even trying to reclaim pages and is instead s= tuck > > in shrink_slab until a page is freed clearing zone->all_unreclaimab= le > > and zone->pages-scanned. >=20 > Why does it stuck in shrink_slab? > If the zone is trouble to reclaim(ie, all_unreclaimable is set), > kswapd will poll the zone only in case of DEF_PRIORITY(ie, small > window) for when the problem goes away. "stuck in shrink" was a poor choice of words. I should have said we can spend a lot of time in there. True, kswapd will only poll the zones while all_unreclaimable is set but it only takes one page to be freed to the per-cpu list to clear all_unreclaimable again. Once any zone has all_unreclaimable cleared, the watermarks are checked but with enough direct reclaimers, it's possible watermarks are met so shrink_zone is not called but shrink_slab is called anyway. Depending on the result, all_unreclaimable can get set again (possibly incorrectly as there is simply no reclaimable slab objects rather than the zone is truely unreclaimable). Another scenario is all zones except ZONE_DMA have all_unreclaimable set when kswapd runs. kswapd finds the watermarks to be ok as the zone is only lightly used so skips shrink_zone() but calls shrink_slab() anyway. Both of these situations would allow kswapd to use a lot of CPU while spending a significant percentage of it in shrink_slab(). --=20 Mel Gorman SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" i= n the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932901Ab1ESJTV (ORCPT ); Thu, 19 May 2011 05:19:21 -0400 Received: from cantor2.suse.de ([195.135.220.15]:34464 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932832Ab1ESJTS (ORCPT ); Thu, 19 May 2011 05:19:18 -0400 Date: Thu, 19 May 2011 10:19:06 +0100 From: Mel Gorman To: Minchan Kim Cc: Andrew Morton , James Bottomley , Colin King , Raghavendra D Prabhu , Jan Kara , Chris Mason , Christoph Lameter , Pekka Enberg , Rik van Riel , Johannes Weiner , linux-fsdevel , linux-mm , linux-kernel , linux-ext4 , stable Subject: Re: [PATCH 2/2] mm: vmscan: If kswapd has been running too long, allow it to sleep Message-ID: <20110519091906.GT5279@suse.de> References: <1305558417-24354-1-git-send-email-mgorman@suse.de> <1305558417-24354-3-git-send-email-mgorman@suse.de> <20110516141654.2728f05a.akpm@linux-foundation.org> <1305614225.6008.19.camel@mulgrave.site> <20110517162226.96974d89.akpm@linux-foundation.org> <20110518094718.GP5279@suse.de> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, May 19, 2011 at 07:42:29AM +0900, Minchan Kim wrote: > On Wed, May 18, 2011 at 6:47 PM, Mel Gorman wrote: > > On Tue, May 17, 2011 at 04:22:26PM -0700, Andrew Morton wrote: > >> On Tue, 17 May 2011 10:37:04 +0400 > >> James Bottomley wrote: > >> > >> > On Mon, 2011-05-16 at 14:16 -0700, Andrew Morton wrote: > >> > > On Mon, 16 May 2011 16:06:57 +0100 > >> > > Mel Gorman wrote: > >> > > > >> > > > Under constant allocation pressure, kswapd can be in the situation where > >> > > > sleeping_prematurely() will always return true even if kswapd has been > >> > > > running a long time. Check if kswapd needs to be scheduled. > >> > > > > >> > > > Signed-off-by: Mel Gorman > >> > > > Acked-by: Rik van Riel > >> > > > --- > >> > > >  mm/vmscan.c |    4 ++++ > >> > > >  1 files changed, 4 insertions(+), 0 deletions(-) > >> > > > > >> > > > diff --git a/mm/vmscan.c b/mm/vmscan.c > >> > > > index af24d1e..4d24828 100644 > >> > > > --- a/mm/vmscan.c > >> > > > +++ b/mm/vmscan.c > >> > > > @@ -2251,6 +2251,10 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining, > >> > > >         unsigned long balanced = 0; > >> > > >         bool all_zones_ok = true; > >> > > > > >> > > > +       /* If kswapd has been running too long, just sleep */ > >> > > > +       if (need_resched()) > >> > > > +               return false; > >> > > > + > >> > > >         /* If a direct reclaimer woke kswapd within HZ/10, it's premature */ > >> > > >         if (remaining) > >> > > >                 return true; > >> > > > >> > > I'm a bit worried by this one. > >> > > > >> > > Do we really fully understand why kswapd is continuously running like > >> > > this?  The changelog makes me think "no" ;) > >> > > > >> > > Given that the page-allocating process is madly reclaiming pages in > >> > > direct reclaim (yes?) and that kswapd is madly reclaiming pages on a > >> > > different CPU, we should pretty promptly get into a situation where > >> > > kswapd can suspend itself.  But that obviously isn't happening.  So > >> > > what *is* going on? > >> > > >> > The triggering workload is a massive untar using a file on the same > >> > filesystem, so that's a continuous stream of pages read into the cache > >> > for the input and a stream of dirty pages out for the writes.  We > >> > thought it might have been out of control shrinkers, so we already > >> > debugged that and found it wasn't.  It just seems to be an imbalance in > >> > the zones that the shrinkers can't fix which causes > >> > sleeping_prematurely() to return true almost indefinitely. > >> > >> Is the untar disk-bound?  The untar has presumably hit the writeback > >> dirty_ratio?  So its rate of page allocation is approximately equal to > >> the write speed of the disks? > >> > > > > A reasonable assumption but it gets messy. > > > >> If so, the VM is consuming 100% of a CPU to reclaim pages at a mere > >> tens-of-megabytes-per-second.  If so, there's something seriously wrong > >> here - under favorable conditions one would expect reclaim to free up > >> 100,000 pages/sec, maybe more. > >> > >> If the untar is not disk-bound and the required page reclaim rate is > >> equal to the rate at which a CPU can read, decompress and write to > >> pagecache then, err, maybe possible.  But it still smells of > >> inefficient reclaim. > >> > > > > I think it's higher than just the rate of data but couldn't guess by > > how much exactly. Reproducing this locally would have been nice but > > the following conditions are likely happening on the problem machine. > > > >   SLUB is using high-orders for its slabs, kswapd and reclaimers are > >   reclaiming at a faster rate than required for just the data. SLUB > >   is using order-2 allocs for inodes so every 18 files created by > >   untar, we need an order-2 page. For ext4_io_end, we need order-3 > >   allocs and we are allocating these due to delayed block allocation. > > > >   So for example: 50 files, each less than 1 page in size needs 50 > >   order-0 pages, 3 order-2 page and 2 order-3 pages > > > >   To satisfy the high order pages, we are reclaiming at least 28 > >   pages. For compaction, we are migrating these so we are allocating > >   a further 28 pages and then copying putting further pressure on > >   the system. We may do this multiple times as order-0 allocations > >   could be breaking up the pages again. Without compaction, we are > >   only reclaiming but can get stalled for significant periods of > >   time if dirty or writeback pages are encountered in the contiguous > >   blocks and can reclaim too many pages quite easily. > > > > So the rate of allocation required to write out data is higher than > > just the data rate. The reclaim rate could be just fine but the number > > of pages we need to reclaim to allocate slab objects can be screwy. > > > >> > > Secondly, taking an up-to-100ms sleep in response to a need_resched() > >> > > seems pretty savage and I suspect it risks undesirable side-effects.  A > >> > > plain old cond_resched() would be more cautious.  But presumably > >> > > kswapd() is already running cond_resched() pretty frequently, so why > >> > > didn't that work? > >> > > >> > So the specific problem with cond_resched() is that kswapd is still > >> > runnable, so even if there's other work the system can be getting on > >> > with, it quickly comes back to looping madly in kswapd.  If we return > >> > false from sleeping_prematurely(), we stop kswapd until its woken up to > >> > do more work.  This manifests, even on non sandybridge systems that > >> > don't hang as a lot of time burned in kswapd. > >> > > >> > I think the sandybridge bug I see on the laptop is that cond_resched() > >> > is somehow ineffective:  kswapd is usually hogging one CPU and there are > >> > runnable processes but they seem to cluster on other CPUs, leaving > >> > kswapd to spin at close to 100% system time. > >> > > >> > When the problem was first described, we tried sprinkling more > >> > cond_rescheds() in the shrinker loop and it didn't work. > >> > >> Seems to me that kswapd for some reason is doing too much work.  Or, > >> more specifically is doing its work very inefficiently.  Making kswapd > >> take arbitrary naps when it's misbehaving didn't fix that misbehaviour! > >> > > > > It is likely to be doing work inefficiently in one of two ways > > > >  1. We are reclaiming far more pages than required by the data > >     for slab objects > > > >  2. The rate we are reclaiming is fast enough that dirty pages are > >     reaching the end of the LRU quickly > > > > The latter part is also important. I doubt we are getting stalled in > > writepage as this is new data being written to disk to blocks aren't > > allocated yet but kswapd is encountering the dirty_ratio of pages > > on the LRU and churning them through the LRU and reclaims the clean > > pages in between. > > > > In effect, this "sorts" the LRU lists so the dirty pages get grouped > > together. At worst on a 2G system such as James', we have 104857 > > (20% of memory in pages) pages together on the LRU, all dirty and > > all being skipped over by kswapd and direct reclaimers. This is at > > least 3276 takings of the zone LRU lock assuming we isolate pages in > > groups of SWAP_CLUSTER_MAX which a lot of list walking and CPU usage > > for no pages reclaimed. > > > > In this case, kswapd might as well take a brief nap as it can't clean > > the pages so the flusher threads can get some work done. > > > >> It would be interesting to watch kswapd's page reclaim inefficiency > >> when this is happening: /proc/vmstat:pgscan_kswapd_* versus > >> /proc/vmstat:kswapd_steal.  If that ration is high then kswapd is > >> scanning many pages and not reclaiming them. > >> > >> But given the prominence of shrink_slab in the traces, perhaps that > >> isn't happening. > >> > > > > As we are aggressively shrinking slab, we can reach the stage where > > we scan the requested number of objects and reclaim none of them > > potentially setting zone->all_unreclaimable to 1 if a lot of scanning > > has also taken place recently without pages being freed. Once this > > happens, kswapd isn't even trying to reclaim pages and is instead stuck > > in shrink_slab until a page is freed clearing zone->all_unreclaimable > > and zone->pages-scanned. > > Why does it stuck in shrink_slab? > If the zone is trouble to reclaim(ie, all_unreclaimable is set), > kswapd will poll the zone only in case of DEF_PRIORITY(ie, small > window) for when the problem goes away. "stuck in shrink" was a poor choice of words. I should have said we can spend a lot of time in there. True, kswapd will only poll the zones while all_unreclaimable is set but it only takes one page to be freed to the per-cpu list to clear all_unreclaimable again. Once any zone has all_unreclaimable cleared, the watermarks are checked but with enough direct reclaimers, it's possible watermarks are met so shrink_zone is not called but shrink_slab is called anyway. Depending on the result, all_unreclaimable can get set again (possibly incorrectly as there is simply no reclaimable slab objects rather than the zone is truely unreclaimable). Another scenario is all zones except ZONE_DMA have all_unreclaimable set when kswapd runs. kswapd finds the watermarks to be ok as the zone is only lightly used so skips shrink_zone() but calls shrink_slab() anyway. Both of these situations would allow kswapd to use a lot of CPU while spending a significant percentage of it in shrink_slab(). -- Mel Gorman SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail138.messagelabs.com (mail138.messagelabs.com [216.82.249.35]) by kanga.kvack.org (Postfix) with SMTP id 3E95A6B0011 for ; Thu, 19 May 2011 05:19:20 -0400 (EDT) Date: Thu, 19 May 2011 10:19:06 +0100 From: Mel Gorman Subject: Re: [PATCH 2/2] mm: vmscan: If kswapd has been running too long, allow it to sleep Message-ID: <20110519091906.GT5279@suse.de> References: <1305558417-24354-1-git-send-email-mgorman@suse.de> <1305558417-24354-3-git-send-email-mgorman@suse.de> <20110516141654.2728f05a.akpm@linux-foundation.org> <1305614225.6008.19.camel@mulgrave.site> <20110517162226.96974d89.akpm@linux-foundation.org> <20110518094718.GP5279@suse.de> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: Sender: owner-linux-mm@kvack.org List-ID: To: Minchan Kim Cc: Andrew Morton , James Bottomley , Colin King , Raghavendra D Prabhu , Jan Kara , Chris Mason , Christoph Lameter , Pekka Enberg , Rik van Riel , Johannes Weiner , linux-fsdevel , linux-mm , linux-kernel , linux-ext4 , stable On Thu, May 19, 2011 at 07:42:29AM +0900, Minchan Kim wrote: > On Wed, May 18, 2011 at 6:47 PM, Mel Gorman wrote: > > On Tue, May 17, 2011 at 04:22:26PM -0700, Andrew Morton wrote: > >> On Tue, 17 May 2011 10:37:04 +0400 > >> James Bottomley wrote: > >> > >> > On Mon, 2011-05-16 at 14:16 -0700, Andrew Morton wrote: > >> > > On Mon, 16 May 2011 16:06:57 +0100 > >> > > Mel Gorman wrote: > >> > > > >> > > > Under constant allocation pressure, kswapd can be in the situation where > >> > > > sleeping_prematurely() will always return true even if kswapd has been > >> > > > running a long time. Check if kswapd needs to be scheduled. > >> > > > > >> > > > Signed-off-by: Mel Gorman > >> > > > Acked-by: Rik van Riel > >> > > > --- > >> > > > mm/vmscan.c | 4 ++++ > >> > > > 1 files changed, 4 insertions(+), 0 deletions(-) > >> > > > > >> > > > diff --git a/mm/vmscan.c b/mm/vmscan.c > >> > > > index af24d1e..4d24828 100644 > >> > > > --- a/mm/vmscan.c > >> > > > +++ b/mm/vmscan.c > >> > > > @@ -2251,6 +2251,10 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining, > >> > > > unsigned long balanced = 0; > >> > > > bool all_zones_ok = true; > >> > > > > >> > > > + /* If kswapd has been running too long, just sleep */ > >> > > > + if (need_resched()) > >> > > > + return false; > >> > > > + > >> > > > /* If a direct reclaimer woke kswapd within HZ/10, it's premature */ > >> > > > if (remaining) > >> > > > return true; > >> > > > >> > > I'm a bit worried by this one. > >> > > > >> > > Do we really fully understand why kswapd is continuously running like > >> > > this? The changelog makes me think "no" ;) > >> > > > >> > > Given that the page-allocating process is madly reclaiming pages in > >> > > direct reclaim (yes?) and that kswapd is madly reclaiming pages on a > >> > > different CPU, we should pretty promptly get into a situation where > >> > > kswapd can suspend itself. But that obviously isn't happening. So > >> > > what *is* going on? > >> > > >> > The triggering workload is a massive untar using a file on the same > >> > filesystem, so that's a continuous stream of pages read into the cache > >> > for the input and a stream of dirty pages out for the writes. We > >> > thought it might have been out of control shrinkers, so we already > >> > debugged that and found it wasn't. It just seems to be an imbalance in > >> > the zones that the shrinkers can't fix which causes > >> > sleeping_prematurely() to return true almost indefinitely. > >> > >> Is the untar disk-bound? The untar has presumably hit the writeback > >> dirty_ratio? So its rate of page allocation is approximately equal to > >> the write speed of the disks? > >> > > > > A reasonable assumption but it gets messy. > > > >> If so, the VM is consuming 100% of a CPU to reclaim pages at a mere > >> tens-of-megabytes-per-second. If so, there's something seriously wrong > >> here - under favorable conditions one would expect reclaim to free up > >> 100,000 pages/sec, maybe more. > >> > >> If the untar is not disk-bound and the required page reclaim rate is > >> equal to the rate at which a CPU can read, decompress and write to > >> pagecache then, err, maybe possible. But it still smells of > >> inefficient reclaim. > >> > > > > I think it's higher than just the rate of data but couldn't guess by > > how much exactly. Reproducing this locally would have been nice but > > the following conditions are likely happening on the problem machine. > > > > SLUB is using high-orders for its slabs, kswapd and reclaimers are > > reclaiming at a faster rate than required for just the data. SLUB > > is using order-2 allocs for inodes so every 18 files created by > > untar, we need an order-2 page. For ext4_io_end, we need order-3 > > allocs and we are allocating these due to delayed block allocation. > > > > So for example: 50 files, each less than 1 page in size needs 50 > > order-0 pages, 3 order-2 page and 2 order-3 pages > > > > To satisfy the high order pages, we are reclaiming at least 28 > > pages. For compaction, we are migrating these so we are allocating > > a further 28 pages and then copying putting further pressure on > > the system. We may do this multiple times as order-0 allocations > > could be breaking up the pages again. Without compaction, we are > > only reclaiming but can get stalled for significant periods of > > time if dirty or writeback pages are encountered in the contiguous > > blocks and can reclaim too many pages quite easily. > > > > So the rate of allocation required to write out data is higher than > > just the data rate. The reclaim rate could be just fine but the number > > of pages we need to reclaim to allocate slab objects can be screwy. > > > >> > > Secondly, taking an up-to-100ms sleep in response to a need_resched() > >> > > seems pretty savage and I suspect it risks undesirable side-effects. A > >> > > plain old cond_resched() would be more cautious. But presumably > >> > > kswapd() is already running cond_resched() pretty frequently, so why > >> > > didn't that work? > >> > > >> > So the specific problem with cond_resched() is that kswapd is still > >> > runnable, so even if there's other work the system can be getting on > >> > with, it quickly comes back to looping madly in kswapd. If we return > >> > false from sleeping_prematurely(), we stop kswapd until its woken up to > >> > do more work. This manifests, even on non sandybridge systems that > >> > don't hang as a lot of time burned in kswapd. > >> > > >> > I think the sandybridge bug I see on the laptop is that cond_resched() > >> > is somehow ineffective: kswapd is usually hogging one CPU and there are > >> > runnable processes but they seem to cluster on other CPUs, leaving > >> > kswapd to spin at close to 100% system time. > >> > > >> > When the problem was first described, we tried sprinkling more > >> > cond_rescheds() in the shrinker loop and it didn't work. > >> > >> Seems to me that kswapd for some reason is doing too much work. Or, > >> more specifically is doing its work very inefficiently. Making kswapd > >> take arbitrary naps when it's misbehaving didn't fix that misbehaviour! > >> > > > > It is likely to be doing work inefficiently in one of two ways > > > > 1. We are reclaiming far more pages than required by the data > > for slab objects > > > > 2. The rate we are reclaiming is fast enough that dirty pages are > > reaching the end of the LRU quickly > > > > The latter part is also important. I doubt we are getting stalled in > > writepage as this is new data being written to disk to blocks aren't > > allocated yet but kswapd is encountering the dirty_ratio of pages > > on the LRU and churning them through the LRU and reclaims the clean > > pages in between. > > > > In effect, this "sorts" the LRU lists so the dirty pages get grouped > > together. At worst on a 2G system such as James', we have 104857 > > (20% of memory in pages) pages together on the LRU, all dirty and > > all being skipped over by kswapd and direct reclaimers. This is at > > least 3276 takings of the zone LRU lock assuming we isolate pages in > > groups of SWAP_CLUSTER_MAX which a lot of list walking and CPU usage > > for no pages reclaimed. > > > > In this case, kswapd might as well take a brief nap as it can't clean > > the pages so the flusher threads can get some work done. > > > >> It would be interesting to watch kswapd's page reclaim inefficiency > >> when this is happening: /proc/vmstat:pgscan_kswapd_* versus > >> /proc/vmstat:kswapd_steal. If that ration is high then kswapd is > >> scanning many pages and not reclaiming them. > >> > >> But given the prominence of shrink_slab in the traces, perhaps that > >> isn't happening. > >> > > > > As we are aggressively shrinking slab, we can reach the stage where > > we scan the requested number of objects and reclaim none of them > > potentially setting zone->all_unreclaimable to 1 if a lot of scanning > > has also taken place recently without pages being freed. Once this > > happens, kswapd isn't even trying to reclaim pages and is instead stuck > > in shrink_slab until a page is freed clearing zone->all_unreclaimable > > and zone->pages-scanned. > > Why does it stuck in shrink_slab? > If the zone is trouble to reclaim(ie, all_unreclaimable is set), > kswapd will poll the zone only in case of DEF_PRIORITY(ie, small > window) for when the problem goes away. "stuck in shrink" was a poor choice of words. I should have said we can spend a lot of time in there. True, kswapd will only poll the zones while all_unreclaimable is set but it only takes one page to be freed to the per-cpu list to clear all_unreclaimable again. Once any zone has all_unreclaimable cleared, the watermarks are checked but with enough direct reclaimers, it's possible watermarks are met so shrink_zone is not called but shrink_slab is called anyway. Depending on the result, all_unreclaimable can get set again (possibly incorrectly as there is simply no reclaimable slab objects rather than the zone is truely unreclaimable). Another scenario is all zones except ZONE_DMA have all_unreclaimable set when kswapd runs. kswapd finds the watermarks to be ok as the zone is only lightly used so skips shrink_zone() but calls shrink_slab() anyway. Both of these situations would allow kswapd to use a lot of CPU while spending a significant percentage of it in shrink_slab(). -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org