* Re: [PATCH] mm/vmscan: Do not block forever at shrink_inactive_list(). [not found] <201405192340.FCD48964.OFQHOOJLVSFFMt@I-love.SAKURA.ne.jp> @ 2014-05-20 0:44 ` Dave Chinner 2014-05-20 3:54 ` Tetsuo Handa 2014-05-20 5:59 ` Andrew Morton 0 siblings, 2 replies; 15+ messages in thread From: Dave Chinner @ 2014-05-20 0:44 UTC (permalink / raw) To: Tetsuo Handa Cc: riel, linux-kernel, xfs, kosaki.motohiro, fengguang.wu, kamezawa.hiroyu [cc xfs@oss.sgi.com] On Mon, May 19, 2014 at 11:40:46PM +0900, Tetsuo Handa wrote: > >From f016db5d7f84d6321132150b13c5888ef67d694f Mon Sep 17 00:00:00 2001 > From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> > Date: Mon, 19 May 2014 23:24:11 +0900 > Subject: [PATCH] mm/vmscan: Do not block forever at shrink_inactive_list(). > > I can observe that commit 35cd7815 "vmscan: throttle direct reclaim when > too many pages are isolated already" causes RHEL7 environment to stall with > 0% CPU usage when a certain type of memory pressure is given. > > Upon memory pressure, kswapd calls xfs_vm_writepage() from shrink_page_list(). > xfs_vm_writepage() eventually calls wait_for_completion() which waits for > xfs_bmapi_allocate_worker(). > > Then, a kernel worker thread calls xfs_bmapi_allocate_worker() and > xfs_bmapi_allocate_worker() eventually calls xfs_btree_lookup_get_block(). > xfs_btree_lookup_get_block() eventually calls alloc_page(). > alloc_page() eventually calls shrink_inactive_list(). > > The stack trace showed that the kernel worker thread which the kswapd is > waiting for was looping at a while loop in shrink_inactive_list(). [snip stack traces indicating XFS is deferring allocation work to the allocation workqueue on behalf of kswapd] > Since the kernel worker thread needs to escape from the while loop so that > alloc_page() can allocate memory (and eventually allow xfs_vm_writepage() > to release memory), I think that we should not block forever. This patch > introduces 30 seconds timeout for userspace processes and 5 seconds timeout > for kernel processes. That's like trying to swat a fly with a runaway train. Eventually you'll hit the fly with the train.... Q: What is missing from the workqueue context that results in the workqueue being blocked in memory reclaim on congestion? Hint: XFS is deferring allocation on behalf of kswapd to a workqueue. A: PF_KSWAPD. shrink_inactive_list() has this guard: /* * Stall direct reclaim for IO completions if underlying BDIs or zone * is congested. Allow kswapd to continue until it starts encountering * unqueued dirty pages or cycling through the LRU too quickly. */ if (!sc->hibernation_mode && !current_is_kswapd()) wait_iff_congested(zone, BLK_RW_ASYNC, HZ/10); So, XFS should be passing kswapd context to the workqueue allocation context. The patch below does this. Tetsuo-san, when it comes to problems involving XFS, you should really CC xfs@oss.sgi.com because very few people really know how XFS works and even fewer still know how it is supposed to interact with memory reclaim.... Cheers, Dave. -- Dave Chinner david@fromorbit.com xfs: block allocation work needs to be kswapd aware From: Dave Chinner <dchinner@redhat.com> Upon memory pressure, kswapd calls xfs_vm_writepage() from shrink_page_list(). This can result in delayed allocation occurring and that gets deferred to the the allocation workqueue. The allocation then runs outside kswapd context, which means if it needs memory (and it does to demand page metadata from disk) it can block in shrink_inactive_list() waiting for IO congestion. These blocking waits are normally avoiding in kswapd context, so under memory pressure writeback from kswapd can be arbitrarily delayed by memory reclaim. To avoid this, pass the kswapd context to the allocation being done by the workqueue, so that memory reclaim understands correctly that the work is being done for kswapd and therefore it is not blocked and does not delay memory reclaim. Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Dave Chinner <dchinner@redhat.com> --- fs/xfs/xfs_bmap_util.c | 16 +++++++++++++--- fs/xfs/xfs_bmap_util.h | 1 + 2 files changed, 14 insertions(+), 3 deletions(-) diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c index 057f671..703b3ec 100644 --- a/fs/xfs/xfs_bmap_util.c +++ b/fs/xfs/xfs_bmap_util.c @@ -258,14 +258,23 @@ xfs_bmapi_allocate_worker( struct xfs_bmalloca *args = container_of(work, struct xfs_bmalloca, work); unsigned long pflags; + unsigned long new_pflags = PF_FSTRANS; - /* we are in a transaction context here */ - current_set_flags_nested(&pflags, PF_FSTRANS); + /* + * we are in a transaction context here, but may also be doing work + * in kswapd context, and hence we may need to inherit that state + * temporarily to ensure that we don't block waiting for memory reclaim + * in any way. + */ + if (args->kswapd) + new_pflags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD; + + current_set_flags_nested(&pflags, new_pflags); args->result = __xfs_bmapi_allocate(args); complete(args->done); - current_restore_flags_nested(&pflags, PF_FSTRANS); + current_restore_flags_nested(&pflags, new_pflags); } /* @@ -284,6 +293,7 @@ xfs_bmapi_allocate( args->done = &done; + args->kswapd = current_is_kswapd(); INIT_WORK_ONSTACK(&args->work, xfs_bmapi_allocate_worker); queue_work(xfs_alloc_wq, &args->work); wait_for_completion(&done); diff --git a/fs/xfs/xfs_bmap_util.h b/fs/xfs/xfs_bmap_util.h index 935ed2b..d227f9d 100644 --- a/fs/xfs/xfs_bmap_util.h +++ b/fs/xfs/xfs_bmap_util.h @@ -56,6 +56,7 @@ struct xfs_bmalloca { char aeof; /* allocated space at eof */ char conv; /* overwriting unwritten extents */ char stack_switch; + char kswapd; /* allocation in kswapd context */ int flags; struct completion *done; struct work_struct work; _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: [PATCH] mm/vmscan: Do not block forever at shrink_inactive_list(). 2014-05-20 0:44 ` [PATCH] mm/vmscan: Do not block forever at shrink_inactive_list() Dave Chinner @ 2014-05-20 3:54 ` Tetsuo Handa 2014-05-20 5:24 ` Dave Chinner 2014-05-20 5:59 ` Andrew Morton 1 sibling, 1 reply; 15+ messages in thread From: Tetsuo Handa @ 2014-05-20 3:54 UTC (permalink / raw) To: Dave Chinner Cc: riel, linux-kernel, xfs, kosaki.motohiro, fengguang.wu, kamezawa.hiroyu Dave Chinner wrote: > So, XFS should be passing kswapd context to the workqueue allocation > context. The patch below does this. > > Tetsuo-san, when it comes to problems involving XFS, you should > really CC xfs@oss.sgi.com because very few people really know how > XFS works and even fewer still know how it is supposed to interact > with memory reclaim.... Thank you for the patch, but ... #define PF_KSWAPD 0x00040000 /* I am kswapd */ static inline int current_is_kswapd(void) { return current->flags & PF_KSWAPD; } I think ((char) (current->flags & 0x00040000)) == 0. Your patch wants -args->kswapd = current_is_kswapd(); +args->kswapd = (current_is_kswapd() != 0); _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] mm/vmscan: Do not block forever at shrink_inactive_list(). 2014-05-20 3:54 ` Tetsuo Handa @ 2014-05-20 5:24 ` Dave Chinner 0 siblings, 0 replies; 15+ messages in thread From: Dave Chinner @ 2014-05-20 5:24 UTC (permalink / raw) To: Tetsuo Handa Cc: riel, linux-kernel, xfs, kosaki.motohiro, fengguang.wu, kamezawa.hiroyu On Tue, May 20, 2014 at 12:54:29PM +0900, Tetsuo Handa wrote: > Dave Chinner wrote: > > So, XFS should be passing kswapd context to the workqueue allocation > > context. The patch below does this. > > > > Tetsuo-san, when it comes to problems involving XFS, you should > > really CC xfs@oss.sgi.com because very few people really know how > > XFS works and even fewer still know how it is supposed to interact > > with memory reclaim.... > > Thank you for the patch, but ... > > #define PF_KSWAPD 0x00040000 /* I am kswapd */ > > static inline int current_is_kswapd(void) > { > return current->flags & PF_KSWAPD; > } > I think ((char) (current->flags & 0x00040000)) == 0. > Your patch wants > > -args->kswapd = current_is_kswapd(); > +args->kswapd = (current_is_kswapd() != 0); Thanks for pointing that out, but I think: -static inline int current_is_kswapd(void) +static inline bool current_is_kswapd(void) is a better solution. It can only be true or false. But regardless, I need to change the boolean options in that XFS structure to be, well, booleans. Cheers, dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] mm/vmscan: Do not block forever at shrink_inactive_list(). 2014-05-20 0:44 ` [PATCH] mm/vmscan: Do not block forever at shrink_inactive_list() Dave Chinner 2014-05-20 3:54 ` Tetsuo Handa @ 2014-05-20 5:59 ` Andrew Morton 2014-05-20 6:03 ` Andrew Morton ` (2 more replies) 1 sibling, 3 replies; 15+ messages in thread From: Andrew Morton @ 2014-05-20 5:59 UTC (permalink / raw) To: Dave Chinner Cc: riel, Tetsuo Handa, linux-kernel, xfs, kosaki.motohiro, fengguang.wu, kamezawa.hiroyu On Tue, 20 May 2014 10:44:49 +1000 Dave Chinner <david@fromorbit.com> wrote: > @@ -258,14 +258,23 @@ xfs_bmapi_allocate_worker( > struct xfs_bmalloca *args = container_of(work, > struct xfs_bmalloca, work); > unsigned long pflags; > + unsigned long new_pflags = PF_FSTRANS; > > - /* we are in a transaction context here */ > - current_set_flags_nested(&pflags, PF_FSTRANS); > + /* > + * we are in a transaction context here, but may also be doing work > + * in kswapd context, and hence we may need to inherit that state > + * temporarily to ensure that we don't block waiting for memory reclaim > + * in any way. > + */ > + if (args->kswapd) > + new_pflags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD; So current_is_kswapd() returns true for a thread which is not kswapd. That's a bit smelly. Should this thread really be incrementing KSWAPD_INODESTEAL instead of PGINODESTEAL, for example? current_is_kswapd() does a range of things, only one(?) of which you actually want. It would be cleaner to create a new PF_ flag to select just that behavior. That's a better model than telling the world "I am magic and special". But we're awfully close to running out of PF_ space and I don't know if this ugly justifies consuming a flag. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] mm/vmscan: Do not block forever at shrink_inactive_list(). 2014-05-20 5:59 ` Andrew Morton @ 2014-05-20 6:03 ` Andrew Morton 2014-05-20 6:33 ` Dave Chinner 2014-05-20 6:30 ` Dave Chinner 2014-05-20 7:20 ` [PATCH] mm/vmscan: Do not block forever at shrink_inactive_list() Christoph Hellwig 2 siblings, 1 reply; 15+ messages in thread From: Andrew Morton @ 2014-05-20 6:03 UTC (permalink / raw) To: Dave Chinner, Tetsuo Handa, riel, kosaki.motohiro, fengguang.wu, kamezawa.hiroyu, linux-kernel, xfs On Mon, 19 May 2014 22:59:15 -0700 Andrew Morton <akpm@linux-foundation.org> wrote: > On Tue, 20 May 2014 10:44:49 +1000 Dave Chinner <david@fromorbit.com> wrote: > > > @@ -258,14 +258,23 @@ xfs_bmapi_allocate_worker( > > struct xfs_bmalloca *args = container_of(work, > > struct xfs_bmalloca, work); > > unsigned long pflags; > > + unsigned long new_pflags = PF_FSTRANS; > > > > - /* we are in a transaction context here */ > > - current_set_flags_nested(&pflags, PF_FSTRANS); > > + /* > > + * we are in a transaction context here, but may also be doing work > > + * in kswapd context, and hence we may need to inherit that state > > + * temporarily to ensure that we don't block waiting for memory reclaim > > + * in any way. > > + */ > > + if (args->kswapd) > > + new_pflags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD; > > So current_is_kswapd() returns true for a thread which is not kswapd. > That's a bit smelly. > > Should this thread really be incrementing KSWAPD_INODESTEAL instead of > PGINODESTEAL, for example? current_is_kswapd() does a range of things, > only one(?) of which you actually want. > > It would be cleaner to create a new PF_ flag to select just that > behavior. That's a better model than telling the world "I am magic and > special". Or a new __GFP_FLAG. > But we're awfully close to running out of PF_ space and I don't know if > this ugly justifies consuming a flag. > _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] mm/vmscan: Do not block forever at shrink_inactive_list(). 2014-05-20 6:03 ` Andrew Morton @ 2014-05-20 6:33 ` Dave Chinner 0 siblings, 0 replies; 15+ messages in thread From: Dave Chinner @ 2014-05-20 6:33 UTC (permalink / raw) To: Andrew Morton Cc: riel, Tetsuo Handa, linux-kernel, xfs, kosaki.motohiro, fengguang.wu, kamezawa.hiroyu On Mon, May 19, 2014 at 11:03:11PM -0700, Andrew Morton wrote: > On Mon, 19 May 2014 22:59:15 -0700 Andrew Morton <akpm@linux-foundation.org> wrote: > > > On Tue, 20 May 2014 10:44:49 +1000 Dave Chinner <david@fromorbit.com> wrote: > > > > > @@ -258,14 +258,23 @@ xfs_bmapi_allocate_worker( > > > struct xfs_bmalloca *args = container_of(work, > > > struct xfs_bmalloca, work); > > > unsigned long pflags; > > > + unsigned long new_pflags = PF_FSTRANS; > > > > > > - /* we are in a transaction context here */ > > > - current_set_flags_nested(&pflags, PF_FSTRANS); > > > + /* > > > + * we are in a transaction context here, but may also be doing work > > > + * in kswapd context, and hence we may need to inherit that state > > > + * temporarily to ensure that we don't block waiting for memory reclaim > > > + * in any way. > > > + */ > > > + if (args->kswapd) > > > + new_pflags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD; > > > > So current_is_kswapd() returns true for a thread which is not kswapd. > > That's a bit smelly. > > > > Should this thread really be incrementing KSWAPD_INODESTEAL instead of > > PGINODESTEAL, for example? current_is_kswapd() does a range of things, > > only one(?) of which you actually want. > > > > It would be cleaner to create a new PF_ flag to select just that > > behavior. That's a better model than telling the world "I am magic and > > special". > > Or a new __GFP_FLAG. Sure - and with that XFS will need another PF_ flag to tell the memory allocator to set the new __GFP_FLAG on every allocation done in that kworker task context, just like it uses PF_FSTRANS to ensure that __GFP_NOFS is set for all the allocations in that kworker context.... Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] mm/vmscan: Do not block forever at shrink_inactive_list(). 2014-05-20 5:59 ` Andrew Morton 2014-05-20 6:03 ` Andrew Morton @ 2014-05-20 6:30 ` Dave Chinner 2014-05-20 14:58 ` Tetsuo Handa 2014-05-20 7:20 ` [PATCH] mm/vmscan: Do not block forever at shrink_inactive_list() Christoph Hellwig 2 siblings, 1 reply; 15+ messages in thread From: Dave Chinner @ 2014-05-20 6:30 UTC (permalink / raw) To: Andrew Morton Cc: riel, Tetsuo Handa, linux-kernel, xfs, kosaki.motohiro, fengguang.wu, kamezawa.hiroyu On Mon, May 19, 2014 at 10:59:15PM -0700, Andrew Morton wrote: > On Tue, 20 May 2014 10:44:49 +1000 Dave Chinner <david@fromorbit.com> wrote: > > > @@ -258,14 +258,23 @@ xfs_bmapi_allocate_worker( > > struct xfs_bmalloca *args = container_of(work, > > struct xfs_bmalloca, work); > > unsigned long pflags; > > + unsigned long new_pflags = PF_FSTRANS; > > > > - /* we are in a transaction context here */ > > - current_set_flags_nested(&pflags, PF_FSTRANS); > > + /* > > + * we are in a transaction context here, but may also be doing work > > + * in kswapd context, and hence we may need to inherit that state > > + * temporarily to ensure that we don't block waiting for memory reclaim > > + * in any way. > > + */ > > + if (args->kswapd) > > + new_pflags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD; > > So current_is_kswapd() returns true for a thread which is not kswapd. > That's a bit smelly. > > Should this thread really be incrementing KSWAPD_INODESTEAL instead of > PGINODESTEAL, for example? current_is_kswapd() does a range of things, > only one(?) of which you actually want. It's doing work on behalf of kswapd under kswapd constraints, so it should both behave and be accounted as if kswapd was executing the work directly. > It would be cleaner to create a new PF_ flag to select just that > behavior. That's a better model than telling the world "I am magic and > special". However, it is magic and special because of who the work needs to be done for. > But we're awfully close to running out of PF_ space and I don't know if > this ugly justifies consuming a flag. I don't really care enough argue over what mechanism should be used. I'll push this patch through the XFS tree, and when a new flag or generic mechanism for pushing task contexts to kworker threads is provided, I'll change the XFS code to use that.... Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] mm/vmscan: Do not block forever at shrink_inactive_list(). 2014-05-20 6:30 ` Dave Chinner @ 2014-05-20 14:58 ` Tetsuo Handa 2014-05-20 16:12 ` Motohiro Kosaki 0 siblings, 1 reply; 15+ messages in thread From: Tetsuo Handa @ 2014-05-20 14:58 UTC (permalink / raw) To: david, riel Cc: linux-kernel, xfs, hch, kosaki.motohiro, akpm, fengguang.wu, kamezawa.hiroyu Today I discussed with Kosaki-san at LinuxCon Japan 2014 about this issue. He does not like the idea of adding timeout to throttle loop. As Dave posted a patch that fixes a bug in XFS delayed allocation, I updated my patch accordingly. Although the bug in XFS was fixed by Dave's patch, other kernel code would have bugs which would fall into this infinite throttle loop. But to keep the possibility of triggering OOM killer minimum, can we agree with this updated patch (and in the future adding some warning mechanism like /proc/sys/kernel/hung_task_timeout_secs for detecting memory allocation stall)? Dave, if you are OK with this updated patch, please let me know commit ID of your patch. Regards. ---------- >From 408e65d9025e8e24838e7bf6ac9066ba8a9391a6 Mon Sep 17 00:00:00 2001 From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Date: Tue, 20 May 2014 23:34:34 +0900 Subject: [PATCH] mm/vmscan: Do not throttle kswapd at shrink_inactive_list(). I can observe that commit 35cd7815 "vmscan: throttle direct reclaim when too many pages are isolated already" causes RHEL7 environment to stall with 0% CPU usage when a certain type of memory pressure is given. This is because nobody can reclaim memory due to rules listed below. (a) XFS uses a kernel worker thread for delayed allocation (b) kswapd wakes up the kernel worker thread for delayed allocation (c) the kernel worker thread is throttled due to commit 35cd7815 This patch and commit XXXXXXXX "xfs: block allocation work needs to be kswapd aware" will solve rule (c). Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> --- mm/vmscan.c | 20 +++++++++++++++----- 1 files changed, 15 insertions(+), 5 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index 32c661d..5c6960e 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1460,12 +1460,22 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec, struct zone *zone = lruvec_zone(lruvec); struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat; - while (unlikely(too_many_isolated(zone, file, sc))) { - congestion_wait(BLK_RW_ASYNC, HZ/10); + /* + * Throttle only direct reclaimers. Allocations by kswapd (and + * allocation workqueue on behalf of kswapd) should not be + * throttled here; otherwise memory allocation will deadlock. + */ + if (!sc->hibernation_mode && !current_is_kswapd()) { + while (unlikely(too_many_isolated(zone, file, sc))) { + congestion_wait(BLK_RW_ASYNC, HZ/10); - /* We are about to die and free our memory. Return now. */ - if (fatal_signal_pending(current)) - return SWAP_CLUSTER_MAX; + /* + * We are about to die and free our memory. + * Return now. + */ + if (fatal_signal_pending(current)) + return SWAP_CLUSTER_MAX; + } } lru_add_drain(); -- 1.7.1 _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply related [flat|nested] 15+ messages in thread
* RE: [PATCH] mm/vmscan: Do not block forever at shrink_inactive_list(). 2014-05-20 14:58 ` Tetsuo Handa @ 2014-05-20 16:12 ` Motohiro Kosaki 2014-05-26 11:45 ` [PATCH] mm/vmscan: Do not block forever atshrink_inactive_list() Tetsuo Handa 0 siblings, 1 reply; 15+ messages in thread From: Motohiro Kosaki @ 2014-05-20 16:12 UTC (permalink / raw) To: Tetsuo Handa, david@fromorbit.com, riel@redhat.com Cc: linux-kernel@vger.kernel.org, xfs@oss.sgi.com, hch@infradead.org, Motohiro Kosaki JP, akpm@linux-foundation.org, fengguang.wu@intel.com, kamezawa.hiroyu@jp.fujitsu.com > -----Original Message----- > From: Tetsuo Handa [mailto:penguin-kernel@I-love.SAKURA.ne.jp] > Sent: Tuesday, May 20, 2014 11:58 PM > To: david@fromorbit.com; riel@redhat.com > Cc: Motohiro Kosaki JP; fengguang.wu@intel.com; kamezawa.hiroyu@jp.fujitsu.com; akpm@linux-foundation.org; > hch@infradead.org; linux-kernel@vger.kernel.org; xfs@oss.sgi.com > Subject: Re: [PATCH] mm/vmscan: Do not block forever at shrink_inactive_list(). > > Today I discussed with Kosaki-san at LinuxCon Japan 2014 about this issue. > He does not like the idea of adding timeout to throttle loop. As Dave posted a patch that fixes a bug in XFS delayed allocation, I > updated my patch accordingly. > > Although the bug in XFS was fixed by Dave's patch, other kernel code would have bugs which would fall into this infinite throttle loop. > But to keep the possibility of triggering OOM killer minimum, can we agree with this updated patch (and in the future adding some > warning mechanism like /proc/sys/kernel/hung_task_timeout_secs for detecting memory allocation stall)? > > Dave, if you are OK with this updated patch, please let me know commit ID of your patch. > > Regards. > ---------- > >From 408e65d9025e8e24838e7bf6ac9066ba8a9391a6 Mon Sep 17 00:00:00 2001 > From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> > Date: Tue, 20 May 2014 23:34:34 +0900 > Subject: [PATCH] mm/vmscan: Do not throttle kswapd at shrink_inactive_list(). > > I can observe that commit 35cd7815 "vmscan: throttle direct reclaim when too many pages are isolated already" causes RHEL7 > environment to stall with 0% CPU usage when a certain type of memory pressure is given. > This is because nobody can reclaim memory due to rules listed below. > > (a) XFS uses a kernel worker thread for delayed allocation > (b) kswapd wakes up the kernel worker thread for delayed allocation > (c) the kernel worker thread is throttled due to commit 35cd7815 > > This patch and commit XXXXXXXX "xfs: block allocation work needs to be kswapd aware" will solve rule (c). > > Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> > --- > mm/vmscan.c | 20 +++++++++++++++----- > 1 files changed, 15 insertions(+), 5 deletions(-) > > diff --git a/mm/vmscan.c b/mm/vmscan.c > index 32c661d..5c6960e 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -1460,12 +1460,22 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec, > struct zone *zone = lruvec_zone(lruvec); > struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat; > > - while (unlikely(too_many_isolated(zone, file, sc))) { > - congestion_wait(BLK_RW_ASYNC, HZ/10); > + /* > + * Throttle only direct reclaimers. Allocations by kswapd (and > + * allocation workqueue on behalf of kswapd) should not be > + * throttled here; otherwise memory allocation will deadlock. > + */ > + if (!sc->hibernation_mode && !current_is_kswapd()) { > + while (unlikely(too_many_isolated(zone, file, sc))) { > + congestion_wait(BLK_RW_ASYNC, HZ/10); > > - /* We are about to die and free our memory. Return now. */ > - if (fatal_signal_pending(current)) > - return SWAP_CLUSTER_MAX; > + /* > + * We are about to die and free our memory. > + * Return now. > + */ > + if (fatal_signal_pending(current)) > + return SWAP_CLUSTER_MAX; > + } > } Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Dave, I don't like Tetsuo's first patch because this too_many_isolated exist to prevent false oom-kill. So, simple timeout resurrect it. Please let me know if you need further MM enhancement to solve XFS issue. I'd like join and assist this. Thanks. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] mm/vmscan: Do not block forever atshrink_inactive_list(). 2014-05-20 16:12 ` Motohiro Kosaki @ 2014-05-26 11:45 ` Tetsuo Handa 2014-06-03 21:43 ` David Rientjes 0 siblings, 1 reply; 15+ messages in thread From: Tetsuo Handa @ 2014-05-26 11:45 UTC (permalink / raw) To: Motohiro.Kosaki, david, riel Cc: linux-kernel, xfs, hch, kosaki.motohiro, akpm, fengguang.wu, kamezawa.hiroyu Tetsuo Handa wrote: > Dave, if you are OK with this updated patch, please let me know commit ID of your patch. I overlooked that too_many_isolated() in mm/vmscan.c already has if (current_is_kswapd()) return 0; lines. Therefore, it turned out that my patch is irrelevant to XFS issue. Does patch shown below make sense? Regards. ---------- >From e133aadd65df3d8622efced3443888ed3e60327b Mon Sep 17 00:00:00 2001 From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Date: Mon, 26 May 2014 20:37:12 +0900 Subject: [PATCH] mm/vmscan: Do not throttle when hibernation_mode != 0. In shrink_inactive_list(), we do not insert delay at if (!sc->hibernation_mode && !current_is_kswapd()) wait_iff_congested(zone, BLK_RW_ASYNC, HZ/10); if sc->hibernation_mode != 0. Follow the same reason, we should not insert delay at while (unlikely(too_many_isolated(zone, file, sc))) { congestion_wait(BLK_RW_ASYNC, HZ/10); /* We are about to die and free our memory. Return now. */ if (fatal_signal_pending(current)) return SWAP_CLUSTER_MAX; } if sc->hibernation_mode != 0. Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> --- mm/vmscan.c | 3 +++ 1 files changed, 3 insertions(+), 0 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index 32c661d..89c42ca 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1362,6 +1362,9 @@ static int too_many_isolated(struct zone *zone, int file, if (current_is_kswapd()) return 0; + if (sc->hibernation_mode) + return 0; + if (!global_reclaim(sc)) return 0; -- 1.7.1 _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: [PATCH] mm/vmscan: Do not block forever atshrink_inactive_list(). 2014-05-26 11:45 ` [PATCH] mm/vmscan: Do not block forever atshrink_inactive_list() Tetsuo Handa @ 2014-06-03 21:43 ` David Rientjes 2014-06-05 12:45 ` [PATCH] mm/vmscan: Do not block forever at shrink_inactive_list() Tetsuo Handa 0 siblings, 1 reply; 15+ messages in thread From: David Rientjes @ 2014-06-03 21:43 UTC (permalink / raw) To: Tetsuo Handa Cc: riel, linux-kernel, xfs, hch, kosaki.motohiro, Motohiro.Kosaki, fengguang.wu, akpm, kamezawa.hiroyu On Mon, 26 May 2014, Tetsuo Handa wrote: > In shrink_inactive_list(), we do not insert delay at > > if (!sc->hibernation_mode && !current_is_kswapd()) > wait_iff_congested(zone, BLK_RW_ASYNC, HZ/10); > > if sc->hibernation_mode != 0. > Follow the same reason, we should not insert delay at > > while (unlikely(too_many_isolated(zone, file, sc))) { > congestion_wait(BLK_RW_ASYNC, HZ/10); > > /* We are about to die and free our memory. Return now. */ > if (fatal_signal_pending(current)) > return SWAP_CLUSTER_MAX; > } > > if sc->hibernation_mode != 0. > > Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> > --- > mm/vmscan.c | 3 +++ > 1 files changed, 3 insertions(+), 0 deletions(-) > > diff --git a/mm/vmscan.c b/mm/vmscan.c > index 32c661d..89c42ca 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -1362,6 +1362,9 @@ static int too_many_isolated(struct zone *zone, int file, > if (current_is_kswapd()) > return 0; > > + if (sc->hibernation_mode) > + return 0; > + > if (!global_reclaim(sc)) > return 0; > This isn't the only too_many_isolated() functions that do a delay, how is the too_many_isolated() in mm/compaction.c different? _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] mm/vmscan: Do not block forever at shrink_inactive_list(). 2014-06-03 21:43 ` David Rientjes @ 2014-06-05 12:45 ` Tetsuo Handa 2014-06-05 13:17 ` Dave Chinner 0 siblings, 1 reply; 15+ messages in thread From: Tetsuo Handa @ 2014-06-05 12:45 UTC (permalink / raw) To: rientjes, Motohiro.Kosaki Cc: riel, linux-kernel, xfs, hch, kosaki.motohiro, akpm, fengguang.wu, kamezawa.hiroyu David Rientjes wrote: > On Mon, 26 May 2014, Tetsuo Handa wrote: > > > In shrink_inactive_list(), we do not insert delay at > > > > if (!sc->hibernation_mode && !current_is_kswapd()) > > wait_iff_congested(zone, BLK_RW_ASYNC, HZ/10); > > > > if sc->hibernation_mode != 0. > > Follow the same reason, we should not insert delay at > > > > while (unlikely(too_many_isolated(zone, file, sc))) { > > congestion_wait(BLK_RW_ASYNC, HZ/10); > > > > /* We are about to die and free our memory. Return now. */ > > if (fatal_signal_pending(current)) > > return SWAP_CLUSTER_MAX; > > } > > > > if sc->hibernation_mode != 0. > > > > Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> > > --- > > mm/vmscan.c | 3 +++ > > 1 files changed, 3 insertions(+), 0 deletions(-) > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c > > index 32c661d..89c42ca 100644 > > --- a/mm/vmscan.c > > +++ b/mm/vmscan.c > > @@ -1362,6 +1362,9 @@ static int too_many_isolated(struct zone *zone, int file, > > if (current_is_kswapd()) > > return 0; > > > > + if (sc->hibernation_mode) > > + return 0; > > + > > if (!global_reclaim(sc)) > > return 0; > > > > This isn't the only too_many_isolated() functions that do a delay, how is > the too_many_isolated() in mm/compaction.c different? > I don't know. But today I realized that this patch is not sufficient. I'm trying to find why __alloc_pages_slowpath() cannot return for many minutes when a certain type of memory pressure is given on a RHEL7 environment with 4 CPU / 2GB RAM. Today I tried to use ftrace for examining the breakdown of time-consuming functions inside __alloc_pages_slowpath(). But on the first run, all processes are trapped into this too_many_isolated()/congestion_wait() loop while kswapd is not running; stalling forever because nobody can perform operations for making too_many_isolated() to return 0. This means that, under rare circumstances, it is possible that all processes other than kswapd are trapped into too_many_isolated()/congestion_wait() loop while kswapd is sleeping because this loop assumes that somebody else shall wake up kswapd and kswapd shall perform operations for making too_many_isolated() to return 0. However, we cannot guarantee that kswapd is waken by somebody nor kswapd is not blocked by blocking operations inside shrinker functions (e.g. mutex_lock()). We need some more changes. I'm thinking memory allocation watchdog thread. Add an "unsigned long" field to "struct task_struct", set jiffies to the field upon entry of GFP_WAIT-able memory allocation attempts, and clear the field upon returning from GFP_WAIT-able memory allocation attempts. A kernel thread periodically scans task list and compares the field and jiffies, and (at least) print warning messages (maybe optionally trigger OOM-killer or kernel panic) if single memory allocation attempt is taking too long (e.g. 60 seconds). What do you think? _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] mm/vmscan: Do not block forever at shrink_inactive_list(). 2014-06-05 12:45 ` [PATCH] mm/vmscan: Do not block forever at shrink_inactive_list() Tetsuo Handa @ 2014-06-05 13:17 ` Dave Chinner 2014-06-06 12:19 ` [PATCH] mm/vmscan: Do not block forever atshrink_inactive_list() Tetsuo Handa 0 siblings, 1 reply; 15+ messages in thread From: Dave Chinner @ 2014-06-05 13:17 UTC (permalink / raw) To: Tetsuo Handa Cc: riel, linux-kernel, xfs, hch, kosaki.motohiro, rientjes, Motohiro.Kosaki, fengguang.wu, akpm, kamezawa.hiroyu On Thu, Jun 05, 2014 at 09:45:26PM +0900, Tetsuo Handa wrote: > David Rientjes wrote: > > On Mon, 26 May 2014, Tetsuo Handa wrote: > > > > > In shrink_inactive_list(), we do not insert delay at > > > > > > if (!sc->hibernation_mode && !current_is_kswapd()) > > > wait_iff_congested(zone, BLK_RW_ASYNC, HZ/10); > > > > > > if sc->hibernation_mode != 0. > > > Follow the same reason, we should not insert delay at > > > > > > while (unlikely(too_many_isolated(zone, file, sc))) { > > > congestion_wait(BLK_RW_ASYNC, HZ/10); > > > > > > /* We are about to die and free our memory. Return now. */ > > > if (fatal_signal_pending(current)) > > > return SWAP_CLUSTER_MAX; > > > } > > > > > > if sc->hibernation_mode != 0. > > > > > > Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> > > > --- > > > mm/vmscan.c | 3 +++ > > > 1 files changed, 3 insertions(+), 0 deletions(-) > > > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c > > > index 32c661d..89c42ca 100644 > > > --- a/mm/vmscan.c > > > +++ b/mm/vmscan.c > > > @@ -1362,6 +1362,9 @@ static int too_many_isolated(struct zone *zone, int file, > > > if (current_is_kswapd()) > > > return 0; > > > > > > + if (sc->hibernation_mode) > > > + return 0; > > > + > > > if (!global_reclaim(sc)) > > > return 0; > > > > > > > This isn't the only too_many_isolated() functions that do a delay, how is > > the too_many_isolated() in mm/compaction.c different? > > > > I don't know. But today I realized that this patch is not sufficient. > > I'm trying to find why __alloc_pages_slowpath() cannot return for many minutes > when a certain type of memory pressure is given on a RHEL7 environment with > 4 CPU / 2GB RAM. Today I tried to use ftrace for examining the breakdown of > time-consuming functions inside __alloc_pages_slowpath(). But on the first run, > all processes are trapped into this too_many_isolated()/congestion_wait() loop > while kswapd is not running; stalling forever because nobody can perform > operations for making too_many_isolated() to return 0. > > This means that, under rare circumstances, it is possible that all processes > other than kswapd are trapped into too_many_isolated()/congestion_wait() loop > while kswapd is sleeping because this loop assumes that somebody else shall > wake up kswapd and kswapd shall perform operations for making > too_many_isolated() to return 0. However, we cannot guarantee that kswapd is > waken by somebody nor kswapd is not blocked by blocking operations inside > shrinker functions (e.g. mutex_lock()). So what you are saying is that kswapd is having problems with getting blocked on locks held by processes in direct reclaim? What are the stack traces that demonstrate such a dependency loop? Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] mm/vmscan: Do not block forever atshrink_inactive_list(). 2014-06-05 13:17 ` Dave Chinner @ 2014-06-06 12:19 ` Tetsuo Handa 0 siblings, 0 replies; 15+ messages in thread From: Tetsuo Handa @ 2014-06-06 12:19 UTC (permalink / raw) To: david Cc: riel, linux-kernel, xfs, hch, kosaki.motohiro, rientjes, Motohiro.Kosaki, fengguang.wu, akpm, kamezawa.hiroyu Dave Chinner wrote: > On Thu, Jun 05, 2014 at 09:45:26PM +0900, Tetsuo Handa wrote: > > This means that, under rare circumstances, it is possible that all processes > > other than kswapd are trapped into too_many_isolated()/congestion_wait() loop > > while kswapd is sleeping because this loop assumes that somebody else shall > > wake up kswapd and kswapd shall perform operations for making > > too_many_isolated() to return 0. However, we cannot guarantee that kswapd is > > waken by somebody nor kswapd is not blocked by blocking operations inside > > shrinker functions (e.g. mutex_lock()). > > So what you are saying is that kswapd is having problems with > getting blocked on locks held by processes in direct reclaim? > > What are the stack traces that demonstrate such a dependency loop? If a normal task's GFP_KERNEL memory allocation called a shrinker function and the shrinker function does GFP_WAIT-able allocation with a mutex held, there is a possibility that kswapd is waken up due to GFP_WAIT-able allocation and kswapd calls the shrinker function, and the kswapd is blocked at trying to hold the same mutex inside the shrinker function, isn't it? Since ttm_dma_pool_shrink_count()/ttm_dma_pool_shrink_scan() holds a mutex and ttm_dma_pool_shrink_scan() does GFP_WAIT-able allocation, I think such a dependency loop is possible. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] mm/vmscan: Do not block forever at shrink_inactive_list(). 2014-05-20 5:59 ` Andrew Morton 2014-05-20 6:03 ` Andrew Morton 2014-05-20 6:30 ` Dave Chinner @ 2014-05-20 7:20 ` Christoph Hellwig 2 siblings, 0 replies; 15+ messages in thread From: Christoph Hellwig @ 2014-05-20 7:20 UTC (permalink / raw) To: Andrew Morton Cc: riel, Tetsuo Handa, linux-kernel, xfs, kosaki.motohiro, fengguang.wu, kamezawa.hiroyu On Mon, May 19, 2014 at 10:59:15PM -0700, Andrew Morton wrote: > So current_is_kswapd() returns true for a thread which is not kswapd. > That's a bit smelly. > > Should this thread really be incrementing KSWAPD_INODESTEAL instead of > PGINODESTEAL, for example? current_is_kswapd() does a range of things, > only one(?) of which you actually want. Actually we want all of them. The allocation workqueue is a workaround for the incredible stack usage in the Linux I/O path. If it is called by kswapd it should act as if it were kswapd for all purposes. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2014-06-06 12:20 UTC | newest]
Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <201405192340.FCD48964.OFQHOOJLVSFFMt@I-love.SAKURA.ne.jp>
2014-05-20 0:44 ` [PATCH] mm/vmscan: Do not block forever at shrink_inactive_list() Dave Chinner
2014-05-20 3:54 ` Tetsuo Handa
2014-05-20 5:24 ` Dave Chinner
2014-05-20 5:59 ` Andrew Morton
2014-05-20 6:03 ` Andrew Morton
2014-05-20 6:33 ` Dave Chinner
2014-05-20 6:30 ` Dave Chinner
2014-05-20 14:58 ` Tetsuo Handa
2014-05-20 16:12 ` Motohiro Kosaki
2014-05-26 11:45 ` [PATCH] mm/vmscan: Do not block forever atshrink_inactive_list() Tetsuo Handa
2014-06-03 21:43 ` David Rientjes
2014-06-05 12:45 ` [PATCH] mm/vmscan: Do not block forever at shrink_inactive_list() Tetsuo Handa
2014-06-05 13:17 ` Dave Chinner
2014-06-06 12:19 ` [PATCH] mm/vmscan: Do not block forever atshrink_inactive_list() Tetsuo Handa
2014-05-20 7:20 ` [PATCH] mm/vmscan: Do not block forever at shrink_inactive_list() Christoph Hellwig
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox