[PATCH] mm/vmscan: Do not block forever at shrink_inactive

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH] mm/vmscan: Do not block forever at shrink_inactive_list().
@ 2014-05-19 14:40 Tetsuo Handa
  2014-05-19 17:42 ` Rik van Riel
                   ` (2 more replies)
  0 siblings, 3 replies; 20+ messages in thread
From: Tetsuo Handa @ 2014-05-19 14:40 UTC (permalink / raw)
  To: riel, kosaki.motohiro, fengguang.wu, kamezawa.hiroyu; +Cc: linux-kernel

>From f016db5d7f84d6321132150b13c5888ef67d694f Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date: Mon, 19 May 2014 23:24:11 +0900
Subject: [PATCH] mm/vmscan: Do not block forever at shrink_inactive_list().

I can observe that commit 35cd7815 "vmscan: throttle direct reclaim when
too many pages are isolated already" causes RHEL7 environment to stall with
0% CPU usage when a certain type of memory pressure is given.

Upon memory pressure, kswapd calls xfs_vm_writepage() from shrink_page_list().
xfs_vm_writepage() eventually calls wait_for_completion() which waits for
xfs_bmapi_allocate_worker().

Then, a kernel worker thread calls xfs_bmapi_allocate_worker() and
xfs_bmapi_allocate_worker() eventually calls xfs_btree_lookup_get_block().
xfs_btree_lookup_get_block() eventually calls alloc_page().
alloc_page() eventually calls shrink_inactive_list().

The stack trace showed that the kernel worker thread which the kswapd is
waiting for was looping at a while loop in shrink_inactive_list().

---------- stack trace start ----------
[  923.927838] kswapd0         D ffff88007fa34580     0   101      2 0x00000000
[  923.930028]  ffff880079103550 0000000000000046 ffff880079103fd8 0000000000014580
[  923.932324]  ffff880079103fd8 0000000000014580 ffff88007c31f1c0 ffff880079103680
[  923.934599]  ffff880079103688 7fffffffffffffff ffff88007c31f1c0 ffff880079103880
[  923.936855] Call Trace:
[  923.937920]  [<ffffffff815f18b9>] schedule+0x29/0x70
[  923.939538]  [<ffffffff815ef7b9>] schedule_timeout+0x209/0x2d0
[  923.941360]  [<ffffffff810976c3>] ? wake_up_process+0x23/0x40
[  923.943157]  [<ffffffff8107b464>] ? wake_up_worker+0x24/0x30
[  923.945147]  [<ffffffff8107bdf2>] ? insert_work+0x62/0xa0
[  923.946900]  [<ffffffff815f1de6>] wait_for_completion+0x116/0x170
[  923.948786]  [<ffffffff81097700>] ? wake_up_state+0x20/0x20
[  923.950572]  [<ffffffffa019ad44>] xfs_bmapi_allocate+0xa4/0xd0 [xfs]
[  923.952515]  [<ffffffffa01cc9f9>] xfs_bmapi_write+0x509/0x810 [xfs]
[  923.954398]  [<ffffffffa019a1f0>] ? xfs_next_bit+0x90/0x90 [xfs]
[  923.956223]  [<ffffffffa01abb50>] xfs_iomap_write_allocate+0x150/0x350 [xfs]
[  923.958256]  [<ffffffffa0197186>] xfs_map_blocks+0x216/0x240 [xfs]
[  923.960141]  [<ffffffffa01983b3>] xfs_vm_writepage+0x263/0x5c0 [xfs]
[  923.962053]  [<ffffffff8115497d>] shrink_page_list+0x80d/0xab0
[  923.963840]  [<ffffffff811552ca>] shrink_inactive_list+0x1ea/0x580
[  923.965677]  [<ffffffff81155dc5>] shrink_lruvec+0x375/0x6e0
[  923.967419]  [<ffffffff811b2556>] ? put_super+0x36/0x40
[  923.969072]  [<ffffffff811b2556>] ? put_super+0x36/0x40
[  923.970694]  [<ffffffff811561a6>] shrink_zone+0x76/0x1a0
[  923.972389]  [<ffffffff8115744c>] balance_pgdat+0x48c/0x5e0
[  923.974110]  [<ffffffff8115770b>] kswapd+0x16b/0x430
[  923.975682]  [<ffffffff81086ab0>] ? wake_up_bit+0x30/0x30
[  923.977395]  [<ffffffff811575a0>] ? balance_pgdat+0x5e0/0x5e0
[  923.979176]  [<ffffffff81085aef>] kthread+0xcf/0xe0
[  923.980739]  [<ffffffff81085a20>] ? kthread_create_on_node+0x140/0x140
[  923.982692]  [<ffffffff815fc76c>] ret_from_fork+0x7c/0xb0
[  923.984380]  [<ffffffff81085a20>] ? kthread_create_on_node+0x140/0x140
[  924.642947] kworker/1:2     D ffff88007fa34580     0   328      2 0x00000000
[  924.645307] Workqueue: xfsalloc xfs_bmapi_allocate_worker [xfs]
[  924.647219]  ffff8800781b1380 0000000000000046 ffff8800781b1fd8 0000000000014580
[  924.649586]  ffff8800781b1fd8 0000000000014580 ffff880078130b60 ffff88007c254000
[  924.651900]  ffff8800781b13b0 0000000100098869 ffff88007c254000 000000000000e728
[  924.654185] Call Trace:
[  924.655305]  [<ffffffff815f18b9>] schedule+0x29/0x70
[  924.656960]  [<ffffffff815ef725>] schedule_timeout+0x175/0x2d0
[  924.658832]  [<ffffffff8106e070>] ? __internal_add_timer+0x130/0x130
[  924.660803]  [<ffffffff815f10ab>] io_schedule_timeout+0x9b/0xf0
[  924.662685]  [<ffffffff81160a32>] congestion_wait+0x82/0x110
[  924.664520]  [<ffffffff81086ab0>] ? wake_up_bit+0x30/0x30
[  924.666269]  [<ffffffff8115543c>] shrink_inactive_list+0x35c/0x580
[  924.668188]  [<ffffffff812d028d>] ? list_del+0xd/0x30
[  924.669860]  [<ffffffff81155dc5>] shrink_lruvec+0x375/0x6e0
[  924.671662]  [<ffffffff811b2556>] ? put_super+0x36/0x40
[  924.673348]  [<ffffffff811b2556>] ? put_super+0x36/0x40
[  924.675045]  [<ffffffff811561a6>] shrink_zone+0x76/0x1a0
[  924.676749]  [<ffffffff811566b0>] do_try_to_free_pages+0xf0/0x4e0
[  924.678605]  [<ffffffff81156b9c>] try_to_free_pages+0xfc/0x180
[  924.680429]  [<ffffffff8114b2ce>] __alloc_pages_nodemask+0x75e/0xb10
[  924.682378]  [<ffffffff81188689>] alloc_pages_current+0xa9/0x170
[  924.684264]  [<ffffffffa020db11>] xfs_buf_allocate_memory+0x16d/0x24a [xfs]
[  924.686324]  [<ffffffffa019e3b5>] xfs_buf_get_map+0x125/0x180 [xfs]
[  924.688225]  [<ffffffffa019ed4c>] xfs_buf_read_map+0x2c/0x140 [xfs]
[  924.690172]  [<ffffffffa0202089>] xfs_trans_read_buf_map+0x2d9/0x4a0 [xfs]
[  924.692245]  [<ffffffffa01cf698>] xfs_btree_read_buf_block.isra.18.constprop.29+0x78/0xc0 [xfs]
[  924.694673]  [<ffffffffa01cf760>] xfs_btree_lookup_get_block+0x80/0x100 [xfs]
[  924.696793]  [<ffffffffa01d38e7>] xfs_btree_lookup+0xd7/0x4b0 [xfs]
[  924.698716]  [<ffffffffa01bc211>] ? xfs_allocbt_init_cursor+0x41/0xd0 [xfs]
[  924.700787]  [<ffffffffa01b9811>] xfs_alloc_ag_vextent_near+0x91/0xa50 [xfs]
[  924.702836]  [<ffffffffa01baa3d>] xfs_alloc_ag_vextent+0xcd/0x110 [xfs]
[  924.704849]  [<ffffffffa01bb7c9>] xfs_alloc_vextent+0x429/0x5e0 [xfs]
[  924.706807]  [<ffffffffa01cb73f>] xfs_bmap_btalloc+0x2df/0x820 [xfs]
[  924.709010]  [<ffffffffa01cbc8e>] xfs_bmap_alloc+0xe/0x10 [xfs]
[  924.710887]  [<ffffffffa01cc2d7>] __xfs_bmapi_allocate+0xc7/0x2e0 [xfs]
[  924.712905]  [<ffffffffa019a221>] xfs_bmapi_allocate_worker+0x31/0x60 [xfs]
[  924.714954]  [<ffffffff8107e02b>] process_one_work+0x17b/0x460
[  924.716754]  [<ffffffff8107edfb>] worker_thread+0x11b/0x400
[  924.718468]  [<ffffffff8107ece0>] ? rescuer_thread+0x400/0x400
[  924.720252]  [<ffffffff81085aef>] kthread+0xcf/0xe0
[  924.721855]  [<ffffffff81085a20>] ? kthread_create_on_node+0x140/0x140
[  924.723815]  [<ffffffff815fc76c>] ret_from_fork+0x7c/0xb0
[  924.725516]  [<ffffffff81085a20>] ? kthread_create_on_node+0x140/0x140
---------- stack trace end ----------

Since the kernel worker thread needs to escape from the while loop so that
alloc_page() can allocate memory (and eventually allow xfs_vm_writepage()
to release memory), I think that we should not block forever. This patch
introduces 30 seconds timeout for userspace processes and 5 seconds timeout
for kernel processes.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
---
 mm/vmscan.c |    7 ++++++-
 1 files changed, 6 insertions(+), 1 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 32c661d..3eeeda6 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1459,13 +1459,18 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
 	int file = is_file_lru(lru);
 	struct zone *zone = lruvec_zone(lruvec);
 	struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
+	int i = 0;
 
-	while (unlikely(too_many_isolated(zone, file, sc))) {
+	/* Throttle with timeout. */
+	while (unlikely(too_many_isolated(zone, file, sc)) && i++ < 300) {
 		congestion_wait(BLK_RW_ASYNC, HZ/10);
 
 		/* We are about to die and free our memory. Return now. */
 		if (fatal_signal_pending(current))
 			return SWAP_CLUSTER_MAX;
+		/* Kernel threads should not be blocked for too long. */
+		if (i == 50 && (current->flags & PF_KTHREAD))
+			break;
 	}
 
 	lru_add_drain();
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH] mm/vmscan: Do not block forever at shrink_inactive_list().
  2014-05-19 14:40 [PATCH] mm/vmscan: Do not block forever at shrink_inactive_list() Tetsuo Handa
@ 2014-05-19 17:42 ` Rik van Riel
  2014-05-20  0:44 ` Dave Chinner
  2014-05-20  2:35 ` Jianyu Zhan
  2 siblings, 0 replies; 20+ messages in thread
From: Rik van Riel @ 2014-05-19 17:42 UTC (permalink / raw)
  To: Tetsuo Handa, kosaki.motohiro, fengguang.wu, kamezawa.hiroyu; +Cc: linux-kernel

On 05/19/2014 10:40 AM, Tetsuo Handa wrote:
>>From f016db5d7f84d6321132150b13c5888ef67d694f Mon Sep 17 00:00:00 2001
> From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Date: Mon, 19 May 2014 23:24:11 +0900
> Subject: [PATCH] mm/vmscan: Do not block forever at shrink_inactive_list().
> 
> I can observe that commit 35cd7815 "vmscan: throttle direct reclaim when
> too many pages are isolated already" causes RHEL7 environment to stall with
> 0% CPU usage when a certain type of memory pressure is given.

> Since the kernel worker thread needs to escape from the while loop so that
> alloc_page() can allocate memory (and eventually allow xfs_vm_writepage()
> to release memory), I think that we should not block forever. This patch
> introduces 30 seconds timeout for userspace processes and 5 seconds timeout
> for kernel processes.

I suspect the fundamental problem is that other processes can
drive the system back to the limit, and unlucky processes can
end up starved forever.

This makes your fix (or something like it) necessary.

> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] mm/vmscan: Do not block forever at shrink_inactive_list().
  2014-05-19 14:40 [PATCH] mm/vmscan: Do not block forever at shrink_inactive_list() Tetsuo Handa
  2014-05-19 17:42 ` Rik van Riel
@ 2014-05-20  0:44 ` Dave Chinner
  2014-05-20  3:54   ` Tetsuo Handa
  2014-05-20  5:59   ` Andrew Morton
  2014-05-20  2:35 ` Jianyu Zhan
  2 siblings, 2 replies; 20+ messages in thread
From: Dave Chinner @ 2014-05-20  0:44 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: riel, kosaki.motohiro, fengguang.wu, kamezawa.hiroyu,
	linux-kernel, xfs

[cc xfs@oss.sgi.com]

On Mon, May 19, 2014 at 11:40:46PM +0900, Tetsuo Handa wrote:
> >From f016db5d7f84d6321132150b13c5888ef67d694f Mon Sep 17 00:00:00 2001
> From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Date: Mon, 19 May 2014 23:24:11 +0900
> Subject: [PATCH] mm/vmscan: Do not block forever at shrink_inactive_list().
> 
> I can observe that commit 35cd7815 "vmscan: throttle direct reclaim when
> too many pages are isolated already" causes RHEL7 environment to stall with
> 0% CPU usage when a certain type of memory pressure is given.
> 
> Upon memory pressure, kswapd calls xfs_vm_writepage() from shrink_page_list().
> xfs_vm_writepage() eventually calls wait_for_completion() which waits for
> xfs_bmapi_allocate_worker().
> 
> Then, a kernel worker thread calls xfs_bmapi_allocate_worker() and
> xfs_bmapi_allocate_worker() eventually calls xfs_btree_lookup_get_block().
> xfs_btree_lookup_get_block() eventually calls alloc_page().
> alloc_page() eventually calls shrink_inactive_list().
> 
> The stack trace showed that the kernel worker thread which the kswapd is
> waiting for was looping at a while loop in shrink_inactive_list().

[snip stack traces indicating XFS is deferring allocation work to the
allocation workqueue on behalf of kswapd]

> Since the kernel worker thread needs to escape from the while loop so that
> alloc_page() can allocate memory (and eventually allow xfs_vm_writepage()
> to release memory), I think that we should not block forever. This patch
> introduces 30 seconds timeout for userspace processes and 5 seconds timeout
> for kernel processes.

That's like trying to swat a fly with a runaway train. Eventually
you'll hit the fly with the train....

Q: What is missing from the workqueue context that results
in the workqueue being blocked in memory reclaim on congestion?

Hint: XFS is deferring allocation on behalf of kswapd to a
workqueue.

A: PF_KSWAPD.  shrink_inactive_list() has this guard:

        /*
         * Stall direct reclaim for IO completions if underlying BDIs or zone
         * is congested. Allow kswapd to continue until it starts encountering
         * unqueued dirty pages or cycling through the LRU too quickly.
         */
        if (!sc->hibernation_mode && !current_is_kswapd())
                wait_iff_congested(zone, BLK_RW_ASYNC, HZ/10);


So, XFS should be passing kswapd context to the workqueue allocation
context. The patch below does this.

Tetsuo-san, when it comes to problems involving XFS, you should
really CC xfs@oss.sgi.com because very few people really know how
XFS works and even fewer still know how it is supposed to interact
with memory reclaim....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com


xfs: block allocation work needs to be kswapd aware

From: Dave Chinner <dchinner@redhat.com>

Upon memory pressure, kswapd calls xfs_vm_writepage() from
shrink_page_list(). This can result in delayed allocation occurring
and that gets deferred to the the allocation workqueue.

The allocation then runs outside kswapd context, which means if it
needs memory (and it does to demand page metadata from disk) it can
block in shrink_inactive_list() waiting for IO congestion. These
blocking waits are normally avoiding in kswapd context, so under
memory pressure writeback from kswapd can be arbitrarily delayed by
memory reclaim.

To avoid this, pass the kswapd context to the allocation being done
by the workqueue, so that memory reclaim understands correctly that
the work is being done for kswapd and therefore it is not blocked
and does not delay memory reclaim.

Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_bmap_util.c | 16 +++++++++++++---
 fs/xfs/xfs_bmap_util.h |  1 +
 2 files changed, 14 insertions(+), 3 deletions(-)

diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index 057f671..703b3ec 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -258,14 +258,23 @@ xfs_bmapi_allocate_worker(
 	struct xfs_bmalloca	*args = container_of(work,
 						struct xfs_bmalloca, work);
 	unsigned long		pflags;
+	unsigned long		new_pflags = PF_FSTRANS;
 
-	/* we are in a transaction context here */
-	current_set_flags_nested(&pflags, PF_FSTRANS);
+	/*
+	 * we are in a transaction context here, but may also be doing work
+	 * in kswapd context, and hence we may need to inherit that state
+	 * temporarily to ensure that we don't block waiting for memory reclaim
+	 * in any way.
+	 */
+	if (args->kswapd)
+		new_pflags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD;
+
+	current_set_flags_nested(&pflags, new_pflags);
 
 	args->result = __xfs_bmapi_allocate(args);
 	complete(args->done);
 
-	current_restore_flags_nested(&pflags, PF_FSTRANS);
+	current_restore_flags_nested(&pflags, new_pflags);
 }
 
 /*
@@ -284,6 +293,7 @@ xfs_bmapi_allocate(
 
 
 	args->done = &done;
+	args->kswapd = current_is_kswapd();
 	INIT_WORK_ONSTACK(&args->work, xfs_bmapi_allocate_worker);
 	queue_work(xfs_alloc_wq, &args->work);
 	wait_for_completion(&done);
diff --git a/fs/xfs/xfs_bmap_util.h b/fs/xfs/xfs_bmap_util.h
index 935ed2b..d227f9d 100644
--- a/fs/xfs/xfs_bmap_util.h
+++ b/fs/xfs/xfs_bmap_util.h
@@ -56,6 +56,7 @@ struct xfs_bmalloca {
 	char			aeof;	/* allocated space at eof */
 	char			conv;	/* overwriting unwritten extents */
 	char			stack_switch;
+	char			kswapd;	/* allocation in kswapd context */
 	int			flags;
 	struct completion	*done;
 	struct work_struct	work;

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH] mm/vmscan: Do not block forever at shrink_inactive_list().
  2014-05-19 14:40 [PATCH] mm/vmscan: Do not block forever at shrink_inactive_list() Tetsuo Handa
  2014-05-19 17:42 ` Rik van Riel
  2014-05-20  0:44 ` Dave Chinner
@ 2014-05-20  2:35 ` Jianyu Zhan
  2 siblings, 0 replies; 20+ messages in thread
From: Jianyu Zhan @ 2014-05-20  2:35 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: Rik van Riel, kosaki.motohiro, Fengguang Wu, kamezawa.hiroyu,
	LKML

On Mon, May 19, 2014 at 10:40 PM, Tetsuo Handa
<penguin-kernel@i-love.sakura.ne.jp> wrote:
>
> Since the kernel worker thread needs to escape from the while loop so that
> alloc_page() can allocate memory (and eventually allow xfs_vm_writepage()
> to release memory), I think that we should not block forever. This patch
> introduces 30 seconds timeout for userspace processes and 5 seconds timeout
> for kernel processes.
>
> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> ---
>  mm/vmscan.c |    7 ++++++-
>  1 files changed, 6 insertions(+), 1 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 32c661d..3eeeda6 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1459,13 +1459,18 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
>         int file = is_file_lru(lru);
>         struct zone *zone = lruvec_zone(lruvec);
>         struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
> +       int i = 0;
>
> -       while (unlikely(too_many_isolated(zone, file, sc))) {
> +       /* Throttle with timeout. */
> +       while (unlikely(too_many_isolated(zone, file, sc)) && i++ < 300) {
>                 congestion_wait(BLK_RW_ASYNC, HZ/10);
>
>                 /* We are about to die and free our memory. Return now. */
>                 if (fatal_signal_pending(current))
>                         return SWAP_CLUSTER_MAX;
> +               /* Kernel threads should not be blocked for too long. */
> +               if (i == 50 && (current->flags & PF_KTHREAD))
> +                       break;
>         }
>
>         lru_add_drain();

Hi, Tetsuo Handa

I think it is good to use a MACRO for this magic number instead of harding code
it, in a long-term maintainability view,  and would better with
appropriate document.

Thanks,
Jianyu Zhan

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] mm/vmscan: Do not block forever at shrink_inactive_list().
  2014-05-20  0:44 ` Dave Chinner
@ 2014-05-20  3:54   ` Tetsuo Handa
  2014-05-20  5:24     ` Dave Chinner
  2014-05-20  5:59   ` Andrew Morton
  1 sibling, 1 reply; 20+ messages in thread
From: Tetsuo Handa @ 2014-05-20  3:54 UTC (permalink / raw)
  To: Dave Chinner
  Cc: riel, kosaki.motohiro, fengguang.wu, kamezawa.hiroyu,
	linux-kernel, xfs

Dave Chinner wrote:
> So, XFS should be passing kswapd context to the workqueue allocation
> context. The patch below does this.
> 
> Tetsuo-san, when it comes to problems involving XFS, you should
> really CC xfs@oss.sgi.com because very few people really know how
> XFS works and even fewer still know how it is supposed to interact
> with memory reclaim....

Thank you for the patch, but ...

  #define PF_KSWAPD       0x00040000      /* I am kswapd */

  static inline int current_is_kswapd(void)
  {
  	return current->flags & PF_KSWAPD;
  }

I think ((char) (current->flags & 0x00040000)) == 0.
Your patch wants

  -args->kswapd = current_is_kswapd();
  +args->kswapd = (current_is_kswapd() != 0);

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] mm/vmscan: Do not block forever at shrink_inactive_list().
  2014-05-20  3:54   ` Tetsuo Handa
@ 2014-05-20  5:24     ` Dave Chinner
  0 siblings, 0 replies; 20+ messages in thread
From: Dave Chinner @ 2014-05-20  5:24 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: riel, kosaki.motohiro, fengguang.wu, kamezawa.hiroyu,
	linux-kernel, xfs

On Tue, May 20, 2014 at 12:54:29PM +0900, Tetsuo Handa wrote:
> Dave Chinner wrote:
> > So, XFS should be passing kswapd context to the workqueue allocation
> > context. The patch below does this.
> > 
> > Tetsuo-san, when it comes to problems involving XFS, you should
> > really CC xfs@oss.sgi.com because very few people really know how
> > XFS works and even fewer still know how it is supposed to interact
> > with memory reclaim....
> 
> Thank you for the patch, but ...
> 
>   #define PF_KSWAPD       0x00040000      /* I am kswapd */
> 
>   static inline int current_is_kswapd(void)
>   {
>   	return current->flags & PF_KSWAPD;
>   }

> I think ((char) (current->flags & 0x00040000)) == 0.
> Your patch wants
> 
>   -args->kswapd = current_is_kswapd();
>   +args->kswapd = (current_is_kswapd() != 0);

Thanks for pointing that out, but I think:

-static inline int current_is_kswapd(void)
+static inline bool current_is_kswapd(void)

is a better solution. It can only be true or false.

But regardless, I need to change the boolean options in that XFS
structure to be, well, booleans.

Cheers,

dave.

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] mm/vmscan: Do not block forever at shrink_inactive_list().
  2014-05-20  0:44 ` Dave Chinner
  2014-05-20  3:54   ` Tetsuo Handa
@ 2014-05-20  5:59   ` Andrew Morton
  2014-05-20  6:03     ` Andrew Morton
                       ` (2 more replies)
  1 sibling, 3 replies; 20+ messages in thread
From: Andrew Morton @ 2014-05-20  5:59 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Tetsuo Handa, riel, kosaki.motohiro, fengguang.wu,
	kamezawa.hiroyu, linux-kernel, xfs

On Tue, 20 May 2014 10:44:49 +1000 Dave Chinner <david@fromorbit.com> wrote:

> @@ -258,14 +258,23 @@ xfs_bmapi_allocate_worker(
>  	struct xfs_bmalloca	*args = container_of(work,
>  						struct xfs_bmalloca, work);
>  	unsigned long		pflags;
> +	unsigned long		new_pflags = PF_FSTRANS;
>  
> -	/* we are in a transaction context here */
> -	current_set_flags_nested(&pflags, PF_FSTRANS);
> +	/*
> +	 * we are in a transaction context here, but may also be doing work
> +	 * in kswapd context, and hence we may need to inherit that state
> +	 * temporarily to ensure that we don't block waiting for memory reclaim
> +	 * in any way.
> +	 */
> +	if (args->kswapd)
> +		new_pflags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD;

So current_is_kswapd() returns true for a thread which is not kswapd. 
That's a bit smelly.

Should this thread really be incrementing KSWAPD_INODESTEAL instead of
PGINODESTEAL, for example?  current_is_kswapd() does a range of things,
only one(?) of which you actually want.

It would be cleaner to create a new PF_ flag to select just that
behavior.  That's a better model than telling the world "I am magic and
special".

But we're awfully close to running out of PF_ space and I don't know if
this ugly justifies consuming a flag.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] mm/vmscan: Do not block forever at shrink_inactive_list().
  2014-05-20  5:59   ` Andrew Morton
@ 2014-05-20  6:03     ` Andrew Morton
  2014-05-20  6:33       ` Dave Chinner
  2014-05-20  6:30     ` Dave Chinner
  2014-05-20  7:20     ` Christoph Hellwig
  2 siblings, 1 reply; 20+ messages in thread
From: Andrew Morton @ 2014-05-20  6:03 UTC (permalink / raw)
  To: Dave Chinner, Tetsuo Handa, riel, kosaki.motohiro, fengguang.wu,
	kamezawa.hiroyu, linux-kernel, xfs

On Mon, 19 May 2014 22:59:15 -0700 Andrew Morton <akpm@linux-foundation.org> wrote:

> On Tue, 20 May 2014 10:44:49 +1000 Dave Chinner <david@fromorbit.com> wrote:
> 
> > @@ -258,14 +258,23 @@ xfs_bmapi_allocate_worker(
> >  	struct xfs_bmalloca	*args = container_of(work,
> >  						struct xfs_bmalloca, work);
> >  	unsigned long		pflags;
> > +	unsigned long		new_pflags = PF_FSTRANS;
> >  
> > -	/* we are in a transaction context here */
> > -	current_set_flags_nested(&pflags, PF_FSTRANS);
> > +	/*
> > +	 * we are in a transaction context here, but may also be doing work
> > +	 * in kswapd context, and hence we may need to inherit that state
> > +	 * temporarily to ensure that we don't block waiting for memory reclaim
> > +	 * in any way.
> > +	 */
> > +	if (args->kswapd)
> > +		new_pflags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD;
> 
> So current_is_kswapd() returns true for a thread which is not kswapd. 
> That's a bit smelly.
> 
> Should this thread really be incrementing KSWAPD_INODESTEAL instead of
> PGINODESTEAL, for example?  current_is_kswapd() does a range of things,
> only one(?) of which you actually want.
> 
> It would be cleaner to create a new PF_ flag to select just that
> behavior.  That's a better model than telling the world "I am magic and
> special".

Or a new __GFP_FLAG.

> But we're awfully close to running out of PF_ space and I don't know if
> this ugly justifies consuming a flag.
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] mm/vmscan: Do not block forever at shrink_inactive_list().
  2014-05-20  5:59   ` Andrew Morton
  2014-05-20  6:03     ` Andrew Morton
@ 2014-05-20  6:30     ` Dave Chinner
  2014-05-20 14:58       ` Tetsuo Handa
  2014-05-20  7:20     ` Christoph Hellwig
  2 siblings, 1 reply; 20+ messages in thread
From: Dave Chinner @ 2014-05-20  6:30 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Tetsuo Handa, riel, kosaki.motohiro, fengguang.wu,
	kamezawa.hiroyu, linux-kernel, xfs

On Mon, May 19, 2014 at 10:59:15PM -0700, Andrew Morton wrote:
> On Tue, 20 May 2014 10:44:49 +1000 Dave Chinner <david@fromorbit.com> wrote:
> 
> > @@ -258,14 +258,23 @@ xfs_bmapi_allocate_worker(
> >  	struct xfs_bmalloca	*args = container_of(work,
> >  						struct xfs_bmalloca, work);
> >  	unsigned long		pflags;
> > +	unsigned long		new_pflags = PF_FSTRANS;
> >  
> > -	/* we are in a transaction context here */
> > -	current_set_flags_nested(&pflags, PF_FSTRANS);
> > +	/*
> > +	 * we are in a transaction context here, but may also be doing work
> > +	 * in kswapd context, and hence we may need to inherit that state
> > +	 * temporarily to ensure that we don't block waiting for memory reclaim
> > +	 * in any way.
> > +	 */
> > +	if (args->kswapd)
> > +		new_pflags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD;
> 
> So current_is_kswapd() returns true for a thread which is not kswapd. 
> That's a bit smelly.
> 
> Should this thread really be incrementing KSWAPD_INODESTEAL instead of
> PGINODESTEAL, for example?  current_is_kswapd() does a range of things,
> only one(?) of which you actually want.

It's doing work on behalf of kswapd under kswapd constraints, so it
should both behave and be accounted as if kswapd was executing the
work directly.

> It would be cleaner to create a new PF_ flag to select just that
> behavior.  That's a better model than telling the world "I am magic and
> special".

However, it is magic and special because of who the work needs to be
done for.

> But we're awfully close to running out of PF_ space and I don't know if
> this ugly justifies consuming a flag.

I don't really care enough argue over what mechanism should be used.
I'll push this patch through the XFS tree, and when a new flag or
generic mechanism for pushing task contexts to kworker threads
is provided, I'll change the XFS code to use that....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] mm/vmscan: Do not block forever at shrink_inactive_list().
  2014-05-20  6:03     ` Andrew Morton
@ 2014-05-20  6:33       ` Dave Chinner
  0 siblings, 0 replies; 20+ messages in thread
From: Dave Chinner @ 2014-05-20  6:33 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Tetsuo Handa, riel, kosaki.motohiro, fengguang.wu,
	kamezawa.hiroyu, linux-kernel, xfs

On Mon, May 19, 2014 at 11:03:11PM -0700, Andrew Morton wrote:
> On Mon, 19 May 2014 22:59:15 -0700 Andrew Morton <akpm@linux-foundation.org> wrote:
> 
> > On Tue, 20 May 2014 10:44:49 +1000 Dave Chinner <david@fromorbit.com> wrote:
> > 
> > > @@ -258,14 +258,23 @@ xfs_bmapi_allocate_worker(
> > >  	struct xfs_bmalloca	*args = container_of(work,
> > >  						struct xfs_bmalloca, work);
> > >  	unsigned long		pflags;
> > > +	unsigned long		new_pflags = PF_FSTRANS;
> > >  
> > > -	/* we are in a transaction context here */
> > > -	current_set_flags_nested(&pflags, PF_FSTRANS);
> > > +	/*
> > > +	 * we are in a transaction context here, but may also be doing work
> > > +	 * in kswapd context, and hence we may need to inherit that state
> > > +	 * temporarily to ensure that we don't block waiting for memory reclaim
> > > +	 * in any way.
> > > +	 */
> > > +	if (args->kswapd)
> > > +		new_pflags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD;
> > 
> > So current_is_kswapd() returns true for a thread which is not kswapd. 
> > That's a bit smelly.
> > 
> > Should this thread really be incrementing KSWAPD_INODESTEAL instead of
> > PGINODESTEAL, for example?  current_is_kswapd() does a range of things,
> > only one(?) of which you actually want.
> > 
> > It would be cleaner to create a new PF_ flag to select just that
> > behavior.  That's a better model than telling the world "I am magic and
> > special".
> 
> Or a new __GFP_FLAG.

Sure - and with that XFS will need another PF_ flag to tell the
memory allocator to set the new __GFP_FLAG on every allocation done
in that kworker task context, just like it uses PF_FSTRANS to ensure
that __GFP_NOFS is set for all the allocations in that kworker
context....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] mm/vmscan: Do not block forever at shrink_inactive_list().
  2014-05-20  5:59   ` Andrew Morton
  2014-05-20  6:03     ` Andrew Morton
  2014-05-20  6:30     ` Dave Chinner
@ 2014-05-20  7:20     ` Christoph Hellwig
  2 siblings, 0 replies; 20+ messages in thread
From: Christoph Hellwig @ 2014-05-20  7:20 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Dave Chinner, Tetsuo Handa, riel, kosaki.motohiro, fengguang.wu,
	kamezawa.hiroyu, linux-kernel, xfs

On Mon, May 19, 2014 at 10:59:15PM -0700, Andrew Morton wrote:
> So current_is_kswapd() returns true for a thread which is not kswapd. 
> That's a bit smelly.
> 
> Should this thread really be incrementing KSWAPD_INODESTEAL instead of
> PGINODESTEAL, for example?  current_is_kswapd() does a range of things,
> only one(?) of which you actually want.

Actually we want all of them. The allocation workqueue is a workaround
for the incredible stack usage in the Linux I/O path.  If it is called
by kswapd it should act as if it were kswapd for all purposes.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] mm/vmscan: Do not block forever at shrink_inactive_list().
  2014-05-20  6:30     ` Dave Chinner
@ 2014-05-20 14:58       ` Tetsuo Handa
  2014-05-20 16:12         ` Motohiro Kosaki
  0 siblings, 1 reply; 20+ messages in thread
From: Tetsuo Handa @ 2014-05-20 14:58 UTC (permalink / raw)
  To: david, riel
  Cc: kosaki.motohiro, fengguang.wu, kamezawa.hiroyu, akpm, hch,
	linux-kernel, xfs

Today I discussed with Kosaki-san at LinuxCon Japan 2014 about this issue.
He does not like the idea of adding timeout to throttle loop. As Dave
posted a patch that fixes a bug in XFS delayed allocation, I updated
my patch accordingly.

Although the bug in XFS was fixed by Dave's patch, other kernel code
would have bugs which would fall into this infinite throttle loop.
But to keep the possibility of triggering OOM killer minimum, can we
agree with this updated patch (and in the future adding some warning
mechanism like /proc/sys/kernel/hung_task_timeout_secs for detecting
memory allocation stall)?

Dave, if you are OK with this updated patch, please let me know
commit ID of your patch.

Regards.
----------
>From 408e65d9025e8e24838e7bf6ac9066ba8a9391a6 Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date: Tue, 20 May 2014 23:34:34 +0900
Subject: [PATCH] mm/vmscan: Do not throttle kswapd at shrink_inactive_list().

I can observe that commit 35cd7815 "vmscan: throttle direct reclaim when
too many pages are isolated already" causes RHEL7 environment to stall
with 0% CPU usage when a certain type of memory pressure is given.
This is because nobody can reclaim memory due to rules listed below.

  (a) XFS uses a kernel worker thread for delayed allocation
  (b) kswapd wakes up the kernel worker thread for delayed allocation
  (c) the kernel worker thread is throttled due to commit 35cd7815

This patch and commit XXXXXXXX "xfs: block allocation work needs to be
kswapd aware" will solve rule (c).

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
---
 mm/vmscan.c |   20 +++++++++++++++-----
 1 files changed, 15 insertions(+), 5 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 32c661d..5c6960e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1460,12 +1460,22 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
 	struct zone *zone = lruvec_zone(lruvec);
 	struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;

-	while (unlikely(too_many_isolated(zone, file, sc))) {
-		congestion_wait(BLK_RW_ASYNC, HZ/10);
+	/*
+	 * Throttle only direct reclaimers. Allocations by kswapd (and
+	 * allocation workqueue on behalf of kswapd) should not be
+	 * throttled here; otherwise memory allocation will deadlock.
+	 */
+	if (!sc->hibernation_mode && !current_is_kswapd()) {
+		while (unlikely(too_many_isolated(zone, file, sc))) {
+			congestion_wait(BLK_RW_ASYNC, HZ/10);

-		/* We are about to die and free our memory. Return now. */
-		if (fatal_signal_pending(current))
-			return SWAP_CLUSTER_MAX;
+			/*
+			 * We are about to die and free our memory.
+			 * Return now.
+			 */
+			if (fatal_signal_pending(current))
+				return SWAP_CLUSTER_MAX;
+		}
 	}

 	lru_add_drain();
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* RE: [PATCH] mm/vmscan: Do not block forever at shrink_inactive_list().
  2014-05-20 14:58       ` Tetsuo Handa
@ 2014-05-20 16:12         ` Motohiro Kosaki
  2014-05-26 11:45           ` [PATCH] mm/vmscan: Do not block forever atshrink_inactive_list() Tetsuo Handa
  0 siblings, 1 reply; 20+ messages in thread
From: Motohiro Kosaki @ 2014-05-20 16:12 UTC (permalink / raw)
  To: Tetsuo Handa, david@fromorbit.com, riel@redhat.com
  Cc: Motohiro Kosaki JP, fengguang.wu@intel.com,
	kamezawa.hiroyu@jp.fujitsu.com, akpm@linux-foundation.org,
	hch@infradead.org, linux-kernel@vger.kernel.org, xfs@oss.sgi.com



> -----Original Message-----
> From: Tetsuo Handa [mailto:penguin-kernel@I-love.SAKURA.ne.jp]
> Sent: Tuesday, May 20, 2014 11:58 PM
> To: david@fromorbit.com; riel@redhat.com
> Cc: Motohiro Kosaki JP; fengguang.wu@intel.com; kamezawa.hiroyu@jp.fujitsu.com; akpm@linux-foundation.org;
> hch@infradead.org; linux-kernel@vger.kernel.org; xfs@oss.sgi.com
> Subject: Re: [PATCH] mm/vmscan: Do not block forever at shrink_inactive_list().
> 
> Today I discussed with Kosaki-san at LinuxCon Japan 2014 about this issue.
> He does not like the idea of adding timeout to throttle loop. As Dave posted a patch that fixes a bug in XFS delayed allocation, I
> updated my patch accordingly.
> 
> Although the bug in XFS was fixed by Dave's patch, other kernel code would have bugs which would fall into this infinite throttle loop.
> But to keep the possibility of triggering OOM killer minimum, can we agree with this updated patch (and in the future adding some
> warning mechanism like /proc/sys/kernel/hung_task_timeout_secs for detecting memory allocation stall)?
> 
> Dave, if you are OK with this updated patch, please let me know commit ID of your patch.
> 
> Regards.
> ----------
> >From 408e65d9025e8e24838e7bf6ac9066ba8a9391a6 Mon Sep 17 00:00:00 2001
> From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Date: Tue, 20 May 2014 23:34:34 +0900
> Subject: [PATCH] mm/vmscan: Do not throttle kswapd at shrink_inactive_list().
> 
> I can observe that commit 35cd7815 "vmscan: throttle direct reclaim when too many pages are isolated already" causes RHEL7
> environment to stall with 0% CPU usage when a certain type of memory pressure is given.
> This is because nobody can reclaim memory due to rules listed below.
> 
>   (a) XFS uses a kernel worker thread for delayed allocation
>   (b) kswapd wakes up the kernel worker thread for delayed allocation
>   (c) the kernel worker thread is throttled due to commit 35cd7815
> 
> This patch and commit XXXXXXXX "xfs: block allocation work needs to be kswapd aware" will solve rule (c).
> 
> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> ---
>  mm/vmscan.c |   20 +++++++++++++++-----
>  1 files changed, 15 insertions(+), 5 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 32c661d..5c6960e 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1460,12 +1460,22 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
>  	struct zone *zone = lruvec_zone(lruvec);
>  	struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
> 
> -	while (unlikely(too_many_isolated(zone, file, sc))) {
> -		congestion_wait(BLK_RW_ASYNC, HZ/10);
> +	/*
> +	 * Throttle only direct reclaimers. Allocations by kswapd (and
> +	 * allocation workqueue on behalf of kswapd) should not be
> +	 * throttled here; otherwise memory allocation will deadlock.
> +	 */
> +	if (!sc->hibernation_mode && !current_is_kswapd()) {
> +		while (unlikely(too_many_isolated(zone, file, sc))) {
> +			congestion_wait(BLK_RW_ASYNC, HZ/10);
> 
> -		/* We are about to die and free our memory. Return now. */
> -		if (fatal_signal_pending(current))
> -			return SWAP_CLUSTER_MAX;
> +			/*
> +			 * We are about to die and free our memory.
> +			 * Return now.
> +			 */
> +			if (fatal_signal_pending(current))
> +				return SWAP_CLUSTER_MAX;
> +		}
>  	}


Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>


Dave, I don't like Tetsuo's first patch because this too_many_isolated exist to prevent false oom-kill. So, simple timeout
resurrect it. Please let me know if you need further MM enhancement to solve XFS issue. I'd like join and assist this.

Thanks.







^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] mm/vmscan: Do not block forever atshrink_inactive_list().
  2014-05-20 16:12         ` Motohiro Kosaki
@ 2014-05-26 11:45           ` Tetsuo Handa
  2014-06-03 21:43             ` David Rientjes
  0 siblings, 1 reply; 20+ messages in thread
From: Tetsuo Handa @ 2014-05-26 11:45 UTC (permalink / raw)
  To: Motohiro.Kosaki, david, riel
  Cc: kosaki.motohiro, fengguang.wu, kamezawa.hiroyu, akpm, hch,
	linux-kernel, xfs

Tetsuo Handa wrote:
> Dave, if you are OK with this updated patch, please let me know commit ID of your patch.

I overlooked that too_many_isolated() in mm/vmscan.c already has

  if (current_is_kswapd())
    return 0;

lines. Therefore, it turned out that my patch is irrelevant to XFS issue.
Does patch shown below make sense?

Regards.
----------
>From e133aadd65df3d8622efced3443888ed3e60327b Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date: Mon, 26 May 2014 20:37:12 +0900
Subject: [PATCH] mm/vmscan: Do not throttle when hibernation_mode != 0.

In shrink_inactive_list(), we do not insert delay at

  if (!sc->hibernation_mode && !current_is_kswapd())
    wait_iff_congested(zone, BLK_RW_ASYNC, HZ/10);

if sc->hibernation_mode != 0.
Follow the same reason, we should not insert delay at

  while (unlikely(too_many_isolated(zone, file, sc))) {
    congestion_wait(BLK_RW_ASYNC, HZ/10);

    /* We are about to die and free our memory. Return now. */
    if (fatal_signal_pending(current))
      return SWAP_CLUSTER_MAX;
  }

if sc->hibernation_mode != 0.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
---
 mm/vmscan.c |    3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 32c661d..89c42ca 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1362,6 +1362,9 @@ static int too_many_isolated(struct zone *zone, int file,
 	if (current_is_kswapd())
 		return 0;
 
+	if (sc->hibernation_mode)
+		return 0;
+
 	if (!global_reclaim(sc))
 		return 0;
 
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH] mm/vmscan: Do not block forever atshrink_inactive_list().
  2014-05-26 11:45           ` [PATCH] mm/vmscan: Do not block forever atshrink_inactive_list() Tetsuo Handa
@ 2014-06-03 21:43             ` David Rientjes
  2014-06-05 12:45               ` [PATCH] mm/vmscan: Do not block forever at shrink_inactive_list() Tetsuo Handa
  0 siblings, 1 reply; 20+ messages in thread
From: David Rientjes @ 2014-06-03 21:43 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: Motohiro.Kosaki, david, riel, kosaki.motohiro, fengguang.wu,
	kamezawa.hiroyu, akpm, hch, linux-kernel, xfs

On Mon, 26 May 2014, Tetsuo Handa wrote:

> In shrink_inactive_list(), we do not insert delay at
> 
>   if (!sc->hibernation_mode && !current_is_kswapd())
>     wait_iff_congested(zone, BLK_RW_ASYNC, HZ/10);
> 
> if sc->hibernation_mode != 0.
> Follow the same reason, we should not insert delay at
> 
>   while (unlikely(too_many_isolated(zone, file, sc))) {
>     congestion_wait(BLK_RW_ASYNC, HZ/10);
> 
>     /* We are about to die and free our memory. Return now. */
>     if (fatal_signal_pending(current))
>       return SWAP_CLUSTER_MAX;
>   }
> 
> if sc->hibernation_mode != 0.
> 
> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> ---
>  mm/vmscan.c |    3 +++
>  1 files changed, 3 insertions(+), 0 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 32c661d..89c42ca 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1362,6 +1362,9 @@ static int too_many_isolated(struct zone *zone, int file,
>  	if (current_is_kswapd())
>  		return 0;
>  
> +	if (sc->hibernation_mode)
> +		return 0;
> +
>  	if (!global_reclaim(sc))
>  		return 0;
>  

This isn't the only too_many_isolated() functions that do a delay, how is 
the too_many_isolated() in mm/compaction.c different?

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] mm/vmscan: Do not block forever at shrink_inactive_list().
  2014-06-03 21:43             ` David Rientjes
@ 2014-06-05 12:45               ` Tetsuo Handa
  2014-06-05 13:17                 ` Dave Chinner
  2014-06-09 11:53                 ` [PATCH] mm/vmscan: Do not block forever at shrink_inactive_list() Tetsuo Handa
  0 siblings, 2 replies; 20+ messages in thread
From: Tetsuo Handa @ 2014-06-05 12:45 UTC (permalink / raw)
  To: rientjes, Motohiro.Kosaki
  Cc: david, riel, kosaki.motohiro, fengguang.wu, kamezawa.hiroyu, akpm,
	hch, linux-kernel, xfs

David Rientjes wrote:
> On Mon, 26 May 2014, Tetsuo Handa wrote:
> 
> > In shrink_inactive_list(), we do not insert delay at
> > 
> >   if (!sc->hibernation_mode && !current_is_kswapd())
> >     wait_iff_congested(zone, BLK_RW_ASYNC, HZ/10);
> > 
> > if sc->hibernation_mode != 0.
> > Follow the same reason, we should not insert delay at
> > 
> >   while (unlikely(too_many_isolated(zone, file, sc))) {
> >     congestion_wait(BLK_RW_ASYNC, HZ/10);
> > 
> >     /* We are about to die and free our memory. Return now. */
> >     if (fatal_signal_pending(current))
> >       return SWAP_CLUSTER_MAX;
> >   }
> > 
> > if sc->hibernation_mode != 0.
> > 
> > Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> > ---
> >  mm/vmscan.c |    3 +++
> >  1 files changed, 3 insertions(+), 0 deletions(-)
> > 
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 32c661d..89c42ca 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -1362,6 +1362,9 @@ static int too_many_isolated(struct zone *zone, int file,
> >  	if (current_is_kswapd())
> >  		return 0;
> >  
> > +	if (sc->hibernation_mode)
> > +		return 0;
> > +
> >  	if (!global_reclaim(sc))
> >  		return 0;
> >  
> 
> This isn't the only too_many_isolated() functions that do a delay, how is 
> the too_many_isolated() in mm/compaction.c different?
> 

I don't know. But today I realized that this patch is not sufficient.

I'm trying to find why __alloc_pages_slowpath() cannot return for many minutes
when a certain type of memory pressure is given on a RHEL7 environment with
4 CPU / 2GB RAM. Today I tried to use ftrace for examining the breakdown of
time-consuming functions inside __alloc_pages_slowpath(). But on the first run,
all processes are trapped into this too_many_isolated()/congestion_wait() loop
while kswapd is not running; stalling forever because nobody can perform
operations for making too_many_isolated() to return 0.

This means that, under rare circumstances, it is possible that all processes
other than kswapd are trapped into too_many_isolated()/congestion_wait() loop
while kswapd is sleeping because this loop assumes that somebody else shall
wake up kswapd and kswapd shall perform operations for making
too_many_isolated() to return 0. However, we cannot guarantee that kswapd is
waken by somebody nor kswapd is not blocked by blocking operations inside
shrinker functions (e.g. mutex_lock()).

We need some more changes. I'm thinking memory allocation watchdog thread.
Add an "unsigned long" field to "struct task_struct", set jiffies to the field
upon entry of GFP_WAIT-able memory allocation attempts, and clear the field
upon returning from GFP_WAIT-able memory allocation attempts. A kernel thread
periodically scans task list and compares the field and jiffies, and (at least)
print warning messages (maybe optionally trigger OOM-killer or kernel panic)
if single memory allocation attempt is taking too long (e.g. 60 seconds).
What do you think?

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] mm/vmscan: Do not block forever at shrink_inactive_list().
  2014-06-05 12:45               ` [PATCH] mm/vmscan: Do not block forever at shrink_inactive_list() Tetsuo Handa
@ 2014-06-05 13:17                 ` Dave Chinner
  2014-06-06 12:19                   ` [PATCH] mm/vmscan: Do not block forever atshrink_inactive_list() Tetsuo Handa
  2014-06-09 11:53                 ` [PATCH] mm/vmscan: Do not block forever at shrink_inactive_list() Tetsuo Handa
  1 sibling, 1 reply; 20+ messages in thread
From: Dave Chinner @ 2014-06-05 13:17 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: rientjes, Motohiro.Kosaki, riel, kosaki.motohiro, fengguang.wu,
	kamezawa.hiroyu, akpm, hch, linux-kernel, xfs

On Thu, Jun 05, 2014 at 09:45:26PM +0900, Tetsuo Handa wrote:
> David Rientjes wrote:
> > On Mon, 26 May 2014, Tetsuo Handa wrote:
> > 
> > > In shrink_inactive_list(), we do not insert delay at
> > > 
> > >   if (!sc->hibernation_mode && !current_is_kswapd())
> > >     wait_iff_congested(zone, BLK_RW_ASYNC, HZ/10);
> > > 
> > > if sc->hibernation_mode != 0.
> > > Follow the same reason, we should not insert delay at
> > > 
> > >   while (unlikely(too_many_isolated(zone, file, sc))) {
> > >     congestion_wait(BLK_RW_ASYNC, HZ/10);
> > > 
> > >     /* We are about to die and free our memory. Return now. */
> > >     if (fatal_signal_pending(current))
> > >       return SWAP_CLUSTER_MAX;
> > >   }
> > > 
> > > if sc->hibernation_mode != 0.
> > > 
> > > Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> > > ---
> > >  mm/vmscan.c |    3 +++
> > >  1 files changed, 3 insertions(+), 0 deletions(-)
> > > 
> > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > index 32c661d..89c42ca 100644
> > > --- a/mm/vmscan.c
> > > +++ b/mm/vmscan.c
> > > @@ -1362,6 +1362,9 @@ static int too_many_isolated(struct zone *zone, int file,
> > >  	if (current_is_kswapd())
> > >  		return 0;
> > >  
> > > +	if (sc->hibernation_mode)
> > > +		return 0;
> > > +
> > >  	if (!global_reclaim(sc))
> > >  		return 0;
> > >  
> > 
> > This isn't the only too_many_isolated() functions that do a delay, how is 
> > the too_many_isolated() in mm/compaction.c different?
> > 
> 
> I don't know. But today I realized that this patch is not sufficient.
> 
> I'm trying to find why __alloc_pages_slowpath() cannot return for many minutes
> when a certain type of memory pressure is given on a RHEL7 environment with
> 4 CPU / 2GB RAM. Today I tried to use ftrace for examining the breakdown of
> time-consuming functions inside __alloc_pages_slowpath(). But on the first run,
> all processes are trapped into this too_many_isolated()/congestion_wait() loop
> while kswapd is not running; stalling forever because nobody can perform
> operations for making too_many_isolated() to return 0.
> 
> This means that, under rare circumstances, it is possible that all processes
> other than kswapd are trapped into too_many_isolated()/congestion_wait() loop
> while kswapd is sleeping because this loop assumes that somebody else shall
> wake up kswapd and kswapd shall perform operations for making
> too_many_isolated() to return 0. However, we cannot guarantee that kswapd is
> waken by somebody nor kswapd is not blocked by blocking operations inside
> shrinker functions (e.g. mutex_lock()).

So what you are saying is that kswapd is having problems with
getting blocked on locks held by processes in direct reclaim?

What are the stack traces that demonstrate such a dependency loop?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] mm/vmscan: Do not block forever atshrink_inactive_list().
  2014-06-05 13:17                 ` Dave Chinner
@ 2014-06-06 12:19                   ` Tetsuo Handa
  0 siblings, 0 replies; 20+ messages in thread
From: Tetsuo Handa @ 2014-06-06 12:19 UTC (permalink / raw)
  To: david
  Cc: rientjes, Motohiro.Kosaki, riel, kosaki.motohiro, fengguang.wu,
	kamezawa.hiroyu, akpm, hch, linux-kernel, xfs

Dave Chinner wrote:
> On Thu, Jun 05, 2014 at 09:45:26PM +0900, Tetsuo Handa wrote:
> > This means that, under rare circumstances, it is possible that all processes
> > other than kswapd are trapped into too_many_isolated()/congestion_wait() loop
> > while kswapd is sleeping because this loop assumes that somebody else shall
> > wake up kswapd and kswapd shall perform operations for making
> > too_many_isolated() to return 0. However, we cannot guarantee that kswapd is
> > waken by somebody nor kswapd is not blocked by blocking operations inside
> > shrinker functions (e.g. mutex_lock()).
> 
> So what you are saying is that kswapd is having problems with
> getting blocked on locks held by processes in direct reclaim?
> 
> What are the stack traces that demonstrate such a dependency loop?

If a normal task's GFP_KERNEL memory allocation called a shrinker function and
the shrinker function does GFP_WAIT-able allocation with a mutex held, there
is a possibility that kswapd is waken up due to GFP_WAIT-able allocation and
kswapd calls the shrinker function, and the kswapd is blocked at trying to
hold the same mutex inside the shrinker function, isn't it?

Since ttm_dma_pool_shrink_count()/ttm_dma_pool_shrink_scan() holds a mutex
and ttm_dma_pool_shrink_scan() does GFP_WAIT-able allocation, I think such
a dependency loop is possible.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] mm/vmscan: Do not block forever at shrink_inactive_list().
  2014-06-05 12:45               ` [PATCH] mm/vmscan: Do not block forever at shrink_inactive_list() Tetsuo Handa
  2014-06-05 13:17                 ` Dave Chinner
@ 2014-06-09 11:53                 ` Tetsuo Handa
  2014-07-02 12:40                   ` Tetsuo Handa
  1 sibling, 1 reply; 20+ messages in thread
From: Tetsuo Handa @ 2014-06-09 11:53 UTC (permalink / raw)
  To: david
  Cc: rientjes, Motohiro.Kosaki, riel, kosaki.motohiro, fengguang.wu,
	kamezawa.hiroyu, akpm, hch, linux-kernel, linux-mm

Tetsuo Handa wrote:
> We need some more changes. I'm thinking memory allocation watchdog thread.
> Add an "unsigned long" field to "struct task_struct", set jiffies to the field
> upon entry of GFP_WAIT-able memory allocation attempts, and clear the field
> upon returning from GFP_WAIT-able memory allocation attempts. A kernel thread
> periodically scans task list and compares the field and jiffies, and (at least)
> print warning messages (maybe optionally trigger OOM-killer or kernel panic)
> if single memory allocation attempt is taking too long (e.g. 60 seconds).
> What do you think?
> 
Here is a demo patch. If you can join analysis of why memory allocation
function cannot return for more than 15 minutes under severe memory pressure,
I'll invite you to private discussion in order to share steps for reproducing
such memory pressure. A quick test says that memory reclaiming functions are
too optimistic about reclaiming memory; they are needlessly called again and
again and again with an assumption that some memory will be reclaimed within
a few seconds. If I insert some delay, CPU usage during stalls can be reduced.
----------
>From 015fecd45761b2849974f37dc379edf3e86acfa6 Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date: Mon, 9 Jun 2014 15:00:47 +0900
Subject: [PATCH] mm: Add memory allocation watchdog kernel thread.

When a certain type of memory pressure is given, the system may stall
for many minutes trying to allocate memory pages. But stalling without
any messages is annoying because (e.g.) timeout will happen without
any prior warning messages.

This patch introduces a watchdog thread which periodically reports the
longest stalling thread if __alloc_pages_nodemask() is taking more than
10 seconds. An example output from a VM with 4 CPU / 2GB RAM (without swap)
running a v3.15 kernel with this patch is shown below.

  [ 5835.136868] INFO: task pcscd:14569 blocked for 11 seconds at memory allocation
  [ 5845.137932] INFO: task pcscd:14569 blocked for 21 seconds at memory allocation
  [ 5855.142985] INFO: task pcscd:14569 blocked for 31 seconds at memory allocation
  (...snipped...)
  [ 6710.227984] INFO: task pcscd:14569 blocked for 886 seconds at memory allocation
  [ 6720.228058] INFO: task pcscd:14569 blocked for 896 seconds at memory allocation
  [ 6730.231108] INFO: task pcscd:14569 blocked for 906 seconds at memory allocation
  [ 6740.242185] INFO: task pcscd:14569 blocked for 916 seconds at memory allocation

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
---
 include/linux/sched.h |  1 +
 mm/page_alloc.c       | 61 ++++++++++++++++++++++++++++++++++++++++++++++++---
 2 files changed, 59 insertions(+), 3 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 221b2bd..befd496 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1610,6 +1610,7 @@ struct task_struct {
 	unsigned int	sequential_io;
 	unsigned int	sequential_io_avg;
 #endif
+	unsigned long memory_allocation_start_jiffies;
 };
 
 /* Future-safe accessor for struct task_struct's cpus_allowed. */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 5dba293..211b0b7 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -61,6 +61,7 @@
 #include <linux/page-debug-flags.h>
 #include <linux/hugetlb.h>
 #include <linux/sched/rt.h>
+#include <linux/kthread.h>
 
 #include <asm/sections.h>
 #include <asm/tlbflush.h>
@@ -2698,6 +2699,16 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 	unsigned int cpuset_mems_cookie;
 	int alloc_flags = ALLOC_WMARK_LOW|ALLOC_CPUSET|ALLOC_FAIR;
 	struct mem_cgroup *memcg = NULL;
+	bool memory_allocation_recursion = false;
+	unsigned long *stamp = &current->memory_allocation_start_jiffies;
+
+	if (likely(!*stamp)) {
+		*stamp = jiffies;
+		if (unlikely(!*stamp))
+			(*stamp)++;
+	} else {
+		memory_allocation_recursion = true;
+	}
 
 	gfp_mask &= gfp_allowed_mask;
 
@@ -2706,7 +2717,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 	might_sleep_if(gfp_mask & __GFP_WAIT);
 
 	if (should_fail_alloc_page(gfp_mask, order))
-		return NULL;
+		goto nopage;
 
 	/*
 	 * Check the zones suitable for the gfp_mask contain at least one
@@ -2714,14 +2725,14 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 	 * of GFP_THISNODE and a memoryless node
 	 */
 	if (unlikely(!zonelist->_zonerefs->zone))
-		return NULL;
+		goto nopage;
 
 	/*
 	 * Will only have any effect when __GFP_KMEMCG is set.  This is
 	 * verified in the (always inline) callee
 	 */
 	if (!memcg_kmem_newpage_charge(gfp_mask, &memcg, order))
-		return NULL;
+		goto nopage;
 
 retry_cpuset:
 	cpuset_mems_cookie = read_mems_allowed_begin();
@@ -2784,10 +2795,54 @@ out:
 
 	memcg_kmem_commit_charge(page, memcg, order);
 
+nopage:
+	if (likely(!memory_allocation_recursion))
+		current->memory_allocation_start_jiffies = 0;
 	return page;
 }
 EXPORT_SYMBOL(__alloc_pages_nodemask);
 
+static int alloc_pages_watchdog_thread(void *unused)
+{
+	while (1) {
+		const unsigned long now = jiffies;
+		unsigned long min_stamp = 0;
+		struct task_struct *p;
+		struct task_struct *t;
+		char comm[TASK_COMM_LEN];
+		pid_t pid = 0;
+
+		rcu_read_lock();
+		for_each_process_thread(p, t) {
+			const unsigned long stamp =
+				t->memory_allocation_start_jiffies;
+			if (likely(!stamp ||
+				   time_after(stamp + 10 * HZ, now)))
+				continue;
+			if (!pid || time_after(min_stamp, stamp)) {
+				min_stamp = stamp;
+				memcpy(comm, t->comm, TASK_COMM_LEN);
+				pid = task_pid_nr(t);
+			}
+		}
+		rcu_read_unlock();
+		if (pid)
+			pr_warn("INFO: task %s:%u blocked for %lu seconds at memory allocation\n",
+				comm, pid, (now - min_stamp) / HZ);
+		schedule_timeout_killable(10 * HZ);
+	}
+	return 0;
+}
+
+static int __init alloc_pages_watchdog_init(void)
+{
+	struct task_struct *p = kthread_run(alloc_pages_watchdog_thread, NULL,
+					    "alloc-watchdog");
+	BUG_ON(IS_ERR(p));
+	return 0;
+}
+late_initcall(alloc_pages_watchdog_init);
+
 /*
  * Common helper functions.
  */
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH] mm/vmscan: Do not block forever at shrink_inactive_list().
  2014-06-09 11:53                 ` [PATCH] mm/vmscan: Do not block forever at shrink_inactive_list() Tetsuo Handa
@ 2014-07-02 12:40                   ` Tetsuo Handa
  0 siblings, 0 replies; 20+ messages in thread
From: Tetsuo Handa @ 2014-07-02 12:40 UTC (permalink / raw)
  To: david
  Cc: rientjes, Motohiro.Kosaki, riel, kosaki.motohiro, fengguang.wu,
	kamezawa.hiroyu, akpm, hch, linux-kernel, linux-mm, fernando_b1

Tetsuo Handa wrote:
> Here is a demo patch. If you can join analysis of why memory allocation
> function cannot return for more than 15 minutes under severe memory pressure,
> I'll invite you to private discussion in order to share steps for reproducing
> such memory pressure. A quick test says that memory reclaiming functions are
> too optimistic about reclaiming memory; they are needlessly called again and
> again and again with an assumption that some memory will be reclaimed within
> a few seconds. If I insert some delay, CPU usage during stalls can be reduced.

Here is a formal patch. This patch includes a test result of today's linux.git
tree with https://lkml.org/lkml/2014/5/29/673 applied, in order to find what
deadlock occurs next. The blocking delay on the mutex inside the ttm shrinker
has gone, but a kernel worker thread trying to perform a block I/O using
GFP_NOIO context is blocked for more than 10 minutes. I think this is not good.

---------- Start of patch ----------

>From c5274057bd71832fcf0baef64d43a49c20f29dbf Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date: Wed, 2 Jul 2014 09:34:51 +0900
Subject: [PATCH] mm: Remember ongoing memory allocation status.

When a stall by memory allocation problem occurs, printing how long
a thread blocked for memory allocation will be useful.

This patch allows remembering how many jiffies was spent for ongoing
__alloc_pages_nodemask() and reading it by printing backtrace and by
analyzing kdump.

Two examples are shown below. You can see that the GFP flags passed to
memory shrinker functions can be GFP_NOIO or GFP_NOFS. Therefore, when
writing memory shrinker functions, please be careful with dependency
inside shrinker functions. For example, unconditional use of GFP_KERNEL
may lead to deadlock. For another example, unconditional use of
blocking lock operations (e.g. mutex_lock()) which are called by
multiple different GFP contexts may lead to deadlock.

     kworker/2:2     R  running task        0   189      2 0x00000000
     MemAlloc: 624869 jiffies on 0x10
     Workqueue: events_freezable_power_ disk_events_workfn
      ffff880036eacfe0 000000004486d7e5 ffff88007fc83c48 ffffffff81090a3f
      ffff880036eacfe0 0000000000000000 ffff88007fc83c80 ffffffff81090b35
      ffff880036ead210 000000004486d7e5 ffffffff817bada0 0000000000000074
     Call Trace:
      [<ffffffff8158401f>] ? _raw_spin_lock+0x2f/0x50
      [<ffffffff81126b99>] list_lru_count_node+0x19/0x60
      [<ffffffff81171e10>] super_cache_count+0x50/0xd0
      [<ffffffff8111460a>] shrink_slab_node+0x3a/0x1b0
      [<ffffffff811683fc>] ? vmpressure+0x1c/0x80
      [<ffffffff811153f3>] shrink_slab+0x83/0x150
      [<ffffffff81118499>] do_try_to_free_pages+0x2f9/0x530
      [<ffffffff81118768>] try_to_free_pages+0x98/0xd0
      [<ffffffff8110e3f3>] __alloc_pages_nodemask+0x6e3/0xad0
      [<ffffffff8114b2b3>] alloc_pages_current+0xa3/0x170
      [<ffffffff81244d87>] bio_copy_user_iov+0x1c7/0x370
      [<ffffffff81244fc9>] bio_copy_kern+0x49/0xe0
      [<ffffffff8124ed4f>] blk_rq_map_kern+0x6f/0x130
      [<ffffffff81249273>] ? blk_get_request+0x83/0x140
      [<ffffffff81393381>] scsi_execute+0x131/0x160
      [<ffffffff81393484>] scsi_execute_req_flags+0x84/0xf0
      [<ffffffffa01b987c>] sr_check_events+0xbc/0x2d0 [sr_mod]
      [<ffffffffa018f173>] cdrom_check_events+0x13/0x30 [cdrom]
      [<ffffffffa01b9ced>] sr_block_check_events+0x2d/0x30 [sr_mod]
      [<ffffffff81258c75>] disk_check_events+0x55/0x1e0
      [<ffffffff81580e65>] ? _cond_resched+0x35/0x60
      [<ffffffff81258e11>] disk_events_workfn+0x11/0x20
      [<ffffffff8107d64f>] process_one_work+0x15f/0x3d0
      [<ffffffff8107de19>] worker_thread+0x119/0x620
      [<ffffffff8107dd00>] ? rescuer_thread+0x440/0x440
      [<ffffffff8108439c>] kthread+0xdc/0x100
      [<ffffffff810842c0>] ? kthread_create_on_node+0x1a0/0x1a0
      [<ffffffff8158483c>] ret_from_fork+0x7c/0xb0
      [<ffffffff810842c0>] ? kthread_create_on_node+0x1a0/0x1a0
    
     kworker/u16:2   R  running task        0 14009  13723 0x00000080
     MemAlloc: 624951 jiffies on 0x250
      0000000000000000 0000000000000100 0000000000000000 28f5c28f5c28f5c3
      0000000000001705 0000000000000060 0000000000000064 0000000000000064
      ffff880036dfea40 ffffffffffffff10 ffffffff8158401a 0000000000000010
     Call Trace:
      [<ffffffff8158401a>] ? _raw_spin_lock+0x2a/0x50
      [<ffffffff81126b99>] ? list_lru_count_node+0x19/0x60
      [<ffffffff81171e10>] ? super_cache_count+0x50/0xd0
      [<ffffffff8111460a>] ? shrink_slab_node+0x3a/0x1b0
      [<ffffffff811683fc>] ? vmpressure+0x1c/0x80
      [<ffffffff811153f3>] ? shrink_slab+0x83/0x150
      [<ffffffff81118499>] ? do_try_to_free_pages+0x2f9/0x530
      [<ffffffff81118768>] ? try_to_free_pages+0x98/0xd0
      [<ffffffff8110e3f3>] ? __alloc_pages_nodemask+0x6e3/0xad0
      [<ffffffff8114b2b3>] ? alloc_pages_current+0xa3/0x170
      [<ffffffffa0232755>] ? xfs_buf_allocate_memory+0x168/0x245 [xfs]
      [<ffffffffa01cc382>] ? xfs_buf_get_map+0xd2/0x130 [xfs]
      [<ffffffffa01cc964>] ? xfs_buf_read_map+0x24/0xc0 [xfs]
      [<ffffffffa0228609>] ? xfs_trans_read_buf_map+0xa9/0x330 [xfs]
      [<ffffffffa0217999>] ? xfs_imap_to_bp+0x69/0xf0 [xfs]
      [<ffffffffa0217e89>] ? xfs_iread+0x79/0x410 [xfs]
      [<ffffffffa01e35df>] ? kmem_zone_alloc+0x6f/0xf0 [xfs]
      [<ffffffffa01d3be3>] ? xfs_iget+0x1a3/0x510 [xfs]
      [<ffffffffa02121de>] ? xfs_lookup+0xbe/0xf0 [xfs]
      [<ffffffffa01d9023>] ? xfs_vn_lookup+0x73/0xc0 [xfs]
      [<ffffffff81178f88>] ? lookup_real+0x18/0x50
      [<ffffffff8117dced>] ? do_last+0x8bd/0xe90
      [<ffffffff8117adde>] ? link_path_walk+0x27e/0x8e0
      [<ffffffff8117e388>] ? path_openat+0xc8/0x6a0
      [<ffffffff8109700c>] ? select_task_rq_fair+0x3dc/0x7e0
      [<ffffffff8117fc18>] ? do_filp_open+0x48/0xb0
      [<ffffffff81154799>] ? kmem_cache_alloc+0x109/0x170
      [<ffffffff81208b51>] ? security_prepare_creds+0x11/0x20
      [<ffffffff811751ad>] ? do_open_exec+0x1d/0xe0
      [<ffffffff8117704d>] ? do_execve_common.isra.26+0x1bd/0x620
      [<ffffffff81154700>] ? kmem_cache_alloc+0x70/0x170
      [<ffffffff811774c3>] ? do_execve+0x13/0x20
      [<ffffffff81079ae7>] ? ____call_usermodehelper+0x117/0x1b0
      [<ffffffff81079b80>] ? ____call_usermodehelper+0x1b0/0x1b0
      [<ffffffff81079b99>] ? call_helper+0x19/0x20
      [<ffffffff8158483c>] ? ret_from_fork+0x7c/0xb0
      [<ffffffff81079b80>] ? ____call_usermodehelper+0x1b0/0x1b0

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
---
 include/linux/sched.h |  2 ++
 kernel/sched/core.c   | 11 +++++++++++
 mm/page_alloc.c       | 19 +++++++++++++++++--
 3 files changed, 30 insertions(+), 2 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 306f4f0..8b5edc7 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1655,6 +1655,8 @@ struct task_struct {
 	unsigned int	sequential_io;
 	unsigned int	sequential_io_avg;
 #endif
+	unsigned long memory_allocation_start_jiffies;
+	gfp_t memory_allocation_flags;
 };
 
 /* Future-safe accessor for struct task_struct's cpus_allowed. */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 3bdf01b..0d1eb3e 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4443,6 +4443,16 @@ out_unlock:
 	return retval;
 }
 
+static void print_memalloc_info(const struct task_struct *p)
+{
+	const unsigned long stamp = p->memory_allocation_start_jiffies;
+
+	if (likely(!stamp))
+		return;
+	printk(KERN_INFO "MemAlloc: %lu jiffies on 0x%x\n", jiffies - stamp,
+	       p->memory_allocation_flags);
+}
+
 static const char stat_nam[] = TASK_STATE_TO_CHAR_STR;
 
 void sched_show_task(struct task_struct *p)
@@ -4475,6 +4485,7 @@ void sched_show_task(struct task_struct *p)
 		task_pid_nr(p), ppid,
 		(unsigned long)task_thread_info(p)->flags);
 
+	print_memalloc_info(p);
 	print_worker_info(KERN_INFO, p);
 	show_stack(p, NULL);
 }
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 20d17f8..cac0d32 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2721,6 +2721,17 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 	unsigned int cpuset_mems_cookie;
 	int alloc_flags = ALLOC_WMARK_LOW|ALLOC_CPUSET|ALLOC_FAIR;
 	int classzone_idx;
+	bool memory_allocation_recursion = false;
+	unsigned long *stamp = &current->memory_allocation_start_jiffies;
+
+	if (likely(!*stamp)) {
+		*stamp = jiffies;
+		if (unlikely(!*stamp))
+			(*stamp)--;
+		current->memory_allocation_flags = gfp_mask;
+	} else {
+		memory_allocation_recursion = true;
+	}
 
 	gfp_mask &= gfp_allowed_mask;
 
@@ -2729,7 +2740,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 	might_sleep_if(gfp_mask & __GFP_WAIT);
 
 	if (should_fail_alloc_page(gfp_mask, order))
-		return NULL;
+		goto nopage;
 
 	/*
 	 * Check the zones suitable for the gfp_mask contain at least one
@@ -2737,7 +2748,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 	 * of GFP_THISNODE and a memoryless node
 	 */
 	if (unlikely(!zonelist->_zonerefs->zone))
-		return NULL;
+		goto nopage;
 
 retry_cpuset:
 	cpuset_mems_cookie = read_mems_allowed_begin();
@@ -2799,6 +2810,10 @@ out:
 	if (unlikely(!page && read_mems_allowed_retry(cpuset_mems_cookie)))
 		goto retry_cpuset;
 
+nopage:
+	if (likely(!memory_allocation_recursion))
+		current->memory_allocation_start_jiffies = 0;
+
 	return page;
 }
 EXPORT_SYMBOL(__alloc_pages_nodemask);
-- 
1.8.3.1

---------- End of patch ----------

Above result is 2GB RAM and no swap space while below result is 2GB RAM and
4GB swap space.

If swap space is available, CPU usage during stalls tend to become 100% to
0% as waiting for disk I/O and congestion_wait() make processes sleep.

I succeeded to generate (so far only once) a CPU 0% stall that lasted for more
than 20 minutes blocked at congestion_wait() inside shrink_inactive_list().

    kthreadd        D ffff88007fcd31c0     0     2      0 0x00000000
    MemAlloc: 1374202 jiffies on 0x2000d0
     ffff88007c41f7c0 0000000000000046 ffff88007c4088e0 00000000000131c0
     ffff88007c41ffd8 00000000000131c0 ffff88007c46f360 0000000000000286
     ffff88007a14fc02 ffff88007c41f730 ffffffff81168040 ffffffff819cc300
    Call Trace:
     [<ffffffff81168040>] ? swap_cgroup_record+0x50/0x80
     [<ffffffff8106f766>] ? lock_timer_base.isra.26+0x26/0x50
     [<ffffffff81580b64>] schedule+0x24/0x70
     [<ffffffff8157ff26>] schedule_timeout+0x126/0x1c0
     [<ffffffff810be5d3>] ? ktime_get_ts+0x43/0xe0
     [<ffffffff8106f2c0>] ? add_timer_on+0xa0/0xa0
     [<ffffffff815810e6>] io_schedule_timeout+0x96/0xf0
     [<ffffffff8112123d>] congestion_wait+0x7d/0xd0
     [<ffffffff810a3da0>] ? __wake_up_sync+0x10/0x10
     [<ffffffff81116f0d>] shrink_inactive_list+0x37d/0x550
     [<ffffffff81117abb>] shrink_lruvec+0x52b/0x730
     [<ffffffff81117d54>] shrink_zone+0x94/0x1e0
     [<ffffffff811182c8>] do_try_to_free_pages+0x128/0x530
     [<ffffffff81118768>] try_to_free_pages+0x98/0xd0
     [<ffffffff8110e3f3>] __alloc_pages_nodemask+0x6e3/0xad0
     [<ffffffff8110e914>] alloc_kmem_pages_node+0x74/0x160
     [<ffffffff810617d5>] ? copy_process.part.32+0x125/0x1bb0
     [<ffffffff810617f6>] copy_process.part.32+0x146/0x1bb0
     [<ffffffff815805db>] ? __schedule+0x29b/0x800
     [<ffffffff810842c0>] ? kthread_create_on_node+0x1a0/0x1a0
     [<ffffffff81063427>] do_fork+0xd7/0x340
     [<ffffffff81091696>] ? set_cpus_allowed_ptr+0x76/0x120
     [<ffffffff810636b1>] kernel_thread+0x21/0x30
     [<ffffffff81084daa>] kthreadd+0x16a/0x1d0
     [<ffffffff81084c40>] ? kthread_create_on_cpu+0x60/0x60
     [<ffffffff8158483c>] ret_from_fork+0x7c/0xb0
     [<ffffffff81084c40>] ? kthread_create_on_cpu+0x60/0x60

    kswapd0         S ffff88007fc931c0     0    53      2 0x00000000
     ffff880079f17e10 0000000000000046 ffff88007c119aa0 00000000000131c0
     ffff880079f17fd8 00000000000131c0 ffff88007c46ea80 ffff880079f17dc0
     ffff880079f17dc0 0000000000000286 ffff88007c50c000 ffff880079f17d88
    Call Trace:
     [<ffffffff8106fc62>] ? try_to_del_timer_sync+0x52/0x80
     [<ffffffff8110a22c>] ? zone_watermark_ok_safe+0xac/0xc0
     [<ffffffff811150e9>] ? zone_balanced+0x19/0x50
     [<ffffffff811151ef>] ? pgdat_balanced+0xcf/0xf0
     [<ffffffff81580b64>] schedule+0x24/0x70
     [<ffffffff81119329>] kswapd+0x2f9/0x3c0
     [<ffffffff810a3da0>] ? __wake_up_sync+0x10/0x10
     [<ffffffff81119030>] ? balance_pgdat+0x640/0x640
     [<ffffffff8108439c>] kthread+0xdc/0x100
     [<ffffffff810842c0>] ? kthread_create_on_node+0x1a0/0x1a0
     [<ffffffff8158483c>] ret_from_fork+0x7c/0xb0
     [<ffffffff810842c0>] ? kthread_create_on_node+0x1a0/0x1a0

    kworker/u16:1   D ffff88007c11e1a0     0    65      2 0x00000000
    MemAlloc: 1455121 jiffies on 0x2000d0
    Workqueue: khelper __call_usermodehelper
     ffff880036c0b6b0 0000000000000046 ffff88007c11e1a0 00000000000131c0
     ffff880036c0bfd8 00000000000131c0 ffff88007c2091c0 ffff880036c0b668
     ffffea0001ff6a40 000000000000001d ffff88007cffec00 ffffea0001dfb000
    Call Trace:
     [<ffffffff8106f766>] ? lock_timer_base.isra.26+0x26/0x50
     [<ffffffff81580b64>] schedule+0x24/0x70
     [<ffffffff8157ff26>] schedule_timeout+0x126/0x1c0
     [<ffffffff810be5d3>] ? ktime_get_ts+0x43/0xe0
     [<ffffffff8106f2c0>] ? add_timer_on+0xa0/0xa0
     [<ffffffff815810e6>] io_schedule_timeout+0x96/0xf0
     [<ffffffff8112123d>] congestion_wait+0x7d/0xd0
     [<ffffffff810a3da0>] ? __wake_up_sync+0x10/0x10
     [<ffffffff81116f0d>] shrink_inactive_list+0x37d/0x550
     [<ffffffff81117abb>] shrink_lruvec+0x52b/0x730
     [<ffffffffa01d42d7>] ? xfs_reclaim_inodes_count+0x37/0x50 [xfs]
     [<ffffffffa01d42d7>] ? xfs_reclaim_inodes_count+0x37/0x50 [xfs]
     [<ffffffff81117d54>] shrink_zone+0x94/0x1e0
     [<ffffffff811182c8>] do_try_to_free_pages+0x128/0x530
     [<ffffffff81118768>] try_to_free_pages+0x98/0xd0
     [<ffffffff8110e3f3>] __alloc_pages_nodemask+0x6e3/0xad0
     [<ffffffff8110e914>] alloc_kmem_pages_node+0x74/0x160
     [<ffffffff810617d5>] ? copy_process.part.32+0x125/0x1bb0
     [<ffffffff810617f6>] copy_process.part.32+0x146/0x1bb0
     [<ffffffff81094255>] ? sched_clock_cpu+0x85/0xc0
     [<ffffffff8109b5ac>] ? put_prev_entity+0x2c/0x2c0
     [<ffffffff8100c5c4>] ? __switch_to+0xf4/0x5a0
     [<ffffffff81079b80>] ? ____call_usermodehelper+0x1b0/0x1b0
     [<ffffffff815805db>] ? __schedule+0x29b/0x800
     [<ffffffff81063427>] do_fork+0xd7/0x340
     [<ffffffffa0079a2b>] ? mpt_fault_reset_work+0x9b/0x45c [mptbase]
     [<ffffffff810636b1>] kernel_thread+0x21/0x30
     [<ffffffff81079bf9>] __call_usermodehelper+0x29/0x90
     [<ffffffff8107d64f>] process_one_work+0x15f/0x3d0
     [<ffffffff8107de19>] worker_thread+0x119/0x620
     [<ffffffff8107dd00>] ? rescuer_thread+0x440/0x440
     [<ffffffff8108439c>] kthread+0xdc/0x100
     [<ffffffff810842c0>] ? kthread_create_on_node+0x1a0/0x1a0
     [<ffffffff8158483c>] ret_from_fork+0x7c/0xb0
     [<ffffffff810842c0>] ? kthread_create_on_node+0x1a0/0x1a0

It seems to me that nobody was able to wake up kswapd. Therefore,
I think loops like

	while (unlikely(too_many_isolated(zone, file, sc))) {
		congestion_wait(BLK_RW_ASYNC, HZ/10);
	
		/* We are about to die and free our memory. Return now. */
		if (fatal_signal_pending(current))
			return SWAP_CLUSTER_MAX;
	}

which assume that somebody else shall wake up kswapd and kswapd shall perform
operations for making too_many_isolated() to return 0 is not good.

^ permalink raw reply related	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2014-07-02 12:40 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-05-19 14:40 [PATCH] mm/vmscan: Do not block forever at shrink_inactive_list() Tetsuo Handa
2014-05-19 17:42 ` Rik van Riel
2014-05-20  0:44 ` Dave Chinner
2014-05-20  3:54   ` Tetsuo Handa
2014-05-20  5:24     ` Dave Chinner
2014-05-20  5:59   ` Andrew Morton
2014-05-20  6:03     ` Andrew Morton
2014-05-20  6:33       ` Dave Chinner
2014-05-20  6:30     ` Dave Chinner
2014-05-20 14:58       ` Tetsuo Handa
2014-05-20 16:12         ` Motohiro Kosaki
2014-05-26 11:45           ` [PATCH] mm/vmscan: Do not block forever atshrink_inactive_list() Tetsuo Handa
2014-06-03 21:43             ` David Rientjes
2014-06-05 12:45               ` [PATCH] mm/vmscan: Do not block forever at shrink_inactive_list() Tetsuo Handa
2014-06-05 13:17                 ` Dave Chinner
2014-06-06 12:19                   ` [PATCH] mm/vmscan: Do not block forever atshrink_inactive_list() Tetsuo Handa
2014-06-09 11:53                 ` [PATCH] mm/vmscan: Do not block forever at shrink_inactive_list() Tetsuo Handa
2014-07-02 12:40                   ` Tetsuo Handa
2014-05-20  7:20     ` Christoph Hellwig
2014-05-20  2:35 ` Jianyu Zhan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox