[PATCH] sched/numa: use down_read_trylock for mmap

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH] sched/numa: use down_read_trylock for mmap_sem
@ 2017-05-15 13:13 Vlastimil Babka
  2017-05-15 14:27 ` Rik van Riel
                   ` (3 more replies)
  0 siblings, 4 replies; 5+ messages in thread
From: Vlastimil Babka @ 2017-05-15 13:13 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: Mel Gorman, Rik van Riel, linux-kernel, Vlastimil Babka

A customer has reported a soft-lockup when running a proprietary intensive
memory stress test, where the trace on multiple CPU's looks like this:

 RIP: 0010:[<ffffffff810c53fe>]
  [<ffffffff810c53fe>] native_queued_spin_lock_slowpath+0x10e/0x190
...
 Call Trace:
  [<ffffffff81182d07>] queued_spin_lock_slowpath+0x7/0xa
  [<ffffffff811bc331>] change_protection_range+0x3b1/0x930
  [<ffffffff811d4be8>] change_prot_numa+0x18/0x30
  [<ffffffff810adefe>] task_numa_work+0x1fe/0x310
  [<ffffffff81098322>] task_work_run+0x72/0x90

Further investigation showed that the lock contention here is pmd_lock().

The task_numa_work() function makes sure that only one thread is let to perform
the work in a single scan period (via cmpxchg), but if there's a thread with
mmap_sem locked for writing for several periods, multiple threads in
task_numa_work() can build up a convoy waiting for mmap_sem for read and then
all get unblocked at once.

This patch changes the down_read() to the trylock version, which prevents the
build up. For a workload experiencing mmap_sem contention, it's probably better
to postpone the NUMA balancing work anyway. This seems to have fixed the soft
lockups involving pmd_lock(), which is in line with the convoy theory.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 kernel/sched/fair.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index dea138964b91..d70f9026defc 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2475,7 +2475,8 @@ void task_numa_work(struct callback_head *work)
 		return;

-	down_read(&mm->mmap_sem);
+	if (!down_read_trylock(&mm->mmap_sem))
+		return;
 	vma = find_vma(mm, start);
 	if (!vma) {
 		reset_ptenuma_scan(p);
-- 
2.12.2

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH] sched/numa: use down_read_trylock for mmap_sem
  2017-05-15 13:13 [PATCH] sched/numa: use down_read_trylock for mmap_sem Vlastimil Babka
@ 2017-05-15 14:27 ` Rik van Riel
  2017-05-15 14:35 ` Mel Gorman
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 5+ messages in thread
From: Rik van Riel @ 2017-05-15 14:27 UTC (permalink / raw)
  To: Vlastimil Babka, Ingo Molnar, Peter Zijlstra; +Cc: Mel Gorman, linux-kernel

On Mon, 2017-05-15 at 15:13 +0200, Vlastimil Babka wrote:
> A customer has reported a soft-lockup when running a proprietary
> intensive
> memory stress test, where the trace on multiple CPU's looks like
> this:
> 
>  RIP: 0010:[<ffffffff810c53fe>]
>   [<ffffffff810c53fe>] native_queued_spin_lock_slowpath+0x10e/0x190
> ...
>  Call Trace:
>   [<ffffffff81182d07>] queued_spin_lock_slowpath+0x7/0xa
>   [<ffffffff811bc331>] change_protection_range+0x3b1/0x930
>   [<ffffffff811d4be8>] change_prot_numa+0x18/0x30
>   [<ffffffff810adefe>] task_numa_work+0x1fe/0x310
>   [<ffffffff81098322>] task_work_run+0x72/0x90
> 
> Further investigation showed that the lock contention here is
> pmd_lock().
> 
> The task_numa_work() function makes sure that only one thread is let
> to perform
> the work in a single scan period (via cmpxchg), but if there's a
> thread with
> mmap_sem locked for writing for several periods, multiple threads in
> task_numa_work() can build up a convoy waiting for mmap_sem for read
> and then
> all get unblocked at once.
> 
> This patch changes the down_read() to the trylock version, which
> prevents the
> build up. For a workload experiencing mmap_sem contention, it's
> probably better
> to postpone the NUMA balancing work anyway. This seems to have fixed
> the soft
> lockups involving pmd_lock(), which is in line with the convoy
> theory.
> 
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

Acked-by: Rik van Riel <riel@redhat.com>

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] sched/numa: use down_read_trylock for mmap_sem
  2017-05-15 13:13 [PATCH] sched/numa: use down_read_trylock for mmap_sem Vlastimil Babka
  2017-05-15 14:27 ` Rik van Riel
@ 2017-05-15 14:35 ` Mel Gorman
  2017-05-16  8:15 ` Peter Zijlstra
  2017-05-23  8:47 ` [tip:sched/core] sched/numa: Use down_read_trylock() for the mmap_sem tip-bot for Vlastimil Babka
  3 siblings, 0 replies; 5+ messages in thread
From: Mel Gorman @ 2017-05-15 14:35 UTC (permalink / raw)
  To: Vlastimil Babka; +Cc: Ingo Molnar, Peter Zijlstra, Rik van Riel, linux-kernel

On Mon, May 15, 2017 at 03:13:16PM +0200, Vlastimil Babka wrote:
> A customer has reported a soft-lockup when running a proprietary intensive
> memory stress test, where the trace on multiple CPU's looks like this:
> 
>  RIP: 0010:[<ffffffff810c53fe>]
>   [<ffffffff810c53fe>] native_queued_spin_lock_slowpath+0x10e/0x190
> ...
>  Call Trace:
>   [<ffffffff81182d07>] queued_spin_lock_slowpath+0x7/0xa
>   [<ffffffff811bc331>] change_protection_range+0x3b1/0x930
>   [<ffffffff811d4be8>] change_prot_numa+0x18/0x30
>   [<ffffffff810adefe>] task_numa_work+0x1fe/0x310
>   [<ffffffff81098322>] task_work_run+0x72/0x90
> 
> Further investigation showed that the lock contention here is pmd_lock().
> 
> The task_numa_work() function makes sure that only one thread is let to perform
> the work in a single scan period (via cmpxchg), but if there's a thread with
> mmap_sem locked for writing for several periods, multiple threads in
> task_numa_work() can build up a convoy waiting for mmap_sem for read and then
> all get unblocked at once.
> 
> This patch changes the down_read() to the trylock version, which prevents the
> build up. For a workload experiencing mmap_sem contention, it's probably better
> to postpone the NUMA balancing work anyway. This seems to have fixed the soft
> lockups involving pmd_lock(), which is in line with the convoy theory.
> 
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

This skips the entire scan window and defers to the next once.
Potentially, with constant contention, it'll never make progress and
there could be other disruption. However, I cannot see any way how
that's worse than waiting on mmap_sem so

Acked-by: Mel Gorman <mgorman@techsingularity.net>

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] sched/numa: use down_read_trylock for mmap_sem
  2017-05-15 13:13 [PATCH] sched/numa: use down_read_trylock for mmap_sem Vlastimil Babka
  2017-05-15 14:27 ` Rik van Riel
  2017-05-15 14:35 ` Mel Gorman
@ 2017-05-16  8:15 ` Peter Zijlstra
  2017-05-23  8:47 ` [tip:sched/core] sched/numa: Use down_read_trylock() for the mmap_sem tip-bot for Vlastimil Babka
  3 siblings, 0 replies; 5+ messages in thread
From: Peter Zijlstra @ 2017-05-16  8:15 UTC (permalink / raw)
  To: Vlastimil Babka; +Cc: Ingo Molnar, Mel Gorman, Rik van Riel, linux-kernel

On Mon, May 15, 2017 at 03:13:16PM +0200, Vlastimil Babka wrote:
> A customer has reported a soft-lockup when running a proprietary intensive
> memory stress test, where the trace on multiple CPU's looks like this:
> 
>  RIP: 0010:[<ffffffff810c53fe>]
>   [<ffffffff810c53fe>] native_queued_spin_lock_slowpath+0x10e/0x190
> ...
>  Call Trace:
>   [<ffffffff81182d07>] queued_spin_lock_slowpath+0x7/0xa
>   [<ffffffff811bc331>] change_protection_range+0x3b1/0x930
>   [<ffffffff811d4be8>] change_prot_numa+0x18/0x30
>   [<ffffffff810adefe>] task_numa_work+0x1fe/0x310
>   [<ffffffff81098322>] task_work_run+0x72/0x90
> 
> Further investigation showed that the lock contention here is pmd_lock().
> 
> The task_numa_work() function makes sure that only one thread is let to perform
> the work in a single scan period (via cmpxchg), but if there's a thread with
> mmap_sem locked for writing for several periods, multiple threads in
> task_numa_work() can build up a convoy waiting for mmap_sem for read and then
> all get unblocked at once.
> 
> This patch changes the down_read() to the trylock version, which prevents the
> build up. For a workload experiencing mmap_sem contention, it's probably better
> to postpone the NUMA balancing work anyway. This seems to have fixed the soft
> lockups involving pmd_lock(), which is in line with the convoy theory.
> 
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

Thanks!

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [tip:sched/core] sched/numa: Use down_read_trylock() for the mmap_sem
  2017-05-15 13:13 [PATCH] sched/numa: use down_read_trylock for mmap_sem Vlastimil Babka
                   ` (2 preceding siblings ...)
  2017-05-16  8:15 ` Peter Zijlstra
@ 2017-05-23  8:47 ` tip-bot for Vlastimil Babka
  3 siblings, 0 replies; 5+ messages in thread
From: tip-bot for Vlastimil Babka @ 2017-05-23  8:47 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: vbabka, torvalds, hpa, tglx, mingo, linux-kernel, riel, mgorman,
	peterz

Commit-ID:  8655d5497735b288f8a9b458bd22e7d1bf95bb61
Gitweb:     http://git.kernel.org/tip/8655d5497735b288f8a9b458bd22e7d1bf95bb61
Author:     Vlastimil Babka <vbabka@suse.cz>
AuthorDate: Mon, 15 May 2017 15:13:16 +0200
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Tue, 23 May 2017 10:01:34 +0200

sched/numa: Use down_read_trylock() for the mmap_sem

A customer has reported a soft-lockup when running an intensive
memory stress test, where the trace on multiple CPU's looks like this:

 RIP: 0010:[<ffffffff810c53fe>]
  [<ffffffff810c53fe>] native_queued_spin_lock_slowpath+0x10e/0x190
...
 Call Trace:
  [<ffffffff81182d07>] queued_spin_lock_slowpath+0x7/0xa
  [<ffffffff811bc331>] change_protection_range+0x3b1/0x930
  [<ffffffff811d4be8>] change_prot_numa+0x18/0x30
  [<ffffffff810adefe>] task_numa_work+0x1fe/0x310
  [<ffffffff81098322>] task_work_run+0x72/0x90

Further investigation showed that the lock contention here is pmd_lock().

The task_numa_work() function makes sure that only one thread is let to perform
the work in a single scan period (via cmpxchg), but if there's a thread with
mmap_sem locked for writing for several periods, multiple threads in
task_numa_work() can build up a convoy waiting for mmap_sem for read and then
all get unblocked at once.

This patch changes the down_read() to the trylock version, which prevents the
build up. For a workload experiencing mmap_sem contention, it's probably better
to postpone the NUMA balancing work anyway. This seems to have fixed the soft
lockups involving pmd_lock(), which is in line with the convoy theory.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20170515131316.21909-1-vbabka@suse.cz
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/fair.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 219fe58..47a0c55 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2470,7 +2470,8 @@ void task_numa_work(struct callback_head *work)
 		return;

-	down_read(&mm->mmap_sem);
+	if (!down_read_trylock(&mm->mmap_sem))
+		return;
 	vma = find_vma(mm, start);
 	if (!vma) {
 		reset_ptenuma_scan(p);

^ permalink raw reply related	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2017-05-23  8:50 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-05-15 13:13 [PATCH] sched/numa: use down_read_trylock for mmap_sem Vlastimil Babka
2017-05-15 14:27 ` Rik van Riel
2017-05-15 14:35 ` Mel Gorman
2017-05-16  8:15 ` Peter Zijlstra
2017-05-23  8:47 ` [tip:sched/core] sched/numa: Use down_read_trylock() for the mmap_sem tip-bot for Vlastimil Babka

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.