Re: [RFC][PATCH] Fix a race between rwsem and the scheduler

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Oleg Nesterov <oleg@redhat.com>
To: Balbir Singh <bsingharora@gmail.com>
Cc: LKML <linux-kernel@vger.kernel.org>,
	Peter Zijlstra <peterz@infradead.org>,
	Benjamin Herrenschmidt <benh@kernel.crashing.org>,
	Nicholas Piggin <nicholas.piggin@gmail.com>
Subject: Re: [RFC][PATCH] Fix a race between rwsem and the scheduler
Date: Tue, 30 Aug 2016 14:58:27 +0200	[thread overview]
Message-ID: <20160830125827.GA17653@redhat.com> (raw)
In-Reply-To: <4050f2ce-1aee-d2aa-39e3-36e995b56252@gmail.com>

On 08/30, Balbir Singh wrote:
>
> The origin of the issue I've seen seems to be related to
> rwsem spin lock stealing. Basically I see the system deadlock'd in the
> following state
> 
> I have a system with multiple threads and
> 
> Most of the threads are stuck doing
> 
> [67272.593915] --- interrupt: e81 at _raw_spin_lock_irqsave+0xa4/0x130
> [67272.593915]     LR = _raw_spin_lock_irqsave+0x9c/0x130
> [67272.700996] [c000000012857ae0] [c00000000012453c] rwsem_wake+0xcc/0x110
> [67272.749283] [c000000012857b20] [c0000000001215d8] up_write+0x78/0x90
> [67272.788965] [c000000012857b50] [c00000000028153c] unlink_anon_vmas+0x15c/0x2c0
> [67272.798782] [c000000012857bc0] [c00000000026f5c0] free_pgtables+0xf0/0x1c0
> [67272.842528] [c000000012857c10] [c00000000027c9a0] exit_mmap+0x100/0x1a0
> [67272.872947] [c000000012857cd0] [c0000000000b4a98] mmput+0xa8/0x1b0
> [67272.898432] [c000000012857d00] [c0000000000bc50c] do_exit+0x33c/0xc30
> [67272.944721] [c000000012857dc0] [c0000000000bcee4] do_group_exit+0x64/0x100
> [67272.969014] [c000000012857e00] [c0000000000bcfac] SyS_exit_group+0x2c/0x30
> [67272.978971] [c000000012857e30] [c000000000009204] system_call+0x38/0xb4
> [67272.999016] Instruction dump:
> 
> They are spinning on the sem->wait_lock, the holder of sem->wait_lock has
> irq's disabled and is doing
> 
> [c00000037930fb30] c0000000000f724c try_to_wake_up+0x6c/0x570
> [c00000037930fbb0] c000000000124328 __rwsem_do_wake+0x1f8/0x260
> [c00000037930fc00] c0000000001244b4 rwsem_wake+0x84/0x110
> [c00000037930fc40] c000000000121598 up_write+0x78/0x90
> [c00000037930fc70] c000000000281a54 anon_vma_fork+0x184/0x1d0
> [c00000037930fcc0] c0000000000b68e0 copy_process.isra.5+0x14c0/0x1870
> [c00000037930fda0] c0000000000b6e68 _do_fork+0xa8/0x4b0
> [c00000037930fe30] c000000000009460 ppc_clone+0x8/0xc
> 
> The offset of try_to_wake_up is actually misleading, it is actually stuck
> doing the following in try_to_wake_up
> 
> while (p->on_cpu)
> 	cpu_relax();
> 
> Analysis
> 
> The issue is triggered, due to the following race
> 
> CPU1					CPU2
> 
> while () {
>   if (cond)
>     break;
>   do {
>     schedule();
>     set_current_state(TASK_UN..)
>   } while (!cond);
> 					rwsem_wake()
> 					  spin_lock_irqsave(wait_lock)
>   raw_spin_lock_irqsave(wait_lock)	  wake_up_process()
> }					  try_to_wake_up()
> set_current_state(TASK_RUNNING);	  ..
> list_del(&waiter.list);
> 
> CPU2 wakes up CPU1, but before it can get the wait_lock and set
> current state to TASK_RUNNING the following occurs
> 
> CPU3
> (stole the rwsem before waiter can be woken up from queue)
> up_write()
> rwsem_wake()
> raw_spin_lock_irqsave(wait_lock)
> if (!list_empty)
>   wake_up_process()
>   try_to_wake_up()
>   raw_spin_lock_irqsave(p->pi_lock)
>   ..
>   if (p->on_rq && ttwu_wakeup())
>   ..
>   while (p->on_cpu)
>     cpu_relax()
>   ..
> 
> CPU3 tries to wake up the task on CPU1 again since it finds
> it on the wait_queue, CPU1 is spinning on wait_lock, but immediately
> after CPU2, CPU3 got it.
> 
> CPU3 checks the state of p on CPU1, it is TASK_UNINTERRUPTIBLE and
> the task is spinning on the wait_lock. Interestingly since p->on_rq
> is checked under pi_lock, I've noticed that try_to_wake_up() finds
> p->on_rq to be 0. This was the most confusing bit of the analysis,
> but p->on_rq is changed under runqueue lock, rq_lock, the p->on_rq
> check is not reliable without this fix IMHO. The race is visible
> (based on the analysis) only when ttwu_queue() does a remote wakeup
> via ttwu_queue_remote. In which case the p->on_rq change is not
> done uder the pi_lock.
> 
> The result is that after a while the entire system locks up on
> the raw_spin_irqlock_save(wait_lock) and the holder spins infintely
> 
> Reproduction of the issue
> 
> The issue can be reproduced after a long run on my system with 80
> threads and having to tweak available memory to very low and running
> memory stress-ng mmapfork test. It usually takes a long time to
> reproduce. I am trying to work on a test case that can reproduce
> the issue faster, but thats work in progress. I am still testing the
> changes on my still in a loop and the tests seem OK thus far.
> 
> Big thanks to Benjamin and Nick for helping debug this as well.
> Ben helped catch the missing barrier, Nick caught every missing
> bit in my theory
> 
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Nicholas Piggin <npiggin@gmail.com>
> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> 
> Signed-off-by: Balbir Singh <bsingharora@gmail.com>
> ---
>  kernel/sched/core.c | 11 +++++++++++
>  1 file changed, 11 insertions(+)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 2a906f2..582c684 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2016,6 +2016,17 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
>  	success = 1; /* we're going to change ->state */
>  	cpu = task_cpu(p);
>  
> +	/*
> +	 * Ensure we see on_rq and p_state consistently
> +	 *
> +	 * For example in __rwsem_down_write_failed(), we have
> +	 *    [S] ->on_rq = 1				[L] ->state
> +	 *    MB					 RMB
> +	 *    [S] ->state = TASK_UNINTERRUPTIBLE	[L] ->on_rq
> +	 * In the absence of the RMB p->on_rq can be observed to be 0
> +	 * and we end up spinning indefinitely in while (p->on_cpu)
> +	 */
> +	smp_rmb();

I think the patch is fine... but unless I am totally confused this
is not specific to __rwsem_down_write_failed(). ttwu() can hang the
same way if the target simply does

	schedule_timeout();
	current->state = TASK_INTERRUPTIBLE;
	current->state = TASK_RUNNING;

And. I am not sure I understand where this MB above comes from.

Oleg.

next prev parent reply	other threads:[~2016-08-30 12:59 UTC|newest]

Thread overview: 27+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-08-30  8:49 [RFC][PATCH] Fix a race between rwsem and the scheduler Balbir Singh
2016-08-30  9:13 ` Nicholas Piggin
2016-08-30 12:19 ` Peter Zijlstra
2016-08-30 13:04   ` Oleg Nesterov
2016-08-30 14:13     ` Peter Zijlstra
2016-08-30 16:57       ` Oleg Nesterov
2016-08-30 18:34         ` Peter Zijlstra
2016-08-30 21:28           ` Benjamin Herrenschmidt
2016-08-31  7:18             ` Peter Zijlstra
2016-08-31 10:56               ` Benjamin Herrenschmidt
2016-08-31 13:31             ` Peter Zijlstra
2016-08-31 21:47               ` Benjamin Herrenschmidt
2016-09-01  6:49                 ` Balbir Singh
2016-09-01  6:57                 ` Peter Zijlstra
2016-09-01 14:17                   ` Boqun Feng
2016-09-01 15:33                     ` Peter Zijlstra
2016-08-30 21:25     ` Benjamin Herrenschmidt
2016-08-31  7:20       ` Peter Zijlstra
2016-08-31 10:55         ` Benjamin Herrenschmidt
2016-08-31  3:41   ` Balbir Singh
2016-08-31  7:28     ` Peter Zijlstra
2016-08-31 10:17       ` Balbir Singh
2016-08-31 10:57       ` Benjamin Herrenschmidt
2016-09-01  1:48       ` Alexey Kardashevskiy
2016-09-01 12:16         ` Alexey Kardashevskiy
2016-08-30 12:58 ` Oleg Nesterov [this message]
2016-08-31  3:25   ` Balbir Singh

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20160830125827.GA17653@redhat.com \
    --to=oleg@redhat.com \
    --cc=benh@kernel.crashing.org \
    --cc=bsingharora@gmail.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=nicholas.piggin@gmail.com \
    --cc=peterz@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.