Re: [PATCH 1/1] do_exit(): Solve possibility of BUG() due to race with try_to_wake_up()

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Oleg Nesterov <oleg@redhat.com>
To: Kautuk Consul <consul.kautuk@gmail.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Michal Hocko <mhocko@suse.cz>,
	David Rientjes <rientjes@google.com>,
	Ionut Alexa <ionut.m.alexa@gmail.com>,
	Guillaume Morin <guillaume@morinfr.org>,
	linux-kernel@vger.kernel.org, Kirill Tkhai <tkhai@yandex.ru>
Subject: Re: [PATCH 1/1] do_exit(): Solve possibility of BUG() due to race with try_to_wake_up()
Date: Mon, 25 Aug 2014 17:57:38 +0200	[thread overview]
Message-ID: <20140825155738.GA5944@redhat.com> (raw)
In-Reply-To: <1408964064-21447-1-git-send-email-consul.kautuk@gmail.com>

On 08/25, Kautuk Consul wrote:
>
> I encountered a BUG() scenario within do_exit() on an ARM system.
>
> The problem is due to a race scenario between do_exit() and try_to_wake_up()
> on different CPUs due to usage of sleeping primitives such as __down_common
> and wait_for_common.
>
> Race Scenario
> =============
>
> Let us assume there are 2 CPUs A and B execute code in the following order:
> 1)	CPU A was running in user-mode and enters kernel mode via some
> 	syscall/exception handler.
> 2)	CPU A sets the current task(t) state to TASK_INTERRUPTIBLE via __down_common
> 	or wait_for_common.
> 3)	CPU A checks for signal_pending() and returns due to TIF_SIGPENDING
> 	being set in t's threadinfo due to a previous signal(say SIGKILL) being
> 	received on this task t.
> 4)	CPU A returns returns back to the assembly trap handler and calls
> 	do_work_pending() -> do_signal() -> get_signal() -> do_group_exit()
> 	 -> do_exit()
> 	CPU A  has not yet executed the following line of code before the final
> 	call to schedule:
>     /* causes final put_task_struct in finish_task_switch(). */
>     tsk->state = TASK_DEAD;
> 5)	CPU B tries to send a signal to task t (currently executing on CPU A)
> 	and thus enters: signal_wake_up_state() -> wake_up_state() ->
> 					 try_to_wake_up()
> 6)	CPU B executes all code in try_to_wake_up() till the call to
> 	ttwu_queue -> ttwu_do_activate -> ttwu_do_wakeup().
> 	CPU B has still not executed the following code in ttwu_do_wakeup():
> 	p->state = TASK_RUNNING;
> 7)	CPU A executes the following line of code:
>     /* causes final put_task_struct in finish_task_switch(). */
>     tsk->state = TASK_DEAD;
> 8)	CPU B executes the following code in ttwu_do_wakeup():
> 	p->state = TASK_RUNNING;
> 9)	CPU A continues to the call to do_exit() -> schedule().
> 	Since the tsk->state is TASK_RUNNING, the call to schedule() returns and
> 	do_exit() -> BUG() is hit on CPU A.
>
> Alternate Solution
> ==================
>
> An alternate solution would be to simply set the current task state to
> TASK_RUNNING in __down_common(), wait_for_common() and all other interruptible
> sleeping primitives in their if(signal_pending/signal_pending_state) conditional
> blocks.
>
> But this change seems to me to be more logical because:
> i)		This will involve lesser changes to the kernel core code.
> ii)		Any further sleeping primitives in the kernel also need not suffer from
> 		this kind of race scenario.
>
> Signed-off-by: Kautuk Consul <consul.kautuk@gmail.com>
> ---
>  kernel/exit.c |   10 ++++++----
>  1 file changed, 6 insertions(+), 4 deletions(-)
>
> diff --git a/kernel/exit.c b/kernel/exit.c
> index 32c58f7..69a8231 100644
> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -824,14 +824,16 @@ void do_exit(long code)
>  	 *     (or hypervisor of virtual machine switches to other guest)
>  	 *  As a result, we may become TASK_RUNNING after becoming TASK_DEAD
>  	 *
> -	 * To avoid it, we have to wait for releasing tsk->pi_lock which
> -	 * is held by try_to_wake_up()
> +	 * To solve this, we have to compete for tsk->pi_lock which is held by
> +	 * try_to_wake_up().
>  	 */
> -	smp_mb();
> -	raw_spin_unlock_wait(&tsk->pi_lock);
> +	raw_spin_lock(&tsk->pi_lock);
>
>  	/* causes final put_task_struct in finish_task_switch(). */
>  	tsk->state = TASK_DEAD;
> +
> +	raw_spin_unlock(&tsk->pi_lock);
> +
>  	tsk->flags |= PF_NOFREEZE;	/* tell freezer to ignore us */
>  	schedule();
>  	BUG();
> --

Peter, do you remember another problem with TASK_DEAD we discussed recently?
(prev_state == TASK_DEAD detection in finish_task_switch() still looks racy).

I am starting to think that perhaps we need something like below, what do
you all think?

Oleg.

--- x/kernel/sched/core.c
+++ x/kernel/sched/core.c
@@ -2205,9 +2205,10 @@ static void finish_task_switch(struct rq
 	__releases(rq->lock)
 {
 	struct mm_struct *mm = rq->prev_mm;
-	long prev_state;
+	struct task_struct *dead = rq->dead;
 
 	rq->prev_mm = NULL;
+	rq->dead = NULL;
 
 	/*
 	 * A task struct has one reference for the use as "current".
@@ -2220,7 +2221,6 @@ static void finish_task_switch(struct rq
 	 * be dropped twice.
 	 *		Manfred Spraul <manfred@colorfullife.com>
 	 */
-	prev_state = prev->state;
 	vtime_task_switch(prev);
 	finish_arch_switch(prev);
 	perf_event_task_sched_in(prev, current);
@@ -2230,16 +2230,16 @@ static void finish_task_switch(struct rq
 	fire_sched_in_preempt_notifiers(current);
 	if (mm)
 		mmdrop(mm);
-	if (unlikely(prev_state == TASK_DEAD)) {
-		if (prev->sched_class->task_dead)
-			prev->sched_class->task_dead(prev);
+	if (unlikely(dead)) {
+		if (dead->sched_class->task_dead)
+			dead->sched_class->task_dead(dead);
 
 		/*
 		 * Remove function-return probe instances associated with this
 		 * task and put them back on the free list.
 		 */
-		kprobe_flush_task(prev);
-		put_task_struct(prev);
+		kprobe_flush_task(dead);
+		put_task_struct(dead);
 	}
 
 	tick_nohz_task_switch(current);
@@ -2770,11 +2770,15 @@ need_resched:
 	smp_mb__before_spinlock();
 	raw_spin_lock_irq(&rq->lock);
 
+	if (unlikely(rq->dead))
+		goto deactivate;
+
 	switch_count = &prev->nivcsw;
 	if (prev->state && !(preempt_count() & PREEMPT_ACTIVE)) {
 		if (unlikely(signal_pending_state(prev->state, prev))) {
 			prev->state = TASK_RUNNING;
 		} else {
+deactivate:
 			deactivate_task(rq, prev, DEQUEUE_SLEEP);
 			prev->on_rq = 0;
 
@@ -2826,6 +2830,15 @@ need_resched:
 		goto need_resched;
 }
 
+// called under preempt_disable();
+void exit_schedule()
+{
+	// TODO: kill TASK_DEAD, this is only for proc
+	current->state = TASK_DEAD;
+	task_rq(current)->dead = current;
+	__schedule();
+}
+
 static inline void sched_submit_work(struct task_struct *tsk)
 {
 	if (!tsk->state || tsk_is_pi_blocked(tsk))
--- x/kernel/exit.c
+++ x/kernel/exit.c
@@ -815,25 +815,8 @@ void do_exit(long code)
 		__this_cpu_add(dirty_throttle_leaks, tsk->nr_dirtied);
 	exit_rcu();
 
-	/*
-	 * The setting of TASK_RUNNING by try_to_wake_up() may be delayed
-	 * when the following two conditions become true.
-	 *   - There is race condition of mmap_sem (It is acquired by
-	 *     exit_mm()), and
-	 *   - SMI occurs before setting TASK_RUNINNG.
-	 *     (or hypervisor of virtual machine switches to other guest)
-	 *  As a result, we may become TASK_RUNNING after becoming TASK_DEAD
-	 *
-	 * To avoid it, we have to wait for releasing tsk->pi_lock which
-	 * is held by try_to_wake_up()
-	 */
-	smp_mb();
-	raw_spin_unlock_wait(&tsk->pi_lock);
-
-	/* causes final put_task_struct in finish_task_switch(). */
-	tsk->state = TASK_DEAD;
 	tsk->flags |= PF_NOFREEZE;	/* tell freezer to ignore us */
-	schedule();
+	exit_schedule();
 	BUG();
 	/* Avoid "noreturn function does return".  */
 	for (;;)

next prev parent reply	other threads:[~2014-08-25 16:00 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-08-25 10:54 [PATCH 1/1] do_exit(): Solve possibility of BUG() due to race with try_to_wake_up() Kautuk Consul
2014-08-25 15:57 ` Oleg Nesterov [this message]
2014-08-26  4:45   ` Kautuk Consul
2014-08-26 15:03     ` Oleg Nesterov
2014-09-01 15:39   ` Peter Zijlstra
2014-09-01 17:58     ` Oleg Nesterov
2014-09-01 19:09       ` Peter Zijlstra
2014-09-02 15:52         ` Oleg Nesterov
2014-09-02 16:47           ` Oleg Nesterov
2014-09-02 17:39             ` Peter Zijlstra
2014-09-03 13:36               ` Oleg Nesterov
2014-09-03 14:44                 ` Peter Zijlstra
2014-09-03 15:18                   ` Oleg Nesterov
2014-09-04  7:15                     ` Peter Zijlstra
2014-09-04 17:03                       ` Paul E. McKenney
2014-09-04  5:04                   ` Ingo Molnar
2014-09-04  6:32                     ` Peter Zijlstra
2014-09-03 16:08             ` task_numa_fault() && TASK_DEAD Oleg Nesterov
2014-09-03 16:33               ` Rik van Riel
2014-09-04  7:11               ` Peter Zijlstra
2014-09-04 10:39                 ` Oleg Nesterov
2014-09-04 19:14                   ` Hugh Dickins
2014-09-05 11:35                     ` Oleg Nesterov
2014-09-03  9:04   ` [PATCH 1/1] do_exit(): Solve possibility of BUG() due to race with try_to_wake_up() Kirill Tkhai
2014-09-03  9:45     ` Peter Zijlstra

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20140825155738.GA5944@redhat.com \
    --to=oleg@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=consul.kautuk@gmail.com \
    --cc=guillaume@morinfr.org \
    --cc=ionut.m.alexa@gmail.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mhocko@suse.cz \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=rientjes@google.com \
    --cc=tkhai@yandex.ru \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.