From: Oleg Nesterov <oleg@redhat.com>
To: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>,
"linux-arch@vger.kernel.org" <linux-arch@vger.kernel.org>,
Long Gao <gaolong@kylinos.com.cn>,
Al Viro <viro@zeniv.linux.org.uk>,
Andrew Morton <akpm@linux-foundation.org>,
Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
"Paul E. McKenney" <paulmck@linux.vnet.ibm.com>,
Ingo Molnar <mingo@kernel.org>
Subject: Re: [PATCH] sched: fix the theoretical signal_wake_up() vs schedule() race
Date: Tue, 13 Aug 2013 16:33:25 +0200 [thread overview]
Message-ID: <20130813143325.GA5541@redhat.com> (raw)
In-Reply-To: <20130813075550.GS27162@twins.programming.kicks-ass.net>
On 08/13, Peter Zijlstra wrote:
>
> On Mon, Aug 12, 2013 at 07:02:57PM +0200, Oleg Nesterov wrote:
> > +/*
> > + * Despite its name it doesn't necessarily has to be a full barrier.
> > + * It should only guarantee that a STORE before the critical section
> > + * can not be reordered with a LOAD inside this section.
> > + * So the default implementation simply ensures that a STORE can not
> > + * move into the critical section, smp_wmb() should serialize it with
> > + * another STORE done by spin_lock().
> > + */
> > +#ifndef smp_mb__before_spinlock
> > +#define smp_mb__before_spinlock() smp_wmb()
> > #endif
>
> I would have expected mention of the ACQUIRE of the lock keeping the
> LOAD inside the locked section.
OK, please see v2 below.
---
From 8de96d3feae3b4b51669902b7c24ac1748ecdbfe Mon Sep 17 00:00:00 2001
From: Oleg Nesterov <oleg@redhat.com>
Date: Mon, 12 Aug 2013 18:14:00 +0200
Subject: sched: fix the theoretical signal_wake_up() vs schedule() race
This is only theoretical, but after try_to_wake_up(p) was changed
to check p->state under p->pi_lock the code like
__set_current_state(TASK_INTERRUPTIBLE);
schedule();
can miss a signal. This is the special case of wait-for-condition,
it relies on try_to_wake_up/schedule interaction and thus it does
not need mb() between __set_current_state() and if(signal_pending).
However, this __set_current_state() can move into the critical
section protected by rq->lock, now that try_to_wake_up() takes
another lock we need to ensure that it can't be reordered with
"if (signal_pending(current))" check inside that section.
The patch is actually one-liner, it simply adds smp_wmb() before
spin_lock_irq(rq->lock). This is what try_to_wake_up() already
does by the same reason.
We turn this wmb() into the new helper, smp_mb__before_spinlock(),
for better documentation and to allow the architectures to change
the default implementation.
While at it, kill smp_mb__after_lock(), it has no callers.
Perhaps we can also add smp_mb__before/after_spinunlock() for
prepare_to_wait().
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
arch/x86/include/asm/spinlock.h | 4 ----
include/linux/spinlock.h | 14 +++++++++++---
kernel/sched/core.c | 14 +++++++++++++-
3 files changed, 24 insertions(+), 8 deletions(-)
diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h
index 33692ea..e3ddd7d 100644
--- a/arch/x86/include/asm/spinlock.h
+++ b/arch/x86/include/asm/spinlock.h
@@ -233,8 +233,4 @@ static inline void arch_write_unlock(arch_rwlock_t *rw)
#define arch_read_relax(lock) cpu_relax()
#define arch_write_relax(lock) cpu_relax()
-/* The {read|write|spin}_lock() on x86 are full memory barriers. */
-static inline void smp_mb__after_lock(void) { }
-#define ARCH_HAS_SMP_MB_AFTER_LOCK
-
#endif /* _ASM_X86_SPINLOCK_H */
diff --git a/include/linux/spinlock.h b/include/linux/spinlock.h
index 7d537ce..75f3494 100644
--- a/include/linux/spinlock.h
+++ b/include/linux/spinlock.h
@@ -117,9 +117,17 @@ do { \
#endif /*arch_spin_is_contended*/
#endif
-/* The lock does not imply full memory barrier. */
-#ifndef ARCH_HAS_SMP_MB_AFTER_LOCK
-static inline void smp_mb__after_lock(void) { smp_mb(); }
+/*
+ * Despite its name it doesn't necessarily has to be a full barrier.
+ * It should only guarantee that a STORE before the critical section
+ * can not be reordered with a LOAD inside this section.
+ * spin_lock() is the one-way barrier, this LOAD can not escape out
+ * of the region. So the default implementation simply ensures that
+ * a STORE can not move into the critical section, smp_wmb() should
+ * serialize it with another STORE done by spin_lock().
+ */
+#ifndef smp_mb__before_spinlock
+#define smp_mb__before_spinlock() smp_wmb()
#endif
/**
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 6df0fbe..97dac0e 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1491,7 +1491,13 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
unsigned long flags;
int cpu, success = 0;
- smp_wmb();
+ /*
+ * If we are going to wake up a thread waiting for CONDITION we
+ * need to ensure that CONDITION=1 done by the caller can not be
+ * reordered with p->state check below. This pairs with mb() in
+ * set_current_state() the waiting thread does.
+ */
+ smp_mb__before_spinlock();
raw_spin_lock_irqsave(&p->pi_lock, flags);
if (!(p->state & state))
goto out;
@@ -2394,6 +2400,12 @@ need_resched:
if (sched_feat(HRTICK))
hrtick_clear(rq);
+ /*
+ * Make sure that signal_pending_state()->signal_pending() below
+ * can't be reordered with __set_current_state(TASK_INTERRUPTIBLE)
+ * done by the caller to avoid the race with signal_wake_up().
+ */
+ smp_mb__before_spinlock();
raw_spin_lock_irq(&rq->lock);
switch_count = &prev->nivcsw;
--
1.5.5.1
WARNING: multiple messages have this Message-ID (diff)
From: Oleg Nesterov <oleg@redhat.com>
To: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>,
"linux-arch@vger.kernel.org" <linux-arch@vger.kernel.org>,
Long Gao <gaolong@kylinos.com.cn>,
Al Viro <viro@zeniv.linux.org.uk>,
Andrew Morton <akpm@linux-foundation.org>,
Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
"Paul E. McKenney" <paulmck@linux.vnet.ibm.com>,
Ingo Molnar <mingo@kernel.org>
Subject: Re: [PATCH] sched: fix the theoretical signal_wake_up() vs schedule() race
Date: Tue, 13 Aug 2013 16:33:25 +0200 [thread overview]
Message-ID: <20130813143325.GA5541@redhat.com> (raw)
In-Reply-To: <20130813075550.GS27162@twins.programming.kicks-ass.net>
On 08/13, Peter Zijlstra wrote:
>
> On Mon, Aug 12, 2013 at 07:02:57PM +0200, Oleg Nesterov wrote:
> > +/*
> > + * Despite its name it doesn't necessarily has to be a full barrier.
> > + * It should only guarantee that a STORE before the critical section
> > + * can not be reordered with a LOAD inside this section.
> > + * So the default implementation simply ensures that a STORE can not
> > + * move into the critical section, smp_wmb() should serialize it with
> > + * another STORE done by spin_lock().
> > + */
> > +#ifndef smp_mb__before_spinlock
> > +#define smp_mb__before_spinlock() smp_wmb()
> > #endif
>
> I would have expected mention of the ACQUIRE of the lock keeping the
> LOAD inside the locked section.
OK, please see v2 below.
---
>From 8de96d3feae3b4b51669902b7c24ac1748ecdbfe Mon Sep 17 00:00:00 2001
From: Oleg Nesterov <oleg@redhat.com>
Date: Mon, 12 Aug 2013 18:14:00 +0200
Subject: sched: fix the theoretical signal_wake_up() vs schedule() race
This is only theoretical, but after try_to_wake_up(p) was changed
to check p->state under p->pi_lock the code like
__set_current_state(TASK_INTERRUPTIBLE);
schedule();
can miss a signal. This is the special case of wait-for-condition,
it relies on try_to_wake_up/schedule interaction and thus it does
not need mb() between __set_current_state() and if(signal_pending).
However, this __set_current_state() can move into the critical
section protected by rq->lock, now that try_to_wake_up() takes
another lock we need to ensure that it can't be reordered with
"if (signal_pending(current))" check inside that section.
The patch is actually one-liner, it simply adds smp_wmb() before
spin_lock_irq(rq->lock). This is what try_to_wake_up() already
does by the same reason.
We turn this wmb() into the new helper, smp_mb__before_spinlock(),
for better documentation and to allow the architectures to change
the default implementation.
While at it, kill smp_mb__after_lock(), it has no callers.
Perhaps we can also add smp_mb__before/after_spinunlock() for
prepare_to_wait().
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
arch/x86/include/asm/spinlock.h | 4 ----
include/linux/spinlock.h | 14 +++++++++++---
kernel/sched/core.c | 14 +++++++++++++-
3 files changed, 24 insertions(+), 8 deletions(-)
diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h
index 33692ea..e3ddd7d 100644
--- a/arch/x86/include/asm/spinlock.h
+++ b/arch/x86/include/asm/spinlock.h
@@ -233,8 +233,4 @@ static inline void arch_write_unlock(arch_rwlock_t *rw)
#define arch_read_relax(lock) cpu_relax()
#define arch_write_relax(lock) cpu_relax()
-/* The {read|write|spin}_lock() on x86 are full memory barriers. */
-static inline void smp_mb__after_lock(void) { }
-#define ARCH_HAS_SMP_MB_AFTER_LOCK
-
#endif /* _ASM_X86_SPINLOCK_H */
diff --git a/include/linux/spinlock.h b/include/linux/spinlock.h
index 7d537ce..75f3494 100644
--- a/include/linux/spinlock.h
+++ b/include/linux/spinlock.h
@@ -117,9 +117,17 @@ do { \
#endif /*arch_spin_is_contended*/
#endif
-/* The lock does not imply full memory barrier. */
-#ifndef ARCH_HAS_SMP_MB_AFTER_LOCK
-static inline void smp_mb__after_lock(void) { smp_mb(); }
+/*
+ * Despite its name it doesn't necessarily has to be a full barrier.
+ * It should only guarantee that a STORE before the critical section
+ * can not be reordered with a LOAD inside this section.
+ * spin_lock() is the one-way barrier, this LOAD can not escape out
+ * of the region. So the default implementation simply ensures that
+ * a STORE can not move into the critical section, smp_wmb() should
+ * serialize it with another STORE done by spin_lock().
+ */
+#ifndef smp_mb__before_spinlock
+#define smp_mb__before_spinlock() smp_wmb()
#endif
/**
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 6df0fbe..97dac0e 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1491,7 +1491,13 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
unsigned long flags;
int cpu, success = 0;
- smp_wmb();
+ /*
+ * If we are going to wake up a thread waiting for CONDITION we
+ * need to ensure that CONDITION=1 done by the caller can not be
+ * reordered with p->state check below. This pairs with mb() in
+ * set_current_state() the waiting thread does.
+ */
+ smp_mb__before_spinlock();
raw_spin_lock_irqsave(&p->pi_lock, flags);
if (!(p->state & state))
goto out;
@@ -2394,6 +2400,12 @@ need_resched:
if (sched_feat(HRTICK))
hrtick_clear(rq);
+ /*
+ * Make sure that signal_pending_state()->signal_pending() below
+ * can't be reordered with __set_current_state(TASK_INTERRUPTIBLE)
+ * done by the caller to avoid the race with signal_wake_up().
+ */
+ smp_mb__before_spinlock();
raw_spin_lock_irq(&rq->lock);
switch_count = &prev->nivcsw;
--
1.5.5.1
next prev parent reply other threads:[~2013-08-13 14:33 UTC|newest]
Thread overview: 21+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <tencent_26310211398C21034BD3B2F9@qq.com>
2013-08-08 18:19 ` Patch for lost wakeups Linus Torvalds
2013-08-08 19:17 ` Oleg Nesterov
2013-08-08 19:51 ` Linus Torvalds
2013-08-09 13:04 ` Oleg Nesterov
2013-08-09 18:21 ` Linus Torvalds
2013-08-11 17:25 ` Oleg Nesterov
2013-08-11 17:27 ` Oleg Nesterov
[not found] ` <tencent_293B72F26D71A4191C7C999A@qq.com>
2013-08-11 17:39 ` Oleg Nesterov
2013-08-11 23:52 ` James Bottomley
2013-08-12 17:02 ` [PATCH] sched: fix the theoretical signal_wake_up() vs schedule() race Oleg Nesterov
2013-08-13 7:55 ` Peter Zijlstra
2013-08-13 14:33 ` Oleg Nesterov [this message]
2013-08-13 14:33 ` Oleg Nesterov
2013-08-16 18:46 ` [tip:sched/core] sched: Fix the theoretical signal_wake_up() vs. " tip-bot for Oleg Nesterov
2013-08-17 15:05 ` Oleg Nesterov
2013-08-19 7:13 ` Ingo Molnar
2013-08-09 15:18 ` [PATCH 0/1] dlm: kill the unnecessary and wrong device_close()->recalc_sigpending() Oleg Nesterov
2013-08-09 15:19 ` [PATCH 1/1] " Oleg Nesterov
2013-08-12 20:26 ` David Teigland
2013-08-09 13:28 ` Patch for lost wakeups Oleg Nesterov
2013-08-09 15:31 ` block_all_signals() must die (Was: Patch for lost wakeups) Oleg Nesterov
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20130813143325.GA5541@redhat.com \
--to=oleg@redhat.com \
--cc=akpm@linux-foundation.org \
--cc=gaolong@kylinos.com.cn \
--cc=linux-arch@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=mingo@kernel.org \
--cc=paulmck@linux.vnet.ibm.com \
--cc=peterz@infradead.org \
--cc=torvalds@linux-foundation.org \
--cc=viro@zeniv.linux.org.uk \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.