Re: Patch for lost wakeups

linux-arch.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: Patch for lost wakeups
       [not found]       ` <20130809130457.GA27493@redhat.com>
@ 2013-08-09 18:21         ` Linus Torvalds
  2013-08-11 17:25           ` Oleg Nesterov
                             ` (2 more replies)
  0 siblings, 3 replies; 9+ messages in thread
From: Linus Torvalds @ 2013-08-09 18:21 UTC (permalink / raw)
  To: Oleg Nesterov, linux-arch@vger.kernel.org
  Cc: Long Gao, Al Viro, Andrew Morton, Linux Kernel Mailing List

On Fri, Aug 9, 2013 at 6:04 AM, Oleg Nesterov <oleg@redhat.com> wrote:
>
>>    The case of signals is special, in that the "wakeup criteria" is
>> inside the scheduler itself, but conceptually the rule is the same.
>
> yes, and because the waiter lacks mb().

Hmm. Ok. So say we have a sleeper that does
__set_task_state(TASK_INTERRUPTIBLE) (so without the memory barrier
semantics of set_task_state()), so we have one thread doing

   prev->state = TASK_INTERRUPTIBLE
   raw_spin_lock_irq(&rq->lock);
   if (signal_pending_state(prev->state, prev))
     prev->state = TASK_RUNNING;

and since it's a spin-lock, things can in theory percolate *into* the
critical region, but not out of it. So as far as other CPU's are
concerned, that thread might actually do the "test pending signals"
before it even sets the state to TASK_INTERRUPTIBLE. And yes, that can
race against new signals coming in.

So it does look like a race. I get the feeling that we should put a
smp_wmb() into the top of __schedule(), before the spinlock - that's
basically free on any sane architecture (yeah, yeah, a bad memory
ordering implementaiton can make even smp_wmb() expensive, but
whatever - I can't find it in myself to care about badly implemented
microarchitectures when the common case gets it right).

That said, at least MIPS does not make spin-locks be that kind of
"semi-transparent" locks, so this race cannot hit MIPS. And I'm pretty
sure loongson has the same memory ordering as mips (ie "sync" will do
a memory barrier).

So this might be a real race for architectures that actually do
aggressive memory re-ordering _and_ actually implements spinlocks with
the aggressive acquire semantics. I think ia64 does. The ppc memory
ordering is confused enough that *maybe* it does, but I doubt it. On
most other architectures I think spinlocks end up being full memory
barriers.

I guess that instead of a "smp_wmb()", we could do another
"smp_mb__before_spinlock()" thing, like we already allow for other
architectures to do a weaker form of mb in case the spinlock is
already a full mb. That would allow avoiding extra synchronization. Do
a

   #ifndef smp_mb__before_spinlock
     #define smp_mb__before_spinlock() smp_wmb()
   #endif

in <linux/spinlock.h> to not force everybody to implement it. Because
a wmb+acquire should be close enough to a full mb that nobody cares
(ok, so reads could move into the critical region from outside, but by
the time anybody has called "schedule()", I can't see it mattering, so
"close enough").

                  Linus

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Patch for lost wakeups
  2013-08-09 18:21         ` Patch for lost wakeups Linus Torvalds
@ 2013-08-11 17:25           ` Oleg Nesterov
  2013-08-11 17:27             ` Oleg Nesterov
       [not found]           ` <tencent_293B72F26D71A4191C7C999A@qq.com>
  2013-08-12 17:02           ` [PATCH] sched: fix the theoretical signal_wake_up() vs schedule() race Oleg Nesterov
  2 siblings, 1 reply; 9+ messages in thread
From: Oleg Nesterov @ 2013-08-11 17:25 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-arch@vger.kernel.org, Long Gao, Al Viro, Andrew Morton,
	Linux Kernel Mailing List

On 08/09, Linus Torvalds wrote:
>
> I guess that instead of a "smp_wmb()", we could do another
> "smp_mb__before_spinlock()" thing, like we already allow for other
> architectures to do a weaker form of mb in case the spinlock is
> already a full mb. That would allow avoiding extra synchronization. Do
> a
>
>    #ifndef smp_mb__before_spinlock
>      #define smp_mb__before_spinlock() smp_wmb()
>    #endif
>
> in <linux/spinlock.h> to not force everybody to implement it. Because
> a wmb+acquire should be close enough to a full mb that nobody cares
> (ok, so reads could move into the critical region from outside, but by
> the time anybody has called "schedule()", I can't see it mattering, so
> "close enough").

Yes, this is what I tried to suggest. And of course we should turn that
wmb() in try_to_wake_up() into smp_mb__before_spinlock().

I event started the patch, but we already have smp_mb__after_lock(), so
it should be smp_mb__before_lock() for consistency and we need to turn
it to "define" too. Or change ARCH_HAS_SMP_MB_AFTER_LOCK, or add
ARCH_HAS_SMP_MB_BEFORE_LOCK.

Oleg.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Patch for lost wakeups
  2013-08-11 17:25           ` Oleg Nesterov
@ 2013-08-11 17:27             ` Oleg Nesterov
  0 siblings, 0 replies; 9+ messages in thread
From: Oleg Nesterov @ 2013-08-11 17:27 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-arch@vger.kernel.org, Long Gao, Al Viro, Andrew Morton,
	Linux Kernel Mailing List

On 08/11, Oleg Nesterov wrote:
>
> On 08/09, Linus Torvalds wrote:
> >
> > I guess that instead of a "smp_wmb()", we could do another
> > "smp_mb__before_spinlock()" thing, like we already allow for other
> > architectures to do a weaker form of mb in case the spinlock is
> > already a full mb. That would allow avoiding extra synchronization. Do
> > a
> >
> >    #ifndef smp_mb__before_spinlock
> >      #define smp_mb__before_spinlock() smp_wmb()
> >    #endif
> >
> > in <linux/spinlock.h> to not force everybody to implement it. Because
> > a wmb+acquire should be close enough to a full mb that nobody cares
> > (ok, so reads could move into the critical region from outside, but by
> > the time anybody has called "schedule()", I can't see it mattering, so
> > "close enough").
>
> Yes, this is what I tried to suggest. And of course we should turn that
> wmb() in try_to_wake_up() into smp_mb__before_spinlock().
>
> I event started the patch, but we already have smp_mb__after_lock(), so
> it should be smp_mb__before_lock() for consistency and we need to turn
> it to "define" too. Or change ARCH_HAS_SMP_MB_AFTER_LOCK, or add
> ARCH_HAS_SMP_MB_BEFORE_LOCK.

Ah, please ignore. According to /bin/grep smp_mb__before_lock() no longer
have users. So we can probably simply kill it in the same patch. I'll try
to write the changelog and send it tomorrow.

Oleg.

^ permalink raw reply	[flat|nested] 9+ messages in thread

[parent not found: <tencent_293B72F26D71A4191C7C999A@qq.com>]

* Re: Patch for lost wakeups
       [not found]           ` <tencent_293B72F26D71A4191C7C999A@qq.com>
@ 2013-08-11 17:39             ` Oleg Nesterov
  2013-08-11 23:52               ` James Bottomley
  0 siblings, 1 reply; 9+ messages in thread
From: Oleg Nesterov @ 2013-08-11 17:39 UTC (permalink / raw)
  To: Long Gao
  Cc: Linus Torvalds, linux-arch@vger.kernel.org, Al Viro,
	Andrew Morton, Linux Kernel Mailing List

On 08/10, Long Gao wrote:
>
>   By the way, could you help me join the linux kernel mailling list?

Do you mean, you want to subscribe?

Well, from http://www.tux.org/lkml/#s3-1

        send the line "subscribe linux-kernel your_email@your_ISP"
        in the body of the message to majordomo@vger.kernel.org

but since I was never subscribed I do not really know if this works ;)

> I have sent two letters to "Linux Kernel Mailing List"<linux-kernel@vger.kernel.org>,
> but did not receive any confirmation letter, except for two letters containing
> instructions, 3X!

If you simply want to send emails to this list, you do not need
to subscribe. However, I do not see your email in

        http://marc.info/?t=137598606900001

Perhaps this list doesn't like chinese letters in subject/body?

Oleg.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Patch for lost wakeups
  2013-08-11 17:39             ` Oleg Nesterov
@ 2013-08-11 23:52               ` James Bottomley
  0 siblings, 0 replies; 9+ messages in thread
From: James Bottomley @ 2013-08-11 23:52 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Long Gao, Linus Torvalds, linux-arch@vger.kernel.org, Al Viro,
	Andrew Morton, Linux Kernel Mailing List

On Sun, 2013-08-11 at 19:39 +0200, Oleg Nesterov wrote:
> On 08/10, Long Gao wrote:
> >
> >   By the way, could you help me join the linux kernel mailling list?
> 
> Do you mean, you want to subscribe?
> 
> Well, from http://www.tux.org/lkml/#s3-1
> 
>         send the line "subscribe linux-kernel your_email@your_ISP"
>         in the body of the message to majordomo@vger.kernel.org
> 
> but since I was never subscribed I do not really know if this works ;)
> 
> > I have sent two letters to "Linux Kernel Mailing List"<linux-kernel@vger.kernel.org>,
> > but did not receive any confirmation letter, except for two letters containing
> > instructions, 3X!
> 
> If you simply want to send emails to this list, you do not need
> to subscribe. However, I do not see your email in
> 
>         http://marc.info/?t=137598606900001
> 
> Perhaps this list doesn't like chinese letters in subject/body?

No, we get patches with UTC-16 characters in them OK.  The usual reason
vger drops an email is if there's a text/html part.

The information about what vger drops is here:

http://vger.kernel.org/majordomo-info.html

James

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH] sched: fix the theoretical signal_wake_up() vs schedule() race
  2013-08-09 18:21         ` Patch for lost wakeups Linus Torvalds
  2013-08-11 17:25           ` Oleg Nesterov
       [not found]           ` <tencent_293B72F26D71A4191C7C999A@qq.com>
@ 2013-08-12 17:02           ` Oleg Nesterov
  2013-08-13  7:55             ` Peter Zijlstra
  2 siblings, 1 reply; 9+ messages in thread
From: Oleg Nesterov @ 2013-08-12 17:02 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-arch@vger.kernel.org, Long Gao, Al Viro, Andrew Morton,
	Linux Kernel Mailing List, Paul E. McKenney, Peter Zijlstra,
	Ingo Molnar

This is only theoretical, but after try_to_wake_up(p) was changed
to check p->state under p->pi_lock the code like

	__set_current_state(TASK_INTERRUPTIBLE);
	schedule();

can miss a signal. This is the special case of wait-for-condition,
it relies on try_to_wake_up/schedule interaction and thus it does
not need mb() between __set_current_state() and if(signal_pending).

However, this __set_current_state() can move into the critical
section protected by rq->lock, now that try_to_wake_up() takes
another lock we need to ensure that it can't be reordered with
"if (signal_pending(current))" check inside that section.

The patch is actually one-liner, it simply adds smp_wmb() before
spin_lock_irq(rq->lock). This is what try_to_wake_up() already
does by the same reason.

We turn this wmb() into the new helper, smp_mb__before_spinlock(),
for better documentation and to allow the architectures to change
the default implementation.

While at it, kill smp_mb__after_lock(), it has no callers.

Perhaps we can also add smp_mb__before/after_spinunlock() for
prepare_to_wait().

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 arch/x86/include/asm/spinlock.h |    4 ----
 include/linux/spinlock.h        |   13 ++++++++++---
 kernel/sched/core.c             |   14 +++++++++++++-
 3 files changed, 23 insertions(+), 8 deletions(-)

diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h
index 33692ea..e3ddd7d 100644
--- a/arch/x86/include/asm/spinlock.h
+++ b/arch/x86/include/asm/spinlock.h
@@ -233,8 +233,4 @@ static inline void arch_write_unlock(arch_rwlock_t *rw)
 #define arch_read_relax(lock)	cpu_relax()
 #define arch_write_relax(lock)	cpu_relax()
 
-/* The {read|write|spin}_lock() on x86 are full memory barriers. */
-static inline void smp_mb__after_lock(void) { }
-#define ARCH_HAS_SMP_MB_AFTER_LOCK
-
 #endif /* _ASM_X86_SPINLOCK_H */
diff --git a/include/linux/spinlock.h b/include/linux/spinlock.h
index 7d537ce..f21551e 100644
--- a/include/linux/spinlock.h
+++ b/include/linux/spinlock.h
@@ -117,9 +117,16 @@ do {								\
 #endif /*arch_spin_is_contended*/
 #endif
 
-/* The lock does not imply full memory barrier. */
-#ifndef ARCH_HAS_SMP_MB_AFTER_LOCK
-static inline void smp_mb__after_lock(void) { smp_mb(); }
+/*
+ * Despite its name it doesn't necessarily has to be a full barrier.
+ * It should only guarantee that a STORE before the critical section
+ * can not be reordered with a LOAD inside this section.
+ * So the default implementation simply ensures that a STORE can not
+ * move into the critical section, smp_wmb() should serialize it with
+ * another STORE done by spin_lock().
+ */
+#ifndef smp_mb__before_spinlock
+#define smp_mb__before_spinlock()	smp_wmb()
 #endif
 
 /**
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 6df0fbe..97dac0e 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1491,7 +1491,13 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
 	unsigned long flags;
 	int cpu, success = 0;
 
-	smp_wmb();
+	/*
+	 * If we are going to wake up a thread waiting for CONDITION we
+	 * need to ensure that CONDITION=1 done by the caller can not be
+	 * reordered with p->state check below. This pairs with mb() in
+	 * set_current_state() the waiting thread does.
+	 */
+	smp_mb__before_spinlock();
 	raw_spin_lock_irqsave(&p->pi_lock, flags);
 	if (!(p->state & state))
 		goto out;
@@ -2394,6 +2400,12 @@ need_resched:
 	if (sched_feat(HRTICK))
 		hrtick_clear(rq);
 
+	/*
+	 * Make sure that signal_pending_state()->signal_pending() below
+	 * can't be reordered with __set_current_state(TASK_INTERRUPTIBLE)
+	 * done by the caller to avoid the race with signal_wake_up().
+	 */
+	smp_mb__before_spinlock();
 	raw_spin_lock_irq(&rq->lock);
 
 	switch_count = &prev->nivcsw;
-- 
1.5.5.1

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH] sched: fix the theoretical signal_wake_up() vs schedule() race
  2013-08-12 17:02           ` [PATCH] sched: fix the theoretical signal_wake_up() vs schedule() race Oleg Nesterov
@ 2013-08-13  7:55             ` Peter Zijlstra
  2013-08-13 14:33               ` Oleg Nesterov
  0 siblings, 1 reply; 9+ messages in thread
From: Peter Zijlstra @ 2013-08-13  7:55 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Linus Torvalds, linux-arch@vger.kernel.org, Long Gao, Al Viro,
	Andrew Morton, Linux Kernel Mailing List, Paul E. McKenney,
	Ingo Molnar

On Mon, Aug 12, 2013 at 07:02:57PM +0200, Oleg Nesterov wrote:
> This is only theoretical, but after try_to_wake_up(p) was changed
> to check p->state under p->pi_lock the code like
> 
> 	__set_current_state(TASK_INTERRUPTIBLE);
> 	schedule();
> 
> can miss a signal. This is the special case of wait-for-condition,
> it relies on try_to_wake_up/schedule interaction and thus it does
> not need mb() between __set_current_state() and if(signal_pending).
> 
> However, this __set_current_state() can move into the critical
> section protected by rq->lock, now that try_to_wake_up() takes
> another lock we need to ensure that it can't be reordered with
> "if (signal_pending(current))" check inside that section.
> 
> The patch is actually one-liner, it simply adds smp_wmb() before
> spin_lock_irq(rq->lock). This is what try_to_wake_up() already
> does by the same reason.
> 
> We turn this wmb() into the new helper, smp_mb__before_spinlock(),
> for better documentation and to allow the architectures to change
> the default implementation.
> 
> While at it, kill smp_mb__after_lock(), it has no callers.
> 
> Perhaps we can also add smp_mb__before/after_spinunlock() for
> prepare_to_wait().
> 
> Signed-off-by: Oleg Nesterov <oleg@redhat.com>

Thanks!

> +/*
> + * Despite its name it doesn't necessarily has to be a full barrier.
> + * It should only guarantee that a STORE before the critical section
> + * can not be reordered with a LOAD inside this section.
> + * So the default implementation simply ensures that a STORE can not
> + * move into the critical section, smp_wmb() should serialize it with
> + * another STORE done by spin_lock().
> + */
> +#ifndef smp_mb__before_spinlock
> +#define smp_mb__before_spinlock()	smp_wmb()
>  #endif

I would have expected mention of the ACQUIRE of the lock keeping the
LOAD inside the locked section.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] sched: fix the theoretical signal_wake_up() vs schedule() race
  2013-08-13  7:55             ` Peter Zijlstra
@ 2013-08-13 14:33               ` Oleg Nesterov
  2013-08-13 14:33                 ` Oleg Nesterov
  0 siblings, 1 reply; 9+ messages in thread
From: Oleg Nesterov @ 2013-08-13 14:33 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, linux-arch@vger.kernel.org, Long Gao, Al Viro,
	Andrew Morton, Linux Kernel Mailing List, Paul E. McKenney,
	Ingo Molnar

On 08/13, Peter Zijlstra wrote:
>
> On Mon, Aug 12, 2013 at 07:02:57PM +0200, Oleg Nesterov wrote:
> > +/*
> > + * Despite its name it doesn't necessarily has to be a full barrier.
> > + * It should only guarantee that a STORE before the critical section
> > + * can not be reordered with a LOAD inside this section.
> > + * So the default implementation simply ensures that a STORE can not
> > + * move into the critical section, smp_wmb() should serialize it with
> > + * another STORE done by spin_lock().
> > + */
> > +#ifndef smp_mb__before_spinlock
> > +#define smp_mb__before_spinlock()	smp_wmb()
> >  #endif
>
> I would have expected mention of the ACQUIRE of the lock keeping the
> LOAD inside the locked section.

OK, please see v2 below.

---
From 8de96d3feae3b4b51669902b7c24ac1748ecdbfe Mon Sep 17 00:00:00 2001
From: Oleg Nesterov <oleg@redhat.com>
Date: Mon, 12 Aug 2013 18:14:00 +0200
Subject: sched: fix the theoretical signal_wake_up() vs schedule() race

This is only theoretical, but after try_to_wake_up(p) was changed
to check p->state under p->pi_lock the code like

	__set_current_state(TASK_INTERRUPTIBLE);
	schedule();

can miss a signal. This is the special case of wait-for-condition,
it relies on try_to_wake_up/schedule interaction and thus it does
not need mb() between __set_current_state() and if(signal_pending).

However, this __set_current_state() can move into the critical
section protected by rq->lock, now that try_to_wake_up() takes
another lock we need to ensure that it can't be reordered with
"if (signal_pending(current))" check inside that section.

The patch is actually one-liner, it simply adds smp_wmb() before
spin_lock_irq(rq->lock). This is what try_to_wake_up() already
does by the same reason.

We turn this wmb() into the new helper, smp_mb__before_spinlock(),
for better documentation and to allow the architectures to change
the default implementation.

While at it, kill smp_mb__after_lock(), it has no callers.

Perhaps we can also add smp_mb__before/after_spinunlock() for
prepare_to_wait().

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 arch/x86/include/asm/spinlock.h |    4 ----
 include/linux/spinlock.h        |   14 +++++++++++---
 kernel/sched/core.c             |   14 +++++++++++++-
 3 files changed, 24 insertions(+), 8 deletions(-)

diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h
index 33692ea..e3ddd7d 100644
--- a/arch/x86/include/asm/spinlock.h
+++ b/arch/x86/include/asm/spinlock.h
@@ -233,8 +233,4 @@ static inline void arch_write_unlock(arch_rwlock_t *rw)
 #define arch_read_relax(lock)	cpu_relax()
 #define arch_write_relax(lock)	cpu_relax()
 
-/* The {read|write|spin}_lock() on x86 are full memory barriers. */
-static inline void smp_mb__after_lock(void) { }
-#define ARCH_HAS_SMP_MB_AFTER_LOCK
-
 #endif /* _ASM_X86_SPINLOCK_H */
diff --git a/include/linux/spinlock.h b/include/linux/spinlock.h
index 7d537ce..75f3494 100644
--- a/include/linux/spinlock.h
+++ b/include/linux/spinlock.h
@@ -117,9 +117,17 @@ do {								\
 #endif /*arch_spin_is_contended*/
 #endif
 
-/* The lock does not imply full memory barrier. */
-#ifndef ARCH_HAS_SMP_MB_AFTER_LOCK
-static inline void smp_mb__after_lock(void) { smp_mb(); }
+/*
+ * Despite its name it doesn't necessarily has to be a full barrier.
+ * It should only guarantee that a STORE before the critical section
+ * can not be reordered with a LOAD inside this section.
+ * spin_lock() is the one-way barrier, this LOAD can not escape out
+ * of the region. So the default implementation simply ensures that
+ * a STORE can not move into the critical section, smp_wmb() should
+ * serialize it with another STORE done by spin_lock().
+ */
+#ifndef smp_mb__before_spinlock
+#define smp_mb__before_spinlock()	smp_wmb()
 #endif
 
 /**
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 6df0fbe..97dac0e 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1491,7 +1491,13 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
 	unsigned long flags;
 	int cpu, success = 0;
 
-	smp_wmb();
+	/*
+	 * If we are going to wake up a thread waiting for CONDITION we
+	 * need to ensure that CONDITION=1 done by the caller can not be
+	 * reordered with p->state check below. This pairs with mb() in
+	 * set_current_state() the waiting thread does.
+	 */
+	smp_mb__before_spinlock();
 	raw_spin_lock_irqsave(&p->pi_lock, flags);
 	if (!(p->state & state))
 		goto out;
@@ -2394,6 +2400,12 @@ need_resched:
 	if (sched_feat(HRTICK))
 		hrtick_clear(rq);
 
+	/*
+	 * Make sure that signal_pending_state()->signal_pending() below
+	 * can't be reordered with __set_current_state(TASK_INTERRUPTIBLE)
+	 * done by the caller to avoid the race with signal_wake_up().
+	 */
+	smp_mb__before_spinlock();
 	raw_spin_lock_irq(&rq->lock);
 
 	switch_count = &prev->nivcsw;
-- 
1.5.5.1

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH] sched: fix the theoretical signal_wake_up() vs schedule() race
  2013-08-13 14:33               ` Oleg Nesterov
@ 2013-08-13 14:33                 ` Oleg Nesterov
  0 siblings, 0 replies; 9+ messages in thread
From: Oleg Nesterov @ 2013-08-13 14:33 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, linux-arch@vger.kernel.org, Long Gao, Al Viro,
	Andrew Morton, Linux Kernel Mailing List, Paul E. McKenney,
	Ingo Molnar

On 08/13, Peter Zijlstra wrote:
>
> On Mon, Aug 12, 2013 at 07:02:57PM +0200, Oleg Nesterov wrote:
> > +/*
> > + * Despite its name it doesn't necessarily has to be a full barrier.
> > + * It should only guarantee that a STORE before the critical section
> > + * can not be reordered with a LOAD inside this section.
> > + * So the default implementation simply ensures that a STORE can not
> > + * move into the critical section, smp_wmb() should serialize it with
> > + * another STORE done by spin_lock().
> > + */
> > +#ifndef smp_mb__before_spinlock
> > +#define smp_mb__before_spinlock()	smp_wmb()
> >  #endif
>
> I would have expected mention of the ACQUIRE of the lock keeping the
> LOAD inside the locked section.

OK, please see v2 below.

---
From 8de96d3feae3b4b51669902b7c24ac1748ecdbfe Mon Sep 17 00:00:00 2001
From: Oleg Nesterov <oleg@redhat.com>
Date: Mon, 12 Aug 2013 18:14:00 +0200
Subject: sched: fix the theoretical signal_wake_up() vs schedule() race

This is only theoretical, but after try_to_wake_up(p) was changed
to check p->state under p->pi_lock the code like

	__set_current_state(TASK_INTERRUPTIBLE);
	schedule();

can miss a signal. This is the special case of wait-for-condition,
it relies on try_to_wake_up/schedule interaction and thus it does
not need mb() between __set_current_state() and if(signal_pending).

However, this __set_current_state() can move into the critical
section protected by rq->lock, now that try_to_wake_up() takes
another lock we need to ensure that it can't be reordered with
"if (signal_pending(current))" check inside that section.

The patch is actually one-liner, it simply adds smp_wmb() before
spin_lock_irq(rq->lock). This is what try_to_wake_up() already
does by the same reason.

We turn this wmb() into the new helper, smp_mb__before_spinlock(),
for better documentation and to allow the architectures to change
the default implementation.

While at it, kill smp_mb__after_lock(), it has no callers.

Perhaps we can also add smp_mb__before/after_spinunlock() for
prepare_to_wait().

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 arch/x86/include/asm/spinlock.h |    4 ----
 include/linux/spinlock.h        |   14 +++++++++++---
 kernel/sched/core.c             |   14 +++++++++++++-
 3 files changed, 24 insertions(+), 8 deletions(-)

diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h
index 33692ea..e3ddd7d 100644
--- a/arch/x86/include/asm/spinlock.h
+++ b/arch/x86/include/asm/spinlock.h
@@ -233,8 +233,4 @@ static inline void arch_write_unlock(arch_rwlock_t *rw)
 #define arch_read_relax(lock)	cpu_relax()
 #define arch_write_relax(lock)	cpu_relax()
 
-/* The {read|write|spin}_lock() on x86 are full memory barriers. */
-static inline void smp_mb__after_lock(void) { }
-#define ARCH_HAS_SMP_MB_AFTER_LOCK
-
 #endif /* _ASM_X86_SPINLOCK_H */
diff --git a/include/linux/spinlock.h b/include/linux/spinlock.h
index 7d537ce..75f3494 100644
--- a/include/linux/spinlock.h
+++ b/include/linux/spinlock.h
@@ -117,9 +117,17 @@ do {								\
 #endif /*arch_spin_is_contended*/
 #endif
 
-/* The lock does not imply full memory barrier. */
-#ifndef ARCH_HAS_SMP_MB_AFTER_LOCK
-static inline void smp_mb__after_lock(void) { smp_mb(); }
+/*
+ * Despite its name it doesn't necessarily has to be a full barrier.
+ * It should only guarantee that a STORE before the critical section
+ * can not be reordered with a LOAD inside this section.
+ * spin_lock() is the one-way barrier, this LOAD can not escape out
+ * of the region. So the default implementation simply ensures that
+ * a STORE can not move into the critical section, smp_wmb() should
+ * serialize it with another STORE done by spin_lock().
+ */
+#ifndef smp_mb__before_spinlock
+#define smp_mb__before_spinlock()	smp_wmb()
 #endif
 
 /**
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 6df0fbe..97dac0e 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1491,7 +1491,13 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
 	unsigned long flags;
 	int cpu, success = 0;
 
-	smp_wmb();
+	/*
+	 * If we are going to wake up a thread waiting for CONDITION we
+	 * need to ensure that CONDITION=1 done by the caller can not be
+	 * reordered with p->state check below. This pairs with mb() in
+	 * set_current_state() the waiting thread does.
+	 */
+	smp_mb__before_spinlock();
 	raw_spin_lock_irqsave(&p->pi_lock, flags);
 	if (!(p->state & state))
 		goto out;
@@ -2394,6 +2400,12 @@ need_resched:
 	if (sched_feat(HRTICK))
 		hrtick_clear(rq);
 
+	/*
+	 * Make sure that signal_pending_state()->signal_pending() below
+	 * can't be reordered with __set_current_state(TASK_INTERRUPTIBLE)
+	 * done by the caller to avoid the race with signal_wake_up().
+	 */
+	smp_mb__before_spinlock();
 	raw_spin_lock_irq(&rq->lock);
 
 	switch_count = &prev->nivcsw;
-- 
1.5.5.1



^ permalink raw reply related	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2013-08-13 14:39 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <tencent_26310211398C21034BD3B2F9@qq.com>
     [not found] ` <CA+55aFwAnWZGkXbyLmL3RxQGW-dX_jjDBPfA5PjxeVWn==a6tA@mail.gmail.com>
     [not found]   ` <20130808191749.GA12062@redhat.com>
     [not found]     ` <CA+55aFw+vEp6+q+QUZm5rf+sKR9z76HWzEqWjoiLtataym5uwA@mail.gmail.com>
     [not found]       ` <20130809130457.GA27493@redhat.com>
2013-08-09 18:21         ` Patch for lost wakeups Linus Torvalds
2013-08-11 17:25           ` Oleg Nesterov
2013-08-11 17:27             ` Oleg Nesterov
     [not found]           ` <tencent_293B72F26D71A4191C7C999A@qq.com>
2013-08-11 17:39             ` Oleg Nesterov
2013-08-11 23:52               ` James Bottomley
2013-08-12 17:02           ` [PATCH] sched: fix the theoretical signal_wake_up() vs schedule() race Oleg Nesterov
2013-08-13  7:55             ` Peter Zijlstra
2013-08-13 14:33               ` Oleg Nesterov
2013-08-13 14:33                 ` Oleg Nesterov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).