* Re: Patch for lost wakeups [not found] ` <20130809130457.GA27493@redhat.com> @ 2013-08-09 18:21 ` Linus Torvalds 2013-08-11 17:25 ` Oleg Nesterov ` (2 more replies) 0 siblings, 3 replies; 9+ messages in thread From: Linus Torvalds @ 2013-08-09 18:21 UTC (permalink / raw) To: Oleg Nesterov, linux-arch@vger.kernel.org Cc: Long Gao, Al Viro, Andrew Morton, Linux Kernel Mailing List On Fri, Aug 9, 2013 at 6:04 AM, Oleg Nesterov <oleg@redhat.com> wrote: > >> The case of signals is special, in that the "wakeup criteria" is >> inside the scheduler itself, but conceptually the rule is the same. > > yes, and because the waiter lacks mb(). Hmm. Ok. So say we have a sleeper that does __set_task_state(TASK_INTERRUPTIBLE) (so without the memory barrier semantics of set_task_state()), so we have one thread doing prev->state = TASK_INTERRUPTIBLE raw_spin_lock_irq(&rq->lock); if (signal_pending_state(prev->state, prev)) prev->state = TASK_RUNNING; and since it's a spin-lock, things can in theory percolate *into* the critical region, but not out of it. So as far as other CPU's are concerned, that thread might actually do the "test pending signals" before it even sets the state to TASK_INTERRUPTIBLE. And yes, that can race against new signals coming in. So it does look like a race. I get the feeling that we should put a smp_wmb() into the top of __schedule(), before the spinlock - that's basically free on any sane architecture (yeah, yeah, a bad memory ordering implementaiton can make even smp_wmb() expensive, but whatever - I can't find it in myself to care about badly implemented microarchitectures when the common case gets it right). That said, at least MIPS does not make spin-locks be that kind of "semi-transparent" locks, so this race cannot hit MIPS. And I'm pretty sure loongson has the same memory ordering as mips (ie "sync" will do a memory barrier). So this might be a real race for architectures that actually do aggressive memory re-ordering _and_ actually implements spinlocks with the aggressive acquire semantics. I think ia64 does. The ppc memory ordering is confused enough that *maybe* it does, but I doubt it. On most other architectures I think spinlocks end up being full memory barriers. I guess that instead of a "smp_wmb()", we could do another "smp_mb__before_spinlock()" thing, like we already allow for other architectures to do a weaker form of mb in case the spinlock is already a full mb. That would allow avoiding extra synchronization. Do a #ifndef smp_mb__before_spinlock #define smp_mb__before_spinlock() smp_wmb() #endif in <linux/spinlock.h> to not force everybody to implement it. Because a wmb+acquire should be close enough to a full mb that nobody cares (ok, so reads could move into the critical region from outside, but by the time anybody has called "schedule()", I can't see it mattering, so "close enough"). Linus ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Patch for lost wakeups 2013-08-09 18:21 ` Patch for lost wakeups Linus Torvalds @ 2013-08-11 17:25 ` Oleg Nesterov 2013-08-11 17:27 ` Oleg Nesterov [not found] ` <tencent_293B72F26D71A4191C7C999A@qq.com> 2013-08-12 17:02 ` [PATCH] sched: fix the theoretical signal_wake_up() vs schedule() race Oleg Nesterov 2 siblings, 1 reply; 9+ messages in thread From: Oleg Nesterov @ 2013-08-11 17:25 UTC (permalink / raw) To: Linus Torvalds Cc: linux-arch@vger.kernel.org, Long Gao, Al Viro, Andrew Morton, Linux Kernel Mailing List On 08/09, Linus Torvalds wrote: > > I guess that instead of a "smp_wmb()", we could do another > "smp_mb__before_spinlock()" thing, like we already allow for other > architectures to do a weaker form of mb in case the spinlock is > already a full mb. That would allow avoiding extra synchronization. Do > a > > #ifndef smp_mb__before_spinlock > #define smp_mb__before_spinlock() smp_wmb() > #endif > > in <linux/spinlock.h> to not force everybody to implement it. Because > a wmb+acquire should be close enough to a full mb that nobody cares > (ok, so reads could move into the critical region from outside, but by > the time anybody has called "schedule()", I can't see it mattering, so > "close enough"). Yes, this is what I tried to suggest. And of course we should turn that wmb() in try_to_wake_up() into smp_mb__before_spinlock(). I event started the patch, but we already have smp_mb__after_lock(), so it should be smp_mb__before_lock() for consistency and we need to turn it to "define" too. Or change ARCH_HAS_SMP_MB_AFTER_LOCK, or add ARCH_HAS_SMP_MB_BEFORE_LOCK. Oleg. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Patch for lost wakeups 2013-08-11 17:25 ` Oleg Nesterov @ 2013-08-11 17:27 ` Oleg Nesterov 0 siblings, 0 replies; 9+ messages in thread From: Oleg Nesterov @ 2013-08-11 17:27 UTC (permalink / raw) To: Linus Torvalds Cc: linux-arch@vger.kernel.org, Long Gao, Al Viro, Andrew Morton, Linux Kernel Mailing List On 08/11, Oleg Nesterov wrote: > > On 08/09, Linus Torvalds wrote: > > > > I guess that instead of a "smp_wmb()", we could do another > > "smp_mb__before_spinlock()" thing, like we already allow for other > > architectures to do a weaker form of mb in case the spinlock is > > already a full mb. That would allow avoiding extra synchronization. Do > > a > > > > #ifndef smp_mb__before_spinlock > > #define smp_mb__before_spinlock() smp_wmb() > > #endif > > > > in <linux/spinlock.h> to not force everybody to implement it. Because > > a wmb+acquire should be close enough to a full mb that nobody cares > > (ok, so reads could move into the critical region from outside, but by > > the time anybody has called "schedule()", I can't see it mattering, so > > "close enough"). > > Yes, this is what I tried to suggest. And of course we should turn that > wmb() in try_to_wake_up() into smp_mb__before_spinlock(). > > I event started the patch, but we already have smp_mb__after_lock(), so > it should be smp_mb__before_lock() for consistency and we need to turn > it to "define" too. Or change ARCH_HAS_SMP_MB_AFTER_LOCK, or add > ARCH_HAS_SMP_MB_BEFORE_LOCK. Ah, please ignore. According to /bin/grep smp_mb__before_lock() no longer have users. So we can probably simply kill it in the same patch. I'll try to write the changelog and send it tomorrow. Oleg. ^ permalink raw reply [flat|nested] 9+ messages in thread
[parent not found: <tencent_293B72F26D71A4191C7C999A@qq.com>]
* Re: Patch for lost wakeups [not found] ` <tencent_293B72F26D71A4191C7C999A@qq.com> @ 2013-08-11 17:39 ` Oleg Nesterov 2013-08-11 23:52 ` James Bottomley 0 siblings, 1 reply; 9+ messages in thread From: Oleg Nesterov @ 2013-08-11 17:39 UTC (permalink / raw) To: Long Gao Cc: Linus Torvalds, linux-arch@vger.kernel.org, Al Viro, Andrew Morton, Linux Kernel Mailing List On 08/10, Long Gao wrote: > > By the way, could you help me join the linux kernel mailling list? Do you mean, you want to subscribe? Well, from http://www.tux.org/lkml/#s3-1 send the line "subscribe linux-kernel your_email@your_ISP" in the body of the message to majordomo@vger.kernel.org but since I was never subscribed I do not really know if this works ;) > I have sent two letters to "Linux Kernel Mailing List"<linux-kernel@vger.kernel.org>, > but did not receive any confirmation letter, except for two letters containing > instructions, 3X! If you simply want to send emails to this list, you do not need to subscribe. However, I do not see your email in http://marc.info/?t=137598606900001 Perhaps this list doesn't like chinese letters in subject/body? Oleg. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Patch for lost wakeups 2013-08-11 17:39 ` Oleg Nesterov @ 2013-08-11 23:52 ` James Bottomley 0 siblings, 0 replies; 9+ messages in thread From: James Bottomley @ 2013-08-11 23:52 UTC (permalink / raw) To: Oleg Nesterov Cc: Long Gao, Linus Torvalds, linux-arch@vger.kernel.org, Al Viro, Andrew Morton, Linux Kernel Mailing List On Sun, 2013-08-11 at 19:39 +0200, Oleg Nesterov wrote: > On 08/10, Long Gao wrote: > > > > By the way, could you help me join the linux kernel mailling list? > > Do you mean, you want to subscribe? > > Well, from http://www.tux.org/lkml/#s3-1 > > send the line "subscribe linux-kernel your_email@your_ISP" > in the body of the message to majordomo@vger.kernel.org > > but since I was never subscribed I do not really know if this works ;) > > > I have sent two letters to "Linux Kernel Mailing List"<linux-kernel@vger.kernel.org>, > > but did not receive any confirmation letter, except for two letters containing > > instructions, 3X! > > If you simply want to send emails to this list, you do not need > to subscribe. However, I do not see your email in > > http://marc.info/?t=137598606900001 > > Perhaps this list doesn't like chinese letters in subject/body? No, we get patches with UTC-16 characters in them OK. The usual reason vger drops an email is if there's a text/html part. The information about what vger drops is here: http://vger.kernel.org/majordomo-info.html James ^ permalink raw reply [flat|nested] 9+ messages in thread
* [PATCH] sched: fix the theoretical signal_wake_up() vs schedule() race 2013-08-09 18:21 ` Patch for lost wakeups Linus Torvalds 2013-08-11 17:25 ` Oleg Nesterov [not found] ` <tencent_293B72F26D71A4191C7C999A@qq.com> @ 2013-08-12 17:02 ` Oleg Nesterov 2013-08-13 7:55 ` Peter Zijlstra 2 siblings, 1 reply; 9+ messages in thread From: Oleg Nesterov @ 2013-08-12 17:02 UTC (permalink / raw) To: Linus Torvalds Cc: linux-arch@vger.kernel.org, Long Gao, Al Viro, Andrew Morton, Linux Kernel Mailing List, Paul E. McKenney, Peter Zijlstra, Ingo Molnar This is only theoretical, but after try_to_wake_up(p) was changed to check p->state under p->pi_lock the code like __set_current_state(TASK_INTERRUPTIBLE); schedule(); can miss a signal. This is the special case of wait-for-condition, it relies on try_to_wake_up/schedule interaction and thus it does not need mb() between __set_current_state() and if(signal_pending). However, this __set_current_state() can move into the critical section protected by rq->lock, now that try_to_wake_up() takes another lock we need to ensure that it can't be reordered with "if (signal_pending(current))" check inside that section. The patch is actually one-liner, it simply adds smp_wmb() before spin_lock_irq(rq->lock). This is what try_to_wake_up() already does by the same reason. We turn this wmb() into the new helper, smp_mb__before_spinlock(), for better documentation and to allow the architectures to change the default implementation. While at it, kill smp_mb__after_lock(), it has no callers. Perhaps we can also add smp_mb__before/after_spinunlock() for prepare_to_wait(). Signed-off-by: Oleg Nesterov <oleg@redhat.com> --- arch/x86/include/asm/spinlock.h | 4 ---- include/linux/spinlock.h | 13 ++++++++++--- kernel/sched/core.c | 14 +++++++++++++- 3 files changed, 23 insertions(+), 8 deletions(-) diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h index 33692ea..e3ddd7d 100644 --- a/arch/x86/include/asm/spinlock.h +++ b/arch/x86/include/asm/spinlock.h @@ -233,8 +233,4 @@ static inline void arch_write_unlock(arch_rwlock_t *rw) #define arch_read_relax(lock) cpu_relax() #define arch_write_relax(lock) cpu_relax() -/* The {read|write|spin}_lock() on x86 are full memory barriers. */ -static inline void smp_mb__after_lock(void) { } -#define ARCH_HAS_SMP_MB_AFTER_LOCK - #endif /* _ASM_X86_SPINLOCK_H */ diff --git a/include/linux/spinlock.h b/include/linux/spinlock.h index 7d537ce..f21551e 100644 --- a/include/linux/spinlock.h +++ b/include/linux/spinlock.h @@ -117,9 +117,16 @@ do { \ #endif /*arch_spin_is_contended*/ #endif -/* The lock does not imply full memory barrier. */ -#ifndef ARCH_HAS_SMP_MB_AFTER_LOCK -static inline void smp_mb__after_lock(void) { smp_mb(); } +/* + * Despite its name it doesn't necessarily has to be a full barrier. + * It should only guarantee that a STORE before the critical section + * can not be reordered with a LOAD inside this section. + * So the default implementation simply ensures that a STORE can not + * move into the critical section, smp_wmb() should serialize it with + * another STORE done by spin_lock(). + */ +#ifndef smp_mb__before_spinlock +#define smp_mb__before_spinlock() smp_wmb() #endif /** diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 6df0fbe..97dac0e 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -1491,7 +1491,13 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags) unsigned long flags; int cpu, success = 0; - smp_wmb(); + /* + * If we are going to wake up a thread waiting for CONDITION we + * need to ensure that CONDITION=1 done by the caller can not be + * reordered with p->state check below. This pairs with mb() in + * set_current_state() the waiting thread does. + */ + smp_mb__before_spinlock(); raw_spin_lock_irqsave(&p->pi_lock, flags); if (!(p->state & state)) goto out; @@ -2394,6 +2400,12 @@ need_resched: if (sched_feat(HRTICK)) hrtick_clear(rq); + /* + * Make sure that signal_pending_state()->signal_pending() below + * can't be reordered with __set_current_state(TASK_INTERRUPTIBLE) + * done by the caller to avoid the race with signal_wake_up(). + */ + smp_mb__before_spinlock(); raw_spin_lock_irq(&rq->lock); switch_count = &prev->nivcsw; -- 1.5.5.1 ^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [PATCH] sched: fix the theoretical signal_wake_up() vs schedule() race 2013-08-12 17:02 ` [PATCH] sched: fix the theoretical signal_wake_up() vs schedule() race Oleg Nesterov @ 2013-08-13 7:55 ` Peter Zijlstra 2013-08-13 14:33 ` Oleg Nesterov 0 siblings, 1 reply; 9+ messages in thread From: Peter Zijlstra @ 2013-08-13 7:55 UTC (permalink / raw) To: Oleg Nesterov Cc: Linus Torvalds, linux-arch@vger.kernel.org, Long Gao, Al Viro, Andrew Morton, Linux Kernel Mailing List, Paul E. McKenney, Ingo Molnar On Mon, Aug 12, 2013 at 07:02:57PM +0200, Oleg Nesterov wrote: > This is only theoretical, but after try_to_wake_up(p) was changed > to check p->state under p->pi_lock the code like > > __set_current_state(TASK_INTERRUPTIBLE); > schedule(); > > can miss a signal. This is the special case of wait-for-condition, > it relies on try_to_wake_up/schedule interaction and thus it does > not need mb() between __set_current_state() and if(signal_pending). > > However, this __set_current_state() can move into the critical > section protected by rq->lock, now that try_to_wake_up() takes > another lock we need to ensure that it can't be reordered with > "if (signal_pending(current))" check inside that section. > > The patch is actually one-liner, it simply adds smp_wmb() before > spin_lock_irq(rq->lock). This is what try_to_wake_up() already > does by the same reason. > > We turn this wmb() into the new helper, smp_mb__before_spinlock(), > for better documentation and to allow the architectures to change > the default implementation. > > While at it, kill smp_mb__after_lock(), it has no callers. > > Perhaps we can also add smp_mb__before/after_spinunlock() for > prepare_to_wait(). > > Signed-off-by: Oleg Nesterov <oleg@redhat.com> Thanks! > +/* > + * Despite its name it doesn't necessarily has to be a full barrier. > + * It should only guarantee that a STORE before the critical section > + * can not be reordered with a LOAD inside this section. > + * So the default implementation simply ensures that a STORE can not > + * move into the critical section, smp_wmb() should serialize it with > + * another STORE done by spin_lock(). > + */ > +#ifndef smp_mb__before_spinlock > +#define smp_mb__before_spinlock() smp_wmb() > #endif I would have expected mention of the ACQUIRE of the lock keeping the LOAD inside the locked section. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] sched: fix the theoretical signal_wake_up() vs schedule() race 2013-08-13 7:55 ` Peter Zijlstra @ 2013-08-13 14:33 ` Oleg Nesterov 2013-08-13 14:33 ` Oleg Nesterov 0 siblings, 1 reply; 9+ messages in thread From: Oleg Nesterov @ 2013-08-13 14:33 UTC (permalink / raw) To: Peter Zijlstra Cc: Linus Torvalds, linux-arch@vger.kernel.org, Long Gao, Al Viro, Andrew Morton, Linux Kernel Mailing List, Paul E. McKenney, Ingo Molnar On 08/13, Peter Zijlstra wrote: > > On Mon, Aug 12, 2013 at 07:02:57PM +0200, Oleg Nesterov wrote: > > +/* > > + * Despite its name it doesn't necessarily has to be a full barrier. > > + * It should only guarantee that a STORE before the critical section > > + * can not be reordered with a LOAD inside this section. > > + * So the default implementation simply ensures that a STORE can not > > + * move into the critical section, smp_wmb() should serialize it with > > + * another STORE done by spin_lock(). > > + */ > > +#ifndef smp_mb__before_spinlock > > +#define smp_mb__before_spinlock() smp_wmb() > > #endif > > I would have expected mention of the ACQUIRE of the lock keeping the > LOAD inside the locked section. OK, please see v2 below. --- From 8de96d3feae3b4b51669902b7c24ac1748ecdbfe Mon Sep 17 00:00:00 2001 From: Oleg Nesterov <oleg@redhat.com> Date: Mon, 12 Aug 2013 18:14:00 +0200 Subject: sched: fix the theoretical signal_wake_up() vs schedule() race This is only theoretical, but after try_to_wake_up(p) was changed to check p->state under p->pi_lock the code like __set_current_state(TASK_INTERRUPTIBLE); schedule(); can miss a signal. This is the special case of wait-for-condition, it relies on try_to_wake_up/schedule interaction and thus it does not need mb() between __set_current_state() and if(signal_pending). However, this __set_current_state() can move into the critical section protected by rq->lock, now that try_to_wake_up() takes another lock we need to ensure that it can't be reordered with "if (signal_pending(current))" check inside that section. The patch is actually one-liner, it simply adds smp_wmb() before spin_lock_irq(rq->lock). This is what try_to_wake_up() already does by the same reason. We turn this wmb() into the new helper, smp_mb__before_spinlock(), for better documentation and to allow the architectures to change the default implementation. While at it, kill smp_mb__after_lock(), it has no callers. Perhaps we can also add smp_mb__before/after_spinunlock() for prepare_to_wait(). Signed-off-by: Oleg Nesterov <oleg@redhat.com> --- arch/x86/include/asm/spinlock.h | 4 ---- include/linux/spinlock.h | 14 +++++++++++--- kernel/sched/core.c | 14 +++++++++++++- 3 files changed, 24 insertions(+), 8 deletions(-) diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h index 33692ea..e3ddd7d 100644 --- a/arch/x86/include/asm/spinlock.h +++ b/arch/x86/include/asm/spinlock.h @@ -233,8 +233,4 @@ static inline void arch_write_unlock(arch_rwlock_t *rw) #define arch_read_relax(lock) cpu_relax() #define arch_write_relax(lock) cpu_relax() -/* The {read|write|spin}_lock() on x86 are full memory barriers. */ -static inline void smp_mb__after_lock(void) { } -#define ARCH_HAS_SMP_MB_AFTER_LOCK - #endif /* _ASM_X86_SPINLOCK_H */ diff --git a/include/linux/spinlock.h b/include/linux/spinlock.h index 7d537ce..75f3494 100644 --- a/include/linux/spinlock.h +++ b/include/linux/spinlock.h @@ -117,9 +117,17 @@ do { \ #endif /*arch_spin_is_contended*/ #endif -/* The lock does not imply full memory barrier. */ -#ifndef ARCH_HAS_SMP_MB_AFTER_LOCK -static inline void smp_mb__after_lock(void) { smp_mb(); } +/* + * Despite its name it doesn't necessarily has to be a full barrier. + * It should only guarantee that a STORE before the critical section + * can not be reordered with a LOAD inside this section. + * spin_lock() is the one-way barrier, this LOAD can not escape out + * of the region. So the default implementation simply ensures that + * a STORE can not move into the critical section, smp_wmb() should + * serialize it with another STORE done by spin_lock(). + */ +#ifndef smp_mb__before_spinlock +#define smp_mb__before_spinlock() smp_wmb() #endif /** diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 6df0fbe..97dac0e 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -1491,7 +1491,13 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags) unsigned long flags; int cpu, success = 0; - smp_wmb(); + /* + * If we are going to wake up a thread waiting for CONDITION we + * need to ensure that CONDITION=1 done by the caller can not be + * reordered with p->state check below. This pairs with mb() in + * set_current_state() the waiting thread does. + */ + smp_mb__before_spinlock(); raw_spin_lock_irqsave(&p->pi_lock, flags); if (!(p->state & state)) goto out; @@ -2394,6 +2400,12 @@ need_resched: if (sched_feat(HRTICK)) hrtick_clear(rq); + /* + * Make sure that signal_pending_state()->signal_pending() below + * can't be reordered with __set_current_state(TASK_INTERRUPTIBLE) + * done by the caller to avoid the race with signal_wake_up(). + */ + smp_mb__before_spinlock(); raw_spin_lock_irq(&rq->lock); switch_count = &prev->nivcsw; -- 1.5.5.1 ^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [PATCH] sched: fix the theoretical signal_wake_up() vs schedule() race 2013-08-13 14:33 ` Oleg Nesterov @ 2013-08-13 14:33 ` Oleg Nesterov 0 siblings, 0 replies; 9+ messages in thread From: Oleg Nesterov @ 2013-08-13 14:33 UTC (permalink / raw) To: Peter Zijlstra Cc: Linus Torvalds, linux-arch@vger.kernel.org, Long Gao, Al Viro, Andrew Morton, Linux Kernel Mailing List, Paul E. McKenney, Ingo Molnar On 08/13, Peter Zijlstra wrote: > > On Mon, Aug 12, 2013 at 07:02:57PM +0200, Oleg Nesterov wrote: > > +/* > > + * Despite its name it doesn't necessarily has to be a full barrier. > > + * It should only guarantee that a STORE before the critical section > > + * can not be reordered with a LOAD inside this section. > > + * So the default implementation simply ensures that a STORE can not > > + * move into the critical section, smp_wmb() should serialize it with > > + * another STORE done by spin_lock(). > > + */ > > +#ifndef smp_mb__before_spinlock > > +#define smp_mb__before_spinlock() smp_wmb() > > #endif > > I would have expected mention of the ACQUIRE of the lock keeping the > LOAD inside the locked section. OK, please see v2 below. --- From 8de96d3feae3b4b51669902b7c24ac1748ecdbfe Mon Sep 17 00:00:00 2001 From: Oleg Nesterov <oleg@redhat.com> Date: Mon, 12 Aug 2013 18:14:00 +0200 Subject: sched: fix the theoretical signal_wake_up() vs schedule() race This is only theoretical, but after try_to_wake_up(p) was changed to check p->state under p->pi_lock the code like __set_current_state(TASK_INTERRUPTIBLE); schedule(); can miss a signal. This is the special case of wait-for-condition, it relies on try_to_wake_up/schedule interaction and thus it does not need mb() between __set_current_state() and if(signal_pending). However, this __set_current_state() can move into the critical section protected by rq->lock, now that try_to_wake_up() takes another lock we need to ensure that it can't be reordered with "if (signal_pending(current))" check inside that section. The patch is actually one-liner, it simply adds smp_wmb() before spin_lock_irq(rq->lock). This is what try_to_wake_up() already does by the same reason. We turn this wmb() into the new helper, smp_mb__before_spinlock(), for better documentation and to allow the architectures to change the default implementation. While at it, kill smp_mb__after_lock(), it has no callers. Perhaps we can also add smp_mb__before/after_spinunlock() for prepare_to_wait(). Signed-off-by: Oleg Nesterov <oleg@redhat.com> --- arch/x86/include/asm/spinlock.h | 4 ---- include/linux/spinlock.h | 14 +++++++++++--- kernel/sched/core.c | 14 +++++++++++++- 3 files changed, 24 insertions(+), 8 deletions(-) diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h index 33692ea..e3ddd7d 100644 --- a/arch/x86/include/asm/spinlock.h +++ b/arch/x86/include/asm/spinlock.h @@ -233,8 +233,4 @@ static inline void arch_write_unlock(arch_rwlock_t *rw) #define arch_read_relax(lock) cpu_relax() #define arch_write_relax(lock) cpu_relax() -/* The {read|write|spin}_lock() on x86 are full memory barriers. */ -static inline void smp_mb__after_lock(void) { } -#define ARCH_HAS_SMP_MB_AFTER_LOCK - #endif /* _ASM_X86_SPINLOCK_H */ diff --git a/include/linux/spinlock.h b/include/linux/spinlock.h index 7d537ce..75f3494 100644 --- a/include/linux/spinlock.h +++ b/include/linux/spinlock.h @@ -117,9 +117,17 @@ do { \ #endif /*arch_spin_is_contended*/ #endif -/* The lock does not imply full memory barrier. */ -#ifndef ARCH_HAS_SMP_MB_AFTER_LOCK -static inline void smp_mb__after_lock(void) { smp_mb(); } +/* + * Despite its name it doesn't necessarily has to be a full barrier. + * It should only guarantee that a STORE before the critical section + * can not be reordered with a LOAD inside this section. + * spin_lock() is the one-way barrier, this LOAD can not escape out + * of the region. So the default implementation simply ensures that + * a STORE can not move into the critical section, smp_wmb() should + * serialize it with another STORE done by spin_lock(). + */ +#ifndef smp_mb__before_spinlock +#define smp_mb__before_spinlock() smp_wmb() #endif /** diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 6df0fbe..97dac0e 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -1491,7 +1491,13 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags) unsigned long flags; int cpu, success = 0; - smp_wmb(); + /* + * If we are going to wake up a thread waiting for CONDITION we + * need to ensure that CONDITION=1 done by the caller can not be + * reordered with p->state check below. This pairs with mb() in + * set_current_state() the waiting thread does. + */ + smp_mb__before_spinlock(); raw_spin_lock_irqsave(&p->pi_lock, flags); if (!(p->state & state)) goto out; @@ -2394,6 +2400,12 @@ need_resched: if (sched_feat(HRTICK)) hrtick_clear(rq); + /* + * Make sure that signal_pending_state()->signal_pending() below + * can't be reordered with __set_current_state(TASK_INTERRUPTIBLE) + * done by the caller to avoid the race with signal_wake_up(). + */ + smp_mb__before_spinlock(); raw_spin_lock_irq(&rq->lock); switch_count = &prev->nivcsw; -- 1.5.5.1 ^ permalink raw reply related [flat|nested] 9+ messages in thread
end of thread, other threads:[~2013-08-13 14:39 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <tencent_26310211398C21034BD3B2F9@qq.com>
[not found] ` <CA+55aFwAnWZGkXbyLmL3RxQGW-dX_jjDBPfA5PjxeVWn==a6tA@mail.gmail.com>
[not found] ` <20130808191749.GA12062@redhat.com>
[not found] ` <CA+55aFw+vEp6+q+QUZm5rf+sKR9z76HWzEqWjoiLtataym5uwA@mail.gmail.com>
[not found] ` <20130809130457.GA27493@redhat.com>
2013-08-09 18:21 ` Patch for lost wakeups Linus Torvalds
2013-08-11 17:25 ` Oleg Nesterov
2013-08-11 17:27 ` Oleg Nesterov
[not found] ` <tencent_293B72F26D71A4191C7C999A@qq.com>
2013-08-11 17:39 ` Oleg Nesterov
2013-08-11 23:52 ` James Bottomley
2013-08-12 17:02 ` [PATCH] sched: fix the theoretical signal_wake_up() vs schedule() race Oleg Nesterov
2013-08-13 7:55 ` Peter Zijlstra
2013-08-13 14:33 ` Oleg Nesterov
2013-08-13 14:33 ` Oleg Nesterov
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).