Q: schedule() and implied barriers on arm64

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Q: schedule() and implied barriers on arm64
@ 2015-10-16 15:18 Peter Zijlstra
  2015-10-16 16:04 ` Paul E. McKenney
  0 siblings, 1 reply; 20+ messages in thread
From: Peter Zijlstra @ 2015-10-16 15:18 UTC (permalink / raw)
  To: Will Deacon, Paul McKenney; +Cc: linux-kernel, Oleg Nesterov, Ingo Molnar

Hi,

IIRC Paul relies on schedule() implying a full memory barrier with
strong transitivity for RCU.

If not, ignore this email.

If so, however, I suspect AARGH64 is borken and would need (just like
PPC):

#define smp_mb__before_spinlock()	smp_mb()

The problem is that schedule() (when a NO-OP) does:

	smp_mb__before_spinlock();
	LOCK rq->lock

	clear_bit()

	UNLOCK rq->lock

And nothing there implies a full barrier on AARGH64, since
smp_mb__before_spinlock() defaults to WMB, LOCK is an "ldaxr" or
load-acquire, UNLOCK is "stlrh" or store-release and clear_bit() isn't
anything.

Pretty much every other arch has LOCK implying a full barrier, either
because its strongly ordered or because it needs one for the ACQUIRE
semantics.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Q: schedule() and implied barriers on arm64
  2015-10-16 15:18 Q: schedule() and implied barriers on arm64 Peter Zijlstra
@ 2015-10-16 16:04 ` Paul E. McKenney
  2015-10-16 16:16   ` Peter Zijlstra
  2015-10-16 17:11   ` Catalin Marinas
  0 siblings, 2 replies; 20+ messages in thread
From: Paul E. McKenney @ 2015-10-16 16:04 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Will Deacon, linux-kernel, Oleg Nesterov, Ingo Molnar

On Fri, Oct 16, 2015 at 05:18:30PM +0200, Peter Zijlstra wrote:
> Hi,
> 
> IIRC Paul relies on schedule() implying a full memory barrier with
> strong transitivity for RCU.
> 
> If not, ignore this email.

Not so sure about schedule(), but definitely need strong transitivity
for the rcu_node structure's ->lock field.  And the atomic operations
on the rcu_dyntick structure's fields when entering or leaving the
idle loop.

With schedule, the thread later reports the quiescent state, which
involves acquiring the rcu_node structure's ->lock field.  So I -think-
that the locks in the scheduler can be weakly transitive.

> If so, however, I suspect AARGH64 is borken and would need (just like
> PPC):
> 
> #define smp_mb__before_spinlock()	smp_mb()
> 
> The problem is that schedule() (when a NO-OP) does:
> 
> 	smp_mb__before_spinlock();
> 	LOCK rq->lock
> 
> 	clear_bit()
> 
> 	UNLOCK rq->lock
> 
> And nothing there implies a full barrier on AARGH64, since
> smp_mb__before_spinlock() defaults to WMB, LOCK is an "ldaxr" or
> load-acquire, UNLOCK is "stlrh" or store-release and clear_bit() isn't
> anything.
> 
> Pretty much every other arch has LOCK implying a full barrier, either
> because its strongly ordered or because it needs one for the ACQUIRE
> semantics.

Well, arm64 might well need smp_mb__after_unlock_lock() to be non-empty.
But I thought that it used a dmb in the spinlock code somewhere or
another...

							Thanx, Paul


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Q: schedule() and implied barriers on arm64
  2015-10-16 16:04 ` Paul E. McKenney
@ 2015-10-16 16:16   ` Peter Zijlstra
  2015-10-16 16:28     ` Paul E. McKenney
  2015-10-16 16:55     ` Catalin Marinas
  2015-10-16 17:11   ` Catalin Marinas
  1 sibling, 2 replies; 20+ messages in thread
From: Peter Zijlstra @ 2015-10-16 16:16 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: Will Deacon, linux-kernel, Oleg Nesterov, Ingo Molnar

On Fri, Oct 16, 2015 at 09:04:22AM -0700, Paul E. McKenney wrote:
> On Fri, Oct 16, 2015 at 05:18:30PM +0200, Peter Zijlstra wrote:
> > Hi,
> > 
> > IIRC Paul relies on schedule() implying a full memory barrier with
> > strong transitivity for RCU.
> > 
> > If not, ignore this email.
> 
> Not so sure about schedule(), but definitely need strong transitivity
> for the rcu_node structure's ->lock field.  And the atomic operations
> on the rcu_dyntick structure's fields when entering or leaving the
> idle loop.
> 
> With schedule, the thread later reports the quiescent state, which
> involves acquiring the rcu_node structure's ->lock field.  So I -think-
> that the locks in the scheduler can be weakly transitive.

So I _thought_ you needed this to separate the preempt_disabled
sections. Such that rcu_note_context_switch() is guaranteed to be done
before a new preempt_disabled region starts.

But if you really only need program order guarantees for that, and deal
with everything else from your tick, then that's fine too.

Maybe some previous RCU variant relied on this?

> > If so, however, I suspect AARGH64 is borken and would need (just like
> > PPC):
> > 
> > #define smp_mb__before_spinlock()	smp_mb()
> > 
> > The problem is that schedule() (when a NO-OP) does:
> > 
> > 	smp_mb__before_spinlock();
> > 	LOCK rq->lock
> > 
> > 	clear_bit()
> > 
> > 	UNLOCK rq->lock
> > 
> > And nothing there implies a full barrier on AARGH64, since
> > smp_mb__before_spinlock() defaults to WMB, LOCK is an "ldaxr" or
> > load-acquire, UNLOCK is "stlrh" or store-release and clear_bit() isn't
> > anything.
> > 
> > Pretty much every other arch has LOCK implying a full barrier, either
> > because its strongly ordered or because it needs one for the ACQUIRE
> > semantics.
> 
> But I thought that it used a dmb in the spinlock code somewhere or
> another...

arm does, arm64 not so much.

> Well, arm64 might well need smp_mb__after_unlock_lock() to be non-empty.

Its UNLOCK+LOCK should be RCsc, so that should be good. Its just that
LOCK+UNLOCK isn't anything.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Q: schedule() and implied barriers on arm64
  2015-10-16 16:16   ` Peter Zijlstra
@ 2015-10-16 16:28     ` Paul E. McKenney
  2015-10-16 16:39       ` Peter Zijlstra
  2015-10-16 16:55     ` Catalin Marinas
  1 sibling, 1 reply; 20+ messages in thread
From: Paul E. McKenney @ 2015-10-16 16:28 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Will Deacon, linux-kernel, Oleg Nesterov, Ingo Molnar

On Fri, Oct 16, 2015 at 06:16:08PM +0200, Peter Zijlstra wrote:
> On Fri, Oct 16, 2015 at 09:04:22AM -0700, Paul E. McKenney wrote:
> > On Fri, Oct 16, 2015 at 05:18:30PM +0200, Peter Zijlstra wrote:
> > > Hi,
> > > 
> > > IIRC Paul relies on schedule() implying a full memory barrier with
> > > strong transitivity for RCU.
> > > 
> > > If not, ignore this email.
> > 
> > Not so sure about schedule(), but definitely need strong transitivity
> > for the rcu_node structure's ->lock field.  And the atomic operations
> > on the rcu_dyntick structure's fields when entering or leaving the
> > idle loop.
> > 
> > With schedule, the thread later reports the quiescent state, which
> > involves acquiring the rcu_node structure's ->lock field.  So I -think-
> > that the locks in the scheduler can be weakly transitive.
> 
> So I _thought_ you needed this to separate the preempt_disabled
> sections. Such that rcu_note_context_switch() is guaranteed to be done
> before a new preempt_disabled region starts.
> 
> But if you really only need program order guarantees for that, and deal
> with everything else from your tick, then that's fine too.
> 
> Maybe some previous RCU variant relied on this?

Yes, older versions did rely on this.  Now, only the CPU itself observes
RCU's state changes during context switch.  I couldn't tell you exactly
when this changed.  :-/

With the exception of some synchronize_sched_expedited() cases, but in
those cases, RCU code acquires the CPU's leaf rcu_node structure's
->lock, and with the required strong transitivity.

> > > If so, however, I suspect AARGH64 is borken and would need (just like
> > > PPC):
> > > 
> > > #define smp_mb__before_spinlock()	smp_mb()
> > > 
> > > The problem is that schedule() (when a NO-OP) does:
> > > 
> > > 	smp_mb__before_spinlock();
> > > 	LOCK rq->lock
> > > 
> > > 	clear_bit()
> > > 
> > > 	UNLOCK rq->lock
> > > 
> > > And nothing there implies a full barrier on AARGH64, since
> > > smp_mb__before_spinlock() defaults to WMB, LOCK is an "ldaxr" or
> > > load-acquire, UNLOCK is "stlrh" or store-release and clear_bit() isn't
> > > anything.
> > > 
> > > Pretty much every other arch has LOCK implying a full barrier, either
> > > because its strongly ordered or because it needs one for the ACQUIRE
> > > semantics.
> > 
> > But I thought that it used a dmb in the spinlock code somewhere or
> > another...
> 
> arm does, arm64 not so much.
> 
> > Well, arm64 might well need smp_mb__after_unlock_lock() to be non-empty.
> 
> Its UNLOCK+LOCK should be RCsc, so that should be good. Its just that
> LOCK+UNLOCK isn't anything.

Ah!  If RCU relies on LOCK+UNLOCK being a barrier of any sort, that is a
bug in RCU that needs fixing.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Q: schedule() and implied barriers on arm64
  2015-10-16 16:28     ` Paul E. McKenney
@ 2015-10-16 16:39       ` Peter Zijlstra
  0 siblings, 0 replies; 20+ messages in thread
From: Peter Zijlstra @ 2015-10-16 16:39 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: Will Deacon, linux-kernel, Oleg Nesterov, Ingo Molnar

On Fri, Oct 16, 2015 at 09:28:24AM -0700, Paul E. McKenney wrote:

> > Maybe some previous RCU variant relied on this?
> 
> Yes, older versions did rely on this.  Now, only the CPU itself observes
> RCU's state changes during context switch.  I couldn't tell you exactly
> when this changed.  :-/
> 
> With the exception of some synchronize_sched_expedited() cases, but in
> those cases, RCU code acquires the CPU's leaf rcu_node structure's
> ->lock, and with the required strong transitivity.

OK, so I can scrap this 'requirement' from my list. All sorted, thanks!

> > > Well, arm64 might well need smp_mb__after_unlock_lock() to be non-empty.
> > 
> > Its UNLOCK+LOCK should be RCsc, so that should be good. Its just that
> > LOCK+UNLOCK isn't anything.
> 
> Ah!  If RCU relies on LOCK+UNLOCK being a barrier of any sort, that is a
> bug in RCU that needs fixing.

Don't think RCU does that, But its what schedule() provides in the
weakest case. Hence my question here.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Q: schedule() and implied barriers on arm64
  2015-10-16 16:16   ` Peter Zijlstra
  2015-10-16 16:28     ` Paul E. McKenney
@ 2015-10-16 16:55     ` Catalin Marinas
  2015-10-16 17:28       ` Paul E. McKenney
  2015-10-16 19:06       ` Peter Zijlstra
  1 sibling, 2 replies; 20+ messages in thread
From: Catalin Marinas @ 2015-10-16 16:55 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Paul E. McKenney, Will Deacon, Linux Kernel Mailing List,
	Oleg Nesterov, Ingo Molnar

I'll try to reply in Will's absence, though I gave up trying to
understand these threads long time ago ;).

On 16 October 2015 at 17:16, Peter Zijlstra <peterz@infradead.org> wrote:
> On Fri, Oct 16, 2015 at 09:04:22AM -0700, Paul E. McKenney wrote:
>> On Fri, Oct 16, 2015 at 05:18:30PM +0200, Peter Zijlstra wrote:
>> > If so, however, I suspect AARGH64 is borken and would need (just like
>> > PPC):
>> >
>> > #define smp_mb__before_spinlock()   smp_mb()
>> >
>> > The problem is that schedule() (when a NO-OP) does:
>> >
>> >     smp_mb__before_spinlock();
>> >     LOCK rq->lock
>> >
>> >     clear_bit()
>> >
>> >     UNLOCK rq->lock
>> >
>> > And nothing there implies a full barrier on AARGH64, since
>> > smp_mb__before_spinlock() defaults to WMB, LOCK is an "ldaxr" or
>> > load-acquire, UNLOCK is "stlrh" or store-release and clear_bit() isn't
>> > anything.
>> >
>> > Pretty much every other arch has LOCK implying a full barrier, either
>> > because its strongly ordered or because it needs one for the ACQUIRE
>> > semantics.
>>
>> But I thought that it used a dmb in the spinlock code somewhere or
>> another...
>
> arm does, arm64 not so much.

arm64 indeed does not have a dmb after spin_lock, it only has a
load-acquire. So with the default smp_mb__before_spinlock() +
spin_lock we have:

  smp_wmb()
  loop
    load-acquire
    store

So (I think) this guarantees that any writes before wmb+lock would be
visible before any reads _and_ writes after wmb+lock. However, the
ordering with reads before wmb+lock is not guaranteed.

-- 
Catalin

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Q: schedule() and implied barriers on arm64
  2015-10-16 16:04 ` Paul E. McKenney
  2015-10-16 16:16   ` Peter Zijlstra
@ 2015-10-16 17:11   ` Catalin Marinas
  1 sibling, 0 replies; 20+ messages in thread
From: Catalin Marinas @ 2015-10-16 17:11 UTC (permalink / raw)
  To: Paul McKenney
  Cc: Peter Zijlstra, Will Deacon, Linux Kernel Mailing List,
	Oleg Nesterov, Ingo Molnar

On 16 October 2015 at 17:04, Paul E. McKenney
<paulmck@linux.vnet.ibm.com> wrote:
> On Fri, Oct 16, 2015 at 05:18:30PM +0200, Peter Zijlstra wrote:
>> If so, however, I suspect AARGH64 is borken and would need (just like
>> PPC):
>>
>> #define smp_mb__before_spinlock()     smp_mb()
>>
>> The problem is that schedule() (when a NO-OP) does:
>>
>>       smp_mb__before_spinlock();
>>       LOCK rq->lock
>>
>>       clear_bit()
>>
>>       UNLOCK rq->lock
>>
>> And nothing there implies a full barrier on AARGH64, since
>> smp_mb__before_spinlock() defaults to WMB, LOCK is an "ldaxr" or
>> load-acquire, UNLOCK is "stlrh" or store-release and clear_bit() isn't
>> anything.
>>
>> Pretty much every other arch has LOCK implying a full barrier, either
>> because its strongly ordered or because it needs one for the ACQUIRE
>> semantics.
>
> Well, arm64 might well need smp_mb__after_unlock_lock() to be non-empty.
> But I thought that it used a dmb in the spinlock code somewhere or
> another...

unlock+lock on arm64 is a full barrier, so, IIUC the semantics of
smp_mb__after_unlock_lock(), I don't think we need this to be
non-empty. Maybe redefining smp_mb__before_spinlock() to be smp_mb()
(rather than wmb) on arm64 if we need smp_mb__before_spinlock+lock to
act as a barrier for both reads and writes (see my other reply to
Peter).

-- 
Catalin

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Q: schedule() and implied barriers on arm64
  2015-10-16 16:55     ` Catalin Marinas
@ 2015-10-16 17:28       ` Paul E. McKenney
  2015-10-16 19:07         ` Peter Zijlstra
  2015-10-19 15:18         ` Catalin Marinas
  2015-10-16 19:06       ` Peter Zijlstra
  1 sibling, 2 replies; 20+ messages in thread
From: Paul E. McKenney @ 2015-10-16 17:28 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Peter Zijlstra, Will Deacon, Linux Kernel Mailing List,
	Oleg Nesterov, Ingo Molnar

On Fri, Oct 16, 2015 at 05:55:35PM +0100, Catalin Marinas wrote:
> I'll try to reply in Will's absence, though I gave up trying to
> understand these threads long time ago ;).
> 
> On 16 October 2015 at 17:16, Peter Zijlstra <peterz@infradead.org> wrote:
> > On Fri, Oct 16, 2015 at 09:04:22AM -0700, Paul E. McKenney wrote:
> >> On Fri, Oct 16, 2015 at 05:18:30PM +0200, Peter Zijlstra wrote:
> >> > If so, however, I suspect AARGH64 is borken and would need (just like
> >> > PPC):
> >> >
> >> > #define smp_mb__before_spinlock()   smp_mb()
> >> >
> >> > The problem is that schedule() (when a NO-OP) does:
> >> >
> >> >     smp_mb__before_spinlock();
> >> >     LOCK rq->lock
> >> >
> >> >     clear_bit()
> >> >
> >> >     UNLOCK rq->lock
> >> >
> >> > And nothing there implies a full barrier on AARGH64, since
> >> > smp_mb__before_spinlock() defaults to WMB, LOCK is an "ldaxr" or
> >> > load-acquire, UNLOCK is "stlrh" or store-release and clear_bit() isn't
> >> > anything.
> >> >
> >> > Pretty much every other arch has LOCK implying a full barrier, either
> >> > because its strongly ordered or because it needs one for the ACQUIRE
> >> > semantics.
> >>
> >> But I thought that it used a dmb in the spinlock code somewhere or
> >> another...
> >
> > arm does, arm64 not so much.
> 
> arm64 indeed does not have a dmb after spin_lock, it only has a
> load-acquire. So with the default smp_mb__before_spinlock() +
> spin_lock we have:
> 
>   smp_wmb()
>   loop
>     load-acquire
>     store
> 
> So (I think) this guarantees that any writes before wmb+lock would be
> visible before any reads _and_ writes after wmb+lock. However, the
> ordering with reads before wmb+lock is not guaranteed.

So RCU needs the following sort of guarantee:

	void task1(unsigned long flags)
	{
		WRITE_ONCE(x, 1);
		WRITE_ONCE(z, 1);
		raw_spin_unlock_irqrestore(&rnp->lock, flags);
	}

	void task2(unsigned long *flags)
	{
		raw_spin_lock_irqsave(&rnp->lock, *flags);
		smp_mb__after_unlock_lock();
		r1 = READ_ONCE(y);
		r2 = READ_ONCE(z);
	}

	void task3(void)
	{
		WRITE_ONCE(y, 1);
		smp_mb();
		r3 = READ_ONCE(x);
	}

	BUG_ON(!r1 && r2 && !r3); /* After the dust settles. */

In other words, if task2() acquires the lock after task1() releases it,
all CPUs must agree on the order of the operations in the two critical
sections, even if these other CPUs don't acquire the lock.

This same guarantee is needed if task1() and then task2() run in
succession on the same CPU with no additional synchronization of any sort.

Does this work on arm64?

							Thanx, Paul


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Q: schedule() and implied barriers on arm64
  2015-10-16 16:55     ` Catalin Marinas
  2015-10-16 17:28       ` Paul E. McKenney
@ 2015-10-16 19:06       ` Peter Zijlstra
  2015-10-19  7:06         ` Ingo Molnar
  1 sibling, 1 reply; 20+ messages in thread
From: Peter Zijlstra @ 2015-10-16 19:06 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Paul E. McKenney, Will Deacon, Linux Kernel Mailing List,
	Oleg Nesterov, Ingo Molnar

On Fri, Oct 16, 2015 at 05:55:35PM +0100, Catalin Marinas wrote:
> arm64 indeed does not have a dmb after spin_lock, it only has a
> load-acquire. So with the default smp_mb__before_spinlock() +
> spin_lock we have:
> 
>   smp_wmb()
>   loop
>     load-acquire
>     store
> 
> So (I think) this guarantees that any writes before wmb+lock would be
> visible before any reads _and_ writes after wmb+lock. However, the
> ordering with reads before wmb+lock is not guaranteed.

That is my understanding as well, and stores could creep up from below
the unlock and then the reads and those stores can cross and you've
lost.

In any case, its all moot now, since Paul no longer requires schedule()
to imply a full barrier.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Q: schedule() and implied barriers on arm64
  2015-10-16 17:28       ` Paul E. McKenney
@ 2015-10-16 19:07         ` Peter Zijlstra
  2015-10-16 19:20           ` Paul E. McKenney
  2015-10-19 15:18         ` Catalin Marinas
  1 sibling, 1 reply; 20+ messages in thread
From: Peter Zijlstra @ 2015-10-16 19:07 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Catalin Marinas, Will Deacon, Linux Kernel Mailing List,
	Oleg Nesterov, Ingo Molnar

On Fri, Oct 16, 2015 at 10:28:11AM -0700, Paul E. McKenney wrote:
> In other words, if task2() acquires the lock after task1() releases it,
> all CPUs must agree on the order of the operations in the two critical
> sections, even if these other CPUs don't acquire the lock.
> 
> This same guarantee is needed if task1() and then task2() run in
> succession on the same CPU with no additional synchronization of any sort.
> 
> Does this work on arm64?

Yes, their load-acquire and store-release are RCsc.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Q: schedule() and implied barriers on arm64
  2015-10-16 19:07         ` Peter Zijlstra
@ 2015-10-16 19:20           ` Paul E. McKenney
  0 siblings, 0 replies; 20+ messages in thread
From: Paul E. McKenney @ 2015-10-16 19:20 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Catalin Marinas, Will Deacon, Linux Kernel Mailing List,
	Oleg Nesterov, Ingo Molnar

On Fri, Oct 16, 2015 at 09:07:41PM +0200, Peter Zijlstra wrote:
> On Fri, Oct 16, 2015 at 10:28:11AM -0700, Paul E. McKenney wrote:
> > In other words, if task2() acquires the lock after task1() releases it,
> > all CPUs must agree on the order of the operations in the two critical
> > sections, even if these other CPUs don't acquire the lock.
> > 
> > This same guarantee is needed if task1() and then task2() run in
> > succession on the same CPU with no additional synchronization of any sort.
> > 
> > Does this work on arm64?
> 
> Yes, their load-acquire and store-release are RCsc.

Whew!!!

							Thanx, Paul


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Q: schedule() and implied barriers on arm64
  2015-10-16 19:06       ` Peter Zijlstra
@ 2015-10-19  7:06         ` Ingo Molnar
  2015-10-19  9:04           ` Peter Zijlstra
  2015-10-19 15:21           ` Catalin Marinas
  0 siblings, 2 replies; 20+ messages in thread
From: Ingo Molnar @ 2015-10-19  7:06 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Catalin Marinas, Paul E. McKenney, Will Deacon,
	Linux Kernel Mailing List, Oleg Nesterov, Linus Torvalds,
	Andrew Morton

* Peter Zijlstra <peterz@infradead.org> wrote:

> In any case, its all moot now, since Paul no longer requires schedule() to imply 
> a full barrier.
>
> [...]

Nevertheless from a least-surprise POV it might be worth guaranteeing it, because 
I bet there's tons of code that assumes that schedule() is a heavy operation and 
it's such an easy mistake to make. Since we are so close to having that guarantee, 
we might as well codify it?

Just like system calls are assumed to be barriers in general - and system calls 
are more lightweight than schedule() ...

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Q: schedule() and implied barriers on arm64
  2015-10-19  7:06         ` Ingo Molnar
@ 2015-10-19  9:04           ` Peter Zijlstra
  2015-10-19 15:21           ` Catalin Marinas
  1 sibling, 0 replies; 20+ messages in thread
From: Peter Zijlstra @ 2015-10-19  9:04 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Catalin Marinas, Paul E. McKenney, Will Deacon,
	Linux Kernel Mailing List, Oleg Nesterov, Linus Torvalds,
	Andrew Morton

On Mon, Oct 19, 2015 at 09:06:05AM +0200, Ingo Molnar wrote:
> 
> * Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > In any case, its all moot now, since Paul no longer requires schedule() to imply 
> > a full barrier.
> >
> > [...]
> 
> Nevertheless from a least-surprise POV it might be worth guaranteeing it, because 
> I bet there's tons of code that assumes that schedule() is a heavy operation and 
> it's such an easy mistake to make. Since we are so close to having that guarantee, 
> we might as well codify it?
> 
> Just like system calls are assumed to be barriers in general - 

Are they? I know they are on some platforms, but I'm not sure we've
audited them all and established this.

> and system calls are more lightweight than schedule() ...

Hopefully, although if you enable nohz_full there's a fair chance to
reverse that :-)

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Q: schedule() and implied barriers on arm64
  2015-10-16 17:28       ` Paul E. McKenney
  2015-10-16 19:07         ` Peter Zijlstra
@ 2015-10-19 15:18         ` Catalin Marinas
  1 sibling, 0 replies; 20+ messages in thread
From: Catalin Marinas @ 2015-10-19 15:18 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Peter Zijlstra, Will Deacon, Linux Kernel Mailing List,
	Oleg Nesterov, Ingo Molnar

On Fri, Oct 16, 2015 at 10:28:11AM -0700, Paul E. McKenney wrote:
> So RCU needs the following sort of guarantee:
> 
> 	void task1(unsigned long flags)
> 	{
> 		WRITE_ONCE(x, 1);
> 		WRITE_ONCE(z, 1);
> 		raw_spin_unlock_irqrestore(&rnp->lock, flags);
> 	}
> 
> 	void task2(unsigned long *flags)
> 	{
> 		raw_spin_lock_irqsave(&rnp->lock, *flags);
> 		smp_mb__after_unlock_lock();
> 		r1 = READ_ONCE(y);
> 		r2 = READ_ONCE(z);
> 	}
> 
> 	void task3(void)
> 	{
> 		WRITE_ONCE(y, 1);
> 		smp_mb();
> 		r3 = READ_ONCE(x);
> 	}
> 
> 	BUG_ON(!r1 && r2 && !r3); /* After the dust settles. */
> 
> In other words, if task2() acquires the lock after task1() releases it,
> all CPUs must agree on the order of the operations in the two critical
> sections, even if these other CPUs don't acquire the lock.
> 
> This same guarantee is needed if task1() and then task2() run in
> succession on the same CPU with no additional synchronization of any sort.
> 
> Does this work on arm64?

I think it does. If r3 == 0, it means that READ_ONCE(x) in task3 is
"observed" (in ARM ARM terms) by task1 before WRITE_ONCE(x, 1). The
smp_mb() in task3 implies that WRITEONCE(y, 1) is also observed by
task1.

A store-release is multi-copy atomic when "observed" with a load-acquire
(from task2). When on the same CPU, they are always observed in program
order. The store-release on ARM has the property that writes observed by
task1 before store-release (that is WRITE_ONCE(y, 1) in task3) will be
observed by other observers (task2) before the store-release is observed
(the unlock).

The above rules guarantee that, when r3 == 0, WRITE_ONCE(y, 1) in task3
is observed by task2 (and task1), hence r1 == 1.

(a more formal proof would have to wait for Will to come back from
holiday ;))

-- 
Catalin


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Q: schedule() and implied barriers on arm64
  2015-10-19  7:06         ` Ingo Molnar
  2015-10-19  9:04           ` Peter Zijlstra
@ 2015-10-19 15:21           ` Catalin Marinas
  2015-10-19 16:24             ` Peter Zijlstra
  1 sibling, 1 reply; 20+ messages in thread
From: Catalin Marinas @ 2015-10-19 15:21 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Zijlstra, Paul E. McKenney, Will Deacon,
	Linux Kernel Mailing List, Oleg Nesterov, Linus Torvalds,
	Andrew Morton

On Mon, Oct 19, 2015 at 09:06:05AM +0200, Ingo Molnar wrote:
> * Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > In any case, its all moot now, since Paul no longer requires schedule() to imply 
> > a full barrier.
> >
> > [...]
> 
> Nevertheless from a least-surprise POV it might be worth guaranteeing it, because 
> I bet there's tons of code that assumes that schedule() is a heavy operation and 
> it's such an easy mistake to make. Since we are so close to having that guarantee, 
> we might as well codify it?

FWIW, the arm64 __switch_to() has a heavy barrier (DSB) but the reason
for this was to cope with potentially interrupted cache or TLB
maintenance (which require a DSB on the same CPU) and thread migration
to another CPU.

-- 
Catalin


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Q: schedule() and implied barriers on arm64
  2015-10-19 15:21           ` Catalin Marinas
@ 2015-10-19 16:24             ` Peter Zijlstra
  2015-10-20  8:37               ` Ingo Molnar
  2015-10-27 16:19               ` Will Deacon
  0 siblings, 2 replies; 20+ messages in thread
From: Peter Zijlstra @ 2015-10-19 16:24 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Ingo Molnar, Paul E. McKenney, Will Deacon,
	Linux Kernel Mailing List, Oleg Nesterov, Linus Torvalds,
	Andrew Morton

On Mon, Oct 19, 2015 at 04:21:08PM +0100, Catalin Marinas wrote:
> On Mon, Oct 19, 2015 at 09:06:05AM +0200, Ingo Molnar wrote:
> > * Peter Zijlstra <peterz@infradead.org> wrote:
> > 
> > > In any case, its all moot now, since Paul no longer requires schedule() to imply 
> > > a full barrier.
> > >
> > > [...]
> > 
> > Nevertheless from a least-surprise POV it might be worth guaranteeing it, because 
> > I bet there's tons of code that assumes that schedule() is a heavy operation and 
> > it's such an easy mistake to make. Since we are so close to having that guarantee, 
> > we might as well codify it?
> 
> FWIW, the arm64 __switch_to() has a heavy barrier (DSB) but the reason
> for this was to cope with potentially interrupted cache or TLB
> maintenance (which require a DSB on the same CPU) and thread migration
> to another CPU.

Right, but there's a path through schedule() that does not pass through
__switch_to(); when we pick the current task as the most eligible task
and next == prev.

In that case there really only is the wmb, a spin lock, an atomic op and
a spin unlock (and a whole bunch of 'normal' code of course).

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Q: schedule() and implied barriers on arm64
  2015-10-19 16:24             ` Peter Zijlstra
@ 2015-10-20  8:37               ` Ingo Molnar
  2015-10-27 16:19               ` Will Deacon
  1 sibling, 0 replies; 20+ messages in thread
From: Ingo Molnar @ 2015-10-20  8:37 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Catalin Marinas, Paul E. McKenney, Will Deacon,
	Linux Kernel Mailing List, Oleg Nesterov, Linus Torvalds,
	Andrew Morton


* Peter Zijlstra <peterz@infradead.org> wrote:

> On Mon, Oct 19, 2015 at 04:21:08PM +0100, Catalin Marinas wrote:
> > On Mon, Oct 19, 2015 at 09:06:05AM +0200, Ingo Molnar wrote:
> > > * Peter Zijlstra <peterz@infradead.org> wrote:
> > > 
> > > > In any case, its all moot now, since Paul no longer requires schedule() to imply 
> > > > a full barrier.
> > > >
> > > > [...]
> > > 
> > > Nevertheless from a least-surprise POV it might be worth guaranteeing it, 
> > > because I bet there's tons of code that assumes that schedule() is a heavy 
> > > operation and it's such an easy mistake to make. Since we are so close to 
> > > having that guarantee, we might as well codify it?
> > 
> > FWIW, the arm64 __switch_to() has a heavy barrier (DSB) but the reason for 
> > this was to cope with potentially interrupted cache or TLB maintenance (which 
> > require a DSB on the same CPU) and thread migration to another CPU.
> 
> Right, but there's a path through schedule() that does not pass through 
> __switch_to(); when we pick the current task as the most eligible task and next 
> == prev.
> 
> In that case there really only is the wmb, a spin lock, an atomic op and a spin 
> unlock (and a whole bunch of 'normal' code of course).

Yeah, so my concern is that this is a rare race that might be 'surprising' for 
developers relying on various schedule() constructs. Especially as it's a full 
barrier on x86 (the most prominent SMP platform at the moment) there's a real 
danger of hard to debug bugs creeping to other architectures.

So I think we should just do the small step of making it a full barrier everywhere 
- it's very close to it in any case, and it shouldn't really matter for 
performance. Agreed?


Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Q: schedule() and implied barriers on arm64
  2015-10-19 16:24             ` Peter Zijlstra
  2015-10-20  8:37               ` Ingo Molnar
@ 2015-10-27 16:19               ` Will Deacon
  2015-10-27 18:40                 ` Peter Zijlstra
  1 sibling, 1 reply; 20+ messages in thread
From: Will Deacon @ 2015-10-27 16:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Catalin Marinas, Ingo Molnar, Paul E. McKenney,
	Linux Kernel Mailing List, Oleg Nesterov, Linus Torvalds,
	Andrew Morton

On Mon, Oct 19, 2015 at 06:24:23PM +0200, Peter Zijlstra wrote:
> On Mon, Oct 19, 2015 at 04:21:08PM +0100, Catalin Marinas wrote:
> > On Mon, Oct 19, 2015 at 09:06:05AM +0200, Ingo Molnar wrote:
> > > * Peter Zijlstra <peterz@infradead.org> wrote:
> > > 
> > > > In any case, its all moot now, since Paul no longer requires schedule() to imply 
> > > > a full barrier.
> > > >
> > > > [...]
> > > 
> > > Nevertheless from a least-surprise POV it might be worth guaranteeing it, because 
> > > I bet there's tons of code that assumes that schedule() is a heavy operation and 
> > > it's such an easy mistake to make. Since we are so close to having that guarantee, 
> > > we might as well codify it?
> > 
> > FWIW, the arm64 __switch_to() has a heavy barrier (DSB) but the reason
> > for this was to cope with potentially interrupted cache or TLB
> > maintenance (which require a DSB on the same CPU) and thread migration
> > to another CPU.
> 
> Right, but there's a path through schedule() that does not pass through
> __switch_to(); when we pick the current task as the most eligible task
> and next == prev.
> 
> In that case there really only is the wmb, a spin lock, an atomic op and
> a spin unlock (and a whole bunch of 'normal' code of course).

... and the 'normal' code will have a control hazard somewhere, followed
by the implicit ISB in exception return, so there's a barrier of sorts
there too.

The problem is that people say "full barrier" without defining what it
really means, and we end up going round the houses on things like
transitivity (which ctrl + isb doesn't always give you).

Will

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Q: schedule() and implied barriers on arm64
  2015-10-27 16:19               ` Will Deacon
@ 2015-10-27 18:40                 ` Peter Zijlstra
  2015-10-28 10:39                   ` Will Deacon
  0 siblings, 1 reply; 20+ messages in thread
From: Peter Zijlstra @ 2015-10-27 18:40 UTC (permalink / raw)
  To: Will Deacon
  Cc: Catalin Marinas, Ingo Molnar, Paul E. McKenney,
	Linux Kernel Mailing List, Oleg Nesterov, Linus Torvalds,
	Andrew Morton

On Tue, Oct 27, 2015 at 04:19:48PM +0000, Will Deacon wrote:
> ... and the 'normal' code will have a control hazard somewhere, followed
> by the implicit ISB in exception return, so there's a barrier of sorts
> there too.

Which exception return?

> The problem is that people say "full barrier" without defining what it
> really means, and we end up going round the houses on things like
> transitivity (which ctrl + isb doesn't always give you).

I pretty much meant smp_mb() here :-)

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Q: schedule() and implied barriers on arm64
  2015-10-27 18:40                 ` Peter Zijlstra
@ 2015-10-28 10:39                   ` Will Deacon
  0 siblings, 0 replies; 20+ messages in thread
From: Will Deacon @ 2015-10-28 10:39 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Catalin Marinas, Ingo Molnar, Paul E. McKenney,
	Linux Kernel Mailing List, Oleg Nesterov, Linus Torvalds,
	Andrew Morton

On Tue, Oct 27, 2015 at 07:40:20PM +0100, Peter Zijlstra wrote:
> On Tue, Oct 27, 2015 at 04:19:48PM +0000, Will Deacon wrote:
> > ... and the 'normal' code will have a control hazard somewhere, followed
> > by the implicit ISB in exception return, so there's a barrier of sorts
> > there too.
> 
> Which exception return?

The return to userspace after the interrupt/fault/system call that got us
into the kernel.

> > The problem is that people say "full barrier" without defining what it
> > really means, and we end up going round the houses on things like
> > transitivity (which ctrl + isb doesn't always give you).
> 
> I pretty much meant smp_mb() here :-)

In which case, we don't provide the transitivity guarantees that you would
get from an smp_mb().

Will

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2015-10-28 10:39 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-10-16 15:18 Q: schedule() and implied barriers on arm64 Peter Zijlstra
2015-10-16 16:04 ` Paul E. McKenney
2015-10-16 16:16   ` Peter Zijlstra
2015-10-16 16:28     ` Paul E. McKenney
2015-10-16 16:39       ` Peter Zijlstra
2015-10-16 16:55     ` Catalin Marinas
2015-10-16 17:28       ` Paul E. McKenney
2015-10-16 19:07         ` Peter Zijlstra
2015-10-16 19:20           ` Paul E. McKenney
2015-10-19 15:18         ` Catalin Marinas
2015-10-16 19:06       ` Peter Zijlstra
2015-10-19  7:06         ` Ingo Molnar
2015-10-19  9:04           ` Peter Zijlstra
2015-10-19 15:21           ` Catalin Marinas
2015-10-19 16:24             ` Peter Zijlstra
2015-10-20  8:37               ` Ingo Molnar
2015-10-27 16:19               ` Will Deacon
2015-10-27 18:40                 ` Peter Zijlstra
2015-10-28 10:39                   ` Will Deacon
2015-10-16 17:11   ` Catalin Marinas

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).