Re: [PATCH RFC tip/core/rcu 1/2] srcu: Allow use of Tiny/Tree SRCU from both process and interrupt context

public inbox for linux-s390@vger.kernel.org
 help / color / mirror / Atom feed

* Re: [PATCH RFC tip/core/rcu 1/2] srcu: Allow use of Tiny/Tree SRCU from both process and interrupt context
       [not found]     ` <b77cb817-9ffb-b600-19a9-1de685049c57@redhat.com>
@ 2017-06-06 14:45       ` Christian Borntraeger
  2017-06-06 15:27         ` Heiko Carstens
  2017-06-06 16:12         ` Peter Zijlstra
  0 siblings, 2 replies; 8+ messages in thread
From: Christian Borntraeger @ 2017-06-06 14:45 UTC (permalink / raw)
  To: Paolo Bonzini, Peter Zijlstra, Paul E. McKenney
  Cc: linux-kernel, mingo, jiangshanlai, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, rostedt, dhowells, edumazet,
	fweisbec, oleg, kvm, Linus Torvalds, Martin Schwidefsky,
	Heiko Carstens, linux-s390

Adding s390 folks and list

On 06/06/2017 03:08 PM, Paolo Bonzini wrote:
> 
> 
> On 06/06/2017 12:53, Peter Zijlstra wrote:
>> On Mon, Jun 05, 2017 at 03:09:50PM -0700, Paul E. McKenney wrote:
>>> There would be a slowdown if 1) fast this_cpu_inc is not available and
>>> cannot be implemented (this usually means that atomic_inc has implicit
>>> memory barriers),
>>
>> I don't get this.
>>
>> How is per-cpu crud related to being strongly ordered?
>>
>> this_cpu_ has 3 forms:
>>
>> 	x86:		single instruction
>> 	arm64,s390:	preempt_disable()+atomic_op
>> 	generic:	local_irq_save()+normal_op
>>
>> Only s390 is TSO, arm64 is very much a weak arch.
> 
> Right, and thus arm64 can implement a fast this_cpu_inc using LL/SC.
> s390 cannot because its atomic_inc has implicit memory barriers.
> 
> s390's this_cpu_inc is *faster* than the generic one, but still pretty slow.

FWIW, we improved the performance of local_irq_save/restore some time ago
with commit 204ee2c5643199a2 ("s390/irqflags: optimize irq restore") and
disable/enable seem to be reasonably fast (3-5ns on my system doing both
disable/enable in a loop) on todays systems. So  I would assume that the
generic implementation would not be that bad. 

A the same time, the implicit memory barrier of the atomic_inc should be
even cheaper. In contrast to x86, a full smp_mb seems to be almost for
free (looks like <= 1 cycle for a bcr 14,0 and no contention). So I
_think_ that this should be really fast enough.

As a side note, I am asking myself, though, why we do need the
preempt_disable/enable for the cases where we use the opcodes 
like lao (atomic load and or to a memory location) and friends.

> 
>>>                   and 2) local_irq_save/restore is slower than disabling
>>> preemption.  The main architecture with these constraints is s390, which
>>> however is already paying the price in __srcu_read_unlock and has not
>>> complained.
>>
>> IIRC only PPC (and hopefully soon x86) has a local_irq_save() that is as
>> fast as preempt_disable().
> 
> 1 = arch-specific this_cpu_inc is available
> 2 = local_irq_save/restore as fast as preempt_disable/enable
> 
> If either 1 or 2 are true, this patch makes SRCU faster or equal
> 
> x86 (single instruction): 1 = true, 2 = false   -> ok
> arm64 (weakly ordered):   1 = true, 2 = false   -> ok
> powerpc:                  1 = false, 2 = true   -> ok
> s390:                     1 = false, 2 = false  -> slower
> 
> For other LL/SC architectures, notably arm, fast this_cpu_* ops not yet
> available, but could be written pretty easily.
> 
>>> A valid optimization on s390 would be to skip the smp_mb;
>>> AIUI, this_cpu_inc implies a memory barrier (!) due to its implementation.
>>
>> You mean the s390 this_cpu_inc() in specific, right? Because
>> this_cpu_inc() in general does not imply any such thing.
> 
> Yes, of course, this is only for s390.
> 
> Alternatively, we could change the counters to atomic_t and use
> smp_mb__{before,after}_atomic, as in the (unnecessary) srcutiny patch.
> That should shave a few cycles on x86 too, since "lock inc" is faster
> than "inc; mfence".  For srcuclassic (and stable) however I'd rather
> keep the simple __this_cpu_inc -> this_cpu_inc change.
> 
> Paolo
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH RFC tip/core/rcu 1/2] srcu: Allow use of Tiny/Tree SRCU from both process and interrupt context
  2017-06-06 14:45       ` [PATCH RFC tip/core/rcu 1/2] srcu: Allow use of Tiny/Tree SRCU from both process and interrupt context Christian Borntraeger
@ 2017-06-06 15:27         ` Heiko Carstens
  2017-06-06 15:37           ` Christian Borntraeger
  2017-06-06 16:15           ` Peter Zijlstra
  2017-06-06 16:12         ` Peter Zijlstra
  1 sibling, 2 replies; 8+ messages in thread
From: Heiko Carstens @ 2017-06-06 15:27 UTC (permalink / raw)
  To: Christian Borntraeger
  Cc: Paolo Bonzini, Peter Zijlstra, Paul E. McKenney, linux-kernel,
	mingo, jiangshanlai, dipankar, akpm, mathieu.desnoyers, josh,
	tglx, rostedt, dhowells, edumazet, fweisbec, oleg, kvm,
	Linus Torvalds, Martin Schwidefsky, linux-s390

On Tue, Jun 06, 2017 at 04:45:57PM +0200, Christian Borntraeger wrote:
> Adding s390 folks and list
> >> Only s390 is TSO, arm64 is very much a weak arch.
> > 
> > Right, and thus arm64 can implement a fast this_cpu_inc using LL/SC.
> > s390 cannot because its atomic_inc has implicit memory barriers.
> > 
> > s390's this_cpu_inc is *faster* than the generic one, but still pretty slow.
> 
> FWIW, we improved the performance of local_irq_save/restore some time ago
> with commit 204ee2c5643199a2 ("s390/irqflags: optimize irq restore") and
> disable/enable seem to be reasonably fast (3-5ns on my system doing both
> disable/enable in a loop) on todays systems. So  I would assume that the
> generic implementation would not be that bad. 
> 
> A the same time, the implicit memory barrier of the atomic_inc should be
> even cheaper. In contrast to x86, a full smp_mb seems to be almost for
> free (looks like <= 1 cycle for a bcr 14,0 and no contention). So I
> _think_ that this should be really fast enough.
> 
> As a side note, I am asking myself, though, why we do need the
> preempt_disable/enable for the cases where we use the opcodes 
> like lao (atomic load and or to a memory location) and friends.

Because you want the atomic instruction to be executed on the local cpu for
which you have to per cpu pointer. If you get preempted to a different cpu
between the ptr__ assignment and lan instruction it might be executed not
on the local cpu. It's not really a correctness issue.

#define arch_this_cpu_to_op(pcp, val, op)				\
{									\
	typedef typeof(pcp) pcp_op_T__;					\
	pcp_op_T__ val__ = (val);					\
	pcp_op_T__ old__, *ptr__;					\
	preempt_disable();						\
	ptr__ = raw_cpu_ptr(&(pcp));					\
	asm volatile(							\
		op "	%[old__],%[val__],%[ptr__]\n"			\
		: [old__] "=d" (old__), [ptr__] "+Q" (*ptr__)		\
		: [val__] "d" (val__)					\
		: "cc");						\
	preempt_enable();						\
}

#define this_cpu_and_4(pcp, val)	arch_this_cpu_to_op(pcp, val, "lan")

However in reality it doesn't matter at all, since all distributions we
care about have preemption disabled.

So this_cpu_inc() should just generate three instructions: two to calculate
the percpu pointer and an additional asi for the atomic increment, with
operand specific serialization. This is supposed to be a lot faster than
disabling/enabling interrupts around a non-atomic operation.

But maybe I didn't get the point of this thread :)

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH RFC tip/core/rcu 1/2] srcu: Allow use of Tiny/Tree SRCU from both process and interrupt context
  2017-06-06 15:27         ` Heiko Carstens
@ 2017-06-06 15:37           ` Christian Borntraeger
  2017-06-06 15:58             ` Paul E. McKenney
  2017-06-06 16:15           ` Peter Zijlstra
  1 sibling, 1 reply; 8+ messages in thread
From: Christian Borntraeger @ 2017-06-06 15:37 UTC (permalink / raw)
  To: Heiko Carstens
  Cc: Paolo Bonzini, Peter Zijlstra, Paul E. McKenney, linux-kernel,
	mingo, jiangshanlai, dipankar, akpm, mathieu.desnoyers, josh,
	tglx, rostedt, dhowells, edumazet, fweisbec, oleg, kvm,
	Linus Torvalds, Martin Schwidefsky, linux-s390

On 06/06/2017 05:27 PM, Heiko Carstens wrote:
> On Tue, Jun 06, 2017 at 04:45:57PM +0200, Christian Borntraeger wrote:
>> Adding s390 folks and list
>>>> Only s390 is TSO, arm64 is very much a weak arch.
>>>
>>> Right, and thus arm64 can implement a fast this_cpu_inc using LL/SC.
>>> s390 cannot because its atomic_inc has implicit memory barriers.
>>>
>>> s390's this_cpu_inc is *faster* than the generic one, but still pretty slow.
>>
>> FWIW, we improved the performance of local_irq_save/restore some time ago
>> with commit 204ee2c5643199a2 ("s390/irqflags: optimize irq restore") and
>> disable/enable seem to be reasonably fast (3-5ns on my system doing both
>> disable/enable in a loop) on todays systems. So  I would assume that the
>> generic implementation would not be that bad. 
>>
>> A the same time, the implicit memory barrier of the atomic_inc should be
>> even cheaper. In contrast to x86, a full smp_mb seems to be almost for
>> free (looks like <= 1 cycle for a bcr 14,0 and no contention). So I
>> _think_ that this should be really fast enough.
>>
>> As a side note, I am asking myself, though, why we do need the
>> preempt_disable/enable for the cases where we use the opcodes 
>> like lao (atomic load and or to a memory location) and friends.
> 
> Because you want the atomic instruction to be executed on the local cpu for
> which you have to per cpu pointer. If you get preempted to a different cpu
> between the ptr__ assignment and lan instruction it might be executed not
> on the local cpu. It's not really a correctness issue.
> 
> #define arch_this_cpu_to_op(pcp, val, op)				\
> {									\
> 	typedef typeof(pcp) pcp_op_T__;					\
> 	pcp_op_T__ val__ = (val);					\
> 	pcp_op_T__ old__, *ptr__;					\
> 	preempt_disable();						\
> 	ptr__ = raw_cpu_ptr(&(pcp));					\
> 	asm volatile(							\
> 		op "	%[old__],%[val__],%[ptr__]\n"			\
> 		: [old__] "=d" (old__), [ptr__] "+Q" (*ptr__)		\
> 		: [val__] "d" (val__)					\
> 		: "cc");						\
> 	preempt_enable();						\
> }
> 
> #define this_cpu_and_4(pcp, val)	arch_this_cpu_to_op(pcp, val, "lan")
> 
> However in reality it doesn't matter at all, since all distributions we
> care about have preemption disabled.
> 
> So this_cpu_inc() should just generate three instructions: two to calculate
> the percpu pointer and an additional asi for the atomic increment, with
> operand specific serialization. This is supposed to be a lot faster than
> disabling/enabling interrupts around a non-atomic operation.
> 
> But maybe I didn't get the point of this thread :)

I think on x86 a memory barrier is relatively expensive (e.g. 33 cycles for mfence
on Haswell according to http://www.agner.org/optimize/instruction_tables.pdf). The 
thread started with a change to rcu, which now happens to use these percpu things
more often so I think Paolos fear is that on s390 we now pay the price for an extra
memory barrier due to that change. For the inc case (asi instruction) this should be
really really cheap.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH RFC tip/core/rcu 1/2] srcu: Allow use of Tiny/Tree SRCU from both process and interrupt context
  2017-06-06 15:37           ` Christian Borntraeger
@ 2017-06-06 15:58             ` Paul E. McKenney
  0 siblings, 0 replies; 8+ messages in thread
From: Paul E. McKenney @ 2017-06-06 15:58 UTC (permalink / raw)
  To: Christian Borntraeger
  Cc: Heiko Carstens, Paolo Bonzini, Peter Zijlstra, linux-kernel,
	mingo, jiangshanlai, dipankar, akpm, mathieu.desnoyers, josh,
	tglx, rostedt, dhowells, edumazet, fweisbec, oleg, kvm,
	Linus Torvalds, Martin Schwidefsky, linux-s390

On Tue, Jun 06, 2017 at 05:37:05PM +0200, Christian Borntraeger wrote:
> On 06/06/2017 05:27 PM, Heiko Carstens wrote:
> > On Tue, Jun 06, 2017 at 04:45:57PM +0200, Christian Borntraeger wrote:
> >> Adding s390 folks and list
> >>>> Only s390 is TSO, arm64 is very much a weak arch.
> >>>
> >>> Right, and thus arm64 can implement a fast this_cpu_inc using LL/SC.
> >>> s390 cannot because its atomic_inc has implicit memory barriers.
> >>>
> >>> s390's this_cpu_inc is *faster* than the generic one, but still pretty slow.
> >>
> >> FWIW, we improved the performance of local_irq_save/restore some time ago
> >> with commit 204ee2c5643199a2 ("s390/irqflags: optimize irq restore") and
> >> disable/enable seem to be reasonably fast (3-5ns on my system doing both
> >> disable/enable in a loop) on todays systems. So  I would assume that the
> >> generic implementation would not be that bad. 
> >>
> >> A the same time, the implicit memory barrier of the atomic_inc should be
> >> even cheaper. In contrast to x86, a full smp_mb seems to be almost for
> >> free (looks like <= 1 cycle for a bcr 14,0 and no contention). So I
> >> _think_ that this should be really fast enough.
> >>
> >> As a side note, I am asking myself, though, why we do need the
> >> preempt_disable/enable for the cases where we use the opcodes 
> >> like lao (atomic load and or to a memory location) and friends.
> > 
> > Because you want the atomic instruction to be executed on the local cpu for
> > which you have to per cpu pointer. If you get preempted to a different cpu
> > between the ptr__ assignment and lan instruction it might be executed not
> > on the local cpu. It's not really a correctness issue.
> > 
> > #define arch_this_cpu_to_op(pcp, val, op)				\
> > {									\
> > 	typedef typeof(pcp) pcp_op_T__;					\
> > 	pcp_op_T__ val__ = (val);					\
> > 	pcp_op_T__ old__, *ptr__;					\
> > 	preempt_disable();						\
> > 	ptr__ = raw_cpu_ptr(&(pcp));					\
> > 	asm volatile(							\
> > 		op "	%[old__],%[val__],%[ptr__]\n"			\
> > 		: [old__] "=d" (old__), [ptr__] "+Q" (*ptr__)		\
> > 		: [val__] "d" (val__)					\
> > 		: "cc");						\
> > 	preempt_enable();						\
> > }
> > 
> > #define this_cpu_and_4(pcp, val)	arch_this_cpu_to_op(pcp, val, "lan")
> > 
> > However in reality it doesn't matter at all, since all distributions we
> > care about have preemption disabled.
> > 
> > So this_cpu_inc() should just generate three instructions: two to calculate
> > the percpu pointer and an additional asi for the atomic increment, with
> > operand specific serialization. This is supposed to be a lot faster than
> > disabling/enabling interrupts around a non-atomic operation.
> > 
> > But maybe I didn't get the point of this thread :)
> 
> I think on x86 a memory barrier is relatively expensive (e.g. 33 cycles for mfence
> on Haswell according to http://www.agner.org/optimize/instruction_tables.pdf). The 
> thread started with a change to rcu, which now happens to use these percpu things
> more often so I think Paolos fear is that on s390 we now pay the price for an extra
> memory barrier due to that change. For the inc case (asi instruction) this should be
> really really cheap.

So what I am seeing from this is that there aren't any real performance
issues for this patch series.  I will update accordingly.  ;-)

						Thanx, Paul

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH RFC tip/core/rcu 1/2] srcu: Allow use of Tiny/Tree SRCU from both process and interrupt context
  2017-06-06 15:27         ` Heiko Carstens
  2017-06-06 15:37           ` Christian Borntraeger
@ 2017-06-06 16:15           ` Peter Zijlstra
  2017-06-06 17:00             ` Paul E. McKenney
  2017-06-06 17:20             ` Heiko Carstens
  1 sibling, 2 replies; 8+ messages in thread
From: Peter Zijlstra @ 2017-06-06 16:15 UTC (permalink / raw)
  To: Heiko Carstens
  Cc: Christian Borntraeger, Paolo Bonzini, Paul E. McKenney,
	linux-kernel, mingo, jiangshanlai, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, rostedt, dhowells, edumazet,
	fweisbec, oleg, kvm, Linus Torvalds, Martin Schwidefsky,
	linux-s390

On Tue, Jun 06, 2017 at 05:27:06PM +0200, Heiko Carstens wrote:
> On Tue, Jun 06, 2017 at 04:45:57PM +0200, Christian Borntraeger wrote:

> > As a side note, I am asking myself, though, why we do need the
> > preempt_disable/enable for the cases where we use the opcodes 
> > like lao (atomic load and or to a memory location) and friends.
> 
> Because you want the atomic instruction to be executed on the local cpu for
> which you have to per cpu pointer. If you get preempted to a different cpu
> between the ptr__ assignment and lan instruction it might be executed not
> on the local cpu. It's not really a correctness issue.

As per the previous email, I think it is a correctness issue wrt CPU
hotplug.

> 
> #define arch_this_cpu_to_op(pcp, val, op)				\
> {									\
> 	typedef typeof(pcp) pcp_op_T__;					\
> 	pcp_op_T__ val__ = (val);					\
> 	pcp_op_T__ old__, *ptr__;					\
> 	preempt_disable();						\
> 	ptr__ = raw_cpu_ptr(&(pcp));					\
> 	asm volatile(							\
> 		op "	%[old__],%[val__],%[ptr__]\n"			\
> 		: [old__] "=d" (old__), [ptr__] "+Q" (*ptr__)		\
> 		: [val__] "d" (val__)					\
> 		: "cc");						\
> 	preempt_enable();						\
> }
> 
> #define this_cpu_and_4(pcp, val)	arch_this_cpu_to_op(pcp, val, "lan")
> 
> However in reality it doesn't matter at all, since all distributions we
> care about have preemption disabled.

Well, either you support PREEMPT=y or you don't :-) If you do, it needs
to be correct, irrespective of what distro's do with it.

> So this_cpu_inc() should just generate three instructions: two to calculate
> the percpu pointer and an additional asi for the atomic increment, with
> operand specific serialization. This is supposed to be a lot faster than
> disabling/enabling interrupts around a non-atomic operation.

So typically we joke about s390 that it has an instruction for this
'very-complicated-thing', but here you guys do not, what gives? ;-)

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH RFC tip/core/rcu 1/2] srcu: Allow use of Tiny/Tree SRCU from both process and interrupt context
  2017-06-06 16:15           ` Peter Zijlstra
@ 2017-06-06 17:00             ` Paul E. McKenney
  2017-06-06 17:20             ` Heiko Carstens
  1 sibling, 0 replies; 8+ messages in thread
From: Paul E. McKenney @ 2017-06-06 17:00 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Heiko Carstens, Christian Borntraeger, Paolo Bonzini,
	linux-kernel, mingo, jiangshanlai, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, rostedt, dhowells, edumazet,
	fweisbec, oleg, kvm, Linus Torvalds, Martin Schwidefsky,
	linux-s390

On Tue, Jun 06, 2017 at 06:15:51PM +0200, Peter Zijlstra wrote:
> On Tue, Jun 06, 2017 at 05:27:06PM +0200, Heiko Carstens wrote:
> > On Tue, Jun 06, 2017 at 04:45:57PM +0200, Christian Borntraeger wrote:
> 
> > > As a side note, I am asking myself, though, why we do need the
> > > preempt_disable/enable for the cases where we use the opcodes 
> > > like lao (atomic load and or to a memory location) and friends.
> > 
> > Because you want the atomic instruction to be executed on the local cpu for
> > which you have to per cpu pointer. If you get preempted to a different cpu
> > between the ptr__ assignment and lan instruction it might be executed not
> > on the local cpu. It's not really a correctness issue.
> 
> As per the previous email, I think it is a correctness issue wrt CPU
> hotplug.

In the specific case of SRCU, this might actually be OK.  We have not
yet entered the SRCU read-side critical section, and SRCU grace periods
don't interact with CPU hotplug.  And the per-CPU variable stick around.
So as long as one of the per-CPU counters is incremented properly,
it doesn't really matter which one is incremented.

But if you overwrite one CPU's counter with the incremented version of
some other CPU's counter, yes, that would be very bad indeed.

> > #define arch_this_cpu_to_op(pcp, val, op)				\
> > {									\
> > 	typedef typeof(pcp) pcp_op_T__;					\
> > 	pcp_op_T__ val__ = (val);					\
> > 	pcp_op_T__ old__, *ptr__;					\
> > 	preempt_disable();						\
> > 	ptr__ = raw_cpu_ptr(&(pcp));					\
> > 	asm volatile(							\
> > 		op "	%[old__],%[val__],%[ptr__]\n"			\
> > 		: [old__] "=d" (old__), [ptr__] "+Q" (*ptr__)		\
> > 		: [val__] "d" (val__)					\
> > 		: "cc");						\
> > 	preempt_enable();						\
> > }
> > 
> > #define this_cpu_and_4(pcp, val)	arch_this_cpu_to_op(pcp, val, "lan")
> > 
> > However in reality it doesn't matter at all, since all distributions we
> > care about have preemption disabled.
> 
> Well, either you support PREEMPT=y or you don't :-) If you do, it needs
> to be correct, irrespective of what distro's do with it.
> 
> > So this_cpu_inc() should just generate three instructions: two to calculate
> > the percpu pointer and an additional asi for the atomic increment, with
> > operand specific serialization. This is supposed to be a lot faster than
> > disabling/enabling interrupts around a non-atomic operation.
> 
> So typically we joke about s390 that it has an instruction for this
> 'very-complicated-thing', but here you guys do not, what gives? ;-)

Even mainframes have finite silicon area?  ;-)

							Thanx, Paul

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH RFC tip/core/rcu 1/2] srcu: Allow use of Tiny/Tree SRCU from both process and interrupt context
  2017-06-06 16:15           ` Peter Zijlstra
  2017-06-06 17:00             ` Paul E. McKenney
@ 2017-06-06 17:20             ` Heiko Carstens
  1 sibling, 0 replies; 8+ messages in thread
From: Heiko Carstens @ 2017-06-06 17:20 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Christian Borntraeger, Paolo Bonzini, Paul E. McKenney,
	linux-kernel, mingo, jiangshanlai, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, rostedt, dhowells, edumazet,
	fweisbec, oleg, kvm, Linus Torvalds, Martin Schwidefsky,
	linux-s390

On Tue, Jun 06, 2017 at 06:15:51PM +0200, Peter Zijlstra wrote:
> On Tue, Jun 06, 2017 at 05:27:06PM +0200, Heiko Carstens wrote:
> > On Tue, Jun 06, 2017 at 04:45:57PM +0200, Christian Borntraeger wrote:
> 
> > > As a side note, I am asking myself, though, why we do need the
> > > preempt_disable/enable for the cases where we use the opcodes 
> > > like lao (atomic load and or to a memory location) and friends.
> > 
> > Because you want the atomic instruction to be executed on the local cpu for
> > which you have to per cpu pointer. If you get preempted to a different cpu
> > between the ptr__ assignment and lan instruction it might be executed not
> > on the local cpu. It's not really a correctness issue.
> 
> As per the previous email, I think it is a correctness issue wrt CPU
> hotplug.

Yes, I realized that just a minute after I sent the above.

> > However in reality it doesn't matter at all, since all distributions we
> > care about have preemption disabled.
> 
> Well, either you support PREEMPT=y or you don't :-) If you do, it needs
> to be correct, irrespective of what distro's do with it.

That is true. Our s390 specific percpu ops are supposed to be correct for
PREEMPT=y, and that's apparently the only reason why I added the preempt
disable/enable pairs back then. I just had to remember why I did that ;)

> > So this_cpu_inc() should just generate three instructions: two to calculate
> > the percpu pointer and an additional asi for the atomic increment, with
> > operand specific serialization. This is supposed to be a lot faster than
> > disabling/enabling interrupts around a non-atomic operation.
> 
> So typically we joke about s390 that it has an instruction for this
> 'very-complicated-thing', but here you guys do not, what gives? ;-)

Tough luck. :)

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH RFC tip/core/rcu 1/2] srcu: Allow use of Tiny/Tree SRCU from both process and interrupt context
  2017-06-06 14:45       ` [PATCH RFC tip/core/rcu 1/2] srcu: Allow use of Tiny/Tree SRCU from both process and interrupt context Christian Borntraeger
  2017-06-06 15:27         ` Heiko Carstens
@ 2017-06-06 16:12         ` Peter Zijlstra
  1 sibling, 0 replies; 8+ messages in thread
From: Peter Zijlstra @ 2017-06-06 16:12 UTC (permalink / raw)
  To: Christian Borntraeger
  Cc: Paolo Bonzini, Paul E. McKenney, linux-kernel, mingo,
	jiangshanlai, dipankar, akpm, mathieu.desnoyers, josh, tglx,
	rostedt, dhowells, edumazet, fweisbec, oleg, kvm, Linus Torvalds,
	Martin Schwidefsky, Heiko Carstens, linux-s390, H. Peter Anvin

On Tue, Jun 06, 2017 at 04:45:57PM +0200, Christian Borntraeger wrote:
> A the same time, the implicit memory barrier of the atomic_inc should be
> even cheaper. In contrast to x86, a full smp_mb seems to be almost for
> free (looks like <= 1 cycle for a bcr 14,0 and no contention). So I
> _think_ that this should be really fast enough.

So there is a patch out there that changes the x86 smp_mb()
implementation to do "LOCK ADD some_stack_location, 0" which is lots
cheaper than the "MFENCE" instruction and provides similar guarantees.

HPA was running that through some of the architects.. ping?

(Also, I can imagine OoO CPUs collapsing back-to-back ordering stuff,
but what do I know).

> As a side note, I am asking myself, though, why we do need the
> preempt_disable/enable for the cases where we use the opcodes 
> like lao (atomic load and or to a memory location) and friends.

I suspect the real reason is CPU hotplug, because regular preemption
should not matter. It would be the same as getting migrated the moment
_after_ you do the $op.

But preempt_disable() also holds off hotplug and thereby serializes
against hotplug notifiers that want to, for instance, move the value of
the per-cpu variable to a still online CPU. Without this serialization
it would be possible for the $op to happen _after_ the hotplug notifier
runs.

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2017-06-06 17:20 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20170605220919.GA27820@linux.vnet.ibm.com>
     [not found] ` <1496700591-30177-1-git-send-email-paulmck@linux.vnet.ibm.com>
     [not found]   ` <20170606105343.ibhzrk6jwhmoja5t@hirez.programming.kicks-ass.net>
     [not found]     ` <b77cb817-9ffb-b600-19a9-1de685049c57@redhat.com>
2017-06-06 14:45       ` [PATCH RFC tip/core/rcu 1/2] srcu: Allow use of Tiny/Tree SRCU from both process and interrupt context Christian Borntraeger
2017-06-06 15:27         ` Heiko Carstens
2017-06-06 15:37           ` Christian Borntraeger
2017-06-06 15:58             ` Paul E. McKenney
2017-06-06 16:15           ` Peter Zijlstra
2017-06-06 17:00             ` Paul E. McKenney
2017-06-06 17:20             ` Heiko Carstens
2017-06-06 16:12         ` Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox