* Re: [PATCH RFC tip/core/rcu 1/2] srcu: Allow use of Tiny/Tree SRCU from both process and interrupt context [not found] ` <b77cb817-9ffb-b600-19a9-1de685049c57@redhat.com> @ 2017-06-06 14:45 ` Christian Borntraeger 2017-06-06 15:27 ` Heiko Carstens 2017-06-06 16:12 ` Peter Zijlstra 0 siblings, 2 replies; 8+ messages in thread From: Christian Borntraeger @ 2017-06-06 14:45 UTC (permalink / raw) To: Paolo Bonzini, Peter Zijlstra, Paul E. McKenney Cc: linux-kernel, mingo, jiangshanlai, dipankar, akpm, mathieu.desnoyers, josh, tglx, rostedt, dhowells, edumazet, fweisbec, oleg, kvm, Linus Torvalds, Martin Schwidefsky, Heiko Carstens, linux-s390 Adding s390 folks and list On 06/06/2017 03:08 PM, Paolo Bonzini wrote: > > > On 06/06/2017 12:53, Peter Zijlstra wrote: >> On Mon, Jun 05, 2017 at 03:09:50PM -0700, Paul E. McKenney wrote: >>> There would be a slowdown if 1) fast this_cpu_inc is not available and >>> cannot be implemented (this usually means that atomic_inc has implicit >>> memory barriers), >> >> I don't get this. >> >> How is per-cpu crud related to being strongly ordered? >> >> this_cpu_ has 3 forms: >> >> x86: single instruction >> arm64,s390: preempt_disable()+atomic_op >> generic: local_irq_save()+normal_op >> >> Only s390 is TSO, arm64 is very much a weak arch. > > Right, and thus arm64 can implement a fast this_cpu_inc using LL/SC. > s390 cannot because its atomic_inc has implicit memory barriers. > > s390's this_cpu_inc is *faster* than the generic one, but still pretty slow. FWIW, we improved the performance of local_irq_save/restore some time ago with commit 204ee2c5643199a2 ("s390/irqflags: optimize irq restore") and disable/enable seem to be reasonably fast (3-5ns on my system doing both disable/enable in a loop) on todays systems. So I would assume that the generic implementation would not be that bad. A the same time, the implicit memory barrier of the atomic_inc should be even cheaper. In contrast to x86, a full smp_mb seems to be almost for free (looks like <= 1 cycle for a bcr 14,0 and no contention). So I _think_ that this should be really fast enough. As a side note, I am asking myself, though, why we do need the preempt_disable/enable for the cases where we use the opcodes like lao (atomic load and or to a memory location) and friends. > >>> and 2) local_irq_save/restore is slower than disabling >>> preemption. The main architecture with these constraints is s390, which >>> however is already paying the price in __srcu_read_unlock and has not >>> complained. >> >> IIRC only PPC (and hopefully soon x86) has a local_irq_save() that is as >> fast as preempt_disable(). > > 1 = arch-specific this_cpu_inc is available > 2 = local_irq_save/restore as fast as preempt_disable/enable > > If either 1 or 2 are true, this patch makes SRCU faster or equal > > x86 (single instruction): 1 = true, 2 = false -> ok > arm64 (weakly ordered): 1 = true, 2 = false -> ok > powerpc: 1 = false, 2 = true -> ok > s390: 1 = false, 2 = false -> slower > > For other LL/SC architectures, notably arm, fast this_cpu_* ops not yet > available, but could be written pretty easily. > >>> A valid optimization on s390 would be to skip the smp_mb; >>> AIUI, this_cpu_inc implies a memory barrier (!) due to its implementation. >> >> You mean the s390 this_cpu_inc() in specific, right? Because >> this_cpu_inc() in general does not imply any such thing. > > Yes, of course, this is only for s390. > > Alternatively, we could change the counters to atomic_t and use > smp_mb__{before,after}_atomic, as in the (unnecessary) srcutiny patch. > That should shave a few cycles on x86 too, since "lock inc" is faster > than "inc; mfence". For srcuclassic (and stable) however I'd rather > keep the simple __this_cpu_inc -> this_cpu_inc change. > > Paolo > ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH RFC tip/core/rcu 1/2] srcu: Allow use of Tiny/Tree SRCU from both process and interrupt context 2017-06-06 14:45 ` [PATCH RFC tip/core/rcu 1/2] srcu: Allow use of Tiny/Tree SRCU from both process and interrupt context Christian Borntraeger @ 2017-06-06 15:27 ` Heiko Carstens 2017-06-06 15:37 ` Christian Borntraeger 2017-06-06 16:15 ` Peter Zijlstra 2017-06-06 16:12 ` Peter Zijlstra 1 sibling, 2 replies; 8+ messages in thread From: Heiko Carstens @ 2017-06-06 15:27 UTC (permalink / raw) To: Christian Borntraeger Cc: Paolo Bonzini, Peter Zijlstra, Paul E. McKenney, linux-kernel, mingo, jiangshanlai, dipankar, akpm, mathieu.desnoyers, josh, tglx, rostedt, dhowells, edumazet, fweisbec, oleg, kvm, Linus Torvalds, Martin Schwidefsky, linux-s390 On Tue, Jun 06, 2017 at 04:45:57PM +0200, Christian Borntraeger wrote: > Adding s390 folks and list > >> Only s390 is TSO, arm64 is very much a weak arch. > > > > Right, and thus arm64 can implement a fast this_cpu_inc using LL/SC. > > s390 cannot because its atomic_inc has implicit memory barriers. > > > > s390's this_cpu_inc is *faster* than the generic one, but still pretty slow. > > FWIW, we improved the performance of local_irq_save/restore some time ago > with commit 204ee2c5643199a2 ("s390/irqflags: optimize irq restore") and > disable/enable seem to be reasonably fast (3-5ns on my system doing both > disable/enable in a loop) on todays systems. So I would assume that the > generic implementation would not be that bad. > > A the same time, the implicit memory barrier of the atomic_inc should be > even cheaper. In contrast to x86, a full smp_mb seems to be almost for > free (looks like <= 1 cycle for a bcr 14,0 and no contention). So I > _think_ that this should be really fast enough. > > As a side note, I am asking myself, though, why we do need the > preempt_disable/enable for the cases where we use the opcodes > like lao (atomic load and or to a memory location) and friends. Because you want the atomic instruction to be executed on the local cpu for which you have to per cpu pointer. If you get preempted to a different cpu between the ptr__ assignment and lan instruction it might be executed not on the local cpu. It's not really a correctness issue. #define arch_this_cpu_to_op(pcp, val, op) \ { \ typedef typeof(pcp) pcp_op_T__; \ pcp_op_T__ val__ = (val); \ pcp_op_T__ old__, *ptr__; \ preempt_disable(); \ ptr__ = raw_cpu_ptr(&(pcp)); \ asm volatile( \ op " %[old__],%[val__],%[ptr__]\n" \ : [old__] "=d" (old__), [ptr__] "+Q" (*ptr__) \ : [val__] "d" (val__) \ : "cc"); \ preempt_enable(); \ } #define this_cpu_and_4(pcp, val) arch_this_cpu_to_op(pcp, val, "lan") However in reality it doesn't matter at all, since all distributions we care about have preemption disabled. So this_cpu_inc() should just generate three instructions: two to calculate the percpu pointer and an additional asi for the atomic increment, with operand specific serialization. This is supposed to be a lot faster than disabling/enabling interrupts around a non-atomic operation. But maybe I didn't get the point of this thread :) ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH RFC tip/core/rcu 1/2] srcu: Allow use of Tiny/Tree SRCU from both process and interrupt context 2017-06-06 15:27 ` Heiko Carstens @ 2017-06-06 15:37 ` Christian Borntraeger 2017-06-06 15:58 ` Paul E. McKenney 2017-06-06 16:15 ` Peter Zijlstra 1 sibling, 1 reply; 8+ messages in thread From: Christian Borntraeger @ 2017-06-06 15:37 UTC (permalink / raw) To: Heiko Carstens Cc: Paolo Bonzini, Peter Zijlstra, Paul E. McKenney, linux-kernel, mingo, jiangshanlai, dipankar, akpm, mathieu.desnoyers, josh, tglx, rostedt, dhowells, edumazet, fweisbec, oleg, kvm, Linus Torvalds, Martin Schwidefsky, linux-s390 On 06/06/2017 05:27 PM, Heiko Carstens wrote: > On Tue, Jun 06, 2017 at 04:45:57PM +0200, Christian Borntraeger wrote: >> Adding s390 folks and list >>>> Only s390 is TSO, arm64 is very much a weak arch. >>> >>> Right, and thus arm64 can implement a fast this_cpu_inc using LL/SC. >>> s390 cannot because its atomic_inc has implicit memory barriers. >>> >>> s390's this_cpu_inc is *faster* than the generic one, but still pretty slow. >> >> FWIW, we improved the performance of local_irq_save/restore some time ago >> with commit 204ee2c5643199a2 ("s390/irqflags: optimize irq restore") and >> disable/enable seem to be reasonably fast (3-5ns on my system doing both >> disable/enable in a loop) on todays systems. So I would assume that the >> generic implementation would not be that bad. >> >> A the same time, the implicit memory barrier of the atomic_inc should be >> even cheaper. In contrast to x86, a full smp_mb seems to be almost for >> free (looks like <= 1 cycle for a bcr 14,0 and no contention). So I >> _think_ that this should be really fast enough. >> >> As a side note, I am asking myself, though, why we do need the >> preempt_disable/enable for the cases where we use the opcodes >> like lao (atomic load and or to a memory location) and friends. > > Because you want the atomic instruction to be executed on the local cpu for > which you have to per cpu pointer. If you get preempted to a different cpu > between the ptr__ assignment and lan instruction it might be executed not > on the local cpu. It's not really a correctness issue. > > #define arch_this_cpu_to_op(pcp, val, op) \ > { \ > typedef typeof(pcp) pcp_op_T__; \ > pcp_op_T__ val__ = (val); \ > pcp_op_T__ old__, *ptr__; \ > preempt_disable(); \ > ptr__ = raw_cpu_ptr(&(pcp)); \ > asm volatile( \ > op " %[old__],%[val__],%[ptr__]\n" \ > : [old__] "=d" (old__), [ptr__] "+Q" (*ptr__) \ > : [val__] "d" (val__) \ > : "cc"); \ > preempt_enable(); \ > } > > #define this_cpu_and_4(pcp, val) arch_this_cpu_to_op(pcp, val, "lan") > > However in reality it doesn't matter at all, since all distributions we > care about have preemption disabled. > > So this_cpu_inc() should just generate three instructions: two to calculate > the percpu pointer and an additional asi for the atomic increment, with > operand specific serialization. This is supposed to be a lot faster than > disabling/enabling interrupts around a non-atomic operation. > > But maybe I didn't get the point of this thread :) I think on x86 a memory barrier is relatively expensive (e.g. 33 cycles for mfence on Haswell according to http://www.agner.org/optimize/instruction_tables.pdf). The thread started with a change to rcu, which now happens to use these percpu things more often so I think Paolos fear is that on s390 we now pay the price for an extra memory barrier due to that change. For the inc case (asi instruction) this should be really really cheap. ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH RFC tip/core/rcu 1/2] srcu: Allow use of Tiny/Tree SRCU from both process and interrupt context 2017-06-06 15:37 ` Christian Borntraeger @ 2017-06-06 15:58 ` Paul E. McKenney 0 siblings, 0 replies; 8+ messages in thread From: Paul E. McKenney @ 2017-06-06 15:58 UTC (permalink / raw) To: Christian Borntraeger Cc: Heiko Carstens, Paolo Bonzini, Peter Zijlstra, linux-kernel, mingo, jiangshanlai, dipankar, akpm, mathieu.desnoyers, josh, tglx, rostedt, dhowells, edumazet, fweisbec, oleg, kvm, Linus Torvalds, Martin Schwidefsky, linux-s390 On Tue, Jun 06, 2017 at 05:37:05PM +0200, Christian Borntraeger wrote: > On 06/06/2017 05:27 PM, Heiko Carstens wrote: > > On Tue, Jun 06, 2017 at 04:45:57PM +0200, Christian Borntraeger wrote: > >> Adding s390 folks and list > >>>> Only s390 is TSO, arm64 is very much a weak arch. > >>> > >>> Right, and thus arm64 can implement a fast this_cpu_inc using LL/SC. > >>> s390 cannot because its atomic_inc has implicit memory barriers. > >>> > >>> s390's this_cpu_inc is *faster* than the generic one, but still pretty slow. > >> > >> FWIW, we improved the performance of local_irq_save/restore some time ago > >> with commit 204ee2c5643199a2 ("s390/irqflags: optimize irq restore") and > >> disable/enable seem to be reasonably fast (3-5ns on my system doing both > >> disable/enable in a loop) on todays systems. So I would assume that the > >> generic implementation would not be that bad. > >> > >> A the same time, the implicit memory barrier of the atomic_inc should be > >> even cheaper. In contrast to x86, a full smp_mb seems to be almost for > >> free (looks like <= 1 cycle for a bcr 14,0 and no contention). So I > >> _think_ that this should be really fast enough. > >> > >> As a side note, I am asking myself, though, why we do need the > >> preempt_disable/enable for the cases where we use the opcodes > >> like lao (atomic load and or to a memory location) and friends. > > > > Because you want the atomic instruction to be executed on the local cpu for > > which you have to per cpu pointer. If you get preempted to a different cpu > > between the ptr__ assignment and lan instruction it might be executed not > > on the local cpu. It's not really a correctness issue. > > > > #define arch_this_cpu_to_op(pcp, val, op) \ > > { \ > > typedef typeof(pcp) pcp_op_T__; \ > > pcp_op_T__ val__ = (val); \ > > pcp_op_T__ old__, *ptr__; \ > > preempt_disable(); \ > > ptr__ = raw_cpu_ptr(&(pcp)); \ > > asm volatile( \ > > op " %[old__],%[val__],%[ptr__]\n" \ > > : [old__] "=d" (old__), [ptr__] "+Q" (*ptr__) \ > > : [val__] "d" (val__) \ > > : "cc"); \ > > preempt_enable(); \ > > } > > > > #define this_cpu_and_4(pcp, val) arch_this_cpu_to_op(pcp, val, "lan") > > > > However in reality it doesn't matter at all, since all distributions we > > care about have preemption disabled. > > > > So this_cpu_inc() should just generate three instructions: two to calculate > > the percpu pointer and an additional asi for the atomic increment, with > > operand specific serialization. This is supposed to be a lot faster than > > disabling/enabling interrupts around a non-atomic operation. > > > > But maybe I didn't get the point of this thread :) > > I think on x86 a memory barrier is relatively expensive (e.g. 33 cycles for mfence > on Haswell according to http://www.agner.org/optimize/instruction_tables.pdf). The > thread started with a change to rcu, which now happens to use these percpu things > more often so I think Paolos fear is that on s390 we now pay the price for an extra > memory barrier due to that change. For the inc case (asi instruction) this should be > really really cheap. So what I am seeing from this is that there aren't any real performance issues for this patch series. I will update accordingly. ;-) Thanx, Paul ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH RFC tip/core/rcu 1/2] srcu: Allow use of Tiny/Tree SRCU from both process and interrupt context 2017-06-06 15:27 ` Heiko Carstens 2017-06-06 15:37 ` Christian Borntraeger @ 2017-06-06 16:15 ` Peter Zijlstra 2017-06-06 17:00 ` Paul E. McKenney 2017-06-06 17:20 ` Heiko Carstens 1 sibling, 2 replies; 8+ messages in thread From: Peter Zijlstra @ 2017-06-06 16:15 UTC (permalink / raw) To: Heiko Carstens Cc: Christian Borntraeger, Paolo Bonzini, Paul E. McKenney, linux-kernel, mingo, jiangshanlai, dipankar, akpm, mathieu.desnoyers, josh, tglx, rostedt, dhowells, edumazet, fweisbec, oleg, kvm, Linus Torvalds, Martin Schwidefsky, linux-s390 On Tue, Jun 06, 2017 at 05:27:06PM +0200, Heiko Carstens wrote: > On Tue, Jun 06, 2017 at 04:45:57PM +0200, Christian Borntraeger wrote: > > As a side note, I am asking myself, though, why we do need the > > preempt_disable/enable for the cases where we use the opcodes > > like lao (atomic load and or to a memory location) and friends. > > Because you want the atomic instruction to be executed on the local cpu for > which you have to per cpu pointer. If you get preempted to a different cpu > between the ptr__ assignment and lan instruction it might be executed not > on the local cpu. It's not really a correctness issue. As per the previous email, I think it is a correctness issue wrt CPU hotplug. > > #define arch_this_cpu_to_op(pcp, val, op) \ > { \ > typedef typeof(pcp) pcp_op_T__; \ > pcp_op_T__ val__ = (val); \ > pcp_op_T__ old__, *ptr__; \ > preempt_disable(); \ > ptr__ = raw_cpu_ptr(&(pcp)); \ > asm volatile( \ > op " %[old__],%[val__],%[ptr__]\n" \ > : [old__] "=d" (old__), [ptr__] "+Q" (*ptr__) \ > : [val__] "d" (val__) \ > : "cc"); \ > preempt_enable(); \ > } > > #define this_cpu_and_4(pcp, val) arch_this_cpu_to_op(pcp, val, "lan") > > However in reality it doesn't matter at all, since all distributions we > care about have preemption disabled. Well, either you support PREEMPT=y or you don't :-) If you do, it needs to be correct, irrespective of what distro's do with it. > So this_cpu_inc() should just generate three instructions: two to calculate > the percpu pointer and an additional asi for the atomic increment, with > operand specific serialization. This is supposed to be a lot faster than > disabling/enabling interrupts around a non-atomic operation. So typically we joke about s390 that it has an instruction for this 'very-complicated-thing', but here you guys do not, what gives? ;-) ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH RFC tip/core/rcu 1/2] srcu: Allow use of Tiny/Tree SRCU from both process and interrupt context 2017-06-06 16:15 ` Peter Zijlstra @ 2017-06-06 17:00 ` Paul E. McKenney 2017-06-06 17:20 ` Heiko Carstens 1 sibling, 0 replies; 8+ messages in thread From: Paul E. McKenney @ 2017-06-06 17:00 UTC (permalink / raw) To: Peter Zijlstra Cc: Heiko Carstens, Christian Borntraeger, Paolo Bonzini, linux-kernel, mingo, jiangshanlai, dipankar, akpm, mathieu.desnoyers, josh, tglx, rostedt, dhowells, edumazet, fweisbec, oleg, kvm, Linus Torvalds, Martin Schwidefsky, linux-s390 On Tue, Jun 06, 2017 at 06:15:51PM +0200, Peter Zijlstra wrote: > On Tue, Jun 06, 2017 at 05:27:06PM +0200, Heiko Carstens wrote: > > On Tue, Jun 06, 2017 at 04:45:57PM +0200, Christian Borntraeger wrote: > > > > As a side note, I am asking myself, though, why we do need the > > > preempt_disable/enable for the cases where we use the opcodes > > > like lao (atomic load and or to a memory location) and friends. > > > > Because you want the atomic instruction to be executed on the local cpu for > > which you have to per cpu pointer. If you get preempted to a different cpu > > between the ptr__ assignment and lan instruction it might be executed not > > on the local cpu. It's not really a correctness issue. > > As per the previous email, I think it is a correctness issue wrt CPU > hotplug. In the specific case of SRCU, this might actually be OK. We have not yet entered the SRCU read-side critical section, and SRCU grace periods don't interact with CPU hotplug. And the per-CPU variable stick around. So as long as one of the per-CPU counters is incremented properly, it doesn't really matter which one is incremented. But if you overwrite one CPU's counter with the incremented version of some other CPU's counter, yes, that would be very bad indeed. > > #define arch_this_cpu_to_op(pcp, val, op) \ > > { \ > > typedef typeof(pcp) pcp_op_T__; \ > > pcp_op_T__ val__ = (val); \ > > pcp_op_T__ old__, *ptr__; \ > > preempt_disable(); \ > > ptr__ = raw_cpu_ptr(&(pcp)); \ > > asm volatile( \ > > op " %[old__],%[val__],%[ptr__]\n" \ > > : [old__] "=d" (old__), [ptr__] "+Q" (*ptr__) \ > > : [val__] "d" (val__) \ > > : "cc"); \ > > preempt_enable(); \ > > } > > > > #define this_cpu_and_4(pcp, val) arch_this_cpu_to_op(pcp, val, "lan") > > > > However in reality it doesn't matter at all, since all distributions we > > care about have preemption disabled. > > Well, either you support PREEMPT=y or you don't :-) If you do, it needs > to be correct, irrespective of what distro's do with it. > > > So this_cpu_inc() should just generate three instructions: two to calculate > > the percpu pointer and an additional asi for the atomic increment, with > > operand specific serialization. This is supposed to be a lot faster than > > disabling/enabling interrupts around a non-atomic operation. > > So typically we joke about s390 that it has an instruction for this > 'very-complicated-thing', but here you guys do not, what gives? ;-) Even mainframes have finite silicon area? ;-) Thanx, Paul ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH RFC tip/core/rcu 1/2] srcu: Allow use of Tiny/Tree SRCU from both process and interrupt context 2017-06-06 16:15 ` Peter Zijlstra 2017-06-06 17:00 ` Paul E. McKenney @ 2017-06-06 17:20 ` Heiko Carstens 1 sibling, 0 replies; 8+ messages in thread From: Heiko Carstens @ 2017-06-06 17:20 UTC (permalink / raw) To: Peter Zijlstra Cc: Christian Borntraeger, Paolo Bonzini, Paul E. McKenney, linux-kernel, mingo, jiangshanlai, dipankar, akpm, mathieu.desnoyers, josh, tglx, rostedt, dhowells, edumazet, fweisbec, oleg, kvm, Linus Torvalds, Martin Schwidefsky, linux-s390 On Tue, Jun 06, 2017 at 06:15:51PM +0200, Peter Zijlstra wrote: > On Tue, Jun 06, 2017 at 05:27:06PM +0200, Heiko Carstens wrote: > > On Tue, Jun 06, 2017 at 04:45:57PM +0200, Christian Borntraeger wrote: > > > > As a side note, I am asking myself, though, why we do need the > > > preempt_disable/enable for the cases where we use the opcodes > > > like lao (atomic load and or to a memory location) and friends. > > > > Because you want the atomic instruction to be executed on the local cpu for > > which you have to per cpu pointer. If you get preempted to a different cpu > > between the ptr__ assignment and lan instruction it might be executed not > > on the local cpu. It's not really a correctness issue. > > As per the previous email, I think it is a correctness issue wrt CPU > hotplug. Yes, I realized that just a minute after I sent the above. > > However in reality it doesn't matter at all, since all distributions we > > care about have preemption disabled. > > Well, either you support PREEMPT=y or you don't :-) If you do, it needs > to be correct, irrespective of what distro's do with it. That is true. Our s390 specific percpu ops are supposed to be correct for PREEMPT=y, and that's apparently the only reason why I added the preempt disable/enable pairs back then. I just had to remember why I did that ;) > > So this_cpu_inc() should just generate three instructions: two to calculate > > the percpu pointer and an additional asi for the atomic increment, with > > operand specific serialization. This is supposed to be a lot faster than > > disabling/enabling interrupts around a non-atomic operation. > > So typically we joke about s390 that it has an instruction for this > 'very-complicated-thing', but here you guys do not, what gives? ;-) Tough luck. :) ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH RFC tip/core/rcu 1/2] srcu: Allow use of Tiny/Tree SRCU from both process and interrupt context 2017-06-06 14:45 ` [PATCH RFC tip/core/rcu 1/2] srcu: Allow use of Tiny/Tree SRCU from both process and interrupt context Christian Borntraeger 2017-06-06 15:27 ` Heiko Carstens @ 2017-06-06 16:12 ` Peter Zijlstra 1 sibling, 0 replies; 8+ messages in thread From: Peter Zijlstra @ 2017-06-06 16:12 UTC (permalink / raw) To: Christian Borntraeger Cc: Paolo Bonzini, Paul E. McKenney, linux-kernel, mingo, jiangshanlai, dipankar, akpm, mathieu.desnoyers, josh, tglx, rostedt, dhowells, edumazet, fweisbec, oleg, kvm, Linus Torvalds, Martin Schwidefsky, Heiko Carstens, linux-s390, H. Peter Anvin On Tue, Jun 06, 2017 at 04:45:57PM +0200, Christian Borntraeger wrote: > A the same time, the implicit memory barrier of the atomic_inc should be > even cheaper. In contrast to x86, a full smp_mb seems to be almost for > free (looks like <= 1 cycle for a bcr 14,0 and no contention). So I > _think_ that this should be really fast enough. So there is a patch out there that changes the x86 smp_mb() implementation to do "LOCK ADD some_stack_location, 0" which is lots cheaper than the "MFENCE" instruction and provides similar guarantees. HPA was running that through some of the architects.. ping? (Also, I can imagine OoO CPUs collapsing back-to-back ordering stuff, but what do I know). > As a side note, I am asking myself, though, why we do need the > preempt_disable/enable for the cases where we use the opcodes > like lao (atomic load and or to a memory location) and friends. I suspect the real reason is CPU hotplug, because regular preemption should not matter. It would be the same as getting migrated the moment _after_ you do the $op. But preempt_disable() also holds off hotplug and thereby serializes against hotplug notifiers that want to, for instance, move the value of the per-cpu variable to a still online CPU. Without this serialization it would be possible for the $op to happen _after_ the hotplug notifier runs. ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2017-06-06 17:20 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <20170605220919.GA27820@linux.vnet.ibm.com>
[not found] ` <1496700591-30177-1-git-send-email-paulmck@linux.vnet.ibm.com>
[not found] ` <20170606105343.ibhzrk6jwhmoja5t@hirez.programming.kicks-ass.net>
[not found] ` <b77cb817-9ffb-b600-19a9-1de685049c57@redhat.com>
2017-06-06 14:45 ` [PATCH RFC tip/core/rcu 1/2] srcu: Allow use of Tiny/Tree SRCU from both process and interrupt context Christian Borntraeger
2017-06-06 15:27 ` Heiko Carstens
2017-06-06 15:37 ` Christian Borntraeger
2017-06-06 15:58 ` Paul E. McKenney
2017-06-06 16:15 ` Peter Zijlstra
2017-06-06 17:00 ` Paul E. McKenney
2017-06-06 17:20 ` Heiko Carstens
2017-06-06 16:12 ` Peter Zijlstra
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox