* [PATCH RFC tip/core/rcu 1/2] srcu: Allow use of Tiny/Tree SRCU from both process and interrupt context [not found] <20170605220919.GA27820@linux.vnet.ibm.com> @ 2017-06-05 22:09 ` Paul E. McKenney 2017-06-06 10:53 ` Peter Zijlstra ` (2 more replies) 2017-06-05 22:09 ` [PATCH RFC tip/core/rcu 2/2] srcu: Allow use of Classic " Paul E. McKenney 1 sibling, 3 replies; 23+ messages in thread From: Paul E. McKenney @ 2017-06-05 22:09 UTC (permalink / raw) To: linux-kernel Cc: mingo, jiangshanlai, dipankar, akpm, mathieu.desnoyers, josh, tglx, peterz, rostedt, dhowells, edumazet, fweisbec, oleg, Paolo Bonzini, kvm, Linus Torvalds, Paul E. McKenney From: Paolo Bonzini <pbonzini@redhat.com> Linu Cherian reported a WARN in cleanup_srcu_struct when shutting down a guest that has iperf running on a VFIO assigned device. This happens because irqfd_wakeup calls srcu_read_lock(&kvm->irq_srcu) in interrupt context, while a worker thread does the same inside kvm_set_irq. If the interrupt happens while the worker thread is executing __srcu_read_lock, lock_count can fall behind. The docs say you are not supposed to call srcu_read_lock() and srcu_read_unlock() from irq context, but KVM interrupt injection happens from (host) interrupt context and it would be nice if SRCU supported the use case. KVM is using SRCU here not really for the "sleepable" part, but rather due to its faster detection of grace periods, therefore it is not possible to switch back to RCU, effectively reverting commit 719d93cd5f5c ("kvm/irqchip: Speed up KVM_SET_GSI_ROUTING", 2014-01-16). However, the docs are painting a worse situation than it actually is. You can have an SRCU instance only has users in irq context, and you can mix process and irq context as long as process context users disable interrupts. In addition, __srcu_read_unlock() actually uses this_cpu_dec on both srcutree and srcuclassic. For those two implementations, only srcu_read_lock() is unsafe. When srcuclassic's __srcu_read_unlock() was changed to use this_cpu_dec(), in commit 5a41344a3d83 ("srcu: Simplify __srcu_read_unlock() via this_cpu_dec()", 2012-11-29), __srcu_read_lock() did two increments. Therefore it kept __this_cpu_inc, with preempt_disable/enable in the caller. srcutree however only does one increment, so on most architectures it is more efficient for __srcu_read_lock to use this_cpu_inc, too. There would be a slowdown if 1) fast this_cpu_inc is not available and cannot be implemented (this usually means that atomic_inc has implicit memory barriers), and 2) local_irq_save/restore is slower than disabling preemption. The main architecture with these constraints is s390, which however is already paying the price in __srcu_read_unlock and has not complained. A valid optimization on s390 would be to skip the smp_mb; AIUI, this_cpu_inc implies a memory barrier (!) due to its implementation. Likewise, srcutiny has one increment only in __srcu_read_lock, and a decrement in__srcu_read_unlock. However, it's not on a percpu variable so it cannot use this_cpu_inc/dec. This patch changes srcutiny to use respectively atomic_inc and atomic_dec_return_relaxed. Because srcutiny is only used for uniprocessor machines, the extra cost of atomic operations is averted at least on x86, where atomic_inc and atomic_dec_return_relaxed compile to unlocked inc and xadd instructions respectively. Fixes: 719d93cd5f5c ("kvm/irqchip: Speed up KVM_SET_GSI_ROUTING") Reported-by: Linu Cherian <linuc.decode@gmail.com> Suggested-by: Linu Cherian <linuc.decode@gmail.com> Cc: kvm@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> --- include/linux/srcutiny.h | 2 +- kernel/rcu/rcutorture.c | 4 ++-- kernel/rcu/srcutiny.c | 21 ++++++++++----------- kernel/rcu/srcutree.c | 5 ++--- 4 files changed, 15 insertions(+), 17 deletions(-) diff --git a/include/linux/srcutiny.h b/include/linux/srcutiny.h index 42311ee0334f..7343e3b03087 100644 --- a/include/linux/srcutiny.h +++ b/include/linux/srcutiny.h @@ -27,7 +27,7 @@ #include <linux/swait.h> struct srcu_struct { - int srcu_lock_nesting[2]; /* srcu_read_lock() nesting depth. */ + atomic_t srcu_lock_nesting[2]; /* srcu_read_lock() nesting depth. */ struct swait_queue_head srcu_wq; /* Last srcu_read_unlock() wakes GP. */ unsigned long srcu_gp_seq; /* GP seq # for callback tagging. */ diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c index ae6e574d4cf5..fa4e3895ab08 100644 --- a/kernel/rcu/rcutorture.c +++ b/kernel/rcu/rcutorture.c @@ -611,8 +611,8 @@ static void srcu_torture_stats(void) idx = READ_ONCE(srcu_ctlp->srcu_idx) & 0x1; pr_alert("%s%s Tiny SRCU per-CPU(idx=%d): (%d,%d)\n", torture_type, TORTURE_FLAG, idx, - READ_ONCE(srcu_ctlp->srcu_lock_nesting[!idx]), - READ_ONCE(srcu_ctlp->srcu_lock_nesting[idx])); + atomic_read(&srcu_ctlp->srcu_lock_nesting[!idx]), + atomic_read(&srcu_ctlp->srcu_lock_nesting[idx])); #endif } diff --git a/kernel/rcu/srcutiny.c b/kernel/rcu/srcutiny.c index 36e1f82faed1..681bf6bc04a5 100644 --- a/kernel/rcu/srcutiny.c +++ b/kernel/rcu/srcutiny.c @@ -35,8 +35,8 @@ static int init_srcu_struct_fields(struct srcu_struct *sp) { - sp->srcu_lock_nesting[0] = 0; - sp->srcu_lock_nesting[1] = 0; + atomic_set(&sp->srcu_lock_nesting[0], 0); + atomic_set(&sp->srcu_lock_nesting[1], 0); init_swait_queue_head(&sp->srcu_wq); sp->srcu_gp_seq = 0; rcu_segcblist_init(&sp->srcu_cblist); @@ -86,7 +86,8 @@ EXPORT_SYMBOL_GPL(init_srcu_struct); */ void cleanup_srcu_struct(struct srcu_struct *sp) { - WARN_ON(sp->srcu_lock_nesting[0] || sp->srcu_lock_nesting[1]); + WARN_ON(atomic_read(&sp->srcu_lock_nesting[0]) || + atomic_read(&sp->srcu_lock_nesting[1])); flush_work(&sp->srcu_work); WARN_ON(rcu_seq_state(sp->srcu_gp_seq)); WARN_ON(sp->srcu_gp_running); @@ -97,7 +98,7 @@ EXPORT_SYMBOL_GPL(cleanup_srcu_struct); /* * Counts the new reader in the appropriate per-CPU element of the - * srcu_struct. Must be called from process context. + * srcu_struct. * Returns an index that must be passed to the matching srcu_read_unlock(). */ int __srcu_read_lock(struct srcu_struct *sp) @@ -105,21 +106,19 @@ int __srcu_read_lock(struct srcu_struct *sp) int idx; idx = READ_ONCE(sp->srcu_idx); - WRITE_ONCE(sp->srcu_lock_nesting[idx], sp->srcu_lock_nesting[idx] + 1); + atomic_inc(&sp->srcu_lock_nesting[idx]); return idx; } EXPORT_SYMBOL_GPL(__srcu_read_lock); /* * Removes the count for the old reader from the appropriate element of - * the srcu_struct. Must be called from process context. + * the srcu_struct. */ void __srcu_read_unlock(struct srcu_struct *sp, int idx) { - int newval = sp->srcu_lock_nesting[idx] - 1; - - WRITE_ONCE(sp->srcu_lock_nesting[idx], newval); - if (!newval && READ_ONCE(sp->srcu_gp_waiting)) + if (atomic_dec_return_relaxed(&sp->srcu_lock_nesting[idx]) == 0 && + READ_ONCE(sp->srcu_gp_waiting)) swake_up(&sp->srcu_wq); } EXPORT_SYMBOL_GPL(__srcu_read_unlock); @@ -148,7 +147,7 @@ void srcu_drive_gp(struct work_struct *wp) idx = sp->srcu_idx; WRITE_ONCE(sp->srcu_idx, !sp->srcu_idx); WRITE_ONCE(sp->srcu_gp_waiting, true); /* srcu_read_unlock() wakes! */ - swait_event(sp->srcu_wq, !READ_ONCE(sp->srcu_lock_nesting[idx])); + swait_event(sp->srcu_wq, !atomic_read(&sp->srcu_lock_nesting[idx])); WRITE_ONCE(sp->srcu_gp_waiting, false); /* srcu_read_unlock() cheap. */ rcu_seq_end(&sp->srcu_gp_seq); diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c index 3ae8474557df..157654fa436a 100644 --- a/kernel/rcu/srcutree.c +++ b/kernel/rcu/srcutree.c @@ -357,7 +357,7 @@ EXPORT_SYMBOL_GPL(cleanup_srcu_struct); /* * Counts the new reader in the appropriate per-CPU element of the - * srcu_struct. Must be called from process context. + * srcu_struct. * Returns an index that must be passed to the matching srcu_read_unlock(). */ int __srcu_read_lock(struct srcu_struct *sp) @@ -365,7 +365,7 @@ int __srcu_read_lock(struct srcu_struct *sp) int idx; idx = READ_ONCE(sp->srcu_idx) & 0x1; - __this_cpu_inc(sp->sda->srcu_lock_count[idx]); + this_cpu_inc(sp->sda->srcu_lock_count[idx]); smp_mb(); /* B */ /* Avoid leaking the critical section. */ return idx; } @@ -375,7 +375,6 @@ EXPORT_SYMBOL_GPL(__srcu_read_lock); * Removes the count for the old reader from the appropriate per-CPU * element of the srcu_struct. Note that this may well be a different * CPU than that which was incremented by the corresponding srcu_read_lock(). - * Must be called from process context. */ void __srcu_read_unlock(struct srcu_struct *sp, int idx) { -- 2.5.2 ^ permalink raw reply related [flat|nested] 23+ messages in thread
* Re: [PATCH RFC tip/core/rcu 1/2] srcu: Allow use of Tiny/Tree SRCU from both process and interrupt context 2017-06-05 22:09 ` [PATCH RFC tip/core/rcu 1/2] srcu: Allow use of Tiny/Tree SRCU from both process and interrupt context Paul E. McKenney @ 2017-06-06 10:53 ` Peter Zijlstra 2017-06-06 12:56 ` Paul E. McKenney 2017-06-06 13:08 ` Paolo Bonzini 2017-06-06 11:09 ` Peter Zijlstra 2017-06-06 17:23 ` Peter Zijlstra 2 siblings, 2 replies; 23+ messages in thread From: Peter Zijlstra @ 2017-06-06 10:53 UTC (permalink / raw) To: Paul E. McKenney Cc: linux-kernel, mingo, jiangshanlai, dipankar, akpm, mathieu.desnoyers, josh, tglx, rostedt, dhowells, edumazet, fweisbec, oleg, Paolo Bonzini, kvm, Linus Torvalds On Mon, Jun 05, 2017 at 03:09:50PM -0700, Paul E. McKenney wrote: > There would be a slowdown if 1) fast this_cpu_inc is not available and > cannot be implemented (this usually means that atomic_inc has implicit > memory barriers), I don't get this. How is per-cpu crud related to being strongly ordered? this_cpu_ has 3 forms: x86: single instruction arm64,s390: preempt_disable()+atomic_op generic: local_irq_save()+normal_op Only s390 is TSO, arm64 is very much a weak arch. > and 2) local_irq_save/restore is slower than disabling > preemption. The main architecture with these constraints is s390, which > however is already paying the price in __srcu_read_unlock and has not > complained. IIRC only PPC (and hopefully soon x86) has a local_irq_save() that is as fast as preempt_disable(). > A valid optimization on s390 would be to skip the smp_mb; > AIUI, this_cpu_inc implies a memory barrier (!) due to its implementation. You mean the s390 this_cpu_inc() in specific, right? Because this_cpu_inc() in general does not imply any such thing. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH RFC tip/core/rcu 1/2] srcu: Allow use of Tiny/Tree SRCU from both process and interrupt context 2017-06-06 10:53 ` Peter Zijlstra @ 2017-06-06 12:56 ` Paul E. McKenney 2017-06-06 13:08 ` Paolo Bonzini 1 sibling, 0 replies; 23+ messages in thread From: Paul E. McKenney @ 2017-06-06 12:56 UTC (permalink / raw) To: Peter Zijlstra Cc: linux-kernel, mingo, jiangshanlai, dipankar, akpm, mathieu.desnoyers, josh, tglx, rostedt, dhowells, edumazet, fweisbec, oleg, Paolo Bonzini, kvm, Linus Torvalds On Tue, Jun 06, 2017 at 12:53:43PM +0200, Peter Zijlstra wrote: > On Mon, Jun 05, 2017 at 03:09:50PM -0700, Paul E. McKenney wrote: > > There would be a slowdown if 1) fast this_cpu_inc is not available and > > cannot be implemented (this usually means that atomic_inc has implicit > > memory barriers), > > I don't get this. > > How is per-cpu crud related to being strongly ordered? > > this_cpu_ has 3 forms: > > x86: single instruction > arm64,s390: preempt_disable()+atomic_op > generic: local_irq_save()+normal_op > > Only s390 is TSO, arm64 is very much a weak arch. > > > and 2) local_irq_save/restore is slower than disabling > > preemption. The main architecture with these constraints is s390, which > > however is already paying the price in __srcu_read_unlock and has not > > complained. > > IIRC only PPC (and hopefully soon x86) has a local_irq_save() that is as > fast as preempt_disable(). > > > A valid optimization on s390 would be to skip the smp_mb; > > AIUI, this_cpu_inc implies a memory barrier (!) due to its implementation. > > You mean the s390 this_cpu_inc() in specific, right? Because > this_cpu_inc() in general does not imply any such thing. More generally, yes, the commit log needs some more help, good catch, thank you! Does the code itself also need more help? Thanx, Paul ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH RFC tip/core/rcu 1/2] srcu: Allow use of Tiny/Tree SRCU from both process and interrupt context 2017-06-06 10:53 ` Peter Zijlstra 2017-06-06 12:56 ` Paul E. McKenney @ 2017-06-06 13:08 ` Paolo Bonzini 2017-06-06 14:45 ` Christian Borntraeger 2017-06-06 16:02 ` Peter Zijlstra 1 sibling, 2 replies; 23+ messages in thread From: Paolo Bonzini @ 2017-06-06 13:08 UTC (permalink / raw) To: Peter Zijlstra, Paul E. McKenney Cc: linux-kernel, mingo, jiangshanlai, dipankar, akpm, mathieu.desnoyers, josh, tglx, rostedt, dhowells, edumazet, fweisbec, oleg, kvm, Linus Torvalds On 06/06/2017 12:53, Peter Zijlstra wrote: > On Mon, Jun 05, 2017 at 03:09:50PM -0700, Paul E. McKenney wrote: >> There would be a slowdown if 1) fast this_cpu_inc is not available and >> cannot be implemented (this usually means that atomic_inc has implicit >> memory barriers), > > I don't get this. > > How is per-cpu crud related to being strongly ordered? > > this_cpu_ has 3 forms: > > x86: single instruction > arm64,s390: preempt_disable()+atomic_op > generic: local_irq_save()+normal_op > > Only s390 is TSO, arm64 is very much a weak arch. Right, and thus arm64 can implement a fast this_cpu_inc using LL/SC. s390 cannot because its atomic_inc has implicit memory barriers. s390's this_cpu_inc is *faster* than the generic one, but still pretty slow. >> and 2) local_irq_save/restore is slower than disabling >> preemption. The main architecture with these constraints is s390, which >> however is already paying the price in __srcu_read_unlock and has not >> complained. > > IIRC only PPC (and hopefully soon x86) has a local_irq_save() that is as > fast as preempt_disable(). 1 = arch-specific this_cpu_inc is available 2 = local_irq_save/restore as fast as preempt_disable/enable If either 1 or 2 are true, this patch makes SRCU faster or equal x86 (single instruction): 1 = true, 2 = false -> ok arm64 (weakly ordered): 1 = true, 2 = false -> ok powerpc: 1 = false, 2 = true -> ok s390: 1 = false, 2 = false -> slower For other LL/SC architectures, notably arm, fast this_cpu_* ops not yet available, but could be written pretty easily. >> A valid optimization on s390 would be to skip the smp_mb; >> AIUI, this_cpu_inc implies a memory barrier (!) due to its implementation. > > You mean the s390 this_cpu_inc() in specific, right? Because > this_cpu_inc() in general does not imply any such thing. Yes, of course, this is only for s390. Alternatively, we could change the counters to atomic_t and use smp_mb__{before,after}_atomic, as in the (unnecessary) srcutiny patch. That should shave a few cycles on x86 too, since "lock inc" is faster than "inc; mfence". For srcuclassic (and stable) however I'd rather keep the simple __this_cpu_inc -> this_cpu_inc change. Paolo ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH RFC tip/core/rcu 1/2] srcu: Allow use of Tiny/Tree SRCU from both process and interrupt context 2017-06-06 13:08 ` Paolo Bonzini @ 2017-06-06 14:45 ` Christian Borntraeger 2017-06-06 15:27 ` Heiko Carstens 2017-06-06 16:12 ` Peter Zijlstra 2017-06-06 16:02 ` Peter Zijlstra 1 sibling, 2 replies; 23+ messages in thread From: Christian Borntraeger @ 2017-06-06 14:45 UTC (permalink / raw) To: Paolo Bonzini, Peter Zijlstra, Paul E. McKenney Cc: linux-kernel, mingo, jiangshanlai, dipankar, akpm, mathieu.desnoyers, josh, tglx, rostedt, dhowells, edumazet, fweisbec, oleg, kvm, Linus Torvalds, Martin Schwidefsky, Heiko Carstens, linux-s390 Adding s390 folks and list On 06/06/2017 03:08 PM, Paolo Bonzini wrote: > > > On 06/06/2017 12:53, Peter Zijlstra wrote: >> On Mon, Jun 05, 2017 at 03:09:50PM -0700, Paul E. McKenney wrote: >>> There would be a slowdown if 1) fast this_cpu_inc is not available and >>> cannot be implemented (this usually means that atomic_inc has implicit >>> memory barriers), >> >> I don't get this. >> >> How is per-cpu crud related to being strongly ordered? >> >> this_cpu_ has 3 forms: >> >> x86: single instruction >> arm64,s390: preempt_disable()+atomic_op >> generic: local_irq_save()+normal_op >> >> Only s390 is TSO, arm64 is very much a weak arch. > > Right, and thus arm64 can implement a fast this_cpu_inc using LL/SC. > s390 cannot because its atomic_inc has implicit memory barriers. > > s390's this_cpu_inc is *faster* than the generic one, but still pretty slow. FWIW, we improved the performance of local_irq_save/restore some time ago with commit 204ee2c5643199a2 ("s390/irqflags: optimize irq restore") and disable/enable seem to be reasonably fast (3-5ns on my system doing both disable/enable in a loop) on todays systems. So I would assume that the generic implementation would not be that bad. A the same time, the implicit memory barrier of the atomic_inc should be even cheaper. In contrast to x86, a full smp_mb seems to be almost for free (looks like <= 1 cycle for a bcr 14,0 and no contention). So I _think_ that this should be really fast enough. As a side note, I am asking myself, though, why we do need the preempt_disable/enable for the cases where we use the opcodes like lao (atomic load and or to a memory location) and friends. > >>> and 2) local_irq_save/restore is slower than disabling >>> preemption. The main architecture with these constraints is s390, which >>> however is already paying the price in __srcu_read_unlock and has not >>> complained. >> >> IIRC only PPC (and hopefully soon x86) has a local_irq_save() that is as >> fast as preempt_disable(). > > 1 = arch-specific this_cpu_inc is available > 2 = local_irq_save/restore as fast as preempt_disable/enable > > If either 1 or 2 are true, this patch makes SRCU faster or equal > > x86 (single instruction): 1 = true, 2 = false -> ok > arm64 (weakly ordered): 1 = true, 2 = false -> ok > powerpc: 1 = false, 2 = true -> ok > s390: 1 = false, 2 = false -> slower > > For other LL/SC architectures, notably arm, fast this_cpu_* ops not yet > available, but could be written pretty easily. > >>> A valid optimization on s390 would be to skip the smp_mb; >>> AIUI, this_cpu_inc implies a memory barrier (!) due to its implementation. >> >> You mean the s390 this_cpu_inc() in specific, right? Because >> this_cpu_inc() in general does not imply any such thing. > > Yes, of course, this is only for s390. > > Alternatively, we could change the counters to atomic_t and use > smp_mb__{before,after}_atomic, as in the (unnecessary) srcutiny patch. > That should shave a few cycles on x86 too, since "lock inc" is faster > than "inc; mfence". For srcuclassic (and stable) however I'd rather > keep the simple __this_cpu_inc -> this_cpu_inc change. > > Paolo > ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH RFC tip/core/rcu 1/2] srcu: Allow use of Tiny/Tree SRCU from both process and interrupt context 2017-06-06 14:45 ` Christian Borntraeger @ 2017-06-06 15:27 ` Heiko Carstens 2017-06-06 15:37 ` Christian Borntraeger 2017-06-06 16:15 ` Peter Zijlstra 2017-06-06 16:12 ` Peter Zijlstra 1 sibling, 2 replies; 23+ messages in thread From: Heiko Carstens @ 2017-06-06 15:27 UTC (permalink / raw) To: Christian Borntraeger Cc: Paolo Bonzini, Peter Zijlstra, Paul E. McKenney, linux-kernel, mingo, jiangshanlai, dipankar, akpm, mathieu.desnoyers, josh, tglx, rostedt, dhowells, edumazet, fweisbec, oleg, kvm, Linus Torvalds, Martin Schwidefsky, linux-s390 On Tue, Jun 06, 2017 at 04:45:57PM +0200, Christian Borntraeger wrote: > Adding s390 folks and list > >> Only s390 is TSO, arm64 is very much a weak arch. > > > > Right, and thus arm64 can implement a fast this_cpu_inc using LL/SC. > > s390 cannot because its atomic_inc has implicit memory barriers. > > > > s390's this_cpu_inc is *faster* than the generic one, but still pretty slow. > > FWIW, we improved the performance of local_irq_save/restore some time ago > with commit 204ee2c5643199a2 ("s390/irqflags: optimize irq restore") and > disable/enable seem to be reasonably fast (3-5ns on my system doing both > disable/enable in a loop) on todays systems. So I would assume that the > generic implementation would not be that bad. > > A the same time, the implicit memory barrier of the atomic_inc should be > even cheaper. In contrast to x86, a full smp_mb seems to be almost for > free (looks like <= 1 cycle for a bcr 14,0 and no contention). So I > _think_ that this should be really fast enough. > > As a side note, I am asking myself, though, why we do need the > preempt_disable/enable for the cases where we use the opcodes > like lao (atomic load and or to a memory location) and friends. Because you want the atomic instruction to be executed on the local cpu for which you have to per cpu pointer. If you get preempted to a different cpu between the ptr__ assignment and lan instruction it might be executed not on the local cpu. It's not really a correctness issue. #define arch_this_cpu_to_op(pcp, val, op) \ { \ typedef typeof(pcp) pcp_op_T__; \ pcp_op_T__ val__ = (val); \ pcp_op_T__ old__, *ptr__; \ preempt_disable(); \ ptr__ = raw_cpu_ptr(&(pcp)); \ asm volatile( \ op " %[old__],%[val__],%[ptr__]\n" \ : [old__] "=d" (old__), [ptr__] "+Q" (*ptr__) \ : [val__] "d" (val__) \ : "cc"); \ preempt_enable(); \ } #define this_cpu_and_4(pcp, val) arch_this_cpu_to_op(pcp, val, "lan") However in reality it doesn't matter at all, since all distributions we care about have preemption disabled. So this_cpu_inc() should just generate three instructions: two to calculate the percpu pointer and an additional asi for the atomic increment, with operand specific serialization. This is supposed to be a lot faster than disabling/enabling interrupts around a non-atomic operation. But maybe I didn't get the point of this thread :) ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH RFC tip/core/rcu 1/2] srcu: Allow use of Tiny/Tree SRCU from both process and interrupt context 2017-06-06 15:27 ` Heiko Carstens @ 2017-06-06 15:37 ` Christian Borntraeger 2017-06-06 15:58 ` Paul E. McKenney 2017-06-06 16:15 ` Peter Zijlstra 1 sibling, 1 reply; 23+ messages in thread From: Christian Borntraeger @ 2017-06-06 15:37 UTC (permalink / raw) To: Heiko Carstens Cc: Paolo Bonzini, Peter Zijlstra, Paul E. McKenney, linux-kernel, mingo, jiangshanlai, dipankar, akpm, mathieu.desnoyers, josh, tglx, rostedt, dhowells, edumazet, fweisbec, oleg, kvm, Linus Torvalds, Martin Schwidefsky, linux-s390 On 06/06/2017 05:27 PM, Heiko Carstens wrote: > On Tue, Jun 06, 2017 at 04:45:57PM +0200, Christian Borntraeger wrote: >> Adding s390 folks and list >>>> Only s390 is TSO, arm64 is very much a weak arch. >>> >>> Right, and thus arm64 can implement a fast this_cpu_inc using LL/SC. >>> s390 cannot because its atomic_inc has implicit memory barriers. >>> >>> s390's this_cpu_inc is *faster* than the generic one, but still pretty slow. >> >> FWIW, we improved the performance of local_irq_save/restore some time ago >> with commit 204ee2c5643199a2 ("s390/irqflags: optimize irq restore") and >> disable/enable seem to be reasonably fast (3-5ns on my system doing both >> disable/enable in a loop) on todays systems. So I would assume that the >> generic implementation would not be that bad. >> >> A the same time, the implicit memory barrier of the atomic_inc should be >> even cheaper. In contrast to x86, a full smp_mb seems to be almost for >> free (looks like <= 1 cycle for a bcr 14,0 and no contention). So I >> _think_ that this should be really fast enough. >> >> As a side note, I am asking myself, though, why we do need the >> preempt_disable/enable for the cases where we use the opcodes >> like lao (atomic load and or to a memory location) and friends. > > Because you want the atomic instruction to be executed on the local cpu for > which you have to per cpu pointer. If you get preempted to a different cpu > between the ptr__ assignment and lan instruction it might be executed not > on the local cpu. It's not really a correctness issue. > > #define arch_this_cpu_to_op(pcp, val, op) \ > { \ > typedef typeof(pcp) pcp_op_T__; \ > pcp_op_T__ val__ = (val); \ > pcp_op_T__ old__, *ptr__; \ > preempt_disable(); \ > ptr__ = raw_cpu_ptr(&(pcp)); \ > asm volatile( \ > op " %[old__],%[val__],%[ptr__]\n" \ > : [old__] "=d" (old__), [ptr__] "+Q" (*ptr__) \ > : [val__] "d" (val__) \ > : "cc"); \ > preempt_enable(); \ > } > > #define this_cpu_and_4(pcp, val) arch_this_cpu_to_op(pcp, val, "lan") > > However in reality it doesn't matter at all, since all distributions we > care about have preemption disabled. > > So this_cpu_inc() should just generate three instructions: two to calculate > the percpu pointer and an additional asi for the atomic increment, with > operand specific serialization. This is supposed to be a lot faster than > disabling/enabling interrupts around a non-atomic operation. > > But maybe I didn't get the point of this thread :) I think on x86 a memory barrier is relatively expensive (e.g. 33 cycles for mfence on Haswell according to http://www.agner.org/optimize/instruction_tables.pdf). The thread started with a change to rcu, which now happens to use these percpu things more often so I think Paolos fear is that on s390 we now pay the price for an extra memory barrier due to that change. For the inc case (asi instruction) this should be really really cheap. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH RFC tip/core/rcu 1/2] srcu: Allow use of Tiny/Tree SRCU from both process and interrupt context 2017-06-06 15:37 ` Christian Borntraeger @ 2017-06-06 15:58 ` Paul E. McKenney 0 siblings, 0 replies; 23+ messages in thread From: Paul E. McKenney @ 2017-06-06 15:58 UTC (permalink / raw) To: Christian Borntraeger Cc: Heiko Carstens, Paolo Bonzini, Peter Zijlstra, linux-kernel, mingo, jiangshanlai, dipankar, akpm, mathieu.desnoyers, josh, tglx, rostedt, dhowells, edumazet, fweisbec, oleg, kvm, Linus Torvalds, Martin Schwidefsky, linux-s390 On Tue, Jun 06, 2017 at 05:37:05PM +0200, Christian Borntraeger wrote: > On 06/06/2017 05:27 PM, Heiko Carstens wrote: > > On Tue, Jun 06, 2017 at 04:45:57PM +0200, Christian Borntraeger wrote: > >> Adding s390 folks and list > >>>> Only s390 is TSO, arm64 is very much a weak arch. > >>> > >>> Right, and thus arm64 can implement a fast this_cpu_inc using LL/SC. > >>> s390 cannot because its atomic_inc has implicit memory barriers. > >>> > >>> s390's this_cpu_inc is *faster* than the generic one, but still pretty slow. > >> > >> FWIW, we improved the performance of local_irq_save/restore some time ago > >> with commit 204ee2c5643199a2 ("s390/irqflags: optimize irq restore") and > >> disable/enable seem to be reasonably fast (3-5ns on my system doing both > >> disable/enable in a loop) on todays systems. So I would assume that the > >> generic implementation would not be that bad. > >> > >> A the same time, the implicit memory barrier of the atomic_inc should be > >> even cheaper. In contrast to x86, a full smp_mb seems to be almost for > >> free (looks like <= 1 cycle for a bcr 14,0 and no contention). So I > >> _think_ that this should be really fast enough. > >> > >> As a side note, I am asking myself, though, why we do need the > >> preempt_disable/enable for the cases where we use the opcodes > >> like lao (atomic load and or to a memory location) and friends. > > > > Because you want the atomic instruction to be executed on the local cpu for > > which you have to per cpu pointer. If you get preempted to a different cpu > > between the ptr__ assignment and lan instruction it might be executed not > > on the local cpu. It's not really a correctness issue. > > > > #define arch_this_cpu_to_op(pcp, val, op) \ > > { \ > > typedef typeof(pcp) pcp_op_T__; \ > > pcp_op_T__ val__ = (val); \ > > pcp_op_T__ old__, *ptr__; \ > > preempt_disable(); \ > > ptr__ = raw_cpu_ptr(&(pcp)); \ > > asm volatile( \ > > op " %[old__],%[val__],%[ptr__]\n" \ > > : [old__] "=d" (old__), [ptr__] "+Q" (*ptr__) \ > > : [val__] "d" (val__) \ > > : "cc"); \ > > preempt_enable(); \ > > } > > > > #define this_cpu_and_4(pcp, val) arch_this_cpu_to_op(pcp, val, "lan") > > > > However in reality it doesn't matter at all, since all distributions we > > care about have preemption disabled. > > > > So this_cpu_inc() should just generate three instructions: two to calculate > > the percpu pointer and an additional asi for the atomic increment, with > > operand specific serialization. This is supposed to be a lot faster than > > disabling/enabling interrupts around a non-atomic operation. > > > > But maybe I didn't get the point of this thread :) > > I think on x86 a memory barrier is relatively expensive (e.g. 33 cycles for mfence > on Haswell according to http://www.agner.org/optimize/instruction_tables.pdf). The > thread started with a change to rcu, which now happens to use these percpu things > more often so I think Paolos fear is that on s390 we now pay the price for an extra > memory barrier due to that change. For the inc case (asi instruction) this should be > really really cheap. So what I am seeing from this is that there aren't any real performance issues for this patch series. I will update accordingly. ;-) Thanx, Paul ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH RFC tip/core/rcu 1/2] srcu: Allow use of Tiny/Tree SRCU from both process and interrupt context 2017-06-06 15:27 ` Heiko Carstens 2017-06-06 15:37 ` Christian Borntraeger @ 2017-06-06 16:15 ` Peter Zijlstra 2017-06-06 17:00 ` Paul E. McKenney 2017-06-06 17:20 ` Heiko Carstens 1 sibling, 2 replies; 23+ messages in thread From: Peter Zijlstra @ 2017-06-06 16:15 UTC (permalink / raw) To: Heiko Carstens Cc: Christian Borntraeger, Paolo Bonzini, Paul E. McKenney, linux-kernel, mingo, jiangshanlai, dipankar, akpm, mathieu.desnoyers, josh, tglx, rostedt, dhowells, edumazet, fweisbec, oleg, kvm, Linus Torvalds, Martin Schwidefsky, linux-s390 On Tue, Jun 06, 2017 at 05:27:06PM +0200, Heiko Carstens wrote: > On Tue, Jun 06, 2017 at 04:45:57PM +0200, Christian Borntraeger wrote: > > As a side note, I am asking myself, though, why we do need the > > preempt_disable/enable for the cases where we use the opcodes > > like lao (atomic load and or to a memory location) and friends. > > Because you want the atomic instruction to be executed on the local cpu for > which you have to per cpu pointer. If you get preempted to a different cpu > between the ptr__ assignment and lan instruction it might be executed not > on the local cpu. It's not really a correctness issue. As per the previous email, I think it is a correctness issue wrt CPU hotplug. > > #define arch_this_cpu_to_op(pcp, val, op) \ > { \ > typedef typeof(pcp) pcp_op_T__; \ > pcp_op_T__ val__ = (val); \ > pcp_op_T__ old__, *ptr__; \ > preempt_disable(); \ > ptr__ = raw_cpu_ptr(&(pcp)); \ > asm volatile( \ > op " %[old__],%[val__],%[ptr__]\n" \ > : [old__] "=d" (old__), [ptr__] "+Q" (*ptr__) \ > : [val__] "d" (val__) \ > : "cc"); \ > preempt_enable(); \ > } > > #define this_cpu_and_4(pcp, val) arch_this_cpu_to_op(pcp, val, "lan") > > However in reality it doesn't matter at all, since all distributions we > care about have preemption disabled. Well, either you support PREEMPT=y or you don't :-) If you do, it needs to be correct, irrespective of what distro's do with it. > So this_cpu_inc() should just generate three instructions: two to calculate > the percpu pointer and an additional asi for the atomic increment, with > operand specific serialization. This is supposed to be a lot faster than > disabling/enabling interrupts around a non-atomic operation. So typically we joke about s390 that it has an instruction for this 'very-complicated-thing', but here you guys do not, what gives? ;-) ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH RFC tip/core/rcu 1/2] srcu: Allow use of Tiny/Tree SRCU from both process and interrupt context 2017-06-06 16:15 ` Peter Zijlstra @ 2017-06-06 17:00 ` Paul E. McKenney 2017-06-06 17:20 ` Heiko Carstens 1 sibling, 0 replies; 23+ messages in thread From: Paul E. McKenney @ 2017-06-06 17:00 UTC (permalink / raw) To: Peter Zijlstra Cc: Heiko Carstens, Christian Borntraeger, Paolo Bonzini, linux-kernel, mingo, jiangshanlai, dipankar, akpm, mathieu.desnoyers, josh, tglx, rostedt, dhowells, edumazet, fweisbec, oleg, kvm, Linus Torvalds, Martin Schwidefsky, linux-s390 On Tue, Jun 06, 2017 at 06:15:51PM +0200, Peter Zijlstra wrote: > On Tue, Jun 06, 2017 at 05:27:06PM +0200, Heiko Carstens wrote: > > On Tue, Jun 06, 2017 at 04:45:57PM +0200, Christian Borntraeger wrote: > > > > As a side note, I am asking myself, though, why we do need the > > > preempt_disable/enable for the cases where we use the opcodes > > > like lao (atomic load and or to a memory location) and friends. > > > > Because you want the atomic instruction to be executed on the local cpu for > > which you have to per cpu pointer. If you get preempted to a different cpu > > between the ptr__ assignment and lan instruction it might be executed not > > on the local cpu. It's not really a correctness issue. > > As per the previous email, I think it is a correctness issue wrt CPU > hotplug. In the specific case of SRCU, this might actually be OK. We have not yet entered the SRCU read-side critical section, and SRCU grace periods don't interact with CPU hotplug. And the per-CPU variable stick around. So as long as one of the per-CPU counters is incremented properly, it doesn't really matter which one is incremented. But if you overwrite one CPU's counter with the incremented version of some other CPU's counter, yes, that would be very bad indeed. > > #define arch_this_cpu_to_op(pcp, val, op) \ > > { \ > > typedef typeof(pcp) pcp_op_T__; \ > > pcp_op_T__ val__ = (val); \ > > pcp_op_T__ old__, *ptr__; \ > > preempt_disable(); \ > > ptr__ = raw_cpu_ptr(&(pcp)); \ > > asm volatile( \ > > op " %[old__],%[val__],%[ptr__]\n" \ > > : [old__] "=d" (old__), [ptr__] "+Q" (*ptr__) \ > > : [val__] "d" (val__) \ > > : "cc"); \ > > preempt_enable(); \ > > } > > > > #define this_cpu_and_4(pcp, val) arch_this_cpu_to_op(pcp, val, "lan") > > > > However in reality it doesn't matter at all, since all distributions we > > care about have preemption disabled. > > Well, either you support PREEMPT=y or you don't :-) If you do, it needs > to be correct, irrespective of what distro's do with it. > > > So this_cpu_inc() should just generate three instructions: two to calculate > > the percpu pointer and an additional asi for the atomic increment, with > > operand specific serialization. This is supposed to be a lot faster than > > disabling/enabling interrupts around a non-atomic operation. > > So typically we joke about s390 that it has an instruction for this > 'very-complicated-thing', but here you guys do not, what gives? ;-) Even mainframes have finite silicon area? ;-) Thanx, Paul ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH RFC tip/core/rcu 1/2] srcu: Allow use of Tiny/Tree SRCU from both process and interrupt context 2017-06-06 16:15 ` Peter Zijlstra 2017-06-06 17:00 ` Paul E. McKenney @ 2017-06-06 17:20 ` Heiko Carstens 1 sibling, 0 replies; 23+ messages in thread From: Heiko Carstens @ 2017-06-06 17:20 UTC (permalink / raw) To: Peter Zijlstra Cc: Christian Borntraeger, Paolo Bonzini, Paul E. McKenney, linux-kernel, mingo, jiangshanlai, dipankar, akpm, mathieu.desnoyers, josh, tglx, rostedt, dhowells, edumazet, fweisbec, oleg, kvm, Linus Torvalds, Martin Schwidefsky, linux-s390 On Tue, Jun 06, 2017 at 06:15:51PM +0200, Peter Zijlstra wrote: > On Tue, Jun 06, 2017 at 05:27:06PM +0200, Heiko Carstens wrote: > > On Tue, Jun 06, 2017 at 04:45:57PM +0200, Christian Borntraeger wrote: > > > > As a side note, I am asking myself, though, why we do need the > > > preempt_disable/enable for the cases where we use the opcodes > > > like lao (atomic load and or to a memory location) and friends. > > > > Because you want the atomic instruction to be executed on the local cpu for > > which you have to per cpu pointer. If you get preempted to a different cpu > > between the ptr__ assignment and lan instruction it might be executed not > > on the local cpu. It's not really a correctness issue. > > As per the previous email, I think it is a correctness issue wrt CPU > hotplug. Yes, I realized that just a minute after I sent the above. > > However in reality it doesn't matter at all, since all distributions we > > care about have preemption disabled. > > Well, either you support PREEMPT=y or you don't :-) If you do, it needs > to be correct, irrespective of what distro's do with it. That is true. Our s390 specific percpu ops are supposed to be correct for PREEMPT=y, and that's apparently the only reason why I added the preempt disable/enable pairs back then. I just had to remember why I did that ;) > > So this_cpu_inc() should just generate three instructions: two to calculate > > the percpu pointer and an additional asi for the atomic increment, with > > operand specific serialization. This is supposed to be a lot faster than > > disabling/enabling interrupts around a non-atomic operation. > > So typically we joke about s390 that it has an instruction for this > 'very-complicated-thing', but here you guys do not, what gives? ;-) Tough luck. :) ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH RFC tip/core/rcu 1/2] srcu: Allow use of Tiny/Tree SRCU from both process and interrupt context 2017-06-06 14:45 ` Christian Borntraeger 2017-06-06 15:27 ` Heiko Carstens @ 2017-06-06 16:12 ` Peter Zijlstra 1 sibling, 0 replies; 23+ messages in thread From: Peter Zijlstra @ 2017-06-06 16:12 UTC (permalink / raw) To: Christian Borntraeger Cc: Paolo Bonzini, Paul E. McKenney, linux-kernel, mingo, jiangshanlai, dipankar, akpm, mathieu.desnoyers, josh, tglx, rostedt, dhowells, edumazet, fweisbec, oleg, kvm, Linus Torvalds, Martin Schwidefsky, Heiko Carstens, linux-s390, H. Peter Anvin On Tue, Jun 06, 2017 at 04:45:57PM +0200, Christian Borntraeger wrote: > A the same time, the implicit memory barrier of the atomic_inc should be > even cheaper. In contrast to x86, a full smp_mb seems to be almost for > free (looks like <= 1 cycle for a bcr 14,0 and no contention). So I > _think_ that this should be really fast enough. So there is a patch out there that changes the x86 smp_mb() implementation to do "LOCK ADD some_stack_location, 0" which is lots cheaper than the "MFENCE" instruction and provides similar guarantees. HPA was running that through some of the architects.. ping? (Also, I can imagine OoO CPUs collapsing back-to-back ordering stuff, but what do I know). > As a side note, I am asking myself, though, why we do need the > preempt_disable/enable for the cases where we use the opcodes > like lao (atomic load and or to a memory location) and friends. I suspect the real reason is CPU hotplug, because regular preemption should not matter. It would be the same as getting migrated the moment _after_ you do the $op. But preempt_disable() also holds off hotplug and thereby serializes against hotplug notifiers that want to, for instance, move the value of the per-cpu variable to a still online CPU. Without this serialization it would be possible for the $op to happen _after_ the hotplug notifier runs. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH RFC tip/core/rcu 1/2] srcu: Allow use of Tiny/Tree SRCU from both process and interrupt context 2017-06-06 13:08 ` Paolo Bonzini 2017-06-06 14:45 ` Christian Borntraeger @ 2017-06-06 16:02 ` Peter Zijlstra 1 sibling, 0 replies; 23+ messages in thread From: Peter Zijlstra @ 2017-06-06 16:02 UTC (permalink / raw) To: Paolo Bonzini Cc: Paul E. McKenney, linux-kernel, mingo, jiangshanlai, dipankar, akpm, mathieu.desnoyers, josh, tglx, rostedt, dhowells, edumazet, fweisbec, oleg, kvm, Linus Torvalds On Tue, Jun 06, 2017 at 03:08:50PM +0200, Paolo Bonzini wrote: > > > On 06/06/2017 12:53, Peter Zijlstra wrote: > > On Mon, Jun 05, 2017 at 03:09:50PM -0700, Paul E. McKenney wrote: > >> There would be a slowdown if 1) fast this_cpu_inc is not available and > >> cannot be implemented (this usually means that atomic_inc has implicit > >> memory barriers), > > > > I don't get this. > > > > How is per-cpu crud related to being strongly ordered? > > > > this_cpu_ has 3 forms: > > > > x86: single instruction > > arm64,s390: preempt_disable()+atomic_op > > generic: local_irq_save()+normal_op > > > > Only s390 is TSO, arm64 is very much a weak arch. > > Right, and thus arm64 can implement a fast this_cpu_inc using LL/SC. > s390 cannot because its atomic_inc has implicit memory barriers. I'm not sure the ordering makes a useful differentiator between a fast and non-fast this_cpu implementation. For TSO archs making their atomic primitives fully ordered isn't _that_ much of a burden over their normal ordering requirements. And LL/SC archs can have very slow (weak) atomics (PPC for example). (Now theoretically a TSO arch could have weak(er) atomics, but I'm not aware of any that do this). I realize we're splitting hairs and slightly off topic :-) ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH RFC tip/core/rcu 1/2] srcu: Allow use of Tiny/Tree SRCU from both process and interrupt context 2017-06-05 22:09 ` [PATCH RFC tip/core/rcu 1/2] srcu: Allow use of Tiny/Tree SRCU from both process and interrupt context Paul E. McKenney 2017-06-06 10:53 ` Peter Zijlstra @ 2017-06-06 11:09 ` Peter Zijlstra 2017-06-06 12:01 ` Paolo Bonzini 2017-06-06 17:23 ` Peter Zijlstra 2 siblings, 1 reply; 23+ messages in thread From: Peter Zijlstra @ 2017-06-06 11:09 UTC (permalink / raw) To: Paul E. McKenney Cc: linux-kernel, mingo, jiangshanlai, dipankar, akpm, mathieu.desnoyers, josh, tglx, rostedt, dhowells, edumazet, fweisbec, oleg, Paolo Bonzini, kvm, Linus Torvalds > diff --git a/kernel/rcu/srcutiny.c b/kernel/rcu/srcutiny.c > index 36e1f82faed1..681bf6bc04a5 100644 > --- a/kernel/rcu/srcutiny.c > +++ b/kernel/rcu/srcutiny.c > @@ -35,8 +35,8 @@ > > static int init_srcu_struct_fields(struct srcu_struct *sp) > { > - sp->srcu_lock_nesting[0] = 0; > - sp->srcu_lock_nesting[1] = 0; > + atomic_set(&sp->srcu_lock_nesting[0], 0); > + atomic_set(&sp->srcu_lock_nesting[1], 0); > init_swait_queue_head(&sp->srcu_wq); > sp->srcu_gp_seq = 0; > rcu_segcblist_init(&sp->srcu_cblist); > @@ -86,7 +86,8 @@ EXPORT_SYMBOL_GPL(init_srcu_struct); > */ > void cleanup_srcu_struct(struct srcu_struct *sp) > { > - WARN_ON(sp->srcu_lock_nesting[0] || sp->srcu_lock_nesting[1]); > + WARN_ON(atomic_read(&sp->srcu_lock_nesting[0]) || > + atomic_read(&sp->srcu_lock_nesting[1])); > flush_work(&sp->srcu_work); > WARN_ON(rcu_seq_state(sp->srcu_gp_seq)); > WARN_ON(sp->srcu_gp_running); > @@ -97,7 +98,7 @@ EXPORT_SYMBOL_GPL(cleanup_srcu_struct); > > /* > * Counts the new reader in the appropriate per-CPU element of the > - * srcu_struct. Must be called from process context. > + * srcu_struct. > * Returns an index that must be passed to the matching srcu_read_unlock(). > */ > int __srcu_read_lock(struct srcu_struct *sp) > @@ -105,21 +106,19 @@ int __srcu_read_lock(struct srcu_struct *sp) > int idx; > > idx = READ_ONCE(sp->srcu_idx); > - WRITE_ONCE(sp->srcu_lock_nesting[idx], sp->srcu_lock_nesting[idx] + 1); > + atomic_inc(&sp->srcu_lock_nesting[idx]); > return idx; > } > EXPORT_SYMBOL_GPL(__srcu_read_lock); > > /* > * Removes the count for the old reader from the appropriate element of > - * the srcu_struct. Must be called from process context. > + * the srcu_struct. > */ > void __srcu_read_unlock(struct srcu_struct *sp, int idx) > { > - int newval = sp->srcu_lock_nesting[idx] - 1; > - > - WRITE_ONCE(sp->srcu_lock_nesting[idx], newval); > - if (!newval && READ_ONCE(sp->srcu_gp_waiting)) > + if (atomic_dec_return_relaxed(&sp->srcu_lock_nesting[idx]) == 0 && > + READ_ONCE(sp->srcu_gp_waiting)) > swake_up(&sp->srcu_wq); > } > EXPORT_SYMBOL_GPL(__srcu_read_unlock); > @@ -148,7 +147,7 @@ void srcu_drive_gp(struct work_struct *wp) > idx = sp->srcu_idx; > WRITE_ONCE(sp->srcu_idx, !sp->srcu_idx); > WRITE_ONCE(sp->srcu_gp_waiting, true); /* srcu_read_unlock() wakes! */ > - swait_event(sp->srcu_wq, !READ_ONCE(sp->srcu_lock_nesting[idx])); > + swait_event(sp->srcu_wq, !atomic_read(&sp->srcu_lock_nesting[idx])); > WRITE_ONCE(sp->srcu_gp_waiting, false); /* srcu_read_unlock() cheap. */ > rcu_seq_end(&sp->srcu_gp_seq); I'm not entirely sure this is actually needed. TINY_SRCU is !PREEMPT && !SMP. So that means all we need is to be safe from IRQs. Now, do we (want) support things like: <IRQ> srcu_read_lock(); </IRQ> srcu_read_lock(); srcu_read_unlock(); <IRQ> srcu_read_unlock(); </IRC> _OR_ do we already (or want to) mandate that SRCU usage in IRQs must be balanced? That is, if it is used from IRQ context it must do an equal amount of srcu_read_lock() and srcu_read_unlock()s? Because if we have the balance requirement (as we do for preempt_disable()) then even on load-store architectures the current code should be sufficient (since if an interrupt does as many dec's as it does inc's, the actual value will not change over an interrupt, and our load from before the interrupt is still valid). ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH RFC tip/core/rcu 1/2] srcu: Allow use of Tiny/Tree SRCU from both process and interrupt context 2017-06-06 11:09 ` Peter Zijlstra @ 2017-06-06 12:01 ` Paolo Bonzini 2017-06-06 12:53 ` Paul E. McKenney 0 siblings, 1 reply; 23+ messages in thread From: Paolo Bonzini @ 2017-06-06 12:01 UTC (permalink / raw) To: Peter Zijlstra, Paul E. McKenney Cc: linux-kernel, mingo, jiangshanlai, dipankar, akpm, mathieu.desnoyers, josh, tglx, rostedt, dhowells, edumazet, fweisbec, oleg, kvm, Linus Torvalds On 06/06/2017 13:09, Peter Zijlstra wrote: >> diff --git a/kernel/rcu/srcutiny.c b/kernel/rcu/srcutiny.c >> index 36e1f82faed1..681bf6bc04a5 100644 >> --- a/kernel/rcu/srcutiny.c >> +++ b/kernel/rcu/srcutiny.c >> @@ -35,8 +35,8 @@ >> >> static int init_srcu_struct_fields(struct srcu_struct *sp) >> { >> - sp->srcu_lock_nesting[0] = 0; >> - sp->srcu_lock_nesting[1] = 0; >> + atomic_set(&sp->srcu_lock_nesting[0], 0); >> + atomic_set(&sp->srcu_lock_nesting[1], 0); >> init_swait_queue_head(&sp->srcu_wq); >> sp->srcu_gp_seq = 0; >> rcu_segcblist_init(&sp->srcu_cblist); >> @@ -86,7 +86,8 @@ EXPORT_SYMBOL_GPL(init_srcu_struct); >> */ >> void cleanup_srcu_struct(struct srcu_struct *sp) >> { >> - WARN_ON(sp->srcu_lock_nesting[0] || sp->srcu_lock_nesting[1]); >> + WARN_ON(atomic_read(&sp->srcu_lock_nesting[0]) || >> + atomic_read(&sp->srcu_lock_nesting[1])); >> flush_work(&sp->srcu_work); >> WARN_ON(rcu_seq_state(sp->srcu_gp_seq)); >> WARN_ON(sp->srcu_gp_running); >> @@ -97,7 +98,7 @@ EXPORT_SYMBOL_GPL(cleanup_srcu_struct); >> >> /* >> * Counts the new reader in the appropriate per-CPU element of the >> - * srcu_struct. Must be called from process context. >> + * srcu_struct. >> * Returns an index that must be passed to the matching srcu_read_unlock(). >> */ >> int __srcu_read_lock(struct srcu_struct *sp) >> @@ -105,21 +106,19 @@ int __srcu_read_lock(struct srcu_struct *sp) >> int idx; >> >> idx = READ_ONCE(sp->srcu_idx); >> - WRITE_ONCE(sp->srcu_lock_nesting[idx], sp->srcu_lock_nesting[idx] + 1); >> + atomic_inc(&sp->srcu_lock_nesting[idx]); >> return idx; >> } >> EXPORT_SYMBOL_GPL(__srcu_read_lock); >> >> /* >> * Removes the count for the old reader from the appropriate element of >> - * the srcu_struct. Must be called from process context. >> + * the srcu_struct. >> */ >> void __srcu_read_unlock(struct srcu_struct *sp, int idx) >> { >> - int newval = sp->srcu_lock_nesting[idx] - 1; >> - >> - WRITE_ONCE(sp->srcu_lock_nesting[idx], newval); >> - if (!newval && READ_ONCE(sp->srcu_gp_waiting)) >> + if (atomic_dec_return_relaxed(&sp->srcu_lock_nesting[idx]) == 0 && >> + READ_ONCE(sp->srcu_gp_waiting)) >> swake_up(&sp->srcu_wq); >> } >> EXPORT_SYMBOL_GPL(__srcu_read_unlock); >> @@ -148,7 +147,7 @@ void srcu_drive_gp(struct work_struct *wp) >> idx = sp->srcu_idx; >> WRITE_ONCE(sp->srcu_idx, !sp->srcu_idx); >> WRITE_ONCE(sp->srcu_gp_waiting, true); /* srcu_read_unlock() wakes! */ >> - swait_event(sp->srcu_wq, !READ_ONCE(sp->srcu_lock_nesting[idx])); >> + swait_event(sp->srcu_wq, !atomic_read(&sp->srcu_lock_nesting[idx])); >> WRITE_ONCE(sp->srcu_gp_waiting, false); /* srcu_read_unlock() cheap. */ >> rcu_seq_end(&sp->srcu_gp_seq); > > I'm not entirely sure this is actually needed. TINY_SRCU is !PREEMPT && > !SMP. So that means all we need is to be safe from IRQs. > > Now, do we (want) support things like: > > <IRQ> > srcu_read_lock(); > </IRQ> > > srcu_read_lock(); > > srcu_read_unlock(); > > <IRQ> > srcu_read_unlock(); > </IRC> > > > _OR_ > > do we already (or want to) mandate that SRCU usage in IRQs must be > balanced? That is, if it is used from IRQ context it must do an equal > amount of srcu_read_lock() and srcu_read_unlock()s? > > Because if we have the balance requirement (as we do for > preempt_disable()) then even on load-store architectures the current > code should be sufficient (since if an interrupt does as many dec's as > it does inc's, the actual value will not change over an interrupt, and > our load from before the interrupt is still valid). Good point! So the srcutiny part should not be necessary. I'll reply to the other email now. Paolo ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH RFC tip/core/rcu 1/2] srcu: Allow use of Tiny/Tree SRCU from both process and interrupt context 2017-06-06 12:01 ` Paolo Bonzini @ 2017-06-06 12:53 ` Paul E. McKenney 2017-06-06 15:54 ` Peter Zijlstra 0 siblings, 1 reply; 23+ messages in thread From: Paul E. McKenney @ 2017-06-06 12:53 UTC (permalink / raw) To: Paolo Bonzini Cc: Peter Zijlstra, linux-kernel, mingo, jiangshanlai, dipankar, akpm, mathieu.desnoyers, josh, tglx, rostedt, dhowells, edumazet, fweisbec, oleg, kvm, Linus Torvalds On Tue, Jun 06, 2017 at 02:01:54PM +0200, Paolo Bonzini wrote: > > > On 06/06/2017 13:09, Peter Zijlstra wrote: > >> diff --git a/kernel/rcu/srcutiny.c b/kernel/rcu/srcutiny.c > >> index 36e1f82faed1..681bf6bc04a5 100644 > >> --- a/kernel/rcu/srcutiny.c > >> +++ b/kernel/rcu/srcutiny.c > >> @@ -35,8 +35,8 @@ > >> > >> static int init_srcu_struct_fields(struct srcu_struct *sp) > >> { > >> - sp->srcu_lock_nesting[0] = 0; > >> - sp->srcu_lock_nesting[1] = 0; > >> + atomic_set(&sp->srcu_lock_nesting[0], 0); > >> + atomic_set(&sp->srcu_lock_nesting[1], 0); > >> init_swait_queue_head(&sp->srcu_wq); > >> sp->srcu_gp_seq = 0; > >> rcu_segcblist_init(&sp->srcu_cblist); > >> @@ -86,7 +86,8 @@ EXPORT_SYMBOL_GPL(init_srcu_struct); > >> */ > >> void cleanup_srcu_struct(struct srcu_struct *sp) > >> { > >> - WARN_ON(sp->srcu_lock_nesting[0] || sp->srcu_lock_nesting[1]); > >> + WARN_ON(atomic_read(&sp->srcu_lock_nesting[0]) || > >> + atomic_read(&sp->srcu_lock_nesting[1])); > >> flush_work(&sp->srcu_work); > >> WARN_ON(rcu_seq_state(sp->srcu_gp_seq)); > >> WARN_ON(sp->srcu_gp_running); > >> @@ -97,7 +98,7 @@ EXPORT_SYMBOL_GPL(cleanup_srcu_struct); > >> > >> /* > >> * Counts the new reader in the appropriate per-CPU element of the > >> - * srcu_struct. Must be called from process context. > >> + * srcu_struct. > >> * Returns an index that must be passed to the matching srcu_read_unlock(). > >> */ > >> int __srcu_read_lock(struct srcu_struct *sp) > >> @@ -105,21 +106,19 @@ int __srcu_read_lock(struct srcu_struct *sp) > >> int idx; > >> > >> idx = READ_ONCE(sp->srcu_idx); > >> - WRITE_ONCE(sp->srcu_lock_nesting[idx], sp->srcu_lock_nesting[idx] + 1); > >> + atomic_inc(&sp->srcu_lock_nesting[idx]); > >> return idx; > >> } > >> EXPORT_SYMBOL_GPL(__srcu_read_lock); > >> > >> /* > >> * Removes the count for the old reader from the appropriate element of > >> - * the srcu_struct. Must be called from process context. > >> + * the srcu_struct. > >> */ > >> void __srcu_read_unlock(struct srcu_struct *sp, int idx) > >> { > >> - int newval = sp->srcu_lock_nesting[idx] - 1; > >> - > >> - WRITE_ONCE(sp->srcu_lock_nesting[idx], newval); > >> - if (!newval && READ_ONCE(sp->srcu_gp_waiting)) > >> + if (atomic_dec_return_relaxed(&sp->srcu_lock_nesting[idx]) == 0 && > >> + READ_ONCE(sp->srcu_gp_waiting)) > >> swake_up(&sp->srcu_wq); > >> } > >> EXPORT_SYMBOL_GPL(__srcu_read_unlock); > >> @@ -148,7 +147,7 @@ void srcu_drive_gp(struct work_struct *wp) > >> idx = sp->srcu_idx; > >> WRITE_ONCE(sp->srcu_idx, !sp->srcu_idx); > >> WRITE_ONCE(sp->srcu_gp_waiting, true); /* srcu_read_unlock() wakes! */ > >> - swait_event(sp->srcu_wq, !READ_ONCE(sp->srcu_lock_nesting[idx])); > >> + swait_event(sp->srcu_wq, !atomic_read(&sp->srcu_lock_nesting[idx])); > >> WRITE_ONCE(sp->srcu_gp_waiting, false); /* srcu_read_unlock() cheap. */ > >> rcu_seq_end(&sp->srcu_gp_seq); > > > > I'm not entirely sure this is actually needed. TINY_SRCU is !PREEMPT && > > !SMP. So that means all we need is to be safe from IRQs. > > > > Now, do we (want) support things like: > > > > <IRQ> > > srcu_read_lock(); > > </IRQ> > > > > srcu_read_lock(); > > > > srcu_read_unlock(); > > > > <IRQ> > > srcu_read_unlock(); > > </IRC> > > > > > > _OR_ > > > > do we already (or want to) mandate that SRCU usage in IRQs must be > > balanced? That is, if it is used from IRQ context it must do an equal > > amount of srcu_read_lock() and srcu_read_unlock()s? > > > > Because if we have the balance requirement (as we do for > > preempt_disable()) then even on load-store architectures the current > > code should be sufficient (since if an interrupt does as many dec's as > > it does inc's, the actual value will not change over an interrupt, and > > our load from before the interrupt is still valid). > > Good point! So the srcutiny part should not be necessary. I'll reply > to the other email now. Good analysis, Peter! So the only part of this patch that is needed is the changes to the comments, right? ;-) Thanx, Paul ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH RFC tip/core/rcu 1/2] srcu: Allow use of Tiny/Tree SRCU from both process and interrupt context 2017-06-06 12:53 ` Paul E. McKenney @ 2017-06-06 15:54 ` Peter Zijlstra 2017-06-06 15:59 ` Paul E. McKenney 0 siblings, 1 reply; 23+ messages in thread From: Peter Zijlstra @ 2017-06-06 15:54 UTC (permalink / raw) To: Paul E. McKenney Cc: Paolo Bonzini, linux-kernel, mingo, jiangshanlai, dipankar, akpm, mathieu.desnoyers, josh, tglx, rostedt, dhowells, edumazet, fweisbec, oleg, kvm, Linus Torvalds On Tue, Jun 06, 2017 at 05:53:57AM -0700, Paul E. McKenney wrote: > So the only part of this patch that is needed is the changes to the > comments, right? ;-) Right, although I would suggest putting that requirement (the balanced lock vs unlock) somewhere as well. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH RFC tip/core/rcu 1/2] srcu: Allow use of Tiny/Tree SRCU from both process and interrupt context 2017-06-06 15:54 ` Peter Zijlstra @ 2017-06-06 15:59 ` Paul E. McKenney 0 siblings, 0 replies; 23+ messages in thread From: Paul E. McKenney @ 2017-06-06 15:59 UTC (permalink / raw) To: Peter Zijlstra Cc: Paolo Bonzini, linux-kernel, mingo, jiangshanlai, dipankar, akpm, mathieu.desnoyers, josh, tglx, rostedt, dhowells, edumazet, fweisbec, oleg, kvm, Linus Torvalds On Tue, Jun 06, 2017 at 05:54:28PM +0200, Peter Zijlstra wrote: > On Tue, Jun 06, 2017 at 05:53:57AM -0700, Paul E. McKenney wrote: > > So the only part of this patch that is needed is the changes to the > > comments, right? ;-) > > Right, although I would suggest putting that requirement (the balanced > lock vs unlock) somewhere as well. Good point! Otherwise, some one will decide it is a good idea to try sooner or later. Thnax, Paul ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH RFC tip/core/rcu 1/2] srcu: Allow use of Tiny/Tree SRCU from both process and interrupt context 2017-06-05 22:09 ` [PATCH RFC tip/core/rcu 1/2] srcu: Allow use of Tiny/Tree SRCU from both process and interrupt context Paul E. McKenney 2017-06-06 10:53 ` Peter Zijlstra 2017-06-06 11:09 ` Peter Zijlstra @ 2017-06-06 17:23 ` Peter Zijlstra 2017-06-06 17:50 ` Paul E. McKenney 2 siblings, 1 reply; 23+ messages in thread From: Peter Zijlstra @ 2017-06-06 17:23 UTC (permalink / raw) To: Paul E. McKenney Cc: linux-kernel, mingo, jiangshanlai, dipankar, akpm, mathieu.desnoyers, josh, tglx, rostedt, dhowells, edumazet, fweisbec, oleg, Paolo Bonzini, kvm, Linus Torvalds On Mon, Jun 05, 2017 at 03:09:50PM -0700, Paul E. McKenney wrote: > diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c > index 3ae8474557df..157654fa436a 100644 > --- a/kernel/rcu/srcutree.c > +++ b/kernel/rcu/srcutree.c > @@ -357,7 +357,7 @@ EXPORT_SYMBOL_GPL(cleanup_srcu_struct); > > /* > * Counts the new reader in the appropriate per-CPU element of the > - * srcu_struct. Must be called from process context. > + * srcu_struct. > * Returns an index that must be passed to the matching srcu_read_unlock(). > */ > int __srcu_read_lock(struct srcu_struct *sp) > @@ -365,7 +365,7 @@ int __srcu_read_lock(struct srcu_struct *sp) > int idx; > > idx = READ_ONCE(sp->srcu_idx) & 0x1; > - __this_cpu_inc(sp->sda->srcu_lock_count[idx]); > + this_cpu_inc(sp->sda->srcu_lock_count[idx]); > smp_mb(); /* B */ /* Avoid leaking the critical section. */ > return idx; > } So again, the change is to make this an IRQ safe operation, however if we have this balance requirement, the IRQ will not visibly change the value and load-store should be good again, no? Or am I missing some other detail with this implementation? ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH RFC tip/core/rcu 1/2] srcu: Allow use of Tiny/Tree SRCU from both process and interrupt context 2017-06-06 17:23 ` Peter Zijlstra @ 2017-06-06 17:50 ` Paul E. McKenney 2017-06-06 18:00 ` Peter Zijlstra 0 siblings, 1 reply; 23+ messages in thread From: Paul E. McKenney @ 2017-06-06 17:50 UTC (permalink / raw) To: Peter Zijlstra Cc: linux-kernel, mingo, jiangshanlai, dipankar, akpm, mathieu.desnoyers, josh, tglx, rostedt, dhowells, edumazet, fweisbec, oleg, Paolo Bonzini, kvm, Linus Torvalds On Tue, Jun 06, 2017 at 07:23:42PM +0200, Peter Zijlstra wrote: > On Mon, Jun 05, 2017 at 03:09:50PM -0700, Paul E. McKenney wrote: > > diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c > > index 3ae8474557df..157654fa436a 100644 > > --- a/kernel/rcu/srcutree.c > > +++ b/kernel/rcu/srcutree.c > > @@ -357,7 +357,7 @@ EXPORT_SYMBOL_GPL(cleanup_srcu_struct); > > > > /* > > * Counts the new reader in the appropriate per-CPU element of the > > - * srcu_struct. Must be called from process context. > > + * srcu_struct. > > * Returns an index that must be passed to the matching srcu_read_unlock(). > > */ > > int __srcu_read_lock(struct srcu_struct *sp) > > @@ -365,7 +365,7 @@ int __srcu_read_lock(struct srcu_struct *sp) > > int idx; > > > > idx = READ_ONCE(sp->srcu_idx) & 0x1; > > - __this_cpu_inc(sp->sda->srcu_lock_count[idx]); > > + this_cpu_inc(sp->sda->srcu_lock_count[idx]); > > smp_mb(); /* B */ /* Avoid leaking the critical section. */ > > return idx; > > } > > So again, the change is to make this an IRQ safe operation, however if > we have this balance requirement, the IRQ will not visibly change the > value and load-store should be good again, no? > > Or am I missing some other detail with this implementation? Unlike Tiny SRCU, Classic and Tree SRCU increment one counter (->srcu_lock_count[]) and decrement another (->srcu_unlock_count[]). So balanced srcu_read_lock() and srcu_read_unlock() within an irq handler would increment both counters, with no decrements. Therefore, __srcu_read_lock()'s counter manipulation needs to be irq-safe. Thanx, Paul ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH RFC tip/core/rcu 1/2] srcu: Allow use of Tiny/Tree SRCU from both process and interrupt context 2017-06-06 17:50 ` Paul E. McKenney @ 2017-06-06 18:00 ` Peter Zijlstra 2017-06-06 18:22 ` Paul E. McKenney 0 siblings, 1 reply; 23+ messages in thread From: Peter Zijlstra @ 2017-06-06 18:00 UTC (permalink / raw) To: Paul E. McKenney Cc: linux-kernel, mingo, jiangshanlai, dipankar, akpm, mathieu.desnoyers, josh, tglx, rostedt, dhowells, edumazet, fweisbec, oleg, Paolo Bonzini, kvm, Linus Torvalds On Tue, Jun 06, 2017 at 10:50:48AM -0700, Paul E. McKenney wrote: > On Tue, Jun 06, 2017 at 07:23:42PM +0200, Peter Zijlstra wrote: > > On Mon, Jun 05, 2017 at 03:09:50PM -0700, Paul E. McKenney wrote: > > > diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c > > > index 3ae8474557df..157654fa436a 100644 > > > --- a/kernel/rcu/srcutree.c > > > +++ b/kernel/rcu/srcutree.c > > > @@ -357,7 +357,7 @@ EXPORT_SYMBOL_GPL(cleanup_srcu_struct); > > > > > > /* > > > * Counts the new reader in the appropriate per-CPU element of the > > > - * srcu_struct. Must be called from process context. > > > + * srcu_struct. > > > * Returns an index that must be passed to the matching srcu_read_unlock(). > > > */ > > > int __srcu_read_lock(struct srcu_struct *sp) > > > @@ -365,7 +365,7 @@ int __srcu_read_lock(struct srcu_struct *sp) > > > int idx; > > > > > > idx = READ_ONCE(sp->srcu_idx) & 0x1; > > > - __this_cpu_inc(sp->sda->srcu_lock_count[idx]); > > > + this_cpu_inc(sp->sda->srcu_lock_count[idx]); > > > smp_mb(); /* B */ /* Avoid leaking the critical section. */ > > > return idx; > > > } > > > > So again, the change is to make this an IRQ safe operation, however if > > we have this balance requirement, the IRQ will not visibly change the > > value and load-store should be good again, no? > > > > Or am I missing some other detail with this implementation? > > Unlike Tiny SRCU, Classic and Tree SRCU increment one counter > (->srcu_lock_count[]) and decrement another (->srcu_unlock_count[]). > So balanced srcu_read_lock() and srcu_read_unlock() within an irq > handler would increment both counters, with no decrements. Therefore, > __srcu_read_lock()'s counter manipulation needs to be irq-safe. Oh, duh, so much for being able to read... ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH RFC tip/core/rcu 1/2] srcu: Allow use of Tiny/Tree SRCU from both process and interrupt context 2017-06-06 18:00 ` Peter Zijlstra @ 2017-06-06 18:22 ` Paul E. McKenney 0 siblings, 0 replies; 23+ messages in thread From: Paul E. McKenney @ 2017-06-06 18:22 UTC (permalink / raw) To: Peter Zijlstra Cc: linux-kernel, mingo, jiangshanlai, dipankar, akpm, mathieu.desnoyers, josh, tglx, rostedt, dhowells, edumazet, fweisbec, oleg, Paolo Bonzini, kvm, Linus Torvalds On Tue, Jun 06, 2017 at 08:00:09PM +0200, Peter Zijlstra wrote: > On Tue, Jun 06, 2017 at 10:50:48AM -0700, Paul E. McKenney wrote: > > On Tue, Jun 06, 2017 at 07:23:42PM +0200, Peter Zijlstra wrote: > > > On Mon, Jun 05, 2017 at 03:09:50PM -0700, Paul E. McKenney wrote: > > > > diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c > > > > index 3ae8474557df..157654fa436a 100644 > > > > --- a/kernel/rcu/srcutree.c > > > > +++ b/kernel/rcu/srcutree.c > > > > @@ -357,7 +357,7 @@ EXPORT_SYMBOL_GPL(cleanup_srcu_struct); > > > > > > > > /* > > > > * Counts the new reader in the appropriate per-CPU element of the > > > > - * srcu_struct. Must be called from process context. > > > > + * srcu_struct. > > > > * Returns an index that must be passed to the matching srcu_read_unlock(). > > > > */ > > > > int __srcu_read_lock(struct srcu_struct *sp) > > > > @@ -365,7 +365,7 @@ int __srcu_read_lock(struct srcu_struct *sp) > > > > int idx; > > > > > > > > idx = READ_ONCE(sp->srcu_idx) & 0x1; > > > > - __this_cpu_inc(sp->sda->srcu_lock_count[idx]); > > > > + this_cpu_inc(sp->sda->srcu_lock_count[idx]); > > > > smp_mb(); /* B */ /* Avoid leaking the critical section. */ > > > > return idx; > > > > } > > > > > > So again, the change is to make this an IRQ safe operation, however if > > > we have this balance requirement, the IRQ will not visibly change the > > > value and load-store should be good again, no? > > > > > > Or am I missing some other detail with this implementation? > > > > Unlike Tiny SRCU, Classic and Tree SRCU increment one counter > > (->srcu_lock_count[]) and decrement another (->srcu_unlock_count[]). > > So balanced srcu_read_lock() and srcu_read_unlock() within an irq > > handler would increment both counters, with no decrements. Therefore, > > __srcu_read_lock()'s counter manipulation needs to be irq-safe. > > Oh, duh, so much for being able to read... I know that feeling! Including the s/decrement/increment/ needed in my erroneous paragraph above. Classic and Tree SRCU increment both counters, and they decrement nothing. :-/ Thanx, Paul ^ permalink raw reply [flat|nested] 23+ messages in thread
* [PATCH RFC tip/core/rcu 2/2] srcu: Allow use of Classic SRCU from both process and interrupt context [not found] <20170605220919.GA27820@linux.vnet.ibm.com> 2017-06-05 22:09 ` [PATCH RFC tip/core/rcu 1/2] srcu: Allow use of Tiny/Tree SRCU from both process and interrupt context Paul E. McKenney @ 2017-06-05 22:09 ` Paul E. McKenney 1 sibling, 0 replies; 23+ messages in thread From: Paul E. McKenney @ 2017-06-05 22:09 UTC (permalink / raw) To: linux-kernel Cc: mingo, jiangshanlai, dipankar, akpm, mathieu.desnoyers, josh, tglx, peterz, rostedt, dhowells, edumazet, fweisbec, oleg, Paolo Bonzini, stable, kvm, Linus Torvalds, Paul E. McKenney From: Paolo Bonzini <pbonzini@redhat.com> Linu Cherian reported a WARN in cleanup_srcu_struct when shutting down a guest that has iperf running on a VFIO assigned device. This happens because irqfd_wakeup calls srcu_read_lock(&kvm->irq_srcu) in interrupt context, while a worker thread does the same inside kvm_set_irq. If the interrupt happens while the worker thread is executing __srcu_read_lock, lock_count can fall behind. The docs say you are not supposed to call srcu_read_lock() and srcu_read_unlock() from irq context, but KVM interrupt injection happens from (host) interrupt context and it would be nice if SRCU supported the use case. KVM is using SRCU here not really for the "sleepable" part, but rather due to its faster detection of grace periods, therefore it is not possible to switch back to RCU, effectively reverting commit 719d93cd5f5c ("kvm/irqchip: Speed up KVM_SET_GSI_ROUTING", 2014-01-16). However, the docs are painting a worse situation than it actually is. You can have an SRCU instance only has users in irq context, and you can mix process and irq context as long as process context users disable interrupts. In addition, __srcu_read_unlock() actually uses this_cpu_dec, so that only srcu_read_lock() is unsafe. When srcuclassic's __srcu_read_unlock() was changed to use this_cpu_dec(), in commit 5a41344a3d83 ("srcu: Simplify __srcu_read_unlock() via this_cpu_dec()", 2012-11-29), __srcu_read_lock() did two increments. Therefore it kept __this_cpu_inc, with preempt_disable/enable in the caller. Nowadays however it only does one increment, so on most architectures it is more efficient for __srcu_read_lock to use this_cpu_inc, too. There would be a slowdown if 1) fast this_cpu_inc is not available and cannot be implemented (this usually means that atomic_inc has implicit memory barriers), and 2) local_irq_save/restore is slower than disabling preemption. The main architecture with these constraints is s390, which however is already paying the price in __srcu_read_unlock and has not complained. Cc: stable@vger.kernel.org Fixes: 719d93cd5f5c ("kvm/irqchip: Speed up KVM_SET_GSI_ROUTING") Reported-by: Linu Cherian <linuc.decode@gmail.com> Suggested-by: Linu Cherian <linuc.decode@gmail.com> Cc: kvm@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> --- include/linux/srcu.h | 2 -- kernel/rcu/srcu.c | 5 ++--- 2 files changed, 2 insertions(+), 5 deletions(-) diff --git a/include/linux/srcu.h b/include/linux/srcu.h index 167ad8831aaf..4c1d5f7e62c4 100644 --- a/include/linux/srcu.h +++ b/include/linux/srcu.h @@ -172,9 +172,7 @@ static inline int srcu_read_lock(struct srcu_struct *sp) __acquires(sp) { int retval; - preempt_disable(); retval = __srcu_read_lock(sp); - preempt_enable(); rcu_lock_acquire(&(sp)->dep_map); return retval; } diff --git a/kernel/rcu/srcu.c b/kernel/rcu/srcu.c index 584d8a983883..dea03614263f 100644 --- a/kernel/rcu/srcu.c +++ b/kernel/rcu/srcu.c @@ -263,7 +263,7 @@ EXPORT_SYMBOL_GPL(cleanup_srcu_struct); /* * Counts the new reader in the appropriate per-CPU element of the - * srcu_struct. Must be called from process context. + * srcu_struct. * Returns an index that must be passed to the matching srcu_read_unlock(). */ int __srcu_read_lock(struct srcu_struct *sp) @@ -271,7 +271,7 @@ int __srcu_read_lock(struct srcu_struct *sp) int idx; idx = READ_ONCE(sp->completed) & 0x1; - __this_cpu_inc(sp->per_cpu_ref->lock_count[idx]); + this_cpu_inc(sp->per_cpu_ref->lock_count[idx]); smp_mb(); /* B */ /* Avoid leaking the critical section. */ return idx; } @@ -281,7 +281,6 @@ EXPORT_SYMBOL_GPL(__srcu_read_lock); * Removes the count for the old reader from the appropriate per-CPU * element of the srcu_struct. Note that this may well be a different * CPU than that which was incremented by the corresponding srcu_read_lock(). - * Must be called from process context. */ void __srcu_read_unlock(struct srcu_struct *sp, int idx) { -- 2.5.2 ^ permalink raw reply related [flat|nested] 23+ messages in thread
end of thread, other threads:[~2017-06-06 18:22 UTC | newest]
Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <20170605220919.GA27820@linux.vnet.ibm.com>
2017-06-05 22:09 ` [PATCH RFC tip/core/rcu 1/2] srcu: Allow use of Tiny/Tree SRCU from both process and interrupt context Paul E. McKenney
2017-06-06 10:53 ` Peter Zijlstra
2017-06-06 12:56 ` Paul E. McKenney
2017-06-06 13:08 ` Paolo Bonzini
2017-06-06 14:45 ` Christian Borntraeger
2017-06-06 15:27 ` Heiko Carstens
2017-06-06 15:37 ` Christian Borntraeger
2017-06-06 15:58 ` Paul E. McKenney
2017-06-06 16:15 ` Peter Zijlstra
2017-06-06 17:00 ` Paul E. McKenney
2017-06-06 17:20 ` Heiko Carstens
2017-06-06 16:12 ` Peter Zijlstra
2017-06-06 16:02 ` Peter Zijlstra
2017-06-06 11:09 ` Peter Zijlstra
2017-06-06 12:01 ` Paolo Bonzini
2017-06-06 12:53 ` Paul E. McKenney
2017-06-06 15:54 ` Peter Zijlstra
2017-06-06 15:59 ` Paul E. McKenney
2017-06-06 17:23 ` Peter Zijlstra
2017-06-06 17:50 ` Paul E. McKenney
2017-06-06 18:00 ` Peter Zijlstra
2017-06-06 18:22 ` Paul E. McKenney
2017-06-05 22:09 ` [PATCH RFC tip/core/rcu 2/2] srcu: Allow use of Classic " Paul E. McKenney
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox