Possible race condition during lock_count increment in srcu_read

kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Possible race condition during lock_count increment in srcu_read_lock
@ 2017-05-29 14:30 Linu Cherian
  2017-05-29 15:05 ` Paolo Bonzini
  0 siblings, 1 reply; 8+ messages in thread
From: Linu Cherian @ 2017-05-29 14:30 UTC (permalink / raw)
  To: pbonzini, rkrcmar; +Cc: kvm, linu.cherian, sunil.goutham, Geethasowjanya.Akula

Hi,

Issue observed:
--------------
Iperf traffic running on guest 10G interface(VFIO assigned device)
followed by, kvm guest poweroff/shutdown results in the following
warning on Cavium arm64 platform.

Stacktrace.
[ 1151.424692] ------------[ cut here ]------------
[ 1151.429318] WARNING: CPU: 30 PID: 6744 at kernel/rcu/srcu.c:251
cleanup_srcu_struct+0xb4/0xd0
[ 1151.437832] Modules linked in: fuse cavium_rng_vf rng_core
cavium_rng ipv6 btrfs xor zlib_deflate raid6_pq

[ 1151.448978] CPU: 30 PID: 6744 Comm: qemu-system-aar Tainted: G
W       4.11.2 #1
[ 1151.456970] Hardware name: Gigabyte gbt-mt60/gbt-mt60, BIOS 0.3 Mar
31 2017
[ 1151.463921] task: fffffe1f994a3400 task.stack: fffffe1f99590000
[ 1151.469831] PC is at cleanup_srcu_struct+0xb4/0xd0
[ 1151.474612] LR is at cleanup_srcu_struct+0x80/0xd0
[ 1151.479392] pc : [<fffffc000811698c>] lr : [<fffffc0008116958>]
pstate: 60000145
[ 1151.486776] sp : fffffe1f99593b20
[ 1151.490081] x29: fffffe1f99593b20 x28: fffffe1f994a3400 
[ 1151.495385] x27: 0000000000000008 x26: fffffe1f99593de8 
[ 1151.500689] x25: dead000000000100 x24: dead000000000200 
[ 1151.505993] x23: fffffc0008e93f24 x22: fffffffffffffff9 
[ 1151.511296] x21: ffffff000d93f0f0 x20: fffffc0008e93bc8 
[ 1151.516601] x19: fffffc0008e93e18 x18: 000000000000003f 
[ 1151.521904] x17: 000000000000003f x16: 000000000000000d 
[ 1151.527208] x15: 0000000000000010 x14: 0000000000000000 
[ 1151.532512] x13: 0000000000000040 x12: 0000000000000020 
[ 1151.537816] x11: 0000000000000020 x10: 0101010101010101 
[ 1151.543120] x9 : 0000000040000000 x8 : 0000000000210d00 
[ 1151.548424] x7 : fefefeff71647274 x6 : 0000000000000000 
[ 1151.553728] x5 : 0000000000000000 x4 : 0000000000000000 
[ 1151.559032] x3 : 0000000000000000 x2 : 0000000000000040 
[ 1151.564335] x1 : 0000000000000040 x0 : 0000000000000040 

[ 1151.571119] ---[ end trace ff1987e1b5556fbe ]---
[ 1151.575726] Call trace:
[ 1151.578163] Exception stack(0xfffffe1f99593950 to
0xfffffe1f99593a80)
[ 1151.584594] 3940:
fffffc0008e93e18 0000040000000000
[ 1151.592414] 3960: fffffe1f99593b20 fffffc000811698c
fffffe1f99593980 fffffc000839cc28
[ 1151.600233] 3980: fffffe1f995939d0 fffffc00081344cc
fffffc000896b000 fffffe1f99593ad8
[ 1151.608053] 39a0: fffffe1f99593ab0 fffffc00081d2a5c
fffffdffc7d95ac0 ffffff1f656b1a00
[ 1151.615873] 39c0: fffffe1f99593ad0 fffffc00081d2a5c
fffffdff87e75e00 fffffe1f9d780000
[ 1151.623692] 39e0: fffffe1f994a3400 000000000000051e
0000000000000040 0000000000000040
[ 1151.631511] 3a00: 0000000000000040 0000000000000000
0000000000000000 0000000000000000
[ 1151.639330] 3a20: 0000000000000000 fefefeff71647274
0000000000210d00 0000000040000000
[ 1151.647150] 3a40: 0101010101010101 0000000000000020
0000000000000020 0000000000000040
[ 1151.654970] 3a60: 0000000000000000 0000000000000010
000000000000000d 000000000000003f
[ 1151.662789] [<fffffc000811698c>] cleanup_srcu_struct+0xb4/0xd0
[ 1151.668614] [<fffffc000809c930>] kvm_put_kvm+0x1e0/0x238
[ 1151.673916] [<fffffc000809c9f8>] kvm_vm_release+0x20/0x30
[ 1151.679308] [<fffffc00081f0c64>] __fput+0x8c/0x1d0
[ 1151.684089] [<fffffc00081f0e0c>] ____fput+0xc/0x18
[ 1151.688872] [<fffffc00080d5830>] task_work_run+0xc0/0xe0
[ 1151.694176] [<fffffc00080bf194>] do_exit+0x2c4/0x978
[ 1151.699131] [<fffffc00080bf8ac>] do_group_exit+0x34/0x98
[ 1151.704435] [<fffffc00080c9118>] get_signal+0x1e0/0x4c0
[ 1151.709653] [<fffffc00080863d0>] do_signal+0xb8/0x4d8
[ 1151.714694] [<fffffc0008086a08>] do_notify_resume+0x88/0xa8
[ 1151.720256] [<fffffc0008082ad0>] work_pending+0x8/0x10



Analysis
--------
Additional prints in srcu_readers_active, indicate that per cpu
variable, lock_count( of struct srcu_struct)lags behind the unlock
count and hence the warning. 

In KVM irq injection path, irqfd_wakeup in interrupt context
calls srcu_read_lock(&kvm->irq_srcu). A worker thread, irqfd_inject
in process context also calls srcu_read_lock inside kvm_set_irq. 

This can lead to race condition while incrementing the
lock_count(using __this_cpu_inc), since atomic instructions being not
used. 

Replacing __this_cpu_inc with this_cpu_inc(atomic version) workarounds the
problem which backs up this analysis.

Possible solution
-----------------
One way is to avoid the srcu_read_lock/unlock usage in irq context.
In arm/arm64 architecture, 	
if (kvm_arch_set_irq_inatomic(&irq, kvm, KVM_USERSPACE_IRQ_SOURCE_ID, 1,
      false) == -EWOULDBLOCK)
in irqfd_wakefup is always true and hence schedule_work can be called
directly

@@ -195,6 +195,11 @@ int __attribute__((weak))
kvm_arch_set_irq_inatomic(
        int idx;
 
        if (flags & POLLIN) {
+               if (kvm_arch_set_irq_will_block_always()) {
+                       schedule_work(&irqfd->inject);
+                       goto skiplock;
+               }
+
                idx = srcu_read_lock(&kvm->irq_srcu);
                do {
                        seq =
			read_seqcount_begin(&irqfd->irq_entry_sc);
@@ -208,6 +213,7 @@ int __attribute__((weak))
kvm_arch_set_irq_inatomic(
                srcu_read_unlock(&kvm->irq_srcu, idx);
        }
 
+skiplock:
        if (flags & POLLHUP) {
                /* The eventfd is closing, detach from KVM */
                unsigned long flags;

This works without giving any warnings as well.

Is a patch welcome in that direction ?

Appreicate your feedback on this.



-- 
Linu cherian

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Possible race condition during lock_count increment in srcu_read_lock
  2017-05-29 14:30 Possible race condition during lock_count increment in srcu_read_lock Linu Cherian
@ 2017-05-29 15:05 ` Paolo Bonzini
  2017-05-29 15:50   ` Paul E. McKenney
  0 siblings, 1 reply; 8+ messages in thread
From: Paolo Bonzini @ 2017-05-29 15:05 UTC (permalink / raw)
  To: Linu Cherian, rkrcmar
  Cc: kvm, linu.cherian, sunil.goutham, Geethasowjanya.Akula,
	Paul McKenney

Adding Paul McKenney...

On 29/05/2017 16:30, Linu Cherian wrote:
> In KVM irq injection path, irqfd_wakeup in interrupt context
> calls srcu_read_lock(&kvm->irq_srcu). A worker thread, irqfd_inject
> in process context also calls srcu_read_lock inside kvm_set_irq. 
> 
> This can lead to race condition while incrementing the
> lock_count(using __this_cpu_inc), since atomic instructions being not
> used. 
> 
> Replacing __this_cpu_inc with this_cpu_inc(atomic version) workarounds the
> problem which backs up this analysis.

Paul, how would you feel about either using this_cpu_inc in
__srcu_read_lock and __srcu_read_unlock?  Many architectures already
provide efficient this_cpu_inc, and any architecture that has LL/SC
could provide it easily.

Thanks,

Paolo

> Possible solution
> -----------------
> One way is to avoid the srcu_read_lock/unlock usage in irq context.
> In arm/arm64 architecture, 	
> if (kvm_arch_set_irq_inatomic(&irq, kvm, KVM_USERSPACE_IRQ_SOURCE_ID, 1,
>       false) == -EWOULDBLOCK)
> in irqfd_wakefup is always true and hence schedule_work can be called
> directly
> 
> @@ -195,6 +195,11 @@ int __attribute__((weak))
> kvm_arch_set_irq_inatomic(
>         int idx;
>  
>         if (flags & POLLIN) {
> +               if (kvm_arch_set_irq_will_block_always()) {
> +                       schedule_work(&irqfd->inject);
> +                       goto skiplock;
> +               }
> +
>                 idx = srcu_read_lock(&kvm->irq_srcu);
>                 do {
>                         seq =
> 			read_seqcount_begin(&irqfd->irq_entry_sc);
> @@ -208,6 +213,7 @@ int __attribute__((weak))
> kvm_arch_set_irq_inatomic(
>                 srcu_read_unlock(&kvm->irq_srcu, idx);
>         }
>  
> +skiplock:
>         if (flags & POLLHUP) {
>                 /* The eventfd is closing, detach from KVM */
>                 unsigned long flags;
> 
> This works without giving any warnings as well.
> 
> Is a patch welcome in that direction ?
> 
> Appreicate your feedback on this.
> 
> 
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Possible race condition during lock_count increment in srcu_read_lock
  2017-05-29 15:05 ` Paolo Bonzini
@ 2017-05-29 15:50   ` Paul E. McKenney
  2017-05-29 16:38     ` Paolo Bonzini
  0 siblings, 1 reply; 8+ messages in thread
From: Paul E. McKenney @ 2017-05-29 15:50 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Linu Cherian, rkrcmar, kvm, linu.cherian, sunil.goutham,
	Geethasowjanya.Akula

On Mon, May 29, 2017 at 05:05:47PM +0200, Paolo Bonzini wrote:
> Adding Paul McKenney...
> 
> On 29/05/2017 16:30, Linu Cherian wrote:
> > In KVM irq injection path, irqfd_wakeup in interrupt context
> > calls srcu_read_lock(&kvm->irq_srcu). A worker thread, irqfd_inject
> > in process context also calls srcu_read_lock inside kvm_set_irq. 
> > 
> > This can lead to race condition while incrementing the
> > lock_count(using __this_cpu_inc), since atomic instructions being not
> > used. 
> > 
> > Replacing __this_cpu_inc with this_cpu_inc(atomic version) workarounds the
> > problem which backs up this analysis.
> 
> Paul, how would you feel about either using this_cpu_inc in
> __srcu_read_lock and __srcu_read_unlock?  Many architectures already
> provide efficient this_cpu_inc, and any architecture that has LL/SC
> could provide it easily.

You actually are not supposed to call __srcu_read_lock() and
__srcu_read_unlock() from irq context.  (Unless you know that you
interrupted some context that is guaranteed never to use the srcu_struct
in question or some such.)

Given that this is the KVM irq injection path, I am guessing that it
is executed quite frequently, so that you do need a low-cost highly
scalable solution.  However, you only really need to exclude between
process level and the various interrupts, so there are three ways
that an architecture could handle this:

1.	Single memory-to-memory increment instructions, as you say.

2.	Atomic read-modify-write instructions, again as you say.

3.	Disable interrupts across __srcu_read_lock() and __srcu_read_unlock().

Different architectures would want different approached, depending on what
instructions are available and the relative speed of these instructions.
It would not be hard to make this an architecture choice on the one
hand or to provide an srcu_read_lock_irq() and srcu_read_unlock_irq()
on the other.  But I recently got some serious pushback on RCU API
proliferation on the one hand and SRCU in particular on the other.  So
it might be good to look at some alternatives as well.

But first, one question...  Does the entire interrupt run in the guest in
interrupt context in the host?  Or is there some sort of wakeup sequence
where the guest's interrupt handler runs in process context from the
host's perspective?  From a quick glance at the code, it looks like
the guest's interrupt handler runs in process context (from the host's
perspective) and that the SRCU read-side critical section is confined
to the host's interrupt handler.

Another (stupid) question...  Can the process-level uses of ->irq_srcu
instead use ->srcu?  As the code is written now, it is OK to use
srcu_read_lock() and srcu_read_unlock() from interrupt handlers as
long as they are never used from either process level or from nested
interrupt handlers.  Does that work for you?

If this is the case, another approach would be for updates to wait for
both an SRCU grace period and an RCU-sched grace period, perhaps with
the latter expedited, though that won't make the real-time guys happy.

							Thanx, Paul

> Thanks,
> 
> Paolo
> 
> > Possible solution
> > -----------------
> > One way is to avoid the srcu_read_lock/unlock usage in irq context.
> > In arm/arm64 architecture, 	
> > if (kvm_arch_set_irq_inatomic(&irq, kvm, KVM_USERSPACE_IRQ_SOURCE_ID, 1,
> >       false) == -EWOULDBLOCK)
> > in irqfd_wakefup is always true and hence schedule_work can be called
> > directly
> > 
> > @@ -195,6 +195,11 @@ int __attribute__((weak))
> > kvm_arch_set_irq_inatomic(
> >         int idx;
> >  
> >         if (flags & POLLIN) {
> > +               if (kvm_arch_set_irq_will_block_always()) {
> > +                       schedule_work(&irqfd->inject);
> > +                       goto skiplock;
> > +               }
> > +
> >                 idx = srcu_read_lock(&kvm->irq_srcu);
> >                 do {
> >                         seq =
> > 			read_seqcount_begin(&irqfd->irq_entry_sc);
> > @@ -208,6 +213,7 @@ int __attribute__((weak))
> > kvm_arch_set_irq_inatomic(
> >                 srcu_read_unlock(&kvm->irq_srcu, idx);
> >         }
> >  
> > +skiplock:
> >         if (flags & POLLHUP) {
> >                 /* The eventfd is closing, detach from KVM */
> >                 unsigned long flags;
> > 
> > This works without giving any warnings as well.
> > 
> > Is a patch welcome in that direction ?
> > 
> > Appreicate your feedback on this.
> > 
> > 
> > 
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Possible race condition during lock_count increment in srcu_read_lock
  2017-05-29 15:50   ` Paul E. McKenney
@ 2017-05-29 16:38     ` Paolo Bonzini
  2017-05-29 20:14       ` Paul E. McKenney
  0 siblings, 1 reply; 8+ messages in thread
From: Paolo Bonzini @ 2017-05-29 16:38 UTC (permalink / raw)
  To: paulmck
  Cc: Linu Cherian, rkrcmar, kvm, linu.cherian, sunil.goutham,
	Geethasowjanya.Akula



On 29/05/2017 17:50, Paul E. McKenney wrote:
> On Mon, May 29, 2017 at 05:05:47PM +0200, Paolo Bonzini wrote:
>> Adding Paul McKenney...
>>
>> On 29/05/2017 16:30, Linu Cherian wrote:
>>> In KVM irq injection path, irqfd_wakeup in interrupt context
>>> calls srcu_read_lock(&kvm->irq_srcu). A worker thread, irqfd_inject
>>> in process context also calls srcu_read_lock inside kvm_set_irq. 
>>>
>>> This can lead to race condition while incrementing the
>>> lock_count(using __this_cpu_inc), since atomic instructions being not
>>> used. 
>>>
>>> Replacing __this_cpu_inc with this_cpu_inc(atomic version) workarounds the
>>> problem which backs up this analysis.
>>
>> Paul, how would you feel about either using this_cpu_inc in
>> __srcu_read_lock and __srcu_read_unlock?  Many architectures already
>> provide efficient this_cpu_inc, and any architecture that has LL/SC
>> could provide it easily.
> 
> You actually are not supposed to call __srcu_read_lock() and
> __srcu_read_unlock() from irq context.  (Unless you know that you
> interrupted some context that is guaranteed never to use the srcu_struct
> in question or some such.)
> 
> Given that this is the KVM irq injection path, I am guessing that it
> is executed quite frequently, so that you do need a low-cost highly
> scalable solution.  However, you only really need to exclude between
> process level and the various interrupts, so there are three ways
> that an architecture could handle this:
> 
> 1.	Single memory-to-memory increment instructions, as you say.
> 
> 2.	Atomic read-modify-write instructions, again as you say.
> 
> 3.	Disable interrupts across __srcu_read_lock() and __srcu_read_unlock().
> 
> Different architectures would want different approached, depending on what
> instructions are available and the relative speed of these instructions.
> It would not be hard to make this an architecture choice on the one
> hand or to provide an srcu_read_lock_irq() and srcu_read_unlock_irq()
> on the other.  But I recently got some serious pushback on RCU API
> proliferation on the one hand and SRCU in particular on the other.  So
> it might be good to look at some alternatives as well.

Yeah, I liked Linu's suggestion of this_cpu_inc because it's already
used in srcu_read_unlock, and because it can fall back to (3)
automatically.  Relaxed atomics weren't much of a proposal, rather a way
to implement this_cpu_inc on some architectures that do not support it
yet (notably 32-bit ARM, MIPS and PPC).

The disadvantage would be slowing down SRCU on machines where neither
(1) nor (2) are possible, and local_irq_save/restore is slower than
disabling preemption.  But I can certainly wrap ->irq_srcu lock/unlock
into my own helpers that take care of saving/restoring the interrupt
state, like

    #define kvm_irq_srcu_read_lock_irqsave(kvm, flags)
    #define kvm_irq_srcu_read_unlock_irqrestore(kvm, flags)

This would be very simple, confined into KVM and respectful of the SRCU
API requirements.

> But first, one question...  Does the entire interrupt run in the guest in
> interrupt context in the host?

It's interrupt context in the host.

> Or is there some sort of wakeup sequence
> where the guest's interrupt handler runs in process context from the
> host's perspective?  From a quick glance at the code, it looks like
> the guest's interrupt handler runs in process context (from the host's
> perspective) and that the SRCU read-side critical section is confined
> to the host's interrupt handler.

Correct.

> Another (stupid) question...  Can the process-level uses of ->irq_srcu
> instead use ->srcu?

I suppose you would synchronize_srcu on both in the write side?
Probably not because because ->srcu has really really long read-side
critical sections.  Basically the entire EQS of RCU is a read-side
->srcu critical section.

Unfortunately the write side of ->irq_srcu is also relatively
performance-critical, and since ->irq_srcu read-side critical sections
are very short, having a separate SRCU instance solved the issue quite
nicely.  Originally Christian Borntraeger had proposed turning the
relevant synchronize_rcu into synchronize_rcu_expedited, but as you said
it's not good for real-time so here we are.

Paolo

> As the code is written now, it is OK to use
> srcu_read_lock() and srcu_read_unlock() from interrupt handlers as
> long as they are never used from either process level or from nested
> interrupt handlers.  Does that work for you?
> 
> If this is the case, another approach would be for updates to wait for
> both an SRCU grace period and an RCU-sched grace period, perhaps with
> the latter expedited, though that won't make the real-time guys happy.
> 
> 							Thanx, Paul
> 
>> Thanks,
>>
>> Paolo
>>
>>> Possible solution
>>> -----------------
>>> One way is to avoid the srcu_read_lock/unlock usage in irq context.
>>> In arm/arm64 architecture, 	
>>> if (kvm_arch_set_irq_inatomic(&irq, kvm, KVM_USERSPACE_IRQ_SOURCE_ID, 1,
>>>       false) == -EWOULDBLOCK)
>>> in irqfd_wakefup is always true and hence schedule_work can be called
>>> directly
>>>
>>> @@ -195,6 +195,11 @@ int __attribute__((weak))
>>> kvm_arch_set_irq_inatomic(
>>>         int idx;
>>>  
>>>         if (flags & POLLIN) {
>>> +               if (kvm_arch_set_irq_will_block_always()) {
>>> +                       schedule_work(&irqfd->inject);
>>> +                       goto skiplock;
>>> +               }
>>> +
>>>                 idx = srcu_read_lock(&kvm->irq_srcu);
>>>                 do {
>>>                         seq =
>>> 			read_seqcount_begin(&irqfd->irq_entry_sc);
>>> @@ -208,6 +213,7 @@ int __attribute__((weak))
>>> kvm_arch_set_irq_inatomic(
>>>                 srcu_read_unlock(&kvm->irq_srcu, idx);
>>>         }
>>>  
>>> +skiplock:
>>>         if (flags & POLLHUP) {
>>>                 /* The eventfd is closing, detach from KVM */
>>>                 unsigned long flags;
>>>
>>> This works without giving any warnings as well.
>>>
>>> Is a patch welcome in that direction ?
>>>
>>> Appreicate your feedback on this.
>>>
>>>
>>>
>>
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Possible race condition during lock_count increment in srcu_read_lock
  2017-05-29 16:38     ` Paolo Bonzini
@ 2017-05-29 20:14       ` Paul E. McKenney
  2017-05-30 13:49         ` Paolo Bonzini
  0 siblings, 1 reply; 8+ messages in thread
From: Paul E. McKenney @ 2017-05-29 20:14 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Linu Cherian, rkrcmar, kvm, linu.cherian, sunil.goutham,
	Geethasowjanya.Akula

On Mon, May 29, 2017 at 06:38:19PM +0200, Paolo Bonzini wrote:
> On 29/05/2017 17:50, Paul E. McKenney wrote:
> > On Mon, May 29, 2017 at 05:05:47PM +0200, Paolo Bonzini wrote:
> >> Adding Paul McKenney...
> >>
> >> On 29/05/2017 16:30, Linu Cherian wrote:
> >>> In KVM irq injection path, irqfd_wakeup in interrupt context
> >>> calls srcu_read_lock(&kvm->irq_srcu). A worker thread, irqfd_inject
> >>> in process context also calls srcu_read_lock inside kvm_set_irq. 
> >>>
> >>> This can lead to race condition while incrementing the
> >>> lock_count(using __this_cpu_inc), since atomic instructions being not
> >>> used. 
> >>>
> >>> Replacing __this_cpu_inc with this_cpu_inc(atomic version) workarounds the
> >>> problem which backs up this analysis.
> >>
> >> Paul, how would you feel about either using this_cpu_inc in
> >> __srcu_read_lock and __srcu_read_unlock?  Many architectures already
> >> provide efficient this_cpu_inc, and any architecture that has LL/SC
> >> could provide it easily.
> > 
> > You actually are not supposed to call __srcu_read_lock() and
> > __srcu_read_unlock() from irq context.  (Unless you know that you
> > interrupted some context that is guaranteed never to use the srcu_struct
> > in question or some such.)
> > 
> > Given that this is the KVM irq injection path, I am guessing that it
> > is executed quite frequently, so that you do need a low-cost highly
> > scalable solution.  However, you only really need to exclude between
> > process level and the various interrupts, so there are three ways
> > that an architecture could handle this:
> > 
> > 1.	Single memory-to-memory increment instructions, as you say.
> > 
> > 2.	Atomic read-modify-write instructions, again as you say.
> > 
> > 3.	Disable interrupts across __srcu_read_lock() and __srcu_read_unlock().
> > 
> > Different architectures would want different approached, depending on what
> > instructions are available and the relative speed of these instructions.
> > It would not be hard to make this an architecture choice on the one
> > hand or to provide an srcu_read_lock_irq() and srcu_read_unlock_irq()
> > on the other.  But I recently got some serious pushback on RCU API
> > proliferation on the one hand and SRCU in particular on the other.  So
> > it might be good to look at some alternatives as well.
> 
> Yeah, I liked Linu's suggestion of this_cpu_inc because it's already
> used in srcu_read_unlock, and because it can fall back to (3)
> automatically.  Relaxed atomics weren't much of a proposal, rather a way
> to implement this_cpu_inc on some architectures that do not support it
> yet (notably 32-bit ARM, MIPS and PPC).
> 
> The disadvantage would be slowing down SRCU on machines where neither
> (1) nor (2) are possible, and local_irq_save/restore is slower than
> disabling preemption.  But I can certainly wrap ->irq_srcu lock/unlock
> into my own helpers that take care of saving/restoring the interrupt
> state, like
> 
>     #define kvm_irq_srcu_read_lock_irqsave(kvm, flags)
>     #define kvm_irq_srcu_read_unlock_irqrestore(kvm, flags)
> 
> This would be very simple, confined into KVM and respectful of the SRCU
> API requirements.

I do like this approach!  If it turns out to be needed in (say) four
other places, then it might make sense for me to pull this into SRCU as
an additional API.

> > But first, one question...  Does the entire interrupt run in the guest in
> > interrupt context in the host?
> 
> It's interrupt context in the host.
> 
> > Or is there some sort of wakeup sequence
> > where the guest's interrupt handler runs in process context from the
> > host's perspective?  From a quick glance at the code, it looks like
> > the guest's interrupt handler runs in process context (from the host's
> > perspective) and that the SRCU read-side critical section is confined
> > to the host's interrupt handler.
> 
> Correct.
> 
> > Another (stupid) question...  Can the process-level uses of ->irq_srcu
> > instead use ->srcu?
> 
> I suppose you would synchronize_srcu on both in the write side?
> Probably not because because ->srcu has really really long read-side
> critical sections.  Basically the entire EQS of RCU is a read-side
> ->srcu critical section.
> 
> Unfortunately the write side of ->irq_srcu is also relatively
> performance-critical, and since ->irq_srcu read-side critical sections
> are very short, having a separate SRCU instance solved the issue quite
> nicely.  Originally Christian Borntraeger had proposed turning the
> relevant synchronize_rcu into synchronize_rcu_expedited, but as you said
> it's not good for real-time so here we are.

Hey, I had to ask!  ;-)

							Thanx, Paul

> Paolo
> 
> > As the code is written now, it is OK to use
> > srcu_read_lock() and srcu_read_unlock() from interrupt handlers as
> > long as they are never used from either process level or from nested
> > interrupt handlers.  Does that work for you?
> > 
> > If this is the case, another approach would be for updates to wait for
> > both an SRCU grace period and an RCU-sched grace period, perhaps with
> > the latter expedited, though that won't make the real-time guys happy.
> > 
> > 							Thanx, Paul
> > 
> >> Thanks,
> >>
> >> Paolo
> >>
> >>> Possible solution
> >>> -----------------
> >>> One way is to avoid the srcu_read_lock/unlock usage in irq context.
> >>> In arm/arm64 architecture, 	
> >>> if (kvm_arch_set_irq_inatomic(&irq, kvm, KVM_USERSPACE_IRQ_SOURCE_ID, 1,
> >>>       false) == -EWOULDBLOCK)
> >>> in irqfd_wakefup is always true and hence schedule_work can be called
> >>> directly
> >>>
> >>> @@ -195,6 +195,11 @@ int __attribute__((weak))
> >>> kvm_arch_set_irq_inatomic(
> >>>         int idx;
> >>>  
> >>>         if (flags & POLLIN) {
> >>> +               if (kvm_arch_set_irq_will_block_always()) {
> >>> +                       schedule_work(&irqfd->inject);
> >>> +                       goto skiplock;
> >>> +               }
> >>> +
> >>>                 idx = srcu_read_lock(&kvm->irq_srcu);
> >>>                 do {
> >>>                         seq =
> >>> 			read_seqcount_begin(&irqfd->irq_entry_sc);
> >>> @@ -208,6 +213,7 @@ int __attribute__((weak))
> >>> kvm_arch_set_irq_inatomic(
> >>>                 srcu_read_unlock(&kvm->irq_srcu, idx);
> >>>         }
> >>>  
> >>> +skiplock:
> >>>         if (flags & POLLHUP) {
> >>>                 /* The eventfd is closing, detach from KVM */
> >>>                 unsigned long flags;
> >>>
> >>> This works without giving any warnings as well.
> >>>
> >>> Is a patch welcome in that direction ?
> >>>
> >>> Appreicate your feedback on this.
> >>>
> >>>
> >>>
> >>
> > 
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Possible race condition during lock_count increment in srcu_read_lock
  2017-05-29 20:14       ` Paul E. McKenney
@ 2017-05-30 13:49         ` Paolo Bonzini
  2017-05-30 16:19           ` Paul E. McKenney
  2017-05-31 13:03           ` Linu Cherian
  0 siblings, 2 replies; 8+ messages in thread
From: Paolo Bonzini @ 2017-05-30 13:49 UTC (permalink / raw)
  To: paulmck
  Cc: Linu Cherian, rkrcmar, kvm, linu.cherian, sunil.goutham,
	Geethasowjanya.Akula



On 29/05/2017 22:14, Paul E. McKenney wrote:
> On Mon, May 29, 2017 at 06:38:19PM +0200, Paolo Bonzini wrote:
>> On 29/05/2017 17:50, Paul E. McKenney wrote:
>>> You actually are not supposed to call __srcu_read_lock() and
>>> __srcu_read_unlock() from irq context.  (Unless you know that you
>>> interrupted some context that is guaranteed never to use the srcu_struct
>>> in question or some such.)

I was going to improve the SRCU doc comments, and then started digging 
into srcu.c history to find out why srcu_read_unlock uses this_cpu_inc.

It turns out that back in 2012 __srcu_read_lock did two accesses to 
percpu variables, while __srcu_read_unlock did one (a decrement).  Then, 
commit 5a41344a3d83 ("srcu: Simplify __srcu_read_unlock() via 
this_cpu_dec()", 2012-11-29) changed __srcu_read_unlock to use 
this_cpu_dec and remove the preempt_disable/preempt_enable pair.

Nowadays, __srcu_read_lock only does one access to percpu variables so 
we could do the same as commit 5a41344a3d83 did:

diff --git a/include/linux/srcu.h b/include/linux/srcu.h
index 167ad8831aaf..4c1d5f7e62c4 100644
--- a/include/linux/srcu.h
+++ b/include/linux/srcu.h
@@ -172,9 +172,7 @@ static inline int srcu_read_lock(struct srcu_struct *sp) __acquires(sp)
 {
 	int retval;
 
-	preempt_disable();
 	retval = __srcu_read_lock(sp);
-	preempt_enable();
 	rcu_lock_acquire(&(sp)->dep_map);
 	return retval;
 }
diff --git a/kernel/rcu/srcu.c b/kernel/rcu/srcu.c
index 584d8a983883..dea03614263f 100644
--- a/kernel/rcu/srcu.c
+++ b/kernel/rcu/srcu.c
@@ -263,7 +263,7 @@ void cleanup_srcu_struct(struct srcu_struct *sp)
 
 /*
  * Counts the new reader in the appropriate per-CPU element of the
- * srcu_struct.  Must be called from process context.
+ * srcu_struct.
  * Returns an index that must be passed to the matching srcu_read_unlock().
  */
 int __srcu_read_lock(struct srcu_struct *sp)
@@ -271,7 +271,7 @@ int __srcu_read_lock(struct srcu_struct *sp)
 	int idx;
 
 	idx = READ_ONCE(sp->completed) & 0x1;
-	__this_cpu_inc(sp->per_cpu_ref->lock_count[idx]);
+	this_cpu_inc(sp->per_cpu_ref->lock_count[idx]);
 	smp_mb(); /* B */  /* Avoid leaking the critical section. */
 	return idx;
 }
@@ -281,7 +281,6 @@ int __srcu_read_lock(struct srcu_struct *sp)
  * Removes the count for the old reader from the appropriate per-CPU
  * element of the srcu_struct.  Note that this may well be a different
  * CPU than that which was incremented by the corresponding srcu_read_lock().
- * Must be called from process context.
  */
 void __srcu_read_unlock(struct srcu_struct *sp, int idx)
 {


As said earlier, this would slow down SRCU on machines where neither fast
this_cpu_inc nor atomic_inc_relaxed are available, and local_irq_save/restore
is slower than disabling preemption.  The main one is s390, which however is
already paying the price in __srcu_read_unlock.

Any objections to the above for 4.12, with commit message explaining both the
history of __srcu_read_unlock and what this fixes in KVM?

Thanks,

Paolo

>>>
>>> Given that this is the KVM irq injection path, I am guessing that it
>>> is executed quite frequently, so that you do need a low-cost highly
>>> scalable solution.  However, you only really need to exclude between
>>> process level and the various interrupts, so there are three ways
>>> that an architecture could handle this:
>>>
>>> 1.	Single memory-to-memory increment instructions, as you say.
>>>
>>> 2.	Atomic read-modify-write instructions, again as you say.
>>>
>>> 3.	Disable interrupts across __srcu_read_lock() and __srcu_read_unlock().
>>>
>>> Different architectures would want different approached, depending on what
>>> instructions are available and the relative speed of these instructions.
>>> It would not be hard to make this an architecture choice on the one
>>> hand or to provide an srcu_read_lock_irq() and srcu_read_unlock_irq()
>>> on the other.  But I recently got some serious pushback on RCU API
>>> proliferation on the one hand and SRCU in particular on the other.  So
>>> it might be good to look at some alternatives as well.
>>
>> Yeah, I liked Linu's suggestion of this_cpu_inc because it's already
>> used in srcu_read_unlock, and because it can fall back to (3)
>> automatically.  Relaxed atomics weren't much of a proposal, rather a way
>> to implement this_cpu_inc on some architectures that do not support it
>> yet (notably 32-bit ARM, MIPS and PPC).
>>
>> The disadvantage would be slowing down SRCU on machines where neither
>> (1) nor (2) are possible, and local_irq_save/restore is slower than
>> disabling preemption.  But I can certainly wrap ->irq_srcu lock/unlock
>> into my own helpers that take care of saving/restoring the interrupt
>> state, like
>>
>>     #define kvm_irq_srcu_read_lock_irqsave(kvm, flags)
>>     #define kvm_irq_srcu_read_unlock_irqrestore(kvm, flags)
>>
>> This would be very simple, confined into KVM and respectful of the SRCU
>> API requirements.
> 
> I do like this approach!  If it turns out to be needed in (say) four
> other places, then it might make sense for me to pull this into SRCU as
> an additional API.

Good.  On the other hand, I was 

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: Possible race condition during lock_count increment in srcu_read_lock
  2017-05-30 13:49         ` Paolo Bonzini
@ 2017-05-30 16:19           ` Paul E. McKenney
  2017-05-31 13:03           ` Linu Cherian
  1 sibling, 0 replies; 8+ messages in thread
From: Paul E. McKenney @ 2017-05-30 16:19 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Linu Cherian, rkrcmar, kvm, linu.cherian, sunil.goutham,
	Geethasowjanya.Akula

On Tue, May 30, 2017 at 03:49:31PM +0200, Paolo Bonzini wrote:

Just out of curiosity, why are we off-list?

> On 29/05/2017 22:14, Paul E. McKenney wrote:
> > On Mon, May 29, 2017 at 06:38:19PM +0200, Paolo Bonzini wrote:
> >> On 29/05/2017 17:50, Paul E. McKenney wrote:
> >>> You actually are not supposed to call __srcu_read_lock() and
> >>> __srcu_read_unlock() from irq context.  (Unless you know that you
> >>> interrupted some context that is guaranteed never to use the srcu_struct
> >>> in question or some such.)
> 
> I was going to improve the SRCU doc comments, and then started digging 
> into srcu.c history to find out why srcu_read_unlock uses this_cpu_inc.
> 
> It turns out that back in 2012 __srcu_read_lock did two accesses to 
> percpu variables, while __srcu_read_unlock did one (a decrement).  Then, 
> commit 5a41344a3d83 ("srcu: Simplify __srcu_read_unlock() via 
> this_cpu_dec()", 2012-11-29) changed __srcu_read_unlock to use 
> this_cpu_dec and remove the preempt_disable/preempt_enable pair.

Yes.

> Nowadays, __srcu_read_lock only does one access to percpu variables so 
> we could do the same as commit 5a41344a3d83 did:
> 
> diff --git a/include/linux/srcu.h b/include/linux/srcu.h
> index 167ad8831aaf..4c1d5f7e62c4 100644
> --- a/include/linux/srcu.h
> +++ b/include/linux/srcu.h
> @@ -172,9 +172,7 @@ static inline int srcu_read_lock(struct srcu_struct *sp) __acquires(sp)
>  {
>  	int retval;
> 
> -	preempt_disable();
>  	retval = __srcu_read_lock(sp);
> -	preempt_enable();
>  	rcu_lock_acquire(&(sp)->dep_map);
>  	return retval;
>  }
> diff --git a/kernel/rcu/srcu.c b/kernel/rcu/srcu.c

This is the old version, which is there for fast recovery if the new
scalable version fails.  Which it has not recently, so it is likely
going away in v4.13.

So kernel/rcu/srcutree.c and kernel/rcu/srcutiny.c need similar changes.

And yes, there is some push for kernel/rcu/srcutiny.c to go away as well,
along with kernel/rcu/rcutiny.c, but in v4.12 they are still here.

> index 584d8a983883..dea03614263f 100644
> --- a/kernel/rcu/srcu.c
> +++ b/kernel/rcu/srcu.c
> @@ -263,7 +263,7 @@ void cleanup_srcu_struct(struct srcu_struct *sp)
> 
>  /*
>   * Counts the new reader in the appropriate per-CPU element of the
> - * srcu_struct.  Must be called from process context.
> + * srcu_struct.
>   * Returns an index that must be passed to the matching srcu_read_unlock().
>   */
>  int __srcu_read_lock(struct srcu_struct *sp)
> @@ -271,7 +271,7 @@ int __srcu_read_lock(struct srcu_struct *sp)
>  	int idx;
> 
>  	idx = READ_ONCE(sp->completed) & 0x1;
> -	__this_cpu_inc(sp->per_cpu_ref->lock_count[idx]);
> +	this_cpu_inc(sp->per_cpu_ref->lock_count[idx]);
>  	smp_mb(); /* B */  /* Avoid leaking the critical section. */
>  	return idx;
>  }
> @@ -281,7 +281,6 @@ int __srcu_read_lock(struct srcu_struct *sp)
>   * Removes the count for the old reader from the appropriate per-CPU
>   * element of the srcu_struct.  Note that this may well be a different
>   * CPU than that which was incremented by the corresponding srcu_read_lock().
> - * Must be called from process context.
>   */
>  void __srcu_read_unlock(struct srcu_struct *sp, int idx)
>  {
> 
> 
> As said earlier, this would slow down SRCU on machines where neither fast
> this_cpu_inc nor atomic_inc_relaxed are available, and local_irq_save/restore
> is slower than disabling preemption.  The main one is s390, which however is
> already paying the price in __srcu_read_unlock.
> 
> Any objections to the above for 4.12, with commit message explaining both the
> history of __srcu_read_unlock and what this fixes in KVM?

With the changes also applied to kernel/rcu/srcutree.c and
kernel/srcu/srcutiny.c, given taht s390 has not complained about
__srcu_read_unlock(), no objections.

							Thanx, Paul

> Thanks,
> 
> Paolo
> 
> >>>
> >>> Given that this is the KVM irq injection path, I am guessing that it
> >>> is executed quite frequently, so that you do need a low-cost highly
> >>> scalable solution.  However, you only really need to exclude between
> >>> process level and the various interrupts, so there are three ways
> >>> that an architecture could handle this:
> >>>
> >>> 1.	Single memory-to-memory increment instructions, as you say.
> >>>
> >>> 2.	Atomic read-modify-write instructions, again as you say.
> >>>
> >>> 3.	Disable interrupts across __srcu_read_lock() and __srcu_read_unlock().
> >>>
> >>> Different architectures would want different approached, depending on what
> >>> instructions are available and the relative speed of these instructions.
> >>> It would not be hard to make this an architecture choice on the one
> >>> hand or to provide an srcu_read_lock_irq() and srcu_read_unlock_irq()
> >>> on the other.  But I recently got some serious pushback on RCU API
> >>> proliferation on the one hand and SRCU in particular on the other.  So
> >>> it might be good to look at some alternatives as well.
> >>
> >> Yeah, I liked Linu's suggestion of this_cpu_inc because it's already
> >> used in srcu_read_unlock, and because it can fall back to (3)
> >> automatically.  Relaxed atomics weren't much of a proposal, rather a way
> >> to implement this_cpu_inc on some architectures that do not support it
> >> yet (notably 32-bit ARM, MIPS and PPC).
> >>
> >> The disadvantage would be slowing down SRCU on machines where neither
> >> (1) nor (2) are possible, and local_irq_save/restore is slower than
> >> disabling preemption.  But I can certainly wrap ->irq_srcu lock/unlock
> >> into my own helpers that take care of saving/restoring the interrupt
> >> state, like
> >>
> >>     #define kvm_irq_srcu_read_lock_irqsave(kvm, flags)
> >>     #define kvm_irq_srcu_read_unlock_irqrestore(kvm, flags)
> >>
> >> This would be very simple, confined into KVM and respectful of the SRCU
> >> API requirements.
> > 
> > I do like this approach!  If it turns out to be needed in (say) four
> > other places, then it might make sense for me to pull this into SRCU as
> > an additional API.
> 
> Good.  On the other hand, I was 
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Possible race condition during lock_count increment in srcu_read_lock
  2017-05-30 13:49         ` Paolo Bonzini
  2017-05-30 16:19           ` Paul E. McKenney
@ 2017-05-31 13:03           ` Linu Cherian
  1 sibling, 0 replies; 8+ messages in thread
From: Linu Cherian @ 2017-05-31 13:03 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: paulmck, rkrcmar, kvm, linu.cherian, sunil.goutham,
	Geethasowjanya.Akula


On Tue May 30, 2017 at 03:49:31PM +0200, Paolo Bonzini wrote:
> 
> 
> On 29/05/2017 22:14, Paul E. McKenney wrote:
> > On Mon, May 29, 2017 at 06:38:19PM +0200, Paolo Bonzini wrote:
> >> On 29/05/2017 17:50, Paul E. McKenney wrote:
> >>> You actually are not supposed to call __srcu_read_lock() and
> >>> __srcu_read_unlock() from irq context.  (Unless you know that you
> >>> interrupted some context that is guaranteed never to use the srcu_struct
> >>> in question or some such.)
> 
> I was going to improve the SRCU doc comments, and then started digging 
> into srcu.c history to find out why srcu_read_unlock uses this_cpu_inc.
> 
> It turns out that back in 2012 __srcu_read_lock did two accesses to 
> percpu variables, while __srcu_read_unlock did one (a decrement).  Then, 
> commit 5a41344a3d83 ("srcu: Simplify __srcu_read_unlock() via 
> this_cpu_dec()", 2012-11-29) changed __srcu_read_unlock to use 
> this_cpu_dec and remove the preempt_disable/preempt_enable pair.
> 
> Nowadays, __srcu_read_lock only does one access to percpu variables so 
> we could do the same as commit 5a41344a3d83 did:
> 
> diff --git a/include/linux/srcu.h b/include/linux/srcu.h
> index 167ad8831aaf..4c1d5f7e62c4 100644
> --- a/include/linux/srcu.h
> +++ b/include/linux/srcu.h
> @@ -172,9 +172,7 @@ static inline int srcu_read_lock(struct srcu_struct *sp) __acquires(sp)
>  {
>  	int retval;
>  
> -	preempt_disable();
>  	retval = __srcu_read_lock(sp);
> -	preempt_enable();
>  	rcu_lock_acquire(&(sp)->dep_map);
>  	return retval;
>  }
> diff --git a/kernel/rcu/srcu.c b/kernel/rcu/srcu.c
> index 584d8a983883..dea03614263f 100644
> --- a/kernel/rcu/srcu.c
> +++ b/kernel/rcu/srcu.c
> @@ -263,7 +263,7 @@ void cleanup_srcu_struct(struct srcu_struct *sp)
>  
>  /*
>   * Counts the new reader in the appropriate per-CPU element of the
> - * srcu_struct.  Must be called from process context.
> + * srcu_struct.
>   * Returns an index that must be passed to the matching srcu_read_unlock().
>   */
>  int __srcu_read_lock(struct srcu_struct *sp)
> @@ -271,7 +271,7 @@ int __srcu_read_lock(struct srcu_struct *sp)
>  	int idx;
>  
>  	idx = READ_ONCE(sp->completed) & 0x1;
> -	__this_cpu_inc(sp->per_cpu_ref->lock_count[idx]);
> +	this_cpu_inc(sp->per_cpu_ref->lock_count[idx]);
>  	smp_mb(); /* B */  /* Avoid leaking the critical section. */
>  	return idx;
>  }
> @@ -281,7 +281,6 @@ int __srcu_read_lock(struct srcu_struct *sp)
>   * Removes the count for the old reader from the appropriate per-CPU
>   * element of the srcu_struct.  Note that this may well be a different
>   * CPU than that which was incremented by the corresponding srcu_read_lock().
> - * Must be called from process context.
>   */
>  void __srcu_read_unlock(struct srcu_struct *sp, int idx)
>  {
> 
> 
> As said earlier, this would slow down SRCU on machines where neither fast
> this_cpu_inc nor atomic_inc_relaxed are available, and local_irq_save/restore
> is slower than disabling preemption.  The main one is s390, which however is
> already paying the price in __srcu_read_unlock.
> 
> Any objections to the above for 4.12, with commit message explaining both the
> history of __srcu_read_unlock and what this fixes in KVM?
> 
> Thanks,
> 
> Paolo
> 

This works for me atleast on 4.11.2 kernel on Cavium arm64 platform.

Thanks.


-- 
Linu cherian

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2017-05-31 13:03 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-05-29 14:30 Possible race condition during lock_count increment in srcu_read_lock Linu Cherian
2017-05-29 15:05 ` Paolo Bonzini
2017-05-29 15:50   ` Paul E. McKenney
2017-05-29 16:38     ` Paolo Bonzini
2017-05-29 20:14       ` Paul E. McKenney
2017-05-30 13:49         ` Paolo Bonzini
2017-05-30 16:19           ` Paul E. McKenney
2017-05-31 13:03           ` Linu Cherian

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).