* [PATCH] dpci: Put the dpci back on the list if scheduled from another CPU.
@ 2015-03-17 15:38 Konrad Rzeszutek Wilk
2015-03-17 16:06 ` Jan Beulich
2015-03-19 7:53 ` Jan Beulich
0 siblings, 2 replies; 9+ messages in thread
From: Konrad Rzeszutek Wilk @ 2015-03-17 15:38 UTC (permalink / raw)
To: jbeulich, xen-devel; +Cc: Konrad Rzeszutek Wilk
There is race when we clear the STATE_SCHED in the softirq
- which allows the 'raise_softirq_for' (on another CPU or
on the one running the softirq) to schedule the dpci.
Specifically this can happen when the other CPU receives
an interrupt, calls 'raise_softirq_for', and puts the dpci
on its per-cpu list (same dpci structure). Note that
this could also happen on the same physical CPU, however
the explanation for simplicity will assume two CPUs actors.
There would be two 'dpci_softirq' running at the same time
(on different CPUs) where on one CPU it would be executing
hvm_dirq_assist (so had cleared STATE_SCHED and set STATE_RUN)
and on the other CPU it is trying to call:
if ( test_and_set_bit(STATE_RUN, &pirq_dpci->state) )
BUG();
Since STATE_RUN is already set it would end badly.
The reason we can get his with this is when an interrupt
affinity is set over multiple CPUs.
Potential solutions:
a) Instead of the BUG() we can put the dpci back on the per-cpu
list to deal with later (when the softirq are activated again).
This putting the 'dpci' back on the per-cpu list is an spin
until the bad condition clears.
b) We could also expand the test-and-set(STATE_SCHED) in raise_softirq_for
to detect for 'STATE_RUN' bit being set and schedule the dpci.
The BUG() check in dpci_softirq would be replace with a spin until
'STATE_RUN' has been cleared. The dpci would still not
be scheduled when STATE_SCHED bit was set.
c) Only schedule the dpci when the state is cleared
(no STATE_SCHED and no STATE_RUN). It would spin if STATE_RUN is set
(as it is in progress and will finish). If the STATE_SCHED is set
(so hasn't run yet) we won't try to spin and just exit.
Down-sides of the solutions:
a). Live-lock of the CPU. We could be finishing an dpci, then adding
the dpci, exiting, and the processing the dpci once more. And so on.
We would eventually stop as the TIMER_SOFTIRQ would be set, which will
cause SCHEDULER_SOFTIRQ to be set as well and we would exit this loop.
Interestingly the old ('tasklet') code used this mechanism.
If the function assigned to the tasklet was running - the softirq
that ran said function (hvm_dirq_assist) would be responsible for
putting the tasklet back on the per-cpu list. This would allow
to have an running tasklet and an 'to-be-scheduled' tasklet
at the same time.
b). is similar to a) - instead of re-entering the dpci_softirq
we are looping in the softirq waiting for the correct condition to
arrive. As it does not allow unwedging ourselves because the other
softirqs are not called - it is less preferable.
c) can cause an dead-lock if the interrupt comes in when we are
processing the dpci in the softirq - iff this happens on the same CPU.
We would be looping in on raise_softirq waiting for STATE_RUN
to be cleared, while the softirq that was to clear it - is preempted
by our interrupt handler.
As such, this patch - which implements a) is the best candidate
for this quagmire.
Reported-and-Tested-by: Sander Eikelenboom <linux@eikelenboom.it>
Reported-and-Tested-by: Malcolm Crossley <malcolm.crossley@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
---
xen/drivers/passthrough/io.c | 12 +++++++++++-
1 file changed, 11 insertions(+), 1 deletion(-)
diff --git a/xen/drivers/passthrough/io.c b/xen/drivers/passthrough/io.c
index ae050df..9b77334 100644
--- a/xen/drivers/passthrough/io.c
+++ b/xen/drivers/passthrough/io.c
@@ -804,7 +804,17 @@ static void dpci_softirq(void)
d = pirq_dpci->dom;
smp_mb(); /* 'd' MUST be saved before we set/clear the bits. */
if ( test_and_set_bit(STATE_RUN, &pirq_dpci->state) )
- BUG();
+ {
+ unsigned long flags;
+
+ /* Put back on the list and retry. */
+ local_irq_save(flags);
+ list_add_tail(&pirq_dpci->softirq_list, &this_cpu(dpci_list));
+ local_irq_restore(flags);
+
+ raise_softirq(HVM_DPCI_SOFTIRQ);
+ continue;
+ }
/*
* The one who clears STATE_SCHED MUST refcount the domain.
*/
--
2.1.0
^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [PATCH] dpci: Put the dpci back on the list if scheduled from another CPU.
2015-03-17 15:38 [PATCH] dpci: Put the dpci back on the list if scheduled from another CPU Konrad Rzeszutek Wilk
@ 2015-03-17 16:06 ` Jan Beulich
2015-03-17 17:15 ` Konrad Rzeszutek Wilk
2015-03-19 7:53 ` Jan Beulich
1 sibling, 1 reply; 9+ messages in thread
From: Jan Beulich @ 2015-03-17 16:06 UTC (permalink / raw)
To: Konrad Rzeszutek Wilk; +Cc: xen-devel
>>> On 17.03.15 at 16:38, <konrad.wilk@oracle.com> wrote:
> --- a/xen/drivers/passthrough/io.c
> +++ b/xen/drivers/passthrough/io.c
> @@ -804,7 +804,17 @@ static void dpci_softirq(void)
> d = pirq_dpci->dom;
> smp_mb(); /* 'd' MUST be saved before we set/clear the bits. */
> if ( test_and_set_bit(STATE_RUN, &pirq_dpci->state) )
> - BUG();
> + {
> + unsigned long flags;
> +
> + /* Put back on the list and retry. */
> + local_irq_save(flags);
> + list_add_tail(&pirq_dpci->softirq_list, &this_cpu(dpci_list));
> + local_irq_restore(flags);
> +
> + raise_softirq(HVM_DPCI_SOFTIRQ);
> + continue;
> + }
As just said in another mail - unless there are convincing new
arguments in favor of this (more of a hack than a real fix), I'm
not going to accept it and instead consider reverting the
offending commit. Iirc the latest we had come to looked quite a
bit better than this one.
Jan
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] dpci: Put the dpci back on the list if scheduled from another CPU.
2015-03-17 16:06 ` Jan Beulich
@ 2015-03-17 17:15 ` Konrad Rzeszutek Wilk
2015-03-18 7:38 ` Jan Beulich
0 siblings, 1 reply; 9+ messages in thread
From: Konrad Rzeszutek Wilk @ 2015-03-17 17:15 UTC (permalink / raw)
To: Jan Beulich; +Cc: xen-devel
[-- Attachment #1: Type: text/plain, Size: 1309 bytes --]
On Tue, Mar 17, 2015 at 04:06:14PM +0000, Jan Beulich wrote:
> >>> On 17.03.15 at 16:38, <konrad.wilk@oracle.com> wrote:
> > --- a/xen/drivers/passthrough/io.c
> > +++ b/xen/drivers/passthrough/io.c
> > @@ -804,7 +804,17 @@ static void dpci_softirq(void)
> > d = pirq_dpci->dom;
> > smp_mb(); /* 'd' MUST be saved before we set/clear the bits. */
> > if ( test_and_set_bit(STATE_RUN, &pirq_dpci->state) )
> > - BUG();
> > + {
> > + unsigned long flags;
> > +
> > + /* Put back on the list and retry. */
> > + local_irq_save(flags);
> > + list_add_tail(&pirq_dpci->softirq_list, &this_cpu(dpci_list));
> > + local_irq_restore(flags);
> > +
> > + raise_softirq(HVM_DPCI_SOFTIRQ);
> > + continue;
> > + }
>
> As just said in another mail - unless there are convincing new
> arguments in favor of this (more of a hack than a real fix), I'm
> not going to accept it and instead consider reverting the
> offending commit. Iirc the latest we had come to looked quite a
> bit better than this one.
The latest one (please see attached) would cause an dead-lock iff
on the CPU we are running the softirq and an do_IRQ comes for the
exact dpci we are in process of executing.
>
> Jan
>
[-- Attachment #2: 0001-dpci-when-scheduling-spin-until-STATE_RUN-or-STATE_S.patch --]
[-- Type: text/plain, Size: 3759 bytes --]
>From 6b32dccfbe00518d3ca9cd94d19a6e007b2645d9 Mon Sep 17 00:00:00 2001
From: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Date: Tue, 17 Mar 2015 09:46:09 -0400
Subject: [PATCH] dpci: when scheduling spin until STATE_RUN or STATE_SCHED has
been cleared.
There is race when we clear the STATE_SCHED in the softirq
- which allows the 'raise_softirq_for' (on another CPU)
to schedule the dpci.
Specifically this can happen whenthe other CPU receives
an interrupt, calls 'raise_softirq_for', and puts the dpci
on its per-cpu list (same dpci structure).
There would be two 'dpci_softirq' running at the same time
(on different CPUs) where on one CPU it would be executing
hvm_dirq_assist (so had cleared STATE_SCHED and set STATE_RUN)
and on the other CPU it is trying to call:
if ( test_and_set_bit(STATE_RUN, &pirq_dpci->state) )
BUG();
Since STATE_RUN is already set it would end badly.
The reason we can get his with this is when an interrupt
affinity is set over multiple CPUs.
Potential solutions:
a) Instead of the BUG() we can put the dpci back on the per-cpu
list to deal with later (when the softirq are activated again).
This putting the 'dpci' back on the per-cpu list is an spin
until the bad condition clears.
b) We could also expand the test-and-set(STATE_SCHED) in raise_softirq_for
to detect for 'STATE_RUN' bit being set and schedule the dpci
in a more safe manner (delay it). The dpci would stil not
be scheduled when STATE_SCHED bit was set.
c) This patch explores a third option - we will only schedule
the dpci when the state is cleared (no STATE_SCHED and no STATE_RUN).
We will spin if STATE_RUN is set (as it is in progress and will
finish). If the STATE_SCHED is set (so hasn't run yet) we won't
try to spin and just exit. This can cause an dead-lock if the interrupt
comes when we are processing the dpci in the softirq.
Interestingly the old ('tasklet') code used an a) mechanism.
If the function assigned to the tasklet was running - the softirq
that ran said function (hvm_dirq_assist) would be responsible for
putting the tasklet back on the per-cpu list. This would allow
to have an running tasklet and an 'to-be-scheduled' tasklet
at the same time. This solution moves this 'to-be-scheduled'
job to be done in 'raise_softirq_for' (instead of the
'softirq').
Reported-by: Sander Eikelenboom <linux@eikelenboom.it>
Reported-by: Malcolm Crossley <malcolm.crossley@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
---
xen/drivers/passthrough/io.c | 28 +++++++++++++++++++++++++---
1 file changed, 25 insertions(+), 3 deletions(-)
diff --git a/xen/drivers/passthrough/io.c b/xen/drivers/passthrough/io.c
index ae050df..9c30ebb 100644
--- a/xen/drivers/passthrough/io.c
+++ b/xen/drivers/passthrough/io.c
@@ -63,10 +63,32 @@ enum {
static void raise_softirq_for(struct hvm_pirq_dpci *pirq_dpci)
{
unsigned long flags;
+ unsigned long old;
- if ( test_and_set_bit(STATE_SCHED, &pirq_dpci->state) )
- return;
-
+ /*
+ * This cmpxchg spins until the state is zero (unused).
+ */
+ for ( ;; )
+ {
+ old = cmpxchg(&pirq_dpci->state, 0, 1 << STATE_SCHED);
+ switch ( old )
+ {
+ case (1 << STATE_SCHED):
+ /*
+ * Whenever STATE_SCHED is set we MUST not schedule it.
+ */
+ return;
+ case (1 << STATE_RUN) | (1 << STATE_SCHED):
+ case (1 << STATE_RUN):
+ /* Getting close to finish. Spin. */
+ continue;
+ }
+ /*
+ * If the 'state' is 0 (not in use) we can schedule it.
+ */
+ if ( old == 0 )
+ break;
+ }
get_knownalive_domain(pirq_dpci->dom);
local_irq_save(flags);
--
2.1.0
[-- Attachment #3: Type: text/plain, Size: 126 bytes --]
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel
^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [PATCH] dpci: Put the dpci back on the list if scheduled from another CPU.
2015-03-17 17:15 ` Konrad Rzeszutek Wilk
@ 2015-03-18 7:38 ` Jan Beulich
2015-03-18 13:58 ` Konrad Rzeszutek Wilk
0 siblings, 1 reply; 9+ messages in thread
From: Jan Beulich @ 2015-03-18 7:38 UTC (permalink / raw)
To: Konrad Rzeszutek Wilk; +Cc: xen-devel
>>> On 17.03.15 at 18:15, <konrad.wilk@oracle.com> wrote:
> On Tue, Mar 17, 2015 at 04:06:14PM +0000, Jan Beulich wrote:
>> >>> On 17.03.15 at 16:38, <konrad.wilk@oracle.com> wrote:
>> > --- a/xen/drivers/passthrough/io.c
>> > +++ b/xen/drivers/passthrough/io.c
>> > @@ -804,7 +804,17 @@ static void dpci_softirq(void)
>> > d = pirq_dpci->dom;
>> > smp_mb(); /* 'd' MUST be saved before we set/clear the bits. */
>> > if ( test_and_set_bit(STATE_RUN, &pirq_dpci->state) )
>> > - BUG();
>> > + {
>> > + unsigned long flags;
>> > +
>> > + /* Put back on the list and retry. */
>> > + local_irq_save(flags);
>> > + list_add_tail(&pirq_dpci->softirq_list, &this_cpu(dpci_list));
>> > + local_irq_restore(flags);
>> > +
>> > + raise_softirq(HVM_DPCI_SOFTIRQ);
>> > + continue;
>> > + }
>>
>> As just said in another mail - unless there are convincing new
>> arguments in favor of this (more of a hack than a real fix), I'm
>> not going to accept it and instead consider reverting the
>> offending commit. Iirc the latest we had come to looked quite a
>> bit better than this one.
>
> The latest one (please see attached) would cause an dead-lock iff
> on the CPU we are running the softirq and an do_IRQ comes for the
> exact dpci we are in process of executing.
I'm afraid I'm not seeing it - please explain.
As to the code - I think switch() is rather hiding the intentions
here, i.e. the code would be better readable if using two if()s:
+ for ( ;; )
+ {
+ old = cmpxchg(&pirq_dpci->state, 0, 1 << STATE_SCHED);
+ /* If the 'state' is 0 (not in use) we can schedule it. */
+ if ( !old )
+ break;
+ if ( !(old & (1 << STATE_RUN)) )
+ {
+ /* Whenever STATE_SCHED is set we MUST not schedule it. */
+ assert(old & (1 << STATE_SCHED));
+ return;
+ }
+ }
Jan
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] dpci: Put the dpci back on the list if scheduled from another CPU.
2015-03-18 7:38 ` Jan Beulich
@ 2015-03-18 13:58 ` Konrad Rzeszutek Wilk
2015-03-18 16:06 ` Jan Beulich
0 siblings, 1 reply; 9+ messages in thread
From: Konrad Rzeszutek Wilk @ 2015-03-18 13:58 UTC (permalink / raw)
To: Jan Beulich; +Cc: xen-devel
On Wed, Mar 18, 2015 at 07:38:12AM +0000, Jan Beulich wrote:
> >>> On 17.03.15 at 18:15, <konrad.wilk@oracle.com> wrote:
> > On Tue, Mar 17, 2015 at 04:06:14PM +0000, Jan Beulich wrote:
> >> >>> On 17.03.15 at 16:38, <konrad.wilk@oracle.com> wrote:
> >> > --- a/xen/drivers/passthrough/io.c
> >> > +++ b/xen/drivers/passthrough/io.c
> >> > @@ -804,7 +804,17 @@ static void dpci_softirq(void)
> >> > d = pirq_dpci->dom;
> >> > smp_mb(); /* 'd' MUST be saved before we set/clear the bits. */
> >> > if ( test_and_set_bit(STATE_RUN, &pirq_dpci->state) )
> >> > - BUG();
> >> > + {
> >> > + unsigned long flags;
> >> > +
> >> > + /* Put back on the list and retry. */
> >> > + local_irq_save(flags);
> >> > + list_add_tail(&pirq_dpci->softirq_list, &this_cpu(dpci_list));
> >> > + local_irq_restore(flags);
> >> > +
> >> > + raise_softirq(HVM_DPCI_SOFTIRQ);
> >> > + continue;
> >> > + }
> >>
> >> As just said in another mail - unless there are convincing new
> >> arguments in favor of this (more of a hack than a real fix), I'm
> >> not going to accept it and instead consider reverting the
> >> offending commit. Iirc the latest we had come to looked quite a
> >> bit better than this one.
> >
> > The latest one (please see attached) would cause an dead-lock iff
> > on the CPU we are running the softirq and an do_IRQ comes for the
> > exact dpci we are in process of executing.
>
> I'm afraid I'm not seeing it - please explain.
Lets assume that the device is an PCIe with MSI. We have only one
VCPU in the guest.
We receive the first interrupt, end up going:
vmx_vmexit_handler
- case EXIT_REASON_EXTERNAL_INTERRUPT
\- vmx_do_extint
\- do_IRQ
\- __do_IRQ_guest
\- hvm_do_IRQ_dpci
\- raise_softirq_for
[DPCI_SOFTIRQ bit set]
vmx_process_softirq
sti
do_softirq
-\ __do_sofitq_
\- dpci_softirq
-\ hvm_dirq_assist
[state is 'running']
[Same vector comes in]
do_IRQ
\- __do_IRQ_guest
\- hvm_do_IRQ_dpci
\- raise_softirq_for
[here we either
a) spin waiting 'running' to be done or -- dead-lock
b) we just exit out and drop this interrupt,
c) increment 'masked' to tell 'dpci_softirq' to reschedule
-- live lock if this keeps on going]
Now c) is protected - the do_IRQ has anti-storm code so eventually
it will stop.
>
> As to the code - I think switch() is rather hiding the intentions
> here, i.e. the code would be better readable if using two if()s:
>
> + for ( ;; )
> + {
> + old = cmpxchg(&pirq_dpci->state, 0, 1 << STATE_SCHED);
> + /* If the 'state' is 0 (not in use) we can schedule it. */
> + if ( !old )
> + break;
> + if ( !(old & (1 << STATE_RUN)) )
> + {
> + /* Whenever STATE_SCHED is set we MUST not schedule it. */
> + assert(old & (1 << STATE_SCHED));
> + return;
> + }
> + }
>
> Jan
>
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] dpci: Put the dpci back on the list if scheduled from another CPU.
2015-03-18 13:58 ` Konrad Rzeszutek Wilk
@ 2015-03-18 16:06 ` Jan Beulich
2015-03-18 16:14 ` Konrad Rzeszutek Wilk
0 siblings, 1 reply; 9+ messages in thread
From: Jan Beulich @ 2015-03-18 16:06 UTC (permalink / raw)
To: Konrad Rzeszutek Wilk; +Cc: xen-devel
>>> On 18.03.15 at 14:58, <konrad.wilk@oracle.com> wrote:
> On Wed, Mar 18, 2015 at 07:38:12AM +0000, Jan Beulich wrote:
>> >>> On 17.03.15 at 18:15, <konrad.wilk@oracle.com> wrote:
>> > The latest one (please see attached) would cause an dead-lock iff
>> > on the CPU we are running the softirq and an do_IRQ comes for the
>> > exact dpci we are in process of executing.
>>
>> I'm afraid I'm not seeing it - please explain.
>
> Lets assume that the device is an PCIe with MSI. We have only one
> VCPU in the guest.
>
> We receive the first interrupt, end up going:
> vmx_vmexit_handler
> - case EXIT_REASON_EXTERNAL_INTERRUPT
> \- vmx_do_extint
> \- do_IRQ
> \- __do_IRQ_guest
> \- hvm_do_IRQ_dpci
> \- raise_softirq_for
> [DPCI_SOFTIRQ bit set]
> vmx_process_softirq
> sti
> do_softirq
> -\ __do_sofitq_
> \- dpci_softirq
> -\ hvm_dirq_assist
> [state is 'running']
>
> [Same vector comes in]
Is that indeed possible? Afaict nothing in the code sequence above
ack-ed the interrupt, and hence another one can't come in.
Jan
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] dpci: Put the dpci back on the list if scheduled from another CPU.
2015-03-18 16:06 ` Jan Beulich
@ 2015-03-18 16:14 ` Konrad Rzeszutek Wilk
0 siblings, 0 replies; 9+ messages in thread
From: Konrad Rzeszutek Wilk @ 2015-03-18 16:14 UTC (permalink / raw)
To: Jan Beulich; +Cc: xen-devel
On Wed, Mar 18, 2015 at 04:06:37PM +0000, Jan Beulich wrote:
> >>> On 18.03.15 at 14:58, <konrad.wilk@oracle.com> wrote:
> > On Wed, Mar 18, 2015 at 07:38:12AM +0000, Jan Beulich wrote:
> >> >>> On 17.03.15 at 18:15, <konrad.wilk@oracle.com> wrote:
> >> > The latest one (please see attached) would cause an dead-lock iff
> >> > on the CPU we are running the softirq and an do_IRQ comes for the
> >> > exact dpci we are in process of executing.
> >>
> >> I'm afraid I'm not seeing it - please explain.
> >
> > Lets assume that the device is an PCIe with MSI. We have only one
> > VCPU in the guest.
> >
> > We receive the first interrupt, end up going:
> > vmx_vmexit_handler
> > - case EXIT_REASON_EXTERNAL_INTERRUPT
> > \- vmx_do_extint
> > \- do_IRQ
> > \- __do_IRQ_guest
> > \- hvm_do_IRQ_dpci
> > \- raise_softirq_for
> > [DPCI_SOFTIRQ bit set]
> > vmx_process_softirq
> > sti
> > do_softirq
> > -\ __do_sofitq_
> > \- dpci_softirq
> > -\ hvm_dirq_assist
> > [state is 'running']
> >
> > [Same vector comes in]
>
> Is that indeed possible? Afaict nothing in the code sequence above
> ack-ed the interrupt, and hence another one can't come in.
The do_IRQ acks it:
854 spin_lock(&desc->lock);
855 desc->handler->ack(desc);
which is:
424 static void ack_maskable_msi_irq(struct irq_desc *desc)
425 {
426 ack_nonmaskable_msi_irq(desc);
427 ack_APIC_irq(); /* ACKTYPE_NONE */
428 }
.. snip..
857 if ( likely(desc->status & IRQ_GUEST) )
..
886 __do_IRQ_guest(irq);
>
> Jan
>
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] dpci: Put the dpci back on the list if scheduled from another CPU.
2015-03-17 15:38 [PATCH] dpci: Put the dpci back on the list if scheduled from another CPU Konrad Rzeszutek Wilk
2015-03-17 16:06 ` Jan Beulich
@ 2015-03-19 7:53 ` Jan Beulich
2015-03-19 19:18 ` Konrad Rzeszutek Wilk
1 sibling, 1 reply; 9+ messages in thread
From: Jan Beulich @ 2015-03-19 7:53 UTC (permalink / raw)
To: Konrad Rzeszutek Wilk; +Cc: xen-devel
>>> On 17.03.15 at 16:38, <konrad.wilk@oracle.com> wrote:
> There is race when we clear the STATE_SCHED in the softirq
> - which allows the 'raise_softirq_for' (on another CPU or
> on the one running the softirq) to schedule the dpci.
>
> Specifically this can happen when the other CPU receives
> an interrupt, calls 'raise_softirq_for', and puts the dpci
> on its per-cpu list (same dpci structure). Note that
> this could also happen on the same physical CPU, however
> the explanation for simplicity will assume two CPUs actors.
>
> There would be two 'dpci_softirq' running at the same time
> (on different CPUs) where on one CPU it would be executing
> hvm_dirq_assist (so had cleared STATE_SCHED and set STATE_RUN)
> and on the other CPU it is trying to call:
>
> if ( test_and_set_bit(STATE_RUN, &pirq_dpci->state) )
> BUG();
>
> Since STATE_RUN is already set it would end badly.
>
> The reason we can get his with this is when an interrupt
> affinity is set over multiple CPUs.
>
> Potential solutions:
>
> a) Instead of the BUG() we can put the dpci back on the per-cpu
> list to deal with later (when the softirq are activated again).
> This putting the 'dpci' back on the per-cpu list is an spin
> until the bad condition clears.
>
> b) We could also expand the test-and-set(STATE_SCHED) in raise_softirq_for
> to detect for 'STATE_RUN' bit being set and schedule the dpci.
> The BUG() check in dpci_softirq would be replace with a spin until
> 'STATE_RUN' has been cleared. The dpci would still not
> be scheduled when STATE_SCHED bit was set.
>
> c) Only schedule the dpci when the state is cleared
> (no STATE_SCHED and no STATE_RUN). It would spin if STATE_RUN is set
> (as it is in progress and will finish). If the STATE_SCHED is set
> (so hasn't run yet) we won't try to spin and just exit.
>
> Down-sides of the solutions:
>
> a). Live-lock of the CPU. We could be finishing an dpci, then adding
> the dpci, exiting, and the processing the dpci once more. And so on.
> We would eventually stop as the TIMER_SOFTIRQ would be set, which will
> cause SCHEDULER_SOFTIRQ to be set as well and we would exit this loop.
>
> Interestingly the old ('tasklet') code used this mechanism.
> If the function assigned to the tasklet was running - the softirq
> that ran said function (hvm_dirq_assist) would be responsible for
> putting the tasklet back on the per-cpu list. This would allow
> to have an running tasklet and an 'to-be-scheduled' tasklet
> at the same time.
>
> b). is similar to a) - instead of re-entering the dpci_softirq
> we are looping in the softirq waiting for the correct condition to
> arrive. As it does not allow unwedging ourselves because the other
> softirqs are not called - it is less preferable.
>
> c) can cause an dead-lock if the interrupt comes in when we are
> processing the dpci in the softirq - iff this happens on the same CPU.
> We would be looping in on raise_softirq waiting for STATE_RUN
> to be cleared, while the softirq that was to clear it - is preempted
> by our interrupt handler.
>
> As such, this patch - which implements a) is the best candidate
> for this quagmire.
>
> Reported-and-Tested-by: Sander Eikelenboom <linux@eikelenboom.it>
> Reported-and-Tested-by: Malcolm Crossley <malcolm.crossley@citrix.com>
> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
So I now agree that in the state we're in this is the most reasonable
fix. My reservations against the extra logic introduced earlier (and
being fixed here) stand though: From an abstract perspective the
IRQ and softirq logic alone should be sufficient to deal with the
needs we have. The complications really result from the desire to
use a per-CPU list of hvm_pirq_dpci-s, which I still think a simpler
alternative should be found for (after all real hardware doesn't use
such lists).
A first thought would be to put them all on a per-domain list and
have a cpumask tracking which CPUs they need servicing on. The
downside of this would be (apart from again not being a proper
equivalent of how actual hardware handles this) that - the softirq
handler not having any other context - domains needing servicing
would also need to be tracked in some form (in order to avoid
having to iterate over all of them), and a per-CPU list would be
undesirable for the exact same reasons. Yet a per-CPU
domain-needs-service bitmap doesn't seem very attractive either,
i.e. this would need further thought (also to make sure such an
alternative model doesn't become even more involved than what
we have now).
Jan
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] dpci: Put the dpci back on the list if scheduled from another CPU.
2015-03-19 7:53 ` Jan Beulich
@ 2015-03-19 19:18 ` Konrad Rzeszutek Wilk
0 siblings, 0 replies; 9+ messages in thread
From: Konrad Rzeszutek Wilk @ 2015-03-19 19:18 UTC (permalink / raw)
To: Jan Beulich; +Cc: xen-devel
On Thu, Mar 19, 2015 at 07:53:35AM +0000, Jan Beulich wrote:
> >>> On 17.03.15 at 16:38, <konrad.wilk@oracle.com> wrote:
> > There is race when we clear the STATE_SCHED in the softirq
> > - which allows the 'raise_softirq_for' (on another CPU or
> > on the one running the softirq) to schedule the dpci.
> >
> > Specifically this can happen when the other CPU receives
> > an interrupt, calls 'raise_softirq_for', and puts the dpci
> > on its per-cpu list (same dpci structure). Note that
> > this could also happen on the same physical CPU, however
> > the explanation for simplicity will assume two CPUs actors.
> >
> > There would be two 'dpci_softirq' running at the same time
> > (on different CPUs) where on one CPU it would be executing
> > hvm_dirq_assist (so had cleared STATE_SCHED and set STATE_RUN)
> > and on the other CPU it is trying to call:
> >
> > if ( test_and_set_bit(STATE_RUN, &pirq_dpci->state) )
> > BUG();
> >
> > Since STATE_RUN is already set it would end badly.
> >
> > The reason we can get his with this is when an interrupt
> > affinity is set over multiple CPUs.
> >
> > Potential solutions:
> >
> > a) Instead of the BUG() we can put the dpci back on the per-cpu
> > list to deal with later (when the softirq are activated again).
> > This putting the 'dpci' back on the per-cpu list is an spin
> > until the bad condition clears.
> >
> > b) We could also expand the test-and-set(STATE_SCHED) in raise_softirq_for
> > to detect for 'STATE_RUN' bit being set and schedule the dpci.
> > The BUG() check in dpci_softirq would be replace with a spin until
> > 'STATE_RUN' has been cleared. The dpci would still not
> > be scheduled when STATE_SCHED bit was set.
> >
> > c) Only schedule the dpci when the state is cleared
> > (no STATE_SCHED and no STATE_RUN). It would spin if STATE_RUN is set
> > (as it is in progress and will finish). If the STATE_SCHED is set
> > (so hasn't run yet) we won't try to spin and just exit.
> >
> > Down-sides of the solutions:
> >
> > a). Live-lock of the CPU. We could be finishing an dpci, then adding
> > the dpci, exiting, and the processing the dpci once more. And so on.
> > We would eventually stop as the TIMER_SOFTIRQ would be set, which will
> > cause SCHEDULER_SOFTIRQ to be set as well and we would exit this loop.
> >
> > Interestingly the old ('tasklet') code used this mechanism.
> > If the function assigned to the tasklet was running - the softirq
> > that ran said function (hvm_dirq_assist) would be responsible for
> > putting the tasklet back on the per-cpu list. This would allow
> > to have an running tasklet and an 'to-be-scheduled' tasklet
> > at the same time.
> >
> > b). is similar to a) - instead of re-entering the dpci_softirq
> > we are looping in the softirq waiting for the correct condition to
> > arrive. As it does not allow unwedging ourselves because the other
> > softirqs are not called - it is less preferable.
> >
> > c) can cause an dead-lock if the interrupt comes in when we are
> > processing the dpci in the softirq - iff this happens on the same CPU.
> > We would be looping in on raise_softirq waiting for STATE_RUN
> > to be cleared, while the softirq that was to clear it - is preempted
> > by our interrupt handler.
> >
> > As such, this patch - which implements a) is the best candidate
> > for this quagmire.
> >
> > Reported-and-Tested-by: Sander Eikelenboom <linux@eikelenboom.it>
> > Reported-and-Tested-by: Malcolm Crossley <malcolm.crossley@citrix.com>
> > Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
>
> So I now agree that in the state we're in this is the most reasonable
> fix. My reservations against the extra logic introduced earlier (and
Thank you.
Are you OK with me checking it in?
> being fixed here) stand though: From an abstract perspective the
> IRQ and softirq logic alone should be sufficient to deal with the
> needs we have. The complications really result from the desire to
> use a per-CPU list of hvm_pirq_dpci-s, which I still think a simpler
> alternative should be found for (after all real hardware doesn't use
> such lists).
>
> A first thought would be to put them all on a per-domain list and
> have a cpumask tracking which CPUs they need servicing on. The
> downside of this would be (apart from again not being a proper
> equivalent of how actual hardware handles this) that - the softirq
> handler not having any other context - domains needing servicing
> would also need to be tracked in some form (in order to avoid
> having to iterate over all of them), and a per-CPU list would be
> undesirable for the exact same reasons. Yet a per-CPU
> domain-needs-service bitmap doesn't seem very attractive either,
> i.e. this would need further thought (also to make sure such an
> alternative model doesn't become even more involved than what
> we have now).
HA! (yes, I completly agree on - "complex" == "unpleasant")
Perhaps we can brainstorm some of this at XenHackathon in Shanghai?
>
> Jan
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2015-03-19 19:18 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-03-17 15:38 [PATCH] dpci: Put the dpci back on the list if scheduled from another CPU Konrad Rzeszutek Wilk
2015-03-17 16:06 ` Jan Beulich
2015-03-17 17:15 ` Konrad Rzeszutek Wilk
2015-03-18 7:38 ` Jan Beulich
2015-03-18 13:58 ` Konrad Rzeszutek Wilk
2015-03-18 16:06 ` Jan Beulich
2015-03-18 16:14 ` Konrad Rzeszutek Wilk
2015-03-19 7:53 ` Jan Beulich
2015-03-19 19:18 ` Konrad Rzeszutek Wilk
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.