linux-arm-kernel.lists.infradead.org archive mirror
 help / color / mirror / Atom feed
* A problem about SPI Interrupt Configuration
@ 2016-01-20  2:38 Yang Yingliang
  2016-01-22  7:57 ` Marc Zyngier
  0 siblings, 1 reply; 4+ messages in thread
From: Yang Yingliang @ 2016-01-20  2:38 UTC (permalink / raw)
  To: linux-arm-kernel

Hi, Marc

I got some error messages "RWP timeout, gone fishing".

The case is :

			CPU0												CPU1
													acquire desc->lock in __setup_irq()
															enable irq in __setup_irq()
read iar in gic_handle_irq()
waiting for desc->lock in handle_fasteoi_irq()
												call gic_set_affinity() from setup_affinity()
												waiting for the irq deactive in gic_do_wait_for_rwp()


The hardware will not clear GICD_CTLR.RWP until the interrupt is not
active. The interrupt is keeping active while it's waiting for
desc->lock on cpu0. But the lock is hold by cpu1 while it's waiting for
the interrupt is not active. It causes a deadlock here in 1s.


And the GICv3 SPEC says:

4.5.5 SPI Interrupt Configuration
To configure an SPI interrupt, to ensure that interrupts are never 
distributed using partially updated configuration
information, software must:
o Ensure the interrupt is not active
o Ensure that the interrupt is disabled
o This might be done either by writing to GICD_CTLR to clear the enables 
for a group, or
o By writing to GICD_ICENABLERn to clear the Enable bit of the interrupt 
(see section 5.3.11).
o In both cases, software must poll GICD_CTLR.RWP to ensure the effects 
are visible (see section
5.3.20).
o Program the routing (if appropriate), priority and group
o Enable the interrupt (if required)

Because it says "Ensure the interrupt is not active", so I can not tell
it is a hardware or software problem.

Can you please give some advice?

Thanks,
Yang

^ permalink raw reply	[flat|nested] 4+ messages in thread

* A problem about SPI Interrupt Configuration
  2016-01-20  2:38 A problem about SPI Interrupt Configuration Yang Yingliang
@ 2016-01-22  7:57 ` Marc Zyngier
  2016-01-28  2:32   ` Yang Yingliang
  0 siblings, 1 reply; 4+ messages in thread
From: Marc Zyngier @ 2016-01-22  7:57 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, 20 Jan 2016 10:38:13 +0800
Yang Yingliang <yangyingliang@huawei.com> wrote:

Hi Yang,

> Hi, Marc
> 
> I got some error messages "RWP timeout, gone fishing".
> 
> The case is :
> 
> 			CPU0												CPU1
> 													acquire desc->lock in __setup_irq()
> 															enable irq in __setup_irq()
> read iar in gic_handle_irq()
> waiting for desc->lock in handle_fasteoi_irq()
> 												call gic_set_affinity() from setup_affinity()
> 												waiting for the irq deactive in gic_do_wait_for_rwp()
> 
> 
> The hardware will not clear GICD_CTLR.RWP until the interrupt is not
> active. The interrupt is keeping active while it's waiting for
> desc->lock on cpu0. But the lock is hold by cpu1 while it's waiting for
> the interrupt is not active. It causes a deadlock here in 1s.
> 
> 
> And the GICv3 SPEC says:
> 
> 4.5.5 SPI Interrupt Configuration
> To configure an SPI interrupt, to ensure that interrupts are never 
> distributed using partially updated configuration
> information, software must:
> o Ensure the interrupt is not active
> o Ensure that the interrupt is disabled
> o This might be done either by writing to GICD_CTLR to clear the enables 
> for a group, or
> o By writing to GICD_ICENABLERn to clear the Enable bit of the interrupt 
> (see section 5.3.11).
> o In both cases, software must poll GICD_CTLR.RWP to ensure the effects 
> are visible (see section
> 5.3.20).
> o Program the routing (if appropriate), priority and group
> o Enable the interrupt (if required)
> 
> Because it says "Ensure the interrupt is not active", so I can not tell
> it is a hardware or software problem.
> 
> Can you please give some advice?

Thanks for the accurate description of the problem. This looks to be a
core issue, or at least a problem between core code and the way the GIC
behaves, unfortunately. The architecture expects the interrupt to be
fully configured before it is enabled and made active, while the core
code does this the other way around.

Can you please have a go at the patch below and let me know if it
improve things? This is just a test, and definitely not the complete
solution, but I'd like to find out if I'm on the right track.

Thanks,

	M.

diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
index 0eebaee..e5802fb 100644
--- a/kernel/irq/manage.c
+++ b/kernel/irq/manage.c
@@ -1303,12 +1303,6 @@ __setup_irq(unsigned int irq, struct irq_desc *desc, struct irqaction *new)
 		if (new->flags & IRQF_ONESHOT)
 			desc->istate |= IRQS_ONESHOT;
 
-		if (irq_settings_can_autoenable(desc))
-			irq_startup(desc, true);
-		else
-			/* Undo nested disables: */
-			desc->depth = 1;
-
 		/* Exclude IRQ from balancing if requested */
 		if (new->flags & IRQF_NOBALANCING) {
 			irq_settings_set_no_balancing(desc);
@@ -1318,6 +1312,12 @@ __setup_irq(unsigned int irq, struct irq_desc *desc, struct irqaction *new)
 		/* Set default affinity mask once everything is setup */
 		setup_affinity(desc, mask);
 
+		if (irq_settings_can_autoenable(desc))
+			irq_startup(desc, true);
+		else
+			/* Undo nested disables: */
+			desc->depth = 1;
+
 	} else if (new->flags & IRQF_TRIGGER_MASK) {
 		unsigned int nmsk = new->flags & IRQF_TRIGGER_MASK;
 		unsigned int omsk = irq_settings_get_trigger_mask(desc);

-- 
Without deviation from the norm, progress is not possible.

^ permalink raw reply related	[flat|nested] 4+ messages in thread

* A problem about SPI Interrupt Configuration
  2016-01-22  7:57 ` Marc Zyngier
@ 2016-01-28  2:32   ` Yang Yingliang
  2016-01-28  8:50     ` Marc Zyngier
  0 siblings, 1 reply; 4+ messages in thread
From: Yang Yingliang @ 2016-01-28  2:32 UTC (permalink / raw)
  To: linux-arm-kernel



On 2016/1/22 15:57, Marc Zyngier wrote:
> On Wed, 20 Jan 2016 10:38:13 +0800
> Yang Yingliang <yangyingliang@huawei.com> wrote:
>
> Hi Yang,
>
>> Hi, Marc
>>
>> I got some error messages "RWP timeout, gone fishing".
>>
>> The case is :
>>
>> 			CPU0												CPU1
>> 													acquire desc->lock in __setup_irq()
>> 															enable irq in __setup_irq()
>> read iar in gic_handle_irq()
>> waiting for desc->lock in handle_fasteoi_irq()
>> 												call gic_set_affinity() from setup_affinity()
>> 												waiting for the irq deactive in gic_do_wait_for_rwp()
>>
>>
>> The hardware will not clear GICD_CTLR.RWP until the interrupt is not
>> active. The interrupt is keeping active while it's waiting for
>> desc->lock on cpu0. But the lock is hold by cpu1 while it's waiting for
>> the interrupt is not active. It causes a deadlock here in 1s.
>>
>>
>> And the GICv3 SPEC says:
>>
>> 4.5.5 SPI Interrupt Configuration
>> To configure an SPI interrupt, to ensure that interrupts are never
>> distributed using partially updated configuration
>> information, software must:
>> o Ensure the interrupt is not active
>> o Ensure that the interrupt is disabled
>> o This might be done either by writing to GICD_CTLR to clear the enables
>> for a group, or
>> o By writing to GICD_ICENABLERn to clear the Enable bit of the interrupt
>> (see section 5.3.11).
>> o In both cases, software must poll GICD_CTLR.RWP to ensure the effects
>> are visible (see section
>> 5.3.20).
>> o Program the routing (if appropriate), priority and group
>> o Enable the interrupt (if required)
>>
>> Because it says "Ensure the interrupt is not active", so I can not tell
>> it is a hardware or software problem.
>>
>> Can you please give some advice?
>
> Thanks for the accurate description of the problem. This looks to be a
> core issue, or at least a problem between core code and the way the GIC
> behaves, unfortunately. The architecture expects the interrupt to be
> fully configured before it is enabled and made active, while the core
> code does this the other way around.
>
> Can you please have a go at the patch below and let me know if it
> improve things? This is just a test, and definitely not the complete
> solution, but I'd like to find out if I'm on the right track.
>
> Thanks,
>
> 	M.
>
> diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
> index 0eebaee..e5802fb 100644
> --- a/kernel/irq/manage.c
> +++ b/kernel/irq/manage.c
> @@ -1303,12 +1303,6 @@ __setup_irq(unsigned int irq, struct irq_desc *desc, struct irqaction *new)
>   		if (new->flags & IRQF_ONESHOT)
>   			desc->istate |= IRQS_ONESHOT;
>
> -		if (irq_settings_can_autoenable(desc))
> -			irq_startup(desc, true);
> -		else
> -			/* Undo nested disables: */
> -			desc->depth = 1;
> -
>   		/* Exclude IRQ from balancing if requested */
>   		if (new->flags & IRQF_NOBALANCING) {
>   			irq_settings_set_no_balancing(desc);
> @@ -1318,6 +1312,12 @@ __setup_irq(unsigned int irq, struct irq_desc *desc, struct irqaction *new)
>   		/* Set default affinity mask once everything is setup */
>   		setup_affinity(desc, mask);
>
> +		if (irq_settings_can_autoenable(desc))
> +			irq_startup(desc, true);
> +		else
> +			/* Undo nested disables: */
> +			desc->depth = 1;
> +
>   	} else if (new->flags & IRQF_TRIGGER_MASK) {
>   		unsigned int nmsk = new->flags & IRQF_TRIGGER_MASK;
>   		unsigned int omsk = irq_settings_get_trigger_mask(desc);
>

This patch can avoid the case I said in this mail.
But in other case like set affinity by the /proc/irq/x/smp_affinity,
it's will have the same problem _if_ the hardware is waiting for deactive.

Thanks,
Yang

^ permalink raw reply	[flat|nested] 4+ messages in thread

* A problem about SPI Interrupt Configuration
  2016-01-28  2:32   ` Yang Yingliang
@ 2016-01-28  8:50     ` Marc Zyngier
  0 siblings, 0 replies; 4+ messages in thread
From: Marc Zyngier @ 2016-01-28  8:50 UTC (permalink / raw)
  To: linux-arm-kernel

On 28/01/16 02:32, Yang Yingliang wrote:
> 
> 
> On 2016/1/22 15:57, Marc Zyngier wrote:
>> On Wed, 20 Jan 2016 10:38:13 +0800
>> Yang Yingliang <yangyingliang@huawei.com> wrote:
>>
>> Hi Yang,
>>
>>> Hi, Marc
>>>
>>> I got some error messages "RWP timeout, gone fishing".
>>>
>>> The case is :
>>>
>>> 			CPU0												CPU1
>>> 													acquire desc->lock in __setup_irq()
>>> 															enable irq in __setup_irq()
>>> read iar in gic_handle_irq()
>>> waiting for desc->lock in handle_fasteoi_irq()
>>> 												call gic_set_affinity() from setup_affinity()
>>> 												waiting for the irq deactive in gic_do_wait_for_rwp()
>>>
>>>
>>> The hardware will not clear GICD_CTLR.RWP until the interrupt is not
>>> active. The interrupt is keeping active while it's waiting for
>>> desc->lock on cpu0. But the lock is hold by cpu1 while it's waiting for
>>> the interrupt is not active. It causes a deadlock here in 1s.
>>>
>>>
>>> And the GICv3 SPEC says:
>>>
>>> 4.5.5 SPI Interrupt Configuration
>>> To configure an SPI interrupt, to ensure that interrupts are never
>>> distributed using partially updated configuration
>>> information, software must:
>>> o Ensure the interrupt is not active
>>> o Ensure that the interrupt is disabled
>>> o This might be done either by writing to GICD_CTLR to clear the enables
>>> for a group, or
>>> o By writing to GICD_ICENABLERn to clear the Enable bit of the interrupt
>>> (see section 5.3.11).
>>> o In both cases, software must poll GICD_CTLR.RWP to ensure the effects
>>> are visible (see section
>>> 5.3.20).
>>> o Program the routing (if appropriate), priority and group
>>> o Enable the interrupt (if required)
>>>
>>> Because it says "Ensure the interrupt is not active", so I can not tell
>>> it is a hardware or software problem.
>>>
>>> Can you please give some advice?
>>
>> Thanks for the accurate description of the problem. This looks to be a
>> core issue, or at least a problem between core code and the way the GIC
>> behaves, unfortunately. The architecture expects the interrupt to be
>> fully configured before it is enabled and made active, while the core
>> code does this the other way around.
>>
>> Can you please have a go at the patch below and let me know if it
>> improve things? This is just a test, and definitely not the complete
>> solution, but I'd like to find out if I'm on the right track.
>>
>> Thanks,
>>
>> 	M.
>>
>> diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
>> index 0eebaee..e5802fb 100644
>> --- a/kernel/irq/manage.c
>> +++ b/kernel/irq/manage.c
>> @@ -1303,12 +1303,6 @@ __setup_irq(unsigned int irq, struct irq_desc *desc, struct irqaction *new)
>>   		if (new->flags & IRQF_ONESHOT)
>>   			desc->istate |= IRQS_ONESHOT;
>>
>> -		if (irq_settings_can_autoenable(desc))
>> -			irq_startup(desc, true);
>> -		else
>> -			/* Undo nested disables: */
>> -			desc->depth = 1;
>> -
>>   		/* Exclude IRQ from balancing if requested */
>>   		if (new->flags & IRQF_NOBALANCING) {
>>   			irq_settings_set_no_balancing(desc);
>> @@ -1318,6 +1312,12 @@ __setup_irq(unsigned int irq, struct irq_desc *desc, struct irqaction *new)
>>   		/* Set default affinity mask once everything is setup */
>>   		setup_affinity(desc, mask);
>>
>> +		if (irq_settings_can_autoenable(desc))
>> +			irq_startup(desc, true);
>> +		else
>> +			/* Undo nested disables: */
>> +			desc->depth = 1;
>> +
>>   	} else if (new->flags & IRQF_TRIGGER_MASK) {
>>   		unsigned int nmsk = new->flags & IRQF_TRIGGER_MASK;
>>   		unsigned int omsk = irq_settings_get_trigger_mask(desc);
>>
> 
> This patch can avoid the case I said in this mail.
> But in other case like set affinity by the /proc/irq/x/smp_affinity,
> it's will have the same problem _if_ the hardware is waiting for deactive.

Indeed, and the more I think of it, the more I'm convinced that what you
are looking at is a hardware bug.

If RWP is gated by an interrupt being active, or the disable is gated by
the interrupt being active, then you will end-up in all kind of horrible
problems that software *cannot* solve. The text you quoted earlier is
not something that can be enforced from a software perspective (it
contains a race condition that cannot be avoided).

Furthermore, take the example of a VM that has an active interrupt
linked to a physical interrupt (HW bit set in the List Register) while
you are changing the affinity on the host. Nothing guarantees that the
deactivate will *ever* happen (the guest could decide it doesn't need to
do it, or takes a very long time to do so). In that case, you will
deadlock the same way, and there is nothing you can do from a SW PoV to
solve it.

I suggest you get back to your HW folks, and explain that what they have
here is not a viable design.

Thanks,

	M.
-- 
Jazz is not dead. It just smells funny...

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2016-01-28  8:50 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-01-20  2:38 A problem about SPI Interrupt Configuration Yang Yingliang
2016-01-22  7:57 ` Marc Zyngier
2016-01-28  2:32   ` Yang Yingliang
2016-01-28  8:50     ` Marc Zyngier

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).