* Re: [PATCH RT] ehea: make receive irq handler non-threaded (IRQF_NODELAY)
[not found] <4BF30793.5070300@us.ibm.com>
@ 2010-05-18 21:52 ` Brian King
2010-05-18 22:19 ` Nivedita Singhvi
2010-05-18 22:22 ` Darren Hart
0 siblings, 2 replies; 23+ messages in thread
From: Brian King @ 2010-05-18 21:52 UTC (permalink / raw)
To: dvhltc
Cc: Jan-Bernd Themann, linux-kernel, Will Schmidt, niv,
Thomas Gleixner, Doug Maxey, linuxppc-dev, Michael Ellerman
Is IRQF_NODELAY something specific to the RT kernel? I don't see it in mainline...
-Brian
On 05/18/2010 04:33 PM, dvhltc@linux.vnet.ibm.com wrote:
>>From ad81664794e33d785f533c5edee37aaba20dd92d Mon Sep 17 00:00:00 2001
> From: Darren Hart <dvhltc@us.ibm.com>
> Date: Tue, 18 May 2010 11:07:13 -0700
> Subject: [PATCH RT] ehea: make receive irq handler non-threaded (IRQF_NODELAY)
>
> The underlying hardware is edge triggered but presented by XICS as level
> triggered. The edge triggered interrupts are not reissued after masking. This
> is not a problem in mainline which does not mask the interrupt (relying on the
> EOI mechanism instead). The threaded interrupts in PREEMPT_RT do mask the
> interrupt, and can lose interrupts that occurred while masked, resulting in a
> hung ethernet interface.
>
> The receive handler simply calls napi_schedule(), as such, there is no
> significant additional overhead in making this non-threaded, since we either
> wakeup the threaded irq handler to call napi_schedule(), or just call
> napi_schedule() directly to wakeup the softirqs. As the receive handler is
> lockless, there is no need to convert any of the ehea spinlock_t's to
> atomic_spinlock_t's.
>
> Without this patch, a simple scp file copy loop would fail quickly (usually
> seconds). We have over two hours of sustained scp activity with the patch
> applied.
>
> Credit goes to Will Schmidt for lots of instrumentation and tracing which
> clarified the scenario and to Thomas Gleixner for the incredibly simple
> solution.
>
> Signed-off-by: Darren Hart <dvhltc@us.ibm.com>
> Acked-by: Will Schmidt <will_schmidt@vnet.ibm.com>
> Cc: Thomas Gleixner <tglx@linuxtronix.de>
> Cc: Jan-Bernd Themann <themann@de.ibm.com>
> Cc: Nivedita Singhvi <niv@us.ibm.com>
> Cc: Brian King <bjking1@us.ibm.com>
> Cc: Michael Ellerman <ellerman@au1.ibm.com>
> Cc: Doug Maxey <doug.maxey@us.ibm.com>
> ---
> drivers/net/ehea/ehea_main.c | 2 +-
> 1 files changed, 1 insertions(+), 1 deletions(-)
>
> diff --git a/drivers/net/ehea/ehea_main.c b/drivers/net/ehea/ehea_main.c
> index 977c3d3..2c53df2 100644
> --- a/drivers/net/ehea/ehea_main.c
> +++ b/drivers/net/ehea/ehea_main.c
> @@ -1263,7 +1263,7 @@ static int ehea_reg_interrupts(struct net_device *dev)
> "%s-queue%d", dev->name, i);
> ret = ibmebus_request_irq(pr->eq->attr.ist1,
> ehea_recv_irq_handler,
> - IRQF_DISABLED, pr->int_send_name,
> + IRQF_DISABLED | IRQF_NODELAY, pr->int_send_name,
> pr);
> if (ret) {
> ehea_error("failed registering irq for ehea_queue "
--
Brian King
Linux on Power Virtualization
IBM Linux Technology Center
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH RT] ehea: make receive irq handler non-threaded (IRQF_NODELAY)
2010-05-18 21:52 ` [PATCH RT] ehea: make receive irq handler non-threaded (IRQF_NODELAY) Brian King
@ 2010-05-18 22:19 ` Nivedita Singhvi
2010-05-18 22:22 ` Darren Hart
1 sibling, 0 replies; 23+ messages in thread
From: Nivedita Singhvi @ 2010-05-18 22:19 UTC (permalink / raw)
To: Brian King
Cc: Jan-Bernd Themann, dvhltc, linux-kernel, Will Schmidt,
Thomas Gleixner, Doug Maxey, linuxppc-dev, Michael Ellerman
Brian King wrote:
> Is IRQF_NODELAY something specific to the RT kernel? I don't see it in mainline...
Yep, this is RT only.
thanks,
Nivedita
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH RT] ehea: make receive irq handler non-threaded (IRQF_NODELAY)
2010-05-18 21:52 ` [PATCH RT] ehea: make receive irq handler non-threaded (IRQF_NODELAY) Brian King
2010-05-18 22:19 ` Nivedita Singhvi
@ 2010-05-18 22:22 ` Darren Hart
2010-05-19 1:25 ` Michael Ellerman
1 sibling, 1 reply; 23+ messages in thread
From: Darren Hart @ 2010-05-18 22:22 UTC (permalink / raw)
To: Brian King
Cc: Jan-Bernd Themann, dvhltc, linux-kernel, Will Schmidt, niv,
Thomas Gleixner, Doug Maxey, linuxppc-dev, Michael Ellerman
On 05/18/2010 02:52 PM, Brian King wrote:
> Is IRQF_NODELAY something specific to the RT kernel? I don't see it in mainline...
Yes, it basically says "don't make this handler threaded".
--
Darren
>
> -Brian
>
>
> On 05/18/2010 04:33 PM, dvhltc@linux.vnet.ibm.com wrote:
>> > From ad81664794e33d785f533c5edee37aaba20dd92d Mon Sep 17 00:00:00 2001
>> From: Darren Hart<dvhltc@us.ibm.com>
>> Date: Tue, 18 May 2010 11:07:13 -0700
>> Subject: [PATCH RT] ehea: make receive irq handler non-threaded (IRQF_NODELAY)
>>
>> The underlying hardware is edge triggered but presented by XICS as level
>> triggered. The edge triggered interrupts are not reissued after masking. This
>> is not a problem in mainline which does not mask the interrupt (relying on the
>> EOI mechanism instead). The threaded interrupts in PREEMPT_RT do mask the
>> interrupt, and can lose interrupts that occurred while masked, resulting in a
>> hung ethernet interface.
>>
>> The receive handler simply calls napi_schedule(), as such, there is no
>> significant additional overhead in making this non-threaded, since we either
>> wakeup the threaded irq handler to call napi_schedule(), or just call
>> napi_schedule() directly to wakeup the softirqs. As the receive handler is
>> lockless, there is no need to convert any of the ehea spinlock_t's to
>> atomic_spinlock_t's.
>>
>> Without this patch, a simple scp file copy loop would fail quickly (usually
>> seconds). We have over two hours of sustained scp activity with the patch
>> applied.
>>
>> Credit goes to Will Schmidt for lots of instrumentation and tracing which
>> clarified the scenario and to Thomas Gleixner for the incredibly simple
>> solution.
>>
>> Signed-off-by: Darren Hart<dvhltc@us.ibm.com>
>> Acked-by: Will Schmidt<will_schmidt@vnet.ibm.com>
>> Cc: Thomas Gleixner<tglx@linuxtronix.de>
>> Cc: Jan-Bernd Themann<themann@de.ibm.com>
>> Cc: Nivedita Singhvi<niv@us.ibm.com>
>> Cc: Brian King<bjking1@us.ibm.com>
>> Cc: Michael Ellerman<ellerman@au1.ibm.com>
>> Cc: Doug Maxey<doug.maxey@us.ibm.com>
>> ---
>> drivers/net/ehea/ehea_main.c | 2 +-
>> 1 files changed, 1 insertions(+), 1 deletions(-)
>>
>> diff --git a/drivers/net/ehea/ehea_main.c b/drivers/net/ehea/ehea_main.c
>> index 977c3d3..2c53df2 100644
>> --- a/drivers/net/ehea/ehea_main.c
>> +++ b/drivers/net/ehea/ehea_main.c
>> @@ -1263,7 +1263,7 @@ static int ehea_reg_interrupts(struct net_device *dev)
>> "%s-queue%d", dev->name, i);
>> ret = ibmebus_request_irq(pr->eq->attr.ist1,
>> ehea_recv_irq_handler,
>> - IRQF_DISABLED, pr->int_send_name,
>> + IRQF_DISABLED | IRQF_NODELAY, pr->int_send_name,
>> pr);
>> if (ret) {
>> ehea_error("failed registering irq for ehea_queue "
>
>
--
Darren Hart
IBM Linux Technology Center
Real-Time Linux Team
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH RT] ehea: make receive irq handler non-threaded (IRQF_NODELAY)
2010-05-18 22:22 ` Darren Hart
@ 2010-05-19 1:25 ` Michael Ellerman
2010-05-19 14:16 ` Darren Hart
0 siblings, 1 reply; 23+ messages in thread
From: Michael Ellerman @ 2010-05-19 1:25 UTC (permalink / raw)
To: Darren Hart
Cc: Jan-Bernd Themann, dvhltc, linux-kernel, Will Schmidt, Brian King,
niv, Thomas Gleixner, Doug Maxey, linuxppc-dev
[-- Attachment #1: Type: text/plain, Size: 656 bytes --]
On Tue, 2010-05-18 at 15:22 -0700, Darren Hart wrote:
> On 05/18/2010 02:52 PM, Brian King wrote:
> > Is IRQF_NODELAY something specific to the RT kernel? I don't see it in mainline...
>
> Yes, it basically says "don't make this handler threaded".
That is a good fix for EHEA, but the threaded handling is still broken
for anything else that is edge triggered isn't it?
The result of the discussion about two years ago on this was that we
needed a custom flow handler for XICS on RT.
Apart from the issue of loosing interrupts there is also the fact that
masking on the XICS requires an RTAS call which takes a global lock.
cheers
[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH RT] ehea: make receive irq handler non-threaded (IRQF_NODELAY)
2010-05-19 1:25 ` Michael Ellerman
@ 2010-05-19 14:16 ` Darren Hart
2010-05-19 14:38 ` Thomas Gleixner
2010-05-20 1:28 ` Michael Ellerman
0 siblings, 2 replies; 23+ messages in thread
From: Darren Hart @ 2010-05-19 14:16 UTC (permalink / raw)
To: michael
Cc: Jan-Bernd Themann, dvhltc, linux-kernel, Will Schmidt, Brian King,
niv, Thomas Gleixner, Doug Maxey, linuxppc-dev
On 05/18/2010 06:25 PM, Michael Ellerman wrote:
> On Tue, 2010-05-18 at 15:22 -0700, Darren Hart wrote:
>> On 05/18/2010 02:52 PM, Brian King wrote:
>>> Is IRQF_NODELAY something specific to the RT kernel? I don't see it in mainline...
>>
>> Yes, it basically says "don't make this handler threaded".
>
> That is a good fix for EHEA, but the threaded handling is still broken
> for anything else that is edge triggered isn't it?
No, I don't believe so. Edge triggered interrupts that are reported as
edge triggered interrupts will use the edge handler (which was the
approach Sebastien took to make this work back in 2008). Since XICS
presents all interrupts as Level Triggered, they use the fasteoi path.
>
> The result of the discussion about two years ago on this was that we
> needed a custom flow handler for XICS on RT.
I'm still not clear on why the ultimate solution wasn't to have XICS
report edge triggered as edge triggered. Probably some complexity of the
entire power stack that I am ignorant of.
> Apart from the issue of loosing interrupts there is also the fact that
> masking on the XICS requires an RTAS call which takes a global lock.
Right, one of may reasons why we felt this was the right fix. The other
is that there is no real additional overhead in running this as
non-threaded since the receive handler is so short (just napi_schedule()).
--
Darren Hart
IBM Linux Technology Center
Real-Time Linux Team
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH RT] ehea: make receive irq handler non-threaded (IRQF_NODELAY)
2010-05-19 14:16 ` Darren Hart
@ 2010-05-19 14:38 ` Thomas Gleixner
2010-05-19 21:08 ` Thomas Gleixner
2010-05-20 1:32 ` Michael Ellerman
2010-05-20 1:28 ` Michael Ellerman
1 sibling, 2 replies; 23+ messages in thread
From: Thomas Gleixner @ 2010-05-19 14:38 UTC (permalink / raw)
To: Darren Hart
Cc: Jan-Bernd Themann, dvhltc, linux-kernel, Will Schmidt, Brian King,
niv, Doug Maxey, linuxppc-dev
On Wed, 19 May 2010, Darren Hart wrote:
> On 05/18/2010 06:25 PM, Michael Ellerman wrote:
> > On Tue, 2010-05-18 at 15:22 -0700, Darren Hart wrote:
> > > On 05/18/2010 02:52 PM, Brian King wrote:
> > > > Is IRQF_NODELAY something specific to the RT kernel? I don't see it in
> > > > mainline...
> > >
> > > Yes, it basically says "don't make this handler threaded".
> >
> > That is a good fix for EHEA, but the threaded handling is still broken
> > for anything else that is edge triggered isn't it?
>
> No, I don't believe so. Edge triggered interrupts that are reported as edge
> triggered interrupts will use the edge handler (which was the approach
> Sebastien took to make this work back in 2008). Since XICS presents all
> interrupts as Level Triggered, they use the fasteoi path.
I wonder whether the XICS interrupts which are edge type can be
identified from the irq_desc->flags. Then we could avoid the masking
for those in the fasteoi_handler in general.
> >
> > The result of the discussion about two years ago on this was that we
> > needed a custom flow handler for XICS on RT.
>
> I'm still not clear on why the ultimate solution wasn't to have XICS report
> edge triggered as edge triggered. Probably some complexity of the entire power
> stack that I am ignorant of.
>
> > Apart from the issue of loosing interrupts there is also the fact that
> > masking on the XICS requires an RTAS call which takes a global lock.
Right, I'd love to avoid that but with real level interrupts we'd run
into an interrupt storm. Though another solution would be to issue the
EOI after the threaded handler finished, that'd work as well, but
needs testing.
> Right, one of may reasons why we felt this was the right fix. The other is
> that there is no real additional overhead in running this as non-threaded
> since the receive handler is so short (just napi_schedule()).
Yes, in the case at hand it's the right thing to do, as we avoid
another wakeup/context switch.
Thanks,
tglx
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH RT] ehea: make receive irq handler non-threaded (IRQF_NODELAY)
2010-05-19 14:38 ` Thomas Gleixner
@ 2010-05-19 21:08 ` Thomas Gleixner
2010-05-20 1:34 ` Michael Ellerman
2010-05-20 1:32 ` Michael Ellerman
1 sibling, 1 reply; 23+ messages in thread
From: Thomas Gleixner @ 2010-05-19 21:08 UTC (permalink / raw)
To: Darren Hart
Cc: Jan-Bernd Themann, dvhltc, linux-kernel, Will Schmidt, Brian King,
niv, Doug Maxey, linuxppc-dev
On Wed, 19 May 2010, Thomas Gleixner wrote:
> > I'm still not clear on why the ultimate solution wasn't to have XICS report
> > edge triggered as edge triggered. Probably some complexity of the entire power
> > stack that I am ignorant of.
> >
> > > Apart from the issue of loosing interrupts there is also the fact that
> > > masking on the XICS requires an RTAS call which takes a global lock.
>
> Right, I'd love to avoid that but with real level interrupts we'd run
> into an interrupt storm. Though another solution would be to issue the
> EOI after the threaded handler finished, that'd work as well, but
> needs testing.
Thought more about that. The case at hand (ehea) is nasty:
The driver does _NOT_ disable the rx interrupt in the card in the rx
interrupt handler - for whatever reason.
So even in mainline you get repeated rx interrupts when packets
arrive while napi is processing the poll, which is suboptimal at
least. In fact it is counterproductive as the whole purpose of NAPI
is to _NOT_ get interrupts for consecutive incoming packets while the
poll is active.
Most of the other network drivers do:
rx_irq()
disable rx interrupts on card
napi_schedule()
Now when the napi poll is done (no more packets available) then the
driver reenables the rx interrupt on the card.
Thanks,
tglx
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH RT] ehea: make receive irq handler non-threaded (IRQF_NODELAY)
2010-05-19 14:16 ` Darren Hart
2010-05-19 14:38 ` Thomas Gleixner
@ 2010-05-20 1:28 ` Michael Ellerman
2010-05-21 9:02 ` Milton Miller
1 sibling, 1 reply; 23+ messages in thread
From: Michael Ellerman @ 2010-05-20 1:28 UTC (permalink / raw)
To: Darren Hart
Cc: Jan-Bernd Themann, dvhltc, linux-kernel, Milton Miller,
Will Schmidt, Brian King, niv, Thomas Gleixner, Doug Maxey,
linuxppc-dev
[-- Attachment #1: Type: text/plain, Size: 2094 bytes --]
On Wed, 2010-05-19 at 07:16 -0700, Darren Hart wrote:
> On 05/18/2010 06:25 PM, Michael Ellerman wrote:
> > On Tue, 2010-05-18 at 15:22 -0700, Darren Hart wrote:
> >> On 05/18/2010 02:52 PM, Brian King wrote:
> >>> Is IRQF_NODELAY something specific to the RT kernel? I don't see it in mainline...
> >>
> >> Yes, it basically says "don't make this handler threaded".
> >
> > That is a good fix for EHEA, but the threaded handling is still broken
> > for anything else that is edge triggered isn't it?
>
> No, I don't believe so. Edge triggered interrupts that are reported as
> edge triggered interrupts will use the edge handler (which was the
> approach Sebastien took to make this work back in 2008). Since XICS
> presents all interrupts as Level Triggered, they use the fasteoi path.
But that's the point, no interrupts on XICS are reported as edge, even
if they are actually edge somewhere deep in the hardware. I don't think
we have any reliable way to determine what is what.
> > The result of the discussion about two years ago on this was that we
> > needed a custom flow handler for XICS on RT.
>
> I'm still not clear on why the ultimate solution wasn't to have XICS
> report edge triggered as edge triggered. Probably some complexity of the
> entire power stack that I am ignorant of.
I'm not really sure either, but I think it's a case of a leaky
abstraction on the part of the hypervisor. Edge interrupts behave as
level as long as you handle the irq before EOI, but if you mask they
don't. But Milton's the expert on that.
> > Apart from the issue of loosing interrupts there is also the fact that
> > masking on the XICS requires an RTAS call which takes a global lock.
>
> Right, one of may reasons why we felt this was the right fix. The other
> is that there is no real additional overhead in running this as
> non-threaded since the receive handler is so short (just napi_schedule()).
True. It's not a fix in general though. I'm worried that we're going to
see the exact same bug for MSI(-X) interrupts.
cheers
[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH RT] ehea: make receive irq handler non-threaded (IRQF_NODELAY)
2010-05-19 14:38 ` Thomas Gleixner
2010-05-19 21:08 ` Thomas Gleixner
@ 2010-05-20 1:32 ` Michael Ellerman
2010-05-20 8:21 ` Thomas Gleixner
1 sibling, 1 reply; 23+ messages in thread
From: Michael Ellerman @ 2010-05-20 1:32 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Darren Hart, Jan-Bernd Themann, dvhltc, linux-kernel,
Milton Miller, Will Schmidt, Brian King, niv, Doug Maxey,
linuxppc-dev
[-- Attachment #1: Type: text/plain, Size: 2182 bytes --]
On Wed, 2010-05-19 at 16:38 +0200, Thomas Gleixner wrote:
> On Wed, 19 May 2010, Darren Hart wrote:
>
> > On 05/18/2010 06:25 PM, Michael Ellerman wrote:
> > > On Tue, 2010-05-18 at 15:22 -0700, Darren Hart wrote:
> > > > On 05/18/2010 02:52 PM, Brian King wrote:
> > > > > Is IRQF_NODELAY something specific to the RT kernel? I don't see it in
> > > > > mainline...
> > > >
> > > > Yes, it basically says "don't make this handler threaded".
> > >
> > > That is a good fix for EHEA, but the threaded handling is still broken
> > > for anything else that is edge triggered isn't it?
> >
> > No, I don't believe so. Edge triggered interrupts that are reported as edge
> > triggered interrupts will use the edge handler (which was the approach
> > Sebastien took to make this work back in 2008). Since XICS presents all
> > interrupts as Level Triggered, they use the fasteoi path.
>
> I wonder whether the XICS interrupts which are edge type can be
> identified from the irq_desc->flags. Then we could avoid the masking
> for those in the fasteoi_handler in general.
I'm not sure they can be. I know on other similar HW we can detect LSI
vs MSI, but that's without the HV in the equation.
> > > The result of the discussion about two years ago on this was that we
> > > needed a custom flow handler for XICS on RT.
> >
> > I'm still not clear on why the ultimate solution wasn't to have XICS report
> > edge triggered as edge triggered. Probably some complexity of the entire power
> > stack that I am ignorant of.
> >
> > > Apart from the issue of loosing interrupts there is also the fact that
> > > masking on the XICS requires an RTAS call which takes a global lock.
>
> Right, I'd love to avoid that but with real level interrupts we'd run
> into an interrupt storm. Though another solution would be to issue the
> EOI after the threaded handler finished, that'd work as well, but
> needs testing.
Yeah I think that was the idea for the custom flow handler. We'd reset
the processor priority so we can take other interrupts (which the EOI
usually does for you), then do the actual EOI after the handler
finished.
cheers
[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH RT] ehea: make receive irq handler non-threaded (IRQF_NODELAY)
2010-05-19 21:08 ` Thomas Gleixner
@ 2010-05-20 1:34 ` Michael Ellerman
2010-05-20 7:37 ` Jan-Bernd Themann
0 siblings, 1 reply; 23+ messages in thread
From: Michael Ellerman @ 2010-05-20 1:34 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Darren Hart, Jan-Bernd Themann, dvhltc, linux-kernel,
Will Schmidt, Brian King, niv, Doug Maxey, linuxppc-dev
[-- Attachment #1: Type: text/plain, Size: 1051 bytes --]
On Wed, 2010-05-19 at 23:08 +0200, Thomas Gleixner wrote:
> On Wed, 19 May 2010, Thomas Gleixner wrote:
> > > I'm still not clear on why the ultimate solution wasn't to have XICS report
> > > edge triggered as edge triggered. Probably some complexity of the entire power
> > > stack that I am ignorant of.
> > >
> > > > Apart from the issue of loosing interrupts there is also the fact that
> > > > masking on the XICS requires an RTAS call which takes a global lock.
> >
> > Right, I'd love to avoid that but with real level interrupts we'd run
> > into an interrupt storm. Though another solution would be to issue the
> > EOI after the threaded handler finished, that'd work as well, but
> > needs testing.
>
> Thought more about that. The case at hand (ehea) is nasty:
>
> The driver does _NOT_ disable the rx interrupt in the card in the rx
> interrupt handler - for whatever reason.
Yeah I saw that, but I don't know why it's written that way. Perhaps
Jan-Bernd or Doug will chime in and enlighten us? :)
cheers
[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH RT] ehea: make receive irq handler non-threaded (IRQF_NODELAY)
2010-05-20 1:34 ` Michael Ellerman
@ 2010-05-20 7:37 ` Jan-Bernd Themann
2010-05-20 8:14 ` Thomas Gleixner
0 siblings, 1 reply; 23+ messages in thread
From: Jan-Bernd Themann @ 2010-05-20 7:37 UTC (permalink / raw)
To: michael
Cc: Darren Hart, dvhltc, linux-kernel, Will Schmidt, Brian King, niv,
Thomas Gleixner, Doug Maxey, linuxppc-dev
Hi,
Michael Ellerman <michael@ellerman.id.au> wrote on 20.05.2010 03:34:08:
> Subject:
>
> Re: [PATCH RT] ehea: make receive irq handler non-threaded (IRQF_NODELAY)
>
> On Wed, 2010-05-19 at 23:08 +0200, Thomas Gleixner wrote:
> > On Wed, 19 May 2010, Thomas Gleixner wrote:
> > > > I'm still not clear on why the ultimate solution wasn't to
> have XICS report
> > > > edge triggered as edge triggered. Probably some complexity of
> the entire power
> > > > stack that I am ignorant of.
> > > >
> > > > > Apart from the issue of loosing interrupts there is also thefact
that
> > > > > masking on the XICS requires an RTAS call which takes a global
lock.
> > >
> > > Right, I'd love to avoid that but with real level interrupts we'd run
> > > into an interrupt storm. Though another solution would be to issue
the
> > > EOI after the threaded handler finished, that'd work as well, but
> > > needs testing.
> >
> > Thought more about that. The case at hand (ehea) is nasty:
> >
> > The driver does _NOT_ disable the rx interrupt in the card in the rx
> > interrupt handler - for whatever reason.
>
> Yeah I saw that, but I don't know why it's written that way. Perhaps
> Jan-Bernd or Doug will chime in and enlighten us? :)
>From our perspective there is no need to disable interrupts for the RX side
as
the chip does not fire further interrupts until we tell the chip to do so
for a
particular queue. We have multiple receive queues with an own interrupt
each
so that the interrupts can arrive on multiple CPUs in parallel.
Interrupts are enabled again when we leave the NAPI Poll function for the
corresponding
receive queue.
Michael, does this answer your question?
Regards,
Jan-Bernd
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH RT] ehea: make receive irq handler non-threaded (IRQF_NODELAY)
2010-05-20 7:37 ` Jan-Bernd Themann
@ 2010-05-20 8:14 ` Thomas Gleixner
2010-05-20 9:05 ` Jan-Bernd Themann
2010-05-20 14:39 ` Darren Hart
0 siblings, 2 replies; 23+ messages in thread
From: Thomas Gleixner @ 2010-05-20 8:14 UTC (permalink / raw)
To: Jan-Bernd Themann
Cc: Darren Hart, dvhltc, linux-kernel, Will Schmidt, Brian King, niv,
Doug Maxey, linuxppc-dev
On Thu, 20 May 2010, Jan-Bernd Themann wrote:
> > > Thought more about that. The case at hand (ehea) is nasty:
> > >
> > > The driver does _NOT_ disable the rx interrupt in the card in the rx
> > > interrupt handler - for whatever reason.
> >
> > Yeah I saw that, but I don't know why it's written that way. Perhaps
> > Jan-Bernd or Doug will chime in and enlighten us? :)
>
> From our perspective there is no need to disable interrupts for the
> RX side as the chip does not fire further interrupts until we tell
> the chip to do so for a particular queue. We have multiple receive
The traces tell a different story though:
ehea_recv_irq_handler()
napi_reschedule()
eoi()
ehea_poll()
...
ehea_recv_irq_handler() <---------------- ???
napi_reschedule()
...
napi_complete()
Can't tell whether you can see the same behaviour in mainline, but I
don't see a reason why not.
> queues with an own interrupt each so that the interrupts can arrive
> on multiple CPUs in parallel. Interrupts are enabled again when we
> leave the NAPI Poll function for the corresponding receive queue.
I can't see a piece of code which does that, but that's probably just
lack of detailed hardware knowledge on my side.
Thanks,
tglx
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH RT] ehea: make receive irq handler non-threaded (IRQF_NODELAY)
2010-05-20 1:32 ` Michael Ellerman
@ 2010-05-20 8:21 ` Thomas Gleixner
2010-05-21 9:18 ` [PATCH RT] ehea: make receive irq handler non-threaded (IRQF_NODELAY), " Milton Miller
0 siblings, 1 reply; 23+ messages in thread
From: Thomas Gleixner @ 2010-05-20 8:21 UTC (permalink / raw)
To: Michael Ellerman
Cc: Darren Hart, Jan-Bernd Themann, dvhltc, linux-kernel,
Milton Miller, Will Schmidt, Brian King, niv, Doug Maxey,
linuxppc-dev
On Thu, 20 May 2010, Michael Ellerman wrote:
> On Wed, 2010-05-19 at 16:38 +0200, Thomas Gleixner wrote:
> > On Wed, 19 May 2010, Darren Hart wrote:
> >
> > > On 05/18/2010 06:25 PM, Michael Ellerman wrote:
> > > > On Tue, 2010-05-18 at 15:22 -0700, Darren Hart wrote:
>
> > > > The result of the discussion about two years ago on this was that we
> > > > needed a custom flow handler for XICS on RT.
> > >
> > > I'm still not clear on why the ultimate solution wasn't to have XICS report
> > > edge triggered as edge triggered. Probably some complexity of the entire power
> > > stack that I am ignorant of.
> > >
> > > > Apart from the issue of loosing interrupts there is also the fact that
> > > > masking on the XICS requires an RTAS call which takes a global lock.
> >
> > Right, I'd love to avoid that but with real level interrupts we'd run
> > into an interrupt storm. Though another solution would be to issue the
> > EOI after the threaded handler finished, that'd work as well, but
> > needs testing.
>
> Yeah I think that was the idea for the custom flow handler. We'd reset
> the processor priority so we can take other interrupts (which the EOI
> usually does for you), then do the actual EOI after the handler
> finished.
That only works when the card does not issue new interrupts until the
EOI happens. If the EOI is only relevant for the interrupt controller,
then you are going to lose any edge which comes in before the EOI as
well.
Thanks,
tglx
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH RT] ehea: make receive irq handler non-threaded (IRQF_NODELAY)
2010-05-20 8:14 ` Thomas Gleixner
@ 2010-05-20 9:05 ` Jan-Bernd Themann
2010-05-20 9:19 ` Thomas Gleixner
2010-05-20 14:53 ` Will Schmidt
2010-05-20 14:39 ` Darren Hart
1 sibling, 2 replies; 23+ messages in thread
From: Jan-Bernd Themann @ 2010-05-20 9:05 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Darren Hart, dvhltc, linux-kernel, Will Schmidt, Brian King, niv,
Doug Maxey, linuxppc-dev
Hi Thomas
> Re: [PATCH RT] ehea: make receive irq handler non-threaded (IRQF_NODELAY)
>
> On Thu, 20 May 2010, Jan-Bernd Themann wrote:
> > > > Thought more about that. The case at hand (ehea) is nasty:
> > > >
> > > > The driver does _NOT_ disable the rx interrupt in the card in the
rx
> > > > interrupt handler - for whatever reason.
> > >
> > > Yeah I saw that, but I don't know why it's written that way. Perhaps
> > > Jan-Bernd or Doug will chime in and enlighten us? :)
> >
> > From our perspective there is no need to disable interrupts for the
> > RX side as the chip does not fire further interrupts until we tell
> > the chip to do so for a particular queue. We have multiple receive
>
> The traces tell a different story though:
>
> ehea_recv_irq_handler()
> napi_reschedule()
> eoi()
> ehea_poll()
> ...
> ehea_recv_irq_handler() <---------------- ???
> napi_reschedule()
> ...
> napi_complete()
>
> Can't tell whether you can see the same behaviour in mainline, but I
> don't see a reason why not.
Is this the same interrupt we are seeing here, or do we see a second other
interrupt popping up on the same CPU? As I said, with multiple receive
queues
(if enabled) you can have multiple interrupts in parallel.
Pleaes check if multiple queues are enabled. The following module parameter
is used for that:
MODULE_PARM_DESC(use_mcs, " 0:NAPI, 1:Multiple receive queues, Default = 0
");
you should also see the number of used HEA interrupts in /proc/interrupts
>
> > queues with an own interrupt each so that the interrupts can arrive
> > on multiple CPUs in parallel. Interrupts are enabled again when we
> > leave the NAPI Poll function for the corresponding receive queue.
>
> I can't see a piece of code which does that, but that's probably just
> lack of detailed hardware knowledge on my side.
If you mean the "re-enable" piece of code, it is not very obvious, you are
right.
Interrupts are only generated if a particular register for our completion
queues
is written. We do this in the following line:
ehea_reset_cq_ep(pr->recv_cq);
ehea_reset_cq_ep(pr->send_cq);
ehea_reset_cq_n1(pr->recv_cq);
ehea_reset_cq_n1(pr->send_cq);
So this is in a way an indirect way to ask for interrupts when new
completions were
written to memory. We don't really disable/enable interrupts on the HEA
chip itself.
I think there are some mechanisms build in the HEA chip that should prevent
that
interrupts don't get lost. But that is something that is / was completely
hidden from
us, so my skill is very limited there.
If more details are needed here we should involve the PHYP guys + eHEA HW
guys if not
already done. Did anyone already talk to them?
Regards,
Jan-Bernd
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH RT] ehea: make receive irq handler non-threaded (IRQF_NODELAY)
2010-05-20 9:05 ` Jan-Bernd Themann
@ 2010-05-20 9:19 ` Thomas Gleixner
2010-05-20 14:26 ` Nivedita Singhvi
2010-05-20 14:53 ` Will Schmidt
1 sibling, 1 reply; 23+ messages in thread
From: Thomas Gleixner @ 2010-05-20 9:19 UTC (permalink / raw)
To: Jan-Bernd Themann
Cc: Darren Hart, dvhltc, linux-kernel, Will Schmidt, Brian King, niv,
Doug Maxey, linuxppc-dev
Jan-Bernd,
On Thu, 20 May 2010, Jan-Bernd Themann wrote:
>
> Hi Thomas
>
> > Re: [PATCH RT] ehea: make receive irq handler non-threaded (IRQF_NODELAY)
> >
> > On Thu, 20 May 2010, Jan-Bernd Themann wrote:
> > > > > Thought more about that. The case at hand (ehea) is nasty:
> > > > >
> > > > > The driver does _NOT_ disable the rx interrupt in the card in the
> rx
> > > > > interrupt handler - for whatever reason.
> > > >
> > > > Yeah I saw that, but I don't know why it's written that way. Perhaps
> > > > Jan-Bernd or Doug will chime in and enlighten us? :)
> > >
> > > From our perspective there is no need to disable interrupts for the
> > > RX side as the chip does not fire further interrupts until we tell
> > > the chip to do so for a particular queue. We have multiple receive
> >
> > The traces tell a different story though:
> >
> > ehea_recv_irq_handler()
> > napi_reschedule()
> > eoi()
> > ehea_poll()
> > ...
> > ehea_recv_irq_handler() <---------------- ???
> > napi_reschedule()
> > ...
> > napi_complete()
> >
> > Can't tell whether you can see the same behaviour in mainline, but I
> > don't see a reason why not.
>
> Is this the same interrupt we are seeing here, or do we see a second other
> interrupt popping up on the same CPU? As I said, with multiple receive
> queues
> (if enabled) you can have multiple interrupts in parallel.
According to the traces it's the very same interrupt number.
> Pleaes check if multiple queues are enabled. The following module parameter
> is used for that:
>
> MODULE_PARM_DESC(use_mcs, " 0:NAPI, 1:Multiple receive queues, Default = 0
> ");
>
> you should also see the number of used HEA interrupts in /proc/interrupts
I leave that for Will and Darren, they have the hardware :)
> >
> > > queues with an own interrupt each so that the interrupts can arrive
> > > on multiple CPUs in parallel. Interrupts are enabled again when we
> > > leave the NAPI Poll function for the corresponding receive queue.
> >
> > I can't see a piece of code which does that, but that's probably just
> > lack of detailed hardware knowledge on my side.
>
> If you mean the "re-enable" piece of code, it is not very obvious,
> you are right. Interrupts are only generated if a particular
> register for our completion queues is written. We do this in the
> following line:
>
> ehea_reset_cq_ep(pr->recv_cq);
> ehea_reset_cq_ep(pr->send_cq);
> ehea_reset_cq_n1(pr->recv_cq);
> ehea_reset_cq_n1(pr->send_cq);
>
> So this is in a way an indirect way to ask for interrupts when new
> completions were written to memory. We don't really disable/enable
> interrupts on the HEA chip itself.
Ah, ok. That's after the napi_complete which looks correct.
> I think there are some mechanisms build in the HEA chip that should
> prevent that interrupts don't get lost. But that is something that
> is / was completely hidden from us, so my skill is very limited
> there.
>
> If more details are needed here we should involve the PHYP guys +
> eHEA HW guys if not already done. Did anyone already talk to them?
Will or Darren might have, but lets gather more information first
before we rack their nerves :)
Thanks,
tglx
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH RT] ehea: make receive irq handler non-threaded (IRQF_NODELAY)
2010-05-20 9:19 ` Thomas Gleixner
@ 2010-05-20 14:26 ` Nivedita Singhvi
0 siblings, 0 replies; 23+ messages in thread
From: Nivedita Singhvi @ 2010-05-20 14:26 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Darren Hart, Jan-Bernd Themann, dvhltc, linux-kernel,
Will Schmidt, Brian King, niv, Doug Maxey, linuxppc-dev
Thomas Gleixner wrote:
>> Pleaes check if multiple queues are enabled. The following module parameter
>> is used for that:
>>
>> MODULE_PARM_DESC(use_mcs, " 0:NAPI, 1:Multiple receive queues, Default = 0
>> ");
>>
>> you should also see the number of used HEA interrupts in /proc/interrupts
>
> I leave that for Will and Darren, they have the hardware :)
16: 477477 ... XICS Level IPI
17: 129 ... XICS Level hvc_console
18: 0 ... XICS Level RAS_EPOW
33: 139232 ... XICS Level mlx4_core
256: 3 ... XICS Level ehea_neq
259: 0 ... XICS Level eth0-aff
260: 2082153 ... XICS Level eth0-queue0
289: 119166 ... XICS Level ipr
305: 0 ... XICS Level ohci_hcd:usb2
306: 0 ... XICS Level ohci_hcd:usb3
307: 2389839 ... XICS Level ehci_hcd:usb1
Nope, multiple rx queues not enabled.
thanks,
Nivedita
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH RT] ehea: make receive irq handler non-threaded (IRQF_NODELAY)
2010-05-20 8:14 ` Thomas Gleixner
2010-05-20 9:05 ` Jan-Bernd Themann
@ 2010-05-20 14:39 ` Darren Hart
2010-05-20 14:45 ` Thomas Gleixner
1 sibling, 1 reply; 23+ messages in thread
From: Darren Hart @ 2010-05-20 14:39 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Jan-Bernd Themann, dvhltc, linux-kernel, Will Schmidt, Brian King,
niv, Doug Maxey, linuxppc-dev
On 05/20/2010 01:14 AM, Thomas Gleixner wrote:
> On Thu, 20 May 2010, Jan-Bernd Themann wrote:
>>>> Thought more about that. The case at hand (ehea) is nasty:
>>>>
>>>> The driver does _NOT_ disable the rx interrupt in the card in the rx
>>>> interrupt handler - for whatever reason.
>>>
>>> Yeah I saw that, but I don't know why it's written that way. Perhaps
>>> Jan-Bernd or Doug will chime in and enlighten us? :)
>>
>> From our perspective there is no need to disable interrupts for the
>> RX side as the chip does not fire further interrupts until we tell
>> the chip to do so for a particular queue. We have multiple receive
>
> The traces tell a different story though:
>
> ehea_recv_irq_handler()
> napi_reschedule()
> eoi()
> ehea_poll()
> ...
> ehea_recv_irq_handler()<---------------- ???
> napi_reschedule()
> ...
> napi_complete()
>
> Can't tell whether you can see the same behaviour in mainline, but I
> don't see a reason why not.
I was going to suggest that because these are threaded handlers, perhaps
they are rescheduled on a different CPU and then receive the interrupt
for the other CPU/queue that Jan was mentioning.
But, the handlers are affined if I remember correctly, and we aren't
running with multiple receive queues. So, we're back to the same
question, why are we seeing another irq. It comes in before
napi_complete() and therefor before the ehea_reset*() block of calls
which do the equivalent of re-enabling interrupts.
--
Darren
>
>> queues with an own interrupt each so that the interrupts can arrive
>> on multiple CPUs in parallel. Interrupts are enabled again when we
>> leave the NAPI Poll function for the corresponding receive queue.
>
> I can't see a piece of code which does that, but that's probably just
> lack of detailed hardware knowledge on my side.
>
> Thanks,
>
> tglx
--
Darren Hart
IBM Linux Technology Center
Real-Time Linux Team
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH RT] ehea: make receive irq handler non-threaded (IRQF_NODELAY)
2010-05-20 14:39 ` Darren Hart
@ 2010-05-20 14:45 ` Thomas Gleixner
2010-05-20 21:44 ` Will Schmidt
0 siblings, 1 reply; 23+ messages in thread
From: Thomas Gleixner @ 2010-05-20 14:45 UTC (permalink / raw)
To: Darren Hart
Cc: Jan-Bernd Themann, dvhltc, linux-kernel, Will Schmidt, Brian King,
niv, Doug Maxey, linuxppc-dev
On Thu, 20 May 2010, Darren Hart wrote:
> On 05/20/2010 01:14 AM, Thomas Gleixner wrote:
> > On Thu, 20 May 2010, Jan-Bernd Themann wrote:
> > > > > Thought more about that. The case at hand (ehea) is nasty:
> > > > >
> > > > > The driver does _NOT_ disable the rx interrupt in the card in the rx
> > > > > interrupt handler - for whatever reason.
> > > >
> > > > Yeah I saw that, but I don't know why it's written that way. Perhaps
> > > > Jan-Bernd or Doug will chime in and enlighten us? :)
> > >
> > > From our perspective there is no need to disable interrupts for the
> > > RX side as the chip does not fire further interrupts until we tell
> > > the chip to do so for a particular queue. We have multiple receive
> >
> > The traces tell a different story though:
> >
> > ehea_recv_irq_handler()
> > napi_reschedule()
> > eoi()
> > ehea_poll()
> > ...
> > ehea_recv_irq_handler()<---------------- ???
> > napi_reschedule()
> > ...
> > napi_complete()
> >
> > Can't tell whether you can see the same behaviour in mainline, but I
> > don't see a reason why not.
>
> I was going to suggest that because these are threaded handlers, perhaps they
> are rescheduled on a different CPU and then receive the interrupt for the
> other CPU/queue that Jan was mentioning.
>
> But, the handlers are affined if I remember correctly, and we aren't running
> with multiple receive queues. So, we're back to the same question, why are we
> seeing another irq. It comes in before napi_complete() and therefor before the
> ehea_reset*() block of calls which do the equivalent of re-enabling
> interrupts.
Can you slap a few trace points into that driver with a stock mainline
kernel and verify that ?
Thanks,
tglx
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH RT] ehea: make receive irq handler non-threaded (IRQF_NODELAY)
2010-05-20 9:05 ` Jan-Bernd Themann
2010-05-20 9:19 ` Thomas Gleixner
@ 2010-05-20 14:53 ` Will Schmidt
1 sibling, 0 replies; 23+ messages in thread
From: Will Schmidt @ 2010-05-20 14:53 UTC (permalink / raw)
To: Jan-Bernd Themann
Cc: Darren Hart, dvhltc, linux-kernel, Brian King, niv,
Thomas Gleixner, Doug Maxey, linuxppc-dev
On Thu, 2010-05-20 at 11:05 +0200, Jan-Bernd Themann wrote:
> Hi Thomas
>
> > Re: [PATCH RT] ehea: make receive irq handler non-threaded (IRQF_NODELAY)
> >
> > On Thu, 20 May 2010, Jan-Bernd Themann wrote:
> > > > > Thought more about that. The case at hand (ehea) is nasty:
> > > > >
> > > > > The driver does _NOT_ disable the rx interrupt in the card in the
> rx
> > > > > interrupt handler - for whatever reason.
> > > >
> > > > Yeah I saw that, but I don't know why it's written that way. Perhaps
> > > > Jan-Bernd or Doug will chime in and enlighten us? :)
> > >
> > > From our perspective there is no need to disable interrupts for the
> > > RX side as the chip does not fire further interrupts until we tell
> > > the chip to do so for a particular queue. We have multiple receive
> >
> > The traces tell a different story though:
> >
> > ehea_recv_irq_handler()
> > napi_reschedule()
> > eoi()
> > ehea_poll()
> > ...
> > ehea_recv_irq_handler() <---------------- ???
> > napi_reschedule()
> > ...
> > napi_complete()
> >
> > Can't tell whether you can see the same behaviour in mainline, but I
> > don't see a reason why not.
>
> Is this the same interrupt we are seeing here, or do we see a second other
> interrupt popping up on the same CPU? As I said, with multiple receive
> queues
> (if enabled) you can have multiple interrupts in parallel.
Same interrupt number (260). Per the trace data, the first
ehea_recv_irq_handler (at 117.904525) was on cpu 0, the second (at
117.904689) was on cpu 1.
<...>-2180 [000] 117.904525: .ehea_recv_irq_handler: ENTER 0 c0000000e8bd08b0
<...>-2180 [000] 117.904527: .ehea_recv_irq_handler: napi_reschedule COMpleted c0000000e8bd08b0
<...>-2180 [000] 117.904528: .ehea_recv_irq_handler: EXIT reschedule(1) 1 c0000000e8bd08b0
<...>-2180 [000] 117.904529: .xics_unmask_irq: xics: unmask virq 260 772
<...>-2180 [000] 117.904547: .xics_unmask_irq: xics: unmask virq pre-xive 260 772 0 status:0 ff
<...>-2180 [000] 117.904586: .xics_unmask_irq: xics: unmask virq post-xive 260 772 0 D:11416 status:0 5
<...>-2180 [000] 117.904602: .handle_fasteoi_irq: 260 8004000
<...>-2180 [000] 117.904603: .xics_mask_irq: xics: mask virq 260 772
<...>-2180 [000] 117.904634: .xics_mask_real_irq: xics: before: mask_real 772 status:0 5
<...>-2180 [000] 117.904668: .xics_mask_real_irq: xics: after: mask_real 772 status:0 ff
<...>-2180 [000] 117.904669: .handle_fasteoi_irq: pre-action: 260 8004100
<...>-2180 [000] 117.904671: .handle_fasteoi_irq: post-action: 260 8004100
<...>-2180 [000] 117.904672: .handle_fasteoi_irq: exit. 260 8004000
<...>-7 [000] 117.904681: .ehea_poll: ENTER 1 c0000000e8bd08b0 poll_counter:0 force:0
<...>-7 [000] 117.904683: .ehea_proc_rwqes: ehea_check_cqe 0
<...>-2180 [001] 117.904689: .ehea_recv_irq_handler: ENTER 1 c0000000e8bd08b0
<...>-7 [000] 117.904690: .ehea_proc_rwqes: ehea_check_cqe 0
<...>-2180 [001] 117.904691: .ehea_recv_irq_handler: napi_reschedule inCOMplete c0000000e8bd08b0
<...>-2180 [001] 117.904692: .ehea_recv_irq_handler: EXIT reschedule(0) 1 c0000000e8bd08b0
<...>-2180 [001] 117.904694: .xics_unmask_irq: xics: unmask virq 260 772
<...>-7 [000] 117.904702: .ehea_refill_rq2: ehea_refill_rq2
<...>-7 [000] 117.904703: .ehea_refill_rq_def: ehea_refill_rq_def
<...>-7 [000] 117.904704: .ehea_refill_rq3: ehea_refill_rq3
<...>-7 [000] 117.904705: .ehea_refill_rq_def: ehea_refill_rq_def
<...>-7 [000] 117.904706: .napi_complete: napi_complete: ENTER state: 1 c0000000e8bd08b0
<...>-7 [000] 117.904707: .napi_complete: napi_complete: EXIT state: 0 c0000000e8bd08b0
<...>-7 [000] 117.904710: .ehea_poll: EXIT !cqe rx(2). 0 c0000000e8bd08b0
<...>-2180 [001] 117.904719: .xics_unmask_irq: xics: unmask virq pre-xive 260 772 0 status:0 ff
<...>-2180 [001] 117.904761: .xics_unmask_irq: xics: unmask virq post-xive 260 772 0 D:12705 status:0 5
> Pleaes check if multiple queues are enabled. The following module parameter
> is used for that:
>
> MODULE_PARM_DESC(use_mcs, " 0:NAPI, 1:Multiple receive queues, Default = 0
> ");
No module parameters were used, should be plain old defaults.
>
> you should also see the number of used HEA interrupts in /proc/interrupts
>
256: 1 0 0 0 0 0 0 0 XICS Level ehea_neq
259: 0 0 0 0 0 0 0 0 XICS Level eth0-aff
260: 361965 0 0 0 0 0 0 0 XICS Level eth0-queue0
>
> >
> > > queues with an own interrupt each so that the interrupts can arrive
> > > on multiple CPUs in parallel. Interrupts are enabled again when we
> > > leave the NAPI Poll function for the corresponding receive queue.
> >
> > I can't see a piece of code which does that, but that's probably just
> > lack of detailed hardware knowledge on my side.
>
> If you mean the "re-enable" piece of code, it is not very obvious, you are
> right.
> Interrupts are only generated if a particular register for our completion
> queues
> is written. We do this in the following line:
>
> ehea_reset_cq_ep(pr->recv_cq);
> ehea_reset_cq_ep(pr->send_cq);
> ehea_reset_cq_n1(pr->recv_cq);
> ehea_reset_cq_n1(pr->send_cq);
>
> So this is in a way an indirect way to ask for interrupts when new
> completions were
> written to memory. We don't really disable/enable interrupts on the HEA
> chip itself.
>
> I think there are some mechanisms build in the HEA chip that should prevent
> that
> interrupts don't get lost. But that is something that is / was completely
> hidden from
> us, so my skill is very limited there.
>
> If more details are needed here we should involve the PHYP guys + eHEA HW
> guys if not
> already done. Did anyone already talk to them?
>
> Regards,
> Jan-Bernd
>
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH RT] ehea: make receive irq handler non-threaded (IRQF_NODELAY)
2010-05-20 14:45 ` Thomas Gleixner
@ 2010-05-20 21:44 ` Will Schmidt
0 siblings, 0 replies; 23+ messages in thread
From: Will Schmidt @ 2010-05-20 21:44 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Darren Hart, Jan-Bernd Themann, dvhltc, linux-kernel, Brian King,
niv, Doug Maxey, linuxppc-dev
On Thu, 2010-05-20 at 16:45 +0200, Thomas Gleixner wrote:
> On Thu, 20 May 2010, Darren Hart wrote:
>
> > On 05/20/2010 01:14 AM, Thomas Gleixner wrote:
> > > On Thu, 20 May 2010, Jan-Bernd Themann wrote:
> > > > > > Thought more about that. The case at hand (ehea) is nasty:
> > > > > >
> > > > > > The driver does _NOT_ disable the rx interrupt in the card in the rx
> > > > > > interrupt handler - for whatever reason.
> > > > >
> > > > > Yeah I saw that, but I don't know why it's written that way. Perhaps
> > > > > Jan-Bernd or Doug will chime in and enlighten us? :)
> > > >
> > > > From our perspective there is no need to disable interrupts for the
> > > > RX side as the chip does not fire further interrupts until we tell
> > > > the chip to do so for a particular queue. We have multiple receive
> > >
> > > The traces tell a different story though:
> > >
> > > ehea_recv_irq_handler()
> > > napi_reschedule()
> > > eoi()
> > > ehea_poll()
> > > ...
> > > ehea_recv_irq_handler()<---------------- ???
> > > napi_reschedule()
> > > ...
> > > napi_complete()
> > >
> > > Can't tell whether you can see the same behaviour in mainline, but I
> > > don't see a reason why not.
> >
> > I was going to suggest that because these are threaded handlers, perhaps they
> > are rescheduled on a different CPU and then receive the interrupt for the
> > other CPU/queue that Jan was mentioning.
> >
> > But, the handlers are affined if I remember correctly, and we aren't running
> > with multiple receive queues. So, we're back to the same question, why are we
> > seeing another irq. It comes in before napi_complete() and therefor before the
> > ehea_reset*() block of calls which do the equivalent of re-enabling
> > interrupts.
>
> Can you slap a few trace points into that driver with a stock mainline
> kernel and verify that ?
2.6.33.4 (non-rt kernel) with similar trace_printk hooks in place...
Most data lumps look like so:
<idle>-0 [000] 1097.685337: .handle_fasteoi_irq: ENTER 260 4000
<idle>-0 [000] 1097.685339: .handle_fasteoi_irq: pre-action 260 4100
<idle>-0 [000] 1097.685339: .ehea_recv_irq_handler: ENTER c0000000e8980700
<idle>-0 [000] 1097.685340: .ehea_recv_irq_handler: napi_schedule ... c0000000e8980700
<idle>-0 [000] 1097.685341: .ehea_recv_irq_handler: napi_schedule Calling __napi_schedule ... c0000000e8980700
<idle>-0 [000] 1097.685342: .ehea_recv_irq_handler: EXIT c0000000e8980700
<idle>-0 [000] 1097.685343: .handle_fasteoi_irq: post-action 260 4100
<idle>-0 [000] 1097.685344: .handle_fasteoi_irq: EXIT. 260 4000
<idle>-0 [000] 1097.685346: .ehea_poll: ENTER c0000000e8980700
<idle>-0 [000] 1097.685352: .napi_complete: napi_complete: ENTER c0000000e8980700
<idle>-0 [000] 1097.685352: .napi_complete: napi_complete: EXIT c0000000e8980700
<idle>-0 [000] 1097.685355: .ehea_poll: EXIT !cqe rx(1) c0000000e8980700
But I did see one like this, which shows a ehea_recv_irq_handler ENTER
within a ehea_poll ENTER. (which I think is what you were expecting,
or wanted to verify..)
<idle>-0 [000] 1097.616261: .handle_fasteoi_irq: ENTER 260 4000
<idle>-0 [000] 1097.616262: .handle_fasteoi_irq: pre-action 260 4100
* <idle>-0 [000] 1097.616263: .ehea_recv_irq_handler: ENTER c0000000e8980700
<idle>-0 [000] 1097.616264: .ehea_recv_irq_handler: napi_schedule ... c0000000e8980700
<idle>-0 [000] 1097.616265: .ehea_recv_irq_handler: napi_schedule Calling __napi_schedule ... c0000000e8980700
<idle>-0 [000] 1097.616265: .ehea_recv_irq_handler: EXIT c0000000e8980700
<idle>-0 [000] 1097.616266: .handle_fasteoi_irq: post-action 260 4100
<idle>-0 [000] 1097.616268: .handle_fasteoi_irq: EXIT. 260 4000
* <idle>-0 [000] 1097.616270: .ehea_poll: ENTER c0000000e8980700
<idle>-0 [000] 1097.616282: .handle_fasteoi_irq: ENTER 260 4000
<idle>-0 [000] 1097.616283: .handle_fasteoi_irq: pre-action 260 4100
* <idle>-0 [000] 1097.616284: .ehea_recv_irq_handler: ENTER c0000000e8980700
<idle>-0 [000] 1097.616285: .ehea_recv_irq_handler: napi_schedule ... c0000000e8980700
<idle>-0 [000] 1097.616286: .ehea_recv_irq_handler: napi_schedule NOT Calling __napi_schedule... c0000000e8980700
<idle>-0 [000] 1097.616286: .ehea_recv_irq_handler: EXIT c0000000e8980700
<idle>-0 [000] 1097.616287: .handle_fasteoi_irq: post-action 260 4100
<idle>-0 [000] 1097.616289: .handle_fasteoi_irq: EXIT. 260 4000
<idle>-0 [000] 1097.616299: .napi_complete: napi_complete: ENTER c0000000e8980700
<idle>-0 [000] 1097.616300: .napi_complete: napi_complete: EXIT c0000000e8980700
<idle>-0 [000] 1097.616302: .ehea_poll: napi_reschedule COMpleted c0000000e8980700
<idle>-0 [000] 1097.616303: .napi_complete: napi_complete: ENTER c0000000e8980700
<idle>-0 [000] 1097.616304: .napi_complete: napi_complete: EXIT c0000000e8980700
* <idle>-0 [000] 1097.616306: .ehea_poll: EXIT !cqe rx(4) c0000000e8980700
Let me know if you want/need more or a variation, etc..
Thanks,
-Will
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH RT] ehea: make receive irq handler non-threaded (IRQF_NODELAY)
2010-05-20 1:28 ` Michael Ellerman
@ 2010-05-21 9:02 ` Milton Miller
0 siblings, 0 replies; 23+ messages in thread
From: Milton Miller @ 2010-05-21 9:02 UTC (permalink / raw)
To: Michael Ellerman
Cc: Darren Hart, Jan-Bernd Themann, dvhltc, linux-kernel,
Milton Miller, Will Schmidt, Brian King, niv, Thomas Gleixner,
Doug Maxey, linuxppc-dev
On Thu May 20 at 11:28:36 EST in 2010, Michael Ellerman wrote:
> On Wed, 2010-05-19 at 07:16 -0700, Darren Hart wrote:
> > On 05/18/2010 06:25 PM, Michael Ellerman wrote:
> > > On Tue, 2010-05-18 at 15:22 -0700, Darren Hart wrote:
> > > > On 05/18/2010 02:52 PM, Brian King wrote:
> > > > > Is IRQF_NODELAY something specific to the RT kernel?
> > > > > I don't see it in mainline...
> > > > Yes, it basically says "don't make this handler threaded".
> > >
> > > That is a good fix for EHEA, but the threaded handling is still broken
> > > for anything else that is edge triggered isn't it?
> >
> > No, I don't believe so. Edge triggered interrupts that are reported as
> > edge triggered interrupts will use the edge handler (which was the
> > approach Sebastien took to make this work back in 2008). Since XICS
> > presents all interrupts as Level Triggered, they use the fasteoi path.
>
> But that's the point, no interrupts on XICS are reported as edge, even
> if they are actually edge somewhere deep in the hardware. I don't think
> we have any reliable way to determine what is what.
>
The platform doesn't tell us this information. The driver might know
but we don't need this information.
> > > The result of the discussion about two years ago on this was that we
> > > needed a custom flow handler for XICS on RT.
> >
> > I'm still not clear on why the ultimate solution wasn't to have XICS
> > report edge triggered as edge triggered. Probably some complexity of the
> > entire power stack that I am ignorant of.
>
> I'm not really sure either, but I think it's a case of a leaky
> abstraction on the part of the hypervisor. Edge interrupts behave as
> level as long as you handle the irq before EOI, but if you mask they
> don't. But Milton's the expert on that.
>
More like the hardware actually converts them. They are all handled
with the same presentation.
The XICS interrupt system is highly scalable and distributed in
implementation, with multiple priority delivery and unlimited nesting.
First, a few features and description:
The hardware has two bits of storage for every LSI interrupt source in the
system to say that interrupt is idle, pending, or was rejected and will
be retried later. The hardware also stores a destination and delivery
priority, settable by software. The destination can be a specific cpu
thread, or a global distribution queue of all (online) threads (in the
partition). While the hardware used to have 256 priority levels available
(255 usable, one for cpu not interrupted), some bits have been stolen
and today we only guarantee 16 levels are avalabile to the OS (15 for
delivery and one for source disabled / cpu not processing any interrupt).
[The current linux kernel delivers all device interrupts at one level
but IPIs at a higher level. To avoid overflowing the irq stack we don't
allow device interrupts while processing any external interrupt.]
The interrupt presentation layer likewise scales, with a seperate instance
for each cpu thread in the system. A single IPI source per thread is
part of this instance; when a cpu wants to interrupt another it writes
the priority of the IPI to that cpus presentation logic.
When an interrupt is signaled, the hardware checks the state of that
interrupt, and if previously idle it sends an interrupt request with its
source number and priority towards the programmed destination, either a
specific cpu thread or the global queue of all processors in the system.
If that cpu is already handling an interrupt of the same or higher (lower
valued) priority either the incoming interrupt will be passed to the next
cpu (if the destnation was global) or it will be rejected and the isu will
update its state and try again later. If the cpu had a prior interrupt
pending at a lower priority then the old interrupt will be rejected back
to its isu instead.
The normal behavior is a load to a presentation logic register causes
the interrupt source number and previous priority of the cpu to be
delivered to the cpu and the cpu priority to be raised to that of the
incoming interrupt. The external interrupt indication to the cpu is
removed. At this point the presentation hardware forgets all history
of this interrupt. A store to the same register resets the priority of
the cpu (which would naturally be the level before it was interrupted
if it stores the value loaded) and sends an EOI (end of interrupt) to
the interrupt source specified in the write. This resets the two bits
from pending to idle.
The software is allowed to reset the cpu priority to allow other
interrupts of equal (or even lower) priority to be presented independently
of creating the EOI for this source. However, until software creates
an EOI for a specific source it will not be presented until the machine
is reset. The only rule is you can't raise your priority (which might
have to reject a pending interrupt) when you send create (write) the EOI.
A cpu can also change its priority to tell the hardware to reject
this interrupt (possibly representing to another cpu) if it was really
working at a higher priority and it just didn't do the MMIO store to the
interrupt controller (which is slow compared to memory). There is also
a polling register that you can see what interrupt would be presented,
but its racy as a new interrupt could come in, displace that one, and
the first one might be represented to another cpu.
To avoid overloading any single cpu, interrupts targeting the global
queue are distributed fairly. Through POWER5 the hardware remembers
the cpu that accepted the previous interrupt and starts considering
the next oneline cpu. Starting with POWER6, the presentation layer
was distributed to the processor chips (for natural scaling) and the
global queue replaced with a forwarding list. The ISU is told (by the
hypervisor) to start its next presentation search with the next cpu in
the list when it accepts the interrupt from the presentation logic.
When MSI interrupts were added, logic was needed to handle reciving the
trigger store, presenting it, and representing the rejected interrupts
the edge when cpus were busy with prior or higher priority interrupts.
So the same state was created for each possible MSI, distributed to the
PCI host bridge logic or other io device like the HEA. These state bits
per MSI convert the incoming store edge trigger into a replayable level,
which will be presented to cpus until one consumes it with the load.
If it gets rejected, it will try again. But unlike an LSI which is still
present from the device, if it gets EOId it waits for a new trigger.
Actually, there is one additional bit in the ISU hardware for MSI sources
that keeps track that an MSI trigger was seen while it is in the pending
state because the path of the EOI from the interrupt presentation logic to
the ISU is not ordered with the MMIOs from the processor to the PCI bus.
However, if the interrupt is disabled, the hardware will not set this bit
or otherwise remember it was triggred. The disable is done by setting
the priority to least favored (FF) as that level could never be higher
than any cpus.
In addition, the OS is not aware where or how the priority, destination,
and enable are are stored. This is hidden via the Run Time Abstraction
Services (RTAS), which is a firmware supplied library for infrequent
calls and is called under a global lock. The platform is not designed
for this to be fast, and the hypwervisor couldn't securely give access
to the registers even if the os knew where they were. (The interrupt
presentation layer is accessed with a fast hypervisor call).
So, with this description, it should be clear that XICS threaded delivery
in the realtime kernel should use the hardware implicit masking per
source and never play games disabling the interrupt at the ISU, which
will be racy for edge sources and pure overhead for true level sources.
This was proposed here: http://lkml.org/lkml/2008/9/24/226 .
The threaded interrupt services in mainline assume the initial interrupt
handler will disable the interrupt at the device and therefore does not
call the irq mask and unmask functions.
> > > Apart from the issue of loosing interrupts there is also the fact that
> > > masking on the XICS requires an RTAS call which takes a global lock.
> >
> > Right, one of may reasons why we felt this was the right fix. The other
> > is that there is no real additional overhead in running this as
> > non-threaded since the receive handler is so short (just napi_schedule()).
>
> True. It's not a fix in general though. I'm worried that we're going to
> see the exact same bug for MSI(-X) interrupts.
>
> cheers
>
>
and hca and ...
milton
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH RT] ehea: make receive irq handler non-threaded (IRQF_NODELAY), Re: [PATCH RT] ehea: make receive irq handler non-threaded (IRQF_NODELAY)
2010-05-20 8:21 ` Thomas Gleixner
@ 2010-05-21 9:18 ` Milton Miller
2010-09-20 14:26 ` Jan-Bernd Themann
0 siblings, 1 reply; 23+ messages in thread
From: Milton Miller @ 2010-05-21 9:18 UTC (permalink / raw)
To: Thomas Gleixner, Jan-Bernd Themann
Cc: Darren Hart, dvhltc, linux-kernel, Milton Miller, Will Schmidt,
Brian King, niv, Doug Maxey, linuxppc-dev
On Thu, 20 May 2010 at 10:21:36 +0200 (CEST) Thomas Gleixner wrote:
> On Thu, 20 May 2010, Michael Ellerman wrote:
> > On Wed, 2010-05-19 at 16:38 +0200, Thomas Gleixner wrote:
> > > On Wed, 19 May 2010, Darren Hart wrote:
> > >
> > > > On 05/18/2010 06:25 PM, Michael Ellerman wrote:
> > > > > On Tue, 2010-05-18 at 15:22 -0700, Darren Hart wrote:
> >
> > > > > The result of the discussion about two years ago on this was that we
> > > > > needed a custom flow handler for XICS on RT.
> > > >
> > > > I'm still not clear on why the ultimate solution wasn't to have XICS report
> > > > edge triggered as edge triggered. Probably some complexity of the entire power
> > > > stack that I am ignorant of.
> > > >
> > > > > Apart from the issue of loosing interrupts there is also the fact that
> > > > > masking on the XICS requires an RTAS call which takes a global lock.
> > >
> > > Right, I'd love to avoid that but with real level interrupts we'd run
> > > into an interrupt storm. Though another solution would be to issue the
> > > EOI after the threaded handler finished, that'd work as well, but
> > > needs testing.
> >
> > Yeah I think that was the idea for the custom flow handler. We'd reset
> > the processor priority so we can take other interrupts (which the EOI
> > usually does for you), then do the actual EOI after the handler
> > finished.
>
> That only works when the card does not issue new interrupts until the
> EOI happens. If the EOI is only relevant for the interrupt controller,
> then you are going to lose any edge which comes in before the EOI as
> well.
Well, the real MSIs have an extra bit to allow the eoi to dally behind
the mmio on another path and that should cover this race when the irq
is left enabled.
Jan-Bernd HEA has that change, right?
milton
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH RT] ehea: make receive irq handler non-threaded (IRQF_NODELAY)
2010-05-21 9:18 ` [PATCH RT] ehea: make receive irq handler non-threaded (IRQF_NODELAY), " Milton Miller
@ 2010-09-20 14:26 ` Jan-Bernd Themann
0 siblings, 0 replies; 23+ messages in thread
From: Jan-Bernd Themann @ 2010-09-20 14:26 UTC (permalink / raw)
To: Milton Miller
Cc: linux-kernel-owner, Darren Hart, dvhltc, linux-kernel,
Milton Miller, Will Schmidt, Brian King, niv, Thomas Gleixner,
Doug Maxey, linuxppc-dev
Hi Milton,
sorry for the delayed answer, was on vacation.
linux-kernel-owner@vger.kernel.org wrote on 21.05.2010 11:18:01:
> linux-kernel-owner@vger.kernel.org
>
> On Thu, 20 May 2010 at 10:21:36 +0200 (CEST) Thomas Gleixner wrote:
> > On Thu, 20 May 2010, Michael Ellerman wrote:
> > > On Wed, 2010-05-19 at 16:38 +0200, Thomas Gleixner wrote:
> > > > On Wed, 19 May 2010, Darren Hart wrote:
> > > >
> > > > > On 05/18/2010 06:25 PM, Michael Ellerman wrote:
> > > > > > On Tue, 2010-05-18 at 15:22 -0700, Darren Hart wrote:
> > >
> > > > > > The result of the discussion about two years ago on this was
that we
> > > > > > needed a custom flow handler for XICS on RT.
> > > > >
> > > > > I'm still not clear on why the ultimate solution wasn't to
> have XICS report
> > > > > edge triggered as edge triggered. Probably some complexity
> of the entire power
> > > > > stack that I am ignorant of.
> > > > >
> > > > > > Apart from the issue of loosing interrupts there is also
> the fact that
> > > > > > masking on the XICS requires an RTAS call which takes a global
lock.
> > > >
> > > > Right, I'd love to avoid that but with real level interrupts we'd
run
> > > > into an interrupt storm. Though another solution would be to issue
the
> > > > EOI after the threaded handler finished, that'd work as well, but
> > > > needs testing.
> > >
> > > Yeah I think that was the idea for the custom flow handler. We'd
reset
> > > the processor priority so we can take other interrupts (which the EOI
> > > usually does for you), then do the actual EOI after the handler
> > > finished.
> >
> > That only works when the card does not issue new interrupts until the
> > EOI happens. If the EOI is only relevant for the interrupt controller,
> > then you are going to lose any edge which comes in before the EOI as
> > well.
>
> Well, the real MSIs have an extra bit to allow the eoi to dally behind
> the mmio on another path and that should cover this race when the irq
> is left enabled.
>
> Jan-Bernd HEA has that change, right?
I don't now. We never hit problems so we did not look very deep into this
area.
We probably have to talk to the HEA HW developers to be sure.
Regards,
Jan-Bernd
^ permalink raw reply [flat|nested] 23+ messages in thread
end of thread, other threads:[~2010-09-20 14:26 UTC | newest]
Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <4BF30793.5070300@us.ibm.com>
2010-05-18 21:52 ` [PATCH RT] ehea: make receive irq handler non-threaded (IRQF_NODELAY) Brian King
2010-05-18 22:19 ` Nivedita Singhvi
2010-05-18 22:22 ` Darren Hart
2010-05-19 1:25 ` Michael Ellerman
2010-05-19 14:16 ` Darren Hart
2010-05-19 14:38 ` Thomas Gleixner
2010-05-19 21:08 ` Thomas Gleixner
2010-05-20 1:34 ` Michael Ellerman
2010-05-20 7:37 ` Jan-Bernd Themann
2010-05-20 8:14 ` Thomas Gleixner
2010-05-20 9:05 ` Jan-Bernd Themann
2010-05-20 9:19 ` Thomas Gleixner
2010-05-20 14:26 ` Nivedita Singhvi
2010-05-20 14:53 ` Will Schmidt
2010-05-20 14:39 ` Darren Hart
2010-05-20 14:45 ` Thomas Gleixner
2010-05-20 21:44 ` Will Schmidt
2010-05-20 1:32 ` Michael Ellerman
2010-05-20 8:21 ` Thomas Gleixner
2010-05-21 9:18 ` [PATCH RT] ehea: make receive irq handler non-threaded (IRQF_NODELAY), " Milton Miller
2010-09-20 14:26 ` Jan-Bernd Themann
2010-05-20 1:28 ` Michael Ellerman
2010-05-21 9:02 ` Milton Miller
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).