* Re: [PATCH 1/1] genirq/msi: Dynamic remove/add stroage adapter hits EEH
[not found] ` <90777da90abe02c87d30968bfedc9168@linux.ibm.com>
@ 2025-03-20 8:23 ` Thomas Gleixner
2025-03-20 8:48 ` Thomas Gleixner
0 siblings, 1 reply; 6+ messages in thread
From: Thomas Gleixner @ 2025-03-20 8:23 UTC (permalink / raw)
To: Wen Xiong; +Cc: linux-kernel, gjoyce, linux-pci, Bjorn Helgaas, linux-scsi
On Wed, Mar 19 2025 at 21:58, Wen Xiong wrote:
>> The real problem has nothing to do with a remove/add operation. The
>> problem is solely in the probe function.
>
> I don't think we have problems in probe function since this driver has
> been in productions for many many years.
Seriously?
It does not matter at all whether you had it many years in production or
not. Fact is that the driver is operational and after that a device
reset happens, which wipes the config space. That _IS_ the problem.
> Also we didn't see the issue before the "MSI domain" patchset dropping
> into linux interrupt code(no issue in rhel92 release).
That's completely irrelevant. See above.
> Device reset is not called in probe function.
Right. The reset is part of PCI error handling, which happens _AFTER_
the driver has set up interrupts.
> We don't see the issue without dynamically remove/add operation.
> There is a small window which irqbalance daemon kicks in during device
> reset. So it took about over 6 hours to recreate the issue when doing
> remove/add loop operation.
Sure. You need a loop to hit the window. And it does not matter whether
it's the probe or the remove which triggers it. Fact is that the reset
wipes out the config space, which means that any read from the config
space between reset and restore will return garbage. That problem is not
restricted to the interrupt code. It's a general problem.
> We can't find the good way to fix the issue in both of device drivers.
> So we look for some help in interrupt code.
No. This is _NOT_ a interrupt specific problem. You are observing the
symptom related to interrupts, but any other code which reads from
config space during the reset window has exactly the same problem.
The PCI error handling resets the device asynchronously to any other
operation which might access the config space. Yes, set_affinity() is
one possible way to hit that due to the implementation detail of
pseries_msi_compose_msg(), which reads the MSI message composed by the
underlying hypervisor back from config space. But even if it would not
read back and compose the message itself then set_affinity() would
create inconsistent state because:
reset()
compose()
write()
restore()
I.e. the reset machinery overwrites the new message, which means this
ends up with inconsistent state.
So this is a general problem with PCI error handling and _not_ a problem
of the interrupt subsystem. I have no idea what to do about that, but
this needs to be looked at from the PCI error handling side and not
papered over at the messenger.
Thanks,
tglx
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH 1/1] genirq/msi: Dynamic remove/add stroage adapter hits EEH
2025-03-20 8:23 ` [PATCH 1/1] genirq/msi: Dynamic remove/add stroage adapter hits EEH Thomas Gleixner
@ 2025-03-20 8:48 ` Thomas Gleixner
2025-03-27 21:36 ` Wen Xiong
0 siblings, 1 reply; 6+ messages in thread
From: Thomas Gleixner @ 2025-03-20 8:48 UTC (permalink / raw)
To: Wen Xiong; +Cc: linux-kernel, gjoyce, linux-pci, Bjorn Helgaas, linux-scsi
On Thu, Mar 20 2025 at 09:23, Thomas Gleixner wrote:
> On Wed, Mar 19 2025 at 21:58, Wen Xiong wrote:
>> We don't see the issue without dynamically remove/add operation.
>> There is a small window which irqbalance daemon kicks in during device
>> reset. So it took about over 6 hours to recreate the issue when doing
>> remove/add loop operation.
>
> Sure. You need a loop to hit the window. And it does not matter whether
> it's the probe or the remove which triggers it. Fact is that the reset
> wipes out the config space, which means that any read from the config
> space between reset and restore will return garbage. That problem is not
> restricted to the interrupt code. It's a general problem.
After looking at the code again, it's a problem in the remove()
function:
__ipr_remove()
ipr_initiate_ioa_bringdown()
// resets device
restore_config_space()
....
ipr_free_all_resources()
free_irqs()
So yes, it's not probe(). But the question is pretty much the same.
Why is a reset issued while the driver is fully operational and
resources are still in use?
Don't even think about telling me that this is a problem of the MSI
interrupt rework. It is not. It's been broken forever.
You _cannot_ pull the rung under a fully operational driver and expect
that all involved parts will just magically handle this gracefully.
What about tearing down resources first and then issuing the reset?
Thanks,
tglx
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH 1/1] genirq/msi: Dynamic remove/add stroage adapter hits EEH
2025-03-20 8:48 ` Thomas Gleixner
@ 2025-03-27 21:36 ` Wen Xiong
2025-03-28 11:27 ` Thomas Gleixner
0 siblings, 1 reply; 6+ messages in thread
From: Wen Xiong @ 2025-03-27 21:36 UTC (permalink / raw)
To: Thomas Gleixner
Cc: linux-kernel, gjoyce, linux-pci, Bjorn Helgaas, linux-scsi
> What about tearing down resources first and then issuing the reset?
This SAS adapter supports dual controller configuration. Normally we
have two adapters in a system.
We config one of them as Primary adapter and another one as Secondary
adapter.
When doing remove operation on primary adapter, the Secondary adapter is
going to be failover and config as primary by adapter firmware. During
failover process, adapter firmware requests the secondary adapter reset,
then sets it as primary adapter.
Secondary adapter failover triggers adapter
reset(ipr_reset_get_unit_check_job()).
[ 940.742698] ipr 0206:a0:00.0: 9070: IOA requested reset -> FW
requested
[ 940.742733] ipr 0206:a0:00.0: Adapter to Adapter Link Failed Due to
SAS Fabric Change [PRC: 17101C25]
[ 940.742768] ipr 0206:a0:00.0: Remote IOA VPID/SN: IBM 57B4001SISIOA
00458021
When secondary adapter doing a reset, we use the same code path as
removing operation. We can’t free irqs for Secondary adapter since
kernel has assigned the irqs for Secondary adapter.
Actually we discussed about "calling pci_free_irq_vectors()" before
doing bist reset when we trying to fix in device driver.
That might cause other problems. It is also not what a user would
expect. For example, if they disabled irq balance and manually setup irq
binding and affinity, if we go and free and reallocate the interrupts
across a reset, this would wipe out those changes, which would not be
expected.
Thanks,
Wendy
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH 1/1] genirq/msi: Dynamic remove/add stroage adapter hits EEH
2025-03-27 21:36 ` Wen Xiong
@ 2025-03-28 11:27 ` Thomas Gleixner
2025-04-01 20:14 ` Wen Xiong
0 siblings, 1 reply; 6+ messages in thread
From: Thomas Gleixner @ 2025-03-28 11:27 UTC (permalink / raw)
To: Wen Xiong; +Cc: linux-kernel, gjoyce, linux-pci, Bjorn Helgaas, linux-scsi
On Thu, Mar 27 2025 at 16:36, Wen Xiong wrote:
> When secondary adapter doing a reset, we use the same code path as
> removing operation. We can’t free irqs for Secondary adapter since
> kernel has assigned the irqs for Secondary adapter.
>
> Actually we discussed about "calling pci_free_irq_vectors()" before
> doing bist reset when we trying to fix in device driver.
> That might cause other problems. It is also not what a user would
> expect. For example, if they disabled irq balance and manually setup irq
> binding and affinity, if we go and free and reallocate the interrupts
> across a reset, this would wipe out those changes, which would not be
> expected.
You are completely missing the point. This is not a problem restricted
to PCI/MSI interrupts.
You have interrupts, work, queues and whatever in fully operational
state. Then you tell the firmware to reset the adapter and swap the
secondary adapter in. This reset operation brings the hardware temporary
into an undefined state as you have observed. How is any part of the
kernel, which can access and operate on this adapter, supposed to deal
with that correctly?
You are now focussing solely on papering over the symptom in PCI/MSI
because that's where you actually observed the fallout.
But as you know curing the symptom is not fixing anything because the
underlying root cause still persists and will roar its ugly head at some
other place sooner than later.
So no, instead of sprinkling horrible band-aids all over the place, you
have to sit down and think about a properly orchestrated mechanism for
this failover scenario.
I understand that you want to preserve the state of the secondary
adapter including interrupt numbers and related settings, but for that
you have to properly freeze _all_ related kernel resources first, then
let the firmware do the swap and once that's complete unfreeze
everything again.
That will need support in other subsystems, like the interrupt layer,
but that needs to be properly designed and integrated so that it works
with all possible PCI/MSI backend implementations and under all
circumstances. Setting affinity is not the only way to make this go
wrong, there are other operations which access the underlying hardware
and config space. You just haven't triggered them yet, but they exist.
And it's not a problem restricted to the pSeries MSI backend
implementation either. That's just one way to expose it.
Even if I put the secondary swap problem aside and just look at a single
instance, __ipr_remove() is already broken in itself. As I pointed out
to you before:
__ipr_remove()
ipr_initiate_ioa_bringdown()
// resets device -> undefined state
restore_config_space()
....
ipr_free_all_resources()
free_irqs()
Up to the point where interrupts are freed, they are fully operational
and exposed to user space. The related functionality can hit the
undefined state not only via set_affinity() independent of the swap
problem.
The adapter swap is just a worse variant of the above because it affects
the secondary side behind its back.
This needs quite some work to resolve properly, but that's the only way
to fix it once and forever. Proliferating the 'Plug and Pray' approach
with a lot of handwaving is just helping you to close your ticket, but
it's not a solution.
Thanks,
tglx
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH 1/1] genirq/msi: Dynamic remove/add stroage adapter hits EEH
2025-03-28 11:27 ` Thomas Gleixner
@ 2025-04-01 20:14 ` Wen Xiong
2025-04-02 8:33 ` Thomas Gleixner
0 siblings, 1 reply; 6+ messages in thread
From: Wen Xiong @ 2025-04-01 20:14 UTC (permalink / raw)
To: Thomas Gleixner
Cc: linux-kernel, gjoyce, linux-pci, Bjorn Helgaas, linux-scsi
On 2025-03-28 06:27, Thomas Gleixner wrote:
> You are completely missing the point. This is not a problem restricted
> to PCI/MSI interrupts.
>
Thanks for all suggestions! I will investigate more and check if we can
fix it in device driver.
Looks several device drivers in kernel use two kernel APIs to set/clean
IRQ_NO_BALACING status for a given irq.
irq_set_status_flags(irq, IRQ_NO_BALANCING);
irq_clear_status_flags(irq, IRQ_NO_BALANCING);
With these two APIs, device driver can decide if using kernel irq
balance or setting irq affinity by driver itself?
So we can set/clear IRQ_NO_BALANCING during reset.
Thanks,
Wendy
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH 1/1] genirq/msi: Dynamic remove/add stroage adapter hits EEH
2025-04-01 20:14 ` Wen Xiong
@ 2025-04-02 8:33 ` Thomas Gleixner
0 siblings, 0 replies; 6+ messages in thread
From: Thomas Gleixner @ 2025-04-02 8:33 UTC (permalink / raw)
To: Wen Xiong; +Cc: linux-kernel, gjoyce, linux-pci, Bjorn Helgaas, linux-scsi
On Tue, Apr 01 2025 at 15:14, Wen Xiong wrote:
> On 2025-03-28 06:27, Thomas Gleixner wrote:
>
>> You are completely missing the point. This is not a problem restricted
>> to PCI/MSI interrupts.
>>
> Thanks for all suggestions! I will investigate more and check if we can
> fix it in device driver.
>
> Looks several device drivers in kernel use two kernel APIs to set/clean
> IRQ_NO_BALACING status for a given irq.
>
> irq_set_status_flags(irq, IRQ_NO_BALANCING);
> irq_clear_status_flags(irq, IRQ_NO_BALANCING);
>
> With these two APIs, device driver can decide if using kernel irq
> balance or setting irq affinity by driver itself?
>
> So we can set/clear IRQ_NO_BALANCING during reset.
That's not sufficient. You are papering over the symptom. I explained
you that affinity setting is not the only way to wreckage this.
But feel free to ignore me and apply your bandaid fix as you see fit.
Thanks,
tglx
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2025-04-02 8:33 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <1742386474-13717-1-git-send-email-wenxiong@linux.ibm.com>
[not found] ` <87a59h3sk9.ffs@tglx>
[not found] ` <90777da90abe02c87d30968bfedc9168@linux.ibm.com>
2025-03-20 8:23 ` [PATCH 1/1] genirq/msi: Dynamic remove/add stroage adapter hits EEH Thomas Gleixner
2025-03-20 8:48 ` Thomas Gleixner
2025-03-27 21:36 ` Wen Xiong
2025-03-28 11:27 ` Thomas Gleixner
2025-04-01 20:14 ` Wen Xiong
2025-04-02 8:33 ` Thomas Gleixner
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).