From: Sasha Levin <sashal@kernel.org>
To: patches@lists.linux.dev, stable@vger.kernel.org
Cc: John Garry <john.g.garry@oracle.com>,
John Meneghini <jmeneghi@redhat.com>,
"Martin K . Petersen" <martin.petersen@oracle.com>,
Sasha Levin <sashal@kernel.org>,
aacraid@microsemi.com, linux-scsi@vger.kernel.org
Subject: [PATCH AUTOSEL 6.16-5.4] scsi: aacraid: Stop using PCI_IRQ_AFFINITY
Date: Sat, 9 Aug 2025 20:20:53 -0400 [thread overview]
Message-ID: <20250810002104.1545396-4-sashal@kernel.org> (raw)
In-Reply-To: <20250810002104.1545396-1-sashal@kernel.org>
From: John Garry <john.g.garry@oracle.com>
[ Upstream commit dafeaf2c03e71255438ffe5a341d94d180e6c88e ]
When PCI_IRQ_AFFINITY is set for calling pci_alloc_irq_vectors(), it
means interrupts are spread around the available CPUs. It also means that
the interrupts become managed, which means that an interrupt is shutdown
when all the CPUs in the interrupt affinity mask go offline.
Using managed interrupts in this way means that we should ensure that
completions should not occur on HW queues where the associated interrupt
is shutdown. This is typically achieved by ensuring only CPUs which are
online can generate IO completion traffic to the HW queue which they are
mapped to (so that they can also serve completion interrupts for that HW
queue).
The problem in the driver is that a CPU can generate completions to a HW
queue whose interrupt may be shutdown, as the CPUs in the HW queue
interrupt affinity mask may be offline. This can cause IOs to never
complete and hang the system. The driver maintains its own CPU <-> HW
queue mapping for submissions, see aac_fib_vector_assign(), but this does
not reflect the CPU <-> HW queue interrupt affinity mapping.
Commit 9dc704dcc09e ("scsi: aacraid: Reply queue mapping to CPUs based on
IRQ affinity") tried to remedy this issue may mapping CPUs properly to HW
queue interrupts. However this was later reverted in commit c5becf57dd56
("Revert "scsi: aacraid: Reply queue mapping to CPUs based on IRQ
affinity") - it seems that there were other reports of hangs. I guess
that this was due to some implementation issue in the original commit or
maybe a HW issue.
Fix the very original hang by just not using managed interrupts by not
setting PCI_IRQ_AFFINITY. In this way, all CPUs will be in each HW queue
affinity mask, so should not create completion problems if any CPUs go
offline.
Signed-off-by: John Garry <john.g.garry@oracle.com>
Link: https://lore.kernel.org/r/20250715111535.499853-1-john.g.garry@oracle.com
Closes: https://lore.kernel.org/linux-scsi/20250618192427.3845724-1-jmeneghi@redhat.com/
Reviewed-by: John Meneghini <jmeneghi@redhat.com>
Tested-by: John Meneghini <jmeneghi@redhat.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
LLM Generated explanations, may be completely bogus:
Based on my analysis, here is my determination:
**Backport Status: YES**
This commit should be backported to stable kernel trees for the
following reasons:
1. **Fixes a Critical Bug**: The commit fixes a system hang issue that
occurs when CPUs go offline. The code change shows it removes
`PCI_IRQ_AFFINITY` flag from `pci_alloc_irq_vectors()` call (line 485
changing from `PCI_IRQ_MSIX | PCI_IRQ_AFFINITY` to just
`PCI_IRQ_MSIX`). This prevents managed interrupts from being shut
down when CPUs go offline, which was causing I/O operations to never
complete and hang the system.
2. **Long-Standing Issue with History**: The commit message reveals this
is addressing a long-standing problem that has been attempted to be
fixed before:
- Commit 9dc704dcc09e tried to fix it by mapping CPUs properly to HW
queue interrupts
- That fix was reverted in commit c5becf57dd56 due to other hang
reports
- The revert commit (c5becf57dd56) was even marked with `Cc:
<stable@vger.kernel.org>`, indicating the severity of the issue
3. **Small and Contained Change**: The actual code change is minimal -
just removing the `PCI_IRQ_AFFINITY` flag from a single function
call. This is a one-line change that doesn't introduce new features
or architectural changes.
4. **Well-Tested Fix**: The commit has both "Reviewed-by" and "Tested-
by" tags from John Meneghini, indicating it has been properly tested
and validated.
5. **No Side Effects Beyond the Fix**: The change simply reverts to non-
managed interrupts, allowing all CPUs to be in each HW queue affinity
mask. This is a conservative approach that avoids the complexity of
trying to properly coordinate CPU-to-queue mappings with interrupt
affinities.
6. **Affects Production Systems**: The linked bug report
(https://lore.kernel.org/linux-
scsi/20250618192427.3845724-1-jmeneghi@redhat.com/) and the previous
kernel bugzilla entry (#217599) indicate this affects real production
systems experiencing hangs.
7. **Driver-Specific Fix**: The change is confined to the aacraid driver
subsystem and doesn't affect other kernel components, minimizing the
risk of regression in other areas.
The fix follows stable tree rules perfectly: it fixes an important bug
(system hangs), is minimal in scope (one-line change), has low
regression risk (reverting to simpler interrupt handling), and is
confined to a specific driver subsystem.
drivers/scsi/aacraid/comminit.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/drivers/scsi/aacraid/comminit.c b/drivers/scsi/aacraid/comminit.c
index 28cf18955a08..726c8531b7d3 100644
--- a/drivers/scsi/aacraid/comminit.c
+++ b/drivers/scsi/aacraid/comminit.c
@@ -481,8 +481,7 @@ void aac_define_int_mode(struct aac_dev *dev)
pci_find_capability(dev->pdev, PCI_CAP_ID_MSIX)) {
min_msix = 2;
i = pci_alloc_irq_vectors(dev->pdev,
- min_msix, msi_count,
- PCI_IRQ_MSIX | PCI_IRQ_AFFINITY);
+ min_msix, msi_count, PCI_IRQ_MSIX);
if (i > 0) {
dev->msi_enabled = 1;
msi_count = i;
--
2.39.5
next prev parent reply other threads:[~2025-08-10 0:21 UTC|newest]
Thread overview: 13+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-08-10 0:20 [PATCH AUTOSEL 6.16-5.4] kconfig: gconf: avoid hardcoding model2 in on_treeview2_cursor_changed() Sasha Levin
2025-08-10 0:20 ` [PATCH AUTOSEL 6.16-6.15] kheaders: rebuild kheaders_data.tar.xz when a file is modified within a minute Sasha Levin
2025-08-10 0:20 ` [PATCH AUTOSEL 6.16-5.4] kconfig: lxdialog: fix 'space' to (de)select options Sasha Levin
2025-08-10 0:20 ` Sasha Levin [this message]
2025-08-10 0:20 ` [PATCH AUTOSEL 6.16-5.4] kconfig: gconf: fix potential memory leak in renderer_edited() Sasha Levin
2025-08-10 0:20 ` [PATCH AUTOSEL 6.16-5.15] scsi: target: core: Generate correct identifiers for PR OUT transport IDs Sasha Levin
2025-08-10 0:20 ` [PATCH AUTOSEL 6.16-5.4] ipmi: Fix strcpy source and destination the same Sasha Levin
2025-08-10 0:20 ` [PATCH AUTOSEL 6.16-5.4] scsi: Fix sas_user_scan() to handle wildcard and multi-channel scans Sasha Levin
2025-08-10 0:20 ` [PATCH AUTOSEL 6.16-6.1] vfio/mlx5: fix possible overflow in tracking max message size Sasha Levin
2025-08-10 0:20 ` [PATCH AUTOSEL 6.16-5.4] kconfig: nconf: Ensure null termination where strncpy is used Sasha Levin
2025-08-10 0:21 ` [PATCH AUTOSEL 6.16-5.4] kconfig: lxdialog: replace strcpy() with strncpy() in inputbox.c Sasha Levin
2025-08-10 0:21 ` [PATCH AUTOSEL 6.16-5.15] vfio/type1: conditional rescheduling while pinning Sasha Levin
2025-08-10 0:21 ` [PATCH AUTOSEL 6.16-5.4] ipmi: Use dev_warn_ratelimited() for incorrect message warnings Sasha Levin
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20250810002104.1545396-4-sashal@kernel.org \
--to=sashal@kernel.org \
--cc=aacraid@microsemi.com \
--cc=jmeneghi@redhat.com \
--cc=john.g.garry@oracle.com \
--cc=linux-scsi@vger.kernel.org \
--cc=martin.petersen@oracle.com \
--cc=patches@lists.linux.dev \
--cc=stable@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox