* Re: [PATCH] scsi: aacraid: Stop using PCI_IRQ_AFFINITY
2025-07-15 11:15 [PATCH] scsi: aacraid: Stop using PCI_IRQ_AFFINITY John Garry
@ 2025-07-15 19:18 ` John Meneghini
2025-07-24 17:23 ` John Meneghini
` (3 subsequent siblings)
4 siblings, 0 replies; 7+ messages in thread
From: John Meneghini @ 2025-07-15 19:18 UTC (permalink / raw)
To: John Garry, jejb, martin.petersen, sagar.biradar
Cc: linux-scsi, hare, ming.lei
OK John. Thanks for the patch.
We will test this out on our test bed here at Red Hat and let you know if this solves the problem.
/John
On 7/15/25 7:15 AM, John Garry wrote:
> When PCI_IRQ_AFFINITY is set for calling pci_alloc_irq_vectors(), it
> means interrupts are spread around the available CPUs. It also means that
> the interrupts become managed, which means that an interrupt is shutdown
> when all the CPUs in the interrupt affinity mask go offline.
>
> Using managed interrupts in this way means that we should ensure that
> completions should not occur on HW queues where the associated interrupt
> is shutdown. This is typically achieved by ensuring only CPUs which are
> online can generate IO completion traffic to the HW queue which they are
> mapped to (so that they can also serve completion interrupts for that
> HW queue).
>
> The problem in the driver is that a CPU can generate completions to
> a HW queue whose interrupt may be shutdown, as the CPUs in the HW queue
> interrupt affinity mask may be offline. This can cause IOs to never
> complete and hang the system. The driver maintains its own CPU <-> HW
> queue mapping for submissions, see aac_fib_vector_assign(), but this
> does not reflect the CPU <-> HW queue interrupt affinity mapping.
>
> Commit 9dc704dcc09e ("scsi: aacraid: Reply queue mapping to CPUs based on
> IRQ affinity") tried to remedy this issue may mapping CPUs properly to
> HW queue interrupts. However this was later reverted in commit c5becf57dd56
> ("Revert "scsi: aacraid: Reply queue mapping to CPUs based on IRQ
> affinity") - it seems that there were other reports of hangs. I guess that
> this was due to some implementation issue in the original commit or
> maybe a HW issue.
>
> Fix the very original hang by just not using managed interrupts by not
> setting PCI_IRQ_AFFINITY. In this way, all CPUs will be in each HW
> queue affinity mask, so should not create completion problems if any
> CPUs go offline.
>
> Signed-off-by: John Garry <john.g.garry@oracle.com>
> ---
> build tested only
>
> diff --git a/drivers/scsi/aacraid/comminit.c b/drivers/scsi/aacraid/comminit.c
> index 28cf18955a08..726c8531b7d3 100644
> --- a/drivers/scsi/aacraid/comminit.c
> +++ b/drivers/scsi/aacraid/comminit.c
> @@ -481,8 +481,7 @@ void aac_define_int_mode(struct aac_dev *dev)
> pci_find_capability(dev->pdev, PCI_CAP_ID_MSIX)) {
> min_msix = 2;
> i = pci_alloc_irq_vectors(dev->pdev,
> - min_msix, msi_count,
> - PCI_IRQ_MSIX | PCI_IRQ_AFFINITY);
> + min_msix, msi_count, PCI_IRQ_MSIX);
> if (i > 0) {
> dev->msi_enabled = 1;
> msi_count = i;
^ permalink raw reply [flat|nested] 7+ messages in thread* Re: [PATCH] scsi: aacraid: Stop using PCI_IRQ_AFFINITY
2025-07-15 11:15 [PATCH] scsi: aacraid: Stop using PCI_IRQ_AFFINITY John Garry
2025-07-15 19:18 ` John Meneghini
@ 2025-07-24 17:23 ` John Meneghini
2025-07-25 8:24 ` John Garry
2025-07-24 17:29 ` John Meneghini
` (2 subsequent siblings)
4 siblings, 1 reply; 7+ messages in thread
From: John Meneghini @ 2025-07-24 17:23 UTC (permalink / raw)
To: John Garry, jejb, martin.petersen, sagar.biradar
Cc: linux-scsi, hare, ming.lei, Marco Patalano
Sorry it has taken so long to get this patch tested.
The good news is: this patch fixes the offline CPU problem.
Here are the test results:
This is with 6.16.0-rc6:
::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
:: Test
::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
fio-3.36-5.el10.x86_64
:: [ 16:06:57 ] :: [ LOG ] :: INFO: Executing FIO_Test() with on: /home/test1G.img
:: [ 16:06:57 ] :: [ BEGIN ] :: Running 'fio -filename=/home/test1G.img -iodepth=64 -thread -rw=randwrite -ioengine=libaio -bs=4K -direct=1 -runtime=1200 -time_based -size=1G -group_reporting -name=mytest -numjobs=4 &'
:: [ 16:06:57 ] :: [ PASS ] :: Command 'fio -filename=/home/test1G.img -iodepth=64 -thread -rw=randwrite -ioengine=libaio -bs=4K -direct=1 -runtime=1200 -time_based -size=1G -group_reporting -name=mytest -numjobs=4 &' (Expected 0, got 0)
:: [ 16:06:57 ] :: [ LOG ] :: 16 CPUs on this system and we will offline 14
mytest: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=64
...
fio-3.36
Starting 4 threads
:: [ 16:07:07 ] :: [ LOG ] :: INFO: Offline/Online CPUs - iteration 1
Offline CPU1
:: [ 16:07:07 ] :: [ BEGIN ] :: Running 'echo 0 > /sys/devices/system/cpu/cpu1/online'
:: [ 16:07:08 ] :: [ PASS ] :: Command 'echo 0 > /sys/devices/system/cpu/cpu1/online' (Expected 0, got 0)
Offline CPU2
:: [ 16:07:08 ] :: [ BEGIN ] :: Running 'echo 0 > /sys/devices/system/cpu/cpu2/online'
:: [ 16:07:08 ] :: [ PASS ] :: Command 'echo 0 > /sys/devices/system/cpu/cpu2/online' (Expected 0, got 0)
Offline CPU3
:: [ 16:07:08 ] :: [ BEGIN ] :: Running 'echo 0 > /sys/devices/system/cpu/cpu3/online'
:: [ 16:07:08 ] :: [ PASS ] :: Command 'echo 0 > /sys/devices/system/cpu/cpu3/online' (Expected 0, got 0)
Offline CPU4
:: [ 16:07:08 ] :: [ BEGIN ] :: Running 'echo 0 > /sys/devices/system/cpu/cpu4/online'
:: [ 16:07:08 ] :: [ PASS ] :: Command 'echo 0 > /sys/devices/system/cpu/cpu4/online' (Expected 0, got 0)
Offline CPU5
:: [ 16:07:08 ] :: [ BEGIN ] :: Running 'echo 0 > /sys/devices/system/cpu/cpu5/online'
:: [ 16:07:08 ] :: [ PASS ] :: Command 'echo 0 > /sys/devices/system/cpu/cpu5/online' (Expected 0, got 0)
Offline CPU6
:: [ 16:07:08 ] :: [ BEGIN ] :: Running 'echo 0 > /sys/devices/system/cpu/cpu6/online'
:: [ 16:07:08 ] :: [ PASS ] :: Command 'echo 0 > /sys/devices/system/cpu/cpu6/online' (Expected 0, got 0)
Almost immediately after offlining the CPUs IO hangs and I see this:
dmesg -Tw
[Wed Jul 23 16:07:08 2025] smpboot: CPU 1 is now offline
[Wed Jul 23 16:07:08 2025] smpboot: CPU 2 is now offline
[Wed Jul 23 16:07:08 2025] smpboot: CPU 3 is now offline
[Wed Jul 23 16:07:08 2025] smpboot: CPU 4 is now offline
[Wed Jul 23 16:07:08 2025] smpboot: CPU 5 is now offline
[Wed Jul 23 16:07:08 2025] smpboot: CPU 6 is now offline
[Wed Jul 23 16:08:10 2025] aacraid: Host adapter abort request.
aacraid: Outstanding commands on (1,1,3,0):
[Wed Jul 23 16:08:10 2025] aacraid: Host adapter abort request.
aacraid: Outstanding commands on (1,1,3,0):
[Wed Jul 23 16:08:10 2025] aacraid: Host adapter abort request.
aacraid: Outstanding commands on (1,1,3,0):
[Wed Jul 23 16:08:10 2025] aacraid: Host adapter abort request.
aacraid: Outstanding commands on (1,1,3,0):
[Wed Jul 23 16:08:10 2025] aacraid: Host adapter abort request.
aacraid: Outstanding commands on (1,1,3,0):
[Wed Jul 23 16:08:10 2025] aacraid: Host adapter abort request.
aacraid: Outstanding commands on (1,1,3,0):
[Wed Jul 23 16:08:10 2025] aacraid: Host adapter abort request.
aacraid: Outstanding commands on (1,1,3,0):
[Wed Jul 23 16:08:10 2025] aacraid: Host adapter abort request.
aacraid: Outstanding commands on (1,1,3,0):
...
[Wed Jul 23 16:08:11 2025] aacraid: Host adapter abort request.
aacraid: Outstanding commands on (1,1,3,0):
[Wed Jul 23 16:08:11 2025] aacraid: Host adapter abort request.
aacraid: Outstanding commands on (1,1,3,0):
[Wed Jul 23 16:08:11 2025] aacraid: Host bus reset request. SCSI hang ?
[Wed Jul 23 16:08:11 2025] aacraid 0000:84:00.0: outstanding cmd: midlevel-0
[Wed Jul 23 16:08:11 2025] aacraid 0000:84:00.0: outstanding cmd: lowlevel-0
[Wed Jul 23 16:08:11 2025] aacraid 0000:84:00.0: outstanding cmd: error handler-0
[Wed Jul 23 16:08:11 2025] aacraid 0000:84:00.0: outstanding cmd: firmware-64
[Wed Jul 23 16:08:11 2025] aacraid 0000:84:00.0: outstanding cmd: kernel-0
[Wed Jul 23 16:08:11 2025] aacraid 0000:84:00.0: Controller reset type is 3
[Wed Jul 23 16:08:11 2025] aacraid 0000:84:00.0: Issuing IOP reset
[Wed Jul 23 16:09:28 2025] INFO: task kworker/10:1:162 blocked for more than 121 seconds.
[Wed Jul 23 16:09:28 2025] Not tainted 6.16.0-rc6 #1
[Wed Jul 23 16:09:28 2025] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Wed Jul 23 16:09:28 2025] task:kworker/10:1 state:D stack:0 pid:162 tgid:162 ppid:2 task_flags:0x4208060 flags:0x00004000
[Wed Jul 23 16:09:28 2025] Workqueue: events_freezable_pwr_efficient disk_events_workfn
[Wed Jul 23 16:09:28 2025] Call Trace:
[Wed Jul 23 16:09:28 2025] <TASK>
[Wed Jul 23 16:09:28 2025] __schedule+0x2c8/0x730
[Wed Jul 23 16:09:28 2025] schedule+0x27/0x80
[Wed Jul 23 16:09:28 2025] io_schedule+0x46/0x70
[Wed Jul 23 16:09:28 2025] blk_mq_get_tag+0x122/0x290
[Wed Jul 23 16:09:28 2025] ? __pfx_autoremove_wake_function+0x10/0x10
[Wed Jul 23 16:09:28 2025] __blk_mq_alloc_requests+0xb5/0x240
[Wed Jul 23 16:09:28 2025] blk_mq_alloc_request+0x1e8/0x280
[Wed Jul 23 16:09:28 2025] scsi_execute_cmd+0xbf/0x2a0
[Wed Jul 23 16:09:28 2025] ? dl_server_stop+0x2f/0x40
[Wed Jul 23 16:09:28 2025] ? srso_return_thunk+0x5/0x5f
[Wed Jul 23 16:09:28 2025] scsi_test_unit_ready+0x74/0x100
[Wed Jul 23 16:09:28 2025] sd_check_events+0xfa/0x1a0 [sd_mod]
[Wed Jul 23 16:09:28 2025] disk_check_events+0x3a/0x100
[Wed Jul 23 16:09:28 2025] ? __schedule+0x2d0/0x730
[Wed Jul 23 16:09:28 2025] process_one_work+0x18b/0x340
[Wed Jul 23 16:09:28 2025] worker_thread+0x256/0x3a0
[Wed Jul 23 16:09:28 2025] ? __pfx_worker_thread+0x10/0x10
[Wed Jul 23 16:09:28 2025] kthread+0xfc/0x240
[Wed Jul 23 16:09:28 2025] ? __pfx_kthread+0x10/0x10
[Wed Jul 23 16:09:28 2025] ? __pfx_kthread+0x10/0x10
[Wed Jul 23 16:09:28 2025] ret_from_fork+0xf0/0x110
[Wed Jul 23 16:09:28 2025] ? __pfx_kthread+0x10/0x10
[Wed Jul 23 16:09:28 2025] ret_from_fork_asm+0x1a/0x30
[Wed Jul 23 16:09:28 2025] </TASK>
[Wed Jul 23 16:09:28 2025] INFO: task main.sh:1849 blocked for more than 121 seconds.
[Wed Jul 23 16:09:28 2025] Not tainted 6.16.0-rc6 #1
[Wed Jul 23 16:09:28 2025] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Wed Jul 23 16:09:28 2025] task:main.sh state:D stack:0 pid:1849 tgid:1849 ppid:1339 task_flags:0x400040 flags:0x00004002
...
[Wed Jul 23 16:09:32 2025] aacraid 0000:84:00.0: IOP reset succeeded
[Wed Jul 23 16:09:32 2025] aacraid: Comm Interface type2 enabled
The machine is hung at this point and has to be power-cycled to recover.
With your patch, all tests pass:
storageqe-34:linux(aacraid_072225) > git log --oneline -2
6ef5eabf114c (HEAD -> aacraid_072225, johnm/aacraid_072225) scsi: aacraid: Stop using PCI_IRQ_AFFINITY
347e9f5043c8 (tag: v6.16-rc6, branch_v6.16-rc6) Linux 6.16-rc6
::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
:: unknown
::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
:: [ 17:44:30 ] :: [ LOG ] :: Phases fingerprint: L5rLAvqh
:: [ 17:44:30 ] :: [ LOG ] :: Asserts fingerprint: Jdim1mp9
:: [ 17:44:30 ] :: [ LOG ] :: JOURNAL XML: /var/tmp/beakerlib-h5opPB5/journal.xml
:: [ 17:44:30 ] :: [ LOG ] :: JOURNAL TXT: /var/tmp/beakerlib-h5opPB5/journal.txt
::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
:: Duration: 1262s
:: Phases: 1 good, 0 bad
:: OVERALL RESULT: PASS (unknown)
:: [ 17:44:02 ] :: [ PASS ] :: Command 'echo 1 > /sys/devices/system/cpu/cpu11/online' (Expected 0, got 0)
:: [ 17:44:02 ] :: [ PASS ] :: Command 'echo 1 > /sys/devices/system/cpu/cpu12/online' (Expected 0, got 0)
:: [ 17:44:02 ] :: [ PASS ] :: Command 'echo 1 > /sys/devices/system/cpu/cpu13/online' (Expected 0, got 0)
:: [ 17:44:02 ] :: [ PASS ] :: Command 'echo 1 > /sys/devices/system/cpu/cpu14/online' (Expected 0, got 0)
:: [ 17:44:07 ] :: [ PASS ] :: Command 'sleep 5' (Expected 0, got 0)
:: [ 17:44:07 ] :: [ LOG ] :: INFO: wait for fio operation to complete
::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
:: Duration: 1239s
:: Assertions: 3001 good, 0 bad
:: RESULT: PASS (Test)
::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
:: unknown
::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
:: [ 17:44:30 ] :: [ LOG ] :: Phases fingerprint: L5rLAvqh
:: [ 17:44:30 ] :: [ LOG ] :: Asserts fingerprint: Jdim1mp9
:: [ 17:44:30 ] :: [ LOG ] :: JOURNAL XML: /var/tmp/beakerlib-h5opPB5/journal.xml
:: [ 17:44:30 ] :: [ LOG ] :: JOURNAL TXT: /var/tmp/beakerlib-h5opPB5/journal.txt
::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
:: Duration: 1262s
:: Phases: 1 good, 0 bad
:: OVERALL RESULT: PASS (unknown)
John A. Meneghini
Senior Principal Platform Storage Engineer
RHEL SST - Platform Storage Group
jmeneghi@redhat.com
On 7/15/25 7:15 AM, John Garry wrote:
> When PCI_IRQ_AFFINITY is set for calling pci_alloc_irq_vectors(), it
> means interrupts are spread around the available CPUs. It also means that
> the interrupts become managed, which means that an interrupt is shutdown
> when all the CPUs in the interrupt affinity mask go offline.
>
> Using managed interrupts in this way means that we should ensure that
> completions should not occur on HW queues where the associated interrupt
> is shutdown. This is typically achieved by ensuring only CPUs which are
> online can generate IO completion traffic to the HW queue which they are
> mapped to (so that they can also serve completion interrupts for that
> HW queue).
>
> The problem in the driver is that a CPU can generate completions to
> a HW queue whose interrupt may be shutdown, as the CPUs in the HW queue
> interrupt affinity mask may be offline. This can cause IOs to never
> complete and hang the system. The driver maintains its own CPU <-> HW
> queue mapping for submissions, see aac_fib_vector_assign(), but this
> does not reflect the CPU <-> HW queue interrupt affinity mapping.
>
> Commit 9dc704dcc09e ("scsi: aacraid: Reply queue mapping to CPUs based on
> IRQ affinity") tried to remedy this issue may mapping CPUs properly to
> HW queue interrupts. However this was later reverted in commit c5becf57dd56
> ("Revert "scsi: aacraid: Reply queue mapping to CPUs based on IRQ
> affinity") - it seems that there were other reports of hangs. I guess that
> this was due to some implementation issue in the original commit or
> maybe a HW issue.
>
> Fix the very original hang by just not using managed interrupts by not
> setting PCI_IRQ_AFFINITY. In this way, all CPUs will be in each HW
> queue affinity mask, so should not create completion problems if any
> CPUs go offline.
>
> Signed-off-by: John Garry <john.g.garry@oracle.com>
> ---
> build tested only
>
> diff --git a/drivers/scsi/aacraid/comminit.c b/drivers/scsi/aacraid/comminit.c
> index 28cf18955a08..726c8531b7d3 100644
> --- a/drivers/scsi/aacraid/comminit.c
> +++ b/drivers/scsi/aacraid/comminit.c
> @@ -481,8 +481,7 @@ void aac_define_int_mode(struct aac_dev *dev)
> pci_find_capability(dev->pdev, PCI_CAP_ID_MSIX)) {
> min_msix = 2;
> i = pci_alloc_irq_vectors(dev->pdev,
> - min_msix, msi_count,
> - PCI_IRQ_MSIX | PCI_IRQ_AFFINITY);
> + min_msix, msi_count, PCI_IRQ_MSIX);
> if (i > 0) {
> dev->msi_enabled = 1;
> msi_count = i;
^ permalink raw reply [flat|nested] 7+ messages in thread* Re: [PATCH] scsi: aacraid: Stop using PCI_IRQ_AFFINITY
2025-07-24 17:23 ` John Meneghini
@ 2025-07-25 8:24 ` John Garry
0 siblings, 0 replies; 7+ messages in thread
From: John Garry @ 2025-07-25 8:24 UTC (permalink / raw)
To: John Meneghini, jejb, martin.petersen, sagar.biradar
Cc: linux-scsi, hare, ming.lei, Marco Patalano
On 24/07/2025 18:23, John Meneghini wrote:
> Sorry it has taken so long to get this patch tested.
>
> The good news is: this patch fixes the offline CPU problem.
thanks for testing.
If you check /proc/irq/<IRQ NUMBER>/smp_affinity_list for the relevant
IRQs, you should notice that it now covers all CPUs.
Hopefully we can restore using PCI_IRQ_AFFINITY and setting .host_tagset
in future, but that would be if someone figures out the problem using
using .host_tagset for this driver.
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] scsi: aacraid: Stop using PCI_IRQ_AFFINITY
2025-07-15 11:15 [PATCH] scsi: aacraid: Stop using PCI_IRQ_AFFINITY John Garry
2025-07-15 19:18 ` John Meneghini
2025-07-24 17:23 ` John Meneghini
@ 2025-07-24 17:29 ` John Meneghini
2025-07-25 1:18 ` Martin K. Petersen
2025-07-31 4:44 ` Martin K. Petersen
4 siblings, 0 replies; 7+ messages in thread
From: John Meneghini @ 2025-07-24 17:29 UTC (permalink / raw)
To: John Garry, jejb, martin.petersen, sagar.biradar
Cc: linux-scsi, hare, ming.lei, Marco Patalano
Closes: https://lore.kernel.org/linux-scsi/20250618192427.3845724-1-jmeneghi@redhat.com/
Reviewed-by: John Meneghini <jmeneghi@redhat.com>
Tested-by: John Meneghini <jmeneghi@redhat.com>
Martin, please merge this patch. This fixes the aacraid driver issue we discussed at LSF/MM.
/John
On 7/15/25 7:15 AM, John Garry wrote:
> When PCI_IRQ_AFFINITY is set for calling pci_alloc_irq_vectors(), it
> means interrupts are spread around the available CPUs. It also means that
> the interrupts become managed, which means that an interrupt is shutdown
> when all the CPUs in the interrupt affinity mask go offline.
>
> Using managed interrupts in this way means that we should ensure that
> completions should not occur on HW queues where the associated interrupt
> is shutdown. This is typically achieved by ensuring only CPUs which are
> online can generate IO completion traffic to the HW queue which they are
> mapped to (so that they can also serve completion interrupts for that
> HW queue).
>
> The problem in the driver is that a CPU can generate completions to
> a HW queue whose interrupt may be shutdown, as the CPUs in the HW queue
> interrupt affinity mask may be offline. This can cause IOs to never
> complete and hang the system. The driver maintains its own CPU <-> HW
> queue mapping for submissions, see aac_fib_vector_assign(), but this
> does not reflect the CPU <-> HW queue interrupt affinity mapping.
>
> Commit 9dc704dcc09e ("scsi: aacraid: Reply queue mapping to CPUs based on
> IRQ affinity") tried to remedy this issue may mapping CPUs properly to
> HW queue interrupts. However this was later reverted in commit c5becf57dd56
> ("Revert "scsi: aacraid: Reply queue mapping to CPUs based on IRQ
> affinity") - it seems that there were other reports of hangs. I guess that
> this was due to some implementation issue in the original commit or
> maybe a HW issue.
>
> Fix the very original hang by just not using managed interrupts by not
> setting PCI_IRQ_AFFINITY. In this way, all CPUs will be in each HW
> queue affinity mask, so should not create completion problems if any
> CPUs go offline.
>
> Signed-off-by: John Garry <john.g.garry@oracle.com>
> ---
> build tested only
>
> diff --git a/drivers/scsi/aacraid/comminit.c b/drivers/scsi/aacraid/comminit.c
> index 28cf18955a08..726c8531b7d3 100644
> --- a/drivers/scsi/aacraid/comminit.c
> +++ b/drivers/scsi/aacraid/comminit.c
> @@ -481,8 +481,7 @@ void aac_define_int_mode(struct aac_dev *dev)
> pci_find_capability(dev->pdev, PCI_CAP_ID_MSIX)) {
> min_msix = 2;
> i = pci_alloc_irq_vectors(dev->pdev,
> - min_msix, msi_count,
> - PCI_IRQ_MSIX | PCI_IRQ_AFFINITY);
> + min_msix, msi_count, PCI_IRQ_MSIX);
> if (i > 0) {
> dev->msi_enabled = 1;
> msi_count = i;
^ permalink raw reply [flat|nested] 7+ messages in thread* Re: [PATCH] scsi: aacraid: Stop using PCI_IRQ_AFFINITY
2025-07-15 11:15 [PATCH] scsi: aacraid: Stop using PCI_IRQ_AFFINITY John Garry
` (2 preceding siblings ...)
2025-07-24 17:29 ` John Meneghini
@ 2025-07-25 1:18 ` Martin K. Petersen
2025-07-31 4:44 ` Martin K. Petersen
4 siblings, 0 replies; 7+ messages in thread
From: Martin K. Petersen @ 2025-07-25 1:18 UTC (permalink / raw)
To: John Garry
Cc: jejb, martin.petersen, sagar.biradar, linux-scsi, jmeneghi, hare,
ming.lei
John,
> Fix the very original hang by just not using managed interrupts by not
> setting PCI_IRQ_AFFINITY. In this way, all CPUs will be in each HW
> queue affinity mask, so should not create completion problems if any
> CPUs go offline.
Applied to 6.17/scsi-staging, thanks!
--
Martin K. Petersen
^ permalink raw reply [flat|nested] 7+ messages in thread* Re: [PATCH] scsi: aacraid: Stop using PCI_IRQ_AFFINITY
2025-07-15 11:15 [PATCH] scsi: aacraid: Stop using PCI_IRQ_AFFINITY John Garry
` (3 preceding siblings ...)
2025-07-25 1:18 ` Martin K. Petersen
@ 2025-07-31 4:44 ` Martin K. Petersen
4 siblings, 0 replies; 7+ messages in thread
From: Martin K. Petersen @ 2025-07-31 4:44 UTC (permalink / raw)
To: jejb, sagar.biradar, John Garry
Cc: Martin K . Petersen, linux-scsi, jmeneghi, hare, ming.lei
On Tue, 15 Jul 2025 11:15:35 +0000, John Garry wrote:
> When PCI_IRQ_AFFINITY is set for calling pci_alloc_irq_vectors(), it
> means interrupts are spread around the available CPUs. It also means that
> the interrupts become managed, which means that an interrupt is shutdown
> when all the CPUs in the interrupt affinity mask go offline.
>
> Using managed interrupts in this way means that we should ensure that
> completions should not occur on HW queues where the associated interrupt
> is shutdown. This is typically achieved by ensuring only CPUs which are
> online can generate IO completion traffic to the HW queue which they are
> mapped to (so that they can also serve completion interrupts for that
> HW queue).
>
> [...]
Applied to 6.17/scsi-queue, thanks!
[1/1] scsi: aacraid: Stop using PCI_IRQ_AFFINITY
https://git.kernel.org/mkp/scsi/c/dafeaf2c03e7
--
Martin K. Petersen Oracle Linux Engineering
^ permalink raw reply [flat|nested] 7+ messages in thread