* [PATCH 0/3] scsi/block: NUMA-local allocations and false-sharing fixes
@ 2026-04-02 7:46 Sumit Saxena
2026-04-02 7:46 ` [PATCH 1/3] scsi: use NUMA-local allocation for sdev and starget Sumit Saxena
` (2 more replies)
0 siblings, 3 replies; 8+ messages in thread
From: Sumit Saxena @ 2026-04-02 7:46 UTC (permalink / raw)
To: martin.petersen, axboe
Cc: linux-scsi, linux-block, mpi3mr-linuxdrv.pdl, Sumit Saxena
This series contains three performance improvements targeting the SCSI
and block layers on multi-socket NUMA systems.
On multi-socket NUMA systems we observed extreme I/O throughput variance
of 50-60% between runs. This series identifies and fixes two root causes:
cross-node memory accesses due to NUMA-unaware allocations in the scan
path, and false sharing between hot atomic counters in
struct request_queue and struct scsi_device.
The first patch makes the SCSI scan path allocate scsi_device and
scsi_target on the NUMA node of the host adapter.
The second patch addresses false sharing in struct request_queue.
This patch touches include/linux/blkdev.h, so needs review from
linux-block, an Acked-by from the block maintainer is requested before
merging via the SCSI tree.
The third patch addresses a false-sharing problem in struct
scsi_device.
Performance notes:
Tested on a dual-socket NUMA system with an mpi3mr HBA, running fio
(random read, 4K, QD 64, 16 jobs, 60s, direct I/O).
IOPS figures are in KIOPS (thousands of IOPS):
Configuration Avg KIOPS Range (KIOPS) Spread
Baseline 6,255 4,200 - 6,700 ~37%
Baseline + patches 2-3 (align) 6,653 6,000 - 7,000 ~15%
Baseline + all patches (1-3) 6,649 6,400 - 7,000 ~9%
Key findings:
- Cacheline alignment patches (2-3) raise average IOPS by ~6% and
cut throughput spread from ~37% to ~15%.
- Adding the NUMA allocation patch (1) further tightens the spread
to ~9% with negligible impact on average throughput.
- The combined effect reduces the observed 50-60% run-to-run variance
to under 10%, significantly improving workload predictability.
No functional regressions observed.
This patch series is based on Martin's for-next tree.
James Rizzo (3):
scsi: use NUMA-local allocation for sdev and starget
block: align nr_active_requests_shared_tags to avoid cache line
contention
scsi: align scsi_device iodone_cnt to avoid cache line contention
drivers/scsi/scsi_scan.c | 9 ++++++---
include/linux/blkdev.h | 4 +++-
include/scsi/scsi_device.h | 4 +++-
3 files changed, 12 insertions(+), 5 deletions(-)
--
2.43.7
^ permalink raw reply [flat|nested] 8+ messages in thread* [PATCH 1/3] scsi: use NUMA-local allocation for sdev and starget
2026-04-02 7:46 [PATCH 0/3] scsi/block: NUMA-local allocations and false-sharing fixes Sumit Saxena
@ 2026-04-02 7:46 ` Sumit Saxena
2026-04-02 7:46 ` [PATCH 2/3] block: align nr_active_requests_shared_tags to avoid cache line contention Sumit Saxena
2026-04-02 7:46 ` [PATCH 3/3] scsi: align scsi_device iodone_cnt " Sumit Saxena
2 siblings, 0 replies; 8+ messages in thread
From: Sumit Saxena @ 2026-04-02 7:46 UTC (permalink / raw)
To: martin.petersen, axboe
Cc: linux-scsi, linux-block, mpi3mr-linuxdrv.pdl, James Rizzo,
Sumit Saxena
From: James Rizzo <james.rizzo@broadcom.com>
Allocate scsi_device and scsi_target on the same NUMA node as the host
adapter's DMA device to improve memory locality and reduce cross-node
traffic.
Signed-off-by: James Rizzo <james.rizzo@broadcom.com>
Signed-off-by: Sumit Saxena <sumit.saxena@broadcom.com>
---
drivers/scsi/scsi_scan.c | 9 ++++++---
1 file changed, 6 insertions(+), 3 deletions(-)
diff --git a/drivers/scsi/scsi_scan.c b/drivers/scsi/scsi_scan.c
index efcaf85ff699..b98c5b7d8018 100644
--- a/drivers/scsi/scsi_scan.c
+++ b/drivers/scsi/scsi_scan.c
@@ -34,6 +34,7 @@
#include <linux/kthread.h>
#include <linux/spinlock.h>
#include <linux/async.h>
+#include <linux/topology.h>
#include <linux/slab.h>
#include <linux/unaligned.h>
@@ -286,9 +287,10 @@ static struct scsi_device *scsi_alloc_sdev(struct scsi_target *starget,
int display_failure_msg = 1, ret;
struct Scsi_Host *shost = dev_to_shost(starget->dev.parent);
struct queue_limits lim;
+ int node = dev_to_node(shost->dma_dev);
- sdev = kzalloc(sizeof(*sdev) + shost->transportt->device_size,
- GFP_KERNEL);
+ sdev = kzalloc_node(sizeof(*sdev) + shost->transportt->device_size,
+ GFP_KERNEL, node);
if (!sdev)
goto out;
@@ -504,8 +506,9 @@ static struct scsi_target *scsi_alloc_target(struct device *parent,
struct scsi_target *starget;
struct scsi_target *found_target;
int error, ref_got;
+ int node = dev_to_node(shost->dma_dev);
- starget = kzalloc(size, GFP_KERNEL);
+ starget = kzalloc_node(size, GFP_KERNEL, node);
if (!starget) {
printk(KERN_ERR "%s: allocation failure\n", __func__);
return NULL;
--
2.43.7
^ permalink raw reply related [flat|nested] 8+ messages in thread* [PATCH 2/3] block: align nr_active_requests_shared_tags to avoid cache line contention
2026-04-02 7:46 [PATCH 0/3] scsi/block: NUMA-local allocations and false-sharing fixes Sumit Saxena
2026-04-02 7:46 ` [PATCH 1/3] scsi: use NUMA-local allocation for sdev and starget Sumit Saxena
@ 2026-04-02 7:46 ` Sumit Saxena
2026-04-02 15:54 ` Bart Van Assche
2026-04-02 7:46 ` [PATCH 3/3] scsi: align scsi_device iodone_cnt " Sumit Saxena
2 siblings, 1 reply; 8+ messages in thread
From: Sumit Saxena @ 2026-04-02 7:46 UTC (permalink / raw)
To: martin.petersen, axboe
Cc: linux-scsi, linux-block, mpi3mr-linuxdrv.pdl, James Rizzo,
Sumit Saxena
From: James Rizzo <james.rizzo@broadcom.com>
Place nr_active_requests_shared_tags on its own cache line so it does not
share a cache line with nr_requests and other hot fields, avoiding
significant performance hits from false sharing on some CPU architectures.
Signed-off-by: James Rizzo <james.rizzo@broadcom.com>
Signed-off-by: Sumit Saxena <sumit.saxena@broadcom.com>
---
include/linux/blkdev.h | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index d463b9b5a0a5..7ed566c81c1b 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -561,7 +561,9 @@ struct request_queue {
struct timer_list timeout;
struct work_struct timeout_work;
- atomic_t nr_active_requests_shared_tags;
+ /* ensure nr_active_requests_shared_tags and nr_requests are on different cache lines
+ to avoid significant performance hits on cache line contention on some CPU architectures */
+ atomic_t nr_active_requests_shared_tags ____cacheline_aligned_in_smp;
struct blk_mq_tags *sched_shared_tags;
--
2.43.7
^ permalink raw reply related [flat|nested] 8+ messages in thread* Re: [PATCH 2/3] block: align nr_active_requests_shared_tags to avoid cache line contention
2026-04-02 7:46 ` [PATCH 2/3] block: align nr_active_requests_shared_tags to avoid cache line contention Sumit Saxena
@ 2026-04-02 15:54 ` Bart Van Assche
2026-04-09 6:13 ` Sumit Saxena
0 siblings, 1 reply; 8+ messages in thread
From: Bart Van Assche @ 2026-04-02 15:54 UTC (permalink / raw)
To: Sumit Saxena, martin.petersen, axboe
Cc: linux-scsi, linux-block, mpi3mr-linuxdrv.pdl, James Rizzo
On 4/2/26 12:46 AM, Sumit Saxena wrote:
> From: James Rizzo <james.rizzo@broadcom.com>
>
> Place nr_active_requests_shared_tags on its own cache line so it does not
> share a cache line with nr_requests and other hot fields, avoiding
> significant performance hits from false sharing on some CPU architectures.
>
> Signed-off-by: James Rizzo <james.rizzo@broadcom.com>
> Signed-off-by: Sumit Saxena <sumit.saxena@broadcom.com>
> ---
> include/linux/blkdev.h | 4 +++-
> 1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index d463b9b5a0a5..7ed566c81c1b 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -561,7 +561,9 @@ struct request_queue {
> struct timer_list timeout;
> struct work_struct timeout_work;
>
> - atomic_t nr_active_requests_shared_tags;
> + /* ensure nr_active_requests_shared_tags and nr_requests are on different cache lines
> + to avoid significant performance hits on cache line contention on some CPU architectures */
> + atomic_t nr_active_requests_shared_tags ____cacheline_aligned_in_smp;
>
> struct blk_mq_tags *sched_shared_tags;
>
A possible alternative is this patch that removes
nr_active_requests_shared_tags:
https://lore.kernel.org/linux-block/20240529213921.3166462-1-bvanassche@acm.org/
Thanks,
Bart.
^ permalink raw reply [flat|nested] 8+ messages in thread
* [PATCH 3/3] scsi: align scsi_device iodone_cnt to avoid cache line contention
2026-04-02 7:46 [PATCH 0/3] scsi/block: NUMA-local allocations and false-sharing fixes Sumit Saxena
2026-04-02 7:46 ` [PATCH 1/3] scsi: use NUMA-local allocation for sdev and starget Sumit Saxena
2026-04-02 7:46 ` [PATCH 2/3] block: align nr_active_requests_shared_tags to avoid cache line contention Sumit Saxena
@ 2026-04-02 7:46 ` Sumit Saxena
2026-04-02 15:58 ` Bart Van Assche
2 siblings, 1 reply; 8+ messages in thread
From: Sumit Saxena @ 2026-04-02 7:46 UTC (permalink / raw)
To: martin.petersen, axboe
Cc: linux-scsi, linux-block, mpi3mr-linuxdrv.pdl, James Rizzo,
Sumit Saxena
From: James Rizzo <james.rizzo@broadcom.com>
Place iodone_cnt on its own cache line so it does not share a cache line
with iorequest_cnt, avoiding significant performance hits from false
sharing when request and completion paths update these counters on some
CPU architectures.
Signed-off-by: James Rizzo <james.rizzo@broadcom.com>
Signed-off-by: Sumit Saxena <sumit.saxena@broadcom.com>
---
include/scsi/scsi_device.h | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/include/scsi/scsi_device.h b/include/scsi/scsi_device.h
index 9c2a7bbe5891..86c2a3a6b206 100644
--- a/include/scsi/scsi_device.h
+++ b/include/scsi/scsi_device.h
@@ -272,7 +272,9 @@ struct scsi_device {
#define SCSI_DEFAULT_DEVICE_BLOCKED 3
atomic_t iorequest_cnt;
- atomic_t iodone_cnt;
+ /* ensure iorequest_cnt and iodone_cnt are on different cache lines to avoid significant
+ performance hits on cache line contention on some CPU architectures */
+ atomic_t iodone_cnt ____cacheline_aligned_in_smp;
atomic_t ioerr_cnt;
atomic_t iotmo_cnt;
--
2.43.7
^ permalink raw reply related [flat|nested] 8+ messages in thread* Re: [PATCH 3/3] scsi: align scsi_device iodone_cnt to avoid cache line contention
2026-04-02 7:46 ` [PATCH 3/3] scsi: align scsi_device iodone_cnt " Sumit Saxena
@ 2026-04-02 15:58 ` Bart Van Assche
2026-04-09 6:17 ` Sumit Saxena
0 siblings, 1 reply; 8+ messages in thread
From: Bart Van Assche @ 2026-04-02 15:58 UTC (permalink / raw)
To: Sumit Saxena, martin.petersen, axboe
Cc: linux-scsi, linux-block, mpi3mr-linuxdrv.pdl, James Rizzo
On 4/2/26 12:46 AM, Sumit Saxena wrote:
> From: James Rizzo <james.rizzo@broadcom.com>
>
> Place iodone_cnt on its own cache line so it does not share a cache line
> with iorequest_cnt, avoiding significant performance hits from false
> sharing when request and completion paths update these counters on some
> CPU architectures.
>
> Signed-off-by: James Rizzo <james.rizzo@broadcom.com>
> Signed-off-by: Sumit Saxena <sumit.saxena@broadcom.com>
> ---
> include/scsi/scsi_device.h | 4 +++-
> 1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/include/scsi/scsi_device.h b/include/scsi/scsi_device.h
> index 9c2a7bbe5891..86c2a3a6b206 100644
> --- a/include/scsi/scsi_device.h
> +++ b/include/scsi/scsi_device.h
> @@ -272,7 +272,9 @@ struct scsi_device {
> #define SCSI_DEFAULT_DEVICE_BLOCKED 3
>
> atomic_t iorequest_cnt;
> - atomic_t iodone_cnt;
> + /* ensure iorequest_cnt and iodone_cnt are on different cache lines to avoid significant
> + performance hits on cache line contention on some CPU architectures */
> + atomic_t iodone_cnt ____cacheline_aligned_in_smp;
> atomic_t ioerr_cnt;
> atomic_t iotmo_cnt;
Has it been considered to change both iorequest_cnt and iodone_cnt into
per-cpu counters?
Thanks,
Bart.
^ permalink raw reply [flat|nested] 8+ messages in thread* Re: [PATCH 3/3] scsi: align scsi_device iodone_cnt to avoid cache line contention
2026-04-02 15:58 ` Bart Van Assche
@ 2026-04-09 6:17 ` Sumit Saxena
0 siblings, 0 replies; 8+ messages in thread
From: Sumit Saxena @ 2026-04-09 6:17 UTC (permalink / raw)
To: Bart Van Assche
Cc: martin.petersen, axboe, linux-scsi, linux-block,
mpi3mr-linuxdrv.pdl, James Rizzo
[-- Attachment #1: Type: text/plain, Size: 233 bytes --]
> Has it been considered to change both iorequest_cnt and iodone_cnt into
> per-cpu counters?
We're testing with per-cpu counters, initial results look good. Once
the testing is complete,
I will post the next version.
Thanks,
Sumit
[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 5469 bytes --]
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2026-04-09 6:17 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-02 7:46 [PATCH 0/3] scsi/block: NUMA-local allocations and false-sharing fixes Sumit Saxena
2026-04-02 7:46 ` [PATCH 1/3] scsi: use NUMA-local allocation for sdev and starget Sumit Saxena
2026-04-02 7:46 ` [PATCH 2/3] block: align nr_active_requests_shared_tags to avoid cache line contention Sumit Saxena
2026-04-02 15:54 ` Bart Van Assche
2026-04-09 6:13 ` Sumit Saxena
2026-04-02 7:46 ` [PATCH 3/3] scsi: align scsi_device iodone_cnt " Sumit Saxena
2026-04-02 15:58 ` Bart Van Assche
2026-04-09 6:17 ` Sumit Saxena
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox