public inbox for linux-block@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/3] scsi/block: NUMA-local allocations and false-sharing fixes
@ 2026-04-02  7:46 Sumit Saxena
  2026-04-02  7:46 ` [PATCH 1/3] scsi: use NUMA-local allocation for sdev and starget Sumit Saxena
                   ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: Sumit Saxena @ 2026-04-02  7:46 UTC (permalink / raw)
  To: martin.petersen, axboe
  Cc: linux-scsi, linux-block, mpi3mr-linuxdrv.pdl, Sumit Saxena

This series contains three performance improvements targeting the SCSI
and block layers on multi-socket NUMA systems.

On multi-socket NUMA systems we observed extreme I/O throughput variance
of 50-60% between runs.  This series identifies and fixes two root causes:
cross-node memory accesses due to NUMA-unaware allocations in the scan
path, and false sharing between hot atomic counters in
struct request_queue and struct scsi_device.

The first patch makes the SCSI scan path allocate scsi_device and
scsi_target on the NUMA node of the host adapter.

The second patch addresses false sharing in struct request_queue.
This patch touches include/linux/blkdev.h, so needs review from
linux-block, an Acked-by from the block maintainer is requested before
merging via the SCSI tree.

The third patch addresses a false-sharing problem in struct
scsi_device.

Performance notes:

Tested on a dual-socket NUMA system with an mpi3mr HBA, running fio
(random read, 4K, QD 64, 16 jobs, 60s, direct I/O). 
IOPS figures are in KIOPS (thousands of IOPS):

  Configuration                    Avg KIOPS   Range (KIOPS)   Spread
  Baseline                         6,255       4,200 - 6,700   ~37%
  Baseline + patches 2-3 (align)   6,653       6,000 - 7,000   ~15%
  Baseline + all patches (1-3)     6,649       6,400 - 7,000    ~9%

Key findings:
  - Cacheline alignment patches (2-3) raise average IOPS by ~6% and
    cut throughput spread from ~37% to ~15%.
  - Adding the NUMA allocation patch (1) further tightens the spread
    to ~9% with negligible impact on average throughput.
  - The combined effect reduces the observed 50-60% run-to-run variance
    to under 10%, significantly improving workload predictability.

No functional regressions observed.

This patch series is based on Martin's for-next tree.

James Rizzo (3):
  scsi: use NUMA-local allocation for sdev and starget
  block: align nr_active_requests_shared_tags to avoid cache line
    contention
  scsi: align scsi_device iodone_cnt to avoid cache line contention

 drivers/scsi/scsi_scan.c   | 9 ++++++---
 include/linux/blkdev.h     | 4 +++-
 include/scsi/scsi_device.h | 4 +++-
 3 files changed, 12 insertions(+), 5 deletions(-)

-- 
2.43.7

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH 1/3] scsi: use NUMA-local allocation for sdev and starget
  2026-04-02  7:46 [PATCH 0/3] scsi/block: NUMA-local allocations and false-sharing fixes Sumit Saxena
@ 2026-04-02  7:46 ` Sumit Saxena
  2026-04-02  7:46 ` [PATCH 2/3] block: align nr_active_requests_shared_tags to avoid cache line contention Sumit Saxena
  2026-04-02  7:46 ` [PATCH 3/3] scsi: align scsi_device iodone_cnt " Sumit Saxena
  2 siblings, 0 replies; 8+ messages in thread
From: Sumit Saxena @ 2026-04-02  7:46 UTC (permalink / raw)
  To: martin.petersen, axboe
  Cc: linux-scsi, linux-block, mpi3mr-linuxdrv.pdl, James Rizzo,
	Sumit Saxena

From: James Rizzo <james.rizzo@broadcom.com>

Allocate scsi_device and scsi_target on the same NUMA node as the host
adapter's DMA device to improve memory locality and reduce cross-node
traffic.

Signed-off-by: James Rizzo <james.rizzo@broadcom.com>
Signed-off-by: Sumit Saxena <sumit.saxena@broadcom.com>
---
 drivers/scsi/scsi_scan.c | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/drivers/scsi/scsi_scan.c b/drivers/scsi/scsi_scan.c
index efcaf85ff699..b98c5b7d8018 100644
--- a/drivers/scsi/scsi_scan.c
+++ b/drivers/scsi/scsi_scan.c
@@ -34,6 +34,7 @@
 #include <linux/kthread.h>
 #include <linux/spinlock.h>
 #include <linux/async.h>
+#include <linux/topology.h>
 #include <linux/slab.h>
 #include <linux/unaligned.h>
 
@@ -286,9 +287,10 @@ static struct scsi_device *scsi_alloc_sdev(struct scsi_target *starget,
 	int display_failure_msg = 1, ret;
 	struct Scsi_Host *shost = dev_to_shost(starget->dev.parent);
 	struct queue_limits lim;
+	int node = dev_to_node(shost->dma_dev);
 
-	sdev = kzalloc(sizeof(*sdev) + shost->transportt->device_size,
-		       GFP_KERNEL);
+	sdev = kzalloc_node(sizeof(*sdev) + shost->transportt->device_size,
+		       GFP_KERNEL, node);
 	if (!sdev)
 		goto out;
 
@@ -504,8 +506,9 @@ static struct scsi_target *scsi_alloc_target(struct device *parent,
 	struct scsi_target *starget;
 	struct scsi_target *found_target;
 	int error, ref_got;
+	int node = dev_to_node(shost->dma_dev);
 
-	starget = kzalloc(size, GFP_KERNEL);
+	starget = kzalloc_node(size, GFP_KERNEL, node);
 	if (!starget) {
 		printk(KERN_ERR "%s: allocation failure\n", __func__);
 		return NULL;
-- 
2.43.7


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH 2/3] block: align nr_active_requests_shared_tags to avoid cache line contention
  2026-04-02  7:46 [PATCH 0/3] scsi/block: NUMA-local allocations and false-sharing fixes Sumit Saxena
  2026-04-02  7:46 ` [PATCH 1/3] scsi: use NUMA-local allocation for sdev and starget Sumit Saxena
@ 2026-04-02  7:46 ` Sumit Saxena
  2026-04-02 15:54   ` Bart Van Assche
  2026-04-02  7:46 ` [PATCH 3/3] scsi: align scsi_device iodone_cnt " Sumit Saxena
  2 siblings, 1 reply; 8+ messages in thread
From: Sumit Saxena @ 2026-04-02  7:46 UTC (permalink / raw)
  To: martin.petersen, axboe
  Cc: linux-scsi, linux-block, mpi3mr-linuxdrv.pdl, James Rizzo,
	Sumit Saxena

From: James Rizzo <james.rizzo@broadcom.com>

Place nr_active_requests_shared_tags on its own cache line so it does not
share a cache line with nr_requests and other hot fields, avoiding
significant performance hits from false sharing on some CPU architectures.

Signed-off-by: James Rizzo <james.rizzo@broadcom.com>
Signed-off-by: Sumit Saxena <sumit.saxena@broadcom.com>
---
 include/linux/blkdev.h | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index d463b9b5a0a5..7ed566c81c1b 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -561,7 +561,9 @@ struct request_queue {
 	struct timer_list	timeout;
 	struct work_struct	timeout_work;
 
-	atomic_t		nr_active_requests_shared_tags;
+	/* ensure nr_active_requests_shared_tags and nr_requests are on different cache lines
+	   to avoid significant performance hits on cache line contention on some CPU architectures */
+	atomic_t		nr_active_requests_shared_tags ____cacheline_aligned_in_smp;
 
 	struct blk_mq_tags	*sched_shared_tags;
 
-- 
2.43.7


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH 3/3] scsi: align scsi_device iodone_cnt to avoid cache line contention
  2026-04-02  7:46 [PATCH 0/3] scsi/block: NUMA-local allocations and false-sharing fixes Sumit Saxena
  2026-04-02  7:46 ` [PATCH 1/3] scsi: use NUMA-local allocation for sdev and starget Sumit Saxena
  2026-04-02  7:46 ` [PATCH 2/3] block: align nr_active_requests_shared_tags to avoid cache line contention Sumit Saxena
@ 2026-04-02  7:46 ` Sumit Saxena
  2026-04-02 15:58   ` Bart Van Assche
  2 siblings, 1 reply; 8+ messages in thread
From: Sumit Saxena @ 2026-04-02  7:46 UTC (permalink / raw)
  To: martin.petersen, axboe
  Cc: linux-scsi, linux-block, mpi3mr-linuxdrv.pdl, James Rizzo,
	Sumit Saxena

From: James Rizzo <james.rizzo@broadcom.com>

Place iodone_cnt on its own cache line so it does not share a cache line
with iorequest_cnt, avoiding significant performance hits from false
sharing when request and completion paths update these counters on some
CPU architectures.

Signed-off-by: James Rizzo <james.rizzo@broadcom.com>
Signed-off-by: Sumit Saxena <sumit.saxena@broadcom.com>
---
 include/scsi/scsi_device.h | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/include/scsi/scsi_device.h b/include/scsi/scsi_device.h
index 9c2a7bbe5891..86c2a3a6b206 100644
--- a/include/scsi/scsi_device.h
+++ b/include/scsi/scsi_device.h
@@ -272,7 +272,9 @@ struct scsi_device {
 #define SCSI_DEFAULT_DEVICE_BLOCKED	3
 
 	atomic_t iorequest_cnt;
-	atomic_t iodone_cnt;
+	/* ensure iorequest_cnt and iodone_cnt are on different cache lines to avoid significant
+	   performance hits on cache line contention on some CPU architectures */
+	atomic_t iodone_cnt ____cacheline_aligned_in_smp;
 	atomic_t ioerr_cnt;
 	atomic_t iotmo_cnt;
 
-- 
2.43.7


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH 2/3] block: align nr_active_requests_shared_tags to avoid cache line contention
  2026-04-02  7:46 ` [PATCH 2/3] block: align nr_active_requests_shared_tags to avoid cache line contention Sumit Saxena
@ 2026-04-02 15:54   ` Bart Van Assche
  2026-04-09  6:13     ` Sumit Saxena
  0 siblings, 1 reply; 8+ messages in thread
From: Bart Van Assche @ 2026-04-02 15:54 UTC (permalink / raw)
  To: Sumit Saxena, martin.petersen, axboe
  Cc: linux-scsi, linux-block, mpi3mr-linuxdrv.pdl, James Rizzo

On 4/2/26 12:46 AM, Sumit Saxena wrote:
> From: James Rizzo <james.rizzo@broadcom.com>
> 
> Place nr_active_requests_shared_tags on its own cache line so it does not
> share a cache line with nr_requests and other hot fields, avoiding
> significant performance hits from false sharing on some CPU architectures.
> 
> Signed-off-by: James Rizzo <james.rizzo@broadcom.com>
> Signed-off-by: Sumit Saxena <sumit.saxena@broadcom.com>
> ---
>   include/linux/blkdev.h | 4 +++-
>   1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index d463b9b5a0a5..7ed566c81c1b 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -561,7 +561,9 @@ struct request_queue {
>   	struct timer_list	timeout;
>   	struct work_struct	timeout_work;
>   
> -	atomic_t		nr_active_requests_shared_tags;
> +	/* ensure nr_active_requests_shared_tags and nr_requests are on different cache lines
> +	   to avoid significant performance hits on cache line contention on some CPU architectures */
> +	atomic_t		nr_active_requests_shared_tags ____cacheline_aligned_in_smp;
>   
>   	struct blk_mq_tags	*sched_shared_tags;
>   

A possible alternative is this patch that removes
nr_active_requests_shared_tags:

https://lore.kernel.org/linux-block/20240529213921.3166462-1-bvanassche@acm.org/

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 3/3] scsi: align scsi_device iodone_cnt to avoid cache line contention
  2026-04-02  7:46 ` [PATCH 3/3] scsi: align scsi_device iodone_cnt " Sumit Saxena
@ 2026-04-02 15:58   ` Bart Van Assche
  2026-04-09  6:17     ` Sumit Saxena
  0 siblings, 1 reply; 8+ messages in thread
From: Bart Van Assche @ 2026-04-02 15:58 UTC (permalink / raw)
  To: Sumit Saxena, martin.petersen, axboe
  Cc: linux-scsi, linux-block, mpi3mr-linuxdrv.pdl, James Rizzo

On 4/2/26 12:46 AM, Sumit Saxena wrote:
> From: James Rizzo <james.rizzo@broadcom.com>
> 
> Place iodone_cnt on its own cache line so it does not share a cache line
> with iorequest_cnt, avoiding significant performance hits from false
> sharing when request and completion paths update these counters on some
> CPU architectures.
> 
> Signed-off-by: James Rizzo <james.rizzo@broadcom.com>
> Signed-off-by: Sumit Saxena <sumit.saxena@broadcom.com>
> ---
>   include/scsi/scsi_device.h | 4 +++-
>   1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/include/scsi/scsi_device.h b/include/scsi/scsi_device.h
> index 9c2a7bbe5891..86c2a3a6b206 100644
> --- a/include/scsi/scsi_device.h
> +++ b/include/scsi/scsi_device.h
> @@ -272,7 +272,9 @@ struct scsi_device {
>   #define SCSI_DEFAULT_DEVICE_BLOCKED	3
>   
>   	atomic_t iorequest_cnt;
> -	atomic_t iodone_cnt;
> +	/* ensure iorequest_cnt and iodone_cnt are on different cache lines to avoid significant
> +	   performance hits on cache line contention on some CPU architectures */
> +	atomic_t iodone_cnt ____cacheline_aligned_in_smp;
>   	atomic_t ioerr_cnt;
>   	atomic_t iotmo_cnt;

Has it been considered to change both iorequest_cnt and iodone_cnt into
per-cpu counters?

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 2/3] block: align nr_active_requests_shared_tags to avoid cache line contention
  2026-04-02 15:54   ` Bart Van Assche
@ 2026-04-09  6:13     ` Sumit Saxena
  0 siblings, 0 replies; 8+ messages in thread
From: Sumit Saxena @ 2026-04-09  6:13 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: martin.petersen, axboe, linux-scsi, linux-block,
	mpi3mr-linuxdrv.pdl, James Rizzo

[-- Attachment #1: Type: text/plain, Size: 241 bytes --]

> A possible alternative is this patch that removes
> nr_active_requests_shared_tags:
>
> https://lore.kernel.org/linux-block/20240529213921.3166462-1-bvanassche@acm.org/
Sorry for the late reply. Let me test with your patch.

Thanks,
Sumit

[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 5469 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 3/3] scsi: align scsi_device iodone_cnt to avoid cache line contention
  2026-04-02 15:58   ` Bart Van Assche
@ 2026-04-09  6:17     ` Sumit Saxena
  0 siblings, 0 replies; 8+ messages in thread
From: Sumit Saxena @ 2026-04-09  6:17 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: martin.petersen, axboe, linux-scsi, linux-block,
	mpi3mr-linuxdrv.pdl, James Rizzo

[-- Attachment #1: Type: text/plain, Size: 233 bytes --]

> Has it been considered to change both iorequest_cnt and iodone_cnt into
> per-cpu counters?
We're testing with per-cpu counters, initial results look good. Once
the testing is complete,
I will post the next version.

Thanks,
Sumit

[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 5469 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2026-04-09  6:17 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-02  7:46 [PATCH 0/3] scsi/block: NUMA-local allocations and false-sharing fixes Sumit Saxena
2026-04-02  7:46 ` [PATCH 1/3] scsi: use NUMA-local allocation for sdev and starget Sumit Saxena
2026-04-02  7:46 ` [PATCH 2/3] block: align nr_active_requests_shared_tags to avoid cache line contention Sumit Saxena
2026-04-02 15:54   ` Bart Van Assche
2026-04-09  6:13     ` Sumit Saxena
2026-04-02  7:46 ` [PATCH 3/3] scsi: align scsi_device iodone_cnt " Sumit Saxena
2026-04-02 15:58   ` Bart Van Assche
2026-04-09  6:17     ` Sumit Saxena

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox