[PATCH v3 0/2] nvme/pci: PRP list DMA pool partitioning

Linux-NVME Archive on lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v3 0/2] nvme/pci: PRP list DMA pool partitioning
@ 2025-04-21 16:55 Caleb Sander Mateos
  2025-04-21 16:55 ` [PATCH v3 1/2] nvme/pci: factor out nvme_init_hctx() helper Caleb Sander Mateos
  2025-04-21 16:55 ` [PATCH v3 2/2] nvme/pci: make PRP list DMA pools per-NUMA-node Caleb Sander Mateos
  0 siblings, 2 replies; 8+ messages in thread
From: Caleb Sander Mateos @ 2025-04-21 16:55 UTC (permalink / raw)
  To: Keith Busch, Jens Axboe, Christoph Hellwig, Sagi Grimberg
  Cc: Kanchan Joshi, linux-nvme, linux-kernel, Caleb Sander Mateos

NVMe commands with more than 4 KB of data allocate PRP list pages from
the per-nvme_device dma_pool prp_page_pool or prp_small_pool. Each call
to dma_pool_alloc() and dma_pool_free() takes the per-dma_pool spinlock.
These device-global spinlocks are a significant source of contention
when many CPUs are submitting to the same NVMe devices. On a workload
issuing 32 KB reads from 16 CPUs (8 hypertwin pairs) across 2 NUMA nodes
to 23 NVMe devices, we observed 2.4% of CPU time spent in
_raw_spin_lock_irqsave called from dma_pool_alloc and dma_pool_free.

Ideally, the dma_pools would be per-hctx to minimize
contention. But that could impose considerable resource costs in a
system with many NVMe devices and CPUs.

As a compromise, allocate per-NUMA-node PRP list DMA pools. Map each
nvme_queue to the set of DMA pools corresponding to its device and its
hctx's NUMA node. This reduces the _raw_spin_lock_irqsave overhead by
about half, to 1.2%. Preventing the sharing of PRP list pages across
NUMA nodes also makes them cheaper to initialize.

Caleb Sander Mateos (2):
  nvme/pci: factor out nvme_init_hctx() helper
  nvme/pci: make PRP list DMA pools per-NUMA-node

 drivers/nvme/host/pci.c | 171 +++++++++++++++++++++++-----------------
 1 file changed, 98 insertions(+), 73 deletions(-)

v3: simplify nvme_release_prp_pools() (Keith)

v2:
- Initialize admin nvme_queue's nvme_prp_dma_pools (Kanchan)
- Shrink nvme_dev's prp_pools array from MAX_NUMNODES to nr_node_ids (Kanchan)

-- 
2.45.2

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH v3 1/2] nvme/pci: factor out nvme_init_hctx() helper
  2025-04-21 16:55 [PATCH v3 0/2] nvme/pci: PRP list DMA pool partitioning Caleb Sander Mateos
@ 2025-04-21 16:55 ` Caleb Sander Mateos
  2025-04-22  7:38   ` Kanchan Joshi
  2025-04-22 11:45   ` Sagi Grimberg
  2025-04-21 16:55 ` [PATCH v3 2/2] nvme/pci: make PRP list DMA pools per-NUMA-node Caleb Sander Mateos
  1 sibling, 2 replies; 8+ messages in thread
From: Caleb Sander Mateos @ 2025-04-21 16:55 UTC (permalink / raw)
  To: Keith Busch, Jens Axboe, Christoph Hellwig, Sagi Grimberg
  Cc: Kanchan Joshi, linux-nvme, linux-kernel, Caleb Sander Mateos

nvme_init_hctx() and nvme_admin_init_hctx() are very similar. In
preparation for adding more logic, factor out a nvme_init_hctx() helper.
Rename the old nvme_init_hctx() to nvme_io_init_hctx().

Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
---
 drivers/nvme/host/pci.c | 29 +++++++++++++++--------------
 1 file changed, 15 insertions(+), 14 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index b178d52eac1b..642890ddada5 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -395,32 +395,33 @@ static int nvme_pci_npages_prp(void)
 	unsigned max_bytes = (NVME_MAX_KB_SZ * 1024) + NVME_CTRL_PAGE_SIZE;
 	unsigned nprps = DIV_ROUND_UP(max_bytes, NVME_CTRL_PAGE_SIZE);
 	return DIV_ROUND_UP(8 * nprps, NVME_CTRL_PAGE_SIZE - 8);
 }
 
-static int nvme_admin_init_hctx(struct blk_mq_hw_ctx *hctx, void *data,
-				unsigned int hctx_idx)
+static int nvme_init_hctx(struct blk_mq_hw_ctx *hctx, void *data, unsigned qid)
 {
 	struct nvme_dev *dev = to_nvme_dev(data);
-	struct nvme_queue *nvmeq = &dev->queues[0];
-
-	WARN_ON(hctx_idx != 0);
-	WARN_ON(dev->admin_tagset.tags[0] != hctx->tags);
+	struct nvme_queue *nvmeq = &dev->queues[qid];
+	struct blk_mq_tags *tags;
 
+	tags = qid ? dev->tagset.tags[qid - 1] : dev->admin_tagset.tags[0];
+	WARN_ON(tags != hctx->tags);
 	hctx->driver_data = nvmeq;
 	return 0;
 }
 
-static int nvme_init_hctx(struct blk_mq_hw_ctx *hctx, void *data,
-			  unsigned int hctx_idx)
+static int nvme_admin_init_hctx(struct blk_mq_hw_ctx *hctx, void *data,
+				unsigned int hctx_idx)
 {
-	struct nvme_dev *dev = to_nvme_dev(data);
-	struct nvme_queue *nvmeq = &dev->queues[hctx_idx + 1];
+	WARN_ON(hctx_idx != 0);
+	return nvme_init_hctx(hctx, data, 0);
+}
 
-	WARN_ON(dev->tagset.tags[hctx_idx] != hctx->tags);
-	hctx->driver_data = nvmeq;
-	return 0;
+static int nvme_io_init_hctx(struct blk_mq_hw_ctx *hctx, void *data,
+			     unsigned int hctx_idx)
+{
+	return nvme_init_hctx(hctx, data, hctx_idx + 1);
 }
 
 static int nvme_pci_init_request(struct blk_mq_tag_set *set,
 		struct request *req, unsigned int hctx_idx,
 		unsigned int numa_node)
@@ -1813,11 +1814,11 @@ static const struct blk_mq_ops nvme_mq_admin_ops = {
 static const struct blk_mq_ops nvme_mq_ops = {
 	.queue_rq	= nvme_queue_rq,
 	.queue_rqs	= nvme_queue_rqs,
 	.complete	= nvme_pci_complete_rq,
 	.commit_rqs	= nvme_commit_rqs,
-	.init_hctx	= nvme_init_hctx,
+	.init_hctx	= nvme_io_init_hctx,
 	.init_request	= nvme_pci_init_request,
 	.map_queues	= nvme_pci_map_queues,
 	.timeout	= nvme_timeout,
 	.poll		= nvme_poll,
 };
-- 
2.45.2



^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH v3 1/2] nvme/pci: factor out nvme_init_hctx() helper
  2025-04-21 16:55 ` [PATCH v3 1/2] nvme/pci: factor out nvme_init_hctx() helper Caleb Sander Mateos
@ 2025-04-22  7:38   ` Kanchan Joshi
  2025-04-22 11:45   ` Sagi Grimberg
  1 sibling, 0 replies; 8+ messages in thread
From: Kanchan Joshi @ 2025-04-22  7:38 UTC (permalink / raw)
  To: Caleb Sander Mateos, Keith Busch, Jens Axboe, Christoph Hellwig,
	Sagi Grimberg
  Cc: linux-nvme, linux-kernel

Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v3 1/2] nvme/pci: factor out nvme_init_hctx() helper
  2025-04-21 16:55 ` [PATCH v3 1/2] nvme/pci: factor out nvme_init_hctx() helper Caleb Sander Mateos
  2025-04-22  7:38   ` Kanchan Joshi
@ 2025-04-22 11:45   ` Sagi Grimberg
  1 sibling, 0 replies; 8+ messages in thread
From: Sagi Grimberg @ 2025-04-22 11:45 UTC (permalink / raw)
  To: Caleb Sander Mateos, Keith Busch, Jens Axboe, Christoph Hellwig
  Cc: Kanchan Joshi, linux-nvme, linux-kernel

Reviewed-by: Sagi Grimberg <sagi@grimberg.me>


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH v3 2/2] nvme/pci: make PRP list DMA pools per-NUMA-node
  2025-04-21 16:55 [PATCH v3 0/2] nvme/pci: PRP list DMA pool partitioning Caleb Sander Mateos
  2025-04-21 16:55 ` [PATCH v3 1/2] nvme/pci: factor out nvme_init_hctx() helper Caleb Sander Mateos
@ 2025-04-21 16:55 ` Caleb Sander Mateos
  2025-04-22  7:38   ` Kanchan Joshi
  2025-04-22 11:48   ` Sagi Grimberg
  1 sibling, 2 replies; 8+ messages in thread
From: Caleb Sander Mateos @ 2025-04-21 16:55 UTC (permalink / raw)
  To: Keith Busch, Jens Axboe, Christoph Hellwig, Sagi Grimberg
  Cc: Kanchan Joshi, linux-nvme, linux-kernel, Caleb Sander Mateos

NVMe commands with more than 4 KB of data allocate PRP list pages from
the per-nvme_device dma_pool prp_page_pool or prp_small_pool. Each call
to dma_pool_alloc() and dma_pool_free() takes the per-dma_pool spinlock.
These device-global spinlocks are a significant source of contention
when many CPUs are submitting to the same NVMe devices. On a workload
issuing 32 KB reads from 16 CPUs (8 hypertwin pairs) across 2 NUMA nodes
to 23 NVMe devices, we observed 2.4% of CPU time spent in
_raw_spin_lock_irqsave called from dma_pool_alloc and dma_pool_free.

Ideally, the dma_pools would be per-hctx to minimize
contention. But that could impose considerable resource costs in a
system with many NVMe devices and CPUs.

As a compromise, allocate per-NUMA-node PRP list DMA pools. Map each
nvme_queue to the set of DMA pools corresponding to its device and its
hctx's NUMA node. This reduces the _raw_spin_lock_irqsave overhead by
about half, to 1.2%. Preventing the sharing of PRP list pages across
NUMA nodes also makes them cheaper to initialize.

Link: https://lore.kernel.org/linux-nvme/CADUfDZqa=OOTtTTznXRDmBQo1WrFcDw1hBA7XwM7hzJ-hpckcA@mail.gmail.com/T/#u
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
---
 drivers/nvme/host/pci.c | 144 +++++++++++++++++++++++-----------------
 1 file changed, 84 insertions(+), 60 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 642890ddada5..7d86d1ec989a 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -16,10 +16,11 @@
 #include <linux/kstrtox.h>
 #include <linux/memremap.h>
 #include <linux/mm.h>
 #include <linux/module.h>
 #include <linux/mutex.h>
+#include <linux/nodemask.h>
 #include <linux/once.h>
 #include <linux/pci.h>
 #include <linux/suspend.h>
 #include <linux/t10-pi.h>
 #include <linux/types.h>
@@ -110,21 +111,24 @@ struct nvme_queue;
 
 static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown);
 static void nvme_delete_io_queues(struct nvme_dev *dev);
 static void nvme_update_attrs(struct nvme_dev *dev);
 
+struct nvme_prp_dma_pools {
+	struct dma_pool *large;
+	struct dma_pool *small;
+};
+
 /*
  * Represents an NVM Express device.  Each nvme_dev is a PCI function.
  */
 struct nvme_dev {
 	struct nvme_queue *queues;
 	struct blk_mq_tag_set tagset;
 	struct blk_mq_tag_set admin_tagset;
 	u32 __iomem *dbs;
 	struct device *dev;
-	struct dma_pool *prp_page_pool;
-	struct dma_pool *prp_small_pool;
 	unsigned online_queues;
 	unsigned max_qid;
 	unsigned io_queues[HCTX_MAX_TYPES];
 	unsigned int num_vecs;
 	u32 q_depth;
@@ -160,10 +164,11 @@ struct nvme_dev {
 	struct nvme_host_mem_buf_desc *host_mem_descs;
 	void **host_mem_desc_bufs;
 	unsigned int nr_allocated_queues;
 	unsigned int nr_write_queues;
 	unsigned int nr_poll_queues;
+	struct nvme_prp_dma_pools prp_pools[];
 };
 
 static int io_queue_depth_set(const char *val, const struct kernel_param *kp)
 {
 	return param_set_uint_minmax(val, kp, NVME_PCI_MIN_QUEUE_SIZE,
@@ -189,10 +194,11 @@ static inline struct nvme_dev *to_nvme_dev(struct nvme_ctrl *ctrl)
  * An NVM Express queue.  Each device has at least two (one for admin
  * commands and one for I/O commands).
  */
 struct nvme_queue {
 	struct nvme_dev *dev;
+	struct nvme_prp_dma_pools prp_pools;
 	spinlock_t sq_lock;
 	void *sq_cmds;
 	 /* only used for poll queues: */
 	spinlock_t cq_poll_lock ____cacheline_aligned_in_smp;
 	struct nvme_completion *cqes;
@@ -395,18 +401,67 @@ static int nvme_pci_npages_prp(void)
 	unsigned max_bytes = (NVME_MAX_KB_SZ * 1024) + NVME_CTRL_PAGE_SIZE;
 	unsigned nprps = DIV_ROUND_UP(max_bytes, NVME_CTRL_PAGE_SIZE);
 	return DIV_ROUND_UP(8 * nprps, NVME_CTRL_PAGE_SIZE - 8);
 }
 
+static struct nvme_prp_dma_pools *
+nvme_setup_prp_pools(struct nvme_dev *dev, unsigned numa_node)
+{
+	struct nvme_prp_dma_pools *prp_pools;
+	size_t small_align = 256;
+
+	prp_pools = &dev->prp_pools[numa_node < nr_node_ids ? numa_node : 0];
+	if (prp_pools->small)
+		return prp_pools; /* already initialized */
+
+	prp_pools->large = dma_pool_create("prp list page", dev->dev,
+						NVME_CTRL_PAGE_SIZE,
+						NVME_CTRL_PAGE_SIZE, 0);
+	if (!prp_pools->large)
+		return ERR_PTR(-ENOMEM);
+
+	if (dev->ctrl.quirks & NVME_QUIRK_DMAPOOL_ALIGN_512)
+		small_align = 512;
+
+	/* Optimisation for I/Os between 4k and 128k */
+	prp_pools->small = dma_pool_create("prp list 256", dev->dev,
+						256, small_align, 0);
+	if (!prp_pools->small) {
+		dma_pool_destroy(prp_pools->large);
+		prp_pools->large = NULL;
+		return ERR_PTR(-ENOMEM);
+	}
+
+	return prp_pools;
+}
+
+static void nvme_release_prp_pools(struct nvme_dev *dev)
+{
+	unsigned i;
+
+	for (i = 0; i < nr_node_ids; i++) {
+		struct nvme_prp_dma_pools *prp_pools = &dev->prp_pools[i];
+
+		dma_pool_destroy(prp_pools->large);
+		dma_pool_destroy(prp_pools->small);
+	}
+}
+
 static int nvme_init_hctx(struct blk_mq_hw_ctx *hctx, void *data, unsigned qid)
 {
 	struct nvme_dev *dev = to_nvme_dev(data);
 	struct nvme_queue *nvmeq = &dev->queues[qid];
+	struct nvme_prp_dma_pools *prp_pools;
 	struct blk_mq_tags *tags;
 
 	tags = qid ? dev->tagset.tags[qid - 1] : dev->admin_tagset.tags[0];
 	WARN_ON(tags != hctx->tags);
+	prp_pools = nvme_setup_prp_pools(dev, hctx->numa_node);
+	if (IS_ERR(prp_pools))
+		return PTR_ERR(prp_pools);
+
+	nvmeq->prp_pools = *prp_pools;
 	hctx->driver_data = nvmeq;
 	return 0;
 }
 
 static int nvme_admin_init_hctx(struct blk_mq_hw_ctx *hctx, void *data,
@@ -536,27 +591,28 @@ static inline bool nvme_pci_use_sgls(struct nvme_dev *dev, struct request *req,
 	if (!sgl_threshold || avg_seg_size < sgl_threshold)
 		return nvme_req(req)->flags & NVME_REQ_USERCMD;
 	return true;
 }
 
-static void nvme_free_prps(struct nvme_dev *dev, struct request *req)
+static void nvme_free_prps(struct nvme_queue *nvmeq, struct request *req)
 {
 	const int last_prp = NVME_CTRL_PAGE_SIZE / sizeof(__le64) - 1;
 	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
 	dma_addr_t dma_addr = iod->first_dma;
 	int i;
 
 	for (i = 0; i < iod->nr_allocations; i++) {
 		__le64 *prp_list = iod->list[i].prp_list;
 		dma_addr_t next_dma_addr = le64_to_cpu(prp_list[last_prp]);
 
-		dma_pool_free(dev->prp_page_pool, prp_list, dma_addr);
+		dma_pool_free(nvmeq->prp_pools.large, prp_list, dma_addr);
 		dma_addr = next_dma_addr;
 	}
 }
 
-static void nvme_unmap_data(struct nvme_dev *dev, struct request *req)
+static void nvme_unmap_data(struct nvme_dev *dev, struct nvme_queue *nvmeq,
+			    struct request *req)
 {
 	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
 
 	if (iod->dma_len) {
 		dma_unmap_page(dev->dev, iod->first_dma, iod->dma_len,
@@ -567,17 +623,17 @@ static void nvme_unmap_data(struct nvme_dev *dev, struct request *req)
 	WARN_ON_ONCE(!iod->sgt.nents);
 
 	dma_unmap_sgtable(dev->dev, &iod->sgt, rq_dma_dir(req), 0);
 
 	if (iod->nr_allocations == 0)
-		dma_pool_free(dev->prp_small_pool, iod->list[0].sg_list,
+		dma_pool_free(nvmeq->prp_pools.small, iod->list[0].sg_list,
 			      iod->first_dma);
 	else if (iod->nr_allocations == 1)
-		dma_pool_free(dev->prp_page_pool, iod->list[0].sg_list,
+		dma_pool_free(nvmeq->prp_pools.large, iod->list[0].sg_list,
 			      iod->first_dma);
 	else
-		nvme_free_prps(dev, req);
+		nvme_free_prps(nvmeq, req);
 	mempool_free(iod->sgt.sgl, dev->iod_mempool);
 }
 
 static void nvme_print_sgl(struct scatterlist *sgl, int nents)
 {
@@ -591,11 +647,11 @@ static void nvme_print_sgl(struct scatterlist *sgl, int nents)
 			i, &phys, sg->offset, sg->length, &sg_dma_address(sg),
 			sg_dma_len(sg));
 	}
 }
 
-static blk_status_t nvme_pci_setup_prps(struct nvme_dev *dev,
+static blk_status_t nvme_pci_setup_prps(struct nvme_queue *nvmeq,
 		struct request *req, struct nvme_rw_command *cmnd)
 {
 	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
 	struct dma_pool *pool;
 	int length = blk_rq_payload_bytes(req);
@@ -627,14 +683,14 @@ static blk_status_t nvme_pci_setup_prps(struct nvme_dev *dev,
 		goto done;
 	}
 
 	nprps = DIV_ROUND_UP(length, NVME_CTRL_PAGE_SIZE);
 	if (nprps <= (256 / 8)) {
-		pool = dev->prp_small_pool;
+		pool = nvmeq->prp_pools.small;
 		iod->nr_allocations = 0;
 	} else {
-		pool = dev->prp_page_pool;
+		pool = nvmeq->prp_pools.large;
 		iod->nr_allocations = 1;
 	}
 
 	prp_list = dma_pool_alloc(pool, GFP_ATOMIC, &prp_dma);
 	if (!prp_list) {
@@ -672,11 +728,11 @@ static blk_status_t nvme_pci_setup_prps(struct nvme_dev *dev,
 done:
 	cmnd->dptr.prp1 = cpu_to_le64(sg_dma_address(iod->sgt.sgl));
 	cmnd->dptr.prp2 = cpu_to_le64(iod->first_dma);
 	return BLK_STS_OK;
 free_prps:
-	nvme_free_prps(dev, req);
+	nvme_free_prps(nvmeq, req);
 	return BLK_STS_RESOURCE;
 bad_sgl:
 	WARN(DO_ONCE(nvme_print_sgl, iod->sgt.sgl, iod->sgt.nents),
 			"Invalid SGL for payload:%d nents:%d\n",
 			blk_rq_payload_bytes(req), iod->sgt.nents);
@@ -697,11 +753,11 @@ static void nvme_pci_sgl_set_seg(struct nvme_sgl_desc *sge,
 	sge->addr = cpu_to_le64(dma_addr);
 	sge->length = cpu_to_le32(entries * sizeof(*sge));
 	sge->type = NVME_SGL_FMT_LAST_SEG_DESC << 4;
 }
 
-static blk_status_t nvme_pci_setup_sgls(struct nvme_dev *dev,
+static blk_status_t nvme_pci_setup_sgls(struct nvme_queue *nvmeq,
 		struct request *req, struct nvme_rw_command *cmd)
 {
 	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
 	struct dma_pool *pool;
 	struct nvme_sgl_desc *sg_list;
@@ -717,14 +773,14 @@ static blk_status_t nvme_pci_setup_sgls(struct nvme_dev *dev,
 		nvme_pci_sgl_set_data(&cmd->dptr.sgl, sg);
 		return BLK_STS_OK;
 	}
 
 	if (entries <= (256 / sizeof(struct nvme_sgl_desc))) {
-		pool = dev->prp_small_pool;
+		pool = nvmeq->prp_pools.small;
 		iod->nr_allocations = 0;
 	} else {
-		pool = dev->prp_page_pool;
+		pool = nvmeq->prp_pools.large;
 		iod->nr_allocations = 1;
 	}
 
 	sg_list = dma_pool_alloc(pool, GFP_ATOMIC, &sgl_dma);
 	if (!sg_list) {
@@ -784,16 +840,16 @@ static blk_status_t nvme_setup_sgl_simple(struct nvme_dev *dev,
 }
 
 static blk_status_t nvme_map_data(struct nvme_dev *dev, struct request *req,
 		struct nvme_command *cmnd)
 {
+	struct nvme_queue *nvmeq = req->mq_hctx->driver_data;
 	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
 	blk_status_t ret = BLK_STS_RESOURCE;
 	int rc;
 
 	if (blk_rq_nr_phys_segments(req) == 1) {
-		struct nvme_queue *nvmeq = req->mq_hctx->driver_data;
 		struct bio_vec bv = req_bvec(req);
 
 		if (!is_pci_p2pdma_page(bv.bv_page)) {
 			if (!nvme_pci_metadata_use_sgls(dev, req) &&
 			    (bv.bv_offset & (NVME_CTRL_PAGE_SIZE - 1)) +
@@ -824,13 +880,13 @@ static blk_status_t nvme_map_data(struct nvme_dev *dev, struct request *req,
 			ret = BLK_STS_TARGET;
 		goto out_free_sg;
 	}
 
 	if (nvme_pci_use_sgls(dev, req, iod->sgt.nents))
-		ret = nvme_pci_setup_sgls(dev, req, &cmnd->rw);
+		ret = nvme_pci_setup_sgls(nvmeq, req, &cmnd->rw);
 	else
-		ret = nvme_pci_setup_prps(dev, req, &cmnd->rw);
+		ret = nvme_pci_setup_prps(nvmeq, req, &cmnd->rw);
 	if (ret != BLK_STS_OK)
 		goto out_unmap_sg;
 	return BLK_STS_OK;
 
 out_unmap_sg:
@@ -841,10 +897,11 @@ static blk_status_t nvme_map_data(struct nvme_dev *dev, struct request *req,
 }
 
 static blk_status_t nvme_pci_setup_meta_sgls(struct nvme_dev *dev,
 					     struct request *req)
 {
+	struct nvme_queue *nvmeq = req->mq_hctx->driver_data;
 	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
 	struct nvme_rw_command *cmnd = &iod->cmd.rw;
 	struct nvme_sgl_desc *sg_list;
 	struct scatterlist *sgl, *sg;
 	unsigned int entries;
@@ -864,11 +921,11 @@ static blk_status_t nvme_pci_setup_meta_sgls(struct nvme_dev *dev,
 	rc = dma_map_sgtable(dev->dev, &iod->meta_sgt, rq_dma_dir(req),
 			     DMA_ATTR_NO_WARN);
 	if (rc)
 		goto out_free_sg;
 
-	sg_list = dma_pool_alloc(dev->prp_small_pool, GFP_ATOMIC, &sgl_dma);
+	sg_list = dma_pool_alloc(nvmeq->prp_pools.small, GFP_ATOMIC, &sgl_dma);
 	if (!sg_list)
 		goto out_unmap_sg;
 
 	entries = iod->meta_sgt.nents;
 	iod->meta_list.sg_list = sg_list;
@@ -946,11 +1003,11 @@ static blk_status_t nvme_prep_rq(struct nvme_dev *dev, struct request *req)
 
 	nvme_start_request(req);
 	return BLK_STS_OK;
 out_unmap_data:
 	if (blk_rq_nr_phys_segments(req))
-		nvme_unmap_data(dev, req);
+		nvme_unmap_data(dev, req->mq_hctx->driver_data, req);
 out_free_cmd:
 	nvme_cleanup_cmd(req);
 	return ret;
 }
 
@@ -1036,10 +1093,11 @@ static void nvme_queue_rqs(struct rq_list *rqlist)
 		nvme_submit_cmds(nvmeq, &submit_list);
 	*rqlist = requeue_list;
 }
 
 static __always_inline void nvme_unmap_metadata(struct nvme_dev *dev,
+						struct nvme_queue *nvmeq,
 						struct request *req)
 {
 	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
 
 	if (!iod->meta_sgt.nents) {
@@ -1047,11 +1105,11 @@ static __always_inline void nvme_unmap_metadata(struct nvme_dev *dev,
 			       rq_integrity_vec(req).bv_len,
 			       rq_dma_dir(req));
 		return;
 	}
 
-	dma_pool_free(dev->prp_small_pool, iod->meta_list.sg_list,
+	dma_pool_free(nvmeq->prp_pools.small, iod->meta_list.sg_list,
 		      iod->meta_dma);
 	dma_unmap_sgtable(dev->dev, &iod->meta_sgt, rq_dma_dir(req), 0);
 	mempool_free(iod->meta_sgt.sgl, dev->iod_meta_mempool);
 }
 
@@ -1059,14 +1117,14 @@ static __always_inline void nvme_pci_unmap_rq(struct request *req)
 {
 	struct nvme_queue *nvmeq = req->mq_hctx->driver_data;
 	struct nvme_dev *dev = nvmeq->dev;
 
 	if (blk_integrity_rq(req))
-		nvme_unmap_metadata(dev, req);
+		nvme_unmap_metadata(dev, nvmeq, req);
 
 	if (blk_rq_nr_phys_segments(req))
-		nvme_unmap_data(dev, req);
+		nvme_unmap_data(dev, nvmeq, req);
 }
 
 static void nvme_pci_complete_rq(struct request *req)
 {
 	nvme_pci_unmap_rq(req);
@@ -2839,39 +2897,10 @@ static int nvme_disable_prepare_reset(struct nvme_dev *dev, bool shutdown)
 		return -EBUSY;
 	nvme_dev_disable(dev, shutdown);
 	return 0;
 }
 
-static int nvme_setup_prp_pools(struct nvme_dev *dev)
-{
-	size_t small_align = 256;
-
-	dev->prp_page_pool = dma_pool_create("prp list page", dev->dev,
-						NVME_CTRL_PAGE_SIZE,
-						NVME_CTRL_PAGE_SIZE, 0);
-	if (!dev->prp_page_pool)
-		return -ENOMEM;
-
-	if (dev->ctrl.quirks & NVME_QUIRK_DMAPOOL_ALIGN_512)
-		small_align = 512;
-
-	/* Optimisation for I/Os between 4k and 128k */
-	dev->prp_small_pool = dma_pool_create("prp list 256", dev->dev,
-						256, small_align, 0);
-	if (!dev->prp_small_pool) {
-		dma_pool_destroy(dev->prp_page_pool);
-		return -ENOMEM;
-	}
-	return 0;
-}
-
-static void nvme_release_prp_pools(struct nvme_dev *dev)
-{
-	dma_pool_destroy(dev->prp_page_pool);
-	dma_pool_destroy(dev->prp_small_pool);
-}
-
 static int nvme_pci_alloc_iod_mempool(struct nvme_dev *dev)
 {
 	size_t meta_size = sizeof(struct scatterlist) * (NVME_MAX_META_SEGS + 1);
 	size_t alloc_size = sizeof(struct scatterlist) * NVME_MAX_SEGS;
 
@@ -3182,11 +3211,12 @@ static struct nvme_dev *nvme_pci_alloc_dev(struct pci_dev *pdev,
 	unsigned long quirks = id->driver_data;
 	int node = dev_to_node(&pdev->dev);
 	struct nvme_dev *dev;
 	int ret = -ENOMEM;
 
-	dev = kzalloc_node(sizeof(*dev), GFP_KERNEL, node);
+	dev = kzalloc_node(sizeof(*dev) + nr_node_ids * sizeof(*dev->prp_pools),
+			   GFP_KERNEL, node);
 	if (!dev)
 		return ERR_PTR(-ENOMEM);
 	INIT_WORK(&dev->ctrl.reset_work, nvme_reset_work);
 	mutex_init(&dev->shutdown_lock);
 
@@ -3257,17 +3287,13 @@ static int nvme_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 
 	result = nvme_dev_map(dev);
 	if (result)
 		goto out_uninit_ctrl;
 
-	result = nvme_setup_prp_pools(dev);
-	if (result)
-		goto out_dev_unmap;
-
 	result = nvme_pci_alloc_iod_mempool(dev);
 	if (result)
-		goto out_release_prp_pools;
+		goto out_dev_unmap;
 
 	dev_info(dev->ctrl.device, "pci function %s\n", dev_name(&pdev->dev));
 
 	result = nvme_pci_enable(dev);
 	if (result)
@@ -3339,12 +3365,10 @@ static int nvme_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 	nvme_dbbuf_dma_free(dev);
 	nvme_free_queues(dev, 0);
 out_release_iod_mempool:
 	mempool_destroy(dev->iod_mempool);
 	mempool_destroy(dev->iod_meta_mempool);
-out_release_prp_pools:
-	nvme_release_prp_pools(dev);
 out_dev_unmap:
 	nvme_dev_unmap(dev);
 out_uninit_ctrl:
 	nvme_uninit_ctrl(&dev->ctrl);
 out_put_ctrl:
-- 
2.45.2



^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH v3 2/2] nvme/pci: make PRP list DMA pools per-NUMA-node
  2025-04-21 16:55 ` [PATCH v3 2/2] nvme/pci: make PRP list DMA pools per-NUMA-node Caleb Sander Mateos
@ 2025-04-22  7:38   ` Kanchan Joshi
  2025-04-22 11:48   ` Sagi Grimberg
  1 sibling, 0 replies; 8+ messages in thread
From: Kanchan Joshi @ 2025-04-22  7:38 UTC (permalink / raw)
  To: Caleb Sander Mateos, Keith Busch, Jens Axboe, Christoph Hellwig,
	Sagi Grimberg
  Cc: linux-nvme, linux-kernel

On 4/21/2025 10:25 PM, Caleb Sander Mateos wrote:
> +	prp_pools = &dev->prp_pools[numa_node < nr_node_ids ? numa_node : 0];

Numa node values are always in the range [0, nr_node_ids - 1], so is 
this condition "numa_node < nr_node_ids" really required?

The patch looks good to me though.

Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>






^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v3 2/2] nvme/pci: make PRP list DMA pools per-NUMA-node
  2025-04-21 16:55 ` [PATCH v3 2/2] nvme/pci: make PRP list DMA pools per-NUMA-node Caleb Sander Mateos
  2025-04-22  7:38   ` Kanchan Joshi
@ 2025-04-22 11:48   ` Sagi Grimberg
  2025-04-22 14:59     ` Caleb Sander Mateos
  1 sibling, 1 reply; 8+ messages in thread
From: Sagi Grimberg @ 2025-04-22 11:48 UTC (permalink / raw)
  To: Caleb Sander Mateos, Keith Busch, Jens Axboe, Christoph Hellwig
  Cc: Kanchan Joshi, linux-nvme, linux-kernel



On 21/04/2025 19:55, Caleb Sander Mateos wrote:
> NVMe commands with more than 4 KB of data allocate PRP list pages from
> the per-nvme_device dma_pool prp_page_pool or prp_small_pool. Each call
> to dma_pool_alloc() and dma_pool_free() takes the per-dma_pool spinlock.
> These device-global spinlocks are a significant source of contention
> when many CPUs are submitting to the same NVMe devices. On a workload
> issuing 32 KB reads from 16 CPUs (8 hypertwin pairs) across 2 NUMA nodes
> to 23 NVMe devices, we observed 2.4% of CPU time spent in
> _raw_spin_lock_irqsave called from dma_pool_alloc and dma_pool_free.
>
> Ideally, the dma_pools would be per-hctx to minimize
> contention. But that could impose considerable resource costs in a
> system with many NVMe devices and CPUs.
>
> As a compromise, allocate per-NUMA-node PRP list DMA pools. Map each
> nvme_queue to the set of DMA pools corresponding to its device and its
> hctx's NUMA node. This reduces the _raw_spin_lock_irqsave overhead by
> about half, to 1.2%. Preventing the sharing of PRP list pages across
> NUMA nodes also makes them cheaper to initialize.
>
> Link: https://lore.kernel.org/linux-nvme/CADUfDZqa=OOTtTTznXRDmBQo1WrFcDw1hBA7XwM7hzJ-hpckcA@mail.gmail.com/T/#u
> Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
> ---
>   drivers/nvme/host/pci.c | 144 +++++++++++++++++++++++-----------------
>   1 file changed, 84 insertions(+), 60 deletions(-)
>
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index 642890ddada5..7d86d1ec989a 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -16,10 +16,11 @@
>   #include <linux/kstrtox.h>
>   #include <linux/memremap.h>
>   #include <linux/mm.h>
>   #include <linux/module.h>
>   #include <linux/mutex.h>
> +#include <linux/nodemask.h>
>   #include <linux/once.h>
>   #include <linux/pci.h>
>   #include <linux/suspend.h>
>   #include <linux/t10-pi.h>
>   #include <linux/types.h>
> @@ -110,21 +111,24 @@ struct nvme_queue;
>   
>   static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown);
>   static void nvme_delete_io_queues(struct nvme_dev *dev);
>   static void nvme_update_attrs(struct nvme_dev *dev);
>   
> +struct nvme_prp_dma_pools {
> +	struct dma_pool *large;
> +	struct dma_pool *small;
> +};
> +
>   /*
>    * Represents an NVM Express device.  Each nvme_dev is a PCI function.
>    */
>   struct nvme_dev {
>   	struct nvme_queue *queues;
>   	struct blk_mq_tag_set tagset;
>   	struct blk_mq_tag_set admin_tagset;
>   	u32 __iomem *dbs;
>   	struct device *dev;
> -	struct dma_pool *prp_page_pool;
> -	struct dma_pool *prp_small_pool;
>   	unsigned online_queues;
>   	unsigned max_qid;
>   	unsigned io_queues[HCTX_MAX_TYPES];
>   	unsigned int num_vecs;
>   	u32 q_depth;
> @@ -160,10 +164,11 @@ struct nvme_dev {
>   	struct nvme_host_mem_buf_desc *host_mem_descs;
>   	void **host_mem_desc_bufs;
>   	unsigned int nr_allocated_queues;
>   	unsigned int nr_write_queues;
>   	unsigned int nr_poll_queues;
> +	struct nvme_prp_dma_pools prp_pools[];
>   };
>   
>   static int io_queue_depth_set(const char *val, const struct kernel_param *kp)
>   {
>   	return param_set_uint_minmax(val, kp, NVME_PCI_MIN_QUEUE_SIZE,
> @@ -189,10 +194,11 @@ static inline struct nvme_dev *to_nvme_dev(struct nvme_ctrl *ctrl)
>    * An NVM Express queue.  Each device has at least two (one for admin
>    * commands and one for I/O commands).
>    */
>   struct nvme_queue {
>   	struct nvme_dev *dev;
> +	struct nvme_prp_dma_pools prp_pools;
>   	spinlock_t sq_lock;
>   	void *sq_cmds;
>   	 /* only used for poll queues: */
>   	spinlock_t cq_poll_lock ____cacheline_aligned_in_smp;
>   	struct nvme_completion *cqes;
> @@ -395,18 +401,67 @@ static int nvme_pci_npages_prp(void)
>   	unsigned max_bytes = (NVME_MAX_KB_SZ * 1024) + NVME_CTRL_PAGE_SIZE;
>   	unsigned nprps = DIV_ROUND_UP(max_bytes, NVME_CTRL_PAGE_SIZE);
>   	return DIV_ROUND_UP(8 * nprps, NVME_CTRL_PAGE_SIZE - 8);
>   }
>   
> +static struct nvme_prp_dma_pools *
> +nvme_setup_prp_pools(struct nvme_dev *dev, unsigned numa_node)
> +{
> +	struct nvme_prp_dma_pools *prp_pools;
> +	size_t small_align = 256;
> +
> +	prp_pools = &dev->prp_pools[numa_node < nr_node_ids ? numa_node : 0];

I'm assuming you are checking numa_node == NUMA_NO_NODE ?
Perhaps it is better to check that explicitly?

Otherwise, looks good
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v3 2/2] nvme/pci: make PRP list DMA pools per-NUMA-node
  2025-04-22 11:48   ` Sagi Grimberg
@ 2025-04-22 14:59     ` Caleb Sander Mateos
  0 siblings, 0 replies; 8+ messages in thread
From: Caleb Sander Mateos @ 2025-04-22 14:59 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Keith Busch, Jens Axboe, Christoph Hellwig, Kanchan Joshi,
	linux-nvme, linux-kernel

On Tue, Apr 22, 2025 at 4:48 AM Sagi Grimberg <sagi@grimberg.me> wrote:
>
>
>
> On 21/04/2025 19:55, Caleb Sander Mateos wrote:
> > NVMe commands with more than 4 KB of data allocate PRP list pages from
> > the per-nvme_device dma_pool prp_page_pool or prp_small_pool. Each call
> > to dma_pool_alloc() and dma_pool_free() takes the per-dma_pool spinlock.
> > These device-global spinlocks are a significant source of contention
> > when many CPUs are submitting to the same NVMe devices. On a workload
> > issuing 32 KB reads from 16 CPUs (8 hypertwin pairs) across 2 NUMA nodes
> > to 23 NVMe devices, we observed 2.4% of CPU time spent in
> > _raw_spin_lock_irqsave called from dma_pool_alloc and dma_pool_free.
> >
> > Ideally, the dma_pools would be per-hctx to minimize
> > contention. But that could impose considerable resource costs in a
> > system with many NVMe devices and CPUs.
> >
> > As a compromise, allocate per-NUMA-node PRP list DMA pools. Map each
> > nvme_queue to the set of DMA pools corresponding to its device and its
> > hctx's NUMA node. This reduces the _raw_spin_lock_irqsave overhead by
> > about half, to 1.2%. Preventing the sharing of PRP list pages across
> > NUMA nodes also makes them cheaper to initialize.
> >
> > Link: https://lore.kernel.org/linux-nvme/CADUfDZqa=OOTtTTznXRDmBQo1WrFcDw1hBA7XwM7hzJ-hpckcA@mail.gmail.com/T/#u
> > Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
> > ---
> >   drivers/nvme/host/pci.c | 144 +++++++++++++++++++++++-----------------
> >   1 file changed, 84 insertions(+), 60 deletions(-)
> >
> > diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> > index 642890ddada5..7d86d1ec989a 100644
> > --- a/drivers/nvme/host/pci.c
> > +++ b/drivers/nvme/host/pci.c
> > @@ -16,10 +16,11 @@
> >   #include <linux/kstrtox.h>
> >   #include <linux/memremap.h>
> >   #include <linux/mm.h>
> >   #include <linux/module.h>
> >   #include <linux/mutex.h>
> > +#include <linux/nodemask.h>
> >   #include <linux/once.h>
> >   #include <linux/pci.h>
> >   #include <linux/suspend.h>
> >   #include <linux/t10-pi.h>
> >   #include <linux/types.h>
> > @@ -110,21 +111,24 @@ struct nvme_queue;
> >
> >   static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown);
> >   static void nvme_delete_io_queues(struct nvme_dev *dev);
> >   static void nvme_update_attrs(struct nvme_dev *dev);
> >
> > +struct nvme_prp_dma_pools {
> > +     struct dma_pool *large;
> > +     struct dma_pool *small;
> > +};
> > +
> >   /*
> >    * Represents an NVM Express device.  Each nvme_dev is a PCI function.
> >    */
> >   struct nvme_dev {
> >       struct nvme_queue *queues;
> >       struct blk_mq_tag_set tagset;
> >       struct blk_mq_tag_set admin_tagset;
> >       u32 __iomem *dbs;
> >       struct device *dev;
> > -     struct dma_pool *prp_page_pool;
> > -     struct dma_pool *prp_small_pool;
> >       unsigned online_queues;
> >       unsigned max_qid;
> >       unsigned io_queues[HCTX_MAX_TYPES];
> >       unsigned int num_vecs;
> >       u32 q_depth;
> > @@ -160,10 +164,11 @@ struct nvme_dev {
> >       struct nvme_host_mem_buf_desc *host_mem_descs;
> >       void **host_mem_desc_bufs;
> >       unsigned int nr_allocated_queues;
> >       unsigned int nr_write_queues;
> >       unsigned int nr_poll_queues;
> > +     struct nvme_prp_dma_pools prp_pools[];
> >   };
> >
> >   static int io_queue_depth_set(const char *val, const struct kernel_param *kp)
> >   {
> >       return param_set_uint_minmax(val, kp, NVME_PCI_MIN_QUEUE_SIZE,
> > @@ -189,10 +194,11 @@ static inline struct nvme_dev *to_nvme_dev(struct nvme_ctrl *ctrl)
> >    * An NVM Express queue.  Each device has at least two (one for admin
> >    * commands and one for I/O commands).
> >    */
> >   struct nvme_queue {
> >       struct nvme_dev *dev;
> > +     struct nvme_prp_dma_pools prp_pools;
> >       spinlock_t sq_lock;
> >       void *sq_cmds;
> >        /* only used for poll queues: */
> >       spinlock_t cq_poll_lock ____cacheline_aligned_in_smp;
> >       struct nvme_completion *cqes;
> > @@ -395,18 +401,67 @@ static int nvme_pci_npages_prp(void)
> >       unsigned max_bytes = (NVME_MAX_KB_SZ * 1024) + NVME_CTRL_PAGE_SIZE;
> >       unsigned nprps = DIV_ROUND_UP(max_bytes, NVME_CTRL_PAGE_SIZE);
> >       return DIV_ROUND_UP(8 * nprps, NVME_CTRL_PAGE_SIZE - 8);
> >   }
> >
> > +static struct nvme_prp_dma_pools *
> > +nvme_setup_prp_pools(struct nvme_dev *dev, unsigned numa_node)
> > +{
> > +     struct nvme_prp_dma_pools *prp_pools;
> > +     size_t small_align = 256;
> > +
> > +     prp_pools = &dev->prp_pools[numa_node < nr_node_ids ? numa_node : 0];
>
> I'm assuming you are checking numa_node == NUMA_NO_NODE ?
> Perhaps it is better to check that explicitly?

Yeah, I wasn't sure whether that was a possible value of
hctx->numa_node. The field is defined as an unsigned and NUMA_NO_NODE
is defined as -1, so I guess it shouldn't ever be set to NUMA_NO_NODE?
I can replace this with just numa_node.

Thanks,
Caleb


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2025-04-22 16:33 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-04-21 16:55 [PATCH v3 0/2] nvme/pci: PRP list DMA pool partitioning Caleb Sander Mateos
2025-04-21 16:55 ` [PATCH v3 1/2] nvme/pci: factor out nvme_init_hctx() helper Caleb Sander Mateos
2025-04-22  7:38   ` Kanchan Joshi
2025-04-22 11:45   ` Sagi Grimberg
2025-04-21 16:55 ` [PATCH v3 2/2] nvme/pci: make PRP list DMA pools per-NUMA-node Caleb Sander Mateos
2025-04-22  7:38   ` Kanchan Joshi
2025-04-22 11:48   ` Sagi Grimberg
2025-04-22 14:59     ` Caleb Sander Mateos

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox