* [PATCH v3 0/2] nvme/pci: PRP list DMA pool partitioning @ 2025-04-21 16:55 Caleb Sander Mateos 2025-04-21 16:55 ` [PATCH v3 1/2] nvme/pci: factor out nvme_init_hctx() helper Caleb Sander Mateos 2025-04-21 16:55 ` [PATCH v3 2/2] nvme/pci: make PRP list DMA pools per-NUMA-node Caleb Sander Mateos 0 siblings, 2 replies; 8+ messages in thread From: Caleb Sander Mateos @ 2025-04-21 16:55 UTC (permalink / raw) To: Keith Busch, Jens Axboe, Christoph Hellwig, Sagi Grimberg Cc: Kanchan Joshi, linux-nvme, linux-kernel, Caleb Sander Mateos NVMe commands with more than 4 KB of data allocate PRP list pages from the per-nvme_device dma_pool prp_page_pool or prp_small_pool. Each call to dma_pool_alloc() and dma_pool_free() takes the per-dma_pool spinlock. These device-global spinlocks are a significant source of contention when many CPUs are submitting to the same NVMe devices. On a workload issuing 32 KB reads from 16 CPUs (8 hypertwin pairs) across 2 NUMA nodes to 23 NVMe devices, we observed 2.4% of CPU time spent in _raw_spin_lock_irqsave called from dma_pool_alloc and dma_pool_free. Ideally, the dma_pools would be per-hctx to minimize contention. But that could impose considerable resource costs in a system with many NVMe devices and CPUs. As a compromise, allocate per-NUMA-node PRP list DMA pools. Map each nvme_queue to the set of DMA pools corresponding to its device and its hctx's NUMA node. This reduces the _raw_spin_lock_irqsave overhead by about half, to 1.2%. Preventing the sharing of PRP list pages across NUMA nodes also makes them cheaper to initialize. Caleb Sander Mateos (2): nvme/pci: factor out nvme_init_hctx() helper nvme/pci: make PRP list DMA pools per-NUMA-node drivers/nvme/host/pci.c | 171 +++++++++++++++++++++++----------------- 1 file changed, 98 insertions(+), 73 deletions(-) v3: simplify nvme_release_prp_pools() (Keith) v2: - Initialize admin nvme_queue's nvme_prp_dma_pools (Kanchan) - Shrink nvme_dev's prp_pools array from MAX_NUMNODES to nr_node_ids (Kanchan) -- 2.45.2 ^ permalink raw reply [flat|nested] 8+ messages in thread
* [PATCH v3 1/2] nvme/pci: factor out nvme_init_hctx() helper 2025-04-21 16:55 [PATCH v3 0/2] nvme/pci: PRP list DMA pool partitioning Caleb Sander Mateos @ 2025-04-21 16:55 ` Caleb Sander Mateos 2025-04-22 7:38 ` Kanchan Joshi 2025-04-22 11:45 ` Sagi Grimberg 2025-04-21 16:55 ` [PATCH v3 2/2] nvme/pci: make PRP list DMA pools per-NUMA-node Caleb Sander Mateos 1 sibling, 2 replies; 8+ messages in thread From: Caleb Sander Mateos @ 2025-04-21 16:55 UTC (permalink / raw) To: Keith Busch, Jens Axboe, Christoph Hellwig, Sagi Grimberg Cc: Kanchan Joshi, linux-nvme, linux-kernel, Caleb Sander Mateos nvme_init_hctx() and nvme_admin_init_hctx() are very similar. In preparation for adding more logic, factor out a nvme_init_hctx() helper. Rename the old nvme_init_hctx() to nvme_io_init_hctx(). Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> --- drivers/nvme/host/pci.c | 29 +++++++++++++++-------------- 1 file changed, 15 insertions(+), 14 deletions(-) diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index b178d52eac1b..642890ddada5 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -395,32 +395,33 @@ static int nvme_pci_npages_prp(void) unsigned max_bytes = (NVME_MAX_KB_SZ * 1024) + NVME_CTRL_PAGE_SIZE; unsigned nprps = DIV_ROUND_UP(max_bytes, NVME_CTRL_PAGE_SIZE); return DIV_ROUND_UP(8 * nprps, NVME_CTRL_PAGE_SIZE - 8); } -static int nvme_admin_init_hctx(struct blk_mq_hw_ctx *hctx, void *data, - unsigned int hctx_idx) +static int nvme_init_hctx(struct blk_mq_hw_ctx *hctx, void *data, unsigned qid) { struct nvme_dev *dev = to_nvme_dev(data); - struct nvme_queue *nvmeq = &dev->queues[0]; - - WARN_ON(hctx_idx != 0); - WARN_ON(dev->admin_tagset.tags[0] != hctx->tags); + struct nvme_queue *nvmeq = &dev->queues[qid]; + struct blk_mq_tags *tags; + tags = qid ? dev->tagset.tags[qid - 1] : dev->admin_tagset.tags[0]; + WARN_ON(tags != hctx->tags); hctx->driver_data = nvmeq; return 0; } -static int nvme_init_hctx(struct blk_mq_hw_ctx *hctx, void *data, - unsigned int hctx_idx) +static int nvme_admin_init_hctx(struct blk_mq_hw_ctx *hctx, void *data, + unsigned int hctx_idx) { - struct nvme_dev *dev = to_nvme_dev(data); - struct nvme_queue *nvmeq = &dev->queues[hctx_idx + 1]; + WARN_ON(hctx_idx != 0); + return nvme_init_hctx(hctx, data, 0); +} - WARN_ON(dev->tagset.tags[hctx_idx] != hctx->tags); - hctx->driver_data = nvmeq; - return 0; +static int nvme_io_init_hctx(struct blk_mq_hw_ctx *hctx, void *data, + unsigned int hctx_idx) +{ + return nvme_init_hctx(hctx, data, hctx_idx + 1); } static int nvme_pci_init_request(struct blk_mq_tag_set *set, struct request *req, unsigned int hctx_idx, unsigned int numa_node) @@ -1813,11 +1814,11 @@ static const struct blk_mq_ops nvme_mq_admin_ops = { static const struct blk_mq_ops nvme_mq_ops = { .queue_rq = nvme_queue_rq, .queue_rqs = nvme_queue_rqs, .complete = nvme_pci_complete_rq, .commit_rqs = nvme_commit_rqs, - .init_hctx = nvme_init_hctx, + .init_hctx = nvme_io_init_hctx, .init_request = nvme_pci_init_request, .map_queues = nvme_pci_map_queues, .timeout = nvme_timeout, .poll = nvme_poll, }; -- 2.45.2 ^ permalink raw reply related [flat|nested] 8+ messages in thread
* Re: [PATCH v3 1/2] nvme/pci: factor out nvme_init_hctx() helper 2025-04-21 16:55 ` [PATCH v3 1/2] nvme/pci: factor out nvme_init_hctx() helper Caleb Sander Mateos @ 2025-04-22 7:38 ` Kanchan Joshi 2025-04-22 11:45 ` Sagi Grimberg 1 sibling, 0 replies; 8+ messages in thread From: Kanchan Joshi @ 2025-04-22 7:38 UTC (permalink / raw) To: Caleb Sander Mateos, Keith Busch, Jens Axboe, Christoph Hellwig, Sagi Grimberg Cc: linux-nvme, linux-kernel Reviewed-by: Kanchan Joshi <joshi.k@samsung.com> ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH v3 1/2] nvme/pci: factor out nvme_init_hctx() helper 2025-04-21 16:55 ` [PATCH v3 1/2] nvme/pci: factor out nvme_init_hctx() helper Caleb Sander Mateos 2025-04-22 7:38 ` Kanchan Joshi @ 2025-04-22 11:45 ` Sagi Grimberg 1 sibling, 0 replies; 8+ messages in thread From: Sagi Grimberg @ 2025-04-22 11:45 UTC (permalink / raw) To: Caleb Sander Mateos, Keith Busch, Jens Axboe, Christoph Hellwig Cc: Kanchan Joshi, linux-nvme, linux-kernel Reviewed-by: Sagi Grimberg <sagi@grimberg.me> ^ permalink raw reply [flat|nested] 8+ messages in thread
* [PATCH v3 2/2] nvme/pci: make PRP list DMA pools per-NUMA-node 2025-04-21 16:55 [PATCH v3 0/2] nvme/pci: PRP list DMA pool partitioning Caleb Sander Mateos 2025-04-21 16:55 ` [PATCH v3 1/2] nvme/pci: factor out nvme_init_hctx() helper Caleb Sander Mateos @ 2025-04-21 16:55 ` Caleb Sander Mateos 2025-04-22 7:38 ` Kanchan Joshi 2025-04-22 11:48 ` Sagi Grimberg 1 sibling, 2 replies; 8+ messages in thread From: Caleb Sander Mateos @ 2025-04-21 16:55 UTC (permalink / raw) To: Keith Busch, Jens Axboe, Christoph Hellwig, Sagi Grimberg Cc: Kanchan Joshi, linux-nvme, linux-kernel, Caleb Sander Mateos NVMe commands with more than 4 KB of data allocate PRP list pages from the per-nvme_device dma_pool prp_page_pool or prp_small_pool. Each call to dma_pool_alloc() and dma_pool_free() takes the per-dma_pool spinlock. These device-global spinlocks are a significant source of contention when many CPUs are submitting to the same NVMe devices. On a workload issuing 32 KB reads from 16 CPUs (8 hypertwin pairs) across 2 NUMA nodes to 23 NVMe devices, we observed 2.4% of CPU time spent in _raw_spin_lock_irqsave called from dma_pool_alloc and dma_pool_free. Ideally, the dma_pools would be per-hctx to minimize contention. But that could impose considerable resource costs in a system with many NVMe devices and CPUs. As a compromise, allocate per-NUMA-node PRP list DMA pools. Map each nvme_queue to the set of DMA pools corresponding to its device and its hctx's NUMA node. This reduces the _raw_spin_lock_irqsave overhead by about half, to 1.2%. Preventing the sharing of PRP list pages across NUMA nodes also makes them cheaper to initialize. Link: https://lore.kernel.org/linux-nvme/CADUfDZqa=OOTtTTznXRDmBQo1WrFcDw1hBA7XwM7hzJ-hpckcA@mail.gmail.com/T/#u Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> --- drivers/nvme/host/pci.c | 144 +++++++++++++++++++++++----------------- 1 file changed, 84 insertions(+), 60 deletions(-) diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index 642890ddada5..7d86d1ec989a 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -16,10 +16,11 @@ #include <linux/kstrtox.h> #include <linux/memremap.h> #include <linux/mm.h> #include <linux/module.h> #include <linux/mutex.h> +#include <linux/nodemask.h> #include <linux/once.h> #include <linux/pci.h> #include <linux/suspend.h> #include <linux/t10-pi.h> #include <linux/types.h> @@ -110,21 +111,24 @@ struct nvme_queue; static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown); static void nvme_delete_io_queues(struct nvme_dev *dev); static void nvme_update_attrs(struct nvme_dev *dev); +struct nvme_prp_dma_pools { + struct dma_pool *large; + struct dma_pool *small; +}; + /* * Represents an NVM Express device. Each nvme_dev is a PCI function. */ struct nvme_dev { struct nvme_queue *queues; struct blk_mq_tag_set tagset; struct blk_mq_tag_set admin_tagset; u32 __iomem *dbs; struct device *dev; - struct dma_pool *prp_page_pool; - struct dma_pool *prp_small_pool; unsigned online_queues; unsigned max_qid; unsigned io_queues[HCTX_MAX_TYPES]; unsigned int num_vecs; u32 q_depth; @@ -160,10 +164,11 @@ struct nvme_dev { struct nvme_host_mem_buf_desc *host_mem_descs; void **host_mem_desc_bufs; unsigned int nr_allocated_queues; unsigned int nr_write_queues; unsigned int nr_poll_queues; + struct nvme_prp_dma_pools prp_pools[]; }; static int io_queue_depth_set(const char *val, const struct kernel_param *kp) { return param_set_uint_minmax(val, kp, NVME_PCI_MIN_QUEUE_SIZE, @@ -189,10 +194,11 @@ static inline struct nvme_dev *to_nvme_dev(struct nvme_ctrl *ctrl) * An NVM Express queue. Each device has at least two (one for admin * commands and one for I/O commands). */ struct nvme_queue { struct nvme_dev *dev; + struct nvme_prp_dma_pools prp_pools; spinlock_t sq_lock; void *sq_cmds; /* only used for poll queues: */ spinlock_t cq_poll_lock ____cacheline_aligned_in_smp; struct nvme_completion *cqes; @@ -395,18 +401,67 @@ static int nvme_pci_npages_prp(void) unsigned max_bytes = (NVME_MAX_KB_SZ * 1024) + NVME_CTRL_PAGE_SIZE; unsigned nprps = DIV_ROUND_UP(max_bytes, NVME_CTRL_PAGE_SIZE); return DIV_ROUND_UP(8 * nprps, NVME_CTRL_PAGE_SIZE - 8); } +static struct nvme_prp_dma_pools * +nvme_setup_prp_pools(struct nvme_dev *dev, unsigned numa_node) +{ + struct nvme_prp_dma_pools *prp_pools; + size_t small_align = 256; + + prp_pools = &dev->prp_pools[numa_node < nr_node_ids ? numa_node : 0]; + if (prp_pools->small) + return prp_pools; /* already initialized */ + + prp_pools->large = dma_pool_create("prp list page", dev->dev, + NVME_CTRL_PAGE_SIZE, + NVME_CTRL_PAGE_SIZE, 0); + if (!prp_pools->large) + return ERR_PTR(-ENOMEM); + + if (dev->ctrl.quirks & NVME_QUIRK_DMAPOOL_ALIGN_512) + small_align = 512; + + /* Optimisation for I/Os between 4k and 128k */ + prp_pools->small = dma_pool_create("prp list 256", dev->dev, + 256, small_align, 0); + if (!prp_pools->small) { + dma_pool_destroy(prp_pools->large); + prp_pools->large = NULL; + return ERR_PTR(-ENOMEM); + } + + return prp_pools; +} + +static void nvme_release_prp_pools(struct nvme_dev *dev) +{ + unsigned i; + + for (i = 0; i < nr_node_ids; i++) { + struct nvme_prp_dma_pools *prp_pools = &dev->prp_pools[i]; + + dma_pool_destroy(prp_pools->large); + dma_pool_destroy(prp_pools->small); + } +} + static int nvme_init_hctx(struct blk_mq_hw_ctx *hctx, void *data, unsigned qid) { struct nvme_dev *dev = to_nvme_dev(data); struct nvme_queue *nvmeq = &dev->queues[qid]; + struct nvme_prp_dma_pools *prp_pools; struct blk_mq_tags *tags; tags = qid ? dev->tagset.tags[qid - 1] : dev->admin_tagset.tags[0]; WARN_ON(tags != hctx->tags); + prp_pools = nvme_setup_prp_pools(dev, hctx->numa_node); + if (IS_ERR(prp_pools)) + return PTR_ERR(prp_pools); + + nvmeq->prp_pools = *prp_pools; hctx->driver_data = nvmeq; return 0; } static int nvme_admin_init_hctx(struct blk_mq_hw_ctx *hctx, void *data, @@ -536,27 +591,28 @@ static inline bool nvme_pci_use_sgls(struct nvme_dev *dev, struct request *req, if (!sgl_threshold || avg_seg_size < sgl_threshold) return nvme_req(req)->flags & NVME_REQ_USERCMD; return true; } -static void nvme_free_prps(struct nvme_dev *dev, struct request *req) +static void nvme_free_prps(struct nvme_queue *nvmeq, struct request *req) { const int last_prp = NVME_CTRL_PAGE_SIZE / sizeof(__le64) - 1; struct nvme_iod *iod = blk_mq_rq_to_pdu(req); dma_addr_t dma_addr = iod->first_dma; int i; for (i = 0; i < iod->nr_allocations; i++) { __le64 *prp_list = iod->list[i].prp_list; dma_addr_t next_dma_addr = le64_to_cpu(prp_list[last_prp]); - dma_pool_free(dev->prp_page_pool, prp_list, dma_addr); + dma_pool_free(nvmeq->prp_pools.large, prp_list, dma_addr); dma_addr = next_dma_addr; } } -static void nvme_unmap_data(struct nvme_dev *dev, struct request *req) +static void nvme_unmap_data(struct nvme_dev *dev, struct nvme_queue *nvmeq, + struct request *req) { struct nvme_iod *iod = blk_mq_rq_to_pdu(req); if (iod->dma_len) { dma_unmap_page(dev->dev, iod->first_dma, iod->dma_len, @@ -567,17 +623,17 @@ static void nvme_unmap_data(struct nvme_dev *dev, struct request *req) WARN_ON_ONCE(!iod->sgt.nents); dma_unmap_sgtable(dev->dev, &iod->sgt, rq_dma_dir(req), 0); if (iod->nr_allocations == 0) - dma_pool_free(dev->prp_small_pool, iod->list[0].sg_list, + dma_pool_free(nvmeq->prp_pools.small, iod->list[0].sg_list, iod->first_dma); else if (iod->nr_allocations == 1) - dma_pool_free(dev->prp_page_pool, iod->list[0].sg_list, + dma_pool_free(nvmeq->prp_pools.large, iod->list[0].sg_list, iod->first_dma); else - nvme_free_prps(dev, req); + nvme_free_prps(nvmeq, req); mempool_free(iod->sgt.sgl, dev->iod_mempool); } static void nvme_print_sgl(struct scatterlist *sgl, int nents) { @@ -591,11 +647,11 @@ static void nvme_print_sgl(struct scatterlist *sgl, int nents) i, &phys, sg->offset, sg->length, &sg_dma_address(sg), sg_dma_len(sg)); } } -static blk_status_t nvme_pci_setup_prps(struct nvme_dev *dev, +static blk_status_t nvme_pci_setup_prps(struct nvme_queue *nvmeq, struct request *req, struct nvme_rw_command *cmnd) { struct nvme_iod *iod = blk_mq_rq_to_pdu(req); struct dma_pool *pool; int length = blk_rq_payload_bytes(req); @@ -627,14 +683,14 @@ static blk_status_t nvme_pci_setup_prps(struct nvme_dev *dev, goto done; } nprps = DIV_ROUND_UP(length, NVME_CTRL_PAGE_SIZE); if (nprps <= (256 / 8)) { - pool = dev->prp_small_pool; + pool = nvmeq->prp_pools.small; iod->nr_allocations = 0; } else { - pool = dev->prp_page_pool; + pool = nvmeq->prp_pools.large; iod->nr_allocations = 1; } prp_list = dma_pool_alloc(pool, GFP_ATOMIC, &prp_dma); if (!prp_list) { @@ -672,11 +728,11 @@ static blk_status_t nvme_pci_setup_prps(struct nvme_dev *dev, done: cmnd->dptr.prp1 = cpu_to_le64(sg_dma_address(iod->sgt.sgl)); cmnd->dptr.prp2 = cpu_to_le64(iod->first_dma); return BLK_STS_OK; free_prps: - nvme_free_prps(dev, req); + nvme_free_prps(nvmeq, req); return BLK_STS_RESOURCE; bad_sgl: WARN(DO_ONCE(nvme_print_sgl, iod->sgt.sgl, iod->sgt.nents), "Invalid SGL for payload:%d nents:%d\n", blk_rq_payload_bytes(req), iod->sgt.nents); @@ -697,11 +753,11 @@ static void nvme_pci_sgl_set_seg(struct nvme_sgl_desc *sge, sge->addr = cpu_to_le64(dma_addr); sge->length = cpu_to_le32(entries * sizeof(*sge)); sge->type = NVME_SGL_FMT_LAST_SEG_DESC << 4; } -static blk_status_t nvme_pci_setup_sgls(struct nvme_dev *dev, +static blk_status_t nvme_pci_setup_sgls(struct nvme_queue *nvmeq, struct request *req, struct nvme_rw_command *cmd) { struct nvme_iod *iod = blk_mq_rq_to_pdu(req); struct dma_pool *pool; struct nvme_sgl_desc *sg_list; @@ -717,14 +773,14 @@ static blk_status_t nvme_pci_setup_sgls(struct nvme_dev *dev, nvme_pci_sgl_set_data(&cmd->dptr.sgl, sg); return BLK_STS_OK; } if (entries <= (256 / sizeof(struct nvme_sgl_desc))) { - pool = dev->prp_small_pool; + pool = nvmeq->prp_pools.small; iod->nr_allocations = 0; } else { - pool = dev->prp_page_pool; + pool = nvmeq->prp_pools.large; iod->nr_allocations = 1; } sg_list = dma_pool_alloc(pool, GFP_ATOMIC, &sgl_dma); if (!sg_list) { @@ -784,16 +840,16 @@ static blk_status_t nvme_setup_sgl_simple(struct nvme_dev *dev, } static blk_status_t nvme_map_data(struct nvme_dev *dev, struct request *req, struct nvme_command *cmnd) { + struct nvme_queue *nvmeq = req->mq_hctx->driver_data; struct nvme_iod *iod = blk_mq_rq_to_pdu(req); blk_status_t ret = BLK_STS_RESOURCE; int rc; if (blk_rq_nr_phys_segments(req) == 1) { - struct nvme_queue *nvmeq = req->mq_hctx->driver_data; struct bio_vec bv = req_bvec(req); if (!is_pci_p2pdma_page(bv.bv_page)) { if (!nvme_pci_metadata_use_sgls(dev, req) && (bv.bv_offset & (NVME_CTRL_PAGE_SIZE - 1)) + @@ -824,13 +880,13 @@ static blk_status_t nvme_map_data(struct nvme_dev *dev, struct request *req, ret = BLK_STS_TARGET; goto out_free_sg; } if (nvme_pci_use_sgls(dev, req, iod->sgt.nents)) - ret = nvme_pci_setup_sgls(dev, req, &cmnd->rw); + ret = nvme_pci_setup_sgls(nvmeq, req, &cmnd->rw); else - ret = nvme_pci_setup_prps(dev, req, &cmnd->rw); + ret = nvme_pci_setup_prps(nvmeq, req, &cmnd->rw); if (ret != BLK_STS_OK) goto out_unmap_sg; return BLK_STS_OK; out_unmap_sg: @@ -841,10 +897,11 @@ static blk_status_t nvme_map_data(struct nvme_dev *dev, struct request *req, } static blk_status_t nvme_pci_setup_meta_sgls(struct nvme_dev *dev, struct request *req) { + struct nvme_queue *nvmeq = req->mq_hctx->driver_data; struct nvme_iod *iod = blk_mq_rq_to_pdu(req); struct nvme_rw_command *cmnd = &iod->cmd.rw; struct nvme_sgl_desc *sg_list; struct scatterlist *sgl, *sg; unsigned int entries; @@ -864,11 +921,11 @@ static blk_status_t nvme_pci_setup_meta_sgls(struct nvme_dev *dev, rc = dma_map_sgtable(dev->dev, &iod->meta_sgt, rq_dma_dir(req), DMA_ATTR_NO_WARN); if (rc) goto out_free_sg; - sg_list = dma_pool_alloc(dev->prp_small_pool, GFP_ATOMIC, &sgl_dma); + sg_list = dma_pool_alloc(nvmeq->prp_pools.small, GFP_ATOMIC, &sgl_dma); if (!sg_list) goto out_unmap_sg; entries = iod->meta_sgt.nents; iod->meta_list.sg_list = sg_list; @@ -946,11 +1003,11 @@ static blk_status_t nvme_prep_rq(struct nvme_dev *dev, struct request *req) nvme_start_request(req); return BLK_STS_OK; out_unmap_data: if (blk_rq_nr_phys_segments(req)) - nvme_unmap_data(dev, req); + nvme_unmap_data(dev, req->mq_hctx->driver_data, req); out_free_cmd: nvme_cleanup_cmd(req); return ret; } @@ -1036,10 +1093,11 @@ static void nvme_queue_rqs(struct rq_list *rqlist) nvme_submit_cmds(nvmeq, &submit_list); *rqlist = requeue_list; } static __always_inline void nvme_unmap_metadata(struct nvme_dev *dev, + struct nvme_queue *nvmeq, struct request *req) { struct nvme_iod *iod = blk_mq_rq_to_pdu(req); if (!iod->meta_sgt.nents) { @@ -1047,11 +1105,11 @@ static __always_inline void nvme_unmap_metadata(struct nvme_dev *dev, rq_integrity_vec(req).bv_len, rq_dma_dir(req)); return; } - dma_pool_free(dev->prp_small_pool, iod->meta_list.sg_list, + dma_pool_free(nvmeq->prp_pools.small, iod->meta_list.sg_list, iod->meta_dma); dma_unmap_sgtable(dev->dev, &iod->meta_sgt, rq_dma_dir(req), 0); mempool_free(iod->meta_sgt.sgl, dev->iod_meta_mempool); } @@ -1059,14 +1117,14 @@ static __always_inline void nvme_pci_unmap_rq(struct request *req) { struct nvme_queue *nvmeq = req->mq_hctx->driver_data; struct nvme_dev *dev = nvmeq->dev; if (blk_integrity_rq(req)) - nvme_unmap_metadata(dev, req); + nvme_unmap_metadata(dev, nvmeq, req); if (blk_rq_nr_phys_segments(req)) - nvme_unmap_data(dev, req); + nvme_unmap_data(dev, nvmeq, req); } static void nvme_pci_complete_rq(struct request *req) { nvme_pci_unmap_rq(req); @@ -2839,39 +2897,10 @@ static int nvme_disable_prepare_reset(struct nvme_dev *dev, bool shutdown) return -EBUSY; nvme_dev_disable(dev, shutdown); return 0; } -static int nvme_setup_prp_pools(struct nvme_dev *dev) -{ - size_t small_align = 256; - - dev->prp_page_pool = dma_pool_create("prp list page", dev->dev, - NVME_CTRL_PAGE_SIZE, - NVME_CTRL_PAGE_SIZE, 0); - if (!dev->prp_page_pool) - return -ENOMEM; - - if (dev->ctrl.quirks & NVME_QUIRK_DMAPOOL_ALIGN_512) - small_align = 512; - - /* Optimisation for I/Os between 4k and 128k */ - dev->prp_small_pool = dma_pool_create("prp list 256", dev->dev, - 256, small_align, 0); - if (!dev->prp_small_pool) { - dma_pool_destroy(dev->prp_page_pool); - return -ENOMEM; - } - return 0; -} - -static void nvme_release_prp_pools(struct nvme_dev *dev) -{ - dma_pool_destroy(dev->prp_page_pool); - dma_pool_destroy(dev->prp_small_pool); -} - static int nvme_pci_alloc_iod_mempool(struct nvme_dev *dev) { size_t meta_size = sizeof(struct scatterlist) * (NVME_MAX_META_SEGS + 1); size_t alloc_size = sizeof(struct scatterlist) * NVME_MAX_SEGS; @@ -3182,11 +3211,12 @@ static struct nvme_dev *nvme_pci_alloc_dev(struct pci_dev *pdev, unsigned long quirks = id->driver_data; int node = dev_to_node(&pdev->dev); struct nvme_dev *dev; int ret = -ENOMEM; - dev = kzalloc_node(sizeof(*dev), GFP_KERNEL, node); + dev = kzalloc_node(sizeof(*dev) + nr_node_ids * sizeof(*dev->prp_pools), + GFP_KERNEL, node); if (!dev) return ERR_PTR(-ENOMEM); INIT_WORK(&dev->ctrl.reset_work, nvme_reset_work); mutex_init(&dev->shutdown_lock); @@ -3257,17 +3287,13 @@ static int nvme_probe(struct pci_dev *pdev, const struct pci_device_id *id) result = nvme_dev_map(dev); if (result) goto out_uninit_ctrl; - result = nvme_setup_prp_pools(dev); - if (result) - goto out_dev_unmap; - result = nvme_pci_alloc_iod_mempool(dev); if (result) - goto out_release_prp_pools; + goto out_dev_unmap; dev_info(dev->ctrl.device, "pci function %s\n", dev_name(&pdev->dev)); result = nvme_pci_enable(dev); if (result) @@ -3339,12 +3365,10 @@ static int nvme_probe(struct pci_dev *pdev, const struct pci_device_id *id) nvme_dbbuf_dma_free(dev); nvme_free_queues(dev, 0); out_release_iod_mempool: mempool_destroy(dev->iod_mempool); mempool_destroy(dev->iod_meta_mempool); -out_release_prp_pools: - nvme_release_prp_pools(dev); out_dev_unmap: nvme_dev_unmap(dev); out_uninit_ctrl: nvme_uninit_ctrl(&dev->ctrl); out_put_ctrl: -- 2.45.2 ^ permalink raw reply related [flat|nested] 8+ messages in thread
* Re: [PATCH v3 2/2] nvme/pci: make PRP list DMA pools per-NUMA-node 2025-04-21 16:55 ` [PATCH v3 2/2] nvme/pci: make PRP list DMA pools per-NUMA-node Caleb Sander Mateos @ 2025-04-22 7:38 ` Kanchan Joshi 2025-04-22 11:48 ` Sagi Grimberg 1 sibling, 0 replies; 8+ messages in thread From: Kanchan Joshi @ 2025-04-22 7:38 UTC (permalink / raw) To: Caleb Sander Mateos, Keith Busch, Jens Axboe, Christoph Hellwig, Sagi Grimberg Cc: linux-nvme, linux-kernel On 4/21/2025 10:25 PM, Caleb Sander Mateos wrote: > + prp_pools = &dev->prp_pools[numa_node < nr_node_ids ? numa_node : 0]; Numa node values are always in the range [0, nr_node_ids - 1], so is this condition "numa_node < nr_node_ids" really required? The patch looks good to me though. Reviewed-by: Kanchan Joshi <joshi.k@samsung.com> ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH v3 2/2] nvme/pci: make PRP list DMA pools per-NUMA-node 2025-04-21 16:55 ` [PATCH v3 2/2] nvme/pci: make PRP list DMA pools per-NUMA-node Caleb Sander Mateos 2025-04-22 7:38 ` Kanchan Joshi @ 2025-04-22 11:48 ` Sagi Grimberg 2025-04-22 14:59 ` Caleb Sander Mateos 1 sibling, 1 reply; 8+ messages in thread From: Sagi Grimberg @ 2025-04-22 11:48 UTC (permalink / raw) To: Caleb Sander Mateos, Keith Busch, Jens Axboe, Christoph Hellwig Cc: Kanchan Joshi, linux-nvme, linux-kernel On 21/04/2025 19:55, Caleb Sander Mateos wrote: > NVMe commands with more than 4 KB of data allocate PRP list pages from > the per-nvme_device dma_pool prp_page_pool or prp_small_pool. Each call > to dma_pool_alloc() and dma_pool_free() takes the per-dma_pool spinlock. > These device-global spinlocks are a significant source of contention > when many CPUs are submitting to the same NVMe devices. On a workload > issuing 32 KB reads from 16 CPUs (8 hypertwin pairs) across 2 NUMA nodes > to 23 NVMe devices, we observed 2.4% of CPU time spent in > _raw_spin_lock_irqsave called from dma_pool_alloc and dma_pool_free. > > Ideally, the dma_pools would be per-hctx to minimize > contention. But that could impose considerable resource costs in a > system with many NVMe devices and CPUs. > > As a compromise, allocate per-NUMA-node PRP list DMA pools. Map each > nvme_queue to the set of DMA pools corresponding to its device and its > hctx's NUMA node. This reduces the _raw_spin_lock_irqsave overhead by > about half, to 1.2%. Preventing the sharing of PRP list pages across > NUMA nodes also makes them cheaper to initialize. > > Link: https://lore.kernel.org/linux-nvme/CADUfDZqa=OOTtTTznXRDmBQo1WrFcDw1hBA7XwM7hzJ-hpckcA@mail.gmail.com/T/#u > Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> > --- > drivers/nvme/host/pci.c | 144 +++++++++++++++++++++++----------------- > 1 file changed, 84 insertions(+), 60 deletions(-) > > diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c > index 642890ddada5..7d86d1ec989a 100644 > --- a/drivers/nvme/host/pci.c > +++ b/drivers/nvme/host/pci.c > @@ -16,10 +16,11 @@ > #include <linux/kstrtox.h> > #include <linux/memremap.h> > #include <linux/mm.h> > #include <linux/module.h> > #include <linux/mutex.h> > +#include <linux/nodemask.h> > #include <linux/once.h> > #include <linux/pci.h> > #include <linux/suspend.h> > #include <linux/t10-pi.h> > #include <linux/types.h> > @@ -110,21 +111,24 @@ struct nvme_queue; > > static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown); > static void nvme_delete_io_queues(struct nvme_dev *dev); > static void nvme_update_attrs(struct nvme_dev *dev); > > +struct nvme_prp_dma_pools { > + struct dma_pool *large; > + struct dma_pool *small; > +}; > + > /* > * Represents an NVM Express device. Each nvme_dev is a PCI function. > */ > struct nvme_dev { > struct nvme_queue *queues; > struct blk_mq_tag_set tagset; > struct blk_mq_tag_set admin_tagset; > u32 __iomem *dbs; > struct device *dev; > - struct dma_pool *prp_page_pool; > - struct dma_pool *prp_small_pool; > unsigned online_queues; > unsigned max_qid; > unsigned io_queues[HCTX_MAX_TYPES]; > unsigned int num_vecs; > u32 q_depth; > @@ -160,10 +164,11 @@ struct nvme_dev { > struct nvme_host_mem_buf_desc *host_mem_descs; > void **host_mem_desc_bufs; > unsigned int nr_allocated_queues; > unsigned int nr_write_queues; > unsigned int nr_poll_queues; > + struct nvme_prp_dma_pools prp_pools[]; > }; > > static int io_queue_depth_set(const char *val, const struct kernel_param *kp) > { > return param_set_uint_minmax(val, kp, NVME_PCI_MIN_QUEUE_SIZE, > @@ -189,10 +194,11 @@ static inline struct nvme_dev *to_nvme_dev(struct nvme_ctrl *ctrl) > * An NVM Express queue. Each device has at least two (one for admin > * commands and one for I/O commands). > */ > struct nvme_queue { > struct nvme_dev *dev; > + struct nvme_prp_dma_pools prp_pools; > spinlock_t sq_lock; > void *sq_cmds; > /* only used for poll queues: */ > spinlock_t cq_poll_lock ____cacheline_aligned_in_smp; > struct nvme_completion *cqes; > @@ -395,18 +401,67 @@ static int nvme_pci_npages_prp(void) > unsigned max_bytes = (NVME_MAX_KB_SZ * 1024) + NVME_CTRL_PAGE_SIZE; > unsigned nprps = DIV_ROUND_UP(max_bytes, NVME_CTRL_PAGE_SIZE); > return DIV_ROUND_UP(8 * nprps, NVME_CTRL_PAGE_SIZE - 8); > } > > +static struct nvme_prp_dma_pools * > +nvme_setup_prp_pools(struct nvme_dev *dev, unsigned numa_node) > +{ > + struct nvme_prp_dma_pools *prp_pools; > + size_t small_align = 256; > + > + prp_pools = &dev->prp_pools[numa_node < nr_node_ids ? numa_node : 0]; I'm assuming you are checking numa_node == NUMA_NO_NODE ? Perhaps it is better to check that explicitly? Otherwise, looks good Reviewed-by: Sagi Grimberg <sagi@grimberg.me> ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH v3 2/2] nvme/pci: make PRP list DMA pools per-NUMA-node 2025-04-22 11:48 ` Sagi Grimberg @ 2025-04-22 14:59 ` Caleb Sander Mateos 0 siblings, 0 replies; 8+ messages in thread From: Caleb Sander Mateos @ 2025-04-22 14:59 UTC (permalink / raw) To: Sagi Grimberg Cc: Keith Busch, Jens Axboe, Christoph Hellwig, Kanchan Joshi, linux-nvme, linux-kernel On Tue, Apr 22, 2025 at 4:48 AM Sagi Grimberg <sagi@grimberg.me> wrote: > > > > On 21/04/2025 19:55, Caleb Sander Mateos wrote: > > NVMe commands with more than 4 KB of data allocate PRP list pages from > > the per-nvme_device dma_pool prp_page_pool or prp_small_pool. Each call > > to dma_pool_alloc() and dma_pool_free() takes the per-dma_pool spinlock. > > These device-global spinlocks are a significant source of contention > > when many CPUs are submitting to the same NVMe devices. On a workload > > issuing 32 KB reads from 16 CPUs (8 hypertwin pairs) across 2 NUMA nodes > > to 23 NVMe devices, we observed 2.4% of CPU time spent in > > _raw_spin_lock_irqsave called from dma_pool_alloc and dma_pool_free. > > > > Ideally, the dma_pools would be per-hctx to minimize > > contention. But that could impose considerable resource costs in a > > system with many NVMe devices and CPUs. > > > > As a compromise, allocate per-NUMA-node PRP list DMA pools. Map each > > nvme_queue to the set of DMA pools corresponding to its device and its > > hctx's NUMA node. This reduces the _raw_spin_lock_irqsave overhead by > > about half, to 1.2%. Preventing the sharing of PRP list pages across > > NUMA nodes also makes them cheaper to initialize. > > > > Link: https://lore.kernel.org/linux-nvme/CADUfDZqa=OOTtTTznXRDmBQo1WrFcDw1hBA7XwM7hzJ-hpckcA@mail.gmail.com/T/#u > > Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> > > --- > > drivers/nvme/host/pci.c | 144 +++++++++++++++++++++++----------------- > > 1 file changed, 84 insertions(+), 60 deletions(-) > > > > diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c > > index 642890ddada5..7d86d1ec989a 100644 > > --- a/drivers/nvme/host/pci.c > > +++ b/drivers/nvme/host/pci.c > > @@ -16,10 +16,11 @@ > > #include <linux/kstrtox.h> > > #include <linux/memremap.h> > > #include <linux/mm.h> > > #include <linux/module.h> > > #include <linux/mutex.h> > > +#include <linux/nodemask.h> > > #include <linux/once.h> > > #include <linux/pci.h> > > #include <linux/suspend.h> > > #include <linux/t10-pi.h> > > #include <linux/types.h> > > @@ -110,21 +111,24 @@ struct nvme_queue; > > > > static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown); > > static void nvme_delete_io_queues(struct nvme_dev *dev); > > static void nvme_update_attrs(struct nvme_dev *dev); > > > > +struct nvme_prp_dma_pools { > > + struct dma_pool *large; > > + struct dma_pool *small; > > +}; > > + > > /* > > * Represents an NVM Express device. Each nvme_dev is a PCI function. > > */ > > struct nvme_dev { > > struct nvme_queue *queues; > > struct blk_mq_tag_set tagset; > > struct blk_mq_tag_set admin_tagset; > > u32 __iomem *dbs; > > struct device *dev; > > - struct dma_pool *prp_page_pool; > > - struct dma_pool *prp_small_pool; > > unsigned online_queues; > > unsigned max_qid; > > unsigned io_queues[HCTX_MAX_TYPES]; > > unsigned int num_vecs; > > u32 q_depth; > > @@ -160,10 +164,11 @@ struct nvme_dev { > > struct nvme_host_mem_buf_desc *host_mem_descs; > > void **host_mem_desc_bufs; > > unsigned int nr_allocated_queues; > > unsigned int nr_write_queues; > > unsigned int nr_poll_queues; > > + struct nvme_prp_dma_pools prp_pools[]; > > }; > > > > static int io_queue_depth_set(const char *val, const struct kernel_param *kp) > > { > > return param_set_uint_minmax(val, kp, NVME_PCI_MIN_QUEUE_SIZE, > > @@ -189,10 +194,11 @@ static inline struct nvme_dev *to_nvme_dev(struct nvme_ctrl *ctrl) > > * An NVM Express queue. Each device has at least two (one for admin > > * commands and one for I/O commands). > > */ > > struct nvme_queue { > > struct nvme_dev *dev; > > + struct nvme_prp_dma_pools prp_pools; > > spinlock_t sq_lock; > > void *sq_cmds; > > /* only used for poll queues: */ > > spinlock_t cq_poll_lock ____cacheline_aligned_in_smp; > > struct nvme_completion *cqes; > > @@ -395,18 +401,67 @@ static int nvme_pci_npages_prp(void) > > unsigned max_bytes = (NVME_MAX_KB_SZ * 1024) + NVME_CTRL_PAGE_SIZE; > > unsigned nprps = DIV_ROUND_UP(max_bytes, NVME_CTRL_PAGE_SIZE); > > return DIV_ROUND_UP(8 * nprps, NVME_CTRL_PAGE_SIZE - 8); > > } > > > > +static struct nvme_prp_dma_pools * > > +nvme_setup_prp_pools(struct nvme_dev *dev, unsigned numa_node) > > +{ > > + struct nvme_prp_dma_pools *prp_pools; > > + size_t small_align = 256; > > + > > + prp_pools = &dev->prp_pools[numa_node < nr_node_ids ? numa_node : 0]; > > I'm assuming you are checking numa_node == NUMA_NO_NODE ? > Perhaps it is better to check that explicitly? Yeah, I wasn't sure whether that was a possible value of hctx->numa_node. The field is defined as an unsigned and NUMA_NO_NODE is defined as -1, so I guess it shouldn't ever be set to NUMA_NO_NODE? I can replace this with just numa_node. Thanks, Caleb ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2025-04-22 16:33 UTC | newest] Thread overview: 8+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-04-21 16:55 [PATCH v3 0/2] nvme/pci: PRP list DMA pool partitioning Caleb Sander Mateos 2025-04-21 16:55 ` [PATCH v3 1/2] nvme/pci: factor out nvme_init_hctx() helper Caleb Sander Mateos 2025-04-22 7:38 ` Kanchan Joshi 2025-04-22 11:45 ` Sagi Grimberg 2025-04-21 16:55 ` [PATCH v3 2/2] nvme/pci: make PRP list DMA pools per-NUMA-node Caleb Sander Mateos 2025-04-22 7:38 ` Kanchan Joshi 2025-04-22 11:48 ` Sagi Grimberg 2025-04-22 14:59 ` Caleb Sander Mateos
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox