[PATCH-part-2 0/9] nvme support for sgl mptr, safe passthrough

Linux-NVME Archive on lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH-part-2 0/9] nvme support for sgl mptr, safe passthrough
@ 2024-09-04 18:38 Keith Busch
  2024-09-04 18:38 ` [PATCH-part-2 1/9] nvme-pci: use sgl capable helper function Keith Busch
                   ` (9 more replies)
  0 siblings, 10 replies; 15+ messages in thread
From: Keith Busch @ 2024-09-04 18:38 UTC (permalink / raw)
  To: hch, sagi, linux-nvme; +Cc: Keith Busch

From: Keith Busch <kbusch@kernel.org>

NVMe's implicit memory lengths is a danger to buffer overflows. It's
been know for a long time this vulnerability exists.

The nvme protocol provides sgl with explicit lengths, so if the hardware
is capable, let's use that.

This patch series sets up the driver to always prefer SGL
representations on user passthrough requests. An added bonus to using
SGL for MPTR is that we can support multi-segment integrity buffers,
allowing merging once again. Request merging with metadata, though, is
kind of broken, so that functionality depends on this block patchset
(hence the "part-2" subject prefix):

  https://lore.kernel.org/linux-nvme/20240904152605.4055570-1-kbusch@meta.com/

I currently don't have real hardware that supports sgl mptr, but I
believe that's coming to me soon. But in the meantime, if you're like
me, you can use the emulated device. Support for MPTR SGL is provided in
this currently unmerged (but very simple) patch to qemu:

  https://lists.nongnu.org/archive/html/qemu-block/2024-08/msg00332.html

Keith Busch (9):
  nvme-pci: use sgl capable helper function
  nvme-pci: provide prp selection helper
  nvme-pci: split out the simple dma mapping
  nvme-pci: remove "dma_len" from nvme_iod
  nvme-pci: simplify io setup function parameters
  nvme-pci: common dma pool alloc helper
  nvme-pci: provide a sgl mapping helper
  nvme-pci: add support for sgl metadata
  nvme: force sgls on user passthrough if possible

 drivers/nvme/host/core.c  |   4 +-
 drivers/nvme/host/ioctl.c |  17 ++
 drivers/nvme/host/nvme.h  |   7 +
 drivers/nvme/host/pci.c   | 319 ++++++++++++++++++++++++++------------
 include/linux/nvme.h      |   1 +
 5 files changed, 246 insertions(+), 102 deletions(-)

-- 
2.43.5

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH-part-2 1/9] nvme-pci: use sgl capable helper function
  2024-09-04 18:38 [PATCH-part-2 0/9] nvme support for sgl mptr, safe passthrough Keith Busch
@ 2024-09-04 18:38 ` Keith Busch
  2024-09-04 18:38 ` [PATCH-part-2 2/9] nvme-pci: provide prp selection helper Keith Busch
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 15+ messages in thread
From: Keith Busch @ 2024-09-04 18:38 UTC (permalink / raw)
  To: hch, sagi, linux-nvme; +Cc: Keith Busch

From: Keith Busch <kbusch@kernel.org>

This check is done twice, and is complex enough, so introduce a helper
function and use it from both places.

Signed-off-by: Keith Busch <kbusch@kernel.org>
---
 drivers/nvme/host/pci.c | 25 +++++++++++++++----------
 1 file changed, 15 insertions(+), 10 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 6cd9395ba9ec3..21f6aeb236a6f 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -504,21 +504,28 @@ static void nvme_commit_rqs(struct blk_mq_hw_ctx *hctx)
 	spin_unlock(&nvmeq->sq_lock);
 }
 
-static inline bool nvme_pci_use_sgls(struct nvme_dev *dev, struct request *req,
-				     int nseg)
+static inline bool nvme_pci_sgl_capable(struct nvme_dev *dev,
+					struct request *req)
 {
 	struct nvme_queue *nvmeq = req->mq_hctx->driver_data;
-	unsigned int avg_seg_size;
-
-	avg_seg_size = DIV_ROUND_UP(blk_rq_payload_bytes(req), nseg);
 
 	if (!nvme_ctrl_sgl_supported(&dev->ctrl))
 		return false;
 	if (!nvmeq->qid)
 		return false;
-	if (!sgl_threshold || avg_seg_size < sgl_threshold)
+	return sgl_threshold;
+}
+
+static inline bool nvme_pci_use_sgls(struct nvme_dev *dev, struct request *req,
+				     int nseg)
+{
+	unsigned int avg_seg_size;
+
+	avg_seg_size = DIV_ROUND_UP(blk_rq_payload_bytes(req), nseg);
+
+	if (!nvme_pci_sgl_capable(dev, req))
 		return false;
-	return true;
+	return avg_seg_size >= sgl_threshold;
 }
 
 static void nvme_free_prps(struct nvme_dev *dev, struct request *req)
@@ -774,7 +781,6 @@ static blk_status_t nvme_map_data(struct nvme_dev *dev, struct request *req,
 	int rc;
 
 	if (blk_rq_nr_phys_segments(req) == 1) {
-		struct nvme_queue *nvmeq = req->mq_hctx->driver_data;
 		struct bio_vec bv = req_bvec(req);
 
 		if (!is_pci_p2pdma_page(bv.bv_page)) {
@@ -783,8 +789,7 @@ static blk_status_t nvme_map_data(struct nvme_dev *dev, struct request *req,
 				return nvme_setup_prp_simple(dev, req,
 							     &cmnd->rw, &bv);
 
-			if (nvmeq->qid && sgl_threshold &&
-			    nvme_ctrl_sgl_supported(&dev->ctrl))
+			if (nvme_pci_sgl_capable(dev, req))
 				return nvme_setup_sgl_simple(dev, req,
 							     &cmnd->rw, &bv);
 		}
-- 
2.43.5



^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH-part-2 2/9] nvme-pci: provide prp selection helper
  2024-09-04 18:38 [PATCH-part-2 0/9] nvme support for sgl mptr, safe passthrough Keith Busch
  2024-09-04 18:38 ` [PATCH-part-2 1/9] nvme-pci: use sgl capable helper function Keith Busch
@ 2024-09-04 18:38 ` Keith Busch
  2024-09-06 11:22   ` Anuj Gupta
  2024-09-04 18:38 ` [PATCH-part-2 3/9] nvme-pci: split out the simple dma mapping Keith Busch
                   ` (7 subsequent siblings)
  9 siblings, 1 reply; 15+ messages in thread
From: Keith Busch @ 2024-09-04 18:38 UTC (permalink / raw)
  To: hch, sagi, linux-nvme; +Cc: Keith Busch

From: Keith Busch <kbusch@kernel.org>

The prp selection criteria is going to get a little more complicated so
introduce a helper to make the code more readable.

Signed-off-by: Keith Busch <kbusch@kernel.org>
---
 drivers/nvme/host/pci.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 21f6aeb236a6f..1763ee1886a7f 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -528,6 +528,12 @@ static inline bool nvme_pci_use_sgls(struct nvme_dev *dev, struct request *req,
 	return avg_seg_size >= sgl_threshold;
 }
 
+static inline bool nvme_pci_use_prps(struct bio_vec *bv)
+{
+	unsigned int off = bv->bv_offset & (NVME_CTRL_PAGE_SIZE - 1);
+	return  off + bv->bv_len <= NVME_CTRL_PAGE_SIZE * 2;
+}
+
 static void nvme_free_prps(struct nvme_dev *dev, struct request *req)
 {
 	const int last_prp = NVME_CTRL_PAGE_SIZE / sizeof(__le64) - 1;
@@ -784,8 +790,7 @@ static blk_status_t nvme_map_data(struct nvme_dev *dev, struct request *req,
 		struct bio_vec bv = req_bvec(req);
 
 		if (!is_pci_p2pdma_page(bv.bv_page)) {
-			if ((bv.bv_offset & (NVME_CTRL_PAGE_SIZE - 1)) +
-			     bv.bv_len <= NVME_CTRL_PAGE_SIZE * 2)
+			if (nvme_pci_use_prps(&bv))
 				return nvme_setup_prp_simple(dev, req,
 							     &cmnd->rw, &bv);
 
-- 
2.43.5



^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH-part-2 3/9] nvme-pci: split out the simple dma mapping
  2024-09-04 18:38 [PATCH-part-2 0/9] nvme support for sgl mptr, safe passthrough Keith Busch
  2024-09-04 18:38 ` [PATCH-part-2 1/9] nvme-pci: use sgl capable helper function Keith Busch
  2024-09-04 18:38 ` [PATCH-part-2 2/9] nvme-pci: provide prp selection helper Keith Busch
@ 2024-09-04 18:38 ` Keith Busch
  2024-09-04 18:38 ` [PATCH-part-2 4/9] nvme-pci: remove "dma_len" from nvme_iod Keith Busch
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 15+ messages in thread
From: Keith Busch @ 2024-09-04 18:38 UTC (permalink / raw)
  To: hch, sagi, linux-nvme; +Cc: Keith Busch

From: Keith Busch <kbusch@kernel.org>

To help imporve code readablilty.

Signed-off-by: Keith Busch <kbusch@kernel.org>
---
 drivers/nvme/host/pci.c | 35 ++++++++++++++++++++---------------
 1 file changed, 20 insertions(+), 15 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 1763ee1886a7f..77d42eaada742 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -779,27 +779,13 @@ static blk_status_t nvme_setup_sgl_simple(struct nvme_dev *dev,
 	return BLK_STS_OK;
 }
 
-static blk_status_t nvme_map_data(struct nvme_dev *dev, struct request *req,
+static blk_status_t __nvme_map_data(struct nvme_dev *dev, struct request *req,
 		struct nvme_command *cmnd)
 {
 	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
 	blk_status_t ret = BLK_STS_RESOURCE;
 	int rc;
 
-	if (blk_rq_nr_phys_segments(req) == 1) {
-		struct bio_vec bv = req_bvec(req);
-
-		if (!is_pci_p2pdma_page(bv.bv_page)) {
-			if (nvme_pci_use_prps(&bv))
-				return nvme_setup_prp_simple(dev, req,
-							     &cmnd->rw, &bv);
-
-			if (nvme_pci_sgl_capable(dev, req))
-				return nvme_setup_sgl_simple(dev, req,
-							     &cmnd->rw, &bv);
-		}
-	}
-
 	iod->dma_len = 0;
 	iod->sgt.sgl = mempool_alloc(dev->iod_mempool, GFP_ATOMIC);
 	if (!iod->sgt.sgl)
@@ -832,6 +818,25 @@ static blk_status_t nvme_map_data(struct nvme_dev *dev, struct request *req,
 	return ret;
 }
 
+static blk_status_t nvme_map_data(struct nvme_dev *dev, struct request *req,
+		struct nvme_command *cmnd)
+{
+	if (blk_rq_nr_phys_segments(req) == 1) {
+		struct bio_vec bv = req_bvec(req);
+
+		if (!is_pci_p2pdma_page(bv.bv_page)) {
+			if (nvme_pci_use_prps(&bv))
+				return nvme_setup_prp_simple(dev, req,
+							     &cmnd->rw, &bv);
+			if (nvme_pci_sgl_capable(dev, req))
+				return nvme_setup_sgl_simple(dev, req,
+							     &cmnd->rw, &bv);
+		}
+	}
+
+	return __nvme_map_data(dev, req, cmnd);
+}
+
 static blk_status_t nvme_map_metadata(struct nvme_dev *dev, struct request *req,
 		struct nvme_command *cmnd)
 {
-- 
2.43.5



^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH-part-2 4/9] nvme-pci: remove "dma_len" from nvme_iod
  2024-09-04 18:38 [PATCH-part-2 0/9] nvme support for sgl mptr, safe passthrough Keith Busch
                   ` (2 preceding siblings ...)
  2024-09-04 18:38 ` [PATCH-part-2 3/9] nvme-pci: split out the simple dma mapping Keith Busch
@ 2024-09-04 18:38 ` Keith Busch
  2024-09-04 18:38 ` [PATCH-part-2 5/9] nvme-pci: simplify io setup function parameters Keith Busch
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 15+ messages in thread
From: Keith Busch @ 2024-09-04 18:38 UTC (permalink / raw)
  To: hch, sagi, linux-nvme; +Cc: Keith Busch

From: Keith Busch <kbusch@kernel.org>

This is redundant data that we can get from sgl nents and request data
length. Remove it to make the struct smaller.

Signed-off-by: Keith Busch <kbusch@kernel.org>
---
 drivers/nvme/host/pci.c | 12 +++---------
 1 file changed, 3 insertions(+), 9 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 77d42eaada742..3af4a226497c9 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -233,7 +233,6 @@ struct nvme_iod {
 	bool aborted;
 	s8 nr_allocations;	/* PRP list pool allocations. 0 means small
 				   pool in use */
-	unsigned int dma_len;	/* length of single DMA segment mapping */
 	dma_addr_t first_dma;
 	dma_addr_t meta_dma;
 	struct sg_table sgt;
@@ -554,14 +553,12 @@ static void nvme_unmap_data(struct nvme_dev *dev, struct request *req)
 {
 	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
 
-	if (iod->dma_len) {
-		dma_unmap_page(dev->dev, iod->first_dma, iod->dma_len,
+	if (!iod->sgt.nents) {
+		dma_unmap_page(dev->dev, iod->first_dma, blk_rq_bytes(req),
 			       rq_dma_dir(req));
 		return;
 	}
 
-	WARN_ON_ONCE(!iod->sgt.nents);
-
 	dma_unmap_sgtable(dev->dev, &iod->sgt, rq_dma_dir(req), 0);
 
 	if (iod->nr_allocations == 0)
@@ -751,7 +748,6 @@ static blk_status_t nvme_setup_prp_simple(struct nvme_dev *dev,
 	iod->first_dma = dma_map_bvec(dev->dev, bv, rq_dma_dir(req), 0);
 	if (dma_mapping_error(dev->dev, iod->first_dma))
 		return BLK_STS_RESOURCE;
-	iod->dma_len = bv->bv_len;
 
 	cmnd->dptr.prp1 = cpu_to_le64(iod->first_dma);
 	if (bv->bv_len > first_prp_len)
@@ -770,11 +766,10 @@ static blk_status_t nvme_setup_sgl_simple(struct nvme_dev *dev,
 	iod->first_dma = dma_map_bvec(dev->dev, bv, rq_dma_dir(req), 0);
 	if (dma_mapping_error(dev->dev, iod->first_dma))
 		return BLK_STS_RESOURCE;
-	iod->dma_len = bv->bv_len;
 
 	cmnd->flags = NVME_CMD_SGL_METABUF;
 	cmnd->dptr.sgl.addr = cpu_to_le64(iod->first_dma);
-	cmnd->dptr.sgl.length = cpu_to_le32(iod->dma_len);
+	cmnd->dptr.sgl.length = cpu_to_le32(bv->bv_len);
 	cmnd->dptr.sgl.type = NVME_SGL_FMT_DATA_DESC << 4;
 	return BLK_STS_OK;
 }
@@ -786,7 +781,6 @@ static blk_status_t __nvme_map_data(struct nvme_dev *dev, struct request *req,
 	blk_status_t ret = BLK_STS_RESOURCE;
 	int rc;
 
-	iod->dma_len = 0;
 	iod->sgt.sgl = mempool_alloc(dev->iod_mempool, GFP_ATOMIC);
 	if (!iod->sgt.sgl)
 		return BLK_STS_RESOURCE;
-- 
2.43.5



^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH-part-2 5/9] nvme-pci: simplify io setup function parameters
  2024-09-04 18:38 [PATCH-part-2 0/9] nvme support for sgl mptr, safe passthrough Keith Busch
                   ` (3 preceding siblings ...)
  2024-09-04 18:38 ` [PATCH-part-2 4/9] nvme-pci: remove "dma_len" from nvme_iod Keith Busch
@ 2024-09-04 18:38 ` Keith Busch
  2024-09-06 11:24   ` Anuj Gupta
  2024-09-04 18:38 ` [PATCH-part-2 6/9] nvme-pci: common dma pool alloc helper Keith Busch
                   ` (4 subsequent siblings)
  9 siblings, 1 reply; 15+ messages in thread
From: Keith Busch @ 2024-09-04 18:38 UTC (permalink / raw)
  To: hch, sagi, linux-nvme; +Cc: Keith Busch

From: Keith Busch <kbusch@kernel.org>

We're threading a reference to the nvme_command through all functions
before it's needed. It's also redundant since it's a part of the request
that has to be provided anway. Remove the extra reference and only get
it in functions that actually need it.

Signed-off-by: Keith Busch <kbusch@kernel.org>
---
 drivers/nvme/host/pci.c | 36 +++++++++++++++++-------------------
 1 file changed, 17 insertions(+), 19 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 3af4a226497c9..cdba1f8e0bba6 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -587,9 +587,10 @@ static void nvme_print_sgl(struct scatterlist *sgl, int nents)
 }
 
 static blk_status_t nvme_pci_setup_prps(struct nvme_dev *dev,
-		struct request *req, struct nvme_rw_command *cmnd)
+		struct request *req)
 {
 	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
+	struct nvme_rw_command *cmnd = &iod->cmd.rw;
 	struct dma_pool *pool;
 	int length = blk_rq_payload_bytes(req);
 	struct scatterlist *sg = iod->sgt.sgl;
@@ -693,9 +694,10 @@ static void nvme_pci_sgl_set_seg(struct nvme_sgl_desc *sge,
 }
 
 static blk_status_t nvme_pci_setup_sgls(struct nvme_dev *dev,
-		struct request *req, struct nvme_rw_command *cmd)
+		struct request *req)
 {
 	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
+	struct nvme_rw_command *cmd = &iod->cmd.rw;
 	struct dma_pool *pool;
 	struct nvme_sgl_desc *sg_list;
 	struct scatterlist *sg = iod->sgt.sgl;
@@ -738,12 +740,12 @@ static blk_status_t nvme_pci_setup_sgls(struct nvme_dev *dev,
 }
 
 static blk_status_t nvme_setup_prp_simple(struct nvme_dev *dev,
-		struct request *req, struct nvme_rw_command *cmnd,
-		struct bio_vec *bv)
+		struct request *req, struct bio_vec *bv)
 {
-	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
 	unsigned int offset = bv->bv_offset & (NVME_CTRL_PAGE_SIZE - 1);
 	unsigned int first_prp_len = NVME_CTRL_PAGE_SIZE - offset;
+	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
+	struct nvme_rw_command *cmnd = &iod->cmd.rw;
 
 	iod->first_dma = dma_map_bvec(dev->dev, bv, rq_dma_dir(req), 0);
 	if (dma_mapping_error(dev->dev, iod->first_dma))
@@ -758,10 +760,10 @@ static blk_status_t nvme_setup_prp_simple(struct nvme_dev *dev,
 }
 
 static blk_status_t nvme_setup_sgl_simple(struct nvme_dev *dev,
-		struct request *req, struct nvme_rw_command *cmnd,
-		struct bio_vec *bv)
+		struct request *req, struct bio_vec *bv)
 {
 	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
+	struct nvme_rw_command *cmnd = &iod->cmd.rw;
 
 	iod->first_dma = dma_map_bvec(dev->dev, bv, rq_dma_dir(req), 0);
 	if (dma_mapping_error(dev->dev, iod->first_dma))
@@ -774,8 +776,7 @@ static blk_status_t nvme_setup_sgl_simple(struct nvme_dev *dev,
 	return BLK_STS_OK;
 }
 
-static blk_status_t __nvme_map_data(struct nvme_dev *dev, struct request *req,
-		struct nvme_command *cmnd)
+static blk_status_t __nvme_map_data(struct nvme_dev *dev, struct request *req)
 {
 	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
 	blk_status_t ret = BLK_STS_RESOURCE;
@@ -798,9 +799,9 @@ static blk_status_t __nvme_map_data(struct nvme_dev *dev, struct request *req,
 	}
 
 	if (nvme_pci_use_sgls(dev, req, iod->sgt.nents))
-		ret = nvme_pci_setup_sgls(dev, req, &cmnd->rw);
+		ret = nvme_pci_setup_sgls(dev, req);
 	else
-		ret = nvme_pci_setup_prps(dev, req, &cmnd->rw);
+		ret = nvme_pci_setup_prps(dev, req);
 	if (ret != BLK_STS_OK)
 		goto out_unmap_sg;
 	return BLK_STS_OK;
@@ -812,23 +813,20 @@ static blk_status_t __nvme_map_data(struct nvme_dev *dev, struct request *req,
 	return ret;
 }
 
-static blk_status_t nvme_map_data(struct nvme_dev *dev, struct request *req,
-		struct nvme_command *cmnd)
+static blk_status_t nvme_map_data(struct nvme_dev *dev, struct request *req)
 {
 	if (blk_rq_nr_phys_segments(req) == 1) {
 		struct bio_vec bv = req_bvec(req);
 
 		if (!is_pci_p2pdma_page(bv.bv_page)) {
 			if (nvme_pci_use_prps(&bv))
-				return nvme_setup_prp_simple(dev, req,
-							     &cmnd->rw, &bv);
+				return nvme_setup_prp_simple(dev, req, &bv);
 			if (nvme_pci_sgl_capable(dev, req))
-				return nvme_setup_sgl_simple(dev, req,
-							     &cmnd->rw, &bv);
+				return nvme_setup_sgl_simple(dev, req, &bv);
 		}
 	}
 
-	return __nvme_map_data(dev, req, cmnd);
+	return __nvme_map_data(dev, req);
 }
 
 static blk_status_t nvme_map_metadata(struct nvme_dev *dev, struct request *req,
@@ -858,7 +856,7 @@ static blk_status_t nvme_prep_rq(struct nvme_dev *dev, struct request *req)
 		return ret;
 
 	if (blk_rq_nr_phys_segments(req)) {
-		ret = nvme_map_data(dev, req, &iod->cmd);
+		ret = nvme_map_data(dev, req);
 		if (ret)
 			goto out_free_cmd;
 	}
-- 
2.43.5



^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH-part-2 6/9] nvme-pci: common dma pool alloc helper
  2024-09-04 18:38 [PATCH-part-2 0/9] nvme support for sgl mptr, safe passthrough Keith Busch
                   ` (4 preceding siblings ...)
  2024-09-04 18:38 ` [PATCH-part-2 5/9] nvme-pci: simplify io setup function parameters Keith Busch
@ 2024-09-04 18:38 ` Keith Busch
  2024-09-04 18:38 ` [PATCH-part-2 7/9] nvme-pci: provide a sgl mapping helper Keith Busch
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 15+ messages in thread
From: Keith Busch @ 2024-09-04 18:38 UTC (permalink / raw)
  To: hch, sagi, linux-nvme; +Cc: Keith Busch

From: Keith Busch <kbusch@kernel.org>

The more complicated sgl and prp setup does the same thing to select a
pool, allocate and initialize the iod dma fields. Provide a common
helper for this.

Signed-off-by: Keith Busch <kbusch@kernel.org>
---
 drivers/nvme/host/pci.c | 69 ++++++++++++++++++++---------------------
 1 file changed, 34 insertions(+), 35 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index cdba1f8e0bba6..4980dde804a0e 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -217,6 +217,7 @@ struct nvme_queue {
 };
 
 union nvme_descriptor {
+	void			*list;
 	struct nvme_sgl_desc	*sg_list;
 	__le64			*prp_list;
 };
@@ -586,6 +587,29 @@ static void nvme_print_sgl(struct scatterlist *sgl, int nents)
 	}
 }
 
+static struct dma_pool *nvme_pci_pool_alloc(struct nvme_dev *dev,
+					    unsigned nents, size_t desc_size,
+					    struct nvme_iod *iod)
+{
+	struct dma_pool *pool;
+
+	if (nents <= (256 / desc_size)) {
+		pool = dev->prp_small_pool;
+		iod->nr_allocations = 0;
+	} else {
+		pool = dev->prp_page_pool;
+		iod->nr_allocations = 1;
+	}
+
+	iod->list[0].list = dma_pool_alloc(pool, GFP_ATOMIC, &iod->first_dma);
+	if (!iod->list[0].list) {
+		iod->nr_allocations = -1;
+		return NULL;
+	}
+
+	return pool;
+}
+
 static blk_status_t nvme_pci_setup_prps(struct nvme_dev *dev,
 		struct request *req)
 {
@@ -599,7 +623,7 @@ static blk_status_t nvme_pci_setup_prps(struct nvme_dev *dev,
 	int offset = dma_addr & (NVME_CTRL_PAGE_SIZE - 1);
 	__le64 *prp_list;
 	dma_addr_t prp_dma;
-	int nprps, i;
+	int nprps, i = 0;
 
 	length -= (NVME_CTRL_PAGE_SIZE - offset);
 	if (length <= 0) {
@@ -622,22 +646,11 @@ static blk_status_t nvme_pci_setup_prps(struct nvme_dev *dev,
 	}
 
 	nprps = DIV_ROUND_UP(length, NVME_CTRL_PAGE_SIZE);
-	if (nprps <= (256 / 8)) {
-		pool = dev->prp_small_pool;
-		iod->nr_allocations = 0;
-	} else {
-		pool = dev->prp_page_pool;
-		iod->nr_allocations = 1;
-	}
-
-	prp_list = dma_pool_alloc(pool, GFP_ATOMIC, &prp_dma);
-	if (!prp_list) {
-		iod->nr_allocations = -1;
+	pool = nvme_pci_pool_alloc(dev, nprps, 8, iod);
+	if (!pool)
 		return BLK_STS_RESOURCE;
-	}
-	iod->list[0].prp_list = prp_list;
-	iod->first_dma = prp_dma;
-	i = 0;
+
+	prp_list = iod->list[0].prp_list;
 	for (;;) {
 		if (i == NVME_CTRL_PAGE_SIZE >> 3) {
 			__le64 *old_prp_list = prp_list;
@@ -697,12 +710,10 @@ static blk_status_t nvme_pci_setup_sgls(struct nvme_dev *dev,
 		struct request *req)
 {
 	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
-	struct nvme_rw_command *cmd = &iod->cmd.rw;
-	struct dma_pool *pool;
 	struct nvme_sgl_desc *sg_list;
+	struct nvme_rw_command *cmd = &iod->cmd.rw;
 	struct scatterlist *sg = iod->sgt.sgl;
 	unsigned int entries = iod->sgt.nents;
-	dma_addr_t sgl_dma;
 	int i = 0;
 
 	/* setting the transfer type as SGL */
@@ -713,24 +724,12 @@ static blk_status_t nvme_pci_setup_sgls(struct nvme_dev *dev,
 		return BLK_STS_OK;
 	}
 
-	if (entries <= (256 / sizeof(struct nvme_sgl_desc))) {
-		pool = dev->prp_small_pool;
-		iod->nr_allocations = 0;
-	} else {
-		pool = dev->prp_page_pool;
-		iod->nr_allocations = 1;
-	}
-
-	sg_list = dma_pool_alloc(pool, GFP_ATOMIC, &sgl_dma);
-	if (!sg_list) {
-		iod->nr_allocations = -1;
+	if (!nvme_pci_pool_alloc(dev, entries, sizeof(struct nvme_sgl_desc),
+				 iod))
 		return BLK_STS_RESOURCE;
-	}
-
-	iod->list[0].sg_list = sg_list;
-	iod->first_dma = sgl_dma;
 
-	nvme_pci_sgl_set_seg(&cmd->dptr.sgl, sgl_dma, entries);
+	sg_list = iod->list[0].sg_list;
+	nvme_pci_sgl_set_seg(&cmd->dptr.sgl, iod->first_dma, entries);
 	do {
 		nvme_pci_sgl_set_data(&sg_list[i++], sg);
 		sg = sg_next(sg);
-- 
2.43.5



^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH-part-2 7/9] nvme-pci: provide a sgl mapping helper
  2024-09-04 18:38 [PATCH-part-2 0/9] nvme support for sgl mptr, safe passthrough Keith Busch
                   ` (5 preceding siblings ...)
  2024-09-04 18:38 ` [PATCH-part-2 6/9] nvme-pci: common dma pool alloc helper Keith Busch
@ 2024-09-04 18:38 ` Keith Busch
  2024-09-04 18:38 ` [PATCH-part-2 8/9] nvme-pci: add support for sgl metadata Keith Busch
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 15+ messages in thread
From: Keith Busch @ 2024-09-04 18:38 UTC (permalink / raw)
  To: hch, sagi, linux-nvme; +Cc: Keith Busch

From: Keith Busch <kbusch@kernel.org>

We're about to add another sgl user, so create helper for the generic
list creation code.

Signed-off-by: Keith Busch <kbusch@kernel.org>
---
 drivers/nvme/host/pci.c | 24 +++++++++++++++---------
 1 file changed, 15 insertions(+), 9 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 4980dde804a0e..db0144a6bc5db 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -706,15 +706,26 @@ static void nvme_pci_sgl_set_seg(struct nvme_sgl_desc *sge,
 	sge->type = NVME_SGL_FMT_LAST_SEG_DESC << 4;
 }
 
+static void nvme_pci_sgl_set_list(struct nvme_sgl_desc *sg_list,
+				  struct nvme_sgl_desc *seg,
+				  struct scatterlist *sgl, int nents,
+				  dma_addr_t dma)
+{
+	struct scatterlist *sg;
+	int i;
+
+	nvme_pci_sgl_set_seg(seg, dma, nents);
+	for_each_sg(sgl, sg, nents, i)
+		nvme_pci_sgl_set_data(&sg_list[i], sg);
+}
+
 static blk_status_t nvme_pci_setup_sgls(struct nvme_dev *dev,
 		struct request *req)
 {
 	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
-	struct nvme_sgl_desc *sg_list;
 	struct nvme_rw_command *cmd = &iod->cmd.rw;
 	struct scatterlist *sg = iod->sgt.sgl;
 	unsigned int entries = iod->sgt.nents;
-	int i = 0;
 
 	/* setting the transfer type as SGL */
 	cmd->flags = NVME_CMD_SGL_METABUF;
@@ -728,13 +739,8 @@ static blk_status_t nvme_pci_setup_sgls(struct nvme_dev *dev,
 				 iod))
 		return BLK_STS_RESOURCE;
 
-	sg_list = iod->list[0].sg_list;
-	nvme_pci_sgl_set_seg(&cmd->dptr.sgl, iod->first_dma, entries);
-	do {
-		nvme_pci_sgl_set_data(&sg_list[i++], sg);
-		sg = sg_next(sg);
-	} while (--entries > 0);
-
+	nvme_pci_sgl_set_list(iod->list[0].sg_list, &cmd->dptr.sgl, sg, entries,
+			      iod->first_dma);
 	return BLK_STS_OK;
 }
 
-- 
2.43.5



^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH-part-2 8/9] nvme-pci: add support for sgl metadata
  2024-09-04 18:38 [PATCH-part-2 0/9] nvme support for sgl mptr, safe passthrough Keith Busch
                   ` (6 preceding siblings ...)
  2024-09-04 18:38 ` [PATCH-part-2 7/9] nvme-pci: provide a sgl mapping helper Keith Busch
@ 2024-09-04 18:38 ` Keith Busch
  2024-09-04 19:22   ` Keith Busch
  2024-09-04 18:38 ` [PATCH-part-2 9/9] nvme: force sgls on user passthrough if possible Keith Busch
  2024-09-06 17:29 ` [PATCH-part-2 0/9] nvme support for sgl mptr, safe passthrough Keith Busch
  9 siblings, 1 reply; 15+ messages in thread
From: Keith Busch @ 2024-09-04 18:38 UTC (permalink / raw)
  To: hch, sagi, linux-nvme; +Cc: Keith Busch

From: Keith Busch <kbusch@kernel.org>

Supporting this mode allows merging requests with metadata that wouldn't
be possible otherwise.

Signed-off-by: Keith Busch <kbusch@kernel.org>
---
 drivers/nvme/host/core.c |   4 +-
 drivers/nvme/host/nvme.h |   5 ++
 drivers/nvme/host/pci.c  | 129 ++++++++++++++++++++++++++++++++++-----
 include/linux/nvme.h     |   1 +
 4 files changed, 122 insertions(+), 17 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 053d5b4909cda..6a8411ef45f26 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -1988,7 +1988,9 @@ static void nvme_set_ctrl_limits(struct nvme_ctrl *ctrl,
 	lim->max_hw_sectors = ctrl->max_hw_sectors;
 	lim->max_segments = min_t(u32, USHRT_MAX,
 		min_not_zero(nvme_max_drv_segments(ctrl), ctrl->max_segments));
-	lim->max_integrity_segments = ctrl->max_integrity_segments;
+	if (lim->max_integrity_segments > 1 &&
+	    !nvme_ctrl_meta_sgl_supported(ctrl))
+		lim->max_integrity_segments = 1;
 	lim->virt_boundary_mask = NVME_CTRL_PAGE_SIZE - 1;
 	lim->max_segment_size = UINT_MAX;
 	lim->dma_alignment = 3;
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index f900e44243aef..699cc36e596fa 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -1121,6 +1121,11 @@ static inline bool nvme_ctrl_sgl_supported(struct nvme_ctrl *ctrl)
 	return ctrl->sgls & ((1 << 0) | (1 << 1));
 }
 
+static inline bool nvme_ctrl_meta_sgl_supported(struct nvme_ctrl *ctrl)
+{
+	return ctrl->sgls & NVME_CTRL_SGLS_MPTR;
+}
+
 #ifdef CONFIG_NVME_HOST_AUTH
 int __init nvme_init_auth(void);
 void __exit nvme_exit_auth(void);
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index db0144a6bc5db..a0a10451d7da8 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -43,6 +43,7 @@
  */
 #define NVME_MAX_KB_SZ	8192
 #define NVME_MAX_SEGS	128
+#define NVME_MAX_META_SEGS 15
 #define NVME_MAX_NR_ALLOCATIONS	5
 
 static int use_threaded_interrupts;
@@ -143,6 +144,7 @@ struct nvme_dev {
 	bool hmb;
 
 	mempool_t *iod_mempool;
+	mempool_t *iod_meta_mempool;
 
 	/* shadow doorbell buffer support: */
 	__le32 *dbbuf_dbs;
@@ -237,6 +239,8 @@ struct nvme_iod {
 	dma_addr_t first_dma;
 	dma_addr_t meta_dma;
 	struct sg_table sgt;
+	struct sg_table meta_sgt;
+	union nvme_descriptor meta_list;
 	union nvme_descriptor list[NVME_MAX_NR_ALLOCATIONS];
 };
 
@@ -516,6 +520,11 @@ static inline bool nvme_pci_sgl_capable(struct nvme_dev *dev,
 	return sgl_threshold;
 }
 
+static inline bool nvme_pci_metadata_use_sgls(struct request *req)
+{
+	return blk_rq_integrity_segments(req) > 1;
+}
+
 static inline bool nvme_pci_use_sgls(struct nvme_dev *dev, struct request *req,
 				     int nseg)
 {
@@ -525,6 +534,8 @@ static inline bool nvme_pci_use_sgls(struct nvme_dev *dev, struct request *req,
 
 	if (!nvme_pci_sgl_capable(dev, req))
 		return false;
+	if (nvme_pci_metadata_use_sgls(req))
+		return true;
 	return avg_seg_size >= sgl_threshold;
 }
 
@@ -834,19 +845,81 @@ static blk_status_t nvme_map_data(struct nvme_dev *dev, struct request *req)
 	return __nvme_map_data(dev, req);
 }
 
-static blk_status_t nvme_map_metadata(struct nvme_dev *dev, struct request *req,
-		struct nvme_command *cmnd)
+static blk_status_t nvme_pci_setup_meta_sgls(struct nvme_dev *dev,
+					     struct request *req)
 {
 	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
+	struct nvme_rw_command *cmnd = &iod->cmd.rw;
+	struct nvme_sgl_desc *sg_list;
+	struct scatterlist *sg;
+	unsigned int entries;
+	dma_addr_t sgl_dma;
+	int rc;
+
+	iod->meta_sgt.sgl = mempool_alloc(dev->iod_meta_mempool, GFP_ATOMIC);
+	if (!iod->meta_sgt.sgl)
+		return BLK_STS_RESOURCE;
+
+	sg_init_table(iod->meta_sgt.sgl, req->nr_integrity_segments);
+	iod->meta_sgt.orig_nents = blk_rq_map_integrity_sg(req->bio,
+							   iod->meta_sgt.sgl);
+	if (!iod->meta_sgt.orig_nents)
+		goto out_free_sg;
+
+	rc = dma_map_sgtable(dev->dev, &iod->meta_sgt, rq_dma_dir(req),
+			     DMA_ATTR_NO_WARN);
+	if (rc)
+		goto out_free_sg;
+
+	sg_list = dma_pool_alloc(dev->prp_small_pool, GFP_ATOMIC, &sgl_dma);
+	if (!sg_list)
+		goto out_unmap_sg;
+
+	entries = iod->meta_sgt.nents;
+	iod->meta_list.sg_list = sg_list;
+	iod->meta_dma = sgl_dma;
+
+	cmnd->flags = NVME_CMD_SGL_METASEG;
+	cmnd->metadata = cpu_to_le64(sgl_dma);
+
+	sg = iod->meta_sgt.sgl;
+	if (entries == 1) {
+		nvme_pci_sgl_set_data(sg_list, sg);
+		return BLK_STS_OK;
+	}
+
+	sgl_dma += sizeof(sizeof(*sg_list));
+	nvme_pci_sgl_set_list(&sg_list[1], sg_list, sg, entries, sgl_dma);
+	return BLK_STS_OK;
+
+out_unmap_sg:
+	dma_unmap_sgtable(dev->dev, &iod->meta_sgt, rq_dma_dir(req), 0);
+out_free_sg:
+	mempool_free(iod->meta_sgt.sgl, dev->iod_meta_mempool);
+	return BLK_STS_RESOURCE;
+}
+
+static blk_status_t nvme_pci_setup_meta_mptr(struct nvme_dev *dev,
+					     struct request *req)
+{
+	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
+	struct nvme_rw_command *cmnd = &iod->cmd.rw;
 	struct bio_vec bv = rq_integrity_vec(req);
 
 	iod->meta_dma = dma_map_bvec(dev->dev, &bv, rq_dma_dir(req), 0);
 	if (dma_mapping_error(dev->dev, iod->meta_dma))
 		return BLK_STS_IOERR;
-	cmnd->rw.metadata = cpu_to_le64(iod->meta_dma);
+	cmnd->metadata = cpu_to_le64(iod->meta_dma);
 	return BLK_STS_OK;
 }
 
+static blk_status_t nvme_map_metadata(struct nvme_dev *dev, struct request *req)
+{
+	if (nvme_pci_metadata_use_sgls(req))
+		return nvme_pci_setup_meta_sgls(dev, req);
+	return nvme_pci_setup_meta_mptr(dev, req);
+}
+
 static blk_status_t nvme_prep_rq(struct nvme_dev *dev, struct request *req)
 {
 	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
@@ -855,6 +928,7 @@ static blk_status_t nvme_prep_rq(struct nvme_dev *dev, struct request *req)
 	iod->aborted = false;
 	iod->nr_allocations = -1;
 	iod->sgt.nents = 0;
+	iod->meta_sgt.nents = 0;
 
 	ret = nvme_setup_cmd(req->q->queuedata, req);
 	if (ret)
@@ -867,7 +941,7 @@ static blk_status_t nvme_prep_rq(struct nvme_dev *dev, struct request *req)
 	}
 
 	if (blk_integrity_rq(req)) {
-		ret = nvme_map_metadata(dev, req, &iod->cmd);
+		ret = nvme_map_metadata(dev, req);
 		if (ret)
 			goto out_unmap_data;
 	}
@@ -971,17 +1045,31 @@ static void nvme_queue_rqs(struct request **rqlist)
 	*rqlist = requeue_list;
 }
 
+static __always_inline void nvme_unmap_metadata(struct nvme_dev *dev,
+						struct request *req)
+{
+	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
+
+	if (!iod->meta_sgt.nents) {
+		dma_unmap_page(dev->dev, iod->meta_dma,
+			       rq_integrity_vec(req).bv_len,
+			       rq_dma_dir(req));
+		return;
+	}
+
+	dma_pool_free(dev->prp_small_pool, iod->meta_list.sg_list,
+		      iod->meta_dma);
+	dma_unmap_sgtable(dev->dev, &iod->meta_sgt, rq_dma_dir(req), 0);
+	mempool_free(iod->meta_sgt.sgl, dev->iod_meta_mempool);
+}
+
 static __always_inline void nvme_pci_unmap_rq(struct request *req)
 {
 	struct nvme_queue *nvmeq = req->mq_hctx->driver_data;
 	struct nvme_dev *dev = nvmeq->dev;
 
-	if (blk_integrity_rq(req)) {
-	        struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
-
-		dma_unmap_page(dev->dev, iod->meta_dma,
-			       rq_integrity_vec(req).bv_len, rq_dma_dir(req));
-	}
+	if (blk_integrity_rq(req))
+		nvme_unmap_metadata(dev, req);
 
 	if (blk_rq_nr_phys_segments(req))
 		nvme_unmap_data(dev, req);
@@ -2718,6 +2806,7 @@ static void nvme_release_prp_pools(struct nvme_dev *dev)
 
 static int nvme_pci_alloc_iod_mempool(struct nvme_dev *dev)
 {
+	size_t meta_size = sizeof(struct scatterlist) * NVME_MAX_META_SEGS + 1;
 	size_t alloc_size = sizeof(struct scatterlist) * NVME_MAX_SEGS;
 
 	dev->iod_mempool = mempool_create_node(1,
@@ -2726,7 +2815,18 @@ static int nvme_pci_alloc_iod_mempool(struct nvme_dev *dev)
 			dev_to_node(dev->dev));
 	if (!dev->iod_mempool)
 		return -ENOMEM;
+
+	dev->iod_meta_mempool = mempool_create_node(1,
+			mempool_kmalloc, mempool_kfree,
+			(void *)meta_size, GFP_KERNEL,
+			dev_to_node(dev->dev));
+	if (!dev->iod_meta_mempool)
+		goto free;
+
 	return 0;
+free:
+	mempool_destroy(dev->iod_mempool);
+	return -ENOMEM;
 }
 
 static void nvme_free_tagset(struct nvme_dev *dev)
@@ -3046,12 +3146,7 @@ static struct nvme_dev *nvme_pci_alloc_dev(struct pci_dev *pdev,
 	dev->ctrl.max_hw_sectors = min_t(u32,
 		NVME_MAX_KB_SZ << 1, dma_opt_mapping_size(&pdev->dev) >> 9);
 	dev->ctrl.max_segments = NVME_MAX_SEGS;
-
-	/*
-	 * There is no support for SGLs for metadata (yet), so we are limited to
-	 * a single integrity segment for the separate metadata pointer.
-	 */
-	dev->ctrl.max_integrity_segments = 1;
+	dev->ctrl.max_integrity_segments = NVME_MAX_META_SEGS;
 	return dev;
 
 out_put_device:
@@ -3155,6 +3250,7 @@ static int nvme_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 	nvme_free_queues(dev, 0);
 out_release_iod_mempool:
 	mempool_destroy(dev->iod_mempool);
+	mempool_destroy(dev->iod_meta_mempool);
 out_release_prp_pools:
 	nvme_release_prp_pools(dev);
 out_dev_unmap:
@@ -3220,6 +3316,7 @@ static void nvme_remove(struct pci_dev *pdev)
 	nvme_dbbuf_dma_free(dev);
 	nvme_free_queues(dev, 0);
 	mempool_destroy(dev->iod_mempool);
+	mempool_destroy(dev->iod_meta_mempool);
 	nvme_release_prp_pools(dev);
 	nvme_dev_unmap(dev);
 	nvme_uninit_ctrl(&dev->ctrl);
diff --git a/include/linux/nvme.h b/include/linux/nvme.h
index 7b2ae2e435447..87ea1673c60b8 100644
--- a/include/linux/nvme.h
+++ b/include/linux/nvme.h
@@ -388,6 +388,7 @@ enum {
 	NVME_CTRL_CTRATT_PREDICTABLE_LAT	= 1 << 5,
 	NVME_CTRL_CTRATT_NAMESPACE_GRANULARITY	= 1 << 7,
 	NVME_CTRL_CTRATT_UUID_LIST		= 1 << 9,
+	NVME_CTRL_SGLS_MPTR                     = 1 << 18,
 };
 
 struct nvme_lbaf {
-- 
2.43.5



^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH-part-2 9/9] nvme: force sgls on user passthrough if possible
  2024-09-04 18:38 [PATCH-part-2 0/9] nvme support for sgl mptr, safe passthrough Keith Busch
                   ` (7 preceding siblings ...)
  2024-09-04 18:38 ` [PATCH-part-2 8/9] nvme-pci: add support for sgl metadata Keith Busch
@ 2024-09-04 18:38 ` Keith Busch
  2024-09-04 19:23   ` Keith Busch
  2024-09-06 17:29 ` [PATCH-part-2 0/9] nvme support for sgl mptr, safe passthrough Keith Busch
  9 siblings, 1 reply; 15+ messages in thread
From: Keith Busch @ 2024-09-04 18:38 UTC (permalink / raw)
  To: hch, sagi, linux-nvme; +Cc: Keith Busch

From: Keith Busch <kbusch@kernel.org>

With capable hardware, this is one way to guard against short buffers.
The consequences of getting the interface wrong can cause corruption and
crash kernels, and utilizing the mptr sgl feature leverages the protocol
to catch such incorrect usage. See CVE-2023-6238.

To emphasize the danger with using this interface, the kernel will be
tainted if the user accesses this interface without hardware capable of
guaranteeing transfer lengths.

Signed-off-by: Keith Busch <kbusch@kernel.org>
---
 drivers/nvme/host/ioctl.c | 17 +++++++++++++++++
 drivers/nvme/host/nvme.h  |  2 ++
 drivers/nvme/host/pci.c   | 20 ++++++++++++++------
 3 files changed, 33 insertions(+), 6 deletions(-)

diff --git a/drivers/nvme/host/ioctl.c b/drivers/nvme/host/ioctl.c
index f1d58e70933f5..cf889a0e79338 100644
--- a/drivers/nvme/host/ioctl.c
+++ b/drivers/nvme/host/ioctl.c
@@ -116,12 +116,22 @@ static int nvme_map_user_request(struct request *req, u64 ubuffer,
 		unsigned bufflen, void __user *meta_buffer, unsigned meta_len,
 		u32 meta_seed, struct io_uring_cmd *ioucmd, unsigned int flags)
 {
+	struct nvme_ctrl *ctrl = nvme_req(req)->ctrl;
 	struct request_queue *q = req->q;
 	struct nvme_ns *ns = q->queuedata;
 	struct block_device *bdev = ns ? ns->disk->part0 : NULL;
 	struct bio *bio = NULL;
 	int ret;
 
+	if (bdev) {
+		if (nvme_ctrl_sgl_supported(ctrl)) {
+			nvme_req(req)->flags |= NVME_REQ_USE_SGLS;
+		} else {
+			dev_warn_once(ctrl->device, "using unchecked buffer\n");
+			add_taint(TAINT_USER, LOCKDEP_STILL_OK);
+		}
+	}
+
 	if (ioucmd && (ioucmd->flags & IORING_URING_CMD_FIXED)) {
 		struct iov_iter iter;
 
@@ -146,6 +156,13 @@ static int nvme_map_user_request(struct request *req, u64 ubuffer,
 	if (bdev) {
 		bio_set_dev(bio, bdev);
 		if (meta_buffer && meta_len) {
+			if (nvme_ctrl_meta_sgl_supported(ctrl)) {
+				nvme_req(req)->flags |= NVME_REQ_USE_META_SGLS;
+			} else {
+				dev_warn_once(ctrl->device,
+					      "using unchecked meta buffer\n");
+				add_taint(TAINT_USER, LOCKDEP_STILL_OK);
+			}
 			ret = bio_integrity_map_user(bio, meta_buffer, meta_len,
 						     meta_seed);
 			if (ret)
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 699cc36e596fa..3c27486acecdc 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -197,6 +197,8 @@ enum {
 	NVME_REQ_USERCMD		= (1 << 1),
 	NVME_MPATH_IO_STATS		= (1 << 2),
 	NVME_MPATH_CNT_ACTIVE		= (1 << 3),
+	NVME_REQ_USE_SGLS		= (1 << 4),
+	NVME_REQ_USE_META_SGLS		= (1 << 5),
 };
 
 static inline struct nvme_request *nvme_req(struct request *req)
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index a0a10451d7da8..eb2ac47f2bd54 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -522,7 +522,8 @@ static inline bool nvme_pci_sgl_capable(struct nvme_dev *dev,
 
 static inline bool nvme_pci_metadata_use_sgls(struct request *req)
 {
-	return blk_rq_integrity_segments(req) > 1;
+	return blk_rq_integrity_segments(req) > 1 ||
+		nvme_req(req)->flags & NVME_REQ_USE_META_SGLS;
 }
 
 static inline bool nvme_pci_use_sgls(struct nvme_dev *dev, struct request *req,
@@ -536,13 +537,20 @@ static inline bool nvme_pci_use_sgls(struct nvme_dev *dev, struct request *req,
 		return false;
 	if (nvme_pci_metadata_use_sgls(req))
 		return true;
-	return avg_seg_size >= sgl_threshold;
+	if (avg_seg_size < sgl_threshold)
+		return nvme_req(req)->flags & NVME_REQ_USE_SGLS;
+	return true;
 }
 
-static inline bool nvme_pci_use_prps(struct bio_vec *bv)
+static inline bool nvme_pci_use_prps(struct request *req, struct bio_vec *bv)
 {
-	unsigned int off = bv->bv_offset & (NVME_CTRL_PAGE_SIZE - 1);
-	return  off + bv->bv_len <= NVME_CTRL_PAGE_SIZE * 2;
+	unsigned int off;
+
+	if (nvme_pci_metadata_use_sgls(req))
+		return false;
+
+	off = bv->bv_offset & (NVME_CTRL_PAGE_SIZE - 1);
+	return off + bv->bv_len <= NVME_CTRL_PAGE_SIZE * 2;
 }
 
 static void nvme_free_prps(struct nvme_dev *dev, struct request *req)
@@ -835,7 +843,7 @@ static blk_status_t nvme_map_data(struct nvme_dev *dev, struct request *req)
 		struct bio_vec bv = req_bvec(req);
 
 		if (!is_pci_p2pdma_page(bv.bv_page)) {
-			if (nvme_pci_use_prps(&bv))
+			if (nvme_pci_use_prps(req, &bv))
 				return nvme_setup_prp_simple(dev, req, &bv);
 			if (nvme_pci_sgl_capable(dev, req))
 				return nvme_setup_sgl_simple(dev, req, &bv);
-- 
2.43.5



^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH-part-2 8/9] nvme-pci: add support for sgl metadata
  2024-09-04 18:38 ` [PATCH-part-2 8/9] nvme-pci: add support for sgl metadata Keith Busch
@ 2024-09-04 19:22   ` Keith Busch
  0 siblings, 0 replies; 15+ messages in thread
From: Keith Busch @ 2024-09-04 19:22 UTC (permalink / raw)
  To: Keith Busch; +Cc: hch, sagi, linux-nvme

On Wed, Sep 04, 2024 at 11:38:16AM -0700, Keith Busch wrote:
> From: Keith Busch <kbusch@kernel.org>
> 
> Supporting this mode allows merging requests with metadata that wouldn't
> be possible otherwise.

A couple minor issues from splitting/merging in-progress patches.

> @@ -1988,7 +1988,9 @@ static void nvme_set_ctrl_limits(struct nvme_ctrl *ctrl,
>  	lim->max_hw_sectors = ctrl->max_hw_sectors;
>  	lim->max_segments = min_t(u32, USHRT_MAX,
>  		min_not_zero(nvme_max_drv_segments(ctrl), ctrl->max_segments));
> -	lim->max_integrity_segments = ctrl->max_integrity_segments;

This line was not supposed to be deleted. It should remain there as
before.

> +	if (lim->max_integrity_segments > 1 &&
> +	    !nvme_ctrl_meta_sgl_supported(ctrl))
> +		lim->max_integrity_segments = 1;

[snip]

> diff --git a/include/linux/nvme.h b/include/linux/nvme.h
> index 7b2ae2e435447..87ea1673c60b8 100644
> --- a/include/linux/nvme.h
> +++ b/include/linux/nvme.h
> @@ -388,6 +388,7 @@ enum {
>  	NVME_CTRL_CTRATT_PREDICTABLE_LAT	= 1 << 5,
>  	NVME_CTRL_CTRATT_NAMESPACE_GRANULARITY	= 1 << 7,
>  	NVME_CTRL_CTRATT_UUID_LIST		= 1 << 9,
> +	NVME_CTRL_SGLS_MPTR                     = 1 << 18,

This shift should be 19, not 18.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH-part-2 9/9] nvme: force sgls on user passthrough if possible
  2024-09-04 18:38 ` [PATCH-part-2 9/9] nvme: force sgls on user passthrough if possible Keith Busch
@ 2024-09-04 19:23   ` Keith Busch
  0 siblings, 0 replies; 15+ messages in thread
From: Keith Busch @ 2024-09-04 19:23 UTC (permalink / raw)
  To: Keith Busch; +Cc: hch, sagi, linux-nvme

On Wed, Sep 04, 2024 at 11:38:17AM -0700, Keith Busch wrote:
> @@ -146,6 +156,13 @@ static int nvme_map_user_request(struct request *req, u64 ubuffer,
>  	if (bdev) {
>  		bio_set_dev(bio, bdev);
>  		if (meta_buffer && meta_len) {
> +			if (nvme_ctrl_meta_sgl_supported(ctrl)) {
> +				nvme_req(req)->flags |= NVME_REQ_USE_META_SGLS;
> +			} else {
> +				dev_warn_once(ctrl->device,
> +					      "using unchecked meta buffer\n");
> +				add_taint(TAINT_USER, LOCKDEP_STILL_OK);
> +			}
>  			ret = bio_integrity_map_user(bio, meta_buffer, meta_len,
>  						     meta_seed);
>  			if (ret)

This actually needs to add this line on successful:

	req->nr_integrity_segments = bio_integrity(bio)->bip_vcnt;

But it think it would be even better to introce a request helper,
blk_integrity_map_user() that takes a struct instead of a bio so it can
handle updating the request state appropriately.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH-part-2 2/9] nvme-pci: provide prp selection helper
  2024-09-04 18:38 ` [PATCH-part-2 2/9] nvme-pci: provide prp selection helper Keith Busch
@ 2024-09-06 11:22   ` Anuj Gupta
  0 siblings, 0 replies; 15+ messages in thread
From: Anuj Gupta @ 2024-09-06 11:22 UTC (permalink / raw)
  To: Keith Busch; +Cc: hch, sagi, linux-nvme, Keith Busch

[-- Attachment #1: Type: text/plain, Size: 264 bytes --]

On 04/09/24 11:38AM, Keith Busch wrote:
>From: Keith Busch <kbusch@kernel.org>
>
>+static inline bool nvme_pci_use_prps(struct bio_vec *bv)
>+{
>+	unsigned int off = bv->bv_offset & (NVME_CTRL_PAGE_SIZE - 1);

Nit: please add a new line after variable declaration

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH-part-2 5/9] nvme-pci: simplify io setup function parameters
  2024-09-04 18:38 ` [PATCH-part-2 5/9] nvme-pci: simplify io setup function parameters Keith Busch
@ 2024-09-06 11:24   ` Anuj Gupta
  0 siblings, 0 replies; 15+ messages in thread
From: Anuj Gupta @ 2024-09-06 11:24 UTC (permalink / raw)
  To: Keith Busch; +Cc: hch, sagi, linux-nvme, Keith Busch

[-- Attachment #1: Type: text/plain, Size: 319 bytes --]

On 04/09/24 11:38AM, Keith Busch wrote:
>From: Keith Busch <kbusch@kernel.org>
>
>We're threading a reference to the nvme_command through all functions
>before it's needed. It's also redundant since it's a part of the request
>that has to be provided anway. Remove the extra reference and only get

Nit: s/anway/anyway

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH-part-2 0/9] nvme support for sgl mptr, safe passthrough
  2024-09-04 18:38 [PATCH-part-2 0/9] nvme support for sgl mptr, safe passthrough Keith Busch
                   ` (8 preceding siblings ...)
  2024-09-04 18:38 ` [PATCH-part-2 9/9] nvme: force sgls on user passthrough if possible Keith Busch
@ 2024-09-06 17:29 ` Keith Busch
  9 siblings, 0 replies; 15+ messages in thread
From: Keith Busch @ 2024-09-06 17:29 UTC (permalink / raw)
  To: Keith Busch; +Cc: hch, sagi, linux-nvme

On Wed, Sep 04, 2024 at 11:38:08AM -0700, Keith Busch wrote:
> From: Keith Busch <kbusch@kernel.org>
> 
> NVMe's implicit memory lengths is a danger to buffer overflows. It's
> been know for a long time this vulnerability exists.
> 
> The nvme protocol provides sgl with explicit lengths, so if the hardware
> is capable, let's use that.

Made some other fixes, minor modifications, and adjusted the patch order
to prevent any bisect hazards. Rather than resend the whole thing out
right now, I've set up a tested branch here:

  https://git.kernel.org/pub/scm/linux/kernel/git/kbusch/linux.git/log/?h=nvme-meta-sgl-2024-09-06


^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2024-09-06 17:29 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-09-04 18:38 [PATCH-part-2 0/9] nvme support for sgl mptr, safe passthrough Keith Busch
2024-09-04 18:38 ` [PATCH-part-2 1/9] nvme-pci: use sgl capable helper function Keith Busch
2024-09-04 18:38 ` [PATCH-part-2 2/9] nvme-pci: provide prp selection helper Keith Busch
2024-09-06 11:22   ` Anuj Gupta
2024-09-04 18:38 ` [PATCH-part-2 3/9] nvme-pci: split out the simple dma mapping Keith Busch
2024-09-04 18:38 ` [PATCH-part-2 4/9] nvme-pci: remove "dma_len" from nvme_iod Keith Busch
2024-09-04 18:38 ` [PATCH-part-2 5/9] nvme-pci: simplify io setup function parameters Keith Busch
2024-09-06 11:24   ` Anuj Gupta
2024-09-04 18:38 ` [PATCH-part-2 6/9] nvme-pci: common dma pool alloc helper Keith Busch
2024-09-04 18:38 ` [PATCH-part-2 7/9] nvme-pci: provide a sgl mapping helper Keith Busch
2024-09-04 18:38 ` [PATCH-part-2 8/9] nvme-pci: add support for sgl metadata Keith Busch
2024-09-04 19:22   ` Keith Busch
2024-09-04 18:38 ` [PATCH-part-2 9/9] nvme: force sgls on user passthrough if possible Keith Busch
2024-09-04 19:23   ` Keith Busch
2024-09-06 17:29 ` [PATCH-part-2 0/9] nvme support for sgl mptr, safe passthrough Keith Busch

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox