[RFC 0/4] nvme-pci: breaking the 512 KiB max IO boundary

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC 0/4] nvme-pci: breaking the 512 KiB max IO boundary
@ 2025-03-20 11:13 Luis Chamberlain
  2025-03-20 11:13 ` [RFC 1/4] iomap: use BLK_MAX_BLOCK_SIZE for the iomap zero page Luis Chamberlain
                   ` (3 more replies)
  0 siblings, 4 replies; 13+ messages in thread
From: Luis Chamberlain @ 2025-03-20 11:13 UTC (permalink / raw)
  To: leon, hch, kbusch, sagi, axboe, joro, brauner, hare, willy, david,
	djwong
  Cc: john.g.garry, ritesh.list, linux-fsdevel, linux-block, linux-mm,
	gost.dev, p.raghav, da.gomez, kernel, mcgrof

Now that we have bs > ps for block device sector sizes on linux-next the next
eye sore is why our max sector size is stuck at 64k while we should be able to
go up to in theory to the max supported by the page cache. On x86_64 that's 2
MiB.

The reason we didn't jump to 2 MiB is because testing with a higher limit than
64k proved to have issues. While we've looked into them a glaring issue was
scatter list limitation on the NVMe PCI driver. While we could adopt scatter
list chaining, the work Christoph and Leon have been working on with the two
step DMA API seems to be the way to go since the scatter lists are tied to
PAGE_SIZE restrictions, and the scatter list chaining is just a mess.

So it begs the question, with the new two step DMA API, does the problem
get easier? The answer is yes, and for those that want to experiment this
will let you do just that.

With this we can enable 2 MiB LBA format on NVMe and we can issue single IOs
up to 8 MiB for both buffered IO and direct IO. The last two patches are not
really intended for upstream, but rather experimental code to let folks muck
around with large sector sizes.

Daniel Gomez has taken Leon Romanovsky's new two step DMA API [0] and
Christoph Hellwig's "Block and NMMe PCI use of new DMA mapping API" [1].
We then used this to apply on top the 64k sector size patches now merged on
linux-next and backported them to v6.14-rc5. The patches on this RFC
are the patches on top of all that so to demonstrate the minimal changes
needed to enable up to 8 MiB IOs on NVMe leveraging a 2 MiB max block
sector size on x86_64 after the two-step DMA API and the NVMe cleanup.

If you want a git tree to play with you can use our large-block-buffer-heads-2m
linux branch from kdevops.

[0] https://lore.kernel.org/all/20250302085717.GO53094@unreal/ 
[1] https://lore.kernel.org/all/cover.1730037261.git.leon@kernel.org/
[2] https://github.com/linux-kdevops/linux/tree/large-block-buffer-heads-2m

Luis Chamberlain (4):
  iomap: use BLK_MAX_BLOCK_SIZE for the iomap zero page
  blkdev: lift BLK_MAX_BLOCK_SIZE to page cache limit
  nvme-pci: bump segments to what the device can use
  nvme-pci: add quirk for qemu with bogus NOWS

 drivers/nvme/host/core.c |   2 +
 drivers/nvme/host/nvme.h |   5 ++
 drivers/nvme/host/pci.c  | 167 ++-------------------------------------
 fs/iomap/direct-io.c     |   2 +-
 include/linux/blkdev.h   |   7 +-
 5 files changed, 15 insertions(+), 168 deletions(-)

-- 
2.47.2

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [RFC 1/4] iomap: use BLK_MAX_BLOCK_SIZE for the iomap zero page
  2025-03-20 11:13 [RFC 0/4] nvme-pci: breaking the 512 KiB max IO boundary Luis Chamberlain
@ 2025-03-20 11:13 ` Luis Chamberlain
  2025-03-20 11:13 ` [RFC 2/4] blkdev: lift BLK_MAX_BLOCK_SIZE to page cache limit Luis Chamberlain
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 13+ messages in thread
From: Luis Chamberlain @ 2025-03-20 11:13 UTC (permalink / raw)
  To: leon, hch, kbusch, sagi, axboe, joro, brauner, hare, willy, david,
	djwong
  Cc: john.g.garry, ritesh.list, linux-fsdevel, linux-block, linux-mm,
	gost.dev, p.raghav, da.gomez, kernel, mcgrof

There is no point in modifying two locations now that we have
a sensible BLK_MAX_BLOCK_SIZE defined which is only lifted once
we have tested and validated things.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 fs/iomap/direct-io.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index 0e47da82b0c2..aa109f4ee491 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -30,7 +30,7 @@
 /*
  * Used for sub block zeroing in iomap_dio_zero()
  */
-#define IOMAP_ZERO_PAGE_SIZE (SZ_64K)
+#define IOMAP_ZERO_PAGE_SIZE (BLK_MAX_BLOCK_SIZE)
 #define IOMAP_ZERO_PAGE_ORDER (get_order(IOMAP_ZERO_PAGE_SIZE))
 static struct page *zero_page;
 
-- 
2.47.2


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [RFC 2/4] blkdev: lift BLK_MAX_BLOCK_SIZE to page cache limit
  2025-03-20 11:13 [RFC 0/4] nvme-pci: breaking the 512 KiB max IO boundary Luis Chamberlain
  2025-03-20 11:13 ` [RFC 1/4] iomap: use BLK_MAX_BLOCK_SIZE for the iomap zero page Luis Chamberlain
@ 2025-03-20 11:13 ` Luis Chamberlain
  2025-03-20 16:01   ` Bart Van Assche
  2025-03-20 11:13 ` [RFC 3/4] nvme-pci: bump segments to what the device can use Luis Chamberlain
  2025-03-20 11:13 ` [RFC 4/4] nvme-pci: add quirk for qemu with bogus NOWS Luis Chamberlain
  3 siblings, 1 reply; 13+ messages in thread
From: Luis Chamberlain @ 2025-03-20 11:13 UTC (permalink / raw)
  To: leon, hch, kbusch, sagi, axboe, joro, brauner, hare, willy, david,
	djwong
  Cc: john.g.garry, ritesh.list, linux-fsdevel, linux-block, linux-mm,
	gost.dev, p.raghav, da.gomez, kernel, mcgrof

It's a brave new world. This is now part of the validation to get
to at least the page cache limit.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 include/linux/blkdev.h | 7 ++-----
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 1c0cf6af392c..9e1b3e7526d9 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -26,6 +26,7 @@
 #include <linux/xarray.h>
 #include <linux/file.h>
 #include <linux/lockdep.h>
+#include <linux/pagemap.h>
 
 struct module;
 struct request_queue;
@@ -268,11 +269,7 @@ static inline dev_t disk_devt(struct gendisk *disk)
 	return MKDEV(disk->major, disk->first_minor);
 }
 
-/*
- * We should strive for 1 << (PAGE_SHIFT + MAX_PAGECACHE_ORDER)
- * however we constrain this to what we can validate and test.
- */
-#define BLK_MAX_BLOCK_SIZE      SZ_64K
+#define BLK_MAX_BLOCK_SIZE      1 << (PAGE_SHIFT + MAX_PAGECACHE_ORDER)
 
 /* blk_validate_limits() validates bsize, so drivers don't usually need to */
 static inline int blk_validate_block_size(unsigned long bsize)
-- 
2.47.2


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [RFC 3/4] nvme-pci: bump segments to what the device can use
  2025-03-20 11:13 [RFC 0/4] nvme-pci: breaking the 512 KiB max IO boundary Luis Chamberlain
  2025-03-20 11:13 ` [RFC 1/4] iomap: use BLK_MAX_BLOCK_SIZE for the iomap zero page Luis Chamberlain
  2025-03-20 11:13 ` [RFC 2/4] blkdev: lift BLK_MAX_BLOCK_SIZE to page cache limit Luis Chamberlain
@ 2025-03-20 11:13 ` Luis Chamberlain
  2025-03-20 11:13 ` [RFC 4/4] nvme-pci: add quirk for qemu with bogus NOWS Luis Chamberlain
  3 siblings, 0 replies; 13+ messages in thread
From: Luis Chamberlain @ 2025-03-20 11:13 UTC (permalink / raw)
  To: leon, hch, kbusch, sagi, axboe, joro, brauner, hare, willy, david,
	djwong
  Cc: john.g.garry, ritesh.list, linux-fsdevel, linux-block, linux-mm,
	gost.dev, p.raghav, da.gomez, kernel, mcgrof

Now that we're not scatter list bound, just use the device limits.
The blk integrity stuff needs to be converted to the new dma API
first, so to enable large IO experimentation just remove it for now.

The iod pools are not used anymore so just nuke them.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 drivers/nvme/host/pci.c | 164 +---------------------------------------
 1 file changed, 3 insertions(+), 161 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 1ca9ab2b8ec5..27b830072c14 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -40,8 +40,6 @@
  * require an sg allocation that needs more than a page of data.
  */
 #define NVME_MAX_KB_SZ	8192
-#define NVME_MAX_SEGS	128
-#define NVME_MAX_META_SEGS 15
 #define NVME_MAX_NR_DESCRIPTORS	5
 #define NVME_SMALL_DESCRIPTOR_SIZE	256
 
@@ -143,9 +141,6 @@ struct nvme_dev {
 	bool hmb;
 	struct sg_table *hmb_sgt;
 
-	mempool_t *iod_mempool;
-	mempool_t *iod_meta_mempool;
-
 	/* shadow doorbell buffer support: */
 	__le32 *dbbuf_dbs;
 	dma_addr_t dbbuf_dbs_dma_addr;
@@ -788,14 +783,6 @@ static void nvme_pci_sgl_set_data(struct nvme_sgl_desc *sge,
 	sge->type = NVME_SGL_FMT_DATA_DESC << 4;
 }
 
-static void nvme_pci_sgl_set_data_legacy(struct nvme_sgl_desc *sge,
-		struct scatterlist *sg)
-{
-	sge->addr = cpu_to_le64(sg_dma_address(sg));
-	sge->length = cpu_to_le32(sg_dma_len(sg));
-	sge->type = NVME_SGL_FMT_DATA_DESC << 4;
-}
-
 static void nvme_pci_sgl_set_seg(struct nvme_sgl_desc *sge,
 		dma_addr_t dma_addr, int entries)
 {
@@ -859,84 +846,6 @@ static blk_status_t nvme_map_data(struct nvme_dev *dev, struct request *req,
 	return nvme_pci_setup_prps(dev, req, &cmnd->rw);
 }
 
-static blk_status_t nvme_pci_setup_meta_sgls(struct nvme_dev *dev,
-					     struct request *req)
-{
-	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
-	struct nvme_rw_command *cmnd = &iod->cmd.rw;
-	struct nvme_sgl_desc *sg_list;
-	struct scatterlist *sgl, *sg;
-	unsigned int entries;
-	dma_addr_t sgl_dma;
-	int rc, i;
-
-	iod->meta_sgt.sgl = mempool_alloc(dev->iod_meta_mempool, GFP_ATOMIC);
-	if (!iod->meta_sgt.sgl)
-		return BLK_STS_RESOURCE;
-
-	sg_init_table(iod->meta_sgt.sgl, req->nr_integrity_segments);
-	iod->meta_sgt.orig_nents = blk_rq_map_integrity_sg(req,
-							   iod->meta_sgt.sgl);
-	if (!iod->meta_sgt.orig_nents)
-		goto out_free_sg;
-
-	rc = dma_map_sgtable(dev->dev, &iod->meta_sgt, rq_dma_dir(req),
-			     DMA_ATTR_NO_WARN);
-	if (rc)
-		goto out_free_sg;
-
-	sg_list = dma_pool_alloc(dev->prp_small_pool, GFP_ATOMIC, &sgl_dma);
-	if (!sg_list)
-		goto out_unmap_sg;
-
-	entries = iod->meta_sgt.nents;
-	iod->meta_descriptors[0] = sg_list;
-	iod->meta_dma = sgl_dma;
-
-	cmnd->flags = NVME_CMD_SGL_METASEG;
-	cmnd->metadata = cpu_to_le64(sgl_dma);
-
-	sgl = iod->meta_sgt.sgl;
-	if (entries == 1) {
-		nvme_pci_sgl_set_data_legacy(sg_list, sgl);
-		return BLK_STS_OK;
-	}
-
-	sgl_dma += sizeof(*sg_list);
-	nvme_pci_sgl_set_seg(sg_list, sgl_dma, entries);
-	for_each_sg(sgl, sg, entries, i)
-		nvme_pci_sgl_set_data_legacy(&sg_list[i + 1], sg);
-
-	return BLK_STS_OK;
-
-out_unmap_sg:
-	dma_unmap_sgtable(dev->dev, &iod->meta_sgt, rq_dma_dir(req), 0);
-out_free_sg:
-	mempool_free(iod->meta_sgt.sgl, dev->iod_meta_mempool);
-	return BLK_STS_RESOURCE;
-}
-
-static blk_status_t nvme_pci_setup_meta_mptr(struct nvme_dev *dev,
-					     struct request *req)
-{
-	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
-	struct bio_vec bv = rq_integrity_vec(req);
-	struct nvme_command *cmnd = &iod->cmd;
-
-	iod->meta_dma = dma_map_bvec(dev->dev, &bv, rq_dma_dir(req), 0);
-	if (dma_mapping_error(dev->dev, iod->meta_dma))
-		return BLK_STS_IOERR;
-	cmnd->rw.metadata = cpu_to_le64(iod->meta_dma);
-	return BLK_STS_OK;
-}
-
-static blk_status_t nvme_map_metadata(struct nvme_dev *dev, struct request *req)
-{
-	if (nvme_pci_metadata_use_sgls(dev, req))
-		return nvme_pci_setup_meta_sgls(dev, req);
-	return nvme_pci_setup_meta_mptr(dev, req);
-}
-
 static blk_status_t nvme_prep_rq(struct nvme_dev *dev, struct request *req)
 {
 	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
@@ -958,17 +867,8 @@ static blk_status_t nvme_prep_rq(struct nvme_dev *dev, struct request *req)
 			goto out_free_cmd;
 	}
 
-	if (blk_integrity_rq(req)) {
-		ret = nvme_map_metadata(dev, req);
-		if (ret)
-			goto out_unmap_data;
-	}
-
 	nvme_start_request(req);
 	return BLK_STS_OK;
-out_unmap_data:
-	if (blk_rq_nr_phys_segments(req))
-		nvme_unmap_data(dev, req);
 out_free_cmd:
 	nvme_cleanup_cmd(req);
 	return ret;
@@ -1057,32 +957,11 @@ static void nvme_queue_rqs(struct rq_list *rqlist)
 	*rqlist = requeue_list;
 }
 
-static __always_inline void nvme_unmap_metadata(struct nvme_dev *dev,
-						struct request *req)
-{
-	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
-
-	if (!iod->meta_sgt.nents) {
-		dma_unmap_page(dev->dev, iod->meta_dma,
-			       rq_integrity_vec(req).bv_len,
-			       rq_dma_dir(req));
-		return;
-	}
-
-	dma_pool_free(dev->prp_small_pool, iod->meta_descriptors[0],
-		      iod->meta_dma);
-	dma_unmap_sgtable(dev->dev, &iod->meta_sgt, rq_dma_dir(req), 0);
-	mempool_free(iod->meta_sgt.sgl, dev->iod_meta_mempool);
-}
-
 static __always_inline void nvme_pci_unmap_rq(struct request *req)
 {
 	struct nvme_queue *nvmeq = req->mq_hctx->driver_data;
 	struct nvme_dev *dev = nvmeq->dev;
 
-	if (blk_integrity_rq(req))
-		nvme_unmap_metadata(dev, req);
-
 	if (blk_rq_nr_phys_segments(req))
 		nvme_unmap_data(dev, req);
 }
@@ -2874,31 +2753,6 @@ static void nvme_release_prp_pools(struct nvme_dev *dev)
 	dma_pool_destroy(dev->prp_small_pool);
 }
 
-static int nvme_pci_alloc_iod_mempool(struct nvme_dev *dev)
-{
-	size_t meta_size = sizeof(struct scatterlist) * (NVME_MAX_META_SEGS + 1);
-	size_t alloc_size = sizeof(struct scatterlist) * NVME_MAX_SEGS;
-
-	dev->iod_mempool = mempool_create_node(1,
-			mempool_kmalloc, mempool_kfree,
-			(void *)alloc_size, GFP_KERNEL,
-			dev_to_node(dev->dev));
-	if (!dev->iod_mempool)
-		return -ENOMEM;
-
-	dev->iod_meta_mempool = mempool_create_node(1,
-			mempool_kmalloc, mempool_kfree,
-			(void *)meta_size, GFP_KERNEL,
-			dev_to_node(dev->dev));
-	if (!dev->iod_meta_mempool)
-		goto free;
-
-	return 0;
-free:
-	mempool_destroy(dev->iod_mempool);
-	return -ENOMEM;
-}
-
 static void nvme_free_tagset(struct nvme_dev *dev)
 {
 	if (dev->tagset.tags)
@@ -2962,7 +2816,7 @@ static void nvme_reset_work(struct work_struct *work)
 		goto out;
 
 	if (nvme_ctrl_meta_sgl_supported(&dev->ctrl))
-		dev->ctrl.max_integrity_segments = NVME_MAX_META_SEGS;
+		dev->ctrl.max_integrity_segments = 0;
 	else
 		dev->ctrl.max_integrity_segments = 1;
 
@@ -3234,7 +3088,6 @@ static struct nvme_dev *nvme_pci_alloc_dev(struct pci_dev *pdev,
 	 */
 	dev->ctrl.max_hw_sectors = min_t(u32,
 		NVME_MAX_KB_SZ << 1, dma_opt_mapping_size(&pdev->dev) >> 9);
-	dev->ctrl.max_segments = NVME_MAX_SEGS;
 	dev->ctrl.max_integrity_segments = 1;
 	return dev;
 
@@ -3267,15 +3120,11 @@ static int nvme_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 	if (result)
 		goto out_dev_unmap;
 
-	result = nvme_pci_alloc_iod_mempool(dev);
-	if (result)
-		goto out_release_prp_pools;
-
 	dev_info(dev->ctrl.device, "pci function %s\n", dev_name(&pdev->dev));
 
 	result = nvme_pci_enable(dev);
 	if (result)
-		goto out_release_iod_mempool;
+		goto out_release_prp_pools;
 
 	result = nvme_alloc_admin_tag_set(&dev->ctrl, &dev->admin_tagset,
 				&nvme_mq_admin_ops, sizeof(struct nvme_iod));
@@ -3298,7 +3147,7 @@ static int nvme_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 		goto out_disable;
 
 	if (nvme_ctrl_meta_sgl_supported(&dev->ctrl))
-		dev->ctrl.max_integrity_segments = NVME_MAX_META_SEGS;
+		dev->ctrl.max_integrity_segments = 0;
 	else
 		dev->ctrl.max_integrity_segments = 1;
 
@@ -3342,9 +3191,6 @@ static int nvme_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 	nvme_dev_remove_admin(dev);
 	nvme_dbbuf_dma_free(dev);
 	nvme_free_queues(dev, 0);
-out_release_iod_mempool:
-	mempool_destroy(dev->iod_mempool);
-	mempool_destroy(dev->iod_meta_mempool);
 out_release_prp_pools:
 	nvme_release_prp_pools(dev);
 out_dev_unmap:
@@ -3409,8 +3255,6 @@ static void nvme_remove(struct pci_dev *pdev)
 	nvme_dev_remove_admin(dev);
 	nvme_dbbuf_dma_free(dev);
 	nvme_free_queues(dev, 0);
-	mempool_destroy(dev->iod_mempool);
-	mempool_destroy(dev->iod_meta_mempool);
 	nvme_release_prp_pools(dev);
 	nvme_dev_unmap(dev);
 	nvme_uninit_ctrl(&dev->ctrl);
@@ -3804,8 +3648,6 @@ static int __init nvme_init(void)
 	BUILD_BUG_ON(sizeof(struct nvme_create_sq) != 64);
 	BUILD_BUG_ON(sizeof(struct nvme_delete_queue) != 64);
 	BUILD_BUG_ON(IRQ_AFFINITY_MAX_SETS < 2);
-	BUILD_BUG_ON(NVME_MAX_SEGS > SGES_PER_PAGE);
-	BUILD_BUG_ON(sizeof(struct scatterlist) * NVME_MAX_SEGS > PAGE_SIZE);
 	BUILD_BUG_ON(nvme_pci_npages_prp() > NVME_MAX_NR_DESCRIPTORS);
 
 	return pci_register_driver(&nvme_driver);
-- 
2.47.2


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [RFC 4/4] nvme-pci: add quirk for qemu with bogus NOWS
  2025-03-20 11:13 [RFC 0/4] nvme-pci: breaking the 512 KiB max IO boundary Luis Chamberlain
                   ` (2 preceding siblings ...)
  2025-03-20 11:13 ` [RFC 3/4] nvme-pci: bump segments to what the device can use Luis Chamberlain
@ 2025-03-20 11:13 ` Luis Chamberlain
  3 siblings, 0 replies; 13+ messages in thread
From: Luis Chamberlain @ 2025-03-20 11:13 UTC (permalink / raw)
  To: leon, hch, kbusch, sagi, axboe, joro, brauner, hare, willy, david,
	djwong
  Cc: john.g.garry, ritesh.list, linux-fsdevel, linux-block, linux-mm,
	gost.dev, p.raghav, da.gomez, kernel, mcgrof

The NOWS value for qemu is bogus but that means we need
to be mucking with userspace when testing large IO, so just
add a quirk to use sensible max limits, in this case just use
MDTS as these drives are virtualized.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 drivers/nvme/host/core.c | 2 ++
 drivers/nvme/host/nvme.h | 5 +++++
 drivers/nvme/host/pci.c  | 3 ++-
 3 files changed, 9 insertions(+), 1 deletion(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index f028913e2e62..8f516de16281 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -2070,6 +2070,8 @@ static bool nvme_update_disk_info(struct nvme_ns *ns, struct nvme_id_ns *id,
 		/* NOWS = Namespace Optimal Write Size */
 		if (id->nows)
 			io_opt = bs * (1 + le16_to_cpu(id->nows));
+		else if (ns->ctrl->quirks & NVME_QUIRK_BOGUS_NOWS)
+			io_opt = lim->max_hw_sectors << SECTOR_SHIFT;
 	}
 
 	/*
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 7be92d07430e..c63a804db462 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -178,6 +178,11 @@ enum nvme_quirks {
 	 * Align dma pool segment size to 512 bytes
 	 */
 	NVME_QUIRK_DMAPOOL_ALIGN_512		= (1 << 22),
+
+	/*
+	 * Reports a NOWS of 0 which is 1 logical block size which is bogus
+	 */
+	NVME_QUIRK_BOGUS_NOWS			= (1 << 23),
 };
 
 /*
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 27b830072c14..577d8f909139 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -3469,7 +3469,8 @@ static const struct pci_device_id nvme_id_table[] = {
 				NVME_QUIRK_DISABLE_WRITE_ZEROES |
 				NVME_QUIRK_BOGUS_NID, },
 	{ PCI_VDEVICE(REDHAT, 0x0010),	/* Qemu emulated controller */
-		.driver_data = NVME_QUIRK_BOGUS_NID, },
+		.driver_data = NVME_QUIRK_BOGUS_NID |
+				NVME_QUIRK_BOGUS_NOWS, },
 	{ PCI_DEVICE(0x1217, 0x8760), /* O2 Micro 64GB Steam Deck */
 		.driver_data = NVME_QUIRK_DMAPOOL_ALIGN_512, },
 	{ PCI_DEVICE(0x126f, 0x2262),	/* Silicon Motion generic */
-- 
2.47.2


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [RFC 2/4] blkdev: lift BLK_MAX_BLOCK_SIZE to page cache limit
  2025-03-20 11:13 ` [RFC 2/4] blkdev: lift BLK_MAX_BLOCK_SIZE to page cache limit Luis Chamberlain
@ 2025-03-20 16:01   ` Bart Van Assche
  2025-03-20 16:06     ` Matthew Wilcox
  0 siblings, 1 reply; 13+ messages in thread
From: Bart Van Assche @ 2025-03-20 16:01 UTC (permalink / raw)
  To: Luis Chamberlain, leon, hch, kbusch, sagi, axboe, joro, brauner,
	hare, willy, david, djwong
  Cc: john.g.garry, ritesh.list, linux-fsdevel, linux-block, linux-mm,
	gost.dev, p.raghav, da.gomez, kernel

On 3/20/25 4:13 AM, Luis Chamberlain wrote:
> -/*
> - * We should strive for 1 << (PAGE_SHIFT + MAX_PAGECACHE_ORDER)
> - * however we constrain this to what we can validate and test.
> - */
> -#define BLK_MAX_BLOCK_SIZE      SZ_64K
> +#define BLK_MAX_BLOCK_SIZE      1 << (PAGE_SHIFT + MAX_PAGECACHE_ORDER)
>   
>   /* blk_validate_limits() validates bsize, so drivers don't usually need to */
>   static inline int blk_validate_block_size(unsigned long bsize)

All logical block sizes above 4 KiB trigger write amplification if there
are applications that write 4 KiB at a time, isn't it? Isn't that issue
even worse for logical block sizes above 64 KiB?

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC 2/4] blkdev: lift BLK_MAX_BLOCK_SIZE to page cache limit
  2025-03-20 16:01   ` Bart Van Assche
@ 2025-03-20 16:06     ` Matthew Wilcox
  2025-03-20 16:15       ` Bart Van Assche
  0 siblings, 1 reply; 13+ messages in thread
From: Matthew Wilcox @ 2025-03-20 16:06 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Luis Chamberlain, leon, hch, kbusch, sagi, axboe, joro, brauner,
	hare, david, djwong, john.g.garry, ritesh.list, linux-fsdevel,
	linux-block, linux-mm, gost.dev, p.raghav, da.gomez, kernel

On Thu, Mar 20, 2025 at 09:01:46AM -0700, Bart Van Assche wrote:
> On 3/20/25 4:13 AM, Luis Chamberlain wrote:
> > -/*
> > - * We should strive for 1 << (PAGE_SHIFT + MAX_PAGECACHE_ORDER)
> > - * however we constrain this to what we can validate and test.
> > - */
> > -#define BLK_MAX_BLOCK_SIZE      SZ_64K
> > +#define BLK_MAX_BLOCK_SIZE      1 << (PAGE_SHIFT + MAX_PAGECACHE_ORDER)
> >   /* blk_validate_limits() validates bsize, so drivers don't usually need to */
> >   static inline int blk_validate_block_size(unsigned long bsize)
> 
> All logical block sizes above 4 KiB trigger write amplification if there
> are applications that write 4 KiB at a time, isn't it? Isn't that issue
> even worse for logical block sizes above 64 KiB?

I think everybody knows this Bart.  You've raised it before, we've
talked about it, and you're not bringing anything new to the discussion
this time.  Why bring it up again?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC 2/4] blkdev: lift BLK_MAX_BLOCK_SIZE to page cache limit
  2025-03-20 16:06     ` Matthew Wilcox
@ 2025-03-20 16:15       ` Bart Van Assche
  2025-03-20 16:27         ` Matthew Wilcox
  0 siblings, 1 reply; 13+ messages in thread
From: Bart Van Assche @ 2025-03-20 16:15 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Luis Chamberlain, leon, hch, kbusch, sagi, axboe, joro, brauner,
	hare, david, djwong, john.g.garry, ritesh.list, linux-fsdevel,
	linux-block, linux-mm, gost.dev, p.raghav, da.gomez, kernel

On 3/20/25 9:06 AM, Matthew Wilcox wrote:
> On Thu, Mar 20, 2025 at 09:01:46AM -0700, Bart Van Assche wrote:
>> On 3/20/25 4:13 AM, Luis Chamberlain wrote:
>>> -/*
>>> - * We should strive for 1 << (PAGE_SHIFT + MAX_PAGECACHE_ORDER)
>>> - * however we constrain this to what we can validate and test.
>>> - */
>>> -#define BLK_MAX_BLOCK_SIZE      SZ_64K
>>> +#define BLK_MAX_BLOCK_SIZE      1 << (PAGE_SHIFT + MAX_PAGECACHE_ORDER)
>>>    /* blk_validate_limits() validates bsize, so drivers don't usually need to */
>>>    static inline int blk_validate_block_size(unsigned long bsize)
>>
>> All logical block sizes above 4 KiB trigger write amplification if there
>> are applications that write 4 KiB at a time, isn't it? Isn't that issue
>> even worse for logical block sizes above 64 KiB?
> 
> I think everybody knows this Bart.  You've raised it before, we've
> talked about it, and you're not bringing anything new to the discussion
> this time.  Why bring it up again?

The patch description mentions what has been changed but does not
mention why. Shouldn't the description of this patch explain why this
change has been made? Shouldn't the description of this patch explain
for which applications this change is useful?

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC 2/4] blkdev: lift BLK_MAX_BLOCK_SIZE to page cache limit
  2025-03-20 16:15       ` Bart Van Assche
@ 2025-03-20 16:27         ` Matthew Wilcox
  2025-03-20 16:34           ` Bart Van Assche
  2025-03-24 10:58           ` Bart Van Assche
  0 siblings, 2 replies; 13+ messages in thread
From: Matthew Wilcox @ 2025-03-20 16:27 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Luis Chamberlain, leon, hch, kbusch, sagi, axboe, joro, brauner,
	hare, david, djwong, john.g.garry, ritesh.list, linux-fsdevel,
	linux-block, linux-mm, gost.dev, p.raghav, da.gomez, kernel

On Thu, Mar 20, 2025 at 09:15:23AM -0700, Bart Van Assche wrote:
> The patch description mentions what has been changed but does not
> mention why. Shouldn't the description of this patch explain why this
> change has been made? Shouldn't the description of this patch explain
> for which applications this change is useful?

The manufacturer chooses the block size.  If they've made a bad decision,
their device will presumably not sell well.  We don't need to justify
their decision in the commit message.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC 2/4] blkdev: lift BLK_MAX_BLOCK_SIZE to page cache limit
  2025-03-20 16:27         ` Matthew Wilcox
@ 2025-03-20 16:34           ` Bart Van Assche
  2025-03-20 16:44             ` Christoph Hellwig
  2025-03-24 10:58           ` Bart Van Assche
  1 sibling, 1 reply; 13+ messages in thread
From: Bart Van Assche @ 2025-03-20 16:34 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Luis Chamberlain, leon, hch, kbusch, sagi, axboe, joro, brauner,
	hare, david, djwong, john.g.garry, ritesh.list, linux-fsdevel,
	linux-block, linux-mm, gost.dev, p.raghav, da.gomez, kernel

On 3/20/25 9:27 AM, Matthew Wilcox wrote:
> On Thu, Mar 20, 2025 at 09:15:23AM -0700, Bart Van Assche wrote:
>> The patch description mentions what has been changed but does not
>> mention why. Shouldn't the description of this patch explain why this
>> change has been made? Shouldn't the description of this patch explain
>> for which applications this change is useful?
> 
> The manufacturer chooses the block size.  If they've made a bad decision,
> their device will presumably not sell well.  We don't need to justify
> their decision in the commit message.

The fact that this change is proposed because there are device
manufacturers that want to produce devices with block sizes larger than
64 KiB would be useful information for the commit message.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC 2/4] blkdev: lift BLK_MAX_BLOCK_SIZE to page cache limit
  2025-03-20 16:34           ` Bart Van Assche
@ 2025-03-20 16:44             ` Christoph Hellwig
  0 siblings, 0 replies; 13+ messages in thread
From: Christoph Hellwig @ 2025-03-20 16:44 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Matthew Wilcox, Luis Chamberlain, leon, hch, kbusch, sagi, axboe,
	joro, brauner, hare, david, djwong, john.g.garry, ritesh.list,
	linux-fsdevel, linux-block, linux-mm, gost.dev, p.raghav,
	da.gomez, kernel

On Thu, Mar 20, 2025 at 09:34:50AM -0700, Bart Van Assche wrote:
> The fact that this change is proposed because there are device
> manufacturers that want to produce devices with block sizes larger than
> 64 KiB would be useful information for the commit message.

I think it's just the proposal that is very confused.  Keith pointed
out that number of segments mostly matters when you have 4k
non-contiguous pages and that's where the quoted limit comes from.

The LBA size is a lower bound to the folio size in the page cache,
so the limit automatically increases with that, although in practice
the file system will usually allocate larger folios even with a smaller
block size, and I assume most systems (or at least the ones that care
about performance) actually use transparent huge pages/folios for
anonymous memory as well.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC 2/4] blkdev: lift BLK_MAX_BLOCK_SIZE to page cache limit
  2025-03-20 16:27         ` Matthew Wilcox
  2025-03-20 16:34           ` Bart Van Assche
@ 2025-03-24 10:58           ` Bart Van Assche
  2025-03-24 15:02             ` Matthew Wilcox
  1 sibling, 1 reply; 13+ messages in thread
From: Bart Van Assche @ 2025-03-24 10:58 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Luis Chamberlain, leon, hch, kbusch, sagi, axboe, joro, brauner,
	hare, david, djwong, john.g.garry, ritesh.list, linux-fsdevel,
	linux-block, linux-mm, gost.dev, p.raghav, da.gomez, kernel

On 3/20/25 12:27 PM, Matthew Wilcox wrote:
> On Thu, Mar 20, 2025 at 09:15:23AM -0700, Bart Van Assche wrote:
>> The patch description mentions what has been changed but does not
>> mention why. Shouldn't the description of this patch explain why this
>> change has been made? Shouldn't the description of this patch explain
>> for which applications this change is useful?
> 
> The manufacturer chooses the block size.  If they've made a bad decision,
> their device will presumably not sell well.  We don't need to justify
> their decision in the commit message.

 From a 2023 presentation by Luis 
(https://lpc.events/event/17/contributions/1508/attachments/1298/2608/LBS_LPC2023.pdf):
- SSD manufacturers want to increase the indirection unit (IU) size.
- Increasing the IU size reduces SSD DRAM costs.
- LBS is not suitable for all workloads because smaller IOs with LBS can
   cause write amplification (WAF) due to read modify writes.
- Some database software benefits of a 16 KiB logical block size.

If the goal is to reduce DRAM costs then I recommend SSD manufacturers
to implement zoned storage (ZNS) instead of only increasing the logical
block size. A big advantage of zoned storage is that the DRAM cost is
reduced significantly even if the block size is not increased.

Are there any applications that benefit from a block size larger than
64 KiB? If not, why to increase BLK_MAX_BLOCK_SIZE further? Do you agree
that this question should be answered in the patch description?

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC 2/4] blkdev: lift BLK_MAX_BLOCK_SIZE to page cache limit
  2025-03-24 10:58           ` Bart Van Assche
@ 2025-03-24 15:02             ` Matthew Wilcox
  0 siblings, 0 replies; 13+ messages in thread
From: Matthew Wilcox @ 2025-03-24 15:02 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Luis Chamberlain, leon, hch, kbusch, sagi, axboe, joro, brauner,
	hare, david, djwong, john.g.garry, ritesh.list, linux-fsdevel,
	linux-block, linux-mm, gost.dev, p.raghav, da.gomez, kernel

On Mon, Mar 24, 2025 at 06:58:26AM -0400, Bart Van Assche wrote:
> If the goal is to reduce DRAM costs then I recommend SSD manufacturers
> to implement zoned storage (ZNS) instead of only increasing the logical
> block size. A big advantage of zoned storage is that the DRAM cost is
> reduced significantly even if the block size is not increased.
> 
> Are there any applications that benefit from a block size larger than
> 64 KiB? If not, why to increase BLK_MAX_BLOCK_SIZE further? Do you agree
> that this question should be answered in the patch description?

Do I agree that we should use the commit message to enter into a
philosophical debate about whether ZNS or large block sizes are better?
No, I do not.  I don't even think we should have this discussion
any more on this mailing list; I think everyone is aware that both
alternatives exist.  You don't like it, and that's your prerogative.
But at some point you have to stop being an awkward cuss about it.

I think CXL is an abomination; I've made this point often enough that
everybody is aware of it.  I don't make it any more.  All I do is NACK
the inclusion of patches that are only for the benefit of CXL until
CXL has actually demonstrated its utility.

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2025-03-24 15:02 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-03-20 11:13 [RFC 0/4] nvme-pci: breaking the 512 KiB max IO boundary Luis Chamberlain
2025-03-20 11:13 ` [RFC 1/4] iomap: use BLK_MAX_BLOCK_SIZE for the iomap zero page Luis Chamberlain
2025-03-20 11:13 ` [RFC 2/4] blkdev: lift BLK_MAX_BLOCK_SIZE to page cache limit Luis Chamberlain
2025-03-20 16:01   ` Bart Van Assche
2025-03-20 16:06     ` Matthew Wilcox
2025-03-20 16:15       ` Bart Van Assche
2025-03-20 16:27         ` Matthew Wilcox
2025-03-20 16:34           ` Bart Van Assche
2025-03-20 16:44             ` Christoph Hellwig
2025-03-24 10:58           ` Bart Van Assche
2025-03-24 15:02             ` Matthew Wilcox
2025-03-20 11:13 ` [RFC 3/4] nvme-pci: bump segments to what the device can use Luis Chamberlain
2025-03-20 11:13 ` [RFC 4/4] nvme-pci: add quirk for qemu with bogus NOWS Luis Chamberlain

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).