linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC 0/4] nvme-pci: breaking the 512 KiB max IO boundary
@ 2025-03-20 11:13 Luis Chamberlain
  2025-03-20 11:13 ` [RFC 1/4] iomap: use BLK_MAX_BLOCK_SIZE for the iomap zero page Luis Chamberlain
                   ` (3 more replies)
  0 siblings, 4 replies; 13+ messages in thread
From: Luis Chamberlain @ 2025-03-20 11:13 UTC (permalink / raw)
  To: leon, hch, kbusch, sagi, axboe, joro, brauner, hare, willy, david,
	djwong
  Cc: john.g.garry, ritesh.list, linux-fsdevel, linux-block, linux-mm,
	gost.dev, p.raghav, da.gomez, kernel, mcgrof

Now that we have bs > ps for block device sector sizes on linux-next the next
eye sore is why our max sector size is stuck at 64k while we should be able to
go up to in theory to the max supported by the page cache. On x86_64 that's 2
MiB.

The reason we didn't jump to 2 MiB is because testing with a higher limit than
64k proved to have issues. While we've looked into them a glaring issue was
scatter list limitation on the NVMe PCI driver. While we could adopt scatter
list chaining, the work Christoph and Leon have been working on with the two
step DMA API seems to be the way to go since the scatter lists are tied to
PAGE_SIZE restrictions, and the scatter list chaining is just a mess.

So it begs the question, with the new two step DMA API, does the problem
get easier? The answer is yes, and for those that want to experiment this
will let you do just that.

With this we can enable 2 MiB LBA format on NVMe and we can issue single IOs
up to 8 MiB for both buffered IO and direct IO. The last two patches are not
really intended for upstream, but rather experimental code to let folks muck
around with large sector sizes.

Daniel Gomez has taken Leon Romanovsky's new two step DMA API [0] and
Christoph Hellwig's "Block and NMMe PCI use of new DMA mapping API" [1].
We then used this to apply on top the 64k sector size patches now merged on
linux-next and backported them to v6.14-rc5. The patches on this RFC
are the patches on top of all that so to demonstrate the minimal changes
needed to enable up to 8 MiB IOs on NVMe leveraging a 2 MiB max block
sector size on x86_64 after the two-step DMA API and the NVMe cleanup.

If you want a git tree to play with you can use our large-block-buffer-heads-2m
linux branch from kdevops.

[0] https://lore.kernel.org/all/20250302085717.GO53094@unreal/ 
[1] https://lore.kernel.org/all/cover.1730037261.git.leon@kernel.org/
[2] https://github.com/linux-kdevops/linux/tree/large-block-buffer-heads-2m

Luis Chamberlain (4):
  iomap: use BLK_MAX_BLOCK_SIZE for the iomap zero page
  blkdev: lift BLK_MAX_BLOCK_SIZE to page cache limit
  nvme-pci: bump segments to what the device can use
  nvme-pci: add quirk for qemu with bogus NOWS

 drivers/nvme/host/core.c |   2 +
 drivers/nvme/host/nvme.h |   5 ++
 drivers/nvme/host/pci.c  | 167 ++-------------------------------------
 fs/iomap/direct-io.c     |   2 +-
 include/linux/blkdev.h   |   7 +-
 5 files changed, 15 insertions(+), 168 deletions(-)

-- 
2.47.2


^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2025-03-24 15:02 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-03-20 11:13 [RFC 0/4] nvme-pci: breaking the 512 KiB max IO boundary Luis Chamberlain
2025-03-20 11:13 ` [RFC 1/4] iomap: use BLK_MAX_BLOCK_SIZE for the iomap zero page Luis Chamberlain
2025-03-20 11:13 ` [RFC 2/4] blkdev: lift BLK_MAX_BLOCK_SIZE to page cache limit Luis Chamberlain
2025-03-20 16:01   ` Bart Van Assche
2025-03-20 16:06     ` Matthew Wilcox
2025-03-20 16:15       ` Bart Van Assche
2025-03-20 16:27         ` Matthew Wilcox
2025-03-20 16:34           ` Bart Van Assche
2025-03-20 16:44             ` Christoph Hellwig
2025-03-24 10:58           ` Bart Van Assche
2025-03-24 15:02             ` Matthew Wilcox
2025-03-20 11:13 ` [RFC 3/4] nvme-pci: bump segments to what the device can use Luis Chamberlain
2025-03-20 11:13 ` [RFC 4/4] nvme-pci: add quirk for qemu with bogus NOWS Luis Chamberlain

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).