[RFC v2 00/10] bdev: LBS devices support to coexist with buffer-heads

linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC v2 00/10] bdev: LBS devices support to coexist with buffer-heads
@ 2023-09-15 21:32 Luis Chamberlain
  2023-09-15 21:32 ` [RFC v2 01/10] bdev: rename iomap aops Luis Chamberlain
                   ` (11 more replies)
  0 siblings, 12 replies; 26+ messages in thread
From: Luis Chamberlain @ 2023-09-15 21:32 UTC (permalink / raw)
  To: hch, djwong, dchinner, kbusch, sagi, axboe
  Cc: willy, brauner, hare, ritesh.list, rgoldwyn, jack, ziy,
	ryan.roberts, patches, linux-xfs, linux-fsdevel, linux-block,
	p.raghav, da.gomez, dan.helmick, mcgrof

Christoph added CONFIG_BUFFER_HEAD on v6.1 enabling a world where we can
live without buffer-heads. When we opt into that world we end up also
using the address space operations of the block device cache using
iomap. Since iomap supports higher order folios it means then that block
devices which do use the iomap aops can end up having a logical block
size or physical block size greater than PAGE_SIZE. We refer to these as
LBS devices. This in turn allows filesystems which support bs > 4k to be
enabled on a 4k PAGE_SIZE world on LBS block devices. This alows LBS
device then to take advantage of the recenlty posted work today to enable
LBS support for filesystems [0].

However, an issue is that disabling CONFIG_BUFFER_HEAD in practice is not viable
for many Linux distributions since it also means disabling support for most
filesystems other than btrfs and XFS. So we either support larger order folios
on buffer-heads, or we draw up a solution to enable co-existence. Since at LSFMM
2023 it was decided we would not support larger order folios on buffer-heads,
this takes the approach to help support co-existence between using buffer-head
based filesystems and using the iomap aops dynamically when needed on block
devices. Folks who want to enhance buffer-heads to support larger order folios
can do so in parallel to this effort. This just allows iomap based filesystems
to support LBS block devices today while also allowing buffer-head based
filesystems to be used on them as well the LBS device also supports a
logical block of PAGE_SIZE.

LBS devices can come in a few flavors. Adopting larger LBA format enables a
logical block sizes to be > 4k. Different block device mechanisms which can also
increase a block device's physical block size to be larger than 4k while still
retaining backward compatibility with a logical block size of 4k. An example is
NVMe drives with an LBA format of 4k and an npwg and awupf | nawupf of 16k. With
this you a logical block size of 4k and a physical block size of 16k. Drives
which do this may be a reality as larger Indirection Units (IUs) on QLC adopt
IUs larger than 4k so to better support larger capacity. At least OCP 2.0 does
require you to expose the IU through npwg. Whether or not a device supports the
same value for awupf | nawupf (large atomic) is device specific. Devices which
*do* sport an LBA format of 4k and npwg & awupf | nawupf >= 16k essentially can
benefit from LBS support such as with XFS [1].

While LBS devices come to market folks can also experiment with NVMe today
what some of this might be like by ignoring power failure and faking the larger
atomic up to the NVMe awun. When the NVMe awun is 0xffff it means awun is
MDTS, so for those drives, in practice you can experiment today with LBS
up to MDTS with real drives. This is documented on the kdevops documentation
for LBS by using something like nvme_core.debug_large_atomics=16384 [2].

To experiment with larger LBA formtas you can also use kdevops and enable
CONFIG_QEMU_ENABLE_EXTRA_DRIVE_LARGEIO. That enables a ton of drives with
logical and physical block sizes >= 4k up to a desriable max target for
experimentation. Since filesystems today only support up to 32k sector sizes,
in practice you may only want to experiment up to 32k physical / logical.

Support for 64k sector sizes requires an XFS format change, which is something
Daniel Gomez has experimental patches for, in case folks are interested in
messing with.

Patch 6 could probably be squashed with patch 5, but I wanted to be
explicit about this, as this should be decided with the community.

There might be a better way to do this than do deal with the switching
of the aops dynamically, ideas welcomed!

The NVMe debug patches are posted for pure experimentation purposes, but
the AWUN patch might be welcomed upsteram. Unless folks really want it,
the goal here is *not* to upstream the support for the debug module
parameter nvme_core.debug_large_atomics=16384. On a v3 RFC I can drop those
patches and just aim to simplify the logic to support LBA formats > ps.

Changes since v1 RFC:

  o Modified the approach to accept we can have different aops per super block
    on the block device cache as suggested by Christoph
  o Try to allow dynamically switching between iomap aops and buffer-head aops
  o Use the iomap aops if the target block size order or min order > ps
  o Loosten restrictions on set_blocksize() so to just check for the iomap
    aops for now, making it easy to allow buffer-heads to later add support
    for bs > ps as well
  o Use buffer-head aops first always unless the min order is > ps so to always
    support buffer-head based filesystems
  o Adopt a flag for buffer-heads so to check for support, buffer-heads based
    filesystems cannot be used while an LBS filesystem is being used, for
    instance
  o Allow iomap aops to be used when the physical block size is > ps as well
  o NVMe changes to experiment with LBS devices without power failure
  o NVMe changes to support LBS devices

[0] https://lkml.kernel.org/r/20230608032404.1887046-1-mcgrof@kernel.org
[1] https://lkml.kernel.org/r/20230915183848.1018717-1-kernel@pankajraghav.com
[2] https://github.com/linux-kdevops/kdevops/blob/master/docs/lbs.md

Luis Chamberlain (10):
  bdev: rename iomap aops
  bdev: dynamically set aops to enable LBS support
  bdev: increase bdev max blocksize depending on the aops used
  filesystems: add filesytem buffer-head flag
  bdev: allow to switch between bdev aops
  bdev: simplify coexistance
  nvme: enhance max supported LBA format check
  nvme: add awun / nawun sanity check
  nvme: add nvme_core.debug_large_atomics to force high awun as phys_bs
  nvme: enable LBS support

 block/bdev.c             | 73 ++++++++++++++++++++++++++++++++++++++--
 block/blk.h              |  1 +
 block/fops.c             | 14 ++++----
 drivers/nvme/host/core.c | 70 +++++++++++++++++++++++++++++++++++---
 drivers/nvme/host/nvme.h |  1 +
 fs/adfs/super.c          |  2 +-
 fs/affs/super.c          |  2 +-
 fs/befs/linuxvfs.c       |  2 +-
 fs/bfs/inode.c           |  2 +-
 fs/efs/super.c           |  2 +-
 fs/exfat/super.c         |  2 +-
 fs/ext2/super.c          |  2 +-
 fs/ext4/super.c          |  7 ++--
 fs/f2fs/super.c          |  2 +-
 fs/fat/namei_msdos.c     |  2 +-
 fs/fat/namei_vfat.c      |  2 +-
 fs/freevxfs/vxfs_super.c |  2 +-
 fs/gfs2/ops_fstype.c     |  4 +--
 fs/hfs/super.c           |  2 +-
 fs/hfsplus/super.c       |  2 +-
 fs/isofs/inode.c         |  2 +-
 fs/jfs/super.c           |  2 +-
 fs/minix/inode.c         |  2 +-
 fs/nilfs2/super.c        |  2 +-
 fs/ntfs/super.c          |  2 +-
 fs/ntfs3/super.c         |  2 +-
 fs/ocfs2/super.c         |  2 +-
 fs/omfs/inode.c          |  2 +-
 fs/qnx4/inode.c          |  2 +-
 fs/qnx6/inode.c          |  2 +-
 fs/reiserfs/super.c      |  2 +-
 fs/super.c               |  3 +-
 fs/sysv/super.c          |  4 +--
 fs/udf/super.c           |  2 +-
 fs/ufs/super.c           |  2 +-
 include/linux/blkdev.h   |  7 ++++
 include/linux/fs.h       |  1 +
 37 files changed, 189 insertions(+), 48 deletions(-)

-- 
2.39.2

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [RFC v2 01/10] bdev: rename iomap aops
  2023-09-15 21:32 [RFC v2 00/10] bdev: LBS devices support to coexist with buffer-heads Luis Chamberlain
@ 2023-09-15 21:32 ` Luis Chamberlain
  2023-09-15 21:32 ` [RFC v2 02/10] bdev: dynamically set aops to enable LBS support Luis Chamberlain
                   ` (10 subsequent siblings)
  11 siblings, 0 replies; 26+ messages in thread
From: Luis Chamberlain @ 2023-09-15 21:32 UTC (permalink / raw)
  To: hch, djwong, dchinner, kbusch, sagi, axboe
  Cc: willy, brauner, hare, ritesh.list, rgoldwyn, jack, ziy,
	ryan.roberts, patches, linux-xfs, linux-fsdevel, linux-block,
	p.raghav, da.gomez, dan.helmick, mcgrof

Allow buffer-head and iomap aops to co-exist on a build. Right
now the iomap aops is can only be used if you disable buffer-heads.
In the near future we should be able to dynamically select at runtime
the intended aops based on the nature of the filesystem and device
requirements.

So rename the iomap aops, and select use the new name if buffer-heads
is disabled. This introduces no functional changes.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 block/bdev.c |  4 ++++
 block/blk.h  |  1 +
 block/fops.c | 14 +++++++-------
 3 files changed, 12 insertions(+), 7 deletions(-)

diff --git a/block/bdev.c b/block/bdev.c
index f3b13aa1b7d4..6e62d8a992e6 100644
--- a/block/bdev.c
+++ b/block/bdev.c
@@ -392,7 +392,11 @@ struct block_device *bdev_alloc(struct gendisk *disk, u8 partno)
 		return NULL;
 	inode->i_mode = S_IFBLK;
 	inode->i_rdev = 0;
+#ifdef CONFIG_BUFFER_HEAD
 	inode->i_data.a_ops = &def_blk_aops;
+#else
+	inode->i_data.a_ops = &def_blk_aops_iomap;
+#endif
 	mapping_set_gfp_mask(&inode->i_data, GFP_USER);
 
 	bdev = I_BDEV(inode);
diff --git a/block/blk.h b/block/blk.h
index 08a358bc0919..75e8deb9f458 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -473,6 +473,7 @@ long blkdev_ioctl(struct file *file, unsigned cmd, unsigned long arg);
 long compat_blkdev_ioctl(struct file *file, unsigned cmd, unsigned long arg);
 
 extern const struct address_space_operations def_blk_aops;
+extern const struct address_space_operations def_blk_aops_iomap;
 
 int disk_register_independent_access_ranges(struct gendisk *disk);
 void disk_unregister_independent_access_ranges(struct gendisk *disk);
diff --git a/block/fops.c b/block/fops.c
index acff3d5d22d4..80a8430bcd69 100644
--- a/block/fops.c
+++ b/block/fops.c
@@ -455,13 +455,14 @@ const struct address_space_operations def_blk_aops = {
 	.migrate_folio	= buffer_migrate_folio_norefs,
 	.is_dirty_writeback = buffer_check_dirty_writeback,
 };
-#else /* CONFIG_BUFFER_HEAD */
-static int blkdev_read_folio(struct file *file, struct folio *folio)
+
+#endif /* CONFIG_BUFFER_HEAD */
+static int blkdev_read_folio_iomap(struct file *file, struct folio *folio)
 {
 	return iomap_read_folio(folio, &blkdev_iomap_ops);
 }
 
-static void blkdev_readahead(struct readahead_control *rac)
+static void blkdev_readahead_iomap(struct readahead_control *rac)
 {
 	iomap_readahead(rac, &blkdev_iomap_ops);
 }
@@ -492,18 +493,17 @@ static int blkdev_writepages(struct address_space *mapping,
 	return iomap_writepages(mapping, wbc, &wpc, &blkdev_writeback_ops);
 }
 
-const struct address_space_operations def_blk_aops = {
+const struct address_space_operations def_blk_aops_iomap = {
 	.dirty_folio	= filemap_dirty_folio,
 	.release_folio		= iomap_release_folio,
 	.invalidate_folio	= iomap_invalidate_folio,
-	.read_folio		= blkdev_read_folio,
-	.readahead		= blkdev_readahead,
+	.read_folio		= blkdev_read_folio_iomap,
+	.readahead		= blkdev_readahead_iomap,
 	.writepages		= blkdev_writepages,
 	.is_partially_uptodate  = iomap_is_partially_uptodate,
 	.error_remove_page	= generic_error_remove_page,
 	.migrate_folio		= filemap_migrate_folio,
 };
-#endif /* CONFIG_BUFFER_HEAD */
 
 /*
  * for a block special file file_inode(file)->i_size is zero
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [RFC v2 02/10] bdev: dynamically set aops to enable LBS support
  2023-09-15 21:32 [RFC v2 00/10] bdev: LBS devices support to coexist with buffer-heads Luis Chamberlain
  2023-09-15 21:32 ` [RFC v2 01/10] bdev: rename iomap aops Luis Chamberlain
@ 2023-09-15 21:32 ` Luis Chamberlain
  2023-09-15 21:32 ` [RFC v2 03/10] bdev: increase bdev max blocksize depending on the aops used Luis Chamberlain
                   ` (9 subsequent siblings)
  11 siblings, 0 replies; 26+ messages in thread
From: Luis Chamberlain @ 2023-09-15 21:32 UTC (permalink / raw)
  To: hch, djwong, dchinner, kbusch, sagi, axboe
  Cc: willy, brauner, hare, ritesh.list, rgoldwyn, jack, ziy,
	ryan.roberts, patches, linux-xfs, linux-fsdevel, linux-block,
	p.raghav, da.gomez, dan.helmick, mcgrof

In order to support large block devices where block size > page size
we must be able to support an aops which does support blocks > ps
and the block layer needs this on its address space operations. We
have to sets of aops and only one which does support bs > ps right
now and that is when we use iomap on the aops for the block device
cache.

If the min order has not yet been set and the target filesystem does
require bs > ps allow for the inode for the block device cache to use
the iomap aops.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 block/bdev.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/block/bdev.c b/block/bdev.c
index 6e62d8a992e6..63b4d7dd8075 100644
--- a/block/bdev.c
+++ b/block/bdev.c
@@ -126,6 +126,7 @@ static void set_init_blocksize(struct block_device *bdev)
 {
 	unsigned int bsize = bdev_logical_block_size(bdev);
 	loff_t size = i_size_read(bdev->bd_inode);
+	int order, folio_order;
 
 	while (bsize < PAGE_SIZE) {
 		if (size & bsize)
@@ -133,6 +134,13 @@ static void set_init_blocksize(struct block_device *bdev)
 		bsize <<= 1;
 	}
 	bdev->bd_inode->i_blkbits = blksize_bits(bsize);
+	order = bdev->bd_inode->i_blkbits - PAGE_SHIFT;
+	folio_order = mapping_min_folio_order(bdev->bd_inode->i_mapping);
+	if (order > 0 && folio_order == 0) {
+		mapping_set_folio_orders(bdev->bd_inode->i_mapping, order,
+					 MAX_PAGECACHE_ORDER);
+		bdev->bd_inode->i_data.a_ops = &def_blk_aops_iomap;
+	}
 }
 
 int set_blocksize(struct block_device *bdev, int size)
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [RFC v2 03/10] bdev: increase bdev max blocksize depending on the aops used
  2023-09-15 21:32 [RFC v2 00/10] bdev: LBS devices support to coexist with buffer-heads Luis Chamberlain
  2023-09-15 21:32 ` [RFC v2 01/10] bdev: rename iomap aops Luis Chamberlain
  2023-09-15 21:32 ` [RFC v2 02/10] bdev: dynamically set aops to enable LBS support Luis Chamberlain
@ 2023-09-15 21:32 ` Luis Chamberlain
  2023-09-15 21:32 ` [RFC v2 04/10] filesystems: add filesytem buffer-head flag Luis Chamberlain
                   ` (8 subsequent siblings)
  11 siblings, 0 replies; 26+ messages in thread
From: Luis Chamberlain @ 2023-09-15 21:32 UTC (permalink / raw)
  To: hch, djwong, dchinner, kbusch, sagi, axboe
  Cc: willy, brauner, hare, ritesh.list, rgoldwyn, jack, ziy,
	ryan.roberts, patches, linux-xfs, linux-fsdevel, linux-block,
	p.raghav, da.gomez, dan.helmick, mcgrof

While buffer-heads is stuck at order 0, so PAGE_SIZE, iomap support is
currently capped at MAX_PAGECACHE_ORDER as that is the cap also set on
readahead. We match parity for the max allowed block size when using
iomap.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 block/bdev.c | 11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/block/bdev.c b/block/bdev.c
index 63b4d7dd8075..0d685270cd34 100644
--- a/block/bdev.c
+++ b/block/bdev.c
@@ -143,10 +143,17 @@ static void set_init_blocksize(struct block_device *bdev)
 	}
 }
 
+static int bdev_bsize_limit(struct block_device *bdev)
+{
+	if (bdev->bd_inode->i_data.a_ops == &def_blk_aops_iomap)
+		return 1 << (PAGE_SHIFT + MAX_PAGECACHE_ORDER);
+	return PAGE_SIZE;
+}
+
 int set_blocksize(struct block_device *bdev, int size)
 {
-	/* Size must be a power of two, and between 512 and PAGE_SIZE */
-	if (size > PAGE_SIZE || size < 512 || !is_power_of_2(size))
+	/* Size must be a power of two, and between 512 and supported order */
+	if (size > bdev_bsize_limit(bdev) || size < 512 || !is_power_of_2(size))
 		return -EINVAL;
 
 	/* Size cannot be smaller than the size supported by the device */
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [RFC v2 04/10] filesystems: add filesytem buffer-head flag
  2023-09-15 21:32 [RFC v2 00/10] bdev: LBS devices support to coexist with buffer-heads Luis Chamberlain
                   ` (2 preceding siblings ...)
  2023-09-15 21:32 ` [RFC v2 03/10] bdev: increase bdev max blocksize depending on the aops used Luis Chamberlain
@ 2023-09-15 21:32 ` Luis Chamberlain
  2023-09-15 21:32 ` [RFC v2 05/10] bdev: allow to switch between bdev aops Luis Chamberlain
                   ` (7 subsequent siblings)
  11 siblings, 0 replies; 26+ messages in thread
From: Luis Chamberlain @ 2023-09-15 21:32 UTC (permalink / raw)
  To: hch, djwong, dchinner, kbusch, sagi, axboe
  Cc: willy, brauner, hare, ritesh.list, rgoldwyn, jack, ziy,
	ryan.roberts, patches, linux-xfs, linux-fsdevel, linux-block,
	p.raghav, da.gomez, dan.helmick, mcgrof

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 fs/adfs/super.c          | 2 +-
 fs/affs/super.c          | 2 +-
 fs/befs/linuxvfs.c       | 2 +-
 fs/bfs/inode.c           | 2 +-
 fs/efs/super.c           | 2 +-
 fs/exfat/super.c         | 2 +-
 fs/ext2/super.c          | 2 +-
 fs/ext4/super.c          | 7 ++++---
 fs/f2fs/super.c          | 2 +-
 fs/fat/namei_msdos.c     | 2 +-
 fs/fat/namei_vfat.c      | 2 +-
 fs/freevxfs/vxfs_super.c | 2 +-
 fs/gfs2/ops_fstype.c     | 4 ++--
 fs/hfs/super.c           | 2 +-
 fs/hfsplus/super.c       | 2 +-
 fs/isofs/inode.c         | 2 +-
 fs/jfs/super.c           | 2 +-
 fs/minix/inode.c         | 2 +-
 fs/nilfs2/super.c        | 2 +-
 fs/ntfs/super.c          | 2 +-
 fs/ntfs3/super.c         | 2 +-
 fs/ocfs2/super.c         | 2 +-
 fs/omfs/inode.c          | 2 +-
 fs/qnx4/inode.c          | 2 +-
 fs/qnx6/inode.c          | 2 +-
 fs/reiserfs/super.c      | 2 +-
 fs/sysv/super.c          | 4 ++--
 fs/udf/super.c           | 2 +-
 fs/ufs/super.c           | 2 +-
 include/linux/fs.h       | 1 +
 30 files changed, 35 insertions(+), 33 deletions(-)

diff --git a/fs/adfs/super.c b/fs/adfs/super.c
index e8bfc38239cd..7c57fff29bb4 100644
--- a/fs/adfs/super.c
+++ b/fs/adfs/super.c
@@ -464,7 +464,7 @@ static struct file_system_type adfs_fs_type = {
 	.name		= "adfs",
 	.mount		= adfs_mount,
 	.kill_sb	= kill_block_super,
-	.fs_flags	= FS_REQUIRES_DEV,
+	.fs_flags	= FS_REQUIRES_DEV | FS_BUFFER_HEADS,
 };
 MODULE_ALIAS_FS("adfs");
 
diff --git a/fs/affs/super.c b/fs/affs/super.c
index 58b391446ae1..2dc200010740 100644
--- a/fs/affs/super.c
+++ b/fs/affs/super.c
@@ -649,7 +649,7 @@ static struct file_system_type affs_fs_type = {
 	.name		= "affs",
 	.mount		= affs_mount,
 	.kill_sb	= affs_kill_sb,
-	.fs_flags	= FS_REQUIRES_DEV,
+	.fs_flags	= FS_REQUIRES_DEV | FS_BUFFER_HEADS,
 };
 MODULE_ALIAS_FS("affs");
 
diff --git a/fs/befs/linuxvfs.c b/fs/befs/linuxvfs.c
index 9a16a51fbb88..64715f554034 100644
--- a/fs/befs/linuxvfs.c
+++ b/fs/befs/linuxvfs.c
@@ -982,7 +982,7 @@ static struct file_system_type befs_fs_type = {
 	.name		= "befs",
 	.mount		= befs_mount,
 	.kill_sb	= kill_block_super,
-	.fs_flags	= FS_REQUIRES_DEV,
+	.fs_flags	= FS_REQUIRES_DEV | FS_BUFFER_HEADS,
 };
 MODULE_ALIAS_FS("befs");
 
diff --git a/fs/bfs/inode.c b/fs/bfs/inode.c
index e6a76ae9eb44..e94d7c92b591 100644
--- a/fs/bfs/inode.c
+++ b/fs/bfs/inode.c
@@ -459,7 +459,7 @@ static struct file_system_type bfs_fs_type = {
 	.name		= "bfs",
 	.mount		= bfs_mount,
 	.kill_sb	= kill_block_super,
-	.fs_flags	= FS_REQUIRES_DEV,
+	.fs_flags	= FS_REQUIRES_DEV | FS_BUFFER_HEADS,
 };
 MODULE_ALIAS_FS("bfs");
 
diff --git a/fs/efs/super.c b/fs/efs/super.c
index b287f47c165b..35891a596267 100644
--- a/fs/efs/super.c
+++ b/fs/efs/super.c
@@ -40,7 +40,7 @@ static struct file_system_type efs_fs_type = {
 	.name		= "efs",
 	.mount		= efs_mount,
 	.kill_sb	= efs_kill_sb,
-	.fs_flags	= FS_REQUIRES_DEV,
+	.fs_flags	= FS_REQUIRES_DEV | FS_BUFFER_HEADS,
 };
 MODULE_ALIAS_FS("efs");
 
diff --git a/fs/exfat/super.c b/fs/exfat/super.c
index 17100b13dcdc..fbc2aa45d291 100644
--- a/fs/exfat/super.c
+++ b/fs/exfat/super.c
@@ -786,7 +786,7 @@ static struct file_system_type exfat_fs_type = {
 	.init_fs_context	= exfat_init_fs_context,
 	.parameters		= exfat_parameters,
 	.kill_sb		= exfat_kill_sb,
-	.fs_flags		= FS_REQUIRES_DEV,
+	.fs_flags		= FS_REQUIRES_DEV | FS_BUFFER_HEADS,
 };
 
 static void exfat_inode_init_once(void *foo)
diff --git a/fs/ext2/super.c b/fs/ext2/super.c
index aaf3e3e88cb2..ca2c44ca2c29 100644
--- a/fs/ext2/super.c
+++ b/fs/ext2/super.c
@@ -1629,7 +1629,7 @@ static struct file_system_type ext2_fs_type = {
 	.name		= "ext2",
 	.mount		= ext2_mount,
 	.kill_sb	= kill_block_super,
-	.fs_flags	= FS_REQUIRES_DEV,
+	.fs_flags	= FS_REQUIRES_DEV | FS_BUFFER_HEADS,
 };
 MODULE_ALIAS_FS("ext2");
 
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 38217422f938..b7fab39cd1ca 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -137,7 +137,7 @@ static struct file_system_type ext2_fs_type = {
 	.init_fs_context	= ext4_init_fs_context,
 	.parameters		= ext4_param_specs,
 	.kill_sb		= ext4_kill_sb,
-	.fs_flags		= FS_REQUIRES_DEV,
+	.fs_flags		= FS_REQUIRES_DEV | FS_BUFFER_HEADS,
 };
 MODULE_ALIAS_FS("ext2");
 MODULE_ALIAS("ext2");
@@ -153,7 +153,7 @@ static struct file_system_type ext3_fs_type = {
 	.init_fs_context	= ext4_init_fs_context,
 	.parameters		= ext4_param_specs,
 	.kill_sb		= ext4_kill_sb,
-	.fs_flags		= FS_REQUIRES_DEV,
+	.fs_flags		= FS_REQUIRES_DEV | FS_BUFFER_HEADS,
 };
 MODULE_ALIAS_FS("ext3");
 MODULE_ALIAS("ext3");
@@ -7314,7 +7314,8 @@ static struct file_system_type ext4_fs_type = {
 	.init_fs_context	= ext4_init_fs_context,
 	.parameters		= ext4_param_specs,
 	.kill_sb		= ext4_kill_sb,
-	.fs_flags		= FS_REQUIRES_DEV | FS_ALLOW_IDMAP | FS_MGTIME,
+	.fs_flags		= FS_REQUIRES_DEV | FS_ALLOW_IDMAP | FS_MGTIME |
+				  FS_BUFFER_HEADS,
 };
 MODULE_ALIAS_FS("ext4");
 
diff --git a/fs/f2fs/super.c b/fs/f2fs/super.c
index fe25ff9cebbe..6ffc7a4d57d8 100644
--- a/fs/f2fs/super.c
+++ b/fs/f2fs/super.c
@@ -4903,7 +4903,7 @@ static struct file_system_type f2fs_fs_type = {
 	.name		= "f2fs",
 	.mount		= f2fs_mount,
 	.kill_sb	= kill_f2fs_super,
-	.fs_flags	= FS_REQUIRES_DEV | FS_ALLOW_IDMAP,
+	.fs_flags	= FS_REQUIRES_DEV | FS_ALLOW_IDMAP | FS_BUFFER_HEADS,
 };
 MODULE_ALIAS_FS("f2fs");
 
diff --git a/fs/fat/namei_msdos.c b/fs/fat/namei_msdos.c
index 2116c486843b..785ee85cf77d 100644
--- a/fs/fat/namei_msdos.c
+++ b/fs/fat/namei_msdos.c
@@ -667,7 +667,7 @@ static struct file_system_type msdos_fs_type = {
 	.name		= "msdos",
 	.mount		= msdos_mount,
 	.kill_sb	= kill_block_super,
-	.fs_flags	= FS_REQUIRES_DEV | FS_ALLOW_IDMAP,
+	.fs_flags	= FS_REQUIRES_DEV | FS_ALLOW_IDMAP | FS_BUFFER_HEADS,
 };
 MODULE_ALIAS_FS("msdos");
 
diff --git a/fs/fat/namei_vfat.c b/fs/fat/namei_vfat.c
index c4d00999a433..4fe85c569543 100644
--- a/fs/fat/namei_vfat.c
+++ b/fs/fat/namei_vfat.c
@@ -1212,7 +1212,7 @@ static struct file_system_type vfat_fs_type = {
 	.name		= "vfat",
 	.mount		= vfat_mount,
 	.kill_sb	= kill_block_super,
-	.fs_flags	= FS_REQUIRES_DEV | FS_ALLOW_IDMAP,
+	.fs_flags	= FS_REQUIRES_DEV | FS_ALLOW_IDMAP | FS_BUFFER_HEADS,
 };
 MODULE_ALIAS_FS("vfat");
 
diff --git a/fs/freevxfs/vxfs_super.c b/fs/freevxfs/vxfs_super.c
index 310d73e254df..d1e042784694 100644
--- a/fs/freevxfs/vxfs_super.c
+++ b/fs/freevxfs/vxfs_super.c
@@ -293,7 +293,7 @@ static struct file_system_type vxfs_fs_type = {
 	.name		= "vxfs",
 	.mount		= vxfs_mount,
 	.kill_sb	= kill_block_super,
-	.fs_flags	= FS_REQUIRES_DEV,
+	.fs_flags	= FS_REQUIRES_DEV | FS_BUFFER_HEADS,
 };
 MODULE_ALIAS_FS("vxfs"); /* makes mount -t vxfs autoload the module */
 MODULE_ALIAS("vxfs");
diff --git a/fs/gfs2/ops_fstype.c b/fs/gfs2/ops_fstype.c
index ecf789b7168c..f97d8480e665 100644
--- a/fs/gfs2/ops_fstype.c
+++ b/fs/gfs2/ops_fstype.c
@@ -1810,7 +1810,7 @@ static void gfs2_kill_sb(struct super_block *sb)
 
 struct file_system_type gfs2_fs_type = {
 	.name = "gfs2",
-	.fs_flags = FS_REQUIRES_DEV,
+	.fs_flags = FS_REQUIRES_DEV | FS_BUFFER_HEADS,
 	.init_fs_context = gfs2_init_fs_context,
 	.parameters = gfs2_fs_parameters,
 	.kill_sb = gfs2_kill_sb,
@@ -1820,7 +1820,7 @@ MODULE_ALIAS_FS("gfs2");
 
 struct file_system_type gfs2meta_fs_type = {
 	.name = "gfs2meta",
-	.fs_flags = FS_REQUIRES_DEV,
+	.fs_flags = FS_REQUIRES_DEV | FS_BUFFER_HEADS,
 	.init_fs_context = gfs2_meta_init_fs_context,
 	.owner = THIS_MODULE,
 };
diff --git a/fs/hfs/super.c b/fs/hfs/super.c
index 6764afa98a6f..1b31b8e2b7ef 100644
--- a/fs/hfs/super.c
+++ b/fs/hfs/super.c
@@ -461,7 +461,7 @@ static struct file_system_type hfs_fs_type = {
 	.name		= "hfs",
 	.mount		= hfs_mount,
 	.kill_sb	= kill_block_super,
-	.fs_flags	= FS_REQUIRES_DEV,
+	.fs_flags	= FS_REQUIRES_DEV | FS_BUFFER_HEADS,
 };
 MODULE_ALIAS_FS("hfs");
 
diff --git a/fs/hfsplus/super.c b/fs/hfsplus/super.c
index 1986b4f18a90..efc12ff05e0f 100644
--- a/fs/hfsplus/super.c
+++ b/fs/hfsplus/super.c
@@ -646,7 +646,7 @@ static struct file_system_type hfsplus_fs_type = {
 	.name		= "hfsplus",
 	.mount		= hfsplus_mount,
 	.kill_sb	= kill_block_super,
-	.fs_flags	= FS_REQUIRES_DEV,
+	.fs_flags	= FS_REQUIRES_DEV | FS_BUFFER_HEADS,
 };
 MODULE_ALIAS_FS("hfsplus");
 
diff --git a/fs/isofs/inode.c b/fs/isofs/inode.c
index 2ee21286ac8f..96d5b4dfec12 100644
--- a/fs/isofs/inode.c
+++ b/fs/isofs/inode.c
@@ -1564,7 +1564,7 @@ static struct file_system_type iso9660_fs_type = {
 	.name		= "iso9660",
 	.mount		= isofs_mount,
 	.kill_sb	= kill_block_super,
-	.fs_flags	= FS_REQUIRES_DEV,
+	.fs_flags	= FS_REQUIRES_DEV | FS_BUFFER_HEADS,
 };
 MODULE_ALIAS_FS("iso9660");
 MODULE_ALIAS("iso9660");
diff --git a/fs/jfs/super.c b/fs/jfs/super.c
index 2e2f7f6d36a0..052c277eab9f 100644
--- a/fs/jfs/super.c
+++ b/fs/jfs/super.c
@@ -906,7 +906,7 @@ static struct file_system_type jfs_fs_type = {
 	.name		= "jfs",
 	.mount		= jfs_do_mount,
 	.kill_sb	= kill_block_super,
-	.fs_flags	= FS_REQUIRES_DEV,
+	.fs_flags	= FS_REQUIRES_DEV | FS_BUFFER_HEADS,
 };
 MODULE_ALIAS_FS("jfs");
 
diff --git a/fs/minix/inode.c b/fs/minix/inode.c
index df575473c1cc..ae2bbb610603 100644
--- a/fs/minix/inode.c
+++ b/fs/minix/inode.c
@@ -689,7 +689,7 @@ static struct file_system_type minix_fs_type = {
 	.name		= "minix",
 	.mount		= minix_mount,
 	.kill_sb	= kill_block_super,
-	.fs_flags	= FS_REQUIRES_DEV,
+	.fs_flags	= FS_REQUIRES_DEV | FS_BUFFER_HEADS,
 };
 MODULE_ALIAS_FS("minix");
 
diff --git a/fs/nilfs2/super.c b/fs/nilfs2/super.c
index a5d1fa4e7552..8e3737a3302a 100644
--- a/fs/nilfs2/super.c
+++ b/fs/nilfs2/super.c
@@ -1371,7 +1371,7 @@ struct file_system_type nilfs_fs_type = {
 	.name     = "nilfs2",
 	.mount    = nilfs_mount,
 	.kill_sb  = kill_block_super,
-	.fs_flags = FS_REQUIRES_DEV,
+	.fs_flags = FS_REQUIRES_DEV | FS_BUFFER_HEADS,
 };
 MODULE_ALIAS_FS("nilfs2");
 
diff --git a/fs/ntfs/super.c b/fs/ntfs/super.c
index 56a7d5bd33e4..eb9e434b6541 100644
--- a/fs/ntfs/super.c
+++ b/fs/ntfs/super.c
@@ -3062,7 +3062,7 @@ static struct file_system_type ntfs_fs_type = {
 	.name		= "ntfs",
 	.mount		= ntfs_mount,
 	.kill_sb	= kill_block_super,
-	.fs_flags	= FS_REQUIRES_DEV,
+	.fs_flags	= FS_REQUIRES_DEV | FS_BUFFER_HEADS,
 };
 MODULE_ALIAS_FS("ntfs");
 
diff --git a/fs/ntfs3/super.c b/fs/ntfs3/super.c
index cfec5e0c7f66..40f09d5159d2 100644
--- a/fs/ntfs3/super.c
+++ b/fs/ntfs3/super.c
@@ -1738,7 +1738,7 @@ static struct file_system_type ntfs_fs_type = {
 	.init_fs_context	= ntfs_init_fs_context,
 	.parameters		= ntfs_fs_parameters,
 	.kill_sb		= ntfs3_kill_sb,
-	.fs_flags		= FS_REQUIRES_DEV | FS_ALLOW_IDMAP,
+	.fs_flags		= FS_REQUIRES_DEV | FS_ALLOW_IDMAP | FS_BUFFER_HEADS,
 };
 // clang-format on
 
diff --git a/fs/ocfs2/super.c b/fs/ocfs2/super.c
index 6b906424902b..50d5be9fb28f 100644
--- a/fs/ocfs2/super.c
+++ b/fs/ocfs2/super.c
@@ -1191,7 +1191,7 @@ static struct file_system_type ocfs2_fs_type = {
 	.name           = "ocfs2",
 	.mount          = ocfs2_mount,
 	.kill_sb        = kill_block_super,
-	.fs_flags       = FS_REQUIRES_DEV|FS_RENAME_DOES_D_MOVE,
+	.fs_flags       = FS_REQUIRES_DEV|FS_RENAME_DOES_D_MOVE | FS_BUFFER_HEADS,
 	.next           = NULL
 };
 MODULE_ALIAS_FS("ocfs2");
diff --git a/fs/omfs/inode.c b/fs/omfs/inode.c
index 2f8c1882f45c..e95b7d91fe35 100644
--- a/fs/omfs/inode.c
+++ b/fs/omfs/inode.c
@@ -607,7 +607,7 @@ static struct file_system_type omfs_fs_type = {
 	.name = "omfs",
 	.mount = omfs_mount,
 	.kill_sb = kill_block_super,
-	.fs_flags = FS_REQUIRES_DEV,
+	.fs_flags = FS_REQUIRES_DEV | FS_BUFFER_HEADS,
 };
 MODULE_ALIAS_FS("omfs");
 
diff --git a/fs/qnx4/inode.c b/fs/qnx4/inode.c
index a7171f5532a1..4f66b93397d7 100644
--- a/fs/qnx4/inode.c
+++ b/fs/qnx4/inode.c
@@ -389,7 +389,7 @@ static struct file_system_type qnx4_fs_type = {
 	.name		= "qnx4",
 	.mount		= qnx4_mount,
 	.kill_sb	= qnx4_kill_sb,
-	.fs_flags	= FS_REQUIRES_DEV,
+	.fs_flags	= FS_REQUIRES_DEV | FS_BUFFER_HEADS,
 };
 MODULE_ALIAS_FS("qnx4");
 
diff --git a/fs/qnx6/inode.c b/fs/qnx6/inode.c
index 21f90d519f1a..68f650d67d3b 100644
--- a/fs/qnx6/inode.c
+++ b/fs/qnx6/inode.c
@@ -645,7 +645,7 @@ static struct file_system_type qnx6_fs_type = {
 	.name		= "qnx6",
 	.mount		= qnx6_mount,
 	.kill_sb	= kill_block_super,
-	.fs_flags	= FS_REQUIRES_DEV,
+	.fs_flags	= FS_REQUIRES_DEV | FS_BUFFER_HEADS,
 };
 MODULE_ALIAS_FS("qnx6");
 
diff --git a/fs/reiserfs/super.c b/fs/reiserfs/super.c
index 7eaf36b3de12..838046f43d11 100644
--- a/fs/reiserfs/super.c
+++ b/fs/reiserfs/super.c
@@ -2635,7 +2635,7 @@ struct file_system_type reiserfs_fs_type = {
 	.name = "reiserfs",
 	.mount = get_super_block,
 	.kill_sb = reiserfs_kill_sb,
-	.fs_flags = FS_REQUIRES_DEV,
+	.fs_flags = FS_REQUIRES_DEV | FS_BUFFER_HEADS,
 };
 MODULE_ALIAS_FS("reiserfs");
 
diff --git a/fs/sysv/super.c b/fs/sysv/super.c
index 3365a30dc1e0..92f08cb24a28 100644
--- a/fs/sysv/super.c
+++ b/fs/sysv/super.c
@@ -545,7 +545,7 @@ static struct file_system_type sysv_fs_type = {
 	.name		= "sysv",
 	.mount		= sysv_mount,
 	.kill_sb	= kill_block_super,
-	.fs_flags	= FS_REQUIRES_DEV,
+	.fs_flags	= FS_REQUIRES_DEV | FS_BUFFER_HEADS,
 };
 MODULE_ALIAS_FS("sysv");
 
@@ -554,7 +554,7 @@ static struct file_system_type v7_fs_type = {
 	.name		= "v7",
 	.mount		= v7_mount,
 	.kill_sb	= kill_block_super,
-	.fs_flags	= FS_REQUIRES_DEV,
+	.fs_flags	= FS_REQUIRES_DEV | FS_BUFFER_HEADS,
 };
 MODULE_ALIAS_FS("v7");
 MODULE_ALIAS("v7");
diff --git a/fs/udf/super.c b/fs/udf/super.c
index 928a04d9d9e0..9f160973cbe8 100644
--- a/fs/udf/super.c
+++ b/fs/udf/super.c
@@ -130,7 +130,7 @@ static struct file_system_type udf_fstype = {
 	.name		= "udf",
 	.mount		= udf_mount,
 	.kill_sb	= kill_block_super,
-	.fs_flags	= FS_REQUIRES_DEV,
+	.fs_flags	= FS_REQUIRES_DEV | FS_BUFFER_HEADS,
 };
 MODULE_ALIAS_FS("udf");
 
diff --git a/fs/ufs/super.c b/fs/ufs/super.c
index 23377c1baed9..7829c325d011 100644
--- a/fs/ufs/super.c
+++ b/fs/ufs/super.c
@@ -1513,7 +1513,7 @@ static struct file_system_type ufs_fs_type = {
 	.name		= "ufs",
 	.mount		= ufs_mount,
 	.kill_sb	= kill_block_super,
-	.fs_flags	= FS_REQUIRES_DEV,
+	.fs_flags	= FS_REQUIRES_DEV | FS_BUFFER_HEADS,
 };
 MODULE_ALIAS_FS("ufs");
 
diff --git a/include/linux/fs.h b/include/linux/fs.h
index ebc7b8ac5008..35b661b48a49 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2336,6 +2336,7 @@ struct file_system_type {
 #define FS_DISALLOW_NOTIFY_PERM	16	/* Disable fanotify permission events */
 #define FS_ALLOW_IDMAP         32      /* FS has been updated to handle vfs idmappings. */
 #define FS_MGTIME		64	/* FS uses multigrain timestamps */
+#define FS_BUFFER_HEADS		128	/* filesystem requires buffer-heads */
 #define FS_RENAME_DOES_D_MOVE	32768	/* FS will handle d_move() during rename() internally. */
 	int (*init_fs_context)(struct fs_context *);
 	const struct fs_parameter_spec *parameters;
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [RFC v2 05/10] bdev: allow to switch between bdev aops
  2023-09-15 21:32 [RFC v2 00/10] bdev: LBS devices support to coexist with buffer-heads Luis Chamberlain
                   ` (3 preceding siblings ...)
  2023-09-15 21:32 ` [RFC v2 04/10] filesystems: add filesytem buffer-head flag Luis Chamberlain
@ 2023-09-15 21:32 ` Luis Chamberlain
  2023-09-15 21:32 ` [RFC v2 06/10] bdev: simplify coexistance Luis Chamberlain
                   ` (6 subsequent siblings)
  11 siblings, 0 replies; 26+ messages in thread
From: Luis Chamberlain @ 2023-09-15 21:32 UTC (permalink / raw)
  To: hch, djwong, dchinner, kbusch, sagi, axboe
  Cc: willy, brauner, hare, ritesh.list, rgoldwyn, jack, ziy,
	ryan.roberts, patches, linux-xfs, linux-fsdevel, linux-block,
	p.raghav, da.gomez, dan.helmick, mcgrof

Now that we have annotations for filesystems which require buffer-heads we
can use that flag to verify if we can use the filesystem on the target
block devices which require higher order folios. A filesystems which requires
buffer-heads cannot be used on block devices which have a logical block size
greater than PAGE_SIZE. We also want to allow to use buffer-head filesystems
on block devices and at a later time then unmount and switch to a filesystem
which supports bs > PAGE_SIZE, even if the logical block size of the block
device is PAGE_SIZE, and this requires iomap. Provide helpers to do all these
checks and resets the aops to iomap when needed.

Leaving iomap in place after an umount would not make such block devices usable
for buffer-head filesystems so we must reset the aops to buffer-heads also
on unmount.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 block/bdev.c           | 55 ++++++++++++++++++++++++++++++++++++++++++
 fs/super.c             |  3 ++-
 include/linux/blkdev.h |  7 ++++++
 3 files changed, 64 insertions(+), 1 deletion(-)

diff --git a/block/bdev.c b/block/bdev.c
index 0d685270cd34..bf3cfc02aaf9 100644
--- a/block/bdev.c
+++ b/block/bdev.c
@@ -150,6 +150,59 @@ static int bdev_bsize_limit(struct block_device *bdev)
 	return PAGE_SIZE;
 }
 
+#ifdef CONFIG_BUFFER_HEAD
+static void bdev_aops_set(struct block_device *bdev,
+			  const struct address_space_operations *aops)
+{
+	kill_bdev(bdev);
+	bdev->bd_inode->i_data.a_ops = aops;
+}
+
+static void bdev_aops_sync(struct super_block *sb, struct block_device *bdev,
+			   const struct address_space_operations *aops)
+{
+	sync_blockdev(bdev);
+	bdev_aops_set(bdev, aops);
+	kill_bdev(bdev);
+	bdev->bd_inode->i_data.a_ops = aops;
+}
+
+void bdev_aops_reset(struct block_device *bdev)
+{
+	bdev_aops_set(bdev, &def_blk_aops);
+}
+
+static int sb_bdev_aops_set(struct super_block *sb)
+{
+	struct block_device *bdev = sb->s_bdev;
+
+	if (mapping_min_folio_order(bdev->bd_inode->i_mapping) != 0 &&
+	    sb->s_type->fs_flags & FS_BUFFER_HEADS) {
+			pr_warn_ratelimited(
+"block device logical block size > PAGE_SIZE, buffer-head filesystem cannot be used.\n");
+		return -EINVAL;
+	}
+
+	/*
+	 * We can switch back and forth, but we need to use buffer-heads
+	 * first, otherwise a filesystem created which only uses iomap
+	 * will have it sticky and we can't detect buffer-head filesystems
+	 * on mount.
+	 */
+	bdev_aops_sync(sb, bdev, &def_blk_aops);
+	if (sb->s_type->fs_flags & FS_BUFFER_HEADS)
+		return 0;
+
+	bdev_aops_sync(sb, bdev, &def_blk_aops_iomap);
+	return 0;
+}
+#else
+static int sb_bdev_aops_set(struct super_block *sb)
+{
+	return 0;
+}
+#endif
+
 int set_blocksize(struct block_device *bdev, int size)
 {
 	/* Size must be a power of two, and between 512 and supported order */
@@ -173,6 +226,8 @@ EXPORT_SYMBOL(set_blocksize);
 
 int sb_set_blocksize(struct super_block *sb, int size)
 {
+	if (sb_bdev_aops_set(sb))
+		return 0;
 	if (set_blocksize(sb->s_bdev, size))
 		return 0;
 	/* If we get here, we know size is power of two
diff --git a/fs/super.c b/fs/super.c
index 816a22a5cad1..eb269c9489cb 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -1649,12 +1649,13 @@ void kill_block_super(struct super_block *sb)
 	generic_shutdown_super(sb);
 	if (bdev) {
 		sync_blockdev(bdev);
+		bdev_aops_reset(bdev);
 		blkdev_put(bdev, sb);
 	}
 }
 
 EXPORT_SYMBOL(kill_block_super);
-#endif
+#endif /* CONFIG_BLOCK */
 
 struct dentry *mount_nodev(struct file_system_type *fs_type,
 	int flags, void *data,
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index eef450f25982..738a879a0786 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1503,6 +1503,13 @@ void sync_bdevs(bool wait);
 void bdev_statx_dioalign(struct inode *inode, struct kstat *stat);
 void printk_all_partitions(void);
 int __init early_lookup_bdev(const char *pathname, dev_t *dev);
+#ifdef CONFIG_BUFFER_HEAD
+void bdev_aops_reset(struct block_device *bdev);
+#else
+static inline void bdev_aops_reset(struct block_device *bdev)
+{
+}
+#endif
 #else
 static inline void invalidate_bdev(struct block_device *bdev)
 {
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [RFC v2 06/10] bdev: simplify coexistance
  2023-09-15 21:32 [RFC v2 00/10] bdev: LBS devices support to coexist with buffer-heads Luis Chamberlain
                   ` (4 preceding siblings ...)
  2023-09-15 21:32 ` [RFC v2 05/10] bdev: allow to switch between bdev aops Luis Chamberlain
@ 2023-09-15 21:32 ` Luis Chamberlain
  2023-09-15 21:32 ` [RFC v2 07/10] nvme: enhance max supported LBA format check Luis Chamberlain
                   ` (5 subsequent siblings)
  11 siblings, 0 replies; 26+ messages in thread
From: Luis Chamberlain @ 2023-09-15 21:32 UTC (permalink / raw)
  To: hch, djwong, dchinner, kbusch, sagi, axboe
  Cc: willy, brauner, hare, ritesh.list, rgoldwyn, jack, ziy,
	ryan.roberts, patches, linux-xfs, linux-fsdevel, linux-block,
	p.raghav, da.gomez, dan.helmick, mcgrof

Now that we have a bdev inode aops easily switch between buffer-heads
and iomap we can simplify our requirement. This also enables usage
of xfs on devices which for example have a physical block size greater
than page size but the logical block size is only 4k or lower.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 block/bdev.c | 11 +++--------
 1 file changed, 3 insertions(+), 8 deletions(-)

diff --git a/block/bdev.c b/block/bdev.c
index bf3cfc02aaf9..d9236eb149a4 100644
--- a/block/bdev.c
+++ b/block/bdev.c
@@ -143,13 +143,6 @@ static void set_init_blocksize(struct block_device *bdev)
 	}
 }
 
-static int bdev_bsize_limit(struct block_device *bdev)
-{
-	if (bdev->bd_inode->i_data.a_ops == &def_blk_aops_iomap)
-		return 1 << (PAGE_SHIFT + MAX_PAGECACHE_ORDER);
-	return PAGE_SIZE;
-}
-
 #ifdef CONFIG_BUFFER_HEAD
 static void bdev_aops_set(struct block_device *bdev,
 			  const struct address_space_operations *aops)
@@ -205,8 +198,10 @@ static int sb_bdev_aops_set(struct super_block *sb)
 
 int set_blocksize(struct block_device *bdev, int size)
 {
+	unsigned int bdev_size_limit = 1 << (PAGE_SHIFT + MAX_PAGECACHE_ORDER);
+
 	/* Size must be a power of two, and between 512 and supported order */
-	if (size > bdev_bsize_limit(bdev) || size < 512 || !is_power_of_2(size))
+	if (size > bdev_size_limit || size < 512 || !is_power_of_2(size))
 		return -EINVAL;
 
 	/* Size cannot be smaller than the size supported by the device */
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [RFC v2 07/10] nvme: enhance max supported LBA format check
  2023-09-15 21:32 [RFC v2 00/10] bdev: LBS devices support to coexist with buffer-heads Luis Chamberlain
                   ` (5 preceding siblings ...)
  2023-09-15 21:32 ` [RFC v2 06/10] bdev: simplify coexistance Luis Chamberlain
@ 2023-09-15 21:32 ` Luis Chamberlain
  2023-09-15 22:20   ` Matthew Wilcox
  2023-09-15 21:32 ` [RFC v2 08/10] nvme: add awun / nawun sanity check Luis Chamberlain
                   ` (4 subsequent siblings)
  11 siblings, 1 reply; 26+ messages in thread
From: Luis Chamberlain @ 2023-09-15 21:32 UTC (permalink / raw)
  To: hch, djwong, dchinner, kbusch, sagi, axboe
  Cc: willy, brauner, hare, ritesh.list, rgoldwyn, jack, ziy,
	ryan.roberts, patches, linux-xfs, linux-fsdevel, linux-block,
	p.raghav, da.gomez, dan.helmick, mcgrof

Only pure-iomap configurations, systems where CONFIG_BUFFER_HEAD is
disabled can enable NVMe devices with LBA formats with a blocksize
larger then the PAGE_SIZE.

Systems with buffer-heads enabled cannot currently make use of these
devices, but this will eventually get fixed. We cap the max supported
LBA format to 19, 512 KiB as support for 1 MiB LBA format still needs
some work.

Also, add a debug module parameter nvme_core.debug_large_lbas to enable
folks to shoot themselves on their foot though if they want to test
and expand support beyond what is supported, only to be used on
pure-iomap configurations.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
---
 drivers/nvme/host/core.c | 34 +++++++++++++++++++++++++++++-----
 1 file changed, 29 insertions(+), 5 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index f3a01b79148c..0365f260c514 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -88,6 +88,10 @@ module_param(apst_secondary_latency_tol_us, ulong, 0644);
 MODULE_PARM_DESC(apst_secondary_latency_tol_us,
 	"secondary APST latency tolerance in us");
 
+static bool debug_large_lbas;
+module_param(debug_large_lbas, bool, 0644);
+MODULE_PARM_DESC(debug_large_lbas, "allow LBAs > PAGE_SIZE");
+
 /*
  * nvme_wq - hosts nvme related works that are not reset or delete
  * nvme_reset_wq - hosts nvme reset works
@@ -1878,6 +1882,29 @@ static void nvme_set_queue_limits(struct nvme_ctrl *ctrl,
 	blk_queue_write_cache(q, vwc, vwc);
 }
 
+/* XXX: shift 20 (1 MiB LBA) crashes on pure-iomap */
+#define NVME_MAX_SHIFT_SUPPORTED 19
+
+static bool nvme_lba_shift_supported(struct nvme_ns *ns)
+{
+	if (ns->lba_shift <= PAGE_SHIFT)
+		return true;
+
+	if (IS_ENABLED(CONFIG_BUFFER_HEAD))
+		return false;
+
+	if (ns->lba_shift <= NVME_MAX_SHIFT_SUPPORTED)
+		return true;
+
+	if (debug_large_lbas) {
+		dev_warn(ns->ctrl->device,
+			"forcibly allowing LBAS > 1 MiB due to nvme_core.debug_large_lbas -- use at your own risk\n");
+		return true;
+	}
+
+	return false;
+}
+
 static void nvme_update_disk_info(struct gendisk *disk,
 		struct nvme_ns *ns, struct nvme_id_ns *id)
 {
@@ -1885,13 +1912,10 @@ static void nvme_update_disk_info(struct gendisk *disk,
 	u32 bs = 1U << ns->lba_shift;
 	u32 atomic_bs, phys_bs, io_opt = 0;
 
-	/*
-	 * The block layer can't support LBA sizes larger than the page size
-	 * yet, so catch this early and don't allow block I/O.
-	 */
-	if (ns->lba_shift > PAGE_SHIFT) {
+	if (!nvme_lba_shift_supported(ns)) {
 		capacity = 0;
 		bs = (1 << 9);
+		dev_warn(ns->ctrl->device, "I'm sorry dave, I'm afraid I can't do that\n");
 	}
 
 	blk_integrity_unregister(disk);
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [RFC v2 08/10] nvme: add awun / nawun sanity check
  2023-09-15 21:32 [RFC v2 00/10] bdev: LBS devices support to coexist with buffer-heads Luis Chamberlain
                   ` (6 preceding siblings ...)
  2023-09-15 21:32 ` [RFC v2 07/10] nvme: enhance max supported LBA format check Luis Chamberlain
@ 2023-09-15 21:32 ` Luis Chamberlain
  2023-09-15 21:32 ` [RFC v2 09/10] nvme: add nvme_core.debug_large_atomics to force high awun as phys_bs Luis Chamberlain
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 26+ messages in thread
From: Luis Chamberlain @ 2023-09-15 21:32 UTC (permalink / raw)
  To: hch, djwong, dchinner, kbusch, sagi, axboe
  Cc: willy, brauner, hare, ritesh.list, rgoldwyn, jack, ziy,
	ryan.roberts, patches, linux-xfs, linux-fsdevel, linux-block,
	p.raghav, da.gomez, dan.helmick, mcgrof

AWUN/NAWUN control the atomicity of command execution in relation to
other commands. They impose inter-command serialization of writing of
blocks of data to the NVM and prevent blocks of data ending up on the
NVM containing partial data from one new command and partial data from
one or more other new commands.

Parse awun / nawun to verify at least the physical block size
exposed is not greater than this value.

The special case of awun / nawun == 0xffff tells us we can ramp
up to mdts.

Suggested-by: Dan Helmick <dan.helmick@samsung.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 drivers/nvme/host/core.c | 21 +++++++++++++++++++++
 drivers/nvme/host/nvme.h |  1 +
 2 files changed, 22 insertions(+)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 0365f260c514..7a3c51ac13bd 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -1911,6 +1911,7 @@ static void nvme_update_disk_info(struct gendisk *disk,
 	sector_t capacity = nvme_lba_to_sect(ns, le64_to_cpu(id->nsze));
 	u32 bs = 1U << ns->lba_shift;
 	u32 atomic_bs, phys_bs, io_opt = 0;
+	u32 awun = 0, awun_bs = 0;
 
 	if (!nvme_lba_shift_supported(ns)) {
 		capacity = 0;
@@ -1931,6 +1932,15 @@ static void nvme_update_disk_info(struct gendisk *disk,
 			atomic_bs = (1 + le16_to_cpu(id->nawupf)) * bs;
 		else
 			atomic_bs = (1 + ns->ctrl->subsys->awupf) * bs;
+		if (id->nsfeat & NVME_NS_FEAT_ATOMICS && id->nawun)
+			awun = (1 + le16_to_cpu(id->nawun));
+		else
+			awun = (1 + ns->ctrl->subsys->awun);
+		/* Indicates MDTS can be used */
+		if (awun == 0xffff)
+			awun_bs = ns->ctrl->max_hw_sectors << SECTOR_SHIFT;
+		else
+			awun_bs = awun * bs;
 	}
 
 	if (id->nsfeat & NVME_NS_FEAT_IO_OPT) {
@@ -1940,6 +1950,16 @@ static void nvme_update_disk_info(struct gendisk *disk,
 		io_opt = bs * (1 + le16_to_cpu(id->nows));
 	}
 
+	if (awun) {
+		phys_bs = min(awun_bs, phys_bs);
+
+		/*
+		 * npwg and nows could be > awun, in such cases users should
+		 * be aware of out of order reads/writes as npwg and nows
+		 * are purely performance optimizations.
+		 */
+	}
+
 	blk_queue_logical_block_size(disk->queue, bs);
 	/*
 	 * Linux filesystems assume writing a single physical block is
@@ -2785,6 +2805,7 @@ static int nvme_init_subsystem(struct nvme_ctrl *ctrl, struct nvme_id_ctrl *id)
 		kfree(subsys);
 		return -EINVAL;
 	}
+	subsys->awun = le16_to_cpu(id->awun);
 	subsys->awupf = le16_to_cpu(id->awupf);
 	nvme_mpath_default_iopolicy(subsys);
 
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index f35647c470af..071ec52d83ea 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -410,6 +410,7 @@ struct nvme_subsystem {
 	u8			cmic;
 	enum nvme_subsys_type	subtype;
 	u16			vendor_id;
+	u16			awun;
 	u16			awupf;	/* 0's based awupf value. */
 	struct ida		ns_ida;
 #ifdef CONFIG_NVME_MULTIPATH
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [RFC v2 09/10] nvme: add nvme_core.debug_large_atomics to force high awun as phys_bs
  2023-09-15 21:32 [RFC v2 00/10] bdev: LBS devices support to coexist with buffer-heads Luis Chamberlain
                   ` (7 preceding siblings ...)
  2023-09-15 21:32 ` [RFC v2 08/10] nvme: add awun / nawun sanity check Luis Chamberlain
@ 2023-09-15 21:32 ` Luis Chamberlain
  2023-09-15 21:32 ` [RFC v2 10/10] nvme: enable LBS support Luis Chamberlain
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 26+ messages in thread
From: Luis Chamberlain @ 2023-09-15 21:32 UTC (permalink / raw)
  To: hch, djwong, dchinner, kbusch, sagi, axboe
  Cc: willy, brauner, hare, ritesh.list, rgoldwyn, jack, ziy,
	ryan.roberts, patches, linux-xfs, linux-fsdevel, linux-block,
	p.raghav, da.gomez, dan.helmick, mcgrof

A drive with atomic write support should have awun / nawun defined,
for these drives it should be possible to play with and experiment
safely with LBS support up to awun / nawun settings if you are
completely ignoring power failure situations. Add support to
experiment with this. The rationale to limit to awun / nawun is
to avoid races with other on flight commands which otherwise
could cause unexpected results.

This also means this debug module parameter feature is not supported
if your drive does not support atomics / awun / nawun.

Suggested-by: Dan Helmick <dan.helmick@samsung.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 drivers/nvme/host/core.c | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 7a3c51ac13bd..c1f9d8e3ea93 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -92,6 +92,10 @@ static bool debug_large_lbas;
 module_param(debug_large_lbas, bool, 0644);
 MODULE_PARM_DESC(debug_large_lbas, "allow LBAs > PAGE_SIZE");
 
+static unsigned int debug_large_atomics;
+module_param(debug_large_atomics, uint, 0644);
+MODULE_PARM_DESC(debug_large_atomics, "allow large atomics <= awun or nawun <= mdts");
+
 /*
  * nvme_wq - hosts nvme related works that are not reset or delete
  * nvme_reset_wq - hosts nvme reset works
@@ -1958,6 +1962,20 @@ static void nvme_update_disk_info(struct gendisk *disk,
 		 * be aware of out of order reads/writes as npwg and nows
 		 * are purely performance optimizations.
 		 */
+
+		/*
+		 * If you're not concerned about power failure, in theory,
+		 * you should be able to experiment up to awun rather safely.
+		 *
+		 * Ignore qemu awun value of 1.
+		 */
+		if (debug_large_atomics && awun != 1) {
+			debug_large_atomics = min(awun_bs, debug_large_atomics);
+			phys_bs = atomic_bs = debug_large_atomics;
+			dev_info(ns->ctrl->device,
+				 "Forcing large atomic: %u (awun_bs: %u awun: %u)\n",
+				 debug_large_atomics, awun_bs, awun);
+		}
 	}
 
 	blk_queue_logical_block_size(disk->queue, bs);
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [RFC v2 10/10] nvme: enable LBS support
  2023-09-15 21:32 [RFC v2 00/10] bdev: LBS devices support to coexist with buffer-heads Luis Chamberlain
                   ` (8 preceding siblings ...)
  2023-09-15 21:32 ` [RFC v2 09/10] nvme: add nvme_core.debug_large_atomics to force high awun as phys_bs Luis Chamberlain
@ 2023-09-15 21:32 ` Luis Chamberlain
  2023-09-15 21:51 ` [RFC v2 00/10] bdev: LBS devices support to coexist with buffer-heads Matthew Wilcox
  2023-09-17 22:38 ` Dave Chinner
  11 siblings, 0 replies; 26+ messages in thread
From: Luis Chamberlain @ 2023-09-15 21:32 UTC (permalink / raw)
  To: hch, djwong, dchinner, kbusch, sagi, axboe
  Cc: willy, brauner, hare, ritesh.list, rgoldwyn, jack, ziy,
	ryan.roberts, patches, linux-xfs, linux-fsdevel, linux-block,
	p.raghav, da.gomez, dan.helmick, mcgrof

Now that the block device cache allows LBS, and we have buffered-io
support, and we support both buffer-heads and iomap aops on the
block device cache through the inode dynamically, enable support for
LBA formats > PAGE_SIZE.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 drivers/nvme/host/core.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index c1f9d8e3ea93..724d3c342344 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -1894,9 +1894,6 @@ static bool nvme_lba_shift_supported(struct nvme_ns *ns)
 	if (ns->lba_shift <= PAGE_SHIFT)
 		return true;
 
-	if (IS_ENABLED(CONFIG_BUFFER_HEAD))
-		return false;
-
 	if (ns->lba_shift <= NVME_MAX_SHIFT_SUPPORTED)
 		return true;
 
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [RFC v2 00/10] bdev: LBS devices support to coexist with buffer-heads
  2023-09-15 21:32 [RFC v2 00/10] bdev: LBS devices support to coexist with buffer-heads Luis Chamberlain
                   ` (9 preceding siblings ...)
  2023-09-15 21:32 ` [RFC v2 10/10] nvme: enable LBS support Luis Chamberlain
@ 2023-09-15 21:51 ` Matthew Wilcox
  2023-09-15 22:26   ` Luis Chamberlain
                     ` (2 more replies)
  2023-09-17 22:38 ` Dave Chinner
  11 siblings, 3 replies; 26+ messages in thread
From: Matthew Wilcox @ 2023-09-15 21:51 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: hch, djwong, dchinner, kbusch, sagi, axboe, brauner, hare,
	ritesh.list, rgoldwyn, jack, ziy, ryan.roberts, patches,
	linux-xfs, linux-fsdevel, linux-block, p.raghav, da.gomez,
	dan.helmick

On Fri, Sep 15, 2023 at 02:32:44PM -0700, Luis Chamberlain wrote:
> However, an issue is that disabling CONFIG_BUFFER_HEAD in practice is not viable
> for many Linux distributions since it also means disabling support for most
> filesystems other than btrfs and XFS. So we either support larger order folios
> on buffer-heads, or we draw up a solution to enable co-existence. Since at LSFMM
> 2023 it was decided we would not support larger order folios on buffer-heads,

Um, I didn't agree to that.  If block size is equal to folio size, there
are no problems supporting one buffer head per folio.  In fact, we could
probably go up to 8 buffer heads per folio without any trouble (but I'm
not signing up to do that work).


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC v2 07/10] nvme: enhance max supported LBA format check
  2023-09-15 21:32 ` [RFC v2 07/10] nvme: enhance max supported LBA format check Luis Chamberlain
@ 2023-09-15 22:20   ` Matthew Wilcox
  2023-09-15 22:27     ` Luis Chamberlain
  0 siblings, 1 reply; 26+ messages in thread
From: Matthew Wilcox @ 2023-09-15 22:20 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: hch, djwong, dchinner, kbusch, sagi, axboe, brauner, hare,
	ritesh.list, rgoldwyn, jack, ziy, ryan.roberts, patches,
	linux-xfs, linux-fsdevel, linux-block, p.raghav, da.gomez,
	dan.helmick

On Fri, Sep 15, 2023 at 02:32:51PM -0700, Luis Chamberlain wrote:
> +/* XXX: shift 20 (1 MiB LBA) crashes on pure-iomap */
> +#define NVME_MAX_SHIFT_SUPPORTED 19

I imagine somewhere we do a PAGE_SIZE << 20 and it overflows an int to 0.
I've been trying to use size_t everywhere, but we probably missed
something.

Does 19 work on a machine with 64kB pages?  My guess is no.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC v2 00/10] bdev: LBS devices support to coexist with buffer-heads
  2023-09-15 21:51 ` [RFC v2 00/10] bdev: LBS devices support to coexist with buffer-heads Matthew Wilcox
@ 2023-09-15 22:26   ` Luis Chamberlain
  2023-09-17 11:50   ` Hannes Reinecke
  2023-09-18 17:12   ` Luis Chamberlain
  2 siblings, 0 replies; 26+ messages in thread
From: Luis Chamberlain @ 2023-09-15 22:26 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: hch, djwong, dchinner, kbusch, sagi, axboe, brauner, hare,
	ritesh.list, rgoldwyn, jack, ziy, ryan.roberts, patches,
	linux-xfs, linux-fsdevel, linux-block, p.raghav, da.gomez,
	dan.helmick

On Fri, Sep 15, 2023 at 10:51:12PM +0100, Matthew Wilcox wrote:
> On Fri, Sep 15, 2023 at 02:32:44PM -0700, Luis Chamberlain wrote:
> > However, an issue is that disabling CONFIG_BUFFER_HEAD in practice is not viable
> > for many Linux distributions since it also means disabling support for most
> > filesystems other than btrfs and XFS. So we either support larger order folios
> > on buffer-heads, or we draw up a solution to enable co-existence. Since at LSFMM
> > 2023 it was decided we would not support larger order folios on buffer-heads,
> 
> Um, I didn't agree to that.  If block size is equal to folio size, there
> are no problems supporting one buffer head per folio.  In fact, we could
> probably go up to 8 buffer heads per folio without any trouble (but I'm
> not signing up to do that work).

Patches welcomed for buffer-head larger order folio support :)

But in the meantime...

  Luis

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC v2 07/10] nvme: enhance max supported LBA format check
  2023-09-15 22:20   ` Matthew Wilcox
@ 2023-09-15 22:27     ` Luis Chamberlain
  0 siblings, 0 replies; 26+ messages in thread
From: Luis Chamberlain @ 2023-09-15 22:27 UTC (permalink / raw)
  To: Matthew Wilcox, Jeff Layton
  Cc: hch, djwong, dchinner, kbusch, sagi, axboe, brauner, hare,
	ritesh.list, rgoldwyn, jack, ziy, ryan.roberts, patches,
	linux-xfs, linux-fsdevel, linux-block, p.raghav, da.gomez,
	dan.helmick

On Fri, Sep 15, 2023 at 11:20:39PM +0100, Matthew Wilcox wrote:
> On Fri, Sep 15, 2023 at 02:32:51PM -0700, Luis Chamberlain wrote:
> > +/* XXX: shift 20 (1 MiB LBA) crashes on pure-iomap */
> > +#define NVME_MAX_SHIFT_SUPPORTED 19
> 
> I imagine somewhere we do a PAGE_SIZE << 20 and it overflows an int to 0.
> I've been trying to use size_t everywhere, but we probably missed
> something.
> 
> Does 19 work on a machine with 64kB pages?  My guess is no.

Let's ask someone who *might* have access to a 64k PAGE_SIZE system or
might know someone who does.

I definitely don't!

  Luis

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC v2 00/10] bdev: LBS devices support to coexist with buffer-heads
  2023-09-15 21:51 ` [RFC v2 00/10] bdev: LBS devices support to coexist with buffer-heads Matthew Wilcox
  2023-09-15 22:26   ` Luis Chamberlain
@ 2023-09-17 11:50   ` Hannes Reinecke
  2023-09-18 17:12   ` Luis Chamberlain
  2 siblings, 0 replies; 26+ messages in thread
From: Hannes Reinecke @ 2023-09-17 11:50 UTC (permalink / raw)
  To: Matthew Wilcox, Luis Chamberlain
  Cc: hch, djwong, dchinner, kbusch, sagi, axboe, brauner, ritesh.list,
	rgoldwyn, jack, ziy, ryan.roberts, patches, linux-xfs,
	linux-fsdevel, linux-block, p.raghav, da.gomez, dan.helmick

On 9/15/23 23:51, Matthew Wilcox wrote:
> On Fri, Sep 15, 2023 at 02:32:44PM -0700, Luis Chamberlain wrote:
>> However, an issue is that disabling CONFIG_BUFFER_HEAD in practice is not viable
>> for many Linux distributions since it also means disabling support for most
>> filesystems other than btrfs and XFS. So we either support larger order folios
>> on buffer-heads, or we draw up a solution to enable co-existence. Since at LSFMM
>> 2023 it was decided we would not support larger order folios on buffer-heads,
> 
> Um, I didn't agree to that.  If block size is equal to folio size, there
> are no problems supporting one buffer head per folio.  In fact, we could
> probably go up to 8 buffer heads per folio without any trouble (but I'm
> not signing up to do that work).
> 
Entirely correct.
I have a patchset ready for doing just that (ie having one buffer head 
per folio); hope I'll find some time next week to get it cleaned up and 
posted.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Ivo Totev, Andrew
Myers, Andrew McDonald, Martje Boudien Moerman


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC v2 00/10] bdev: LBS devices support to coexist with buffer-heads
  2023-09-15 21:32 [RFC v2 00/10] bdev: LBS devices support to coexist with buffer-heads Luis Chamberlain
                   ` (10 preceding siblings ...)
  2023-09-15 21:51 ` [RFC v2 00/10] bdev: LBS devices support to coexist with buffer-heads Matthew Wilcox
@ 2023-09-17 22:38 ` Dave Chinner
  2023-09-17 23:14   ` Matthew Wilcox
  11 siblings, 1 reply; 26+ messages in thread
From: Dave Chinner @ 2023-09-17 22:38 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: hch, djwong, dchinner, kbusch, sagi, axboe, willy, brauner, hare,
	ritesh.list, rgoldwyn, jack, ziy, ryan.roberts, patches,
	linux-xfs, linux-fsdevel, linux-block, p.raghav, da.gomez,
	dan.helmick

On Fri, Sep 15, 2023 at 02:32:44PM -0700, Luis Chamberlain wrote:
> Christoph added CONFIG_BUFFER_HEAD on v6.1 enabling a world where we can
> live without buffer-heads. When we opt into that world we end up also
> using the address space operations of the block device cache using
> iomap. Since iomap supports higher order folios it means then that block
> devices which do use the iomap aops can end up having a logical block
> size or physical block size greater than PAGE_SIZE. We refer to these as
> LBS devices. This in turn allows filesystems which support bs > 4k to be
> enabled on a 4k PAGE_SIZE world on LBS block devices. This alows LBS
> device then to take advantage of the recenlty posted work today to enable
> LBS support for filesystems [0].

Why do we need LBS devices to support bs > ps in XFS?

As long as the filesystem block size is >= logical sector size of
the device, XFS just doesn't care what the "block size" of the
underlying block device is. i.e. XFS will allow minimum IO sizes of
the logical sector size of the device for select metadata and
sub-block direct IO, but otherwise all other IO will be aligned to
filesystem block size and so the underlying device block sizes are
completely irrelevant...

> To experiment with larger LBA formtas you can also use kdevops and enable
> CONFIG_QEMU_ENABLE_EXTRA_DRIVE_LARGEIO. That enables a ton of drives with
> logical and physical block sizes >= 4k up to a desriable max target for
> experimentation. Since filesystems today only support up to 32k sector sizes,
> in practice you may only want to experiment up to 32k physical / logical.
> 
> Support for 64k sector sizes requires an XFS format change, which is something
> Daniel Gomez has experimental patches for, in case folks are interested in
> messing with.

Please don't do this without first talking to the upstream XFS
developers about the intended use cases and design of the new
format. Especially when the problem involves requiring a whole new
journal header and indexing format to be implemented...

As a general rule, nobody should *ever* be writing patches to change
the on-disk format of a -any- filesystem without first engaging the
people who maintain that filesystem. Architecture and design come
first, not implementation. The last thing we want is to have someone
spend weeks or months on something that takes the experts half a
minute to NACK because it is so obviously flawed....

> Patch 6 could probably be squashed with patch 5, but I wanted to be
> explicit about this, as this should be decided with the community.
> 
> There might be a better way to do this than do deal with the switching
> of the aops dynamically, ideas welcomed!

Is it even safe to switch aops dynamically? We know there are
inherent race conditions in doing this w.r.t. mmap and page faults,
as the write fault part of the processing is directly dependent
on the page being correctly initialised during the initial
population of the page data (the "read fault" side of the write
fault).

Hence it's not generally considered safe to change aops from one
mechanism to another dynamically. Block devices can be mmap()d, but
I don't see anything in this patch set that ensures there are no
other users of the block device when the swaps are done. What am I
missing?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC v2 00/10] bdev: LBS devices support to coexist with buffer-heads
  2023-09-17 22:38 ` Dave Chinner
@ 2023-09-17 23:14   ` Matthew Wilcox
  2023-09-18  0:59     ` Dave Chinner
  2023-09-18 11:34     ` Hannes Reinecke
  0 siblings, 2 replies; 26+ messages in thread
From: Matthew Wilcox @ 2023-09-17 23:14 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Luis Chamberlain, hch, djwong, dchinner, kbusch, sagi, axboe,
	brauner, hare, ritesh.list, rgoldwyn, jack, ziy, ryan.roberts,
	patches, linux-xfs, linux-fsdevel, linux-block, p.raghav,
	da.gomez, dan.helmick

On Mon, Sep 18, 2023 at 08:38:37AM +1000, Dave Chinner wrote:
> On Fri, Sep 15, 2023 at 02:32:44PM -0700, Luis Chamberlain wrote:
> > LBS devices. This in turn allows filesystems which support bs > 4k to be
> > enabled on a 4k PAGE_SIZE world on LBS block devices. This alows LBS
> > device then to take advantage of the recenlty posted work today to enable
> > LBS support for filesystems [0].
> 
> Why do we need LBS devices to support bs > ps in XFS?

It's the other way round -- we need the support in the page cache to
reject sub-block-size folios (which is in the other patches) before we
can sensibly talk about enabling any filesystems on top of LBS devices.
Even XFS, or for that matter ext2 which support 16k block sizes on
CONFIG_PAGE_SIZE_16K (or 64K) kernels need that support first.

[snipping the parts I agree with; this should not be the first you're
hearing about a format change to XFS]

> > There might be a better way to do this than do deal with the switching
> > of the aops dynamically, ideas welcomed!
> 
> Is it even safe to switch aops dynamically? We know there are
> inherent race conditions in doing this w.r.t. mmap and page faults,
> as the write fault part of the processing is directly dependent
> on the page being correctly initialised during the initial
> population of the page data (the "read fault" side of the write
> fault).
> 
> Hence it's not generally considered safe to change aops from one
> mechanism to another dynamically. Block devices can be mmap()d, but
> I don't see anything in this patch set that ensures there are no
> other users of the block device when the swaps are done. What am I
> missing?

We need to evict all pages from the page cache before switching aops to
prevent misinterpretation of folio->private.  If switching aops is even
the right thing to do.  I don't see the problem with allowing buffer heads
on block devices, but I haven't been involved with the discussion here.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC v2 00/10] bdev: LBS devices support to coexist with buffer-heads
  2023-09-17 23:14   ` Matthew Wilcox
@ 2023-09-18  0:59     ` Dave Chinner
  2023-09-18  1:13       ` Luis Chamberlain
  2023-09-18 11:34     ` Hannes Reinecke
  1 sibling, 1 reply; 26+ messages in thread
From: Dave Chinner @ 2023-09-18  0:59 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Luis Chamberlain, hch, djwong, dchinner, kbusch, sagi, axboe,
	brauner, hare, ritesh.list, rgoldwyn, jack, ziy, ryan.roberts,
	patches, linux-xfs, linux-fsdevel, linux-block, p.raghav,
	da.gomez, dan.helmick

On Mon, Sep 18, 2023 at 12:14:48AM +0100, Matthew Wilcox wrote:
> On Mon, Sep 18, 2023 at 08:38:37AM +1000, Dave Chinner wrote:
> > On Fri, Sep 15, 2023 at 02:32:44PM -0700, Luis Chamberlain wrote:
> > > LBS devices. This in turn allows filesystems which support bs > 4k to be
> > > enabled on a 4k PAGE_SIZE world on LBS block devices. This alows LBS
> > > device then to take advantage of the recenlty posted work today to enable
> > > LBS support for filesystems [0].
> > 
> > Why do we need LBS devices to support bs > ps in XFS?
> 
> It's the other way round -- we need the support in the page cache to
> reject sub-block-size folios (which is in the other patches) before we
> can sensibly talk about enabling any filesystems on top of LBS devices.
>
> Even XFS, or for that matter ext2 which support 16k block sizes on
> CONFIG_PAGE_SIZE_16K (or 64K) kernels need that support first.

Well, yes, I know that. But the statement above implies that we
can't use bs > ps filesytems without LBS support on 4kB PAGE_SIZE
systems. If it's meant to mean the exact opposite, then it is
extremely poorly worded....

> > > There might be a better way to do this than do deal with the switching
> > > of the aops dynamically, ideas welcomed!
> > 
> > Is it even safe to switch aops dynamically? We know there are
> > inherent race conditions in doing this w.r.t. mmap and page faults,
> > as the write fault part of the processing is directly dependent
> > on the page being correctly initialised during the initial
> > population of the page data (the "read fault" side of the write
> > fault).
> > 
> > Hence it's not generally considered safe to change aops from one
> > mechanism to another dynamically. Block devices can be mmap()d, but
> > I don't see anything in this patch set that ensures there are no
> > other users of the block device when the swaps are done. What am I
> > missing?
> 
> We need to evict all pages from the page cache before switching aops to
> prevent misinterpretation of folio->private. 

Yes, but if the device is mapped, even after an invalidation, we can
still race with a new fault instantiating a page whilst the aops are
being swapped, right? That was the problem that sunk dynamic
swapping of the aops when turning DAX on and off on an inode, right?

> If switching aops is even
> the right thing to do.  I don't see the problem with allowing buffer heads
> on block devices, but I haven't been involved with the discussion here.

iomap supports bufferheads as a transitional thing (e.g. for gfs2).

Hence I suspect that a better solution is to always use iomap and
the same aops, but just switch from iomap page state to buffer heads
in the bdev mapping interface via a synchronised invalidation +
setting/clearing IOMAP_F_BUFFER_HEAD in all new mapping requests...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC v2 00/10] bdev: LBS devices support to coexist with buffer-heads
  2023-09-18  0:59     ` Dave Chinner
@ 2023-09-18  1:13       ` Luis Chamberlain
  2023-09-18  2:49         ` Dave Chinner
  0 siblings, 1 reply; 26+ messages in thread
From: Luis Chamberlain @ 2023-09-18  1:13 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Matthew Wilcox, hch, djwong, dchinner, kbusch, sagi, axboe,
	brauner, hare, ritesh.list, rgoldwyn, jack, ziy, ryan.roberts,
	patches, linux-xfs, linux-fsdevel, linux-block, p.raghav,
	da.gomez, dan.helmick

On Mon, Sep 18, 2023 at 10:59:07AM +1000, Dave Chinner wrote:
> On Mon, Sep 18, 2023 at 12:14:48AM +0100, Matthew Wilcox wrote:
> > On Mon, Sep 18, 2023 at 08:38:37AM +1000, Dave Chinner wrote:
> > > On Fri, Sep 15, 2023 at 02:32:44PM -0700, Luis Chamberlain wrote:
> > > > LBS devices. This in turn allows filesystems which support bs > 4k to be
> > > > enabled on a 4k PAGE_SIZE world on LBS block devices. This alows LBS
> > > > device then to take advantage of the recenlty posted work today to enable
> > > > LBS support for filesystems [0].
> > > 
> > > Why do we need LBS devices to support bs > ps in XFS?
> > 
> > It's the other way round -- we need the support in the page cache to
> > reject sub-block-size folios (which is in the other patches) before we
> > can sensibly talk about enabling any filesystems on top of LBS devices.
> >
> > Even XFS, or for that matter ext2 which support 16k block sizes on
> > CONFIG_PAGE_SIZE_16K (or 64K) kernels need that support first.
> 
> Well, yes, I know that. But the statement above implies that we
> can't use bs > ps filesytems without LBS support on 4kB PAGE_SIZE
> systems. If it's meant to mean the exact opposite, then it is
> extremely poorly worded....

Let's ignore the above statement of a second just to clarify the goal here.
The patches posted by Pankaj enable bs > ps even on 4k sector drives.

This patch series by definition is suggesting that an LBS device is one
where the minimum sector size is > ps, in practice I'm talking about
devices where the logical block size is > 4k, or where the physical
block size can be > 4k. There are two situations in the NVMe world where
this can happen. One is where the LBA format used is > 4k, the other is
where the npwg & awupf | nawupf is > 4k. The first makes the logical
block size > 4k, the later allows the physical block size to be > 4k.
The first *needs* an aops which groks > ps on the block device cache.
The second let's you remain backward compatible with 4k sector size, but
if you want to use a larger sector size you can too, but that also
requires a block device cache which groks > ps. When using > ps for
logical block size of physical block size is what I am suggesting we
should call LBS devices.

I'm happy for us to use other names but it just seemed like a natural
preogression to me.

> > > > There might be a better way to do this than do deal with the switching
> > > > of the aops dynamically, ideas welcomed!
> > > 
> > > Is it even safe to switch aops dynamically? We know there are
> > > inherent race conditions in doing this w.r.t. mmap and page faults,
> > > as the write fault part of the processing is directly dependent
> > > on the page being correctly initialised during the initial
> > > population of the page data (the "read fault" side of the write
> > > fault).
> > > 
> > > Hence it's not generally considered safe to change aops from one
> > > mechanism to another dynamically. Block devices can be mmap()d, but
> > > I don't see anything in this patch set that ensures there are no
> > > other users of the block device when the swaps are done. What am I
> > > missing?
> > 
> > We need to evict all pages from the page cache before switching aops to
> > prevent misinterpretation of folio->private. 
> 
> Yes, but if the device is mapped, even after an invalidation, we can
> still race with a new fault instantiating a page whilst the aops are
> being swapped, right? That was the problem that sunk dynamic
> swapping of the aops when turning DAX on and off on an inode, right?

I was not aware of that, thanks!

> > If switching aops is even
> > the right thing to do.  I don't see the problem with allowing buffer heads
> > on block devices, but I haven't been involved with the discussion here.
> 
> iomap supports bufferheads as a transitional thing (e.g. for gfs2).

But not for the block device cache.

> Hence I suspect that a better solution is to always use iomap and
> the same aops, but just switch from iomap page state to buffer heads
> in the bdev mapping 

Not sure this means, any chance I can trouble you to clarify a bit more?

> interface via a synchronised invalidation +
> setting/clearing IOMAP_F_BUFFER_HEAD in all new mapping requests...

I'm happy we use whatever makes more sense and is safer, just keep in
mind that we can't default to the iomap aops otherwise buffer-head
based filesystems won't work. I tried that. The first aops which the
block device cache needs if we want co-existence is buffer-heads.

If buffer-heads get large folio support then this is a non-issue as
far as I can tell but only because XFS does not use buffer-heads for
meta data. Once and when we do have a filesystem which does use
buffer-heads for metadata to support LBS it could be a problem I think.

  Luis

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC v2 00/10] bdev: LBS devices support to coexist with buffer-heads
  2023-09-18  1:13       ` Luis Chamberlain
@ 2023-09-18  2:49         ` Dave Chinner
  2023-09-18 17:51           ` Luis Chamberlain
  0 siblings, 1 reply; 26+ messages in thread
From: Dave Chinner @ 2023-09-18  2:49 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: Matthew Wilcox, hch, djwong, dchinner, kbusch, sagi, axboe,
	brauner, hare, ritesh.list, rgoldwyn, jack, ziy, ryan.roberts,
	patches, linux-xfs, linux-fsdevel, linux-block, p.raghav,
	da.gomez, dan.helmick

On Sun, Sep 17, 2023 at 06:13:40PM -0700, Luis Chamberlain wrote:
> On Mon, Sep 18, 2023 at 10:59:07AM +1000, Dave Chinner wrote:
> > On Mon, Sep 18, 2023 at 12:14:48AM +0100, Matthew Wilcox wrote:
> > > On Mon, Sep 18, 2023 at 08:38:37AM +1000, Dave Chinner wrote:
> > > > On Fri, Sep 15, 2023 at 02:32:44PM -0700, Luis Chamberlain wrote:
> > > > > LBS devices. This in turn allows filesystems which support bs > 4k to be
> > > > > enabled on a 4k PAGE_SIZE world on LBS block devices. This alows LBS
> > > > > device then to take advantage of the recenlty posted work today to enable
> > > > > LBS support for filesystems [0].
> > > > 
> > > > Why do we need LBS devices to support bs > ps in XFS?
> > > 
> > > It's the other way round -- we need the support in the page cache to
> > > reject sub-block-size folios (which is in the other patches) before we
> > > can sensibly talk about enabling any filesystems on top of LBS devices.
> > >
> > > Even XFS, or for that matter ext2 which support 16k block sizes on
> > > CONFIG_PAGE_SIZE_16K (or 64K) kernels need that support first.
> > 
> > Well, yes, I know that. But the statement above implies that we
> > can't use bs > ps filesytems without LBS support on 4kB PAGE_SIZE
> > systems. If it's meant to mean the exact opposite, then it is
> > extremely poorly worded....
> 
> Let's ignore the above statement of a second just to clarify the goal here.
> The patches posted by Pankaj enable bs > ps even on 4k sector drives.

Yes. They also enable XFS to support bs > ps on devices up to 32kB
sector sizes, too. All the sector size does is define the minimum
filesystem block size that can be supported by the filesystem on
that device and apart from that we just don't care what the sector
size on the underlying device is.

> This patch series by definition is suggesting that an LBS device is one
> where the minimum sector size is > ps, in practice I'm talking about
> devices where the logical block size is > 4k, or where the physical
> block size can be > 4k.

Which XFS with bs > ps just doesn't care about. As long as the
logical sector size is a power of 2 between 512 bytes and 32kB, it
will just work.

> There are two situations in the NVMe world where
> this can happen. One is where the LBA format used is > 4k, the other is
> where the npwg & awupf | nawupf is > 4k. The first makes the logical

FWIW, I have no idea what these acronyms actually mean....

> block size > 4k, the later allows the physical block size to be > 4k.
> The first *needs* an aops which groks > ps on the block device cache.
> The second let's you remain backward compatible with 4k sector size, but
> if you want to use a larger sector size you can too, but that also
> requires a block device cache which groks > ps. When using > ps for
> logical block size of physical block size is what I am suggesting we
> should call LBS devices.

Sure. LBS means sector size > page size for the block device. That
much has always been clear.

But telling me that again doesn't explain what LBS support has to do
with the filesystem implementation. mkfs.xfs doesn't require LBS
support to make a new XFS filesystem with a 32kB sector size and
64kB filessytem block size.  For the mounted filesystem that
supports bs > ps, it also doesn't care about the device sector size
is smaller than what mkfs told it to us. e.g. this is how run 4kB
sector size filesystem testing on 512 byte sector devices today....

What I'm asking is why LBS support even mentions filesystems? It's
not necessary for filesystems to support bs > ps for LBS to be
implemented, and it's not necessary for LBS to be supported for
filesytsems to implement bs > ps. Conflating them seems a recipe
for confusion....

> > > > > There might be a better way to do this than do deal with the switching
> > > > > of the aops dynamically, ideas welcomed!
> > > > 
> > > > Is it even safe to switch aops dynamically? We know there are
> > > > inherent race conditions in doing this w.r.t. mmap and page faults,
> > > > as the write fault part of the processing is directly dependent
> > > > on the page being correctly initialised during the initial
> > > > population of the page data (the "read fault" side of the write
> > > > fault).
> > > > 
> > > > Hence it's not generally considered safe to change aops from one
> > > > mechanism to another dynamically. Block devices can be mmap()d, but
> > > > I don't see anything in this patch set that ensures there are no
> > > > other users of the block device when the swaps are done. What am I
> > > > missing?
> > > 
> > > We need to evict all pages from the page cache before switching aops to
> > > prevent misinterpretation of folio->private. 
> > 
> > Yes, but if the device is mapped, even after an invalidation, we can
> > still race with a new fault instantiating a page whilst the aops are
> > being swapped, right? That was the problem that sunk dynamic
> > swapping of the aops when turning DAX on and off on an inode, right?
> 
> I was not aware of that, thanks!
> 
> > > If switching aops is even
> > > the right thing to do.  I don't see the problem with allowing buffer heads
> > > on block devices, but I haven't been involved with the discussion here.
> > 
> > iomap supports bufferheads as a transitional thing (e.g. for gfs2).
> 
> But not for the block device cache.

That's why I'm suggesting that you implement support for bufferheads
through the existing iomap infrastructure instead of trying to
dynamically switch aops structures....

> > Hence I suspect that a better solution is to always use iomap and
> > the same aops, but just switch from iomap page state to buffer heads
> > in the bdev mapping 
> 
> Not sure this means, any chance I can trouble you to clarify a bit more?

bdev_use_buggerheads()
{
	/*
	 * set the bufferhead bit atomically with invalidation emptying the
	 * page cache to prevent repopulation races. 
	 */
	filemap_invalidate_lock()
	invalidate_bdev()
	if (turn_on_bufferheads)
		set_bit(bdev->state, BDEV_USE_BUFFERHEADS);
	else
		clear_bit(bdev->state, BDEV_USE_BUFFERHEADS);
	filemap_invalidate_unlock()
}

bdev_iomap_begin()
{
	.....
	if (test_bit(bdev->state, BDEV_USE_BUFFERHEADS))
		iomap->flags |= IOMAP_F_BUFFER_HEAD;
}

/*
 * If an indexing switch happened while processing the iomap, make
 * sure to get the iomap marked stale to force a new mapping to be
 * looked up.
 */
bdev_iomap_valid()
{
	bool need_bhs = iomap->flags & IOMAP_F_BUFFER_HEAD;
	bool use_bhs = test_bit(bdev->state, BDEV_USE_BUGGERHEADS);

	return need_bhs == use_bhs;
}

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC v2 00/10] bdev: LBS devices support to coexist with buffer-heads
  2023-09-17 23:14   ` Matthew Wilcox
  2023-09-18  0:59     ` Dave Chinner
@ 2023-09-18 11:34     ` Hannes Reinecke
  1 sibling, 0 replies; 26+ messages in thread
From: Hannes Reinecke @ 2023-09-18 11:34 UTC (permalink / raw)
  To: Matthew Wilcox, Dave Chinner
  Cc: Luis Chamberlain, hch, djwong, dchinner, kbusch, sagi, axboe,
	brauner, ritesh.list, rgoldwyn, jack, ziy, ryan.roberts, patches,
	linux-xfs, linux-fsdevel, linux-block, p.raghav, da.gomez,
	dan.helmick

On 9/18/23 01:14, Matthew Wilcox wrote:
> On Mon, Sep 18, 2023 at 08:38:37AM +1000, Dave Chinner wrote:
>> On Fri, Sep 15, 2023 at 02:32:44PM -0700, Luis Chamberlain wrote:
>>> LBS devices. This in turn allows filesystems which support bs > 4k to be
>>> enabled on a 4k PAGE_SIZE world on LBS block devices. This alows LBS
>>> device then to take advantage of the recenlty posted work today to enable
>>> LBS support for filesystems [0].
>>
>> Why do we need LBS devices to support bs > ps in XFS?
> 
> It's the other way round -- we need the support in the page cache to
> reject sub-block-size folios (which is in the other patches) before we
> can sensibly talk about enabling any filesystems on top of LBS devices.
> Even XFS, or for that matter ext2 which support 16k block sizes on
> CONFIG_PAGE_SIZE_16K (or 64K) kernels need that support first.
> 
> [snipping the parts I agree with; this should not be the first you're
> hearing about a format change to XFS]
> 
>>> There might be a better way to do this than do deal with the switching
>>> of the aops dynamically, ideas welcomed!
>>
>> Is it even safe to switch aops dynamically? We know there are
>> inherent race conditions in doing this w.r.t. mmap and page faults,
>> as the write fault part of the processing is directly dependent
>> on the page being correctly initialised during the initial
>> population of the page data (the "read fault" side of the write
>> fault).
>>
>> Hence it's not generally considered safe to change aops from one
>> mechanism to another dynamically. Block devices can be mmap()d, but
>> I don't see anything in this patch set that ensures there are no
>> other users of the block device when the swaps are done. What am I
>> missing?
> 
> We need to evict all pages from the page cache before switching aops to
> prevent misinterpretation of folio->private.  If switching aops is even
> the right thing to do.  I don't see the problem with allowing buffer heads
> on block devices, but I haven't been involved with the discussion here.

Did we even have that conversation?
That's one of the first things I've stumbled across when doing my 
patchset, and found the implications too horrible to consider.

Not a big fan, plus I don't think we need that.
Cf my patchset :-)

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Ivo Totev, Andrew
Myers, Andrew McDonald, Martje Boudien Moerman


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC v2 00/10] bdev: LBS devices support to coexist with buffer-heads
  2023-09-15 21:51 ` [RFC v2 00/10] bdev: LBS devices support to coexist with buffer-heads Matthew Wilcox
  2023-09-15 22:26   ` Luis Chamberlain
  2023-09-17 11:50   ` Hannes Reinecke
@ 2023-09-18 17:12   ` Luis Chamberlain
  2023-09-18 18:15     ` Matthew Wilcox
  2 siblings, 1 reply; 26+ messages in thread
From: Luis Chamberlain @ 2023-09-18 17:12 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: hch, djwong, dchinner, kbusch, sagi, axboe, brauner, hare,
	ritesh.list, rgoldwyn, jack, ziy, ryan.roberts, patches,
	linux-xfs, linux-fsdevel, linux-block, p.raghav, da.gomez,
	dan.helmick

On Fri, Sep 15, 2023 at 10:51:12PM +0100, Matthew Wilcox wrote:
> On Fri, Sep 15, 2023 at 02:32:44PM -0700, Luis Chamberlain wrote:
> > However, an issue is that disabling CONFIG_BUFFER_HEAD in practice is not viable
> > for many Linux distributions since it also means disabling support for most
> > filesystems other than btrfs and XFS. So we either support larger order folios
> > on buffer-heads, or we draw up a solution to enable co-existence. Since at LSFMM
> > 2023 it was decided we would not support larger order folios on buffer-heads,
> 
> Um, I didn't agree to that.

Coverage on sunsetting buffer-heads talk by LWN:

https://lwn.net/Articles/931809/

"the apparent conclusion from the session: the buffer-head layer will be
converted to use folios internally while minimizing changes visible to
the filesystems using it. Only single-page folios will be used within
this new buffer-head layer. Any other desires, he said, can be addressed
later after this problem has been solved."

And so I think we're at the later part of this. I'm happy to see efforts
for buffer-heads support with large order folio support, and so at this
point I think it would be wise to look for commonalities and colalborate
on what things could / should be shared.

  Luis

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC v2 00/10] bdev: LBS devices support to coexist with buffer-heads
  2023-09-18  2:49         ` Dave Chinner
@ 2023-09-18 17:51           ` Luis Chamberlain
  0 siblings, 0 replies; 26+ messages in thread
From: Luis Chamberlain @ 2023-09-18 17:51 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Matthew Wilcox, hch, djwong, dchinner, kbusch, sagi, axboe,
	brauner, hare, ritesh.list, rgoldwyn, jack, ziy, ryan.roberts,
	patches, linux-xfs, linux-fsdevel, linux-block, p.raghav,
	da.gomez, dan.helmick

On Mon, Sep 18, 2023 at 12:49:36PM +1000, Dave Chinner wrote:
> On Sun, Sep 17, 2023 at 06:13:40PM -0700, Luis Chamberlain wrote:
> > On Mon, Sep 18, 2023 at 10:59:07AM +1000, Dave Chinner wrote:
> > > On Mon, Sep 18, 2023 at 12:14:48AM +0100, Matthew Wilcox wrote:
> > > > On Mon, Sep 18, 2023 at 08:38:37AM +1000, Dave Chinner wrote:
> > > > > On Fri, Sep 15, 2023 at 02:32:44PM -0700, Luis Chamberlain wrote:
> > > > > > LBS devices. This in turn allows filesystems which support bs > 4k to be
> > > > > > enabled on a 4k PAGE_SIZE world on LBS block devices. This alows LBS
> > > > > > device then to take advantage of the recenlty posted work today to enable
> > > > > > LBS support for filesystems [0].
> > > > > 
> > > > > Why do we need LBS devices to support bs > ps in XFS?
> > > > 
> > > > It's the other way round -- we need the support in the page cache to
> > > > reject sub-block-size folios (which is in the other patches) before we
> > > > can sensibly talk about enabling any filesystems on top of LBS devices.
> > > >
> > > > Even XFS, or for that matter ext2 which support 16k block sizes on
> > > > CONFIG_PAGE_SIZE_16K (or 64K) kernels need that support first.
> > > 
> > > Well, yes, I know that. But the statement above implies that we
> > > can't use bs > ps filesytems without LBS support on 4kB PAGE_SIZE
> > > systems. If it's meant to mean the exact opposite, then it is
> > > extremely poorly worded....
> > 
> > Let's ignore the above statement of a second just to clarify the goal here.
> > The patches posted by Pankaj enable bs > ps even on 4k sector drives.
> 
> Yes. They also enable XFS to support bs > ps on devices up to 32kB
> sector sizes, too. All the sector size does is define the minimum
> filesystem block size that can be supported by the filesystem on
> that device and apart from that we just don't care what the sector
> size on the underlying device is.

Ah yes, but on a system with 4k page size, my point is that you still
can't use a 32k sector size drive.

> > This patch series by definition is suggesting that an LBS device is one
> > where the minimum sector size is > ps, in practice I'm talking about
> > devices where the logical block size is > 4k, or where the physical
> > block size can be > 4k.
> 
> Which XFS with bs > ps just doesn't care about. As long as the
> logical sector size is a power of 2 between 512 bytes and 32kB, it
> will just work.
> 
> > There are two situations in the NVMe world where
> > this can happen. One is where the LBA format used is > 4k, the other is
> > where the npwg & awupf | nawupf is > 4k. The first makes the logical
> 
> FWIW, I have no idea what these acronyms actually mean....

Sorry, I've described them them, here now:

https://kernelnewbies.org/KernelProjects/large-block-size#LBS_NVMe_support

What matters though is in practice it is a combination of two values
which today also allows the nvme driver to have a higher than 4k
physical block size.

> > block size > 4k, the later allows the physical block size to be > 4k.
> > The first *needs* an aops which groks > ps on the block device cache.
> > The second let's you remain backward compatible with 4k sector size, but
> > if you want to use a larger sector size you can too, but that also
> > requires a block device cache which groks > ps. When using > ps for
> > logical block size of physical block size is what I am suggesting we
> > should call LBS devices.
> 
> Sure. LBS means sector size > page size for the block device. That
> much has always been clear.
> 
> But telling me that again doesn't explain what LBS support has to do
> with the filesystem implementation. mkfs.xfs doesn't require LBS
> support to make a new XFS filesystem with a 32kB sector size and
> 64kB filessytem block size.  For the mounted filesystem that
> supports bs > ps, it also doesn't care about the device sector size
> is smaller than what mkfs told it to us. e.g. this is how run 4kB
> sector size filesystem testing on 512 byte sector devices today....
> 
> What I'm asking is why LBS support even mentions filesystems?

It's just that without this patchset you can mkfs an filesystem with
larger bs > ps, but will only be able to mount them if the sector size
matches the page size. This patchset allows them to be mounted and used
where sector size > ps. The main issue there is just the block device
cache aops and the current size limitaiton on set_blocksize().

> It's
> not necessary for filesystems to support bs > ps for LBS to be
> implemented, and it's not necessary for LBS to be supported for
> filesytsems to implement bs > ps. Conflating them seems a recipe
> for confusion....

I see, so we should just limit the scope to enabling LBS devices on
the block device cache?

> > > iomap supports bufferheads as a transitional thing (e.g. for gfs2).
> > 
> > But not for the block device cache.
> 
> That's why I'm suggesting that you implement support for bufferheads
> through the existing iomap infrastructure instead of trying to
> dynamically switch aops structures....
> 
> > > Hence I suspect that a better solution is to always use iomap and
> > > the same aops, but just switch from iomap page state to buffer heads
> > > in the bdev mapping 
> > 
> > Not sure this means, any chance I can trouble you to clarify a bit more?
> 
> bdev_use_buggerheads()
> {
> 	/*
> 	 * set the bufferhead bit atomically with invalidation emptying the
> 	 * page cache to prevent repopulation races. 
> 	 */
> 	filemap_invalidate_lock()
> 	invalidate_bdev()
> 	if (turn_on_bufferheads)
> 		set_bit(bdev->state, BDEV_USE_BUFFERHEADS);
> 	else
> 		clear_bit(bdev->state, BDEV_USE_BUFFERHEADS);
> 	filemap_invalidate_unlock()
> }
> 
> bdev_iomap_begin()
> {
> 	.....
> 	if (test_bit(bdev->state, BDEV_USE_BUFFERHEADS))
> 		iomap->flags |= IOMAP_F_BUFFER_HEAD;
> }
> 
> /*
>  * If an indexing switch happened while processing the iomap, make
>  * sure to get the iomap marked stale to force a new mapping to be
>  * looked up.
>  */
> bdev_iomap_valid()
> {
> 	bool need_bhs = iomap->flags & IOMAP_F_BUFFER_HEAD;
> 	bool use_bhs = test_bit(bdev->state, BDEV_USE_BUGGERHEADS);
> 
> 	return need_bhs == use_bhs;
> }

Will give this a shot, thanks!

  Luis

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC v2 00/10] bdev: LBS devices support to coexist with buffer-heads
  2023-09-18 17:12   ` Luis Chamberlain
@ 2023-09-18 18:15     ` Matthew Wilcox
  2023-09-18 18:42       ` Hannes Reinecke
  0 siblings, 1 reply; 26+ messages in thread
From: Matthew Wilcox @ 2023-09-18 18:15 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: hch, djwong, dchinner, kbusch, sagi, axboe, brauner, hare,
	ritesh.list, rgoldwyn, jack, ziy, ryan.roberts, patches,
	linux-xfs, linux-fsdevel, linux-block, p.raghav, da.gomez,
	dan.helmick

On Mon, Sep 18, 2023 at 10:12:04AM -0700, Luis Chamberlain wrote:
> On Fri, Sep 15, 2023 at 10:51:12PM +0100, Matthew Wilcox wrote:
> > On Fri, Sep 15, 2023 at 02:32:44PM -0700, Luis Chamberlain wrote:
> > > However, an issue is that disabling CONFIG_BUFFER_HEAD in practice is not viable
> > > for many Linux distributions since it also means disabling support for most
> > > filesystems other than btrfs and XFS. So we either support larger order folios
> > > on buffer-heads, or we draw up a solution to enable co-existence. Since at LSFMM
> > > 2023 it was decided we would not support larger order folios on buffer-heads,
> > 
> > Um, I didn't agree to that.
> 
> Coverage on sunsetting buffer-heads talk by LWN:
> 
> https://lwn.net/Articles/931809/
> 
> "the apparent conclusion from the session: the buffer-head layer will be
> converted to use folios internally while minimizing changes visible to
> the filesystems using it. Only single-page folios will be used within
> this new buffer-head layer. Any other desires, he said, can be addressed
> later after this problem has been solved."

Other people said that.  Not me.  I said it was fine for single
buffer_head per folio.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC v2 00/10] bdev: LBS devices support to coexist with buffer-heads
  2023-09-18 18:15     ` Matthew Wilcox
@ 2023-09-18 18:42       ` Hannes Reinecke
  0 siblings, 0 replies; 26+ messages in thread
From: Hannes Reinecke @ 2023-09-18 18:42 UTC (permalink / raw)
  To: Matthew Wilcox, Luis Chamberlain
  Cc: hch, djwong, dchinner, kbusch, sagi, axboe, brauner, ritesh.list,
	rgoldwyn, jack, ziy, ryan.roberts, patches, linux-xfs,
	linux-fsdevel, linux-block, p.raghav, da.gomez, dan.helmick

On 9/18/23 20:15, Matthew Wilcox wrote:
> On Mon, Sep 18, 2023 at 10:12:04AM -0700, Luis Chamberlain wrote:
>> On Fri, Sep 15, 2023 at 10:51:12PM +0100, Matthew Wilcox wrote:
>>> On Fri, Sep 15, 2023 at 02:32:44PM -0700, Luis Chamberlain wrote:
>>>> However, an issue is that disabling CONFIG_BUFFER_HEAD in practice is not viable
>>>> for many Linux distributions since it also means disabling support for most
>>>> filesystems other than btrfs and XFS. So we either support larger order folios
>>>> on buffer-heads, or we draw up a solution to enable co-existence. Since at LSFMM
>>>> 2023 it was decided we would not support larger order folios on buffer-heads,
>>>
>>> Um, I didn't agree to that.
>>
>> Coverage on sunsetting buffer-heads talk by LWN:
>>
>> https://lwn.net/Articles/931809/
>>
>> "the apparent conclusion from the session: the buffer-head layer will be
>> converted to use folios internally while minimizing changes visible to
>> the filesystems using it. Only single-page folios will be used within
>> this new buffer-head layer. Any other desires, he said, can be addressed
>> later after this problem has been solved."
> 
> Other people said that.  Not me.  I said it was fine for single
> buffer_head per folio.

And my patchset proves that to be the case.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Ivo Totev, Andrew
Myers, Andrew McDonald, Martje Boudien Moerman


^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2023-09-18 18:42 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-09-15 21:32 [RFC v2 00/10] bdev: LBS devices support to coexist with buffer-heads Luis Chamberlain
2023-09-15 21:32 ` [RFC v2 01/10] bdev: rename iomap aops Luis Chamberlain
2023-09-15 21:32 ` [RFC v2 02/10] bdev: dynamically set aops to enable LBS support Luis Chamberlain
2023-09-15 21:32 ` [RFC v2 03/10] bdev: increase bdev max blocksize depending on the aops used Luis Chamberlain
2023-09-15 21:32 ` [RFC v2 04/10] filesystems: add filesytem buffer-head flag Luis Chamberlain
2023-09-15 21:32 ` [RFC v2 05/10] bdev: allow to switch between bdev aops Luis Chamberlain
2023-09-15 21:32 ` [RFC v2 06/10] bdev: simplify coexistance Luis Chamberlain
2023-09-15 21:32 ` [RFC v2 07/10] nvme: enhance max supported LBA format check Luis Chamberlain
2023-09-15 22:20   ` Matthew Wilcox
2023-09-15 22:27     ` Luis Chamberlain
2023-09-15 21:32 ` [RFC v2 08/10] nvme: add awun / nawun sanity check Luis Chamberlain
2023-09-15 21:32 ` [RFC v2 09/10] nvme: add nvme_core.debug_large_atomics to force high awun as phys_bs Luis Chamberlain
2023-09-15 21:32 ` [RFC v2 10/10] nvme: enable LBS support Luis Chamberlain
2023-09-15 21:51 ` [RFC v2 00/10] bdev: LBS devices support to coexist with buffer-heads Matthew Wilcox
2023-09-15 22:26   ` Luis Chamberlain
2023-09-17 11:50   ` Hannes Reinecke
2023-09-18 17:12   ` Luis Chamberlain
2023-09-18 18:15     ` Matthew Wilcox
2023-09-18 18:42       ` Hannes Reinecke
2023-09-17 22:38 ` Dave Chinner
2023-09-17 23:14   ` Matthew Wilcox
2023-09-18  0:59     ` Dave Chinner
2023-09-18  1:13       ` Luis Chamberlain
2023-09-18  2:49         ` Dave Chinner
2023-09-18 17:51           ` Luis Chamberlain
2023-09-18 11:34     ` Hannes Reinecke

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).