* [PATCH] mkfs: use stx_blksize for dev block size by default
@ 2025-02-06 19:00 da.gomez
2025-02-06 22:27 ` Darrick J. Wong
2025-02-07 4:30 ` Christoph Hellwig
0 siblings, 2 replies; 17+ messages in thread
From: da.gomez @ 2025-02-06 19:00 UTC (permalink / raw)
To: linux-xfs; +Cc: Luis Chamberlain, Daniel Gomez, Pankaj Raghav, gost.dev
From: Daniel Gomez <da.gomez@samsung.com>
In patch [1] ("bdev: use bdev_io_min() for statx block size"), block
devices will now report their preferred minimum I/O size for optimal
performance in the stx_blksize field of the statx data structure. This
change updates the current default 4 KiB block size for all devices
reporting a minimum I/O larger than 4 KiB, opting instead to query for
its advertised minimum I/O value in the statx data struct.
[1]:
https://lore.kernel.org/all/20250204231209.429356-9-mcgrof@kernel.org/
Signed-off-by: Daniel Gomez <da.gomez@samsung.com>
---
Set MIN-IO from statx as the default filesystem fundamental block size.
This ensures that, for devices reporting values within the supported
XFS block size range, we do not incur in RMW. If the MIN-IO reported
value is lower than the current default of 4 KiB, then 4 KiB will be
used instead.
---
mkfs/xfs_mkfs.c | 21 +++++++++++++++++++++
1 file changed, 21 insertions(+)
diff --git a/mkfs/xfs_mkfs.c b/mkfs/xfs_mkfs.c
index bbd0dbb6c80ab63ebf9edbe0a9a304149770f89d..e17389622682bc23f9b20c207db7e22181e2fe6f 100644
--- a/mkfs/xfs_mkfs.c
+++ b/mkfs/xfs_mkfs.c
@@ -2178,6 +2178,26 @@ _("block size %d cannot be smaller than sector size %d\n"),
}
}
+void
+get_dev_blocksize(
+ struct cli_params *cli,
+ struct mkfs_default_params *dft)
+{
+ struct statx stx;
+ int ret;
+
+ if (!cli->xi->data.name)
+ return;
+
+ ret = statx(AT_FDCWD, cli->xi->data.name,
+ AT_STATX_DONT_SYNC | AT_NO_AUTOMOUNT,
+ STATX_DIOALIGN | STATX_TYPE | STATX_MODE | STATX_INO, &stx);
+ if (!ret)
+ dft->blocksize =
+ min(max(1 << XFS_DFL_BLOCKSIZE_LOG, stx.stx_blksize),
+ XFS_MAX_BLOCKSIZE);
+}
+
static void
validate_blocksize(
struct mkfs_params *cfg,
@@ -2189,6 +2209,7 @@ validate_blocksize(
* For RAID4/5/6 we want to align sector size and block size,
* so we need to start with the device geometry extraction too.
*/
+ get_dev_blocksize(cli, dft);
if (!cli->blocksize)
cfg->blocksize = dft->blocksize;
else
---
base-commit: 90d6da68ee54e6d4ef99eca4a82cac6036a34b00
change-id: 20250206-min-io-default-blocksize-13334f54899e
Best regards,
--
Daniel Gomez <da.gomez@samsung.com>
^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: [PATCH] mkfs: use stx_blksize for dev block size by default
2025-02-06 19:00 [PATCH] mkfs: use stx_blksize for dev block size by default da.gomez
@ 2025-02-06 22:27 ` Darrick J. Wong
2025-02-06 22:50 ` Luis Chamberlain
` (2 more replies)
2025-02-07 4:30 ` Christoph Hellwig
1 sibling, 3 replies; 17+ messages in thread
From: Darrick J. Wong @ 2025-02-06 22:27 UTC (permalink / raw)
To: da.gomez; +Cc: linux-xfs, Luis Chamberlain, Daniel Gomez, Pankaj Raghav,
gost.dev
On Thu, Feb 06, 2025 at 07:00:55PM +0000, da.gomez@kernel.org wrote:
> From: Daniel Gomez <da.gomez@samsung.com>
>
> In patch [1] ("bdev: use bdev_io_min() for statx block size"), block
> devices will now report their preferred minimum I/O size for optimal
> performance in the stx_blksize field of the statx data structure. This
> change updates the current default 4 KiB block size for all devices
> reporting a minimum I/O larger than 4 KiB, opting instead to query for
> its advertised minimum I/O value in the statx data struct.
>
> [1]:
> https://lore.kernel.org/all/20250204231209.429356-9-mcgrof@kernel.org/
This isn't even upstream yet...
> Signed-off-by: Daniel Gomez <da.gomez@samsung.com>
> ---
> Set MIN-IO from statx as the default filesystem fundamental block size.
> This ensures that, for devices reporting values within the supported
> XFS block size range, we do not incur in RMW. If the MIN-IO reported
> value is lower than the current default of 4 KiB, then 4 KiB will be
> used instead.
I don't think this is a good idea -- assuming you mean the same MIN-IO
as what lsblk puts out:
NAME MIN-IO
sda 512
├─sda1 512
├─sda2 512
│ └─node0.boot 512
├─sda3 512
│ └─node0.swap 512
└─sda4 512
└─node0.lvm 512
└─node0-root 512
sdb 4096
└─sdb1 4096
nvme1n1 512
└─md0 524288
└─node0.raid 524288
└─node0_raid-storage 524288
nvme0n1 512
└─md0 524288
└─node0.raid 524288
└─node0_raid-storage 524288
nvme2n1 512
└─md0 524288
└─node0.raid 524288
└─node0_raid-storage 524288
nvme3n1 512
└─md0 524288
└─node0.raid 524288
└─node0_raid-storage 524288
Now you've decreased the default blocksize to 512 on sda, and md0 gets
an impossible 512k blocksize. Also, disrupting the default 4k blocksize
will introduce portability problems with distros that aren't yet
shipping 6.12.
--D
> ---
> mkfs/xfs_mkfs.c | 21 +++++++++++++++++++++
> 1 file changed, 21 insertions(+)
>
> diff --git a/mkfs/xfs_mkfs.c b/mkfs/xfs_mkfs.c
> index bbd0dbb6c80ab63ebf9edbe0a9a304149770f89d..e17389622682bc23f9b20c207db7e22181e2fe6f 100644
> --- a/mkfs/xfs_mkfs.c
> +++ b/mkfs/xfs_mkfs.c
> @@ -2178,6 +2178,26 @@ _("block size %d cannot be smaller than sector size %d\n"),
> }
> }
>
> +void
> +get_dev_blocksize(
> + struct cli_params *cli,
> + struct mkfs_default_params *dft)
> +{
> + struct statx stx;
> + int ret;
> +
> + if (!cli->xi->data.name)
> + return;
> +
> + ret = statx(AT_FDCWD, cli->xi->data.name,
> + AT_STATX_DONT_SYNC | AT_NO_AUTOMOUNT,
> + STATX_DIOALIGN | STATX_TYPE | STATX_MODE | STATX_INO, &stx);
> + if (!ret)
> + dft->blocksize =
> + min(max(1 << XFS_DFL_BLOCKSIZE_LOG, stx.stx_blksize),
> + XFS_MAX_BLOCKSIZE);
> +}
> +
> static void
> validate_blocksize(
> struct mkfs_params *cfg,
> @@ -2189,6 +2209,7 @@ validate_blocksize(
> * For RAID4/5/6 we want to align sector size and block size,
> * so we need to start with the device geometry extraction too.
> */
> + get_dev_blocksize(cli, dft);
> if (!cli->blocksize)
> cfg->blocksize = dft->blocksize;
> else
>
> ---
> base-commit: 90d6da68ee54e6d4ef99eca4a82cac6036a34b00
> change-id: 20250206-min-io-default-blocksize-13334f54899e
>
> Best regards,
> --
> Daniel Gomez <da.gomez@samsung.com>
>
>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] mkfs: use stx_blksize for dev block size by default
2025-02-06 22:27 ` Darrick J. Wong
@ 2025-02-06 22:50 ` Luis Chamberlain
2025-02-06 23:07 ` Darrick J. Wong
2025-02-07 9:12 ` Daniel Gomez
2025-02-07 9:39 ` Daniel Gomez
2025-02-07 19:26 ` Luis Chamberlain
2 siblings, 2 replies; 17+ messages in thread
From: Luis Chamberlain @ 2025-02-06 22:50 UTC (permalink / raw)
To: Darrick J. Wong
Cc: da.gomez, linux-xfs, Daniel Gomez, Pankaj Raghav, gost.dev
On Thu, Feb 06, 2025 at 02:27:16PM -0800, Darrick J. Wong wrote:
> Now you've decreased the default blocksize to 512 on sda, and md0 gets
> an impossible 512k blocksize. Also, disrupting the default 4k blocksize
> will introduce portability problems with distros that aren't yet
> shipping 6.12.
Our default should be 4k, and to address the later we should sanity
check and user an upper limit of what XFS supports, 64k.
Thoughts?
Luis
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] mkfs: use stx_blksize for dev block size by default
2025-02-06 22:50 ` Luis Chamberlain
@ 2025-02-06 23:07 ` Darrick J. Wong
2025-02-07 9:12 ` Daniel Gomez
1 sibling, 0 replies; 17+ messages in thread
From: Darrick J. Wong @ 2025-02-06 23:07 UTC (permalink / raw)
To: Luis Chamberlain
Cc: da.gomez, linux-xfs, Daniel Gomez, Pankaj Raghav, gost.dev
On Thu, Feb 06, 2025 at 02:50:28PM -0800, Luis Chamberlain wrote:
> On Thu, Feb 06, 2025 at 02:27:16PM -0800, Darrick J. Wong wrote:
> > Now you've decreased the default blocksize to 512 on sda, and md0 gets
> > an impossible 512k blocksize. Also, disrupting the default 4k blocksize
> > will introduce portability problems with distros that aren't yet
> > shipping 6.12.
>
> Our default should be 4k, and to address the later we should sanity
> check and user an upper limit of what XFS supports, 64k.
I don't think it's a good idea to boost the default fsblock size beyond
4k until we get further into the era where the major distros are
shipping 6.12 kernels. I wouldn't want to deal with people accidentally
ending up with an 8k fsblock filesystem that they can't mount on fairly
new things like RHEL9/Debian12/etc.
--D
> Thoughts?
>
> Luis
>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] mkfs: use stx_blksize for dev block size by default
2025-02-06 19:00 [PATCH] mkfs: use stx_blksize for dev block size by default da.gomez
2025-02-06 22:27 ` Darrick J. Wong
@ 2025-02-07 4:30 ` Christoph Hellwig
2025-02-07 10:04 ` Daniel Gomez
1 sibling, 1 reply; 17+ messages in thread
From: Christoph Hellwig @ 2025-02-07 4:30 UTC (permalink / raw)
To: da.gomez; +Cc: linux-xfs, Luis Chamberlain, Daniel Gomez, Pankaj Raghav,
gost.dev
On Thu, Feb 06, 2025 at 07:00:55PM +0000, da.gomez@kernel.org wrote:
> From: Daniel Gomez <da.gomez@samsung.com>
>
> In patch [1] ("bdev: use bdev_io_min() for statx block size"), block
> devices will now report their preferred minimum I/O size for optimal
> performance in the stx_blksize field of the statx data structure. This
> change updates the current default 4 KiB block size for all devices
> reporting a minimum I/O larger than 4 KiB, opting instead to query for
> its advertised minimum I/O value in the statx data struct.
UUuh, no. Larger block sizes have their use cases, but this will
regress performance for a lot (most?) common setups. A lot of
device report fairly high values there, but say increasing the
directory and bmap btree block size unconditionally using the current
algorithms will dramatically increase write amplification. Similarly
for small buffered writes.
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] mkfs: use stx_blksize for dev block size by default
2025-02-06 22:50 ` Luis Chamberlain
2025-02-06 23:07 ` Darrick J. Wong
@ 2025-02-07 9:12 ` Daniel Gomez
2025-02-07 19:16 ` Luis Chamberlain
1 sibling, 1 reply; 17+ messages in thread
From: Daniel Gomez @ 2025-02-07 9:12 UTC (permalink / raw)
To: Darrick J. Wong, Luis Chamberlain
Cc: linux-xfs, Daniel Gomez, Pankaj Raghav, gost.dev
On Thu, Feb 06, 2025 at 02:50:28PM +0100, Luis Chamberlain wrote:
> On Thu, Feb 06, 2025 at 02:27:16PM -0800, Darrick J. Wong wrote:
> > Now you've decreased the default blocksize to 512 on sda, and md0 gets
> > an impossible 512k blocksize. Also, disrupting the default 4k blocksize
> > will introduce portability problems with distros that aren't yet
> > shipping 6.12.
>
> Our default should be 4k, and to address the later we should sanity
> check and user an upper limit of what XFS supports, 64k.
To clarify, the patch addresses the cases described above. It sets mkfs.xfs's
default block size to the device’s reported minimum I/O size (via stx_blksize),
clamping the value between 4k and 64k: if the reported size is less than 4k, 4k
is used, and if it exceeds 64k (XFS_MAX_BLOCKSIZE), 64k is applied.
>
> Thoughts?
>
> Luis
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] mkfs: use stx_blksize for dev block size by default
2025-02-06 22:27 ` Darrick J. Wong
2025-02-06 22:50 ` Luis Chamberlain
@ 2025-02-07 9:39 ` Daniel Gomez
2025-02-07 19:26 ` Luis Chamberlain
2 siblings, 0 replies; 17+ messages in thread
From: Daniel Gomez @ 2025-02-07 9:39 UTC (permalink / raw)
To: Darrick J. Wong
Cc: linux-xfs, Luis Chamberlain, Daniel Gomez, Pankaj Raghav,
gost.dev
On Thu, Feb 06, 2025 at 02:27:16PM +0100, Darrick J. Wong wrote:
> On Thu, Feb 06, 2025 at 07:00:55PM +0000, da.gomez@kernel.org wrote:
> > From: Daniel Gomez <da.gomez@samsung.com>
> >
> > In patch [1] ("bdev: use bdev_io_min() for statx block size"), block
> > devices will now report their preferred minimum I/O size for optimal
> > performance in the stx_blksize field of the statx data structure. This
> > change updates the current default 4 KiB block size for all devices
> > reporting a minimum I/O larger than 4 KiB, opting instead to query for
> > its advertised minimum I/O value in the statx data struct.
> >
> > [1]:
> > https://lore.kernel.org/all/20250204231209.429356-9-mcgrof@kernel.org/
>
> This isn't even upstream yet...
>
> > Signed-off-by: Daniel Gomez <da.gomez@samsung.com>
> > ---
> > Set MIN-IO from statx as the default filesystem fundamental block size.
> > This ensures that, for devices reporting values within the supported
> > XFS block size range, we do not incur in RMW. If the MIN-IO reported
> > value is lower than the current default of 4 KiB, then 4 KiB will be
> > used instead.
>
> I don't think this is a good idea -- assuming you mean the same MIN-IO
> as what lsblk puts out:
This is just about matching the values in code and documentation across all
layers (to guarantee writes do no incur in RMW when possible and supported by
the fs): minimum_io_size (block layer) -> stx_blksize (statx) -> lsblk MIN-IO
(minimum I/ O size) -> Filesystem fundamental block size (mkfs.xfs -b size).
* MIN-IO is the minimum I/O size in lsblk [1] which should be the queue-sysfs
minimum_io_size [2] [3] ("This is the smallest preferred IO size reported by the
device").
* From statx [4] manual (and kernel statx data struct description), stx_blksize is
'The "preferred" block size for efficient filesystem I/O (Writing to a file in
smaller chunks may cause an inefficient read-modify-rewrite.)'
[1] https://github.com/util-linux/util-linux/blob/master/misc-utils/lsblk.c#L199
[2] minimum_io_size: https://www.kernel.org/doc/Documentation/block/queue-sysfs.txt
[3] https://www.kernel.org/doc/Documentation/ABI/stable/sysfs-block
What: /sys/block/<disk>/queue/minimum_io_size
Date: April 2009
Contact: Martin K. Petersen <martin.petersen@oracle.com>
Description:
[RO] Storage devices may report a granularity or preferred
minimum I/O size which is the smallest request the device can
perform without incurring a performance penalty. For disk
drives this is often the physical block size. For RAID arrays
it is often the stripe chunk size. A properly aligned multiple
of minimum_io_size is the preferred request size for workloads
where a high number of I/O operations is desired.
[4] https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git/tree/man/man2/statx.2?id=master#n369
kernel: __u32 stx_blksize; /* Preferred general I/O size [uncond] */
> nvme1n1 512
> └─md0 524288
> └─node0.raid 524288
> └─node0_raid-storage 524288
Is the MIN-IO correctly reported in RAID arrays here? I guess it should match
the stripe chunk size as per description above?
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] mkfs: use stx_blksize for dev block size by default
2025-02-07 4:30 ` Christoph Hellwig
@ 2025-02-07 10:04 ` Daniel Gomez
2025-02-13 4:10 ` Christoph Hellwig
0 siblings, 1 reply; 17+ messages in thread
From: Daniel Gomez @ 2025-02-07 10:04 UTC (permalink / raw)
To: Christoph Hellwig
Cc: linux-xfs, Luis Chamberlain, Daniel Gomez, Pankaj Raghav,
gost.dev
On Thu, Feb 06, 2025 at 08:30:22PM +0100, Christoph Hellwig wrote:
> On Thu, Feb 06, 2025 at 07:00:55PM +0000, da.gomez@kernel.org wrote:
> > From: Daniel Gomez <da.gomez@samsung.com>
> >
> > In patch [1] ("bdev: use bdev_io_min() for statx block size"), block
> > devices will now report their preferred minimum I/O size for optimal
> > performance in the stx_blksize field of the statx data structure. This
> > change updates the current default 4 KiB block size for all devices
> > reporting a minimum I/O larger than 4 KiB, opting instead to query for
> > its advertised minimum I/O value in the statx data struct.
>
> UUuh, no. Larger block sizes have their use cases, but this will
> regress performance for a lot (most?) common setups. A lot of
> device report fairly high values there, but say increasing the
Are these devices reporting the correct value? As I mentioned in my discussion
with Darrick, matching the minimum_io_size with the "fs fundamental blocksize"
actually allows to avoid RMW operations (when using the default path in mkfs.xfs
and the value reported is within boundaries).
> directory and bmap btree block size unconditionally using the current
> algorithms will dramatically increase write amplification. Similarly
I agree, but it seems to be a consequence of using such a large minimum_io_size.
> for small buffered writes.
>
Exactly. Even though write amplification happens with a 512 byte minimum_io_size
and a 4k default block size, it doesn't incur a performance penalty. But we will
incur that when minimum_io_size exceeds 4k. So, this solves the issue but comes
at the cost of write amplification.
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] mkfs: use stx_blksize for dev block size by default
2025-02-07 9:12 ` Daniel Gomez
@ 2025-02-07 19:16 ` Luis Chamberlain
0 siblings, 0 replies; 17+ messages in thread
From: Luis Chamberlain @ 2025-02-07 19:16 UTC (permalink / raw)
To: Daniel Gomez
Cc: Darrick J. Wong, linux-xfs, Daniel Gomez, Pankaj Raghav, gost.dev
On Fri, Feb 07, 2025 at 10:12:07AM +0100, Daniel Gomez wrote:
> On Thu, Feb 06, 2025 at 02:50:28PM +0100, Luis Chamberlain wrote:
> > On Thu, Feb 06, 2025 at 02:27:16PM -0800, Darrick J. Wong wrote:
> > > Now you've decreased the default blocksize to 512 on sda, and md0 gets
> > > an impossible 512k blocksize. Also, disrupting the default 4k blocksize
> > > will introduce portability problems with distros that aren't yet
> > > shipping 6.12.
> >
> > Our default should be 4k, and to address the later we should sanity
> > check and user an upper limit of what XFS supports, 64k.
>
> To clarify, the patch addresses the cases described above. It sets mkfs.xfs's
> default block size to the device’s reported minimum I/O size (via stx_blksize),
> clamping the value between 4k and 64k: if the reported size is less than 4k, 4k
> is used, and if it exceeds 64k (XFS_MAX_BLOCKSIZE), 64k is applied.
Ah, neat.
Then the only other concern expresed by Darrick is using this for
distros not yet shipping v6.12. Let me address that next in reply
to another part of this thread.
Luis
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] mkfs: use stx_blksize for dev block size by default
2025-02-06 22:27 ` Darrick J. Wong
2025-02-06 22:50 ` Luis Chamberlain
2025-02-07 9:39 ` Daniel Gomez
@ 2025-02-07 19:26 ` Luis Chamberlain
2025-02-07 19:31 ` Darrick J. Wong
2 siblings, 1 reply; 17+ messages in thread
From: Luis Chamberlain @ 2025-02-07 19:26 UTC (permalink / raw)
To: Darrick J. Wong
Cc: da.gomez, linux-xfs, Daniel Gomez, Pankaj Raghav, gost.dev
On Thu, Feb 06, 2025 at 02:27:16PM -0800, Darrick J. Wong wrote:
> NAME MIN-IO
> sda 512
> ├─sda1 512
> ├─sda2 512
> │ └─node0.boot 512
> ├─sda3 512
> │ └─node0.swap 512
> └─sda4 512
> └─node0.lvm 512
> └─node0-root 512
> sdb 4096
> └─sdb1 4096
> nvme1n1 512
> └─md0 524288
> └─node0.raid 524288
> └─node0_raid-storage 524288
> nvme0n1 512
> └─md0 524288
> └─node0.raid 524288
> └─node0_raid-storage 524288
> nvme2n1 512
> └─md0 524288
> └─node0.raid 524288
> └─node0_raid-storage 524288
> nvme3n1 512
> └─md0 524288
> └─node0.raid 524288
> └─node0_raid-storage 524288
Can you try this for each of these:
stat --print=%o
I believe that without that new patch I posted [0] you will get 4 KiB
here. Then the blocksize passed won't be the min-io until that patch
gets applied.
The above is:
statx(AT_FDCWD, "/dev/nvme0n1", AT_STATX_SYNC_AS_STAT|AT_SYMLINK_NOFOLLOW|AT_NO_AUTOMOUNT, 0,
{stx_mask=STATX_BASIC_STATS|STATX_MNT_ID, stx_attributes=0,
stx_mode=S_IFBLK|0660, stx_size=0, ...}) = 0
So if we use this instead at mkfs, then even older kernels will get 4
KiB, and if distros want to automatically lift the value at mkfs, they
could cherry pick that simple patch.
[0] https://lkml.kernel.org/r/20250204231209.429356-9-mcgrof@kernel.org
Luis
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] mkfs: use stx_blksize for dev block size by default
2025-02-07 19:26 ` Luis Chamberlain
@ 2025-02-07 19:31 ` Darrick J. Wong
2025-02-07 19:44 ` Luis Chamberlain
0 siblings, 1 reply; 17+ messages in thread
From: Darrick J. Wong @ 2025-02-07 19:31 UTC (permalink / raw)
To: Luis Chamberlain
Cc: da.gomez, linux-xfs, Daniel Gomez, Pankaj Raghav, gost.dev
On Fri, Feb 07, 2025 at 11:26:20AM -0800, Luis Chamberlain wrote:
> On Thu, Feb 06, 2025 at 02:27:16PM -0800, Darrick J. Wong wrote:
> > NAME MIN-IO
> > sda 512
> > ├─sda1 512
> > ├─sda2 512
> > │ └─node0.boot 512
> > ├─sda3 512
> > │ └─node0.swap 512
> > └─sda4 512
> > └─node0.lvm 512
> > └─node0-root 512
> > sdb 4096
> > └─sdb1 4096
> > nvme1n1 512
> > └─md0 524288
> > └─node0.raid 524288
> > └─node0_raid-storage 524288
> > nvme0n1 512
> > └─md0 524288
> > └─node0.raid 524288
> > └─node0_raid-storage 524288
> > nvme2n1 512
> > └─md0 524288
> > └─node0.raid 524288
> > └─node0_raid-storage 524288
> > nvme3n1 512
> > └─md0 524288
> > └─node0.raid 524288
> > └─node0_raid-storage 524288
>
> Can you try this for each of these:
>
> stat --print=%o
>
> I believe that without that new patch I posted [0] you will get 4 KiB
> here. Then the blocksize passed won't be the min-io until that patch
> gets applied.
Yes, that returns 4K on 6.13.0 for every device in the list. I think
you're saying that stat will start returning 512K for the blocksize if
your patch is merged?
> The above is:
>
> statx(AT_FDCWD, "/dev/nvme0n1", AT_STATX_SYNC_AS_STAT|AT_SYMLINK_NOFOLLOW|AT_NO_AUTOMOUNT, 0,
> {stx_mask=STATX_BASIC_STATS|STATX_MNT_ID, stx_attributes=0,
> stx_mode=S_IFBLK|0660, stx_size=0, ...}) = 0
>
> So if we use this instead at mkfs, then even older kernels will get 4
> KiB, and if distros want to automatically lift the value at mkfs, they
> could cherry pick that simple patch.
How well does that work if the gold master image creator machine has a
new kernel and a RAID setup, but the kernel written into the gold master
image is something older than a 6.12 kernel?
--D
>
> [0] https://lkml.kernel.org/r/20250204231209.429356-9-mcgrof@kernel.org
>
> Luis
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] mkfs: use stx_blksize for dev block size by default
2025-02-07 19:31 ` Darrick J. Wong
@ 2025-02-07 19:44 ` Luis Chamberlain
0 siblings, 0 replies; 17+ messages in thread
From: Luis Chamberlain @ 2025-02-07 19:44 UTC (permalink / raw)
To: Darrick J. Wong
Cc: da.gomez, linux-xfs, Daniel Gomez, Pankaj Raghav, gost.dev
On Fri, Feb 07, 2025 at 11:31:17AM -0800, Darrick J. Wong wrote:
> On Fri, Feb 07, 2025 at 11:26:20AM -0800, Luis Chamberlain wrote:
> > On Thu, Feb 06, 2025 at 02:27:16PM -0800, Darrick J. Wong wrote:
> > > NAME MIN-IO
> > > sda 512
> > > ├─sda1 512
> > > ├─sda2 512
> > > │ └─node0.boot 512
> > > ├─sda3 512
> > > │ └─node0.swap 512
> > > └─sda4 512
> > > └─node0.lvm 512
> > > └─node0-root 512
> > > sdb 4096
> > > └─sdb1 4096
> > > nvme1n1 512
> > > └─md0 524288
> > > └─node0.raid 524288
> > > └─node0_raid-storage 524288
> > > nvme0n1 512
> > > └─md0 524288
> > > └─node0.raid 524288
> > > └─node0_raid-storage 524288
> > > nvme2n1 512
> > > └─md0 524288
> > > └─node0.raid 524288
> > > └─node0_raid-storage 524288
> > > nvme3n1 512
> > > └─md0 524288
> > > └─node0.raid 524288
> > > └─node0_raid-storage 524288
> >
> > Can you try this for each of these:
> >
> > stat --print=%o
> >
> > I believe that without that new patch I posted [0] you will get 4 KiB
> > here. Then the blocksize passed won't be the min-io until that patch
> > gets applied.
>
> Yes, that returns 4K on 6.13.0 for every device in the list. I think
> you're saying that stat will start returning 512K for the blocksize if
> your patch is merged?
Yes, as that *is* min-io for block devices.
> > The above is:
> >
> > statx(AT_FDCWD, "/dev/nvme0n1", AT_STATX_SYNC_AS_STAT|AT_SYMLINK_NOFOLLOW|AT_NO_AUTOMOUNT, 0,
> > {stx_mask=STATX_BASIC_STATS|STATX_MNT_ID, stx_attributes=0,
> > stx_mode=S_IFBLK|0660, stx_size=0, ...}) = 0
> >
> > So if we use this instead at mkfs, then even older kernels will get 4
> > KiB, and if distros want to automatically lift the value at mkfs, they
> > could cherry pick that simple patch.
>
> How well does that work if the gold master image creator machine has a
> new kernel and a RAID setup, but the kernel written into the gold master
> image is something older than a 6.12 kernel?
I think you're asking about an image creator for an old release and that
old release uses an old kernel. Isn't that the same concern as what if
the same new kernel creates a filesystem on an image for an old release?
If so the risk is the new kernel defaults take precedence.
And the only thing I can think of for this case an option to ignore this
heuristic for blocksize, which could be used on configs to override
defaults for older target kernels.
Luis
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] mkfs: use stx_blksize for dev block size by default
2025-02-07 10:04 ` Daniel Gomez
@ 2025-02-13 4:10 ` Christoph Hellwig
2025-02-13 13:26 ` Daniel Gomez
0 siblings, 1 reply; 17+ messages in thread
From: Christoph Hellwig @ 2025-02-13 4:10 UTC (permalink / raw)
To: Daniel Gomez
Cc: Christoph Hellwig, linux-xfs, Luis Chamberlain, Daniel Gomez,
Pankaj Raghav, gost.dev
On Fri, Feb 07, 2025 at 11:04:06AM +0100, Daniel Gomez wrote:
> > > performance in the stx_blksize field of the statx data structure. This
> > > change updates the current default 4 KiB block size for all devices
> > > reporting a minimum I/O larger than 4 KiB, opting instead to query for
> > > its advertised minimum I/O value in the statx data struct.
> >
> > UUuh, no. Larger block sizes have their use cases, but this will
> > regress performance for a lot (most?) common setups. A lot of
> > device report fairly high values there, but say increasing the
>
> Are these devices reporting the correct value?
Who defines what "correct" means to start with?
> As I mentioned in my
> discussion with Darrick, matching the minimum_io_size with the "fs
> fundamental blocksize" actually allows to avoid RMW operations (when
> using the default path in mkfs.xfs and the value reported is within
> boundaries).
At least for buffered I/O it does not remove RMW operations at all,
it just moves them up.
And for plenty devices the min_io size might be set to larger than
LBA size, but the device is still reasonably efficient at handling
smaller I/O (e.g. because it has power loss protection and stages
writes in powerfail protected memory).
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] mkfs: use stx_blksize for dev block size by default
2025-02-13 4:10 ` Christoph Hellwig
@ 2025-02-13 13:26 ` Daniel Gomez
2025-02-18 8:30 ` Christoph Hellwig
0 siblings, 1 reply; 17+ messages in thread
From: Daniel Gomez @ 2025-02-13 13:26 UTC (permalink / raw)
To: Christoph Hellwig
Cc: linux-xfs, Luis Chamberlain, Daniel Gomez, Pankaj Raghav,
gost.dev
On Wed, Feb 12, 2025 at 08:10:04PM +0100, Christoph Hellwig wrote:
> On Fri, Feb 07, 2025 at 11:04:06AM +0100, Daniel Gomez wrote:
> > > > performance in the stx_blksize field of the statx data structure. This
> > > > change updates the current default 4 KiB block size for all devices
> > > > reporting a minimum I/O larger than 4 KiB, opting instead to query for
> > > > its advertised minimum I/O value in the statx data struct.
> > >
> > > UUuh, no. Larger block sizes have their use cases, but this will
> > > regress performance for a lot (most?) common setups. A lot of
> > > device report fairly high values there, but say increasing the
> >
> > Are these devices reporting the correct value?
>
> Who defines what "correct" means to start with?
That's a good question. The stx_blksize field description indicates the value
should be referring to the fs block size that avoids RMW.
* From statx manual:
DESCRIPTION
struct statx {
__u32 stx_blksize; /* Block size for filesystem I/O */
{...}
The returned information
stx_blksize
The "preferred" block size for efficient filesystem I/O. (Writing
to a file in smaller chunks may cause an inefficient read-modify-rewrite.)
* From include/uapi/linux/stat.h:
struct statx {
/* Preferred general I/O size [uncond] */
__u32 stx_blksize;
So I think, if devices report high values in stx_blksize, it is either because
smaller values than the reported one cause RMW or they are incorrectly reporting
a value in the wrong statx field.
The commit I refer in the commit message maps the minimum_io_size reported by
the block layer with stx_blksize.
* Documentation/ABI/stable/sysfs-block
What: /sys/block/<disk>/queue/minimum_io_size
Date: April 2009
Contact: Martin K. Petersen <martin.petersen@oracle.com>
Description:
[RO] Storage devices may report a granularity or preferred
minimum I/O size which is the smallest request the device can
perform without incurring a performance penalty. For disk
drives this is often the physical block size. For RAID arrays
it is often the stripe chunk size. A properly aligned multiple
of minimum_io_size is the preferred request size for workloads
where a high number of I/O operations is desired.
I guess the correct approach is to ensure the mapping only occurs for "disk
drives" and we avoid mapping RAID strip chunk size to stx_blksize.
In addition, we also have the optimal I/O size:
What: /sys/block/<disk>/queue/optimal_io_size
Date: April 2009
Contact: Martin K. Petersen <martin.petersen@oracle.com>
Description:
[RO] Storage devices may report an optimal I/O size, which is
the device's preferred unit for sustained I/O. This is rarely
reported for disk drives. For RAID arrays it is usually the
stripe width or the internal track size. A properly aligned
multiple of optimal_io_size is the preferred request size for
workloads where sustained throughput is desired. If no optimal
I/O size is reported this file contains 0.
Optimal I/O is not preferred I/O (minimum_io_size). However, I think xfs mount
option mixes both values:
* From Documentation/admin-guide/xfs.rst
Mount Options
=============
largeio or nolargeio (default)
If ``nolargeio`` is specified, the optimal I/O reported in
``st_blksize`` by **stat(2)** will be as small as possible to allow
user applications to avoid inefficient read/modify/write
I/O. This is typically the page size of the machine, as
this is the granularity of the page cache.
But it must be referring to the minimum_io_size as this is the limit "to
avoid inefficient read/modify/write I/O" as per sysfs-block minimum_io_size
description, right?
>
> > As I mentioned in my
> > discussion with Darrick, matching the minimum_io_size with the "fs
> > fundamental blocksize" actually allows to avoid RMW operations (when
> > using the default path in mkfs.xfs and the value reported is within
> > boundaries).
>
> At least for buffered I/O it does not remove RMW operations at all,
> it just moves them up.
But only if the RMW boundary is smaller than the fs bs. If they are matched,
it should not move them up.
>
> And for plenty devices the min_io size might be set to larger than
> LBA size,
Yes, I agree.
> but the device is still reasonably efficient at handling
> smaller I/O (e.g. because it has power loss protection and stages
That is device specific and covered in the stx_blksize and minimum_io_size
descriptions.
> writes in powerfail protected memory).
Assuming min_io is lsblk MIN-IO reported field, this is the block
minimum_io_size.
* From util-linux code:
case COL_MINIO:
ul_path_read_string(dev->sysfs, &str, "queue/minimum_io_size");
if (rawdata)
str2u64(str, rawdata);
break;
I think this might be a bit trickier. The value should report the disk
"preferred I/O minimum granularity" and "the stripe chunk size" for RAID arrays.
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] mkfs: use stx_blksize for dev block size by default
2025-02-13 13:26 ` Daniel Gomez
@ 2025-02-18 8:30 ` Christoph Hellwig
2025-05-09 14:27 ` Luis Chamberlain
0 siblings, 1 reply; 17+ messages in thread
From: Christoph Hellwig @ 2025-02-18 8:30 UTC (permalink / raw)
To: Daniel Gomez
Cc: Christoph Hellwig, linux-xfs, Luis Chamberlain, Daniel Gomez,
Pankaj Raghav, gost.dev
On Thu, Feb 13, 2025 at 02:26:37PM +0100, Daniel Gomez wrote:
> That's a good question. The stx_blksize field description indicates the value
> should be referring to the fs block size that avoids RMW.
One that is optimal, the RMW is an example. This what Posix says:
blksize_t st_blksize A file system-specific preferred I/O block size
for this object. In some file system types, this
may vary from file to file.
> So I think, if devices report high values in stx_blksize, it is either because
> smaller values than the reported one cause RMW or they are incorrectly reporting
> a value in the wrong statx field.
Or it is just more efficient. E.g. on NFS or XFS you'll get fairly
big values.
> The commit I refer in the commit message maps the minimum_io_size reported by
> the block layer with stx_blksize.
Yes, and that was my question again - minimum_io_size for block
devices is specified even more handwavy.
In other words I'm really sceptical this is going to be a net win.
To win me over you'll need to show a improvement over a wide variety
of devіces and workloads.
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] mkfs: use stx_blksize for dev block size by default
2025-02-18 8:30 ` Christoph Hellwig
@ 2025-05-09 14:27 ` Luis Chamberlain
2025-05-13 5:33 ` Christoph Hellwig
0 siblings, 1 reply; 17+ messages in thread
From: Luis Chamberlain @ 2025-05-09 14:27 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Daniel Gomez, linux-xfs, Daniel Gomez, Pankaj Raghav, gost.dev
On Tue, Feb 18, 2025 at 12:30:57AM -0800, Christoph Hellwig wrote:
> On Thu, Feb 13, 2025 at 02:26:37PM +0100, Daniel Gomez wrote:
> > That's a good question. The stx_blksize field description indicates the value
> > should be referring to the fs block size that avoids RMW.
>
> One that is optimal, the RMW is an example. This what Posix says:
>
> blksize_t st_blksize A file system-specific preferred I/O block size
> for this object. In some file system types, this
> may vary from file to file.
>
So this should just be repsined with this just stat blocksize?
Luis
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] mkfs: use stx_blksize for dev block size by default
2025-05-09 14:27 ` Luis Chamberlain
@ 2025-05-13 5:33 ` Christoph Hellwig
0 siblings, 0 replies; 17+ messages in thread
From: Christoph Hellwig @ 2025-05-13 5:33 UTC (permalink / raw)
To: Luis Chamberlain
Cc: Christoph Hellwig, Daniel Gomez, linux-xfs, Daniel Gomez,
Pankaj Raghav, gost.dev
On Fri, May 09, 2025 at 07:27:24AM -0700, Luis Chamberlain wrote:
> So this should just be repsined with this just stat blocksize?
No, as mentioned before the entire approach is incorrect.
^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2025-05-13 5:33 UTC | newest]
Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-02-06 19:00 [PATCH] mkfs: use stx_blksize for dev block size by default da.gomez
2025-02-06 22:27 ` Darrick J. Wong
2025-02-06 22:50 ` Luis Chamberlain
2025-02-06 23:07 ` Darrick J. Wong
2025-02-07 9:12 ` Daniel Gomez
2025-02-07 19:16 ` Luis Chamberlain
2025-02-07 9:39 ` Daniel Gomez
2025-02-07 19:26 ` Luis Chamberlain
2025-02-07 19:31 ` Darrick J. Wong
2025-02-07 19:44 ` Luis Chamberlain
2025-02-07 4:30 ` Christoph Hellwig
2025-02-07 10:04 ` Daniel Gomez
2025-02-13 4:10 ` Christoph Hellwig
2025-02-13 13:26 ` Daniel Gomez
2025-02-18 8:30 ` Christoph Hellwig
2025-05-09 14:27 ` Luis Chamberlain
2025-05-13 5:33 ` Christoph Hellwig
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox