linux-xfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v6 0/6] block/md/dm: set chunk_sectors from stacked dev stripe size
@ 2025-07-11  8:09 John Garry
  2025-07-11  8:09 ` [PATCH v6 1/6] ilog2: add max_pow_of_two_factor() John Garry
                   ` (6 more replies)
  0 siblings, 7 replies; 17+ messages in thread
From: John Garry @ 2025-07-11  8:09 UTC (permalink / raw)
  To: agk, snitzer, mpatocka, song, yukuai3, hch, nilay, axboe, cem
  Cc: dm-devel, linux-kernel, linux-raid, linux-block, ojaswin,
	martin.petersen, akpm, linux-xfs, djwong, John Garry

This value in io_min is used to configure any atomic write limit for the
stacked device. The idea is that the atomic write unit max is a
power-of-2 factor of the stripe size, and the stripe size is available
in io_min.

Using io_min causes issues, as:
a. it may be mutated
b. the check for io_min being set for determining if we are dealing with
a striped device is hard to get right, as reported in [0].

This series now sets chunk_sectors limit to share stripe size.

[0] https://lore.kernel.org/linux-block/888f3b1d-7817-4007-b3b3-1a2ea04df771@linux.ibm.com/T/#mecca17129f72811137d3c2f1e477634e77f06781

Based on 8b428f42f3edf nbd: fix lockdep deadlock warning

This series fixes issues for v6.16, but it's prob better to have this in
v6.17 .

Differences to v5:
- Neaten code in blk_validate_atomic_write_limits() (Jens)

Differences to v4:
- Use check_shl_overflow() (Nilay)
- Use long long in for chunk bytes in 2/6
- Add tags from Nilay (thanks!)

Differences to v3:
- relocate max_pow_of_two_factor() to common header and rework (Mikulas)
- cater for overflow from chunk sectors (Mikulas)

John Garry (6):
  ilog2: add max_pow_of_two_factor()
  block: sanitize chunk_sectors for atomic write limits
  md/raid0: set chunk_sectors limit
  md/raid10: set chunk_sectors limit
  dm-stripe: limit chunk_sectors to the stripe size
  block: use chunk_sectors when evaluating stacked atomic write limits

 block/blk-settings.c   | 61 ++++++++++++++++++++++++++----------------
 drivers/md/dm-stripe.c |  1 +
 drivers/md/raid0.c     |  1 +
 drivers/md/raid10.c    |  1 +
 fs/xfs/xfs_mount.c     |  5 ----
 include/linux/log2.h   | 14 ++++++++++
 6 files changed, 55 insertions(+), 28 deletions(-)

-- 
2.43.5


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v6 1/6] ilog2: add max_pow_of_two_factor()
  2025-07-11  8:09 [PATCH v6 0/6] block/md/dm: set chunk_sectors from stacked dev stripe size John Garry
@ 2025-07-11  8:09 ` John Garry
  2025-07-11  8:09 ` [PATCH v6 2/6] block: sanitize chunk_sectors for atomic write limits John Garry
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 17+ messages in thread
From: John Garry @ 2025-07-11  8:09 UTC (permalink / raw)
  To: agk, snitzer, mpatocka, song, yukuai3, hch, nilay, axboe, cem
  Cc: dm-devel, linux-kernel, linux-raid, linux-block, ojaswin,
	martin.petersen, akpm, linux-xfs, djwong, John Garry

Relocate the function max_pow_of_two_factor() to common ilog2.h from the
xfs code, as it will be used elsewhere.

Also simplify the function, as advised by Mikulas Patocka.

Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 fs/xfs/xfs_mount.c   |  5 -----
 include/linux/log2.h | 14 ++++++++++++++
 2 files changed, 14 insertions(+), 5 deletions(-)

diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index 29276fe60df9c..6c669ae082d4d 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -672,11 +672,6 @@ static inline xfs_extlen_t xfs_calc_atomic_write_max(struct xfs_mount *mp)
 	return rounddown_pow_of_two(XFS_B_TO_FSB(mp, MAX_RW_COUNT));
 }
 
-static inline unsigned int max_pow_of_two_factor(const unsigned int nr)
-{
-	return 1 << (ffs(nr) - 1);
-}
-
 /*
  * If the data device advertises atomic write support, limit the size of data
  * device atomic writes to the greatest power-of-two factor of the AG size so
diff --git a/include/linux/log2.h b/include/linux/log2.h
index 1366cb688a6d9..2eac3fc9303d6 100644
--- a/include/linux/log2.h
+++ b/include/linux/log2.h
@@ -255,4 +255,18 @@ int __bits_per(unsigned long n)
 	) :					\
 	__bits_per(n)				\
 )
+
+/**
+ * max_pow_of_two_factor - return highest power-of-2 factor
+ * @n: parameter
+ *
+ * find highest power-of-2 which is evenly divisible into n.
+ * 0 is returned for n == 0 or 1.
+ */
+static inline __attribute__((const))
+unsigned int max_pow_of_two_factor(unsigned int n)
+{
+	return n & -n;
+}
+
 #endif /* _LINUX_LOG2_H */
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v6 2/6] block: sanitize chunk_sectors for atomic write limits
  2025-07-11  8:09 [PATCH v6 0/6] block/md/dm: set chunk_sectors from stacked dev stripe size John Garry
  2025-07-11  8:09 ` [PATCH v6 1/6] ilog2: add max_pow_of_two_factor() John Garry
@ 2025-07-11  8:09 ` John Garry
  2025-07-11  8:42   ` Damien Le Moal
  2025-07-11  8:09 ` [PATCH v6 3/6] md/raid0: set chunk_sectors limit John Garry
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 17+ messages in thread
From: John Garry @ 2025-07-11  8:09 UTC (permalink / raw)
  To: agk, snitzer, mpatocka, song, yukuai3, hch, nilay, axboe, cem
  Cc: dm-devel, linux-kernel, linux-raid, linux-block, ojaswin,
	martin.petersen, akpm, linux-xfs, djwong, John Garry

Currently we just ensure that a non-zero value in chunk_sectors aligns
with any atomic write boundary, as the blk boundary functionality uses
both these values.

However it is also improper to have atomic write unit max > chunk_sectors
(for non-zero chunk_sectors), as this would lead to splitting of atomic
write bios (which is disallowed).

Sanitize atomic write unit max against chunk_sectors to avoid any
potential problems.

Fixes: d00eea91deaf3 ("block: Add extra checks in blk_validate_atomic_write_limits()")
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 block/blk-settings.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/block/blk-settings.c b/block/blk-settings.c
index a000daafbfb48..a2c089167174e 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -180,6 +180,7 @@ static void blk_atomic_writes_update_limits(struct queue_limits *lim)
 
 static void blk_validate_atomic_write_limits(struct queue_limits *lim)
 {
+	unsigned long long chunk_bytes = lim->chunk_sectors << SECTOR_SHIFT;
 	unsigned int boundary_sectors;
 
 	if (!(lim->features & BLK_FEAT_ATOMIC_WRITES))
@@ -202,6 +203,10 @@ static void blk_validate_atomic_write_limits(struct queue_limits *lim)
 			 lim->atomic_write_hw_max))
 		goto unsupported;
 
+	if (WARN_ON_ONCE(chunk_bytes &&
+			lim->atomic_write_hw_unit_max > chunk_bytes))
+		goto unsupported;
+
 	boundary_sectors = lim->atomic_write_hw_boundary >> SECTOR_SHIFT;
 
 	if (boundary_sectors) {
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v6 3/6] md/raid0: set chunk_sectors limit
  2025-07-11  8:09 [PATCH v6 0/6] block/md/dm: set chunk_sectors from stacked dev stripe size John Garry
  2025-07-11  8:09 ` [PATCH v6 1/6] ilog2: add max_pow_of_two_factor() John Garry
  2025-07-11  8:09 ` [PATCH v6 2/6] block: sanitize chunk_sectors for atomic write limits John Garry
@ 2025-07-11  8:09 ` John Garry
  2025-07-11  8:09 ` [PATCH v6 4/6] md/raid10: " John Garry
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 17+ messages in thread
From: John Garry @ 2025-07-11  8:09 UTC (permalink / raw)
  To: agk, snitzer, mpatocka, song, yukuai3, hch, nilay, axboe, cem
  Cc: dm-devel, linux-kernel, linux-raid, linux-block, ojaswin,
	martin.petersen, akpm, linux-xfs, djwong, John Garry

Currently we use min io size as the chunk size when deciding on the
atomic write size limits - see blk_stack_atomic_writes_head().

The limit min_io size is not a reliable value to store the chunk size, as
this may be mutated by the block stacking code. Such an example would be
for the min io size less than the physical block size, and the min io size
is raised to the physical block size - see blk_stack_limits().

The block stacking limits will rely on chunk_sectors in future,
so set this value (to the chunk size).

Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 drivers/md/raid0.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/md/raid0.c b/drivers/md/raid0.c
index d8f639f4ae123..cbe2a9054cb91 100644
--- a/drivers/md/raid0.c
+++ b/drivers/md/raid0.c
@@ -384,6 +384,7 @@ static int raid0_set_limits(struct mddev *mddev)
 	lim.max_write_zeroes_sectors = mddev->chunk_sectors;
 	lim.io_min = mddev->chunk_sectors << 9;
 	lim.io_opt = lim.io_min * mddev->raid_disks;
+	lim.chunk_sectors = mddev->chunk_sectors;
 	lim.features |= BLK_FEAT_ATOMIC_WRITES;
 	err = mddev_stack_rdev_limits(mddev, &lim, MDDEV_STACK_INTEGRITY);
 	if (err)
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v6 4/6] md/raid10: set chunk_sectors limit
  2025-07-11  8:09 [PATCH v6 0/6] block/md/dm: set chunk_sectors from stacked dev stripe size John Garry
                   ` (2 preceding siblings ...)
  2025-07-11  8:09 ` [PATCH v6 3/6] md/raid0: set chunk_sectors limit John Garry
@ 2025-07-11  8:09 ` John Garry
  2025-07-11  8:09 ` [PATCH v6 5/6] dm-stripe: limit chunk_sectors to the stripe size John Garry
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 17+ messages in thread
From: John Garry @ 2025-07-11  8:09 UTC (permalink / raw)
  To: agk, snitzer, mpatocka, song, yukuai3, hch, nilay, axboe, cem
  Cc: dm-devel, linux-kernel, linux-raid, linux-block, ojaswin,
	martin.petersen, akpm, linux-xfs, djwong, John Garry

Same as done for raid0, set chunk_sectors limit to appropriately set the
atomic write size limit.

Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 drivers/md/raid10.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index b74780af4c220..97065bb26f430 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -4004,6 +4004,7 @@ static int raid10_set_queue_limits(struct mddev *mddev)
 	md_init_stacking_limits(&lim);
 	lim.max_write_zeroes_sectors = 0;
 	lim.io_min = mddev->chunk_sectors << 9;
+	lim.chunk_sectors = mddev->chunk_sectors;
 	lim.io_opt = lim.io_min * raid10_nr_stripes(conf);
 	lim.features |= BLK_FEAT_ATOMIC_WRITES;
 	err = mddev_stack_rdev_limits(mddev, &lim, MDDEV_STACK_INTEGRITY);
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v6 5/6] dm-stripe: limit chunk_sectors to the stripe size
  2025-07-11  8:09 [PATCH v6 0/6] block/md/dm: set chunk_sectors from stacked dev stripe size John Garry
                   ` (3 preceding siblings ...)
  2025-07-11  8:09 ` [PATCH v6 4/6] md/raid10: " John Garry
@ 2025-07-11  8:09 ` John Garry
  2025-07-11  8:09 ` [PATCH v6 6/6] block: use chunk_sectors when evaluating stacked atomic write limits John Garry
  2025-07-11  8:44 ` [PATCH v6 0/6] block/md/dm: set chunk_sectors from stacked dev stripe size Damien Le Moal
  6 siblings, 0 replies; 17+ messages in thread
From: John Garry @ 2025-07-11  8:09 UTC (permalink / raw)
  To: agk, snitzer, mpatocka, song, yukuai3, hch, nilay, axboe, cem
  Cc: dm-devel, linux-kernel, linux-raid, linux-block, ojaswin,
	martin.petersen, akpm, linux-xfs, djwong, John Garry

Same as done for raid0, set chunk_sectors limit to appropriately set the
atomic write size limit.

Setting chunk_sectors limit in this way overrides the stacked limit
already calculated based on the bottom device limits. This is ok, as
when any bios are sent to the bottom devices, the block layer will still
respect the bottom device chunk_sectors.

Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Reviewed-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 drivers/md/dm-stripe.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/md/dm-stripe.c b/drivers/md/dm-stripe.c
index a7dc04bd55e5c..5bbbdf8fc1bde 100644
--- a/drivers/md/dm-stripe.c
+++ b/drivers/md/dm-stripe.c
@@ -458,6 +458,7 @@ static void stripe_io_hints(struct dm_target *ti,
 	struct stripe_c *sc = ti->private;
 	unsigned int chunk_size = sc->chunk_size << SECTOR_SHIFT;
 
+	limits->chunk_sectors = sc->chunk_size;
 	limits->io_min = chunk_size;
 	limits->io_opt = chunk_size * sc->stripes;
 }
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v6 6/6] block: use chunk_sectors when evaluating stacked atomic write limits
  2025-07-11  8:09 [PATCH v6 0/6] block/md/dm: set chunk_sectors from stacked dev stripe size John Garry
                   ` (4 preceding siblings ...)
  2025-07-11  8:09 ` [PATCH v6 5/6] dm-stripe: limit chunk_sectors to the stripe size John Garry
@ 2025-07-11  8:09 ` John Garry
  2025-07-11  8:44 ` [PATCH v6 0/6] block/md/dm: set chunk_sectors from stacked dev stripe size Damien Le Moal
  6 siblings, 0 replies; 17+ messages in thread
From: John Garry @ 2025-07-11  8:09 UTC (permalink / raw)
  To: agk, snitzer, mpatocka, song, yukuai3, hch, nilay, axboe, cem
  Cc: dm-devel, linux-kernel, linux-raid, linux-block, ojaswin,
	martin.petersen, akpm, linux-xfs, djwong, John Garry

The atomic write unit max value is limited by any stacked device stripe
size.

It is required that the atomic write unit is a power-of-2 factor of the
stripe size.

Currently we use io_min limit to hold the stripe size, and check for a
io_min <= SECTOR_SIZE when deciding if we have a striped stacked device.

Nilay reports that this causes a problem when the physical block size is
greater than SECTOR_SIZE [0].

Furthermore, io_min may be mutated when stacking devices, and this makes
it a poor candidate to hold the stripe size. Such an example (of when
io_min may change) would be when the io_min is less than the physical
block size.

Use chunk_sectors to hold the stripe size, which is more appropriate.

[0] https://lore.kernel.org/linux-block/888f3b1d-7817-4007-b3b3-1a2ea04df771@linux.ibm.com/T/#mecca17129f72811137d3c2f1e477634e77f06781

Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Tested-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 block/blk-settings.c | 56 ++++++++++++++++++++++++++------------------
 1 file changed, 33 insertions(+), 23 deletions(-)

diff --git a/block/blk-settings.c b/block/blk-settings.c
index a2c089167174e..2e519458e7b1a 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -594,41 +594,50 @@ static bool blk_stack_atomic_writes_boundary_head(struct queue_limits *t,
 	return true;
 }
 
-
-/* Check stacking of first bottom device */
-static bool blk_stack_atomic_writes_head(struct queue_limits *t,
-				struct queue_limits *b)
+static void blk_stack_atomic_writes_chunk_sectors(struct queue_limits *t)
 {
-	if (b->atomic_write_hw_boundary &&
-	    !blk_stack_atomic_writes_boundary_head(t, b))
-		return false;
+	unsigned int chunk_bytes;
 
-	if (t->io_min <= SECTOR_SIZE) {
-		/* No chunk sectors, so use bottom device values directly */
-		t->atomic_write_hw_unit_max = b->atomic_write_hw_unit_max;
-		t->atomic_write_hw_unit_min = b->atomic_write_hw_unit_min;
-		t->atomic_write_hw_max = b->atomic_write_hw_max;
-		return true;
-	}
+	if (!t->chunk_sectors)
+		return;
+
+	/*
+	 * If chunk sectors is so large that its value in bytes overflows
+	 * UINT_MAX, then just shift it down so it definitely will fit.
+	 * We don't support atomic writes of such a large size anyway.
+	 */
+	if (check_shl_overflow(t->chunk_sectors, SECTOR_SHIFT, &chunk_bytes))
+		chunk_bytes = t->chunk_sectors;
 
 	/*
 	 * Find values for limits which work for chunk size.
 	 * b->atomic_write_hw_unit_{min, max} may not be aligned with chunk
-	 * size (t->io_min), as chunk size is not restricted to a power-of-2.
+	 * size, as the chunk size is not restricted to a power-of-2.
 	 * So we need to find highest power-of-2 which works for the chunk
 	 * size.
-	 * As an example scenario, we could have b->unit_max = 16K and
-	 * t->io_min = 24K. For this case, reduce t->unit_max to a value
-	 * aligned with both limits, i.e. 8K in this example.
+	 * As an example scenario, we could have t->unit_max = 16K and
+	 * t->chunk_sectors = 24KB. For this case, reduce t->unit_max to a
+	 * value aligned with both limits, i.e. 8K in this example.
 	 */
-	t->atomic_write_hw_unit_max = b->atomic_write_hw_unit_max;
-	while (t->io_min % t->atomic_write_hw_unit_max)
-		t->atomic_write_hw_unit_max /= 2;
+	t->atomic_write_hw_unit_max = min(t->atomic_write_hw_unit_max,
+					max_pow_of_two_factor(chunk_bytes));
 
-	t->atomic_write_hw_unit_min = min(b->atomic_write_hw_unit_min,
+	t->atomic_write_hw_unit_min = min(t->atomic_write_hw_unit_min,
 					  t->atomic_write_hw_unit_max);
-	t->atomic_write_hw_max = min(b->atomic_write_hw_max, t->io_min);
+	t->atomic_write_hw_max = min(t->atomic_write_hw_max, chunk_bytes);
+}
+
+/* Check stacking of first bottom device */
+static bool blk_stack_atomic_writes_head(struct queue_limits *t,
+				struct queue_limits *b)
+{
+	if (b->atomic_write_hw_boundary &&
+	    !blk_stack_atomic_writes_boundary_head(t, b))
+		return false;
 
+	t->atomic_write_hw_unit_max = b->atomic_write_hw_unit_max;
+	t->atomic_write_hw_unit_min = b->atomic_write_hw_unit_min;
+	t->atomic_write_hw_max = b->atomic_write_hw_max;
 	return true;
 }
 
@@ -656,6 +665,7 @@ static void blk_stack_atomic_writes_limits(struct queue_limits *t,
 
 	if (!blk_stack_atomic_writes_head(t, b))
 		goto unsupported;
+	blk_stack_atomic_writes_chunk_sectors(t);
 	return;
 
 unsupported:
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH v6 2/6] block: sanitize chunk_sectors for atomic write limits
  2025-07-11  8:09 ` [PATCH v6 2/6] block: sanitize chunk_sectors for atomic write limits John Garry
@ 2025-07-11  8:42   ` Damien Le Moal
  2025-07-11  9:22     ` John Garry
  0 siblings, 1 reply; 17+ messages in thread
From: Damien Le Moal @ 2025-07-11  8:42 UTC (permalink / raw)
  To: John Garry, agk, snitzer, mpatocka, song, yukuai3, hch, nilay,
	axboe, cem
  Cc: dm-devel, linux-kernel, linux-raid, linux-block, ojaswin,
	martin.petersen, akpm, linux-xfs, djwong

On 7/11/25 5:09 PM, John Garry wrote:
> Currently we just ensure that a non-zero value in chunk_sectors aligns
> with any atomic write boundary, as the blk boundary functionality uses
> both these values.
> 
> However it is also improper to have atomic write unit max > chunk_sectors
> (for non-zero chunk_sectors), as this would lead to splitting of atomic
> write bios (which is disallowed).
> 
> Sanitize atomic write unit max against chunk_sectors to avoid any
> potential problems.
> 
> Fixes: d00eea91deaf3 ("block: Add extra checks in blk_validate_atomic_write_limits()")
> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
> Signed-off-by: John Garry <john.g.garry@oracle.com>
> ---
>  block/blk-settings.c | 5 +++++
>  1 file changed, 5 insertions(+)
> 
> diff --git a/block/blk-settings.c b/block/blk-settings.c
> index a000daafbfb48..a2c089167174e 100644
> --- a/block/blk-settings.c
> +++ b/block/blk-settings.c
> @@ -180,6 +180,7 @@ static void blk_atomic_writes_update_limits(struct queue_limits *lim)
>  
>  static void blk_validate_atomic_write_limits(struct queue_limits *lim)
>  {
> +	unsigned long long chunk_bytes = lim->chunk_sectors << SECTOR_SHIFT;

Don't you need to cast to a 64-bits "lim->chunk_sectors" here ?

>  	unsigned int boundary_sectors;
>  
>  	if (!(lim->features & BLK_FEAT_ATOMIC_WRITES))
> @@ -202,6 +203,10 @@ static void blk_validate_atomic_write_limits(struct queue_limits *lim)
>  			 lim->atomic_write_hw_max))
>  		goto unsupported;
>  
> +	if (WARN_ON_ONCE(chunk_bytes &&
> +			lim->atomic_write_hw_unit_max > chunk_bytes))
> +		goto unsupported;
> +
>  	boundary_sectors = lim->atomic_write_hw_boundary >> SECTOR_SHIFT;
>  
>  	if (boundary_sectors) {


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v6 0/6] block/md/dm: set chunk_sectors from stacked dev stripe size
  2025-07-11  8:09 [PATCH v6 0/6] block/md/dm: set chunk_sectors from stacked dev stripe size John Garry
                   ` (5 preceding siblings ...)
  2025-07-11  8:09 ` [PATCH v6 6/6] block: use chunk_sectors when evaluating stacked atomic write limits John Garry
@ 2025-07-11  8:44 ` Damien Le Moal
  2025-07-11  9:16   ` John Garry
  2025-07-14  5:53   ` Christoph Hellwig
  6 siblings, 2 replies; 17+ messages in thread
From: Damien Le Moal @ 2025-07-11  8:44 UTC (permalink / raw)
  To: John Garry, agk, snitzer, mpatocka, song, yukuai3, hch, nilay,
	axboe, cem
  Cc: dm-devel, linux-kernel, linux-raid, linux-block, ojaswin,
	martin.petersen, akpm, linux-xfs, djwong

On 7/11/25 5:09 PM, John Garry wrote:
> This value in io_min is used to configure any atomic write limit for the
> stacked device. The idea is that the atomic write unit max is a
> power-of-2 factor of the stripe size, and the stripe size is available
> in io_min.
> 
> Using io_min causes issues, as:
> a. it may be mutated
> b. the check for io_min being set for determining if we are dealing with
> a striped device is hard to get right, as reported in [0].
> 
> This series now sets chunk_sectors limit to share stripe size.

Hmm... chunk_sectors for a zoned device is the zone size. So is this all safe
if we are dealing with a zoned block device that also supports atomic writes ?
Not that I know of any such device, but better be safe, so maybe for now do not
enable atomic write support on zoned devices ?



-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v6 0/6] block/md/dm: set chunk_sectors from stacked dev stripe size
  2025-07-11  8:44 ` [PATCH v6 0/6] block/md/dm: set chunk_sectors from stacked dev stripe size Damien Le Moal
@ 2025-07-11  9:16   ` John Garry
  2025-07-14  5:53   ` Christoph Hellwig
  1 sibling, 0 replies; 17+ messages in thread
From: John Garry @ 2025-07-11  9:16 UTC (permalink / raw)
  To: Damien Le Moal, agk, snitzer, mpatocka, song, yukuai3, hch, nilay,
	axboe, cem
  Cc: dm-devel, linux-kernel, linux-raid, linux-block, ojaswin,
	martin.petersen, akpm, linux-xfs, djwong

On 11/07/2025 09:44, Damien Le Moal wrote:
>> This series now sets chunk_sectors limit to share stripe size.
> Hmm... chunk_sectors for a zoned device is the zone size. So is this all safe
> if we are dealing with a zoned block device that also supports atomic writes ?
> Not that I know of any such device, but better be safe, so maybe for now do not
> enable atomic write support on zoned devices ?

I don't think that we need to do anything specific there.

Patch 1/6 catches if the chunk size is less than the atomic write max size.

Having said that, if a zoned device did support atomic writes then it 
would be very odd to have its atomic write max size > zone size anyway.

Thanks,
John

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v6 2/6] block: sanitize chunk_sectors for atomic write limits
  2025-07-11  8:42   ` Damien Le Moal
@ 2025-07-11  9:22     ` John Garry
  0 siblings, 0 replies; 17+ messages in thread
From: John Garry @ 2025-07-11  9:22 UTC (permalink / raw)
  To: Damien Le Moal, agk, snitzer, mpatocka, song, yukuai3, hch, nilay,
	axboe, cem
  Cc: dm-devel, linux-kernel, linux-raid, linux-block, ojaswin,
	martin.petersen, akpm, linux-xfs, djwong

On 11/07/2025 09:42, Damien Le Moal wrote:
>> diff --git a/block/blk-settings.c b/block/blk-settings.c
>> index a000daafbfb48..a2c089167174e 100644
>> --- a/block/blk-settings.c
>> +++ b/block/blk-settings.c
>> @@ -180,6 +180,7 @@ static void blk_atomic_writes_update_limits(struct queue_limits *lim)
>>   
>>   static void blk_validate_atomic_write_limits(struct queue_limits *lim)
>>   {
>> +	unsigned long long chunk_bytes = lim->chunk_sectors << SECTOR_SHIFT;
> Don't you need to cast to a 64-bits "lim->chunk_sectors" here ?

I thought that we automatically convert lim->chunk_sectors to unsigned 
long long, but I think that you are right...

At this point I think that it's easier to just convert 
atomic_write_hw_max to sectors and do that comparison

Thanks,
John


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v6 0/6] block/md/dm: set chunk_sectors from stacked dev stripe size
  2025-07-11  8:44 ` [PATCH v6 0/6] block/md/dm: set chunk_sectors from stacked dev stripe size Damien Le Moal
  2025-07-11  9:16   ` John Garry
@ 2025-07-14  5:53   ` Christoph Hellwig
  2025-07-14  6:00     ` Damien Le Moal
  2025-07-14  7:52     ` John Garry
  1 sibling, 2 replies; 17+ messages in thread
From: Christoph Hellwig @ 2025-07-14  5:53 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: John Garry, agk, snitzer, mpatocka, song, yukuai3, hch, nilay,
	axboe, cem, dm-devel, linux-kernel, linux-raid, linux-block,
	ojaswin, martin.petersen, akpm, linux-xfs, djwong

On Fri, Jul 11, 2025 at 05:44:26PM +0900, Damien Le Moal wrote:
> On 7/11/25 5:09 PM, John Garry wrote:
> > This value in io_min is used to configure any atomic write limit for the
> > stacked device. The idea is that the atomic write unit max is a
> > power-of-2 factor of the stripe size, and the stripe size is available
> > in io_min.
> > 
> > Using io_min causes issues, as:
> > a. it may be mutated
> > b. the check for io_min being set for determining if we are dealing with
> > a striped device is hard to get right, as reported in [0].
> > 
> > This series now sets chunk_sectors limit to share stripe size.
> 
> Hmm... chunk_sectors for a zoned device is the zone size. So is this all safe
> if we are dealing with a zoned block device that also supports atomic writes ?

Btw, I wonder if it's time to decouple the zone size from the chunk
size eventually.  It seems like a nice little hack, but with things
like parity raid for zoned devices now showing up at least in academia,
and nvme devices reporting chunk sizes the overload might not be that
good any more.

> Not that I know of any such device, but better be safe, so maybe for now
> do not enable atomic write support on zoned devices ?

How would atomic writes make sense for zone devices?  Because all writes
up to the reported write pointer must be valid, there usual checks for
partial updates a lacking, so the only use would be to figure out if a
write got truncated.  At least for file systems we detects this using the
fs metadata that must be written on I/O completion anyway, so the only
user would be an application with some sort of speculative writes that
can't detect partial writes. Which sounds rather fringe and dangerous.

Now we should be able to implement the software atomic writes pretty
easily for zoned XFS, and funnily they might actually be slightly faster
than normal writes due to the transaction batching.  Now that we're
getting reasonable test coverage we should be able to give it a spin, but
I have a few too many things on my plate at the moment.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v6 0/6] block/md/dm: set chunk_sectors from stacked dev stripe size
  2025-07-14  5:53   ` Christoph Hellwig
@ 2025-07-14  6:00     ` Damien Le Moal
  2025-07-14  6:13       ` Christoph Hellwig
  2025-07-14  7:52     ` John Garry
  1 sibling, 1 reply; 17+ messages in thread
From: Damien Le Moal @ 2025-07-14  6:00 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: John Garry, agk, snitzer, mpatocka, song, yukuai3, nilay, axboe,
	cem, dm-devel, linux-kernel, linux-raid, linux-block, ojaswin,
	martin.petersen, akpm, linux-xfs, djwong

On 2025/07/14 14:53, Christoph Hellwig wrote:
> On Fri, Jul 11, 2025 at 05:44:26PM +0900, Damien Le Moal wrote:
>> On 7/11/25 5:09 PM, John Garry wrote:
>>> This value in io_min is used to configure any atomic write limit for the
>>> stacked device. The idea is that the atomic write unit max is a
>>> power-of-2 factor of the stripe size, and the stripe size is available
>>> in io_min.
>>>
>>> Using io_min causes issues, as:
>>> a. it may be mutated
>>> b. the check for io_min being set for determining if we are dealing with
>>> a striped device is hard to get right, as reported in [0].
>>>
>>> This series now sets chunk_sectors limit to share stripe size.
>>
>> Hmm... chunk_sectors for a zoned device is the zone size. So is this all safe
>> if we are dealing with a zoned block device that also supports atomic writes ?
> 
> Btw, I wonder if it's time to decouple the zone size from the chunk
> size eventually.  It seems like a nice little hack, but with things
> like parity raid for zoned devices now showing up at least in academia,
> and nvme devices reporting chunk sizes the overload might not be that
> good any more.

Agreed, it would be nice to clean that up. BUT, the chunk_sectors sysfs
attribute file is reporting the zone size today. Changing that may break
applications. So I am not sure if we can actually do that, unless the sysfs
interface is considered as "unstable" ?

> 
>> Not that I know of any such device, but better be safe, so maybe for now
>> do not enable atomic write support on zoned devices ?
> 
> How would atomic writes make sense for zone devices?  Because all writes
> up to the reported write pointer must be valid, there usual checks for
> partial updates a lacking, so the only use would be to figure out if a
> write got truncated.  At least for file systems we detects this using the
> fs metadata that must be written on I/O completion anyway, so the only
> user would be an application with some sort of speculative writes that
> can't detect partial writes. Which sounds rather fringe and dangerous.

The only thing I can think of which would make sense is to avoid torn writes
with SAS drives. But in itself, that is extremely niche.

> 
> Now we should be able to implement the software atomic writes pretty
> easily for zoned XFS, and funnily they might actually be slightly faster
> than normal writes due to the transaction batching.  Now that we're
> getting reasonable test coverage we should be able to give it a spin, but
> I have a few too many things on my plate at the moment.


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v6 0/6] block/md/dm: set chunk_sectors from stacked dev stripe size
  2025-07-14  6:00     ` Damien Le Moal
@ 2025-07-14  6:13       ` Christoph Hellwig
  2025-07-15 15:45         ` Hannes Reinecke
  0 siblings, 1 reply; 17+ messages in thread
From: Christoph Hellwig @ 2025-07-14  6:13 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: Christoph Hellwig, John Garry, agk, snitzer, mpatocka, song,
	yukuai3, nilay, axboe, cem, dm-devel, linux-kernel, linux-raid,
	linux-block, ojaswin, martin.petersen, akpm, linux-xfs, djwong

On Mon, Jul 14, 2025 at 03:00:57PM +0900, Damien Le Moal wrote:
> Agreed, it would be nice to clean that up. BUT, the chunk_sectors sysfs
> attribute file is reporting the zone size today. Changing that may break
> applications. So I am not sure if we can actually do that, unless the sysfs
> interface is considered as "unstable" ?

Good point.  I don't think it is considered unstable.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v6 0/6] block/md/dm: set chunk_sectors from stacked dev stripe size
  2025-07-14  5:53   ` Christoph Hellwig
  2025-07-14  6:00     ` Damien Le Moal
@ 2025-07-14  7:52     ` John Garry
  2025-07-14 10:46       ` Christoph Hellwig
  1 sibling, 1 reply; 17+ messages in thread
From: John Garry @ 2025-07-14  7:52 UTC (permalink / raw)
  To: Christoph Hellwig, Damien Le Moal
  Cc: agk, snitzer, mpatocka, song, yukuai3, nilay, axboe, cem,
	dm-devel, linux-kernel, linux-raid, linux-block, ojaswin,
	martin.petersen, akpm, linux-xfs, djwong

On 14/07/2025 06:53, Christoph Hellwig wrote:
> Now we should be able to implement the software atomic writes pretty
> easily for zoned XFS, and funnily they might actually be slightly faster
> than normal writes due to the transaction batching.  Now that we're
> getting reasonable test coverage we should be able to give it a spin, but
> I have a few too many things on my plate at the moment.

Isn't reflink currently incompatible with zoned xfs?

I don't think that there is even anything else needed to automatically 
get software-based atomics support for zoned xfs.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v6 0/6] block/md/dm: set chunk_sectors from stacked dev stripe size
  2025-07-14  7:52     ` John Garry
@ 2025-07-14 10:46       ` Christoph Hellwig
  0 siblings, 0 replies; 17+ messages in thread
From: Christoph Hellwig @ 2025-07-14 10:46 UTC (permalink / raw)
  To: John Garry
  Cc: Christoph Hellwig, Damien Le Moal, agk, snitzer, mpatocka, song,
	yukuai3, nilay, axboe, cem, dm-devel, linux-kernel, linux-raid,
	linux-block, ojaswin, martin.petersen, akpm, linux-xfs, djwong

On Mon, Jul 14, 2025 at 08:52:39AM +0100, John Garry wrote:
> On 14/07/2025 06:53, Christoph Hellwig wrote:
>> Now we should be able to implement the software atomic writes pretty
>> easily for zoned XFS, and funnily they might actually be slightly faster
>> than normal writes due to the transaction batching.  Now that we're
>> getting reasonable test coverage we should be able to give it a spin, but
>> I have a few too many things on my plate at the moment.
>
> Isn't reflink currently incompatible with zoned xfs?

reflink itself yes due to the garbage collection algorithm that is not
reflink aware.  But all I/O on zoned file RT device uses the same I/O
path design as writes that unshare reflinks because it always has to
write out of place.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v6 0/6] block/md/dm: set chunk_sectors from stacked dev stripe size
  2025-07-14  6:13       ` Christoph Hellwig
@ 2025-07-15 15:45         ` Hannes Reinecke
  0 siblings, 0 replies; 17+ messages in thread
From: Hannes Reinecke @ 2025-07-15 15:45 UTC (permalink / raw)
  To: Christoph Hellwig, Damien Le Moal
  Cc: John Garry, agk, snitzer, mpatocka, song, yukuai3, nilay, axboe,
	cem, dm-devel, linux-kernel, linux-raid, linux-block, ojaswin,
	martin.petersen, akpm, linux-xfs, djwong

On 7/14/25 08:13, Christoph Hellwig wrote:
> On Mon, Jul 14, 2025 at 03:00:57PM +0900, Damien Le Moal wrote:
>> Agreed, it would be nice to clean that up. BUT, the chunk_sectors sysfs
>> attribute file is reporting the zone size today. Changing that may break
>> applications. So I am not sure if we can actually do that, unless the sysfs
>> interface is considered as "unstable" ?
> 
> Good point.  I don't think it is considered unstable.
>
Hmm. It does, but really the meaning of 'chunk_sectors' (ie a boundary 
which I/O requests may not cross) hasn't changed. And that's also
the original use-case for the mapping of zone size to chunk_sectors,
namely to ensure that the block layer generates valid I/O.
So from that standpoint I guess we can change it; in the end, there may
(and will) be setups where 'chunk_sectors' is smaller than the zone
size.
We would need to have another attribute for the zone size, though :-)
But arguably we should have that even if we don't follow the above
reasoning.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2025-07-15 15:45 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-11  8:09 [PATCH v6 0/6] block/md/dm: set chunk_sectors from stacked dev stripe size John Garry
2025-07-11  8:09 ` [PATCH v6 1/6] ilog2: add max_pow_of_two_factor() John Garry
2025-07-11  8:09 ` [PATCH v6 2/6] block: sanitize chunk_sectors for atomic write limits John Garry
2025-07-11  8:42   ` Damien Le Moal
2025-07-11  9:22     ` John Garry
2025-07-11  8:09 ` [PATCH v6 3/6] md/raid0: set chunk_sectors limit John Garry
2025-07-11  8:09 ` [PATCH v6 4/6] md/raid10: " John Garry
2025-07-11  8:09 ` [PATCH v6 5/6] dm-stripe: limit chunk_sectors to the stripe size John Garry
2025-07-11  8:09 ` [PATCH v6 6/6] block: use chunk_sectors when evaluating stacked atomic write limits John Garry
2025-07-11  8:44 ` [PATCH v6 0/6] block/md/dm: set chunk_sectors from stacked dev stripe size Damien Le Moal
2025-07-11  9:16   ` John Garry
2025-07-14  5:53   ` Christoph Hellwig
2025-07-14  6:00     ` Damien Le Moal
2025-07-14  6:13       ` Christoph Hellwig
2025-07-15 15:45         ` Hannes Reinecke
2025-07-14  7:52     ` John Garry
2025-07-14 10:46       ` Christoph Hellwig

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).