[PATCH 0/8] Improve zoned (SMR) HDD write throughput

public inbox for linux-block@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/8] Improve zoned (SMR) HDD write throughput
@ 2026-02-21  0:44 Damien Le Moal
  2026-02-21  0:44 ` [PATCH 1/8] block: fix zone write plug removal Damien Le Moal
                   ` (8 more replies)
  0 siblings, 9 replies; 31+ messages in thread
From: Damien Le Moal @ 2026-02-21  0:44 UTC (permalink / raw)
  To: Jens Axboe, linux-block

Jens,

This patch series cleans up the zone write plugging code and introduces
the ability to issue all write BIOs from a single context (a kthread)
instead of allowing multiple zones to be written at the same time using
a per zone work. As shown in patch 6, raw block device tests and XFS
tests with an SMR HDD show that this can significantly increase write
throughput (up to 40% over the current zone write plugging).

Damien Le Moal (8):
  block: fix zone write plug removal
  block: remove BLK_ZONE_WPLUG_UNHASHED
  block: remove disk_zone_is_full()
  block: improve disk_zone_wplug_schedule_bio_work()
  block: rename struct gendisk zone_wplugs_lock field
  block: allow submitting all zone writes from a single context
  block: default to QD=1 writes for blk-mq rotational zoned devices
  Documentation: ABI: stable: document the zoned_qd1_writes attribute

 Documentation/ABI/stable/sysfs-block |  15 ++
 block/blk-mq-debugfs.c               |   1 +
 block/blk-sysfs.c                    |  43 ++++
 block/blk-zoned.c                    | 312 ++++++++++++++++++++-------
 include/linux/blkdev.h               |  10 +-
 5 files changed, 298 insertions(+), 83 deletions(-)

-- 
2.53.0


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH 1/8] block: fix zone write plug removal
  2026-02-21  0:44 [PATCH 0/8] Improve zoned (SMR) HDD write throughput Damien Le Moal
@ 2026-02-21  0:44 ` Damien Le Moal
  2026-02-23 11:56   ` Hannes Reinecke
  2026-02-21  0:44 ` [PATCH 2/8] block: remove BLK_ZONE_WPLUG_UNHASHED Damien Le Moal
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 31+ messages in thread
From: Damien Le Moal @ 2026-02-21  0:44 UTC (permalink / raw)
  To: Jens Axboe, linux-block

Commit 7b295187287e ("block: Do not remove zone write plugs still in
use") modified disk_should_remove_zone_wplug() to add a check on the
reference count of a zone write plug to prevent removing zone write
plugs from a disk hash table when the plugs are still being referenced
by BIOs in-flight. However, this check does not take into account that
a BIO completion may happen right after its submission by a zone write
plug BIO work, and before the zone write plug BIO work release the zone
write plug reference count. This situation leads to
disk_should_remove_zone_wplug() returning false. If the BIO that
completes in such manner transitioned the zone to the FULL condition,
the zone write plug for the FULL zone will remain in the disk hash
table.

Solve this by moving the check for removing a zone write plug from its
disk hash table from disk_zone_wplug_unplug_bio() into a new function
disk_check_and_put_zone_wplug() so that we can always check if a zone
write plug should be removed when the plug references are being
released. This new function calls disk_should_remove_zone_wplug() and
disk_remove_zone_wplug() with the zone write plug lock held when the
zone write plug reference count is equal to the minimum value (2)
indicating that only the caller context is referencing the zone write
plug. This is followed with a call to disk_put_zone_wplug() which may
cause the zone write plug to be freed if the plug was removed from the
disk hash table and the plug reference count dropped to 0.

The simpler disk_put_zone_write_plug() function is used whenever we
know that we have multiple references on a zone write plug, that is,
when we know that checking for the zone removal is not necessary.

Of note is that the calls to disk_should_remove_zone_wplug() and
disk_remove_zone_zone_wplug() are maintained in
disk_zone_wplug_set_wp_offset() because this function is called
directly while walking the disk zone write plug hash table without
taking references on zone write plugs when processing zone reset all
operations.

Fixes: 7b295187287e ("block: Do not remove zone write plugs still in use")
Cc: stable@vger.kernel.org
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
---
 block/blk-zoned.c | 52 +++++++++++++++++++++++++++++++----------------
 1 file changed, 34 insertions(+), 18 deletions(-)

diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index 846b8844f04a..9c38497e7258 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -655,6 +655,29 @@ static void disk_remove_zone_wplug(struct gendisk *disk,
 	disk_put_zone_wplug(zwplug);
 }
 
+static inline void disk_check_and_put_zone_wplug(struct blk_zone_wplug *zwplug)
+{
+	lockdep_assert_not_held(&zwplug->lock);
+
+	/*
+	 * Check for the need to remove the zone write plug from the hash list
+	 * when we see that only the caller is referencing the zone write plug.
+	 * Since this is racy without holding the zone write plug lock, this
+	 * check is done again in  disk_should_remove_zone_wplug().
+	 */
+	if (refcount_read(&zwplug->ref) <= 2) {
+		struct gendisk *disk = zwplug->disk;
+		unsigned long flags;
+
+		spin_lock_irqsave(&zwplug->lock, flags);
+		if (disk_should_remove_zone_wplug(disk, zwplug))
+			disk_remove_zone_wplug(disk, zwplug);
+		spin_unlock_irqrestore(&zwplug->lock, flags);
+	}
+
+	disk_put_zone_wplug(zwplug);
+}
+
 static void blk_zone_wplug_bio_work(struct work_struct *work);
 
 /*
@@ -835,7 +858,7 @@ static unsigned int disk_zone_wplug_sync_wp_offset(struct gendisk *disk,
 		if (zwplug->flags & BLK_ZONE_WPLUG_NEED_WP_UPDATE)
 			disk_zone_wplug_set_wp_offset(disk, zwplug, wp_offset);
 		spin_unlock_irqrestore(&zwplug->lock, flags);
-		disk_put_zone_wplug(zwplug);
+		disk_check_and_put_zone_wplug(zwplug);
 	}
 
 	return wp_offset;
@@ -1017,14 +1040,14 @@ int blkdev_get_zone_info(struct block_device *bdev, sector_t sector,
 	spin_lock_irqsave(&zwplug->lock, flags);
 	if (zwplug->flags & BLK_ZONE_WPLUG_NEED_WP_UPDATE) {
 		spin_unlock_irqrestore(&zwplug->lock, flags);
-		disk_put_zone_wplug(zwplug);
+		disk_check_and_put_zone_wplug(zwplug);
 		return blkdev_report_zone_fallback(bdev, sector, zone);
 	}
 	zone->cond = zwplug->cond;
 	zone->wp = sector + zwplug->wp_offset;
 	spin_unlock_irqrestore(&zwplug->lock, flags);
 
-	disk_put_zone_wplug(zwplug);
+	disk_check_and_put_zone_wplug(zwplug);
 
 	return 0;
 }
@@ -1107,7 +1130,7 @@ static void blk_zone_reset_bio_endio(struct bio *bio)
 		spin_lock_irqsave(&zwplug->lock, flags);
 		disk_zone_wplug_set_wp_offset(disk, zwplug, 0);
 		spin_unlock_irqrestore(&zwplug->lock, flags);
-		disk_put_zone_wplug(zwplug);
+		disk_check_and_put_zone_wplug(zwplug);
 	} else {
 		disk_zone_set_cond(disk, sector, BLK_ZONE_COND_EMPTY);
 	}
@@ -1165,7 +1188,7 @@ static void blk_zone_finish_bio_endio(struct bio *bio)
 		disk_zone_wplug_set_wp_offset(disk, zwplug,
 					      bdev_zone_sectors(bdev));
 		spin_unlock_irqrestore(&zwplug->lock, flags);
-		disk_put_zone_wplug(zwplug);
+		disk_check_and_put_zone_wplug(zwplug);
 	} else {
 		disk_zone_set_cond(disk, sector, BLK_ZONE_COND_FULL);
 	}
@@ -1530,7 +1553,7 @@ static void blk_zone_wplug_handle_native_zone_append(struct bio *bio)
 	disk_remove_zone_wplug(disk, zwplug);
 	spin_unlock_irqrestore(&zwplug->lock, flags);
 
-	disk_put_zone_wplug(zwplug);
+	disk_check_and_put_zone_wplug(zwplug);
 }
 
 static bool blk_zone_wplug_handle_zone_mgmt(struct bio *bio)
@@ -1631,13 +1654,6 @@ static void disk_zone_wplug_unplug_bio(struct gendisk *disk,
 
 	zwplug->flags &= ~BLK_ZONE_WPLUG_PLUGGED;
 
-	/*
-	 * If the zone is full (it was fully written or finished, or empty
-	 * (it was reset), remove its zone write plug from the hash table.
-	 */
-	if (disk_should_remove_zone_wplug(disk, zwplug))
-		disk_remove_zone_wplug(disk, zwplug);
-
 	spin_unlock_irqrestore(&zwplug->lock, flags);
 }
 
@@ -1701,7 +1717,7 @@ void blk_zone_write_plug_bio_endio(struct bio *bio)
 		disk_zone_wplug_unplug_bio(disk, zwplug);
 
 	/* Drop the reference we took when entering this function. */
-	disk_put_zone_wplug(zwplug);
+	disk_check_and_put_zone_wplug(zwplug);
 }
 
 void blk_zone_write_plug_finish_request(struct request *req)
@@ -1724,7 +1740,7 @@ void blk_zone_write_plug_finish_request(struct request *req)
 	disk_zone_wplug_unplug_bio(disk, zwplug);
 
 	/* Drop the reference we took when entering this function. */
-	disk_put_zone_wplug(zwplug);
+	disk_check_and_put_zone_wplug(zwplug);
 }
 
 static void blk_zone_wplug_bio_work(struct work_struct *work)
@@ -1777,7 +1793,7 @@ static void blk_zone_wplug_bio_work(struct work_struct *work)
 
 put_zwplug:
 	/* Drop the reference we took in disk_zone_wplug_schedule_bio_work(). */
-	disk_put_zone_wplug(zwplug);
+	disk_check_and_put_zone_wplug(zwplug);
 }
 
 void disk_init_zone_resources(struct gendisk *disk)
@@ -1850,7 +1866,7 @@ static void disk_destroy_zone_wplugs_hash_table(struct gendisk *disk)
 					     struct blk_zone_wplug, node);
 			refcount_inc(&zwplug->ref);
 			disk_remove_zone_wplug(disk, zwplug);
-			disk_put_zone_wplug(zwplug);
+			disk_check_and_put_zone_wplug(zwplug);
 		}
 	}
 
@@ -2112,7 +2128,7 @@ static int blk_revalidate_seq_zone(struct blk_zone *zone, unsigned int idx,
 	if (!zwplug)
 		return -ENOMEM;
 	spin_unlock_irqrestore(&zwplug->lock, flags);
-	disk_put_zone_wplug(zwplug);
+	disk_check_and_put_zone_wplug(zwplug);
 
 	return 0;
 }
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/8] block: fix zone write plug removal
  2026-02-21  0:44 ` [PATCH 1/8] block: fix zone write plug removal Damien Le Moal
@ 2026-02-23 11:56   ` Hannes Reinecke
  2026-02-23 19:30     ` Bart Van Assche
  0 siblings, 1 reply; 31+ messages in thread
From: Hannes Reinecke @ 2026-02-23 11:56 UTC (permalink / raw)
  To: Damien Le Moal, Jens Axboe, linux-block

On 2/21/26 01:44, Damien Le Moal wrote:
> Commit 7b295187287e ("block: Do not remove zone write plugs still in
> use") modified disk_should_remove_zone_wplug() to add a check on the
> reference count of a zone write plug to prevent removing zone write
> plugs from a disk hash table when the plugs are still being referenced
> by BIOs in-flight. However, this check does not take into account that
> a BIO completion may happen right after its submission by a zone write
> plug BIO work, and before the zone write plug BIO work release the zone
> write plug reference count. This situation leads to
> disk_should_remove_zone_wplug() returning false. If the BIO that
> completes in such manner transitioned the zone to the FULL condition,
> the zone write plug for the FULL zone will remain in the disk hash
> table.
> 
> Solve this by moving the check for removing a zone write plug from its
> disk hash table from disk_zone_wplug_unplug_bio() into a new function
> disk_check_and_put_zone_wplug() so that we can always check if a zone
> write plug should be removed when the plug references are being
> released. This new function calls disk_should_remove_zone_wplug() and
> disk_remove_zone_wplug() with the zone write plug lock held when the
> zone write plug reference count is equal to the minimum value (2)
> indicating that only the caller context is referencing the zone write
> plug. This is followed with a call to disk_put_zone_wplug() which may
> cause the zone write plug to be freed if the plug was removed from the
> disk hash table and the plug reference count dropped to 0.
> 
> The simpler disk_put_zone_write_plug() function is used whenever we
> know that we have multiple references on a zone write plug, that is,
> when we know that checking for the zone removal is not necessary.
> 
> Of note is that the calls to disk_should_remove_zone_wplug() and
> disk_remove_zone_zone_wplug() are maintained in
> disk_zone_wplug_set_wp_offset() because this function is called
> directly while walking the disk zone write plug hash table without
> taking references on zone write plugs when processing zone reset all
> operations.
> 
> Fixes: 7b295187287e ("block: Do not remove zone write plugs still in use")
> Cc: stable@vger.kernel.org
> Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
> ---
>   block/blk-zoned.c | 52 +++++++++++++++++++++++++++++++----------------
>   1 file changed, 34 insertions(+), 18 deletions(-)
> 
> diff --git a/block/blk-zoned.c b/block/blk-zoned.c
> index 846b8844f04a..9c38497e7258 100644
> --- a/block/blk-zoned.c
> +++ b/block/blk-zoned.c
> @@ -655,6 +655,29 @@ static void disk_remove_zone_wplug(struct gendisk *disk,
>   	disk_put_zone_wplug(zwplug);
>   }
>   
> +static inline void disk_check_and_put_zone_wplug(struct blk_zone_wplug *zwplug)
> +{
> +	lockdep_assert_not_held(&zwplug->lock);
> +
> +	/*
> +	 * Check for the need to remove the zone write plug from the hash list
> +	 * when we see that only the caller is referencing the zone write plug.
> +	 * Since this is racy without holding the zone write plug lock, this
> +	 * check is done again in  disk_should_remove_zone_wplug().
> +	 */
> +	if (refcount_read(&zwplug->ref) <= 2) {
> +		struct gendisk *disk = zwplug->disk;
> +		unsigned long flags;
> +
> +		spin_lock_irqsave(&zwplug->lock, flags);
> +		if (disk_should_remove_zone_wplug(disk, zwplug))
> +			disk_remove_zone_wplug(disk, zwplug);
> +		spin_unlock_irqrestore(&zwplug->lock, flags);
> +	}
> +
> +	disk_put_zone_wplug(zwplug);
> +}
> +
>   static void blk_zone_wplug_bio_work(struct work_struct *work);
>   
>   /*

This looks slightly odd; a simple 'refcount_read()' always trigger alarm
bells with me as it inevitably races with 'refcount_dec()' /
'refcount_inc()'.
And similar the check 'refcount <= 2' also looks odd; typically one
would check against '< 1' and let the refcount destructor handle
things. But anyway, that's how the code does it, so:

Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/8] block: fix zone write plug removal
  2026-02-23 11:56   ` Hannes Reinecke
@ 2026-02-23 19:30     ` Bart Van Assche
  2026-02-23 20:21       ` Bart Van Assche
  0 siblings, 1 reply; 31+ messages in thread
From: Bart Van Assche @ 2026-02-23 19:30 UTC (permalink / raw)
  To: Hannes Reinecke, Damien Le Moal, Jens Axboe, linux-block

On 2/23/26 3:56 AM, Hannes Reinecke wrote:
> On 2/21/26 01:44, Damien Le Moal wrote:
>> +static inline void disk_check_and_put_zone_wplug(struct 
>> blk_zone_wplug *zwplug)
>> +{
>> +    lockdep_assert_not_held(&zwplug->lock);
>> +
>> +    /*
>> +     * Check for the need to remove the zone write plug from the hash 
>> list
>> +     * when we see that only the caller is referencing the zone write 
>> plug.
>> +     * Since this is racy without holding the zone write plug lock, this
>> +     * check is done again in  disk_should_remove_zone_wplug().
>> +     */
>> +    if (refcount_read(&zwplug->ref) <= 2) {
>> +        struct gendisk *disk = zwplug->disk;
>> +        unsigned long flags;
>> +
>> +        spin_lock_irqsave(&zwplug->lock, flags);
>> +        if (disk_should_remove_zone_wplug(disk, zwplug))
>> +            disk_remove_zone_wplug(disk, zwplug);
>> +        spin_unlock_irqrestore(&zwplug->lock, flags);
>> +    }
>> +
>> +    disk_put_zone_wplug(zwplug);
>> +}
>> +
>>   static void blk_zone_wplug_bio_work(struct work_struct *work);
>>   /*
> 
> This looks slightly odd; a simple 'refcount_read()' always trigger alarm
> bells with me as it inevitably races with 'refcount_dec()' /
> 'refcount_inc()'.
> And similar the check 'refcount <= 2' also looks odd; typically one
> would check against '< 1' and let the refcount destructor handle
> things.
Hmm ... I think that the above code is racy. Here is an example:

thread 1                                thread 2
-------------------------------         -------------------------------
refcount_read() returns 3               refcount_read() returns 3

disk_should_remove_zone_wplug() is      disk_should_remove_zone_wplug()
is not called                           is not called

disk_put_zone_wplug() decreases         disk_put_zone_wplug() decreases
zwplug->ref from 3 to 2                 zwplug->ref from 2 to 1


Result: zwplug->ref became less than 2 without 
disk_should_remove_zone_wplug() having been called.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/8] block: fix zone write plug removal
  2026-02-23 19:30     ` Bart Van Assche
@ 2026-02-23 20:21       ` Bart Van Assche
  2026-02-24  1:57         ` Damien Le Moal
  0 siblings, 1 reply; 31+ messages in thread
From: Bart Van Assche @ 2026-02-23 20:21 UTC (permalink / raw)
  To: Hannes Reinecke, Damien Le Moal, Jens Axboe, linux-block

On 2/23/26 11:30 AM, Bart Van Assche wrote:
> On 2/23/26 3:56 AM, Hannes Reinecke wrote:
>> On 2/21/26 01:44, Damien Le Moal wrote:
>>> +static inline void disk_check_and_put_zone_wplug(struct 
>>> blk_zone_wplug *zwplug)
>>> +{
>>> +    lockdep_assert_not_held(&zwplug->lock);
>>> +
>>> +    /*
>>> +     * Check for the need to remove the zone write plug from the 
>>> hash list
>>> +     * when we see that only the caller is referencing the zone 
>>> write plug.
>>> +     * Since this is racy without holding the zone write plug lock, 
>>> this
>>> +     * check is done again in  disk_should_remove_zone_wplug().
>>> +     */
>>> +    if (refcount_read(&zwplug->ref) <= 2) {
>>> +        struct gendisk *disk = zwplug->disk;
>>> +        unsigned long flags;
>>> +
>>> +        spin_lock_irqsave(&zwplug->lock, flags);
>>> +        if (disk_should_remove_zone_wplug(disk, zwplug))
>>> +            disk_remove_zone_wplug(disk, zwplug);
>>> +        spin_unlock_irqrestore(&zwplug->lock, flags);
>>> +    }
>>> +
>>> +    disk_put_zone_wplug(zwplug);
>>> +}
>>> +
>>>   static void blk_zone_wplug_bio_work(struct work_struct *work);
>>>   /*
>>
>> This looks slightly odd; a simple 'refcount_read()' always trigger alarm
>> bells with me as it inevitably races with 'refcount_dec()' /
>> 'refcount_inc()'.
>> And similar the check 'refcount <= 2' also looks odd; typically one
>> would check against '< 1' and let the refcount destructor handle
>> things.
> Hmm ... I think that the above code is racy. Here is an example:
> 
> thread 1                                thread 2
> -------------------------------         -------------------------------
> refcount_read() returns 3               refcount_read() returns 3
> 
> disk_should_remove_zone_wplug() is      disk_should_remove_zone_wplug()
> is not called                           is not called
> 
> disk_put_zone_wplug() decreases         disk_put_zone_wplug() decreases
> zwplug->ref from 3 to 2                 zwplug->ref from 2 to 1
> 
> 
> Result: zwplug->ref became less than 2 without 
> disk_should_remove_zone_wplug() having been called.

A correction to the above: three threads are required before it becomes
possible for the zwplug reference count to drop from 3 to zero without
disk_should_remove_zone_wplug() having been called.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/8] block: fix zone write plug removal
  2026-02-23 20:21       ` Bart Van Assche
@ 2026-02-24  1:57         ` Damien Le Moal
  0 siblings, 0 replies; 31+ messages in thread
From: Damien Le Moal @ 2026-02-24  1:57 UTC (permalink / raw)
  To: Bart Van Assche, Hannes Reinecke, Jens Axboe, linux-block

On 2/24/26 5:21 AM, Bart Van Assche wrote:
> On 2/23/26 11:30 AM, Bart Van Assche wrote:
>> On 2/23/26 3:56 AM, Hannes Reinecke wrote:
>>> On 2/21/26 01:44, Damien Le Moal wrote:
>>>> +static inline void disk_check_and_put_zone_wplug(struct blk_zone_wplug
>>>> *zwplug)
>>>> +{
>>>> +    lockdep_assert_not_held(&zwplug->lock);
>>>> +
>>>> +    /*
>>>> +     * Check for the need to remove the zone write plug from the hash list
>>>> +     * when we see that only the caller is referencing the zone write plug.
>>>> +     * Since this is racy without holding the zone write plug lock, this
>>>> +     * check is done again in  disk_should_remove_zone_wplug().
>>>> +     */
>>>> +    if (refcount_read(&zwplug->ref) <= 2) {
>>>> +        struct gendisk *disk = zwplug->disk;
>>>> +        unsigned long flags;
>>>> +
>>>> +        spin_lock_irqsave(&zwplug->lock, flags);
>>>> +        if (disk_should_remove_zone_wplug(disk, zwplug))
>>>> +            disk_remove_zone_wplug(disk, zwplug);
>>>> +        spin_unlock_irqrestore(&zwplug->lock, flags);
>>>> +    }
>>>> +
>>>> +    disk_put_zone_wplug(zwplug);
>>>> +}
>>>> +
>>>>   static void blk_zone_wplug_bio_work(struct work_struct *work);
>>>>   /*
>>>
>>> This looks slightly odd; a simple 'refcount_read()' always trigger alarm
>>> bells with me as it inevitably races with 'refcount_dec()' /
>>> 'refcount_inc()'.
>>> And similar the check 'refcount <= 2' also looks odd; typically one
>>> would check against '< 1' and let the refcount destructor handle
>>> things.
>> Hmm ... I think that the above code is racy. Here is an example:
>>
>> thread 1                                thread 2
>> -------------------------------         -------------------------------
>> refcount_read() returns 3               refcount_read() returns 3
>>
>> disk_should_remove_zone_wplug() is      disk_should_remove_zone_wplug()
>> is not called                           is not called
>>
>> disk_put_zone_wplug() decreases         disk_put_zone_wplug() decreases
>> zwplug->ref from 3 to 2                 zwplug->ref from 2 to 1
>>
>>
>> Result: zwplug->ref became less than 2 without
>> disk_should_remove_zone_wplug() having been called.
> 
> A correction to the above: three threads are required before it becomes
> possible for the zwplug reference count to drop from 3 to zero without
> disk_should_remove_zone_wplug() having been called.

I do not see how your example is possible... If refcount read are form the BIO
work context, we only have one per zone, so there is not such concurrency.
The additional contexts can be from zone reset or zone finish, but these
unconditionally call disk_should_remove_zone_wplug(), so there is no such race.

> 
> Thanks,
> 
> Bart.


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH 2/8] block: remove BLK_ZONE_WPLUG_UNHASHED
  2026-02-21  0:44 [PATCH 0/8] Improve zoned (SMR) HDD write throughput Damien Le Moal
  2026-02-21  0:44 ` [PATCH 1/8] block: fix zone write plug removal Damien Le Moal
@ 2026-02-21  0:44 ` Damien Le Moal
  2026-02-23 11:48   ` Hannes Reinecke
  2026-02-21  0:44 ` [PATCH 3/8] block: remove disk_zone_is_full() Damien Le Moal
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 31+ messages in thread
From: Damien Le Moal @ 2026-02-21  0:44 UTC (permalink / raw)
  To: Jens Axboe, linux-block

Detecting that a zone write plug has been removed from a disk hash table
can be done using hlist_unhashed(), so there is no need for the zone
write plug flag BLK_ZONE_WPLUG_UNHASHED. Remove this flag and convert
all tests using it to use hlist_unhashed().

Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
---
 block/blk-zoned.c | 30 ++++++++++--------------------
 1 file changed, 10 insertions(+), 20 deletions(-)

diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index 9c38497e7258..3dd50b2ba2bb 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -99,17 +99,9 @@ static inline unsigned int disk_zone_wplugs_hash_size(struct gendisk *disk)
  *    being executed or the zone write plug bio list is not empty.
  *  - BLK_ZONE_WPLUG_NEED_WP_UPDATE: Indicates that we lost track of a zone
  *    write pointer offset and need to update it.
- *  - BLK_ZONE_WPLUG_UNHASHED: Indicates that the zone write plug was removed
- *    from the disk hash table and that the initial reference to the zone
- *    write plug set when the plug was first added to the hash table has been
- *    dropped. This flag is set when a zone is reset, finished or become full,
- *    to prevent new references to the zone write plug to be taken for
- *    newly incoming BIOs. A zone write plug flagged with this flag will be
- *    freed once all remaining references from BIOs or functions are dropped.
  */
 #define BLK_ZONE_WPLUG_PLUGGED		(1U << 0)
 #define BLK_ZONE_WPLUG_NEED_WP_UPDATE	(1U << 1)
-#define BLK_ZONE_WPLUG_UNHASHED		(1U << 2)
 
 /**
  * blk_zone_cond_str - Return a zone condition name string
@@ -592,7 +584,7 @@ static inline void disk_put_zone_wplug(struct blk_zone_wplug *zwplug)
 	if (refcount_dec_and_test(&zwplug->ref)) {
 		WARN_ON_ONCE(!bio_list_empty(&zwplug->bio_list));
 		WARN_ON_ONCE(zwplug->flags & BLK_ZONE_WPLUG_PLUGGED);
-		WARN_ON_ONCE(!(zwplug->flags & BLK_ZONE_WPLUG_UNHASHED));
+		WARN_ON_ONCE(!hlist_unhashed_lockless(&zwplug->node));
 
 		call_rcu(&zwplug->rcu_head, disk_free_zone_wplug_rcu);
 	}
@@ -603,14 +595,14 @@ static inline bool disk_should_remove_zone_wplug(struct gendisk *disk,
 {
 	lockdep_assert_held(&zwplug->lock);
 
-	/* If the zone write plug was already removed, we are done. */
-	if (zwplug->flags & BLK_ZONE_WPLUG_UNHASHED)
-		return false;
-
 	/* If the zone write plug is still plugged, it cannot be removed. */
 	if (zwplug->flags & BLK_ZONE_WPLUG_PLUGGED)
 		return false;
 
+	/* If the zone write plug was already removed, we have nothing to do. */
+	if (hlist_unhashed(&zwplug->node))
+		return false;
+
 	/*
 	 * Completions of BIOs with blk_zone_write_plug_bio_endio() may
 	 * happen after handling a request completion with
@@ -635,16 +627,14 @@ static void disk_remove_zone_wplug(struct gendisk *disk,
 	unsigned long flags;
 
 	/* If the zone write plug was already removed, we have nothing to do. */
-	if (zwplug->flags & BLK_ZONE_WPLUG_UNHASHED)
+	if (hlist_unhashed(&zwplug->node))
 		return;
 
 	/*
-	 * Mark the zone write plug as unhashed and drop the extra reference we
-	 * took when the plug was inserted in the hash table. Also update the
-	 * disk zone condition array with the current condition of the zone
-	 * write plug.
+	 * Update the disk zone condition array with the current condition of
+	 * the zone write plug and drop the extra reference we took when the
+	 * plug was inserted in the hash table.
 	 */
-	zwplug->flags |= BLK_ZONE_WPLUG_UNHASHED;
 	spin_lock_irqsave(&disk->zone_wplugs_lock, flags);
 	blk_zone_set_cond(rcu_dereference_check(disk->zones_cond,
 				lockdep_is_held(&disk->zone_wplugs_lock)),
@@ -702,7 +692,7 @@ static struct blk_zone_wplug *disk_get_and_lock_zone_wplug(struct gendisk *disk,
 		 * we need to get a new plug so start over from the beginning.
 		 */
 		spin_lock_irqsave(&zwplug->lock, *flags);
-		if (zwplug->flags & BLK_ZONE_WPLUG_UNHASHED) {
+		if (hlist_unhashed(&zwplug->node)) {
 			spin_unlock_irqrestore(&zwplug->lock, *flags);
 			disk_put_zone_wplug(zwplug);
 			goto again;
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/8] block: remove BLK_ZONE_WPLUG_UNHASHED
  2026-02-21  0:44 ` [PATCH 2/8] block: remove BLK_ZONE_WPLUG_UNHASHED Damien Le Moal
@ 2026-02-23 11:48   ` Hannes Reinecke
  2026-02-24  2:04     ` Damien Le Moal
  0 siblings, 1 reply; 31+ messages in thread
From: Hannes Reinecke @ 2026-02-23 11:48 UTC (permalink / raw)
  To: Damien Le Moal, Jens Axboe, linux-block

On 2/21/26 01:44, Damien Le Moal wrote:
> Detecting that a zone write plug has been removed from a disk hash table
> can be done using hlist_unhashed(), so there is no need for the zone
> write plug flag BLK_ZONE_WPLUG_UNHASHED. Remove this flag and convert
> all tests using it to use hlist_unhashed().
> 
> Suggested-by: Christoph Hellwig <hch@lst.de>
> Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
> ---
>   block/blk-zoned.c | 30 ++++++++++--------------------
>   1 file changed, 10 insertions(+), 20 deletions(-)
> 
> diff --git a/block/blk-zoned.c b/block/blk-zoned.c
> index 9c38497e7258..3dd50b2ba2bb 100644
> --- a/block/blk-zoned.c
> +++ b/block/blk-zoned.c
> @@ -99,17 +99,9 @@ static inline unsigned int disk_zone_wplugs_hash_size(struct gendisk *disk)
>    *    being executed or the zone write plug bio list is not empty.
>    *  - BLK_ZONE_WPLUG_NEED_WP_UPDATE: Indicates that we lost track of a zone
>    *    write pointer offset and need to update it.
> - *  - BLK_ZONE_WPLUG_UNHASHED: Indicates that the zone write plug was removed
> - *    from the disk hash table and that the initial reference to the zone
> - *    write plug set when the plug was first added to the hash table has been
> - *    dropped. This flag is set when a zone is reset, finished or become full,
> - *    to prevent new references to the zone write plug to be taken for
> - *    newly incoming BIOs. A zone write plug flagged with this flag will be
> - *    freed once all remaining references from BIOs or functions are dropped.
>    */
>   #define BLK_ZONE_WPLUG_PLUGGED		(1U << 0)
>   #define BLK_ZONE_WPLUG_NEED_WP_UPDATE	(1U << 1)
> -#define BLK_ZONE_WPLUG_UNHASHED		(1U << 2)
>   
>   /**
>    * blk_zone_cond_str - Return a zone condition name string
> @@ -592,7 +584,7 @@ static inline void disk_put_zone_wplug(struct blk_zone_wplug *zwplug)
>   	if (refcount_dec_and_test(&zwplug->ref)) {
>   		WARN_ON_ONCE(!bio_list_empty(&zwplug->bio_list));
>   		WARN_ON_ONCE(zwplug->flags & BLK_ZONE_WPLUG_PLUGGED);
> -		WARN_ON_ONCE(!(zwplug->flags & BLK_ZONE_WPLUG_UNHASHED));
> +		WARN_ON_ONCE(!hlist_unhashed_lockless(&zwplug->node));
>   
>   		call_rcu(&zwplug->rcu_head, disk_free_zone_wplug_rcu);
>   	}
> @@ -603,14 +595,14 @@ static inline bool disk_should_remove_zone_wplug(struct gendisk *disk,
>   {
>   	lockdep_assert_held(&zwplug->lock);
>   
> -	/* If the zone write plug was already removed, we are done. */
> -	if (zwplug->flags & BLK_ZONE_WPLUG_UNHASHED)
> -		return false;
> -
>   	/* If the zone write plug is still plugged, it cannot be removed. */
>   	if (zwplug->flags & BLK_ZONE_WPLUG_PLUGGED)
>   		return false;
>   
> +	/* If the zone write plug was already removed, we have nothing to do. */
> +	if (hlist_unhashed(&zwplug->node))
> +		return false;
> +
>   	/*
>   	 * Completions of BIOs with blk_zone_write_plug_bio_endio() may
>   	 * happen after handling a request completion with
The order of the checks have changed here; not sure if that matters,
but as it stands it's not a 1:1 conversion.
Maybe move this check up.

> @@ -635,16 +627,14 @@ static void disk_remove_zone_wplug(struct gendisk *disk,
>   	unsigned long flags;
>   
>   	/* If the zone write plug was already removed, we have nothing to do. */
> -	if (zwplug->flags & BLK_ZONE_WPLUG_UNHASHED)
> +	if (hlist_unhashed(&zwplug->node))
>   		return;
>   
>   	/*
> -	 * Mark the zone write plug as unhashed and drop the extra reference we
> -	 * took when the plug was inserted in the hash table. Also update the
> -	 * disk zone condition array with the current condition of the zone
> -	 * write plug.
> +	 * Update the disk zone condition array with the current condition of
> +	 * the zone write plug and drop the extra reference we took when the
> +	 * plug was inserted in the hash table.
>   	 */
> -	zwplug->flags |= BLK_ZONE_WPLUG_UNHASHED;
>   	spin_lock_irqsave(&disk->zone_wplugs_lock, flags);
>   	blk_zone_set_cond(rcu_dereference_check(disk->zones_cond,
>   				lockdep_is_held(&disk->zone_wplugs_lock)),

Especially as the checks here at in the same order.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/8] block: remove BLK_ZONE_WPLUG_UNHASHED
  2026-02-23 11:48   ` Hannes Reinecke
@ 2026-02-24  2:04     ` Damien Le Moal
  0 siblings, 0 replies; 31+ messages in thread
From: Damien Le Moal @ 2026-02-24  2:04 UTC (permalink / raw)
  To: Hannes Reinecke, Jens Axboe, linux-block

On 2/23/26 8:48 PM, Hannes Reinecke wrote:
>> @@ -603,14 +595,14 @@ static inline bool disk_should_remove_zone_wplug(struct
>> gendisk *disk,
>>   {
>>       lockdep_assert_held(&zwplug->lock);
>>   -    /* If the zone write plug was already removed, we are done. */
>> -    if (zwplug->flags & BLK_ZONE_WPLUG_UNHASHED)
>> -        return false;
>> -
>>       /* If the zone write plug is still plugged, it cannot be removed. */
>>       if (zwplug->flags & BLK_ZONE_WPLUG_PLUGGED)
>>           return false;
>>   +    /* If the zone write plug was already removed, we have nothing to do. */
>> +    if (hlist_unhashed(&zwplug->node))
>> +        return false;
>> +
>>       /*
>>        * Completions of BIOs with blk_zone_write_plug_bio_endio() may
>>        * happen after handling a request completion with
> The order of the checks have changed here; not sure if that matters,
> but as it stands it's not a 1:1 conversion.
> Maybe move this check up.

Will do.

-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH 3/8] block: remove disk_zone_is_full()
  2026-02-21  0:44 [PATCH 0/8] Improve zoned (SMR) HDD write throughput Damien Le Moal
  2026-02-21  0:44 ` [PATCH 1/8] block: fix zone write plug removal Damien Le Moal
  2026-02-21  0:44 ` [PATCH 2/8] block: remove BLK_ZONE_WPLUG_UNHASHED Damien Le Moal
@ 2026-02-21  0:44 ` Damien Le Moal
  2026-02-23 11:56   ` Hannes Reinecke
  2026-02-24 13:15   ` Johannes Thumshirn
  2026-02-21  0:44 ` [PATCH 4/8] block: improve disk_zone_wplug_schedule_bio_work() Damien Le Moal
                   ` (5 subsequent siblings)
  8 siblings, 2 replies; 31+ messages in thread
From: Damien Le Moal @ 2026-02-21  0:44 UTC (permalink / raw)
  To: Jens Axboe, linux-block

The helper function disk_zone_is_full() is only used in
disk_zone_wplug_is_full(). So remove it and open code it directly in
this single caller.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
---
 block/blk-zoned.c | 12 +++---------
 1 file changed, 3 insertions(+), 9 deletions(-)

diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index 3dd50b2ba2bb..c5c91f927a71 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -484,18 +484,12 @@ static bool disk_zone_is_last(struct gendisk *disk, struct blk_zone *zone)
 	return zone->start + zone->len >= get_capacity(disk);
 }
 
-static bool disk_zone_is_full(struct gendisk *disk,
-			      unsigned int zno, unsigned int offset_in_zone)
-{
-	if (zno < disk->nr_zones - 1)
-		return offset_in_zone >= disk->zone_capacity;
-	return offset_in_zone >= disk->last_zone_capacity;
-}
-
 static bool disk_zone_wplug_is_full(struct gendisk *disk,
 				    struct blk_zone_wplug *zwplug)
 {
-	return disk_zone_is_full(disk, zwplug->zone_no, zwplug->wp_offset);
+	if (zwplug->zone_no < disk->nr_zones - 1)
+		return zwplug->wp_offset >= disk->zone_capacity;
+	return zwplug->wp_offset >= disk->last_zone_capacity;
 }
 
 static bool disk_insert_zone_wplug(struct gendisk *disk,
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/8] block: remove disk_zone_is_full()
  2026-02-21  0:44 ` [PATCH 3/8] block: remove disk_zone_is_full() Damien Le Moal
@ 2026-02-23 11:56   ` Hannes Reinecke
  2026-02-24 13:15   ` Johannes Thumshirn
  1 sibling, 0 replies; 31+ messages in thread
From: Hannes Reinecke @ 2026-02-23 11:56 UTC (permalink / raw)
  To: Damien Le Moal, Jens Axboe, linux-block

On 2/21/26 01:44, Damien Le Moal wrote:
> The helper function disk_zone_is_full() is only used in
> disk_zone_wplug_is_full(). So remove it and open code it directly in
> this single caller.
> 
> Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
> ---
>   block/blk-zoned.c | 12 +++---------
>   1 file changed, 3 insertions(+), 9 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/8] block: remove disk_zone_is_full()
  2026-02-21  0:44 ` [PATCH 3/8] block: remove disk_zone_is_full() Damien Le Moal
  2026-02-23 11:56   ` Hannes Reinecke
@ 2026-02-24 13:15   ` Johannes Thumshirn
  1 sibling, 0 replies; 31+ messages in thread
From: Johannes Thumshirn @ 2026-02-24 13:15 UTC (permalink / raw)
  To: Damien Le Moal, Jens Axboe, linux-block@vger.kernel.org

Looks good,

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH 4/8] block: improve disk_zone_wplug_schedule_bio_work()
  2026-02-21  0:44 [PATCH 0/8] Improve zoned (SMR) HDD write throughput Damien Le Moal
                   ` (2 preceding siblings ...)
  2026-02-21  0:44 ` [PATCH 3/8] block: remove disk_zone_is_full() Damien Le Moal
@ 2026-02-21  0:44 ` Damien Le Moal
  2026-02-23 11:59   ` Hannes Reinecke
  2026-02-24 13:18   ` Johannes Thumshirn
  2026-02-21  0:44 ` [PATCH 5/8] block: rename struct gendisk zone_wplugs_lock field Damien Le Moal
                   ` (4 subsequent siblings)
  8 siblings, 2 replies; 31+ messages in thread
From: Damien Le Moal @ 2026-02-21  0:44 UTC (permalink / raw)
  To: Jens Axboe, linux-block

The function disk_zone_wplug_schedule_bio_work() always takes a
reference on the zone write plug of the BIO work being scheduled. This
ensure that the zone write plug cannot be freed while the BIO work is
being scheduled but has not run yet. However, this unconditional
reference taking is fragile since the reference taken is realized by the
BIO work blk_zone_wplug_bio_work() function, which implies that there
always must be a 1:1 relation between the work being scheduled and the
work running.

Make this less fragile by taking the reference on the BIO work if and
only if the BIO work has not been already scheduled.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
---
 block/blk-zoned.c | 12 +++++++-----
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index c5c91f927a71..4b388ae1acaa 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -1205,13 +1205,15 @@ static void disk_zone_wplug_schedule_bio_work(struct gendisk *disk,
 	lockdep_assert_held(&zwplug->lock);
 
 	/*
-	 * Take a reference on the zone write plug and schedule the submission
-	 * of the next plugged BIO. blk_zone_wplug_bio_work() will release the
-	 * reference we take here.
+	 * Schedule the submission of the next plugged BIO. If the zone write
+	 * plug BIO work is not already scheduled, take a reference on the zone
+	 * write plug to ensure that it does not go away while the work is being
+	 * scheduled but has not run yet. blk_zone_wplug_bio_work() will release
+	 * the reference we take here.
 	 */
 	WARN_ON_ONCE(!(zwplug->flags & BLK_ZONE_WPLUG_PLUGGED));
-	refcount_inc(&zwplug->ref);
-	queue_work(disk->zone_wplugs_wq, &zwplug->bio_work);
+	if (queue_work(disk->zone_wplugs_wq, &zwplug->bio_work))
+		refcount_inc(&zwplug->ref);
 }
 
 static inline void disk_zone_wplug_add_bio(struct gendisk *disk,
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH 4/8] block: improve disk_zone_wplug_schedule_bio_work()
  2026-02-21  0:44 ` [PATCH 4/8] block: improve disk_zone_wplug_schedule_bio_work() Damien Le Moal
@ 2026-02-23 11:59   ` Hannes Reinecke
  2026-02-23 18:56     ` Bart Van Assche
  2026-02-24  2:03     ` Damien Le Moal
  2026-02-24 13:18   ` Johannes Thumshirn
  1 sibling, 2 replies; 31+ messages in thread
From: Hannes Reinecke @ 2026-02-23 11:59 UTC (permalink / raw)
  To: Damien Le Moal, Jens Axboe, linux-block

On 2/21/26 01:44, Damien Le Moal wrote:
> The function disk_zone_wplug_schedule_bio_work() always takes a
> reference on the zone write plug of the BIO work being scheduled. This
> ensure that the zone write plug cannot be freed while the BIO work is
> being scheduled but has not run yet. However, this unconditional
> reference taking is fragile since the reference taken is realized by the
> BIO work blk_zone_wplug_bio_work() function, which implies that there
> always must be a 1:1 relation between the work being scheduled and the
> work running.
> 
> Make this less fragile by taking the reference on the BIO work if and
> only if the BIO work has not been already scheduled.
> 
> Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
> ---
>   block/blk-zoned.c | 12 +++++++-----
>   1 file changed, 7 insertions(+), 5 deletions(-)
> 
> diff --git a/block/blk-zoned.c b/block/blk-zoned.c
> index c5c91f927a71..4b388ae1acaa 100644
> --- a/block/blk-zoned.c
> +++ b/block/blk-zoned.c
> @@ -1205,13 +1205,15 @@ static void disk_zone_wplug_schedule_bio_work(struct gendisk *disk,
>   	lockdep_assert_held(&zwplug->lock);
>   
>   	/*
> -	 * Take a reference on the zone write plug and schedule the submission
> -	 * of the next plugged BIO. blk_zone_wplug_bio_work() will release the
> -	 * reference we take here.
> +	 * Schedule the submission of the next plugged BIO. If the zone write
> +	 * plug BIO work is not already scheduled, take a reference on the zone
> +	 * write plug to ensure that it does not go away while the work is being
> +	 * scheduled but has not run yet. blk_zone_wplug_bio_work() will release
> +	 * the reference we take here.
>   	 */
>   	WARN_ON_ONCE(!(zwplug->flags & BLK_ZONE_WPLUG_PLUGGED));
> -	refcount_inc(&zwplug->ref);
> -	queue_work(disk->zone_wplugs_wq, &zwplug->bio_work);
> +	if (queue_work(disk->zone_wplugs_wq, &zwplug->bio_work))
> +		refcount_inc(&zwplug->ref);
>   }
>   
>   static inline void disk_zone_wplug_add_bio(struct gendisk *disk,

Urgh. Don't.
There is a race condition; 'bio_work' might be scheduled directly and
complete before 'refcount_inc()' is called.
You have to invert the statement by keeping the 'refcount_inc()'
before calling 'queue_work()', and then drop the reference again
if queue_work failed.
Ugly, I know.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 4/8] block: improve disk_zone_wplug_schedule_bio_work()
  2026-02-23 11:59   ` Hannes Reinecke
@ 2026-02-23 18:56     ` Bart Van Assche
  2026-02-24  2:03     ` Damien Le Moal
  1 sibling, 0 replies; 31+ messages in thread
From: Bart Van Assche @ 2026-02-23 18:56 UTC (permalink / raw)
  To: Hannes Reinecke, Damien Le Moal, Jens Axboe, linux-block

On 2/23/26 3:59 AM, Hannes Reinecke wrote:
> On 2/21/26 01:44, Damien Le Moal wrote:
>>       WARN_ON_ONCE(!(zwplug->flags & BLK_ZONE_WPLUG_PLUGGED));
>> -    refcount_inc(&zwplug->ref);
>> -    queue_work(disk->zone_wplugs_wq, &zwplug->bio_work);
>> +    if (queue_work(disk->zone_wplugs_wq, &zwplug->bio_work))
>> +        refcount_inc(&zwplug->ref);
>>   }
>>   static inline void disk_zone_wplug_add_bio(struct gendisk *disk,
> 
> Urgh. Don't.
> There is a race condition; 'bio_work' might be scheduled directly and
> complete before 'refcount_inc()' is called.
> You have to invert the statement by keeping the 'refcount_inc()'
> before calling 'queue_work()', and then drop the reference again
> if queue_work failed.

Hi Hannes,

I think that the above code is fine because the caller holds
zwplug->lock. blk_zone_wplug_bio_work() only decrements zwplug->ref
after it has locked and unlocked the same lock.

Bart.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 4/8] block: improve disk_zone_wplug_schedule_bio_work()
  2026-02-23 11:59   ` Hannes Reinecke
  2026-02-23 18:56     ` Bart Van Assche
@ 2026-02-24  2:03     ` Damien Le Moal
  2026-02-24 15:00       ` Hannes Reinecke
  1 sibling, 1 reply; 31+ messages in thread
From: Damien Le Moal @ 2026-02-24  2:03 UTC (permalink / raw)
  To: Hannes Reinecke, Jens Axboe, linux-block

On 2/23/26 8:59 PM, Hannes Reinecke wrote:
> On 2/21/26 01:44, Damien Le Moal wrote:
>> The function disk_zone_wplug_schedule_bio_work() always takes a
>> reference on the zone write plug of the BIO work being scheduled. This
>> ensure that the zone write plug cannot be freed while the BIO work is
>> being scheduled but has not run yet. However, this unconditional
>> reference taking is fragile since the reference taken is realized by the
>> BIO work blk_zone_wplug_bio_work() function, which implies that there
>> always must be a 1:1 relation between the work being scheduled and the
>> work running.
>>
>> Make this less fragile by taking the reference on the BIO work if and
>> only if the BIO work has not been already scheduled.
>>
>> Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
>> ---
>>   block/blk-zoned.c | 12 +++++++-----
>>   1 file changed, 7 insertions(+), 5 deletions(-)
>>
>> diff --git a/block/blk-zoned.c b/block/blk-zoned.c
>> index c5c91f927a71..4b388ae1acaa 100644
>> --- a/block/blk-zoned.c
>> +++ b/block/blk-zoned.c
>> @@ -1205,13 +1205,15 @@ static void disk_zone_wplug_schedule_bio_work(struct
>> gendisk *disk,
>>       lockdep_assert_held(&zwplug->lock);
>>         /*
>> -     * Take a reference on the zone write plug and schedule the submission
>> -     * of the next plugged BIO. blk_zone_wplug_bio_work() will release the
>> -     * reference we take here.
>> +     * Schedule the submission of the next plugged BIO. If the zone write
>> +     * plug BIO work is not already scheduled, take a reference on the zone
>> +     * write plug to ensure that it does not go away while the work is being
>> +     * scheduled but has not run yet. blk_zone_wplug_bio_work() will release
>> +     * the reference we take here.
>>        */
>>       WARN_ON_ONCE(!(zwplug->flags & BLK_ZONE_WPLUG_PLUGGED));
>> -    refcount_inc(&zwplug->ref);
>> -    queue_work(disk->zone_wplugs_wq, &zwplug->bio_work);
>> +    if (queue_work(disk->zone_wplugs_wq, &zwplug->bio_work))
>> +        refcount_inc(&zwplug->ref);
>>   }
>>     static inline void disk_zone_wplug_add_bio(struct gendisk *disk,
> 
> Urgh. Don't.
> There is a race condition; 'bio_work' might be scheduled directly and
> complete before 'refcount_inc()' is called.
> You have to invert the statement by keeping the 'refcount_inc()'
> before calling 'queue_work()', and then drop the reference again
> if queue_work failed.
> Ugly, I know.

As Bart pointed out, since this is done with the zone write plug spinlock held
and the work takes that lock first when it runs, there is no chance that the
refcount gets bad. That is indeed not super clear though, so I will imrove the
comment to describe that.

Note: the zone write plug refcounting is a bit a mess and needs cleaning up,
which I was planning to do on top of all this.


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 4/8] block: improve disk_zone_wplug_schedule_bio_work()
  2026-02-24  2:03     ` Damien Le Moal
@ 2026-02-24 15:00       ` Hannes Reinecke
  2026-02-24 15:08         ` Christoph Hellwig
  0 siblings, 1 reply; 31+ messages in thread
From: Hannes Reinecke @ 2026-02-24 15:00 UTC (permalink / raw)
  To: Damien Le Moal, Jens Axboe, linux-block

On 2/24/26 03:03, Damien Le Moal wrote:
> On 2/23/26 8:59 PM, Hannes Reinecke wrote:
>> On 2/21/26 01:44, Damien Le Moal wrote:
>>> The function disk_zone_wplug_schedule_bio_work() always takes a
>>> reference on the zone write plug of the BIO work being scheduled. This
>>> ensure that the zone write plug cannot be freed while the BIO work is
>>> being scheduled but has not run yet. However, this unconditional
>>> reference taking is fragile since the reference taken is realized by the
>>> BIO work blk_zone_wplug_bio_work() function, which implies that there
>>> always must be a 1:1 relation between the work being scheduled and the
>>> work running.
>>>
>>> Make this less fragile by taking the reference on the BIO work if and
>>> only if the BIO work has not been already scheduled.
>>>
>>> Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
>>> ---
>>>    block/blk-zoned.c | 12 +++++++-----
>>>    1 file changed, 7 insertions(+), 5 deletions(-)
>>>
>>> diff --git a/block/blk-zoned.c b/block/blk-zoned.c
>>> index c5c91f927a71..4b388ae1acaa 100644
>>> --- a/block/blk-zoned.c
>>> +++ b/block/blk-zoned.c
>>> @@ -1205,13 +1205,15 @@ static void disk_zone_wplug_schedule_bio_work(struct
>>> gendisk *disk,
>>>        lockdep_assert_held(&zwplug->lock);
>>>          /*
>>> -     * Take a reference on the zone write plug and schedule the submission
>>> -     * of the next plugged BIO. blk_zone_wplug_bio_work() will release the
>>> -     * reference we take here.
>>> +     * Schedule the submission of the next plugged BIO. If the zone write
>>> +     * plug BIO work is not already scheduled, take a reference on the zone
>>> +     * write plug to ensure that it does not go away while the work is being
>>> +     * scheduled but has not run yet. blk_zone_wplug_bio_work() will release
>>> +     * the reference we take here.
>>>         */
>>>        WARN_ON_ONCE(!(zwplug->flags & BLK_ZONE_WPLUG_PLUGGED));
>>> -    refcount_inc(&zwplug->ref);
>>> -    queue_work(disk->zone_wplugs_wq, &zwplug->bio_work);
>>> +    if (queue_work(disk->zone_wplugs_wq, &zwplug->bio_work))
>>> +        refcount_inc(&zwplug->ref);
>>>    }
>>>      static inline void disk_zone_wplug_add_bio(struct gendisk *disk,
>>
>> Urgh. Don't.
>> There is a race condition; 'bio_work' might be scheduled directly and
>> complete before 'refcount_inc()' is called.
>> You have to invert the statement by keeping the 'refcount_inc()'
>> before calling 'queue_work()', and then drop the reference again
>> if queue_work failed.
>> Ugly, I know.
> 
> As Bart pointed out, since this is done with the zone write plug spinlock held
> and the work takes that lock first when it runs, there is no chance that the
> refcount gets bad. That is indeed not super clear though, so I will imrove the
> comment to describe that.
> 
But then we deadlock.
If a lock is held when calling queue_work() and the workqueue function
is calling spin_lock() on the _same_ lock _and_ we are scheduled on
the _same_ CPU (remember single-core systems ?) we deadlock.
Don't.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 4/8] block: improve disk_zone_wplug_schedule_bio_work()
  2026-02-24 15:00       ` Hannes Reinecke
@ 2026-02-24 15:08         ` Christoph Hellwig
  0 siblings, 0 replies; 31+ messages in thread
From: Christoph Hellwig @ 2026-02-24 15:08 UTC (permalink / raw)
  To: Hannes Reinecke; +Cc: Damien Le Moal, Jens Axboe, linux-block

On Tue, Feb 24, 2026 at 04:00:17PM +0100, Hannes Reinecke wrote:
> But then we deadlock.
> If a lock is held when calling queue_work() and the workqueue function
> is calling spin_lock() on the _same_ lock _and_ we are scheduled on
> the _same_ CPU (remember single-core systems ?) we deadlock.

You very obviously can't context switch to a separate thread when
holding a spinlock.

Note that I particularly like this pattern.  I'd much rather just
increment the reference and drop it when we did not schedule the
work, this is not a super hot path that needs crazy optimizations,
and that would have avoided this bike shedding discussion.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 4/8] block: improve disk_zone_wplug_schedule_bio_work()
  2026-02-21  0:44 ` [PATCH 4/8] block: improve disk_zone_wplug_schedule_bio_work() Damien Le Moal
  2026-02-23 11:59   ` Hannes Reinecke
@ 2026-02-24 13:18   ` Johannes Thumshirn
  1 sibling, 0 replies; 31+ messages in thread
From: Johannes Thumshirn @ 2026-02-24 13:18 UTC (permalink / raw)
  To: Damien Le Moal, Jens Axboe, linux-block@vger.kernel.org

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH 5/8] block: rename struct gendisk zone_wplugs_lock field
  2026-02-21  0:44 [PATCH 0/8] Improve zoned (SMR) HDD write throughput Damien Le Moal
                   ` (3 preceding siblings ...)
  2026-02-21  0:44 ` [PATCH 4/8] block: improve disk_zone_wplug_schedule_bio_work() Damien Le Moal
@ 2026-02-21  0:44 ` Damien Le Moal
  2026-02-23 12:00   ` Hannes Reinecke
  2026-02-24 13:19   ` Johannes Thumshirn
  2026-02-21  0:44 ` [PATCH 6/8] block: allow submitting all zone writes from a single context Damien Le Moal
                   ` (3 subsequent siblings)
  8 siblings, 2 replies; 31+ messages in thread
From: Damien Le Moal @ 2026-02-21  0:44 UTC (permalink / raw)
  To: Jens Axboe, linux-block

Rename struct gendisk zone_wplugs_lock field to zone_wplugs_hash_lock to
clearly indicates that this is the spinlock used for manipulating the
hash table of zone write plugs.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
---
 block/blk-zoned.c      | 23 ++++++++++++-----------
 include/linux/blkdev.h |  2 +-
 2 files changed, 13 insertions(+), 12 deletions(-)

diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index 4b388ae1acaa..ee77330e14e2 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -506,10 +506,11 @@ static bool disk_insert_zone_wplug(struct gendisk *disk,
 	 * are racing with other submission context, so we may already have a
 	 * zone write plug for the same zone.
 	 */
-	spin_lock_irqsave(&disk->zone_wplugs_lock, flags);
+	spin_lock_irqsave(&disk->zone_wplugs_hash_lock, flags);
 	hlist_for_each_entry_rcu(zwplg, &disk->zone_wplugs_hash[idx], node) {
 		if (zwplg->zone_no == zwplug->zone_no) {
-			spin_unlock_irqrestore(&disk->zone_wplugs_lock, flags);
+			spin_unlock_irqrestore(&disk->zone_wplugs_hash_lock,
+					       flags);
 			return false;
 		}
 	}
@@ -521,7 +522,7 @@ static bool disk_insert_zone_wplug(struct gendisk *disk,
 	 * necessarilly in the active condition.
 	 */
 	zones_cond = rcu_dereference_check(disk->zones_cond,
-				lockdep_is_held(&disk->zone_wplugs_lock));
+				lockdep_is_held(&disk->zone_wplugs_hash_lock));
 	if (zones_cond)
 		zwplug->cond = zones_cond[zwplug->zone_no];
 	else
@@ -529,7 +530,7 @@ static bool disk_insert_zone_wplug(struct gendisk *disk,
 
 	hlist_add_head_rcu(&zwplug->node, &disk->zone_wplugs_hash[idx]);
 	atomic_inc(&disk->nr_zone_wplugs);
-	spin_unlock_irqrestore(&disk->zone_wplugs_lock, flags);
+	spin_unlock_irqrestore(&disk->zone_wplugs_hash_lock, flags);
 
 	return true;
 }
@@ -629,13 +630,13 @@ static void disk_remove_zone_wplug(struct gendisk *disk,
 	 * the zone write plug and drop the extra reference we took when the
 	 * plug was inserted in the hash table.
 	 */
-	spin_lock_irqsave(&disk->zone_wplugs_lock, flags);
+	spin_lock_irqsave(&disk->zone_wplugs_hash_lock, flags);
 	blk_zone_set_cond(rcu_dereference_check(disk->zones_cond,
-				lockdep_is_held(&disk->zone_wplugs_lock)),
+				lockdep_is_held(&disk->zone_wplugs_hash_lock)),
 			  zwplug->zone_no, zwplug->cond);
 	hlist_del_init_rcu(&zwplug->node);
 	atomic_dec(&disk->nr_zone_wplugs);
-	spin_unlock_irqrestore(&disk->zone_wplugs_lock, flags);
+	spin_unlock_irqrestore(&disk->zone_wplugs_hash_lock, flags);
 	disk_put_zone_wplug(zwplug);
 }
 
@@ -1784,7 +1785,7 @@ static void blk_zone_wplug_bio_work(struct work_struct *work)
 
 void disk_init_zone_resources(struct gendisk *disk)
 {
-	spin_lock_init(&disk->zone_wplugs_lock);
+	spin_lock_init(&disk->zone_wplugs_hash_lock);
 }
 
 /*
@@ -1874,10 +1875,10 @@ static void disk_set_zones_cond_array(struct gendisk *disk, u8 *zones_cond)
 {
 	unsigned long flags;
 
-	spin_lock_irqsave(&disk->zone_wplugs_lock, flags);
+	spin_lock_irqsave(&disk->zone_wplugs_hash_lock, flags);
 	zones_cond = rcu_replace_pointer(disk->zones_cond, zones_cond,
-				lockdep_is_held(&disk->zone_wplugs_lock));
-	spin_unlock_irqrestore(&disk->zone_wplugs_lock, flags);
+				lockdep_is_held(&disk->zone_wplugs_hash_lock));
+	spin_unlock_irqrestore(&disk->zone_wplugs_hash_lock, flags);
 
 	kfree_rcu_mightsleep(zones_cond);
 }
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index d463b9b5a0a5..689090023770 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -200,7 +200,7 @@ struct gendisk {
 	u8 __rcu		*zones_cond;
 	unsigned int		zone_wplugs_hash_bits;
 	atomic_t		nr_zone_wplugs;
-	spinlock_t		zone_wplugs_lock;
+	spinlock_t		zone_wplugs_hash_lock;
 	struct mempool		*zone_wplugs_pool;
 	struct hlist_head	*zone_wplugs_hash;
 	struct workqueue_struct *zone_wplugs_wq;
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH 5/8] block: rename struct gendisk zone_wplugs_lock field
  2026-02-21  0:44 ` [PATCH 5/8] block: rename struct gendisk zone_wplugs_lock field Damien Le Moal
@ 2026-02-23 12:00   ` Hannes Reinecke
  2026-02-24 13:19   ` Johannes Thumshirn
  1 sibling, 0 replies; 31+ messages in thread
From: Hannes Reinecke @ 2026-02-23 12:00 UTC (permalink / raw)
  To: Damien Le Moal, Jens Axboe, linux-block

On 2/21/26 01:44, Damien Le Moal wrote:
> Rename struct gendisk zone_wplugs_lock field to zone_wplugs_hash_lock to
> clearly indicates that this is the spinlock used for manipulating the
> hash table of zone write plugs.
> 
> Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
> ---
>   block/blk-zoned.c      | 23 ++++++++++++-----------
>   include/linux/blkdev.h |  2 +-
>   2 files changed, 13 insertions(+), 12 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 5/8] block: rename struct gendisk zone_wplugs_lock field
  2026-02-21  0:44 ` [PATCH 5/8] block: rename struct gendisk zone_wplugs_lock field Damien Le Moal
  2026-02-23 12:00   ` Hannes Reinecke
@ 2026-02-24 13:19   ` Johannes Thumshirn
  1 sibling, 0 replies; 31+ messages in thread
From: Johannes Thumshirn @ 2026-02-24 13:19 UTC (permalink / raw)
  To: Damien Le Moal, Jens Axboe, linux-block@vger.kernel.org

Looks good,

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH 6/8] block: allow submitting all zone writes from a single context
  2026-02-21  0:44 [PATCH 0/8] Improve zoned (SMR) HDD write throughput Damien Le Moal
                   ` (4 preceding siblings ...)
  2026-02-21  0:44 ` [PATCH 5/8] block: rename struct gendisk zone_wplugs_lock field Damien Le Moal
@ 2026-02-21  0:44 ` Damien Le Moal
  2026-02-23 12:07   ` Hannes Reinecke
  2026-02-21  0:44 ` [PATCH 7/8] block: default to QD=1 writes for blk-mq rotational zoned devices Damien Le Moal
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 31+ messages in thread
From: Damien Le Moal @ 2026-02-21  0:44 UTC (permalink / raw)
  To: Jens Axboe, linux-block

In order to maintain sequential write patterns per zone with zoned block
devices, zone write plugging issues only a single write BIO per zone at
any time. This works well but has the side effect that when large
sequential write streams are issued by the user and these streams cross
zone boundaries, the device ends up receiving a discontiguous set of
write commands for different zones. The same also happens when a user
writes simultaneously at high queue depth multiple zones: the device
does not see all sequential writes per zone and receives discontiguous
writes to different zones. While this does not affect the performance of
solid state zoned block devices, when using an SMR HDD, this pattern
change from sequential writes to discontiguous writes to different zones
significantly increases head seek which results in degraded write
throughput.

In order to reduce this seek overhead for rotational media devices,
introduce a per disk zone write plugs kernel thread to issue all write
BIOs to zones. This single zone write issuing context is enabled for
any zoned block device that has a request queue flagged with the new
QUEUE_ZONED_QD1_WRITES flag.

The flag QUEUE_ZONED_QD1_WRITES is visible as the sysfs queue attribute
zoned_qd1_writes. For a regular block device, this attribute cannot be
changed and is always 0. For zoned block devices, a user can override
the default value set to force the global write maximum queue depth of
1 for a zoned block device, or clear this attribute to fallback to the
default behavior of zone write plugging which limits writes to QD=1 per
sequential zone.

Writing to a zoned block device flagged with QUEUE_ZONED_QD1_WRITES is
implemented using a list of zone write plugs that have a non-empty BIO
list. Listed zone write plugs are processed by the disk zone write plugs
worker kthread in FIFO order, and all BIOs of a zone write plug are all
processed before switching to the next listed zone write plug. A newly
submitted BIO for a non-FULL zone write plug that is not yet listed
causes the addition of the zone write plug at the end of the disk list
of zone write plugs.

Since the write BIOs queued in a zone write plug BIO list are
necessarilly sequential, for rotational media, using the single zone
write plugs kthread to issue all BIOs maintains a sequential write
pattern and thus reduces seek overhead and improves write throughput.
This processing essentially result in always writing to HDDs at QD=1,
which is not an issue for HDDs operating with write caching enabled.
Performance with write cache disabled is also not degraded thanks to
the efficient write handling of modern SMR HDDs.

A disk list of zone write plugs is defined using the new struct gendisk
zone_wplugs_list, and accesses to this list is protected using the
zone_wplugs_list_lock spinlock.  The per disk kthread
(zone_wplugs_worker) code is implemented by the function
disk_zone_wplugs_worker(). A reference on listed zone write plugs is
always held until all BIOs of the zone write plug are processed by the
worker kthread. BIO issuing at QD=1 is driven using a completion
structure (zone_wplugs_worker_bio_done) and calls to blk_io_wait().

With this change, performance when sequentially writing the zones of a
30 TB SMR SATA HDD connected to an AHCI adapter changes as follows
(1MiB direct I/Os, results in MB/s unit):

                    +--------------------+
		    |   Write BW (MB/s)  |
 +------------------+----------+---------+
 | Sequential write | Baseline | Patched |
 |  Queue Depth     | 6.19-rc8 |         |
 +------------------+----------+---------+
 | 1                | 244      | 245     |
 | 2                | 244      | 245     |
 | 4                | 245      | 245     |
 | 8                | 242      | 245     |
 | 16               | 222      | 246     |
 | 32               | 211      | 245     |
 | 64               | 193      | 244     |
 | 128              | 112      | 246     |
 +------------------+----------+---------+

With the current code (baseline), as the sequential write stream crosses
a zone boundary, higher queue depth creates a gap between the
last IO to the previous zone and the first IOs to the following zones,
causing head seeks and degrading performance. Using the disk zone
write plugs worker thread, this pattern disappears and the maximum
throughput of the drive is maintained, leading to over 100%
improvements in throughput for high queue depth write.

Using 16 fio jobs all writing to randomly chosen zones at QD=32 with 1
MiB direct IOs, write throughput also increases significantly.

                    +--------------------+
		    |   Write BW (MB/s)  |
 +------------------+----------+---------+
 |   Random write   | Baseline | Patched |
 |  Number of zones | 6.19-rc7 |         |
 +------------------+----------+---------+
 | 1                | 191      | 192     |
 | 2                | 101      | 128     |
 | 4                | 115      | 123     |
 | 8                | 90       | 120     |
 | 16               | 64       | 115     |
 | 32               | 58       | 105     |
 | 64               | 56       | 101     |
 | 128              | 55       | 99      |
 +------------------+----------+---------+

Tests using XFS shows that buffered write speed with 8 jobs writing
files increases by 12% to 35% depending on the workload.

                    +--------------------+
		    |   Write BW (MB/s)  |
 +------------------+----------+---------+
 |     Workload     | Baseline | Patched |
 |                  | 6.19-rc7 |         |
 +------------------+----------+---------+
 | 256MiB file size | 212      | 238     |
 +------------------+----------+---------+
 | 4MiB .. 128 MiB  | 213      | 243     |
 | random file size |          |         |
 +------------------+----------+---------+
 | 2MiB .. 8 MiB    | 179      | 242     |
 | random file size |          |         |
 +------------------+----------+---------+

Performance gains are even more significant when using an HBA that
limits the maximum size of commands to a small value, e.g. HBAs
controlled with the mpi3mr driver limit commands to a maximum of 1 MiB.
In such case, the write throughput gains are over 40%.

                    +--------------------+
		    |   Write BW (MB/s)  |
 +------------------+----------+---------+
 |     Workload     | Baseline | Patched |
 |                  | 6.19-rc7 |         |
 +------------------+----------+---------+
 | 256MiB file size | 175      | 245     |
 +------------------+----------+---------+
 | 4MiB .. 128 MiB  | 174      | 244     |
 | random file size |          |         |
 +------------------+----------+---------+
 | 2MiB .. 8 MiB    | 171      | 243     |
 | random file size |          |         |
 +------------------+----------+---------+

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
---
 block/blk-mq-debugfs.c |   1 +
 block/blk-sysfs.c      |  35 ++++++++
 block/blk-zoned.c      | 197 +++++++++++++++++++++++++++++++++++------
 include/linux/blkdev.h |   8 ++
 4 files changed, 215 insertions(+), 26 deletions(-)

diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index 28167c9baa55..047ec887456b 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -97,6 +97,7 @@ static const char *const blk_queue_flag_name[] = {
 	QUEUE_FLAG_NAME(NO_ELV_SWITCH),
 	QUEUE_FLAG_NAME(QOS_ENABLED),
 	QUEUE_FLAG_NAME(BIO_ISSUE_TIME),
+	QUEUE_FLAG_NAME(ZONED_QD1_WRITES),
 };
 #undef QUEUE_FLAG_NAME
 
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index f3b1968c80ce..789802286d95 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -384,6 +384,39 @@ static ssize_t queue_nr_zones_show(struct gendisk *disk, char *page)
 	return queue_var_show(disk_nr_zones(disk), page);
 }
 
+static ssize_t queue_zoned_qd1_writes_show(struct gendisk *disk, char *page)
+{
+	return queue_var_show(!!blk_queue_zoned_qd1_writes(disk->queue),
+			      page);
+}
+
+static ssize_t queue_zoned_qd1_writes_store(struct gendisk *disk,
+					    const char *page, size_t count)
+{
+	struct request_queue *q = disk->queue;
+	unsigned long qd1_writes;
+	unsigned int memflags;
+	ssize_t ret;
+
+	if (!blk_queue_is_zoned(q))
+		return -EOPNOTSUPP;
+
+	ret = queue_var_store(&qd1_writes, page, count);
+	if (ret < 0)
+		return ret;
+
+	memflags = blk_mq_freeze_queue(q);
+	blk_mq_quiesce_queue(q);
+	if (qd1_writes)
+		blk_queue_flag_set(QUEUE_FLAG_ZONED_QD1_WRITES, q);
+	else
+		blk_queue_flag_clear(QUEUE_FLAG_ZONED_QD1_WRITES, q);
+	blk_mq_unquiesce_queue(q);
+	blk_mq_unfreeze_queue(q, memflags);
+
+	return count;
+}
+
 static ssize_t queue_iostats_passthrough_show(struct gendisk *disk, char *page)
 {
 	return queue_var_show(!!blk_queue_passthrough_stat(disk->queue), page);
@@ -611,6 +644,7 @@ QUEUE_LIM_RO_ENTRY(queue_max_zone_append_sectors, "zone_append_max_bytes");
 QUEUE_LIM_RO_ENTRY(queue_zone_write_granularity, "zone_write_granularity");
 
 QUEUE_LIM_RO_ENTRY(queue_zoned, "zoned");
+QUEUE_RW_ENTRY(queue_zoned_qd1_writes, "zoned_qd1_writes");
 QUEUE_RO_ENTRY(queue_nr_zones, "nr_zones");
 QUEUE_LIM_RO_ENTRY(queue_max_open_zones, "max_open_zones");
 QUEUE_LIM_RO_ENTRY(queue_max_active_zones, "max_active_zones");
@@ -748,6 +782,7 @@ static struct attribute *queue_attrs[] = {
 	&queue_nomerges_entry.attr,
 	&queue_poll_entry.attr,
 	&queue_poll_delay_entry.attr,
+	&queue_zoned_qd1_writes_entry.attr,
 
 	NULL,
 };
diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index ee77330e14e2..59b5c837b17a 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -16,6 +16,8 @@
 #include <linux/spinlock.h>
 #include <linux/refcount.h>
 #include <linux/mempool.h>
+#include <linux/kthread.h>
+#include <linux/freezer.h>
 
 #include <trace/events/block.h>
 
@@ -40,6 +42,8 @@ static const char *const zone_cond_name[] = {
 /*
  * Per-zone write plug.
  * @node: hlist_node structure for managing the plug using a hash table.
+ * @entry: list_head structure for listing the plug in the disk list of active
+ *         zone write plugs.
  * @bio_list: The list of BIOs that are currently plugged.
  * @bio_work: Work struct to handle issuing of plugged BIOs
  * @rcu_head: RCU head to free zone write plugs with an RCU grace period.
@@ -62,6 +66,7 @@ static const char *const zone_cond_name[] = {
  */
 struct blk_zone_wplug {
 	struct hlist_node	node;
+	struct list_head	entry;
 	struct bio_list		bio_list;
 	struct work_struct	bio_work;
 	struct rcu_head		rcu_head;
@@ -663,7 +668,19 @@ static inline void disk_check_and_put_zone_wplug(struct blk_zone_wplug *zwplug)
 	disk_put_zone_wplug(zwplug);
 }
 
-static void blk_zone_wplug_bio_work(struct work_struct *work);
+static bool disk_zone_wplug_submit_bio(struct gendisk *disk,
+				       struct blk_zone_wplug *zwplug);
+
+static void blk_zone_wplug_bio_work(struct work_struct *work)
+{
+	struct blk_zone_wplug *zwplug =
+		container_of(work, struct blk_zone_wplug, bio_work);
+
+	disk_zone_wplug_submit_bio(zwplug->disk, zwplug);
+
+	/* Drop the reference we took in disk_zone_wplug_schedule_work(). */
+	disk_put_zone_wplug(zwplug);
+}
 
 /*
  * Get a reference on the write plug for the zone containing @sector.
@@ -712,6 +729,7 @@ static struct blk_zone_wplug *disk_get_and_lock_zone_wplug(struct gendisk *disk,
 	zwplug->wp_offset = bdev_offset_from_zone_start(disk->part0, sector);
 	bio_list_init(&zwplug->bio_list);
 	INIT_WORK(&zwplug->bio_work, blk_zone_wplug_bio_work);
+	INIT_LIST_HEAD(&zwplug->entry);
 	zwplug->disk = disk;
 
 	spin_lock_irqsave(&zwplug->lock, *flags);
@@ -747,6 +765,7 @@ static inline void blk_zone_wplug_bio_io_error(struct blk_zone_wplug *zwplug,
  */
 static void disk_zone_wplug_abort(struct blk_zone_wplug *zwplug)
 {
+	struct gendisk *disk = zwplug->disk;
 	struct bio *bio;
 
 	lockdep_assert_held(&zwplug->lock);
@@ -760,6 +779,20 @@ static void disk_zone_wplug_abort(struct blk_zone_wplug *zwplug)
 		blk_zone_wplug_bio_io_error(zwplug, bio);
 
 	zwplug->flags &= ~BLK_ZONE_WPLUG_PLUGGED;
+
+	/*
+	 * If we are using the per disk zone write plugs worker thread, remove
+	 * the zone write plug from the work list and drop the reference we
+	 * took when the zone write plug was added to that list.
+	 */
+	if (blk_queue_zoned_qd1_writes(disk->queue)) {
+		spin_lock(&disk->zone_wplugs_list_lock);
+		if (!list_empty(&zwplug->entry)) {
+			list_del_init(&zwplug->entry);
+			disk_put_zone_wplug(zwplug);
+		}
+		spin_unlock(&disk->zone_wplugs_list_lock);
+	}
 }
 
 /*
@@ -1200,19 +1233,19 @@ void blk_zone_mgmt_bio_endio(struct bio *bio)
 	}
 }
 
-static void disk_zone_wplug_schedule_bio_work(struct gendisk *disk,
-					      struct blk_zone_wplug *zwplug)
+static void disk_zone_wplug_schedule_work(struct gendisk *disk,
+					  struct blk_zone_wplug *zwplug)
 {
 	lockdep_assert_held(&zwplug->lock);
 
 	/*
-	 * Schedule the submission of the next plugged BIO. If the zone write
-	 * plug BIO work is not already scheduled, take a reference on the zone
-	 * write plug to ensure that it does not go away while the work is being
-	 * scheduled but has not run yet. blk_zone_wplug_bio_work() will release
-	 * the reference we take here.
+	 * If the zone write plug BIO work is not already scheduled, take a
+	 * reference on the zone write plug to ensure that it does not go away
+	 * while the work is being scheduled but has not run yet.
+	 * blk_zone_wplug_bio_work() will release the reference we take here.
 	 */
 	WARN_ON_ONCE(!(zwplug->flags & BLK_ZONE_WPLUG_PLUGGED));
+	WARN_ON_ONCE(blk_queue_zoned_qd1_writes(disk->queue));
 	if (queue_work(disk->zone_wplugs_wq, &zwplug->bio_work))
 		refcount_inc(&zwplug->ref);
 }
@@ -1251,6 +1284,22 @@ static inline void disk_zone_wplug_add_bio(struct gendisk *disk,
 	bio_list_add(&zwplug->bio_list, bio);
 	trace_disk_zone_wplug_add_bio(zwplug->disk->queue, zwplug->zone_no,
 				      bio->bi_iter.bi_sector, bio_sectors(bio));
+
+	/*
+	 * If we are using the disk zone write plugs worker instead of the per
+	 * zone write plug BIO work, add the zone write plug to the work list
+	 * if it is not already there. Make sure to also get an extra reference
+	 * on the zone write plug so that it does not go away until it is
+	 * removed from the work list.
+	 */
+	if (blk_queue_zoned_qd1_writes(disk->queue)) {
+		spin_lock(&disk->zone_wplugs_list_lock);
+		if (list_empty(&zwplug->entry)) {
+			list_add_tail(&zwplug->entry, &disk->zone_wplugs_list);
+			refcount_inc(&zwplug->ref);
+		}
+		spin_unlock(&disk->zone_wplugs_list_lock);
+	}
 }
 
 /*
@@ -1469,6 +1518,13 @@ static bool blk_zone_wplug_handle_write(struct bio *bio, unsigned int nr_segs)
 		goto queue_bio;
 	}
 
+	/*
+	 * For rotational devices, we will use the gendisk zone write plugs
+	 * work instead of the per zone write plug BIO work, so queue the BIO.
+	 */
+	if (blk_queue_zoned_qd1_writes(disk->queue))
+		goto queue_bio;
+
 	/* If the zone is already plugged, add the BIO to the BIO plug list. */
 	if (zwplug->flags & BLK_ZONE_WPLUG_PLUGGED)
 		goto queue_bio;
@@ -1491,7 +1547,10 @@ static bool blk_zone_wplug_handle_write(struct bio *bio, unsigned int nr_segs)
 
 	if (!(zwplug->flags & BLK_ZONE_WPLUG_PLUGGED)) {
 		zwplug->flags |= BLK_ZONE_WPLUG_PLUGGED;
-		disk_zone_wplug_schedule_bio_work(disk, zwplug);
+		if (blk_queue_zoned_qd1_writes(disk->queue))
+			wake_up_process(disk->zone_wplugs_worker);
+		else
+			disk_zone_wplug_schedule_work(disk, zwplug);
 	}
 
 	spin_unlock_irqrestore(&zwplug->lock, flags);
@@ -1632,14 +1691,18 @@ static void disk_zone_wplug_unplug_bio(struct gendisk *disk,
 
 	spin_lock_irqsave(&zwplug->lock, flags);
 
-	/* Schedule submission of the next plugged BIO if we have one. */
-	if (!bio_list_empty(&zwplug->bio_list)) {
-		disk_zone_wplug_schedule_bio_work(disk, zwplug);
-		spin_unlock_irqrestore(&zwplug->lock, flags);
-		return;
-	}
+	/*
+	 * For rotational devices, signal the BIO completion to the zone write
+	 * plug work. Otherwise, schedule submission of the next plugged BIO
+	 * if we have one.
+	 */
+	if (bio_list_empty(&zwplug->bio_list))
+		zwplug->flags &= ~BLK_ZONE_WPLUG_PLUGGED;
 
-	zwplug->flags &= ~BLK_ZONE_WPLUG_PLUGGED;
+	if (blk_queue_zoned_qd1_writes(disk->queue))
+		complete(&disk->zone_wplugs_worker_bio_done);
+	else if (!bio_list_empty(&zwplug->bio_list))
+		disk_zone_wplug_schedule_work(disk, zwplug);
 
 	spin_unlock_irqrestore(&zwplug->lock, flags);
 }
@@ -1730,10 +1793,9 @@ void blk_zone_write_plug_finish_request(struct request *req)
 	disk_check_and_put_zone_wplug(zwplug);
 }
 
-static void blk_zone_wplug_bio_work(struct work_struct *work)
+static bool disk_zone_wplug_submit_bio(struct gendisk *disk,
+				       struct blk_zone_wplug *zwplug)
 {
-	struct blk_zone_wplug *zwplug =
-		container_of(work, struct blk_zone_wplug, bio_work);
 	struct block_device *bdev;
 	unsigned long flags;
 	struct bio *bio;
@@ -1749,7 +1811,7 @@ static void blk_zone_wplug_bio_work(struct work_struct *work)
 	if (!bio) {
 		zwplug->flags &= ~BLK_ZONE_WPLUG_PLUGGED;
 		spin_unlock_irqrestore(&zwplug->lock, flags);
-		goto put_zwplug;
+		return false;
 	}
 
 	trace_blk_zone_wplug_bio(zwplug->disk->queue, zwplug->zone_no,
@@ -1763,14 +1825,15 @@ static void blk_zone_wplug_bio_work(struct work_struct *work)
 		goto again;
 	}
 
-	bdev = bio->bi_bdev;
-
 	/*
 	 * blk-mq devices will reuse the extra reference on the request queue
 	 * usage counter we took when the BIO was plugged, but the submission
 	 * path for BIO-based devices will not do that. So drop this extra
 	 * reference here.
 	 */
+	if (blk_queue_zoned_qd1_writes(disk->queue))
+		reinit_completion(&disk->zone_wplugs_worker_bio_done);
+	bdev = bio->bi_bdev;
 	if (bdev_test_flag(bdev, BD_HAS_SUBMIT_BIO)) {
 		bdev->bd_disk->fops->submit_bio(bio);
 		blk_queue_exit(bdev->bd_disk->queue);
@@ -1778,14 +1841,78 @@ static void blk_zone_wplug_bio_work(struct work_struct *work)
 		blk_mq_submit_bio(bio);
 	}
 
-put_zwplug:
-	/* Drop the reference we took in disk_zone_wplug_schedule_bio_work(). */
-	disk_check_and_put_zone_wplug(zwplug);
+	return true;
+}
+
+static struct blk_zone_wplug *disk_get_zone_wplugs_work(struct gendisk *disk)
+{
+	struct blk_zone_wplug *zwplug;
+
+	spin_lock_irq(&disk->zone_wplugs_list_lock);
+	zwplug = list_first_entry_or_null(&disk->zone_wplugs_list,
+					  struct blk_zone_wplug, entry);
+	if (zwplug)
+		list_del_init(&zwplug->entry);
+	spin_unlock_irq(&disk->zone_wplugs_list_lock);
+
+	return zwplug;
+}
+
+static int disk_zone_wplugs_worker(void *data)
+{
+	struct gendisk *disk = data;
+	struct blk_zone_wplug *zwplug;
+	unsigned int noio_flag;
+
+	noio_flag = memalloc_noio_save();
+	set_user_nice(current, MIN_NICE);
+	set_freezable();
+
+	for (;;) {
+		set_current_state(TASK_INTERRUPTIBLE | TASK_FREEZABLE);
+
+		zwplug = disk_get_zone_wplugs_work(disk);
+		if (zwplug) {
+			/*
+			 * Process all BIOs of this zone write plug and then
+			 * drop the reference we took when adding the zone write
+			 * plug to the active list.
+			 */
+			set_current_state(TASK_RUNNING);
+			while (disk_zone_wplug_submit_bio(disk, zwplug))
+				blk_wait_io(&disk->zone_wplugs_worker_bio_done);
+			disk_check_and_put_zone_wplug(zwplug);
+			continue;
+		}
+
+		/*
+		 * Only sleep if nothing sets the state to running. Else check
+		 * for zone write plugs work again as a newly submitted BIO
+		 * might have added a zone write plug to the work list.
+		 */
+		if (get_current_state() == TASK_RUNNING) {
+			try_to_freeze();
+		} else {
+			if (kthread_should_stop()) {
+				set_current_state(TASK_RUNNING);
+				break;
+			}
+			schedule();
+		}
+	}
+
+	WARN_ON_ONCE(!list_empty(&disk->zone_wplugs_list));
+	memalloc_noio_restore(noio_flag);
+
+	return 0;
 }
 
 void disk_init_zone_resources(struct gendisk *disk)
 {
 	spin_lock_init(&disk->zone_wplugs_hash_lock);
+	spin_lock_init(&disk->zone_wplugs_list_lock);
+	INIT_LIST_HEAD(&disk->zone_wplugs_list);
+	init_completion(&disk->zone_wplugs_worker_bio_done);
 }
 
 /*
@@ -1801,6 +1928,7 @@ static int disk_alloc_zone_resources(struct gendisk *disk,
 				     unsigned int pool_size)
 {
 	unsigned int i;
+	int ret = -ENOMEM;
 
 	atomic_set(&disk->nr_zone_wplugs, 0);
 	disk->zone_wplugs_hash_bits =
@@ -1826,8 +1954,21 @@ static int disk_alloc_zone_resources(struct gendisk *disk,
 	if (!disk->zone_wplugs_wq)
 		goto destroy_pool;
 
+	disk->zone_wplugs_worker =
+		kthread_create(disk_zone_wplugs_worker, disk,
+			       "%s_zwplugs_worker", disk->disk_name);
+	if (IS_ERR(disk->zone_wplugs_worker)) {
+		ret = PTR_ERR(disk->zone_wplugs_worker);
+		disk->zone_wplugs_worker = NULL;
+		goto destroy_wq;
+	}
+	wake_up_process(disk->zone_wplugs_worker);
+
 	return 0;
 
+destroy_wq:
+	destroy_workqueue(disk->zone_wplugs_wq);
+	disk->zone_wplugs_wq = NULL;
 destroy_pool:
 	mempool_destroy(disk->zone_wplugs_pool);
 	disk->zone_wplugs_pool = NULL;
@@ -1835,7 +1976,7 @@ static int disk_alloc_zone_resources(struct gendisk *disk,
 	kfree(disk->zone_wplugs_hash);
 	disk->zone_wplugs_hash = NULL;
 	disk->zone_wplugs_hash_bits = 0;
-	return -ENOMEM;
+	return ret;
 }
 
 static void disk_destroy_zone_wplugs_hash_table(struct gendisk *disk)
@@ -1885,6 +2026,10 @@ static void disk_set_zones_cond_array(struct gendisk *disk, u8 *zones_cond)
 
 void disk_free_zone_resources(struct gendisk *disk)
 {
+	if (disk->zone_wplugs_worker)
+		kthread_stop(disk->zone_wplugs_worker);
+	WARN_ON_ONCE(!list_empty(&disk->zone_wplugs_list));
+
 	if (disk->zone_wplugs_wq) {
 		destroy_workqueue(disk->zone_wplugs_wq);
 		disk->zone_wplugs_wq = NULL;
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 689090023770..ac899cd0cd70 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -13,6 +13,7 @@
 #include <linux/minmax.h>
 #include <linux/timer.h>
 #include <linux/workqueue.h>
+#include <linux/completion.h>
 #include <linux/wait.h>
 #include <linux/bio.h>
 #include <linux/gfp.h>
@@ -204,6 +205,10 @@ struct gendisk {
 	struct mempool		*zone_wplugs_pool;
 	struct hlist_head	*zone_wplugs_hash;
 	struct workqueue_struct *zone_wplugs_wq;
+	spinlock_t		zone_wplugs_list_lock;
+	struct list_head	zone_wplugs_list;
+	struct task_struct	*zone_wplugs_worker;
+	struct completion	zone_wplugs_worker_bio_done;
 #endif /* CONFIG_BLK_DEV_ZONED */
 
 #if IS_ENABLED(CONFIG_CDROM)
@@ -668,6 +673,7 @@ enum {
 	QUEUE_FLAG_NO_ELV_SWITCH,	/* can't switch elevator any more */
 	QUEUE_FLAG_QOS_ENABLED,		/* qos is enabled */
 	QUEUE_FLAG_BIO_ISSUE_TIME,	/* record bio->issue_time_ns */
+	QUEUE_FLAG_ZONED_QD1_WRITES,	/* Limit zoned devices writes to QD=1 */
 	QUEUE_FLAG_MAX
 };
 
@@ -707,6 +713,8 @@ void blk_queue_flag_clear(unsigned int flag, struct request_queue *q);
 	test_bit(QUEUE_FLAG_DISABLE_WBT_DEF, &(q)->queue_flags)
 #define blk_queue_no_elv_switch(q)	\
 	test_bit(QUEUE_FLAG_NO_ELV_SWITCH, &(q)->queue_flags)
+#define blk_queue_zoned_qd1_writes(q)	\
+	test_bit(QUEUE_FLAG_ZONED_QD1_WRITES, &(q)->queue_flags)
 
 extern void blk_set_pm_only(struct request_queue *q);
 extern void blk_clear_pm_only(struct request_queue *q);
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH 6/8] block: allow submitting all zone writes from a single context
  2026-02-21  0:44 ` [PATCH 6/8] block: allow submitting all zone writes from a single context Damien Le Moal
@ 2026-02-23 12:07   ` Hannes Reinecke
  2026-02-24  2:00     ` Damien Le Moal
  0 siblings, 1 reply; 31+ messages in thread
From: Hannes Reinecke @ 2026-02-23 12:07 UTC (permalink / raw)
  To: Damien Le Moal, Jens Axboe, linux-block

On 2/21/26 01:44, Damien Le Moal wrote:
> In order to maintain sequential write patterns per zone with zoned block
> devices, zone write plugging issues only a single write BIO per zone at
> any time. This works well but has the side effect that when large
> sequential write streams are issued by the user and these streams cross
> zone boundaries, the device ends up receiving a discontiguous set of
> write commands for different zones. The same also happens when a user
> writes simultaneously at high queue depth multiple zones: the device
> does not see all sequential writes per zone and receives discontiguous
> writes to different zones. While this does not affect the performance of
> solid state zoned block devices, when using an SMR HDD, this pattern
> change from sequential writes to discontiguous writes to different zones
> significantly increases head seek which results in degraded write
> throughput.
> 
> In order to reduce this seek overhead for rotational media devices,
> introduce a per disk zone write plugs kernel thread to issue all write
> BIOs to zones. This single zone write issuing context is enabled for
> any zoned block device that has a request queue flagged with the new
> QUEUE_ZONED_QD1_WRITES flag.
> 
> The flag QUEUE_ZONED_QD1_WRITES is visible as the sysfs queue attribute
> zoned_qd1_writes. For a regular block device, this attribute cannot be
> changed and is always 0. For zoned block devices, a user can override
> the default value set to force the global write maximum queue depth of
> 1 for a zoned block device, or clear this attribute to fallback to the
> default behavior of zone write plugging which limits writes to QD=1 per
> sequential zone.
> 
> Writing to a zoned block device flagged with QUEUE_ZONED_QD1_WRITES is
> implemented using a list of zone write plugs that have a non-empty BIO
> list. Listed zone write plugs are processed by the disk zone write plugs
> worker kthread in FIFO order, and all BIOs of a zone write plug are all
> processed before switching to the next listed zone write plug. A newly
> submitted BIO for a non-FULL zone write plug that is not yet listed
> causes the addition of the zone write plug at the end of the disk list
> of zone write plugs.
> 
> Since the write BIOs queued in a zone write plug BIO list are
> necessarilly sequential, for rotational media, using the single zone
> write plugs kthread to issue all BIOs maintains a sequential write
> pattern and thus reduces seek overhead and improves write throughput.
> This processing essentially result in always writing to HDDs at QD=1,
> which is not an issue for HDDs operating with write caching enabled.
> Performance with write cache disabled is also not degraded thanks to
> the efficient write handling of modern SMR HDDs.
> 
> A disk list of zone write plugs is defined using the new struct gendisk
> zone_wplugs_list, and accesses to this list is protected using the
> zone_wplugs_list_lock spinlock.  The per disk kthread
> (zone_wplugs_worker) code is implemented by the function
> disk_zone_wplugs_worker(). A reference on listed zone write plugs is
> always held until all BIOs of the zone write plug are processed by the
> worker kthread. BIO issuing at QD=1 is driven using a completion
> structure (zone_wplugs_worker_bio_done) and calls to blk_io_wait().
> 
> With this change, performance when sequentially writing the zones of a
> 30 TB SMR SATA HDD connected to an AHCI adapter changes as follows
> (1MiB direct I/Os, results in MB/s unit):
> 
>                      +--------------------+
> 		    |   Write BW (MB/s)  |
>   +------------------+----------+---------+
>   | Sequential write | Baseline | Patched |
>   |  Queue Depth     | 6.19-rc8 |         |
>   +------------------+----------+---------+
>   | 1                | 244      | 245     |
>   | 2                | 244      | 245     |
>   | 4                | 245      | 245     |
>   | 8                | 242      | 245     |
>   | 16               | 222      | 246     |
>   | 32               | 211      | 245     |
>   | 64               | 193      | 244     |
>   | 128              | 112      | 246     |
>   +------------------+----------+---------+
> 
> With the current code (baseline), as the sequential write stream crosses
> a zone boundary, higher queue depth creates a gap between the
> last IO to the previous zone and the first IOs to the following zones,
> causing head seeks and degrading performance. Using the disk zone
> write plugs worker thread, this pattern disappears and the maximum
> throughput of the drive is maintained, leading to over 100%
> improvements in throughput for high queue depth write.
> 
> Using 16 fio jobs all writing to randomly chosen zones at QD=32 with 1
> MiB direct IOs, write throughput also increases significantly.
> 
>                      +--------------------+
> 		    |   Write BW (MB/s)  |
>   +------------------+----------+---------+
>   |   Random write   | Baseline | Patched |
>   |  Number of zones | 6.19-rc7 |         |
>   +------------------+----------+---------+
>   | 1                | 191      | 192     |
>   | 2                | 101      | 128     |
>   | 4                | 115      | 123     |
>   | 8                | 90       | 120     |
>   | 16               | 64       | 115     |
>   | 32               | 58       | 105     |
>   | 64               | 56       | 101     |
>   | 128              | 55       | 99      |
>   +------------------+----------+---------+
> 
> Tests using XFS shows that buffered write speed with 8 jobs writing
> files increases by 12% to 35% depending on the workload.
> 
>                      +--------------------+
> 		    |   Write BW (MB/s)  |
>   +------------------+----------+---------+
>   |     Workload     | Baseline | Patched |
>   |                  | 6.19-rc7 |         |
>   +------------------+----------+---------+
>   | 256MiB file size | 212      | 238     |
>   +------------------+----------+---------+
>   | 4MiB .. 128 MiB  | 213      | 243     |
>   | random file size |          |         |
>   +------------------+----------+---------+
>   | 2MiB .. 8 MiB    | 179      | 242     |
>   | random file size |          |         |
>   +------------------+----------+---------+
> 
> Performance gains are even more significant when using an HBA that
> limits the maximum size of commands to a small value, e.g. HBAs
> controlled with the mpi3mr driver limit commands to a maximum of 1 MiB.
> In such case, the write throughput gains are over 40%.
> 
>                      +--------------------+
> 		    |   Write BW (MB/s)  |
>   +------------------+----------+---------+
>   |     Workload     | Baseline | Patched |
>   |                  | 6.19-rc7 |         |
>   +------------------+----------+---------+
>   | 256MiB file size | 175      | 245     |
>   +------------------+----------+---------+
>   | 4MiB .. 128 MiB  | 174      | 244     |
>   | random file size |          |         |
>   +------------------+----------+---------+
>   | 2MiB .. 8 MiB    | 171      | 243     |
>   | random file size |          |         |
>   +------------------+----------+---------+
> 
> Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
> ---
>   block/blk-mq-debugfs.c |   1 +
>   block/blk-sysfs.c      |  35 ++++++++
>   block/blk-zoned.c      | 197 +++++++++++++++++++++++++++++++++++------
>   include/linux/blkdev.h |   8 ++
>   4 files changed, 215 insertions(+), 26 deletions(-)
> 
> diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
> index 28167c9baa55..047ec887456b 100644
> --- a/block/blk-mq-debugfs.c
> +++ b/block/blk-mq-debugfs.c
> @@ -97,6 +97,7 @@ static const char *const blk_queue_flag_name[] = {
>   	QUEUE_FLAG_NAME(NO_ELV_SWITCH),
>   	QUEUE_FLAG_NAME(QOS_ENABLED),
>   	QUEUE_FLAG_NAME(BIO_ISSUE_TIME),
> +	QUEUE_FLAG_NAME(ZONED_QD1_WRITES),
>   };
>   #undef QUEUE_FLAG_NAME
>   
> diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
> index f3b1968c80ce..789802286d95 100644
> --- a/block/blk-sysfs.c
> +++ b/block/blk-sysfs.c
> @@ -384,6 +384,39 @@ static ssize_t queue_nr_zones_show(struct gendisk *disk, char *page)
>   	return queue_var_show(disk_nr_zones(disk), page);
>   }
>   
> +static ssize_t queue_zoned_qd1_writes_show(struct gendisk *disk, char *page)
> +{
> +	return queue_var_show(!!blk_queue_zoned_qd1_writes(disk->queue),
> +			      page);
> +}
> +
> +static ssize_t queue_zoned_qd1_writes_store(struct gendisk *disk,
> +					    const char *page, size_t count)
> +{
> +	struct request_queue *q = disk->queue;
> +	unsigned long qd1_writes;
> +	unsigned int memflags;
> +	ssize_t ret;
> +
> +	if (!blk_queue_is_zoned(q))
> +		return -EOPNOTSUPP;
> +
> +	ret = queue_var_store(&qd1_writes, page, count);
> +	if (ret < 0)
> +		return ret;
> +
> +	memflags = blk_mq_freeze_queue(q);
> +	blk_mq_quiesce_queue(q);
> +	if (qd1_writes)
> +		blk_queue_flag_set(QUEUE_FLAG_ZONED_QD1_WRITES, q);
> +	else
> +		blk_queue_flag_clear(QUEUE_FLAG_ZONED_QD1_WRITES, q);
> +	blk_mq_unquiesce_queue(q);
> +	blk_mq_unfreeze_queue(q, memflags);
> +
> +	return count;
> +}
> +
>   static ssize_t queue_iostats_passthrough_show(struct gendisk *disk, char *page)
>   {
>   	return queue_var_show(!!blk_queue_passthrough_stat(disk->queue), page);
> @@ -611,6 +644,7 @@ QUEUE_LIM_RO_ENTRY(queue_max_zone_append_sectors, "zone_append_max_bytes");
>   QUEUE_LIM_RO_ENTRY(queue_zone_write_granularity, "zone_write_granularity");
>   
>   QUEUE_LIM_RO_ENTRY(queue_zoned, "zoned");
> +QUEUE_RW_ENTRY(queue_zoned_qd1_writes, "zoned_qd1_writes");
>   QUEUE_RO_ENTRY(queue_nr_zones, "nr_zones");
>   QUEUE_LIM_RO_ENTRY(queue_max_open_zones, "max_open_zones");
>   QUEUE_LIM_RO_ENTRY(queue_max_active_zones, "max_active_zones");
> @@ -748,6 +782,7 @@ static struct attribute *queue_attrs[] = {
>   	&queue_nomerges_entry.attr,
>   	&queue_poll_entry.attr,
>   	&queue_poll_delay_entry.attr,
> +	&queue_zoned_qd1_writes_entry.attr,
>   
>   	NULL,
>   };

Can't you plug it into the 'queue_attr_visible()' function, too,
such that it doesn't even show up for non-zoned drives?
(After all, it's not that you can change between zoned and
non-zoned drives.)
(One hopes :-)

And I really had hoped that we wouldn't need to introduce new
kthreads, rather that workqueue are the new kthreads.
Any reasoning why you couldn't use workqueues here?

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 6/8] block: allow submitting all zone writes from a single context
  2026-02-23 12:07   ` Hannes Reinecke
@ 2026-02-24  2:00     ` Damien Le Moal
  0 siblings, 0 replies; 31+ messages in thread
From: Damien Le Moal @ 2026-02-24  2:00 UTC (permalink / raw)
  To: Hannes Reinecke, Jens Axboe, linux-block

On 2/23/26 9:07 PM, Hannes Reinecke wrote:
> On 2/21/26 01:44, Damien Le Moal wrote:
>> In order to maintain sequential write patterns per zone with zoned block
>> devices, zone write plugging issues only a single write BIO per zone at
>> any time. This works well but has the side effect that when large
>> sequential write streams are issued by the user and these streams cross
>> zone boundaries, the device ends up receiving a discontiguous set of
>> write commands for different zones. The same also happens when a user
>> writes simultaneously at high queue depth multiple zones: the device
>> does not see all sequential writes per zone and receives discontiguous
>> writes to different zones. While this does not affect the performance of
>> solid state zoned block devices, when using an SMR HDD, this pattern
>> change from sequential writes to discontiguous writes to different zones
>> significantly increases head seek which results in degraded write
>> throughput.
>>
>> In order to reduce this seek overhead for rotational media devices,
>> introduce a per disk zone write plugs kernel thread to issue all write
>> BIOs to zones. This single zone write issuing context is enabled for
>> any zoned block device that has a request queue flagged with the new
>> QUEUE_ZONED_QD1_WRITES flag.
>>
>> The flag QUEUE_ZONED_QD1_WRITES is visible as the sysfs queue attribute
>> zoned_qd1_writes. For a regular block device, this attribute cannot be
>> changed and is always 0. For zoned block devices, a user can override
>> the default value set to force the global write maximum queue depth of
>> 1 for a zoned block device, or clear this attribute to fallback to the
>> default behavior of zone write plugging which limits writes to QD=1 per
>> sequential zone.
>>
>> Writing to a zoned block device flagged with QUEUE_ZONED_QD1_WRITES is
>> implemented using a list of zone write plugs that have a non-empty BIO
>> list. Listed zone write plugs are processed by the disk zone write plugs
>> worker kthread in FIFO order, and all BIOs of a zone write plug are all
>> processed before switching to the next listed zone write plug. A newly
>> submitted BIO for a non-FULL zone write plug that is not yet listed
>> causes the addition of the zone write plug at the end of the disk list
>> of zone write plugs.
>>
>> Since the write BIOs queued in a zone write plug BIO list are
>> necessarilly sequential, for rotational media, using the single zone
>> write plugs kthread to issue all BIOs maintains a sequential write
>> pattern and thus reduces seek overhead and improves write throughput.
>> This processing essentially result in always writing to HDDs at QD=1,
>> which is not an issue for HDDs operating with write caching enabled.
>> Performance with write cache disabled is also not degraded thanks to
>> the efficient write handling of modern SMR HDDs.
>>
>> A disk list of zone write plugs is defined using the new struct gendisk
>> zone_wplugs_list, and accesses to this list is protected using the
>> zone_wplugs_list_lock spinlock.  The per disk kthread
>> (zone_wplugs_worker) code is implemented by the function
>> disk_zone_wplugs_worker(). A reference on listed zone write plugs is
>> always held until all BIOs of the zone write plug are processed by the
>> worker kthread. BIO issuing at QD=1 is driven using a completion
>> structure (zone_wplugs_worker_bio_done) and calls to blk_io_wait().
>>
>> With this change, performance when sequentially writing the zones of a
>> 30 TB SMR SATA HDD connected to an AHCI adapter changes as follows
>> (1MiB direct I/Os, results in MB/s unit):
>>
>>                      +--------------------+
>>             |   Write BW (MB/s)  |
>>   +------------------+----------+---------+
>>   | Sequential write | Baseline | Patched |
>>   |  Queue Depth     | 6.19-rc8 |         |
>>   +------------------+----------+---------+
>>   | 1                | 244      | 245     |
>>   | 2                | 244      | 245     |
>>   | 4                | 245      | 245     |
>>   | 8                | 242      | 245     |
>>   | 16               | 222      | 246     |
>>   | 32               | 211      | 245     |
>>   | 64               | 193      | 244     |
>>   | 128              | 112      | 246     |
>>   +------------------+----------+---------+
>>
>> With the current code (baseline), as the sequential write stream crosses
>> a zone boundary, higher queue depth creates a gap between the
>> last IO to the previous zone and the first IOs to the following zones,
>> causing head seeks and degrading performance. Using the disk zone
>> write plugs worker thread, this pattern disappears and the maximum
>> throughput of the drive is maintained, leading to over 100%
>> improvements in throughput for high queue depth write.
>>
>> Using 16 fio jobs all writing to randomly chosen zones at QD=32 with 1
>> MiB direct IOs, write throughput also increases significantly.
>>
>>                      +--------------------+
>>             |   Write BW (MB/s)  |
>>   +------------------+----------+---------+
>>   |   Random write   | Baseline | Patched |
>>   |  Number of zones | 6.19-rc7 |         |
>>   +------------------+----------+---------+
>>   | 1                | 191      | 192     |
>>   | 2                | 101      | 128     |
>>   | 4                | 115      | 123     |
>>   | 8                | 90       | 120     |
>>   | 16               | 64       | 115     |
>>   | 32               | 58       | 105     |
>>   | 64               | 56       | 101     |
>>   | 128              | 55       | 99      |
>>   +------------------+----------+---------+
>>
>> Tests using XFS shows that buffered write speed with 8 jobs writing
>> files increases by 12% to 35% depending on the workload.
>>
>>                      +--------------------+
>>             |   Write BW (MB/s)  |
>>   +------------------+----------+---------+
>>   |     Workload     | Baseline | Patched |
>>   |                  | 6.19-rc7 |         |
>>   +------------------+----------+---------+
>>   | 256MiB file size | 212      | 238     |
>>   +------------------+----------+---------+
>>   | 4MiB .. 128 MiB  | 213      | 243     |
>>   | random file size |          |         |
>>   +------------------+----------+---------+
>>   | 2MiB .. 8 MiB    | 179      | 242     |
>>   | random file size |          |         |
>>   +------------------+----------+---------+
>>
>> Performance gains are even more significant when using an HBA that
>> limits the maximum size of commands to a small value, e.g. HBAs
>> controlled with the mpi3mr driver limit commands to a maximum of 1 MiB.
>> In such case, the write throughput gains are over 40%.
>>
>>                      +--------------------+
>>             |   Write BW (MB/s)  |
>>   +------------------+----------+---------+
>>   |     Workload     | Baseline | Patched |
>>   |                  | 6.19-rc7 |         |
>>   +------------------+----------+---------+
>>   | 256MiB file size | 175      | 245     |
>>   +------------------+----------+---------+
>>   | 4MiB .. 128 MiB  | 174      | 244     |
>>   | random file size |          |         |
>>   +------------------+----------+---------+
>>   | 2MiB .. 8 MiB    | 171      | 243     |
>>   | random file size |          |         |
>>   +------------------+----------+---------+
>>
>> Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
>> ---
>>   block/blk-mq-debugfs.c |   1 +
>>   block/blk-sysfs.c      |  35 ++++++++
>>   block/blk-zoned.c      | 197 +++++++++++++++++++++++++++++++++++------
>>   include/linux/blkdev.h |   8 ++
>>   4 files changed, 215 insertions(+), 26 deletions(-)
>>
>> diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
>> index 28167c9baa55..047ec887456b 100644
>> --- a/block/blk-mq-debugfs.c
>> +++ b/block/blk-mq-debugfs.c
>> @@ -97,6 +97,7 @@ static const char *const blk_queue_flag_name[] = {
>>       QUEUE_FLAG_NAME(NO_ELV_SWITCH),
>>       QUEUE_FLAG_NAME(QOS_ENABLED),
>>       QUEUE_FLAG_NAME(BIO_ISSUE_TIME),
>> +    QUEUE_FLAG_NAME(ZONED_QD1_WRITES),
>>   };
>>   #undef QUEUE_FLAG_NAME
>>   diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
>> index f3b1968c80ce..789802286d95 100644
>> --- a/block/blk-sysfs.c
>> +++ b/block/blk-sysfs.c
>> @@ -384,6 +384,39 @@ static ssize_t queue_nr_zones_show(struct gendisk *disk,
>> char *page)
>>       return queue_var_show(disk_nr_zones(disk), page);
>>   }
>>   +static ssize_t queue_zoned_qd1_writes_show(struct gendisk *disk, char *page)
>> +{
>> +    return queue_var_show(!!blk_queue_zoned_qd1_writes(disk->queue),
>> +                  page);
>> +}
>> +
>> +static ssize_t queue_zoned_qd1_writes_store(struct gendisk *disk,
>> +                        const char *page, size_t count)
>> +{
>> +    struct request_queue *q = disk->queue;
>> +    unsigned long qd1_writes;
>> +    unsigned int memflags;
>> +    ssize_t ret;
>> +
>> +    if (!blk_queue_is_zoned(q))
>> +        return -EOPNOTSUPP;
>> +
>> +    ret = queue_var_store(&qd1_writes, page, count);
>> +    if (ret < 0)
>> +        return ret;
>> +
>> +    memflags = blk_mq_freeze_queue(q);
>> +    blk_mq_quiesce_queue(q);
>> +    if (qd1_writes)
>> +        blk_queue_flag_set(QUEUE_FLAG_ZONED_QD1_WRITES, q);
>> +    else
>> +        blk_queue_flag_clear(QUEUE_FLAG_ZONED_QD1_WRITES, q);
>> +    blk_mq_unquiesce_queue(q);
>> +    blk_mq_unfreeze_queue(q, memflags);
>> +
>> +    return count;
>> +}
>> +
>>   static ssize_t queue_iostats_passthrough_show(struct gendisk *disk, char
>> *page)
>>   {
>>       return queue_var_show(!!blk_queue_passthrough_stat(disk->queue), page);
>> @@ -611,6 +644,7 @@ QUEUE_LIM_RO_ENTRY(queue_max_zone_append_sectors,
>> "zone_append_max_bytes");
>>   QUEUE_LIM_RO_ENTRY(queue_zone_write_granularity, "zone_write_granularity");
>>     QUEUE_LIM_RO_ENTRY(queue_zoned, "zoned");
>> +QUEUE_RW_ENTRY(queue_zoned_qd1_writes, "zoned_qd1_writes");
>>   QUEUE_RO_ENTRY(queue_nr_zones, "nr_zones");
>>   QUEUE_LIM_RO_ENTRY(queue_max_open_zones, "max_open_zones");
>>   QUEUE_LIM_RO_ENTRY(queue_max_active_zones, "max_active_zones");
>> @@ -748,6 +782,7 @@ static struct attribute *queue_attrs[] = {
>>       &queue_nomerges_entry.attr,
>>       &queue_poll_entry.attr,
>>       &queue_poll_delay_entry.attr,
>> +    &queue_zoned_qd1_writes_entry.attr,
>>         NULL,
>>   };
> 
> Can't you plug it into the 'queue_attr_visible()' function, too,
> such that it doesn't even show up for non-zoned drives?
> (After all, it's not that you can change between zoned and
> non-zoned drives.)
> (One hopes :-)

Good point. Will look into that.

> And I really had hoped that we wouldn't need to introduce new
> kthreads, rather that workqueue are the new kthreads.
> Any reasoning why you couldn't use workqueues here?

I can use a single work item instead of a kthread. I had that initially, but
the code ended up cleaner with the kthread. I can try again with the work item
if you insist...

-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH 7/8] block: default to QD=1 writes for blk-mq rotational zoned devices
  2026-02-21  0:44 [PATCH 0/8] Improve zoned (SMR) HDD write throughput Damien Le Moal
                   ` (5 preceding siblings ...)
  2026-02-21  0:44 ` [PATCH 6/8] block: allow submitting all zone writes from a single context Damien Le Moal
@ 2026-02-21  0:44 ` Damien Le Moal
  2026-02-23 12:07   ` Hannes Reinecke
  2026-02-21  0:44 ` [PATCH 8/8] Documentation: ABI: stable: document the zoned_qd1_writes attribute Damien Le Moal
  2026-02-23 17:03 ` [PATCH 0/8] Improve zoned (SMR) HDD write throughput Bart Van Assche
  8 siblings, 1 reply; 31+ messages in thread
From: Damien Le Moal @ 2026-02-21  0:44 UTC (permalink / raw)
  To: Jens Axboe, linux-block

For blk-mq rotational zoned block devices (e.g. SMR HDDs), default to
having zone write plugging limit write operations to a maximum queue
depth of 1 for all zones. This significantly reduce write seek overhead
and improves SMR HDD write throughput.

For remotely connected disks with a very high network latency this
features might not be useful. However, remotely connected zoned devices
are rare at the moment, and we cannot know the round trip latency to
pick a good default for network attached devices. System administrators
can however disable this feature in that case.

For BIO based (non blk-mq) rotational zoned block devices, the device
driver (e.g. a DM target driver) can directly set an appropriate
default.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
---
 block/blk-sysfs.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 789802286d95..0e552b8f5bbd 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -963,6 +963,14 @@ int blk_register_queue(struct gendisk *disk)
 		blk_mq_debugfs_register(q);
 	blk_debugfs_unlock(q, memflags);
 
+	/*
+	 * For blk-mq rotational zoned devices, default to using QD=1
+	 * writes. For non-mq rotational zoned devices, the device driver can
+	 * set an appropriate default.
+	 */
+	if (queue_is_mq(q) && blk_queue_rot(q) && blk_queue_is_zoned(q))
+		blk_queue_flag_set(QUEUE_FLAG_ZONED_QD1_WRITES, q);
+
 	ret = disk_register_independent_access_ranges(disk);
 	if (ret)
 		goto out_debugfs_remove;
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH 7/8] block: default to QD=1 writes for blk-mq rotational zoned devices
  2026-02-21  0:44 ` [PATCH 7/8] block: default to QD=1 writes for blk-mq rotational zoned devices Damien Le Moal
@ 2026-02-23 12:07   ` Hannes Reinecke
  0 siblings, 0 replies; 31+ messages in thread
From: Hannes Reinecke @ 2026-02-23 12:07 UTC (permalink / raw)
  To: Damien Le Moal, Jens Axboe, linux-block

On 2/21/26 01:44, Damien Le Moal wrote:
> For blk-mq rotational zoned block devices (e.g. SMR HDDs), default to
> having zone write plugging limit write operations to a maximum queue
> depth of 1 for all zones. This significantly reduce write seek overhead
> and improves SMR HDD write throughput.
> 
> For remotely connected disks with a very high network latency this
> features might not be useful. However, remotely connected zoned devices
> are rare at the moment, and we cannot know the round trip latency to
> pick a good default for network attached devices. System administrators
> can however disable this feature in that case.
> 
> For BIO based (non blk-mq) rotational zoned block devices, the device
> driver (e.g. a DM target driver) can directly set an appropriate
> default.
> 
> Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
> ---
>   block/blk-sysfs.c | 8 ++++++++
>   1 file changed, 8 insertions(+)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH 8/8] Documentation: ABI: stable: document the zoned_qd1_writes attribute
  2026-02-21  0:44 [PATCH 0/8] Improve zoned (SMR) HDD write throughput Damien Le Moal
                   ` (6 preceding siblings ...)
  2026-02-21  0:44 ` [PATCH 7/8] block: default to QD=1 writes for blk-mq rotational zoned devices Damien Le Moal
@ 2026-02-21  0:44 ` Damien Le Moal
  2026-02-23 12:07   ` Hannes Reinecke
  2026-02-23 17:03 ` [PATCH 0/8] Improve zoned (SMR) HDD write throughput Bart Van Assche
  8 siblings, 1 reply; 31+ messages in thread
From: Damien Le Moal @ 2026-02-21  0:44 UTC (permalink / raw)
  To: Jens Axboe, linux-block

Update the documentation file Documentation/ABI/stable/sysfs-block to
describe the zoned_qd1_writes sysfs queue attribute file.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
---
 Documentation/ABI/stable/sysfs-block | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff --git a/Documentation/ABI/stable/sysfs-block b/Documentation/ABI/stable/sysfs-block
index 09a9d4aca0fd..bc3056af5eb2 100644
--- a/Documentation/ABI/stable/sysfs-block
+++ b/Documentation/ABI/stable/sysfs-block
@@ -886,6 +886,21 @@ Description:
 		zone commands, they will be treated as regular block devices and
 		zoned will report "none".
 
+What:		/sys/block/<disk>/queue/zoned_qd1_writes
+Date:		January 2026
+Contact:	Damien Le Moal <dlemoal@kernel.org>
+Description:
+		[RW] zoned_qd1_writes indicates if write operations to a zoned
+		block device are being handled using a single issuer context (a
+		kernel thread) operating at a maximum queue depth of 1. A value
+		of 0 is always reported for regular block devices and for zoned
+		block devices that are not rotational devices (e.g. ZNS SSDs or
+		zoned UFS devices). For rotational zoned block devices (e.g. SMR
+		HDDs) the default value is 1. Since this default may not be
+		appropriate for some devices, e.g. remotely connected devices
+		over high latency networks, the user can disable this feature by
+		setting this attribute to 0.
+
 
 What:		/sys/block/<disk>/hidden
 Date:		March 2023
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH 8/8] Documentation: ABI: stable: document the zoned_qd1_writes attribute
  2026-02-21  0:44 ` [PATCH 8/8] Documentation: ABI: stable: document the zoned_qd1_writes attribute Damien Le Moal
@ 2026-02-23 12:07   ` Hannes Reinecke
  0 siblings, 0 replies; 31+ messages in thread
From: Hannes Reinecke @ 2026-02-23 12:07 UTC (permalink / raw)
  To: Damien Le Moal, Jens Axboe, linux-block

On 2/21/26 01:44, Damien Le Moal wrote:
> Update the documentation file Documentation/ABI/stable/sysfs-block to
> describe the zoned_qd1_writes sysfs queue attribute file.
> 
> Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
> ---
>   Documentation/ABI/stable/sysfs-block | 15 +++++++++++++++
>   1 file changed, 15 insertions(+)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/8] Improve zoned (SMR) HDD write throughput
  2026-02-21  0:44 [PATCH 0/8] Improve zoned (SMR) HDD write throughput Damien Le Moal
                   ` (7 preceding siblings ...)
  2026-02-21  0:44 ` [PATCH 8/8] Documentation: ABI: stable: document the zoned_qd1_writes attribute Damien Le Moal
@ 2026-02-23 17:03 ` Bart Van Assche
  2026-02-24  1:07   ` Damien Le Moal
  8 siblings, 1 reply; 31+ messages in thread
From: Bart Van Assche @ 2026-02-23 17:03 UTC (permalink / raw)
  To: Damien Le Moal, Jens Axboe, linux-block

On 2/20/26 4:44 PM, Damien Le Moal wrote:
> This patch series cleans up the zone write plugging code and introduces
> the ability to issue all write BIOs from a single context (a kthread)
> instead of allowing multiple zones to be written at the same time using
> a per zone work. As shown in patch 6, raw block device tests and XFS
> tests with an SMR HDD show that this can significantly increase write
> throughput (up to 40% over the current zone write plugging).
Hi Damien,

Is a new kthread necessary? Has the following approach been considered?
* Make the dm drivers that support rotational zoned storage devices
   request-based instead of bio-based.
* Modify blk_mq_get_tag() such that only one tag can be allocated at a
   time for zoned write requests.

I think that would be sufficient to serialize zoned writes. 
Additionally, this approach doesn't increase request processing latency
by forcing a context switch to a kthread.

For reference, from drivers/md/dm.h:

/*
  * To check whether the target type is request-based or not (bio-based).
  */
#define dm_target_request_based(t) ((t)->type->clone_and_map_rq != NULL)

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/8] Improve zoned (SMR) HDD write throughput
  2026-02-23 17:03 ` [PATCH 0/8] Improve zoned (SMR) HDD write throughput Bart Van Assche
@ 2026-02-24  1:07   ` Damien Le Moal
  0 siblings, 0 replies; 31+ messages in thread
From: Damien Le Moal @ 2026-02-24  1:07 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe, linux-block

On 2/24/26 2:03 AM, Bart Van Assche wrote:
> On 2/20/26 4:44 PM, Damien Le Moal wrote:
>> This patch series cleans up the zone write plugging code and introduces
>> the ability to issue all write BIOs from a single context (a kthread)
>> instead of allowing multiple zones to be written at the same time using
>> a per zone work. As shown in patch 6, raw block device tests and XFS
>> tests with an SMR HDD show that this can significantly increase write
>> throughput (up to 40% over the current zone write plugging).
> Hi Damien,
> 
> Is a new kthread necessary? Has the following approach been considered?
> * Make the dm drivers that support rotational zoned storage devices
>   request-based instead of bio-based.

What you are suggesting would be an enormous amount of work (dm-linear,
dm-flakey, dm-error, dm-crypt) to change generic code into DM targets that
would be very specialized for just SMR HDDs. I do not understand why you think
that would be a good idea.

> * Modify blk_mq_get_tag() such that only one tag can be allocated at a
>   time for zoned write requests.

Sure, doing that would limit the number of write requests to zones to 1 at most
at any time, but that would also result in a total loss of control over which
zone write BIO work gets that single tag, meaning that the writes would in the
end be mostly random again, like they are now. So with this solution, I can say
goodbye to the +40% write throughput increase that I am seeing with the kthread.

Also note that this idea of limiting write tags combined with your idea of
using req based DM targets would likely negatively impact dm-crypt performance
as we would lose the ability to encrypt multiple writes in parallel on
different CPUs.

> I think that would be sufficient to serialize zoned writes. Additionally, this
> approach doesn't increase request processing latency
> by forcing a context switch to a kthread.

This point is in my opinion moot because we currently use work items to issue
the write commands. We are scheduling the zone write plugs BIO works in the
submission and completion path, so the context switch overhead is already
there. And would argue that using work items is potentially even more overhead
than using a fixed kthread since the work items need to be assigned to CPUs and
worker threads.

Granted, your point is valid for a QD=1 workload. In this case, I am indeed
introducing a context switch where there is none now. But that is not really
the use case we are looking at here. File system writeback does not happen at
QD=1 per zone. Also, this added overhead does not really matter for HDDs
anyway, and if that is really an issue, the user can enable the legacy zone
write plugging behavior with "echo 0 > /sys/block/sdX/queue/zoned_qd1_writes".

-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2026-02-24 15:08 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-02-21  0:44 [PATCH 0/8] Improve zoned (SMR) HDD write throughput Damien Le Moal
2026-02-21  0:44 ` [PATCH 1/8] block: fix zone write plug removal Damien Le Moal
2026-02-23 11:56   ` Hannes Reinecke
2026-02-23 19:30     ` Bart Van Assche
2026-02-23 20:21       ` Bart Van Assche
2026-02-24  1:57         ` Damien Le Moal
2026-02-21  0:44 ` [PATCH 2/8] block: remove BLK_ZONE_WPLUG_UNHASHED Damien Le Moal
2026-02-23 11:48   ` Hannes Reinecke
2026-02-24  2:04     ` Damien Le Moal
2026-02-21  0:44 ` [PATCH 3/8] block: remove disk_zone_is_full() Damien Le Moal
2026-02-23 11:56   ` Hannes Reinecke
2026-02-24 13:15   ` Johannes Thumshirn
2026-02-21  0:44 ` [PATCH 4/8] block: improve disk_zone_wplug_schedule_bio_work() Damien Le Moal
2026-02-23 11:59   ` Hannes Reinecke
2026-02-23 18:56     ` Bart Van Assche
2026-02-24  2:03     ` Damien Le Moal
2026-02-24 15:00       ` Hannes Reinecke
2026-02-24 15:08         ` Christoph Hellwig
2026-02-24 13:18   ` Johannes Thumshirn
2026-02-21  0:44 ` [PATCH 5/8] block: rename struct gendisk zone_wplugs_lock field Damien Le Moal
2026-02-23 12:00   ` Hannes Reinecke
2026-02-24 13:19   ` Johannes Thumshirn
2026-02-21  0:44 ` [PATCH 6/8] block: allow submitting all zone writes from a single context Damien Le Moal
2026-02-23 12:07   ` Hannes Reinecke
2026-02-24  2:00     ` Damien Le Moal
2026-02-21  0:44 ` [PATCH 7/8] block: default to QD=1 writes for blk-mq rotational zoned devices Damien Le Moal
2026-02-23 12:07   ` Hannes Reinecke
2026-02-21  0:44 ` [PATCH 8/8] Documentation: ABI: stable: document the zoned_qd1_writes attribute Damien Le Moal
2026-02-23 12:07   ` Hannes Reinecke
2026-02-23 17:03 ` [PATCH 0/8] Improve zoned (SMR) HDD write throughput Bart Van Assche
2026-02-24  1:07   ` Damien Le Moal

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox