From: Damien Le Moal <dlemoal@kernel.org>
To: Hannes Reinecke <hare@suse.de>, Jens Axboe <axboe@kernel.dk>,
linux-block@vger.kernel.org
Subject: Re: [PATCH 6/8] block: allow submitting all zone writes from a single context
Date: Tue, 24 Feb 2026 11:00:44 +0900 [thread overview]
Message-ID: <5ced7cb1-49a2-4ed3-aa93-3784f853cc38@kernel.org> (raw)
In-Reply-To: <f61591f3-1b0a-44c2-b53c-9b10fbf4a432@suse.de>
On 2/23/26 9:07 PM, Hannes Reinecke wrote:
> On 2/21/26 01:44, Damien Le Moal wrote:
>> In order to maintain sequential write patterns per zone with zoned block
>> devices, zone write plugging issues only a single write BIO per zone at
>> any time. This works well but has the side effect that when large
>> sequential write streams are issued by the user and these streams cross
>> zone boundaries, the device ends up receiving a discontiguous set of
>> write commands for different zones. The same also happens when a user
>> writes simultaneously at high queue depth multiple zones: the device
>> does not see all sequential writes per zone and receives discontiguous
>> writes to different zones. While this does not affect the performance of
>> solid state zoned block devices, when using an SMR HDD, this pattern
>> change from sequential writes to discontiguous writes to different zones
>> significantly increases head seek which results in degraded write
>> throughput.
>>
>> In order to reduce this seek overhead for rotational media devices,
>> introduce a per disk zone write plugs kernel thread to issue all write
>> BIOs to zones. This single zone write issuing context is enabled for
>> any zoned block device that has a request queue flagged with the new
>> QUEUE_ZONED_QD1_WRITES flag.
>>
>> The flag QUEUE_ZONED_QD1_WRITES is visible as the sysfs queue attribute
>> zoned_qd1_writes. For a regular block device, this attribute cannot be
>> changed and is always 0. For zoned block devices, a user can override
>> the default value set to force the global write maximum queue depth of
>> 1 for a zoned block device, or clear this attribute to fallback to the
>> default behavior of zone write plugging which limits writes to QD=1 per
>> sequential zone.
>>
>> Writing to a zoned block device flagged with QUEUE_ZONED_QD1_WRITES is
>> implemented using a list of zone write plugs that have a non-empty BIO
>> list. Listed zone write plugs are processed by the disk zone write plugs
>> worker kthread in FIFO order, and all BIOs of a zone write plug are all
>> processed before switching to the next listed zone write plug. A newly
>> submitted BIO for a non-FULL zone write plug that is not yet listed
>> causes the addition of the zone write plug at the end of the disk list
>> of zone write plugs.
>>
>> Since the write BIOs queued in a zone write plug BIO list are
>> necessarilly sequential, for rotational media, using the single zone
>> write plugs kthread to issue all BIOs maintains a sequential write
>> pattern and thus reduces seek overhead and improves write throughput.
>> This processing essentially result in always writing to HDDs at QD=1,
>> which is not an issue for HDDs operating with write caching enabled.
>> Performance with write cache disabled is also not degraded thanks to
>> the efficient write handling of modern SMR HDDs.
>>
>> A disk list of zone write plugs is defined using the new struct gendisk
>> zone_wplugs_list, and accesses to this list is protected using the
>> zone_wplugs_list_lock spinlock. The per disk kthread
>> (zone_wplugs_worker) code is implemented by the function
>> disk_zone_wplugs_worker(). A reference on listed zone write plugs is
>> always held until all BIOs of the zone write plug are processed by the
>> worker kthread. BIO issuing at QD=1 is driven using a completion
>> structure (zone_wplugs_worker_bio_done) and calls to blk_io_wait().
>>
>> With this change, performance when sequentially writing the zones of a
>> 30 TB SMR SATA HDD connected to an AHCI adapter changes as follows
>> (1MiB direct I/Os, results in MB/s unit):
>>
>> +--------------------+
>> | Write BW (MB/s) |
>> +------------------+----------+---------+
>> | Sequential write | Baseline | Patched |
>> | Queue Depth | 6.19-rc8 | |
>> +------------------+----------+---------+
>> | 1 | 244 | 245 |
>> | 2 | 244 | 245 |
>> | 4 | 245 | 245 |
>> | 8 | 242 | 245 |
>> | 16 | 222 | 246 |
>> | 32 | 211 | 245 |
>> | 64 | 193 | 244 |
>> | 128 | 112 | 246 |
>> +------------------+----------+---------+
>>
>> With the current code (baseline), as the sequential write stream crosses
>> a zone boundary, higher queue depth creates a gap between the
>> last IO to the previous zone and the first IOs to the following zones,
>> causing head seeks and degrading performance. Using the disk zone
>> write plugs worker thread, this pattern disappears and the maximum
>> throughput of the drive is maintained, leading to over 100%
>> improvements in throughput for high queue depth write.
>>
>> Using 16 fio jobs all writing to randomly chosen zones at QD=32 with 1
>> MiB direct IOs, write throughput also increases significantly.
>>
>> +--------------------+
>> | Write BW (MB/s) |
>> +------------------+----------+---------+
>> | Random write | Baseline | Patched |
>> | Number of zones | 6.19-rc7 | |
>> +------------------+----------+---------+
>> | 1 | 191 | 192 |
>> | 2 | 101 | 128 |
>> | 4 | 115 | 123 |
>> | 8 | 90 | 120 |
>> | 16 | 64 | 115 |
>> | 32 | 58 | 105 |
>> | 64 | 56 | 101 |
>> | 128 | 55 | 99 |
>> +------------------+----------+---------+
>>
>> Tests using XFS shows that buffered write speed with 8 jobs writing
>> files increases by 12% to 35% depending on the workload.
>>
>> +--------------------+
>> | Write BW (MB/s) |
>> +------------------+----------+---------+
>> | Workload | Baseline | Patched |
>> | | 6.19-rc7 | |
>> +------------------+----------+---------+
>> | 256MiB file size | 212 | 238 |
>> +------------------+----------+---------+
>> | 4MiB .. 128 MiB | 213 | 243 |
>> | random file size | | |
>> +------------------+----------+---------+
>> | 2MiB .. 8 MiB | 179 | 242 |
>> | random file size | | |
>> +------------------+----------+---------+
>>
>> Performance gains are even more significant when using an HBA that
>> limits the maximum size of commands to a small value, e.g. HBAs
>> controlled with the mpi3mr driver limit commands to a maximum of 1 MiB.
>> In such case, the write throughput gains are over 40%.
>>
>> +--------------------+
>> | Write BW (MB/s) |
>> +------------------+----------+---------+
>> | Workload | Baseline | Patched |
>> | | 6.19-rc7 | |
>> +------------------+----------+---------+
>> | 256MiB file size | 175 | 245 |
>> +------------------+----------+---------+
>> | 4MiB .. 128 MiB | 174 | 244 |
>> | random file size | | |
>> +------------------+----------+---------+
>> | 2MiB .. 8 MiB | 171 | 243 |
>> | random file size | | |
>> +------------------+----------+---------+
>>
>> Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
>> ---
>> block/blk-mq-debugfs.c | 1 +
>> block/blk-sysfs.c | 35 ++++++++
>> block/blk-zoned.c | 197 +++++++++++++++++++++++++++++++++++------
>> include/linux/blkdev.h | 8 ++
>> 4 files changed, 215 insertions(+), 26 deletions(-)
>>
>> diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
>> index 28167c9baa55..047ec887456b 100644
>> --- a/block/blk-mq-debugfs.c
>> +++ b/block/blk-mq-debugfs.c
>> @@ -97,6 +97,7 @@ static const char *const blk_queue_flag_name[] = {
>> QUEUE_FLAG_NAME(NO_ELV_SWITCH),
>> QUEUE_FLAG_NAME(QOS_ENABLED),
>> QUEUE_FLAG_NAME(BIO_ISSUE_TIME),
>> + QUEUE_FLAG_NAME(ZONED_QD1_WRITES),
>> };
>> #undef QUEUE_FLAG_NAME
>> diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
>> index f3b1968c80ce..789802286d95 100644
>> --- a/block/blk-sysfs.c
>> +++ b/block/blk-sysfs.c
>> @@ -384,6 +384,39 @@ static ssize_t queue_nr_zones_show(struct gendisk *disk,
>> char *page)
>> return queue_var_show(disk_nr_zones(disk), page);
>> }
>> +static ssize_t queue_zoned_qd1_writes_show(struct gendisk *disk, char *page)
>> +{
>> + return queue_var_show(!!blk_queue_zoned_qd1_writes(disk->queue),
>> + page);
>> +}
>> +
>> +static ssize_t queue_zoned_qd1_writes_store(struct gendisk *disk,
>> + const char *page, size_t count)
>> +{
>> + struct request_queue *q = disk->queue;
>> + unsigned long qd1_writes;
>> + unsigned int memflags;
>> + ssize_t ret;
>> +
>> + if (!blk_queue_is_zoned(q))
>> + return -EOPNOTSUPP;
>> +
>> + ret = queue_var_store(&qd1_writes, page, count);
>> + if (ret < 0)
>> + return ret;
>> +
>> + memflags = blk_mq_freeze_queue(q);
>> + blk_mq_quiesce_queue(q);
>> + if (qd1_writes)
>> + blk_queue_flag_set(QUEUE_FLAG_ZONED_QD1_WRITES, q);
>> + else
>> + blk_queue_flag_clear(QUEUE_FLAG_ZONED_QD1_WRITES, q);
>> + blk_mq_unquiesce_queue(q);
>> + blk_mq_unfreeze_queue(q, memflags);
>> +
>> + return count;
>> +}
>> +
>> static ssize_t queue_iostats_passthrough_show(struct gendisk *disk, char
>> *page)
>> {
>> return queue_var_show(!!blk_queue_passthrough_stat(disk->queue), page);
>> @@ -611,6 +644,7 @@ QUEUE_LIM_RO_ENTRY(queue_max_zone_append_sectors,
>> "zone_append_max_bytes");
>> QUEUE_LIM_RO_ENTRY(queue_zone_write_granularity, "zone_write_granularity");
>> QUEUE_LIM_RO_ENTRY(queue_zoned, "zoned");
>> +QUEUE_RW_ENTRY(queue_zoned_qd1_writes, "zoned_qd1_writes");
>> QUEUE_RO_ENTRY(queue_nr_zones, "nr_zones");
>> QUEUE_LIM_RO_ENTRY(queue_max_open_zones, "max_open_zones");
>> QUEUE_LIM_RO_ENTRY(queue_max_active_zones, "max_active_zones");
>> @@ -748,6 +782,7 @@ static struct attribute *queue_attrs[] = {
>> &queue_nomerges_entry.attr,
>> &queue_poll_entry.attr,
>> &queue_poll_delay_entry.attr,
>> + &queue_zoned_qd1_writes_entry.attr,
>> NULL,
>> };
>
> Can't you plug it into the 'queue_attr_visible()' function, too,
> such that it doesn't even show up for non-zoned drives?
> (After all, it's not that you can change between zoned and
> non-zoned drives.)
> (One hopes :-)
Good point. Will look into that.
> And I really had hoped that we wouldn't need to introduce new
> kthreads, rather that workqueue are the new kthreads.
> Any reasoning why you couldn't use workqueues here?
I can use a single work item instead of a kthread. I had that initially, but
the code ended up cleaner with the kthread. I can try again with the work item
if you insist...
--
Damien Le Moal
Western Digital Research
next prev parent reply other threads:[~2026-02-24 2:06 UTC|newest]
Thread overview: 31+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-02-21 0:44 [PATCH 0/8] Improve zoned (SMR) HDD write throughput Damien Le Moal
2026-02-21 0:44 ` [PATCH 1/8] block: fix zone write plug removal Damien Le Moal
2026-02-23 11:56 ` Hannes Reinecke
2026-02-23 19:30 ` Bart Van Assche
2026-02-23 20:21 ` Bart Van Assche
2026-02-24 1:57 ` Damien Le Moal
2026-02-21 0:44 ` [PATCH 2/8] block: remove BLK_ZONE_WPLUG_UNHASHED Damien Le Moal
2026-02-23 11:48 ` Hannes Reinecke
2026-02-24 2:04 ` Damien Le Moal
2026-02-21 0:44 ` [PATCH 3/8] block: remove disk_zone_is_full() Damien Le Moal
2026-02-23 11:56 ` Hannes Reinecke
2026-02-24 13:15 ` Johannes Thumshirn
2026-02-21 0:44 ` [PATCH 4/8] block: improve disk_zone_wplug_schedule_bio_work() Damien Le Moal
2026-02-23 11:59 ` Hannes Reinecke
2026-02-23 18:56 ` Bart Van Assche
2026-02-24 2:03 ` Damien Le Moal
2026-02-24 15:00 ` Hannes Reinecke
2026-02-24 15:08 ` Christoph Hellwig
2026-02-24 13:18 ` Johannes Thumshirn
2026-02-21 0:44 ` [PATCH 5/8] block: rename struct gendisk zone_wplugs_lock field Damien Le Moal
2026-02-23 12:00 ` Hannes Reinecke
2026-02-24 13:19 ` Johannes Thumshirn
2026-02-21 0:44 ` [PATCH 6/8] block: allow submitting all zone writes from a single context Damien Le Moal
2026-02-23 12:07 ` Hannes Reinecke
2026-02-24 2:00 ` Damien Le Moal [this message]
2026-02-21 0:44 ` [PATCH 7/8] block: default to QD=1 writes for blk-mq rotational zoned devices Damien Le Moal
2026-02-23 12:07 ` Hannes Reinecke
2026-02-21 0:44 ` [PATCH 8/8] Documentation: ABI: stable: document the zoned_qd1_writes attribute Damien Le Moal
2026-02-23 12:07 ` Hannes Reinecke
2026-02-23 17:03 ` [PATCH 0/8] Improve zoned (SMR) HDD write throughput Bart Van Assche
2026-02-24 1:07 ` Damien Le Moal
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=5ced7cb1-49a2-4ed3-aa93-3784f853cc38@kernel.org \
--to=dlemoal@kernel.org \
--cc=axboe@kernel.dk \
--cc=hare@suse.de \
--cc=linux-block@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox