Re: [PATCH 6/8] block: allow submitting all zone writes from a single context

public inbox for linux-block@vger.kernel.org
 help / color / mirror / Atom feed

From: Damien Le Moal <dlemoal@kernel.org>
To: Hannes Reinecke <hare@suse.de>, Jens Axboe <axboe@kernel.dk>,
	linux-block@vger.kernel.org
Subject: Re: [PATCH 6/8] block: allow submitting all zone writes from a single context
Date: Tue, 24 Feb 2026 11:00:44 +0900	[thread overview]
Message-ID: <5ced7cb1-49a2-4ed3-aa93-3784f853cc38@kernel.org> (raw)
In-Reply-To: <f61591f3-1b0a-44c2-b53c-9b10fbf4a432@suse.de>

On 2/23/26 9:07 PM, Hannes Reinecke wrote:
> On 2/21/26 01:44, Damien Le Moal wrote:
>> In order to maintain sequential write patterns per zone with zoned block
>> devices, zone write plugging issues only a single write BIO per zone at
>> any time. This works well but has the side effect that when large
>> sequential write streams are issued by the user and these streams cross
>> zone boundaries, the device ends up receiving a discontiguous set of
>> write commands for different zones. The same also happens when a user
>> writes simultaneously at high queue depth multiple zones: the device
>> does not see all sequential writes per zone and receives discontiguous
>> writes to different zones. While this does not affect the performance of
>> solid state zoned block devices, when using an SMR HDD, this pattern
>> change from sequential writes to discontiguous writes to different zones
>> significantly increases head seek which results in degraded write
>> throughput.
>>
>> In order to reduce this seek overhead for rotational media devices,
>> introduce a per disk zone write plugs kernel thread to issue all write
>> BIOs to zones. This single zone write issuing context is enabled for
>> any zoned block device that has a request queue flagged with the new
>> QUEUE_ZONED_QD1_WRITES flag.
>>
>> The flag QUEUE_ZONED_QD1_WRITES is visible as the sysfs queue attribute
>> zoned_qd1_writes. For a regular block device, this attribute cannot be
>> changed and is always 0. For zoned block devices, a user can override
>> the default value set to force the global write maximum queue depth of
>> 1 for a zoned block device, or clear this attribute to fallback to the
>> default behavior of zone write plugging which limits writes to QD=1 per
>> sequential zone.
>>
>> Writing to a zoned block device flagged with QUEUE_ZONED_QD1_WRITES is
>> implemented using a list of zone write plugs that have a non-empty BIO
>> list. Listed zone write plugs are processed by the disk zone write plugs
>> worker kthread in FIFO order, and all BIOs of a zone write plug are all
>> processed before switching to the next listed zone write plug. A newly
>> submitted BIO for a non-FULL zone write plug that is not yet listed
>> causes the addition of the zone write plug at the end of the disk list
>> of zone write plugs.
>>
>> Since the write BIOs queued in a zone write plug BIO list are
>> necessarilly sequential, for rotational media, using the single zone
>> write plugs kthread to issue all BIOs maintains a sequential write
>> pattern and thus reduces seek overhead and improves write throughput.
>> This processing essentially result in always writing to HDDs at QD=1,
>> which is not an issue for HDDs operating with write caching enabled.
>> Performance with write cache disabled is also not degraded thanks to
>> the efficient write handling of modern SMR HDDs.
>>
>> A disk list of zone write plugs is defined using the new struct gendisk
>> zone_wplugs_list, and accesses to this list is protected using the
>> zone_wplugs_list_lock spinlock.  The per disk kthread
>> (zone_wplugs_worker) code is implemented by the function
>> disk_zone_wplugs_worker(). A reference on listed zone write plugs is
>> always held until all BIOs of the zone write plug are processed by the
>> worker kthread. BIO issuing at QD=1 is driven using a completion
>> structure (zone_wplugs_worker_bio_done) and calls to blk_io_wait().
>>
>> With this change, performance when sequentially writing the zones of a
>> 30 TB SMR SATA HDD connected to an AHCI adapter changes as follows
>> (1MiB direct I/Os, results in MB/s unit):
>>
>>                      +--------------------+
>>             |   Write BW (MB/s)  |
>>   +------------------+----------+---------+
>>   | Sequential write | Baseline | Patched |
>>   |  Queue Depth     | 6.19-rc8 |         |
>>   +------------------+----------+---------+
>>   | 1                | 244      | 245     |
>>   | 2                | 244      | 245     |
>>   | 4                | 245      | 245     |
>>   | 8                | 242      | 245     |
>>   | 16               | 222      | 246     |
>>   | 32               | 211      | 245     |
>>   | 64               | 193      | 244     |
>>   | 128              | 112      | 246     |
>>   +------------------+----------+---------+
>>
>> With the current code (baseline), as the sequential write stream crosses
>> a zone boundary, higher queue depth creates a gap between the
>> last IO to the previous zone and the first IOs to the following zones,
>> causing head seeks and degrading performance. Using the disk zone
>> write plugs worker thread, this pattern disappears and the maximum
>> throughput of the drive is maintained, leading to over 100%
>> improvements in throughput for high queue depth write.
>>
>> Using 16 fio jobs all writing to randomly chosen zones at QD=32 with 1
>> MiB direct IOs, write throughput also increases significantly.
>>
>>                      +--------------------+
>>             |   Write BW (MB/s)  |
>>   +------------------+----------+---------+
>>   |   Random write   | Baseline | Patched |
>>   |  Number of zones | 6.19-rc7 |         |
>>   +------------------+----------+---------+
>>   | 1                | 191      | 192     |
>>   | 2                | 101      | 128     |
>>   | 4                | 115      | 123     |
>>   | 8                | 90       | 120     |
>>   | 16               | 64       | 115     |
>>   | 32               | 58       | 105     |
>>   | 64               | 56       | 101     |
>>   | 128              | 55       | 99      |
>>   +------------------+----------+---------+
>>
>> Tests using XFS shows that buffered write speed with 8 jobs writing
>> files increases by 12% to 35% depending on the workload.
>>
>>                      +--------------------+
>>             |   Write BW (MB/s)  |
>>   +------------------+----------+---------+
>>   |     Workload     | Baseline | Patched |
>>   |                  | 6.19-rc7 |         |
>>   +------------------+----------+---------+
>>   | 256MiB file size | 212      | 238     |
>>   +------------------+----------+---------+
>>   | 4MiB .. 128 MiB  | 213      | 243     |
>>   | random file size |          |         |
>>   +------------------+----------+---------+
>>   | 2MiB .. 8 MiB    | 179      | 242     |
>>   | random file size |          |         |
>>   +------------------+----------+---------+
>>
>> Performance gains are even more significant when using an HBA that
>> limits the maximum size of commands to a small value, e.g. HBAs
>> controlled with the mpi3mr driver limit commands to a maximum of 1 MiB.
>> In such case, the write throughput gains are over 40%.
>>
>>                      +--------------------+
>>             |   Write BW (MB/s)  |
>>   +------------------+----------+---------+
>>   |     Workload     | Baseline | Patched |
>>   |                  | 6.19-rc7 |         |
>>   +------------------+----------+---------+
>>   | 256MiB file size | 175      | 245     |
>>   +------------------+----------+---------+
>>   | 4MiB .. 128 MiB  | 174      | 244     |
>>   | random file size |          |         |
>>   +------------------+----------+---------+
>>   | 2MiB .. 8 MiB    | 171      | 243     |
>>   | random file size |          |         |
>>   +------------------+----------+---------+
>>
>> Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
>> ---
>>   block/blk-mq-debugfs.c |   1 +
>>   block/blk-sysfs.c      |  35 ++++++++
>>   block/blk-zoned.c      | 197 +++++++++++++++++++++++++++++++++++------
>>   include/linux/blkdev.h |   8 ++
>>   4 files changed, 215 insertions(+), 26 deletions(-)
>>
>> diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
>> index 28167c9baa55..047ec887456b 100644
>> --- a/block/blk-mq-debugfs.c
>> +++ b/block/blk-mq-debugfs.c
>> @@ -97,6 +97,7 @@ static const char *const blk_queue_flag_name[] = {
>>       QUEUE_FLAG_NAME(NO_ELV_SWITCH),
>>       QUEUE_FLAG_NAME(QOS_ENABLED),
>>       QUEUE_FLAG_NAME(BIO_ISSUE_TIME),
>> +    QUEUE_FLAG_NAME(ZONED_QD1_WRITES),
>>   };
>>   #undef QUEUE_FLAG_NAME
>>   diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
>> index f3b1968c80ce..789802286d95 100644
>> --- a/block/blk-sysfs.c
>> +++ b/block/blk-sysfs.c
>> @@ -384,6 +384,39 @@ static ssize_t queue_nr_zones_show(struct gendisk *disk,
>> char *page)
>>       return queue_var_show(disk_nr_zones(disk), page);
>>   }
>>   +static ssize_t queue_zoned_qd1_writes_show(struct gendisk *disk, char *page)
>> +{
>> +    return queue_var_show(!!blk_queue_zoned_qd1_writes(disk->queue),
>> +                  page);
>> +}
>> +
>> +static ssize_t queue_zoned_qd1_writes_store(struct gendisk *disk,
>> +                        const char *page, size_t count)
>> +{
>> +    struct request_queue *q = disk->queue;
>> +    unsigned long qd1_writes;
>> +    unsigned int memflags;
>> +    ssize_t ret;
>> +
>> +    if (!blk_queue_is_zoned(q))
>> +        return -EOPNOTSUPP;
>> +
>> +    ret = queue_var_store(&qd1_writes, page, count);
>> +    if (ret < 0)
>> +        return ret;
>> +
>> +    memflags = blk_mq_freeze_queue(q);
>> +    blk_mq_quiesce_queue(q);
>> +    if (qd1_writes)
>> +        blk_queue_flag_set(QUEUE_FLAG_ZONED_QD1_WRITES, q);
>> +    else
>> +        blk_queue_flag_clear(QUEUE_FLAG_ZONED_QD1_WRITES, q);
>> +    blk_mq_unquiesce_queue(q);
>> +    blk_mq_unfreeze_queue(q, memflags);
>> +
>> +    return count;
>> +}
>> +
>>   static ssize_t queue_iostats_passthrough_show(struct gendisk *disk, char
>> *page)
>>   {
>>       return queue_var_show(!!blk_queue_passthrough_stat(disk->queue), page);
>> @@ -611,6 +644,7 @@ QUEUE_LIM_RO_ENTRY(queue_max_zone_append_sectors,
>> "zone_append_max_bytes");
>>   QUEUE_LIM_RO_ENTRY(queue_zone_write_granularity, "zone_write_granularity");
>>     QUEUE_LIM_RO_ENTRY(queue_zoned, "zoned");
>> +QUEUE_RW_ENTRY(queue_zoned_qd1_writes, "zoned_qd1_writes");
>>   QUEUE_RO_ENTRY(queue_nr_zones, "nr_zones");
>>   QUEUE_LIM_RO_ENTRY(queue_max_open_zones, "max_open_zones");
>>   QUEUE_LIM_RO_ENTRY(queue_max_active_zones, "max_active_zones");
>> @@ -748,6 +782,7 @@ static struct attribute *queue_attrs[] = {
>>       &queue_nomerges_entry.attr,
>>       &queue_poll_entry.attr,
>>       &queue_poll_delay_entry.attr,
>> +    &queue_zoned_qd1_writes_entry.attr,
>>         NULL,
>>   };
> 
> Can't you plug it into the 'queue_attr_visible()' function, too,
> such that it doesn't even show up for non-zoned drives?
> (After all, it's not that you can change between zoned and
> non-zoned drives.)
> (One hopes :-)

Good point. Will look into that.

> And I really had hoped that we wouldn't need to introduce new
> kthreads, rather that workqueue are the new kthreads.
> Any reasoning why you couldn't use workqueues here?

I can use a single work item instead of a kthread. I had that initially, but
the code ended up cleaner with the kthread. I can try again with the work item
if you insist...

-- 
Damien Le Moal
Western Digital Research

next prev parent reply	other threads:[~2026-02-24  2:06 UTC|newest]

Thread overview: 31+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-02-21  0:44 [PATCH 0/8] Improve zoned (SMR) HDD write throughput Damien Le Moal
2026-02-21  0:44 ` [PATCH 1/8] block: fix zone write plug removal Damien Le Moal
2026-02-23 11:56   ` Hannes Reinecke
2026-02-23 19:30     ` Bart Van Assche
2026-02-23 20:21       ` Bart Van Assche
2026-02-24  1:57         ` Damien Le Moal
2026-02-21  0:44 ` [PATCH 2/8] block: remove BLK_ZONE_WPLUG_UNHASHED Damien Le Moal
2026-02-23 11:48   ` Hannes Reinecke
2026-02-24  2:04     ` Damien Le Moal
2026-02-21  0:44 ` [PATCH 3/8] block: remove disk_zone_is_full() Damien Le Moal
2026-02-23 11:56   ` Hannes Reinecke
2026-02-24 13:15   ` Johannes Thumshirn
2026-02-21  0:44 ` [PATCH 4/8] block: improve disk_zone_wplug_schedule_bio_work() Damien Le Moal
2026-02-23 11:59   ` Hannes Reinecke
2026-02-23 18:56     ` Bart Van Assche
2026-02-24  2:03     ` Damien Le Moal
2026-02-24 15:00       ` Hannes Reinecke
2026-02-24 15:08         ` Christoph Hellwig
2026-02-24 13:18   ` Johannes Thumshirn
2026-02-21  0:44 ` [PATCH 5/8] block: rename struct gendisk zone_wplugs_lock field Damien Le Moal
2026-02-23 12:00   ` Hannes Reinecke
2026-02-24 13:19   ` Johannes Thumshirn
2026-02-21  0:44 ` [PATCH 6/8] block: allow submitting all zone writes from a single context Damien Le Moal
2026-02-23 12:07   ` Hannes Reinecke
2026-02-24  2:00     ` Damien Le Moal [this message]
2026-02-21  0:44 ` [PATCH 7/8] block: default to QD=1 writes for blk-mq rotational zoned devices Damien Le Moal
2026-02-23 12:07   ` Hannes Reinecke
2026-02-21  0:44 ` [PATCH 8/8] Documentation: ABI: stable: document the zoned_qd1_writes attribute Damien Le Moal
2026-02-23 12:07   ` Hannes Reinecke
2026-02-23 17:03 ` [PATCH 0/8] Improve zoned (SMR) HDD write throughput Bart Van Assche
2026-02-24  1:07   ` Damien Le Moal

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5ced7cb1-49a2-4ed3-aa93-3784f853cc38@kernel.org \
    --to=dlemoal@kernel.org \
    --cc=axboe@kernel.dk \
    --cc=hare@suse.de \
    --cc=linux-block@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox