From: Damien Le Moal <dlemoal@kernel.org>
To: Bart Van Assche <bvanassche@acm.org>, Jens Axboe <axboe@kernel.dk>
Cc: linux-block@vger.kernel.org, linux-scsi@vger.kernel.org,
Christoph Hellwig <hch@lst.de>
Subject: Re: [PATCH v21 00/12] Improve write performance for zoned UFS devices
Date: Tue, 22 Jul 2025 10:36:08 +0900 [thread overview]
Message-ID: <15e63230-6e18-4581-a60a-a77bc3b57721@kernel.org> (raw)
In-Reply-To: <754540df-0039-47b5-ab60-44d6c4f7ac5a@acm.org>
On 7/19/25 3:30 AM, Bart Van Assche wrote:
> On 7/18/25 12:08 AM, Damien Le Moal wrote:
>> How did you test this ?
>
> Hi Damien,
>
> This patch series has been tested as follows:
> - In an x86-64 VM:
> - By running blktests.
> - By running the attached two scripts. test-pipelining-zoned-writes
> submits small writes sequentially and has been used to compare IOPS
> with and without write pipelining. test-pipelining-and-requeuing
> submits sequential or random writes. This script has
> been used to verify that the HOST BUSY and UNALIGNED WRITE
> conditions are handled correctly for both I/O patterns.
> - On an ARM development board with a ZUFS device, by running a multitude
> of I/O patterns on top of F2FS and a ZUFS device with data
> verification enabled.
>
>> I do not have a zoned UFS drive, so I used an NVMe ZNS drive, which should be
>> fine since the commands in the submission queues of a PCI controller are always
>> handled in order. So I added:
>>
>> diff --git a/drivers/nvme/host/zns.c b/drivers/nvme/host/zns.c
>> index cce4c5b55aa9..36d16b8d3f37 100644
>> --- a/drivers/nvme/host/zns.c
>> +++ b/drivers/nvme/host/zns.c
>> @@ -108,7 +108,7 @@ int nvme_query_zone_info(struct nvme_ns *ns, unsigned lbaf,
>> void nvme_update_zone_info(struct nvme_ns *ns, struct queue_limits *lim,
>> struct nvme_zone_info *zi)
>> {
>> - lim->features |= BLK_FEAT_ZONED;
>> + lim->features |= BLK_FEAT_ZONED | BLK_FEAT_ORDERED_HWQ;
>> lim->max_open_zones = zi->max_open_zones;
>> lim->max_active_zones = zi->max_active_zones;
>> lim->max_hw_zone_append_sectors = ns->ctrl->max_zone_append;
>>
>> And ran this:
>>
>> fio --name=test --filename=/dev/nvme1n2 --ioengine=io_uring --iodepth=128 \
>> --direct=1 --bs=4096 --zonemode=zbd --rw=randwrite \
>> --numjobs=1
>>
>> And I get unaligned write errors 100% of the time. Looking at your patches
>> again, you are not handling REQ_NOWAIT case in blk_zone_wplug_handle_write(). If
>> you get REQ_NOWAIT BIO, which io_uring will issue, the code goes directly to
>> plugging the BIO, thus bypassing your from_cpu handling.
>
> Didn't Jens recommend libaio instead of io_uring for zoned storage? See
> also https://lore.kernel.org/linux-block/8c0f9d28-d68f-4800-
> b94f-1905079d4007@kernel.dk/T/#mb61b6d1294da76a9f1be38edf6dceaf703112335. I ran
> all my tests with
> libaio instead of io_uring.
My bad, yes, io_uring does not work reliably for zoned writes because of its
no-wait handling of BIOs and punting to a worker thread for blocking BIOs. But
as I said, tests with libaio did not go well either.
>> But the same fio command with libaio (no REQ_NOWAIT in that case) also fails.
>
> While this patch series addresses most potential causes of reordering by
> the block layer, it does not address all possible causes of reordering.
> An example of a potential cause of reordering that has not been
> addressed by this patch series can be found in blk_mq_insert_requests().
> That function either inserts requests in a software or a hardware queue.
> Bypassing the software queue for some requests can cause reordering.
> Another example can be found in blk_mq_dispatch_rq_list(). If the block
> driver responds with BLK_STS_RESOURCE or BLK_STS_DEV_RESOURCE, the
> requests that have not been accepted by the block driver are added to
> the &hctx->dispatch list. If these requests came from a software queue,
> adding these to hctx->dispatch_list instead of putting them back in
> their original position in the software queue can cause reordering.
>
> Patches 8 and 9 work around this by retrying writes in the unlikely case
> that reordering happens. I think this is a more pragmatic solution than
> making more changes in the block layer to make it fully preserve the
> request order. In the traces that I gathered and that I inspected, I
> did not see any UNALIGNED WRITE errors being reported by ZUFS devices.
So the end result of your patches is that the submission path can still
generates reordering and cause unaligned write errors. Not great to say the
least. I would really prefer something that does not cause such submission
errors to be sure that if we see an error, it is due to the a user bug (user
sending unaligned writes).
--
Damien Le Moal
Western Digital Research
next prev parent reply other threads:[~2025-07-22 1:38 UTC|newest]
Thread overview: 25+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-07-17 20:57 [PATCH v21 00/12] Improve write performance for zoned UFS devices Bart Van Assche
2025-07-17 20:57 ` [PATCH v21 01/12] block: Support block devices that preserve the order of write requests Bart Van Assche
2025-07-17 20:57 ` [PATCH v21 02/12] blk-mq: Restore the zone write order when requeuing Bart Van Assche
2025-07-17 20:57 ` [PATCH v21 03/12] blk-zoned: Add an argument to blk_zone_plug_bio() Bart Van Assche
2025-07-18 7:13 ` Damien Le Moal
2025-07-18 15:54 ` Bart Van Assche
2025-07-17 20:58 ` [PATCH v21 04/12] blk-zoned: Split an if-statement Bart Van Assche
2025-07-17 20:58 ` [PATCH v21 05/12] blk-zoned: Move code from disk_zone_wplug_add_bio() into its caller Bart Van Assche
2025-07-18 7:15 ` Damien Le Moal
2025-07-17 20:58 ` [PATCH v21 06/12] blk-zoned: Introduce a loop in blk_zone_wplug_bio_work() Bart Van Assche
2025-07-18 7:17 ` Damien Le Moal
2025-07-17 20:58 ` [PATCH v21 07/12] blk-zoned: Support pipelining of zoned writes Bart Van Assche
2025-07-18 7:38 ` Damien Le Moal
2025-07-18 16:29 ` Bart Van Assche
2025-07-17 20:58 ` [PATCH v21 08/12] scsi: core: Retry unaligned " Bart Van Assche
2025-07-17 20:58 ` [PATCH v21 09/12] scsi: sd: Increase retry count for " Bart Van Assche
2025-07-17 20:58 ` [PATCH v21 10/12] scsi: scsi_debug: Add the preserves_write_order module parameter Bart Van Assche
2025-07-17 20:58 ` [PATCH v21 11/12] scsi: scsi_debug: Support injecting unaligned write errors Bart Van Assche
2025-07-17 20:58 ` [PATCH v21 12/12] ufs: core: Inform the block layer about write ordering Bart Van Assche
2025-07-18 7:08 ` [PATCH v21 00/12] Improve write performance for zoned UFS devices Damien Le Moal
2025-07-18 18:30 ` Bart Van Assche
2025-07-22 1:36 ` Damien Le Moal [this message]
2025-07-22 18:24 ` Bart Van Assche
2025-07-18 7:39 ` Damien Le Moal
2025-07-18 16:32 ` Bart Van Assche
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=15e63230-6e18-4581-a60a-a77bc3b57721@kernel.org \
--to=dlemoal@kernel.org \
--cc=axboe@kernel.dk \
--cc=bvanassche@acm.org \
--cc=hch@lst.de \
--cc=linux-block@vger.kernel.org \
--cc=linux-scsi@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox