From: Andreas Hindborg <andreas.hindborg@wdc.com>
To: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Cc: Ming Lei <ming.lei@redhat.com>,
Andreas Hindborg <andreas.hindborg@wdc.com>,
Jens Axboe <axboe@kernel.dk>,
linux-block@vger.kernel.org
Subject: Re: Reordering of ublk IO requests
Date: Sat, 19 Nov 2022 08:36:38 +0100 [thread overview]
Message-ID: <87cz9j75l5.fsf@wdc.com> (raw)
In-Reply-To: <be6940cf-7b23-4b11-1f6f-f3d4853d9a34@opensource.wdc.com>
Damien Le Moal <damien.lemoal@opensource.wdc.com> writes:
> On 11/18/22 21:47, Ming Lei wrote:
>> On Fri, Nov 18, 2022 at 12:49:15PM +0100, Andreas Hindborg wrote:
>>>
>>> Ming Lei <ming.lei@redhat.com> writes:
>>>
>>>> CAUTION: This email originated from outside of Western Digital. Do not click on
>>>> links or open attachments unless you recognize the sender and know that the
>>>> content is safe.
>>>>
>>>>
>>>> On Fri, Nov 18, 2022 at 10:41:31AM +0100, Andreas Hindborg wrote:
>>>>>
>>>>> Ming Lei <ming.lei@redhat.com> writes:
>>>>>
>>>>>> CAUTION: This email originated from outside of Western Digital. Do not click on
>>>>>> links or open attachments unless you recognize the sender and know that the
>>>>>> content is safe.
>>>>>>
>>>>>>
>>>>>> On Fri, Nov 18, 2022 at 01:35:29PM +0900, Damien Le Moal wrote:
>>>>>>> On 11/18/22 13:12, Ming Lei wrote:
>>>>>>> [...]
>>>>>>>>>> You can only assign it to zoned write request, but you still have to check
>>>>>>>>>> the sequence inside each zone, right? Then why not just check LBAs in
>>>>>>>>>> each zone simply?
>>>>>>>>>
>>>>>>>>> We would need to know the zone map, which is not otherwise required.
>>>>>>>>> Then we would need to track the write pointer for each open zone for
>>>>>>>>> each queue, so that we can stall writes that are not issued at the write
>>>>>>>>> pointer. This is in effect all zones, because we cannot track when zones
>>>>>>>>> are implicitly closed. Then, if different queues are issuing writes to
>>>>>>>>
>>>>>>>> Can you explain "implicitly closed" state a bit?
>>>>>>>>
>>>>>>>> From https://zonedstorage.io/docs/introduction/zoned-storage, only the
>>>>>>>> following words are mentioned about closed state:
>>>>>>>>
>>>>>>>> ```Conversely, implicitly or explicitly opened zoned can be transitioned to the
>>>>>>>> closed state using the CLOSE ZONE command.```
>>>>>>>
>>>>>>> When a write is issued to an empty or closed zone, the drive will
>>>>>>> automatically transition the zone into the implicit open state. This is
>>>>>>> called implicit open because the host did not (explicitly) issue an open
>>>>>>> zone command.
>>>>>>>
>>>>>>> When there are too many implicitly open zones, the drive may choose to
>>>>>>> close one of the implicitly opened zone to implicitly open the zone that
>>>>>>> is a target for a write command.
>>>>>>>
>>>>>>> Simple in a nutshell. This is done so that the drive can work with a
>>>>>>> limited set of resources needed to handle open zones, that is, zones that
>>>>>>> are being written. There are some more nasty details to all this with
>>>>>>> limits on the number of open zones and active zones that a zoned drive may
>>>>>>> have.
>>>>>>
>>>>>> OK, thanks for the clarification about implicitly closed, but I
>>>>>> understand this close can't change the zone's write pointer.
>>>>>
>>>>> You are right, it does not matter if the zone is implicitly closed, I
>>>>> was mistaken. But we still have to track the write pointer of every zone
>>>>> in open or active state, otherwise we cannot know if a write that arrive
>>>>> to a zone with no outstanding IO is actually at the write pointer, or
>>>>> whether we need to hold it.
>>>>>
>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> zone info can be cached in the mapping(hash table)(zone sector is the key, and zone
>>>>>>>> info is the value), which can be implemented as one LRU style. If any zone
>>>>>>>> info isn't hit in the mapping table, ioctl(BLKREPORTZONE) can be called for
>>>>>>>> obtaining the zone info.
>>>>>>>>
>>>>>>>>> the same zone, we need to sync across queues. Userspace may have
>>>>>>>>> synchronization in place to issue writes with multiple threads while
>>>>>>>>> still hitting the write pointer.
>>>>>>>>
>>>>>>>> You can trust mq-dealine, which guaranteed that write IO is sent to ->queue_rq()
>>>>>>>> in order, no matter MQ or SQ.
>>>>>>>>
>>>>>>>> Yes, it could be issue from multiple queues for ublksrv, which doesn't sync
>>>>>>>> among multiple queues.
>>>>>>>>
>>>>>>>> But per-zone re-order still can solve the issue, just need one lock
>>>>>>>> for each zone to cover the MQ re-order.
>>>>>>>
>>>>>>> That lock is already there and using it, mq-deadline will never dispatch
>>>>>>> more than one write per zone at any time. This is to avoid write
>>>>>>> reordering. So multi queue or not, for any zone, there is no possibility
>>>>>>> of having writes reordered.
>>>>>>
>>>>>> oops, I miss the single queue depth point per zone, so ublk won't break
>>>>>> zoned write at all, and I agree order of batch IOs is one problem, but
>>>>>> not hard to solve.
>>>>>
>>>>> The current implementation _does_ break zoned write because it reverses
>>>>> batched writes. But if it is an easy fix, that is cool :)
>>>>
>>>> Please look at Damien's comment:
>>>>
>>>>>> That lock is already there and using it, mq-deadline will never dispatch
>>>>>> more than one write per zone at any time. This is to avoid write
>>>>>> reordering. So multi queue or not, for any zone, there is no possibility
>>>>>> of having writes reordered.
>>>>
>>>> For zoned write, mq-deadline is used to limit at most one inflight write
>>>> for each zone.
>>>>
>>>> So can you explain a bit how the current implementation breaks zoned
>>>> write?
>>>
>>> Like Damien wrote in another email, mq-deadline will only impose
>>> ordering for requests submitted in batch. The flow we have is the
>>> following:
>>>
>>> - Userspace sends requests to ublk gendisk
>>> - Requests go through block layer and is _not_ reordered when using
>>> mq-deadline. They may be split.
>>> - Requests hit ublk_drv and ublk_drv will reverse order of _all_
>>> batched up requests (including split requests).
>>
>> For ublk-zone, ublk driver needs to be exposed as zoned device by
>> calling disk_set_zoned() finally, which definitely isn't supported now,
>> so mq-deadline at most sends one write IO for each zone after ublk-zone
>> is supported, see blk_req_can_dispatch_to_zone().
>>
>>> - ublk_drv sends request to ublksrv in _reverse_ order.
>>> - ublksrv sends requests _not_ batched up to target device.
>>> - Requests that enter mq-deadline at the same time are reordered in LBA
>>> order, that is all good.
>>> - Requests that enter the kernel in different batches are not reordered
>>> in LBA order and end up missing the write pointer. This is bad.
>>
>> Again, please read Damien's comment:
>>
>>>> That lock is already there and using it, mq-deadline will never dispatch
>>>> more than one write per zone at any time.
>>
>> Anytime, there is at most one write IO for each zone, how can the single
>> write IO be re-order?
>
> If the user issues writes one at a time out of order (not aligned to the
> write pointer), mq-deadline will not help at all. The zone write locking
> will still limit write dispatching to one per zone, but the writes will fail.
>
> mq-deadline will reorder write commands in the correct lba order only if:
> - the commands are inserted as a batch (more than on request passed to
> ->insert_requests)
> - commands are inserted individually when the target zone is locked (a
> write is already being executed)
>
> This has been the semantic from the start: the block layer has no
> guarantees about the correct ordering of writes to zoned drive. What is
> guaranteed is that (1) if the user issues writes in order AND (2)
> mq-deadline is used, then writes will be dispatched in the same order to
> the device.
>
> I have not looked at the details of ublk, but from the thread, I think (1)
> is not done and (2) is missing-ish as the ublk device is not marked as zoned.
I have a patch in the works for adding zoned storage support to ublk. It
sets up the ublk device as a zoned device. It is very much work in
progress, but it lives here [1] for now.
I am pretty sure that I saw large writes to zoned ublk device being
split and issued to the device (same zone) with multiple outstanding
requests at the same time. I'll verify on Monday and provide a test case
if that is the case. Might be I configured the ublk device wrong? I set
it up as host managed zoned and set up zone size, max active, max open.
Best regards,
Andreas
[1] https://github.com/metaspace/linux/tree/ublk-zoned
next prev parent reply other threads:[~2022-11-19 7:42 UTC|newest]
Thread overview: 27+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-11-16 15:00 Reordering of ublk IO requests Andreas Hindborg
2022-11-17 2:18 ` Ming Lei
2022-11-17 8:05 ` Andreas Hindborg
2022-11-17 8:52 ` Ming Lei
2022-11-17 9:07 ` Andreas Hindborg
2022-11-17 11:47 ` Ming Lei
2022-11-17 11:59 ` Andreas Hindborg
2022-11-17 13:11 ` Damien Le Moal
2022-11-17 13:31 ` Andreas Hindborg
2022-11-18 1:51 ` Damien Le Moal
2022-11-18 9:29 ` Andreas Hindborg
2022-11-18 4:12 ` Ming Lei
2022-11-18 4:35 ` Damien Le Moal
2022-11-18 6:07 ` Ming Lei
2022-11-18 9:41 ` Andreas Hindborg
2022-11-18 11:28 ` Ming Lei
2022-11-18 11:49 ` Andreas Hindborg
2022-11-18 12:46 ` Andreas Hindborg
2022-11-18 12:47 ` Ming Lei
2022-11-19 0:24 ` Damien Le Moal
2022-11-19 7:36 ` Andreas Hindborg [this message]
2022-11-21 10:15 ` Andreas Hindborg
2022-11-20 14:37 ` Ming Lei
2022-11-21 1:25 ` Damien Le Moal
2022-11-21 8:03 ` Christoph Hellwig
2022-11-21 8:13 ` Ming Lei
2022-11-17 13:00 ` Damien Le Moal
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87cz9j75l5.fsf@wdc.com \
--to=andreas.hindborg@wdc.com \
--cc=axboe@kernel.dk \
--cc=damien.lemoal@opensource.wdc.com \
--cc=linux-block@vger.kernel.org \
--cc=ming.lei@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).