From: Andreas Hindborg <andreas.hindborg@wdc.com>
To: Ming Lei <ming.lei@redhat.com>
Cc: Andreas Hindborg <andreas.hindborg@wdc.com>,
Damien Le Moal <damien.lemoal@opensource.wdc.com>,
Jens Axboe <axboe@kernel.dk>,
linux-block@vger.kernel.org
Subject: Re: Reordering of ublk IO requests
Date: Fri, 18 Nov 2022 12:49:15 +0100 [thread overview]
Message-ID: <87v8nc79cv.fsf@wdc.com> (raw)
In-Reply-To: <Y3dscKle5oqLjSNT@T590>
Ming Lei <ming.lei@redhat.com> writes:
> CAUTION: This email originated from outside of Western Digital. Do not click on
> links or open attachments unless you recognize the sender and know that the
> content is safe.
>
>
> On Fri, Nov 18, 2022 at 10:41:31AM +0100, Andreas Hindborg wrote:
>>
>> Ming Lei <ming.lei@redhat.com> writes:
>>
>> > CAUTION: This email originated from outside of Western Digital. Do not click on
>> > links or open attachments unless you recognize the sender and know that the
>> > content is safe.
>> >
>> >
>> > On Fri, Nov 18, 2022 at 01:35:29PM +0900, Damien Le Moal wrote:
>> >> On 11/18/22 13:12, Ming Lei wrote:
>> >> [...]
>> >> >>> You can only assign it to zoned write request, but you still have to check
>> >> >>> the sequence inside each zone, right? Then why not just check LBAs in
>> >> >>> each zone simply?
>> >> >>
>> >> >> We would need to know the zone map, which is not otherwise required.
>> >> >> Then we would need to track the write pointer for each open zone for
>> >> >> each queue, so that we can stall writes that are not issued at the write
>> >> >> pointer. This is in effect all zones, because we cannot track when zones
>> >> >> are implicitly closed. Then, if different queues are issuing writes to
>> >> >
>> >> > Can you explain "implicitly closed" state a bit?
>> >> >
>> >> > From https://zonedstorage.io/docs/introduction/zoned-storage, only the
>> >> > following words are mentioned about closed state:
>> >> >
>> >> > ```Conversely, implicitly or explicitly opened zoned can be transitioned to the
>> >> > closed state using the CLOSE ZONE command.```
>> >>
>> >> When a write is issued to an empty or closed zone, the drive will
>> >> automatically transition the zone into the implicit open state. This is
>> >> called implicit open because the host did not (explicitly) issue an open
>> >> zone command.
>> >>
>> >> When there are too many implicitly open zones, the drive may choose to
>> >> close one of the implicitly opened zone to implicitly open the zone that
>> >> is a target for a write command.
>> >>
>> >> Simple in a nutshell. This is done so that the drive can work with a
>> >> limited set of resources needed to handle open zones, that is, zones that
>> >> are being written. There are some more nasty details to all this with
>> >> limits on the number of open zones and active zones that a zoned drive may
>> >> have.
>> >
>> > OK, thanks for the clarification about implicitly closed, but I
>> > understand this close can't change the zone's write pointer.
>>
>> You are right, it does not matter if the zone is implicitly closed, I
>> was mistaken. But we still have to track the write pointer of every zone
>> in open or active state, otherwise we cannot know if a write that arrive
>> to a zone with no outstanding IO is actually at the write pointer, or
>> whether we need to hold it.
>>
>> >
>> >>
>> >> >
>> >> > zone info can be cached in the mapping(hash table)(zone sector is the key, and zone
>> >> > info is the value), which can be implemented as one LRU style. If any zone
>> >> > info isn't hit in the mapping table, ioctl(BLKREPORTZONE) can be called for
>> >> > obtaining the zone info.
>> >> >
>> >> >> the same zone, we need to sync across queues. Userspace may have
>> >> >> synchronization in place to issue writes with multiple threads while
>> >> >> still hitting the write pointer.
>> >> >
>> >> > You can trust mq-dealine, which guaranteed that write IO is sent to ->queue_rq()
>> >> > in order, no matter MQ or SQ.
>> >> >
>> >> > Yes, it could be issue from multiple queues for ublksrv, which doesn't sync
>> >> > among multiple queues.
>> >> >
>> >> > But per-zone re-order still can solve the issue, just need one lock
>> >> > for each zone to cover the MQ re-order.
>> >>
>> >> That lock is already there and using it, mq-deadline will never dispatch
>> >> more than one write per zone at any time. This is to avoid write
>> >> reordering. So multi queue or not, for any zone, there is no possibility
>> >> of having writes reordered.
>> >
>> > oops, I miss the single queue depth point per zone, so ublk won't break
>> > zoned write at all, and I agree order of batch IOs is one problem, but
>> > not hard to solve.
>>
>> The current implementation _does_ break zoned write because it reverses
>> batched writes. But if it is an easy fix, that is cool :)
>
> Please look at Damien's comment:
>
>>> That lock is already there and using it, mq-deadline will never dispatch
>>> more than one write per zone at any time. This is to avoid write
>>> reordering. So multi queue or not, for any zone, there is no possibility
>>> of having writes reordered.
>
> For zoned write, mq-deadline is used to limit at most one inflight write
> for each zone.
>
> So can you explain a bit how the current implementation breaks zoned
> write?
Like Damien wrote in another email, mq-deadline will only impose
ordering for requests submitted in batch. The flow we have is the
following:
- Userspace sends requests to ublk gendisk
- Requests go through block layer and is _not_ reordered when using
mq-deadline. They may be split.
- Requests hit ublk_drv and ublk_drv will reverse order of _all_
batched up requests (including split requests).
- ublk_drv sends request to ublksrv in _reverse_ order.
- ublksrv sends requests _not_ batched up to target device.
- Requests that enter mq-deadline at the same time are reordered in LBA
order, that is all good.
- Requests that enter the kernel in different batches are not reordered
in LBA order and end up missing the write pointer. This is bad.
So, ublk_drv is not functional for zoned storage as is. Either we have
to fix up the ordering in userspace in ublksrv, and that _will_ have a
performance impact. Or we fix the bug in ublk_drv that causes batched
requests to be _reversed_.
Thanks,
Andreas
next prev parent reply other threads:[~2022-11-18 12:09 UTC|newest]
Thread overview: 27+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-11-16 15:00 Reordering of ublk IO requests Andreas Hindborg
2022-11-17 2:18 ` Ming Lei
2022-11-17 8:05 ` Andreas Hindborg
2022-11-17 8:52 ` Ming Lei
2022-11-17 9:07 ` Andreas Hindborg
2022-11-17 11:47 ` Ming Lei
2022-11-17 11:59 ` Andreas Hindborg
2022-11-17 13:11 ` Damien Le Moal
2022-11-17 13:31 ` Andreas Hindborg
2022-11-18 1:51 ` Damien Le Moal
2022-11-18 9:29 ` Andreas Hindborg
2022-11-18 4:12 ` Ming Lei
2022-11-18 4:35 ` Damien Le Moal
2022-11-18 6:07 ` Ming Lei
2022-11-18 9:41 ` Andreas Hindborg
2022-11-18 11:28 ` Ming Lei
2022-11-18 11:49 ` Andreas Hindborg [this message]
2022-11-18 12:46 ` Andreas Hindborg
2022-11-18 12:47 ` Ming Lei
2022-11-19 0:24 ` Damien Le Moal
2022-11-19 7:36 ` Andreas Hindborg
2022-11-21 10:15 ` Andreas Hindborg
2022-11-20 14:37 ` Ming Lei
2022-11-21 1:25 ` Damien Le Moal
2022-11-21 8:03 ` Christoph Hellwig
2022-11-21 8:13 ` Ming Lei
2022-11-17 13:00 ` Damien Le Moal
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87v8nc79cv.fsf@wdc.com \
--to=andreas.hindborg@wdc.com \
--cc=axboe@kernel.dk \
--cc=damien.lemoal@opensource.wdc.com \
--cc=linux-block@vger.kernel.org \
--cc=ming.lei@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).