linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Ming Lei <ming.lei@redhat.com>
To: Andreas Hindborg <andreas.hindborg@wdc.com>
Cc: Damien Le Moal <damien.lemoal@opensource.wdc.com>,
	Jens Axboe <axboe@kernel.dk>,
	linux-block@vger.kernel.org, ming.lei@redhat.com
Subject: Re: Reordering of ublk IO requests
Date: Fri, 18 Nov 2022 20:47:43 +0800	[thread overview]
Message-ID: <Y3d+78qMOusdYUAG@T590> (raw)
In-Reply-To: <87v8nc79cv.fsf@wdc.com>

On Fri, Nov 18, 2022 at 12:49:15PM +0100, Andreas Hindborg wrote:
> 
> Ming Lei <ming.lei@redhat.com> writes:
> 
> > CAUTION: This email originated from outside of Western Digital. Do not click on
> > links or open attachments unless you recognize the sender and know that the
> > content is safe.
> >
> >
> > On Fri, Nov 18, 2022 at 10:41:31AM +0100, Andreas Hindborg wrote:
> >>
> >> Ming Lei <ming.lei@redhat.com> writes:
> >>
> >> > CAUTION: This email originated from outside of Western Digital. Do not click on
> >> > links or open attachments unless you recognize the sender and know that the
> >> > content is safe.
> >> >
> >> >
> >> > On Fri, Nov 18, 2022 at 01:35:29PM +0900, Damien Le Moal wrote:
> >> >> On 11/18/22 13:12, Ming Lei wrote:
> >> >> [...]
> >> >> >>> You can only assign it to zoned write request, but you still have to check
> >> >> >>> the sequence inside each zone, right? Then why not just check LBAs in
> >> >> >>> each zone simply?
> >> >> >>
> >> >> >> We would need to know the zone map, which is not otherwise required.
> >> >> >> Then we would need to track the write pointer for each open zone for
> >> >> >> each queue, so that we can stall writes that are not issued at the write
> >> >> >> pointer. This is in effect all zones, because we cannot track when zones
> >> >> >> are implicitly closed. Then, if different queues are issuing writes to
> >> >> >
> >> >> > Can you explain "implicitly closed" state a bit?
> >> >> >
> >> >> > From https://zonedstorage.io/docs/introduction/zoned-storage, only the
> >> >> > following words are mentioned about closed state:
> >> >> >
> >> >> >     ```Conversely, implicitly or explicitly opened zoned can be transitioned to the
> >> >> >     closed state using the CLOSE ZONE command.```
> >> >>
> >> >> When a write is issued to an empty or closed zone, the drive will
> >> >> automatically transition the zone into the implicit open state. This is
> >> >> called implicit open because the host did not (explicitly) issue an open
> >> >> zone command.
> >> >>
> >> >> When there are too many implicitly open zones, the drive may choose to
> >> >> close one of the implicitly opened zone to implicitly open the zone that
> >> >> is a target for a write command.
> >> >>
> >> >> Simple in a nutshell. This is done so that the drive can work with a
> >> >> limited set of resources needed to handle open zones, that is, zones that
> >> >> are being written. There are some more nasty details to all this with
> >> >> limits on the number of open zones and active zones that a zoned drive may
> >> >> have.
> >> >
> >> > OK, thanks for the clarification about implicitly closed, but I
> >> > understand this close can't change the zone's write pointer.
> >>
> >> You are right, it does not matter if the zone is implicitly closed, I
> >> was mistaken. But we still have to track the write pointer of every zone
> >> in open or active state, otherwise we cannot know if a write that arrive
> >> to a zone with no outstanding IO is actually at the write pointer, or
> >> whether we need to hold it.
> >>
> >> >
> >> >>
> >> >> >
> >> >> > zone info can be cached in the mapping(hash table)(zone sector is the key, and zone
> >> >> > info is the value), which can be implemented as one LRU style. If any zone
> >> >> > info isn't hit in the mapping table, ioctl(BLKREPORTZONE) can be called for
> >> >> > obtaining the zone info.
> >> >> >
> >> >> >> the same zone, we need to sync across queues. Userspace may have
> >> >> >> synchronization in place to issue writes with multiple threads while
> >> >> >> still hitting the write pointer.
> >> >> >
> >> >> > You can trust mq-dealine, which guaranteed that write IO is sent to ->queue_rq()
> >> >> > in order, no matter MQ or SQ.
> >> >> >
> >> >> > Yes, it could be issue from multiple queues for ublksrv, which doesn't sync
> >> >> > among multiple queues.
> >> >> >
> >> >> > But per-zone re-order still can solve the issue, just need one lock
> >> >> > for each zone to cover the MQ re-order.
> >> >>
> >> >> That lock is already there and using it, mq-deadline will never dispatch
> >> >> more than one write per zone at any time. This is to avoid write
> >> >> reordering. So multi queue or not, for any zone, there is no possibility
> >> >> of having writes reordered.
> >> >
> >> > oops, I miss the single queue depth point per zone, so ublk won't break
> >> > zoned write at all, and I agree order of batch IOs is one problem, but
> >> > not hard to solve.
> >>
> >> The current implementation _does_ break zoned write because it reverses
> >> batched writes. But if it is an easy fix, that is cool :)
> >
> > Please look at Damien's comment:
> >
> >>> That lock is already there and using it, mq-deadline will never dispatch
> >>> more than one write per zone at any time. This is to avoid write
> >>> reordering. So multi queue or not, for any zone, there is no possibility
> >>> of having writes reordered.
> >
> > For zoned write, mq-deadline is used to limit at most one inflight write
> > for each zone.
> >
> > So can you explain a bit how the current implementation breaks zoned
> > write?
> 
> Like Damien wrote in another email, mq-deadline will only impose
> ordering for requests submitted in batch. The flow we have is the
> following:
> 
>  - Userspace sends requests to ublk gendisk
>  - Requests go through block layer and is _not_ reordered when using
>    mq-deadline. They may be split.
>  - Requests hit ublk_drv and ublk_drv will reverse order of _all_
>    batched up requests (including split requests).

For ublk-zone, ublk driver needs to be exposed as zoned device by
calling disk_set_zoned() finally, which definitely isn't supported now,
so mq-deadline at most sends one write IO for each zone after ublk-zone
is supported, see blk_req_can_dispatch_to_zone().

>  - ublk_drv sends request to ublksrv in _reverse_ order.
>  - ublksrv sends requests _not_ batched up to target device.
>  - Requests that enter mq-deadline at the same time are reordered in LBA
>    order, that is all good.
>  - Requests that enter the kernel in different batches are not reordered
>    in LBA order and end up missing the write pointer. This is bad.

Again, please read Damien's comment:

>> That lock is already there and using it, mq-deadline will never dispatch
>> more than one write per zone at any time.

Anytime, there is at most one write IO for each zone, how can the single
write IO be re-order?


Thanks,
Ming


  parent reply	other threads:[~2022-11-18 12:48 UTC|newest]

Thread overview: 27+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-11-16 15:00 Reordering of ublk IO requests Andreas Hindborg
2022-11-17  2:18 ` Ming Lei
2022-11-17  8:05   ` Andreas Hindborg
2022-11-17  8:52     ` Ming Lei
2022-11-17  9:07       ` Andreas Hindborg
2022-11-17 11:47         ` Ming Lei
2022-11-17 11:59           ` Andreas Hindborg
2022-11-17 13:11             ` Damien Le Moal
2022-11-17 13:31               ` Andreas Hindborg
2022-11-18  1:51                 ` Damien Le Moal
2022-11-18  9:29                   ` Andreas Hindborg
2022-11-18  4:12             ` Ming Lei
2022-11-18  4:35               ` Damien Le Moal
2022-11-18  6:07                 ` Ming Lei
2022-11-18  9:41                   ` Andreas Hindborg
2022-11-18 11:28                     ` Ming Lei
2022-11-18 11:49                       ` Andreas Hindborg
2022-11-18 12:46                         ` Andreas Hindborg
2022-11-18 12:47                         ` Ming Lei [this message]
2022-11-19  0:24                           ` Damien Le Moal
2022-11-19  7:36                             ` Andreas Hindborg
2022-11-21 10:15                               ` Andreas Hindborg
2022-11-20 14:37                             ` Ming Lei
2022-11-21  1:25                               ` Damien Le Moal
2022-11-21  8:03                             ` Christoph Hellwig
2022-11-21  8:13                               ` Ming Lei
2022-11-17 13:00         ` Damien Le Moal

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Y3d+78qMOusdYUAG@T590 \
    --to=ming.lei@redhat.com \
    --cc=andreas.hindborg@wdc.com \
    --cc=axboe@kernel.dk \
    --cc=damien.lemoal@opensource.wdc.com \
    --cc=linux-block@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).