linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Ming Lei <ming.lei@redhat.com>
To: Andreas Hindborg <andreas.hindborg@wdc.com>
Cc: Jens Axboe <axboe@kernel.dk>,
	linux-block@vger.kernel.org, ming.lei@redhat.com
Subject: Re: Reordering of ublk IO requests
Date: Thu, 17 Nov 2022 16:52:03 +0800	[thread overview]
Message-ID: <Y3X2M3CSULigQr4f@T590> (raw)
In-Reply-To: <87o7t67zzv.fsf@wdc.com>

On Thu, Nov 17, 2022 at 09:05:48AM +0100, Andreas Hindborg wrote:
> 
> Ming Lei <ming.lei@redhat.com> writes:
> 
> > Hi Andreas,
> >
> > On Wed, Nov 16, 2022 at 04:00:15PM +0100, Andreas Hindborg wrote:
> >>
> >> Hi Ming,
> >>
> >> I have a question regarding ublk. For context, I am working on adding
> >> zoned storage support to ublk, and you have been very kind to help me on
> >> Github [1].
> >>
> >> I have a problem with ordering of requests after they are issued to the
> >> ublk driver. Basically ublk will reverse the ordering of requests that are
> >> batched. The behavior can be observed with the following fio workload:
> >>
> >> > fio --name=test --ioengine=io_uring --rw=read --bs=4m --direct=1
> >> > --size=4m --filename=/dev/ublkb0
> >>
> >> For a loopback ublk target I get the following from blktrace:
> >>
> >> > 259,2    0     3469   286.337681303   724  D   R 0 + 1024 [fio]
> >> > 259,2    0     3470   286.337691313   724  D   R 1024 + 1024 [fio]
> >> > 259,2    0     3471   286.337694423   724  D   R 2048 + 1024 [fio]
> >> > 259,2    0     3472   286.337696583   724  D   R 3072 + 1024 [fio]
> >> > 259,2    0     3473   286.337698433   724  D   R 4096 + 1024 [fio]
> >> > 259,2    0     3474   286.337700213   724  D   R 5120 + 1024 [fio]
> >> > 259,2    0     3475   286.337702723   724  D   R 6144 + 1024 [fio]
> >> > 259,2    0     3476   286.337704323   724  D   R 7168 + 1024 [fio]
> >> > 259,1    0     1794   286.337794934   390  D   R 6144 + 2048 [ublk]
> >> > 259,1    0     1795   286.337805504   390  D   R 4096 + 2048 [ublk]
> >> > 259,1    0     1796   286.337816274   390  D   R 2048 + 2048 [ublk]
> >> > 259,1    0     1797   286.337821744   390  D   R 0 + 2048 [ublk]
> >>
> >> And enabling debug prints in ublk shows that the reversal happens when
> >> ublk defers work to the io_uring context:
> >>
> >> > kernel: ublk_queue_rq: qid 0, tag 180, sect 0
> >> > kernel: ublk_queue_rq: qid 0, tag 181, sect 1024
> >> > kernel: ublk_queue_rq: qid 0, tag 182, sect 2048
> >> > kernel: ublk_queue_rq: qid 0, tag 183, sect 3072
> >> > kernel: ublk_queue_rq: qid 0, tag 184, sect 4096
> >> > kernel: ublk_queue_rq: qid 0, tag 185, sect 5120
> >> > kernel: ublk_queue_rq: qid 0, tag 186, sect 6144
> >> > kernel: ublk_queue_rq: qid 0, tag 187, sect 7168
> >> > kernel: __ublk_rq_task_work: complete: op 33, qid 0 tag 187 io_flags 1 addr 7f245d359000, sect 7168
> >> > kernel: __ublk_rq_task_work: complete: op 33, qid 0 tag 186 io_flags 1 addr 7f245fcfd000, sect 6144
> >> > kernel: __ublk_rq_task_work: complete: op 33, qid 0 tag 185 io_flags 1 addr 7f245fd7f000, sect 5120
> >> > kernel: __ublk_rq_task_work: complete: op 33, qid 0 tag 184 io_flags 1 addr 7f245fe01000, sect 4096
> >> > kernel: __ublk_rq_task_work: complete: op 33, qid 0 tag 183 io_flags 1 addr 7f245fe83000, sect 3072
> >> > kernel: __ublk_rq_task_work: complete: op 33, qid 0 tag 182 io_flags 1 addr 7f245ff05000, sect 2048
> >> > kernel: __ublk_rq_task_work: complete: op 33, qid 0 tag 181 io_flags 1 addr 7f245ff87000, sect 1024
> >> > kernel: __ublk_rq_task_work: complete: op 33, qid 0 tag 180 io_flags 1 addr 7f2460009000, sect 0
> >>
> >> The problem seems to be that the method used to defer work to the
> >> io_uring context, task_work_add(), is using a stack to queue the
> >> callbacks.
> >
> > Is the observation done on zoned ublk or plain ublk-loop?
> 
> I collected this trace on my fork, but since the behavior comes from
> task_work.c using a stack to queue work, it would be present on upstream
> ublk as well. For completeness, I have verified that this is the case.
> 
> > If it is plain ublk-loop, the reverse order can be started from
> > blk_add_rq_to_plug(), but it won't happen for zoned write request
> > which isn't queued to plug list.
> 
> I forgot to mention that I set the deadline scheduler for both ublkb0
> and the loopback target. No reordering should happen in mq with the
> deadline scheduler, as far as I understand.

I meant you need to setup one zoned ublk-loop for observing write request
order, block layer never guarantees request order for other devices.

> 
> >
> > Otherwise, can you observe zoned write req reorder from ublksrv side?
> >
> 
> Yes, the reverse order is observable in ublk server. Reordering happens
> in ublk kernel driver.
> 
> >>
> >> As you probably are aware, writes to zoned storage must
> >> happen at the write pointer, so when order is lost there is a problem.
> >> But I would assume that this behavior is also undesirable in other
> >> circumstances.
> >>
> >> I am not sure what is the best approach to change this behavor. Maybe
> >> having a separate queue for the requests and then only scheduling work
> >> if a worker is not already processing the queue?
> >>
> >> If you like, I can try to implement a fix?
> >
> > Yeah, I think zoned write requests could be forwarded to ublk server out of order.
> 
> In reverse order for requests that are issued without a context switch,
> as far as I can tell.
> 
> >
> > Is it possible for re-order the writes in ublksrv side? I guess it is
> > be doable:
> >
> > 1) zoned write request is sent to ublk_queue_rq() in order always
> >
> > 2) when ublksrv zone target/backend code gets zoned write request in each
> > zone, it can wait until the next expected write comes, then handle the
> > write and advance write pointer, then repeat the whole process.
> >
> > 3) because of 1), the next expected zoned write req is guaranteed to
> > come without much delay, and the per-zone queue won't be long.
> 
> I think it is not feasible to have per zone data structures. Instead, I

If one mapping data structure is used for ordering per-zone write
request, it could be pretty easy, such as C++'s map or hash table, and it
won't take much memory given the max in-flight IOs are very limited and
the zone mapping entry can be reused among different zone, and quite easy
to implement.

Also most of times, the per-zone ordering can be completed in current
batch(requests completed from single io_uring_enter()), then the extra
cost could be close to zero, you can simply run the per-zone ordering in
->handle_io_background() callback, when all requests could come for each
zone most of times.

> considered adding a sequence number to each request, and then queue up
> requests if there is a gap in the numbering.

IMO, that won't be doable, given you don't know when the sequence will be over.

> 
> But really, the issue should be resolved by changing the way
> ublk_queue_rq() is sending work to uring context with task_add_work().

Not sure it can be solved easily given llist is implemented in this way.

> Fixing in ublksrv is a bit of a hack and will have performance penalty.

As I mentioned, ordering zoned write request in each zone just takes
a little memory(mapping, or hashing) and the max size of the mapping
table is queue depth, and basically zero cpu cost, not see extra
performance penalty is involved.

> 
> Also, it can not be good for any storage device to have sequential
> requests delivered in reverse order? Fixing this would benefit all targets.

Only zoned target has strict ordering requirement which does take cost, block
layer never guarantees request order. As I mentioned, blk_add_rq_to_plug()
can re-order requests in reverse order too.

Thanks,
Ming


  reply	other threads:[~2022-11-17  8:53 UTC|newest]

Thread overview: 27+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-11-16 15:00 Reordering of ublk IO requests Andreas Hindborg
2022-11-17  2:18 ` Ming Lei
2022-11-17  8:05   ` Andreas Hindborg
2022-11-17  8:52     ` Ming Lei [this message]
2022-11-17  9:07       ` Andreas Hindborg
2022-11-17 11:47         ` Ming Lei
2022-11-17 11:59           ` Andreas Hindborg
2022-11-17 13:11             ` Damien Le Moal
2022-11-17 13:31               ` Andreas Hindborg
2022-11-18  1:51                 ` Damien Le Moal
2022-11-18  9:29                   ` Andreas Hindborg
2022-11-18  4:12             ` Ming Lei
2022-11-18  4:35               ` Damien Le Moal
2022-11-18  6:07                 ` Ming Lei
2022-11-18  9:41                   ` Andreas Hindborg
2022-11-18 11:28                     ` Ming Lei
2022-11-18 11:49                       ` Andreas Hindborg
2022-11-18 12:46                         ` Andreas Hindborg
2022-11-18 12:47                         ` Ming Lei
2022-11-19  0:24                           ` Damien Le Moal
2022-11-19  7:36                             ` Andreas Hindborg
2022-11-21 10:15                               ` Andreas Hindborg
2022-11-20 14:37                             ` Ming Lei
2022-11-21  1:25                               ` Damien Le Moal
2022-11-21  8:03                             ` Christoph Hellwig
2022-11-21  8:13                               ` Ming Lei
2022-11-17 13:00         ` Damien Le Moal

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Y3X2M3CSULigQr4f@T590 \
    --to=ming.lei@redhat.com \
    --cc=andreas.hindborg@wdc.com \
    --cc=axboe@kernel.dk \
    --cc=linux-block@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).