From: Ming Lei <ming.lei@redhat.com>
To: Ziyang Zhang <ZiyangZhang@linux.alibaba.com>
Cc: linux-block@vger.kernel.org, Jens Axboe <axboe@kernel.dk>
Subject: Re: [PATCH] ublk_drv: don't call task_work_add for queueing io commands
Date: Wed, 26 Oct 2022 19:29:27 +0800 [thread overview]
Message-ID: <Y1kaFyan927Qnnnh@T590> (raw)
In-Reply-To: <85bf5d34-7bbc-7da1-54fb-b0cec68a610b@linux.alibaba.com>
On Wed, Oct 26, 2022 at 06:32:46PM +0800, Ziyang Zhang wrote:
> On 2022/10/25 23:17, Ming Lei wrote:
> > On Tue, Oct 25, 2022 at 04:43:56PM +0800, Ziyang Zhang wrote:
> >> On 2022/10/25 15:46, Ziyang Zhang wrote:
> >>> On 2022/10/25 15:19, Ming Lei wrote:
> >>>> On Tue, Oct 25, 2022 at 11:15:57AM +0800, Ziyang Zhang wrote:
> >>>>> On 2022/10/24 21:20, Ming Lei wrote:
> >>>>>> Hello Ziyang,
> >>>>>>
> >>>>>> On Mon, Oct 24, 2022 at 05:48:51PM +0800, Ziyang Zhang wrote:
> >>>>>>> On 2022/10/23 17:38, Ming Lei wrote:
> >>>>>>>> task_work_add() is used for waking ubq daemon task with one batch
> >>>>>>>> of io requests/commands queued. However, task_work_add() isn't
> >>>>>>>> exported for module code, and it is still debatable if the symbol
> >>>>>>>> should be exported.
> >>>>>>>>
> >>>>>>>> Fortunately we still have io_uring_cmd_complete_in_task() which just
> >>>>>>>> can't handle batched wakeup for us.
> >>>>>>>>
> >>>>>>>> Add one one llist into ublk_queue and call io_uring_cmd_complete_in_task()
> >>>>>>>> via current command for running them via task work.
> >>>>>>>>
> >>>>>>>> This way cleans up current code a lot, meantime allow us to wakeup
> >>>>>>>> ubq daemon task after queueing batched requests/io commands.
> >>>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> Hi, Ming
> >>>>>>>
> >>>>>>> This patch works and I have run some tests to compare current version(ucmd)
> >>>>>>> with your patch(ucmd-batch).
> >>>>>>>
> >>>>>>> iodepth=128 numjobs=1 direct=1 bs=4k
> >>>>>>>
> >>>>>>> --------------------------------------------
> >>>>>>> ublk loop target, the backend is a file.
> >>>>>>> IOPS(k)
> >>>>>>>
> >>>>>>> type ucmd ucmd-batch
> >>>>>>> seq-read 54.7 54.2
> >>>>>>> rand-read 52.8 52.0
> >>>>>>>
> >>>>>>> --------------------------------------------
> >>>>>>> ublk null target
> >>>>>>> IOPS(k)
> >>>>>>>
> >>>>>>> type ucmd ucmd-batch
> >>>>>>> seq-read 257 257
> >>>>>>> rand-read 252 253
> >>>>>>>
> >>>>>>>
> >>>>>>> I find that io_req_task_work_add() puts task_work node into a llist
> >>>>>>> first, then it may call task_work_add() to run batched task_works. So do we really
> >>>>>>> need such llist in ublk_drv? I think io_uring has already considered task_work batch
> >>>>>>> optimization.
> >>>>>>>
> >>>>>>> BTW, task_work_add() in ublk_drv achieves
> >>>>>>> higher IOPS(about 5-10% on my machine) than io_uring_cmd_complete_in_task()
> >>>>>>> in ublk_drv.
> >>>>>>
> >>>>>> Yeah, that is same with my observation, and motivation of this patch is
> >>>>>> to get same performance with task_work_add by building ublk_drv as
> >>>>>> module. One win of task_work_add() is that we get exact batching info
> >>>>>> meantime only send TWA_SIGNAL_NO_IPI for whole batch, that is basically
> >>>>>> what the patch is doing, but needs help of the following ublksrv patch:
> >>>>>>
> >>>>>> https://github.com/ming1/ubdsrv/commit/dce6d1d222023c1641292713b311ced01e6dc548
> >>>>>>
> >>>>>> which sets IORING_SETUP_COOP_TASKRUN for ublksrv's uring, then
> >>>>>> io_uring_cmd_complete_in_task will notify via TWA_SIGNAL_NO_IPI, and 5+%
> >>>>>> IOPS boost is observed on loop/001 by putting image on SSD in my test
> >>>>>> VM.
> >>>>>
> >>>>> Hi Ming,
> >>>>>
> >>>>> I have added this ublksrv patch and run the above test again.
> >>>>> I have also run ublksrv test: loop/001. Please check them.
> >>>>>
> >>>>> Intel(R) Xeon(R) Platinum 8369B CPU @ 2.70GHz 16 cores
> >>>>> 64GB MEM, CentOS 8, kernel 6.0+
> >>>>>
> >>>>> --------
> >>>>> fio test
> >>>>>
> >>>>> iodepth=128 numjobs=1 direct=1 bs=4k
> >>>>>
> >>>>> ucmd: without your kernel patch. Run io_uring_cmd_complete_in_task()
> >>>>> for each blk-mq rq.
> >>>>>
> >>>>> ucmd-batch: with your kernel patch. Run io_uring_cmd_complete_in_task()
> >>>>> for the last blk-mq rq.
> >>>>>
> >>>>> --------------------------------------------
> >>>>> ublk loop target, the backend is a file.
> >>>>>
> >>>>> IOPS(k)
> >>>>>
> >>>>> type ucmd ucmd-batch
> >>>>> seq-read 54.1 53.7
> >>>>> rand-read 52.0 52.0
> >>>>>
> >>>>> --------------------------------------------
> >>>>> ublk null target
> >>>>> IOPS(k)
> >>>>>
> >>>>> type ucmd ucmd-batch
> >>>>> seq-read 272 265
> >>>>> rand-read 262 260
> >>>>>
> >>>>> ------------
> >>>>> ublksrv test
> >>>>>
> >>>>> -------------
> >>>>> ucmd
> >>>>>
> >>>>> running loop/001
> >>>>> fio (ublk/loop( -f /root/work/ubdsrv/tests/tmp/ublk_loop_1G_BZ85U), libaio, bs 4k, dio, hw queues:1, uring_comp: 0, get_data: 0)...
> >>>>> randwrite: jobs 1, iops 66737
> >>>>> randread: jobs 1, iops 64935
> >>>>> randrw: jobs 1, iops read 32694 write 32710
> >>>>> rw(512k): jobs 1, iops read 772 write 819
> >>>>>
> >>>>> -------------
> >>>>> ucmd-batch
> >>>>>
> >>>>> running loop/001
> >>>>> fio (ublk/loop( -f /root/work/ubdsrv/tests/tmp/ublk_loop_1G_F56a3), libaio, bs 4k, dio, hw queues:1, uring_comp: 0, get_data: 0)...
> >>>>> randwrite: jobs 1, iops 66720
> >>>>> randread: jobs 1, iops 65015
> >>>>> randrw: jobs 1, iops read 32743 write 32759
> >>>>> rw(512k): jobs 1, iops read 771 write 817
> >>>>>
> >>>>>
> >>>>> It seems that manually putting rqs into llist and calling
> >>>>> io_uring_cmd_complete_in_task() while handling the last rq does
> >>>>> not improve IOPS much.
> >>>>>
> >>>>> io_req_task_work_add() puts task_work node into a internal llist
> >>>>> first, then it may call task_work_add() to run batched task_works.
> >>>>> IMO, io_uring has already done such batch optimization and ublk_drv
> >>>>> does not need to add such llist.
> >>>>
> >>>> The difference is just how batching is handled, looks blk-mq's batch info
> >>>> doesn't matter any more. In my test, looks the perf improvement is mainly
> >>>> made by enabling IORING_SETUP_COOP_TASKRUN in ublksrv.
> >>>
> >>> I guess only IORING_SETUP_COOP_TASKRUN helps improve IOPS. The llist in
> >>> ublk_drv does not improve IOPS.
> >>>
> >>>>
> >>>> Can you check if enabling IORING_SETUP_COOP_TASKRUN only can reach
> >>>> same perf with task_work_add()(ublk_drv is builtin) when building
> >>>> ublk_drv as module?
> >>>>
> >>>
> >>> OK.
> >>>
> >>
> >> Intel(R) Xeon(R) Platinum 8369B CPU @ 2.70GHz 16 cores
> >> 64GB MEM, CentOS 8, kernel 6.0+
> >> with IORING_SETUP_COOP_TASKRUN, without this kernel patch
> >>
> >> ucmd: io_uring_cmd_complete_in_task(), ublk_drv is a module
> >> tw: task_work_add(), ublk is built-in.
> >>
> >>
> >> --------
> >> fio test
> >>
> >> iodepth=128 numjobs=1 direct=1 bs=4k
> >>
> >> --------------------------------------------
> >> ublk loop target, the backend is a file.
> >>
> >> IOPS(k)
> >>
> >> type ucmd tw
> >> seq-read 54.1 53.8
> >> rand-read 52.0 52.0
> >>
> >> --------------------------------------------
> >> ublk null target
> >> IOPS(k)
> >>
> >> type ucmd tw
> >> seq-read 272 286
> >> rand-read 262 278
> >>
> >>
> >> ------------
> >> ublksrv test
> >>
> >> -------------
> >> ucmd
> >>
> >> running loop/001
> >> fio (ublk/loop( -f /root/work/ubdsrv/tests/tmp/ublk_loop_1G_BZ85U), libaio, bs 4k, dio, hw queues:1, uring_comp: 0, get_data: 0)...
> >> randwrite: jobs 1, iops 66737
> >> randread: jobs 1, iops 64935
> >> randrw: jobs 1, iops read 32694 write 32710
> >> rw(512k): jobs 1, iops read 772 write 819
> >>
> >> running null/001
> >> fio (ublk/null(), libaio, bs 4k, dio, hw queues:1, uring_comp: 0, get_data: 0)...
> >> randwrite: jobs 1, iops 715863
> >> randread: jobs 1, iops 758449
> >> randrw: jobs 1, iops read 357407 write 357183
> >> rw(512k): jobs 1, iops read 5895 write 5875
> >>
> >> -------------
> >> tw
> >>
> >> running loop/001
> >> fio (ublk/loop( -f /root/work/ubdsrv/tests/tmp/ublk_loop_1G_pvLTL), libaio, bs 4k, dio, hw queues:1, uring_comp: 0, get_data: 0)...
> >> randwrite: jobs 1, iops 66856
> >> randread: jobs 1, iops 65015
> >> randrw: jobs 1, iops read 32751 write 32767
> >> rw(512k): jobs 1, iops read 776 write 823
> >>
> >> running null/001
> >> fio (ublk/null(), libaio, bs 4k, dio, hw queues:1, uring_comp: 0, get_data: 0)...
> >> randwrite: jobs 1, iops 739450
> >> randread: jobs 1, iops 787500
> >> randrw: jobs 1, iops read 372956 write 372831
> >> rw(512k): jobs 1, iops read 5798 write 5777
> >
> > Looks the gap isn't big between ucmd and tw when running null/001, in
> > which the fio io process should saturate the CPU. Probably we
> > should avoid to touch 'cmd'/'pdu'/'io' in ublk_queue_rq() since these
> > data should be cold at that time.
> >
> > Can you apply the following delta patch against the current patch(
> > "ublk_drv: don't call task_work_add for queueing io commands") and
> > compare with task_work_add()?
> >
> > From ecbbf6d10dbc63e279ce0b55c45da6721947f18d Mon Sep 17 00:00:00 2001
> > From: Ming Lei <ming.lei@redhat.com>
> > Date: Tue, 25 Oct 2022 11:01:25 -0400
> > Subject: [PATCH] ublk: follow up change
> >
> > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > ---
> > drivers/block/ublk_drv.c | 36 ++++++++++++++++++++++--------------
> > 1 file changed, 22 insertions(+), 14 deletions(-)
> >
> > diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
> > index 7963fba66dd1..18db337094c1 100644
> > --- a/drivers/block/ublk_drv.c
> > +++ b/drivers/block/ublk_drv.c
> > @@ -56,9 +56,12 @@
> > /* All UBLK_PARAM_TYPE_* should be included here */
> > #define UBLK_PARAM_TYPE_ALL (UBLK_PARAM_TYPE_BASIC | UBLK_PARAM_TYPE_DISCARD)
> >
> > +struct ublk_rq_data {
> > + struct llist_node node;
> > +};
> > +
> > struct ublk_uring_cmd_pdu {
> > struct request *req;
> > - struct llist_node node;
> > };
> >
> > /*
> > @@ -693,7 +696,8 @@ static inline void __ublk_rq_task_work(struct request *req)
> > *
> > * (2) current->flags & PF_EXITING.
> > */
> > - if (unlikely(current != ubq->ubq_daemon || current->flags & PF_EXITING)) {
> > + if (unlikely(current != ubq->ubq_daemon || current->flags & PF_EXITING
> > + || (io->flags & UBLK_IO_FLAG_ABORTED))) {
> > __ublk_abort_rq(ubq, req);
> > return;
>
> We cannot check io->flags & UBLK_IO_FLAG_ABORTED in io_uring task_work.
> We have to check io->flags & UBLK_IO_FLAG_ABORTED in ublk_queue_rq() because
> io_uring ctx may be freed at that time and ioucmd task_work is invalid(freed).
> Please See comment around it for more detail.
The comment is obsolete, because all inflight requests are drained
before io_uring is dead, please see __ublk_quiesce_dev().
>
> > }
> > @@ -757,11 +761,12 @@ static void ublk_rqs_task_work_cb(struct io_uring_cmd *cmd)
> > struct ublk_uring_cmd_pdu *pdu = ublk_get_uring_cmd_pdu(cmd);
> > struct ublk_queue *ubq = pdu->req->mq_hctx->driver_data;
> > struct llist_node *io_cmds = llist_del_all(&ubq->io_cmds);
> > + struct ublk_rq_data *data;
> >
> > __ublk_rq_task_work(pdu->req);
> >
> > - llist_for_each_entry(pdu, io_cmds, node)
> > - __ublk_rq_task_work(pdu->req);
> > + llist_for_each_entry(data, io_cmds, node)
> > + __ublk_rq_task_work(blk_mq_rq_from_pdu(data));
> > }
> >
> > static blk_status_t ublk_queue_rq(struct blk_mq_hw_ctx *hctx,
> > @@ -769,9 +774,6 @@ static blk_status_t ublk_queue_rq(struct blk_mq_hw_ctx *hctx,
> > {
> > struct ublk_queue *ubq = hctx->driver_data;
> > struct request *rq = bd->rq;
> > - struct ublk_io *io = &ubq->ios[rq->tag];
> > - struct io_uring_cmd *cmd = io->cmd;
> > - struct ublk_uring_cmd_pdu *pdu = ublk_get_uring_cmd_pdu(cmd);
> > blk_status_t res;
> >
> > /* fill iod to slot in io cmd buffer */
> > @@ -805,14 +807,11 @@ static blk_status_t ublk_queue_rq(struct blk_mq_hw_ctx *hctx,
> > * Since releasing/allocating a tag implies smp_mb(), finding UBLK_IO_FLAG_ABORTED
> > * guarantees that here is a re-issued request aborted previously.
> > */
> > - if (unlikely(ubq_daemon_is_dying(ubq) ||
> > - (io->flags & UBLK_IO_FLAG_ABORTED))) {
> > + if (unlikely(ubq_daemon_is_dying(ubq))) {
> > __ublk_abort_rq(ubq, rq);
> > return BLK_STS_OK;
> > }
> >
> > - pdu->req = rq;
> > -
> > /*
> > * Typical multiple producers and single consumer, it is just fine
> > * to use llist_add() in producer side and llist_del_all() in
> > @@ -821,10 +820,18 @@ static blk_status_t ublk_queue_rq(struct blk_mq_hw_ctx *hctx,
> > * The last command can't be added into list, otherwise it could
> > * be handled twice
> > */
> > - if (bd->last)
> > + if (bd->last) {
> > + struct ublk_io *io = &ubq->ios[rq->tag];
> > + struct io_uring_cmd *cmd = io->cmd;
> > + struct ublk_uring_cmd_pdu *pdu = ublk_get_uring_cmd_pdu(cmd);
> > +
> > + pdu->req = rq;
> > io_uring_cmd_complete_in_task(cmd, ublk_rqs_task_work_cb);
> > - else
> > - llist_add(&pdu->node, &ubq->io_cmds);
> > + } else {
> > + struct ublk_rq_data *data = blk_mq_rq_to_pdu(rq);
> > +
> > + llist_add(&data->node, &ubq->io_cmds);
> > + }
> >
> > return BLK_STS_OK;
> > }
> > @@ -1426,6 +1433,7 @@ static int ublk_add_tag_set(struct ublk_device *ub)
> > ub->tag_set.queue_depth = ub->dev_info.queue_depth;
> > ub->tag_set.numa_node = NUMA_NO_NODE;
> > ub->tag_set.flags = BLK_MQ_F_SHOULD_MERGE;
> > + ub->tag_set.cmd_size = sizeof(struct ublk_rq_data);
> > ub->tag_set.driver_data = ub;
> > return blk_mq_alloc_tag_set(&ub->tag_set);
> > }
>
>
> Intel(R) Xeon(R) Platinum 8369B CPU @ 2.70GHz 16 cores
> 64GB MEM, CentOS 8, kernel 6.0+
> with IORING_SETUP_COOP_TASKRUN, without this kernel patch
>
> ucmd: io_uring_cmd_complete_in_task(), ublk_drv is a module
>
> ucmd-not-touch-pdu: use llist && do not touch 'cmd'/'pdu'/'io' in ublk_queue_rq()
>
> tw: task_work_add(), ublk is built-in.
>
>
> --------
> fio test
>
> iodepth=128 numjobs=1 direct=1 bs=4k
>
> --------------------------------------------
> ublk loop target, the backend is a file.
>
> IOPS(k)
>
> type ucmd tw ucmd-not-touch-pdu
> seq-read 54.1 53.8 53.6
> rand-read 52.0 52.0 52.0
>
> --------------------------------------------
> ublk null target
> IOPS(k)
>
> type ucmd tw ucmd-not-touch-pdu
> seq-read 272 286 275
> rand-read 262 278 269
>
>
> ------------
> ublksrv test
>
> -------------
> ucmd
>
> running loop/001
> fio (ublk/loop( -f /root/work/ubdsrv/tests/tmp/ublk_loop_1G_BZ85U), libaio, bs 4k, dio, hw queues:1, uring_comp: 0, get_data: 0)...
> randwrite: jobs 1, iops 66737
> randread: jobs 1, iops 64935
> randrw: jobs 1, iops read 32694 write 32710
> rw(512k): jobs 1, iops read 772 write 819
>
> running null/001
> fio (ublk/null(), libaio, bs 4k, dio, hw queues:1, uring_comp: 0, get_data: 0)...
> randwrite: jobs 1, iops 715863
> randread: jobs 1, iops 758449
> randrw: jobs 1, iops read 357407 write 357183
> rw(512k): jobs 1, iops read 5895 write 5875
>
> -------------
> tw
>
> running loop/001
> fio (ublk/loop( -f /root/work/ubdsrv/tests/tmp/ublk_loop_1G_pvLTL), libaio, bs 4k, dio, hw queues:1, uring_comp: 0, get_data: 0)...
> randwrite: jobs 1, iops 66856
> randread: jobs 1, iops 65015
> randrw: jobs 1, iops read 32751 write 32767
> rw(512k): jobs 1, iops read 776 write 823
>
> running null/001
> fio (ublk/null(), libaio, bs 4k, dio, hw queues:1, uring_comp: 0, get_data: 0)...
> randwrite: jobs 1, iops 739450
> randread: jobs 1, iops 787500
> randrw: jobs 1, iops read 372956 write 372831
> rw(512k): jobs 1, iops read 5798 write 5777
>
> -------------
> ucmd-not-touch-pdu
>
> running loop/001
> fio (ublk/loop( -f /root/work/ubdsrv/tests/tmp/ublk_loop_1G_oH0eG), libaio, bs 4k, dio, hw queues:1, uring_comp: 0, get_data: 0)...
> randwrite: jobs 1, iops 66754
> randread: jobs 1, iops 65032
> randrw: jobs 1, iops read 32776 write 32792
> rw(512k): jobs 1, iops read 772 write 818
>
> running null/001
> fio (ublk/null(), libaio, bs 4k, dio, hw queues:1, uring_comp: 0, get_data: 0)...
> randwrite: jobs 1, iops 725334
> randread: jobs 1, iops 741105
> randrw: jobs 1, iops read 360285 write 360047
> rw(512k): jobs 1, iops read 5770 write 5748
>
> Not touching cmd/pdu/io in ublk_queue_rq() improves IOPS.
> But it is worse than using task_work_add().
Thanks for the test! It is better to not share ucmd between
ublk blk-mq io context and ubq daemon context, and we can
improve it for using io_uring_cmd_complete_in_task(), and I
have one patch by using the batch handing approach in
io_uring_cmd_complete_in_task().
Another reason could be the extra __set_notify_signal() in
__io_req_task_work_add() via task_work_add(). When task_work_add()
is available, we just need to call __set_notify_signal() once
for the whole batch, but it can't be done in case of using
io_uring_cmd_complete_in_task().
Also the patch of 'use llist' is actually wrong since we have to
call io_uring_cmd_complete_in_task() once in ->commit_rqs(), but
that couldn't be easy because ucmd isn't available at that time.
I think we may have to live with task_work_add() until the perf
number is improved to same basically with io_uring_cmd_complete_in_task().
thanks,
Ming
next prev parent reply other threads:[~2022-10-26 11:29 UTC|newest]
Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-10-23 9:38 [PATCH] ublk_drv: don't call task_work_add for queueing io commands Ming Lei
2022-10-24 9:48 ` Ziyang Zhang
2022-10-24 13:20 ` Ming Lei
2022-10-25 3:15 ` Ziyang Zhang
2022-10-25 7:19 ` Ming Lei
2022-10-25 7:46 ` Ziyang Zhang
2022-10-25 8:43 ` Ziyang Zhang
2022-10-25 15:17 ` Ming Lei
2022-10-26 10:32 ` Ziyang Zhang
2022-10-26 11:29 ` Ming Lei [this message]
2022-10-27 3:00 ` Ziyang Zhang
2022-10-27 15:38 ` Ming Lei
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Y1kaFyan927Qnnnh@T590 \
--to=ming.lei@redhat.com \
--cc=ZiyangZhang@linux.alibaba.com \
--cc=axboe@kernel.dk \
--cc=linux-block@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).