From: Ming Lei <ming.lei@redhat.com>
To: Ziyang Zhang <ZiyangZhang@linux.alibaba.com>
Cc: linux-block@vger.kernel.org, Jens Axboe <axboe@kernel.dk>
Subject: Re: [PATCH] ublk_drv: don't call task_work_add for queueing io commands
Date: Tue, 25 Oct 2022 23:17:53 +0800 [thread overview]
Message-ID: <Y1f+IeFQZhZRmZda@T590> (raw)
In-Reply-To: <925c299f-0d2f-2970-9c85-2f67834dd2bf@linux.alibaba.com>
On Tue, Oct 25, 2022 at 04:43:56PM +0800, Ziyang Zhang wrote:
> On 2022/10/25 15:46, Ziyang Zhang wrote:
> > On 2022/10/25 15:19, Ming Lei wrote:
> >> On Tue, Oct 25, 2022 at 11:15:57AM +0800, Ziyang Zhang wrote:
> >>> On 2022/10/24 21:20, Ming Lei wrote:
> >>>> Hello Ziyang,
> >>>>
> >>>> On Mon, Oct 24, 2022 at 05:48:51PM +0800, Ziyang Zhang wrote:
> >>>>> On 2022/10/23 17:38, Ming Lei wrote:
> >>>>>> task_work_add() is used for waking ubq daemon task with one batch
> >>>>>> of io requests/commands queued. However, task_work_add() isn't
> >>>>>> exported for module code, and it is still debatable if the symbol
> >>>>>> should be exported.
> >>>>>>
> >>>>>> Fortunately we still have io_uring_cmd_complete_in_task() which just
> >>>>>> can't handle batched wakeup for us.
> >>>>>>
> >>>>>> Add one one llist into ublk_queue and call io_uring_cmd_complete_in_task()
> >>>>>> via current command for running them via task work.
> >>>>>>
> >>>>>> This way cleans up current code a lot, meantime allow us to wakeup
> >>>>>> ubq daemon task after queueing batched requests/io commands.
> >>>>>>
> >>>>>
> >>>>>
> >>>>> Hi, Ming
> >>>>>
> >>>>> This patch works and I have run some tests to compare current version(ucmd)
> >>>>> with your patch(ucmd-batch).
> >>>>>
> >>>>> iodepth=128 numjobs=1 direct=1 bs=4k
> >>>>>
> >>>>> --------------------------------------------
> >>>>> ublk loop target, the backend is a file.
> >>>>> IOPS(k)
> >>>>>
> >>>>> type ucmd ucmd-batch
> >>>>> seq-read 54.7 54.2
> >>>>> rand-read 52.8 52.0
> >>>>>
> >>>>> --------------------------------------------
> >>>>> ublk null target
> >>>>> IOPS(k)
> >>>>>
> >>>>> type ucmd ucmd-batch
> >>>>> seq-read 257 257
> >>>>> rand-read 252 253
> >>>>>
> >>>>>
> >>>>> I find that io_req_task_work_add() puts task_work node into a llist
> >>>>> first, then it may call task_work_add() to run batched task_works. So do we really
> >>>>> need such llist in ublk_drv? I think io_uring has already considered task_work batch
> >>>>> optimization.
> >>>>>
> >>>>> BTW, task_work_add() in ublk_drv achieves
> >>>>> higher IOPS(about 5-10% on my machine) than io_uring_cmd_complete_in_task()
> >>>>> in ublk_drv.
> >>>>
> >>>> Yeah, that is same with my observation, and motivation of this patch is
> >>>> to get same performance with task_work_add by building ublk_drv as
> >>>> module. One win of task_work_add() is that we get exact batching info
> >>>> meantime only send TWA_SIGNAL_NO_IPI for whole batch, that is basically
> >>>> what the patch is doing, but needs help of the following ublksrv patch:
> >>>>
> >>>> https://github.com/ming1/ubdsrv/commit/dce6d1d222023c1641292713b311ced01e6dc548
> >>>>
> >>>> which sets IORING_SETUP_COOP_TASKRUN for ublksrv's uring, then
> >>>> io_uring_cmd_complete_in_task will notify via TWA_SIGNAL_NO_IPI, and 5+%
> >>>> IOPS boost is observed on loop/001 by putting image on SSD in my test
> >>>> VM.
> >>>
> >>> Hi Ming,
> >>>
> >>> I have added this ublksrv patch and run the above test again.
> >>> I have also run ublksrv test: loop/001. Please check them.
> >>>
> >>> Intel(R) Xeon(R) Platinum 8369B CPU @ 2.70GHz 16 cores
> >>> 64GB MEM, CentOS 8, kernel 6.0+
> >>>
> >>> --------
> >>> fio test
> >>>
> >>> iodepth=128 numjobs=1 direct=1 bs=4k
> >>>
> >>> ucmd: without your kernel patch. Run io_uring_cmd_complete_in_task()
> >>> for each blk-mq rq.
> >>>
> >>> ucmd-batch: with your kernel patch. Run io_uring_cmd_complete_in_task()
> >>> for the last blk-mq rq.
> >>>
> >>> --------------------------------------------
> >>> ublk loop target, the backend is a file.
> >>>
> >>> IOPS(k)
> >>>
> >>> type ucmd ucmd-batch
> >>> seq-read 54.1 53.7
> >>> rand-read 52.0 52.0
> >>>
> >>> --------------------------------------------
> >>> ublk null target
> >>> IOPS(k)
> >>>
> >>> type ucmd ucmd-batch
> >>> seq-read 272 265
> >>> rand-read 262 260
> >>>
> >>> ------------
> >>> ublksrv test
> >>>
> >>> -------------
> >>> ucmd
> >>>
> >>> running loop/001
> >>> fio (ublk/loop( -f /root/work/ubdsrv/tests/tmp/ublk_loop_1G_BZ85U), libaio, bs 4k, dio, hw queues:1, uring_comp: 0, get_data: 0)...
> >>> randwrite: jobs 1, iops 66737
> >>> randread: jobs 1, iops 64935
> >>> randrw: jobs 1, iops read 32694 write 32710
> >>> rw(512k): jobs 1, iops read 772 write 819
> >>>
> >>> -------------
> >>> ucmd-batch
> >>>
> >>> running loop/001
> >>> fio (ublk/loop( -f /root/work/ubdsrv/tests/tmp/ublk_loop_1G_F56a3), libaio, bs 4k, dio, hw queues:1, uring_comp: 0, get_data: 0)...
> >>> randwrite: jobs 1, iops 66720
> >>> randread: jobs 1, iops 65015
> >>> randrw: jobs 1, iops read 32743 write 32759
> >>> rw(512k): jobs 1, iops read 771 write 817
> >>>
> >>>
> >>> It seems that manually putting rqs into llist and calling
> >>> io_uring_cmd_complete_in_task() while handling the last rq does
> >>> not improve IOPS much.
> >>>
> >>> io_req_task_work_add() puts task_work node into a internal llist
> >>> first, then it may call task_work_add() to run batched task_works.
> >>> IMO, io_uring has already done such batch optimization and ublk_drv
> >>> does not need to add such llist.
> >>
> >> The difference is just how batching is handled, looks blk-mq's batch info
> >> doesn't matter any more. In my test, looks the perf improvement is mainly
> >> made by enabling IORING_SETUP_COOP_TASKRUN in ublksrv.
> >
> > I guess only IORING_SETUP_COOP_TASKRUN helps improve IOPS. The llist in
> > ublk_drv does not improve IOPS.
> >
> >>
> >> Can you check if enabling IORING_SETUP_COOP_TASKRUN only can reach
> >> same perf with task_work_add()(ublk_drv is builtin) when building
> >> ublk_drv as module?
> >>
> >
> > OK.
> >
>
> Intel(R) Xeon(R) Platinum 8369B CPU @ 2.70GHz 16 cores
> 64GB MEM, CentOS 8, kernel 6.0+
> with IORING_SETUP_COOP_TASKRUN, without this kernel patch
>
> ucmd: io_uring_cmd_complete_in_task(), ublk_drv is a module
> tw: task_work_add(), ublk is built-in.
>
>
> --------
> fio test
>
> iodepth=128 numjobs=1 direct=1 bs=4k
>
> --------------------------------------------
> ublk loop target, the backend is a file.
>
> IOPS(k)
>
> type ucmd tw
> seq-read 54.1 53.8
> rand-read 52.0 52.0
>
> --------------------------------------------
> ublk null target
> IOPS(k)
>
> type ucmd tw
> seq-read 272 286
> rand-read 262 278
>
>
> ------------
> ublksrv test
>
> -------------
> ucmd
>
> running loop/001
> fio (ublk/loop( -f /root/work/ubdsrv/tests/tmp/ublk_loop_1G_BZ85U), libaio, bs 4k, dio, hw queues:1, uring_comp: 0, get_data: 0)...
> randwrite: jobs 1, iops 66737
> randread: jobs 1, iops 64935
> randrw: jobs 1, iops read 32694 write 32710
> rw(512k): jobs 1, iops read 772 write 819
>
> running null/001
> fio (ublk/null(), libaio, bs 4k, dio, hw queues:1, uring_comp: 0, get_data: 0)...
> randwrite: jobs 1, iops 715863
> randread: jobs 1, iops 758449
> randrw: jobs 1, iops read 357407 write 357183
> rw(512k): jobs 1, iops read 5895 write 5875
>
> -------------
> tw
>
> running loop/001
> fio (ublk/loop( -f /root/work/ubdsrv/tests/tmp/ublk_loop_1G_pvLTL), libaio, bs 4k, dio, hw queues:1, uring_comp: 0, get_data: 0)...
> randwrite: jobs 1, iops 66856
> randread: jobs 1, iops 65015
> randrw: jobs 1, iops read 32751 write 32767
> rw(512k): jobs 1, iops read 776 write 823
>
> running null/001
> fio (ublk/null(), libaio, bs 4k, dio, hw queues:1, uring_comp: 0, get_data: 0)...
> randwrite: jobs 1, iops 739450
> randread: jobs 1, iops 787500
> randrw: jobs 1, iops read 372956 write 372831
> rw(512k): jobs 1, iops read 5798 write 5777
Looks the gap isn't big between ucmd and tw when running null/001, in
which the fio io process should saturate the CPU. Probably we
should avoid to touch 'cmd'/'pdu'/'io' in ublk_queue_rq() since these
data should be cold at that time.
Can you apply the following delta patch against the current patch(
"ublk_drv: don't call task_work_add for queueing io commands") and
compare with task_work_add()?
From ecbbf6d10dbc63e279ce0b55c45da6721947f18d Mon Sep 17 00:00:00 2001
From: Ming Lei <ming.lei@redhat.com>
Date: Tue, 25 Oct 2022 11:01:25 -0400
Subject: [PATCH] ublk: follow up change
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
drivers/block/ublk_drv.c | 36 ++++++++++++++++++++++--------------
1 file changed, 22 insertions(+), 14 deletions(-)
diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
index 7963fba66dd1..18db337094c1 100644
--- a/drivers/block/ublk_drv.c
+++ b/drivers/block/ublk_drv.c
@@ -56,9 +56,12 @@
/* All UBLK_PARAM_TYPE_* should be included here */
#define UBLK_PARAM_TYPE_ALL (UBLK_PARAM_TYPE_BASIC | UBLK_PARAM_TYPE_DISCARD)
+struct ublk_rq_data {
+ struct llist_node node;
+};
+
struct ublk_uring_cmd_pdu {
struct request *req;
- struct llist_node node;
};
/*
@@ -693,7 +696,8 @@ static inline void __ublk_rq_task_work(struct request *req)
*
* (2) current->flags & PF_EXITING.
*/
- if (unlikely(current != ubq->ubq_daemon || current->flags & PF_EXITING)) {
+ if (unlikely(current != ubq->ubq_daemon || current->flags & PF_EXITING
+ || (io->flags & UBLK_IO_FLAG_ABORTED))) {
__ublk_abort_rq(ubq, req);
return;
}
@@ -757,11 +761,12 @@ static void ublk_rqs_task_work_cb(struct io_uring_cmd *cmd)
struct ublk_uring_cmd_pdu *pdu = ublk_get_uring_cmd_pdu(cmd);
struct ublk_queue *ubq = pdu->req->mq_hctx->driver_data;
struct llist_node *io_cmds = llist_del_all(&ubq->io_cmds);
+ struct ublk_rq_data *data;
__ublk_rq_task_work(pdu->req);
- llist_for_each_entry(pdu, io_cmds, node)
- __ublk_rq_task_work(pdu->req);
+ llist_for_each_entry(data, io_cmds, node)
+ __ublk_rq_task_work(blk_mq_rq_from_pdu(data));
}
static blk_status_t ublk_queue_rq(struct blk_mq_hw_ctx *hctx,
@@ -769,9 +774,6 @@ static blk_status_t ublk_queue_rq(struct blk_mq_hw_ctx *hctx,
{
struct ublk_queue *ubq = hctx->driver_data;
struct request *rq = bd->rq;
- struct ublk_io *io = &ubq->ios[rq->tag];
- struct io_uring_cmd *cmd = io->cmd;
- struct ublk_uring_cmd_pdu *pdu = ublk_get_uring_cmd_pdu(cmd);
blk_status_t res;
/* fill iod to slot in io cmd buffer */
@@ -805,14 +807,11 @@ static blk_status_t ublk_queue_rq(struct blk_mq_hw_ctx *hctx,
* Since releasing/allocating a tag implies smp_mb(), finding UBLK_IO_FLAG_ABORTED
* guarantees that here is a re-issued request aborted previously.
*/
- if (unlikely(ubq_daemon_is_dying(ubq) ||
- (io->flags & UBLK_IO_FLAG_ABORTED))) {
+ if (unlikely(ubq_daemon_is_dying(ubq))) {
__ublk_abort_rq(ubq, rq);
return BLK_STS_OK;
}
- pdu->req = rq;
-
/*
* Typical multiple producers and single consumer, it is just fine
* to use llist_add() in producer side and llist_del_all() in
@@ -821,10 +820,18 @@ static blk_status_t ublk_queue_rq(struct blk_mq_hw_ctx *hctx,
* The last command can't be added into list, otherwise it could
* be handled twice
*/
- if (bd->last)
+ if (bd->last) {
+ struct ublk_io *io = &ubq->ios[rq->tag];
+ struct io_uring_cmd *cmd = io->cmd;
+ struct ublk_uring_cmd_pdu *pdu = ublk_get_uring_cmd_pdu(cmd);
+
+ pdu->req = rq;
io_uring_cmd_complete_in_task(cmd, ublk_rqs_task_work_cb);
- else
- llist_add(&pdu->node, &ubq->io_cmds);
+ } else {
+ struct ublk_rq_data *data = blk_mq_rq_to_pdu(rq);
+
+ llist_add(&data->node, &ubq->io_cmds);
+ }
return BLK_STS_OK;
}
@@ -1426,6 +1433,7 @@ static int ublk_add_tag_set(struct ublk_device *ub)
ub->tag_set.queue_depth = ub->dev_info.queue_depth;
ub->tag_set.numa_node = NUMA_NO_NODE;
ub->tag_set.flags = BLK_MQ_F_SHOULD_MERGE;
+ ub->tag_set.cmd_size = sizeof(struct ublk_rq_data);
ub->tag_set.driver_data = ub;
return blk_mq_alloc_tag_set(&ub->tag_set);
}
--
2.31.1
Thanks,
Ming
next prev parent reply other threads:[~2022-10-25 15:18 UTC|newest]
Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-10-23 9:38 [PATCH] ublk_drv: don't call task_work_add for queueing io commands Ming Lei
2022-10-24 9:48 ` Ziyang Zhang
2022-10-24 13:20 ` Ming Lei
2022-10-25 3:15 ` Ziyang Zhang
2022-10-25 7:19 ` Ming Lei
2022-10-25 7:46 ` Ziyang Zhang
2022-10-25 8:43 ` Ziyang Zhang
2022-10-25 15:17 ` Ming Lei [this message]
2022-10-26 10:32 ` Ziyang Zhang
2022-10-26 11:29 ` Ming Lei
2022-10-27 3:00 ` Ziyang Zhang
2022-10-27 15:38 ` Ming Lei
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Y1f+IeFQZhZRmZda@T590 \
--to=ming.lei@redhat.com \
--cc=ZiyangZhang@linux.alibaba.com \
--cc=axboe@kernel.dk \
--cc=linux-block@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.