From: Ming Lei <ming.lei@redhat.com>
To: Uday Shankar <ushankar@purestorage.com>
Cc: Jens Axboe <axboe@kernel.dk>, linux-block@vger.kernel.org
Subject: Re: [PATCH] ublk: decouple hctx and ublk server threads
Date: Tue, 8 Oct 2024 09:47:39 +0800 [thread overview]
Message-ID: <ZwSPO4b6rccVVx-H@fedora> (raw)
In-Reply-To: <ZwQ7cQPSiUlmEGZc@dev-ushankar.dev.purestorage.com>
On Mon, Oct 07, 2024 at 01:50:09PM -0600, Uday Shankar wrote:
> On Sun, Oct 06, 2024 at 05:20:05PM +0800, Ming Lei wrote:
> > On Wed, Oct 02, 2024 at 04:44:37PM -0600, Uday Shankar wrote:
> > > Currently, ublk_drv associates to each hardware queue (hctx) a unique
> > > task (called the queue's ubq_daemon) which is allowed to issue
> > > COMMIT_AND_FETCH commands against the hctx. If any other task attempts
> > > to do so, the command fails immediately with EINVAL. When considered
> > > together with the block layer architecture, the result is that for each
> > > CPU C on the system, there is a unique ublk server thread which is
> > > allowed to handle I/O submitted on CPU C. This can lead to suboptimal
> > > performance under imbalanced load generation. For an extreme example,
> > > suppose all the load is generated on CPUs mapping to a single ublk
> > > server thread. Then that thread may be fully utilized and become the
> > > bottleneck in the system, while other ublk server threads are totally
> > > idle.
> >
> > I am wondering why the problem can't be avoided by setting ublk server's
> > thread affinity manually.
>
> I don't think the ublk server thread CPU affinity has any effect here.
> Assuming that the ublk server threads do not pass I/Os between
> themselves to balance the load, each ublk server thread must handle all
> the I/O issued to its associated hctx, and each thread is limited by how
> much CPU it can get. Since threads are the unit of parallelism, one
> thread can make use of at most one CPU, regardless of the affinity of
> the thread. And this can become a bottleneck.
If ublk server may be saturated, there is at least two choices:
- increase nr_hw_queues, so each ublk server thread can handle IOs from
less CPUs
- let ublk server focus on submitting UBLK_IO_COMMIT_AND_FETCH_REQ
uring_cmd, and moving actual IO handling into new worker thread if ublk
server becomes saturated, and the communication can be done with eventfd,
please see example in:
https://github.com/ublk-org/ublksrv/blob/master/demo_event.c
>
> > > be balanced across all ublk server threads by having the threads fetch
> > > I/Os for the same QID in a round robin manner. For example, in a system
> > > with 4 ublk server threads, 2 hctxs, and a queue depth of 4, the threads
> > > could issue fetch requests as follows (where each entry is of the form
> > > qid, tag):
> > >
> > > poller thread: T0 T1 T2 T3
> > > 0,0 0,1 0,2 0,3
> > > 1,3 1,0 1,1 1,2
> >
> > How many ublk devices there are? If it is 1, just wondering why you use
> > 4 threads? Usually one thread is enough to drive one queue, and the
> > actually io command handling can be moved to new work thread if the queue
> > thread is saturated.
>
> This is just a small example to demonstrate the idea, not necessarily a
> realistic one.
OK, but I'd suggest to share examples closing to reality, then we can
just focus on problems in real cases.
>
> > > -static inline void ublk_forward_io_cmds(struct ublk_queue *ubq,
> > > - unsigned issue_flags)
> > > -{
> > > - struct llist_node *io_cmds = llist_del_all(&ubq->io_cmds);
> > > - struct ublk_rq_data *data, *tmp;
> > > -
> > > - io_cmds = llist_reverse_order(io_cmds);
> > > - llist_for_each_entry_safe(data, tmp, io_cmds, node)
> > > - __ublk_rq_task_work(blk_mq_rq_from_pdu(data), issue_flags);
> > > -}
> > > -
> > > -static void ublk_rq_task_work_cb(struct io_uring_cmd *cmd, unsigned issue_flags)
> > > -{
> > > - struct ublk_uring_cmd_pdu *pdu = ublk_get_uring_cmd_pdu(cmd);
> > > - struct ublk_queue *ubq = pdu->ubq;
> > > -
> > > - ublk_forward_io_cmds(ubq, issue_flags);
> > > -}
> > > -
> > > static void ublk_queue_cmd(struct ublk_queue *ubq, struct request *rq)
> > > {
> > > - struct ublk_rq_data *data = blk_mq_rq_to_pdu(rq);
> > > -
> > > - if (llist_add(&data->node, &ubq->io_cmds)) {
> > > - struct ublk_io *io = &ubq->ios[rq->tag];
> > > -
> > > - io_uring_cmd_complete_in_task(io->cmd, ublk_rq_task_work_cb);
> > > - }
> > > + struct ublk_io *io = &ubq->ios[rq->tag];
> > > + struct ublk_uring_cmd_pdu *pdu = ublk_get_uring_cmd_pdu(io->cmd);
> > > + pdu->req = rq;
> > > + io_uring_cmd_complete_in_task(io->cmd, __ublk_rq_task_work);
> > > }
> >
> > It should be fine to convert to io_uring_cmd_complete_in_task() since
> > the callback list is re-ordered in io_uring.
>
> Yes, I noticed that task_work has (lockless) internal queueing, so
> there shouldn't be a need to maintain our own queue of commands in
> ublk_drv. I can factor this change out into its own patch if that is
> useful.
Yeah, please go ahead, since it does simplify things.
>
> >
> > >
> > > static enum blk_eh_timer_return ublk_timeout(struct request *rq)
> > > {
> > > struct ublk_queue *ubq = rq->mq_hctx->driver_data;
> > > + struct ublk_rq_data *data = blk_mq_rq_to_pdu(rq);
> > > unsigned int nr_inflight = 0;
> > > int i;
> > >
> > > if (ubq->flags & UBLK_F_UNPRIVILEGED_DEV) {
> > > - if (!ubq->timeout) {
> > > - send_sig(SIGKILL, ubq->ubq_daemon, 0);
> > > - ubq->timeout = true;
> > > - }
> > > -
> > > + send_sig(SIGKILL, data->task, 0);
> > > return BLK_EH_DONE;
> > > }
> > >
> > > - if (!ubq_daemon_is_dying(ubq))
> > > + if (!(data->task->flags & PF_EXITING))
> > > return BLK_EH_RESET_TIMER;
> >
> > ->task is only for error handling, but it may not work any more since
> > who knows which task is for handling the io command actually.
>
> Yes, you are right - this part right here is the only reason we need to
> save/take a reference to the task. I have a couple alternative ideas:
>
> 1. Don't kill anything if a timeout happens. Instead, synchronize
> against the "normal" completion path (i.e. commit_and_fetch), and if
> timeout happens first, normal completion gets an error. If normal
> completion happens first, timeout does nothing.
But how to synchronize? Looks the only weapon could be RCU.
Also one server thread may have bug and run into dead loop.
> 2. Require that all threads handling I/O are threads of the same process
> (in the kernel, I think this means their task_struct::group_leader is
> the same?)
So far we only allow single process to open /dev/ublkcN, so all threads
have to belong to same process.
And that can be thought as another limit of ublk implementation.
> In the normal completion path, we replace the check that
> exists today (check equality with ubq_daemon) with ensuring that the
> current task is within the process. In the timeout path, we send
> SIGKILL to the top-level process, which should propagate to the
> threads as well.
It should be enough to kill the only process which opens '/dev/ublkcN'.
>
> Does either of those sound okay?
Looks #2 is more doable.
Thanks,
Ming
next prev parent reply other threads:[~2024-10-08 1:47 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-10-02 22:44 [PATCH] ublk: decouple hctx and ublk server threads Uday Shankar
2024-10-03 23:40 ` Uday Shankar
2024-10-05 21:24 ` kernel test robot
2024-10-06 9:20 ` Ming Lei
2024-10-07 19:50 ` Uday Shankar
2024-10-08 1:47 ` Ming Lei [this message]
2024-10-08 2:11 ` Ming Lei
2024-10-09 21:11 ` Uday Shankar
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ZwSPO4b6rccVVx-H@fedora \
--to=ming.lei@redhat.com \
--cc=axboe@kernel.dk \
--cc=linux-block@vger.kernel.org \
--cc=ushankar@purestorage.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox