Re: [PATCH] ublk: decouple hctx and ublk server threads

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Ming Lei <ming.lei@redhat.com>
To: Uday Shankar <ushankar@purestorage.com>
Cc: Jens Axboe <axboe@kernel.dk>, linux-block@vger.kernel.org
Subject: Re: [PATCH] ublk: decouple hctx and ublk server threads
Date: Tue, 8 Oct 2024 09:47:39 +0800	[thread overview]
Message-ID: <ZwSPO4b6rccVVx-H@fedora> (raw)
In-Reply-To: <ZwQ7cQPSiUlmEGZc@dev-ushankar.dev.purestorage.com>

On Mon, Oct 07, 2024 at 01:50:09PM -0600, Uday Shankar wrote:
> On Sun, Oct 06, 2024 at 05:20:05PM +0800, Ming Lei wrote:
> > On Wed, Oct 02, 2024 at 04:44:37PM -0600, Uday Shankar wrote:
> > > Currently, ublk_drv associates to each hardware queue (hctx) a unique
> > > task (called the queue's ubq_daemon) which is allowed to issue
> > > COMMIT_AND_FETCH commands against the hctx. If any other task attempts
> > > to do so, the command fails immediately with EINVAL. When considered
> > > together with the block layer architecture, the result is that for each
> > > CPU C on the system, there is a unique ublk server thread which is
> > > allowed to handle I/O submitted on CPU C. This can lead to suboptimal
> > > performance under imbalanced load generation. For an extreme example,
> > > suppose all the load is generated on CPUs mapping to a single ublk
> > > server thread. Then that thread may be fully utilized and become the
> > > bottleneck in the system, while other ublk server threads are totally
> > > idle.
> > 
> > I am wondering why the problem can't be avoided by setting ublk server's
> > thread affinity manually.
> 
> I don't think the ublk server thread CPU affinity has any effect here.
> Assuming that the ublk server threads do not pass I/Os between
> themselves to balance the load, each ublk server thread must handle all
> the I/O issued to its associated hctx, and each thread is limited by how
> much CPU it can get. Since threads are the unit of parallelism, one
> thread can make use of at most one CPU, regardless of the affinity of
> the thread. And this can become a bottleneck.

If ublk server may be saturated, there is at least two choices:

- increase nr_hw_queues, so each ublk server thread can handle IOs from
  less CPUs

- let ublk server focus on submitting UBLK_IO_COMMIT_AND_FETCH_REQ
uring_cmd, and moving actual IO handling into new worker thread if ublk
server becomes saturated, and the communication can be done with eventfd,
please see example in:

https://github.com/ublk-org/ublksrv/blob/master/demo_event.c

> 
> > > be balanced across all ublk server threads by having the threads fetch
> > > I/Os for the same QID in a round robin manner. For example, in a system
> > > with 4 ublk server threads, 2 hctxs, and a queue depth of 4, the threads
> > > could issue fetch requests as follows (where each entry is of the form
> > > qid, tag):
> > > 
> > > poller thread:	T0	T1	T2	T3
> > > 		0,0	0,1	0,2	0,3
> > > 		1,3	1,0	1,1	1,2
> > 
> > How many ublk devices there are? If it is 1, just wondering why you use
> > 4 threads? Usually one thread is enough to drive one queue, and the
> > actually io command handling can be moved to new work thread if the queue
> > thread is saturated.
> 
> This is just a small example to demonstrate the idea, not necessarily a
> realistic one.

OK, but I'd suggest to share examples closing to reality, then we can
just focus on problems in real cases.

> 
> > > -static inline void ublk_forward_io_cmds(struct ublk_queue *ubq,
> > > -					unsigned issue_flags)
> > > -{
> > > -	struct llist_node *io_cmds = llist_del_all(&ubq->io_cmds);
> > > -	struct ublk_rq_data *data, *tmp;
> > > -
> > > -	io_cmds = llist_reverse_order(io_cmds);
> > > -	llist_for_each_entry_safe(data, tmp, io_cmds, node)
> > > -		__ublk_rq_task_work(blk_mq_rq_from_pdu(data), issue_flags);
> > > -}
> > > -
> > > -static void ublk_rq_task_work_cb(struct io_uring_cmd *cmd, unsigned issue_flags)
> > > -{
> > > -	struct ublk_uring_cmd_pdu *pdu = ublk_get_uring_cmd_pdu(cmd);
> > > -	struct ublk_queue *ubq = pdu->ubq;
> > > -
> > > -	ublk_forward_io_cmds(ubq, issue_flags);
> > > -}
> > > -
> > >  static void ublk_queue_cmd(struct ublk_queue *ubq, struct request *rq)
> > >  {
> > > -	struct ublk_rq_data *data = blk_mq_rq_to_pdu(rq);
> > > -
> > > -	if (llist_add(&data->node, &ubq->io_cmds)) {
> > > -		struct ublk_io *io = &ubq->ios[rq->tag];
> > > -
> > > -		io_uring_cmd_complete_in_task(io->cmd, ublk_rq_task_work_cb);
> > > -	}
> > > +	struct ublk_io *io = &ubq->ios[rq->tag];
> > > +	struct ublk_uring_cmd_pdu *pdu = ublk_get_uring_cmd_pdu(io->cmd);
> > > +	pdu->req = rq;
> > > +	io_uring_cmd_complete_in_task(io->cmd, __ublk_rq_task_work);
> > >  }
> > 
> > It should be fine to convert to io_uring_cmd_complete_in_task() since
> > the callback list is re-ordered in io_uring.
> 
> Yes, I noticed that task_work has (lockless) internal queueing, so
> there shouldn't be a need to maintain our own queue of commands in
> ublk_drv. I can factor this change out into its own patch if that is
> useful.

Yeah, please go ahead, since it does simplify things.

> 
> > 
> > >  
> > >  static enum blk_eh_timer_return ublk_timeout(struct request *rq)
> > >  {
> > >  	struct ublk_queue *ubq = rq->mq_hctx->driver_data;
> > > +	struct ublk_rq_data *data = blk_mq_rq_to_pdu(rq);
> > >  	unsigned int nr_inflight = 0;
> > >  	int i;
> > >  
> > >  	if (ubq->flags & UBLK_F_UNPRIVILEGED_DEV) {
> > > -		if (!ubq->timeout) {
> > > -			send_sig(SIGKILL, ubq->ubq_daemon, 0);
> > > -			ubq->timeout = true;
> > > -		}
> > > -
> > > +		send_sig(SIGKILL, data->task, 0);
> > >  		return BLK_EH_DONE;
> > >  	}
> > >  
> > > -	if (!ubq_daemon_is_dying(ubq))
> > > +	if (!(data->task->flags & PF_EXITING))
> > >  		return BLK_EH_RESET_TIMER;
> > 
> > ->task is only for error handling, but it may not work any more since
> > who knows which task is for handling the io command actually.
> 
> Yes, you are right - this part right here is the only reason we need to
> save/take a reference to the task. I have a couple alternative ideas:
> 
> 1. Don't kill anything if a timeout happens. Instead, synchronize
>    against the "normal" completion path (i.e. commit_and_fetch), and if
>    timeout happens first, normal completion gets an error. If normal
>    completion happens first, timeout does nothing.

But how to synchronize? Looks the only weapon could be RCU.

Also one server thread may have bug and run into dead loop.

> 2. Require that all threads handling I/O are threads of the same process
>    (in the kernel, I think this means their task_struct::group_leader is
>    the same?)

So far we only allow single process to open /dev/ublkcN, so all threads
have to belong to same process.

And that can be thought as another limit of ublk implementation.

> In the normal completion path, we replace the check that
>    exists today (check equality with ubq_daemon) with ensuring that the
>    current task is within the process. In the timeout path, we send
>    SIGKILL to the top-level process, which should propagate to the
>    threads as well.

It should be enough to kill the only process which opens '/dev/ublkcN'.

> 
> Does either of those sound okay?

Looks #2 is more doable.


Thanks,
Ming

next prev parent reply	other threads:[~2024-10-08  1:47 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-10-02 22:44 [PATCH] ublk: decouple hctx and ublk server threads Uday Shankar
2024-10-03 23:40 ` Uday Shankar
2024-10-05 21:24 ` kernel test robot
2024-10-06  9:20 ` Ming Lei
2024-10-07 19:50   ` Uday Shankar
2024-10-08  1:47     ` Ming Lei [this message]
2024-10-08  2:11       ` Ming Lei
2024-10-09 21:11         ` Uday Shankar

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZwSPO4b6rccVVx-H@fedora \
    --to=ming.lei@redhat.com \
    --cc=axboe@kernel.dk \
    --cc=linux-block@vger.kernel.org \
    --cc=ushankar@purestorage.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.