Re: [PATCH V3 5/9] io_uring: support SQE group

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Ming Lei <ming.lei@redhat.com>
To: Pavel Begunkov <asml.silence@gmail.com>
Cc: Jens Axboe <axboe@kernel.dk>,
	io-uring@vger.kernel.org, linux-block@vger.kernel.org,
	Kevin Wolf <kwolf@redhat.com>,
	ming.lei@redhat.com
Subject: Re: [PATCH V3 5/9] io_uring: support SQE group
Date: Mon, 17 Jun 2024 11:54:34 +0800	[thread overview]
Message-ID: <Zm+zemW1Bi5ZF9DE@fedora> (raw)
In-Reply-To: <f29e662c-5959-4b90-845a-0e3c81a174e6@gmail.com>

On Sun, Jun 16, 2024 at 08:13:26PM +0100, Pavel Begunkov wrote:
> On 6/13/24 02:45, Ming Lei wrote:
> > On Mon, Jun 10, 2024 at 03:53:51AM +0100, Pavel Begunkov wrote:
> > > On 5/11/24 01:12, Ming Lei wrote:
> > > > SQE group is defined as one chain of SQEs starting with the first SQE that
> > > > has IOSQE_SQE_GROUP set, and ending with the first subsequent SQE that
> > > > doesn't have it set, and it is similar with chain of linked SQEs.
> > > 
> > > The main concern stays same, it adds overhead nearly to every
> > > single hot function I can think of, as well as lots of
> > > complexity.
> > 
> > Almost every sqe group change is covered by REQ_F_SQE_GROUP, so I am
> > not clear what the added overhead is.
> 
> Yes, and there is a dozen of such in the hot path.

req->flags is supposed to be L1-cached in all these hot paths, and the
check is basically zero cost, so SQE_GROUP shouldn't add extra cost for
existed io_uring core code path.

> 
> > > Another minor issue is REQ_F_INFLIGHT, as explained before,
> > > cancellation has to be able to find all REQ_F_INFLIGHT
> > > requests. Requests you add to a group can have that flag
> > > but are not discoverable by core io_uring code.
> > 
> > OK, we can deal with it by setting leader as REQ_F_INFLIGHT if the
> > flag is set for any member, since all members are guaranteed to
> > be drained when leader is completed. Will do it in V4.
> 
> Or fail if see one, that's also fine. REQ_F_INFLIGHT is
> only set for POLL requests polling another io_uring.

It is set for read-write/tee/splice op with normal file too, so looks
not safe to fail.

> 
> > > Another note, I'll be looking deeper into this patch, there
> > > is too much of random tossing around of requests / refcounting
> > > and other dependencies, as well as odd intertwinings with
> > > other parts.
> > 
> > The only thing wrt. request refcount is for io-wq, since request
> > reference is grabbed when the req is handled in io-wq context, and
> > group leader need to be completed after all members are done. That
> > is all special change wrt. request refcounting.
> 
> I rather mean refcounting the group leader, even if it's not
> atomic.

If you mean reusing req->refs for refcounting the group leader, it may
not work, cause member can complete from io-wq, but leader may not.
Meantime using dedicated ->grp_refs actually simplifies things a lot.

> 
> > > > diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
> > > > index 7a6b190c7da7..62311b0f0e0b 100644
> > > > diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
> > > > index c184c9a312df..b87c5452de43 100644
> > > > --- a/io_uring/io_uring.c
> > > > +++ b/io_uring/io_uring.c
> ...
> > > >    	}
> > > >    }
> > > > +static inline bool need_queue_group_members(struct io_kiocb *req)
> > > > +{
> > > > +	return req_is_group_leader(req) && req->grp_link;
> > > > +}
> > > > +
> > > > +/* Can only be called after this request is issued */
> > > > +static inline struct io_kiocb *get_group_leader(struct io_kiocb *req)
> > > > +{
> > > > +	if (req->flags & REQ_F_SQE_GROUP) {
> > > > +		if (req_is_group_leader(req))
> > > > +			return req;
> > > > +		return req->grp_link;
> > > 
> > > I'm missing something, it seems io_group_sqe() adding all
> > > requests of a group into a singly linked list via ->grp_link,
> > > but here we return it as a leader. Confused.
> > 
> > ->grp_link stores the singly linked list for group leader, and
> > the same field stores the group leader pointer for group member requests.
> > For later, we can add one union field to make code more readable.
> > Will do that in V4.
> 
> So you're repurposing it in io_queue_group_members(). Since
> it has different meaning at different stages of execution,
> it warrants a comment (unless there is one I missed).

OK, either adding comment or another union field for it.

> 
> > > > +	}
> > > > +	return NULL;
> > > > +}
> > > > +
> > > > +void io_cancel_group_members(struct io_kiocb *req, bool ignore_cqes)
> > > > +{
> > > > +	struct io_kiocb *member = req->grp_link;
> > > > +
> > > > +	while (member) {
> > > > +		struct io_kiocb *next = member->grp_link;
> > > > +
> > > > +		if (ignore_cqes)
> > > > +			member->flags |= REQ_F_CQE_SKIP;
> > > > +		if (!(member->flags & REQ_F_FAIL)) {
> > > > +			req_set_fail(member);
> > > > +			io_req_set_res(member, -ECANCELED, 0);
> > > > +		}
> > > > +		member = next;
> > > > +	}
> > > > +}
> > > > +
> > > > +void io_queue_group_members(struct io_kiocb *req, bool async)
> > > > +{
> > > > +	struct io_kiocb *member = req->grp_link;
> > > > +
> > > > +	if (!member)
> > > > +		return;
> > > > +
> > > > +	while (member) {
> > > > +		struct io_kiocb *next = member->grp_link;
> > > > +
> > > > +		member->grp_link = req;
> > > > +		if (async)
> > > > +			member->flags |= REQ_F_FORCE_ASYNC;
> > > > +
> > > > +		if (unlikely(member->flags & REQ_F_FAIL)) {
> > > > +			io_req_task_queue_fail(member, member->cqe.res);
> > > > +		} else if (member->flags & REQ_F_FORCE_ASYNC) {
> > > > +			io_req_task_queue(member);
> > > > +		} else {
> > > > +			io_queue_sqe(member);
> 
> io_req_queue_tw_complete() please, just like links deal
> with it, so it's executed in a well known context without
> jumping ahead of other requests.

members needn't to be queued until leader is completed for plain
SQE_GROUP, otherwise perf can drop.

> 
> > > > +		}
> > > > +		member = next;
> > > > +	}
> > > > +	req->grp_link = NULL;
> > > > +}
> > > > +
> > > > +static inline bool __io_complete_group_req(struct io_kiocb *req,
> > > > +			     struct io_kiocb *lead)
> > > > +{
> > > > +	WARN_ON_ONCE(!(req->flags & REQ_F_SQE_GROUP));
> > > > +
> > > > +	if (WARN_ON_ONCE(lead->grp_refs <= 0))
> > > > +		return false;
> > > > +
> > > > +	/*
> > > > +	 * Set linked leader as failed if any member is failed, so
> > > > +	 * the remained link chain can be terminated
> > > > +	 */
> > > > +	if (unlikely((req->flags & REQ_F_FAIL) &&
> > > > +		     ((lead->flags & IO_REQ_LINK_FLAGS) && lead->link)))
> > > > +		req_set_fail(lead);
> > > > +	return !--lead->grp_refs;
> > > > +}
> > > > +
> > > > +/* Complete group request and collect completed leader for freeing */
> > > > +static inline void io_complete_group_req(struct io_kiocb *req,
> > > > +		struct io_wq_work_list *grp_list)
> > > > +{
> > > > +	struct io_kiocb *lead = get_group_leader(req);
> > > > +
> > > > +	if (__io_complete_group_req(req, lead)) {
> > > > +		req->flags &= ~REQ_F_SQE_GROUP;
> > > > +		lead->flags &= ~REQ_F_SQE_GROUP_LEADER;
> > > > +		if (!(lead->flags & REQ_F_CQE_SKIP))
> > > > +			io_req_commit_cqe(lead, lead->ctx->lockless_cq);
> > > > +
> > > > +		if (req != lead) {
> > > > +			/*
> > > > +			 * Add leader to free list if it isn't there
> > > > +			 * otherwise clearing group flag for freeing it
> > > > +			 * in current batch
> > > > +			 */
> > > > +			if (!(lead->flags & REQ_F_SQE_GROUP))
> > > > +				wq_list_add_tail(&lead->comp_list, grp_list);
> > > > +			else
> > > > +				lead->flags &= ~REQ_F_SQE_GROUP;
> > > > +		}
> > > > +	} else if (req != lead) {
> > > > +		req->flags &= ~REQ_F_SQE_GROUP;
> > > > +	} else {
> > > > +		/*
> > > > +		 * Leader's group flag clearing is delayed until it is
> > > > +		 * removed from free list
> > > > +		 */
> > > > +	}
> > > > +}
> > > > +
> > > >    static void io_req_complete_post(struct io_kiocb *req, unsigned issue_flags)
> > > >    {
> > > >    	struct io_ring_ctx *ctx = req->ctx;
> > > > @@ -1427,6 +1545,17 @@ static void io_free_batch_list(struct io_ring_ctx *ctx,
> > > >    						    comp_list);
> > > >    		if (unlikely(req->flags & IO_REQ_CLEAN_SLOW_FLAGS)) {
> > > > +			/*
> > > > +			 * Group leader may be removed twice, don't free it
> > > > +			 * if group flag isn't cleared, when some members
> > > > +			 * aren't completed yet
> > > > +			 */
> > > > +			if (req->flags & REQ_F_SQE_GROUP) {
> > > > +				node = req->comp_list.next;
> > > > +				req->flags &= ~REQ_F_SQE_GROUP;
> > > > +				continue;
> > > > +			}
> > > > +
> > > >    			if (req->flags & REQ_F_REFCOUNT) {
> > > >    				node = req->comp_list.next;
> > > >    				if (!req_ref_put_and_test(req))
> > > > @@ -1459,6 +1588,7 @@ void __io_submit_flush_completions(struct io_ring_ctx *ctx)
> > > >    	__must_hold(&ctx->uring_lock)
> > > >    {
> > > >    	struct io_submit_state *state = &ctx->submit_state;
> > > > +	struct io_wq_work_list grp_list = {NULL};
> > > >    	struct io_wq_work_node *node;
> > > >    	__io_cq_lock(ctx);
> > > > @@ -1468,9 +1598,15 @@ void __io_submit_flush_completions(struct io_ring_ctx *ctx)
> > > >    		if (!(req->flags & REQ_F_CQE_SKIP))
> > > >    			io_req_commit_cqe(req, ctx->lockless_cq);
> > > > +
> > > > +		if (req->flags & REQ_F_SQE_GROUP)
> > > 
> > > Same note about hot path
> > > 
> > > > +			io_complete_group_req(req, &grp_list);
> > > >    	}
> > > >    	__io_cq_unlock_post(ctx);
> > > > +	if (!wq_list_empty(&grp_list))
> > > > +		__wq_list_splice(&grp_list, state->compl_reqs.first);
> > > 
> > > What's the point of splicing it here insted of doing all
> > > that under REQ_F_SQE_GROUP above?
> > 
> > As mentioned, group leader can't be completed until all members are
> > done, so any leaders in the current list have to be moved to this
> > local list for deferred completion. That should be the only tricky
> > part of the whole sqe group implementation.
> > 
> > > 
> > > > +
> > > >    	if (!wq_list_empty(&ctx->submit_state.compl_reqs)) {
> > > >    		io_free_batch_list(ctx, state->compl_reqs.first);
> > > >    		INIT_WQ_LIST(&state->compl_reqs);
> ...
> > > > @@ -1863,6 +2012,8 @@ void io_wq_submit_work(struct io_wq_work *work)
> > > >    		}
> > > >    	}
> > > > +	if (need_queue_group_members(req))
> > > > +		io_queue_group_members(req, true);
> > > >    	do {
> > > >    		ret = io_issue_sqe(req, issue_flags);
> > > >    		if (ret != -EAGAIN)
> > > > @@ -1977,6 +2128,9 @@ static inline void io_queue_sqe(struct io_kiocb *req)
> > > >    	 */
> > > >    	if (unlikely(ret))
> > > >    		io_queue_async(req, ret);
> > > > +
> > > > +	if (need_queue_group_members(req))
> > > > +		io_queue_group_members(req, false);
> > > 
> > > Request ownership is considered to be handed further at this
> > > point and requests should not be touched. Only ret==0 from
> > > io_issue_sqe it's still ours, but again it's handed somewhere
> > > by io_queue_async().
> > 
> > Yes, you are right.
> > 
> > And it has been fixed in my local tree:
> > 
> > @@ -2154,8 +2154,7 @@ static inline void io_queue_sqe(struct io_kiocb *req)
> >           */
> >          if (unlikely(ret))
> >                  io_queue_async(req, ret);
> > -
> > -       if (need_queue_group_members(req))
> > +       else if (need_queue_group_members(req))
> >                  io_queue_group_members(req, false);
> >   }
> 
> In the else branch you don't own the request anymore
> and shouldn't be poking into it.

In theory, it is yes, but now all requests won't be freed unless
returning from io_queue_sqe(), and it needs to be commented
carefully.

> 
> It looks like you're trying to do io_queue_group_members()
> when previously the request would get completed. It's not

It is only true for REQ_F_SQE_GROUP_DEP, and there isn't such
dependency for plain SQE_GROUP.

> the right place, and apart from whack'a'moled
> io_wq_submit_work() there is also io_poll_issue() missed.
> 
> Seems __io_submit_flush_completions() / io_free_batch_list()
> would be more appropriate, and you already have a chunk with
> GROUP check in there handling the leader appearing in there
> twice.

As mentioned, we need to queue members with leader together
if there isn't dependency among them.

> 
> 
> > > >    }
> > > >    static void io_queue_sqe_fallback(struct io_kiocb *req)
> ...
> > > > @@ -2232,7 +2443,7 @@ static inline int io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req,
> > > >    			 const struct io_uring_sqe *sqe)
> > > >    	__must_hold(&ctx->uring_lock)
> > > >    {
> > > > -	struct io_submit_link *link = &ctx->submit_state.link;
> > > > +	struct io_submit_state *state = &ctx->submit_state;
> > > >    	int ret;
> > > >    	ret = io_init_req(ctx, req, sqe);
> > > > @@ -2241,9 +2452,17 @@ static inline int io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req,
> > > >    	trace_io_uring_submit_req(req);
> > > > -	if (unlikely(link->head || (req->flags & (IO_REQ_LINK_FLAGS |
> > > > -				    REQ_F_FORCE_ASYNC | REQ_F_FAIL)))) {
> > > > -		req = io_link_sqe(link, req);
> > > > +	if (unlikely(state->group.head ||
> > > 
> > > A note rather to myself and for the future, all theese checks
> > > including links and groups can be folded under one common if.
> > 
> > Sorry, I may not get the idea, can you provide one example?
> 
> To be clear, not suggesting you doing it.
> 
> Simplifying:
> 
> init_req() {
> 	if (req->flags & GROUP|LINK) {
> 		ctx->assembling;
> 	}
> }
> 
> io_submit_sqe() {
> 	init_req();
> 
> 	if (ctx->assembling) {
> 		check_groups/links();
> 		if (done);
> 			ctx->assembling = false;
> 	}
> }

OK, I can work toward this way, and it is just to replace check over
group.head/link.head & link/group flag with ->assembling, meantime
with cost of setting ctx->assembling.



Thanks,
Ming

next prev parent reply	other threads:[~2024-06-17  3:54 UTC|newest]

Thread overview: 28+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-05-11  0:12 [PATCH V3 0/9] io_uring: support sqe group and provide group kbuf Ming Lei
2024-05-11  0:12 ` [PATCH V3 1/9] io_uring: add io_link_req() helper Ming Lei
2024-05-11  0:12 ` [PATCH V3 2/9] io_uring: add io_submit_fail_link() helper Ming Lei
2024-05-11  0:12 ` [PATCH V3 3/9] io_uring: add helper of io_req_commit_cqe() Ming Lei
2024-06-10  1:18   ` Pavel Begunkov
2024-06-11 13:21     ` Ming Lei
2024-05-11  0:12 ` [PATCH V3 4/9] io_uring: move marking REQ_F_CQE_SKIP out of io_free_req() Ming Lei
2024-06-10  1:23   ` Pavel Begunkov
2024-06-11 13:28     ` Ming Lei
2024-06-16 18:08       ` Pavel Begunkov
2024-05-11  0:12 ` [PATCH V3 5/9] io_uring: support SQE group Ming Lei
2024-05-21  2:58   ` Ming Lei
2024-06-10  1:55     ` Pavel Begunkov
2024-06-11 13:32       ` Ming Lei
2024-06-16 18:14         ` Pavel Begunkov
2024-06-17  1:42           ` Ming Lei
2024-06-10  2:53   ` Pavel Begunkov
2024-06-13  1:45     ` Ming Lei
2024-06-16 19:13       ` Pavel Begunkov
2024-06-17  3:54         ` Ming Lei [this message]
2024-05-11  0:12 ` [PATCH V3 6/9] io_uring: support sqe group with members depending on leader Ming Lei
2024-05-11  0:12 ` [PATCH V3 7/9] io_uring: support providing sqe group buffer Ming Lei
2024-06-10  2:00   ` Pavel Begunkov
2024-06-12  0:22     ` Ming Lei
2024-05-11  0:12 ` [PATCH V3 8/9] io_uring/uring_cmd: support provide group kernel buffer Ming Lei
2024-05-11  0:12 ` [PATCH V3 9/9] ublk: support provide io buffer Ming Lei
2024-06-03  0:05 ` [PATCH V3 0/9] io_uring: support sqe group and provide group kbuf Ming Lei
2024-06-07 12:32   ` Pavel Begunkov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Zm+zemW1Bi5ZF9DE@fedora \
    --to=ming.lei@redhat.com \
    --cc=asml.silence@gmail.com \
    --cc=axboe@kernel.dk \
    --cc=io-uring@vger.kernel.org \
    --cc=kwolf@redhat.com \
    --cc=linux-block@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.