Re: [PATCH 2/4] io_uring: switch deferred task_work to an io_wq_work_list

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Jens Axboe <axboe@kernel.dk>
To: Pavel Begunkov <asml.silence@gmail.com>, io-uring@vger.kernel.org
Subject: Re: [PATCH 2/4] io_uring: switch deferred task_work to an io_wq_work_list
Date: Wed, 27 Mar 2024 09:45:40 -0600	[thread overview]
Message-ID: <7662a22e-caeb-4ffd-a4ee-482ff809e628@kernel.dk> (raw)
In-Reply-To: <22f87633-9efa-4cd2-ab5d-e6d225b28ad5@gmail.com>

On 3/27/24 7:24 AM, Pavel Begunkov wrote:
> On 3/26/24 18:42, Jens Axboe wrote:
>> Lockless lists may be handy for some things, but they mean that items
>> are in the reverse order as we can only add to the head of the list.
>> That in turn means that iterating items on the list needs to reverse it
>> first, if it's sensitive to ordering between items on the list.
>>
>> Switch the DEFER_TASKRUN work list from an llist to a normal
>> io_wq_work_list, and protect it with a lock. Then we can get rid of the
>> manual reversing of the list when running it, which takes considerable
>> cycles particularly for bursty task_work additions.
>>
>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>> ---
>>   include/linux/io_uring_types.h |  11 ++--
>>   io_uring/io_uring.c            | 117 ++++++++++++---------------------
>>   io_uring/io_uring.h            |   4 +-
>>   3 files changed, 51 insertions(+), 81 deletions(-)
>>
>> diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
>> index aeb4639785b5..e51bf15196e4 100644
>> --- a/include/linux/io_uring_types.h
>> +++ b/include/linux/io_uring_types.h
>> @@ -329,7 +329,9 @@ struct io_ring_ctx {
>>        * regularly bounce b/w CPUs.
> 
> ...
> 
>> -static inline void io_req_local_work_add(struct io_kiocb *req, unsigned flags)
>> +static inline void io_req_local_work_add(struct io_kiocb *req, unsigned tw_flags)
>>   {
>>       struct io_ring_ctx *ctx = req->ctx;
>> -    unsigned nr_wait, nr_tw, nr_tw_prev;
>> -    struct llist_node *head;
>> +    unsigned nr_wait, nr_tw;
>> +    unsigned long flags;
>>         /* See comment above IO_CQ_WAKE_INIT */
>>       BUILD_BUG_ON(IO_CQ_WAKE_FORCE <= IORING_MAX_CQ_ENTRIES);
>>         /*
>> -     * We don't know how many reuqests is there in the link and whether
>> +     * We don't know how many requests is there in the link and whether
>>        * they can even be queued lazily, fall back to non-lazy.
>>        */
>>       if (req->flags & (REQ_F_LINK | REQ_F_HARDLINK))
>> -        flags &= ~IOU_F_TWQ_LAZY_WAKE;
>> -
>> -    head = READ_ONCE(ctx->work_llist.first);
>> -    do {
>> -        nr_tw_prev = 0;
>> -        if (head) {
>> -            struct io_kiocb *first_req = container_of(head,
>> -                            struct io_kiocb,
>> -                            io_task_work.node);
>> -            /*
>> -             * Might be executed at any moment, rely on
>> -             * SLAB_TYPESAFE_BY_RCU to keep it alive.
>> -             */
>> -            nr_tw_prev = READ_ONCE(first_req->nr_tw);
>> -        }
>> -
>> -        /*
>> -         * Theoretically, it can overflow, but that's fine as one of
>> -         * previous adds should've tried to wake the task.
>> -         */
>> -        nr_tw = nr_tw_prev + 1;
>> -        if (!(flags & IOU_F_TWQ_LAZY_WAKE))
>> -            nr_tw = IO_CQ_WAKE_FORCE;
> 
> Aren't you just killing the entire IOU_F_TWQ_LAZY_WAKE handling?
> It's assigned to IO_CQ_WAKE_FORCE so that it passes the check
> before wake_up below.

Yeah I messed that one up, did fix that one yesterday before sending it
out.

>> +        tw_flags &= ~IOU_F_TWQ_LAZY_WAKE;
>>   -        req->nr_tw = nr_tw;
>> -        req->io_task_work.node.next = head;
>> -    } while (!try_cmpxchg(&ctx->work_llist.first, &head,
>> -                  &req->io_task_work.node));
>> +    spin_lock_irqsave(&ctx->work_lock, flags);
>> +    wq_list_add_tail(&req->io_task_work.node, &ctx->work_list);
>> +    nr_tw = ++ctx->work_items;
>> +    spin_unlock_irqrestore(&ctx->work_lock, flags);
> 
> smp_mb(), see the comment below, and fwiw "_after_atomic" would not
> work.

For this one, I think all we need to do is have the wq_list_empty()
check be fully stable. If we read:

nr_wait = atomic_read(&ctx->cq_wait_nr);

right before a waiter does:

atomic_set(&ctx->cq_wait_nr, foo);
set_current_state(TASK_INTERRUPTIBLE);

then we need to ensure that the "I have work" check in
io_cqring_wait_schedule() sees the work. The spin_unlock() has release
semantics, and the current READ_ONCE() for work check sbould be enough,
no?

>>       /*
>>        * cmpxchg implies a full barrier, which pairs with the barrier
>> @@ -1289,7 +1254,7 @@ static inline void io_req_local_work_add(struct io_kiocb *req, unsigned flags)
>>        * is similar to the wait/wawke task state sync.
>>        */
>>   -    if (!head) {
>> +    if (nr_tw == 1) {
>>           if (ctx->flags & IORING_SETUP_TASKRUN_FLAG)
>>               atomic_or(IORING_SQ_TASKRUN, &ctx->rings->sq_flags);
>>           if (ctx->has_evfd)
>> @@ -1297,13 +1262,8 @@ static inline void io_req_local_work_add(struct io_kiocb *req, unsigned flags)
>>       }
>>         nr_wait = atomic_read(&ctx->cq_wait_nr);
>> -    /* not enough or no one is waiting */
>> -    if (nr_tw < nr_wait)
>> -        return;
>> -    /* the previous add has already woken it up */
>> -    if (nr_tw_prev >= nr_wait)
>> -        return;
>> -    wake_up_state(ctx->submitter_task, TASK_INTERRUPTIBLE);
>> +    if (nr_tw >= nr_wait)
>> +        wake_up_state(ctx->submitter_task, TASK_INTERRUPTIBLE);
> 
> IIUC, you're removing a very important optimisation, and I
> don't understand why'd you do that. It was waking up only when
> it's changing from "don't need to wake" to "have to be woken up",
> just once per splicing the list on the waiting side.
> 
> It seems like this one will keep pounding with wake_up_state for
> every single tw queued after @nr_wait, which quite often just 1.

Agree, that was just a silly oversight. I brought that back now.
Apparently doesn't hit anything here, at least not to the extent that I
saw it in testing. But it is a good idea and we should keep that, so
only the first one over the threshold attempts the wake.

-- 
Jens Axboe

next prev parent reply	other threads:[~2024-03-27 15:45 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-03-26 18:42 [PATCHSET 0/4] Use io_wq_work_list for task_work Jens Axboe
2024-03-26 18:42 ` [PATCH 1/4] io_uring: use the right type for work_llist empty check Jens Axboe
2024-03-26 18:42 ` [PATCH 2/4] io_uring: switch deferred task_work to an io_wq_work_list Jens Axboe
2024-03-27 13:24   ` Pavel Begunkov
2024-03-27 15:45     ` Jens Axboe [this message]
2024-03-27 16:37       ` Jens Axboe
2024-03-27 17:28         ` Pavel Begunkov
2024-03-27 17:34           ` Jens Axboe
2024-03-26 18:42 ` [PATCH 3/4] io_uring: switch fallback work to io_wq_work_list Jens Axboe
2024-03-26 18:42 ` [PATCH 4/4] io_uring: switch normal task_work " Jens Axboe
2024-03-27 13:33 ` [PATCHSET 0/4] Use io_wq_work_list for task_work Pavel Begunkov
2024-03-27 16:36   ` Jens Axboe
2024-03-27 17:05     ` Jens Axboe
2024-03-27 18:04     ` Pavel Begunkov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=7662a22e-caeb-4ffd-a4ee-482ff809e628@kernel.dk \
    --to=axboe@kernel.dk \
    --cc=asml.silence@gmail.com \
    --cc=io-uring@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.