From: Jens Axboe <axboe@kernel.dk>
To: Andres Freund <andres@anarazel.de>, io-uring@vger.kernel.org
Cc: Oleg Nesterov <oleg@redhat.com>
Subject: Re: signals not reliably interrupting io_uring_enter anymore
Date: Sat, 4 Jul 2020 08:55:39 -0600 [thread overview]
Message-ID: <a82e680a-7db6-3569-2b53-e43e087ef464@kernel.dk> (raw)
In-Reply-To: <624c0af9-886e-0c5f-0c35-dd245496b365@kernel.dk>
On 7/3/20 8:56 PM, Jens Axboe wrote:
> On 7/3/20 8:08 PM, Jens Axboe wrote:
>> On 7/3/20 7:52 PM, Jens Axboe wrote:
>>> On 7/3/20 7:13 PM, Andres Freund wrote:
>>>> Hi,
>>>>
>>>> On July 3, 2020 5:48:21 PM PDT, Jens Axboe <axboe@kernel.dk> wrote:
>>>>> On 7/3/20 6:15 PM, Andres Freund wrote:
>>>>>> Hi,
>>>>>>
>>>>>> On 2020-07-03 17:00:49 -0700, Andres Freund wrote:
>>>>>>> I haven't yet fully analyzed the problem, but after updating to
>>>>>>> cdd3bb54332f82295ed90cd0c09c78cd0c0ee822 io_uring using postgres
>>>>> does
>>>>>>> not work reliably anymore.
>>>>>>>
>>>>>>> The symptom is that io_uring_enter(IORING_ENTER_GETEVENTS) isn't
>>>>>>> interrupted by signals anymore. The signal handler executes, but
>>>>>>> afterwards the syscall is restarted. Previously io_uring_enter
>>>>> reliably
>>>>>>> returned EINTR in that case.
>>>>>>>
>>>>>>> Currently postgres relies on signals interrupting io_uring_enter().
>>>>> We
>>>>>>> probably can find a way to not do so, but it'd not be entirely
>>>>> trivial.
>>>>>>>
>>>>>>> I suspect the issue is
>>>>>>>
>>>>>>> commit ce593a6c480a22acba08795be313c0c6d49dd35d (tag:
>>>>> io_uring-5.8-2020-07-01, linux-block/io_uring-5.8)
>>>>>>> Author: Jens Axboe <axboe@kernel.dk>
>>>>>>> Date: 2020-06-30 12:39:05 -0600
>>>>>>>
>>>>>>> io_uring: use signal based task_work running
>>>>>>>
>>>>>>> as that appears to have changed the error returned by
>>>>>>> io_uring_enter(GETEVENTS) after having been interrupted by a signal
>>>>> from
>>>>>>> EINTR to ERESTARTSYS.
>>>>>>>
>>>>>>>
>>>>>>> I'll check to make sure that the issue doesn't exist before the
>>>>> above
>>>>>>> commit.
>>>>>>
>>>>>> Indeed, on cd77006e01b3198c75fb7819b3d0ff89709539bb the PG issue
>>>>> doesn't
>>>>>> exist, which pretty much confirms that the above commit is the issue.
>>>>>>
>>>>>> What was the reason for changing EINTR to ERESTARTSYS in the above
>>>>>> commit? I assume trying to avoid returning spurious EINTRs to
>>>>> userland?
>>>>>
>>>>> Yeah, for when it's running task_work. I wonder if something like the
>>>>> below will do the trick?
>>>>>
>>>>>
>>>>> diff --git a/fs/io_uring.c b/fs/io_uring.c
>>>>> index 700644a016a7..0efa73d78451 100644
>>>>> --- a/fs/io_uring.c
>>>>> +++ b/fs/io_uring.c
>>>>> @@ -6197,11 +6197,11 @@ static int io_cqring_wait(struct io_ring_ctx
>>>>> *ctx, int min_events,
>>>>> do {
>>>>> prepare_to_wait_exclusive(&ctx->wait, &iowq.wq,
>>>>> TASK_INTERRUPTIBLE);
>>>>> - /* make sure we run task_work before checking for signals */
>>>>> - if (current->task_works)
>>>>> - task_work_run();
>>>>> if (signal_pending(current)) {
>>>>> - ret = -ERESTARTSYS;
>>>>> + if (current->task_works)
>>>>> + ret = -ERESTARTSYS;
>>>>> + else
>>>>> + ret = -EINTR;
>>>>> break;
>>>>> }
>>>>> if (io_should_wake(&iowq, false))
>>>>> @@ -6210,7 +6210,7 @@ static int io_cqring_wait(struct io_ring_ctx
>>>>> *ctx, int min_events,
>>>>> } while (1);
>>>>> finish_wait(&ctx->wait, &iowq.wq);
>>>>>
>>>>> - restore_saved_sigmask_unless(ret == -ERESTARTSYS);
>>>>> + restore_saved_sigmask_unless(ret == -EINTR);
>>>>>
>>>>> return READ_ONCE(rings->cq.head) == READ_ONCE(rings->cq.tail) ? ret :
>>>>> 0;
>>>>> }
>>>>
>>>> I'll try in a bit. Suspect however that there'd be trouble if there
>>>> were both an actual signal and task work pending?
>>>
>>> Yes, I have that worry too. We'd really need to check if we have an
>>> actual signal pending - if we do, we still do -EINTR. If not, then we
>>> just do -ERESTARTSYS and restart the system call after task_work has
>>> been completed. Half-assed approach below, I suspect this won't _really_
>>> work without appropriate locking. Which would be unfortunate.
>>>
>>> Either that, or we'd need to know if an actual signal was delivered when
>>> we get re-entered due to returning -ERESTARTSYS. If it was just
>>> task_work being run, then we're fine. But if an actual signal was
>>> pending, then we'd need to return -EINTR.
>>>
>>> CC'ing Oleg to see if he has any good ideas here.
>>
>> This might be simpler:
>
> Or... That's it for today, I'll check in after the weekend.
This tests out fine for me, and it avoids TWA_SIGNAL if we're not using
an eventfd.
diff --git a/fs/io_uring.c b/fs/io_uring.c
index 700644a016a7..d37d7ea5ebe5 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -4072,14 +4072,22 @@ struct io_poll_table {
int error;
};
-static int io_req_task_work_add(struct io_kiocb *req, struct callback_head *cb,
- int notify)
+static int io_req_task_work_add(struct io_kiocb *req, struct callback_head *cb)
{
struct task_struct *tsk = req->task;
- int ret;
+ struct io_ring_ctx *ctx = req->ctx;
+ int ret, notify = TWA_RESUME;
- if (req->ctx->flags & IORING_SETUP_SQPOLL)
+ /*
+ * SQPOLL kernel thread doesn't need notification, just a wakeup.
+ * If we're not using an eventfd, then TWA_RESUME is always fine,
+ * as we won't have dependencies between request completions for
+ * other kernel wait conditions.
+ */
+ if (ctx->flags & IORING_SETUP_SQPOLL)
notify = 0;
+ else if (ctx->cq_ev_fd)
+ notify = TWA_SIGNAL;
ret = task_work_add(tsk, cb, notify);
if (!ret)
@@ -4110,7 +4118,7 @@ static int __io_async_wake(struct io_kiocb *req, struct io_poll_iocb *poll,
* of executing it. We can't safely execute it anyway, as we may not
* have the needed state needed for it anyway.
*/
- ret = io_req_task_work_add(req, &req->task_work, TWA_SIGNAL);
+ ret = io_req_task_work_add(req, &req->task_work);
if (unlikely(ret)) {
WRITE_ONCE(poll->canceled, true);
tsk = io_wq_get_task(req->ctx->io_wq);
@@ -6201,7 +6209,14 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events,
if (current->task_works)
task_work_run();
if (signal_pending(current)) {
- ret = -ERESTARTSYS;
+ if (current->jobctl & JOBCTL_TASK_WORK) {
+ spin_lock_irq(¤t->sighand->siglock);
+ current->jobctl &= ~JOBCTL_TASK_WORK;
+ recalc_sigpending();
+ spin_unlock_irq(¤t->sighand->siglock);
+ continue;
+ }
+ ret = -EINTR;
break;
}
if (io_should_wake(&iowq, false))
@@ -6210,7 +6225,7 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events,
} while (1);
finish_wait(&ctx->wait, &iowq.wq);
- restore_saved_sigmask_unless(ret == -ERESTARTSYS);
+ restore_saved_sigmask_unless(ret == -EINTR);
return READ_ONCE(rings->cq.head) == READ_ONCE(rings->cq.tail) ? ret : 0;
}
--
Jens Axboe
next prev parent reply other threads:[~2020-07-04 14:55 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-07-04 0:00 signals not reliably interrupting io_uring_enter anymore Andres Freund
2020-07-04 0:15 ` Andres Freund
2020-07-04 0:48 ` Jens Axboe
2020-07-04 1:13 ` Andres Freund
2020-07-04 1:52 ` Jens Axboe
2020-07-04 2:08 ` Jens Axboe
2020-07-04 2:56 ` Jens Axboe
2020-07-04 14:55 ` Jens Axboe [this message]
2020-07-04 19:11 ` Andres Freund
2020-07-04 19:45 ` Jens Axboe
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=a82e680a-7db6-3569-2b53-e43e087ef464@kernel.dk \
--to=axboe@kernel.dk \
--cc=andres@anarazel.de \
--cc=io-uring@vger.kernel.org \
--cc=oleg@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.