From: Nam Cao <namcao@linutronix.de>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Christian Brauner <brauner@kernel.org>,
Xi Ruoyao <xry111@xry111.site>,
Frederic Weisbecker <frederic@kernel.org>,
Valentin Schneider <vschneid@redhat.com>,
Alexander Viro <viro@zeniv.linux.org.uk>, Jan Kara <jack@suse.cz>,
Sebastian Andrzej Siewior <bigeasy@linutronix.de>,
John Ogness <john.ogness@linutronix.de>,
K Prateek Nayak <kprateek.nayak@amd.com>,
Clark Williams <clrkwllms@kernel.org>,
Steven Rostedt <rostedt@goodmis.org>,
linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
linux-rt-devel@lists.linux.dev, linux-rt-users@vger.kernel.org,
Joe Damato <jdamato@fastly.com>,
Martin Karsten <mkarsten@uwaterloo.ca>,
Jens Axboe <axboe@kernel.dk>,
stable@vger.kernel.org
Subject: Re: [PATCH v4 1/1] eventpoll: Replace rwlock with spinlock
Date: Wed, 16 Jul 2025 09:41:10 +0200 [thread overview]
Message-ID: <20250716074110.AY8PBtBi@linutronix.de> (raw)
In-Reply-To: <CAHk-=wheHZRWPyNNoqXB8+ygw2PqEYjbyKQfSbYpirecg5K1Nw@mail.gmail.com>
On Tue, Jul 15, 2025 at 09:42:52AM -0700, Linus Torvalds wrote:
> On Tue, 15 Jul 2025 at 05:47, Nam Cao <namcao@linutronix.de> wrote:
> >
> > fs/eventpoll.c | 139 +++++++++----------------------------------------
> > 1 file changed, 26 insertions(+), 113 deletions(-)
>
> Yeah, this is more like the kind of diffstat I like to see for
> eventpoll. Plus it makes things fundamentally simpler.
>
> It might be worth looking at ep_poll_callback() - the only case that
> had read_lock_irqsave() - and seeing if perhaps some of the tests
> inside the lock might be done optimistically, or delayed to after the
> lock.
>
> For example, the whole wakeup sequence looks like it should be done
> *after* the ep->lock has been released, because it uses its own lock
> (ie the
>
> if (sync)
> wake_up_sync(&ep->wq);
> else
> wake_up(&ep->wq);
>
> thing uses the wq lock, and there is nothing that ep->lock protects
> there as far as I can tell.
>
> So I think this has some potential for _simple_ optimizations, but I'm
> not sure how worth it it is.
Actually ep->lock does protect this part. Because ep_poll() touches ep->wq
without holding the waitqueue's lock.
We could do something like the diff below. But things are not so simple
anymore.
Best regards,
Nam
diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index a171f7e7dacc..5e6c7da606e7 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -1252,8 +1252,6 @@ static int ep_poll_callback(wait_queue_entry_t *wait, unsigned mode, int sync, v
unsigned long flags;
int ewake = 0;
- spin_lock_irqsave(&ep->lock, flags);
-
ep_set_busy_poll_napi_id(epi);
/*
@@ -1263,7 +1261,7 @@ static int ep_poll_callback(wait_queue_entry_t *wait, unsigned mode, int sync, v
* until the next EPOLL_CTL_MOD will be issued.
*/
if (!(epi->event.events & ~EP_PRIVATE_BITS))
- goto out_unlock;
+ goto out;
/*
* Check the events coming with the callback. At this stage, not
@@ -1272,7 +1270,7 @@ static int ep_poll_callback(wait_queue_entry_t *wait, unsigned mode, int sync, v
* test for "key" != NULL before the event match test.
*/
if (pollflags && !(pollflags & epi->event.events))
- goto out_unlock;
+ goto out;
/*
* If we are transferring events to userspace, we can hold no locks
@@ -1280,6 +1278,8 @@ static int ep_poll_callback(wait_queue_entry_t *wait, unsigned mode, int sync, v
* semantics). All the events that happen during that period of time are
* chained in ep->ovflist and requeued later on.
*/
+ spin_lock_irqsave(&ep->lock, flags);
+
if (READ_ONCE(ep->ovflist) != EP_UNACTIVE_PTR) {
if (epi->next == EP_UNACTIVE_PTR) {
epi->next = READ_ONCE(ep->ovflist);
@@ -1292,9 +1292,14 @@ static int ep_poll_callback(wait_queue_entry_t *wait, unsigned mode, int sync, v
ep_pm_stay_awake_rcu(epi);
}
+ spin_unlock_irqrestore(&ep->lock, flags);
+
+
/*
* Wake up ( if active ) both the eventpoll wait list and the ->poll()
* wait list.
+ *
+ * Memory barrier for waitqueue_active() from spin_unlock_irqrestore().
*/
if (waitqueue_active(&ep->wq)) {
if ((epi->event.events & EPOLLEXCLUSIVE) &&
@@ -1321,9 +1326,7 @@ static int ep_poll_callback(wait_queue_entry_t *wait, unsigned mode, int sync, v
if (waitqueue_active(&ep->poll_wait))
pwake++;
-out_unlock:
- spin_unlock_irqrestore(&ep->lock, flags);
-
+out:
/* We have to call this outside the lock */
if (pwake)
ep_poll_safewake(ep, epi, pollflags & EPOLL_URING_WAKE);
@@ -2001,13 +2004,7 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
init_wait(&wait);
wait.func = ep_autoremove_wake_function;
- spin_lock_irq(&ep->lock);
- /*
- * Barrierless variant, waitqueue_active() is called under
- * the same lock on wakeup ep_poll_callback() side, so it
- * is safe to avoid an explicit barrier.
- */
- __set_current_state(TASK_INTERRUPTIBLE);
+ prepare_to_wait_exclusive(&ep->wq, &wait, TASK_INTERRUPTIBLE);
/*
* Do the final check under the lock. ep_start/done_scan()
@@ -2016,10 +2013,8 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
* period of time although events are pending, so lock is
* important.
*/
+ spin_lock_irq(&ep->lock);
eavail = ep_events_available(ep);
- if (!eavail)
- __add_wait_queue_exclusive(&ep->wq, &wait);
-
spin_unlock_irq(&ep->lock);
if (!eavail)
@@ -2036,7 +2031,7 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
eavail = 1;
if (!list_empty_careful(&wait.entry)) {
- spin_lock_irq(&ep->lock);
+ spin_lock_irq(&ep->wq.lock);
/*
* If the thread timed out and is not on the wait queue,
* it means that the thread was woken up after its
@@ -2047,7 +2042,7 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
if (timed_out)
eavail = list_empty(&wait.entry);
__remove_wait_queue(&ep->wq, &wait);
- spin_unlock_irq(&ep->lock);
+ spin_unlock_irq(&ep->wq.lock);
}
}
}
next prev parent reply other threads:[~2025-07-16 7:41 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-07-15 12:46 [PATCH v4 0/1] eventpoll: Fix priority inversion problem Nam Cao
2025-07-15 12:46 ` [PATCH v4 1/1] eventpoll: Replace rwlock with spinlock Nam Cao
2025-07-15 12:58 ` Nam Cao
2025-07-15 16:42 ` Linus Torvalds
2025-07-16 7:41 ` Nam Cao [this message]
2025-07-16 8:34 ` K Prateek Nayak
2025-08-26 8:43 ` Nam Cao
2025-09-03 8:40 ` Sebastian Andrzej Siewior
2025-09-05 13:52 ` Christian Brauner
2025-09-05 13:52 ` [PATCH v4 0/1] eventpoll: Fix priority inversion problem Christian Brauner
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20250716074110.AY8PBtBi@linutronix.de \
--to=namcao@linutronix.de \
--cc=axboe@kernel.dk \
--cc=bigeasy@linutronix.de \
--cc=brauner@kernel.org \
--cc=clrkwllms@kernel.org \
--cc=frederic@kernel.org \
--cc=jack@suse.cz \
--cc=jdamato@fastly.com \
--cc=john.ogness@linutronix.de \
--cc=kprateek.nayak@amd.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-rt-devel@lists.linux.dev \
--cc=linux-rt-users@vger.kernel.org \
--cc=mkarsten@uwaterloo.ca \
--cc=rostedt@goodmis.org \
--cc=stable@vger.kernel.org \
--cc=torvalds@linux-foundation.org \
--cc=viro@zeniv.linux.org.uk \
--cc=vschneid@redhat.com \
--cc=xry111@xry111.site \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).