linux-rt-devel.lists.linux.dev archive mirror
 help / color / mirror / Atom feed
From: Nam Cao <namcao@linutronix.de>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Christian Brauner <brauner@kernel.org>,
	Xi Ruoyao <xry111@xry111.site>,
	Frederic Weisbecker <frederic@kernel.org>,
	Valentin Schneider <vschneid@redhat.com>,
	Alexander Viro <viro@zeniv.linux.org.uk>, Jan Kara <jack@suse.cz>,
	Sebastian Andrzej Siewior <bigeasy@linutronix.de>,
	John Ogness <john.ogness@linutronix.de>,
	K Prateek Nayak <kprateek.nayak@amd.com>,
	Clark Williams <clrkwllms@kernel.org>,
	Steven Rostedt <rostedt@goodmis.org>,
	linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-rt-devel@lists.linux.dev, linux-rt-users@vger.kernel.org,
	Joe Damato <jdamato@fastly.com>,
	Martin Karsten <mkarsten@uwaterloo.ca>,
	Jens Axboe <axboe@kernel.dk>,
	stable@vger.kernel.org
Subject: Re: [PATCH v4 1/1] eventpoll: Replace rwlock with spinlock
Date: Wed, 16 Jul 2025 09:41:10 +0200	[thread overview]
Message-ID: <20250716074110.AY8PBtBi@linutronix.de> (raw)
In-Reply-To: <CAHk-=wheHZRWPyNNoqXB8+ygw2PqEYjbyKQfSbYpirecg5K1Nw@mail.gmail.com>

On Tue, Jul 15, 2025 at 09:42:52AM -0700, Linus Torvalds wrote:
> On Tue, 15 Jul 2025 at 05:47, Nam Cao <namcao@linutronix.de> wrote:
> >
> >  fs/eventpoll.c | 139 +++++++++----------------------------------------
> >  1 file changed, 26 insertions(+), 113 deletions(-)
> 
> Yeah, this is more like the kind of diffstat I like to see for
> eventpoll. Plus it makes things fundamentally simpler.
> 
> It might be worth looking at ep_poll_callback() - the only case that
> had read_lock_irqsave() - and seeing if perhaps some of the tests
> inside the lock might be done optimistically, or delayed to after the
> lock.
> 
> For example, the whole wakeup sequence looks like it should be done
> *after* the ep->lock has been released, because it uses its own lock
> (ie the
> 
>                 if (sync)
>                         wake_up_sync(&ep->wq);
>                 else
>                         wake_up(&ep->wq);
> 
> thing uses the wq lock, and there is nothing that ep->lock protects
> there as far as I can tell.
> 
> So I think this has some potential for _simple_ optimizations, but I'm
> not sure how worth it it is.

Actually ep->lock does protect this part. Because ep_poll() touches ep->wq
without holding the waitqueue's lock.

We could do something like the diff below. But things are not so simple
anymore.

Best regards,
Nam

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index a171f7e7dacc..5e6c7da606e7 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -1252,8 +1252,6 @@ static int ep_poll_callback(wait_queue_entry_t *wait, unsigned mode, int sync, v
 	unsigned long flags;
 	int ewake = 0;
 
-	spin_lock_irqsave(&ep->lock, flags);
-
 	ep_set_busy_poll_napi_id(epi);
 
 	/*
@@ -1263,7 +1261,7 @@ static int ep_poll_callback(wait_queue_entry_t *wait, unsigned mode, int sync, v
 	 * until the next EPOLL_CTL_MOD will be issued.
 	 */
 	if (!(epi->event.events & ~EP_PRIVATE_BITS))
-		goto out_unlock;
+		goto out;
 
 	/*
 	 * Check the events coming with the callback. At this stage, not
@@ -1272,7 +1270,7 @@ static int ep_poll_callback(wait_queue_entry_t *wait, unsigned mode, int sync, v
 	 * test for "key" != NULL before the event match test.
 	 */
 	if (pollflags && !(pollflags & epi->event.events))
-		goto out_unlock;
+		goto out;
 
 	/*
 	 * If we are transferring events to userspace, we can hold no locks
@@ -1280,6 +1278,8 @@ static int ep_poll_callback(wait_queue_entry_t *wait, unsigned mode, int sync, v
 	 * semantics). All the events that happen during that period of time are
 	 * chained in ep->ovflist and requeued later on.
 	 */
+	spin_lock_irqsave(&ep->lock, flags);
+
 	if (READ_ONCE(ep->ovflist) != EP_UNACTIVE_PTR) {
 		if (epi->next == EP_UNACTIVE_PTR) {
 			epi->next = READ_ONCE(ep->ovflist);
@@ -1292,9 +1292,14 @@ static int ep_poll_callback(wait_queue_entry_t *wait, unsigned mode, int sync, v
 		ep_pm_stay_awake_rcu(epi);
 	}
 
+	spin_unlock_irqrestore(&ep->lock, flags);
+
+
 	/*
 	 * Wake up ( if active ) both the eventpoll wait list and the ->poll()
 	 * wait list.
+	 *
+	 * Memory barrier for waitqueue_active() from spin_unlock_irqrestore().
 	 */
 	if (waitqueue_active(&ep->wq)) {
 		if ((epi->event.events & EPOLLEXCLUSIVE) &&
@@ -1321,9 +1326,7 @@ static int ep_poll_callback(wait_queue_entry_t *wait, unsigned mode, int sync, v
 	if (waitqueue_active(&ep->poll_wait))
 		pwake++;
 
-out_unlock:
-	spin_unlock_irqrestore(&ep->lock, flags);
-
+out:
 	/* We have to call this outside the lock */
 	if (pwake)
 		ep_poll_safewake(ep, epi, pollflags & EPOLL_URING_WAKE);
@@ -2001,13 +2004,7 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
 		init_wait(&wait);
 		wait.func = ep_autoremove_wake_function;
 
-		spin_lock_irq(&ep->lock);
-		/*
-		 * Barrierless variant, waitqueue_active() is called under
-		 * the same lock on wakeup ep_poll_callback() side, so it
-		 * is safe to avoid an explicit barrier.
-		 */
-		__set_current_state(TASK_INTERRUPTIBLE);
+		prepare_to_wait_exclusive(&ep->wq, &wait, TASK_INTERRUPTIBLE);
 
 		/*
 		 * Do the final check under the lock. ep_start/done_scan()
@@ -2016,10 +2013,8 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
 		 * period of time although events are pending, so lock is
 		 * important.
 		 */
+		spin_lock_irq(&ep->lock);
 		eavail = ep_events_available(ep);
-		if (!eavail)
-			__add_wait_queue_exclusive(&ep->wq, &wait);
-
 		spin_unlock_irq(&ep->lock);
 
 		if (!eavail)
@@ -2036,7 +2031,7 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
 		eavail = 1;
 
 		if (!list_empty_careful(&wait.entry)) {
-			spin_lock_irq(&ep->lock);
+			spin_lock_irq(&ep->wq.lock);
 			/*
 			 * If the thread timed out and is not on the wait queue,
 			 * it means that the thread was woken up after its
@@ -2047,7 +2042,7 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
 			if (timed_out)
 				eavail = list_empty(&wait.entry);
 			__remove_wait_queue(&ep->wq, &wait);
-			spin_unlock_irq(&ep->lock);
+			spin_unlock_irq(&ep->wq.lock);
 		}
 	}
 }

  reply	other threads:[~2025-07-16  7:41 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-07-15 12:46 [PATCH v4 0/1] eventpoll: Fix priority inversion problem Nam Cao
2025-07-15 12:46 ` [PATCH v4 1/1] eventpoll: Replace rwlock with spinlock Nam Cao
2025-07-15 12:58   ` Nam Cao
2025-07-15 16:42   ` Linus Torvalds
2025-07-16  7:41     ` Nam Cao [this message]
2025-07-16  8:34   ` K Prateek Nayak
2025-08-26  8:43   ` Nam Cao
2025-09-03  8:40     ` Sebastian Andrzej Siewior
2025-09-05 13:52       ` Christian Brauner
2025-09-05 13:52 ` [PATCH v4 0/1] eventpoll: Fix priority inversion problem Christian Brauner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20250716074110.AY8PBtBi@linutronix.de \
    --to=namcao@linutronix.de \
    --cc=axboe@kernel.dk \
    --cc=bigeasy@linutronix.de \
    --cc=brauner@kernel.org \
    --cc=clrkwllms@kernel.org \
    --cc=frederic@kernel.org \
    --cc=jack@suse.cz \
    --cc=jdamato@fastly.com \
    --cc=john.ogness@linutronix.de \
    --cc=kprateek.nayak@amd.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-rt-devel@lists.linux.dev \
    --cc=linux-rt-users@vger.kernel.org \
    --cc=mkarsten@uwaterloo.ca \
    --cc=rostedt@goodmis.org \
    --cc=stable@vger.kernel.org \
    --cc=torvalds@linux-foundation.org \
    --cc=viro@zeniv.linux.org.uk \
    --cc=vschneid@redhat.com \
    --cc=xry111@xry111.site \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).