Linux filesystem development
 help / color / mirror / Atom feed
* [PATCH v2 0/2] eventpoll: Fix epoll_wait() report false negative
@ 2026-05-30  9:37 Nam Cao
  2026-05-30  9:37 ` [PATCH v2 1/2] selftests/eventpoll: Add test for multiple waiters Nam Cao
                   ` (2 more replies)
  0 siblings, 3 replies; 4+ messages in thread
From: Nam Cao @ 2026-05-30  9:37 UTC (permalink / raw)
  To: Christian Brauner, Alexander Viro, Jan Kara, Shuah Khan,
	Davidlohr Bueso, Soheil Hassas Yeganeh, Mateusz Guzik,
	David Laight, linux-fsdevel, linux-kernel, linux-kselftest
  Cc: Nam Cao

Hi,

While staring at epoll, I noticed ep_events_available() looks wrong. I
wrote a small program to confirm, and yes it is definitely wrong.

This series adds a reproducer to kselftest, and fix the bug.

v2: Switch to seqlock solution

Nam Cao (2):
  selftests/eventpoll: Add test for multiple waiters
  eventpoll: Fix epoll_wait() report false negative

 fs/eventpoll.c                                | 20 ++++++++-
 .../filesystems/epoll/epoll_wakeup_test.c     | 45 +++++++++++++++++++
 2 files changed, 64 insertions(+), 1 deletion(-)

-- 
2.47.3


^ permalink raw reply	[flat|nested] 4+ messages in thread

* [PATCH v2 1/2] selftests/eventpoll: Add test for multiple waiters
  2026-05-30  9:37 [PATCH v2 0/2] eventpoll: Fix epoll_wait() report false negative Nam Cao
@ 2026-05-30  9:37 ` Nam Cao
  2026-05-30  9:37 ` [PATCH v2 2/2] eventpoll: Fix epoll_wait() report false negative Nam Cao
  2026-06-02 13:20 ` [PATCH v2 0/2] " Christian Brauner
  2 siblings, 0 replies; 4+ messages in thread
From: Nam Cao @ 2026-05-30  9:37 UTC (permalink / raw)
  To: Christian Brauner, Alexander Viro, Jan Kara, Shuah Khan,
	Davidlohr Bueso, Soheil Hassas Yeganeh, Mateusz Guzik,
	David Laight, linux-fsdevel, linux-kernel, linux-kselftest
  Cc: Nam Cao

Add a test whichs creates 64 threads who all epoll_wait() on the same
eventpoll. The source eventfd is written but never read, therefore all the
threads should always see an EPOLLIN event.

This test fails because of a kernel bug, which will be fixed by a follow-up
commit.

Signed-off-by: Nam Cao <namcao@linutronix.de>
---
 .../filesystems/epoll/epoll_wakeup_test.c     | 45 +++++++++++++++++++
 1 file changed, 45 insertions(+)

diff --git a/tools/testing/selftests/filesystems/epoll/epoll_wakeup_test.c b/tools/testing/selftests/filesystems/epoll/epoll_wakeup_test.c
index 8bc57a2ef966..f6f1a7ff01b0 100644
--- a/tools/testing/selftests/filesystems/epoll/epoll_wakeup_test.c
+++ b/tools/testing/selftests/filesystems/epoll/epoll_wakeup_test.c
@@ -3493,4 +3493,49 @@ TEST(epoll64)
 	close(ctx.sfd[1]);
 }
 
+static void *epoll65_wait(void *ctx_)
+{
+	struct epoll_mtcontext *ctx = ctx_;
+	struct epoll_event event;
+
+	for (int i = 0; i < 100000; ++i) {
+		if (!epoll_wait(ctx->efd[0], &event, 1, 0))
+			return (void *)ENODATA;
+	}
+
+	return (void *)0;
+}
+
+TEST(epoll65)
+{
+	struct epoll_mtcontext ctx;
+	struct epoll_event event;
+	int64_t dummy_data = 99;
+	pthread_t threads[64];
+	uintptr_t ret;
+	int i, err;
+
+	ctx.efd[0] = epoll_create(1);
+	ASSERT_GE(ctx.efd[0], 0);
+	ctx.efd[1] = eventfd(0, 0);
+	ASSERT_GE(ctx.efd[1], 0);
+
+	event.events = EPOLLIN;
+	err = epoll_ctl(ctx.efd[0], EPOLL_CTL_ADD, ctx.efd[1], &event);
+	ASSERT_EQ(err, 0);
+
+	write(ctx.efd[1], &dummy_data, sizeof(dummy_data));
+
+	for (i = 0; i < ARRAY_SIZE(threads); ++i)
+		ASSERT_EQ(pthread_create(&threads[i], NULL, epoll65_wait, &ctx), 0);
+
+	for (i = 0; i < ARRAY_SIZE(threads); ++i) {
+		ASSERT_EQ(pthread_join(threads[i], (void **)&ret), 0);
+		ASSERT_EQ(ret, 0);
+	}
+
+	close(ctx.efd[0]);
+	close(ctx.efd[1]);
+}
+
 TEST_HARNESS_MAIN
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* [PATCH v2 2/2] eventpoll: Fix epoll_wait() report false negative
  2026-05-30  9:37 [PATCH v2 0/2] eventpoll: Fix epoll_wait() report false negative Nam Cao
  2026-05-30  9:37 ` [PATCH v2 1/2] selftests/eventpoll: Add test for multiple waiters Nam Cao
@ 2026-05-30  9:37 ` Nam Cao
  2026-06-02 13:20 ` [PATCH v2 0/2] " Christian Brauner
  2 siblings, 0 replies; 4+ messages in thread
From: Nam Cao @ 2026-05-30  9:37 UTC (permalink / raw)
  To: Christian Brauner, Alexander Viro, Jan Kara, Shuah Khan,
	Davidlohr Bueso, Soheil Hassas Yeganeh, Mateusz Guzik,
	David Laight, linux-fsdevel, linux-kernel, linux-kselftest
  Cc: Nam Cao, stable

ep_events_available() checks for available events by looking at ep->rdllist
and ep->ovflist. However, this is done without a lock and can report false
negative if rdllist and ovflist are changed in ep_start_scan() or
ep_done_scan() by another task. For example:
____________________________________________________________________________________
                                           |ep_start_scan()
                                           |  list_splice_init(&ep->rdllist, txlist)
ep_events_available()                      |
  !list_empty_careful(&ep->rdllist) ||     |
  READ_ONCE(ep->ovflist) != EP_UNACTIVE_PTR|
	                                   |  WRITE_ONCE(ep->ovflist, NULL)
___________________________________________|________________________________________

Another example:
____________________________________________________________________________________
ep_events_available()                      |
                                           |ep_start_scan()
                                           |  list_splice_init(&ep->rdllist, txlist);
	                                   |  WRITE_ONCE(ep->ovflist, NULL);
  !list_empty_careful(&ep->rdllist) ||     |
                                           |ep_done_scan()
                                           |  WRITE_ONCE(ep->ovflist, EP_UNACTIVE_PTR);
                                           |  list_splice(txlist, &ep->rdllist);
  READ_ONCE(ep->ovflist) != EP_UNACTIVE_PTR|
___________________________________________|________________________________________

In the above examples, ep_events_available() sees no event from both
rdllist and ovflist despite event being available.

Introduce a sequence lock to resolve this issue.

Measuring the time consumption of 10 million loop iterations doing
epoll_wait(), the following performance drop is observed:

   timeout  #event  before    after    diff
     0ms      0     3727ms   3974ms   +6.6%
     0ms      1     8099ms   9134ms    +13%
     1ms      1    13525ms  13586ms  +0.45%

Considering the use case of epoll_wait() (wait for events, do something
with the events, repeat), it should only contribute to a small portion of
user's CPU consumption. Therefore this performance drop is not alarming.

Fixes: c5a282e9635e ("fs/epoll: reduce the scope of wq lock in epoll_wait()")
Suggested-by: Mateusz Guzik <mjguzik@gmail.com>
Signed-off-by: Nam Cao <namcao@linutronix.de>
Cc: stable@vger.kernel.org
---
 fs/eventpoll.c | 20 +++++++++++++++++++-
 1 file changed, 19 insertions(+), 1 deletion(-)

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index a3090b446af1..58248862e5ee 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -38,6 +38,7 @@
 #include <linux/compat.h>
 #include <linux/rculist.h>
 #include <linux/capability.h>
+#include <linux/seqlock.h>
 #include <net/busy_poll.h>
 
 /*
@@ -190,6 +191,9 @@ struct eventpoll {
 	/* Lock which protects rdllist and ovflist */
 	spinlock_t lock;
 
+	/* Protect switching between rdllist and ovflist */
+	seqcount_spinlock_t seq;
+
 	/* RB tree root used to store monitored fd structs */
 	struct rb_root_cached rbr;
 
@@ -382,8 +386,11 @@ static inline struct epitem *ep_item_from_wait(wait_queue_entry_t *p)
  */
 static inline int ep_events_available(struct eventpoll *ep)
 {
+	unsigned int seq = read_seqcount_begin(&ep->seq);
+
 	return !list_empty_careful(&ep->rdllist) ||
-		READ_ONCE(ep->ovflist) != EP_UNACTIVE_PTR;
+		READ_ONCE(ep->ovflist) != EP_UNACTIVE_PTR ||
+		read_seqcount_retry(&ep->seq, seq);
 }
 
 #ifdef CONFIG_NET_RX_BUSY_POLL
@@ -735,8 +742,12 @@ static void ep_start_scan(struct eventpoll *ep, struct list_head *txlist)
 	 */
 	lockdep_assert_irqs_enabled();
 	spin_lock_irq(&ep->lock);
+	write_seqcount_begin(&ep->seq);
+
 	list_splice_init(&ep->rdllist, txlist);
 	WRITE_ONCE(ep->ovflist, NULL);
+
+	write_seqcount_end(&ep->seq);
 	spin_unlock_irq(&ep->lock);
 }
 
@@ -768,6 +779,9 @@ static void ep_done_scan(struct eventpoll *ep,
 			ep_pm_stay_awake(epi);
 		}
 	}
+
+	write_seqcount_begin(&ep->seq);
+
 	/*
 	 * We need to set back ep->ovflist to EP_UNACTIVE_PTR, so that after
 	 * releasing the lock, events will be queued in the normal way inside
@@ -779,6 +793,9 @@ static void ep_done_scan(struct eventpoll *ep,
 	 * Quickly re-inject items left on "txlist".
 	 */
 	list_splice(txlist, &ep->rdllist);
+
+	write_seqcount_end(&ep->seq);
+
 	__pm_relax(ep->ws);
 
 	if (!list_empty(&ep->rdllist)) {
@@ -1155,6 +1172,7 @@ static int ep_alloc(struct eventpoll **pep)
 
 	mutex_init(&ep->mtx);
 	spin_lock_init(&ep->lock);
+	seqcount_spinlock_init(&ep->seq, &ep->lock);
 	init_waitqueue_head(&ep->wq);
 	init_waitqueue_head(&ep->poll_wait);
 	INIT_LIST_HEAD(&ep->rdllist);
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH v2 0/2] eventpoll: Fix epoll_wait() report false negative
  2026-05-30  9:37 [PATCH v2 0/2] eventpoll: Fix epoll_wait() report false negative Nam Cao
  2026-05-30  9:37 ` [PATCH v2 1/2] selftests/eventpoll: Add test for multiple waiters Nam Cao
  2026-05-30  9:37 ` [PATCH v2 2/2] eventpoll: Fix epoll_wait() report false negative Nam Cao
@ 2026-06-02 13:20 ` Christian Brauner
  2 siblings, 0 replies; 4+ messages in thread
From: Christian Brauner @ 2026-06-02 13:20 UTC (permalink / raw)
  To: Nam Cao
  Cc: Alexander Viro, Jan Kara, Shuah Khan, Davidlohr Bueso,
	Soheil Hassas Yeganeh, Mateusz Guzik, David Laight, linux-fsdevel,
	linux-kernel, linux-kselftest

On Sat, May 30, 2026 at 11:37:30AM +0200, Nam Cao wrote:
> Hi,
> 
> While staring at epoll, I noticed ep_events_available() looks wrong. I
> wrote a small program to confirm, and yes it is definitely wrong.
> 
> This series adds a reproducer to kselftest, and fix the bug.

Please resend on top of vfs-7.2.eventpoll... :)

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2026-06-02 13:20 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-30  9:37 [PATCH v2 0/2] eventpoll: Fix epoll_wait() report false negative Nam Cao
2026-05-30  9:37 ` [PATCH v2 1/2] selftests/eventpoll: Add test for multiple waiters Nam Cao
2026-05-30  9:37 ` [PATCH v2 2/2] eventpoll: Fix epoll_wait() report false negative Nam Cao
2026-06-02 13:20 ` [PATCH v2 0/2] " Christian Brauner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox