[PATCH 0/2] eventpoll: Fix epoll_wait() report false negative

Linux filesystem development
 help / color / mirror / Atom feed

* [PATCH 0/2] eventpoll: Fix epoll_wait() report false negative
@ 2025-07-18  7:52 Nam Cao
  2025-07-18  7:52 ` [PATCH 1/2] selftests/eventpoll: Add test for multiple waiters Nam Cao
                   ` (2 more replies)
  0 siblings, 3 replies; 23+ messages in thread
From: Nam Cao @ 2025-07-18  7:52 UTC (permalink / raw)
  To: Alexander Viro, Christian Brauner, Jan Kara, Shuah Khan,
	Davidlohr Bueso, Soheil Hassas Yeganeh, Khazhismel Kumykov,
	Willem de Bruijn, Eric Dumazet, Jens Axboe, linux-fsdevel,
	linux-kernel, linux-kselftest
  Cc: Nam Cao

Hi,

While staring at epoll, I noticed ep_events_available() looks wrong. I
wrote a small program to confirm, and yes it is definitely wrong.

This series adds a reproducer to kselftest, and fix the bug.

Nam Cao (2):
  selftests/eventpoll: Add test for multiple waiters
  eventpoll: Fix epoll_wait() report false negative

 fs/eventpoll.c                                | 16 +------
 .../filesystems/epoll/epoll_wakeup_test.c     | 45 +++++++++++++++++++
 2 files changed, 47 insertions(+), 14 deletions(-)

-- 
2.39.5


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 1/2] selftests/eventpoll: Add test for multiple waiters
  2025-07-18  7:52 [PATCH 0/2] eventpoll: Fix epoll_wait() report false negative Nam Cao
@ 2025-07-18  7:52 ` Nam Cao
  2025-07-18  7:52 ` [PATCH 2/2] eventpoll: Fix epoll_wait() report false negative Nam Cao
  2025-09-17  7:27 ` [PATCH 0/2] " Nam Cao
  2 siblings, 0 replies; 23+ messages in thread
From: Nam Cao @ 2025-07-18  7:52 UTC (permalink / raw)
  To: Alexander Viro, Christian Brauner, Jan Kara, Shuah Khan,
	Davidlohr Bueso, Soheil Hassas Yeganeh, Khazhismel Kumykov,
	Willem de Bruijn, Eric Dumazet, Jens Axboe, linux-fsdevel,
	linux-kernel, linux-kselftest
  Cc: Nam Cao

Add a test whichs creates 64 threads who all epoll_wait() on the same
eventpoll. The source eventfd is written but never read, therefore all the
threads should always see an EPOLLIN event.

This test fails because of a kernel bug, which will be fixed by a follow-up
commit.

Signed-off-by: Nam Cao <namcao@linutronix.de>
---
 .../filesystems/epoll/epoll_wakeup_test.c     | 45 +++++++++++++++++++
 1 file changed, 45 insertions(+)

diff --git a/tools/testing/selftests/filesystems/epoll/epoll_wakeup_test.c b/tools/testing/selftests/filesystems/epoll/epoll_wakeup_test.c
index 65ede506305c..0852c68d0461 100644
--- a/tools/testing/selftests/filesystems/epoll/epoll_wakeup_test.c
+++ b/tools/testing/selftests/filesystems/epoll/epoll_wakeup_test.c
@@ -3493,4 +3493,49 @@ TEST(epoll64)
 	close(ctx.sfd[1]);
 }
 
+static void *epoll65_wait(void *ctx_)
+{
+	struct epoll_mtcontext *ctx = ctx_;
+	struct epoll_event event;
+
+	for (int i = 0; i < 100000; ++i) {
+		if (!epoll_wait(ctx->efd[0], &event, 1, 0))
+			return (void *)ENODATA;
+	}
+
+	return (void *)0;
+}
+
+TEST(epoll65)
+{
+	struct epoll_mtcontext ctx;
+	struct epoll_event event;
+	int64_t dummy_data = 99;
+	pthread_t threads[64];
+	uintptr_t ret;
+	int i, err;
+
+	ctx.efd[0] = epoll_create(1);
+	ASSERT_GE(ctx.efd[0], 0);
+	ctx.efd[1] = eventfd(0, 0);
+	ASSERT_GE(ctx.efd[1], 0);
+
+	event.events = EPOLLIN;
+	err = epoll_ctl(ctx.efd[0], EPOLL_CTL_ADD, ctx.efd[1], &event);
+	ASSERT_EQ(err, 0);
+
+	write(ctx.efd[1], &dummy_data, sizeof(dummy_data));
+
+	for (i = 0; i < ARRAY_SIZE(threads); ++i)
+		ASSERT_EQ(pthread_create(&threads[i], NULL, epoll65_wait, &ctx), 0);
+
+	for (i = 0; i < ARRAY_SIZE(threads); ++i) {
+		ASSERT_EQ(pthread_join(threads[i], (void **)&ret), 0);
+		ASSERT_EQ(ret, 0);
+	}
+
+	close(ctx.efd[0]);
+	close(ctx.efd[1]);
+}
+
 TEST_HARNESS_MAIN
-- 
2.39.5


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 2/2] eventpoll: Fix epoll_wait() report false negative
  2025-07-18  7:52 [PATCH 0/2] eventpoll: Fix epoll_wait() report false negative Nam Cao
  2025-07-18  7:52 ` [PATCH 1/2] selftests/eventpoll: Add test for multiple waiters Nam Cao
@ 2025-07-18  7:52 ` Nam Cao
  2025-07-18  8:38   ` Soheil Hassas Yeganeh
  2025-09-17 12:49   ` Mateusz Guzik
  2025-09-17  7:27 ` [PATCH 0/2] " Nam Cao
  2 siblings, 2 replies; 23+ messages in thread
From: Nam Cao @ 2025-07-18  7:52 UTC (permalink / raw)
  To: Alexander Viro, Christian Brauner, Jan Kara, Shuah Khan,
	Davidlohr Bueso, Soheil Hassas Yeganeh, Khazhismel Kumykov,
	Willem de Bruijn, Eric Dumazet, Jens Axboe, linux-fsdevel,
	linux-kernel, linux-kselftest
  Cc: Nam Cao, stable

ep_events_available() checks for available events by looking at ep->rdllist
and ep->ovflist. However, this is done without a lock, therefore the
returned value is not reliable. Because it is possible that both checks on
ep->rdllist and ep->ovflist are false while ep_start_scan() or
ep_done_scan() is being executed on other CPUs, despite events are
available.

This bug can be observed by:

  1. Create an eventpoll with at least one ready level-triggered event

  2. Create multiple threads who do epoll_wait() with zero timeout. The
     threads do not consume the events, therefore all epoll_wait() should
     return at least one event.

If one thread is executing ep_events_available() while another thread is
executing ep_start_scan() or ep_done_scan(), epoll_wait() may wrongly
return no event for the former thread.

This reproducer is implemented as TEST(epoll65) in
tools/testing/selftests/filesystems/epoll/epoll_wakeup_test.c

Fix it by skipping ep_events_available(), just call ep_try_send_events()
directly.

epoll_sendevents() (io_uring) suffers the same problem, fix that as well.

There is still ep_busy_loop() who uses ep_events_available() without lock,
but it is probably okay (?) for busy-polling.

Fixes: c5a282e9635e ("fs/epoll: reduce the scope of wq lock in epoll_wait()")
Fixes: e59d3c64cba6 ("epoll: eliminate unnecessary lock for zero timeout")
Fixes: ae3a4f1fdc2c ("eventpoll: add epoll_sendevents() helper")
Signed-off-by: Nam Cao <namcao@linutronix.de>
Cc: stable@vger.kernel.org
---
 fs/eventpoll.c | 16 ++--------------
 1 file changed, 2 insertions(+), 14 deletions(-)

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 0fbf5dfedb24..541481eafc20 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -2022,7 +2022,7 @@ static int ep_schedule_timeout(ktime_t *to)
 static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
 		   int maxevents, struct timespec64 *timeout)
 {
-	int res, eavail, timed_out = 0;
+	int res, eavail = 1, timed_out = 0;
 	u64 slack = 0;
 	wait_queue_entry_t wait;
 	ktime_t expires, *to = NULL;
@@ -2041,16 +2041,6 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
 		timed_out = 1;
 	}
 
-	/*
-	 * This call is racy: We may or may not see events that are being added
-	 * to the ready list under the lock (e.g., in IRQ callbacks). For cases
-	 * with a non-zero timeout, this thread will check the ready list under
-	 * lock and will add to the wait queue.  For cases with a zero
-	 * timeout, the user by definition should not care and will have to
-	 * recheck again.
-	 */
-	eavail = ep_events_available(ep);
-
 	while (1) {
 		if (eavail) {
 			res = ep_try_send_events(ep, events, maxevents);
@@ -2496,9 +2486,7 @@ int epoll_sendevents(struct file *file, struct epoll_event __user *events,
 	 * Racy call, but that's ok - it should get retried based on
 	 * poll readiness anyway.
 	 */
-	if (ep_events_available(ep))
-		return ep_try_send_events(ep, events, maxevents);
-	return 0;
+	return ep_try_send_events(ep, events, maxevents);
 }
 
 /*
-- 
2.39.5


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/2] eventpoll: Fix epoll_wait() report false negative
  2025-07-18  7:52 ` [PATCH 2/2] eventpoll: Fix epoll_wait() report false negative Nam Cao
@ 2025-07-18  8:38   ` Soheil Hassas Yeganeh
  2025-07-18  8:59     ` Nam Cao
  2025-09-17 12:49   ` Mateusz Guzik
  1 sibling, 1 reply; 23+ messages in thread
From: Soheil Hassas Yeganeh @ 2025-07-18  8:38 UTC (permalink / raw)
  To: Nam Cao
  Cc: Alexander Viro, Christian Brauner, Jan Kara, Shuah Khan,
	Davidlohr Bueso, Khazhismel Kumykov, Willem de Bruijn,
	Eric Dumazet, Jens Axboe, linux-fsdevel, linux-kernel,
	linux-kselftest, stable

On Fri, Jul 18, 2025 at 8:52 AM Nam Cao <namcao@linutronix.de> wrote:
>
> ep_events_available() checks for available events by looking at ep->rdllist
> and ep->ovflist. However, this is done without a lock, therefore the
> returned value is not reliable. Because it is possible that both checks on
> ep->rdllist and ep->ovflist are false while ep_start_scan() or
> ep_done_scan() is being executed on other CPUs, despite events are
> available.
>
> This bug can be observed by:
>
>   1. Create an eventpoll with at least one ready level-triggered event
>
>   2. Create multiple threads who do epoll_wait() with zero timeout. The
>      threads do not consume the events, therefore all epoll_wait() should
>      return at least one event.
>
> If one thread is executing ep_events_available() while another thread is
> executing ep_start_scan() or ep_done_scan(), epoll_wait() may wrongly
> return no event for the former thread.

That is the whole point of epoll_wait with a zero timeout.  We would want to
opportunistically poll without much overhead, which will have more
false positives.
A caller that calls with a zero timeout should retry later, and will
at some point
observe the event.

I'm not sure if we would want to add much more overheads, for higher precision.

Thanks,
Soheil

> This reproducer is implemented as TEST(epoll65) in
> tools/testing/selftests/filesystems/epoll/epoll_wakeup_test.c
>
> Fix it by skipping ep_events_available(), just call ep_try_send_events()
> directly.
>
> epoll_sendevents() (io_uring) suffers the same problem, fix that as well.
>
> There is still ep_busy_loop() who uses ep_events_available() without lock,
> but it is probably okay (?) for busy-polling.
>
> Fixes: c5a282e9635e ("fs/epoll: reduce the scope of wq lock in epoll_wait()")
> Fixes: e59d3c64cba6 ("epoll: eliminate unnecessary lock for zero timeout")
> Fixes: ae3a4f1fdc2c ("eventpoll: add epoll_sendevents() helper")
> Signed-off-by: Nam Cao <namcao@linutronix.de>
> Cc: stable@vger.kernel.org
> ---
>  fs/eventpoll.c | 16 ++--------------
>  1 file changed, 2 insertions(+), 14 deletions(-)
>
> diff --git a/fs/eventpoll.c b/fs/eventpoll.c
> index 0fbf5dfedb24..541481eafc20 100644
> --- a/fs/eventpoll.c
> +++ b/fs/eventpoll.c
> @@ -2022,7 +2022,7 @@ static int ep_schedule_timeout(ktime_t *to)
>  static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
>                    int maxevents, struct timespec64 *timeout)
>  {
> -       int res, eavail, timed_out = 0;
> +       int res, eavail = 1, timed_out = 0;
>         u64 slack = 0;
>         wait_queue_entry_t wait;
>         ktime_t expires, *to = NULL;
> @@ -2041,16 +2041,6 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
>                 timed_out = 1;
>         }
>
> -       /*
> -        * This call is racy: We may or may not see events that are being added
> -        * to the ready list under the lock (e.g., in IRQ callbacks). For cases
> -        * with a non-zero timeout, this thread will check the ready list under
> -        * lock and will add to the wait queue.  For cases with a zero
> -        * timeout, the user by definition should not care and will have to
> -        * recheck again.
> -        */
> -       eavail = ep_events_available(ep);
> -
>         while (1) {
>                 if (eavail) {
>                         res = ep_try_send_events(ep, events, maxevents);
> @@ -2496,9 +2486,7 @@ int epoll_sendevents(struct file *file, struct epoll_event __user *events,
>          * Racy call, but that's ok - it should get retried based on
>          * poll readiness anyway.
>          */
> -       if (ep_events_available(ep))
> -               return ep_try_send_events(ep, events, maxevents);
> -       return 0;
> +       return ep_try_send_events(ep, events, maxevents);
>  }
>
>  /*
> --
> 2.39.5
>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/2] eventpoll: Fix epoll_wait() report false negative
  2025-07-18  8:38   ` Soheil Hassas Yeganeh
@ 2025-07-18  8:59     ` Nam Cao
  2026-04-29  6:54       ` Christian Brauner
  0 siblings, 1 reply; 23+ messages in thread
From: Nam Cao @ 2025-07-18  8:59 UTC (permalink / raw)
  To: Soheil Hassas Yeganeh
  Cc: Alexander Viro, Christian Brauner, Jan Kara, Shuah Khan,
	Davidlohr Bueso, Khazhismel Kumykov, Willem de Bruijn,
	Eric Dumazet, Jens Axboe, linux-fsdevel, linux-kernel,
	linux-kselftest, stable

On Fri, Jul 18, 2025 at 09:38:27AM +0100, Soheil Hassas Yeganeh wrote:
> On Fri, Jul 18, 2025 at 8:52 AM Nam Cao <namcao@linutronix.de> wrote:
> >
> > ep_events_available() checks for available events by looking at ep->rdllist
> > and ep->ovflist. However, this is done without a lock, therefore the
> > returned value is not reliable. Because it is possible that both checks on
> > ep->rdllist and ep->ovflist are false while ep_start_scan() or
> > ep_done_scan() is being executed on other CPUs, despite events are
> > available.
> >
> > This bug can be observed by:
> >
> >   1. Create an eventpoll with at least one ready level-triggered event
> >
> >   2. Create multiple threads who do epoll_wait() with zero timeout. The
> >      threads do not consume the events, therefore all epoll_wait() should
> >      return at least one event.
> >
> > If one thread is executing ep_events_available() while another thread is
> > executing ep_start_scan() or ep_done_scan(), epoll_wait() may wrongly
> > return no event for the former thread.
> 
> That is the whole point of epoll_wait with a zero timeout. We would want to
> opportunistically poll without much overhead, which will have more
> false positives.
> A caller that calls with a zero timeout should retry later, and will
> at some point observe the event.

Is this a documented behavior that users expect? I do not see this in the
man page.

It sounds completely broken to me that an event has been sitting there for
some time, but epoll_wait() says there is nothing.

> I'm not sure if we would want to add much more overheads, for higher precision.

Correctness before performance please. And I'm not sure what you mean by
"much more overheads". While this patch definitely adds some cycles in case
there is no event, epoll_wait() still returns "instantly".

Nam

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/2] eventpoll: Fix epoll_wait() report false negative
  2025-07-18  7:52 [PATCH 0/2] eventpoll: Fix epoll_wait() report false negative Nam Cao
  2025-07-18  7:52 ` [PATCH 1/2] selftests/eventpoll: Add test for multiple waiters Nam Cao
  2025-07-18  7:52 ` [PATCH 2/2] eventpoll: Fix epoll_wait() report false negative Nam Cao
@ 2025-09-17  7:27 ` Nam Cao
  2 siblings, 0 replies; 23+ messages in thread
From: Nam Cao @ 2025-09-17  7:27 UTC (permalink / raw)
  To: Alexander Viro, Christian Brauner, Jan Kara, Shuah Khan,
	Davidlohr Bueso, Soheil Hassas Yeganeh, Khazhismel Kumykov,
	Willem de Bruijn, Eric Dumazet, Jens Axboe, linux-fsdevel,
	linux-kernel, linux-kselftest

Nam Cao <namcao@linutronix.de> writes:
> While staring at epoll, I noticed ep_events_available() looks wrong. I
> wrote a small program to confirm, and yes it is definitely wrong.
>
> This series adds a reproducer to kselftest, and fix the bug.

Friendly reminder that this exists.

This probably only appears once in a blue moon, and the impact is
probably not significant. But it still should be fixed.

Nam

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/2] eventpoll: Fix epoll_wait() report false negative
  2025-07-18  7:52 ` [PATCH 2/2] eventpoll: Fix epoll_wait() report false negative Nam Cao
  2025-07-18  8:38   ` Soheil Hassas Yeganeh
@ 2025-09-17 12:49   ` Mateusz Guzik
  2025-09-17 13:41     ` Nam Cao
  1 sibling, 1 reply; 23+ messages in thread
From: Mateusz Guzik @ 2025-09-17 12:49 UTC (permalink / raw)
  To: Nam Cao
  Cc: Alexander Viro, Christian Brauner, Jan Kara, Shuah Khan,
	Davidlohr Bueso, Soheil Hassas Yeganeh, Khazhismel Kumykov,
	Willem de Bruijn, Eric Dumazet, Jens Axboe, linux-fsdevel,
	linux-kernel, linux-kselftest, stable

On Fri, Jul 18, 2025 at 09:52:29AM +0200, Nam Cao wrote:
> ep_events_available() checks for available events by looking at ep->rdllist
> and ep->ovflist. However, this is done without a lock, therefore the
> returned value is not reliable. Because it is possible that both checks on
> ep->rdllist and ep->ovflist are false while ep_start_scan() or
> ep_done_scan() is being executed on other CPUs, despite events are
> available.
> 
> This bug can be observed by:
> 
>   1. Create an eventpoll with at least one ready level-triggered event
> 
>   2. Create multiple threads who do epoll_wait() with zero timeout. The
>      threads do not consume the events, therefore all epoll_wait() should
>      return at least one event.
> 
> If one thread is executing ep_events_available() while another thread is
> executing ep_start_scan() or ep_done_scan(), epoll_wait() may wrongly
> return no event for the former thread.
> 
> This reproducer is implemented as TEST(epoll65) in
> tools/testing/selftests/filesystems/epoll/epoll_wakeup_test.c
> 
> Fix it by skipping ep_events_available(), just call ep_try_send_events()
> directly.
> 
> epoll_sendevents() (io_uring) suffers the same problem, fix that as well.
> 
> There is still ep_busy_loop() who uses ep_events_available() without lock,
> but it is probably okay (?) for busy-polling.
> 

I'll say upfront I'm not an epoll person, just looked here out of
curiosity.

I can agree there is a bug. The event is generated before any of the
threads even exist and they only poll for it, none of them consume it.

However, the commit message fails to explain why the change fixes
anything and I think your patch de facto reverts e59d3c64cba6 ("epoll:
eliminate unnecessary lock for zero timeout"). Looking at that diff
the point was to avoid the expensive lock trip if timeout == 0 and there
are no events.

As for the bug is, from my reading the ep_start_scan()/ep_done_scan()
pair transiently disturbs the state checked by ep_events_available(),
resulting in false-negatives. Then the locked check works because by the
time you acquire it, the damage is undone.

Given the commits referenced in Fixes:, I suspect the real fix would be
to stop destroying that state of course.

But if that's not feasible, I would check if a sequence counter around
this would do the trick -- then the racy ep_events_available(ep) upfront
would become safe with smaller overhead than with your proposal for the
no-event case, but with higher overhead when there is something.

My proposal is trivial to implement, I have no idea if it will get a
buy-in though.

> Fixes: c5a282e9635e ("fs/epoll: reduce the scope of wq lock in
> epoll_wait()") Fixes: e59d3c64cba6 ("epoll: eliminate unnecessary lock
> for zero timeout") Fixes: ae3a4f1fdc2c ("eventpoll: add
> epoll_sendevents() helper") Signed-off-by: Nam Cao
> <namcao@linutronix.de>
> Cc: stable@vger.kernel.org
> ---
>  fs/eventpoll.c | 16 ++--------------
>  1 file changed, 2 insertions(+), 14 deletions(-)
> 
> diff --git a/fs/eventpoll.c b/fs/eventpoll.c
> index 0fbf5dfedb24..541481eafc20 100644
> --- a/fs/eventpoll.c
> +++ b/fs/eventpoll.c
> @@ -2022,7 +2022,7 @@ static int ep_schedule_timeout(ktime_t *to)
>  static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
>  		   int maxevents, struct timespec64 *timeout)
>  {
> -	int res, eavail, timed_out = 0;
> +	int res, eavail = 1, timed_out = 0;
>  	u64 slack = 0;
>  	wait_queue_entry_t wait;
>  	ktime_t expires, *to = NULL;
> @@ -2041,16 +2041,6 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
>  		timed_out = 1;
>  	}
>  
> -	/*
> -	 * This call is racy: We may or may not see events that are being added
> -	 * to the ready list under the lock (e.g., in IRQ callbacks). For cases
> -	 * with a non-zero timeout, this thread will check the ready list under
> -	 * lock and will add to the wait queue.  For cases with a zero
> -	 * timeout, the user by definition should not care and will have to
> -	 * recheck again.
> -	 */
> -	eavail = ep_events_available(ep);
> -
>  	while (1) {
>  		if (eavail) {
>  			res = ep_try_send_events(ep, events, maxevents);
> @@ -2496,9 +2486,7 @@ int epoll_sendevents(struct file *file, struct epoll_event __user *events,
>  	 * Racy call, but that's ok - it should get retried based on
>  	 * poll readiness anyway.
>  	 */
> -	if (ep_events_available(ep))
> -		return ep_try_send_events(ep, events, maxevents);
> -	return 0;
> +	return ep_try_send_events(ep, events, maxevents);
>  }
>  
>  /*
> -- 
> 2.39.5
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/2] eventpoll: Fix epoll_wait() report false negative
  2025-09-17 12:49   ` Mateusz Guzik
@ 2025-09-17 13:41     ` Nam Cao
  2025-09-17 16:05       ` Mateusz Guzik
  0 siblings, 1 reply; 23+ messages in thread
From: Nam Cao @ 2025-09-17 13:41 UTC (permalink / raw)
  To: Mateusz Guzik
  Cc: Alexander Viro, Christian Brauner, Jan Kara, Shuah Khan,
	Davidlohr Bueso, Soheil Hassas Yeganeh, Khazhismel Kumykov,
	Willem de Bruijn, Eric Dumazet, Jens Axboe, linux-fsdevel,
	linux-kernel, linux-kselftest, stable

Mateusz Guzik <mjguzik@gmail.com> writes:
> I'll say upfront I'm not an epoll person, just looked here out of
> curiosity.

Feedbacks always welcomed.

> I can agree there is a bug. The event is generated before any of the
> threads even exist and they only poll for it, none of them consume it.
>
> However, the commit message fails to explain why the change fixes
> anything and I think your patch de facto reverts e59d3c64cba6 ("epoll:
> eliminate unnecessary lock for zero timeout"). Looking at that diff
> the point was to avoid the expensive lock trip if timeout == 0 and there
> are no events.
>
> As for the bug is, from my reading the ep_start_scan()/ep_done_scan()
> pair transiently disturbs the state checked by ep_events_available(),
> resulting in false-negatives. Then the locked check works because by the
> time you acquire it, the damage is undone.

Exactly so. I can add this into the description.

> Given the commits referenced in Fixes:, I suspect the real fix would be
> to stop destroying that state of course.
>
> But if that's not feasible, I would check if a sequence counter around
> this would do the trick -- then the racy ep_events_available(ep) upfront
> would become safe with smaller overhead than with your proposal for the
> no-event case, but with higher overhead when there is something.
>
> My proposal is trivial to implement, I have no idea if it will get a
> buy-in though.

My question is whether the performance of epoll_wait() with zero
timeout is really that important that we have to complicate
things. If epoll_wait() with zero timeout is called repeatedly in a loop
but there is no event, I'm sure there will be measurabled performance
drop. But sane user would just use timeout in that case.

epoll's data is protected by a lock. Therefore I think the most
straightforward solution is just taking the lock before reading the
data.

Lockless is hard to get right and may cause hard-to-debug problems. So
unless this performance drop somehow bothers someone, I would prefer
"keep it simple, stupid".

Best regards,
Nam

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/2] eventpoll: Fix epoll_wait() report false negative
  2025-09-17 13:41     ` Nam Cao
@ 2025-09-17 16:05       ` Mateusz Guzik
  2025-09-17 16:08         ` Mateusz Guzik
  2025-09-20 14:42         ` David Laight
  0 siblings, 2 replies; 23+ messages in thread
From: Mateusz Guzik @ 2025-09-17 16:05 UTC (permalink / raw)
  To: Nam Cao
  Cc: Alexander Viro, Christian Brauner, Jan Kara, Shuah Khan,
	Davidlohr Bueso, Soheil Hassas Yeganeh, Khazhismel Kumykov,
	Willem de Bruijn, Eric Dumazet, Jens Axboe, linux-fsdevel,
	linux-kernel, linux-kselftest, stable

On Wed, Sep 17, 2025 at 3:41 PM Nam Cao <namcao@linutronix.de> wrote:
> My question is whether the performance of epoll_wait() with zero
> timeout is really that important that we have to complicate
> things. If epoll_wait() with zero timeout is called repeatedly in a loop
> but there is no event, I'm sure there will be measurabled performance
> drop. But sane user would just use timeout in that case.
>
> epoll's data is protected by a lock. Therefore I think the most
> straightforward solution is just taking the lock before reading the
> data.
>

I have no idea what the original use case is. I see the author of the
patch is cc'ed, so hopefully they will answer.

> Lockless is hard to get right and may cause hard-to-debug problems. So
> unless this performance drop somehow bothers someone, I would prefer
> "keep it simple, stupid".
>

Well epoll is known to suffer from lock contention, so I would like to
think the lockless games were motivated by a real-world need, but I'm
not going peruse the history to find out.

I can agree the current state concerning ep_events_available() is
avoidably error prone and something(tm) should be done. fwiw the
refcount thing is almost free on amd64, I have no idea how this pans
out on arm64.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/2] eventpoll: Fix epoll_wait() report false negative
  2025-09-17 16:05       ` Mateusz Guzik
@ 2025-09-17 16:08         ` Mateusz Guzik
  2025-09-17 18:03           ` Khazhy Kumykov
  2025-09-20 14:42         ` David Laight
  1 sibling, 1 reply; 23+ messages in thread
From: Mateusz Guzik @ 2025-09-17 16:08 UTC (permalink / raw)
  To: Nam Cao
  Cc: Alexander Viro, Christian Brauner, Jan Kara, Shuah Khan,
	Davidlohr Bueso, Soheil Hassas Yeganeh, Khazhismel Kumykov,
	Willem de Bruijn, Eric Dumazet, Jens Axboe, linux-fsdevel,
	linux-kernel, linux-kselftest, stable

On Wed, Sep 17, 2025 at 6:05 PM Mateusz Guzik <mjguzik@gmail.com> wrote:
>
> On Wed, Sep 17, 2025 at 3:41 PM Nam Cao <namcao@linutronix.de> wrote:
> > My question is whether the performance of epoll_wait() with zero
> > timeout is really that important that we have to complicate
> > things. If epoll_wait() with zero timeout is called repeatedly in a loop
> > but there is no event, I'm sure there will be measurabled performance
> > drop. But sane user would just use timeout in that case.
> >
> > epoll's data is protected by a lock. Therefore I think the most
> > straightforward solution is just taking the lock before reading the
> > data.
> >
>
> I have no idea what the original use case is. I see the author of the
> patch is cc'ed, so hopefully they will answer.
>
> > Lockless is hard to get right and may cause hard-to-debug problems. So
> > unless this performance drop somehow bothers someone, I would prefer
> > "keep it simple, stupid".
> >
>
> Well epoll is known to suffer from lock contention, so I would like to
> think the lockless games were motivated by a real-world need, but I'm
> not going peruse the history to find out.
>
> I can agree the current state concerning ep_events_available() is
> avoidably error prone and something(tm) should be done. fwiw the
> refcount thing is almost free on amd64, I have no idea how this pans
> out on arm64.

erm, seqcount

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/2] eventpoll: Fix epoll_wait() report false negative
  2025-09-17 16:08         ` Mateusz Guzik
@ 2025-09-17 18:03           ` Khazhy Kumykov
  2025-09-17 22:28             ` Khazhy Kumykov
  0 siblings, 1 reply; 23+ messages in thread
From: Khazhy Kumykov @ 2025-09-17 18:03 UTC (permalink / raw)
  To: Mateusz Guzik
  Cc: Nam Cao, Alexander Viro, Christian Brauner, Jan Kara, Shuah Khan,
	Davidlohr Bueso, Soheil Hassas Yeganeh, Willem de Bruijn,
	Eric Dumazet, Jens Axboe, linux-fsdevel, linux-kernel,
	linux-kselftest, stable

I think the justification for the original comment is: epoll_wait
returns either when events are available, or the timeout is hit -> and
if the timeout is hit, "event is available" is undefined (or another
way: it would be incorrect to interpret a timeout being hit as "no
events available"). So one could justify this missed event that way,
but it does feel against the spirit of the API, especially since the
event may have existed for an infinite amount of time, and still be
missed.

Glancing again at this code, ep_events_available() should return true
if rdllist is not empty, is actively being modified, or if ovflist is
not EP_UNACTIVE_PTR.

One ordering thing that sticks out to me is ep_start_scan first
splices out rdllist, *then* clears ovflist (from EP_UNACTIVE_PTR ->
NULL). This creates a short condition where rdllist is empty, not
being modified, but ovflist is still EP_UNACTIVE_PTR -> which we
interpret as "no events available" - even though a local txlist may
have some events. It seems like, for this lockless check to remain
accurate, we should need to reverse the order of these two operations,
*and* ensure the order remains observable. (and for users using the
lock, there should be no observable difference with this change)

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/2] eventpoll: Fix epoll_wait() report false negative
  2025-09-17 18:03           ` Khazhy Kumykov
@ 2025-09-17 22:28             ` Khazhy Kumykov
  2025-09-17 22:38               ` Mateusz Guzik
  0 siblings, 1 reply; 23+ messages in thread
From: Khazhy Kumykov @ 2025-09-17 22:28 UTC (permalink / raw)
  To: Mateusz Guzik
  Cc: Nam Cao, Alexander Viro, Christian Brauner, Jan Kara, Shuah Khan,
	Davidlohr Bueso, Soheil Hassas Yeganeh, Willem de Bruijn,
	Eric Dumazet, Jens Axboe, linux-fsdevel, linux-kernel,
	linux-kselftest, stable

On Wed, Sep 17, 2025 at 11:03 AM Khazhy Kumykov <khazhy@chromium.org> wrote:
>
> One ordering thing that sticks out to me is
(ordering can't help here since we'll rapidly be flapping between
ovflist in use and inactive... right)

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/2] eventpoll: Fix epoll_wait() report false negative
  2025-09-17 22:28             ` Khazhy Kumykov
@ 2025-09-17 22:38               ` Mateusz Guzik
  2025-09-22  6:26                 ` Nam Cao
  0 siblings, 1 reply; 23+ messages in thread
From: Mateusz Guzik @ 2025-09-17 22:38 UTC (permalink / raw)
  To: Khazhy Kumykov
  Cc: Nam Cao, Alexander Viro, Christian Brauner, Jan Kara, Shuah Khan,
	Davidlohr Bueso, Soheil Hassas Yeganeh, Willem de Bruijn,
	Eric Dumazet, Jens Axboe, linux-fsdevel, linux-kernel,
	linux-kselftest, stable

On Thu, Sep 18, 2025 at 12:34 AM Khazhy Kumykov <khazhy@chromium.org> wrote:
>
> On Wed, Sep 17, 2025 at 11:03 AM Khazhy Kumykov <khazhy@chromium.org> wrote:
> >
> > One ordering thing that sticks out to me is
> (ordering can't help here since we'll rapidly be flapping between
> ovflist in use and inactive... right)

a sequence counter around shenanigans will sort it out, but I don't
know if it is worth it and don't really want to investigate myself.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/2] eventpoll: Fix epoll_wait() report false negative
  2025-09-17 16:05       ` Mateusz Guzik
  2025-09-17 16:08         ` Mateusz Guzik
@ 2025-09-20 14:42         ` David Laight
  2025-09-20 14:45           ` Mateusz Guzik
  1 sibling, 1 reply; 23+ messages in thread
From: David Laight @ 2025-09-20 14:42 UTC (permalink / raw)
  To: Mateusz Guzik
  Cc: Nam Cao, Alexander Viro, Christian Brauner, Jan Kara, Shuah Khan,
	Davidlohr Bueso, Soheil Hassas Yeganeh, Khazhismel Kumykov,
	Willem de Bruijn, Eric Dumazet, Jens Axboe, linux-fsdevel,
	linux-kernel, linux-kselftest, stable

On Wed, 17 Sep 2025 18:05:45 +0200
Mateusz Guzik <mjguzik@gmail.com> wrote:

> On Wed, Sep 17, 2025 at 3:41 PM Nam Cao <namcao@linutronix.de> wrote:
> > My question is whether the performance of epoll_wait() with zero
> > timeout is really that important that we have to complicate
> > things. If epoll_wait() with zero timeout is called repeatedly in a loop
> > but there is no event, I'm sure there will be measurabled performance
> > drop. But sane user would just use timeout in that case.
> >
> > epoll's data is protected by a lock. Therefore I think the most
> > straightforward solution is just taking the lock before reading the
> > data.
> >  
> 
> I have no idea what the original use case is. I see the author of the
> patch is cc'ed, so hopefully they will answer.
> 
> > Lockless is hard to get right and may cause hard-to-debug problems. So
> > unless this performance drop somehow bothers someone, I would prefer
> > "keep it simple, stupid".
> >  
> 
> Well epoll is known to suffer from lock contention, so I would like to
> think the lockless games were motivated by a real-world need, but I'm
> not going peruse the history to find out.
> 
> I can agree the current state concerning ep_events_available() is
> avoidably error prone and something(tm) should be done. fwiw the
> refcount thing is almost free on amd64, I have no idea how this pans
> out on arm64.

Atomic operations are anything but free....
They are likely to be a similar cost to an uncontested spinlock entry.

	David


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/2] eventpoll: Fix epoll_wait() report false negative
  2025-09-20 14:42         ` David Laight
@ 2025-09-20 14:45           ` Mateusz Guzik
  0 siblings, 0 replies; 23+ messages in thread
From: Mateusz Guzik @ 2025-09-20 14:45 UTC (permalink / raw)
  To: David Laight
  Cc: Nam Cao, Alexander Viro, Christian Brauner, Jan Kara, Shuah Khan,
	Davidlohr Bueso, Soheil Hassas Yeganeh, Khazhismel Kumykov,
	Willem de Bruijn, Eric Dumazet, Jens Axboe, linux-fsdevel,
	linux-kernel, linux-kselftest, stable

On Sat, Sep 20, 2025 at 4:42 PM David Laight
<david.laight.linux@gmail.com> wrote:
>
> On Wed, 17 Sep 2025 18:05:45 +0200
> Mateusz Guzik <mjguzik@gmail.com> wrote:
> > I can agree the current state concerning ep_events_available() is
> > avoidably error prone and something(tm) should be done. fwiw the
> > refcount thing is almost free on amd64, I have no idea how this pans
> > out on arm64.
>
> Atomic operations are anything but free....
> They are likely to be a similar cost to an uncontested spinlock entry.
>

In this context it was supposed to be s/refcount/seqcount/ and on
amd64 that's loading the same var twice + a branch for the read thing.
Not *free* but not in the same galaxy comped to acquiring a spinlock
(even assuming it is uncontested).

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/2] eventpoll: Fix epoll_wait() report false negative
  2025-09-17 22:38               ` Mateusz Guzik
@ 2025-09-22  6:26                 ` Nam Cao
  0 siblings, 0 replies; 23+ messages in thread
From: Nam Cao @ 2025-09-22  6:26 UTC (permalink / raw)
  To: Mateusz Guzik, Khazhy Kumykov
  Cc: Alexander Viro, Christian Brauner, Jan Kara, Shuah Khan,
	Davidlohr Bueso, Soheil Hassas Yeganeh, Willem de Bruijn,
	Eric Dumazet, Jens Axboe, linux-fsdevel, linux-kernel,
	linux-kselftest, stable

Mateusz Guzik <mjguzik@gmail.com> writes:
> a sequence counter around shenanigans will sort it out, but I don't
> know if it is worth it and don't really want to investigate myself.

The original commit did mention "1% CPU/RPC reduction in RPC benchmarks".
I'm not sure what "RPC" stands for and which benchmark it is. But if it
is really important, we must have heard by now.

Nam

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/2] eventpoll: Fix epoll_wait() report false negative
  2025-07-18  8:59     ` Nam Cao
@ 2026-04-29  6:54       ` Christian Brauner
  2026-04-29  7:27         ` Nam Cao
  2026-05-04 12:00         ` David Laight
  0 siblings, 2 replies; 23+ messages in thread
From: Christian Brauner @ 2026-04-29  6:54 UTC (permalink / raw)
  To: Nam Cao
  Cc: Soheil Hassas Yeganeh, Alexander Viro, Jan Kara, Shuah Khan,
	Davidlohr Bueso, Khazhismel Kumykov, Willem de Bruijn,
	Eric Dumazet, Jens Axboe, linux-fsdevel, linux-kernel,
	linux-kselftest, stable

On Fri, Jul 18, 2025 at 10:59:48AM +0200, Nam Cao wrote:
> On Fri, Jul 18, 2025 at 09:38:27AM +0100, Soheil Hassas Yeganeh wrote:
> > On Fri, Jul 18, 2025 at 8:52 AM Nam Cao <namcao@linutronix.de> wrote:
> > >
> > > ep_events_available() checks for available events by looking at ep->rdllist
> > > and ep->ovflist. However, this is done without a lock, therefore the
> > > returned value is not reliable. Because it is possible that both checks on
> > > ep->rdllist and ep->ovflist are false while ep_start_scan() or
> > > ep_done_scan() is being executed on other CPUs, despite events are
> > > available.
> > >
> > > This bug can be observed by:
> > >
> > >   1. Create an eventpoll with at least one ready level-triggered event
> > >
> > >   2. Create multiple threads who do epoll_wait() with zero timeout. The
> > >      threads do not consume the events, therefore all epoll_wait() should
> > >      return at least one event.
> > >
> > > If one thread is executing ep_events_available() while another thread is
> > > executing ep_start_scan() or ep_done_scan(), epoll_wait() may wrongly
> > > return no event for the former thread.
> > 
> > That is the whole point of epoll_wait with a zero timeout. We would want to
> > opportunistically poll without much overhead, which will have more
> > false positives.
> > A caller that calls with a zero timeout should retry later, and will
> > at some point observe the event.
> 
> Is this a documented behavior that users expect? I do not see this in the
> man page.

The selftests rely on this behavior that timeout=0 sees events from a
concurrently running producer. They would fail at a very higher rate
after this change - believe me I had a similar patch that changed
something in this area. I would explore the seqcount that Mateusz
suggested tbh.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/2] eventpoll: Fix epoll_wait() report false negative
  2026-04-29  6:54       ` Christian Brauner
@ 2026-04-29  7:27         ` Nam Cao
  2026-04-29 15:34           ` Mateusz Guzik
  2026-05-04 12:00         ` David Laight
  1 sibling, 1 reply; 23+ messages in thread
From: Nam Cao @ 2026-04-29  7:27 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Soheil Hassas Yeganeh, Alexander Viro, Jan Kara, Shuah Khan,
	Davidlohr Bueso, Khazhismel Kumykov, Willem de Bruijn,
	Eric Dumazet, Jens Axboe, linux-fsdevel, linux-kernel,
	linux-kselftest, stable

Christian Brauner <brauner@kernel.org> writes:
> The selftests rely on this behavior that timeout=0 sees events from a
> concurrently running producer. They would fail at a very higher rate
> after this change - believe me I had a similar patch that changed
> something in this area.

Huh, that's interesting. Do you still remember which selftest cases rely
on this behavior? I would like to study them further.

> I would explore the seqcount that Mateusz suggested tbh.

I will investigate that.

Nam

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/2] eventpoll: Fix epoll_wait() report false negative
  2026-04-29  7:27         ` Nam Cao
@ 2026-04-29 15:34           ` Mateusz Guzik
  2026-05-03 13:24             ` Nam Cao
  0 siblings, 1 reply; 23+ messages in thread
From: Mateusz Guzik @ 2026-04-29 15:34 UTC (permalink / raw)
  To: Nam Cao
  Cc: Christian Brauner, Soheil Hassas Yeganeh, Alexander Viro,
	Jan Kara, Shuah Khan, Davidlohr Bueso, Khazhismel Kumykov,
	Willem de Bruijn, Eric Dumazet, Jens Axboe, linux-fsdevel,
	linux-kernel, linux-kselftest, stable

On Wed, Apr 29, 2026 at 09:27:59AM +0200, Nam Cao wrote:
> Christian Brauner <brauner@kernel.org> writes:
> > The selftests rely on this behavior that timeout=0 sees events from a
> > concurrently running producer. They would fail at a very higher rate
> > after this change - believe me I had a similar patch that changed
> > something in this area.
> 
> Huh, that's interesting. Do you still remember which selftest cases rely
> on this behavior? I would like to study them further.
> 
> > I would explore the seqcount that Mateusz suggested tbh.
> 
> I will investigate that.
> 

In the meantime I grew fond of another approach: have the write side
re-calc the conditon the unlocked side checks for.

While the seqc thing solves the scalabilty problem, it still requires
fences which are not free on arm.

the goal would be to make it so that this:
static inline int ep_events_available(struct eventpoll *ep)
{  
        return !list_empty_careful(&ep->rdllist) ||
                READ_ONCE(ep->ovflist) != EP_UNACTIVE_PTR;    
}

can be converted into:
static inline int ep_events_available(struct eventpoll *ep)
{  
	return ep->has_events;
}

Which in turn means that any codepath which messes with either rdllist
or ovflist will need to recalc before it ends up unlocking.

Strictly speaking more error prone than the seq approach, but should be
faster on weaker-ordered archs thanks to avoided fences.

I'm definitely not going to protest the seqc route.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/2] eventpoll: Fix epoll_wait() report false negative
  2026-04-29 15:34           ` Mateusz Guzik
@ 2026-05-03 13:24             ` Nam Cao
  2026-05-21 12:38               ` Mateusz Guzik
  0 siblings, 1 reply; 23+ messages in thread
From: Nam Cao @ 2026-05-03 13:24 UTC (permalink / raw)
  To: Mateusz Guzik
  Cc: Christian Brauner, Soheil Hassas Yeganeh, Alexander Viro,
	Jan Kara, Shuah Khan, Davidlohr Bueso, Khazhismel Kumykov,
	Willem de Bruijn, Eric Dumazet, Jens Axboe, linux-fsdevel,
	linux-kernel, linux-kselftest, stable

Mateusz Guzik <mjguzik@gmail.com> writes:
> Strictly speaking more error prone than the seq approach, but should be
> faster on weaker-ordered archs thanks to avoided fences.
>
> I'm definitely not going to protest the seqc route.

Linus probably wouldn't be thrilled if I break epoll again, so let's
stay with the simpler seqcount route.

Nam

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index a3090b446af1..22c3f0186476 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -38,6 +38,7 @@
 #include <linux/compat.h>
 #include <linux/rculist.h>
 #include <linux/capability.h>
+#include <linux/seqlock.h>
 #include <net/busy_poll.h>
 
 /*
@@ -190,6 +191,9 @@ struct eventpoll {
 	/* Lock which protects rdllist and ovflist */
 	spinlock_t lock;
 
+	/* Protect switching between rdllist and ovflist */
+	seqcount_spinlock_t seq;
+
 	/* RB tree root used to store monitored fd structs */
 	struct rb_root_cached rbr;
 
@@ -382,8 +386,17 @@ static inline struct epitem *ep_item_from_wait(wait_queue_entry_t *p)
  */
 static inline int ep_events_available(struct eventpoll *ep)
 {
-	return !list_empty_careful(&ep->rdllist) ||
-		READ_ONCE(ep->ovflist) != EP_UNACTIVE_PTR;
+	bool events_available;
+	unsigned int seq;
+
+	do {
+		seq = read_seqcount_begin(&ep->seq);
+
+		events_available = !list_empty_careful(&ep->rdllist) ||
+				   READ_ONCE(ep->ovflist) != EP_UNACTIVE_PTR;
+	} while (read_seqcount_retry(&ep->seq, seq));
+
+	return events_available;
 }
 
 #ifdef CONFIG_NET_RX_BUSY_POLL
@@ -735,8 +748,12 @@ static void ep_start_scan(struct eventpoll *ep, struct list_head *txlist)
 	 */
 	lockdep_assert_irqs_enabled();
 	spin_lock_irq(&ep->lock);
+	write_seqcount_begin(&ep->seq);
+
 	list_splice_init(&ep->rdllist, txlist);
 	WRITE_ONCE(ep->ovflist, NULL);
+
+	write_seqcount_end(&ep->seq);
 	spin_unlock_irq(&ep->lock);
 }
 
@@ -768,6 +785,9 @@ static void ep_done_scan(struct eventpoll *ep,
 			ep_pm_stay_awake(epi);
 		}
 	}
+
+	write_seqcount_begin(&ep->seq);
+
 	/*
 	 * We need to set back ep->ovflist to EP_UNACTIVE_PTR, so that after
 	 * releasing the lock, events will be queued in the normal way inside
@@ -779,6 +799,9 @@ static void ep_done_scan(struct eventpoll *ep,
 	 * Quickly re-inject items left on "txlist".
 	 */
 	list_splice(txlist, &ep->rdllist);
+
+	write_seqcount_end(&ep->seq);
+
 	__pm_relax(ep->ws);
 
 	if (!list_empty(&ep->rdllist)) {
@@ -1155,6 +1178,7 @@ static int ep_alloc(struct eventpoll **pep)
 
 	mutex_init(&ep->mtx);
 	spin_lock_init(&ep->lock);
+	seqcount_spinlock_init(&ep->seq, &ep->lock);
 	init_waitqueue_head(&ep->wq);
 	init_waitqueue_head(&ep->poll_wait);
 	INIT_LIST_HEAD(&ep->rdllist);

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/2] eventpoll: Fix epoll_wait() report false negative
  2026-04-29  6:54       ` Christian Brauner
  2026-04-29  7:27         ` Nam Cao
@ 2026-05-04 12:00         ` David Laight
  1 sibling, 0 replies; 23+ messages in thread
From: David Laight @ 2026-05-04 12:00 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Nam Cao, Soheil Hassas Yeganeh, Alexander Viro, Jan Kara,
	Shuah Khan, Davidlohr Bueso, Khazhismel Kumykov, Willem de Bruijn,
	Eric Dumazet, Jens Axboe, linux-fsdevel, linux-kernel,
	linux-kselftest, stable

On Wed, 29 Apr 2026 08:54:06 +0200
Christian Brauner <brauner@kernel.org> wrote:

> On Fri, Jul 18, 2025 at 10:59:48AM +0200, Nam Cao wrote:
> > On Fri, Jul 18, 2025 at 09:38:27AM +0100, Soheil Hassas Yeganeh wrote:  
> > > On Fri, Jul 18, 2025 at 8:52 AM Nam Cao <namcao@linutronix.de> wrote:  
> > > >
> > > > ep_events_available() checks for available events by looking at ep->rdllist
> > > > and ep->ovflist. However, this is done without a lock, therefore the
> > > > returned value is not reliable. Because it is possible that both checks on
> > > > ep->rdllist and ep->ovflist are false while ep_start_scan() or
> > > > ep_done_scan() is being executed on other CPUs, despite events are
> > > > available.
> > > >
> > > > This bug can be observed by:
> > > >
> > > >   1. Create an eventpoll with at least one ready level-triggered event
> > > >
> > > >   2. Create multiple threads who do epoll_wait() with zero timeout. The
> > > >      threads do not consume the events, therefore all epoll_wait() should
> > > >      return at least one event.
> > > >
> > > > If one thread is executing ep_events_available() while another thread is
> > > > executing ep_start_scan() or ep_done_scan(), epoll_wait() may wrongly
> > > > return no event for the former thread.  
> > > 
> > > That is the whole point of epoll_wait with a zero timeout. We would want to
> > > opportunistically poll without much overhead, which will have more
> > > false positives.
> > > A caller that calls with a zero timeout should retry later, and will
> > > at some point observe the event.  
> > 
> > Is this a documented behavior that users expect? I do not see this in the
> > man page.  
> 
> The selftests rely on this behavior that timeout=0 sees events from a
> concurrently running producer. They would fail at a very higher rate
> after this change - believe me I had a similar patch that changed
> something in this area. I would explore the seqcount that Mateusz
> suggested tbh.
> 

Does this scenario really affect any real programs?
It doesn't make sense to have multiple threads looking for level-triggered
events on a single epoll fd.
When epoll returns an event you really need to do a (usually) read on
the associated file descriptor before calling epoll again.

To split the epoll processing between multiple threads you need lots of
epoll fd with the underlying fd distributed between them and get the
threads to process the epoll fd sequentially (eg by putting the fd in an
array and using an atomic increment of a global array index to get the
next epoll fd to process).

-- David

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/2] eventpoll: Fix epoll_wait() report false negative
  2026-05-03 13:24             ` Nam Cao
@ 2026-05-21 12:38               ` Mateusz Guzik
  2026-05-25 12:19                 ` Nam Cao
  0 siblings, 1 reply; 23+ messages in thread
From: Mateusz Guzik @ 2026-05-21 12:38 UTC (permalink / raw)
  To: Nam Cao
  Cc: Christian Brauner, Soheil Hassas Yeganeh, Alexander Viro,
	Jan Kara, Shuah Khan, Davidlohr Bueso, Khazhismel Kumykov,
	Willem de Bruijn, Eric Dumazet, Jens Axboe, linux-fsdevel,
	linux-kernel, linux-kselftest, stable

On Sun, May 3, 2026 at 3:24 PM Nam Cao <namcao@linutronix.de> wrote:
>
> Mateusz Guzik <mjguzik@gmail.com> writes:
> > Strictly speaking more error prone than the seq approach, but should be
> > faster on weaker-ordered archs thanks to avoided fences.
> >
> > I'm definitely not going to protest the seqc route.
>
> Linus probably wouldn't be thrilled if I break epoll again, so let's
> stay with the simpler seqcount route.
>
> Nam
>
> diff --git a/fs/eventpoll.c b/fs/eventpoll.c
> index a3090b446af1..22c3f0186476 100644
> --- a/fs/eventpoll.c
> +++ b/fs/eventpoll.c
> @@ -38,6 +38,7 @@
>  #include <linux/compat.h>
>  #include <linux/rculist.h>
>  #include <linux/capability.h>
> +#include <linux/seqlock.h>
>  #include <net/busy_poll.h>
>
>  /*
> @@ -190,6 +191,9 @@ struct eventpoll {
>         /* Lock which protects rdllist and ovflist */
>         spinlock_t lock;
>
> +       /* Protect switching between rdllist and ovflist */
> +       seqcount_spinlock_t seq;
> +
>         /* RB tree root used to store monitored fd structs */
>         struct rb_root_cached rbr;
>
> @@ -382,8 +386,17 @@ static inline struct epitem *ep_item_from_wait(wait_queue_entry_t *p)
>   */
>  static inline int ep_events_available(struct eventpoll *ep)
>  {
> -       return !list_empty_careful(&ep->rdllist) ||
> -               READ_ONCE(ep->ovflist) != EP_UNACTIVE_PTR;
> +       bool events_available;
> +       unsigned int seq;
> +
> +       do {
> +               seq = read_seqcount_begin(&ep->seq);
> +
> +               events_available = !list_empty_careful(&ep->rdllist) ||
> +                                  READ_ONCE(ep->ovflist) != EP_UNACTIVE_PTR;
> +       } while (read_seqcount_retry(&ep->seq, seq));
> +
> +       return events_available;
>  }
>
>  #ifdef CONFIG_NET_RX_BUSY_POLL
> @@ -735,8 +748,12 @@ static void ep_start_scan(struct eventpoll *ep, struct list_head *txlist)
>          */
>         lockdep_assert_irqs_enabled();
>         spin_lock_irq(&ep->lock);
> +       write_seqcount_begin(&ep->seq);
> +
>         list_splice_init(&ep->rdllist, txlist);
>         WRITE_ONCE(ep->ovflist, NULL);
> +
> +       write_seqcount_end(&ep->seq);
>         spin_unlock_irq(&ep->lock);
>  }
>
> @@ -768,6 +785,9 @@ static void ep_done_scan(struct eventpoll *ep,
>                         ep_pm_stay_awake(epi);
>                 }
>         }
> +
> +       write_seqcount_begin(&ep->seq);
> +
>         /*
>          * We need to set back ep->ovflist to EP_UNACTIVE_PTR, so that after
>          * releasing the lock, events will be queued in the normal way inside
> @@ -779,6 +799,9 @@ static void ep_done_scan(struct eventpoll *ep,
>          * Quickly re-inject items left on "txlist".
>          */
>         list_splice(txlist, &ep->rdllist);
> +
> +       write_seqcount_end(&ep->seq);
> +
>         __pm_relax(ep->ws);
>
>         if (!list_empty(&ep->rdllist)) {
> @@ -1155,6 +1178,7 @@ static int ep_alloc(struct eventpoll **pep)
>
>         mutex_init(&ep->mtx);
>         spin_lock_init(&ep->lock);
> +       seqcount_spinlock_init(&ep->seq, &ep->lock);
>         init_waitqueue_head(&ep->wq);
>         init_waitqueue_head(&ep->poll_wait);
>         INIT_LIST_HEAD(&ep->rdllist);

Apologies for late reply.

The diff reads ok to me, but I would consider not looping in case of
seq mismatch.

So does it solve the problem?

I think a somewhat of a blocker here would be to bench the thing -- I
would expect some slowdown compared to stock kernel, but should be
still be faster than the previously proposed patch.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/2] eventpoll: Fix epoll_wait() report false negative
  2026-05-21 12:38               ` Mateusz Guzik
@ 2026-05-25 12:19                 ` Nam Cao
  0 siblings, 0 replies; 23+ messages in thread
From: Nam Cao @ 2026-05-25 12:19 UTC (permalink / raw)
  To: Mateusz Guzik
  Cc: Christian Brauner, Soheil Hassas Yeganeh, Alexander Viro,
	Jan Kara, Shuah Khan, Davidlohr Bueso, Khazhismel Kumykov,
	Willem de Bruijn, Eric Dumazet, Jens Axboe, linux-fsdevel,
	linux-kernel, linux-kselftest, stable

Mateusz Guzik <mjguzik@gmail.com> writes:
> Apologies for late reply.

No problem. Your comment is welcomed, but I was actually waiting to see
if my testing comes back with anything suspicious.

> The diff reads ok to me, but I would consider not looping in case of
> seq mismatch.

You mean something like below?

static inline int ep_events_available(struct eventpoll *ep)
{
        unsigned int seq = read_seqcount_begin(&ep->seq);

        return !list_empty_careful(&ep->rdllist) ||
               READ_ONCE(ep->ovflist) != EP_UNACTIVE_PTR ||
               read_seqcount_retry(&ep->seq, seq);
}

Looping almost never happens, so I don't think it makes any
difference. But sure, it is a bit shorter.

> So does it solve the problem?

Yes it does.

> I think a somewhat of a blocker here would be to bench the thing -- I
> would expect some slowdown compared to stock kernel but should be
> still be faster than the previously proposed patch.

I expect the slowdown to be in the noise, but I will do a benchmark.

Nam

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2026-05-25 12:19 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-18  7:52 [PATCH 0/2] eventpoll: Fix epoll_wait() report false negative Nam Cao
2025-07-18  7:52 ` [PATCH 1/2] selftests/eventpoll: Add test for multiple waiters Nam Cao
2025-07-18  7:52 ` [PATCH 2/2] eventpoll: Fix epoll_wait() report false negative Nam Cao
2025-07-18  8:38   ` Soheil Hassas Yeganeh
2025-07-18  8:59     ` Nam Cao
2026-04-29  6:54       ` Christian Brauner
2026-04-29  7:27         ` Nam Cao
2026-04-29 15:34           ` Mateusz Guzik
2026-05-03 13:24             ` Nam Cao
2026-05-21 12:38               ` Mateusz Guzik
2026-05-25 12:19                 ` Nam Cao
2026-05-04 12:00         ` David Laight
2025-09-17 12:49   ` Mateusz Guzik
2025-09-17 13:41     ` Nam Cao
2025-09-17 16:05       ` Mateusz Guzik
2025-09-17 16:08         ` Mateusz Guzik
2025-09-17 18:03           ` Khazhy Kumykov
2025-09-17 22:28             ` Khazhy Kumykov
2025-09-17 22:38               ` Mateusz Guzik
2025-09-22  6:26                 ` Nam Cao
2025-09-20 14:42         ` David Laight
2025-09-20 14:45           ` Mateusz Guzik
2025-09-17  7:27 ` [PATCH 0/2] " Nam Cao

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox