* [PATCH 0/2] eventpoll: Fix epoll_wait() report false negative @ 2025-07-18 7:52 Nam Cao 2025-07-18 7:52 ` [PATCH 1/2] selftests/eventpoll: Add test for multiple waiters Nam Cao 2025-07-18 7:52 ` [PATCH 2/2] eventpoll: Fix epoll_wait() report false negative Nam Cao 0 siblings, 2 replies; 5+ messages in thread From: Nam Cao @ 2025-07-18 7:52 UTC (permalink / raw) To: Alexander Viro, Christian Brauner, Jan Kara, Shuah Khan, Davidlohr Bueso, Soheil Hassas Yeganeh, Khazhismel Kumykov, Willem de Bruijn, Eric Dumazet, Jens Axboe, linux-fsdevel, linux-kernel, linux-kselftest Cc: Nam Cao Hi, While staring at epoll, I noticed ep_events_available() looks wrong. I wrote a small program to confirm, and yes it is definitely wrong. This series adds a reproducer to kselftest, and fix the bug. Nam Cao (2): selftests/eventpoll: Add test for multiple waiters eventpoll: Fix epoll_wait() report false negative fs/eventpoll.c | 16 +------ .../filesystems/epoll/epoll_wakeup_test.c | 45 +++++++++++++++++++ 2 files changed, 47 insertions(+), 14 deletions(-) -- 2.39.5 ^ permalink raw reply [flat|nested] 5+ messages in thread
* [PATCH 1/2] selftests/eventpoll: Add test for multiple waiters 2025-07-18 7:52 [PATCH 0/2] eventpoll: Fix epoll_wait() report false negative Nam Cao @ 2025-07-18 7:52 ` Nam Cao 2025-07-18 7:52 ` [PATCH 2/2] eventpoll: Fix epoll_wait() report false negative Nam Cao 1 sibling, 0 replies; 5+ messages in thread From: Nam Cao @ 2025-07-18 7:52 UTC (permalink / raw) To: Alexander Viro, Christian Brauner, Jan Kara, Shuah Khan, Davidlohr Bueso, Soheil Hassas Yeganeh, Khazhismel Kumykov, Willem de Bruijn, Eric Dumazet, Jens Axboe, linux-fsdevel, linux-kernel, linux-kselftest Cc: Nam Cao Add a test whichs creates 64 threads who all epoll_wait() on the same eventpoll. The source eventfd is written but never read, therefore all the threads should always see an EPOLLIN event. This test fails because of a kernel bug, which will be fixed by a follow-up commit. Signed-off-by: Nam Cao <namcao@linutronix.de> --- .../filesystems/epoll/epoll_wakeup_test.c | 45 +++++++++++++++++++ 1 file changed, 45 insertions(+) diff --git a/tools/testing/selftests/filesystems/epoll/epoll_wakeup_test.c b/tools/testing/selftests/filesystems/epoll/epoll_wakeup_test.c index 65ede506305c..0852c68d0461 100644 --- a/tools/testing/selftests/filesystems/epoll/epoll_wakeup_test.c +++ b/tools/testing/selftests/filesystems/epoll/epoll_wakeup_test.c @@ -3493,4 +3493,49 @@ TEST(epoll64) close(ctx.sfd[1]); } +static void *epoll65_wait(void *ctx_) +{ + struct epoll_mtcontext *ctx = ctx_; + struct epoll_event event; + + for (int i = 0; i < 100000; ++i) { + if (!epoll_wait(ctx->efd[0], &event, 1, 0)) + return (void *)ENODATA; + } + + return (void *)0; +} + +TEST(epoll65) +{ + struct epoll_mtcontext ctx; + struct epoll_event event; + int64_t dummy_data = 99; + pthread_t threads[64]; + uintptr_t ret; + int i, err; + + ctx.efd[0] = epoll_create(1); + ASSERT_GE(ctx.efd[0], 0); + ctx.efd[1] = eventfd(0, 0); + ASSERT_GE(ctx.efd[1], 0); + + event.events = EPOLLIN; + err = epoll_ctl(ctx.efd[0], EPOLL_CTL_ADD, ctx.efd[1], &event); + ASSERT_EQ(err, 0); + + write(ctx.efd[1], &dummy_data, sizeof(dummy_data)); + + for (i = 0; i < ARRAY_SIZE(threads); ++i) + ASSERT_EQ(pthread_create(&threads[i], NULL, epoll65_wait, &ctx), 0); + + for (i = 0; i < ARRAY_SIZE(threads); ++i) { + ASSERT_EQ(pthread_join(threads[i], (void **)&ret), 0); + ASSERT_EQ(ret, 0); + } + + close(ctx.efd[0]); + close(ctx.efd[1]); +} + TEST_HARNESS_MAIN -- 2.39.5 ^ permalink raw reply related [flat|nested] 5+ messages in thread
* [PATCH 2/2] eventpoll: Fix epoll_wait() report false negative 2025-07-18 7:52 [PATCH 0/2] eventpoll: Fix epoll_wait() report false negative Nam Cao 2025-07-18 7:52 ` [PATCH 1/2] selftests/eventpoll: Add test for multiple waiters Nam Cao @ 2025-07-18 7:52 ` Nam Cao 2025-07-18 8:38 ` Soheil Hassas Yeganeh 1 sibling, 1 reply; 5+ messages in thread From: Nam Cao @ 2025-07-18 7:52 UTC (permalink / raw) To: Alexander Viro, Christian Brauner, Jan Kara, Shuah Khan, Davidlohr Bueso, Soheil Hassas Yeganeh, Khazhismel Kumykov, Willem de Bruijn, Eric Dumazet, Jens Axboe, linux-fsdevel, linux-kernel, linux-kselftest Cc: Nam Cao, stable ep_events_available() checks for available events by looking at ep->rdllist and ep->ovflist. However, this is done without a lock, therefore the returned value is not reliable. Because it is possible that both checks on ep->rdllist and ep->ovflist are false while ep_start_scan() or ep_done_scan() is being executed on other CPUs, despite events are available. This bug can be observed by: 1. Create an eventpoll with at least one ready level-triggered event 2. Create multiple threads who do epoll_wait() with zero timeout. The threads do not consume the events, therefore all epoll_wait() should return at least one event. If one thread is executing ep_events_available() while another thread is executing ep_start_scan() or ep_done_scan(), epoll_wait() may wrongly return no event for the former thread. This reproducer is implemented as TEST(epoll65) in tools/testing/selftests/filesystems/epoll/epoll_wakeup_test.c Fix it by skipping ep_events_available(), just call ep_try_send_events() directly. epoll_sendevents() (io_uring) suffers the same problem, fix that as well. There is still ep_busy_loop() who uses ep_events_available() without lock, but it is probably okay (?) for busy-polling. Fixes: c5a282e9635e ("fs/epoll: reduce the scope of wq lock in epoll_wait()") Fixes: e59d3c64cba6 ("epoll: eliminate unnecessary lock for zero timeout") Fixes: ae3a4f1fdc2c ("eventpoll: add epoll_sendevents() helper") Signed-off-by: Nam Cao <namcao@linutronix.de> Cc: stable@vger.kernel.org --- fs/eventpoll.c | 16 ++-------------- 1 file changed, 2 insertions(+), 14 deletions(-) diff --git a/fs/eventpoll.c b/fs/eventpoll.c index 0fbf5dfedb24..541481eafc20 100644 --- a/fs/eventpoll.c +++ b/fs/eventpoll.c @@ -2022,7 +2022,7 @@ static int ep_schedule_timeout(ktime_t *to) static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events, int maxevents, struct timespec64 *timeout) { - int res, eavail, timed_out = 0; + int res, eavail = 1, timed_out = 0; u64 slack = 0; wait_queue_entry_t wait; ktime_t expires, *to = NULL; @@ -2041,16 +2041,6 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events, timed_out = 1; } - /* - * This call is racy: We may or may not see events that are being added - * to the ready list under the lock (e.g., in IRQ callbacks). For cases - * with a non-zero timeout, this thread will check the ready list under - * lock and will add to the wait queue. For cases with a zero - * timeout, the user by definition should not care and will have to - * recheck again. - */ - eavail = ep_events_available(ep); - while (1) { if (eavail) { res = ep_try_send_events(ep, events, maxevents); @@ -2496,9 +2486,7 @@ int epoll_sendevents(struct file *file, struct epoll_event __user *events, * Racy call, but that's ok - it should get retried based on * poll readiness anyway. */ - if (ep_events_available(ep)) - return ep_try_send_events(ep, events, maxevents); - return 0; + return ep_try_send_events(ep, events, maxevents); } /* -- 2.39.5 ^ permalink raw reply related [flat|nested] 5+ messages in thread
* Re: [PATCH 2/2] eventpoll: Fix epoll_wait() report false negative 2025-07-18 7:52 ` [PATCH 2/2] eventpoll: Fix epoll_wait() report false negative Nam Cao @ 2025-07-18 8:38 ` Soheil Hassas Yeganeh 2025-07-18 8:59 ` Nam Cao 0 siblings, 1 reply; 5+ messages in thread From: Soheil Hassas Yeganeh @ 2025-07-18 8:38 UTC (permalink / raw) To: Nam Cao Cc: Alexander Viro, Christian Brauner, Jan Kara, Shuah Khan, Davidlohr Bueso, Khazhismel Kumykov, Willem de Bruijn, Eric Dumazet, Jens Axboe, linux-fsdevel, linux-kernel, linux-kselftest, stable On Fri, Jul 18, 2025 at 8:52 AM Nam Cao <namcao@linutronix.de> wrote: > > ep_events_available() checks for available events by looking at ep->rdllist > and ep->ovflist. However, this is done without a lock, therefore the > returned value is not reliable. Because it is possible that both checks on > ep->rdllist and ep->ovflist are false while ep_start_scan() or > ep_done_scan() is being executed on other CPUs, despite events are > available. > > This bug can be observed by: > > 1. Create an eventpoll with at least one ready level-triggered event > > 2. Create multiple threads who do epoll_wait() with zero timeout. The > threads do not consume the events, therefore all epoll_wait() should > return at least one event. > > If one thread is executing ep_events_available() while another thread is > executing ep_start_scan() or ep_done_scan(), epoll_wait() may wrongly > return no event for the former thread. That is the whole point of epoll_wait with a zero timeout. We would want to opportunistically poll without much overhead, which will have more false positives. A caller that calls with a zero timeout should retry later, and will at some point observe the event. I'm not sure if we would want to add much more overheads, for higher precision. Thanks, Soheil > This reproducer is implemented as TEST(epoll65) in > tools/testing/selftests/filesystems/epoll/epoll_wakeup_test.c > > Fix it by skipping ep_events_available(), just call ep_try_send_events() > directly. > > epoll_sendevents() (io_uring) suffers the same problem, fix that as well. > > There is still ep_busy_loop() who uses ep_events_available() without lock, > but it is probably okay (?) for busy-polling. > > Fixes: c5a282e9635e ("fs/epoll: reduce the scope of wq lock in epoll_wait()") > Fixes: e59d3c64cba6 ("epoll: eliminate unnecessary lock for zero timeout") > Fixes: ae3a4f1fdc2c ("eventpoll: add epoll_sendevents() helper") > Signed-off-by: Nam Cao <namcao@linutronix.de> > Cc: stable@vger.kernel.org > --- > fs/eventpoll.c | 16 ++-------------- > 1 file changed, 2 insertions(+), 14 deletions(-) > > diff --git a/fs/eventpoll.c b/fs/eventpoll.c > index 0fbf5dfedb24..541481eafc20 100644 > --- a/fs/eventpoll.c > +++ b/fs/eventpoll.c > @@ -2022,7 +2022,7 @@ static int ep_schedule_timeout(ktime_t *to) > static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events, > int maxevents, struct timespec64 *timeout) > { > - int res, eavail, timed_out = 0; > + int res, eavail = 1, timed_out = 0; > u64 slack = 0; > wait_queue_entry_t wait; > ktime_t expires, *to = NULL; > @@ -2041,16 +2041,6 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events, > timed_out = 1; > } > > - /* > - * This call is racy: We may or may not see events that are being added > - * to the ready list under the lock (e.g., in IRQ callbacks). For cases > - * with a non-zero timeout, this thread will check the ready list under > - * lock and will add to the wait queue. For cases with a zero > - * timeout, the user by definition should not care and will have to > - * recheck again. > - */ > - eavail = ep_events_available(ep); > - > while (1) { > if (eavail) { > res = ep_try_send_events(ep, events, maxevents); > @@ -2496,9 +2486,7 @@ int epoll_sendevents(struct file *file, struct epoll_event __user *events, > * Racy call, but that's ok - it should get retried based on > * poll readiness anyway. > */ > - if (ep_events_available(ep)) > - return ep_try_send_events(ep, events, maxevents); > - return 0; > + return ep_try_send_events(ep, events, maxevents); > } > > /* > -- > 2.39.5 > ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH 2/2] eventpoll: Fix epoll_wait() report false negative 2025-07-18 8:38 ` Soheil Hassas Yeganeh @ 2025-07-18 8:59 ` Nam Cao 0 siblings, 0 replies; 5+ messages in thread From: Nam Cao @ 2025-07-18 8:59 UTC (permalink / raw) To: Soheil Hassas Yeganeh Cc: Alexander Viro, Christian Brauner, Jan Kara, Shuah Khan, Davidlohr Bueso, Khazhismel Kumykov, Willem de Bruijn, Eric Dumazet, Jens Axboe, linux-fsdevel, linux-kernel, linux-kselftest, stable On Fri, Jul 18, 2025 at 09:38:27AM +0100, Soheil Hassas Yeganeh wrote: > On Fri, Jul 18, 2025 at 8:52 AM Nam Cao <namcao@linutronix.de> wrote: > > > > ep_events_available() checks for available events by looking at ep->rdllist > > and ep->ovflist. However, this is done without a lock, therefore the > > returned value is not reliable. Because it is possible that both checks on > > ep->rdllist and ep->ovflist are false while ep_start_scan() or > > ep_done_scan() is being executed on other CPUs, despite events are > > available. > > > > This bug can be observed by: > > > > 1. Create an eventpoll with at least one ready level-triggered event > > > > 2. Create multiple threads who do epoll_wait() with zero timeout. The > > threads do not consume the events, therefore all epoll_wait() should > > return at least one event. > > > > If one thread is executing ep_events_available() while another thread is > > executing ep_start_scan() or ep_done_scan(), epoll_wait() may wrongly > > return no event for the former thread. > > That is the whole point of epoll_wait with a zero timeout. We would want to > opportunistically poll without much overhead, which will have more > false positives. > A caller that calls with a zero timeout should retry later, and will > at some point observe the event. Is this a documented behavior that users expect? I do not see this in the man page. It sounds completely broken to me that an event has been sitting there for some time, but epoll_wait() says there is nothing. > I'm not sure if we would want to add much more overheads, for higher precision. Correctness before performance please. And I'm not sure what you mean by "much more overheads". While this patch definitely adds some cycles in case there is no event, epoll_wait() still returns "instantly". Nam ^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2025-07-18 9:00 UTC | newest] Thread overview: 5+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-07-18 7:52 [PATCH 0/2] eventpoll: Fix epoll_wait() report false negative Nam Cao 2025-07-18 7:52 ` [PATCH 1/2] selftests/eventpoll: Add test for multiple waiters Nam Cao 2025-07-18 7:52 ` [PATCH 2/2] eventpoll: Fix epoll_wait() report false negative Nam Cao 2025-07-18 8:38 ` Soheil Hassas Yeganeh 2025-07-18 8:59 ` Nam Cao
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).