Linux filesystem development
 help / color / mirror / Atom feed
* Re: [PATCH 2/2] eventpoll: Fix epoll_wait() report false negative
       [not found]     ` <20250718085948.3xXGcxeQ@linutronix.de>
@ 2026-04-29  6:54       ` Christian Brauner
  2026-04-29  7:27         ` Nam Cao
  2026-05-04 12:00         ` David Laight
  0 siblings, 2 replies; 5+ messages in thread
From: Christian Brauner @ 2026-04-29  6:54 UTC (permalink / raw)
  To: Nam Cao
  Cc: Soheil Hassas Yeganeh, Alexander Viro, Jan Kara, Shuah Khan,
	Davidlohr Bueso, Khazhismel Kumykov, Willem de Bruijn,
	Eric Dumazet, Jens Axboe, linux-fsdevel, linux-kernel,
	linux-kselftest, stable

On Fri, Jul 18, 2025 at 10:59:48AM +0200, Nam Cao wrote:
> On Fri, Jul 18, 2025 at 09:38:27AM +0100, Soheil Hassas Yeganeh wrote:
> > On Fri, Jul 18, 2025 at 8:52 AM Nam Cao <namcao@linutronix.de> wrote:
> > >
> > > ep_events_available() checks for available events by looking at ep->rdllist
> > > and ep->ovflist. However, this is done without a lock, therefore the
> > > returned value is not reliable. Because it is possible that both checks on
> > > ep->rdllist and ep->ovflist are false while ep_start_scan() or
> > > ep_done_scan() is being executed on other CPUs, despite events are
> > > available.
> > >
> > > This bug can be observed by:
> > >
> > >   1. Create an eventpoll with at least one ready level-triggered event
> > >
> > >   2. Create multiple threads who do epoll_wait() with zero timeout. The
> > >      threads do not consume the events, therefore all epoll_wait() should
> > >      return at least one event.
> > >
> > > If one thread is executing ep_events_available() while another thread is
> > > executing ep_start_scan() or ep_done_scan(), epoll_wait() may wrongly
> > > return no event for the former thread.
> > 
> > That is the whole point of epoll_wait with a zero timeout. We would want to
> > opportunistically poll without much overhead, which will have more
> > false positives.
> > A caller that calls with a zero timeout should retry later, and will
> > at some point observe the event.
> 
> Is this a documented behavior that users expect? I do not see this in the
> man page.

The selftests rely on this behavior that timeout=0 sees events from a
concurrently running producer. They would fail at a very higher rate
after this change - believe me I had a similar patch that changed
something in this area. I would explore the seqcount that Mateusz
suggested tbh.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH 2/2] eventpoll: Fix epoll_wait() report false negative
  2026-04-29  6:54       ` [PATCH 2/2] eventpoll: Fix epoll_wait() report false negative Christian Brauner
@ 2026-04-29  7:27         ` Nam Cao
  2026-04-29 15:34           ` Mateusz Guzik
  2026-05-04 12:00         ` David Laight
  1 sibling, 1 reply; 5+ messages in thread
From: Nam Cao @ 2026-04-29  7:27 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Soheil Hassas Yeganeh, Alexander Viro, Jan Kara, Shuah Khan,
	Davidlohr Bueso, Khazhismel Kumykov, Willem de Bruijn,
	Eric Dumazet, Jens Axboe, linux-fsdevel, linux-kernel,
	linux-kselftest, stable

Christian Brauner <brauner@kernel.org> writes:
> The selftests rely on this behavior that timeout=0 sees events from a
> concurrently running producer. They would fail at a very higher rate
> after this change - believe me I had a similar patch that changed
> something in this area.

Huh, that's interesting. Do you still remember which selftest cases rely
on this behavior? I would like to study them further.

> I would explore the seqcount that Mateusz suggested tbh.

I will investigate that.

Nam

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH 2/2] eventpoll: Fix epoll_wait() report false negative
  2026-04-29  7:27         ` Nam Cao
@ 2026-04-29 15:34           ` Mateusz Guzik
  2026-05-03 13:24             ` Nam Cao
  0 siblings, 1 reply; 5+ messages in thread
From: Mateusz Guzik @ 2026-04-29 15:34 UTC (permalink / raw)
  To: Nam Cao
  Cc: Christian Brauner, Soheil Hassas Yeganeh, Alexander Viro,
	Jan Kara, Shuah Khan, Davidlohr Bueso, Khazhismel Kumykov,
	Willem de Bruijn, Eric Dumazet, Jens Axboe, linux-fsdevel,
	linux-kernel, linux-kselftest, stable

On Wed, Apr 29, 2026 at 09:27:59AM +0200, Nam Cao wrote:
> Christian Brauner <brauner@kernel.org> writes:
> > The selftests rely on this behavior that timeout=0 sees events from a
> > concurrently running producer. They would fail at a very higher rate
> > after this change - believe me I had a similar patch that changed
> > something in this area.
> 
> Huh, that's interesting. Do you still remember which selftest cases rely
> on this behavior? I would like to study them further.
> 
> > I would explore the seqcount that Mateusz suggested tbh.
> 
> I will investigate that.
> 

In the meantime I grew fond of another approach: have the write side
re-calc the conditon the unlocked side checks for.

While the seqc thing solves the scalabilty problem, it still requires
fences which are not free on arm.

the goal would be to make it so that this:
static inline int ep_events_available(struct eventpoll *ep)
{  
        return !list_empty_careful(&ep->rdllist) ||
                READ_ONCE(ep->ovflist) != EP_UNACTIVE_PTR;    
}

can be converted into:
static inline int ep_events_available(struct eventpoll *ep)
{  
	return ep->has_events;
}

Which in turn means that any codepath which messes with either rdllist
or ovflist will need to recalc before it ends up unlocking.

Strictly speaking more error prone than the seq approach, but should be
faster on weaker-ordered archs thanks to avoided fences.

I'm definitely not going to protest the seqc route.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH 2/2] eventpoll: Fix epoll_wait() report false negative
  2026-04-29 15:34           ` Mateusz Guzik
@ 2026-05-03 13:24             ` Nam Cao
  0 siblings, 0 replies; 5+ messages in thread
From: Nam Cao @ 2026-05-03 13:24 UTC (permalink / raw)
  To: Mateusz Guzik
  Cc: Christian Brauner, Soheil Hassas Yeganeh, Alexander Viro,
	Jan Kara, Shuah Khan, Davidlohr Bueso, Khazhismel Kumykov,
	Willem de Bruijn, Eric Dumazet, Jens Axboe, linux-fsdevel,
	linux-kernel, linux-kselftest, stable

Mateusz Guzik <mjguzik@gmail.com> writes:
> Strictly speaking more error prone than the seq approach, but should be
> faster on weaker-ordered archs thanks to avoided fences.
>
> I'm definitely not going to protest the seqc route.

Linus probably wouldn't be thrilled if I break epoll again, so let's
stay with the simpler seqcount route.

Nam

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index a3090b446af1..22c3f0186476 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -38,6 +38,7 @@
 #include <linux/compat.h>
 #include <linux/rculist.h>
 #include <linux/capability.h>
+#include <linux/seqlock.h>
 #include <net/busy_poll.h>
 
 /*
@@ -190,6 +191,9 @@ struct eventpoll {
 	/* Lock which protects rdllist and ovflist */
 	spinlock_t lock;
 
+	/* Protect switching between rdllist and ovflist */
+	seqcount_spinlock_t seq;
+
 	/* RB tree root used to store monitored fd structs */
 	struct rb_root_cached rbr;
 
@@ -382,8 +386,17 @@ static inline struct epitem *ep_item_from_wait(wait_queue_entry_t *p)
  */
 static inline int ep_events_available(struct eventpoll *ep)
 {
-	return !list_empty_careful(&ep->rdllist) ||
-		READ_ONCE(ep->ovflist) != EP_UNACTIVE_PTR;
+	bool events_available;
+	unsigned int seq;
+
+	do {
+		seq = read_seqcount_begin(&ep->seq);
+
+		events_available = !list_empty_careful(&ep->rdllist) ||
+				   READ_ONCE(ep->ovflist) != EP_UNACTIVE_PTR;
+	} while (read_seqcount_retry(&ep->seq, seq));
+
+	return events_available;
 }
 
 #ifdef CONFIG_NET_RX_BUSY_POLL
@@ -735,8 +748,12 @@ static void ep_start_scan(struct eventpoll *ep, struct list_head *txlist)
 	 */
 	lockdep_assert_irqs_enabled();
 	spin_lock_irq(&ep->lock);
+	write_seqcount_begin(&ep->seq);
+
 	list_splice_init(&ep->rdllist, txlist);
 	WRITE_ONCE(ep->ovflist, NULL);
+
+	write_seqcount_end(&ep->seq);
 	spin_unlock_irq(&ep->lock);
 }
 
@@ -768,6 +785,9 @@ static void ep_done_scan(struct eventpoll *ep,
 			ep_pm_stay_awake(epi);
 		}
 	}
+
+	write_seqcount_begin(&ep->seq);
+
 	/*
 	 * We need to set back ep->ovflist to EP_UNACTIVE_PTR, so that after
 	 * releasing the lock, events will be queued in the normal way inside
@@ -779,6 +799,9 @@ static void ep_done_scan(struct eventpoll *ep,
 	 * Quickly re-inject items left on "txlist".
 	 */
 	list_splice(txlist, &ep->rdllist);
+
+	write_seqcount_end(&ep->seq);
+
 	__pm_relax(ep->ws);
 
 	if (!list_empty(&ep->rdllist)) {
@@ -1155,6 +1178,7 @@ static int ep_alloc(struct eventpoll **pep)
 
 	mutex_init(&ep->mtx);
 	spin_lock_init(&ep->lock);
+	seqcount_spinlock_init(&ep->seq, &ep->lock);
 	init_waitqueue_head(&ep->wq);
 	init_waitqueue_head(&ep->poll_wait);
 	INIT_LIST_HEAD(&ep->rdllist);

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH 2/2] eventpoll: Fix epoll_wait() report false negative
  2026-04-29  6:54       ` [PATCH 2/2] eventpoll: Fix epoll_wait() report false negative Christian Brauner
  2026-04-29  7:27         ` Nam Cao
@ 2026-05-04 12:00         ` David Laight
  1 sibling, 0 replies; 5+ messages in thread
From: David Laight @ 2026-05-04 12:00 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Nam Cao, Soheil Hassas Yeganeh, Alexander Viro, Jan Kara,
	Shuah Khan, Davidlohr Bueso, Khazhismel Kumykov, Willem de Bruijn,
	Eric Dumazet, Jens Axboe, linux-fsdevel, linux-kernel,
	linux-kselftest, stable

On Wed, 29 Apr 2026 08:54:06 +0200
Christian Brauner <brauner@kernel.org> wrote:

> On Fri, Jul 18, 2025 at 10:59:48AM +0200, Nam Cao wrote:
> > On Fri, Jul 18, 2025 at 09:38:27AM +0100, Soheil Hassas Yeganeh wrote:  
> > > On Fri, Jul 18, 2025 at 8:52 AM Nam Cao <namcao@linutronix.de> wrote:  
> > > >
> > > > ep_events_available() checks for available events by looking at ep->rdllist
> > > > and ep->ovflist. However, this is done without a lock, therefore the
> > > > returned value is not reliable. Because it is possible that both checks on
> > > > ep->rdllist and ep->ovflist are false while ep_start_scan() or
> > > > ep_done_scan() is being executed on other CPUs, despite events are
> > > > available.
> > > >
> > > > This bug can be observed by:
> > > >
> > > >   1. Create an eventpoll with at least one ready level-triggered event
> > > >
> > > >   2. Create multiple threads who do epoll_wait() with zero timeout. The
> > > >      threads do not consume the events, therefore all epoll_wait() should
> > > >      return at least one event.
> > > >
> > > > If one thread is executing ep_events_available() while another thread is
> > > > executing ep_start_scan() or ep_done_scan(), epoll_wait() may wrongly
> > > > return no event for the former thread.  
> > > 
> > > That is the whole point of epoll_wait with a zero timeout. We would want to
> > > opportunistically poll without much overhead, which will have more
> > > false positives.
> > > A caller that calls with a zero timeout should retry later, and will
> > > at some point observe the event.  
> > 
> > Is this a documented behavior that users expect? I do not see this in the
> > man page.  
> 
> The selftests rely on this behavior that timeout=0 sees events from a
> concurrently running producer. They would fail at a very higher rate
> after this change - believe me I had a similar patch that changed
> something in this area. I would explore the seqcount that Mateusz
> suggested tbh.
> 

Does this scenario really affect any real programs?
It doesn't make sense to have multiple threads looking for level-triggered
events on a single epoll fd.
When epoll returns an event you really need to do a (usually) read on
the associated file descriptor before calling epoll again.

To split the epoll processing between multiple threads you need lots of
epoll fd with the underlying fd distributed between them and get the
threads to process the epoll fd sequentially (eg by putting the fd in an
array and using an atomic increment of a global array index to get the
next epoll fd to process).

-- David

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2026-05-04 12:00 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <cover.1752824628.git.namcao@linutronix.de>
     [not found] ` <43d64ad765e2c47e958f01246320359b11379466.1752824628.git.namcao@linutronix.de>
     [not found]   ` <CACSApvZT5F4F36jLWEc5v_AbqZVQpmH1W7UK21tB9nPu=OtmBA@mail.gmail.com>
     [not found]     ` <20250718085948.3xXGcxeQ@linutronix.de>
2026-04-29  6:54       ` [PATCH 2/2] eventpoll: Fix epoll_wait() report false negative Christian Brauner
2026-04-29  7:27         ` Nam Cao
2026-04-29 15:34           ` Mateusz Guzik
2026-05-03 13:24             ` Nam Cao
2026-05-04 12:00         ` David Laight

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox