[PATCH v4 0/1] eventpoll: Fix priority inversion problem

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v4 0/1] eventpoll: Fix priority inversion problem
@ 2025-07-15 12:46 Nam Cao
  2025-07-15 12:46 ` [PATCH v4 1/1] eventpoll: Replace rwlock with spinlock Nam Cao
  2025-09-05 13:52 ` [PATCH v4 0/1] eventpoll: Fix priority inversion problem Christian Brauner
  0 siblings, 2 replies; 10+ messages in thread
From: Nam Cao @ 2025-07-15 12:46 UTC (permalink / raw)
  To: Christian Brauner, Linus Torvalds, Xi Ruoyao, Frederic Weisbecker,
	Valentin Schneider, Alexander Viro, Jan Kara,
	Sebastian Andrzej Siewior, John Ogness, K Prateek Nayak,
	Clark Williams, Steven Rostedt, linux-fsdevel, linux-kernel,
	linux-rt-devel, linux-rt-users, Joe Damato, Martin Karsten,
	Jens Axboe
  Cc: Nam Cao

Hi,

This v4 is the follow-up to v3 at:
https://lore.kernel.org/linux-fsdevel/20250527090836.1290532-1-namcao@linutronix.de/
which resolves a priority inversion problem.

The v3 patch was merged, but then got reverted due to regression.

The direction of v3 was wrong in the first place. It changed the
eventpoll's event list to be lockless, making the code harder to read. I
stared at the patch again, but still couldn't figure out what the bug is.

The performance numbers were indeed impressive with lockless, but the
numbers are from a benchmark, which is unclear whether it really reflects
real workload.

This v4 takes a completely different approach: it converts the rwlock to
spinlock. Unfortunately, unlike rwlock, spinlock does not allow concurrent
readers. This patch therefore reduces the performance numbers.

I have some optimization tricks to reduce spinlock contention and bring the
numbers back. But Linus appeared and declared that epoll's performance
shouldn't be the priority. So I decided not to post those optimization
patches.

For those who are curious, the optimization patches are at:
    git@github.com:covanam/linux.git epoll_optimize
be warned that they have not been well-tested.

The regression with v3 turned me into paranoid mode now. So I made this v4
as obvious as I can.

Nam Cao (1):
  eventpoll: Replace rwlock with spinlock

 fs/eventpoll.c | 139 +++++++++----------------------------------------
 1 file changed, 26 insertions(+), 113 deletions(-)

-- 
2.39.5

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH v4 1/1] eventpoll: Replace rwlock with spinlock
  2025-07-15 12:46 [PATCH v4 0/1] eventpoll: Fix priority inversion problem Nam Cao
@ 2025-07-15 12:46 ` Nam Cao
  2025-07-15 12:58   ` Nam Cao
                     ` (3 more replies)
  2025-09-05 13:52 ` [PATCH v4 0/1] eventpoll: Fix priority inversion problem Christian Brauner
  1 sibling, 4 replies; 10+ messages in thread
From: Nam Cao @ 2025-07-15 12:46 UTC (permalink / raw)
  To: Christian Brauner, Linus Torvalds, Xi Ruoyao, Frederic Weisbecker,
	Valentin Schneider, Alexander Viro, Jan Kara,
	Sebastian Andrzej Siewior, John Ogness, K Prateek Nayak,
	Clark Williams, Steven Rostedt, linux-fsdevel, linux-kernel,
	linux-rt-devel, linux-rt-users, Joe Damato, Martin Karsten,
	Jens Axboe
  Cc: Nam Cao, stable

The ready event list of an epoll object is protected by read-write
semaphore:

  - The consumer (waiter) acquires the write lock and takes items.
  - the producer (waker) takes the read lock and adds items.

The point of this design is enabling epoll to scale well with large number
of producers, as multiple producers can hold the read lock at the same
time.

Unfortunately, this implementation may cause scheduling priority inversion
problem. Suppose the consumer has higher scheduling priority than the
producer. The consumer needs to acquire the write lock, but may be blocked
by the producer holding the read lock. Since read-write semaphore does not
support priority-boosting for the readers (even with CONFIG_PREEMPT_RT=y),
we have a case of priority inversion: a higher priority consumer is blocked
by a lower priority producer. This problem was reported in [1].

Furthermore, this could also cause stall problem, as described in [2].

Fix this problem by replacing rwlock with spinlock.

This reduces the event bandwidth, as the producers now have to contend with
each other for the spinlock. According to the benchmark from
https://github.com/rouming/test-tools/blob/master/stress-epoll.c:

    On 12 x86 CPUs:
                  Before     After        Diff
        threads  events/ms  events/ms
              8       7162       4956     -31%
             16       8733       5383     -38%
             32       7968       5572     -30%
             64      10652       5739     -46%
            128      11236       5931     -47%

    On 4 riscv CPUs:
                  Before     After        Diff
        threads  events/ms  events/ms
              8       2958       2833      -4%
             16       3323       3097      -7%
             32       3451       3240      -6%
             64       3554       3178     -11%
            128       3601       3235     -10%

Although the numbers look bad, it should be noted that this benchmark
creates multiple threads who do nothing except constantly generating new
epoll events, thus contention on the spinlock is high. For real workload,
the event rate is likely much lower, and the performance drop is not as
bad.

Using another benchmark (perf bench epoll wait) where spinlock contention
is lower, improvement is even observed on x86:

    On 12 x86 CPUs:
        Before: Averaged 110279 operations/sec (+- 1.09%), total secs = 8
        After:  Averaged 114577 operations/sec (+- 2.25%), total secs = 8

    On 4 riscv CPUs:
        Before: Averaged 175767 operations/sec (+- 0.62%), total secs = 8
        After:  Averaged 167396 operations/sec (+- 0.23%), total secs = 8

In conclusion, no one is likely to be upset over this change. After all,
spinlock was used originally for years, and the commit which converted to
rwlock didn't mention a real workload, just that the benchmark numbers are
nice.

This patch is not exactly the revert of commit a218cc491420 ("epoll: use
rwlock in order to reduce ep_poll_callback() contention"), because git
revert conflicts in some places which are not obvious on the resolution.
This patch is intended to be backported, therefore go with the obvious
approach:

  - Replace rwlock_t with spinlock_t one to one

  - Delete list_add_tail_lockless() and chain_epi_lockless(). These were
    introduced to allow producers to concurrently add items to the list.
    But now that spinlock no longer allows producers to touch the event
    list concurrently, these two functions are not necessary anymore.

Fixes: a218cc491420 ("epoll: use rwlock in order to reduce ep_poll_callback() contention")
Signed-off-by: Nam Cao <namcao@linutronix.de>
Cc: stable@vger.kernel.org
---
 fs/eventpoll.c | 139 +++++++++----------------------------------------
 1 file changed, 26 insertions(+), 113 deletions(-)

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 0fbf5dfedb24..a171f7e7dacc 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -46,10 +46,10 @@
  *
  * 1) epnested_mutex (mutex)
  * 2) ep->mtx (mutex)
- * 3) ep->lock (rwlock)
+ * 3) ep->lock (spinlock)
  *
  * The acquire order is the one listed above, from 1 to 3.
- * We need a rwlock (ep->lock) because we manipulate objects
+ * We need a spinlock (ep->lock) because we manipulate objects
  * from inside the poll callback, that might be triggered from
  * a wake_up() that in turn might be called from IRQ context.
  * So we can't sleep inside the poll callback and hence we need
@@ -195,7 +195,7 @@ struct eventpoll {
 	struct list_head rdllist;
 
 	/* Lock which protects rdllist and ovflist */
-	rwlock_t lock;
+	spinlock_t lock;
 
 	/* RB tree root used to store monitored fd structs */
 	struct rb_root_cached rbr;
@@ -740,10 +740,10 @@ static void ep_start_scan(struct eventpoll *ep, struct list_head *txlist)
 	 * in a lockless way.
 	 */
 	lockdep_assert_irqs_enabled();
-	write_lock_irq(&ep->lock);
+	spin_lock_irq(&ep->lock);
 	list_splice_init(&ep->rdllist, txlist);
 	WRITE_ONCE(ep->ovflist, NULL);
-	write_unlock_irq(&ep->lock);
+	spin_unlock_irq(&ep->lock);
 }
 
 static void ep_done_scan(struct eventpoll *ep,
@@ -751,7 +751,7 @@ static void ep_done_scan(struct eventpoll *ep,
 {
 	struct epitem *epi, *nepi;
 
-	write_lock_irq(&ep->lock);
+	spin_lock_irq(&ep->lock);
 	/*
 	 * During the time we spent inside the "sproc" callback, some
 	 * other events might have been queued by the poll callback.
@@ -792,7 +792,7 @@ static void ep_done_scan(struct eventpoll *ep,
 			wake_up(&ep->wq);
 	}
 
-	write_unlock_irq(&ep->lock);
+	spin_unlock_irq(&ep->lock);
 }
 
 static void ep_get(struct eventpoll *ep)
@@ -867,10 +867,10 @@ static bool __ep_remove(struct eventpoll *ep, struct epitem *epi, bool force)
 
 	rb_erase_cached(&epi->rbn, &ep->rbr);
 
-	write_lock_irq(&ep->lock);
+	spin_lock_irq(&ep->lock);
 	if (ep_is_linked(epi))
 		list_del_init(&epi->rdllink);
-	write_unlock_irq(&ep->lock);
+	spin_unlock_irq(&ep->lock);
 
 	wakeup_source_unregister(ep_wakeup_source(epi));
 	/*
@@ -1151,7 +1151,7 @@ static int ep_alloc(struct eventpoll **pep)
 		return -ENOMEM;
 
 	mutex_init(&ep->mtx);
-	rwlock_init(&ep->lock);
+	spin_lock_init(&ep->lock);
 	init_waitqueue_head(&ep->wq);
 	init_waitqueue_head(&ep->poll_wait);
 	INIT_LIST_HEAD(&ep->rdllist);
@@ -1238,100 +1238,10 @@ struct file *get_epoll_tfile_raw_ptr(struct file *file, int tfd,
 }
 #endif /* CONFIG_KCMP */
 
-/*
- * Adds a new entry to the tail of the list in a lockless way, i.e.
- * multiple CPUs are allowed to call this function concurrently.
- *
- * Beware: it is necessary to prevent any other modifications of the
- *         existing list until all changes are completed, in other words
- *         concurrent list_add_tail_lockless() calls should be protected
- *         with a read lock, where write lock acts as a barrier which
- *         makes sure all list_add_tail_lockless() calls are fully
- *         completed.
- *
- *        Also an element can be locklessly added to the list only in one
- *        direction i.e. either to the tail or to the head, otherwise
- *        concurrent access will corrupt the list.
- *
- * Return: %false if element has been already added to the list, %true
- * otherwise.
- */
-static inline bool list_add_tail_lockless(struct list_head *new,
-					  struct list_head *head)
-{
-	struct list_head *prev;
-
-	/*
-	 * This is simple 'new->next = head' operation, but cmpxchg()
-	 * is used in order to detect that same element has been just
-	 * added to the list from another CPU: the winner observes
-	 * new->next == new.
-	 */
-	if (!try_cmpxchg(&new->next, &new, head))
-		return false;
-
-	/*
-	 * Initially ->next of a new element must be updated with the head
-	 * (we are inserting to the tail) and only then pointers are atomically
-	 * exchanged.  XCHG guarantees memory ordering, thus ->next should be
-	 * updated before pointers are actually swapped and pointers are
-	 * swapped before prev->next is updated.
-	 */
-
-	prev = xchg(&head->prev, new);
-
-	/*
-	 * It is safe to modify prev->next and new->prev, because a new element
-	 * is added only to the tail and new->next is updated before XCHG.
-	 */
-
-	prev->next = new;
-	new->prev = prev;
-
-	return true;
-}
-
-/*
- * Chains a new epi entry to the tail of the ep->ovflist in a lockless way,
- * i.e. multiple CPUs are allowed to call this function concurrently.
- *
- * Return: %false if epi element has been already chained, %true otherwise.
- */
-static inline bool chain_epi_lockless(struct epitem *epi)
-{
-	struct eventpoll *ep = epi->ep;
-
-	/* Fast preliminary check */
-	if (epi->next != EP_UNACTIVE_PTR)
-		return false;
-
-	/* Check that the same epi has not been just chained from another CPU */
-	if (cmpxchg(&epi->next, EP_UNACTIVE_PTR, NULL) != EP_UNACTIVE_PTR)
-		return false;
-
-	/* Atomically exchange tail */
-	epi->next = xchg(&ep->ovflist, epi);
-
-	return true;
-}
-
 /*
  * This is the callback that is passed to the wait queue wakeup
  * mechanism. It is called by the stored file descriptors when they
  * have events to report.
- *
- * This callback takes a read lock in order not to contend with concurrent
- * events from another file descriptor, thus all modifications to ->rdllist
- * or ->ovflist are lockless.  Read lock is paired with the write lock from
- * ep_start/done_scan(), which stops all list modifications and guarantees
- * that lists state is seen correctly.
- *
- * Another thing worth to mention is that ep_poll_callback() can be called
- * concurrently for the same @epi from different CPUs if poll table was inited
- * with several wait queues entries.  Plural wakeup from different CPUs of a
- * single wait queue is serialized by wq.lock, but the case when multiple wait
- * queues are used should be detected accordingly.  This is detected using
- * cmpxchg() operation.
  */
 static int ep_poll_callback(wait_queue_entry_t *wait, unsigned mode, int sync, void *key)
 {
@@ -1342,7 +1252,7 @@ static int ep_poll_callback(wait_queue_entry_t *wait, unsigned mode, int sync, v
 	unsigned long flags;
 	int ewake = 0;
 
-	read_lock_irqsave(&ep->lock, flags);
+	spin_lock_irqsave(&ep->lock, flags);
 
 	ep_set_busy_poll_napi_id(epi);
 
@@ -1371,12 +1281,15 @@ static int ep_poll_callback(wait_queue_entry_t *wait, unsigned mode, int sync, v
 	 * chained in ep->ovflist and requeued later on.
 	 */
 	if (READ_ONCE(ep->ovflist) != EP_UNACTIVE_PTR) {
-		if (chain_epi_lockless(epi))
+		if (epi->next == EP_UNACTIVE_PTR) {
+			epi->next = READ_ONCE(ep->ovflist);
+			WRITE_ONCE(ep->ovflist, epi);
 			ep_pm_stay_awake_rcu(epi);
+		}
 	} else if (!ep_is_linked(epi)) {
 		/* In the usual case, add event to ready list. */
-		if (list_add_tail_lockless(&epi->rdllink, &ep->rdllist))
-			ep_pm_stay_awake_rcu(epi);
+		list_add_tail(&epi->rdllink, &ep->rdllist);
+		ep_pm_stay_awake_rcu(epi);
 	}
 
 	/*
@@ -1409,7 +1322,7 @@ static int ep_poll_callback(wait_queue_entry_t *wait, unsigned mode, int sync, v
 		pwake++;
 
 out_unlock:
-	read_unlock_irqrestore(&ep->lock, flags);
+	spin_unlock_irqrestore(&ep->lock, flags);
 
 	/* We have to call this outside the lock */
 	if (pwake)
@@ -1744,7 +1657,7 @@ static int ep_insert(struct eventpoll *ep, const struct epoll_event *event,
 	}
 
 	/* We have to drop the new item inside our item list to keep track of it */
-	write_lock_irq(&ep->lock);
+	spin_lock_irq(&ep->lock);
 
 	/* record NAPI ID of new item if present */
 	ep_set_busy_poll_napi_id(epi);
@@ -1761,7 +1674,7 @@ static int ep_insert(struct eventpoll *ep, const struct epoll_event *event,
 			pwake++;
 	}
 
-	write_unlock_irq(&ep->lock);
+	spin_unlock_irq(&ep->lock);
 
 	/* We have to call this outside the lock */
 	if (pwake)
@@ -1825,7 +1738,7 @@ static int ep_modify(struct eventpoll *ep, struct epitem *epi,
 	 * list, push it inside.
 	 */
 	if (ep_item_poll(epi, &pt, 1)) {
-		write_lock_irq(&ep->lock);
+		spin_lock_irq(&ep->lock);
 		if (!ep_is_linked(epi)) {
 			list_add_tail(&epi->rdllink, &ep->rdllist);
 			ep_pm_stay_awake(epi);
@@ -1836,7 +1749,7 @@ static int ep_modify(struct eventpoll *ep, struct epitem *epi,
 			if (waitqueue_active(&ep->poll_wait))
 				pwake++;
 		}
-		write_unlock_irq(&ep->lock);
+		spin_unlock_irq(&ep->lock);
 	}
 
 	/* We have to call this outside the lock */
@@ -2088,7 +2001,7 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
 		init_wait(&wait);
 		wait.func = ep_autoremove_wake_function;
 
-		write_lock_irq(&ep->lock);
+		spin_lock_irq(&ep->lock);
 		/*
 		 * Barrierless variant, waitqueue_active() is called under
 		 * the same lock on wakeup ep_poll_callback() side, so it
@@ -2107,7 +2020,7 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
 		if (!eavail)
 			__add_wait_queue_exclusive(&ep->wq, &wait);
 
-		write_unlock_irq(&ep->lock);
+		spin_unlock_irq(&ep->lock);
 
 		if (!eavail)
 			timed_out = !ep_schedule_timeout(to) ||
@@ -2123,7 +2036,7 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
 		eavail = 1;
 
 		if (!list_empty_careful(&wait.entry)) {
-			write_lock_irq(&ep->lock);
+			spin_lock_irq(&ep->lock);
 			/*
 			 * If the thread timed out and is not on the wait queue,
 			 * it means that the thread was woken up after its
@@ -2134,7 +2047,7 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
 			if (timed_out)
 				eavail = list_empty(&wait.entry);
 			__remove_wait_queue(&ep->wq, &wait);
-			write_unlock_irq(&ep->lock);
+			spin_unlock_irq(&ep->lock);
 		}
 	}
 }
-- 
2.39.5


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH v4 1/1] eventpoll: Replace rwlock with spinlock
  2025-07-15 12:46 ` [PATCH v4 1/1] eventpoll: Replace rwlock with spinlock Nam Cao
@ 2025-07-15 12:58   ` Nam Cao
  2025-07-15 16:42   ` Linus Torvalds
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 10+ messages in thread
From: Nam Cao @ 2025-07-15 12:58 UTC (permalink / raw)
  To: Christian Brauner, Linus Torvalds, Xi Ruoyao, Frederic Weisbecker,
	Valentin Schneider, Alexander Viro, Jan Kara,
	Sebastian Andrzej Siewior, John Ogness, K Prateek Nayak,
	Clark Williams, Steven Rostedt, linux-fsdevel, linux-kernel,
	linux-rt-devel, linux-rt-users, Joe Damato, Martin Karsten,
	Jens Axboe
  Cc: stable

On Tue, Jul 15, 2025 at 02:46:34PM +0200, Nam Cao wrote:
> The ready event list of an epoll object is protected by read-write
> semaphore:
> 
>   - The consumer (waiter) acquires the write lock and takes items.
>   - the producer (waker) takes the read lock and adds items.
> 
> The point of this design is enabling epoll to scale well with large number
> of producers, as multiple producers can hold the read lock at the same
> time.
> 
> Unfortunately, this implementation may cause scheduling priority inversion
> problem. Suppose the consumer has higher scheduling priority than the
> producer. The consumer needs to acquire the write lock, but may be blocked
> by the producer holding the read lock. Since read-write semaphore does not
> support priority-boosting for the readers (even with CONFIG_PREEMPT_RT=y),
> we have a case of priority inversion: a higher priority consumer is blocked
> by a lower priority producer. This problem was reported in [1].
> 
> Furthermore, this could also cause stall problem, as described in [2].
> 
> Fix this problem by replacing rwlock with spinlock.
> 
> This reduces the event bandwidth, as the producers now have to contend with
> each other for the spinlock. According to the benchmark from
> https://github.com/rouming/test-tools/blob/master/stress-epoll.c:
> 
>     On 12 x86 CPUs:
>                   Before     After        Diff
>         threads  events/ms  events/ms
>               8       7162       4956     -31%
>              16       8733       5383     -38%
>              32       7968       5572     -30%
>              64      10652       5739     -46%
>             128      11236       5931     -47%
> 
>     On 4 riscv CPUs:
>                   Before     After        Diff
>         threads  events/ms  events/ms
>               8       2958       2833      -4%
>              16       3323       3097      -7%
>              32       3451       3240      -6%
>              64       3554       3178     -11%
>             128       3601       3235     -10%
> 
> Although the numbers look bad, it should be noted that this benchmark
> creates multiple threads who do nothing except constantly generating new
> epoll events, thus contention on the spinlock is high. For real workload,
> the event rate is likely much lower, and the performance drop is not as
> bad.
> 
> Using another benchmark (perf bench epoll wait) where spinlock contention
> is lower, improvement is even observed on x86:
> 
>     On 12 x86 CPUs:
>         Before: Averaged 110279 operations/sec (+- 1.09%), total secs = 8
>         After:  Averaged 114577 operations/sec (+- 2.25%), total secs = 8
> 
>     On 4 riscv CPUs:
>         Before: Averaged 175767 operations/sec (+- 0.62%), total secs = 8
>         After:  Averaged 167396 operations/sec (+- 0.23%), total secs = 8
> 
> In conclusion, no one is likely to be upset over this change. After all,
> spinlock was used originally for years, and the commit which converted to
> rwlock didn't mention a real workload, just that the benchmark numbers are
> nice.
> 
> This patch is not exactly the revert of commit a218cc491420 ("epoll: use
> rwlock in order to reduce ep_poll_callback() contention"), because git
> revert conflicts in some places which are not obvious on the resolution.
> This patch is intended to be backported, therefore go with the obvious
> approach:
> 
>   - Replace rwlock_t with spinlock_t one to one
> 
>   - Delete list_add_tail_lockless() and chain_epi_lockless(). These were
>     introduced to allow producers to concurrently add items to the list.
>     But now that spinlock no longer allows producers to touch the event
>     list concurrently, these two functions are not necessary anymore.
> 
> Fixes: a218cc491420 ("epoll: use rwlock in order to reduce ep_poll_callback() contention")
> Signed-off-by: Nam Cao <namcao@linutronix.de>
> Cc: stable@vger.kernel.org

I forgot to add:

Reported-by: Frederic Weisbecker <frederic@kernel.org>
Closes: https://lore.kernel.org/linux-rt-users/20210825132754.GA895675@lothringen/ [1]
Reported-by: Valentin Schneider <vschneid@redhat.com>
Closes: https://lore.kernel.org/linux-rt-users/xhsmhttqvnall.mognet@vschneid.remote.csb/ [2]

Christian, do you mind adding those for me, if/when you apply the patch?

Nam

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v4 1/1] eventpoll: Replace rwlock with spinlock
  2025-07-15 12:46 ` [PATCH v4 1/1] eventpoll: Replace rwlock with spinlock Nam Cao
  2025-07-15 12:58   ` Nam Cao
@ 2025-07-15 16:42   ` Linus Torvalds
  2025-07-16  7:41     ` Nam Cao
  2025-07-16  8:34   ` K Prateek Nayak
  2025-08-26  8:43   ` Nam Cao
  3 siblings, 1 reply; 10+ messages in thread
From: Linus Torvalds @ 2025-07-15 16:42 UTC (permalink / raw)
  To: Nam Cao
  Cc: Christian Brauner, Xi Ruoyao, Frederic Weisbecker,
	Valentin Schneider, Alexander Viro, Jan Kara,
	Sebastian Andrzej Siewior, John Ogness, K Prateek Nayak,
	Clark Williams, Steven Rostedt, linux-fsdevel, linux-kernel,
	linux-rt-devel, linux-rt-users, Joe Damato, Martin Karsten,
	Jens Axboe, stable

On Tue, 15 Jul 2025 at 05:47, Nam Cao <namcao@linutronix.de> wrote:
>
>  fs/eventpoll.c | 139 +++++++++----------------------------------------
>  1 file changed, 26 insertions(+), 113 deletions(-)

Yeah, this is more like the kind of diffstat I like to see for
eventpoll. Plus it makes things fundamentally simpler.

It might be worth looking at ep_poll_callback() - the only case that
had read_lock_irqsave() - and seeing if perhaps some of the tests
inside the lock might be done optimistically, or delayed to after the
lock.

For example, the whole wakeup sequence looks like it should be done
*after* the ep->lock has been released, because it uses its own lock
(ie the

                if (sync)
                        wake_up_sync(&ep->wq);
                else
                        wake_up(&ep->wq);

thing uses the wq lock, and there is nothing that ep->lock protects
there as far as I can tell.

So I think this has some potential for _simple_ optimizations, but I'm
not sure how worth it it is.

Thanks,
          Linus

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v4 1/1] eventpoll: Replace rwlock with spinlock
  2025-07-15 16:42   ` Linus Torvalds
@ 2025-07-16  7:41     ` Nam Cao
  0 siblings, 0 replies; 10+ messages in thread
From: Nam Cao @ 2025-07-16  7:41 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Christian Brauner, Xi Ruoyao, Frederic Weisbecker,
	Valentin Schneider, Alexander Viro, Jan Kara,
	Sebastian Andrzej Siewior, John Ogness, K Prateek Nayak,
	Clark Williams, Steven Rostedt, linux-fsdevel, linux-kernel,
	linux-rt-devel, linux-rt-users, Joe Damato, Martin Karsten,
	Jens Axboe, stable

On Tue, Jul 15, 2025 at 09:42:52AM -0700, Linus Torvalds wrote:
> On Tue, 15 Jul 2025 at 05:47, Nam Cao <namcao@linutronix.de> wrote:
> >
> >  fs/eventpoll.c | 139 +++++++++----------------------------------------
> >  1 file changed, 26 insertions(+), 113 deletions(-)
> 
> Yeah, this is more like the kind of diffstat I like to see for
> eventpoll. Plus it makes things fundamentally simpler.
> 
> It might be worth looking at ep_poll_callback() - the only case that
> had read_lock_irqsave() - and seeing if perhaps some of the tests
> inside the lock might be done optimistically, or delayed to after the
> lock.
> 
> For example, the whole wakeup sequence looks like it should be done
> *after* the ep->lock has been released, because it uses its own lock
> (ie the
> 
>                 if (sync)
>                         wake_up_sync(&ep->wq);
>                 else
>                         wake_up(&ep->wq);
> 
> thing uses the wq lock, and there is nothing that ep->lock protects
> there as far as I can tell.
> 
> So I think this has some potential for _simple_ optimizations, but I'm
> not sure how worth it it is.

Actually ep->lock does protect this part. Because ep_poll() touches ep->wq
without holding the waitqueue's lock.

We could do something like the diff below. But things are not so simple
anymore.

Best regards,
Nam

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index a171f7e7dacc..5e6c7da606e7 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -1252,8 +1252,6 @@ static int ep_poll_callback(wait_queue_entry_t *wait, unsigned mode, int sync, v
 	unsigned long flags;
 	int ewake = 0;
 
-	spin_lock_irqsave(&ep->lock, flags);
-
 	ep_set_busy_poll_napi_id(epi);
 
 	/*
@@ -1263,7 +1261,7 @@ static int ep_poll_callback(wait_queue_entry_t *wait, unsigned mode, int sync, v
 	 * until the next EPOLL_CTL_MOD will be issued.
 	 */
 	if (!(epi->event.events & ~EP_PRIVATE_BITS))
-		goto out_unlock;
+		goto out;
 
 	/*
 	 * Check the events coming with the callback. At this stage, not
@@ -1272,7 +1270,7 @@ static int ep_poll_callback(wait_queue_entry_t *wait, unsigned mode, int sync, v
 	 * test for "key" != NULL before the event match test.
 	 */
 	if (pollflags && !(pollflags & epi->event.events))
-		goto out_unlock;
+		goto out;
 
 	/*
 	 * If we are transferring events to userspace, we can hold no locks
@@ -1280,6 +1278,8 @@ static int ep_poll_callback(wait_queue_entry_t *wait, unsigned mode, int sync, v
 	 * semantics). All the events that happen during that period of time are
 	 * chained in ep->ovflist and requeued later on.
 	 */
+	spin_lock_irqsave(&ep->lock, flags);
+
 	if (READ_ONCE(ep->ovflist) != EP_UNACTIVE_PTR) {
 		if (epi->next == EP_UNACTIVE_PTR) {
 			epi->next = READ_ONCE(ep->ovflist);
@@ -1292,9 +1292,14 @@ static int ep_poll_callback(wait_queue_entry_t *wait, unsigned mode, int sync, v
 		ep_pm_stay_awake_rcu(epi);
 	}
 
+	spin_unlock_irqrestore(&ep->lock, flags);
+
+
 	/*
 	 * Wake up ( if active ) both the eventpoll wait list and the ->poll()
 	 * wait list.
+	 *
+	 * Memory barrier for waitqueue_active() from spin_unlock_irqrestore().
 	 */
 	if (waitqueue_active(&ep->wq)) {
 		if ((epi->event.events & EPOLLEXCLUSIVE) &&
@@ -1321,9 +1326,7 @@ static int ep_poll_callback(wait_queue_entry_t *wait, unsigned mode, int sync, v
 	if (waitqueue_active(&ep->poll_wait))
 		pwake++;
 
-out_unlock:
-	spin_unlock_irqrestore(&ep->lock, flags);
-
+out:
 	/* We have to call this outside the lock */
 	if (pwake)
 		ep_poll_safewake(ep, epi, pollflags & EPOLL_URING_WAKE);
@@ -2001,13 +2004,7 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
 		init_wait(&wait);
 		wait.func = ep_autoremove_wake_function;
 
-		spin_lock_irq(&ep->lock);
-		/*
-		 * Barrierless variant, waitqueue_active() is called under
-		 * the same lock on wakeup ep_poll_callback() side, so it
-		 * is safe to avoid an explicit barrier.
-		 */
-		__set_current_state(TASK_INTERRUPTIBLE);
+		prepare_to_wait_exclusive(&ep->wq, &wait, TASK_INTERRUPTIBLE);
 
 		/*
 		 * Do the final check under the lock. ep_start/done_scan()
@@ -2016,10 +2013,8 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
 		 * period of time although events are pending, so lock is
 		 * important.
 		 */
+		spin_lock_irq(&ep->lock);
 		eavail = ep_events_available(ep);
-		if (!eavail)
-			__add_wait_queue_exclusive(&ep->wq, &wait);
-
 		spin_unlock_irq(&ep->lock);
 
 		if (!eavail)
@@ -2036,7 +2031,7 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
 		eavail = 1;
 
 		if (!list_empty_careful(&wait.entry)) {
-			spin_lock_irq(&ep->lock);
+			spin_lock_irq(&ep->wq.lock);
 			/*
 			 * If the thread timed out and is not on the wait queue,
 			 * it means that the thread was woken up after its
@@ -2047,7 +2042,7 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
 			if (timed_out)
 				eavail = list_empty(&wait.entry);
 			__remove_wait_queue(&ep->wq, &wait);
-			spin_unlock_irq(&ep->lock);
+			spin_unlock_irq(&ep->wq.lock);
 		}
 	}
 }

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH v4 1/1] eventpoll: Replace rwlock with spinlock
  2025-07-15 12:46 ` [PATCH v4 1/1] eventpoll: Replace rwlock with spinlock Nam Cao
  2025-07-15 12:58   ` Nam Cao
  2025-07-15 16:42   ` Linus Torvalds
@ 2025-07-16  8:34   ` K Prateek Nayak
  2025-08-26  8:43   ` Nam Cao
  3 siblings, 0 replies; 10+ messages in thread
From: K Prateek Nayak @ 2025-07-16  8:34 UTC (permalink / raw)
  To: Nam Cao, Christian Brauner, Linus Torvalds, Xi Ruoyao,
	Frederic Weisbecker, Valentin Schneider, Alexander Viro, Jan Kara,
	Sebastian Andrzej Siewior, John Ogness, Clark Williams,
	Steven Rostedt, linux-fsdevel, linux-kernel, linux-rt-devel,
	linux-rt-users, Joe Damato, Martin Karsten, Jens Axboe
  Cc: stable

Hello Nam,

On 7/15/2025 6:16 PM, Nam Cao wrote:
> The ready event list of an epoll object is protected by read-write
> semaphore:
> 
>   - The consumer (waiter) acquires the write lock and takes items.
>   - the producer (waker) takes the read lock and adds items.
> 
> The point of this design is enabling epoll to scale well with large number
> of producers, as multiple producers can hold the read lock at the same
> time.
> 
> Unfortunately, this implementation may cause scheduling priority inversion
> problem. Suppose the consumer has higher scheduling priority than the
> producer. The consumer needs to acquire the write lock, but may be blocked
> by the producer holding the read lock. Since read-write semaphore does not
> support priority-boosting for the readers (even with CONFIG_PREEMPT_RT=y),
> we have a case of priority inversion: a higher priority consumer is blocked
> by a lower priority producer. This problem was reported in [1].
> 
> Furthermore, this could also cause stall problem, as described in [2].
> 
> Fix this problem by replacing rwlock with spinlock.
> 
> This reduces the event bandwidth, as the producers now have to contend with
> each other for the spinlock. According to the benchmark from
> https://github.com/rouming/test-tools/blob/master/stress-epoll.c:
> 
>     On 12 x86 CPUs:
>                   Before     After        Diff
>         threads  events/ms  events/ms
>               8       7162       4956     -31%
>              16       8733       5383     -38%
>              32       7968       5572     -30%
>              64      10652       5739     -46%
>             128      11236       5931     -47%
> 
>     On 4 riscv CPUs:
>                   Before     After        Diff
>         threads  events/ms  events/ms
>               8       2958       2833      -4%
>              16       3323       3097      -7%
>              32       3451       3240      -6%
>              64       3554       3178     -11%
>             128       3601       3235     -10%
> 
> Although the numbers look bad, it should be noted that this benchmark
> creates multiple threads who do nothing except constantly generating new
> epoll events, thus contention on the spinlock is high. For real workload,
> the event rate is likely much lower, and the performance drop is not as
> bad.
> 
> Using another benchmark (perf bench epoll wait) where spinlock contention
> is lower, improvement is even observed on x86:
> 
>     On 12 x86 CPUs:
>         Before: Averaged 110279 operations/sec (+- 1.09%), total secs = 8
>         After:  Averaged 114577 operations/sec (+- 2.25%), total secs = 8
> 
>     On 4 riscv CPUs:
>         Before: Averaged 175767 operations/sec (+- 0.62%), total secs = 8
>         After:  Averaged 167396 operations/sec (+- 0.23%), total secs = 8
> 
> In conclusion, no one is likely to be upset over this change. After all,
> spinlock was used originally for years, and the commit which converted to
> rwlock didn't mention a real workload, just that the benchmark numbers are
> nice.
> 
> This patch is not exactly the revert of commit a218cc491420 ("epoll: use
> rwlock in order to reduce ep_poll_callback() contention"), because git
> revert conflicts in some places which are not obvious on the resolution.
> This patch is intended to be backported, therefore go with the obvious
> approach:
> 
>   - Replace rwlock_t with spinlock_t one to one
> 
>   - Delete list_add_tail_lockless() and chain_epi_lockless(). These were
>     introduced to allow producers to concurrently add items to the list.
>     But now that spinlock no longer allows producers to touch the event
>     list concurrently, these two functions are not necessary anymore.
> 
> Fixes: a218cc491420 ("epoll: use rwlock in order to reduce ep_poll_callback() contention")
> Signed-off-by: Nam Cao <namcao@linutronix.de>
> Cc: stable@vger.kernel.org

Tested this version with the reproducer that Jan shared in [1] on top of
tip:sched/core (PREEMPT_RT) and I didn't run into any rcu-stalls with
your patch applied on top (the VM is running the repro for over an hour
now and is still responsive). Feel free to include:

Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>

[1] https://lore.kernel.org/all/7483d3ae-5846-4067-b9f7-390a614ba408@siemens.com/

> ---
>  fs/eventpoll.c | 139 +++++++++----------------------------------------
>  1 file changed, 26 insertions(+), 113 deletions(-)
-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v4 1/1] eventpoll: Replace rwlock with spinlock
  2025-07-15 12:46 ` [PATCH v4 1/1] eventpoll: Replace rwlock with spinlock Nam Cao
                     ` (2 preceding siblings ...)
  2025-07-16  8:34   ` K Prateek Nayak
@ 2025-08-26  8:43   ` Nam Cao
  2025-09-03  8:40     ` Sebastian Andrzej Siewior
  3 siblings, 1 reply; 10+ messages in thread
From: Nam Cao @ 2025-08-26  8:43 UTC (permalink / raw)
  To: Christian Brauner, Linus Torvalds, Xi Ruoyao, Frederic Weisbecker,
	Valentin Schneider, Alexander Viro, Jan Kara,
	Sebastian Andrzej Siewior, John Ogness, K Prateek Nayak,
	Clark Williams, Steven Rostedt, linux-fsdevel, linux-kernel,
	linux-rt-devel, linux-rt-users, Joe Damato, Martin Karsten,
	Jens Axboe
  Cc: stable

On Tue, Jul 15, 2025 at 02:46:34PM +0200, Nam Cao wrote:
> The ready event list of an epoll object is protected by read-write
> semaphore:
> 
>   - The consumer (waiter) acquires the write lock and takes items.
>   - the producer (waker) takes the read lock and adds items.
> 
> The point of this design is enabling epoll to scale well with large number
> of producers, as multiple producers can hold the read lock at the same
> time.
> 
> Unfortunately, this implementation may cause scheduling priority inversion
> problem. Suppose the consumer has higher scheduling priority than the
> producer. The consumer needs to acquire the write lock, but may be blocked
> by the producer holding the read lock. Since read-write semaphore does not
> support priority-boosting for the readers (even with CONFIG_PREEMPT_RT=y),
> we have a case of priority inversion: a higher priority consumer is blocked
> by a lower priority producer. This problem was reported in [1].
> 
> Furthermore, this could also cause stall problem, as described in [2].
> 
> Fix this problem by replacing rwlock with spinlock.

Hi Christian,

May I know your plan with this patch? Are you still waiting for something?

You may still understandably be paranoid about epoll due to the last
regression. But it's been weeks, and this patch is quite simple, so I start
to wonder if it is forgotten.

Nam

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v4 1/1] eventpoll: Replace rwlock with spinlock
  2025-08-26  8:43   ` Nam Cao
@ 2025-09-03  8:40     ` Sebastian Andrzej Siewior
  2025-09-05 13:52       ` Christian Brauner
  0 siblings, 1 reply; 10+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-09-03  8:40 UTC (permalink / raw)
  To: Nam Cao, Christian Brauner
  Cc: Linus Torvalds, Xi Ruoyao, Frederic Weisbecker,
	Valentin Schneider, Alexander Viro, Jan Kara, John Ogness,
	K Prateek Nayak, Clark Williams, Steven Rostedt, linux-fsdevel,
	linux-kernel, linux-rt-devel, linux-rt-users, Joe Damato,
	Martin Karsten, Jens Axboe, stable

On 2025-08-26 10:43:20 [+0200], Nam Cao wrote:
> On Tue, Jul 15, 2025 at 02:46:34PM +0200, Nam Cao wrote:
> > The ready event list of an epoll object is protected by read-write
> > semaphore:
> > 
> >   - The consumer (waiter) acquires the write lock and takes items.
> >   - the producer (waker) takes the read lock and adds items.
> > 
> > The point of this design is enabling epoll to scale well with large number
> > of producers, as multiple producers can hold the read lock at the same
> > time.
> > 
> > Unfortunately, this implementation may cause scheduling priority inversion
> > problem. Suppose the consumer has higher scheduling priority than the
> > producer. The consumer needs to acquire the write lock, but may be blocked
> > by the producer holding the read lock. Since read-write semaphore does not
> > support priority-boosting for the readers (even with CONFIG_PREEMPT_RT=y),
> > we have a case of priority inversion: a higher priority consumer is blocked
> > by a lower priority producer. This problem was reported in [1].
> > 
> > Furthermore, this could also cause stall problem, as described in [2].
> > 
> > Fix this problem by replacing rwlock with spinlock.
> 
> Hi Christian,
> 
> May I know your plan with this patch? Are you still waiting for something?
> 
> You may still understandably be paranoid about epoll due to the last
> regression. But it's been weeks, and this patch is quite simple, so I start
> to wonder if it is forgotten.

A friendly reminder.

> Nam

Sebastian

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v4 1/1] eventpoll: Replace rwlock with spinlock
  2025-09-03  8:40     ` Sebastian Andrzej Siewior
@ 2025-09-05 13:52       ` Christian Brauner
  0 siblings, 0 replies; 10+ messages in thread
From: Christian Brauner @ 2025-09-05 13:52 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Nam Cao, Linus Torvalds, Xi Ruoyao, Frederic Weisbecker,
	Valentin Schneider, Alexander Viro, Jan Kara, John Ogness,
	K Prateek Nayak, Clark Williams, Steven Rostedt, linux-fsdevel,
	linux-kernel, linux-rt-devel, linux-rt-users, Joe Damato,
	Martin Karsten, Jens Axboe, stable

On Wed, Sep 03, 2025 at 10:40:12AM +0200, Sebastian Andrzej Siewior wrote:
> On 2025-08-26 10:43:20 [+0200], Nam Cao wrote:
> > On Tue, Jul 15, 2025 at 02:46:34PM +0200, Nam Cao wrote:
> > > The ready event list of an epoll object is protected by read-write
> > > semaphore:
> > > 
> > >   - The consumer (waiter) acquires the write lock and takes items.
> > >   - the producer (waker) takes the read lock and adds items.
> > > 
> > > The point of this design is enabling epoll to scale well with large number
> > > of producers, as multiple producers can hold the read lock at the same
> > > time.
> > > 
> > > Unfortunately, this implementation may cause scheduling priority inversion
> > > problem. Suppose the consumer has higher scheduling priority than the
> > > producer. The consumer needs to acquire the write lock, but may be blocked
> > > by the producer holding the read lock. Since read-write semaphore does not
> > > support priority-boosting for the readers (even with CONFIG_PREEMPT_RT=y),
> > > we have a case of priority inversion: a higher priority consumer is blocked
> > > by a lower priority producer. This problem was reported in [1].
> > > 
> > > Furthermore, this could also cause stall problem, as described in [2].
> > > 
> > > Fix this problem by replacing rwlock with spinlock.
> > 
> > Hi Christian,
> > 
> > May I know your plan with this patch? Are you still waiting for something?
> > 
> > You may still understandably be paranoid about epoll due to the last
> > regression. But it's been weeks, and this patch is quite simple, so I start
> > to wonder if it is forgotten.
> 
> A friendly reminder.

Sorry, this apparently fell through the cracks. Taken care of now!

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v4 0/1] eventpoll: Fix priority inversion problem
  2025-07-15 12:46 [PATCH v4 0/1] eventpoll: Fix priority inversion problem Nam Cao
  2025-07-15 12:46 ` [PATCH v4 1/1] eventpoll: Replace rwlock with spinlock Nam Cao
@ 2025-09-05 13:52 ` Christian Brauner
  1 sibling, 0 replies; 10+ messages in thread
From: Christian Brauner @ 2025-09-05 13:52 UTC (permalink / raw)
  To: Nam Cao
  Cc: Christian Brauner, Linus Torvalds, Xi Ruoyao, Frederic Weisbecker,
	Valentin Schneider, Alexander Viro, Jan Kara,
	Sebastian Andrzej Siewior, John Ogness, K Prateek Nayak,
	Clark Williams, Steven Rostedt, linux-fsdevel, linux-kernel,
	linux-rt-devel, linux-rt-users, Joe Damato, Martin Karsten,
	Jens Axboe

On Tue, 15 Jul 2025 14:46:33 +0200, Nam Cao wrote:
> This v4 is the follow-up to v3 at:
> https://lore.kernel.org/linux-fsdevel/20250527090836.1290532-1-namcao@linutronix.de/
> which resolves a priority inversion problem.
> 
> The v3 patch was merged, but then got reverted due to regression.
> 
> The direction of v3 was wrong in the first place. It changed the
> eventpoll's event list to be lockless, making the code harder to read. I
> stared at the patch again, but still couldn't figure out what the bug is.
> 
> [...]

Applied to the vfs-6.18.misc branch of the vfs/vfs.git tree.
Patches in the vfs-6.18.misc branch should appear in linux-next soon.

Please report any outstanding bugs that were missed during review in a
new review to the original patch series allowing us to drop it.

It's encouraged to provide Acked-bys and Reviewed-bys even though the
patch has now been applied. If possible patch trailers will be updated.

Note that commit hashes shown below are subject to change due to rebase,
trailer updates or similar. If in doubt, please check the listed branch.

tree:   https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git
branch: vfs-6.18.misc

[1/1] eventpoll: Replace rwlock with spinlock
      https://git.kernel.org/vfs/vfs/c/0c43094f8cc9

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2025-09-05 13:52 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-15 12:46 [PATCH v4 0/1] eventpoll: Fix priority inversion problem Nam Cao
2025-07-15 12:46 ` [PATCH v4 1/1] eventpoll: Replace rwlock with spinlock Nam Cao
2025-07-15 12:58   ` Nam Cao
2025-07-15 16:42   ` Linus Torvalds
2025-07-16  7:41     ` Nam Cao
2025-07-16  8:34   ` K Prateek Nayak
2025-08-26  8:43   ` Nam Cao
2025-09-03  8:40     ` Sebastian Andrzej Siewior
2025-09-05 13:52       ` Christian Brauner
2025-09-05 13:52 ` [PATCH v4 0/1] eventpoll: Fix priority inversion problem Christian Brauner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).