* [PATCH v4 0/1] eventpoll: Fix priority inversion problem @ 2025-07-15 12:46 Nam Cao 2025-07-15 12:46 ` [PATCH v4 1/1] eventpoll: Replace rwlock with spinlock Nam Cao 2025-09-05 13:52 ` [PATCH v4 0/1] eventpoll: Fix priority inversion problem Christian Brauner 0 siblings, 2 replies; 10+ messages in thread From: Nam Cao @ 2025-07-15 12:46 UTC (permalink / raw) To: Christian Brauner, Linus Torvalds, Xi Ruoyao, Frederic Weisbecker, Valentin Schneider, Alexander Viro, Jan Kara, Sebastian Andrzej Siewior, John Ogness, K Prateek Nayak, Clark Williams, Steven Rostedt, linux-fsdevel, linux-kernel, linux-rt-devel, linux-rt-users, Joe Damato, Martin Karsten, Jens Axboe Cc: Nam Cao Hi, This v4 is the follow-up to v3 at: https://lore.kernel.org/linux-fsdevel/20250527090836.1290532-1-namcao@linutronix.de/ which resolves a priority inversion problem. The v3 patch was merged, but then got reverted due to regression. The direction of v3 was wrong in the first place. It changed the eventpoll's event list to be lockless, making the code harder to read. I stared at the patch again, but still couldn't figure out what the bug is. The performance numbers were indeed impressive with lockless, but the numbers are from a benchmark, which is unclear whether it really reflects real workload. This v4 takes a completely different approach: it converts the rwlock to spinlock. Unfortunately, unlike rwlock, spinlock does not allow concurrent readers. This patch therefore reduces the performance numbers. I have some optimization tricks to reduce spinlock contention and bring the numbers back. But Linus appeared and declared that epoll's performance shouldn't be the priority. So I decided not to post those optimization patches. For those who are curious, the optimization patches are at: git@github.com:covanam/linux.git epoll_optimize be warned that they have not been well-tested. The regression with v3 turned me into paranoid mode now. So I made this v4 as obvious as I can. Nam Cao (1): eventpoll: Replace rwlock with spinlock fs/eventpoll.c | 139 +++++++++---------------------------------------- 1 file changed, 26 insertions(+), 113 deletions(-) -- 2.39.5 ^ permalink raw reply [flat|nested] 10+ messages in thread
* [PATCH v4 1/1] eventpoll: Replace rwlock with spinlock 2025-07-15 12:46 [PATCH v4 0/1] eventpoll: Fix priority inversion problem Nam Cao @ 2025-07-15 12:46 ` Nam Cao 2025-07-15 12:58 ` Nam Cao ` (3 more replies) 2025-09-05 13:52 ` [PATCH v4 0/1] eventpoll: Fix priority inversion problem Christian Brauner 1 sibling, 4 replies; 10+ messages in thread From: Nam Cao @ 2025-07-15 12:46 UTC (permalink / raw) To: Christian Brauner, Linus Torvalds, Xi Ruoyao, Frederic Weisbecker, Valentin Schneider, Alexander Viro, Jan Kara, Sebastian Andrzej Siewior, John Ogness, K Prateek Nayak, Clark Williams, Steven Rostedt, linux-fsdevel, linux-kernel, linux-rt-devel, linux-rt-users, Joe Damato, Martin Karsten, Jens Axboe Cc: Nam Cao, stable The ready event list of an epoll object is protected by read-write semaphore: - The consumer (waiter) acquires the write lock and takes items. - the producer (waker) takes the read lock and adds items. The point of this design is enabling epoll to scale well with large number of producers, as multiple producers can hold the read lock at the same time. Unfortunately, this implementation may cause scheduling priority inversion problem. Suppose the consumer has higher scheduling priority than the producer. The consumer needs to acquire the write lock, but may be blocked by the producer holding the read lock. Since read-write semaphore does not support priority-boosting for the readers (even with CONFIG_PREEMPT_RT=y), we have a case of priority inversion: a higher priority consumer is blocked by a lower priority producer. This problem was reported in [1]. Furthermore, this could also cause stall problem, as described in [2]. Fix this problem by replacing rwlock with spinlock. This reduces the event bandwidth, as the producers now have to contend with each other for the spinlock. According to the benchmark from https://github.com/rouming/test-tools/blob/master/stress-epoll.c: On 12 x86 CPUs: Before After Diff threads events/ms events/ms 8 7162 4956 -31% 16 8733 5383 -38% 32 7968 5572 -30% 64 10652 5739 -46% 128 11236 5931 -47% On 4 riscv CPUs: Before After Diff threads events/ms events/ms 8 2958 2833 -4% 16 3323 3097 -7% 32 3451 3240 -6% 64 3554 3178 -11% 128 3601 3235 -10% Although the numbers look bad, it should be noted that this benchmark creates multiple threads who do nothing except constantly generating new epoll events, thus contention on the spinlock is high. For real workload, the event rate is likely much lower, and the performance drop is not as bad. Using another benchmark (perf bench epoll wait) where spinlock contention is lower, improvement is even observed on x86: On 12 x86 CPUs: Before: Averaged 110279 operations/sec (+- 1.09%), total secs = 8 After: Averaged 114577 operations/sec (+- 2.25%), total secs = 8 On 4 riscv CPUs: Before: Averaged 175767 operations/sec (+- 0.62%), total secs = 8 After: Averaged 167396 operations/sec (+- 0.23%), total secs = 8 In conclusion, no one is likely to be upset over this change. After all, spinlock was used originally for years, and the commit which converted to rwlock didn't mention a real workload, just that the benchmark numbers are nice. This patch is not exactly the revert of commit a218cc491420 ("epoll: use rwlock in order to reduce ep_poll_callback() contention"), because git revert conflicts in some places which are not obvious on the resolution. This patch is intended to be backported, therefore go with the obvious approach: - Replace rwlock_t with spinlock_t one to one - Delete list_add_tail_lockless() and chain_epi_lockless(). These were introduced to allow producers to concurrently add items to the list. But now that spinlock no longer allows producers to touch the event list concurrently, these two functions are not necessary anymore. Fixes: a218cc491420 ("epoll: use rwlock in order to reduce ep_poll_callback() contention") Signed-off-by: Nam Cao <namcao@linutronix.de> Cc: stable@vger.kernel.org --- fs/eventpoll.c | 139 +++++++++---------------------------------------- 1 file changed, 26 insertions(+), 113 deletions(-) diff --git a/fs/eventpoll.c b/fs/eventpoll.c index 0fbf5dfedb24..a171f7e7dacc 100644 --- a/fs/eventpoll.c +++ b/fs/eventpoll.c @@ -46,10 +46,10 @@ * * 1) epnested_mutex (mutex) * 2) ep->mtx (mutex) - * 3) ep->lock (rwlock) + * 3) ep->lock (spinlock) * * The acquire order is the one listed above, from 1 to 3. - * We need a rwlock (ep->lock) because we manipulate objects + * We need a spinlock (ep->lock) because we manipulate objects * from inside the poll callback, that might be triggered from * a wake_up() that in turn might be called from IRQ context. * So we can't sleep inside the poll callback and hence we need @@ -195,7 +195,7 @@ struct eventpoll { struct list_head rdllist; /* Lock which protects rdllist and ovflist */ - rwlock_t lock; + spinlock_t lock; /* RB tree root used to store monitored fd structs */ struct rb_root_cached rbr; @@ -740,10 +740,10 @@ static void ep_start_scan(struct eventpoll *ep, struct list_head *txlist) * in a lockless way. */ lockdep_assert_irqs_enabled(); - write_lock_irq(&ep->lock); + spin_lock_irq(&ep->lock); list_splice_init(&ep->rdllist, txlist); WRITE_ONCE(ep->ovflist, NULL); - write_unlock_irq(&ep->lock); + spin_unlock_irq(&ep->lock); } static void ep_done_scan(struct eventpoll *ep, @@ -751,7 +751,7 @@ static void ep_done_scan(struct eventpoll *ep, { struct epitem *epi, *nepi; - write_lock_irq(&ep->lock); + spin_lock_irq(&ep->lock); /* * During the time we spent inside the "sproc" callback, some * other events might have been queued by the poll callback. @@ -792,7 +792,7 @@ static void ep_done_scan(struct eventpoll *ep, wake_up(&ep->wq); } - write_unlock_irq(&ep->lock); + spin_unlock_irq(&ep->lock); } static void ep_get(struct eventpoll *ep) @@ -867,10 +867,10 @@ static bool __ep_remove(struct eventpoll *ep, struct epitem *epi, bool force) rb_erase_cached(&epi->rbn, &ep->rbr); - write_lock_irq(&ep->lock); + spin_lock_irq(&ep->lock); if (ep_is_linked(epi)) list_del_init(&epi->rdllink); - write_unlock_irq(&ep->lock); + spin_unlock_irq(&ep->lock); wakeup_source_unregister(ep_wakeup_source(epi)); /* @@ -1151,7 +1151,7 @@ static int ep_alloc(struct eventpoll **pep) return -ENOMEM; mutex_init(&ep->mtx); - rwlock_init(&ep->lock); + spin_lock_init(&ep->lock); init_waitqueue_head(&ep->wq); init_waitqueue_head(&ep->poll_wait); INIT_LIST_HEAD(&ep->rdllist); @@ -1238,100 +1238,10 @@ struct file *get_epoll_tfile_raw_ptr(struct file *file, int tfd, } #endif /* CONFIG_KCMP */ -/* - * Adds a new entry to the tail of the list in a lockless way, i.e. - * multiple CPUs are allowed to call this function concurrently. - * - * Beware: it is necessary to prevent any other modifications of the - * existing list until all changes are completed, in other words - * concurrent list_add_tail_lockless() calls should be protected - * with a read lock, where write lock acts as a barrier which - * makes sure all list_add_tail_lockless() calls are fully - * completed. - * - * Also an element can be locklessly added to the list only in one - * direction i.e. either to the tail or to the head, otherwise - * concurrent access will corrupt the list. - * - * Return: %false if element has been already added to the list, %true - * otherwise. - */ -static inline bool list_add_tail_lockless(struct list_head *new, - struct list_head *head) -{ - struct list_head *prev; - - /* - * This is simple 'new->next = head' operation, but cmpxchg() - * is used in order to detect that same element has been just - * added to the list from another CPU: the winner observes - * new->next == new. - */ - if (!try_cmpxchg(&new->next, &new, head)) - return false; - - /* - * Initially ->next of a new element must be updated with the head - * (we are inserting to the tail) and only then pointers are atomically - * exchanged. XCHG guarantees memory ordering, thus ->next should be - * updated before pointers are actually swapped and pointers are - * swapped before prev->next is updated. - */ - - prev = xchg(&head->prev, new); - - /* - * It is safe to modify prev->next and new->prev, because a new element - * is added only to the tail and new->next is updated before XCHG. - */ - - prev->next = new; - new->prev = prev; - - return true; -} - -/* - * Chains a new epi entry to the tail of the ep->ovflist in a lockless way, - * i.e. multiple CPUs are allowed to call this function concurrently. - * - * Return: %false if epi element has been already chained, %true otherwise. - */ -static inline bool chain_epi_lockless(struct epitem *epi) -{ - struct eventpoll *ep = epi->ep; - - /* Fast preliminary check */ - if (epi->next != EP_UNACTIVE_PTR) - return false; - - /* Check that the same epi has not been just chained from another CPU */ - if (cmpxchg(&epi->next, EP_UNACTIVE_PTR, NULL) != EP_UNACTIVE_PTR) - return false; - - /* Atomically exchange tail */ - epi->next = xchg(&ep->ovflist, epi); - - return true; -} - /* * This is the callback that is passed to the wait queue wakeup * mechanism. It is called by the stored file descriptors when they * have events to report. - * - * This callback takes a read lock in order not to contend with concurrent - * events from another file descriptor, thus all modifications to ->rdllist - * or ->ovflist are lockless. Read lock is paired with the write lock from - * ep_start/done_scan(), which stops all list modifications and guarantees - * that lists state is seen correctly. - * - * Another thing worth to mention is that ep_poll_callback() can be called - * concurrently for the same @epi from different CPUs if poll table was inited - * with several wait queues entries. Plural wakeup from different CPUs of a - * single wait queue is serialized by wq.lock, but the case when multiple wait - * queues are used should be detected accordingly. This is detected using - * cmpxchg() operation. */ static int ep_poll_callback(wait_queue_entry_t *wait, unsigned mode, int sync, void *key) { @@ -1342,7 +1252,7 @@ static int ep_poll_callback(wait_queue_entry_t *wait, unsigned mode, int sync, v unsigned long flags; int ewake = 0; - read_lock_irqsave(&ep->lock, flags); + spin_lock_irqsave(&ep->lock, flags); ep_set_busy_poll_napi_id(epi); @@ -1371,12 +1281,15 @@ static int ep_poll_callback(wait_queue_entry_t *wait, unsigned mode, int sync, v * chained in ep->ovflist and requeued later on. */ if (READ_ONCE(ep->ovflist) != EP_UNACTIVE_PTR) { - if (chain_epi_lockless(epi)) + if (epi->next == EP_UNACTIVE_PTR) { + epi->next = READ_ONCE(ep->ovflist); + WRITE_ONCE(ep->ovflist, epi); ep_pm_stay_awake_rcu(epi); + } } else if (!ep_is_linked(epi)) { /* In the usual case, add event to ready list. */ - if (list_add_tail_lockless(&epi->rdllink, &ep->rdllist)) - ep_pm_stay_awake_rcu(epi); + list_add_tail(&epi->rdllink, &ep->rdllist); + ep_pm_stay_awake_rcu(epi); } /* @@ -1409,7 +1322,7 @@ static int ep_poll_callback(wait_queue_entry_t *wait, unsigned mode, int sync, v pwake++; out_unlock: - read_unlock_irqrestore(&ep->lock, flags); + spin_unlock_irqrestore(&ep->lock, flags); /* We have to call this outside the lock */ if (pwake) @@ -1744,7 +1657,7 @@ static int ep_insert(struct eventpoll *ep, const struct epoll_event *event, } /* We have to drop the new item inside our item list to keep track of it */ - write_lock_irq(&ep->lock); + spin_lock_irq(&ep->lock); /* record NAPI ID of new item if present */ ep_set_busy_poll_napi_id(epi); @@ -1761,7 +1674,7 @@ static int ep_insert(struct eventpoll *ep, const struct epoll_event *event, pwake++; } - write_unlock_irq(&ep->lock); + spin_unlock_irq(&ep->lock); /* We have to call this outside the lock */ if (pwake) @@ -1825,7 +1738,7 @@ static int ep_modify(struct eventpoll *ep, struct epitem *epi, * list, push it inside. */ if (ep_item_poll(epi, &pt, 1)) { - write_lock_irq(&ep->lock); + spin_lock_irq(&ep->lock); if (!ep_is_linked(epi)) { list_add_tail(&epi->rdllink, &ep->rdllist); ep_pm_stay_awake(epi); @@ -1836,7 +1749,7 @@ static int ep_modify(struct eventpoll *ep, struct epitem *epi, if (waitqueue_active(&ep->poll_wait)) pwake++; } - write_unlock_irq(&ep->lock); + spin_unlock_irq(&ep->lock); } /* We have to call this outside the lock */ @@ -2088,7 +2001,7 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events, init_wait(&wait); wait.func = ep_autoremove_wake_function; - write_lock_irq(&ep->lock); + spin_lock_irq(&ep->lock); /* * Barrierless variant, waitqueue_active() is called under * the same lock on wakeup ep_poll_callback() side, so it @@ -2107,7 +2020,7 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events, if (!eavail) __add_wait_queue_exclusive(&ep->wq, &wait); - write_unlock_irq(&ep->lock); + spin_unlock_irq(&ep->lock); if (!eavail) timed_out = !ep_schedule_timeout(to) || @@ -2123,7 +2036,7 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events, eavail = 1; if (!list_empty_careful(&wait.entry)) { - write_lock_irq(&ep->lock); + spin_lock_irq(&ep->lock); /* * If the thread timed out and is not on the wait queue, * it means that the thread was woken up after its @@ -2134,7 +2047,7 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events, if (timed_out) eavail = list_empty(&wait.entry); __remove_wait_queue(&ep->wq, &wait); - write_unlock_irq(&ep->lock); + spin_unlock_irq(&ep->lock); } } } -- 2.39.5 ^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [PATCH v4 1/1] eventpoll: Replace rwlock with spinlock 2025-07-15 12:46 ` [PATCH v4 1/1] eventpoll: Replace rwlock with spinlock Nam Cao @ 2025-07-15 12:58 ` Nam Cao 2025-07-15 16:42 ` Linus Torvalds ` (2 subsequent siblings) 3 siblings, 0 replies; 10+ messages in thread From: Nam Cao @ 2025-07-15 12:58 UTC (permalink / raw) To: Christian Brauner, Linus Torvalds, Xi Ruoyao, Frederic Weisbecker, Valentin Schneider, Alexander Viro, Jan Kara, Sebastian Andrzej Siewior, John Ogness, K Prateek Nayak, Clark Williams, Steven Rostedt, linux-fsdevel, linux-kernel, linux-rt-devel, linux-rt-users, Joe Damato, Martin Karsten, Jens Axboe Cc: stable On Tue, Jul 15, 2025 at 02:46:34PM +0200, Nam Cao wrote: > The ready event list of an epoll object is protected by read-write > semaphore: > > - The consumer (waiter) acquires the write lock and takes items. > - the producer (waker) takes the read lock and adds items. > > The point of this design is enabling epoll to scale well with large number > of producers, as multiple producers can hold the read lock at the same > time. > > Unfortunately, this implementation may cause scheduling priority inversion > problem. Suppose the consumer has higher scheduling priority than the > producer. The consumer needs to acquire the write lock, but may be blocked > by the producer holding the read lock. Since read-write semaphore does not > support priority-boosting for the readers (even with CONFIG_PREEMPT_RT=y), > we have a case of priority inversion: a higher priority consumer is blocked > by a lower priority producer. This problem was reported in [1]. > > Furthermore, this could also cause stall problem, as described in [2]. > > Fix this problem by replacing rwlock with spinlock. > > This reduces the event bandwidth, as the producers now have to contend with > each other for the spinlock. According to the benchmark from > https://github.com/rouming/test-tools/blob/master/stress-epoll.c: > > On 12 x86 CPUs: > Before After Diff > threads events/ms events/ms > 8 7162 4956 -31% > 16 8733 5383 -38% > 32 7968 5572 -30% > 64 10652 5739 -46% > 128 11236 5931 -47% > > On 4 riscv CPUs: > Before After Diff > threads events/ms events/ms > 8 2958 2833 -4% > 16 3323 3097 -7% > 32 3451 3240 -6% > 64 3554 3178 -11% > 128 3601 3235 -10% > > Although the numbers look bad, it should be noted that this benchmark > creates multiple threads who do nothing except constantly generating new > epoll events, thus contention on the spinlock is high. For real workload, > the event rate is likely much lower, and the performance drop is not as > bad. > > Using another benchmark (perf bench epoll wait) where spinlock contention > is lower, improvement is even observed on x86: > > On 12 x86 CPUs: > Before: Averaged 110279 operations/sec (+- 1.09%), total secs = 8 > After: Averaged 114577 operations/sec (+- 2.25%), total secs = 8 > > On 4 riscv CPUs: > Before: Averaged 175767 operations/sec (+- 0.62%), total secs = 8 > After: Averaged 167396 operations/sec (+- 0.23%), total secs = 8 > > In conclusion, no one is likely to be upset over this change. After all, > spinlock was used originally for years, and the commit which converted to > rwlock didn't mention a real workload, just that the benchmark numbers are > nice. > > This patch is not exactly the revert of commit a218cc491420 ("epoll: use > rwlock in order to reduce ep_poll_callback() contention"), because git > revert conflicts in some places which are not obvious on the resolution. > This patch is intended to be backported, therefore go with the obvious > approach: > > - Replace rwlock_t with spinlock_t one to one > > - Delete list_add_tail_lockless() and chain_epi_lockless(). These were > introduced to allow producers to concurrently add items to the list. > But now that spinlock no longer allows producers to touch the event > list concurrently, these two functions are not necessary anymore. > > Fixes: a218cc491420 ("epoll: use rwlock in order to reduce ep_poll_callback() contention") > Signed-off-by: Nam Cao <namcao@linutronix.de> > Cc: stable@vger.kernel.org I forgot to add: Reported-by: Frederic Weisbecker <frederic@kernel.org> Closes: https://lore.kernel.org/linux-rt-users/20210825132754.GA895675@lothringen/ [1] Reported-by: Valentin Schneider <vschneid@redhat.com> Closes: https://lore.kernel.org/linux-rt-users/xhsmhttqvnall.mognet@vschneid.remote.csb/ [2] Christian, do you mind adding those for me, if/when you apply the patch? Nam ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH v4 1/1] eventpoll: Replace rwlock with spinlock 2025-07-15 12:46 ` [PATCH v4 1/1] eventpoll: Replace rwlock with spinlock Nam Cao 2025-07-15 12:58 ` Nam Cao @ 2025-07-15 16:42 ` Linus Torvalds 2025-07-16 7:41 ` Nam Cao 2025-07-16 8:34 ` K Prateek Nayak 2025-08-26 8:43 ` Nam Cao 3 siblings, 1 reply; 10+ messages in thread From: Linus Torvalds @ 2025-07-15 16:42 UTC (permalink / raw) To: Nam Cao Cc: Christian Brauner, Xi Ruoyao, Frederic Weisbecker, Valentin Schneider, Alexander Viro, Jan Kara, Sebastian Andrzej Siewior, John Ogness, K Prateek Nayak, Clark Williams, Steven Rostedt, linux-fsdevel, linux-kernel, linux-rt-devel, linux-rt-users, Joe Damato, Martin Karsten, Jens Axboe, stable On Tue, 15 Jul 2025 at 05:47, Nam Cao <namcao@linutronix.de> wrote: > > fs/eventpoll.c | 139 +++++++++---------------------------------------- > 1 file changed, 26 insertions(+), 113 deletions(-) Yeah, this is more like the kind of diffstat I like to see for eventpoll. Plus it makes things fundamentally simpler. It might be worth looking at ep_poll_callback() - the only case that had read_lock_irqsave() - and seeing if perhaps some of the tests inside the lock might be done optimistically, or delayed to after the lock. For example, the whole wakeup sequence looks like it should be done *after* the ep->lock has been released, because it uses its own lock (ie the if (sync) wake_up_sync(&ep->wq); else wake_up(&ep->wq); thing uses the wq lock, and there is nothing that ep->lock protects there as far as I can tell. So I think this has some potential for _simple_ optimizations, but I'm not sure how worth it it is. Thanks, Linus ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH v4 1/1] eventpoll: Replace rwlock with spinlock 2025-07-15 16:42 ` Linus Torvalds @ 2025-07-16 7:41 ` Nam Cao 0 siblings, 0 replies; 10+ messages in thread From: Nam Cao @ 2025-07-16 7:41 UTC (permalink / raw) To: Linus Torvalds Cc: Christian Brauner, Xi Ruoyao, Frederic Weisbecker, Valentin Schneider, Alexander Viro, Jan Kara, Sebastian Andrzej Siewior, John Ogness, K Prateek Nayak, Clark Williams, Steven Rostedt, linux-fsdevel, linux-kernel, linux-rt-devel, linux-rt-users, Joe Damato, Martin Karsten, Jens Axboe, stable On Tue, Jul 15, 2025 at 09:42:52AM -0700, Linus Torvalds wrote: > On Tue, 15 Jul 2025 at 05:47, Nam Cao <namcao@linutronix.de> wrote: > > > > fs/eventpoll.c | 139 +++++++++---------------------------------------- > > 1 file changed, 26 insertions(+), 113 deletions(-) > > Yeah, this is more like the kind of diffstat I like to see for > eventpoll. Plus it makes things fundamentally simpler. > > It might be worth looking at ep_poll_callback() - the only case that > had read_lock_irqsave() - and seeing if perhaps some of the tests > inside the lock might be done optimistically, or delayed to after the > lock. > > For example, the whole wakeup sequence looks like it should be done > *after* the ep->lock has been released, because it uses its own lock > (ie the > > if (sync) > wake_up_sync(&ep->wq); > else > wake_up(&ep->wq); > > thing uses the wq lock, and there is nothing that ep->lock protects > there as far as I can tell. > > So I think this has some potential for _simple_ optimizations, but I'm > not sure how worth it it is. Actually ep->lock does protect this part. Because ep_poll() touches ep->wq without holding the waitqueue's lock. We could do something like the diff below. But things are not so simple anymore. Best regards, Nam diff --git a/fs/eventpoll.c b/fs/eventpoll.c index a171f7e7dacc..5e6c7da606e7 100644 --- a/fs/eventpoll.c +++ b/fs/eventpoll.c @@ -1252,8 +1252,6 @@ static int ep_poll_callback(wait_queue_entry_t *wait, unsigned mode, int sync, v unsigned long flags; int ewake = 0; - spin_lock_irqsave(&ep->lock, flags); - ep_set_busy_poll_napi_id(epi); /* @@ -1263,7 +1261,7 @@ static int ep_poll_callback(wait_queue_entry_t *wait, unsigned mode, int sync, v * until the next EPOLL_CTL_MOD will be issued. */ if (!(epi->event.events & ~EP_PRIVATE_BITS)) - goto out_unlock; + goto out; /* * Check the events coming with the callback. At this stage, not @@ -1272,7 +1270,7 @@ static int ep_poll_callback(wait_queue_entry_t *wait, unsigned mode, int sync, v * test for "key" != NULL before the event match test. */ if (pollflags && !(pollflags & epi->event.events)) - goto out_unlock; + goto out; /* * If we are transferring events to userspace, we can hold no locks @@ -1280,6 +1278,8 @@ static int ep_poll_callback(wait_queue_entry_t *wait, unsigned mode, int sync, v * semantics). All the events that happen during that period of time are * chained in ep->ovflist and requeued later on. */ + spin_lock_irqsave(&ep->lock, flags); + if (READ_ONCE(ep->ovflist) != EP_UNACTIVE_PTR) { if (epi->next == EP_UNACTIVE_PTR) { epi->next = READ_ONCE(ep->ovflist); @@ -1292,9 +1292,14 @@ static int ep_poll_callback(wait_queue_entry_t *wait, unsigned mode, int sync, v ep_pm_stay_awake_rcu(epi); } + spin_unlock_irqrestore(&ep->lock, flags); + + /* * Wake up ( if active ) both the eventpoll wait list and the ->poll() * wait list. + * + * Memory barrier for waitqueue_active() from spin_unlock_irqrestore(). */ if (waitqueue_active(&ep->wq)) { if ((epi->event.events & EPOLLEXCLUSIVE) && @@ -1321,9 +1326,7 @@ static int ep_poll_callback(wait_queue_entry_t *wait, unsigned mode, int sync, v if (waitqueue_active(&ep->poll_wait)) pwake++; -out_unlock: - spin_unlock_irqrestore(&ep->lock, flags); - +out: /* We have to call this outside the lock */ if (pwake) ep_poll_safewake(ep, epi, pollflags & EPOLL_URING_WAKE); @@ -2001,13 +2004,7 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events, init_wait(&wait); wait.func = ep_autoremove_wake_function; - spin_lock_irq(&ep->lock); - /* - * Barrierless variant, waitqueue_active() is called under - * the same lock on wakeup ep_poll_callback() side, so it - * is safe to avoid an explicit barrier. - */ - __set_current_state(TASK_INTERRUPTIBLE); + prepare_to_wait_exclusive(&ep->wq, &wait, TASK_INTERRUPTIBLE); /* * Do the final check under the lock. ep_start/done_scan() @@ -2016,10 +2013,8 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events, * period of time although events are pending, so lock is * important. */ + spin_lock_irq(&ep->lock); eavail = ep_events_available(ep); - if (!eavail) - __add_wait_queue_exclusive(&ep->wq, &wait); - spin_unlock_irq(&ep->lock); if (!eavail) @@ -2036,7 +2031,7 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events, eavail = 1; if (!list_empty_careful(&wait.entry)) { - spin_lock_irq(&ep->lock); + spin_lock_irq(&ep->wq.lock); /* * If the thread timed out and is not on the wait queue, * it means that the thread was woken up after its @@ -2047,7 +2042,7 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events, if (timed_out) eavail = list_empty(&wait.entry); __remove_wait_queue(&ep->wq, &wait); - spin_unlock_irq(&ep->lock); + spin_unlock_irq(&ep->wq.lock); } } } ^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [PATCH v4 1/1] eventpoll: Replace rwlock with spinlock 2025-07-15 12:46 ` [PATCH v4 1/1] eventpoll: Replace rwlock with spinlock Nam Cao 2025-07-15 12:58 ` Nam Cao 2025-07-15 16:42 ` Linus Torvalds @ 2025-07-16 8:34 ` K Prateek Nayak 2025-08-26 8:43 ` Nam Cao 3 siblings, 0 replies; 10+ messages in thread From: K Prateek Nayak @ 2025-07-16 8:34 UTC (permalink / raw) To: Nam Cao, Christian Brauner, Linus Torvalds, Xi Ruoyao, Frederic Weisbecker, Valentin Schneider, Alexander Viro, Jan Kara, Sebastian Andrzej Siewior, John Ogness, Clark Williams, Steven Rostedt, linux-fsdevel, linux-kernel, linux-rt-devel, linux-rt-users, Joe Damato, Martin Karsten, Jens Axboe Cc: stable Hello Nam, On 7/15/2025 6:16 PM, Nam Cao wrote: > The ready event list of an epoll object is protected by read-write > semaphore: > > - The consumer (waiter) acquires the write lock and takes items. > - the producer (waker) takes the read lock and adds items. > > The point of this design is enabling epoll to scale well with large number > of producers, as multiple producers can hold the read lock at the same > time. > > Unfortunately, this implementation may cause scheduling priority inversion > problem. Suppose the consumer has higher scheduling priority than the > producer. The consumer needs to acquire the write lock, but may be blocked > by the producer holding the read lock. Since read-write semaphore does not > support priority-boosting for the readers (even with CONFIG_PREEMPT_RT=y), > we have a case of priority inversion: a higher priority consumer is blocked > by a lower priority producer. This problem was reported in [1]. > > Furthermore, this could also cause stall problem, as described in [2]. > > Fix this problem by replacing rwlock with spinlock. > > This reduces the event bandwidth, as the producers now have to contend with > each other for the spinlock. According to the benchmark from > https://github.com/rouming/test-tools/blob/master/stress-epoll.c: > > On 12 x86 CPUs: > Before After Diff > threads events/ms events/ms > 8 7162 4956 -31% > 16 8733 5383 -38% > 32 7968 5572 -30% > 64 10652 5739 -46% > 128 11236 5931 -47% > > On 4 riscv CPUs: > Before After Diff > threads events/ms events/ms > 8 2958 2833 -4% > 16 3323 3097 -7% > 32 3451 3240 -6% > 64 3554 3178 -11% > 128 3601 3235 -10% > > Although the numbers look bad, it should be noted that this benchmark > creates multiple threads who do nothing except constantly generating new > epoll events, thus contention on the spinlock is high. For real workload, > the event rate is likely much lower, and the performance drop is not as > bad. > > Using another benchmark (perf bench epoll wait) where spinlock contention > is lower, improvement is even observed on x86: > > On 12 x86 CPUs: > Before: Averaged 110279 operations/sec (+- 1.09%), total secs = 8 > After: Averaged 114577 operations/sec (+- 2.25%), total secs = 8 > > On 4 riscv CPUs: > Before: Averaged 175767 operations/sec (+- 0.62%), total secs = 8 > After: Averaged 167396 operations/sec (+- 0.23%), total secs = 8 > > In conclusion, no one is likely to be upset over this change. After all, > spinlock was used originally for years, and the commit which converted to > rwlock didn't mention a real workload, just that the benchmark numbers are > nice. > > This patch is not exactly the revert of commit a218cc491420 ("epoll: use > rwlock in order to reduce ep_poll_callback() contention"), because git > revert conflicts in some places which are not obvious on the resolution. > This patch is intended to be backported, therefore go with the obvious > approach: > > - Replace rwlock_t with spinlock_t one to one > > - Delete list_add_tail_lockless() and chain_epi_lockless(). These were > introduced to allow producers to concurrently add items to the list. > But now that spinlock no longer allows producers to touch the event > list concurrently, these two functions are not necessary anymore. > > Fixes: a218cc491420 ("epoll: use rwlock in order to reduce ep_poll_callback() contention") > Signed-off-by: Nam Cao <namcao@linutronix.de> > Cc: stable@vger.kernel.org Tested this version with the reproducer that Jan shared in [1] on top of tip:sched/core (PREEMPT_RT) and I didn't run into any rcu-stalls with your patch applied on top (the VM is running the repro for over an hour now and is still responsive). Feel free to include: Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> [1] https://lore.kernel.org/all/7483d3ae-5846-4067-b9f7-390a614ba408@siemens.com/ > --- > fs/eventpoll.c | 139 +++++++++---------------------------------------- > 1 file changed, 26 insertions(+), 113 deletions(-) -- Thanks and Regards, Prateek ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH v4 1/1] eventpoll: Replace rwlock with spinlock 2025-07-15 12:46 ` [PATCH v4 1/1] eventpoll: Replace rwlock with spinlock Nam Cao ` (2 preceding siblings ...) 2025-07-16 8:34 ` K Prateek Nayak @ 2025-08-26 8:43 ` Nam Cao 2025-09-03 8:40 ` Sebastian Andrzej Siewior 3 siblings, 1 reply; 10+ messages in thread From: Nam Cao @ 2025-08-26 8:43 UTC (permalink / raw) To: Christian Brauner, Linus Torvalds, Xi Ruoyao, Frederic Weisbecker, Valentin Schneider, Alexander Viro, Jan Kara, Sebastian Andrzej Siewior, John Ogness, K Prateek Nayak, Clark Williams, Steven Rostedt, linux-fsdevel, linux-kernel, linux-rt-devel, linux-rt-users, Joe Damato, Martin Karsten, Jens Axboe Cc: stable On Tue, Jul 15, 2025 at 02:46:34PM +0200, Nam Cao wrote: > The ready event list of an epoll object is protected by read-write > semaphore: > > - The consumer (waiter) acquires the write lock and takes items. > - the producer (waker) takes the read lock and adds items. > > The point of this design is enabling epoll to scale well with large number > of producers, as multiple producers can hold the read lock at the same > time. > > Unfortunately, this implementation may cause scheduling priority inversion > problem. Suppose the consumer has higher scheduling priority than the > producer. The consumer needs to acquire the write lock, but may be blocked > by the producer holding the read lock. Since read-write semaphore does not > support priority-boosting for the readers (even with CONFIG_PREEMPT_RT=y), > we have a case of priority inversion: a higher priority consumer is blocked > by a lower priority producer. This problem was reported in [1]. > > Furthermore, this could also cause stall problem, as described in [2]. > > Fix this problem by replacing rwlock with spinlock. Hi Christian, May I know your plan with this patch? Are you still waiting for something? You may still understandably be paranoid about epoll due to the last regression. But it's been weeks, and this patch is quite simple, so I start to wonder if it is forgotten. Nam ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH v4 1/1] eventpoll: Replace rwlock with spinlock 2025-08-26 8:43 ` Nam Cao @ 2025-09-03 8:40 ` Sebastian Andrzej Siewior 2025-09-05 13:52 ` Christian Brauner 0 siblings, 1 reply; 10+ messages in thread From: Sebastian Andrzej Siewior @ 2025-09-03 8:40 UTC (permalink / raw) To: Nam Cao, Christian Brauner Cc: Linus Torvalds, Xi Ruoyao, Frederic Weisbecker, Valentin Schneider, Alexander Viro, Jan Kara, John Ogness, K Prateek Nayak, Clark Williams, Steven Rostedt, linux-fsdevel, linux-kernel, linux-rt-devel, linux-rt-users, Joe Damato, Martin Karsten, Jens Axboe, stable On 2025-08-26 10:43:20 [+0200], Nam Cao wrote: > On Tue, Jul 15, 2025 at 02:46:34PM +0200, Nam Cao wrote: > > The ready event list of an epoll object is protected by read-write > > semaphore: > > > > - The consumer (waiter) acquires the write lock and takes items. > > - the producer (waker) takes the read lock and adds items. > > > > The point of this design is enabling epoll to scale well with large number > > of producers, as multiple producers can hold the read lock at the same > > time. > > > > Unfortunately, this implementation may cause scheduling priority inversion > > problem. Suppose the consumer has higher scheduling priority than the > > producer. The consumer needs to acquire the write lock, but may be blocked > > by the producer holding the read lock. Since read-write semaphore does not > > support priority-boosting for the readers (even with CONFIG_PREEMPT_RT=y), > > we have a case of priority inversion: a higher priority consumer is blocked > > by a lower priority producer. This problem was reported in [1]. > > > > Furthermore, this could also cause stall problem, as described in [2]. > > > > Fix this problem by replacing rwlock with spinlock. > > Hi Christian, > > May I know your plan with this patch? Are you still waiting for something? > > You may still understandably be paranoid about epoll due to the last > regression. But it's been weeks, and this patch is quite simple, so I start > to wonder if it is forgotten. A friendly reminder. > Nam Sebastian ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH v4 1/1] eventpoll: Replace rwlock with spinlock 2025-09-03 8:40 ` Sebastian Andrzej Siewior @ 2025-09-05 13:52 ` Christian Brauner 0 siblings, 0 replies; 10+ messages in thread From: Christian Brauner @ 2025-09-05 13:52 UTC (permalink / raw) To: Sebastian Andrzej Siewior Cc: Nam Cao, Linus Torvalds, Xi Ruoyao, Frederic Weisbecker, Valentin Schneider, Alexander Viro, Jan Kara, John Ogness, K Prateek Nayak, Clark Williams, Steven Rostedt, linux-fsdevel, linux-kernel, linux-rt-devel, linux-rt-users, Joe Damato, Martin Karsten, Jens Axboe, stable On Wed, Sep 03, 2025 at 10:40:12AM +0200, Sebastian Andrzej Siewior wrote: > On 2025-08-26 10:43:20 [+0200], Nam Cao wrote: > > On Tue, Jul 15, 2025 at 02:46:34PM +0200, Nam Cao wrote: > > > The ready event list of an epoll object is protected by read-write > > > semaphore: > > > > > > - The consumer (waiter) acquires the write lock and takes items. > > > - the producer (waker) takes the read lock and adds items. > > > > > > The point of this design is enabling epoll to scale well with large number > > > of producers, as multiple producers can hold the read lock at the same > > > time. > > > > > > Unfortunately, this implementation may cause scheduling priority inversion > > > problem. Suppose the consumer has higher scheduling priority than the > > > producer. The consumer needs to acquire the write lock, but may be blocked > > > by the producer holding the read lock. Since read-write semaphore does not > > > support priority-boosting for the readers (even with CONFIG_PREEMPT_RT=y), > > > we have a case of priority inversion: a higher priority consumer is blocked > > > by a lower priority producer. This problem was reported in [1]. > > > > > > Furthermore, this could also cause stall problem, as described in [2]. > > > > > > Fix this problem by replacing rwlock with spinlock. > > > > Hi Christian, > > > > May I know your plan with this patch? Are you still waiting for something? > > > > You may still understandably be paranoid about epoll due to the last > > regression. But it's been weeks, and this patch is quite simple, so I start > > to wonder if it is forgotten. > > A friendly reminder. Sorry, this apparently fell through the cracks. Taken care of now! ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH v4 0/1] eventpoll: Fix priority inversion problem 2025-07-15 12:46 [PATCH v4 0/1] eventpoll: Fix priority inversion problem Nam Cao 2025-07-15 12:46 ` [PATCH v4 1/1] eventpoll: Replace rwlock with spinlock Nam Cao @ 2025-09-05 13:52 ` Christian Brauner 1 sibling, 0 replies; 10+ messages in thread From: Christian Brauner @ 2025-09-05 13:52 UTC (permalink / raw) To: Nam Cao Cc: Christian Brauner, Linus Torvalds, Xi Ruoyao, Frederic Weisbecker, Valentin Schneider, Alexander Viro, Jan Kara, Sebastian Andrzej Siewior, John Ogness, K Prateek Nayak, Clark Williams, Steven Rostedt, linux-fsdevel, linux-kernel, linux-rt-devel, linux-rt-users, Joe Damato, Martin Karsten, Jens Axboe On Tue, 15 Jul 2025 14:46:33 +0200, Nam Cao wrote: > This v4 is the follow-up to v3 at: > https://lore.kernel.org/linux-fsdevel/20250527090836.1290532-1-namcao@linutronix.de/ > which resolves a priority inversion problem. > > The v3 patch was merged, but then got reverted due to regression. > > The direction of v3 was wrong in the first place. It changed the > eventpoll's event list to be lockless, making the code harder to read. I > stared at the patch again, but still couldn't figure out what the bug is. > > [...] Applied to the vfs-6.18.misc branch of the vfs/vfs.git tree. Patches in the vfs-6.18.misc branch should appear in linux-next soon. Please report any outstanding bugs that were missed during review in a new review to the original patch series allowing us to drop it. It's encouraged to provide Acked-bys and Reviewed-bys even though the patch has now been applied. If possible patch trailers will be updated. Note that commit hashes shown below are subject to change due to rebase, trailer updates or similar. If in doubt, please check the listed branch. tree: https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git branch: vfs-6.18.misc [1/1] eventpoll: Replace rwlock with spinlock https://git.kernel.org/vfs/vfs/c/0c43094f8cc9 ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2025-09-05 13:52 UTC | newest] Thread overview: 10+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-07-15 12:46 [PATCH v4 0/1] eventpoll: Fix priority inversion problem Nam Cao 2025-07-15 12:46 ` [PATCH v4 1/1] eventpoll: Replace rwlock with spinlock Nam Cao 2025-07-15 12:58 ` Nam Cao 2025-07-15 16:42 ` Linus Torvalds 2025-07-16 7:41 ` Nam Cao 2025-07-16 8:34 ` K Prateek Nayak 2025-08-26 8:43 ` Nam Cao 2025-09-03 8:40 ` Sebastian Andrzej Siewior 2025-09-05 13:52 ` Christian Brauner 2025-09-05 13:52 ` [PATCH v4 0/1] eventpoll: Fix priority inversion problem Christian Brauner
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).