[PATCH 00/17] eventpoll: clarity refactor

public inbox for linux-fsdevel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH 00/17] eventpoll: clarity refactor
@ 2026-04-24 13:46 Christian Brauner
  2026-04-24 13:46 ` [PATCH 01/17] eventpoll: expand top-of-file overview / locking doc Christian Brauner
                   ` (17 more replies)
  0 siblings, 18 replies; 20+ messages in thread
From: Christian Brauner @ 2026-04-24 13:46 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Alexander Viro, Jan Kara, Linus Torvalds, Jens Axboe,
	Christian Brauner (Amutable)

The recent UAF series (a6dc643c6931 and follow-ups) rode on
invariants in fs/eventpoll.c that were nowhere documented and had
to be reverse-engineered from the code: the lifetime relationships
between struct eventpoll, struct epitem, and struct file, the three
removal paths coordinating via epi_fget() pins and ep->mtx, the
ovflist sentinel-encoded scan state machine, the POLLFREE
release/acquire handshake, and the loop / path check globals
serialized by epnested_mutex. The fix was correct but the next
person to touch this code will hit the same learning curve.

This adds a bunch of documentation (a bunch of swearwords were removed
by having an llm go over it) and refactors. The end goal is hopefully a
bit more pallatable than what this is right now. No functional changes
intended yet.

This series codifies those invariants in source and tightens the
surrounding structure.

First there are a couple of pure documentation changes. A top-of-file
overview with field-protection tables for struct eventpoll and struct
epitem, a section gathering the loop-check / path-check globals next to
their declarations, labelled comments on the two sides of the POLLFREE
handshake, refreshed comments on epi_fget() and ep_remove_file() (whose
contract the UAF fix re-shaped), and a docblock on
ep_clear_and_put() that names its two-pass structure as load-bearing.

Next are a couple of mechanical naming cleanups.
ep_refcount_dec_and_test() -> ep_put() to pair with ep_get(); the unused
depth argument dropped from epoll_mutex_lock() (all three callers passed
zero); attach_epitem() -> ep_attach_file() for ep_remove_file()
symmetry; and the CONFIG_KCMP block relocated next to CONFIG_COMPAT so
the hot-path code is contiguous.

Next are a couple of changes that extract long bodies into named
helpers. ep_insert() splits into ep_alloc_epitem() and
ep_register_epitem(); ep_clear_and_put()'s two passes become
ep_drain_pollwaits() and ep_drain_tree() so the ordering invariant is
enforced by the call sequence rather than convention; the per-event
delivery loop body extracts from ep_send_events() as ep_deliver_event();
and the ep->mtx + epnested_mutex acquisition dance lifts out of
do_epoll_ctl() into ep_ctl_lock() / ep_ctl_unlock(), with a return value
that doubles as the @full_check argument to ep_insert().

Next are a couple of changes that address sentinel and predicate sprawl.
The EP_UNACTIVE_PTR overload (meaning "no scan in progress" on
ep->ovflist and "epi not on ovflist" on epi->next) is hidden behind
named helpers (ep_is_scanning, epi_on_ovflist, ...); epi->next is
renamed to epi->ovflist_next and the local txlist to scan_batch; and
is_file_epoll(), ep_is_linked(), ep_events_available() are converted to
return bool to match their already-boolean bodies.

And last we move the per-CTL_ADD scratch state (tfile_check_list,
path_count[], inserting_into) from file-scope globals into a
stack-allocated struct ep_ctl_ctx plumbed through the loop / path check
chain. loop_check_gen stays at file scope because the stamp it leaves on
ep->gen across calls must not collide with a future walk.

The load-bearing invariants the UAF series closed are preserved
verbatim: the epi_fget() pin in ep_remove(), the ordering of
ep_unregister_pollwait() before ep_remove_file() / ep_remove_epi()
in all three removal paths, kfree_rcu(epi) and kfree_rcu(ep), the
POLLFREE smp_store_release / smp_load_acquire pair on pwq->whead,
ep->lock IRQ-safety, the mutex_lock_nested() subclass arithmetic
in ep_insert (subclass 0 outer, 1 for tep) and __ep_eventpoll_poll
/ ep_loop_check_proc (depth-based), and the WARN_ON_ONCE contract
on ep_put() in ep_remove().

Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
---
Christian Brauner (17):
      eventpoll: expand top-of-file overview / locking doc
      eventpoll: document loop-check / path-check globals
      eventpoll: clarify POLLFREE handshake comments
      eventpoll: refresh epi_fget() / ep_remove_file() comments
      eventpoll: document ep_clear_and_put() two-pass pattern
      eventpoll: rename ep_refcount_dec_and_test() to ep_put()
      eventpoll: drop unused depth argument from epoll_mutex_lock()
      eventpoll: rename attach_epitem() to ep_attach_file()
      eventpoll: relocate KCMP helpers near compat syscalls
      eventpoll: split ep_insert() into alloc + register stages
      eventpoll: split ep_clear_and_put() into drain helpers
      eventpoll: extract ep_deliver_event() from ep_send_events()
      eventpoll: extract lock dance from do_epoll_ctl() into ep_ctl_lock()
      eventpoll: wrap EP_UNACTIVE_PTR in typed sentinel helpers
      eventpoll: rename epi->next and txlist for clarity
      eventpoll: use bool for predicate helpers
      eventpoll: hoist CTL_ADD scratch state into struct ep_ctl_ctx

 fs/eventpoll.c | 1183 +++++++++++++++++++++++++++++++++++++-------------------
 1 file changed, 778 insertions(+), 405 deletions(-)
---
base-commit: dd6c438c3e64a5ff0b5d7e78f7f9be547803ef1b
change-id: 20260424-work-epoll-rework-a02330741d24

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH 01/17] eventpoll: expand top-of-file overview / locking doc
  2026-04-24 13:46 [PATCH 00/17] eventpoll: clarity refactor Christian Brauner
@ 2026-04-24 13:46 ` Christian Brauner
  2026-04-24 13:46 ` [PATCH 02/17] eventpoll: document loop-check / path-check globals Christian Brauner
                   ` (16 subsequent siblings)
  17 siblings, 0 replies; 20+ messages in thread
From: Christian Brauner @ 2026-04-24 13:46 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Alexander Viro, Jan Kara, Linus Torvalds, Jens Axboe,
	Christian Brauner (Amutable)

The existing ~40-line "LOCKING:" banner covered the three-level lock
hierarchy (epnested_mutex > ep->mtx > ep->lock) but nothing else.
Lifetime rules, the ready-list state machine, the three removal paths,
and the POLLFREE contract are implicit in the code. The recent UAF
series (a6dc643c6931, 07712db80857, 8c2e52ebbe88, f2e467a48287) rode
on invariants that were only implicit.

Codify them at the top of the file: the subsystem overview, the lock
hierarchy and its mutex_lock_nested() subclass convention (reworded
from the old banner), a field-protection table for struct eventpoll
and struct epitem that names the two faces of the rbn/rcu union (rbn
under ep->mtx while linked into ep->rbr; rcu touched only by
kfree_rcu(epi) on the free path), the ovflist sentinel encoding and
scan-flip invariants, the three removal paths (A ep_remove, B
ep_clear_and_put, C eventpoll_release_file) and the epi_fget() pin
that orchestrates A vs C, and the POLLFREE store-release /
load-acquire handshake.

No functional change.

Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 fs/eventpoll.c | 199 ++++++++++++++++++++++++++++++++++++++++++++++-----------
 1 file changed, 162 insertions(+), 37 deletions(-)

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index a3090b446af1..5896f705a3ac 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -41,45 +41,170 @@
 #include <net/busy_poll.h>
 
 /*
- * LOCKING:
- * There are three level of locking required by epoll :
+ * fs/eventpoll.c - Efficient event polling ("epoll") kernel implementation.
  *
- * 1) epnested_mutex (mutex)
- * 2) ep->mtx (mutex)
- * 3) ep->lock (spinlock)
  *
- * The acquire order is the one listed above, from 1 to 3.
- * We need a spinlock (ep->lock) because we manipulate objects
- * from inside the poll callback, that might be triggered from
- * a wake_up() that in turn might be called from IRQ context.
- * So we can't sleep inside the poll callback and hence we need
- * a spinlock. During the event transfer loop (from kernel to
- * user space) we could end up sleeping due a copy_to_user(), so
- * we need a lock that will allow us to sleep. This lock is a
- * mutex (ep->mtx). It is acquired during the event transfer loop,
- * during epoll_ctl(EPOLL_CTL_DEL) and during eventpoll_release_file().
- * The epnested_mutex is acquired when inserting an epoll fd onto another
- * epoll fd. We do this so that we walk the epoll tree and ensure that this
- * insertion does not create a cycle of epoll file descriptors, which
- * could lead to deadlock. We need a global mutex to prevent two
- * simultaneous inserts (A into B and B into A) from racing and
- * constructing a cycle without either insert observing that it is
- * going to.
- * It is necessary to acquire multiple "ep->mtx"es at once in the
- * case when one epoll fd is added to another. In this case, we
- * always acquire the locks in the order of nesting (i.e. after
- * epoll_ctl(e1, EPOLL_CTL_ADD, e2), e1->mtx will always be acquired
- * before e2->mtx). Since we disallow cycles of epoll file
- * descriptors, this ensures that the mutexes are well-ordered. In
- * order to communicate this nesting to lockdep, when walking a tree
- * of epoll file descriptors, we use the current recursion depth as
- * the lockdep subkey.
- * It is possible to drop the "ep->mtx" and to use the global
- * mutex "epnested_mutex" (together with "ep->lock") to have it working,
- * but having "ep->mtx" will make the interface more scalable.
- * Events that require holding "epnested_mutex" are very rare, while for
- * normal operations the epoll private "ep->mtx" will guarantee
- * a better scalability.
+ * Overview
+ * --------
+ *
+ * Each epoll_create(2) returns an anonymous [eventpoll] file whose
+ * ->private_data is a struct eventpoll. Each EPOLL_CTL_ADD installs
+ * a struct epitem linking one (watched file, fd) pair back to that
+ * eventpoll via the watched file's f_op->poll() wait queue(s). When
+ * the watched file signals readiness, ep_poll_callback() fires and
+ * marks the epitem ready. epoll_wait(2) drains the ready list under
+ * ep->mtx, re-queueing items in level-triggered mode.
+ *
+ * epoll instances can watch other epoll instances up to EP_MAX_NESTS
+ * deep; cycles are forbidden and detected at EPOLL_CTL_ADD time.
+ *
+ *
+ * Locking
+ * -------
+ *
+ * Three levels, acquired from outer to inner:
+ *
+ *   epnested_mutex   (global; rare; taken only for EPOLL_CTL_ADD
+ *                     loop / path checks)
+ *     > ep->mtx     (per-eventpoll; sleepable; serializes most ops)
+ *       > ep->lock  (per-eventpoll; IRQ-safe spinlock)
+ *
+ *   file->f_lock    (per-file; NOT IRQ-safe; guards f_ep hlist ops;
+ *                    nested inside ep->mtx, outside ep->lock)
+ *
+ * Rationale:
+ *   - ep->lock is a spinlock because ep_poll_callback() is called from
+ *     wake_up() which may run in hard-IRQ context. All ep->lock
+ *     critical sections use spin_lock_irqsave().
+ *   - ep->mtx is a sleepable mutex because the event delivery loop
+ *     calls copy_to_user(), and ep_insert() may sleep in
+ *     kmem_cache_alloc() and f_op->poll().
+ *   - epnested_mutex is global because cycle detection needs a global
+ *     view of the epoll topology; a per-object scheme would let two
+ *     concurrent inserts (A into B, B into A) construct a cycle
+ *     without either observer seeing it.
+ *   - Per-ep ep->mtx is preferred for scalability elsewhere. Events
+ *     that require epnested_mutex are rare.
+ *
+ * When EPOLL_CTL_ADD nests one eventpoll inside another we acquire
+ * ep->mtx on both: outer first, target second. Since cycles are
+ * forbidden the set of live ep->mtx holds is always a strict chain,
+ * communicated to lockdep via mutex_lock_nested() subclasses derived
+ * from the current recursion depth.
+ *
+ *
+ * Field protection
+ * ----------------
+ *
+ * struct eventpoll:
+ *   mtx              - self
+ *   rbr              - ep->mtx
+ *   ovflist, rdllist - ep->lock (IRQ-safe)
+ *   wq               - ep->lock for queue mutation
+ *   poll_wait        - internal waitqueue spinlock
+ *   refs             - file->f_lock for adds; ep->mtx for removes;
+ *                      RCU for readers (hlist_del_rcu + kfree_rcu(ep))
+ *   ws               - ep->mtx
+ *   gen, loop_check_depth - epnested_mutex
+ *   file, user       - immutable after setup
+ *   refcount         - atomic (refcount_t)
+ *   napi_*           - READ_ONCE / WRITE_ONCE
+ *
+ * struct epitem:
+ *   rbn / rcu union  - rbn: ep->mtx (while epi is linked in ep->rbr).
+ *                      rcu: written only by kfree_rcu(epi) on the free
+ *                      path; otherwise untouched by epoll code.
+ *   rdllink, next    - ep->lock
+ *   ffd, ep          - immutable after ep_insert()
+ *   pwqlist          - ep->mtx for writes; POLLFREE clears pwq->whead
+ *                      via smp_store_release(), see below
+ *   fllink           - file->f_lock for mutation; hlist_del_rcu +
+ *                      kfree_rcu(epi) for safe RCU readers
+ *   ws               - RCU (rcu_assign_pointer /
+ *                      rcu_dereference_check(mtx))
+ *   event            - ep->mtx for writes; lockless read in
+ *                      ep_poll_callback pairs with smp_mb() in
+ *                      ep_modify()
+ *
+ *
+ * Ready-list state machine
+ * ------------------------
+ *
+ * Readiness is tracked in two lists under ep->lock:
+ *
+ *   rdllist   - doubly-linked FIFO; the "current" ready list.
+ *   ovflist   - singly-linked LIFO; used during a scan to catch
+ *               events that arrive while rdllist is being iterated
+ *               without ep->lock.
+ *
+ * Encoded in ep->ovflist:
+ *   EP_UNACTIVE_PTR - no scan active; callback appends to rdllist.
+ *   NULL            - scan active, no spill yet.
+ *   pointer to epi  - scan active with spilled items (LIFO).
+ *
+ * Encoded in epi->next:
+ *   EP_UNACTIVE_PTR - epi is not on ovflist.
+ *   otherwise       - next epi on ovflist (NULL at tail).
+ *
+ * ep_start_scan() flips "not scanning" to "scanning" and splices
+ * rdllist into a caller-local txlist. ep_done_scan() drains ovflist
+ * back to rdllist (list_add head-insert reverses LIFO to FIFO),
+ * flips back to "not scanning", and re-splices any items the caller
+ * left in txlist (e.g., level-triggered re-queues).
+ *
+ *
+ * Removal paths
+ * -------------
+ *
+ * Three paths dispose of epitems and/or eventpolls:
+ *
+ *   A. ep_remove()              - EPOLL_CTL_DEL and ep_insert()
+ *                                 rollback. Caller holds ep->mtx.
+ *   B. ep_clear_and_put()       - close of the epoll fd itself
+ *                                 (ep_eventpoll_release).
+ *   C. eventpoll_release_file() - close of a watched file, invoked
+ *                                 from __fput().
+ *
+ * Coordination:
+ *   A and C exclude each other via the watched file's refcount.
+ *   A pins the file with epi_fget() before touching file->f_ep or
+ *   file->f_lock; if the pin fails, __fput() is in flight and C
+ *   will clean this epi up. See the epi_fget() block comment.
+ *   A and B both hold ep->mtx serially. B walks the rbtree with
+ *   rb_next() captured before ep_remove() erases the current node.
+ *   B and C both take ep->mtx; the loser sees fewer entries or an
+ *   empty file->f_ep.
+ *
+ * Within every path the internal order is strict:
+ *   ep_unregister_pollwait()  - drain pwqlist; synchronizes with any
+ *                                in-flight ep_poll_callback via the
+ *                                watched wait-queue head's lock.
+ *   ep_remove_file()          - hlist_del_rcu of epi->fllink and,
+ *                                if last watcher, clear file->f_ep,
+ *                                under file->f_lock.
+ *   ep_remove_epi()           - rb_erase, rdllist unlink (ep->lock),
+ *                                wakeup_source_unregister,
+ *                                kfree_rcu(epi).
+ *
+ * kfree_rcu(epi) defers the free past RCU readers in
+ * reverse_path_check_proc(); kfree_rcu(ep) defers past readers in
+ * ep_get_upwards_depth_proc().
+ *
+ *
+ * POLLFREE handshake
+ * ------------------
+ *
+ * When a subsystem tears down a wait-queue head that an epitem is
+ * registered on (binder, signalfd, ...), it wakes the callback with
+ * POLLFREE and must RCU-defer the head's free. The store/load pair:
+ *
+ *   ep_poll_callback() POLLFREE branch:
+ *     smp_store_release(&pwq->whead, NULL)
+ *
+ *   ep_remove_wait_queue():
+ *     smp_load_acquire(&pwq->whead)
+ *
+ * See those sites for the full argument.
  */
 
 /* Epoll private bits inside the event mask */

-- 
2.47.3


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 02/17] eventpoll: document loop-check / path-check globals
  2026-04-24 13:46 [PATCH 00/17] eventpoll: clarity refactor Christian Brauner
  2026-04-24 13:46 ` [PATCH 01/17] eventpoll: expand top-of-file overview / locking doc Christian Brauner
@ 2026-04-24 13:46 ` Christian Brauner
  2026-04-24 13:46 ` [PATCH 03/17] eventpoll: clarify POLLFREE handshake comments Christian Brauner
                   ` (15 subsequent siblings)
  17 siblings, 0 replies; 20+ messages in thread
From: Christian Brauner @ 2026-04-24 13:46 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Alexander Viro, Jan Kara, Linus Torvalds, Jens Axboe,
	Christian Brauner (Amutable)

The globals that support EPOLL_CTL_ADD's cycle and path-length checks
are scattered: epnested_mutex, loop_check_gen, inserting_into, and
tfile_check_list sit at the top of the file; path_count[] and
path_limits[] are declared inline with the path-check code further
down. Their interaction -- the "ep->gen == loop_check_gen" trigger in
do_epoll_ctl(), the two loop_check_gen++ bumps that sandwich a check,
the EP_UNACTIVE_PTR sentinel on tfile_check_list, the -ELOOP back-edge
detection via inserting_into -- is not documented anywhere.

The area has had three recent fixes (CVE-2025-38349, the unbounded
recursion fix, and the overflow fix) whose logic depends on these
invariants. Collect the description in one block alongside the
declarations, cross-reference the path_count[] declaration that lives
with the path-check code, and name the fix commits so future readers
can find the context.

Also add a short comment on struct epitems_head describing its
dual use (wrapper for non-epoll file->f_ep versus pointing into
&ep->refs for the epoll-watches-epoll case), which the old comment
on tfile_check_list had accidentally attached to the struct.

Comment-only; no functional change.

Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
---
 fs/eventpoll.c | 56 ++++++++++++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 50 insertions(+), 6 deletions(-)

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 5896f705a3ac..477fcbc8e95e 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -372,12 +372,54 @@ struct ep_pqueue {
 /* Maximum number of epoll watched descriptors, per user */
 static long max_user_watches __read_mostly;
 
-/* Used for cycles detection */
+/*
+ * Cycle and path-length checks at EPOLL_CTL_ADD
+ * ---------------------------------------------
+ *
+ * When EPOLL_CTL_ADD creates a link that either targets an eventpoll
+ * file or extends an existing chain of eventpolls, two checks run:
+ *
+ *   1. no cycle is being formed -- ep_loop_check() walks downward
+ *      from the candidate target, and ep_get_upwards_depth_proc()
+ *      walks upward from the outer ep, both bounded by EP_MAX_NESTS.
+ *   2. no file accumulates more than path_limits[depth] wakeup paths
+ *      of a given length -- reverse_path_check().
+ *
+ * Both need a global view of the epoll topology and must be atomic
+ * with the insertion, so the scratch state below is all serialized by
+ * one global mutex, epnested_mutex. Non-nested inserts skip this
+ * machinery entirely and take only ep->mtx.
+ *
+ *   epnested_mutex     Serializes the whole check; also protects every
+ *                      other variable in this block plus path_count[]
+ *                      (declared with the path-check code further
+ *                      down).
+ *   loop_check_gen     Monotonic stamp, bumped once at the start of a
+ *                      check and once at the end. ep->gen caches the
+ *                      value under which ep was last visited by
+ *                      ep_loop_check_proc() or
+ *                      ep_get_upwards_depth_proc(); the post-check
+ *                      bump ensures those cached stamps can no longer
+ *                      equal loop_check_gen, so the
+ *                      "ep->gen == loop_check_gen" trigger in
+ *                      do_epoll_ctl() only fires while another check
+ *                      is in flight.
+ *   inserting_into     Outer eventpoll pointer for the lifetime of one
+ *                      ep_loop_check(); ep_loop_check_proc() fails
+ *                      with -ELOOP if the downward walk reaches it.
+ *   tfile_check_list   Singly-linked list of epitems_head objects
+ *                      collected by ep_loop_check_proc() during the
+ *                      walk, consumed by reverse_path_check()
+ *                      afterwards. Sentinel EP_UNACTIVE_PTR means no
+ *                      check is in flight.
+ *
+ * Commits fdcfce93073d ("eventpoll: Fix integer overflow in
+ * ep_loop_check_proc()") and f2e467a48287 ("eventpoll: Fix
+ * semi-unbounded recursion") hardened the walk; any refactor must
+ * preserve both bail-outs.
+ */
 static DEFINE_MUTEX(epnested_mutex);
-
 static u64 loop_check_gen = 0;
-
-/* Used to check for epoll file descriptor inclusion loops */
 static struct eventpoll *inserting_into;
 
 /* Slab cache used to allocate "struct epitem" */
@@ -387,8 +429,10 @@ static struct kmem_cache *epi_cache __ro_after_init;
 static struct kmem_cache *pwq_cache __ro_after_init;
 
 /*
- * List of files with newly added links, where we may need to limit the number
- * of emanating paths. Protected by the epnested_mutex.
+ * Wrapper anchor for file->f_ep when the watched file is not itself an
+ * eventpoll; for the epoll-watches-epoll case, file->f_ep points at
+ * &watched_ep->refs directly. The ->next field threads
+ * tfile_check_list during one EPOLL_CTL_ADD path check.
  */
 struct epitems_head {
 	struct hlist_head epitems;

-- 
2.47.3


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 03/17] eventpoll: clarify POLLFREE handshake comments
  2026-04-24 13:46 [PATCH 00/17] eventpoll: clarity refactor Christian Brauner
  2026-04-24 13:46 ` [PATCH 01/17] eventpoll: expand top-of-file overview / locking doc Christian Brauner
  2026-04-24 13:46 ` [PATCH 02/17] eventpoll: document loop-check / path-check globals Christian Brauner
@ 2026-04-24 13:46 ` Christian Brauner
  2026-04-24 13:46 ` [PATCH 04/17] eventpoll: refresh epi_fget() / ep_remove_file() comments Christian Brauner
                   ` (14 subsequent siblings)
  17 siblings, 0 replies; 20+ messages in thread
From: Christian Brauner @ 2026-04-24 13:46 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Alexander Viro, Jan Kara, Linus Torvalds, Jens Axboe,
	Christian Brauner (Amutable)

ep_remove_wait_queue() and the POLLFREE branch of ep_poll_callback()
are the two halves of a release/acquire handshake that lets a
subsystem (binder, signalfd, ...) tear down a wait-queue head from
under a registered epitem. The existing local comments documented the
race but did not name the protocol or refer readers from one side to
the other. After the previous commit added a "POLLFREE handshake"
section to the top-of-file banner, these sites can point at the
banner and at each other.

Rework the two comment blocks so that each side is labelled
"acquire side" or "release side", references the banner, and
explains its role in the protocol. On the release side fuse the two
former comments into one narrative: list_del_init() tolerates a
second delete from a racing ep_remove_wait_queue(), and the
smp_store_release() is what lets that racing remover discover the
teardown.

Comment-only; no functional change.

Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
---
 fs/eventpoll.c | 38 +++++++++++++++++++++++++-------------
 1 file changed, 25 insertions(+), 13 deletions(-)

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 477fcbc8e95e..1d1fd6464c38 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -828,10 +828,15 @@ static void ep_remove_wait_queue(struct eppoll_entry *pwq)
 
 	rcu_read_lock();
 	/*
-	 * If it is cleared by POLLFREE, it should be rcu-safe.
-	 * If we read NULL we need a barrier paired with
-	 * smp_store_release() in ep_poll_callback(), otherwise
-	 * we rely on whead->lock.
+	 * POLLFREE handshake, acquire side; see "POLLFREE handshake"
+	 * at the top of this file.
+	 *
+	 * A NULL load is paired with the smp_store_release(&whead, NULL)
+	 * in ep_poll_callback()'s POLLFREE branch: the teardown is
+	 * complete and we must not touch whead again. On a non-NULL load
+	 * rcu_read_lock() keeps the waitqueue memory alive (POLLFREE
+	 * firers RCU-defer the free) and whead->lock inside
+	 * remove_wait_queue() serializes us against the store side.
 	 */
 	whead = smp_load_acquire(&pwq->whead);
 	if (whead)
@@ -1505,17 +1510,24 @@ static int ep_poll_callback(wait_queue_entry_t *wait, unsigned mode, int sync, v
 
 	if (pollflags & POLLFREE) {
 		/*
-		 * If we race with ep_remove_wait_queue() it can miss
-		 * ->whead = NULL and do another remove_wait_queue() after
-		 * us, so we can't use __remove_wait_queue().
+		 * POLLFREE handshake, release side; see "POLLFREE handshake"
+		 * at the top of this file.
+		 *
+		 * Unlink our wait entry with list_del_init rather than
+		 * __remove_wait_queue: a concurrent ep_remove_wait_queue()
+		 * that already loaded a non-NULL whead may still call
+		 * remove_wait_queue() after us, and list_del_init() tolerates
+		 * the second delete.
+		 *
+		 * smp_store_release(&whead, NULL) publishes the teardown to
+		 * ep_remove_wait_queue()'s smp_load_acquire(). Before this
+		 * store, a racing ep_clear_and_put() / ep_remove() reaches
+		 * ep_remove_wait_queue() which sees whead != NULL and takes
+		 * whead->lock -- the same lock held by our caller, so it
+		 * serializes behind us. Once whead is zeroed, nothing else
+		 * protects ep / epi / wait.
 		 */
 		list_del_init(&wait->entry);
-		/*
-		 * ->whead != NULL protects us from the race with
-		 * ep_clear_and_put() or ep_remove(), ep_remove_wait_queue()
-		 * takes whead->lock held by the caller. Once we nullify it,
-		 * nothing protects ep/epi or even wait.
-		 */
 		smp_store_release(&ep_pwq_from_wait(wait)->whead, NULL);
 	}
 

-- 
2.47.3


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 04/17] eventpoll: refresh epi_fget() / ep_remove_file() comments
  2026-04-24 13:46 [PATCH 00/17] eventpoll: clarity refactor Christian Brauner
                   ` (2 preceding siblings ...)
  2026-04-24 13:46 ` [PATCH 03/17] eventpoll: clarify POLLFREE handshake comments Christian Brauner
@ 2026-04-24 13:46 ` Christian Brauner
  2026-04-24 13:46 ` [PATCH 05/17] eventpoll: document ep_clear_and_put() two-pass pattern Christian Brauner
                   ` (13 subsequent siblings)
  17 siblings, 0 replies; 20+ messages in thread
From: Christian Brauner @ 2026-04-24 13:46 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Alexander Viro, Jan Kara, Linus Torvalds, Jens Axboe,
	Christian Brauner (Amutable)

Two comments drifted from the code they sit on.

epi_fget()'s block comment still referenced atomic_long_inc_not_zero,
which has been file_ref_get() for a while, and described only one of
the function's two roles: safe dereference of epi->ffd.file under
ep->mtx. Since commit a6dc643c6931 ("eventpoll: fix ep_remove struct
eventpoll / struct file UAF") the refcount bump also serves as a pin
that blocks __fput() from starting, which is what lets ep_remove()
touch file->f_lock and file->f_ep without racing
eventpoll_release_file(). Update the block to name both roles and the
commit that introduced the pin role.

ep_remove_file()'s one-line "See eventpoll_release() for details"
pointed at an inline in include/linux/eventpoll.h but said nothing
about what those details were. Replace it with a short explanation:
we publish NULL so the eventpoll_release() fastpath can skip the slow
path, and this is safe because every f_ep writer either holds a pin
via epi_fget() or is __fput() itself.

Comment-only; no functional change.

Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
---
 fs/eventpoll.c | 37 ++++++++++++++++++++++---------------
 1 file changed, 22 insertions(+), 15 deletions(-)

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 1d1fd6464c38..1039d9737ce9 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -991,22 +991,23 @@ static void ep_free(struct eventpoll *ep)
 }
 
 /*
- * The ffd.file pointer may be in the process of being torn down due to
- * being closed, but we may not have finished eventpoll_release() yet.
+ * Pin @epi->ffd.file for operations that require both safe dereference
+ * and exclusion from __fput().
  *
- * Normally, even with the atomic_long_inc_not_zero, the file may have
- * been free'd and then gotten re-allocated to something else (since
- * files are not RCU-delayed, they are SLAB_TYPESAFE_BY_RCU).
+ * struct file uses SLAB_TYPESAFE_BY_RCU, so a freed slot can be
+ * reassigned at any time. The bare load of epi->ffd.file is safe here
+ * because the caller holds ep->mtx and eventpoll_release_file() blocks
+ * on that mutex while tearing down the epi, so the backing file
+ * allocation cannot be freed and reused under us. An rcu_read_lock()
+ * is therefore unnecessary for the load.
  *
- * But for epoll, users hold the ep->mtx mutex, and as such any file in
- * the process of being free'd will block in eventpoll_release_file()
- * and thus the underlying file allocation will not be free'd, and the
- * file re-use cannot happen.
- *
- * For the same reason we can avoid a rcu_read_lock() around the
- * operation - 'ffd.file' cannot go away even if the refcount has
- * reached zero (but we must still not call out to ->poll() functions
- * etc).
+ * A successful file_ref_get() additionally blocks __fput() from
+ * starting on this file: once the refcount has reached zero it cannot
+ * come back. ep_remove() relies on that to touch file->f_lock and
+ * file->f_ep without racing eventpoll_release_file() (see commit
+ * a6dc643c6931). A NULL return means __fput() is already in flight;
+ * the caller must bail without touching the file, and
+ * eventpoll_release_file() will clean the epi up from its side.
  */
 static struct file *epi_fget(const struct epitem *epi)
 {
@@ -1032,7 +1033,13 @@ static void ep_remove_file(struct eventpoll *ep, struct epitem *epi,
 	spin_lock(&file->f_lock);
 	head = file->f_ep;
 	if (hlist_is_singular_node(&epi->fllink, head)) {
-		/* See eventpoll_release() for details. */
+		/*
+		 * Last watcher: publish NULL so the eventpoll_release()
+		 * fastpath in include/linux/eventpoll.h can skip the slow
+		 * path on a future __fput(). Safe because every f_ep writer
+		 * either holds a pin on @file via epi_fget() or is __fput()
+		 * itself -- see the comment in eventpoll_release().
+		 */
 		WRITE_ONCE(file->f_ep, NULL);
 		if (!is_file_epoll(file)) {
 			struct epitems_head *v;

-- 
2.47.3


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 05/17] eventpoll: document ep_clear_and_put() two-pass pattern
  2026-04-24 13:46 [PATCH 00/17] eventpoll: clarity refactor Christian Brauner
                   ` (3 preceding siblings ...)
  2026-04-24 13:46 ` [PATCH 04/17] eventpoll: refresh epi_fget() / ep_remove_file() comments Christian Brauner
@ 2026-04-24 13:46 ` Christian Brauner
  2026-04-24 13:46 ` [PATCH 06/17] eventpoll: rename ep_refcount_dec_and_test() to ep_put() Christian Brauner
                   ` (12 subsequent siblings)
  17 siblings, 0 replies; 20+ messages in thread
From: Christian Brauner @ 2026-04-24 13:46 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Alexander Viro, Jan Kara, Linus Torvalds, Jens Axboe,
	Christian Brauner (Amutable)

ep_clear_and_put() walks the rbtree twice: once to drain each epi's
pwqlist, then again to ep_remove() each entry. The split is
load-bearing -- fusing the passes into one loop would let a poll
callback still queued on some epi_i fire after epi_{i+k} has already
been freed -- but the previous comments described each pass in
isolation and did not explain the ordering invariant or the
cooperation with removal path C (eventpoll_release_file).

Add a function-level docblock that labels this as path B from the
top-of-file "Removal paths" section, names the two passes and the
ordering invariant, explains the pwqlist drain as synchronization
with in-flight ep_poll_callback() via whead->lock, describes the
C-path hand-off when epi_fget() returns NULL, and states the
ep->refcount invariant that keeps ep_remove()'s WARN_ON_ONCE safe
across the loop.

Also tighten the per-pass comments to one line each and fix the
minor grammar bug in the poll_wait release comment ("these file" ->
"poll-on-ep").

No functional change.

Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 fs/eventpoll.c | 44 ++++++++++++++++++++++++++++++++------------
 1 file changed, 32 insertions(+), 12 deletions(-)

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 1039d9737ce9..b6a14c69c482 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -1103,20 +1103,47 @@ static void ep_remove(struct eventpoll *ep, struct epitem *epi)
 	WARN_ON_ONCE(ep_refcount_dec_and_test(ep));
 }
 
+/*
+ * Removal path B (see "Removal paths" in the top-of-file banner):
+ * close of the epoll fd itself, reached via ep_eventpoll_release().
+ *
+ * Under ep->mtx we walk the rbtree twice:
+ *
+ *   Pass 1 drains each epi's pwqlist via ep_unregister_pollwait().
+ *          This takes each watched waitqueue head's lock and so
+ *          synchronizes with any in-flight ep_poll_callback(), so
+ *          after the pass ends no callback can still be holding or
+ *          about to dereference any epi on this ep.
+ *
+ *   Pass 2 runs ep_remove() on each epi. The per-epi pwqlist is
+ *          already empty, but the rest of ep_remove() still runs
+ *          (epi_fget() pin, f_ep clear under f_lock, rbtree erase,
+ *          rdllist unlink, kfree_rcu).
+ *
+ * Pass 1 must strictly precede Pass 2: fusing them would let a
+ * callback queued on epi_i still fire after epi_{i+k} was freed.
+ *
+ * A concurrent eventpoll_release_file() (path C) serializes against
+ * us on ep->mtx; in Pass 2, ep_remove() transparently hands off any
+ * epi whose watched file is in __fput() by bailing when epi_fget()
+ * returns NULL, and C will clean that epi up on its side.
+ *
+ * ep->refcount is held > 0 throughout by the ep file's own share;
+ * we drop that share after the walk and free the eventpoll if we
+ * were last.
+ */
 static void ep_clear_and_put(struct eventpoll *ep)
 {
 	struct rb_node *rbp, *next;
 	struct epitem *epi;
 
-	/* We need to release all tasks waiting for these file */
+	/* Release any threads blocked in poll-on-ep. */
 	if (waitqueue_active(&ep->poll_wait))
 		ep_poll_safewake(ep, NULL, 0);
 
 	mutex_lock(&ep->mtx);
 
-	/*
-	 * Walks through the whole tree by unregistering poll callbacks.
-	 */
+	/* Pass 1: drain pwqlists; synchronizes with in-flight callbacks. */
 	for (rbp = rb_first_cached(&ep->rbr); rbp; rbp = rb_next(rbp)) {
 		epi = rb_entry(rbp, struct epitem, rbn);
 
@@ -1124,14 +1151,7 @@ static void ep_clear_and_put(struct eventpoll *ep)
 		cond_resched();
 	}
 
-	/*
-	 * Walks through the whole tree and try to free each "struct epitem".
-	 * Note that ep_remove() will not remove the epitem in case of a
-	 * racing eventpoll_release_file(); the latter will do the removal.
-	 * At this point we are sure no poll callbacks will be lingering around.
-	 * Since we still own a reference to the eventpoll struct, the loop can't
-	 * dispose it.
-	 */
+	/* Pass 2: remove each epi. rb_next() is captured before erase. */
 	for (rbp = rb_first_cached(&ep->rbr); rbp; rbp = next) {
 		next = rb_next(rbp);
 		epi = rb_entry(rbp, struct epitem, rbn);

-- 
2.47.3


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 06/17] eventpoll: rename ep_refcount_dec_and_test() to ep_put()
  2026-04-24 13:46 [PATCH 00/17] eventpoll: clarity refactor Christian Brauner
                   ` (4 preceding siblings ...)
  2026-04-24 13:46 ` [PATCH 05/17] eventpoll: document ep_clear_and_put() two-pass pattern Christian Brauner
@ 2026-04-24 13:46 ` Christian Brauner
  2026-04-24 13:46 ` [PATCH 07/17] eventpoll: drop unused depth argument from epoll_mutex_lock() Christian Brauner
                   ` (11 subsequent siblings)
  17 siblings, 0 replies; 20+ messages in thread
From: Christian Brauner @ 2026-04-24 13:46 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Alexander Viro, Jan Kara, Linus Torvalds, Jens Axboe,
	Christian Brauner (Amutable)

ep_refcount_dec_and_test() mirrors refcount_dec_and_test() verbatim,
which reads fine at a call site like

  if (ep_refcount_dec_and_test(ep))
      ep_free(ep);

but awkward at

  WARN_ON_ONCE(ep_refcount_dec_and_test(ep));

and does not pair cleanly with ep_get(). Rename to the idiomatic
ep_put() and reword the kerneldoc to spell out the return-value
contract (caller is responsible for ep_free() iff the return is
true). Leave ep_put() as a bool-returning wrapper -- we cannot fold
ep_free() into it because ep_remove() calls it under ep->mtx and the
mutex would still be held when ep_free()'s mutex_destroy() ran (see
commit 8c2e52ebbe88 "eventpoll: don't decrement ep refcount while
still holding the ep mutex").

No functional change.

Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
---
 fs/eventpoll.c | 11 ++++++-----
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index b6a14c69c482..da31a3ac6057 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -969,9 +969,10 @@ static void ep_get(struct eventpoll *ep)
 }
 
 /*
- * Returns true if the event poll can be disposed
+ * Drop a reference to @ep; returns true iff it was the last, in which
+ * case the caller is responsible for ep_free().
  */
-static bool ep_refcount_dec_and_test(struct eventpoll *ep)
+static bool ep_put(struct eventpoll *ep)
 {
 	if (!refcount_dec_and_test(&ep->refcount))
 		return false;
@@ -1100,7 +1101,7 @@ static void ep_remove(struct eventpoll *ep, struct epitem *epi)
 
 	ep_remove_file(ep, epi, file);
 	ep_remove_epi(ep, epi);
-	WARN_ON_ONCE(ep_refcount_dec_and_test(ep));
+	WARN_ON_ONCE(ep_put(ep));
 }
 
 /*
@@ -1160,7 +1161,7 @@ static void ep_clear_and_put(struct eventpoll *ep)
 	}
 
 	mutex_unlock(&ep->mtx);
-	if (ep_refcount_dec_and_test(ep))
+	if (ep_put(ep))
 		ep_free(ep);
 }
 
@@ -1339,7 +1340,7 @@ void eventpoll_release_file(struct file *file)
 
 		mutex_unlock(&ep->mtx);
 
-		if (ep_refcount_dec_and_test(ep))
+		if (ep_put(ep))
 			ep_free(ep);
 		goto again;
 	}

-- 
2.47.3


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 07/17] eventpoll: drop unused depth argument from epoll_mutex_lock()
  2026-04-24 13:46 [PATCH 00/17] eventpoll: clarity refactor Christian Brauner
                   ` (5 preceding siblings ...)
  2026-04-24 13:46 ` [PATCH 06/17] eventpoll: rename ep_refcount_dec_and_test() to ep_put() Christian Brauner
@ 2026-04-24 13:46 ` Christian Brauner
  2026-04-24 13:46 ` [PATCH 08/17] eventpoll: rename attach_epitem() to ep_attach_file() Christian Brauner
                   ` (10 subsequent siblings)
  17 siblings, 0 replies; 20+ messages in thread
From: Christian Brauner @ 2026-04-24 13:46 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Alexander Viro, Jan Kara, Linus Torvalds, Jens Axboe,
	Christian Brauner (Amutable)

epoll_mutex_lock() has three callers, all in do_epoll_ctl(), and every
one passes depth == 0. The argument has been dead since the helper was
introduced. Drop it. Because a zero subclass makes mutex_lock_nested()
equivalent to mutex_lock(), switch the blocking path to the simpler
primitive as well.

No functional change.

Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
---
 fs/eventpoll.c | 15 ++++++---------
 1 file changed, 6 insertions(+), 9 deletions(-)

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index da31a3ac6057..ba1017c72167 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -2432,16 +2432,13 @@ static inline void ep_take_care_of_epollwakeup(struct epoll_event *epev)
 }
 #endif
 
-static inline int epoll_mutex_lock(struct mutex *mutex, int depth,
-				   bool nonblock)
+static inline int epoll_mutex_lock(struct mutex *mutex, bool nonblock)
 {
 	if (!nonblock) {
-		mutex_lock_nested(mutex, depth);
+		mutex_lock(mutex);
 		return 0;
 	}
-	if (mutex_trylock(mutex))
-		return 0;
-	return -EAGAIN;
+	return mutex_trylock(mutex) ? 0 : -EAGAIN;
 }
 
 int do_epoll_ctl(int epfd, int op, int fd, struct epoll_event *epds,
@@ -2513,14 +2510,14 @@ int do_epoll_ctl(int epfd, int op, int fd, struct epoll_event *epds,
 	 * deep wakeup paths from forming in parallel through multiple
 	 * EPOLL_CTL_ADD operations.
 	 */
-	error = epoll_mutex_lock(&ep->mtx, 0, nonblock);
+	error = epoll_mutex_lock(&ep->mtx, nonblock);
 	if (error)
 		goto error_tgt_fput;
 	if (op == EPOLL_CTL_ADD) {
 		if (READ_ONCE(fd_file(f)->f_ep) || ep->gen == loop_check_gen ||
 		    is_file_epoll(fd_file(tf))) {
 			mutex_unlock(&ep->mtx);
-			error = epoll_mutex_lock(&epnested_mutex, 0, nonblock);
+			error = epoll_mutex_lock(&epnested_mutex, nonblock);
 			if (error)
 				goto error_tgt_fput;
 			loop_check_gen++;
@@ -2531,7 +2528,7 @@ int do_epoll_ctl(int epfd, int op, int fd, struct epoll_event *epds,
 				if (ep_loop_check(ep, tep) != 0)
 					goto error_tgt_fput;
 			}
-			error = epoll_mutex_lock(&ep->mtx, 0, nonblock);
+			error = epoll_mutex_lock(&ep->mtx, nonblock);
 			if (error)
 				goto error_tgt_fput;
 		}

-- 
2.47.3


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 08/17] eventpoll: rename attach_epitem() to ep_attach_file()
  2026-04-24 13:46 [PATCH 00/17] eventpoll: clarity refactor Christian Brauner
                   ` (6 preceding siblings ...)
  2026-04-24 13:46 ` [PATCH 07/17] eventpoll: drop unused depth argument from epoll_mutex_lock() Christian Brauner
@ 2026-04-24 13:46 ` Christian Brauner
  2026-04-24 13:46 ` [PATCH 09/17] eventpoll: relocate KCMP helpers near compat syscalls Christian Brauner
                   ` (9 subsequent siblings)
  17 siblings, 0 replies; 20+ messages in thread
From: Christian Brauner @ 2026-04-24 13:46 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Alexander Viro, Jan Kara, Linus Torvalds, Jens Axboe,
	Christian Brauner (Amutable)

ep_remove_file() tears down the f_ep linkage that attach_epitem()
establishes, so the pair should look like one. Rename to
ep_attach_file() for the "ep_*" + subject symmetry and to match the
naming used elsewhere in the file (ep_insert, ep_modify, ep_remove,
ep_remove_file, ep_remove_epi, ep_unregister_pollwait).

Pure rename; no functional change.

Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
---
 fs/eventpoll.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index ba1017c72167..1fe9f1772a28 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -1735,7 +1735,7 @@ static noinline void ep_destroy_wakeup_source(struct epitem *epi)
 	wakeup_source_unregister(ws);
 }
 
-static int attach_epitem(struct file *file, struct epitem *epi)
+static int ep_attach_file(struct file *file, struct epitem *epi)
 {
 	struct epitems_head *to_free = NULL;
 	struct hlist_head *head = NULL;
@@ -1806,7 +1806,7 @@ static int ep_insert(struct eventpoll *ep, const struct epoll_event *event,
 	if (tep)
 		mutex_lock_nested(&tep->mtx, 1);
 	/* Add the current item to the list of active epoll hook for this file */
-	if (unlikely(attach_epitem(tfile, epi) < 0)) {
+	if (unlikely(ep_attach_file(tfile, epi) < 0)) {
 		if (tep)
 			mutex_unlock(&tep->mtx);
 		kmem_cache_free(epi_cache, epi);

-- 
2.47.3


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 09/17] eventpoll: relocate KCMP helpers near compat syscalls
  2026-04-24 13:46 [PATCH 00/17] eventpoll: clarity refactor Christian Brauner
                   ` (7 preceding siblings ...)
  2026-04-24 13:46 ` [PATCH 08/17] eventpoll: rename attach_epitem() to ep_attach_file() Christian Brauner
@ 2026-04-24 13:46 ` Christian Brauner
  2026-04-24 13:46 ` [PATCH 10/17] eventpoll: split ep_insert() into alloc + register stages Christian Brauner
                   ` (8 subsequent siblings)
  17 siblings, 0 replies; 20+ messages in thread
From: Christian Brauner @ 2026-04-24 13:46 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Alexander Viro, Jan Kara, Linus Torvalds, Jens Axboe,
	Christian Brauner (Amutable)

ep_find_tfd() and get_epoll_tfile_raw_ptr() are only used when
CONFIG_KCMP=y. They implement the lookup side of the kcmp(2)
KCMP_EPOLL_TFD query. The helpers currently live between ep_find()
and ep_poll_callback(), interrupting the run of hot-path code
(callback, wait-queue setup, path check, insert, modify, send_events,
poll) with a feature-gated block.

Move the #ifdef CONFIG_KCMP block next to #ifdef CONFIG_COMPAT, which
is also a peripheral ABI extension. Hot-path code becomes a
contiguous span, and the userspace-adjacent extensions cluster at the
end of the file just before eventpoll_init().

Pure code movement; diff is 44 removed and 44 added, all within one
block. No functional change.

Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
---
 fs/eventpoll.c | 88 +++++++++++++++++++++++++++++-----------------------------
 1 file changed, 44 insertions(+), 44 deletions(-)

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 1fe9f1772a28..fde2396342b6 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -1399,50 +1399,6 @@ static struct epitem *ep_find(struct eventpoll *ep, struct file *file, int fd)
 	return epir;
 }
 
-#ifdef CONFIG_KCMP
-static struct epitem *ep_find_tfd(struct eventpoll *ep, int tfd, unsigned long toff)
-{
-	struct rb_node *rbp;
-	struct epitem *epi;
-
-	for (rbp = rb_first_cached(&ep->rbr); rbp; rbp = rb_next(rbp)) {
-		epi = rb_entry(rbp, struct epitem, rbn);
-		if (epi->ffd.fd == tfd) {
-			if (toff == 0)
-				return epi;
-			else
-				toff--;
-		}
-		cond_resched();
-	}
-
-	return NULL;
-}
-
-struct file *get_epoll_tfile_raw_ptr(struct file *file, int tfd,
-				     unsigned long toff)
-{
-	struct file *file_raw;
-	struct eventpoll *ep;
-	struct epitem *epi;
-
-	if (!is_file_epoll(file))
-		return ERR_PTR(-EINVAL);
-
-	ep = file->private_data;
-
-	mutex_lock(&ep->mtx);
-	epi = ep_find_tfd(ep, tfd, toff);
-	if (epi)
-		file_raw = epi->ffd.file;
-	else
-		file_raw = ERR_PTR(-ENOENT);
-	mutex_unlock(&ep->mtx);
-
-	return file_raw;
-}
-#endif /* CONFIG_KCMP */
-
 /*
  * This is the callback that is passed to the wait queue wakeup
  * mechanism. It is called by the stored file descriptors when they
@@ -2733,6 +2689,50 @@ SYSCALL_DEFINE6(epoll_pwait2, int, epfd, struct epoll_event __user *, events,
 			      sigmask, sigsetsize);
 }
 
+#ifdef CONFIG_KCMP
+static struct epitem *ep_find_tfd(struct eventpoll *ep, int tfd, unsigned long toff)
+{
+	struct rb_node *rbp;
+	struct epitem *epi;
+
+	for (rbp = rb_first_cached(&ep->rbr); rbp; rbp = rb_next(rbp)) {
+		epi = rb_entry(rbp, struct epitem, rbn);
+		if (epi->ffd.fd == tfd) {
+			if (toff == 0)
+				return epi;
+			else
+				toff--;
+		}
+		cond_resched();
+	}
+
+	return NULL;
+}
+
+struct file *get_epoll_tfile_raw_ptr(struct file *file, int tfd,
+				     unsigned long toff)
+{
+	struct file *file_raw;
+	struct eventpoll *ep;
+	struct epitem *epi;
+
+	if (!is_file_epoll(file))
+		return ERR_PTR(-EINVAL);
+
+	ep = file->private_data;
+
+	mutex_lock(&ep->mtx);
+	epi = ep_find_tfd(ep, tfd, toff);
+	if (epi)
+		file_raw = epi->ffd.file;
+	else
+		file_raw = ERR_PTR(-ENOENT);
+	mutex_unlock(&ep->mtx);
+
+	return file_raw;
+}
+#endif /* CONFIG_KCMP */
+
 #ifdef CONFIG_COMPAT
 static int do_compat_epoll_pwait(int epfd, struct epoll_event __user *events,
 				 int maxevents, struct timespec64 *timeout,

-- 
2.47.3


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 10/17] eventpoll: split ep_insert() into alloc + register stages
  2026-04-24 13:46 [PATCH 00/17] eventpoll: clarity refactor Christian Brauner
                   ` (8 preceding siblings ...)
  2026-04-24 13:46 ` [PATCH 09/17] eventpoll: relocate KCMP helpers near compat syscalls Christian Brauner
@ 2026-04-24 13:46 ` Christian Brauner
  2026-04-24 13:46 ` [PATCH 11/17] eventpoll: split ep_clear_and_put() into drain helpers Christian Brauner
                   ` (7 subsequent siblings)
  17 siblings, 0 replies; 20+ messages in thread
From: Christian Brauner @ 2026-04-24 13:46 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Alexander Viro, Jan Kara, Linus Torvalds, Jens Axboe,
	Christian Brauner (Amutable)

ep_insert() was 130 lines and mixed four concerns in one body: user
quota charge and epitem allocation, attach-into-file-hlist plus
rbtree insert plus target-ep locking, reverse-path + EPOLLWAKEUP +
poll-queue install with rollback, and ready-list publication.
Factor the first two concerns into named helpers so the body reduces
to orchestration.

ep_alloc_epitem() charges the user's epoll_watches quota, allocates
a fresh epitem, and initializes its fields. On failure it returns
ERR_PTR(-ENOSPC) or ERR_PTR(-ENOMEM); on success the epi is not yet
linked into anything.

ep_register_epitem() installs @epi into @tfile's f_ep hlist and
@ep's rbtree, optionally chains @tfile onto tfile_check_list for the
path check, takes the tep->mtx nested lock for the epoll-watches-
epoll case, and finally takes the ep_get() reference that pairs
with ep_remove()'s ep_put() in ep_insert()'s error paths. On failure
it frees the epi and decrements epoll_watches to match
ep_alloc_epitem().

ep_insert()'s remaining body is the rollback-via-ep_remove() chain
(reverse_path_check, EPOLLWAKEUP source creation, ep_ptable_queue_proc
allocation) and the ready-list / wake publication. Remove a few
stale comments that duplicated function-level documentation or
described obvious code.

No functional change; rollback boundaries unchanged -- every error
path after ep_register_epitem() still calls ep_remove(), preserving
the ep->refcount invariant that keeps ep_remove()'s WARN_ON_ONCE safe.

Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
---
 fs/eventpoll.c | 111 ++++++++++++++++++++++++++++++++++++++-------------------
 1 file changed, 74 insertions(+), 37 deletions(-)

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index fde2396342b6..e4a4e92d329f 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -1726,68 +1726,112 @@ static int ep_attach_file(struct file *file, struct epitem *epi)
 }
 
 /*
- * Must be called with "mtx" held.
+ * Charge the user's epoll_watches quota, allocate a fresh epitem for
+ * @tfile/@fd, and initialize its fields. The returned item is not yet
+ * linked into any data structure; the caller must install it via
+ * ep_register_epitem() (which takes over on success) or kmem_cache_free()
+ * it and decrement epoll_watches on its own.
+ *
+ * Returns ERR_PTR(-ENOSPC) if the quota is exceeded, ERR_PTR(-ENOMEM)
+ * if the slab allocation fails.
  */
-static int ep_insert(struct eventpoll *ep, const struct epoll_event *event,
-		     struct file *tfile, int fd, int full_check)
+static struct epitem *ep_alloc_epitem(struct eventpoll *ep,
+				      const struct epoll_event *event,
+				      struct file *tfile, int fd)
 {
-	int error, pwake = 0;
-	__poll_t revents;
 	struct epitem *epi;
-	struct ep_pqueue epq;
-	struct eventpoll *tep = NULL;
-
-	if (is_file_epoll(tfile))
-		tep = tfile->private_data;
-
-	lockdep_assert_irqs_enabled();
 
 	if (unlikely(percpu_counter_compare(&ep->user->epoll_watches,
 					    max_user_watches) >= 0))
-		return -ENOSPC;
+		return ERR_PTR(-ENOSPC);
 	percpu_counter_inc(&ep->user->epoll_watches);
 
-	if (!(epi = kmem_cache_zalloc(epi_cache, GFP_KERNEL))) {
+	epi = kmem_cache_zalloc(epi_cache, GFP_KERNEL);
+	if (unlikely(!epi)) {
 		percpu_counter_dec(&ep->user->epoll_watches);
-		return -ENOMEM;
+		return ERR_PTR(-ENOMEM);
 	}
 
-	/* Item initialization follow here ... */
 	INIT_LIST_HEAD(&epi->rdllink);
 	epi->ep = ep;
 	ep_set_ffd(&epi->ffd, tfile, fd);
 	epi->event = *event;
 	epi->next = EP_UNACTIVE_PTR;
 
+	return epi;
+}
+
+/*
+ * Install @epi into its target file's f_ep hlist and into @ep's rbtree,
+ * taking one additional reference on @ep for the lifetime of the item.
+ *
+ * If @tep is non-NULL, the target file is itself an eventpoll; we hold
+ * tep->mtx at subclass 1 across the attach + rbtree insert to serialize
+ * with the target side. RB tree ops are protected by @ep->mtx, which
+ * the caller already holds.
+ *
+ * On failure the epi is freed and the epoll_watches counter decremented,
+ * matching ep_alloc_epitem()'s allocation. After this returns
+ * successfully, ep_insert()'s later error paths use ep_remove() for
+ * unwind; that cannot drop @ep's refcount to zero because the ep file
+ * itself still holds the original reference.
+ */
+static int ep_register_epitem(struct eventpoll *ep, struct epitem *epi,
+			      struct eventpoll *tep, int full_check)
+{
+	struct file *tfile = epi->ffd.file;
+	int error;
+
 	if (tep)
 		mutex_lock_nested(&tep->mtx, 1);
-	/* Add the current item to the list of active epoll hook for this file */
-	if (unlikely(ep_attach_file(tfile, epi) < 0)) {
+
+	error = ep_attach_file(tfile, epi);
+	if (unlikely(error)) {
 		if (tep)
 			mutex_unlock(&tep->mtx);
 		kmem_cache_free(epi_cache, epi);
 		percpu_counter_dec(&ep->user->epoll_watches);
-		return -ENOMEM;
+		return error;
 	}
 
 	if (full_check && !tep)
 		list_file(tfile);
 
-	/*
-	 * Add the current item to the RB tree. All RB tree operations are
-	 * protected by "mtx", and ep_insert() is called with "mtx" held.
-	 */
 	ep_rbtree_insert(ep, epi);
+
 	if (tep)
 		mutex_unlock(&tep->mtx);
 
-	/*
-	 * ep_remove() calls in the later error paths can't lead to
-	 * ep_free() as the ep file itself still holds an ep reference.
-	 */
 	ep_get(ep);
+	return 0;
+}
+
+/*
+ * Must be called with "mtx" held.
+ */
+static int ep_insert(struct eventpoll *ep, const struct epoll_event *event,
+		     struct file *tfile, int fd, int full_check)
+{
+	int error, pwake = 0;
+	__poll_t revents;
+	struct epitem *epi;
+	struct ep_pqueue epq;
+	struct eventpoll *tep = NULL;
+
+	if (is_file_epoll(tfile))
+		tep = tfile->private_data;
+
+	lockdep_assert_irqs_enabled();
 
-	/* now check if we've created too many backpaths */
+	epi = ep_alloc_epitem(ep, event, tfile, fd);
+	if (IS_ERR(epi))
+		return PTR_ERR(epi);
+
+	error = ep_register_epitem(ep, epi, tep, full_check);
+	if (error)
+		return error;
+
+	/* Reject the insert if the new link would create too many back-paths. */
 	if (unlikely(full_check && reverse_path_check())) {
 		ep_remove(ep, epi);
 		return -EINVAL;
@@ -1814,28 +1858,21 @@ static int ep_insert(struct eventpoll *ep, const struct epoll_event *event,
 	 */
 	revents = ep_item_poll(epi, &epq.pt, 1);
 
-	/*
-	 * We have to check if something went wrong during the poll wait queue
-	 * install process. Namely an allocation for a wait queue failed due
-	 * high memory pressure.
-	 */
+	/* ep_ptable_queue_proc() signals allocation failure by clearing epq.epi. */
 	if (unlikely(!epq.epi)) {
 		ep_remove(ep, epi);
 		return -ENOMEM;
 	}
 
-	/* We have to drop the new item inside our item list to keep track of it */
+	/* Drop the new item onto the ready list if it is already ready. */
 	spin_lock_irq(&ep->lock);
 
-	/* record NAPI ID of new item if present */
 	ep_set_busy_poll_napi_id(epi);
 
-	/* If the file is already "ready" we drop it inside the ready list */
 	if (revents && !ep_is_linked(epi)) {
 		list_add_tail(&epi->rdllink, &ep->rdllist);
 		ep_pm_stay_awake(epi);
 
-		/* Notify waiting tasks that events are available */
 		if (waitqueue_active(&ep->wq))
 			wake_up(&ep->wq);
 		if (waitqueue_active(&ep->poll_wait))

-- 
2.47.3


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 11/17] eventpoll: split ep_clear_and_put() into drain helpers
  2026-04-24 13:46 [PATCH 00/17] eventpoll: clarity refactor Christian Brauner
                   ` (9 preceding siblings ...)
  2026-04-24 13:46 ` [PATCH 10/17] eventpoll: split ep_insert() into alloc + register stages Christian Brauner
@ 2026-04-24 13:46 ` Christian Brauner
  2026-04-24 13:46 ` [PATCH 12/17] eventpoll: extract ep_deliver_event() from ep_send_events() Christian Brauner
                   ` (6 subsequent siblings)
  17 siblings, 0 replies; 20+ messages in thread
From: Christian Brauner @ 2026-04-24 13:46 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Alexander Viro, Jan Kara, Linus Torvalds, Jens Axboe,
	Christian Brauner (Amutable)

ep_clear_and_put()'s two-pass walk is the main way an ep file close
tears down its state, and the ordering between the passes is
load-bearing (see previous commit's docblock). Give each pass its
own function so the ordering is enforced by the call sequence in
ep_clear_and_put() rather than by convention inside one body.

ep_drain_pollwaits() carries out Pass 1: walk the rbtree and
ep_unregister_pollwait() each epi. The function-level comment names
it as Pass 1 and spells out the synchronization contract with
ep_poll_callback().

ep_drain_tree() carries out Pass 2: walk the rbtree and ep_remove()
each epi, capturing rb_next() before each erase. The comment names
it as Pass 2 and documents the hand-off with a concurrent
eventpoll_release_file() (removal path C).

ep_clear_and_put() keeps the poll-on-ep wakeup, ep->mtx bracketing,
and ep_put() + conditional ep_free(), and its docblock shrinks to
the high-level summary; the per-pass detail moved into the helpers.

No functional change.

Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
---
 fs/eventpoll.c | 87 ++++++++++++++++++++++++++++++++++------------------------
 1 file changed, 51 insertions(+), 36 deletions(-)

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index e4a4e92d329f..eeddd05ba529 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -1105,62 +1105,77 @@ static void ep_remove(struct eventpoll *ep, struct epitem *epi)
 }
 
 /*
- * Removal path B (see "Removal paths" in the top-of-file banner):
- * close of the epoll fd itself, reached via ep_eventpoll_release().
- *
- * Under ep->mtx we walk the rbtree twice:
- *
- *   Pass 1 drains each epi's pwqlist via ep_unregister_pollwait().
- *          This takes each watched waitqueue head's lock and so
- *          synchronizes with any in-flight ep_poll_callback(), so
- *          after the pass ends no callback can still be holding or
- *          about to dereference any epi on this ep.
- *
- *   Pass 2 runs ep_remove() on each epi. The per-epi pwqlist is
- *          already empty, but the rest of ep_remove() still runs
- *          (epi_fget() pin, f_ep clear under f_lock, rbtree erase,
- *          rdllist unlink, kfree_rcu).
- *
- * Pass 1 must strictly precede Pass 2: fusing them would let a
- * callback queued on epi_i still fire after epi_{i+k} was freed.
- *
- * A concurrent eventpoll_release_file() (path C) serializes against
- * us on ep->mtx; in Pass 2, ep_remove() transparently hands off any
- * epi whose watched file is in __fput() by bailing when epi_fget()
- * returns NULL, and C will clean that epi up on its side.
- *
- * ep->refcount is held > 0 throughout by the ep file's own share;
- * we drop that share after the walk and free the eventpoll if we
- * were last.
+ * Pass 1 of ep_clear_and_put(): drain every epi's pwqlist.
+ * ep_unregister_pollwait() takes each watched wait-queue head's lock,
+ * which synchronizes with any in-flight ep_poll_callback(); after
+ * this returns no callback can still be about to dereference an epi
+ * on this ep. Must strictly precede ep_drain_tree() -- fusing the
+ * two walks would let a callback queued on epi_i still fire after
+ * epi_{i+k} had already been freed.
  */
-static void ep_clear_and_put(struct eventpoll *ep)
+static void ep_drain_pollwaits(struct eventpoll *ep)
 {
-	struct rb_node *rbp, *next;
+	struct rb_node *rbp;
 	struct epitem *epi;
 
-	/* Release any threads blocked in poll-on-ep. */
-	if (waitqueue_active(&ep->poll_wait))
-		ep_poll_safewake(ep, NULL, 0);
-
-	mutex_lock(&ep->mtx);
+	lockdep_assert_held(&ep->mtx);
 
-	/* Pass 1: drain pwqlists; synchronizes with in-flight callbacks. */
 	for (rbp = rb_first_cached(&ep->rbr); rbp; rbp = rb_next(rbp)) {
 		epi = rb_entry(rbp, struct epitem, rbn);
 
 		ep_unregister_pollwait(ep, epi);
 		cond_resched();
 	}
+}
+
+/*
+ * Pass 2 of ep_clear_and_put(): ep_remove() every epi. The per-epi
+ * pwqlist is already empty (ep_drain_pollwaits ran), but the rest of
+ * ep_remove() still runs: epi_fget() pin, f_ep clear under f_lock,
+ * rbtree erase, rdllist unlink, kfree_rcu(epi). rb_next() is captured
+ * before each erase so the iteration is stable.
+ *
+ * A concurrent eventpoll_release_file() (removal path C) on a watched
+ * file serializes with us via ep->mtx; ep_remove() transparently
+ * hands off any epi whose file is in __fput() by bailing when
+ * epi_fget() returns NULL, and path C will clean that epi up.
+ */
+static void ep_drain_tree(struct eventpoll *ep)
+{
+	struct rb_node *rbp, *next;
+	struct epitem *epi;
+
+	lockdep_assert_held(&ep->mtx);
 
-	/* Pass 2: remove each epi. rb_next() is captured before erase. */
 	for (rbp = rb_first_cached(&ep->rbr); rbp; rbp = next) {
 		next = rb_next(rbp);
 		epi = rb_entry(rbp, struct epitem, rbn);
 		ep_remove(ep, epi);
 		cond_resched();
 	}
+}
+
+/*
+ * Removal path B (see "Removal paths" in the top-of-file banner):
+ * close of the epoll fd itself, reached via ep_eventpoll_release().
+ *
+ * Two passes under ep->mtx: first ep_drain_pollwaits() quiesces
+ * in-flight callbacks, then ep_drain_tree() frees the epis. The
+ * ep->refcount is kept > 0 across the walk by the ep file's own
+ * share, which we drop below; ep_free() runs iff we were the last
+ * holder after the tree drained.
+ */
+static void ep_clear_and_put(struct eventpoll *ep)
+{
+	/* Release any threads blocked in poll-on-ep. */
+	if (waitqueue_active(&ep->poll_wait))
+		ep_poll_safewake(ep, NULL, 0);
 
+	mutex_lock(&ep->mtx);
+	ep_drain_pollwaits(ep);
+	ep_drain_tree(ep);
 	mutex_unlock(&ep->mtx);
+
 	if (ep_put(ep))
 		ep_free(ep);
 }

-- 
2.47.3


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 12/17] eventpoll: extract ep_deliver_event() from ep_send_events()
  2026-04-24 13:46 [PATCH 00/17] eventpoll: clarity refactor Christian Brauner
                   ` (10 preceding siblings ...)
  2026-04-24 13:46 ` [PATCH 11/17] eventpoll: split ep_clear_and_put() into drain helpers Christian Brauner
@ 2026-04-24 13:46 ` Christian Brauner
  2026-04-24 13:46 ` [PATCH 13/17] eventpoll: extract lock dance from do_epoll_ctl() into ep_ctl_lock() Christian Brauner
                   ` (5 subsequent siblings)
  17 siblings, 0 replies; 20+ messages in thread
From: Christian Brauner @ 2026-04-24 13:46 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Alexander Viro, Jan Kara, Linus Torvalds, Jens Axboe,
	Christian Brauner (Amutable)

ep_send_events()'s body covered two concerns: per-item work (PM
wakeup-source bookkeeping, re-poll, copy_to_user, level-trigger
re-queue, EPOLLONESHOT mask clear) and the scan-level accumulator
(maxevents cap, EFAULT preservation, txlist/rdllist splice).

Extract the per-item work as ep_deliver_event(), which returns a
tri-state int:

  1       one event was delivered; caller advances the counter,
  0       re-poll produced no caller-requested events (item drops
          out of the ready list; a future callback will re-queue),
 -EFAULT  copy_to_user() faulted; item is already re-inserted at
          the head of the txlist so ep_done_scan() splices it back
          to rdllist.

The per-item comments (PM ordering, the "sole writer to rdllist"
invariant for the LT re-queue, the EFAULT semantics) move into
ep_deliver_event(). ep_send_events() reduces to the fatal-signal
short-circuit, scan bracket, and a short txlist walk that accumulates
the deliveries and preserves the "first error wins" EFAULT contract
(res = delivered only if no event was previously delivered; otherwise
the success count is returned and -EFAULT is reported on the next
call).

No functional change.

Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
---
 fs/eventpoll.c | 138 +++++++++++++++++++++++++++++++++++----------------------
 1 file changed, 84 insertions(+), 54 deletions(-)

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index eeddd05ba529..6d4167a347ab 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -1979,6 +1979,82 @@ static int ep_modify(struct eventpoll *ep, struct epitem *epi,
 	return 0;
 }
 
+/*
+ * Attempt to deliver one event for @epi into @*uevents.
+ *
+ * Returns 1 if an event was delivered (with *uevents advanced to the
+ * next slot), 0 if the re-poll reported no caller-requested events
+ * (@epi drops out of the ready list; a future callback will re-add
+ * it), or -EFAULT if copy_to_user() faulted (in which case @epi is
+ * re-inserted at the head of @txlist so ep_done_scan() merges it
+ * back to rdllist for the next attempt).
+ *
+ * PM bookkeeping and level-triggered re-queue are handled here.
+ * Caller holds ep->mtx and the scan is active.
+ */
+static int ep_deliver_event(struct eventpoll *ep, struct epitem *epi,
+			    poll_table *pt,
+			    struct epoll_event __user **uevents,
+			    struct list_head *txlist)
+{
+	struct epoll_event __user *next;
+	struct wakeup_source *ws;
+	__poll_t revents;
+
+	/*
+	 * Activate ep->ws before deactivating epi->ws to prevent
+	 * triggering auto-suspend here (in case we reactivate epi->ws
+	 * below).  Rearranging to delay the deactivation would let
+	 * epi->ws drift out of sync with ep_is_linked().
+	 */
+	ws = ep_wakeup_source(epi);
+	if (ws) {
+		if (ws->active)
+			__pm_stay_awake(ep->ws);
+		__pm_relax(ws);
+	}
+
+	list_del_init(&epi->rdllink);
+
+	/*
+	 * Re-poll under ep->mtx so userspace cannot change the item
+	 * out from under us. If no caller-requested events remain,
+	 * @epi stays off the ready list; the poll callback will
+	 * re-queue it when events next appear.
+	 */
+	revents = ep_item_poll(epi, pt, 1);
+	if (!revents)
+		return 0;
+
+	next = epoll_put_uevent(revents, epi->event.data, *uevents);
+	if (!next) {
+		/*
+		 * copy_to_user() faulted: put the item back so
+		 * ep_done_scan() splices it onto rdllist for the next
+		 * attempt.
+		 */
+		list_add(&epi->rdllink, txlist);
+		ep_pm_stay_awake(epi);
+		return -EFAULT;
+	}
+	*uevents = next;
+
+	if (epi->event.events & EPOLLONESHOT) {
+		epi->event.events &= EP_PRIVATE_BITS;
+	} else if (!(epi->event.events & EPOLLET)) {
+		/*
+		 * Level-triggered: re-queue so the next epoll_wait()
+		 * rechecks availability. We are the sole writer to
+		 * rdllist here -- epoll_ctl() callers are locked out
+		 * by ep->mtx, and the poll callback queues to ovflist
+		 * during scans.
+		 */
+		list_add_tail(&epi->rdllink, &ep->rdllist);
+		ep_pm_stay_awake(epi);
+	}
+	return 1;
+}
+
 static int ep_send_events(struct eventpoll *ep,
 			  struct epoll_event __user *events, int maxevents)
 {
@@ -2001,70 +2077,24 @@ static int ep_send_events(struct eventpoll *ep,
 	ep_start_scan(ep, &txlist);
 
 	/*
-	 * We can loop without lock because we are passed a task private list.
-	 * Items cannot vanish during the loop we are holding ep->mtx.
+	 * We can loop without lock because we are passed a task-private
+	 * txlist; items cannot vanish while we hold ep->mtx.
 	 */
 	list_for_each_entry_safe(epi, tmp, &txlist, rdllink) {
-		struct wakeup_source *ws;
-		__poll_t revents;
+		int delivered;
 
 		if (res >= maxevents)
 			break;
 
-		/*
-		 * Activate ep->ws before deactivating epi->ws to prevent
-		 * triggering auto-suspend here (in case we reactive epi->ws
-		 * below).
-		 *
-		 * This could be rearranged to delay the deactivation of epi->ws
-		 * instead, but then epi->ws would temporarily be out of sync
-		 * with ep_is_linked().
-		 */
-		ws = ep_wakeup_source(epi);
-		if (ws) {
-			if (ws->active)
-				__pm_stay_awake(ep->ws);
-			__pm_relax(ws);
-		}
-
-		list_del_init(&epi->rdllink);
-
-		/*
-		 * If the event mask intersect the caller-requested one,
-		 * deliver the event to userspace. Again, we are holding ep->mtx,
-		 * so no operations coming from userspace can change the item.
-		 */
-		revents = ep_item_poll(epi, &pt, 1);
-		if (!revents)
-			continue;
-
-		events = epoll_put_uevent(revents, epi->event.data, events);
-		if (!events) {
-			list_add(&epi->rdllink, &txlist);
-			ep_pm_stay_awake(epi);
+		delivered = ep_deliver_event(ep, epi, &pt, &events, &txlist);
+		if (delivered < 0) {
 			if (!res)
-				res = -EFAULT;
+				res = delivered;
 			break;
 		}
-		res++;
-		if (epi->event.events & EPOLLONESHOT)
-			epi->event.events &= EP_PRIVATE_BITS;
-		else if (!(epi->event.events & EPOLLET)) {
-			/*
-			 * If this file has been added with Level
-			 * Trigger mode, we need to insert back inside
-			 * the ready list, so that the next call to
-			 * epoll_wait() will check again the events
-			 * availability. At this point, no one can insert
-			 * into ep->rdllist besides us. The epoll_ctl()
-			 * callers are locked out by
-			 * ep_send_events() holding "mtx" and the
-			 * poll callback will queue them in ep->ovflist.
-			 */
-			list_add_tail(&epi->rdllink, &ep->rdllist);
-			ep_pm_stay_awake(epi);
-		}
+		res += delivered;
 	}
+
 	ep_done_scan(ep, &txlist);
 	mutex_unlock(&ep->mtx);
 

-- 
2.47.3


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 13/17] eventpoll: extract lock dance from do_epoll_ctl() into ep_ctl_lock()
  2026-04-24 13:46 [PATCH 00/17] eventpoll: clarity refactor Christian Brauner
                   ` (11 preceding siblings ...)
  2026-04-24 13:46 ` [PATCH 12/17] eventpoll: extract ep_deliver_event() from ep_send_events() Christian Brauner
@ 2026-04-24 13:46 ` Christian Brauner
  2026-04-24 13:46 ` [PATCH 14/17] eventpoll: wrap EP_UNACTIVE_PTR in typed sentinel helpers Christian Brauner
                   ` (4 subsequent siblings)
  17 siblings, 0 replies; 20+ messages in thread
From: Christian Brauner @ 2026-04-24 13:46 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Alexander Viro, Jan Kara, Linus Torvalds, Jens Axboe,
	Christian Brauner (Amutable)

do_epoll_ctl() interleaved three concerns in one body: input
validation, the ep->mtx + epnested_mutex acquisition dance for
EPOLL_CTL_ADD on potentially-nested topologies, and the op dispatch
with final unlock. The middle concern is the error-prone one; the
error_tgt_fput label existed mainly to orchestrate it.

Extract the acquisition as ep_ctl_lock() and the release as
ep_ctl_unlock(). ep_ctl_lock() always takes ep->mtx and, for
EPOLL_CTL_ADD on a topology that can change, additionally runs the
loop / path check under epnested_mutex. The return value is a
ternary:

   0        ep->mtx held.
   1        ep->mtx AND epnested_mutex held (full-check mode).
   -errno   failure, no locks held.

The non-negative value doubles as the @full_check argument to
ep_insert() and as the argument to ep_ctl_unlock(), so the caller
neither needs an out-parameter nor a separate boolean:

   full_check = ep_ctl_lock(ep, op, epfile, tfile, nonblock);
   if (full_check < 0)
       return full_check;
   ...
   ep_ctl_unlock(ep, full_check);

ep_ctl_unlock() drops ep->mtx and, if full_check == 1, clears
tfile_check_list, bumps loop_check_gen, and drops epnested_mutex --
mirroring the old error_tgt_fput block.

With that in place do_epoll_ctl()'s preconditions become direct
returns (no locks held, nothing to clean up), the acquisition is a
single call, the op dispatch is unchanged, and the epilogue is a
single ep_ctl_unlock() before return. The error_tgt_fput label goes
away.

The two loop_check_gen bumps (one at the start of the full check,
one after) are preserved inside ep_ctl_lock() / ep_ctl_unlock(),
keeping the invariant that ep->gen stamps left on per-eventpoll
caches never equal loop_check_gen after the check completes.

No functional change.

Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
---
 fs/eventpoll.c | 152 ++++++++++++++++++++++++++++++++++-----------------------
 1 file changed, 91 insertions(+), 61 deletions(-)

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 6d4167a347ab..d49457dc8c7f 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -2479,14 +2479,92 @@ static inline int epoll_mutex_lock(struct mutex *mutex, bool nonblock)
 	return mutex_trylock(mutex) ? 0 : -EAGAIN;
 }
 
+/*
+ * Acquire the locks required for do_epoll_ctl() on @ep for @op.
+ *
+ * Always takes ep->mtx. For EPOLL_CTL_ADD, additionally runs the
+ * loop / path check under epnested_mutex when the topology can
+ * change: @ep is already watched (epfile->f_ep non-NULL), @ep was
+ * recently loop-checked (ep->gen == loop_check_gen), or @tfile is
+ * itself an eventpoll.
+ *
+ * Return value encodes both outcome and lock state:
+ *
+ *   0        success; ep->mtx held.
+ *   1        success; ep->mtx held AND the full check ran under
+ *            epnested_mutex (which is also still held). The value
+ *            doubles as the @full_check argument to ep_insert().
+ *   -errno   failure; no locks held.
+ *
+ * The caller releases what was taken with ep_ctl_unlock(ep, ret).
+ *
+ * Holding epnested_mutex on add is what prevents two racing
+ * EPOLL_CTL_ADDs on different eps from building a cycle without
+ * either walker observing it.
+ */
+static int ep_ctl_lock(struct eventpoll *ep, int op,
+		       struct file *epfile, struct file *tfile,
+		       bool nonblock)
+{
+	struct eventpoll *tep;
+	int error;
+
+	error = epoll_mutex_lock(&ep->mtx, nonblock);
+	if (error)
+		return error;
+
+	if (op != EPOLL_CTL_ADD)
+		return 0;
+	if (!READ_ONCE(epfile->f_ep) && ep->gen != loop_check_gen &&
+	    !is_file_epoll(tfile))
+		return 0;
+
+	/* Full check needed: drop ep->mtx so we can take epnested_mutex. */
+	mutex_unlock(&ep->mtx);
+	error = epoll_mutex_lock(&epnested_mutex, nonblock);
+	if (error)
+		return error;
+
+	loop_check_gen++;
+
+	if (is_file_epoll(tfile)) {
+		tep = tfile->private_data;
+		if (ep_loop_check(ep, tep) != 0) {
+			error = -ELOOP;
+			goto err_unlock_nested;
+		}
+	}
+
+	error = epoll_mutex_lock(&ep->mtx, nonblock);
+	if (error)
+		goto err_unlock_nested;
+
+	return 1;
+
+err_unlock_nested:
+	clear_tfile_check_list();
+	loop_check_gen++;
+	mutex_unlock(&epnested_mutex);
+	return error;
+}
+
+static void ep_ctl_unlock(struct eventpoll *ep, int full_check)
+{
+	mutex_unlock(&ep->mtx);
+	if (full_check) {
+		clear_tfile_check_list();
+		loop_check_gen++;
+		mutex_unlock(&epnested_mutex);
+	}
+}
+
 int do_epoll_ctl(int epfd, int op, int fd, struct epoll_event *epds,
 		 bool nonblock)
 {
 	int error;
-	int full_check = 0;
+	int full_check;
 	struct eventpoll *ep;
 	struct epitem *epi;
-	struct eventpoll *tep = NULL;
 
 	CLASS(fd, f)(epfd);
 	if (fd_empty(f))
@@ -2506,76 +2584,34 @@ int do_epoll_ctl(int epfd, int op, int fd, struct epoll_event *epds,
 		ep_take_care_of_epollwakeup(epds);
 
 	/*
-	 * We have to check that the file structure underneath the file descriptor
-	 * the user passed to us _is_ an eventpoll file. And also we do not permit
+	 * The @epfd file must itself be an eventpoll, and we do not permit
 	 * adding an epoll file descriptor inside itself.
 	 */
-	error = -EINVAL;
 	if (fd_file(f) == fd_file(tf) || !is_file_epoll(fd_file(f)))
-		goto error_tgt_fput;
+		return -EINVAL;
 
 	/*
 	 * epoll adds to the wakeup queue at EPOLL_CTL_ADD time only,
 	 * so EPOLLEXCLUSIVE is not allowed for a EPOLL_CTL_MOD operation.
-	 * Also, we do not currently supported nested exclusive wakeups.
+	 * Also, nested exclusive wakeups are not supported.
 	 */
 	if (ep_op_has_event(op) && (epds->events & EPOLLEXCLUSIVE)) {
 		if (op == EPOLL_CTL_MOD)
-			goto error_tgt_fput;
+			return -EINVAL;
 		if (op == EPOLL_CTL_ADD && (is_file_epoll(fd_file(tf)) ||
 				(epds->events & ~EPOLLEXCLUSIVE_OK_BITS)))
-			goto error_tgt_fput;
+			return -EINVAL;
 	}
 
-	/*
-	 * At this point it is safe to assume that the "private_data" contains
-	 * our own data structure.
-	 */
 	ep = fd_file(f)->private_data;
 
-	/*
-	 * When we insert an epoll file descriptor inside another epoll file
-	 * descriptor, there is the chance of creating closed loops, which are
-	 * better be handled here, than in more critical paths. While we are
-	 * checking for loops we also determine the list of files reachable
-	 * and hang them on the tfile_check_list, so we can check that we
-	 * haven't created too many possible wakeup paths.
-	 *
-	 * We do not need to take the global 'epumutex' on EPOLL_CTL_ADD when
-	 * the epoll file descriptor is attaching directly to a wakeup source,
-	 * unless the epoll file descriptor is nested. The purpose of taking the
-	 * 'epnested_mutex' on add is to prevent complex toplogies such as loops and
-	 * deep wakeup paths from forming in parallel through multiple
-	 * EPOLL_CTL_ADD operations.
-	 */
-	error = epoll_mutex_lock(&ep->mtx, nonblock);
-	if (error)
-		goto error_tgt_fput;
-	if (op == EPOLL_CTL_ADD) {
-		if (READ_ONCE(fd_file(f)->f_ep) || ep->gen == loop_check_gen ||
-		    is_file_epoll(fd_file(tf))) {
-			mutex_unlock(&ep->mtx);
-			error = epoll_mutex_lock(&epnested_mutex, nonblock);
-			if (error)
-				goto error_tgt_fput;
-			loop_check_gen++;
-			full_check = 1;
-			if (is_file_epoll(fd_file(tf))) {
-				tep = fd_file(tf)->private_data;
-				error = -ELOOP;
-				if (ep_loop_check(ep, tep) != 0)
-					goto error_tgt_fput;
-			}
-			error = epoll_mutex_lock(&ep->mtx, nonblock);
-			if (error)
-				goto error_tgt_fput;
-		}
-	}
+	full_check = ep_ctl_lock(ep, op, fd_file(f), fd_file(tf), nonblock);
+	if (full_check < 0)
+		return full_check;
 
 	/*
-	 * Try to lookup the file inside our RB tree. Since we grabbed "mtx"
-	 * above, we can be sure to be able to use the item looked up by
-	 * ep_find() till we release the mutex.
+	 * Look the target up in ep's RB tree. We hold ep->mtx, so the
+	 * item stays valid until we release.
 	 */
 	epi = ep_find(ep, fd_file(tf), fd);
 
@@ -2610,14 +2646,8 @@ int do_epoll_ctl(int epfd, int op, int fd, struct epoll_event *epds,
 			error = -ENOENT;
 		break;
 	}
-	mutex_unlock(&ep->mtx);
 
-error_tgt_fput:
-	if (full_check) {
-		clear_tfile_check_list();
-		loop_check_gen++;
-		mutex_unlock(&epnested_mutex);
-	}
+	ep_ctl_unlock(ep, full_check);
 	return error;
 }
 

-- 
2.47.3


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 14/17] eventpoll: wrap EP_UNACTIVE_PTR in typed sentinel helpers
  2026-04-24 13:46 [PATCH 00/17] eventpoll: clarity refactor Christian Brauner
                   ` (12 preceding siblings ...)
  2026-04-24 13:46 ` [PATCH 13/17] eventpoll: extract lock dance from do_epoll_ctl() into ep_ctl_lock() Christian Brauner
@ 2026-04-24 13:46 ` Christian Brauner
  2026-04-24 13:46 ` [PATCH 15/17] eventpoll: rename epi->next and txlist for clarity Christian Brauner
                   ` (3 subsequent siblings)
  17 siblings, 0 replies; 20+ messages in thread
From: Christian Brauner @ 2026-04-24 13:46 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Alexander Viro, Jan Kara, Linus Torvalds, Jens Axboe,
	Christian Brauner (Amutable)

ep->ovflist and epi->next both use EP_UNACTIVE_PTR (a cast to
(void *)-1) as a sentinel, with distinct meanings at each site:

  ep->ovflist == EP_UNACTIVE_PTR         no scan in progress
  epi->next   == EP_UNACTIVE_PTR         epi not on ovflist

Call sites had to know the sentinel's value and, by convention, what
it meant in each context. Hide both behind inline helpers:

  ep_is_scanning(ep)       predicate for "scan in progress"
  ep_enter_scan(ep)        WRITE_ONCE flip to NULL (scan start)
  ep_exit_scan(ep)         WRITE_ONCE flip to sentinel (scan end)
  epi_on_ovflist(epi)      predicate for "epi is on ovflist"
  epi_clear_ovflist(epi)   clear epi's ovflist link slot

Convert ep_events_available(), ep_start_scan(), ep_done_scan(),
ep_poll_callback(), and ep_alloc_epitem() to use the wrappers. The
ovflist state-machine transitions are now named, not encoded in
sentinel comparisons, and the top-of-file "Ready-list state machine"
section is the single place that spells out the sentinel's meaning.

ep_alloc() keeps the raw "ep->ovflist = EP_UNACTIVE_PTR" init (no
concurrent access at that point) with an inline "not scanning"
comment, and the tfile_check_list sentinel is left alone -- it will
disappear entirely when the loop-check globals move into a
stack-allocated ep_ctl_ctx in a later commit.

Also rework ep_done_scan()'s for-loop: the combined initializer +
update clause that advanced nepi AND cleared epi->next in one step
was clever but hard to read; splitting the update into two
statements inside the body makes the epi_clear_ovflist() call
visible.

No functional change.

Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
---
 fs/eventpoll.c | 73 +++++++++++++++++++++++++++++++++++++++++-----------------
 1 file changed, 52 insertions(+), 21 deletions(-)

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index d49457dc8c7f..4199ef8e42e5 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -541,6 +541,43 @@ static inline struct epitem *ep_item_from_wait(wait_queue_entry_t *p)
 	return container_of(p, struct eppoll_entry, wait)->base;
 }
 
+/*
+ * Ready-list / ovflist state (see "Ready-list state machine" in the
+ * top-of-file banner for the full state machine). EP_UNACTIVE_PTR is
+ * the sentinel; these wrappers name each transition and each test so
+ * call sites do not need to know the sentinel's value.
+ */
+
+/* True iff @ep is between ep_enter_scan() and ep_exit_scan(). */
+static inline bool ep_is_scanning(struct eventpoll *ep)
+{
+	return READ_ONCE(ep->ovflist) != EP_UNACTIVE_PTR;
+}
+
+/* Called by ep_start_scan(): divert ep_poll_callback() to ovflist. */
+static inline void ep_enter_scan(struct eventpoll *ep)
+{
+	WRITE_ONCE(ep->ovflist, NULL);
+}
+
+/* Called by ep_done_scan(): redirect ep_poll_callback() back to rdllist. */
+static inline void ep_exit_scan(struct eventpoll *ep)
+{
+	WRITE_ONCE(ep->ovflist, EP_UNACTIVE_PTR);
+}
+
+/* True iff @epi is currently linked on its ep's ovflist. */
+static inline bool epi_on_ovflist(const struct epitem *epi)
+{
+	return epi->next != EP_UNACTIVE_PTR;
+}
+
+/* Mark @epi as not on any ovflist (init and post-drain). */
+static inline void epi_clear_ovflist(struct epitem *epi)
+{
+	epi->next = EP_UNACTIVE_PTR;
+}
+
 /**
  * ep_events_available - Checks if ready events might be available.
  *
@@ -551,8 +588,7 @@ static inline struct epitem *ep_item_from_wait(wait_queue_entry_t *p)
  */
 static inline int ep_events_available(struct eventpoll *ep)
 {
-	return !list_empty_careful(&ep->rdllist) ||
-		READ_ONCE(ep->ovflist) != EP_UNACTIVE_PTR;
+	return !list_empty_careful(&ep->rdllist) || ep_is_scanning(ep);
 }
 
 #ifdef CONFIG_NET_RX_BUSY_POLL
@@ -910,7 +946,7 @@ static void ep_start_scan(struct eventpoll *ep, struct list_head *txlist)
 	lockdep_assert_irqs_enabled();
 	spin_lock_irq(&ep->lock);
 	list_splice_init(&ep->rdllist, txlist);
-	WRITE_ONCE(ep->ovflist, NULL);
+	ep_enter_scan(ep);
 	spin_unlock_irq(&ep->lock);
 }
 
@@ -925,29 +961,24 @@ static void ep_done_scan(struct eventpoll *ep,
 	 * other events might have been queued by the poll callback.
 	 * We re-insert them inside the main ready-list here.
 	 */
-	for (nepi = READ_ONCE(ep->ovflist); (epi = nepi) != NULL;
-	     nepi = epi->next, epi->next = EP_UNACTIVE_PTR) {
+	for (nepi = READ_ONCE(ep->ovflist); (epi = nepi) != NULL; ) {
+		nepi = epi->next;
+		epi_clear_ovflist(epi);
 		/*
-		 * We need to check if the item is already in the list.
-		 * During the "sproc" callback execution time, items are
-		 * queued into ->ovflist but the "txlist" might already
-		 * contain them, and the list_splice() below takes care of them.
+		 * Skip items that the caller already returned via @txlist
+		 * -- the list_splice() below takes care of those.
 		 */
 		if (!ep_is_linked(epi)) {
 			/*
-			 * ->ovflist is LIFO, so we have to reverse it in order
-			 * to keep in FIFO.
+			 * ovflist is LIFO; list_add() head-insert here
+			 * reverses the iteration order into FIFO.
 			 */
 			list_add(&epi->rdllink, &ep->rdllist);
 			ep_pm_stay_awake(epi);
 		}
 	}
-	/*
-	 * We need to set back ep->ovflist to EP_UNACTIVE_PTR, so that after
-	 * releasing the lock, events will be queued in the normal way inside
-	 * ep->rdllist.
-	 */
-	WRITE_ONCE(ep->ovflist, EP_UNACTIVE_PTR);
+	/* Back out of scan mode; callbacks target ep->rdllist again. */
+	ep_exit_scan(ep);
 
 	/*
 	 * Quickly re-inject items left on "txlist".
@@ -1376,7 +1407,7 @@ static int ep_alloc(struct eventpoll **pep)
 	init_waitqueue_head(&ep->poll_wait);
 	INIT_LIST_HEAD(&ep->rdllist);
 	ep->rbr = RB_ROOT_CACHED;
-	ep->ovflist = EP_UNACTIVE_PTR;
+	ep->ovflist = EP_UNACTIVE_PTR;	/* not scanning */
 	ep->user = get_current_user();
 	refcount_set(&ep->refcount, 1);
 
@@ -1456,8 +1487,8 @@ static int ep_poll_callback(wait_queue_entry_t *wait, unsigned mode, int sync, v
 	 * semantics). All the events that happen during that period of time are
 	 * chained in ep->ovflist and requeued later on.
 	 */
-	if (READ_ONCE(ep->ovflist) != EP_UNACTIVE_PTR) {
-		if (epi->next == EP_UNACTIVE_PTR) {
+	if (ep_is_scanning(ep)) {
+		if (!epi_on_ovflist(epi)) {
 			epi->next = READ_ONCE(ep->ovflist);
 			WRITE_ONCE(ep->ovflist, epi);
 			ep_pm_stay_awake_rcu(epi);
@@ -1771,7 +1802,7 @@ static struct epitem *ep_alloc_epitem(struct eventpoll *ep,
 	epi->ep = ep;
 	ep_set_ffd(&epi->ffd, tfile, fd);
 	epi->event = *event;
-	epi->next = EP_UNACTIVE_PTR;
+	epi_clear_ovflist(epi);
 
 	return epi;
 }

-- 
2.47.3


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 15/17] eventpoll: rename epi->next and txlist for clarity
  2026-04-24 13:46 [PATCH 00/17] eventpoll: clarity refactor Christian Brauner
                   ` (13 preceding siblings ...)
  2026-04-24 13:46 ` [PATCH 14/17] eventpoll: wrap EP_UNACTIVE_PTR in typed sentinel helpers Christian Brauner
@ 2026-04-24 13:46 ` Christian Brauner
  2026-04-24 16:06   ` Linus Torvalds
  2026-04-24 13:46 ` [PATCH 16/17] eventpoll: use bool for predicate helpers Christian Brauner
                   ` (2 subsequent siblings)
  17 siblings, 1 reply; 20+ messages in thread
From: Christian Brauner @ 2026-04-24 13:46 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Alexander Viro, Jan Kara, Linus Torvalds, Jens Axboe,
	Christian Brauner (Amutable)

Two list-related names were confusing in isolation:

  struct epitem::next
    A singly-linked link slot used only when an epi is queued on
    ep->ovflist during an ep_start_scan/ep_done_scan window. The
    bare name "next" suggests a generic list link and doesn't say
    which list it belongs to.

  txlist
    The caller-local list_head used by ep_send_events() and
    __ep_eventpoll_poll() to hold the batch of items stolen from
    ep->rdllist for the current scan. "txlist" ("transmission
    list") is abbreviated and overloaded: it doesn't distinguish
    itself from ep->rdllist or ep->ovflist at a glance.

Rename for what each actually is:

  struct epitem::next   -> struct epitem::ovflist_next
  local txlist          -> scan_batch

With these in place:
  - epi->ovflist_next reads as "this is the ep->ovflist link slot",
    matching the rdllink pattern above it.
  - scan_batch reads as "the batch currently being scanned", clearly
    distinct from rdllist (canonical ready list) and ovflist
    (scan-window overflow).

ep->rdllist and ep->ovflist struct field names are preserved -- they
are long-standing interface-facing identifiers, and the new inline
helpers (ep_is_scanning, epi_on_ovflist, ...) already hide the
sentinel semantics at call sites.

No functional change.

Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
---
 fs/eventpoll.c | 62 ++++++++++++++++++++++++++++++----------------------------
 1 file changed, 32 insertions(+), 30 deletions(-)

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 4199ef8e42e5..7ed4b47279ff 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -142,15 +142,15 @@
  *   NULL            - scan active, no spill yet.
  *   pointer to epi  - scan active with spilled items (LIFO).
  *
- * Encoded in epi->next:
+ * Encoded in epi->ovflist_next:
  *   EP_UNACTIVE_PTR - epi is not on ovflist.
  *   otherwise       - next epi on ovflist (NULL at tail).
  *
  * ep_start_scan() flips "not scanning" to "scanning" and splices
- * rdllist into a caller-local txlist. ep_done_scan() drains ovflist
+ * rdllist into a caller-local scan_batch. ep_done_scan() drains ovflist
  * back to rdllist (list_add head-insert reverses LIFO to FIFO),
  * flips back to "not scanning", and re-splices any items the caller
- * left in txlist (e.g., level-triggered re-queues).
+ * left in scan_batch (e.g., level-triggered re-queues).
  *
  *
  * Removal paths
@@ -261,14 +261,16 @@ struct epitem {
 		struct rcu_head rcu;
 	};
 
-	/* List header used to link this structure to the eventpoll ready list */
+	/* Link on the owning eventpoll's ready list (ep->rdllist). */
 	struct list_head rdllink;
 
 	/*
-	 * Works together "struct eventpoll"->ovflist in keeping the
-	 * single linked chain of items.
+	 * Link on the owning eventpoll's scan-overflow list (ep->ovflist),
+	 * EP_UNACTIVE_PTR when not linked. See epi_on_ovflist() /
+	 * epi_clear_ovflist() and the "Ready-list state machine" section
+	 * in the top-of-file banner.
 	 */
-	struct epitem *next;
+	struct epitem *ovflist_next;
 
 	/* The file descriptor information this item refers to */
 	struct epoll_filefd ffd;
@@ -569,13 +571,13 @@ static inline void ep_exit_scan(struct eventpoll *ep)
 /* True iff @epi is currently linked on its ep's ovflist. */
 static inline bool epi_on_ovflist(const struct epitem *epi)
 {
-	return epi->next != EP_UNACTIVE_PTR;
+	return epi->ovflist_next != EP_UNACTIVE_PTR;
 }
 
 /* Mark @epi as not on any ovflist (init and post-drain). */
 static inline void epi_clear_ovflist(struct epitem *epi)
 {
-	epi->next = EP_UNACTIVE_PTR;
+	epi->ovflist_next = EP_UNACTIVE_PTR;
 }
 
 /**
@@ -933,7 +935,7 @@ static inline void ep_pm_stay_awake_rcu(struct epitem *epi)
  * ep->mutex needs to be held because we could be hit by
  * eventpoll_release_file() and epoll_ctl().
  */
-static void ep_start_scan(struct eventpoll *ep, struct list_head *txlist)
+static void ep_start_scan(struct eventpoll *ep, struct list_head *scan_batch)
 {
 	/*
 	 * Steal the ready list, and re-init the original one to the
@@ -945,13 +947,13 @@ static void ep_start_scan(struct eventpoll *ep, struct list_head *txlist)
 	 */
 	lockdep_assert_irqs_enabled();
 	spin_lock_irq(&ep->lock);
-	list_splice_init(&ep->rdllist, txlist);
+	list_splice_init(&ep->rdllist, scan_batch);
 	ep_enter_scan(ep);
 	spin_unlock_irq(&ep->lock);
 }
 
 static void ep_done_scan(struct eventpoll *ep,
-			 struct list_head *txlist)
+			 struct list_head *scan_batch)
 {
 	struct epitem *epi, *nepi;
 
@@ -962,10 +964,10 @@ static void ep_done_scan(struct eventpoll *ep,
 	 * We re-insert them inside the main ready-list here.
 	 */
 	for (nepi = READ_ONCE(ep->ovflist); (epi = nepi) != NULL; ) {
-		nepi = epi->next;
+		nepi = epi->ovflist_next;
 		epi_clear_ovflist(epi);
 		/*
-		 * Skip items that the caller already returned via @txlist
+		 * Skip items that the caller already returned via @scan_batch
 		 * -- the list_splice() below takes care of those.
 		 */
 		if (!ep_is_linked(epi)) {
@@ -981,9 +983,9 @@ static void ep_done_scan(struct eventpoll *ep,
 	ep_exit_scan(ep);
 
 	/*
-	 * Quickly re-inject items left on "txlist".
+	 * Quickly re-inject items left on "scan_batch".
 	 */
-	list_splice(txlist, &ep->rdllist);
+	list_splice(scan_batch, &ep->rdllist);
 	__pm_relax(ep->ws);
 
 	if (!list_empty(&ep->rdllist)) {
@@ -1247,7 +1249,7 @@ static __poll_t ep_item_poll(const struct epitem *epi, poll_table *pt, int depth
 static __poll_t __ep_eventpoll_poll(struct file *file, poll_table *wait, int depth)
 {
 	struct eventpoll *ep = file->private_data;
-	LIST_HEAD(txlist);
+	LIST_HEAD(scan_batch);
 	struct epitem *epi, *tmp;
 	poll_table pt;
 	__poll_t res = 0;
@@ -1262,8 +1264,8 @@ static __poll_t __ep_eventpoll_poll(struct file *file, poll_table *wait, int dep
 	 * the ready list.
 	 */
 	mutex_lock_nested(&ep->mtx, depth);
-	ep_start_scan(ep, &txlist);
-	list_for_each_entry_safe(epi, tmp, &txlist, rdllink) {
+	ep_start_scan(ep, &scan_batch);
+	list_for_each_entry_safe(epi, tmp, &scan_batch, rdllink) {
 		if (ep_item_poll(epi, &pt, depth + 1)) {
 			res = EPOLLIN | EPOLLRDNORM;
 			break;
@@ -1277,7 +1279,7 @@ static __poll_t __ep_eventpoll_poll(struct file *file, poll_table *wait, int dep
 			list_del_init(&epi->rdllink);
 		}
 	}
-	ep_done_scan(ep, &txlist);
+	ep_done_scan(ep, &scan_batch);
 	mutex_unlock(&ep->mtx);
 	return res;
 }
@@ -1489,7 +1491,7 @@ static int ep_poll_callback(wait_queue_entry_t *wait, unsigned mode, int sync, v
 	 */
 	if (ep_is_scanning(ep)) {
 		if (!epi_on_ovflist(epi)) {
-			epi->next = READ_ONCE(ep->ovflist);
+			epi->ovflist_next = READ_ONCE(ep->ovflist);
 			WRITE_ONCE(ep->ovflist, epi);
 			ep_pm_stay_awake_rcu(epi);
 		}
@@ -2017,7 +2019,7 @@ static int ep_modify(struct eventpoll *ep, struct epitem *epi,
  * next slot), 0 if the re-poll reported no caller-requested events
  * (@epi drops out of the ready list; a future callback will re-add
  * it), or -EFAULT if copy_to_user() faulted (in which case @epi is
- * re-inserted at the head of @txlist so ep_done_scan() merges it
+ * re-inserted at the head of @scan_batch so ep_done_scan() merges it
  * back to rdllist for the next attempt).
  *
  * PM bookkeeping and level-triggered re-queue are handled here.
@@ -2026,7 +2028,7 @@ static int ep_modify(struct eventpoll *ep, struct epitem *epi,
 static int ep_deliver_event(struct eventpoll *ep, struct epitem *epi,
 			    poll_table *pt,
 			    struct epoll_event __user **uevents,
-			    struct list_head *txlist)
+			    struct list_head *scan_batch)
 {
 	struct epoll_event __user *next;
 	struct wakeup_source *ws;
@@ -2064,7 +2066,7 @@ static int ep_deliver_event(struct eventpoll *ep, struct epitem *epi,
 		 * ep_done_scan() splices it onto rdllist for the next
 		 * attempt.
 		 */
-		list_add(&epi->rdllink, txlist);
+		list_add(&epi->rdllink, scan_batch);
 		ep_pm_stay_awake(epi);
 		return -EFAULT;
 	}
@@ -2090,7 +2092,7 @@ static int ep_send_events(struct eventpoll *ep,
 			  struct epoll_event __user *events, int maxevents)
 {
 	struct epitem *epi, *tmp;
-	LIST_HEAD(txlist);
+	LIST_HEAD(scan_batch);
 	poll_table pt;
 	int res = 0;
 
@@ -2105,19 +2107,19 @@ static int ep_send_events(struct eventpoll *ep,
 	init_poll_funcptr(&pt, NULL);
 
 	mutex_lock(&ep->mtx);
-	ep_start_scan(ep, &txlist);
+	ep_start_scan(ep, &scan_batch);
 
 	/*
 	 * We can loop without lock because we are passed a task-private
-	 * txlist; items cannot vanish while we hold ep->mtx.
+	 * scan_batch; items cannot vanish while we hold ep->mtx.
 	 */
-	list_for_each_entry_safe(epi, tmp, &txlist, rdllink) {
+	list_for_each_entry_safe(epi, tmp, &scan_batch, rdllink) {
 		int delivered;
 
 		if (res >= maxevents)
 			break;
 
-		delivered = ep_deliver_event(ep, epi, &pt, &events, &txlist);
+		delivered = ep_deliver_event(ep, epi, &pt, &events, &scan_batch);
 		if (delivered < 0) {
 			if (!res)
 				res = delivered;
@@ -2126,7 +2128,7 @@ static int ep_send_events(struct eventpoll *ep,
 		res += delivered;
 	}
 
-	ep_done_scan(ep, &txlist);
+	ep_done_scan(ep, &scan_batch);
 	mutex_unlock(&ep->mtx);
 
 	return res;

-- 
2.47.3


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 16/17] eventpoll: use bool for predicate helpers
  2026-04-24 13:46 [PATCH 00/17] eventpoll: clarity refactor Christian Brauner
                   ` (14 preceding siblings ...)
  2026-04-24 13:46 ` [PATCH 15/17] eventpoll: rename epi->next and txlist for clarity Christian Brauner
@ 2026-04-24 13:46 ` Christian Brauner
  2026-04-24 13:46 ` [PATCH 17/17] eventpoll: hoist CTL_ADD scratch state into struct ep_ctl_ctx Christian Brauner
  2026-04-24 15:33 ` [PATCH 00/17] eventpoll: clarity refactor Linus Torvalds
  17 siblings, 0 replies; 20+ messages in thread
From: Christian Brauner @ 2026-04-24 13:46 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Alexander Viro, Jan Kara, Linus Torvalds, Jens Axboe,
	Christian Brauner (Amutable)

Three inline predicates -- is_file_epoll(), ep_is_linked(),
ep_events_available() -- were declared to return int even though
their only use is as a truthy test and their bodies are already
boolean expressions. ep_has_wakeup_source(), in the same file,
returns bool, so the convention was already inconsistent.

Convert all three to return bool. Rewrite ep_events_available()'s
verbose kerneldoc to the same one-line style the rest of the
predicates use now.

ep_poll()'s local eavail variable stores the result of
ep_events_available() (already boolean), ep_busy_loop() (returns
bool), and list_empty() (int but tested as boolean). Split it out
of the combined int declaration and give it bool type; replace the
one "eavail = 1" after a wakeup with "eavail = true" to match.

No functional change.

Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
---
 fs/eventpoll.c | 22 ++++++++--------------
 1 file changed, 8 insertions(+), 14 deletions(-)

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 7ed4b47279ff..201e688304b3 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -505,7 +505,7 @@ static void __init epoll_sysctls_init(void)
 
 static const struct file_operations eventpoll_fops;
 
-static inline int is_file_epoll(struct file *f)
+static inline bool is_file_epoll(struct file *f)
 {
 	return f->f_op == &eventpoll_fops;
 }
@@ -526,8 +526,8 @@ static inline int ep_cmp_ffd(struct epoll_filefd *p1,
 	        (p1->file < p2->file ? -1 : p1->fd - p2->fd));
 }
 
-/* Tells us if the item is currently linked */
-static inline int ep_is_linked(struct epitem *epi)
+/* True iff @epi is on its owning ep's ready list. */
+static inline bool ep_is_linked(struct epitem *epi)
 {
 	return !list_empty(&epi->rdllink);
 }
@@ -580,15 +580,8 @@ static inline void epi_clear_ovflist(struct epitem *epi)
 	epi->ovflist_next = EP_UNACTIVE_PTR;
 }
 
-/**
- * ep_events_available - Checks if ready events might be available.
- *
- * @ep: Pointer to the eventpoll context.
- *
- * Return: a value different than %zero if ready events are available,
- *          or %zero otherwise.
- */
-static inline int ep_events_available(struct eventpoll *ep)
+/* True iff @ep has ready events that epoll_wait() might harvest. */
+static inline bool ep_events_available(struct eventpoll *ep)
 {
 	return !list_empty_careful(&ep->rdllist) || ep_is_scanning(ep);
 }
@@ -2218,7 +2211,8 @@ static int ep_schedule_timeout(ktime_t *to)
 static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
 		   int maxevents, struct timespec64 *timeout)
 {
-	int res, eavail, timed_out = 0;
+	int res, timed_out = 0;
+	bool eavail;
 	u64 slack = 0;
 	wait_queue_entry_t wait;
 	ktime_t expires, *to = NULL;
@@ -2316,7 +2310,7 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
 		 * If timed out and still on the wait queue, recheck eavail
 		 * carefully under lock, below.
 		 */
-		eavail = 1;
+		eavail = true;
 
 		if (!list_empty_careful(&wait.entry)) {
 			spin_lock_irq(&ep->lock);

-- 
2.47.3


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 17/17] eventpoll: hoist CTL_ADD scratch state into struct ep_ctl_ctx
  2026-04-24 13:46 [PATCH 00/17] eventpoll: clarity refactor Christian Brauner
                   ` (15 preceding siblings ...)
  2026-04-24 13:46 ` [PATCH 16/17] eventpoll: use bool for predicate helpers Christian Brauner
@ 2026-04-24 13:46 ` Christian Brauner
  2026-04-24 15:33 ` [PATCH 00/17] eventpoll: clarity refactor Linus Torvalds
  17 siblings, 0 replies; 20+ messages in thread
From: Christian Brauner @ 2026-04-24 13:46 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Alexander Viro, Jan Kara, Linus Torvalds, Jens Axboe,
	Christian Brauner (Amutable)

Three globals were shared between the loop check and the path check
paths: tfile_check_list (chain of epitems_head to walk afterwards),
path_count[] (per-depth wakeup-path tally) and inserting_into
(cycle-detection sentinel). All three are scratch state used only
during a single EPOLL_CTL_ADD full_check, yet they sit at file
scope and rely on epnested_mutex for exclusion.

The area has had three bugs in the last year -- CVE-2025-38349,
f2e467a48287 ("eventpoll: Fix semi-unbounded recursion"), and
fdcfce93073d ("eventpoll: Fix integer overflow in
ep_loop_check_proc()") -- all rooted in the shared-mutable-global
pattern being hard to reason about.

Collect the three into a stack-allocated struct ep_ctl_ctx:

    struct ep_ctl_ctx {
        struct eventpoll   *inserting_into;
        struct epitems_head *tfile_check_list;
        int                 path_count[PATH_ARR_SIZE];
    };

do_epoll_ctl() zero-initializes one on its stack and plumbs it
through ep_ctl_lock() / ep_ctl_unlock() / ep_insert() /
ep_register_epitem() / list_file() / ep_loop_check() /
ep_loop_check_proc() / reverse_path_check() /
reverse_path_check_proc() / path_count_inc() / path_count_init() /
clear_tfile_check_list(). Non-nested inserts leave the ctx zeroed
and skip the machinery entirely.

With the scratch state in ctx:
  - tfile_check_list no longer has an EP_UNACTIVE_PTR sentinel --
    NULL is the obvious "empty" value and the zero-init handles it
    for free;
  - path_count[] is no longer an array global that could be touched
    in unexpected orderings;
  - inserting_into is scoped to the exact call that set it.

loop_check_gen stays as a file-scope monotonic counter, because the
stamp left on ep->gen by a completed walk must not equal the stamp
of a future walk -- something a stack-local value cannot guarantee
across calls. It remains protected by epnested_mutex for the bump
and read lockless for the "do we need a full check" trigger in
ep_ctl_lock().

Every bail-out that existed before (the ELOOP on cycle, the path
limit check, the unbounded-recursion cap, the +1 overflow guard) is
preserved verbatim; only the data they operate on moved from file
scope to the stack ctx.

No functional change.

Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
---
 fs/eventpoll.c | 206 ++++++++++++++++++++++++++++++++-------------------------
 1 file changed, 117 insertions(+), 89 deletions(-)

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 201e688304b3..b839cc02eb0e 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -388,32 +388,25 @@ static long max_user_watches __read_mostly;
  *      of a given length -- reverse_path_check().
  *
  * Both need a global view of the epoll topology and must be atomic
- * with the insertion, so the scratch state below is all serialized by
- * one global mutex, epnested_mutex. Non-nested inserts skip this
- * machinery entirely and take only ep->mtx.
- *
- *   epnested_mutex     Serializes the whole check; also protects every
- *                      other variable in this block plus path_count[]
- *                      (declared with the path-check code further
- *                      down).
- *   loop_check_gen     Monotonic stamp, bumped once at the start of a
- *                      check and once at the end. ep->gen caches the
- *                      value under which ep was last visited by
+ * with the insertion, so the check is serialized by epnested_mutex
+ * and carries its scratch state on a stack-allocated struct
+ * ep_ctl_ctx scoped to one do_epoll_ctl() call. Non-nested inserts
+ * skip this machinery entirely and take only ep->mtx.
+ *
+ *   epnested_mutex     Serializes the whole check.
+ *   loop_check_gen     Global monotonic stamp, bumped at the start of
+ *                      a check and again at the end. ep->gen caches
+ *                      the value under which ep was last visited by
  *                      ep_loop_check_proc() or
  *                      ep_get_upwards_depth_proc(); the post-check
  *                      bump ensures those cached stamps can no longer
  *                      equal loop_check_gen, so the
  *                      "ep->gen == loop_check_gen" trigger in
- *                      do_epoll_ctl() only fires while another check
+ *                      ep_ctl_lock() only fires while another check
  *                      is in flight.
- *   inserting_into     Outer eventpoll pointer for the lifetime of one
- *                      ep_loop_check(); ep_loop_check_proc() fails
- *                      with -ELOOP if the downward walk reaches it.
- *   tfile_check_list   Singly-linked list of epitems_head objects
- *                      collected by ep_loop_check_proc() during the
- *                      walk, consumed by reverse_path_check()
- *                      afterwards. Sentinel EP_UNACTIVE_PTR means no
- *                      check is in flight.
+ *
+ * struct ep_ctl_ctx carries the rest (inserting_into, tfile_check_list,
+ * path_count[]) through the walk; see its declaration below.
  *
  * Commits fdcfce93073d ("eventpoll: Fix integer overflow in
  * ep_loop_check_proc()") and f2e467a48287 ("eventpoll: Fix
@@ -422,7 +415,36 @@ static long max_user_watches __read_mostly;
  */
 static DEFINE_MUTEX(epnested_mutex);
 static u64 loop_check_gen = 0;
-static struct eventpoll *inserting_into;
+
+#define PATH_ARR_SIZE 5
+
+/*
+ * Per-do_epoll_ctl() scratch for the loop / path checks. Allocated on
+ * the caller's stack; populated by ep_ctl_lock() and the downward
+ * walk; consumed by reverse_path_check(); released by ep_ctl_unlock().
+ * Only valid while the caller holds epnested_mutex.
+ */
+struct ep_ctl_ctx {
+	/*
+	 * Outer eventpoll for one ep_loop_check(); if the downward walk
+	 * reaches it the insert would form a cycle.
+	 */
+	struct eventpoll *inserting_into;
+
+	/*
+	 * Singly-linked list of epitems_head objects collected during
+	 * ep_loop_check_proc(), then walked by reverse_path_check().
+	 * NULL means empty.
+	 */
+	struct epitems_head *tfile_check_list;
+
+	/*
+	 * Per-depth wakeup-path tally used by reverse_path_check_proc();
+	 * reinitialized to zero at the start of each reverse_path_check()
+	 * iteration.
+	 */
+	int path_count[PATH_ARR_SIZE];
+};
 
 /* Slab cache used to allocate "struct epitem" */
 static struct kmem_cache *epi_cache __ro_after_init;
@@ -434,13 +456,12 @@ static struct kmem_cache *pwq_cache __ro_after_init;
  * Wrapper anchor for file->f_ep when the watched file is not itself an
  * eventpoll; for the epoll-watches-epoll case, file->f_ep points at
  * &watched_ep->refs directly. The ->next field threads
- * tfile_check_list during one EPOLL_CTL_ADD path check.
+ * ctx->tfile_check_list during one EPOLL_CTL_ADD path check.
  */
 struct epitems_head {
 	struct hlist_head epitems;
 	struct epitems_head *next;
 };
-static struct epitems_head *tfile_check_list = EP_UNACTIVE_PTR;
 
 static struct kmem_cache *ephead_cache __ro_after_init;
 
@@ -450,14 +471,14 @@ static inline void free_ephead(struct epitems_head *head)
 		kmem_cache_free(ephead_cache, head);
 }
 
-static void list_file(struct file *file)
+static void list_file(struct file *file, struct ep_ctl_ctx *ctx)
 {
 	struct epitems_head *head;
 
 	head = container_of(file->f_ep, struct epitems_head, epitems);
 	if (!head->next) {
-		head->next = tfile_check_list;
-		tfile_check_list = head;
+		head->next = ctx->tfile_check_list;
+		ctx->tfile_check_list = head;
 	}
 }
 
@@ -1613,41 +1634,40 @@ static void ep_rbtree_insert(struct eventpoll *ep, struct epitem *epi)
 
 
 
-#define PATH_ARR_SIZE 5
 /*
- * These are the number paths of length 1 to 5, that we are allowing to emanate
- * from a single file of interest. For example, we allow 1000 paths of length
- * 1, to emanate from each file of interest. This essentially represents the
- * potential wakeup paths, which need to be limited in order to avoid massive
- * uncontrolled wakeup storms. The common use case should be a single ep which
- * is connected to n file sources. In this case each file source has 1 path
- * of length 1. Thus, the numbers below should be more than sufficient. These
- * path limits are enforced during an EPOLL_CTL_ADD operation, since a modify
- * and delete can't add additional paths. Protected by the epnested_mutex.
+ * Upper bound on wakeup paths emanating from any one watched file,
+ * indexed by path depth (1..PATH_ARR_SIZE). For example, we allow
+ * 1000 paths of length 1 from each watched file. These caps limit
+ * the wakeup amplification that can be built from epoll-watches-
+ * epoll topologies without rejecting reasonable usage.
+ *
+ * Enforced at EPOLL_CTL_ADD; CTL_MOD and CTL_DEL cannot add paths.
+ * The running tallies live in ctx->path_count[] and are protected by
+ * epnested_mutex.
  */
 static const int path_limits[PATH_ARR_SIZE] = { 1000, 500, 100, 50, 10 };
-static int path_count[PATH_ARR_SIZE];
 
-static int path_count_inc(int nests)
+static int path_count_inc(struct ep_ctl_ctx *ctx, int nests)
 {
 	/* Allow an arbitrary number of depth 1 paths */
 	if (nests == 0)
 		return 0;
 
-	if (++path_count[nests] > path_limits[nests])
+	if (++ctx->path_count[nests] > path_limits[nests])
 		return -1;
 	return 0;
 }
 
-static void path_count_init(void)
+static void path_count_init(struct ep_ctl_ctx *ctx)
 {
 	int i;
 
 	for (i = 0; i < PATH_ARR_SIZE; i++)
-		path_count[i] = 0;
+		ctx->path_count[i] = 0;
 }
 
-static int reverse_path_check_proc(struct hlist_head *refs, int depth)
+static int reverse_path_check_proc(struct ep_ctl_ctx *ctx,
+				   struct hlist_head *refs, int depth)
 {
 	int error = 0;
 	struct epitem *epi;
@@ -1659,9 +1679,9 @@ static int reverse_path_check_proc(struct hlist_head *refs, int depth)
 	hlist_for_each_entry_rcu(epi, refs, fllink) {
 		struct hlist_head *refs = &epi->ep->refs;
 		if (hlist_empty(refs))
-			error = path_count_inc(depth);
+			error = path_count_inc(ctx, depth);
 		else
-			error = reverse_path_check_proc(refs, depth + 1);
+			error = reverse_path_check_proc(ctx, refs, depth + 1);
 		if (error != 0)
 			break;
 	}
@@ -1669,24 +1689,23 @@ static int reverse_path_check_proc(struct hlist_head *refs, int depth)
 }
 
 /**
- * reverse_path_check - The tfile_check_list is list of epitem_head, which have
- *                      links that are proposed to be newly added. We need to
- *                      make sure that those added links don't add too many
- *                      paths such that we will spend all our time waking up
- *                      eventpoll objects.
+ * reverse_path_check - ctx->tfile_check_list is a list of epitems_head
+ *                      anchoring files with newly proposed links; make
+ *                      sure those links don't push any path-length bucket
+ *                      over its limit in path_limits[].
  *
  * Return: %zero if the proposed links don't create too many paths,
  *	    %-1 otherwise.
  */
-static int reverse_path_check(void)
+static int reverse_path_check(struct ep_ctl_ctx *ctx)
 {
 	struct epitems_head *p;
 
-	for (p = tfile_check_list; p != EP_UNACTIVE_PTR; p = p->next) {
+	for (p = ctx->tfile_check_list; p; p = p->next) {
 		int error;
-		path_count_init();
+		path_count_init(ctx);
 		rcu_read_lock();
-		error = reverse_path_check_proc(&p->epitems, 0);
+		error = reverse_path_check_proc(ctx, &p->epitems, 0);
 		rcu_read_unlock();
 		if (error)
 			return error;
@@ -1817,8 +1836,9 @@ static struct epitem *ep_alloc_epitem(struct eventpoll *ep,
  * unwind; that cannot drop @ep's refcount to zero because the ep file
  * itself still holds the original reference.
  */
-static int ep_register_epitem(struct eventpoll *ep, struct epitem *epi,
-			      struct eventpoll *tep, int full_check)
+static int ep_register_epitem(struct ep_ctl_ctx *ctx, struct eventpoll *ep,
+			      struct epitem *epi, struct eventpoll *tep,
+			      int full_check)
 {
 	struct file *tfile = epi->ffd.file;
 	int error;
@@ -1836,7 +1856,7 @@ static int ep_register_epitem(struct eventpoll *ep, struct epitem *epi,
 	}
 
 	if (full_check && !tep)
-		list_file(tfile);
+		list_file(tfile, ctx);
 
 	ep_rbtree_insert(ep, epi);
 
@@ -1850,8 +1870,9 @@ static int ep_register_epitem(struct eventpoll *ep, struct epitem *epi,
 /*
  * Must be called with "mtx" held.
  */
-static int ep_insert(struct eventpoll *ep, const struct epoll_event *event,
-		     struct file *tfile, int fd, int full_check)
+static int ep_insert(struct ep_ctl_ctx *ctx, struct eventpoll *ep,
+		     const struct epoll_event *event, struct file *tfile,
+		     int fd, int full_check)
 {
 	int error, pwake = 0;
 	__poll_t revents;
@@ -1868,12 +1889,12 @@ static int ep_insert(struct eventpoll *ep, const struct epoll_event *event,
 	if (IS_ERR(epi))
 		return PTR_ERR(epi);
 
-	error = ep_register_epitem(ep, epi, tep, full_check);
+	error = ep_register_epitem(ctx, ep, epi, tep, full_check);
 	if (error)
 		return error;
 
 	/* Reject the insert if the new link would create too many back-paths. */
-	if (unlikely(full_check && reverse_path_check())) {
+	if (unlikely(full_check && reverse_path_check(ctx))) {
 		ep_remove(ep, epi);
 		return -EINVAL;
 	}
@@ -2340,7 +2361,8 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
  * Return: depth of the subtree, or a value bigger than EP_MAX_NESTS if we found
  * a loop or went too deep.
  */
-static int ep_loop_check_proc(struct eventpoll *ep, int depth)
+static int ep_loop_check_proc(struct ep_ctl_ctx *ctx,
+			      struct eventpoll *ep, int depth)
 {
 	int result = 0;
 	struct rb_node *rbp;
@@ -2356,22 +2378,23 @@ static int ep_loop_check_proc(struct eventpoll *ep, int depth)
 		if (unlikely(is_file_epoll(epi->ffd.file))) {
 			struct eventpoll *ep_tovisit;
 			ep_tovisit = epi->ffd.file->private_data;
-			if (ep_tovisit == inserting_into || depth > EP_MAX_NESTS)
+			if (ep_tovisit == ctx->inserting_into ||
+			    depth > EP_MAX_NESTS)
 				result = EP_MAX_NESTS+1;
 			else
-				result = max(result, ep_loop_check_proc(ep_tovisit, depth + 1) + 1);
+				result = max(result,
+					     ep_loop_check_proc(ctx, ep_tovisit,
+								depth + 1) + 1);
 			if (result > EP_MAX_NESTS)
 				break;
 		} else {
 			/*
-			 * If we've reached a file that is not associated with
-			 * an ep, then we need to check if the newly added
-			 * links are going to add too many wakeup paths. We do
-			 * this by adding it to the tfile_check_list, if it's
-			 * not already there, and calling reverse_path_check()
-			 * during ep_insert().
+			 * A non-epoll leaf. Queue it for the companion
+			 * reverse_path_check() that runs after this walk so
+			 * any new links we propose don't add too many wakeup
+			 * paths.
 			 */
-			list_file(epi->ffd.file);
+			list_file(epi->ffd.file, ctx);
 		}
 	}
 	ep->loop_check_depth = result;
@@ -2400,22 +2423,24 @@ static int ep_get_upwards_depth_proc(struct eventpoll *ep, int depth)
  *                 into another epoll file (represented by @ep) does not create
  *                 closed loops or too deep chains.
  *
- * @ep: Pointer to the epoll we are inserting into.
- * @to: Pointer to the epoll to be inserted.
+ * @ctx: Per-CTL_ADD scratch context.
+ * @ep:  Pointer to the epoll we are inserting into.
+ * @to:  Pointer to the epoll to be inserted.
  *
  * Return: %zero if adding the epoll @to inside the epoll @from
  * does not violate the constraints, or %-1 otherwise.
  */
-static int ep_loop_check(struct eventpoll *ep, struct eventpoll *to)
+static int ep_loop_check(struct ep_ctl_ctx *ctx, struct eventpoll *ep,
+			 struct eventpoll *to)
 {
 	int depth, upwards_depth;
 
-	inserting_into = ep;
+	ctx->inserting_into = ep;
 	/*
 	 * Check how deep down we can get from @to, and whether it is possible
 	 * to loop up to @ep.
 	 */
-	depth = ep_loop_check_proc(to, 0);
+	depth = ep_loop_check_proc(ctx, to, 0);
 	if (depth > EP_MAX_NESTS)
 		return -1;
 	/* Check how far up we can go from @ep. */
@@ -2426,12 +2451,12 @@ static int ep_loop_check(struct eventpoll *ep, struct eventpoll *to)
 	return (depth+1+upwards_depth > EP_MAX_NESTS) ? -1 : 0;
 }
 
-static void clear_tfile_check_list(void)
+static void clear_tfile_check_list(struct ep_ctl_ctx *ctx)
 {
 	rcu_read_lock();
-	while (tfile_check_list != EP_UNACTIVE_PTR) {
-		struct epitems_head *head = tfile_check_list;
-		tfile_check_list = head->next;
+	while (ctx->tfile_check_list) {
+		struct epitems_head *head = ctx->tfile_check_list;
+		ctx->tfile_check_list = head->next;
 		unlist_file(head);
 	}
 	rcu_read_unlock();
@@ -2529,9 +2554,8 @@ static inline int epoll_mutex_lock(struct mutex *mutex, bool nonblock)
  * EPOLL_CTL_ADDs on different eps from building a cycle without
  * either walker observing it.
  */
-static int ep_ctl_lock(struct eventpoll *ep, int op,
-		       struct file *epfile, struct file *tfile,
-		       bool nonblock)
+static int ep_ctl_lock(struct ep_ctl_ctx *ctx, struct eventpoll *ep, int op,
+		       struct file *epfile, struct file *tfile, bool nonblock)
 {
 	struct eventpoll *tep;
 	int error;
@@ -2556,7 +2580,7 @@ static int ep_ctl_lock(struct eventpoll *ep, int op,
 
 	if (is_file_epoll(tfile)) {
 		tep = tfile->private_data;
-		if (ep_loop_check(ep, tep) != 0) {
+		if (ep_loop_check(ctx, ep, tep) != 0) {
 			error = -ELOOP;
 			goto err_unlock_nested;
 		}
@@ -2569,17 +2593,18 @@ static int ep_ctl_lock(struct eventpoll *ep, int op,
 	return 1;
 
 err_unlock_nested:
-	clear_tfile_check_list();
+	clear_tfile_check_list(ctx);
 	loop_check_gen++;
 	mutex_unlock(&epnested_mutex);
 	return error;
 }
 
-static void ep_ctl_unlock(struct eventpoll *ep, int full_check)
+static void ep_ctl_unlock(struct ep_ctl_ctx *ctx, struct eventpoll *ep,
+			  int full_check)
 {
 	mutex_unlock(&ep->mtx);
 	if (full_check) {
-		clear_tfile_check_list();
+		clear_tfile_check_list(ctx);
 		loop_check_gen++;
 		mutex_unlock(&epnested_mutex);
 	}
@@ -2592,6 +2617,7 @@ int do_epoll_ctl(int epfd, int op, int fd, struct epoll_event *epds,
 	int full_check;
 	struct eventpoll *ep;
 	struct epitem *epi;
+	struct ep_ctl_ctx ctx = { };
 
 	CLASS(fd, f)(epfd);
 	if (fd_empty(f))
@@ -2632,7 +2658,8 @@ int do_epoll_ctl(int epfd, int op, int fd, struct epoll_event *epds,
 
 	ep = fd_file(f)->private_data;
 
-	full_check = ep_ctl_lock(ep, op, fd_file(f), fd_file(tf), nonblock);
+	full_check = ep_ctl_lock(&ctx, ep, op, fd_file(f), fd_file(tf),
+				 nonblock);
 	if (full_check < 0)
 		return full_check;
 
@@ -2647,7 +2674,8 @@ int do_epoll_ctl(int epfd, int op, int fd, struct epoll_event *epds,
 	case EPOLL_CTL_ADD:
 		if (!epi) {
 			epds->events |= EPOLLERR | EPOLLHUP;
-			error = ep_insert(ep, epds, fd_file(tf), fd, full_check);
+			error = ep_insert(&ctx, ep, epds, fd_file(tf), fd,
+					  full_check);
 		} else
 			error = -EEXIST;
 		break;
@@ -2674,7 +2702,7 @@ int do_epoll_ctl(int epfd, int op, int fd, struct epoll_event *epds,
 		break;
 	}
 
-	ep_ctl_unlock(ep, full_check);
+	ep_ctl_unlock(&ctx, ep, full_check);
 	return error;
 }
 

-- 
2.47.3


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH 00/17] eventpoll: clarity refactor
  2026-04-24 13:46 [PATCH 00/17] eventpoll: clarity refactor Christian Brauner
                   ` (16 preceding siblings ...)
  2026-04-24 13:46 ` [PATCH 17/17] eventpoll: hoist CTL_ADD scratch state into struct ep_ctl_ctx Christian Brauner
@ 2026-04-24 15:33 ` Linus Torvalds
  17 siblings, 0 replies; 20+ messages in thread
From: Linus Torvalds @ 2026-04-24 15:33 UTC (permalink / raw)
  To: Christian Brauner; +Cc: linux-fsdevel, Alexander Viro, Jan Kara, Jens Axboe

On Fri, 24 Apr 2026 at 06:46, Christian Brauner <brauner@kernel.org> wrote:
>
> This adds a bunch of documentation (a bunch of swearwords were removed
> by having an llm go over it) and refactors.

I like it. Since we're probably stuck with this forever, and my
fever-dream of some day getting rid of it is just that, maybe the
right path forward is to just decrapify this whole thing.

And this all looks sane to me. I'll comment on some things individually.

              Linus

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 15/17] eventpoll: rename epi->next and txlist for clarity
  2026-04-24 13:46 ` [PATCH 15/17] eventpoll: rename epi->next and txlist for clarity Christian Brauner
@ 2026-04-24 16:06   ` Linus Torvalds
  0 siblings, 0 replies; 20+ messages in thread
From: Linus Torvalds @ 2026-04-24 16:06 UTC (permalink / raw)
  To: Christian Brauner; +Cc: linux-fsdevel, Alexander Viro, Jan Kara, Jens Axboe

On Fri, 24 Apr 2026 at 06:47, Christian Brauner <brauner@kernel.org> wrote:
>
> Two list-related names were confusing in isolation:

I think many more of these names are confusing. I'd love to have more
random letter jumbles renamed. "rdllink" isn't exactly obvious _and_
it's visually very close to "rdllist". They are related, yes, so you
want them to be similar, but when it's a letter jumble, similarity
ends up really being something you have to think about.

And "ovflist" is a horrible name too, and has that EP_UNACTIVE_PTR
thing. "unactive"? I guess technically it's a valid word, but...

So I'd like even more renaming.

           Linus

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2026-04-24 16:06 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-24 13:46 [PATCH 00/17] eventpoll: clarity refactor Christian Brauner
2026-04-24 13:46 ` [PATCH 01/17] eventpoll: expand top-of-file overview / locking doc Christian Brauner
2026-04-24 13:46 ` [PATCH 02/17] eventpoll: document loop-check / path-check globals Christian Brauner
2026-04-24 13:46 ` [PATCH 03/17] eventpoll: clarify POLLFREE handshake comments Christian Brauner
2026-04-24 13:46 ` [PATCH 04/17] eventpoll: refresh epi_fget() / ep_remove_file() comments Christian Brauner
2026-04-24 13:46 ` [PATCH 05/17] eventpoll: document ep_clear_and_put() two-pass pattern Christian Brauner
2026-04-24 13:46 ` [PATCH 06/17] eventpoll: rename ep_refcount_dec_and_test() to ep_put() Christian Brauner
2026-04-24 13:46 ` [PATCH 07/17] eventpoll: drop unused depth argument from epoll_mutex_lock() Christian Brauner
2026-04-24 13:46 ` [PATCH 08/17] eventpoll: rename attach_epitem() to ep_attach_file() Christian Brauner
2026-04-24 13:46 ` [PATCH 09/17] eventpoll: relocate KCMP helpers near compat syscalls Christian Brauner
2026-04-24 13:46 ` [PATCH 10/17] eventpoll: split ep_insert() into alloc + register stages Christian Brauner
2026-04-24 13:46 ` [PATCH 11/17] eventpoll: split ep_clear_and_put() into drain helpers Christian Brauner
2026-04-24 13:46 ` [PATCH 12/17] eventpoll: extract ep_deliver_event() from ep_send_events() Christian Brauner
2026-04-24 13:46 ` [PATCH 13/17] eventpoll: extract lock dance from do_epoll_ctl() into ep_ctl_lock() Christian Brauner
2026-04-24 13:46 ` [PATCH 14/17] eventpoll: wrap EP_UNACTIVE_PTR in typed sentinel helpers Christian Brauner
2026-04-24 13:46 ` [PATCH 15/17] eventpoll: rename epi->next and txlist for clarity Christian Brauner
2026-04-24 16:06   ` Linus Torvalds
2026-04-24 13:46 ` [PATCH 16/17] eventpoll: use bool for predicate helpers Christian Brauner
2026-04-24 13:46 ` [PATCH 17/17] eventpoll: hoist CTL_ADD scratch state into struct ep_ctl_ctx Christian Brauner
2026-04-24 15:33 ` [PATCH 00/17] eventpoll: clarity refactor Linus Torvalds

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox