From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 3C0313D6CAA for ; Fri, 24 Apr 2026 13:47:07 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777038428; cv=none; b=i/EXmgIErxP8z5arY1IhZuMnU8sVvNxswMFWUfnorwzpxHNSxCcK5gC4kD8JCyEtanRS99SJEhUWRFEihbWWybplA2d7bMRRwQfNBjxsY2uDBDrbCjzAroVCCM+DHPuc3E9ynHdR31Oo5OpbMNFZXyzxPaCPR4tYfV7Q0XpGkKI= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777038428; c=relaxed/simple; bh=gEeToQeAJ/dcYncpR1YN6QmOFsq8cgbn0W8SuYnPS1E=; h=From:Date:Subject:MIME-Version:Content-Type:Message-Id:References: In-Reply-To:To:Cc; b=qZhK96DbaeNNTYlJKm3yxkIXtU2V0tq1rabdtVhgCFHEZ7gu7M3Rp3GqeVGB5F9v2YQQV1b+jANR6LFOpG0L78gYG6GCEYXxe6rDpkl5Pc0vvjjhchOtbg1wZ0T+YoqHXplSCD5Pxn8doLbFtEyRFX0Mjq+jX+cUz6HF0FLfQuQ= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=KO18rw4q; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="KO18rw4q" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 89878C2BCB2; Fri, 24 Apr 2026 13:47:06 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1777038427; bh=gEeToQeAJ/dcYncpR1YN6QmOFsq8cgbn0W8SuYnPS1E=; h=From:Date:Subject:References:In-Reply-To:To:Cc:From; b=KO18rw4qNSAANepIoxvjCjhX7UjiA51pzQQTehH3DbaWDcrBLIdKkZaCdSvkfFciM wWuBeKMI0shVeARYpCQgQ6ZF/LtWV2nEesXrF0M/5T4/7OfO4Tv/JWrGVd2Ypkz/AZ uBWtWDJEPtnHL+WnTaajriqy8KbSottoHvP/vYCcq+Hrm2RJZ64BfA3xaWtlach1uM EM3xrx/oGqPc3SkwuV4xpedlnlyp3dmBYLc0WbPWmFe1yrwWhkGBfBdXU4raC4503b SRY35djbg4p/11DpykscDkLTwWTeg3YiFSSDsfqZfwxtMIg73Ij5q8+rsoKBMBnbw5 Eh6A5BSRBBsLw== From: Christian Brauner Date: Fri, 24 Apr 2026 15:46:41 +0200 Subject: [PATCH 10/17] eventpoll: split ep_insert() into alloc + register stages Precedence: bulk X-Mailing-List: linux-fsdevel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit Message-Id: <20260424-work-epoll-rework-v1-10-249ed00a20f3@kernel.org> References: <20260424-work-epoll-rework-v1-0-249ed00a20f3@kernel.org> In-Reply-To: <20260424-work-epoll-rework-v1-0-249ed00a20f3@kernel.org> To: linux-fsdevel@vger.kernel.org Cc: Alexander Viro , Jan Kara , Linus Torvalds , Jens Axboe , "Christian Brauner (Amutable)" X-Mailer: b4 0.16-dev X-Developer-Signature: v=1; a=openpgp-sha256; l=7396; i=brauner@kernel.org; h=from:subject:message-id; bh=gEeToQeAJ/dcYncpR1YN6QmOFsq8cgbn0W8SuYnPS1E=; b=owGbwMvMwCU28Zj0gdSKO4sYT6slMWS+LnFfW+/WeynW/v02vwQVJcsGjRX9tocve8WfdPdgD Qn/JrGqo5SFQYyLQVZMkcWh3SRcbjlPxWajTA2YOaxMIEMYuDgFYCIzeBj+8DYsUMhfw7eu8rdk pS2PY3zADvHJio6ZAoIx2f/TdEQtGf6XuB+Z+fBlYMTTCRMc5fRSA2cEFFzerpubHNd68P+qlCB OAA== X-Developer-Key: i=brauner@kernel.org; a=openpgp; fpr=4880B8C9BD0E5106FC070F4F7B3C391EFEA93624 ep_insert() was 130 lines and mixed four concerns in one body: user quota charge and epitem allocation, attach-into-file-hlist plus rbtree insert plus target-ep locking, reverse-path + EPOLLWAKEUP + poll-queue install with rollback, and ready-list publication. Factor the first two concerns into named helpers so the body reduces to orchestration. ep_alloc_epitem() charges the user's epoll_watches quota, allocates a fresh epitem, and initializes its fields. On failure it returns ERR_PTR(-ENOSPC) or ERR_PTR(-ENOMEM); on success the epi is not yet linked into anything. ep_register_epitem() installs @epi into @tfile's f_ep hlist and @ep's rbtree, optionally chains @tfile onto tfile_check_list for the path check, takes the tep->mtx nested lock for the epoll-watches- epoll case, and finally takes the ep_get() reference that pairs with ep_remove()'s ep_put() in ep_insert()'s error paths. On failure it frees the epi and decrements epoll_watches to match ep_alloc_epitem(). ep_insert()'s remaining body is the rollback-via-ep_remove() chain (reverse_path_check, EPOLLWAKEUP source creation, ep_ptable_queue_proc allocation) and the ready-list / wake publication. Remove a few stale comments that duplicated function-level documentation or described obvious code. No functional change; rollback boundaries unchanged -- every error path after ep_register_epitem() still calls ep_remove(), preserving the ep->refcount invariant that keeps ep_remove()'s WARN_ON_ONCE safe. Signed-off-by: Christian Brauner (Amutable) --- fs/eventpoll.c | 111 ++++++++++++++++++++++++++++++++++++++------------------- 1 file changed, 74 insertions(+), 37 deletions(-) diff --git a/fs/eventpoll.c b/fs/eventpoll.c index fde2396342b6..e4a4e92d329f 100644 --- a/fs/eventpoll.c +++ b/fs/eventpoll.c @@ -1726,68 +1726,112 @@ static int ep_attach_file(struct file *file, struct epitem *epi) } /* - * Must be called with "mtx" held. + * Charge the user's epoll_watches quota, allocate a fresh epitem for + * @tfile/@fd, and initialize its fields. The returned item is not yet + * linked into any data structure; the caller must install it via + * ep_register_epitem() (which takes over on success) or kmem_cache_free() + * it and decrement epoll_watches on its own. + * + * Returns ERR_PTR(-ENOSPC) if the quota is exceeded, ERR_PTR(-ENOMEM) + * if the slab allocation fails. */ -static int ep_insert(struct eventpoll *ep, const struct epoll_event *event, - struct file *tfile, int fd, int full_check) +static struct epitem *ep_alloc_epitem(struct eventpoll *ep, + const struct epoll_event *event, + struct file *tfile, int fd) { - int error, pwake = 0; - __poll_t revents; struct epitem *epi; - struct ep_pqueue epq; - struct eventpoll *tep = NULL; - - if (is_file_epoll(tfile)) - tep = tfile->private_data; - - lockdep_assert_irqs_enabled(); if (unlikely(percpu_counter_compare(&ep->user->epoll_watches, max_user_watches) >= 0)) - return -ENOSPC; + return ERR_PTR(-ENOSPC); percpu_counter_inc(&ep->user->epoll_watches); - if (!(epi = kmem_cache_zalloc(epi_cache, GFP_KERNEL))) { + epi = kmem_cache_zalloc(epi_cache, GFP_KERNEL); + if (unlikely(!epi)) { percpu_counter_dec(&ep->user->epoll_watches); - return -ENOMEM; + return ERR_PTR(-ENOMEM); } - /* Item initialization follow here ... */ INIT_LIST_HEAD(&epi->rdllink); epi->ep = ep; ep_set_ffd(&epi->ffd, tfile, fd); epi->event = *event; epi->next = EP_UNACTIVE_PTR; + return epi; +} + +/* + * Install @epi into its target file's f_ep hlist and into @ep's rbtree, + * taking one additional reference on @ep for the lifetime of the item. + * + * If @tep is non-NULL, the target file is itself an eventpoll; we hold + * tep->mtx at subclass 1 across the attach + rbtree insert to serialize + * with the target side. RB tree ops are protected by @ep->mtx, which + * the caller already holds. + * + * On failure the epi is freed and the epoll_watches counter decremented, + * matching ep_alloc_epitem()'s allocation. After this returns + * successfully, ep_insert()'s later error paths use ep_remove() for + * unwind; that cannot drop @ep's refcount to zero because the ep file + * itself still holds the original reference. + */ +static int ep_register_epitem(struct eventpoll *ep, struct epitem *epi, + struct eventpoll *tep, int full_check) +{ + struct file *tfile = epi->ffd.file; + int error; + if (tep) mutex_lock_nested(&tep->mtx, 1); - /* Add the current item to the list of active epoll hook for this file */ - if (unlikely(ep_attach_file(tfile, epi) < 0)) { + + error = ep_attach_file(tfile, epi); + if (unlikely(error)) { if (tep) mutex_unlock(&tep->mtx); kmem_cache_free(epi_cache, epi); percpu_counter_dec(&ep->user->epoll_watches); - return -ENOMEM; + return error; } if (full_check && !tep) list_file(tfile); - /* - * Add the current item to the RB tree. All RB tree operations are - * protected by "mtx", and ep_insert() is called with "mtx" held. - */ ep_rbtree_insert(ep, epi); + if (tep) mutex_unlock(&tep->mtx); - /* - * ep_remove() calls in the later error paths can't lead to - * ep_free() as the ep file itself still holds an ep reference. - */ ep_get(ep); + return 0; +} + +/* + * Must be called with "mtx" held. + */ +static int ep_insert(struct eventpoll *ep, const struct epoll_event *event, + struct file *tfile, int fd, int full_check) +{ + int error, pwake = 0; + __poll_t revents; + struct epitem *epi; + struct ep_pqueue epq; + struct eventpoll *tep = NULL; + + if (is_file_epoll(tfile)) + tep = tfile->private_data; + + lockdep_assert_irqs_enabled(); - /* now check if we've created too many backpaths */ + epi = ep_alloc_epitem(ep, event, tfile, fd); + if (IS_ERR(epi)) + return PTR_ERR(epi); + + error = ep_register_epitem(ep, epi, tep, full_check); + if (error) + return error; + + /* Reject the insert if the new link would create too many back-paths. */ if (unlikely(full_check && reverse_path_check())) { ep_remove(ep, epi); return -EINVAL; @@ -1814,28 +1858,21 @@ static int ep_insert(struct eventpoll *ep, const struct epoll_event *event, */ revents = ep_item_poll(epi, &epq.pt, 1); - /* - * We have to check if something went wrong during the poll wait queue - * install process. Namely an allocation for a wait queue failed due - * high memory pressure. - */ + /* ep_ptable_queue_proc() signals allocation failure by clearing epq.epi. */ if (unlikely(!epq.epi)) { ep_remove(ep, epi); return -ENOMEM; } - /* We have to drop the new item inside our item list to keep track of it */ + /* Drop the new item onto the ready list if it is already ready. */ spin_lock_irq(&ep->lock); - /* record NAPI ID of new item if present */ ep_set_busy_poll_napi_id(epi); - /* If the file is already "ready" we drop it inside the ready list */ if (revents && !ep_is_linked(epi)) { list_add_tail(&epi->rdllink, &ep->rdllist); ep_pm_stay_awake(epi); - /* Notify waiting tasks that events are available */ if (waitqueue_active(&ep->wq)) wake_up(&ep->wq); if (waitqueue_active(&ep->poll_wait)) -- 2.47.3