From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-alma10-1.taild15c8.ts.net [100.103.45.18]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 3AF3423393C; Mon, 1 Jun 2026 18:58:03 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=100.103.45.18 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1780340284; cv=none; b=qCbIkNSHlHJU9w41YRMDKIzwBJVKw9EMoj762zlFOU0/jND2RtvGLjbjV5CrzpWcEbQEna+I/jtBEydoADyvTsBQZTDF1HWGyY+urWu+8PEzPzmPhNWY4NwM5zhmDhNjgIJ0eZaeArNjefSEYMK3hopcqAWg1IRnuGG0y/BWjew= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1780340284; c=relaxed/simple; bh=yRMay8ztBq1oRZs/VDDM24K1p4t3B+LtKCuSuuTycEM=; h=From:Subject:To:Cc:In-Reply-To:References:Content-Type:Date: Message-Id; b=iL4t+v5tE7ahjfomsT5ohkTLBcF2F0FQ0RrIdQ+amyGpglGE1U6wOi/yO8oILn5lK5SBFBmKXCQmtk1rgvo9Gu5vUUUlHZL4qbWM8kxTfYmSA0bqyxHU8e4/Ef1slIs/NiMhQAQZgcZC12R5/fZ/nkkrPBxcFesZ/rORo+GOMiY= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=MlEBJZjp; arc=none smtp.client-ip=100.103.45.18 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="MlEBJZjp" Received: by smtp.kernel.org (Postfix) with ESMTPSA id B22171F00893; Mon, 1 Jun 2026 18:58:02 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kernel.org; s=k20260515; t=1780340282; bh=3N5P2gpKxerXP3VehMb0w4n+9h/Z0zVwrIMwJTm545U=; h=From:Subject:Reply-To:To:Cc:In-Reply-To:References:Date; b=MlEBJZjp777Ugtrbn2mMkQEOjHErM9H5DRvRIIBjDvDrIESyeIG3+nBsIudpDoSor 9fbwtManuH9Hcj5VIF4CodOHT7Cpb5k9vennYsdjO1dJ81D8R7zagA0va1JAX+IdAo 8EnVpDVjYWDPBmacfZ60kSQlA1uq6GGWFWL4REldfd0D33ggR5PUl++XafkT11YfCX Xt4yATn6mXLnmaq4UPNonej8ZotGQWACRXDeAo/3vHJIbmJWl0l86F8V8daBgCiEP/ 1TAOnzE6Fk+e7X59s50htOcX0SyvOxVIdf1v9iO2IZqwKREFS9j5D0bb/6xXtQVqnp E9fFi8m/UafJQ== From: sashiko-bot@kernel.org Subject: Re: [PATCH bpf-next] bpf: Replace scratch PTE atomically when allocating arena pages Reply-To: sashiko-reviews@lists.linux.dev To: "Tejun Heo" Cc: sched-ext@lists.linux.dev, bpf@vger.kernel.org In-Reply-To: <20260601183728.1800490-1-tj@kernel.org> References: <20260601183728.1800490-1-tj@kernel.org> Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Date: Mon, 01 Jun 2026 18:58:02 +0000 Message-Id: <20260601185802.B22171F00893@smtp.kernel.org> Precedence: bulk X-Mailing-List: sched-ext@lists.linux.dev List-Id: List-Subscribe: List-Unsubscribe: Thank you for your contribution! Sashiko AI review found 3 potential issue(= s) to consider: New issues: - [Critical] Replacing the populated scratch page PTE with a newly allocate= d page lacks a TLB flush on architectures where `flush_tlb_before_set()` is= a no-op, causing silent data corruption. - [High] Calling `pte_page(old)` without checking `pte_present(old)` can ca= use an invalid pointer dereference or kernel panic. Pre-existing issues: - [High] Calling memory allocation and page table walking functions with in= terrupts disabled causes lock nesting violations and scheduling while atomi= c bugs. -- commit b6f9fe34365955853d165d4a39abcf4bf12f7b70 Author: Tejun Heo bpf: Replace scratch PTE atomically when allocating arena pages This commit modifies the BPF arena allocation to replace the shared scratch page with a newly allocated page. To support architectures like arm64, it uses ptep_try_set() to do a valid->invalid->valid transition, providing a new flush_tlb_before_set() hook for the break-before-make sequence. For architectures without ptep_try_set(), it falls back to checking pte_none() and setting the PTE directly. > diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h > index b5739bb99fc15..4c6c4081ef715 100644 > --- a/include/linux/pgtable.h > +++ b/include/linux/pgtable.h [ ... ] > @@ -1061,6 +1061,24 @@ static inline bool ptep_try_set(pte_t *ptep, pte_t= new_pte) > } > #endif > =20 > +#ifndef flush_tlb_before_set > +/** > + * flush_tlb_before_set - invalidate a kernel PTE's TLB before re-settin= g it > + * @addr: kernel virtual address whose PTE was just cleared > + * > + * Some architectures (e.g. arm64) do not allow a live page-table entry = to be > + * repointed at a different page in one step. The old entry must first b= e made > + * invalid and its translation flushed from every TLB, and only then may= the new > + * entry be written. > + * > + * This is only for the lockless atomic kernel-PTE installers (ptep_try_= set()). > + * It must be callable with interrupts disabled. > + */ > +static inline void flush_tlb_before_set(unsigned long addr) > +{ > +} [Severity: Critical] Since the generic implementation of flush_tlb_before_set() is an empty function, does this leave a stale TLB entry on architectures like x86? When the scratch page is replaced, ptep_get_and_clear() clears the PTE, and then ptep_try_set() installs the newly allocated page. If the architecture uses the generic fallback for flush_tlb_before_set(), there is no TLB flush before the new entry is written, and the allocation loop doesn't seem to perform a deferred flush either. Could this result in BPF programs continuing to read and write the shared scratch page instead of the new memory? > +#endif > + > #ifndef wrprotect_ptes > /** > * wrprotect_ptes - Write-protect PTEs that map consecutive pages of the= same > diff --git a/kernel/bpf/arena.c b/kernel/bpf/arena.c > index a641b19255aad..799fb49768e69 100644 > --- a/kernel/bpf/arena.c > +++ b/kernel/bpf/arena.c [ ... ] > @@ -154,19 +155,44 @@ static int apply_range_set_cb(pte_t *pte, unsigned = long addr, void *data) > { > struct apply_range_data *d =3D data; > struct page *page; > + pte_t pteval; > =20 > if (!data) > return 0; > =20 > page =3D d->pages[d->i]; > /* paranoia, similar to vmap_pages_pte_range() */ > if (WARN_ON_ONCE(!pfn_valid(page_to_pfn(page)))) > return -EINVAL; > =20 > - set_pte_at(&init_mm, addr, pte, mk_pte(page, PAGE_KERNEL)); > + pteval =3D mk_pte(page, PAGE_KERNEL); > +#ifdef ptep_try_set > + /* > + * Kernel-fault recovery may have installed the scratch page here, and > + * some architectures (arm64) prohibit valid->valid PTE transitions. > + * Install atomically into a none slot. If scratch is present, clear it > + * and flush_tlb_before_set() (break-before-make) before retrying. > + */ > + while (!ptep_try_set(pte, pteval)) { > + pte_t old =3D ptep_get(pte); > + > + if (pte_none(old)) > + continue; > + if (WARN_ON_ONCE(pte_page(old) !=3D d->scratch_page)) [Severity: High] Can calling pte_page(old) here without first checking pte_present(old) cause an invalid pointer dereference? The function pte_none() only guarantees the PTE is not completely empty, but it doesn't ensure it is present. If the PTE contains a non-present entry like a swap entry or a poison marker, pte_page() will interpret the swap metadata as a physical page frame number. Additionally, on architectures like x86 with CONFIG_DEBUG_VM enabled, pte_pfn() asserts pte_present(), which would trigger a kernel panic. Should this check pte_present(old) before extracting the page? > + return -EBUSY; > + ptep_get_and_clear(&init_mm, addr, pte); > + flush_tlb_before_set(addr); > + } [ ... ] > @@ -475,7 +501,8 @@ static vm_fault_t arena_vm_fault(struct vm_fault *vmf) > if (ret) > goto out_sigsegv_memcg; > =20 > - struct apply_range_data data =3D { .pages =3D &page, .i =3D 0 }; > + struct apply_range_data data =3D { .pages =3D &page, .i =3D 0, > + .scratch_page =3D arena->scratch_page }; > /* Account into memcg of the process that created bpf_arena */ > ret =3D bpf_map_alloc_pages(map, NUMA_NO_NODE, 1, &page); [Severity: High] This is a pre-existing issue, but does calling bpf_map_alloc_pages() here with interrupts disabled cause lock nesting violations or scheduling while atomic bugs? Looking at arena_vm_fault(), raw_res_spin_lock_irqsave() is called earlier, disabling interrupts. Then bpf_map_alloc_pages() is called, followed by apply_to_page_range(). The apply_to_page_range() function can dynamically allocate page tables using GFP_KERNEL (which can sleep) and acquires the page table lock, which is a sleepable rt_mutex on PREEMPT_RT kernels. A similar pattern exists in arena_alloc_pages(), where raw_res_spin_lock_irqsave() is taken before the while loop that calls apply_to_page_range(). Is it safe to call these sleepable functions inside an IRQ-disabled critical section? > if (ret) { > range_tree_set(&arena->rt, vmf->pgoff, 1); > goto out_sigsegv_memcg; > } --=20 Sashiko AI review =C2=B7 https://sashiko.dev/#/patchset/20260601183728.1800= 490-1-tj@kernel.org?part=3D1