* [PATCH 0/3] PCI: endpoint: Add PCI DMA endpoint function (part 3/3)
From: Koichiro Den @ 2026-05-21 6:36 UTC (permalink / raw)
To: Manivannan Sadhasivam, Krzysztof Wilczyński,
Kishon Vijay Abraham I, Bjorn Helgaas, Jonathan Corbet,
Shuah Khan, Vinod Koul, Frank Li, Arnd Bergmann, Damien Le Moal,
Niklas Cassel
Cc: Marek Vasut, Yoshihiro Shimoda, linux-pci, linux-doc,
linux-kernel, dmaengine
Hi,
This is part 3 of three series for PCI endpoint DMA.
The three series are:
* part 1: dmaengine: dw-edma: Prepare for PCI EP DMA
* part 2: PCI: endpoint: Expose endpoint DMA resources
* part 3: PCI: endpoint: Add PCI DMA endpoint function
This series adds the host-side metadata parser, the pci-epf-dma endpoint
function driver, and documentation.
The endpoint function exposes selected endpoint-integrated DMA channels as
a separate PCI DMA controller function. The host-side dw-edma-pcie driver
discovers the BAR metadata, requests the final layout, and registers the
exposed channels with DMAengine. Host clients then submit transfers through
the regular DMAengine API. The endpoint function keeps the metadata BAR
stable and uses a separate DMA window BAR for resources that need dynamic
subrange mappings.
No fixed PCI ID is assigned by this series. Users provide the PCI
vendor/device ID through configfs and bind dw-edma-pcie explicitly, for
example with driver_override.
Dependencies
============
This series depends on parts 1 and 2, applied on top of pci/endpoint:
[PATCH 00/12] dmaengine: dw-edma: Prepare for PCI EP DMA (part 1/3)
https://lore.kernel.org/all/20260521063115.2842238-1-den@valinux.co.jp/
[PATCH 0/3] PCI: endpoint: Expose endpoint DMA resources (part 2/3)
https://lore.kernel.org/all/20260521063405.2842644-1-den@valinux.co.jp/
Note
====
This series touches both dmaengine and PCI endpoint code. I kept the
dw-edma-pcie metadata parser together with the endpoint function so the
metadata producer and consumer can be reviewed in one place.
If the general direction looks acceptable, the dw-edma-pcie patch may need
a dmaengine Ack if this series is routed through the PCI endpoint tree.
Tested on
=========
The RC-to-EP data path was tested with a small out-of-tree DMAengine
client. The host submits a DMA_MEM_TO_DEV transfer through dw-edma-pcie,
which uses a DesignWare eDMA read channel to copy host memory into
endpoint memory.
Tested with:
* R-Car S4 as endpoint and R-Car S4 as root complex
* RK3588 as endpoint and CD8180 as root complex
Best regards,
Koichiro
Koichiro Den (3):
dmaengine: dw-edma-pcie: Discover endpoint DMA metadata
PCI: endpoint: Add DMA endpoint function
Documentation: PCI: Add PCI DMA endpoint function documentation
Documentation/PCI/endpoint/index.rst | 2 +
.../PCI/endpoint/pci-dma-function.rst | 182 +++
Documentation/PCI/endpoint/pci-dma-howto.rst | 200 +++
drivers/dma/dw-edma/dw-edma-pcie.c | 369 ++++-
drivers/pci/endpoint/functions/Kconfig | 14 +
drivers/pci/endpoint/functions/Makefile | 1 +
drivers/pci/endpoint/functions/pci-epf-dma.c | 1361 +++++++++++++++++
7 files changed, 2128 insertions(+), 1 deletion(-)
create mode 100644 Documentation/PCI/endpoint/pci-dma-function.rst
create mode 100644 Documentation/PCI/endpoint/pci-dma-howto.rst
create mode 100644 drivers/pci/endpoint/functions/pci-epf-dma.c
--
2.51.0
^ permalink raw reply
* Re: [PATCH bpf-next v11 3/8] bpf: allow non-owning list-node args via __nonown_allowed
From: Kaitao Cheng @ 2026-05-21 6:29 UTC (permalink / raw)
To: bot+bpf-ci
Cc: bpf, ast, corbet, martin.lau, daniel, andrii, eddyz87, song,
yonghong.song, john.fastabend, kpsingh, sdf, haoluo, jolsa, shuah,
chengkaitao, skhan, memxor, linux-kernel, linux-doc, vmalik,
linux-kselftest, martin.lau, clm, ihor.solodrai
In-Reply-To: <96cd744d0a5a8756cb0c1e31e95fbe682956b97226e2159120785feb01013e53@mail.kernel.org>
在 2026/5/21 12:08, bot+bpf-ci@kernel.org 写道:
>> diff --git a/Documentation/bpf/kfuncs.rst b/Documentation/bpf/kfuncs.rst
>> index 75e6c078e0e7..3a9db1108b95 100644
>> --- a/Documentation/bpf/kfuncs.rst
>> +++ b/Documentation/bpf/kfuncs.rst
>
> [ ... ]
>
>> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
>> index 8dd79b735a69..f3cf8d85bea0 100644
>> --- a/kernel/bpf/verifier.c
>> +++ b/kernel/bpf/verifier.c
>
> [ ... ]
>
>> @@ -12244,6 +12249,13 @@ static int check_kfunc_args(struct bpf_verifier_env *env, struct bpf_kfunc_call_
>> return ret;
>> break;
>> case KF_ARG_PTR_TO_LIST_NODE:
>> + if (is_kfunc_arg_nonown_allowed(btf, &args[i]) &&
>> + type_is_non_owning_ref(reg->type) && !reg->ref_obj_id) {
>> + /* Allow bpf_list_front/back return value for
>> + * __nonown_allowed list-node arguments.
>> + */
>> + goto check_ok;
>> + }
>> if (reg->type != (PTR_TO_BTF_ID | MEM_ALLOC)) {
>> verbose(env, "%s expected pointer to allocated object\n",
>> reg_arg_name(env, argno));
>> @@ -12253,6 +12265,7 @@ static int check_kfunc_args(struct bpf_verifier_env *env, struct bpf_kfunc_call_
>> verbose(env, "allocated object must be referenced\n");
>> return -EINVAL;
>> }
>> +check_ok:
>> ret = process_kf_arg_ptr_to_list_node(env, reg, argno, meta);
>> if (ret < 0)
>> return ret;
>
> This isn't a bug, but the multi-line comment doesn't match the BPF
> subsystem style. The BPF guide requires multi-line comments to have the
> opening /* on its own line for files under kernel/bpf/. Should this be:
>
> /*
> * Allow bpf_list_front/back return value for
> * __nonown_allowed list-node arguments.
> */
This file uses this comment style extensively, so I followed the
existing convention here.
> Eduard Zingerman raised a related nit in v10 RESEND: the check
> '!reg->ref_obj_id' appears redundant. Would type_is_non_owning_ref() be
> sufficient here?
>
> Reference: https://lore.kernel.org/bpf/0419643c9a04bf0824066742e52e3f197b43909b.camel@gmail.com/
>
Sorry, I missed that. If needed, I can fix it in the next version.
> AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
> See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md
>
> CI run summary: https://github.com/kernel-patches/bpf/actions/runs/26204125015
--
Thanks
Kaitao Cheng
^ permalink raw reply
* Re: [PATCH bpf-next] bpf: Add kernel-doc for arena page kfuncs
From: Alexei Starovoitov @ 2026-05-21 5:28 UTC (permalink / raw)
To: Dhiraj Shah
Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
Martin KaFai Lau, Eduard, Kumar Kartikeya Dwivedi, Song Liu,
Yonghong Song, Jiri Olsa, Jonathan Corbet, Shuah Khan,
open list:DOCUMENTATION, LKML
In-Reply-To: <20260521043553.199781-1-find.dhiraj@gmail.com>
On Thu, May 21, 2026 at 6:36 AM Dhiraj Shah <find.dhiraj@gmail.com> wrote:
>
> The page-management kfuncs exposed by BPF arena -
> bpf_arena_alloc_pages(), bpf_arena_free_pages() and
> bpf_arena_reserve_pages() - are part of the BPF kfunc ABI but lack
> rendered documentation. Their contracts (valid argument ranges,
> sleepable-only context, and the set of error returns) are today only
> discoverable by reading kernel/bpf/arena.c.
>
> Add a kernel-doc comment block above each of the three kfuncs and
> render them under a new "BPF arena kfuncs" subsection in
> Documentation/bpf/kfuncs.rst, alongside the existing core kfunc
> subsections.
>
> No functional change.
>
> Signed-off-by: Dhiraj Shah <find.dhiraj@gmail.com>
> ---
> Documentation/bpf/kfuncs.rst | 27 +++++++++++++++
> kernel/bpf/arena.c | 64 ++++++++++++++++++++++++++++++++++++
> 2 files changed, 91 insertions(+)
>
> diff --git a/Documentation/bpf/kfuncs.rst b/Documentation/bpf/kfuncs.rst
> index 75e6c078e0e7..fe0df1e16453 100644
> --- a/Documentation/bpf/kfuncs.rst
> +++ b/Documentation/bpf/kfuncs.rst
> @@ -732,3 +732,30 @@ the verifier. bpf_cgroup_ancestor() can be used as follows:
> BPF provides a set of kfuncs that can be used to query, allocate, mutate, and
> destroy struct cpumask * objects. Please refer to :ref:`cpumasks-header-label`
> for more details.
> +
> +4.4 BPF arena kfuncs
> +--------------------
> +
> +A BPF arena (``BPF_MAP_TYPE_ARENA``) is a sparsely-populated shared memory
> +region that a BPF program and a user-space process can both address. The
> +following kfuncs allow a sleepable BPF program to allocate, free, and reserve
> +pages within an arena:
> +
> +.. kernel-doc:: kernel/bpf/arena.c
> + :identifiers: bpf_arena_alloc_pages bpf_arena_free_pages bpf_arena_reserve_pages
> +
> +A typical pattern is to allocate one or more pages, write to them from BPF,
> +and let user space observe the same memory after a page fault populates its
> +VMA:
> +
> +.. code-block:: c
> +
> + void __arena *page;
> +
> + page = bpf_arena_alloc_pages(&arena, NULL, 1, NUMA_NO_NODE, 0);
> + if (!page)
> + return -ENOMEM;
> +
> + /* ... use the page from BPF; user space sees the same bytes ... */
> +
> + bpf_arena_free_pages(&arena, page, 1);
> diff --git a/kernel/bpf/arena.c b/kernel/bpf/arena.c
> index 49a8f7b1beef..b8ec2953dee6 100644
> --- a/kernel/bpf/arena.c
> +++ b/kernel/bpf/arena.c
> @@ -870,6 +870,33 @@ static void arena_free_irq(struct irq_work *iw)
>
> __bpf_kfunc_start_defs();
>
> +/**
> + * bpf_arena_alloc_pages() - Allocate pages within a BPF arena.
> + * @p__map: Pointer to a ``BPF_MAP_TYPE_ARENA`` map.
> + * @addr__ign: Page-aligned user-space address within the arena at which to
> + * place the allocation, or %NULL to let the kernel choose. When
> + * non-NULL the address must fall inside the arena's user VMA
> + * range; otherwise the allocation fails.
> + * @page_cnt: Number of pages to allocate. Must be non-zero and no greater
> + * than the arena's configured size in pages.
> + * @node_id: NUMA node hint for the backing pages, or %NUMA_NO_NODE.
> + * @flags: Reserved for future use; must be 0.
> + *
> + * Allocates @page_cnt physically-backed pages and inserts them into the
> + * arena's kernel VMA at the offset corresponding to @addr__ign (or at an
> + * arbitrary free offset when @addr__ign is %NULL). A subsequent user-space
> + * page fault on the matching user address populates the user VMA with the
> + * same pages, giving BPF and user space a shared view of the region.
> + *
> + * The underlying allocator may sleep, so this kfunc is only callable from
> + * sleepable BPF programs.
what?
> + * Return:
> + * * Kernel pointer to the start of the allocated region on success.
what?
so much slop nowadays :(
pw-bot: cr
^ permalink raw reply
* Re: [PATCH bpf-next] bpf: Add kernel-doc for arena page kfuncs
From: bot+bpf-ci @ 2026-05-21 5:15 UTC (permalink / raw)
To: find.dhiraj, bpf
Cc: ast, daniel, andrii, martin.lau, eddyz87, memxor, song,
yonghong.song, jolsa, corbet, skhan, linux-doc, linux-kernel, ast,
andrii, daniel, martin.lau, eddyz87, yonghong.song, clm,
ihor.solodrai
In-Reply-To: <20260521043553.199781-1-find.dhiraj@gmail.com>
[-- Attachment #1: Type: text/plain, Size: 3933 bytes --]
> diff --git a/Documentation/bpf/kfuncs.rst b/Documentation/bpf/kfuncs.rst
> index 75e6c078e0e7..fe0df1e16453 100644
> --- a/Documentation/bpf/kfuncs.rst
> +++ b/Documentation/bpf/kfuncs.rst
[ ... ]
> diff --git a/kernel/bpf/arena.c b/kernel/bpf/arena.c
> index 49a8f7b1beef..b8ec2953dee6 100644
> --- a/kernel/bpf/arena.c
> +++ b/kernel/bpf/arena.c
> @@ -870,6 +870,33 @@ static void arena_free_irq(struct irq_work *iw)
>
> __bpf_kfunc_start_defs();
>
> +/**
> + * bpf_arena_alloc_pages() - Allocate pages within a BPF arena.
> + * @p__map: Pointer to a ``BPF_MAP_TYPE_ARENA`` map.
> + * @addr__ign: Page-aligned user-space address within the arena at which to
> + * place the allocation, or %NULL to let the kernel choose. When
> + * non-NULL the address must fall inside the arena's user VMA
> + * range; otherwise the allocation fails.
> + * @page_cnt: Number of pages to allocate. Must be non-zero and no greater
> + * than the arena's configured size in pages.
> + * @node_id: NUMA node hint for the backing pages, or %NUMA_NO_NODE.
> + * @flags: Reserved for future use; must be 0.
> + *
> + * Allocates @page_cnt physically-backed pages and inserts them into the
> + * arena's kernel VMA at the offset corresponding to @addr__ign (or at an
> + * arbitrary free offset when @addr__ign is %NULL). A subsequent user-space
> + * page fault on the matching user address populates the user VMA with the
> + * same pages, giving BPF and user space a shared view of the region.
> + *
> + * The underlying allocator may sleep, so this kfunc is only callable from
> + * sleepable BPF programs.
> + *
> + * Return:
> + * * Kernel pointer to the start of the allocated region on success.
Does the return value description match the implementation? Looking at
arena_alloc_pages() in kernel/bpf/arena.c, the function returns:
return clear_lo32(arena->user_vm_start) + uaddr32;
which is a user-space virtual address derived from the arena's
user_vm_start. This matches the description for bpf_arena_free_pages()
later in this patch which notes the address is "in user-space form" when
previously returned by bpf_arena_alloc_pages().
Should the documentation state it returns a user-space pointer rather than
"Kernel pointer"?
> + * * %NULL if @p__map is not an arena, @flags is non-zero, @page_cnt is zero
> + * or exceeds the arena size, @addr__ign is misaligned or outside the
> + * arena, @node_id is invalid, or the kernel is out of memory.
> + */
> __bpf_kfunc void *bpf_arena_alloc_pages(void *p__map, void *addr__ign, u32 page_cnt,
> int node_id, u64 flags)
> {
[ ... ]
> @@ -913,6 +957,26 @@ void bpf_arena_free_pages_non_sleepable(void *p__map, void *ptr__ign, u32 page_c
> arena_free_pages(arena, (long)ptr__ign, page_cnt, false);
> }
>
> +/**
> + * bpf_arena_free_pages() - Free a range of pages within a BPF arena.
> + * @p__map: Pointer to a ``BPF_MAP_TYPE_ARENA`` map.
> + * @ptr__ign: User-space virtual address of the first page to free, as used
> + * to address the arena from BPF and user space. Typically the
> + * same address that was previously returned (in user-space form)
> + * by bpf_arena_alloc_pages().
> + * @page_cnt: Number of pages to free.
> + *
> + * Releases the backing pages, unmapping them from the arena's kernel VMA
> + * and from any user-space VMA that previously faulted them in. May sleep,
> + * so the kfunc is callable only from sleepable BPF programs.
> + *
> + * The call is a no-op when @p__map is not an arena, when @page_cnt is zero,
> + * or when @ptr__ign is %NULL.
> + */
> __bpf_kfunc void bpf_arena_free_pages(void *p__map, void *ptr__ign, u32 page_cnt)
> {
[ ... ]
---
AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md
CI run summary: https://github.com/kernel-patches/bpf/actions/runs/26206146239
^ permalink raw reply
* Re: [PATCH v2] docs: submitting-patches: Clarify that "reviewer" is a person
From: Mauro Carvalho Chehab @ 2026-05-21 5:12 UTC (permalink / raw)
To: Krzysztof Kozlowski
Cc: Jonathan Corbet, Shuah Khan, workflows, linux-doc, linux-kernel,
Greg Kroah-Hartman, Vlastimil Babka, Andrew Morton,
David Hildenbrand, Linus Torvalds, Randy Dunlap, Mark Brown
In-Reply-To: <20260520154846.162170-2-krzysztof.kozlowski@oss.qualcomm.com>
On Wed, 20 May 2026 17:48:47 +0200
Krzysztof Kozlowski <krzysztof.kozlowski@oss.qualcomm.com> wrote:
> Common understanding of word "Reviewer" is: a person performing a review
> work [1]. Tools are not persons, thus cannot be reviewers in this term.
> Also tools cannot make statements and cannot take responsibility for the
> review.
>
> Our docs already clearly mark that "Reviewed-by" must come from a
> person:
>
> - "By offering my Reviewed-by: tag, I state that:"
>
> Usage of first person "I" and word "state"
>
> - "A Reviewed-by tag is *a statement of opinion* that the patch is an
> appropriate modification of the kernel without any remaining serious"
>
> Only a person can make a statement of opinion.
>
> - "Any interested reviewer (who has done the work) can offer a
> Reviewed-by"
>
> A person can offer a tag thus above does not grant the tool
> permission to offer a tag.
>
> However this might not be enough, so let's clarify that only a person
> with a known identity can state the "Reviewer's statement of oversight".
>
> Link: https://en.wiktionary.org/wiki/reviewer [1]
> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> Cc: Vlastimil Babka <vbabka@kernel.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: David Hildenbrand <david@kernel.org>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Acked-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
> Acked-by: Randy Dunlap <rdunlap@infradead.org>
> Reviewed-by: Mark Brown <broonie@kernel.org>
> Acked-by: David Hildenbrand (Arm) <david@kernel.org>
> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@oss.qualcomm.com>
Makes sense to me.
Reviewed-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
> ---
>
> Changes in v2:
> 1. Add tags
> 2. Rephrase/simplify a bit commit msg. Rephrase title - drop "in
> English".
> 3. Add "with known identity", suggested by David Hildenbrand. I retained
> previous tags, assuming this change is within spirit of previous
> version and there were no objections on the list.
> ---
> Documentation/process/submitting-patches.rst | 12 ++++++------
> 1 file changed, 6 insertions(+), 6 deletions(-)
>
> diff --git a/Documentation/process/submitting-patches.rst b/Documentation/process/submitting-patches.rst
> index d7290e208e72..cc6a1f73d7f2 100644
> --- a/Documentation/process/submitting-patches.rst
> +++ b/Documentation/process/submitting-patches.rst
> @@ -581,12 +581,12 @@ By offering my Reviewed-by: tag, I state that:
>
> A Reviewed-by tag is a statement of opinion that the patch is an
> appropriate modification of the kernel without any remaining serious
> -technical issues. Any interested reviewer (who has done the work) can
> -offer a Reviewed-by tag for a patch. This tag serves to give credit to
> -reviewers and to inform maintainers of the degree of review which has been
> -done on the patch. Reviewed-by: tags, when supplied by reviewers known to
> -understand the subject area and to perform thorough reviews, will normally
> -increase the likelihood of your patch getting into the kernel.
> +technical issues. Any interested reviewer (who has done the work and is a
> +person with known identity) can offer a Reviewed-by tag for a patch. This tag
> +serves to give credit to reviewers and to inform maintainers of the degree of
> +review which has been done on the patch. Reviewed-by: tags, when supplied by
> +reviewers known to understand the subject area and to perform thorough reviews,
> +will normally increase the likelihood of your patch getting into the kernel.
>
> Both Tested-by and Reviewed-by tags, once received on mailing list from tester
> or reviewer, should be added by author to the applicable patches when sending
Thanks,
Mauro
^ permalink raw reply
* Re: [PATCH mm-unstable v17 11/14] mm/khugepaged: Introduce mTHP collapse support
From: Vernon Yang @ 2026-05-21 5:11 UTC (permalink / raw)
To: Wei Yang
Cc: Nico Pache, linux-doc, linux-kernel, linux-mm, linux-trace-kernel,
aarcange, akpm, anshuman.khandual, apopple, baohua, baolin.wang,
byungchul, catalin.marinas, cl, corbet, dave.hansen, david,
dev.jain, gourry, hannes, hughd, jack, jackmanb, jannh, jglisse,
joshua.hahnjy, kas, lance.yang, liam, ljs, mathieu.desnoyers,
matthew.brost, mhiramat, mhocko, peterx, pfalcato, rakie.kim,
raquini, rdunlap, rientjes, rostedt, rppt, ryan.roberts, shivankg,
sunnanyong, surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
zokeefe
In-Reply-To: <20260521024654.2a7teoe665porz76@master>
On Thu, May 21, 2026 at 02:46:54AM +0000, Wei Yang wrote:
> On Thu, May 21, 2026 at 10:36:15AM +0800, Vernon Yang wrote:
> >On Mon, May 11, 2026 at 12:58:11PM -0600, Nico Pache wrote:
> >> Enable khugepaged to collapse to mTHP orders. This patch implements the
> >> main scanning logic using a bitmap to track occupied pages and a stack
> >> structure that allows us to find optimal collapse sizes.
> >>
> >> Previous to this patch, PMD collapse had 3 main phases, a light weight
> >> scanning phase (mmap_read_lock) that determines a potential PMD
> >> collapse, an alloc phase (mmap unlocked), then finally heavier collapse
> >> phase (mmap_write_lock).
> >>
> >> To enabled mTHP collapse we make the following changes:
> >>
> >> During PMD scan phase, track occupied pages in a bitmap. When mTHP
> >> orders are enabled, we remove the restriction of max_ptes_none during the
> >> scan phase to avoid missing potential mTHP collapse candidates. Once we
> >> have scanned the full PMD range and updated the bitmap to track occupied
> >> pages, we use the bitmap to find the optimal mTHP size.
> >>
> >> Implement collapse_scan_bitmap() to perform binary recursion on the bitmap
> >> and determine the best eligible order for the collapse. A stack structure
> >> is used instead of traditional recursion to manage the search. This also
> >> prevents a traditional recursive approach when the kernel stack struct is
> >> limited. The algorithm recursively splits the bitmap into smaller chunks to
> >> find the highest order mTHPs that satisfy the collapse criteria. We start
> >> by attempting the PMD order, then moved on the consecutively lower orders
> >> (mTHP collapse). The stack maintains a pair of variables (offset, order),
> >> indicating the number of PTEs from the start of the PMD, and the order of
> >> the potential collapse candidate.
> >>
> >> The algorithm for consuming the bitmap works as such:
> >> 1) push (0, HPAGE_PMD_ORDER) onto the stack
> >> 2) pop the stack
> >> 3) check if the number of set bits in that (offset,order) pair
> >> statisfy the max_ptes_none threshold for that order
> >> 4) if yes, attempt collapse
> >> 5) if no (or collapse fails), push two new stack items representing
> >> the left and right halves of the current bitmap range, at the
> >> next lower order
> >> 6) repeat at step (2) until stack is empty.
> >>
> >> Below is a diagram representing the algorithm and stack items:
> >>
> >> offset mid_offset
> >> | |
> >> | |
> >> v v
> >> ____________________________________
> >> | PTE Page Table |
> >> --------------------------------------
> >> <-------><------->
> >> order-1 order-1
> >>
> >> mTHP collapses reject regions containing swapped out or shared pages.
> >> This is because adding new entries can lead to new none pages, and these
> >> may lead to constant promotion into a higher order mTHP. A similar
> >> issue can occur with "max_ptes_none > HPAGE_PMD_NR/2" due to a collapse
> >> introducing at least 2x the number of pages, and on a future scan will
> >> satisfy the promotion condition once again. This issue is prevented via
> >> the collapse_max_ptes_none() function which imposes the max_ptes_none
> >> restrictions above.
> >>
> >> We currently only support mTHP collapse for max_ptes_none values of 0
> >> and HPAGE_PMD_NR - 1. resulting in the following behavior:
> >>
> >> - max_ptes_none=0: Never introduce new empty pages during collapse
> >> - max_ptes_none=HPAGE_PMD_NR-1: Always try collapse to the highest
> >> available mTHP order
> >>
> >> Any other max_ptes_none value will emit a warning and skip mTHP collapse
> >> attempts. There should be no behavior change for PMD collapse.
> >>
> >> Once we determine what mTHP sizes fits best in that PMD range a collapse
> >> is attempted. A minimum collapse order of 2 is used as this is the lowest
> >> order supported by anon memory as defined by THP_ORDERS_ALL_ANON.
> >>
> >> Currently madv_collapse is not supported and will only attempt PMD
> >> collapse.
> >>
> >> We can also remove the check for is_khugepaged inside the PMD scan as
> >> the collapse_max_ptes_none() function handles this logic now.
> >>
> >> Signed-off-by: Nico Pache <npache@redhat.com>
> >> ---
> >> mm/khugepaged.c | 182 +++++++++++++++++++++++++++++++++++++++++++++---
> >> 1 file changed, 174 insertions(+), 8 deletions(-)
> >>
> >> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> >> index 3492b135d667..39bf7ea8a6e8 100644
> >> --- a/mm/khugepaged.c
> >> +++ b/mm/khugepaged.c
> >> @@ -100,6 +100,30 @@ static DEFINE_READ_MOSTLY_HASHTABLE(mm_slots_hash, MM_SLOTS_HASH_BITS);
> >>
> >> static struct kmem_cache *mm_slot_cache __ro_after_init;
> >>
> >> +#define KHUGEPAGED_MIN_MTHP_ORDER 2
> >> +/*
> >> + * mthp_collapse() does an iterative DFS over a binary tree, from
> >> + * HPAGE_PMD_ORDER down to KHUGEPAGED_MIN_MTHP_ORDER. The max stack
> >> + * size needed for a DFS on a binary tree is height + 1, where
> >> + * height = HPAGE_PMD_ORDER - KHUGEPAGED_MIN_MTHP_ORDER.
> >> + *
> >> + * ilog2 is used in place of HPAGE_PMD_ORDER because some architectures
> >> + * (e.g. ppc64le) do not define HPAGE_PMD_ORDER until after build time.
> >> + */
> >> +#define MTHP_STACK_SIZE (ilog2(MAX_PTRS_PER_PTE) - KHUGEPAGED_MIN_MTHP_ORDER + 1)
> >> +
> >> +/*
> >> + * Defines a range of PTE entries in a PTE page table which are being
> >> + * considered for mTHP collapse.
> >> + *
> >> + * @offset: the offset of the first PTE entry in a PMD range.
> >> + * @order: the order of the PTE entries being considered for collapse.
> >> + */
> >> +struct mthp_range {
> >> + u16 offset;
> >> + u8 order;
> >> +};
> >> +
> >> struct collapse_control {
> >> bool is_khugepaged;
> >>
> >> @@ -111,6 +135,12 @@ struct collapse_control {
> >>
> >> /* nodemask for allocation fallback */
> >> nodemask_t alloc_nmask;
> >> +
> >> + /* Each bit represents a single occupied (!none/zero) page. */
> >> + DECLARE_BITMAP(mthp_bitmap, MAX_PTRS_PER_PTE);
> >> + /* A mask of the current range being considered for mTHP collapse. */
> >> + DECLARE_BITMAP(mthp_bitmap_mask, MAX_PTRS_PER_PTE);
> >> + struct mthp_range mthp_bitmap_stack[MTHP_STACK_SIZE];
> >> };
> >>
> >> /**
> >> @@ -1404,20 +1434,140 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long s
> >> return result;
> >> }
> >>
> >> +static void collapse_mthp_stack_push(struct collapse_control *cc, int *stack_size,
> >> + u16 offset, u8 order)
> >> +{
> >> + const int size = *stack_size;
> >> + struct mthp_range *stack = &cc->mthp_bitmap_stack[size];
> >> +
> >> + VM_WARN_ON_ONCE(size >= MTHP_STACK_SIZE);
> >> + stack->order = order;
> >> + stack->offset = offset;
> >> + (*stack_size)++;
> >> +}
> >> +
> >> +static struct mthp_range collapse_mthp_stack_pop(struct collapse_control *cc,
> >> + int *stack_size)
> >> +{
> >> + const int size = *stack_size;
> >> +
> >> + VM_WARN_ON_ONCE(size <= 0);
> >> + (*stack_size)--;
> >> + return cc->mthp_bitmap_stack[size - 1];
> >> +}
> >> +
> >> +static unsigned int collapse_mthp_count_present(struct collapse_control *cc,
> >> + u16 offset, unsigned int nr_ptes)
> >> +{
> >> + bitmap_zero(cc->mthp_bitmap_mask, MAX_PTRS_PER_PTE);
> >> + bitmap_set(cc->mthp_bitmap_mask, offset, nr_ptes);
> >> + return bitmap_weight_and(cc->mthp_bitmap, cc->mthp_bitmap_mask, MAX_PTRS_PER_PTE);
> >> +}
> >> +
> >> +/*
> >> + * mthp_collapse() consumes the bitmap that is generated during
> >> + * collapse_scan_pmd() to determine what regions and mTHP orders fit best.
> >> + *
> >> + * Each bit in cc->mthp_bitmap represents a single occupied (!none/zero) page.
> >> + * A stack structure cc->mthp_bitmap_stack is used to check different regions
> >> + * of the bitmap for collapse eligibility. The stack maintains a pair of
> >> + * variables (offset, order), indicating the number of PTEs from the start of
> >> + * the PMD, and the order of the potential collapse candidate respectively. We
> >> + * start at the PMD order and check if it is eligible for collapse; if not, we
> >> + * add two entries to the stack at a lower order to represent the left and right
> >> + * halves of the PTE page table we are examining.
> >> + *
> >> + * offset mid_offset
> >> + * | |
> >> + * | |
> >> + * v v
> >> + * --------------------------------------
> >> + * | cc->mthp_bitmap |
> >> + * --------------------------------------
> >> + * <-------><------->
> >> + * order-1 order-1
> >> + *
> >> + * For each of these, we determine how many PTE entries are occupied in the
> >> + * range of PTE entries we propose to collapse, then we compare this to a
> >> + * threshold number of PTE entries which would need to be occupied for a
> >> + * collapse to be permitted at that order (accounting for max_ptes_none).
> >> + *
> >> + * If a collapse is permitted, we attempt to collapse the PTE range into a
> >> + * mTHP.
> >> + */
> >> +static int mthp_collapse(struct mm_struct *mm, unsigned long address,
> >> + int referenced, int unmapped, struct collapse_control *cc,
> >> + unsigned long enabled_orders)
> >> +{
> >> + unsigned int nr_occupied_ptes, nr_ptes;
> >> + int max_ptes_none, collapsed = 0, stack_size = 0;
> >> + unsigned long collapse_address;
> >> + struct mthp_range range;
> >> + u16 offset;
> >> + u8 order;
> >> +
> >> + collapse_mthp_stack_push(cc, &stack_size, 0, HPAGE_PMD_ORDER);
> >> +
> >> + while (stack_size) {
> >> + range = collapse_mthp_stack_pop(cc, &stack_size);
> >> + order = range.order;
> >> + offset = range.offset;
> >> + nr_ptes = 1UL << order;
> >> +
> >> + if (!test_bit(order, &enabled_orders))
> >> + goto next_order;
> >> +
> >> + max_ptes_none = collapse_max_ptes_none(cc, NULL, order);
> >> +
> >> + if (max_ptes_none < 0)
> >> + return collapsed;
> >> +
> >> + nr_occupied_ptes = collapse_mthp_count_present(cc, offset,
> >> + nr_ptes);
> >> +
> >> + if (nr_occupied_ptes >= nr_ptes - max_ptes_none) {
> >> + int ret;
> >> +
> >> + collapse_address = address + offset * PAGE_SIZE;
> >> + ret = collapse_huge_page(mm, collapse_address, referenced,
> >> + unmapped, cc, order);
> >> + if (ret == SCAN_SUCCEED) {
> >> + collapsed += nr_ptes;
> >> + continue;
> >> + }
> >> + }
> >> +
> >> +next_order:
> >> + if (order > KHUGEPAGED_MIN_MTHP_ORDER) {
> >
> >Hi Nico, thank you very much for your contributions to this series.
> >
> >I found a minor issue, for MADV_COLLAPSE, if collapse_huge_page() fails
> >for some reason (e.g. allocate folio), it goes to next_order and
> >continues splitting to the next small order. However, enabled_orders
> >only supports HPAGE_PMD_ORDER, so it keeps runing the split operations
> >without any effective work until KHUGEPAGED_MIN_MTHP_ORDER is reached
> >before exiting. For khugepaged, e.g. setting only 2MB to always, also
> >same phenomenon.
>
> Yes, but it does no actual work since it is checked after pop up.
>
> >
> >This does not affect the overall functionality of mthp collapse, just
> >redundant.
> >
> >The redundant operations can be easily skipped with the following
> >modification. If I miss some thing, please let me know. Thanks!
> >
> >diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> >index 1a25af3d6d0f..fa407cce525c 100644
> >--- a/mm/khugepaged.c
> >+++ b/mm/khugepaged.c
> >@@ -1574,7 +1574,7 @@ static int mthp_collapse(struct mm_struct *mm, unsigned long address,
> > }
> >
> > next_order:
> >- if (order > KHUGEPAGED_MIN_MTHP_ORDER) {
> >+ if ((BIT(order) - 1) & enabled_orders) {
> > const u8 next_order = order - 1;
> > const u16 mid_offset = offset + (nr_ptes / 2);
> >
>
> This would stop the iteration if there are other lower enabled order, right?
^^^^ ^^^^^^^^^^^^^^^^^^^
NO :)
For more details, please refer to the following information.
| Scenario | Old Behavior (order > 2) | New Behavior ((BIT(order)-1) & enabled_orders) |
|-------------------------------------|--------------------------|------------------------------------------------|
| MADV_COLLAPSE | Splits 9,8,7,...,3 | No split |
| khugepaged, only 2MB enabled | Splits 9,8,7,...,3 | No split |
| khugepaged, only 2MB + 64KB enabled | Splits 9,8,7,...,3 | Splits 9,8,7,...,5 |
| khugepaged, only 32KB enabled | Splits 9,8,7,...,3 | Splits 9,8,7,...,4 |
| khugepaged, only 16KB enabled | Splits 9,8,7,...,3 | Splits 9,8,7,...,3 |
| khugepaged, all mTHP enabled | Splits 9,8,7,...,3 | Splits 9,8,7,...,3 |
--
Cheers,
Vernon
^ permalink raw reply
* Re: [PATCH bpf-next] bpf: Add kernel-doc for arena page kfuncs
From: Emil Tsalapatis @ 2026-05-21 4:56 UTC (permalink / raw)
To: Dhiraj Shah, bpf
Cc: ast, daniel, andrii, martin.lau, eddyz87, memxor, song,
yonghong.song, jolsa, corbet, skhan, linux-doc, linux-kernel
In-Reply-To: <20260521043553.199781-1-find.dhiraj@gmail.com>
On Thu May 21, 2026 at 12:35 AM EDT, Dhiraj Shah wrote:
> The page-management kfuncs exposed by BPF arena -
> bpf_arena_alloc_pages(), bpf_arena_free_pages() and
> bpf_arena_reserve_pages() - are part of the BPF kfunc ABI but lack
> rendered documentation. Their contracts (valid argument ranges,
> sleepable-only context, and the set of error returns) are today only
> discoverable by reading kernel/bpf/arena.c.
>
> Add a kernel-doc comment block above each of the three kfuncs and
> render them under a new "BPF arena kfuncs" subsection in
> Documentation/bpf/kfuncs.rst, alongside the existing core kfunc
> subsections.
>
> No functional change.
>
> Signed-off-by: Dhiraj Shah <find.dhiraj@gmail.com>
> ---
> Documentation/bpf/kfuncs.rst | 27 +++++++++++++++
> kernel/bpf/arena.c | 64 ++++++++++++++++++++++++++++++++++++
> 2 files changed, 91 insertions(+)
>
> diff --git a/Documentation/bpf/kfuncs.rst b/Documentation/bpf/kfuncs.rst
> index 75e6c078e0e7..fe0df1e16453 100644
> --- a/Documentation/bpf/kfuncs.rst
> +++ b/Documentation/bpf/kfuncs.rst
> @@ -732,3 +732,30 @@ the verifier. bpf_cgroup_ancestor() can be used as follows:
> BPF provides a set of kfuncs that can be used to query, allocate, mutate, and
> destroy struct cpumask * objects. Please refer to :ref:`cpumasks-header-label`
> for more details.
> +
> +4.4 BPF arena kfuncs
> +--------------------
> +
> +A BPF arena (``BPF_MAP_TYPE_ARENA``) is a sparsely-populated shared memory
> +region that a BPF program and a user-space process can both address. The
> +following kfuncs allow a sleepable BPF program to allocate, free, and reserve
> +pages within an arena:
> +
> +.. kernel-doc:: kernel/bpf/arena.c
> + :identifiers: bpf_arena_alloc_pages bpf_arena_free_pages bpf_arena_reserve_pages
> +
> +A typical pattern is to allocate one or more pages, write to them from BPF,
> +and let user space observe the same memory after a page fault populates its
> +VMA:
Maybe slight rephrase? This description is a bit dense. E.g.,
"...and let user space access the pages through a mapping in its address space."
> +
> +.. code-block:: c
> +
> + void __arena *page;
> +
> + page = bpf_arena_alloc_pages(&arena, NULL, 1, NUMA_NO_NODE, 0);
> + if (!page)
> + return -ENOMEM;
> +
> + /* ... use the page from BPF; user space sees the same bytes ... */
> +
> + bpf_arena_free_pages(&arena, page, 1);
> diff --git a/kernel/bpf/arena.c b/kernel/bpf/arena.c
> index 49a8f7b1beef..b8ec2953dee6 100644
> --- a/kernel/bpf/arena.c
> +++ b/kernel/bpf/arena.c
> @@ -870,6 +870,33 @@ static void arena_free_irq(struct irq_work *iw)
>
> __bpf_kfunc_start_defs();
>
> +/**
> + * bpf_arena_alloc_pages() - Allocate pages within a BPF arena.
> + * @p__map: Pointer to a ``BPF_MAP_TYPE_ARENA`` map.
> + * @addr__ign: Page-aligned user-space address within the arena at which to
> + * place the allocation, or %NULL to let the kernel choose. When
> + * non-NULL the address must fall inside the arena's user VMA
> + * range; otherwise the allocation fails.
> + * @page_cnt: Number of pages to allocate. Must be non-zero and no greater
> + * than the arena's configured size in pages.
> + * @node_id: NUMA node hint for the backing pages, or %NUMA_NO_NODE.
> + * @flags: Reserved for future use; must be 0.
> + *
> + * Allocates @page_cnt physically-backed pages and inserts them into the
> + * arena's kernel VMA at the offset corresponding to @addr__ign (or at an
> + * arbitrary free offset when @addr__ign is %NULL). A subsequent user-space
> + * page fault on the matching user address populates the user VMA with the
> + * same pages, giving BPF and user space a shared view of the region.
> + *
> + * The underlying allocator may sleep, so this kfunc is only callable from
> + * sleepable BPF programs.
I think this is half the story, since the verifier adjusts the call to
the function to the non-sleepable version when necessary. So the kfunc
is technically only callable from sleepable BPF programs but it never
will be thanks to the verifier.
> + *
> + * Return:
> + * * Kernel pointer to the start of the allocated region on success.
> + * * %NULL if @p__map is not an arena, @flags is non-zero, @page_cnt is zero
> + * or exceeds the arena size, @addr__ign is misaligned or outside the
> + * arena, @node_id is invalid, or the kernel is out of memory.
> + */
> __bpf_kfunc void *bpf_arena_alloc_pages(void *p__map, void *addr__ign, u32 page_cnt,
> int node_id, u64 flags)
> {
> @@ -893,6 +920,23 @@ void *bpf_arena_alloc_pages_non_sleepable(void *p__map, void *addr__ign, u32 pag
>
> return (void *)arena_alloc_pages(arena, (long)addr__ign, page_cnt, node_id, false);
> }
> +
> +/**
> + * bpf_arena_free_pages() - Free a range of pages within a BPF arena.
> + * @p__map: Pointer to a ``BPF_MAP_TYPE_ARENA`` map.
> + * @ptr__ign: User-space virtual address of the first page to free, as used
> + * to address the arena from BPF and user space. Typically the
> + * same address that was previously returned (in user-space form)
> + * by bpf_arena_alloc_pages().
> + * @page_cnt: Number of pages to free.
> + *
> + * Releases the backing pages, unmapping them from the arena's kernel VMA
> + * and from any user-space VMA that previously faulted them in. May sleep,
> + * so the kfunc is callable only from sleepable BPF programs.
Same here.
> + *
> + * The call is a no-op when @p__map is not an arena, when @page_cnt is zero,
> + * or when @ptr__ign is %NULL.
> + */
> __bpf_kfunc void bpf_arena_free_pages(void *p__map, void *ptr__ign, u32 page_cnt)
> {
> struct bpf_map *map = p__map;
> @@ -913,6 +957,26 @@ void bpf_arena_free_pages_non_sleepable(void *p__map, void *ptr__ign, u32 page_c
> arena_free_pages(arena, (long)ptr__ign, page_cnt, false);
> }
>
> +/**
> + * bpf_arena_reserve_pages() - Reserve a page range within a BPF arena.
> + * @p__map: Pointer to a ``BPF_MAP_TYPE_ARENA`` map.
> + * @ptr__ign: Page-aligned user-space virtual address of the start of the
> + * range to reserve.
> + * @page_cnt: Number of pages to reserve. Zero is permitted and is a no-op.
> + *
> + * Marks @page_cnt pages starting at @ptr__ign as reserved so that subsequent
> + * bpf_arena_alloc_pages() calls will not place allocations in that range.
> + * No physical pages are allocated by this kfunc; the range is simply
> + * excluded from the arena's free space.
> + *
> + * Return:
> + * * 0 on success, or when @page_cnt is zero.
> + * * -EINVAL if @p__map is not an arena or the requested range falls outside
> + * the arena's user VMA.
> + * * -EBUSY if any page in the requested range is already allocated, or if
> + * contention on the arena's internal spinlock prevents the operation from
> + * completing.
> + */
> __bpf_kfunc int bpf_arena_reserve_pages(void *p__map, void *ptr__ign, u32 page_cnt)
> {
> struct bpf_map *map = p__map;
^ permalink raw reply
* [PATCH bpf-next] bpf: Add kernel-doc for arena page kfuncs
From: Dhiraj Shah @ 2026-05-21 4:35 UTC (permalink / raw)
To: bpf
Cc: ast, daniel, andrii, martin.lau, eddyz87, memxor, song,
yonghong.song, jolsa, corbet, skhan, linux-doc, linux-kernel
The page-management kfuncs exposed by BPF arena -
bpf_arena_alloc_pages(), bpf_arena_free_pages() and
bpf_arena_reserve_pages() - are part of the BPF kfunc ABI but lack
rendered documentation. Their contracts (valid argument ranges,
sleepable-only context, and the set of error returns) are today only
discoverable by reading kernel/bpf/arena.c.
Add a kernel-doc comment block above each of the three kfuncs and
render them under a new "BPF arena kfuncs" subsection in
Documentation/bpf/kfuncs.rst, alongside the existing core kfunc
subsections.
No functional change.
Signed-off-by: Dhiraj Shah <find.dhiraj@gmail.com>
---
Documentation/bpf/kfuncs.rst | 27 +++++++++++++++
kernel/bpf/arena.c | 64 ++++++++++++++++++++++++++++++++++++
2 files changed, 91 insertions(+)
diff --git a/Documentation/bpf/kfuncs.rst b/Documentation/bpf/kfuncs.rst
index 75e6c078e0e7..fe0df1e16453 100644
--- a/Documentation/bpf/kfuncs.rst
+++ b/Documentation/bpf/kfuncs.rst
@@ -732,3 +732,30 @@ the verifier. bpf_cgroup_ancestor() can be used as follows:
BPF provides a set of kfuncs that can be used to query, allocate, mutate, and
destroy struct cpumask * objects. Please refer to :ref:`cpumasks-header-label`
for more details.
+
+4.4 BPF arena kfuncs
+--------------------
+
+A BPF arena (``BPF_MAP_TYPE_ARENA``) is a sparsely-populated shared memory
+region that a BPF program and a user-space process can both address. The
+following kfuncs allow a sleepable BPF program to allocate, free, and reserve
+pages within an arena:
+
+.. kernel-doc:: kernel/bpf/arena.c
+ :identifiers: bpf_arena_alloc_pages bpf_arena_free_pages bpf_arena_reserve_pages
+
+A typical pattern is to allocate one or more pages, write to them from BPF,
+and let user space observe the same memory after a page fault populates its
+VMA:
+
+.. code-block:: c
+
+ void __arena *page;
+
+ page = bpf_arena_alloc_pages(&arena, NULL, 1, NUMA_NO_NODE, 0);
+ if (!page)
+ return -ENOMEM;
+
+ /* ... use the page from BPF; user space sees the same bytes ... */
+
+ bpf_arena_free_pages(&arena, page, 1);
diff --git a/kernel/bpf/arena.c b/kernel/bpf/arena.c
index 49a8f7b1beef..b8ec2953dee6 100644
--- a/kernel/bpf/arena.c
+++ b/kernel/bpf/arena.c
@@ -870,6 +870,33 @@ static void arena_free_irq(struct irq_work *iw)
__bpf_kfunc_start_defs();
+/**
+ * bpf_arena_alloc_pages() - Allocate pages within a BPF arena.
+ * @p__map: Pointer to a ``BPF_MAP_TYPE_ARENA`` map.
+ * @addr__ign: Page-aligned user-space address within the arena at which to
+ * place the allocation, or %NULL to let the kernel choose. When
+ * non-NULL the address must fall inside the arena's user VMA
+ * range; otherwise the allocation fails.
+ * @page_cnt: Number of pages to allocate. Must be non-zero and no greater
+ * than the arena's configured size in pages.
+ * @node_id: NUMA node hint for the backing pages, or %NUMA_NO_NODE.
+ * @flags: Reserved for future use; must be 0.
+ *
+ * Allocates @page_cnt physically-backed pages and inserts them into the
+ * arena's kernel VMA at the offset corresponding to @addr__ign (or at an
+ * arbitrary free offset when @addr__ign is %NULL). A subsequent user-space
+ * page fault on the matching user address populates the user VMA with the
+ * same pages, giving BPF and user space a shared view of the region.
+ *
+ * The underlying allocator may sleep, so this kfunc is only callable from
+ * sleepable BPF programs.
+ *
+ * Return:
+ * * Kernel pointer to the start of the allocated region on success.
+ * * %NULL if @p__map is not an arena, @flags is non-zero, @page_cnt is zero
+ * or exceeds the arena size, @addr__ign is misaligned or outside the
+ * arena, @node_id is invalid, or the kernel is out of memory.
+ */
__bpf_kfunc void *bpf_arena_alloc_pages(void *p__map, void *addr__ign, u32 page_cnt,
int node_id, u64 flags)
{
@@ -893,6 +920,23 @@ void *bpf_arena_alloc_pages_non_sleepable(void *p__map, void *addr__ign, u32 pag
return (void *)arena_alloc_pages(arena, (long)addr__ign, page_cnt, node_id, false);
}
+
+/**
+ * bpf_arena_free_pages() - Free a range of pages within a BPF arena.
+ * @p__map: Pointer to a ``BPF_MAP_TYPE_ARENA`` map.
+ * @ptr__ign: User-space virtual address of the first page to free, as used
+ * to address the arena from BPF and user space. Typically the
+ * same address that was previously returned (in user-space form)
+ * by bpf_arena_alloc_pages().
+ * @page_cnt: Number of pages to free.
+ *
+ * Releases the backing pages, unmapping them from the arena's kernel VMA
+ * and from any user-space VMA that previously faulted them in. May sleep,
+ * so the kfunc is callable only from sleepable BPF programs.
+ *
+ * The call is a no-op when @p__map is not an arena, when @page_cnt is zero,
+ * or when @ptr__ign is %NULL.
+ */
__bpf_kfunc void bpf_arena_free_pages(void *p__map, void *ptr__ign, u32 page_cnt)
{
struct bpf_map *map = p__map;
@@ -913,6 +957,26 @@ void bpf_arena_free_pages_non_sleepable(void *p__map, void *ptr__ign, u32 page_c
arena_free_pages(arena, (long)ptr__ign, page_cnt, false);
}
+/**
+ * bpf_arena_reserve_pages() - Reserve a page range within a BPF arena.
+ * @p__map: Pointer to a ``BPF_MAP_TYPE_ARENA`` map.
+ * @ptr__ign: Page-aligned user-space virtual address of the start of the
+ * range to reserve.
+ * @page_cnt: Number of pages to reserve. Zero is permitted and is a no-op.
+ *
+ * Marks @page_cnt pages starting at @ptr__ign as reserved so that subsequent
+ * bpf_arena_alloc_pages() calls will not place allocations in that range.
+ * No physical pages are allocated by this kfunc; the range is simply
+ * excluded from the arena's free space.
+ *
+ * Return:
+ * * 0 on success, or when @page_cnt is zero.
+ * * -EINVAL if @p__map is not an arena or the requested range falls outside
+ * the arena's user VMA.
+ * * -EBUSY if any page in the requested range is already allocated, or if
+ * contention on the arena's internal spinlock prevents the operation from
+ * completing.
+ */
__bpf_kfunc int bpf_arena_reserve_pages(void *p__map, void *ptr__ign, u32 page_cnt)
{
struct bpf_map *map = p__map;
--
2.43.0
^ permalink raw reply related
* Re: [PATCH bpf-next v11 6/8] bpf: Add bpf_list_add to insert node after a given list node
From: bot+bpf-ci @ 2026-05-21 4:08 UTC (permalink / raw)
To: kaitao.cheng, ast, corbet, martin.lau, daniel, andrii, eddyz87,
song, yonghong.song, john.fastabend, kpsingh, sdf, haoluo, jolsa,
shuah, chengkaitao, skhan, memxor
Cc: bpf, linux-kernel, linux-doc, vmalik, linux-kselftest, ast,
andrii, daniel, martin.lau, eddyz87, yonghong.song, clm,
ihor.solodrai
In-Reply-To: <20260521032306.97118-7-kaitao.cheng@linux.dev>
[-- Attachment #1: Type: text/plain, Size: 3884 bytes --]
> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
> index 1c69476c8a09..89579165ef4d 100644
> --- a/kernel/bpf/helpers.c
> +++ b/kernel/bpf/helpers.c
> @@ -2577,6 +2577,16 @@ __bpf_kfunc int bpf_list_push_back_impl(struct bpf_list_head *head,
> return bpf_list_push_back(head, node, meta__ign, off);
> }
>
> +__bpf_kfunc int bpf_list_add(struct bpf_list_head *head, struct bpf_list_node *new,
> + struct bpf_list_node *prev__nonown_allowed,
> + struct btf_struct_meta *meta, u64 off)
> +{
> + struct bpf_list_node_kern *n = (void *)new, *p = (void *)prev__nonown_allowed;
> + struct list_head *prev_ptr = &p->list_head;
> +
> + return __bpf_list_add(n, head, &prev_ptr, meta ? meta->record : NULL, off);
> +}
If __bpf_list_add() fails at runtime (e.g., because prev is not in the
list indicated by head), it drops the 'new' node via __bpf_obj_drop_impl
in the fail path.
However, the verifier unconditionally converts 'new' into a non-owning
reference during check_kfunc_call(). If __bpf_list_add() fails, the BPF
program retains this non-owning reference to a freed node.
Could this allow the program to trigger a use-after-free within the lock
critical section?
(sashiko-bot@kernel.org, v10 RESEND)
https://lore.kernel.org/bpf/20260513204420.2BB22C19425@smtp.kernel.org/
>
> static struct bpf_list_node *__bpf_list_del(struct bpf_list_head *head,
> struct list_head *n)
[ ... ]
> @@ -4756,6 +4766,7 @@ BTF_ID_FLAGS(func, bpf_list_push_front, KF_IMPLICIT_ARGS)
> BTF_ID_FLAGS(func, bpf_list_push_front_impl)
> BTF_ID_FLAGS(func, bpf_list_push_back, KF_IMPLICIT_ARGS)
> BTF_ID_FLAGS(func, bpf_list_push_back_impl)
> +BTF_ID_FLAGS(func, bpf_list_add, KF_IMPLICIT_ARGS)
Since bpf_list_add is marked with KF_IMPLICIT_ARGS, won't the verifier
look for a bpf_list_add_impl counterpart during program load?
Without defining and registering a bpf_list_add_impl function,
find_kfunc_impl_proto() will fail to find the required _impl suffix in
the vmlinux BTF.
Will this cause the verifier to unconditionally reject the kfunc?
(sashiko-bot@kernel.org, v10 RESEND)
https://lore.kernel.org/bpf/20260513204420.2BB22C19425@smtp.kernel.org/
> BTF_ID_FLAGS(func, bpf_list_pop_front, KF_ACQUIRE | KF_RET_NULL)
> BTF_ID_FLAGS(func, bpf_list_pop_back, KF_ACQUIRE | KF_RET_NULL)
> BTF_ID_FLAGS(func, bpf_list_del, KF_ACQUIRE | KF_RET_NULL)
[ ... ]
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index 35eebb5e7769..662ad7312697 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
[ ... ]
> @@ -19524,8 +19527,11 @@ int bpf_fixup_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
> int struct_meta_reg = BPF_REG_3;
> int node_offset_reg = BPF_REG_4;
>
> - /* rbtree_add has extra 'less' arg, so args-to-fixup are in diff regs */
> - if (is_bpf_rbtree_add_kfunc(desc->func_id)) {
> + /* list_add/rbtree_add have an extra arg (prev/less),
> + * so args-to-fixup are in diff regs.
> + */
This isn't a bug, but the multi-line comment format doesn't follow the
BPF subsystem style. The preferred form in kernel/bpf/ has the opening
/* on its own line.
Additionally, consider adding an is_bpf_list_add_kfunc() helper instead
of direct comparison with special_kfunc_list[KF_bpf_list_add].
(bot+bpf-ci@kernel.org and Eduard Zingerman, v10 RESEND)
https://lore.kernel.org/bpf/f573f7e0a5140fe6254d2ddfd3f69add455fa25c67c020c0312e2c2ceeca7c69@mail.kernel.org/
> + if (desc->func_id == special_kfunc_list[KF_bpf_list_add] ||
> + is_bpf_rbtree_add_kfunc(desc->func_id)) {
> struct_meta_reg = BPF_REG_4;
> node_offset_reg = BPF_REG_5;
> }
---
AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md
CI run summary: https://github.com/kernel-patches/bpf/actions/runs/26204125015
^ permalink raw reply
* Re: [PATCH bpf-next v11 4/8] bpf: Introduce the bpf_list_del kfunc.
From: bot+bpf-ci @ 2026-05-21 4:08 UTC (permalink / raw)
To: kaitao.cheng, ast, corbet, martin.lau, daniel, andrii, eddyz87,
song, yonghong.song, john.fastabend, kpsingh, sdf, haoluo, jolsa,
shuah, chengkaitao, skhan, memxor
Cc: bpf, linux-kernel, linux-doc, vmalik, linux-kselftest, ast,
andrii, daniel, martin.lau, eddyz87, yonghong.song, clm,
ihor.solodrai
In-Reply-To: <20260521032306.97118-5-kaitao.cheng@linux.dev>
[-- Attachment #1: Type: text/plain, Size: 1269 bytes --]
> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
> index 59855b434f0b..804c201c28f3 100644
> --- a/kernel/bpf/helpers.c
> +++ b/kernel/bpf/helpers.c
[ ... ]
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index f3cf8d85bea0..35eebb5e7769 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
[ ... ]
> @@ -11671,7 +11674,8 @@ static bool check_kfunc_is_graph_node_api(struct bpf_verifier_env *env,
>
> switch (node_field_type) {
> case BPF_LIST_NODE:
> - ret = is_bpf_list_push_kfunc(kfunc_btf_id);
> + ret = is_bpf_list_push_kfunc(kfunc_btf_id) ||
> + kfunc_btf_id == special_kfunc_list[KF_bpf_list_del];
This isn't a bug, but Eduard Zingerman suggested in v10 adding a
is_bpf_list_del_kfunc() helper for consistency with the existing code style
(similar to is_bpf_list_push_kfunc()). This would maintain the pattern used
throughout the codebase for checking kfunc types.
Reference: https://lore.kernel.org/bpf/ca14a6fd8a1f3ce1e64c70b6d9ea2baf8df913ae.camel@gmail.com/
---
AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md
CI run summary: https://github.com/kernel-patches/bpf/actions/runs/26204125015
^ permalink raw reply
* Re: [PATCH bpf-next v11 2/8] bpf: clear list node owner and unlink before drop
From: bot+bpf-ci @ 2026-05-21 4:08 UTC (permalink / raw)
To: kaitao.cheng, ast, corbet, martin.lau, daniel, andrii, eddyz87,
song, yonghong.song, john.fastabend, kpsingh, sdf, haoluo, jolsa,
shuah, chengkaitao, skhan, memxor
Cc: bpf, linux-kernel, linux-doc, vmalik, linux-kselftest, ast,
andrii, daniel, martin.lau, eddyz87, yonghong.song, clm,
ihor.solodrai
In-Reply-To: <20260521032306.97118-3-kaitao.cheng@linux.dev>
[-- Attachment #1: Type: text/plain, Size: 2927 bytes --]
> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
> index 094457c3e6d3..59855b434f0b 100644
> --- a/kernel/bpf/helpers.c
> +++ b/kernel/bpf/helpers.c
> @@ -2247,10 +2247,11 @@ EXPORT_SYMBOL_GPL(bpf_base_func_proto);
> void bpf_list_head_free(const struct btf_field *field, void *list_head,
> struct bpf_spin_lock *spin_lock)
> {
> - struct list_head *head = list_head, *orig_head = list_head;
> + struct list_head *head = list_head, drain, *pos, *n;
>
> BUILD_BUG_ON(sizeof(struct list_head) > sizeof(struct bpf_list_head));
> BUILD_BUG_ON(__alignof__(struct list_head) > __alignof__(struct bpf_list_head));
> + INIT_LIST_HEAD(&drain);
>
> /* Do the actual list draining outside the lock to not hold the lock for
> * too long, and also prevent deadlocks if tracing programs end up
> @@ -2261,20 +2262,30 @@ void bpf_list_head_free(const struct btf_field *field, void *list_head,
> __bpf_spin_lock_irqsave(spin_lock);
> if (!head->next || list_empty(head))
> goto unlock;
> - head = head->next;
> + list_for_each_safe(pos, n, head) {
> + struct bpf_list_node_kern *node;
> +
> + node = container_of(pos, struct bpf_list_node_kern, list_head);
> + WRITE_ONCE(node->owner, BPF_PTR_POISON);
> + list_move_tail(pos, &drain);
> + }
Does this introduce an unbounded delay with interrupts disabled?
The original code detached the head in O(1) time. Now there is an O(N)
loop inside __bpf_spin_lock_irqsave(). Since BPF lists can grow quite
large, could this cause soft lockups?
This also appears to contradict the comment just above the lock which
explicitly mentions doing the actual list draining outside the lock to
not hold it for too long.
Lore reference:
https://lore.kernel.org/bpf/20260513060221.DB2C7C2BCB7@smtp.kernel.org/
Reviewer: sashiko-bot@kernel.org
Date: Wed, 13 May 2026 06:02:21 +0000
> unlock:
> - INIT_LIST_HEAD(orig_head);
> + INIT_LIST_HEAD(head);
> __bpf_spin_unlock_irqrestore(spin_lock);
>
> - while (head != orig_head) {
> - void *obj = head;
> + while (!list_empty(&drain)) {
> + struct bpf_list_node_kern *node;
>
> - obj -= field->graph_root.node_offset;
> - head = head->next;
> + pos = drain.next;
> + node = container_of(pos, struct bpf_list_node_kern, list_head);
> + list_del_init(pos);
> + /* Ensure __bpf_list_add() sees the node as unlinked. */
> + smp_store_release(&node->owner, NULL);
> /* The contained type can also have resources, including a
> * bpf_list_head which needs to be freed.
> */
> - __bpf_obj_drop_impl(obj, field->graph_root.value_rec, false);
> + __bpf_obj_drop_impl((char *)pos - field->graph_root.node_offset,
> + field->graph_root.value_rec, false);
> }
> }
---
AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md
CI run summary: https://github.com/kernel-patches/bpf/actions/runs/26204125015
^ permalink raw reply
* Re: [PATCH bpf-next v11 3/8] bpf: allow non-owning list-node args via __nonown_allowed
From: bot+bpf-ci @ 2026-05-21 4:08 UTC (permalink / raw)
To: kaitao.cheng, ast, corbet, martin.lau, daniel, andrii, eddyz87,
song, yonghong.song, john.fastabend, kpsingh, sdf, haoluo, jolsa,
shuah, chengkaitao, skhan, memxor
Cc: bpf, linux-kernel, linux-doc, vmalik, linux-kselftest, ast,
andrii, daniel, martin.lau, eddyz87, yonghong.song, clm,
ihor.solodrai
In-Reply-To: <20260521032306.97118-4-kaitao.cheng@linux.dev>
[-- Attachment #1: Type: text/plain, Size: 2076 bytes --]
> diff --git a/Documentation/bpf/kfuncs.rst b/Documentation/bpf/kfuncs.rst
> index 75e6c078e0e7..3a9db1108b95 100644
> --- a/Documentation/bpf/kfuncs.rst
> +++ b/Documentation/bpf/kfuncs.rst
[ ... ]
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index 8dd79b735a69..f3cf8d85bea0 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
[ ... ]
> @@ -12244,6 +12249,13 @@ static int check_kfunc_args(struct bpf_verifier_env *env, struct bpf_kfunc_call_
> return ret;
> break;
> case KF_ARG_PTR_TO_LIST_NODE:
> + if (is_kfunc_arg_nonown_allowed(btf, &args[i]) &&
> + type_is_non_owning_ref(reg->type) && !reg->ref_obj_id) {
> + /* Allow bpf_list_front/back return value for
> + * __nonown_allowed list-node arguments.
> + */
> + goto check_ok;
> + }
> if (reg->type != (PTR_TO_BTF_ID | MEM_ALLOC)) {
> verbose(env, "%s expected pointer to allocated object\n",
> reg_arg_name(env, argno));
> @@ -12253,6 +12265,7 @@ static int check_kfunc_args(struct bpf_verifier_env *env, struct bpf_kfunc_call_
> verbose(env, "allocated object must be referenced\n");
> return -EINVAL;
> }
> +check_ok:
> ret = process_kf_arg_ptr_to_list_node(env, reg, argno, meta);
> if (ret < 0)
> return ret;
This isn't a bug, but the multi-line comment doesn't match the BPF
subsystem style. The BPF guide requires multi-line comments to have the
opening /* on its own line for files under kernel/bpf/. Should this be:
/*
* Allow bpf_list_front/back return value for
* __nonown_allowed list-node arguments.
*/
Eduard Zingerman raised a related nit in v10 RESEND: the check
'!reg->ref_obj_id' appears redundant. Would type_is_non_owning_ref() be
sufficient here?
Reference: https://lore.kernel.org/bpf/0419643c9a04bf0824066742e52e3f197b43909b.camel@gmail.com/
---
AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md
CI run summary: https://github.com/kernel-patches/bpf/actions/runs/26204125015
^ permalink raw reply
* Re: [PATCH bpf-next v11 8/8] selftests/bpf: Add test cases for bpf_list_del/add/is_first/is_last/empty
From: bot+bpf-ci @ 2026-05-21 4:08 UTC (permalink / raw)
To: kaitao.cheng, ast, corbet, martin.lau, daniel, andrii, eddyz87,
song, yonghong.song, john.fastabend, kpsingh, sdf, haoluo, jolsa,
shuah, chengkaitao, skhan, memxor
Cc: bpf, linux-kernel, linux-doc, vmalik, linux-kselftest, ast,
andrii, daniel, martin.lau, eddyz87, yonghong.song, clm,
ihor.solodrai
In-Reply-To: <20260521032306.97118-9-kaitao.cheng@linux.dev>
[-- Attachment #1: Type: text/plain, Size: 1758 bytes --]
commit 75130d2969168e50252cc473cfaa880a9795c0ef
Author: Kaitao Cheng <chengkaitao@kylinos.cn>
selftests/bpf: Add test cases for bpf_list_del/add/is_first/is_last/empty
Extend refcounted_kptr with tests for bpf_list_add (including prev from
bpf_list_front and bpf_refcount_acquire), bpf_list_del (including node
from bpf_list_front, bpf_rbtree_remove and bpf_refcount_acquire),
bpf_list_empty, bpf_list_is_first/last, and push_back on uninit head.
To verify the validity of bpf_list_del/add, the test also expects the
verifier to reject calls to bpf_list_del/add made without holding the
spin_lock.
Signed-off-by: Kaitao Cheng <chengkaitao@kylinos.cn>
The commit message mentions adding test cases for bpf_list_is_first,
bpf_list_is_last, and bpf_list_empty, but these test functions already
exist in the file and are not added by this patch. The existing tests
are list_empty_test, list_is_edge_test1, list_is_edge_test2, and
list_is_edge_test3.
This patch actually adds:
- list_del_test3 (new test for bpf_list_del with list_front return value)
- list_add_test1 (new test for bpf_list_add after prev node)
- list_add_test2 (new test for bpf_list_add with list_front as prev)
- list_push_back_uninit_head (new test for push_back on uninitialized head)
- list_del_without_lock_fail (negative test for del without lock)
- list_add_without_lock_fail (negative test for add without lock)
Should the commit message be adjusted to accurately reflect which tests
are being added?
---
AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md
CI run summary: https://github.com/kernel-patches/bpf/actions/runs/26204125015
^ permalink raw reply
* [PATCH bpf-next v11 8/8] selftests/bpf: Add test cases for bpf_list_del/add/is_first/is_last/empty
From: Kaitao Cheng @ 2026-05-21 3:23 UTC (permalink / raw)
To: ast, corbet, martin.lau, daniel, andrii, eddyz87, song,
yonghong.song, john.fastabend, kpsingh, sdf, haoluo, jolsa, shuah,
chengkaitao, skhan, memxor
Cc: bpf, linux-kernel, linux-doc, vmalik, linux-kselftest
In-Reply-To: <20260521032306.97118-1-kaitao.cheng@linux.dev>
From: Kaitao Cheng <chengkaitao@kylinos.cn>
Extend refcounted_kptr with tests for bpf_list_add (including prev from
bpf_list_front and bpf_refcount_acquire), bpf_list_del (including node
from bpf_list_front, bpf_rbtree_remove and bpf_refcount_acquire),
bpf_list_empty, bpf_list_is_first/last, and push_back on uninit head.
To verify the validity of bpf_list_del/add, the test also expects the
verifier to reject calls to bpf_list_del/add made without holding the
spin_lock.
Signed-off-by: Kaitao Cheng <chengkaitao@kylinos.cn>
---
.../selftests/bpf/progs/refcounted_kptr.c | 421 ++++++++++++++++++
1 file changed, 421 insertions(+)
diff --git a/tools/testing/selftests/bpf/progs/refcounted_kptr.c b/tools/testing/selftests/bpf/progs/refcounted_kptr.c
index c847398837cc..13de169ad68f 100644
--- a/tools/testing/selftests/bpf/progs/refcounted_kptr.c
+++ b/tools/testing/selftests/bpf/progs/refcounted_kptr.c
@@ -367,6 +367,427 @@ long insert_rbtree_and_stash__del_tree_##rem_tree(void *ctx) \
INSERT_STASH_READ(true, "insert_stash_read: remove from tree");
INSERT_STASH_READ(false, "insert_stash_read: don't remove from tree");
+SEC("tc")
+__description("list_empty_test: list empty before add, non-empty after add")
+__success __retval(0)
+int list_empty_test(void *ctx)
+{
+ struct node_data *node_new;
+
+ bpf_spin_lock(&lock);
+ if (!bpf_list_empty(&head)) {
+ bpf_spin_unlock(&lock);
+ return -1;
+ }
+ bpf_spin_unlock(&lock);
+
+ node_new = bpf_obj_new(typeof(*node_new));
+ if (!node_new)
+ return -2;
+
+ bpf_spin_lock(&lock);
+ bpf_list_push_front(&head, &node_new->l);
+
+ if (bpf_list_empty(&head)) {
+ bpf_spin_unlock(&lock);
+ return -3;
+ }
+ bpf_spin_unlock(&lock);
+ return 0;
+}
+
+static struct node_data *__add_in_list(struct bpf_list_head *head,
+ struct bpf_spin_lock *lock)
+{
+ struct node_data *node_new, *node_ref;
+
+ node_new = bpf_obj_new(typeof(*node_new));
+ if (!node_new)
+ return NULL;
+
+ node_ref = bpf_refcount_acquire(node_new);
+
+ bpf_spin_lock(lock);
+ bpf_list_push_front(head, &node_new->l);
+ bpf_spin_unlock(lock);
+ return node_ref;
+}
+
+SEC("tc")
+__description("list_is_edge_test1: is_first on first node, is_last on last node")
+__success __retval(0)
+int list_is_edge_test1(void *ctx)
+{
+ struct node_data *node_first, *node_last;
+ int err = 0;
+
+ node_last = __add_in_list(&head, &lock);
+ if (!node_last)
+ return -1;
+
+ node_first = __add_in_list(&head, &lock);
+ if (!node_first) {
+ bpf_obj_drop(node_last);
+ return -2;
+ }
+
+ bpf_spin_lock(&lock);
+ if (!bpf_list_is_first(&head, &node_first->l)) {
+ err = -3;
+ goto fail;
+ }
+ if (!bpf_list_is_last(&head, &node_last->l))
+ err = -4;
+
+fail:
+ bpf_spin_unlock(&lock);
+ bpf_obj_drop(node_first);
+ bpf_obj_drop(node_last);
+ return err;
+}
+
+SEC("tc")
+__description("list_is_edge_test2: accept list_front/list_back return value")
+__success __retval(0)
+int list_is_edge_test2(void *ctx)
+{
+ struct bpf_list_node *front, *back;
+ struct node_data *a, *b;
+ long err = 0;
+
+ a = __add_in_list(&head, &lock);
+ if (!a)
+ return -1;
+
+ b = __add_in_list(&head, &lock);
+ if (!b) {
+ bpf_obj_drop(a);
+ return -2;
+ }
+
+ bpf_spin_lock(&lock);
+ front = bpf_list_front(&head);
+ back = bpf_list_back(&head);
+ if (!front || !back) {
+ err = -3;
+ goto out_unlock;
+ }
+
+ if (!bpf_list_is_first(&head, front) || bpf_list_is_last(&head, front)) {
+ err = -4;
+ goto out_unlock;
+ }
+
+ if (!bpf_list_is_last(&head, back) || bpf_list_is_first(&head, back)) {
+ err = -5;
+ goto out_unlock;
+ }
+
+out_unlock:
+ bpf_spin_unlock(&lock);
+ bpf_obj_drop(a);
+ bpf_obj_drop(b);
+ return err;
+}
+
+SEC("tc")
+__description("list_is_edge_test3: single node is both first and last")
+__success __retval(0)
+int list_is_edge_test3(void *ctx)
+{
+ struct node_data *tmp;
+ struct bpf_list_node *node;
+ long err = 0;
+
+ tmp = __add_in_list(&head, &lock);
+ if (!tmp)
+ return -1;
+
+ bpf_spin_lock(&lock);
+ node = bpf_list_front(&head);
+ if (!node) {
+ bpf_spin_unlock(&lock);
+ bpf_obj_drop(tmp);
+ return -2;
+ }
+
+ if (!bpf_list_is_first(&head, node) || !bpf_list_is_last(&head, node))
+ err = -3;
+ bpf_spin_unlock(&lock);
+
+ bpf_obj_drop(tmp);
+ return err;
+}
+
+SEC("tc")
+__description("list_del_test1: del returns removed nodes")
+__success __retval(0)
+int list_del_test1(void *ctx)
+{
+ struct node_data *node_first, *node_last;
+ struct bpf_list_node *bpf_node_first, *bpf_node_last;
+ int err = 0;
+
+ node_last = __add_in_list(&head, &lock);
+ if (!node_last)
+ return -1;
+
+ node_first = __add_in_list(&head, &lock);
+ if (!node_first) {
+ bpf_obj_drop(node_last);
+ return -2;
+ }
+
+ bpf_spin_lock(&lock);
+ bpf_node_last = bpf_list_del(&head, &node_last->l);
+ bpf_node_first = bpf_list_del(&head, &node_first->l);
+ bpf_spin_unlock(&lock);
+
+ if (bpf_node_first)
+ bpf_obj_drop(container_of(bpf_node_first, struct node_data, l));
+ else
+ err = -3;
+
+ if (bpf_node_last)
+ bpf_obj_drop(container_of(bpf_node_last, struct node_data, l));
+ else
+ err = -4;
+
+ bpf_obj_drop(node_first);
+ bpf_obj_drop(node_last);
+ return err;
+}
+
+SEC("tc")
+__description("list_del_test2: remove an arbitrary node from the list")
+__success __retval(0)
+int list_del_test2(void *ctx)
+{
+ struct bpf_rb_node *rb;
+ struct bpf_list_node *l;
+ struct node_data *n;
+ long err;
+
+ err = __insert_in_tree_and_list(&head, &root, &lock);
+ if (err)
+ return err;
+
+ bpf_spin_lock(&lock);
+ rb = bpf_rbtree_first(&root);
+ if (!rb) {
+ bpf_spin_unlock(&lock);
+ return -4;
+ }
+
+ rb = bpf_rbtree_remove(&root, rb);
+ if (!rb) {
+ bpf_spin_unlock(&lock);
+ return -5;
+ }
+
+ n = container_of(rb, struct node_data, r);
+ l = bpf_list_del(&head, &n->l);
+ bpf_spin_unlock(&lock);
+ bpf_obj_drop(n);
+ if (!l)
+ return -6;
+
+ bpf_obj_drop(container_of(l, struct node_data, l));
+ return 0;
+}
+
+SEC("tc")
+__description("list_del_test3: list_del accepts list_front return value as node")
+__success __retval(0)
+int list_del_test3(void *ctx)
+{
+ struct node_data *tmp;
+ struct bpf_list_node *bpf_node, *l;
+ long err = 0;
+
+ tmp = __add_in_list(&head, &lock);
+ if (!tmp)
+ return -1;
+
+ bpf_spin_lock(&lock);
+ bpf_node = bpf_list_front(&head);
+ if (!bpf_node) {
+ bpf_spin_unlock(&lock);
+ err = -2;
+ goto fail;
+ }
+
+ l = bpf_list_del(&head, bpf_node);
+ bpf_spin_unlock(&lock);
+ if (!l) {
+ err = -3;
+ goto fail;
+ }
+
+ bpf_obj_drop(container_of(l, struct node_data, l));
+ bpf_obj_drop(tmp);
+ return 0;
+
+fail:
+ bpf_obj_drop(tmp);
+ return err;
+}
+
+SEC("tc")
+__description("list_add_test1: insert new node after prev")
+__success __retval(0)
+int list_add_test1(void *ctx)
+{
+ struct node_data *node_first;
+ struct node_data *new_node;
+ long err = 0;
+
+ node_first = __add_in_list(&head, &lock);
+ if (!node_first)
+ return -1;
+
+ new_node = bpf_obj_new(typeof(*new_node));
+ if (!new_node) {
+ err = -2;
+ goto fail;
+ }
+
+ bpf_spin_lock(&lock);
+ err = bpf_list_add(&head, &new_node->l, &node_first->l);
+ bpf_spin_unlock(&lock);
+ if (err) {
+ err = -3;
+ goto fail;
+ }
+
+fail:
+ bpf_obj_drop(node_first);
+ return err;
+}
+
+SEC("tc")
+__description("list_add_test2: list_add accepts list_front return value as prev")
+__success __retval(0)
+int list_add_test2(void *ctx)
+{
+ struct node_data *new_node, *tmp;
+ struct bpf_list_node *bpf_node;
+ long err = 0;
+
+ tmp = __add_in_list(&head, &lock);
+ if (!tmp)
+ return -1;
+
+ new_node = bpf_obj_new(typeof(*new_node));
+ if (!new_node) {
+ err = -2;
+ goto fail;
+ }
+
+ bpf_spin_lock(&lock);
+ bpf_node = bpf_list_front(&head);
+ if (!bpf_node) {
+ bpf_spin_unlock(&lock);
+ bpf_obj_drop(new_node);
+ err = -3;
+ goto fail;
+ }
+
+ err = bpf_list_add(&head, &new_node->l, bpf_node);
+ bpf_spin_unlock(&lock);
+ if (err) {
+ err = -4;
+ goto fail;
+ }
+
+fail:
+ bpf_obj_drop(tmp);
+ return err;
+}
+
+struct uninit_head_val {
+ struct bpf_spin_lock lock;
+ struct bpf_list_head head __contains(node_data, l);
+};
+
+struct {
+ __uint(type, BPF_MAP_TYPE_ARRAY);
+ __type(key, int);
+ __type(value, struct uninit_head_val);
+ __uint(max_entries, 1);
+} uninit_head_map SEC(".maps");
+
+SEC("tc")
+__description("list_push_back_uninit_head: push_back on 0-initialized list head")
+__success __retval(0)
+int list_push_back_uninit_head(void *ctx)
+{
+ struct uninit_head_val *st;
+ struct node_data *node;
+ int ret = -1, key = 0;
+
+ st = bpf_map_lookup_elem(&uninit_head_map, &key);
+ if (!st)
+ return -1;
+
+ node = bpf_obj_new(typeof(*node));
+ if (!node)
+ return -1;
+
+ bpf_spin_lock(&st->lock);
+ ret = bpf_list_push_back(&st->head, &node->l);
+ bpf_spin_unlock(&st->lock);
+
+ return ret;
+}
+
+SEC("?tc")
+__failure __msg("bpf_spin_lock at off=32 must be held for bpf_list_head")
+long list_del_without_lock_fail(void *ctx)
+{
+ struct node_data *n;
+ struct bpf_list_node *l;
+
+ n = bpf_obj_new(typeof(*n));
+ if (!n)
+ return -1;
+
+ /* Error case: delete list node without holding lock */
+ l = bpf_list_del(&head, &n->l);
+ bpf_obj_drop(n);
+ if (!l)
+ return -2;
+ bpf_obj_drop(container_of(l, struct node_data, l));
+
+ return 0;
+}
+
+SEC("?tc")
+__failure __msg("bpf_spin_lock at off=32 must be held for bpf_list_head")
+long list_add_without_lock_fail(void *ctx)
+{
+ struct node_data *n, *prev;
+ long err;
+
+ n = bpf_obj_new(typeof(*n));
+ if (!n)
+ return -1;
+
+ prev = bpf_obj_new(typeof(*prev));
+ if (!prev) {
+ bpf_obj_drop(n);
+ return -1;
+ }
+
+ /* Error case: add list node without holding lock */
+ err = bpf_list_add(&head, &n->l, &prev->l);
+ bpf_obj_drop(prev);
+ if (err)
+ return -2;
+
+ return 0;
+}
+
SEC("tc")
__success
long rbtree_refcounted_node_ref_escapes(void *ctx)
--
2.50.1 (Apple Git-155)
^ permalink raw reply related
* [PATCH bpf-next v11 7/8] bpf: add bpf_list_is_first/last/empty kfuncs
From: Kaitao Cheng @ 2026-05-21 3:23 UTC (permalink / raw)
To: ast, corbet, martin.lau, daniel, andrii, eddyz87, song,
yonghong.song, john.fastabend, kpsingh, sdf, haoluo, jolsa, shuah,
chengkaitao, skhan, memxor
Cc: bpf, linux-kernel, linux-doc, vmalik, linux-kselftest,
Emil Tsalapatis
In-Reply-To: <20260521032306.97118-1-kaitao.cheng@linux.dev>
From: Kaitao Cheng <chengkaitao@kylinos.cn>
Add three kfuncs for BPF linked list queries:
- bpf_list_is_first(head, node): true if node is the first in the list.
- bpf_list_is_last(head, node): true if node is the last in the list.
- bpf_list_empty(head): true if the list has no entries.
Currently, without these kfuncs, to implement the above functionality
it is necessary to first call bpf_list_pop_front/back to retrieve the
first or last node before checking whether the passed-in node was the
first or last one. After the check, the node had to be pushed back into
the list using bpf_list_push_front/back, which was very inefficient.
Now, with the bpf_list_is_first/last/empty kfuncs, we can directly
check whether a node is the first, last, or whether the list is empty,
without having to first retrieve the node.
Signed-off-by: Kaitao Cheng <chengkaitao@kylinos.cn>
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
---
kernel/bpf/helpers.c | 40 ++++++++++++++++++++++++++++++++++++++++
kernel/bpf/verifier.c | 15 +++++++++++++--
2 files changed, 53 insertions(+), 2 deletions(-)
diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index 89579165ef4d..b6c3d02d5593 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -2656,6 +2656,43 @@ __bpf_kfunc struct bpf_list_node *bpf_list_back(struct bpf_list_head *head)
return (struct bpf_list_node *)h->prev;
}
+__bpf_kfunc bool bpf_list_is_first(struct bpf_list_head *head,
+ struct bpf_list_node *node__nonown_allowed)
+{
+ struct list_head *h = (struct list_head *)head;
+ struct bpf_list_node_kern *kn = (struct bpf_list_node_kern *)node__nonown_allowed;
+
+ if (READ_ONCE(kn->owner) != head)
+ return false;
+
+ return list_is_first(&kn->list_head, h);
+}
+
+__bpf_kfunc bool bpf_list_is_last(struct bpf_list_head *head,
+ struct bpf_list_node *node__nonown_allowed)
+{
+ struct list_head *h = (struct list_head *)head;
+ struct bpf_list_node_kern *kn = (struct bpf_list_node_kern *)node__nonown_allowed;
+
+ if (READ_ONCE(kn->owner) != head)
+ return false;
+
+ return list_is_last(&kn->list_head, h);
+}
+
+__bpf_kfunc bool bpf_list_empty(struct bpf_list_head *head)
+{
+ struct list_head *h = (struct list_head *)head;
+
+ /* If list_head was 0-initialized by map, bpf_obj_init_field wasn't
+ * called on its fields, so init here
+ */
+ if (unlikely(!h->next))
+ INIT_LIST_HEAD(h);
+
+ return list_empty(h);
+}
+
__bpf_kfunc struct bpf_rb_node *bpf_rbtree_remove(struct bpf_rb_root *root,
struct bpf_rb_node *node)
{
@@ -4772,6 +4809,9 @@ BTF_ID_FLAGS(func, bpf_list_pop_back, KF_ACQUIRE | KF_RET_NULL)
BTF_ID_FLAGS(func, bpf_list_del, KF_ACQUIRE | KF_RET_NULL)
BTF_ID_FLAGS(func, bpf_list_front, KF_RET_NULL)
BTF_ID_FLAGS(func, bpf_list_back, KF_RET_NULL)
+BTF_ID_FLAGS(func, bpf_list_is_first)
+BTF_ID_FLAGS(func, bpf_list_is_last)
+BTF_ID_FLAGS(func, bpf_list_empty)
BTF_ID_FLAGS(func, bpf_task_acquire, KF_ACQUIRE | KF_RCU | KF_RET_NULL)
BTF_ID_FLAGS(func, bpf_task_release, KF_RELEASE)
BTF_ID_FLAGS(func, bpf_rbtree_remove, KF_ACQUIRE | KF_RET_NULL)
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 662ad7312697..d9bdc3b32c05 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -10965,6 +10965,9 @@ enum special_kfunc_type {
KF_bpf_list_del,
KF_bpf_list_front,
KF_bpf_list_back,
+ KF_bpf_list_is_first,
+ KF_bpf_list_is_last,
+ KF_bpf_list_empty,
KF_bpf_cast_to_kern_ctx,
KF_bpf_rdonly_cast,
KF_bpf_rcu_read_lock,
@@ -11035,6 +11038,9 @@ BTF_ID(func, bpf_list_pop_back)
BTF_ID(func, bpf_list_del)
BTF_ID(func, bpf_list_front)
BTF_ID(func, bpf_list_back)
+BTF_ID(func, bpf_list_is_first)
+BTF_ID(func, bpf_list_is_last)
+BTF_ID(func, bpf_list_empty)
BTF_ID(func, bpf_cast_to_kern_ctx)
BTF_ID(func, bpf_rdonly_cast)
BTF_ID(func, bpf_rcu_read_lock)
@@ -11556,7 +11562,10 @@ static bool is_bpf_list_api_kfunc(u32 btf_id)
btf_id == special_kfunc_list[KF_bpf_list_pop_back] ||
btf_id == special_kfunc_list[KF_bpf_list_del] ||
btf_id == special_kfunc_list[KF_bpf_list_front] ||
- btf_id == special_kfunc_list[KF_bpf_list_back];
+ btf_id == special_kfunc_list[KF_bpf_list_back] ||
+ btf_id == special_kfunc_list[KF_bpf_list_is_first] ||
+ btf_id == special_kfunc_list[KF_bpf_list_is_last] ||
+ btf_id == special_kfunc_list[KF_bpf_list_empty];
}
static bool is_bpf_rbtree_api_kfunc(u32 btf_id)
@@ -11678,7 +11687,9 @@ static bool check_kfunc_is_graph_node_api(struct bpf_verifier_env *env,
switch (node_field_type) {
case BPF_LIST_NODE:
ret = is_bpf_list_push_kfunc(kfunc_btf_id) ||
- kfunc_btf_id == special_kfunc_list[KF_bpf_list_del];
+ kfunc_btf_id == special_kfunc_list[KF_bpf_list_del] ||
+ kfunc_btf_id == special_kfunc_list[KF_bpf_list_is_first] ||
+ kfunc_btf_id == special_kfunc_list[KF_bpf_list_is_last];
break;
case BPF_RB_NODE:
ret = (is_bpf_rbtree_add_kfunc(kfunc_btf_id) ||
--
2.50.1 (Apple Git-155)
^ permalink raw reply related
* [PATCH bpf-next v11 6/8] bpf: Add bpf_list_add to insert node after a given list node
From: Kaitao Cheng @ 2026-05-21 3:23 UTC (permalink / raw)
To: ast, corbet, martin.lau, daniel, andrii, eddyz87, song,
yonghong.song, john.fastabend, kpsingh, sdf, haoluo, jolsa, shuah,
chengkaitao, skhan, memxor
Cc: bpf, linux-kernel, linux-doc, vmalik, linux-kselftest
In-Reply-To: <20260521032306.97118-1-kaitao.cheng@linux.dev>
From: Kaitao Cheng <chengkaitao@kylinos.cn>
Add a new kfunc bpf_list_add(head, new, prev, meta, off) that
inserts 'new' after 'prev' in the BPF linked list. Both must be in
the same list; 'prev' must already be in the list. The new node must
be an owning reference (e.g. from bpf_obj_new); the kfunc consumes
that reference and the node becomes non-owning once inserted.
We have added an additional parameter bpf_list_head *head to
bpf_list_add, as the verifier requires the head parameter to
check whether the lock is being held.
Returns 0 on success, -EINVAL if 'prev' is not in a list or 'new'
is already in a list (or duplicate insertion). On failure, the
kernel drops the passed-in node.
Signed-off-by: Kaitao Cheng <chengkaitao@kylinos.cn>
Reviewed-by: Eduard Zingerman <eddyz87@gmail.com>
---
kernel/bpf/helpers.c | 11 +++++++++++
kernel/bpf/verifier.c | 12 +++++++++---
2 files changed, 20 insertions(+), 3 deletions(-)
diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index 1c69476c8a09..89579165ef4d 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -2577,6 +2577,16 @@ __bpf_kfunc int bpf_list_push_back_impl(struct bpf_list_head *head,
return bpf_list_push_back(head, node, meta__ign, off);
}
+__bpf_kfunc int bpf_list_add(struct bpf_list_head *head, struct bpf_list_node *new,
+ struct bpf_list_node *prev__nonown_allowed,
+ struct btf_struct_meta *meta, u64 off)
+{
+ struct bpf_list_node_kern *n = (void *)new, *p = (void *)prev__nonown_allowed;
+ struct list_head *prev_ptr = &p->list_head;
+
+ return __bpf_list_add(n, head, &prev_ptr, meta ? meta->record : NULL, off);
+}
+
static struct bpf_list_node *__bpf_list_del(struct bpf_list_head *head,
struct list_head *n)
{
@@ -4756,6 +4766,7 @@ BTF_ID_FLAGS(func, bpf_list_push_front, KF_IMPLICIT_ARGS)
BTF_ID_FLAGS(func, bpf_list_push_front_impl)
BTF_ID_FLAGS(func, bpf_list_push_back, KF_IMPLICIT_ARGS)
BTF_ID_FLAGS(func, bpf_list_push_back_impl)
+BTF_ID_FLAGS(func, bpf_list_add, KF_IMPLICIT_ARGS)
BTF_ID_FLAGS(func, bpf_list_pop_front, KF_ACQUIRE | KF_RET_NULL)
BTF_ID_FLAGS(func, bpf_list_pop_back, KF_ACQUIRE | KF_RET_NULL)
BTF_ID_FLAGS(func, bpf_list_del, KF_ACQUIRE | KF_RET_NULL)
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 35eebb5e7769..662ad7312697 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -10959,6 +10959,7 @@ enum special_kfunc_type {
KF_bpf_list_push_front,
KF_bpf_list_push_back_impl,
KF_bpf_list_push_back,
+ KF_bpf_list_add,
KF_bpf_list_pop_front,
KF_bpf_list_pop_back,
KF_bpf_list_del,
@@ -11028,6 +11029,7 @@ BTF_ID(func, bpf_list_push_front_impl)
BTF_ID(func, bpf_list_push_front)
BTF_ID(func, bpf_list_push_back_impl)
BTF_ID(func, bpf_list_push_back)
+BTF_ID(func, bpf_list_add)
BTF_ID(func, bpf_list_pop_front)
BTF_ID(func, bpf_list_pop_back)
BTF_ID(func, bpf_list_del)
@@ -11140,7 +11142,8 @@ static bool is_bpf_list_push_kfunc(u32 func_id)
return func_id == special_kfunc_list[KF_bpf_list_push_front] ||
func_id == special_kfunc_list[KF_bpf_list_push_front_impl] ||
func_id == special_kfunc_list[KF_bpf_list_push_back] ||
- func_id == special_kfunc_list[KF_bpf_list_push_back_impl];
+ func_id == special_kfunc_list[KF_bpf_list_push_back_impl] ||
+ func_id == special_kfunc_list[KF_bpf_list_add];
}
static bool is_bpf_rbtree_add_kfunc(u32 func_id)
@@ -19524,8 +19527,11 @@ int bpf_fixup_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
int struct_meta_reg = BPF_REG_3;
int node_offset_reg = BPF_REG_4;
- /* rbtree_add has extra 'less' arg, so args-to-fixup are in diff regs */
- if (is_bpf_rbtree_add_kfunc(desc->func_id)) {
+ /* list_add/rbtree_add have an extra arg (prev/less),
+ * so args-to-fixup are in diff regs.
+ */
+ if (desc->func_id == special_kfunc_list[KF_bpf_list_add] ||
+ is_bpf_rbtree_add_kfunc(desc->func_id)) {
struct_meta_reg = BPF_REG_4;
node_offset_reg = BPF_REG_5;
}
--
2.50.1 (Apple Git-155)
^ permalink raw reply related
* [PATCH bpf-next v11 5/8] bpf: refactor __bpf_list_add to take insertion point via **prev_ptr
From: Kaitao Cheng @ 2026-05-21 3:23 UTC (permalink / raw)
To: ast, corbet, martin.lau, daniel, andrii, eddyz87, song,
yonghong.song, john.fastabend, kpsingh, sdf, haoluo, jolsa, shuah,
chengkaitao, skhan, memxor
Cc: bpf, linux-kernel, linux-doc, vmalik, linux-kselftest
In-Reply-To: <20260521032306.97118-1-kaitao.cheng@linux.dev>
From: Kaitao Cheng <chengkaitao@kylinos.cn>
Refactor __bpf_list_add to accept (node, head, struct list_head **prev_ptr,
..) instead of (node, head, bool tail, ..). Load prev from *prev_ptr after
INIT_LIST_HEAD(h), so we never dereference an uninitialized h->prev when
head was 0-initialized (e.g. push_back passes &h->prev).
When prev is not the list head, validate that prev is in the list via
its owner.
Prepares for bpf_list_add(head, new, prev, ..) to insert after a given
list node.
Signed-off-by: Kaitao Cheng <chengkaitao@kylinos.cn>
Reviewed-by: Eduard Zingerman <eddyz87@gmail.com>
---
kernel/bpf/helpers.c | 36 ++++++++++++++++++++++++++----------
1 file changed, 26 insertions(+), 10 deletions(-)
diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index 804c201c28f3..1c69476c8a09 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -2478,9 +2478,11 @@ __bpf_kfunc void *bpf_refcount_acquire_impl(void *p__refcounted_kptr, void *meta
static int __bpf_list_add(struct bpf_list_node_kern *node,
struct bpf_list_head *head,
- bool tail, struct btf_record *rec, u64 off)
+ struct list_head **prev_ptr,
+ struct btf_record *rec, u64 off)
{
struct list_head *n = &node->list_head, *h = (void *)head;
+ struct list_head *prev;
/* If list_head was 0-initialized by map, bpf_obj_init_field wasn't
* called on its fields, so init here
@@ -2488,19 +2490,31 @@ static int __bpf_list_add(struct bpf_list_node_kern *node,
if (unlikely(!h->next))
INIT_LIST_HEAD(h);
+ prev = *prev_ptr;
+
+ /* When prev is not the list head, it must be a node in this list. */
+ if (prev != h) {
+ struct bpf_list_node_kern *prev_kn =
+ container_of(prev, struct bpf_list_node_kern, list_head);
+
+ if (unlikely(READ_ONCE(prev_kn->owner) != head))
+ goto fail;
+ }
+
/* node->owner != NULL implies !list_empty(n), no need to separately
* check the latter
*/
- if (cmpxchg(&node->owner, NULL, BPF_PTR_POISON)) {
- /* Only called from BPF prog, no need to migrate_disable */
- __bpf_obj_drop_impl((void *)n - off, rec, false);
- return -EINVAL;
- }
+ if (cmpxchg(&node->owner, NULL, BPF_PTR_POISON))
+ goto fail;
- tail ? list_add_tail(n, h) : list_add(n, h);
+ list_add(n, prev);
WRITE_ONCE(node->owner, head);
-
return 0;
+
+fail:
+ /* Only called from BPF prog, no need to migrate_disable */
+ __bpf_obj_drop_impl((void *)n - off, rec, false);
+ return -EINVAL;
}
/**
@@ -2521,8 +2535,9 @@ __bpf_kfunc int bpf_list_push_front(struct bpf_list_head *head,
u64 off)
{
struct bpf_list_node_kern *n = (void *)node;
+ struct list_head *h = (void *)head;
- return __bpf_list_add(n, head, false, meta ? meta->record : NULL, off);
+ return __bpf_list_add(n, head, &h, meta ? meta->record : NULL, off);
}
__bpf_kfunc int bpf_list_push_front_impl(struct bpf_list_head *head,
@@ -2550,8 +2565,9 @@ __bpf_kfunc int bpf_list_push_back(struct bpf_list_head *head,
u64 off)
{
struct bpf_list_node_kern *n = (void *)node;
+ struct list_head *h = (void *)head;
- return __bpf_list_add(n, head, true, meta ? meta->record : NULL, off);
+ return __bpf_list_add(n, head, &h->prev, meta ? meta->record : NULL, off);
}
__bpf_kfunc int bpf_list_push_back_impl(struct bpf_list_head *head,
--
2.50.1 (Apple Git-155)
^ permalink raw reply related
* [PATCH bpf-next v11 4/8] bpf: Introduce the bpf_list_del kfunc.
From: Kaitao Cheng @ 2026-05-21 3:23 UTC (permalink / raw)
To: ast, corbet, martin.lau, daniel, andrii, eddyz87, song,
yonghong.song, john.fastabend, kpsingh, sdf, haoluo, jolsa, shuah,
chengkaitao, skhan, memxor
Cc: bpf, linux-kernel, linux-doc, vmalik, linux-kselftest
In-Reply-To: <20260521032306.97118-1-kaitao.cheng@linux.dev>
From: Kaitao Cheng <chengkaitao@kylinos.cn>
Allow users to remove any node from a linked list.
We have added an additional parameter bpf_list_head *head to
bpf_list_del, as the verifier requires the head parameter to
check whether the lock is being held.
Signed-off-by: Kaitao Cheng <chengkaitao@kylinos.cn>
Reviewed-by: Eduard Zingerman <eddyz87@gmail.com>
---
kernel/bpf/helpers.c | 10 ++++++++++
kernel/bpf/verifier.c | 6 +++++-
2 files changed, 15 insertions(+), 1 deletion(-)
diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index 59855b434f0b..804c201c28f3 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -2601,6 +2601,15 @@ __bpf_kfunc struct bpf_list_node *bpf_list_pop_back(struct bpf_list_head *head)
return __bpf_list_del(head, h->prev);
}
+__bpf_kfunc struct bpf_list_node *bpf_list_del(struct bpf_list_head *head,
+ struct bpf_list_node *node__nonown_allowed)
+{
+ struct bpf_list_node_kern *kn = (void *)node__nonown_allowed;
+
+ /* verifier guarantees node is a list node rather than list head */
+ return __bpf_list_del(head, &kn->list_head);
+}
+
__bpf_kfunc struct bpf_list_node *bpf_list_front(struct bpf_list_head *head)
{
struct list_head *h = (struct list_head *)head;
@@ -4733,6 +4742,7 @@ BTF_ID_FLAGS(func, bpf_list_push_back, KF_IMPLICIT_ARGS)
BTF_ID_FLAGS(func, bpf_list_push_back_impl)
BTF_ID_FLAGS(func, bpf_list_pop_front, KF_ACQUIRE | KF_RET_NULL)
BTF_ID_FLAGS(func, bpf_list_pop_back, KF_ACQUIRE | KF_RET_NULL)
+BTF_ID_FLAGS(func, bpf_list_del, KF_ACQUIRE | KF_RET_NULL)
BTF_ID_FLAGS(func, bpf_list_front, KF_RET_NULL)
BTF_ID_FLAGS(func, bpf_list_back, KF_RET_NULL)
BTF_ID_FLAGS(func, bpf_task_acquire, KF_ACQUIRE | KF_RCU | KF_RET_NULL)
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index f3cf8d85bea0..35eebb5e7769 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -10961,6 +10961,7 @@ enum special_kfunc_type {
KF_bpf_list_push_back,
KF_bpf_list_pop_front,
KF_bpf_list_pop_back,
+ KF_bpf_list_del,
KF_bpf_list_front,
KF_bpf_list_back,
KF_bpf_cast_to_kern_ctx,
@@ -11029,6 +11030,7 @@ BTF_ID(func, bpf_list_push_back_impl)
BTF_ID(func, bpf_list_push_back)
BTF_ID(func, bpf_list_pop_front)
BTF_ID(func, bpf_list_pop_back)
+BTF_ID(func, bpf_list_del)
BTF_ID(func, bpf_list_front)
BTF_ID(func, bpf_list_back)
BTF_ID(func, bpf_cast_to_kern_ctx)
@@ -11549,6 +11551,7 @@ static bool is_bpf_list_api_kfunc(u32 btf_id)
return is_bpf_list_push_kfunc(btf_id) ||
btf_id == special_kfunc_list[KF_bpf_list_pop_front] ||
btf_id == special_kfunc_list[KF_bpf_list_pop_back] ||
+ btf_id == special_kfunc_list[KF_bpf_list_del] ||
btf_id == special_kfunc_list[KF_bpf_list_front] ||
btf_id == special_kfunc_list[KF_bpf_list_back];
}
@@ -11671,7 +11674,8 @@ static bool check_kfunc_is_graph_node_api(struct bpf_verifier_env *env,
switch (node_field_type) {
case BPF_LIST_NODE:
- ret = is_bpf_list_push_kfunc(kfunc_btf_id);
+ ret = is_bpf_list_push_kfunc(kfunc_btf_id) ||
+ kfunc_btf_id == special_kfunc_list[KF_bpf_list_del];
break;
case BPF_RB_NODE:
ret = (is_bpf_rbtree_add_kfunc(kfunc_btf_id) ||
--
2.50.1 (Apple Git-155)
^ permalink raw reply related
* [PATCH bpf-next v11 3/8] bpf: allow non-owning list-node args via __nonown_allowed
From: Kaitao Cheng @ 2026-05-21 3:23 UTC (permalink / raw)
To: ast, corbet, martin.lau, daniel, andrii, eddyz87, song,
yonghong.song, john.fastabend, kpsingh, sdf, haoluo, jolsa, shuah,
chengkaitao, skhan, memxor
Cc: bpf, linux-kernel, linux-doc, vmalik, linux-kselftest
In-Reply-To: <20260521032306.97118-1-kaitao.cheng@linux.dev>
From: Kaitao Cheng <chengkaitao@kylinos.cn>
KF_ARG_PTR_TO_LIST_NODE normally requires an owning reference
(PTR_TO_BTF_ID | MEM_ALLOC with ref_obj_id). Introduce the
__nonown_allowed annotation on selected list-node arguments so
non-owning references with ref_obj_id==0 are accepted as well.
This patch only adds the generic verifier support and documents the
annotation. Later patches in the series will apply it to bpf_list_add
/del(), and bpf_list_is_first/last(), allowing bpf_list_front/back()
results to be used as the insertion point, deletion target, or query
target for those kfuncs.
Verifier keeps existing owning-ref checks by default; only arguments
annotated with __nonown_allowed bypass MEM_ALLOC/ref_obj_id checks
and then follow the same list-node validation path.
Signed-off-by: Kaitao Cheng <chengkaitao@kylinos.cn>
Reviewed-by: Eduard Zingerman <eddyz87@gmail.com>
---
Documentation/bpf/kfuncs.rst | 22 ++++++++++++++++++++--
kernel/bpf/verifier.c | 13 +++++++++++++
2 files changed, 33 insertions(+), 2 deletions(-)
diff --git a/Documentation/bpf/kfuncs.rst b/Documentation/bpf/kfuncs.rst
index 75e6c078e0e7..3a9db1108b95 100644
--- a/Documentation/bpf/kfuncs.rst
+++ b/Documentation/bpf/kfuncs.rst
@@ -207,8 +207,26 @@ Here, the buffer may be NULL. If the buffer is not NULL, it must be at least
buffer__szk bytes in size. The kfunc is responsible for checking if the buffer
is NULL before using it.
-2.3.5 __str Annotation
-----------------------------
+2.3.5 __nonown_allowed Annotation
+---------------------------------
+
+This annotation is used to indicate that the parameter may be a non-owning reference.
+
+An example is given below::
+
+ __bpf_kfunc int bpf_list_add(..., struct bpf_list_node
+ *prev__nonown_allowed, ...)
+ {
+ ...
+ }
+
+For the ``prev__nonown_allowed`` parameter (resolved as ``KF_ARG_PTR_TO_LIST_NODE``),
+suffix ``__nonown_allowed`` retains the usual owning-pointer rules and also
+permits a non-owning reference with no ref_obj_id (e.g. the return value of
+bpf_list_front() / bpf_list_back()).
+
+2.3.6 __str Annotation
+----------------------
This annotation is used to indicate that the argument is a constant string.
An example is given below::
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 8dd79b735a69..f3cf8d85bea0 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -10714,6 +10714,11 @@ static bool is_kfunc_arg_nullable(const struct btf *btf, const struct btf_param
return btf_param_match_suffix(btf, arg, "__nullable");
}
+static bool is_kfunc_arg_nonown_allowed(const struct btf *btf, const struct btf_param *arg)
+{
+ return btf_param_match_suffix(btf, arg, "__nonown_allowed");
+}
+
static bool is_kfunc_arg_const_str(const struct btf *btf, const struct btf_param *arg)
{
return btf_param_match_suffix(btf, arg, "__str");
@@ -12244,6 +12249,13 @@ static int check_kfunc_args(struct bpf_verifier_env *env, struct bpf_kfunc_call_
return ret;
break;
case KF_ARG_PTR_TO_LIST_NODE:
+ if (is_kfunc_arg_nonown_allowed(btf, &args[i]) &&
+ type_is_non_owning_ref(reg->type) && !reg->ref_obj_id) {
+ /* Allow bpf_list_front/back return value for
+ * __nonown_allowed list-node arguments.
+ */
+ goto check_ok;
+ }
if (reg->type != (PTR_TO_BTF_ID | MEM_ALLOC)) {
verbose(env, "%s expected pointer to allocated object\n",
reg_arg_name(env, argno));
@@ -12253,6 +12265,7 @@ static int check_kfunc_args(struct bpf_verifier_env *env, struct bpf_kfunc_call_
verbose(env, "allocated object must be referenced\n");
return -EINVAL;
}
+check_ok:
ret = process_kf_arg_ptr_to_list_node(env, reg, argno, meta);
if (ret < 0)
return ret;
--
2.50.1 (Apple Git-155)
^ permalink raw reply related
* [PATCH bpf-next v11 2/8] bpf: clear list node owner and unlink before drop
From: Kaitao Cheng @ 2026-05-21 3:23 UTC (permalink / raw)
To: ast, corbet, martin.lau, daniel, andrii, eddyz87, song,
yonghong.song, john.fastabend, kpsingh, sdf, haoluo, jolsa, shuah,
chengkaitao, skhan, memxor
Cc: bpf, linux-kernel, linux-doc, vmalik, linux-kselftest
In-Reply-To: <20260521032306.97118-1-kaitao.cheng@linux.dev>
From: Kaitao Cheng <chengkaitao@kylinos.cn>
The issue only becomes exposed once bpf_list_del() is available: callers
can pass an arbitrary bpf_list_head and bpf_list_node pair, including
nodes that are not actually linked to the supplied head, or nodes that
outlive their original head after refcount-based retention. This was
not practically reachable for callers restricted to pop-style helpers
alone; bpf_list_del() widens the API surface.
A failure mode appears when bpf_list_head_free() runs while a program
still holds an independent refcount on a node (for example via
bpf_refcount_acquire()). The list head value embedded in map memory can
go away while the node object survives. If node->owner is left pointing
at the old head address until drop completes, that pointer becomes stale.
If a new bpf_list_head is later allocated at the same address and the
stale node is passed to bpf_list_del(), the owner comparison can succeed
even though the node is not really linked to the new head, and
list_del_init() will follow bogus next/prev pointers with the risk of
memory corruption.
When draining a bpf_list_head, mark each node owner with BPF_PTR_POISON
under the map spinlock while moving it to a private drain list, then
list_del_init() the node and clear owner to NULL before calling
__bpf_obj_drop_impl(). Concurrent readers therefore never observe a
node that appears linked to a head while its list_head is inconsistent,
and surviving refcounted nodes never retain a stale non-NULL owner.
Signed-off-by: Kaitao Cheng <chengkaitao@kylinos.cn>
---
kernel/bpf/helpers.c | 27 +++++++++++++++++++--------
1 file changed, 19 insertions(+), 8 deletions(-)
diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index 094457c3e6d3..59855b434f0b 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -2247,10 +2247,11 @@ EXPORT_SYMBOL_GPL(bpf_base_func_proto);
void bpf_list_head_free(const struct btf_field *field, void *list_head,
struct bpf_spin_lock *spin_lock)
{
- struct list_head *head = list_head, *orig_head = list_head;
+ struct list_head *head = list_head, drain, *pos, *n;
BUILD_BUG_ON(sizeof(struct list_head) > sizeof(struct bpf_list_head));
BUILD_BUG_ON(__alignof__(struct list_head) > __alignof__(struct bpf_list_head));
+ INIT_LIST_HEAD(&drain);
/* Do the actual list draining outside the lock to not hold the lock for
* too long, and also prevent deadlocks if tracing programs end up
@@ -2261,20 +2262,30 @@ void bpf_list_head_free(const struct btf_field *field, void *list_head,
__bpf_spin_lock_irqsave(spin_lock);
if (!head->next || list_empty(head))
goto unlock;
- head = head->next;
+ list_for_each_safe(pos, n, head) {
+ struct bpf_list_node_kern *node;
+
+ node = container_of(pos, struct bpf_list_node_kern, list_head);
+ WRITE_ONCE(node->owner, BPF_PTR_POISON);
+ list_move_tail(pos, &drain);
+ }
unlock:
- INIT_LIST_HEAD(orig_head);
+ INIT_LIST_HEAD(head);
__bpf_spin_unlock_irqrestore(spin_lock);
- while (head != orig_head) {
- void *obj = head;
+ while (!list_empty(&drain)) {
+ struct bpf_list_node_kern *node;
- obj -= field->graph_root.node_offset;
- head = head->next;
+ pos = drain.next;
+ node = container_of(pos, struct bpf_list_node_kern, list_head);
+ list_del_init(pos);
+ /* Ensure __bpf_list_add() sees the node as unlinked. */
+ smp_store_release(&node->owner, NULL);
/* The contained type can also have resources, including a
* bpf_list_head which needs to be freed.
*/
- __bpf_obj_drop_impl(obj, field->graph_root.value_rec, false);
+ __bpf_obj_drop_impl((char *)pos - field->graph_root.node_offset,
+ field->graph_root.value_rec, false);
}
}
--
2.50.1 (Apple Git-155)
^ permalink raw reply related
* [PATCH bpf-next v11 1/8] bpf: refactor __bpf_list_del to take list node pointer
From: Kaitao Cheng @ 2026-05-21 3:22 UTC (permalink / raw)
To: ast, corbet, martin.lau, daniel, andrii, eddyz87, song,
yonghong.song, john.fastabend, kpsingh, sdf, haoluo, jolsa, shuah,
chengkaitao, skhan, memxor
Cc: bpf, linux-kernel, linux-doc, vmalik, linux-kselftest
In-Reply-To: <20260521032306.97118-1-kaitao.cheng@linux.dev>
From: Kaitao Cheng <chengkaitao@kylinos.cn>
Refactor __bpf_list_del to accept (head, struct list_head *n) instead of
(head, bool tail). The caller now passes the specific node to remove:
bpf_list_pop_front passes h->next, bpf_list_pop_back passes h->prev.
Prepares for introducing bpf_list_del(head, node) kfunc to remove an
arbitrary node when the user holds ownership.
Signed-off-by: Kaitao Cheng <chengkaitao@kylinos.cn>
Reviewed-by: Eduard Zingerman <eddyz87@gmail.com>
---
kernel/bpf/helpers.c | 23 +++++++++++++++--------
1 file changed, 15 insertions(+), 8 deletions(-)
diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index 07de26e7314c..094457c3e6d3 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -2550,37 +2550,44 @@ __bpf_kfunc int bpf_list_push_back_impl(struct bpf_list_head *head,
return bpf_list_push_back(head, node, meta__ign, off);
}
-static struct bpf_list_node *__bpf_list_del(struct bpf_list_head *head, bool tail)
+static struct bpf_list_node *__bpf_list_del(struct bpf_list_head *head,
+ struct list_head *n)
{
- struct list_head *n, *h = (void *)head;
+ struct list_head *h = (void *)head;
struct bpf_list_node_kern *node;
/* If list_head was 0-initialized by map, bpf_obj_init_field wasn't
* called on its fields, so init here
*/
- if (unlikely(!h->next))
+ if (unlikely(!h->next)) {
INIT_LIST_HEAD(h);
+ return NULL;
+ }
if (list_empty(h))
return NULL;
- n = tail ? h->prev : h->next;
node = container_of(n, struct bpf_list_node_kern, list_head);
- if (WARN_ON_ONCE(READ_ONCE(node->owner) != head))
+ if (unlikely(READ_ONCE(node->owner) != head))
return NULL;
list_del_init(n);
- WRITE_ONCE(node->owner, NULL);
+ /* Ensure __bpf_list_add() sees the node as unlinked. */
+ smp_store_release(&node->owner, NULL);
return (struct bpf_list_node *)n;
}
__bpf_kfunc struct bpf_list_node *bpf_list_pop_front(struct bpf_list_head *head)
{
- return __bpf_list_del(head, false);
+ struct list_head *h = (void *)head;
+
+ return __bpf_list_del(head, h->next);
}
__bpf_kfunc struct bpf_list_node *bpf_list_pop_back(struct bpf_list_head *head)
{
- return __bpf_list_del(head, true);
+ struct list_head *h = (void *)head;
+
+ return __bpf_list_del(head, h->prev);
}
__bpf_kfunc struct bpf_list_node *bpf_list_front(struct bpf_list_head *head)
--
2.50.1 (Apple Git-155)
^ permalink raw reply related
* [PATCH bpf-next v11 0/8] bpf: Extend the bpf_list family of APIs
From: Kaitao Cheng @ 2026-05-21 3:22 UTC (permalink / raw)
To: ast, corbet, martin.lau, daniel, andrii, eddyz87, song,
yonghong.song, john.fastabend, kpsingh, sdf, haoluo, jolsa, shuah,
chengkaitao, skhan, memxor
Cc: bpf, linux-kernel, linux-doc, vmalik, linux-kselftest,
Kaitao Cheng
In BPF, a list can only be used to implement a stack structure.
Due to an incomplete API set, only FIFO or LIFO operations are
supported. The patches enhance the BPF list API, making it more
list-like.
Five new kfuncs have been added:
bpf_list_del: remove a node from the list
bpf_list_add_impl: insert a node after a given list node
bpf_list_is_first: check if a node is the first in the list
bpf_list_is_last: check if a node is the last in the list
bpf_list_empty: check if the list is empty
And add test cases for the aforementioned kfuncs.
Changes in v11:
- Move [PATCH v10 7/8] earlier (Eduard Zingerman)
- Fix the synchronization issue in [PATCH v10 2/8] (Eduard Zingerman,
Alexei Starovoitov)
Changes in v10:
- Remove the table-driven approach (Ihor Solodrai)
- Use the __nonown_allowed suffix for bpf_list_del/front/back
- Add test cases for __nonown_allowed
Changes in v9:
- Expand table-driven approach coverage (Emil Tsalapatis)
- Clear list node owner and unlink before drop (Emil Tsalapatis)
- Remove warnings caused by WARN_ON_ONCE() (Emil Tsalapatis)
- Introduce the __nonown_allowed suffix (Alexei Starovoitov)
Changes in v8:
- Use [patch v7 5/5] as the start of the patch series (Leon Hwang)
- Introduce double pointer prev_ptr in __bpf_list_del
(Kumar Kartikeya Dwivedi)
- Extract refactored __bpf_list_del/add into separate patches (Leon Hwang)
- Allow bpf_list_front/back result as the prev argument of bpf_list_add
- Split test cases (Leon Hwang)
Changes in v7:
- Replace bpf_list_node_is_edge with bpf_list_is_first/is_last
- Reimplement __bpf_list_del and __bpf_list_add (Kumar Kartikeya Dwivedi)
- Simplify test cases (Mykyta Yatsenko)
Changes in v6:
- Merge [patch v5 (2,4,6)/6] into [patch v6 4/5] (Leon Hwang)
- If list_head was 0-initialized, init it
- refactor kfunc checks to table-driven approach (Leon Hwang)
Changes in v5:
- Fix bpf_obj leak on bpf_list_add_impl error
Changes in v4:
- [patch v3 1/6] Revert to version v1 (Alexei Starovoitov)
- Change the parameters of bpf_list_add_impl to (head, new, prev, ...)
Changes in v3:
- Add a new lock_rec member to struct bpf_reference_state for lock
holding detection.
- Add test cases to verify that the verifier correctly restricts calls
to bpf_list_del when the spin_lock is not held.
Changes in v2:
- Remove the head parameter from bpf_list_del (Alexei Starovoitov)
- Add bpf_list_add/is_first/is_last/empty to API and test cases
(Alexei Starovoitov)
Link to v10:
https://lore.kernel.org/all/20260512055919.95716-1-kaitao.cheng@linux.dev/
Link to v9:
https://lore.kernel.org/all/20260329140506.9595-1-pilgrimtao@gmail.com/
Link to v8:
https://lore.kernel.org/all/20260316112843.78657-1-pilgrimtao@gmail.com/
Link to v7:
https://lore.kernel.org/all/20260308134614.29711-1-pilgrimtao@gmail.com/
Link to v6:
https://lore.kernel.org/all/20260304143459.78059-1-pilgrimtao@gmail.com/
Link to v5:
https://lore.kernel.org/all/20260304031606.43884-1-pilgrimtao@gmail.com/
Link to v4:
https://lore.kernel.org/all/20260303135219.33726-1-pilgrimtao@gmail.com/
Link to v3:
https://lore.kernel.org/all/20260302124028.82420-1-pilgrimtao@gmail.com/
Link to v2:
https://lore.kernel.org/all/20260225092651.94689-1-pilgrimtao@gmail.com/
Link to v1:
https://lore.kernel.org/all/20260209025250.55750-1-pilgrimtao@gmail.com/
Kaitao Cheng (8):
bpf: refactor __bpf_list_del to take list node pointer
bpf: clear list node owner and unlink before drop
bpf: allow non-owning list-node args via __nonown_allowed
bpf: Introduce the bpf_list_del kfunc.
bpf: refactor __bpf_list_add to take insertion point via **prev_ptr
bpf: Add bpf_list_add to insert node after a given list node
bpf: add bpf_list_is_first/last/empty kfuncs
selftests/bpf: Add test cases for
bpf_list_del/add/is_first/is_last/empty
Documentation/bpf/kfuncs.rst | 22 +-
kernel/bpf/helpers.c | 147 ++++--
kernel/bpf/verifier.c | 44 +-
.../selftests/bpf/progs/refcounted_kptr.c | 421 ++++++++++++++++++
4 files changed, 601 insertions(+), 33 deletions(-)
--
2.50.1 (Apple Git-155)
^ permalink raw reply
* [soc:zx/soc 1/1] htmldocs: Documentation/arch/arm/zte/zx297520v3.rst:66: WARNING: Title underline too short.
From: kernel test robot @ 2026-05-21 2:57 UTC (permalink / raw)
To: Stefan Dösinger
Cc: oe-kbuild-all, linux-arm-kernel, arm, Linus Walleij,
Krzysztof Kozlowski, linux-doc
tree: https://git.kernel.org/pub/scm/linux/kernel/git/soc/soc.git zx/soc
head: 220ae5d36dba278003d265aabd080ffa78553f5a
commit: 220ae5d36dba278003d265aabd080ffa78553f5a [1/1] ARM: zte: Add zx297520v3 platform support
compiler: clang version 20.1.8 (https://github.com/llvm/llvm-project 87f0227cb60147a26a1eeb4fb06e3b505e9c7261)
docutils: docutils (Docutils 0.21.2, Python 3.13.5, on linux)
reproduce: (https://download.01.org/0day-ci/archive/20260521/202605210401.8D6jRbz8-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202605210401.8D6jRbz8-lkp@intel.com/
All warnings (new ones prefixed by >>):
WARNING: Documentation/ABI/testing/sysfs-class-reboot-mode-reboot_modes:36: abi_sys_class_reboot_mode_driver_reboot_modes doesn't have a description
WARNING: /sys/bus/usb/devices/<busnum>-<devnum>:<config num>.<interface num>/<hid-bus>:<vendor-id>:<product-id>.<num>/os_mode is defined 2 times: Documentation/ABI/testing/sysfs-driver-hid-lenovo-go:364; Documentation/ABI/testing/sysfs-driver-hid-lenovo-go-s:234
WARNING: /sys/bus/usb/devices/<busnum>-<devnum>:<config num>.<interface num>/<hid-bus>:<vendor-id>:<product-id>.<num>/os_mode_index is defined 2 times: Documentation/ABI/testing/sysfs-driver-hid-lenovo-go:373; Documentation/ABI/testing/sysfs-driver-hid-lenovo-go-s:243
WARNING: /sys/bus/usb/devices/<busnum>-<devnum>:<config num>.<interface num>/<hid-bus>:<vendor-id>:<product-id>.<num>/touchpad/enabled is defined 2 times: Documentation/ABI/testing/sysfs-driver-hid-lenovo-go:636; Documentation/ABI/testing/sysfs-driver-hid-lenovo-go-s:252
WARNING: /sys/bus/usb/devices/<busnum>-<devnum>:<config num>.<interface num>/<hid-bus>:<vendor-id>:<product-id>.<num>/touchpad/enabled_index is defined 2 times: Documentation/ABI/testing/sysfs-driver-hid-lenovo-go:645; Documentation/ABI/testing/sysfs-driver-hid-lenovo-go-s:261
>> Documentation/arch/arm/zte/zx297520v3.rst:66: WARNING: Title underline too short.
--
3. Building for built-in U-Boot
--------------------------- [docutils]
>> Documentation/arch/arm/zte/zx297520v3.rst:90: WARNING: Enumerated list ends without a blank line; unexpected unindent. [docutils]
>> Documentation/arch/arm/zte/zx297520v3.rst:116: WARNING: Inline literal start-string without end-string. [docutils]
Documentation/arch/arm/zte/zx297520v3.rst:137: ERROR: Unexpected indentation. [docutils]
>> Documentation/arch/arm/zte/zx297520v3.rst:138: WARNING: Block quote ends without a blank line; unexpected unindent. [docutils]
Documentation/arch/arm/zte/zx297520v3.rst:164: WARNING: Inline literal start-string without end-string. [docutils]
>> Documentation/arch/arm/zte/zx297520v3.rst:164: WARNING: Inline interpreted text or phrase reference start-string without end-string. [docutils]
>> Documentation/arch/arm/zte/zx297520v3.rst:7: WARNING: Document or section may not begin with a transition. [docutils]
Documentation/arch/riscv/zicfilp.rst:79: WARNING: Inline literal start-string without end-string. [docutils]
Documentation/core-api/kref:328: ./include/linux/kref.h:72: WARNING: Invalid C declaration: Expected end of definition. [error at 96]
int kref_put_mutex (struct kref *kref, void (*release)(struct kref *kref), struct mutex *mutex) __cond_acquires(true# mutex)
------------------------------------------------------------------------------------------------^
Documentation/core-api/kref:328: ./include/linux/kref.h:94: WARNING: Invalid C declaration: Expected end of definition. [error at 92]
vim +66 Documentation/arch/arm/zte/zx297520v3.rst
6
> 7 ...............................................................................
8
9 Author: Stefan Dösinger
10
11 Date : 27 Jan 2026
12
13 1. Hardware description
14 ---------------------------
15 Zx297520v3 SoCs use a 64 bit capable Cortex-A53 CPU and GICv3, although they
16 run in arm32 mode only. The CPU has support EL3, but no hypervisor (EL2) and
17 it seems to lack VFP and NEON.
18
19 The SoC is used in a number of cheap LTE to WiFi routers, both battery powered
20 MiFis and stationary CPEs. In addition to the CPU these devices usually have
21 64 MB Ram (although some is shared with the LTE chip), 128 MB NAND flash, an
22 SDIO connected RTL8192-type Wifi chip limited to 2.4 ghz operation, USB 2,
23 and buttons. Devices with as low as 32 MB or as high as 128 MB ram exist, as
24 do devices with 8 or 16 MB of NOR flash.
25
26 Some devices, especially the stationary ones, have 100 mbit Ethernet and an
27 Ethernet switch.
28
29 Usually the devices have LEDs for status indication, although some have SPI or
30 I2C connected displays
31
32 Some have an SD card slot. If it exists, it is a better choice for the root
33 file system because it easily outperforms the built-in NAND.
34
35 The LTE interface runs on a separate DSP called ZSP880. It is probably derived
36 from LSI ZSPs and has an undocumented instruction set. The ZSP communicates
37 with the main CPU via SRAM and DRAM and a mailbox hardware that can generate
38 IRQs on either ends.
39
40 There is also a Cortex M0 CPU, which is responsible for early HW initialization
41 and starting the Cortex A53 CPU. It does not have any essential purpose once
42 U-Boot is started. A SRAM-Based handover protocol exists to run custom code on
43 this CPU.
44
45 2. Booting via USB
46 ---------------------------
47
48 The Boot ROM has support for booting custom code via USB. This mode can be
49 entered by connecting a Boot PIN to GND or by modifying the third byte on NAND
50 (set it to anything other than 0x5A aka 'Z'). A free software tool to start
51 custom U-Boot and kernels can be found here:
52
53 https://github.com/zx297520v3-mainline/zx297520v3-loader
54
55 If USB download mode is entered but no boot commands are sent through USB, the
56 device will proceed to boot normally after a few seconds. It is therefore
57 possible to enable USB boot permanently and still leave the default boot files
58 in place.
59
60 https://github.com/zx297520v3-mainline/u-boot-mainline
61
62 Contains an U-Boot version that can be used with the USB loader and sets up the
63 CPU and interrupt controller to comply with Linux's booting requirements.
64
65 3. Building for built-in U-Boot
> 66 ---------------------------
67 The devices come with an ancient U-Boot that loads legacy uImages from NAND and
68 boots them without a chance for the user to interrupt. The images are stored in
69 files ap_cpuap.bin and ap_recovery.bin on a jffs2 partition named imagefs,
70 usually mtd4. A file named "fotaflag" switches between the two modes.
71
72 In addition to the uImage header, those files have a 384 byte signature header,
73 which is used for authenticating the images on some devices. Most devices have
74 this authentication disabled and it is enough to pad the uImage files with 384
75 zero bytes.
76
77 Builtin U-Boot also poorly sets up the CPU. Read the next section for details
78 on this. It has no support for loading DTBs, so CONFIG_ARM_APPENDED_DTB is
79 needed.
80
81 So to build an image that boots from NAND the following steps are necessary:
82
83 1) Patch the assembly code from section 3 into arch/arm/kernel/head.S.
84 2) make zx29_defconfig
85 3) make [-j x]
86 4) cat arch/arm/boot/zImage arch/arm/boot/dts/zte/[device].dtb > kernel+dtb
87 5) mkimage -A arm -O linux -T kernel -C none -a 0x20008000 -d kernel+dtb uimg
88 6) dd if=/dev/zero bs=1 count=384 of=ap_recovery.bin
89 7) cat uimg >> ap_recovery.bin
> 90 8) Place this file onto imagefs on the device. Delete ap_cpuap.bin if the
91 free space is not enough.
92 9) Create the file fotaflag: echo -n FOTA-RECOVERY > fotaflag
93
94 For development, booting ap_recovery.bin is recommended because the normal boot
95 mode arms the watchdog before starting the kernel.
96
97 4. CPU and GIC Setup
98 ---------------------------
99
100 Generally CPU and GICv3 need to be set up according to the requirements spelled
101 out in Documentation/arch/arm64/booting.rst. For zx297520v3 this means:
102
103 1. GICD_CTLR.DS=1 to disable GIC security
104 2. Enable access to ICC_SRE
105 3. Disable trapping IRQs into monitor mode
106 4. Configure EL2 and below to run in insecure mode.
107 5. Configure timer PPIs to active-low.
108
109 The kernel sources provided by ZTE do not boot either (interrupts do not work
110 at all). They are incomplete in other aspects too, so it is assumed that there
111 is some workaround similar to the one described in this document somewhere in
112 the binary blobs.
113
114 The assembly code below is given as an example of how to achieve this:
115
> 116 ```
117 #include <linux/irqchip/arm-gic-v3.h>
118 #include <asm/assembler.h>
119 #include <asm/cp15.h>
120
121 @ Detect sane bootloaders and skip the hack
122 ldr r3, =0xf2000000
123 ldr r3, [r3]
124 ldr r4, =(GICD_CTLR_ARE_NS | GICD_CTLR_DS)
125 cmp r3, r4
126 beq skip_zx_hack
127 @ This allows EL1 to handle ints hat are normally handled by EL2/3.
128 ldr r3, =0xf2000000
129 str r4, [r3]
130
131 cps #MON_MODE
132
133 @ Work in non-secure physical address space: SCR_EL3.NS = 1. At least the UART
134 @ seems to respond only to non-secure addresses. I have taken insipiration from
135 @ Raspberry pi's armstub7.S here.
136 mov r3, #0x131 @ non-secure, Make F, A bits in CPSR writeable
137 @ Allow hypervisor call.
> 138 mcr p15, 0, r3, c1, c1, 0
139
140 @ AP_PPI_MODE_REG: Configure timer PPIs (10, 11, 13, 14) to active-low.
141 ldr r3, =0xF22020a8
142 ldr r4, =0x50
143 str r4, [r3]
144 ldr r3, =0xF22020ac
145 ldr r4, =0x14
146 str r4, [r3]
147
148 @ Enable EL2 access to ICC_SRE (bit 3, ICC_SRE_EL3.Enable). Enable system reg
149 @ access to GICv3 registers (bit 0, ICC_SRE_EL3.SRE) for EL1 and EL3.
150 mrc p15, 6, r3, c12, c12, 5 @ ICC_SRE_EL3
151 orr r3, #0x9 @ FIXME: No defines for SRE_EL3 values?
152 mcr p15, 6, r3, c12, c12, 5
153 mrc p15, 0, r3, c12, c12, 5 @ ICC_SRE_EL1
154 orr r3, #(ICC_SRE_EL1_SRE)
155 mcr p15, 0, r3, c12, c12, 5
156
157 @ Like ICC_SRE_EL3, enable EL1 access to ICC_SRE and system register access
158 @ for EL2.
159 mrc p15, 4, r3, c12, c9, 5 @ ICC_SRE_EL2 aka ICC_HSRE
160 orr r3, r3, #(ICC_SRE_EL2_ENABLE | ICC_SRE_EL2_SRE)
161 mcr p15, 4, r3, c12, c9, 5
162 isb
163
> 164 @ Back to SVC mode
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply
* Re: [PATCH mm-unstable v17 11/14] mm/khugepaged: Introduce mTHP collapse support
From: Wei Yang @ 2026-05-21 2:46 UTC (permalink / raw)
To: Vernon Yang
Cc: Nico Pache, linux-doc, linux-kernel, linux-mm, linux-trace-kernel,
aarcange, akpm, anshuman.khandual, apopple, baohua, baolin.wang,
byungchul, catalin.marinas, cl, corbet, dave.hansen, david,
dev.jain, gourry, hannes, hughd, jack, jackmanb, jannh, jglisse,
joshua.hahnjy, kas, lance.yang, liam, ljs, mathieu.desnoyers,
matthew.brost, mhiramat, mhocko, peterx, pfalcato, rakie.kim,
raquini, rdunlap, richard.weiyang, rientjes, rostedt, rppt,
ryan.roberts, shivankg, sunnanyong, surenb, thomas.hellstrom,
tiwai, usamaarif642, vbabka, vishal.moola, wangkefeng.wang, will,
willy, yang, ying.huang, ziy, zokeefe
In-Reply-To: <8f9834db-8981-4eb1-ae46-94908943da3d@gmail.com>
On Thu, May 21, 2026 at 10:36:15AM +0800, Vernon Yang wrote:
>On Mon, May 11, 2026 at 12:58:11PM -0600, Nico Pache wrote:
>> Enable khugepaged to collapse to mTHP orders. This patch implements the
>> main scanning logic using a bitmap to track occupied pages and a stack
>> structure that allows us to find optimal collapse sizes.
>>
>> Previous to this patch, PMD collapse had 3 main phases, a light weight
>> scanning phase (mmap_read_lock) that determines a potential PMD
>> collapse, an alloc phase (mmap unlocked), then finally heavier collapse
>> phase (mmap_write_lock).
>>
>> To enabled mTHP collapse we make the following changes:
>>
>> During PMD scan phase, track occupied pages in a bitmap. When mTHP
>> orders are enabled, we remove the restriction of max_ptes_none during the
>> scan phase to avoid missing potential mTHP collapse candidates. Once we
>> have scanned the full PMD range and updated the bitmap to track occupied
>> pages, we use the bitmap to find the optimal mTHP size.
>>
>> Implement collapse_scan_bitmap() to perform binary recursion on the bitmap
>> and determine the best eligible order for the collapse. A stack structure
>> is used instead of traditional recursion to manage the search. This also
>> prevents a traditional recursive approach when the kernel stack struct is
>> limited. The algorithm recursively splits the bitmap into smaller chunks to
>> find the highest order mTHPs that satisfy the collapse criteria. We start
>> by attempting the PMD order, then moved on the consecutively lower orders
>> (mTHP collapse). The stack maintains a pair of variables (offset, order),
>> indicating the number of PTEs from the start of the PMD, and the order of
>> the potential collapse candidate.
>>
>> The algorithm for consuming the bitmap works as such:
>> 1) push (0, HPAGE_PMD_ORDER) onto the stack
>> 2) pop the stack
>> 3) check if the number of set bits in that (offset,order) pair
>> statisfy the max_ptes_none threshold for that order
>> 4) if yes, attempt collapse
>> 5) if no (or collapse fails), push two new stack items representing
>> the left and right halves of the current bitmap range, at the
>> next lower order
>> 6) repeat at step (2) until stack is empty.
>>
>> Below is a diagram representing the algorithm and stack items:
>>
>> offset mid_offset
>> | |
>> | |
>> v v
>> ____________________________________
>> | PTE Page Table |
>> --------------------------------------
>> <-------><------->
>> order-1 order-1
>>
>> mTHP collapses reject regions containing swapped out or shared pages.
>> This is because adding new entries can lead to new none pages, and these
>> may lead to constant promotion into a higher order mTHP. A similar
>> issue can occur with "max_ptes_none > HPAGE_PMD_NR/2" due to a collapse
>> introducing at least 2x the number of pages, and on a future scan will
>> satisfy the promotion condition once again. This issue is prevented via
>> the collapse_max_ptes_none() function which imposes the max_ptes_none
>> restrictions above.
>>
>> We currently only support mTHP collapse for max_ptes_none values of 0
>> and HPAGE_PMD_NR - 1. resulting in the following behavior:
>>
>> - max_ptes_none=0: Never introduce new empty pages during collapse
>> - max_ptes_none=HPAGE_PMD_NR-1: Always try collapse to the highest
>> available mTHP order
>>
>> Any other max_ptes_none value will emit a warning and skip mTHP collapse
>> attempts. There should be no behavior change for PMD collapse.
>>
>> Once we determine what mTHP sizes fits best in that PMD range a collapse
>> is attempted. A minimum collapse order of 2 is used as this is the lowest
>> order supported by anon memory as defined by THP_ORDERS_ALL_ANON.
>>
>> Currently madv_collapse is not supported and will only attempt PMD
>> collapse.
>>
>> We can also remove the check for is_khugepaged inside the PMD scan as
>> the collapse_max_ptes_none() function handles this logic now.
>>
>> Signed-off-by: Nico Pache <npache@redhat.com>
>> ---
>> mm/khugepaged.c | 182 +++++++++++++++++++++++++++++++++++++++++++++---
>> 1 file changed, 174 insertions(+), 8 deletions(-)
>>
>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>> index 3492b135d667..39bf7ea8a6e8 100644
>> --- a/mm/khugepaged.c
>> +++ b/mm/khugepaged.c
>> @@ -100,6 +100,30 @@ static DEFINE_READ_MOSTLY_HASHTABLE(mm_slots_hash, MM_SLOTS_HASH_BITS);
>>
>> static struct kmem_cache *mm_slot_cache __ro_after_init;
>>
>> +#define KHUGEPAGED_MIN_MTHP_ORDER 2
>> +/*
>> + * mthp_collapse() does an iterative DFS over a binary tree, from
>> + * HPAGE_PMD_ORDER down to KHUGEPAGED_MIN_MTHP_ORDER. The max stack
>> + * size needed for a DFS on a binary tree is height + 1, where
>> + * height = HPAGE_PMD_ORDER - KHUGEPAGED_MIN_MTHP_ORDER.
>> + *
>> + * ilog2 is used in place of HPAGE_PMD_ORDER because some architectures
>> + * (e.g. ppc64le) do not define HPAGE_PMD_ORDER until after build time.
>> + */
>> +#define MTHP_STACK_SIZE (ilog2(MAX_PTRS_PER_PTE) - KHUGEPAGED_MIN_MTHP_ORDER + 1)
>> +
>> +/*
>> + * Defines a range of PTE entries in a PTE page table which are being
>> + * considered for mTHP collapse.
>> + *
>> + * @offset: the offset of the first PTE entry in a PMD range.
>> + * @order: the order of the PTE entries being considered for collapse.
>> + */
>> +struct mthp_range {
>> + u16 offset;
>> + u8 order;
>> +};
>> +
>> struct collapse_control {
>> bool is_khugepaged;
>>
>> @@ -111,6 +135,12 @@ struct collapse_control {
>>
>> /* nodemask for allocation fallback */
>> nodemask_t alloc_nmask;
>> +
>> + /* Each bit represents a single occupied (!none/zero) page. */
>> + DECLARE_BITMAP(mthp_bitmap, MAX_PTRS_PER_PTE);
>> + /* A mask of the current range being considered for mTHP collapse. */
>> + DECLARE_BITMAP(mthp_bitmap_mask, MAX_PTRS_PER_PTE);
>> + struct mthp_range mthp_bitmap_stack[MTHP_STACK_SIZE];
>> };
>>
>> /**
>> @@ -1404,20 +1434,140 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long s
>> return result;
>> }
>>
>> +static void collapse_mthp_stack_push(struct collapse_control *cc, int *stack_size,
>> + u16 offset, u8 order)
>> +{
>> + const int size = *stack_size;
>> + struct mthp_range *stack = &cc->mthp_bitmap_stack[size];
>> +
>> + VM_WARN_ON_ONCE(size >= MTHP_STACK_SIZE);
>> + stack->order = order;
>> + stack->offset = offset;
>> + (*stack_size)++;
>> +}
>> +
>> +static struct mthp_range collapse_mthp_stack_pop(struct collapse_control *cc,
>> + int *stack_size)
>> +{
>> + const int size = *stack_size;
>> +
>> + VM_WARN_ON_ONCE(size <= 0);
>> + (*stack_size)--;
>> + return cc->mthp_bitmap_stack[size - 1];
>> +}
>> +
>> +static unsigned int collapse_mthp_count_present(struct collapse_control *cc,
>> + u16 offset, unsigned int nr_ptes)
>> +{
>> + bitmap_zero(cc->mthp_bitmap_mask, MAX_PTRS_PER_PTE);
>> + bitmap_set(cc->mthp_bitmap_mask, offset, nr_ptes);
>> + return bitmap_weight_and(cc->mthp_bitmap, cc->mthp_bitmap_mask, MAX_PTRS_PER_PTE);
>> +}
>> +
>> +/*
>> + * mthp_collapse() consumes the bitmap that is generated during
>> + * collapse_scan_pmd() to determine what regions and mTHP orders fit best.
>> + *
>> + * Each bit in cc->mthp_bitmap represents a single occupied (!none/zero) page.
>> + * A stack structure cc->mthp_bitmap_stack is used to check different regions
>> + * of the bitmap for collapse eligibility. The stack maintains a pair of
>> + * variables (offset, order), indicating the number of PTEs from the start of
>> + * the PMD, and the order of the potential collapse candidate respectively. We
>> + * start at the PMD order and check if it is eligible for collapse; if not, we
>> + * add two entries to the stack at a lower order to represent the left and right
>> + * halves of the PTE page table we are examining.
>> + *
>> + * offset mid_offset
>> + * | |
>> + * | |
>> + * v v
>> + * --------------------------------------
>> + * | cc->mthp_bitmap |
>> + * --------------------------------------
>> + * <-------><------->
>> + * order-1 order-1
>> + *
>> + * For each of these, we determine how many PTE entries are occupied in the
>> + * range of PTE entries we propose to collapse, then we compare this to a
>> + * threshold number of PTE entries which would need to be occupied for a
>> + * collapse to be permitted at that order (accounting for max_ptes_none).
>> + *
>> + * If a collapse is permitted, we attempt to collapse the PTE range into a
>> + * mTHP.
>> + */
>> +static int mthp_collapse(struct mm_struct *mm, unsigned long address,
>> + int referenced, int unmapped, struct collapse_control *cc,
>> + unsigned long enabled_orders)
>> +{
>> + unsigned int nr_occupied_ptes, nr_ptes;
>> + int max_ptes_none, collapsed = 0, stack_size = 0;
>> + unsigned long collapse_address;
>> + struct mthp_range range;
>> + u16 offset;
>> + u8 order;
>> +
>> + collapse_mthp_stack_push(cc, &stack_size, 0, HPAGE_PMD_ORDER);
>> +
>> + while (stack_size) {
>> + range = collapse_mthp_stack_pop(cc, &stack_size);
>> + order = range.order;
>> + offset = range.offset;
>> + nr_ptes = 1UL << order;
>> +
>> + if (!test_bit(order, &enabled_orders))
>> + goto next_order;
>> +
>> + max_ptes_none = collapse_max_ptes_none(cc, NULL, order);
>> +
>> + if (max_ptes_none < 0)
>> + return collapsed;
>> +
>> + nr_occupied_ptes = collapse_mthp_count_present(cc, offset,
>> + nr_ptes);
>> +
>> + if (nr_occupied_ptes >= nr_ptes - max_ptes_none) {
>> + int ret;
>> +
>> + collapse_address = address + offset * PAGE_SIZE;
>> + ret = collapse_huge_page(mm, collapse_address, referenced,
>> + unmapped, cc, order);
>> + if (ret == SCAN_SUCCEED) {
>> + collapsed += nr_ptes;
>> + continue;
>> + }
>> + }
>> +
>> +next_order:
>> + if (order > KHUGEPAGED_MIN_MTHP_ORDER) {
>
>Hi Nico, thank you very much for your contributions to this series.
>
>I found a minor issue, for MADV_COLLAPSE, if collapse_huge_page() fails
>for some reason (e.g. allocate folio), it goes to next_order and
>continues splitting to the next small order. However, enabled_orders
>only supports HPAGE_PMD_ORDER, so it keeps runing the split operations
>without any effective work until KHUGEPAGED_MIN_MTHP_ORDER is reached
>before exiting. For khugepaged, e.g. setting only 2MB to always, also
>same phenomenon.
Yes, but it does no actual work since it is checked after pop up.
>
>This does not affect the overall functionality of mthp collapse, just
>redundant.
>
>The redundant operations can be easily skipped with the following
>modification. If I miss some thing, please let me know. Thanks!
>
>diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>index 1a25af3d6d0f..fa407cce525c 100644
>--- a/mm/khugepaged.c
>+++ b/mm/khugepaged.c
>@@ -1574,7 +1574,7 @@ static int mthp_collapse(struct mm_struct *mm, unsigned long address,
> }
>
> next_order:
>- if (order > KHUGEPAGED_MIN_MTHP_ORDER) {
>+ if ((BIT(order) - 1) & enabled_orders) {
> const u8 next_order = order - 1;
> const u16 mid_offset = offset + (nr_ptes / 2);
>
This would stop the iteration if there are other lower enabled order, right?
>Cheers,
>Vernon
--
Wei Yang
Help you, Help me
^ permalink raw reply
* Re: [PATCH mm-unstable v17 11/14] mm/khugepaged: Introduce mTHP collapse support
From: Vernon Yang @ 2026-05-21 2:36 UTC (permalink / raw)
To: Nico Pache
Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
catalin.marinas, cl, corbet, dave.hansen, david, dev.jain, gourry,
hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
lance.yang, liam, ljs, mathieu.desnoyers, matthew.brost, mhiramat,
mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap,
richard.weiyang, rientjes, rostedt, rppt, ryan.roberts, shivankg,
sunnanyong, surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
zokeefe
In-Reply-To: <20260511185817.686831-12-npache@redhat.com>
On Mon, May 11, 2026 at 12:58:11PM -0600, Nico Pache wrote:
> Enable khugepaged to collapse to mTHP orders. This patch implements the
> main scanning logic using a bitmap to track occupied pages and a stack
> structure that allows us to find optimal collapse sizes.
>
> Previous to this patch, PMD collapse had 3 main phases, a light weight
> scanning phase (mmap_read_lock) that determines a potential PMD
> collapse, an alloc phase (mmap unlocked), then finally heavier collapse
> phase (mmap_write_lock).
>
> To enabled mTHP collapse we make the following changes:
>
> During PMD scan phase, track occupied pages in a bitmap. When mTHP
> orders are enabled, we remove the restriction of max_ptes_none during the
> scan phase to avoid missing potential mTHP collapse candidates. Once we
> have scanned the full PMD range and updated the bitmap to track occupied
> pages, we use the bitmap to find the optimal mTHP size.
>
> Implement collapse_scan_bitmap() to perform binary recursion on the bitmap
> and determine the best eligible order for the collapse. A stack structure
> is used instead of traditional recursion to manage the search. This also
> prevents a traditional recursive approach when the kernel stack struct is
> limited. The algorithm recursively splits the bitmap into smaller chunks to
> find the highest order mTHPs that satisfy the collapse criteria. We start
> by attempting the PMD order, then moved on the consecutively lower orders
> (mTHP collapse). The stack maintains a pair of variables (offset, order),
> indicating the number of PTEs from the start of the PMD, and the order of
> the potential collapse candidate.
>
> The algorithm for consuming the bitmap works as such:
> 1) push (0, HPAGE_PMD_ORDER) onto the stack
> 2) pop the stack
> 3) check if the number of set bits in that (offset,order) pair
> statisfy the max_ptes_none threshold for that order
> 4) if yes, attempt collapse
> 5) if no (or collapse fails), push two new stack items representing
> the left and right halves of the current bitmap range, at the
> next lower order
> 6) repeat at step (2) until stack is empty.
>
> Below is a diagram representing the algorithm and stack items:
>
> offset mid_offset
> | |
> | |
> v v
> ____________________________________
> | PTE Page Table |
> --------------------------------------
> <-------><------->
> order-1 order-1
>
> mTHP collapses reject regions containing swapped out or shared pages.
> This is because adding new entries can lead to new none pages, and these
> may lead to constant promotion into a higher order mTHP. A similar
> issue can occur with "max_ptes_none > HPAGE_PMD_NR/2" due to a collapse
> introducing at least 2x the number of pages, and on a future scan will
> satisfy the promotion condition once again. This issue is prevented via
> the collapse_max_ptes_none() function which imposes the max_ptes_none
> restrictions above.
>
> We currently only support mTHP collapse for max_ptes_none values of 0
> and HPAGE_PMD_NR - 1. resulting in the following behavior:
>
> - max_ptes_none=0: Never introduce new empty pages during collapse
> - max_ptes_none=HPAGE_PMD_NR-1: Always try collapse to the highest
> available mTHP order
>
> Any other max_ptes_none value will emit a warning and skip mTHP collapse
> attempts. There should be no behavior change for PMD collapse.
>
> Once we determine what mTHP sizes fits best in that PMD range a collapse
> is attempted. A minimum collapse order of 2 is used as this is the lowest
> order supported by anon memory as defined by THP_ORDERS_ALL_ANON.
>
> Currently madv_collapse is not supported and will only attempt PMD
> collapse.
>
> We can also remove the check for is_khugepaged inside the PMD scan as
> the collapse_max_ptes_none() function handles this logic now.
>
> Signed-off-by: Nico Pache <npache@redhat.com>
> ---
> mm/khugepaged.c | 182 +++++++++++++++++++++++++++++++++++++++++++++---
> 1 file changed, 174 insertions(+), 8 deletions(-)
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 3492b135d667..39bf7ea8a6e8 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -100,6 +100,30 @@ static DEFINE_READ_MOSTLY_HASHTABLE(mm_slots_hash, MM_SLOTS_HASH_BITS);
>
> static struct kmem_cache *mm_slot_cache __ro_after_init;
>
> +#define KHUGEPAGED_MIN_MTHP_ORDER 2
> +/*
> + * mthp_collapse() does an iterative DFS over a binary tree, from
> + * HPAGE_PMD_ORDER down to KHUGEPAGED_MIN_MTHP_ORDER. The max stack
> + * size needed for a DFS on a binary tree is height + 1, where
> + * height = HPAGE_PMD_ORDER - KHUGEPAGED_MIN_MTHP_ORDER.
> + *
> + * ilog2 is used in place of HPAGE_PMD_ORDER because some architectures
> + * (e.g. ppc64le) do not define HPAGE_PMD_ORDER until after build time.
> + */
> +#define MTHP_STACK_SIZE (ilog2(MAX_PTRS_PER_PTE) - KHUGEPAGED_MIN_MTHP_ORDER + 1)
> +
> +/*
> + * Defines a range of PTE entries in a PTE page table which are being
> + * considered for mTHP collapse.
> + *
> + * @offset: the offset of the first PTE entry in a PMD range.
> + * @order: the order of the PTE entries being considered for collapse.
> + */
> +struct mthp_range {
> + u16 offset;
> + u8 order;
> +};
> +
> struct collapse_control {
> bool is_khugepaged;
>
> @@ -111,6 +135,12 @@ struct collapse_control {
>
> /* nodemask for allocation fallback */
> nodemask_t alloc_nmask;
> +
> + /* Each bit represents a single occupied (!none/zero) page. */
> + DECLARE_BITMAP(mthp_bitmap, MAX_PTRS_PER_PTE);
> + /* A mask of the current range being considered for mTHP collapse. */
> + DECLARE_BITMAP(mthp_bitmap_mask, MAX_PTRS_PER_PTE);
> + struct mthp_range mthp_bitmap_stack[MTHP_STACK_SIZE];
> };
>
> /**
> @@ -1404,20 +1434,140 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long s
> return result;
> }
>
> +static void collapse_mthp_stack_push(struct collapse_control *cc, int *stack_size,
> + u16 offset, u8 order)
> +{
> + const int size = *stack_size;
> + struct mthp_range *stack = &cc->mthp_bitmap_stack[size];
> +
> + VM_WARN_ON_ONCE(size >= MTHP_STACK_SIZE);
> + stack->order = order;
> + stack->offset = offset;
> + (*stack_size)++;
> +}
> +
> +static struct mthp_range collapse_mthp_stack_pop(struct collapse_control *cc,
> + int *stack_size)
> +{
> + const int size = *stack_size;
> +
> + VM_WARN_ON_ONCE(size <= 0);
> + (*stack_size)--;
> + return cc->mthp_bitmap_stack[size - 1];
> +}
> +
> +static unsigned int collapse_mthp_count_present(struct collapse_control *cc,
> + u16 offset, unsigned int nr_ptes)
> +{
> + bitmap_zero(cc->mthp_bitmap_mask, MAX_PTRS_PER_PTE);
> + bitmap_set(cc->mthp_bitmap_mask, offset, nr_ptes);
> + return bitmap_weight_and(cc->mthp_bitmap, cc->mthp_bitmap_mask, MAX_PTRS_PER_PTE);
> +}
> +
> +/*
> + * mthp_collapse() consumes the bitmap that is generated during
> + * collapse_scan_pmd() to determine what regions and mTHP orders fit best.
> + *
> + * Each bit in cc->mthp_bitmap represents a single occupied (!none/zero) page.
> + * A stack structure cc->mthp_bitmap_stack is used to check different regions
> + * of the bitmap for collapse eligibility. The stack maintains a pair of
> + * variables (offset, order), indicating the number of PTEs from the start of
> + * the PMD, and the order of the potential collapse candidate respectively. We
> + * start at the PMD order and check if it is eligible for collapse; if not, we
> + * add two entries to the stack at a lower order to represent the left and right
> + * halves of the PTE page table we are examining.
> + *
> + * offset mid_offset
> + * | |
> + * | |
> + * v v
> + * --------------------------------------
> + * | cc->mthp_bitmap |
> + * --------------------------------------
> + * <-------><------->
> + * order-1 order-1
> + *
> + * For each of these, we determine how many PTE entries are occupied in the
> + * range of PTE entries we propose to collapse, then we compare this to a
> + * threshold number of PTE entries which would need to be occupied for a
> + * collapse to be permitted at that order (accounting for max_ptes_none).
> + *
> + * If a collapse is permitted, we attempt to collapse the PTE range into a
> + * mTHP.
> + */
> +static int mthp_collapse(struct mm_struct *mm, unsigned long address,
> + int referenced, int unmapped, struct collapse_control *cc,
> + unsigned long enabled_orders)
> +{
> + unsigned int nr_occupied_ptes, nr_ptes;
> + int max_ptes_none, collapsed = 0, stack_size = 0;
> + unsigned long collapse_address;
> + struct mthp_range range;
> + u16 offset;
> + u8 order;
> +
> + collapse_mthp_stack_push(cc, &stack_size, 0, HPAGE_PMD_ORDER);
> +
> + while (stack_size) {
> + range = collapse_mthp_stack_pop(cc, &stack_size);
> + order = range.order;
> + offset = range.offset;
> + nr_ptes = 1UL << order;
> +
> + if (!test_bit(order, &enabled_orders))
> + goto next_order;
> +
> + max_ptes_none = collapse_max_ptes_none(cc, NULL, order);
> +
> + if (max_ptes_none < 0)
> + return collapsed;
> +
> + nr_occupied_ptes = collapse_mthp_count_present(cc, offset,
> + nr_ptes);
> +
> + if (nr_occupied_ptes >= nr_ptes - max_ptes_none) {
> + int ret;
> +
> + collapse_address = address + offset * PAGE_SIZE;
> + ret = collapse_huge_page(mm, collapse_address, referenced,
> + unmapped, cc, order);
> + if (ret == SCAN_SUCCEED) {
> + collapsed += nr_ptes;
> + continue;
> + }
> + }
> +
> +next_order:
> + if (order > KHUGEPAGED_MIN_MTHP_ORDER) {
Hi Nico, thank you very much for your contributions to this series.
I found a minor issue, for MADV_COLLAPSE, if collapse_huge_page() fails
for some reason (e.g. allocate folio), it goes to next_order and
continues splitting to the next small order. However, enabled_orders
only supports HPAGE_PMD_ORDER, so it keeps runing the split operations
without any effective work until KHUGEPAGED_MIN_MTHP_ORDER is reached
before exiting. For khugepaged, e.g. setting only 2MB to always, also
same phenomenon.
This does not affect the overall functionality of mthp collapse, just
redundant.
The redundant operations can be easily skipped with the following
modification. If I miss some thing, please let me know. Thanks!
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 1a25af3d6d0f..fa407cce525c 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1574,7 +1574,7 @@ static int mthp_collapse(struct mm_struct *mm, unsigned long address,
}
next_order:
- if (order > KHUGEPAGED_MIN_MTHP_ORDER) {
+ if ((BIT(order) - 1) & enabled_orders) {
const u8 next_order = order - 1;
const u16 mid_offset = offset + (nr_ptes / 2);
--
Cheers,
Vernon
> + const u8 next_order = order - 1;
> + const u16 mid_offset = offset + (nr_ptes / 2);
> +
> + collapse_mthp_stack_push(cc, &stack_size, mid_offset,
> + next_order);
> + collapse_mthp_stack_push(cc, &stack_size, offset,
> + next_order);
> + }
> + }
> + return collapsed;
> +}
> +
> static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
> struct vm_area_struct *vma, unsigned long start_addr,
> bool *lock_dropped, struct collapse_control *cc)
> {
> - const int max_ptes_none = collapse_max_ptes_none(cc, vma, HPAGE_PMD_ORDER);
> + int max_ptes_none = collapse_max_ptes_none(cc, vma, HPAGE_PMD_ORDER);
> const unsigned int max_ptes_shared = collapse_max_ptes_shared(cc, HPAGE_PMD_ORDER);
> const unsigned int max_ptes_swap = collapse_max_ptes_swap(cc, HPAGE_PMD_ORDER);
> + enum tva_type tva_flags = cc->is_khugepaged ? TVA_KHUGEPAGED : TVA_FORCED_COLLAPSE;
> pmd_t *pmd;
> - pte_t *pte, *_pte;
> - int none_or_zero = 0, shared = 0, referenced = 0;
> + pte_t *pte, *_pte, pteval;
> + int i;
> + int none_or_zero = 0, shared = 0, nr_collapsed = 0, referenced = 0;
> enum scan_result result = SCAN_FAIL;
> struct page *page = NULL;
> struct folio *folio = NULL;
> unsigned long addr;
> + unsigned long enabled_orders;
> spinlock_t *ptl;
> int node = NUMA_NO_NODE, unmapped = 0;
>
> @@ -1429,8 +1579,19 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
> goto out;
> }
>
> + bitmap_zero(cc->mthp_bitmap, MAX_PTRS_PER_PTE);
> memset(cc->node_load, 0, sizeof(cc->node_load));
> nodes_clear(cc->alloc_nmask);
> +
> + enabled_orders = collapse_allowable_orders(vma, vma->vm_flags, tva_flags);
> +
> + /*
> + * If PMD is the only enabled order, enforce max_ptes_none, otherwise
> + * scan all pages to populate the bitmap for mTHP collapse.
> + */
> + if (enabled_orders != BIT(HPAGE_PMD_ORDER))
> + max_ptes_none = KHUGEPAGED_MAX_PTES_LIMIT;
> +
> pte = pte_offset_map_lock(mm, pmd, start_addr, &ptl);
> if (!pte) {
> cc->progress++;
> @@ -1438,11 +1599,13 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
> goto out;
> }
>
> - for (addr = start_addr, _pte = pte; _pte < pte + HPAGE_PMD_NR;
> - _pte++, addr += PAGE_SIZE) {
> + for (i = 0; i < HPAGE_PMD_NR; i++) {
> + _pte = pte + i;
> + addr = start_addr + i * PAGE_SIZE;
> + pteval = ptep_get(_pte);
> +
> cc->progress++;
>
> - pte_t pteval = ptep_get(_pte);
> if (pte_none_or_zero(pteval)) {
> if (++none_or_zero > max_ptes_none) {
> result = SCAN_EXCEED_NONE_PTE;
> @@ -1522,6 +1685,8 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
> }
> }
>
> + /* Set bit for occupied pages */
> + __set_bit(i, cc->mthp_bitmap);
> /*
> * Record which node the original page is from and save this
> * information to cc->node_load[].
> @@ -1580,10 +1745,11 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
> if (result == SCAN_SUCCEED) {
> /* collapse_huge_page expects the lock to be dropped before calling */
> mmap_read_unlock(mm);
> - result = collapse_huge_page(mm, start_addr, referenced,
> - unmapped, cc, HPAGE_PMD_ORDER);
> + nr_collapsed = mthp_collapse(mm, start_addr, referenced, unmapped,
> + cc, enabled_orders);
> /* collapse_huge_page will return with the mmap_lock released */
> *lock_dropped = true;
> + result = nr_collapsed ? SCAN_SUCCEED : SCAN_FAIL;
> }
> out:
> trace_mm_khugepaged_scan_pmd(mm, folio, referenced,
> --
> 2.54.0
>
>
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox