Linux-mm Archive on lore.kernel.org
 help / color / mirror / Atom feed
From: Andrea Righi <arighi@nvidia.com>
To: Tejun Heo <tj@kernel.org>
Cc: David Vernet <void@manifault.com>,
	Changwoo Min <changwoo@igalia.com>,
	Alexei Starovoitov <ast@kernel.org>,
	Andrii Nakryiko <andrii@kernel.org>,
	Daniel Borkmann <daniel@iogearbox.net>,
	Martin KaFai Lau <martin.lau@linux.dev>,
	Kumar Kartikeya Dwivedi <memxor@gmail.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Catalin Marinas <catalin.marinas@arm.com>,
	Will Deacon <will@kernel.org>, Thomas Gleixner <tglx@kernel.org>,
	Ingo Molnar <mingo@redhat.com>, Borislav Petkov <bp@alien8.de>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	David Hildenbrand <david@kernel.org>,
	Mike Rapoport <rppt@kernel.org>,
	Emil Tsalapatis <emil@etsalapatis.com>,
	sched-ext@lists.linux.dev, bpf@vger.kernel.org, x86@kernel.org,
	linux-arm-kernel@lists.infradead.org, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org
Subject: Re: [PATCH v2 7/8] sched_ext: Sub-allocator over kernel-claimed BPF arena pages
Date: Thu, 21 May 2026 19:54:01 +0200	[thread overview]
Message-ID: <ag9GueVem_OAkdMQ@gpd4> (raw)
In-Reply-To: <dd5b3702a826666242b6eb6e805bf83f@kernel.org>

On Thu, May 21, 2026 at 07:37:46AM -1000, Tejun Heo wrote:
> Build a per-scheduler sub-allocator on top of pages claimed from the BPF
> arena registered in the previous patch. Subsequent kernel-managed
> arena-resident structures (e.g. per-CPU set_cmask cmask) carve their storage
> from this pool.
> 
> scx_arena_pool_init() creates a gen_pool. scx_arena_alloc() returns the
> kernel VA. On exhaustion, the pool grows by claiming more pages via
> bpf_arena_alloc_pages_sleepable(). Chunks are added at the kernel-side
> mapping address; callers translate to the BPF-arena form themselves if
> needed.
> 
> Allocations sleep (GFP_KERNEL) - they may grow the pool through vzalloc and
> arena page allocation. All current consumers run from the enable path (after
> ops.init() and the kernel-side arena auto-discovery, before validate_ops()),
> where sleeping is fine.
> 
> scx_arena_pool_destroy() walks each chunk, returns outstanding ranges to the
> gen_pool with gen_pool_free() and then calls gen_pool_destroy(). The
> underlying arena pages are released when the arena map itself is torn down,
> so the pool destroy doesn't free them explicitly.
> 
> v2: Switch scx_arena_alloc() to a loop. (Andrea)
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Cc: Andrea Righi <arighi@nvidia.com>

Looks good to me.

Reviewed-by: Andrea Righi <arighi@nvidia.com>

Thanks,
-Andrea

> ---
>  kernel/sched/build_policy.c |    4 +
>  kernel/sched/ext.c          |   11 +++
>  kernel/sched/ext_arena.c    |  126 ++++++++++++++++++++++++++++++++++++++++++++
>  kernel/sched/ext_arena.h    |   18 ++++++
>  kernel/sched/ext_internal.h |    5 +
>  5 files changed, 164 insertions(+)
> 
> --- a/kernel/sched/build_policy.c
> +++ b/kernel/sched/build_policy.c
> @@ -59,12 +59,16 @@
>  
>  #ifdef CONFIG_SCHED_CLASS_EXT
>  # include <linux/btf_ids.h>
> +# include <linux/find.h>
> +# include <linux/genalloc.h>
>  # include "ext_types.h"
>  # include "ext_internal.h"
>  # include "ext_cid.h"
> +# include "ext_arena.h"
>  # include "ext_idle.h"
>  # include "ext.c"
>  # include "ext_cid.c"
> +# include "ext_arena.c"
>  # include "ext_idle.c"
>  #endif
>  
> --- a/kernel/sched/ext.c
> +++ b/kernel/sched/ext.c
> @@ -5003,6 +5003,7 @@ static void scx_sched_free_rcu_work(stru
>  
>  	rhashtable_free_and_destroy(&sch->dsq_hash, NULL, NULL);
>  	free_exit_info(sch->exit_info);
> +	scx_arena_pool_destroy(sch);
>  	if (sch->arena_map)
>  		bpf_map_put(sch->arena_map);
>  	kfree(sch);
> @@ -7155,6 +7156,12 @@ static void scx_root_enable_workfn(struc
>  		sch->exit_info->flags |= SCX_EFLAG_INITIALIZED;
>  	}
>  
> +	ret = scx_arena_pool_init(sch);
> +	if (ret) {
> +		cpus_read_unlock();
> +		goto err_disable;
> +	}
> +
>  	for (i = SCX_OPI_CPU_HOTPLUG_BEGIN; i < SCX_OPI_CPU_HOTPLUG_END; i++)
>  		if (((void (**)(void))ops)[i])
>  			set_bit(i, sch->has_op);
> @@ -7473,6 +7480,10 @@ static void scx_sub_enable_workfn(struct
>  		sch->exit_info->flags |= SCX_EFLAG_INITIALIZED;
>  	}
>  
> +	ret = scx_arena_pool_init(sch);
> +	if (ret)
> +		goto err_disable;
> +
>  	if (validate_ops(sch, ops))
>  		goto err_disable;
>  
> --- /dev/null
> +++ b/kernel/sched/ext_arena.c
> @@ -0,0 +1,126 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/*
> + * BPF extensible scheduler class: Documentation/scheduler/sched-ext.rst
> + *
> + * scx_arena_pool: kernel-side sub-allocator over BPF-arena pages.
> + *
> + * Each chunk added to @sch->arena_pool comes from one
> + * bpf_arena_alloc_pages_sleepable() call and is registered at the
> + * kernel-side mapping address. Callers translate to the BPF-arena form
> + * themselves if needed.
> + *
> + * Allocations grow the pool on demand. Underlying arena pages are released
> + * when the arena map itself is torn down.
> + *
> + * Copyright (c) 2026 Meta Platforms, Inc. and affiliates.
> + * Copyright (c) 2026 Tejun Heo <tj@kernel.org>
> + */
> +
> +enum scx_arena_consts {
> +	SCX_ARENA_MIN_ORDER		= 3,	/* 8-byte minimum sub-allocation */
> +	SCX_ARENA_GROW_PAGES		= 4,	/* per growth */
> +};
> +
> +s32 scx_arena_pool_init(struct scx_sched *sch)
> +{
> +	if (!sch->arena_map)
> +		return 0;
> +
> +	sch->arena_pool = gen_pool_create(SCX_ARENA_MIN_ORDER, NUMA_NO_NODE);
> +	if (!sch->arena_pool)
> +		return -ENOMEM;
> +	return 0;
> +}
> +
> +static void scx_arena_clear_chunk(struct gen_pool *pool, struct gen_pool_chunk *chunk,
> +				  void *data)
> +{
> +	int order = pool->min_alloc_order;
> +	size_t chunk_sz = chunk->end_addr - chunk->start_addr + 1;
> +	unsigned long end_bit = chunk_sz >> order;
> +	unsigned long b, e;
> +
> +	for_each_set_bitrange(b, e, chunk->bits, end_bit)
> +		gen_pool_free(pool, chunk->start_addr + (b << order),
> +			      (e - b) << order);
> +}
> +
> +/*
> + * Tear down the pool. Outstanding gen_pool allocations are freed via
> + * scx_arena_clear_chunk() so gen_pool_destroy() doesn't BUG. The underlying
> + * arena pages are released when the arena map itself is torn down.
> + */
> +void scx_arena_pool_destroy(struct scx_sched *sch)
> +{
> +	if (!sch->arena_pool)
> +		return;
> +	gen_pool_for_each_chunk(sch->arena_pool, scx_arena_clear_chunk, NULL);
> +	gen_pool_destroy(sch->arena_pool);
> +	sch->arena_pool = NULL;
> +}
> +
> +/*
> + * Grow the pool by @page_cnt pages. bpf_arena_alloc_pages_sleepable() and
> + * gen_pool_add() (which calls vzalloc(GFP_KERNEL)) require a sleepable
> + * context.
> + */
> +static int scx_arena_grow(struct scx_sched *sch, u32 page_cnt)
> +{
> +	u64 kern_vm_start;
> +	u32 uaddr32;
> +	void *p;
> +	int ret;
> +
> +	if (!sch->arena_map || !sch->arena_pool)
> +		return -EINVAL;
> +
> +	p = bpf_arena_alloc_pages_sleepable(sch->arena_map, NULL,
> +					    page_cnt, NUMA_NO_NODE, 0);
> +	if (!p)
> +		return -ENOMEM;
> +
> +	uaddr32 = (u32)(unsigned long)p;
> +	kern_vm_start = bpf_arena_map_kern_vm_start(sch->arena_map);
> +
> +	ret = gen_pool_add(sch->arena_pool, kern_vm_start + uaddr32,
> +			   page_cnt * PAGE_SIZE, NUMA_NO_NODE);
> +	if (ret) {
> +		bpf_arena_free_pages_non_sleepable(sch->arena_map, p, page_cnt);
> +		return ret;
> +	}
> +	return 0;
> +}
> +
> +/*
> + * Allocate @size bytes from the arena pool. Returns kernel VA on success, NULL
> + * on failure. May grow the pool via scx_arena_grow() which sleeps. Caller must
> + * be in a GFP_KERNEL context.
> + */
> +void *scx_arena_alloc(struct scx_sched *sch, size_t size)
> +{
> +	unsigned long kern_va;
> +	u32 page_cnt;
> +
> +	might_sleep();
> +
> +	if (!sch->arena_pool)
> +		return NULL;
> +
> +	while (true) {
> +		kern_va = gen_pool_alloc(sch->arena_pool, size);
> +		if (kern_va)
> +			break;
> +		page_cnt = max_t(u32, SCX_ARENA_GROW_PAGES,
> +				 (size + PAGE_SIZE - 1) >> PAGE_SHIFT);
> +		if (scx_arena_grow(sch, page_cnt))
> +			return NULL;
> +	}
> +
> +	return (void *)kern_va;
> +}
> +
> +void scx_arena_free(struct scx_sched *sch, void *kern_va, size_t size)
> +{
> +	if (sch->arena_pool && kern_va)
> +		gen_pool_free(sch->arena_pool, (unsigned long)kern_va, size);
> +}
> --- /dev/null
> +++ b/kernel/sched/ext_arena.h
> @@ -0,0 +1,18 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/*
> + * BPF extensible scheduler class: Documentation/scheduler/sched-ext.rst
> + *
> + * Copyright (c) 2025 Meta Platforms, Inc. and affiliates.
> + * Copyright (c) 2025 Tejun Heo <tj@kernel.org>
> + */
> +#ifndef _KERNEL_SCHED_EXT_ARENA_H
> +#define _KERNEL_SCHED_EXT_ARENA_H
> +
> +struct scx_sched;
> +
> +s32 scx_arena_pool_init(struct scx_sched *sch);
> +void scx_arena_pool_destroy(struct scx_sched *sch);
> +void *scx_arena_alloc(struct scx_sched *sch, size_t size);
> +void scx_arena_free(struct scx_sched *sch, void *kern_va, size_t size);
> +
> +#endif /* _KERNEL_SCHED_EXT_ARENA_H */
> --- a/kernel/sched/ext_internal.h
> +++ b/kernel/sched/ext_internal.h
> @@ -1116,8 +1116,13 @@ struct scx_sched {
>  	 * Arena map auto-discovered from member progs at struct_ops attach.
>  	 * cid-form schedulers must use exactly one arena across all member
>  	 * progs. NULL on cpu-form.
> +	 *
> +	 * @arena_pool sub-allocates @arena_map. Each gen_pool chunk is added
> +	 * at the kernel-side mapping address. Grows on demand and pages are
> +	 * not released until sched destroy.
>  	 */
>  	struct bpf_map		*arena_map;
> +	struct gen_pool		*arena_pool;
>  
>  	DECLARE_BITMAP(has_op, SCX_OPI_END);
>  


  reply	other threads:[~2026-05-21 17:54 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-20 23:50 [PATCHSET v3 sched_ext/for-7.2] bpf/arena: Direct kernel-side access Tejun Heo
2026-05-20 23:50 ` [PATCH 1/8] mm: Add ptep_try_set() for lockless empty-slot installs Tejun Heo
2026-05-21  7:00   ` Andrea Righi
2026-05-21 17:37   ` [PATCH v3 " Tejun Heo
2026-05-20 23:50 ` [PATCH 2/8] bpf: Recover arena kernel faults with scratch page Tejun Heo
2026-05-21  3:16   ` Emil Tsalapatis
2026-05-21  9:42   ` Alexei Starovoitov
2026-05-21 17:39     ` Tejun Heo
2026-05-20 23:50 ` [PATCH 3/8] bpf: Add sleepable variant of bpf_arena_alloc_pages for kernel callers Tejun Heo
2026-05-21  3:17   ` Emil Tsalapatis
2026-05-20 23:50 ` [PATCH 4/8] bpf: Add bpf_struct_ops_for_each_prog() Tejun Heo
2026-05-21  4:07   ` Emil Tsalapatis
2026-05-20 23:50 ` [PATCH 5/8] bpf/arena: Add bpf_arena_map_kern_vm_start() and bpf_prog_arena() Tejun Heo
2026-05-21  4:08   ` Emil Tsalapatis
2026-05-20 23:50 ` [PATCH 6/8] sched_ext: Require an arena for cid-form schedulers Tejun Heo
2026-05-21  4:15   ` Emil Tsalapatis
2026-05-20 23:50 ` [PATCH 7/8] sched_ext: Sub-allocator over kernel-claimed BPF arena pages Tejun Heo
2026-05-21  7:56   ` Andrea Righi
2026-05-21 17:22     ` Tejun Heo
2026-05-21 17:37   ` [PATCH v2 " Tejun Heo
2026-05-21 17:54     ` Andrea Righi [this message]
2026-05-20 23:50 ` [PATCH 8/8] sched_ext: Convert ops.set_cmask() to arena-resident cmask Tejun Heo
2026-05-21  4:19   ` Emil Tsalapatis
2026-05-22  1:59 ` [PATCH v3 2/8] bpf: Recover arena kernel faults with scratch page Tejun Heo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ag9GueVem_OAkdMQ@gpd4 \
    --to=arighi@nvidia.com \
    --cc=akpm@linux-foundation.org \
    --cc=andrii@kernel.org \
    --cc=ast@kernel.org \
    --cc=bp@alien8.de \
    --cc=bpf@vger.kernel.org \
    --cc=catalin.marinas@arm.com \
    --cc=changwoo@igalia.com \
    --cc=daniel@iogearbox.net \
    --cc=dave.hansen@linux.intel.com \
    --cc=david@kernel.org \
    --cc=emil@etsalapatis.com \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=martin.lau@linux.dev \
    --cc=memxor@gmail.com \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=rppt@kernel.org \
    --cc=sched-ext@lists.linux.dev \
    --cc=tglx@kernel.org \
    --cc=tj@kernel.org \
    --cc=void@manifault.com \
    --cc=will@kernel.org \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox