[RFC PATCH bpf-next] bpf: ephemeral cgroup BPF control programs

BPF List
 help / color / mirror / Atom feed

From: Rohan Kakulawaram <rohanka@google.com>
To: bpf@vger.kernel.org
Cc: Alexei Starovoitov <ast@kernel.org>,
	Daniel Borkmann <daniel@iogearbox.net>,
	 Andrii Nakryiko <andrii@kernel.org>,
	Martin KaFai Lau <martin.lau@linux.dev>,
	 Eduard Zingerman <eddyz87@gmail.com>, Song Liu <song@kernel.org>,
	 Yonghong Song <yonghong.song@linux.dev>,
	ohn Fastabend <john.fastabend@gmail.com>,
	 KP Singh <kpsingh@kernel.org>,
	Stanislav Fomichev <sdf@fomichev.me>,
	Jiri Olsa <jolsa@kernel.org>,
	 Roman Gushchin <roman.gushchin@linux.dev>,
	Tejun Heo <tj@kernel.org>,
	 Matt Bobrowski <mattbobrowski@google.com>,
	Josh Don <joshdon@google.com>,
	rohanka@google.com
Subject: [RFC PATCH bpf-next] bpf: ephemeral cgroup BPF control programs
Date: Tue,  3 Feb 2026 10:20:55 +0000	[thread overview]
Message-ID: <20260203102058.41030-1-rohanka@google.com> (raw)

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="y", Size: 9755 bytes --]

From: rohanka@google.com

Extended Berkeley Packet Filter (eBPF) programs are loadable modules
that can hook onto various contexts within the kernel: kernel
functions, tracepoints, etc. Since these programs are decoupled
from the main kernel binary and can be loaded without a machine
reboot, there is a desire to outsource some of the kernel’s
responsibilities to eBPF for increased flexibility. Furthermore,
eBPF can serve as a bridge between userspace and the kernel by
facilitating access to the kernel’s internal state.

One of the main gaps in achieving these ends, however, is that
there is no infrastructure that supports exposing per cgroup data
via file paths that mirror the cgroupfs hierarchy. This would
allow for a unified source of truth as it relates to accessing
various streams of data related to cgroups. It is important to
note that we would want the fate of these ephemeral files to be
tied to the manipulation of the cgroup tree, such as exposing/
deleting files when creating/removing cgroups respectively.
Cgroup iterators can mimic some of this functionality through
bpffs pins but lack the dynamism of the listed approach.

To elucidate the value of this infra, we note that this would be
instrumental in aiding some of our efforts at Google. For
instance, Borglet, the daemon that manages workloads on each
production machine, repeatedly has to manually parse raw memory
cgroup file from cgroupfs. If we had the capability to directly
expose these stats at each level of the cgroup hierarchy via bpf,
we could forgo some of the expensive parsing associated with the
current approach. Moreover, in the context of upstream efforts,
this infra could be helpful in allowing bpf based schedulers to
expose customized cgroup controls to user space. Overall, this
paradigm of exposing specialized stat files for file systems can
be incredibly valuable in providing robust kernel visibility.

Approach 1: BPFFS Centric
------------------------
This approach introduces a new BPF program type: BPF_PROG_TYPE_CGROUP_STAT.
When such a program is loaded and linked, per-cgroup files are made
available, mirrored under the BPFFS filesystem. For example, an iterator
named "hist_oncpu" would expose data at paths like
/sys/fs/bpf/cgroup/...<cgroup tree>.../hist_oncpu.

To accommodate cgroup-v1 hierarchies, a separate directory, such as
/sys/fs/bpf/cgroup-v1/<controller>/...<memcg_tree>.../memcg_histo,
would be populated for each v1 controller.

Sample read program:

SEC("cgroup/stat")
int histo_on_cpu(struct bpf_iter__cgroup *ctx) {
    struct seq_file *seq = ctx->meta->seq;
    struct cgroup *cgrp = ctx->cgroup;
    if (cgrp)
        expose_buckets(cgrp, seq);
    return 0;
}

Approach 2: Cgroupfs Centric (Preferred)
-----------------------------------------
This alternative exposes ephemeral files directly within cgroupfs. During
the initial cgroup traversal executed when link_create is called, the
__kernfs_create_file function would be used for every cgroup directory
encountered. The filenames would adhere to a bpf.stat.<program_name>
convention to clearly distinguish them as BPF-managed ephemeral files.

The Case for Approach 2
------------------------
Approach 2 is preferred because it avoids complexities inherent in the
BPFFS-centric approach, such as:

1.Syncing the directory structure between cgroupfs and bpffs
2.Handling the distinct cgroup-v1 and cgroup-v2 hierarchies within bpffs

BPF Syscall Story
-------------------
The primary syscall of interest will be link_create, which is invoked after
the user has loaded the program. Traditionally, bpf links are used to
manage the lifecycle of bpf programs; thus, in this context, we would be
able to switch out the program associated with a link and alter the
ephemeral file content associated with it. Essentially, when the link is
created, analogous to how links are attached to some sort of target, it is
"attached" to the cgroup tree. At this point, the underlying file the
program is supposed to represent is exposed for every cgroup in the
machine. Moreover, the user will pass in the cgroup whose descendants will
expose this ephemeral stat file: attr->link_create.cgroup_root. We will add
the file metadata, including the program link, to a list referenced by
this root cgroup. As discussed in the "Evolving Cgroup Tree" section,
this list will be utilized by cgroup_mkdir and cgroup_rmdir to manage the
lifecycle of ephemeral files within the directories of descendant cgroups.

At a high level here are the file/seq operations we wish to define:

open -> prepare a seq file with seq->private containing the 
necessary metadata for the program (ie the cgroup)
read -> invoke the seq_show operation on the seq file in file->private
release -> free the seq file and update the bpf program's refcount
seq_show for this file should be relatively simple: we set up the program ctx 
to take in the cgroup pointer, as well as the seq file, and then run 
the program.

It is important to note that we might need to extend this program to handle
writes, which is a prerequisite for their utility in providing sched_ext
cgroup controls. Please reference the "Potential Feature: File Writes" for
more information on this.

Cgroup Traversal
------------------------
With cgroup_mutex in hand, we will traverse the cgroup tree(s). For each
iteration in approach 1, we check the corresponding level of the bpffs
cgroup tree and see if there is an entry corresponding to our cgroup. To
facilitate this, we can store the dentry of the corresponding bpffs dir
within the cgroup struct. Thus, when we reach a particular cgroup we
invoke lookup_one_qstr_excl using its parent dentry as the base.
Essentially we want to emulate the filename_create without doing a path
resolution for every cgroup we come across. Once we create this new
directory or confirm that it already exists from a previous traversal, we
can add the file using vfs_mkobj with the new set file operations we
mentioned in the previous section.

We must note, this complexity is not present in the cgroupfs-centric
approach as the kernfs_node linked to a cgroup's directory is referenced
by the cgroup itself.

To support both v1 and v2 hierarchies simultaneously, the traversal will go
as follows: we first traverse the default cgroup root to construct the v2
hierarchy, then iterate through all cgroup subsystems to identify those
belonging to the v1 hierarchy and create corresponding subdirectories under
/sys/fs/bpf/cgroup-v1 for each controller. Once again, with the cgroupfs-
centric approach, we do not need to deal with the complexities of these
distinct hierarchies as the cgroup dir, which is accessible in each step
of this traversal, is all we need to create the file.

If we fail during this traversal, we must remove the associated ephemeral
file in bpffs (or cgroupfs) for each visited cgroup. This is handled by
re-walking the hierarchy (in post order for each root). In the bpffs
approach, if no ephemeral files exist in the system on failure, the
directories associated with each cgroup must be removed during this
re-walk.

Evolving Cgroup Tree
------------------------
We also wish to ensure that the ephemeral BPF file hierarchy evolves
alongside the cgroup tree. In the bpffs-centric model, cgroup_mkdir
creates a new directory using the parent bpffs dentry as a base,
populating it with required ephemeral files. Conversely, the cgroupfs-
centric model creates the ephemeral file within the current cgroup
directory. During cgroup_mkdir, the new cgroup traverses each ancestor to
iterate through its associated file list and adds the files to the
appropriate directory in the appropriate filesystem. Additionally, these
operations must ensure that link reference counts are managed precisely to
maintain the persistence of underlying links. Accordingly, cgroup_rmdir
must perform an equivalent traversal of the ephemeral files to decrement
the reference counts for each associated link as they are removed.

Potential Feature: File Writes
--------------------------------
It is possible that we could use this mechanism to enable bpf based
schedulers to expose cgroup controls to userspace. Thus, it is worth
considering allowing for writes via this interface so that user space can
turn these knobs. From an implementation standpoint, we could potentially
use the same program to handle both reads and writes. In that case, the
read/write handlers must provide a program context such that the program
knows which mode it ought to be operating in. For instance, ctx->meta.buffer
can be set to null when the program is in write mode, but in write mode,
it will be populated and used by the program to update some internal state.

Potential Feature: Lazy File Creation
--------------------------------------
Rather than engaging in the complex operation of traversing the cgroup
hierarchy, which carries the intrinsic risk of becoming a system bottleneck
due to the necessity of acquiring the cgroup_mutex, we could create the
ephemeral files when a task attempts to read it. This would potentially
involve modifying the lookup operation in kernfs_dir_iops to invoke some
sort of custom handler after the function attempts to find a file using
kernfs_find_ns. In the case of cgroupfs, this custom handler could traverse
through some sort of red black tree structure containing the ephemeral
files' metadata. If we find the associated file in this tree, we create
the necessary file structures (e.g. inode, kernfs_node, etc.) to support
the file and link it to the containing cgroup's kernfs_node.

next             reply	other threads:[~2026-02-03 10:21 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-02-03 10:20 Rohan Kakulawaram [this message]
2026-02-03 20:26 ` [RFC PATCH bpf-next] bpf: ephemeral cgroup BPF control programs Tejun Heo
2026-02-04  1:04   ` Josh Don
2026-02-04 20:25     ` Tejun Heo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260203102058.41030-1-rohanka@google.com \
    --to=rohanka@google.com \
    --cc=andrii@kernel.org \
    --cc=ast@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=daniel@iogearbox.net \
    --cc=eddyz87@gmail.com \
    --cc=john.fastabend@gmail.com \
    --cc=jolsa@kernel.org \
    --cc=joshdon@google.com \
    --cc=kpsingh@kernel.org \
    --cc=martin.lau@linux.dev \
    --cc=mattbobrowski@google.com \
    --cc=roman.gushchin@linux.dev \
    --cc=sdf@fomichev.me \
    --cc=song@kernel.org \
    --cc=tj@kernel.org \
    --cc=yonghong.song@linux.dev \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox