[RFC PATCH bpf-next] bpf: ephemeral cgroup BPF control programs

BPF List
 help / color / mirror / Atom feed

* [RFC PATCH bpf-next] bpf: ephemeral cgroup BPF control programs
@ 2026-02-03 10:20 Rohan Kakulawaram
  2026-02-03 20:26 ` Tejun Heo
  0 siblings, 1 reply; 4+ messages in thread
From: Rohan Kakulawaram @ 2026-02-03 10:20 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Eduard Zingerman, Song Liu, Yonghong Song,
	ohn Fastabend, KP Singh, Stanislav Fomichev, Jiri Olsa,
	Roman Gushchin, Tejun Heo, Matt Bobrowski, Josh Don, rohanka

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="y", Size: 9755 bytes --]

From: rohanka@google.com

Extended Berkeley Packet Filter (eBPF) programs are loadable modules
that can hook onto various contexts within the kernel: kernel
functions, tracepoints, etc. Since these programs are decoupled
from the main kernel binary and can be loaded without a machine
reboot, there is a desire to outsource some of the kernel’s
responsibilities to eBPF for increased flexibility. Furthermore,
eBPF can serve as a bridge between userspace and the kernel by
facilitating access to the kernel’s internal state.

One of the main gaps in achieving these ends, however, is that
there is no infrastructure that supports exposing per cgroup data
via file paths that mirror the cgroupfs hierarchy. This would
allow for a unified source of truth as it relates to accessing
various streams of data related to cgroups. It is important to
note that we would want the fate of these ephemeral files to be
tied to the manipulation of the cgroup tree, such as exposing/
deleting files when creating/removing cgroups respectively.
Cgroup iterators can mimic some of this functionality through
bpffs pins but lack the dynamism of the listed approach.

To elucidate the value of this infra, we note that this would be
instrumental in aiding some of our efforts at Google. For
instance, Borglet, the daemon that manages workloads on each
production machine, repeatedly has to manually parse raw memory
cgroup file from cgroupfs. If we had the capability to directly
expose these stats at each level of the cgroup hierarchy via bpf,
we could forgo some of the expensive parsing associated with the
current approach. Moreover, in the context of upstream efforts,
this infra could be helpful in allowing bpf based schedulers to
expose customized cgroup controls to user space. Overall, this
paradigm of exposing specialized stat files for file systems can
be incredibly valuable in providing robust kernel visibility.

Approach 1: BPFFS Centric
------------------------
This approach introduces a new BPF program type: BPF_PROG_TYPE_CGROUP_STAT.
When such a program is loaded and linked, per-cgroup files are made
available, mirrored under the BPFFS filesystem. For example, an iterator
named "hist_oncpu" would expose data at paths like
/sys/fs/bpf/cgroup/...<cgroup tree>.../hist_oncpu.

To accommodate cgroup-v1 hierarchies, a separate directory, such as
/sys/fs/bpf/cgroup-v1/<controller>/...<memcg_tree>.../memcg_histo,
would be populated for each v1 controller.

Sample read program:

SEC("cgroup/stat")
int histo_on_cpu(struct bpf_iter__cgroup *ctx) {
    struct seq_file *seq = ctx->meta->seq;
    struct cgroup *cgrp = ctx->cgroup;
    if (cgrp)
        expose_buckets(cgrp, seq);
    return 0;
}

Approach 2: Cgroupfs Centric (Preferred)
-----------------------------------------
This alternative exposes ephemeral files directly within cgroupfs. During
the initial cgroup traversal executed when link_create is called, the
__kernfs_create_file function would be used for every cgroup directory
encountered. The filenames would adhere to a bpf.stat.<program_name>
convention to clearly distinguish them as BPF-managed ephemeral files.

The Case for Approach 2
------------------------
Approach 2 is preferred because it avoids complexities inherent in the
BPFFS-centric approach, such as:

1.Syncing the directory structure between cgroupfs and bpffs
2.Handling the distinct cgroup-v1 and cgroup-v2 hierarchies within bpffs

BPF Syscall Story
-------------------
The primary syscall of interest will be link_create, which is invoked after
the user has loaded the program. Traditionally, bpf links are used to
manage the lifecycle of bpf programs; thus, in this context, we would be
able to switch out the program associated with a link and alter the
ephemeral file content associated with it. Essentially, when the link is
created, analogous to how links are attached to some sort of target, it is
"attached" to the cgroup tree. At this point, the underlying file the
program is supposed to represent is exposed for every cgroup in the
machine. Moreover, the user will pass in the cgroup whose descendants will
expose this ephemeral stat file: attr->link_create.cgroup_root. We will add
the file metadata, including the program link, to a list referenced by
this root cgroup. As discussed in the "Evolving Cgroup Tree" section,
this list will be utilized by cgroup_mkdir and cgroup_rmdir to manage the
lifecycle of ephemeral files within the directories of descendant cgroups.

At a high level here are the file/seq operations we wish to define:

open -> prepare a seq file with seq->private containing the 
necessary metadata for the program (ie the cgroup)
read -> invoke the seq_show operation on the seq file in file->private
release -> free the seq file and update the bpf program's refcount
seq_show for this file should be relatively simple: we set up the program ctx 
to take in the cgroup pointer, as well as the seq file, and then run 
the program.

It is important to note that we might need to extend this program to handle
writes, which is a prerequisite for their utility in providing sched_ext
cgroup controls. Please reference the "Potential Feature: File Writes" for
more information on this.

Cgroup Traversal
------------------------
With cgroup_mutex in hand, we will traverse the cgroup tree(s). For each
iteration in approach 1, we check the corresponding level of the bpffs
cgroup tree and see if there is an entry corresponding to our cgroup. To
facilitate this, we can store the dentry of the corresponding bpffs dir
within the cgroup struct. Thus, when we reach a particular cgroup we
invoke lookup_one_qstr_excl using its parent dentry as the base.
Essentially we want to emulate the filename_create without doing a path
resolution for every cgroup we come across. Once we create this new
directory or confirm that it already exists from a previous traversal, we
can add the file using vfs_mkobj with the new set file operations we
mentioned in the previous section.

We must note, this complexity is not present in the cgroupfs-centric
approach as the kernfs_node linked to a cgroup's directory is referenced
by the cgroup itself.

To support both v1 and v2 hierarchies simultaneously, the traversal will go
as follows: we first traverse the default cgroup root to construct the v2
hierarchy, then iterate through all cgroup subsystems to identify those
belonging to the v1 hierarchy and create corresponding subdirectories under
/sys/fs/bpf/cgroup-v1 for each controller. Once again, with the cgroupfs-
centric approach, we do not need to deal with the complexities of these
distinct hierarchies as the cgroup dir, which is accessible in each step
of this traversal, is all we need to create the file.

If we fail during this traversal, we must remove the associated ephemeral
file in bpffs (or cgroupfs) for each visited cgroup. This is handled by
re-walking the hierarchy (in post order for each root). In the bpffs
approach, if no ephemeral files exist in the system on failure, the
directories associated with each cgroup must be removed during this
re-walk.

Evolving Cgroup Tree
------------------------
We also wish to ensure that the ephemeral BPF file hierarchy evolves
alongside the cgroup tree. In the bpffs-centric model, cgroup_mkdir
creates a new directory using the parent bpffs dentry as a base,
populating it with required ephemeral files. Conversely, the cgroupfs-
centric model creates the ephemeral file within the current cgroup
directory. During cgroup_mkdir, the new cgroup traverses each ancestor to
iterate through its associated file list and adds the files to the
appropriate directory in the appropriate filesystem. Additionally, these
operations must ensure that link reference counts are managed precisely to
maintain the persistence of underlying links. Accordingly, cgroup_rmdir
must perform an equivalent traversal of the ephemeral files to decrement
the reference counts for each associated link as they are removed.

Potential Feature: File Writes
--------------------------------
It is possible that we could use this mechanism to enable bpf based
schedulers to expose cgroup controls to userspace. Thus, it is worth
considering allowing for writes via this interface so that user space can
turn these knobs. From an implementation standpoint, we could potentially
use the same program to handle both reads and writes. In that case, the
read/write handlers must provide a program context such that the program
knows which mode it ought to be operating in. For instance, ctx->meta.buffer
can be set to null when the program is in write mode, but in write mode,
it will be populated and used by the program to update some internal state.

Potential Feature: Lazy File Creation
--------------------------------------
Rather than engaging in the complex operation of traversing the cgroup
hierarchy, which carries the intrinsic risk of becoming a system bottleneck
due to the necessity of acquiring the cgroup_mutex, we could create the
ephemeral files when a task attempts to read it. This would potentially
involve modifying the lookup operation in kernfs_dir_iops to invoke some
sort of custom handler after the function attempts to find a file using
kernfs_find_ns. In the case of cgroupfs, this custom handler could traverse
through some sort of red black tree structure containing the ephemeral
files' metadata. If we find the associated file in this tree, we create
the necessary file structures (e.g. inode, kernfs_node, etc.) to support
the file and link it to the containing cgroup's kernfs_node.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [RFC PATCH bpf-next] bpf: ephemeral cgroup BPF control programs
  2026-02-03 10:20 [RFC PATCH bpf-next] bpf: ephemeral cgroup BPF control programs Rohan Kakulawaram
@ 2026-02-03 20:26 ` Tejun Heo
  2026-02-04  1:04   ` Josh Don
  0 siblings, 1 reply; 4+ messages in thread
From: Tejun Heo @ 2026-02-03 20:26 UTC (permalink / raw)
  To: Rohan Kakulawaram
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Eduard Zingerman, Song Liu, Yonghong Song,
	ohn Fastabend, KP Singh, Stanislav Fomichev, Jiri Olsa,
	Roman Gushchin, Matt Bobrowski, Josh Don

Hello, Rohan.

On Tue, Feb 03, 2026 at 10:20:55AM +0000, Rohan Kakulawaram wrote:
...
> One of the main gaps in achieving these ends, however, is that
> there is no infrastructure that supports exposing per cgroup data
> via file paths that mirror the cgroupfs hierarchy. This would
> allow for a unified source of truth as it relates to accessing
> various streams of data related to cgroups. It is important to
> note that we would want the fate of these ephemeral files to be
> tied to the manipulation of the cgroup tree, such as exposing/
> deleting files when creating/removing cgroups respectively.
> Cgroup iterators can mimic some of this functionality through
> bpffs pins but lack the dynamism of the listed approach.

On one hand, I think why not, but at the same time, I'm having a hard time
why this would need to be on some file system. After all, we're talking
about BPF, there are numerous way lower overhead ways to do bi-directional
communication - shared pinned maps, BPF upcalls, BPF seqfile iterator, ring
buffer based interface and so on.

Can you elaborate why this *needs* to be a separate file interface? Note
that this doesn't really expand what BPF progs can do with cgroups. The only
thing being added is a different and not-particularly-efficient way to
communicate with BPF progs.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [RFC PATCH bpf-next] bpf: ephemeral cgroup BPF control programs
  2026-02-03 20:26 ` Tejun Heo
@ 2026-02-04  1:04   ` Josh Don
  2026-02-04 20:25     ` Tejun Heo
  0 siblings, 1 reply; 4+ messages in thread
From: Josh Don @ 2026-02-04  1:04 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Rohan Kakulawaram, bpf, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman, Song Liu,
	Yonghong Song, ohn Fastabend, KP Singh, Stanislav Fomichev,
	Jiri Olsa, Roman Gushchin, Matt Bobrowski

Hi Tejun,

On Tue, Feb 3, 2026 at 12:26 PM Tejun Heo <tj@kernel.org> wrote:
>
> Hello, Rohan.
>
> On Tue, Feb 03, 2026 at 10:20:55AM +0000, Rohan Kakulawaram wrote:
> ...
> > One of the main gaps in achieving these ends, however, is that
> > there is no infrastructure that supports exposing per cgroup data
> > via file paths that mirror the cgroupfs hierarchy. This would
> > allow for a unified source of truth as it relates to accessing
> > various streams of data related to cgroups. It is important to
> > note that we would want the fate of these ephemeral files to be
> > tied to the manipulation of the cgroup tree, such as exposing/
> > deleting files when creating/removing cgroups respectively.
> > Cgroup iterators can mimic some of this functionality through
> > bpffs pins but lack the dynamism of the listed approach.
>
> On one hand, I think why not, but at the same time, I'm having a hard time
> why this would need to be on some file system. After all, we're talking
> about BPF, there are numerous way lower overhead ways to do bi-directional
> communication - shared pinned maps, BPF upcalls, BPF seqfile iterator, ring
> buffer based interface and so on.
>
> Can you elaborate why this *needs* to be a separate file interface? Note
> that this doesn't really expand what BPF progs can do with cgroups. The only
> thing being added is a different and not-particularly-efficient way to
> communicate with BPF progs.

Each of those existing communication mechanisms have advantages and
disadvantages, and my take is that none are really optimal for the use
case described/implied here.

For starters, I think it is important to have the interface be
synchronous. Stat collection and reporting for example makes much more
sense to do on a read() edge rather than arbitrarily dumping info
continuously into a map or ring buffer or something.

For the BPF iterators we already have, you could in theory pin and
unpin as cgroups are created and destroyed but that feels like a bit
of a hack; at that point you don't really care about it being an
iterator program, you're just piggy-backing off the fact that it
exposes a seqfile interface. Add to that the fact the trickiness of
keeping everything in sync as the cgroup tree is modified, plus there
will always be a latency between cgroups getting created and userspace
going to pin an iterator (especially if the jobs creating the cgroups
are not the ones caring to pin the program).

I also find the file based interface incredibly convenient. You don't
need to have code deal with making BPF upcalls or read() from an
iterator fd, instead you can use traditional file based APIs. Exposing
as a file-based interface also easily lets scripts and manual
observation/manipulation work easily as you can cat/grep/etc just as
any other file. I have to imagine the motivation for allowing file
based pinning of iterators shared similar motivations.

Typically cgroupfs interfaces are low bandwidth communication
mechanisms to occasionally set/get resource limits and stats. So, in
contrast to the APIs you describe, this is also about offering a more
flexible and convenient solution without needing to worry as much
about efficiency.

I also think this pairs pretty nicely with sched_ext as schedulers can
define custom tuning knobs that will be automatically exposed for
manipulation on a per-job (cgroup) basis.

Best,
Josh

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [RFC PATCH bpf-next] bpf: ephemeral cgroup BPF control programs
  2026-02-04  1:04   ` Josh Don
@ 2026-02-04 20:25     ` Tejun Heo
  0 siblings, 0 replies; 4+ messages in thread
From: Tejun Heo @ 2026-02-04 20:25 UTC (permalink / raw)
  To: Josh Don
  Cc: Rohan Kakulawaram, bpf, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman, Song Liu,
	Yonghong Song, ohn Fastabend, KP Singh, Stanislav Fomichev,
	Jiri Olsa, Roman Gushchin, Matt Bobrowski

Hello, Josh.

On Tue, Feb 03, 2026 at 05:04:09PM -0800, Josh Don wrote:
> > Can you elaborate why this *needs* to be a separate file interface? Note
> > that this doesn't really expand what BPF progs can do with cgroups. The only
> > thing being added is a different and not-particularly-efficient way to
> > communicate with BPF progs.
> 
> Each of those existing communication mechanisms have advantages and
> disadvantages, and my take is that none are really optimal for the use
> case described/implied here.
> 
> For starters, I think it is important to have the interface be
> synchronous. Stat collection and reporting for example makes much more
> sense to do on a read() edge rather than arbitrarily dumping info
> continuously into a map or ring buffer or something.
> 
> For the BPF iterators we already have, you could in theory pin and
> unpin as cgroups are created and destroyed but that feels like a bit
> of a hack; at that point you don't really care about it being an
> iterator program, you're just piggy-backing off the fact that it
> exposes a seqfile interface. Add to that the fact the trickiness of
> keeping everything in sync as the cgroup tree is modified, plus there
> will always be a latency between cgroups getting created and userspace
> going to pin an iterator (especially if the jobs creating the cgroups
> are not the ones caring to pin the program).

Wouldn't pinned BPF_PROG_RUN program fit the bill? It can serve as a generic
entry point with arbitrary input and output data. It can take the cgroup ID
along with other params, do whatever operations necessary and then return
output in whatever format. The users don't have to know much either. It just
needs to know the name of the pinned program and input/output formats and
then do bpf_prog_test_run_opts(). It's not whole lot different from doing an
ioctl call.

> I also find the file based interface incredibly convenient. You don't
> need to have code deal with making BPF upcalls or read() from an
> iterator fd, instead you can use traditional file based APIs. Exposing
> as a file-based interface also easily lets scripts and manual
> observation/manipulation work easily as you can cat/grep/etc just as
> any other file. I have to imagine the motivation for allowing file
> based pinning of iterators shared similar motivations.

AFAICS, this is the only actual benefit, right? Having text files as
interface.

> Typically cgroupfs interfaces are low bandwidth communication
> mechanisms to occasionally set/get resource limits and stats. So, in
> contrast to the APIs you describe, this is also about offering a more
> flexible and convenient solution without needing to worry as much
> about efficiency.
> 
> I also think this pairs pretty nicely with sched_ext as schedulers can
> define custom tuning knobs that will be automatically exposed for
> manipulation on a per-job (cgroup) basis.

Maybe, but, for cgroup level low-freq hinting, being able to read xattrs on
cgroupfs should be enough. Anything high-volume/freq or needing finer
granularity, cgroupfs file interface is far from ideal.

So, I don't know. I'm not dead against it but unless I'm misunderstanding
something the rationale seems pretty weak.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2026-02-04 20:25 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-02-03 10:20 [RFC PATCH bpf-next] bpf: ephemeral cgroup BPF control programs Rohan Kakulawaram
2026-02-03 20:26 ` Tejun Heo
2026-02-04  1:04   ` Josh Don
2026-02-04 20:25     ` Tejun Heo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox