public inbox for linux-mm@kvack.org
 help / color / mirror / Atom feed
From: Tal Zussman <tz2294@columbia.edu>
To: lsf-pc@lists.linux-foundation.org
Cc: linux-mm@kvack.org, bpf@vger.kernel.org,
	Roman Gushchin <roman.gushchin@linux.dev>,
	Shakeel Butt <shakeel.butt@linux.dev>,
	Emil Tsalapatis <emil@etsalapatis.com>,
	"Matthew Wilcox (Oracle)" <willy@infradead.org>,
	Josh Don <joshdon@google.com>, Greg Thelen <gthelen@google.com>,
	david@kernel.org
Subject: [LSF/MM/BPF TOPIC] Upstreaming cache_ext: Custom Page Cache Eviction with eBPF
Date: Wed, 25 Mar 2026 17:15:37 -0400	[thread overview]
Message-ID: <0b1293ca-7a1f-4358-bc20-15784452238d@columbia.edu> (raw)

Hi all,

This proposal is late, but I've been encouraged by a few people to submit
it, so hopefully it's not *too* late...

I would like to propose a session to discuss cache_ext, a framework that
allows applications to customize page cache eviction policies using eBPF.
This work was published at SOSP'25 [1] and presented at LPC in the eBPF
track [2]. The preliminary code is available at [3].

This topic spans both MM and BPF, so it might fit best as part of the joint
BPF+MM session that Roman and Shakeel have mentioned [4].

Background
----------

The kernel offers two built-in page cache eviction options: Active/Inactive
LRU and MGLRU. Neither is optimal for all workloads, and the existing
customization interfaces (fadvise, madvise, sysctl) are limited and often do
not behave as expected. Applications that need better caching behavior today
are stuck either living with the default policy or implementing their own
userspace caches, which are hard to share across processes and often still
rely on the page cache as a second tier.

What cache_ext does
-------------------

cache_ext uses eBPF struct_ops to let applications define custom eviction
policies that run in the kernel. Inspired by sched_ext, it provides:

  - Six policy function hooks: init, evict_folios, folio_added,
    folio_accessed, folio_removed, and admit_folio (admission filtering).

  - An eviction list API (kfuncs) for creating and manipulating
    variable-sized linked lists of folios. Policies can use multiple
    lists.

  - A batched eviction candidate interface: policies propose up to 32 folios
    per eviction request; the kernel validates and evicts them.

  - Per-cgroup isolation: each cgroup can run its own policy without
    interfering with others using per-cgroup struct_ops programs.

We have implemented eight policies on cache_ext, from simple (FIFO, LFU,
MRU) to sophisticated (S3-FIFO, LHD, MGLRU), as well as application-informed
policies. Our evaluation shows that matching the policy to the workload can
improve throughput by up to 1.7x and reduce P99 latency by up to 58%, and
that, in general, no single policy is best for all workloads.

The kernel changes in our prototype are roughly 2000 lines total, of which
only about 210 lines modify core page cache code, 80 lines touch the
verifier, and 80 lines touch cgroup code. The rest is self-contained
cache_ext functionality (eviction list kfuncs and registry operations),
but much of this can and will be simplified.

Discussion Topics
-----------------

1. Interface design

   Right now the page cache is not modularized. cache_ext adds hooks into
   the page cache in an ad hoc fashion, inserting struct_ops callbacks at
   six points in the page cache. Is this the right abstraction? Are there
   page cache events we are missing? A longer-term goal could be a more
   systematic modularization of the page cache to make it amenable to
   extensibility, but that is a much larger effort -- we would like to
   discuss what a practical first step looks like.

2. Relationship with MGLRU

   cache_ext is currently built on top of the active/inactive lists
   infrastructure. Can we instead make use of MGLRU's infrastructure (e.g.,
   the access bit scanning)? This also raises the question of whether we can
   split MGLRU into reusable infrastructure and policy, so that policies
   could build on MGLRU's infrastructure while replacing its policy logic.

3. Eviction list data structures

   cache_ext implements eviction lists as kernel-managed linked lists
   exposed via kfuncs. Could we use BPF arenas instead, as sched_ext does?
   And how would arenas affect the ability to fall back to the kernel's
   default policy when a BPF policy misbehaves or fails to propose enough
   eviction candidates?  Are the eviction interface and data structures
   powerful enough as-is?

4. Path to upstreaming

   cache_ext was developed on Linux v6.6. We are currently working on
   rebasing to the latest kernel and should have more progress in the next
   month. There are a few other issues we plan to fix and clean up along the
   way, but in general, what does the path towards upstreaming cache_ext
   look like?

5. Future extensions

   Beyond file-backed page eviction, there are natural next steps that could
   be explored down the line. Prefetching customization has been looked at
   before (FetchBPF [5]). Extending cache_ext to cover anonymous memory and
   swap decisions has also been mentioned as a natural extension. This
   could also have interesting interactions with Shakeel's memcg_ext
   proposal [6].

Links
-----

[1] https://doi.org/10.1145/3731569.3764820 (SOSP'25 paper)
[2] https://lpc.events/event/19/contributions/2165/ (LPC talk)
[3] https://github.com/cache-ext/cache_ext
[4] https://lore.kernel.org/lkml/aa9SB6OzocfwL9kO@linux.dev/
[5] https://www.usenix.org/conference/atc24/presentation/cao
[6] https://lore.kernel.org/lkml/20260307182424.2889780-1-shakeel.butt@linux.dev/

Thanks,
Tal


                 reply	other threads:[~2026-03-25 21:15 UTC|newest]

Thread overview: [no followups] expand[flat|nested]  mbox.gz  Atom feed

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=0b1293ca-7a1f-4358-bc20-15784452238d@columbia.edu \
    --to=tz2294@columbia.edu \
    --cc=bpf@vger.kernel.org \
    --cc=david@kernel.org \
    --cc=emil@etsalapatis.com \
    --cc=gthelen@google.com \
    --cc=joshdon@google.com \
    --cc=linux-mm@kvack.org \
    --cc=lsf-pc@lists.linux-foundation.org \
    --cc=roman.gushchin@linux.dev \
    --cc=shakeel.butt@linux.dev \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox