* [PATCH v3 0/2] mm/vmpressure: reduce CPU, memory and code overhead on cgroup v2
@ 2026-06-30 11:23 Usama Arif
2026-06-30 11:23 ` [PATCH v3 1/2] mm/vmpressure: skip tree=true accounting " Usama Arif
2026-06-30 11:23 ` [PATCH v3 2/2] mm/vmpressure: move v1 userspace eventfd code into memcontrol-v1.c Usama Arif
0 siblings, 2 replies; 7+ messages in thread
From: Usama Arif @ 2026-06-30 11:23 UTC (permalink / raw)
To: Andrew Morton, david, linux-mm
Cc: hannes, tj, mkoutny, shakeel.butt, roman.gushchin, liam,
linux-kernel, ljs, mhocko, rppt, surenb, vbabka, kernel-team,
Usama Arif
The vmpressure subsystem has two distinct consumers, gated by the
@tree argument:
tree=false : in-kernel socket pressure, consumed by TCP/SCTP. This
is cgroup v2 only; v1 sockets read memcg->tcpmem_pressure
instead.
tree=true : cgroup v1 userspace eventfd notifications via the
memory.pressure_level / cgroup.event_control interface.
v2 has no equivalent (userspace gets reclaim signals
through memory.pressure / PSI, which doesn't touch
vmpressure).
So of the four (hierarchy, tree) combinations, only two carry data
that anyone reads. The existing early return in vmpressure() covered
v1 + tree=false; the symmetric v2 + tree=true case was falling through
and doing the full lock / accumulate / schedule_work / parent-walk
dance, even though the events list it eventually iterates is empty
on cgroup v2 (vmpressure_register_event() is wired up only through the
v1 cftype "memory.pressure_level" and can't be reached from a v2
memcg).
Patch 1 extends the existing early return to also skip v2 + tree=true.
On a v2-only host this eliminates a contended path where reclaimers
can serialize on a single global sr_lock. bpftrace on a 176-core production
host (cgroup v2, 285 memcgs, sustained reclaim) showed ~16,200 such calls
per minute with tree = true.
Patch 2 follows up with a cleanup: it splits the v1 userspace eventfd
interface (struct vmpressure_event, the events list and its mutex, the
work_struct and its handler, the parent walk,
vmpressure_register_event / unregister_event, and vmpressure_prio)
into a new mm/memcontrol-v1.c built only when CONFIG_MEMCG_V1=y,
behind small no-op stubs in the header. mm/vmpressure.c keeps the
shared bits and the tree=false socket-pressure path. The size of
vmpressure.c goes down to half and the code is much more simpler.
The only #ifdef CONFIG_MEMCG_V1 remaining in source is around the
v1-only fields inside struct vmpressure itself. Memory savings on
CONFIG_MEMCG_V1=n:
struct vmpressure : 112B -> 24B
struct mem_cgroup : 1664B -> 1536B
This split is the first step toward eventually making vmpressure
CONFIG_MEMCG_V1 only. The v2 in-kernel socket pressure path
(tree=false) cannot be removed today immediately: PSI is not an
exact replacement for vmpressure, and switching networking socket-buffer
back-off to PSI may regress networking performance or increase memory
pressure in workloads that today rely on vmpressure's hysteresis. The
medium-term plan is to introduce a PSI-based socket-pressure path, keep
vmpressure available for v2 behind a defconfig as an opt-out for several
releases, and only then drop the tree=false path entirely, at which point
everything that remains in mm/memcontrol-v1.c is the whole subsystem.
---
v2 -> v3: https://lore.kernel.org/all/20260629130042.2649505-1-usama.arif@linux.dev/
- Move the cgroup v1 code into memcontrol-v1.c instead of creating a new
file (Johannes)
v1 -> v2: https://lore.kernel.org/all/20260606114158.3126210-1-usama.arif@linux.dev/
- Add more in commit message about future plans of vmpressure for cgroup v2
(Shakeel)
- Remove unnecessary return statement in vmpressure for v1 only tree path
(Michal)
- Rebased onto latest mm-new
Usama Arif (2):
mm/vmpressure: skip tree=true accounting on cgroup v2
mm/vmpressure: move v1 userspace eventfd code into memcontrol-v1.c
include/linux/vmpressure.h | 46 +++++-
mm/memcontrol-v1.c | 292 +++++++++++++++++++++++++++++++++++
mm/vmpressure.c | 302 ++-----------------------------------
3 files changed, 349 insertions(+), 291 deletions(-)
--
2.53.0-Meta
^ permalink raw reply [flat|nested] 7+ messages in thread
* [PATCH v3 1/2] mm/vmpressure: skip tree=true accounting on cgroup v2
2026-06-30 11:23 [PATCH v3 0/2] mm/vmpressure: reduce CPU, memory and code overhead on cgroup v2 Usama Arif
@ 2026-06-30 11:23 ` Usama Arif
2026-06-30 16:07 ` Johannes Weiner
2026-06-30 11:23 ` [PATCH v3 2/2] mm/vmpressure: move v1 userspace eventfd code into memcontrol-v1.c Usama Arif
1 sibling, 1 reply; 7+ messages in thread
From: Usama Arif @ 2026-06-30 11:23 UTC (permalink / raw)
To: Andrew Morton, david, linux-mm
Cc: hannes, tj, mkoutny, shakeel.butt, roman.gushchin, liam,
linux-kernel, ljs, mhocko, rppt, surenb, vbabka, kernel-team,
Usama Arif
vmpressure() has two outputs gated by the @tree argument:
@tree=false drives in-kernel socket pressure (mem_cgroup_set_
socket_pressure), consumed by TCP/SCTP. This only
applies on cgroup v2; on v1 socket memory is charged
separately via tcpmem and the consumer reads
memcg->tcpmem_pressure instead.
@tree=true drives userspace eventfd notifications via the v1
memory.pressure_level / cgroup.event_control interface.
v2 has no equivalent: userspace gets reclaim signals
through memory.pressure (PSI), which does not touch
vmpressure.
The existing early return covered v1 + @tree=false. The symmetric
v2 + @tree=true case was falling through and doing the full lock /
accumulate / schedule_work / parent-walk dance for an events list
that can never be populated. bpftrace on a 176-core production host
(cgroup v2, CONFIG_MEMCG_V1=n, 285 memcgs, sustained reclaim) showed
~16,200 @tree=true vmpressure() calls per minute. Add an early return
that skips cgroup v2 + tree = true which avoids us doing all this work.
On a v2-only host this also eliminates a lock contention path that can
serialise reclaimers on a single global sr_lock.
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Signed-off-by: Usama Arif <usama.arif@linux.dev>
---
mm/vmpressure.c | 10 ++++++----
1 file changed, 6 insertions(+), 4 deletions(-)
diff --git a/mm/vmpressure.c b/mm/vmpressure.c
index f053554e5826..c82cee1ab43b 100644
--- a/mm/vmpressure.c
+++ b/mm/vmpressure.c
@@ -246,11 +246,13 @@ void vmpressure(gfp_t gfp, int order, struct mem_cgroup *memcg, bool tree,
return;
/*
- * The in-kernel users only care about the reclaim efficiency
- * for this @memcg rather than the whole subtree, and there
- * isn't and won't be any in-kernel user in a legacy cgroup.
+ * Only two combinations have a consumer:
+ * cgroup v2 + tree=false -> in-kernel socket pressure
+ * cgroup v1 + tree=true -> userspace eventfds (memory.pressure_level)
+ * Skip the other two: nothing consumes the result.
*/
- if (!cgroup_subsys_on_dfl(memory_cgrp_subsys) && !tree)
+ if ((!cgroup_subsys_on_dfl(memory_cgrp_subsys) && !tree) ||
+ (cgroup_subsys_on_dfl(memory_cgrp_subsys) && tree))
return;
vmpr = memcg_to_vmpressure(memcg);
--
2.53.0-Meta
^ permalink raw reply related [flat|nested] 7+ messages in thread
* [PATCH v3 2/2] mm/vmpressure: move v1 userspace eventfd code into memcontrol-v1.c
2026-06-30 11:23 [PATCH v3 0/2] mm/vmpressure: reduce CPU, memory and code overhead on cgroup v2 Usama Arif
2026-06-30 11:23 ` [PATCH v3 1/2] mm/vmpressure: skip tree=true accounting " Usama Arif
@ 2026-06-30 11:23 ` Usama Arif
2026-06-30 12:32 ` Usama Arif
2026-06-30 14:21 ` Shakeel Butt
1 sibling, 2 replies; 7+ messages in thread
From: Usama Arif @ 2026-06-30 11:23 UTC (permalink / raw)
To: Andrew Morton, david, linux-mm
Cc: hannes, tj, mkoutny, shakeel.butt, roman.gushchin, liam,
linux-kernel, ljs, mhocko, rppt, surenb, vbabka, kernel-team,
Usama Arif
Clean up mm/vmpressure.c by separating the cgroup v1 userspace eventfd
interface from the shared and v2 in-kernel code.
Currently, almost half of mm/vmpressure.c exists to serve tree=true:
struct vmpressure_event, the events list and its mutex, the work_struct
and vmpressure_work_fn that drains tree_scanned/tree_reclaimed, the
parent walk, vmpressure_event(), vmpressure_register_event(),
vmpressure_unregister_event(), and vmpressure_prio() (which always
calls vmpressure() with tree=true).
Move it all into mm/memcontrol-v1.c (built only when CONFIG_MEMCG_V1=y)
as a single contiguous block, following the per-component layout already
used by that file. Keeping the v1 vmpressure code with the rest of the
deprecated cgroup v1 memory controller makes the full footprint of the
CONFIG_MEMCG_V1 option easy to see in one place, which matters more
than component-level file separation for code that has no active
development.
vmpressure.c keeps the shared bits (constants, vmpressure_calc_level,
the runtime hierarchy check, the tree=false body, init/cleanup
plumbing) and calls into three small v1 hooks for the tree=true
accumulator and the v1 portions of init/cleanup. The hooks have
static-inline no-op stubs in include/linux/vmpressure.h for the
!MEMCG_V1 case, so callers don't need ifdefs. vmpressure_prio() gets
the same treatment, which means vmscan.c's call site disappears at
compile time on v2-only kernels.
The only #ifdef CONFIG_MEMCG_V1 in source remains around the v1-only
fields inside struct vmpressure itself.
Memory savings on CONFIG_MEMCG_V1=n (measured with pahole):
struct vmpressure : 112B -> 24B
struct mem_cgroup : 1664B -> 1536B
This split is the first step toward eventually making vmpressure
CONFIG_MEMCG_V1 only. The v2 in-kernel socket pressure path
(tree=false) cannot be removed today immediately: PSI is not an
exact replacement for vmpressure, and switching networking socket-buffer
back-off to PSI may regress networking performance or increase memory
pressure in workloads that today rely on vmpressure's hysteresis. The
medium-term plan is to introduce a PSI-based socket-pressure path, keep
vmpressure available for v2 behind a defconfig as an opt-out for several
releases, and only then drop the tree=false path entirely, at which point
everything that remains of the vmpressure block in mm/memcontrol-v1.c is
the whole subsystem.
Signed-off-by: Usama Arif <usama.arif@linux.dev>
---
include/linux/vmpressure.h | 46 +++++-
mm/memcontrol-v1.c | 292 +++++++++++++++++++++++++++++++++++++
mm/vmpressure.c | 292 ++-----------------------------------
3 files changed, 343 insertions(+), 287 deletions(-)
diff --git a/include/linux/vmpressure.h b/include/linux/vmpressure.h
index faecd5522401..b4d13457bc2a 100644
--- a/include/linux/vmpressure.h
+++ b/include/linux/vmpressure.h
@@ -13,18 +13,31 @@
struct vmpressure {
unsigned long scanned;
unsigned long reclaimed;
+ /* The lock is used to keep the scanned/reclaimed in sync. */
+ spinlock_t sr_lock;
+#ifdef CONFIG_MEMCG_V1
+ /*
+ * tree=true accumulators feed the v1 userspace eventfd interface
+ * (memory.pressure_level). Drained by @work. v2 has no equivalent
+ * interface, so this state is omitted on CONFIG_MEMCG_V1=n builds.
+ */
unsigned long tree_scanned;
unsigned long tree_reclaimed;
- /* The lock is used to keep the scanned/reclaimed above in sync. */
- spinlock_t sr_lock;
-
/* The list of vmpressure_event structs. */
struct list_head events;
/* Have to grab the lock on events traversal or modifications. */
struct mutex events_lock;
struct work_struct work;
+#endif
+};
+
+enum vmpressure_levels {
+ VMPRESSURE_LOW = 0,
+ VMPRESSURE_MEDIUM,
+ VMPRESSURE_CRITICAL,
+ VMPRESSURE_NUM_LEVELS,
};
struct mem_cgroup;
@@ -32,18 +45,41 @@ struct mem_cgroup;
#ifdef CONFIG_MEMCG
void vmpressure(gfp_t gfp, int order, struct mem_cgroup *memcg, bool tree,
unsigned long scanned, unsigned long reclaimed);
-extern void vmpressure_prio(gfp_t gfp, struct mem_cgroup *memcg, int prio);
-
extern void vmpressure_init(struct vmpressure *vmpr);
extern void vmpressure_cleanup(struct vmpressure *vmpr);
extern struct vmpressure *memcg_to_vmpressure(struct mem_cgroup *memcg);
extern struct mem_cgroup *vmpressure_to_memcg(struct vmpressure *vmpr);
+
+/* Shared with the v1 vmpressure block in mm/memcontrol-v1.c. */
+extern const unsigned long vmpressure_win;
+extern enum vmpressure_levels vmpressure_calc_level(unsigned long scanned,
+ unsigned long reclaimed);
+
+#ifdef CONFIG_MEMCG_V1
+extern void vmpressure_prio(gfp_t gfp, struct mem_cgroup *memcg, int prio);
extern int vmpressure_register_event(struct mem_cgroup *memcg,
struct eventfd_ctx *eventfd,
const char *args);
extern void vmpressure_unregister_event(struct mem_cgroup *memcg,
struct eventfd_ctx *eventfd);
+
+/* v1 hooks called from mm/vmpressure.c; no-ops below when !MEMCG_V1. */
+extern void vmpressure_v1_init(struct vmpressure *vmpr);
+extern void vmpressure_v1_cleanup(struct vmpressure *vmpr);
+extern void vmpressure_v1_account_tree(struct vmpressure *vmpr,
+ unsigned long scanned,
+ unsigned long reclaimed);
#else
+static inline void vmpressure_prio(gfp_t gfp, struct mem_cgroup *memcg,
+ int prio) {}
+static inline void vmpressure_v1_init(struct vmpressure *vmpr) {}
+static inline void vmpressure_v1_cleanup(struct vmpressure *vmpr) {}
+static inline void vmpressure_v1_account_tree(struct vmpressure *vmpr,
+ unsigned long scanned,
+ unsigned long reclaimed) {}
+#endif /* CONFIG_MEMCG_V1 */
+
+#else /* !CONFIG_MEMCG */
static inline void vmpressure(gfp_t gfp, int order, struct mem_cgroup *memcg,
bool tree, unsigned long scanned,
unsigned long reclaimed) {}
diff --git a/mm/memcontrol-v1.c b/mm/memcontrol-v1.c
index 765069211567..135622b6172b 100644
--- a/mm/memcontrol-v1.c
+++ b/mm/memcontrol-v1.c
@@ -6,6 +6,7 @@
#include <linux/pagewalk.h>
#include <linux/backing-dev.h>
#include <linux/eventfd.h>
+#include <linux/log2.h>
#include <linux/poll.h>
#include <linux/sort.h>
#include <linux/file.h>
@@ -1476,6 +1477,297 @@ void memcg1_oom_finish(struct mem_cgroup *memcg, bool locked)
mem_cgroup_oom_unlock(memcg);
}
+/*
+ * cgroup v1 userspace vmpressure interface (memory.pressure_level /
+ * cgroup.event_control). Kept here so v2-only kernels (CONFIG_MEMCG_V1=n)
+ * drop the whole eventfd accumulator, its work item, and the per-memcg
+ * state it requires.
+ *
+ * When there are too little pages left to scan, vmpressure() may miss the
+ * critical pressure as number of pages will be less than "window size".
+ * However, in that case the vmscan priority will raise fast as the
+ * reclaimer will try to scan LRUs more deeply.
+ *
+ * The vmscan logic considers these special priorities:
+ *
+ * prio == DEF_PRIORITY (12): reclaimer starts with that value
+ * prio <= DEF_PRIORITY - 2 : kswapd becomes somewhat overwhelmed
+ * prio == 0 : close to OOM, kernel scans every page in an lru
+ *
+ * Any value in this range is acceptable for this tunable (i.e. from 12 to
+ * 0). Current value for the vmpressure_level_critical_prio is chosen
+ * empirically, but the number, in essence, means that we consider
+ * critical level when scanning depth is ~10% of the lru size (vmscan
+ * scans 'lru_size >> prio' pages, so it is actually 12.5%, or one
+ * eights).
+ */
+static const unsigned int vmpressure_level_critical_prio = ilog2(100 / 10);
+
+enum vmpressure_modes {
+ VMPRESSURE_NO_PASSTHROUGH = 0,
+ VMPRESSURE_HIERARCHY,
+ VMPRESSURE_LOCAL,
+ VMPRESSURE_NUM_MODES,
+};
+
+static const char * const vmpressure_str_levels[] = {
+ [VMPRESSURE_LOW] = "low",
+ [VMPRESSURE_MEDIUM] = "medium",
+ [VMPRESSURE_CRITICAL] = "critical",
+};
+
+static const char * const vmpressure_str_modes[] = {
+ [VMPRESSURE_NO_PASSTHROUGH] = "default",
+ [VMPRESSURE_HIERARCHY] = "hierarchy",
+ [VMPRESSURE_LOCAL] = "local",
+};
+
+struct vmpressure_event {
+ struct eventfd_ctx *efd;
+ enum vmpressure_levels level;
+ enum vmpressure_modes mode;
+ struct list_head node;
+};
+
+static struct vmpressure *work_to_vmpressure(struct work_struct *work)
+{
+ return container_of(work, struct vmpressure, work);
+}
+
+static struct vmpressure *vmpressure_parent(struct vmpressure *vmpr)
+{
+ struct mem_cgroup *memcg = vmpressure_to_memcg(vmpr);
+
+ memcg = parent_mem_cgroup(memcg);
+ if (!memcg)
+ return NULL;
+ return memcg_to_vmpressure(memcg);
+}
+
+static bool vmpressure_event(struct vmpressure *vmpr,
+ const enum vmpressure_levels level,
+ bool ancestor, bool signalled)
+{
+ struct vmpressure_event *ev;
+ bool ret = false;
+
+ mutex_lock(&vmpr->events_lock);
+ list_for_each_entry(ev, &vmpr->events, node) {
+ if (ancestor && ev->mode == VMPRESSURE_LOCAL)
+ continue;
+ if (signalled && ev->mode == VMPRESSURE_NO_PASSTHROUGH)
+ continue;
+ if (level < ev->level)
+ continue;
+ eventfd_signal(ev->efd);
+ ret = true;
+ }
+ mutex_unlock(&vmpr->events_lock);
+
+ return ret;
+}
+
+static void vmpressure_work_fn(struct work_struct *work)
+{
+ struct vmpressure *vmpr = work_to_vmpressure(work);
+ unsigned long scanned;
+ unsigned long reclaimed;
+ enum vmpressure_levels level;
+ bool ancestor = false;
+ bool signalled = false;
+
+ spin_lock(&vmpr->sr_lock);
+ /*
+ * Several contexts might be calling vmpressure(), so it is
+ * possible that the work was rescheduled again before the old
+ * work context cleared the counters. In that case we will run
+ * just after the old work returns, but then scanned might be zero
+ * here. No need for any locks here since we don't care if
+ * vmpr->reclaimed is in sync.
+ */
+ scanned = vmpr->tree_scanned;
+ if (!scanned) {
+ spin_unlock(&vmpr->sr_lock);
+ return;
+ }
+
+ reclaimed = vmpr->tree_reclaimed;
+ vmpr->tree_scanned = 0;
+ vmpr->tree_reclaimed = 0;
+ spin_unlock(&vmpr->sr_lock);
+
+ level = vmpressure_calc_level(scanned, reclaimed);
+
+ do {
+ if (vmpressure_event(vmpr, level, ancestor, signalled))
+ signalled = true;
+ ancestor = true;
+ } while ((vmpr = vmpressure_parent(vmpr)));
+}
+
+/*
+ * Tree-mode accumulator: accumulate per-memcg scanned/reclaimed and
+ * schedule the work that walks the parent chain and signals registered
+ * eventfd listeners once we cross the window threshold.
+ */
+void vmpressure_v1_account_tree(struct vmpressure *vmpr,
+ unsigned long scanned,
+ unsigned long reclaimed)
+{
+ spin_lock(&vmpr->sr_lock);
+ scanned = vmpr->tree_scanned += scanned;
+ vmpr->tree_reclaimed += reclaimed;
+ spin_unlock(&vmpr->sr_lock);
+
+ if (scanned < vmpressure_win)
+ return;
+ schedule_work(&vmpr->work);
+}
+
+void vmpressure_v1_init(struct vmpressure *vmpr)
+{
+ mutex_init(&vmpr->events_lock);
+ INIT_LIST_HEAD(&vmpr->events);
+ INIT_WORK(&vmpr->work, vmpressure_work_fn);
+}
+
+void vmpressure_v1_cleanup(struct vmpressure *vmpr)
+{
+ /*
+ * Make sure there is no pending work before eventfd infrastructure
+ * goes away.
+ */
+ flush_work(&vmpr->work);
+}
+
+/**
+ * vmpressure_prio() - Account memory pressure through reclaimer priority level
+ * @gfp: reclaimer's gfp mask
+ * @memcg: cgroup memory controller handle
+ * @prio: reclaimer's priority
+ *
+ * This function should be called from the reclaim path every time when
+ * the vmscan's reclaiming priority (scanning depth) changes.
+ *
+ * This function does not return any value.
+ */
+void vmpressure_prio(gfp_t gfp, struct mem_cgroup *memcg, int prio)
+{
+ /*
+ * We only use prio for accounting critical level. For more info
+ * see comment for vmpressure_level_critical_prio variable above.
+ */
+ if (prio > vmpressure_level_critical_prio)
+ return;
+
+ /*
+ * OK, the prio is below the threshold, updating vmpressure
+ * information before shrinker dives into long shrinking of long
+ * range vmscan. Passing scanned = vmpressure_win, reclaimed = 0
+ * to the vmpressure() basically means that we signal 'critical'
+ * level.
+ */
+ vmpressure(gfp, 0, memcg, true, vmpressure_win, 0);
+}
+
+#define MAX_VMPRESSURE_ARGS_LEN (strlen("critical") + strlen("hierarchy") + 2)
+
+/**
+ * vmpressure_register_event() - Bind vmpressure notifications to an eventfd
+ * @memcg: memcg that is interested in vmpressure notifications
+ * @eventfd: eventfd context to link notifications with
+ * @args: event arguments (pressure level threshold, optional mode)
+ *
+ * This function associates eventfd context with the vmpressure
+ * infrastructure, so that the notifications will be delivered to the
+ * @eventfd. The @args parameter is a comma-delimited string that denotes a
+ * pressure level threshold (one of vmpressure_str_levels, i.e. "low", "medium",
+ * or "critical") and an optional mode (one of vmpressure_str_modes, i.e.
+ * "hierarchy" or "local").
+ *
+ * To be used as memcg event method.
+ *
+ * Return: 0 on success, -ENOMEM on memory failure or -EINVAL if @args could
+ * not be parsed.
+ */
+int vmpressure_register_event(struct mem_cgroup *memcg,
+ struct eventfd_ctx *eventfd, const char *args)
+{
+ struct vmpressure *vmpr = memcg_to_vmpressure(memcg);
+ struct vmpressure_event *ev;
+ enum vmpressure_modes mode = VMPRESSURE_NO_PASSTHROUGH;
+ enum vmpressure_levels level;
+ char *spec, *spec_orig;
+ char *token;
+ int ret = 0;
+
+ spec_orig = spec = kstrndup(args, MAX_VMPRESSURE_ARGS_LEN, GFP_KERNEL);
+ if (!spec)
+ return -ENOMEM;
+
+ /* Find required level */
+ token = strsep(&spec, ",");
+ ret = match_string(vmpressure_str_levels, VMPRESSURE_NUM_LEVELS, token);
+ if (ret < 0)
+ goto out;
+ level = ret;
+
+ /* Find optional mode */
+ token = strsep(&spec, ",");
+ if (token) {
+ ret = match_string(vmpressure_str_modes, VMPRESSURE_NUM_MODES, token);
+ if (ret < 0)
+ goto out;
+ mode = ret;
+ }
+
+ ev = kzalloc_obj(*ev);
+ if (!ev) {
+ ret = -ENOMEM;
+ goto out;
+ }
+
+ ev->efd = eventfd;
+ ev->level = level;
+ ev->mode = mode;
+
+ mutex_lock(&vmpr->events_lock);
+ list_add(&ev->node, &vmpr->events);
+ mutex_unlock(&vmpr->events_lock);
+ ret = 0;
+out:
+ kfree(spec_orig);
+ return ret;
+}
+
+/**
+ * vmpressure_unregister_event() - Unbind eventfd from vmpressure
+ * @memcg: memcg handle
+ * @eventfd: eventfd context that was used to link vmpressure with the @cg
+ *
+ * This function does internal manipulations to detach the @eventfd from
+ * the vmpressure notifications, and then frees internal resources
+ * associated with the @eventfd (but the @eventfd itself is not freed).
+ *
+ * To be used as memcg event method.
+ */
+void vmpressure_unregister_event(struct mem_cgroup *memcg,
+ struct eventfd_ctx *eventfd)
+{
+ struct vmpressure *vmpr = memcg_to_vmpressure(memcg);
+ struct vmpressure_event *ev;
+
+ mutex_lock(&vmpr->events_lock);
+ list_for_each_entry(ev, &vmpr->events, node) {
+ if (ev->efd != eventfd)
+ continue;
+ list_del(&ev->node);
+ kfree(ev);
+ break;
+ }
+ mutex_unlock(&vmpr->events_lock);
+}
+
static DEFINE_MUTEX(memcg_max_mutex);
static int mem_cgroup_resize_max(struct mem_cgroup *memcg,
diff --git a/mm/vmpressure.c b/mm/vmpressure.c
index c82cee1ab43b..14470141bbe6 100644
--- a/mm/vmpressure.c
+++ b/mm/vmpressure.c
@@ -7,16 +7,15 @@
*
* Based on ideas from Andrew Morton, David Rientjes, KOSAKI Motohiro,
* Leonid Moiseichuk, Mel Gorman, Minchan Kim and Pekka Enberg.
+ *
+ * Tree-mode (cgroup v1 userspace eventfd) bookkeeping lives in
+ * mm/memcontrol-v1.c; this file holds the shared code and the in-kernel
+ * (tree=false) socket-pressure path that runs on cgroup v2.
*/
#include <linux/cgroup.h>
-#include <linux/fs.h>
#include <linux/log2.h>
-#include <linux/sched.h>
#include <linux/mm.h>
-#include <linux/vmstat.h>
-#include <linux/eventfd.h>
-#include <linux/slab.h>
#include <linux/swap.h>
#include <linux/printk.h>
#include <linux/vmpressure.h>
@@ -35,7 +34,7 @@
* TODO: Make the window size depend on machine size, as we do for vmstat
* thresholds. Currently we set it to 512 pages (2MB for 4KB pages).
*/
-static const unsigned long vmpressure_win = SWAP_CLUSTER_MAX * 16;
+const unsigned long vmpressure_win = SWAP_CLUSTER_MAX * 16;
/*
* These thresholds are used when we account memory pressure through
@@ -46,68 +45,6 @@ static const unsigned long vmpressure_win = SWAP_CLUSTER_MAX * 16;
static const unsigned int vmpressure_level_med = 60;
static const unsigned int vmpressure_level_critical = 95;
-/*
- * When there are too little pages left to scan, vmpressure() may miss the
- * critical pressure as number of pages will be less than "window size".
- * However, in that case the vmscan priority will raise fast as the
- * reclaimer will try to scan LRUs more deeply.
- *
- * The vmscan logic considers these special priorities:
- *
- * prio == DEF_PRIORITY (12): reclaimer starts with that value
- * prio <= DEF_PRIORITY - 2 : kswapd becomes somewhat overwhelmed
- * prio == 0 : close to OOM, kernel scans every page in an lru
- *
- * Any value in this range is acceptable for this tunable (i.e. from 12 to
- * 0). Current value for the vmpressure_level_critical_prio is chosen
- * empirically, but the number, in essence, means that we consider
- * critical level when scanning depth is ~10% of the lru size (vmscan
- * scans 'lru_size >> prio' pages, so it is actually 12.5%, or one
- * eights).
- */
-static const unsigned int vmpressure_level_critical_prio = ilog2(100 / 10);
-
-static struct vmpressure *work_to_vmpressure(struct work_struct *work)
-{
- return container_of(work, struct vmpressure, work);
-}
-
-static struct vmpressure *vmpressure_parent(struct vmpressure *vmpr)
-{
- struct mem_cgroup *memcg = vmpressure_to_memcg(vmpr);
-
- memcg = parent_mem_cgroup(memcg);
- if (!memcg)
- return NULL;
- return memcg_to_vmpressure(memcg);
-}
-
-enum vmpressure_levels {
- VMPRESSURE_LOW = 0,
- VMPRESSURE_MEDIUM,
- VMPRESSURE_CRITICAL,
- VMPRESSURE_NUM_LEVELS,
-};
-
-enum vmpressure_modes {
- VMPRESSURE_NO_PASSTHROUGH = 0,
- VMPRESSURE_HIERARCHY,
- VMPRESSURE_LOCAL,
- VMPRESSURE_NUM_MODES,
-};
-
-static const char * const vmpressure_str_levels[] = {
- [VMPRESSURE_LOW] = "low",
- [VMPRESSURE_MEDIUM] = "medium",
- [VMPRESSURE_CRITICAL] = "critical",
-};
-
-static const char * const vmpressure_str_modes[] = {
- [VMPRESSURE_NO_PASSTHROUGH] = "default",
- [VMPRESSURE_HIERARCHY] = "hierarchy",
- [VMPRESSURE_LOCAL] = "local",
-};
-
static enum vmpressure_levels vmpressure_level(unsigned long pressure)
{
if (pressure >= vmpressure_level_critical)
@@ -117,8 +54,8 @@ static enum vmpressure_levels vmpressure_level(unsigned long pressure)
return VMPRESSURE_LOW;
}
-static enum vmpressure_levels vmpressure_calc_level(unsigned long scanned,
- unsigned long reclaimed)
+enum vmpressure_levels vmpressure_calc_level(unsigned long scanned,
+ unsigned long reclaimed)
{
unsigned long scale = scanned + reclaimed;
unsigned long pressure = 0;
@@ -147,74 +84,6 @@ static enum vmpressure_levels vmpressure_calc_level(unsigned long scanned,
return vmpressure_level(pressure);
}
-struct vmpressure_event {
- struct eventfd_ctx *efd;
- enum vmpressure_levels level;
- enum vmpressure_modes mode;
- struct list_head node;
-};
-
-static bool vmpressure_event(struct vmpressure *vmpr,
- const enum vmpressure_levels level,
- bool ancestor, bool signalled)
-{
- struct vmpressure_event *ev;
- bool ret = false;
-
- mutex_lock(&vmpr->events_lock);
- list_for_each_entry(ev, &vmpr->events, node) {
- if (ancestor && ev->mode == VMPRESSURE_LOCAL)
- continue;
- if (signalled && ev->mode == VMPRESSURE_NO_PASSTHROUGH)
- continue;
- if (level < ev->level)
- continue;
- eventfd_signal(ev->efd);
- ret = true;
- }
- mutex_unlock(&vmpr->events_lock);
-
- return ret;
-}
-
-static void vmpressure_work_fn(struct work_struct *work)
-{
- struct vmpressure *vmpr = work_to_vmpressure(work);
- unsigned long scanned;
- unsigned long reclaimed;
- enum vmpressure_levels level;
- bool ancestor = false;
- bool signalled = false;
-
- spin_lock(&vmpr->sr_lock);
- /*
- * Several contexts might be calling vmpressure(), so it is
- * possible that the work was rescheduled again before the old
- * work context cleared the counters. In that case we will run
- * just after the old work returns, but then scanned might be zero
- * here. No need for any locks here since we don't care if
- * vmpr->reclaimed is in sync.
- */
- scanned = vmpr->tree_scanned;
- if (!scanned) {
- spin_unlock(&vmpr->sr_lock);
- return;
- }
-
- reclaimed = vmpr->tree_reclaimed;
- vmpr->tree_scanned = 0;
- vmpr->tree_reclaimed = 0;
- spin_unlock(&vmpr->sr_lock);
-
- level = vmpressure_calc_level(scanned, reclaimed);
-
- do {
- if (vmpressure_event(vmpr, level, ancestor, signalled))
- signalled = true;
- ancestor = true;
- } while ((vmpr = vmpressure_parent(vmpr)));
-}
-
/**
* vmpressure() - Account memory pressure through scanned/reclaimed ratio
* @gfp: reclaimer's gfp mask
@@ -283,14 +152,7 @@ void vmpressure(gfp_t gfp, int order, struct mem_cgroup *memcg, bool tree,
return;
if (tree) {
- spin_lock(&vmpr->sr_lock);
- scanned = vmpr->tree_scanned += scanned;
- vmpr->tree_reclaimed += reclaimed;
- spin_unlock(&vmpr->sr_lock);
-
- if (scanned < vmpressure_win)
- return;
- schedule_work(&vmpr->work);
+ vmpressure_v1_account_tree(vmpr, scanned, reclaimed);
} else {
enum vmpressure_levels level;
@@ -332,134 +194,6 @@ void vmpressure(gfp_t gfp, int order, struct mem_cgroup *memcg, bool tree,
}
}
-/**
- * vmpressure_prio() - Account memory pressure through reclaimer priority level
- * @gfp: reclaimer's gfp mask
- * @memcg: cgroup memory controller handle
- * @prio: reclaimer's priority
- *
- * This function should be called from the reclaim path every time when
- * the vmscan's reclaiming priority (scanning depth) changes.
- *
- * This function does not return any value.
- */
-void vmpressure_prio(gfp_t gfp, struct mem_cgroup *memcg, int prio)
-{
- /*
- * We only use prio for accounting critical level. For more info
- * see comment for vmpressure_level_critical_prio variable above.
- */
- if (prio > vmpressure_level_critical_prio)
- return;
-
- /*
- * OK, the prio is below the threshold, updating vmpressure
- * information before shrinker dives into long shrinking of long
- * range vmscan. Passing scanned = vmpressure_win, reclaimed = 0
- * to the vmpressure() basically means that we signal 'critical'
- * level.
- */
- vmpressure(gfp, 0, memcg, true, vmpressure_win, 0);
-}
-
-#define MAX_VMPRESSURE_ARGS_LEN (strlen("critical") + strlen("hierarchy") + 2)
-
-/**
- * vmpressure_register_event() - Bind vmpressure notifications to an eventfd
- * @memcg: memcg that is interested in vmpressure notifications
- * @eventfd: eventfd context to link notifications with
- * @args: event arguments (pressure level threshold, optional mode)
- *
- * This function associates eventfd context with the vmpressure
- * infrastructure, so that the notifications will be delivered to the
- * @eventfd. The @args parameter is a comma-delimited string that denotes a
- * pressure level threshold (one of vmpressure_str_levels, i.e. "low", "medium",
- * or "critical") and an optional mode (one of vmpressure_str_modes, i.e.
- * "hierarchy" or "local").
- *
- * To be used as memcg event method.
- *
- * Return: 0 on success, -ENOMEM on memory failure or -EINVAL if @args could
- * not be parsed.
- */
-int vmpressure_register_event(struct mem_cgroup *memcg,
- struct eventfd_ctx *eventfd, const char *args)
-{
- struct vmpressure *vmpr = memcg_to_vmpressure(memcg);
- struct vmpressure_event *ev;
- enum vmpressure_modes mode = VMPRESSURE_NO_PASSTHROUGH;
- enum vmpressure_levels level;
- char *spec, *spec_orig;
- char *token;
- int ret = 0;
-
- spec_orig = spec = kstrndup(args, MAX_VMPRESSURE_ARGS_LEN, GFP_KERNEL);
- if (!spec)
- return -ENOMEM;
-
- /* Find required level */
- token = strsep(&spec, ",");
- ret = match_string(vmpressure_str_levels, VMPRESSURE_NUM_LEVELS, token);
- if (ret < 0)
- goto out;
- level = ret;
-
- /* Find optional mode */
- token = strsep(&spec, ",");
- if (token) {
- ret = match_string(vmpressure_str_modes, VMPRESSURE_NUM_MODES, token);
- if (ret < 0)
- goto out;
- mode = ret;
- }
-
- ev = kzalloc_obj(*ev);
- if (!ev) {
- ret = -ENOMEM;
- goto out;
- }
-
- ev->efd = eventfd;
- ev->level = level;
- ev->mode = mode;
-
- mutex_lock(&vmpr->events_lock);
- list_add(&ev->node, &vmpr->events);
- mutex_unlock(&vmpr->events_lock);
- ret = 0;
-out:
- kfree(spec_orig);
- return ret;
-}
-
-/**
- * vmpressure_unregister_event() - Unbind eventfd from vmpressure
- * @memcg: memcg handle
- * @eventfd: eventfd context that was used to link vmpressure with the @cg
- *
- * This function does internal manipulations to detach the @eventfd from
- * the vmpressure notifications, and then frees internal resources
- * associated with the @eventfd (but the @eventfd itself is not freed).
- *
- * To be used as memcg event method.
- */
-void vmpressure_unregister_event(struct mem_cgroup *memcg,
- struct eventfd_ctx *eventfd)
-{
- struct vmpressure *vmpr = memcg_to_vmpressure(memcg);
- struct vmpressure_event *ev;
-
- mutex_lock(&vmpr->events_lock);
- list_for_each_entry(ev, &vmpr->events, node) {
- if (ev->efd != eventfd)
- continue;
- list_del(&ev->node);
- kfree(ev);
- break;
- }
- mutex_unlock(&vmpr->events_lock);
-}
-
/**
* vmpressure_init() - Initialize vmpressure control structure
* @vmpr: Structure to be initialized
@@ -470,9 +204,7 @@ void vmpressure_unregister_event(struct mem_cgroup *memcg,
void vmpressure_init(struct vmpressure *vmpr)
{
spin_lock_init(&vmpr->sr_lock);
- mutex_init(&vmpr->events_lock);
- INIT_LIST_HEAD(&vmpr->events);
- INIT_WORK(&vmpr->work, vmpressure_work_fn);
+ vmpressure_v1_init(vmpr);
}
/**
@@ -484,9 +216,5 @@ void vmpressure_init(struct vmpressure *vmpr)
*/
void vmpressure_cleanup(struct vmpressure *vmpr)
{
- /*
- * Make sure there is no pending work before eventfd infrastructure
- * goes away.
- */
- flush_work(&vmpr->work);
+ vmpressure_v1_cleanup(vmpr);
}
--
2.53.0-Meta
^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: [PATCH v3 2/2] mm/vmpressure: move v1 userspace eventfd code into memcontrol-v1.c
2026-06-30 11:23 ` [PATCH v3 2/2] mm/vmpressure: move v1 userspace eventfd code into memcontrol-v1.c Usama Arif
@ 2026-06-30 12:32 ` Usama Arif
2026-06-30 14:21 ` Shakeel Butt
1 sibling, 0 replies; 7+ messages in thread
From: Usama Arif @ 2026-06-30 12:32 UTC (permalink / raw)
To: Usama Arif
Cc: Andrew Morton, david, linux-mm, hannes, tj, mkoutny, shakeel.butt,
roman.gushchin, liam, linux-kernel, ljs, mhocko, rppt, surenb,
vbabka, kernel-team
On Tue, 30 Jun 2026 04:23:33 -0700 Usama Arif <usama.arif@linux.dev> wrote:
> Clean up mm/vmpressure.c by separating the cgroup v1 userspace eventfd
> interface from the shared and v2 in-kernel code.
>
> Currently, almost half of mm/vmpressure.c exists to serve tree=true:
> struct vmpressure_event, the events list and its mutex, the work_struct
> and vmpressure_work_fn that drains tree_scanned/tree_reclaimed, the
> parent walk, vmpressure_event(), vmpressure_register_event(),
> vmpressure_unregister_event(), and vmpressure_prio() (which always
> calls vmpressure() with tree=true).
>
> Move it all into mm/memcontrol-v1.c (built only when CONFIG_MEMCG_V1=y)
> as a single contiguous block, following the per-component layout already
> used by that file. Keeping the v1 vmpressure code with the rest of the
> deprecated cgroup v1 memory controller makes the full footprint of the
> CONFIG_MEMCG_V1 option easy to see in one place, which matters more
> than component-level file separation for code that has no active
> development.
>
> vmpressure.c keeps the shared bits (constants, vmpressure_calc_level,
> the runtime hierarchy check, the tree=false body, init/cleanup
> plumbing) and calls into three small v1 hooks for the tree=true
> accumulator and the v1 portions of init/cleanup. The hooks have
> static-inline no-op stubs in include/linux/vmpressure.h for the
> !MEMCG_V1 case, so callers don't need ifdefs. vmpressure_prio() gets
> the same treatment, which means vmscan.c's call site disappears at
> compile time on v2-only kernels.
>
> The only #ifdef CONFIG_MEMCG_V1 in source remains around the v1-only
> fields inside struct vmpressure itself.
>
> Memory savings on CONFIG_MEMCG_V1=n (measured with pahole):
>
> struct vmpressure : 112B -> 24B
> struct mem_cgroup : 1664B -> 1536B
>
> This split is the first step toward eventually making vmpressure
> CONFIG_MEMCG_V1 only. The v2 in-kernel socket pressure path
> (tree=false) cannot be removed today immediately: PSI is not an
> exact replacement for vmpressure, and switching networking socket-buffer
> back-off to PSI may regress networking performance or increase memory
> pressure in workloads that today rely on vmpressure's hysteresis. The
> medium-term plan is to introduce a PSI-based socket-pressure path, keep
> vmpressure available for v2 behind a defconfig as an opt-out for several
> releases, and only then drop the tree=false path entirely, at which point
> everything that remains of the vmpressure block in mm/memcontrol-v1.c is
> the whole subsystem.
>
> Signed-off-by: Usama Arif <usama.arif@linux.dev>
Shakeel had acked the previous version, but I forgot to carry it over,
sorry about that!
> ---
> include/linux/vmpressure.h | 46 +++++-
> mm/memcontrol-v1.c | 292 +++++++++++++++++++++++++++++++++++++
> mm/vmpressure.c | 292 ++-----------------------------------
> 3 files changed, 343 insertions(+), 287 deletions(-)
>
> diff --git a/include/linux/vmpressure.h b/include/linux/vmpressure.h
> index faecd5522401..b4d13457bc2a 100644
> --- a/include/linux/vmpressure.h
> +++ b/include/linux/vmpressure.h
> @@ -13,18 +13,31 @@
> struct vmpressure {
> unsigned long scanned;
> unsigned long reclaimed;
> + /* The lock is used to keep the scanned/reclaimed in sync. */
> + spinlock_t sr_lock;
>
> +#ifdef CONFIG_MEMCG_V1
> + /*
> + * tree=true accumulators feed the v1 userspace eventfd interface
> + * (memory.pressure_level). Drained by @work. v2 has no equivalent
> + * interface, so this state is omitted on CONFIG_MEMCG_V1=n builds.
> + */
> unsigned long tree_scanned;
> unsigned long tree_reclaimed;
> - /* The lock is used to keep the scanned/reclaimed above in sync. */
> - spinlock_t sr_lock;
> -
> /* The list of vmpressure_event structs. */
> struct list_head events;
> /* Have to grab the lock on events traversal or modifications. */
> struct mutex events_lock;
>
> struct work_struct work;
> +#endif
> +};
> +
> +enum vmpressure_levels {
> + VMPRESSURE_LOW = 0,
> + VMPRESSURE_MEDIUM,
> + VMPRESSURE_CRITICAL,
> + VMPRESSURE_NUM_LEVELS,
> };
>
> struct mem_cgroup;
> @@ -32,18 +45,41 @@ struct mem_cgroup;
> #ifdef CONFIG_MEMCG
> void vmpressure(gfp_t gfp, int order, struct mem_cgroup *memcg, bool tree,
> unsigned long scanned, unsigned long reclaimed);
> -extern void vmpressure_prio(gfp_t gfp, struct mem_cgroup *memcg, int prio);
> -
> extern void vmpressure_init(struct vmpressure *vmpr);
> extern void vmpressure_cleanup(struct vmpressure *vmpr);
> extern struct vmpressure *memcg_to_vmpressure(struct mem_cgroup *memcg);
> extern struct mem_cgroup *vmpressure_to_memcg(struct vmpressure *vmpr);
> +
> +/* Shared with the v1 vmpressure block in mm/memcontrol-v1.c. */
> +extern const unsigned long vmpressure_win;
> +extern enum vmpressure_levels vmpressure_calc_level(unsigned long scanned,
> + unsigned long reclaimed);
> +
> +#ifdef CONFIG_MEMCG_V1
> +extern void vmpressure_prio(gfp_t gfp, struct mem_cgroup *memcg, int prio);
> extern int vmpressure_register_event(struct mem_cgroup *memcg,
> struct eventfd_ctx *eventfd,
> const char *args);
> extern void vmpressure_unregister_event(struct mem_cgroup *memcg,
> struct eventfd_ctx *eventfd);
> +
> +/* v1 hooks called from mm/vmpressure.c; no-ops below when !MEMCG_V1. */
> +extern void vmpressure_v1_init(struct vmpressure *vmpr);
> +extern void vmpressure_v1_cleanup(struct vmpressure *vmpr);
> +extern void vmpressure_v1_account_tree(struct vmpressure *vmpr,
> + unsigned long scanned,
> + unsigned long reclaimed);
> #else
> +static inline void vmpressure_prio(gfp_t gfp, struct mem_cgroup *memcg,
> + int prio) {}
> +static inline void vmpressure_v1_init(struct vmpressure *vmpr) {}
> +static inline void vmpressure_v1_cleanup(struct vmpressure *vmpr) {}
> +static inline void vmpressure_v1_account_tree(struct vmpressure *vmpr,
> + unsigned long scanned,
> + unsigned long reclaimed) {}
> +#endif /* CONFIG_MEMCG_V1 */
> +
> +#else /* !CONFIG_MEMCG */
> static inline void vmpressure(gfp_t gfp, int order, struct mem_cgroup *memcg,
> bool tree, unsigned long scanned,
> unsigned long reclaimed) {}
> diff --git a/mm/memcontrol-v1.c b/mm/memcontrol-v1.c
> index 765069211567..135622b6172b 100644
> --- a/mm/memcontrol-v1.c
> +++ b/mm/memcontrol-v1.c
> @@ -6,6 +6,7 @@
> #include <linux/pagewalk.h>
> #include <linux/backing-dev.h>
> #include <linux/eventfd.h>
> +#include <linux/log2.h>
> #include <linux/poll.h>
> #include <linux/sort.h>
> #include <linux/file.h>
> @@ -1476,6 +1477,297 @@ void memcg1_oom_finish(struct mem_cgroup *memcg, bool locked)
> mem_cgroup_oom_unlock(memcg);
> }
>
> +/*
> + * cgroup v1 userspace vmpressure interface (memory.pressure_level /
> + * cgroup.event_control). Kept here so v2-only kernels (CONFIG_MEMCG_V1=n)
> + * drop the whole eventfd accumulator, its work item, and the per-memcg
> + * state it requires.
> + *
> + * When there are too little pages left to scan, vmpressure() may miss the
> + * critical pressure as number of pages will be less than "window size".
> + * However, in that case the vmscan priority will raise fast as the
> + * reclaimer will try to scan LRUs more deeply.
> + *
> + * The vmscan logic considers these special priorities:
> + *
> + * prio == DEF_PRIORITY (12): reclaimer starts with that value
> + * prio <= DEF_PRIORITY - 2 : kswapd becomes somewhat overwhelmed
> + * prio == 0 : close to OOM, kernel scans every page in an lru
> + *
> + * Any value in this range is acceptable for this tunable (i.e. from 12 to
> + * 0). Current value for the vmpressure_level_critical_prio is chosen
> + * empirically, but the number, in essence, means that we consider
> + * critical level when scanning depth is ~10% of the lru size (vmscan
> + * scans 'lru_size >> prio' pages, so it is actually 12.5%, or one
> + * eights).
> + */
> +static const unsigned int vmpressure_level_critical_prio = ilog2(100 / 10);
> +
> +enum vmpressure_modes {
> + VMPRESSURE_NO_PASSTHROUGH = 0,
> + VMPRESSURE_HIERARCHY,
> + VMPRESSURE_LOCAL,
> + VMPRESSURE_NUM_MODES,
> +};
> +
> +static const char * const vmpressure_str_levels[] = {
> + [VMPRESSURE_LOW] = "low",
> + [VMPRESSURE_MEDIUM] = "medium",
> + [VMPRESSURE_CRITICAL] = "critical",
> +};
> +
> +static const char * const vmpressure_str_modes[] = {
> + [VMPRESSURE_NO_PASSTHROUGH] = "default",
> + [VMPRESSURE_HIERARCHY] = "hierarchy",
> + [VMPRESSURE_LOCAL] = "local",
> +};
> +
> +struct vmpressure_event {
> + struct eventfd_ctx *efd;
> + enum vmpressure_levels level;
> + enum vmpressure_modes mode;
> + struct list_head node;
> +};
> +
> +static struct vmpressure *work_to_vmpressure(struct work_struct *work)
> +{
> + return container_of(work, struct vmpressure, work);
> +}
> +
> +static struct vmpressure *vmpressure_parent(struct vmpressure *vmpr)
> +{
> + struct mem_cgroup *memcg = vmpressure_to_memcg(vmpr);
> +
> + memcg = parent_mem_cgroup(memcg);
> + if (!memcg)
> + return NULL;
> + return memcg_to_vmpressure(memcg);
> +}
> +
> +static bool vmpressure_event(struct vmpressure *vmpr,
> + const enum vmpressure_levels level,
> + bool ancestor, bool signalled)
> +{
> + struct vmpressure_event *ev;
> + bool ret = false;
> +
> + mutex_lock(&vmpr->events_lock);
> + list_for_each_entry(ev, &vmpr->events, node) {
> + if (ancestor && ev->mode == VMPRESSURE_LOCAL)
> + continue;
> + if (signalled && ev->mode == VMPRESSURE_NO_PASSTHROUGH)
> + continue;
> + if (level < ev->level)
> + continue;
> + eventfd_signal(ev->efd);
> + ret = true;
> + }
> + mutex_unlock(&vmpr->events_lock);
> +
> + return ret;
> +}
> +
> +static void vmpressure_work_fn(struct work_struct *work)
> +{
> + struct vmpressure *vmpr = work_to_vmpressure(work);
> + unsigned long scanned;
> + unsigned long reclaimed;
> + enum vmpressure_levels level;
> + bool ancestor = false;
> + bool signalled = false;
> +
> + spin_lock(&vmpr->sr_lock);
> + /*
> + * Several contexts might be calling vmpressure(), so it is
> + * possible that the work was rescheduled again before the old
> + * work context cleared the counters. In that case we will run
> + * just after the old work returns, but then scanned might be zero
> + * here. No need for any locks here since we don't care if
> + * vmpr->reclaimed is in sync.
> + */
> + scanned = vmpr->tree_scanned;
> + if (!scanned) {
> + spin_unlock(&vmpr->sr_lock);
> + return;
> + }
> +
> + reclaimed = vmpr->tree_reclaimed;
> + vmpr->tree_scanned = 0;
> + vmpr->tree_reclaimed = 0;
> + spin_unlock(&vmpr->sr_lock);
> +
> + level = vmpressure_calc_level(scanned, reclaimed);
> +
> + do {
> + if (vmpressure_event(vmpr, level, ancestor, signalled))
> + signalled = true;
> + ancestor = true;
> + } while ((vmpr = vmpressure_parent(vmpr)));
> +}
> +
> +/*
> + * Tree-mode accumulator: accumulate per-memcg scanned/reclaimed and
> + * schedule the work that walks the parent chain and signals registered
> + * eventfd listeners once we cross the window threshold.
> + */
> +void vmpressure_v1_account_tree(struct vmpressure *vmpr,
> + unsigned long scanned,
> + unsigned long reclaimed)
> +{
> + spin_lock(&vmpr->sr_lock);
> + scanned = vmpr->tree_scanned += scanned;
> + vmpr->tree_reclaimed += reclaimed;
> + spin_unlock(&vmpr->sr_lock);
> +
> + if (scanned < vmpressure_win)
> + return;
> + schedule_work(&vmpr->work);
> +}
> +
> +void vmpressure_v1_init(struct vmpressure *vmpr)
> +{
> + mutex_init(&vmpr->events_lock);
> + INIT_LIST_HEAD(&vmpr->events);
> + INIT_WORK(&vmpr->work, vmpressure_work_fn);
> +}
> +
> +void vmpressure_v1_cleanup(struct vmpressure *vmpr)
> +{
> + /*
> + * Make sure there is no pending work before eventfd infrastructure
> + * goes away.
> + */
> + flush_work(&vmpr->work);
> +}
> +
> +/**
> + * vmpressure_prio() - Account memory pressure through reclaimer priority level
> + * @gfp: reclaimer's gfp mask
> + * @memcg: cgroup memory controller handle
> + * @prio: reclaimer's priority
> + *
> + * This function should be called from the reclaim path every time when
> + * the vmscan's reclaiming priority (scanning depth) changes.
> + *
> + * This function does not return any value.
> + */
> +void vmpressure_prio(gfp_t gfp, struct mem_cgroup *memcg, int prio)
> +{
> + /*
> + * We only use prio for accounting critical level. For more info
> + * see comment for vmpressure_level_critical_prio variable above.
> + */
> + if (prio > vmpressure_level_critical_prio)
> + return;
> +
> + /*
> + * OK, the prio is below the threshold, updating vmpressure
> + * information before shrinker dives into long shrinking of long
> + * range vmscan. Passing scanned = vmpressure_win, reclaimed = 0
> + * to the vmpressure() basically means that we signal 'critical'
> + * level.
> + */
> + vmpressure(gfp, 0, memcg, true, vmpressure_win, 0);
> +}
> +
> +#define MAX_VMPRESSURE_ARGS_LEN (strlen("critical") + strlen("hierarchy") + 2)
> +
> +/**
> + * vmpressure_register_event() - Bind vmpressure notifications to an eventfd
> + * @memcg: memcg that is interested in vmpressure notifications
> + * @eventfd: eventfd context to link notifications with
> + * @args: event arguments (pressure level threshold, optional mode)
> + *
> + * This function associates eventfd context with the vmpressure
> + * infrastructure, so that the notifications will be delivered to the
> + * @eventfd. The @args parameter is a comma-delimited string that denotes a
> + * pressure level threshold (one of vmpressure_str_levels, i.e. "low", "medium",
> + * or "critical") and an optional mode (one of vmpressure_str_modes, i.e.
> + * "hierarchy" or "local").
> + *
> + * To be used as memcg event method.
> + *
> + * Return: 0 on success, -ENOMEM on memory failure or -EINVAL if @args could
> + * not be parsed.
> + */
> +int vmpressure_register_event(struct mem_cgroup *memcg,
> + struct eventfd_ctx *eventfd, const char *args)
> +{
> + struct vmpressure *vmpr = memcg_to_vmpressure(memcg);
> + struct vmpressure_event *ev;
> + enum vmpressure_modes mode = VMPRESSURE_NO_PASSTHROUGH;
> + enum vmpressure_levels level;
> + char *spec, *spec_orig;
> + char *token;
> + int ret = 0;
> +
> + spec_orig = spec = kstrndup(args, MAX_VMPRESSURE_ARGS_LEN, GFP_KERNEL);
> + if (!spec)
> + return -ENOMEM;
> +
> + /* Find required level */
> + token = strsep(&spec, ",");
> + ret = match_string(vmpressure_str_levels, VMPRESSURE_NUM_LEVELS, token);
> + if (ret < 0)
> + goto out;
> + level = ret;
> +
> + /* Find optional mode */
> + token = strsep(&spec, ",");
> + if (token) {
> + ret = match_string(vmpressure_str_modes, VMPRESSURE_NUM_MODES, token);
> + if (ret < 0)
> + goto out;
> + mode = ret;
> + }
> +
> + ev = kzalloc_obj(*ev);
> + if (!ev) {
> + ret = -ENOMEM;
> + goto out;
> + }
> +
> + ev->efd = eventfd;
> + ev->level = level;
> + ev->mode = mode;
> +
> + mutex_lock(&vmpr->events_lock);
> + list_add(&ev->node, &vmpr->events);
> + mutex_unlock(&vmpr->events_lock);
> + ret = 0;
> +out:
> + kfree(spec_orig);
> + return ret;
> +}
> +
> +/**
> + * vmpressure_unregister_event() - Unbind eventfd from vmpressure
> + * @memcg: memcg handle
> + * @eventfd: eventfd context that was used to link vmpressure with the @cg
> + *
> + * This function does internal manipulations to detach the @eventfd from
> + * the vmpressure notifications, and then frees internal resources
> + * associated with the @eventfd (but the @eventfd itself is not freed).
> + *
> + * To be used as memcg event method.
> + */
> +void vmpressure_unregister_event(struct mem_cgroup *memcg,
> + struct eventfd_ctx *eventfd)
> +{
> + struct vmpressure *vmpr = memcg_to_vmpressure(memcg);
> + struct vmpressure_event *ev;
> +
> + mutex_lock(&vmpr->events_lock);
> + list_for_each_entry(ev, &vmpr->events, node) {
> + if (ev->efd != eventfd)
> + continue;
> + list_del(&ev->node);
> + kfree(ev);
> + break;
> + }
> + mutex_unlock(&vmpr->events_lock);
> +}
> +
> static DEFINE_MUTEX(memcg_max_mutex);
>
> static int mem_cgroup_resize_max(struct mem_cgroup *memcg,
> diff --git a/mm/vmpressure.c b/mm/vmpressure.c
> index c82cee1ab43b..14470141bbe6 100644
> --- a/mm/vmpressure.c
> +++ b/mm/vmpressure.c
> @@ -7,16 +7,15 @@
> *
> * Based on ideas from Andrew Morton, David Rientjes, KOSAKI Motohiro,
> * Leonid Moiseichuk, Mel Gorman, Minchan Kim and Pekka Enberg.
> + *
> + * Tree-mode (cgroup v1 userspace eventfd) bookkeeping lives in
> + * mm/memcontrol-v1.c; this file holds the shared code and the in-kernel
> + * (tree=false) socket-pressure path that runs on cgroup v2.
> */
>
> #include <linux/cgroup.h>
> -#include <linux/fs.h>
> #include <linux/log2.h>
> -#include <linux/sched.h>
> #include <linux/mm.h>
> -#include <linux/vmstat.h>
> -#include <linux/eventfd.h>
> -#include <linux/slab.h>
> #include <linux/swap.h>
> #include <linux/printk.h>
> #include <linux/vmpressure.h>
> @@ -35,7 +34,7 @@
> * TODO: Make the window size depend on machine size, as we do for vmstat
> * thresholds. Currently we set it to 512 pages (2MB for 4KB pages).
> */
> -static const unsigned long vmpressure_win = SWAP_CLUSTER_MAX * 16;
> +const unsigned long vmpressure_win = SWAP_CLUSTER_MAX * 16;
>
> /*
> * These thresholds are used when we account memory pressure through
> @@ -46,68 +45,6 @@ static const unsigned long vmpressure_win = SWAP_CLUSTER_MAX * 16;
> static const unsigned int vmpressure_level_med = 60;
> static const unsigned int vmpressure_level_critical = 95;
>
> -/*
> - * When there are too little pages left to scan, vmpressure() may miss the
> - * critical pressure as number of pages will be less than "window size".
> - * However, in that case the vmscan priority will raise fast as the
> - * reclaimer will try to scan LRUs more deeply.
> - *
> - * The vmscan logic considers these special priorities:
> - *
> - * prio == DEF_PRIORITY (12): reclaimer starts with that value
> - * prio <= DEF_PRIORITY - 2 : kswapd becomes somewhat overwhelmed
> - * prio == 0 : close to OOM, kernel scans every page in an lru
> - *
> - * Any value in this range is acceptable for this tunable (i.e. from 12 to
> - * 0). Current value for the vmpressure_level_critical_prio is chosen
> - * empirically, but the number, in essence, means that we consider
> - * critical level when scanning depth is ~10% of the lru size (vmscan
> - * scans 'lru_size >> prio' pages, so it is actually 12.5%, or one
> - * eights).
> - */
> -static const unsigned int vmpressure_level_critical_prio = ilog2(100 / 10);
> -
> -static struct vmpressure *work_to_vmpressure(struct work_struct *work)
> -{
> - return container_of(work, struct vmpressure, work);
> -}
> -
> -static struct vmpressure *vmpressure_parent(struct vmpressure *vmpr)
> -{
> - struct mem_cgroup *memcg = vmpressure_to_memcg(vmpr);
> -
> - memcg = parent_mem_cgroup(memcg);
> - if (!memcg)
> - return NULL;
> - return memcg_to_vmpressure(memcg);
> -}
> -
> -enum vmpressure_levels {
> - VMPRESSURE_LOW = 0,
> - VMPRESSURE_MEDIUM,
> - VMPRESSURE_CRITICAL,
> - VMPRESSURE_NUM_LEVELS,
> -};
> -
> -enum vmpressure_modes {
> - VMPRESSURE_NO_PASSTHROUGH = 0,
> - VMPRESSURE_HIERARCHY,
> - VMPRESSURE_LOCAL,
> - VMPRESSURE_NUM_MODES,
> -};
> -
> -static const char * const vmpressure_str_levels[] = {
> - [VMPRESSURE_LOW] = "low",
> - [VMPRESSURE_MEDIUM] = "medium",
> - [VMPRESSURE_CRITICAL] = "critical",
> -};
> -
> -static const char * const vmpressure_str_modes[] = {
> - [VMPRESSURE_NO_PASSTHROUGH] = "default",
> - [VMPRESSURE_HIERARCHY] = "hierarchy",
> - [VMPRESSURE_LOCAL] = "local",
> -};
> -
> static enum vmpressure_levels vmpressure_level(unsigned long pressure)
> {
> if (pressure >= vmpressure_level_critical)
> @@ -117,8 +54,8 @@ static enum vmpressure_levels vmpressure_level(unsigned long pressure)
> return VMPRESSURE_LOW;
> }
>
> -static enum vmpressure_levels vmpressure_calc_level(unsigned long scanned,
> - unsigned long reclaimed)
> +enum vmpressure_levels vmpressure_calc_level(unsigned long scanned,
> + unsigned long reclaimed)
> {
> unsigned long scale = scanned + reclaimed;
> unsigned long pressure = 0;
> @@ -147,74 +84,6 @@ static enum vmpressure_levels vmpressure_calc_level(unsigned long scanned,
> return vmpressure_level(pressure);
> }
>
> -struct vmpressure_event {
> - struct eventfd_ctx *efd;
> - enum vmpressure_levels level;
> - enum vmpressure_modes mode;
> - struct list_head node;
> -};
> -
> -static bool vmpressure_event(struct vmpressure *vmpr,
> - const enum vmpressure_levels level,
> - bool ancestor, bool signalled)
> -{
> - struct vmpressure_event *ev;
> - bool ret = false;
> -
> - mutex_lock(&vmpr->events_lock);
> - list_for_each_entry(ev, &vmpr->events, node) {
> - if (ancestor && ev->mode == VMPRESSURE_LOCAL)
> - continue;
> - if (signalled && ev->mode == VMPRESSURE_NO_PASSTHROUGH)
> - continue;
> - if (level < ev->level)
> - continue;
> - eventfd_signal(ev->efd);
> - ret = true;
> - }
> - mutex_unlock(&vmpr->events_lock);
> -
> - return ret;
> -}
> -
> -static void vmpressure_work_fn(struct work_struct *work)
> -{
> - struct vmpressure *vmpr = work_to_vmpressure(work);
> - unsigned long scanned;
> - unsigned long reclaimed;
> - enum vmpressure_levels level;
> - bool ancestor = false;
> - bool signalled = false;
> -
> - spin_lock(&vmpr->sr_lock);
> - /*
> - * Several contexts might be calling vmpressure(), so it is
> - * possible that the work was rescheduled again before the old
> - * work context cleared the counters. In that case we will run
> - * just after the old work returns, but then scanned might be zero
> - * here. No need for any locks here since we don't care if
> - * vmpr->reclaimed is in sync.
> - */
> - scanned = vmpr->tree_scanned;
> - if (!scanned) {
> - spin_unlock(&vmpr->sr_lock);
> - return;
> - }
> -
> - reclaimed = vmpr->tree_reclaimed;
> - vmpr->tree_scanned = 0;
> - vmpr->tree_reclaimed = 0;
> - spin_unlock(&vmpr->sr_lock);
> -
> - level = vmpressure_calc_level(scanned, reclaimed);
> -
> - do {
> - if (vmpressure_event(vmpr, level, ancestor, signalled))
> - signalled = true;
> - ancestor = true;
> - } while ((vmpr = vmpressure_parent(vmpr)));
> -}
> -
> /**
> * vmpressure() - Account memory pressure through scanned/reclaimed ratio
> * @gfp: reclaimer's gfp mask
> @@ -283,14 +152,7 @@ void vmpressure(gfp_t gfp, int order, struct mem_cgroup *memcg, bool tree,
> return;
>
> if (tree) {
> - spin_lock(&vmpr->sr_lock);
> - scanned = vmpr->tree_scanned += scanned;
> - vmpr->tree_reclaimed += reclaimed;
> - spin_unlock(&vmpr->sr_lock);
> -
> - if (scanned < vmpressure_win)
> - return;
> - schedule_work(&vmpr->work);
> + vmpressure_v1_account_tree(vmpr, scanned, reclaimed);
> } else {
> enum vmpressure_levels level;
>
> @@ -332,134 +194,6 @@ void vmpressure(gfp_t gfp, int order, struct mem_cgroup *memcg, bool tree,
> }
> }
>
> -/**
> - * vmpressure_prio() - Account memory pressure through reclaimer priority level
> - * @gfp: reclaimer's gfp mask
> - * @memcg: cgroup memory controller handle
> - * @prio: reclaimer's priority
> - *
> - * This function should be called from the reclaim path every time when
> - * the vmscan's reclaiming priority (scanning depth) changes.
> - *
> - * This function does not return any value.
> - */
> -void vmpressure_prio(gfp_t gfp, struct mem_cgroup *memcg, int prio)
> -{
> - /*
> - * We only use prio for accounting critical level. For more info
> - * see comment for vmpressure_level_critical_prio variable above.
> - */
> - if (prio > vmpressure_level_critical_prio)
> - return;
> -
> - /*
> - * OK, the prio is below the threshold, updating vmpressure
> - * information before shrinker dives into long shrinking of long
> - * range vmscan. Passing scanned = vmpressure_win, reclaimed = 0
> - * to the vmpressure() basically means that we signal 'critical'
> - * level.
> - */
> - vmpressure(gfp, 0, memcg, true, vmpressure_win, 0);
> -}
> -
> -#define MAX_VMPRESSURE_ARGS_LEN (strlen("critical") + strlen("hierarchy") + 2)
> -
> -/**
> - * vmpressure_register_event() - Bind vmpressure notifications to an eventfd
> - * @memcg: memcg that is interested in vmpressure notifications
> - * @eventfd: eventfd context to link notifications with
> - * @args: event arguments (pressure level threshold, optional mode)
> - *
> - * This function associates eventfd context with the vmpressure
> - * infrastructure, so that the notifications will be delivered to the
> - * @eventfd. The @args parameter is a comma-delimited string that denotes a
> - * pressure level threshold (one of vmpressure_str_levels, i.e. "low", "medium",
> - * or "critical") and an optional mode (one of vmpressure_str_modes, i.e.
> - * "hierarchy" or "local").
> - *
> - * To be used as memcg event method.
> - *
> - * Return: 0 on success, -ENOMEM on memory failure or -EINVAL if @args could
> - * not be parsed.
> - */
> -int vmpressure_register_event(struct mem_cgroup *memcg,
> - struct eventfd_ctx *eventfd, const char *args)
> -{
> - struct vmpressure *vmpr = memcg_to_vmpressure(memcg);
> - struct vmpressure_event *ev;
> - enum vmpressure_modes mode = VMPRESSURE_NO_PASSTHROUGH;
> - enum vmpressure_levels level;
> - char *spec, *spec_orig;
> - char *token;
> - int ret = 0;
> -
> - spec_orig = spec = kstrndup(args, MAX_VMPRESSURE_ARGS_LEN, GFP_KERNEL);
> - if (!spec)
> - return -ENOMEM;
> -
> - /* Find required level */
> - token = strsep(&spec, ",");
> - ret = match_string(vmpressure_str_levels, VMPRESSURE_NUM_LEVELS, token);
> - if (ret < 0)
> - goto out;
> - level = ret;
> -
> - /* Find optional mode */
> - token = strsep(&spec, ",");
> - if (token) {
> - ret = match_string(vmpressure_str_modes, VMPRESSURE_NUM_MODES, token);
> - if (ret < 0)
> - goto out;
> - mode = ret;
> - }
> -
> - ev = kzalloc_obj(*ev);
> - if (!ev) {
> - ret = -ENOMEM;
> - goto out;
> - }
> -
> - ev->efd = eventfd;
> - ev->level = level;
> - ev->mode = mode;
> -
> - mutex_lock(&vmpr->events_lock);
> - list_add(&ev->node, &vmpr->events);
> - mutex_unlock(&vmpr->events_lock);
> - ret = 0;
> -out:
> - kfree(spec_orig);
> - return ret;
> -}
> -
> -/**
> - * vmpressure_unregister_event() - Unbind eventfd from vmpressure
> - * @memcg: memcg handle
> - * @eventfd: eventfd context that was used to link vmpressure with the @cg
> - *
> - * This function does internal manipulations to detach the @eventfd from
> - * the vmpressure notifications, and then frees internal resources
> - * associated with the @eventfd (but the @eventfd itself is not freed).
> - *
> - * To be used as memcg event method.
> - */
> -void vmpressure_unregister_event(struct mem_cgroup *memcg,
> - struct eventfd_ctx *eventfd)
> -{
> - struct vmpressure *vmpr = memcg_to_vmpressure(memcg);
> - struct vmpressure_event *ev;
> -
> - mutex_lock(&vmpr->events_lock);
> - list_for_each_entry(ev, &vmpr->events, node) {
> - if (ev->efd != eventfd)
> - continue;
> - list_del(&ev->node);
> - kfree(ev);
> - break;
> - }
> - mutex_unlock(&vmpr->events_lock);
> -}
> -
> /**
> * vmpressure_init() - Initialize vmpressure control structure
> * @vmpr: Structure to be initialized
> @@ -470,9 +204,7 @@ void vmpressure_unregister_event(struct mem_cgroup *memcg,
> void vmpressure_init(struct vmpressure *vmpr)
> {
> spin_lock_init(&vmpr->sr_lock);
> - mutex_init(&vmpr->events_lock);
> - INIT_LIST_HEAD(&vmpr->events);
> - INIT_WORK(&vmpr->work, vmpressure_work_fn);
> + vmpressure_v1_init(vmpr);
> }
>
> /**
> @@ -484,9 +216,5 @@ void vmpressure_init(struct vmpressure *vmpr)
> */
> void vmpressure_cleanup(struct vmpressure *vmpr)
> {
> - /*
> - * Make sure there is no pending work before eventfd infrastructure
> - * goes away.
> - */
> - flush_work(&vmpr->work);
> + vmpressure_v1_cleanup(vmpr);
> }
> --
> 2.53.0-Meta
>
>
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH v3 2/2] mm/vmpressure: move v1 userspace eventfd code into memcontrol-v1.c
2026-06-30 11:23 ` [PATCH v3 2/2] mm/vmpressure: move v1 userspace eventfd code into memcontrol-v1.c Usama Arif
2026-06-30 12:32 ` Usama Arif
@ 2026-06-30 14:21 ` Shakeel Butt
1 sibling, 0 replies; 7+ messages in thread
From: Shakeel Butt @ 2026-06-30 14:21 UTC (permalink / raw)
To: Usama Arif
Cc: Andrew Morton, david, linux-mm, hannes, tj, mkoutny,
roman.gushchin, liam, linux-kernel, ljs, mhocko, rppt, surenb,
vbabka, kernel-team
On Tue, Jun 30, 2026 at 04:23:33AM -0700, Usama Arif wrote:
> Clean up mm/vmpressure.c by separating the cgroup v1 userspace eventfd
> interface from the shared and v2 in-kernel code.
>
> Currently, almost half of mm/vmpressure.c exists to serve tree=true:
> struct vmpressure_event, the events list and its mutex, the work_struct
> and vmpressure_work_fn that drains tree_scanned/tree_reclaimed, the
> parent walk, vmpressure_event(), vmpressure_register_event(),
> vmpressure_unregister_event(), and vmpressure_prio() (which always
> calls vmpressure() with tree=true).
>
> Move it all into mm/memcontrol-v1.c (built only when CONFIG_MEMCG_V1=y)
> as a single contiguous block, following the per-component layout already
> used by that file. Keeping the v1 vmpressure code with the rest of the
> deprecated cgroup v1 memory controller makes the full footprint of the
> CONFIG_MEMCG_V1 option easy to see in one place, which matters more
> than component-level file separation for code that has no active
> development.
>
> vmpressure.c keeps the shared bits (constants, vmpressure_calc_level,
> the runtime hierarchy check, the tree=false body, init/cleanup
> plumbing) and calls into three small v1 hooks for the tree=true
> accumulator and the v1 portions of init/cleanup. The hooks have
> static-inline no-op stubs in include/linux/vmpressure.h for the
> !MEMCG_V1 case, so callers don't need ifdefs. vmpressure_prio() gets
> the same treatment, which means vmscan.c's call site disappears at
> compile time on v2-only kernels.
>
> The only #ifdef CONFIG_MEMCG_V1 in source remains around the v1-only
> fields inside struct vmpressure itself.
>
> Memory savings on CONFIG_MEMCG_V1=n (measured with pahole):
>
> struct vmpressure : 112B -> 24B
> struct mem_cgroup : 1664B -> 1536B
>
> This split is the first step toward eventually making vmpressure
> CONFIG_MEMCG_V1 only. The v2 in-kernel socket pressure path
> (tree=false) cannot be removed today immediately: PSI is not an
> exact replacement for vmpressure, and switching networking socket-buffer
> back-off to PSI may regress networking performance or increase memory
> pressure in workloads that today rely on vmpressure's hysteresis. The
> medium-term plan is to introduce a PSI-based socket-pressure path, keep
> vmpressure available for v2 behind a defconfig as an opt-out for several
> releases, and only then drop the tree=false path entirely, at which point
> everything that remains of the vmpressure block in mm/memcontrol-v1.c is
> the whole subsystem.
>
> Signed-off-by: Usama Arif <usama.arif@linux.dev>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH v3 1/2] mm/vmpressure: skip tree=true accounting on cgroup v2
2026-06-30 11:23 ` [PATCH v3 1/2] mm/vmpressure: skip tree=true accounting " Usama Arif
@ 2026-06-30 16:07 ` Johannes Weiner
2026-06-30 16:30 ` Usama Arif
0 siblings, 1 reply; 7+ messages in thread
From: Johannes Weiner @ 2026-06-30 16:07 UTC (permalink / raw)
To: Usama Arif
Cc: Andrew Morton, david, linux-mm, tj, mkoutny, shakeel.butt,
roman.gushchin, liam, linux-kernel, ljs, mhocko, rppt, surenb,
vbabka, kernel-team
On Tue, Jun 30, 2026 at 04:23:32AM -0700, Usama Arif wrote:
> vmpressure() has two outputs gated by the @tree argument:
>
> @tree=false drives in-kernel socket pressure (mem_cgroup_set_
> socket_pressure), consumed by TCP/SCTP. This only
> applies on cgroup v2; on v1 socket memory is charged
> separately via tcpmem and the consumer reads
> memcg->tcpmem_pressure instead.
>
> @tree=true drives userspace eventfd notifications via the v1
> memory.pressure_level / cgroup.event_control interface.
> v2 has no equivalent: userspace gets reclaim signals
> through memory.pressure (PSI), which does not touch
> vmpressure.
>
> The existing early return covered v1 + @tree=false. The symmetric
> v2 + @tree=true case was falling through and doing the full lock /
> accumulate / schedule_work / parent-walk dance for an events list
> that can never be populated. bpftrace on a 176-core production host
> (cgroup v2, CONFIG_MEMCG_V1=n, 285 memcgs, sustained reclaim) showed
> ~16,200 @tree=true vmpressure() calls per minute. Add an early return
> that skips cgroup v2 + tree = true which avoids us doing all this work.
> On a v2-only host this also eliminates a lock contention path that can
> serialise reclaimers on a single global sr_lock.
>
> Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
> Signed-off-by: Usama Arif <usama.arif@linux.dev>
> ---
> mm/vmpressure.c | 10 ++++++----
> 1 file changed, 6 insertions(+), 4 deletions(-)
>
> diff --git a/mm/vmpressure.c b/mm/vmpressure.c
> index f053554e5826..c82cee1ab43b 100644
> --- a/mm/vmpressure.c
> +++ b/mm/vmpressure.c
> @@ -246,11 +246,13 @@ void vmpressure(gfp_t gfp, int order, struct mem_cgroup *memcg, bool tree,
> return;
>
> /*
> - * The in-kernel users only care about the reclaim efficiency
> - * for this @memcg rather than the whole subtree, and there
> - * isn't and won't be any in-kernel user in a legacy cgroup.
> + * Only two combinations have a consumer:
> + * cgroup v2 + tree=false -> in-kernel socket pressure
> + * cgroup v1 + tree=true -> userspace eventfds (memory.pressure_level)
> + * Skip the other two: nothing consumes the result.
> */
> - if (!cgroup_subsys_on_dfl(memory_cgrp_subsys) && !tree)
> + if ((!cgroup_subsys_on_dfl(memory_cgrp_subsys) && !tree) ||
> + (cgroup_subsys_on_dfl(memory_cgrp_subsys) && tree))
> return;
I had already acked this one, with a half serious suggestion to make
this
if (cgroup_subsys_on_dfl(memory_cgrp_subsys) == tree)
return;
Anyway, no strong feelings. If nobody agrees,
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH v3 1/2] mm/vmpressure: skip tree=true accounting on cgroup v2
2026-06-30 16:07 ` Johannes Weiner
@ 2026-06-30 16:30 ` Usama Arif
0 siblings, 0 replies; 7+ messages in thread
From: Usama Arif @ 2026-06-30 16:30 UTC (permalink / raw)
To: Johannes Weiner
Cc: Andrew Morton, david, linux-mm, tj, mkoutny, shakeel.butt,
roman.gushchin, liam, linux-kernel, ljs, mhocko, rppt, surenb,
vbabka, kernel-team
On 30/06/2026 17:07, Johannes Weiner wrote:
> On Tue, Jun 30, 2026 at 04:23:32AM -0700, Usama Arif wrote:
>> vmpressure() has two outputs gated by the @tree argument:
>>
>> @tree=false drives in-kernel socket pressure (mem_cgroup_set_
>> socket_pressure), consumed by TCP/SCTP. This only
>> applies on cgroup v2; on v1 socket memory is charged
>> separately via tcpmem and the consumer reads
>> memcg->tcpmem_pressure instead.
>>
>> @tree=true drives userspace eventfd notifications via the v1
>> memory.pressure_level / cgroup.event_control interface.
>> v2 has no equivalent: userspace gets reclaim signals
>> through memory.pressure (PSI), which does not touch
>> vmpressure.
>>
>> The existing early return covered v1 + @tree=false. The symmetric
>> v2 + @tree=true case was falling through and doing the full lock /
>> accumulate / schedule_work / parent-walk dance for an events list
>> that can never be populated. bpftrace on a 176-core production host
>> (cgroup v2, CONFIG_MEMCG_V1=n, 285 memcgs, sustained reclaim) showed
>> ~16,200 @tree=true vmpressure() calls per minute. Add an early return
>> that skips cgroup v2 + tree = true which avoids us doing all this work.
>> On a v2-only host this also eliminates a lock contention path that can
>> serialise reclaimers on a single global sr_lock.
>>
>> Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
>> Signed-off-by: Usama Arif <usama.arif@linux.dev>
>> ---
>> mm/vmpressure.c | 10 ++++++----
>> 1 file changed, 6 insertions(+), 4 deletions(-)
>>
>> diff --git a/mm/vmpressure.c b/mm/vmpressure.c
>> index f053554e5826..c82cee1ab43b 100644
>> --- a/mm/vmpressure.c
>> +++ b/mm/vmpressure.c
>> @@ -246,11 +246,13 @@ void vmpressure(gfp_t gfp, int order, struct mem_cgroup *memcg, bool tree,
>> return;
>>
>> /*
>> - * The in-kernel users only care about the reclaim efficiency
>> - * for this @memcg rather than the whole subtree, and there
>> - * isn't and won't be any in-kernel user in a legacy cgroup.
>> + * Only two combinations have a consumer:
>> + * cgroup v2 + tree=false -> in-kernel socket pressure
>> + * cgroup v1 + tree=true -> userspace eventfds (memory.pressure_level)
>> + * Skip the other two: nothing consumes the result.
>> */
>> - if (!cgroup_subsys_on_dfl(memory_cgrp_subsys) && !tree)
>> + if ((!cgroup_subsys_on_dfl(memory_cgrp_subsys) && !tree) ||
>> + (cgroup_subsys_on_dfl(memory_cgrp_subsys) && tree))
>> return;
>
> I had already acked this one, with a half serious suggestion to make
> this
>
> if (cgroup_subsys_on_dfl(memory_cgrp_subsys) == tree)
> return;
>
> Anyway, no strong feelings. If nobody agrees,
>
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Yeah sorry about this! I just amended my last patch to move code
from vmpressure-v1.c to memcontrol-v1.c and just sent it, without
other changes. Forgot Shakeels ack on v2 as well :(
Andrew would you mind applying the below fixlet? I can also respin
if its easier. Thanks!!
From 969c19da782bbcd77ae4b9e94d3a9e1d78c198d7 Mon Sep 17 00:00:00 2001
From: Usama Arif <usama.arif@linux.dev>
Date: Tue, 30 Jun 2026 09:25:05 -0700
Subject: [fixlet] mm/vmpressure: skip tree=true accounting on cgroup v2
Simplify the guard. Both cgroup_subsys_on_dfl() and tree are bool, so
the two combinations that have no consumer (v1 + tree=false, v2 +
tree=true) are exactly the cases where dfl == tree.
Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Usama Arif <usama.arif@linux.dev>
---
mm/vmpressure.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/mm/vmpressure.c b/mm/vmpressure.c
index 14470141bbe6..9629240d77ad 100644
--- a/mm/vmpressure.c
+++ b/mm/vmpressure.c
@@ -120,8 +120,7 @@ void vmpressure(gfp_t gfp, int order, struct mem_cgroup *memcg, bool tree,
* cgroup v1 + tree=true -> userspace eventfds (memory.pressure_level)
* Skip the other two: nothing consumes the result.
*/
- if ((!cgroup_subsys_on_dfl(memory_cgrp_subsys) && !tree) ||
- (cgroup_subsys_on_dfl(memory_cgrp_subsys) && tree))
+ if (cgroup_subsys_on_dfl(memory_cgrp_subsys) == tree)
return;
vmpr = memcg_to_vmpressure(memcg);
--
2.53.0-Meta
^ permalink raw reply related [flat|nested] 7+ messages in thread
end of thread, other threads:[~2026-06-30 16:31 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-30 11:23 [PATCH v3 0/2] mm/vmpressure: reduce CPU, memory and code overhead on cgroup v2 Usama Arif
2026-06-30 11:23 ` [PATCH v3 1/2] mm/vmpressure: skip tree=true accounting " Usama Arif
2026-06-30 16:07 ` Johannes Weiner
2026-06-30 16:30 ` Usama Arif
2026-06-30 11:23 ` [PATCH v3 2/2] mm/vmpressure: move v1 userspace eventfd code into memcontrol-v1.c Usama Arif
2026-06-30 12:32 ` Usama Arif
2026-06-30 14:21 ` Shakeel Butt
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox