[PATCH v2 0/2] mm/vmpressure: reduce CPU, memory and code overhead on cgroup v2

The Linux Kernel Mailing List
 help / color / mirror / Atom feed

* [PATCH v2 0/2] mm/vmpressure: reduce CPU, memory and code overhead on cgroup v2
@ 2026-06-29 12:59 Usama Arif
  2026-06-29 12:59 ` [PATCH v2 1/2] mm/vmpressure: skip tree=true accounting " Usama Arif
  2026-06-29 12:59 ` [PATCH v2 2/2] mm/vmpressure: split v1 userspace eventfd code into vmpressure-v1.c Usama Arif
  0 siblings, 2 replies; 14+ messages in thread
From: Usama Arif @ 2026-06-29 12:59 UTC (permalink / raw)
  To: Andrew Morton, david, linux-mm
  Cc: hannes, tj, mkoutny, shakeel.butt, roman.gushchin, liam,
	linux-kernel, ljs, mhocko, rppt, surenb, vbabka, kernel-team,
	Usama Arif

The vmpressure subsystem has two distinct consumers, gated by the
@tree argument:

  tree=false : in-kernel socket pressure, consumed by TCP/SCTP. This
               is cgroup v2 only; v1 sockets read memcg->tcpmem_pressure
               instead.
  tree=true  : cgroup v1 userspace eventfd notifications via the
               memory.pressure_level / cgroup.event_control interface.
               v2 has no equivalent (userspace gets reclaim signals
               through memory.pressure / PSI, which doesn't touch
               vmpressure).

So of the four (hierarchy, tree) combinations, only two carry data
that anyone reads. The existing early return in vmpressure() covered
v1 + tree=false; the symmetric v2 + tree=true case was falling through
and doing the full lock / accumulate / schedule_work / parent-walk
dance, even though the events list it eventually iterates is empty
on cgroup v2 (vmpressure_register_event() is wired up only through the
v1 cftype "memory.pressure_level" and can't be reached from a v2
memcg).

Patch 1 extends the existing early return to also skip v2 + tree=true.
On a v2-only host this eliminates a contended path where reclaimers
can serialize on a single global sr_lock. bpftrace on a 176-core production
host (cgroup v2, 285 memcgs, sustained reclaim) showed ~16,200 such calls
per minute with tree = true.

Patch 2 follows up with a cleanup: it splits the v1 userspace eventfd
interface (struct vmpressure_event, the events list and its mutex, the
work_struct and its handler, the parent walk,
vmpressure_register_event / unregister_event, and vmpressure_prio)
into a new mm/vmpressure-v1.c built only when CONFIG_MEMCG_V1=y,
behind small no-op stubs in the header. mm/vmpressure.c keeps the
shared bits and the tree=false socket-pressure path. The size of
vmpressure.c goes down to half and the code is much more simpler.
The only #ifdef CONFIG_MEMCG_V1 remaining in source is around the
v1-only fields inside struct vmpressure itself. Memory savings on
CONFIG_MEMCG_V1=n:
  struct vmpressure :  112B  ->  24B
  struct mem_cgroup : 1664B  -> 1536B

This split is the first step toward eventually making vmpressure
CONFIG_MEMCG_V1 only. The v2 in-kernel socket pressure path
(tree=false) cannot be removed today immediately: PSI is not an
exact replacement for vmpressure, and switching networking socket-buffer
back-off to PSI may regress networking performance or increase memory
pressure in workloads that today rely on vmpressure's hysteresis. The
medium-term plan is to introduce a PSI-based socket-pressure path, keep
vmpressure available for v2 behind a defconfig as an opt-out for several
releases, and only then drop the tree=false path entirely, at which point
everything that remains in mm/vmpressure-v1.c is the whole subsystem.
---
v1 -> v2: https://lore.kernel.org/all/20260606114158.3126210-1-usama.arif@linux.dev/
- Add more in commit message about future plans of vmpressure for cgroup v2
  (Shakeel)
- Remove unnecessary return statement in vmpressure for v1 only tree path
  (Michal)
- Rebased onto latest mm-new

Usama Arif (2):
  mm/vmpressure: skip tree=true accounting on cgroup v2
  mm/vmpressure: split v1 userspace eventfd code into vmpressure-v1.c

 include/linux/vmpressure.h |  46 +++++-
 mm/Makefile                |   2 +-
 mm/vmpressure-v1.c         | 305 +++++++++++++++++++++++++++++++++++++
 mm/vmpressure.c            | 302 ++----------------------------------
 4 files changed, 363 insertions(+), 292 deletions(-)
 create mode 100644 mm/vmpressure-v1.c

-- 
2.53.0-Meta

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v2 1/2] mm/vmpressure: skip tree=true accounting on cgroup v2
  2026-06-29 12:59 [PATCH v2 0/2] mm/vmpressure: reduce CPU, memory and code overhead on cgroup v2 Usama Arif
@ 2026-06-29 12:59 ` Usama Arif
  2026-06-29 16:46   ` Johannes Weiner
  2026-06-29 12:59 ` [PATCH v2 2/2] mm/vmpressure: split v1 userspace eventfd code into vmpressure-v1.c Usama Arif
  1 sibling, 1 reply; 14+ messages in thread
From: Usama Arif @ 2026-06-29 12:59 UTC (permalink / raw)
  To: Andrew Morton, david, linux-mm
  Cc: hannes, tj, mkoutny, shakeel.butt, roman.gushchin, liam,
	linux-kernel, ljs, mhocko, rppt, surenb, vbabka, kernel-team,
	Usama Arif

vmpressure() has two outputs gated by the @tree argument:

  @tree=false drives in-kernel socket pressure (mem_cgroup_set_
              socket_pressure), consumed by TCP/SCTP. This only
              applies on cgroup v2; on v1 socket memory is charged
              separately via tcpmem and the consumer reads
              memcg->tcpmem_pressure instead.

  @tree=true  drives userspace eventfd notifications via the v1
              memory.pressure_level / cgroup.event_control interface.
              v2 has no equivalent: userspace gets reclaim signals
              through memory.pressure (PSI), which does not touch
              vmpressure.

The existing early return covered v1 + @tree=false. The symmetric
v2 + @tree=true case was falling through and doing the full lock /
accumulate / schedule_work / parent-walk dance for an events list
that can never be populated. bpftrace on a 176-core production host
(cgroup v2, CONFIG_MEMCG_V1=n, 285 memcgs, sustained reclaim) showed
~16,200 @tree=true vmpressure() calls per minute. Add an early return
that skips cgroup v2 + tree = true which avoids us doing all this work.
On a v2-only host this also eliminates a lock contention path that can
serialise reclaimers on a single global sr_lock.

Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Signed-off-by: Usama Arif <usama.arif@linux.dev>
---
 mm/vmpressure.c | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/mm/vmpressure.c b/mm/vmpressure.c
index f053554e5826..c82cee1ab43b 100644
--- a/mm/vmpressure.c
+++ b/mm/vmpressure.c
@@ -246,11 +246,13 @@ void vmpressure(gfp_t gfp, int order, struct mem_cgroup *memcg, bool tree,
 		return;
 
 	/*
-	 * The in-kernel users only care about the reclaim efficiency
-	 * for this @memcg rather than the whole subtree, and there
-	 * isn't and won't be any in-kernel user in a legacy cgroup.
+	 * Only two combinations have a consumer:
+	 *   cgroup v2 + tree=false -> in-kernel socket pressure
+	 *   cgroup v1 + tree=true  -> userspace eventfds (memory.pressure_level)
+	 * Skip the other two: nothing consumes the result.
 	 */
-	if (!cgroup_subsys_on_dfl(memory_cgrp_subsys) && !tree)
+	if ((!cgroup_subsys_on_dfl(memory_cgrp_subsys) && !tree) ||
+	    (cgroup_subsys_on_dfl(memory_cgrp_subsys) && tree))
 		return;
 
 	vmpr = memcg_to_vmpressure(memcg);
-- 
2.53.0-Meta


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v2 2/2] mm/vmpressure: split v1 userspace eventfd code into vmpressure-v1.c
  2026-06-29 12:59 [PATCH v2 0/2] mm/vmpressure: reduce CPU, memory and code overhead on cgroup v2 Usama Arif
  2026-06-29 12:59 ` [PATCH v2 1/2] mm/vmpressure: skip tree=true accounting " Usama Arif
@ 2026-06-29 12:59 ` Usama Arif
  2026-06-29 13:34   ` Michal Koutný
                     ` (2 more replies)
  1 sibling, 3 replies; 14+ messages in thread
From: Usama Arif @ 2026-06-29 12:59 UTC (permalink / raw)
  To: Andrew Morton, david, linux-mm
  Cc: hannes, tj, mkoutny, shakeel.butt, roman.gushchin, liam,
	linux-kernel, ljs, mhocko, rppt, surenb, vbabka, kernel-team,
	Usama Arif

Clean up mm/vmpressure.c by separating the cgroup v1 userspace eventfd
interface from the shared and v2 in-kernel code.

Currently, almost half of mm/vmpressure.c exists to serve tree=true:
struct vmpressure_event, the events list and its mutex, the work_struct
and vmpressure_work_fn that drains tree_scanned/tree_reclaimed, the
parent walk, vmpressure_event(), vmpressure_register_event(),
vmpressure_unregister_event(), and vmpressure_prio() (which always
calls vmpressure() with tree=true).

Move it all into a new mm/vmpressure-v1.c built only when
CONFIG_MEMCG_V1=y (following the existing memcontrol-v1.o pattern).

vmpressure.c keeps the shared bits (constants, vmpressure_calc_level,
the runtime hierarchy check, the tree=false body, init/cleanup
plumbing) and calls into three small v1 hooks for the tree=true
accumulator and the v1 portions of init/cleanup. The hooks have
static-inline no-op stubs in include/linux/vmpressure.h for the
!MEMCG_V1 case, so callers don't need ifdefs. vmpressure_prio() gets
the same treatment, which means vmscan.c's call site disappears at
compile time on v2-only kernels.

The only #ifdef CONFIG_MEMCG_V1 in source remains around the v1-only
fields inside struct vmpressure itself.

Memory savings on CONFIG_MEMCG_V1=n (measured with pahole):

  struct vmpressure :  112B ->   24B
  struct mem_cgroup : 1664B -> 1536B

This split is the first step toward eventually making vmpressure
CONFIG_MEMCG_V1 only. The v2 in-kernel socket pressure path
(tree=false) cannot be removed today immediately: PSI is not an
exact replacement for vmpressure, and switching networking socket-buffer
back-off to PSI may regress networking performance or increase memory
pressure in workloads that today rely on vmpressure's hysteresis. The
medium-term plan is to introduce a PSI-based socket-pressure path, keep
vmpressure available for v2 behind a defconfig as an opt-out for several
releases, and only then drop the tree=false path entirely, at which point
everything that remains in mm/vmpressure-v1.c is the whole subsystem.

Signed-off-by: Usama Arif <usama.arif@linux.dev>
---
 include/linux/vmpressure.h |  46 +++++-
 mm/Makefile                |   2 +-
 mm/vmpressure-v1.c         | 305 +++++++++++++++++++++++++++++++++++++
 mm/vmpressure.c            | 292 ++---------------------------------
 4 files changed, 357 insertions(+), 288 deletions(-)
 create mode 100644 mm/vmpressure-v1.c

diff --git a/include/linux/vmpressure.h b/include/linux/vmpressure.h
index faecd5522401..e5e6b68d0dc4 100644
--- a/include/linux/vmpressure.h
+++ b/include/linux/vmpressure.h
@@ -13,18 +13,31 @@
 struct vmpressure {
 	unsigned long scanned;
 	unsigned long reclaimed;
+	/* The lock is used to keep the scanned/reclaimed in sync. */
+	spinlock_t sr_lock;
 
+#ifdef CONFIG_MEMCG_V1
+	/*
+	 * tree=true accumulators feed the v1 userspace eventfd interface
+	 * (memory.pressure_level). Drained by @work. v2 has no equivalent
+	 * interface, so this state is omitted on CONFIG_MEMCG_V1=n builds.
+	 */
 	unsigned long tree_scanned;
 	unsigned long tree_reclaimed;
-	/* The lock is used to keep the scanned/reclaimed above in sync. */
-	spinlock_t sr_lock;
-
 	/* The list of vmpressure_event structs. */
 	struct list_head events;
 	/* Have to grab the lock on events traversal or modifications. */
 	struct mutex events_lock;
 
 	struct work_struct work;
+#endif
+};
+
+enum vmpressure_levels {
+	VMPRESSURE_LOW = 0,
+	VMPRESSURE_MEDIUM,
+	VMPRESSURE_CRITICAL,
+	VMPRESSURE_NUM_LEVELS,
 };
 
 struct mem_cgroup;
@@ -32,18 +45,41 @@ struct mem_cgroup;
 #ifdef CONFIG_MEMCG
 void vmpressure(gfp_t gfp, int order, struct mem_cgroup *memcg, bool tree,
 		unsigned long scanned, unsigned long reclaimed);
-extern void vmpressure_prio(gfp_t gfp, struct mem_cgroup *memcg, int prio);
-
 extern void vmpressure_init(struct vmpressure *vmpr);
 extern void vmpressure_cleanup(struct vmpressure *vmpr);
 extern struct vmpressure *memcg_to_vmpressure(struct mem_cgroup *memcg);
 extern struct mem_cgroup *vmpressure_to_memcg(struct vmpressure *vmpr);
+
+/* Shared with mm/vmpressure-v1.c. */
+extern const unsigned long vmpressure_win;
+extern enum vmpressure_levels vmpressure_calc_level(unsigned long scanned,
+						    unsigned long reclaimed);
+
+#ifdef CONFIG_MEMCG_V1
+extern void vmpressure_prio(gfp_t gfp, struct mem_cgroup *memcg, int prio);
 extern int vmpressure_register_event(struct mem_cgroup *memcg,
 				     struct eventfd_ctx *eventfd,
 				     const char *args);
 extern void vmpressure_unregister_event(struct mem_cgroup *memcg,
 					struct eventfd_ctx *eventfd);
+
+/* v1 hooks called from mm/vmpressure.c; no-ops below when !MEMCG_V1. */
+extern void vmpressure_v1_init(struct vmpressure *vmpr);
+extern void vmpressure_v1_cleanup(struct vmpressure *vmpr);
+extern void vmpressure_v1_account_tree(struct vmpressure *vmpr,
+				       unsigned long scanned,
+				       unsigned long reclaimed);
 #else
+static inline void vmpressure_prio(gfp_t gfp, struct mem_cgroup *memcg,
+				   int prio) {}
+static inline void vmpressure_v1_init(struct vmpressure *vmpr) {}
+static inline void vmpressure_v1_cleanup(struct vmpressure *vmpr) {}
+static inline void vmpressure_v1_account_tree(struct vmpressure *vmpr,
+					      unsigned long scanned,
+					      unsigned long reclaimed) {}
+#endif /* CONFIG_MEMCG_V1 */
+
+#else /* !CONFIG_MEMCG */
 static inline void vmpressure(gfp_t gfp, int order, struct mem_cgroup *memcg,
 			      bool tree, unsigned long scanned,
 			      unsigned long reclaimed) {}
diff --git a/mm/Makefile b/mm/Makefile
index 4fc713867b9b..de991630c96a 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -101,7 +101,7 @@ obj-$(CONFIG_DEVICE_MIGRATION) += migrate_device.o
 obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o
 obj-$(CONFIG_PAGE_COUNTER) += page_counter.o
 obj-$(CONFIG_LIVEUPDATE_MEMFD) += memfd_luo.o
-obj-$(CONFIG_MEMCG_V1) += memcontrol-v1.o
+obj-$(CONFIG_MEMCG_V1) += memcontrol-v1.o vmpressure-v1.o
 obj-$(CONFIG_MEMCG) += memcontrol.o vmpressure.o
 ifdef CONFIG_BPF_SYSCALL
 obj-$(CONFIG_MEMCG) += bpf_memcontrol.o
diff --git a/mm/vmpressure-v1.c b/mm/vmpressure-v1.c
new file mode 100644
index 000000000000..fd813cba0544
--- /dev/null
+++ b/mm/vmpressure-v1.c
@@ -0,0 +1,305 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * cgroup v1 userspace vmpressure interface (memory.pressure_level /
+ * cgroup.event_control). Split out of mm/vmpressure.c so that v2-only
+ * kernels (CONFIG_MEMCG_V1=n) drop the whole eventfd accumulator,
+ * its work item, and the per-memcg state it requires.
+ */
+
+#include <linux/cgroup.h>
+#include <linux/eventfd.h>
+#include <linux/list.h>
+#include <linux/log2.h>
+#include <linux/memcontrol.h>
+#include <linux/mutex.h>
+#include <linux/slab.h>
+#include <linux/string.h>
+#include <linux/swap.h>
+#include <linux/vmpressure.h>
+#include <linux/workqueue.h>
+
+/*
+ * When there are too little pages left to scan, vmpressure() may miss the
+ * critical pressure as number of pages will be less than "window size".
+ * However, in that case the vmscan priority will raise fast as the
+ * reclaimer will try to scan LRUs more deeply.
+ *
+ * The vmscan logic considers these special priorities:
+ *
+ * prio == DEF_PRIORITY (12): reclaimer starts with that value
+ * prio <= DEF_PRIORITY - 2 : kswapd becomes somewhat overwhelmed
+ * prio == 0                : close to OOM, kernel scans every page in an lru
+ *
+ * Any value in this range is acceptable for this tunable (i.e. from 12 to
+ * 0). Current value for the vmpressure_level_critical_prio is chosen
+ * empirically, but the number, in essence, means that we consider
+ * critical level when scanning depth is ~10% of the lru size (vmscan
+ * scans 'lru_size >> prio' pages, so it is actually 12.5%, or one
+ * eights).
+ */
+static const unsigned int vmpressure_level_critical_prio = ilog2(100 / 10);
+
+enum vmpressure_modes {
+	VMPRESSURE_NO_PASSTHROUGH = 0,
+	VMPRESSURE_HIERARCHY,
+	VMPRESSURE_LOCAL,
+	VMPRESSURE_NUM_MODES,
+};
+
+static const char * const vmpressure_str_levels[] = {
+	[VMPRESSURE_LOW] = "low",
+	[VMPRESSURE_MEDIUM] = "medium",
+	[VMPRESSURE_CRITICAL] = "critical",
+};
+
+static const char * const vmpressure_str_modes[] = {
+	[VMPRESSURE_NO_PASSTHROUGH] = "default",
+	[VMPRESSURE_HIERARCHY] = "hierarchy",
+	[VMPRESSURE_LOCAL] = "local",
+};
+
+struct vmpressure_event {
+	struct eventfd_ctx *efd;
+	enum vmpressure_levels level;
+	enum vmpressure_modes mode;
+	struct list_head node;
+};
+
+static struct vmpressure *work_to_vmpressure(struct work_struct *work)
+{
+	return container_of(work, struct vmpressure, work);
+}
+
+static struct vmpressure *vmpressure_parent(struct vmpressure *vmpr)
+{
+	struct mem_cgroup *memcg = vmpressure_to_memcg(vmpr);
+
+	memcg = parent_mem_cgroup(memcg);
+	if (!memcg)
+		return NULL;
+	return memcg_to_vmpressure(memcg);
+}
+
+static bool vmpressure_event(struct vmpressure *vmpr,
+			     const enum vmpressure_levels level,
+			     bool ancestor, bool signalled)
+{
+	struct vmpressure_event *ev;
+	bool ret = false;
+
+	mutex_lock(&vmpr->events_lock);
+	list_for_each_entry(ev, &vmpr->events, node) {
+		if (ancestor && ev->mode == VMPRESSURE_LOCAL)
+			continue;
+		if (signalled && ev->mode == VMPRESSURE_NO_PASSTHROUGH)
+			continue;
+		if (level < ev->level)
+			continue;
+		eventfd_signal(ev->efd);
+		ret = true;
+	}
+	mutex_unlock(&vmpr->events_lock);
+
+	return ret;
+}
+
+static void vmpressure_work_fn(struct work_struct *work)
+{
+	struct vmpressure *vmpr = work_to_vmpressure(work);
+	unsigned long scanned;
+	unsigned long reclaimed;
+	enum vmpressure_levels level;
+	bool ancestor = false;
+	bool signalled = false;
+
+	spin_lock(&vmpr->sr_lock);
+	/*
+	 * Several contexts might be calling vmpressure(), so it is
+	 * possible that the work was rescheduled again before the old
+	 * work context cleared the counters. In that case we will run
+	 * just after the old work returns, but then scanned might be zero
+	 * here. No need for any locks here since we don't care if
+	 * vmpr->reclaimed is in sync.
+	 */
+	scanned = vmpr->tree_scanned;
+	if (!scanned) {
+		spin_unlock(&vmpr->sr_lock);
+		return;
+	}
+
+	reclaimed = vmpr->tree_reclaimed;
+	vmpr->tree_scanned = 0;
+	vmpr->tree_reclaimed = 0;
+	spin_unlock(&vmpr->sr_lock);
+
+	level = vmpressure_calc_level(scanned, reclaimed);
+
+	do {
+		if (vmpressure_event(vmpr, level, ancestor, signalled))
+			signalled = true;
+		ancestor = true;
+	} while ((vmpr = vmpressure_parent(vmpr)));
+}
+
+/*
+ * Tree-mode accumulator: accumulate per-memcg scanned/reclaimed and
+ * schedule the work that walks the parent chain and signals registered
+ * eventfd listeners once we cross the window threshold.
+ */
+void vmpressure_v1_account_tree(struct vmpressure *vmpr,
+				unsigned long scanned,
+				unsigned long reclaimed)
+{
+	spin_lock(&vmpr->sr_lock);
+	scanned = vmpr->tree_scanned += scanned;
+	vmpr->tree_reclaimed += reclaimed;
+	spin_unlock(&vmpr->sr_lock);
+
+	if (scanned < vmpressure_win)
+		return;
+	schedule_work(&vmpr->work);
+}
+
+void vmpressure_v1_init(struct vmpressure *vmpr)
+{
+	mutex_init(&vmpr->events_lock);
+	INIT_LIST_HEAD(&vmpr->events);
+	INIT_WORK(&vmpr->work, vmpressure_work_fn);
+}
+
+void vmpressure_v1_cleanup(struct vmpressure *vmpr)
+{
+	/*
+	 * Make sure there is no pending work before eventfd infrastructure
+	 * goes away.
+	 */
+	flush_work(&vmpr->work);
+}
+
+/**
+ * vmpressure_prio() - Account memory pressure through reclaimer priority level
+ * @gfp:	reclaimer's gfp mask
+ * @memcg:	cgroup memory controller handle
+ * @prio:	reclaimer's priority
+ *
+ * This function should be called from the reclaim path every time when
+ * the vmscan's reclaiming priority (scanning depth) changes.
+ *
+ * This function does not return any value.
+ */
+void vmpressure_prio(gfp_t gfp, struct mem_cgroup *memcg, int prio)
+{
+	/*
+	 * We only use prio for accounting critical level. For more info
+	 * see comment for vmpressure_level_critical_prio variable above.
+	 */
+	if (prio > vmpressure_level_critical_prio)
+		return;
+
+	/*
+	 * OK, the prio is below the threshold, updating vmpressure
+	 * information before shrinker dives into long shrinking of long
+	 * range vmscan. Passing scanned = vmpressure_win, reclaimed = 0
+	 * to the vmpressure() basically means that we signal 'critical'
+	 * level.
+	 */
+	vmpressure(gfp, 0, memcg, true, vmpressure_win, 0);
+}
+
+#define MAX_VMPRESSURE_ARGS_LEN	(strlen("critical") + strlen("hierarchy") + 2)
+
+/**
+ * vmpressure_register_event() - Bind vmpressure notifications to an eventfd
+ * @memcg:	memcg that is interested in vmpressure notifications
+ * @eventfd:	eventfd context to link notifications with
+ * @args:	event arguments (pressure level threshold, optional mode)
+ *
+ * This function associates eventfd context with the vmpressure
+ * infrastructure, so that the notifications will be delivered to the
+ * @eventfd. The @args parameter is a comma-delimited string that denotes a
+ * pressure level threshold (one of vmpressure_str_levels, i.e. "low", "medium",
+ * or "critical") and an optional mode (one of vmpressure_str_modes, i.e.
+ * "hierarchy" or "local").
+ *
+ * To be used as memcg event method.
+ *
+ * Return: 0 on success, -ENOMEM on memory failure or -EINVAL if @args could
+ * not be parsed.
+ */
+int vmpressure_register_event(struct mem_cgroup *memcg,
+			      struct eventfd_ctx *eventfd, const char *args)
+{
+	struct vmpressure *vmpr = memcg_to_vmpressure(memcg);
+	struct vmpressure_event *ev;
+	enum vmpressure_modes mode = VMPRESSURE_NO_PASSTHROUGH;
+	enum vmpressure_levels level;
+	char *spec, *spec_orig;
+	char *token;
+	int ret = 0;
+
+	spec_orig = spec = kstrndup(args, MAX_VMPRESSURE_ARGS_LEN, GFP_KERNEL);
+	if (!spec)
+		return -ENOMEM;
+
+	/* Find required level */
+	token = strsep(&spec, ",");
+	ret = match_string(vmpressure_str_levels, VMPRESSURE_NUM_LEVELS, token);
+	if (ret < 0)
+		goto out;
+	level = ret;
+
+	/* Find optional mode */
+	token = strsep(&spec, ",");
+	if (token) {
+		ret = match_string(vmpressure_str_modes, VMPRESSURE_NUM_MODES, token);
+		if (ret < 0)
+			goto out;
+		mode = ret;
+	}
+
+	ev = kzalloc_obj(*ev);
+	if (!ev) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	ev->efd = eventfd;
+	ev->level = level;
+	ev->mode = mode;
+
+	mutex_lock(&vmpr->events_lock);
+	list_add(&ev->node, &vmpr->events);
+	mutex_unlock(&vmpr->events_lock);
+	ret = 0;
+out:
+	kfree(spec_orig);
+	return ret;
+}
+
+/**
+ * vmpressure_unregister_event() - Unbind eventfd from vmpressure
+ * @memcg:	memcg handle
+ * @eventfd:	eventfd context that was used to link vmpressure with the @cg
+ *
+ * This function does internal manipulations to detach the @eventfd from
+ * the vmpressure notifications, and then frees internal resources
+ * associated with the @eventfd (but the @eventfd itself is not freed).
+ *
+ * To be used as memcg event method.
+ */
+void vmpressure_unregister_event(struct mem_cgroup *memcg,
+				 struct eventfd_ctx *eventfd)
+{
+	struct vmpressure *vmpr = memcg_to_vmpressure(memcg);
+	struct vmpressure_event *ev;
+
+	mutex_lock(&vmpr->events_lock);
+	list_for_each_entry(ev, &vmpr->events, node) {
+		if (ev->efd != eventfd)
+			continue;
+		list_del(&ev->node);
+		kfree(ev);
+		break;
+	}
+	mutex_unlock(&vmpr->events_lock);
+}
diff --git a/mm/vmpressure.c b/mm/vmpressure.c
index c82cee1ab43b..bcfa4bd8ffc5 100644
--- a/mm/vmpressure.c
+++ b/mm/vmpressure.c
@@ -7,16 +7,15 @@
  *
  * Based on ideas from Andrew Morton, David Rientjes, KOSAKI Motohiro,
  * Leonid Moiseichuk, Mel Gorman, Minchan Kim and Pekka Enberg.
+ *
+ * Tree-mode (cgroup v1 userspace eventfd) bookkeeping lives in
+ * mm/vmpressure-v1.c; this file holds the shared code and the in-kernel
+ * (tree=false) socket-pressure path that runs on cgroup v2.
  */
 
 #include <linux/cgroup.h>
-#include <linux/fs.h>
 #include <linux/log2.h>
-#include <linux/sched.h>
 #include <linux/mm.h>
-#include <linux/vmstat.h>
-#include <linux/eventfd.h>
-#include <linux/slab.h>
 #include <linux/swap.h>
 #include <linux/printk.h>
 #include <linux/vmpressure.h>
@@ -35,7 +34,7 @@
  * TODO: Make the window size depend on machine size, as we do for vmstat
  * thresholds. Currently we set it to 512 pages (2MB for 4KB pages).
  */
-static const unsigned long vmpressure_win = SWAP_CLUSTER_MAX * 16;
+const unsigned long vmpressure_win = SWAP_CLUSTER_MAX * 16;
 
 /*
  * These thresholds are used when we account memory pressure through
@@ -46,68 +45,6 @@ static const unsigned long vmpressure_win = SWAP_CLUSTER_MAX * 16;
 static const unsigned int vmpressure_level_med = 60;
 static const unsigned int vmpressure_level_critical = 95;
 
-/*
- * When there are too little pages left to scan, vmpressure() may miss the
- * critical pressure as number of pages will be less than "window size".
- * However, in that case the vmscan priority will raise fast as the
- * reclaimer will try to scan LRUs more deeply.
- *
- * The vmscan logic considers these special priorities:
- *
- * prio == DEF_PRIORITY (12): reclaimer starts with that value
- * prio <= DEF_PRIORITY - 2 : kswapd becomes somewhat overwhelmed
- * prio == 0                : close to OOM, kernel scans every page in an lru
- *
- * Any value in this range is acceptable for this tunable (i.e. from 12 to
- * 0). Current value for the vmpressure_level_critical_prio is chosen
- * empirically, but the number, in essence, means that we consider
- * critical level when scanning depth is ~10% of the lru size (vmscan
- * scans 'lru_size >> prio' pages, so it is actually 12.5%, or one
- * eights).
- */
-static const unsigned int vmpressure_level_critical_prio = ilog2(100 / 10);
-
-static struct vmpressure *work_to_vmpressure(struct work_struct *work)
-{
-	return container_of(work, struct vmpressure, work);
-}
-
-static struct vmpressure *vmpressure_parent(struct vmpressure *vmpr)
-{
-	struct mem_cgroup *memcg = vmpressure_to_memcg(vmpr);
-
-	memcg = parent_mem_cgroup(memcg);
-	if (!memcg)
-		return NULL;
-	return memcg_to_vmpressure(memcg);
-}
-
-enum vmpressure_levels {
-	VMPRESSURE_LOW = 0,
-	VMPRESSURE_MEDIUM,
-	VMPRESSURE_CRITICAL,
-	VMPRESSURE_NUM_LEVELS,
-};
-
-enum vmpressure_modes {
-	VMPRESSURE_NO_PASSTHROUGH = 0,
-	VMPRESSURE_HIERARCHY,
-	VMPRESSURE_LOCAL,
-	VMPRESSURE_NUM_MODES,
-};
-
-static const char * const vmpressure_str_levels[] = {
-	[VMPRESSURE_LOW] = "low",
-	[VMPRESSURE_MEDIUM] = "medium",
-	[VMPRESSURE_CRITICAL] = "critical",
-};
-
-static const char * const vmpressure_str_modes[] = {
-	[VMPRESSURE_NO_PASSTHROUGH] = "default",
-	[VMPRESSURE_HIERARCHY] = "hierarchy",
-	[VMPRESSURE_LOCAL] = "local",
-};
-
 static enum vmpressure_levels vmpressure_level(unsigned long pressure)
 {
 	if (pressure >= vmpressure_level_critical)
@@ -117,8 +54,8 @@ static enum vmpressure_levels vmpressure_level(unsigned long pressure)
 	return VMPRESSURE_LOW;
 }
 
-static enum vmpressure_levels vmpressure_calc_level(unsigned long scanned,
-						    unsigned long reclaimed)
+enum vmpressure_levels vmpressure_calc_level(unsigned long scanned,
+					     unsigned long reclaimed)
 {
 	unsigned long scale = scanned + reclaimed;
 	unsigned long pressure = 0;
@@ -147,74 +84,6 @@ static enum vmpressure_levels vmpressure_calc_level(unsigned long scanned,
 	return vmpressure_level(pressure);
 }
 
-struct vmpressure_event {
-	struct eventfd_ctx *efd;
-	enum vmpressure_levels level;
-	enum vmpressure_modes mode;
-	struct list_head node;
-};
-
-static bool vmpressure_event(struct vmpressure *vmpr,
-			     const enum vmpressure_levels level,
-			     bool ancestor, bool signalled)
-{
-	struct vmpressure_event *ev;
-	bool ret = false;
-
-	mutex_lock(&vmpr->events_lock);
-	list_for_each_entry(ev, &vmpr->events, node) {
-		if (ancestor && ev->mode == VMPRESSURE_LOCAL)
-			continue;
-		if (signalled && ev->mode == VMPRESSURE_NO_PASSTHROUGH)
-			continue;
-		if (level < ev->level)
-			continue;
-		eventfd_signal(ev->efd);
-		ret = true;
-	}
-	mutex_unlock(&vmpr->events_lock);
-
-	return ret;
-}
-
-static void vmpressure_work_fn(struct work_struct *work)
-{
-	struct vmpressure *vmpr = work_to_vmpressure(work);
-	unsigned long scanned;
-	unsigned long reclaimed;
-	enum vmpressure_levels level;
-	bool ancestor = false;
-	bool signalled = false;
-
-	spin_lock(&vmpr->sr_lock);
-	/*
-	 * Several contexts might be calling vmpressure(), so it is
-	 * possible that the work was rescheduled again before the old
-	 * work context cleared the counters. In that case we will run
-	 * just after the old work returns, but then scanned might be zero
-	 * here. No need for any locks here since we don't care if
-	 * vmpr->reclaimed is in sync.
-	 */
-	scanned = vmpr->tree_scanned;
-	if (!scanned) {
-		spin_unlock(&vmpr->sr_lock);
-		return;
-	}
-
-	reclaimed = vmpr->tree_reclaimed;
-	vmpr->tree_scanned = 0;
-	vmpr->tree_reclaimed = 0;
-	spin_unlock(&vmpr->sr_lock);
-
-	level = vmpressure_calc_level(scanned, reclaimed);
-
-	do {
-		if (vmpressure_event(vmpr, level, ancestor, signalled))
-			signalled = true;
-		ancestor = true;
-	} while ((vmpr = vmpressure_parent(vmpr)));
-}
-
 /**
  * vmpressure() - Account memory pressure through scanned/reclaimed ratio
  * @gfp:	reclaimer's gfp mask
@@ -283,14 +152,7 @@ void vmpressure(gfp_t gfp, int order, struct mem_cgroup *memcg, bool tree,
 		return;
 
 	if (tree) {
-		spin_lock(&vmpr->sr_lock);
-		scanned = vmpr->tree_scanned += scanned;
-		vmpr->tree_reclaimed += reclaimed;
-		spin_unlock(&vmpr->sr_lock);
-
-		if (scanned < vmpressure_win)
-			return;
-		schedule_work(&vmpr->work);
+		vmpressure_v1_account_tree(vmpr, scanned, reclaimed);
 	} else {
 		enum vmpressure_levels level;
 
@@ -332,134 +194,6 @@ void vmpressure(gfp_t gfp, int order, struct mem_cgroup *memcg, bool tree,
 	}
 }
 
-/**
- * vmpressure_prio() - Account memory pressure through reclaimer priority level
- * @gfp:	reclaimer's gfp mask
- * @memcg:	cgroup memory controller handle
- * @prio:	reclaimer's priority
- *
- * This function should be called from the reclaim path every time when
- * the vmscan's reclaiming priority (scanning depth) changes.
- *
- * This function does not return any value.
- */
-void vmpressure_prio(gfp_t gfp, struct mem_cgroup *memcg, int prio)
-{
-	/*
-	 * We only use prio for accounting critical level. For more info
-	 * see comment for vmpressure_level_critical_prio variable above.
-	 */
-	if (prio > vmpressure_level_critical_prio)
-		return;
-
-	/*
-	 * OK, the prio is below the threshold, updating vmpressure
-	 * information before shrinker dives into long shrinking of long
-	 * range vmscan. Passing scanned = vmpressure_win, reclaimed = 0
-	 * to the vmpressure() basically means that we signal 'critical'
-	 * level.
-	 */
-	vmpressure(gfp, 0, memcg, true, vmpressure_win, 0);
-}
-
-#define MAX_VMPRESSURE_ARGS_LEN	(strlen("critical") + strlen("hierarchy") + 2)
-
-/**
- * vmpressure_register_event() - Bind vmpressure notifications to an eventfd
- * @memcg:	memcg that is interested in vmpressure notifications
- * @eventfd:	eventfd context to link notifications with
- * @args:	event arguments (pressure level threshold, optional mode)
- *
- * This function associates eventfd context with the vmpressure
- * infrastructure, so that the notifications will be delivered to the
- * @eventfd. The @args parameter is a comma-delimited string that denotes a
- * pressure level threshold (one of vmpressure_str_levels, i.e. "low", "medium",
- * or "critical") and an optional mode (one of vmpressure_str_modes, i.e.
- * "hierarchy" or "local").
- *
- * To be used as memcg event method.
- *
- * Return: 0 on success, -ENOMEM on memory failure or -EINVAL if @args could
- * not be parsed.
- */
-int vmpressure_register_event(struct mem_cgroup *memcg,
-			      struct eventfd_ctx *eventfd, const char *args)
-{
-	struct vmpressure *vmpr = memcg_to_vmpressure(memcg);
-	struct vmpressure_event *ev;
-	enum vmpressure_modes mode = VMPRESSURE_NO_PASSTHROUGH;
-	enum vmpressure_levels level;
-	char *spec, *spec_orig;
-	char *token;
-	int ret = 0;
-
-	spec_orig = spec = kstrndup(args, MAX_VMPRESSURE_ARGS_LEN, GFP_KERNEL);
-	if (!spec)
-		return -ENOMEM;
-
-	/* Find required level */
-	token = strsep(&spec, ",");
-	ret = match_string(vmpressure_str_levels, VMPRESSURE_NUM_LEVELS, token);
-	if (ret < 0)
-		goto out;
-	level = ret;
-
-	/* Find optional mode */
-	token = strsep(&spec, ",");
-	if (token) {
-		ret = match_string(vmpressure_str_modes, VMPRESSURE_NUM_MODES, token);
-		if (ret < 0)
-			goto out;
-		mode = ret;
-	}
-
-	ev = kzalloc_obj(*ev);
-	if (!ev) {
-		ret = -ENOMEM;
-		goto out;
-	}
-
-	ev->efd = eventfd;
-	ev->level = level;
-	ev->mode = mode;
-
-	mutex_lock(&vmpr->events_lock);
-	list_add(&ev->node, &vmpr->events);
-	mutex_unlock(&vmpr->events_lock);
-	ret = 0;
-out:
-	kfree(spec_orig);
-	return ret;
-}
-
-/**
- * vmpressure_unregister_event() - Unbind eventfd from vmpressure
- * @memcg:	memcg handle
- * @eventfd:	eventfd context that was used to link vmpressure with the @cg
- *
- * This function does internal manipulations to detach the @eventfd from
- * the vmpressure notifications, and then frees internal resources
- * associated with the @eventfd (but the @eventfd itself is not freed).
- *
- * To be used as memcg event method.
- */
-void vmpressure_unregister_event(struct mem_cgroup *memcg,
-				 struct eventfd_ctx *eventfd)
-{
-	struct vmpressure *vmpr = memcg_to_vmpressure(memcg);
-	struct vmpressure_event *ev;
-
-	mutex_lock(&vmpr->events_lock);
-	list_for_each_entry(ev, &vmpr->events, node) {
-		if (ev->efd != eventfd)
-			continue;
-		list_del(&ev->node);
-		kfree(ev);
-		break;
-	}
-	mutex_unlock(&vmpr->events_lock);
-}
-
 /**
  * vmpressure_init() - Initialize vmpressure control structure
  * @vmpr:	Structure to be initialized
@@ -470,9 +204,7 @@ void vmpressure_unregister_event(struct mem_cgroup *memcg,
 void vmpressure_init(struct vmpressure *vmpr)
 {
 	spin_lock_init(&vmpr->sr_lock);
-	mutex_init(&vmpr->events_lock);
-	INIT_LIST_HEAD(&vmpr->events);
-	INIT_WORK(&vmpr->work, vmpressure_work_fn);
+	vmpressure_v1_init(vmpr);
 }
 
 /**
@@ -484,9 +216,5 @@ void vmpressure_init(struct vmpressure *vmpr)
  */
 void vmpressure_cleanup(struct vmpressure *vmpr)
 {
-	/*
-	 * Make sure there is no pending work before eventfd infrastructure
-	 * goes away.
-	 */
-	flush_work(&vmpr->work);
+	vmpressure_v1_cleanup(vmpr);
 }
-- 
2.53.0-Meta


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 2/2] mm/vmpressure: split v1 userspace eventfd code into vmpressure-v1.c
  2026-06-29 12:59 ` [PATCH v2 2/2] mm/vmpressure: split v1 userspace eventfd code into vmpressure-v1.c Usama Arif
@ 2026-06-29 13:34   ` Michal Koutný
  2026-06-29 13:55     ` Usama Arif
  2026-06-29 15:57   ` Shakeel Butt
  2026-06-29 16:48   ` Johannes Weiner
  2 siblings, 1 reply; 14+ messages in thread
From: Michal Koutný @ 2026-06-29 13:34 UTC (permalink / raw)
  To: Usama Arif
  Cc: Andrew Morton, david, linux-mm, hannes, tj, shakeel.butt,
	roman.gushchin, liam, linux-kernel, ljs, mhocko, rppt, surenb,
	vbabka, kernel-team

[-- Attachment #1: Type: text/plain, Size: 608 bytes --]

On Mon, Jun 29, 2026 at 05:59:37AM -0700, Usama Arif <usama.arif@linux.dev> wrote:
> This split is the first step toward eventually making vmpressure
> CONFIG_MEMCG_V1 only. The v2 in-kernel socket pressure path
> (tree=false) cannot be removed today immediately: PSI is not an
> exact replacement for vmpressure, and switching networking socket-buffer
> back-off to PSI

(Here I understand PSI is a different and differntly scope metric) but
what does it mean when you write that tree=false cannot be removed but
the other patch bails out from vmpressure() (i.e. nothing is updated
anyway)?

Thanks,
Michal

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 265 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 2/2] mm/vmpressure: split v1 userspace eventfd code into vmpressure-v1.c
  2026-06-29 13:34   ` Michal Koutný
@ 2026-06-29 13:55     ` Usama Arif
  2026-06-29 14:29       ` Michal Koutný
  0 siblings, 1 reply; 14+ messages in thread
From: Usama Arif @ 2026-06-29 13:55 UTC (permalink / raw)
  To: Michal Koutný
  Cc: Andrew Morton, david, linux-mm, hannes, tj, shakeel.butt,
	roman.gushchin, liam, linux-kernel, ljs, mhocko, rppt, surenb,
	vbabka, kernel-team



On 29/06/2026 14:34, Michal Koutný wrote:
> On Mon, Jun 29, 2026 at 05:59:37AM -0700, Usama Arif <usama.arif@linux.dev> wrote:
>> This split is the first step toward eventually making vmpressure
>> CONFIG_MEMCG_V1 only. The v2 in-kernel socket pressure path
>> (tree=false) cannot be removed today immediately: PSI is not an
>> exact replacement for vmpressure, and switching networking socket-buffer
>> back-off to PSI
> 
> (Here I understand PSI is a different and differntly scope metric) but
> what does it mean when you write that tree=false cannot be removed but
> the other patch bails out from vmpressure() (i.e. nothing is updated
> anyway)?
So the first patch bails out for cgroup v2 for tree = true only.
For tree = false, it doesn't bail out, and is still used for networking
socket-buffer back-off. I think that is a whole another scope of work
switching to PSI. Hope that makes sense?

Thanks,
Usama

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 2/2] mm/vmpressure: split v1 userspace eventfd code into vmpressure-v1.c
  2026-06-29 13:55     ` Usama Arif
@ 2026-06-29 14:29       ` Michal Koutný
  2026-06-29 15:20         ` Usama Arif
  0 siblings, 1 reply; 14+ messages in thread
From: Michal Koutný @ 2026-06-29 14:29 UTC (permalink / raw)
  To: Usama Arif
  Cc: Andrew Morton, david, linux-mm, hannes, tj, shakeel.butt,
	roman.gushchin, liam, linux-kernel, ljs, mhocko, rppt, surenb,
	vbabka, kernel-team

[-- Attachment #1: Type: text/plain, Size: 1689 bytes --]

On Mon, Jun 29, 2026 at 02:55:57PM +0100, Usama Arif <usama.arif@linux.dev> wrote:
> On 29/06/2026 14:34, Michal Koutný wrote:
> > On Mon, Jun 29, 2026 at 05:59:37AM -0700, Usama Arif <usama.arif@linux.dev> wrote:
> >> This split is the first step toward eventually making vmpressure
> >> CONFIG_MEMCG_V1 only. The v2 in-kernel socket pressure path
> >> (tree=false) cannot be removed today immediately: PSI is not an
> >> exact replacement for vmpressure, and switching networking socket-buffer
> >> back-off to PSI
> > 
> > (Here I understand PSI is a different and differntly scope metric) but
> > what does it mean when you write that tree=false cannot be removed but
> > the other patch bails out from vmpressure() (i.e. nothing is updated
> > anyway)?
> So the first patch bails out for cgroup v2 for tree = true only.
> For tree = false, it doesn't bail out, and is still used for networking
> socket-buffer back-off. I think that is a whole another scope of work
> switching to PSI. Hope that makes sense?

I've mixed mutliple things together, sorry. I wanted to actually ask
about your response:

| I realized when trying to swap the order that the splitting off v1
| commit will end up doing more that what I think it should do (just
| splitting off v1 specific code), as the tree = true code will not get
| compiled in at all for cgroup v2, and it then ends up changing more
| behaviour.

tree=true won't get compiled but v2 doesn't care about it, so the effect
of patch 2/2 should still be same (regardless whether it comes 1st or
2nd).
Do you refer to the invocation of vmpressure_v1_account_tree() that is
affected by this?

Thanks,
Michal

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 265 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 2/2] mm/vmpressure: split v1 userspace eventfd code into vmpressure-v1.c
  2026-06-29 14:29       ` Michal Koutný
@ 2026-06-29 15:20         ` Usama Arif
  2026-06-29 17:13           ` Michal Koutný
  0 siblings, 1 reply; 14+ messages in thread
From: Usama Arif @ 2026-06-29 15:20 UTC (permalink / raw)
  To: Michal Koutný
  Cc: Andrew Morton, david, linux-mm, hannes, tj, shakeel.butt,
	roman.gushchin, liam, linux-kernel, ljs, mhocko, rppt, surenb,
	vbabka, kernel-team



On 29/06/2026 15:29, Michal Koutný wrote:
> On Mon, Jun 29, 2026 at 02:55:57PM +0100, Usama Arif <usama.arif@linux.dev> wrote:
>> On 29/06/2026 14:34, Michal Koutný wrote:
>>> On Mon, Jun 29, 2026 at 05:59:37AM -0700, Usama Arif <usama.arif@linux.dev> wrote:
>>>> This split is the first step toward eventually making vmpressure
>>>> CONFIG_MEMCG_V1 only. The v2 in-kernel socket pressure path
>>>> (tree=false) cannot be removed today immediately: PSI is not an
>>>> exact replacement for vmpressure, and switching networking socket-buffer
>>>> back-off to PSI
>>>
>>> (Here I understand PSI is a different and differntly scope metric) but
>>> what does it mean when you write that tree=false cannot be removed but
>>> the other patch bails out from vmpressure() (i.e. nothing is updated
>>> anyway)?
>> So the first patch bails out for cgroup v2 for tree = true only.
>> For tree = false, it doesn't bail out, and is still used for networking
>> socket-buffer back-off. I think that is a whole another scope of work
>> switching to PSI. Hope that makes sense?
> 
> I've mixed mutliple things together, sorry. I wanted to actually ask
> about your response:
> 
> | I realized when trying to swap the order that the splitting off v1
> | commit will end up doing more that what I think it should do (just
> | splitting off v1 specific code), as the tree = true code will not get
> | compiled in at all for cgroup v2, and it then ends up changing more
> | behaviour.
> 
> tree=true won't get compiled but v2 doesn't care about it, so the effect
> of patch 2/2 should still be same (regardless whether it comes 1st or
> 2nd).
> Do you refer to the invocation of vmpressure_v1_account_tree() that is
> affected by this?

So what I mean is that I want to keep the effect of the patch that splits
off v1 code as just that and not have the optimization of not running
vmpressure for tree = true + cgroup v2.

vmpressure_v1_account_tree() compiles to an empty function in the split
patch for cgroup v2, so if we make the splitting out v1 code as the first
patch, the commit is not just going to split v1 code but also do the
optimization of not running tree = true cgroup v2.

I hope it makes sense?

> 
> Thanks,
> Michal


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 2/2] mm/vmpressure: split v1 userspace eventfd code into vmpressure-v1.c
  2026-06-29 12:59 ` [PATCH v2 2/2] mm/vmpressure: split v1 userspace eventfd code into vmpressure-v1.c Usama Arif
  2026-06-29 13:34   ` Michal Koutný
@ 2026-06-29 15:57   ` Shakeel Butt
  2026-06-29 16:48   ` Johannes Weiner
  2 siblings, 0 replies; 14+ messages in thread
From: Shakeel Butt @ 2026-06-29 15:57 UTC (permalink / raw)
  To: Usama Arif
  Cc: Andrew Morton, david, linux-mm, hannes, tj, mkoutny,
	roman.gushchin, liam, linux-kernel, ljs, mhocko, rppt, surenb,
	vbabka, kernel-team

On Mon, Jun 29, 2026 at 05:59:37AM -0700, Usama Arif wrote:
> Clean up mm/vmpressure.c by separating the cgroup v1 userspace eventfd
> interface from the shared and v2 in-kernel code.
> 
> Currently, almost half of mm/vmpressure.c exists to serve tree=true:
> struct vmpressure_event, the events list and its mutex, the work_struct
> and vmpressure_work_fn that drains tree_scanned/tree_reclaimed, the
> parent walk, vmpressure_event(), vmpressure_register_event(),
> vmpressure_unregister_event(), and vmpressure_prio() (which always
> calls vmpressure() with tree=true).
> 
> Move it all into a new mm/vmpressure-v1.c built only when
> CONFIG_MEMCG_V1=y (following the existing memcontrol-v1.o pattern).
> 
> vmpressure.c keeps the shared bits (constants, vmpressure_calc_level,
> the runtime hierarchy check, the tree=false body, init/cleanup
> plumbing) and calls into three small v1 hooks for the tree=true
> accumulator and the v1 portions of init/cleanup. The hooks have
> static-inline no-op stubs in include/linux/vmpressure.h for the
> !MEMCG_V1 case, so callers don't need ifdefs. vmpressure_prio() gets
> the same treatment, which means vmscan.c's call site disappears at
> compile time on v2-only kernels.
> 
> The only #ifdef CONFIG_MEMCG_V1 in source remains around the v1-only
> fields inside struct vmpressure itself.
> 
> Memory savings on CONFIG_MEMCG_V1=n (measured with pahole):
> 
>   struct vmpressure :  112B ->   24B
>   struct mem_cgroup : 1664B -> 1536B
> 
> This split is the first step toward eventually making vmpressure
> CONFIG_MEMCG_V1 only. The v2 in-kernel socket pressure path
> (tree=false) cannot be removed today immediately: PSI is not an
> exact replacement for vmpressure, and switching networking socket-buffer
> back-off to PSI may regress networking performance or increase memory
> pressure in workloads that today rely on vmpressure's hysteresis. The
> medium-term plan is to introduce a PSI-based socket-pressure path, keep
> vmpressure available for v2 behind a defconfig as an opt-out for several
> releases, and only then drop the tree=false path entirely, at which point
> everything that remains in mm/vmpressure-v1.c is the whole subsystem.
> 
> Signed-off-by: Usama Arif <usama.arif@linux.dev>

Acked-by: Shakeel Butt <shakeel.butt@linux.dev>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 1/2] mm/vmpressure: skip tree=true accounting on cgroup v2
  2026-06-29 12:59 ` [PATCH v2 1/2] mm/vmpressure: skip tree=true accounting " Usama Arif
@ 2026-06-29 16:46   ` Johannes Weiner
  0 siblings, 0 replies; 14+ messages in thread
From: Johannes Weiner @ 2026-06-29 16:46 UTC (permalink / raw)
  To: Usama Arif
  Cc: Andrew Morton, david, linux-mm, tj, mkoutny, shakeel.butt,
	roman.gushchin, liam, linux-kernel, ljs, mhocko, rppt, surenb,
	vbabka, kernel-team

On Mon, Jun 29, 2026 at 05:59:36AM -0700, Usama Arif wrote:
> vmpressure() has two outputs gated by the @tree argument:
> 
>   @tree=false drives in-kernel socket pressure (mem_cgroup_set_
>               socket_pressure), consumed by TCP/SCTP. This only
>               applies on cgroup v2; on v1 socket memory is charged
>               separately via tcpmem and the consumer reads
>               memcg->tcpmem_pressure instead.
> 
>   @tree=true  drives userspace eventfd notifications via the v1
>               memory.pressure_level / cgroup.event_control interface.
>               v2 has no equivalent: userspace gets reclaim signals
>               through memory.pressure (PSI), which does not touch
>               vmpressure.
> 
> The existing early return covered v1 + @tree=false. The symmetric
> v2 + @tree=true case was falling through and doing the full lock /
> accumulate / schedule_work / parent-walk dance for an events list
> that can never be populated. bpftrace on a 176-core production host
> (cgroup v2, CONFIG_MEMCG_V1=n, 285 memcgs, sustained reclaim) showed
> ~16,200 @tree=true vmpressure() calls per minute. Add an early return
> that skips cgroup v2 + tree = true which avoids us doing all this work.
> On a v2-only host this also eliminates a lock contention path that can
> serialise reclaimers on a single global sr_lock.
> 
> Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
> Signed-off-by: Usama Arif <usama.arif@linux.dev>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

> ---
>  mm/vmpressure.c | 10 ++++++----
>  1 file changed, 6 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/vmpressure.c b/mm/vmpressure.c
> index f053554e5826..c82cee1ab43b 100644
> --- a/mm/vmpressure.c
> +++ b/mm/vmpressure.c
> @@ -246,11 +246,13 @@ void vmpressure(gfp_t gfp, int order, struct mem_cgroup *memcg, bool tree,
>  		return;
>  
>  	/*
> -	 * The in-kernel users only care about the reclaim efficiency
> -	 * for this @memcg rather than the whole subtree, and there
> -	 * isn't and won't be any in-kernel user in a legacy cgroup.
> +	 * Only two combinations have a consumer:
> +	 *   cgroup v2 + tree=false -> in-kernel socket pressure
> +	 *   cgroup v1 + tree=true  -> userspace eventfds (memory.pressure_level)
> +	 * Skip the other two: nothing consumes the result.
>  	 */
> -	if (!cgroup_subsys_on_dfl(memory_cgrp_subsys) && !tree)
> +	if ((!cgroup_subsys_on_dfl(memory_cgrp_subsys) && !tree) ||
> +	    (cgroup_subsys_on_dfl(memory_cgrp_subsys) && tree))
>  		return;

	if (cgroup_subsys_on_dfl(memory_cgrp_subsys) == tree)
		return;

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 2/2] mm/vmpressure: split v1 userspace eventfd code into vmpressure-v1.c
  2026-06-29 12:59 ` [PATCH v2 2/2] mm/vmpressure: split v1 userspace eventfd code into vmpressure-v1.c Usama Arif
  2026-06-29 13:34   ` Michal Koutný
  2026-06-29 15:57   ` Shakeel Butt
@ 2026-06-29 16:48   ` Johannes Weiner
  2026-06-29 17:23     ` Usama Arif
  2 siblings, 1 reply; 14+ messages in thread
From: Johannes Weiner @ 2026-06-29 16:48 UTC (permalink / raw)
  To: Usama Arif
  Cc: Andrew Morton, david, linux-mm, tj, mkoutny, shakeel.butt,
	roman.gushchin, liam, linux-kernel, ljs, mhocko, rppt, surenb,
	vbabka, kernel-team

On Mon, Jun 29, 2026 at 05:59:37AM -0700, Usama Arif wrote:
> @@ -101,7 +101,7 @@ obj-$(CONFIG_DEVICE_MIGRATION) += migrate_device.o
>  obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o
>  obj-$(CONFIG_PAGE_COUNTER) += page_counter.o
>  obj-$(CONFIG_LIVEUPDATE_MEMFD) += memfd_luo.o
> -obj-$(CONFIG_MEMCG_V1) += memcontrol-v1.o
> +obj-$(CONFIG_MEMCG_V1) += memcontrol-v1.o vmpressure-v1.o

Might as well move the interface part to memcontrol-v1.c instead of
creating a new file.

Otherwise looks good to me.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 2/2] mm/vmpressure: split v1 userspace eventfd code into vmpressure-v1.c
  2026-06-29 15:20         ` Usama Arif
@ 2026-06-29 17:13           ` Michal Koutný
  0 siblings, 0 replies; 14+ messages in thread
From: Michal Koutný @ 2026-06-29 17:13 UTC (permalink / raw)
  To: Usama Arif
  Cc: Andrew Morton, david, linux-mm, hannes, tj, shakeel.butt,
	roman.gushchin, liam, linux-kernel, ljs, mhocko, rppt, surenb,
	vbabka, kernel-team

[-- Attachment #1: Type: text/plain, Size: 694 bytes --]

On Mon, Jun 29, 2026 at 04:20:10PM +0100, Usama Arif <usama.arif@linux.dev> wrote:
> So what I mean is that I want to keep the effect of the patch that splits
> off v1 code as just that and not have the optimization of not running
> vmpressure for tree = true + cgroup v2.
> 
> vmpressure_v1_account_tree() compiles to an empty function in the split
> patch for cgroup v2, so if we make the splitting out v1 code as the first
> patch, the commit is not just going to split v1 code but also do the
> optimization of not running tree = true cgroup v2.
> 
> I hope it makes sense?

All clear now, it's clicked for me what is the optimization and what the
cleanup.

Thanks,
Michal

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 265 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 2/2] mm/vmpressure: split v1 userspace eventfd code into vmpressure-v1.c
  2026-06-29 16:48   ` Johannes Weiner
@ 2026-06-29 17:23     ` Usama Arif
  2026-06-29 18:12       ` Johannes Weiner
  0 siblings, 1 reply; 14+ messages in thread
From: Usama Arif @ 2026-06-29 17:23 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, david, linux-mm, tj, mkoutny, shakeel.butt,
	roman.gushchin, liam, linux-kernel, ljs, mhocko, rppt, surenb,
	vbabka, kernel-team



On 29/06/2026 17:48, Johannes Weiner wrote:
> On Mon, Jun 29, 2026 at 05:59:37AM -0700, Usama Arif wrote:
>> @@ -101,7 +101,7 @@ obj-$(CONFIG_DEVICE_MIGRATION) += migrate_device.o
>>  obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o
>>  obj-$(CONFIG_PAGE_COUNTER) += page_counter.o
>>  obj-$(CONFIG_LIVEUPDATE_MEMFD) += memfd_luo.o
>> -obj-$(CONFIG_MEMCG_V1) += memcontrol-v1.o
>> +obj-$(CONFIG_MEMCG_V1) += memcontrol-v1.o vmpressure-v1.o
> 
> Might as well move the interface part to memcontrol-v1.c instead of
> creating a new file.
> 

I think it would make it easier once we move cgroup v2 off of
vmpressure. Then we can rename vmpressure-v1.c to just vmpressure.c
and gate it to CONFIG_MEMCG_V1. The other option would be everything
living in memcontrol-v1.c? I think its nice to keep it separate as
memcontrol-v1.c is already 2K+ lines and this is standalone feature
that can sit in a separate file.

> Otherwise looks good to me.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 2/2] mm/vmpressure: split v1 userspace eventfd code into vmpressure-v1.c
  2026-06-29 17:23     ` Usama Arif
@ 2026-06-29 18:12       ` Johannes Weiner
  2026-06-29 18:28         ` Shakeel Butt
  0 siblings, 1 reply; 14+ messages in thread
From: Johannes Weiner @ 2026-06-29 18:12 UTC (permalink / raw)
  To: Usama Arif
  Cc: Andrew Morton, david, linux-mm, tj, mkoutny, shakeel.butt,
	roman.gushchin, liam, linux-kernel, ljs, mhocko, rppt, surenb,
	vbabka, kernel-team

On Mon, Jun 29, 2026 at 06:23:58PM +0100, Usama Arif wrote:
> 
> 
> On 29/06/2026 17:48, Johannes Weiner wrote:
> > On Mon, Jun 29, 2026 at 05:59:37AM -0700, Usama Arif wrote:
> >> @@ -101,7 +101,7 @@ obj-$(CONFIG_DEVICE_MIGRATION) += migrate_device.o
> >>  obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o
> >>  obj-$(CONFIG_PAGE_COUNTER) += page_counter.o
> >>  obj-$(CONFIG_LIVEUPDATE_MEMFD) += memfd_luo.o
> >> -obj-$(CONFIG_MEMCG_V1) += memcontrol-v1.o
> >> +obj-$(CONFIG_MEMCG_V1) += memcontrol-v1.o vmpressure-v1.o
> > 
> > Might as well move the interface part to memcontrol-v1.c instead of
> > creating a new file.
> > 
> 
> I think it would make it easier once we move cgroup v2 off of
> vmpressure. Then we can rename vmpressure-v1.c to just vmpressure.c
> and gate it to CONFIG_MEMCG_V1. The other option would be everything
> living in memcontrol-v1.c?

Hm? I just mean move whatever you move into that new vmpressure-v1.c
into memcontrol-v1.c instead.

> I think its nice to keep it separate as
> memcontrol-v1.c is already 2K+ lines and this is standalone feature
> that can sit in a separate file.

It's all deprecated code with no active development. Containing it and
making it easy to see the footprint of the cgroup1 option is the
higher priority, more than maintaining neat component separation.

No need to work it naturally into the file either - decls/variables up
top, functions somewhere further down etc. Just paste it into one
contiguous vmpressure block under memcg1_oom_finish(). That's how that
file is already structured with the other components in there.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 2/2] mm/vmpressure: split v1 userspace eventfd code into vmpressure-v1.c
  2026-06-29 18:12       ` Johannes Weiner
@ 2026-06-29 18:28         ` Shakeel Butt
  0 siblings, 0 replies; 14+ messages in thread
From: Shakeel Butt @ 2026-06-29 18:28 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Usama Arif, Andrew Morton, david, linux-mm, tj, mkoutny,
	roman.gushchin, liam, linux-kernel, ljs, mhocko, rppt, surenb,
	vbabka, kernel-team

On Mon, Jun 29, 2026 at 02:12:38PM -0400, Johannes Weiner wrote:
> On Mon, Jun 29, 2026 at 06:23:58PM +0100, Usama Arif wrote:
> > 
> > 
> > On 29/06/2026 17:48, Johannes Weiner wrote:
> > > On Mon, Jun 29, 2026 at 05:59:37AM -0700, Usama Arif wrote:
> > >> @@ -101,7 +101,7 @@ obj-$(CONFIG_DEVICE_MIGRATION) += migrate_device.o
> > >>  obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o
> > >>  obj-$(CONFIG_PAGE_COUNTER) += page_counter.o
> > >>  obj-$(CONFIG_LIVEUPDATE_MEMFD) += memfd_luo.o
> > >> -obj-$(CONFIG_MEMCG_V1) += memcontrol-v1.o
> > >> +obj-$(CONFIG_MEMCG_V1) += memcontrol-v1.o vmpressure-v1.o
> > > 
> > > Might as well move the interface part to memcontrol-v1.c instead of
> > > creating a new file.
> > > 
> > 
> > I think it would make it easier once we move cgroup v2 off of
> > vmpressure. Then we can rename vmpressure-v1.c to just vmpressure.c
> > and gate it to CONFIG_MEMCG_V1. The other option would be everything
> > living in memcontrol-v1.c?
> 
> Hm? I just mean move whatever you move into that new vmpressure-v1.c
> into memcontrol-v1.c instead.
> 
> > I think its nice to keep it separate as
> > memcontrol-v1.c is already 2K+ lines and this is standalone feature
> > that can sit in a separate file.
> 
> It's all deprecated code with no active development. Containing it and
> making it easy to see the footprint of the cgroup1 option is the
> higher priority, more than maintaining neat component separation.
> 
> No need to work it naturally into the file either - decls/variables up
> top, functions somewhere further down etc. Just paste it into one
> contiguous vmpressure block under memcg1_oom_finish(). That's how that
> file is already structured with the other components in there.

Yup, this makes sense.

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2026-06-29 18:29 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-29 12:59 [PATCH v2 0/2] mm/vmpressure: reduce CPU, memory and code overhead on cgroup v2 Usama Arif
2026-06-29 12:59 ` [PATCH v2 1/2] mm/vmpressure: skip tree=true accounting " Usama Arif
2026-06-29 16:46   ` Johannes Weiner
2026-06-29 12:59 ` [PATCH v2 2/2] mm/vmpressure: split v1 userspace eventfd code into vmpressure-v1.c Usama Arif
2026-06-29 13:34   ` Michal Koutný
2026-06-29 13:55     ` Usama Arif
2026-06-29 14:29       ` Michal Koutný
2026-06-29 15:20         ` Usama Arif
2026-06-29 17:13           ` Michal Koutný
2026-06-29 15:57   ` Shakeel Butt
2026-06-29 16:48   ` Johannes Weiner
2026-06-29 17:23     ` Usama Arif
2026-06-29 18:12       ` Johannes Weiner
2026-06-29 18:28         ` Shakeel Butt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox