[PATCH v3 0/2] mm/vmpressure: reduce CPU, memory and code overhead on cgroup v2

Linux-mm Archive on lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v3 0/2] mm/vmpressure: reduce CPU, memory and code overhead on cgroup v2
@ 2026-06-30 11:23 Usama Arif
  2026-06-30 11:23 ` [PATCH v3 1/2] mm/vmpressure: skip tree=true accounting " Usama Arif
  2026-06-30 11:23 ` [PATCH v3 2/2] mm/vmpressure: move v1 userspace eventfd code into memcontrol-v1.c Usama Arif
  0 siblings, 2 replies; 7+ messages in thread
From: Usama Arif @ 2026-06-30 11:23 UTC (permalink / raw)
  To: Andrew Morton, david, linux-mm
  Cc: hannes, tj, mkoutny, shakeel.butt, roman.gushchin, liam,
	linux-kernel, ljs, mhocko, rppt, surenb, vbabka, kernel-team,
	Usama Arif

The vmpressure subsystem has two distinct consumers, gated by the
@tree argument:

  tree=false : in-kernel socket pressure, consumed by TCP/SCTP. This
               is cgroup v2 only; v1 sockets read memcg->tcpmem_pressure
               instead.
  tree=true  : cgroup v1 userspace eventfd notifications via the
               memory.pressure_level / cgroup.event_control interface.
               v2 has no equivalent (userspace gets reclaim signals
               through memory.pressure / PSI, which doesn't touch
               vmpressure).

So of the four (hierarchy, tree) combinations, only two carry data
that anyone reads. The existing early return in vmpressure() covered
v1 + tree=false; the symmetric v2 + tree=true case was falling through
and doing the full lock / accumulate / schedule_work / parent-walk
dance, even though the events list it eventually iterates is empty
on cgroup v2 (vmpressure_register_event() is wired up only through the
v1 cftype "memory.pressure_level" and can't be reached from a v2
memcg).

Patch 1 extends the existing early return to also skip v2 + tree=true.
On a v2-only host this eliminates a contended path where reclaimers
can serialize on a single global sr_lock. bpftrace on a 176-core production
host (cgroup v2, 285 memcgs, sustained reclaim) showed ~16,200 such calls
per minute with tree = true.

Patch 2 follows up with a cleanup: it splits the v1 userspace eventfd
interface (struct vmpressure_event, the events list and its mutex, the
work_struct and its handler, the parent walk,
vmpressure_register_event / unregister_event, and vmpressure_prio)
into a new mm/memcontrol-v1.c built only when CONFIG_MEMCG_V1=y,
behind small no-op stubs in the header. mm/vmpressure.c keeps the
shared bits and the tree=false socket-pressure path. The size of
vmpressure.c goes down to half and the code is much more simpler.
The only #ifdef CONFIG_MEMCG_V1 remaining in source is around the
v1-only fields inside struct vmpressure itself. Memory savings on
CONFIG_MEMCG_V1=n:
  struct vmpressure :  112B  ->  24B
  struct mem_cgroup : 1664B  -> 1536B

This split is the first step toward eventually making vmpressure
CONFIG_MEMCG_V1 only. The v2 in-kernel socket pressure path
(tree=false) cannot be removed today immediately: PSI is not an
exact replacement for vmpressure, and switching networking socket-buffer
back-off to PSI may regress networking performance or increase memory
pressure in workloads that today rely on vmpressure's hysteresis. The
medium-term plan is to introduce a PSI-based socket-pressure path, keep
vmpressure available for v2 behind a defconfig as an opt-out for several
releases, and only then drop the tree=false path entirely, at which point
everything that remains in mm/memcontrol-v1.c is the whole subsystem.
---
v2 -> v3: https://lore.kernel.org/all/20260629130042.2649505-1-usama.arif@linux.dev/
- Move the cgroup v1 code into memcontrol-v1.c instead of creating a new
  file (Johannes)

v1 -> v2: https://lore.kernel.org/all/20260606114158.3126210-1-usama.arif@linux.dev/
- Add more in commit message about future plans of vmpressure for cgroup v2
  (Shakeel)
- Remove unnecessary return statement in vmpressure for v1 only tree path
  (Michal)
- Rebased onto latest mm-new

Usama Arif (2):
  mm/vmpressure: skip tree=true accounting on cgroup v2
  mm/vmpressure: move v1 userspace eventfd code into memcontrol-v1.c

 include/linux/vmpressure.h |  46 +++++-
 mm/memcontrol-v1.c         | 292 +++++++++++++++++++++++++++++++++++
 mm/vmpressure.c            | 302 ++-----------------------------------
 3 files changed, 349 insertions(+), 291 deletions(-)

-- 
2.53.0-Meta

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH v3 1/2] mm/vmpressure: skip tree=true accounting on cgroup v2
  2026-06-30 11:23 [PATCH v3 0/2] mm/vmpressure: reduce CPU, memory and code overhead on cgroup v2 Usama Arif
@ 2026-06-30 11:23 ` Usama Arif
  2026-06-30 16:07   ` Johannes Weiner
  2026-06-30 11:23 ` [PATCH v3 2/2] mm/vmpressure: move v1 userspace eventfd code into memcontrol-v1.c Usama Arif
  1 sibling, 1 reply; 7+ messages in thread
From: Usama Arif @ 2026-06-30 11:23 UTC (permalink / raw)
  To: Andrew Morton, david, linux-mm
  Cc: hannes, tj, mkoutny, shakeel.butt, roman.gushchin, liam,
	linux-kernel, ljs, mhocko, rppt, surenb, vbabka, kernel-team,
	Usama Arif

vmpressure() has two outputs gated by the @tree argument:

  @tree=false drives in-kernel socket pressure (mem_cgroup_set_
              socket_pressure), consumed by TCP/SCTP. This only
              applies on cgroup v2; on v1 socket memory is charged
              separately via tcpmem and the consumer reads
              memcg->tcpmem_pressure instead.

  @tree=true  drives userspace eventfd notifications via the v1
              memory.pressure_level / cgroup.event_control interface.
              v2 has no equivalent: userspace gets reclaim signals
              through memory.pressure (PSI), which does not touch
              vmpressure.

The existing early return covered v1 + @tree=false. The symmetric
v2 + @tree=true case was falling through and doing the full lock /
accumulate / schedule_work / parent-walk dance for an events list
that can never be populated. bpftrace on a 176-core production host
(cgroup v2, CONFIG_MEMCG_V1=n, 285 memcgs, sustained reclaim) showed
~16,200 @tree=true vmpressure() calls per minute. Add an early return
that skips cgroup v2 + tree = true which avoids us doing all this work.
On a v2-only host this also eliminates a lock contention path that can
serialise reclaimers on a single global sr_lock.

Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Signed-off-by: Usama Arif <usama.arif@linux.dev>
---
 mm/vmpressure.c | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/mm/vmpressure.c b/mm/vmpressure.c
index f053554e5826..c82cee1ab43b 100644
--- a/mm/vmpressure.c
+++ b/mm/vmpressure.c
@@ -246,11 +246,13 @@ void vmpressure(gfp_t gfp, int order, struct mem_cgroup *memcg, bool tree,
 		return;
 
 	/*
-	 * The in-kernel users only care about the reclaim efficiency
-	 * for this @memcg rather than the whole subtree, and there
-	 * isn't and won't be any in-kernel user in a legacy cgroup.
+	 * Only two combinations have a consumer:
+	 *   cgroup v2 + tree=false -> in-kernel socket pressure
+	 *   cgroup v1 + tree=true  -> userspace eventfds (memory.pressure_level)
+	 * Skip the other two: nothing consumes the result.
 	 */
-	if (!cgroup_subsys_on_dfl(memory_cgrp_subsys) && !tree)
+	if ((!cgroup_subsys_on_dfl(memory_cgrp_subsys) && !tree) ||
+	    (cgroup_subsys_on_dfl(memory_cgrp_subsys) && tree))
 		return;
 
 	vmpr = memcg_to_vmpressure(memcg);
-- 
2.53.0-Meta



^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH v3 2/2] mm/vmpressure: move v1 userspace eventfd code into memcontrol-v1.c
  2026-06-30 11:23 [PATCH v3 0/2] mm/vmpressure: reduce CPU, memory and code overhead on cgroup v2 Usama Arif
  2026-06-30 11:23 ` [PATCH v3 1/2] mm/vmpressure: skip tree=true accounting " Usama Arif
@ 2026-06-30 11:23 ` Usama Arif
  2026-06-30 12:32   ` Usama Arif
  2026-06-30 14:21   ` Shakeel Butt
  1 sibling, 2 replies; 7+ messages in thread
From: Usama Arif @ 2026-06-30 11:23 UTC (permalink / raw)
  To: Andrew Morton, david, linux-mm
  Cc: hannes, tj, mkoutny, shakeel.butt, roman.gushchin, liam,
	linux-kernel, ljs, mhocko, rppt, surenb, vbabka, kernel-team,
	Usama Arif

Clean up mm/vmpressure.c by separating the cgroup v1 userspace eventfd
interface from the shared and v2 in-kernel code.

Currently, almost half of mm/vmpressure.c exists to serve tree=true:
struct vmpressure_event, the events list and its mutex, the work_struct
and vmpressure_work_fn that drains tree_scanned/tree_reclaimed, the
parent walk, vmpressure_event(), vmpressure_register_event(),
vmpressure_unregister_event(), and vmpressure_prio() (which always
calls vmpressure() with tree=true).

Move it all into mm/memcontrol-v1.c (built only when CONFIG_MEMCG_V1=y)
as a single contiguous block, following the per-component layout already
used by that file. Keeping the v1 vmpressure code with the rest of the
deprecated cgroup v1 memory controller makes the full footprint of the
CONFIG_MEMCG_V1 option easy to see in one place, which matters more
than component-level file separation for code that has no active
development.

vmpressure.c keeps the shared bits (constants, vmpressure_calc_level,
the runtime hierarchy check, the tree=false body, init/cleanup
plumbing) and calls into three small v1 hooks for the tree=true
accumulator and the v1 portions of init/cleanup. The hooks have
static-inline no-op stubs in include/linux/vmpressure.h for the
!MEMCG_V1 case, so callers don't need ifdefs. vmpressure_prio() gets
the same treatment, which means vmscan.c's call site disappears at
compile time on v2-only kernels.

The only #ifdef CONFIG_MEMCG_V1 in source remains around the v1-only
fields inside struct vmpressure itself.

Memory savings on CONFIG_MEMCG_V1=n (measured with pahole):

  struct vmpressure :  112B ->   24B
  struct mem_cgroup : 1664B -> 1536B

This split is the first step toward eventually making vmpressure
CONFIG_MEMCG_V1 only. The v2 in-kernel socket pressure path
(tree=false) cannot be removed today immediately: PSI is not an
exact replacement for vmpressure, and switching networking socket-buffer
back-off to PSI may regress networking performance or increase memory
pressure in workloads that today rely on vmpressure's hysteresis. The
medium-term plan is to introduce a PSI-based socket-pressure path, keep
vmpressure available for v2 behind a defconfig as an opt-out for several
releases, and only then drop the tree=false path entirely, at which point
everything that remains of the vmpressure block in mm/memcontrol-v1.c is
the whole subsystem.

Signed-off-by: Usama Arif <usama.arif@linux.dev>
---
 include/linux/vmpressure.h |  46 +++++-
 mm/memcontrol-v1.c         | 292 +++++++++++++++++++++++++++++++++++++
 mm/vmpressure.c            | 292 ++-----------------------------------
 3 files changed, 343 insertions(+), 287 deletions(-)

diff --git a/include/linux/vmpressure.h b/include/linux/vmpressure.h
index faecd5522401..b4d13457bc2a 100644
--- a/include/linux/vmpressure.h
+++ b/include/linux/vmpressure.h
@@ -13,18 +13,31 @@
 struct vmpressure {
 	unsigned long scanned;
 	unsigned long reclaimed;
+	/* The lock is used to keep the scanned/reclaimed in sync. */
+	spinlock_t sr_lock;
 
+#ifdef CONFIG_MEMCG_V1
+	/*
+	 * tree=true accumulators feed the v1 userspace eventfd interface
+	 * (memory.pressure_level). Drained by @work. v2 has no equivalent
+	 * interface, so this state is omitted on CONFIG_MEMCG_V1=n builds.
+	 */
 	unsigned long tree_scanned;
 	unsigned long tree_reclaimed;
-	/* The lock is used to keep the scanned/reclaimed above in sync. */
-	spinlock_t sr_lock;
-
 	/* The list of vmpressure_event structs. */
 	struct list_head events;
 	/* Have to grab the lock on events traversal or modifications. */
 	struct mutex events_lock;
 
 	struct work_struct work;
+#endif
+};
+
+enum vmpressure_levels {
+	VMPRESSURE_LOW = 0,
+	VMPRESSURE_MEDIUM,
+	VMPRESSURE_CRITICAL,
+	VMPRESSURE_NUM_LEVELS,
 };
 
 struct mem_cgroup;
@@ -32,18 +45,41 @@ struct mem_cgroup;
 #ifdef CONFIG_MEMCG
 void vmpressure(gfp_t gfp, int order, struct mem_cgroup *memcg, bool tree,
 		unsigned long scanned, unsigned long reclaimed);
-extern void vmpressure_prio(gfp_t gfp, struct mem_cgroup *memcg, int prio);
-
 extern void vmpressure_init(struct vmpressure *vmpr);
 extern void vmpressure_cleanup(struct vmpressure *vmpr);
 extern struct vmpressure *memcg_to_vmpressure(struct mem_cgroup *memcg);
 extern struct mem_cgroup *vmpressure_to_memcg(struct vmpressure *vmpr);
+
+/* Shared with the v1 vmpressure block in mm/memcontrol-v1.c. */
+extern const unsigned long vmpressure_win;
+extern enum vmpressure_levels vmpressure_calc_level(unsigned long scanned,
+						    unsigned long reclaimed);
+
+#ifdef CONFIG_MEMCG_V1
+extern void vmpressure_prio(gfp_t gfp, struct mem_cgroup *memcg, int prio);
 extern int vmpressure_register_event(struct mem_cgroup *memcg,
 				     struct eventfd_ctx *eventfd,
 				     const char *args);
 extern void vmpressure_unregister_event(struct mem_cgroup *memcg,
 					struct eventfd_ctx *eventfd);
+
+/* v1 hooks called from mm/vmpressure.c; no-ops below when !MEMCG_V1. */
+extern void vmpressure_v1_init(struct vmpressure *vmpr);
+extern void vmpressure_v1_cleanup(struct vmpressure *vmpr);
+extern void vmpressure_v1_account_tree(struct vmpressure *vmpr,
+				       unsigned long scanned,
+				       unsigned long reclaimed);
 #else
+static inline void vmpressure_prio(gfp_t gfp, struct mem_cgroup *memcg,
+				   int prio) {}
+static inline void vmpressure_v1_init(struct vmpressure *vmpr) {}
+static inline void vmpressure_v1_cleanup(struct vmpressure *vmpr) {}
+static inline void vmpressure_v1_account_tree(struct vmpressure *vmpr,
+					      unsigned long scanned,
+					      unsigned long reclaimed) {}
+#endif /* CONFIG_MEMCG_V1 */
+
+#else /* !CONFIG_MEMCG */
 static inline void vmpressure(gfp_t gfp, int order, struct mem_cgroup *memcg,
 			      bool tree, unsigned long scanned,
 			      unsigned long reclaimed) {}
diff --git a/mm/memcontrol-v1.c b/mm/memcontrol-v1.c
index 765069211567..135622b6172b 100644
--- a/mm/memcontrol-v1.c
+++ b/mm/memcontrol-v1.c
@@ -6,6 +6,7 @@
 #include <linux/pagewalk.h>
 #include <linux/backing-dev.h>
 #include <linux/eventfd.h>
+#include <linux/log2.h>
 #include <linux/poll.h>
 #include <linux/sort.h>
 #include <linux/file.h>
@@ -1476,6 +1477,297 @@ void memcg1_oom_finish(struct mem_cgroup *memcg, bool locked)
 		mem_cgroup_oom_unlock(memcg);
 }
 
+/*
+ * cgroup v1 userspace vmpressure interface (memory.pressure_level /
+ * cgroup.event_control). Kept here so v2-only kernels (CONFIG_MEMCG_V1=n)
+ * drop the whole eventfd accumulator, its work item, and the per-memcg
+ * state it requires.
+ *
+ * When there are too little pages left to scan, vmpressure() may miss the
+ * critical pressure as number of pages will be less than "window size".
+ * However, in that case the vmscan priority will raise fast as the
+ * reclaimer will try to scan LRUs more deeply.
+ *
+ * The vmscan logic considers these special priorities:
+ *
+ * prio == DEF_PRIORITY (12): reclaimer starts with that value
+ * prio <= DEF_PRIORITY - 2 : kswapd becomes somewhat overwhelmed
+ * prio == 0                : close to OOM, kernel scans every page in an lru
+ *
+ * Any value in this range is acceptable for this tunable (i.e. from 12 to
+ * 0). Current value for the vmpressure_level_critical_prio is chosen
+ * empirically, but the number, in essence, means that we consider
+ * critical level when scanning depth is ~10% of the lru size (vmscan
+ * scans 'lru_size >> prio' pages, so it is actually 12.5%, or one
+ * eights).
+ */
+static const unsigned int vmpressure_level_critical_prio = ilog2(100 / 10);
+
+enum vmpressure_modes {
+	VMPRESSURE_NO_PASSTHROUGH = 0,
+	VMPRESSURE_HIERARCHY,
+	VMPRESSURE_LOCAL,
+	VMPRESSURE_NUM_MODES,
+};
+
+static const char * const vmpressure_str_levels[] = {
+	[VMPRESSURE_LOW] = "low",
+	[VMPRESSURE_MEDIUM] = "medium",
+	[VMPRESSURE_CRITICAL] = "critical",
+};
+
+static const char * const vmpressure_str_modes[] = {
+	[VMPRESSURE_NO_PASSTHROUGH] = "default",
+	[VMPRESSURE_HIERARCHY] = "hierarchy",
+	[VMPRESSURE_LOCAL] = "local",
+};
+
+struct vmpressure_event {
+	struct eventfd_ctx *efd;
+	enum vmpressure_levels level;
+	enum vmpressure_modes mode;
+	struct list_head node;
+};
+
+static struct vmpressure *work_to_vmpressure(struct work_struct *work)
+{
+	return container_of(work, struct vmpressure, work);
+}
+
+static struct vmpressure *vmpressure_parent(struct vmpressure *vmpr)
+{
+	struct mem_cgroup *memcg = vmpressure_to_memcg(vmpr);
+
+	memcg = parent_mem_cgroup(memcg);
+	if (!memcg)
+		return NULL;
+	return memcg_to_vmpressure(memcg);
+}
+
+static bool vmpressure_event(struct vmpressure *vmpr,
+			     const enum vmpressure_levels level,
+			     bool ancestor, bool signalled)
+{
+	struct vmpressure_event *ev;
+	bool ret = false;
+
+	mutex_lock(&vmpr->events_lock);
+	list_for_each_entry(ev, &vmpr->events, node) {
+		if (ancestor && ev->mode == VMPRESSURE_LOCAL)
+			continue;
+		if (signalled && ev->mode == VMPRESSURE_NO_PASSTHROUGH)
+			continue;
+		if (level < ev->level)
+			continue;
+		eventfd_signal(ev->efd);
+		ret = true;
+	}
+	mutex_unlock(&vmpr->events_lock);
+
+	return ret;
+}
+
+static void vmpressure_work_fn(struct work_struct *work)
+{
+	struct vmpressure *vmpr = work_to_vmpressure(work);
+	unsigned long scanned;
+	unsigned long reclaimed;
+	enum vmpressure_levels level;
+	bool ancestor = false;
+	bool signalled = false;
+
+	spin_lock(&vmpr->sr_lock);
+	/*
+	 * Several contexts might be calling vmpressure(), so it is
+	 * possible that the work was rescheduled again before the old
+	 * work context cleared the counters. In that case we will run
+	 * just after the old work returns, but then scanned might be zero
+	 * here. No need for any locks here since we don't care if
+	 * vmpr->reclaimed is in sync.
+	 */
+	scanned = vmpr->tree_scanned;
+	if (!scanned) {
+		spin_unlock(&vmpr->sr_lock);
+		return;
+	}
+
+	reclaimed = vmpr->tree_reclaimed;
+	vmpr->tree_scanned = 0;
+	vmpr->tree_reclaimed = 0;
+	spin_unlock(&vmpr->sr_lock);
+
+	level = vmpressure_calc_level(scanned, reclaimed);
+
+	do {
+		if (vmpressure_event(vmpr, level, ancestor, signalled))
+			signalled = true;
+		ancestor = true;
+	} while ((vmpr = vmpressure_parent(vmpr)));
+}
+
+/*
+ * Tree-mode accumulator: accumulate per-memcg scanned/reclaimed and
+ * schedule the work that walks the parent chain and signals registered
+ * eventfd listeners once we cross the window threshold.
+ */
+void vmpressure_v1_account_tree(struct vmpressure *vmpr,
+				unsigned long scanned,
+				unsigned long reclaimed)
+{
+	spin_lock(&vmpr->sr_lock);
+	scanned = vmpr->tree_scanned += scanned;
+	vmpr->tree_reclaimed += reclaimed;
+	spin_unlock(&vmpr->sr_lock);
+
+	if (scanned < vmpressure_win)
+		return;
+	schedule_work(&vmpr->work);
+}
+
+void vmpressure_v1_init(struct vmpressure *vmpr)
+{
+	mutex_init(&vmpr->events_lock);
+	INIT_LIST_HEAD(&vmpr->events);
+	INIT_WORK(&vmpr->work, vmpressure_work_fn);
+}
+
+void vmpressure_v1_cleanup(struct vmpressure *vmpr)
+{
+	/*
+	 * Make sure there is no pending work before eventfd infrastructure
+	 * goes away.
+	 */
+	flush_work(&vmpr->work);
+}
+
+/**
+ * vmpressure_prio() - Account memory pressure through reclaimer priority level
+ * @gfp:	reclaimer's gfp mask
+ * @memcg:	cgroup memory controller handle
+ * @prio:	reclaimer's priority
+ *
+ * This function should be called from the reclaim path every time when
+ * the vmscan's reclaiming priority (scanning depth) changes.
+ *
+ * This function does not return any value.
+ */
+void vmpressure_prio(gfp_t gfp, struct mem_cgroup *memcg, int prio)
+{
+	/*
+	 * We only use prio for accounting critical level. For more info
+	 * see comment for vmpressure_level_critical_prio variable above.
+	 */
+	if (prio > vmpressure_level_critical_prio)
+		return;
+
+	/*
+	 * OK, the prio is below the threshold, updating vmpressure
+	 * information before shrinker dives into long shrinking of long
+	 * range vmscan. Passing scanned = vmpressure_win, reclaimed = 0
+	 * to the vmpressure() basically means that we signal 'critical'
+	 * level.
+	 */
+	vmpressure(gfp, 0, memcg, true, vmpressure_win, 0);
+}
+
+#define MAX_VMPRESSURE_ARGS_LEN	(strlen("critical") + strlen("hierarchy") + 2)
+
+/**
+ * vmpressure_register_event() - Bind vmpressure notifications to an eventfd
+ * @memcg:	memcg that is interested in vmpressure notifications
+ * @eventfd:	eventfd context to link notifications with
+ * @args:	event arguments (pressure level threshold, optional mode)
+ *
+ * This function associates eventfd context with the vmpressure
+ * infrastructure, so that the notifications will be delivered to the
+ * @eventfd. The @args parameter is a comma-delimited string that denotes a
+ * pressure level threshold (one of vmpressure_str_levels, i.e. "low", "medium",
+ * or "critical") and an optional mode (one of vmpressure_str_modes, i.e.
+ * "hierarchy" or "local").
+ *
+ * To be used as memcg event method.
+ *
+ * Return: 0 on success, -ENOMEM on memory failure or -EINVAL if @args could
+ * not be parsed.
+ */
+int vmpressure_register_event(struct mem_cgroup *memcg,
+			      struct eventfd_ctx *eventfd, const char *args)
+{
+	struct vmpressure *vmpr = memcg_to_vmpressure(memcg);
+	struct vmpressure_event *ev;
+	enum vmpressure_modes mode = VMPRESSURE_NO_PASSTHROUGH;
+	enum vmpressure_levels level;
+	char *spec, *spec_orig;
+	char *token;
+	int ret = 0;
+
+	spec_orig = spec = kstrndup(args, MAX_VMPRESSURE_ARGS_LEN, GFP_KERNEL);
+	if (!spec)
+		return -ENOMEM;
+
+	/* Find required level */
+	token = strsep(&spec, ",");
+	ret = match_string(vmpressure_str_levels, VMPRESSURE_NUM_LEVELS, token);
+	if (ret < 0)
+		goto out;
+	level = ret;
+
+	/* Find optional mode */
+	token = strsep(&spec, ",");
+	if (token) {
+		ret = match_string(vmpressure_str_modes, VMPRESSURE_NUM_MODES, token);
+		if (ret < 0)
+			goto out;
+		mode = ret;
+	}
+
+	ev = kzalloc_obj(*ev);
+	if (!ev) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	ev->efd = eventfd;
+	ev->level = level;
+	ev->mode = mode;
+
+	mutex_lock(&vmpr->events_lock);
+	list_add(&ev->node, &vmpr->events);
+	mutex_unlock(&vmpr->events_lock);
+	ret = 0;
+out:
+	kfree(spec_orig);
+	return ret;
+}
+
+/**
+ * vmpressure_unregister_event() - Unbind eventfd from vmpressure
+ * @memcg:	memcg handle
+ * @eventfd:	eventfd context that was used to link vmpressure with the @cg
+ *
+ * This function does internal manipulations to detach the @eventfd from
+ * the vmpressure notifications, and then frees internal resources
+ * associated with the @eventfd (but the @eventfd itself is not freed).
+ *
+ * To be used as memcg event method.
+ */
+void vmpressure_unregister_event(struct mem_cgroup *memcg,
+				 struct eventfd_ctx *eventfd)
+{
+	struct vmpressure *vmpr = memcg_to_vmpressure(memcg);
+	struct vmpressure_event *ev;
+
+	mutex_lock(&vmpr->events_lock);
+	list_for_each_entry(ev, &vmpr->events, node) {
+		if (ev->efd != eventfd)
+			continue;
+		list_del(&ev->node);
+		kfree(ev);
+		break;
+	}
+	mutex_unlock(&vmpr->events_lock);
+}
+
 static DEFINE_MUTEX(memcg_max_mutex);
 
 static int mem_cgroup_resize_max(struct mem_cgroup *memcg,
diff --git a/mm/vmpressure.c b/mm/vmpressure.c
index c82cee1ab43b..14470141bbe6 100644
--- a/mm/vmpressure.c
+++ b/mm/vmpressure.c
@@ -7,16 +7,15 @@
  *
  * Based on ideas from Andrew Morton, David Rientjes, KOSAKI Motohiro,
  * Leonid Moiseichuk, Mel Gorman, Minchan Kim and Pekka Enberg.
+ *
+ * Tree-mode (cgroup v1 userspace eventfd) bookkeeping lives in
+ * mm/memcontrol-v1.c; this file holds the shared code and the in-kernel
+ * (tree=false) socket-pressure path that runs on cgroup v2.
  */
 
 #include <linux/cgroup.h>
-#include <linux/fs.h>
 #include <linux/log2.h>
-#include <linux/sched.h>
 #include <linux/mm.h>
-#include <linux/vmstat.h>
-#include <linux/eventfd.h>
-#include <linux/slab.h>
 #include <linux/swap.h>
 #include <linux/printk.h>
 #include <linux/vmpressure.h>
@@ -35,7 +34,7 @@
  * TODO: Make the window size depend on machine size, as we do for vmstat
  * thresholds. Currently we set it to 512 pages (2MB for 4KB pages).
  */
-static const unsigned long vmpressure_win = SWAP_CLUSTER_MAX * 16;
+const unsigned long vmpressure_win = SWAP_CLUSTER_MAX * 16;
 
 /*
  * These thresholds are used when we account memory pressure through
@@ -46,68 +45,6 @@ static const unsigned long vmpressure_win = SWAP_CLUSTER_MAX * 16;
 static const unsigned int vmpressure_level_med = 60;
 static const unsigned int vmpressure_level_critical = 95;
 
-/*
- * When there are too little pages left to scan, vmpressure() may miss the
- * critical pressure as number of pages will be less than "window size".
- * However, in that case the vmscan priority will raise fast as the
- * reclaimer will try to scan LRUs more deeply.
- *
- * The vmscan logic considers these special priorities:
- *
- * prio == DEF_PRIORITY (12): reclaimer starts with that value
- * prio <= DEF_PRIORITY - 2 : kswapd becomes somewhat overwhelmed
- * prio == 0                : close to OOM, kernel scans every page in an lru
- *
- * Any value in this range is acceptable for this tunable (i.e. from 12 to
- * 0). Current value for the vmpressure_level_critical_prio is chosen
- * empirically, but the number, in essence, means that we consider
- * critical level when scanning depth is ~10% of the lru size (vmscan
- * scans 'lru_size >> prio' pages, so it is actually 12.5%, or one
- * eights).
- */
-static const unsigned int vmpressure_level_critical_prio = ilog2(100 / 10);
-
-static struct vmpressure *work_to_vmpressure(struct work_struct *work)
-{
-	return container_of(work, struct vmpressure, work);
-}
-
-static struct vmpressure *vmpressure_parent(struct vmpressure *vmpr)
-{
-	struct mem_cgroup *memcg = vmpressure_to_memcg(vmpr);
-
-	memcg = parent_mem_cgroup(memcg);
-	if (!memcg)
-		return NULL;
-	return memcg_to_vmpressure(memcg);
-}
-
-enum vmpressure_levels {
-	VMPRESSURE_LOW = 0,
-	VMPRESSURE_MEDIUM,
-	VMPRESSURE_CRITICAL,
-	VMPRESSURE_NUM_LEVELS,
-};
-
-enum vmpressure_modes {
-	VMPRESSURE_NO_PASSTHROUGH = 0,
-	VMPRESSURE_HIERARCHY,
-	VMPRESSURE_LOCAL,
-	VMPRESSURE_NUM_MODES,
-};
-
-static const char * const vmpressure_str_levels[] = {
-	[VMPRESSURE_LOW] = "low",
-	[VMPRESSURE_MEDIUM] = "medium",
-	[VMPRESSURE_CRITICAL] = "critical",
-};
-
-static const char * const vmpressure_str_modes[] = {
-	[VMPRESSURE_NO_PASSTHROUGH] = "default",
-	[VMPRESSURE_HIERARCHY] = "hierarchy",
-	[VMPRESSURE_LOCAL] = "local",
-};
-
 static enum vmpressure_levels vmpressure_level(unsigned long pressure)
 {
 	if (pressure >= vmpressure_level_critical)
@@ -117,8 +54,8 @@ static enum vmpressure_levels vmpressure_level(unsigned long pressure)
 	return VMPRESSURE_LOW;
 }
 
-static enum vmpressure_levels vmpressure_calc_level(unsigned long scanned,
-						    unsigned long reclaimed)
+enum vmpressure_levels vmpressure_calc_level(unsigned long scanned,
+					     unsigned long reclaimed)
 {
 	unsigned long scale = scanned + reclaimed;
 	unsigned long pressure = 0;
@@ -147,74 +84,6 @@ static enum vmpressure_levels vmpressure_calc_level(unsigned long scanned,
 	return vmpressure_level(pressure);
 }
 
-struct vmpressure_event {
-	struct eventfd_ctx *efd;
-	enum vmpressure_levels level;
-	enum vmpressure_modes mode;
-	struct list_head node;
-};
-
-static bool vmpressure_event(struct vmpressure *vmpr,
-			     const enum vmpressure_levels level,
-			     bool ancestor, bool signalled)
-{
-	struct vmpressure_event *ev;
-	bool ret = false;
-
-	mutex_lock(&vmpr->events_lock);
-	list_for_each_entry(ev, &vmpr->events, node) {
-		if (ancestor && ev->mode == VMPRESSURE_LOCAL)
-			continue;
-		if (signalled && ev->mode == VMPRESSURE_NO_PASSTHROUGH)
-			continue;
-		if (level < ev->level)
-			continue;
-		eventfd_signal(ev->efd);
-		ret = true;
-	}
-	mutex_unlock(&vmpr->events_lock);
-
-	return ret;
-}
-
-static void vmpressure_work_fn(struct work_struct *work)
-{
-	struct vmpressure *vmpr = work_to_vmpressure(work);
-	unsigned long scanned;
-	unsigned long reclaimed;
-	enum vmpressure_levels level;
-	bool ancestor = false;
-	bool signalled = false;
-
-	spin_lock(&vmpr->sr_lock);
-	/*
-	 * Several contexts might be calling vmpressure(), so it is
-	 * possible that the work was rescheduled again before the old
-	 * work context cleared the counters. In that case we will run
-	 * just after the old work returns, but then scanned might be zero
-	 * here. No need for any locks here since we don't care if
-	 * vmpr->reclaimed is in sync.
-	 */
-	scanned = vmpr->tree_scanned;
-	if (!scanned) {
-		spin_unlock(&vmpr->sr_lock);
-		return;
-	}
-
-	reclaimed = vmpr->tree_reclaimed;
-	vmpr->tree_scanned = 0;
-	vmpr->tree_reclaimed = 0;
-	spin_unlock(&vmpr->sr_lock);
-
-	level = vmpressure_calc_level(scanned, reclaimed);
-
-	do {
-		if (vmpressure_event(vmpr, level, ancestor, signalled))
-			signalled = true;
-		ancestor = true;
-	} while ((vmpr = vmpressure_parent(vmpr)));
-}
-
 /**
  * vmpressure() - Account memory pressure through scanned/reclaimed ratio
  * @gfp:	reclaimer's gfp mask
@@ -283,14 +152,7 @@ void vmpressure(gfp_t gfp, int order, struct mem_cgroup *memcg, bool tree,
 		return;
 
 	if (tree) {
-		spin_lock(&vmpr->sr_lock);
-		scanned = vmpr->tree_scanned += scanned;
-		vmpr->tree_reclaimed += reclaimed;
-		spin_unlock(&vmpr->sr_lock);
-
-		if (scanned < vmpressure_win)
-			return;
-		schedule_work(&vmpr->work);
+		vmpressure_v1_account_tree(vmpr, scanned, reclaimed);
 	} else {
 		enum vmpressure_levels level;
 
@@ -332,134 +194,6 @@ void vmpressure(gfp_t gfp, int order, struct mem_cgroup *memcg, bool tree,
 	}
 }
 
-/**
- * vmpressure_prio() - Account memory pressure through reclaimer priority level
- * @gfp:	reclaimer's gfp mask
- * @memcg:	cgroup memory controller handle
- * @prio:	reclaimer's priority
- *
- * This function should be called from the reclaim path every time when
- * the vmscan's reclaiming priority (scanning depth) changes.
- *
- * This function does not return any value.
- */
-void vmpressure_prio(gfp_t gfp, struct mem_cgroup *memcg, int prio)
-{
-	/*
-	 * We only use prio for accounting critical level. For more info
-	 * see comment for vmpressure_level_critical_prio variable above.
-	 */
-	if (prio > vmpressure_level_critical_prio)
-		return;
-
-	/*
-	 * OK, the prio is below the threshold, updating vmpressure
-	 * information before shrinker dives into long shrinking of long
-	 * range vmscan. Passing scanned = vmpressure_win, reclaimed = 0
-	 * to the vmpressure() basically means that we signal 'critical'
-	 * level.
-	 */
-	vmpressure(gfp, 0, memcg, true, vmpressure_win, 0);
-}
-
-#define MAX_VMPRESSURE_ARGS_LEN	(strlen("critical") + strlen("hierarchy") + 2)
-
-/**
- * vmpressure_register_event() - Bind vmpressure notifications to an eventfd
- * @memcg:	memcg that is interested in vmpressure notifications
- * @eventfd:	eventfd context to link notifications with
- * @args:	event arguments (pressure level threshold, optional mode)
- *
- * This function associates eventfd context with the vmpressure
- * infrastructure, so that the notifications will be delivered to the
- * @eventfd. The @args parameter is a comma-delimited string that denotes a
- * pressure level threshold (one of vmpressure_str_levels, i.e. "low", "medium",
- * or "critical") and an optional mode (one of vmpressure_str_modes, i.e.
- * "hierarchy" or "local").
- *
- * To be used as memcg event method.
- *
- * Return: 0 on success, -ENOMEM on memory failure or -EINVAL if @args could
- * not be parsed.
- */
-int vmpressure_register_event(struct mem_cgroup *memcg,
-			      struct eventfd_ctx *eventfd, const char *args)
-{
-	struct vmpressure *vmpr = memcg_to_vmpressure(memcg);
-	struct vmpressure_event *ev;
-	enum vmpressure_modes mode = VMPRESSURE_NO_PASSTHROUGH;
-	enum vmpressure_levels level;
-	char *spec, *spec_orig;
-	char *token;
-	int ret = 0;
-
-	spec_orig = spec = kstrndup(args, MAX_VMPRESSURE_ARGS_LEN, GFP_KERNEL);
-	if (!spec)
-		return -ENOMEM;
-
-	/* Find required level */
-	token = strsep(&spec, ",");
-	ret = match_string(vmpressure_str_levels, VMPRESSURE_NUM_LEVELS, token);
-	if (ret < 0)
-		goto out;
-	level = ret;
-
-	/* Find optional mode */
-	token = strsep(&spec, ",");
-	if (token) {
-		ret = match_string(vmpressure_str_modes, VMPRESSURE_NUM_MODES, token);
-		if (ret < 0)
-			goto out;
-		mode = ret;
-	}
-
-	ev = kzalloc_obj(*ev);
-	if (!ev) {
-		ret = -ENOMEM;
-		goto out;
-	}
-
-	ev->efd = eventfd;
-	ev->level = level;
-	ev->mode = mode;
-
-	mutex_lock(&vmpr->events_lock);
-	list_add(&ev->node, &vmpr->events);
-	mutex_unlock(&vmpr->events_lock);
-	ret = 0;
-out:
-	kfree(spec_orig);
-	return ret;
-}
-
-/**
- * vmpressure_unregister_event() - Unbind eventfd from vmpressure
- * @memcg:	memcg handle
- * @eventfd:	eventfd context that was used to link vmpressure with the @cg
- *
- * This function does internal manipulations to detach the @eventfd from
- * the vmpressure notifications, and then frees internal resources
- * associated with the @eventfd (but the @eventfd itself is not freed).
- *
- * To be used as memcg event method.
- */
-void vmpressure_unregister_event(struct mem_cgroup *memcg,
-				 struct eventfd_ctx *eventfd)
-{
-	struct vmpressure *vmpr = memcg_to_vmpressure(memcg);
-	struct vmpressure_event *ev;
-
-	mutex_lock(&vmpr->events_lock);
-	list_for_each_entry(ev, &vmpr->events, node) {
-		if (ev->efd != eventfd)
-			continue;
-		list_del(&ev->node);
-		kfree(ev);
-		break;
-	}
-	mutex_unlock(&vmpr->events_lock);
-}
-
 /**
  * vmpressure_init() - Initialize vmpressure control structure
  * @vmpr:	Structure to be initialized
@@ -470,9 +204,7 @@ void vmpressure_unregister_event(struct mem_cgroup *memcg,
 void vmpressure_init(struct vmpressure *vmpr)
 {
 	spin_lock_init(&vmpr->sr_lock);
-	mutex_init(&vmpr->events_lock);
-	INIT_LIST_HEAD(&vmpr->events);
-	INIT_WORK(&vmpr->work, vmpressure_work_fn);
+	vmpressure_v1_init(vmpr);
 }
 
 /**
@@ -484,9 +216,5 @@ void vmpressure_init(struct vmpressure *vmpr)
  */
 void vmpressure_cleanup(struct vmpressure *vmpr)
 {
-	/*
-	 * Make sure there is no pending work before eventfd infrastructure
-	 * goes away.
-	 */
-	flush_work(&vmpr->work);
+	vmpressure_v1_cleanup(vmpr);
 }
-- 
2.53.0-Meta



^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH v3 2/2] mm/vmpressure: move v1 userspace eventfd code into memcontrol-v1.c
  2026-06-30 11:23 ` [PATCH v3 2/2] mm/vmpressure: move v1 userspace eventfd code into memcontrol-v1.c Usama Arif
@ 2026-06-30 12:32   ` Usama Arif
  2026-06-30 14:21   ` Shakeel Butt
  1 sibling, 0 replies; 7+ messages in thread
From: Usama Arif @ 2026-06-30 12:32 UTC (permalink / raw)
  To: Usama Arif
  Cc: Andrew Morton, david, linux-mm, hannes, tj, mkoutny, shakeel.butt,
	roman.gushchin, liam, linux-kernel, ljs, mhocko, rppt, surenb,
	vbabka, kernel-team

On Tue, 30 Jun 2026 04:23:33 -0700 Usama Arif <usama.arif@linux.dev> wrote:

> Clean up mm/vmpressure.c by separating the cgroup v1 userspace eventfd
> interface from the shared and v2 in-kernel code.
> 
> Currently, almost half of mm/vmpressure.c exists to serve tree=true:
> struct vmpressure_event, the events list and its mutex, the work_struct
> and vmpressure_work_fn that drains tree_scanned/tree_reclaimed, the
> parent walk, vmpressure_event(), vmpressure_register_event(),
> vmpressure_unregister_event(), and vmpressure_prio() (which always
> calls vmpressure() with tree=true).
> 
> Move it all into mm/memcontrol-v1.c (built only when CONFIG_MEMCG_V1=y)
> as a single contiguous block, following the per-component layout already
> used by that file. Keeping the v1 vmpressure code with the rest of the
> deprecated cgroup v1 memory controller makes the full footprint of the
> CONFIG_MEMCG_V1 option easy to see in one place, which matters more
> than component-level file separation for code that has no active
> development.
> 
> vmpressure.c keeps the shared bits (constants, vmpressure_calc_level,
> the runtime hierarchy check, the tree=false body, init/cleanup
> plumbing) and calls into three small v1 hooks for the tree=true
> accumulator and the v1 portions of init/cleanup. The hooks have
> static-inline no-op stubs in include/linux/vmpressure.h for the
> !MEMCG_V1 case, so callers don't need ifdefs. vmpressure_prio() gets
> the same treatment, which means vmscan.c's call site disappears at
> compile time on v2-only kernels.
> 
> The only #ifdef CONFIG_MEMCG_V1 in source remains around the v1-only
> fields inside struct vmpressure itself.
> 
> Memory savings on CONFIG_MEMCG_V1=n (measured with pahole):
> 
>   struct vmpressure :  112B ->   24B
>   struct mem_cgroup : 1664B -> 1536B
> 
> This split is the first step toward eventually making vmpressure
> CONFIG_MEMCG_V1 only. The v2 in-kernel socket pressure path
> (tree=false) cannot be removed today immediately: PSI is not an
> exact replacement for vmpressure, and switching networking socket-buffer
> back-off to PSI may regress networking performance or increase memory
> pressure in workloads that today rely on vmpressure's hysteresis. The
> medium-term plan is to introduce a PSI-based socket-pressure path, keep
> vmpressure available for v2 behind a defconfig as an opt-out for several
> releases, and only then drop the tree=false path entirely, at which point
> everything that remains of the vmpressure block in mm/memcontrol-v1.c is
> the whole subsystem.
> 
> Signed-off-by: Usama Arif <usama.arif@linux.dev>

Shakeel had acked the previous version, but I forgot to carry it over,
sorry about that!

> ---
>  include/linux/vmpressure.h |  46 +++++-
>  mm/memcontrol-v1.c         | 292 +++++++++++++++++++++++++++++++++++++
>  mm/vmpressure.c            | 292 ++-----------------------------------
>  3 files changed, 343 insertions(+), 287 deletions(-)
> 
> diff --git a/include/linux/vmpressure.h b/include/linux/vmpressure.h
> index faecd5522401..b4d13457bc2a 100644
> --- a/include/linux/vmpressure.h
> +++ b/include/linux/vmpressure.h
> @@ -13,18 +13,31 @@
>  struct vmpressure {
>  	unsigned long scanned;
>  	unsigned long reclaimed;
> +	/* The lock is used to keep the scanned/reclaimed in sync. */
> +	spinlock_t sr_lock;
>  
> +#ifdef CONFIG_MEMCG_V1
> +	/*
> +	 * tree=true accumulators feed the v1 userspace eventfd interface
> +	 * (memory.pressure_level). Drained by @work. v2 has no equivalent
> +	 * interface, so this state is omitted on CONFIG_MEMCG_V1=n builds.
> +	 */
>  	unsigned long tree_scanned;
>  	unsigned long tree_reclaimed;
> -	/* The lock is used to keep the scanned/reclaimed above in sync. */
> -	spinlock_t sr_lock;
> -
>  	/* The list of vmpressure_event structs. */
>  	struct list_head events;
>  	/* Have to grab the lock on events traversal or modifications. */
>  	struct mutex events_lock;
>  
>  	struct work_struct work;
> +#endif
> +};
> +
> +enum vmpressure_levels {
> +	VMPRESSURE_LOW = 0,
> +	VMPRESSURE_MEDIUM,
> +	VMPRESSURE_CRITICAL,
> +	VMPRESSURE_NUM_LEVELS,
>  };
>  
>  struct mem_cgroup;
> @@ -32,18 +45,41 @@ struct mem_cgroup;
>  #ifdef CONFIG_MEMCG
>  void vmpressure(gfp_t gfp, int order, struct mem_cgroup *memcg, bool tree,
>  		unsigned long scanned, unsigned long reclaimed);
> -extern void vmpressure_prio(gfp_t gfp, struct mem_cgroup *memcg, int prio);
> -
>  extern void vmpressure_init(struct vmpressure *vmpr);
>  extern void vmpressure_cleanup(struct vmpressure *vmpr);
>  extern struct vmpressure *memcg_to_vmpressure(struct mem_cgroup *memcg);
>  extern struct mem_cgroup *vmpressure_to_memcg(struct vmpressure *vmpr);
> +
> +/* Shared with the v1 vmpressure block in mm/memcontrol-v1.c. */
> +extern const unsigned long vmpressure_win;
> +extern enum vmpressure_levels vmpressure_calc_level(unsigned long scanned,
> +						    unsigned long reclaimed);
> +
> +#ifdef CONFIG_MEMCG_V1
> +extern void vmpressure_prio(gfp_t gfp, struct mem_cgroup *memcg, int prio);
>  extern int vmpressure_register_event(struct mem_cgroup *memcg,
>  				     struct eventfd_ctx *eventfd,
>  				     const char *args);
>  extern void vmpressure_unregister_event(struct mem_cgroup *memcg,
>  					struct eventfd_ctx *eventfd);
> +
> +/* v1 hooks called from mm/vmpressure.c; no-ops below when !MEMCG_V1. */
> +extern void vmpressure_v1_init(struct vmpressure *vmpr);
> +extern void vmpressure_v1_cleanup(struct vmpressure *vmpr);
> +extern void vmpressure_v1_account_tree(struct vmpressure *vmpr,
> +				       unsigned long scanned,
> +				       unsigned long reclaimed);
>  #else
> +static inline void vmpressure_prio(gfp_t gfp, struct mem_cgroup *memcg,
> +				   int prio) {}
> +static inline void vmpressure_v1_init(struct vmpressure *vmpr) {}
> +static inline void vmpressure_v1_cleanup(struct vmpressure *vmpr) {}
> +static inline void vmpressure_v1_account_tree(struct vmpressure *vmpr,
> +					      unsigned long scanned,
> +					      unsigned long reclaimed) {}
> +#endif /* CONFIG_MEMCG_V1 */
> +
> +#else /* !CONFIG_MEMCG */
>  static inline void vmpressure(gfp_t gfp, int order, struct mem_cgroup *memcg,
>  			      bool tree, unsigned long scanned,
>  			      unsigned long reclaimed) {}
> diff --git a/mm/memcontrol-v1.c b/mm/memcontrol-v1.c
> index 765069211567..135622b6172b 100644
> --- a/mm/memcontrol-v1.c
> +++ b/mm/memcontrol-v1.c
> @@ -6,6 +6,7 @@
>  #include <linux/pagewalk.h>
>  #include <linux/backing-dev.h>
>  #include <linux/eventfd.h>
> +#include <linux/log2.h>
>  #include <linux/poll.h>
>  #include <linux/sort.h>
>  #include <linux/file.h>
> @@ -1476,6 +1477,297 @@ void memcg1_oom_finish(struct mem_cgroup *memcg, bool locked)
>  		mem_cgroup_oom_unlock(memcg);
>  }
>  
> +/*
> + * cgroup v1 userspace vmpressure interface (memory.pressure_level /
> + * cgroup.event_control). Kept here so v2-only kernels (CONFIG_MEMCG_V1=n)
> + * drop the whole eventfd accumulator, its work item, and the per-memcg
> + * state it requires.
> + *
> + * When there are too little pages left to scan, vmpressure() may miss the
> + * critical pressure as number of pages will be less than "window size".
> + * However, in that case the vmscan priority will raise fast as the
> + * reclaimer will try to scan LRUs more deeply.
> + *
> + * The vmscan logic considers these special priorities:
> + *
> + * prio == DEF_PRIORITY (12): reclaimer starts with that value
> + * prio <= DEF_PRIORITY - 2 : kswapd becomes somewhat overwhelmed
> + * prio == 0                : close to OOM, kernel scans every page in an lru
> + *
> + * Any value in this range is acceptable for this tunable (i.e. from 12 to
> + * 0). Current value for the vmpressure_level_critical_prio is chosen
> + * empirically, but the number, in essence, means that we consider
> + * critical level when scanning depth is ~10% of the lru size (vmscan
> + * scans 'lru_size >> prio' pages, so it is actually 12.5%, or one
> + * eights).
> + */
> +static const unsigned int vmpressure_level_critical_prio = ilog2(100 / 10);
> +
> +enum vmpressure_modes {
> +	VMPRESSURE_NO_PASSTHROUGH = 0,
> +	VMPRESSURE_HIERARCHY,
> +	VMPRESSURE_LOCAL,
> +	VMPRESSURE_NUM_MODES,
> +};
> +
> +static const char * const vmpressure_str_levels[] = {
> +	[VMPRESSURE_LOW] = "low",
> +	[VMPRESSURE_MEDIUM] = "medium",
> +	[VMPRESSURE_CRITICAL] = "critical",
> +};
> +
> +static const char * const vmpressure_str_modes[] = {
> +	[VMPRESSURE_NO_PASSTHROUGH] = "default",
> +	[VMPRESSURE_HIERARCHY] = "hierarchy",
> +	[VMPRESSURE_LOCAL] = "local",
> +};
> +
> +struct vmpressure_event {
> +	struct eventfd_ctx *efd;
> +	enum vmpressure_levels level;
> +	enum vmpressure_modes mode;
> +	struct list_head node;
> +};
> +
> +static struct vmpressure *work_to_vmpressure(struct work_struct *work)
> +{
> +	return container_of(work, struct vmpressure, work);
> +}
> +
> +static struct vmpressure *vmpressure_parent(struct vmpressure *vmpr)
> +{
> +	struct mem_cgroup *memcg = vmpressure_to_memcg(vmpr);
> +
> +	memcg = parent_mem_cgroup(memcg);
> +	if (!memcg)
> +		return NULL;
> +	return memcg_to_vmpressure(memcg);
> +}
> +
> +static bool vmpressure_event(struct vmpressure *vmpr,
> +			     const enum vmpressure_levels level,
> +			     bool ancestor, bool signalled)
> +{
> +	struct vmpressure_event *ev;
> +	bool ret = false;
> +
> +	mutex_lock(&vmpr->events_lock);
> +	list_for_each_entry(ev, &vmpr->events, node) {
> +		if (ancestor && ev->mode == VMPRESSURE_LOCAL)
> +			continue;
> +		if (signalled && ev->mode == VMPRESSURE_NO_PASSTHROUGH)
> +			continue;
> +		if (level < ev->level)
> +			continue;
> +		eventfd_signal(ev->efd);
> +		ret = true;
> +	}
> +	mutex_unlock(&vmpr->events_lock);
> +
> +	return ret;
> +}
> +
> +static void vmpressure_work_fn(struct work_struct *work)
> +{
> +	struct vmpressure *vmpr = work_to_vmpressure(work);
> +	unsigned long scanned;
> +	unsigned long reclaimed;
> +	enum vmpressure_levels level;
> +	bool ancestor = false;
> +	bool signalled = false;
> +
> +	spin_lock(&vmpr->sr_lock);
> +	/*
> +	 * Several contexts might be calling vmpressure(), so it is
> +	 * possible that the work was rescheduled again before the old
> +	 * work context cleared the counters. In that case we will run
> +	 * just after the old work returns, but then scanned might be zero
> +	 * here. No need for any locks here since we don't care if
> +	 * vmpr->reclaimed is in sync.
> +	 */
> +	scanned = vmpr->tree_scanned;
> +	if (!scanned) {
> +		spin_unlock(&vmpr->sr_lock);
> +		return;
> +	}
> +
> +	reclaimed = vmpr->tree_reclaimed;
> +	vmpr->tree_scanned = 0;
> +	vmpr->tree_reclaimed = 0;
> +	spin_unlock(&vmpr->sr_lock);
> +
> +	level = vmpressure_calc_level(scanned, reclaimed);
> +
> +	do {
> +		if (vmpressure_event(vmpr, level, ancestor, signalled))
> +			signalled = true;
> +		ancestor = true;
> +	} while ((vmpr = vmpressure_parent(vmpr)));
> +}
> +
> +/*
> + * Tree-mode accumulator: accumulate per-memcg scanned/reclaimed and
> + * schedule the work that walks the parent chain and signals registered
> + * eventfd listeners once we cross the window threshold.
> + */
> +void vmpressure_v1_account_tree(struct vmpressure *vmpr,
> +				unsigned long scanned,
> +				unsigned long reclaimed)
> +{
> +	spin_lock(&vmpr->sr_lock);
> +	scanned = vmpr->tree_scanned += scanned;
> +	vmpr->tree_reclaimed += reclaimed;
> +	spin_unlock(&vmpr->sr_lock);
> +
> +	if (scanned < vmpressure_win)
> +		return;
> +	schedule_work(&vmpr->work);
> +}
> +
> +void vmpressure_v1_init(struct vmpressure *vmpr)
> +{
> +	mutex_init(&vmpr->events_lock);
> +	INIT_LIST_HEAD(&vmpr->events);
> +	INIT_WORK(&vmpr->work, vmpressure_work_fn);
> +}
> +
> +void vmpressure_v1_cleanup(struct vmpressure *vmpr)
> +{
> +	/*
> +	 * Make sure there is no pending work before eventfd infrastructure
> +	 * goes away.
> +	 */
> +	flush_work(&vmpr->work);
> +}
> +
> +/**
> + * vmpressure_prio() - Account memory pressure through reclaimer priority level
> + * @gfp:	reclaimer's gfp mask
> + * @memcg:	cgroup memory controller handle
> + * @prio:	reclaimer's priority
> + *
> + * This function should be called from the reclaim path every time when
> + * the vmscan's reclaiming priority (scanning depth) changes.
> + *
> + * This function does not return any value.
> + */
> +void vmpressure_prio(gfp_t gfp, struct mem_cgroup *memcg, int prio)
> +{
> +	/*
> +	 * We only use prio for accounting critical level. For more info
> +	 * see comment for vmpressure_level_critical_prio variable above.
> +	 */
> +	if (prio > vmpressure_level_critical_prio)
> +		return;
> +
> +	/*
> +	 * OK, the prio is below the threshold, updating vmpressure
> +	 * information before shrinker dives into long shrinking of long
> +	 * range vmscan. Passing scanned = vmpressure_win, reclaimed = 0
> +	 * to the vmpressure() basically means that we signal 'critical'
> +	 * level.
> +	 */
> +	vmpressure(gfp, 0, memcg, true, vmpressure_win, 0);
> +}
> +
> +#define MAX_VMPRESSURE_ARGS_LEN	(strlen("critical") + strlen("hierarchy") + 2)
> +
> +/**
> + * vmpressure_register_event() - Bind vmpressure notifications to an eventfd
> + * @memcg:	memcg that is interested in vmpressure notifications
> + * @eventfd:	eventfd context to link notifications with
> + * @args:	event arguments (pressure level threshold, optional mode)
> + *
> + * This function associates eventfd context with the vmpressure
> + * infrastructure, so that the notifications will be delivered to the
> + * @eventfd. The @args parameter is a comma-delimited string that denotes a
> + * pressure level threshold (one of vmpressure_str_levels, i.e. "low", "medium",
> + * or "critical") and an optional mode (one of vmpressure_str_modes, i.e.
> + * "hierarchy" or "local").
> + *
> + * To be used as memcg event method.
> + *
> + * Return: 0 on success, -ENOMEM on memory failure or -EINVAL if @args could
> + * not be parsed.
> + */
> +int vmpressure_register_event(struct mem_cgroup *memcg,
> +			      struct eventfd_ctx *eventfd, const char *args)
> +{
> +	struct vmpressure *vmpr = memcg_to_vmpressure(memcg);
> +	struct vmpressure_event *ev;
> +	enum vmpressure_modes mode = VMPRESSURE_NO_PASSTHROUGH;
> +	enum vmpressure_levels level;
> +	char *spec, *spec_orig;
> +	char *token;
> +	int ret = 0;
> +
> +	spec_orig = spec = kstrndup(args, MAX_VMPRESSURE_ARGS_LEN, GFP_KERNEL);
> +	if (!spec)
> +		return -ENOMEM;
> +
> +	/* Find required level */
> +	token = strsep(&spec, ",");
> +	ret = match_string(vmpressure_str_levels, VMPRESSURE_NUM_LEVELS, token);
> +	if (ret < 0)
> +		goto out;
> +	level = ret;
> +
> +	/* Find optional mode */
> +	token = strsep(&spec, ",");
> +	if (token) {
> +		ret = match_string(vmpressure_str_modes, VMPRESSURE_NUM_MODES, token);
> +		if (ret < 0)
> +			goto out;
> +		mode = ret;
> +	}
> +
> +	ev = kzalloc_obj(*ev);
> +	if (!ev) {
> +		ret = -ENOMEM;
> +		goto out;
> +	}
> +
> +	ev->efd = eventfd;
> +	ev->level = level;
> +	ev->mode = mode;
> +
> +	mutex_lock(&vmpr->events_lock);
> +	list_add(&ev->node, &vmpr->events);
> +	mutex_unlock(&vmpr->events_lock);
> +	ret = 0;
> +out:
> +	kfree(spec_orig);
> +	return ret;
> +}
> +
> +/**
> + * vmpressure_unregister_event() - Unbind eventfd from vmpressure
> + * @memcg:	memcg handle
> + * @eventfd:	eventfd context that was used to link vmpressure with the @cg
> + *
> + * This function does internal manipulations to detach the @eventfd from
> + * the vmpressure notifications, and then frees internal resources
> + * associated with the @eventfd (but the @eventfd itself is not freed).
> + *
> + * To be used as memcg event method.
> + */
> +void vmpressure_unregister_event(struct mem_cgroup *memcg,
> +				 struct eventfd_ctx *eventfd)
> +{
> +	struct vmpressure *vmpr = memcg_to_vmpressure(memcg);
> +	struct vmpressure_event *ev;
> +
> +	mutex_lock(&vmpr->events_lock);
> +	list_for_each_entry(ev, &vmpr->events, node) {
> +		if (ev->efd != eventfd)
> +			continue;
> +		list_del(&ev->node);
> +		kfree(ev);
> +		break;
> +	}
> +	mutex_unlock(&vmpr->events_lock);
> +}
> +
>  static DEFINE_MUTEX(memcg_max_mutex);
>  
>  static int mem_cgroup_resize_max(struct mem_cgroup *memcg,
> diff --git a/mm/vmpressure.c b/mm/vmpressure.c
> index c82cee1ab43b..14470141bbe6 100644
> --- a/mm/vmpressure.c
> +++ b/mm/vmpressure.c
> @@ -7,16 +7,15 @@
>   *
>   * Based on ideas from Andrew Morton, David Rientjes, KOSAKI Motohiro,
>   * Leonid Moiseichuk, Mel Gorman, Minchan Kim and Pekka Enberg.
> + *
> + * Tree-mode (cgroup v1 userspace eventfd) bookkeeping lives in
> + * mm/memcontrol-v1.c; this file holds the shared code and the in-kernel
> + * (tree=false) socket-pressure path that runs on cgroup v2.
>   */
>  
>  #include <linux/cgroup.h>
> -#include <linux/fs.h>
>  #include <linux/log2.h>
> -#include <linux/sched.h>
>  #include <linux/mm.h>
> -#include <linux/vmstat.h>
> -#include <linux/eventfd.h>
> -#include <linux/slab.h>
>  #include <linux/swap.h>
>  #include <linux/printk.h>
>  #include <linux/vmpressure.h>
> @@ -35,7 +34,7 @@
>   * TODO: Make the window size depend on machine size, as we do for vmstat
>   * thresholds. Currently we set it to 512 pages (2MB for 4KB pages).
>   */
> -static const unsigned long vmpressure_win = SWAP_CLUSTER_MAX * 16;
> +const unsigned long vmpressure_win = SWAP_CLUSTER_MAX * 16;
>  
>  /*
>   * These thresholds are used when we account memory pressure through
> @@ -46,68 +45,6 @@ static const unsigned long vmpressure_win = SWAP_CLUSTER_MAX * 16;
>  static const unsigned int vmpressure_level_med = 60;
>  static const unsigned int vmpressure_level_critical = 95;
>  
> -/*
> - * When there are too little pages left to scan, vmpressure() may miss the
> - * critical pressure as number of pages will be less than "window size".
> - * However, in that case the vmscan priority will raise fast as the
> - * reclaimer will try to scan LRUs more deeply.
> - *
> - * The vmscan logic considers these special priorities:
> - *
> - * prio == DEF_PRIORITY (12): reclaimer starts with that value
> - * prio <= DEF_PRIORITY - 2 : kswapd becomes somewhat overwhelmed
> - * prio == 0                : close to OOM, kernel scans every page in an lru
> - *
> - * Any value in this range is acceptable for this tunable (i.e. from 12 to
> - * 0). Current value for the vmpressure_level_critical_prio is chosen
> - * empirically, but the number, in essence, means that we consider
> - * critical level when scanning depth is ~10% of the lru size (vmscan
> - * scans 'lru_size >> prio' pages, so it is actually 12.5%, or one
> - * eights).
> - */
> -static const unsigned int vmpressure_level_critical_prio = ilog2(100 / 10);
> -
> -static struct vmpressure *work_to_vmpressure(struct work_struct *work)
> -{
> -	return container_of(work, struct vmpressure, work);
> -}
> -
> -static struct vmpressure *vmpressure_parent(struct vmpressure *vmpr)
> -{
> -	struct mem_cgroup *memcg = vmpressure_to_memcg(vmpr);
> -
> -	memcg = parent_mem_cgroup(memcg);
> -	if (!memcg)
> -		return NULL;
> -	return memcg_to_vmpressure(memcg);
> -}
> -
> -enum vmpressure_levels {
> -	VMPRESSURE_LOW = 0,
> -	VMPRESSURE_MEDIUM,
> -	VMPRESSURE_CRITICAL,
> -	VMPRESSURE_NUM_LEVELS,
> -};
> -
> -enum vmpressure_modes {
> -	VMPRESSURE_NO_PASSTHROUGH = 0,
> -	VMPRESSURE_HIERARCHY,
> -	VMPRESSURE_LOCAL,
> -	VMPRESSURE_NUM_MODES,
> -};
> -
> -static const char * const vmpressure_str_levels[] = {
> -	[VMPRESSURE_LOW] = "low",
> -	[VMPRESSURE_MEDIUM] = "medium",
> -	[VMPRESSURE_CRITICAL] = "critical",
> -};
> -
> -static const char * const vmpressure_str_modes[] = {
> -	[VMPRESSURE_NO_PASSTHROUGH] = "default",
> -	[VMPRESSURE_HIERARCHY] = "hierarchy",
> -	[VMPRESSURE_LOCAL] = "local",
> -};
> -
>  static enum vmpressure_levels vmpressure_level(unsigned long pressure)
>  {
>  	if (pressure >= vmpressure_level_critical)
> @@ -117,8 +54,8 @@ static enum vmpressure_levels vmpressure_level(unsigned long pressure)
>  	return VMPRESSURE_LOW;
>  }
>  
> -static enum vmpressure_levels vmpressure_calc_level(unsigned long scanned,
> -						    unsigned long reclaimed)
> +enum vmpressure_levels vmpressure_calc_level(unsigned long scanned,
> +					     unsigned long reclaimed)
>  {
>  	unsigned long scale = scanned + reclaimed;
>  	unsigned long pressure = 0;
> @@ -147,74 +84,6 @@ static enum vmpressure_levels vmpressure_calc_level(unsigned long scanned,
>  	return vmpressure_level(pressure);
>  }
>  
> -struct vmpressure_event {
> -	struct eventfd_ctx *efd;
> -	enum vmpressure_levels level;
> -	enum vmpressure_modes mode;
> -	struct list_head node;
> -};
> -
> -static bool vmpressure_event(struct vmpressure *vmpr,
> -			     const enum vmpressure_levels level,
> -			     bool ancestor, bool signalled)
> -{
> -	struct vmpressure_event *ev;
> -	bool ret = false;
> -
> -	mutex_lock(&vmpr->events_lock);
> -	list_for_each_entry(ev, &vmpr->events, node) {
> -		if (ancestor && ev->mode == VMPRESSURE_LOCAL)
> -			continue;
> -		if (signalled && ev->mode == VMPRESSURE_NO_PASSTHROUGH)
> -			continue;
> -		if (level < ev->level)
> -			continue;
> -		eventfd_signal(ev->efd);
> -		ret = true;
> -	}
> -	mutex_unlock(&vmpr->events_lock);
> -
> -	return ret;
> -}
> -
> -static void vmpressure_work_fn(struct work_struct *work)
> -{
> -	struct vmpressure *vmpr = work_to_vmpressure(work);
> -	unsigned long scanned;
> -	unsigned long reclaimed;
> -	enum vmpressure_levels level;
> -	bool ancestor = false;
> -	bool signalled = false;
> -
> -	spin_lock(&vmpr->sr_lock);
> -	/*
> -	 * Several contexts might be calling vmpressure(), so it is
> -	 * possible that the work was rescheduled again before the old
> -	 * work context cleared the counters. In that case we will run
> -	 * just after the old work returns, but then scanned might be zero
> -	 * here. No need for any locks here since we don't care if
> -	 * vmpr->reclaimed is in sync.
> -	 */
> -	scanned = vmpr->tree_scanned;
> -	if (!scanned) {
> -		spin_unlock(&vmpr->sr_lock);
> -		return;
> -	}
> -
> -	reclaimed = vmpr->tree_reclaimed;
> -	vmpr->tree_scanned = 0;
> -	vmpr->tree_reclaimed = 0;
> -	spin_unlock(&vmpr->sr_lock);
> -
> -	level = vmpressure_calc_level(scanned, reclaimed);
> -
> -	do {
> -		if (vmpressure_event(vmpr, level, ancestor, signalled))
> -			signalled = true;
> -		ancestor = true;
> -	} while ((vmpr = vmpressure_parent(vmpr)));
> -}
> -
>  /**
>   * vmpressure() - Account memory pressure through scanned/reclaimed ratio
>   * @gfp:	reclaimer's gfp mask
> @@ -283,14 +152,7 @@ void vmpressure(gfp_t gfp, int order, struct mem_cgroup *memcg, bool tree,
>  		return;
>  
>  	if (tree) {
> -		spin_lock(&vmpr->sr_lock);
> -		scanned = vmpr->tree_scanned += scanned;
> -		vmpr->tree_reclaimed += reclaimed;
> -		spin_unlock(&vmpr->sr_lock);
> -
> -		if (scanned < vmpressure_win)
> -			return;
> -		schedule_work(&vmpr->work);
> +		vmpressure_v1_account_tree(vmpr, scanned, reclaimed);
>  	} else {
>  		enum vmpressure_levels level;
>  
> @@ -332,134 +194,6 @@ void vmpressure(gfp_t gfp, int order, struct mem_cgroup *memcg, bool tree,
>  	}
>  }
>  
> -/**
> - * vmpressure_prio() - Account memory pressure through reclaimer priority level
> - * @gfp:	reclaimer's gfp mask
> - * @memcg:	cgroup memory controller handle
> - * @prio:	reclaimer's priority
> - *
> - * This function should be called from the reclaim path every time when
> - * the vmscan's reclaiming priority (scanning depth) changes.
> - *
> - * This function does not return any value.
> - */
> -void vmpressure_prio(gfp_t gfp, struct mem_cgroup *memcg, int prio)
> -{
> -	/*
> -	 * We only use prio for accounting critical level. For more info
> -	 * see comment for vmpressure_level_critical_prio variable above.
> -	 */
> -	if (prio > vmpressure_level_critical_prio)
> -		return;
> -
> -	/*
> -	 * OK, the prio is below the threshold, updating vmpressure
> -	 * information before shrinker dives into long shrinking of long
> -	 * range vmscan. Passing scanned = vmpressure_win, reclaimed = 0
> -	 * to the vmpressure() basically means that we signal 'critical'
> -	 * level.
> -	 */
> -	vmpressure(gfp, 0, memcg, true, vmpressure_win, 0);
> -}
> -
> -#define MAX_VMPRESSURE_ARGS_LEN	(strlen("critical") + strlen("hierarchy") + 2)
> -
> -/**
> - * vmpressure_register_event() - Bind vmpressure notifications to an eventfd
> - * @memcg:	memcg that is interested in vmpressure notifications
> - * @eventfd:	eventfd context to link notifications with
> - * @args:	event arguments (pressure level threshold, optional mode)
> - *
> - * This function associates eventfd context with the vmpressure
> - * infrastructure, so that the notifications will be delivered to the
> - * @eventfd. The @args parameter is a comma-delimited string that denotes a
> - * pressure level threshold (one of vmpressure_str_levels, i.e. "low", "medium",
> - * or "critical") and an optional mode (one of vmpressure_str_modes, i.e.
> - * "hierarchy" or "local").
> - *
> - * To be used as memcg event method.
> - *
> - * Return: 0 on success, -ENOMEM on memory failure or -EINVAL if @args could
> - * not be parsed.
> - */
> -int vmpressure_register_event(struct mem_cgroup *memcg,
> -			      struct eventfd_ctx *eventfd, const char *args)
> -{
> -	struct vmpressure *vmpr = memcg_to_vmpressure(memcg);
> -	struct vmpressure_event *ev;
> -	enum vmpressure_modes mode = VMPRESSURE_NO_PASSTHROUGH;
> -	enum vmpressure_levels level;
> -	char *spec, *spec_orig;
> -	char *token;
> -	int ret = 0;
> -
> -	spec_orig = spec = kstrndup(args, MAX_VMPRESSURE_ARGS_LEN, GFP_KERNEL);
> -	if (!spec)
> -		return -ENOMEM;
> -
> -	/* Find required level */
> -	token = strsep(&spec, ",");
> -	ret = match_string(vmpressure_str_levels, VMPRESSURE_NUM_LEVELS, token);
> -	if (ret < 0)
> -		goto out;
> -	level = ret;
> -
> -	/* Find optional mode */
> -	token = strsep(&spec, ",");
> -	if (token) {
> -		ret = match_string(vmpressure_str_modes, VMPRESSURE_NUM_MODES, token);
> -		if (ret < 0)
> -			goto out;
> -		mode = ret;
> -	}
> -
> -	ev = kzalloc_obj(*ev);
> -	if (!ev) {
> -		ret = -ENOMEM;
> -		goto out;
> -	}
> -
> -	ev->efd = eventfd;
> -	ev->level = level;
> -	ev->mode = mode;
> -
> -	mutex_lock(&vmpr->events_lock);
> -	list_add(&ev->node, &vmpr->events);
> -	mutex_unlock(&vmpr->events_lock);
> -	ret = 0;
> -out:
> -	kfree(spec_orig);
> -	return ret;
> -}
> -
> -/**
> - * vmpressure_unregister_event() - Unbind eventfd from vmpressure
> - * @memcg:	memcg handle
> - * @eventfd:	eventfd context that was used to link vmpressure with the @cg
> - *
> - * This function does internal manipulations to detach the @eventfd from
> - * the vmpressure notifications, and then frees internal resources
> - * associated with the @eventfd (but the @eventfd itself is not freed).
> - *
> - * To be used as memcg event method.
> - */
> -void vmpressure_unregister_event(struct mem_cgroup *memcg,
> -				 struct eventfd_ctx *eventfd)
> -{
> -	struct vmpressure *vmpr = memcg_to_vmpressure(memcg);
> -	struct vmpressure_event *ev;
> -
> -	mutex_lock(&vmpr->events_lock);
> -	list_for_each_entry(ev, &vmpr->events, node) {
> -		if (ev->efd != eventfd)
> -			continue;
> -		list_del(&ev->node);
> -		kfree(ev);
> -		break;
> -	}
> -	mutex_unlock(&vmpr->events_lock);
> -}
> -
>  /**
>   * vmpressure_init() - Initialize vmpressure control structure
>   * @vmpr:	Structure to be initialized
> @@ -470,9 +204,7 @@ void vmpressure_unregister_event(struct mem_cgroup *memcg,
>  void vmpressure_init(struct vmpressure *vmpr)
>  {
>  	spin_lock_init(&vmpr->sr_lock);
> -	mutex_init(&vmpr->events_lock);
> -	INIT_LIST_HEAD(&vmpr->events);
> -	INIT_WORK(&vmpr->work, vmpressure_work_fn);
> +	vmpressure_v1_init(vmpr);
>  }
>  
>  /**
> @@ -484,9 +216,5 @@ void vmpressure_init(struct vmpressure *vmpr)
>   */
>  void vmpressure_cleanup(struct vmpressure *vmpr)
>  {
> -	/*
> -	 * Make sure there is no pending work before eventfd infrastructure
> -	 * goes away.
> -	 */
> -	flush_work(&vmpr->work);
> +	vmpressure_v1_cleanup(vmpr);
>  }
> -- 
> 2.53.0-Meta
> 
> 


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v3 2/2] mm/vmpressure: move v1 userspace eventfd code into memcontrol-v1.c
  2026-06-30 11:23 ` [PATCH v3 2/2] mm/vmpressure: move v1 userspace eventfd code into memcontrol-v1.c Usama Arif
  2026-06-30 12:32   ` Usama Arif
@ 2026-06-30 14:21   ` Shakeel Butt
  1 sibling, 0 replies; 7+ messages in thread
From: Shakeel Butt @ 2026-06-30 14:21 UTC (permalink / raw)
  To: Usama Arif
  Cc: Andrew Morton, david, linux-mm, hannes, tj, mkoutny,
	roman.gushchin, liam, linux-kernel, ljs, mhocko, rppt, surenb,
	vbabka, kernel-team

On Tue, Jun 30, 2026 at 04:23:33AM -0700, Usama Arif wrote:
> Clean up mm/vmpressure.c by separating the cgroup v1 userspace eventfd
> interface from the shared and v2 in-kernel code.
> 
> Currently, almost half of mm/vmpressure.c exists to serve tree=true:
> struct vmpressure_event, the events list and its mutex, the work_struct
> and vmpressure_work_fn that drains tree_scanned/tree_reclaimed, the
> parent walk, vmpressure_event(), vmpressure_register_event(),
> vmpressure_unregister_event(), and vmpressure_prio() (which always
> calls vmpressure() with tree=true).
> 
> Move it all into mm/memcontrol-v1.c (built only when CONFIG_MEMCG_V1=y)
> as a single contiguous block, following the per-component layout already
> used by that file. Keeping the v1 vmpressure code with the rest of the
> deprecated cgroup v1 memory controller makes the full footprint of the
> CONFIG_MEMCG_V1 option easy to see in one place, which matters more
> than component-level file separation for code that has no active
> development.
> 
> vmpressure.c keeps the shared bits (constants, vmpressure_calc_level,
> the runtime hierarchy check, the tree=false body, init/cleanup
> plumbing) and calls into three small v1 hooks for the tree=true
> accumulator and the v1 portions of init/cleanup. The hooks have
> static-inline no-op stubs in include/linux/vmpressure.h for the
> !MEMCG_V1 case, so callers don't need ifdefs. vmpressure_prio() gets
> the same treatment, which means vmscan.c's call site disappears at
> compile time on v2-only kernels.
> 
> The only #ifdef CONFIG_MEMCG_V1 in source remains around the v1-only
> fields inside struct vmpressure itself.
> 
> Memory savings on CONFIG_MEMCG_V1=n (measured with pahole):
> 
>   struct vmpressure :  112B ->   24B
>   struct mem_cgroup : 1664B -> 1536B
> 
> This split is the first step toward eventually making vmpressure
> CONFIG_MEMCG_V1 only. The v2 in-kernel socket pressure path
> (tree=false) cannot be removed today immediately: PSI is not an
> exact replacement for vmpressure, and switching networking socket-buffer
> back-off to PSI may regress networking performance or increase memory
> pressure in workloads that today rely on vmpressure's hysteresis. The
> medium-term plan is to introduce a PSI-based socket-pressure path, keep
> vmpressure available for v2 behind a defconfig as an opt-out for several
> releases, and only then drop the tree=false path entirely, at which point
> everything that remains of the vmpressure block in mm/memcontrol-v1.c is
> the whole subsystem.
> 
> Signed-off-by: Usama Arif <usama.arif@linux.dev>

Acked-by: Shakeel Butt <shakeel.butt@linux.dev>



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v3 1/2] mm/vmpressure: skip tree=true accounting on cgroup v2
  2026-06-30 11:23 ` [PATCH v3 1/2] mm/vmpressure: skip tree=true accounting " Usama Arif
@ 2026-06-30 16:07   ` Johannes Weiner
  2026-06-30 16:30     ` Usama Arif
  0 siblings, 1 reply; 7+ messages in thread
From: Johannes Weiner @ 2026-06-30 16:07 UTC (permalink / raw)
  To: Usama Arif
  Cc: Andrew Morton, david, linux-mm, tj, mkoutny, shakeel.butt,
	roman.gushchin, liam, linux-kernel, ljs, mhocko, rppt, surenb,
	vbabka, kernel-team

On Tue, Jun 30, 2026 at 04:23:32AM -0700, Usama Arif wrote:
> vmpressure() has two outputs gated by the @tree argument:
> 
>   @tree=false drives in-kernel socket pressure (mem_cgroup_set_
>               socket_pressure), consumed by TCP/SCTP. This only
>               applies on cgroup v2; on v1 socket memory is charged
>               separately via tcpmem and the consumer reads
>               memcg->tcpmem_pressure instead.
> 
>   @tree=true  drives userspace eventfd notifications via the v1
>               memory.pressure_level / cgroup.event_control interface.
>               v2 has no equivalent: userspace gets reclaim signals
>               through memory.pressure (PSI), which does not touch
>               vmpressure.
> 
> The existing early return covered v1 + @tree=false. The symmetric
> v2 + @tree=true case was falling through and doing the full lock /
> accumulate / schedule_work / parent-walk dance for an events list
> that can never be populated. bpftrace on a 176-core production host
> (cgroup v2, CONFIG_MEMCG_V1=n, 285 memcgs, sustained reclaim) showed
> ~16,200 @tree=true vmpressure() calls per minute. Add an early return
> that skips cgroup v2 + tree = true which avoids us doing all this work.
> On a v2-only host this also eliminates a lock contention path that can
> serialise reclaimers on a single global sr_lock.
> 
> Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
> Signed-off-by: Usama Arif <usama.arif@linux.dev>
> ---
>  mm/vmpressure.c | 10 ++++++----
>  1 file changed, 6 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/vmpressure.c b/mm/vmpressure.c
> index f053554e5826..c82cee1ab43b 100644
> --- a/mm/vmpressure.c
> +++ b/mm/vmpressure.c
> @@ -246,11 +246,13 @@ void vmpressure(gfp_t gfp, int order, struct mem_cgroup *memcg, bool tree,
>  		return;
>  
>  	/*
> -	 * The in-kernel users only care about the reclaim efficiency
> -	 * for this @memcg rather than the whole subtree, and there
> -	 * isn't and won't be any in-kernel user in a legacy cgroup.
> +	 * Only two combinations have a consumer:
> +	 *   cgroup v2 + tree=false -> in-kernel socket pressure
> +	 *   cgroup v1 + tree=true  -> userspace eventfds (memory.pressure_level)
> +	 * Skip the other two: nothing consumes the result.
>  	 */
> -	if (!cgroup_subsys_on_dfl(memory_cgrp_subsys) && !tree)
> +	if ((!cgroup_subsys_on_dfl(memory_cgrp_subsys) && !tree) ||
> +	    (cgroup_subsys_on_dfl(memory_cgrp_subsys) && tree))
>  		return;

I had already acked this one, with a half serious suggestion to make
this

	if (cgroup_subsys_on_dfl(memory_cgrp_subsys) == tree)
		return;

Anyway, no strong feelings. If nobody agrees,

Acked-by: Johannes Weiner <hannes@cmpxchg.org>


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v3 1/2] mm/vmpressure: skip tree=true accounting on cgroup v2
  2026-06-30 16:07   ` Johannes Weiner
@ 2026-06-30 16:30     ` Usama Arif
  0 siblings, 0 replies; 7+ messages in thread
From: Usama Arif @ 2026-06-30 16:30 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, david, linux-mm, tj, mkoutny, shakeel.butt,
	roman.gushchin, liam, linux-kernel, ljs, mhocko, rppt, surenb,
	vbabka, kernel-team



On 30/06/2026 17:07, Johannes Weiner wrote:
> On Tue, Jun 30, 2026 at 04:23:32AM -0700, Usama Arif wrote:
>> vmpressure() has two outputs gated by the @tree argument:
>>
>>   @tree=false drives in-kernel socket pressure (mem_cgroup_set_
>>               socket_pressure), consumed by TCP/SCTP. This only
>>               applies on cgroup v2; on v1 socket memory is charged
>>               separately via tcpmem and the consumer reads
>>               memcg->tcpmem_pressure instead.
>>
>>   @tree=true  drives userspace eventfd notifications via the v1
>>               memory.pressure_level / cgroup.event_control interface.
>>               v2 has no equivalent: userspace gets reclaim signals
>>               through memory.pressure (PSI), which does not touch
>>               vmpressure.
>>
>> The existing early return covered v1 + @tree=false. The symmetric
>> v2 + @tree=true case was falling through and doing the full lock /
>> accumulate / schedule_work / parent-walk dance for an events list
>> that can never be populated. bpftrace on a 176-core production host
>> (cgroup v2, CONFIG_MEMCG_V1=n, 285 memcgs, sustained reclaim) showed
>> ~16,200 @tree=true vmpressure() calls per minute. Add an early return
>> that skips cgroup v2 + tree = true which avoids us doing all this work.
>> On a v2-only host this also eliminates a lock contention path that can
>> serialise reclaimers on a single global sr_lock.
>>
>> Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
>> Signed-off-by: Usama Arif <usama.arif@linux.dev>
>> ---
>>  mm/vmpressure.c | 10 ++++++----
>>  1 file changed, 6 insertions(+), 4 deletions(-)
>>
>> diff --git a/mm/vmpressure.c b/mm/vmpressure.c
>> index f053554e5826..c82cee1ab43b 100644
>> --- a/mm/vmpressure.c
>> +++ b/mm/vmpressure.c
>> @@ -246,11 +246,13 @@ void vmpressure(gfp_t gfp, int order, struct mem_cgroup *memcg, bool tree,
>>  		return;
>>  
>>  	/*
>> -	 * The in-kernel users only care about the reclaim efficiency
>> -	 * for this @memcg rather than the whole subtree, and there
>> -	 * isn't and won't be any in-kernel user in a legacy cgroup.
>> +	 * Only two combinations have a consumer:
>> +	 *   cgroup v2 + tree=false -> in-kernel socket pressure
>> +	 *   cgroup v1 + tree=true  -> userspace eventfds (memory.pressure_level)
>> +	 * Skip the other two: nothing consumes the result.
>>  	 */
>> -	if (!cgroup_subsys_on_dfl(memory_cgrp_subsys) && !tree)
>> +	if ((!cgroup_subsys_on_dfl(memory_cgrp_subsys) && !tree) ||
>> +	    (cgroup_subsys_on_dfl(memory_cgrp_subsys) && tree))
>>  		return;
> 
> I had already acked this one, with a half serious suggestion to make
> this
> 
> 	if (cgroup_subsys_on_dfl(memory_cgrp_subsys) == tree)
> 		return;
> 
> Anyway, no strong feelings. If nobody agrees,
> 
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>

Yeah sorry about this! I just amended my last patch to move code
from vmpressure-v1.c to memcontrol-v1.c and just sent it, without
other changes. Forgot Shakeels ack on v2 as well :(

Andrew would you mind applying the below fixlet? I can also respin
if its easier. Thanks!!

From 969c19da782bbcd77ae4b9e94d3a9e1d78c198d7 Mon Sep 17 00:00:00 2001
From: Usama Arif <usama.arif@linux.dev>
Date: Tue, 30 Jun 2026 09:25:05 -0700
Subject: [fixlet] mm/vmpressure: skip tree=true accounting on cgroup v2

Simplify the guard. Both cgroup_subsys_on_dfl() and tree are bool, so
the two combinations that have no consumer (v1 + tree=false, v2 +
tree=true) are exactly the cases where dfl == tree.

Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Usama Arif <usama.arif@linux.dev>
---
 mm/vmpressure.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/mm/vmpressure.c b/mm/vmpressure.c
index 14470141bbe6..9629240d77ad 100644
--- a/mm/vmpressure.c
+++ b/mm/vmpressure.c
@@ -120,8 +120,7 @@ void vmpressure(gfp_t gfp, int order, struct mem_cgroup *memcg, bool tree,
 	 *   cgroup v1 + tree=true  -> userspace eventfds (memory.pressure_level)
 	 * Skip the other two: nothing consumes the result.
 	 */
-	if ((!cgroup_subsys_on_dfl(memory_cgrp_subsys) && !tree) ||
-	    (cgroup_subsys_on_dfl(memory_cgrp_subsys) && tree))
+	if (cgroup_subsys_on_dfl(memory_cgrp_subsys) == tree)
 		return;
 
 	vmpr = memcg_to_vmpressure(memcg);
-- 
2.53.0-Meta







^ permalink raw reply related	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2026-06-30 16:31 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-30 11:23 [PATCH v3 0/2] mm/vmpressure: reduce CPU, memory and code overhead on cgroup v2 Usama Arif
2026-06-30 11:23 ` [PATCH v3 1/2] mm/vmpressure: skip tree=true accounting " Usama Arif
2026-06-30 16:07   ` Johannes Weiner
2026-06-30 16:30     ` Usama Arif
2026-06-30 11:23 ` [PATCH v3 2/2] mm/vmpressure: move v1 userspace eventfd code into memcontrol-v1.c Usama Arif
2026-06-30 12:32   ` Usama Arif
2026-06-30 14:21   ` Shakeel Butt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox