* [PATCH v3 0/2] mm/vmpressure: reduce CPU, memory and code overhead on cgroup v2
@ 2026-06-30 11:23 Usama Arif
2026-06-30 11:23 ` [PATCH v3 1/2] mm/vmpressure: skip tree=true accounting " Usama Arif
2026-06-30 11:23 ` [PATCH v3 2/2] mm/vmpressure: move v1 userspace eventfd code into memcontrol-v1.c Usama Arif
0 siblings, 2 replies; 7+ messages in thread
From: Usama Arif @ 2026-06-30 11:23 UTC (permalink / raw)
To: Andrew Morton, david, linux-mm
Cc: hannes, tj, mkoutny, shakeel.butt, roman.gushchin, liam,
linux-kernel, ljs, mhocko, rppt, surenb, vbabka, kernel-team,
Usama Arif
The vmpressure subsystem has two distinct consumers, gated by the
@tree argument:
tree=false : in-kernel socket pressure, consumed by TCP/SCTP. This
is cgroup v2 only; v1 sockets read memcg->tcpmem_pressure
instead.
tree=true : cgroup v1 userspace eventfd notifications via the
memory.pressure_level / cgroup.event_control interface.
v2 has no equivalent (userspace gets reclaim signals
through memory.pressure / PSI, which doesn't touch
vmpressure).
So of the four (hierarchy, tree) combinations, only two carry data
that anyone reads. The existing early return in vmpressure() covered
v1 + tree=false; the symmetric v2 + tree=true case was falling through
and doing the full lock / accumulate / schedule_work / parent-walk
dance, even though the events list it eventually iterates is empty
on cgroup v2 (vmpressure_register_event() is wired up only through the
v1 cftype "memory.pressure_level" and can't be reached from a v2
memcg).
Patch 1 extends the existing early return to also skip v2 + tree=true.
On a v2-only host this eliminates a contended path where reclaimers
can serialize on a single global sr_lock. bpftrace on a 176-core production
host (cgroup v2, 285 memcgs, sustained reclaim) showed ~16,200 such calls
per minute with tree = true.
Patch 2 follows up with a cleanup: it splits the v1 userspace eventfd
interface (struct vmpressure_event, the events list and its mutex, the
work_struct and its handler, the parent walk,
vmpressure_register_event / unregister_event, and vmpressure_prio)
into a new mm/memcontrol-v1.c built only when CONFIG_MEMCG_V1=y,
behind small no-op stubs in the header. mm/vmpressure.c keeps the
shared bits and the tree=false socket-pressure path. The size of
vmpressure.c goes down to half and the code is much more simpler.
The only #ifdef CONFIG_MEMCG_V1 remaining in source is around the
v1-only fields inside struct vmpressure itself. Memory savings on
CONFIG_MEMCG_V1=n:
struct vmpressure : 112B -> 24B
struct mem_cgroup : 1664B -> 1536B
This split is the first step toward eventually making vmpressure
CONFIG_MEMCG_V1 only. The v2 in-kernel socket pressure path
(tree=false) cannot be removed today immediately: PSI is not an
exact replacement for vmpressure, and switching networking socket-buffer
back-off to PSI may regress networking performance or increase memory
pressure in workloads that today rely on vmpressure's hysteresis. The
medium-term plan is to introduce a PSI-based socket-pressure path, keep
vmpressure available for v2 behind a defconfig as an opt-out for several
releases, and only then drop the tree=false path entirely, at which point
everything that remains in mm/memcontrol-v1.c is the whole subsystem.
---
v2 -> v3: https://lore.kernel.org/all/20260629130042.2649505-1-usama.arif@linux.dev/
- Move the cgroup v1 code into memcontrol-v1.c instead of creating a new
file (Johannes)
v1 -> v2: https://lore.kernel.org/all/20260606114158.3126210-1-usama.arif@linux.dev/
- Add more in commit message about future plans of vmpressure for cgroup v2
(Shakeel)
- Remove unnecessary return statement in vmpressure for v1 only tree path
(Michal)
- Rebased onto latest mm-new
Usama Arif (2):
mm/vmpressure: skip tree=true accounting on cgroup v2
mm/vmpressure: move v1 userspace eventfd code into memcontrol-v1.c
include/linux/vmpressure.h | 46 +++++-
mm/memcontrol-v1.c | 292 +++++++++++++++++++++++++++++++++++
mm/vmpressure.c | 302 ++-----------------------------------
3 files changed, 349 insertions(+), 291 deletions(-)
--
2.53.0-Meta
^ permalink raw reply [flat|nested] 7+ messages in thread* [PATCH v3 1/2] mm/vmpressure: skip tree=true accounting on cgroup v2 2026-06-30 11:23 [PATCH v3 0/2] mm/vmpressure: reduce CPU, memory and code overhead on cgroup v2 Usama Arif @ 2026-06-30 11:23 ` Usama Arif 2026-06-30 16:07 ` Johannes Weiner 2026-06-30 11:23 ` [PATCH v3 2/2] mm/vmpressure: move v1 userspace eventfd code into memcontrol-v1.c Usama Arif 1 sibling, 1 reply; 7+ messages in thread From: Usama Arif @ 2026-06-30 11:23 UTC (permalink / raw) To: Andrew Morton, david, linux-mm Cc: hannes, tj, mkoutny, shakeel.butt, roman.gushchin, liam, linux-kernel, ljs, mhocko, rppt, surenb, vbabka, kernel-team, Usama Arif vmpressure() has two outputs gated by the @tree argument: @tree=false drives in-kernel socket pressure (mem_cgroup_set_ socket_pressure), consumed by TCP/SCTP. This only applies on cgroup v2; on v1 socket memory is charged separately via tcpmem and the consumer reads memcg->tcpmem_pressure instead. @tree=true drives userspace eventfd notifications via the v1 memory.pressure_level / cgroup.event_control interface. v2 has no equivalent: userspace gets reclaim signals through memory.pressure (PSI), which does not touch vmpressure. The existing early return covered v1 + @tree=false. The symmetric v2 + @tree=true case was falling through and doing the full lock / accumulate / schedule_work / parent-walk dance for an events list that can never be populated. bpftrace on a 176-core production host (cgroup v2, CONFIG_MEMCG_V1=n, 285 memcgs, sustained reclaim) showed ~16,200 @tree=true vmpressure() calls per minute. Add an early return that skips cgroup v2 + tree = true which avoids us doing all this work. On a v2-only host this also eliminates a lock contention path that can serialise reclaimers on a single global sr_lock. Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Signed-off-by: Usama Arif <usama.arif@linux.dev> --- mm/vmpressure.c | 10 ++++++---- 1 file changed, 6 insertions(+), 4 deletions(-) diff --git a/mm/vmpressure.c b/mm/vmpressure.c index f053554e5826..c82cee1ab43b 100644 --- a/mm/vmpressure.c +++ b/mm/vmpressure.c @@ -246,11 +246,13 @@ void vmpressure(gfp_t gfp, int order, struct mem_cgroup *memcg, bool tree, return; /* - * The in-kernel users only care about the reclaim efficiency - * for this @memcg rather than the whole subtree, and there - * isn't and won't be any in-kernel user in a legacy cgroup. + * Only two combinations have a consumer: + * cgroup v2 + tree=false -> in-kernel socket pressure + * cgroup v1 + tree=true -> userspace eventfds (memory.pressure_level) + * Skip the other two: nothing consumes the result. */ - if (!cgroup_subsys_on_dfl(memory_cgrp_subsys) && !tree) + if ((!cgroup_subsys_on_dfl(memory_cgrp_subsys) && !tree) || + (cgroup_subsys_on_dfl(memory_cgrp_subsys) && tree)) return; vmpr = memcg_to_vmpressure(memcg); -- 2.53.0-Meta ^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: [PATCH v3 1/2] mm/vmpressure: skip tree=true accounting on cgroup v2 2026-06-30 11:23 ` [PATCH v3 1/2] mm/vmpressure: skip tree=true accounting " Usama Arif @ 2026-06-30 16:07 ` Johannes Weiner 2026-06-30 16:30 ` Usama Arif 0 siblings, 1 reply; 7+ messages in thread From: Johannes Weiner @ 2026-06-30 16:07 UTC (permalink / raw) To: Usama Arif Cc: Andrew Morton, david, linux-mm, tj, mkoutny, shakeel.butt, roman.gushchin, liam, linux-kernel, ljs, mhocko, rppt, surenb, vbabka, kernel-team On Tue, Jun 30, 2026 at 04:23:32AM -0700, Usama Arif wrote: > vmpressure() has two outputs gated by the @tree argument: > > @tree=false drives in-kernel socket pressure (mem_cgroup_set_ > socket_pressure), consumed by TCP/SCTP. This only > applies on cgroup v2; on v1 socket memory is charged > separately via tcpmem and the consumer reads > memcg->tcpmem_pressure instead. > > @tree=true drives userspace eventfd notifications via the v1 > memory.pressure_level / cgroup.event_control interface. > v2 has no equivalent: userspace gets reclaim signals > through memory.pressure (PSI), which does not touch > vmpressure. > > The existing early return covered v1 + @tree=false. The symmetric > v2 + @tree=true case was falling through and doing the full lock / > accumulate / schedule_work / parent-walk dance for an events list > that can never be populated. bpftrace on a 176-core production host > (cgroup v2, CONFIG_MEMCG_V1=n, 285 memcgs, sustained reclaim) showed > ~16,200 @tree=true vmpressure() calls per minute. Add an early return > that skips cgroup v2 + tree = true which avoids us doing all this work. > On a v2-only host this also eliminates a lock contention path that can > serialise reclaimers on a single global sr_lock. > > Acked-by: Shakeel Butt <shakeel.butt@linux.dev> > Signed-off-by: Usama Arif <usama.arif@linux.dev> > --- > mm/vmpressure.c | 10 ++++++---- > 1 file changed, 6 insertions(+), 4 deletions(-) > > diff --git a/mm/vmpressure.c b/mm/vmpressure.c > index f053554e5826..c82cee1ab43b 100644 > --- a/mm/vmpressure.c > +++ b/mm/vmpressure.c > @@ -246,11 +246,13 @@ void vmpressure(gfp_t gfp, int order, struct mem_cgroup *memcg, bool tree, > return; > > /* > - * The in-kernel users only care about the reclaim efficiency > - * for this @memcg rather than the whole subtree, and there > - * isn't and won't be any in-kernel user in a legacy cgroup. > + * Only two combinations have a consumer: > + * cgroup v2 + tree=false -> in-kernel socket pressure > + * cgroup v1 + tree=true -> userspace eventfds (memory.pressure_level) > + * Skip the other two: nothing consumes the result. > */ > - if (!cgroup_subsys_on_dfl(memory_cgrp_subsys) && !tree) > + if ((!cgroup_subsys_on_dfl(memory_cgrp_subsys) && !tree) || > + (cgroup_subsys_on_dfl(memory_cgrp_subsys) && tree)) > return; I had already acked this one, with a half serious suggestion to make this if (cgroup_subsys_on_dfl(memory_cgrp_subsys) == tree) return; Anyway, no strong feelings. If nobody agrees, Acked-by: Johannes Weiner <hannes@cmpxchg.org> ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH v3 1/2] mm/vmpressure: skip tree=true accounting on cgroup v2 2026-06-30 16:07 ` Johannes Weiner @ 2026-06-30 16:30 ` Usama Arif 0 siblings, 0 replies; 7+ messages in thread From: Usama Arif @ 2026-06-30 16:30 UTC (permalink / raw) To: Johannes Weiner Cc: Andrew Morton, david, linux-mm, tj, mkoutny, shakeel.butt, roman.gushchin, liam, linux-kernel, ljs, mhocko, rppt, surenb, vbabka, kernel-team On 30/06/2026 17:07, Johannes Weiner wrote: > On Tue, Jun 30, 2026 at 04:23:32AM -0700, Usama Arif wrote: >> vmpressure() has two outputs gated by the @tree argument: >> >> @tree=false drives in-kernel socket pressure (mem_cgroup_set_ >> socket_pressure), consumed by TCP/SCTP. This only >> applies on cgroup v2; on v1 socket memory is charged >> separately via tcpmem and the consumer reads >> memcg->tcpmem_pressure instead. >> >> @tree=true drives userspace eventfd notifications via the v1 >> memory.pressure_level / cgroup.event_control interface. >> v2 has no equivalent: userspace gets reclaim signals >> through memory.pressure (PSI), which does not touch >> vmpressure. >> >> The existing early return covered v1 + @tree=false. The symmetric >> v2 + @tree=true case was falling through and doing the full lock / >> accumulate / schedule_work / parent-walk dance for an events list >> that can never be populated. bpftrace on a 176-core production host >> (cgroup v2, CONFIG_MEMCG_V1=n, 285 memcgs, sustained reclaim) showed >> ~16,200 @tree=true vmpressure() calls per minute. Add an early return >> that skips cgroup v2 + tree = true which avoids us doing all this work. >> On a v2-only host this also eliminates a lock contention path that can >> serialise reclaimers on a single global sr_lock. >> >> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> >> Signed-off-by: Usama Arif <usama.arif@linux.dev> >> --- >> mm/vmpressure.c | 10 ++++++---- >> 1 file changed, 6 insertions(+), 4 deletions(-) >> >> diff --git a/mm/vmpressure.c b/mm/vmpressure.c >> index f053554e5826..c82cee1ab43b 100644 >> --- a/mm/vmpressure.c >> +++ b/mm/vmpressure.c >> @@ -246,11 +246,13 @@ void vmpressure(gfp_t gfp, int order, struct mem_cgroup *memcg, bool tree, >> return; >> >> /* >> - * The in-kernel users only care about the reclaim efficiency >> - * for this @memcg rather than the whole subtree, and there >> - * isn't and won't be any in-kernel user in a legacy cgroup. >> + * Only two combinations have a consumer: >> + * cgroup v2 + tree=false -> in-kernel socket pressure >> + * cgroup v1 + tree=true -> userspace eventfds (memory.pressure_level) >> + * Skip the other two: nothing consumes the result. >> */ >> - if (!cgroup_subsys_on_dfl(memory_cgrp_subsys) && !tree) >> + if ((!cgroup_subsys_on_dfl(memory_cgrp_subsys) && !tree) || >> + (cgroup_subsys_on_dfl(memory_cgrp_subsys) && tree)) >> return; > > I had already acked this one, with a half serious suggestion to make > this > > if (cgroup_subsys_on_dfl(memory_cgrp_subsys) == tree) > return; > > Anyway, no strong feelings. If nobody agrees, > > Acked-by: Johannes Weiner <hannes@cmpxchg.org> Yeah sorry about this! I just amended my last patch to move code from vmpressure-v1.c to memcontrol-v1.c and just sent it, without other changes. Forgot Shakeels ack on v2 as well :( Andrew would you mind applying the below fixlet? I can also respin if its easier. Thanks!! From 969c19da782bbcd77ae4b9e94d3a9e1d78c198d7 Mon Sep 17 00:00:00 2001 From: Usama Arif <usama.arif@linux.dev> Date: Tue, 30 Jun 2026 09:25:05 -0700 Subject: [fixlet] mm/vmpressure: skip tree=true accounting on cgroup v2 Simplify the guard. Both cgroup_subsys_on_dfl() and tree are bool, so the two combinations that have no consumer (v1 + tree=false, v2 + tree=true) are exactly the cases where dfl == tree. Suggested-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Usama Arif <usama.arif@linux.dev> --- mm/vmpressure.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/mm/vmpressure.c b/mm/vmpressure.c index 14470141bbe6..9629240d77ad 100644 --- a/mm/vmpressure.c +++ b/mm/vmpressure.c @@ -120,8 +120,7 @@ void vmpressure(gfp_t gfp, int order, struct mem_cgroup *memcg, bool tree, * cgroup v1 + tree=true -> userspace eventfds (memory.pressure_level) * Skip the other two: nothing consumes the result. */ - if ((!cgroup_subsys_on_dfl(memory_cgrp_subsys) && !tree) || - (cgroup_subsys_on_dfl(memory_cgrp_subsys) && tree)) + if (cgroup_subsys_on_dfl(memory_cgrp_subsys) == tree) return; vmpr = memcg_to_vmpressure(memcg); -- 2.53.0-Meta ^ permalink raw reply related [flat|nested] 7+ messages in thread
* [PATCH v3 2/2] mm/vmpressure: move v1 userspace eventfd code into memcontrol-v1.c 2026-06-30 11:23 [PATCH v3 0/2] mm/vmpressure: reduce CPU, memory and code overhead on cgroup v2 Usama Arif 2026-06-30 11:23 ` [PATCH v3 1/2] mm/vmpressure: skip tree=true accounting " Usama Arif @ 2026-06-30 11:23 ` Usama Arif 2026-06-30 12:32 ` Usama Arif 2026-06-30 14:21 ` Shakeel Butt 1 sibling, 2 replies; 7+ messages in thread From: Usama Arif @ 2026-06-30 11:23 UTC (permalink / raw) To: Andrew Morton, david, linux-mm Cc: hannes, tj, mkoutny, shakeel.butt, roman.gushchin, liam, linux-kernel, ljs, mhocko, rppt, surenb, vbabka, kernel-team, Usama Arif Clean up mm/vmpressure.c by separating the cgroup v1 userspace eventfd interface from the shared and v2 in-kernel code. Currently, almost half of mm/vmpressure.c exists to serve tree=true: struct vmpressure_event, the events list and its mutex, the work_struct and vmpressure_work_fn that drains tree_scanned/tree_reclaimed, the parent walk, vmpressure_event(), vmpressure_register_event(), vmpressure_unregister_event(), and vmpressure_prio() (which always calls vmpressure() with tree=true). Move it all into mm/memcontrol-v1.c (built only when CONFIG_MEMCG_V1=y) as a single contiguous block, following the per-component layout already used by that file. Keeping the v1 vmpressure code with the rest of the deprecated cgroup v1 memory controller makes the full footprint of the CONFIG_MEMCG_V1 option easy to see in one place, which matters more than component-level file separation for code that has no active development. vmpressure.c keeps the shared bits (constants, vmpressure_calc_level, the runtime hierarchy check, the tree=false body, init/cleanup plumbing) and calls into three small v1 hooks for the tree=true accumulator and the v1 portions of init/cleanup. The hooks have static-inline no-op stubs in include/linux/vmpressure.h for the !MEMCG_V1 case, so callers don't need ifdefs. vmpressure_prio() gets the same treatment, which means vmscan.c's call site disappears at compile time on v2-only kernels. The only #ifdef CONFIG_MEMCG_V1 in source remains around the v1-only fields inside struct vmpressure itself. Memory savings on CONFIG_MEMCG_V1=n (measured with pahole): struct vmpressure : 112B -> 24B struct mem_cgroup : 1664B -> 1536B This split is the first step toward eventually making vmpressure CONFIG_MEMCG_V1 only. The v2 in-kernel socket pressure path (tree=false) cannot be removed today immediately: PSI is not an exact replacement for vmpressure, and switching networking socket-buffer back-off to PSI may regress networking performance or increase memory pressure in workloads that today rely on vmpressure's hysteresis. The medium-term plan is to introduce a PSI-based socket-pressure path, keep vmpressure available for v2 behind a defconfig as an opt-out for several releases, and only then drop the tree=false path entirely, at which point everything that remains of the vmpressure block in mm/memcontrol-v1.c is the whole subsystem. Signed-off-by: Usama Arif <usama.arif@linux.dev> --- include/linux/vmpressure.h | 46 +++++- mm/memcontrol-v1.c | 292 +++++++++++++++++++++++++++++++++++++ mm/vmpressure.c | 292 ++----------------------------------- 3 files changed, 343 insertions(+), 287 deletions(-) diff --git a/include/linux/vmpressure.h b/include/linux/vmpressure.h index faecd5522401..b4d13457bc2a 100644 --- a/include/linux/vmpressure.h +++ b/include/linux/vmpressure.h @@ -13,18 +13,31 @@ struct vmpressure { unsigned long scanned; unsigned long reclaimed; + /* The lock is used to keep the scanned/reclaimed in sync. */ + spinlock_t sr_lock; +#ifdef CONFIG_MEMCG_V1 + /* + * tree=true accumulators feed the v1 userspace eventfd interface + * (memory.pressure_level). Drained by @work. v2 has no equivalent + * interface, so this state is omitted on CONFIG_MEMCG_V1=n builds. + */ unsigned long tree_scanned; unsigned long tree_reclaimed; - /* The lock is used to keep the scanned/reclaimed above in sync. */ - spinlock_t sr_lock; - /* The list of vmpressure_event structs. */ struct list_head events; /* Have to grab the lock on events traversal or modifications. */ struct mutex events_lock; struct work_struct work; +#endif +}; + +enum vmpressure_levels { + VMPRESSURE_LOW = 0, + VMPRESSURE_MEDIUM, + VMPRESSURE_CRITICAL, + VMPRESSURE_NUM_LEVELS, }; struct mem_cgroup; @@ -32,18 +45,41 @@ struct mem_cgroup; #ifdef CONFIG_MEMCG void vmpressure(gfp_t gfp, int order, struct mem_cgroup *memcg, bool tree, unsigned long scanned, unsigned long reclaimed); -extern void vmpressure_prio(gfp_t gfp, struct mem_cgroup *memcg, int prio); - extern void vmpressure_init(struct vmpressure *vmpr); extern void vmpressure_cleanup(struct vmpressure *vmpr); extern struct vmpressure *memcg_to_vmpressure(struct mem_cgroup *memcg); extern struct mem_cgroup *vmpressure_to_memcg(struct vmpressure *vmpr); + +/* Shared with the v1 vmpressure block in mm/memcontrol-v1.c. */ +extern const unsigned long vmpressure_win; +extern enum vmpressure_levels vmpressure_calc_level(unsigned long scanned, + unsigned long reclaimed); + +#ifdef CONFIG_MEMCG_V1 +extern void vmpressure_prio(gfp_t gfp, struct mem_cgroup *memcg, int prio); extern int vmpressure_register_event(struct mem_cgroup *memcg, struct eventfd_ctx *eventfd, const char *args); extern void vmpressure_unregister_event(struct mem_cgroup *memcg, struct eventfd_ctx *eventfd); + +/* v1 hooks called from mm/vmpressure.c; no-ops below when !MEMCG_V1. */ +extern void vmpressure_v1_init(struct vmpressure *vmpr); +extern void vmpressure_v1_cleanup(struct vmpressure *vmpr); +extern void vmpressure_v1_account_tree(struct vmpressure *vmpr, + unsigned long scanned, + unsigned long reclaimed); #else +static inline void vmpressure_prio(gfp_t gfp, struct mem_cgroup *memcg, + int prio) {} +static inline void vmpressure_v1_init(struct vmpressure *vmpr) {} +static inline void vmpressure_v1_cleanup(struct vmpressure *vmpr) {} +static inline void vmpressure_v1_account_tree(struct vmpressure *vmpr, + unsigned long scanned, + unsigned long reclaimed) {} +#endif /* CONFIG_MEMCG_V1 */ + +#else /* !CONFIG_MEMCG */ static inline void vmpressure(gfp_t gfp, int order, struct mem_cgroup *memcg, bool tree, unsigned long scanned, unsigned long reclaimed) {} diff --git a/mm/memcontrol-v1.c b/mm/memcontrol-v1.c index 765069211567..135622b6172b 100644 --- a/mm/memcontrol-v1.c +++ b/mm/memcontrol-v1.c @@ -6,6 +6,7 @@ #include <linux/pagewalk.h> #include <linux/backing-dev.h> #include <linux/eventfd.h> +#include <linux/log2.h> #include <linux/poll.h> #include <linux/sort.h> #include <linux/file.h> @@ -1476,6 +1477,297 @@ void memcg1_oom_finish(struct mem_cgroup *memcg, bool locked) mem_cgroup_oom_unlock(memcg); } +/* + * cgroup v1 userspace vmpressure interface (memory.pressure_level / + * cgroup.event_control). Kept here so v2-only kernels (CONFIG_MEMCG_V1=n) + * drop the whole eventfd accumulator, its work item, and the per-memcg + * state it requires. + * + * When there are too little pages left to scan, vmpressure() may miss the + * critical pressure as number of pages will be less than "window size". + * However, in that case the vmscan priority will raise fast as the + * reclaimer will try to scan LRUs more deeply. + * + * The vmscan logic considers these special priorities: + * + * prio == DEF_PRIORITY (12): reclaimer starts with that value + * prio <= DEF_PRIORITY - 2 : kswapd becomes somewhat overwhelmed + * prio == 0 : close to OOM, kernel scans every page in an lru + * + * Any value in this range is acceptable for this tunable (i.e. from 12 to + * 0). Current value for the vmpressure_level_critical_prio is chosen + * empirically, but the number, in essence, means that we consider + * critical level when scanning depth is ~10% of the lru size (vmscan + * scans 'lru_size >> prio' pages, so it is actually 12.5%, or one + * eights). + */ +static const unsigned int vmpressure_level_critical_prio = ilog2(100 / 10); + +enum vmpressure_modes { + VMPRESSURE_NO_PASSTHROUGH = 0, + VMPRESSURE_HIERARCHY, + VMPRESSURE_LOCAL, + VMPRESSURE_NUM_MODES, +}; + +static const char * const vmpressure_str_levels[] = { + [VMPRESSURE_LOW] = "low", + [VMPRESSURE_MEDIUM] = "medium", + [VMPRESSURE_CRITICAL] = "critical", +}; + +static const char * const vmpressure_str_modes[] = { + [VMPRESSURE_NO_PASSTHROUGH] = "default", + [VMPRESSURE_HIERARCHY] = "hierarchy", + [VMPRESSURE_LOCAL] = "local", +}; + +struct vmpressure_event { + struct eventfd_ctx *efd; + enum vmpressure_levels level; + enum vmpressure_modes mode; + struct list_head node; +}; + +static struct vmpressure *work_to_vmpressure(struct work_struct *work) +{ + return container_of(work, struct vmpressure, work); +} + +static struct vmpressure *vmpressure_parent(struct vmpressure *vmpr) +{ + struct mem_cgroup *memcg = vmpressure_to_memcg(vmpr); + + memcg = parent_mem_cgroup(memcg); + if (!memcg) + return NULL; + return memcg_to_vmpressure(memcg); +} + +static bool vmpressure_event(struct vmpressure *vmpr, + const enum vmpressure_levels level, + bool ancestor, bool signalled) +{ + struct vmpressure_event *ev; + bool ret = false; + + mutex_lock(&vmpr->events_lock); + list_for_each_entry(ev, &vmpr->events, node) { + if (ancestor && ev->mode == VMPRESSURE_LOCAL) + continue; + if (signalled && ev->mode == VMPRESSURE_NO_PASSTHROUGH) + continue; + if (level < ev->level) + continue; + eventfd_signal(ev->efd); + ret = true; + } + mutex_unlock(&vmpr->events_lock); + + return ret; +} + +static void vmpressure_work_fn(struct work_struct *work) +{ + struct vmpressure *vmpr = work_to_vmpressure(work); + unsigned long scanned; + unsigned long reclaimed; + enum vmpressure_levels level; + bool ancestor = false; + bool signalled = false; + + spin_lock(&vmpr->sr_lock); + /* + * Several contexts might be calling vmpressure(), so it is + * possible that the work was rescheduled again before the old + * work context cleared the counters. In that case we will run + * just after the old work returns, but then scanned might be zero + * here. No need for any locks here since we don't care if + * vmpr->reclaimed is in sync. + */ + scanned = vmpr->tree_scanned; + if (!scanned) { + spin_unlock(&vmpr->sr_lock); + return; + } + + reclaimed = vmpr->tree_reclaimed; + vmpr->tree_scanned = 0; + vmpr->tree_reclaimed = 0; + spin_unlock(&vmpr->sr_lock); + + level = vmpressure_calc_level(scanned, reclaimed); + + do { + if (vmpressure_event(vmpr, level, ancestor, signalled)) + signalled = true; + ancestor = true; + } while ((vmpr = vmpressure_parent(vmpr))); +} + +/* + * Tree-mode accumulator: accumulate per-memcg scanned/reclaimed and + * schedule the work that walks the parent chain and signals registered + * eventfd listeners once we cross the window threshold. + */ +void vmpressure_v1_account_tree(struct vmpressure *vmpr, + unsigned long scanned, + unsigned long reclaimed) +{ + spin_lock(&vmpr->sr_lock); + scanned = vmpr->tree_scanned += scanned; + vmpr->tree_reclaimed += reclaimed; + spin_unlock(&vmpr->sr_lock); + + if (scanned < vmpressure_win) + return; + schedule_work(&vmpr->work); +} + +void vmpressure_v1_init(struct vmpressure *vmpr) +{ + mutex_init(&vmpr->events_lock); + INIT_LIST_HEAD(&vmpr->events); + INIT_WORK(&vmpr->work, vmpressure_work_fn); +} + +void vmpressure_v1_cleanup(struct vmpressure *vmpr) +{ + /* + * Make sure there is no pending work before eventfd infrastructure + * goes away. + */ + flush_work(&vmpr->work); +} + +/** + * vmpressure_prio() - Account memory pressure through reclaimer priority level + * @gfp: reclaimer's gfp mask + * @memcg: cgroup memory controller handle + * @prio: reclaimer's priority + * + * This function should be called from the reclaim path every time when + * the vmscan's reclaiming priority (scanning depth) changes. + * + * This function does not return any value. + */ +void vmpressure_prio(gfp_t gfp, struct mem_cgroup *memcg, int prio) +{ + /* + * We only use prio for accounting critical level. For more info + * see comment for vmpressure_level_critical_prio variable above. + */ + if (prio > vmpressure_level_critical_prio) + return; + + /* + * OK, the prio is below the threshold, updating vmpressure + * information before shrinker dives into long shrinking of long + * range vmscan. Passing scanned = vmpressure_win, reclaimed = 0 + * to the vmpressure() basically means that we signal 'critical' + * level. + */ + vmpressure(gfp, 0, memcg, true, vmpressure_win, 0); +} + +#define MAX_VMPRESSURE_ARGS_LEN (strlen("critical") + strlen("hierarchy") + 2) + +/** + * vmpressure_register_event() - Bind vmpressure notifications to an eventfd + * @memcg: memcg that is interested in vmpressure notifications + * @eventfd: eventfd context to link notifications with + * @args: event arguments (pressure level threshold, optional mode) + * + * This function associates eventfd context with the vmpressure + * infrastructure, so that the notifications will be delivered to the + * @eventfd. The @args parameter is a comma-delimited string that denotes a + * pressure level threshold (one of vmpressure_str_levels, i.e. "low", "medium", + * or "critical") and an optional mode (one of vmpressure_str_modes, i.e. + * "hierarchy" or "local"). + * + * To be used as memcg event method. + * + * Return: 0 on success, -ENOMEM on memory failure or -EINVAL if @args could + * not be parsed. + */ +int vmpressure_register_event(struct mem_cgroup *memcg, + struct eventfd_ctx *eventfd, const char *args) +{ + struct vmpressure *vmpr = memcg_to_vmpressure(memcg); + struct vmpressure_event *ev; + enum vmpressure_modes mode = VMPRESSURE_NO_PASSTHROUGH; + enum vmpressure_levels level; + char *spec, *spec_orig; + char *token; + int ret = 0; + + spec_orig = spec = kstrndup(args, MAX_VMPRESSURE_ARGS_LEN, GFP_KERNEL); + if (!spec) + return -ENOMEM; + + /* Find required level */ + token = strsep(&spec, ","); + ret = match_string(vmpressure_str_levels, VMPRESSURE_NUM_LEVELS, token); + if (ret < 0) + goto out; + level = ret; + + /* Find optional mode */ + token = strsep(&spec, ","); + if (token) { + ret = match_string(vmpressure_str_modes, VMPRESSURE_NUM_MODES, token); + if (ret < 0) + goto out; + mode = ret; + } + + ev = kzalloc_obj(*ev); + if (!ev) { + ret = -ENOMEM; + goto out; + } + + ev->efd = eventfd; + ev->level = level; + ev->mode = mode; + + mutex_lock(&vmpr->events_lock); + list_add(&ev->node, &vmpr->events); + mutex_unlock(&vmpr->events_lock); + ret = 0; +out: + kfree(spec_orig); + return ret; +} + +/** + * vmpressure_unregister_event() - Unbind eventfd from vmpressure + * @memcg: memcg handle + * @eventfd: eventfd context that was used to link vmpressure with the @cg + * + * This function does internal manipulations to detach the @eventfd from + * the vmpressure notifications, and then frees internal resources + * associated with the @eventfd (but the @eventfd itself is not freed). + * + * To be used as memcg event method. + */ +void vmpressure_unregister_event(struct mem_cgroup *memcg, + struct eventfd_ctx *eventfd) +{ + struct vmpressure *vmpr = memcg_to_vmpressure(memcg); + struct vmpressure_event *ev; + + mutex_lock(&vmpr->events_lock); + list_for_each_entry(ev, &vmpr->events, node) { + if (ev->efd != eventfd) + continue; + list_del(&ev->node); + kfree(ev); + break; + } + mutex_unlock(&vmpr->events_lock); +} + static DEFINE_MUTEX(memcg_max_mutex); static int mem_cgroup_resize_max(struct mem_cgroup *memcg, diff --git a/mm/vmpressure.c b/mm/vmpressure.c index c82cee1ab43b..14470141bbe6 100644 --- a/mm/vmpressure.c +++ b/mm/vmpressure.c @@ -7,16 +7,15 @@ * * Based on ideas from Andrew Morton, David Rientjes, KOSAKI Motohiro, * Leonid Moiseichuk, Mel Gorman, Minchan Kim and Pekka Enberg. + * + * Tree-mode (cgroup v1 userspace eventfd) bookkeeping lives in + * mm/memcontrol-v1.c; this file holds the shared code and the in-kernel + * (tree=false) socket-pressure path that runs on cgroup v2. */ #include <linux/cgroup.h> -#include <linux/fs.h> #include <linux/log2.h> -#include <linux/sched.h> #include <linux/mm.h> -#include <linux/vmstat.h> -#include <linux/eventfd.h> -#include <linux/slab.h> #include <linux/swap.h> #include <linux/printk.h> #include <linux/vmpressure.h> @@ -35,7 +34,7 @@ * TODO: Make the window size depend on machine size, as we do for vmstat * thresholds. Currently we set it to 512 pages (2MB for 4KB pages). */ -static const unsigned long vmpressure_win = SWAP_CLUSTER_MAX * 16; +const unsigned long vmpressure_win = SWAP_CLUSTER_MAX * 16; /* * These thresholds are used when we account memory pressure through @@ -46,68 +45,6 @@ static const unsigned long vmpressure_win = SWAP_CLUSTER_MAX * 16; static const unsigned int vmpressure_level_med = 60; static const unsigned int vmpressure_level_critical = 95; -/* - * When there are too little pages left to scan, vmpressure() may miss the - * critical pressure as number of pages will be less than "window size". - * However, in that case the vmscan priority will raise fast as the - * reclaimer will try to scan LRUs more deeply. - * - * The vmscan logic considers these special priorities: - * - * prio == DEF_PRIORITY (12): reclaimer starts with that value - * prio <= DEF_PRIORITY - 2 : kswapd becomes somewhat overwhelmed - * prio == 0 : close to OOM, kernel scans every page in an lru - * - * Any value in this range is acceptable for this tunable (i.e. from 12 to - * 0). Current value for the vmpressure_level_critical_prio is chosen - * empirically, but the number, in essence, means that we consider - * critical level when scanning depth is ~10% of the lru size (vmscan - * scans 'lru_size >> prio' pages, so it is actually 12.5%, or one - * eights). - */ -static const unsigned int vmpressure_level_critical_prio = ilog2(100 / 10); - -static struct vmpressure *work_to_vmpressure(struct work_struct *work) -{ - return container_of(work, struct vmpressure, work); -} - -static struct vmpressure *vmpressure_parent(struct vmpressure *vmpr) -{ - struct mem_cgroup *memcg = vmpressure_to_memcg(vmpr); - - memcg = parent_mem_cgroup(memcg); - if (!memcg) - return NULL; - return memcg_to_vmpressure(memcg); -} - -enum vmpressure_levels { - VMPRESSURE_LOW = 0, - VMPRESSURE_MEDIUM, - VMPRESSURE_CRITICAL, - VMPRESSURE_NUM_LEVELS, -}; - -enum vmpressure_modes { - VMPRESSURE_NO_PASSTHROUGH = 0, - VMPRESSURE_HIERARCHY, - VMPRESSURE_LOCAL, - VMPRESSURE_NUM_MODES, -}; - -static const char * const vmpressure_str_levels[] = { - [VMPRESSURE_LOW] = "low", - [VMPRESSURE_MEDIUM] = "medium", - [VMPRESSURE_CRITICAL] = "critical", -}; - -static const char * const vmpressure_str_modes[] = { - [VMPRESSURE_NO_PASSTHROUGH] = "default", - [VMPRESSURE_HIERARCHY] = "hierarchy", - [VMPRESSURE_LOCAL] = "local", -}; - static enum vmpressure_levels vmpressure_level(unsigned long pressure) { if (pressure >= vmpressure_level_critical) @@ -117,8 +54,8 @@ static enum vmpressure_levels vmpressure_level(unsigned long pressure) return VMPRESSURE_LOW; } -static enum vmpressure_levels vmpressure_calc_level(unsigned long scanned, - unsigned long reclaimed) +enum vmpressure_levels vmpressure_calc_level(unsigned long scanned, + unsigned long reclaimed) { unsigned long scale = scanned + reclaimed; unsigned long pressure = 0; @@ -147,74 +84,6 @@ static enum vmpressure_levels vmpressure_calc_level(unsigned long scanned, return vmpressure_level(pressure); } -struct vmpressure_event { - struct eventfd_ctx *efd; - enum vmpressure_levels level; - enum vmpressure_modes mode; - struct list_head node; -}; - -static bool vmpressure_event(struct vmpressure *vmpr, - const enum vmpressure_levels level, - bool ancestor, bool signalled) -{ - struct vmpressure_event *ev; - bool ret = false; - - mutex_lock(&vmpr->events_lock); - list_for_each_entry(ev, &vmpr->events, node) { - if (ancestor && ev->mode == VMPRESSURE_LOCAL) - continue; - if (signalled && ev->mode == VMPRESSURE_NO_PASSTHROUGH) - continue; - if (level < ev->level) - continue; - eventfd_signal(ev->efd); - ret = true; - } - mutex_unlock(&vmpr->events_lock); - - return ret; -} - -static void vmpressure_work_fn(struct work_struct *work) -{ - struct vmpressure *vmpr = work_to_vmpressure(work); - unsigned long scanned; - unsigned long reclaimed; - enum vmpressure_levels level; - bool ancestor = false; - bool signalled = false; - - spin_lock(&vmpr->sr_lock); - /* - * Several contexts might be calling vmpressure(), so it is - * possible that the work was rescheduled again before the old - * work context cleared the counters. In that case we will run - * just after the old work returns, but then scanned might be zero - * here. No need for any locks here since we don't care if - * vmpr->reclaimed is in sync. - */ - scanned = vmpr->tree_scanned; - if (!scanned) { - spin_unlock(&vmpr->sr_lock); - return; - } - - reclaimed = vmpr->tree_reclaimed; - vmpr->tree_scanned = 0; - vmpr->tree_reclaimed = 0; - spin_unlock(&vmpr->sr_lock); - - level = vmpressure_calc_level(scanned, reclaimed); - - do { - if (vmpressure_event(vmpr, level, ancestor, signalled)) - signalled = true; - ancestor = true; - } while ((vmpr = vmpressure_parent(vmpr))); -} - /** * vmpressure() - Account memory pressure through scanned/reclaimed ratio * @gfp: reclaimer's gfp mask @@ -283,14 +152,7 @@ void vmpressure(gfp_t gfp, int order, struct mem_cgroup *memcg, bool tree, return; if (tree) { - spin_lock(&vmpr->sr_lock); - scanned = vmpr->tree_scanned += scanned; - vmpr->tree_reclaimed += reclaimed; - spin_unlock(&vmpr->sr_lock); - - if (scanned < vmpressure_win) - return; - schedule_work(&vmpr->work); + vmpressure_v1_account_tree(vmpr, scanned, reclaimed); } else { enum vmpressure_levels level; @@ -332,134 +194,6 @@ void vmpressure(gfp_t gfp, int order, struct mem_cgroup *memcg, bool tree, } } -/** - * vmpressure_prio() - Account memory pressure through reclaimer priority level - * @gfp: reclaimer's gfp mask - * @memcg: cgroup memory controller handle - * @prio: reclaimer's priority - * - * This function should be called from the reclaim path every time when - * the vmscan's reclaiming priority (scanning depth) changes. - * - * This function does not return any value. - */ -void vmpressure_prio(gfp_t gfp, struct mem_cgroup *memcg, int prio) -{ - /* - * We only use prio for accounting critical level. For more info - * see comment for vmpressure_level_critical_prio variable above. - */ - if (prio > vmpressure_level_critical_prio) - return; - - /* - * OK, the prio is below the threshold, updating vmpressure - * information before shrinker dives into long shrinking of long - * range vmscan. Passing scanned = vmpressure_win, reclaimed = 0 - * to the vmpressure() basically means that we signal 'critical' - * level. - */ - vmpressure(gfp, 0, memcg, true, vmpressure_win, 0); -} - -#define MAX_VMPRESSURE_ARGS_LEN (strlen("critical") + strlen("hierarchy") + 2) - -/** - * vmpressure_register_event() - Bind vmpressure notifications to an eventfd - * @memcg: memcg that is interested in vmpressure notifications - * @eventfd: eventfd context to link notifications with - * @args: event arguments (pressure level threshold, optional mode) - * - * This function associates eventfd context with the vmpressure - * infrastructure, so that the notifications will be delivered to the - * @eventfd. The @args parameter is a comma-delimited string that denotes a - * pressure level threshold (one of vmpressure_str_levels, i.e. "low", "medium", - * or "critical") and an optional mode (one of vmpressure_str_modes, i.e. - * "hierarchy" or "local"). - * - * To be used as memcg event method. - * - * Return: 0 on success, -ENOMEM on memory failure or -EINVAL if @args could - * not be parsed. - */ -int vmpressure_register_event(struct mem_cgroup *memcg, - struct eventfd_ctx *eventfd, const char *args) -{ - struct vmpressure *vmpr = memcg_to_vmpressure(memcg); - struct vmpressure_event *ev; - enum vmpressure_modes mode = VMPRESSURE_NO_PASSTHROUGH; - enum vmpressure_levels level; - char *spec, *spec_orig; - char *token; - int ret = 0; - - spec_orig = spec = kstrndup(args, MAX_VMPRESSURE_ARGS_LEN, GFP_KERNEL); - if (!spec) - return -ENOMEM; - - /* Find required level */ - token = strsep(&spec, ","); - ret = match_string(vmpressure_str_levels, VMPRESSURE_NUM_LEVELS, token); - if (ret < 0) - goto out; - level = ret; - - /* Find optional mode */ - token = strsep(&spec, ","); - if (token) { - ret = match_string(vmpressure_str_modes, VMPRESSURE_NUM_MODES, token); - if (ret < 0) - goto out; - mode = ret; - } - - ev = kzalloc_obj(*ev); - if (!ev) { - ret = -ENOMEM; - goto out; - } - - ev->efd = eventfd; - ev->level = level; - ev->mode = mode; - - mutex_lock(&vmpr->events_lock); - list_add(&ev->node, &vmpr->events); - mutex_unlock(&vmpr->events_lock); - ret = 0; -out: - kfree(spec_orig); - return ret; -} - -/** - * vmpressure_unregister_event() - Unbind eventfd from vmpressure - * @memcg: memcg handle - * @eventfd: eventfd context that was used to link vmpressure with the @cg - * - * This function does internal manipulations to detach the @eventfd from - * the vmpressure notifications, and then frees internal resources - * associated with the @eventfd (but the @eventfd itself is not freed). - * - * To be used as memcg event method. - */ -void vmpressure_unregister_event(struct mem_cgroup *memcg, - struct eventfd_ctx *eventfd) -{ - struct vmpressure *vmpr = memcg_to_vmpressure(memcg); - struct vmpressure_event *ev; - - mutex_lock(&vmpr->events_lock); - list_for_each_entry(ev, &vmpr->events, node) { - if (ev->efd != eventfd) - continue; - list_del(&ev->node); - kfree(ev); - break; - } - mutex_unlock(&vmpr->events_lock); -} - /** * vmpressure_init() - Initialize vmpressure control structure * @vmpr: Structure to be initialized @@ -470,9 +204,7 @@ void vmpressure_unregister_event(struct mem_cgroup *memcg, void vmpressure_init(struct vmpressure *vmpr) { spin_lock_init(&vmpr->sr_lock); - mutex_init(&vmpr->events_lock); - INIT_LIST_HEAD(&vmpr->events); - INIT_WORK(&vmpr->work, vmpressure_work_fn); + vmpressure_v1_init(vmpr); } /** @@ -484,9 +216,5 @@ void vmpressure_init(struct vmpressure *vmpr) */ void vmpressure_cleanup(struct vmpressure *vmpr) { - /* - * Make sure there is no pending work before eventfd infrastructure - * goes away. - */ - flush_work(&vmpr->work); + vmpressure_v1_cleanup(vmpr); } -- 2.53.0-Meta ^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: [PATCH v3 2/2] mm/vmpressure: move v1 userspace eventfd code into memcontrol-v1.c 2026-06-30 11:23 ` [PATCH v3 2/2] mm/vmpressure: move v1 userspace eventfd code into memcontrol-v1.c Usama Arif @ 2026-06-30 12:32 ` Usama Arif 2026-06-30 14:21 ` Shakeel Butt 1 sibling, 0 replies; 7+ messages in thread From: Usama Arif @ 2026-06-30 12:32 UTC (permalink / raw) To: Usama Arif Cc: Andrew Morton, david, linux-mm, hannes, tj, mkoutny, shakeel.butt, roman.gushchin, liam, linux-kernel, ljs, mhocko, rppt, surenb, vbabka, kernel-team On Tue, 30 Jun 2026 04:23:33 -0700 Usama Arif <usama.arif@linux.dev> wrote: > Clean up mm/vmpressure.c by separating the cgroup v1 userspace eventfd > interface from the shared and v2 in-kernel code. > > Currently, almost half of mm/vmpressure.c exists to serve tree=true: > struct vmpressure_event, the events list and its mutex, the work_struct > and vmpressure_work_fn that drains tree_scanned/tree_reclaimed, the > parent walk, vmpressure_event(), vmpressure_register_event(), > vmpressure_unregister_event(), and vmpressure_prio() (which always > calls vmpressure() with tree=true). > > Move it all into mm/memcontrol-v1.c (built only when CONFIG_MEMCG_V1=y) > as a single contiguous block, following the per-component layout already > used by that file. Keeping the v1 vmpressure code with the rest of the > deprecated cgroup v1 memory controller makes the full footprint of the > CONFIG_MEMCG_V1 option easy to see in one place, which matters more > than component-level file separation for code that has no active > development. > > vmpressure.c keeps the shared bits (constants, vmpressure_calc_level, > the runtime hierarchy check, the tree=false body, init/cleanup > plumbing) and calls into three small v1 hooks for the tree=true > accumulator and the v1 portions of init/cleanup. The hooks have > static-inline no-op stubs in include/linux/vmpressure.h for the > !MEMCG_V1 case, so callers don't need ifdefs. vmpressure_prio() gets > the same treatment, which means vmscan.c's call site disappears at > compile time on v2-only kernels. > > The only #ifdef CONFIG_MEMCG_V1 in source remains around the v1-only > fields inside struct vmpressure itself. > > Memory savings on CONFIG_MEMCG_V1=n (measured with pahole): > > struct vmpressure : 112B -> 24B > struct mem_cgroup : 1664B -> 1536B > > This split is the first step toward eventually making vmpressure > CONFIG_MEMCG_V1 only. The v2 in-kernel socket pressure path > (tree=false) cannot be removed today immediately: PSI is not an > exact replacement for vmpressure, and switching networking socket-buffer > back-off to PSI may regress networking performance or increase memory > pressure in workloads that today rely on vmpressure's hysteresis. The > medium-term plan is to introduce a PSI-based socket-pressure path, keep > vmpressure available for v2 behind a defconfig as an opt-out for several > releases, and only then drop the tree=false path entirely, at which point > everything that remains of the vmpressure block in mm/memcontrol-v1.c is > the whole subsystem. > > Signed-off-by: Usama Arif <usama.arif@linux.dev> Shakeel had acked the previous version, but I forgot to carry it over, sorry about that! > --- > include/linux/vmpressure.h | 46 +++++- > mm/memcontrol-v1.c | 292 +++++++++++++++++++++++++++++++++++++ > mm/vmpressure.c | 292 ++----------------------------------- > 3 files changed, 343 insertions(+), 287 deletions(-) > > diff --git a/include/linux/vmpressure.h b/include/linux/vmpressure.h > index faecd5522401..b4d13457bc2a 100644 > --- a/include/linux/vmpressure.h > +++ b/include/linux/vmpressure.h > @@ -13,18 +13,31 @@ > struct vmpressure { > unsigned long scanned; > unsigned long reclaimed; > + /* The lock is used to keep the scanned/reclaimed in sync. */ > + spinlock_t sr_lock; > > +#ifdef CONFIG_MEMCG_V1 > + /* > + * tree=true accumulators feed the v1 userspace eventfd interface > + * (memory.pressure_level). Drained by @work. v2 has no equivalent > + * interface, so this state is omitted on CONFIG_MEMCG_V1=n builds. > + */ > unsigned long tree_scanned; > unsigned long tree_reclaimed; > - /* The lock is used to keep the scanned/reclaimed above in sync. */ > - spinlock_t sr_lock; > - > /* The list of vmpressure_event structs. */ > struct list_head events; > /* Have to grab the lock on events traversal or modifications. */ > struct mutex events_lock; > > struct work_struct work; > +#endif > +}; > + > +enum vmpressure_levels { > + VMPRESSURE_LOW = 0, > + VMPRESSURE_MEDIUM, > + VMPRESSURE_CRITICAL, > + VMPRESSURE_NUM_LEVELS, > }; > > struct mem_cgroup; > @@ -32,18 +45,41 @@ struct mem_cgroup; > #ifdef CONFIG_MEMCG > void vmpressure(gfp_t gfp, int order, struct mem_cgroup *memcg, bool tree, > unsigned long scanned, unsigned long reclaimed); > -extern void vmpressure_prio(gfp_t gfp, struct mem_cgroup *memcg, int prio); > - > extern void vmpressure_init(struct vmpressure *vmpr); > extern void vmpressure_cleanup(struct vmpressure *vmpr); > extern struct vmpressure *memcg_to_vmpressure(struct mem_cgroup *memcg); > extern struct mem_cgroup *vmpressure_to_memcg(struct vmpressure *vmpr); > + > +/* Shared with the v1 vmpressure block in mm/memcontrol-v1.c. */ > +extern const unsigned long vmpressure_win; > +extern enum vmpressure_levels vmpressure_calc_level(unsigned long scanned, > + unsigned long reclaimed); > + > +#ifdef CONFIG_MEMCG_V1 > +extern void vmpressure_prio(gfp_t gfp, struct mem_cgroup *memcg, int prio); > extern int vmpressure_register_event(struct mem_cgroup *memcg, > struct eventfd_ctx *eventfd, > const char *args); > extern void vmpressure_unregister_event(struct mem_cgroup *memcg, > struct eventfd_ctx *eventfd); > + > +/* v1 hooks called from mm/vmpressure.c; no-ops below when !MEMCG_V1. */ > +extern void vmpressure_v1_init(struct vmpressure *vmpr); > +extern void vmpressure_v1_cleanup(struct vmpressure *vmpr); > +extern void vmpressure_v1_account_tree(struct vmpressure *vmpr, > + unsigned long scanned, > + unsigned long reclaimed); > #else > +static inline void vmpressure_prio(gfp_t gfp, struct mem_cgroup *memcg, > + int prio) {} > +static inline void vmpressure_v1_init(struct vmpressure *vmpr) {} > +static inline void vmpressure_v1_cleanup(struct vmpressure *vmpr) {} > +static inline void vmpressure_v1_account_tree(struct vmpressure *vmpr, > + unsigned long scanned, > + unsigned long reclaimed) {} > +#endif /* CONFIG_MEMCG_V1 */ > + > +#else /* !CONFIG_MEMCG */ > static inline void vmpressure(gfp_t gfp, int order, struct mem_cgroup *memcg, > bool tree, unsigned long scanned, > unsigned long reclaimed) {} > diff --git a/mm/memcontrol-v1.c b/mm/memcontrol-v1.c > index 765069211567..135622b6172b 100644 > --- a/mm/memcontrol-v1.c > +++ b/mm/memcontrol-v1.c > @@ -6,6 +6,7 @@ > #include <linux/pagewalk.h> > #include <linux/backing-dev.h> > #include <linux/eventfd.h> > +#include <linux/log2.h> > #include <linux/poll.h> > #include <linux/sort.h> > #include <linux/file.h> > @@ -1476,6 +1477,297 @@ void memcg1_oom_finish(struct mem_cgroup *memcg, bool locked) > mem_cgroup_oom_unlock(memcg); > } > > +/* > + * cgroup v1 userspace vmpressure interface (memory.pressure_level / > + * cgroup.event_control). Kept here so v2-only kernels (CONFIG_MEMCG_V1=n) > + * drop the whole eventfd accumulator, its work item, and the per-memcg > + * state it requires. > + * > + * When there are too little pages left to scan, vmpressure() may miss the > + * critical pressure as number of pages will be less than "window size". > + * However, in that case the vmscan priority will raise fast as the > + * reclaimer will try to scan LRUs more deeply. > + * > + * The vmscan logic considers these special priorities: > + * > + * prio == DEF_PRIORITY (12): reclaimer starts with that value > + * prio <= DEF_PRIORITY - 2 : kswapd becomes somewhat overwhelmed > + * prio == 0 : close to OOM, kernel scans every page in an lru > + * > + * Any value in this range is acceptable for this tunable (i.e. from 12 to > + * 0). Current value for the vmpressure_level_critical_prio is chosen > + * empirically, but the number, in essence, means that we consider > + * critical level when scanning depth is ~10% of the lru size (vmscan > + * scans 'lru_size >> prio' pages, so it is actually 12.5%, or one > + * eights). > + */ > +static const unsigned int vmpressure_level_critical_prio = ilog2(100 / 10); > + > +enum vmpressure_modes { > + VMPRESSURE_NO_PASSTHROUGH = 0, > + VMPRESSURE_HIERARCHY, > + VMPRESSURE_LOCAL, > + VMPRESSURE_NUM_MODES, > +}; > + > +static const char * const vmpressure_str_levels[] = { > + [VMPRESSURE_LOW] = "low", > + [VMPRESSURE_MEDIUM] = "medium", > + [VMPRESSURE_CRITICAL] = "critical", > +}; > + > +static const char * const vmpressure_str_modes[] = { > + [VMPRESSURE_NO_PASSTHROUGH] = "default", > + [VMPRESSURE_HIERARCHY] = "hierarchy", > + [VMPRESSURE_LOCAL] = "local", > +}; > + > +struct vmpressure_event { > + struct eventfd_ctx *efd; > + enum vmpressure_levels level; > + enum vmpressure_modes mode; > + struct list_head node; > +}; > + > +static struct vmpressure *work_to_vmpressure(struct work_struct *work) > +{ > + return container_of(work, struct vmpressure, work); > +} > + > +static struct vmpressure *vmpressure_parent(struct vmpressure *vmpr) > +{ > + struct mem_cgroup *memcg = vmpressure_to_memcg(vmpr); > + > + memcg = parent_mem_cgroup(memcg); > + if (!memcg) > + return NULL; > + return memcg_to_vmpressure(memcg); > +} > + > +static bool vmpressure_event(struct vmpressure *vmpr, > + const enum vmpressure_levels level, > + bool ancestor, bool signalled) > +{ > + struct vmpressure_event *ev; > + bool ret = false; > + > + mutex_lock(&vmpr->events_lock); > + list_for_each_entry(ev, &vmpr->events, node) { > + if (ancestor && ev->mode == VMPRESSURE_LOCAL) > + continue; > + if (signalled && ev->mode == VMPRESSURE_NO_PASSTHROUGH) > + continue; > + if (level < ev->level) > + continue; > + eventfd_signal(ev->efd); > + ret = true; > + } > + mutex_unlock(&vmpr->events_lock); > + > + return ret; > +} > + > +static void vmpressure_work_fn(struct work_struct *work) > +{ > + struct vmpressure *vmpr = work_to_vmpressure(work); > + unsigned long scanned; > + unsigned long reclaimed; > + enum vmpressure_levels level; > + bool ancestor = false; > + bool signalled = false; > + > + spin_lock(&vmpr->sr_lock); > + /* > + * Several contexts might be calling vmpressure(), so it is > + * possible that the work was rescheduled again before the old > + * work context cleared the counters. In that case we will run > + * just after the old work returns, but then scanned might be zero > + * here. No need for any locks here since we don't care if > + * vmpr->reclaimed is in sync. > + */ > + scanned = vmpr->tree_scanned; > + if (!scanned) { > + spin_unlock(&vmpr->sr_lock); > + return; > + } > + > + reclaimed = vmpr->tree_reclaimed; > + vmpr->tree_scanned = 0; > + vmpr->tree_reclaimed = 0; > + spin_unlock(&vmpr->sr_lock); > + > + level = vmpressure_calc_level(scanned, reclaimed); > + > + do { > + if (vmpressure_event(vmpr, level, ancestor, signalled)) > + signalled = true; > + ancestor = true; > + } while ((vmpr = vmpressure_parent(vmpr))); > +} > + > +/* > + * Tree-mode accumulator: accumulate per-memcg scanned/reclaimed and > + * schedule the work that walks the parent chain and signals registered > + * eventfd listeners once we cross the window threshold. > + */ > +void vmpressure_v1_account_tree(struct vmpressure *vmpr, > + unsigned long scanned, > + unsigned long reclaimed) > +{ > + spin_lock(&vmpr->sr_lock); > + scanned = vmpr->tree_scanned += scanned; > + vmpr->tree_reclaimed += reclaimed; > + spin_unlock(&vmpr->sr_lock); > + > + if (scanned < vmpressure_win) > + return; > + schedule_work(&vmpr->work); > +} > + > +void vmpressure_v1_init(struct vmpressure *vmpr) > +{ > + mutex_init(&vmpr->events_lock); > + INIT_LIST_HEAD(&vmpr->events); > + INIT_WORK(&vmpr->work, vmpressure_work_fn); > +} > + > +void vmpressure_v1_cleanup(struct vmpressure *vmpr) > +{ > + /* > + * Make sure there is no pending work before eventfd infrastructure > + * goes away. > + */ > + flush_work(&vmpr->work); > +} > + > +/** > + * vmpressure_prio() - Account memory pressure through reclaimer priority level > + * @gfp: reclaimer's gfp mask > + * @memcg: cgroup memory controller handle > + * @prio: reclaimer's priority > + * > + * This function should be called from the reclaim path every time when > + * the vmscan's reclaiming priority (scanning depth) changes. > + * > + * This function does not return any value. > + */ > +void vmpressure_prio(gfp_t gfp, struct mem_cgroup *memcg, int prio) > +{ > + /* > + * We only use prio for accounting critical level. For more info > + * see comment for vmpressure_level_critical_prio variable above. > + */ > + if (prio > vmpressure_level_critical_prio) > + return; > + > + /* > + * OK, the prio is below the threshold, updating vmpressure > + * information before shrinker dives into long shrinking of long > + * range vmscan. Passing scanned = vmpressure_win, reclaimed = 0 > + * to the vmpressure() basically means that we signal 'critical' > + * level. > + */ > + vmpressure(gfp, 0, memcg, true, vmpressure_win, 0); > +} > + > +#define MAX_VMPRESSURE_ARGS_LEN (strlen("critical") + strlen("hierarchy") + 2) > + > +/** > + * vmpressure_register_event() - Bind vmpressure notifications to an eventfd > + * @memcg: memcg that is interested in vmpressure notifications > + * @eventfd: eventfd context to link notifications with > + * @args: event arguments (pressure level threshold, optional mode) > + * > + * This function associates eventfd context with the vmpressure > + * infrastructure, so that the notifications will be delivered to the > + * @eventfd. The @args parameter is a comma-delimited string that denotes a > + * pressure level threshold (one of vmpressure_str_levels, i.e. "low", "medium", > + * or "critical") and an optional mode (one of vmpressure_str_modes, i.e. > + * "hierarchy" or "local"). > + * > + * To be used as memcg event method. > + * > + * Return: 0 on success, -ENOMEM on memory failure or -EINVAL if @args could > + * not be parsed. > + */ > +int vmpressure_register_event(struct mem_cgroup *memcg, > + struct eventfd_ctx *eventfd, const char *args) > +{ > + struct vmpressure *vmpr = memcg_to_vmpressure(memcg); > + struct vmpressure_event *ev; > + enum vmpressure_modes mode = VMPRESSURE_NO_PASSTHROUGH; > + enum vmpressure_levels level; > + char *spec, *spec_orig; > + char *token; > + int ret = 0; > + > + spec_orig = spec = kstrndup(args, MAX_VMPRESSURE_ARGS_LEN, GFP_KERNEL); > + if (!spec) > + return -ENOMEM; > + > + /* Find required level */ > + token = strsep(&spec, ","); > + ret = match_string(vmpressure_str_levels, VMPRESSURE_NUM_LEVELS, token); > + if (ret < 0) > + goto out; > + level = ret; > + > + /* Find optional mode */ > + token = strsep(&spec, ","); > + if (token) { > + ret = match_string(vmpressure_str_modes, VMPRESSURE_NUM_MODES, token); > + if (ret < 0) > + goto out; > + mode = ret; > + } > + > + ev = kzalloc_obj(*ev); > + if (!ev) { > + ret = -ENOMEM; > + goto out; > + } > + > + ev->efd = eventfd; > + ev->level = level; > + ev->mode = mode; > + > + mutex_lock(&vmpr->events_lock); > + list_add(&ev->node, &vmpr->events); > + mutex_unlock(&vmpr->events_lock); > + ret = 0; > +out: > + kfree(spec_orig); > + return ret; > +} > + > +/** > + * vmpressure_unregister_event() - Unbind eventfd from vmpressure > + * @memcg: memcg handle > + * @eventfd: eventfd context that was used to link vmpressure with the @cg > + * > + * This function does internal manipulations to detach the @eventfd from > + * the vmpressure notifications, and then frees internal resources > + * associated with the @eventfd (but the @eventfd itself is not freed). > + * > + * To be used as memcg event method. > + */ > +void vmpressure_unregister_event(struct mem_cgroup *memcg, > + struct eventfd_ctx *eventfd) > +{ > + struct vmpressure *vmpr = memcg_to_vmpressure(memcg); > + struct vmpressure_event *ev; > + > + mutex_lock(&vmpr->events_lock); > + list_for_each_entry(ev, &vmpr->events, node) { > + if (ev->efd != eventfd) > + continue; > + list_del(&ev->node); > + kfree(ev); > + break; > + } > + mutex_unlock(&vmpr->events_lock); > +} > + > static DEFINE_MUTEX(memcg_max_mutex); > > static int mem_cgroup_resize_max(struct mem_cgroup *memcg, > diff --git a/mm/vmpressure.c b/mm/vmpressure.c > index c82cee1ab43b..14470141bbe6 100644 > --- a/mm/vmpressure.c > +++ b/mm/vmpressure.c > @@ -7,16 +7,15 @@ > * > * Based on ideas from Andrew Morton, David Rientjes, KOSAKI Motohiro, > * Leonid Moiseichuk, Mel Gorman, Minchan Kim and Pekka Enberg. > + * > + * Tree-mode (cgroup v1 userspace eventfd) bookkeeping lives in > + * mm/memcontrol-v1.c; this file holds the shared code and the in-kernel > + * (tree=false) socket-pressure path that runs on cgroup v2. > */ > > #include <linux/cgroup.h> > -#include <linux/fs.h> > #include <linux/log2.h> > -#include <linux/sched.h> > #include <linux/mm.h> > -#include <linux/vmstat.h> > -#include <linux/eventfd.h> > -#include <linux/slab.h> > #include <linux/swap.h> > #include <linux/printk.h> > #include <linux/vmpressure.h> > @@ -35,7 +34,7 @@ > * TODO: Make the window size depend on machine size, as we do for vmstat > * thresholds. Currently we set it to 512 pages (2MB for 4KB pages). > */ > -static const unsigned long vmpressure_win = SWAP_CLUSTER_MAX * 16; > +const unsigned long vmpressure_win = SWAP_CLUSTER_MAX * 16; > > /* > * These thresholds are used when we account memory pressure through > @@ -46,68 +45,6 @@ static const unsigned long vmpressure_win = SWAP_CLUSTER_MAX * 16; > static const unsigned int vmpressure_level_med = 60; > static const unsigned int vmpressure_level_critical = 95; > > -/* > - * When there are too little pages left to scan, vmpressure() may miss the > - * critical pressure as number of pages will be less than "window size". > - * However, in that case the vmscan priority will raise fast as the > - * reclaimer will try to scan LRUs more deeply. > - * > - * The vmscan logic considers these special priorities: > - * > - * prio == DEF_PRIORITY (12): reclaimer starts with that value > - * prio <= DEF_PRIORITY - 2 : kswapd becomes somewhat overwhelmed > - * prio == 0 : close to OOM, kernel scans every page in an lru > - * > - * Any value in this range is acceptable for this tunable (i.e. from 12 to > - * 0). Current value for the vmpressure_level_critical_prio is chosen > - * empirically, but the number, in essence, means that we consider > - * critical level when scanning depth is ~10% of the lru size (vmscan > - * scans 'lru_size >> prio' pages, so it is actually 12.5%, or one > - * eights). > - */ > -static const unsigned int vmpressure_level_critical_prio = ilog2(100 / 10); > - > -static struct vmpressure *work_to_vmpressure(struct work_struct *work) > -{ > - return container_of(work, struct vmpressure, work); > -} > - > -static struct vmpressure *vmpressure_parent(struct vmpressure *vmpr) > -{ > - struct mem_cgroup *memcg = vmpressure_to_memcg(vmpr); > - > - memcg = parent_mem_cgroup(memcg); > - if (!memcg) > - return NULL; > - return memcg_to_vmpressure(memcg); > -} > - > -enum vmpressure_levels { > - VMPRESSURE_LOW = 0, > - VMPRESSURE_MEDIUM, > - VMPRESSURE_CRITICAL, > - VMPRESSURE_NUM_LEVELS, > -}; > - > -enum vmpressure_modes { > - VMPRESSURE_NO_PASSTHROUGH = 0, > - VMPRESSURE_HIERARCHY, > - VMPRESSURE_LOCAL, > - VMPRESSURE_NUM_MODES, > -}; > - > -static const char * const vmpressure_str_levels[] = { > - [VMPRESSURE_LOW] = "low", > - [VMPRESSURE_MEDIUM] = "medium", > - [VMPRESSURE_CRITICAL] = "critical", > -}; > - > -static const char * const vmpressure_str_modes[] = { > - [VMPRESSURE_NO_PASSTHROUGH] = "default", > - [VMPRESSURE_HIERARCHY] = "hierarchy", > - [VMPRESSURE_LOCAL] = "local", > -}; > - > static enum vmpressure_levels vmpressure_level(unsigned long pressure) > { > if (pressure >= vmpressure_level_critical) > @@ -117,8 +54,8 @@ static enum vmpressure_levels vmpressure_level(unsigned long pressure) > return VMPRESSURE_LOW; > } > > -static enum vmpressure_levels vmpressure_calc_level(unsigned long scanned, > - unsigned long reclaimed) > +enum vmpressure_levels vmpressure_calc_level(unsigned long scanned, > + unsigned long reclaimed) > { > unsigned long scale = scanned + reclaimed; > unsigned long pressure = 0; > @@ -147,74 +84,6 @@ static enum vmpressure_levels vmpressure_calc_level(unsigned long scanned, > return vmpressure_level(pressure); > } > > -struct vmpressure_event { > - struct eventfd_ctx *efd; > - enum vmpressure_levels level; > - enum vmpressure_modes mode; > - struct list_head node; > -}; > - > -static bool vmpressure_event(struct vmpressure *vmpr, > - const enum vmpressure_levels level, > - bool ancestor, bool signalled) > -{ > - struct vmpressure_event *ev; > - bool ret = false; > - > - mutex_lock(&vmpr->events_lock); > - list_for_each_entry(ev, &vmpr->events, node) { > - if (ancestor && ev->mode == VMPRESSURE_LOCAL) > - continue; > - if (signalled && ev->mode == VMPRESSURE_NO_PASSTHROUGH) > - continue; > - if (level < ev->level) > - continue; > - eventfd_signal(ev->efd); > - ret = true; > - } > - mutex_unlock(&vmpr->events_lock); > - > - return ret; > -} > - > -static void vmpressure_work_fn(struct work_struct *work) > -{ > - struct vmpressure *vmpr = work_to_vmpressure(work); > - unsigned long scanned; > - unsigned long reclaimed; > - enum vmpressure_levels level; > - bool ancestor = false; > - bool signalled = false; > - > - spin_lock(&vmpr->sr_lock); > - /* > - * Several contexts might be calling vmpressure(), so it is > - * possible that the work was rescheduled again before the old > - * work context cleared the counters. In that case we will run > - * just after the old work returns, but then scanned might be zero > - * here. No need for any locks here since we don't care if > - * vmpr->reclaimed is in sync. > - */ > - scanned = vmpr->tree_scanned; > - if (!scanned) { > - spin_unlock(&vmpr->sr_lock); > - return; > - } > - > - reclaimed = vmpr->tree_reclaimed; > - vmpr->tree_scanned = 0; > - vmpr->tree_reclaimed = 0; > - spin_unlock(&vmpr->sr_lock); > - > - level = vmpressure_calc_level(scanned, reclaimed); > - > - do { > - if (vmpressure_event(vmpr, level, ancestor, signalled)) > - signalled = true; > - ancestor = true; > - } while ((vmpr = vmpressure_parent(vmpr))); > -} > - > /** > * vmpressure() - Account memory pressure through scanned/reclaimed ratio > * @gfp: reclaimer's gfp mask > @@ -283,14 +152,7 @@ void vmpressure(gfp_t gfp, int order, struct mem_cgroup *memcg, bool tree, > return; > > if (tree) { > - spin_lock(&vmpr->sr_lock); > - scanned = vmpr->tree_scanned += scanned; > - vmpr->tree_reclaimed += reclaimed; > - spin_unlock(&vmpr->sr_lock); > - > - if (scanned < vmpressure_win) > - return; > - schedule_work(&vmpr->work); > + vmpressure_v1_account_tree(vmpr, scanned, reclaimed); > } else { > enum vmpressure_levels level; > > @@ -332,134 +194,6 @@ void vmpressure(gfp_t gfp, int order, struct mem_cgroup *memcg, bool tree, > } > } > > -/** > - * vmpressure_prio() - Account memory pressure through reclaimer priority level > - * @gfp: reclaimer's gfp mask > - * @memcg: cgroup memory controller handle > - * @prio: reclaimer's priority > - * > - * This function should be called from the reclaim path every time when > - * the vmscan's reclaiming priority (scanning depth) changes. > - * > - * This function does not return any value. > - */ > -void vmpressure_prio(gfp_t gfp, struct mem_cgroup *memcg, int prio) > -{ > - /* > - * We only use prio for accounting critical level. For more info > - * see comment for vmpressure_level_critical_prio variable above. > - */ > - if (prio > vmpressure_level_critical_prio) > - return; > - > - /* > - * OK, the prio is below the threshold, updating vmpressure > - * information before shrinker dives into long shrinking of long > - * range vmscan. Passing scanned = vmpressure_win, reclaimed = 0 > - * to the vmpressure() basically means that we signal 'critical' > - * level. > - */ > - vmpressure(gfp, 0, memcg, true, vmpressure_win, 0); > -} > - > -#define MAX_VMPRESSURE_ARGS_LEN (strlen("critical") + strlen("hierarchy") + 2) > - > -/** > - * vmpressure_register_event() - Bind vmpressure notifications to an eventfd > - * @memcg: memcg that is interested in vmpressure notifications > - * @eventfd: eventfd context to link notifications with > - * @args: event arguments (pressure level threshold, optional mode) > - * > - * This function associates eventfd context with the vmpressure > - * infrastructure, so that the notifications will be delivered to the > - * @eventfd. The @args parameter is a comma-delimited string that denotes a > - * pressure level threshold (one of vmpressure_str_levels, i.e. "low", "medium", > - * or "critical") and an optional mode (one of vmpressure_str_modes, i.e. > - * "hierarchy" or "local"). > - * > - * To be used as memcg event method. > - * > - * Return: 0 on success, -ENOMEM on memory failure or -EINVAL if @args could > - * not be parsed. > - */ > -int vmpressure_register_event(struct mem_cgroup *memcg, > - struct eventfd_ctx *eventfd, const char *args) > -{ > - struct vmpressure *vmpr = memcg_to_vmpressure(memcg); > - struct vmpressure_event *ev; > - enum vmpressure_modes mode = VMPRESSURE_NO_PASSTHROUGH; > - enum vmpressure_levels level; > - char *spec, *spec_orig; > - char *token; > - int ret = 0; > - > - spec_orig = spec = kstrndup(args, MAX_VMPRESSURE_ARGS_LEN, GFP_KERNEL); > - if (!spec) > - return -ENOMEM; > - > - /* Find required level */ > - token = strsep(&spec, ","); > - ret = match_string(vmpressure_str_levels, VMPRESSURE_NUM_LEVELS, token); > - if (ret < 0) > - goto out; > - level = ret; > - > - /* Find optional mode */ > - token = strsep(&spec, ","); > - if (token) { > - ret = match_string(vmpressure_str_modes, VMPRESSURE_NUM_MODES, token); > - if (ret < 0) > - goto out; > - mode = ret; > - } > - > - ev = kzalloc_obj(*ev); > - if (!ev) { > - ret = -ENOMEM; > - goto out; > - } > - > - ev->efd = eventfd; > - ev->level = level; > - ev->mode = mode; > - > - mutex_lock(&vmpr->events_lock); > - list_add(&ev->node, &vmpr->events); > - mutex_unlock(&vmpr->events_lock); > - ret = 0; > -out: > - kfree(spec_orig); > - return ret; > -} > - > -/** > - * vmpressure_unregister_event() - Unbind eventfd from vmpressure > - * @memcg: memcg handle > - * @eventfd: eventfd context that was used to link vmpressure with the @cg > - * > - * This function does internal manipulations to detach the @eventfd from > - * the vmpressure notifications, and then frees internal resources > - * associated with the @eventfd (but the @eventfd itself is not freed). > - * > - * To be used as memcg event method. > - */ > -void vmpressure_unregister_event(struct mem_cgroup *memcg, > - struct eventfd_ctx *eventfd) > -{ > - struct vmpressure *vmpr = memcg_to_vmpressure(memcg); > - struct vmpressure_event *ev; > - > - mutex_lock(&vmpr->events_lock); > - list_for_each_entry(ev, &vmpr->events, node) { > - if (ev->efd != eventfd) > - continue; > - list_del(&ev->node); > - kfree(ev); > - break; > - } > - mutex_unlock(&vmpr->events_lock); > -} > - > /** > * vmpressure_init() - Initialize vmpressure control structure > * @vmpr: Structure to be initialized > @@ -470,9 +204,7 @@ void vmpressure_unregister_event(struct mem_cgroup *memcg, > void vmpressure_init(struct vmpressure *vmpr) > { > spin_lock_init(&vmpr->sr_lock); > - mutex_init(&vmpr->events_lock); > - INIT_LIST_HEAD(&vmpr->events); > - INIT_WORK(&vmpr->work, vmpressure_work_fn); > + vmpressure_v1_init(vmpr); > } > > /** > @@ -484,9 +216,5 @@ void vmpressure_init(struct vmpressure *vmpr) > */ > void vmpressure_cleanup(struct vmpressure *vmpr) > { > - /* > - * Make sure there is no pending work before eventfd infrastructure > - * goes away. > - */ > - flush_work(&vmpr->work); > + vmpressure_v1_cleanup(vmpr); > } > -- > 2.53.0-Meta > > ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH v3 2/2] mm/vmpressure: move v1 userspace eventfd code into memcontrol-v1.c 2026-06-30 11:23 ` [PATCH v3 2/2] mm/vmpressure: move v1 userspace eventfd code into memcontrol-v1.c Usama Arif 2026-06-30 12:32 ` Usama Arif @ 2026-06-30 14:21 ` Shakeel Butt 1 sibling, 0 replies; 7+ messages in thread From: Shakeel Butt @ 2026-06-30 14:21 UTC (permalink / raw) To: Usama Arif Cc: Andrew Morton, david, linux-mm, hannes, tj, mkoutny, roman.gushchin, liam, linux-kernel, ljs, mhocko, rppt, surenb, vbabka, kernel-team On Tue, Jun 30, 2026 at 04:23:33AM -0700, Usama Arif wrote: > Clean up mm/vmpressure.c by separating the cgroup v1 userspace eventfd > interface from the shared and v2 in-kernel code. > > Currently, almost half of mm/vmpressure.c exists to serve tree=true: > struct vmpressure_event, the events list and its mutex, the work_struct > and vmpressure_work_fn that drains tree_scanned/tree_reclaimed, the > parent walk, vmpressure_event(), vmpressure_register_event(), > vmpressure_unregister_event(), and vmpressure_prio() (which always > calls vmpressure() with tree=true). > > Move it all into mm/memcontrol-v1.c (built only when CONFIG_MEMCG_V1=y) > as a single contiguous block, following the per-component layout already > used by that file. Keeping the v1 vmpressure code with the rest of the > deprecated cgroup v1 memory controller makes the full footprint of the > CONFIG_MEMCG_V1 option easy to see in one place, which matters more > than component-level file separation for code that has no active > development. > > vmpressure.c keeps the shared bits (constants, vmpressure_calc_level, > the runtime hierarchy check, the tree=false body, init/cleanup > plumbing) and calls into three small v1 hooks for the tree=true > accumulator and the v1 portions of init/cleanup. The hooks have > static-inline no-op stubs in include/linux/vmpressure.h for the > !MEMCG_V1 case, so callers don't need ifdefs. vmpressure_prio() gets > the same treatment, which means vmscan.c's call site disappears at > compile time on v2-only kernels. > > The only #ifdef CONFIG_MEMCG_V1 in source remains around the v1-only > fields inside struct vmpressure itself. > > Memory savings on CONFIG_MEMCG_V1=n (measured with pahole): > > struct vmpressure : 112B -> 24B > struct mem_cgroup : 1664B -> 1536B > > This split is the first step toward eventually making vmpressure > CONFIG_MEMCG_V1 only. The v2 in-kernel socket pressure path > (tree=false) cannot be removed today immediately: PSI is not an > exact replacement for vmpressure, and switching networking socket-buffer > back-off to PSI may regress networking performance or increase memory > pressure in workloads that today rely on vmpressure's hysteresis. The > medium-term plan is to introduce a PSI-based socket-pressure path, keep > vmpressure available for v2 behind a defconfig as an opt-out for several > releases, and only then drop the tree=false path entirely, at which point > everything that remains of the vmpressure block in mm/memcontrol-v1.c is > the whole subsystem. > > Signed-off-by: Usama Arif <usama.arif@linux.dev> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2026-06-30 16:31 UTC | newest] Thread overview: 7+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2026-06-30 11:23 [PATCH v3 0/2] mm/vmpressure: reduce CPU, memory and code overhead on cgroup v2 Usama Arif 2026-06-30 11:23 ` [PATCH v3 1/2] mm/vmpressure: skip tree=true accounting " Usama Arif 2026-06-30 16:07 ` Johannes Weiner 2026-06-30 16:30 ` Usama Arif 2026-06-30 11:23 ` [PATCH v3 2/2] mm/vmpressure: move v1 userspace eventfd code into memcontrol-v1.c Usama Arif 2026-06-30 12:32 ` Usama Arif 2026-06-30 14:21 ` Shakeel Butt
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox