From: Usama Arif <usama.arif@linux.dev>
To: Usama Arif <usama.arif@linux.dev>
Cc: Andrew Morton <akpm@linux-foundation.org>,
david@kernel.org, linux-mm@kvack.org, hannes@cmpxchg.org,
tj@kernel.org, mkoutny@suse.com, shakeel.butt@linux.dev,
roman.gushchin@linux.dev, liam@infradead.org,
linux-kernel@vger.kernel.org, ljs@kernel.org, mhocko@suse.com,
rppt@kernel.org, surenb@google.com, vbabka@kernel.org,
kernel-team@meta.com
Subject: Re: [PATCH v3 2/2] mm/vmpressure: move v1 userspace eventfd code into memcontrol-v1.c
Date: Tue, 30 Jun 2026 05:32:27 -0700 [thread overview]
Message-ID: <20260630123228.4052656-1-usama.arif@linux.dev> (raw)
In-Reply-To: <20260630112617.1198623-3-usama.arif@linux.dev>
On Tue, 30 Jun 2026 04:23:33 -0700 Usama Arif <usama.arif@linux.dev> wrote:
> Clean up mm/vmpressure.c by separating the cgroup v1 userspace eventfd
> interface from the shared and v2 in-kernel code.
>
> Currently, almost half of mm/vmpressure.c exists to serve tree=true:
> struct vmpressure_event, the events list and its mutex, the work_struct
> and vmpressure_work_fn that drains tree_scanned/tree_reclaimed, the
> parent walk, vmpressure_event(), vmpressure_register_event(),
> vmpressure_unregister_event(), and vmpressure_prio() (which always
> calls vmpressure() with tree=true).
>
> Move it all into mm/memcontrol-v1.c (built only when CONFIG_MEMCG_V1=y)
> as a single contiguous block, following the per-component layout already
> used by that file. Keeping the v1 vmpressure code with the rest of the
> deprecated cgroup v1 memory controller makes the full footprint of the
> CONFIG_MEMCG_V1 option easy to see in one place, which matters more
> than component-level file separation for code that has no active
> development.
>
> vmpressure.c keeps the shared bits (constants, vmpressure_calc_level,
> the runtime hierarchy check, the tree=false body, init/cleanup
> plumbing) and calls into three small v1 hooks for the tree=true
> accumulator and the v1 portions of init/cleanup. The hooks have
> static-inline no-op stubs in include/linux/vmpressure.h for the
> !MEMCG_V1 case, so callers don't need ifdefs. vmpressure_prio() gets
> the same treatment, which means vmscan.c's call site disappears at
> compile time on v2-only kernels.
>
> The only #ifdef CONFIG_MEMCG_V1 in source remains around the v1-only
> fields inside struct vmpressure itself.
>
> Memory savings on CONFIG_MEMCG_V1=n (measured with pahole):
>
> struct vmpressure : 112B -> 24B
> struct mem_cgroup : 1664B -> 1536B
>
> This split is the first step toward eventually making vmpressure
> CONFIG_MEMCG_V1 only. The v2 in-kernel socket pressure path
> (tree=false) cannot be removed today immediately: PSI is not an
> exact replacement for vmpressure, and switching networking socket-buffer
> back-off to PSI may regress networking performance or increase memory
> pressure in workloads that today rely on vmpressure's hysteresis. The
> medium-term plan is to introduce a PSI-based socket-pressure path, keep
> vmpressure available for v2 behind a defconfig as an opt-out for several
> releases, and only then drop the tree=false path entirely, at which point
> everything that remains of the vmpressure block in mm/memcontrol-v1.c is
> the whole subsystem.
>
> Signed-off-by: Usama Arif <usama.arif@linux.dev>
Shakeel had acked the previous version, but I forgot to carry it over,
sorry about that!
> ---
> include/linux/vmpressure.h | 46 +++++-
> mm/memcontrol-v1.c | 292 +++++++++++++++++++++++++++++++++++++
> mm/vmpressure.c | 292 ++-----------------------------------
> 3 files changed, 343 insertions(+), 287 deletions(-)
>
> diff --git a/include/linux/vmpressure.h b/include/linux/vmpressure.h
> index faecd5522401..b4d13457bc2a 100644
> --- a/include/linux/vmpressure.h
> +++ b/include/linux/vmpressure.h
> @@ -13,18 +13,31 @@
> struct vmpressure {
> unsigned long scanned;
> unsigned long reclaimed;
> + /* The lock is used to keep the scanned/reclaimed in sync. */
> + spinlock_t sr_lock;
>
> +#ifdef CONFIG_MEMCG_V1
> + /*
> + * tree=true accumulators feed the v1 userspace eventfd interface
> + * (memory.pressure_level). Drained by @work. v2 has no equivalent
> + * interface, so this state is omitted on CONFIG_MEMCG_V1=n builds.
> + */
> unsigned long tree_scanned;
> unsigned long tree_reclaimed;
> - /* The lock is used to keep the scanned/reclaimed above in sync. */
> - spinlock_t sr_lock;
> -
> /* The list of vmpressure_event structs. */
> struct list_head events;
> /* Have to grab the lock on events traversal or modifications. */
> struct mutex events_lock;
>
> struct work_struct work;
> +#endif
> +};
> +
> +enum vmpressure_levels {
> + VMPRESSURE_LOW = 0,
> + VMPRESSURE_MEDIUM,
> + VMPRESSURE_CRITICAL,
> + VMPRESSURE_NUM_LEVELS,
> };
>
> struct mem_cgroup;
> @@ -32,18 +45,41 @@ struct mem_cgroup;
> #ifdef CONFIG_MEMCG
> void vmpressure(gfp_t gfp, int order, struct mem_cgroup *memcg, bool tree,
> unsigned long scanned, unsigned long reclaimed);
> -extern void vmpressure_prio(gfp_t gfp, struct mem_cgroup *memcg, int prio);
> -
> extern void vmpressure_init(struct vmpressure *vmpr);
> extern void vmpressure_cleanup(struct vmpressure *vmpr);
> extern struct vmpressure *memcg_to_vmpressure(struct mem_cgroup *memcg);
> extern struct mem_cgroup *vmpressure_to_memcg(struct vmpressure *vmpr);
> +
> +/* Shared with the v1 vmpressure block in mm/memcontrol-v1.c. */
> +extern const unsigned long vmpressure_win;
> +extern enum vmpressure_levels vmpressure_calc_level(unsigned long scanned,
> + unsigned long reclaimed);
> +
> +#ifdef CONFIG_MEMCG_V1
> +extern void vmpressure_prio(gfp_t gfp, struct mem_cgroup *memcg, int prio);
> extern int vmpressure_register_event(struct mem_cgroup *memcg,
> struct eventfd_ctx *eventfd,
> const char *args);
> extern void vmpressure_unregister_event(struct mem_cgroup *memcg,
> struct eventfd_ctx *eventfd);
> +
> +/* v1 hooks called from mm/vmpressure.c; no-ops below when !MEMCG_V1. */
> +extern void vmpressure_v1_init(struct vmpressure *vmpr);
> +extern void vmpressure_v1_cleanup(struct vmpressure *vmpr);
> +extern void vmpressure_v1_account_tree(struct vmpressure *vmpr,
> + unsigned long scanned,
> + unsigned long reclaimed);
> #else
> +static inline void vmpressure_prio(gfp_t gfp, struct mem_cgroup *memcg,
> + int prio) {}
> +static inline void vmpressure_v1_init(struct vmpressure *vmpr) {}
> +static inline void vmpressure_v1_cleanup(struct vmpressure *vmpr) {}
> +static inline void vmpressure_v1_account_tree(struct vmpressure *vmpr,
> + unsigned long scanned,
> + unsigned long reclaimed) {}
> +#endif /* CONFIG_MEMCG_V1 */
> +
> +#else /* !CONFIG_MEMCG */
> static inline void vmpressure(gfp_t gfp, int order, struct mem_cgroup *memcg,
> bool tree, unsigned long scanned,
> unsigned long reclaimed) {}
> diff --git a/mm/memcontrol-v1.c b/mm/memcontrol-v1.c
> index 765069211567..135622b6172b 100644
> --- a/mm/memcontrol-v1.c
> +++ b/mm/memcontrol-v1.c
> @@ -6,6 +6,7 @@
> #include <linux/pagewalk.h>
> #include <linux/backing-dev.h>
> #include <linux/eventfd.h>
> +#include <linux/log2.h>
> #include <linux/poll.h>
> #include <linux/sort.h>
> #include <linux/file.h>
> @@ -1476,6 +1477,297 @@ void memcg1_oom_finish(struct mem_cgroup *memcg, bool locked)
> mem_cgroup_oom_unlock(memcg);
> }
>
> +/*
> + * cgroup v1 userspace vmpressure interface (memory.pressure_level /
> + * cgroup.event_control). Kept here so v2-only kernels (CONFIG_MEMCG_V1=n)
> + * drop the whole eventfd accumulator, its work item, and the per-memcg
> + * state it requires.
> + *
> + * When there are too little pages left to scan, vmpressure() may miss the
> + * critical pressure as number of pages will be less than "window size".
> + * However, in that case the vmscan priority will raise fast as the
> + * reclaimer will try to scan LRUs more deeply.
> + *
> + * The vmscan logic considers these special priorities:
> + *
> + * prio == DEF_PRIORITY (12): reclaimer starts with that value
> + * prio <= DEF_PRIORITY - 2 : kswapd becomes somewhat overwhelmed
> + * prio == 0 : close to OOM, kernel scans every page in an lru
> + *
> + * Any value in this range is acceptable for this tunable (i.e. from 12 to
> + * 0). Current value for the vmpressure_level_critical_prio is chosen
> + * empirically, but the number, in essence, means that we consider
> + * critical level when scanning depth is ~10% of the lru size (vmscan
> + * scans 'lru_size >> prio' pages, so it is actually 12.5%, or one
> + * eights).
> + */
> +static const unsigned int vmpressure_level_critical_prio = ilog2(100 / 10);
> +
> +enum vmpressure_modes {
> + VMPRESSURE_NO_PASSTHROUGH = 0,
> + VMPRESSURE_HIERARCHY,
> + VMPRESSURE_LOCAL,
> + VMPRESSURE_NUM_MODES,
> +};
> +
> +static const char * const vmpressure_str_levels[] = {
> + [VMPRESSURE_LOW] = "low",
> + [VMPRESSURE_MEDIUM] = "medium",
> + [VMPRESSURE_CRITICAL] = "critical",
> +};
> +
> +static const char * const vmpressure_str_modes[] = {
> + [VMPRESSURE_NO_PASSTHROUGH] = "default",
> + [VMPRESSURE_HIERARCHY] = "hierarchy",
> + [VMPRESSURE_LOCAL] = "local",
> +};
> +
> +struct vmpressure_event {
> + struct eventfd_ctx *efd;
> + enum vmpressure_levels level;
> + enum vmpressure_modes mode;
> + struct list_head node;
> +};
> +
> +static struct vmpressure *work_to_vmpressure(struct work_struct *work)
> +{
> + return container_of(work, struct vmpressure, work);
> +}
> +
> +static struct vmpressure *vmpressure_parent(struct vmpressure *vmpr)
> +{
> + struct mem_cgroup *memcg = vmpressure_to_memcg(vmpr);
> +
> + memcg = parent_mem_cgroup(memcg);
> + if (!memcg)
> + return NULL;
> + return memcg_to_vmpressure(memcg);
> +}
> +
> +static bool vmpressure_event(struct vmpressure *vmpr,
> + const enum vmpressure_levels level,
> + bool ancestor, bool signalled)
> +{
> + struct vmpressure_event *ev;
> + bool ret = false;
> +
> + mutex_lock(&vmpr->events_lock);
> + list_for_each_entry(ev, &vmpr->events, node) {
> + if (ancestor && ev->mode == VMPRESSURE_LOCAL)
> + continue;
> + if (signalled && ev->mode == VMPRESSURE_NO_PASSTHROUGH)
> + continue;
> + if (level < ev->level)
> + continue;
> + eventfd_signal(ev->efd);
> + ret = true;
> + }
> + mutex_unlock(&vmpr->events_lock);
> +
> + return ret;
> +}
> +
> +static void vmpressure_work_fn(struct work_struct *work)
> +{
> + struct vmpressure *vmpr = work_to_vmpressure(work);
> + unsigned long scanned;
> + unsigned long reclaimed;
> + enum vmpressure_levels level;
> + bool ancestor = false;
> + bool signalled = false;
> +
> + spin_lock(&vmpr->sr_lock);
> + /*
> + * Several contexts might be calling vmpressure(), so it is
> + * possible that the work was rescheduled again before the old
> + * work context cleared the counters. In that case we will run
> + * just after the old work returns, but then scanned might be zero
> + * here. No need for any locks here since we don't care if
> + * vmpr->reclaimed is in sync.
> + */
> + scanned = vmpr->tree_scanned;
> + if (!scanned) {
> + spin_unlock(&vmpr->sr_lock);
> + return;
> + }
> +
> + reclaimed = vmpr->tree_reclaimed;
> + vmpr->tree_scanned = 0;
> + vmpr->tree_reclaimed = 0;
> + spin_unlock(&vmpr->sr_lock);
> +
> + level = vmpressure_calc_level(scanned, reclaimed);
> +
> + do {
> + if (vmpressure_event(vmpr, level, ancestor, signalled))
> + signalled = true;
> + ancestor = true;
> + } while ((vmpr = vmpressure_parent(vmpr)));
> +}
> +
> +/*
> + * Tree-mode accumulator: accumulate per-memcg scanned/reclaimed and
> + * schedule the work that walks the parent chain and signals registered
> + * eventfd listeners once we cross the window threshold.
> + */
> +void vmpressure_v1_account_tree(struct vmpressure *vmpr,
> + unsigned long scanned,
> + unsigned long reclaimed)
> +{
> + spin_lock(&vmpr->sr_lock);
> + scanned = vmpr->tree_scanned += scanned;
> + vmpr->tree_reclaimed += reclaimed;
> + spin_unlock(&vmpr->sr_lock);
> +
> + if (scanned < vmpressure_win)
> + return;
> + schedule_work(&vmpr->work);
> +}
> +
> +void vmpressure_v1_init(struct vmpressure *vmpr)
> +{
> + mutex_init(&vmpr->events_lock);
> + INIT_LIST_HEAD(&vmpr->events);
> + INIT_WORK(&vmpr->work, vmpressure_work_fn);
> +}
> +
> +void vmpressure_v1_cleanup(struct vmpressure *vmpr)
> +{
> + /*
> + * Make sure there is no pending work before eventfd infrastructure
> + * goes away.
> + */
> + flush_work(&vmpr->work);
> +}
> +
> +/**
> + * vmpressure_prio() - Account memory pressure through reclaimer priority level
> + * @gfp: reclaimer's gfp mask
> + * @memcg: cgroup memory controller handle
> + * @prio: reclaimer's priority
> + *
> + * This function should be called from the reclaim path every time when
> + * the vmscan's reclaiming priority (scanning depth) changes.
> + *
> + * This function does not return any value.
> + */
> +void vmpressure_prio(gfp_t gfp, struct mem_cgroup *memcg, int prio)
> +{
> + /*
> + * We only use prio for accounting critical level. For more info
> + * see comment for vmpressure_level_critical_prio variable above.
> + */
> + if (prio > vmpressure_level_critical_prio)
> + return;
> +
> + /*
> + * OK, the prio is below the threshold, updating vmpressure
> + * information before shrinker dives into long shrinking of long
> + * range vmscan. Passing scanned = vmpressure_win, reclaimed = 0
> + * to the vmpressure() basically means that we signal 'critical'
> + * level.
> + */
> + vmpressure(gfp, 0, memcg, true, vmpressure_win, 0);
> +}
> +
> +#define MAX_VMPRESSURE_ARGS_LEN (strlen("critical") + strlen("hierarchy") + 2)
> +
> +/**
> + * vmpressure_register_event() - Bind vmpressure notifications to an eventfd
> + * @memcg: memcg that is interested in vmpressure notifications
> + * @eventfd: eventfd context to link notifications with
> + * @args: event arguments (pressure level threshold, optional mode)
> + *
> + * This function associates eventfd context with the vmpressure
> + * infrastructure, so that the notifications will be delivered to the
> + * @eventfd. The @args parameter is a comma-delimited string that denotes a
> + * pressure level threshold (one of vmpressure_str_levels, i.e. "low", "medium",
> + * or "critical") and an optional mode (one of vmpressure_str_modes, i.e.
> + * "hierarchy" or "local").
> + *
> + * To be used as memcg event method.
> + *
> + * Return: 0 on success, -ENOMEM on memory failure or -EINVAL if @args could
> + * not be parsed.
> + */
> +int vmpressure_register_event(struct mem_cgroup *memcg,
> + struct eventfd_ctx *eventfd, const char *args)
> +{
> + struct vmpressure *vmpr = memcg_to_vmpressure(memcg);
> + struct vmpressure_event *ev;
> + enum vmpressure_modes mode = VMPRESSURE_NO_PASSTHROUGH;
> + enum vmpressure_levels level;
> + char *spec, *spec_orig;
> + char *token;
> + int ret = 0;
> +
> + spec_orig = spec = kstrndup(args, MAX_VMPRESSURE_ARGS_LEN, GFP_KERNEL);
> + if (!spec)
> + return -ENOMEM;
> +
> + /* Find required level */
> + token = strsep(&spec, ",");
> + ret = match_string(vmpressure_str_levels, VMPRESSURE_NUM_LEVELS, token);
> + if (ret < 0)
> + goto out;
> + level = ret;
> +
> + /* Find optional mode */
> + token = strsep(&spec, ",");
> + if (token) {
> + ret = match_string(vmpressure_str_modes, VMPRESSURE_NUM_MODES, token);
> + if (ret < 0)
> + goto out;
> + mode = ret;
> + }
> +
> + ev = kzalloc_obj(*ev);
> + if (!ev) {
> + ret = -ENOMEM;
> + goto out;
> + }
> +
> + ev->efd = eventfd;
> + ev->level = level;
> + ev->mode = mode;
> +
> + mutex_lock(&vmpr->events_lock);
> + list_add(&ev->node, &vmpr->events);
> + mutex_unlock(&vmpr->events_lock);
> + ret = 0;
> +out:
> + kfree(spec_orig);
> + return ret;
> +}
> +
> +/**
> + * vmpressure_unregister_event() - Unbind eventfd from vmpressure
> + * @memcg: memcg handle
> + * @eventfd: eventfd context that was used to link vmpressure with the @cg
> + *
> + * This function does internal manipulations to detach the @eventfd from
> + * the vmpressure notifications, and then frees internal resources
> + * associated with the @eventfd (but the @eventfd itself is not freed).
> + *
> + * To be used as memcg event method.
> + */
> +void vmpressure_unregister_event(struct mem_cgroup *memcg,
> + struct eventfd_ctx *eventfd)
> +{
> + struct vmpressure *vmpr = memcg_to_vmpressure(memcg);
> + struct vmpressure_event *ev;
> +
> + mutex_lock(&vmpr->events_lock);
> + list_for_each_entry(ev, &vmpr->events, node) {
> + if (ev->efd != eventfd)
> + continue;
> + list_del(&ev->node);
> + kfree(ev);
> + break;
> + }
> + mutex_unlock(&vmpr->events_lock);
> +}
> +
> static DEFINE_MUTEX(memcg_max_mutex);
>
> static int mem_cgroup_resize_max(struct mem_cgroup *memcg,
> diff --git a/mm/vmpressure.c b/mm/vmpressure.c
> index c82cee1ab43b..14470141bbe6 100644
> --- a/mm/vmpressure.c
> +++ b/mm/vmpressure.c
> @@ -7,16 +7,15 @@
> *
> * Based on ideas from Andrew Morton, David Rientjes, KOSAKI Motohiro,
> * Leonid Moiseichuk, Mel Gorman, Minchan Kim and Pekka Enberg.
> + *
> + * Tree-mode (cgroup v1 userspace eventfd) bookkeeping lives in
> + * mm/memcontrol-v1.c; this file holds the shared code and the in-kernel
> + * (tree=false) socket-pressure path that runs on cgroup v2.
> */
>
> #include <linux/cgroup.h>
> -#include <linux/fs.h>
> #include <linux/log2.h>
> -#include <linux/sched.h>
> #include <linux/mm.h>
> -#include <linux/vmstat.h>
> -#include <linux/eventfd.h>
> -#include <linux/slab.h>
> #include <linux/swap.h>
> #include <linux/printk.h>
> #include <linux/vmpressure.h>
> @@ -35,7 +34,7 @@
> * TODO: Make the window size depend on machine size, as we do for vmstat
> * thresholds. Currently we set it to 512 pages (2MB for 4KB pages).
> */
> -static const unsigned long vmpressure_win = SWAP_CLUSTER_MAX * 16;
> +const unsigned long vmpressure_win = SWAP_CLUSTER_MAX * 16;
>
> /*
> * These thresholds are used when we account memory pressure through
> @@ -46,68 +45,6 @@ static const unsigned long vmpressure_win = SWAP_CLUSTER_MAX * 16;
> static const unsigned int vmpressure_level_med = 60;
> static const unsigned int vmpressure_level_critical = 95;
>
> -/*
> - * When there are too little pages left to scan, vmpressure() may miss the
> - * critical pressure as number of pages will be less than "window size".
> - * However, in that case the vmscan priority will raise fast as the
> - * reclaimer will try to scan LRUs more deeply.
> - *
> - * The vmscan logic considers these special priorities:
> - *
> - * prio == DEF_PRIORITY (12): reclaimer starts with that value
> - * prio <= DEF_PRIORITY - 2 : kswapd becomes somewhat overwhelmed
> - * prio == 0 : close to OOM, kernel scans every page in an lru
> - *
> - * Any value in this range is acceptable for this tunable (i.e. from 12 to
> - * 0). Current value for the vmpressure_level_critical_prio is chosen
> - * empirically, but the number, in essence, means that we consider
> - * critical level when scanning depth is ~10% of the lru size (vmscan
> - * scans 'lru_size >> prio' pages, so it is actually 12.5%, or one
> - * eights).
> - */
> -static const unsigned int vmpressure_level_critical_prio = ilog2(100 / 10);
> -
> -static struct vmpressure *work_to_vmpressure(struct work_struct *work)
> -{
> - return container_of(work, struct vmpressure, work);
> -}
> -
> -static struct vmpressure *vmpressure_parent(struct vmpressure *vmpr)
> -{
> - struct mem_cgroup *memcg = vmpressure_to_memcg(vmpr);
> -
> - memcg = parent_mem_cgroup(memcg);
> - if (!memcg)
> - return NULL;
> - return memcg_to_vmpressure(memcg);
> -}
> -
> -enum vmpressure_levels {
> - VMPRESSURE_LOW = 0,
> - VMPRESSURE_MEDIUM,
> - VMPRESSURE_CRITICAL,
> - VMPRESSURE_NUM_LEVELS,
> -};
> -
> -enum vmpressure_modes {
> - VMPRESSURE_NO_PASSTHROUGH = 0,
> - VMPRESSURE_HIERARCHY,
> - VMPRESSURE_LOCAL,
> - VMPRESSURE_NUM_MODES,
> -};
> -
> -static const char * const vmpressure_str_levels[] = {
> - [VMPRESSURE_LOW] = "low",
> - [VMPRESSURE_MEDIUM] = "medium",
> - [VMPRESSURE_CRITICAL] = "critical",
> -};
> -
> -static const char * const vmpressure_str_modes[] = {
> - [VMPRESSURE_NO_PASSTHROUGH] = "default",
> - [VMPRESSURE_HIERARCHY] = "hierarchy",
> - [VMPRESSURE_LOCAL] = "local",
> -};
> -
> static enum vmpressure_levels vmpressure_level(unsigned long pressure)
> {
> if (pressure >= vmpressure_level_critical)
> @@ -117,8 +54,8 @@ static enum vmpressure_levels vmpressure_level(unsigned long pressure)
> return VMPRESSURE_LOW;
> }
>
> -static enum vmpressure_levels vmpressure_calc_level(unsigned long scanned,
> - unsigned long reclaimed)
> +enum vmpressure_levels vmpressure_calc_level(unsigned long scanned,
> + unsigned long reclaimed)
> {
> unsigned long scale = scanned + reclaimed;
> unsigned long pressure = 0;
> @@ -147,74 +84,6 @@ static enum vmpressure_levels vmpressure_calc_level(unsigned long scanned,
> return vmpressure_level(pressure);
> }
>
> -struct vmpressure_event {
> - struct eventfd_ctx *efd;
> - enum vmpressure_levels level;
> - enum vmpressure_modes mode;
> - struct list_head node;
> -};
> -
> -static bool vmpressure_event(struct vmpressure *vmpr,
> - const enum vmpressure_levels level,
> - bool ancestor, bool signalled)
> -{
> - struct vmpressure_event *ev;
> - bool ret = false;
> -
> - mutex_lock(&vmpr->events_lock);
> - list_for_each_entry(ev, &vmpr->events, node) {
> - if (ancestor && ev->mode == VMPRESSURE_LOCAL)
> - continue;
> - if (signalled && ev->mode == VMPRESSURE_NO_PASSTHROUGH)
> - continue;
> - if (level < ev->level)
> - continue;
> - eventfd_signal(ev->efd);
> - ret = true;
> - }
> - mutex_unlock(&vmpr->events_lock);
> -
> - return ret;
> -}
> -
> -static void vmpressure_work_fn(struct work_struct *work)
> -{
> - struct vmpressure *vmpr = work_to_vmpressure(work);
> - unsigned long scanned;
> - unsigned long reclaimed;
> - enum vmpressure_levels level;
> - bool ancestor = false;
> - bool signalled = false;
> -
> - spin_lock(&vmpr->sr_lock);
> - /*
> - * Several contexts might be calling vmpressure(), so it is
> - * possible that the work was rescheduled again before the old
> - * work context cleared the counters. In that case we will run
> - * just after the old work returns, but then scanned might be zero
> - * here. No need for any locks here since we don't care if
> - * vmpr->reclaimed is in sync.
> - */
> - scanned = vmpr->tree_scanned;
> - if (!scanned) {
> - spin_unlock(&vmpr->sr_lock);
> - return;
> - }
> -
> - reclaimed = vmpr->tree_reclaimed;
> - vmpr->tree_scanned = 0;
> - vmpr->tree_reclaimed = 0;
> - spin_unlock(&vmpr->sr_lock);
> -
> - level = vmpressure_calc_level(scanned, reclaimed);
> -
> - do {
> - if (vmpressure_event(vmpr, level, ancestor, signalled))
> - signalled = true;
> - ancestor = true;
> - } while ((vmpr = vmpressure_parent(vmpr)));
> -}
> -
> /**
> * vmpressure() - Account memory pressure through scanned/reclaimed ratio
> * @gfp: reclaimer's gfp mask
> @@ -283,14 +152,7 @@ void vmpressure(gfp_t gfp, int order, struct mem_cgroup *memcg, bool tree,
> return;
>
> if (tree) {
> - spin_lock(&vmpr->sr_lock);
> - scanned = vmpr->tree_scanned += scanned;
> - vmpr->tree_reclaimed += reclaimed;
> - spin_unlock(&vmpr->sr_lock);
> -
> - if (scanned < vmpressure_win)
> - return;
> - schedule_work(&vmpr->work);
> + vmpressure_v1_account_tree(vmpr, scanned, reclaimed);
> } else {
> enum vmpressure_levels level;
>
> @@ -332,134 +194,6 @@ void vmpressure(gfp_t gfp, int order, struct mem_cgroup *memcg, bool tree,
> }
> }
>
> -/**
> - * vmpressure_prio() - Account memory pressure through reclaimer priority level
> - * @gfp: reclaimer's gfp mask
> - * @memcg: cgroup memory controller handle
> - * @prio: reclaimer's priority
> - *
> - * This function should be called from the reclaim path every time when
> - * the vmscan's reclaiming priority (scanning depth) changes.
> - *
> - * This function does not return any value.
> - */
> -void vmpressure_prio(gfp_t gfp, struct mem_cgroup *memcg, int prio)
> -{
> - /*
> - * We only use prio for accounting critical level. For more info
> - * see comment for vmpressure_level_critical_prio variable above.
> - */
> - if (prio > vmpressure_level_critical_prio)
> - return;
> -
> - /*
> - * OK, the prio is below the threshold, updating vmpressure
> - * information before shrinker dives into long shrinking of long
> - * range vmscan. Passing scanned = vmpressure_win, reclaimed = 0
> - * to the vmpressure() basically means that we signal 'critical'
> - * level.
> - */
> - vmpressure(gfp, 0, memcg, true, vmpressure_win, 0);
> -}
> -
> -#define MAX_VMPRESSURE_ARGS_LEN (strlen("critical") + strlen("hierarchy") + 2)
> -
> -/**
> - * vmpressure_register_event() - Bind vmpressure notifications to an eventfd
> - * @memcg: memcg that is interested in vmpressure notifications
> - * @eventfd: eventfd context to link notifications with
> - * @args: event arguments (pressure level threshold, optional mode)
> - *
> - * This function associates eventfd context with the vmpressure
> - * infrastructure, so that the notifications will be delivered to the
> - * @eventfd. The @args parameter is a comma-delimited string that denotes a
> - * pressure level threshold (one of vmpressure_str_levels, i.e. "low", "medium",
> - * or "critical") and an optional mode (one of vmpressure_str_modes, i.e.
> - * "hierarchy" or "local").
> - *
> - * To be used as memcg event method.
> - *
> - * Return: 0 on success, -ENOMEM on memory failure or -EINVAL if @args could
> - * not be parsed.
> - */
> -int vmpressure_register_event(struct mem_cgroup *memcg,
> - struct eventfd_ctx *eventfd, const char *args)
> -{
> - struct vmpressure *vmpr = memcg_to_vmpressure(memcg);
> - struct vmpressure_event *ev;
> - enum vmpressure_modes mode = VMPRESSURE_NO_PASSTHROUGH;
> - enum vmpressure_levels level;
> - char *spec, *spec_orig;
> - char *token;
> - int ret = 0;
> -
> - spec_orig = spec = kstrndup(args, MAX_VMPRESSURE_ARGS_LEN, GFP_KERNEL);
> - if (!spec)
> - return -ENOMEM;
> -
> - /* Find required level */
> - token = strsep(&spec, ",");
> - ret = match_string(vmpressure_str_levels, VMPRESSURE_NUM_LEVELS, token);
> - if (ret < 0)
> - goto out;
> - level = ret;
> -
> - /* Find optional mode */
> - token = strsep(&spec, ",");
> - if (token) {
> - ret = match_string(vmpressure_str_modes, VMPRESSURE_NUM_MODES, token);
> - if (ret < 0)
> - goto out;
> - mode = ret;
> - }
> -
> - ev = kzalloc_obj(*ev);
> - if (!ev) {
> - ret = -ENOMEM;
> - goto out;
> - }
> -
> - ev->efd = eventfd;
> - ev->level = level;
> - ev->mode = mode;
> -
> - mutex_lock(&vmpr->events_lock);
> - list_add(&ev->node, &vmpr->events);
> - mutex_unlock(&vmpr->events_lock);
> - ret = 0;
> -out:
> - kfree(spec_orig);
> - return ret;
> -}
> -
> -/**
> - * vmpressure_unregister_event() - Unbind eventfd from vmpressure
> - * @memcg: memcg handle
> - * @eventfd: eventfd context that was used to link vmpressure with the @cg
> - *
> - * This function does internal manipulations to detach the @eventfd from
> - * the vmpressure notifications, and then frees internal resources
> - * associated with the @eventfd (but the @eventfd itself is not freed).
> - *
> - * To be used as memcg event method.
> - */
> -void vmpressure_unregister_event(struct mem_cgroup *memcg,
> - struct eventfd_ctx *eventfd)
> -{
> - struct vmpressure *vmpr = memcg_to_vmpressure(memcg);
> - struct vmpressure_event *ev;
> -
> - mutex_lock(&vmpr->events_lock);
> - list_for_each_entry(ev, &vmpr->events, node) {
> - if (ev->efd != eventfd)
> - continue;
> - list_del(&ev->node);
> - kfree(ev);
> - break;
> - }
> - mutex_unlock(&vmpr->events_lock);
> -}
> -
> /**
> * vmpressure_init() - Initialize vmpressure control structure
> * @vmpr: Structure to be initialized
> @@ -470,9 +204,7 @@ void vmpressure_unregister_event(struct mem_cgroup *memcg,
> void vmpressure_init(struct vmpressure *vmpr)
> {
> spin_lock_init(&vmpr->sr_lock);
> - mutex_init(&vmpr->events_lock);
> - INIT_LIST_HEAD(&vmpr->events);
> - INIT_WORK(&vmpr->work, vmpressure_work_fn);
> + vmpressure_v1_init(vmpr);
> }
>
> /**
> @@ -484,9 +216,5 @@ void vmpressure_init(struct vmpressure *vmpr)
> */
> void vmpressure_cleanup(struct vmpressure *vmpr)
> {
> - /*
> - * Make sure there is no pending work before eventfd infrastructure
> - * goes away.
> - */
> - flush_work(&vmpr->work);
> + vmpressure_v1_cleanup(vmpr);
> }
> --
> 2.53.0-Meta
>
>
next prev parent reply other threads:[~2026-06-30 12:32 UTC|newest]
Thread overview: 7+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-06-30 11:23 [PATCH v3 0/2] mm/vmpressure: reduce CPU, memory and code overhead on cgroup v2 Usama Arif
2026-06-30 11:23 ` [PATCH v3 1/2] mm/vmpressure: skip tree=true accounting " Usama Arif
2026-06-30 16:07 ` Johannes Weiner
2026-06-30 16:30 ` Usama Arif
2026-06-30 11:23 ` [PATCH v3 2/2] mm/vmpressure: move v1 userspace eventfd code into memcontrol-v1.c Usama Arif
2026-06-30 12:32 ` Usama Arif [this message]
2026-06-30 14:21 ` Shakeel Butt
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260630123228.4052656-1-usama.arif@linux.dev \
--to=usama.arif@linux.dev \
--cc=akpm@linux-foundation.org \
--cc=david@kernel.org \
--cc=hannes@cmpxchg.org \
--cc=kernel-team@meta.com \
--cc=liam@infradead.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=ljs@kernel.org \
--cc=mhocko@suse.com \
--cc=mkoutny@suse.com \
--cc=roman.gushchin@linux.dev \
--cc=rppt@kernel.org \
--cc=shakeel.butt@linux.dev \
--cc=surenb@google.com \
--cc=tj@kernel.org \
--cc=vbabka@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox