From: Youngjun Park <her0gyugyu@gmail.com>
To: akpm@linux-foundation.org
Cc: chrisl@kernel.org, youngjun.park@lge.com, linux-mm@kvack.org,
cgroups@vger.kernel.org, linux-kernel@vger.kernel.org,
kasong@tencent.com, hannes@cmpxchg.org, mhocko@kernel.org,
roman.gushchin@linux.dev, shakeel.butt@linux.dev,
muchun.song@linux.dev, shikemeng@huaweicloud.com,
nphamcs@gmail.com, baoquan.he@linux.dev, baohua@kernel.org,
yosry@kernel.org, gunho.lee@lge.com, taejoon.song@lge.com,
hyungjun.cho@lge.com, mkoutny@suse.com, baver.bae@lge.com,
matia.kim@lge.com
Subject: [PATCH v9 3/6] mm: memcontrol: add interface for swap tier selection
Date: Sun, 21 Jun 2026 03:16:28 +0900 [thread overview]
Message-ID: <20260620181635.299364-4-youngjun.park@lge.com> (raw)
In-Reply-To: <20260620181635.299364-1-youngjun.park@lge.com>
Introduce memory.swap.tiers.max, a flat-keyed file listing each
tier defined in /sys/kernel/mm/swap/tiers with its state, "max"
(allowed, the default) or "0" (disabled). A tier is one bit in the
cgroup's tier mask, so writing "<tier> max" or "<tier> 0" sets or
clears that bit.
Since the current use case lacks amount control, it only supports
"max" (on) and "0" (off). Therefore, it does not track per-tier swap
usage, relying instead on a fast runtime bitmask check.
We maintain both `mask` and `effective_mask`. The `effective_mask` is
strictly bounded by the parent (e.g., if a parent is "0", the child's
effective state is "0" even if its `mask` is "max"). Maintaining this
separately avoids costly cgroup tree traversals to check ancestors at
runtime.
Suggested-by: Shakeel Butt <shakeel.butt@linux.dev>
Suggested-by: Yosry Ahmed <yosry@kernel.org>
Signed-off-by: Youngjun Park <youngjun.park@lge.com>
---
Documentation/admin-guide/cgroup-v2.rst | 20 +++++
Documentation/mm/swap-tier.rst | 9 +++
include/linux/memcontrol.h | 5 ++
mm/memcontrol.c | 67 ++++++++++++++++
mm/swap_state.c | 5 +-
mm/swap_tier.c | 102 +++++++++++++++++++++++-
mm/swap_tier.h | 57 +++++++++++--
7 files changed, 255 insertions(+), 10 deletions(-)
diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 6efd0095ed99..4843ffcfd110 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -1850,6 +1850,26 @@ The following nested keys are defined.
Swap usage hard limit. If a cgroup's swap usage reaches this
limit, anonymous memory of the cgroup will not be swapped out.
+ memory.swap.tiers.max
+ A read-write flat-keyed file which exists on non-root
+ cgroups. The default is "max" for every tier.
+
+ Limits the swap tiers this cgroup may swap to. Tiers are
+ defined globally in /sys/kernel/mm/swap/tiers and listed here,
+ one per line. When read, the values are displayed in descending
+ order of the tiers (highest tier first)::
+
+ <tier_1> max
+ <tier_2> 0
+ ...
+
+ Currently, only "max" and "0" are supported. "max" allows the
+ tier, "0" disables it. Each write sets a single "<tier> max"
+ or "<tier> 0" pair.
+
+ A child may only narrow what its parent allows. A tier an
+ ancestor disabled stays disabled regardless of the value here.
+
memory.swap.events
A read-only flat-keyed file which exists on non-root cgroups.
The following entries are defined. Unless specified
diff --git a/Documentation/mm/swap-tier.rst b/Documentation/mm/swap-tier.rst
index 0fb4a1153a67..addbc495de8c 100644
--- a/Documentation/mm/swap-tier.rst
+++ b/Documentation/mm/swap-tier.rst
@@ -15,6 +15,15 @@ speed to fully utilize this feature. While the current implementation is
integrated with cgroups, the concept is designed to be extensible for other
subsystems in the future.
+Use case
+---------
+
+Users can perform selective swapping by choosing a swap tier assigned according
+to speed within a cgroup.
+
+For more information on cgroup v2, please refer to
+``Documentation/admin-guide/cgroup-v2.rst``.
+
Priority Range
--------------
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index e1f46a0016fc..d53826c68562 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -283,6 +283,11 @@ struct mem_cgroup {
struct lru_gen_mm_list mm_list;
#endif
+#ifdef CONFIG_SWAP
+ int tier_mask;
+ int tier_effective_mask;
+#endif
+
#ifdef CONFIG_MEMCG_V1
/* Legacy consumer-oriented counters */
struct page_counter kmem; /* v1 only */
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 56cd4af08232..63259576792a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -68,6 +68,7 @@
#include <net/ip.h>
#include "slab.h"
#include "memcontrol-v1.h"
+#include "swap_tier.h"
#include <linux/uaccess.h>
@@ -4244,6 +4245,8 @@ static int mem_cgroup_css_online(struct cgroup_subsys_state *css)
refcount_set(&memcg->id.ref, 1);
css_get(css);
+ swap_tiers_memcg_inherit_mask(memcg);
+
/*
* Ensure mem_cgroup_from_private_id() works once we're fully online.
*
@@ -5785,6 +5788,64 @@ static int swap_events_show(struct seq_file *m, void *v)
return 0;
}
+static int swap_tier_max_show(struct seq_file *m, void *v)
+{
+ struct mem_cgroup *memcg = mem_cgroup_from_seq(m);
+
+ swap_tiers_mask_show(m, memcg);
+ return 0;
+}
+
+static ssize_t swap_tier_max_write(struct kernfs_open_file *of,
+ char *buf, size_t nbytes, loff_t off)
+{
+ struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
+ char *pos, *name, *val;
+ bool enable;
+ int mask;
+ int ret = 0;
+
+ pos = strstrip(buf);
+ name = strsep(&pos, " \t\n");
+ if (!name || !*name)
+ return -EINVAL;
+ if (pos)
+ pos = skip_spaces(pos);
+ val = strsep(&pos, " \t\n");
+ if (!val || !*val)
+ return -EINVAL;
+ if (pos && *skip_spaces(pos))
+ return -EINVAL;
+
+ if (!strcmp(val, "max"))
+ enable = true;
+ else if (!strcmp(val, "0"))
+ enable = false;
+ else
+ return -EINVAL;
+
+ spin_lock(&swap_tier_lock);
+ mask = swap_tiers_mask_lookup(name);
+ if (!mask) {
+ ret = -EINVAL;
+ goto out;
+ }
+
+ /*
+ * tier_mask is set per memcg here; the effective mask is clamped
+ * to the parent's in swap_tiers_memcg_sync_mask().
+ */
+ if (enable)
+ WRITE_ONCE(memcg->tier_mask, memcg->tier_mask | mask);
+ else
+ WRITE_ONCE(memcg->tier_mask, memcg->tier_mask & ~mask);
+
+ swap_tiers_memcg_sync_mask(memcg);
+out:
+ spin_unlock(&swap_tier_lock);
+ return ret ? ret : nbytes;
+}
+
static struct cftype swap_files[] = {
{
.name = "swap.current",
@@ -5817,6 +5878,12 @@ static struct cftype swap_files[] = {
.file_offset = offsetof(struct mem_cgroup, swap_events_file),
.seq_show = swap_events_show,
},
+ {
+ .name = "swap.tiers.max",
+ .flags = CFTYPE_NOT_ON_ROOT,
+ .seq_show = swap_tier_max_show,
+ .write = swap_tier_max_write,
+ },
{ } /* terminate */
};
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 2f382d4dcbdc..712b225509cc 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -1021,6 +1021,7 @@ static ssize_t tiers_store(struct kobject *kobj,
char *p, *token, *name, *tmp;
int ret = 0;
short prio;
+ int mask = 0;
tmp = kstrdup(buf, GFP_KERNEL);
if (!tmp)
@@ -1053,7 +1054,7 @@ static ssize_t tiers_store(struct kobject *kobj,
goto restore;
break;
case '-':
- ret = swap_tiers_remove(token + 1);
+ ret = swap_tiers_remove(token + 1, &mask);
if (ret)
goto restore;
break;
@@ -1063,7 +1064,7 @@ static ssize_t tiers_store(struct kobject *kobj,
}
}
- if (!swap_tiers_update()) {
+ if (!swap_tiers_update(mask)) {
ret = -EINVAL;
goto restore;
}
diff --git a/mm/swap_tier.c b/mm/swap_tier.c
index 6b57cadb3e95..98bfee760b8d 100644
--- a/mm/swap_tier.c
+++ b/mm/swap_tier.c
@@ -253,7 +253,7 @@ int swap_tiers_add(const char *name, int prio)
return ret;
}
-int swap_tiers_remove(const char *name)
+int swap_tiers_remove(const char *name, int *mask)
{
int ret = 0;
struct swap_tier *tier;
@@ -276,6 +276,7 @@ int swap_tiers_remove(const char *name)
list_prev_entry(tier, list)->prio = DEF_SWAP_PRIO;
swap_tier_inactivate(tier);
+ *mask |= TIER_MASK(tier);
return ret;
}
@@ -344,7 +345,24 @@ void swap_tiers_assign_dev(struct swap_info_struct *swp)
swp->tier_mask = TIER_DEFAULT_MASK;
}
-bool swap_tiers_update(void)
+#ifdef CONFIG_MEMCG
+static void swap_tier_memcg_propagate(int mask)
+{
+ struct mem_cgroup *child;
+
+ for_each_mem_cgroup_tree(child, root_mem_cgroup) {
+ WRITE_ONCE(child->tier_mask, child->tier_mask | mask);
+ WRITE_ONCE(child->tier_effective_mask,
+ child->tier_effective_mask | mask);
+ }
+}
+#else
+static void swap_tier_memcg_propagate(int mask)
+{
+}
+#endif
+
+bool swap_tiers_update(int mask)
{
struct swap_tier *tier;
struct swap_info_struct *swp;
@@ -375,5 +393,85 @@ bool swap_tiers_update(void)
swap_tiers_assign_dev(swp);
}
+ /*
+ * When a tier is removed, its index (bit position in the mask) becomes
+ * free for reassignment to a future tier. If a memcg had previously
+ * disabled this tier (cleared the bit in its swap.tiers.max file), the
+ * effective mask would keep that bit clear -- meaning the new tier at
+ * the same index would be silently unavailable, an invisible cgroup
+ * constraint left behind by a tier that no longer exists.
+ *
+ * To prevent this, OR the removed tier's mask bit into every memcg's
+ * tier_mask and tier_effective_mask. This resets the bit so the new
+ * tier is accessible by default; users who want to restrict it must
+ * explicitly disable it after the tier is re-created.
+ */
+ if (mask)
+ swap_tier_memcg_propagate(mask);
+
return true;
}
+
+#ifdef CONFIG_MEMCG
+void swap_tiers_mask_show(struct seq_file *m, struct mem_cgroup *memcg)
+{
+ struct swap_tier *tier;
+ int mask;
+
+ spin_lock(&swap_tier_lock);
+ mask = READ_ONCE(memcg->tier_mask);
+
+ for_each_active_tier(tier)
+ seq_printf(m, "%s %s\n", tier->name,
+ (mask & TIER_MASK(tier)) ? "max" : "0");
+ spin_unlock(&swap_tier_lock);
+}
+
+int swap_tiers_mask_lookup(const char *name)
+{
+ struct swap_tier *tier;
+
+ lockdep_assert_held(&swap_tier_lock);
+
+ for_each_active_tier(tier) {
+ if (!strcmp(name, tier->name))
+ return TIER_MASK(tier);
+ }
+
+ return 0;
+}
+
+static void __swap_tier_memcg_inherit_mask(struct mem_cgroup *memcg,
+ struct mem_cgroup *parent)
+{
+ int parent_mask = parent
+ ? READ_ONCE(parent->tier_effective_mask)
+ : TIER_ALL_MASK;
+
+ WRITE_ONCE(memcg->tier_effective_mask,
+ parent_mask & READ_ONCE(memcg->tier_mask));
+}
+
+/* Computes the initial effective mask from the parent's effective mask. */
+void swap_tiers_memcg_inherit_mask(struct mem_cgroup *memcg)
+{
+ spin_lock(&swap_tier_lock);
+ memcg->tier_mask = TIER_ALL_MASK;
+ __swap_tier_memcg_inherit_mask(memcg, parent_mem_cgroup(memcg));
+ spin_unlock(&swap_tier_lock);
+}
+
+/*
+ * Called when a memcg's tier_mask is modified. Walks the subtree
+ * and recomputes each descendant's effective mask against its parent.
+ */
+void swap_tiers_memcg_sync_mask(struct mem_cgroup *memcg)
+{
+ struct mem_cgroup *child;
+
+ lockdep_assert_held(&swap_tier_lock);
+
+ for_each_mem_cgroup_tree(child, memcg)
+ __swap_tier_memcg_inherit_mask(child, parent_mem_cgroup(child));
+}
+#endif
diff --git a/mm/swap_tier.h b/mm/swap_tier.h
index 3e355f857363..e2f0cf32035b 100644
--- a/mm/swap_tier.h
+++ b/mm/swap_tier.h
@@ -10,22 +10,67 @@ struct swap_info_struct;
extern spinlock_t swap_tier_lock;
-#define TIER_ALL_MASK (~0)
-#define TIER_DEFAULT_IDX (31)
-#define TIER_DEFAULT_MASK (1U << TIER_DEFAULT_IDX)
-
/* Initialization and application */
void swap_tiers_init(void);
ssize_t swap_tiers_sysfs_show(char *buf);
int swap_tiers_add(const char *name, int prio);
-int swap_tiers_remove(const char *name);
+int swap_tiers_remove(const char *name, int *mask);
void swap_tiers_snapshot(void);
void swap_tiers_snapshot_restore(void);
-bool swap_tiers_update(void);
+bool swap_tiers_update(int mask);
/* Tier assignment */
void swap_tiers_assign_dev(struct swap_info_struct *swp);
+#define TIER_ALL_MASK (~0)
+#define TIER_DEFAULT_IDX (31)
+#define TIER_DEFAULT_MASK (1U << TIER_DEFAULT_IDX)
+
+#if defined(CONFIG_SWAP) && defined(CONFIG_MEMCG)
+/* Memcg related functions */
+void swap_tiers_mask_show(struct seq_file *m, struct mem_cgroup *memcg);
+void swap_tiers_memcg_inherit_mask(struct mem_cgroup *memcg);
+void swap_tiers_memcg_sync_mask(struct mem_cgroup *memcg);
+int swap_tiers_mask_lookup(const char *name);
+static inline int folio_tier_effective_mask(struct folio *folio)
+{
+ struct mem_cgroup *memcg;
+ int mask = TIER_ALL_MASK;
+
+ rcu_read_lock();
+ memcg = folio_memcg(folio);
+ if (memcg)
+ mask = READ_ONCE(memcg->tier_effective_mask);
+ rcu_read_unlock();
+
+ return mask;
+}
+#else
+static inline void swap_tiers_mask_show(struct seq_file *m,
+ struct mem_cgroup *memcg) {}
+static inline void swap_tiers_memcg_inherit_mask(struct mem_cgroup *memcg) {}
+static inline void swap_tiers_memcg_sync_mask(struct mem_cgroup *memcg) {}
+static inline int swap_tiers_mask_lookup(const char *name)
+{
+ return 0;
+}
+static inline int folio_tier_effective_mask(struct folio *folio)
+{
+ return TIER_ALL_MASK;
+}
+#endif
+
+/**
+ * swap_tiers_mask_test - Check if the tier mask is valid
+ * @tier_mask: The tier mask to check
+ * @mask: The mask to compare against
+ *
+ * Return: true if condition matches, false otherwise
+ */
+static inline bool swap_tiers_mask_test(int tier_mask, int mask)
+{
+ return tier_mask & mask;
+}
#endif /* _SWAP_TIER_H */
--
2.48.1
next prev parent reply other threads:[~2026-06-20 18:17 UTC|newest]
Thread overview: 16+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-06-20 18:16 [PATCH v9 0/6] mm/swap, memcg: Introduce swap tiers for cgroup based swap control Youngjun Park
2026-06-20 18:16 ` [PATCH v9 1/6] mm: swap: introduce swap tier infrastructure Youngjun Park
2026-06-20 18:16 ` [PATCH v9 2/6] mm: swap: associate swap devices with tiers Youngjun Park
2026-06-20 18:16 ` Youngjun Park [this message]
2026-06-22 5:03 ` [PATCH v9 3/6] mm: memcontrol: add interface for swap tier selection Youngjun Park
2026-06-22 21:21 ` Yosry Ahmed
2026-06-22 22:10 ` Joshua Hahn
2026-06-22 22:26 ` Yosry Ahmed
2026-06-22 23:19 ` Joshua Hahn
2026-06-22 23:46 ` Yosry Ahmed
2026-06-23 0:40 ` Joshua Hahn
2026-06-20 18:16 ` [PATCH v9 4/6] mm: swap: filter swap allocation by memcg tier mask Youngjun Park
2026-06-20 18:16 ` [PATCH v9 5/6] selftests/mm: add a swap tier configuration test Youngjun Park
2026-06-20 18:16 ` [PATCH v9 6/6] selftests/cgroup: add a swap tier routing test Youngjun Park
2026-06-22 21:23 ` [PATCH v9 0/6] mm/swap, memcg: Introduce swap tiers for cgroup based swap control Yosry Ahmed
2026-06-23 1:29 ` Youngjun Park
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260620181635.299364-4-youngjun.park@lge.com \
--to=her0gyugyu@gmail.com \
--cc=akpm@linux-foundation.org \
--cc=baohua@kernel.org \
--cc=baoquan.he@linux.dev \
--cc=baver.bae@lge.com \
--cc=cgroups@vger.kernel.org \
--cc=chrisl@kernel.org \
--cc=gunho.lee@lge.com \
--cc=hannes@cmpxchg.org \
--cc=hyungjun.cho@lge.com \
--cc=kasong@tencent.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=matia.kim@lge.com \
--cc=mhocko@kernel.org \
--cc=mkoutny@suse.com \
--cc=muchun.song@linux.dev \
--cc=nphamcs@gmail.com \
--cc=roman.gushchin@linux.dev \
--cc=shakeel.butt@linux.dev \
--cc=shikemeng@huaweicloud.com \
--cc=taejoon.song@lge.com \
--cc=yosry@kernel.org \
--cc=youngjun.park@lge.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.