All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v7 0/4] mm: swap: introduce swap tier infrastructure
@ 2026-05-27  6:22 Youngjun Park
  2026-05-27  6:22 ` [PATCH v7 1/4] " Youngjun Park
                   ` (5 more replies)
  0 siblings, 6 replies; 15+ messages in thread
From: Youngjun Park @ 2026-05-27  6:22 UTC (permalink / raw)
  To: akpm
  Cc: chrisl, youngjun.park, linux-mm, cgroups, linux-kernel, kasong,
	hannes, mhocko, roman.gushchin, shakeel.butt, muchun.song,
	shikemeng, nphamcs, baoquan.he, baohua, gunho.lee, taejoon.song,
	hyungjun.cho, mkoutny, baver.bae, matia.kim

This is v7 of the swap tier series addressing review feedback.
The cover letter has been simplified.

I revisited the design (see Design Rationale). Since our use case
fits best with a memcg-based model, the implementation remains
within memcg and preserves its resource accounting semantics.

Alternatives considered:

1. A separate sysfs interface under swap. (Workable. But, it would still
   need to reference memcg paths, and fully decoupling it would add
   swap-layer logic to manage memcgs, making it secondary option.)

2. Making the feature non-default.

Other interfaces were also reviewed. Aside from sysfs and BPF,
the options involve trade-offs and are largely design choices.
BPF was excluded due to possible disablement on our embedded
platform, though future extension remains possible.

Overview
========

Swap Tiers group swap devices into performance classes (e.g. NVMe,
HDD, Network) and allow per-memcg selection of which tiers to use.
This mechanism was suggested by Chris Li.

Design Rationale
================

Swap tier selection is attached to memcg. A child cgroup may select a
subset of the parent's allowed tiers.

This
- Preserves cgroup inheritance semantics (boundary at parent,
  refinement at child).
- Reuses memcg, which already groups processes and enforces
  hierarchical memory limits.
- Aligns with existing memcg swap controls (e.g. swap.max, zswap.writeback)
- Avoids introducing a parallel swap control hierarchy.

Placing tier control outside memcg (e.g., via BPF, syscalls, or
madvise) would allow swap preference to diverge from the memcg
hierarchy. Integrating it into memcg keeps the swap policy
consistent with existing memory ownership semantics. There are
also real use cases built around memcg.

In the future, this can be extended to other interfaces to cover
additional use cases.

I believe a memcg-based swap control is a good starting point
before such extensions.

Use Cases
=========

#1: Latency separation (our primary deployment scenario)
  [ / ]
     |
     +-- latency-sensitive workload  (fast tier)
     +-- background workload         (slow tier)

The parent defines the memory boundary.
Each workload selects a swap tier via memory.swap.tiers according to
latency requirements.

This prevents latency-sensitive workloads from being swapped to
slow devices used by background workloads.

#2: Per-VM swap selection (Chris Li's deployment scenario)
  [ / ]
     |
     +-- [ Job on VM ]              (tiers: zswap, SSD)
            |
            +-- [ VMM guest memory ]  (tiers: SSD)

The parent (job) has access to both zswap and SSD tiers.
The child (VMM guest memory) selects SSD as its swap tier via
memory.swap.tiers. In this deployment, swap device selection
happens at the child level from the parent's available set.

#3: Tier isolation for reduced contention (hypothetical)
  [ / ]                    (tiers: A, B)
     |
     +-- workload X        (tiers: A)
     +-- workload Y        (tiers: B)

Each child uses a different tier. Since swap paths are separated
per tier, synchronization overhead between the two workloads is
reduced.

Future extension
================

#1: Intra-tier distribution policy:
  Currently, swap devices with the same priority are allocated in a
  round-robin fashion. Per-tier policy files under
  /sys/kernel/mm/swap/tiers/ can control how devices within a tier
  are selected (e.g. round-robin, weighted).

#2: Inter-tier promotion and demotion:
  Promotion and demotion apply between tiers, not within a single
  tier. The current interface defines only tier assignment; it does
  not yet define when or how pages move between tiers. Two triggering
  models are possible:

  (a) User-triggered: userspace explicitly initiates migration between
      tiers (e.g. via a new interface or existing move_pages semantics).
  (b) Kernel-triggered: the kernel moves pages between tiers at
      appropriate points such as reclaim or refault.

#3: Per-VMA, per-process swap and BPF:
  Not just for memcg based swap, possible to extend Per-VMA or per-process swap.
  Or we can use it as BPF program.

Experimentation
===============

Tested on our internal platform using NBD as a separate swap tier.
Our first production's simple usecase.

Without tiers:
- No selective control over flash wear
- Cannot selectively assign NBD to specific applications

Cold launch improvement (preloaded vs. baseline):
- App A: 13.17s -> 4.18s (68%)
- App B: 5.60s -> 1.12s (80%)
- App C: 10.25s -> 2.00s (80%)

Performance impact with no tiers configured:
<1% regression in kernel build and vm-scalability benchmarks

Change log
===========

v7
- Collect Baoquan's review tag
- Baoquan's feedback on fixing improper comment
- Minor code adjustments per Baoquan's feedback.
- Rebase on recent mm-new
- v6 link: https://lore.kernel.org/linux-mm/20260421055323.940344-1-youngjun.park@lge.com/

v6
- Sashiko AI review fixes
 - Fix batch parsing error path to restore snapshot before exit
 - Reject overlong tier names to prevent truncated duplicates
 - Avoid restoring raw list_head via memcpy (stale pointer risk)
 - Ensure early parse errors do not skip DEF_SWAP_PRIO validation
 - Use (1U << TIER_DEFAULT_IDX) to avoid signed shift UB
 - Defer tier mask inheritance to css_online() to close race window
 - Add READ_ONCE()/WRITE_ONCE() for tier mask accesses
- Other fixes
 - Fix build error reintroduced due to missing v5 change (sorry for that..)
 - Fix WARNING in folio_tier_effective_mask by adding rcu_read_lock() (syzbot CI fix)
 - default number of swap tier max (change to 32->31, for reserving last bit)
 - commit message refinement.
 - rebased on recently mm-new 
- v5 link: https://lore.kernel.org/linux-mm/20260325175453.2523280-1-youngjun.park@lge.com/

v5
- Fixed build errors reported in v4
- rebased on up to date mm-new 
- Minor cleanups
- Design docs with validation (by Shakeel Butt discussion)
- v4 link : https://lore.kernel.org/linux-mm/20260217000950.4015880-1-youngjun.park@lge.com/

v4
- Simplified control flow and indentation
- Added CONFIG option for MAX_SWAPTIER (default: 4)
- Added memory.swap.tiers.effective interface
- Reworked save/restore logic into snapshot/rollback model
- Removed tier priority modification support (deferred)
- Improved validation and fixed edge cases
- Rebased onto latest mm-new
- RFC v3 link: https://lore.kernel.org/linux-mm/20260131125454.3187546-1-youngjun.park@lge.com/

RFC v1 ~ v3
- Change the direction after discussion with Chris-Li
- apply some LPC feedback.
- RFC v2 - https://lore.kernel.org/linux-mm/20260126065242.1221862-1-youngjun.park@lge.com/
- RFC v1 - https://lore.kernel.org/linux-mm/20251109124947.1101520-1-youngjun.park@lge.com/

Earlier Approach (per cgroup swap priority)
- v1: https://lore.kernel.org/linux-mm/20250716202006.3640584-1-youngjun.park@lge.com/
- RFC: https://lore.kernel.org/linux-mm/aEvLjEInMQC7hEyh@yjaykim-PowerEdge-T330/T/#mbbb6a5e9e30843097e1f5f65fb98f31d582b973d

Youngjun Park (4):
  mm: swap: introduce swap tier infrastructure
  mm: swap: associate swap devices with tiers
  mm: memcontrol: add interfaces for swap tier selection
  mm: swap: filter swap allocation by memcg tier mask

 Documentation/admin-guide/cgroup-v2.rst |  29 ++
 Documentation/mm/index.rst              |   1 +
 Documentation/mm/swap-tier.rst          | 159 ++++++++
 MAINTAINERS                             |   3 +
 include/linux/memcontrol.h              |   5 +
 include/linux/swap.h                    |   1 +
 mm/Kconfig                              |  12 +
 mm/Makefile                             |   2 +-
 mm/memcontrol.c                         |  96 +++++
 mm/swap.h                               |   4 +
 mm/swap_state.c                         |  75 ++++
 mm/swap_tier.c                          | 482 ++++++++++++++++++++++++
 mm/swap_tier.h                          |  75 ++++
 mm/swapfile.c                           |  20 +-
 14 files changed, 959 insertions(+), 5 deletions(-)
 create mode 100644 Documentation/mm/swap-tier.rst
 create mode 100644 mm/swap_tier.c
 create mode 100644 mm/swap_tier.h

base-commit: 938bf00744a1b82cefd551f848a927cc24d5fb2f
-- 
2.34.1


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v7 1/4] mm: swap: introduce swap tier infrastructure
  2026-05-27  6:22 [PATCH v7 0/4] mm: swap: introduce swap tier infrastructure Youngjun Park
@ 2026-05-27  6:22 ` Youngjun Park
  2026-05-27  6:22 ` [PATCH v7 2/4] mm: swap: associate swap devices with tiers Youngjun Park
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 15+ messages in thread
From: Youngjun Park @ 2026-05-27  6:22 UTC (permalink / raw)
  To: akpm
  Cc: chrisl, youngjun.park, linux-mm, cgroups, linux-kernel, kasong,
	hannes, mhocko, roman.gushchin, shakeel.butt, muchun.song,
	shikemeng, nphamcs, baoquan.he, baohua, gunho.lee, taejoon.song,
	hyungjun.cho, mkoutny, baver.bae, matia.kim

This patch introduces the "Swap tier" concept, which serves as an
abstraction layer for managing swap devices based on their performance
characteristics (e.g., NVMe, HDD, Network swap).

Swap tiers are user-named groups representing priority ranges.
Tier names must consist of alphanumeric characters and underscores.
These tiers collectively cover the entire priority space from -1
(`DEF_SWAP_PRIO`) to `SHRT_MAX`.

To configure tiers, a new sysfs interface is exposed at
/sys/kernel/mm/swap/tiers. The input parser evaluates commands from
left to right and supports batch input, allowing users to add or remove
multiple tiers in a single write operation.

Tier management enforces continuous priority ranges anchored by start
priorities. Operations trigger range splitting or merging, but overwriting
start priorities is forbidden. Merging expands lower tiers upwards to
preserve configured start priorities, except when removing `DEF_SWAP_PRIO`,
which merges downwards.

Suggested-by: Chris Li <chrisl@kernel.org>
Reviewed-by: Baoquan He <baoquan.he@linux.dev>
Signed-off-by: Youngjun Park <youngjun.park@lge.com>
---
 MAINTAINERS     |   2 +
 mm/Kconfig      |  12 ++
 mm/Makefile     |   2 +-
 mm/swap.h       |   4 +
 mm/swap_state.c |  74 ++++++++++++
 mm/swap_tier.c  | 302 ++++++++++++++++++++++++++++++++++++++++++++++++
 mm/swap_tier.h  |  20 ++++
 mm/swapfile.c   |   8 +-
 8 files changed, 420 insertions(+), 4 deletions(-)
 create mode 100644 mm/swap_tier.c
 create mode 100644 mm/swap_tier.h

diff --git a/MAINTAINERS b/MAINTAINERS
index e3ee97f5474e..e3cbfedbaa5f 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -17049,6 +17049,8 @@ F:	mm/swap.c
 F:	mm/swap.h
 F:	mm/swap_table.h
 F:	mm/swap_state.c
+F:	mm/swap_tier.c
+F:	mm/swap_tier.h
 F:	mm/swapfile.c
 
 MEMORY MANAGEMENT - THP (TRANSPARENT HUGE PAGE)
diff --git a/mm/Kconfig b/mm/Kconfig
index 776b67c66e82..5343937f3da9 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -19,6 +19,18 @@ menuconfig SWAP
 	  used to provide more virtual memory than the actual RAM present
 	  in your computer.  If unsure say Y.
 
+config NR_SWAP_TIERS
+        int "Number of swap device tiers"
+        depends on SWAP
+        default 4
+        range 1 31
+        help
+          Sets the number of swap device tiers. Swap devices are
+          grouped into tiers based on their priority, allowing the
+          system to prefer faster devices over slower ones.
+
+          If unsure, say 4.
+
 config ZSWAP
 	bool "Compressed cache for swap pages"
 	depends on SWAP
diff --git a/mm/Makefile b/mm/Makefile
index eff9f9e7e061..29cb1e778285 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -75,7 +75,7 @@ ifdef CONFIG_MMU
 	obj-$(CONFIG_ADVISE_SYSCALLS)	+= madvise.o
 endif
 
-obj-$(CONFIG_SWAP)	+= page_io.o swap_state.o swapfile.o
+obj-$(CONFIG_SWAP)	+= page_io.o swap_state.o swapfile.o swap_tier.o
 obj-$(CONFIG_ZSWAP)	+= zswap.o
 obj-$(CONFIG_HAS_DMA)	+= dmapool.o
 obj-$(CONFIG_HUGETLBFS)	+= hugetlb.o hugetlb_sysfs.o hugetlb_sysctl.o
diff --git a/mm/swap.h b/mm/swap.h
index 900a539c63f0..067f0afd7a3e 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -34,6 +34,10 @@ extern int page_cluster;
 #define swap_entry_order(order)	0
 #endif
 
+#define DEF_SWAP_PRIO  -1
+
+extern spinlock_t swap_lock;
+extern struct plist_head swap_active_head;
 extern struct swap_info_struct *swap_info[];
 
 /*
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 04f5ce992401..e609bbdf7e13 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -25,6 +25,7 @@
 #include "internal.h"
 #include "swap_table.h"
 #include "swap.h"
+#include "swap_tier.h"
 
 /*
  * swapper_space is a fiction, retained to simplify the path through
@@ -997,8 +998,81 @@ static ssize_t vma_ra_enabled_store(struct kobject *kobj,
 }
 static struct kobj_attribute vma_ra_enabled_attr = __ATTR_RW(vma_ra_enabled);
 
+static ssize_t tiers_show(struct kobject *kobj,
+				     struct kobj_attribute *attr, char *buf)
+{
+	return swap_tiers_sysfs_show(buf);
+}
+
+static ssize_t tiers_store(struct kobject *kobj,
+			    struct kobj_attribute *attr,
+			    const char *buf, size_t count)
+{
+	char *p, *token, *name, *tmp;
+	int ret = 0;
+	short prio;
+
+	tmp = kstrdup(buf, GFP_KERNEL);
+	if (!tmp)
+		return -ENOMEM;
+
+	spin_lock(&swap_lock);
+	spin_lock(&swap_tier_lock);
+	swap_tiers_snapshot();
+
+	p = tmp;
+	while ((token = strsep(&p, ", \t\n")) != NULL) {
+		if (!*token)
+			continue;
+
+		switch (token[0]) {
+		case '+':
+			name = token + 1;
+			token = strchr(name, ':');
+			if (!token) {
+				ret = -EINVAL;
+				goto restore;
+			}
+			*token++ = '\0';
+			if (kstrtos16(token, 10, &prio)) {
+				ret = -EINVAL;
+				goto restore;
+			}
+			ret = swap_tiers_add(name, prio);
+			if (ret)
+				goto restore;
+			break;
+		case '-':
+			ret = swap_tiers_remove(token + 1);
+			if (ret)
+				goto restore;
+			break;
+		default:
+			ret = -EINVAL;
+			goto restore;
+		}
+	}
+
+	if (!swap_tiers_validate()) {
+		ret = -EINVAL;
+		goto restore;
+	}
+	goto out;
+
+restore:
+	swap_tiers_snapshot_restore();
+out:
+	spin_unlock(&swap_tier_lock);
+	spin_unlock(&swap_lock);
+	kfree(tmp);
+	return ret ? ret : count;
+}
+
+static struct kobj_attribute tier_attr = __ATTR_RW(tiers);
+
 static struct attribute *swap_attrs[] = {
 	&vma_ra_enabled_attr.attr,
+	&tier_attr.attr,
 	NULL,
 };
 
diff --git a/mm/swap_tier.c b/mm/swap_tier.c
new file mode 100644
index 000000000000..ac7a3c2a48cb
--- /dev/null
+++ b/mm/swap_tier.c
@@ -0,0 +1,302 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/swap.h>
+#include <linux/memcontrol.h>
+#include "memcontrol-v1.h"
+#include <linux/sysfs.h>
+#include <linux/plist.h>
+
+#include "swap.h"
+#include "swap_tier.h"
+
+#define MAX_SWAPTIER	CONFIG_NR_SWAP_TIERS
+#define MAX_TIERNAME	16
+
+/*
+ * struct swap_tier - structure representing a swap tier.
+ *
+ * @name: name of the swap_tier.
+ * @prio: starting value of priority.
+ * @list: linked list of tiers.
+ */
+static struct swap_tier {
+	char name[MAX_TIERNAME];
+	short prio;
+	struct list_head list;
+} swap_tiers[MAX_SWAPTIER];
+
+DEFINE_SPINLOCK(swap_tier_lock);
+/* active swap priority list, sorted in descending order */
+static LIST_HEAD(swap_tier_active_list);
+/* unused swap_tier object */
+static LIST_HEAD(swap_tier_inactive_list);
+
+#define TIER_IDX(tier)	((tier) - swap_tiers)
+#define TIER_MASK(tier)	(1U << TIER_IDX(tier))
+#define TIER_INACTIVE_PRIO (DEF_SWAP_PRIO - 1)
+#define TIER_IS_ACTIVE(tier) ((tier->prio) !=  TIER_INACTIVE_PRIO)
+#define TIER_END_PRIO(tier) \
+	(!list_is_first(&(tier)->list, &swap_tier_active_list) ? \
+	list_prev_entry((tier), list)->prio - 1 : SHRT_MAX)
+
+#define for_each_tier(tier, idx) \
+	for (idx = 0, tier = &swap_tiers[0]; idx < MAX_SWAPTIER; \
+		idx++, tier = &swap_tiers[idx])
+
+#define for_each_active_tier(tier) \
+	list_for_each_entry(tier, &swap_tier_active_list, list)
+
+#define for_each_inactive_tier(tier) \
+	list_for_each_entry(tier, &swap_tier_inactive_list, list)
+
+/*
+ * Naming Convention:
+ *   swap_tiers_*() - Public/exported functions
+ *   swap_tier_*()  - Private/internal functions
+ */
+
+static bool swap_tier_is_active(void)
+{
+	return !list_empty(&swap_tier_active_list);
+}
+
+static struct swap_tier *swap_tier_lookup(const char *name)
+{
+	struct swap_tier *tier;
+
+	for_each_active_tier(tier) {
+		if (!strcmp(tier->name, name))
+			return tier;
+	}
+
+	return NULL;
+}
+
+/* Insert new tier into the active list sorted by priority. */
+static void swap_tier_activate(struct swap_tier *new)
+{
+	struct list_head *pos = &swap_tier_active_list;
+	struct swap_tier *tier;
+
+	for_each_active_tier(tier) {
+		if (tier->prio <= new->prio) {
+			pos = &tier->list;
+			break;
+		}
+	}
+
+	list_add_tail(&new->list, pos);
+}
+
+static void swap_tier_inactivate(struct swap_tier *tier)
+{
+	list_move(&tier->list, &swap_tier_inactive_list);
+	tier->prio = TIER_INACTIVE_PRIO;
+}
+
+void swap_tiers_init(void)
+{
+	struct swap_tier *tier;
+	int idx;
+
+	BUILD_BUG_ON(BITS_PER_TYPE(int) < MAX_SWAPTIER);
+
+	for_each_tier(tier, idx) {
+		INIT_LIST_HEAD(&tier->list);
+		swap_tier_inactivate(tier);
+	}
+}
+
+ssize_t swap_tiers_sysfs_show(char *buf)
+{
+	struct swap_tier *tier;
+	ssize_t len = 0;
+
+	len += sysfs_emit_at(buf, len, "%-16s %-5s %-11s %-11s\n",
+			 "Name", "Idx", "PrioStart", "PrioEnd");
+
+	spin_lock(&swap_tier_lock);
+	for_each_active_tier(tier) {
+		len += sysfs_emit_at(buf, len, "%-16s %-5td %-11d %-11d\n",
+				     tier->name,
+				     TIER_IDX(tier),
+				     tier->prio,
+				     TIER_END_PRIO(tier));
+	}
+	spin_unlock(&swap_tier_lock);
+
+	return len;
+}
+
+static struct swap_tier *swap_tier_prepare(const char *name, short prio)
+{
+	struct swap_tier *tier;
+
+	lockdep_assert_held(&swap_tier_lock);
+
+	if (prio < DEF_SWAP_PRIO)
+		return ERR_PTR(-EINVAL);
+
+	if (list_empty(&swap_tier_inactive_list))
+		return ERR_PTR(-ENOSPC);
+
+	tier = list_first_entry(&swap_tier_inactive_list,
+		struct swap_tier, list);
+
+	list_del_init(&tier->list);
+	strscpy(tier->name, name, MAX_TIERNAME);
+	tier->prio = prio;
+
+	return tier;
+}
+
+static int swap_tier_check_range(short prio)
+{
+	struct swap_tier *tier;
+
+	lockdep_assert_held(&swap_lock);
+	lockdep_assert_held(&swap_tier_lock);
+
+	for_each_active_tier(tier) {
+		/* No overwrite */
+		if (tier->prio == prio)
+			return -EINVAL;
+	}
+
+	return 0;
+}
+
+static bool swap_tier_validate_name(const char *name)
+{
+	int len;
+
+	if (!name || !*name)
+		return false;
+
+	len = strlen(name);
+	if (len >= MAX_TIERNAME)
+		return false;
+
+	while (*name) {
+		if (!isalnum(*name) && *name != '_')
+			return false;
+		name++;
+	}
+	return true;
+}
+
+int swap_tiers_add(const char *name, int prio)
+{
+	int ret;
+	struct swap_tier *tier;
+
+	lockdep_assert_held(&swap_lock);
+	lockdep_assert_held(&swap_tier_lock);
+
+	/* Duplicate check */
+	if (swap_tier_lookup(name))
+		return -EEXIST;
+
+	if (!swap_tier_validate_name(name))
+		return -EINVAL;
+
+	ret = swap_tier_check_range(prio);
+	if (ret)
+		return ret;
+
+	tier = swap_tier_prepare(name, prio);
+	if (IS_ERR(tier)) {
+		ret = PTR_ERR(tier);
+		return ret;
+	}
+
+	swap_tier_activate(tier);
+
+	return ret;
+}
+
+int swap_tiers_remove(const char *name)
+{
+	int ret = 0;
+	struct swap_tier *tier;
+
+	lockdep_assert_held(&swap_lock);
+	lockdep_assert_held(&swap_tier_lock);
+
+	tier = swap_tier_lookup(name);
+	if (!tier)
+		return -EINVAL;
+
+	/* Removing DEF_SWAP_PRIO merges into the higher tier. */
+	if (!list_is_singular(&swap_tier_active_list)
+		&& tier->prio == DEF_SWAP_PRIO)
+		list_prev_entry(tier, list)->prio = DEF_SWAP_PRIO;
+
+	swap_tier_inactivate(tier);
+
+	return ret;
+}
+
+static struct swap_tier swap_tiers_snap[MAX_SWAPTIER];
+/*
+ * XXX: When multiple operations (adds and removes) are submitted in a
+ * single write, reverting each individually on failure is complex and
+ * error-prone. Instead, snapshot the entire state beforehand and
+ * restore it wholesale if any operation fails.
+ */
+void swap_tiers_snapshot(void)
+{
+	BUILD_BUG_ON(sizeof(swap_tiers_snap) != sizeof(swap_tiers));
+
+	lockdep_assert_held(&swap_lock);
+	lockdep_assert_held(&swap_tier_lock);
+
+	memcpy(swap_tiers_snap, swap_tiers, sizeof(swap_tiers));
+}
+
+void swap_tiers_snapshot_restore(void)
+{
+	struct swap_tier *tier;
+	int idx;
+
+	lockdep_assert_held(&swap_lock);
+	lockdep_assert_held(&swap_tier_lock);
+
+	memcpy(swap_tiers, swap_tiers_snap, sizeof(swap_tiers));
+
+	INIT_LIST_HEAD(&swap_tier_active_list);
+	INIT_LIST_HEAD(&swap_tier_inactive_list);
+
+	/*
+	 * memcpy copied snapshot-time list pointers into each tier's
+	 * list_head.  Those references are stale, so re-init every
+	 * tier before re-linking into the freshly initialised global
+	 * lists below.
+	 */
+	for_each_tier(tier, idx) {
+		INIT_LIST_HEAD(&tier->list);
+
+		if (TIER_IS_ACTIVE(tier))
+			swap_tier_activate(tier);
+		else
+			swap_tier_inactivate(tier);
+	}
+}
+
+bool swap_tiers_validate(void)
+{
+	struct swap_tier *tier;
+
+	/*
+	 * Initial setting might not cover DEF_SWAP_PRIO.
+	 * Swap tier must cover the full range (DEF_SWAP_PRIO to SHRT_MAX).
+	 */
+	if (swap_tier_is_active()) {
+		tier = list_last_entry(&swap_tier_active_list,
+			struct swap_tier, list);
+
+		if (tier->prio != DEF_SWAP_PRIO)
+			return false;
+	}
+
+	return true;
+}
diff --git a/mm/swap_tier.h b/mm/swap_tier.h
new file mode 100644
index 000000000000..a1395ec02c24
--- /dev/null
+++ b/mm/swap_tier.h
@@ -0,0 +1,20 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _SWAP_TIER_H
+#define _SWAP_TIER_H
+
+#include <linux/types.h>
+#include <linux/spinlock.h>
+
+extern spinlock_t swap_tier_lock;
+
+/* Initialization and application */
+void swap_tiers_init(void);
+ssize_t swap_tiers_sysfs_show(char *buf);
+
+int swap_tiers_add(const char *name, int prio);
+int swap_tiers_remove(const char *name);
+
+void swap_tiers_snapshot(void);
+void swap_tiers_snapshot_restore(void);
+bool swap_tiers_validate(void);
+#endif /* _SWAP_TIER_H */
diff --git a/mm/swapfile.c b/mm/swapfile.c
index e3d126602a1e..3f7225dbc6cd 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -48,6 +48,7 @@
 #include "swap_table.h"
 #include "internal.h"
 #include "swap.h"
+#include "swap_tier.h"
 
 static void swap_range_alloc(struct swap_info_struct *si,
 			     unsigned int nr_entries);
@@ -63,7 +64,8 @@ static void move_cluster(struct swap_info_struct *si,
  *
  * Also protects swap_active_head total_swap_pages, and the SWP_WRITEOK flag.
  */
-static DEFINE_SPINLOCK(swap_lock);
+DEFINE_SPINLOCK(swap_lock);
+
 static unsigned int nr_swapfiles;
 atomic_long_t nr_swap_pages;
 /*
@@ -74,7 +76,6 @@ atomic_long_t nr_swap_pages;
 EXPORT_SYMBOL_GPL(nr_swap_pages);
 /* protected with swap_lock. reading in vm_swap_full() doesn't need lock */
 long total_swap_pages;
-#define DEF_SWAP_PRIO  -1
 unsigned long swapfile_maximum_size;
 #ifdef CONFIG_MIGRATION
 bool swap_migration_ad_supported;
@@ -87,7 +88,7 @@ static const char Bad_offset[] = "Bad swap offset entry ";
  * all active swap_info_structs
  * protected with swap_lock, and ordered by priority.
  */
-static PLIST_HEAD(swap_active_head);
+PLIST_HEAD(swap_active_head);
 
 /*
  * all available (active, not full) swap_info_structs
@@ -3988,6 +3989,7 @@ static int __init swapfile_init(void)
 		swap_migration_ad_supported = true;
 #endif	/* CONFIG_MIGRATION */
 
+	swap_tiers_init();
 	return 0;
 }
 subsys_initcall(swapfile_init);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v7 2/4] mm: swap: associate swap devices with tiers
  2026-05-27  6:22 [PATCH v7 0/4] mm: swap: introduce swap tier infrastructure Youngjun Park
  2026-05-27  6:22 ` [PATCH v7 1/4] " Youngjun Park
@ 2026-05-27  6:22 ` Youngjun Park
  2026-05-27  6:22 ` [PATCH v7 3/4] mm: memcontrol: add interfaces for swap tier selection Youngjun Park
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 15+ messages in thread
From: Youngjun Park @ 2026-05-27  6:22 UTC (permalink / raw)
  To: akpm
  Cc: chrisl, youngjun.park, linux-mm, cgroups, linux-kernel, kasong,
	hannes, mhocko, roman.gushchin, shakeel.butt, muchun.song,
	shikemeng, nphamcs, baoquan.he, baohua, gunho.lee, taejoon.song,
	hyungjun.cho, mkoutny, baver.bae, matia.kim

This patch connects swap devices to the swap tier infrastructure,
ensuring that devices are correctly assigned to tiers based on their
priority.

A `tier_mask` is added to identify the tier membership of swap devices.
Although tier-based allocation logic is not yet implemented, this
mapping is necessary to track which tier a device belongs to. Upon
activation, the device is assigned to a tier by matching its priority
against the configured tier ranges.

The infrastructure allows dynamic modification of tiers, such as
splitting or merging ranges. These operations are permitted provided
that the tier assignment of already configured swap devices remains
unchanged.

This patch also adds the documentation for the swap tier feature,
covering the core concepts, sysfs interface usage, and configuration
details.

Reviewed-by: Baoquan He <baoquan.he@linux.dev>
Signed-off-by: Youngjun Park <youngjun.park@lge.com>
---
 Documentation/mm/index.rst     |   1 +
 Documentation/mm/swap-tier.rst | 159 +++++++++++++++++++++++++++++++++
 MAINTAINERS                    |   1 +
 include/linux/swap.h           |   1 +
 mm/swap_state.c                |   2 +-
 mm/swap_tier.c                 | 101 ++++++++++++++++++---
 mm/swap_tier.h                 |  13 ++-
 mm/swapfile.c                  |   2 +
 8 files changed, 266 insertions(+), 14 deletions(-)
 create mode 100644 Documentation/mm/swap-tier.rst

diff --git a/Documentation/mm/index.rst b/Documentation/mm/index.rst
index 7aa2a8886908..a0d1447c5569 100644
--- a/Documentation/mm/index.rst
+++ b/Documentation/mm/index.rst
@@ -21,6 +21,7 @@ see the :doc:`admin guide <../admin-guide/mm/index>`.
    page_reclaim
    swap
    swap-table
+   swap-tier
    page_cache
    shmfs
    oom
diff --git a/Documentation/mm/swap-tier.rst b/Documentation/mm/swap-tier.rst
new file mode 100644
index 000000000000..addbc495de8c
--- /dev/null
+++ b/Documentation/mm/swap-tier.rst
@@ -0,0 +1,159 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+:Author: Chris Li <chrisl@kernel.org> Youngjun Park <youngjun.park@lge.com>
+
+==========
+Swap Tier
+==========
+
+Swap tier is a collection of user-named groups classified by priority ranges.
+It acts as a facilitation layer, allowing users to manage swap devices based
+on their speeds.
+
+Users are encouraged to assign swap device priorities according to device
+speed to fully utilize this feature. While the current implementation is
+integrated with cgroups, the concept is designed to be extensible for other
+subsystems in the future.
+
+Use case
+---------
+
+Users can perform selective swapping by choosing a swap tier assigned according
+to speed within a cgroup.
+
+For more information on cgroup v2, please refer to
+``Documentation/admin-guide/cgroup-v2.rst``.
+
+Priority Range
+--------------
+
+The specified tiers must cover the entire priority range from -1
+(DEF_SWAP_PRIO) to SHRT_MAX.
+
+Consistency
+-----------
+
+Tier consistency is guaranteed with a focus on maximizing flexibility. When a
+swap device is activated within a tier range, the tier covering that device's
+priority is guaranteed not to disappear or change while the device remains
+active. Adding a new tier may split the range of an existing tier, but the
+active device's tier assignment remains unchanged.
+
+However, specifying a tier in a cgroup does not guarantee the tier's existence.
+Consequently, the corresponding tier can disappear at any time.
+
+Configuration Interface
+-----------------------
+
+The swap tiers can be configured via the following interface:
+
+/sys/kernel/mm/swap/tiers
+
+Operations can be performed using the following syntax:
+
+* Add:    ``+"<tiername>":"<start_priority>"``
+* Remove: ``-"<tiername>"``
+
+Tier names must consist of alphanumeric characters and underscores. Multiple
+operations can be provided in a single write, separated by commas (",") or
+whitespace (spaces, tabs, newlines).
+
+When configuring tiers, the specified value represents the **start priority**
+of that tier. The end priority is automatically determined by the start
+priority of the next higher tier. Consequently, adding a tier
+automatically adjusts the ranges of adjacent tiers to ensure continuity.
+
+Examples
+--------
+
+**1. Initialization**
+
+A tier starting at -1 is mandatory to cover the entire priority range up to
+SHRT_MAX. In this example, 'HDD' starts at 50, and 'NET' covers the remaining
+lower range starting from -1.
+
+::
+
+    # echo "+HDD:50, +NET:-1" > /sys/kernel/mm/swap/tiers
+    # cat /sys/kernel/mm/swap/tiers
+    Name             Idx   PrioStart   PrioEnd
+    HDD              0     50          32767
+    NET              1     -1          49
+
+**2. Adding a New Tier (split)**
+
+A new tier 'SSD' is added at priority 100, splitting the existing 'HDD' tier.
+The ranges are automatically recalculated:
+
+* 'SSD' takes the top range (100 to SHRT_MAX).
+* 'HDD' is adjusted to the range between 'NET' and 'SSD' (50 to 99).
+* 'NET' remains unchanged (-1 to 49).
+
+::
+
+    # echo "+SSD:100" > /sys/kernel/mm/swap/tiers
+    # cat /sys/kernel/mm/swap/tiers
+    Name             Idx   PrioStart   PrioEnd
+    SSD              2     100         32767
+    HDD              0     50          99
+    NET              1     -1          49
+
+**3. Removal (merge)**
+
+Tiers can be removed using the '-' prefix.
+::
+
+    # echo "-SSD" > /sys/kernel/mm/swap/tiers
+
+When a tier is removed, its priority range is merged into the adjacent
+tier. The merge direction is always upward (the tier below expands),
+except when the lowest tier is removed — in that case the tier above
+shifts its starting priority down to -1 to maintain full range coverage.
+
+::
+
+    Initial state:
+    Name             Idx   PrioStart   PrioEnd
+    SSD              2     100         32767
+    HDD              1     50          99
+    NET              0     -1          49
+
+    # echo "-SSD" > /sys/kernel/mm/swap/tiers
+
+    Name             Idx   PrioStart   PrioEnd
+    HDD              1     50          32767       <- merged with SSD's range
+    NET              0     -1          49
+
+    # echo "-NET" > /sys/kernel/mm/swap/tiers
+
+    Name             Idx   PrioStart   PrioEnd
+    HDD              1     -1          32767       <- shifted down to -1
+
+**4. Interaction with Active Swap Devices**
+
+If a swap device is active (swapon), the tier covering that device's
+priority cannot be removed. Splitting the active tier's range is only
+allowed above the device's priority.
+
+Assume a swap device is active at priority 60 (inside 'HDD' tier).
+
+::
+
+    # swapon -p 60 /dev/zram0
+
+    Name             Idx   PrioStart   PrioEnd
+    HDD              0     50          32767
+    NET              1     -1          49
+
+    # echo "-HDD" > /sys/kernel/mm/swap/tiers
+    -bash: echo: write error: Device or resource busy
+
+    # echo "+SSD:60" > /sys/kernel/mm/swap/tiers
+    -bash: echo: write error: Device or resource busy
+
+    # echo "+SSD:100" > /sys/kernel/mm/swap/tiers
+
+    Name             Idx   PrioStart   PrioEnd
+    SSD              2     100         32767
+    HDD              0     50          99          <- device (prio 60) stays here
+    NET              1     -1          49
diff --git a/MAINTAINERS b/MAINTAINERS
index e3cbfedbaa5f..c008014663b3 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -17041,6 +17041,7 @@ R:	Youngjun Park <youngjun.park@lge.com>
 L:	linux-mm@kvack.org
 S:	Maintained
 F:	Documentation/mm/swap-table.rst
+F:	Documentation/mm/swap-tier.rst
 F:	include/linux/swap.h
 F:	include/linux/swapfile.h
 F:	include/linux/swapops.h
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 6d72778e6cc3..21286945770a 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -250,6 +250,7 @@ struct swap_info_struct {
 	struct percpu_ref users;	/* indicate and keep swap device valid. */
 	unsigned long	flags;		/* SWP_USED etc: see above */
 	signed short	prio;		/* swap priority of this type */
+	int tier_mask;			/* swap tier mask */
 	struct plist_node list;		/* entry in swap_active_head */
 	signed char	type;		/* strange name for an index */
 	unsigned int	max;		/* size of this swap device */
diff --git a/mm/swap_state.c b/mm/swap_state.c
index e609bbdf7e13..de285b36e31c 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -1053,7 +1053,7 @@ static ssize_t tiers_store(struct kobject *kobj,
 		}
 	}
 
-	if (!swap_tiers_validate()) {
+	if (!swap_tiers_update()) {
 		ret = -EINVAL;
 		goto restore;
 	}
diff --git a/mm/swap_tier.c b/mm/swap_tier.c
index ac7a3c2a48cb..6b57cadb3e95 100644
--- a/mm/swap_tier.c
+++ b/mm/swap_tier.c
@@ -38,6 +38,8 @@ static LIST_HEAD(swap_tier_inactive_list);
 	(!list_is_first(&(tier)->list, &swap_tier_active_list) ? \
 	list_prev_entry((tier), list)->prio - 1 : SHRT_MAX)
 
+#define MASK_TO_TIER(mask) (&swap_tiers[__ffs((mask))])
+
 #define for_each_tier(tier, idx) \
 	for (idx = 0, tier = &swap_tiers[0]; idx < MAX_SWAPTIER; \
 		idx++, tier = &swap_tiers[idx])
@@ -59,6 +61,26 @@ static bool swap_tier_is_active(void)
 	return !list_empty(&swap_tier_active_list);
 }
 
+static bool swap_tier_prio_in_range(struct swap_tier *tier, short prio)
+{
+	if (tier->prio <= prio && TIER_END_PRIO(tier) >= prio)
+		return true;
+
+	return false;
+}
+
+static bool swap_tier_prio_is_used(short prio)
+{
+	struct swap_tier *tier;
+
+	for_each_active_tier(tier) {
+		if (tier->prio == prio)
+			return true;
+	}
+
+	return false;
+}
+
 static struct swap_tier *swap_tier_lookup(const char *name)
 {
 	struct swap_tier *tier;
@@ -99,6 +121,7 @@ void swap_tiers_init(void)
 	int idx;
 
 	BUILD_BUG_ON(BITS_PER_TYPE(int) < MAX_SWAPTIER);
+	BUILD_BUG_ON(MAX_SWAPTIER > TIER_DEFAULT_IDX);
 
 	for_each_tier(tier, idx) {
 		INIT_LIST_HEAD(&tier->list);
@@ -149,17 +172,29 @@ static struct swap_tier *swap_tier_prepare(const char *name, short prio)
 	return tier;
 }
 
-static int swap_tier_check_range(short prio)
+static int swap_tier_can_split_range(short new_prio)
 {
+	struct swap_info_struct *p;
 	struct swap_tier *tier;
 
 	lockdep_assert_held(&swap_lock);
 	lockdep_assert_held(&swap_tier_lock);
 
-	for_each_active_tier(tier) {
-		/* No overwrite */
-		if (tier->prio == prio)
-			return -EINVAL;
+	plist_for_each_entry(p, &swap_active_head, list) {
+		if (p->tier_mask == TIER_DEFAULT_MASK)
+			continue;
+
+		tier = MASK_TO_TIER(p->tier_mask);
+		if (!swap_tier_prio_in_range(tier, new_prio))
+			continue;
+
+		/*
+		 * Device sits in a tier that spans new_prio;
+		 * splitting here would reassign it to a
+		 * different tier.
+		 */
+		if (p->prio >= new_prio)
+			return -EBUSY;
 	}
 
 	return 0;
@@ -199,7 +234,11 @@ int swap_tiers_add(const char *name, int prio)
 	if (!swap_tier_validate_name(name))
 		return -EINVAL;
 
-	ret = swap_tier_check_range(prio);
+	/* No overwrite */
+	if (swap_tier_prio_is_used(prio))
+		return -EBUSY;
+
+	ret = swap_tier_can_split_range(prio);
 	if (ret)
 		return ret;
 
@@ -226,6 +265,11 @@ int swap_tiers_remove(const char *name)
 	if (!tier)
 		return -EINVAL;
 
+	/* Simulate adding a tier to check for conflicts */
+	ret = swap_tier_can_split_range(tier->prio);
+	if (ret)
+		return ret;
+
 	/* Removing DEF_SWAP_PRIO merges into the higher tier. */
 	if (!list_is_singular(&swap_tier_active_list)
 		&& tier->prio == DEF_SWAP_PRIO)
@@ -236,13 +280,15 @@ int swap_tiers_remove(const char *name)
 	return ret;
 }
 
-static struct swap_tier swap_tiers_snap[MAX_SWAPTIER];
 /*
- * XXX: When multiple operations (adds and removes) are submitted in a
- * single write, reverting each individually on failure is complex and
- * error-prone. Instead, snapshot the entire state beforehand and
- * restore it wholesale if any operation fails.
+ * XXX: Static global snapshot buffer for batch operations. Small
+ * and used once per write, so a static global is not bad.
+ * When multiple adds/removes are submitted in a single write,
+ * reverting each individually on failure is error-prone. Instead,
+ * snapshot beforehand and restore wholesale if any operation fails.
  */
+static struct swap_tier swap_tiers_snap[MAX_SWAPTIER];
+
 void swap_tiers_snapshot(void)
 {
 	BUILD_BUG_ON(sizeof(swap_tiers_snap) != sizeof(swap_tiers));
@@ -282,10 +328,30 @@ void swap_tiers_snapshot_restore(void)
 	}
 }
 
-bool swap_tiers_validate(void)
+void swap_tiers_assign_dev(struct swap_info_struct *swp)
 {
 	struct swap_tier *tier;
 
+	lockdep_assert_held(&swap_lock);
+
+	for_each_active_tier(tier) {
+		if (swap_tier_prio_in_range(tier, swp->prio)) {
+			swp->tier_mask = TIER_MASK(tier);
+			return;
+		}
+	}
+
+	swp->tier_mask = TIER_DEFAULT_MASK;
+}
+
+bool swap_tiers_update(void)
+{
+	struct swap_tier *tier;
+	struct swap_info_struct *swp;
+
+	lockdep_assert_held(&swap_lock);
+	lockdep_assert_held(&swap_tier_lock);
+
 	/*
 	 * Initial setting might not cover DEF_SWAP_PRIO.
 	 * Swap tier must cover the full range (DEF_SWAP_PRIO to SHRT_MAX).
@@ -298,5 +364,16 @@ bool swap_tiers_validate(void)
 			return false;
 	}
 
+	/*
+	 * If applied initially, the swap tier_mask may change
+	 * from the default value.
+	 */
+	plist_for_each_entry(swp, &swap_active_head, list) {
+		/* Tier is already configured */
+		if (swp->tier_mask != TIER_DEFAULT_MASK)
+			break;
+		swap_tiers_assign_dev(swp);
+	}
+
 	return true;
 }
diff --git a/mm/swap_tier.h b/mm/swap_tier.h
index a1395ec02c24..3e355f857363 100644
--- a/mm/swap_tier.h
+++ b/mm/swap_tier.h
@@ -5,8 +5,15 @@
 #include <linux/types.h>
 #include <linux/spinlock.h>
 
+/* Forward declarations */
+struct swap_info_struct;
+
 extern spinlock_t swap_tier_lock;
 
+#define TIER_ALL_MASK		(~0)
+#define TIER_DEFAULT_IDX	(31)
+#define TIER_DEFAULT_MASK	(1U << TIER_DEFAULT_IDX)
+
 /* Initialization and application */
 void swap_tiers_init(void);
 ssize_t swap_tiers_sysfs_show(char *buf);
@@ -16,5 +23,9 @@ int swap_tiers_remove(const char *name);
 
 void swap_tiers_snapshot(void);
 void swap_tiers_snapshot_restore(void);
-bool swap_tiers_validate(void);
+bool swap_tiers_update(void);
+
+/* Tier assignment */
+void swap_tiers_assign_dev(struct swap_info_struct *swp);
+
 #endif /* _SWAP_TIER_H */
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 3f7225dbc6cd..9a86ebe992f4 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -3036,6 +3036,8 @@ static void _enable_swap_info(struct swap_info_struct *si)
 
 	/* Add back to available list */
 	add_to_avail_list(si, true);
+
+	swap_tiers_assign_dev(si);
 }
 
 /*
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v7 3/4] mm: memcontrol: add interfaces for swap tier selection
  2026-05-27  6:22 [PATCH v7 0/4] mm: swap: introduce swap tier infrastructure Youngjun Park
  2026-05-27  6:22 ` [PATCH v7 1/4] " Youngjun Park
  2026-05-27  6:22 ` [PATCH v7 2/4] mm: swap: associate swap devices with tiers Youngjun Park
@ 2026-05-27  6:22 ` Youngjun Park
  2026-05-27  6:22 ` [PATCH v7 4/4] mm: swap: filter swap allocation by memcg tier mask Youngjun Park
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 15+ messages in thread
From: Youngjun Park @ 2026-05-27  6:22 UTC (permalink / raw)
  To: akpm
  Cc: chrisl, youngjun.park, linux-mm, cgroups, linux-kernel, kasong,
	hannes, mhocko, roman.gushchin, shakeel.butt, muchun.song,
	shikemeng, nphamcs, baoquan.he, baohua, gunho.lee, taejoon.song,
	hyungjun.cho, mkoutny, baver.bae, matia.kim

Integrate swap tier infrastructure with cgroup to allow selecting
specific swap devices per cgroup.

Introduce memory.swap.tiers for configuring allowed tiers, and
memory.swap.tiers.effective for exposing the effective tiers.
The effective tiers are the intersection of the configured tiers
and the parent's effective tiers.

Note that cgroups do not pin swap tiers, similar to cpuset and CPU
hotplug, allowing configuration changes regardless of usage.

Signed-off-by: Youngjun Park <youngjun.park@lge.com>
---
 Documentation/admin-guide/cgroup-v2.rst |  29 +++++++
 include/linux/memcontrol.h              |   5 ++
 mm/memcontrol.c                         |  96 +++++++++++++++++++++
 mm/swap_state.c                         |   5 +-
 mm/swap_tier.c                          | 107 +++++++++++++++++++++++-
 mm/swap_tier.h                          |  56 +++++++++++--
 6 files changed, 288 insertions(+), 10 deletions(-)

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 6efd0095ed99..08253072a252 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -1850,6 +1850,35 @@ The following nested keys are defined.
 	Swap usage hard limit.  If a cgroup's swap usage reaches this
 	limit, anonymous memory of the cgroup will not be swapped out.
 
+  memory.swap.tiers
+        A read-write file which exists on non-root cgroups.
+        Format is similar to cgroup.subtree_control.
+
+        Controls which swap tiers this cgroup is allowed to swap
+        out to. All tiers are enabled by default.
+
+        ::
+
+            (-|+)TIER [(-|+)TIER ...]
+
+        "-" disables a tier, "+" re-enables it.
+        Entries are whitespace-delimited.
+
+        Changes here are combined with parent restrictions to
+        compute memory.swap.tiers.effective.
+
+        If a tier is removed from /sys/kernel/mm/swap/tiers,
+        any prior disable for that tier is invalidated.
+
+  memory.swap.tiers.effective
+        A read-only file which exists on non-root cgroups.
+
+        Shows the tiers this cgroup can actually swap out to.
+        This is the intersection of the parent's effective tiers
+        and this cgroup's own memory.swap.tiers configuration.
+        A child cannot enable a tier that is disabled in its
+        parent.
+
   memory.swap.events
 	A read-only flat-keyed file which exists on non-root cgroups.
 	The following entries are defined.  Unless specified
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index bf1a6e131eca..eb33c8e30c9e 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -287,6 +287,11 @@ struct mem_cgroup {
 	struct lru_gen_mm_list mm_list;
 #endif
 
+#ifdef CONFIG_SWAP
+	int tier_mask;
+	int tier_effective_mask;
+#endif
+
 #ifdef CONFIG_MEMCG_V1
 	/* Legacy consumer-oriented counters */
 	struct page_counter kmem;		/* v1 only */
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e24114a4493a..cbc7a519a24d 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -68,6 +68,7 @@
 #include <net/ip.h>
 #include "slab.h"
 #include "memcontrol-v1.h"
+#include "swap_tier.h"
 
 #include <linux/uaccess.h>
 
@@ -4249,6 +4250,8 @@ static int mem_cgroup_css_online(struct cgroup_subsys_state *css)
 	refcount_set(&memcg->id.ref, 1);
 	css_get(css);
 
+	swap_tiers_memcg_inherit_mask(memcg);
+
 	/*
 	 * Ensure mem_cgroup_from_private_id() works once we're fully online.
 	 *
@@ -5791,6 +5794,88 @@ static int swap_events_show(struct seq_file *m, void *v)
 	return 0;
 }
 
+static int swap_tier_show(struct seq_file *m, void *v)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_seq(m);
+
+	swap_tiers_mask_show(m, READ_ONCE(memcg->tier_mask));
+	return 0;
+}
+
+static ssize_t swap_tier_write(struct kernfs_open_file *of,
+				char *buf, size_t nbytes, loff_t off)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
+	char *pos, *token;
+	int ret = 0;
+	int original_mask = 0;
+
+	pos = strstrip(buf);
+
+	spin_lock(&swap_tier_lock);
+	if (!*pos) {
+		WRITE_ONCE(memcg->tier_mask, TIER_ALL_MASK);
+		goto sync;
+	}
+
+	original_mask = memcg->tier_mask;
+
+	while ((token = strsep(&pos, " \t\n")) != NULL) {
+		int mask;
+
+		if (!*token)
+			continue;
+
+		if (token[0] != '-' && token[0] != '+') {
+			ret = -EINVAL;
+			goto err;
+		}
+
+		mask = swap_tiers_mask_lookup(token+1);
+		if (!mask) {
+			ret = -EINVAL;
+			goto err;
+		}
+		/*
+		 * tier_mask can be modified independently at each memcg.
+		 * However, the effective mask is restricted to a subset of
+		 * the parent's mask in swap_tiers_memcg_sync_mask().
+		 */
+		switch (token[0]) {
+		case '-':
+			WRITE_ONCE(memcg->tier_mask,
+				   memcg->tier_mask & ~mask);
+			break;
+		case '+':
+			WRITE_ONCE(memcg->tier_mask,
+				   memcg->tier_mask | mask);
+			break;
+		default:
+			ret = -EINVAL;
+			break;
+		}
+
+		if (ret)
+			goto err;
+	}
+
+sync:
+	swap_tiers_memcg_sync_mask(memcg);
+err:
+	if (ret)
+		WRITE_ONCE(memcg->tier_mask, original_mask);
+	spin_unlock(&swap_tier_lock);
+	return ret ? ret : nbytes;
+}
+
+static int swap_tier_effective_show(struct seq_file *m, void *v)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_seq(m);
+
+	swap_tiers_mask_show(m, READ_ONCE(memcg->tier_effective_mask));
+	return 0;
+}
+
 static struct cftype swap_files[] = {
 	{
 		.name = "swap.current",
@@ -5823,6 +5908,17 @@ static struct cftype swap_files[] = {
 		.file_offset = offsetof(struct mem_cgroup, swap_events_file),
 		.seq_show = swap_events_show,
 	},
+	{
+		.name = "swap.tiers",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.seq_show = swap_tier_show,
+		.write = swap_tier_write,
+	},
+	{
+		.name = "swap.tiers.effective",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.seq_show = swap_tier_effective_show,
+	},
 	{ }	/* terminate */
 };
 
diff --git a/mm/swap_state.c b/mm/swap_state.c
index de285b36e31c..2fda6b61e2de 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -1011,6 +1011,7 @@ static ssize_t tiers_store(struct kobject *kobj,
 	char *p, *token, *name, *tmp;
 	int ret = 0;
 	short prio;
+	int mask = 0;
 
 	tmp = kstrdup(buf, GFP_KERNEL);
 	if (!tmp)
@@ -1043,7 +1044,7 @@ static ssize_t tiers_store(struct kobject *kobj,
 				goto restore;
 			break;
 		case '-':
-			ret = swap_tiers_remove(token + 1);
+			ret = swap_tiers_remove(token + 1, &mask);
 			if (ret)
 				goto restore;
 			break;
@@ -1053,7 +1054,7 @@ static ssize_t tiers_store(struct kobject *kobj,
 		}
 	}
 
-	if (!swap_tiers_update()) {
+	if (!swap_tiers_update(mask)) {
 		ret = -EINVAL;
 		goto restore;
 	}
diff --git a/mm/swap_tier.c b/mm/swap_tier.c
index 6b57cadb3e95..9c180f55a4e9 100644
--- a/mm/swap_tier.c
+++ b/mm/swap_tier.c
@@ -253,7 +253,7 @@ int swap_tiers_add(const char *name, int prio)
 	return ret;
 }
 
-int swap_tiers_remove(const char *name)
+int swap_tiers_remove(const char *name, int *mask)
 {
 	int ret = 0;
 	struct swap_tier *tier;
@@ -276,6 +276,7 @@ int swap_tiers_remove(const char *name)
 		list_prev_entry(tier, list)->prio = DEF_SWAP_PRIO;
 
 	swap_tier_inactivate(tier);
+	*mask |= TIER_MASK(tier);
 
 	return ret;
 }
@@ -344,7 +345,26 @@ void swap_tiers_assign_dev(struct swap_info_struct *swp)
 	swp->tier_mask = TIER_DEFAULT_MASK;
 }
 
-bool swap_tiers_update(void)
+#ifdef CONFIG_MEMCG
+static void swap_tier_memcg_propagate(int mask)
+{
+	struct mem_cgroup *child;
+
+	rcu_read_lock();
+	for_each_mem_cgroup_tree(child, root_mem_cgroup) {
+		WRITE_ONCE(child->tier_mask, child->tier_mask | mask);
+		WRITE_ONCE(child->tier_effective_mask,
+			   child->tier_effective_mask | mask);
+	}
+	rcu_read_unlock();
+}
+#else
+static void swap_tier_memcg_propagate(int mask)
+{
+}
+#endif
+
+bool swap_tiers_update(int mask)
 {
 	struct swap_tier *tier;
 	struct swap_info_struct *swp;
@@ -375,5 +395,88 @@ bool swap_tiers_update(void)
 		swap_tiers_assign_dev(swp);
 	}
 
+	/*
+	 * When a tier is removed, its index (bit position in the mask) becomes
+	 * free for reassignment to a future tier. If a memcg had previously
+	 * disabled this tier (cleared the bit in its swap.tiers file), the
+	 * effective mask would keep that bit clear -- meaning the new tier at
+	 * the same index would be silently unavailable, an invisible cgroup
+	 * constraint left behind by a tier that no longer exists.
+	 *
+	 * To prevent this, OR the removed tier's mask bit into every memcg's
+	 * tier_mask and tier_effective_mask. This resets the bit so the new
+	 * tier is accessible by default; users who want to restrict it must
+	 * explicitly disable it after the tier is re-created.
+	 */
+	if (mask)
+		swap_tier_memcg_propagate(mask);
+
 	return true;
 }
+
+#ifdef CONFIG_MEMCG
+void swap_tiers_mask_show(struct seq_file *m, int mask)
+{
+	struct swap_tier *tier;
+
+	spin_lock(&swap_tier_lock);
+	for_each_active_tier(tier) {
+		if (mask & TIER_MASK(tier))
+			seq_printf(m, "%s ", tier->name);
+	}
+	spin_unlock(&swap_tier_lock);
+	seq_puts(m, "\n");
+}
+
+int swap_tiers_mask_lookup(const char *name)
+{
+	struct swap_tier *tier;
+
+	lockdep_assert_held(&swap_tier_lock);
+
+	for_each_active_tier(tier) {
+		if (!strcmp(name, tier->name))
+			return TIER_MASK(tier);
+	}
+
+	return 0;
+}
+
+static void __swap_tier_memcg_inherit_mask(struct mem_cgroup *memcg,
+	struct mem_cgroup *parent)
+{
+	int parent_mask = parent
+		? READ_ONCE(parent->tier_effective_mask)
+		: TIER_ALL_MASK;
+
+	WRITE_ONCE(memcg->tier_effective_mask,
+		   parent_mask & READ_ONCE(memcg->tier_mask));
+}
+
+/* Computes the initial effective mask from the parent's effective mask. */
+void swap_tiers_memcg_inherit_mask(struct mem_cgroup *memcg)
+{
+	spin_lock(&swap_tier_lock);
+	rcu_read_lock();
+	memcg->tier_mask = TIER_ALL_MASK;
+	__swap_tier_memcg_inherit_mask(memcg, parent_mem_cgroup(memcg));
+	rcu_read_unlock();
+	spin_unlock(&swap_tier_lock);
+}
+
+/*
+ * Called when a memcg's tier_mask is modified. Walks the subtree
+ * and recomputes each descendant's effective mask against its parent.
+ */
+void swap_tiers_memcg_sync_mask(struct mem_cgroup *memcg)
+{
+	struct mem_cgroup *child;
+
+	lockdep_assert_held(&swap_tier_lock);
+
+	rcu_read_lock();
+	for_each_mem_cgroup_tree(child, memcg)
+		__swap_tier_memcg_inherit_mask(child, parent_mem_cgroup(child));
+	rcu_read_unlock();
+}
+#endif
diff --git a/mm/swap_tier.h b/mm/swap_tier.h
index 3e355f857363..49433dcaa1ce 100644
--- a/mm/swap_tier.h
+++ b/mm/swap_tier.h
@@ -10,22 +10,66 @@ struct swap_info_struct;
 
 extern spinlock_t swap_tier_lock;
 
-#define TIER_ALL_MASK		(~0)
-#define TIER_DEFAULT_IDX	(31)
-#define TIER_DEFAULT_MASK	(1U << TIER_DEFAULT_IDX)
-
 /* Initialization and application */
 void swap_tiers_init(void);
 ssize_t swap_tiers_sysfs_show(char *buf);
 
 int swap_tiers_add(const char *name, int prio);
-int swap_tiers_remove(const char *name);
+int swap_tiers_remove(const char *name, int *mask);
 
 void swap_tiers_snapshot(void);
 void swap_tiers_snapshot_restore(void);
-bool swap_tiers_update(void);
+bool swap_tiers_update(int mask);
 
 /* Tier assignment */
 void swap_tiers_assign_dev(struct swap_info_struct *swp);
 
+#define TIER_ALL_MASK		(~0)
+#define TIER_DEFAULT_IDX	(31)
+#define TIER_DEFAULT_MASK	(1U << TIER_DEFAULT_IDX)
+
+#if defined(CONFIG_SWAP) && defined(CONFIG_MEMCG)
+/* Memcg related functions */
+void swap_tiers_mask_show(struct seq_file *m, int mask);
+void swap_tiers_memcg_inherit_mask(struct mem_cgroup *memcg);
+void swap_tiers_memcg_sync_mask(struct mem_cgroup *memcg);
+int swap_tiers_mask_lookup(const char *name);
+static inline int folio_tier_effective_mask(struct folio *folio)
+{
+	struct mem_cgroup *memcg;
+	int mask = TIER_ALL_MASK;
+
+	rcu_read_lock();
+	memcg = folio_memcg(folio);
+	if (memcg)
+		mask = READ_ONCE(memcg->tier_effective_mask);
+	rcu_read_unlock();
+
+	return mask;
+}
+#else
+static inline void swap_tiers_mask_show(struct seq_file *m, int mask) {}
+static inline void swap_tiers_memcg_inherit_mask(struct mem_cgroup *memcg) {}
+static inline void swap_tiers_memcg_sync_mask(struct mem_cgroup *memcg) {}
+static inline int swap_tiers_mask_lookup(const char *name)
+{
+	return 0;
+}
+static inline int folio_tier_effective_mask(struct folio *folio)
+{
+	return TIER_ALL_MASK;
+}
+#endif
+
+/**
+ * swap_tiers_mask_test - Check if the tier mask is valid
+ * @tier_mask: The tier mask to check
+ * @mask: The mask to compare against
+ *
+ * Return: true if condition matches, false otherwise
+ */
+static inline bool swap_tiers_mask_test(int tier_mask, int mask)
+{
+	return tier_mask & mask;
+}
 #endif /* _SWAP_TIER_H */
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v7 4/4] mm: swap: filter swap allocation by memcg tier mask
  2026-05-27  6:22 [PATCH v7 0/4] mm: swap: introduce swap tier infrastructure Youngjun Park
                   ` (2 preceding siblings ...)
  2026-05-27  6:22 ` [PATCH v7 3/4] mm: memcontrol: add interfaces for swap tier selection Youngjun Park
@ 2026-05-27  6:22 ` Youngjun Park
  2026-05-27  7:32   ` Baoquan He
                     ` (2 more replies)
  2026-05-27 20:36 ` [PATCH v7 0/4] mm: swap: introduce swap tier infrastructure Andrew Morton
  2026-05-30 18:02 ` Nhat Pham
  5 siblings, 3 replies; 15+ messages in thread
From: Youngjun Park @ 2026-05-27  6:22 UTC (permalink / raw)
  To: akpm
  Cc: chrisl, youngjun.park, linux-mm, cgroups, linux-kernel, kasong,
	hannes, mhocko, roman.gushchin, shakeel.butt, muchun.song,
	shikemeng, nphamcs, baoquan.he, baohua, gunho.lee, taejoon.song,
	hyungjun.cho, mkoutny, baver.bae, matia.kim

Apply memcg tier effective mask during swap slot allocation to
enforce per-cgroup swap tier restrictions.

In the fast path, check the percpu cached swap_info's tier_mask
against the folio's effective mask. If it does not match, fall
through to the slow path. In the slow path, skip swap devices
whose tier_mask is not covered by the folio's effective mask.

This works correctly when there is only one non-rotational
device in the system and no devices share the same priority.
However, there are known limitations:

 - When non-rotational devices are distributed across multiple
   tiers, and different memcgs are configured to use those
   distinct tiers, they may constantly overwrite the shared
   percpu swap cache. This cache thrashing leads to frequent
   fast path misses.

 - Combined with the above issue, if same-priority devices exist
   among them, a percpu cache miss (overwritten by another memcg)
   forces the allocator to round-robin to the next device
   prematurely, even if the current cluster is not fully
   exhausted.

These edge cases do not affect the primary use case of
directing swap traffic per cgroup. Further optimization is
planned for future work.

Signed-off-by: Youngjun Park <youngjun.park@lge.com>
---
 mm/swapfile.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/mm/swapfile.c b/mm/swapfile.c
index 9a86ebe992f4..1a2d29735b71 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1365,14 +1365,18 @@ static bool swap_alloc_fast(struct folio *folio)
 	struct swap_cluster_info *ci;
 	struct swap_info_struct *si;
 	unsigned int offset;
+	int mask = folio_tier_effective_mask(folio);
 
 	/*
 	 * Once allocated, swap_info_struct will never be completely freed,
 	 * so checking it's liveness by get_swap_device_info is enough.
 	 */
 	si = this_cpu_read(percpu_swap_cluster.si[order]);
+	if (!si || !swap_tiers_mask_test(si->tier_mask, mask))
+		return false;
+
 	offset = this_cpu_read(percpu_swap_cluster.offset[order]);
-	if (!si || !offset || !get_swap_device_info(si))
+	if (!offset || !get_swap_device_info(si))
 		return false;
 
 	ci = swap_cluster_lock(si, offset);
@@ -1392,10 +1396,14 @@ static bool swap_alloc_fast(struct folio *folio)
 static void swap_alloc_slow(struct folio *folio)
 {
 	struct swap_info_struct *si, *next;
+	int mask = folio_tier_effective_mask(folio);
 
 	spin_lock(&swap_avail_lock);
 start_over:
 	plist_for_each_entry_safe(si, next, &swap_avail_head, avail_list) {
+		if (!swap_tiers_mask_test(si->tier_mask, mask))
+			continue;
+
 		/* Rotate the device and switch to a new cluster */
 		plist_requeue(&si->avail_list, &swap_avail_head);
 		spin_unlock(&swap_avail_lock);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH v7 4/4] mm: swap: filter swap allocation by memcg tier mask
  2026-05-27  6:22 ` [PATCH v7 4/4] mm: swap: filter swap allocation by memcg tier mask Youngjun Park
@ 2026-05-27  7:32   ` Baoquan He
  2026-05-27 17:50   ` Kairui Song
  2026-05-30 17:51   ` Nhat Pham
  2 siblings, 0 replies; 15+ messages in thread
From: Baoquan He @ 2026-05-27  7:32 UTC (permalink / raw)
  To: Youngjun Park
  Cc: akpm, chrisl, linux-mm, cgroups, linux-kernel, kasong, hannes,
	mhocko, roman.gushchin, shakeel.butt, muchun.song, shikemeng,
	nphamcs, baohua, gunho.lee, taejoon.song, hyungjun.cho, mkoutny,
	baver.bae, matia.kim

On 05/27/26 at 03:22pm, Youngjun Park wrote:
> Apply memcg tier effective mask during swap slot allocation to
> enforce per-cgroup swap tier restrictions.
> 
> In the fast path, check the percpu cached swap_info's tier_mask
> against the folio's effective mask. If it does not match, fall
> through to the slow path. In the slow path, skip swap devices
> whose tier_mask is not covered by the folio's effective mask.
> 
> This works correctly when there is only one non-rotational
> device in the system and no devices share the same priority.
> However, there are known limitations:
> 
>  - When non-rotational devices are distributed across multiple
>    tiers, and different memcgs are configured to use those
>    distinct tiers, they may constantly overwrite the shared
>    percpu swap cache. This cache thrashing leads to frequent
>    fast path misses.
> 
>  - Combined with the above issue, if same-priority devices exist
>    among them, a percpu cache miss (overwritten by another memcg)
>    forces the allocator to round-robin to the next device
>    prematurely, even if the current cluster is not fully
>    exhausted.
> 
> These edge cases do not affect the primary use case of
> directing swap traffic per cgroup. Further optimization is
> planned for future work.
> 
> Signed-off-by: Youngjun Park <youngjun.park@lge.com>
> ---
>  mm/swapfile.c | 10 +++++++++-
>  1 file changed, 9 insertions(+), 1 deletion(-)

LGTM,

Reviewed-by: Baoquan He <baoquan.he@linux.dev>

> 
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 9a86ebe992f4..1a2d29735b71 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -1365,14 +1365,18 @@ static bool swap_alloc_fast(struct folio *folio)
>  	struct swap_cluster_info *ci;
>  	struct swap_info_struct *si;
>  	unsigned int offset;
> +	int mask = folio_tier_effective_mask(folio);
>  
>  	/*
>  	 * Once allocated, swap_info_struct will never be completely freed,
>  	 * so checking it's liveness by get_swap_device_info is enough.
>  	 */
>  	si = this_cpu_read(percpu_swap_cluster.si[order]);
> +	if (!si || !swap_tiers_mask_test(si->tier_mask, mask))
> +		return false;
> +
>  	offset = this_cpu_read(percpu_swap_cluster.offset[order]);
> -	if (!si || !offset || !get_swap_device_info(si))
> +	if (!offset || !get_swap_device_info(si))
>  		return false;
>  
>  	ci = swap_cluster_lock(si, offset);
> @@ -1392,10 +1396,14 @@ static bool swap_alloc_fast(struct folio *folio)
>  static void swap_alloc_slow(struct folio *folio)
>  {
>  	struct swap_info_struct *si, *next;
> +	int mask = folio_tier_effective_mask(folio);
>  
>  	spin_lock(&swap_avail_lock);
>  start_over:
>  	plist_for_each_entry_safe(si, next, &swap_avail_head, avail_list) {
> +		if (!swap_tiers_mask_test(si->tier_mask, mask))
> +			continue;
> +
>  		/* Rotate the device and switch to a new cluster */
>  		plist_requeue(&si->avail_list, &swap_avail_head);
>  		spin_unlock(&swap_avail_lock);
> -- 
> 2.34.1
> 
> 


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v7 4/4] mm: swap: filter swap allocation by memcg tier mask
  2026-05-27  6:22 ` [PATCH v7 4/4] mm: swap: filter swap allocation by memcg tier mask Youngjun Park
  2026-05-27  7:32   ` Baoquan He
@ 2026-05-27 17:50   ` Kairui Song
  2026-05-30 17:51   ` Nhat Pham
  2 siblings, 0 replies; 15+ messages in thread
From: Kairui Song @ 2026-05-27 17:50 UTC (permalink / raw)
  To: Youngjun Park
  Cc: akpm, chrisl, linux-mm, cgroups, linux-kernel, kasong, hannes,
	mhocko, roman.gushchin, shakeel.butt, muchun.song, shikemeng,
	nphamcs, baoquan.he, baohua, gunho.lee, taejoon.song,
	hyungjun.cho, mkoutny, baver.bae, matia.kim

On Wed, May 27, 2026 at 03:22:47PM +0800, Youngjun Park wrote:
> Apply memcg tier effective mask during swap slot allocation to
> enforce per-cgroup swap tier restrictions.
> 
> In the fast path, check the percpu cached swap_info's tier_mask
> against the folio's effective mask. If it does not match, fall
> through to the slow path. In the slow path, skip swap devices
> whose tier_mask is not covered by the folio's effective mask.
> 
> This works correctly when there is only one non-rotational
> device in the system and no devices share the same priority.
> However, there are known limitations:
> 
>  - When non-rotational devices are distributed across multiple
>    tiers, and different memcgs are configured to use those
>    distinct tiers, they may constantly overwrite the shared
>    percpu swap cache. This cache thrashing leads to frequent
>    fast path misses.
> 
>  - Combined with the above issue, if same-priority devices exist
>    among them, a percpu cache miss (overwritten by another memcg)
>    forces the allocator to round-robin to the next device
>    prematurely, even if the current cluster is not fully
>    exhausted.
> 
> These edge cases do not affect the primary use case of
> directing swap traffic per cgroup. Further optimization is
> planned for future work.
> 
> Signed-off-by: Youngjun Park <youngjun.park@lge.com>
> ---
>  mm/swapfile.c | 10 +++++++++-
>  1 file changed, 9 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 9a86ebe992f4..1a2d29735b71 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -1365,14 +1365,18 @@ static bool swap_alloc_fast(struct folio *folio)
>  	struct swap_cluster_info *ci;
>  	struct swap_info_struct *si;
>  	unsigned int offset;
> +	int mask = folio_tier_effective_mask(folio);
>  
>  	/*
>  	 * Once allocated, swap_info_struct will never be completely freed,
>  	 * so checking it's liveness by get_swap_device_info is enough.
>  	 */
>  	si = this_cpu_read(percpu_swap_cluster.si[order]);
> +	if (!si || !swap_tiers_mask_test(si->tier_mask, mask))
> +		return false;
> +
>  	offset = this_cpu_read(percpu_swap_cluster.offset[order]);
> -	if (!si || !offset || !get_swap_device_info(si))
> +	if (!offset || !get_swap_device_info(si))
>  		return false;
>  
>  	ci = swap_cluster_lock(si, offset);
> @@ -1392,10 +1396,14 @@ static bool swap_alloc_fast(struct folio *folio)
>  static void swap_alloc_slow(struct folio *folio)
>  {
>  	struct swap_info_struct *si, *next;
> +	int mask = folio_tier_effective_mask(folio);
>  
>  	spin_lock(&swap_avail_lock);
>  start_over:
>  	plist_for_each_entry_safe(si, next, &swap_avail_head, avail_list) {
> +		if (!swap_tiers_mask_test(si->tier_mask, mask))
> +			continue;
> +
>  		/* Rotate the device and switch to a new cluster */
>  		plist_requeue(&si->avail_list, &swap_avail_head);
>  		spin_unlock(&swap_avail_lock);
> -- 
> 2.34.1

This part looks good to me, the known limitations are not regression
and only for tiering, so can be improved later, and we do have plan
to refine the priority / rotation / pcp cluster so they aligns well.

Reviewed-by: Kairui Song <kasong@tencent.com>


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v7 0/4] mm: swap: introduce swap tier infrastructure
  2026-05-27  6:22 [PATCH v7 0/4] mm: swap: introduce swap tier infrastructure Youngjun Park
                   ` (3 preceding siblings ...)
  2026-05-27  6:22 ` [PATCH v7 4/4] mm: swap: filter swap allocation by memcg tier mask Youngjun Park
@ 2026-05-27 20:36 ` Andrew Morton
  2026-05-27 23:52   ` Yosry Ahmed
  2026-06-01  4:00   ` YoungJun Park
  2026-05-30 18:02 ` Nhat Pham
  5 siblings, 2 replies; 15+ messages in thread
From: Andrew Morton @ 2026-05-27 20:36 UTC (permalink / raw)
  To: Youngjun Park
  Cc: chrisl, linux-mm, cgroups, linux-kernel, kasong, hannes, mhocko,
	roman.gushchin, shakeel.butt, muchun.song, shikemeng, nphamcs,
	baoquan.he, baohua, gunho.lee, taejoon.song, hyungjun.cho,
	mkoutny, baver.bae, matia.kim

On Wed, 27 May 2026 15:22:43 +0900 Youngjun Park <youngjun.park@lge.com> wrote:

> This is v7 of the swap tier series addressing review feedback.
> The cover letter has been simplified.

One question from Sashiko.   Minor, but easy to address.
	https://sashiko.dev/#/patchset/20260527062247.3440692-1-youngjun.park@lge.com

I'm reluctant to add a new feature patchset at this time - we have a lot
already and we're at -rc5.   What do others think?

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v7 0/4] mm: swap: introduce swap tier infrastructure
  2026-05-27 20:36 ` [PATCH v7 0/4] mm: swap: introduce swap tier infrastructure Andrew Morton
@ 2026-05-27 23:52   ` Yosry Ahmed
  2026-06-01  4:00   ` YoungJun Park
  1 sibling, 0 replies; 15+ messages in thread
From: Yosry Ahmed @ 2026-05-27 23:52 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Youngjun Park, chrisl, linux-mm, cgroups, linux-kernel, kasong,
	hannes, mhocko, roman.gushchin, shakeel.butt, muchun.song,
	shikemeng, nphamcs, baoquan.he, baohua, gunho.lee, taejoon.song,
	hyungjun.cho, mkoutny, baver.bae, matia.kim

On Wed, May 27, 2026 at 1:36 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Wed, 27 May 2026 15:22:43 +0900 Youngjun Park <youngjun.park@lge.com> wrote:
>
> > This is v7 of the swap tier series addressing review feedback.
> > The cover letter has been simplified.
>
> One question from Sashiko.   Minor, but easy to address.
>         https://sashiko.dev/#/patchset/20260527062247.3440692-1-youngjun.park@lge.com
>
> I'm reluctant to add a new feature patchset at this time - we have a lot
> already and we're at -rc5.   What do others think?

This adds new user-visible interfaces and I think we didn't reach an
agreement on them. I specifically recall Shakeel (and perhaps other
memcg folks) having questions about the memcg interface, and I don't
see any Acks on that patch. I don't think this should be included.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v7 4/4] mm: swap: filter swap allocation by memcg tier mask
  2026-05-27  6:22 ` [PATCH v7 4/4] mm: swap: filter swap allocation by memcg tier mask Youngjun Park
  2026-05-27  7:32   ` Baoquan He
  2026-05-27 17:50   ` Kairui Song
@ 2026-05-30 17:51   ` Nhat Pham
  2026-05-30 18:21     ` Nhat Pham
  2 siblings, 1 reply; 15+ messages in thread
From: Nhat Pham @ 2026-05-30 17:51 UTC (permalink / raw)
  To: Youngjun Park
  Cc: akpm, chrisl, linux-mm, cgroups, linux-kernel, kasong, hannes,
	mhocko, roman.gushchin, shakeel.butt, muchun.song, shikemeng,
	baoquan.he, baohua, gunho.lee, taejoon.song, hyungjun.cho,
	mkoutny, baver.bae, matia.kim

On Tue, May 26, 2026 at 11:23 PM Youngjun Park <youngjun.park@lge.com> wrote:
>
> Apply memcg tier effective mask during swap slot allocation to
> enforce per-cgroup swap tier restrictions.
>
> In the fast path, check the percpu cached swap_info's tier_mask
> against the folio's effective mask. If it does not match, fall
> through to the slow path. In the slow path, skip swap devices
> whose tier_mask is not covered by the folio's effective mask.
>
> This works correctly when there is only one non-rotational
> device in the system and no devices share the same priority.
> However, there are known limitations:
>
>  - When non-rotational devices are distributed across multiple
>    tiers, and different memcgs are configured to use those
>    distinct tiers, they may constantly overwrite the shared
>    percpu swap cache. This cache thrashing leads to frequent
>    fast path misses.
>
>  - Combined with the above issue, if same-priority devices exist
>    among them, a percpu cache miss (overwritten by another memcg)
>    forces the allocator to round-robin to the next device
>    prematurely, even if the current cluster is not fully
>    exhausted.

I had very similar issues when I tried hacking vswap on top of swap
table too... It's even worse over there because it's not just
performance - vswap needs special handling in certain cases, and in
some places cannot be used at all (for e.g in zswap writeback). I
ended up having to add separate caching for vswap device:

https://lore.kernel.org/all/20260528212955.1912856-1-nphamcs@gmail.com/

How expensive is it to add per-cpu caching for each device :(

Anyway, as a first step, this LGTM. Reviewing from swap's mechanism
perspective, and leaving the cgroup side to memcg folks:

Reviewed-by: Nhat Pham <nphamcs@gmail.com>


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v7 0/4] mm: swap: introduce swap tier infrastructure
  2026-05-27  6:22 [PATCH v7 0/4] mm: swap: introduce swap tier infrastructure Youngjun Park
                   ` (4 preceding siblings ...)
  2026-05-27 20:36 ` [PATCH v7 0/4] mm: swap: introduce swap tier infrastructure Andrew Morton
@ 2026-05-30 18:02 ` Nhat Pham
  2026-06-01  3:42   ` YoungJun Park
  5 siblings, 1 reply; 15+ messages in thread
From: Nhat Pham @ 2026-05-30 18:02 UTC (permalink / raw)
  To: Youngjun Park
  Cc: akpm, chrisl, linux-mm, cgroups, linux-kernel, kasong, hannes,
	mhocko, roman.gushchin, shakeel.butt, muchun.song, shikemeng,
	baoquan.he, baohua, gunho.lee, taejoon.song, hyungjun.cho,
	mkoutny, baver.bae, matia.kim

On Tue, May 26, 2026 at 11:23 PM Youngjun Park <youngjun.park@lge.com> wrote:
>
> This is v7 of the swap tier series addressing review feedback.
> The cover letter has been simplified.
>
> I revisited the design (see Design Rationale). Since our use case
> fits best with a memcg-based model, the implementation remains
> within memcg and preserves its resource accounting semantics.
>
> Alternatives considered:
>
> 1. A separate sysfs interface under swap. (Workable. But, it would still
>    need to reference memcg paths, and fully decoupling it would add
>    swap-layer logic to manage memcgs, making it secondary option.)
>
> 2. Making the feature non-default.
>
> Other interfaces were also reviewed. Aside from sysfs and BPF,
> the options involve trade-offs and are largely design choices.
> BPF was excluded due to possible disablement on our embedded
> platform, though future extension remains possible.
>
> Overview
> ========
>
> Swap Tiers group swap devices into performance classes (e.g. NVMe,
> HDD, Network) and allow per-memcg selection of which tiers to use.
> This mechanism was suggested by Chris Li.
>
> Design Rationale
> ================
>
> Swap tier selection is attached to memcg. A child cgroup may select a
> subset of the parent's allowed tiers.
>
> This
> - Preserves cgroup inheritance semantics (boundary at parent,
>   refinement at child).
> - Reuses memcg, which already groups processes and enforces
>   hierarchical memory limits.
> - Aligns with existing memcg swap controls (e.g. swap.max, zswap.writeback)
> - Avoids introducing a parallel swap control hierarchy.
>
> Placing tier control outside memcg (e.g., via BPF, syscalls, or
> madvise) would allow swap preference to diverge from the memcg
> hierarchy. Integrating it into memcg keeps the swap policy
> consistent with existing memory ownership semantics. There are
> also real use cases built around memcg.
>
> In the future, this can be extended to other interfaces to cover
> additional use cases.
>
> I believe a memcg-based swap control is a good starting point
> before such extensions.
>
> Use Cases
> =========
>
> #1: Latency separation (our primary deployment scenario)
>   [ / ]
>      |
>      +-- latency-sensitive workload  (fast tier)
>      +-- background workload         (slow tier)
>
> The parent defines the memory boundary.
> Each workload selects a swap tier via memory.swap.tiers according to
> latency requirements.
>
> This prevents latency-sensitive workloads from being swapped to
> slow devices used by background workloads.
>
> #2: Per-VM swap selection (Chris Li's deployment scenario)
>   [ / ]
>      |
>      +-- [ Job on VM ]              (tiers: zswap, SSD)
>             |
>             +-- [ VMM guest memory ]  (tiers: SSD)
>
> The parent (job) has access to both zswap and SSD tiers.
> The child (VMM guest memory) selects SSD as its swap tier via
> memory.swap.tiers. In this deployment, swap device selection
> happens at the child level from the parent's available set.
>
> #3: Tier isolation for reduced contention (hypothetical)
>   [ / ]                    (tiers: A, B)
>      |
>      +-- workload X        (tiers: A)
>      +-- workload Y        (tiers: B)
>
> Each child uses a different tier. Since swap paths are separated
> per tier, synchronization overhead between the two workloads is
> reduced.
>
> Future extension
> ================
>
> #1: Intra-tier distribution policy:
>   Currently, swap devices with the same priority are allocated in a
>   round-robin fashion. Per-tier policy files under
>   /sys/kernel/mm/swap/tiers/ can control how devices within a tier
>   are selected (e.g. round-robin, weighted).
>
> #2: Inter-tier promotion and demotion:
>   Promotion and demotion apply between tiers, not within a single
>   tier. The current interface defines only tier assignment; it does
>   not yet define when or how pages move between tiers. Two triggering
>   models are possible:
>
>   (a) User-triggered: userspace explicitly initiates migration between
>       tiers (e.g. via a new interface or existing move_pages semantics).
>   (b) Kernel-triggered: the kernel moves pages between tiers at
>       appropriate points such as reclaim or refault.
>
> #3: Per-VMA, per-process swap and BPF:
>   Not just for memcg based swap, possible to extend Per-VMA or per-process swap.
>   Or we can use it as BPF program.
>
> Experimentation
> ===============
>
> Tested on our internal platform using NBD as a separate swap tier.
> Our first production's simple usecase.
>
> Without tiers:
> - No selective control over flash wear
> - Cannot selectively assign NBD to specific applications
>
> Cold launch improvement (preloaded vs. baseline):
> - App A: 13.17s -> 4.18s (68%)
> - App B: 5.60s -> 1.12s (80%)
> - App C: 10.25s -> 2.00s (80%)
>
> Performance impact with no tiers configured:
> <1% regression in kernel build and vm-scalability benchmarks
>

Bit late to the party - working on my review backlog right now :)

I see some parallels with this and memory tiering work being done. One
future line of work could be considering how to ensure fairness when
multiple cgroups share same tiers:

https://lwn.net/Articles/1073400/

I can imagine a scenario where one noisy neighbor eagerly swaps first
and occupy all the space in the faster tier(s), pushing the other
colocated tenants to the slower tier(s). We might need to figure out a
way to ensure fairness here (while letting cgroups occupy fast swap
backends opportunistically if there is no resources scarcity).

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v7 4/4] mm: swap: filter swap allocation by memcg tier mask
  2026-05-30 17:51   ` Nhat Pham
@ 2026-05-30 18:21     ` Nhat Pham
  2026-06-01  3:50       ` YoungJun Park
  0 siblings, 1 reply; 15+ messages in thread
From: Nhat Pham @ 2026-05-30 18:21 UTC (permalink / raw)
  To: Youngjun Park
  Cc: akpm, chrisl, linux-mm, cgroups, linux-kernel, kasong, hannes,
	mhocko, roman.gushchin, shakeel.butt, muchun.song, shikemeng,
	baoquan.he, baohua, gunho.lee, taejoon.song, hyungjun.cho,
	mkoutny, baver.bae, matia.kim

On Sat, May 30, 2026 at 10:51 AM Nhat Pham <nphamcs@gmail.com> wrote:
>
>
> How expensive is it to add per-cpu caching for each device :(

to clarify - a percpu_swap_cluster per si for every si.

>

... or for each tier (assuming devices in each tier share the same
performance characteristics, and could be used interchangeably?).

Basically:

struct percpu_swap_cluster {
    struct swap_info_struct *si[MAX_SWAPTIER][SWAP_NR_ORDERS];
    unsigned long offset[MAX_SWAPTIER][SWAP_NR_ORDERS];
    local_lock_t lock;
};

Seems like 4 is the default number of tier right? So the extra
overhead is just (nr cpu) * 10 * 3 * (sizeof(unsigned long) +
sizeof(*ptr)) or wev?


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v7 0/4] mm: swap: introduce swap tier infrastructure
  2026-05-30 18:02 ` Nhat Pham
@ 2026-06-01  3:42   ` YoungJun Park
  0 siblings, 0 replies; 15+ messages in thread
From: YoungJun Park @ 2026-06-01  3:42 UTC (permalink / raw)
  To: Nhat Pham
  Cc: akpm, chrisl, linux-mm, cgroups, linux-kernel, kasong, hannes,
	mhocko, roman.gushchin, shakeel.butt, muchun.song, shikemeng,
	baoquan.he, baohua, gunho.lee, taejoon.song, hyungjun.cho,
	mkoutny, baver.bae, matia.kim

On Sat, May 30, 2026 at 11:02:03AM -0700, Nhat Pham wrote:
> On Tue, May 26, 2026 at 11:23 PM Youngjun Park <youngjun.park@lge.com> wrote:
> >
> > This is v7 of the swap tier series addressing review feedback.
> > The cover letter has been simplified.
> >
> > I revisited the design (see Design Rationale). Since our use case
> > fits best with a memcg-based model, the implementation remains
> > within memcg and preserves its resource accounting semantics.
> >
> > Alternatives considered:
> >
> > 1. A separate sysfs interface under swap. (Workable. But, it would still
> >    need to reference memcg paths, and fully decoupling it would add
> >    swap-layer logic to manage memcgs, making it secondary option.)
> >
> > 2. Making the feature non-default.
> >
> > Other interfaces were also reviewed. Aside from sysfs and BPF,
> > the options involve trade-offs and are largely design choices.
> > BPF was excluded due to possible disablement on our embedded
> > platform, though future extension remains possible.
> >
> > Overview
> > ========
> >
> > Swap Tiers group swap devices into performance classes (e.g. NVMe,
> > HDD, Network) and allow per-memcg selection of which tiers to use.
> > This mechanism was suggested by Chris Li.
> >
> > Design Rationale
> > ================
> >
> > Swap tier selection is attached to memcg. A child cgroup may select a
> > subset of the parent's allowed tiers.
> >
> > This
> > - Preserves cgroup inheritance semantics (boundary at parent,
> >   refinement at child).
> > - Reuses memcg, which already groups processes and enforces
> >   hierarchical memory limits.
> > - Aligns with existing memcg swap controls (e.g. swap.max, zswap.writeback)
> > - Avoids introducing a parallel swap control hierarchy.
> >
> > Placing tier control outside memcg (e.g., via BPF, syscalls, or
> > madvise) would allow swap preference to diverge from the memcg
> > hierarchy. Integrating it into memcg keeps the swap policy
> > consistent with existing memory ownership semantics. There are
> > also real use cases built around memcg.
> >
> > In the future, this can be extended to other interfaces to cover
> > additional use cases.
> >
> > I believe a memcg-based swap control is a good starting point
> > before such extensions.
> >
> > Use Cases
> > =========
> >
> > #1: Latency separation (our primary deployment scenario)
> >   [ / ]
> >      |
> >      +-- latency-sensitive workload  (fast tier)
> >      +-- background workload         (slow tier)
> >
> > The parent defines the memory boundary.
> > Each workload selects a swap tier via memory.swap.tiers according to
> > latency requirements.
> >
> > This prevents latency-sensitive workloads from being swapped to
> > slow devices used by background workloads.
> >
> > #2: Per-VM swap selection (Chris Li's deployment scenario)
> >   [ / ]
> >      |
> >      +-- [ Job on VM ]              (tiers: zswap, SSD)
> >             |
> >             +-- [ VMM guest memory ]  (tiers: SSD)
> >
> > The parent (job) has access to both zswap and SSD tiers.
> > The child (VMM guest memory) selects SSD as its swap tier via
> > memory.swap.tiers. In this deployment, swap device selection
> > happens at the child level from the parent's available set.
> >
> > #3: Tier isolation for reduced contention (hypothetical)
> >   [ / ]                    (tiers: A, B)
> >      |
> >      +-- workload X        (tiers: A)
> >      +-- workload Y        (tiers: B)
> >
> > Each child uses a different tier. Since swap paths are separated
> > per tier, synchronization overhead between the two workloads is
> > reduced.
> >
> > Future extension
> > ================
> >
> > #1: Intra-tier distribution policy:
> >   Currently, swap devices with the same priority are allocated in a
> >   round-robin fashion. Per-tier policy files under
> >   /sys/kernel/mm/swap/tiers/ can control how devices within a tier
> >   are selected (e.g. round-robin, weighted).
> >
> > #2: Inter-tier promotion and demotion:
> >   Promotion and demotion apply between tiers, not within a single
> >   tier. The current interface defines only tier assignment; it does
> >   not yet define when or how pages move between tiers. Two triggering
> >   models are possible:
> >
> >   (a) User-triggered: userspace explicitly initiates migration between
> >       tiers (e.g. via a new interface or existing move_pages semantics).
> >   (b) Kernel-triggered: the kernel moves pages between tiers at
> >       appropriate points such as reclaim or refault.
> >
> > #3: Per-VMA, per-process swap and BPF:
> >   Not just for memcg based swap, possible to extend Per-VMA or per-process swap.
> >   Or we can use it as BPF program.
> >
> > Experimentation
> > ===============
> >
> > Tested on our internal platform using NBD as a separate swap tier.
> > Our first production's simple usecase.
> >
> > Without tiers:
> > - No selective control over flash wear
> > - Cannot selectively assign NBD to specific applications
> >
> > Cold launch improvement (preloaded vs. baseline):
> > - App A: 13.17s -> 4.18s (68%)
> > - App B: 5.60s -> 1.12s (80%)
> > - App C: 10.25s -> 2.00s (80%)
> >
> > Performance impact with no tiers configured:
> > <1% regression in kernel build and vm-scalability benchmarks
> >
> 
> Bit late to the party - working on my review backlog right now :)
> 
> I see some parallels with this and memory tiering work being done. One
> future line of work could be considering how to ensure fairness when
> multiple cgroups share same tiers:
> 
> https://lwn.net/Articles/1073400/

Hi Nhat,

Thanks for bringing this up. I took a quick look at the link, and I agree
with your point. We could probably use a similar proposed mechanism (e.g., setting
min/max limits per tier) to handle promotion and demotion in the future.

This also suggests that keeping the swap tier limits inside the memcg
interface is a reasonable approach. It aligns well with such future work.
If it were implemented in other layers (e.g., prctl, sysfs, or BPF), we
would likely have to revisit its integration with memcg someday anyway.

> and occupy all the space in the faster tier(s), pushing the other
> colocated tenants to the slower tier(s). We might need to figure out a
> way to ensure fairness here (while letting cgroups occupy fast swap
> backends opportunistically if there is no resources scarcity).

Agreed. Ensuring fairness will be essential when we eventually expand
the promotion and demotion mechanisms across swap tiers.

Thanks,
Youngjun

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v7 4/4] mm: swap: filter swap allocation by memcg tier mask
  2026-05-30 18:21     ` Nhat Pham
@ 2026-06-01  3:50       ` YoungJun Park
  0 siblings, 0 replies; 15+ messages in thread
From: YoungJun Park @ 2026-06-01  3:50 UTC (permalink / raw)
  To: Nhat Pham
  Cc: akpm, chrisl, linux-mm, cgroups, linux-kernel, kasong, hannes,
	mhocko, roman.gushchin, shakeel.butt, muchun.song, shikemeng,
	baoquan.he, baohua, gunho.lee, taejoon.song, hyungjun.cho,
	mkoutny, baver.bae, matia.kim

On Sat, May 30, 2026 at 11:21:12AM -0700, Nhat Pham wrote:
> On Sat, May 30, 2026 at 10:51 AM Nhat Pham <nphamcs@gmail.com> wrote:
> >
> >
> > How expensive is it to add per-cpu caching for each device :(
> 
> to clarify - a percpu_swap_cluster per si for every si.
> 
> >
> 
> ... or for each tier (assuming devices in each tier share the same
> performance characteristics, and could be used interchangeably?).
> 
> Basically:
> 
> struct percpu_swap_cluster {
>     struct swap_info_struct *si[MAX_SWAPTIER][SWAP_NR_ORDERS];
>     unsigned long offset[MAX_SWAPTIER][SWAP_NR_ORDERS];
>     local_lock_t lock;
> };
> 
> Seems like 4 is the default number of tier right? So the extra
> overhead is just (nr cpu) * 10 * 3 * (sizeof(unsigned long) +
> sizeof(*ptr)) or wev?

I agree. I actually considered the idea of a tier-centric cache as well.

You might remember that in the previous "per cgroup swap priority"
patchset, I implemented per-cpu caches per priority, which is essentially
the same as having them per tier.

However, as you agreed in another thread, including this optimization
right now might be a bit premature. If the core swap tier idea gets
merged, I plan to explore this optimization further as follow-up work.

Thanks
Youngjun Park

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v7 0/4] mm: swap: introduce swap tier infrastructure
  2026-05-27 20:36 ` [PATCH v7 0/4] mm: swap: introduce swap tier infrastructure Andrew Morton
  2026-05-27 23:52   ` Yosry Ahmed
@ 2026-06-01  4:00   ` YoungJun Park
  1 sibling, 0 replies; 15+ messages in thread
From: YoungJun Park @ 2026-06-01  4:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: chrisl, linux-mm, cgroups, linux-kernel, kasong, hannes, mhocko,
	roman.gushchin, shakeel.butt, muchun.song, shikemeng, nphamcs,
	baoquan.he, baohua, gunho.lee, taejoon.song, hyungjun.cho,
	mkoutny, baver.bae, matia.kim

On Wed, May 27, 2026 at 01:36:51PM -0700, Andrew Morton wrote:
> On Wed, 27 May 2026 15:22:43 +0900 Youngjun Park <youngjun.park@lge.com> wrote:
> 
> > This is v7 of the swap tier series addressing review feedback.
> > The cover letter has been simplified.
> 
> One question from Sashiko.   Minor, but easy to address.
> 	https://sashiko.dev/#/patchset/20260527062247.3440692-1-youngjun.park@lge.com

Thanks, Andrew. That is a valid concern and definitely needs to be fixed.
I will address it in the next version.

> I'm reluctant to add a new feature patchset at this time - we have a lot
> already and we're at -rc5.   What do others think?

I will wait to hear others' thoughts on this.

Thanks,
Youngjun Park

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2026-06-01  4:00 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-27  6:22 [PATCH v7 0/4] mm: swap: introduce swap tier infrastructure Youngjun Park
2026-05-27  6:22 ` [PATCH v7 1/4] " Youngjun Park
2026-05-27  6:22 ` [PATCH v7 2/4] mm: swap: associate swap devices with tiers Youngjun Park
2026-05-27  6:22 ` [PATCH v7 3/4] mm: memcontrol: add interfaces for swap tier selection Youngjun Park
2026-05-27  6:22 ` [PATCH v7 4/4] mm: swap: filter swap allocation by memcg tier mask Youngjun Park
2026-05-27  7:32   ` Baoquan He
2026-05-27 17:50   ` Kairui Song
2026-05-30 17:51   ` Nhat Pham
2026-05-30 18:21     ` Nhat Pham
2026-06-01  3:50       ` YoungJun Park
2026-05-27 20:36 ` [PATCH v7 0/4] mm: swap: introduce swap tier infrastructure Andrew Morton
2026-05-27 23:52   ` Yosry Ahmed
2026-06-01  4:00   ` YoungJun Park
2026-05-30 18:02 ` Nhat Pham
2026-06-01  3:42   ` YoungJun Park

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.