public inbox for cgroups@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH v5 0/4] mm/swap, memcg: Introduce swap tiers for cgroup based swap control
@ 2026-03-25 17:54 Youngjun Park
  2026-03-25 17:54 ` [PATCH v5 1/4] mm: swap: introduce swap tier infrastructure Youngjun Park
                   ` (5 more replies)
  0 siblings, 6 replies; 13+ messages in thread
From: Youngjun Park @ 2026-03-25 17:54 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Chris Li, Youngjun Park, linux-mm, cgroups, linux-kernel, kasong,
	hannes, mhocko, roman.gushchin, shakeel.butt, muchun.song,
	shikemeng, nphamcs, bhe, baohua, gunho.lee, taejoon.song,
	hyungjun.cho, mkoutny

This is v5 of the "Swap Tiers" series.
For clarity, this cover letter is structured in two parts:

  Part 1 describes the patch series itself (what is implemented in v5).
  Part 2 consolidates the design rationale and use case discussion,
  including clarification around the memcg-integrated model and
  comparison with BPF-based approaches.

This separation is intentional so reviewers can clearly distinguish
between patch introduction and design discussion (for Shakeel's
ongoing feedback).

v4:
  https://lore.kernel.org/linux-mm/20260217000950.4015880-1-youngjun.park@lge.com/

Earlier RFC versions:
  v3: https://lore.kernel.org/linux-mm/20260131125454.3187546-1-youngjun.park@lge.com/
  v2: https://lore.kernel.org/linux-mm/20260126065242.1221862-1-youngjun.park@lge.com/
  v1: https://lore.kernel.org/linux-mm/20251109124947.1101520-1-youngjun.park@lge.com/

Earlier Approach (per cgroup swap priority)
  RFC: https://lore.kernel.org/linux-mm/aEvLjEInMQC7hEyh@yjaykim-PowerEdge-T330/T/#mbbb6a5e9e30843097e1f5f65fb98f31d582b973d
  v1: https://lore.kernel.org/linux-mm/20250716202006.3640584-1-youngjun.park@lge.com/
======================================================================
Part 1: Patch Series Summary
======================================================================

Overview
========
Swap Tiers group swap devices into performance classes (e.g. NVMe,
HDD, Network) and allow per-memcg selection of which tiers to use.
This mechanism was suggested by Chris Li.

This series introduces:

- Core tier infrastructure
- Per-memcg tier assignment (subset of parent)
- memory.swap.tiers and memory.swap.tiers.effective interfaces

Changes in v5
=============
- Fixed build errors reported in v4
- rebased on up to date mm-new 
- Minor cleanups
- Design docs with validation (by Shakeel Butt discussion)

Changes in v4 (summary)
=======================
- Simplified control flow and indentation
- Added CONFIG option for MAX_SWAPTIER (default: 4)
- Added memory.swap.tiers.effective interface
- Reworked save/restore logic into snapshot/rollback model
- Removed tier priority modification support (deferred)
- Improved validation and fixed edge cases
- Rebased onto latest mm-new

Deferred / Future Work
======================
- Per-tier swap_active_head to reduce contention (Suggested by Chris Li)
- Fast path and slow path allocation improvement
  (this will be introduced after Kairui's work)

Real-world Results
==================
Tested on our internal platform using NBD as a separate swap tier.
Our first production's simple usecase.

Without tiers:
- No selective control over flash wear
- Cannot selectively assign NBD to specific applications

Cold launch improvement (preloaded vs. baseline):
- App A: 13.17s -> 4.18s (68%)
- App B: 5.60s -> 1.12s (80%)
- App C: 10.25s -> 2.00s (80%)

Performance impact with no tiers configured:
<1% regression in kernel build and vm-scalability benchmarks
(measured in RFC v2).

======================================================================
Part 2: Design Rationale and Use Cases
======================================================================

Design Rationale
================
Swap tier selection is attached to memcg. A child cgroup may select a
subset of the parent's allowed tiers.

This:
- Preserves cgroup inheritance semantics (boundary at parent,
  refinement at child).
- Reuses memcg, which already groups processes and enforces
  hierarchical memory limits.
- Aligns with existing memcg swap controls (e.g. swap.max, zswap.writeback)
- Avoids introducing a parallel swap control hierarchy.

Placing tier control outside memcg (e.g. bpf, syscall, madvise etc..)
would allow swap preference to diverge from the memcg hierarchy.
Integrating it into memcg keeps swap policy consistent with
existing memory ownership semantics.

Use case #1: Latency separation (our primary deployment scenario)
=================================================================
  [ / ]
     |
     +-- latency-sensitive workload  (fast tier)
     +-- background workload         (slow tier)

The parent defines the memory boundary.
Each workload selects a swap tier via memory.swap.tiers according to
latency requirements.

This prevents latency-sensitive workloads from being swapped to
slow devices used by background workloads.

Use case #2: Per-VM swap selection (Chris Li's deployment scenario)
==================================================================
  [ / ]
     |
     +-- [ Job on VM ]              (tiers: zswap, SSD)
            |
            +-- [ VMM guest memory ]  (tiers: SSD)

The parent (job) has access to both zswap and SSD tiers.
The child (VMM guest memory) selects SSD as its swap tier via
memory.swap.tiers. In this deployment, swap device selection
happens at the child level from the parent's available set.


Use case #3: Tier isolation for reduced contention (hypothetical)
=================================================================
  [ / ]                    (tiers: A, B)
     |
     +-- workload X        (tiers: A)
     +-- workload Y        (tiers: B)

Each child uses a different tier. Since swap paths are separated
per tier, synchronization overhead between the two workloads is
reduced.

How the Current Interface Supports Future Extensions
====================================================

- Intra-tier distribution policy:
  Currently, swap devices with the same priority are allocated in a
  round-robin fashion. Per-tier policy files under
  /sys/kernel/mm/swap/tiers/ can control how devices within a tier
  are selected (e.g. round-robin, weighted).

- Inter-tier promotion and demotion:
  Promotion and demotion apply between tiers, not within a single
  tier. The current interface defines only tier assignment; it does
  not yet define when or how pages move between tiers. Two triggering
  models are possible:

  (a) User-triggered: userspace explicitly initiates migration between
      tiers (e.g. via a new interface or existing move_pages semantics).
  (b) Kernel-triggered: the kernel moves pages between tiers at
      appropriate points such as reclaim or refault.

  From the memcg perspective, inter-tier movement is bounded by
  memory.swap.tiers.effective -- pages can only be promoted or demoted
  to tiers within the memcg's effective set. The specific policy and
  triggering mechanism require further discussion and are not part of
  this series.

- Per-VMA or per-process swap hints:
  A future madvise-style hint (e.g. MADV_SWAP_TIER) could reference
  the tier indices in /sys/kernel/mm/swap/tiers/. At reclaim time,
  the kernel would check the VMA hint against the memcg's effective
  tier set to pick the swap-out target.

BPF Comparison
==============
The use cases described above already rely on memcg for swap tier
control, and real deployments are built around this model.
A BPF-based approach has additional considerations:

- Hierarchy consistency: BPF programs operate outside the memcg
  tree. Without explicit constraints, a BPF selector could
  contradict parent tier restrictions. Edge cases such as zombie
  memcgs make the resolution less clear.
- Deployment scope: requiring BPF for core swap behavior may not
  be suitable for constrained or embedded configurations.

BPF could still work as an extension on top of the tier model
in the future.

Youngjun Park (4):
  mm: swap: introduce swap tier infrastructure
  mm: swap: associate swap devices with tiers
  mm: memcontrol: add interfaces for swap tier selection
  mm: swap: filter swap allocation by memcg tier mask

 Documentation/admin-guide/cgroup-v2.rst |  27 ++
 Documentation/mm/swap-tier.rst          | 159 +++++++++
 MAINTAINERS                             |   3 +
 include/linux/memcontrol.h              |   3 +-
 include/linux/swap.h                    |   1 +
 mm/Kconfig                              |  12 +
 mm/Makefile                             |   2 +-
 mm/memcontrol.c                         |  95 +++++
 mm/swap.h                               |   4 +
 mm/swap_state.c                         |  75 ++++
 mm/swap_tier.c                          | 451 ++++++++++++++++++++++++
 mm/swap_tier.h                          |  74 ++++
 mm/swapfile.c                           |  23 +-
 13 files changed, 923 insertions(+), 6 deletions(-)
 create mode 100644 Documentation/mm/swap-tier.rst
 create mode 100644 mm/swap_tier.c
 create mode 100644 mm/swap_tier.h

base-commit: 6381a729fa7dda43574d93ab9c61cec516dd885b 
-- 
2.34.1

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH v5 1/4] mm: swap: introduce swap tier infrastructure
  2026-03-25 17:54 [PATCH v5 0/4] mm/swap, memcg: Introduce swap tiers for cgroup based swap control Youngjun Park
@ 2026-03-25 17:54 ` Youngjun Park
  2026-03-29 10:49   ` kernel test robot
  2026-03-29 13:46   ` kernel test robot
  2026-03-25 17:54 ` [PATCH v5 2/4] mm: swap: associate swap devices with tiers Youngjun Park
                   ` (4 subsequent siblings)
  5 siblings, 2 replies; 13+ messages in thread
From: Youngjun Park @ 2026-03-25 17:54 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Chris Li, Youngjun Park, linux-mm, cgroups, linux-kernel, kasong,
	hannes, mhocko, roman.gushchin, shakeel.butt, muchun.song,
	shikemeng, nphamcs, bhe, baohua, gunho.lee, taejoon.song,
	hyungjun.cho, mkoutny

This patch introduces the "Swap tier" concept, which serves as an
abstraction layer for managing swap devices based on their performance
characteristics (e.g., NVMe, HDD, Network swap).

Swap tiers are user-named groups representing priority ranges.
Tier names must consist of alphanumeric characters and underscores.
These tiers collectively cover the entire priority space from -1
(`DEF_SWAP_PRIO`) to `SHRT_MAX`.

To configure tiers, a new sysfs interface is exposed at
/sys/kernel/mm/swap/tiers. The input parser evaluates commands from
left to right and supports batch input, allowing users to add or remove
multiple tiers in a single write operation.

Tier management enforces continuous priority ranges anchored by start
priorities. Operations trigger range splitting or merging, but overwriting
start priorities is forbidden. Merging expands lower tiers upwards to
preserve configured start priorities, except when removing `DEF_SWAP_PRIO`,
which merges downwards.

Suggested-by: Chris Li <chrisl@kernel.org>
Signed-off-by: Youngjun Park <youngjun.park@lge.com>
---
 MAINTAINERS     |   2 +
 mm/Kconfig      |  12 ++
 mm/Makefile     |   2 +-
 mm/swap.h       |   4 +
 mm/swap_state.c |  74 +++++++++++++
 mm/swap_tier.c  | 285 ++++++++++++++++++++++++++++++++++++++++++++++++
 mm/swap_tier.h  |  20 ++++
 mm/swapfile.c   |   8 +-
 8 files changed, 403 insertions(+), 4 deletions(-)
 create mode 100644 mm/swap_tier.c
 create mode 100644 mm/swap_tier.h

diff --git a/MAINTAINERS b/MAINTAINERS
index 76431aa5efbe..f3b07f1fa38a 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -16916,6 +16916,8 @@ F:	mm/swap.c
 F:	mm/swap.h
 F:	mm/swap_table.h
 F:	mm/swap_state.c
+F:	mm/swap_tier.c
+F:	mm/swap_tier.h
 F:	mm/swapfile.c
 
 MEMORY MANAGEMENT - THP (TRANSPARENT HUGE PAGE)
diff --git a/mm/Kconfig b/mm/Kconfig
index bd283958d675..b645e9430af5 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -19,6 +19,18 @@ menuconfig SWAP
 	  used to provide more virtual memory than the actual RAM present
 	  in your computer.  If unsure say Y.
 
+config NR_SWAP_TIERS
+        int "Number of swap device tiers"
+        depends on SWAP
+        default 4
+        range 1 32
+        help
+          Sets the number of swap device tiers. Swap devices are
+          grouped into tiers based on their priority, allowing the
+          system to prefer faster devices over slower ones.
+
+          If unsure, say 4.
+
 config ZSWAP
 	bool "Compressed cache for swap pages"
 	depends on SWAP
diff --git a/mm/Makefile b/mm/Makefile
index 8ad2ab08244e..db6449f84991 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -75,7 +75,7 @@ ifdef CONFIG_MMU
 	obj-$(CONFIG_ADVISE_SYSCALLS)	+= madvise.o
 endif
 
-obj-$(CONFIG_SWAP)	+= page_io.o swap_state.o swapfile.o
+obj-$(CONFIG_SWAP)	+= page_io.o swap_state.o swapfile.o swap_tier.o
 obj-$(CONFIG_ZSWAP)	+= zswap.o
 obj-$(CONFIG_HAS_DMA)	+= dmapool.o
 obj-$(CONFIG_HUGETLBFS)	+= hugetlb.o hugetlb_sysfs.o hugetlb_sysctl.o
diff --git a/mm/swap.h b/mm/swap.h
index a77016f2423b..fda8363bee73 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -16,6 +16,10 @@ extern int page_cluster;
 #define swap_entry_order(order)	0
 #endif
 
+#define DEF_SWAP_PRIO  -1
+
+extern spinlock_t swap_lock;
+extern struct plist_head swap_active_head;
 extern struct swap_info_struct *swap_info[];
 
 /*
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 1415a5c54a43..bfdc0208e081 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -25,6 +25,7 @@
 #include "internal.h"
 #include "swap_table.h"
 #include "swap.h"
+#include "swap_tier.h"
 
 /*
  * swapper_space is a fiction, retained to simplify the path through
@@ -924,8 +925,81 @@ static ssize_t vma_ra_enabled_store(struct kobject *kobj,
 }
 static struct kobj_attribute vma_ra_enabled_attr = __ATTR_RW(vma_ra_enabled);
 
+static ssize_t tiers_show(struct kobject *kobj,
+				     struct kobj_attribute *attr, char *buf)
+{
+	return swap_tiers_sysfs_show(buf);
+}
+
+static ssize_t tiers_store(struct kobject *kobj,
+			    struct kobj_attribute *attr,
+			    const char *buf, size_t count)
+{
+	char *p, *token, *name, *tmp;
+	int ret = 0;
+	short prio;
+
+	tmp = kstrdup(buf, GFP_KERNEL);
+	if (!tmp)
+		return -ENOMEM;
+
+	spin_lock(&swap_lock);
+	spin_lock(&swap_tier_lock);
+	swap_tiers_snapshot();
+
+	p = tmp;
+	while ((token = strsep(&p, ", \t\n")) != NULL) {
+		if (!*token)
+			continue;
+
+		switch (token[0]) {
+		case '+':
+			name = token + 1;
+			token = strchr(name, ':');
+			if (!token) {
+				ret = -EINVAL;
+				goto out;
+			}
+			*token++ = '\0';
+			if (kstrtos16(token, 10, &prio)) {
+				ret = -EINVAL;
+				goto out;
+			}
+			ret = swap_tiers_add(name, prio);
+			if (ret)
+				goto restore;
+			break;
+		case '-':
+			ret = swap_tiers_remove(token + 1);
+			if (ret)
+				goto restore;
+			break;
+		default:
+			ret = -EINVAL;
+			goto out;
+		}
+	}
+
+	if (!swap_tiers_validate()) {
+		ret = -EINVAL;
+		goto restore;
+	}
+	goto out;
+
+restore:
+	swap_tiers_snapshot_restore();
+out:
+	spin_unlock(&swap_tier_lock);
+	spin_unlock(&swap_lock);
+	kfree(tmp);
+	return ret ? ret : count;
+}
+
+static struct kobj_attribute tier_attr = __ATTR_RW(tiers);
+
 static struct attribute *swap_attrs[] = {
 	&vma_ra_enabled_attr.attr,
+	&tier_attr.attr,
 	NULL,
 };
 
diff --git a/mm/swap_tier.c b/mm/swap_tier.c
new file mode 100644
index 000000000000..62b60fa8d3b7
--- /dev/null
+++ b/mm/swap_tier.c
@@ -0,0 +1,285 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/swap.h>
+#include <linux/memcontrol.h>
+#include "memcontrol-v1.h"
+#include <linux/sysfs.h>
+#include <linux/plist.h>
+
+#include "swap.h"
+#include "swap_tier.h"
+
+#define MAX_SWAPTIER	CONFIG_NR_SWAP_TIERS
+#define MAX_TIERNAME	16
+
+/*
+ * struct swap_tier - structure representing a swap tier.
+ *
+ * @name: name of the swap_tier.
+ * @prio: starting value of priority.
+ * @list: linked list of tiers.
+ */
+static struct swap_tier {
+	char name[MAX_TIERNAME];
+	short prio;
+	struct list_head list;
+} swap_tiers[MAX_SWAPTIER];
+
+DEFINE_SPINLOCK(swap_tier_lock);
+/* active swap priority list, sorted in descending order */
+static LIST_HEAD(swap_tier_active_list);
+/* unused swap_tier object */
+static LIST_HEAD(swap_tier_inactive_list);
+
+#define TIER_IDX(tier)	((tier) - swap_tiers)
+#define TIER_MASK(tier)	(1 << TIER_IDX(tier))
+#define TIER_INACTIVE_PRIO (DEF_SWAP_PRIO - 1)
+#define TIER_IS_ACTIVE(tier) ((tier->prio) !=  TIER_INACTIVE_PRIO)
+#define TIER_END_PRIO(tier) \
+	(!list_is_first(&(tier)->list, &swap_tier_active_list) ? \
+	list_prev_entry((tier), list)->prio - 1 : SHRT_MAX)
+
+#define for_each_tier(tier, idx) \
+	for (idx = 0, tier = &swap_tiers[0]; idx < MAX_SWAPTIER; \
+		idx++, tier = &swap_tiers[idx])
+
+#define for_each_active_tier(tier) \
+	list_for_each_entry(tier, &swap_tier_active_list, list)
+
+#define for_each_inactive_tier(tier) \
+	list_for_each_entry(tier, &swap_tier_inactive_list, list)
+
+/*
+ * Naming Convention:
+ *   swap_tiers_*() - Public/exported functions
+ *   swap_tier_*()  - Private/internal functions
+ */
+
+static bool swap_tier_is_active(void)
+{
+	return !list_empty(&swap_tier_active_list) ? true : false;
+}
+
+static struct swap_tier *swap_tier_lookup(const char *name)
+{
+	struct swap_tier *tier;
+
+	for_each_active_tier(tier) {
+		if (!strcmp(tier->name, name))
+			return tier;
+	}
+
+	return NULL;
+}
+
+/* Insert new tier into the active list sorted by priority. */
+static void swap_tier_activate(struct swap_tier *new)
+{
+	struct swap_tier *tier;
+
+	for_each_active_tier(tier) {
+		if (tier->prio <= new->prio)
+			break;
+	}
+
+	list_add_tail(&new->list, &tier->list);
+}
+
+static void swap_tier_inactivate(struct swap_tier *tier)
+{
+	list_move(&tier->list, &swap_tier_inactive_list);
+	tier->prio = TIER_INACTIVE_PRIO;
+}
+
+void swap_tiers_init(void)
+{
+	struct swap_tier *tier;
+	int idx;
+
+	BUILD_BUG_ON(BITS_PER_TYPE(int) < MAX_SWAPTIER);
+
+	for_each_tier(tier, idx) {
+		INIT_LIST_HEAD(&tier->list);
+		swap_tier_inactivate(tier);
+	}
+}
+
+ssize_t swap_tiers_sysfs_show(char *buf)
+{
+	struct swap_tier *tier;
+	ssize_t len = 0;
+
+	len += sysfs_emit_at(buf, len, "%-16s %-5s %-11s %-11s\n",
+			 "Name", "Idx", "PrioStart", "PrioEnd");
+
+	spin_lock(&swap_tier_lock);
+	for_each_active_tier(tier) {
+		len += sysfs_emit_at(buf, len, "%-16s %-5ld %-11d %-11d\n",
+				     tier->name,
+				     TIER_IDX(tier),
+				     tier->prio,
+				     TIER_END_PRIO(tier));
+	}
+	spin_unlock(&swap_tier_lock);
+
+	return len;
+}
+
+static struct swap_tier *swap_tier_prepare(const char *name, short prio)
+{
+	struct swap_tier *tier;
+
+	lockdep_assert_held(&swap_tier_lock);
+
+	if (prio < DEF_SWAP_PRIO)
+		return ERR_PTR(-EINVAL);
+
+	if (list_empty(&swap_tier_inactive_list))
+		return ERR_PTR(-ENOSPC);
+
+	tier = list_first_entry(&swap_tier_inactive_list,
+		struct swap_tier, list);
+
+	list_del_init(&tier->list);
+	strscpy(tier->name, name, MAX_TIERNAME);
+	tier->prio = prio;
+
+	return tier;
+}
+
+static int swap_tier_check_range(short prio)
+{
+	struct swap_tier *tier;
+
+	lockdep_assert_held(&swap_lock);
+	lockdep_assert_held(&swap_tier_lock);
+
+	for_each_active_tier(tier) {
+		/* No overwrite */
+		if (tier->prio == prio)
+			return -EINVAL;
+	}
+
+	return 0;
+}
+
+static bool swap_tier_validate_name(const char *name)
+{
+	if (!name || !*name)
+		return false;
+
+	while (*name) {
+		if (!isalnum(*name) && *name != '_')
+			return false;
+		name++;
+	}
+	return true;
+}
+
+int swap_tiers_add(const char *name, int prio)
+{
+	int ret;
+	struct swap_tier *tier;
+
+	lockdep_assert_held(&swap_lock);
+	lockdep_assert_held(&swap_tier_lock);
+
+	/* Duplicate check */
+	if (swap_tier_lookup(name))
+		return -EEXIST;
+
+	if (!swap_tier_validate_name(name))
+		return -EINVAL;
+
+	ret = swap_tier_check_range(prio);
+	if (ret)
+		return ret;
+
+	tier = swap_tier_prepare(name, prio);
+	if (IS_ERR(tier)) {
+		ret = PTR_ERR(tier);
+		return ret;
+	}
+
+	swap_tier_activate(tier);
+
+	return ret;
+}
+
+int swap_tiers_remove(const char *name)
+{
+	int ret = 0;
+	struct swap_tier *tier;
+
+	lockdep_assert_held(&swap_lock);
+	lockdep_assert_held(&swap_tier_lock);
+
+	tier = swap_tier_lookup(name);
+	if (!tier)
+		return -EINVAL;
+
+	/* Removing DEF_SWAP_PRIO merges into the higher tier. */
+	if (!list_is_singular(&swap_tier_active_list)
+		&& tier->prio == DEF_SWAP_PRIO)
+		list_prev_entry(tier, list)->prio = DEF_SWAP_PRIO;
+
+	swap_tier_inactivate(tier);
+
+	return ret;
+}
+
+static struct swap_tier swap_tiers_snap[MAX_SWAPTIER];
+/*
+ * XXX: When multiple operations (adds and removes) are submitted in a
+ * single write, reverting each individually on failure is complex and
+ * error-prone. Instead, snapshot the entire state beforehand and
+ * restore it wholesale if any operation fails.
+ */
+void swap_tiers_snapshot(void)
+{
+	BUILD_BUG_ON(sizeof(swap_tiers_snap) != sizeof(swap_tiers));
+
+	lockdep_assert_held(&swap_lock);
+	lockdep_assert_held(&swap_tier_lock);
+
+	memcpy(swap_tiers_snap, swap_tiers, sizeof(swap_tiers));
+}
+
+void swap_tiers_snapshot_restore(void)
+{
+	struct swap_tier *tier;
+	int idx;
+
+	lockdep_assert_held(&swap_lock);
+	lockdep_assert_held(&swap_tier_lock);
+
+	memcpy(swap_tiers, swap_tiers_snap, sizeof(swap_tiers));
+
+	INIT_LIST_HEAD(&swap_tier_active_list);
+	INIT_LIST_HEAD(&swap_tier_inactive_list);
+
+	for_each_tier(tier, idx) {
+		if (TIER_IS_ACTIVE(tier))
+			swap_tier_activate(tier);
+		else
+			swap_tier_inactivate(tier);
+	}
+}
+
+bool swap_tiers_validate(void)
+{
+	struct swap_tier *tier;
+
+	/*
+	 * Initial setting might not cover DEF_SWAP_PRIO.
+	 * Swap tier must cover the full range (DEF_SWAP_PRIO to SHRT_MAX).
+	 */
+	if (swap_tier_is_active()) {
+		tier = list_last_entry(&swap_tier_active_list,
+			struct swap_tier, list);
+
+		if (tier->prio != DEF_SWAP_PRIO)
+			return false;
+	}
+
+	return true;
+}
diff --git a/mm/swap_tier.h b/mm/swap_tier.h
new file mode 100644
index 000000000000..a1395ec02c24
--- /dev/null
+++ b/mm/swap_tier.h
@@ -0,0 +1,20 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _SWAP_TIER_H
+#define _SWAP_TIER_H
+
+#include <linux/types.h>
+#include <linux/spinlock.h>
+
+extern spinlock_t swap_tier_lock;
+
+/* Initialization and application */
+void swap_tiers_init(void);
+ssize_t swap_tiers_sysfs_show(char *buf);
+
+int swap_tiers_add(const char *name, int prio);
+int swap_tiers_remove(const char *name);
+
+void swap_tiers_snapshot(void);
+void swap_tiers_snapshot_restore(void);
+bool swap_tiers_validate(void);
+#endif /* _SWAP_TIER_H */
diff --git a/mm/swapfile.c b/mm/swapfile.c
index ff315b752afd..03bf2a0a42ac 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -49,6 +49,7 @@
 #include "swap_table.h"
 #include "internal.h"
 #include "swap.h"
+#include "swap_tier.h"
 
 static void swap_range_alloc(struct swap_info_struct *si,
 			     unsigned int nr_entries);
@@ -64,7 +65,8 @@ static void move_cluster(struct swap_info_struct *si,
  *
  * Also protects swap_active_head total_swap_pages, and the SWP_WRITEOK flag.
  */
-static DEFINE_SPINLOCK(swap_lock);
+DEFINE_SPINLOCK(swap_lock);
+
 static unsigned int nr_swapfiles;
 atomic_long_t nr_swap_pages;
 /*
@@ -75,7 +77,6 @@ atomic_long_t nr_swap_pages;
 EXPORT_SYMBOL_GPL(nr_swap_pages);
 /* protected with swap_lock. reading in vm_swap_full() doesn't need lock */
 long total_swap_pages;
-#define DEF_SWAP_PRIO  -1
 unsigned long swapfile_maximum_size;
 #ifdef CONFIG_MIGRATION
 bool swap_migration_ad_supported;
@@ -88,7 +89,7 @@ static const char Bad_offset[] = "Bad swap offset entry ";
  * all active swap_info_structs
  * protected with swap_lock, and ordered by priority.
  */
-static PLIST_HEAD(swap_active_head);
+PLIST_HEAD(swap_active_head);
 
 /*
  * all available (active, not full) swap_info_structs
@@ -3890,6 +3891,7 @@ static int __init swapfile_init(void)
 		swap_migration_ad_supported = true;
 #endif	/* CONFIG_MIGRATION */
 
+	swap_tiers_init();
 	return 0;
 }
 subsys_initcall(swapfile_init);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v5 2/4] mm: swap: associate swap devices with tiers
  2026-03-25 17:54 [PATCH v5 0/4] mm/swap, memcg: Introduce swap tiers for cgroup based swap control Youngjun Park
  2026-03-25 17:54 ` [PATCH v5 1/4] mm: swap: introduce swap tier infrastructure Youngjun Park
@ 2026-03-25 17:54 ` Youngjun Park
  2026-03-27 19:06   ` kernel test robot
  2026-03-25 17:54 ` [PATCH v5 3/4] mm: memcontrol: add interfaces for swap tier selection Youngjun Park
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 13+ messages in thread
From: Youngjun Park @ 2026-03-25 17:54 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Chris Li, Youngjun Park, linux-mm, cgroups, linux-kernel, kasong,
	hannes, mhocko, roman.gushchin, shakeel.butt, muchun.song,
	shikemeng, nphamcs, bhe, baohua, gunho.lee, taejoon.song,
	hyungjun.cho, mkoutny

This patch connects swap devices to the swap tier infrastructure,
ensuring that devices are correctly assigned to tiers based on their
priority.

A `tier_mask` is added to identify the tier membership of swap devices.
Although tier-based allocation logic is not yet implemented, this
mapping is necessary to track which tier a device belongs to. Upon
activation, the device is assigned to a tier by matching its priority
against the configured tier ranges.

The infrastructure allows dynamic modification of tiers, such as
splitting or merging ranges. These operations are permitted provided
that the tier assignment of already configured swap devices remains
unchanged.

This patch also adds the documentation for the swap tier feature,
covering the core concepts, sysfs interface usage, and configuration
details.

Signed-off-by: Youngjun Park <youngjun.park@lge.com>
---
 Documentation/mm/swap-tier.rst | 159 +++++++++++++++++++++++++++++++++
 MAINTAINERS                    |   1 +
 include/linux/swap.h           |   1 +
 mm/swap_state.c                |   2 +-
 mm/swap_tier.c                 | 101 ++++++++++++++++++---
 mm/swap_tier.h                 |  12 ++-
 mm/swapfile.c                  |   2 +
 7 files changed, 264 insertions(+), 14 deletions(-)
 create mode 100644 Documentation/mm/swap-tier.rst

diff --git a/Documentation/mm/swap-tier.rst b/Documentation/mm/swap-tier.rst
new file mode 100644
index 000000000000..7b29b0e4e414
--- /dev/null
+++ b/Documentation/mm/swap-tier.rst
@@ -0,0 +1,159 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+:Author: Chris Li <chrisl@kernel.org> Youngjun Park <youngjun.park@lge.com>
+
+==========
+Swap Tier
+==========
+
+Swap tier is a collection of user-named groups classified by priority ranges.
+It acts as a facilitation layer, allowing users to manage swap devices based
+on their speeds.
+
+Users are encouraged to assign swap device priorities according to device
+speed to fully utilize this feature. While the current implementation is
+integrated with cgroups, the concept is designed to be extensible for other
+subsystems in the future.
+
+Use case
+-------
+
+Users can perform selective swapping by choosing a swap tier assigned according
+to speed within a cgroup.
+
+For more information on cgroup v2, please refer to
+``Documentation/admin-guide/cgroup-v2.rst``.
+
+Priority Range
+--------------
+
+The specified tiers must cover the entire priority range from -1
+(DEF_SWAP_PRIO) to SHRT_MAX.
+
+Consistency
+-----------
+
+Tier consistency is guaranteed with a focus on maximizing flexibility. When a
+swap device is activated within a tier range, the tier covering that device's
+priority is guaranteed not to disappear or change while the device remains
+active. Adding a new tier may split the range of an existing tier, but the
+active device's tier assignment remains unchanged.
+
+However, specifying a tier in a cgroup does not guarantee the tier's existence.
+Consequently, the corresponding tier can disappear at any time.
+
+Configuration Interface
+-----------------------
+
+The swap tiers can be configured via the following interface:
+
+/sys/kernel/mm/swap/tiers
+
+Operations can be performed using the following syntax:
+
+* Add:    ``+"<tiername>":"<start_priority>"``
+* Remove: ``-"<tiername>"``
+
+Tier names must consist of alphanumeric characters and underscores. Multiple
+operations can be provided in a single write, separated by commas (",") or
+whitespace (spaces, tabs, newlines).
+
+When configuring tiers, the specified value represents the **start priority**
+of that tier. The end priority is automatically determined by the start
+priority of the next higher tier. Consequently, adding a tier
+automatically adjusts the ranges of adjacent tiers to ensure continuity.
+
+Examples
+--------
+
+**1. Initialization**
+
+A tier starting at -1 is mandatory to cover the entire priority range up to
+SHRT_MAX. In this example, 'HDD' starts at 50, and 'NET' covers the remaining
+lower range starting from -1.
+
+::
+
+    # echo "+HDD:50, +NET:-1" > /sys/kernel/mm/swap/tiers
+    # cat /sys/kernel/mm/swap/tiers
+    Name             Idx   PrioStart   PrioEnd
+    HDD              0     50          32767
+    NET              1     -1          49
+
+**2. Adding a New Tier (split)**
+
+A new tier 'SSD' is added at priority 100, splitting the existing 'HDD' tier.
+The ranges are automatically recalculated:
+
+* 'SSD' takes the top range (100 to SHRT_MAX).
+* 'HDD' is adjusted to the range between 'NET' and 'SSD' (50 to 99).
+* 'NET' remains unchanged (-1 to 49).
+
+::
+
+    # echo "+SSD:100" > /sys/kernel/mm/swap/tiers
+    # cat /sys/kernel/mm/swap/tiers
+    Name             Idx   PrioStart   PrioEnd
+    SSD              2     100         32767
+    HDD              0     50          99
+    NET              1     -1          49
+
+**3. Removal (merge)**
+
+Tiers can be removed using the '-' prefix.
+::
+
+    # echo "-SSD" > /sys/kernel/mm/swap/tiers
+
+When a tier is removed, its priority range is merged into the adjacent
+tier. The merge direction is always upward (the tier below expands),
+except when the lowest tier is removed — in that case the tier above
+shifts its starting priority down to -1 to maintain full range coverage.
+
+::
+
+    Initial state:
+    Name             Idx   PrioStart   PrioEnd
+    SSD              2     100         32767
+    HDD              1     50          99
+    NET              0     -1          49
+
+    # echo "-SSD" > /sys/kernel/mm/swap/tiers
+
+    Name             Idx   PrioStart   PrioEnd
+    HDD              1     50          32767       <- merged with SSD's range
+    NET              0     -1          49
+
+    # echo "-NET" > /sys/kernel/mm/swap/tiers
+
+    Name             Idx   PrioStart   PrioEnd
+    HDD              1     -1          32767       <- shifted down to -1
+
+**4. Interaction with Active Swap Devices**
+
+If a swap device is active (swapon), the tier covering that device's
+priority cannot be removed. Splitting the active tier's range is only
+allowed above the device's priority.
+
+Assume a swap device is active at priority 60 (inside 'HDD' tier).
+
+::
+
+    # swapon -p 60 /dev/zram0
+
+    Name             Idx   PrioStart   PrioEnd
+    HDD              0     50          32767
+    NET              1     -1          49
+
+    # echo "-HDD" > /sys/kernel/mm/swap/tiers
+    -bash: echo: write error: Device or resource busy
+
+    # echo "+SSD:60" > /sys/kernel/mm/swap/tiers
+    -bash: echo: write error: Device or resource busy
+
+    # echo "+SSD:100" > /sys/kernel/mm/swap/tiers
+
+    Name             Idx   PrioStart   PrioEnd
+    SSD              2     100         32767
+    HDD              0     50          99          <- device (prio 60) stays here
+    NET              1     -1          49
diff --git a/MAINTAINERS b/MAINTAINERS
index f3b07f1fa38a..62a177983799 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -16908,6 +16908,7 @@ R:	Youngjun Park <youngjun.park@lge.com>
 L:	linux-mm@kvack.org
 S:	Maintained
 F:	Documentation/mm/swap-table.rst
+F:	Documentation/mm/swap-tier.rst
 F:	include/linux/swap.h
 F:	include/linux/swapfile.h
 F:	include/linux/swapops.h
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 1930f81e6be4..3bc06a1a4a17 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -250,6 +250,7 @@ struct swap_info_struct {
 	struct percpu_ref users;	/* indicate and keep swap device valid. */
 	unsigned long	flags;		/* SWP_USED etc: see above */
 	signed short	prio;		/* swap priority of this type */
+	int tier_mask;			/* swap tier mask */
 	struct plist_node list;		/* entry in swap_active_head */
 	signed char	type;		/* strange name for an index */
 	unsigned int	max;		/* size of this swap device */
diff --git a/mm/swap_state.c b/mm/swap_state.c
index bfdc0208e081..847096e2f3e5 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -980,7 +980,7 @@ static ssize_t tiers_store(struct kobject *kobj,
 		}
 	}
 
-	if (!swap_tiers_validate()) {
+	if (!swap_tiers_update()) {
 		ret = -EINVAL;
 		goto restore;
 	}
diff --git a/mm/swap_tier.c b/mm/swap_tier.c
index 62b60fa8d3b7..91aac55d3a8b 100644
--- a/mm/swap_tier.c
+++ b/mm/swap_tier.c
@@ -38,6 +38,8 @@ static LIST_HEAD(swap_tier_inactive_list);
 	(!list_is_first(&(tier)->list, &swap_tier_active_list) ? \
 	list_prev_entry((tier), list)->prio - 1 : SHRT_MAX)
 
+#define MASK_TO_TIER(mask) (&swap_tiers[__ffs((mask))])
+
 #define for_each_tier(tier, idx) \
 	for (idx = 0, tier = &swap_tiers[0]; idx < MAX_SWAPTIER; \
 		idx++, tier = &swap_tiers[idx])
@@ -59,6 +61,26 @@ static bool swap_tier_is_active(void)
 	return !list_empty(&swap_tier_active_list) ? true : false;
 }
 
+static bool swap_tier_prio_in_range(struct swap_tier *tier, short prio)
+{
+	if (tier->prio <= prio && TIER_END_PRIO(tier) >= prio)
+		return true;
+
+	return false;
+}
+
+static bool swap_tier_prio_is_used(short prio)
+{
+	struct swap_tier *tier;
+
+	for_each_active_tier(tier) {
+		if (tier->prio == prio)
+			return true;
+	}
+
+	return false;
+}
+
 static struct swap_tier *swap_tier_lookup(const char *name)
 {
 	struct swap_tier *tier;
@@ -96,6 +118,7 @@ void swap_tiers_init(void)
 	int idx;
 
 	BUILD_BUG_ON(BITS_PER_TYPE(int) < MAX_SWAPTIER);
+	BUILD_BUG_ON(MAX_SWAPTIER > TIER_DEFAULT_IDX);
 
 	for_each_tier(tier, idx) {
 		INIT_LIST_HEAD(&tier->list);
@@ -146,17 +169,29 @@ static struct swap_tier *swap_tier_prepare(const char *name, short prio)
 	return tier;
 }
 
-static int swap_tier_check_range(short prio)
+static int swap_tier_can_split_range(short new_prio)
 {
+	struct swap_info_struct *p;
 	struct swap_tier *tier;
 
 	lockdep_assert_held(&swap_lock);
 	lockdep_assert_held(&swap_tier_lock);
 
-	for_each_active_tier(tier) {
-		/* No overwrite */
-		if (tier->prio == prio)
-			return -EINVAL;
+	plist_for_each_entry(p, &swap_active_head, list) {
+		if (p->tier_mask == TIER_DEFAULT_MASK)
+			continue;
+
+		tier = MASK_TO_TIER(p->tier_mask);
+		if (!swap_tier_prio_in_range(tier, new_prio))
+			continue;
+
+		/*
+		 * Device sits in a tier that spans new_prio;
+		 * splitting here would reassign it to a
+		 * different tier.
+		 */
+		if (p->prio >= new_prio)
+			return -EBUSY;
 	}
 
 	return 0;
@@ -190,7 +225,11 @@ int swap_tiers_add(const char *name, int prio)
 	if (!swap_tier_validate_name(name))
 		return -EINVAL;
 
-	ret = swap_tier_check_range(prio);
+	/* No overwrite */
+	if (swap_tier_prio_is_used(prio))
+		return -EBUSY;
+
+	ret = swap_tier_can_split_range(prio);
 	if (ret)
 		return ret;
 
@@ -217,6 +256,11 @@ int swap_tiers_remove(const char *name)
 	if (!tier)
 		return -EINVAL;
 
+	/* Simulate adding a tier to check for conflicts */
+	ret = swap_tier_can_split_range(tier->prio);
+	if (ret)
+		return ret;
+
 	/* Removing DEF_SWAP_PRIO merges into the higher tier. */
 	if (!list_is_singular(&swap_tier_active_list)
 		&& tier->prio == DEF_SWAP_PRIO)
@@ -227,13 +271,15 @@ int swap_tiers_remove(const char *name)
 	return ret;
 }
 
-static struct swap_tier swap_tiers_snap[MAX_SWAPTIER];
 /*
- * XXX: When multiple operations (adds and removes) are submitted in a
- * single write, reverting each individually on failure is complex and
- * error-prone. Instead, snapshot the entire state beforehand and
- * restore it wholesale if any operation fails.
+ * XXX: Static global snapshot buffer for batch operations. Small
+ * and used once per write, so a static global is not bad.
+ * When multiple adds/removes are submitted in a single write,
+ * reverting each individually on failure is error-prone. Instead,
+ * snapshot beforehand and restore wholesale if any operation fails.
  */
+static struct swap_tier swap_tiers_snap[MAX_SWAPTIER];
+
 void swap_tiers_snapshot(void)
 {
 	BUILD_BUG_ON(sizeof(swap_tiers_snap) != sizeof(swap_tiers));
@@ -265,9 +311,29 @@ void swap_tiers_snapshot_restore(void)
 	}
 }
 
-bool swap_tiers_validate(void)
+void swap_tiers_assign_dev(struct swap_info_struct *swp)
+{
+	struct swap_tier *tier;
+
+	lockdep_assert_held(&swap_lock);
+
+	for_each_active_tier(tier) {
+		if (swap_tier_prio_in_range(tier, swp->prio)) {
+			swp->tier_mask = TIER_MASK(tier);
+			return;
+		}
+	}
+
+	swp->tier_mask = TIER_DEFAULT_MASK;
+}
+
+bool swap_tiers_update(void)
 {
 	struct swap_tier *tier;
+	struct swap_info_struct *swp;
+
+	lockdep_assert_held(&swap_lock);
+	lockdep_assert_held(&swap_tier_lock);
 
 	/*
 	 * Initial setting might not cover DEF_SWAP_PRIO.
@@ -281,5 +347,16 @@ bool swap_tiers_validate(void)
 			return false;
 	}
 
+	/*
+	 * If applied initially, the swap tier_mask may change
+	 * from the default value.
+	 */
+	plist_for_each_entry(swp, &swap_active_head, list) {
+		/* Tier is already configured */
+		if (swp->tier_mask != TIER_DEFAULT_MASK)
+			break;
+		swap_tiers_assign_dev(swp);
+	}
+
 	return true;
 }
diff --git a/mm/swap_tier.h b/mm/swap_tier.h
index a1395ec02c24..6f281e95ed81 100644
--- a/mm/swap_tier.h
+++ b/mm/swap_tier.h
@@ -5,8 +5,15 @@
 #include <linux/types.h>
 #include <linux/spinlock.h>
 
+/* Forward declarations */
+struct swap_info_struct;
+
 extern spinlock_t swap_tier_lock;
 
+#define TIER_ALL_MASK		(~0)
+#define TIER_DEFAULT_IDX	(31)
+#define TIER_DEFAULT_MASK	(1 << TIER_DEFAULT_IDX)
+
 /* Initialization and application */
 void swap_tiers_init(void);
 ssize_t swap_tiers_sysfs_show(char *buf);
@@ -16,5 +23,8 @@ int swap_tiers_remove(const char *name);
 
 void swap_tiers_snapshot(void);
 void swap_tiers_snapshot_restore(void);
-bool swap_tiers_validate(void);
+bool swap_tiers_update(void);
+
+/* Tier assignment */
+void swap_tiers_assign_dev(struct swap_info_struct *swp);
 #endif /* _SWAP_TIER_H */
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 03bf2a0a42ac..645e10c3af28 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -2914,6 +2914,8 @@ static void _enable_swap_info(struct swap_info_struct *si)
 
 	/* Add back to available list */
 	add_to_avail_list(si, true);
+
+	swap_tiers_assign_dev(si);
 }
 
 /*
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v5 3/4] mm: memcontrol: add interfaces for swap tier selection
  2026-03-25 17:54 [PATCH v5 0/4] mm/swap, memcg: Introduce swap tiers for cgroup based swap control Youngjun Park
  2026-03-25 17:54 ` [PATCH v5 1/4] mm: swap: introduce swap tier infrastructure Youngjun Park
  2026-03-25 17:54 ` [PATCH v5 2/4] mm: swap: associate swap devices with tiers Youngjun Park
@ 2026-03-25 17:54 ` Youngjun Park
  2026-03-27 23:50   ` kernel test robot
  2026-03-29 11:10   ` kernel test robot
  2026-03-25 17:54 ` [PATCH v5 4/4] mm: swap: filter swap allocation by memcg tier mask Youngjun Park
                   ` (2 subsequent siblings)
  5 siblings, 2 replies; 13+ messages in thread
From: Youngjun Park @ 2026-03-25 17:54 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Chris Li, Youngjun Park, linux-mm, cgroups, linux-kernel, kasong,
	hannes, mhocko, roman.gushchin, shakeel.butt, muchun.song,
	shikemeng, nphamcs, bhe, baohua, gunho.lee, taejoon.song,
	hyungjun.cho, mkoutny

Integrate swap tier infrastructure with cgroup to allow selecting specific
swap devices per cgroup.

Introduce `memory.swap.tiers` for configuring allowed tiers, and
`memory.swap.tiers.effective` for exposing the effective tiers.
The effective tiers are the intersection of the configured tiers and
the parent's effective tiers.

Note that cgroups do not pin swap tiers, similar to `cpuset` and CPU
hotplug, allowing configuration changes regardless of usage.

Signed-off-by: Youngjun Park <youngjun.park@lge.com>
---
 Documentation/admin-guide/cgroup-v2.rst | 27 +++++++
 include/linux/memcontrol.h              |  3 +-
 mm/memcontrol.c                         | 95 +++++++++++++++++++++++++
 mm/swap_state.c                         |  5 +-
 mm/swap_tier.c                          | 93 +++++++++++++++++++++++-
 mm/swap_tier.h                          | 56 +++++++++++++--
 6 files changed, 268 insertions(+), 11 deletions(-)

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 8ad0b2781317..6effe1bfe74d 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -1850,6 +1850,33 @@ The following nested keys are defined.
 	Swap usage hard limit.  If a cgroup's swap usage reaches this
 	limit, anonymous memory of the cgroup will not be swapped out.
 
+  memory.swap.tiers
+        A read-write file which exists on non-root cgroups.
+        Format is similar to cgroup.subtree_control.
+
+        Controls which swap tiers this cgroup is allowed to swap
+        out to. All tiers are enabled by default.
+
+          (-|+)TIER [(-|+)TIER ...]
+
+        "-" disables a tier, "+" re-enables it.
+        Entries are whitespace-delimited.
+
+        Changes here are combined with parent restrictions to
+        compute memory.swap.tiers.effective.
+
+        If a tier is removed from /sys/kernel/mm/swap/tiers,
+        any prior disable for that tier is invalidated.
+
+  memory.swap.tiers.effective
+        A read-only file which exists on non-root cgroups.
+
+        Shows the tiers this cgroup can actually swap out to.
+        This is the intersection of the parent's effective tiers
+        and this cgroup's own memory.swap.tiers configuration.
+        A child cannot enable a tier that is disabled in its
+        parent.
+
   memory.swap.events
 	A read-only flat-keyed file which exists on non-root cgroups.
 	The following entries are defined.  Unless specified
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 0782c72a1997..5603d6ce905f 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -281,7 +281,8 @@ struct mem_cgroup {
 	/* per-memcg mm_struct list */
 	struct lru_gen_mm_list mm_list;
 #endif
-
+	int tier_mask;
+	int tier_effective_mask;
 #ifdef CONFIG_MEMCG_V1
 	/* Legacy consumer-oriented counters */
 	struct page_counter kmem;		/* v1 only */
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index ac7b46c4d67e..5d7036b3926f 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -68,6 +68,7 @@
 #include <net/ip.h>
 #include "slab.h"
 #include "memcontrol-v1.h"
+#include "swap_tier.h"
 
 #include <linux/uaccess.h>
 
@@ -4086,6 +4087,9 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
 	WRITE_ONCE(memcg->zswap_writeback, true);
 #endif
 	page_counter_set_high(&memcg->swap, PAGE_COUNTER_MAX);
+	memcg->tier_mask = TIER_ALL_MASK;
+	swap_tiers_memcg_inherit_mask(memcg, parent);
+
 	if (parent) {
 		WRITE_ONCE(memcg->swappiness, mem_cgroup_swappiness(parent));
 
@@ -5694,6 +5698,86 @@ static int swap_events_show(struct seq_file *m, void *v)
 	return 0;
 }
 
+static int swap_tier_show(struct seq_file *m, void *v)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_seq(m);
+
+	swap_tiers_mask_show(m, memcg->tier_mask);
+	return 0;
+}
+
+static ssize_t swap_tier_write(struct kernfs_open_file *of,
+				char *buf, size_t nbytes, loff_t off)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
+	char *pos, *token;
+	int ret = 0;
+	int original_mask;
+
+	pos = strstrip(buf);
+
+	spin_lock(&swap_tier_lock);
+	if (!*pos) {
+		memcg->tier_mask = TIER_ALL_MASK;
+		goto sync;
+	}
+
+	original_mask = memcg->tier_mask;
+
+	while ((token = strsep(&pos, " \t\n")) != NULL) {
+		int mask;
+
+		if (!*token)
+			continue;
+
+		if (token[0] != '-' && token[0] != '+') {
+			ret = -EINVAL;
+			goto err;
+		}
+
+		mask = swap_tiers_mask_lookup(token+1);
+		if (!mask) {
+			ret = -EINVAL;
+			goto err;
+		}
+
+		/*
+		 * if child already set, cannot add that tiers for hierarch mismatching.
+		 * parent compatible, child must respect parent selected swap device.
+		 */
+		switch (token[0]) {
+		case '-':
+			memcg->tier_mask &= ~mask;
+			break;
+		case '+':
+			memcg->tier_mask |= mask;
+			break;
+		default:
+			ret = -EINVAL;
+			break;
+		}
+
+		if (ret)
+			goto err;
+	}
+
+sync:
+	swap_tiers_memcg_sync_mask(memcg);
+err:
+	if (ret)
+		memcg->tier_mask = original_mask;
+	spin_unlock(&swap_tier_lock);
+	return ret ? ret : nbytes;
+}
+
+static int swap_tier_effective_show(struct seq_file *m, void *v)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_seq(m);
+
+	swap_tiers_mask_show(m, memcg->tier_effective_mask);
+	return 0;
+}
+
 static struct cftype swap_files[] = {
 	{
 		.name = "swap.current",
@@ -5726,6 +5810,17 @@ static struct cftype swap_files[] = {
 		.file_offset = offsetof(struct mem_cgroup, swap_events_file),
 		.seq_show = swap_events_show,
 	},
+	{
+		.name = "swap.tiers",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.seq_show = swap_tier_show,
+		.write = swap_tier_write,
+	},
+	{
+		.name = "swap.tiers.effective",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.seq_show = swap_tier_effective_show,
+	},
 	{ }	/* terminate */
 };
 
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 847096e2f3e5..2d1bc6bc09d3 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -938,6 +938,7 @@ static ssize_t tiers_store(struct kobject *kobj,
 	char *p, *token, *name, *tmp;
 	int ret = 0;
 	short prio;
+	int mask = 0;
 
 	tmp = kstrdup(buf, GFP_KERNEL);
 	if (!tmp)
@@ -970,7 +971,7 @@ static ssize_t tiers_store(struct kobject *kobj,
 				goto restore;
 			break;
 		case '-':
-			ret = swap_tiers_remove(token + 1);
+			ret = swap_tiers_remove(token + 1, &mask);
 			if (ret)
 				goto restore;
 			break;
@@ -980,7 +981,7 @@ static ssize_t tiers_store(struct kobject *kobj,
 		}
 	}
 
-	if (!swap_tiers_update()) {
+	if (!swap_tiers_update(mask)) {
 		ret = -EINVAL;
 		goto restore;
 	}
diff --git a/mm/swap_tier.c b/mm/swap_tier.c
index 91aac55d3a8b..64365569b970 100644
--- a/mm/swap_tier.c
+++ b/mm/swap_tier.c
@@ -244,7 +244,7 @@ int swap_tiers_add(const char *name, int prio)
 	return ret;
 }
 
-int swap_tiers_remove(const char *name)
+int swap_tiers_remove(const char *name, int *mask)
 {
 	int ret = 0;
 	struct swap_tier *tier;
@@ -267,6 +267,7 @@ int swap_tiers_remove(const char *name)
 		list_prev_entry(tier, list)->prio = DEF_SWAP_PRIO;
 
 	swap_tier_inactivate(tier);
+	*mask |= TIER_MASK(tier);
 
 	return ret;
 }
@@ -327,7 +328,24 @@ void swap_tiers_assign_dev(struct swap_info_struct *swp)
 	swp->tier_mask = TIER_DEFAULT_MASK;
 }
 
-bool swap_tiers_update(void)
+/*
+ * When a tier is removed, set its bit in every memcg's tier_mask and
+ * tier_effective_mask. This prevents stale tier indices from being
+ * silently filtered out if the same index is reused later.
+ */
+static void swap_tier_memcg_propagate(int mask)
+{
+	struct mem_cgroup *child;
+
+	rcu_read_lock();
+	for_each_mem_cgroup_tree(child, root_mem_cgroup) {
+		child->tier_mask |= mask;
+		child->tier_effective_mask |= mask;
+	}
+	rcu_read_unlock();
+}
+
+bool swap_tiers_update(int mask)
 {
 	struct swap_tier *tier;
 	struct swap_info_struct *swp;
@@ -357,6 +375,77 @@ bool swap_tiers_update(void)
 			break;
 		swap_tiers_assign_dev(swp);
 	}
+	/*
+	 * XXX: Unused tiers default to ON, disabled after next tier added.
+	 * Use removed tier mask to clear settings for removed/re-added tiers.
+	 * (Could hold tier refs, but better to keep cgroup config independent)
+	 */
+	if (mask)
+		swap_tier_memcg_propagate(mask);
 
 	return true;
 }
+
+void swap_tiers_mask_show(struct seq_file *m, int mask)
+{
+	struct swap_tier *tier;
+
+	spin_lock(&swap_tier_lock);
+	for_each_active_tier(tier) {
+		if (mask & TIER_MASK(tier))
+			seq_printf(m, "%s ", tier->name);
+	}
+	spin_unlock(&swap_tier_lock);
+	seq_puts(m, "\n");
+}
+
+int swap_tiers_mask_lookup(const char *name)
+{
+	struct swap_tier *tier;
+
+	lockdep_assert_held(&swap_tier_lock);
+
+	for_each_active_tier(tier) {
+		if (!strcmp(name, tier->name))
+			return TIER_MASK(tier);
+	}
+
+	return 0;
+}
+
+static void __swap_tier_memcg_inherit_mask(struct mem_cgroup *memcg,
+	struct mem_cgroup *parent)
+{
+	int effective_mask
+		= parent ? parent->tier_effective_mask : TIER_ALL_MASK;
+
+	memcg->tier_effective_mask
+		= effective_mask & memcg->tier_mask;
+}
+
+/* Computes the initial effective mask from the parent's effective mask. */
+void swap_tiers_memcg_inherit_mask(struct mem_cgroup *memcg,
+	struct mem_cgroup *parent)
+{
+	spin_lock(&swap_tier_lock);
+	rcu_read_lock();
+	__swap_tier_memcg_inherit_mask(memcg, parent);
+	rcu_read_unlock();
+	spin_unlock(&swap_tier_lock);
+}
+
+/*
+ * Called when a memcg's tier_mask is modified. Walks the subtree
+ * and recomputes each descendant's effective mask against its parent.
+ */
+void swap_tiers_memcg_sync_mask(struct mem_cgroup *memcg)
+{
+	struct mem_cgroup *child;
+
+	lockdep_assert_held(&swap_tier_lock);
+
+	rcu_read_lock();
+	for_each_mem_cgroup_tree(child, memcg)
+		__swap_tier_memcg_inherit_mask(child, parent_mem_cgroup(child));
+	rcu_read_unlock();
+}
diff --git a/mm/swap_tier.h b/mm/swap_tier.h
index 6f281e95ed81..329c6a4f375f 100644
--- a/mm/swap_tier.h
+++ b/mm/swap_tier.h
@@ -10,21 +10,65 @@ struct swap_info_struct;
 
 extern spinlock_t swap_tier_lock;
 
-#define TIER_ALL_MASK		(~0)
-#define TIER_DEFAULT_IDX	(31)
-#define TIER_DEFAULT_MASK	(1 << TIER_DEFAULT_IDX)
-
 /* Initialization and application */
 void swap_tiers_init(void);
 ssize_t swap_tiers_sysfs_show(char *buf);
 
 int swap_tiers_add(const char *name, int prio);
-int swap_tiers_remove(const char *name);
+int swap_tiers_remove(const char *name, int *mask);
 
 void swap_tiers_snapshot(void);
 void swap_tiers_snapshot_restore(void);
-bool swap_tiers_update(void);
+bool swap_tiers_update(int mask);
 
 /* Tier assignment */
 void swap_tiers_assign_dev(struct swap_info_struct *swp);
+
+#ifdef CONFIG_SWAP
+/* Memcg related functions */
+void swap_tiers_mask_show(struct seq_file *m, int mask);
+void swap_tiers_memcg_inherit_mask(struct mem_cgroup *memcg,
+	struct mem_cgroup *parent);
+void swap_tiers_memcg_sync_mask(struct mem_cgroup *memcg);
+#else
+static inline void swap_tiers_mask_show(struct seq_file *m, int mask) {}
+static inline void swap_tiers_memcg_inherit_mask(struct mem_cgroup *memcg,
+	struct mem_cgroup *parent) {}
+static inline void swap_tiers_memcg_sync_mask(struct mem_cgroup *memcg) {}
+static inline void __swap_tiers_memcg_sync_mask(struct mem_cgroup *memcg) {}
+#endif
+
+/* Mask and tier lookup */
+int swap_tiers_mask_lookup(const char *name);
+
+/**
+ * swap_tiers_mask_test - Check if the tier mask is valid
+ * @tier_mask: The tier mask to check
+ * @mask: The mask to compare against
+ *
+ * Return: true if condition matches, false otherwise
+ */
+static inline bool swap_tiers_mask_test(int tier_mask, int mask)
+{
+	return tier_mask & mask;
+}
+
+#define TIER_ALL_MASK		(~0)
+#define TIER_DEFAULT_IDX	(31)
+#define TIER_DEFAULT_MASK	(1 << TIER_DEFAULT_IDX)
+
+#ifdef CONFIG_MEMCG
+static inline int folio_tier_effective_mask(struct folio *folio)
+{
+	struct mem_cgroup *memcg = folio_memcg(folio);
+
+	return memcg ? memcg->tier_effective_mask : TIER_ALL_MASK;
+}
+#else
+static inline int folio_tier_effective_mask(struct folio *folio)
+{
+	return TIER_ALL_MASK;
+}
+#endif
+
 #endif /* _SWAP_TIER_H */
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v5 4/4] mm: swap: filter swap allocation by memcg tier mask
  2026-03-25 17:54 [PATCH v5 0/4] mm/swap, memcg: Introduce swap tiers for cgroup based swap control Youngjun Park
                   ` (2 preceding siblings ...)
  2026-03-25 17:54 ` [PATCH v5 3/4] mm: memcontrol: add interfaces for swap tier selection Youngjun Park
@ 2026-03-25 17:54 ` Youngjun Park
  2026-03-25 23:20 ` [PATCH v5 0/4] mm/swap, memcg: Introduce swap tiers for cgroup based swap control Andrew Morton
  2026-03-26  7:41 ` [syzbot ci] " syzbot ci
  5 siblings, 0 replies; 13+ messages in thread
From: Youngjun Park @ 2026-03-25 17:54 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Chris Li, Youngjun Park, linux-mm, cgroups, linux-kernel, kasong,
	hannes, mhocko, roman.gushchin, shakeel.butt, muchun.song,
	shikemeng, nphamcs, bhe, baohua, gunho.lee, taejoon.song,
	hyungjun.cho, mkoutny

Apply memcg tier effective mask during swap slot allocation to
enforce per-cgroup swap tier restrictions.

In the fast path, check the percpu cached swap_info's tier_mask
against the folio's effective mask. If it does not match, fall
through to the slow path. In the slow path, skip swap devices
whose tier_mask is not covered by the folio's effective mask.

This works correctly when there is only one non-rotational
device in the system and no devices share the same priority.
However, there are known limitations:

 - When multiple non-rotational devices exist, percpu swap
   caches from different memcg contexts may reference
   mismatched tiers, causing unnecessary fast path misses.

 - When multiple non-rotational devices are assigned to
   different tiers and same-priority devices exist among
   them, cluster-based rotation may not work correctly.

These edge cases do not affect the primary use case of
directing swap traffic per cgroup. Further optimization is
planned for future work.

Signed-off-by: Youngjun Park <youngjun.park@lge.com>
---
 mm/swapfile.c | 13 ++++++++++++-
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/mm/swapfile.c b/mm/swapfile.c
index 645e10c3af28..627b09e57c1d 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1352,15 +1352,22 @@ static bool swap_alloc_fast(struct folio *folio)
 	struct swap_cluster_info *ci;
 	struct swap_info_struct *si;
 	unsigned int offset;
+	int mask = folio_tier_effective_mask(folio);
 
 	/*
 	 * Once allocated, swap_info_struct will never be completely freed,
 	 * so checking it's liveness by get_swap_device_info is enough.
 	 */
 	si = this_cpu_read(percpu_swap_cluster.si[order]);
+	if (!si || !swap_tiers_mask_test(si->tier_mask, mask) ||
+		!get_swap_device_info(si))
+		return false;
+
 	offset = this_cpu_read(percpu_swap_cluster.offset[order]);
-	if (!si || !offset || !get_swap_device_info(si))
+	if (!offset) {
+		put_swap_device(si);
 		return false;
+	}
 
 	ci = swap_cluster_lock(si, offset);
 	if (cluster_is_usable(ci, order)) {
@@ -1379,10 +1386,14 @@ static bool swap_alloc_fast(struct folio *folio)
 static void swap_alloc_slow(struct folio *folio)
 {
 	struct swap_info_struct *si, *next;
+	int mask = folio_tier_effective_mask(folio);
 
 	spin_lock(&swap_avail_lock);
 start_over:
 	plist_for_each_entry_safe(si, next, &swap_avail_head, avail_list) {
+		if (!swap_tiers_mask_test(si->tier_mask, mask))
+			continue;
+
 		/* Rotate the device and switch to a new cluster */
 		plist_requeue(&si->avail_list, &swap_avail_head);
 		spin_unlock(&swap_avail_lock);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH v5 0/4] mm/swap, memcg: Introduce swap tiers for cgroup based swap control
  2026-03-25 17:54 [PATCH v5 0/4] mm/swap, memcg: Introduce swap tiers for cgroup based swap control Youngjun Park
                   ` (3 preceding siblings ...)
  2026-03-25 17:54 ` [PATCH v5 4/4] mm: swap: filter swap allocation by memcg tier mask Youngjun Park
@ 2026-03-25 23:20 ` Andrew Morton
  2026-03-26 14:04   ` YoungJun Park
  2026-03-26  7:41 ` [syzbot ci] " syzbot ci
  5 siblings, 1 reply; 13+ messages in thread
From: Andrew Morton @ 2026-03-25 23:20 UTC (permalink / raw)
  To: Youngjun Park
  Cc: Chris Li, linux-mm, cgroups, linux-kernel, kasong, hannes, mhocko,
	roman.gushchin, shakeel.butt, muchun.song, shikemeng, nphamcs,
	bhe, baohua, gunho.lee, taejoon.song, hyungjun.cho, mkoutny

On Thu, 26 Mar 2026 02:54:49 +0900 Youngjun Park <youngjun.park@lge.com> wrote:

> This is v5 of the "Swap Tiers" series.

Thanks.  I'd prefer to hold off until the next cycle, please.  As I
mentioned in 

https://lkml.kernel.org/r/20260323202941.08ddf2b0411501cae801ab4c@linux-foundation.org

Also, AI review had a lot to say, Please take a look.  Should you do
so, I'm interested in learning how much of that material was useful. 
Thanks.

https://sashiko.dev/#/patchset/20260325175453.2523280-1-youngjun.park%40lge.com

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [syzbot ci] Re: mm/swap, memcg: Introduce swap tiers for cgroup based swap control
  2026-03-25 17:54 [PATCH v5 0/4] mm/swap, memcg: Introduce swap tiers for cgroup based swap control Youngjun Park
                   ` (4 preceding siblings ...)
  2026-03-25 23:20 ` [PATCH v5 0/4] mm/swap, memcg: Introduce swap tiers for cgroup based swap control Andrew Morton
@ 2026-03-26  7:41 ` syzbot ci
  5 siblings, 0 replies; 13+ messages in thread
From: syzbot ci @ 2026-03-26  7:41 UTC (permalink / raw)
  To: akpm, baohua, bhe, cgroups, chrisl, gunho.lee, hannes,
	hyungjun.cho, kasong, linux-kernel, linux-mm, mhocko, mkoutny,
	muchun.song, nphamcs, roman.gushchin, shakeel.butt, shikemeng,
	taejoon.song, youngjun.park
  Cc: syzbot, syzkaller-bugs

syzbot ci has tested the following series

[v5] mm/swap, memcg: Introduce swap tiers for cgroup based swap control
https://lore.kernel.org/all/20260325175453.2523280-1-youngjun.park@lge.com
* [PATCH v5 1/4] mm: swap: introduce swap tier infrastructure
* [PATCH v5 2/4] mm: swap: associate swap devices with tiers
* [PATCH v5 3/4] mm: memcontrol: add interfaces for swap tier selection
* [PATCH v5 4/4] mm: swap: filter swap allocation by memcg tier mask

and found the following issue:
WARNING in folio_tier_effective_mask

Full report is available here:
https://ci.syzbot.org/series/6ed50ca2-a106-41e9-aa4d-7c46869e0011

***

WARNING in folio_tier_effective_mask

tree:      mm-new
URL:       https://kernel.googlesource.com/pub/scm/linux/kernel/git/akpm/mm.git
base:      6381a729fa7dda43574d93ab9c61cec516dd885b
arch:      amd64
compiler:  Debian clang version 21.1.8 (++20251221033036+2078da43e25a-1~exp1~20251221153213.50), Debian LLD 21.1.8
config:    https://ci.syzbot.org/builds/e5c66fa8-a7fd-4809-9564-448847b5f230/config
C repro:   https://ci.syzbot.org/findings/d64cc6fa-636a-40a0-b131-d02ce1129494/c_repro
syz repro: https://ci.syzbot.org/findings/d64cc6fa-636a-40a0-b131-d02ce1129494/syz_repro

------------[ cut here ]------------
debug_locks && !(rcu_read_lock_held() || lock_is_held(&(&cgroup_mutex)->dep_map))
WARNING: ./include/linux/memcontrol.h:377 at obj_cgroup_memcg include/linux/memcontrol.h:377 [inline], CPU#1: syz.0.17/5955
WARNING: ./include/linux/memcontrol.h:377 at folio_memcg include/linux/memcontrol.h:431 [inline], CPU#1: syz.0.17/5955
WARNING: ./include/linux/memcontrol.h:377 at folio_tier_effective_mask+0x175/0x210 mm/swap_tier.h:63, CPU#1: syz.0.17/5955
Modules linked in:
CPU: 1 UID: 0 PID: 5955 Comm: syz.0.17 Not tainted syzkaller #0 PREEMPT(full) 
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
RIP: 0010:obj_cgroup_memcg include/linux/memcontrol.h:377 [inline]
RIP: 0010:folio_memcg include/linux/memcontrol.h:431 [inline]
RIP: 0010:folio_tier_effective_mask+0x175/0x210 mm/swap_tier.h:63
Code: 0f b6 04 20 84 c0 75 6b 8b 03 eb 0a e8 04 b8 9e ff b8 ff ff ff ff 5b 41 5c 41 5e 41 5f 5d c3 cc cc cc cc cc e8 ec b7 9e ff 90 <0f> 0b 90 eb 9b 44 89 f9 80 e1 07 80 c1 03 38 c1 0f 8c c2 fe ff ff
RSP: 0018:ffffc90004bee6d0 EFLAGS: 00010293
RAX: ffffffff8226dd04 RBX: ffff888113589280 RCX: ffff8881727b8000
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
RBP: 0000000000000000 R08: ffffea0006c62207 R09: 1ffffd4000d8c440
R10: dffffc0000000000 R11: fffff94000d8c441 R12: dffffc0000000000
R13: ffffea0006c62208 R14: ffffea0006c62200 R15: ffffea0006c62230
FS:  00005555771cb500(0000) GS:ffff8882a9462000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000001b2ed63fff CR3: 0000000112d86000 CR4: 00000000000006f0
Call Trace:
 <TASK>
 swap_alloc_fast mm/swapfile.c:1355 [inline]
 folio_alloc_swap+0x392/0x13a0 mm/swapfile.c:1735
 shrink_folio_list+0x26a7/0x5250 mm/vmscan.c:1281
 reclaim_folio_list+0x100/0x460 mm/vmscan.c:2171
 reclaim_pages+0x45b/0x530 mm/vmscan.c:2208
 madvise_cold_or_pageout_pte_range+0x1ef5/0x2220 mm/madvise.c:563
 walk_pmd_range mm/pagewalk.c:142 [inline]
 walk_pud_range mm/pagewalk.c:233 [inline]
 walk_p4d_range mm/pagewalk.c:275 [inline]
 walk_pgd_range+0xfdc/0x1d90 mm/pagewalk.c:316
 __walk_page_range+0x14c/0x710 mm/pagewalk.c:424
 walk_page_range_vma_unsafe+0x309/0x410 mm/pagewalk.c:728
 madvise_pageout_page_range mm/madvise.c:622 [inline]
 madvise_pageout mm/madvise.c:647 [inline]
 madvise_vma_behavior+0x28b9/0x42c0 mm/madvise.c:1358
 madvise_walk_vmas+0x573/0xae0 mm/madvise.c:1713
 madvise_do_behavior+0x386/0x540 mm/madvise.c:1929
 do_madvise+0x1fa/0x2e0 mm/madvise.c:2022
 __do_sys_madvise mm/madvise.c:2031 [inline]
 __se_sys_madvise mm/madvise.c:2029 [inline]
 __x64_sys_madvise+0xa6/0xc0 mm/madvise.c:2029
 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
 do_syscall_64+0x14d/0xf80 arch/x86/entry/syscall_64.c:94
 entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7f5e5af9c799
Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 e8 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007fff0c8e2708 EFLAGS: 00000246 ORIG_RAX: 000000000000001c
RAX: ffffffffffffffda RBX: 00007f5e5b215fa0 RCX: 00007f5e5af9c799
RDX: 0000000000000015 RSI: 0000000000600000 RDI: 0000200000000000
RBP: 00007f5e5b032c99 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 00007f5e5b215fac R14: 00007f5e5b215fa0 R15: 00007f5e5b215fa0
 </TASK>


***

If these findings have caused you to resend the series or submit a
separate fix, please add the following tag to your commit message:
  Tested-by: syzbot@syzkaller.appspotmail.com

---
This report is generated by a bot. It may contain errors.
syzbot ci engineers can be reached at syzkaller@googlegroups.com.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v5 0/4] mm/swap, memcg: Introduce swap tiers for cgroup based swap control
  2026-03-25 23:20 ` [PATCH v5 0/4] mm/swap, memcg: Introduce swap tiers for cgroup based swap control Andrew Morton
@ 2026-03-26 14:04   ` YoungJun Park
  0 siblings, 0 replies; 13+ messages in thread
From: YoungJun Park @ 2026-03-26 14:04 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Chris Li, linux-mm, cgroups, linux-kernel, kasong, hannes, mhocko,
	roman.gushchin, shakeel.butt, muchun.song, shikemeng, nphamcs,
	bhe, baohua, gunho.lee, taejoon.song, hyungjun.cho, mkoutny

On Wed, Mar 25, 2026 at 04:20:03PM -0700, Andrew Morton wrote:
> On Thu, 26 Mar 2026 02:54:49 +0900 Youngjun Park <youngjun.park@lge.com> wrote:
> 
> > This is v5 of the "Swap Tiers" series.
> 
> Thanks.  I'd prefer to hold off until the next cycle, please.  As I
> mentioned in 
> 
> https://lkml.kernel.org/r/20260323202941.08ddf2b0411501cae801ab4c@linux-foundation.org
> 
> Also, AI review had a lot to say, Please take a look.  Should you do
> so, I'm interested in learning how much of that material was useful. 
> Thanks.
> 
> https://sashiko.dev/#/patchset/20260325175453.2523280-1-youngjun.park%40lge.com

Hi Andrew, Understood. 
I'll address the AI review comments and run syzbot CI, 
then resubmit for the next cycle.

Thanks,
Youngjun Park

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v5 2/4] mm: swap: associate swap devices with tiers
  2026-03-25 17:54 ` [PATCH v5 2/4] mm: swap: associate swap devices with tiers Youngjun Park
@ 2026-03-27 19:06   ` kernel test robot
  0 siblings, 0 replies; 13+ messages in thread
From: kernel test robot @ 2026-03-27 19:06 UTC (permalink / raw)
  To: Youngjun Park, Andrew Morton
  Cc: oe-kbuild-all, Linux Memory Management List, Chris Li,
	Youngjun Park, cgroups, linux-kernel, kasong, hannes, mhocko,
	roman.gushchin, shakeel.butt, muchun.song, shikemeng, nphamcs,
	bhe, baohua, gunho.lee, taejoon.song, hyungjun.cho, mkoutny

Hi Youngjun,

kernel test robot noticed the following build warnings:

[auto build test WARNING on 6381a729fa7dda43574d93ab9c61cec516dd885b]

url:    https://github.com/intel-lab-lkp/linux/commits/Youngjun-Park/mm-swap-introduce-swap-tier-infrastructure/20260327-203639
base:   6381a729fa7dda43574d93ab9c61cec516dd885b
patch link:    https://lore.kernel.org/r/20260325175453.2523280-3-youngjun.park%40lge.com
patch subject: [PATCH v5 2/4] mm: swap: associate swap devices with tiers
compiler: clang version 20.1.8 (https://github.com/llvm/llvm-project 87f0227cb60147a26a1eeb4fb06e3b505e9c7261)
docutils: docutils (Docutils 0.21.2, Python 3.13.5, on linux)
reproduce: (https://download.01.org/0day-ci/archive/20260327/202603271922.UNxxB12b-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202603271922.UNxxB12b-lkp@intel.com/

All warnings (new ones prefixed by >>):

   Non-Preserved Properties
   ======================== [docutils]
>> Documentation/mm/swap-tier.rst:19: WARNING: Title underline too short.
--
   Documentation/userspace-api/landlock:526: ./include/uapi/linux/landlock.h:45: ERROR: Unknown target name: "network flags". [docutils]
   Documentation/userspace-api/landlock:526: ./include/uapi/linux/landlock.h:50: ERROR: Unknown target name: "scope flags". [docutils]
   Documentation/userspace-api/landlock:526: ./include/uapi/linux/landlock.h:24: ERROR: Unknown target name: "filesystem flags". [docutils]
   Documentation/userspace-api/landlock:535: ./include/uapi/linux/landlock.h:166: ERROR: Unknown target name: "filesystem flags". [docutils]
   Documentation/userspace-api/landlock:535: ./include/uapi/linux/landlock.h:189: ERROR: Unknown target name: "network flags". [docutils]
>> Documentation/mm/swap-tier.rst: WARNING: document isn't included in any toctree [toc.not_included]
   Documentation/networking/skbuff:36: ./include/linux/skbuff.h:181: WARNING: Failed to create a cross reference. A title or caption not found: 'crc' [ref.ref]


vim +19 Documentation/mm/swap-tier.rst

    17	
    18	Use case
  > 19	-------
    20	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v5 3/4] mm: memcontrol: add interfaces for swap tier selection
  2026-03-25 17:54 ` [PATCH v5 3/4] mm: memcontrol: add interfaces for swap tier selection Youngjun Park
@ 2026-03-27 23:50   ` kernel test robot
  2026-03-29 11:10   ` kernel test robot
  1 sibling, 0 replies; 13+ messages in thread
From: kernel test robot @ 2026-03-27 23:50 UTC (permalink / raw)
  To: Youngjun Park, Andrew Morton
  Cc: oe-kbuild-all, Linux Memory Management List, Chris Li,
	Youngjun Park, cgroups, linux-kernel, kasong, hannes, mhocko,
	roman.gushchin, shakeel.butt, muchun.song, shikemeng, nphamcs,
	bhe, baohua, gunho.lee, taejoon.song, hyungjun.cho, mkoutny

Hi Youngjun,

kernel test robot noticed the following build warnings:

[auto build test WARNING on 6381a729fa7dda43574d93ab9c61cec516dd885b]

url:    https://github.com/intel-lab-lkp/linux/commits/Youngjun-Park/mm-swap-introduce-swap-tier-infrastructure/20260327-203639
base:   6381a729fa7dda43574d93ab9c61cec516dd885b
patch link:    https://lore.kernel.org/r/20260325175453.2523280-4-youngjun.park%40lge.com
patch subject: [PATCH v5 3/4] mm: memcontrol: add interfaces for swap tier selection
compiler: clang version 20.1.8 (https://github.com/llvm/llvm-project 87f0227cb60147a26a1eeb4fb06e3b505e9c7261)
docutils: docutils (Docutils 0.21.2, Python 3.13.5, on linux)
reproduce: (https://download.01.org/0day-ci/archive/20260328/202603280046.d4u6S8W9-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202603280046.d4u6S8W9-lkp@intel.com/

All warnings (new ones prefixed by >>):

   Warning: tools/docs/documentation-file-ref-check references a file that doesn't exist: m,\b(\S*)(Documentation/[A-Za-z0-9
   Warning: tools/docs/documentation-file-ref-check references a file that doesn't exist: Documentation/devicetree/dt-object-internal.txt
   Warning: tools/docs/documentation-file-ref-check references a file that doesn't exist: m,^Documentation/scheduler/sched-pelt
   Warning: tools/docs/documentation-file-ref-check references a file that doesn't exist: m,(Documentation/translations/[
   Using alabaster theme
>> Documentation/admin-guide/cgroup-v2.rst:1860: WARNING: Inline substitution_reference start-string without end-string. [docutils]
>> Documentation/admin-guide/cgroup-v2.rst:1860: WARNING: Inline substitution_reference start-string without end-string. [docutils]
   Documentation/core-api/kref:328: ./include/linux/kref.h:72: WARNING: Invalid C declaration: Expected end of definition. [error at 96]
   int kref_put_mutex (struct kref *kref, void (*release)(struct kref *kref), struct mutex *mutex) __cond_acquires(true# mutex)
   ------------------------------------------------------------------------------------------------^
   Documentation/core-api/kref:328: ./include/linux/kref.h:94: WARNING: Invalid C declaration: Expected end of definition. [error at 92]
   int kref_put_lock (struct kref *kref, void (*release)(struct kref *kref), spinlock_t *lock) __cond_acquires(true# lock)


vim +1860 Documentation/admin-guide/cgroup-v2.rst

  1427	
  1428		  ==========            ================================
  1429		  swappiness            Swappiness value to reclaim with
  1430		  ==========            ================================
  1431	
  1432		Specifying a swappiness value instructs the kernel to perform
  1433		the reclaim with that swappiness value. Note that this has the
  1434		same semantics as vm.swappiness applied to memcg reclaim with
  1435		all the existing limitations and potential future extensions.
  1436	
  1437		The valid range for swappiness is [0-200, max], setting
  1438		swappiness=max exclusively reclaims anonymous memory.
  1439	
  1440	  memory.peak
  1441		A read-write single value file which exists on non-root cgroups.
  1442	
  1443		The max memory usage recorded for the cgroup and its descendants since
  1444		either the creation of the cgroup or the most recent reset for that FD.
  1445	
  1446		A write of any non-empty string to this file resets it to the
  1447		current memory usage for subsequent reads through the same
  1448		file descriptor.
  1449	
  1450	  memory.oom.group
  1451		A read-write single value file which exists on non-root
  1452		cgroups.  The default value is "0".
  1453	
  1454		Determines whether the cgroup should be treated as
  1455		an indivisible workload by the OOM killer. If set,
  1456		all tasks belonging to the cgroup or to its descendants
  1457		(if the memory cgroup is not a leaf cgroup) are killed
  1458		together or not at all. This can be used to avoid
  1459		partial kills to guarantee workload integrity.
  1460	
  1461		Tasks with the OOM protection (oom_score_adj set to -1000)
  1462		are treated as an exception and are never killed.
  1463	
  1464		If the OOM killer is invoked in a cgroup, it's not going
  1465		to kill any tasks outside of this cgroup, regardless
  1466		memory.oom.group values of ancestor cgroups.
  1467	
  1468	  memory.events
  1469		A read-only flat-keyed file which exists on non-root cgroups.
  1470		The following entries are defined.  Unless specified
  1471		otherwise, a value change in this file generates a file
  1472		modified event.
  1473	
  1474		Note that all fields in this file are hierarchical and the
  1475		file modified event can be generated due to an event down the
  1476		hierarchy. For the local events at the cgroup level see
  1477		memory.events.local.
  1478	
  1479		  low
  1480			The number of times the cgroup is reclaimed due to
  1481			high memory pressure even though its usage is under
  1482			the low boundary.  This usually indicates that the low
  1483			boundary is over-committed.
  1484	
  1485		  high
  1486			The number of times processes of the cgroup are
  1487			throttled and routed to perform direct memory reclaim
  1488			because the high memory boundary was exceeded.  For a
  1489			cgroup whose memory usage is capped by the high limit
  1490			rather than global memory pressure, this event's
  1491			occurrences are expected.
  1492	
  1493		  max
  1494			The number of times the cgroup's memory usage was
  1495			about to go over the max boundary.  If direct reclaim
  1496			fails to bring it down, the cgroup goes to OOM state.
  1497	
  1498		  oom
  1499			The number of time the cgroup's memory usage was
  1500			reached the limit and allocation was about to fail.
  1501	
  1502			This event is not raised if the OOM killer is not
  1503			considered as an option, e.g. for failed high-order
  1504			allocations or if caller asked to not retry attempts.
  1505	
  1506		  oom_kill
  1507			The number of processes belonging to this cgroup
  1508			killed by any kind of OOM killer.
  1509	
  1510	          oom_group_kill
  1511	                The number of times a group OOM has occurred.
  1512	
  1513	          sock_throttled
  1514	                The number of times network sockets associated with
  1515	                this cgroup are throttled.
  1516	
  1517	  memory.events.local
  1518		Similar to memory.events but the fields in the file are local
  1519		to the cgroup i.e. not hierarchical. The file modified event
  1520		generated on this file reflects only the local events.
  1521	
  1522	  memory.stat
  1523		A read-only flat-keyed file which exists on non-root cgroups.
  1524	
  1525		This breaks down the cgroup's memory footprint into different
  1526		types of memory, type-specific details, and other information
  1527		on the state and past events of the memory management system.
  1528	
  1529		All memory amounts are in bytes.
  1530	
  1531		The entries are ordered to be human readable, and new entries
  1532		can show up in the middle. Don't rely on items remaining in a
  1533		fixed position; use the keys to look up specific values!
  1534	
  1535		If the entry has no per-node counter (or not show in the
  1536		memory.numa_stat). We use 'npn' (non-per-node) as the tag
  1537		to indicate that it will not show in the memory.numa_stat.
  1538	
  1539		  anon
  1540			Amount of memory used in anonymous mappings such as
  1541			brk(), sbrk(), and mmap(MAP_ANONYMOUS). Note that
  1542			some kernel configurations might account complete larger
  1543			allocations (e.g., THP) if only some, but not all the
  1544			memory of such an allocation is mapped anymore.
  1545	
  1546		  file
  1547			Amount of memory used to cache filesystem data,
  1548			including tmpfs and shared memory.
  1549	
  1550		  kernel (npn)
  1551			Amount of total kernel memory, including
  1552			(kernel_stack, pagetables, percpu, vmalloc, slab) in
  1553			addition to other kernel memory use cases.
  1554	
  1555		  kernel_stack
  1556			Amount of memory allocated to kernel stacks.
  1557	
  1558		  pagetables
  1559	                Amount of memory allocated for page tables.
  1560	
  1561		  sec_pagetables
  1562			Amount of memory allocated for secondary page tables,
  1563			this currently includes KVM mmu allocations on x86
  1564			and arm64 and IOMMU page tables.
  1565	
  1566		  percpu (npn)
  1567			Amount of memory used for storing per-cpu kernel
  1568			data structures.
  1569	
  1570		  sock (npn)
  1571			Amount of memory used in network transmission buffers
  1572	
  1573		  vmalloc (npn)
  1574			Amount of memory used for vmap backed memory.
  1575	
  1576		  shmem
  1577			Amount of cached filesystem data that is swap-backed,
  1578			such as tmpfs, shm segments, shared anonymous mmap()s
  1579	
  1580		  zswap
  1581			Amount of memory consumed by the zswap compression backend.
  1582	
  1583		  zswapped
  1584			Amount of application memory swapped out to zswap.
  1585	
  1586		  file_mapped
  1587			Amount of cached filesystem data mapped with mmap(). Note
  1588			that some kernel configurations might account complete
  1589			larger allocations (e.g., THP) if only some, but not
  1590			not all the memory of such an allocation is mapped.
  1591	
  1592		  file_dirty
  1593			Amount of cached filesystem data that was modified but
  1594			not yet written back to disk
  1595	
  1596		  file_writeback
  1597			Amount of cached filesystem data that was modified and
  1598			is currently being written back to disk
  1599	
  1600		  swapcached
  1601			Amount of swap cached in memory. The swapcache is accounted
  1602			against both memory and swap usage.
  1603	
  1604		  anon_thp
  1605			Amount of memory used in anonymous mappings backed by
  1606			transparent hugepages
  1607	
  1608		  file_thp
  1609			Amount of cached filesystem data backed by transparent
  1610			hugepages
  1611	
  1612		  shmem_thp
  1613			Amount of shm, tmpfs, shared anonymous mmap()s backed by
  1614			transparent hugepages
  1615	
  1616		  inactive_anon, active_anon, inactive_file, active_file, unevictable
  1617			Amount of memory, swap-backed and filesystem-backed,
  1618			on the internal memory management lists used by the
  1619			page reclaim algorithm.
  1620	
  1621			As these represent internal list state (eg. shmem pages are on anon
  1622			memory management lists), inactive_foo + active_foo may not be equal to
  1623			the value for the foo counter, since the foo counter is type-based, not
  1624			list-based.
  1625	
  1626		  slab_reclaimable
  1627			Part of "slab" that might be reclaimed, such as
  1628			dentries and inodes.
  1629	
  1630		  slab_unreclaimable
  1631			Part of "slab" that cannot be reclaimed on memory
  1632			pressure.
  1633	
  1634		  slab (npn)
  1635			Amount of memory used for storing in-kernel data
  1636			structures.
  1637	
  1638		  workingset_refault_anon
  1639			Number of refaults of previously evicted anonymous pages.
  1640	
  1641		  workingset_refault_file
  1642			Number of refaults of previously evicted file pages.
  1643	
  1644		  workingset_activate_anon
  1645			Number of refaulted anonymous pages that were immediately
  1646			activated.
  1647	
  1648		  workingset_activate_file
  1649			Number of refaulted file pages that were immediately activated.
  1650	
  1651		  workingset_restore_anon
  1652			Number of restored anonymous pages which have been detected as
  1653			an active workingset before they got reclaimed.
  1654	
  1655		  workingset_restore_file
  1656			Number of restored file pages which have been detected as an
  1657			active workingset before they got reclaimed.
  1658	
  1659		  workingset_nodereclaim
  1660			Number of times a shadow node has been reclaimed
  1661	
  1662		  pswpin (npn)
  1663			Number of pages swapped into memory
  1664	
  1665		  pswpout (npn)
  1666			Number of pages swapped out of memory
  1667	
  1668		  pgscan (npn)
  1669			Amount of scanned pages (in an inactive LRU list)
  1670	
  1671		  pgsteal (npn)
  1672			Amount of reclaimed pages
  1673	
  1674		  pgscan_kswapd (npn)
  1675			Amount of scanned pages by kswapd (in an inactive LRU list)
  1676	
  1677		  pgscan_direct (npn)
  1678			Amount of scanned pages directly  (in an inactive LRU list)
  1679	
  1680		  pgscan_khugepaged (npn)
  1681			Amount of scanned pages by khugepaged  (in an inactive LRU list)
  1682	
  1683		  pgscan_proactive (npn)
  1684			Amount of scanned pages proactively (in an inactive LRU list)
  1685	
  1686		  pgsteal_kswapd (npn)
  1687			Amount of reclaimed pages by kswapd
  1688	
  1689		  pgsteal_direct (npn)
  1690			Amount of reclaimed pages directly
  1691	
  1692		  pgsteal_khugepaged (npn)
  1693			Amount of reclaimed pages by khugepaged
  1694	
  1695		  pgsteal_proactive (npn)
  1696			Amount of reclaimed pages proactively
  1697	
  1698		  pgfault (npn)
  1699			Total number of page faults incurred
  1700	
  1701		  pgmajfault (npn)
  1702			Number of major page faults incurred
  1703	
  1704		  pgrefill (npn)
  1705			Amount of scanned pages (in an active LRU list)
  1706	
  1707		  pgactivate (npn)
  1708			Amount of pages moved to the active LRU list
  1709	
  1710		  pgdeactivate (npn)
  1711			Amount of pages moved to the inactive LRU list
  1712	
  1713		  pglazyfree (npn)
  1714			Amount of pages postponed to be freed under memory pressure
  1715	
  1716		  pglazyfreed (npn)
  1717			Amount of reclaimed lazyfree pages
  1718	
  1719		  swpin_zero
  1720			Number of pages swapped into memory and filled with zero, where I/O
  1721			was optimized out because the page content was detected to be zero
  1722			during swapout.
  1723	
  1724		  swpout_zero
  1725			Number of zero-filled pages swapped out with I/O skipped due to the
  1726			content being detected as zero.
  1727	
  1728		  zswpin
  1729			Number of pages moved in to memory from zswap.
  1730	
  1731		  zswpout
  1732			Number of pages moved out of memory to zswap.
  1733	
  1734		  zswpwb
  1735			Number of pages written from zswap to swap.
  1736	
  1737		  zswap_incomp
  1738			Number of incompressible pages currently stored in zswap
  1739			without compression. These pages could not be compressed to
  1740			a size smaller than PAGE_SIZE, so they are stored as-is.
  1741	
  1742		  thp_fault_alloc (npn)
  1743			Number of transparent hugepages which were allocated to satisfy
  1744			a page fault. This counter is not present when CONFIG_TRANSPARENT_HUGEPAGE
  1745	                is not set.
  1746	
  1747		  thp_collapse_alloc (npn)
  1748			Number of transparent hugepages which were allocated to allow
  1749			collapsing an existing range of pages. This counter is not
  1750			present when CONFIG_TRANSPARENT_HUGEPAGE is not set.
  1751	
  1752		  thp_swpout (npn)
  1753			Number of transparent hugepages which are swapout in one piece
  1754			without splitting.
  1755	
  1756		  thp_swpout_fallback (npn)
  1757			Number of transparent hugepages which were split before swapout.
  1758			Usually because failed to allocate some continuous swap space
  1759			for the huge page.
  1760	
  1761		  numa_pages_migrated (npn)
  1762			Number of pages migrated by NUMA balancing.
  1763	
  1764		  numa_pte_updates (npn)
  1765			Number of pages whose page table entries are modified by
  1766			NUMA balancing to produce NUMA hinting faults on access.
  1767	
  1768		  numa_hint_faults (npn)
  1769			Number of NUMA hinting faults.
  1770	
  1771		  pgdemote_kswapd
  1772			Number of pages demoted by kswapd.
  1773	
  1774		  pgdemote_direct
  1775			Number of pages demoted directly.
  1776	
  1777		  pgdemote_khugepaged
  1778			Number of pages demoted by khugepaged.
  1779	
  1780		  pgdemote_proactive
  1781			Number of pages demoted by proactively.
  1782	
  1783		  hugetlb
  1784			Amount of memory used by hugetlb pages. This metric only shows
  1785			up if hugetlb usage is accounted for in memory.current (i.e.
  1786			cgroup is mounted with the memory_hugetlb_accounting option).
  1787	
  1788	  memory.numa_stat
  1789		A read-only nested-keyed file which exists on non-root cgroups.
  1790	
  1791		This breaks down the cgroup's memory footprint into different
  1792		types of memory, type-specific details, and other information
  1793		per node on the state of the memory management system.
  1794	
  1795		This is useful for providing visibility into the NUMA locality
  1796		information within an memcg since the pages are allowed to be
  1797		allocated from any physical node. One of the use case is evaluating
  1798		application performance by combining this information with the
  1799		application's CPU allocation.
  1800	
  1801		All memory amounts are in bytes.
  1802	
  1803		The output format of memory.numa_stat is::
  1804	
  1805		  type N0=<bytes in node 0> N1=<bytes in node 1> ...
  1806	
  1807		The entries are ordered to be human readable, and new entries
  1808		can show up in the middle. Don't rely on items remaining in a
  1809		fixed position; use the keys to look up specific values!
  1810	
  1811		The entries can refer to the memory.stat.
  1812	
  1813	  memory.swap.current
  1814		A read-only single value file which exists on non-root
  1815		cgroups.
  1816	
  1817		The total amount of swap currently being used by the cgroup
  1818		and its descendants.
  1819	
  1820	  memory.swap.high
  1821		A read-write single value file which exists on non-root
  1822		cgroups.  The default is "max".
  1823	
  1824		Swap usage throttle limit.  If a cgroup's swap usage exceeds
  1825		this limit, all its further allocations will be throttled to
  1826		allow userspace to implement custom out-of-memory procedures.
  1827	
  1828		This limit marks a point of no return for the cgroup. It is NOT
  1829		designed to manage the amount of swapping a workload does
  1830		during regular operation. Compare to memory.swap.max, which
  1831		prohibits swapping past a set amount, but lets the cgroup
  1832		continue unimpeded as long as other memory can be reclaimed.
  1833	
  1834		Healthy workloads are not expected to reach this limit.
  1835	
  1836	  memory.swap.peak
  1837		A read-write single value file which exists on non-root cgroups.
  1838	
  1839		The max swap usage recorded for the cgroup and its descendants since
  1840		the creation of the cgroup or the most recent reset for that FD.
  1841	
  1842		A write of any non-empty string to this file resets it to the
  1843		current memory usage for subsequent reads through the same
  1844		file descriptor.
  1845	
  1846	  memory.swap.max
  1847		A read-write single value file which exists on non-root
  1848		cgroups.  The default is "max".
  1849	
  1850		Swap usage hard limit.  If a cgroup's swap usage reaches this
  1851		limit, anonymous memory of the cgroup will not be swapped out.
  1852	
  1853	  memory.swap.tiers
  1854	        A read-write file which exists on non-root cgroups.
  1855	        Format is similar to cgroup.subtree_control.
  1856	
  1857	        Controls which swap tiers this cgroup is allowed to swap
  1858	        out to. All tiers are enabled by default.
  1859	
> 1860	          (-|+)TIER [(-|+)TIER ...]
  1861	
  1862	        "-" disables a tier, "+" re-enables it.
  1863	        Entries are whitespace-delimited.
  1864	
  1865	        Changes here are combined with parent restrictions to
  1866	        compute memory.swap.tiers.effective.
  1867	
  1868	        If a tier is removed from /sys/kernel/mm/swap/tiers,
  1869	        any prior disable for that tier is invalidated.
  1870	
  1871	  memory.swap.tiers.effective
  1872	        A read-only file which exists on non-root cgroups.
  1873	
  1874	        Shows the tiers this cgroup can actually swap out to.
  1875	        This is the intersection of the parent's effective tiers
  1876	        and this cgroup's own memory.swap.tiers configuration.
  1877	        A child cannot enable a tier that is disabled in its
  1878	        parent.
  1879	
  1880	  memory.swap.events
  1881		A read-only flat-keyed file which exists on non-root cgroups.
  1882		The following entries are defined.  Unless specified
  1883		otherwise, a value change in this file generates a file
  1884		modified event.
  1885	
  1886		  high
  1887			The number of times the cgroup's swap usage was over
  1888			the high threshold.
  1889	
  1890		  max
  1891			The number of times the cgroup's swap usage was about
  1892			to go over the max boundary and swap allocation
  1893			failed.
  1894	
  1895		  fail
  1896			The number of times swap allocation failed either
  1897			because of running out of swap system-wide or max
  1898			limit.
  1899	
  1900		When reduced under the current usage, the existing swap
  1901		entries are reclaimed gradually and the swap usage may stay
  1902		higher than the limit for an extended period of time.  This
  1903		reduces the impact on the workload and memory management.
  1904	
  1905	  memory.zswap.current
  1906		A read-only single value file which exists on non-root
  1907		cgroups.
  1908	
  1909		The total amount of memory consumed by the zswap compression
  1910		backend.
  1911	
  1912	  memory.zswap.max
  1913		A read-write single value file which exists on non-root
  1914		cgroups.  The default is "max".
  1915	
  1916		Zswap usage hard limit. If a cgroup's zswap pool reaches this
  1917		limit, it will refuse to take any more stores before existing
  1918		entries fault back in or are written out to disk.
  1919	
  1920	  memory.zswap.writeback
  1921		A read-write single value file. The default value is "1".
  1922		Note that this setting is hierarchical, i.e. the writeback would be
  1923		implicitly disabled for child cgroups if the upper hierarchy
  1924		does so.
  1925	
  1926		When this is set to 0, all swapping attempts to swapping devices
  1927		are disabled. This included both zswap writebacks, and swapping due
  1928		to zswap store failures. If the zswap store failures are recurring
  1929		(for e.g if the pages are incompressible), users can observe
  1930		reclaim inefficiency after disabling writeback (because the same
  1931		pages might be rejected again and again).
  1932	
  1933		Note that this is subtly different from setting memory.swap.max to
  1934		0, as it still allows for pages to be written to the zswap pool.
  1935		This setting has no effect if zswap is disabled, and swapping
  1936		is allowed unless memory.swap.max is set to 0.
  1937	
  1938	  memory.pressure
  1939		A read-only nested-keyed file.
  1940	
  1941		Shows pressure stall information for memory. See
  1942		:ref:`Documentation/accounting/psi.rst <psi>` for details.
  1943	
  1944	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v5 1/4] mm: swap: introduce swap tier infrastructure
  2026-03-25 17:54 ` [PATCH v5 1/4] mm: swap: introduce swap tier infrastructure Youngjun Park
@ 2026-03-29 10:49   ` kernel test robot
  2026-03-29 13:46   ` kernel test robot
  1 sibling, 0 replies; 13+ messages in thread
From: kernel test robot @ 2026-03-29 10:49 UTC (permalink / raw)
  To: Youngjun Park, Andrew Morton
  Cc: llvm, oe-kbuild-all, Linux Memory Management List, Chris Li,
	Youngjun Park, cgroups, linux-kernel, kasong, hannes, mhocko,
	roman.gushchin, shakeel.butt, muchun.song, shikemeng, nphamcs,
	bhe, baohua, gunho.lee, taejoon.song, hyungjun.cho, mkoutny

Hi Youngjun,

kernel test robot noticed the following build warnings:

[auto build test WARNING on 6381a729fa7dda43574d93ab9c61cec516dd885b]

url:    https://github.com/intel-lab-lkp/linux/commits/Youngjun-Park/mm-swap-introduce-swap-tier-infrastructure/20260327-203639
base:   6381a729fa7dda43574d93ab9c61cec516dd885b
patch link:    https://lore.kernel.org/r/20260325175453.2523280-2-youngjun.park%40lge.com
patch subject: [PATCH v5 1/4] mm: swap: introduce swap tier infrastructure
config: hexagon-randconfig-002-20260329 (https://download.01.org/0day-ci/archive/20260329/202603291831.wZLe8bqg-lkp@intel.com/config)
compiler: clang version 23.0.0git (https://github.com/llvm/llvm-project 054e11d1a17e5ba88bb1a8ef32fad3346e80b186)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260329/202603291831.wZLe8bqg-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202603291831.wZLe8bqg-lkp@intel.com/

All warnings (new ones prefixed by >>):

>> mm/swap_tier.c:118:10: warning: format specifies type 'long' but the argument has type '__ptrdiff_t' (aka 'int') [-Wformat]
     116 |                 len += sysfs_emit_at(buf, len, "%-16s %-5ld %-11d %-11d\n",
         |                                                       ~~~~~
         |                                                       %-5td
     117 |                                      tier->name,
     118 |                                      TIER_IDX(tier),
         |                                      ^~~~~~~~~~~~~~
   mm/swap_tier.c:33:24: note: expanded from macro 'TIER_IDX'
      33 | #define TIER_IDX(tier)  ((tier) - swap_tiers)
         |                         ^~~~~~~~~~~~~~~~~~~~~
   1 warning generated.


vim +118 mm/swap_tier.c

   105	
   106	ssize_t swap_tiers_sysfs_show(char *buf)
   107	{
   108		struct swap_tier *tier;
   109		ssize_t len = 0;
   110	
   111		len += sysfs_emit_at(buf, len, "%-16s %-5s %-11s %-11s\n",
   112				 "Name", "Idx", "PrioStart", "PrioEnd");
   113	
   114		spin_lock(&swap_tier_lock);
   115		for_each_active_tier(tier) {
   116			len += sysfs_emit_at(buf, len, "%-16s %-5ld %-11d %-11d\n",
   117					     tier->name,
 > 118					     TIER_IDX(tier),
   119					     tier->prio,
   120					     TIER_END_PRIO(tier));
   121		}
   122		spin_unlock(&swap_tier_lock);
   123	
   124		return len;
   125	}
   126	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v5 3/4] mm: memcontrol: add interfaces for swap tier selection
  2026-03-25 17:54 ` [PATCH v5 3/4] mm: memcontrol: add interfaces for swap tier selection Youngjun Park
  2026-03-27 23:50   ` kernel test robot
@ 2026-03-29 11:10   ` kernel test robot
  1 sibling, 0 replies; 13+ messages in thread
From: kernel test robot @ 2026-03-29 11:10 UTC (permalink / raw)
  To: Youngjun Park, Andrew Morton
  Cc: llvm, oe-kbuild-all, Linux Memory Management List, Chris Li,
	Youngjun Park, cgroups, linux-kernel, kasong, hannes, mhocko,
	roman.gushchin, shakeel.butt, muchun.song, shikemeng, nphamcs,
	bhe, baohua, gunho.lee, taejoon.song, hyungjun.cho, mkoutny

Hi Youngjun,

kernel test robot noticed the following build errors:

[auto build test ERROR on 6381a729fa7dda43574d93ab9c61cec516dd885b]

url:    https://github.com/intel-lab-lkp/linux/commits/Youngjun-Park/mm-swap-introduce-swap-tier-infrastructure/20260327-203639
base:   6381a729fa7dda43574d93ab9c61cec516dd885b
patch link:    https://lore.kernel.org/r/20260325175453.2523280-4-youngjun.park%40lge.com
patch subject: [PATCH v5 3/4] mm: memcontrol: add interfaces for swap tier selection
config: hexagon-randconfig-002-20260329 (https://download.01.org/0day-ci/archive/20260329/202603291945.9q4pyvON-lkp@intel.com/config)
compiler: clang version 23.0.0git (https://github.com/llvm/llvm-project 054e11d1a17e5ba88bb1a8ef32fad3346e80b186)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260329/202603291945.9q4pyvON-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202603291945.9q4pyvON-lkp@intel.com/

All errors (new ones prefixed by >>):

   mm/swap_tier.c:141:10: warning: format specifies type 'long' but the argument has type '__ptrdiff_t' (aka 'int') [-Wformat]
     139 |                 len += sysfs_emit_at(buf, len, "%-16s %-5ld %-11d %-11d\n",
         |                                                       ~~~~~
         |                                                       %-5td
     140 |                                      tier->name,
     141 |                                      TIER_IDX(tier),
         |                                      ^~~~~~~~~~~~~~
   mm/swap_tier.c:33:24: note: expanded from macro 'TIER_IDX'
      33 | #define TIER_IDX(tier)  ((tier) - swap_tiers)
         |                         ^~~~~~~~~~~~~~~~~~~~~
>> mm/swap_tier.c:342:8: error: incomplete definition of type 'struct mem_cgroup'
     342 |                 child->tier_mask |= mask;
         |                 ~~~~~^
   include/linux/mm_types.h:36:8: note: forward declaration of 'struct mem_cgroup'
      36 | struct mem_cgroup;
         |        ^
   mm/swap_tier.c:343:8: error: incomplete definition of type 'struct mem_cgroup'
     343 |                 child->tier_effective_mask |= mask;
         |                 ~~~~~^
   include/linux/mm_types.h:36:8: note: forward declaration of 'struct mem_cgroup'
      36 | struct mem_cgroup;
         |        ^
   mm/swap_tier.c:420:20: error: incomplete definition of type 'struct mem_cgroup'
     420 |                 = parent ? parent->tier_effective_mask : TIER_ALL_MASK;
         |                            ~~~~~~^
   include/linux/mm_types.h:36:8: note: forward declaration of 'struct mem_cgroup'
      36 | struct mem_cgroup;
         |        ^
   mm/swap_tier.c:422:7: error: incomplete definition of type 'struct mem_cgroup'
     422 |         memcg->tier_effective_mask
         |         ~~~~~^
   include/linux/mm_types.h:36:8: note: forward declaration of 'struct mem_cgroup'
      36 | struct mem_cgroup;
         |        ^
   mm/swap_tier.c:423:27: error: incomplete definition of type 'struct mem_cgroup'
     423 |                 = effective_mask & memcg->tier_mask;
         |                                    ~~~~~^
   include/linux/mm_types.h:36:8: note: forward declaration of 'struct mem_cgroup'
      36 | struct mem_cgroup;
         |        ^
   1 warning and 5 errors generated.


vim +342 mm/swap_tier.c

   330	
   331	/*
   332	 * When a tier is removed, set its bit in every memcg's tier_mask and
   333	 * tier_effective_mask. This prevents stale tier indices from being
   334	 * silently filtered out if the same index is reused later.
   335	 */
   336	static void swap_tier_memcg_propagate(int mask)
   337	{
   338		struct mem_cgroup *child;
   339	
   340		rcu_read_lock();
   341		for_each_mem_cgroup_tree(child, root_mem_cgroup) {
 > 342			child->tier_mask |= mask;
   343			child->tier_effective_mask |= mask;
   344		}
   345		rcu_read_unlock();
   346	}
   347	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v5 1/4] mm: swap: introduce swap tier infrastructure
  2026-03-25 17:54 ` [PATCH v5 1/4] mm: swap: introduce swap tier infrastructure Youngjun Park
  2026-03-29 10:49   ` kernel test robot
@ 2026-03-29 13:46   ` kernel test robot
  1 sibling, 0 replies; 13+ messages in thread
From: kernel test robot @ 2026-03-29 13:46 UTC (permalink / raw)
  To: Youngjun Park, Andrew Morton
  Cc: oe-kbuild-all, Linux Memory Management List, Chris Li,
	Youngjun Park, cgroups, linux-kernel, kasong, hannes, mhocko,
	roman.gushchin, shakeel.butt, muchun.song, shikemeng, nphamcs,
	bhe, baohua, gunho.lee, taejoon.song, hyungjun.cho, mkoutny

Hi Youngjun,

kernel test robot noticed the following build warnings:

[auto build test WARNING on 6381a729fa7dda43574d93ab9c61cec516dd885b]

url:    https://github.com/intel-lab-lkp/linux/commits/Youngjun-Park/mm-swap-introduce-swap-tier-infrastructure/20260327-203639
base:   6381a729fa7dda43574d93ab9c61cec516dd885b
patch link:    https://lore.kernel.org/r/20260325175453.2523280-2-youngjun.park%40lge.com
patch subject: [PATCH v5 1/4] mm: swap: introduce swap tier infrastructure
config: arc-allyesconfig (https://download.01.org/0day-ci/archive/20260329/202603292156.2V2nVz19-lkp@intel.com/config)
compiler: arc-linux-gcc (GCC) 15.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260329/202603292156.2V2nVz19-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202603292156.2V2nVz19-lkp@intel.com/

All warnings (new ones prefixed by >>):

   mm/swap_tier.c: In function 'swap_tiers_sysfs_show':
>> mm/swap_tier.c:116:59: warning: format '%ld' expects argument of type 'long int', but argument 5 has type 'int' [-Wformat=]
     116 |                 len += sysfs_emit_at(buf, len, "%-16s %-5ld %-11d %-11d\n",
         |                                                       ~~~~^
         |                                                           |
         |                                                           long int
         |                                                       %-5d


vim +116 mm/swap_tier.c

   105	
   106	ssize_t swap_tiers_sysfs_show(char *buf)
   107	{
   108		struct swap_tier *tier;
   109		ssize_t len = 0;
   110	
   111		len += sysfs_emit_at(buf, len, "%-16s %-5s %-11s %-11s\n",
   112				 "Name", "Idx", "PrioStart", "PrioEnd");
   113	
   114		spin_lock(&swap_tier_lock);
   115		for_each_active_tier(tier) {
 > 116			len += sysfs_emit_at(buf, len, "%-16s %-5ld %-11d %-11d\n",
   117					     tier->name,
   118					     TIER_IDX(tier),
   119					     tier->prio,
   120					     TIER_END_PRIO(tier));
   121		}
   122		spin_unlock(&swap_tier_lock);
   123	
   124		return len;
   125	}
   126	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2026-03-29 13:47 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-25 17:54 [PATCH v5 0/4] mm/swap, memcg: Introduce swap tiers for cgroup based swap control Youngjun Park
2026-03-25 17:54 ` [PATCH v5 1/4] mm: swap: introduce swap tier infrastructure Youngjun Park
2026-03-29 10:49   ` kernel test robot
2026-03-29 13:46   ` kernel test robot
2026-03-25 17:54 ` [PATCH v5 2/4] mm: swap: associate swap devices with tiers Youngjun Park
2026-03-27 19:06   ` kernel test robot
2026-03-25 17:54 ` [PATCH v5 3/4] mm: memcontrol: add interfaces for swap tier selection Youngjun Park
2026-03-27 23:50   ` kernel test robot
2026-03-29 11:10   ` kernel test robot
2026-03-25 17:54 ` [PATCH v5 4/4] mm: swap: filter swap allocation by memcg tier mask Youngjun Park
2026-03-25 23:20 ` [PATCH v5 0/4] mm/swap, memcg: Introduce swap tiers for cgroup based swap control Andrew Morton
2026-03-26 14:04   ` YoungJun Park
2026-03-26  7:41 ` [syzbot ci] " syzbot ci

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox