[RFC PATCH 0/2] mm/swap, memcg: Support per-cgroup swap device prioritization

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH 0/2] mm/swap, memcg: Support per-cgroup swap device prioritization
@ 2025-06-12 10:37 youngjun.park
  2025-06-12 10:37 ` [RFC PATCH 1/2] mm/swap, memcg: basic structure and logic for per cgroup swap priority control youngjun.park
                   ` (2 more replies)
  0 siblings, 3 replies; 25+ messages in thread
From: youngjun.park @ 2025-06-12 10:37 UTC (permalink / raw)
  To: linux-mm
  Cc: akpm, hannes, mhocko, roman.gushchin, shakeel.butt, cgroups,
	linux-kernel, shikemeng, kasong, nphamcs, bhe, baohua, chrisl,
	muchun.song, iamjoonsoo.kim, taejoon.song, gunho.lee,
	Youngjun Park

From: Youngjun Park <youngjun.park@lge.com>

Introduction
============
I am a kernel developer working on platforms deployed on commercial consumer devices.
Due to real-world product requirements, needed to modify the Linux kernel to support
a new swap management mechanism. The proposed mechanism allows assigning different swap
priorities to swap devices per cgroup.
I believe this mechanism can be generally useful for similar constrained-device scenarios
and would like to propose it for upstream inclusion and solicit feedback from the community.

Motivation
==========
Core requirement was to improve application responsiveness and loading time, especially
for latency critical applications, without increasing RAM or storage hardware resources.
Device constraints:
  - Linux-based embedded platform
  - Limited system RAM
  - Small local swap
  - No option to expand RAM or local swap
To mitigate this, we explored utilizing idle RAM and storage from nearby devices as remote
swap space. To maximize its effectiveness, we needed the ability to control which swap devices
were used by different cgroups:
  - Assign faster local swap devices to latency critical apps
  - Assign remote swap devices to background apps
However, current Linux kernel swap infrastructure does not support per-cgroup swap device
assignment.
To solve this, I propose a mechanism to allow each cgroup to specify its own swap device
priorities.

Evaluated Alternatives
======================
1. **Per-cgroup dedicated swap devices**
   - Previously proposed upstream [1]
   - Challenges in managing global vs per-cgroup swap state
   - Difficult to integrate with existing memory.limit / swap.max semantics
2. **Multi-backend swap device with cgroup-aware routing**
   - Considered sort of layering violation (block device cgroup awareness)
   - Swap devices are commonly meant to be physical block devices.
   - Similar idea mentioned in [2]
3. **Per-cgroup swap device enable/disable with swap usage contorl**
   - Expand swap.max with zswap.writeback usage
   - Discussed in context of zswap writeback [3]
   - Cannot express arbitrary priority orderings 
    (e.g. swap priority A-B-C on cgroup C-A-B impossible)
   - Less flexible than per-device priority approach
4. **Per-namespace swap priority configuration**
   - In short, make swap namespace for swap device priority
   - Overly complex for our use case
   - Cgroups are the natural scope for this mechanism

Based on these findings, we chose to prototype per-cgroup swap priority configuration
as the most natural, least invasive extension of the existing kernel mechanisms.

Design and Semantics
====================
- Each swap device gets a unique ID at `swapon` time
- Each cgroup has a `memory.swap.priority` interface:
  - Show unique ID by memory.swap.priority interface
  - Format: `unique_id:priority,unique_id:priority,...`
  - All currently-active swap devices must be listed
  - Priorities follow existing swap infrastructure semantics
- The interface is writeable and updatable at runtime
- A priority configuration can be reset via `echo "" > memory.swap.priority`
- Swap on/off events propagate to all cgroups with priority configurations

Example Usage
-------------
# swap device on
$ swapon
NAME      TYPE      SIZE USED PRIO
/dev/sdb  partition 300M  0B   10
/dev/sdc  partition 300M  0B    5

# assign custom priorities in a cgroup
$ echo "1:5,2:10" > memory.swap.priority
$ cat memory.swap.priority
Active
/dev/sdb  unique:1  prio:5
/dev/sdc  unique:2  prio:10

# adding new swap device later
$ swapon /dev/sdd --priority -1
$ cat memory.swap.priority
Active
/dev/sdb  unique:1  prio:5
/dev/sdc  unique:2  prio:10
/dev/sdd  unique:3  prio:-2 

# reset cgroup priority
$ echo "" > memory.swap.priority
$ cat memory.swap.priority
Inactive
/dev/sdb  unique:1  prio:10
/dev/sdc  unique:2  prio:5
/dev/sdd  unique:3  prio:-2

Implementation Notes
====================
The items mentioned below are to be considered during the next patch work.

- Workaround using per swap cpu cluster as before 
- Priority propgation of child cgroup
- And other TODO, XXX
- Refactoring for reviewability and maintainability, comprehensive testing
  and performance evaluation

Future Work
===========
These are items that would benefit from further consideration 
and potential implementation.

- Support for per-process or anything else swap prioritization
- Optional usage limits per swap device (e.g., ratio, max bytes)
- Generalizing the interface beyond cgroups

References
==========
[1] https://lkml.iu.edu/hypermail/linux/kernel/1404.0/02530.html
[2] https://lore.kernel.org/linux-mm/CAMgjq7DGMS5A4t6nOQmwyLy5Px96aoejBkiwFHgy9uMk-F8Y-w@mail.gmail.com
[3] https://lore.kernel.org/lkml/CAF8kJuN-4UE0skVHvjUzpGefavkLULMonjgkXUZSBVJrcGFXCA@mail.gmail.com

All comments and feedback are greatly appreciated.
Patch will follow.

Sincerely,
Youngjun Park

youngjun.park (2):
  mm/swap, memcg: basic structure and logic for per cgroup swap priority
    control
  mm: swap: apply per cgroup swap priority mechansim on swap layer

 include/linux/memcontrol.h |   3 +
 include/linux/swap.h       |  11 ++
 mm/Kconfig                 |   7 +
 mm/memcontrol.c            |  55 ++++++
 mm/swap.h                  |  18 ++
 mm/swap_cgroup_priority.c  | 335 +++++++++++++++++++++++++++++++++++++
 mm/swapfile.c              | 129 ++++++++++----
 7 files changed, 523 insertions(+), 35 deletions(-)
 create mode 100644 mm/swap_cgroup_priority.c

base-commit: 19272b37aa4f83ca52bdf9c16d5d81bdd1354494
-- 
2.34.1



^ permalink raw reply	[flat|nested] 25+ messages in thread

* [RFC PATCH 1/2] mm/swap, memcg: basic structure and logic for per cgroup swap priority control
  2025-06-12 10:37 [RFC PATCH 0/2] mm/swap, memcg: Support per-cgroup swap device prioritization youngjun.park
@ 2025-06-12 10:37 ` youngjun.park
  2025-06-17 12:23   ` Michal Koutný
  2025-06-12 10:37 ` [RFC PATCH 2/2] mm: swap: apply per cgroup swap priority mechansim on swap layer youngjun.park
  2025-06-12 12:24 ` [RFC PATCH 0/2] mm/swap, memcg: Support per-cgroup swap device prioritization Kairui Song
  2 siblings, 1 reply; 25+ messages in thread
From: youngjun.park @ 2025-06-12 10:37 UTC (permalink / raw)
  To: linux-mm
  Cc: akpm, hannes, mhocko, roman.gushchin, shakeel.butt, cgroups,
	linux-kernel, shikemeng, kasong, nphamcs, bhe, baohua, chrisl,
	muchun.song, iamjoonsoo.kim, taejoon.song, gunho.lee,
	youngjun.park

From: "youngjun.park" <youngjun.park@lge.com>

We are working in a constrained environment where devices often
operate under limited resources. To improve overall system responsiveness,
especially under memory pressure, we aim to utilize idle devices as swap
targets over the network.

In this context, we propose a mechanism to control swap priorities on a
per-cgroup basis.
By assigning different swap priorities to each cgroup, we can ensure that
ciritical applications maintain higher responsiveness and stability,
while less important workloads experience deferred swap activity.

The following is detailed explanation of the implementation.

1. Object Description

- swap_cgroup_priority
This object manages an array of swap_cgroup_priority_pnode
that points to swap devices and their associated priorities.

- swap_cgroup_priority_pnode
This object points to a swap device and contains priority information
that can be allocated through an interface.

2. Object Lifecycle

- The swap_cgroup_priority and swap_cgroup_priority_pnode share the same
lifetime.

- Object is dealt with memory.swap.priority interface.
Each swap device is assigned a unique ID at swapon time,
which can be queried via the memory.swap.priority interface.

Example:
cat memory.swap.priority
Inactive
/dev/sdb	unique:1	 prio:10
/dev/sdc	unique:2	 prio:5

- Creation
echo  "unique id of swapdev 1: priority, unique id of swapdev 2: priority ..."
> memory.swap.priority

- Destruction
Reset through the memory.swap.priority interface.
Example: echo "" > memory.swap.priority

And also be destroyed when the mem_cgroup is removed.

3. Priority Mechanism

- Follows the original concept of swap priority.
(This includes automatic binding of swap devices to NUMA nodes.)

- Swap On/Off Propagation
When swapon is executed, the settings are propagated.
Also when swapoff is executed, the settings are removed.

The implementation of swap on/off propagation and the mechanism
for iterating through the configured swap cgroup priorities
are available in the next patch.

Signed-off-by: Youngjun Park <youngjun.park@lge.com>
Suggested-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
---
 include/linux/memcontrol.h |   3 +
 include/linux/swap.h       |   3 +
 mm/Kconfig                 |   7 ++
 mm/memcontrol.c            |  55 ++++++++++
 mm/swap.h                  |  10 ++
 mm/swap_cgroup_priority.c  | 202 +++++++++++++++++++++++++++++++++++++
 mm/swapfile.c              |   6 ++
 7 files changed, 286 insertions(+)
 create mode 100644 mm/swap_cgroup_priority.c

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 87b6688f124a..625e59f9ecd2 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -218,6 +218,9 @@ struct mem_cgroup {
 	bool zswap_writeback;
 #endif
 
+#ifdef CONFIG_SWAP_CGROUP_PRIORITY
+	struct swap_cgroup_priority *swap_priority;
+#endif
 	/* vmpressure notifications */
 	struct vmpressure vmpressure;
 
diff --git a/include/linux/swap.h b/include/linux/swap.h
index bc0e1c275fc0..49b73911c1bd 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -339,6 +339,9 @@ struct swap_info_struct {
 	struct work_struct discard_work; /* discard worker */
 	struct work_struct reclaim_work; /* reclaim worker */
 	struct list_head discard_clusters; /* discard clusters list */
+#ifdef CONFIG_SWAP_CGROUP_PRIORITY
+	int unique_id;
+#endif
 	struct plist_node avail_lists[]; /*
 					   * entries in swap_avail_heads, one
 					   * entry per node.
diff --git a/mm/Kconfig b/mm/Kconfig
index 781be3240e21..ff4b0ef867f4 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -190,6 +190,13 @@ config ZSMALLOC_CHAIN_SIZE
 
 	  For more information, see zsmalloc documentation.
 
+config SWAP_CGROUP_PRIORITY
+	bool "Use swap cgroup priority"
+	default false
+	depends on SWAP && CGROUPS
+	help
+	  This option sets per cgroup swap device priority.
+
 menu "Slab allocator options"
 
 config SLUB
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 902da8a9c643..628ffb048489 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -69,6 +69,7 @@
 #include <net/ip.h>
 #include "slab.h"
 #include "memcontrol-v1.h"
+#include "swap.h"
 
 #include <linux/uaccess.h>
 
@@ -3702,6 +3703,7 @@ static void mem_cgroup_free(struct mem_cgroup *memcg)
 {
 	lru_gen_exit_memcg(memcg);
 	memcg_wb_domain_exit(memcg);
+	delete_swap_cgroup_priority(memcg);
 	__mem_cgroup_free(memcg);
 }
 
@@ -5403,6 +5405,51 @@ static int swap_events_show(struct seq_file *m, void *v)
 	return 0;
 }
 
+#ifdef CONFIG_SWAP_CGROUP_PRIORITY
+static ssize_t swap_cgroup_priority_write(struct kernfs_open_file *of,
+			      char *buf, size_t nbytes, loff_t off)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
+	int ret;
+	int unique[MAX_SWAPFILES] = {0, };
+	int prios[MAX_SWAPFILES] = {0,};
+	int idx = 0;
+	char *token;
+
+	buf = strstrip(buf);
+	if (strlen(buf) == 0) {
+		delete_swap_cgroup_priority(memcg);
+		return nbytes;
+	}
+
+	while ((token = strsep(&buf, ",")) != NULL) {
+		char *token2 = token;
+		char *token3;
+
+		token3 = strsep(&token2, ":");
+		if (!token2 || !token3)
+			return -EINVAL;
+
+		if (kstrtoint(token3, 10, &unique[idx]) ||
+			kstrtoint(token2, 10, &prios[idx]))
+			return -EINVAL;
+
+		idx++;
+	}
+
+	if ((ret = create_swap_cgroup_priority(memcg, unique, prios, idx)))
+		return ret;
+
+	return nbytes;
+}
+
+static int swap_cgroup_priority_show(struct seq_file *m, void *v)
+{
+	show_swap_device_unique_id(m);
+	return 0;
+}
+#endif
+
 static struct cftype swap_files[] = {
 	{
 		.name = "swap.current",
@@ -5435,6 +5482,14 @@ static struct cftype swap_files[] = {
 		.file_offset = offsetof(struct mem_cgroup, swap_events_file),
 		.seq_show = swap_events_show,
 	},
+#ifdef CONFIG_SWAP_CGROUP_PRIORITY
+	{
+		.name = "swap.priority",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.seq_show = swap_cgroup_priority_show,
+		.write = swap_cgroup_priority_write,
+	},
+#endif
 	{ }	/* terminate */
 };
 
diff --git a/mm/swap.h b/mm/swap.h
index 2269eb9df0af..cd2649c632ed 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -106,6 +106,16 @@ static inline int swap_zeromap_batch(swp_entry_t entry, int max_nr,
 		return find_next_bit(sis->zeromap, end, start) - start;
 }
 
+#ifdef CONFIG_SWAP_CGROUP_PRIORITY
+int create_swap_cgroup_priority(struct mem_cgroup *memcg,
+				int unique[], int prio[], int nr);
+void delete_swap_cgroup_priority(struct mem_cgroup *memcg);
+void show_swap_device_unique_id(struct seq_file *m);
+#else
+static inline void delete_swap_cgroup_priority(struct mem_cgroup *memcg) {}
+static inline void get_swap_unique_id(struct swap_info_struct *si) {}
+#endif
+
 #else /* CONFIG_SWAP */
 struct swap_iocb;
 static inline void swap_read_folio(struct folio *folio, struct swap_iocb **plug)
diff --git a/mm/swap_cgroup_priority.c b/mm/swap_cgroup_priority.c
new file mode 100644
index 000000000000..b3e20b676680
--- /dev/null
+++ b/mm/swap_cgroup_priority.c
@@ -0,0 +1,202 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+
+/* per mem_cgroup */
+struct swap_cgroup_priority {
+	struct list_head link;
+	/* XXX: to flatten memory is hard. variable array is our enemy */
+	struct swap_cgroup_priority_pnode *pnode[MAX_SWAPFILES];
+	struct plist_head plist[];
+};
+
+/* per mem_cgroup & per swap device node */
+struct swap_cgroup_priority_pnode {
+	struct swap_info_struct *swap;
+	int prio;
+	struct plist_node avail_lists[];
+};
+
+/* per swap device unique id counter */
+static atomic_t swap_unique_id_counter;
+
+/* active swap_cgroup_priority list */
+static LIST_HEAD(swap_cgroup_priority_list);
+
+/* XXX: Not want memcontrol to know swap_cgroup_priority internal. */
+void show_swap_device_unique_id(struct seq_file *m)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_seq(m);
+
+	spin_lock(&swap_lock);
+	/* XXX: what is beautiful visibility? */
+	seq_printf(m, "%s\n", memcg->swap_priority ? "Active" : "Inactive");
+	for (int i = 0; i < nr_swapfiles; i++) {
+		struct swap_info_struct *si = swap_info[i];
+
+		if (!(si->flags & SWP_USED))
+			continue;
+
+		seq_file_path(m, si->swap_file, "\t\n\\");
+		seq_printf(m,  "\tunique:%d\t", si->unique_id);
+
+		if (!memcg->swap_priority) {
+			seq_printf(m, " prio:%d\n", si->prio);
+			continue;
+		}
+
+		seq_printf(m,  "prio:%d\n",
+			memcg->swap_priority->pnode[i]->prio);
+	}
+	spin_unlock(&swap_lock);
+}
+
+static void get_swap_unique_id(struct swap_info_struct *si)
+{
+	si->unique_id = atomic_add_return(1, &swap_unique_id_counter);
+}
+
+int create_swap_cgroup_priority(struct mem_cgroup *memcg,
+		int unique[], int prio[], int nr)
+{
+	bool b_found = false;
+	struct swap_cgroup_priority *swap_priority, *old_swap_priority = NULL;
+	int nid;
+
+	/* Fast check */
+	if (nr != nr_swapfiles)
+		return -EINVAL;
+
+	/*
+	* XXX: always make newly object and exchange it.
+	* possible to give object reusability if it is simple and better.
+	*/
+	swap_priority = kvmalloc(struct_size(swap_priority, plist, nr_node_ids),
+			GFP_KERNEL);
+
+	if (!swap_priority)
+		return -ENOMEM;
+
+	/* XXX: use pre allocate. think swapon time allocate is better? */
+	for (int i = 0; i < MAX_SWAPFILES; i++) {
+		swap_priority->pnode[i] =
+			kvmalloc(struct_size(swap_priority->pnode[0],
+				avail_lists, nr_node_ids),
+				GFP_KERNEL);
+
+		if (!swap_priority->pnode[i]) {
+			for (int j = 0; j < i; j++)
+				kvfree(swap_priority->pnode[i]);
+
+			kvfree(swap_priority);
+			return -ENOMEM;
+		}
+	}
+
+	INIT_LIST_HEAD(&swap_priority->link);
+	for_each_node(nid)
+		plist_head_init(&swap_priority->plist[nid]);
+
+	spin_lock(&swap_lock);
+	spin_lock(&swap_avail_lock);
+
+	/* swap on/off under us. */
+	if (nr != nr_swapfiles)
+		goto error;
+
+	/* TODO: naive search. make it fast.*/
+	for (int i = 0; i < nr; i++) {
+		b_found = false;
+		for (int j = 0; j < nr_swapfiles; j++) {
+			struct swap_info_struct *si = swap_info[j];
+			struct swap_cgroup_priority_pnode *pnode
+					= swap_priority->pnode[j];
+
+			if (si->unique_id != unique[i])
+				continue;
+
+			/* swap off under us */
+			if (!(si->flags & SWP_USED))
+				goto error;
+
+			int k;
+			for_each_node(k) {
+				if (prio[i] >= 0) {
+					pnode->prio = prio[i];
+					plist_node_init(&pnode->avail_lists[k],
+						-pnode->prio);
+				} else {
+					pnode->prio = si->prio;
+					if (swap_node(si) == k)
+						plist_node_init(
+							&pnode->avail_lists[k],
+							1);
+					else
+						plist_node_init(
+							&pnode->avail_lists[k],
+							-pnode->prio);
+				}
+
+				plist_add(&pnode->avail_lists[k],
+					&swap_priority->plist[k]);
+			}
+
+			pnode->swap = si;
+			b_found = true;
+			break;
+		}
+
+		/* cannot find unique id pair */
+		if (!b_found)
+			goto error;
+	}
+
+	if (memcg->swap_priority) {
+		old_swap_priority = memcg->swap_priority;
+		list_del(&old_swap_priority->link);
+	}
+
+	list_add(&swap_priority->link, &swap_cgroup_priority_list);
+
+	memcg->swap_priority = swap_priority;
+	spin_unlock(&swap_avail_lock);
+	spin_unlock(&swap_lock);
+
+	if (old_swap_priority) {
+		for (int i = 0; i < MAX_SWAPFILES; i++)
+			kvfree(old_swap_priority->pnode[i]);
+		kvfree(old_swap_priority);
+	}
+
+	return 0;
+
+error:
+	spin_unlock(&swap_avail_lock);
+	spin_unlock(&swap_lock);
+
+	for (int i = 0; i < MAX_SWAPFILES; i++)
+		kvfree(swap_priority->pnode[i]);
+	kvfree(swap_priority);
+
+	return -EINVAL;
+}
+
+void delete_swap_cgroup_priority(struct mem_cgroup *memcg)
+{
+	struct swap_cgroup_priority *swap_priority;
+
+	spin_lock(&swap_avail_lock);
+	swap_priority = memcg->swap_priority;
+	if (!swap_priority) {
+		spin_unlock(&swap_avail_lock);
+		return;
+	}
+	memcg->swap_priority = NULL;
+	list_del(&swap_priority->link);
+	spin_unlock(&swap_avail_lock);
+
+	/* wait show_swap_device_unique_id */
+	synchronize_rcu();
+
+	for (int i = 0; i < MAX_SWAPFILES; i++)
+		kvfree(swap_priority->pnode[i]);
+	kvfree(swap_priority);
+}
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 68ce283e84be..f8e48dd2381e 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -126,6 +126,10 @@ static DEFINE_PER_CPU(struct percpu_swap_cluster, percpu_swap_cluster) = {
 	.offset = { SWAP_ENTRY_INVALID },
 	.lock = INIT_LOCAL_LOCK(),
 };
+/* TODO: better choice? */
+#ifdef CONFIG_SWAP_CGROUP_PRIORITY
+#include "swap_cgroup_priority.c"
+#endif
 
 static struct swap_info_struct *swap_type_to_swap_info(int type)
 {
@@ -3462,6 +3466,8 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
 		goto free_swap_zswap;
 	}
 
+	get_swap_unique_id(si);
+
 	mutex_lock(&swapon_mutex);
 	prio = -1;
 	if (swap_flags & SWAP_FLAG_PREFER)
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [RFC PATCH 2/2] mm: swap: apply per cgroup swap priority mechansim on swap layer
  2025-06-12 10:37 [RFC PATCH 0/2] mm/swap, memcg: Support per-cgroup swap device prioritization youngjun.park
  2025-06-12 10:37 ` [RFC PATCH 1/2] mm/swap, memcg: basic structure and logic for per cgroup swap priority control youngjun.park
@ 2025-06-12 10:37 ` youngjun.park
  2025-06-12 11:14   ` Kairui Song
  2025-06-12 12:24 ` [RFC PATCH 0/2] mm/swap, memcg: Support per-cgroup swap device prioritization Kairui Song
  2 siblings, 1 reply; 25+ messages in thread
From: youngjun.park @ 2025-06-12 10:37 UTC (permalink / raw)
  To: linux-mm
  Cc: akpm, hannes, mhocko, roman.gushchin, shakeel.butt, cgroups,
	linux-kernel, shikemeng, kasong, nphamcs, bhe, baohua, chrisl,
	muchun.song, iamjoonsoo.kim, taejoon.song, gunho.lee,
	youngjun.park

From: "youngjun.park" <youngjun.park@lge.com>

This patch implements swap device selection and swap on/off propagation
when a cgroup-specific swap priority is set.

There is one workaround to this implementation as follows.
Current per-cpu swap cluster enforces swap device selection based solely
on CPU locality, overriding the swap cgroup's configured priorities.
Therefore, when a swap cgroup priority is assigned, we fall back to
using per-CPU clusters per swap device, similar to the previous behavior.

A proper fix for this workaround will be evaluated in the next patch.

Signed-off-by: Youngjun park <youngjun.park@lge.com>
---
 include/linux/swap.h      |   8 +++
 mm/swap.h                 |   8 +++
 mm/swap_cgroup_priority.c | 133 ++++++++++++++++++++++++++++++++++++++
 mm/swapfile.c             | 125 ++++++++++++++++++++++++-----------
 4 files changed, 238 insertions(+), 36 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 49b73911c1bd..d158b0d5c997 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -283,6 +283,13 @@ enum swap_cluster_flags {
 #define SWAP_NR_ORDERS		1
 #endif
 
+#ifdef CONFIG_SWAP_CGROUP_PRIORITY
+struct percpu_cluster {
+	local_lock_t lock; /* Protect the percpu_cluster above */
+	unsigned int next[SWAP_NR_ORDERS]; /* Likely next allocation offset */
+};
+#endif
+
 /*
  * We keep using same cluster for rotational device so IO will be sequential.
  * The purpose is to optimize SWAP throughput on these device.
@@ -341,6 +348,7 @@ struct swap_info_struct {
 	struct list_head discard_clusters; /* discard clusters list */
 #ifdef CONFIG_SWAP_CGROUP_PRIORITY
 	int unique_id;
+	struct percpu_cluster __percpu *percpu_cluster; /* per cpu's swap location */
 #endif
 	struct plist_node avail_lists[]; /*
 					   * entries in swap_avail_heads, one
diff --git a/mm/swap.h b/mm/swap.h
index cd2649c632ed..cb6d653fe3f1 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -113,7 +113,15 @@ void delete_swap_cgroup_priority(struct mem_cgroup *memcg);
 void show_swap_device_unique_id(struct seq_file *m);
 #else
 static inline void delete_swap_cgroup_priority(struct mem_cgroup *memcg) {}
+static inline void activate_swap_cgroup_priority_pnode(struct swap_info_struct *swp, bool swapon) {}
+static inline void deactivate_swap_cgroup_priority_pnode(struct swap_info_struct *swp, bool swapoff){}
 static inline void get_swap_unique_id(struct swap_info_struct *si) {}
+static inline bool swap_alloc_cgroup_priority(struct mem_cgroup *memcg,
+				swp_entry_t *entry, int order)
+{
+	return false;
+}
+
 #endif
 
 #else /* CONFIG_SWAP */
diff --git a/mm/swap_cgroup_priority.c b/mm/swap_cgroup_priority.c
index b3e20b676680..bb18cb251f60 100644
--- a/mm/swap_cgroup_priority.c
+++ b/mm/swap_cgroup_priority.c
@@ -54,6 +54,132 @@ static void get_swap_unique_id(struct swap_info_struct *si)
 	si->unique_id = atomic_add_return(1, &swap_unique_id_counter);
 }
 
+static bool swap_alloc_cgroup_priority(struct mem_cgroup *memcg,
+				swp_entry_t *entry, int order)
+{
+	struct swap_cgroup_priority *swap_priority;
+	struct swap_cgroup_priority_pnode *pnode, *next;
+	unsigned long offset;
+	int node;
+
+	if (!memcg)
+		return false;
+
+	spin_lock(&swap_avail_lock);
+priority_check:
+	swap_priority = memcg->swap_priority;
+	if (!swap_priority) {
+		spin_unlock(&swap_avail_lock);
+		return false;
+	}
+
+	node = numa_node_id();
+start_over:
+	plist_for_each_entry_safe(pnode, next, &swap_priority->plist[node],
+					avail_lists[node]) {
+		struct swap_info_struct *si = pnode->swap;
+		plist_requeue(&pnode->avail_lists[node],
+			&swap_priority->plist[node]);
+		spin_unlock(&swap_avail_lock);
+
+		if (get_swap_device_info(si)) {
+			offset = cluster_alloc_swap_entry(si,
+					order, SWAP_HAS_CACHE, true);
+			put_swap_device(si);
+			if (offset) {
+				*entry = swp_entry(si->type, offset);
+				return true;
+			}
+			if (order)
+				return false;
+		}
+
+		spin_lock(&swap_avail_lock);
+
+		/* swap_priority is remove or changed under us. */
+		if (swap_priority != memcg->swap_priority)
+			goto priority_check;
+
+		if (plist_node_empty(&next->avail_lists[node]))
+			goto start_over;
+	}
+	spin_unlock(&swap_avail_lock);
+
+	return false;
+}
+
+/* add_to_avail_list (swapon / swapusage > 0) */
+static void activate_swap_cgroup_priority_pnode(struct swap_info_struct *swp,
+			bool swapon)
+{
+	struct swap_cgroup_priority *swap_priority;
+	int i;
+
+	list_for_each_entry(swap_priority, &swap_cgroup_priority_list, link) {
+		struct swap_cgroup_priority_pnode *pnode
+			= swap_priority->pnode[swp->type];
+
+		if (swapon) {
+			pnode->swap = swp;
+			pnode->prio = swp->prio;
+		}
+
+		/* NUMA priority handling */
+		for_each_node(i) {
+			if (swapon) {
+				if (swap_node(swp) == i) {
+					plist_node_init(
+						&pnode->avail_lists[i],
+						1);
+				} else {
+					plist_node_init(
+						&pnode->avail_lists[i],
+						-pnode->prio);
+				}
+			}
+
+			plist_add(&pnode->avail_lists[i],
+				&swap_priority->plist[i]);
+		}
+	}
+}
+
+/* del_from_avail_list (swapoff / swap usage <= 0) */
+static void deactivate_swap_cgroup_priority_pnode(struct swap_info_struct *swp,
+		bool swapoff)
+{
+	struct swap_cgroup_priority *swap_priority;
+	int nid, i;
+
+	list_for_each_entry(swap_priority, &swap_cgroup_priority_list, link) {
+		struct swap_cgroup_priority_pnode *pnode;
+
+		if (swapoff && swp->prio < 0) {
+			/*
+			* NUMA priority handling
+			* mimic swapoff prio adjustment without plist
+			*/
+			for (int i = 0; i < MAX_SWAPFILES; i++) {
+				pnode = swap_priority->pnode[i];
+				if (pnode->prio > swp->prio ||
+					pnode->swap == swp)
+					continue;
+
+				pnode->prio++;
+				for_each_node(nid) {
+					if (pnode->avail_lists[nid].prio != 1)
+						pnode->avail_lists[nid].prio--;
+				}
+			}
+		}
+
+		pnode = swap_priority->pnode[swp->type];
+		for_each_node(i)
+			plist_del(&pnode->avail_lists[i],
+				&swap_priority->plist[i]);
+	}
+}
+
 int create_swap_cgroup_priority(struct mem_cgroup *memcg,
 		int unique[], int prio[], int nr)
 {
@@ -183,6 +309,12 @@ void delete_swap_cgroup_priority(struct mem_cgroup *memcg)
 {
 	struct swap_cgroup_priority *swap_priority;
 
+	/*
+	* XXX: Possible RCU wait? No. Cannot protect priority list addition.
+	* swap_avail_lock gives protection.
+	* Think about other object protection mechanism
+	* might be solve it and better. (e.g object reference)
+	*/
 	spin_lock(&swap_avail_lock);
 	swap_priority = memcg->swap_priority;
 	if (!swap_priority) {
@@ -198,5 +330,6 @@ void delete_swap_cgroup_priority(struct mem_cgroup *memcg)
 
 	for (int i = 0; i < MAX_SWAPFILES; i++)
 		kvfree(swap_priority->pnode[i]);
+
 	kvfree(swap_priority);
 }
diff --git a/mm/swapfile.c b/mm/swapfile.c
index f8e48dd2381e..28afe4ec0504 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -126,8 +126,12 @@ static DEFINE_PER_CPU(struct percpu_swap_cluster, percpu_swap_cluster) = {
 	.offset = { SWAP_ENTRY_INVALID },
 	.lock = INIT_LOCAL_LOCK(),
 };
-/* TODO: better choice? */
+/* TODO: better arrangement */
 #ifdef CONFIG_SWAP_CGROUP_PRIORITY
+static bool get_swap_device_info(struct swap_info_struct *si);
+static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int order,
+					      unsigned char usage, bool is_cgroup_priority);
+static int swap_node(struct swap_info_struct *si);
 #include "swap_cgroup_priority.c"
 #endif
 
@@ -776,7 +780,8 @@ static unsigned int alloc_swap_scan_cluster(struct swap_info_struct *si,
 					    struct swap_cluster_info *ci,
 					    unsigned long offset,
 					    unsigned int order,
-					    unsigned char usage)
+					    unsigned char usage,
+					    bool is_cgroup_priority)
 {
 	unsigned int next = SWAP_ENTRY_INVALID, found = SWAP_ENTRY_INVALID;
 	unsigned long start = ALIGN_DOWN(offset, SWAPFILE_CLUSTER);
@@ -820,12 +825,19 @@ static unsigned int alloc_swap_scan_cluster(struct swap_info_struct *si,
 out:
 	relocate_cluster(si, ci);
 	unlock_cluster(ci);
+
 	if (si->flags & SWP_SOLIDSTATE) {
-		this_cpu_write(percpu_swap_cluster.offset[order], next);
-		this_cpu_write(percpu_swap_cluster.si[order], si);
-	} else {
+		if (!is_cgroup_priority) {
+			this_cpu_write(percpu_swap_cluster.offset[order], next);
+			this_cpu_write(percpu_swap_cluster.si[order], si);
+		} else {
+#ifdef CONFIG_SWAP_CGROUP_PRIORITY
+			__this_cpu_write(si->percpu_cluster->next[order], next);
+#endif
+		}
+	} else
 		si->global_cluster->next[order] = next;
-	}
+
 	return found;
 }
 
@@ -883,7 +895,7 @@ static void swap_reclaim_work(struct work_struct *work)
  * cluster for current CPU too.
  */
 static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int order,
-					      unsigned char usage)
+					      unsigned char usage, bool is_cgroup_priority)
 {
 	struct swap_cluster_info *ci;
 	unsigned int offset = SWAP_ENTRY_INVALID, found = SWAP_ENTRY_INVALID;
@@ -895,32 +907,38 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
 	if (order && !(si->flags & SWP_BLKDEV))
 		return 0;
 
-	if (!(si->flags & SWP_SOLIDSTATE)) {
+	if (si->flags & SWP_SOLIDSTATE) {
+#ifdef CONFIG_SWAP_CGROUP_PRIORITY
+                local_lock(&si->percpu_cluster->lock);
+                offset = __this_cpu_read(si->percpu_cluster->next[order]);
+#endif
+	} else {
 		/* Serialize HDD SWAP allocation for each device. */
 		spin_lock(&si->global_cluster_lock);
 		offset = si->global_cluster->next[order];
-		if (offset == SWAP_ENTRY_INVALID)
-			goto new_cluster;
+	}
 
-		ci = lock_cluster(si, offset);
-		/* Cluster could have been used by another order */
-		if (cluster_is_usable(ci, order)) {
-			if (cluster_is_empty(ci))
-				offset = cluster_offset(si, ci);
-			found = alloc_swap_scan_cluster(si, ci, offset,
-							order, usage);
-		} else {
-			unlock_cluster(ci);
-		}
-		if (found)
-			goto done;
+	if (offset == SWAP_ENTRY_INVALID)
+		goto new_cluster;
+
+	ci = lock_cluster(si, offset);
+	/* Cluster could have been used by another order */
+	if (cluster_is_usable(ci, order)) {
+		if (cluster_is_empty(ci))
+			offset = cluster_offset(si, ci);
+		found = alloc_swap_scan_cluster(si, ci, offset,
+						order, usage, is_cgroup_priority);
+	} else {
+		unlock_cluster(ci);
 	}
+	if (found)
+		goto done;
 
 new_cluster:
 	ci = isolate_lock_cluster(si, &si->free_clusters);
 	if (ci) {
 		found = alloc_swap_scan_cluster(si, ci, cluster_offset(si, ci),
-						order, usage);
+						order, usage, is_cgroup_priority);
 		if (found)
 			goto done;
 	}
@@ -934,7 +952,7 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
 
 		while ((ci = isolate_lock_cluster(si, &si->nonfull_clusters[order]))) {
 			found = alloc_swap_scan_cluster(si, ci, cluster_offset(si, ci),
-							order, usage);
+							order, usage, is_cgroup_priority);
 			if (found)
 				goto done;
 			/* Clusters failed to allocate are moved to frag_clusters */
@@ -952,7 +970,7 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
 			 * reclaimable (eg. lazy-freed swap cache) slots.
 			 */
 			found = alloc_swap_scan_cluster(si, ci, cluster_offset(si, ci),
-							order, usage);
+							order, usage, is_cgroup_priority);
 			if (found)
 				goto done;
 			frags++;
@@ -979,21 +997,27 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
 		while ((ci = isolate_lock_cluster(si, &si->frag_clusters[o]))) {
 			atomic_long_dec(&si->frag_cluster_nr[o]);
 			found = alloc_swap_scan_cluster(si, ci, cluster_offset(si, ci),
-							0, usage);
+							0, usage, is_cgroup_priority);
 			if (found)
 				goto done;
 		}
 
 		while ((ci = isolate_lock_cluster(si, &si->nonfull_clusters[o]))) {
 			found = alloc_swap_scan_cluster(si, ci, cluster_offset(si, ci),
-							0, usage);
+							0, usage, is_cgroup_priority);
 			if (found)
 				goto done;
 		}
 	}
 done:
-	if (!(si->flags & SWP_SOLIDSTATE))
+	if (si->flags & SWP_SOLIDSTATE) {
+#ifdef CONFIG_SWAP_CGROUP_PRIORITY
+		local_unlock(&si->percpu_cluster->lock);
+#endif
+	} else {
 		spin_unlock(&si->global_cluster_lock);
+	}
+
 	return found;
 }
 
@@ -1032,6 +1056,7 @@ static void del_from_avail_list(struct swap_info_struct *si, bool swapoff)
 	for_each_node(nid)
 		plist_del(&si->avail_lists[nid], &swap_avail_heads[nid]);
 
+	deactivate_swap_cgroup_priority_pnode(si, swapoff);
 skip:
 	spin_unlock(&swap_avail_lock);
 }
@@ -1075,6 +1100,7 @@ static void add_to_avail_list(struct swap_info_struct *si, bool swapon)
 	for_each_node(nid)
 		plist_add(&si->avail_lists[nid], &swap_avail_heads[nid]);
 
+	activate_swap_cgroup_priority_pnode(si, swapon);
 skip:
 	spin_unlock(&swap_avail_lock);
 }
@@ -1200,7 +1226,8 @@ static bool swap_alloc_fast(swp_entry_t *entry,
 	if (cluster_is_usable(ci, order)) {
 		if (cluster_is_empty(ci))
 			offset = cluster_offset(si, ci);
-		found = alloc_swap_scan_cluster(si, ci, offset, order, SWAP_HAS_CACHE);
+		found = alloc_swap_scan_cluster(si, ci, offset, order,
+				SWAP_HAS_CACHE, false);
 		if (found)
 			*entry = swp_entry(si->type, found);
 	} else {
@@ -1227,7 +1254,7 @@ static bool swap_alloc_slow(swp_entry_t *entry,
 		plist_requeue(&si->avail_lists[node], &swap_avail_heads[node]);
 		spin_unlock(&swap_avail_lock);
 		if (get_swap_device_info(si)) {
-			offset = cluster_alloc_swap_entry(si, order, SWAP_HAS_CACHE);
+			offset = cluster_alloc_swap_entry(si, order, SWAP_HAS_CACHE, false);
 			put_swap_device(si);
 			if (offset) {
 				*entry = swp_entry(si->type, offset);
@@ -1294,10 +1321,12 @@ int folio_alloc_swap(struct folio *folio, gfp_t gfp)
 		}
 	}
 
-	local_lock(&percpu_swap_cluster.lock);
-	if (!swap_alloc_fast(&entry, order))
-		swap_alloc_slow(&entry, order);
-	local_unlock(&percpu_swap_cluster.lock);
+	if (!swap_alloc_cgroup_priority(folio_memcg(folio), &entry, order)) {
+		local_lock(&percpu_swap_cluster.lock);
+		if (!swap_alloc_fast(&entry, order))
+			swap_alloc_slow(&entry, order);
+		local_unlock(&percpu_swap_cluster.lock);
+	}
 
 	/* Need to call this even if allocation failed, for MEMCG_SWAP_FAIL. */
 	if (mem_cgroup_try_charge_swap(folio, entry))
@@ -1870,7 +1899,7 @@ swp_entry_t get_swap_page_of_type(int type)
 	/* This is called for allocating swap entry, not cache */
 	if (get_swap_device_info(si)) {
 		if (si->flags & SWP_WRITEOK) {
-			offset = cluster_alloc_swap_entry(si, 0, 1);
+			offset = cluster_alloc_swap_entry(si, 0, 1, false);
 			if (offset) {
 				entry = swp_entry(si->type, offset);
 				atomic_long_dec(&nr_swap_pages);
@@ -2800,6 +2829,10 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
 	arch_swap_invalidate_area(p->type);
 	zswap_swapoff(p->type);
 	mutex_unlock(&swapon_mutex);
+#ifdef CONFIG_SWAP_CGROUP_PRIORITY
+	free_percpu(p->percpu_cluster);
+	p->percpu_cluster = NULL;
+#endif
 	kfree(p->global_cluster);
 	p->global_cluster = NULL;
 	vfree(swap_map);
@@ -3207,7 +3240,23 @@ static struct swap_cluster_info *setup_clusters(struct swap_info_struct *si,
 	for (i = 0; i < nr_clusters; i++)
 		spin_lock_init(&cluster_info[i].lock);
 
-	if (!(si->flags & SWP_SOLIDSTATE)) {
+	if (si->flags & SWP_SOLIDSTATE) {
+#ifdef CONFIG_SWAP_CGROUP_PRIORITY
+		si->percpu_cluster = alloc_percpu(struct percpu_cluster);
+		if (!si->percpu_cluster)
+			goto err_free;
+
+		int cpu;
+		for_each_possible_cpu(cpu) {
+			struct percpu_cluster *cluster;
+
+			cluster = per_cpu_ptr(si->percpu_cluster, cpu);
+			for (i = 0; i < SWAP_NR_ORDERS; i++)
+				cluster->next[i] = SWAP_ENTRY_INVALID;
+			local_lock_init(&cluster->lock);
+		}
+#endif
+	} else {
 		si->global_cluster = kmalloc(sizeof(*si->global_cluster),
 				     GFP_KERNEL);
 		if (!si->global_cluster)
@@ -3495,6 +3544,10 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
 bad_swap_unlock_inode:
 	inode_unlock(inode);
 bad_swap:
+#ifdef CONFIG_SWAP_CGROUP_PRIORITY
+	free_percpu(si->percpu_cluster);
+	si->percpu_cluster = NULL;
+#endif
 	kfree(si->global_cluster);
 	si->global_cluster = NULL;
 	inode = NULL;
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH 2/2] mm: swap: apply per cgroup swap priority mechansim on swap layer
  2025-06-12 10:37 ` [RFC PATCH 2/2] mm: swap: apply per cgroup swap priority mechansim on swap layer youngjun.park
@ 2025-06-12 11:14   ` Kairui Song
  2025-06-12 11:16     ` Kairui Song
                       ` (2 more replies)
  0 siblings, 3 replies; 25+ messages in thread
From: Kairui Song @ 2025-06-12 11:14 UTC (permalink / raw)
  To: youngjun.park
  Cc: linux-mm, akpm, hannes, mhocko, roman.gushchin, shakeel.butt,
	cgroups, linux-kernel, shikemeng, nphamcs, bhe, baohua, chrisl,
	muchun.song, iamjoonsoo.kim, taejoon.song, gunho.lee

On Thu, Jun 12, 2025 at 6:43 PM <youngjun.park@lge.com> wrote:
>
> From: "youngjun.park" <youngjun.park@lge.com>
>

Hi, Youngjun,

Thanks for sharing this series.

> This patch implements swap device selection and swap on/off propagation
> when a cgroup-specific swap priority is set.
>
> There is one workaround to this implementation as follows.
> Current per-cpu swap cluster enforces swap device selection based solely
> on CPU locality, overriding the swap cgroup's configured priorities.

I've been thinking about this, we can switch to a per-cgroup-per-cpu
next cluster selector, the problem with current code is that swap
allocator is not designed with folio / cgroup in mind at all, so it's
really ugly to implement, which is why I have following two patches in
the swap table series:

https://lore.kernel.org/linux-mm/20250514201729.48420-18-ryncsn@gmail.com/
https://lore.kernel.org/linux-mm/20250514201729.48420-22-ryncsn@gmail.com/

The first one makes all swap allocation starts with a folio, the
second one makes the allocator always folio aware. So you can know
which cgroup is doing the allocation at anytime inside the allocator
(and it reduced the number of argument, also improving performance :)
)

So the allocator can just use cgroup's swap info if available, plist,
percpu cluster, and fallback to global locality in a very natural way.

> Therefore, when a swap cgroup priority is assigned, we fall back to
> using per-CPU clusters per swap device, similar to the previous behavior.
>
> A proper fix for this workaround will be evaluated in the next patch.

Hmm, but this is already the last patch in the series?

>
> Signed-off-by: Youngjun park <youngjun.park@lge.com>
> ---
>  include/linux/swap.h      |   8 +++
>  mm/swap.h                 |   8 +++
>  mm/swap_cgroup_priority.c | 133 ++++++++++++++++++++++++++++++++++++++
>  mm/swapfile.c             | 125 ++++++++++++++++++++++++-----------
>  4 files changed, 238 insertions(+), 36 deletions(-)
>
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 49b73911c1bd..d158b0d5c997 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -283,6 +283,13 @@ enum swap_cluster_flags {
>  #define SWAP_NR_ORDERS         1
>  #endif
>
> +#ifdef CONFIG_SWAP_CGROUP_PRIORITY
> +struct percpu_cluster {
> +       local_lock_t lock; /* Protect the percpu_cluster above */
> +       unsigned int next[SWAP_NR_ORDERS]; /* Likely next allocation offset */
> +};
> +#endif
> +
>  /*
>   * We keep using same cluster for rotational device so IO will be sequential.
>   * The purpose is to optimize SWAP throughput on these device.
> @@ -341,6 +348,7 @@ struct swap_info_struct {
>         struct list_head discard_clusters; /* discard clusters list */
>  #ifdef CONFIG_SWAP_CGROUP_PRIORITY
>         int unique_id;
> +       struct percpu_cluster __percpu *percpu_cluster; /* per cpu's swap location */
>  #endif
>         struct plist_node avail_lists[]; /*
>                                            * entries in swap_avail_heads, one
> diff --git a/mm/swap.h b/mm/swap.h
> index cd2649c632ed..cb6d653fe3f1 100644
> --- a/mm/swap.h
> +++ b/mm/swap.h
> @@ -113,7 +113,15 @@ void delete_swap_cgroup_priority(struct mem_cgroup *memcg);
>  void show_swap_device_unique_id(struct seq_file *m);
>  #else
>  static inline void delete_swap_cgroup_priority(struct mem_cgroup *memcg) {}
> +static inline void activate_swap_cgroup_priority_pnode(struct swap_info_struct *swp, bool swapon) {}
> +static inline void deactivate_swap_cgroup_priority_pnode(struct swap_info_struct *swp, bool swapoff){}
>  static inline void get_swap_unique_id(struct swap_info_struct *si) {}
> +static inline bool swap_alloc_cgroup_priority(struct mem_cgroup *memcg,
> +                               swp_entry_t *entry, int order)
> +{
> +       return false;
> +}
> +
>  #endif
>
>  #else /* CONFIG_SWAP */
> diff --git a/mm/swap_cgroup_priority.c b/mm/swap_cgroup_priority.c
> index b3e20b676680..bb18cb251f60 100644
> --- a/mm/swap_cgroup_priority.c
> +++ b/mm/swap_cgroup_priority.c
> @@ -54,6 +54,132 @@ static void get_swap_unique_id(struct swap_info_struct *si)
>         si->unique_id = atomic_add_return(1, &swap_unique_id_counter);
>  }
>
> +static bool swap_alloc_cgroup_priority(struct mem_cgroup *memcg,
> +                               swp_entry_t *entry, int order)
> +{
> +       struct swap_cgroup_priority *swap_priority;
> +       struct swap_cgroup_priority_pnode *pnode, *next;
> +       unsigned long offset;
> +       int node;
> +
> +       if (!memcg)
> +               return false;
> +
> +       spin_lock(&swap_avail_lock);
> +priority_check:
> +       swap_priority = memcg->swap_priority;
> +       if (!swap_priority) {
> +               spin_unlock(&swap_avail_lock);
> +               return false;
> +       }
> +
> +       node = numa_node_id();
> +start_over:
> +       plist_for_each_entry_safe(pnode, next, &swap_priority->plist[node],
> +                                       avail_lists[node]) {
> +               struct swap_info_struct *si = pnode->swap;
> +               plist_requeue(&pnode->avail_lists[node],
> +                       &swap_priority->plist[node]);
> +               spin_unlock(&swap_avail_lock);
> +
> +               if (get_swap_device_info(si)) {
> +                       offset = cluster_alloc_swap_entry(si,
> +                                       order, SWAP_HAS_CACHE, true);
> +                       put_swap_device(si);
> +                       if (offset) {
> +                               *entry = swp_entry(si->type, offset);
> +                               return true;
> +                       }
> +                       if (order)
> +                               return false;
> +               }
> +
> +               spin_lock(&swap_avail_lock);
> +
> +               /* swap_priority is remove or changed under us. */
> +               if (swap_priority != memcg->swap_priority)
> +                       goto priority_check;
> +
> +               if (plist_node_empty(&next->avail_lists[node]))
> +                       goto start_over;
> +       }
> +       spin_unlock(&swap_avail_lock);
> +
> +       return false;
> +}
> +
> +/* add_to_avail_list (swapon / swapusage > 0) */
> +static void activate_swap_cgroup_priority_pnode(struct swap_info_struct *swp,
> +                       bool swapon)
> +{
> +       struct swap_cgroup_priority *swap_priority;
> +       int i;
> +
> +       list_for_each_entry(swap_priority, &swap_cgroup_priority_list, link) {
> +               struct swap_cgroup_priority_pnode *pnode
> +                       = swap_priority->pnode[swp->type];
> +
> +               if (swapon) {
> +                       pnode->swap = swp;
> +                       pnode->prio = swp->prio;
> +               }
> +
> +               /* NUMA priority handling */
> +               for_each_node(i) {
> +                       if (swapon) {
> +                               if (swap_node(swp) == i) {
> +                                       plist_node_init(
> +                                               &pnode->avail_lists[i],
> +                                               1);
> +                               } else {
> +                                       plist_node_init(
> +                                               &pnode->avail_lists[i],
> +                                               -pnode->prio);
> +                               }
> +                       }
> +
> +                       plist_add(&pnode->avail_lists[i],
> +                               &swap_priority->plist[i]);
> +               }
> +       }
> +}
> +
> +/* del_from_avail_list (swapoff / swap usage <= 0) */
> +static void deactivate_swap_cgroup_priority_pnode(struct swap_info_struct *swp,
> +               bool swapoff)
> +{
> +       struct swap_cgroup_priority *swap_priority;
> +       int nid, i;
> +
> +       list_for_each_entry(swap_priority, &swap_cgroup_priority_list, link) {
> +               struct swap_cgroup_priority_pnode *pnode;
> +
> +               if (swapoff && swp->prio < 0) {
> +                       /*
> +                       * NUMA priority handling
> +                       * mimic swapoff prio adjustment without plist
> +                       */
> +                       for (int i = 0; i < MAX_SWAPFILES; i++) {
> +                               pnode = swap_priority->pnode[i];
> +                               if (pnode->prio > swp->prio ||
> +                                       pnode->swap == swp)
> +                                       continue;
> +
> +                               pnode->prio++;
> +                               for_each_node(nid) {
> +                                       if (pnode->avail_lists[nid].prio != 1)
> +                                               pnode->avail_lists[nid].prio--;
> +                               }
> +                       }
> +               }
> +
> +               pnode = swap_priority->pnode[swp->type];
> +               for_each_node(i)
> +                       plist_del(&pnode->avail_lists[i],
> +                               &swap_priority->plist[i]);
> +       }
> +}
> +
>  int create_swap_cgroup_priority(struct mem_cgroup *memcg,
>                 int unique[], int prio[], int nr)
>  {
> @@ -183,6 +309,12 @@ void delete_swap_cgroup_priority(struct mem_cgroup *memcg)
>  {
>         struct swap_cgroup_priority *swap_priority;
>
> +       /*
> +       * XXX: Possible RCU wait? No. Cannot protect priority list addition.
> +       * swap_avail_lock gives protection.
> +       * Think about other object protection mechanism
> +       * might be solve it and better. (e.g object reference)
> +       */
>         spin_lock(&swap_avail_lock);
>         swap_priority = memcg->swap_priority;
>         if (!swap_priority) {
> @@ -198,5 +330,6 @@ void delete_swap_cgroup_priority(struct mem_cgroup *memcg)
>
>         for (int i = 0; i < MAX_SWAPFILES; i++)
>                 kvfree(swap_priority->pnode[i]);
> +
>         kvfree(swap_priority);
>  }
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index f8e48dd2381e..28afe4ec0504 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -126,8 +126,12 @@ static DEFINE_PER_CPU(struct percpu_swap_cluster, percpu_swap_cluster) = {
>         .offset = { SWAP_ENTRY_INVALID },
>         .lock = INIT_LOCAL_LOCK(),
>  };
> -/* TODO: better choice? */
> +/* TODO: better arrangement */
>  #ifdef CONFIG_SWAP_CGROUP_PRIORITY
> +static bool get_swap_device_info(struct swap_info_struct *si);
> +static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int order,
> +                                             unsigned char usage, bool is_cgroup_priority);
> +static int swap_node(struct swap_info_struct *si);
>  #include "swap_cgroup_priority.c"
>  #endif
>
> @@ -776,7 +780,8 @@ static unsigned int alloc_swap_scan_cluster(struct swap_info_struct *si,
>                                             struct swap_cluster_info *ci,
>                                             unsigned long offset,
>                                             unsigned int order,
> -                                           unsigned char usage)
> +                                           unsigned char usage,
> +                                           bool is_cgroup_priority)
>  {
>         unsigned int next = SWAP_ENTRY_INVALID, found = SWAP_ENTRY_INVALID;
>         unsigned long start = ALIGN_DOWN(offset, SWAPFILE_CLUSTER);
> @@ -820,12 +825,19 @@ static unsigned int alloc_swap_scan_cluster(struct swap_info_struct *si,
>  out:
>         relocate_cluster(si, ci);
>         unlock_cluster(ci);
> +
>         if (si->flags & SWP_SOLIDSTATE) {
> -               this_cpu_write(percpu_swap_cluster.offset[order], next);
> -               this_cpu_write(percpu_swap_cluster.si[order], si);
> -       } else {
> +               if (!is_cgroup_priority) {
> +                       this_cpu_write(percpu_swap_cluster.offset[order], next);
> +                       this_cpu_write(percpu_swap_cluster.si[order], si);
> +               } else {
> +#ifdef CONFIG_SWAP_CGROUP_PRIORITY
> +                       __this_cpu_write(si->percpu_cluster->next[order], next);
> +#endif
> +               }
> +       } else
>                 si->global_cluster->next[order] = next;
> -       }
> +
>         return found;
>  }
>
> @@ -883,7 +895,7 @@ static void swap_reclaim_work(struct work_struct *work)
>   * cluster for current CPU too.
>   */
>  static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int order,
> -                                             unsigned char usage)
> +                                             unsigned char usage, bool is_cgroup_priority)
>  {
>         struct swap_cluster_info *ci;
>         unsigned int offset = SWAP_ENTRY_INVALID, found = SWAP_ENTRY_INVALID;
> @@ -895,32 +907,38 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
>         if (order && !(si->flags & SWP_BLKDEV))
>                 return 0;
>
> -       if (!(si->flags & SWP_SOLIDSTATE)) {
> +       if (si->flags & SWP_SOLIDSTATE) {
> +#ifdef CONFIG_SWAP_CGROUP_PRIORITY
> +                local_lock(&si->percpu_cluster->lock);
> +                offset = __this_cpu_read(si->percpu_cluster->next[order]);
> +#endif
> +       } else {
>                 /* Serialize HDD SWAP allocation for each device. */
>                 spin_lock(&si->global_cluster_lock);
>                 offset = si->global_cluster->next[order];
> -               if (offset == SWAP_ENTRY_INVALID)
> -                       goto new_cluster;
> +       }
>
> -               ci = lock_cluster(si, offset);
> -               /* Cluster could have been used by another order */
> -               if (cluster_is_usable(ci, order)) {
> -                       if (cluster_is_empty(ci))
> -                               offset = cluster_offset(si, ci);
> -                       found = alloc_swap_scan_cluster(si, ci, offset,
> -                                                       order, usage);
> -               } else {
> -                       unlock_cluster(ci);
> -               }
> -               if (found)
> -                       goto done;
> +       if (offset == SWAP_ENTRY_INVALID)
> +               goto new_cluster;
> +
> +       ci = lock_cluster(si, offset);
> +       /* Cluster could have been used by another order */
> +       if (cluster_is_usable(ci, order)) {
> +               if (cluster_is_empty(ci))
> +                       offset = cluster_offset(si, ci);
> +               found = alloc_swap_scan_cluster(si, ci, offset,
> +                                               order, usage, is_cgroup_priority);
> +       } else {
> +               unlock_cluster(ci);
>         }
> +       if (found)
> +               goto done;
>
>  new_cluster:
>         ci = isolate_lock_cluster(si, &si->free_clusters);
>         if (ci) {
>                 found = alloc_swap_scan_cluster(si, ci, cluster_offset(si, ci),
> -                                               order, usage);
> +                                               order, usage, is_cgroup_priority);
>                 if (found)
>                         goto done;
>         }
> @@ -934,7 +952,7 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
>
>                 while ((ci = isolate_lock_cluster(si, &si->nonfull_clusters[order]))) {
>                         found = alloc_swap_scan_cluster(si, ci, cluster_offset(si, ci),
> -                                                       order, usage);
> +                                                       order, usage, is_cgroup_priority);
>                         if (found)
>                                 goto done;
>                         /* Clusters failed to allocate are moved to frag_clusters */
> @@ -952,7 +970,7 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
>                          * reclaimable (eg. lazy-freed swap cache) slots.
>                          */
>                         found = alloc_swap_scan_cluster(si, ci, cluster_offset(si, ci),
> -                                                       order, usage);
> +                                                       order, usage, is_cgroup_priority);
>                         if (found)
>                                 goto done;
>                         frags++;
> @@ -979,21 +997,27 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
>                 while ((ci = isolate_lock_cluster(si, &si->frag_clusters[o]))) {
>                         atomic_long_dec(&si->frag_cluster_nr[o]);
>                         found = alloc_swap_scan_cluster(si, ci, cluster_offset(si, ci),
> -                                                       0, usage);
> +                                                       0, usage, is_cgroup_priority);
>                         if (found)
>                                 goto done;
>                 }
>
>                 while ((ci = isolate_lock_cluster(si, &si->nonfull_clusters[o]))) {
>                         found = alloc_swap_scan_cluster(si, ci, cluster_offset(si, ci),
> -                                                       0, usage);
> +                                                       0, usage, is_cgroup_priority);
>                         if (found)
>                                 goto done;
>                 }
>         }
>  done:
> -       if (!(si->flags & SWP_SOLIDSTATE))
> +       if (si->flags & SWP_SOLIDSTATE) {
> +#ifdef CONFIG_SWAP_CGROUP_PRIORITY
> +               local_unlock(&si->percpu_cluster->lock);
> +#endif
> +       } else {
>                 spin_unlock(&si->global_cluster_lock);
> +       }
> +
>         return found;
>  }
>
> @@ -1032,6 +1056,7 @@ static void del_from_avail_list(struct swap_info_struct *si, bool swapoff)
>         for_each_node(nid)
>                 plist_del(&si->avail_lists[nid], &swap_avail_heads[nid]);
>
> +       deactivate_swap_cgroup_priority_pnode(si, swapoff);
>  skip:
>         spin_unlock(&swap_avail_lock);
>  }
> @@ -1075,6 +1100,7 @@ static void add_to_avail_list(struct swap_info_struct *si, bool swapon)
>         for_each_node(nid)
>                 plist_add(&si->avail_lists[nid], &swap_avail_heads[nid]);
>
> +       activate_swap_cgroup_priority_pnode(si, swapon);
>  skip:
>         spin_unlock(&swap_avail_lock);
>  }
> @@ -1200,7 +1226,8 @@ static bool swap_alloc_fast(swp_entry_t *entry,
>         if (cluster_is_usable(ci, order)) {
>                 if (cluster_is_empty(ci))
>                         offset = cluster_offset(si, ci);
> -               found = alloc_swap_scan_cluster(si, ci, offset, order, SWAP_HAS_CACHE);
> +               found = alloc_swap_scan_cluster(si, ci, offset, order,
> +                               SWAP_HAS_CACHE, false);
>                 if (found)
>                         *entry = swp_entry(si->type, found);
>         } else {
> @@ -1227,7 +1254,7 @@ static bool swap_alloc_slow(swp_entry_t *entry,
>                 plist_requeue(&si->avail_lists[node], &swap_avail_heads[node]);
>                 spin_unlock(&swap_avail_lock);
>                 if (get_swap_device_info(si)) {
> -                       offset = cluster_alloc_swap_entry(si, order, SWAP_HAS_CACHE);
> +                       offset = cluster_alloc_swap_entry(si, order, SWAP_HAS_CACHE, false);
>                         put_swap_device(si);
>                         if (offset) {
>                                 *entry = swp_entry(si->type, offset);
> @@ -1294,10 +1321,12 @@ int folio_alloc_swap(struct folio *folio, gfp_t gfp)
>                 }
>         }
>
> -       local_lock(&percpu_swap_cluster.lock);
> -       if (!swap_alloc_fast(&entry, order))
> -               swap_alloc_slow(&entry, order);
> -       local_unlock(&percpu_swap_cluster.lock);
> +       if (!swap_alloc_cgroup_priority(folio_memcg(folio), &entry, order)) {
> +               local_lock(&percpu_swap_cluster.lock);
> +               if (!swap_alloc_fast(&entry, order))
> +                       swap_alloc_slow(&entry, order);
> +               local_unlock(&percpu_swap_cluster.lock);
> +       }
>
>         /* Need to call this even if allocation failed, for MEMCG_SWAP_FAIL. */
>         if (mem_cgroup_try_charge_swap(folio, entry))
> @@ -1870,7 +1899,7 @@ swp_entry_t get_swap_page_of_type(int type)
>         /* This is called for allocating swap entry, not cache */
>         if (get_swap_device_info(si)) {
>                 if (si->flags & SWP_WRITEOK) {
> -                       offset = cluster_alloc_swap_entry(si, 0, 1);
> +                       offset = cluster_alloc_swap_entry(si, 0, 1, false);
>                         if (offset) {
>                                 entry = swp_entry(si->type, offset);
>                                 atomic_long_dec(&nr_swap_pages);
> @@ -2800,6 +2829,10 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
>         arch_swap_invalidate_area(p->type);
>         zswap_swapoff(p->type);
>         mutex_unlock(&swapon_mutex);
> +#ifdef CONFIG_SWAP_CGROUP_PRIORITY
> +       free_percpu(p->percpu_cluster);
> +       p->percpu_cluster = NULL;
> +#endif
>         kfree(p->global_cluster);
>         p->global_cluster = NULL;
>         vfree(swap_map);
> @@ -3207,7 +3240,23 @@ static struct swap_cluster_info *setup_clusters(struct swap_info_struct *si,
>         for (i = 0; i < nr_clusters; i++)
>                 spin_lock_init(&cluster_info[i].lock);
>
> -       if (!(si->flags & SWP_SOLIDSTATE)) {
> +       if (si->flags & SWP_SOLIDSTATE) {
> +#ifdef CONFIG_SWAP_CGROUP_PRIORITY
> +               si->percpu_cluster = alloc_percpu(struct percpu_cluster);
> +               if (!si->percpu_cluster)
> +                       goto err_free;
> +
> +               int cpu;
> +               for_each_possible_cpu(cpu) {
> +                       struct percpu_cluster *cluster;
> +
> +                       cluster = per_cpu_ptr(si->percpu_cluster, cpu);
> +                       for (i = 0; i < SWAP_NR_ORDERS; i++)
> +                               cluster->next[i] = SWAP_ENTRY_INVALID;
> +                       local_lock_init(&cluster->lock);
> +               }
> +#endif
> +       } else {
>                 si->global_cluster = kmalloc(sizeof(*si->global_cluster),
>                                      GFP_KERNEL);
>                 if (!si->global_cluster)
> @@ -3495,6 +3544,10 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
>  bad_swap_unlock_inode:
>         inode_unlock(inode);
>  bad_swap:
> +#ifdef CONFIG_SWAP_CGROUP_PRIORITY
> +       free_percpu(si->percpu_cluster);
> +       si->percpu_cluster = NULL;
> +#endif
>         kfree(si->global_cluster);
>         si->global_cluster = NULL;
>         inode = NULL;
> --
> 2.34.1
>
>


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH 2/2] mm: swap: apply per cgroup swap priority mechansim on swap layer
  2025-06-12 11:14   ` Kairui Song
@ 2025-06-12 11:16     ` Kairui Song
  2025-06-12 17:28     ` Nhat Pham
  2025-06-13  6:49     ` YoungJun Park
  2 siblings, 0 replies; 25+ messages in thread
From: Kairui Song @ 2025-06-12 11:16 UTC (permalink / raw)
  To: youngjun.park
  Cc: linux-mm, akpm, hannes, mhocko, roman.gushchin, shakeel.butt,
	cgroups, linux-kernel, shikemeng, nphamcs, bhe, baohua, chrisl,
	muchun.song, iamjoonsoo.kim, taejoon.song, gunho.lee

On Thu, Jun 12, 2025 at 7:14 PM Kairui Song <ryncsn@gmail.com> wrote:
>
> On Thu, Jun 12, 2025 at 6:43 PM <youngjun.park@lge.com> wrote:
> >
> > From: "youngjun.park" <youngjun.park@lge.com>
> >
>
> Hi, Youngjun,
>
> Thanks for sharing this series.
>
> > This patch implements swap device selection and swap on/off propagation
> > when a cgroup-specific swap priority is set.
> >
> > There is one workaround to this implementation as follows.
> > Current per-cpu swap cluster enforces swap device selection based solely
> > on CPU locality, overriding the swap cgroup's configured priorities.
>
> I've been thinking about this, we can switch to a per-cgroup-per-cpu
> next cluster selector, the problem with current code is that swap
> allocator is not designed with folio / cgroup in mind at all, so it's
> really ugly to implement, which is why I have following two patches in
> the swap table series:
>
> https://lore.kernel.org/linux-mm/20250514201729.48420-18-ryncsn@gmail.com/
> https://lore.kernel.org/linux-mm/20250514201729.48420-22-ryncsn@gmail.com/

And BTW this is not the only reason, these two are also quite critical
to get rid of the swap_cgroup_ctrl later, and maybe switch to use
folio lock for more swap operations, etc..

> The first one makes all swap allocation starts with a folio, the
> second one makes the allocator always folio aware. So you can know
> which cgroup is doing the allocation at anytime inside the allocator
> (and it reduced the number of argument, also improving performance :)
> )
>
> So the allocator can just use cgroup's swap info if available, plist,
> percpu cluster, and fallback to global locality in a very natural way.
>
>
> > Therefore, when a swap cgroup priority is assigned, we fall back to
> > using per-CPU clusters per swap device, similar to the previous behavior.
> >
> > A proper fix for this workaround will be evaluated in the next patch.
>
> Hmm, but this is already the last patch in the series?
>
> >
> > Signed-off-by: Youngjun park <youngjun.park@lge.com>
> > ---
> >  include/linux/swap.h      |   8 +++
> >  mm/swap.h                 |   8 +++
> >  mm/swap_cgroup_priority.c | 133 ++++++++++++++++++++++++++++++++++++++
> >  mm/swapfile.c             | 125 ++++++++++++++++++++++++-----------
> >  4 files changed, 238 insertions(+), 36 deletions(-)
> >
> > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > index 49b73911c1bd..d158b0d5c997 100644
> > --- a/include/linux/swap.h
> > +++ b/include/linux/swap.h
> > @@ -283,6 +283,13 @@ enum swap_cluster_flags {
> >  #define SWAP_NR_ORDERS         1
> >  #endif
> >
> > +#ifdef CONFIG_SWAP_CGROUP_PRIORITY
> > +struct percpu_cluster {
> > +       local_lock_t lock; /* Protect the percpu_cluster above */
> > +       unsigned int next[SWAP_NR_ORDERS]; /* Likely next allocation offset */
> > +};
> > +#endif
> > +
> >  /*
> >   * We keep using same cluster for rotational device so IO will be sequential.
> >   * The purpose is to optimize SWAP throughput on these device.
> > @@ -341,6 +348,7 @@ struct swap_info_struct {
> >         struct list_head discard_clusters; /* discard clusters list */
> >  #ifdef CONFIG_SWAP_CGROUP_PRIORITY
> >         int unique_id;
> > +       struct percpu_cluster __percpu *percpu_cluster; /* per cpu's swap location */
> >  #endif
> >         struct plist_node avail_lists[]; /*
> >                                            * entries in swap_avail_heads, one
> > diff --git a/mm/swap.h b/mm/swap.h
> > index cd2649c632ed..cb6d653fe3f1 100644
> > --- a/mm/swap.h
> > +++ b/mm/swap.h
> > @@ -113,7 +113,15 @@ void delete_swap_cgroup_priority(struct mem_cgroup *memcg);
> >  void show_swap_device_unique_id(struct seq_file *m);
> >  #else
> >  static inline void delete_swap_cgroup_priority(struct mem_cgroup *memcg) {}
> > +static inline void activate_swap_cgroup_priority_pnode(struct swap_info_struct *swp, bool swapon) {}
> > +static inline void deactivate_swap_cgroup_priority_pnode(struct swap_info_struct *swp, bool swapoff){}
> >  static inline void get_swap_unique_id(struct swap_info_struct *si) {}
> > +static inline bool swap_alloc_cgroup_priority(struct mem_cgroup *memcg,
> > +                               swp_entry_t *entry, int order)
> > +{
> > +       return false;
> > +}
> > +
> >  #endif
> >
> >  #else /* CONFIG_SWAP */
> > diff --git a/mm/swap_cgroup_priority.c b/mm/swap_cgroup_priority.c
> > index b3e20b676680..bb18cb251f60 100644
> > --- a/mm/swap_cgroup_priority.c
> > +++ b/mm/swap_cgroup_priority.c
> > @@ -54,6 +54,132 @@ static void get_swap_unique_id(struct swap_info_struct *si)
> >         si->unique_id = atomic_add_return(1, &swap_unique_id_counter);
> >  }
> >
> > +static bool swap_alloc_cgroup_priority(struct mem_cgroup *memcg,
> > +                               swp_entry_t *entry, int order)
> > +{
> > +       struct swap_cgroup_priority *swap_priority;
> > +       struct swap_cgroup_priority_pnode *pnode, *next;
> > +       unsigned long offset;
> > +       int node;
> > +
> > +       if (!memcg)
> > +               return false;
> > +
> > +       spin_lock(&swap_avail_lock);
> > +priority_check:
> > +       swap_priority = memcg->swap_priority;
> > +       if (!swap_priority) {
> > +               spin_unlock(&swap_avail_lock);
> > +               return false;
> > +       }
> > +
> > +       node = numa_node_id();
> > +start_over:
> > +       plist_for_each_entry_safe(pnode, next, &swap_priority->plist[node],
> > +                                       avail_lists[node]) {
> > +               struct swap_info_struct *si = pnode->swap;
> > +               plist_requeue(&pnode->avail_lists[node],
> > +                       &swap_priority->plist[node]);
> > +               spin_unlock(&swap_avail_lock);
> > +
> > +               if (get_swap_device_info(si)) {
> > +                       offset = cluster_alloc_swap_entry(si,
> > +                                       order, SWAP_HAS_CACHE, true);
> > +                       put_swap_device(si);
> > +                       if (offset) {
> > +                               *entry = swp_entry(si->type, offset);
> > +                               return true;
> > +                       }
> > +                       if (order)
> > +                               return false;
> > +               }
> > +
> > +               spin_lock(&swap_avail_lock);
> > +
> > +               /* swap_priority is remove or changed under us. */
> > +               if (swap_priority != memcg->swap_priority)
> > +                       goto priority_check;
> > +
> > +               if (plist_node_empty(&next->avail_lists[node]))
> > +                       goto start_over;
> > +       }
> > +       spin_unlock(&swap_avail_lock);
> > +
> > +       return false;
> > +}
> > +
> > +/* add_to_avail_list (swapon / swapusage > 0) */
> > +static void activate_swap_cgroup_priority_pnode(struct swap_info_struct *swp,
> > +                       bool swapon)
> > +{
> > +       struct swap_cgroup_priority *swap_priority;
> > +       int i;
> > +
> > +       list_for_each_entry(swap_priority, &swap_cgroup_priority_list, link) {
> > +               struct swap_cgroup_priority_pnode *pnode
> > +                       = swap_priority->pnode[swp->type];
> > +
> > +               if (swapon) {
> > +                       pnode->swap = swp;
> > +                       pnode->prio = swp->prio;
> > +               }
> > +
> > +               /* NUMA priority handling */
> > +               for_each_node(i) {
> > +                       if (swapon) {
> > +                               if (swap_node(swp) == i) {
> > +                                       plist_node_init(
> > +                                               &pnode->avail_lists[i],
> > +                                               1);
> > +                               } else {
> > +                                       plist_node_init(
> > +                                               &pnode->avail_lists[i],
> > +                                               -pnode->prio);
> > +                               }
> > +                       }
> > +
> > +                       plist_add(&pnode->avail_lists[i],
> > +                               &swap_priority->plist[i]);
> > +               }
> > +       }
> > +}
> > +
> > +/* del_from_avail_list (swapoff / swap usage <= 0) */
> > +static void deactivate_swap_cgroup_priority_pnode(struct swap_info_struct *swp,
> > +               bool swapoff)
> > +{
> > +       struct swap_cgroup_priority *swap_priority;
> > +       int nid, i;
> > +
> > +       list_for_each_entry(swap_priority, &swap_cgroup_priority_list, link) {
> > +               struct swap_cgroup_priority_pnode *pnode;
> > +
> > +               if (swapoff && swp->prio < 0) {
> > +                       /*
> > +                       * NUMA priority handling
> > +                       * mimic swapoff prio adjustment without plist
> > +                       */
> > +                       for (int i = 0; i < MAX_SWAPFILES; i++) {
> > +                               pnode = swap_priority->pnode[i];
> > +                               if (pnode->prio > swp->prio ||
> > +                                       pnode->swap == swp)
> > +                                       continue;
> > +
> > +                               pnode->prio++;
> > +                               for_each_node(nid) {
> > +                                       if (pnode->avail_lists[nid].prio != 1)
> > +                                               pnode->avail_lists[nid].prio--;
> > +                               }
> > +                       }
> > +               }
> > +
> > +               pnode = swap_priority->pnode[swp->type];
> > +               for_each_node(i)
> > +                       plist_del(&pnode->avail_lists[i],
> > +                               &swap_priority->plist[i]);
> > +       }
> > +}
> > +
> >  int create_swap_cgroup_priority(struct mem_cgroup *memcg,
> >                 int unique[], int prio[], int nr)
> >  {
> > @@ -183,6 +309,12 @@ void delete_swap_cgroup_priority(struct mem_cgroup *memcg)
> >  {
> >         struct swap_cgroup_priority *swap_priority;
> >
> > +       /*
> > +       * XXX: Possible RCU wait? No. Cannot protect priority list addition.
> > +       * swap_avail_lock gives protection.
> > +       * Think about other object protection mechanism
> > +       * might be solve it and better. (e.g object reference)
> > +       */
> >         spin_lock(&swap_avail_lock);
> >         swap_priority = memcg->swap_priority;
> >         if (!swap_priority) {
> > @@ -198,5 +330,6 @@ void delete_swap_cgroup_priority(struct mem_cgroup *memcg)
> >
> >         for (int i = 0; i < MAX_SWAPFILES; i++)
> >                 kvfree(swap_priority->pnode[i]);
> > +
> >         kvfree(swap_priority);
> >  }
> > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > index f8e48dd2381e..28afe4ec0504 100644
> > --- a/mm/swapfile.c
> > +++ b/mm/swapfile.c
> > @@ -126,8 +126,12 @@ static DEFINE_PER_CPU(struct percpu_swap_cluster, percpu_swap_cluster) = {
> >         .offset = { SWAP_ENTRY_INVALID },
> >         .lock = INIT_LOCAL_LOCK(),
> >  };
> > -/* TODO: better choice? */
> > +/* TODO: better arrangement */
> >  #ifdef CONFIG_SWAP_CGROUP_PRIORITY
> > +static bool get_swap_device_info(struct swap_info_struct *si);
> > +static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int order,
> > +                                             unsigned char usage, bool is_cgroup_priority);
> > +static int swap_node(struct swap_info_struct *si);
> >  #include "swap_cgroup_priority.c"
> >  #endif
> >
> > @@ -776,7 +780,8 @@ static unsigned int alloc_swap_scan_cluster(struct swap_info_struct *si,
> >                                             struct swap_cluster_info *ci,
> >                                             unsigned long offset,
> >                                             unsigned int order,
> > -                                           unsigned char usage)
> > +                                           unsigned char usage,
> > +                                           bool is_cgroup_priority)
> >  {
> >         unsigned int next = SWAP_ENTRY_INVALID, found = SWAP_ENTRY_INVALID;
> >         unsigned long start = ALIGN_DOWN(offset, SWAPFILE_CLUSTER);
> > @@ -820,12 +825,19 @@ static unsigned int alloc_swap_scan_cluster(struct swap_info_struct *si,
> >  out:
> >         relocate_cluster(si, ci);
> >         unlock_cluster(ci);
> > +
> >         if (si->flags & SWP_SOLIDSTATE) {
> > -               this_cpu_write(percpu_swap_cluster.offset[order], next);
> > -               this_cpu_write(percpu_swap_cluster.si[order], si);
> > -       } else {
> > +               if (!is_cgroup_priority) {
> > +                       this_cpu_write(percpu_swap_cluster.offset[order], next);
> > +                       this_cpu_write(percpu_swap_cluster.si[order], si);
> > +               } else {
> > +#ifdef CONFIG_SWAP_CGROUP_PRIORITY
> > +                       __this_cpu_write(si->percpu_cluster->next[order], next);
> > +#endif
> > +               }
> > +       } else
> >                 si->global_cluster->next[order] = next;
> > -       }
> > +
> >         return found;
> >  }
> >
> > @@ -883,7 +895,7 @@ static void swap_reclaim_work(struct work_struct *work)
> >   * cluster for current CPU too.
> >   */
> >  static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int order,
> > -                                             unsigned char usage)
> > +                                             unsigned char usage, bool is_cgroup_priority)
> >  {
> >         struct swap_cluster_info *ci;
> >         unsigned int offset = SWAP_ENTRY_INVALID, found = SWAP_ENTRY_INVALID;
> > @@ -895,32 +907,38 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
> >         if (order && !(si->flags & SWP_BLKDEV))
> >                 return 0;
> >
> > -       if (!(si->flags & SWP_SOLIDSTATE)) {
> > +       if (si->flags & SWP_SOLIDSTATE) {
> > +#ifdef CONFIG_SWAP_CGROUP_PRIORITY
> > +                local_lock(&si->percpu_cluster->lock);
> > +                offset = __this_cpu_read(si->percpu_cluster->next[order]);
> > +#endif
> > +       } else {
> >                 /* Serialize HDD SWAP allocation for each device. */
> >                 spin_lock(&si->global_cluster_lock);
> >                 offset = si->global_cluster->next[order];
> > -               if (offset == SWAP_ENTRY_INVALID)
> > -                       goto new_cluster;
> > +       }
> >
> > -               ci = lock_cluster(si, offset);
> > -               /* Cluster could have been used by another order */
> > -               if (cluster_is_usable(ci, order)) {
> > -                       if (cluster_is_empty(ci))
> > -                               offset = cluster_offset(si, ci);
> > -                       found = alloc_swap_scan_cluster(si, ci, offset,
> > -                                                       order, usage);
> > -               } else {
> > -                       unlock_cluster(ci);
> > -               }
> > -               if (found)
> > -                       goto done;
> > +       if (offset == SWAP_ENTRY_INVALID)
> > +               goto new_cluster;
> > +
> > +       ci = lock_cluster(si, offset);
> > +       /* Cluster could have been used by another order */
> > +       if (cluster_is_usable(ci, order)) {
> > +               if (cluster_is_empty(ci))
> > +                       offset = cluster_offset(si, ci);
> > +               found = alloc_swap_scan_cluster(si, ci, offset,
> > +                                               order, usage, is_cgroup_priority);
> > +       } else {
> > +               unlock_cluster(ci);
> >         }
> > +       if (found)
> > +               goto done;
> >
> >  new_cluster:
> >         ci = isolate_lock_cluster(si, &si->free_clusters);
> >         if (ci) {
> >                 found = alloc_swap_scan_cluster(si, ci, cluster_offset(si, ci),
> > -                                               order, usage);
> > +                                               order, usage, is_cgroup_priority);
> >                 if (found)
> >                         goto done;
> >         }
> > @@ -934,7 +952,7 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
> >
> >                 while ((ci = isolate_lock_cluster(si, &si->nonfull_clusters[order]))) {
> >                         found = alloc_swap_scan_cluster(si, ci, cluster_offset(si, ci),
> > -                                                       order, usage);
> > +                                                       order, usage, is_cgroup_priority);
> >                         if (found)
> >                                 goto done;
> >                         /* Clusters failed to allocate are moved to frag_clusters */
> > @@ -952,7 +970,7 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
> >                          * reclaimable (eg. lazy-freed swap cache) slots.
> >                          */
> >                         found = alloc_swap_scan_cluster(si, ci, cluster_offset(si, ci),
> > -                                                       order, usage);
> > +                                                       order, usage, is_cgroup_priority);
> >                         if (found)
> >                                 goto done;
> >                         frags++;
> > @@ -979,21 +997,27 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
> >                 while ((ci = isolate_lock_cluster(si, &si->frag_clusters[o]))) {
> >                         atomic_long_dec(&si->frag_cluster_nr[o]);
> >                         found = alloc_swap_scan_cluster(si, ci, cluster_offset(si, ci),
> > -                                                       0, usage);
> > +                                                       0, usage, is_cgroup_priority);
> >                         if (found)
> >                                 goto done;
> >                 }
> >
> >                 while ((ci = isolate_lock_cluster(si, &si->nonfull_clusters[o]))) {
> >                         found = alloc_swap_scan_cluster(si, ci, cluster_offset(si, ci),
> > -                                                       0, usage);
> > +                                                       0, usage, is_cgroup_priority);
> >                         if (found)
> >                                 goto done;
> >                 }
> >         }
> >  done:
> > -       if (!(si->flags & SWP_SOLIDSTATE))
> > +       if (si->flags & SWP_SOLIDSTATE) {
> > +#ifdef CONFIG_SWAP_CGROUP_PRIORITY
> > +               local_unlock(&si->percpu_cluster->lock);
> > +#endif
> > +       } else {
> >                 spin_unlock(&si->global_cluster_lock);
> > +       }
> > +
> >         return found;
> >  }
> >
> > @@ -1032,6 +1056,7 @@ static void del_from_avail_list(struct swap_info_struct *si, bool swapoff)
> >         for_each_node(nid)
> >                 plist_del(&si->avail_lists[nid], &swap_avail_heads[nid]);
> >
> > +       deactivate_swap_cgroup_priority_pnode(si, swapoff);
> >  skip:
> >         spin_unlock(&swap_avail_lock);
> >  }
> > @@ -1075,6 +1100,7 @@ static void add_to_avail_list(struct swap_info_struct *si, bool swapon)
> >         for_each_node(nid)
> >                 plist_add(&si->avail_lists[nid], &swap_avail_heads[nid]);
> >
> > +       activate_swap_cgroup_priority_pnode(si, swapon);
> >  skip:
> >         spin_unlock(&swap_avail_lock);
> >  }
> > @@ -1200,7 +1226,8 @@ static bool swap_alloc_fast(swp_entry_t *entry,
> >         if (cluster_is_usable(ci, order)) {
> >                 if (cluster_is_empty(ci))
> >                         offset = cluster_offset(si, ci);
> > -               found = alloc_swap_scan_cluster(si, ci, offset, order, SWAP_HAS_CACHE);
> > +               found = alloc_swap_scan_cluster(si, ci, offset, order,
> > +                               SWAP_HAS_CACHE, false);
> >                 if (found)
> >                         *entry = swp_entry(si->type, found);
> >         } else {
> > @@ -1227,7 +1254,7 @@ static bool swap_alloc_slow(swp_entry_t *entry,
> >                 plist_requeue(&si->avail_lists[node], &swap_avail_heads[node]);
> >                 spin_unlock(&swap_avail_lock);
> >                 if (get_swap_device_info(si)) {
> > -                       offset = cluster_alloc_swap_entry(si, order, SWAP_HAS_CACHE);
> > +                       offset = cluster_alloc_swap_entry(si, order, SWAP_HAS_CACHE, false);
> >                         put_swap_device(si);
> >                         if (offset) {
> >                                 *entry = swp_entry(si->type, offset);
> > @@ -1294,10 +1321,12 @@ int folio_alloc_swap(struct folio *folio, gfp_t gfp)
> >                 }
> >         }
> >
> > -       local_lock(&percpu_swap_cluster.lock);
> > -       if (!swap_alloc_fast(&entry, order))
> > -               swap_alloc_slow(&entry, order);
> > -       local_unlock(&percpu_swap_cluster.lock);
> > +       if (!swap_alloc_cgroup_priority(folio_memcg(folio), &entry, order)) {
> > +               local_lock(&percpu_swap_cluster.lock);
> > +               if (!swap_alloc_fast(&entry, order))
> > +                       swap_alloc_slow(&entry, order);
> > +               local_unlock(&percpu_swap_cluster.lock);
> > +       }
> >
> >         /* Need to call this even if allocation failed, for MEMCG_SWAP_FAIL. */
> >         if (mem_cgroup_try_charge_swap(folio, entry))
> > @@ -1870,7 +1899,7 @@ swp_entry_t get_swap_page_of_type(int type)
> >         /* This is called for allocating swap entry, not cache */
> >         if (get_swap_device_info(si)) {
> >                 if (si->flags & SWP_WRITEOK) {
> > -                       offset = cluster_alloc_swap_entry(si, 0, 1);
> > +                       offset = cluster_alloc_swap_entry(si, 0, 1, false);
> >                         if (offset) {
> >                                 entry = swp_entry(si->type, offset);
> >                                 atomic_long_dec(&nr_swap_pages);
> > @@ -2800,6 +2829,10 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
> >         arch_swap_invalidate_area(p->type);
> >         zswap_swapoff(p->type);
> >         mutex_unlock(&swapon_mutex);
> > +#ifdef CONFIG_SWAP_CGROUP_PRIORITY
> > +       free_percpu(p->percpu_cluster);
> > +       p->percpu_cluster = NULL;
> > +#endif
> >         kfree(p->global_cluster);
> >         p->global_cluster = NULL;
> >         vfree(swap_map);
> > @@ -3207,7 +3240,23 @@ static struct swap_cluster_info *setup_clusters(struct swap_info_struct *si,
> >         for (i = 0; i < nr_clusters; i++)
> >                 spin_lock_init(&cluster_info[i].lock);
> >
> > -       if (!(si->flags & SWP_SOLIDSTATE)) {
> > +       if (si->flags & SWP_SOLIDSTATE) {
> > +#ifdef CONFIG_SWAP_CGROUP_PRIORITY
> > +               si->percpu_cluster = alloc_percpu(struct percpu_cluster);
> > +               if (!si->percpu_cluster)
> > +                       goto err_free;
> > +
> > +               int cpu;
> > +               for_each_possible_cpu(cpu) {
> > +                       struct percpu_cluster *cluster;
> > +
> > +                       cluster = per_cpu_ptr(si->percpu_cluster, cpu);
> > +                       for (i = 0; i < SWAP_NR_ORDERS; i++)
> > +                               cluster->next[i] = SWAP_ENTRY_INVALID;
> > +                       local_lock_init(&cluster->lock);
> > +               }
> > +#endif
> > +       } else {
> >                 si->global_cluster = kmalloc(sizeof(*si->global_cluster),
> >                                      GFP_KERNEL);
> >                 if (!si->global_cluster)
> > @@ -3495,6 +3544,10 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
> >  bad_swap_unlock_inode:
> >         inode_unlock(inode);
> >  bad_swap:
> > +#ifdef CONFIG_SWAP_CGROUP_PRIORITY
> > +       free_percpu(si->percpu_cluster);
> > +       si->percpu_cluster = NULL;
> > +#endif
> >         kfree(si->global_cluster);
> >         si->global_cluster = NULL;
> >         inode = NULL;
> > --
> > 2.34.1
> >
> >


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH 0/2] mm/swap, memcg: Support per-cgroup swap device prioritization
  2025-06-12 10:37 [RFC PATCH 0/2] mm/swap, memcg: Support per-cgroup swap device prioritization youngjun.park
  2025-06-12 10:37 ` [RFC PATCH 1/2] mm/swap, memcg: basic structure and logic for per cgroup swap priority control youngjun.park
  2025-06-12 10:37 ` [RFC PATCH 2/2] mm: swap: apply per cgroup swap priority mechansim on swap layer youngjun.park
@ 2025-06-12 12:24 ` Kairui Song
  2025-06-12 21:32   ` Nhat Pham
  2025-06-13  6:56   ` YoungJun Park
  2 siblings, 2 replies; 25+ messages in thread
From: Kairui Song @ 2025-06-12 12:24 UTC (permalink / raw)
  To: youngjun.park
  Cc: linux-mm, akpm, hannes, mhocko, roman.gushchin, shakeel.butt,
	cgroups, linux-kernel, shikemeng, nphamcs, bhe, baohua, chrisl,
	muchun.song, iamjoonsoo.kim, taejoon.song, gunho.lee

On Thu, Jun 12, 2025 at 6:38 PM <youngjun.park@lge.com> wrote:
>
> From: Youngjun Park <youngjun.park@lge.com>
>
> Introduction
> ============
> I am a kernel developer working on platforms deployed on commercial consumer devices.
> Due to real-world product requirements, needed to modify the Linux kernel to support
> a new swap management mechanism. The proposed mechanism allows assigning different swap
> priorities to swap devices per cgroup.
> I believe this mechanism can be generally useful for similar constrained-device scenarios
> and would like to propose it for upstream inclusion and solicit feedback from the community.
>
> Motivation
> ==========
> Core requirement was to improve application responsiveness and loading time, especially
> for latency critical applications, without increasing RAM or storage hardware resources.
> Device constraints:
>   - Linux-based embedded platform
>   - Limited system RAM
>   - Small local swap
>   - No option to expand RAM or local swap
> To mitigate this, we explored utilizing idle RAM and storage from nearby devices as remote
> swap space. To maximize its effectiveness, we needed the ability to control which swap devices
> were used by different cgroups:
>   - Assign faster local swap devices to latency critical apps
>   - Assign remote swap devices to background apps
> However, current Linux kernel swap infrastructure does not support per-cgroup swap device
> assignment.
> To solve this, I propose a mechanism to allow each cgroup to specify its own swap device
> priorities.
>
> Evaluated Alternatives
> ======================
> 1. **Per-cgroup dedicated swap devices**
>    - Previously proposed upstream [1]
>    - Challenges in managing global vs per-cgroup swap state
>    - Difficult to integrate with existing memory.limit / swap.max semantics
> 2. **Multi-backend swap device with cgroup-aware routing**
>    - Considered sort of layering violation (block device cgroup awareness)
>    - Swap devices are commonly meant to be physical block devices.
>    - Similar idea mentioned in [2]
> 3. **Per-cgroup swap device enable/disable with swap usage contorl**
>    - Expand swap.max with zswap.writeback usage
>    - Discussed in context of zswap writeback [3]
>    - Cannot express arbitrary priority orderings
>     (e.g. swap priority A-B-C on cgroup C-A-B impossible)
>    - Less flexible than per-device priority approach
> 4. **Per-namespace swap priority configuration**
>    - In short, make swap namespace for swap device priority
>    - Overly complex for our use case
>    - Cgroups are the natural scope for this mechanism
>
> Based on these findings, we chose to prototype per-cgroup swap priority configuration
> as the most natural, least invasive extension of the existing kernel mechanisms.
>
> Design and Semantics
> ====================
> - Each swap device gets a unique ID at `swapon` time
> - Each cgroup has a `memory.swap.priority` interface:
>   - Show unique ID by memory.swap.priority interface
>   - Format: `unique_id:priority,unique_id:priority,...`
>   - All currently-active swap devices must be listed
>   - Priorities follow existing swap infrastructure semantics
> - The interface is writeable and updatable at runtime
> - A priority configuration can be reset via `echo "" > memory.swap.priority`
> - Swap on/off events propagate to all cgroups with priority configurations
>
> Example Usage
> -------------
> # swap device on
> $ swapon
> NAME      TYPE      SIZE USED PRIO
> /dev/sdb  partition 300M  0B   10
> /dev/sdc  partition 300M  0B    5
>
> # assign custom priorities in a cgroup
> $ echo "1:5,2:10" > memory.swap.priority
> $ cat memory.swap.priority
> Active
> /dev/sdb  unique:1  prio:5
> /dev/sdc  unique:2  prio:10
>
> # adding new swap device later
> $ swapon /dev/sdd --priority -1
> $ cat memory.swap.priority
> Active
> /dev/sdb  unique:1  prio:5
> /dev/sdc  unique:2  prio:10
> /dev/sdd  unique:3  prio:-2
>
> # reset cgroup priority
> $ echo "" > memory.swap.priority
> $ cat memory.swap.priority
> Inactive
> /dev/sdb  unique:1  prio:10
> /dev/sdc  unique:2  prio:5
> /dev/sdd  unique:3  prio:-2
>
> Implementation Notes
> ====================
> The items mentioned below are to be considered during the next patch work.
>
> - Workaround using per swap cpu cluster as before
> - Priority propgation of child cgroup
> - And other TODO, XXX
> - Refactoring for reviewability and maintainability, comprehensive testing
>   and performance evaluation

Hi Youngjun,

Interesting idea. For your current approach, I think all we need is
per-cgroup swap meta info structures (and infrastures for maintaining
and manipulating them).

So we have a global version and a cgroup version of "plist, next
cluster list, and maybe something else", right? And then
once the allocator is folio aware it can just prefer the cgroup ones
(as I mentioned in another reply) reusing all the same other
routines. Changes are minimal, the cgroup swap meta infos
and control plane are separately maintained.

It seems aligned quite well with what I wanted to do, and can be done
in a clean and easy to maintain way.

Meanwhile with virtual swap, things could be even more flexible, not
only changing the priority at swapout time, it will also provide
capabilities to migrate and balance devices adaptively, and solve long
term issues like mTHP fragmentation and min-order swapout etc..

Maybe they can be combined, like maybe cgroup can be limited to use
the virtual device or physical ones depending on priority. Seems all
solvable. Just some ideas here.

Vswap can cover the priority part too. I think we might want to avoid
duplicated interfaces.

So I'm just imagining things now, will it be good if we have something
like (following your design):

$ cat memcg1/memory.swap.priority
Active
/dev/vswap:(zram/zswap? with compression params?) unique:0 prio:5

$ cat memcg2/memory.swap.priority
Active
/dev/vswap:/dev/nvme1  unique:1  prio:5
/dev/vswap:/dev/nvme2  unique:2  prio:10
/dev/vswap:/dev/vda  unique:3  prio:15
/dev/sda  unique:4  prio:20

$ cat memcg3/memory.swap.priority
Active
/dev/vda  unique:3  prio:5
/dev/sda  unique:4  prio:15

Meaning memcg1 (high priority) is allowed to use compressed memory
only through vswap, and memcg2 (mid priority) uses disks through vswap
and fallback to HDD. memcg3 (low prio) is only allowed to use slow
devices.

Global fallback just uses everything the system has. It might be over
complex though?


>
> Future Work
> ===========
> These are items that would benefit from further consideration
> and potential implementation.
>
> - Support for per-process or anything else swap prioritization
> - Optional usage limits per swap device (e.g., ratio, max bytes)
> - Generalizing the interface beyond cgroups
>
> References
> ==========
> [1] https://lkml.iu.edu/hypermail/linux/kernel/1404.0/02530.html
> [2] https://lore.kernel.org/linux-mm/CAMgjq7DGMS5A4t6nOQmwyLy5Px96aoejBkiwFHgy9uMk-F8Y-w@mail.gmail.com
> [3] https://lore.kernel.org/lkml/CAF8kJuN-4UE0skVHvjUzpGefavkLULMonjgkXUZSBVJrcGFXCA@mail.gmail.com
>
> All comments and feedback are greatly appreciated.
> Patch will follow.
>
> Sincerely,
> Youngjun Park
>
> youngjun.park (2):
>   mm/swap, memcg: basic structure and logic for per cgroup swap priority
>     control
>   mm: swap: apply per cgroup swap priority mechansim on swap layer
>
>  include/linux/memcontrol.h |   3 +
>  include/linux/swap.h       |  11 ++
>  mm/Kconfig                 |   7 +
>  mm/memcontrol.c            |  55 ++++++
>  mm/swap.h                  |  18 ++
>  mm/swap_cgroup_priority.c  | 335 +++++++++++++++++++++++++++++++++++++
>  mm/swapfile.c              | 129 ++++++++++----
>  7 files changed, 523 insertions(+), 35 deletions(-)
>  create mode 100644 mm/swap_cgroup_priority.c
>
> base-commit: 19272b37aa4f83ca52bdf9c16d5d81bdd1354494
> --
> 2.34.1
>
>


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH 2/2] mm: swap: apply per cgroup swap priority mechansim on swap layer
  2025-06-12 11:14   ` Kairui Song
  2025-06-12 11:16     ` Kairui Song
@ 2025-06-12 17:28     ` Nhat Pham
  2025-06-12 18:20       ` Kairui Song
  2025-06-13  6:49     ` YoungJun Park
  2 siblings, 1 reply; 25+ messages in thread
From: Nhat Pham @ 2025-06-12 17:28 UTC (permalink / raw)
  To: Kairui Song
  Cc: youngjun.park, linux-mm, akpm, hannes, mhocko, roman.gushchin,
	shakeel.butt, cgroups, linux-kernel, shikemeng, bhe, baohua,
	chrisl, muchun.song, iamjoonsoo.kim, taejoon.song, gunho.lee

On Thu, Jun 12, 2025 at 4:14 AM Kairui Song <ryncsn@gmail.com> wrote:
>
> On Thu, Jun 12, 2025 at 6:43 PM <youngjun.park@lge.com> wrote:
> >
> > From: "youngjun.park" <youngjun.park@lge.com>
> >
>
> Hi, Youngjun,
>
> Thanks for sharing this series.
>
> > This patch implements swap device selection and swap on/off propagation
> > when a cgroup-specific swap priority is set.
> >
> > There is one workaround to this implementation as follows.
> > Current per-cpu swap cluster enforces swap device selection based solely
> > on CPU locality, overriding the swap cgroup's configured priorities.
>
> I've been thinking about this, we can switch to a per-cgroup-per-cpu
> next cluster selector, the problem with current code is that swap

What about per-cpu-per-order-per-swap-device :-? Number of swap
devices is gonna be smaller than number of cgroups, right?

At swap slot allocation time, we check the folio's swap device
priority list, then pump that all the way to the swap allocator.

swap allocator, given a priority list, for each priority level, try to
allocate from that level first. It will get a cluster (either locally
cached or a new one) from swap devices in that priority level, before
moving on to the next priority level.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH 2/2] mm: swap: apply per cgroup swap priority mechansim on swap layer
  2025-06-12 17:28     ` Nhat Pham
@ 2025-06-12 18:20       ` Kairui Song
  2025-06-12 20:08         ` Nhat Pham
  0 siblings, 1 reply; 25+ messages in thread
From: Kairui Song @ 2025-06-12 18:20 UTC (permalink / raw)
  To: Nhat Pham
  Cc: youngjun.park, linux-mm, akpm, hannes, mhocko, roman.gushchin,
	shakeel.butt, cgroups, linux-kernel, shikemeng, bhe, baohua,
	chrisl, muchun.song, iamjoonsoo.kim, taejoon.song, gunho.lee

On Fri, Jun 13, 2025 at 1:28 AM Nhat Pham <nphamcs@gmail.com> wrote:
>
> On Thu, Jun 12, 2025 at 4:14 AM Kairui Song <ryncsn@gmail.com> wrote:
> >
> > On Thu, Jun 12, 2025 at 6:43 PM <youngjun.park@lge.com> wrote:
> > >
> > > From: "youngjun.park" <youngjun.park@lge.com>
> > >
> >
> > Hi, Youngjun,
> >
> > Thanks for sharing this series.
> >
> > > This patch implements swap device selection and swap on/off propagation
> > > when a cgroup-specific swap priority is set.
> > >
> > > There is one workaround to this implementation as follows.
> > > Current per-cpu swap cluster enforces swap device selection based solely
> > > on CPU locality, overriding the swap cgroup's configured priorities.
> >
> > I've been thinking about this, we can switch to a per-cgroup-per-cpu
> > next cluster selector, the problem with current code is that swap
>
> What about per-cpu-per-order-per-swap-device :-? Number of swap
> devices is gonna be smaller than number of cgroups, right?

Hi Nhat,

The problem is per cgroup makes more sense (I was suggested to use
cgroup level locality at the very beginning of the implementation of
the allocator in the mail list, but it was hard to do so at that
time), for container environments, a cgroup is a container that runs
one type of workload, so it has its own locality. Things like systemd
also organize different desktop workloads into cgroups. The whole
point is about cgroup.

There could be a lot of cgroups indeed, but not every one of them is
going to enable a cgroup level swap configuration. Youngjun used a
pointer in mem_cgroup, so disabled cgroups have no overhead.

We had a per-device-per-cpu-per-order table previously (before
1b7e90020eb77). It works. Only minor problem is allocation has to
iterate the plist first, then use the si->percpu, and usually there
are only a few swap devices, much less flexible than cgroups.

>
> At swap slot allocation time, we check the folio's swap device
> priority list, then pump that all the way to the swap allocator.
>
> swap allocator, given a priority list, for each priority level, try to
> allocate from that level first. It will get a cluster (either locally
> cached or a new one) from swap devices in that priority level, before
> moving on to the next priority level.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH 2/2] mm: swap: apply per cgroup swap priority mechansim on swap layer
  2025-06-12 18:20       ` Kairui Song
@ 2025-06-12 20:08         ` Nhat Pham
  2025-06-13  7:11           ` YoungJun Park
  0 siblings, 1 reply; 25+ messages in thread
From: Nhat Pham @ 2025-06-12 20:08 UTC (permalink / raw)
  To: Kairui Song
  Cc: youngjun.park, linux-mm, akpm, hannes, mhocko, roman.gushchin,
	shakeel.butt, cgroups, linux-kernel, shikemeng, bhe, baohua,
	chrisl, muchun.song, iamjoonsoo.kim, taejoon.song, gunho.lee

On Thu, Jun 12, 2025 at 11:20 AM Kairui Song <ryncsn@gmail.com> wrote:
>
> On Fri, Jun 13, 2025 at 1:28 AM Nhat Pham <nphamcs@gmail.com> wrote:
> >
> > On Thu, Jun 12, 2025 at 4:14 AM Kairui Song <ryncsn@gmail.com> wrote:
> > >
> > > On Thu, Jun 12, 2025 at 6:43 PM <youngjun.park@lge.com> wrote:
> > > >
> > > > From: "youngjun.park" <youngjun.park@lge.com>
> > > >
> > >
> > > Hi, Youngjun,
> > >
> > > Thanks for sharing this series.
> > >
> > > > This patch implements swap device selection and swap on/off propagation
> > > > when a cgroup-specific swap priority is set.
> > > >
> > > > There is one workaround to this implementation as follows.
> > > > Current per-cpu swap cluster enforces swap device selection based solely
> > > > on CPU locality, overriding the swap cgroup's configured priorities.
> > >
> > > I've been thinking about this, we can switch to a per-cgroup-per-cpu
> > > next cluster selector, the problem with current code is that swap
> >
> > What about per-cpu-per-order-per-swap-device :-? Number of swap
> > devices is gonna be smaller than number of cgroups, right?
>
> Hi Nhat,
>
> The problem is per cgroup makes more sense (I was suggested to use
> cgroup level locality at the very beginning of the implementation of
> the allocator in the mail list, but it was hard to do so at that
> time), for container environments, a cgroup is a container that runs
> one type of workload, so it has its own locality. Things like systemd
> also organize different desktop workloads into cgroups. The whole
> point is about cgroup.

Yeah I know what cgroup represents. Which is why I mentioned in the
next paragraph that are still making decisions based per-cgroup - we
just organize the per-cpu cache based on swap devices. This way, two
cgroups with similar/same priority list can share the clusters, for
each swapfile, in each CPU. There will be a lot less duplication and
overhead. And two cgroups with different priority lists won't
interfere with each other, since they'll target different swapfiles.

Unless we want to nudge the swapfiles/clusters to be self-partitioned
among the cgroups? :) IOW, each cluster contains pages mostly from a
single cgroup (with some stranglers mixed in). I suppose that will be
very useful for swap on rotational drives where read contiguity is
imperative, but not sure about other backends :-?

Anyway, no strong opinions to be completely honest :) Was just
throwing out some ideas. Per-cgroup-per-cpu-per-order sounds good to
me too, if it's easy to do.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH 0/2] mm/swap, memcg: Support per-cgroup swap device prioritization
  2025-06-12 12:24 ` [RFC PATCH 0/2] mm/swap, memcg: Support per-cgroup swap device prioritization Kairui Song
@ 2025-06-12 21:32   ` Nhat Pham
  2025-06-13  6:56   ` YoungJun Park
  1 sibling, 0 replies; 25+ messages in thread
From: Nhat Pham @ 2025-06-12 21:32 UTC (permalink / raw)
  To: Kairui Song
  Cc: youngjun.park, linux-mm, akpm, hannes, mhocko, roman.gushchin,
	shakeel.butt, cgroups, linux-kernel, shikemeng, bhe, baohua,
	chrisl, muchun.song, iamjoonsoo.kim, taejoon.song, gunho.lee

On Thu, Jun 12, 2025 at 5:24 AM Kairui Song <ryncsn@gmail.com> wrote:
>
> On Thu, Jun 12, 2025 at 6:38 PM <youngjun.park@lge.com> wrote:
> >
> > From: Youngjun Park <youngjun.park@lge.com>
> >
> > Introduction
> > ============
> > I am a kernel developer working on platforms deployed on commercial consumer devices.
> > Due to real-world product requirements, needed to modify the Linux kernel to support
> > a new swap management mechanism. The proposed mechanism allows assigning different swap
> > priorities to swap devices per cgroup.
> > I believe this mechanism can be generally useful for similar constrained-device scenarios
> > and would like to propose it for upstream inclusion and solicit feedback from the community.

We're mostly just using zswap and disk swap, for now, so I don't have
too much input for this.

Kairui, would this design satisfy your zram use case as well?

> >
> > Motivation
> > ==========
> > Core requirement was to improve application responsiveness and loading time, especially
> > for latency critical applications, without increasing RAM or storage hardware resources.
> > Device constraints:
> >   - Linux-based embedded platform
> >   - Limited system RAM
> >   - Small local swap
> >   - No option to expand RAM or local swap
> > To mitigate this, we explored utilizing idle RAM and storage from nearby devices as remote
> > swap space. To maximize its effectiveness, we needed the ability to control which swap devices
> > were used by different cgroups:
> >   - Assign faster local swap devices to latency critical apps
> >   - Assign remote swap devices to background apps
> > However, current Linux kernel swap infrastructure does not support per-cgroup swap device
> > assignment.
> > To solve this, I propose a mechanism to allow each cgroup to specify its own swap device
> > priorities.
> >
> > Evaluated Alternatives
> > ======================
> > 1. **Per-cgroup dedicated swap devices**
> >    - Previously proposed upstream [1]
> >    - Challenges in managing global vs per-cgroup swap state
> >    - Difficult to integrate with existing memory.limit / swap.max semantics
> > 2. **Multi-backend swap device with cgroup-aware routing**
> >    - Considered sort of layering violation (block device cgroup awareness)
> >    - Swap devices are commonly meant to be physical block devices.
> >    - Similar idea mentioned in [2]
> > 3. **Per-cgroup swap device enable/disable with swap usage contorl**
> >    - Expand swap.max with zswap.writeback usage
> >    - Discussed in context of zswap writeback [3]
> >    - Cannot express arbitrary priority orderings
> >     (e.g. swap priority A-B-C on cgroup C-A-B impossible)
> >    - Less flexible than per-device priority approach
> > 4. **Per-namespace swap priority configuration**
> >    - In short, make swap namespace for swap device priority
> >    - Overly complex for our use case
> >    - Cgroups are the natural scope for this mechanism
> >
> > Based on these findings, we chose to prototype per-cgroup swap priority configuration
> > as the most natural, least invasive extension of the existing kernel mechanisms.
> >
> > Design and Semantics
> > ====================
> > - Each swap device gets a unique ID at `swapon` time
> > - Each cgroup has a `memory.swap.priority` interface:
> >   - Show unique ID by memory.swap.priority interface
> >   - Format: `unique_id:priority,unique_id:priority,...`
> >   - All currently-active swap devices must be listed
> >   - Priorities follow existing swap infrastructure semantics
> > - The interface is writeable and updatable at runtime
> > - A priority configuration can be reset via `echo "" > memory.swap.priority`
> > - Swap on/off events propagate to all cgroups with priority configurations
> >
> > Example Usage
> > -------------
> > # swap device on
> > $ swapon
> > NAME      TYPE      SIZE USED PRIO
> > /dev/sdb  partition 300M  0B   10
> > /dev/sdc  partition 300M  0B    5
> >
> > # assign custom priorities in a cgroup
> > $ echo "1:5,2:10" > memory.swap.priority
> > $ cat memory.swap.priority
> > Active
> > /dev/sdb  unique:1  prio:5
> > /dev/sdc  unique:2  prio:10
> >
> > # adding new swap device later
> > $ swapon /dev/sdd --priority -1
> > $ cat memory.swap.priority
> > Active
> > /dev/sdb  unique:1  prio:5
> > /dev/sdc  unique:2  prio:10
> > /dev/sdd  unique:3  prio:-2
> >
> > # reset cgroup priority
> > $ echo "" > memory.swap.priority
> > $ cat memory.swap.priority
> > Inactive
> > /dev/sdb  unique:1  prio:10
> > /dev/sdc  unique:2  prio:5
> > /dev/sdd  unique:3  prio:-2
> >
> > Implementation Notes
> > ====================
> > The items mentioned below are to be considered during the next patch work.
> >
> > - Workaround using per swap cpu cluster as before
> > - Priority propgation of child cgroup
> > - And other TODO, XXX
> > - Refactoring for reviewability and maintainability, comprehensive testing
> >   and performance evaluation
>
> Hi Youngjun,
>
> Interesting idea. For your current approach, I think all we need is
> per-cgroup swap meta info structures (and infrastures for maintaining
> and manipulating them).

Agreed.

>
> So we have a global version and a cgroup version of "plist, next
> cluster list, and maybe something else", right? And then
> once the allocator is folio aware it can just prefer the cgroup ones
> (as I mentioned in another reply) reusing all the same other
> routines. Changes are minimal, the cgroup swap meta infos
> and control plane are separately maintained.
>
> It seems aligned quite well with what I wanted to do, and can be done
> in a clean and easy to maintain way.
>
> Meanwhile with virtual swap, things could be even more flexible, not
> only changing the priority at swapout time, it will also provide
> capabilities to migrate and balance devices adaptively, and solve long
> term issues like mTHP fragmentation and min-order swapout etc..

Agreed.

>
> Maybe they can be combined, like maybe cgroup can be limited to use
> the virtual device or physical ones depending on priority. Seems all
> solvable. Just some ideas here.

100%

>
> Vswap can cover the priority part too. I think we might want to avoid
> duplicated interfaces.

Yeah as long as we have a reasonable cgroup interface, we can always
change the implementation later. We can move things to virtual swap,
etc. at a latter time.

>
> So I'm just imagining things now, will it be good if we have something
> like (following your design):
>
> $ cat memcg1/memory.swap.priority
> Active
> /dev/vswap:(zram/zswap? with compression params?) unique:0 prio:5
>
> $ cat memcg2/memory.swap.priority
> Active
> /dev/vswap:/dev/nvme1  unique:1  prio:5
> /dev/vswap:/dev/nvme2  unique:2  prio:10
> /dev/vswap:/dev/vda  unique:3  prio:15
> /dev/sda  unique:4  prio:20
>
> $ cat memcg3/memory.swap.priority
> Active
> /dev/vda  unique:3  prio:5
> /dev/sda  unique:4  prio:15
>
> Meaning memcg1 (high priority) is allowed to use compressed memory
> only through vswap, and memcg2 (mid priority) uses disks through vswap
> and fallback to HDD. memcg3 (low prio) is only allowed to use slow
> devices.
>
> Global fallback just uses everything the system has. It might be over
> complex though?

Sounds good to me.

>
>
> >
> > Future Work
> > ===========
> > These are items that would benefit from further consideration
> > and potential implementation.
> >
> > - Support for per-process or anything else swap prioritization

This might be too granular.


> > - Optional usage limits per swap device (e.g., ratio, max bytes)
> > - Generalizing the interface beyond cgroups
> >
> > References
> > ==========
> > [1] https://lkml.iu.edu/hypermail/linux/kernel/1404.0/02530.html
> > [2] https://lore.kernel.org/linux-mm/CAMgjq7DGMS5A4t6nOQmwyLy5Px96aoejBkiwFHgy9uMk-F8Y-w@mail.gmail.com
> > [3] https://lore.kernel.org/lkml/CAF8kJuN-4UE0skVHvjUzpGefavkLULMonjgkXUZSBVJrcGFXCA@mail.gmail.com
> >
> > All comments and feedback are greatly appreciated.
> > Patch will follow.
> >
> > Sincerely,
> > Youngjun Park
> >
> > youngjun.park (2):
> >   mm/swap, memcg: basic structure and logic for per cgroup swap priority
> >     control
> >   mm: swap: apply per cgroup swap priority mechansim on swap layer
> >
> >  include/linux/memcontrol.h |   3 +
> >  include/linux/swap.h       |  11 ++
> >  mm/Kconfig                 |   7 +
> >  mm/memcontrol.c            |  55 ++++++
> >  mm/swap.h                  |  18 ++
> >  mm/swap_cgroup_priority.c  | 335 +++++++++++++++++++++++++++++++++++++
> >  mm/swapfile.c              | 129 ++++++++++----
> >  7 files changed, 523 insertions(+), 35 deletions(-)
> >  create mode 100644 mm/swap_cgroup_priority.c
> >
> > base-commit: 19272b37aa4f83ca52bdf9c16d5d81bdd1354494
> > --
> > 2.34.1
> >
> >


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH 2/2] mm: swap: apply per cgroup swap priority mechansim on swap layer
  2025-06-12 11:14   ` Kairui Song
  2025-06-12 11:16     ` Kairui Song
  2025-06-12 17:28     ` Nhat Pham
@ 2025-06-13  6:49     ` YoungJun Park
  2 siblings, 0 replies; 25+ messages in thread
From: YoungJun Park @ 2025-06-13  6:49 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, akpm, hannes, mhocko, roman.gushchin, shakeel.butt,
	cgroups, linux-kernel, shikemeng, nphamcs, bhe, baohua, chrisl,
	muchun.song, iamjoonsoo.kim, taejoon.song, gunho.lee

On Thu, Jun 12, 2025 at 07:14:20PM +0800, Kairui Song wrote:
> On Thu, Jun 12, 2025 at 6:43 PM <youngjun.park@lge.com> wrote:
> >
> > From: "youngjun.park" <youngjun.park@lge.com>
> >
> 
> Hi, Youngjun,
> 
> Thanks for sharing this series.
> 
> > This patch implements swap device selection and swap on/off propagation
> > when a cgroup-specific swap priority is set.
> >
> > There is one workaround to this implementation as follows.
> > Current per-cpu swap cluster enforces swap device selection based solely
> > on CPU locality, overriding the swap cgroup's configured priorities.
> 
> I've been thinking about this, we can switch to a per-cgroup-per-cpu
> next cluster selector, the problem with current code is that swap
> allocator is not designed with folio / cgroup in mind at all, so it's
> really ugly to implement, which is why I have following two patches in
> the swap table series:
 
This seems to be the suitable alternative for upstream at the moment.
I think there are still a few things that need to be considered, though.         
(Nhat pointed it out well. I've share my thoughts on that context. )  

> https://lore.kernel.org/linux-mm/20250514201729.48420-18-ryncsn@gmail.com/
> https://lore.kernel.org/linux-mm/20250514201729.48420-22-ryncsn@gmail.com/
> 
> The first one makes all swap allocation starts with a folio, the
> second one makes the allocator always folio aware. So you can know
> which cgroup is doing the allocation at anytime inside the allocator
> (and it reduced the number of argument, also improving performance :)
> )
> So the allocator can just use cgroup's swap info if available, plist,
> percpu cluster, and fallback to global locality in a very natural way.
> 

Wow! This is exactly the situation I needed. 
I thought it was uncomfortable to pass memcg parameter.
If memcg can be naturally identified within the allocation, as you mentioned,    
It would be good both performance-wise and design-wise. 

> > Therefore, when a swap cgroup priority is assigned, we fall back to
> > using per-CPU clusters per swap device, similar to the previous behavior.
> >
> > A proper fix for this workaround will be evaluated in the next patch.
> 
> Hmm, but this is already the last patch in the series?

Ah! The next patch series refers to the one.
I'm still evaluating this part and wasn't confident enough to include it
in the current version.
At first, I wanted to get feedback on the core part, I'm currently pursuing.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH 0/2] mm/swap, memcg: Support per-cgroup swap device prioritization
  2025-06-12 12:24 ` [RFC PATCH 0/2] mm/swap, memcg: Support per-cgroup swap device prioritization Kairui Song
  2025-06-12 21:32   ` Nhat Pham
@ 2025-06-13  6:56   ` YoungJun Park
  1 sibling, 0 replies; 25+ messages in thread
From: YoungJun Park @ 2025-06-13  6:56 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, akpm, hannes, mhocko, roman.gushchin, shakeel.butt,
	cgroups, linux-kernel, shikemeng, nphamcs, bhe, baohua, chrisl,
	muchun.song, iamjoonsoo.kim, taejoon.song, gunho.lee

On Thu, Jun 12, 2025 at 08:24:08PM +0800, Kairui Song wrote:
> On Thu, Jun 12, 2025 at 6:38 PM <youngjun.park@lge.com> wrote:
> >
> > From: Youngjun Park <youngjun.park@lge.com>
> >
> > Introduction
> > ============
> > I am a kernel developer working on platforms deployed on commercial consumer devices.
> > Due to real-world product requirements, needed to modify the Linux kernel to support
> > a new swap management mechanism. The proposed mechanism allows assigning different swap
> > priorities to swap devices per cgroup.
> > I believe this mechanism can be generally useful for similar constrained-device scenarios
> > and would like to propose it for upstream inclusion and solicit feedback from the community.
> >
> > Motivation
> > ==========
> > Core requirement was to improve application responsiveness and loading time, especially
> > for latency critical applications, without increasing RAM or storage hardware resources.
> > Device constraints:
> >   - Linux-based embedded platform
> >   - Limited system RAM
> >   - Small local swap
> >   - No option to expand RAM or local swap
> > To mitigate this, we explored utilizing idle RAM and storage from nearby devices as remote
> > swap space. To maximize its effectiveness, we needed the ability to control which swap devices
> > were used by different cgroups:
> >   - Assign faster local swap devices to latency critical apps
> >   - Assign remote swap devices to background apps
> > However, current Linux kernel swap infrastructure does not support per-cgroup swap device
> > assignment.
> > To solve this, I propose a mechanism to allow each cgroup to specify its own swap device
> > priorities.
> >
> > Evaluated Alternatives
> > ======================
> > 1. **Per-cgroup dedicated swap devices**
> >    - Previously proposed upstream [1]
> >    - Challenges in managing global vs per-cgroup swap state
> >    - Difficult to integrate with existing memory.limit / swap.max semantics
> > 2. **Multi-backend swap device with cgroup-aware routing**
> >    - Considered sort of layering violation (block device cgroup awareness)
> >    - Swap devices are commonly meant to be physical block devices.
> >    - Similar idea mentioned in [2]
> > 3. **Per-cgroup swap device enable/disable with swap usage contorl**
> >    - Expand swap.max with zswap.writeback usage
> >    - Discussed in context of zswap writeback [3]
> >    - Cannot express arbitrary priority orderings
> >     (e.g. swap priority A-B-C on cgroup C-A-B impossible)
> >    - Less flexible than per-device priority approach
> > 4. **Per-namespace swap priority configuration**
> >    - In short, make swap namespace for swap device priority
> >    - Overly complex for our use case
> >    - Cgroups are the natural scope for this mechanism
> >
> > Based on these findings, we chose to prototype per-cgroup swap priority configuration
> > as the most natural, least invasive extension of the existing kernel mechanisms.
> >
> > Design and Semantics
> > ====================
> > - Each swap device gets a unique ID at `swapon` time
> > - Each cgroup has a `memory.swap.priority` interface:
> >   - Show unique ID by memory.swap.priority interface
> >   - Format: `unique_id:priority,unique_id:priority,...`
> >   - All currently-active swap devices must be listed
> >   - Priorities follow existing swap infrastructure semantics
> > - The interface is writeable and updatable at runtime
> > - A priority configuration can be reset via `echo "" > memory.swap.priority`
> > - Swap on/off events propagate to all cgroups with priority configurations
> >
> > Example Usage
> > -------------
> > # swap device on
> > $ swapon
> > NAME      TYPE      SIZE USED PRIO
> > /dev/sdb  partition 300M  0B   10
> > /dev/sdc  partition 300M  0B    5
> >
> > # assign custom priorities in a cgroup
> > $ echo "1:5,2:10" > memory.swap.priority
> > $ cat memory.swap.priority
> > Active
> > /dev/sdb  unique:1  prio:5
> > /dev/sdc  unique:2  prio:10
> >
> > # adding new swap device later
> > $ swapon /dev/sdd --priority -1
> > $ cat memory.swap.priority
> > Active
> > /dev/sdb  unique:1  prio:5
> > /dev/sdc  unique:2  prio:10
> > /dev/sdd  unique:3  prio:-2
> >
> > # reset cgroup priority
> > $ echo "" > memory.swap.priority
> > $ cat memory.swap.priority
> > Inactive
> > /dev/sdb  unique:1  prio:10
> > /dev/sdc  unique:2  prio:5
> > /dev/sdd  unique:3  prio:-2
> >
> > Implementation Notes
> > ====================
> > The items mentioned below are to be considered during the next patch work.
> >
> > - Workaround using per swap cpu cluster as before
> > - Priority propgation of child cgroup
> > - And other TODO, XXX
> > - Refactoring for reviewability and maintainability, comprehensive testing
> >   and performance evaluation
> 
> Hi Youngjun,
> 
> Interesting idea. For your current approach, I think all we need is
> per-cgroup swap meta info structures (and infrastures for maintaining
> and manipulating them).
> 
> So we have a global version and a cgroup version of "plist, next
> cluster list, and maybe something else", right? And then
> once the allocator is folio aware it can just prefer the cgroup ones
> (as I mentioned in another reply) reusing all the same other
> routines. Changes are minimal, the cgroup swap meta infos
> and control plane are separately maintained.
> 
> It seems aligned quite well with what I wanted to do, and can be done
> in a clean and easy to maintain way.
> 
> Meanwhile with virtual swap, things could be even more flexible, not
> only changing the priority at swapout time, it will also provide
> capabilities to migrate and balance devices adaptively, and solve long
> term issues like mTHP fragmentation and min-order swapout etc..
> 
> Maybe they can be combined, like maybe cgroup can be limited to use
> the virtual device or physical ones depending on priority. Seems all
> solvable. Just some ideas here.

I had been thinking about the work related to vswap and alignment,
so I'm glad to hear that they can harmonize.

> Vswap can cover the priority part too. I think we might want to avoid
> duplicated interfaces.
> 
> So I'm just imagining things now, will it be good if we have something
> like (following your design):
> 
> $ cat memcg1/memory.swap.priority
> Active
> /dev/vswap:(zram/zswap? with compression params?) unique:0 prio:5
> 
> $ cat memcg2/memory.swap.priority
> Active
> /dev/vswap:/dev/nvme1  unique:1  prio:5
> /dev/vswap:/dev/nvme2  unique:2  prio:10
> /dev/vswap:/dev/vda  unique:3  prio:15
> /dev/sda  unique:4  prio:20
> 
> $ cat memcg3/memory.swap.priority
> Active
> /dev/vda  unique:3  prio:5
> /dev/sda  unique:4  prio:15
> 
> Meaning memcg1 (high priority) is allowed to use compressed memory
> only through vswap, and memcg2 (mid priority) uses disks through vswap
> and fallback to HDD. memcg3 (low prio) is only allowed to use slow
> devices.
> 
> Global fallback just uses everything the system has. It might be over
> complex though?
 
Just looking at the example usage which you mention, 
it seems flexible and good.
I will think more about this in relation to it.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH 2/2] mm: swap: apply per cgroup swap priority mechansim on swap layer
  2025-06-12 20:08         ` Nhat Pham
@ 2025-06-13  7:11           ` YoungJun Park
  2025-06-13  7:36             ` Kairui Song
  0 siblings, 1 reply; 25+ messages in thread
From: YoungJun Park @ 2025-06-13  7:11 UTC (permalink / raw)
  To: Nhat Pham
  Cc: Kairui Song, linux-mm, akpm, hannes, mhocko, roman.gushchin,
	shakeel.butt, cgroups, linux-kernel, shikemeng, bhe, baohua,
	chrisl, muchun.song, iamjoonsoo.kim, taejoon.song, gunho.lee

On Thu, Jun 12, 2025 at 01:08:08PM -0700, Nhat Pham wrote:
> On Thu, Jun 12, 2025 at 11:20 AM Kairui Song <ryncsn@gmail.com> wrote:
> >
> > On Fri, Jun 13, 2025 at 1:28 AM Nhat Pham <nphamcs@gmail.com> wrote:
> > >
> > > On Thu, Jun 12, 2025 at 4:14 AM Kairui Song <ryncsn@gmail.com> wrote:
> > > >
> > > > On Thu, Jun 12, 2025 at 6:43 PM <youngjun.park@lge.com> wrote:
> > > > >
> > > > > From: "youngjun.park" <youngjun.park@lge.com>
> > > > >
> > > >
> > > > Hi, Youngjun,
> > > >
> > > > Thanks for sharing this series.
> > > >
> > > > > This patch implements swap device selection and swap on/off propagation
> > > > > when a cgroup-specific swap priority is set.
> > > > >
> > > > > There is one workaround to this implementation as follows.
> > > > > Current per-cpu swap cluster enforces swap device selection based solely
> > > > > on CPU locality, overriding the swap cgroup's configured priorities.
> > > >
> > > > I've been thinking about this, we can switch to a per-cgroup-per-cpu
> > > > next cluster selector, the problem with current code is that swap
> > >
> > > What about per-cpu-per-order-per-swap-device :-? Number of swap
> > > devices is gonna be smaller than number of cgroups, right?
> >
> > Hi Nhat,
> >
> > The problem is per cgroup makes more sense (I was suggested to use
> > cgroup level locality at the very beginning of the implementation of
> > the allocator in the mail list, but it was hard to do so at that
> > time), for container environments, a cgroup is a container that runs
> > one type of workload, so it has its own locality. Things like systemd
> > also organize different desktop workloads into cgroups. The whole
> > point is about cgroup.
> 
> Yeah I know what cgroup represents. Which is why I mentioned in the
> next paragraph that are still making decisions based per-cgroup - we
> just organize the per-cpu cache based on swap devices. This way, two
> cgroups with similar/same priority list can share the clusters, for
> each swapfile, in each CPU. There will be a lot less duplication and
> overhead. And two cgroups with different priority lists won't
> interfere with each other, since they'll target different swapfiles.
> 
> Unless we want to nudge the swapfiles/clusters to be self-partitioned
> among the cgroups? :) IOW, each cluster contains pages mostly from a
> single cgroup (with some stranglers mixed in). I suppose that will be
> very useful for swap on rotational drives where read contiguity is
> imperative, but not sure about other backends :-? 
> Anyway, no strong opinions to be completely honest :) Was just
> throwing out some ideas. Per-cgroup-per-cpu-per-order sounds good to
> me too, if it's easy to do.

Good point!
I agree with the mention that self-partitioned clusters and duplicated priority.
One concern is the cost of synchronization.
Specifically the one incurred when accessing the prioritized swap device
From a simple performance perspective, a per-cgroup-per-CPU implementation
seems favorable - in line with the current swap allocation fastpath.

It seems most reasonable to carefully compare the pros and cons of the           
tow approaches.

To summaraize,

Option 1. per-cgroup-per-cpu
Pros: upstream fit. performance. 
Cons: duplicate priority(some memory structure consumtion cost), 
self partioned cluster 

Option 2. per-cpu-per-order(per-device)
Pros: Cons of Option1
Cons: Pros of Option1

It's not easy to draw a definitive conclusion right away, 
I should also evaluate other pros and cons that may arise during actual 
implementation.
so I'd like to take some time to review things in more detail 
and share my thoughs and conclusions in the next patch series.

What do you think, Nhat and Kairui?


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH 2/2] mm: swap: apply per cgroup swap priority mechansim on swap layer
  2025-06-13  7:11           ` YoungJun Park
@ 2025-06-13  7:36             ` Kairui Song
  2025-06-13  7:38               ` Kairui Song
  0 siblings, 1 reply; 25+ messages in thread
From: Kairui Song @ 2025-06-13  7:36 UTC (permalink / raw)
  To: YoungJun Park
  Cc: Nhat Pham, linux-mm, akpm, hannes, mhocko, roman.gushchin,
	shakeel.butt, cgroups, linux-kernel, shikemeng, bhe, baohua,
	chrisl, muchun.song, iamjoonsoo.kim, taejoon.song, gunho.lee

On Fri, Jun 13, 2025 at 3:11 PM YoungJun Park <youngjun.park@lge.com> wrote:
>
> On Thu, Jun 12, 2025 at 01:08:08PM -0700, Nhat Pham wrote:
> > On Thu, Jun 12, 2025 at 11:20 AM Kairui Song <ryncsn@gmail.com> wrote:
> > >
> > > On Fri, Jun 13, 2025 at 1:28 AM Nhat Pham <nphamcs@gmail.com> wrote:
> > > >
> > > > On Thu, Jun 12, 2025 at 4:14 AM Kairui Song <ryncsn@gmail.com> wrote:
> > > > >
> > > > > On Thu, Jun 12, 2025 at 6:43 PM <youngjun.park@lge.com> wrote:
> > > > > >
> > > > > > From: "youngjun.park" <youngjun.park@lge.com>
> > > > > >
> > > > >
> > > > > Hi, Youngjun,
> > > > >
> > > > > Thanks for sharing this series.
> > > > >
> > > > > > This patch implements swap device selection and swap on/off propagation
> > > > > > when a cgroup-specific swap priority is set.
> > > > > >
> > > > > > There is one workaround to this implementation as follows.
> > > > > > Current per-cpu swap cluster enforces swap device selection based solely
> > > > > > on CPU locality, overriding the swap cgroup's configured priorities.
> > > > >
> > > > > I've been thinking about this, we can switch to a per-cgroup-per-cpu
> > > > > next cluster selector, the problem with current code is that swap
> > > >
> > > > What about per-cpu-per-order-per-swap-device :-? Number of swap
> > > > devices is gonna be smaller than number of cgroups, right?
> > >
> > > Hi Nhat,
> > >
> > > The problem is per cgroup makes more sense (I was suggested to use
> > > cgroup level locality at the very beginning of the implementation of
> > > the allocator in the mail list, but it was hard to do so at that
> > > time), for container environments, a cgroup is a container that runs
> > > one type of workload, so it has its own locality. Things like systemd
> > > also organize different desktop workloads into cgroups. The whole
> > > point is about cgroup.
> >
> > Yeah I know what cgroup represents. Which is why I mentioned in the
> > next paragraph that are still making decisions based per-cgroup - we
> > just organize the per-cpu cache based on swap devices. This way, two
> > cgroups with similar/same priority list can share the clusters, for
> > each swapfile, in each CPU. There will be a lot less duplication and
> > overhead. And two cgroups with different priority lists won't
> > interfere with each other, since they'll target different swapfiles.
> >
> > Unless we want to nudge the swapfiles/clusters to be self-partitioned
> > among the cgroups? :) IOW, each cluster contains pages mostly from a
> > single cgroup (with some stranglers mixed in). I suppose that will be
> > very useful for swap on rotational drives where read contiguity is
> > imperative, but not sure about other backends :-?
> > Anyway, no strong opinions to be completely honest :) Was just
> > throwing out some ideas. Per-cgroup-per-cpu-per-order sounds good to
> > me too, if it's easy to do.
>
> Good point!
> I agree with the mention that self-partitioned clusters and duplicated priority.
> One concern is the cost of synchronization.
> Specifically the one incurred when accessing the prioritized swap device
> From a simple performance perspective, a per-cgroup-per-CPU implementation
> seems favorable - in line with the current swap allocation fastpath.
>
> It seems most reasonable to carefully compare the pros and cons of the
> tow approaches.
>
> To summaraize,
>
> Option 1. per-cgroup-per-cpu
> Pros: upstream fit. performance.
> Cons: duplicate priority(some memory structure consumtion cost),
> self partioned cluster
>
> Option 2. per-cpu-per-order(per-device)
> Pros: Cons of Option1
> Cons: Pros of Option1
>
> It's not easy to draw a definitive conclusion right away,
> I should also evaluate other pros and cons that may arise during actual
> implementation.
> so I'd like to take some time to review things in more detail
> and share my thoughs and conclusions in the next patch series.
>
> What do you think, Nhat and Kairui?

Ah, I think what might be best fits here is, each cgroup have a pcp
device list,  and each device have a pcp cluster list:

folio -> mem_cgroup -> swap_priority (maybe a more generic name is
better?) -> swap_device_pcp (recording only the *si per order)
swap_device_info -> swap_cluster_pcp (cluster offset per order)

And if mem_cgroup -> swap_priority is NULL, fallback to a global
swap_device_pcp.

This seems to fit what Nhat suggested, and easy to implement, since
both si and folio->memcg are accessible easily.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH 2/2] mm: swap: apply per cgroup swap priority mechansim on swap layer
  2025-06-13  7:36             ` Kairui Song
@ 2025-06-13  7:38               ` Kairui Song
  2025-06-13 10:45                 ` YoungJun Park
  0 siblings, 1 reply; 25+ messages in thread
From: Kairui Song @ 2025-06-13  7:38 UTC (permalink / raw)
  To: YoungJun Park
  Cc: Nhat Pham, linux-mm, akpm, hannes, mhocko, roman.gushchin,
	shakeel.butt, cgroups, linux-kernel, shikemeng, bhe, baohua,
	chrisl, muchun.song, iamjoonsoo.kim, taejoon.song, gunho.lee

On Fri, Jun 13, 2025 at 3:36 PM Kairui Song <ryncsn@gmail.com> wrote:
>
> On Fri, Jun 13, 2025 at 3:11 PM YoungJun Park <youngjun.park@lge.com> wrote:
> >
> > On Thu, Jun 12, 2025 at 01:08:08PM -0700, Nhat Pham wrote:
> > > On Thu, Jun 12, 2025 at 11:20 AM Kairui Song <ryncsn@gmail.com> wrote:
> > > >
> > > > On Fri, Jun 13, 2025 at 1:28 AM Nhat Pham <nphamcs@gmail.com> wrote:
> > > > >
> > > > > On Thu, Jun 12, 2025 at 4:14 AM Kairui Song <ryncsn@gmail.com> wrote:
> > > > > >
> > > > > > On Thu, Jun 12, 2025 at 6:43 PM <youngjun.park@lge.com> wrote:
> > > > > > >
> > > > > > > From: "youngjun.park" <youngjun.park@lge.com>
> > > > > > >
> > > > > >
> > > > > > Hi, Youngjun,
> > > > > >
> > > > > > Thanks for sharing this series.
> > > > > >
> > > > > > > This patch implements swap device selection and swap on/off propagation
> > > > > > > when a cgroup-specific swap priority is set.
> > > > > > >
> > > > > > > There is one workaround to this implementation as follows.
> > > > > > > Current per-cpu swap cluster enforces swap device selection based solely
> > > > > > > on CPU locality, overriding the swap cgroup's configured priorities.
> > > > > >
> > > > > > I've been thinking about this, we can switch to a per-cgroup-per-cpu
> > > > > > next cluster selector, the problem with current code is that swap
> > > > >
> > > > > What about per-cpu-per-order-per-swap-device :-? Number of swap
> > > > > devices is gonna be smaller than number of cgroups, right?
> > > >
> > > > Hi Nhat,
> > > >
> > > > The problem is per cgroup makes more sense (I was suggested to use
> > > > cgroup level locality at the very beginning of the implementation of
> > > > the allocator in the mail list, but it was hard to do so at that
> > > > time), for container environments, a cgroup is a container that runs
> > > > one type of workload, so it has its own locality. Things like systemd
> > > > also organize different desktop workloads into cgroups. The whole
> > > > point is about cgroup.
> > >
> > > Yeah I know what cgroup represents. Which is why I mentioned in the
> > > next paragraph that are still making decisions based per-cgroup - we
> > > just organize the per-cpu cache based on swap devices. This way, two
> > > cgroups with similar/same priority list can share the clusters, for
> > > each swapfile, in each CPU. There will be a lot less duplication and
> > > overhead. And two cgroups with different priority lists won't
> > > interfere with each other, since they'll target different swapfiles.
> > >
> > > Unless we want to nudge the swapfiles/clusters to be self-partitioned
> > > among the cgroups? :) IOW, each cluster contains pages mostly from a
> > > single cgroup (with some stranglers mixed in). I suppose that will be
> > > very useful for swap on rotational drives where read contiguity is
> > > imperative, but not sure about other backends :-?
> > > Anyway, no strong opinions to be completely honest :) Was just
> > > throwing out some ideas. Per-cgroup-per-cpu-per-order sounds good to
> > > me too, if it's easy to do.
> >
> > Good point!
> > I agree with the mention that self-partitioned clusters and duplicated priority.
> > One concern is the cost of synchronization.
> > Specifically the one incurred when accessing the prioritized swap device
> > From a simple performance perspective, a per-cgroup-per-CPU implementation
> > seems favorable - in line with the current swap allocation fastpath.
> >
> > It seems most reasonable to carefully compare the pros and cons of the
> > tow approaches.
> >
> > To summaraize,
> >
> > Option 1. per-cgroup-per-cpu
> > Pros: upstream fit. performance.
> > Cons: duplicate priority(some memory structure consumtion cost),
> > self partioned cluster
> >
> > Option 2. per-cpu-per-order(per-device)
> > Pros: Cons of Option1
> > Cons: Pros of Option1
> >
> > It's not easy to draw a definitive conclusion right away,
> > I should also evaluate other pros and cons that may arise during actual
> > implementation.
> > so I'd like to take some time to review things in more detail
> > and share my thoughs and conclusions in the next patch series.
> >
> > What do you think, Nhat and Kairui?
>
> Ah, I think what might be best fits here is, each cgroup have a pcp
> device list,  and each device have a pcp cluster list:
>
> folio -> mem_cgroup -> swap_priority (maybe a more generic name is
> better?) -> swap_device_pcp (recording only the *si per order)
> swap_device_info -> swap_cluster_pcp (cluster offset per order)

Sorry the truncate made this hard to read, let me try again:

folio ->
  mem_cgroup ->
    swap_priority (maybe a more generic name is better?) ->
      swap_device_pcp (recording only the *si per order)

And:
swap_device_info ->
  swap_cluster_pcp (cluster offset per order)

And if mem_cgroup -> swap_priority is NULL,
fallback to a global swap_device_pcp.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH 2/2] mm: swap: apply per cgroup swap priority mechansim on swap layer
  2025-06-13  7:38               ` Kairui Song
@ 2025-06-13 10:45                 ` YoungJun Park
  0 siblings, 0 replies; 25+ messages in thread
From: YoungJun Park @ 2025-06-13 10:45 UTC (permalink / raw)
  To: Kairui Song
  Cc: Nhat Pham, linux-mm, akpm, hannes, mhocko, roman.gushchin,
	shakeel.butt, cgroups, linux-kernel, shikemeng, bhe, baohua,
	chrisl, muchun.song, iamjoonsoo.kim, taejoon.song, gunho.lee

On Fri, Jun 13, 2025 at 03:38:37PM +0800, Kairui Song wrote:
> On Fri, Jun 13, 2025 at 3:36 PM Kairui Song <ryncsn@gmail.com> wrote:
> >
> > On Fri, Jun 13, 2025 at 3:11 PM YoungJun Park <youngjun.park@lge.com> wrote:
> > >
> > > On Thu, Jun 12, 2025 at 01:08:08PM -0700, Nhat Pham wrote:
> > > > On Thu, Jun 12, 2025 at 11:20 AM Kairui Song <ryncsn@gmail.com> wrote:
> > > > >
> > > > > On Fri, Jun 13, 2025 at 1:28 AM Nhat Pham <nphamcs@gmail.com> wrote:
> > > > > >
> > > > > > On Thu, Jun 12, 2025 at 4:14 AM Kairui Song <ryncsn@gmail.com> wrote:
> > > > > > >
> > > > > > > On Thu, Jun 12, 2025 at 6:43 PM <youngjun.park@lge.com> wrote:
> > > > > > > >
> > > > > > > > From: "youngjun.park" <youngjun.park@lge.com>
> > > > > > > >
> > > > > > >
> > > > > > > Hi, Youngjun,
> > > > > > >
> > > > > > > Thanks for sharing this series.
> > > > > > >
> > > > > > > > This patch implements swap device selection and swap on/off propagation
> > > > > > > > when a cgroup-specific swap priority is set.
> > > > > > > >
> > > > > > > > There is one workaround to this implementation as follows.
> > > > > > > > Current per-cpu swap cluster enforces swap device selection based solely
> > > > > > > > on CPU locality, overriding the swap cgroup's configured priorities.
> > > > > > >
> > > > > > > I've been thinking about this, we can switch to a per-cgroup-per-cpu
> > > > > > > next cluster selector, the problem with current code is that swap
> > > > > >
> > > > > > What about per-cpu-per-order-per-swap-device :-? Number of swap
> > > > > > devices is gonna be smaller than number of cgroups, right?
> > > > >
> > > > > Hi Nhat,
> > > > >
> > > > > The problem is per cgroup makes more sense (I was suggested to use
> > > > > cgroup level locality at the very beginning of the implementation of
> > > > > the allocator in the mail list, but it was hard to do so at that
> > > > > time), for container environments, a cgroup is a container that runs
> > > > > one type of workload, so it has its own locality. Things like systemd
> > > > > also organize different desktop workloads into cgroups. The whole
> > > > > point is about cgroup.
> > > >
> > > > Yeah I know what cgroup represents. Which is why I mentioned in the
> > > > next paragraph that are still making decisions based per-cgroup - we
> > > > just organize the per-cpu cache based on swap devices. This way, two
> > > > cgroups with similar/same priority list can share the clusters, for
> > > > each swapfile, in each CPU. There will be a lot less duplication and
> > > > overhead. And two cgroups with different priority lists won't
> > > > interfere with each other, since they'll target different swapfiles.
> > > >
> > > > Unless we want to nudge the swapfiles/clusters to be self-partitioned
> > > > among the cgroups? :) IOW, each cluster contains pages mostly from a
> > > > single cgroup (with some stranglers mixed in). I suppose that will be
> > > > very useful for swap on rotational drives where read contiguity is
> > > > imperative, but not sure about other backends :-?
> > > > Anyway, no strong opinions to be completely honest :) Was just
> > > > throwing out some ideas. Per-cgroup-per-cpu-per-order sounds good to
> > > > me too, if it's easy to do.
> > >
> > > Good point!
> > > I agree with the mention that self-partitioned clusters and duplicated priority.
> > > One concern is the cost of synchronization.
> > > Specifically the one incurred when accessing the prioritized swap device
> > > From a simple performance perspective, a per-cgroup-per-CPU implementation
> > > seems favorable - in line with the current swap allocation fastpath.
> > >
> > > It seems most reasonable to carefully compare the pros and cons of the
> > > tow approaches.
> > >
> > > To summaraize,
> > >
> > > Option 1. per-cgroup-per-cpu
> > > Pros: upstream fit. performance.
> > > Cons: duplicate priority(some memory structure consumtion cost),
> > > self partioned cluster
> > >
> > > Option 2. per-cpu-per-order(per-device)
> > > Pros: Cons of Option1
> > > Cons: Pros of Option1
> > >
> > > It's not easy to draw a definitive conclusion right away,
> > > I should also evaluate other pros and cons that may arise during actual
> > > implementation.
> > > so I'd like to take some time to review things in more detail
> > > and share my thoughs and conclusions in the next patch series.
> > >
> > > What do you think, Nhat and Kairui?
> >
> > Ah, I think what might be best fits here is, each cgroup have a pcp
> > device list,  and each device have a pcp cluster list:
> >
> > folio -> mem_cgroup -> swap_priority (maybe a more generic name is
> > better?) -> swap_device_pcp (recording only the *si per order)
> > swap_device_info -> swap_cluster_pcp (cluster offset per order)
> 
> Sorry the truncate made this hard to read, let me try again:
> 
> folio ->
>   mem_cgroup ->
>     swap_priority (maybe a more generic name is better?) ->
>       swap_device_pcp (recording only the *si per order)
> 
> And:
> swap_device_info ->
>   swap_cluster_pcp (cluster offset per order)
> 
> And if mem_cgroup -> swap_priority is NULL,
> fallback to a global swap_device_pcp.

Thank you for quick and kind feedback. This is a really good idea :)
On my workaround proposal, I just need to add the swap_device_pcp part 
along with some refactoring.

And the naming swap_cgroup_priority...
I adopted the term "swap_cgorup_priority" based on the
perspective of the functionality I'm aiming to implement.
Here are some words that immediately come to mind.
(Like I said, just come to mind)
* swap_tier, swap_order, swap_selection, swap_cgroup_tier, swap_cgroup_order,
swap_cgroup_selection....

I'll try to come up with a more suitable conceptual name as I continue working
on the patch.
In the meantime, I'd appreciate any suggestions or feedback you may have.

Thanks again your feedback and suggestions.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH 1/2] mm/swap, memcg: basic structure and logic for per cgroup swap priority control
  2025-06-12 10:37 ` [RFC PATCH 1/2] mm/swap, memcg: basic structure and logic for per cgroup swap priority control youngjun.park
@ 2025-06-17 12:23   ` Michal Koutný
  2025-06-18  0:32     ` YoungJun Park
  0 siblings, 1 reply; 25+ messages in thread
From: Michal Koutný @ 2025-06-17 12:23 UTC (permalink / raw)
  To: youngjun.park
  Cc: linux-mm, akpm, hannes, mhocko, roman.gushchin, shakeel.butt,
	cgroups, linux-kernel, shikemeng, kasong, nphamcs, bhe, baohua,
	chrisl, muchun.song, iamjoonsoo.kim, taejoon.song, gunho.lee

[-- Attachment #1: Type: text/plain, Size: 1215 bytes --]

Hello.

On Thu, Jun 12, 2025 at 07:37:43PM +0900, youngjun.park@lge.com wrote:
> Example:
> cat memory.swap.priority
> Inactive
> /dev/sdb	unique:1	 prio:10
> /dev/sdc	unique:2	 prio:5
> 
> - Creation
> echo  "unique id of swapdev 1: priority, unique id of swapdev 2: priority ..."
> > memory.swap.priority
> 
> - Destruction
> Reset through the memory.swap.priority interface.
> Example: echo "" > memory.swap.priority
> 
> And also be destroyed when the mem_cgroup is removed.
> 
> 3. Priority Mechanism
> 
> - Follows the original concept of swap priority.
> (This includes automatic binding of swap devices to NUMA nodes.)

How is this supposed to work
cg1     /dev/sda	prio:10
        /dev/sdb	prio:5
` cg3     /dev/sda	  prio:5
   	  /dev/sdb	  prio:10
cg2     /dev/sda	prio:5
        /dev/sdb	prio:10
` cg4     /dev/sda	  prio:10
   	  /dev/sdb	  prio:5

when there are competitors from cg3 and cg4? Which device should be
preferred by each cgroup?

Interface note -- try to make it "Nested keyed" or "Flat keyed" as
described in Documentation/admin-guide/cgroup-v2.rst (like io.max or
io.weight), so that it is consistent with other cgroup v2 APIs.


HTH,
Michal

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH 1/2] mm/swap, memcg: basic structure and logic for per cgroup swap priority control
  2025-06-17 12:23   ` Michal Koutný
@ 2025-06-18  0:32     ` YoungJun Park
  2025-06-18  9:11       ` Michal Koutný
  0 siblings, 1 reply; 25+ messages in thread
From: YoungJun Park @ 2025-06-18  0:32 UTC (permalink / raw)
  To: Michal Koutný
  Cc: linux-mm, akpm, hannes, mhocko, roman.gushchin, shakeel.butt,
	cgroups, linux-kernel, shikemeng, kasong, nphamcs, bhe, baohua,
	chrisl, muchun.song, iamjoonsoo.kim, taejoon.song, gunho.lee

On Tue, Jun 17, 2025 at 02:23:07PM +0200, Michal Koutný wrote:
> Hello.
> 
> On Thu, Jun 12, 2025 at 07:37:43PM +0900, youngjun.park@lge.com wrote:
> > Example:
> > cat memory.swap.priority
> > Inactive
> > /dev/sdb	unique:1	 prio:10
> > /dev/sdc	unique:2	 prio:5
> > 
> > - Creation
> > echo  "unique id of swapdev 1: priority, unique id of swapdev 2: priority ..."
> > > memory.swap.priority
> > 
> > - Destruction
> > Reset through the memory.swap.priority interface.
> > Example: echo "" > memory.swap.priority
> > 
> > And also be destroyed when the mem_cgroup is removed.
> > 
> > 3. Priority Mechanism
> > 
> > - Follows the original concept of swap priority.
> > (This includes automatic binding of swap devices to NUMA nodes.)
> 
> How is this supposed to work
> cg1     /dev/sda	prio:10
>         /dev/sdb	prio:5
> ` cg3     /dev/sda	  prio:5
>    	  /dev/sdb	  prio:10
> cg2     /dev/sda	prio:5
>         /dev/sdb	prio:10
> ` cg4     /dev/sda	  prio:10
>    	  /dev/sdb	  prio:5
> 
> when there are competitors from cg3 and cg4? Which device should be
> preferred by each cgroup?

Hello Michal.

What issue is the question assuming the existence of competitors in two
cgroups trying to address? Could you explain it a bit more specifically?

To answer your question for now,
Each cgroup just prefers devices according to their priority values.
until swap device is exhausted.

cg1 prefer /dev/sda than /dev/sdb.
cg2 prefer /dev/sdb than /dev/sda.
cg3 prefer /dev/sdb than /dev/sda.
cg4 prefer /dev/sda than /dev/sdb.

> Interface note -- try to make it "Nested keyed" or "Flat keyed" as
> described in Documentation/admin-guide/cgroup-v2.rst (like io.max or
> io.weight), so that it is consistent with other cgroup v2 APIs.

Yes, it looks like the API format should be adjusted as you suggested.
Thanks for the review.

Regards,
Youngjun Park


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH 1/2] mm/swap, memcg: basic structure and logic for per cgroup swap priority control
  2025-06-18  0:32     ` YoungJun Park
@ 2025-06-18  9:11       ` Michal Koutný
  2025-06-18 12:07         ` YoungJun Park
  0 siblings, 1 reply; 25+ messages in thread
From: Michal Koutný @ 2025-06-18  9:11 UTC (permalink / raw)
  To: YoungJun Park
  Cc: linux-mm, akpm, hannes, mhocko, roman.gushchin, shakeel.butt,
	cgroups, linux-kernel, shikemeng, kasong, nphamcs, bhe, baohua,
	chrisl, muchun.song, iamjoonsoo.kim, taejoon.song, gunho.lee

[-- Attachment #1: Type: text/plain, Size: 966 bytes --]

On Wed, Jun 18, 2025 at 09:32:13AM +0900, YoungJun Park <youngjun.park@lge.com> wrote:
> What issue is the question assuming the existence of competitors in two
> cgroups trying to address? Could you explain it a bit more specifically?

I'm after how this mechanism is supposed to honor hierarchical
structure. (I thought the numeric example was the most specific.)

> 
> To answer your question for now,
> Each cgroup just prefers devices according to their priority values.
> until swap device is exhausted.
> 
> cg1 prefer /dev/sda than /dev/sdb.
> cg2 prefer /dev/sdb than /dev/sda.
> cg3 prefer /dev/sdb than /dev/sda.
> cg4 prefer /dev/sda than /dev/sdb.

Hm, than means the settigs from cg1 (or cg2) don't apply to descendant
cg3 (or cg4) :-/

When referring to that document
(Documentation/admin-guide/cgroup-v2.rst) again, which of the "Resource
Distribution Models" do you find the most fitting for this scenario?

Thanks,
Michal

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH 1/2] mm/swap, memcg: basic structure and logic for per cgroup swap priority control
  2025-06-18  9:11       ` Michal Koutný
@ 2025-06-18 12:07         ` YoungJun Park
  2025-06-30 17:39           ` Michal Koutný
  0 siblings, 1 reply; 25+ messages in thread
From: YoungJun Park @ 2025-06-18 12:07 UTC (permalink / raw)
  To: Michal Koutný
  Cc: linux-mm, akpm, hannes, mhocko, roman.gushchin, shakeel.butt,
	cgroups, linux-kernel, shikemeng, kasong, nphamcs, bhe, baohua,
	chrisl, muchun.song, iamjoonsoo.kim, taejoon.song, gunho.lee

On Wed, Jun 18, 2025 at 11:11:32AM +0200, Michal Koutný wrote:
> On Wed, Jun 18, 2025 at 09:32:13AM +0900, YoungJun Park <youngjun.park@lge.com> wrote:
> > What issue is the question assuming the existence of competitors in two
> > cgroups trying to address? Could you explain it a bit more specifically?
> 
> I'm after how this mechanism is supposed to honor hierarchical
> structure. (I thought the numeric example was the most specific.)
> 
> > 
> > To answer your question for now,
> > Each cgroup just prefers devices according to their priority values.
> > until swap device is exhausted.
> > 
> > cg1 prefer /dev/sda than /dev/sdb.
> > cg2 prefer /dev/sdb than /dev/sda.
> > cg3 prefer /dev/sdb than /dev/sda.
> > cg4 prefer /dev/sda than /dev/sdb.
> 
> Hm, than means the settigs from cg1 (or cg2) don't apply to descendant
> cg3 (or cg4) :-/

I've been thinking about whether the use case I suggested aligns with the
philosophy of cgroups, and I believe there are two feasible directions
could take (This still needs some detailed refinement.)

Bascially on two strategies, child inherits parent setting.

1. Preserve the order of priorities and type of swap devices
when a child cgroup inherits values from 
its parent. the inherited order must be strictly maintained

e.g 

1.1 possible case.
1.1.1
cgroupA (swapA-swapB-swapC)
	' cgroupB (swapA-swapC)

1.1.2 
cgroupA (swapA-swapB-swapC)
	' cgroupB (swapA-swapC)

after time, modify it (swapD add on cgroupA)

cgroupA (swapA-swapB-swapC-swapD)
	' cgroupB (swapA-swapC)

1.2.impossible case.

1.2.1 violate the order of priorities rule.
cgroupA (swapA-swapB-swapC)
	' cgroupB  (swapC-swapA-swapB)

1.2.2 violate the type of swap devices rule.
cgroupA (swapA-swapB-swapC)
	' cgroupB  (swapD)

2. Restrict child cgroups to only use values inherited from the parent,
without allowing them to define their own setting.

e.g
cgroupA (swapA-swapB-swapC)
	' cgroupB (swapA-swapB-swapC)

after time, modify it (swapD add on cgroupA)

cgroupA (swapA-swapB-swapC-swapD)
	' cgroupB (swapA-swapB-swapC-swapD)

it is different from 1.1.2 case swapD propagated. 
(because child and parent must be same)

> When referring to that document
> (Documentation/admin-guide/cgroup-v2.rst) again, which of the "Resource
> Distribution Models" do you find the most fitting for this scenario?

I initially submitted the RFC from the perspective that each in-use
swap device must explicitly have a priority assigned, including propagation
at swapon time. (for avoiding swap-fail by using this mechanism)

However, condisering the resource distribution model you mentioned, 
I now see that not requiring all swap devices to have an explicitly defined 
priority aligns better with the broader cgroup "limit distribution" philosophy,
particularly in terms of limiting and distributing resources.

This is because cgroups can still restrict swap device usage and control 
device order without requiring explicit priorities for all devices.
In this view, the cgroup interface serves more as a limit or preference 
mechanism across the full set of available swap devices, rather than
requiring full enumeration and configuration.

Regards,
Youngjun Park

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH 1/2] mm/swap, memcg: basic structure and logic for per cgroup swap priority control
  2025-06-18 12:07         ` YoungJun Park
@ 2025-06-30 17:39           ` Michal Koutný
  2025-07-01 13:08             ` YoungJun Park
  0 siblings, 1 reply; 25+ messages in thread
From: Michal Koutný @ 2025-06-30 17:39 UTC (permalink / raw)
  To: YoungJun Park
  Cc: linux-mm, akpm, hannes, mhocko, roman.gushchin, shakeel.butt,
	cgroups, linux-kernel, shikemeng, kasong, nphamcs, bhe, baohua,
	chrisl, muchun.song, iamjoonsoo.kim, taejoon.song, gunho.lee

[-- Attachment #1: Type: text/plain, Size: 1151 bytes --]

On Wed, Jun 18, 2025 at 09:07:51PM +0900, YoungJun Park <youngjun.park@lge.com> wrote:
> This is because cgroups can still restrict swap device usage and control 
> device order without requiring explicit priorities for all devices.
> In this view, the cgroup interface serves more as a limit or preference 
> mechanism across the full set of available swap devices, rather than
> requiring full enumeration and configuration.

I was wondering whether your use cases would be catered by having
memory.swap.max limit per device (essentially disable swap to undesired
device(s) for given group). The disadvantage is that memory.swap.max is
already existing as scalar. Alternatively, remapping priorities to
memory.swap.weight -- with sibling vs sibling competition and children
treated with weight of parent when approached from the top. I find this
weight semantics little weird as it'd clash with other .weight which are
dual to this (cgroups compete over one device vs cgroup is choosing
between multiple devices).

Please try to take the existing distribution models into account not to
make something overly unidiomatic,
Michal

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH 1/2] mm/swap, memcg: basic structure and logic for per cgroup swap priority control
  2025-06-30 17:39           ` Michal Koutný
@ 2025-07-01 13:08             ` YoungJun Park
  2025-07-07  9:59               ` Michal Koutný
  0 siblings, 1 reply; 25+ messages in thread
From: YoungJun Park @ 2025-07-01 13:08 UTC (permalink / raw)
  To: Michal Koutný
  Cc: linux-mm, akpm, hannes, mhocko, roman.gushchin, shakeel.butt,
	cgroups, linux-kernel, shikemeng, kasong, nphamcs, bhe, baohua,
	chrisl, muchun.song, iamjoonsoo.kim, taejoon.song, gunho.lee

On Mon, Jun 30, 2025 at 07:39:47PM +0200, Michal Koutný wrote:
> On Wed, Jun 18, 2025 at 09:07:51PM +0900, YoungJun Park <youngjun.park@lge.com> wrote:
> > This is because cgroups can still restrict swap device usage and control 
> > device order without requiring explicit priorities for all devices.
> > In this view, the cgroup interface serves more as a limit or preference 
> > mechanism across the full set of available swap devices, rather than
> > requiring full enumeration and configuration.

Hello Michal,

Thank you very much for your thoughtful review and for sharing your
insights.

I’d like to share my thoughts and the reasoning behind my current
direction, including some points I considered in relation to your
suggestions.

> I was wondering whether your use cases would be catered by having
> memory.swap.max limit per device (essentially disable swap to undesired
> device(s) for given group). The disadvantage is that memory.swap.max is
> already existing as scalar. Alternatively, remapping priorities to

I did consider implementing this kind of control.
In that design, it would work similarly to memory.swap.max but per
device: the implementation would iterate through the swap devices in
priority order and maintain per-cgroup counters for each device’s usage.
It would also need to handle proper counter cleanup after use, and
ensure that usage checks also happen on the fastpath where per-CPU
caches for swap device clusters come into play.

From a runtime behavior perspective, the priority-based approach seemed
preferable, as it allows more flexible control: the configured cgroup
can strongly prefer the desired device and benefit from faster selection
at allocation time.

I also considered how this would coexist with the existing swap.max
interface, but given the additional implementation and runtime overhead
this would introduce, I decided to hold it back and chose a priority-based
approach instead.

> already existing as scalar. Alternatively, remapping priorities to
> memory.swap.weight -- with sibling vs sibling competition and children
> treated with weight of parent when approached from the top. I find this
> weight semantics little weird as it'd clash with other .weight which are
> dual to this (cgroups compete over one device vs cgroup is choosing
> between multiple devices).

Your point about the semantic mismatch is very valid. I agree that
reusing .weight semantics here could be confusing: .weight usually
expresses competition among siblings for a shared resource, whereas
here, the goal is to steer selection among multiple devices within a
single cgroup’s scope.

The swap priority concept already exists as an
independent mechanism, so mapping it into a .weight field might not
align well in practice.

> Please try to take the existing distribution models into account not to
> make something overly unidiomatic,

I also thought about possible alignment with existing mechanisms like
zswap.writeback. One alternative could be to adopt an on/off style mechanism
similar to zswap.writeback including propagation strategy. 
On implementation-wise, this could be handled by including
or excluding devices from the cgroup’s swap device priority list.
(The direction I suggested on)

However, this approach also has limitations in certain use cases. For
example, if we want to enforce a different ordering than the global
system swap priority, an on/off switch alone is not sufficient.

One possible example would be:
(Some cgroup use the slowest available swap device but with a larger capacity
avoiding swap failure.)

Global swap: A (fast) -> B (slower) -> C (slowest)
Cgroup swap: C (slowest) -> B (slower) -> A (fast)

This kind of configuration cannot be achieved only with an on/off
switch.

I think that priority approach might not map perfectly to the existing
major distribution models (like limit, weight, etc.),
I cautiously see this as an extension of the resource control interfaces, 
building on the solid foundation that the cgroup mechanism already provides.

I am working to ensure that the proposed interface and propagation
behavior integrate properly with parent cgroups and follow the same
interface style. Here is the current version I am working on now.
(It turned out a bit long, but I felt it might be useful to share it with you.)

  memory.swap.priority
        A read-write flat-keyed file which exists on non-root cgroups.

        Example: (after swapon)
          $ swapon
          NAME     TYPE      SIZE  USED PRIO
          /dev/sdb partition 300M   0B   10
          /dev/sdc partition 300M   0B    5
          /dev/sdd partition 300M   0B   -2

        To assign priorities to swap devices in the current cgroup,
        write one or more lines in the following format:

          <swap_device_unique_id> <priority>

        Example: (writing priorities)
          $ echo "1 4" > memory.swap.priority
          $ echo "2 -2" > memory.swap.priority
          $ echo "3 -1" > memory.swap.priority

        Example: (reading after write)
          $ cat memory.swap.priority
          1 4
          2 -2
          3 -1

        The priority semantics are consistent with the global swap
        system:

          - Higher values indicate higher preference.
          - See Documentation/admin-guide/mm/swap_numa.rst for swap numa
            autobinding.

        Note:
          A special value of -1 means the swap device is completely
          excluded from use by this cgroup. Unlike the global swap
          priority, where negative values simply lower the priority,
          setting -1 here disables allocation from that device for the
          current cgroup only.

        If any ancestor cgroup has set a swap priority configuration, it
        is inherited by all descendants. In that case, the child’s own
        configuration is ignored and the topmost configured ancestor
        determines the effective priority ordering.

  memory.swap.priority.effective
        A read-only file showing the effective swap priority ordering
        actually applied to this cgroup, after resolving inheritance
        from ancestors.

        If there is no configuration in the current cgroup and its
        ancestors, this file shows the global swap device priority from
        `swapon`, in the form of unique_id priority pairs.

        Example: (global only)
          $ swapon
          NAME     TYPE      SIZE  USED PRIO
          /dev/sdb partition 300M   0B   10
          /dev/sdc partition 300M   0B    5
          /dev/sdd partition 300M   0B   -2

          $ cat /sys/fs/cgroup/parent/child/memory.swap.priority.effective
          1 10
          2 5
          3 -2

        Example: (with parent override)
          # Parent cgroup configuration
          $ cat /sys/fs/cgroup/parent/memory.swap.priority
          1 4
          2 -2

          # Child cgroup configuration (ignored because parent overrides)
          $ cat /sys/fs/cgroup/parent/child/memory.swap.priority
          1 8
          2 5

          # Effective priority seen by the child
          $ cat /sys/fs/cgroup/parent/child/memory.swap.priority.effective
          1 4
          2 -2

        In this case:
          - If no cgroup sets any configuration, the output matches the
            global `swapon` priority.
          - If an ancestor has a configuration, the child inherits it
            and ignores its own setting.

I hope my explanation clarifies my intention, 
and I would truly appreciate your positive consideration 
and any further thoughts you might have.

Best regards, 
Youngjun Park

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH 1/2] mm/swap, memcg: basic structure and logic for per cgroup swap priority control
  2025-07-01 13:08             ` YoungJun Park
@ 2025-07-07  9:59               ` Michal Koutný
  2025-07-07 14:45                 ` YoungJun Park
  0 siblings, 1 reply; 25+ messages in thread
From: Michal Koutný @ 2025-07-07  9:59 UTC (permalink / raw)
  To: YoungJun Park
  Cc: linux-mm, akpm, hannes, mhocko, roman.gushchin, shakeel.butt,
	cgroups, linux-kernel, shikemeng, kasong, nphamcs, bhe, baohua,
	chrisl, muchun.song, iamjoonsoo.kim, taejoon.song, gunho.lee

[-- Attachment #1: Type: text/plain, Size: 1731 bytes --]

Hello.

On Tue, Jul 01, 2025 at 10:08:46PM +0900, YoungJun Park <youngjun.park@lge.com> wrote:
>   memory.swap.priority
...

>         To assign priorities to swap devices in the current cgroup,
>         write one or more lines in the following format:
> 
>           <swap_device_unique_id> <priority>

How would the user know this unique_id? (I don't see it in /proc/swaps.)

>         Note:
>           A special value of -1 means the swap device is completely
>           excluded from use by this cgroup. Unlike the global swap
>           priority, where negative values simply lower the priority,
>           setting -1 here disables allocation from that device for the
>           current cgroup only.

The divergence from the global semantics is little bit confusing.
There should better be a special value (like 'disabled') in the interface.
And possible second special value like 'none' that denotes the default
(for new (unconfigured) cgroups or when a new swap device is activated).

>   memory.swap.priority.effective
>         A read-only file showing the effective swap priority ordering
>         actually applied to this cgroup, after resolving inheritance
>         from ancestors.

Yes, this'd be definitely useful for troubleshooting and understanding
the configurations.

...
>         In this case:
>           - If no cgroup sets any configuration, the output matches the
>             global `swapon` priority.
>           - If an ancestor has a configuration, the child inherits it
>             and ignores its own setting.

The child's priority could be capped by ancestors' instead of wholy
overwritten? (So that remains some effect both.)

Thanks,
Michal

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH 1/2] mm/swap, memcg: basic structure and logic for per cgroup swap priority control
  2025-07-07  9:59               ` Michal Koutný
@ 2025-07-07 14:45                 ` YoungJun Park
  2025-07-07 14:57                   ` YoungJun Park
  0 siblings, 1 reply; 25+ messages in thread
From: YoungJun Park @ 2025-07-07 14:45 UTC (permalink / raw)
  To: Michal Koutný
  Cc: linux-mm, akpm, hannes, mhocko, roman.gushchin, shakeel.butt,
	cgroups, linux-kernel, shikemeng, kasong, nphamcs, bhe, baohua,
	chrisl, muchun.song, iamjoonsoo.kim, taejoon.song, gunho.lee

On Mon, Jul 07, 2025 at 11:59:49AM +0200, Michal Koutný wrote:
> Hello.
> 
> On Tue, Jul 01, 2025 at 10:08:46PM +0900, YoungJun Park <youngjun.park@lge.com> wrote:
> >   memory.swap.priority
> ...
> 
> >         To assign priorities to swap devices in the current cgroup,
> >         write one or more lines in the following format:
> > 
> >           <swap_device_unique_id> <priority>
> 
> How would the user know this unique_id? (I don't see it in /proc/swaps.)

The unique_id is a new concept I introduced to refer to assigned
swap devices. It's allocated whenever a swap device is turned on. I did
explore other key identifiers like the swap device path, but I
determined that providing a separate unique_id is more suitable for
this context. Initially, I proposed printing it directly from
memory.swap.priority to facilitate usage like:

$ swapon
NAME       TYPE      SIZE USED PRIO
/dev/sdb   partition 300M  0B   10
/dev/sdc   partition 300M  0B    5

$ cat memory.swap.priority
Active
/dev/sdb   unique:1  prio:10
/dev/sdc   unique:2  prio:5
Following your suggestion, I've deprecated this initial proposal and
considered four alternatives. I'm currently leaning towards
options 2 and 4, and I plan to propose option 4 as the primary
approach:

1. /proc/swaps with ID: We've rejected this due to potential ABI
changes.

2. New /proc interface: This could be /proc/swaps with the ID,
or a dedicated swapdevice file with the ID. While viable, I prefer
not to add new /proc interfaces if we can avoid it.

3. /sys/kernel/mm/swap/ location: (Similar to vma_ra_enabled)
This was rejected because sysfs typically shows configured values,
not dynamic identifiers, which would be inconsistent with existing
conventions.

4. Align memory.swap.priority.effective with /proc/swaps:
Aligning the order of id prio pairs in
memory.swap.priority.effective with the output order of
/proc/swaps would allow users to infer which swap device
corresponds to which ID. For example:

$ swapon
NAME       TYPE      SIZE USED PRIO
/dev/sdb   partition 300M  0B   10
/dev/sdc   partition 300M  0B    5

$ cat memory.swap.priority.effective
Active
1 10     // this is /dev/sdb
2 5      // this is /dev/sdc

> >         Note:
> >           A special value of -1 means the swap device is completely
> >           excluded from use by this cgroup. Unlike the global swap
> >           priority, where negative values simply lower the priority,
> >           setting -1 here disables allocation from that device for the
> >           current cgroup only.
> 
> The divergence from the global semantics is little bit confusing.
> There should better be a special value (like 'disabled') in the interface.
> And possible second special value like 'none' that denotes the default
> (for new (unconfigured) cgroups or when a new swap device is activated).
> 

Thank you for your insightful comments and suggestions regarding the
default values. I was initially focused on providing numerical values
for these settings. However, using keywords like "none" and
"disabled" for default values makes the semantics much more natural
and user-friendly.

Based on your feedback and the cgroup-v2.html documentation on default
values, I propose the following semantics:

none: This applies priority based on the global swap
priority. It's important to note that for negative priorities,
this implies following NUMA auto-binding rules, rather than a direct
application of the negative value itself.

disabled: This keyword explicitly excludes the swap device
from use by this cgroup.

Here's how these semantics would translate into usage:

echo "default none" > memory.swap.priority or
echo "none" > memory.swap.priority:
* When swapon is active, the cgroup's swap device priority will
follow the global swap priority.

echo "default disabled" > memory.swap.priority or
echo "default" > memory.swap.priority:
* When swapon is active, the swap device will be excluded from
allocation within this cgroup.

echo "<id> none" > memory.swap.priority:
* The specified swap device will follow its global swap priority.

echo "<id> disabled" > memory.swap.priority:
* The specified swap device will be excluded from allocation for
this cgroup.

echo "<id> <prio>" > memory.swap.priority:
* This sets a specific priority for the specified swap device.

> ...
> >         In this case:
> >           - If no cgroup sets any configuration, the output matches the
> >             global `swapon` priority.
> >           - If an ancestor has a configuration, the child inherits it
> >             and ignores its own setting.
> 
> The child's priority could be capped by ancestors' instead of wholy
> overwritten? (So that remains some effect both.)

Regarding the child's priority being capped or refined by ancestors'
settings, I've considered allowing the child's priority to resolve its
own settings when the sorted priority order is consistent and the
child's swap devices are a subset of the parent's. Here's a visual
representation of how that might work:

+-----------------+
| Parent cgroup   |
| (Swaps: A, B, C)|
+--------+--------+
         |
         | (Child applies settings to its own children)
         v
+--------+--------+
| Child cgroup    |
| (Swaps: B, C)   |
| (B & C resolved by child's settings)
+--------+--------+
         |
         +-------------------+
         |                   |
         v                   v
+--------+--------+   +--------+--------+
| Grandchild cgroup |   | Grandchild 2 cgroup |
| (Swaps: C)        |   | (Swaps: A)        |
| (C resolved by    |   | (A not in B,C;    |
|  grandchild's     |   |  resolved by      |
|  child's settings)|   |  child's settings)|
+-------------------+   +-------------------+

However, this feature isn't currently required for our immediate use
case, and it adds notable complexity to the implementation. I suggest
we consider this as a next step if the current feature is integrated
into the kernel and sees widespread adoption or
any further use cases or requirements.

Best regards,
Youngjun Park

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH 1/2] mm/swap, memcg: basic structure and logic for per cgroup swap priority control
  2025-07-07 14:45                 ` YoungJun Park
@ 2025-07-07 14:57                   ` YoungJun Park
  0 siblings, 0 replies; 25+ messages in thread
From: YoungJun Park @ 2025-07-07 14:57 UTC (permalink / raw)
  To: Michal Koutný
  Cc: linux-mm, akpm, hannes, mhocko, roman.gushchin, shakeel.butt,
	cgroups, linux-kernel, shikemeng, kasong, nphamcs, bhe, baohua,
	chrisl, muchun.song, iamjoonsoo.kim, taejoon.song, gunho.lee

On Mon, Jul 07, 2025 at 11:45:25PM +0900, YoungJun Park wrote:
 
> $ cat memory.swap.priority.effective
> Active
> 1 10     // this is /dev/sdb
> 2 5      // this is /dev/sdc
Please disregard the "Active" line. 
I apologize; I mistakenly included incorrect output.

$cat memory.swap.priority.effective
1 10     // this is /dev/sdb
2 5      // this is /dev/sdc
 
Best regards,
Youngjun Park


^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2025-07-07 14:57 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-12 10:37 [RFC PATCH 0/2] mm/swap, memcg: Support per-cgroup swap device prioritization youngjun.park
2025-06-12 10:37 ` [RFC PATCH 1/2] mm/swap, memcg: basic structure and logic for per cgroup swap priority control youngjun.park
2025-06-17 12:23   ` Michal Koutný
2025-06-18  0:32     ` YoungJun Park
2025-06-18  9:11       ` Michal Koutný
2025-06-18 12:07         ` YoungJun Park
2025-06-30 17:39           ` Michal Koutný
2025-07-01 13:08             ` YoungJun Park
2025-07-07  9:59               ` Michal Koutný
2025-07-07 14:45                 ` YoungJun Park
2025-07-07 14:57                   ` YoungJun Park
2025-06-12 10:37 ` [RFC PATCH 2/2] mm: swap: apply per cgroup swap priority mechansim on swap layer youngjun.park
2025-06-12 11:14   ` Kairui Song
2025-06-12 11:16     ` Kairui Song
2025-06-12 17:28     ` Nhat Pham
2025-06-12 18:20       ` Kairui Song
2025-06-12 20:08         ` Nhat Pham
2025-06-13  7:11           ` YoungJun Park
2025-06-13  7:36             ` Kairui Song
2025-06-13  7:38               ` Kairui Song
2025-06-13 10:45                 ` YoungJun Park
2025-06-13  6:49     ` YoungJun Park
2025-06-12 12:24 ` [RFC PATCH 0/2] mm/swap, memcg: Support per-cgroup swap device prioritization Kairui Song
2025-06-12 21:32   ` Nhat Pham
2025-06-13  6:56   ` YoungJun Park

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).