cgroups.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/4] mm/swap, memcg: Support per-cgroup swap device priorities
@ 2025-07-16 20:20 Youngjun Park
  2025-07-16 20:20 ` [PATCH 1/4] mm/swap, memcg: Introduce infrastructure for cgroup-based swap priority Youngjun Park
                   ` (3 more replies)
  0 siblings, 4 replies; 39+ messages in thread
From: Youngjun Park @ 2025-07-16 20:20 UTC (permalink / raw)
  To: akpm, hannes
  Cc: mhocko, roman.gushchin, shakeel.butt, muchun.song, shikemeng,
	kasong, nphamcs, bhe, baohua, chrisl, cgroups, linux-mm,
	linux-kernel, gunho.lee, iamjoonsoo.kim, taejoon.song,
	Youngjun Park

This patchset introduces a mechanism to assign swap device priorities
per cgroup.

It is an evolution of a previously submitted RFC [1], with revised
semantics, interfaces, and implementation based on community feedback.

======================================================================
I. MOTIVATION
======================================================================

Core requirement was to improve application responsiveness and loading
time, especially for latency-critical applications, without increasing
RAM or storage hardware resources.

Device constraints:
  - Linux-based embedded platform
  - Limited system RAM
  - Small local swap
  - No option to expand RAM or local swap

To mitigate this, we explored utilizing idle RAM and storage from nearby
devices as remote swap space. To maximize its effectiveness, we needed
per-cgroup control over swap device selection:

  - Assign faster local swap devices to latency-critical apps
  - Assign remote swap devices to background apps

However, current kernel swap infrastructure does not support per-cgroup
swap device assignment.

======================================================================
II. EVALUATED ALTERNATIVES
======================================================================

**II-1. Per-cgroup Dedicated Swap Devices**

- Proposed upstream [2]
- Difficult to maintain consistent global vs per-cgroup swap state
- Hard to reconcile with memory.max and swap.max semantics

**II-2. Multi-backend Swap Device with Cgroup-aware Routing**

- Breaks layering abstraction (block device cgroup awareness)
- Swap devices treated as physical storage
- Related ideas discussed in [3]

**II-3. Per-cgroup Swap Enable/Disable with Usage Control**

- Could expand swap.max via zswap writeback [4]
- Cannot express flexible device orderings
- Less expressive than per-device priorities

**Conclusion:** Per-cgroup swap priority configuration is the most natural and
least invasive extension to existing kernel mechanisms.

======================================================================
III. DESIGN OVERVIEW
======================================================================

**III-1. Per-Cgroup Swap Priority**

Semantics:
- Configure swap priorities per device via the `memory.swap.priority` interface.
- If a value is specified, it overrides the global priority for that cgroup.
- Priority semantics follow the global swap behavior:
  - Higher numeric values are preferred
  - Devices with equal priority are used round-robin
  - Negative priorities follow NUMA-aware fallback [5]
- If no value is given, the global swap priority is used.
- Default settings influence swap device propagation on swapon/swapoff events.
- At `swapon`, these settings determine whether and how newly added devices
  are included for the cgroup.

Each cgroup exposes a readable and writable file:

  memory.swap.priority

This file accepts one `<id> <priority>` pair per line, where `<id>` is the
numeric ID of a swap device as shown in `/proc/swaps`:

  Filename       Type        Size   Used  Priority  Id
  /dev/sda2      partition   ...    ...   20        1
  /dev/sdb2      partition   ...    ...   -2        2

The following defaults can be set:

- `default none`:
  Use global priority (implicit default)

- `default disabled`:
  Exclude swap devices from use in this cgroup

These defaults determine how new devices are handled at `swapon` time.

Special keywords can also be specified per device:
- `<id> none`: use global priority (clears override)
- `<id> disabled`: exclude the device from this cgroup's swap allocation

Reading this file shows the current configuration. Devices not explicitly set
may still appear if their effective priority differs from the global value due
to NUMA fallback or internal normalization.

**Example**

  echo "1 -2" > memory.swap.priority

May result in:

  1 -2
  2 -3

To revert both devices to global priority:

  echo "1 none" > memory.swap.priority
  echo "2 none" > memory.swap.priority

To disable device 1 while allowing device 2:

  echo "1 disabled" > memory.swap.priority

**III-2. Inheritance**

Inheritance semantics:

- Each cgroup inherits from the **highest** ancestor with a setting
- Intermediate ancestors are ignored
- If no ancestor is configured, the local setting is used
- If the inherited ancestor configuration is removed or absent, the cgroup
  falls back to its local setting. If none exists, the global priority is used.

The effective configuration after inheritance is visible via:

  memory.swap.priority.effective

If `default disabled` is active, it is shown explicitly.  
If `default none` is used, it is applied silently and not shown.

======================================================================
IV. TESTING
======================================================================

This patchset was tested on x86_64 under QEMU using `stress-ng` to generate
swap I/O while toggling swap devices and updating `memory.swap.priority`.

The kernel was instrumented with KASAN, lockdep, and other
`CONFIG_DEBUG_*` options to increase debugging coverage and help identify
potential issues under stress.

======================================================================
V. CHANGE HISTORY
======================================================================

== RFC → v1 ==

[1] Changed interface from flat `1:10,2:-1` to line-based flat key format,
    following cgroup v2 interface conventions where each swap device is
    configured independently.
    - Suggested by: Michal Koutný

[2] Added `memory.swap.priority.effective` to expose the final applied
    priority, reflecting cgroup inheritance rules.

[3] Clarified default semantics: `default none`, `default disabled`
    - Suggested by: Michal Koutný

[4] Implemented per-cgroup percpu swap device cache and used per-device
    shared clusters to avoid scalability issues
    - Suggested by: Kairui Song

[5] Exposed swap device id via /proc/swaps for introspection

[6] Introduced swap_cgroup_priority.h to define the main interface and declare
    symbols shared with swapfile.c.

[7] Aligned the number of swap_cgroup_priority_pnode instances with nr_swapfiles
    to ensure consistency during swap device changes.

[8] Removed the explicit delete interface, now handled implicitly by dynamic tracking.

======================================================================
VI. REFERENCES
======================================================================

[1] RFC: Per-cgroup swap device prioritization  
    https://lore.kernel.org/linux-mm/aEvLjEInMQC7hEyh@yjaykim-PowerEdge-T330/T/#mbbb6a5e9e30843097e1f5f65fb98f31d582b973d  
[2] Cgroup-specific swap devices (2014)  
    https://lkml.iu.edu/hypermail/linux/kernel/1404.0/02530.html  
[3] Swap redirection and zswap writeback discussions  
    https://lore.kernel.org/linux-mm/CAMgjq7DGMS5A4t6nOQmwyLy5Px96aoejBkiwFHgy9uMk-F8Y-w@mail.gmail.com  
[4] Per-cgroup zswap writeback  
    https://lore.kernel.org/lkml/CAF8kJuN-4UE0skVHvjUzpGefavkLULMonjgkXUZSBVJrcGFXCA@mail.gmail.com  
[5] Swap NUMA fallback  
    https://docs.kernel.org/vm/swap_numa.html
---

This feature is marked **EXPERIMENTAL** in Kconfig, as it has not yet undergone
extensive real-world testing. The implementation is functional and reflects
feedback from prior RFC discussions, but further testing and review are welcome.
I’m happy to iterate based on community feedback.

Thanks,
Youngjun Park

Youngjun Park (4):
  mm/swap, memcg: Introduce infrastructure for cgroup-based swap
    priority
  mm: swap: Apply per-cgroup swap priority mechanism to swap layer
  mm: memcg: Add swap cgroup priority inheritance mechanism
  mm: swap: Per-cgroup per-CPU swap device cache with shared clusters

 Documentation/admin-guide/cgroup-v2.rst |   76 ++
 MAINTAINERS                             |    2 +
 include/linux/memcontrol.h              |    3 +
 include/linux/swap.h                    |   10 +
 mm/Kconfig                              |   14 +
 mm/Makefile                             |    1 +
 mm/memcontrol.c                         |  105 ++-
 mm/swap_cgroup_priority.c               | 1036 +++++++++++++++++++++++
 mm/swap_cgroup_priority.h               |  128 +++
 mm/swapfile.c                           |  108 ++-
 10 files changed, 1456 insertions(+), 27 deletions(-)
 create mode 100644 mm/swap_cgroup_priority.c
 create mode 100644 mm/swap_cgroup_priority.h

base-commit: 347e9f5043c89695b01e66b3ed111755afcf1911
-- 
2.34.1


^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH 1/4] mm/swap, memcg: Introduce infrastructure for cgroup-based swap priority
  2025-07-16 20:20 [PATCH 0/4] mm/swap, memcg: Support per-cgroup swap device priorities Youngjun Park
@ 2025-07-16 20:20 ` Youngjun Park
  2025-07-17 11:20   ` kernel test robot
                     ` (3 more replies)
  2025-07-16 20:20 ` [PATCH 2/4] mm: swap: Apply per-cgroup swap priority mechanism to swap layer Youngjun Park
                   ` (2 subsequent siblings)
  3 siblings, 4 replies; 39+ messages in thread
From: Youngjun Park @ 2025-07-16 20:20 UTC (permalink / raw)
  To: akpm, hannes
  Cc: mhocko, roman.gushchin, shakeel.butt, muchun.song, shikemeng,
	kasong, nphamcs, bhe, baohua, chrisl, cgroups, linux-mm,
	linux-kernel, gunho.lee, iamjoonsoo.kim, taejoon.song,
	Youngjun Park, Michal Koutný

In resource-constrained environments with limited RAM and storage, it is
often desirable to utilize remote or heterogeneous storage devices as swap
targets. To maximize responsiveness under memory pressure, particularly for
latency-critical applications, it is important to control which cgroups use
which swap devices.

This patch introduces a mechanism for assigning swap device priorities on a
per-cgroup basis. By allowing cgroups to customize the relative priority of
available swap devices, faster local swap can be reserved for critical
workloads, while background tasks can be directed to slower or remote swap.

This commit provides the base infrastructure for priority tracking:

- Introduces `memory.swap.priority`, a new cgroup2 interface that allows
  setting per-device priorities using `<id> <priority>` pairs. The swap
  device ID corresponds to the identifier in `/proc/swaps`.

- Internally, priorities are tracked with `struct swap_cgroup_priority`,
  which holds dynamically allocated pnode structures (`struct
  swap_cgroup_priority_pnode`) per device.

- Objects are created on-demand when the cgroup interface is written to,
  and automatically freed when:
    - The configured priorities match the global system defaults
    - The memory cgroup is removed

- Swapon and swapoff propagation is supported:
    - When a new swap device is activated, default values (e.g.,
      `default none`, `default disabled`) determine how the cgroup treats
      that device
    - When a swap device is removed via `swapoff`, it is cleared from all
      affected cgroups

- Priority semantics follow the global swap rules:
    - Higher values are preferred
    - Equal values round-robin
    - Negative values follow NUMA-aware fallback

The default value mechanism (`default none`, `default disabled`) was proposed
by Michal Koutný and integrated into the design to better support swapon
propagation and reduce configuration overhead.

The general design, including how to track priorities and manage per-cgroup
objects, was refined through internal discussions with Joonsoo Kim.

Enforcement logic within the swap allocator is introduced in the next patch.

Suggested-by: Michal Koutný <mkoutny@suse.com>
Suggested-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Youngjun Park <youngjun.park@lge.com>
---
 Documentation/admin-guide/cgroup-v2.rst |  62 ++
 MAINTAINERS                             |   2 +
 include/linux/memcontrol.h              |   3 +
 include/linux/swap.h                    |   3 +
 mm/Kconfig                              |  14 +
 mm/Makefile                             |   1 +
 mm/memcontrol.c                         |  91 ++-
 mm/swap_cgroup_priority.c               | 739 ++++++++++++++++++++++++
 mm/swap_cgroup_priority.h               |  86 +++
 mm/swapfile.c                           |  17 +-
 10 files changed, 1009 insertions(+), 9 deletions(-)
 create mode 100644 mm/swap_cgroup_priority.c
 create mode 100644 mm/swap_cgroup_priority.h

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index bd98ea3175ec..35fb9677f0d6 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -1839,6 +1839,68 @@ The following nested keys are defined.
 	higher than the limit for an extended period of time.  This
 	reduces the impact on the workload and memory management.
 
+  memory.swap.priority
+    A read-write flat-keyed file which exists on non-root cgroups.
+    This interface allows you to set per-swap-device priorities for the current
+    cgroup and to define how they differ from the global swap system.
+
+    To assign priorities or define specific behaviors for swap devices
+    in the current cgroup, write one or more lines in the following
+    formats:
+
+     - <swap_device_id> <priority>
+     - <swap_device_id> disabled
+     - <swap_device_id> none
+     - default none
+     - default disabled
+
+    Each <swap_device_id> refers to a unique swap device registered
+    in the system. You can check the ID, device path, and current
+    priority of active swap devices through the `/proc/swaps` file.
+    This provides a clear mapping between swap devices and the IDs
+    used in this interface.
+
+    The 'default' keyword sets the fallback priority behavior rule for
+    this cgroup. If no specific entry matches a swap device, this default
+    applies.
+
+    * 'default none': This is the default if no configuration
+      is explicitly written. Swap devices follow the system-wide
+      swap priorities.
+
+    * 'default disabled': All swap devices are excluded from this cgroup’s
+      swap priority list and will not be used by this cgroup.
+
+    The priority semantics are consistent with the global swap system:
+
+      - Higher numerical values indicate higher preference.
+      - See Documentation/admin-guide/mm/swap_numa.rst for details on
+        swap NUMA autobinding and negative priority rules.
+
+    The handling of negative priorities in this cgroup interface
+    has specific behaviors for assignment and restoration:
+
+    * Negative Priority Assignment
+      This interface allows you to explicitly override priorities with negative
+      values. When you do so, the total number of negative slots and their order
+      may shift depending on how the new value compares to existing ones:
+
+      - If you override an existing priority (whether originally positive or negative)
+        with a smaller (more negative) number, it may push other negative priorities
+        upward (toward zero).
+
+      - If you override an existing negative priority with a larger
+        (less negative) number, it may push other negative priorities
+        downward (more negative).
+
+    * Negative Priority Restoration with 'none'
+      When restoring a device’s priority to its global value using 'none',
+      if the original priority was negative, it might not revert to the exact
+      same global negative value if the total number of negative priorities
+      in the cgroup has decreased. In such cases, you may need to adjust
+      other negative priorities to restore the same ordering as the global
+      swap configuration.
+
   memory.zswap.current
 	A read-only single value file which exists on non-root
 	cgroups.
diff --git a/MAINTAINERS b/MAINTAINERS
index 60bba48f5479..d51ddc2272a7 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -6169,6 +6169,8 @@ F:	mm/memcontrol.c
 F:	mm/memcontrol-v1.c
 F:	mm/memcontrol-v1.h
 F:	mm/swap_cgroup.c
+F:	mm/swap_cgroup_priority.c
+F:	mm/swap_cgroup_priority.h
 F:	samples/cgroup/*
 F:	tools/testing/selftests/cgroup/memcg_protection.m
 F:	tools/testing/selftests/cgroup/test_hugetlb_memcg.c
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 87b6688f124a..625e59f9ecd2 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -218,6 +218,9 @@ struct mem_cgroup {
 	bool zswap_writeback;
 #endif
 
+#ifdef CONFIG_SWAP_CGROUP_PRIORITY
+	struct swap_cgroup_priority *swap_priority;
+#endif
 	/* vmpressure notifications */
 	struct vmpressure vmpressure;
 
diff --git a/include/linux/swap.h b/include/linux/swap.h
index bc0e1c275fc0..bfddbec2ee28 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -339,6 +339,9 @@ struct swap_info_struct {
 	struct work_struct discard_work; /* discard worker */
 	struct work_struct reclaim_work; /* reclaim worker */
 	struct list_head discard_clusters; /* discard clusters list */
+#ifdef CONFIG_SWAP_CGROUP_PRIORITY
+	u64 id;
+#endif
 	struct plist_node avail_lists[]; /*
 					   * entries in swap_avail_heads, one
 					   * entry per node.
diff --git a/mm/Kconfig b/mm/Kconfig
index 781be3240e21..43751e8d0bc4 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -190,6 +190,20 @@ config ZSMALLOC_CHAIN_SIZE
 
 	  For more information, see zsmalloc documentation.
 
+config SWAP_CGROUP_PRIORITY
+	bool "Per cgroup swap priority (EXPERIMENTAL)"
+	depends on SWAP && CGROUPS
+	default n
+	help
+          Enable per-cgroup swap device priority control.
+
+          This option allows configuring swap device priorities on a
+          per-cgroup basis, and makes it possible to exclude specific swap
+          devices from use by a cgroup.
+
+          If no configuration is set for a cgroup, it falls back to the
+          system-wide swap device priorities defined at swapon time.
+
 menu "Slab allocator options"
 
 config SLUB
diff --git a/mm/Makefile b/mm/Makefile
index 1a7a11d4933d..dde27ee58a8d 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -76,6 +76,7 @@ ifdef CONFIG_MMU
 endif
 
 obj-$(CONFIG_SWAP)	+= page_io.o swap_state.o swapfile.o
+obj-$(CONFIG_SWAP_CGROUP_PRIORITY) += swap_cgroup_priority.o
 obj-$(CONFIG_ZSWAP)	+= zswap.o
 obj-$(CONFIG_HAS_DMA)	+= dmapool.o
 obj-$(CONFIG_HUGETLBFS)	+= hugetlb.o
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 70fdeda1120b..ea207d498ad6 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -69,6 +69,8 @@
 #include <net/ip.h>
 #include "slab.h"
 #include "memcontrol-v1.h"
+#include "swap.h"
+#include "swap_cgroup_priority.h"
 
 #include <linux/uaccess.h>
 
@@ -3700,6 +3702,9 @@ static void mem_cgroup_free(struct mem_cgroup *memcg)
 {
 	lru_gen_exit_memcg(memcg);
 	memcg_wb_domain_exit(memcg);
+#ifdef CONFIG_SWAP_CGROUP_PRIORITY
+	delete_swap_cgroup_priority(memcg);
+#endif
 	__mem_cgroup_free(memcg);
 }
 
@@ -3793,6 +3798,7 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
 
 	page_counter_set_high(&memcg->memory, PAGE_COUNTER_MAX);
 	memcg1_soft_limit_reset(memcg);
+
 #ifdef CONFIG_ZSWAP
 	memcg->zswap_max = PAGE_COUNTER_MAX;
 	WRITE_ONCE(memcg->zswap_writeback, true);
@@ -3800,7 +3806,6 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
 	page_counter_set_high(&memcg->swap, PAGE_COUNTER_MAX);
 	if (parent) {
 		WRITE_ONCE(memcg->swappiness, mem_cgroup_swappiness(parent));
-
 		page_counter_init(&memcg->memory, &parent->memory, memcg_on_dfl);
 		page_counter_init(&memcg->swap, &parent->swap, false);
 #ifdef CONFIG_MEMCG_V1
@@ -5401,6 +5406,82 @@ static int swap_events_show(struct seq_file *m, void *v)
 	return 0;
 }
 
+#ifdef CONFIG_SWAP_CGROUP_PRIORITY
+static ssize_t swap_cgroup_priority_write(struct kernfs_open_file *of,
+					  char *buf, size_t nbytes, loff_t off)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
+	u64 id;
+	int prio;
+	int ret;
+	char first_token[32];
+	char second_token[32];
+	char dummy[2];
+	char *stripped_buf;
+	int num_parsed;
+
+	stripped_buf = strstrip(buf);
+	num_parsed = sscanf(stripped_buf, "%31s %31s %1s", first_token,
+			    second_token, dummy);
+	if (num_parsed == 2) {
+		if (strcmp(first_token, "default") == 0) {
+			if (strcmp(second_token, "none") == 0)
+				ret = apply_swap_cgroup_priority(
+					memcg, DEFAULT_ID, SWAP_PRIORITY_GLOBAL);
+			else if (strcmp(second_token, "disabled") == 0)
+				ret = apply_swap_cgroup_priority(
+					memcg, DEFAULT_ID, SWAP_PRIORITY_DISABLE);
+			else
+				ret = -EINVAL;
+		} else {
+			ret = kstrtoull(first_token, 10, &id);
+			if (ret)
+				return -EINVAL;
+
+			if (strcmp(second_token, "none") == 0) {
+				ret = apply_swap_cgroup_priority(
+					memcg, id, SWAP_PRIORITY_GLOBAL);
+			} else if (strcmp(second_token, "disabled") == 0) {
+				ret = apply_swap_cgroup_priority(
+					memcg, id, SWAP_PRIORITY_DISABLE);
+			} else {
+				ret = kstrtoint(second_token, 10, &prio);
+				if (ret)
+					return -EINVAL;
+				if (prio == -1)
+					return -EINVAL;
+				else if (prio > SHRT_MAX || prio < SHRT_MIN)
+					return -EINVAL;
+				ret = apply_swap_cgroup_priority(memcg, id,
+								 prio);
+			}
+		}
+	} else if (num_parsed == 1) {
+		if (strcmp(first_token, "none") == 0)
+			ret = apply_swap_cgroup_priority(
+				memcg, id, SWAP_PRIORITY_GLOBAL);
+		else if (strcmp(first_token, "disabled") == 0)
+			ret = apply_swap_cgroup_priority(
+				memcg, id, SWAP_PRIORITY_DISABLE);
+		else
+			ret = -EINVAL;
+	} else {
+		return -EINVAL;
+	}
+
+	if (ret)
+		return ret;
+
+	return nbytes;
+}
+
+static int swap_cgroup_priority_show(struct seq_file *m, void *v)
+{
+	show_swap_cgroup_priority(m);
+	return 0;
+}
+#endif
+
 static struct cftype swap_files[] = {
 	{
 		.name = "swap.current",
@@ -5433,6 +5514,14 @@ static struct cftype swap_files[] = {
 		.file_offset = offsetof(struct mem_cgroup, swap_events_file),
 		.seq_show = swap_events_show,
 	},
+#ifdef CONFIG_SWAP_CGROUP_PRIORITY
+	{
+		.name = "swap.priority",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.seq_show = swap_cgroup_priority_show,
+		.write = swap_cgroup_priority_write,
+	},
+#endif
 	{ }	/* terminate */
 };
 
diff --git a/mm/swap_cgroup_priority.c b/mm/swap_cgroup_priority.c
new file mode 100644
index 000000000000..abbefa6de63a
--- /dev/null
+++ b/mm/swap_cgroup_priority.c
@@ -0,0 +1,739 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright (c) 2025 LG Electronics Inc.
+ *
+ * This file is part of the Linux kernel and implements per-cgroup
+ * swap device priority control.
+ *
+ * This feature allows configuring the preference and exclusion of
+ * swap devices on a per-cgroup basis.
+ *
+ * If no configuration is set, the system-wide swap priorities
+ * assigned at swapon time will apply.
+ *
+ * Author: Youngjun Park <youngjun.park@lge.com>
+ */
+#include <linux/swap.h>
+#include <linux/rcupdate.h>
+#include <linux/memcontrol.h>
+#include <linux/plist.h>
+#include "swap.h"
+#include "swap_cgroup_priority.h"
+#include "memcontrol-v1.h"
+
+static LIST_HEAD(swap_cgroup_priority_list);
+
+/*
+ * struct swap_cgroup_priority
+ *
+ * This structure is RCU protected. Its lifecycle is determined by its
+ * owning memcg or when its 'distance' reaches zero. The 'distance' field
+ * tracks priority differences from global swap. If zero, and its default_prio
+ * follows global swap priority(SWAP_PRIORITY_GLOBAL), the object is destroyed.
+ *
+ * pnode - Array of pointers to swap device priority nodes.
+ * owner - The owning memory cgroup.
+ * rcu - RCU free callback.
+ * link - Global linked list entry.
+ * least_priority - Current lowest priority.
+ * distance - Priority differences from global swap priority.
+ * default_prio - Default priority for this cgroup.
+ * plist - Priority list head.
+ */
+struct swap_cgroup_priority {
+	struct swap_cgroup_priority_pnode *pnode[MAX_SWAPFILES];
+	struct mem_cgroup *owner;
+
+	union {
+		struct rcu_head rcu;
+		struct list_head link;
+	};
+
+	int least_priority;
+	s8 distance;
+	int default_prio;
+	struct plist_head plist[];
+};
+
+/*
+ * struct swap_cgroup_priority_pnode
+ *
+ * This structure represents a priority node for a specific swap device
+ * within a cgroup.
+ *
+ * swap - Pointer to the associated swap device.
+ * id - Unique identifier for the swap device.
+ * prio - Configured priority for this device.
+ * avail_lists - Connections to various priority lists.
+ */
+struct swap_cgroup_priority_pnode {
+	struct swap_info_struct *swap;
+	u64 id;
+	signed short prio;
+	struct plist_node avail_lists[];
+};
+
+/*
+ * Even with a zero distance, a swap device isn't assigned if it doesn't
+ * meet global swap priority conditions; thus, we don't clear it.
+ */
+static bool should_clear_swap_cgroup_priority(
+	struct swap_cgroup_priority *swap_priority)
+{
+	WARN_ON_ONCE(swap_priority->distance < 0 ||
+		swap_priority->distance > MAX_SWAPFILES);
+
+	if (swap_priority->distance == 0 &&
+	    swap_priority->default_prio == SWAP_PRIORITY_GLOBAL)
+		return true;
+
+	return false;
+}
+
+/*
+ * swapdev_id
+ *
+ * A unique identifier for a swap device.
+ *
+ * This ID ensures stable identification for users and crucial synchronization
+ * for swap cgroup priority settings. It provides a reliable reference even if
+ * device paths or numbers change.
+ */
+static atomic64_t swapdev_id_counter;
+
+void get_swapdev_id(struct swap_info_struct *si)
+{
+	si->id = atomic64_inc_return(&swapdev_id_counter);
+}
+
+static struct swap_cgroup_priority *get_swap_cgroup_priority(
+	struct mem_cgroup *memcg)
+{
+	if (!memcg)
+		return NULL;
+
+	return rcu_dereference(memcg->swap_priority);
+}
+
+static struct swap_cgroup_priority_pnode *alloc_swap_cgroup_priority_pnode(
+	gfp_t gfp)
+{
+	struct swap_cgroup_priority_pnode *pnode;
+	pnode = kvzalloc(struct_size(pnode, avail_lists, nr_node_ids),
+			 gfp);
+
+	return pnode;
+}
+
+static void free_swap_cgroup_priority_pnode(
+	struct swap_cgroup_priority_pnode *pnode)
+{
+	if (pnode)
+		kvfree(pnode);
+}
+
+static void free_swap_cgroup_priority(
+	struct swap_cgroup_priority *swap_priority)
+{
+	for (int i = 0; i < MAX_SWAPFILES; i++)
+		free_swap_cgroup_priority_pnode(swap_priority->pnode[i]);
+
+	kvfree(swap_priority);
+}
+
+static struct swap_cgroup_priority *alloc_swap_cgroup_priority(void)
+{
+	struct swap_cgroup_priority *swap_priority;
+
+	swap_priority = kvzalloc(struct_size(swap_priority, plist, nr_node_ids),
+				 GFP_KERNEL);
+	if (!swap_priority)
+		return NULL;
+
+	/*
+	 * Pre-allocates pnode array up to nr_swapfiles at init.
+	 * Individual pnodes are assigned on swapon, but not freed
+	 * on swapoff. This avoids complex ref-counting, simplifying
+	 * the structure for swap cgroup priority management.
+	 */
+	for (int i = 0; i < nr_swapfiles; i++) {
+		swap_priority->pnode[i] = alloc_swap_cgroup_priority_pnode(
+						GFP_KERNEL);
+		if (!swap_priority->pnode[i]) {
+			free_swap_cgroup_priority(swap_priority);
+			return NULL;
+		}
+
+	}
+
+	return swap_priority;
+}
+
+static void rcu_free_swap_cgroup_priority(struct rcu_head *rcu)
+{
+	struct swap_cgroup_priority *swap_priority
+		= container_of(rcu, struct swap_cgroup_priority, rcu);
+
+	free_swap_cgroup_priority(swap_priority);
+}
+
+void show_swap_cgroup_priority(struct seq_file *m)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_seq(m);
+	struct swap_cgroup_priority *swap_priority;
+
+	spin_lock(&swap_lock);
+	swap_priority = memcg->swap_priority;
+	if (!swap_priority || swap_priority->owner != memcg) {
+		spin_unlock(&swap_lock);
+		return;
+	}
+
+	if (swap_priority->default_prio != SWAP_PRIORITY_GLOBAL)
+		seq_printf(m,  "default disabled\n");
+
+	for (int i = 0; i < nr_swapfiles; i++) {
+		struct swap_info_struct *si = swap_info[i];
+		struct swap_cgroup_priority_pnode *pnode;
+		signed short prio;
+
+		if (!(si->flags & SWP_USED) || !(si->flags & SWP_WRITEOK))
+			continue;
+
+		pnode = swap_priority->pnode[i];
+
+		if (WARN_ON_ONCE(!pnode))
+			continue;
+
+		prio = pnode->prio;
+		if (prio == si->prio)
+			continue;
+
+		seq_printf(m,  "%lld", si->id);
+		if (prio != SWAP_PRIORITY_DISABLE)
+			seq_printf(m,  " %d\n", prio);
+		else
+			seq_printf(m,  " disabled\n");
+	}
+
+	spin_unlock(&swap_lock);
+}
+
+static void __delete_swap_cgroup_priority(struct mem_cgroup *memcg);
+void purge_swap_cgroup_priority(void)
+{
+	struct swap_cgroup_priority *swap_priority, *tmp;
+
+	spin_lock(&swap_avail_lock);
+	list_for_each_entry_safe(swap_priority, tmp, &swap_cgroup_priority_list,
+				 link) {
+
+		if (should_clear_swap_cgroup_priority(swap_priority))
+			__delete_swap_cgroup_priority(swap_priority->owner);
+	}
+	spin_unlock(&swap_avail_lock);
+}
+
+bool swap_alloc_cgroup_priority(struct mem_cgroup *memcg,
+				swp_entry_t *entry, int order)
+{
+	struct swap_cgroup_priority *swap_priority;
+	struct swap_cgroup_priority_pnode *pnode, *next;
+	struct swap_info_struct *si;
+	unsigned long offset;
+	int node;
+
+	/* TODO
+	 * Per-cpu swapdev cache can't be used directly as cgroup-specific
+	 * priorities may select different devices.
+	 */
+	spin_lock(&swap_avail_lock);
+	node = numa_node_id();
+
+	swap_priority = get_swap_cgroup_priority(memcg);
+swap_priority_check:
+	if (!swap_priority) {
+		spin_unlock(&swap_avail_lock);
+		return false;
+	}
+
+start_over:
+	plist_for_each_entry_safe(pnode, next, &swap_priority->plist[node],
+				  avail_lists[node]) {
+		si = pnode->swap;
+		plist_requeue(&pnode->avail_lists[node],
+			&swap_priority->plist[node]);
+		spin_unlock(&swap_avail_lock);
+		if (get_swap_device_info(si)) {
+			offset = cluster_alloc_swap_entry(si, order,
+							  SWAP_HAS_CACHE);
+			put_swap_device(si);
+			if (offset) {
+				*entry = swp_entry(si->type, offset);
+				return true;
+			}
+
+			if (order)
+				return false;
+		}
+
+		spin_lock(&swap_avail_lock);
+		/*
+		 * If 'swap_cgroup_priority' changes while we're holding a lock,
+		 * we must verify its state to ensure memory validness.
+		 */
+		if (memcg->swap_priority != swap_priority)
+			goto swap_priority_check;
+
+		if (plist_node_empty(&next->avail_lists[node]))
+			goto start_over;
+	}
+	spin_unlock(&swap_avail_lock);
+
+	return false;
+}
+
+/* add_to_avail_list (swapon / swapusage > 0) */
+void activate_swap_cgroup_priority(struct swap_info_struct *swp,
+				   bool swapon)
+{
+	struct swap_cgroup_priority *swap_priority;
+	int i;
+
+	list_for_each_entry(swap_priority, &swap_cgroup_priority_list, link) {
+		struct swap_cgroup_priority_pnode *pnode =
+			swap_priority->pnode[swp->type];
+
+		if (WARN_ON_ONCE(!pnode))
+			continue;
+
+		/* Exclude reinsert */
+		if (swapon && pnode->id != swp->id) {
+			pnode->swap = swp;
+			if (swap_priority->default_prio == SWAP_PRIORITY_GLOBAL) {
+				if (swp->prio >= 0)
+					pnode->prio = swp->prio;
+				else
+					pnode->prio =
+						--swap_priority->least_priority;
+			} else {
+				pnode->prio = SWAP_PRIORITY_DISABLE;
+				swap_priority->distance++;
+			}
+		}
+
+		/* NUMA priority handling */
+		for_each_node(i) {
+			if (swapon) {
+				if (pnode->prio < 0 && swap_node(swp) == i) {
+					plist_node_init(
+						&pnode->avail_lists[i],
+						1);
+				} else {
+					plist_node_init(
+						&pnode->avail_lists[i],
+						-pnode->prio);
+				}
+			}
+
+			if (pnode->prio != SWAP_PRIORITY_DISABLE)
+				plist_add(&pnode->avail_lists[i],
+					  &swap_priority->plist[i]);
+		}
+	}
+}
+
+/* del_from_avail_list (swapoff / swap usage <= 0) */
+void deactivate_swap_cgroup_priority(struct swap_info_struct *swp,
+				     bool swapoff)
+{
+	struct swap_cgroup_priority *swap_priority, *tmp;
+	int nid, i;
+
+	list_for_each_entry_safe(swap_priority, tmp, &swap_cgroup_priority_list,
+				 link) {
+		struct swap_cgroup_priority_pnode *pnode =
+			swap_priority->pnode[swp->type];
+
+		if (WARN_ON_ONCE(!pnode))
+			continue;
+
+		if (swapoff) {
+			if (pnode->prio != swp->prio)
+				swap_priority->distance--;
+		}
+
+		if (pnode->prio == SWAP_PRIORITY_DISABLE)
+			continue;
+
+		if (swapoff && pnode->prio < 0) {
+			struct swap_cgroup_priority_pnode *tmp;
+			/*
+			 * NUMA priority handling
+			 * mimic swapoff prio adjustment without plist
+			 */
+			for (int i = 0; i < nr_swapfiles; i++) {
+				tmp = swap_priority->pnode[i];
+				if (!tmp || tmp->prio > pnode->prio ||
+				    tmp->swap == swp)
+					continue;
+				tmp->prio++;
+				for_each_node(nid) {
+					if (tmp->avail_lists[nid].prio != 1)
+						tmp->avail_lists[nid].prio--;
+				}
+			}
+
+			swap_priority->least_priority++;
+		}
+
+		for_each_node(i)
+			plist_del(&pnode->avail_lists[i],
+				&swap_priority->plist[i]);
+	}
+}
+
+static void apply_swap_cgroup_priority_pnode(
+	struct swap_cgroup_priority *swap_priority,
+	struct swap_cgroup_priority_pnode *pnode,
+	int prio,
+	bool clear)
+{
+	int nid;
+
+	if (clear && pnode->prio != SWAP_PRIORITY_DISABLE) {
+		for_each_node(nid) {
+			plist_del(&pnode->avail_lists[nid],
+				&swap_priority->plist[nid]);
+		}
+	}
+
+	if (pnode->swap->prio != prio && pnode->swap->prio == pnode->prio)
+		swap_priority->distance++;
+	else if (pnode->swap->prio == prio && pnode->swap->prio != pnode->prio)
+		swap_priority->distance--;
+
+	pnode->prio = prio;
+	for_each_node(nid) {
+		if (pnode->prio >= 0) {
+			plist_node_init(&pnode->avail_lists[nid],
+				-pnode->prio);
+		} else {
+			if (swap_node(pnode->swap) == nid)
+				plist_node_init(
+					&pnode->avail_lists[nid],
+					1);
+			else
+				plist_node_init(
+					&pnode->avail_lists[nid],
+					-pnode->prio);
+		}
+
+		/*
+		 * Check SWP_WRITEOK for skipping
+		 * 1. reinsert case when swapoff fails
+		 * 2. on-going swapon before adding avail list
+		 */
+		if (pnode->prio != SWAP_PRIORITY_DISABLE &&
+		    (pnode->swap->flags & SWP_WRITEOK))
+			plist_add(&pnode->avail_lists[nid],
+				&swap_priority->plist[nid]);
+	}
+}
+
+static int __apply_swap_cgroup_priority(
+	struct swap_cgroup_priority *swap_priority, u64 id, int prio, bool new)
+{
+	struct swap_cgroup_priority_pnode *pnode;
+	struct swap_info_struct *si;
+	int old_prio;
+	int type;
+
+	if (new)
+		swap_priority->least_priority = least_priority;
+
+	if (id == DEFAULT_ID) {
+		swap_priority->default_prio = prio;
+		if (new)
+			goto assign_prio;
+
+		goto out;
+	}
+
+	for (type = 0; type < nr_swapfiles; type++) {
+		si = swap_info[type];
+		if (id == si->id)
+			break;
+		si = NULL;
+	}
+
+	if (!si)
+		return -EIO;
+
+	if (!(si->flags & SWP_USED) || !(si->flags & SWP_WRITEOK))
+		return -EFAULT;
+
+	if (si->id != id)
+		return -EINVAL;
+
+	if (prio == SWAP_PRIORITY_GLOBAL)
+		prio = si->prio;
+
+	pnode = swap_priority->pnode[type];
+	/* Assigning the same priority has no effect. */
+	if (!new && pnode && pnode->prio == prio)
+		return 0;
+	else if (new && si->prio == prio)
+		return 0;
+
+	if (new) {
+		pnode->id = id;
+		pnode->swap = si;
+		pnode->prio = si->prio;
+	}
+	old_prio = pnode->prio;
+
+	/*
+	 * When a new negative priority is added, least_priority decreases.
+	 * When a negative priority is deleted, least_priority increases.
+	 */
+	if (prio < SWAP_PRIORITY_DISABLE && old_prio >= SWAP_PRIORITY_DISABLE)
+		swap_priority->least_priority--;
+	else if (prio >= SWAP_PRIORITY_DISABLE &&
+		 old_prio < SWAP_PRIORITY_DISABLE)
+		swap_priority->least_priority++;
+
+	if (prio < swap_priority->least_priority)
+		prio = swap_priority->least_priority;
+
+	apply_swap_cgroup_priority_pnode(swap_priority, pnode, prio, !new);
+
+	/*
+	 * This logic adjusts priorities according to global swap on/off rule.
+	 * Priorities at or above SWAP_PRIORITY_DISABLE don't affect other swap
+	 * device priorities. However, negative priorities below this threshold
+	 * influence each other based on their values. Adjustments are made if a
+	 * swap device's priority becomes negative and starts influencing others,
+	 * or if it moves out of the negative range and stops influencing them.
+	 */
+assign_prio:
+	for (int i = 0; i < nr_swapfiles; i++) {
+		int changed_prio;
+		si = swap_info[i];
+		/*
+		 * nr_swapfiles may have increased after initial alloc
+		 * due to missing swap_lock
+		 */
+		if (!(pnode = swap_priority->pnode[si->type])) {
+			pnode = alloc_swap_cgroup_priority_pnode(GFP_ATOMIC);
+			if (!pnode)
+				return -ENOMEM;
+			swap_priority->pnode[si->type] = pnode;
+		}
+
+		/*
+		 * Does not check SWP_WRITEOK. device could be reinserted.
+		 * Ensure si->map is valid before proceeding.
+		 * This prevents missing swapon failures where SWP_USED
+		 * state persists unexpectedly.
+		 */
+		if (!(si->flags & SWP_USED) || !si->swap_map)
+			continue;
+
+		if (si->id == id)
+			continue;
+
+		if (si->id != pnode->id) {
+			pnode->id = si->id;
+			pnode->prio = si->prio;
+			pnode->swap = si;
+		}
+
+		changed_prio = pnode->prio;
+
+		/*
+		 * A new negative value is added,
+		 * so all values lower than it are shifted backward by one.
+		 */
+		if (old_prio >= SWAP_PRIORITY_DISABLE &&
+		    prio < SWAP_PRIORITY_DISABLE &&
+		    (pnode->prio < SWAP_PRIORITY_DISABLE &&
+		    pnode->prio <= prio)) {
+			changed_prio--;
+		/*
+		 * One negative value is removed,
+		 * so all higher values are shifted forward by one.
+		 */
+		} else if (old_prio < SWAP_PRIORITY_DISABLE &&
+			   prio >= SWAP_PRIORITY_DISABLE &&
+			   (pnode->prio < SWAP_PRIORITY_DISABLE &&
+			   pnode->prio <= old_prio)) {
+			changed_prio++;
+		} else if (old_prio < SWAP_PRIORITY_DISABLE &&
+			   prio < SWAP_PRIORITY_DISABLE) {
+			/*
+			 * If it was negative already but becomes smaller,
+			 * shift all values in range backward by one.
+			 */
+			if (old_prio > prio &&
+			    (prio <= pnode->prio && old_prio >= pnode->prio)) {
+				changed_prio++;
+			/*
+			 * If it was negative already but becomes larger,
+			 * shift all values in range forward by one.
+			 */
+			} else if (old_prio < prio &&
+				   (prio >= pnode->prio &&
+				   old_prio <= pnode->prio)) {
+				changed_prio--;
+			}
+		}
+
+		if (!new && changed_prio == pnode->prio)
+			continue;
+
+		apply_swap_cgroup_priority_pnode(
+			swap_priority, pnode, changed_prio, !new);
+	}
+
+out:
+	if (should_clear_swap_cgroup_priority(swap_priority))
+		return 1;
+
+	return 0;
+}
+
+int prepare_swap_cgroup_priority(int type)
+{
+	struct swap_cgroup_priority *swap_priority;
+	int err = 0;
+
+	spin_lock(&swap_avail_lock);
+	list_for_each_entry_rcu(swap_priority,
+				&swap_cgroup_priority_list, link) {
+		if (!swap_priority->pnode[type]) {
+			swap_priority->pnode[type] =
+				alloc_swap_cgroup_priority_pnode(GFP_ATOMIC);
+
+			if (!swap_priority->pnode[type]) {
+				err = -ENOMEM;
+				break;
+			}
+		}
+
+	}
+	spin_unlock(&swap_avail_lock);
+
+	return err;
+}
+
+int apply_swap_cgroup_priority(struct mem_cgroup *memcg, u64 id, int prio)
+{
+	struct swap_cgroup_priority *swap_priority;
+	int nid;
+	bool new = false;
+	int err = 0;
+
+	rcu_read_lock();
+	swap_priority = rcu_dereference(memcg->swap_priority);
+	if (swap_priority && swap_priority->owner == memcg) {
+		rcu_read_unlock();
+		goto prio_set;
+	}
+	rcu_read_unlock();
+	new = true;
+
+	/* No need to define "global swap priority" for a new cgroup. */
+	if (new && prio == SWAP_PRIORITY_GLOBAL)
+		return 0;
+
+	swap_priority = alloc_swap_cgroup_priority();
+	if (!swap_priority)
+		return -ENOMEM;
+
+	/* Just initialize. may changed on __apply_swap_cgroup_priority */
+	swap_priority->default_prio = SWAP_PRIORITY_GLOBAL;
+	INIT_LIST_HEAD(&swap_priority->link);
+	for_each_node(nid)
+		plist_head_init(&swap_priority->plist[nid]);
+
+prio_set:
+	spin_lock(&swap_lock);
+	spin_lock(&swap_avail_lock);
+
+	/* Simultaneous calls to the same interface.*/
+	if (new && memcg->swap_priority &&
+	    memcg->swap_priority->owner == memcg) {
+		new = false;
+		free_swap_cgroup_priority(swap_priority);
+		swap_priority = memcg->swap_priority;
+	}
+
+	err = __apply_swap_cgroup_priority(swap_priority, id, prio, new);
+	if (err) {
+		/*
+		 * The difference with the global swap priority is now zero.
+		 * Remove the swap priority.
+		 */
+		if (err == 1) {
+			err = 0;
+			__delete_swap_cgroup_priority(memcg);
+		}
+
+		goto error_locked;
+	}
+
+	if (new) {
+		swap_priority->owner = memcg;
+		list_add_rcu(&swap_priority->link, &swap_cgroup_priority_list);
+		memcg->swap_priority = swap_priority;
+
+		for (int i = 0; i < nr_swapfiles; i++) {
+			if (!swap_priority->pnode[i]->swap) {
+				free_swap_cgroup_priority_pnode(
+					swap_priority->pnode[i]);
+				swap_priority->pnode[i] = NULL;
+			}
+		}
+	}
+
+	spin_unlock(&swap_avail_lock);
+	spin_unlock(&swap_lock);
+
+	return 0;
+
+error_locked:
+	spin_unlock(&swap_avail_lock);
+	spin_unlock(&swap_lock);
+
+	if (!new)
+		return err;
+
+	free_swap_cgroup_priority(swap_priority);
+	return err;
+}
+
+static void __delete_swap_cgroup_priority(struct mem_cgroup *memcg)
+{
+	struct swap_cgroup_priority *swap_priority = memcg->swap_priority;
+
+	lockdep_assert_held(&swap_avail_lock);
+
+	if (!swap_priority)
+		return;
+
+	/* If using a cached swap_priority, there is no need to remove it. */
+	if (swap_priority->owner != memcg)
+		return;
+
+	rcu_assign_pointer(memcg->swap_priority, NULL);
+	list_del_rcu(&swap_priority->link);
+	call_rcu(&swap_priority->rcu, rcu_free_swap_cgroup_priority);
+}
+
+void delete_swap_cgroup_priority(struct mem_cgroup *memcg)
+{
+	spin_lock(&swap_avail_lock);
+	__delete_swap_cgroup_priority(memcg);
+	spin_unlock(&swap_avail_lock);
+}
diff --git a/mm/swap_cgroup_priority.h b/mm/swap_cgroup_priority.h
new file mode 100644
index 000000000000..253e95623270
--- /dev/null
+++ b/mm/swap_cgroup_priority.h
@@ -0,0 +1,86 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _SWAP_CGROUP_PRIORITY_H
+#define _SWAP_CGROUP_PRIORITY_H
+#include <linux/swap.h>
+
+#ifdef CONFIG_SWAP_CGROUP_PRIORITY
+#include <linux/limits.h>
+/*
+ * A priority of -1 is not assigned to global swap entries,
+ * based on the kernel's specific negative priority assignment rules.
+ */
+#define SWAP_PRIORITY_DISABLE	-1
+/*
+ * (SHRT_MAX + 1) exceeds the maximum 'prio' value for signed short.
+ * This marks it as an invalid or special priority state, not for standard use.
+ */
+#define SWAP_PRIORITY_GLOBAL	SHRT_MAX+1
+/*
+ * ID 0 is reserved/unused in kernel swap management, allowing its use
+ * for special internal states or flags, as swap IDs typically start from 1.
+ */
+#define DEFAULT_ID		0
+
+/* linux/mm/swapfile.c */
+extern spinlock_t swap_lock;
+extern int least_priority;
+extern unsigned int nr_swapfiles;
+extern spinlock_t swap_avail_lock;
+extern struct swap_info_struct *swap_info[MAX_SWAPFILES];
+int swap_node(struct swap_info_struct *si);
+unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int order,
+				       unsigned char usage);
+bool get_swap_device_info(struct swap_info_struct *si);
+
+/* linux/mm/swap_cgroup_priority.c */
+int apply_swap_cgroup_priority(struct mem_cgroup *memcg, u64 id, int prio);
+void activate_swap_cgroup_priority(struct swap_info_struct *swp, bool swapon);
+void deactivate_swap_cgroup_priority(struct swap_info_struct *swp,
+				     bool swapoff);
+int prepare_swap_cgroup_priority(int type);
+void show_swap_cgroup_priority(struct seq_file *m);
+void get_swapdev_id(struct swap_info_struct *si);
+void purge_swap_cgroup_priority(void);
+bool swap_alloc_cgroup_priority(struct mem_cgroup *memcg, swp_entry_t *entry,
+				int order);
+void delete_swap_cgroup_priority(struct mem_cgroup *memcg);
+#else
+int swap_node(struct swap_info_struct *si);
+unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int order,
+				       unsigned char usage);
+bool get_swap_device_info(struct swap_info_struct *si);
+
+static inline int apply_swap_cgroup_priority(struct mem_cgroup *memcg, int id,
+					     int prio)
+{
+	return 0;
+}
+static inline void activate_swap_cgroup_priority(struct swap_info_struct *swp,
+						 bool swapon)
+{
+}
+static inline void deactivate_swap_cgroup_priority(struct swap_info_struct *swp, 
+						   bool swapoff)
+{
+}
+static inline int prepare_swap_cgroup_priority(int type)
+{
+	return 0;
+}
+
+static inline void get_swapdev_id(struct swap_info_struct *si)
+{
+}
+static inline void purge_swap_cgroup_priority(void)
+{
+}
+static inline bool swap_alloc_cgroup_priority(struct mem_cgroup *memcg,
+					      swp_entry_t *entry, int order)
+{
+	return false;
+}
+static inline void delete_swap_cgroup_priority(struct mem_cgroup *memcg)
+{
+}
+#endif
+#endif
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 68ce283e84be..4b56f117b2b0 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -48,6 +48,7 @@
 #include <linux/swap_cgroup.h>
 #include "internal.h"
 #include "swap.h"
+#include "swap_cgroup_priority.h"
 
 static bool swap_count_continued(struct swap_info_struct *, pgoff_t,
 				 unsigned char);
@@ -62,8 +63,8 @@ static struct swap_cluster_info *lock_cluster(struct swap_info_struct *si,
 					      unsigned long offset);
 static inline void unlock_cluster(struct swap_cluster_info *ci);
 
-static DEFINE_SPINLOCK(swap_lock);
-static unsigned int nr_swapfiles;
+DEFINE_SPINLOCK(swap_lock);
+unsigned int nr_swapfiles;
 atomic_long_t nr_swap_pages;
 /*
  * Some modules use swappable objects and may try to swap them out under
@@ -73,7 +74,7 @@ atomic_long_t nr_swap_pages;
 EXPORT_SYMBOL_GPL(nr_swap_pages);
 /* protected with swap_lock. reading in vm_swap_full() doesn't need lock */
 long total_swap_pages;
-static int least_priority = -1;
+int least_priority = -1;
 unsigned long swapfile_maximum_size;
 #ifdef CONFIG_MIGRATION
 bool swap_migration_ad_supported;
@@ -103,9 +104,9 @@ static PLIST_HEAD(swap_active_head);
  * before any swap_info_struct->lock.
  */
 static struct plist_head *swap_avail_heads;
-static DEFINE_SPINLOCK(swap_avail_lock);
+DEFINE_SPINLOCK(swap_avail_lock);
 
-static struct swap_info_struct *swap_info[MAX_SWAPFILES];
+struct swap_info_struct *swap_info[MAX_SWAPFILES];
 
 static DEFINE_MUTEX(swapon_mutex);
 
@@ -878,7 +879,7 @@ static void swap_reclaim_work(struct work_struct *work)
  * Try to allocate swap entries with specified order and try set a new
  * cluster for current CPU too.
  */
-static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int order,
+unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int order,
 					      unsigned char usage)
 {
 	struct swap_cluster_info *ci;
@@ -1156,7 +1157,7 @@ static void swap_range_free(struct swap_info_struct *si, unsigned long offset,
 	swap_usage_sub(si, nr_entries);
 }
 
-static bool get_swap_device_info(struct swap_info_struct *si)
+bool get_swap_device_info(struct swap_info_struct *si)
 {
 	if (!percpu_ref_tryget_live(&si->users))
 		return false;
@@ -2536,7 +2537,7 @@ static int setup_swap_extents(struct swap_info_struct *sis, sector_t *span)
 	return generic_swapfile_activate(sis, swap_file, span);
 }
 
-static int swap_node(struct swap_info_struct *si)
+int swap_node(struct swap_info_struct *si)
 {
 	struct block_device *bdev;
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 2/4] mm: swap: Apply per-cgroup swap priority mechanism to swap layer
  2025-07-16 20:20 [PATCH 0/4] mm/swap, memcg: Support per-cgroup swap device priorities Youngjun Park
  2025-07-16 20:20 ` [PATCH 1/4] mm/swap, memcg: Introduce infrastructure for cgroup-based swap priority Youngjun Park
@ 2025-07-16 20:20 ` Youngjun Park
  2025-07-16 20:20 ` [PATCH 3/4] mm: memcg: Add swap cgroup priority inheritance mechanism Youngjun Park
  2025-07-16 20:20 ` [PATCH 4/4] mm: swap: Per-cgroup per-CPU swap device cache with shared clusters Youngjun Park
  3 siblings, 0 replies; 39+ messages in thread
From: Youngjun Park @ 2025-07-16 20:20 UTC (permalink / raw)
  To: akpm, hannes
  Cc: mhocko, roman.gushchin, shakeel.butt, muchun.song, shikemeng,
	kasong, nphamcs, bhe, baohua, chrisl, cgroups, linux-mm,
	linux-kernel, gunho.lee, iamjoonsoo.kim, taejoon.song,
	Youngjun Park

This patch applies the per-cgroup swap priority mechanism to the swap layer.

It implements:
- Swap device ID assignment based on the cgroup's effective priority
- Swap device selection respecting cgroup-specific priorities
- Swap on/off propagation logic that updates per-cgroup settings accordingly

Currently, the per-CPU swap cluster cache is bypassed, since different
cgroups may select different devices based on their configured priorities.

Signed-off-by: Youngjun Park <youngjun.park@lge.com>
---
 mm/swap_cgroup_priority.c |  6 ++---
 mm/swapfile.c             | 46 +++++++++++++++++++++++++++++++++++++--
 2 files changed, 47 insertions(+), 5 deletions(-)

diff --git a/mm/swap_cgroup_priority.c b/mm/swap_cgroup_priority.c
index abbefa6de63a..979bc18d2eed 100644
--- a/mm/swap_cgroup_priority.c
+++ b/mm/swap_cgroup_priority.c
@@ -243,9 +243,9 @@ bool swap_alloc_cgroup_priority(struct mem_cgroup *memcg,
 	unsigned long offset;
 	int node;
 
-	/* TODO
-	 * Per-cpu swapdev cache can't be used directly as cgroup-specific
-	 * priorities may select different devices.
+	/*
+	 * TODO: Per-cpu swap cluster cache can't be used directly
+	 * as cgroup-specific priorities may select different devices.
 	 */
 	spin_lock(&swap_avail_lock);
 	node = numa_node_id();
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 4b56f117b2b0..bfd0532ad250 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1029,6 +1029,7 @@ static void del_from_avail_list(struct swap_info_struct *si, bool swapoff)
 	for_each_node(nid)
 		plist_del(&si->avail_lists[nid], &swap_avail_heads[nid]);
 
+	deactivate_swap_cgroup_priority(si, swapoff);
 skip:
 	spin_unlock(&swap_avail_lock);
 }
@@ -1072,6 +1073,7 @@ static void add_to_avail_list(struct swap_info_struct *si, bool swapon)
 	for_each_node(nid)
 		plist_add(&si->avail_lists[nid], &swap_avail_heads[nid]);
 
+	activate_swap_cgroup_priority(si, swapon);
 skip:
 	spin_unlock(&swap_avail_lock);
 }
@@ -1292,8 +1294,10 @@ int folio_alloc_swap(struct folio *folio, gfp_t gfp)
 	}
 
 	local_lock(&percpu_swap_cluster.lock);
-	if (!swap_alloc_fast(&entry, order))
-		swap_alloc_slow(&entry, order);
+	if (!swap_alloc_cgroup_priority(folio_memcg(folio), &entry, order)) {
+		if (!swap_alloc_fast(&entry, order))
+			swap_alloc_slow(&entry, order);
+	}
 	local_unlock(&percpu_swap_cluster.lock);
 
 	/* Need to call this even if allocation failed, for MEMCG_SWAP_FAIL. */
@@ -2778,6 +2782,7 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
 	if (!p->bdev || !bdev_nonrot(p->bdev))
 		atomic_dec(&nr_rotate_swap);
 
+	purge_swap_cgroup_priority();
 	mutex_lock(&swapon_mutex);
 	spin_lock(&swap_lock);
 	spin_lock(&p->lock);
@@ -2895,6 +2900,8 @@ static void swap_stop(struct seq_file *swap, void *v)
 	mutex_unlock(&swapon_mutex);
 }
 
+
+#ifndef CONFIG_SWAP_CGROUP_PRIORITY
 static int swap_show(struct seq_file *swap, void *v)
 {
 	struct swap_info_struct *si = v;
@@ -2921,6 +2928,34 @@ static int swap_show(struct seq_file *swap, void *v)
 			si->prio);
 	return 0;
 }
+#else
+static int swap_show(struct seq_file *swap, void *v)
+{
+	struct swap_info_struct *si = v;
+	struct file *file;
+	int len;
+	unsigned long bytes, inuse;
+
+	if (si == SEQ_START_TOKEN) {
+		seq_puts(swap, "Filename\t\t\t\tType\t\tSize\t\tUsed\t\tPriority\t\tId\n");
+		return 0;
+	}
+
+	bytes = K(si->pages);
+	inuse = K(swap_usage_in_pages(si));
+
+	file = si->swap_file;
+	len = seq_file_path(swap, file, " \t\n\\");
+	seq_printf(swap, "%*s%s\t%lu\t%s%lu\t%s%d\t\t\t%llu\n",
+			len < 40 ? 40 - len : 1, " ",
+			S_ISBLK(file_inode(file)->i_mode) ?
+				"partition" : "file\t",
+			bytes, bytes < 10000000 ? "\t" : "",
+			inuse, inuse < 10000000 ? "\t" : "",
+			si->prio, si->id);
+	return 0;
+}
+#endif
 
 static const struct seq_operations swaps_op = {
 	.start =	swap_start,
@@ -3463,6 +3498,13 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
 		goto free_swap_zswap;
 	}
 
+	error = prepare_swap_cgroup_priority(si->type);
+	if (error) {
+		inode->i_flags &= ~S_SWAPFILE;
+		goto free_swap_zswap;
+	}
+	get_swapdev_id(si);
+
 	mutex_lock(&swapon_mutex);
 	prio = -1;
 	if (swap_flags & SWAP_FLAG_PREFER)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 3/4] mm: memcg: Add swap cgroup priority inheritance mechanism
  2025-07-16 20:20 [PATCH 0/4] mm/swap, memcg: Support per-cgroup swap device priorities Youngjun Park
  2025-07-16 20:20 ` [PATCH 1/4] mm/swap, memcg: Introduce infrastructure for cgroup-based swap priority Youngjun Park
  2025-07-16 20:20 ` [PATCH 2/4] mm: swap: Apply per-cgroup swap priority mechanism to swap layer Youngjun Park
@ 2025-07-16 20:20 ` Youngjun Park
  2025-07-16 20:20 ` [PATCH 4/4] mm: swap: Per-cgroup per-CPU swap device cache with shared clusters Youngjun Park
  3 siblings, 0 replies; 39+ messages in thread
From: Youngjun Park @ 2025-07-16 20:20 UTC (permalink / raw)
  To: akpm, hannes
  Cc: mhocko, roman.gushchin, shakeel.butt, muchun.song, shikemeng,
	kasong, nphamcs, bhe, baohua, chrisl, cgroups, linux-mm,
	linux-kernel, gunho.lee, iamjoonsoo.kim, taejoon.song,
	Youngjun Park, Michal Koutný

This patch introduces inheritance semantics for swap cgroup priorities.

Each cgroup can configure its own swap priority via the
memory.swap.priority interface. However, the effective priority is
determined by walking up the cgroup hierarchy and applying the highest
ancestor's configured value.

If no ancestor has a configured value, the cgroup's own setting is used.
If neither is present, it falls back to the global swap configuration.

To make inheritance visible to userspace, this patch introduces the
memory.swap.priority.effective interface.

Suggested-by: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Youngjun Park <youngjun.park@lge.com>
---
 Documentation/admin-guide/cgroup-v2.rst |  14 ++
 mm/memcontrol.c                         |  14 ++
 mm/swap_cgroup_priority.c               | 203 ++++++++++++++++++++----
 mm/swap_cgroup_priority.h               |   3 +
 4 files changed, 207 insertions(+), 27 deletions(-)

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 35fb9677f0d6..ae6a0c809db4 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -1901,6 +1901,20 @@ The following nested keys are defined.
       other negative priorities to restore the same ordering as the global
       swap configuration.
 
+  memory.swap.priority.effective
+        A read-only file showing the effective swap priority ordering
+        actually applied to this cgroup, after resolving inheritance
+        from ancestors. The effective swap priority for a cgroup is
+        also influenced by its position within the cgroup hierarchy. If any
+        ancestor cgroup has set a swap priority configuration, it is
+        propagated and inherited by all descendants. In that case, the
+        child’s own configuration is ignored and the topmost configured
+        ancestor determines the effective priority ordering.
+
+        If there is no configuration in the current cgroup and its
+        ancestors, this file shows the global swap device priority from
+        `swapon`, in the form of id and priority pairs.
+
   memory.zswap.current
 	A read-only single value file which exists on non-root
 	cgroups.
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index ea207d498ad6..4a0762060f99 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3806,6 +3806,9 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
 	page_counter_set_high(&memcg->swap, PAGE_COUNTER_MAX);
 	if (parent) {
 		WRITE_ONCE(memcg->swappiness, mem_cgroup_swappiness(parent));
+#ifdef CONFIG_SWAP_CGROUP_PRIORITY
+		memcg->swap_priority = inherit_swap_cgroup_priority(parent);
+#endif
 		page_counter_init(&memcg->memory, &parent->memory, memcg_on_dfl);
 		page_counter_init(&memcg->swap, &parent->swap, false);
 #ifdef CONFIG_MEMCG_V1
@@ -5480,6 +5483,12 @@ static int swap_cgroup_priority_show(struct seq_file *m, void *v)
 	show_swap_cgroup_priority(m);
 	return 0;
 }
+
+static int swap_cgroup_priority_effective_show(struct seq_file *m, void *v)
+{
+	show_swap_cgroup_priority_effective(m);
+	return 0;
+}
 #endif
 
 static struct cftype swap_files[] = {
@@ -5521,6 +5530,11 @@ static struct cftype swap_files[] = {
 		.seq_show = swap_cgroup_priority_show,
 		.write = swap_cgroup_priority_write,
 	},
+	{
+		.name = "swap.priority.effective",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.seq_show = swap_cgroup_priority_effective_show,
+	},
 #endif
 	{ }	/* terminate */
 };
diff --git a/mm/swap_cgroup_priority.c b/mm/swap_cgroup_priority.c
index 979bc18d2eed..84e876b77f01 100644
--- a/mm/swap_cgroup_priority.c
+++ b/mm/swap_cgroup_priority.c
@@ -21,6 +21,7 @@
 #include "swap_cgroup_priority.h"
 #include "memcontrol-v1.h"
 
+static DEFINE_MUTEX(swap_cgroup_priority_inherit_lck);
 static LIST_HEAD(swap_cgroup_priority_list);
 
 /*
@@ -31,6 +32,16 @@ static LIST_HEAD(swap_cgroup_priority_list);
  * tracks priority differences from global swap. If zero, and its default_prio
  * follows global swap priority(SWAP_PRIORITY_GLOBAL), the object is destroyed.
  *
+ * Child cgroups hold direct pointers to this object for fast access.
+ * No reference counting is needed, as the owner's teardown or zero
+ * distance directly implies this object's destruction.
+ *
+ * A child cgroup that has its own effective swap_cgroup_priority uses
+ * the 'effective' field to point to the top-most cgroup's relevant
+ * swap_cgroup_priority object that it should inherit. Changes in the
+ * parent cgroup's swap priority are appropriately propagated downwards.
+ *
+ * effective - Actual effective swap cgroup priority.
  * pnode - Array of pointers to swap device priority nodes.
  * owner - The owning memory cgroup.
  * rcu - RCU free callback.
@@ -41,6 +52,7 @@ static LIST_HEAD(swap_cgroup_priority_list);
  * plist - Priority list head.
  */
 struct swap_cgroup_priority {
+	struct swap_cgroup_priority *effective;
 	struct swap_cgroup_priority_pnode *pnode[MAX_SWAPFILES];
 	struct mem_cgroup *owner;
 
@@ -106,13 +118,38 @@ void get_swapdev_id(struct swap_info_struct *si)
 	si->id = atomic64_inc_return(&swapdev_id_counter);
 }
 
-static struct swap_cgroup_priority *get_swap_cgroup_priority(
+static struct swap_cgroup_priority *get_effective_swap_cgroup_priority(
 	struct mem_cgroup *memcg)
 {
+	struct swap_cgroup_priority *swap_priority;
 	if (!memcg)
 		return NULL;
 
-	return rcu_dereference(memcg->swap_priority);
+	swap_priority = memcg->swap_priority;
+	if (!swap_priority)
+		return NULL;
+
+	return swap_priority->effective;
+}
+
+static bool validate_effective_swap_cgroup_priority(
+	struct mem_cgroup *memcg,
+	struct swap_cgroup_priority **swap_priority)
+{
+	struct swap_cgroup_priority *target = memcg->swap_priority;
+
+	if (!target) {
+		*swap_priority = NULL;
+		return false;
+	}
+
+	target = target->effective;
+	if (target != *swap_priority) {
+		*swap_priority = target;
+		return false;
+	}
+
+	return true;
 }
 
 static struct swap_cgroup_priority_pnode *alloc_swap_cgroup_priority_pnode(
@@ -182,10 +219,13 @@ void show_swap_cgroup_priority(struct seq_file *m)
 	struct mem_cgroup *memcg = mem_cgroup_from_seq(m);
 	struct swap_cgroup_priority *swap_priority;
 
+	mutex_lock(&swap_cgroup_priority_inherit_lck);
 	spin_lock(&swap_lock);
+
 	swap_priority = memcg->swap_priority;
 	if (!swap_priority || swap_priority->owner != memcg) {
 		spin_unlock(&swap_lock);
+		mutex_unlock(&swap_cgroup_priority_inherit_lck);
 		return;
 	}
 
@@ -217,6 +257,47 @@ void show_swap_cgroup_priority(struct seq_file *m)
 	}
 
 	spin_unlock(&swap_lock);
+	mutex_unlock(&swap_cgroup_priority_inherit_lck);
+}
+
+void show_swap_cgroup_priority_effective(struct seq_file *m)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_seq(m);
+	struct swap_cgroup_priority *swap_priority;
+
+	mutex_lock(&swap_cgroup_priority_inherit_lck);
+	spin_lock(&swap_lock);
+
+	swap_priority = get_effective_swap_cgroup_priority(memcg);
+	if (swap_priority && swap_priority->default_prio != SWAP_PRIORITY_GLOBAL)
+		seq_printf(m,  "default disabled\n");
+
+	for (int i = 0; i < nr_swapfiles; i++) {
+		struct swap_info_struct *si = swap_info[i];
+		struct swap_cgroup_priority_pnode *pnode;
+		signed short prio;
+
+		if (!(si->flags & SWP_USED) || !(si->flags & SWP_WRITEOK))
+			continue;
+
+		seq_printf(m,  "%lld", si->id);
+		if (!swap_priority) {
+			seq_printf(m, " %d\n", si->prio);
+			continue;
+		}
+
+		pnode = swap_priority->pnode[i];
+		if (WARN_ON(!pnode))
+			continue;
+
+		prio = pnode->prio;
+		if (prio != SWAP_PRIORITY_DISABLE)
+			seq_printf(m,  " %d\n", prio);
+		else
+			seq_printf(m,  " disabled\n");
+	}
+	spin_unlock(&swap_lock);
+	mutex_unlock(&swap_cgroup_priority_inherit_lck);
 }
 
 static void __delete_swap_cgroup_priority(struct mem_cgroup *memcg);
@@ -224,6 +305,7 @@ void purge_swap_cgroup_priority(void)
 {
 	struct swap_cgroup_priority *swap_priority, *tmp;
 
+	mutex_lock(&swap_cgroup_priority_inherit_lck);
 	spin_lock(&swap_avail_lock);
 	list_for_each_entry_safe(swap_priority, tmp, &swap_cgroup_priority_list,
 				 link) {
@@ -232,6 +314,7 @@ void purge_swap_cgroup_priority(void)
 			__delete_swap_cgroup_priority(swap_priority->owner);
 	}
 	spin_unlock(&swap_avail_lock);
+	mutex_unlock(&swap_cgroup_priority_inherit_lck);
 }
 
 bool swap_alloc_cgroup_priority(struct mem_cgroup *memcg,
@@ -250,7 +333,7 @@ bool swap_alloc_cgroup_priority(struct mem_cgroup *memcg,
 	spin_lock(&swap_avail_lock);
 	node = numa_node_id();
 
-	swap_priority = get_swap_cgroup_priority(memcg);
+	swap_priority = get_effective_swap_cgroup_priority(memcg);
 swap_priority_check:
 	if (!swap_priority) {
 		spin_unlock(&swap_avail_lock);
@@ -282,7 +365,8 @@ bool swap_alloc_cgroup_priority(struct mem_cgroup *memcg,
 		 * If 'swap_cgroup_priority' changes while we're holding a lock,
 		 * we must verify its state to ensure memory validness.
 		 */
-		if (memcg->swap_priority != swap_priority)
+		if (!validate_effective_swap_cgroup_priority(memcg,
+							     &swap_priority))
 			goto swap_priority_check;
 
 		if (plist_node_empty(&next->avail_lists[node]))
@@ -350,7 +434,7 @@ void deactivate_swap_cgroup_priority(struct swap_info_struct *swp,
 	struct swap_cgroup_priority *swap_priority, *tmp;
 	int nid, i;
 
-	list_for_each_entry_safe(swap_priority, tmp, &swap_cgroup_priority_list,
+	list_for_each_entry_safe(swap_priority, tmp, &swap_cgroup_priority_list, 
 				 link) {
 		struct swap_cgroup_priority_pnode *pnode =
 			swap_priority->pnode[swp->type];
@@ -603,17 +687,57 @@ static int __apply_swap_cgroup_priority(
 	return 0;
 }
 
+/*
+ * If this is the top-level swap_cgroup_priority, propagation is needed.
+ * We traverse the 'mem_cgroup_tree' using 'for_each_mem_cgroup_tree'.
+ * Due to its pre-order traversal, after propagating changes in the parent,
+ * subsequent child nodes can correctly retrieve the parent's effective
+ * swap_cgroup_priority, ensuring proper propagation.
+ */
+static void propagate_swap_cgroup_priority(
+	struct mem_cgroup *memcg,
+	struct swap_cgroup_priority *swap_priority)
+{
+	struct mem_cgroup *iter;
+
+	iter = parent_mem_cgroup(memcg);
+	while (iter) {
+		if (iter->swap_priority)
+			return;
+		iter = parent_mem_cgroup(iter);
+	}
+
+	for_each_mem_cgroup_tree(iter, memcg) {
+		if (iter == memcg)
+			continue;
+
+		if (iter->swap_priority &&
+			iter->swap_priority->owner == iter) {
+			rcu_assign_pointer(iter->swap_priority->effective,
+					   swap_priority ?
+					   swap_priority : iter->swap_priority);
+		} else {
+			struct swap_cgroup_priority *effective =
+				get_effective_swap_cgroup_priority(
+					parent_mem_cgroup(iter));
+			iter->swap_priority = effective;
+		}
+	}
+
+	return;
+}
+
 int prepare_swap_cgroup_priority(int type)
 {
 	struct swap_cgroup_priority *swap_priority;
 	int err = 0;
 
-	spin_lock(&swap_avail_lock);
+	mutex_lock(&swap_cgroup_priority_inherit_lck);
 	list_for_each_entry_rcu(swap_priority,
 				&swap_cgroup_priority_list, link) {
 		if (!swap_priority->pnode[type]) {
 			swap_priority->pnode[type] =
-				alloc_swap_cgroup_priority_pnode(GFP_ATOMIC);
+				alloc_swap_cgroup_priority_pnode(GFP_KERNEL);
 
 			if (!swap_priority->pnode[type]) {
 				err = -ENOMEM;
@@ -622,11 +746,23 @@ int prepare_swap_cgroup_priority(int type)
 		}
 
 	}
-	spin_unlock(&swap_avail_lock);
+	mutex_unlock(&swap_cgroup_priority_inherit_lck);
 
 	return err;
 }
 
+struct swap_cgroup_priority *inherit_swap_cgroup_priority(
+	struct mem_cgroup *parent)
+{
+	struct swap_cgroup_priority *swap_priority;
+
+	mutex_lock(&swap_cgroup_priority_inherit_lck);
+	swap_priority = get_effective_swap_cgroup_priority(parent);
+	mutex_unlock(&swap_cgroup_priority_inherit_lck);
+
+	return swap_priority;
+}
+
 int apply_swap_cgroup_priority(struct mem_cgroup *memcg, u64 id, int prio)
 {
 	struct swap_cgroup_priority *swap_priority;
@@ -634,22 +770,24 @@ int apply_swap_cgroup_priority(struct mem_cgroup *memcg, u64 id, int prio)
 	bool new = false;
 	int err = 0;
 
-	rcu_read_lock();
-	swap_priority = rcu_dereference(memcg->swap_priority);
-	if (swap_priority && swap_priority->owner == memcg) {
-		rcu_read_unlock();
+	mutex_lock(&swap_cgroup_priority_inherit_lck);
+	swap_priority = memcg->swap_priority;
+	if (swap_priority && swap_priority->owner == memcg)
 		goto prio_set;
-	}
-	rcu_read_unlock();
+
 	new = true;
 
 	/* No need to define "global swap priority" for a new cgroup. */
-	if (new && prio == SWAP_PRIORITY_GLOBAL)
+	if (new && prio == SWAP_PRIORITY_GLOBAL) {
+		mutex_unlock(&swap_cgroup_priority_inherit_lck);
 		return 0;
+	}
 
 	swap_priority = alloc_swap_cgroup_priority();
-	if (!swap_priority)
+	if (!swap_priority) {
+		mutex_unlock(&swap_cgroup_priority_inherit_lck);
 		return -ENOMEM;
+	}
 
 	/* Just initialize. may changed on __apply_swap_cgroup_priority */
 	swap_priority->default_prio = SWAP_PRIORITY_GLOBAL;
@@ -661,23 +799,17 @@ int apply_swap_cgroup_priority(struct mem_cgroup *memcg, u64 id, int prio)
 	spin_lock(&swap_lock);
 	spin_lock(&swap_avail_lock);
 
-	/* Simultaneous calls to the same interface.*/
-	if (new && memcg->swap_priority &&
-	    memcg->swap_priority->owner == memcg) {
-		new = false;
-		free_swap_cgroup_priority(swap_priority);
-		swap_priority = memcg->swap_priority;
-	}
-
 	err = __apply_swap_cgroup_priority(swap_priority, id, prio, new);
 	if (err) {
 		/*
 		 * The difference with the global swap priority is now zero.
-		 * Remove the swap priority.
+		 * Remove the swap priority, and propagate if needed.
 		 */
 		if (err == 1) {
 			err = 0;
 			__delete_swap_cgroup_priority(memcg);
+			if (swap_priority != swap_priority->effective)
+				memcg->swap_priority = swap_priority->effective;
 		}
 
 		goto error_locked;
@@ -686,7 +818,19 @@ int apply_swap_cgroup_priority(struct mem_cgroup *memcg, u64 id, int prio)
 	if (new) {
 		swap_priority->owner = memcg;
 		list_add_rcu(&swap_priority->link, &swap_cgroup_priority_list);
-		memcg->swap_priority = swap_priority;
+	        /* If there was an inherited swap priority, update effective. */
+		if (memcg->swap_priority) {
+			swap_priority->effective = memcg->swap_priority;
+			memcg->swap_priority = swap_priority;
+		} else {
+			swap_priority->effective = swap_priority;
+			memcg->swap_priority = swap_priority;
+	                /*
+			 * Might be a top-level parent memcg,
+			 * so propagate effective priority.
+			 */
+			propagate_swap_cgroup_priority(memcg, swap_priority);
+		}
 
 		for (int i = 0; i < nr_swapfiles; i++) {
 			if (!swap_priority->pnode[i]->swap) {
@@ -699,12 +843,13 @@ int apply_swap_cgroup_priority(struct mem_cgroup *memcg, u64 id, int prio)
 
 	spin_unlock(&swap_avail_lock);
 	spin_unlock(&swap_lock);
-
+	mutex_unlock(&swap_cgroup_priority_inherit_lck);
 	return 0;
 
 error_locked:
 	spin_unlock(&swap_avail_lock);
 	spin_unlock(&swap_lock);
+	mutex_unlock(&swap_cgroup_priority_inherit_lck);
 
 	if (!new)
 		return err;
@@ -717,6 +862,7 @@ static void __delete_swap_cgroup_priority(struct mem_cgroup *memcg)
 {
 	struct swap_cgroup_priority *swap_priority = memcg->swap_priority;
 
+	lockdep_assert_held(&swap_cgroup_priority_inherit_lck);
 	lockdep_assert_held(&swap_avail_lock);
 
 	if (!swap_priority)
@@ -727,13 +873,16 @@ static void __delete_swap_cgroup_priority(struct mem_cgroup *memcg)
 		return;
 
 	rcu_assign_pointer(memcg->swap_priority, NULL);
+	propagate_swap_cgroup_priority(memcg, NULL);
 	list_del_rcu(&swap_priority->link);
 	call_rcu(&swap_priority->rcu, rcu_free_swap_cgroup_priority);
 }
 
 void delete_swap_cgroup_priority(struct mem_cgroup *memcg)
 {
+	mutex_lock(&swap_cgroup_priority_inherit_lck);
 	spin_lock(&swap_avail_lock);
 	__delete_swap_cgroup_priority(memcg);
 	spin_unlock(&swap_avail_lock);
+	mutex_unlock(&swap_cgroup_priority_inherit_lck);
 }
diff --git a/mm/swap_cgroup_priority.h b/mm/swap_cgroup_priority.h
index 253e95623270..5d16b63d12e0 100644
--- a/mm/swap_cgroup_priority.h
+++ b/mm/swap_cgroup_priority.h
@@ -39,8 +39,11 @@ void deactivate_swap_cgroup_priority(struct swap_info_struct *swp,
 				     bool swapoff);
 int prepare_swap_cgroup_priority(int type);
 void show_swap_cgroup_priority(struct seq_file *m);
+void show_swap_cgroup_priority_effective(struct seq_file *m);
 void get_swapdev_id(struct swap_info_struct *si);
 void purge_swap_cgroup_priority(void);
+struct swap_cgroup_priority *inherit_swap_cgroup_priority(
+	struct mem_cgroup *parent);
 bool swap_alloc_cgroup_priority(struct mem_cgroup *memcg, swp_entry_t *entry,
 				int order);
 void delete_swap_cgroup_priority(struct mem_cgroup *memcg);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 4/4] mm: swap: Per-cgroup per-CPU swap device cache with shared clusters
  2025-07-16 20:20 [PATCH 0/4] mm/swap, memcg: Support per-cgroup swap device priorities Youngjun Park
                   ` (2 preceding siblings ...)
  2025-07-16 20:20 ` [PATCH 3/4] mm: memcg: Add swap cgroup priority inheritance mechanism Youngjun Park
@ 2025-07-16 20:20 ` Youngjun Park
  2025-07-22 17:44   ` Kairui Song
  3 siblings, 1 reply; 39+ messages in thread
From: Youngjun Park @ 2025-07-16 20:20 UTC (permalink / raw)
  To: akpm, hannes
  Cc: mhocko, roman.gushchin, shakeel.butt, muchun.song, shikemeng,
	kasong, nphamcs, bhe, baohua, chrisl, cgroups, linux-mm,
	linux-kernel, gunho.lee, iamjoonsoo.kim, taejoon.song,
	Youngjun Park

This patch introduces a new swap allocation mechanism that supports
per-cgroup per-CPU swap device caches, combined with per-device per-CPU
cluster management.

The existing global swap allocator uses a per-CPU device cache and
cluster, shared by all cgroups. Under this model, per-cgroup swap
priorities cannot be effectively honored on the fast path, as allocations
do not distinguish between cgroups.

To address this, we introduce per-cgroup per-CPU swap device caches.
This allows fast-path swap allocations to respect each cgroup’s
individual priority settings.

To avoid an explosion of cluster structures proportional to the number
of cgroups, clusters remain per-device and are shared across cgroups.
This strikes a balance between performance and memory overhead.

Suggested-by: Nhat Pham <nphamcs@gmail.com>
Suggested-by: Kairui Song <kasong@tencent.com>
Signed-off-by: Youngjun Park <youngjun.park@lge.com>
---
 include/linux/swap.h      |   7 ++
 mm/swap_cgroup_priority.c | 156 +++++++++++++++++++++++++++++++++++++-
 mm/swap_cgroup_priority.h |  39 ++++++++++
 mm/swapfile.c             |  47 +++++++-----
 4 files changed, 228 insertions(+), 21 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index bfddbec2ee28..ab15f4c103a1 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -283,6 +283,12 @@ enum swap_cluster_flags {
 #define SWAP_NR_ORDERS		1
 #endif
 
+#ifdef CONFIG_SWAP_CGROUP_PRIORITY
+struct percpu_cluster {
+	unsigned int next[SWAP_NR_ORDERS]; /* Likely next allocation offset */
+};
+#endif
+
 /*
  * We keep using same cluster for rotational device so IO will be sequential.
  * The purpose is to optimize SWAP throughput on these device.
@@ -341,6 +347,7 @@ struct swap_info_struct {
 	struct list_head discard_clusters; /* discard clusters list */
 #ifdef CONFIG_SWAP_CGROUP_PRIORITY
 	u64 id;
+	struct percpu_cluster __percpu *percpu_cluster; /* per cpu's swap location */
 #endif
 	struct plist_node avail_lists[]; /*
 					   * entries in swap_avail_heads, one
diff --git a/mm/swap_cgroup_priority.c b/mm/swap_cgroup_priority.c
index 84e876b77f01..f960c3dcab48 100644
--- a/mm/swap_cgroup_priority.c
+++ b/mm/swap_cgroup_priority.c
@@ -21,6 +21,17 @@
 #include "swap_cgroup_priority.h"
 #include "memcontrol-v1.h"
 
+/*
+ * We do maintain a cache on a per-cgroup-per-swap-device basis.
+ * However, the underlying cluster cache itself is managed
+ * per-swap-device. This design prevents each individual
+ * swap_cgroup_priority entry from caching its own cluster data,
+ * even as the number of such entries increases.
+ */
+struct percpu_swap_device {
+	struct swap_info_struct *si[SWAP_NR_ORDERS];
+};
+
 static DEFINE_MUTEX(swap_cgroup_priority_inherit_lck);
 static LIST_HEAD(swap_cgroup_priority_list);
 
@@ -49,6 +60,7 @@ static LIST_HEAD(swap_cgroup_priority_list);
  * least_priority - Current lowest priority.
  * distance - Priority differences from global swap priority.
  * default_prio - Default priority for this cgroup.
+ * pcpu_swapdev - Per-CPU swap device.
  * plist - Priority list head.
  */
 struct swap_cgroup_priority {
@@ -64,6 +76,7 @@ struct swap_cgroup_priority {
 	int least_priority;
 	s8 distance;
 	int default_prio;
+	struct percpu_swap_device __percpu *pcpu_swapdev;
 	struct plist_head plist[];
 };
 
@@ -132,6 +145,21 @@ static struct swap_cgroup_priority *get_effective_swap_cgroup_priority(
 	return swap_priority->effective;
 }
 
+static struct swap_cgroup_priority *get_effective_swap_cgroup_priority_rcu(
+	struct mem_cgroup *memcg)
+{
+	struct swap_cgroup_priority *swap_priority;
+
+	if (!memcg)
+		return NULL;
+
+	swap_priority = rcu_dereference(memcg->swap_priority);
+	if (!swap_priority)
+		return NULL;
+
+	return rcu_dereference(swap_priority->effective);
+}
+
 static bool validate_effective_swap_cgroup_priority(
 	struct mem_cgroup *memcg,
 	struct swap_cgroup_priority **swap_priority)
@@ -172,6 +200,9 @@ static void free_swap_cgroup_priority_pnode(
 static void free_swap_cgroup_priority(
 	struct swap_cgroup_priority *swap_priority)
 {
+	if (swap_priority->pcpu_swapdev)
+		free_percpu(swap_priority->pcpu_swapdev);
+
 	for (int i = 0; i < MAX_SWAPFILES; i++)
 		free_swap_cgroup_priority_pnode(swap_priority->pnode[i]);
 
@@ -187,6 +218,12 @@ static struct swap_cgroup_priority *alloc_swap_cgroup_priority(void)
 	if (!swap_priority)
 		return NULL;
 
+	swap_priority->pcpu_swapdev = alloc_percpu(struct percpu_swap_device);
+	if (!swap_priority->pcpu_swapdev) {
+		kvfree(swap_priority);
+		return NULL;
+	}
+
 	/*
 	 * Pre-allocates pnode array up to nr_swapfiles at init.
 	 * Individual pnodes are assigned on swapon, but not freed
@@ -326,10 +363,34 @@ bool swap_alloc_cgroup_priority(struct mem_cgroup *memcg,
 	unsigned long offset;
 	int node;
 
-	/*
-	 * TODO: Per-cpu swap cluster cache can't be used directly
-	 * as cgroup-specific priorities may select different devices.
-	 */
+	rcu_read_lock();
+	if (!(swap_priority = get_effective_swap_cgroup_priority_rcu(memcg))) {
+		rcu_read_unlock();
+		return false;
+	}
+
+	/* Fast path */
+	si = this_cpu_read(swap_priority->pcpu_swapdev->si[order]);
+	if (si && get_swap_device_info(si)) {
+		offset = cluster_alloc_swap_entry(si, order, SWAP_HAS_CACHE);
+		if (offset) {
+			*entry = swp_entry(si->type, offset);
+			/*
+			 * Protected by 'percpu_swap_cluster' local_lock;
+			 * CPU migration is disabled during this operation.
+			 */
+			this_cpu_write(swap_priority->pcpu_swapdev->si[order],
+				       si);
+			put_swap_device(si);
+			rcu_read_unlock();
+
+			return true;
+		}
+		put_swap_device(si);
+	}
+	rcu_read_unlock();
+
+	/* Slow path */
 	spin_lock(&swap_avail_lock);
 	node = numa_node_id();
 
@@ -350,6 +411,14 @@ bool swap_alloc_cgroup_priority(struct mem_cgroup *memcg,
 		if (get_swap_device_info(si)) {
 			offset = cluster_alloc_swap_entry(si, order,
 							  SWAP_HAS_CACHE);
+			/*
+			 * Protected by 'percpu_swap_cluster' local_lock;
+			 * CPU migration is disabled during this operation.
+			 */
+			if (memcg->swap_priority == swap_priority)
+				this_cpu_write(
+					swap_priority->pcpu_swapdev->si[order],
+					si);
 			put_swap_device(si);
 			if (offset) {
 				*entry = swp_entry(si->type, offset);
@@ -687,6 +756,21 @@ static int __apply_swap_cgroup_priority(
 	return 0;
 }
 
+static int init_swap_cgroup_priority_pcpu_swapdev_cache(
+	struct swap_cgroup_priority *swap_priority)
+{
+	int cpu;
+
+	for_each_possible_cpu(cpu) {
+		struct percpu_swap_device *pcp_swap_dev =
+			per_cpu_ptr(swap_priority->pcpu_swapdev, cpu);
+		for (int i = 0; i < SWAP_NR_ORDERS; i++)
+			pcp_swap_dev->si[i] = NULL;
+	}
+
+	return 0;
+}
+
 /*
  * If this is the top-level swap_cgroup_priority, propagation is needed.
  * We traverse the 'mem_cgroup_tree' using 'for_each_mem_cgroup_tree'.
@@ -795,6 +879,8 @@ int apply_swap_cgroup_priority(struct mem_cgroup *memcg, u64 id, int prio)
 	for_each_node(nid)
 		plist_head_init(&swap_priority->plist[nid]);
 
+	init_swap_cgroup_priority_pcpu_swapdev_cache(swap_priority);
+
 prio_set:
 	spin_lock(&swap_lock);
 	spin_lock(&swap_avail_lock);
@@ -843,6 +929,23 @@ int apply_swap_cgroup_priority(struct mem_cgroup *memcg, u64 id, int prio)
 
 	spin_unlock(&swap_avail_lock);
 	spin_unlock(&swap_lock);
+	/*
+	 * XXX: We cannot fully synchronize with swap_alloc_cgroup_priority
+	 * when updating the next si.
+	 * Still, we ensure that flush operations inside swap_priority
+	 * are performed as reliably as possible.
+	 */
+	if (id != DEFAULT_ID &&
+	    swap_priority == swap_priority->effective && !new) {
+		int cpu;
+		struct swap_info_struct **pcp_si;
+		for_each_possible_cpu(cpu) {
+			pcp_si = per_cpu_ptr(
+				swap_priority->pcpu_swapdev->si, cpu);
+			for (int i = 0; i < SWAP_NR_ORDERS; i++)
+				pcp_si[i] = NULL;
+		}
+	}
 	mutex_unlock(&swap_cgroup_priority_inherit_lck);
 	return 0;
 
@@ -886,3 +989,48 @@ void delete_swap_cgroup_priority(struct mem_cgroup *memcg)
 	spin_unlock(&swap_avail_lock);
 	mutex_unlock(&swap_cgroup_priority_inherit_lck);
 }
+
+void flush_swap_cgroup_priority_percpu_swapdev(struct swap_info_struct *si)
+{
+	int cpu, i;
+	struct swap_info_struct **pcp_si;
+	struct swap_cgroup_priority *swap_priority;
+
+	rcu_read_lock();
+	list_for_each_entry_rcu(swap_priority,
+				&swap_cgroup_priority_list, link) {
+		for_each_possible_cpu(cpu) {
+			pcp_si = per_cpu_ptr(
+					swap_priority->pcpu_swapdev->si, cpu);
+
+			for (i = 0; i < SWAP_NR_ORDERS; i++)
+				cmpxchg(&pcp_si[i], si, NULL);
+		}
+	}
+	rcu_read_unlock();
+}
+
+bool alloc_percpu_swap_cluster(struct swap_info_struct *si)
+{
+	si->percpu_cluster = alloc_percpu(struct percpu_cluster);
+	if (!si->percpu_cluster)
+		return false;
+
+	int cpu;
+	int i;
+	for_each_possible_cpu(cpu) {
+		struct percpu_cluster *cluster;
+
+		cluster = per_cpu_ptr(si->percpu_cluster, cpu);
+		for (i = 0; i < SWAP_NR_ORDERS; i++)
+			cluster->next[i] = SWAP_ENTRY_INVALID;
+	}
+
+	return true;
+}
+
+void free_percpu_swap_cluster(struct swap_info_struct *si)
+{
+	free_percpu(si->percpu_cluster);
+	si->percpu_cluster = NULL;
+}
diff --git a/mm/swap_cgroup_priority.h b/mm/swap_cgroup_priority.h
index 5d16b63d12e0..815822ebd0d1 100644
--- a/mm/swap_cgroup_priority.h
+++ b/mm/swap_cgroup_priority.h
@@ -47,6 +47,22 @@ struct swap_cgroup_priority *inherit_swap_cgroup_priority(
 bool swap_alloc_cgroup_priority(struct mem_cgroup *memcg, swp_entry_t *entry,
 				int order);
 void delete_swap_cgroup_priority(struct mem_cgroup *memcg);
+void flush_swap_cgroup_priority_percpu_swapdev(struct swap_info_struct *si);
+
+bool alloc_percpu_swap_cluster(struct swap_info_struct *si);
+void free_percpu_swap_cluster(struct swap_info_struct *si);
+static inline void write_percpu_swap_cluster_next(struct swap_info_struct *si,
+						  int order,
+						  unsigned int next)
+{
+	this_cpu_write(si->percpu_cluster->next[order], next);
+}
+
+static inline unsigned int read_percpu_swap_cluster_next(
+	struct swap_info_struct *si, int order)
+{
+        return __this_cpu_read(si->percpu_cluster->next[order]);
+}
 #else
 int swap_node(struct swap_info_struct *si);
 unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int order,
@@ -85,5 +101,28 @@ static inline bool swap_alloc_cgroup_priority(struct mem_cgroup *memcg,
 static inline void delete_swap_cgroup_priority(struct mem_cgroup *memcg)
 {
 }
+static inline void flush_swap_cgroup_priority_percpu_swapdev(
+	struct swap_info_struct *si)
+{
+}
+static inline bool alloc_percpu_swap_cluster(struct swap_info_struct *si)
+{
+	return true;
+}
+static inline void free_percpu_swap_cluster(struct swap_info_struct *si)
+{
+}
+static inline void write_percpu_swap_cluster_next(struct swap_info_struct *si,
+						  int order,
+						  unsigned int next)
+{
+	return;
+}
+
+static inline unsigned int read_percpu_swap_cluster_next(
+	struct swap_info_struct *si, int order)
+{
+	return SWAP_ENTRY_INVALID;
+}
 #endif
 #endif
diff --git a/mm/swapfile.c b/mm/swapfile.c
index bfd0532ad250..6a5ac9962e9f 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -817,12 +817,15 @@ static unsigned int alloc_swap_scan_cluster(struct swap_info_struct *si,
 out:
 	relocate_cluster(si, ci);
 	unlock_cluster(ci);
+
 	if (si->flags & SWP_SOLIDSTATE) {
 		this_cpu_write(percpu_swap_cluster.offset[order], next);
 		this_cpu_write(percpu_swap_cluster.si[order], si);
+		write_percpu_swap_cluster_next(si, order, next);
 	} else {
 		si->global_cluster->next[order] = next;
 	}
+
 	return found;
 }
 
@@ -892,26 +895,29 @@ unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int order,
 	if (order && !(si->flags & SWP_BLKDEV))
 		return 0;
 
-	if (!(si->flags & SWP_SOLIDSTATE)) {
+	if (si->flags & SWP_SOLIDSTATE) {
+		offset = read_percpu_swap_cluster_next(si, order);
+	} else {
 		/* Serialize HDD SWAP allocation for each device. */
 		spin_lock(&si->global_cluster_lock);
 		offset = si->global_cluster->next[order];
-		if (offset == SWAP_ENTRY_INVALID)
-			goto new_cluster;
+	}
 
-		ci = lock_cluster(si, offset);
-		/* Cluster could have been used by another order */
-		if (cluster_is_usable(ci, order)) {
-			if (cluster_is_empty(ci))
-				offset = cluster_offset(si, ci);
-			found = alloc_swap_scan_cluster(si, ci, offset,
-							order, usage);
-		} else {
-			unlock_cluster(ci);
-		}
-		if (found)
-			goto done;
+	if (offset == SWAP_ENTRY_INVALID)
+		goto new_cluster;
+
+	ci = lock_cluster(si, offset);
+	/* Cluster could have been used by another order */
+	if (cluster_is_usable(ci, order)) {
+		if (cluster_is_empty(ci))
+			offset = cluster_offset(si, ci);
+		found = alloc_swap_scan_cluster(si, ci, offset,
+						order, usage);
+	} else {
+		unlock_cluster(ci);
 	}
+	if (found)
+		goto done;
 
 new_cluster:
 	ci = isolate_lock_cluster(si, &si->free_clusters);
@@ -991,6 +997,7 @@ unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int order,
 done:
 	if (!(si->flags & SWP_SOLIDSTATE))
 		spin_unlock(&si->global_cluster_lock);
+
 	return found;
 }
 
@@ -2674,6 +2681,8 @@ static void flush_percpu_swap_cluster(struct swap_info_struct *si)
 		for (i = 0; i < SWAP_NR_ORDERS; i++)
 			cmpxchg(&pcp_si[i], si, NULL);
 	}
+
+	flush_swap_cgroup_priority_percpu_swapdev(si);
 }
 
 
@@ -2802,6 +2811,7 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
 	arch_swap_invalidate_area(p->type);
 	zswap_swapoff(p->type);
 	mutex_unlock(&swapon_mutex);
+	free_percpu_swap_cluster(p);
 	kfree(p->global_cluster);
 	p->global_cluster = NULL;
 	vfree(swap_map);
@@ -2900,7 +2910,6 @@ static void swap_stop(struct seq_file *swap, void *v)
 	mutex_unlock(&swapon_mutex);
 }
 
-
 #ifndef CONFIG_SWAP_CGROUP_PRIORITY
 static int swap_show(struct seq_file *swap, void *v)
 {
@@ -3239,7 +3248,10 @@ static struct swap_cluster_info *setup_clusters(struct swap_info_struct *si,
 	for (i = 0; i < nr_clusters; i++)
 		spin_lock_init(&cluster_info[i].lock);
 
-	if (!(si->flags & SWP_SOLIDSTATE)) {
+	if (si->flags & SWP_SOLIDSTATE) {
+		if (!alloc_percpu_swap_cluster(si))
+			goto err_free;
+	} else {
 		si->global_cluster = kmalloc(sizeof(*si->global_cluster),
 				     GFP_KERNEL);
 		if (!si->global_cluster)
@@ -3532,6 +3544,7 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
 bad_swap_unlock_inode:
 	inode_unlock(inode);
 bad_swap:
+	free_percpu_swap_cluster(si);
 	kfree(si->global_cluster);
 	si->global_cluster = NULL;
 	inode = NULL;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/4] mm/swap, memcg: Introduce infrastructure for cgroup-based swap priority
  2025-07-16 20:20 ` [PATCH 1/4] mm/swap, memcg: Introduce infrastructure for cgroup-based swap priority Youngjun Park
@ 2025-07-17 11:20   ` kernel test robot
  2025-07-22 14:09     ` YoungJun Park
  2025-07-18 17:08   ` kernel test robot
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 39+ messages in thread
From: kernel test robot @ 2025-07-17 11:20 UTC (permalink / raw)
  To: Youngjun Park, akpm, hannes
  Cc: oe-kbuild-all, mhocko, roman.gushchin, shakeel.butt, muchun.song,
	shikemeng, kasong, nphamcs, bhe, baohua, chrisl, cgroups,
	linux-mm, linux-kernel, gunho.lee, iamjoonsoo.kim, taejoon.song,
	Youngjun Park, Michal Koutný

Hi Youngjun,

kernel test robot noticed the following build warnings:

[auto build test WARNING on 347e9f5043c89695b01e66b3ed111755afcf1911]

url:    https://github.com/intel-lab-lkp/linux/commits/Youngjun-Park/mm-swap-memcg-Introduce-infrastructure-for-cgroup-based-swap-priority/20250717-042648
base:   347e9f5043c89695b01e66b3ed111755afcf1911
patch link:    https://lore.kernel.org/r/20250716202006.3640584-2-youngjun.park%40lge.com
patch subject: [PATCH 1/4] mm/swap, memcg: Introduce infrastructure for cgroup-based swap priority
config: loongarch-allyesconfig (https://download.01.org/0day-ci/archive/20250717/202507171936.fGW4muEc-lkp@intel.com/config)
compiler: clang version 21.0.0git (https://github.com/llvm/llvm-project 16534d19bf50bde879a83f0ae62875e2c5120e64)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250717/202507171936.fGW4muEc-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202507171936.fGW4muEc-lkp@intel.com/

All warnings (new ones prefixed by >>):

>> mm/memcontrol.c:5462:12: warning: variable 'id' is uninitialized when used here [-Wuninitialized]
    5462 |                                 memcg, id, SWAP_PRIORITY_GLOBAL);
         |                                        ^~
   mm/memcontrol.c:5414:8: note: initialize the variable 'id' to silence this warning
    5414 |         u64 id;
         |               ^
         |                = 0
   1 warning generated.


vim +/id +5462 mm/memcontrol.c

  5408	
  5409	#ifdef CONFIG_SWAP_CGROUP_PRIORITY
  5410	static ssize_t swap_cgroup_priority_write(struct kernfs_open_file *of,
  5411						  char *buf, size_t nbytes, loff_t off)
  5412	{
  5413		struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
  5414		u64 id;
  5415		int prio;
  5416		int ret;
  5417		char first_token[32];
  5418		char second_token[32];
  5419		char dummy[2];
  5420		char *stripped_buf;
  5421		int num_parsed;
  5422	
  5423		stripped_buf = strstrip(buf);
  5424		num_parsed = sscanf(stripped_buf, "%31s %31s %1s", first_token,
  5425				    second_token, dummy);
  5426		if (num_parsed == 2) {
  5427			if (strcmp(first_token, "default") == 0) {
  5428				if (strcmp(second_token, "none") == 0)
  5429					ret = apply_swap_cgroup_priority(
  5430						memcg, DEFAULT_ID, SWAP_PRIORITY_GLOBAL);
  5431				else if (strcmp(second_token, "disabled") == 0)
  5432					ret = apply_swap_cgroup_priority(
  5433						memcg, DEFAULT_ID, SWAP_PRIORITY_DISABLE);
  5434				else
  5435					ret = -EINVAL;
  5436			} else {
  5437				ret = kstrtoull(first_token, 10, &id);
  5438				if (ret)
  5439					return -EINVAL;
  5440	
  5441				if (strcmp(second_token, "none") == 0) {
  5442					ret = apply_swap_cgroup_priority(
  5443						memcg, id, SWAP_PRIORITY_GLOBAL);
  5444				} else if (strcmp(second_token, "disabled") == 0) {
  5445					ret = apply_swap_cgroup_priority(
  5446						memcg, id, SWAP_PRIORITY_DISABLE);
  5447				} else {
  5448					ret = kstrtoint(second_token, 10, &prio);
  5449					if (ret)
  5450						return -EINVAL;
  5451					if (prio == -1)
  5452						return -EINVAL;
  5453					else if (prio > SHRT_MAX || prio < SHRT_MIN)
  5454						return -EINVAL;
  5455					ret = apply_swap_cgroup_priority(memcg, id,
  5456									 prio);
  5457				}
  5458			}
  5459		} else if (num_parsed == 1) {
  5460			if (strcmp(first_token, "none") == 0)
  5461				ret = apply_swap_cgroup_priority(
> 5462					memcg, id, SWAP_PRIORITY_GLOBAL);
  5463			else if (strcmp(first_token, "disabled") == 0)
  5464				ret = apply_swap_cgroup_priority(
  5465					memcg, id, SWAP_PRIORITY_DISABLE);
  5466			else
  5467				ret = -EINVAL;
  5468		} else {
  5469			return -EINVAL;
  5470		}
  5471	
  5472		if (ret)
  5473			return ret;
  5474	
  5475		return nbytes;
  5476	}
  5477	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/4] mm/swap, memcg: Introduce infrastructure for cgroup-based swap priority
  2025-07-16 20:20 ` [PATCH 1/4] mm/swap, memcg: Introduce infrastructure for cgroup-based swap priority Youngjun Park
  2025-07-17 11:20   ` kernel test robot
@ 2025-07-18 17:08   ` kernel test robot
  2025-07-22 14:11     ` YoungJun Park
  2025-07-21 15:13   ` kernel test robot
  2025-07-22  8:41   ` Michal Koutný
  3 siblings, 1 reply; 39+ messages in thread
From: kernel test robot @ 2025-07-18 17:08 UTC (permalink / raw)
  To: Youngjun Park, akpm, hannes
  Cc: oe-kbuild-all, mhocko, roman.gushchin, shakeel.butt, muchun.song,
	shikemeng, kasong, nphamcs, bhe, baohua, chrisl, cgroups,
	linux-mm, linux-kernel, gunho.lee, iamjoonsoo.kim, taejoon.song,
	Youngjun Park, Michal Koutný

Hi Youngjun,

kernel test robot noticed the following build errors:

[auto build test ERROR on 347e9f5043c89695b01e66b3ed111755afcf1911]

url:    https://github.com/intel-lab-lkp/linux/commits/Youngjun-Park/mm-swap-memcg-Introduce-infrastructure-for-cgroup-based-swap-priority/20250717-042648
base:   347e9f5043c89695b01e66b3ed111755afcf1911
patch link:    https://lore.kernel.org/r/20250716202006.3640584-2-youngjun.park%40lge.com
patch subject: [PATCH 1/4] mm/swap, memcg: Introduce infrastructure for cgroup-based swap priority
config: sparc64-randconfig-r054-20250718 (https://download.01.org/0day-ci/archive/20250719/202507190037.RCDNmMsJ-lkp@intel.com/config)
compiler: sparc64-linux-gcc (GCC) 15.1.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250719/202507190037.RCDNmMsJ-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202507190037.RCDNmMsJ-lkp@intel.com/

All errors (new ones prefixed by >>):

   In file included from include/linux/rbtree.h:24,
                    from include/linux/mm_types.h:11,
                    from include/linux/mmzone.h:22,
                    from include/linux/swap.h:7,
                    from mm/swap_cgroup_priority.c:16:
   mm/swap_cgroup_priority.c: In function 'get_swap_cgroup_priority':
>> mm/swap_cgroup_priority.c:115:37: error: invalid use of undefined type 'struct mem_cgroup'
     115 |         return rcu_dereference(memcg->swap_priority);
         |                                     ^~
   include/linux/rcupdate.h:532:17: note: in definition of macro '__rcu_dereference_check'
     532 |         typeof(*p) *local = (typeof(*p) *__force)READ_ONCE(p); \
         |                 ^
   include/linux/rcupdate.h:752:28: note: in expansion of macro 'rcu_dereference_check'
     752 | #define rcu_dereference(p) rcu_dereference_check(p, 0)
         |                            ^~~~~~~~~~~~~~~~~~~~~
   mm/swap_cgroup_priority.c:115:16: note: in expansion of macro 'rcu_dereference'
     115 |         return rcu_dereference(memcg->swap_priority);
         |                ^~~~~~~~~~~~~~~
>> mm/swap_cgroup_priority.c:115:37: error: invalid use of undefined type 'struct mem_cgroup'
     115 |         return rcu_dereference(memcg->swap_priority);
         |                                     ^~
   include/linux/rcupdate.h:532:38: note: in definition of macro '__rcu_dereference_check'
     532 |         typeof(*p) *local = (typeof(*p) *__force)READ_ONCE(p); \
         |                                      ^
   include/linux/rcupdate.h:752:28: note: in expansion of macro 'rcu_dereference_check'
     752 | #define rcu_dereference(p) rcu_dereference_check(p, 0)
         |                            ^~~~~~~~~~~~~~~~~~~~~
   mm/swap_cgroup_priority.c:115:16: note: in expansion of macro 'rcu_dereference'
     115 |         return rcu_dereference(memcg->swap_priority);
         |                ^~~~~~~~~~~~~~~
   In file included from <command-line>:
>> mm/swap_cgroup_priority.c:115:37: error: invalid use of undefined type 'struct mem_cgroup'
     115 |         return rcu_dereference(memcg->swap_priority);
         |                                     ^~
   include/linux/compiler_types.h:548:23: note: in definition of macro '__compiletime_assert'
     548 |                 if (!(condition))                                       \
         |                       ^~~~~~~~~
   include/linux/compiler_types.h:568:9: note: in expansion of macro '_compiletime_assert'
     568 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
         |         ^~~~~~~~~~~~~~~~~~~
   include/asm-generic/rwonce.h:36:9: note: in expansion of macro 'compiletime_assert'
      36 |         compiletime_assert(__native_word(t) || sizeof(t) == sizeof(long long),  \
         |         ^~~~~~~~~~~~~~~~~~
   include/asm-generic/rwonce.h:36:28: note: in expansion of macro '__native_word'
      36 |         compiletime_assert(__native_word(t) || sizeof(t) == sizeof(long long),  \
         |                            ^~~~~~~~~~~~~
   include/asm-generic/rwonce.h:49:9: note: in expansion of macro 'compiletime_assert_rwonce_type'
      49 |         compiletime_assert_rwonce_type(x);                              \
         |         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   include/linux/rcupdate.h:532:50: note: in expansion of macro 'READ_ONCE'
     532 |         typeof(*p) *local = (typeof(*p) *__force)READ_ONCE(p); \
         |                                                  ^~~~~~~~~
   include/linux/rcupdate.h:680:9: note: in expansion of macro '__rcu_dereference_check'
     680 |         __rcu_dereference_check((p), __UNIQUE_ID(rcu), \
         |         ^~~~~~~~~~~~~~~~~~~~~~~
   include/linux/rcupdate.h:752:28: note: in expansion of macro 'rcu_dereference_check'
     752 | #define rcu_dereference(p) rcu_dereference_check(p, 0)
         |                            ^~~~~~~~~~~~~~~~~~~~~
   mm/swap_cgroup_priority.c:115:16: note: in expansion of macro 'rcu_dereference'
     115 |         return rcu_dereference(memcg->swap_priority);
         |                ^~~~~~~~~~~~~~~
>> mm/swap_cgroup_priority.c:115:37: error: invalid use of undefined type 'struct mem_cgroup'
     115 |         return rcu_dereference(memcg->swap_priority);
         |                                     ^~
   include/linux/compiler_types.h:548:23: note: in definition of macro '__compiletime_assert'
     548 |                 if (!(condition))                                       \
         |                       ^~~~~~~~~
   include/linux/compiler_types.h:568:9: note: in expansion of macro '_compiletime_assert'
     568 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
         |         ^~~~~~~~~~~~~~~~~~~
   include/asm-generic/rwonce.h:36:9: note: in expansion of macro 'compiletime_assert'
      36 |         compiletime_assert(__native_word(t) || sizeof(t) == sizeof(long long),  \
         |         ^~~~~~~~~~~~~~~~~~
   include/asm-generic/rwonce.h:36:28: note: in expansion of macro '__native_word'
      36 |         compiletime_assert(__native_word(t) || sizeof(t) == sizeof(long long),  \
         |                            ^~~~~~~~~~~~~
   include/asm-generic/rwonce.h:49:9: note: in expansion of macro 'compiletime_assert_rwonce_type'
      49 |         compiletime_assert_rwonce_type(x);                              \
         |         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   include/linux/rcupdate.h:532:50: note: in expansion of macro 'READ_ONCE'
     532 |         typeof(*p) *local = (typeof(*p) *__force)READ_ONCE(p); \
         |                                                  ^~~~~~~~~
   include/linux/rcupdate.h:680:9: note: in expansion of macro '__rcu_dereference_check'
     680 |         __rcu_dereference_check((p), __UNIQUE_ID(rcu), \
         |         ^~~~~~~~~~~~~~~~~~~~~~~
   include/linux/rcupdate.h:752:28: note: in expansion of macro 'rcu_dereference_check'
     752 | #define rcu_dereference(p) rcu_dereference_check(p, 0)
         |                            ^~~~~~~~~~~~~~~~~~~~~
   mm/swap_cgroup_priority.c:115:16: note: in expansion of macro 'rcu_dereference'
     115 |         return rcu_dereference(memcg->swap_priority);
         |                ^~~~~~~~~~~~~~~
>> mm/swap_cgroup_priority.c:115:37: error: invalid use of undefined type 'struct mem_cgroup'
     115 |         return rcu_dereference(memcg->swap_priority);
         |                                     ^~
   include/linux/compiler_types.h:548:23: note: in definition of macro '__compiletime_assert'
     548 |                 if (!(condition))                                       \
         |                       ^~~~~~~~~
   include/linux/compiler_types.h:568:9: note: in expansion of macro '_compiletime_assert'
     568 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
         |         ^~~~~~~~~~~~~~~~~~~
   include/asm-generic/rwonce.h:36:9: note: in expansion of macro 'compiletime_assert'
      36 |         compiletime_assert(__native_word(t) || sizeof(t) == sizeof(long long),  \
         |         ^~~~~~~~~~~~~~~~~~
   include/asm-generic/rwonce.h:36:28: note: in expansion of macro '__native_word'
      36 |         compiletime_assert(__native_word(t) || sizeof(t) == sizeof(long long),  \
         |                            ^~~~~~~~~~~~~
   include/asm-generic/rwonce.h:49:9: note: in expansion of macro 'compiletime_assert_rwonce_type'
      49 |         compiletime_assert_rwonce_type(x);                              \
         |         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   include/linux/rcupdate.h:532:50: note: in expansion of macro 'READ_ONCE'
     532 |         typeof(*p) *local = (typeof(*p) *__force)READ_ONCE(p); \
         |                                                  ^~~~~~~~~
   include/linux/rcupdate.h:680:9: note: in expansion of macro '__rcu_dereference_check'
     680 |         __rcu_dereference_check((p), __UNIQUE_ID(rcu), \
         |         ^~~~~~~~~~~~~~~~~~~~~~~
   include/linux/rcupdate.h:752:28: note: in expansion of macro 'rcu_dereference_check'
     752 | #define rcu_dereference(p) rcu_dereference_check(p, 0)
         |                            ^~~~~~~~~~~~~~~~~~~~~
   mm/swap_cgroup_priority.c:115:16: note: in expansion of macro 'rcu_dereference'
     115 |         return rcu_dereference(memcg->swap_priority);
         |                ^~~~~~~~~~~~~~~
>> mm/swap_cgroup_priority.c:115:37: error: invalid use of undefined type 'struct mem_cgroup'
     115 |         return rcu_dereference(memcg->swap_priority);
         |                                     ^~
   include/linux/compiler_types.h:548:23: note: in definition of macro '__compiletime_assert'
     548 |                 if (!(condition))                                       \
         |                       ^~~~~~~~~
   include/linux/compiler_types.h:568:9: note: in expansion of macro '_compiletime_assert'
     568 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
         |         ^~~~~~~~~~~~~~~~~~~
   include/asm-generic/rwonce.h:36:9: note: in expansion of macro 'compiletime_assert'
      36 |         compiletime_assert(__native_word(t) || sizeof(t) == sizeof(long long),  \
         |         ^~~~~~~~~~~~~~~~~~
   include/asm-generic/rwonce.h:36:28: note: in expansion of macro '__native_word'
      36 |         compiletime_assert(__native_word(t) || sizeof(t) == sizeof(long long),  \
         |                            ^~~~~~~~~~~~~
   include/asm-generic/rwonce.h:49:9: note: in expansion of macro 'compiletime_assert_rwonce_type'
      49 |         compiletime_assert_rwonce_type(x);                              \
         |         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   include/linux/rcupdate.h:532:50: note: in expansion of macro 'READ_ONCE'
     532 |         typeof(*p) *local = (typeof(*p) *__force)READ_ONCE(p); \
         |                                                  ^~~~~~~~~
   include/linux/rcupdate.h:680:9: note: in expansion of macro '__rcu_dereference_check'
     680 |         __rcu_dereference_check((p), __UNIQUE_ID(rcu), \
         |         ^~~~~~~~~~~~~~~~~~~~~~~
   include/linux/rcupdate.h:752:28: note: in expansion of macro 'rcu_dereference_check'
     752 | #define rcu_dereference(p) rcu_dereference_check(p, 0)
         |                            ^~~~~~~~~~~~~~~~~~~~~
   mm/swap_cgroup_priority.c:115:16: note: in expansion of macro 'rcu_dereference'
     115 |         return rcu_dereference(memcg->swap_priority);
         |                ^~~~~~~~~~~~~~~
>> mm/swap_cgroup_priority.c:115:37: error: invalid use of undefined type 'struct mem_cgroup'
     115 |         return rcu_dereference(memcg->swap_priority);
         |                                     ^~
   include/linux/compiler_types.h:548:23: note: in definition of macro '__compiletime_assert'
     548 |                 if (!(condition))                                       \
         |                       ^~~~~~~~~
   include/linux/compiler_types.h:568:9: note: in expansion of macro '_compiletime_assert'
     568 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
         |         ^~~~~~~~~~~~~~~~~~~
   include/asm-generic/rwonce.h:36:9: note: in expansion of macro 'compiletime_assert'
      36 |         compiletime_assert(__native_word(t) || sizeof(t) == sizeof(long long),  \
         |         ^~~~~~~~~~~~~~~~~~
   include/asm-generic/rwonce.h:49:9: note: in expansion of macro 'compiletime_assert_rwonce_type'
      49 |         compiletime_assert_rwonce_type(x);                              \
         |         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   include/linux/rcupdate.h:532:50: note: in expansion of macro 'READ_ONCE'
     532 |         typeof(*p) *local = (typeof(*p) *__force)READ_ONCE(p); \
         |                                                  ^~~~~~~~~
   include/linux/rcupdate.h:680:9: note: in expansion of macro '__rcu_dereference_check'
     680 |         __rcu_dereference_check((p), __UNIQUE_ID(rcu), \
         |         ^~~~~~~~~~~~~~~~~~~~~~~
   include/linux/rcupdate.h:752:28: note: in expansion of macro 'rcu_dereference_check'
     752 | #define rcu_dereference(p) rcu_dereference_check(p, 0)
         |                            ^~~~~~~~~~~~~~~~~~~~~
   mm/swap_cgroup_priority.c:115:16: note: in expansion of macro 'rcu_dereference'
     115 |         return rcu_dereference(memcg->swap_priority);
         |                ^~~~~~~~~~~~~~~
>> mm/swap_cgroup_priority.c:115:37: error: invalid use of undefined type 'struct mem_cgroup'
     115 |         return rcu_dereference(memcg->swap_priority);
         |                                     ^~
   include/linux/compiler_types.h:518:27: note: in definition of macro '__unqual_scalar_typeof'
     518 |                 _Generic((x),                                           \
         |                           ^
   include/asm-generic/rwonce.h:50:9: note: in expansion of macro '__READ_ONCE'
      50 |         __READ_ONCE(x);                                                 \
         |         ^~~~~~~~~~~
   include/linux/rcupdate.h:532:50: note: in expansion of macro 'READ_ONCE'
     532 |         typeof(*p) *local = (typeof(*p) *__force)READ_ONCE(p); \
         |                                                  ^~~~~~~~~
   include/linux/rcupdate.h:680:9: note: in expansion of macro '__rcu_dereference_check'
     680 |         __rcu_dereference_check((p), __UNIQUE_ID(rcu), \
         |         ^~~~~~~~~~~~~~~~~~~~~~~
   include/linux/rcupdate.h:752:28: note: in expansion of macro 'rcu_dereference_check'
     752 | #define rcu_dereference(p) rcu_dereference_check(p, 0)
         |                            ^~~~~~~~~~~~~~~~~~~~~
   mm/swap_cgroup_priority.c:115:16: note: in expansion of macro 'rcu_dereference'
     115 |         return rcu_dereference(memcg->swap_priority);
         |                ^~~~~~~~~~~~~~~
   In file included from ./arch/sparc/include/generated/asm/rwonce.h:1,
                    from include/linux/compiler.h:390,
                    from include/linux/export.h:5,
                    from include/linux/linkage.h:7,
                    from include/linux/preempt.h:10,
                    from include/linux/spinlock.h:56,
                    from include/linux/swap.h:5:
>> mm/swap_cgroup_priority.c:115:37: error: invalid use of undefined type 'struct mem_cgroup'
     115 |         return rcu_dereference(memcg->swap_priority);
         |                                     ^~
   include/asm-generic/rwonce.h:44:73: note: in definition of macro '__READ_ONCE'
      44 | #define __READ_ONCE(x)  (*(const volatile __unqual_scalar_typeof(x) *)&(x))
         |                                                                         ^
   include/linux/rcupdate.h:532:50: note: in expansion of macro 'READ_ONCE'
     532 |         typeof(*p) *local = (typeof(*p) *__force)READ_ONCE(p); \
         |                                                  ^~~~~~~~~
   include/linux/rcupdate.h:680:9: note: in expansion of macro '__rcu_dereference_check'
     680 |         __rcu_dereference_check((p), __UNIQUE_ID(rcu), \
         |         ^~~~~~~~~~~~~~~~~~~~~~~
   include/linux/rcupdate.h:752:28: note: in expansion of macro 'rcu_dereference_check'
     752 | #define rcu_dereference(p) rcu_dereference_check(p, 0)
         |                            ^~~~~~~~~~~~~~~~~~~~~
   mm/swap_cgroup_priority.c:115:16: note: in expansion of macro 'rcu_dereference'
     115 |         return rcu_dereference(memcg->swap_priority);
         |                ^~~~~~~~~~~~~~~
>> mm/swap_cgroup_priority.c:115:37: error: invalid use of undefined type 'struct mem_cgroup'
     115 |         return rcu_dereference(memcg->swap_priority);
         |                                     ^~
   include/linux/rcupdate.h:535:19: note: in definition of macro '__rcu_dereference_check'
     535 |         ((typeof(*p) __force __kernel *)(local)); \
         |                   ^
   include/linux/rcupdate.h:752:28: note: in expansion of macro 'rcu_dereference_check'
     752 | #define rcu_dereference(p) rcu_dereference_check(p, 0)
         |                            ^~~~~~~~~~~~~~~~~~~~~
   mm/swap_cgroup_priority.c:115:16: note: in expansion of macro 'rcu_dereference'
     115 |         return rcu_dereference(memcg->swap_priority);
         |                ^~~~~~~~~~~~~~~
   mm/swap_cgroup_priority.c: In function 'show_swap_cgroup_priority':
   mm/swap_cgroup_priority.c:186:30: error: invalid use of undefined type 'struct mem_cgroup'
     186 |         swap_priority = memcg->swap_priority;
         |                              ^~
   mm/swap_cgroup_priority.c: In function 'swap_alloc_cgroup_priority':
   mm/swap_cgroup_priority.c:285:26: error: invalid use of undefined type 'struct mem_cgroup'
     285 |                 if (memcg->swap_priority != swap_priority)
         |                          ^~
   mm/swap_cgroup_priority.c: In function 'apply_swap_cgroup_priority':
   mm/swap_cgroup_priority.c:638:46: error: invalid use of undefined type 'struct mem_cgroup'
     638 |         swap_priority = rcu_dereference(memcg->swap_priority);
         |                                              ^~
   include/linux/rcupdate.h:532:17: note: in definition of macro '__rcu_dereference_check'
     532 |         typeof(*p) *local = (typeof(*p) *__force)READ_ONCE(p); \
         |                 ^
   include/linux/rcupdate.h:752:28: note: in expansion of macro 'rcu_dereference_check'
     752 | #define rcu_dereference(p) rcu_dereference_check(p, 0)
         |                            ^~~~~~~~~~~~~~~~~~~~~
   mm/swap_cgroup_priority.c:638:25: note: in expansion of macro 'rcu_dereference'
     638 |         swap_priority = rcu_dereference(memcg->swap_priority);
         |                         ^~~~~~~~~~~~~~~
   mm/swap_cgroup_priority.c:638:46: error: invalid use of undefined type 'struct mem_cgroup'
     638 |         swap_priority = rcu_dereference(memcg->swap_priority);
         |                                              ^~
   include/linux/rcupdate.h:532:38: note: in definition of macro '__rcu_dereference_check'
     532 |         typeof(*p) *local = (typeof(*p) *__force)READ_ONCE(p); \
         |                                      ^
   include/linux/rcupdate.h:752:28: note: in expansion of macro 'rcu_dereference_check'
     752 | #define rcu_dereference(p) rcu_dereference_check(p, 0)
         |                            ^~~~~~~~~~~~~~~~~~~~~
   mm/swap_cgroup_priority.c:638:25: note: in expansion of macro 'rcu_dereference'
     638 |         swap_priority = rcu_dereference(memcg->swap_priority);
         |                         ^~~~~~~~~~~~~~~
   mm/swap_cgroup_priority.c:638:46: error: invalid use of undefined type 'struct mem_cgroup'
     638 |         swap_priority = rcu_dereference(memcg->swap_priority);
         |                                              ^~
   include/linux/compiler_types.h:548:23: note: in definition of macro '__compiletime_assert'
     548 |                 if (!(condition))                                       \
         |                       ^~~~~~~~~
   include/linux/compiler_types.h:568:9: note: in expansion of macro '_compiletime_assert'
     568 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
         |         ^~~~~~~~~~~~~~~~~~~
   include/asm-generic/rwonce.h:36:9: note: in expansion of macro 'compiletime_assert'
      36 |         compiletime_assert(__native_word(t) || sizeof(t) == sizeof(long long),  \
         |         ^~~~~~~~~~~~~~~~~~
   include/asm-generic/rwonce.h:36:28: note: in expansion of macro '__native_word'
      36 |         compiletime_assert(__native_word(t) || sizeof(t) == sizeof(long long),  \
         |                            ^~~~~~~~~~~~~
   include/asm-generic/rwonce.h:49:9: note: in expansion of macro 'compiletime_assert_rwonce_type'
      49 |         compiletime_assert_rwonce_type(x);                              \
         |         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   include/linux/rcupdate.h:532:50: note: in expansion of macro 'READ_ONCE'
     532 |         typeof(*p) *local = (typeof(*p) *__force)READ_ONCE(p); \
         |                                                  ^~~~~~~~~
   include/linux/rcupdate.h:680:9: note: in expansion of macro '__rcu_dereference_check'
     680 |         __rcu_dereference_check((p), __UNIQUE_ID(rcu), \
         |         ^~~~~~~~~~~~~~~~~~~~~~~
   include/linux/rcupdate.h:752:28: note: in expansion of macro 'rcu_dereference_check'
     752 | #define rcu_dereference(p) rcu_dereference_check(p, 0)
         |                            ^~~~~~~~~~~~~~~~~~~~~
   mm/swap_cgroup_priority.c:638:25: note: in expansion of macro 'rcu_dereference'
     638 |         swap_priority = rcu_dereference(memcg->swap_priority);
         |                         ^~~~~~~~~~~~~~~
   mm/swap_cgroup_priority.c:638:46: error: invalid use of undefined type 'struct mem_cgroup'
     638 |         swap_priority = rcu_dereference(memcg->swap_priority);
         |                                              ^~
   include/linux/compiler_types.h:548:23: note: in definition of macro '__compiletime_assert'
     548 |                 if (!(condition))                                       \
         |                       ^~~~~~~~~
   include/linux/compiler_types.h:568:9: note: in expansion of macro '_compiletime_assert'
     568 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
         |         ^~~~~~~~~~~~~~~~~~~
   include/asm-generic/rwonce.h:36:9: note: in expansion of macro 'compiletime_assert'
      36 |         compiletime_assert(__native_word(t) || sizeof(t) == sizeof(long long),  \
         |         ^~~~~~~~~~~~~~~~~~
   include/asm-generic/rwonce.h:36:28: note: in expansion of macro '__native_word'
      36 |         compiletime_assert(__native_word(t) || sizeof(t) == sizeof(long long),  \
         |                            ^~~~~~~~~~~~~
   include/asm-generic/rwonce.h:49:9: note: in expansion of macro 'compiletime_assert_rwonce_type'
      49 |         compiletime_assert_rwonce_type(x);                              \
         |         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   include/linux/rcupdate.h:532:50: note: in expansion of macro 'READ_ONCE'
     532 |         typeof(*p) *local = (typeof(*p) *__force)READ_ONCE(p); \
         |                                                  ^~~~~~~~~
   include/linux/rcupdate.h:680:9: note: in expansion of macro '__rcu_dereference_check'
     680 |         __rcu_dereference_check((p), __UNIQUE_ID(rcu), \
         |         ^~~~~~~~~~~~~~~~~~~~~~~
   include/linux/rcupdate.h:752:28: note: in expansion of macro 'rcu_dereference_check'
     752 | #define rcu_dereference(p) rcu_dereference_check(p, 0)


vim +115 mm/swap_cgroup_priority.c

  > 16	#include <linux/swap.h>
    17	#include <linux/rcupdate.h>
    18	#include <linux/memcontrol.h>
    19	#include <linux/plist.h>
    20	#include "swap.h"
    21	#include "swap_cgroup_priority.h"
    22	#include "memcontrol-v1.h"
    23	
    24	static LIST_HEAD(swap_cgroup_priority_list);
    25	
    26	/*
    27	 * struct swap_cgroup_priority
    28	 *
    29	 * This structure is RCU protected. Its lifecycle is determined by its
    30	 * owning memcg or when its 'distance' reaches zero. The 'distance' field
    31	 * tracks priority differences from global swap. If zero, and its default_prio
    32	 * follows global swap priority(SWAP_PRIORITY_GLOBAL), the object is destroyed.
    33	 *
    34	 * pnode - Array of pointers to swap device priority nodes.
    35	 * owner - The owning memory cgroup.
    36	 * rcu - RCU free callback.
    37	 * link - Global linked list entry.
    38	 * least_priority - Current lowest priority.
    39	 * distance - Priority differences from global swap priority.
    40	 * default_prio - Default priority for this cgroup.
    41	 * plist - Priority list head.
    42	 */
    43	struct swap_cgroup_priority {
    44		struct swap_cgroup_priority_pnode *pnode[MAX_SWAPFILES];
    45		struct mem_cgroup *owner;
    46	
    47		union {
    48			struct rcu_head rcu;
    49			struct list_head link;
    50		};
    51	
    52		int least_priority;
    53		s8 distance;
    54		int default_prio;
    55		struct plist_head plist[];
    56	};
    57	
    58	/*
    59	 * struct swap_cgroup_priority_pnode
    60	 *
    61	 * This structure represents a priority node for a specific swap device
    62	 * within a cgroup.
    63	 *
    64	 * swap - Pointer to the associated swap device.
    65	 * id - Unique identifier for the swap device.
    66	 * prio - Configured priority for this device.
    67	 * avail_lists - Connections to various priority lists.
    68	 */
    69	struct swap_cgroup_priority_pnode {
    70		struct swap_info_struct *swap;
    71		u64 id;
    72		signed short prio;
    73		struct plist_node avail_lists[];
    74	};
    75	
    76	/*
    77	 * Even with a zero distance, a swap device isn't assigned if it doesn't
    78	 * meet global swap priority conditions; thus, we don't clear it.
    79	 */
    80	static bool should_clear_swap_cgroup_priority(
    81		struct swap_cgroup_priority *swap_priority)
    82	{
    83		WARN_ON_ONCE(swap_priority->distance < 0 ||
    84			swap_priority->distance > MAX_SWAPFILES);
    85	
    86		if (swap_priority->distance == 0 &&
    87		    swap_priority->default_prio == SWAP_PRIORITY_GLOBAL)
    88			return true;
    89	
    90		return false;
    91	}
    92	
    93	/*
    94	 * swapdev_id
    95	 *
    96	 * A unique identifier for a swap device.
    97	 *
    98	 * This ID ensures stable identification for users and crucial synchronization
    99	 * for swap cgroup priority settings. It provides a reliable reference even if
   100	 * device paths or numbers change.
   101	 */
   102	static atomic64_t swapdev_id_counter;
   103	
   104	void get_swapdev_id(struct swap_info_struct *si)
   105	{
   106		si->id = atomic64_inc_return(&swapdev_id_counter);
   107	}
   108	
   109	static struct swap_cgroup_priority *get_swap_cgroup_priority(
   110		struct mem_cgroup *memcg)
   111	{
   112		if (!memcg)
   113			return NULL;
   114	
 > 115		return rcu_dereference(memcg->swap_priority);
   116	}
   117	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/4] mm/swap, memcg: Introduce infrastructure for cgroup-based swap priority
  2025-07-16 20:20 ` [PATCH 1/4] mm/swap, memcg: Introduce infrastructure for cgroup-based swap priority Youngjun Park
  2025-07-17 11:20   ` kernel test robot
  2025-07-18 17:08   ` kernel test robot
@ 2025-07-21 15:13   ` kernel test robot
  2025-07-22 14:14     ` YoungJun Park
  2025-07-22  8:41   ` Michal Koutný
  3 siblings, 1 reply; 39+ messages in thread
From: kernel test robot @ 2025-07-21 15:13 UTC (permalink / raw)
  To: Youngjun Park, akpm, hannes
  Cc: oe-kbuild-all, mhocko, roman.gushchin, shakeel.butt, muchun.song,
	shikemeng, kasong, nphamcs, bhe, baohua, chrisl, cgroups,
	linux-mm, linux-kernel, gunho.lee, iamjoonsoo.kim, taejoon.song,
	Youngjun Park, Michal Koutný

Hi Youngjun,

kernel test robot noticed the following build warnings:

[auto build test WARNING on 347e9f5043c89695b01e66b3ed111755afcf1911]

url:    https://github.com/intel-lab-lkp/linux/commits/Youngjun-Park/mm-swap-memcg-Introduce-infrastructure-for-cgroup-based-swap-priority/20250717-042648
base:   347e9f5043c89695b01e66b3ed111755afcf1911
patch link:    https://lore.kernel.org/r/20250716202006.3640584-2-youngjun.park%40lge.com
patch subject: [PATCH 1/4] mm/swap, memcg: Introduce infrastructure for cgroup-based swap priority
config: loongarch-randconfig-r123-20250721 (https://download.01.org/0day-ci/archive/20250721/202507212243.Lf8fSo0T-lkp@intel.com/config)
compiler: clang version 19.1.7 (https://github.com/llvm/llvm-project cd708029e0b2869e80abe31ddb175f7c35361f90)
reproduce: (https://download.01.org/0day-ci/archive/20250721/202507212243.Lf8fSo0T-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202507212243.Lf8fSo0T-lkp@intel.com/

sparse warnings: (new ones prefixed by >>)
>> mm/swap_cgroup_priority.c:115:16: sparse: sparse: incompatible types in comparison expression (different address spaces):
   mm/swap_cgroup_priority.c:115:16: sparse:    struct swap_cgroup_priority [noderef] __rcu *
   mm/swap_cgroup_priority.c:115:16: sparse:    struct swap_cgroup_priority *
   mm/swap_cgroup_priority.c:729:9: sparse: sparse: incompatible types in comparison expression (different address spaces):
   mm/swap_cgroup_priority.c:729:9: sparse:    struct swap_cgroup_priority [noderef] __rcu *
   mm/swap_cgroup_priority.c:729:9: sparse:    struct swap_cgroup_priority *
   mm/swap_cgroup_priority.c:638:25: sparse: sparse: incompatible types in comparison expression (different address spaces):
   mm/swap_cgroup_priority.c:638:25: sparse:    struct swap_cgroup_priority [noderef] __rcu *
   mm/swap_cgroup_priority.c:638:25: sparse:    struct swap_cgroup_priority *

vim +115 mm/swap_cgroup_priority.c

   108	
   109	static struct swap_cgroup_priority *get_swap_cgroup_priority(
   110		struct mem_cgroup *memcg)
   111	{
   112		if (!memcg)
   113			return NULL;
   114	
 > 115		return rcu_dereference(memcg->swap_priority);
   116	}
   117	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/4] mm/swap, memcg: Introduce infrastructure for cgroup-based swap priority
  2025-07-16 20:20 ` [PATCH 1/4] mm/swap, memcg: Introduce infrastructure for cgroup-based swap priority Youngjun Park
                     ` (2 preceding siblings ...)
  2025-07-21 15:13   ` kernel test robot
@ 2025-07-22  8:41   ` Michal Koutný
  2025-07-22 14:05     ` YoungJun Park
  3 siblings, 1 reply; 39+ messages in thread
From: Michal Koutný @ 2025-07-22  8:41 UTC (permalink / raw)
  To: Youngjun Park
  Cc: akpm, hannes, mhocko, roman.gushchin, shakeel.butt, muchun.song,
	shikemeng, kasong, nphamcs, bhe, baohua, chrisl, cgroups,
	linux-mm, linux-kernel, gunho.lee, iamjoonsoo.kim, taejoon.song

[-- Attachment #1: Type: text/plain, Size: 2392 bytes --]

On Thu, Jul 17, 2025 at 05:20:03AM +0900, Youngjun Park <youngjun.park@lge.com> wrote:
> +  memory.swap.priority
> +    A read-write flat-keyed file which exists on non-root cgroups.
> +    This interface allows you to set per-swap-device priorities for the current
> +    cgroup and to define how they differ from the global swap system.
> +
> +    To assign priorities or define specific behaviors for swap devices
> +    in the current cgroup, write one or more lines in the following
> +    formats:
> +
> +     - <swap_device_id> <priority>
> +     - <swap_device_id> disabled
> +     - <swap_device_id> none
> +     - default none
> +     - default disabled
> +
> +    Each <swap_device_id> refers to a unique swap device registered
> +    in the system. You can check the ID, device path, and current
> +    priority of active swap devices through the `/proc/swaps` file.

Do you mean row number as the ID? Or does this depend on some other
patches or API?


> +    This provides a clear mapping between swap devices and the IDs
> +    used in this interface.
> +
> +    The 'default' keyword sets the fallback priority behavior rule for
> +    this cgroup. If no specific entry matches a swap device, this default
> +    applies.
> +
> +    * 'default none': This is the default if no configuration
> +      is explicitly written. Swap devices follow the system-wide
> +      swap priorities.
> +
> +    * 'default disabled': All swap devices are excluded from this cgroup’s
> +      swap priority list and will not be used by this cgroup.

This duplicates memory.swap.max=0. I'm not sure it's thus necessary.
At the same time you don't accept 'default <priority>' (that's sane).


> +
> +    The priority semantics are consistent with the global swap system:
> +
> +      - Higher numerical values indicate higher preference.
> +      - See Documentation/admin-guide/mm/swap_numa.rst for details on
> +        swap NUMA autobinding and negative priority rules.
> +
> +    The handling of negative priorities in this cgroup interface
> +    has specific behaviors for assignment and restoration:
> +
> +    * Negative Priority Assignment

Even in Documentation/admin-guide/mm/swap_numa.rst it's part of "Implementation details".
I admit I'm daunted by this paragraphs. Is it important for this interface?


Thanks,
Michal

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/4] mm/swap, memcg: Introduce infrastructure for cgroup-based swap priority
  2025-07-22  8:41   ` Michal Koutný
@ 2025-07-22 14:05     ` YoungJun Park
  2025-07-22 18:41       ` YoungJun Park
  0 siblings, 1 reply; 39+ messages in thread
From: YoungJun Park @ 2025-07-22 14:05 UTC (permalink / raw)
  To: Michal Koutný
  Cc: akpm, hannes, mhocko, roman.gushchin, shakeel.butt, muchun.song,
	shikemeng, kasong, nphamcs, bhe, baohua, chrisl, cgroups,
	linux-mm, linux-kernel, gunho.lee, iamjoonsoo.kim, taejoon.song

On Tue, Jul 22, 2025 at 10:41:20AM +0200, Michal Koutný wrote:
> On Thu, Jul 17, 2025 at 05:20:03AM +0900, Youngjun Park <youngjun.park@lge.com> wrote:
> > +  memory.swap.priority
> > +    A read-write flat-keyed file which exists on non-root cgroups.
> > +    This interface allows you to set per-swap-device priorities for the current
> > +    cgroup and to define how they differ from the global swap system.
> > +
> > +    To assign priorities or define specific behaviors for swap devices
> > +    in the current cgroup, write one or more lines in the following
> > +    formats:
> > +
> > +     - <swap_device_id> <priority>
> > +     - <swap_device_id> disabled
> > +     - <swap_device_id> none
> > +     - default none
> > +     - default disabled
> > +
> > +    Each <swap_device_id> refers to a unique swap device registered
> > +    in the system. You can check the ID, device path, and current
> > +    priority of active swap devices through the `/proc/swaps` file.
> 
> Do you mean row number as the ID? Or does this depend on some other
> patches or API?

You're right to ask for clarification. The `<swap_device_id>` refers
to a unique identifier added to each swap device entry in `/proc/swaps`.
I will revise the documentation to make this clearer.

As a side note, I initially had concerns about breaking the existing ABI.
However, the additional ID column does not significantly change the
current output format and is gated behind `CONFIG_SWAP_CGROUP_PRIORITY`,
so it should be safe and intuitive to expose it through `/proc/swaps

> > +    This provides a clear mapping between swap devices and the IDs
> > +    used in this interface.
> > +
> > +    The 'default' keyword sets the fallback priority behavior rule for
> > +    this cgroup. If no specific entry matches a swap device, this default
> > +    applies.
> > +
> > +    * 'default none': This is the default if no configuration
> > +      is explicitly written. Swap devices follow the system-wide
> > +      swap priorities.
> > +
> > +    * 'default disabled': All swap devices are excluded from this cgroup’s
> > +      swap priority list and will not be used by this cgroup.
> 
> This duplicates memory.swap.max=0. I'm not sure it's thus necessary.
> At the same time you don't accept 'default <priority>' (that's sane).

That's a valid observation. While `memory.swap.max=0` controls the overall
swap usage limit, the `default disabled` entry is intended to disable
specific swap devices within the scope of this cgroup interface. The
motivation was to offer more granular control over device selection
rather than total swap usage.

> > +
> > +    The priority semantics are consistent with the global swap system:
> > +
> > +      - Higher numerical values indicate higher preference.
> > +      - See Documentation/admin-guide/mm/swap_numa.rst for details on
> > +        swap NUMA autobinding and negative priority rules.
> > +
> > +    The handling of negative priorities in this cgroup interface
> > +    has specific behaviors for assignment and restoration:
> > +
> > +    * Negative Priority Assignment
> 
> Even in Documentation/admin-guide/mm/swap_numa.rst it's part of "Implementation details".
> I admit I'm daunted by this paragraphs. Is it important for this interface?

Thank you for pointing this out. My original philosophy was to preserve
as much of the existing swap functionality as possible, including
NUMA-aware behaviors.

However, I agree that the explanation is complex and also not be
necessary for my proposed usage. After some reflection, I believe the
implementation (and documentation) will be clearer and simpler without
supporting negative priorities here. 

Unless further objections arise, I plan to drop this behavior in the next
version of the patch, as you suggested. If compelling use cases emerge in
the future, we can consider reintroducing the support at that time.

Thanks again for your helpful review!

Best regards,
Youngjun Park

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/4] mm/swap, memcg: Introduce infrastructure for cgroup-based swap priority
  2025-07-17 11:20   ` kernel test robot
@ 2025-07-22 14:09     ` YoungJun Park
  0 siblings, 0 replies; 39+ messages in thread
From: YoungJun Park @ 2025-07-22 14:09 UTC (permalink / raw)
  To: kernel test robot
  Cc: akpm, hannes, oe-kbuild-all, mhocko, roman.gushchin, shakeel.butt,
	muchun.song, shikemeng, kasong, nphamcs, bhe, baohua, chrisl,
	cgroups, linux-mm, linux-kernel, gunho.lee, iamjoonsoo.kim,
	taejoon.song, Michal Koutný

On Thu, Jul 17, 2025 at 07:20:58PM +0800, kernel test robot wrote:
> Hi Youngjun,
> 
> kernel test robot noticed the following build warnings:
> 
> [auto build test WARNING on 347e9f5043c89695b01e66b3ed111755afcf1911]
> 
> url:    https://github.com/intel-lab-lkp/linux/commits/Youngjun-Park/mm-swap-memcg-Introduce-infrastructure-for-cgroup-based-swap-priority/20250717-042648
> base:   347e9f5043c89695b01e66b3ed111755afcf1911
> patch link:    https://lore.kernel.org/r/20250716202006.3640584-2-youngjun.park%40lge.com
> patch subject: [PATCH 1/4] mm/swap, memcg: Introduce infrastructure for cgroup-based swap priority
> config: loongarch-allyesconfig (https://download.01.org/0day-ci/archive/20250717/202507171936.fGW4muEc-lkp@intel.com/config)
> compiler: clang version 21.0.0git (https://github.com/llvm/llvm-project 16534d19bf50bde879a83f0ae62875e2c5120e64)
> reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250717/202507171936.fGW4muEc-lkp@intel.com/reproduce)
> 
> If you fix the issue in a separate patch/commit (i.e. not just a new version of
> the same patch/commit), kindly add following tags
> | Reported-by: kernel test robot <lkp@intel.com>
> | Closes: https://lore.kernel.org/oe-kbuild-all/202507171936.fGW4muEc-lkp@intel.com/
> 
> All warnings (new ones prefixed by >>):
> 
> >> mm/memcontrol.c:5462:12: warning: variable 'id' is uninitialized when used here [-Wuninitialized]
>     5462 |                                 memcg, id, SWAP_PRIORITY_GLOBAL);
>          |                                        ^~
>    mm/memcontrol.c:5414:8: note: initialize the variable 'id' to silence this warning
>     5414 |         u64 id;
>          |               ^
>          |                = 0
>    1 warning generated.
> 
> 
> vim +/id +5462 mm/memcontrol.c
> 
>   5408	
>   5409	#ifdef CONFIG_SWAP_CGROUP_PRIORITY
>   5410	static ssize_t swap_cgroup_priority_write(struct kernfs_open_file *of,
>   5411						  char *buf, size_t nbytes, loff_t off)
>   5412	{
>   5413		struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
>   5414		u64 id;
>   5415		int prio;
>   5416		int ret;
>   5417		char first_token[32];
>   5418		char second_token[32];
>   5419		char dummy[2];
>   5420		char *stripped_buf;
>   5421		int num_parsed;
>   5422	
>   5423		stripped_buf = strstrip(buf);
>   5424		num_parsed = sscanf(stripped_buf, "%31s %31s %1s", first_token,
>   5425				    second_token, dummy);
>   5426		if (num_parsed == 2) {
>   5427			if (strcmp(first_token, "default") == 0) {
>   5428				if (strcmp(second_token, "none") == 0)
>   5429					ret = apply_swap_cgroup_priority(
>   5430						memcg, DEFAULT_ID, SWAP_PRIORITY_GLOBAL);
>   5431				else if (strcmp(second_token, "disabled") == 0)
>   5432					ret = apply_swap_cgroup_priority(
>   5433						memcg, DEFAULT_ID, SWAP_PRIORITY_DISABLE);
>   5434				else
>   5435					ret = -EINVAL;
>   5436			} else {
>   5437				ret = kstrtoull(first_token, 10, &id);
>   5438				if (ret)
>   5439					return -EINVAL;
>   5440	
>   5441				if (strcmp(second_token, "none") == 0) {
>   5442					ret = apply_swap_cgroup_priority(
>   5443						memcg, id, SWAP_PRIORITY_GLOBAL);
>   5444				} else if (strcmp(second_token, "disabled") == 0) {
>   5445					ret = apply_swap_cgroup_priority(
>   5446						memcg, id, SWAP_PRIORITY_DISABLE);
>   5447				} else {
>   5448					ret = kstrtoint(second_token, 10, &prio);
>   5449					if (ret)
>   5450						return -EINVAL;
>   5451					if (prio == -1)
>   5452						return -EINVAL;
>   5453					else if (prio > SHRT_MAX || prio < SHRT_MIN)
>   5454						return -EINVAL;
>   5455					ret = apply_swap_cgroup_priority(memcg, id,
>   5456									 prio);
>   5457				}
>   5458			}
>   5459		} else if (num_parsed == 1) {
>   5460			if (strcmp(first_token, "none") == 0)
>   5461				ret = apply_swap_cgroup_priority(
> > 5462					memcg, id, SWAP_PRIORITY_GLOBAL);
>   5463			else if (strcmp(first_token, "disabled") == 0)
>   5464				ret = apply_swap_cgroup_priority(
>   5465					memcg, id, SWAP_PRIORITY_DISABLE);
>   5466			else
>   5467				ret = -EINVAL;
>   5468		} else {
>   5469			return -EINVAL;
>   5470		}
>   5471	
>   5472		if (ret)
>   5473			return ret;
>   5474	
>   5475		return nbytes;
>   5476	}
>   5477	

This is an initialization bug where the "default" value may not be handled
correctly in certain cases, such as:

  e.g. echo none > memory.swap.priority

I should have checked this more carefully. I will fix the issue and add
a test case in the next patch revision.

Best regards,
Youngjun Park

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/4] mm/swap, memcg: Introduce infrastructure for cgroup-based swap priority
  2025-07-18 17:08   ` kernel test robot
@ 2025-07-22 14:11     ` YoungJun Park
  0 siblings, 0 replies; 39+ messages in thread
From: YoungJun Park @ 2025-07-22 14:11 UTC (permalink / raw)
  To: kernel test robot
  Cc: akpm, hannes, oe-kbuild-all, mhocko, roman.gushchin, shakeel.butt,
	muchun.song, shikemeng, kasong, nphamcs, bhe, baohua, chrisl,
	cgroups, linux-mm, linux-kernel, gunho.lee, iamjoonsoo.kim,
	taejoon.song, Michal Koutný

On Sat, Jul 19, 2025 at 01:08:46AM +0800, kernel test robot wrote:
> Hi Youngjun,
> 
> kernel test robot noticed the following build errors:
> 
> [auto build test ERROR on 347e9f5043c89695b01e66b3ed111755afcf1911]
> 
> url:    https://github.com/intel-lab-lkp/linux/commits/Youngjun-Park/mm-swap-memcg-Introduce-infrastructure-for-cgroup-based-swap-priority/20250717-042648
> base:   347e9f5043c89695b01e66b3ed111755afcf1911
> patch link:    https://lore.kernel.org/r/20250716202006.3640584-2-youngjun.park%40lge.com
> patch subject: [PATCH 1/4] mm/swap, memcg: Introduce infrastructure for cgroup-based swap priority
> config: sparc64-randconfig-r054-20250718 (https://download.01.org/0day-ci/archive/20250719/202507190037.RCDNmMsJ-lkp@intel.com/config)
> compiler: sparc64-linux-gcc (GCC) 15.1.0
> reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250719/202507190037.RCDNmMsJ-lkp@intel.com/reproduce)
> 
> If you fix the issue in a separate patch/commit (i.e. not just a new version of
> the same patch/commit), kindly add following tags
> | Reported-by: kernel test robot <lkp@intel.com>
> | Closes: https://lore.kernel.org/oe-kbuild-all/202507190037.RCDNmMsJ-lkp@intel.com/
> 
> All errors (new ones prefixed by >>):
> 
>    In file included from include/linux/rbtree.h:24,
>                     from include/linux/mm_types.h:11,
>                     from include/linux/mmzone.h:22,
>                     from include/linux/swap.h:7,
>                     from mm/swap_cgroup_priority.c:16:
>    mm/swap_cgroup_priority.c: In function 'get_swap_cgroup_priority':
> >> mm/swap_cgroup_priority.c:115:37: error: invalid use of undefined type 'struct mem_cgroup'
>      115 |         return rcu_dereference(memcg->swap_priority);
>          |                                     ^~
>    include/linux/rcupdate.h:532:17: note: in definition of macro '__rcu_dereference_check'
>      532 |         typeof(*p) *local = (typeof(*p) *__force)READ_ONCE(p); \
>          |                 ^
>    include/linux/rcupdate.h:752:28: note: in expansion of macro 'rcu_dereference_check'
>      752 | #define rcu_dereference(p) rcu_dereference_check(p, 0)
>          |                            ^~~~~~~~~~~~~~~~~~~~~
>    mm/swap_cgroup_priority.c:115:16: note: in expansion of macro 'rcu_dereference'
>      115 |         return rcu_dereference(memcg->swap_priority);
>          |                ^~~~~~~~~~~~~~~
> >> mm/swap_cgroup_priority.c:115:37: error: invalid use of undefined type 'struct mem_cgroup'
>      115 |         return rcu_dereference(memcg->swap_priority);
>          |                                     ^~
>    include/linux/rcupdate.h:532:38: note: in definition of macro '__rcu_dereference_check'
>      532 |         typeof(*p) *local = (typeof(*p) *__force)READ_ONCE(p); \
>          |                                      ^
>    include/linux/rcupdate.h:752:28: note: in expansion of macro 'rcu_dereference_check'
>      752 | #define rcu_dereference(p) rcu_dereference_check(p, 0)
>          |                            ^~~~~~~~~~~~~~~~~~~~~
>    mm/swap_cgroup_priority.c:115:16: note: in expansion of macro 'rcu_dereference'
>      115 |         return rcu_dereference(memcg->swap_priority);
>          |                ^~~~~~~~~~~~~~~
>    In file included from <command-line>:
> >> mm/swap_cgroup_priority.c:115:37: error: invalid use of undefined type 'struct mem_cgroup'
>      115 |         return rcu_dereference(memcg->swap_priority);
>          |                                     ^~
>    include/linux/compiler_types.h:548:23: note: in definition of macro '__compiletime_assert'
>      548 |                 if (!(condition))                                       \
>          |                       ^~~~~~~~~
>    include/linux/compiler_types.h:568:9: note: in expansion of macro '_compiletime_assert'
>      568 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
>          |         ^~~~~~~~~~~~~~~~~~~
>    include/asm-generic/rwonce.h:36:9: note: in expansion of macro 'compiletime_assert'
>       36 |         compiletime_assert(__native_word(t) || sizeof(t) == sizeof(long long),  \
>          |         ^~~~~~~~~~~~~~~~~~
>    include/asm-generic/rwonce.h:36:28: note: in expansion of macro '__native_word'
>       36 |         compiletime_assert(__native_word(t) || sizeof(t) == sizeof(long long),  \
>          |                            ^~~~~~~~~~~~~
>    include/asm-generic/rwonce.h:49:9: note: in expansion of macro 'compiletime_assert_rwonce_type'
>       49 |         compiletime_assert_rwonce_type(x);                              \
>          |         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>    include/linux/rcupdate.h:532:50: note: in expansion of macro 'READ_ONCE'
>      532 |         typeof(*p) *local = (typeof(*p) *__force)READ_ONCE(p); \
>          |                                                  ^~~~~~~~~
>    include/linux/rcupdate.h:680:9: note: in expansion of macro '__rcu_dereference_check'
>      680 |         __rcu_dereference_check((p), __UNIQUE_ID(rcu), \
>          |         ^~~~~~~~~~~~~~~~~~~~~~~
>    include/linux/rcupdate.h:752:28: note: in expansion of macro 'rcu_dereference_check'
>      752 | #define rcu_dereference(p) rcu_dereference_check(p, 0)
>          |                            ^~~~~~~~~~~~~~~~~~~~~
>    mm/swap_cgroup_priority.c:115:16: note: in expansion of macro 'rcu_dereference'
>      115 |         return rcu_dereference(memcg->swap_priority);
>          |                ^~~~~~~~~~~~~~~
> >> mm/swap_cgroup_priority.c:115:37: error: invalid use of undefined type 'struct mem_cgroup'
>      115 |         return rcu_dereference(memcg->swap_priority);
>          |                                     ^~
>    include/linux/compiler_types.h:548:23: note: in definition of macro '__compiletime_assert'
>      548 |                 if (!(condition))                                       \
>          |                       ^~~~~~~~~
>    include/linux/compiler_types.h:568:9: note: in expansion of macro '_compiletime_assert'
>      568 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
>          |         ^~~~~~~~~~~~~~~~~~~
>    include/asm-generic/rwonce.h:36:9: note: in expansion of macro 'compiletime_assert'
>       36 |         compiletime_assert(__native_word(t) || sizeof(t) == sizeof(long long),  \
>          |         ^~~~~~~~~~~~~~~~~~
>    include/asm-generic/rwonce.h:36:28: note: in expansion of macro '__native_word'
>       36 |         compiletime_assert(__native_word(t) || sizeof(t) == sizeof(long long),  \
>          |                            ^~~~~~~~~~~~~
>    include/asm-generic/rwonce.h:49:9: note: in expansion of macro 'compiletime_assert_rwonce_type'
>       49 |         compiletime_assert_rwonce_type(x);                              \
>          |         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>    include/linux/rcupdate.h:532:50: note: in expansion of macro 'READ_ONCE'
>      532 |         typeof(*p) *local = (typeof(*p) *__force)READ_ONCE(p); \
>          |                                                  ^~~~~~~~~
>    include/linux/rcupdate.h:680:9: note: in expansion of macro '__rcu_dereference_check'
>      680 |         __rcu_dereference_check((p), __UNIQUE_ID(rcu), \
>          |         ^~~~~~~~~~~~~~~~~~~~~~~
>    include/linux/rcupdate.h:752:28: note: in expansion of macro 'rcu_dereference_check'
>      752 | #define rcu_dereference(p) rcu_dereference_check(p, 0)
>          |                            ^~~~~~~~~~~~~~~~~~~~~
>    mm/swap_cgroup_priority.c:115:16: note: in expansion of macro 'rcu_dereference'
>      115 |         return rcu_dereference(memcg->swap_priority);
>          |                ^~~~~~~~~~~~~~~
> >> mm/swap_cgroup_priority.c:115:37: error: invalid use of undefined type 'struct mem_cgroup'
>      115 |         return rcu_dereference(memcg->swap_priority);
>          |                                     ^~
>    include/linux/compiler_types.h:548:23: note: in definition of macro '__compiletime_assert'
>      548 |                 if (!(condition))                                       \
>          |                       ^~~~~~~~~
>    include/linux/compiler_types.h:568:9: note: in expansion of macro '_compiletime_assert'
>      568 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
>          |         ^~~~~~~~~~~~~~~~~~~
>    include/asm-generic/rwonce.h:36:9: note: in expansion of macro 'compiletime_assert'
>       36 |         compiletime_assert(__native_word(t) || sizeof(t) == sizeof(long long),  \
>          |         ^~~~~~~~~~~~~~~~~~
>    include/asm-generic/rwonce.h:36:28: note: in expansion of macro '__native_word'
>       36 |         compiletime_assert(__native_word(t) || sizeof(t) == sizeof(long long),  \
>          |                            ^~~~~~~~~~~~~
>    include/asm-generic/rwonce.h:49:9: note: in expansion of macro 'compiletime_assert_rwonce_type'
>       49 |         compiletime_assert_rwonce_type(x);                              \
>          |         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>    include/linux/rcupdate.h:532:50: note: in expansion of macro 'READ_ONCE'
>      532 |         typeof(*p) *local = (typeof(*p) *__force)READ_ONCE(p); \
>          |                                                  ^~~~~~~~~
>    include/linux/rcupdate.h:680:9: note: in expansion of macro '__rcu_dereference_check'
>      680 |         __rcu_dereference_check((p), __UNIQUE_ID(rcu), \
>          |         ^~~~~~~~~~~~~~~~~~~~~~~
>    include/linux/rcupdate.h:752:28: note: in expansion of macro 'rcu_dereference_check'
>      752 | #define rcu_dereference(p) rcu_dereference_check(p, 0)
>          |                            ^~~~~~~~~~~~~~~~~~~~~
>    mm/swap_cgroup_priority.c:115:16: note: in expansion of macro 'rcu_dereference'
>      115 |         return rcu_dereference(memcg->swap_priority);
>          |                ^~~~~~~~~~~~~~~
> >> mm/swap_cgroup_priority.c:115:37: error: invalid use of undefined type 'struct mem_cgroup'
>      115 |         return rcu_dereference(memcg->swap_priority);
>          |                                     ^~
>    include/linux/compiler_types.h:548:23: note: in definition of macro '__compiletime_assert'
>      548 |                 if (!(condition))                                       \
>          |                       ^~~~~~~~~
>    include/linux/compiler_types.h:568:9: note: in expansion of macro '_compiletime_assert'
>      568 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
>          |         ^~~~~~~~~~~~~~~~~~~
>    include/asm-generic/rwonce.h:36:9: note: in expansion of macro 'compiletime_assert'
>       36 |         compiletime_assert(__native_word(t) || sizeof(t) == sizeof(long long),  \
>          |         ^~~~~~~~~~~~~~~~~~
>    include/asm-generic/rwonce.h:36:28: note: in expansion of macro '__native_word'
>       36 |         compiletime_assert(__native_word(t) || sizeof(t) == sizeof(long long),  \
>          |                            ^~~~~~~~~~~~~
>    include/asm-generic/rwonce.h:49:9: note: in expansion of macro 'compiletime_assert_rwonce_type'
>       49 |         compiletime_assert_rwonce_type(x);                              \
>          |         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>    include/linux/rcupdate.h:532:50: note: in expansion of macro 'READ_ONCE'
>      532 |         typeof(*p) *local = (typeof(*p) *__force)READ_ONCE(p); \
>          |                                                  ^~~~~~~~~
>    include/linux/rcupdate.h:680:9: note: in expansion of macro '__rcu_dereference_check'
>      680 |         __rcu_dereference_check((p), __UNIQUE_ID(rcu), \
>          |         ^~~~~~~~~~~~~~~~~~~~~~~
>    include/linux/rcupdate.h:752:28: note: in expansion of macro 'rcu_dereference_check'
>      752 | #define rcu_dereference(p) rcu_dereference_check(p, 0)
>          |                            ^~~~~~~~~~~~~~~~~~~~~
>    mm/swap_cgroup_priority.c:115:16: note: in expansion of macro 'rcu_dereference'
>      115 |         return rcu_dereference(memcg->swap_priority);
>          |                ^~~~~~~~~~~~~~~
> >> mm/swap_cgroup_priority.c:115:37: error: invalid use of undefined type 'struct mem_cgroup'
>      115 |         return rcu_dereference(memcg->swap_priority);
>          |                                     ^~
>    include/linux/compiler_types.h:548:23: note: in definition of macro '__compiletime_assert'
>      548 |                 if (!(condition))                                       \
>          |                       ^~~~~~~~~
>    include/linux/compiler_types.h:568:9: note: in expansion of macro '_compiletime_assert'
>      568 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
>          |         ^~~~~~~~~~~~~~~~~~~
>    include/asm-generic/rwonce.h:36:9: note: in expansion of macro 'compiletime_assert'
>       36 |         compiletime_assert(__native_word(t) || sizeof(t) == sizeof(long long),  \
>          |         ^~~~~~~~~~~~~~~~~~
>    include/asm-generic/rwonce.h:49:9: note: in expansion of macro 'compiletime_assert_rwonce_type'
>       49 |         compiletime_assert_rwonce_type(x);                              \
>          |         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>    include/linux/rcupdate.h:532:50: note: in expansion of macro 'READ_ONCE'
>      532 |         typeof(*p) *local = (typeof(*p) *__force)READ_ONCE(p); \
>          |                                                  ^~~~~~~~~
>    include/linux/rcupdate.h:680:9: note: in expansion of macro '__rcu_dereference_check'
>      680 |         __rcu_dereference_check((p), __UNIQUE_ID(rcu), \
>          |         ^~~~~~~~~~~~~~~~~~~~~~~
>    include/linux/rcupdate.h:752:28: note: in expansion of macro 'rcu_dereference_check'
>      752 | #define rcu_dereference(p) rcu_dereference_check(p, 0)
>          |                            ^~~~~~~~~~~~~~~~~~~~~
>    mm/swap_cgroup_priority.c:115:16: note: in expansion of macro 'rcu_dereference'
>      115 |         return rcu_dereference(memcg->swap_priority);
>          |                ^~~~~~~~~~~~~~~
> >> mm/swap_cgroup_priority.c:115:37: error: invalid use of undefined type 'struct mem_cgroup'
>      115 |         return rcu_dereference(memcg->swap_priority);
>          |                                     ^~
>    include/linux/compiler_types.h:518:27: note: in definition of macro '__unqual_scalar_typeof'
>      518 |                 _Generic((x),                                           \
>          |                           ^
>    include/asm-generic/rwonce.h:50:9: note: in expansion of macro '__READ_ONCE'
>       50 |         __READ_ONCE(x);                                                 \
>          |         ^~~~~~~~~~~
>    include/linux/rcupdate.h:532:50: note: in expansion of macro 'READ_ONCE'
>      532 |         typeof(*p) *local = (typeof(*p) *__force)READ_ONCE(p); \
>          |                                                  ^~~~~~~~~
>    include/linux/rcupdate.h:680:9: note: in expansion of macro '__rcu_dereference_check'
>      680 |         __rcu_dereference_check((p), __UNIQUE_ID(rcu), \
>          |         ^~~~~~~~~~~~~~~~~~~~~~~
>    include/linux/rcupdate.h:752:28: note: in expansion of macro 'rcu_dereference_check'
>      752 | #define rcu_dereference(p) rcu_dereference_check(p, 0)
>          |                            ^~~~~~~~~~~~~~~~~~~~~
>    mm/swap_cgroup_priority.c:115:16: note: in expansion of macro 'rcu_dereference'
>      115 |         return rcu_dereference(memcg->swap_priority);
>          |                ^~~~~~~~~~~~~~~
>    In file included from ./arch/sparc/include/generated/asm/rwonce.h:1,
>                     from include/linux/compiler.h:390,
>                     from include/linux/export.h:5,
>                     from include/linux/linkage.h:7,
>                     from include/linux/preempt.h:10,
>                     from include/linux/spinlock.h:56,
>                     from include/linux/swap.h:5:
> >> mm/swap_cgroup_priority.c:115:37: error: invalid use of undefined type 'struct mem_cgroup'
>      115 |         return rcu_dereference(memcg->swap_priority);
>          |                                     ^~
>    include/asm-generic/rwonce.h:44:73: note: in definition of macro '__READ_ONCE'
>       44 | #define __READ_ONCE(x)  (*(const volatile __unqual_scalar_typeof(x) *)&(x))
>          |                                                                         ^
>    include/linux/rcupdate.h:532:50: note: in expansion of macro 'READ_ONCE'
>      532 |         typeof(*p) *local = (typeof(*p) *__force)READ_ONCE(p); \
>          |                                                  ^~~~~~~~~
>    include/linux/rcupdate.h:680:9: note: in expansion of macro '__rcu_dereference_check'
>      680 |         __rcu_dereference_check((p), __UNIQUE_ID(rcu), \
>          |         ^~~~~~~~~~~~~~~~~~~~~~~
>    include/linux/rcupdate.h:752:28: note: in expansion of macro 'rcu_dereference_check'
>      752 | #define rcu_dereference(p) rcu_dereference_check(p, 0)
>          |                            ^~~~~~~~~~~~~~~~~~~~~
>    mm/swap_cgroup_priority.c:115:16: note: in expansion of macro 'rcu_dereference'
>      115 |         return rcu_dereference(memcg->swap_priority);
>          |                ^~~~~~~~~~~~~~~
> >> mm/swap_cgroup_priority.c:115:37: error: invalid use of undefined type 'struct mem_cgroup'
>      115 |         return rcu_dereference(memcg->swap_priority);
>          |                                     ^~
>    include/linux/rcupdate.h:535:19: note: in definition of macro '__rcu_dereference_check'
>      535 |         ((typeof(*p) __force __kernel *)(local)); \
>          |                   ^
>    include/linux/rcupdate.h:752:28: note: in expansion of macro 'rcu_dereference_check'
>      752 | #define rcu_dereference(p) rcu_dereference_check(p, 0)
>          |                            ^~~~~~~~~~~~~~~~~~~~~
>    mm/swap_cgroup_priority.c:115:16: note: in expansion of macro 'rcu_dereference'
>      115 |         return rcu_dereference(memcg->swap_priority);
>          |                ^~~~~~~~~~~~~~~
>    mm/swap_cgroup_priority.c: In function 'show_swap_cgroup_priority':
>    mm/swap_cgroup_priority.c:186:30: error: invalid use of undefined type 'struct mem_cgroup'
>      186 |         swap_priority = memcg->swap_priority;
>          |                              ^~
>    mm/swap_cgroup_priority.c: In function 'swap_alloc_cgroup_priority':
>    mm/swap_cgroup_priority.c:285:26: error: invalid use of undefined type 'struct mem_cgroup'
>      285 |                 if (memcg->swap_priority != swap_priority)
>          |                          ^~
>    mm/swap_cgroup_priority.c: In function 'apply_swap_cgroup_priority':
>    mm/swap_cgroup_priority.c:638:46: error: invalid use of undefined type 'struct mem_cgroup'
>      638 |         swap_priority = rcu_dereference(memcg->swap_priority);
>          |                                              ^~
>    include/linux/rcupdate.h:532:17: note: in definition of macro '__rcu_dereference_check'
>      532 |         typeof(*p) *local = (typeof(*p) *__force)READ_ONCE(p); \
>          |                 ^
>    include/linux/rcupdate.h:752:28: note: in expansion of macro 'rcu_dereference_check'
>      752 | #define rcu_dereference(p) rcu_dereference_check(p, 0)
>          |                            ^~~~~~~~~~~~~~~~~~~~~
>    mm/swap_cgroup_priority.c:638:25: note: in expansion of macro 'rcu_dereference'
>      638 |         swap_priority = rcu_dereference(memcg->swap_priority);
>          |                         ^~~~~~~~~~~~~~~
>    mm/swap_cgroup_priority.c:638:46: error: invalid use of undefined type 'struct mem_cgroup'
>      638 |         swap_priority = rcu_dereference(memcg->swap_priority);
>          |                                              ^~
>    include/linux/rcupdate.h:532:38: note: in definition of macro '__rcu_dereference_check'
>      532 |         typeof(*p) *local = (typeof(*p) *__force)READ_ONCE(p); \
>          |                                      ^
>    include/linux/rcupdate.h:752:28: note: in expansion of macro 'rcu_dereference_check'
>      752 | #define rcu_dereference(p) rcu_dereference_check(p, 0)
>          |                            ^~~~~~~~~~~~~~~~~~~~~
>    mm/swap_cgroup_priority.c:638:25: note: in expansion of macro 'rcu_dereference'
>      638 |         swap_priority = rcu_dereference(memcg->swap_priority);
>          |                         ^~~~~~~~~~~~~~~
>    mm/swap_cgroup_priority.c:638:46: error: invalid use of undefined type 'struct mem_cgroup'
>      638 |         swap_priority = rcu_dereference(memcg->swap_priority);
>          |                                              ^~
>    include/linux/compiler_types.h:548:23: note: in definition of macro '__compiletime_assert'
>      548 |                 if (!(condition))                                       \
>          |                       ^~~~~~~~~
>    include/linux/compiler_types.h:568:9: note: in expansion of macro '_compiletime_assert'
>      568 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
>          |         ^~~~~~~~~~~~~~~~~~~
>    include/asm-generic/rwonce.h:36:9: note: in expansion of macro 'compiletime_assert'
>       36 |         compiletime_assert(__native_word(t) || sizeof(t) == sizeof(long long),  \
>          |         ^~~~~~~~~~~~~~~~~~
>    include/asm-generic/rwonce.h:36:28: note: in expansion of macro '__native_word'
>       36 |         compiletime_assert(__native_word(t) || sizeof(t) == sizeof(long long),  \
>          |                            ^~~~~~~~~~~~~
>    include/asm-generic/rwonce.h:49:9: note: in expansion of macro 'compiletime_assert_rwonce_type'
>       49 |         compiletime_assert_rwonce_type(x);                              \
>          |         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>    include/linux/rcupdate.h:532:50: note: in expansion of macro 'READ_ONCE'
>      532 |         typeof(*p) *local = (typeof(*p) *__force)READ_ONCE(p); \
>          |                                                  ^~~~~~~~~
>    include/linux/rcupdate.h:680:9: note: in expansion of macro '__rcu_dereference_check'
>      680 |         __rcu_dereference_check((p), __UNIQUE_ID(rcu), \
>          |         ^~~~~~~~~~~~~~~~~~~~~~~
>    include/linux/rcupdate.h:752:28: note: in expansion of macro 'rcu_dereference_check'
>      752 | #define rcu_dereference(p) rcu_dereference_check(p, 0)
>          |                            ^~~~~~~~~~~~~~~~~~~~~
>    mm/swap_cgroup_priority.c:638:25: note: in expansion of macro 'rcu_dereference'
>      638 |         swap_priority = rcu_dereference(memcg->swap_priority);
>          |                         ^~~~~~~~~~~~~~~
>    mm/swap_cgroup_priority.c:638:46: error: invalid use of undefined type 'struct mem_cgroup'
>      638 |         swap_priority = rcu_dereference(memcg->swap_priority);
>          |                                              ^~
>    include/linux/compiler_types.h:548:23: note: in definition of macro '__compiletime_assert'
>      548 |                 if (!(condition))                                       \
>          |                       ^~~~~~~~~
>    include/linux/compiler_types.h:568:9: note: in expansion of macro '_compiletime_assert'
>      568 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
>          |         ^~~~~~~~~~~~~~~~~~~
>    include/asm-generic/rwonce.h:36:9: note: in expansion of macro 'compiletime_assert'
>       36 |         compiletime_assert(__native_word(t) || sizeof(t) == sizeof(long long),  \
>          |         ^~~~~~~~~~~~~~~~~~
>    include/asm-generic/rwonce.h:36:28: note: in expansion of macro '__native_word'
>       36 |         compiletime_assert(__native_word(t) || sizeof(t) == sizeof(long long),  \
>          |                            ^~~~~~~~~~~~~
>    include/asm-generic/rwonce.h:49:9: note: in expansion of macro 'compiletime_assert_rwonce_type'
>       49 |         compiletime_assert_rwonce_type(x);                              \
>          |         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>    include/linux/rcupdate.h:532:50: note: in expansion of macro 'READ_ONCE'
>      532 |         typeof(*p) *local = (typeof(*p) *__force)READ_ONCE(p); \
>          |                                                  ^~~~~~~~~
>    include/linux/rcupdate.h:680:9: note: in expansion of macro '__rcu_dereference_check'
>      680 |         __rcu_dereference_check((p), __UNIQUE_ID(rcu), \
>          |         ^~~~~~~~~~~~~~~~~~~~~~~
>    include/linux/rcupdate.h:752:28: note: in expansion of macro 'rcu_dereference_check'
>      752 | #define rcu_dereference(p) rcu_dereference_check(p, 0)
> 
> 
> vim +115 mm/swap_cgroup_priority.c
> 
>   > 16	#include <linux/swap.h>
>     17	#include <linux/rcupdate.h>
>     18	#include <linux/memcontrol.h>
>     19	#include <linux/plist.h>
>     20	#include "swap.h"
>     21	#include "swap_cgroup_priority.h"
>     22	#include "memcontrol-v1.h"
>     23	
>     24	static LIST_HEAD(swap_cgroup_priority_list);
>     25	
>     26	/*
>     27	 * struct swap_cgroup_priority
>     28	 *
>     29	 * This structure is RCU protected. Its lifecycle is determined by its
>     30	 * owning memcg or when its 'distance' reaches zero. The 'distance' field
>     31	 * tracks priority differences from global swap. If zero, and its default_prio
>     32	 * follows global swap priority(SWAP_PRIORITY_GLOBAL), the object is destroyed.
>     33	 *
>     34	 * pnode - Array of pointers to swap device priority nodes.
>     35	 * owner - The owning memory cgroup.
>     36	 * rcu - RCU free callback.
>     37	 * link - Global linked list entry.
>     38	 * least_priority - Current lowest priority.
>     39	 * distance - Priority differences from global swap priority.
>     40	 * default_prio - Default priority for this cgroup.
>     41	 * plist - Priority list head.
>     42	 */
>     43	struct swap_cgroup_priority {
>     44		struct swap_cgroup_priority_pnode *pnode[MAX_SWAPFILES];
>     45		struct mem_cgroup *owner;
>     46	
>     47		union {
>     48			struct rcu_head rcu;
>     49			struct list_head link;
>     50		};
>     51	
>     52		int least_priority;
>     53		s8 distance;
>     54		int default_prio;
>     55		struct plist_head plist[];
>     56	};
>     57	
>     58	/*
>     59	 * struct swap_cgroup_priority_pnode
>     60	 *
>     61	 * This structure represents a priority node for a specific swap device
>     62	 * within a cgroup.
>     63	 *
>     64	 * swap - Pointer to the associated swap device.
>     65	 * id - Unique identifier for the swap device.
>     66	 * prio - Configured priority for this device.
>     67	 * avail_lists - Connections to various priority lists.
>     68	 */
>     69	struct swap_cgroup_priority_pnode {
>     70		struct swap_info_struct *swap;
>     71		u64 id;
>     72		signed short prio;
>     73		struct plist_node avail_lists[];
>     74	};
>     75	
>     76	/*
>     77	 * Even with a zero distance, a swap device isn't assigned if it doesn't
>     78	 * meet global swap priority conditions; thus, we don't clear it.
>     79	 */
>     80	static bool should_clear_swap_cgroup_priority(
>     81		struct swap_cgroup_priority *swap_priority)
>     82	{
>     83		WARN_ON_ONCE(swap_priority->distance < 0 ||
>     84			swap_priority->distance > MAX_SWAPFILES);
>     85	
>     86		if (swap_priority->distance == 0 &&
>     87		    swap_priority->default_prio == SWAP_PRIORITY_GLOBAL)
>     88			return true;
>     89	
>     90		return false;
>     91	}
>     92	
>     93	/*
>     94	 * swapdev_id
>     95	 *
>     96	 * A unique identifier for a swap device.
>     97	 *
>     98	 * This ID ensures stable identification for users and crucial synchronization
>     99	 * for swap cgroup priority settings. It provides a reliable reference even if
>    100	 * device paths or numbers change.
>    101	 */
>    102	static atomic64_t swapdev_id_counter;
>    103	
>    104	void get_swapdev_id(struct swap_info_struct *si)
>    105	{
>    106		si->id = atomic64_inc_return(&swapdev_id_counter);
>    107	}
>    108	
>    109	static struct swap_cgroup_priority *get_swap_cgroup_priority(
>    110		struct mem_cgroup *memcg)
>    111	{
>    112		if (!memcg)
>    113			return NULL;
>    114	
>  > 115		return rcu_dereference(memcg->swap_priority);
>    116	}
>    117	
 
The build dependency should have been CONFIG_MEMCG instead of CONFIG_CGROUP.
Apologies for overlooking this. I will update the dependency and verify
in the next patch version.

Best regards,
Youngjun Park

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/4] mm/swap, memcg: Introduce infrastructure for cgroup-based swap priority
  2025-07-21 15:13   ` kernel test robot
@ 2025-07-22 14:14     ` YoungJun Park
  0 siblings, 0 replies; 39+ messages in thread
From: YoungJun Park @ 2025-07-22 14:14 UTC (permalink / raw)
  To: kernel test robot
  Cc: akpm, hannes, oe-kbuild-all, mhocko, roman.gushchin, shakeel.butt,
	muchun.song, shikemeng, kasong, nphamcs, bhe, baohua, chrisl,
	cgroups, linux-mm, linux-kernel, gunho.lee, iamjoonsoo.kim,
	taejoon.song, Michal Koutný

On Mon, Jul 21, 2025 at 11:13:24PM +0800, kernel test robot wrote:
> Hi Youngjun,
> 
> kernel test robot noticed the following build warnings:
> 
> [auto build test WARNING on 347e9f5043c89695b01e66b3ed111755afcf1911]
> 
> url:    https://github.com/intel-lab-lkp/linux/commits/Youngjun-Park/mm-swap-memcg-Introduce-infrastructure-for-cgroup-based-swap-priority/20250717-042648
> base:   347e9f5043c89695b01e66b3ed111755afcf1911
> patch link:    https://lore.kernel.org/r/20250716202006.3640584-2-youngjun.park%40lge.com
> patch subject: [PATCH 1/4] mm/swap, memcg: Introduce infrastructure for cgroup-based swap priority
> config: loongarch-randconfig-r123-20250721 (https://download.01.org/0day-ci/archive/20250721/202507212243.Lf8fSo0T-lkp@intel.com/config)
> compiler: clang version 19.1.7 (https://github.com/llvm/llvm-project cd708029e0b2869e80abe31ddb175f7c35361f90)
> reproduce: (https://download.01.org/0day-ci/archive/20250721/202507212243.Lf8fSo0T-lkp@intel.com/reproduce)
> 
> If you fix the issue in a separate patch/commit (i.e. not just a new version of
> the same patch/commit), kindly add following tags
> | Reported-by: kernel test robot <lkp@intel.com>
> | Closes: https://lore.kernel.org/oe-kbuild-all/202507212243.Lf8fSo0T-lkp@intel.com/
> 
> sparse warnings: (new ones prefixed by >>)
> >> mm/swap_cgroup_priority.c:115:16: sparse: sparse: incompatible types in comparison expression (different address spaces):
>    mm/swap_cgroup_priority.c:115:16: sparse:    struct swap_cgroup_priority [noderef] __rcu *
>    mm/swap_cgroup_priority.c:115:16: sparse:    struct swap_cgroup_priority *
>    mm/swap_cgroup_priority.c:729:9: sparse: sparse: incompatible types in comparison expression (different address spaces):
>    mm/swap_cgroup_priority.c:729:9: sparse:    struct swap_cgroup_priority [noderef] __rcu *
>    mm/swap_cgroup_priority.c:729:9: sparse:    struct swap_cgroup_priority *
>    mm/swap_cgroup_priority.c:638:25: sparse: sparse: incompatible types in comparison expression (different address spaces):
>    mm/swap_cgroup_priority.c:638:25: sparse:    struct swap_cgroup_priority [noderef] __rcu *
>    mm/swap_cgroup_priority.c:638:25: sparse:    struct swap_cgroup_priority *
> 
> vim +115 mm/swap_cgroup_priority.c
> 
>    108	
>    109	static struct swap_cgroup_priority *get_swap_cgroup_priority(
>    110		struct mem_cgroup *memcg)
>    111	{
>    112		if (!memcg)
>    113			return NULL;
>    114	
>  > 115		return rcu_dereference(memcg->swap_priority);
>    116	}
>    117	
> 

This part of the code, which retrieves the object, 
is expected to be properly updated in a subsequent patch series. 
Therefore, I believe it's reasonable to leave it as-is for now.

Best Regard,
Youngjun Park

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 4/4] mm: swap: Per-cgroup per-CPU swap device cache with shared clusters
  2025-07-16 20:20 ` [PATCH 4/4] mm: swap: Per-cgroup per-CPU swap device cache with shared clusters Youngjun Park
@ 2025-07-22 17:44   ` Kairui Song
  2025-07-22 18:30     ` YoungJun Park
  0 siblings, 1 reply; 39+ messages in thread
From: Kairui Song @ 2025-07-22 17:44 UTC (permalink / raw)
  To: Youngjun Park
  Cc: akpm, hannes, mhocko, roman.gushchin, shakeel.butt, muchun.song,
	shikemeng, nphamcs, bhe, baohua, chrisl, cgroups, linux-mm,
	linux-kernel, gunho.lee, iamjoonsoo.kim, taejoon.song

On Thu, Jul 17, 2025 at 4:21 AM Youngjun Park <youngjun.park@lge.com> wrote:
>
> This patch introduces a new swap allocation mechanism that supports
> per-cgroup per-CPU swap device caches, combined with per-device per-CPU
> cluster management.
>
> The existing global swap allocator uses a per-CPU device cache and
> cluster, shared by all cgroups. Under this model, per-cgroup swap
> priorities cannot be effectively honored on the fast path, as allocations
> do not distinguish between cgroups.
>
> To address this, we introduce per-cgroup per-CPU swap device caches.
> This allows fast-path swap allocations to respect each cgroup’s
> individual priority settings.
>
> To avoid an explosion of cluster structures proportional to the number
> of cgroups, clusters remain per-device and are shared across cgroups.
> This strikes a balance between performance and memory overhead.
>
> Suggested-by: Nhat Pham <nphamcs@gmail.com>
> Suggested-by: Kairui Song <kasong@tencent.com>
> Signed-off-by: Youngjun Park <youngjun.park@lge.com>
> ---
>  include/linux/swap.h      |   7 ++
>  mm/swap_cgroup_priority.c | 156 +++++++++++++++++++++++++++++++++++++-
>  mm/swap_cgroup_priority.h |  39 ++++++++++
>  mm/swapfile.c             |  47 +++++++-----
>  4 files changed, 228 insertions(+), 21 deletions(-)
>
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index bfddbec2ee28..ab15f4c103a1 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -283,6 +283,12 @@ enum swap_cluster_flags {
>  #define SWAP_NR_ORDERS         1
>  #endif
>
> +#ifdef CONFIG_SWAP_CGROUP_PRIORITY
> +struct percpu_cluster {
> +       unsigned int next[SWAP_NR_ORDERS]; /* Likely next allocation offset */
> +};
> +#endif
> +
>  /*
>   * We keep using same cluster for rotational device so IO will be sequential.
>   * The purpose is to optimize SWAP throughput on these device.
> @@ -341,6 +347,7 @@ struct swap_info_struct {
>         struct list_head discard_clusters; /* discard clusters list */
>  #ifdef CONFIG_SWAP_CGROUP_PRIORITY
>         u64 id;
> +       struct percpu_cluster __percpu *percpu_cluster; /* per cpu's swap location */
>  #endif
>         struct plist_node avail_lists[]; /*
>                                            * entries in swap_avail_heads, one
> diff --git a/mm/swap_cgroup_priority.c b/mm/swap_cgroup_priority.c
> index 84e876b77f01..f960c3dcab48 100644
> --- a/mm/swap_cgroup_priority.c
> +++ b/mm/swap_cgroup_priority.c
> @@ -21,6 +21,17 @@
>  #include "swap_cgroup_priority.h"
>  #include "memcontrol-v1.h"
>
> +/*
> + * We do maintain a cache on a per-cgroup-per-swap-device basis.
> + * However, the underlying cluster cache itself is managed
> + * per-swap-device. This design prevents each individual
> + * swap_cgroup_priority entry from caching its own cluster data,
> + * even as the number of such entries increases.
> + */
> +struct percpu_swap_device {
> +       struct swap_info_struct *si[SWAP_NR_ORDERS];
> +};
> +
>  static DEFINE_MUTEX(swap_cgroup_priority_inherit_lck);
>  static LIST_HEAD(swap_cgroup_priority_list);
>
> @@ -49,6 +60,7 @@ static LIST_HEAD(swap_cgroup_priority_list);
>   * least_priority - Current lowest priority.
>   * distance - Priority differences from global swap priority.
>   * default_prio - Default priority for this cgroup.
> + * pcpu_swapdev - Per-CPU swap device.
>   * plist - Priority list head.
>   */
>  struct swap_cgroup_priority {
> @@ -64,6 +76,7 @@ struct swap_cgroup_priority {
>         int least_priority;
>         s8 distance;
>         int default_prio;
> +       struct percpu_swap_device __percpu *pcpu_swapdev;
>         struct plist_head plist[];
>  };
>
> @@ -132,6 +145,21 @@ static struct swap_cgroup_priority *get_effective_swap_cgroup_priority(
>         return swap_priority->effective;
>  }
>
> +static struct swap_cgroup_priority *get_effective_swap_cgroup_priority_rcu(
> +       struct mem_cgroup *memcg)
> +{
> +       struct swap_cgroup_priority *swap_priority;
> +
> +       if (!memcg)
> +               return NULL;
> +
> +       swap_priority = rcu_dereference(memcg->swap_priority);
> +       if (!swap_priority)
> +               return NULL;
> +
> +       return rcu_dereference(swap_priority->effective);
> +}
> +
>  static bool validate_effective_swap_cgroup_priority(
>         struct mem_cgroup *memcg,
>         struct swap_cgroup_priority **swap_priority)
> @@ -172,6 +200,9 @@ static void free_swap_cgroup_priority_pnode(
>  static void free_swap_cgroup_priority(
>         struct swap_cgroup_priority *swap_priority)
>  {
> +       if (swap_priority->pcpu_swapdev)
> +               free_percpu(swap_priority->pcpu_swapdev);
> +
>         for (int i = 0; i < MAX_SWAPFILES; i++)
>                 free_swap_cgroup_priority_pnode(swap_priority->pnode[i]);
>
> @@ -187,6 +218,12 @@ static struct swap_cgroup_priority *alloc_swap_cgroup_priority(void)
>         if (!swap_priority)
>                 return NULL;
>
> +       swap_priority->pcpu_swapdev = alloc_percpu(struct percpu_swap_device);
> +       if (!swap_priority->pcpu_swapdev) {
> +               kvfree(swap_priority);
> +               return NULL;
> +       }
> +
>         /*
>          * Pre-allocates pnode array up to nr_swapfiles at init.
>          * Individual pnodes are assigned on swapon, but not freed
> @@ -326,10 +363,34 @@ bool swap_alloc_cgroup_priority(struct mem_cgroup *memcg,
>         unsigned long offset;
>         int node;
>
> -       /*
> -        * TODO: Per-cpu swap cluster cache can't be used directly
> -        * as cgroup-specific priorities may select different devices.
> -        */
> +       rcu_read_lock();
> +       if (!(swap_priority = get_effective_swap_cgroup_priority_rcu(memcg))) {
> +               rcu_read_unlock();
> +               return false;
> +       }
> +
> +       /* Fast path */
> +       si = this_cpu_read(swap_priority->pcpu_swapdev->si[order]);
> +       if (si && get_swap_device_info(si)) {
> +               offset = cluster_alloc_swap_entry(si, order, SWAP_HAS_CACHE);
> +               if (offset) {
> +                       *entry = swp_entry(si->type, offset);
> +                       /*
> +                        * Protected by 'percpu_swap_cluster' local_lock;
> +                        * CPU migration is disabled during this operation.
> +                        */
> +                       this_cpu_write(swap_priority->pcpu_swapdev->si[order],
> +                                      si);
> +                       put_swap_device(si);
> +                       rcu_read_unlock();
> +
> +                       return true;
> +               }
> +               put_swap_device(si);
> +       }
> +       rcu_read_unlock();
> +
> +       /* Slow path */

Hi Youngjun

One thing I noticed after a quick glance is that this
swap_alloc_cgroup_priority is bloated and it is doing similar things
as folio_alloc_swap.

I imagined that we can just have a struct (eg. let's call it struct
swap_percpu_info / pi) as a closure of what the allocator needs, it
contains the plist and fast path device.

With slight changes to folio_alloc_swap, it can respect either the
cgroup's pi or global pi. (might be a horrible name though, feel free
to change it)

For example first thing swap_alloc_fast do will be:

`struct swap_percpu_info *pi = folio_swap_percpu_info(folio);`

folio_swap_percpu_info returns the cgroup's swap_percpu_info or the global one.

swap_alloc_slow can do a similar thing, it then can just use pi->plist
and pi->pcpu_swapdev, (cluster info will be in si) ignoring all the
cgroup differences.

Also it is better to check your patches with ./scripts/checkpatch.pl,
I'm seeing some styling issues.

I'll check your other patches too later this week, thanks for the
update on this idea.

>         spin_lock(&swap_avail_lock);
>         node = numa_node_id();
>
> @@ -350,6 +411,14 @@ bool swap_alloc_cgroup_priority(struct mem_cgroup *memcg,
>                 if (get_swap_device_info(si)) {
>                         offset = cluster_alloc_swap_entry(si, order,
>                                                           SWAP_HAS_CACHE);
> +                       /*
> +                        * Protected by 'percpu_swap_cluster' local_lock;
> +                        * CPU migration is disabled during this operation.
> +                        */
> +                       if (memcg->swap_priority == swap_priority)
> +                               this_cpu_write(
> +                                       swap_priority->pcpu_swapdev->si[order],
> +                                       si);
>                         put_swap_device(si);
>                         if (offset) {
>                                 *entry = swp_entry(si->type, offset);
> @@ -687,6 +756,21 @@ static int __apply_swap_cgroup_priority(
>         return 0;
>  }
>
> +static int init_swap_cgroup_priority_pcpu_swapdev_cache(
> +       struct swap_cgroup_priority *swap_priority)
> +{
> +       int cpu;
> +
> +       for_each_possible_cpu(cpu) {
> +               struct percpu_swap_device *pcp_swap_dev =
> +                       per_cpu_ptr(swap_priority->pcpu_swapdev, cpu);
> +               for (int i = 0; i < SWAP_NR_ORDERS; i++)
> +                       pcp_swap_dev->si[i] = NULL;
> +       }
> +
> +       return 0;
> +}
> +
>  /*
>   * If this is the top-level swap_cgroup_priority, propagation is needed.
>   * We traverse the 'mem_cgroup_tree' using 'for_each_mem_cgroup_tree'.
> @@ -795,6 +879,8 @@ int apply_swap_cgroup_priority(struct mem_cgroup *memcg, u64 id, int prio)
>         for_each_node(nid)
>                 plist_head_init(&swap_priority->plist[nid]);
>
> +       init_swap_cgroup_priority_pcpu_swapdev_cache(swap_priority);
> +
>  prio_set:
>         spin_lock(&swap_lock);
>         spin_lock(&swap_avail_lock);
> @@ -843,6 +929,23 @@ int apply_swap_cgroup_priority(struct mem_cgroup *memcg, u64 id, int prio)
>
>         spin_unlock(&swap_avail_lock);
>         spin_unlock(&swap_lock);
> +       /*
> +        * XXX: We cannot fully synchronize with swap_alloc_cgroup_priority
> +        * when updating the next si.
> +        * Still, we ensure that flush operations inside swap_priority
> +        * are performed as reliably as possible.
> +        */
> +       if (id != DEFAULT_ID &&
> +           swap_priority == swap_priority->effective && !new) {
> +               int cpu;
> +               struct swap_info_struct **pcp_si;
> +               for_each_possible_cpu(cpu) {
> +                       pcp_si = per_cpu_ptr(
> +                               swap_priority->pcpu_swapdev->si, cpu);
> +                       for (int i = 0; i < SWAP_NR_ORDERS; i++)
> +                               pcp_si[i] = NULL;
> +               }
> +       }
>         mutex_unlock(&swap_cgroup_priority_inherit_lck);
>         return 0;
>
> @@ -886,3 +989,48 @@ void delete_swap_cgroup_priority(struct mem_cgroup *memcg)
>         spin_unlock(&swap_avail_lock);
>         mutex_unlock(&swap_cgroup_priority_inherit_lck);
>  }
> +
> +void flush_swap_cgroup_priority_percpu_swapdev(struct swap_info_struct *si)
> +{
> +       int cpu, i;
> +       struct swap_info_struct **pcp_si;
> +       struct swap_cgroup_priority *swap_priority;
> +
> +       rcu_read_lock();
> +       list_for_each_entry_rcu(swap_priority,
> +                               &swap_cgroup_priority_list, link) {
> +               for_each_possible_cpu(cpu) {
> +                       pcp_si = per_cpu_ptr(
> +                                       swap_priority->pcpu_swapdev->si, cpu);
> +
> +                       for (i = 0; i < SWAP_NR_ORDERS; i++)
> +                               cmpxchg(&pcp_si[i], si, NULL);
> +               }
> +       }
> +       rcu_read_unlock();
> +}
> +
> +bool alloc_percpu_swap_cluster(struct swap_info_struct *si)
> +{
> +       si->percpu_cluster = alloc_percpu(struct percpu_cluster);
> +       if (!si->percpu_cluster)
> +               return false;
> +
> +       int cpu;
> +       int i;
> +       for_each_possible_cpu(cpu) {
> +               struct percpu_cluster *cluster;
> +
> +               cluster = per_cpu_ptr(si->percpu_cluster, cpu);
> +               for (i = 0; i < SWAP_NR_ORDERS; i++)
> +                       cluster->next[i] = SWAP_ENTRY_INVALID;
> +       }
> +
> +       return true;
> +}
> +
> +void free_percpu_swap_cluster(struct swap_info_struct *si)
> +{
> +       free_percpu(si->percpu_cluster);
> +       si->percpu_cluster = NULL;
> +}
> diff --git a/mm/swap_cgroup_priority.h b/mm/swap_cgroup_priority.h
> index 5d16b63d12e0..815822ebd0d1 100644
> --- a/mm/swap_cgroup_priority.h
> +++ b/mm/swap_cgroup_priority.h
> @@ -47,6 +47,22 @@ struct swap_cgroup_priority *inherit_swap_cgroup_priority(
>  bool swap_alloc_cgroup_priority(struct mem_cgroup *memcg, swp_entry_t *entry,
>                                 int order);
>  void delete_swap_cgroup_priority(struct mem_cgroup *memcg);
> +void flush_swap_cgroup_priority_percpu_swapdev(struct swap_info_struct *si);
> +
> +bool alloc_percpu_swap_cluster(struct swap_info_struct *si);
> +void free_percpu_swap_cluster(struct swap_info_struct *si);
> +static inline void write_percpu_swap_cluster_next(struct swap_info_struct *si,
> +                                                 int order,
> +                                                 unsigned int next)
> +{
> +       this_cpu_write(si->percpu_cluster->next[order], next);
> +}
> +
> +static inline unsigned int read_percpu_swap_cluster_next(
> +       struct swap_info_struct *si, int order)
> +{
> +        return __this_cpu_read(si->percpu_cluster->next[order]);
> +}
>  #else
>  int swap_node(struct swap_info_struct *si);
>  unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int order,
> @@ -85,5 +101,28 @@ static inline bool swap_alloc_cgroup_priority(struct mem_cgroup *memcg,
>  static inline void delete_swap_cgroup_priority(struct mem_cgroup *memcg)
>  {
>  }
> +static inline void flush_swap_cgroup_priority_percpu_swapdev(
> +       struct swap_info_struct *si)
> +{
> +}
> +static inline bool alloc_percpu_swap_cluster(struct swap_info_struct *si)
> +{
> +       return true;
> +}
> +static inline void free_percpu_swap_cluster(struct swap_info_struct *si)
> +{
> +}
> +static inline void write_percpu_swap_cluster_next(struct swap_info_struct *si,
> +                                                 int order,
> +                                                 unsigned int next)
> +{
> +       return;
> +}
> +
> +static inline unsigned int read_percpu_swap_cluster_next(
> +       struct swap_info_struct *si, int order)
> +{
> +       return SWAP_ENTRY_INVALID;
> +}
>  #endif
>  #endif
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index bfd0532ad250..6a5ac9962e9f 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -817,12 +817,15 @@ static unsigned int alloc_swap_scan_cluster(struct swap_info_struct *si,
>  out:
>         relocate_cluster(si, ci);
>         unlock_cluster(ci);
> +
>         if (si->flags & SWP_SOLIDSTATE) {
>                 this_cpu_write(percpu_swap_cluster.offset[order], next);

Why not just remove the `percpu_swap_cluster.offset` and just share
si->percpu_cluster among all cgroups (including root cgroup)?

Otherwise, eg. if rootcg's pcpu cluster and one cgroup's pcpu
cluster are pointing to one same cluster, they might be in
contention on allocation of different order, or even in the same order
the performance might not be good as multiple CPUs will race
with each other.

It will be easier to implement too.




>                 this_cpu_write(percpu_swap_cluster.si[order], si);
> +               write_percpu_swap_cluster_next(si, order, next);
>         } else {
>                 si->global_cluster->next[order] = next;
>         }
> +
>         return found;
>  }
>
> @@ -892,26 +895,29 @@ unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int order,
>         if (order && !(si->flags & SWP_BLKDEV))
>                 return 0;
>
> -       if (!(si->flags & SWP_SOLIDSTATE)) {
> +       if (si->flags & SWP_SOLIDSTATE) {
> +               offset = read_percpu_swap_cluster_next(si, order);
> +       } else {
>                 /* Serialize HDD SWAP allocation for each device. */
>                 spin_lock(&si->global_cluster_lock);
>                 offset = si->global_cluster->next[order];
> -               if (offset == SWAP_ENTRY_INVALID)
> -                       goto new_cluster;
> +       }
>
> -               ci = lock_cluster(si, offset);
> -               /* Cluster could have been used by another order */
> -               if (cluster_is_usable(ci, order)) {
> -                       if (cluster_is_empty(ci))
> -                               offset = cluster_offset(si, ci);
> -                       found = alloc_swap_scan_cluster(si, ci, offset,
> -                                                       order, usage);
> -               } else {
> -                       unlock_cluster(ci);
> -               }
> -               if (found)
> -                       goto done;
> +       if (offset == SWAP_ENTRY_INVALID)
> +               goto new_cluster;
> +
> +       ci = lock_cluster(si, offset);
> +       /* Cluster could have been used by another order */
> +       if (cluster_is_usable(ci, order)) {
> +               if (cluster_is_empty(ci))
> +                       offset = cluster_offset(si, ci);
> +               found = alloc_swap_scan_cluster(si, ci, offset,
> +                                               order, usage);
> +       } else {
> +               unlock_cluster(ci);
>         }
> +       if (found)
> +               goto done;
>
>  new_cluster:
>         ci = isolate_lock_cluster(si, &si->free_clusters);
> @@ -991,6 +997,7 @@ unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int order,
>  done:
>         if (!(si->flags & SWP_SOLIDSTATE))
>                 spin_unlock(&si->global_cluster_lock);
> +
>         return found;
>  }
>
> @@ -2674,6 +2681,8 @@ static void flush_percpu_swap_cluster(struct swap_info_struct *si)
>                 for (i = 0; i < SWAP_NR_ORDERS; i++)
>                         cmpxchg(&pcp_si[i], si, NULL);
>         }
> +
> +       flush_swap_cgroup_priority_percpu_swapdev(si);
>  }
>
>
> @@ -2802,6 +2811,7 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
>         arch_swap_invalidate_area(p->type);
>         zswap_swapoff(p->type);
>         mutex_unlock(&swapon_mutex);
> +       free_percpu_swap_cluster(p);
>         kfree(p->global_cluster);
>         p->global_cluster = NULL;
>         vfree(swap_map);
> @@ -2900,7 +2910,6 @@ static void swap_stop(struct seq_file *swap, void *v)
>         mutex_unlock(&swapon_mutex);
>  }
>
> -
>  #ifndef CONFIG_SWAP_CGROUP_PRIORITY
>  static int swap_show(struct seq_file *swap, void *v)
>  {
> @@ -3239,7 +3248,10 @@ static struct swap_cluster_info *setup_clusters(struct swap_info_struct *si,
>         for (i = 0; i < nr_clusters; i++)
>                 spin_lock_init(&cluster_info[i].lock);
>
> -       if (!(si->flags & SWP_SOLIDSTATE)) {
> +       if (si->flags & SWP_SOLIDSTATE) {
> +               if (!alloc_percpu_swap_cluster(si))
> +                       goto err_free;
> +       } else {
>                 si->global_cluster = kmalloc(sizeof(*si->global_cluster),
>                                      GFP_KERNEL);
>                 if (!si->global_cluster)
> @@ -3532,6 +3544,7 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
>  bad_swap_unlock_inode:
>         inode_unlock(inode);
>  bad_swap:
> +       free_percpu_swap_cluster(si);
>         kfree(si->global_cluster);
>         si->global_cluster = NULL;
>         inode = NULL;
> --
> 2.34.1
>
>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 4/4] mm: swap: Per-cgroup per-CPU swap device cache with shared clusters
  2025-07-22 17:44   ` Kairui Song
@ 2025-07-22 18:30     ` YoungJun Park
  0 siblings, 0 replies; 39+ messages in thread
From: YoungJun Park @ 2025-07-22 18:30 UTC (permalink / raw)
  To: Kairui Song
  Cc: akpm, hannes, mhocko, roman.gushchin, shakeel.butt, muchun.song,
	shikemeng, nphamcs, bhe, baohua, chrisl, cgroups, linux-mm,
	linux-kernel, gunho.lee, iamjoonsoo.kim, taejoon.song

On Wed, Jul 23, 2025 at 01:44:49AM +0800, Kairui Song wrote:
> On Thu, Jul 17, 2025 at 4:21 AM Youngjun Park <youngjun.park@lge.com> wrote:
> 
> Hi Youngjun
> 
> One thing I noticed after a quick glance is that this
> swap_alloc_cgroup_priority is bloated and it is doing similar things
> as folio_alloc_swap.
> 
> I imagined that we can just have a struct (eg. let's call it struct
> swap_percpu_info / pi) as a closure of what the allocator needs, it
> contains the plist and fast path device.
> 
> With slight changes to folio_alloc_swap, it can respect either the
> cgroup's pi or global pi. (might be a horrible name though, feel free
> to change it)
> 
> For example first thing swap_alloc_fast do will be:
> 
> `struct swap_percpu_info *pi = folio_swap_percpu_info(folio);`
> 
> folio_swap_percpu_info returns the cgroup's swap_percpu_info or the global one.
> 
> swap_alloc_slow can do a similar thing, it then can just use pi->plist
> and pi->pcpu_swapdev, (cluster info will be in si) ignoring all the
> cgroup differences.

I was also considering whether the priority handling (like `plist`) could be  
abstracted to unify the allocation logic across paths.  

At the time, I leaned toward keeping the existing allocator logic intact as    
much as possible, which is why I avoided introducing a new struct and instead  
duplicated some logic.  

Your suggestion with `swap_percpu_info` makes the design clearer and aligns    
well with what I had in mind — I’ll review this direction more closely. If my  
thoughts change during the process, I’ll make sure to share the update on the  
mailing list.  

Thanks again for the helpful input!

> Also it is better to check your patches with ./scripts/checkpatch.pl,
> I'm seeing some styling issues.

I should have paid more attention to this.  
I’ll be sure to run `./scripts/checkpatch.pl` more carefully and address those 
issues in the next version of the patch. Thanks for the reminder!

> I'll check your other patches too later this week, thanks for the
> update on this idea.

Thanks again for the great idea, and I really appreciate you taking the time to
review this in the middle of your busy schedule.

> 
> Why not just remove the `percpu_swap_cluster.offset` and just share
> si->percpu_cluster among all cgroups (including root cgroup)?
> 
> Otherwise, eg. if rootcg's pcpu cluster and one cgroup's pcpu
> cluster are pointing to one same cluster, they might be in
> contention on allocation of different order, or even in the same order
> the performance might not be good as multiple CPUs will race
> with each other.
> 
> It will be easier to implement too.

I originally kept `percpu_swap_cluster.offset` around to
preserve compatibility when swap cgroup priority is not enabled, and to        
minimize disruption to the existing fast path.  

But after reviewing your suggestion, I agree it makes more sense to unify this 
path and always rely on `si->percpu_cluster`, even for the root cgroup.  

This simplifies the implementation, and as you pointed out, avoids potential   
contention and complexity that could arise from sharing per-cgroup clusters    
across CPUs.  

Thanks again for the clear and helpful insight.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/4] mm/swap, memcg: Introduce infrastructure for cgroup-based swap priority
  2025-07-22 14:05     ` YoungJun Park
@ 2025-07-22 18:41       ` YoungJun Park
  2025-08-14 14:03         ` Michal Koutný
  0 siblings, 1 reply; 39+ messages in thread
From: YoungJun Park @ 2025-07-22 18:41 UTC (permalink / raw)
  To: Michal Koutný
  Cc: akpm, hannes, mhocko, roman.gushchin, shakeel.butt, muchun.song,
	shikemeng, kasong, nphamcs, bhe, baohua, chrisl, cgroups,
	linux-mm, linux-kernel, gunho.lee, iamjoonsoo.kim, taejoon.song

On Tue, Jul 22, 2025 at 11:05:24PM +0900, YoungJun Park wrote:
> On Tue, Jul 22, 2025 at 10:41:20AM +0200, Michal Koutný wrote:
> > On Thu, Jul 17, 2025 at 05:20:03AM +0900, Youngjun Park <youngjun.park@lge.com> wrote:
> > > +  memory.swap.priority
> > > +    A read-write flat-keyed file which exists on non-root cgroups.
> > > +    This interface allows you to set per-swap-device priorities for the current
> > > +    cgroup and to define how they differ from the global swap system.
> > > +
> > > +    To assign priorities or define specific behaviors for swap devices
> > > +    in the current cgroup, write one or more lines in the following
> > > +    formats:
> > > +
> > > +     - <swap_device_id> <priority>
> > > +     - <swap_device_id> disabled
> > > +     - <swap_device_id> none
> > > +     - default none
> > > +     - default disabled
> > > +
> > > +    Each <swap_device_id> refers to a unique swap device registered
> > > +    in the system. You can check the ID, device path, and current
> > > +    priority of active swap devices through the `/proc/swaps` file.
> > 
> > Do you mean row number as the ID? Or does this depend on some other
> > patches or API?
> 
> You're right to ask for clarification. The `<swap_device_id>` refers
> to a unique identifier added to each swap device entry in `/proc/swaps`.
> I will revise the documentation to make this clearer.
> 
> As a side note, I initially had concerns about breaking the existing ABI.
> However, the additional ID column does not significantly change the
> current output format and is gated behind `CONFIG_SWAP_CGROUP_PRIORITY`,
> so it should be safe and intuitive to expose it through `/proc/swaps
> 
> > > +    This provides a clear mapping between swap devices and the IDs
> > > +    used in this interface.
> > > +
> > > +    The 'default' keyword sets the fallback priority behavior rule for
> > > +    this cgroup. If no specific entry matches a swap device, this default
> > > +    applies.
> > > +
> > > +    * 'default none': This is the default if no configuration
> > > +      is explicitly written. Swap devices follow the system-wide
> > > +      swap priorities.
> > > +
> > > +    * 'default disabled': All swap devices are excluded from this cgroup’s
> > > +      swap priority list and will not be used by this cgroup.
> > 
> > This duplicates memory.swap.max=0. I'm not sure it's thus necessary.
> > At the same time you don't accept 'default <priority>' (that's sane).
> 
> That's a valid observation. While `memory.swap.max=0` controls the overall
> swap usage limit, the `default disabled` entry is intended to disable
> specific swap devices within the scope of this cgroup interface. The
> motivation was to offer more granular control over device selection
> rather than total swap usage.
> 
> > > +
> > > +    The priority semantics are consistent with the global swap system:
> > > +
> > > +      - Higher numerical values indicate higher preference.
> > > +      - See Documentation/admin-guide/mm/swap_numa.rst for details on
> > > +        swap NUMA autobinding and negative priority rules.
> > > +
> > > +    The handling of negative priorities in this cgroup interface
> > > +    has specific behaviors for assignment and restoration:
> > > +
> > > +    * Negative Priority Assignment
> > 
> > Even in Documentation/admin-guide/mm/swap_numa.rst it's part of "Implementation details".
> > I admit I'm daunted by this paragraphs. Is it important for this interface?
> 
> Thank you for pointing this out. My original philosophy was to preserve
> as much of the existing swap functionality as possible, including
> NUMA-aware behaviors.
> 
> However, I agree that the explanation is complex and also not be
> necessary for my proposed usage. After some reflection, I believe the
> implementation (and documentation) will be clearer and simpler without
> supporting negative priorities here. 
> 
> Unless further objections arise, I plan to drop this behavior in the next
> version of the patch, as you suggested. If compelling use cases emerge in
> the future, we can consider reintroducing the support at that time.
> 
> Thanks again for your helpful review!

I'd like to revisit the NUMA-aware swap priority behavior based on
further implementation consideration. After refining the idea 
, I realized there are potential issues
if we fully remove NUMA autobind behavior when cgroup priorities are
set.

For example, suppose the global swap device priorities are configured
as:

  swapA -2
  swapB -3
  swapC -4

If we update the per-cgroup priority of swapA to a positive value, it
feels natural that only swapA should be affected, and swapB/swapC
should remain subject to NUMA autobind as configured globally. In other
words, the presence of one overridden device shouldn't disable autobind
entirely.

Thus, it seems that we may still need to retain some internal structure
for honoring NUMA autobind even when swap cgroup priority is enabled,
at least for the devices not explicitly overridden.

This leaves us with a few design options:

1. Treat negative values as valid priorities. Once any device is
   assigned via `memory.swap.priority`, the NUMA autobind logic is
   entirely disabled.
   - Pros: Simplifies implementation; avoids exposing NUMA autobind via
     cgroup interface.
   - Cons: Overrides autobind for all devices even if only one is set.

2. Continue to treat negative values as NUMA autobind weights, without
   implicit shifting. If a user assigns `-3`, it is stored and used
   exactly as `-3`, and does not affect other devices.
   - Pros: Simple and intuitive; matches current implementation
     semantics.
   - Cons: Autobind semantics still need to be reasoned about when
     using the interface.

3. Disallow setting negative values via `memory.swap.priority`.
   Existing NUMA autobind config is preserved, but no new autobind
   configuration is possible from cgroup interface.
   - Pros: Keeps cgroup interface simple; no autobind manipulation.
   - Cons: Autobind infra remains partially active, increasing code
     complexity.

4. Keep the current design: allow setting negative values to express
   NUMA autobind weights explicitly. Devices without overridden values
   continue to follow NUMA-based dynamic selection.
   - Pros: Preserves current flexibility; gives users control per device.
   - Cons: Slightly more complex semantics; NUMA autobind remains a
     visible part of the interface.

After thinking through these tradeoffs, I'm inclined to think that
preserving the NUMA autobind option might be the better path forward.
What are your thoughts on this?

Thank you again for your helpful feedback.

Best regards,
Youngjun Park

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/4] mm/swap, memcg: Introduce infrastructure for cgroup-based swap priority
  2025-07-22 18:41       ` YoungJun Park
@ 2025-08-14 14:03         ` Michal Koutný
  2025-08-15 15:10           ` Chris Li
  2025-08-16 16:41           ` YoungJun Park
  0 siblings, 2 replies; 39+ messages in thread
From: Michal Koutný @ 2025-08-14 14:03 UTC (permalink / raw)
  To: YoungJun Park
  Cc: akpm, hannes, mhocko, roman.gushchin, shakeel.butt, muchun.song,
	shikemeng, kasong, nphamcs, bhe, baohua, chrisl, cgroups,
	linux-mm, linux-kernel, gunho.lee, iamjoonsoo.kim, taejoon.song

[-- Attachment #1: Type: text/plain, Size: 4395 bytes --]

On Wed, Jul 23, 2025 at 03:41:47AM +0900, YoungJun Park <youngjun.park@lge.com> wrote:
> This leaves us with a few design options:
> 
> 1. Treat negative values as valid priorities. Once any device is
>    assigned via `memory.swap.priority`, the NUMA autobind logic is
>    entirely disabled.
>    - Pros: Simplifies implementation; avoids exposing NUMA autobind via
>      cgroup interface.
>    - Cons: Overrides autobind for all devices even if only one is set.
> 
> 2. Continue to treat negative values as NUMA autobind weights, without
>    implicit shifting. If a user assigns `-3`, it is stored and used
>    exactly as `-3`, and does not affect other devices.
>    - Pros: Simple and intuitive; matches current implementation
>      semantics.
>    - Cons: Autobind semantics still need to be reasoned about when
>      using the interface.
> 
> 3. Disallow setting negative values via `memory.swap.priority`.
>    Existing NUMA autobind config is preserved, but no new autobind
>    configuration is possible from cgroup interface.
>    - Pros: Keeps cgroup interface simple; no autobind manipulation.
>    - Cons: Autobind infra remains partially active, increasing code
>      complexity.
> 
> 4. Keep the current design: allow setting negative values to express
>    NUMA autobind weights explicitly. Devices without overridden values
>    continue to follow NUMA-based dynamic selection.
>    - Pros: Preserves current flexibility; gives users control per device.
>    - Cons: Slightly more complex semantics; NUMA autobind remains a
>      visible part of the interface.
> 
> After thinking through these tradeoffs, I'm inclined to think that
> preserving the NUMA autobind option might be the better path forward.
> What are your thoughts on this?
> 
> Thank you again for your helpful feedback.

Let me share my mental model in order to help forming the design.

I find these per-cgroup swap priorities similar to cpuset -- instead of
having a configured cpumask (bitmask) for each cgroup, you have
weight-mask for individual swap devices (or distribution over the
devices, I hope it's not too big deviation from priority ranking).
Then you have the hierarchy, so you need a method how to combine
child+parent masks (or global/root) to obtain effective weight-mask (and
effective ranking) for each cgroup.

Furthermore, there's the NUMA autobinding which adds another weight-mask
to the game but this time it's not configured but it depends on "who is
asking". (Tasks running on node N would have autobind shifted towards
devices associated to node N. Is that how autobinding works?)

From the hierarchy point of view, you have to compound weight-masks in
top-down preference (so that higher cgroups can override lower) and
autobind weight-mask that is only conceivable at the very bottom
(not a cgroup but depending on the task's NUMA placement).

There I see conflict between the ends a tad. I think the attempted
reconciliation was to allow emptiness of a single slot in the
weight-mask but it may not be practical for the compounding (that's why
you came up with the four variants). So another option would be to allow
whole weight-mask being empty (or uniform) so that it'd be identity in
the compounding operation.
The conflict exists also in the current non-percg priorities -- there
are the global priorities and autobind priorities. IIUC, the global
level either defines a weight (user prio) or it is empty (defer to NUMA
autobinding).

[I leveled rankings and weight-masks of devices but I left a loophole of
how the empty slots in the latter would be converted to (and from)
rankings. This e-mail is already too long.]


An very different alternative that comes to my mind together with
autobinding and leveraging that to your use case:
- define virtual NUMA nodes [1],
- associate separate swap devices to those nodes,
- utilize task (or actual (mem)cpuset) affinity to those virtual NUMA
  nodes based on each process's swap requirements,
- NUMA autobinding would then yield the device constraints you sought.


HTH,
Michal


[1] Not sure how close this is to the linked series [2] which is AFAIU
    a different kind of virtualization that isn't supposed to be exposed
    to userspace(?).
[2] https://lore.kernel.org/linux-mm/20250429233848.3093350-1-nphamcs@gmail.com/


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/4] mm/swap, memcg: Introduce infrastructure for cgroup-based swap priority
  2025-08-14 14:03         ` Michal Koutný
@ 2025-08-15 15:10           ` Chris Li
  2025-08-16 17:21             ` YoungJun Park
  2025-08-16 16:41           ` YoungJun Park
  1 sibling, 1 reply; 39+ messages in thread
From: Chris Li @ 2025-08-15 15:10 UTC (permalink / raw)
  To: Michal Koutný
  Cc: YoungJun Park, akpm, hannes, mhocko, roman.gushchin, shakeel.butt,
	muchun.song, shikemeng, kasong, nphamcs, bhe, baohua, cgroups,
	linux-mm, linux-kernel, gunho.lee, iamjoonsoo.kim, taejoon.song,
	Matthew Wilcox, David Hildenbrand, Kairui Song

Hi Michal and YoungJun,

I am sorry for the late reply. I have briefly read through the patches
series the overall impression:
1)  Priority is not the best way to select which swap file to use per cgroup.
The priority is assigned to one device, it is a per swap file local
change. The effect you want to see is actually a global one, how this
swap device compares to other devices. You actually want  a list at
the end result. Adjusting per swap file priority is backwards. A lot
of unnecessary usage complexity and code complexity come from that.
2)  This series is too complicated for what it does.

I have a similar idea, "swap.tiers," first mentioned earlier here:
https://lore.kernel.org/linux-mm/CAF8kJuNFtejEtjQHg5UBGduvFNn3AaGn4ffyoOrEnXfHpx6Ubg@mail.gmail.com/

I will outline the line in more detail in the last part of my reply.

BTW, YoungJun and Michal, do you have the per cgroup swap file control
proposal for this year's LPC? If you want to, I am happy to work with
you on the swap tiers topic as a secondary. I probably don't have the
time to do it as a primary.

On Thu, Aug 14, 2025 at 7:03 AM Michal Koutný <mkoutny@suse.com> wrote:
>
> On Wed, Jul 23, 2025 at 03:41:47AM +0900, YoungJun Park <youngjun.park@lge.com> wrote:
> >
> > After thinking through these tradeoffs, I'm inclined to think that
> > preserving the NUMA autobind option might be the better path forward.
> > What are your thoughts on this?

The swap allocator has gone through a complete rewrite. We need to
revisit whether the NUMA autobinding thing is still beneficial in the
new swap allocator. We need more data points. Personally I would like
to decouple the NUMA to the swap device. If the swap device needs more
sharding, we can do more sharding without NUMA nodes. Using NUMA nodes
is just one way of sharding. Should not be the only way to do
sharding. Coupling the swap device with NUMA nodes makes things really
complicated. It would need a lot of performance difference to justify
that kind of complexity.

> > Thank you again for your helpful feedback.
>
> Let me share my mental model in order to help forming the design.
>
> I find these per-cgroup swap priorities similar to cpuset -- instead of
> having a configured cpumask (bitmask) for each cgroup, you have
> weight-mask for individual swap devices (or distribution over the
> devices, I hope it's not too big deviation from priority ranking).

+1. The swap tiers I have in mind is very close to what you describe

> Then you have the hierarchy, so you need a method how to combine
> child+parent masks (or global/root) to obtain effective weight-mask (and
> effective ranking) for each cgroup.

Yes, swap tiers has a hierarchy module story as well. Will talk about
that in a later part of the email.

>
> Furthermore, there's the NUMA autobinding which adds another weight-mask
> to the game but this time it's not configured but it depends on "who is
> asking". (Tasks running on node N would have autobind shifted towards
> devices associated to node N. Is that how autobinding works?)

Again, I really wish the swap file selection decouples from the NUMA nodes.

> From the hierarchy point of view, you have to compound weight-masks in
> top-down preference (so that higher cgroups can override lower) and
> autobind weight-mask that is only conceivable at the very bottom
> (not a cgroup but depending on the task's NUMA placement).

I want to abandon weight adjusting, focus on opt in or out.

> There I see conflict between the ends a tad. I think the attempted
> reconciliation was to allow emptiness of a single slot in the

I think adjusting a single swap file to impact the relative order is backwards.

> weight-mask but it may not be practical for the compounding (that's why
> you came up with the four variants). So another option would be to allow
> whole weight-mask being empty (or uniform) so that it'd be identity in
> the compounding operation.
> The conflict exists also in the current non-percg priorities -- there
> are the global priorities and autobind priorities. IIUC, the global
> level either defines a weight (user prio) or it is empty (defer to NUMA
> autobinding).
>
> [I leveled rankings and weight-masks of devices but I left a loophole of
> how the empty slots in the latter would be converted to (and from)
> rankings. This e-mail is already too long.]

OK. I want to abandon the weight-adjustment approach. Here I outline
the swap tiers idea as follows. I can probably start a new thread for
that later.

1) No per cgroup swap priority adjustment. The swap file priority is
global to the system.
Per cgroup swap file ordering adjustment is bad from the LRU point of
view. We should make the swap file ordering matching to the swap
device service performance. Fast swap tier zram, zswap store hotter
data, slower tier hard drive store colder data.  SSD in between. It is
important to maintain the fast slow tier match to the hot cold LRU
ordering.

2) There is a simple mapping of global swap tier names into priority range
The name itself is customizable.
e.g. 100+ is the "compress_ram" tier. 50-99 is the "SSD" tier, 0-55 is
the "hdd" tier.
The detailed mechanization and API is TBD.
The end result is a simple tier name lookup will get the priority range.
By default all swap tiers are available for global usage without
cgroup. That matches the current global swap on behavior.

3) Each cgroup will have "swap.tiers" (name TBD) to opt in/out of the tier.
It is a list of tiers including the default tier who shall not be named.

Here are a few examples:
e.g. consider the following cgroup hierarchy a/b/c/d, a as the first
level cgroup.
a/swap.tiers: "- +compress_ram"
it means who shall not be named is set to opt out,  optin in
compress_ram only, no ssd, no hard.
Who shall not be named, if specified, has to be the first one listed
in the "swap.tiers".

a/b/swap.tiers: "+ssd"
For b cgroup, who shall not be named is not specified, the tier is
appended to the parent "a/swap.tiers". The effective "a/b/swap.tiers"
become "- +compress_ram +ssd"
a/b can use both zswap and ssd.

Every time the who shall not be named is changed, it can drop the
parent swap.tiers chain, starting from scratch.

a/b/c/swap.tiers: "-"

For c, it turns off all swap. The effective "a/b/c/swap.tiers" become
"- +compress_ram +ssd -" which simplify as "-", because the second "-"
overwrites all previous optin/optout results.
In other words, if the current cgroup does not specify the who shall
not be named, it will walk the parent chain until it does. The global
"/" for non cgroup is on.

a/b/c/d/swap.tiers: "- +hdd"
For d, only hdd swap, nothing else.

More example:
 "- +ssd +hdd -ssd" will simplify to: "- +hdd", which means hdd only.
 "+ -hdd": No hdd for you! Use everything else.

Let me know what you think about the above "swap.tiers"(name TBD) proposal.

Chris

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/4] mm/swap, memcg: Introduce infrastructure for cgroup-based swap priority
  2025-08-14 14:03         ` Michal Koutný
  2025-08-15 15:10           ` Chris Li
@ 2025-08-16 16:41           ` YoungJun Park
  1 sibling, 0 replies; 39+ messages in thread
From: YoungJun Park @ 2025-08-16 16:41 UTC (permalink / raw)
  To: Michal Koutný
  Cc: akpm, hannes, mhocko, roman.gushchin, shakeel.butt, muchun.song,
	shikemeng, kasong, nphamcs, bhe, baohua, chrisl, cgroups,
	linux-mm, linux-kernel, gunho.lee, iamjoonsoo.kim, taejoon.song

On Thu, Aug 14, 2025 at 04:03:36PM +0200, Michal Koutný wrote:
> On Wed, Jul 23, 2025 at 03:41:47AM +0900, YoungJun Park <youngjun.park@lge.com> wrote:

> Let me share my mental model in order to help forming the design.

First of all, thank you very much for your detailed reply. As Friday was a
public holiday in Korea and I had some personal commitments over the weekend,
I only got to check your email late — I hope you can kindly excuse the delayed
response.

For the points that require deeper consideration, I will provide detailed
answers later. For now, let me share some quick feedback on the parts I can
respond to right away.

> I find these per-cgroup swap priorities similar to cpuset -- instead of
> having a configured cpumask (bitmask) for each cgroup, you have
> weight-mask for individual swap devices (or distribution over the
> devices, I hope it's not too big deviation from priority ranking).
> Then you have the hierarchy, so you need a method how to combine
> child+parent masks (or global/root) to obtain effective weight-mask (and
> effective ranking) for each cgroup.
> 
> Furthermore, there's the NUMA autobinding which adds another weight-mask
> to the game but this time it's not configured but it depends on "who is
> asking". (Tasks running on node N would have autobind shifted towards
> devices associated to node N. Is that how autobinding works?)

Yes, your description indeed captures the core concept of how autobinding
works.
 
> From the hierarchy point of view, you have to compound weight-masks in
> top-down preference (so that higher cgroups can override lower) and
> autobind weight-mask that is only conceivable at the very bottom
> (not a cgroup but depending on the task's NUMA placement).
> 
> There I see conflict between the ends a tad. I think the attempted
> reconciliation was to allow emptiness of a single slot in the
> weight-mask but it may not be practical for the compounding (that's why
> you came up with the four variants). So another option would be to allow
> whole weight-mask being empty (or uniform) so that it'd be identity in
> the compounding operation.
> The conflict exists also in the current non-percg priorities -- there
> are the global priorities and autobind priorities. IIUC, the global
> level either defines a weight (user prio) or it is empty (defer to NUMA
> autobinding).
> 
> [I leveled rankings and weight-masks of devices but I left a loophole of
> how the empty slots in the latter would be converted to (and from)
> rankings. This e-mail is already too long.]

Yes. A single slot emptiness is enemy..
The problem arises from two aspects: (1) allowing per-device priorities
inherently leads to the possibility of single-slot emptiness, and (2)
depending on swapon configuration, empty slots may be inevitable. That’s
why the compounding rules ended up allowing this complexity. I’ll review
your suggestions carefully and share soon how we might simplify this
direction.

> 
> An very different alternative that comes to my mind together with
> autobinding and leveraging that to your use case:
> - define virtual NUMA nodes [1],
> - associate separate swap devices to those nodes,
> - utilize task (or actual (mem)cpuset) affinity to those virtual NUMA
>   nodes based on each process's swap requirements,
> - NUMA autobinding would then yield the device constraints you sought.

Creative. I have understood the overall concept for now.

Thank you as always for your valuable insights.

Best regards,  
YoungJun

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/4] mm/swap, memcg: Introduce infrastructure for cgroup-based swap priority
  2025-08-15 15:10           ` Chris Li
@ 2025-08-16 17:21             ` YoungJun Park
  2025-08-16 19:15               ` Chris Li
  0 siblings, 1 reply; 39+ messages in thread
From: YoungJun Park @ 2025-08-16 17:21 UTC (permalink / raw)
  To: Chris Li
  Cc: Michal Koutný, akpm, hannes, mhocko, roman.gushchin,
	shakeel.butt, muchun.song, shikemeng, kasong, nphamcs, bhe,
	baohua, cgroups, linux-mm, linux-kernel, gunho.lee,
	iamjoonsoo.kim, taejoon.song, Matthew Wilcox, David Hildenbrand,
	Kairui Song

On Fri, Aug 15, 2025 at 08:10:09AM -0700, Chris Li wrote:
> Hi Michal and YoungJun,

First of all, thank you for sharing your thoughts. I really appreciate the
detailed feedback. I have many points I would like to think through and
discuss as well. For now, let me give some quick feedback, and I will follow
up with more detailed responses after I have had more time to reflect.

> I am sorry for the late reply. I have briefly read through the patches
> series the overall impression:
> 1)  Priority is not the best way to select which swap file to use per cgroup.
> The priority is assigned to one device, it is a per swap file local
> change. The effect you want to see is actually a global one, how this
> swap device compares to other devices. You actually want  a list at
> the end result. Adjusting per swap file priority is backwards. A lot
> of unnecessary usage complexity and code complexity come from that.
> 2)  This series is too complicated for what it does.

You mentioned that the series is overly complex and does more than what is
really needed. I understand your concern. I have spent quite a lot of time
thinking about this topic, and the reason I chose the priority approach is
that it gives more flexibility and extensibility by reusing an existing
concept.

Where you see unnecessary functionality, I tend to view it as providing more
degrees of freedom and flexibility. In my view, the swap tier concept can be
expressed as a subset of the per-cgroup priority model.

> I have a similar idea, "swap.tiers," first mentioned earlier here:
> https://lore.kernel.org/linux-mm/CAF8kJuNFtejEtjQHg5UBGduvFNn3AaGn4ffyoOrEnXfHpx6Ubg@mail.gmail.com/
> 
> I will outline the line in more detail in the last part of my reply.
> 
> BTW, YoungJun and Michal, do you have the per cgroup swap file control
> proposal for this year's LPC? If you want to, I am happy to work with
> you on the swap tiers topic as a secondary. I probably don't have the
> time to do it as a primary.

I have not submitted an LPC proposal. If it turns out to be necessary,
I agree it could be a good idea, and I truly appreciate your offer to
work together on it. From my understanding, though, the community has
so far received this patchset positively, so I hope the discussion can
continue within this context and eventually be accepted there.
 
> OK. I want to abandon the weight-adjustment approach. Here I outline
> the swap tiers idea as follows. I can probably start a new thread for
> that later.
> 
> 1) No per cgroup swap priority adjustment. The swap file priority is
> global to the system.
> Per cgroup swap file ordering adjustment is bad from the LRU point of
> view. We should make the swap file ordering matching to the swap
> device service performance. Fast swap tier zram, zswap store hotter
> data, slower tier hard drive store colder data.  SSD in between. It is
> important to maintain the fast slow tier match to the hot cold LRU
> ordering.

Regarding your first point about swap tiers: I would like to study this part
a bit more carefully. If you could share some additional explanation, that
would be very helpful for me.
 
> 2) There is a simple mapping of global swap tier names into priority range
> The name itself is customizable.
> e.g. 100+ is the "compress_ram" tier. 50-99 is the "SSD" tier, 0-55 is
> the "hdd" tier.
> The detailed mechanization and API is TBD.
> The end result is a simple tier name lookup will get the priority range.
> By default all swap tiers are available for global usage without
> cgroup. That matches the current global swap on behavior.
> 
> 3) Each cgroup will have "swap.tiers" (name TBD) to opt in/out of the tier.
> It is a list of tiers including the default tier who shall not be named.
> 
> Here are a few examples:
> e.g. consider the following cgroup hierarchy a/b/c/d, a as the first
> level cgroup.
> a/swap.tiers: "- +compress_ram"
> it means who shall not be named is set to opt out,  optin in
> compress_ram only, no ssd, no hard.
> Who shall not be named, if specified, has to be the first one listed
> in the "swap.tiers".
> 
> a/b/swap.tiers: "+ssd"
> For b cgroup, who shall not be named is not specified, the tier is
> appended to the parent "a/swap.tiers". The effective "a/b/swap.tiers"
> become "- +compress_ram +ssd"
> a/b can use both zswap and ssd.
> 
> Every time the who shall not be named is changed, it can drop the
> parent swap.tiers chain, starting from scratch.
> 
> a/b/c/swap.tiers: "-"
> 
> For c, it turns off all swap. The effective "a/b/c/swap.tiers" become
> "- +compress_ram +ssd -" which simplify as "-", because the second "-"
> overwrites all previous optin/optout results.
> In other words, if the current cgroup does not specify the who shall
> not be named, it will walk the parent chain until it does. The global
> "/" for non cgroup is on.
> 
> a/b/c/d/swap.tiers: "- +hdd"
> For d, only hdd swap, nothing else.
> 
> More example:
>  "- +ssd +hdd -ssd" will simplify to: "- +hdd", which means hdd only.
>  "+ -hdd": No hdd for you! Use everything else.
> 
> Let me know what you think about the above "swap.tiers"(name TBD) proposal.

Thank you very much for the detailed description of the "swap.tiers" idea.
As I understand it, the main idea is to separate swap devices by speed,
assign a suitable priority range for each, and then make it easy for users to
include or exclude tiers. I believe I have understood the concept clearly.

I agree that operating with tiers is important. At the same time, as I
mentioned earlier, I believe that managing priorities in a way that reflects
tiers can also achieve the intended effect.

I have also been thinking about a possible compromise. If the interface is
intended to make tiers visible to users in the way you describe, then mapping
priority ranges to tiers (as you propose) makes sense. Users would still have
the flexibility to define ordering, while internally we could maintain the
priority list model I suggested. I wonder what you think about such a hybrid
approach. 

Thank you as always for your valuable insights.

Best regards,
Youngjun Park

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/4] mm/swap, memcg: Introduce infrastructure for cgroup-based swap priority
  2025-08-16 17:21             ` YoungJun Park
@ 2025-08-16 19:15               ` Chris Li
  2025-08-19 10:12                 ` YoungJun Park
  0 siblings, 1 reply; 39+ messages in thread
From: Chris Li @ 2025-08-16 19:15 UTC (permalink / raw)
  To: YoungJun Park
  Cc: Michal Koutný, akpm, hannes, mhocko, roman.gushchin,
	shakeel.butt, muchun.song, shikemeng, kasong, nphamcs, bhe,
	baohua, cgroups, linux-mm, linux-kernel, gunho.lee,
	iamjoonsoo.kim, taejoon.song, Matthew Wilcox, David Hildenbrand,
	Kairui Song

On Sat, Aug 16, 2025 at 10:21 AM YoungJun Park <youngjun.park@lge.com> wrote:
>
> On Fri, Aug 15, 2025 at 08:10:09AM -0700, Chris Li wrote:
> > Hi Michal and YoungJun,
>
> First of all, thank you for sharing your thoughts. I really appreciate the
> detailed feedback. I have many points I would like to think through and
> discuss as well. For now, let me give some quick feedback, and I will follow
> up with more detailed responses after I have had more time to reflect.

Please do, that is part of the community feedback and review process.

> > I am sorry for the late reply. I have briefly read through the patches
> > series the overall impression:
> > 1)  Priority is not the best way to select which swap file to use per cgroup.
> > The priority is assigned to one device, it is a per swap file local
> > change. The effect you want to see is actually a global one, how this
> > swap device compares to other devices. You actually want  a list at
> > the end result. Adjusting per swap file priority is backwards. A lot
> > of unnecessary usage complexity and code complexity come from that.
> > 2)  This series is too complicated for what it does.
>
> You mentioned that the series is overly complex and does more than what is
> really needed. I understand your concern. I have spent quite a lot of time
> thinking about this topic, and the reason I chose the priority approach is
> that it gives more flexibility and extensibility by reusing an existing
> concept.

I have not questioned the approach you can achieve with your goal. The
real question is, is this the best approach to consider to merge into
the main line Linux kernel. Merging into the main line kernel has a
very high bar. How is it compared to other alternative approaches in
terms of technical merit and complexity trade offs.

> Where you see unnecessary functionality, I tend to view it as providing more
> degrees of freedom and flexibility. In my view, the swap tier concept can be
> expressed as a subset of the per-cgroup priority model.

Why would I trade a cleaner less complex approach for a more complex
approach with technical deficiency not able to address (inverting swap
entry LRU ordering)?

> > I have a similar idea, "swap.tiers," first mentioned earlier here:
> > https://lore.kernel.org/linux-mm/CAF8kJuNFtejEtjQHg5UBGduvFNn3AaGn4ffyoOrEnXfHpx6Ubg@mail.gmail.com/
> >
> > I will outline the line in more detail in the last part of my reply.
> >
> > BTW, YoungJun and Michal, do you have the per cgroup swap file control
> > proposal for this year's LPC? If you want to, I am happy to work with
> > you on the swap tiers topic as a secondary. I probably don't have the
> > time to do it as a primary.
>
> I have not submitted an LPC proposal. If it turns out to be necessary,
> I agree it could be a good idea, and I truly appreciate your offer to
> work together on it.

Let me clarify. LPC is not required to get your series merged. Giving
a talk in LPC usually is an honor. It does not guarantee your series
gets merged either. It certainly helps your idea get more exposure and
discussion. You might be able to meet some maintainers in person. For
me, it is nice to meet the person to whom I have been communicating by
email. I was making the suggestion because it can be a good topic for
LPC, and just in case you might enjoy LPC. It is totally for your
benefit. Up to your decision, please don't make it a burden. It is
not.

If after your consideration, you do want to submit a proposal in LPC,
you need to hurry though. The deadline is closing soon.

> From my understanding, though, the community has
> so far received this patchset positively, so I hope the discussion can
> continue within this context and eventually be accepted there.

Let me make it very clear.  As it is, it will not get my support for
the reason I have laid out in my last email.

> > OK. I want to abandon the weight-adjustment approach. Here I outline
> > the swap tiers idea as follows. I can probably start a new thread for
> > that later.
> >
> > 1) No per cgroup swap priority adjustment. The swap file priority is
> > global to the system.
> > Per cgroup swap file ordering adjustment is bad from the LRU point of
> > view. We should make the swap file ordering matching to the swap
> > device service performance. Fast swap tier zram, zswap store hotter
> > data, slower tier hard drive store colder data.  SSD in between. It is
> > important to maintain the fast slow tier match to the hot cold LRU
> > ordering.
>
> Regarding your first point about swap tiers: I would like to study this part
> a bit more carefully.

Please do.

> If you could share some additional explanation, that
> would be very helpful for me.

Feel free to ask, I will do my best to answer.

> > More example:
> >  "- +ssd +hdd -ssd" will simplify to: "- +hdd", which means hdd only.
> >  "+ -hdd": No hdd for you! Use everything else.
> >
> > Let me know what you think about the above "swap.tiers"(name TBD) proposal.
>
> Thank you very much for the detailed description of the "swap.tiers" idea.
> As I understand it, the main idea is to separate swap devices by speed,
> assign a suitable priority range for each, and then make it easy for users to
> include or exclude tiers. I believe I have understood the concept clearly.
>
> I agree that operating with tiers is important. At the same time, as I
> mentioned earlier, I believe that managing priorities in a way that reflects
> tiers can also achieve the intended effect.

The per cgroup per swap file priorities has one Achilles heel you need
to address before you can make any further progress upstreaming it.
Putting the extra complexity aside, the per cgroup per swap file
priorities can invert swap entry LRU order between different views of
ordering by different cgroup.
That violates the swap entry LRU order between tiers.

From the swap file point of view, when it needs to flush some data to
the lower tiers, it is very hard if possible for swap file to maintain
per cgroup LRU order within a swap file.
It is much easier if all the swap entries in a swap file are in the
same LRU order tier.

Inverting swap entry LRU order is a deal breaker for your per cgroup
per swap file priority approach.

> I have also been thinking about a possible compromise. If the interface is

The swap.tiers idea is not a compromise, it is a straight win. Can you
describe what per cgroup per swap file can do while swap.tiers can
not?

> intended to make tiers visible to users in the way you describe, then mapping
> priority ranges to tiers (as you propose) makes sense. Users would still have
> the flexibility to define ordering, while internally we could maintain the

Because I don't want to violate the swap entry LRU ordering between
tiers. Within that context, what usage case do you have in mind?
Within the same tier, the swap device can have finer grain priority
order between them. The part I haven't understood, please help me
understand, why do you need per cgroup per swap file orthering  rather
than the tier order? It is much easier from the admin's point of view.
This app needs to be fast, can't afford slow swap, give it faster swap
tiers.

> priority list model I suggested. I wonder what you think about such a hybrid
> approach.

It obviously will introduce new complexity. I want to understand the
reason to justify the additional complexity before I consider such an
approach.

> Thank you as always for your valuable insights.

My pleasure. Thanks for leading this per cgroup swap file effort.

Chris

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/4] mm/swap, memcg: Introduce infrastructure for cgroup-based swap priority
  2025-08-16 19:15               ` Chris Li
@ 2025-08-19 10:12                 ` YoungJun Park
  2025-08-20  0:52                   ` Chris Li
  0 siblings, 1 reply; 39+ messages in thread
From: YoungJun Park @ 2025-08-19 10:12 UTC (permalink / raw)
  To: Chris Li
  Cc: Michal Koutný, akpm, hannes, mhocko, roman.gushchin,
	shakeel.butt, muchun.song, shikemeng, kasong, nphamcs, bhe,
	baohua, cgroups, linux-mm, linux-kernel, gunho.lee,
	iamjoonsoo.kim, taejoon.song, Matthew Wilcox, David Hildenbrand,
	Kairui Song

On Sat, Aug 16, 2025 at 12:15:43PM -0700, Chris Li wrote:

At first, Thank you for detailed and fast feedback!

> I have not questioned the approach you can achieve with your goal. The
> real question is, is this the best approach to consider to merge into

Yes, I believe this could be the best approach.
I have compared several possible approaches before making this proposal. These
are the alternatives I reviewed in the RFC:
(https://lore.kernel.org/linux-mm/20250612103743.3385842-1-youngjun.park@lge.com/)
The part I mentions are as belows

> Evaluated Alternatives
> ======================
> 1. **Per-cgroup dedicated swap devices**
>    - Previously proposed upstream [1]
>    - Challenges in managing global vs per-cgroup swap state
>    - Difficult to integrate with existing memory.limit / swap.max semantics
> 2. **Multi-backend swap device with cgroup-aware routing**
>    - Considered sort of layering violation (block device cgroup awareness)
>    - Swap devices are commonly meant to be physical block devices.
>    - Similar idea mentioned in [2]
> 3. **Per-cgroup swap device enable/disable with swap usage contorl**
>    - Expand swap.max with zswap.writeback usage
>    - Discussed in context of zswap writeback [3]
>    - Cannot express arbitrary priority orderings
>      (e.g. swap priority A-B-C on cgroup C-A-B impossible)
>    - Less flexible than per-device priority approach
> 4. **Per-namespace swap priority configuration**
>    - In short, make swap namespace for swap device priority
>    - Overly complex for our use case
>    - Cgroups are the natural scope for this mechanism

In my view, the `swap.tier` proposal aligns quite well with alternative (3) that
I reviewed. That approach keeps the global priority assignment while adding
inclusion/exclusion semantics at the cgroup level. The reason I decided not to
go with it is because it lacks flexibility — it cannot express arbitrary
ordering. As noted above, it is impossible to represent arbitrary orderings,
which is why I chose a per-device priority strategy instead.

> the main line Linux kernel. Merging into the main line kernel has a
> very high bar. How is it compared to other alternative approaches in
> terms of technical merit and complexity trade offs.

Since you seem most concerned about complexity, I have been thinking more about
this point.

1. **Conceptual complexity**  
   The idea is simply to add a swap priority list per cgroup. This is
   straightforward to understand. The more complicated part is NUMA priority
   handling — but if that turns out to be too complex, we can drop it entirely
   or adjust its semantics to reduce the complexity.

2. **Implementation complexity**  
   Could you clarify from which perspective you see implementation complexity as
   problematic? I would like to know more specifically what part worries you.

The `swap.tier` concept also requires mapping priorities to tiers, creating
per-cgroup tier objects, and so forth. That means a number of supporting
structures are needed as well. While I agree it is conceptually well-defined,
I don’t necessarily find it simpler than the per-device priority model.

> Why would I trade a cleaner less complex approach for a more complex
> approach with technical deficiency not able to address (inverting swap
> entry LRU ordering)?

Could you elaborate on what exactly you mean by “inverting swap entry LRU order”?
Do you mean that because of per-cgroup priority differences, entries on the
global swap LRU list could become inconsistent when viewed from different
cgroups? If that is the case, could you explain more concretely what problems
such inconsistencies would cause? That would help me understand the concern
better.

> Let me clarify. LPC is not required to get your series merged. Giving
> a talk in LPC usually is an honor. It does not guarantee your series
> gets merged either. It certainly helps your idea get more exposure and
> discussion. You might be able to meet some maintainers in person. For
> me, it is nice to meet the person to whom I have been communicating by
> email. I was making the suggestion because it can be a good topic for
> LPC, and just in case you might enjoy LPC. It is totally for your
> benefit. Up to your decision, please don't make it a burden. It is
> not.
>
> If after your consideration, you do want to submit a proposal in LPC,
> you need to hurry though. The deadline is closing soon.

I see, thank you for the suggestion. I also think having the chance to discuss
this at LPC would be very beneficial for me. I will not see it as a burden —
if I decide to go forward, I will let you know right away (until this week).

> From the swap file point of view, when it needs to flush some data to
> the lower tiers, it is very hard if possible for swap file to maintain
> per cgroup LRU order within a swap file.

Could you explain in more detail why the flush operation is difficult in that
case? I would like to understand what the concrete difficulty is.

> It is much easier if all the swap entries in a swap file are in the
> same LRU order tier.

This is related to the same question above — I would appreciate a more
detailed explanation because it is not yet clear to me. Why is it easy?

> The swap.tiers idea is not a compromise, it is a straight win. Can you
> describe what per cgroup per swap file can do while swap.tiers can
> not?

I mentioned already on this mail: what swap tiers cannot do is arbitrary
ordering. If ordering is fixed globally by tiers, some workloads that want to
consume slower swap devices first (and reserve faster devices as a safety
backend to minimize swap failures) cannot be expressed. This kind of policy
requires arbitrary ordering flexibility, which is possible with per-device
priorities but not with fixed tiers.

And vswap possible usage: if we must consider vswap (assume we can select it
like an individual swap device), where should it be mapped in the tier model?
(see https://lore.kernel.org/linux-mm/CAMgjq7BA_2-5iCvS-vp9ZEoG=1DwHWYuVZOuH8DWH9wzdoC00g@mail.gmail.com/) 
In my opinion, it cannot be mapped purely by service speed. 
There are indeed situations where tiering by service speed is beneficial, 
but I also believe priority-based ordering can capture the same intention 
while also covering exceptional use cases.

So, I see the per-device priority approach as more general: it can represent
tier-based usage, but also more flexible policies that tiers alone cannot cover.

> It obviously will introduce new complexity. I want to understand the
> reason to justify the additional complexity before I consider such an
> approach.

I think that any new concept adds complexity, whether it is “swap.tier” or
per-device priority. If you could clarify more precisely what kind of
complexity you are most concerned about, I would be happy to give my detailed
thoughts in that direction.

Thank you again for your prompt and thoughtful feedback :). I will continue
thinking about this further while awaiting your reply.

Best regards,
Youngjun Park

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/4] mm/swap, memcg: Introduce infrastructure for cgroup-based swap priority
  2025-08-19 10:12                 ` YoungJun Park
@ 2025-08-20  0:52                   ` Chris Li
  2025-08-20 14:39                     ` YoungJun Park
  0 siblings, 1 reply; 39+ messages in thread
From: Chris Li @ 2025-08-20  0:52 UTC (permalink / raw)
  To: YoungJun Park
  Cc: Michal Koutný, akpm, hannes, mhocko, roman.gushchin,
	shakeel.butt, muchun.song, shikemeng, kasong, nphamcs, bhe,
	baohua, cgroups, linux-mm, linux-kernel, gunho.lee,
	iamjoonsoo.kim, taejoon.song, Matthew Wilcox, David Hildenbrand,
	Kairui Song

On Tue, Aug 19, 2025 at 3:13 AM YoungJun Park <youngjun.park@lge.com> wrote:
>
> On Sat, Aug 16, 2025 at 12:15:43PM -0700, Chris Li wrote:
>
> At first, Thank you for detailed and fast feedback!
>
> > I have not questioned the approach you can achieve with your goal. The
> > real question is, is this the best approach to consider to merge into
>
> Yes, I believe this could be the best approach.
> I have compared several possible approaches before making this proposal. These
> are the alternatives I reviewed in the RFC:
> (https://lore.kernel.org/linux-mm/20250612103743.3385842-1-youngjun.park@lge.com/)
> The part I mentions are as belows
>
> > Evaluated Alternatives
> > ======================
> > 1. **Per-cgroup dedicated swap devices**
> >    - Previously proposed upstream [1]
> >    - Challenges in managing global vs per-cgroup swap state
> >    - Difficult to integrate with existing memory.limit / swap.max semantics
> > 2. **Multi-backend swap device with cgroup-aware routing**
> >    - Considered sort of layering violation (block device cgroup awareness)
> >    - Swap devices are commonly meant to be physical block devices.
> >    - Similar idea mentioned in [2]
> > 3. **Per-cgroup swap device enable/disable with swap usage contorl**
> >    - Expand swap.max with zswap.writeback usage
> >    - Discussed in context of zswap writeback [3]
> >    - Cannot express arbitrary priority orderings
> >      (e.g. swap priority A-B-C on cgroup C-A-B impossible)
> >    - Less flexible than per-device priority approach
> > 4. **Per-namespace swap priority configuration**
> >    - In short, make swap namespace for swap device priority
> >    - Overly complex for our use case
> >    - Cgroups are the natural scope for this mechanism
>
> In my view, the `swap.tier` proposal aligns quite well with alternative (3) that
> I reviewed. That approach keeps the global priority assignment while adding

Not the same as option 3. swap.tier has one level in direction for the
tier class. It does not directly operate on swap files. That level of
indirection allows swap files to rotate within the same tier. I expect
it to have very few tires so all the swap tires can fit a simple
bitmask, e.g. one 32 bit integer per cgroup is good enough. Assume we
allow 31 tiers. We can have less than 32 swap files, 31 tiers should
be more than enough.

> inclusion/exclusion semantics at the cgroup level. The reason I decided not to
> go with it is because it lacks flexibility — it cannot express arbitrary
> ordering. As noted above, it is impossible to represent arbitrary orderings,
> which is why I chose a per-device priority strategy instead.

As said, arbitrary orders violate the swap entry LRU orders. You still
haven't given me a detailed technical reason why you need arbitrary
orders other than "I want a pony".

> > the main line Linux kernel. Merging into the main line kernel has a
> > very high bar. How is it compared to other alternative approaches in
> > terms of technical merit and complexity trade offs.
>
> Since you seem most concerned about complexity, I have been thinking more about
> this point.
>
> 1. **Conceptual complexity**
>    The idea is simply to add a swap priority list per cgroup. This is
>    straightforward to understand. The more complicated part is NUMA priority
>    handling — but if that turns out to be too complex, we can drop it entirely
>    or adjust its semantics to reduce the complexity.

The swap priority list is a list. The swap tiers is just a set less
than32 total tiers. Can be expressed in one integer bitmask.

> 2. **Implementation complexity**
>    Could you clarify from which perspective you see implementation complexity as
>    problematic? I would like to know more specifically what part worries you.

Your 4 patch series total lines of code? I expect the swap tiers can
be much shorter, because it does not deal with arbitrate orders.

> The `swap.tier` concept also requires mapping priorities to tiers, creating
> per-cgroup tier objects, and so forth. That means a number of supporting
> structures are needed as well. While I agree it is conceptually well-defined,
> I don’t necessarily find it simpler than the per-device priority model.

You haven't embraced the swap.tiers ideas to the full extent. I do see
it can be simpler if you follow my suggestion. You are imaging a
version using swap file priority data struct to implement the swap
tiers. That is not what I have in mind. The tiers can be just one
integer to represent the set of tiers it enrolls and the default. If
you follow my suggestion and the design you will have a simpler series
in the end.

> > Why would I trade a cleaner less complex approach for a more complex
> > approach with technical deficiency not able to address (inverting swap
> > entry LRU ordering)?
>
> Could you elaborate on what exactly you mean by “inverting swap entry LRU order”?
> Do you mean that because of per-cgroup priority differences, entries on the
> global swap LRU list could become inconsistent when viewed from different
> cgroups?

Exactly.

>If that is the case, could you explain more concretely what problems
> such inconsistencies would cause? That would help me understand the concern

The problem is that you pollute your fast tier with very cold swap
entry data, that is to your disadvantage, because you will need to
swap back more from the slower tier.

e.g. you have two pages. Swap entry A will get 2 swap faults, the swap
entry B will get 20 swap faults in the next 2 hours. B is hotter than
A. Let's say you have to store them one in zswap and the other in hdd.
Which one should you store in faster zswap? Obvious swap entry B.

It will cause more problems when you flush the data to the lower tier.
You want to flush the coldest data first. Please read about the
history of zswap write back and what LRU problem it encountered. The
most recent zswap storing the incompressible pages series in the mail
list precisely driven by preserving the swap entry LRU order reason.

You really should consider the effect on swap entry LRU ordering
before you design the per cgroup swap priority.

> > From the swap file point of view, when it needs to flush some data to
> > the lower tiers, it is very hard if possible for swap file to maintain
> > per cgroup LRU order within a swap file.
>
> Could you explain in more detail why the flush operation is difficult in that
> case? I would like to understand what the concrete difficulty is.
>
> > It is much easier if all the swap entries in a swap file are in the
> > same LRU order tier.
>
> This is related to the same question above — I would appreciate a more
> detailed explanation because it is not yet clear to me. Why is it easy?

Because I don't need to alter the list ording. When it enumerates the
same list of swap files, it just needs to check if the current swap
file is excluded by the swap.tiers integer bitmask. Each swap file can
cache a bit which tier it is belonging to, for example.

>
> > The swap.tiers idea is not a compromise, it is a straight win. Can you
> > describe what per cgroup per swap file can do while swap.tiers can
> > not?
>
> I mentioned already on this mail: what swap tiers cannot do is arbitrary
> ordering. If ordering is fixed globally by tiers, some workloads that want to
> consume slower swap devices first (and reserve faster devices as a safety
> backend to minimize swap failures) cannot be expressed. This kind of policy
> requires arbitrary ordering flexibility, which is possible with per-device
> priorities but not with fixed tiers.

Let's say you have fast tier A and slow tier B.

Option 1) All swap entries go through the fast tier A first. As time
goes on, the colder swap entry will move to the end of the tier A LRU,
because there is no swap fault happening to those colder entries. If
you run out of space of  A, then you flush the end of the A to B. If
the swap fault does happen in the relative short period of time, it
will serve by the faster tier of A.

That is a win compared to your proposal you want directly to go to B,
with more swap faults will be served by B compared to option 1).

option 2) Just disable fast tier A in the beginning, only use B until
B is full. At some point B is full, you want to enable fast tier A.
Then it should move the head LRU from B into A. That way it still
maintains the LRU order.

option 1) seems better than 2) because it serves more swap faults from
faster tier A.

> And vswap possible usage: if we must consider vswap (assume we can select it
> like an individual swap device), where should it be mapped in the tier model?
> (see https://lore.kernel.org/linux-mm/CAMgjq7BA_2-5iCvS-vp9ZEoG=1DwHWYuVZOuH8DWH9wzdoC00g@mail.gmail.com/)

The swap tires do not depend on vswap, you don't need to worry about that now.

> In my opinion, it cannot be mapped purely by service speed.
> There are indeed situations where tiering by service speed is beneficial,
> but I also believe priority-based ordering can capture the same intention
> while also covering exceptional use cases.

The above two options should be able to cover what you want.

> So, I see the per-device priority approach as more general: it can represent
> tier-based usage, but also more flexible policies that tiers alone cannot cover.

Not worth while to break the swap entry LRU order. We can do it in a
way keeping the LRU order. You will be serving the more swap fault
from the fast tier which is an overall win.

> > It obviously will introduce new complexity. I want to understand the
> > reason to justify the additional complexity before I consider such an
> > approach.
>
> I think that any new concept adds complexity, whether it is “swap.tier” or
> per-device priority. If you could clarify more precisely what kind of
> complexity you are most concerned about, I would be happy to give my detailed
> thoughts in that direction.

I see no real justification to break the swap entry LRU order yet.
Will my solution 1) or 2) work for you in your example?

The per cgroup swap tiers integer bitmask is simpler than maintaining
a per cgroup order list. It might be the same complexity in your mind,
I do see swap tiers as the simpler one.

Chris

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/4] mm/swap, memcg: Introduce infrastructure for cgroup-based swap priority
  2025-08-20  0:52                   ` Chris Li
@ 2025-08-20 14:39                     ` YoungJun Park
  2025-08-21 20:39                       ` Chris Li
  0 siblings, 1 reply; 39+ messages in thread
From: YoungJun Park @ 2025-08-20 14:39 UTC (permalink / raw)
  To: Chris Li
  Cc: Michal Koutný, akpm, hannes, mhocko, roman.gushchin,
	shakeel.butt, muchun.song, shikemeng, kasong, nphamcs, bhe,
	baohua, cgroups, linux-mm, linux-kernel, gunho.lee,
	iamjoonsoo.kim, taejoon.song, Matthew Wilcox, David Hildenbrand,
	Kairui Song

> > inclusion/exclusion semantics at the cgroup level. The reason I decided not to
> > go with it is because it lacks flexibility — it cannot express arbitrary     
> > ordering. As noted above, it is impossible to represent arbitrary orderings, 
> > which is why I chose a per-device priority strategy instead.                 
>                                                                                
> As said, arbitrary orders violate the swap entry LRU orders. You still         
> haven't given me a detailed technical reason why you need arbitrary            
> orders other than "I want a pony".

I believe the examples I provided for arbitrary ordering can be considered
a detailed technical reason.
(You responded with Option 1 and Option 2.)

> > The `swap.tier` concept also requires mapping priorities to tiers, creating  
> > per-cgroup tier objects, and so forth. That means a number of supporting     
> > structures are needed as well. While I agree it is conceptually well-defined,
> > I don’t necessarily find it simpler than the per-device priority model.      
>                                                                                
> You haven't embraced the swap.tiers ideas to the full extent. I do see         
> it can be simpler if you follow my suggestion. You are imaging a               
> version using swap file priority data struct to implement the swap             
> tiers. 

Thank you for the detailed explanation. I think I understood the core points of this concept
What I wrote was simply my interpretation — that it can be
viewed as a well-defined extension of maintaining equal priority dependency
together with inclusion/exclusion semantics. Nothing more and nothing less.

> That is not what I have in mind. The tiers can be just one             
> integer to represent the set of tiers it enrolls and the default. If           
> you follow my suggestion and the design you will have a simpler series         
> in the end.                                                                    

Through this discussion my intention is to arrive at the best solution,
and I appreciate that you pointed out areas I should reconsider. If you,
and other reviewers(If somebody gives opions of it, then it will be helpful)
generally conclude that the tier concept is the right path,
I have a clear willingness to re-propose an RFC and patches
based on your idea. In that case, since arbitrary ordering would not be
allowed, I fully agree that the main swap selection logic would become
simpler than my current implementation.
                                                                    
> The problem is that you pollute your fast tier with very cold swap              
> entry data, that is to your disadvantage, because you will need to             
> swap back more from the slower tier.                                           
>                                                                                
> e.g. you have two pages. Swap entry A will get 2 swap faults, the swap         
> entry B will get 20 swap faults in the next 2 hours. B is hotter than          
> A. Let's say you have to store them one in zswap and the other in hdd.         
> Which one should you store in faster zswap? Obvious swap entry B.              
>                                                                                
> It will cause more problems when you flush the data to the lower tier.         
> You want to flush the coldest data first. Please read about the                
> history of zswap write back and what LRU problem it encountered. The           
> most recent zswap storing the incompressible pages series in the mail          
> list precisely driven by preserving the swap entry LRU order reason.           
>                                                                                
> You really should consider the effect on swap entry LRU ordering               
> before you design the per cgroup swap priority.                                

Then I would like to ask a fundamental question about priority. Priority is
a user interface, and the user has the choice. From the beginning, when the
user sets priorities, there could be a scenario where the slower swap is
given a higher priority and the faster swap is given a lower one. That is
possible. For example, if the faster device has a short lifetime, a real
use case might be to consume the slower swap first for endurance, and only
use the faster swap when unavoidable.

In this case, logically from the LRU perspective there is no inversion of
priority order, but in practice the slower device is filled first. That
looks like degradation from a performance perspective — but it is exactly
what the user intended.

The swap tier concept appears to map priority semantics directly to service
speed, so that higher priority always means faster service. This looks like
it enforces the choice on the user(but it is opend).

Even with swap tiers, under the semantics you suggested, it is possible for
a given cgroup to use only the slower tier. From that cgroup’s view there
is no LRU inversion, but since the fast swap exists and is left unused, it
could still be seen as an "inverse" in terms of usage.

In summary, what I struggle to understand is that if the major assumption
is that swap operation must always align with service speed, then even swap
tiers can contradict it (since users may deliberately prefer the lower
tier). In that case, wouldn’t the whole concept of letting users select swap
devices by priority itself also become a problem?

> > I mentioned already on this mail: what swap tiers cannot do is arbitrary     
> > ordering. If ordering is fixed globally by tiers, some workloads that want to
> > consume slower swap devices first (and reserve faster devices as a safety    
> > backend to minimize swap failures) cannot be expressed. This kind of policy  
> > requires arbitrary ordering flexibility, which is possible with per-device   
> > priorities but not with fixed tiers.                                         
>                                                                                
> Let's say you have fast tier A and slow tier B.                                
>                                                                                
> Option 1) All swap entries go through the fast tier A first. As time           
> goes on, the colder swap entry will move to the end of the tier A LRU,         
> because there is no swap fault happening to those colder entries. If           
> you run out of space of  A, then you flush the end of the A to B. If           
> the swap fault does happen in the relative short period of time, it            
> will serve by the faster tier of A.                                            
>                                                                                
> That is a win compared to your proposal you want directly to go to B,          
> with more swap faults will be served by B compared to option 1).               
>                                                                                
> option 2) Just disable fast tier A in the beginning, only use B until          
> B is full. At some point B is full, you want to enable fast tier A.            
> Then it should move the head LRU from B into A. That way it still              
> maintains the LRU order.                                                       
>                                                                                
> option 1) seems better than 2) because it serves more swap faults from         
> faster tier A.                                                                 

Option 1 does not really align with the usage scenario I had in mind,
since it starts from the fast swap. Option 2 fits partially, but requires
controlling when to enable the fast tier once full, and handling LRU
movement — which adds complexity.

Your final suggestion of Option 1 seems consistent with your original
objection: that the system design should fundamentally aim at performance
improvement by making use of the fast swap first.

> > And vswap possible usage: if we must consider vswap (assume we can select it 
> > like an individual swap device), where should it be mapped in the tier model?
> > (see https://lore.kernel.org/linux-mm/CAMgjq7BA_2-5iCvS-vp9ZEoG=1DwHWYuVZOuH8DWH9wzdoC00g@mail.gmail.com/)
>                                                                                
> The swap tires do not depend on vswap, you don't need to worry about that now. 

I initially understood vswap could also be treated as an
identity selectable in the unified swap framework. If that were the case, I
thought it would be hard to map vswap into the tier concept. Was that my
misinterpretation?

> The per cgroup swap tiers integer bitmask is simpler than maintaining          
> a per cgroup order list. It might be the same complexity in your mind,         
> I do see swap tiers as the simpler one.                                        

I agree that from the perspective of implementing the main swap selection
logic, tiers are simpler. Since arbitrary ordering is not allowed, a large
part of the implementation complexity can indeed be reduced.

Once again, thank you for your thoughtful comments and constructive feedback.

Best Regards,
Youngjun Park 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/4] mm/swap, memcg: Introduce infrastructure for cgroup-based swap priority
  2025-08-20 14:39                     ` YoungJun Park
@ 2025-08-21 20:39                       ` Chris Li
  2025-08-22  5:45                         ` YoungJun Park
  0 siblings, 1 reply; 39+ messages in thread
From: Chris Li @ 2025-08-21 20:39 UTC (permalink / raw)
  To: YoungJun Park
  Cc: Michal Koutný, akpm, hannes, mhocko, roman.gushchin,
	shakeel.butt, muchun.song, shikemeng, kasong, nphamcs, bhe,
	baohua, cgroups, linux-mm, linux-kernel, gunho.lee,
	iamjoonsoo.kim, taejoon.song, Matthew Wilcox, David Hildenbrand,
	Kairui Song

On Wed, Aug 20, 2025 at 7:39 AM YoungJun Park <youngjun.park@lge.com> wrote:
>
> > > inclusion/exclusion semantics at the cgroup level. The reason I decided not to
> > > go with it is because it lacks flexibility — it cannot express arbitrary
> > > ordering. As noted above, it is impossible to represent arbitrary orderings,
> > > which is why I chose a per-device priority strategy instead.
> >
> > As said, arbitrary orders violate the swap entry LRU orders. You still
> > haven't given me a detailed technical reason why you need arbitrary
> > orders other than "I want a pony".
>
> I believe the examples I provided for arbitrary ordering can be considered
> a detailed technical reason.
> (You responded with Option 1 and Option 2.)

You still did not provide the detailed reason for it yet. I understand
you want the per cgroup swap device arbitrate ordering, that is a
solution not the root cause. I want to go one level deeper, why do you
want to have per cgroup swap device ordering. What is the
consideration to use the per cgroups list of the swap device order vs
other approach. For example, I want to preserve the fast swap device
mostly for jobs requiring fast response, I don't want to fill the fast
swap device with slow jobs' data. That is one of my guesses. Please
provide the background usage case and thinking process to get to that
conclusion.  Right now I am just guessing in the dark. You jump to the
conclusion of using aribitury cgroup swap device order as the only
solution too soon too quickly.

> > > The `swap.tier` concept also requires mapping priorities to tiers, creating
> > > per-cgroup tier objects, and so forth. That means a number of supporting
> > > structures are needed as well. While I agree it is conceptually well-defined,
> > > I don’t necessarily find it simpler than the per-device priority model.
> >
> > You haven't embraced the swap.tiers ideas to the full extent. I do see
> > it can be simpler if you follow my suggestion. You are imaging a
> > version using swap file priority data struct to implement the swap
> > tiers.
>
> Thank you for the detailed explanation. I think I understood the core points of this concept
> What I wrote was simply my interpretation — that it can be
> viewed as a well-defined extension of maintaining equal priority dependency
> together with inclusion/exclusion semantics. Nothing more and nothing less.

Good.


> > That is not what I have in mind. The tiers can be just one
> > integer to represent the set of tiers it enrolls and the default. If
> > you follow my suggestion and the design you will have a simpler series
> > in the end.
>
> Through this discussion my intention is to arrive at the best solution,

Ack.

> and I appreciate that you pointed out areas I should reconsider. If you,
> and other reviewers(If somebody gives opions of it, then it will be helpful)
> generally conclude that the tier concept is the right path,

That is why we should make it a more formal proposal, list out the
details to solicit feedback.

> I have a clear willingness to re-propose an RFC and patches
> based on your idea. In that case, since arbitrary ordering would not be
> allowed, I fully agree that the main swap selection logic would become
> simpler than my current implementation.

Thank you. If you can integrate the swap.tiers into your next series,
that would be great. I am worried that I might not have enough time to
implement it myself. I can certainly reason about it and point you in
the right direction as best as I can.

> > The problem is that you pollute your fast tier with very cold swap
> > entry data, that is to your disadvantage, because you will need to
> > swap back more from the slower tier.
> >
> > e.g. you have two pages. Swap entry A will get 2 swap faults, the swap
> > entry B will get 20 swap faults in the next 2 hours. B is hotter than
> > A. Let's say you have to store them one in zswap and the other in hdd.
> > Which one should you store in faster zswap? Obvious swap entry B.
> >
> > It will cause more problems when you flush the data to the lower tier.
> > You want to flush the coldest data first. Please read about the
> > history of zswap write back and what LRU problem it encountered. The
> > most recent zswap storing the incompressible pages series in the mail
> > list precisely driven by preserving the swap entry LRU order reason.
> >
> > You really should consider the effect on swap entry LRU ordering
> > before you design the per cgroup swap priority.
>
> Then I would like to ask a fundamental question about priority. Priority is
> a user interface, and the user has the choice. From the beginning, when the
> user sets priorities, there could be a scenario where the slower swap is

The Priority is just the global swap file ordering. Higher priority
use that swap device first.

> given a higher priority and the faster swap is given a lower one. That is
> possible. For example, if the faster device has a short lifetime, a real
> use case might be to consume the slower swap first for endurance, and only
> use the faster swap when unavoidable.

The idea of matching the faster swap with higher priority is just a
strategy to get better performance. It does not mean the priority ==
device speed.
If the user wants  to choose another priority strategy, maybe slower
performance, that is OK. They will get what they ask for.
We as  the kernel developer design the system as simply as possible to
achieve the good performance. Basically allow the good strategy to
happen easily. I wouldn't go overboard to change the meaning of
priority.

> In this case, logically from the LRU perspective there is no inversion of
> priority order, but in practice the slower device is filled first. That
> looks like degradation from a performance perspective — but it is exactly
> what the user intended.

You touch on a very good point. How to mix the global order and the
per memcg order.

> The swap tier concept appears to map priority semantics directly to service
> speed, so that higher priority always means faster service. This looks like
> it enforces the choice on the user(but it is opend).

Yes, and no. We should allow the better performance strategy to happen
easily while maintaining the code complexity low. That is what I am
trying to do here.

> Even with swap tiers, under the semantics you suggested, it is possible for
> a given cgroup to use only the slower tier. From that cgroup’s view there
> is no LRU inversion, but since the fast swap exists and is left unused, it
> could still be seen as an "inverse" in terms of usage.

Yes, if you put all the fast tier in one group. It needs to be
discussed case by case. That is exactly what I am asking for, what is
your usage case in mind that demands the per cgroup priority. We can
analyze the usage case and come up with creative solutions before we
jump to the conclusion. You can, for example, have divided the swap
space into two groups. A1 & A2 are both fast tiers. B1 & B2 are both
slow tiers. The one always follows to fill up A to B order using the
A1 and B1 group. The one wants to fill up the B first then A uses the
A2 and B2 group. 1 and 2 groups never mix. Then you can still maintain
LRU order when B2 fills up and starts to use A2, it will not upset the
A1 LRU because they are different swap devices on different groups.

If you give a more detailed usage situation, what challenge it faces.
I can give a more detailed solution using per cgroup priority vs
swap.tiers. That is why your usage case and reason is important.

> In summary, what I struggle to understand is that if the major assumption
> is that swap operation must always align with service speed, then even swap
> tiers can contradict it (since users may deliberately prefer the lower
> tier). In that case, wouldn’t the whole concept of letting users select swap
> devices by priority itself also become a problem?

Yes, if you keep them in one group and mix them. See the above 1 & 2
group option.

>
> > > I mentioned already on this mail: what swap tiers cannot do is arbitrary
> > > ordering. If ordering is fixed globally by tiers, some workloads that want to
> > > consume slower swap devices first (and reserve faster devices as a safety
> > > backend to minimize swap failures) cannot be expressed. This kind of policy
> > > requires arbitrary ordering flexibility, which is possible with per-device
> > > priorities but not with fixed tiers.
> >
> > Let's say you have fast tier A and slow tier B.
> >
> > Option 1) All swap entries go through the fast tier A first. As time
> > goes on, the colder swap entry will move to the end of the tier A LRU,
> > because there is no swap fault happening to those colder entries. If
> > you run out of space of  A, then you flush the end of the A to B. If
> > the swap fault does happen in the relative short period of time, it
> > will serve by the faster tier of A.
> >
> > That is a win compared to your proposal you want directly to go to B,
> > with more swap faults will be served by B compared to option 1).
> >
> > option 2) Just disable fast tier A in the beginning, only use B until
> > B is full. At some point B is full, you want to enable fast tier A.
> > Then it should move the head LRU from B into A. That way it still
> > maintains the LRU order.
> >
> > option 1) seems better than 2) because it serves more swap faults from
> > faster tier A.
>
> Option 1 does not really align with the usage scenario I had in mind,
> since it starts from the fast swap. Option 2 fits partially, but requires
> controlling when to enable the fast tier once full, and handling LRU
> movement — which adds complexity.

Why do you want to fill up the slower device first? You haven't
answered that question in detail. You are asking for a behavior
because you already determined you want this behavior. You need to go
deeper to the root cause why you want this behavior. What is your
ultimate goal? There might be other solutions addressing your ultimate
goal without using the behavior you choose.

> Your final suggestion of Option 1 seems consistent with your original
> objection: that the system design should fundamentally aim at performance
> improvement by making use of the fast swap first.

You did not give me a reason why option 1) violates your goal. I feel
that your goal is already fixated on the swap order. That is only the
solution of your thought process. You haven't shown us how you come to
that conclusion.

> > > And vswap possible usage: if we must consider vswap (assume we can select it
> > > like an individual swap device), where should it be mapped in the tier model?
> > > (see https://lore.kernel.org/linux-mm/CAMgjq7BA_2-5iCvS-vp9ZEoG=1DwHWYuVZOuH8DWH9wzdoC00g@mail.gmail.com/)
> >
> > The swap tires do not depend on vswap, you don't need to worry about that now.
>
> I initially understood vswap could also be treated as an
> identity selectable in the unified swap framework. If that were the case, I
> thought it would be hard to map vswap into the tier concept. Was that my
> misinterpretation?

Your series assumes adopting swap.tiers are likely to get in before
the vswap does. If that is the case, that problem is for vswap to
solve. Let's work on this incrementally one step at a time.

> > The per cgroup swap tiers integer bitmask is simpler than maintaining
> > a per cgroup order list. It might be the same complexity in your mind,
> > I do see swap tiers as the simpler one.
>
> I agree that from the perspective of implementing the main swap selection
> logic, tiers are simpler. Since arbitrary ordering is not allowed, a large
> part of the implementation complexity can indeed be reduced.

Exactly. We can start with this simple case and address the main
problem. If there is a special case we need to do the other order, we
can add them later. It makes sense to have a simple and clean solution
address the majority of the usage case first. The most common usage I
see is that, let latency sensitive jobs use faster tiers. Overflow to
a slower tier if necessary. The latency insensitive jobs just use the
slower tiers.

> Once again, thank you for your thoughtful comments and constructive feedback.

You are most welcome.


Chris

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/4] mm/swap, memcg: Introduce infrastructure for cgroup-based swap priority
  2025-08-21 20:39                       ` Chris Li
@ 2025-08-22  5:45                         ` YoungJun Park
  2025-08-22 16:48                           ` Chris Li
  0 siblings, 1 reply; 39+ messages in thread
From: YoungJun Park @ 2025-08-22  5:45 UTC (permalink / raw)
  To: Chris Li
  Cc: Michal Koutný, akpm, hannes, mhocko, roman.gushchin,
	shakeel.butt, muchun.song, shikemeng, kasong, nphamcs, bhe,
	baohua, cgroups, linux-mm, linux-kernel, gunho.lee,
	iamjoonsoo.kim, taejoon.song, Matthew Wilcox, David Hildenbrand,
	Kairui Song

I still believe that the priority based approach has more flexibility,
and can cover more usage scenarios. That opinion has not changed.

However, from this discussion I came to clearly understand and agree on
three points:

1. The swap.tier idea can be implemented in a much simpler way, and
2. It can cover the most important use cases I initially needed, as well
   as common performance scenarios, without causing LRU inversion.
3. The really really needed usage scenario of arbitrary ordering does not exist.
the usage scenario I suggest is imaginary.(just has possibility)

I have also considered the situation where I might need to revisit my
original idea in the future. I believe this would still be manageable
within the swap.tier framework. For example:

* If after swap.tier is merged, an arbitrate ordering use case arises
  (which you do not consider concrete), it could be solved by allowing
  cgroups to remap the tier order individually.

* If reviewers later decide to go back to the priority based direction,
  I think it will still be possible. By then, much of the work would
  already be done in patch v2, so switching back would not be
  impossible.

And also, since I highly respect you for long-time contributions and
deep thinking in the swap layer, I decided to move the idea forward
based on swap.tier.

For now, I would like to share the first major direction change I am
considering, and get feedback on how to proceed. If you think this path
is promising, please advise whether I should continue as patch v2, or
send a new RFC series or new patch series.

-----------------------------------------------------------------------
1. Interface
-----------------------------------------------------------------------

In the initial thread you replied with the following examples:

> Here are a few examples:
> e.g. consider the following cgroup hierarchy a/b/c/d, a as the first
> level cgroup.
> a/swap.tiers: "- +compress_ram"
> it means who shall not be named is set to opt out, optin in
> compress_ram only, no ssd, no hard.
> Who shall not be named, if specified, has to be the first one listed
> in the "swap.tiers".
>
> a/b/swap.tiers: "+ssd"
> For b cgroup, who shall not be named is not specified, the tier is
> appended to the parent "a/swap.tiers". The effective "a/b/swap.tiers"
> become "- +compress_ram +ssd"
> a/b can use both zswap and ssd.
>
> Every time the who shall not be named is changed, it can drop the
> parent swap.tiers chain, starting from scratch.
>
> a/b/c/swap.tiers: "-"
>
> For c, it turns off all swap. The effective "a/b/c/swap.tiers" become
> "- +compress_ram +ssd -" which simplify as "-", because the second "-"
> overwrites all previous optin/optout results.
> In other words, if the current cgroup does not specify the who shall
> not be named, it will walk the parent chain until it does. The global
> "/" for non cgroup is on.
>
> a/b/c/d/swap.tiers: "- +hdd"
> For d, only hdd swap, nothing else.
>
> More example:
> "- +ssd +hdd -ssd" will simplify to: "- +hdd", which means hdd only.
> "+ -hdd": No hdd for you! Use everything else.
>
> Let me know what you think about the above "swap.tiers"(name TBD)
> proposal.

My opinion is that instead of mapping priority into named concepts, it
may be simpler to represent it as plain integers. 
(The integers are assigned in sequential order, as explained in the following reply.)
This would make the interface almost identical to the cpuset style suggested by Koutný.

For example:

  echo 1-8,9-10 > a/swap.tier   # parent allows tier range 1–8 and 9-10
  echo 1-4,9    > a/b/swap.tier # child uses tier 1-4 and 9 within parent's range
  echo 20   > a/b/swap.tier # invalid: parent only allowed 1-8 and 9-10

named concepts can be dealt with by some userland based software solution.
kernel just gives simple integer mapping concept. 
userland software can abstract it as a "named" tier to user.

Regarding the mapping of names to ranges, as you also mentioned:

> There is a simple mapping of global swap tier names into priority
> range
> The name itself is customizable.
> e.g. 100+ is the "compress_ram" tier. 50-99 is the "SSD" tier,
> 0-55 is the "hdd" tier.
> The detailed mechanization and API is TBD.
> The end result is a simple tier name lookup will get the priority
> range.
> By default all swap tiers are available for global usage without
> cgroup. That matches the current global swap on behavior.

One idea would be to provide a /proc/swaptier interface:

  echo "100 40" > /proc/swaptier

This would mean:
* >=100 : tier 1
* 40–99 : tier 2
* <40   : tier 3

How do you feel about this approach?

-----------------------------------------------------------------------
2. NUMA autobind
-----------------------------------------------------------------------

If NUMA autobind is in use, perhaps it is best to simply disallow
swaptier settings. I expect workloads depending on autobind would rely
on it globally, rather than per-cgroup. Therefore, when a negative
priority is present, tier grouping could reject the configuration.

-----------------------------------------------------------------------
3. Implementation
-----------------------------------------------------------------------

My initial thought is to implement a simple bitmask check. That is, in
the slow swap path, check whether the cgroup has selected the given
tier. This is simple, but I worry it might lose the optimization of the
current priority list, where devices are dynamically tracked as they
become available or unavailable.

So perhaps a better design is to make swap tier an object, and have
each cgroup traverse only the priority list of the tiers it selected. I
would like feedback on whether this design makes sense.

-----------------------------------------------------------------------

Finally, I want to thank all reviewers for the constructive feedback.
Even if we move to the swap.tier approach, the reviews from Kairui, Nhat
Pham and Koutný are still valid and will remain relevant.

Kairui, Nhat Pham
* Regarding per-cgroup per-cluster feedback: this would likely need to
  be adapted to tier-based design.
* Regarding passing percpu info along the allocation path: since tier is
  selected per-cgroup, this may still be needed, depending on
  implementation.

Koutný
* Regarding NUMA autobind complexity: as explained above, I intend to
  design the mechanism so that autobind does not affect it. Parent-child
  semantics will remain essentially identical to cpuset. If the proposed
  interface is accepted, its usage would be like cpuset, which should be
  less controversial.

---

Thank you again for the suggestions. I will continue to review while
waiting for your feedback.

Best Regards,
Youngjun Park

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/4] mm/swap, memcg: Introduce infrastructure for cgroup-based swap priority
  2025-08-22  5:45                         ` YoungJun Park
@ 2025-08-22 16:48                           ` Chris Li
  2025-08-24 12:05                             ` YoungJun Park
  2025-08-24 14:19                             ` YoungJun Park
  0 siblings, 2 replies; 39+ messages in thread
From: Chris Li @ 2025-08-22 16:48 UTC (permalink / raw)
  To: YoungJun Park
  Cc: Michal Koutný, akpm, hannes, mhocko, roman.gushchin,
	shakeel.butt, muchun.song, shikemeng, kasong, nphamcs, bhe,
	baohua, cgroups, linux-mm, linux-kernel, gunho.lee,
	iamjoonsoo.kim, taejoon.song, Matthew Wilcox, David Hildenbrand,
	Kairui Song

On Thu, Aug 21, 2025 at 10:45 PM YoungJun Park <youngjun.park@lge.com> wrote:
>
> I still believe that the priority based approach has more flexibility,
> and can cover more usage scenarios. That opinion has not changed.

I agree with you on that. It is more flexible that way, no question about it.

I am open to considering your usage scenarios and revisit the
swap.tiers limitation. I just haven't seen the real usage scenario
yet.

> However, from this discussion I came to clearly understand and agree on
> three points:
>
> 1. The swap.tier idea can be implemented in a much simpler way, and
> 2. It can cover the most important use cases I initially needed, as well
>    as common performance scenarios, without causing LRU inversion.
Glad we are aligned on this.

> 3. The really really needed usage scenario of arbitrary ordering does not exist.
> the usage scenario I suggest is imaginary.(just has possibility)
Wow, that is surprise for me to see that from you. I was expecting
some very complex or special usage case demand on the arbitrary
ordering. If it is just an imaginary usage scenario, I am very glad we
did not pay the price of extra complexity for imaginary usage.

> I have also considered the situation where I might need to revisit my
> original idea in the future. I believe this would still be manageable
> within the swap.tier framework. For example:

Sure, having an incremental improvement is a good thing. We can always
come back and revisit if the reasoning for the previous decision is
still valid or not.

> * If after swap.tier is merged, an arbitrate ordering use case arises
>   (which you do not consider concrete), it could be solved by allowing
>   cgroups to remap the tier order individually.

Ack.

> * If reviewers later decide to go back to the priority based direction,
>   I think it will still be possible. By then, much of the work would
>   already be done in patch v2, so switching back would not be
>   impossible.

I really doubt that we need to get back to the pure priority approach.

> And also, since I highly respect you for long-time contributions and
> deep thinking in the swap layer, I decided to move the idea forward
> based on swap.tier.

Thank you. I really appreciate you taking the feedback with flexibility.

> For now, I would like to share the first major direction change I am
> considering, and get feedback on how to proceed. If you think this path
> is promising, please advise whether I should continue as patch v2, or
> send a new RFC series or new patch series.
>
> -----------------------------------------------------------------------
> 1. Interface
> -----------------------------------------------------------------------
>
> In the initial thread you replied with the following examples:
>
> > Here are a few examples:
> > e.g. consider the following cgroup hierarchy a/b/c/d, a as the first
> > level cgroup.
> > a/swap.tiers: "- +compress_ram"
> > it means who shall not be named is set to opt out, optin in
> > compress_ram only, no ssd, no hard.
> > Who shall not be named, if specified, has to be the first one listed
> > in the "swap.tiers".
> >
> > a/b/swap.tiers: "+ssd"
> > For b cgroup, who shall not be named is not specified, the tier is
> > appended to the parent "a/swap.tiers". The effective "a/b/swap.tiers"
> > become "- +compress_ram +ssd"
> > a/b can use both zswap and ssd.
> >
> > Every time the who shall not be named is changed, it can drop the
> > parent swap.tiers chain, starting from scratch.
> >
> > a/b/c/swap.tiers: "-"
> >
> > For c, it turns off all swap. The effective "a/b/c/swap.tiers" become
> > "- +compress_ram +ssd -" which simplify as "-", because the second "-"
> > overwrites all previous optin/optout results.
> > In other words, if the current cgroup does not specify the who shall
> > not be named, it will walk the parent chain until it does. The global
> > "/" for non cgroup is on.
> >
> > a/b/c/d/swap.tiers: "- +hdd"
> > For d, only hdd swap, nothing else.
> >
> > More example:
> > "- +ssd +hdd -ssd" will simplify to: "- +hdd", which means hdd only.
> > "+ -hdd": No hdd for you! Use everything else.
> >
> > Let me know what you think about the above "swap.tiers"(name TBD)
> > proposal.
>
> My opinion is that instead of mapping priority into named concepts, it
> may be simpler to represent it as plain integers.

In my mind, the tier name is just a look up to a bit in the bit mask.
Give it a name so it is easier to distinguish with the other number
e.g. priority number.

> (The integers are assigned in sequential order, as explained in the following reply.)
> This would make the interface almost identical to the cpuset style suggested by Koutný.
>
> For example:
>
>   echo 1-8,9-10 > a/swap.tier   # parent allows tier range 1–8 and 9-10

swap.tiers, it can have more than one tier.

How do you express the default tier who shall not name? There are
actually 3 states associated with default. It is not binary.
1) default not specified: look up parent chain for default.
2) default specified as on. Override parent default.
3) default specified as off. Override parent default.

e.g. "- +zswap +ssd" means default off, allow zswap and sdd tiers.

>   echo 1-4,9    > a/b/swap.tier # child uses tier 1-4 and 9 within parent's range
>   echo 20   > a/b/swap.tier # invalid: parent only allowed 1-8 and 9-10

How are you going to store the list of ranges? Just  a bitmask integer
or a list?
I feel the tier name is more readable. The number to which actual
device mapping is non trivial to track for humans.
Adding a name to a tier object is trivial. Using the name is more convenient.
We might be able to support both if we make up a rule that tier names
can't be pure numbers.

I want to add another usage case into consideration. The swap.tiers
does not have to be per cgroup. It can be per VMA. We can extend the
"madvise" syscall so the user space can advise to the kernel, I only
want this memory  range which contains my private key swap to zswap,
not hdd. So that if there is an unexpected power off event,  my
private key will not end up in the hdd. In RAM or zswap is fine
because they will be gone when power off.
>
> named concepts can be dealt with by some userland based software solution.
> kernel just gives simple integer mapping concept.
> userland software can abstract it as a "named" tier to user.

The kernel will need to manage the tier object anyway, which range it
covers, having a name there is trivial. I consider it just convenient
for system admins. Pure tier number map to another priority number is
a bit cryptic.

> Regarding the mapping of names to ranges, as you also mentioned:
>
> > There is a simple mapping of global swap tier names into priority
> > range
> > The name itself is customizable.
> > e.g. 100+ is the "compress_ram" tier. 50-99 is the "SSD" tier,
> > 0-55 is the "hdd" tier.
> > The detailed mechanization and API is TBD.
> > The end result is a simple tier name lookup will get the priority
> > range.
> > By default all swap tiers are available for global usage without
> > cgroup. That matches the current global swap on behavior.
>
> One idea would be to provide a /proc/swaptier interface:

Maybe stay away from  '/proc'. Maybe some thing like "/sys/kernel/mm/swap"
>
>   echo "100 40" > /proc/swaptier
>
> This would mean:
> * >=100 : tier 1
> * 40–99 : tier 2
> * <40   : tier 3
>
> How do you feel about this approach?
Sounds fine. Maybe we can have
"ssd:100 zswap:40 hdd" for the same thing but give a name to the tier
as well.You can still reference the tier by numbers.

>
> -----------------------------------------------------------------------
> 2. NUMA autobind
> -----------------------------------------------------------------------
>
> If NUMA autobind is in use, perhaps it is best to simply disallow
> swaptier settings. I expect workloads depending on autobind would rely
> on it globally, rather than per-cgroup. Therefore, when a negative
> priority is present, tier grouping could reject the configuration.

Can you elaborate on that. Just brainstorming, can we keep the
swap.tiers and assign NUMA autobind range to tier as well? It is just
negative ranges, we can assign negative ranges to say "NUMA" tier.
Then if the swap.tiers contain "ssd NUMA" then it is as if the system
only configures ssd and numa globally. Frankly I don't think the NUMA
autobind swap matters any more in the new swap allocator. It can also
make up rules that if swap.tiers was used, no NUMA autobinds for that
cgroup.

>
> -----------------------------------------------------------------------
> 3. Implementation
> -----------------------------------------------------------------------
>
> My initial thought is to implement a simple bitmask check. That is, in
> the slow swap path, check whether the cgroup has selected the given
> tier. This is simple, but I worry it might lose the optimization of the
> current priority list, where devices are dynamically tracked as they
> become available or unavailable.
>
> So perhaps a better design is to make swap tier an object, and have
> each cgroup traverse only the priority list of the tiers it selected. I
> would like feedback on whether this design makes sense.

I feel that that has the risk of  premature optimization. I suggest
just going with the simplest bitmask check first then optimize as
follow up when needed. The bitmask check should still work with the
dynamic lists of swap devices but I doubt how much of a difference
that NUMA autobind makes now.

Chris

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/4] mm/swap, memcg: Introduce infrastructure for cgroup-based swap priority
  2025-08-22 16:48                           ` Chris Li
@ 2025-08-24 12:05                             ` YoungJun Park
  2025-08-26  8:19                               ` Chris Li
  2025-08-24 14:19                             ` YoungJun Park
  1 sibling, 1 reply; 39+ messages in thread
From: YoungJun Park @ 2025-08-24 12:05 UTC (permalink / raw)
  To: Chris Li
  Cc: Michal Koutný, akpm, hannes, mhocko, roman.gushchin,
	shakeel.butt, muchun.song, shikemeng, kasong, nphamcs, bhe,
	baohua, cgroups, linux-mm, linux-kernel, gunho.lee,
	iamjoonsoo.kim, taejoon.song, Matthew Wilcox, David Hildenbrand,
	Kairui Song

> How do you express the default tier who shall not name? There are
> actually 3 states associated with default. It is not binary.
> 1) default not specified: look up parent chain for default.
> 2) default specified as on. Override parent default.
> 3) default specified as off. Override parent default.

As I understand, your intention is to define inheritance semantics depending
on the default value, and allow children to override this freely with `-` and
`+` semantics. Is that correct?

When I originally proposed the swap cgroup priority mechanism, Michal Koutný
commented that it is unnatural for cgroups if a parent attribute is not
inherited by its child:
(https://lore.kernel.org/linux-mm/rivwhhhkuqy7p4r6mmuhpheaj3c7vcw4w4kavp42avpz7es5vp@hbnvrmgzb5tr/)

Therefore, my current thinking is:
* The global swap setting itself is tier 1 (if nothing is configured).
* If a cgroup has no setting:
  - Top-level cgroups follow the global swap.
  - Child cgroups follow their parent’s setting.
* If a cgroup has its own setting, that setting is applied.
(child cgroups can only select tiers that the parent has allowed.)

This seems natural because most cgroup resource distribution mechanisms follow
a subset inheritance model.

Thus, in my concept, there is no notion of a “default” value that controls
inheritance.

> How are you going to store the list of ranges? Just a bitmask integer
> or a list?

They can be represented as increasing integers, up to 32, and stored as a
bitmask.

> I feel the tier name is more readable. The number to which actual
> device mapping is non trivial to track for humans.

Using increasing integers makes it simpler for the kernel to accept a uniform
interface format, it is identical to the existing cpuset interface, and it
expresses the meaning of “tiers of swap by speed hierarchy” more clearly in my
view.

However, my feeling is still that this approach is clearer both in terms of
implementation and conceptual expression. I would appreciate if you could
reconsider it once more. If after reconsideration you still prefer your
direction, I will follow your decision.

> I want to add another usage case into consideration. The swap.tiers
> does not have to be per cgroup. It can be per VMA. [...]

I understand this as a potential extension use case for swap.tier.  
I will keep this in mind when implementing. If I have further ideas here, I
will share them for discussion.

> Sounds fine. Maybe we can have "ssd:100 zswap:40 hdd" [...]

Yes, this alignment looks good to me!

> Can you elaborate on that. Just brainstorming, can we keep the
> swap.tiers and assign NUMA autobind range to tier as well? [...]

That is actually the same idea I had in mind for the NUMA use case.  
However, I doubt if there is any real workload using this in practice, so I
thought it may be better to leave it out for now. If NUMA autobind is truly
needed later, it could be implemented then.

This point can also be revisited during review or patch writing, so I will
keep thinking about it.

> I feel that that has the risk of  premature optimization. I suggest
> just going with the simplest bitmask check first then optimize as
> follow up when needed. [...]

Yes, I agree with you. Starting with the bitmask implementation seems to be
the right approach.

By the way, while thinking about possible implementation, I would like to ask
your opinion on the following situation:

Suppose a tier has already been defined and cgroups are configured to use it.
Should we allow the tier definition itself to be modified afterwards?

There seem to be two possible choices:

1. Once a cgroup references a tier, modifying that tier should be disallowed.
2. Allow tier re-definition even if cgroups are already referencing it.

Personally, I prefer option (1), since it avoids unexpected changes for
cgroups that already rely on a particular tier definition.

What is your opinion on this?

Best Regards,
Youngjun Park

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/4] mm/swap, memcg: Introduce infrastructure for cgroup-based swap priority
  2025-08-22 16:48                           ` Chris Li
  2025-08-24 12:05                             ` YoungJun Park
@ 2025-08-24 14:19                             ` YoungJun Park
  1 sibling, 0 replies; 39+ messages in thread
From: YoungJun Park @ 2025-08-24 14:19 UTC (permalink / raw)
  To: Chris Li
  Cc: Michal Koutný, akpm, hannes, mhocko, roman.gushchin,
	shakeel.butt, muchun.song, shikemeng, kasong, nphamcs, bhe,
	baohua, cgroups, linux-mm, linux-kernel, gunho.lee,
	iamjoonsoo.kim, taejoon.song, Matthew Wilcox, David Hildenbrand,
	Kairui Song

> How do you express the default tier who shall not name? There are
> actually 3 states associated with default. It is not binary.
> 1) default not specified: look up parent chain for default.
> 2) default specified as on. Override parent default.
> 3) default specified as off. Override parent default.

As I understand, your intention is to define inheritance semantics depending
on the default value, and allow children to override this freely with `-` and
`+` semantics. Is that correct?

When I originally proposed the swap cgroup priority mechanism, Michal Koutný
commented that it is unnatural for cgroups if a parent attribute is not
inherited by its child:
(https://lore.kernel.org/linux-mm/rivwhhhkuqy7p4r6mmuhpheaj3c7vcw4w4kavp42avpz7es5vp@hbnvrmgzb5tr/)

Therefore, my current thinking is:
* The global swap setting itself is tier 1 (if nothing is configured).
* If a cgroup has no setting:
  - Top-level cgroups follow the global swap.
  - Child cgroups follow their parent’s setting.
* If a cgroup has its own setting, that setting is applied.
(child cgroups can only select tiers that the parent has allowed.)

This seems natural because most cgroup resource distribution mechanisms follow
a subset inheritance model.

Thus, in my concept, there is no notion of a “default” value that controls
inheritance.

> How are you going to store the list of ranges? Just a bitmask integer
> or a list?

They can be represented as increasing integers, up to 32, and stored as a
bitmask.

> I feel the tier name is more readable. The number to which actual
> device mapping is non trivial to track for humans.

Using increasing integers makes it simpler for the kernel to accept a uniform
interface format, it is identical to the existing cpuset interface, and it
expresses the meaning of “tiers of swap by speed hierarchy” more clearly in my
view.

However, my feeling is still that this approach is clearer both in terms of
implementation and conceptual expression. I would appreciate if you could
reconsider it once more. If after reconsideration you still prefer your
direction, I will follow your decision.

> I want to add another usage case into consideration. The swap.tiers
> does not have to be per cgroup. It can be per VMA. [...]

I understand this as a potential extension use case for swap.tier.  
I will keep this in mind when implementing. If I have further ideas here, I
will share them for discussion.

> Sounds fine. Maybe we can have "ssd:100 zswap:40 hdd" [...]

Yes, this alignment looks good to me!

> Can you elaborate on that. Just brainstorming, can we keep the
> swap.tiers and assign NUMA autobind range to tier as well? [...]

That is actually the same idea I had in mind for the NUMA use case.  
However, I doubt if there is any real workload using this in practice, so I
thought it may be better to leave it out for now. If NUMA autobind is truly
needed later, it could be implemented then.

This point can also be revisited during review or patch writing, so I will
keep thinking about it.

> I feel that that has the risk of  premature optimization. I suggest
> just going with the simplest bitmask check first then optimize as
> follow up when needed. [...]

Yes, I agree with you. Starting with the bitmask implementation seems to be
the right approach.

By the way, while thinking about possible implementation, I would like to ask
your opinion on the following situation:

Suppose a tier has already been defined and cgroups are configured to use it.
Should we allow the tier definition itself to be modified afterwards?

There seem to be two possible choices:

1. Once a cgroup references a tier, modifying that tier should be disallowed.
2. Allow tier re-definition even if cgroups are already referencing it.

Personally, I prefer option (1), since it avoids unexpected changes for
cgroups that already rely on a particular tier definition.

What is your opinion on this?

Best Regards,
Youngjun Park

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/4] mm/swap, memcg: Introduce infrastructure for cgroup-based swap priority
  2025-08-24 12:05                             ` YoungJun Park
@ 2025-08-26  8:19                               ` Chris Li
  2025-08-26 12:57                                 ` YoungJun Park
  0 siblings, 1 reply; 39+ messages in thread
From: Chris Li @ 2025-08-26  8:19 UTC (permalink / raw)
  To: YoungJun Park
  Cc: Michal Koutný, akpm, hannes, mhocko, roman.gushchin,
	shakeel.butt, muchun.song, shikemeng, kasong, nphamcs, bhe,
	baohua, cgroups, linux-mm, linux-kernel, gunho.lee,
	iamjoonsoo.kim, taejoon.song, Matthew Wilcox, David Hildenbrand,
	Kairui Song

On Sun, Aug 24, 2025 at 5:05 AM YoungJun Park <youngjun.park@lge.com> wrote:
>
> > How do you express the default tier who shall not name? There are
> > actually 3 states associated with default. It is not binary.
> > 1) default not specified: look up parent chain for default.
> > 2) default specified as on. Override parent default.
> > 3) default specified as off. Override parent default.
>
> As I understand, your intention is to define inheritance semantics depending
> on the default value, and allow children to override this freely with `-` and
> `+` semantics. Is that correct?

Right, the "+" and "-" need to place in the beginning without tier
name, then it is referring the default.

>
> When I originally proposed the swap cgroup priority mechanism, Michal Koutný
> commented that it is unnatural for cgroups if a parent attribute is not
> inherited by its child:
> (https://lore.kernel.org/linux-mm/rivwhhhkuqy7p4r6mmuhpheaj3c7vcw4w4kavp42avpz7es5vp@hbnvrmgzb5tr/)
Michal only said you need to provide ways for child cgroup to inherit
the parent.
The swap.tiers does provide such a mechanism. Just don't override the
default.  I would not go that far to ban the default overwrite. It is
useful no need to list every swap tier.

BTW, Michal, I haven't heard any feedback from you since I started the
swap.tiers discussion. If you have any concerns please do voice out.

> Therefore, my current thinking is:
> * The global swap setting itself is tier 1 (if nothing is configured).
> * If a cgroup has no setting:
>   - Top-level cgroups follow the global swap.
>   - Child cgroups follow their parent’s setting.
> * If a cgroup has its own setting, that setting is applied.
> (child cgroups can only select tiers that the parent has allowed.)

That is too restrictive. The most common case is just the parent
cgroup matters, the child uses the exact same setting as the parent.
However, if you want the child to be different from the parent, there
are two cases depending on your intention. Both can make sense.
1) The parent is more latency sensitive than the child. That way the
child will be more (slower) tired than the parent. Using more tiers is
slower, that is the inverted relationship. Your proposal does not
allow this?
2) The parent is latency tolerant and the child is latency sensitive.
In this case, the child will remove some swap files from the parent.
This is also a valid case, e.g. the parent is just a wrapper daemon
invoking the real worker as a child. The wrapper just does log
rotation and restarting the child group with a watchdog, it does not
need to be very latency sensitive, let say the watchdog is 1 hours.
The child is the heavy lifter and requires fast response.

I think both cases are possible, I don't see a strong reason to limit
the flexibility when there is no additional cost. I expect the
restriction approach having similar complexity.

> This seems natural because most cgroup resource distribution mechanisms follow
> a subset inheritance model.

I don't see a strong reason to make this kind of restriction yet. It
can go both ways. Depending on your viewpoint, having more swap tier
does not mean it is more powerful, it can be less powerful in the
sense that it can slow you down more.

> Thus, in my concept, there is no notion of a “default” value that controls
> inheritance.

Then you need to list all tiers to disable all. It would be error
prone if your tier list is long.
>
> > How are you going to store the list of ranges? Just a bitmask integer
> > or a list?
>
> They can be represented as increasing integers, up to 32, and stored as a
> bitmask.

Great, that is what I have in mind as well.

> > I feel the tier name is more readable. The number to which actual
> > device mapping is non trivial to track for humans.
>
> Using increasing integers makes it simpler for the kernel to accept a uniform
> interface format, it is identical to the existing cpuset interface, and it
> expresses the meaning of “tiers of swap by speed hierarchy” more clearly in my
> view.

Same.

>
> However, my feeling is still that this approach is clearer both in terms of
> implementation and conceptual expression. I would appreciate it if you could
> reconsider it once more. If after reconsideration you still prefer your

Can you clarify what I need to reconsider? I have the very similar
bitmask idea as you describe now.
I am not a dictator. I just provide feedback to your usage case with
my reasoning.

> direction, I will follow your decision.
>
> > I want to add another usage case into consideration. The swap.tiers
> > does not have to be per cgroup. It can be per VMA. [...]
>
> I understand this as a potential extension use case for swap.tier.
> I will keep this in mind when implementing. If I have further ideas here, I
> will share them for discussion.

That means the tiers definition needs to be global, outside of the cgroup.

> > Sounds fine. Maybe we can have "ssd:100 zswap:40 hdd" [...]
>
> Yes, this alignment looks good to me!
>
> > Can you elaborate on that. Just brainstorming, can we keep the
> > swap.tiers and assign NUMA autobind range to tier as well? [...]
>
> That is actually the same idea I had in mind for the NUMA use case.
> However, I doubt if there is any real workload using this in practice, so I
> thought it may be better to leave it out for now. If NUMA autobind is truly
> needed later, it could be implemented then.

I do see a possibility to just remove the NUMA autobind thing if the
default swap behavior is close enough. The recent swap allocator
change has made huge improvements in terms of lock contention and
using smaller locks. The NUMA autobind might not justify the
complexity now. I wouldn't spend too much effort in NUMA  for the MVP
of swap.tiers.

> This point can also be revisited during review or patch writing, so I will
> keep thinking about it.

Agree.

> > I feel that that has the risk of  premature optimization. I suggest
> > just going with the simplest bitmask check first then optimize as
> > follow up when needed. [...]
>
> Yes, I agree with you. Starting with the bitmask implementation seems to be
> the right approach.
>
> By the way, while thinking about possible implementation, I would like to ask
> your opinion on the following situation:
>
> Suppose a tier has already been defined and cgroups are configured to use it.
> Should we allow the tier definition itself to be modified afterwards?

If we can set it the first time, we should be able to set it the
second time. I don't recall such an example in the kernel parameter
can only be set once.


> There seem to be two possible choices:
>
> 1. Once a cgroup references a tier, modifying that tier should be disallowed.

Even modify a tier to cover more priority range but no swap device
falls in that additional range yet?
I think we should make the change follow the swap on/swap off
behavior. Once the swap device is swapped on, it can't change its tier
until it is swapped off again. when it is swapped off, there is no
cgroup on it. Notice the swap file belongs to which tier is not the
same as the priority range of the tier. You can modify the range and
reorder swap tiers as long as it is not causing swap on device jump to
a different tier.

> 2. Allow tier re-definition even if cgroups are already referencing it.

You can still swap off even if cgroup is still using it.

> Personally, I prefer option (1), since it avoids unexpected changes for
> cgroups that already rely on a particular tier definition.

Swap off and on already have similar problems. We can't change the
priority when the swap device is swapon already. We can go through a
swap off to change it.

Chris

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/4] mm/swap, memcg: Introduce infrastructure for cgroup-based swap priority
  2025-08-26  8:19                               ` Chris Li
@ 2025-08-26 12:57                                 ` YoungJun Park
  2025-08-26 14:30                                   ` Chris Li
  0 siblings, 1 reply; 39+ messages in thread
From: YoungJun Park @ 2025-08-26 12:57 UTC (permalink / raw)
  To: Chris Li
  Cc: Michal Koutný, akpm, hannes, mhocko, roman.gushchin,
	shakeel.butt, muchun.song, shikemeng, kasong, nphamcs, bhe,
	baohua, cgroups, linux-mm, linux-kernel, gunho.lee,
	iamjoonsoo.kim, taejoon.song, Matthew Wilcox, David Hildenbrand,
	Kairui Song

> > Therefore, my current thinking is:
> > * The global swap setting itself is tier 1 (if nothing is configured).
> > * If a cgroup has no setting:
> >   - Top-level cgroups follow the global swap.
> >   - Child cgroups follow their parent’s setting.
> > * If a cgroup has its own setting, that setting is applied.
> > (child cgroups can only select tiers that the parent has allowed.)
>
> That is too restrictive. The most common case is just the parent
> cgroup matters, the child uses the exact same setting as the parent.
> However, if you want the child to be different from the parent, there
> are two cases depending on your intention. Both can make sense.
> 1) The parent is more latency sensitive than the child. That way the
> child will be more (slower) tired than the parent. Using more tiers is
> slower, that is the inverted relationship. Your proposal does not
> allow this?
> 2) The parent is latency tolerant and the child is latency sensitive.
> In this case, the child will remove some swap files from the parent.
> This is also a valid case, e.g. the parent is just a wrapper daemon
> invoking the real worker as a child. The wrapper just does log
> rotation and restarting the child group with a watchdog, it does not
> need to be very latency sensitive, let say the watchdog is 1 hours.
> The child is the heavy lifter and requires fast response.
>
> I think both cases are possible, I don't see a strong reason to limit
> the flexibility when there is no additional cost. I expect the
> restriction approach having similar complexity.

In my use case, I think a restrictive inheritance model could
be sufficient. My argument was mainly based on the fact that most cgroup
resource distribution mechanisms usually follow a parent→child restrictive
pattern. Through the review, I came to the view that I should adhere to the
common behavior whenever possible.

Firstly(on RFC), I initially supported allowing parent/child inconsistency
for flexibility, so I actually agree with your view regarding flexibility.
For the examples you mentioned, I have no disagreement. I think my final
understanding is aligned with yours.

> Can you clarify what I need to reconsider? I have the very similar
> bitmask idea as you describe now.
> I am not a dictator. I just provide feedback to your usage case with
> my reasoning.
>

Oh! I think you are a good reviewer :D
Okay then, Let me explain my preference for numeric tiers in more detail.
It seems we are aligned on the implementation strategy with bitmask,
but I think our difference lies in the interface style — 'name' vs.
'numeric increase'."

1. A simple numeric interface makes the usage more straightforward.
   Instead of '+/-' semantics, directly listing the numeric range feels
   clearer and easier to use. For example:

     tier 1 (ram)
     tier 2 (ssd)
     tier 3 (hdd)
     tier 4 (network device)
     tier 5 (some device)
     tier 6 (some device2)

   cg1: echo 1-3  > memory.swap.tier (ram,ssd,hdd)
   cg1/cg2: 2-4,6  > memory.swap.tie (ssd,hdd,network device, somedevice 2, assuming non-subset is allowed)

   Tier specification can also be expressed simply as arrays of priority
   ranges, which feels easy to understand.

2. Since tiers are inherently ordered, numbering fits naturally and is
   easier for users to accept.  
   In my view, assigning a name is mainly useful to distinguish between
   otherwise 'indistinguishable' groups, but in this case, there is already
   a clear distinction given by the different priorities which simply be 
   charaterized by increasing number.

I understand your point that tier names may be more convenient for
administrators, and I see the value in that. That was why I used the word
"reconsider" — your feedback makes sense as well.

I do not have a strong preference. It would be good to align after
considering the pros and cons. I look forward to your thoughts."

> > There seem to be two possible choices:
> >
> > 1. Once a cgroup references a tier, modifying that tier should be
> >    disallowed.
>
> Even modify a tier to cover more priority range but no swap device
> falls in that additional range yet?
> I think we should make the change follow the swap on/swap off
> behavior. Once the swap device is swapped on, it can't change its tier
> until it is swapped off again. when it is swapped off, there is no
> cgroup on it. Notice the swap file belongs to which tier is not the
> same as the priority range of the tier. You can modify the range and
> reorder swap tiers as long as it is not causing swap on device jump to
> a different tier.
>
> > 2. Allow tier re-definition even if cgroups are already referencing
> >    it.
>
> You can still swap off even if cgroup is still using it.
>
> > Personally, I prefer option (1), since it avoids unexpected changes
> > for cgroups that already rely on a particular tier definition.
>
> Swap off and on already have similar problems. We can't change the
> priority when the swap device is swapon already. We can go through a
> swap off to change it.

I see your point. In practice, when tiers are already being referenced
by cgroups, swap devices may come and go within those tiers. I think
this can be considered a "natural" behavior, as swap management is
usually performed explicitly by the administrator.  

From that perspective, I expect that unintended behavior is very
unlikely to occur in real scenarios. So I am comfortable assuming this
implicit behavior when reasoning about tier modifications.  

Thanks again for the clarification. With this, the overall picture
feels much clearer. Once we reach alignment on the "named" vs. "numeric"
tier interface, I plan to move forward with the patch work.

Best Regards
Youngjun Park

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/4] mm/swap, memcg: Introduce infrastructure for cgroup-based swap priority
  2025-08-26 12:57                                 ` YoungJun Park
@ 2025-08-26 14:30                                   ` Chris Li
  2025-08-30  4:05                                     ` YoungJun Park
  0 siblings, 1 reply; 39+ messages in thread
From: Chris Li @ 2025-08-26 14:30 UTC (permalink / raw)
  To: YoungJun Park
  Cc: Michal Koutný, akpm, hannes, mhocko, roman.gushchin,
	shakeel.butt, muchun.song, shikemeng, kasong, nphamcs, bhe,
	baohua, cgroups, linux-mm, linux-kernel, gunho.lee,
	iamjoonsoo.kim, taejoon.song, Matthew Wilcox, David Hildenbrand,
	Kairui Song

On Tue, Aug 26, 2025 at 5:57 AM YoungJun Park <youngjun.park@lge.com> wrote:
> > I think both cases are possible, I don't see a strong reason to limit
> > the flexibility when there is no additional cost. I expect the
> > restriction approach having similar complexity.
>
> In my use case, I think a restrictive inheritance model could
> be sufficient. My argument was mainly based on the fact that most cgroup
> resource distribution mechanisms usually follow a parent→child restrictive
> pattern. Through the review, I came to the view that I should adhere to the
> common behavior whenever possible.

I sleep on it a bit both literally and philosophically. I like to
point out that most of the cgroup control is about resource
constraints. For example, if you set a memory limit on the toplevel
cgroup. None of the children can go beyond that limit. So the child
usage does not make sense to go more than the parent usage. This is a
strict mathematical subset containing relationships. That is the
deeper reason behind the parent to child more restrictive pattern,
because mathematically it does not make sense otherwise.

The swap file control is different. What we really want is not about
the source limit. We have swap.max for that. The swap.tiers is about
QoS control. In the QoS point of view, there is not such a strict
subset containing relationships. The QoS of the parent and child can
be independent. Therefore, it is justifiable to have an anti-pattern
here. Because the root cause, the QoS is not a resource limit type of
the constain. It is more like a policy.

We shouldn't adhere to the common behavior just because other cgroup
interfaces do it. Here I believe we have a justifiable reason to break
away from it. Because it is a different type of control, QoS vs limit.

I think you touch on a very important question that might trigger a
big design change. Do we want to have a per tier swap.max? It will
specify not only whether this cgroup will enroll into this tier or
not. It also controls how much swap it allows to do in this cgroup.
The swap.max will follow the straight contain relationship. I would
need to think more about the relationship between swap.max and
swap.tiers. Initial intuition is that, we might end up with both per
tier swap.max, which control resource limit, it has subset contain
relationship. At the same time the swap.tiers which control QoS, it
does not follow the subset contained.

Need more sleep on that.

> Firstly(on RFC), I initially supported allowing parent/child inconsistency
> for flexibility, so I actually agree with your view regarding flexibility.
> For the examples you mentioned, I have no disagreement. I think my final
> understanding is aligned with yours.
>
> > Can you clarify what I need to reconsider? I have the very similar
> > bitmask idea as you describe now.
> > I am not a dictator. I just provide feedback to your usage case with
> > my reasoning.
> >
>
> Oh! I think you are a good reviewer :D
> Okay then, Let me explain my preference for numeric tiers in more detail.
> It seems we are aligned on the implementation strategy with bitmask,
> but I think our difference lies in the interface style — 'name' vs.
> 'numeric increase'."
>
> 1. A simple numeric interface makes the usage more straightforward.
>    Instead of '+/-' semantics, directly listing the numeric range feels
>    clearer and easier to use. For example:

I am not against it. There might be some small aspect of it here and
there to fine tune.

>      tier 1 (ram)
>      tier 2 (ssd)
>      tier 3 (hdd)
>      tier 4 (network device)
>      tier 5 (some device)
>      tier 6 (some device2)
>
>    cg1: echo 1-3  > memory.swap.tier (ram,ssd,hdd)

First of all, sorry about the pedantic, it should be "swap.tiers" just
to be consistent with the rest of the discussion.
Secondly, I just view names as an alias of the number. 1-3 is hard to
read what you want.
If we allow name as the alias, we can also do:
echo zram-hdd > memory.swap.tieres

It is exactly the same thing but much more readable.

>    cg1/cg2: 2-4,6  > memory.swap.tie (ssd,hdd,network device, somedevice 2, assuming non-subset is allowed)

echo ssd-network_device,some_device2 > memory.swap.tiers

See, same thing but much more readable what is your intention.

BTW, we should disallow space in tier names.

>
>    Tier specification can also be expressed simply as arrays of priority
>    ranges, which feels easy to understand.

The number to device mapping is just harder for humans to process. I
think the named alias makes sense. There is an advantage of using bash
to control it from sysfs rather than a dedicated user space swap tiers
control tool. You can still write a user space tool if you want. I
want the userspace tool optional.
It is the same thing under the hook anyway.

> 2. Since tiers are inherently ordered, numbering fits naturally and is
>    easier for users to accept.
>    In my view, assigning a name is mainly useful to distinguish between
>    otherwise 'indistinguishable' groups, but in this case, there is already
>    a clear distinction given by the different priorities which simply be
>    charaterized by increasing number.
>
> I understand your point that tier names may be more convenient for
> administrators, and I see the value in that. That was why I used the word
> "reconsider" — your feedback makes sense as well.

I still prefer to use the name myself. I am not against having numbers
if you prefer numbers more. You can configure it with numbers. I have
a small brain and I want to use names as aliases to config.

> I do not have a strong preference. It would be good to align after
> considering the pros and cons. I look forward to your thoughts."

The name is a huge usability improvement for bare mortals. I don't
want to maintain user space tools just to adjust swap.tiers IMHO. I am
not opposed to someone else having such tools. It needs to be
optional.

> > > There seem to be two possible choices:
> > >
> > > 1. Once a cgroup references a tier, modifying that tier should be
> > >    disallowed.
> >
> > Even modify a tier to cover more priority range but no swap device
> > falls in that additional range yet?
> > I think we should make the change follow the swap on/swap off
> > behavior. Once the swap device is swapped on, it can't change its tier
> > until it is swapped off again. when it is swapped off, there is no
> > cgroup on it. Notice the swap file belongs to which tier is not the
> > same as the priority range of the tier. You can modify the range and
> > reorder swap tiers as long as it is not causing swap on device jump to
> > a different tier.
> >
> > > 2. Allow tier re-definition even if cgroups are already referencing
> > >    it.
> >
> > You can still swap off even if cgroup is still using it.
> >
> > > Personally, I prefer option (1), since it avoids unexpected changes
> > > for cgroups that already rely on a particular tier definition.
> >
> > Swap off and on already have similar problems. We can't change the
> > priority when the swap device is swapon already. We can go through a
> > swap off to change it.
>
> I see your point. In practice, when tiers are already being referenced
> by cgroups, swap devices may come and go within those tiers. I think
> this can be considered a "natural" behavior, as swap management is
> usually performed explicitly by the administrator.
>
> From that perspective, I expect that unintended behavior is very
> unlikely to occur in real scenarios. So I am comfortable assuming this
> implicit behavior when reasoning about tier modifications.
>
> Thanks again for the clarification. With this, the overall picture
> feels much clearer. Once we reach alignment on the "named" vs. "numeric"
> tier interface, I plan to move forward with the patch work.

I consider that really trivial. Why can't we have both? The madvise
interface might only use numbers in the form of bit mask. Because that
is a C interface. For sysfs and administrative control, having a name
as an alias is so much better.

We do want to think about swap.tiers vs per tier swap.max. One idea
just brainstorming is that we can have an array of
"swap.<tiername>.max".
It is likely we need to have both kinds of interface. Because
"swap.<tiername>.max" specifies the inclusive child limit.
"swap.tiers" specifies this C group swap usage QoS. I might not use
hdd in this cgroup A, but the child cgroup B does. So A's hdd max
can't be zero.

The other idea is to specify a percentage for each tier of the
swap.max in "swap.tiers.max". That in place of "swap.<tiername>.max":
zram:30  sdd:70
That means zram max is "swap.max * 30%"   and ssd max is "swap.max *
70%". The number does not need to add up to 100, but can't be bigger
than 100.
The sum can be bigger than 100.

Need more sleep on it.

Chris

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/4] mm/swap, memcg: Introduce infrastructure for cgroup-based swap priority
  2025-08-26 14:30                                   ` Chris Li
@ 2025-08-30  4:05                                     ` YoungJun Park
  2025-08-30  7:13                                       ` Chris Li
  0 siblings, 1 reply; 39+ messages in thread
From: YoungJun Park @ 2025-08-30  4:05 UTC (permalink / raw)
  To: Chris Li
  Cc: Michal Koutný, akpm, hannes, mhocko, roman.gushchin,
	shakeel.butt, muchun.song, shikemeng, kasong, nphamcs, bhe,
	baohua, cgroups, linux-mm, linux-kernel, gunho.lee,
	iamjoonsoo.kim, taejoon.song, Matthew Wilcox, David Hildenbrand,
	Kairui Song

Hi Chris,

Thanks for the detailed feedback, and sorry for the late reply.

> I think you touch on a very important question that might trigger a
> big design change. Do we want to have a per tier swap.max? It will
> specify not only whether this cgroup will enroll into this tier or
> not. It also controls how much swap it allows to do in this cgroup.
> The swap.max will follow the straight contain relationship. I would
> need to think more about the relationship between swap.max and
> swap.tiers. Initial intuition is that, we might end up with both per
> tier swap.max, which control resource limit, it has subset contain
> relationship. At the same time the swap.tiers which control QoS, it
> does not follow the subset contained.
>
> Need more sleep on that.

When I first ideated on this, I also considered per-device max values,
with 0 meaning exclusion, to implement cases like a cgroup using only
network swap. At that time the idea was to give each device its own
counter, so setting it to 0 would imply exclusion. But this approach
would effectively require maintaining per-device page counters similar
to the existing swap.max implementation, and the relationship between
these per-device counters and the global swap.max would need to be
carefully defined. That made the design significantly heavier than the
functionality I was aiming for, so I decided to drop it. I read your
point more as a QoS extension, and I see it as complementary rather
than a counter argument.

> First of all, sorry about the pedantic, it should be "swap.tiers" just
> to be consistent with the rest of the discussion.
> Secondly, I just view names as an alias of the number. 1-3 is hard to
> read what you want.
> If we allow name as the alias, we can also do:
> echo zram-hdd > memory.swap.tieres
>
> It is exactly the same thing but much more readable.
>
> >    cg1/cg2: 2-4,6  > memory.swap.tie (ssd,hdd,network device, somedevice 2, assuming non-subset is allowed)
>
> echo ssd-network_device,some_device2 > memory.swap.tiers
>
> See, same thing but much more readable what is your intention.
>
> BTW, we should disallow space in tier names.

Ack—those spaces were only in my example; the implementation will reject
spaces in tier names.

I like the interface format you proposed, and I’ll move forward with an
initial implementation using the name-based tier approach, dropping
the numeric format.

> We do want to think about swap.tiers vs per tier swap.max. One idea
> just brainstorming is that we can have an array of
> "swap.<tiername>.max".
> It is likely we need to have both kinds of interface. Because
> "swap.<tiername>.max" specifies the inclusive child limit.
> "swap.tiers" specifies this C group swap usage QoS. I might not use
> hdd in this cgroup A, but the child cgroup B does. So A's hdd max
> can't be zero.
>
> The other idea is to specify a percentage for each tier of the
> swap.max in "swap.tiers.max": zram:30  sdd:70
> That means zram max is "swap.max * 30%"   and ssd max is "swap.max *
> 70%". The number does not need to add up to 100, but can't be bigger
> than 100.
> The sum can be bigger than 100.
>
> Need more sleep on it.

I don’t have additional ideas beyond what you suggested at now. Since swap.max
is defined in terms of quantity, my intuition is that tier.max should
probably also be quantity-based, not percentage. As I mentioned earlier,
I had also considered per-device max in the early RFC stage. The design
was to introduce per-device counters, but that added substantial overhead
and complexity, especially in reconciling them with the global swap.max
semantics. For that reason I abandoned the idea, though I agree your
suggestion makes sense in the context of QoS extension.

At this point I feel the main directions are aligned, so I’ll proceed
with an initial patch version. My current summary is:

1. Global interface to group swap priority ranges into tiers by name
   (/sys/kernel/mm/swap/swaptier).
2. Slow path allocation uses bitmask skipping; fast path uses per-cpu
   tier cluster caches.
3. Cgroup interface format modeled after cpuset.
4. No inheritance between parent and child cgroup as a perspective of QoS
5. Runtime modification of tier settings allowed.
6. Keep extensibility and broader use cases in mind.

And some open points for further thought:

1. NUMA autobind
   - Forbid tier if NUMA priorities exist, and vice versa?
   - Should we create a dedicated NUMA tier?
   - Other options?
2. swap.tier.max
   - percentage vs quantity, and clear use cases.
  -  sketch concrete real-world scenarios to clarify usage 
3. Possible future extensions to VMA-based tier usage.
4. Arbitrary ordering
   - Do we really need it?
   - If so, maybe provide a separate cgroup interface to reorder tiers.

Best Regards
Youngjun Park

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/4] mm/swap, memcg: Introduce infrastructure for cgroup-based swap priority
  2025-08-30  4:05                                     ` YoungJun Park
@ 2025-08-30  7:13                                       ` Chris Li
  2025-08-31 13:53                                         ` YoungJun Park
  0 siblings, 1 reply; 39+ messages in thread
From: Chris Li @ 2025-08-30  7:13 UTC (permalink / raw)
  To: YoungJun Park
  Cc: Michal Koutný, akpm, hannes, mhocko, roman.gushchin,
	shakeel.butt, muchun.song, shikemeng, kasong, nphamcs, bhe,
	baohua, cgroups, linux-mm, linux-kernel, gunho.lee,
	iamjoonsoo.kim, taejoon.song, Matthew Wilcox, David Hildenbrand,
	Kairui Song

On Fri, Aug 29, 2025 at 9:05 PM YoungJun Park <youngjun.park@lge.com> wrote:
>
> Hi Chris,
>
> Thanks for the detailed feedback, and sorry for the late reply.

Not a problem at all. I have been pretty busy this week and don't have
much time for it either.

> > I think you touch on a very important question that might trigger a
> > big design change. Do we want to have a per tier swap.max? It will
> > specify not only whether this cgroup will enroll into this tier or
> > not. It also controls how much swap it allows to do in this cgroup.
> > The swap.max will follow the straight contain relationship. I would
> > need to think more about the relationship between swap.max and
> > swap.tiers. Initial intuition is that, we might end up with both per
> > tier swap.max, which control resource limit, it has subset contain
> > relationship. At the same time the swap.tiers which control QoS, it
> > does not follow the subset contained.
> >
> > Need more sleep on that.
>
> When I first ideated on this, I also considered per-device max values,
> with 0 meaning exclusion, to implement cases like a cgroup using only
> network swap. At that time the idea was to give each device its own
> counter, so setting it to 0 would imply exclusion. But this approach
> would effectively require maintaining per-device page counters similar
> to the existing swap.max implementation, and the relationship between
> these per-device counters and the global swap.max would need to be
> carefully defined. That made the design significantly heavier than the
> functionality I was aiming for, so I decided to drop it. I read your
> point more as a QoS extension, and I see it as complementary rather
> than a counter argument.

Yes, I slept on it for a few days. I reached a similar conclusion.
I am happy to share my thoughts:
1) FACT: We don't have any support to move data from swap device to
another swap device nowadays. It will not happen overnight. Talking
about those percentage allocation and maintaining those percentages is
super complicated. I question myself getting ahead of myself on this
feature.
2) FACT: I don't know if any real customers want this kind of
sub-cgroup swap per tier max adjustment. We should write imaginary
code for imaginary customers and reserve the real coding for the real
world customers. Most of the customers I know, including our company,
care most about the top level CGroup swap assignment. There are cases
that enable/disable per sub CGroup swap device, in the QoS sense not
the swap max usage sense.
I think this will be one good question to ask feedback in the LPC MC
discussion. Does anyone care about per tier max adjustment in the
cgroup? We should only consider that when we have real customers.

So I would shelf this per tier max adjustment and not spend any more time on it.

> > First of all, sorry about the pedantic, it should be "swap.tiers" just
> > to be consistent with the rest of the discussion.
> > Secondly, I just view names as an alias of the number. 1-3 is hard to
> > read what you want.
> > If we allow name as the alias, we can also do:
> > echo zram-hdd > memory.swap.tieres
> >
> > It is exactly the same thing but much more readable.
> >
> > >    cg1/cg2: 2-4,6  > memory.swap.tie (ssd,hdd,network device, somedevice 2, assuming non-subset is allowed)
> >
> > echo ssd-network_device,some_device2 > memory.swap.tiers
> >
> > See, same thing but much more readable what is your intention.
> >
> > BTW, we should disallow space in tier names.
>
> Ack—those spaces were only in my example; the implementation will reject
> spaces in tier names.
>
> I like the interface format you proposed, and I’ll move forward with an
> initial implementation using the name-based tier approach, dropping
> the numeric format.

I am glad you like it.

> > We do want to think about swap.tiers vs per tier swap.max. One idea
> > just brainstorming is that we can have an array of
> > "swap.<tiername>.max".
> > It is likely we need to have both kinds of interface. Because
> > "swap.<tiername>.max" specifies the inclusive child limit.
> > "swap.tiers" specifies this C group swap usage QoS. I might not use
> > hdd in this cgroup A, but the child cgroup B does. So A's hdd max
> > can't be zero.
> >
> > The other idea is to specify a percentage for each tier of the
> > swap.max in "swap.tiers.max": zram:30  sdd:70
> > That means zram max is "swap.max * 30%"   and ssd max is "swap.max *
> > 70%". The number does not need to add up to 100, but can't be bigger
> > than 100.
> > The sum can be bigger than 100.
> >
> > Need more sleep on it.
>
> I don’t have additional ideas beyond what you suggested at now. Since swap.max
> is defined in terms of quantity, my intuition is that tier.max should
> probably also be quantity-based, not percentage. As I mentioned earlier,
> I had also considered per-device max in the early RFC stage. The design
> was to introduce per-device counters, but that added substantial overhead
> and complexity, especially in reconciling them with the global swap.max
> semantics. For that reason I abandoned the idea, though I agree your
> suggestion makes sense in the context of QoS extension.

We are in agreement here. We should not touch it until we have a real
customer ask for it.

> At this point I feel the main directions are aligned, so I’ll proceed
> with an initial patch version. My current summary is:
>
> 1. Global interface to group swap priority ranges into tiers by name
>    (/sys/kernel/mm/swap/swaptier).
I suggest "/sys/kernel/mm/swap/tiers" just to make the file name look
different from the "swap.tiers" in the cgroup interface.
This former defines all tiers, giving tiers a name and range. The
latter enroll a subset of the tiers.
 I think the tier bit location does not have to follow the priority
order. If we allow adding a new tier, the new tier will get the next
higher bit. But the priority it split can insert into the middle thus
splitting an existing tier range. We do need to expose the tier bits
into the user space. Because for madvise()  to set tiers for VMA, it
will use bitmasks. It needs to know the name of the bitmask mapping,
I was thinking the mm/swap/tiers read back as one tier a line. show:
name, bitmask bit, range low, range high


> 2. Slow path allocation uses bitmask skipping; fast path uses per-cpu
>    tier cluster caches.
If the fast path fails, it will go through the slow path. So the slow
patch is actually a catch all.
> 3. Cgroup interface format modeled after cpuset.
I am not very familiar with the cpuset part of the interface. Maybe
you should explain that to the reader without using cpuset cgroup as a
reference.
> 4. No inheritance between parent and child cgroup as a perspective of QoS
In my original proposal of "swap.tiers", if the default is not set on
this tier, it will look up the parent until the root memcg. There are
two different tiers bitmask.
One is the local tier bitmask. The other is the effective bitmask.
If local tier bitmask sets the default, the effective tier bitmask ==
local tier bitmask
if local tier bitmask does not set default, The effective tier is
concatenation from parent to this memcg.

For example
a/swap.tiers: - +ssd # ssd only
a/b/swap.tiers: ""  # effective "- +ssh", also ssd only.
a/b/c : + -hdd # effective "- +ssd + -hdd", simplify as "+ -hdd"  The
'+' overwrite the default, anything before that can be ignored.

That way, if you are not setting anything in "swap.tiers" in the child
cgroup, that is the default behavior when you create a new cgroup.
Changing the parent can change all the child cgroup at the same time.

> 5. Runtime modification of tier settings allowed.
Need to clarify which tier setting? "swap.tiers" or /sys/kernel/mm/swap/tiers?

> 6. Keep extensibility and broader use cases in mind.
>
> And some open points for further thought:
>
> 1. NUMA autobind
>    - Forbid tier if NUMA priorities exist, and vice versa?
>    - Should we create a dedicated NUMA tier?
>    - Other options?

I want to verify and remove the NUMA autobind from swap later. That
will make things simpler for swap. I think the reason the NUMA swap
was introduced does not exist any more.

> 2. swap.tier.max
>    - percentage vs quantity, and clear use cases.
>   -  sketch concrete real-world scenarios to clarify usage

Just don't do that. Ignore until there is a real usage case request.

> 3. Possible future extensions to VMA-based tier usage.

Madvise(). That can be introduced earlier. I know a usage case for
that is the android. Android does not set every app as a cgroup. I
haven't checked for a while if that is still true.

> 4. Arbitrary ordering
>    - Do we really need it?
>    - If so, maybe provide a separate cgroup interface to reorder tiers.

No for now. Need to answer how to deal with swap entry LRU order
inversion issue.

Chris

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/4] mm/swap, memcg: Introduce infrastructure for cgroup-based swap priority
  2025-08-30  7:13                                       ` Chris Li
@ 2025-08-31 13:53                                         ` YoungJun Park
  2025-08-31 16:45                                           ` Chris Li
  0 siblings, 1 reply; 39+ messages in thread
From: YoungJun Park @ 2025-08-31 13:53 UTC (permalink / raw)
  To: Chris Li
  Cc: Michal Koutný, akpm, hannes, mhocko, roman.gushchin,
	shakeel.butt, muchun.song, shikemeng, kasong, nphamcs, bhe,
	baohua, cgroups, linux-mm, linux-kernel, gunho.lee,
	iamjoonsoo.kim, taejoon.song, Matthew Wilcox, David Hildenbrand,
	Kairui Song

> Yes, I slept on it for a few days. I reached a similar conclusion.
> I am happy to share my thoughts:
> 1) FACT: We don't have any support to move data from swap device to
> another swap device nowadays. It will not happen overnight. Talking
> about those percentage allocation and maintaining those percentages is
> super complicated. I question myself getting ahead of myself on this
> feature.
> 2) FACT: I don't know if any real customers want this kind of
> sub-cgroup swap per tier max adjustment. We should write imaginary
> code for imaginary customers and reserve the real coding for the real
> world customers. Most of the customers I know, including our company,
> care most about the top level CGroup swap assignment. There are cases
> that enable/disable per sub CGroup swap device, in the QoS sense not
> the swap max usage sense.
> I think this will be one good question to ask feedback in the LPC MC
> discussion.

Great—looking forward to it at the LPC MC.

> > At this point I feel the main directions are aligned, so I’ll proceed
> > with an initial patch version. My current summary is:
> >
> > 1. Global interface to group swap priority ranges into tiers by name
> >    (/sys/kernel/mm/swap/swaptier).
> I suggest "/sys/kernel/mm/swap/tiers" just to make the file name look

Yes, I also think "/sys/kernel/mm/swap/tiers" is a better fit.

> different from the "swap.tiers" in the cgroup interface.
> This former defines all tiers, giving tiers a name and range. The
> latter enroll a subset of the tiers.
>  I think the tier bit location does not have to follow the priority
> order. If we allow adding a new tier, the new tier will get the next
> higher bit. But the priority it split can insert into the middle thus
> splitting an existing tier range. We do need to expose the tier bits
> into the user space. Because for madvise()  to set tiers for VMA, it
> will use bitmasks. It needs to know the name of the bitmask mapping,
> I was thinking the mm/swap/tiers read back as one tier a line. show:
> name, bitmask bit, range low, range high

This part relates to my earlier point on runtime modification. My
intention was to only allow setting the tiers globally, and to align
bitmask with priority ranges. For example, an input like:

  ssd:100, hdd:50, network_swap

would translate into ranges as 100+ (bit0), 50–99 (bit1), and 0–49
(bit2).

From your description, I understand you are considering allowing
additive updates, insertions and letting bitmask differ from the range priority. Is
that correct? In that case we probably need a way to distinguish
between “add” and “reset”. Personally, I feel supporting only reset
semantics would make the interface simpler, while still allowing add
semantics when the full set is provided again.

> > 2. Slow path allocation uses bitmask skipping; fast path uses per-cpu
> >    tier cluster caches.
> If the fast path fails, it will go through the slow path. So the slow
> patch is actually a catch all.

Do you mean that if the cluster does not belong to the desired tier in
the fast path, it will skip and then fall back to the slow path? If so,
the slow path would need to avoid inserting the cluster back into the
cache, otherwise processes with a global swap view may end up using the
wrong tier device(which must be referenced firstly assumed)
Also cgroup which is tier set experience performance degradation 
because, there is possibility to try to alloc swap on slowpath most of the time.
Wouldn’t this have performance implications?  

I was thinking that maintaining per-tier per-cpu cluster caches would be
simpler. Then each tier manages its own cluster cache, and we only need
an array of per-cpu caches of size “max tiers”.

> > 3. Cgroup interface format modeled after cpuset.
> I am not very familiar with the cpuset part of the interface. Maybe
> you should explain that to the reader without using cpuset cgroup as a
> reference.

The similarity with cpuset is only in the text format. Like cpuset.cpus
uses a comma-separated list and dash ranges (e.g. "0-4,6,8-10"), the
swap tier interface would use the same style but with tier names. For
example:
  echo ssd-network_device,some_device2 > swap.tiers
This makes it easy for users to read and modify at runtime, and keeps
the interface consistent with existing cgroup controls.
(Reference: https://docs.kernel.org/admin-guide/cgroup-v2.html, Cpuset Interface Files)

> > 4. No inheritance between parent and child cgroup as a perspective of QoS
> In my original proposal of "swap.tiers", if the default is not set on
> this tier, it will look up the parent until the root memcg. ...

My current thought is that it might be simpler to avoid inheritance
entirely. Since this is a QoS interface rather than a resource limit
mechanism, inheritance semantics may not be the best fit. I would prefer
to always override based on what is explicitly set, and otherwise fall
back to global swap. For example, input like:

  swap.tiers = ssd,network_device,some_device2

would always override the setting directly, without any parent lookup.

> > 5. Runtime modification of tier settings allowed.
> Need to clarify which tier setting? "swap.tiers" or /sys/kernel/mm/swap/tiers?

My earlier comment was about allowing runtime modifications
to the global /sys/kernel/mm/swap/tiers.

> > 6. Keep extensibility and broader use cases in mind.
> >
> > And some open points for further thought:
> >
> > 1. NUMA autobind
> >    - Forbid tier if NUMA priorities exist, and vice versa?
> >    - Should we create a dedicated NUMA tier?
> >    - Other options?
>
> I want to verify and remove the NUMA autobind from swap later. That
> will make things simpler for swap. I think the reason the NUMA swap
> was introduced does not exist any more.

Per your suggestion, the question of whether NUMA autobind 
is needed can be addressed in a dedicated discussion later. 
I look forward to it. :)

The NUMA autobind removal work.. possible directions could be:
  
  - runtime toggle (default off),  
  - keep default on but gradually flip to default off,  
    eventually remove entirely.
  - remove it. entirely.

Not a proposal —just a thought 

In my current patch,
tier and NUMA priorities are made mutually exclusive so they cannot be set together. 

> > 2. swap.tier.max
> >    - percentage vs quantity, and clear use cases.
> >   -  sketch concrete real-world scenarios to clarify usage
>
> Just don't do that. Ignore until there is a real usage case request.

Agreed. It is better to defer until we see a concrete use case.

> > 4. Arbitrary ordering
> >    - Do we really need it?
> >    - If so, maybe provide a separate cgroup interface to reorder tiers.
>
> No for now. Need to answer how to deal with swap entry LRU order
> inversion issue.

Right, if we want to support this usage, your point about LRU order must
definitely be addressed first.

Best Regards
Youngjun Park

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/4] mm/swap, memcg: Introduce infrastructure for cgroup-based swap priority
  2025-08-31 13:53                                         ` YoungJun Park
@ 2025-08-31 16:45                                           ` Chris Li
  2025-09-01 16:03                                             ` YoungJun Park
  2025-09-01 16:06                                             ` YoungJun Park
  0 siblings, 2 replies; 39+ messages in thread
From: Chris Li @ 2025-08-31 16:45 UTC (permalink / raw)
  To: YoungJun Park
  Cc: Michal Koutný, akpm, hannes, mhocko, roman.gushchin,
	shakeel.butt, muchun.song, shikemeng, kasong, nphamcs, bhe,
	baohua, cgroups, linux-mm, linux-kernel, gunho.lee,
	iamjoonsoo.kim, taejoon.song, Matthew Wilcox, David Hildenbrand,
	Kairui Song

On Sun, Aug 31, 2025 at 6:53 AM YoungJun Park <youngjun.park@lge.com> wrote:
> > I think this will be one good question to ask feedback in the LPC MC
> > discussion.
>
> Great—looking forward to it at the LPC MC.

Ack

>
> > > At this point I feel the main directions are aligned, so I’ll proceed
> > > with an initial patch version. My current summary is:
> > >
> > > 1. Global interface to group swap priority ranges into tiers by name
> > >    (/sys/kernel/mm/swap/swaptier).
> > I suggest "/sys/kernel/mm/swap/tiers" just to make the file name look
>
> Yes, I also think "/sys/kernel/mm/swap/tiers" is a better fit.
>
> > different from the "swap.tiers" in the cgroup interface.
> > This former defines all tiers, giving tiers a name and range. The
> > latter enroll a subset of the tiers.
> >  I think the tier bit location does not have to follow the priority
> > order. If we allow adding a new tier, the new tier will get the next
> > higher bit. But the priority it split can insert into the middle thus
> > splitting an existing tier range. We do need to expose the tier bits
> > into the user space. Because for madvise()  to set tiers for VMA, it
> > will use bitmasks. It needs to know the name of the bitmask mapping,
> > I was thinking the mm/swap/tiers read back as one tier a line. show:
> > name, bitmask bit, range low, range high
>
> This part relates to my earlier point on runtime modification. My
> intention was to only allow setting the tiers globally, and to align
> bitmask with priority ranges. For example, an input like:
>
>   ssd:100, hdd:50, network_swap
>
> would translate into ranges as 100+ (bit0), 50–99 (bit1), and 0–49
> (bit2).
>
> From your description, I understand you are considering allowing
> additive updates, insertions and letting bitmask differ from the range priority. Is
> that correct? In that case we probably need a way to distinguish

That is right.

> between “add” and “reset”. Personally, I feel supporting only reset
> semantics would make the interface simpler, while still allowing add
> semantics when the full set is provided again.

The counterpart of "add" is "remove". There are two possible idea to explore:
1) only allow removing  a tier when all swap devices in that tier
range have been swapped off.
2) Remove the tier is removing a midpoint from the range. So the lower
tier automatically gets the range belonging to the tier that was
removed. Then optionally you can add another tier back in replacement
with different range boundaries. It effectively achieves replacement
as well. This approach does not require swap off the swap device. I
like it better. Because if you don't want the race window where the
swap device temporarily belongs to the lower tier, you can always swap
off the device in question before performing 2). so 2) can actually be
mixed with 1) as well.

>
> > > 2. Slow path allocation uses bitmask skipping; fast path uses per-cpu
> > >    tier cluster caches.
> > If the fast path fails, it will go through the slow path. So the slow
> > patch is actually a catch all.
>
> Do you mean that if the cluster does not belong to the desired tier in
> the fast path, it will skip and then fall back to the slow path? If so,

I am describing the existing swap cluster allocator behavior. In my
mind, we are using the existing cluster swap allocator code, with
constraints that only allow swap entry to be allocated from the
affected tier bitmask.

> the slow path would need to avoid inserting the cluster back into the
> cache, otherwise processes with a global swap view may end up using the
> wrong tier device(which must be referenced firstly assumed)
> Also cgroup which is tier set experience performance degradation
> because, there is possibility to try to alloc swap on slowpath most of the time.
> Wouldn’t this have performance implications?

I think we are mixing two different concepts. There are swap tiers
which decide which swap device to use. Then there is the swap
allocator to allocate a swap from the allowed list.

If we move to the swap tiers, the swap allocator needs to be swap
tiers aware. So it might move to per cgroup cache list or disable the
cache for the cgroup that hasn't been allocating for a while. The
allocation logic should be in the allocator, not in the swap tier
layer.

> I was thinking that maintaining per-tier per-cpu cluster caches would be
> simpler. Then each tier manages its own cluster cache, and we only need
> an array of per-cpu caches of size “max tiers”.

Again, let's not jump to premature optimizations. Do it the simple way
first, then let the measurement number guide us.
It might be per swap file has a cache not necessary per CPU. per-cpu x
per-tier the combination is too big, I am worried about caching too
much swap clusters. Each cluster is 2M.

>
> > > 3. Cgroup interface format modeled after cpuset.
> > I am not very familiar with the cpuset part of the interface. Maybe
> > you should explain that to the reader without using cpuset cgroup as a
> > reference.
>
> The similarity with cpuset is only in the text format. Like cpuset.cpus
> uses a comma-separated list and dash ranges (e.g. "0-4,6,8-10"), the
> swap tier interface would use the same style but with tier names. For
> example:
>   echo ssd-network_device,some_device2 > swap.tiers
> This makes it easy for users to read and modify at runtime, and keeps
> the interface consistent with existing cgroup controls.
> (Reference: https://docs.kernel.org/admin-guide/cgroup-v2.html, Cpuset Interface Files)

Thanks for the explanation. That sounds fine to me.

>
> > > 4. No inheritance between parent and child cgroup as a perspective of QoS
> > In my original proposal of "swap.tiers", if the default is not set on
> > this tier, it will look up the parent until the root memcg. ...
>
> My current thought is that it might be simpler to avoid inheritance
> entirely. Since this is a QoS interface rather than a resource limit
> mechanism, inheritance semantics may not be the best fit. I would prefer
> to always override based on what is explicitly set, and otherwise fall
> back to global swap. For example, input like:
>
>   swap.tiers = ssd,network_device,some_device2
>
> would always override the setting directly, without any parent lookup.

We DO want some parent level control. That is a real customer
requirement. The cons with your proposal is that, if you want to
change the whole set from top level cgroup to child cgroup, you need
to talk the hieratical chain to set each of the child cgroup. While
you are walking the child tree, there are races with more sub level
cgroup added to the tree. You will end up missing the newly created
cgroup. It is a mess.

It is much cleaner if we can allow the child cgroup to have the
default "swap.tiers" to be empty. Then you just need to set one value
to the top level parent cgroup and all child cgroup get it without
exception. The child can overwrite it if they want, default is getting
it from the parents.

The whole set of cgroup from top level including child can map into a
k8s pod. It is common to perform adjustments on the whole set
atomically. We should support it.

> > > 5. Runtime modification of tier settings allowed.
> > Need to clarify which tier setting? "swap.tiers" or /sys/kernel/mm/swap/tiers?
>
> My earlier comment was about allowing runtime modifications
> to the global /sys/kernel/mm/swap/tiers.

Ack.

> > > 6. Keep extensibility and broader use cases in mind.
> > >
> > > And some open points for further thought:
> > >
> > > 1. NUMA autobind
> > >    - Forbid tier if NUMA priorities exist, and vice versa?
> > >    - Should we create a dedicated NUMA tier?
> > >    - Other options?
> >
> > I want to verify and remove the NUMA autobind from swap later. That
> > will make things simpler for swap. I think the reason the NUMA swap
> > was introduced does not exist any more.
>
> Per your suggestion, the question of whether NUMA autobind
> is needed can be addressed in a dedicated discussion later.
> I look forward to it. :)

I was thinking of removing the NUMA autobind feature from the Linux
source code. Deleting codes, if the measurement number shows the NUMA
autobind does not make much of a difference any more. The performance
charistic have changed dramatically with the new cluster based swap
allocator. It is possible NUMA autobind does not make sense to justify
the complexity to exist in the source code.

> The NUMA autobind removal work.. possible directions could be:
>
>   - runtime toggle (default off),
>   - keep default on but gradually flip to default off,
>     eventually remove entirely.
>   - remove it. entirely.
>
> Not a proposal —just a thought
>
> In my current patch,
> tier and NUMA priorities are made mutually exclusive so they cannot be set together.

That is one more reason to remove NUMA priorities completely.

> > > 2. swap.tier.max
> > >    - percentage vs quantity, and clear use cases.
> > >   -  sketch concrete real-world scenarios to clarify usage
> >
> > Just don't do that. Ignore until there is a real usage case request.
>
> Agreed. It is better to defer until we see a concrete use case.

Ack.

> > > 4. Arbitrary ordering
> > >    - Do we really need it?
> > >    - If so, maybe provide a separate cgroup interface to reorder tiers.
> >
> > No for now. Need to answer how to deal with swap entry LRU order
> > inversion issue.
>
> Right, if we want to support this usage, your point about LRU order must
> definitely be addressed first.

Ack.

I think we are aligned.

Thanks

Chris

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/4] mm/swap, memcg: Introduce infrastructure for cgroup-based swap priority
  2025-08-31 16:45                                           ` Chris Li
@ 2025-09-01 16:03                                             ` YoungJun Park
  2025-09-01 16:06                                             ` YoungJun Park
  1 sibling, 0 replies; 39+ messages in thread
From: YoungJun Park @ 2025-09-01 16:03 UTC (permalink / raw)
  To: Chris Li
  Cc: Michal Koutný, akpm, hannes, mhocko, roman.gushchin,
	shakeel.butt, muchun.song, shikemeng, kasong, nphamcs, bhe,
	baohua, cgroups, linux-mm, linux-kernel, gunho.lee,
	iamjoonsoo.kim, taejoon.song, Matthew Wilcox, David Hildenbrand,
	Kairui Song

Overall, the alignment looks good. Among the three points you suggested,
I agree with (3) cgroup inheritance. I would like to continue the
discussion on (1) swap tier lifecycle and (2) allocation logic.

1. swap tier lifecycle
2. allocation logic
3. cgroup inheritance

> > This part relates to my earlier point on runtime modification. My
> > intention was to only allow setting the tiers globally, and to align
> > bitmask with priority ranges. For example, an input like:
> >
> >   ssd:100, hdd:50, network_swap
> >
> > would translate into ranges as 100+ (bit0), 50–99 (bit1), and 0–49
> > (bit2).
> >
> > From your description, I understand you are considering allowing
> > additive updates, insertions and letting bitmask differ from the range priority. Is
> > that correct? In that case we probably need a way to distinguish
>
> That is right.

Yes, I agree that add/remove semantics can be supported,
But it was not fully clear whether there was agreement on the full set
format, I wanted to state explicitly that my preference is to require
the full set format for simplicity. That said, if staged insertion and
removal are considered useful, one possible approach is:

(side note! explanation of the interface was somewhat
descriptive, which may not have been fully clear. If this explanation is
sufficient to establish the general direction, I will aim to present it
more concretely in the patch series. Otherwise, I can provide a more
detailed explanation in a follow-up email.)

  echo "add ssd:100,hdd:50,network_swap" >/sys/kernel/mm/swap/tiers
  echo "add new:80"  > /sys/kernel/mm/swap/tiers
  echo "remove hdd" > /sys/kernel/mm/swap/tiers

Alternatively, separate files for add, remove, and show could be used to
represent staged operations:

  echo "ssd:100,hdd:50,network_swap" >/sys/kernel/mm/swap/tiers/add
  echo "new:80"  > /sys/kernel/mm/swap/tiers/add
  echo "hdd" > /sys/kernel/mm/swap/tiers/remove

When using the fullset approach:

  ssd:100(bit0), hdd:50(bit1), network_device(bit2)

If we remove the ssd layer and add a new tier:

  echo new:80,hdd:50,network_device >/sys/kernel/mm/swap/tiers

The show output could display staged state (imaginary output for understanding):

  ssd:100(bit0), new:80(bit3, in stage), hdd:50(bit1, removing), network_device(bit2)

After the hdd tier reference drops to zero:

  ssd:100(bit0), new:80(bit3),  network_device(bit2)

> > between “add” and “reset”. Personally, I feel supporting only reset
> > semantics would make the interface simpler, while still allowing add
> > semantics when the full set is provided again.
>
> The counterpart of "add" is "remove". There are two possible ideas to explore:
> 1) only allow removing a tier when all swap devices in that tier
> range have been swapped off.
> 2) Remove the tier by removing a midpoint from the range. The lower
> tier automatically gets the range belonging to the tier that was
> removed. Optionally, you can add another tier back in replacement
> with different range boundaries. This effectively achieves replacement
> as well. This approach does not require swapping off the swap device. I
> like it better. If you want to avoid the race window where the
> swap device temporarily belongs to the lower tier, you can always swap
> off the device before performing 2). So 2) can be mixed with 1) as well.

I have already explained this from the perspective of option 2 mixed
with option 1. Let me clarify one point:

If...
ssd:100, hdd:50, network_device.
Insertion above 100 becomes visible after ssd removal,
Insertion above 50 becomes visible after hdd removal,
Insertion above 0 becomes visible after network_device removal.

It means that as long as the tier exists, the referenced priority ranges
cannot be overridden.

And Regarding swap_tier object lifecycle:

A swap_tier should not be deleted until all devices in the tier are
swapped off (As you said, references are held). Therefore, cgroups that reference a
tier should also hold a reference. Silently dropping a tier is problematic
from a cgroup perspective.

If we allow this, the implementation should behave as follows as I think:
If a swap_tier is removed, the cgroup’s tier configuration could be
marked invalid. This should trigger an event to the cgroup to notify
user space.

> >
> > > > 2. Slow path allocation uses bitmask skipping; fast path uses per-cpu
> > > >    tier cluster caches.
> > > If the fast path fails, it will go through the slow path. So the slow
> > > patch is actually a catch all.
> >
> > Do you mean that if the cluster does not belong to the desired tier in
> > the fast path, it will skip and then fall back to the slow path? If so,
>
> I am describing the existing swap cluster allocator behavior. In my
> mind, we are using the existing cluster swap allocator code, with
> constraints that only allow swap entry to be allocated from the
> affected tier bitmask.
>
> > the slow path would need to avoid inserting the cluster back into the
> > cache, otherwise processes with a global swap view may end up using the
> > wrong tier device(which must be referenced firstly assumed)
> > Also cgroup which is tier set experience performance degradation
> > because, there is possibility to try to alloc swap on slowpath most of the time.
> > Wouldn’t this have performance implications?
>
> I think we are mixing two different concepts. There are swap tiers
> which decide which swap device to use. Then there is the swap
> allocator to allocate a swap from the allowed list.
>
> If we move to the swap tiers, the swap allocator needs to be swap
> tiers aware. So it might move to per-cgroup cache list or disable the
> cache for cgroups that haven't been allocating for a while. The
> allocation logic should be in the allocator, not in the swap tier
> layer.
>
> > I was thinking that maintaining per-tier per-cpu cluster caches would be
> > simpler. Then each tier manages its own cluster cache, and we only need
> > an array of per-cpu caches of size “max tiers”.
>
> Again, let's not jump to premature optimizations. Do it the simple way
> first, then let the measurement numbers guide us.
> It might be per swap file has a cache not necessary per CPU. per-cpu x
> per-tier the combination is too big, I am worried about caching too
> much swap clusters. Each cluster is 2M.

You suggested maintaining per-swap-device cluster caches. As an
alternative, I would like to suggest a per-device per-CPU cache
approach, which could be simpler from an integration perspective. It
would fit more naturally with the existing allocation logic, remove tier
awareness from the allocator, and should not introduce functional
differences in behavior. Moreover, since SSD devices are likely to be
concentrated in only a small number of tiers (with one being the "best"
tier), the number of clusters actually cached at any time would not be
large. I am not presenting this as the ultimate solution, but rather as
a simple and reasonably practical approach to consider. I agree that we
should revisit and evaluate this approach further.

> We DO want some parent level control. That is a real customer
> requirement. The cons with your proposal is that, if you want to
> change the whole set from top-level cgroup to child cgroups, you need
> to traverse the hierarchical chain to set each child cgroup. While
> walking the child tree, more sub-level cgroups may be added, and
> you could miss newly created cgroups. It becomes a mess.
>
> It is much cleaner if we allow the child cgroup to have the default
> "swap.tiers" empty. Then you just need to set one value at the top-level
> parent cgroup, and all child cgroups inherit it automatically. A child
> can overwrite it if desired; by default it inherits from its parent.
>
> The whole set of cgroups from top-level including children can map
> into a Kubernetes pod. It is common to perform adjustments on the
> whole set atomically. We should support it.

Okay I will adopt default inheritance for pod-level and similar use cases. A
child cgroup inherits the nearest ancestor’s mask upon creation. If it
later sets its own mask, that configuration will apply to itself.

> I think we are aligned.
>
> Thanks
>
> Chris

Many thanks for the detailed review. It helped clarify the implementation
direction, and I look forward to preparing the patch series accordingly.

Youngjun Park

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/4] mm/swap, memcg: Introduce infrastructure for cgroup-based swap priority
  2025-08-31 16:45                                           ` Chris Li
  2025-09-01 16:03                                             ` YoungJun Park
@ 2025-09-01 16:06                                             ` YoungJun Park
  2025-09-01 22:40                                               ` Chris Li
  1 sibling, 1 reply; 39+ messages in thread
From: YoungJun Park @ 2025-09-01 16:06 UTC (permalink / raw)
  To: Chris Li
  Cc: Michal Koutný, akpm, hannes, mhocko, roman.gushchin,
	shakeel.butt, muchun.song, shikemeng, kasong, nphamcs, bhe,
	baohua, cgroups, linux-mm, linux-kernel, gunho.lee,
	iamjoonsoo.kim, taejoon.song, Matthew Wilcox, David Hildenbrand,
	Kairui Song

Overall, the alignment looks good. Among the three points you suggested,
I agree with (3) cgroup inheritance. I would like to continue the
discussion on (1) swap tier lifecycle and (2) allocation logic.

1. swap tier lifecycle
2. allocation logic
3. cgroup inheritance

> > This part relates to my earlier point on runtime modification. My
> > intention was to only allow setting the tiers globally, and to align
> > bitmask with priority ranges. For example, an input like:
> >
> >   ssd:100, hdd:50, network_swap
> >
> > would translate into ranges as 100+ (bit0), 50–99 (bit1), and 0–49
> > (bit2).
> >
> > From your description, I understand you are considering allowing
> > additive updates, insertions and letting bitmask differ from the range priority. Is
> > that correct? In that case we probably need a way to distinguish
>
> That is right.

Yes, I agree that add/remove semantics can be supported,
But it was not fully clear whether there was agreement on the full set
format, I wanted to state explicitly that my preference is to require
the full set format for simplicity. That said, if staged insertion and
removal are considered useful, one possible approach is:

(side note! explanation of the interface was somewhat
descriptive, which may not have been fully clear. If this explanation is
sufficient to establish the general direction, I will aim to present it
more concretely in the patch series. Otherwise, I can provide a more
detailed explanation in a follow-up email.)

  echo "add ssd:100,hdd:50,network_swap" >/sys/kernel/mm/swap/tiers
  echo "add new:80"  > /sys/kernel/mm/swap/tiers
  echo "remove hdd" > /sys/kernel/mm/swap/tiers

Alternatively, separate files for add, remove, and show could be used to
represent staged operations:

  echo "ssd:100,hdd:50,network_swap" >/sys/kernel/mm/swap/tiers/add
  echo "new:80"  > /sys/kernel/mm/swap/tiers/add
  echo "hdd" > /sys/kernel/mm/swap/tiers/remove

When using the fullset approach:

  ssd:100(bit0), hdd:50(bit1), network_device(bit2)

If we remove the ssd layer and add a new tier:

  echo new:80,hdd:50,network_device >/sys/kernel/mm/swap/tiers

The show output could display staged state (imaginary output for understanding):

  ssd:100(bit0), new:80(bit3, in stage), hdd:50(bit1, removing), network_device(bit2)

After the hdd tier reference drops to zero:

  ssd:100(bit0), new:80(bit3),  network_device(bit2)

> > between “add” and “reset”. Personally, I feel supporting only reset
> > semantics would make the interface simpler, while still allowing add
> > semantics when the full set is provided again.
>
> The counterpart of "add" is "remove". There are two possible ideas to explore:
> 1) only allow removing a tier when all swap devices in that tier
> range have been swapped off.
> 2) Remove the tier by removing a midpoint from the range. The lower
> tier automatically gets the range belonging to the tier that was
> removed. Optionally, you can add another tier back in replacement
> with different range boundaries. This effectively achieves replacement
> as well. This approach does not require swapping off the swap device. I
> like it better. If you want to avoid the race window where the
> swap device temporarily belongs to the lower tier, you can always swap
> off the device before performing 2). So 2) can be mixed with 1) as well.

I have already explained this from the perspective of option 2 mixed
with option 1. Let me clarify one point:

If...
ssd:100, hdd:50, network_device.
Insertion above 100 becomes visible after ssd removal,
Insertion above 50 becomes visible after hdd removal,
Insertion above 0 becomes visible after network_device removal.

It means that as long as the tier exists, the referenced priority ranges
cannot be overridden.

And Regarding swap_tier object lifecycle:

A swap_tier should not be deleted until all devices in the tier are
swapped off (As you said, references are held). Therefore, cgroups that reference a
tier should also hold a reference. Silently dropping a tier is problematic
from a cgroup perspective.

If we allow this, the implementation should behave as follows as I think:
If a swap_tier is removed, the cgroup’s tier configuration could be
marked invalid. This should trigger an event to the cgroup to notify
user space.

> >
> > > > 2. Slow path allocation uses bitmask skipping; fast path uses per-cpu
> > > >    tier cluster caches.
> > > If the fast path fails, it will go through the slow path. So the slow
> > > patch is actually a catch all.
> >
> > Do you mean that if the cluster does not belong to the desired tier in
> > the fast path, it will skip and then fall back to the slow path? If so,
>
> I am describing the existing swap cluster allocator behavior. In my
> mind, we are using the existing cluster swap allocator code, with
> constraints that only allow swap entry to be allocated from the
> affected tier bitmask.
>
> > the slow path would need to avoid inserting the cluster back into the
> > cache, otherwise processes with a global swap view may end up using the
> > wrong tier device(which must be referenced firstly assumed)
> > Also cgroup which is tier set experience performance degradation
> > because, there is possibility to try to alloc swap on slowpath most of the time.
> > Wouldn’t this have performance implications?
>
> I think we are mixing two different concepts. There are swap tiers
> which decide which swap device to use. Then there is the swap
> allocator to allocate a swap from the allowed list.
>
> If we move to the swap tiers, the swap allocator needs to be swap
> tiers aware. So it might move to per-cgroup cache list or disable the
> cache for cgroups that haven't been allocating for a while. The
> allocation logic should be in the allocator, not in the swap tier
> layer.
>
> > I was thinking that maintaining per-tier per-cpu cluster caches would be
> > simpler. Then each tier manages its own cluster cache, and we only need
> > an array of per-cpu caches of size “max tiers”.
>
> Again, let's not jump to premature optimizations. Do it the simple way
> first, then let the measurement numbers guide us.
> It might be per swap file has a cache not necessary per CPU. per-cpu x
> per-tier the combination is too big, I am worried about caching too
> much swap clusters. Each cluster is 2M.

You suggested maintaining per-swap-device cluster caches. As an
alternative, I would like to suggest a per-device per-CPU cache
approach, which could be simpler from an integration perspective. It
would fit more naturally with the existing allocation logic, remove tier
awareness from the allocator, and should not introduce functional
differences in behavior. Moreover, since SSD devices are likely to be
concentrated in only a small number of tiers (with one being the "best"
tier), the number of clusters actually cached at any time would not be
large. I am not presenting this as the ultimate solution, but rather as
a simple and reasonably practical approach to consider. I agree that we
should revisit and evaluate this approach further.

> We DO want some parent level control. That is a real customer
> requirement. The cons with your proposal is that, if you want to
> change the whole set from top-level cgroup to child cgroups, you need
> to traverse the hierarchical chain to set each child cgroup. While
> walking the child tree, more sub-level cgroups may be added, and
> you could miss newly created cgroups. It becomes a mess.
>
> It is much cleaner if we allow the child cgroup to have the default
> "swap.tiers" empty. Then you just need to set one value at the top-level
> parent cgroup, and all child cgroups inherit it automatically. A child
> can overwrite it if desired; by default it inherits from its parent.
>
> The whole set of cgroups from top-level including children can map
> into a Kubernetes pod. It is common to perform adjustments on the
> whole set atomically. We should support it.

Okay I will adopt default inheritance for pod-level and similar use cases. A
child cgroup inherits the nearest ancestor’s mask upon creation. If it
later sets its own mask, that configuration will apply to itself.

> I think we are aligned.
>
> Thanks
>
> Chris

Many thanks for the detailed review. It helped clarify the implementation
direction, and I look forward to preparing the patch series accordingly.

Youngjun Park

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/4] mm/swap, memcg: Introduce infrastructure for cgroup-based swap priority
  2025-09-01 16:06                                             ` YoungJun Park
@ 2025-09-01 22:40                                               ` Chris Li
  0 siblings, 0 replies; 39+ messages in thread
From: Chris Li @ 2025-09-01 22:40 UTC (permalink / raw)
  To: YoungJun Park
  Cc: Michal Koutný, akpm, hannes, mhocko, roman.gushchin,
	shakeel.butt, muchun.song, shikemeng, kasong, nphamcs, bhe,
	baohua, cgroups, linux-mm, linux-kernel, gunho.lee,
	iamjoonsoo.kim, taejoon.song, Matthew Wilcox, David Hildenbrand,
	Kairui Song

On Mon, Sep 1, 2025 at 9:21 AM YoungJun Park <youngjun.park@lge.com> wrote:
>
> Overall, the alignment looks good. Among the three points you suggested,
> I agree with (3) cgroup inheritance. I would like to continue the
> discussion on (1) swap tier lifecycle and (2) allocation logic.

Sure.

>
> 1. swap tier lifecycle
> 2. allocation logic
> 3. cgroup inheritance
>
> > > This part relates to my earlier point on runtime modification. My
> > > intention was to only allow setting the tiers globally, and to align
> > > bitmask with priority ranges. For example, an input like:
> > >
> > >   ssd:100, hdd:50, network_swap
> > >
> > > would translate into ranges as 100+ (bit0), 50–99 (bit1), and 0–49
> > > (bit2).
> > >
> > > From your description, I understand you are considering allowing
> > > additive updates, insertions and letting bitmask differ from the range priority. Is
> > > that correct? In that case we probably need a way to distinguish
> >
> > That is right.
>
> Yes, I agree that add/remove semantics can be supported,
> But it was not fully clear whether there was agreement on the full set
> format, I wanted to state explicitly that my preference is to require
> the full set format for simplicity. That said, if staged insertion and
> removal are considered useful, one possible approach is:
>
> (side note! explanation of the interface was somewhat
> descriptive, which may not have been fully clear. If this explanation is
> sufficient to establish the general direction, I will aim to present it
> more concretely in the patch series. Otherwise, I can provide a more
> detailed explanation in a follow-up email.)
>
>   echo "add ssd:100,hdd:50,network_swap" >/sys/kernel/mm/swap/tiers
Yes, that works. I would skip the "add" keyword.
Also I notice that we can allow " " in place of "," as a separator as well.
Let's call it option 1).

>   echo "add new:80"  > /sys/kernel/mm/swap/tiers
>   echo "remove hdd" > /sys/kernel/mm/swap/tiers

Maybe instead of "remove hdd", just "-hdd" which is similar to how to
operate on swap.tiers.

>
> Alternatively, separate files for add, remove, and show could be used to
> represent staged operations:
>
>   echo "ssd:100,hdd:50,network_swap" >/sys/kernel/mm/swap/tiers/add
>   echo "new:80"  > /sys/kernel/mm/swap/tiers/add
>   echo "hdd" > /sys/kernel/mm/swap/tiers/remove

Let's call it option 2)
I feel that we don't need to have both add and remove interface files.
The above use just one file is simpler. I like the above option 1)
with modification: skip "add" keyboard and use "-" to replace "remove"
keyword. I can give more examples if needed.

Don't like option 2).

> When using the fullset approach:
>
>   ssd:100(bit0), hdd:50(bit1), network_device(bit2)

Oh, the fullset you mean you want to specify the bit for the tier name.
Why? The bit selection can happen automatically thus reducing the
change users give to an invalid bit value, e.g. that bit is already
used. Please educate me what usage case you do need to specify the
bits while auto bit selection is not good enough?

> If we remove the ssd layer and add a new tier:
>
>   echo new:80,hdd:50,network_device >/sys/kernel/mm/swap/tiers

Option 3), full set specification.

Oh, you mean the tier not listed in the above will be deleted.
I prefer the above option 1) then.

Notice there is race when you remove stuff, there will be newly added
the tier can be accidentally removed as well. Again, what is the usage
case you can't do with option 1)?

> The show output could display staged state (imaginary output for understanding):
>
>   ssd:100(bit0), new:80(bit3, in stage), hdd:50(bit1, removing), network_device(bit2)

I don't understand what is this "removing" and "in stage", that is
some extra complexity.
What is it trying to solve?

>
> After the hdd tier reference drops to zero:

Drop to zero, how? By swap off or expecting the app using the swap
exits or fault in all swapped pages in that tier?

>
>   ssd:100(bit0), new:80(bit3),  network_device(bit2)

For display we can also make each tier take one line at a time.

>
> > > between “add” and “reset”. Personally, I feel supporting only reset
> > > semantics would make the interface simpler, while still allowing add
> > > semantics when the full set is provided again.
> >
> > The counterpart of "add" is "remove". There are two possible ideas to explore:
> > 1) only allow removing a tier when all swap devices in that tier
> > range have been swapped off.
> > 2) Remove the tier by removing a midpoint from the range. The lower
> > tier automatically gets the range belonging to the tier that was
> > removed. Optionally, you can add another tier back in replacement
> > with different range boundaries. This effectively achieves replacement
> > as well. This approach does not require swapping off the swap device. I
> > like it better. If you want to avoid the race window where the
> > swap device temporarily belongs to the lower tier, you can always swap
> > off the device before performing 2). So 2) can be mixed with 1) as well.
>
> I have already explained this from the perspective of option 2 mixed
> with option 1. Let me clarify one point:
>
> If...
> ssd:100, hdd:50, network_device.
> Insertion above 100 becomes visible after ssd removal,

What do you mean by "visible"? Previous discussions haven't defined
what is visible vs invisible. If you use a new term, please define it.
Is visible available to add a new tier on?

> Insertion above 50 becomes visible after hdd removal,
> Insertion above 0 becomes visible after network_device removal.
>
> It means that as long as the tier exists, the referenced priority ranges
> cannot be overridden.
>
> And Regarding swap_tier object lifecycle:
>
> A swap_tier should not be deleted until all devices in the tier are
> swapped off (As you said, references are held). Therefore, cgroups that reference a
> tier should also hold a reference. Silently dropping a tier is problematic
> from a cgroup perspective.

Nope, that is too much reference to track. Each swap device belongs to
only one tier at a time. The swap device will have a pointer or bit
mask of that tier and bump up that tier's reference count when a swap
device swaps on.

> If we allow this, the implementation should behave as follows as I think:
> If a swap_tier is removed, the cgroup’s tier configuration could be
> marked invalid. This should trigger an event to the cgroup to notify
> user space.

Trigger event to notify user space? Who consumes the event and what
can that user space tool do? That seems to be the extra complexity we
should avoid. Just stick with the swap off behavior. If you remove the
swap tier. the range of that tier merges to the neighbour tier.  That
way you don't need to worry about the swap file already having an
entry in this tier you swap out.

Please don't keep proposing new interfaces for the sake of it. Try to
get the feature done with an absolutely minimal interface introduced.

> > > > > 2. Slow path allocation uses bitmask skipping; fast path uses per-cpu
> > > > >    tier cluster caches.
> > > > If the fast path fails, it will go through the slow path. So the slow
> > > > patch is actually a catch all.
> > >
> > > Do you mean that if the cluster does not belong to the desired tier in
> > > the fast path, it will skip and then fall back to the slow path? If so,
> >
> > I am describing the existing swap cluster allocator behavior. In my
> > mind, we are using the existing cluster swap allocator code, with
> > constraints that only allow swap entry to be allocated from the
> > affected tier bitmask.
> >
> > > the slow path would need to avoid inserting the cluster back into the
> > > cache, otherwise processes with a global swap view may end up using the
> > > wrong tier device(which must be referenced firstly assumed)
> > > Also cgroup which is tier set experience performance degradation
> > > because, there is possibility to try to alloc swap on slowpath most of the time.
> > > Wouldn’t this have performance implications?
> >
> > I think we are mixing two different concepts. There are swap tiers
> > which decide which swap device to use. Then there is the swap
> > allocator to allocate a swap from the allowed list.
> >
> > If we move to the swap tiers, the swap allocator needs to be swap
> > tiers aware. So it might move to per-cgroup cache list or disable the
> > cache for cgroups that haven't been allocating for a while. The
> > allocation logic should be in the allocator, not in the swap tier
> > layer.
> >
> > > I was thinking that maintaining per-tier per-cpu cluster caches would be
> > > simpler. Then each tier manages its own cluster cache, and we only need
> > > an array of per-cpu caches of size “max tiers”.
> >
> > Again, let's not jump to premature optimizations. Do it the simple way
> > first, then let the measurement numbers guide us.
> > It might be per swap file has a cache not necessary per CPU. per-cpu x
> > per-tier the combination is too big, I am worried about caching too
> > much swap clusters. Each cluster is 2M.
>
> You suggested maintaining per-swap-device cluster caches. As an
> alternative, I would like to suggest a per-device per-CPU cache
> approach, which could be simpler from an integration perspective. It

Each device belongs to one tier, that is the same number as per tier x
per cpu or even more. You might end up catching too many clusters.
Morden server has a very high cpu count. I would stay away per CPU if
I can afford it or find other solutions. Per cpu x another high number
is just a liability waiting for bad things to happen.

> would fit more naturally with the existing allocation logic, remove tier
> awareness from the allocator, and should not introduce functional

The current swap allocator is using a per CPU cluster cache. That
cached cluster is only going to belong to one swap device. When the
swap tier is introduced, the swap allocator behavior needs to be
changed anyway. Because we don't have to "fill this device full"
before moving to the next device.

> differences in behavior. Moreover, since SSD devices are likely to be
> concentrated in only a small number of tiers (with one being the "best"
> tier), the number of clusters actually cached at any time would not be
> large. I am not presenting this as the ultimate solution, but rather as
> a simple and reasonably practical approach to consider. I agree that we
> should revisit and evaluate this approach further.

Let's revisit it when we get more detail. This is internal
implementation detail anyway. The users of swap.tiers don't have to
know about it.

>
> > We DO want some parent level control. That is a real customer
> > requirement. The cons with your proposal is that, if you want to
> > change the whole set from top-level cgroup to child cgroups, you need
> > to traverse the hierarchical chain to set each child cgroup. While
> > walking the child tree, more sub-level cgroups may be added, and
> > you could miss newly created cgroups. It becomes a mess.
> >
> > It is much cleaner if we allow the child cgroup to have the default
> > "swap.tiers" empty. Then you just need to set one value at the top-level
> > parent cgroup, and all child cgroups inherit it automatically. A child
> > can overwrite it if desired; by default it inherits from its parent.
> >
> > The whole set of cgroups from top-level including children can map
> > into a Kubernetes pod. It is common to perform adjustments on the
> > whole set atomically. We should support it.
>
> Okay I will adopt default inheritance for pod-level and similar use cases. A
> child cgroup inherits the nearest ancestor’s mask upon creation. If it

You are missing the fact that you need to track two sets of the tier mask.
the "swap.ters" is a local tiername you need to track. Default is empty.

The runtime tier set is the local tier mask walk to the parent, and
collect all parent local bitmasks and aggregate into an effective set.
The effective tier bitmask is used as the swap allocator.

What you purpose still has the same problem at:
1) parent "- +ssd"
2) child "' # same as parent.

If we do what your proposal is. Child will have "ssd" on the bit mask
at creation
When parents remove the "ssd". the child will keep having the "ssd".

In  my original proposal, if a parent removes ssd then the child will
automatically get it as well.

I am not happy that you keep introducing change to my proposal and
introduce the same buggy behavior again and again. Please respect my
time, I am spending my long weekend writing an email to you. Please
don't introduce the same bug again just for the sake of changing
behavior. Be careful and think through what you are proposing.

Chris

^ permalink raw reply	[flat|nested] 39+ messages in thread

end of thread, other threads:[~2025-09-01 22:41 UTC | newest]

Thread overview: 39+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-16 20:20 [PATCH 0/4] mm/swap, memcg: Support per-cgroup swap device priorities Youngjun Park
2025-07-16 20:20 ` [PATCH 1/4] mm/swap, memcg: Introduce infrastructure for cgroup-based swap priority Youngjun Park
2025-07-17 11:20   ` kernel test robot
2025-07-22 14:09     ` YoungJun Park
2025-07-18 17:08   ` kernel test robot
2025-07-22 14:11     ` YoungJun Park
2025-07-21 15:13   ` kernel test robot
2025-07-22 14:14     ` YoungJun Park
2025-07-22  8:41   ` Michal Koutný
2025-07-22 14:05     ` YoungJun Park
2025-07-22 18:41       ` YoungJun Park
2025-08-14 14:03         ` Michal Koutný
2025-08-15 15:10           ` Chris Li
2025-08-16 17:21             ` YoungJun Park
2025-08-16 19:15               ` Chris Li
2025-08-19 10:12                 ` YoungJun Park
2025-08-20  0:52                   ` Chris Li
2025-08-20 14:39                     ` YoungJun Park
2025-08-21 20:39                       ` Chris Li
2025-08-22  5:45                         ` YoungJun Park
2025-08-22 16:48                           ` Chris Li
2025-08-24 12:05                             ` YoungJun Park
2025-08-26  8:19                               ` Chris Li
2025-08-26 12:57                                 ` YoungJun Park
2025-08-26 14:30                                   ` Chris Li
2025-08-30  4:05                                     ` YoungJun Park
2025-08-30  7:13                                       ` Chris Li
2025-08-31 13:53                                         ` YoungJun Park
2025-08-31 16:45                                           ` Chris Li
2025-09-01 16:03                                             ` YoungJun Park
2025-09-01 16:06                                             ` YoungJun Park
2025-09-01 22:40                                               ` Chris Li
2025-08-24 14:19                             ` YoungJun Park
2025-08-16 16:41           ` YoungJun Park
2025-07-16 20:20 ` [PATCH 2/4] mm: swap: Apply per-cgroup swap priority mechanism to swap layer Youngjun Park
2025-07-16 20:20 ` [PATCH 3/4] mm: memcg: Add swap cgroup priority inheritance mechanism Youngjun Park
2025-07-16 20:20 ` [PATCH 4/4] mm: swap: Per-cgroup per-CPU swap device cache with shared clusters Youngjun Park
2025-07-22 17:44   ` Kairui Song
2025-07-22 18:30     ` YoungJun Park

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).