From: Youngjun Park <youngjun.park@lge.com>
To: akpm@linux-foundation.org, hannes@cmpxchg.org
Cc: mhocko@kernel.org, roman.gushchin@linux.dev,
shakeel.butt@linux.dev, muchun.song@linux.dev,
shikemeng@huaweicloud.com, kasong@tencent.com, nphamcs@gmail.com,
bhe@redhat.com, baohua@kernel.org, chrisl@kernel.org,
cgroups@vger.kernel.org, linux-mm@kvack.org,
linux-kernel@vger.kernel.org, gunho.lee@lge.com,
iamjoonsoo.kim@lge.com, taejoon.song@lge.com,
Youngjun Park <youngjun.park@lge.com>
Subject: [PATCH 0/4] mm/swap, memcg: Support per-cgroup swap device priorities
Date: Thu, 17 Jul 2025 05:20:02 +0900 [thread overview]
Message-ID: <20250716202006.3640584-1-youngjun.park@lge.com> (raw)
This patchset introduces a mechanism to assign swap device priorities
per cgroup.
It is an evolution of a previously submitted RFC [1], with revised
semantics, interfaces, and implementation based on community feedback.
======================================================================
I. MOTIVATION
======================================================================
Core requirement was to improve application responsiveness and loading
time, especially for latency-critical applications, without increasing
RAM or storage hardware resources.
Device constraints:
- Linux-based embedded platform
- Limited system RAM
- Small local swap
- No option to expand RAM or local swap
To mitigate this, we explored utilizing idle RAM and storage from nearby
devices as remote swap space. To maximize its effectiveness, we needed
per-cgroup control over swap device selection:
- Assign faster local swap devices to latency-critical apps
- Assign remote swap devices to background apps
However, current kernel swap infrastructure does not support per-cgroup
swap device assignment.
======================================================================
II. EVALUATED ALTERNATIVES
======================================================================
**II-1. Per-cgroup Dedicated Swap Devices**
- Proposed upstream [2]
- Difficult to maintain consistent global vs per-cgroup swap state
- Hard to reconcile with memory.max and swap.max semantics
**II-2. Multi-backend Swap Device with Cgroup-aware Routing**
- Breaks layering abstraction (block device cgroup awareness)
- Swap devices treated as physical storage
- Related ideas discussed in [3]
**II-3. Per-cgroup Swap Enable/Disable with Usage Control**
- Could expand swap.max via zswap writeback [4]
- Cannot express flexible device orderings
- Less expressive than per-device priorities
**Conclusion:** Per-cgroup swap priority configuration is the most natural and
least invasive extension to existing kernel mechanisms.
======================================================================
III. DESIGN OVERVIEW
======================================================================
**III-1. Per-Cgroup Swap Priority**
Semantics:
- Configure swap priorities per device via the `memory.swap.priority` interface.
- If a value is specified, it overrides the global priority for that cgroup.
- Priority semantics follow the global swap behavior:
- Higher numeric values are preferred
- Devices with equal priority are used round-robin
- Negative priorities follow NUMA-aware fallback [5]
- If no value is given, the global swap priority is used.
- Default settings influence swap device propagation on swapon/swapoff events.
- At `swapon`, these settings determine whether and how newly added devices
are included for the cgroup.
Each cgroup exposes a readable and writable file:
memory.swap.priority
This file accepts one `<id> <priority>` pair per line, where `<id>` is the
numeric ID of a swap device as shown in `/proc/swaps`:
Filename Type Size Used Priority Id
/dev/sda2 partition ... ... 20 1
/dev/sdb2 partition ... ... -2 2
The following defaults can be set:
- `default none`:
Use global priority (implicit default)
- `default disabled`:
Exclude swap devices from use in this cgroup
These defaults determine how new devices are handled at `swapon` time.
Special keywords can also be specified per device:
- `<id> none`: use global priority (clears override)
- `<id> disabled`: exclude the device from this cgroup's swap allocation
Reading this file shows the current configuration. Devices not explicitly set
may still appear if their effective priority differs from the global value due
to NUMA fallback or internal normalization.
**Example**
echo "1 -2" > memory.swap.priority
May result in:
1 -2
2 -3
To revert both devices to global priority:
echo "1 none" > memory.swap.priority
echo "2 none" > memory.swap.priority
To disable device 1 while allowing device 2:
echo "1 disabled" > memory.swap.priority
**III-2. Inheritance**
Inheritance semantics:
- Each cgroup inherits from the **highest** ancestor with a setting
- Intermediate ancestors are ignored
- If no ancestor is configured, the local setting is used
- If the inherited ancestor configuration is removed or absent, the cgroup
falls back to its local setting. If none exists, the global priority is used.
The effective configuration after inheritance is visible via:
memory.swap.priority.effective
If `default disabled` is active, it is shown explicitly.
If `default none` is used, it is applied silently and not shown.
======================================================================
IV. TESTING
======================================================================
This patchset was tested on x86_64 under QEMU using `stress-ng` to generate
swap I/O while toggling swap devices and updating `memory.swap.priority`.
The kernel was instrumented with KASAN, lockdep, and other
`CONFIG_DEBUG_*` options to increase debugging coverage and help identify
potential issues under stress.
======================================================================
V. CHANGE HISTORY
======================================================================
== RFC → v1 ==
[1] Changed interface from flat `1:10,2:-1` to line-based flat key format,
following cgroup v2 interface conventions where each swap device is
configured independently.
- Suggested by: Michal Koutný
[2] Added `memory.swap.priority.effective` to expose the final applied
priority, reflecting cgroup inheritance rules.
[3] Clarified default semantics: `default none`, `default disabled`
- Suggested by: Michal Koutný
[4] Implemented per-cgroup percpu swap device cache and used per-device
shared clusters to avoid scalability issues
- Suggested by: Kairui Song
[5] Exposed swap device id via /proc/swaps for introspection
[6] Introduced swap_cgroup_priority.h to define the main interface and declare
symbols shared with swapfile.c.
[7] Aligned the number of swap_cgroup_priority_pnode instances with nr_swapfiles
to ensure consistency during swap device changes.
[8] Removed the explicit delete interface, now handled implicitly by dynamic tracking.
======================================================================
VI. REFERENCES
======================================================================
[1] RFC: Per-cgroup swap device prioritization
https://lore.kernel.org/linux-mm/aEvLjEInMQC7hEyh@yjaykim-PowerEdge-T330/T/#mbbb6a5e9e30843097e1f5f65fb98f31d582b973d
[2] Cgroup-specific swap devices (2014)
https://lkml.iu.edu/hypermail/linux/kernel/1404.0/02530.html
[3] Swap redirection and zswap writeback discussions
https://lore.kernel.org/linux-mm/CAMgjq7DGMS5A4t6nOQmwyLy5Px96aoejBkiwFHgy9uMk-F8Y-w@mail.gmail.com
[4] Per-cgroup zswap writeback
https://lore.kernel.org/lkml/CAF8kJuN-4UE0skVHvjUzpGefavkLULMonjgkXUZSBVJrcGFXCA@mail.gmail.com
[5] Swap NUMA fallback
https://docs.kernel.org/vm/swap_numa.html
---
This feature is marked **EXPERIMENTAL** in Kconfig, as it has not yet undergone
extensive real-world testing. The implementation is functional and reflects
feedback from prior RFC discussions, but further testing and review are welcome.
I’m happy to iterate based on community feedback.
Thanks,
Youngjun Park
Youngjun Park (4):
mm/swap, memcg: Introduce infrastructure for cgroup-based swap
priority
mm: swap: Apply per-cgroup swap priority mechanism to swap layer
mm: memcg: Add swap cgroup priority inheritance mechanism
mm: swap: Per-cgroup per-CPU swap device cache with shared clusters
Documentation/admin-guide/cgroup-v2.rst | 76 ++
MAINTAINERS | 2 +
include/linux/memcontrol.h | 3 +
include/linux/swap.h | 10 +
mm/Kconfig | 14 +
mm/Makefile | 1 +
mm/memcontrol.c | 105 ++-
mm/swap_cgroup_priority.c | 1036 +++++++++++++++++++++++
mm/swap_cgroup_priority.h | 128 +++
mm/swapfile.c | 108 ++-
10 files changed, 1456 insertions(+), 27 deletions(-)
create mode 100644 mm/swap_cgroup_priority.c
create mode 100644 mm/swap_cgroup_priority.h
base-commit: 347e9f5043c89695b01e66b3ed111755afcf1911
--
2.34.1
next reply other threads:[~2025-07-16 20:35 UTC|newest]
Thread overview: 39+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-07-16 20:20 Youngjun Park [this message]
2025-07-16 20:20 ` [PATCH 1/4] mm/swap, memcg: Introduce infrastructure for cgroup-based swap priority Youngjun Park
2025-07-17 11:20 ` kernel test robot
2025-07-22 14:09 ` YoungJun Park
2025-07-18 17:08 ` kernel test robot
2025-07-22 14:11 ` YoungJun Park
2025-07-21 15:13 ` kernel test robot
2025-07-22 14:14 ` YoungJun Park
2025-07-22 8:41 ` Michal Koutný
2025-07-22 14:05 ` YoungJun Park
2025-07-22 18:41 ` YoungJun Park
2025-08-14 14:03 ` Michal Koutný
2025-08-15 15:10 ` Chris Li
2025-08-16 17:21 ` YoungJun Park
2025-08-16 19:15 ` Chris Li
2025-08-19 10:12 ` YoungJun Park
2025-08-20 0:52 ` Chris Li
2025-08-20 14:39 ` YoungJun Park
2025-08-21 20:39 ` Chris Li
2025-08-22 5:45 ` YoungJun Park
2025-08-22 16:48 ` Chris Li
2025-08-24 12:05 ` YoungJun Park
2025-08-26 8:19 ` Chris Li
2025-08-26 12:57 ` YoungJun Park
2025-08-26 14:30 ` Chris Li
2025-08-30 4:05 ` YoungJun Park
2025-08-30 7:13 ` Chris Li
2025-08-31 13:53 ` YoungJun Park
2025-08-31 16:45 ` Chris Li
2025-09-01 16:03 ` YoungJun Park
2025-09-01 16:06 ` YoungJun Park
2025-09-01 22:40 ` Chris Li
2025-08-24 14:19 ` YoungJun Park
2025-08-16 16:41 ` YoungJun Park
2025-07-16 20:20 ` [PATCH 2/4] mm: swap: Apply per-cgroup swap priority mechanism to swap layer Youngjun Park
2025-07-16 20:20 ` [PATCH 3/4] mm: memcg: Add swap cgroup priority inheritance mechanism Youngjun Park
2025-07-16 20:20 ` [PATCH 4/4] mm: swap: Per-cgroup per-CPU swap device cache with shared clusters Youngjun Park
2025-07-22 17:44 ` Kairui Song
2025-07-22 18:30 ` YoungJun Park
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20250716202006.3640584-1-youngjun.park@lge.com \
--to=youngjun.park@lge.com \
--cc=akpm@linux-foundation.org \
--cc=baohua@kernel.org \
--cc=bhe@redhat.com \
--cc=cgroups@vger.kernel.org \
--cc=chrisl@kernel.org \
--cc=gunho.lee@lge.com \
--cc=hannes@cmpxchg.org \
--cc=iamjoonsoo.kim@lge.com \
--cc=kasong@tencent.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mhocko@kernel.org \
--cc=muchun.song@linux.dev \
--cc=nphamcs@gmail.com \
--cc=roman.gushchin@linux.dev \
--cc=shakeel.butt@linux.dev \
--cc=shikemeng@huaweicloud.com \
--cc=taejoon.song@lge.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).