public inbox for linux-mm@kvack.org
 help / color / mirror / Atom feed
* [PATCH v5 0/4] mm/swap, memcg: Introduce swap tiers for cgroup based swap control
@ 2026-03-25 17:54 Youngjun Park
  2026-03-25 17:54 ` [PATCH v5 1/4] mm: swap: introduce swap tier infrastructure Youngjun Park
                   ` (5 more replies)
  0 siblings, 6 replies; 8+ messages in thread
From: Youngjun Park @ 2026-03-25 17:54 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Chris Li, Youngjun Park, linux-mm, cgroups, linux-kernel, kasong,
	hannes, mhocko, roman.gushchin, shakeel.butt, muchun.song,
	shikemeng, nphamcs, bhe, baohua, gunho.lee, taejoon.song,
	hyungjun.cho, mkoutny

This is v5 of the "Swap Tiers" series.
For clarity, this cover letter is structured in two parts:

  Part 1 describes the patch series itself (what is implemented in v5).
  Part 2 consolidates the design rationale and use case discussion,
  including clarification around the memcg-integrated model and
  comparison with BPF-based approaches.

This separation is intentional so reviewers can clearly distinguish
between patch introduction and design discussion (for Shakeel's
ongoing feedback).

v4:
  https://lore.kernel.org/linux-mm/20260217000950.4015880-1-youngjun.park@lge.com/

Earlier RFC versions:
  v3: https://lore.kernel.org/linux-mm/20260131125454.3187546-1-youngjun.park@lge.com/
  v2: https://lore.kernel.org/linux-mm/20260126065242.1221862-1-youngjun.park@lge.com/
  v1: https://lore.kernel.org/linux-mm/20251109124947.1101520-1-youngjun.park@lge.com/

Earlier Approach (per cgroup swap priority)
  RFC: https://lore.kernel.org/linux-mm/aEvLjEInMQC7hEyh@yjaykim-PowerEdge-T330/T/#mbbb6a5e9e30843097e1f5f65fb98f31d582b973d
  v1: https://lore.kernel.org/linux-mm/20250716202006.3640584-1-youngjun.park@lge.com/
======================================================================
Part 1: Patch Series Summary
======================================================================

Overview
========
Swap Tiers group swap devices into performance classes (e.g. NVMe,
HDD, Network) and allow per-memcg selection of which tiers to use.
This mechanism was suggested by Chris Li.

This series introduces:

- Core tier infrastructure
- Per-memcg tier assignment (subset of parent)
- memory.swap.tiers and memory.swap.tiers.effective interfaces

Changes in v5
=============
- Fixed build errors reported in v4
- rebased on up to date mm-new 
- Minor cleanups
- Design docs with validation (by Shakeel Butt discussion)

Changes in v4 (summary)
=======================
- Simplified control flow and indentation
- Added CONFIG option for MAX_SWAPTIER (default: 4)
- Added memory.swap.tiers.effective interface
- Reworked save/restore logic into snapshot/rollback model
- Removed tier priority modification support (deferred)
- Improved validation and fixed edge cases
- Rebased onto latest mm-new

Deferred / Future Work
======================
- Per-tier swap_active_head to reduce contention (Suggested by Chris Li)
- Fast path and slow path allocation improvement
  (this will be introduced after Kairui's work)

Real-world Results
==================
Tested on our internal platform using NBD as a separate swap tier.
Our first production's simple usecase.

Without tiers:
- No selective control over flash wear
- Cannot selectively assign NBD to specific applications

Cold launch improvement (preloaded vs. baseline):
- App A: 13.17s -> 4.18s (68%)
- App B: 5.60s -> 1.12s (80%)
- App C: 10.25s -> 2.00s (80%)

Performance impact with no tiers configured:
<1% regression in kernel build and vm-scalability benchmarks
(measured in RFC v2).

======================================================================
Part 2: Design Rationale and Use Cases
======================================================================

Design Rationale
================
Swap tier selection is attached to memcg. A child cgroup may select a
subset of the parent's allowed tiers.

This:
- Preserves cgroup inheritance semantics (boundary at parent,
  refinement at child).
- Reuses memcg, which already groups processes and enforces
  hierarchical memory limits.
- Aligns with existing memcg swap controls (e.g. swap.max, zswap.writeback)
- Avoids introducing a parallel swap control hierarchy.

Placing tier control outside memcg (e.g. bpf, syscall, madvise etc..)
would allow swap preference to diverge from the memcg hierarchy.
Integrating it into memcg keeps swap policy consistent with
existing memory ownership semantics.

Use case #1: Latency separation (our primary deployment scenario)
=================================================================
  [ / ]
     |
     +-- latency-sensitive workload  (fast tier)
     +-- background workload         (slow tier)

The parent defines the memory boundary.
Each workload selects a swap tier via memory.swap.tiers according to
latency requirements.

This prevents latency-sensitive workloads from being swapped to
slow devices used by background workloads.

Use case #2: Per-VM swap selection (Chris Li's deployment scenario)
==================================================================
  [ / ]
     |
     +-- [ Job on VM ]              (tiers: zswap, SSD)
            |
            +-- [ VMM guest memory ]  (tiers: SSD)

The parent (job) has access to both zswap and SSD tiers.
The child (VMM guest memory) selects SSD as its swap tier via
memory.swap.tiers. In this deployment, swap device selection
happens at the child level from the parent's available set.


Use case #3: Tier isolation for reduced contention (hypothetical)
=================================================================
  [ / ]                    (tiers: A, B)
     |
     +-- workload X        (tiers: A)
     +-- workload Y        (tiers: B)

Each child uses a different tier. Since swap paths are separated
per tier, synchronization overhead between the two workloads is
reduced.

How the Current Interface Supports Future Extensions
====================================================

- Intra-tier distribution policy:
  Currently, swap devices with the same priority are allocated in a
  round-robin fashion. Per-tier policy files under
  /sys/kernel/mm/swap/tiers/ can control how devices within a tier
  are selected (e.g. round-robin, weighted).

- Inter-tier promotion and demotion:
  Promotion and demotion apply between tiers, not within a single
  tier. The current interface defines only tier assignment; it does
  not yet define when or how pages move between tiers. Two triggering
  models are possible:

  (a) User-triggered: userspace explicitly initiates migration between
      tiers (e.g. via a new interface or existing move_pages semantics).
  (b) Kernel-triggered: the kernel moves pages between tiers at
      appropriate points such as reclaim or refault.

  From the memcg perspective, inter-tier movement is bounded by
  memory.swap.tiers.effective -- pages can only be promoted or demoted
  to tiers within the memcg's effective set. The specific policy and
  triggering mechanism require further discussion and are not part of
  this series.

- Per-VMA or per-process swap hints:
  A future madvise-style hint (e.g. MADV_SWAP_TIER) could reference
  the tier indices in /sys/kernel/mm/swap/tiers/. At reclaim time,
  the kernel would check the VMA hint against the memcg's effective
  tier set to pick the swap-out target.

BPF Comparison
==============
The use cases described above already rely on memcg for swap tier
control, and real deployments are built around this model.
A BPF-based approach has additional considerations:

- Hierarchy consistency: BPF programs operate outside the memcg
  tree. Without explicit constraints, a BPF selector could
  contradict parent tier restrictions. Edge cases such as zombie
  memcgs make the resolution less clear.
- Deployment scope: requiring BPF for core swap behavior may not
  be suitable for constrained or embedded configurations.

BPF could still work as an extension on top of the tier model
in the future.

Youngjun Park (4):
  mm: swap: introduce swap tier infrastructure
  mm: swap: associate swap devices with tiers
  mm: memcontrol: add interfaces for swap tier selection
  mm: swap: filter swap allocation by memcg tier mask

 Documentation/admin-guide/cgroup-v2.rst |  27 ++
 Documentation/mm/swap-tier.rst          | 159 +++++++++
 MAINTAINERS                             |   3 +
 include/linux/memcontrol.h              |   3 +-
 include/linux/swap.h                    |   1 +
 mm/Kconfig                              |  12 +
 mm/Makefile                             |   2 +-
 mm/memcontrol.c                         |  95 +++++
 mm/swap.h                               |   4 +
 mm/swap_state.c                         |  75 ++++
 mm/swap_tier.c                          | 451 ++++++++++++++++++++++++
 mm/swap_tier.h                          |  74 ++++
 mm/swapfile.c                           |  23 +-
 13 files changed, 923 insertions(+), 6 deletions(-)
 create mode 100644 Documentation/mm/swap-tier.rst
 create mode 100644 mm/swap_tier.c
 create mode 100644 mm/swap_tier.h

base-commit: 6381a729fa7dda43574d93ab9c61cec516dd885b 
-- 
2.34.1


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2026-03-26 14:05 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-25 17:54 [PATCH v5 0/4] mm/swap, memcg: Introduce swap tiers for cgroup based swap control Youngjun Park
2026-03-25 17:54 ` [PATCH v5 1/4] mm: swap: introduce swap tier infrastructure Youngjun Park
2026-03-25 17:54 ` [PATCH v5 2/4] mm: swap: associate swap devices with tiers Youngjun Park
2026-03-25 17:54 ` [PATCH v5 3/4] mm: memcontrol: add interfaces for swap tier selection Youngjun Park
2026-03-25 17:54 ` [PATCH v5 4/4] mm: swap: filter swap allocation by memcg tier mask Youngjun Park
2026-03-25 23:20 ` [PATCH v5 0/4] mm/swap, memcg: Introduce swap tiers for cgroup based swap control Andrew Morton
2026-03-26 14:04   ` YoungJun Park
2026-03-26  7:41 ` [syzbot ci] " syzbot ci

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox