Linux cgroups development

Linux cgroups development
 help / color / mirror / Atom feed

* Re: [PATCH v7 7/8] selftest/cgroup: fix zswap attempt_writeback() on 64K pagesize system
From: Michal Koutný @ 2026-06-17 12:26 UTC (permalink / raw)
  To: Li Wang
  Cc: akpm, tj, longman, roman.gushchin, hannes, yosry, jiayuan.chen,
	nphamcs, chengming.zhou, shuah, linux-mm, cgroups,
	linux-kselftest, linux-kernel, Michal Hocko, Muchun Song,
	Shakeel Butt
In-Reply-To: <20260424040059.12940-8-li.wang@linux.dev>

[-- Attachment #1: Type: text/plain, Size: 2218 bytes --]

On Fri, Apr 24, 2026 at 12:00:58PM +0800, Li Wang <li.wang@linux.dev> wrote:
> In attempt_writeback(), a memsize of 4M only covers 64 pages on 64K
> page size systems. When memory.reclaim is called, the kernel prefers
> reclaiming clean file pages (binary, libc, linker, etc.) over swapping
> anonymous pages. With only 64 pages of anonymous memory, the reclaim
> target can be largely or entirely satisfied by dropping file pages,
> resulting in very few or zero anonymous pages being pushed into zswap.
> 
> This causes zswap_usage to be extremely small or zero, making
> zswap_usage/4 insufficient to create meaningful writeback pressure.
> The test then fails because no writeback is triggered.
> 
> On 4K page size systems this is not an issue because 4M covers 1024
> pages, and file pages are a small fraction of the reclaim target.
> 
> Fix this by:
> - Always allocating 1024 pages regardless of page size. This ensures
>   enough anonymous pages to reliably populate zswap and trigger
>   writeback, while keeping the original 4M allocation on 4K systems.
> - Setting zswap.max to zswap_usage/4 instead of zswap_usage/2 to
>   create stronger writeback pressure, ensuring reclaim reliably
>   triggers writeback even on large page size systems.
> 
> === Error Log ===
>   # uname -rm
>   6.12.0-211.el10.ppc64le ppc64le
> 
>   # getconf PAGESIZE
>   65536
> 
>   # ./test_zswap
>   TAP version 13
>   1..7
>   ok 1 test_zswap_usage
>   ok 2 test_swapin_nozswap
>   ok 3 test_zswapin
>   not ok 4 test_zswap_writeback_enabled
>   ...
> 
> Signed-off-by: Li Wang <li.wang@linux.dev>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: Michal Koutný <mkoutny@suse.com>
> Cc: Muchun Song <muchun.song@linux.dev>
> Cc: Nhat Pham <nphamcs@gmail.com>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: Roman Gushchin <roman.gushchin@linux.dev>
> Cc: Shakeel Butt <shakeel.butt@linux.dev>
> Acked-by: Yosry Ahmed <yosry@kernel.org>
> Acked-by: Nhat Pham <nphamcs@gmail.com>
> ---
>  tools/testing/selftests/cgroup/test_zswap.c | 8 ++++----
>  1 file changed, 4 insertions(+), 4 deletions(-)

Reviewed-by: Michal Koutný <mkoutny@suse.com>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 265 bytes --]

^ permalink raw reply

* Re: [PATCH v7 8/8] selftests/cgroup: test_zswap: wait for asynchronous writeback
From: Michal Koutný @ 2026-06-17 12:26 UTC (permalink / raw)
  To: Li Wang
  Cc: akpm, tj, longman, roman.gushchin, hannes, yosry, jiayuan.chen,
	nphamcs, chengming.zhou, shuah, linux-mm, cgroups,
	linux-kselftest, linux-kernel, Michal Hocko, Muchun Song,
	Shakeel Butt, Yosry Ahmed
In-Reply-To: <20260424040059.12940-9-li.wang@linux.dev>

[-- Attachment #1: Type: text/plain, Size: 3193 bytes --]


On Fri, Apr 24, 2026 at 12:00:59PM +0800, Li Wang <li.wang@linux.dev> wrote:
> @@ -345,7 +366,10 @@ static int test_zswap_writeback_one(const char *cgroup, bool wb)
>  		return -1;
>  
>  	/* Verify that zswap writeback occurred only if writeback was enabled */
> -	zswpwb_after = get_cg_wb_count(cgroup);
> +	if (wb)
> +		zswpwb_after = wait_for_writeback(cgroup, 5000);

We should have something like
	cg_read_key_long_poll(cgroup,
	                      "memory.stat",
			      "zswpwb",
			      0,
			      500,
	                      DEFAULT_WAIT_INTERVAL_US);
for this.

Although this also needs further change like (and respective adjustment):

diff --git a/tools/testing/selftests/cgroup/lib/cgroup_util.c b/tools/testing/selftests/cgroup/lib/cgroup_util.c
index a7b3380d88d77..c0511853db9c6 100644
--- a/tools/testing/selftests/cgroup/lib/cgroup_util.c
+++ b/tools/testing/selftests/cgroup/lib/cgroup_util.c
@@ -188,8 +188,8 @@ long cg_read_key_long(const char *cgroup, const char *control, const char *key)
 }

 long cg_read_key_long_poll(const char *cgroup, const char *control,
-                          const char *key, long expected, int retries,
-                          useconds_t wait_interval_us)
+                          const char *key, enum exp_op expected_op, long expected,
+                          int retries, useconds_t wait_interval_us)
 {
        long val = -1;
        int i;
@@ -199,7 +199,9 @@ long cg_read_key_long_poll(const char *cgroup, const char *control,
                if (val < 0)
                        return val;

-               if (val == expected)
+               if (expected_op == EXP_EQUAL && val == expected)
+                       break;
+               if (expected_op == EXP_GT && val > expected)
                        break;

                usleep(wait_interval_us);
diff --git a/tools/testing/selftests/cgroup/lib/include/cgroup_util.h b/tools/testing/selftests/cgroup/lib/include/cgroup_util.h
index 567b1082974c5..3e9bfb66cf5a9 100644
--- a/tools/testing/selftests/cgroup/lib/include/cgroup_util.h
+++ b/tools/testing/selftests/cgroup/lib/include/cgroup_util.h
@@ -19,6 +19,11 @@

 #define DEFAULT_WAIT_INTERVAL_US (100 * 1000) /* 100 ms */

+enum exp_op {
+       EXP_EQUAL,
+       EXP_GT,
+};
+
 /*
  * Checks if two given values differ by less than err% of their sum.
  */
@@ -69,8 +74,8 @@ extern long cg_read_long(const char *cgroup, const char *control);
 extern long cg_read_long_fd(int fd);
 long cg_read_key_long(const char *cgroup, const char *control, const char *key);
 long cg_read_key_long_poll(const char *cgroup, const char *control,
-                          const char *key, long expected, int retries,
-                          useconds_t wait_interval_us);
+                          const char *key, enum exp_op expected_op, long expected,
+                          int retries, useconds_t wait_interval_us);
 extern long cg_read_lc(const char *cgroup, const char *control);
 extern int cg_write(const char *cgroup, const char *control, char *buf);
 extern int cg_open(const char *cgroup, const char *control, int flags);

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 265 bytes --]

^ permalink raw reply related

* Re: [PATCH v7 2/8] selftests/cgroup: avoid OOM in test_swapin_nozswap
From: Michal Koutný @ 2026-06-17 12:26 UTC (permalink / raw)
  To: Li Wang
  Cc: akpm, tj, longman, roman.gushchin, hannes, yosry, jiayuan.chen,
	nphamcs, chengming.zhou, shuah, linux-mm, cgroups,
	linux-kselftest, linux-kernel, Michal Hocko, Muchun Song,
	Shakeel Butt
In-Reply-To: <20260424040059.12940-3-li.wang@linux.dev>

[-- Attachment #1: Type: text/plain, Size: 957 bytes --]

On Fri, Apr 24, 2026 at 12:00:53PM +0800, Li Wang <li.wang@linux.dev> wrote:
> test_swapin_nozswap can hit OOM before reaching its assertions on some
> setups.

Is it because of differences in available IO or what does it depend on?

> The test currently sets memory.max=8M and then allocates/reads
> 32M with memory.zswap.max=0, which may over-constrain reclaim and kill
> the workload process.
> 
> Replace hardcoded sizes with PAGE_SIZE-based values:
>   - control_allocation_size = PAGE_SIZE * 512
>   - memory.max = control_allocation_size * 3 / 4
>   - minimum expected swap = control_allocation_size / 4
> 
> This keeps the test pressure model intact (allocate/read beyond memory.max to
> force swap-in/out) while making it more robust across different environments.

I see you used allocation value that is preserve absolute values from 64k systems
test is differnt on 4k ones. Any specific reason for that?


Thanks,
Michal

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 265 bytes --]

^ permalink raw reply

* Re: [PATCH v7 1/8] selftests/cgroup: skip test_zswap if zswap is globally disabled
From: Michal Koutný @ 2026-06-17 12:25 UTC (permalink / raw)
  To: Li Wang
  Cc: akpm, tj, longman, roman.gushchin, hannes, yosry, jiayuan.chen,
	nphamcs, chengming.zhou, shuah, linux-mm, cgroups,
	linux-kselftest, linux-kernel, Michal Hocko, Muchun Song,
	Shakeel Butt, Yosry Ahmed
In-Reply-To: <20260424040059.12940-2-li.wang@linux.dev>

[-- Attachment #1: Type: text/plain, Size: 1361 bytes --]

On Fri, Apr 24, 2026 at 12:00:52PM +0800, Li Wang <li.wang@linux.dev> wrote:
> test_zswap currently only checks whether zswap is present by testing
> /sys/module/zswap. This misses the runtime global state exposed in
> /sys/module/zswap/parameters/enabled.
> 
> When zswap is built/loaded but globally disabled, the zswap cgroup
> selftests run in an invalid environment and may fail spuriously.
> 
> Check the runtime enabled state before running the tests:
>   - skip if zswap is not configured,
>   - fail if the enabled knob cannot be read,
>   - skip if zswap is globally disabled.
> 
> Also print a hint in the skip message on how to enable zswap.
> 
> Signed-off-by: Li Wang <li.wang@linux.dev>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: Michal Koutný <mkoutny@suse.com>
> Cc: Muchun Song <muchun.song@linux.dev>
> Cc: Nhat Pham <nphamcs@gmail.com>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: Roman Gushchin <roman.gushchin@linux.dev>
> Cc: Shakeel Butt <shakeel.butt@linux.dev>
> Cc: Yosry Ahmed <yosryahmed@google.com>
> Acked-by: Yosry Ahmed <yosry@kernel.org>
> Acked-by: Nhat Pham <nphamcs@gmail.com>
> ---
>  tools/testing/selftests/cgroup/test_zswap.c | 19 +++++++++++++++----
>  1 file changed, 15 insertions(+), 4 deletions(-)

Reviewed-by: Michal Koutný <mkoutny@suse.com>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 265 bytes --]

^ permalink raw reply

* Re: [PATCH v7 3/8] selftests/cgroup: use runtime page size for zswpin check
From: Michal Koutný @ 2026-06-17 12:25 UTC (permalink / raw)
  To: Li Wang
  Cc: akpm, tj, longman, roman.gushchin, hannes, yosry, jiayuan.chen,
	nphamcs, chengming.zhou, shuah, linux-mm, cgroups,
	linux-kselftest, linux-kernel, Michal Hocko, Muchun Song,
	Shakeel Butt
In-Reply-To: <20260424040059.12940-4-li.wang@linux.dev>

[-- Attachment #1: Type: text/plain, Size: 1089 bytes --]

On Fri, Apr 24, 2026 at 12:00:54PM +0800, Li Wang <li.wang@linux.dev> wrote:
> test_zswapin compares memory.stat:zswpin (counted in pages) against a
> byte threshold converted with PAGE_SIZE. In cgroup selftests, PAGE_SIZE
> is hardcoded to 4096, which makes the conversion wrong on systems with
> non-4K base pages (e.g. 64K).
> 
> As a result, the test requires too many pages to pass and fails
> spuriously even when zswap is working.
> 
> Use sysconf(_SC_PAGESIZE) for the zswpin threshold conversion so the
> check matches the actual system page size.
> 
> Signed-off-by: Li Wang <li.wang@linux.dev>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: Michal Koutný <mkoutny@suse.com>
> Cc: Muchun Song <muchun.song@linux.dev>
> Cc: Nhat Pham <nphamcs@gmail.com>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: Roman Gushchin <roman.gushchin@linux.dev>
> Cc: Shakeel Butt <shakeel.butt@linux.dev>
> Reviewed-by: Yosry Ahmed <yosry@kernel.org>
> Acked-by: Nhat Pham <nphamcs@gmail.com>
>
Reviewed-by: Michal Koutný <mkoutny@suse.com>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 265 bytes --]

^ permalink raw reply

* Re: [PATCH v3 01/15] mm/slab: do not init any kfence objects on allocation
From: Marco Elver @ 2026-06-17 11:52 UTC (permalink / raw)
  To: Vlastimil Babka (SUSE)
  Cc: Harry Yoo, Hao Li, Christoph Lameter, David Rientjes,
	Roman Gushchin, Suren Baghdasaryan, Alexei Starovoitov,
	Andrew Morton, Johannes Weiner, Michal Hocko, Shakeel Butt,
	Alexander Potapenko, Dmitry Vyukov, kasan-dev, linux-mm,
	linux-kernel, cgroups
In-Reply-To: <20260615-slab_alloc_flags-v3-1-ce1146d140fb@kernel.org>

On Mon, 15 Jun 2026 at 13:54, Vlastimil Babka (SUSE) <vbabka@kernel.org> wrote:
>
> When init (zeroing) on allocation is requested, for kmalloc() we
> generally have to zero the full object size even if a smaller size is
> requested, in order to provide krealloc()'s __GFP_ZERO guarantees.
>
> When we end up allocating a kfence object, kfence performs the zeroing
> on its own because it has its own redzone beyond the requested size.
> Thus slab_post_alloc_hook() has an 'init' parameter which has to be
> evaluated in all callers (via slab_want_init_on_alloc()) and should be
> false for kfence allocations.
>
> For kfence allocations in slab_alloc_node() this is achieved by subtly
> skipping over the slab_want_init_on_alloc() call. Other callers (i.e.
> kmem_cache_alloc_bulk_noprof()) however evaluate it unconditionally even
> if they do end up with a kfence allocation. This is only subtly not a
> problem, as those are not kmalloc allocations and thus the "requested
> size" equals s->object_size and thus it cannot interfere with kfence's
> redzone. There's just a unnecessary double zeroing (in both kfence and
> slab_post_alloc_hook()), but it's all very fragile and contradicts the
> comment in kfence_guarded_alloc().
>
> Remove this subtlety and simplify the code by eliminating the init
> parameter from slab_post_alloc_hook() and make it call
> slab_want_init_on_alloc() itself. Instead add a is_kfence_address()
> check before performing the memset, which will start doing the right
> thing for all callers of slab_post_alloc_hook().
>
> This potentially adds overhead of the is_kfence_address() check to
> allocation hotpath, but that one is designed to be as small as possible,
> and it's only evaluated if zeroing is about to happen. This means (aside
> from init_on_alloc hardening) only for __GFP_ZERO allocations, and the
> zeroing itself comes with an overhead likely larger than the added
> check.
>
> While at it, refactor the handling of evaluating when KASAN does the
> init instead of SLUB, with no intended functional changes. A
> non-functional change is that we don't pass kasan_init as true to
> kasan_slab_alloc() if kasan has no integrated init, but then the value
> is ignored anyway, so it's theoretically more correct.
>
> Thanks to Harry Yoo for the initial refactoring attempt, and for updated
> comments that are used here.
>
> Link: https://patch.msgid.link/20260610-slab_alloc_flags-v2-2-7190909db118@kernel.org
> Reviewed-by: Harry Yoo (Oracle) <harry@kernel.org>
> Reviewed-by: Suren Baghdasaryan <surenb@google.com>
> Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>

Reviewed-by: Marco Elver <elver@google.com>

> ---
>  mm/kfence/core.c |  2 +-
>  mm/slub.c        | 60 ++++++++++++++++++++++++++------------------------------
>  2 files changed, 29 insertions(+), 33 deletions(-)
>
> diff --git a/mm/kfence/core.c b/mm/kfence/core.c
> index 655dc5ce3240..5e0b406924e9 100644
> --- a/mm/kfence/core.c
> +++ b/mm/kfence/core.c
> @@ -500,7 +500,7 @@ static void *kfence_guarded_alloc(struct kmem_cache *cache, size_t size, gfp_t g
>
>         /*
>          * We check slab_want_init_on_alloc() ourselves, rather than letting
> -        * SL*B do the initialization, as otherwise we might overwrite KFENCE's
> +        * slab do the initialization, as otherwise it might overwrite KFENCE's
>          * redzone.
>          */
>         if (unlikely(slab_want_init_on_alloc(gfp, cache)))
> diff --git a/mm/slub.c b/mm/slub.c
> index e2ee8f1aaccf..d762cbe5d040 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -4565,13 +4565,13 @@ struct kmem_cache *slab_pre_alloc_hook(struct kmem_cache *s, gfp_t flags)
>
>  static __fastpath_inline
>  bool slab_post_alloc_hook(struct kmem_cache *s, struct list_lru *lru,
> -                         gfp_t flags, size_t size, void **p, bool init,
> +                         gfp_t flags, size_t size, void **p,
>                           unsigned int orig_size)
>  {
> +       bool init = slab_want_init_on_alloc(flags, s);
>         unsigned int zero_size = s->object_size;
> -       bool kasan_init = init;
> -       size_t i;
>         gfp_t init_flags = flags & gfp_allowed_mask;
> +       bool kasan_init = false;
>
>         /*
>          * For kmalloc object, the allocated size (object_size) can be larger
> @@ -4588,28 +4588,33 @@ bool slab_post_alloc_hook(struct kmem_cache *s, struct list_lru *lru,
>                 zero_size = orig_size;
>
>         /*
> -        * When slab_debug is enabled, avoid memory initialization integrated
> -        * into KASAN and instead zero out the memory via the memset below with
> -        * the proper size. Otherwise, KASAN might overwrite SLUB redzones and
> -        * cause false-positive reports. This does not lead to a performance
> +        * ARM64 can set memory tags and zero the memory using a single
> +        * instruction. Since HW_TAGS KASAN uses that while tagging the object,
> +        * separate zeroing is unnecessary.
> +        *
> +        * However, KASAN never zeroes memory when slab_debug is enabled to
> +        * avoid overwriting SLUB redzones. This does not lead to a performance
>          * penalty on production builds, as slab_debug is not intended to be
>          * enabled there.
>          */
> -       if (__slub_debug_enabled())
> -               kasan_init = false;
> +       if (kasan_has_integrated_init() && !__slub_debug_enabled()) {
> +               kasan_init = init;
> +               init = false;
> +       }
>
> -       /*
> -        * As memory initialization might be integrated into KASAN,
> -        * kasan_slab_alloc and initialization memset must be
> -        * kept together to avoid discrepancies in behavior.
> -        *
> -        * As p[i] might get tagged, memset and kmemleak hook come after KASAN.
> -        */
> -       for (i = 0; i < size; i++) {
> +       for (size_t i = 0; i < size; i++) {
>                 p[i] = kasan_slab_alloc(s, p[i], init_flags, kasan_init);
> -               if (p[i] && init && (!kasan_init ||
> -                                    !kasan_has_integrated_init()))
> +
> +               /*
> +                * memset and hooks come after KASAN as p[i] might get tagged
> +                *
> +                * kfence zeroes the object instead of SLUB to avoid overwriting
> +                * its own redzone starting at orig_size, which could happen
> +                * with SLUB zeroing full s->object_size
> +                */
> +               if (init && p[i] && !is_kfence_address(p[i]))
>                         memset(p[i], 0, zero_size);
> +
>                 if (gfpflags_allow_spinning(flags))
>                         kmemleak_alloc_recursive(p[i], s->object_size, 1,
>                                                  s->flags, init_flags);
> @@ -4910,7 +4915,6 @@ static __fastpath_inline void *slab_alloc_node(struct kmem_cache *s, struct list
>                 gfp_t gfpflags, int node, unsigned long addr, size_t orig_size)
>  {
>         void *object;
> -       bool init = false;
>
>         s = slab_pre_alloc_hook(s, gfpflags);
>         if (unlikely(!s))
> @@ -4926,16 +4930,13 @@ static __fastpath_inline void *slab_alloc_node(struct kmem_cache *s, struct list
>                 object = __slab_alloc_node(s, gfpflags, node, addr, orig_size);
>
>         maybe_wipe_obj_freeptr(s, object);
> -       init = slab_want_init_on_alloc(gfpflags, s);
>
>  out:
>         /*
> -        * When init equals 'true', like for kzalloc() family, only
> -        * @orig_size bytes might be zeroed instead of s->object_size
>          * In case this fails due to memcg_slab_post_alloc_hook(),
>          * object is set to NULL
>          */
> -       slab_post_alloc_hook(s, lru, gfpflags, 1, &object, init, orig_size);
> +       slab_post_alloc_hook(s, lru, gfpflags, 1, &object, orig_size);
>
>         return object;
>  }
> @@ -5230,7 +5231,6 @@ kmem_cache_alloc_from_sheaf_noprof(struct kmem_cache *s, gfp_t gfp,
>                                    struct slab_sheaf *sheaf)
>  {
>         void *ret = NULL;
> -       bool init;
>
>         if (sheaf->size == 0)
>                 goto out;
> @@ -5240,10 +5240,8 @@ kmem_cache_alloc_from_sheaf_noprof(struct kmem_cache *s, gfp_t gfp,
>         if (likely(!ret))
>                 ret = sheaf->objects[--sheaf->size];
>
> -       init = slab_want_init_on_alloc(gfp, s);
> -
>         /* add __GFP_NOFAIL to force successful memcg charging */
> -       slab_post_alloc_hook(s, NULL, gfp | __GFP_NOFAIL, 1, &ret, init, s->object_size);
> +       slab_post_alloc_hook(s, NULL, gfp | __GFP_NOFAIL, 1, &ret, s->object_size);
>  out:
>         trace_kmem_cache_alloc(_RET_IP_, ret, s, gfp, NUMA_NO_NODE);
>
> @@ -5423,8 +5421,7 @@ void *_kmalloc_nolock_noprof(DECL_TOKEN_PARAMS(size, token), gfp_t gfp_flags, in
>
>  success:
>         maybe_wipe_obj_freeptr(s, ret);
> -       slab_post_alloc_hook(s, NULL, alloc_gfp, 1, &ret,
> -                            slab_want_init_on_alloc(alloc_gfp, s), orig_size);
> +       slab_post_alloc_hook(s, NULL, alloc_gfp, 1, &ret, orig_size);
>
>         ret = kasan_kmalloc(s, ret, orig_size, alloc_gfp);
>         return ret;
> @@ -7339,8 +7336,7 @@ bool kmem_cache_alloc_bulk_noprof(struct kmem_cache *s, gfp_t flags,
>
>  out:
>         /* memcg and kmem_cache debug support and memory initialization */
> -       return likely(slab_post_alloc_hook(s, NULL, flags, size, p,
> -                       slab_want_init_on_alloc(flags, s), s->object_size));
> +       return likely(slab_post_alloc_hook(s, NULL, flags, size, p, s->object_size));
>  }
>  EXPORT_SYMBOL(kmem_cache_alloc_bulk_noprof);
>
>
> --
> 2.54.0
>

^ permalink raw reply

* Re: [GIT PULL] cgroup changes for v7.2
From: pr-tracker-bot @ 2026-06-17 11:25 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Linus Torvalds, linux-kernel, cgroups, Johannes Weiner,
	Michal Koutný, Waiman Long
In-Reply-To: <b935100af77c4a118f8180ca31627e35@kernel.org>

The pull request you sent on Mon, 15 Jun 2026 12:01:08 -1000:

> https://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git tags/cgroup-for-7.2

has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/83476cc97bc635a3ff502bd194c79bfb1f1ae050

Thank you!

-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/prtracker.html

^ permalink raw reply

* Re: [PATCH v3 13/15] mm/slab: introduce kmalloc_flags()
From: Harry Yoo @ 2026-06-17 11:16 UTC (permalink / raw)
  To: Vlastimil Babka (SUSE)
  Cc: Hao Li, Christoph Lameter, David Rientjes, Roman Gushchin,
	Suren Baghdasaryan, Alexei Starovoitov, Andrew Morton,
	Johannes Weiner, Michal Hocko, Shakeel Butt, Alexander Potapenko,
	Marco Elver, Dmitry Vyukov, kasan-dev, linux-mm, linux-kernel,
	cgroups
In-Reply-To: <20260615-slab_alloc_flags-v3-13-ce1146d140fb@kernel.org>


[-- Attachment #1.1: Type: text/plain, Size: 1441 bytes --]



On 6/15/26 8:54 PM, Vlastimil Babka (SUSE) wrote:
> With alloc_flags usage in slab, we can replace __GFP_NO_OBJ_EXT with an
> alloc flag that prevents kmalloc recursion. For that we need a version
> of kmalloc() that takes alloc_flags and use it in places that perform
> these potentially recursive kmalloc allocations (of sheaves or obj_ext
> arrays).
> 
> Add this function, named kmalloc_flags(). Right now it's only useful for
> these nested allocations, so it doesn't need to optimize build-time
> constant sizes like kmalloc() or kmalloc_buckets.
> 
> Since we need it to support both normal and non-spinning
> kmalloc_nolock() context through the SLAB_ALLOC_NOLOCK flag, split out
> most of the special _kmalloc_nolock_noprof() implementation to
> __kmalloc_nolock_noprof() that takes a slab_alloc_context, and make
> _kmalloc_nolock_noprof() a simple tail calling wrapper with the proper
> context.
> 
> kmalloc_flags() can thus determine whether to call
> __kmalloc_nolock_noprof() or __do_kmalloc_node(), based on the
> given alloc_flags.
> 
> Link: https://patch.msgid.link/20260610-slab_alloc_flags-v2-14-7190909db118@kernel.org
> Reviewed-by: Hao Li <hao.li@linux.dev>
> Reviewed-by: Suren Baghdasaryan <surenb@google.com>
> Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
> ---

Looks good to me,
Reviewed-by: Harry Yoo (Oracle) <harry@kernel.org>

-- 
Cheers,
Harry / Hyeonggon

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply

* Re: [PATCH v3] security: Expand task_setscheduler LSM hook to include CPU affinity mask
From: Aaron Tomlin @ 2026-06-17 10:52 UTC (permalink / raw)
  To: Paul Moore, peterz
  Cc: peterz, tsbogend, jmorris, serge, mingo, juri.lelli,
	vincent.guittot, stephen.smalley.work, casey, longman, tj, hannes,
	mkoutny, chenridong, dietmar.eggemann, rostedt, bsegall, mgorman,
	vschneid, kprateek.nayak, omosnace, kees, neelx, sean, chjohnst,
	steve, mproche, nick.lange, cgroups, linux-mips, linux-fsdevel,
	linux-security-module, selinux, linux-kernel
In-Reply-To: <CAHC9VhSROg6RGUN4_ZVBoEwYjRnKvyjnkbx2D88c09KiTgY3KQ@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 3790 bytes --]

On Mon, Jun 15, 2026 at 06:03:06PM -0400, Paul Moore wrote:
> > Hi Paul,
> >
> > I am writing to politely follow up on the discussion above regarding the
> > proposed enhancement to the sched_setaffinity LSM hook.
> 
> Generally speaking I wait until all dependencies land in Linus' tree.
> I've lost a lot of time in the past sorting out issues only to have
> one of the dependencies rejected.
> 
> > As you will see from the thread, Peter Zijlstra and I have discussed the
> > architectural justification for this change. While the cpuset cgroup
> > controller effectively handles spatial enforcement, it silently truncates
> > requested affinity masks. Passing the raw in_mask to the LSM hook enables
> > security modules (such as the BPF LSM) to audit and mediate the actual
> > intent of the request before the kernel sanitises the input, a capability
> > that cgroups inherently lack.
> 
> The issue of resource control comes up from time to time within the
> context of LSMs, and my general comment is that we likely need to see
> a more comprehensive approach to what access control on resource
> limits would look like from a LSM perspective.  We've seen a lot of
> quick changes to solve very specific problems, but I have yet to see a
> good proposal of what it would look like for a more comprehensive
> approach.
> 
> There is also another issue to consider: none of the in-tree LSMs
> currently use these new parameters, raising questions about their
> purpose, maintainability, etc.  While this is not necessarily a deal
> breaker, it does go along with my comment above about taking a more
> holistic view of LSM resource controls.
> 
> To summarize, I haven't thought about this too much yet because there
> are other fires/patches that don't (currently) have the dependency
> issues of this patch.  I would also feel a lot better if there was an
> in-tree user of this parameter and some discussion of how this might
> fit into a more holistic approach to controlling resource limits in
> the LSM subsystem.

Hi Paul,

Thank you for the transparent feedback.

Your point regarding the need for a comprehensive, holistic approach to
resource limits within the LSM subsystem is well taken.

To clarify my intent, my primary motivation for this patch is actually
rooted in observability and auditing, though I view the capacity for
finer-grained resource control as a natural, downstream benefit of exposing
these raw parameters to the LSM.

As I mentioned to Peter Zijlstra, because the core kernel silently
truncates the requested affinity masks before the LSM hook evaluates them,
security modules are fundamentally blind to the original userspace intent.
If a process requests an overly broad or malformed mask, an auditing tool
attached via the BPF LSM logs the sanitised outcome rather than the
attempted action. Passing the raw 'in_mask' and 'sched_attr' ensures that
security modules can accurately observe and log exactly what was requested.

That said, I completely understand your reluctance to review this while the
dependency [1][2] remain out-of-tree. I will hold off on pushing this specific
patch further until this prerequisite has officially landed in mainline.

Regarding your request for an in-tree user: when the dependency is added
and I provide the next iteration, I will ensure it includes a concrete
demonstration of this auditing capability to justify the maintenance burden
of the expanded hook signature.

Thanks again for your time and guidance.

[1]: https://lore.kernel.org/lkml/ai_T_uRkojOsTE-Z@alpha.franken.de/
[2]: https://git.kernel.org/pub/scm/linux/kernel/git/mips/linux.git/commit/?id=98e37db4a34d3af3fb2f4648295c25b5e40b20e3

Kind regards,
-- 
Aaron Tomlin

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* Re: [PATCH v3 12/15] mm/slab: allow __GFP_NOMEMALLOC and __GFP_NOWARN for kmalloc_nolock()
From: Harry Yoo @ 2026-06-17  9:41 UTC (permalink / raw)
  To: Vlastimil Babka (SUSE)
  Cc: Hao Li, Christoph Lameter, David Rientjes, Roman Gushchin,
	Suren Baghdasaryan, Alexei Starovoitov, Andrew Morton,
	Johannes Weiner, Michal Hocko, Shakeel Butt, Alexander Potapenko,
	Marco Elver, Dmitry Vyukov, kasan-dev, linux-mm, linux-kernel,
	cgroups
In-Reply-To: <20260615-slab_alloc_flags-v3-12-ce1146d140fb@kernel.org>


[-- Attachment #1.1: Type: text/plain, Size: 877 bytes --]



On 6/15/26 8:54 PM, Vlastimil Babka (SUSE) wrote:
> The two flags are added internally so there's no point for warning if
> they are passed by the caller as well, so allow them. This will allow
> simplifying obj_ext allocation under kmalloc_nolock().
> 
> Also it's not necessary to have the extra alloc_gfp variable for adding
> the two flags. The original gfp_flags parameter is not used anywhere
> except for the warning. So remove alloc_gfp and directly modify and use
> gfp_flags everywhere.
> 
> Link: https://patch.msgid.link/20260610-slab_alloc_flags-v2-13-7190909db118@kernel.org
> Reviewed-by: Hao Li <hao.li@linux.dev>
> Reviewed-by: Suren Baghdasaryan <surenb@google.com>
> Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
> ---

Looks good to me,
Reviewed-by: Harry Yoo (Oracle) <harry@kernel.org>

-- 
Cheers,
Harry / Hyeonggon

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply

* Re: [PATCH v3 11/15] mm/slab: pass slab_alloc_context to __do_kmalloc_node()
From: Harry Yoo @ 2026-06-17  9:36 UTC (permalink / raw)
  To: Vlastimil Babka (SUSE)
  Cc: Hao Li, Christoph Lameter, David Rientjes, Roman Gushchin,
	Suren Baghdasaryan, Alexei Starovoitov, Andrew Morton,
	Johannes Weiner, Michal Hocko, Shakeel Butt, Alexander Potapenko,
	Marco Elver, Dmitry Vyukov, kasan-dev, linux-mm, linux-kernel,
	cgroups
In-Reply-To: <20260615-slab_alloc_flags-v3-11-ce1146d140fb@kernel.org>


[-- Attachment #1.1: Type: text/plain, Size: 873 bytes --]



On 6/15/26 8:54 PM, Vlastimil Babka (SUSE) wrote:
> With alloc_flags usage in slab, we can replace __GFP_NO_OBJ_EXT with an
> alloc flag that prevents kmalloc recursion. For that we need a version
> of kmalloc() that takes alloc_flags and use it in places that perform
> these potentially recursive kmalloc allocations (of sheaves or obj_ext
> arrays).
> 
> As a preparatory step, make __do_kmalloc_node() take a pointer to
> slab_alloc_context. This replaces the 'size' and 'caller' parameters and
> includes alloc_flags which we'll make use of.
> 
> Link: https://patch.msgid.link/20260610-slab_alloc_flags-v2-12-7190909db118@kernel.org
> Reviewed-by: Hao Li <hao.li@linux.dev>
> Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
> ---

Looks good to me,
Reviewed-by: Harry Yoo (Oracle) <harry@kernel.org>

-- 
Cheers,
Harry / Hyeonggon

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply

* Re: [PATCH v3 10/15] mm/slab: allow kmem_cache_alloc_bulk() with any gfp flags
From: Harry Yoo @ 2026-06-17  9:28 UTC (permalink / raw)
  To: Vlastimil Babka (SUSE)
  Cc: Hao Li, Christoph Lameter, David Rientjes, Roman Gushchin,
	Suren Baghdasaryan, Alexei Starovoitov, Andrew Morton,
	Johannes Weiner, Michal Hocko, Shakeel Butt, Alexander Potapenko,
	Marco Elver, Dmitry Vyukov, kasan-dev, linux-mm, linux-kernel,
	cgroups
In-Reply-To: <20260615-slab_alloc_flags-v3-10-ce1146d140fb@kernel.org>


[-- Attachment #1.1: Type: text/plain, Size: 1135 bytes --]



On 6/15/26 8:54 PM, Vlastimil Babka (SUSE) wrote:
> The last user of gfpflags_allow_spinning() in slab is
> alloc_from_pcs_bulk(), which is only called from
> kmem_cache_alloc_bulk().
> 
> It turns out that gfpflags_allow_spinning() is not necessary, because
> kmem_cache_alloc_bulk() is only expected to be called from context that
> does allow spinning, so simply replace it with 'true'. This means we can
> also drop the gfp parameter from alloc_from_pcs_bulk().
> 
> With that, we can remove the "@flags must allow spinning" part of the
> kernel doc, as there is no more connection to the gfp flags in the slab
> implementation.
> 
> Also remove a comment in alloc_slab_obj_exts() because there should be
> no more false positives possible due to gfp_allowed_mask during early
> boot.
> 
> Link: https://patch.msgid.link/20260610-slab_alloc_flags-v2-11-7190909db118@kernel.org
> Reviewed-by: Suren Baghdasaryan <surenb@google.com>
> Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
> ---

Looks good to me,
Reviewed-by: Harry Yoo (Oracle) <harry@kernel.org>

-- 
Cheers,
Harry / Hyeonggon

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply

* Re: [PATCH v3 09/15] mm/slab: replace slab_alloc_node() parameters with slab_alloc_context
From: Harry Yoo @ 2026-06-17  9:24 UTC (permalink / raw)
  To: Vlastimil Babka (SUSE)
  Cc: Hao Li, Christoph Lameter, David Rientjes, Roman Gushchin,
	Suren Baghdasaryan, Alexei Starovoitov, Andrew Morton,
	Johannes Weiner, Michal Hocko, Shakeel Butt, Alexander Potapenko,
	Marco Elver, Dmitry Vyukov, kasan-dev, linux-mm, linux-kernel,
	cgroups
In-Reply-To: <20260615-slab_alloc_flags-v3-9-ce1146d140fb@kernel.org>


[-- Attachment #1.1: Type: text/plain, Size: 899 bytes --]



On 6/15/26 8:54 PM, Vlastimil Babka (SUSE) wrote:
> The function takes all the parameters that exist as fields in
> slab_alloc_context, except alloc_flags. Replace them with a single
> pointer.
> 
> This moves slab_alloc_context initialization to a number of callers,
> which is more verbose, but arguably also more clear than a long list of
> parameters, and most do not use the 'lru' field.
> 
> This will also allow kmalloc_nolock() to call slab_alloc_node() and
> reduce the special open-coding it currently has.
> 
> Link: https://patch.msgid.link/20260610-slab_alloc_flags-v2-10-7190909db118@kernel.org
> Reviewed-by: Hao Li <hao.li@linux.dev>
> Reviewed-by: Suren Baghdasaryan <surenb@google.com>
> Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
> ---

Looks good to me,
Reviewed-by: Harry Yoo (Oracle) <harry@kernel.org>

-- 
Cheers,
Harry / Hyeonggon

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply

* Re: [PATCH v8 4/4] mm: swap: filter swap allocation by memcg tier mask
From: YoungJun Park @ 2026-06-17  6:24 UTC (permalink / raw)
  To: akpm
  Cc: chrisl, linux-mm, cgroups, linux-kernel, kasong, hannes, mhocko,
	roman.gushchin, shakeel.butt, muchun.song, shikemeng, nphamcs,
	baoquan.he, baohua, yosry, gunho.lee, taejoon.song, hyungjun.cho,
	mkoutny, baver.bae, matia.kim
In-Reply-To: <20260617053447.2831896-5-youngjun.park@lge.com>

https://sashiko.dev/#/patchset/20260617053447.2831896-1-youngjun.park@lge.com?part=4

Regarding the review from Sachiko.

The AI review pointed out two things. a pre-existing issue and a valid
issue regarding swap_sync_discard().

For the pre-existing issue, I believe it is out of scope to address it
together in this patchset.

However, the second point is a valid issue. For the case where
swap_sync_discard() discards a non-selected tier, succeeds, and then
retries swap allocation. I will fix this in the next version by
checking the memcg mask within swap_sync_discard().

^ permalink raw reply

* Re: [PATCH v8 3/4] mm: memcontrol: add interface for swap tier selection
From: YoungJun Park @ 2026-06-17  6:10 UTC (permalink / raw)
  To: akpm
  Cc: chrisl, linux-mm, cgroups, linux-kernel, kasong, hannes, mhocko,
	roman.gushchin, shakeel.butt, muchun.song, shikemeng, nphamcs,
	baoquan.he, baohua, yosry, gunho.lee, taejoon.song, hyungjun.cho,
	mkoutny, baver.bae, matia.kim
In-Reply-To: <20260617053447.2831896-4-youngjun.park@lge.com>

Sahiko review

https://sashiko.dev/#/message/20260617053447.2831896-4-youngjun.park%40lge.com

I think this might be an overconcern. In normal
situations, cgroup delegation is strictly under the administrator's control.

Furthermore, if a user already has a delegated root, there are numerous
other ways they could potentially cause a DoS attack. Therefore, I believe
this doesn't need to be modified.

^ permalink raw reply

* [PATCH v8 3/4] mm: memcontrol: add interface for swap tier selection
From: Youngjun Park @ 2026-06-17  5:34 UTC (permalink / raw)
  To: akpm
  Cc: chrisl, youngjun.park, linux-mm, cgroups, linux-kernel, kasong,
	hannes, mhocko, roman.gushchin, shakeel.butt, muchun.song,
	shikemeng, nphamcs, baoquan.he, baohua, yosry, gunho.lee,
	taejoon.song, hyungjun.cho, mkoutny, baver.bae, matia.kim
In-Reply-To: <20260617053447.2831896-1-youngjun.park@lge.com>

Introduce memory.swap.tiers.max, a flat-keyed file listing each
tier defined in /sys/kernel/mm/swap/tiers with its state, "max"
(allowed, the default) or "0" (disabled).  A tier is one bit in the
cgroup's tier mask, so writing "<tier> max" or "<tier> 0" sets or
clears that bit.

Since the current use case lacks amount control, it only supports
"max" (on) and "0" (off). Therefore, it does not track per-tier swap
usage, relying instead on a fast runtime bitmask check.

We maintain both `mask` and `effective_mask`. The `effective_mask` is
strictly bounded by the parent (e.g., if a parent is "0", the child's
effective state is "0" even if its `mask` is "max"). Maintaining this
separately avoids costly cgroup tree traversals to check ancestors at
runtime.

Suggested-by: Shakeel Butt <shakeel.butt@linux.dev>
Suggested-by: Yosry Ahmed <yosry@kernel.org>
Signed-off-by: Youngjun Park <youngjun.park@lge.com>

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 6efd0095ed99..4843ffcfd110 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -1850,6 +1850,26 @@ The following nested keys are defined.
 	Swap usage hard limit.  If a cgroup's swap usage reaches this
 	limit, anonymous memory of the cgroup will not be swapped out.
 
+  memory.swap.tiers.max
+	A read-write flat-keyed file which exists on non-root
+	cgroups.  The default is "max" for every tier.
+
+	Limits the swap tiers this cgroup may swap to.  Tiers are
+	defined globally in /sys/kernel/mm/swap/tiers and listed here,
+	one per line. When read, the values are displayed in descending
+	order of the tiers (highest tier first)::
+
+	  <tier_1> max
+	  <tier_2> 0
+	  ...
+
+	Currently, only "max" and "0" are supported. "max" allows the
+	tier, "0" disables it.  Each write sets a single "<tier> max"
+	or "<tier> 0" pair.
+
+	A child may only narrow what its parent allows. A tier an
+	ancestor disabled stays disabled regardless of the value here.
+
   memory.swap.events
 	A read-only flat-keyed file which exists on non-root cgroups.
 	The following entries are defined.  Unless specified
diff --git a/Documentation/mm/swap-tier.rst b/Documentation/mm/swap-tier.rst
index 0fb4a1153a67..addbc495de8c 100644
--- a/Documentation/mm/swap-tier.rst
+++ b/Documentation/mm/swap-tier.rst
@@ -15,6 +15,15 @@ speed to fully utilize this feature. While the current implementation is
 integrated with cgroups, the concept is designed to be extensible for other
 subsystems in the future.
 
+Use case
+---------
+
+Users can perform selective swapping by choosing a swap tier assigned according
+to speed within a cgroup.
+
+For more information on cgroup v2, please refer to
+``Documentation/admin-guide/cgroup-v2.rst``.
+
 Priority Range
 --------------
 
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index e1f46a0016fc..d53826c68562 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -283,6 +283,11 @@ struct mem_cgroup {
 	struct lru_gen_mm_list mm_list;
 #endif
 
+#ifdef CONFIG_SWAP
+	int tier_mask;
+	int tier_effective_mask;
+#endif
+
 #ifdef CONFIG_MEMCG_V1
 	/* Legacy consumer-oriented counters */
 	struct page_counter kmem;		/* v1 only */
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 56cd4af08232..63259576792a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -68,6 +68,7 @@
 #include <net/ip.h>
 #include "slab.h"
 #include "memcontrol-v1.h"
+#include "swap_tier.h"
 
 #include <linux/uaccess.h>
 
@@ -4244,6 +4245,8 @@ static int mem_cgroup_css_online(struct cgroup_subsys_state *css)
 	refcount_set(&memcg->id.ref, 1);
 	css_get(css);
 
+	swap_tiers_memcg_inherit_mask(memcg);
+
 	/*
 	 * Ensure mem_cgroup_from_private_id() works once we're fully online.
 	 *
@@ -5785,6 +5788,64 @@ static int swap_events_show(struct seq_file *m, void *v)
 	return 0;
 }
 
+static int swap_tier_max_show(struct seq_file *m, void *v)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_seq(m);
+
+	swap_tiers_mask_show(m, memcg);
+	return 0;
+}
+
+static ssize_t swap_tier_max_write(struct kernfs_open_file *of,
+				char *buf, size_t nbytes, loff_t off)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
+	char *pos, *name, *val;
+	bool enable;
+	int mask;
+	int ret = 0;
+
+	pos = strstrip(buf);
+	name = strsep(&pos, " \t\n");
+	if (!name || !*name)
+		return -EINVAL;
+	if (pos)
+		pos = skip_spaces(pos);
+	val = strsep(&pos, " \t\n");
+	if (!val || !*val)
+		return -EINVAL;
+	if (pos && *skip_spaces(pos))
+		return -EINVAL;
+
+	if (!strcmp(val, "max"))
+		enable = true;
+	else if (!strcmp(val, "0"))
+		enable = false;
+	else
+		return -EINVAL;
+
+	spin_lock(&swap_tier_lock);
+	mask = swap_tiers_mask_lookup(name);
+	if (!mask) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	/*
+	 * tier_mask is set per memcg here; the effective mask is clamped
+	 * to the parent's in swap_tiers_memcg_sync_mask().
+	 */
+	if (enable)
+		WRITE_ONCE(memcg->tier_mask, memcg->tier_mask | mask);
+	else
+		WRITE_ONCE(memcg->tier_mask, memcg->tier_mask & ~mask);
+
+	swap_tiers_memcg_sync_mask(memcg);
+out:
+	spin_unlock(&swap_tier_lock);
+	return ret ? ret : nbytes;
+}
+
 static struct cftype swap_files[] = {
 	{
 		.name = "swap.current",
@@ -5817,6 +5878,12 @@ static struct cftype swap_files[] = {
 		.file_offset = offsetof(struct mem_cgroup, swap_events_file),
 		.seq_show = swap_events_show,
 	},
+	{
+		.name = "swap.tiers.max",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.seq_show = swap_tier_max_show,
+		.write = swap_tier_max_write,
+	},
 	{ }	/* terminate */
 };
 
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 2f382d4dcbdc..712b225509cc 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -1021,6 +1021,7 @@ static ssize_t tiers_store(struct kobject *kobj,
 	char *p, *token, *name, *tmp;
 	int ret = 0;
 	short prio;
+	int mask = 0;
 
 	tmp = kstrdup(buf, GFP_KERNEL);
 	if (!tmp)
@@ -1053,7 +1054,7 @@ static ssize_t tiers_store(struct kobject *kobj,
 				goto restore;
 			break;
 		case '-':
-			ret = swap_tiers_remove(token + 1);
+			ret = swap_tiers_remove(token + 1, &mask);
 			if (ret)
 				goto restore;
 			break;
@@ -1063,7 +1064,7 @@ static ssize_t tiers_store(struct kobject *kobj,
 		}
 	}
 
-	if (!swap_tiers_update()) {
+	if (!swap_tiers_update(mask)) {
 		ret = -EINVAL;
 		goto restore;
 	}
diff --git a/mm/swap_tier.c b/mm/swap_tier.c
index 6b57cadb3e95..d6f0e1588dd8 100644
--- a/mm/swap_tier.c
+++ b/mm/swap_tier.c
@@ -253,7 +253,7 @@ int swap_tiers_add(const char *name, int prio)
 	return ret;
 }
 
-int swap_tiers_remove(const char *name)
+int swap_tiers_remove(const char *name, int *mask)
 {
 	int ret = 0;
 	struct swap_tier *tier;
@@ -276,6 +276,7 @@ int swap_tiers_remove(const char *name)
 		list_prev_entry(tier, list)->prio = DEF_SWAP_PRIO;
 
 	swap_tier_inactivate(tier);
+	*mask |= TIER_MASK(tier);
 
 	return ret;
 }
@@ -344,7 +345,26 @@ void swap_tiers_assign_dev(struct swap_info_struct *swp)
 	swp->tier_mask = TIER_DEFAULT_MASK;
 }
 
-bool swap_tiers_update(void)
+#ifdef CONFIG_MEMCG
+static void swap_tier_memcg_propagate(int mask)
+{
+	struct mem_cgroup *child;
+
+	rcu_read_lock();
+	for_each_mem_cgroup_tree(child, root_mem_cgroup) {
+		WRITE_ONCE(child->tier_mask, child->tier_mask | mask);
+		WRITE_ONCE(child->tier_effective_mask,
+			   child->tier_effective_mask | mask);
+	}
+	rcu_read_unlock();
+}
+#else
+static void swap_tier_memcg_propagate(int mask)
+{
+}
+#endif
+
+bool swap_tiers_update(int mask)
 {
 	struct swap_tier *tier;
 	struct swap_info_struct *swp;
@@ -375,5 +395,89 @@ bool swap_tiers_update(void)
 		swap_tiers_assign_dev(swp);
 	}
 
+	/*
+	 * When a tier is removed, its index (bit position in the mask) becomes
+	 * free for reassignment to a future tier. If a memcg had previously
+	 * disabled this tier (cleared the bit in its swap.tiers.max file), the
+	 * effective mask would keep that bit clear -- meaning the new tier at
+	 * the same index would be silently unavailable, an invisible cgroup
+	 * constraint left behind by a tier that no longer exists.
+	 *
+	 * To prevent this, OR the removed tier's mask bit into every memcg's
+	 * tier_mask and tier_effective_mask. This resets the bit so the new
+	 * tier is accessible by default; users who want to restrict it must
+	 * explicitly disable it after the tier is re-created.
+	 */
+	if (mask)
+		swap_tier_memcg_propagate(mask);
+
 	return true;
 }
+
+#ifdef CONFIG_MEMCG
+void swap_tiers_mask_show(struct seq_file *m, struct mem_cgroup *memcg)
+{
+	struct swap_tier *tier;
+	int mask;
+
+	spin_lock(&swap_tier_lock);
+	mask = READ_ONCE(memcg->tier_mask);
+
+	for_each_active_tier(tier)
+		seq_printf(m, "%s %s\n", tier->name,
+			   (mask & TIER_MASK(tier)) ? "max" : "0");
+	spin_unlock(&swap_tier_lock);
+}
+
+int swap_tiers_mask_lookup(const char *name)
+{
+	struct swap_tier *tier;
+
+	lockdep_assert_held(&swap_tier_lock);
+
+	for_each_active_tier(tier) {
+		if (!strcmp(name, tier->name))
+			return TIER_MASK(tier);
+	}
+
+	return 0;
+}
+
+static void __swap_tier_memcg_inherit_mask(struct mem_cgroup *memcg,
+	struct mem_cgroup *parent)
+{
+	int parent_mask = parent
+		? READ_ONCE(parent->tier_effective_mask)
+		: TIER_ALL_MASK;
+
+	WRITE_ONCE(memcg->tier_effective_mask,
+		   parent_mask & READ_ONCE(memcg->tier_mask));
+}
+
+/* Computes the initial effective mask from the parent's effective mask. */
+void swap_tiers_memcg_inherit_mask(struct mem_cgroup *memcg)
+{
+	spin_lock(&swap_tier_lock);
+	rcu_read_lock();
+	memcg->tier_mask = TIER_ALL_MASK;
+	__swap_tier_memcg_inherit_mask(memcg, parent_mem_cgroup(memcg));
+	rcu_read_unlock();
+	spin_unlock(&swap_tier_lock);
+}
+
+/*
+ * Called when a memcg's tier_mask is modified. Walks the subtree
+ * and recomputes each descendant's effective mask against its parent.
+ */
+void swap_tiers_memcg_sync_mask(struct mem_cgroup *memcg)
+{
+	struct mem_cgroup *child;
+
+	lockdep_assert_held(&swap_tier_lock);
+
+	rcu_read_lock();
+	for_each_mem_cgroup_tree(child, memcg)
+		__swap_tier_memcg_inherit_mask(child, parent_mem_cgroup(child));
+	rcu_read_unlock();
+}
+#endif
diff --git a/mm/swap_tier.h b/mm/swap_tier.h
index 3e355f857363..e2f0cf32035b 100644
--- a/mm/swap_tier.h
+++ b/mm/swap_tier.h
@@ -10,22 +10,67 @@ struct swap_info_struct;
 
 extern spinlock_t swap_tier_lock;
 
-#define TIER_ALL_MASK		(~0)
-#define TIER_DEFAULT_IDX	(31)
-#define TIER_DEFAULT_MASK	(1U << TIER_DEFAULT_IDX)
-
 /* Initialization and application */
 void swap_tiers_init(void);
 ssize_t swap_tiers_sysfs_show(char *buf);
 
 int swap_tiers_add(const char *name, int prio);
-int swap_tiers_remove(const char *name);
+int swap_tiers_remove(const char *name, int *mask);
 
 void swap_tiers_snapshot(void);
 void swap_tiers_snapshot_restore(void);
-bool swap_tiers_update(void);
+bool swap_tiers_update(int mask);
 
 /* Tier assignment */
 void swap_tiers_assign_dev(struct swap_info_struct *swp);
 
+#define TIER_ALL_MASK		(~0)
+#define TIER_DEFAULT_IDX	(31)
+#define TIER_DEFAULT_MASK	(1U << TIER_DEFAULT_IDX)
+
+#if defined(CONFIG_SWAP) && defined(CONFIG_MEMCG)
+/* Memcg related functions */
+void swap_tiers_mask_show(struct seq_file *m, struct mem_cgroup *memcg);
+void swap_tiers_memcg_inherit_mask(struct mem_cgroup *memcg);
+void swap_tiers_memcg_sync_mask(struct mem_cgroup *memcg);
+int swap_tiers_mask_lookup(const char *name);
+static inline int folio_tier_effective_mask(struct folio *folio)
+{
+	struct mem_cgroup *memcg;
+	int mask = TIER_ALL_MASK;
+
+	rcu_read_lock();
+	memcg = folio_memcg(folio);
+	if (memcg)
+		mask = READ_ONCE(memcg->tier_effective_mask);
+	rcu_read_unlock();
+
+	return mask;
+}
+#else
+static inline void swap_tiers_mask_show(struct seq_file *m,
+	struct mem_cgroup *memcg) {}
+static inline void swap_tiers_memcg_inherit_mask(struct mem_cgroup *memcg) {}
+static inline void swap_tiers_memcg_sync_mask(struct mem_cgroup *memcg) {}
+static inline int swap_tiers_mask_lookup(const char *name)
+{
+	return 0;
+}
+static inline int folio_tier_effective_mask(struct folio *folio)
+{
+	return TIER_ALL_MASK;
+}
+#endif
+
+/**
+ * swap_tiers_mask_test - Check if the tier mask is valid
+ * @tier_mask: The tier mask to check
+ * @mask: The mask to compare against
+ *
+ * Return: true if condition matches, false otherwise
+ */
+static inline bool swap_tiers_mask_test(int tier_mask, int mask)
+{
+	return tier_mask & mask;
+}
 #endif /* _SWAP_TIER_H */
-- 
2.34.1


^ permalink raw reply related

* [PATCH v8 2/4] mm: swap: associate swap devices with tiers
From: Youngjun Park @ 2026-06-17  5:34 UTC (permalink / raw)
  To: akpm
  Cc: chrisl, youngjun.park, linux-mm, cgroups, linux-kernel, kasong,
	hannes, mhocko, roman.gushchin, shakeel.butt, muchun.song,
	shikemeng, nphamcs, baoquan.he, baohua, yosry, gunho.lee,
	taejoon.song, hyungjun.cho, mkoutny, baver.bae, matia.kim
In-Reply-To: <20260617053447.2831896-1-youngjun.park@lge.com>

This patch connects swap devices to the swap tier infrastructure,
ensuring that devices are correctly assigned to tiers based on their
priority.

A `tier_mask` is added to identify the tier membership of swap devices.
Although tier-based allocation logic is not yet implemented, this
mapping is necessary to track which tier a device belongs to. Upon
activation, the device is assigned to a tier by matching its priority
against the configured tier ranges.

The infrastructure allows dynamic modification of tiers, such as
splitting or merging ranges. These operations are permitted provided
that the tier assignment of already configured swap devices remains
unchanged.

This patch also adds the documentation for the swap tier feature,
covering the core concepts, sysfs interface usage, and configuration
details.

Reviewed-by: Baoquan He <baoquan.he@linux.dev>
Signed-off-by: Youngjun Park <youngjun.park@lge.com>

diff --git a/Documentation/mm/index.rst b/Documentation/mm/index.rst
index 7aa2a8886908..a0d1447c5569 100644
--- a/Documentation/mm/index.rst
+++ b/Documentation/mm/index.rst
@@ -21,6 +21,7 @@ see the :doc:`admin guide <../admin-guide/mm/index>`.
    page_reclaim
    swap
    swap-table
+   swap-tier
    page_cache
    shmfs
    oom
diff --git a/Documentation/mm/swap-tier.rst b/Documentation/mm/swap-tier.rst
new file mode 100644
index 000000000000..0fb4a1153a67
--- /dev/null
+++ b/Documentation/mm/swap-tier.rst
@@ -0,0 +1,150 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+:Author: Chris Li <chrisl@kernel.org> Youngjun Park <youngjun.park@lge.com>
+
+==========
+Swap Tier
+==========
+
+Swap tier is a collection of user-named groups classified by priority ranges.
+It acts as a facilitation layer, allowing users to manage swap devices based
+on their speeds.
+
+Users are encouraged to assign swap device priorities according to device
+speed to fully utilize this feature. While the current implementation is
+integrated with cgroups, the concept is designed to be extensible for other
+subsystems in the future.
+
+Priority Range
+--------------
+
+The specified tiers must cover the entire priority range from -1
+(DEF_SWAP_PRIO) to SHRT_MAX.
+
+Consistency
+-----------
+
+Tier consistency is guaranteed with a focus on maximizing flexibility. When a
+swap device is activated within a tier range, the tier covering that device's
+priority is guaranteed not to disappear or change while the device remains
+active. Adding a new tier may split the range of an existing tier, but the
+active device's tier assignment remains unchanged.
+
+However, specifying a tier in a cgroup does not guarantee the tier's existence.
+Consequently, the corresponding tier can disappear at any time.
+
+Configuration Interface
+-----------------------
+
+The swap tiers can be configured via the following interface:
+
+/sys/kernel/mm/swap/tiers
+
+Operations can be performed using the following syntax:
+
+* Add:    ``+"<tiername>":"<start_priority>"``
+* Remove: ``-"<tiername>"``
+
+Tier names must consist of alphanumeric characters and underscores. Multiple
+operations can be provided in a single write, separated by commas (",") or
+whitespace (spaces, tabs, newlines).
+
+When configuring tiers, the specified value represents the **start priority**
+of that tier. The end priority is automatically determined by the start
+priority of the next higher tier. Consequently, adding a tier
+automatically adjusts the ranges of adjacent tiers to ensure continuity.
+
+Examples
+--------
+
+**1. Initialization**
+
+A tier starting at -1 is mandatory to cover the entire priority range up to
+SHRT_MAX. In this example, 'HDD' starts at 50, and 'NET' covers the remaining
+lower range starting from -1.
+
+::
+
+    # echo "+HDD:50, +NET:-1" > /sys/kernel/mm/swap/tiers
+    # cat /sys/kernel/mm/swap/tiers
+    Name             Idx   PrioStart   PrioEnd
+    HDD              0     50          32767
+    NET              1     -1          49
+
+**2. Adding a New Tier (split)**
+
+A new tier 'SSD' is added at priority 100, splitting the existing 'HDD' tier.
+The ranges are automatically recalculated:
+
+* 'SSD' takes the top range (100 to SHRT_MAX).
+* 'HDD' is adjusted to the range between 'NET' and 'SSD' (50 to 99).
+* 'NET' remains unchanged (-1 to 49).
+
+::
+
+    # echo "+SSD:100" > /sys/kernel/mm/swap/tiers
+    # cat /sys/kernel/mm/swap/tiers
+    Name             Idx   PrioStart   PrioEnd
+    SSD              2     100         32767
+    HDD              0     50          99
+    NET              1     -1          49
+
+**3. Removal (merge)**
+
+Tiers can be removed using the '-' prefix.
+::
+
+    # echo "-SSD" > /sys/kernel/mm/swap/tiers
+
+When a tier is removed, its priority range is merged into the adjacent
+tier. The merge direction is always upward (the tier below expands),
+except when the lowest tier is removed — in that case the tier above
+shifts its starting priority down to -1 to maintain full range coverage.
+
+::
+
+    Initial state:
+    Name             Idx   PrioStart   PrioEnd
+    SSD              2     100         32767
+    HDD              1     50          99
+    NET              0     -1          49
+
+    # echo "-SSD" > /sys/kernel/mm/swap/tiers
+
+    Name             Idx   PrioStart   PrioEnd
+    HDD              1     50          32767       <- merged with SSD's range
+    NET              0     -1          49
+
+    # echo "-NET" > /sys/kernel/mm/swap/tiers
+
+    Name             Idx   PrioStart   PrioEnd
+    HDD              1     -1          32767       <- shifted down to -1
+
+**4. Interaction with Active Swap Devices**
+
+If a swap device is active (swapon), the tier covering that device's
+priority cannot be removed. Splitting the active tier's range is only
+allowed above the device's priority.
+
+Assume a swap device is active at priority 60 (inside 'HDD' tier).
+
+::
+
+    # swapon -p 60 /dev/zram0
+
+    Name             Idx   PrioStart   PrioEnd
+    HDD              0     50          32767
+    NET              1     -1          49
+
+    # echo "-HDD" > /sys/kernel/mm/swap/tiers
+    -bash: echo: write error: Device or resource busy
+
+    # echo "+SSD:60" > /sys/kernel/mm/swap/tiers
+    -bash: echo: write error: Device or resource busy
+
+    # echo "+SSD:100" > /sys/kernel/mm/swap/tiers
+
+    Name             Idx   PrioStart   PrioEnd
+    SSD              2     100         32767
+    HDD              0     50          99          <- device (prio 60) stays here
+    NET              1     -1          49
diff --git a/MAINTAINERS b/MAINTAINERS
index d1bb3b4b1e1c..4293048be1ab 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -17052,6 +17052,7 @@ L:	linux-mm@kvack.org
 S:	Maintained
 F:	Documentation/ABI/testing/sysfs-kernel-mm-swap
 F:	Documentation/mm/swap-table.rst
+F:	Documentation/mm/swap-tier.rst
 F:	include/linux/swap.h
 F:	include/linux/swapfile.h
 F:	include/linux/swapops.h
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 6d72778e6cc3..21286945770a 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -250,6 +250,7 @@ struct swap_info_struct {
 	struct percpu_ref users;	/* indicate and keep swap device valid. */
 	unsigned long	flags;		/* SWP_USED etc: see above */
 	signed short	prio;		/* swap priority of this type */
+	int tier_mask;			/* swap tier mask */
 	struct plist_node list;		/* entry in swap_active_head */
 	signed char	type;		/* strange name for an index */
 	unsigned int	max;		/* size of this swap device */
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 762d9ca6ad5a..2f382d4dcbdc 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -1063,7 +1063,7 @@ static ssize_t tiers_store(struct kobject *kobj,
 		}
 	}
 
-	if (!swap_tiers_validate()) {
+	if (!swap_tiers_update()) {
 		ret = -EINVAL;
 		goto restore;
 	}
diff --git a/mm/swap_tier.c b/mm/swap_tier.c
index ac7a3c2a48cb..6b57cadb3e95 100644
--- a/mm/swap_tier.c
+++ b/mm/swap_tier.c
@@ -38,6 +38,8 @@ static LIST_HEAD(swap_tier_inactive_list);
 	(!list_is_first(&(tier)->list, &swap_tier_active_list) ? \
 	list_prev_entry((tier), list)->prio - 1 : SHRT_MAX)
 
+#define MASK_TO_TIER(mask) (&swap_tiers[__ffs((mask))])
+
 #define for_each_tier(tier, idx) \
 	for (idx = 0, tier = &swap_tiers[0]; idx < MAX_SWAPTIER; \
 		idx++, tier = &swap_tiers[idx])
@@ -59,6 +61,26 @@ static bool swap_tier_is_active(void)
 	return !list_empty(&swap_tier_active_list);
 }
 
+static bool swap_tier_prio_in_range(struct swap_tier *tier, short prio)
+{
+	if (tier->prio <= prio && TIER_END_PRIO(tier) >= prio)
+		return true;
+
+	return false;
+}
+
+static bool swap_tier_prio_is_used(short prio)
+{
+	struct swap_tier *tier;
+
+	for_each_active_tier(tier) {
+		if (tier->prio == prio)
+			return true;
+	}
+
+	return false;
+}
+
 static struct swap_tier *swap_tier_lookup(const char *name)
 {
 	struct swap_tier *tier;
@@ -99,6 +121,7 @@ void swap_tiers_init(void)
 	int idx;
 
 	BUILD_BUG_ON(BITS_PER_TYPE(int) < MAX_SWAPTIER);
+	BUILD_BUG_ON(MAX_SWAPTIER > TIER_DEFAULT_IDX);
 
 	for_each_tier(tier, idx) {
 		INIT_LIST_HEAD(&tier->list);
@@ -149,17 +172,29 @@ static struct swap_tier *swap_tier_prepare(const char *name, short prio)
 	return tier;
 }
 
-static int swap_tier_check_range(short prio)
+static int swap_tier_can_split_range(short new_prio)
 {
+	struct swap_info_struct *p;
 	struct swap_tier *tier;
 
 	lockdep_assert_held(&swap_lock);
 	lockdep_assert_held(&swap_tier_lock);
 
-	for_each_active_tier(tier) {
-		/* No overwrite */
-		if (tier->prio == prio)
-			return -EINVAL;
+	plist_for_each_entry(p, &swap_active_head, list) {
+		if (p->tier_mask == TIER_DEFAULT_MASK)
+			continue;
+
+		tier = MASK_TO_TIER(p->tier_mask);
+		if (!swap_tier_prio_in_range(tier, new_prio))
+			continue;
+
+		/*
+		 * Device sits in a tier that spans new_prio;
+		 * splitting here would reassign it to a
+		 * different tier.
+		 */
+		if (p->prio >= new_prio)
+			return -EBUSY;
 	}
 
 	return 0;
@@ -199,7 +234,11 @@ int swap_tiers_add(const char *name, int prio)
 	if (!swap_tier_validate_name(name))
 		return -EINVAL;
 
-	ret = swap_tier_check_range(prio);
+	/* No overwrite */
+	if (swap_tier_prio_is_used(prio))
+		return -EBUSY;
+
+	ret = swap_tier_can_split_range(prio);
 	if (ret)
 		return ret;
 
@@ -226,6 +265,11 @@ int swap_tiers_remove(const char *name)
 	if (!tier)
 		return -EINVAL;
 
+	/* Simulate adding a tier to check for conflicts */
+	ret = swap_tier_can_split_range(tier->prio);
+	if (ret)
+		return ret;
+
 	/* Removing DEF_SWAP_PRIO merges into the higher tier. */
 	if (!list_is_singular(&swap_tier_active_list)
 		&& tier->prio == DEF_SWAP_PRIO)
@@ -236,13 +280,15 @@ int swap_tiers_remove(const char *name)
 	return ret;
 }
 
-static struct swap_tier swap_tiers_snap[MAX_SWAPTIER];
 /*
- * XXX: When multiple operations (adds and removes) are submitted in a
- * single write, reverting each individually on failure is complex and
- * error-prone. Instead, snapshot the entire state beforehand and
- * restore it wholesale if any operation fails.
+ * XXX: Static global snapshot buffer for batch operations. Small
+ * and used once per write, so a static global is not bad.
+ * When multiple adds/removes are submitted in a single write,
+ * reverting each individually on failure is error-prone. Instead,
+ * snapshot beforehand and restore wholesale if any operation fails.
  */
+static struct swap_tier swap_tiers_snap[MAX_SWAPTIER];
+
 void swap_tiers_snapshot(void)
 {
 	BUILD_BUG_ON(sizeof(swap_tiers_snap) != sizeof(swap_tiers));
@@ -282,10 +328,30 @@ void swap_tiers_snapshot_restore(void)
 	}
 }
 
-bool swap_tiers_validate(void)
+void swap_tiers_assign_dev(struct swap_info_struct *swp)
 {
 	struct swap_tier *tier;
 
+	lockdep_assert_held(&swap_lock);
+
+	for_each_active_tier(tier) {
+		if (swap_tier_prio_in_range(tier, swp->prio)) {
+			swp->tier_mask = TIER_MASK(tier);
+			return;
+		}
+	}
+
+	swp->tier_mask = TIER_DEFAULT_MASK;
+}
+
+bool swap_tiers_update(void)
+{
+	struct swap_tier *tier;
+	struct swap_info_struct *swp;
+
+	lockdep_assert_held(&swap_lock);
+	lockdep_assert_held(&swap_tier_lock);
+
 	/*
 	 * Initial setting might not cover DEF_SWAP_PRIO.
 	 * Swap tier must cover the full range (DEF_SWAP_PRIO to SHRT_MAX).
@@ -298,5 +364,16 @@ bool swap_tiers_validate(void)
 			return false;
 	}
 
+	/*
+	 * If applied initially, the swap tier_mask may change
+	 * from the default value.
+	 */
+	plist_for_each_entry(swp, &swap_active_head, list) {
+		/* Tier is already configured */
+		if (swp->tier_mask != TIER_DEFAULT_MASK)
+			break;
+		swap_tiers_assign_dev(swp);
+	}
+
 	return true;
 }
diff --git a/mm/swap_tier.h b/mm/swap_tier.h
index a1395ec02c24..3e355f857363 100644
--- a/mm/swap_tier.h
+++ b/mm/swap_tier.h
@@ -5,8 +5,15 @@
 #include <linux/types.h>
 #include <linux/spinlock.h>
 
+/* Forward declarations */
+struct swap_info_struct;
+
 extern spinlock_t swap_tier_lock;
 
+#define TIER_ALL_MASK		(~0)
+#define TIER_DEFAULT_IDX	(31)
+#define TIER_DEFAULT_MASK	(1U << TIER_DEFAULT_IDX)
+
 /* Initialization and application */
 void swap_tiers_init(void);
 ssize_t swap_tiers_sysfs_show(char *buf);
@@ -16,5 +23,9 @@ int swap_tiers_remove(const char *name);
 
 void swap_tiers_snapshot(void);
 void swap_tiers_snapshot_restore(void);
-bool swap_tiers_validate(void);
+bool swap_tiers_update(void);
+
+/* Tier assignment */
+void swap_tiers_assign_dev(struct swap_info_struct *swp);
+
 #endif /* _SWAP_TIER_H */
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 3f7225dbc6cd..9a86ebe992f4 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -3036,6 +3036,8 @@ static void _enable_swap_info(struct swap_info_struct *si)
 
 	/* Add back to available list */
 	add_to_avail_list(si, true);
+
+	swap_tiers_assign_dev(si);
 }
 
 /*
-- 
2.34.1


^ permalink raw reply related

* [PATCH v8 1/4] mm: swap: introduce swap tier infrastructure
From: Youngjun Park @ 2026-06-17  5:34 UTC (permalink / raw)
  To: akpm
  Cc: chrisl, youngjun.park, linux-mm, cgroups, linux-kernel, kasong,
	hannes, mhocko, roman.gushchin, shakeel.butt, muchun.song,
	shikemeng, nphamcs, baoquan.he, baohua, yosry, gunho.lee,
	taejoon.song, hyungjun.cho, mkoutny, baver.bae, matia.kim
In-Reply-To: <20260617053447.2831896-1-youngjun.park@lge.com>

This patch introduces the "Swap tier" concept, which serves as an
abstraction layer for managing swap devices based on their performance
characteristics (e.g., NVMe, HDD, Network swap).

Swap tiers are user-named groups representing priority ranges.
Tier names must consist of alphanumeric characters and underscores.
These tiers collectively cover the entire priority space from -1
(`DEF_SWAP_PRIO`) to `SHRT_MAX`.

To configure tiers, a new sysfs interface is exposed at
/sys/kernel/mm/swap/tiers. The input parser evaluates commands from
left to right and supports batch input, allowing users to add or remove
multiple tiers in a single write operation.

Tier management enforces continuous priority ranges anchored by start
priorities. Operations trigger range splitting or merging, but overwriting
start priorities is forbidden. Merging expands lower tiers upwards to
preserve configured start priorities, except when removing `DEF_SWAP_PRIO`,
which merges downwards.

Suggested-by: Chris Li <chrisl@kernel.org>
Reviewed-by: Baoquan He <baoquan.he@linux.dev>
Signed-off-by: Youngjun Park <youngjun.park@lge.com>

diff --git a/MAINTAINERS b/MAINTAINERS
index 65bd4328fe05..d1bb3b4b1e1c 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -17060,6 +17060,8 @@ F:	mm/swap.c
 F:	mm/swap.h
 F:	mm/swap_table.h
 F:	mm/swap_state.c
+F:	mm/swap_tier.c
+F:	mm/swap_tier.h
 F:	mm/swapfile.c
 
 MEMORY MANAGEMENT - THP (TRANSPARENT HUGE PAGE)
diff --git a/mm/Kconfig b/mm/Kconfig
index 776b67c66e82..5343937f3da9 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -19,6 +19,18 @@ menuconfig SWAP
 	  used to provide more virtual memory than the actual RAM present
 	  in your computer.  If unsure say Y.
 
+config NR_SWAP_TIERS
+        int "Number of swap device tiers"
+        depends on SWAP
+        default 4
+        range 1 31
+        help
+          Sets the number of swap device tiers. Swap devices are
+          grouped into tiers based on their priority, allowing the
+          system to prefer faster devices over slower ones.
+
+          If unsure, say 4.
+
 config ZSWAP
 	bool "Compressed cache for swap pages"
 	depends on SWAP
diff --git a/mm/Makefile b/mm/Makefile
index eff9f9e7e061..29cb1e778285 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -75,7 +75,7 @@ ifdef CONFIG_MMU
 	obj-$(CONFIG_ADVISE_SYSCALLS)	+= madvise.o
 endif
 
-obj-$(CONFIG_SWAP)	+= page_io.o swap_state.o swapfile.o
+obj-$(CONFIG_SWAP)	+= page_io.o swap_state.o swapfile.o swap_tier.o
 obj-$(CONFIG_ZSWAP)	+= zswap.o
 obj-$(CONFIG_HAS_DMA)	+= dmapool.o
 obj-$(CONFIG_HUGETLBFS)	+= hugetlb.o hugetlb_sysfs.o hugetlb_sysctl.o
diff --git a/mm/swap.h b/mm/swap.h
index 77d2d14eda42..d6c5f5d31f63 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -34,6 +34,10 @@ extern int page_cluster;
 #define swap_entry_order(order)	0
 #endif
 
+#define DEF_SWAP_PRIO  -1
+
+extern spinlock_t swap_lock;
+extern struct plist_head swap_active_head;
 extern struct swap_info_struct *swap_info[];
 
 /*
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 9c3a5cf99778..762d9ca6ad5a 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -25,6 +25,7 @@
 #include "internal.h"
 #include "swap_table.h"
 #include "swap.h"
+#include "swap_tier.h"
 
 /*
  * swapper_space is a fiction, retained to simplify the path through
@@ -1007,8 +1008,81 @@ static ssize_t vma_ra_enabled_store(struct kobject *kobj,
 }
 static struct kobj_attribute vma_ra_enabled_attr = __ATTR_RW(vma_ra_enabled);
 
+static ssize_t tiers_show(struct kobject *kobj,
+				     struct kobj_attribute *attr, char *buf)
+{
+	return swap_tiers_sysfs_show(buf);
+}
+
+static ssize_t tiers_store(struct kobject *kobj,
+			    struct kobj_attribute *attr,
+			    const char *buf, size_t count)
+{
+	char *p, *token, *name, *tmp;
+	int ret = 0;
+	short prio;
+
+	tmp = kstrdup(buf, GFP_KERNEL);
+	if (!tmp)
+		return -ENOMEM;
+
+	spin_lock(&swap_lock);
+	spin_lock(&swap_tier_lock);
+	swap_tiers_snapshot();
+
+	p = tmp;
+	while ((token = strsep(&p, ", \t\n")) != NULL) {
+		if (!*token)
+			continue;
+
+		switch (token[0]) {
+		case '+':
+			name = token + 1;
+			token = strchr(name, ':');
+			if (!token) {
+				ret = -EINVAL;
+				goto restore;
+			}
+			*token++ = '\0';
+			if (kstrtos16(token, 10, &prio)) {
+				ret = -EINVAL;
+				goto restore;
+			}
+			ret = swap_tiers_add(name, prio);
+			if (ret)
+				goto restore;
+			break;
+		case '-':
+			ret = swap_tiers_remove(token + 1);
+			if (ret)
+				goto restore;
+			break;
+		default:
+			ret = -EINVAL;
+			goto restore;
+		}
+	}
+
+	if (!swap_tiers_validate()) {
+		ret = -EINVAL;
+		goto restore;
+	}
+	goto out;
+
+restore:
+	swap_tiers_snapshot_restore();
+out:
+	spin_unlock(&swap_tier_lock);
+	spin_unlock(&swap_lock);
+	kfree(tmp);
+	return ret ? ret : count;
+}
+
+static struct kobj_attribute tier_attr = __ATTR_RW(tiers);
+
 static struct attribute *swap_attrs[] = {
 	&vma_ra_enabled_attr.attr,
+	&tier_attr.attr,
 	NULL,
 };
 
diff --git a/mm/swap_tier.c b/mm/swap_tier.c
new file mode 100644
index 000000000000..ac7a3c2a48cb
--- /dev/null
+++ b/mm/swap_tier.c
@@ -0,0 +1,302 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/swap.h>
+#include <linux/memcontrol.h>
+#include "memcontrol-v1.h"
+#include <linux/sysfs.h>
+#include <linux/plist.h>
+
+#include "swap.h"
+#include "swap_tier.h"
+
+#define MAX_SWAPTIER	CONFIG_NR_SWAP_TIERS
+#define MAX_TIERNAME	16
+
+/*
+ * struct swap_tier - structure representing a swap tier.
+ *
+ * @name: name of the swap_tier.
+ * @prio: starting value of priority.
+ * @list: linked list of tiers.
+ */
+static struct swap_tier {
+	char name[MAX_TIERNAME];
+	short prio;
+	struct list_head list;
+} swap_tiers[MAX_SWAPTIER];
+
+DEFINE_SPINLOCK(swap_tier_lock);
+/* active swap priority list, sorted in descending order */
+static LIST_HEAD(swap_tier_active_list);
+/* unused swap_tier object */
+static LIST_HEAD(swap_tier_inactive_list);
+
+#define TIER_IDX(tier)	((tier) - swap_tiers)
+#define TIER_MASK(tier)	(1U << TIER_IDX(tier))
+#define TIER_INACTIVE_PRIO (DEF_SWAP_PRIO - 1)
+#define TIER_IS_ACTIVE(tier) ((tier->prio) !=  TIER_INACTIVE_PRIO)
+#define TIER_END_PRIO(tier) \
+	(!list_is_first(&(tier)->list, &swap_tier_active_list) ? \
+	list_prev_entry((tier), list)->prio - 1 : SHRT_MAX)
+
+#define for_each_tier(tier, idx) \
+	for (idx = 0, tier = &swap_tiers[0]; idx < MAX_SWAPTIER; \
+		idx++, tier = &swap_tiers[idx])
+
+#define for_each_active_tier(tier) \
+	list_for_each_entry(tier, &swap_tier_active_list, list)
+
+#define for_each_inactive_tier(tier) \
+	list_for_each_entry(tier, &swap_tier_inactive_list, list)
+
+/*
+ * Naming Convention:
+ *   swap_tiers_*() - Public/exported functions
+ *   swap_tier_*()  - Private/internal functions
+ */
+
+static bool swap_tier_is_active(void)
+{
+	return !list_empty(&swap_tier_active_list);
+}
+
+static struct swap_tier *swap_tier_lookup(const char *name)
+{
+	struct swap_tier *tier;
+
+	for_each_active_tier(tier) {
+		if (!strcmp(tier->name, name))
+			return tier;
+	}
+
+	return NULL;
+}
+
+/* Insert new tier into the active list sorted by priority. */
+static void swap_tier_activate(struct swap_tier *new)
+{
+	struct list_head *pos = &swap_tier_active_list;
+	struct swap_tier *tier;
+
+	for_each_active_tier(tier) {
+		if (tier->prio <= new->prio) {
+			pos = &tier->list;
+			break;
+		}
+	}
+
+	list_add_tail(&new->list, pos);
+}
+
+static void swap_tier_inactivate(struct swap_tier *tier)
+{
+	list_move(&tier->list, &swap_tier_inactive_list);
+	tier->prio = TIER_INACTIVE_PRIO;
+}
+
+void swap_tiers_init(void)
+{
+	struct swap_tier *tier;
+	int idx;
+
+	BUILD_BUG_ON(BITS_PER_TYPE(int) < MAX_SWAPTIER);
+
+	for_each_tier(tier, idx) {
+		INIT_LIST_HEAD(&tier->list);
+		swap_tier_inactivate(tier);
+	}
+}
+
+ssize_t swap_tiers_sysfs_show(char *buf)
+{
+	struct swap_tier *tier;
+	ssize_t len = 0;
+
+	len += sysfs_emit_at(buf, len, "%-16s %-5s %-11s %-11s\n",
+			 "Name", "Idx", "PrioStart", "PrioEnd");
+
+	spin_lock(&swap_tier_lock);
+	for_each_active_tier(tier) {
+		len += sysfs_emit_at(buf, len, "%-16s %-5td %-11d %-11d\n",
+				     tier->name,
+				     TIER_IDX(tier),
+				     tier->prio,
+				     TIER_END_PRIO(tier));
+	}
+	spin_unlock(&swap_tier_lock);
+
+	return len;
+}
+
+static struct swap_tier *swap_tier_prepare(const char *name, short prio)
+{
+	struct swap_tier *tier;
+
+	lockdep_assert_held(&swap_tier_lock);
+
+	if (prio < DEF_SWAP_PRIO)
+		return ERR_PTR(-EINVAL);
+
+	if (list_empty(&swap_tier_inactive_list))
+		return ERR_PTR(-ENOSPC);
+
+	tier = list_first_entry(&swap_tier_inactive_list,
+		struct swap_tier, list);
+
+	list_del_init(&tier->list);
+	strscpy(tier->name, name, MAX_TIERNAME);
+	tier->prio = prio;
+
+	return tier;
+}
+
+static int swap_tier_check_range(short prio)
+{
+	struct swap_tier *tier;
+
+	lockdep_assert_held(&swap_lock);
+	lockdep_assert_held(&swap_tier_lock);
+
+	for_each_active_tier(tier) {
+		/* No overwrite */
+		if (tier->prio == prio)
+			return -EINVAL;
+	}
+
+	return 0;
+}
+
+static bool swap_tier_validate_name(const char *name)
+{
+	int len;
+
+	if (!name || !*name)
+		return false;
+
+	len = strlen(name);
+	if (len >= MAX_TIERNAME)
+		return false;
+
+	while (*name) {
+		if (!isalnum(*name) && *name != '_')
+			return false;
+		name++;
+	}
+	return true;
+}
+
+int swap_tiers_add(const char *name, int prio)
+{
+	int ret;
+	struct swap_tier *tier;
+
+	lockdep_assert_held(&swap_lock);
+	lockdep_assert_held(&swap_tier_lock);
+
+	/* Duplicate check */
+	if (swap_tier_lookup(name))
+		return -EEXIST;
+
+	if (!swap_tier_validate_name(name))
+		return -EINVAL;
+
+	ret = swap_tier_check_range(prio);
+	if (ret)
+		return ret;
+
+	tier = swap_tier_prepare(name, prio);
+	if (IS_ERR(tier)) {
+		ret = PTR_ERR(tier);
+		return ret;
+	}
+
+	swap_tier_activate(tier);
+
+	return ret;
+}
+
+int swap_tiers_remove(const char *name)
+{
+	int ret = 0;
+	struct swap_tier *tier;
+
+	lockdep_assert_held(&swap_lock);
+	lockdep_assert_held(&swap_tier_lock);
+
+	tier = swap_tier_lookup(name);
+	if (!tier)
+		return -EINVAL;
+
+	/* Removing DEF_SWAP_PRIO merges into the higher tier. */
+	if (!list_is_singular(&swap_tier_active_list)
+		&& tier->prio == DEF_SWAP_PRIO)
+		list_prev_entry(tier, list)->prio = DEF_SWAP_PRIO;
+
+	swap_tier_inactivate(tier);
+
+	return ret;
+}
+
+static struct swap_tier swap_tiers_snap[MAX_SWAPTIER];
+/*
+ * XXX: When multiple operations (adds and removes) are submitted in a
+ * single write, reverting each individually on failure is complex and
+ * error-prone. Instead, snapshot the entire state beforehand and
+ * restore it wholesale if any operation fails.
+ */
+void swap_tiers_snapshot(void)
+{
+	BUILD_BUG_ON(sizeof(swap_tiers_snap) != sizeof(swap_tiers));
+
+	lockdep_assert_held(&swap_lock);
+	lockdep_assert_held(&swap_tier_lock);
+
+	memcpy(swap_tiers_snap, swap_tiers, sizeof(swap_tiers));
+}
+
+void swap_tiers_snapshot_restore(void)
+{
+	struct swap_tier *tier;
+	int idx;
+
+	lockdep_assert_held(&swap_lock);
+	lockdep_assert_held(&swap_tier_lock);
+
+	memcpy(swap_tiers, swap_tiers_snap, sizeof(swap_tiers));
+
+	INIT_LIST_HEAD(&swap_tier_active_list);
+	INIT_LIST_HEAD(&swap_tier_inactive_list);
+
+	/*
+	 * memcpy copied snapshot-time list pointers into each tier's
+	 * list_head.  Those references are stale, so re-init every
+	 * tier before re-linking into the freshly initialised global
+	 * lists below.
+	 */
+	for_each_tier(tier, idx) {
+		INIT_LIST_HEAD(&tier->list);
+
+		if (TIER_IS_ACTIVE(tier))
+			swap_tier_activate(tier);
+		else
+			swap_tier_inactivate(tier);
+	}
+}
+
+bool swap_tiers_validate(void)
+{
+	struct swap_tier *tier;
+
+	/*
+	 * Initial setting might not cover DEF_SWAP_PRIO.
+	 * Swap tier must cover the full range (DEF_SWAP_PRIO to SHRT_MAX).
+	 */
+	if (swap_tier_is_active()) {
+		tier = list_last_entry(&swap_tier_active_list,
+			struct swap_tier, list);
+
+		if (tier->prio != DEF_SWAP_PRIO)
+			return false;
+	}
+
+	return true;
+}
diff --git a/mm/swap_tier.h b/mm/swap_tier.h
new file mode 100644
index 000000000000..a1395ec02c24
--- /dev/null
+++ b/mm/swap_tier.h
@@ -0,0 +1,20 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _SWAP_TIER_H
+#define _SWAP_TIER_H
+
+#include <linux/types.h>
+#include <linux/spinlock.h>
+
+extern spinlock_t swap_tier_lock;
+
+/* Initialization and application */
+void swap_tiers_init(void);
+ssize_t swap_tiers_sysfs_show(char *buf);
+
+int swap_tiers_add(const char *name, int prio);
+int swap_tiers_remove(const char *name);
+
+void swap_tiers_snapshot(void);
+void swap_tiers_snapshot_restore(void);
+bool swap_tiers_validate(void);
+#endif /* _SWAP_TIER_H */
diff --git a/mm/swapfile.c b/mm/swapfile.c
index e3d126602a1e..3f7225dbc6cd 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -48,6 +48,7 @@
 #include "swap_table.h"
 #include "internal.h"
 #include "swap.h"
+#include "swap_tier.h"
 
 static void swap_range_alloc(struct swap_info_struct *si,
 			     unsigned int nr_entries);
@@ -63,7 +64,8 @@ static void move_cluster(struct swap_info_struct *si,
  *
  * Also protects swap_active_head total_swap_pages, and the SWP_WRITEOK flag.
  */
-static DEFINE_SPINLOCK(swap_lock);
+DEFINE_SPINLOCK(swap_lock);
+
 static unsigned int nr_swapfiles;
 atomic_long_t nr_swap_pages;
 /*
@@ -74,7 +76,6 @@ atomic_long_t nr_swap_pages;
 EXPORT_SYMBOL_GPL(nr_swap_pages);
 /* protected with swap_lock. reading in vm_swap_full() doesn't need lock */
 long total_swap_pages;
-#define DEF_SWAP_PRIO  -1
 unsigned long swapfile_maximum_size;
 #ifdef CONFIG_MIGRATION
 bool swap_migration_ad_supported;
@@ -87,7 +88,7 @@ static const char Bad_offset[] = "Bad swap offset entry ";
  * all active swap_info_structs
  * protected with swap_lock, and ordered by priority.
  */
-static PLIST_HEAD(swap_active_head);
+PLIST_HEAD(swap_active_head);
 
 /*
  * all available (active, not full) swap_info_structs
@@ -3988,6 +3989,7 @@ static int __init swapfile_init(void)
 		swap_migration_ad_supported = true;
 #endif	/* CONFIG_MIGRATION */
 
+	swap_tiers_init();
 	return 0;
 }
 subsys_initcall(swapfile_init);
-- 
2.34.1


^ permalink raw reply related

* [PATCH v8 4/4] mm: swap: filter swap allocation by memcg tier mask
From: Youngjun Park @ 2026-06-17  5:34 UTC (permalink / raw)
  To: akpm
  Cc: chrisl, youngjun.park, linux-mm, cgroups, linux-kernel, kasong,
	hannes, mhocko, roman.gushchin, shakeel.butt, muchun.song,
	shikemeng, nphamcs, baoquan.he, baohua, yosry, gunho.lee,
	taejoon.song, hyungjun.cho, mkoutny, baver.bae, matia.kim
In-Reply-To: <20260617053447.2831896-1-youngjun.park@lge.com>

Apply memcg tier effective mask during swap slot allocation to
enforce per-cgroup swap tier restrictions.

In the fast path, check the percpu cached swap_info's tier_mask
against the folio's effective mask. If it does not match, fall
through to the slow path. In the slow path, skip swap devices
whose tier_mask is not covered by the folio's effective mask.

This works correctly when there is only one non-rotational
device in the system and no devices share the same priority.
However, there are known limitations:

 - When non-rotational devices are distributed across multiple
   tiers, and different memcgs are configured to use those
   distinct tiers, they may constantly overwrite the shared
   percpu swap cache. This cache thrashing leads to frequent
   fast path misses.

 - Combined with the above issue, if same-priority devices exist
   among them, a percpu cache miss (overwritten by another memcg)
   forces the allocator to round-robin to the next device
   prematurely, even if the current cluster is not fully
   exhausted.

These edge cases do not affect the primary use case of
directing swap traffic per cgroup. Further optimization is
planned for future work.

Reviewed-by: Nhat Pham <nphamcs@gmail.com>
Reviewed-by: Kairui Song <kasong@tencent.com>
Reviewed-by: Baoquan He <baoquan.he@linux.dev>
Signed-off-by: Youngjun Park <youngjun.park@lge.com>

diff --git a/mm/swapfile.c b/mm/swapfile.c
index 9a86ebe992f4..1a2d29735b71 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1365,14 +1365,18 @@ static bool swap_alloc_fast(struct folio *folio)
 	struct swap_cluster_info *ci;
 	struct swap_info_struct *si;
 	unsigned int offset;
+	int mask = folio_tier_effective_mask(folio);
 
 	/*
 	 * Once allocated, swap_info_struct will never be completely freed,
 	 * so checking it's liveness by get_swap_device_info is enough.
 	 */
 	si = this_cpu_read(percpu_swap_cluster.si[order]);
+	if (!si || !swap_tiers_mask_test(si->tier_mask, mask))
+		return false;
+
 	offset = this_cpu_read(percpu_swap_cluster.offset[order]);
-	if (!si || !offset || !get_swap_device_info(si))
+	if (!offset || !get_swap_device_info(si))
 		return false;
 
 	ci = swap_cluster_lock(si, offset);
@@ -1392,10 +1396,14 @@ static bool swap_alloc_fast(struct folio *folio)
 static void swap_alloc_slow(struct folio *folio)
 {
 	struct swap_info_struct *si, *next;
+	int mask = folio_tier_effective_mask(folio);
 
 	spin_lock(&swap_avail_lock);
 start_over:
 	plist_for_each_entry_safe(si, next, &swap_avail_head, avail_list) {
+		if (!swap_tiers_mask_test(si->tier_mask, mask))
+			continue;
+
 		/* Rotate the device and switch to a new cluster */
 		plist_requeue(&si->avail_list, &swap_avail_head);
 		spin_unlock(&swap_avail_lock);
-- 
2.34.1


^ permalink raw reply related

* [PATCH v8 0/4] mm/swap, memcg: Introduce swap tiers for cgroup based swap control
From: Youngjun Park @ 2026-06-17  5:34 UTC (permalink / raw)
  To: akpm
  Cc: chrisl, youngjun.park, linux-mm, cgroups, linux-kernel, kasong,
	hannes, mhocko, roman.gushchin, shakeel.butt, muchun.song,
	shikemeng, nphamcs, baoquan.he, baohua, yosry, gunho.lee,
	taejoon.song, hyungjun.cho, mkoutny, baver.bae, matia.kim

This is the v8 series of the swap tier patchset.

Great thanks to Shakeel Butt and Yosry for the reviews and discussions [1].
The main change in this version is the interface change to use
memory.swap.tiers.max with '0' (disable) and 'max' (enable) values.
This mechanism was suggested by Shakeel and Yosry.

This change allows for future extensions to control swap
between tiers and aligns better with existing memcg interfaces.
Even with this memcg interface change, only patch #3 needed updates.
Internally, patch #3 still uses the existing mask processing method
(which is implementation-efficient), so only the user-facing interface
was modified.

We also discussed tier extensions. Thanks to Yosry, Nhat and Shakeel for their
valuable feedback.

Here is a brief summary of our tentative conclusions. Please correct me
if anything is misrepresented (details in references):

* Zswap tiering [2]:
  Tiering applies only to the vswap + zswap combo. Zswap itself will
  not be tiered, as the current architecture requires a physical device
  for zswap allocation.
* Vswap tiering [3]:
  Vswap should be handled transparently to the user. Vswap itself will
  not be tiered. But, someday supported if there is strong and real usecase.
* Relationship with zswap.writeback [4]:
  If zswap tiering is introduced, it could replace the zswap-only tier.
  However, since zswap cannot be tiered independently, it is still
  needed for non-vswap cases. Separately, the internal logic could
  potentially be integrated into the tiering logic.
* Tier demotion [5]:
  A separate interface like memory.swap.tiers.demotion might be needed.
  For now, we only support 0/max to enable/disable tiers. In the future,
  we could introduce an "auto" mode to automatically scale the limit
  based on swapfile size and memory.swap.max, similar to the direction
  memory tiering is heading in.

I plan to apply the swap tier infrastructure and the first use case
(cgroup-based swap control) first, and continue following up on the
discussions above.

Overview
========

Swap Tiers group swap devices into performance classes (e.g. NVMe,
HDD, Network) and allow per-memcg selection of which tiers to use.
This mechanism was suggested by Chris Li.

Design Rationale
================

Swap tier selection is attached to memcg. A child cgroup may select a
subset of the parent's allowed tiers.

This
- Preserves cgroup inheritance semantics (boundary at parent,
  refinement at child).
- Reuses memcg, which already groups processes and enforces
  hierarchical memory limits.
- Aligns with existing memcg swap controls (e.g. swap.max, zswap.writeback)
- Avoids introducing a parallel swap control hierarchy.

Placing tier control outside memcg (e.g., via BPF, syscalls, or
madvise) would allow swap preference to diverge from the memcg
hierarchy. Integrating it into memcg keeps the swap policy
consistent with existing memory ownership semantics. There are
also real use cases built around memcg.

In the future, this can be extended to other interfaces to cover
additional use cases.

I believe a memcg-based swap control is a good starting point
before such extensions.

Use Cases
=========

#1: Latency separation (our primary deployment scenario)
  [ / ]
     |
     +-- latency-sensitive workload  (fast tier)
     +-- background workload         (slow tier)

The parent defines the memory boundary.
Each workload selects a swap tier via memory.swap.tiers.max according to
latency requirements.

This prevents latency-sensitive workloads from being swapped to
slow devices used by background workloads.

#2: Per-VM swap selection (Chris Li's deployment scenario)
  [ / ]
     |
     +-- [ Job on VM ]              (tiers: zswap, SSD)
            |
            +-- [ VMM guest memory ]  (tiers: SSD)

The parent (job) has access to both zswap and SSD tiers.
The child (VMM guest memory) selects SSD as its swap tier via
memory.swap.tiers.max. In this deployment, swap device selection
happens at the child level from the parent's available set.

#3: Tier isolation for reduced contention (hypothetical)
  [ / ]                    (tiers: A, B)
     |
     +-- workload X        (tiers: A)
     +-- workload Y        (tiers: B)

Each child uses a different tier. Since swap paths are separated
per tier, synchronization overhead between the two workloads is
reduced.

Future extension
================

#1: Intra-tier distribution policy:
  Currently, swap devices with the same priority are allocated in a
  round-robin fashion. Per-tier policy files under
  /sys/kernel/mm/swap/tiers/ can control how devices within a tier
  are selected (e.g. round-robin, weighted).

#2: Inter-tier promotion and demotion:
  Promotion and demotion apply between tiers, not within a single
  tier. The current interface defines only tier assignment; it does
  not yet define when or how pages move between tiers. Two triggering
  models are possible:

  (a) User-triggered: userspace explicitly initiates migration between
      tiers (e.g. via a new interface or existing move_pages semantics).
  (b) Kernel-triggered: the kernel moves pages between tiers at
      appropriate points such as reclaim or refault.

#3: Per-VMA, per-process swap and BPF:
  Not just for memcg based swap, possible to extend Per-VMA or per-process
  swap. Or we can use it as BPF program.

#4: Zswap and vswap tiering:
  Tiering applies to the vswap + zswap combination.

#5: Vswap on/off control:
  Currently not supported. If a strong use case arises where vswap needs
  to be controlled by memcg, the tier interface could be used for it.

Experimentation
===============

Tested on our internal platform using NBD as a separate swap tier.
Our first production's simple usecase.

Without tiers:
- No selective control over flash wear
- Cannot selectively assign NBD to specific applications

Cold launch improvement (preloaded vs. baseline):
- App A: 13.17s -> 4.18s (68%)
- App B: 5.60s -> 1.12s (80%)
- App C: 10.25s -> 2.00s (80%)

Performance impact with no tiers configured:
<1% regression in kernel build and vm-scalability benchmarks

Change log
===========

v8
- Changed the memcg interface to memory.swap.tiers.max.
  Values are '0' (disable) and 'max' (enable). Default is 'max'.
- Addressed Sashiko's review: Update the mask value atomically at once and
  read the mask value while grabbing lock.
- Collected review tags from Kairui and Nhat.
- Rebase on recent mm-new
- v7 link: https://lore.kernel.org/linux-mm/20260527062247.3440692-1-youngjun.park@lge.com/

v7
- Collect Baoquan's review tag
- Baoquan's feedback on fixing improper comment
- Minor code adjustments per Baoquan's feedback.
- Rebase on recent mm-new
- v6 link: https://lore.kernel.org/linux-mm/20260421055323.940344-1-youngjun.park@lge.com/

v6
- Sashiko AI review fixes
 - Fix batch parsing error path to restore snapshot before exit
 - Reject overlong tier names to prevent truncated duplicates
 - Avoid restoring raw list_head via memcpy (stale pointer risk)
 - Ensure early parse errors do not skip DEF_SWAP_PRIO validation
 - Use (1U << TIER_DEFAULT_IDX) to avoid signed shift UB
 - Defer tier mask inheritance to css_online() to close race window
 - Add READ_ONCE()/WRITE_ONCE() for tier mask accesses
- Other fixes
 - Fix build error reintroduced due to missing v5 change
 - Fix WARNING in folio_tier_effective_mask by adding rcu_read_lock()
 - default number of swap tier max (change to 32->31, for reserving last bit)
 - commit message refinement.
 - rebased on recently mm-new
- v5 link: https://lore.kernel.org/linux-mm/20260325175453.2523280-1-youngjun.park@lge.com/

v5
- Fixed build errors reported in v4
- rebased on up to date mm-new
- Minor cleanups
- Design docs with validation (by Shakeel Butt discussion)
- v4 link : https://lore.kernel.org/linux-mm/20260217000950.4015880-1-youngjun.park@lge.com/

v4
- Simplified control flow and indentation
- Added CONFIG option for MAX_SWAPTIER (default: 4)
- Added memory.swap.tiers.effective interface
- Reworked save/restore logic into snapshot/rollback model
- Removed tier priority modification support (deferred)
- Improved validation and fixed edge cases
- Rebased onto latest mm-new
- RFC v3 link: https://lore.kernel.org/linux-mm/20260131125454.3187546-1-youngjun.park@lge.com/

RFC v1 ~ v3
- Change the direction after discussion with Chris-Li
- apply some LPC feedback.
- RFC v2 - https://lore.kernel.org/linux-mm/20260126065242.1221862-1-youngjun.park@lge.com/
- RFC v1 - https://lore.kernel.org/linux-mm/20251109124947.1101520-1-youngjun.park@lge.com/

Earlier Approach (per cgroup swap priority)
- v1: https://lore.kernel.org/linux-mm/20250716202006.3640584-1-youngjun.park@lge.com/
- RFC: https://lore.kernel.org/linux-mm/aEvLjEInMQC7hEyh@yjaykim-PowerEdge-T330/T/#mbbb6a5e9e30843097e1f5f65fb98f31d582b973d

Reference
=========

[1] https://lore.kernel.org/linux-doc/aiw2p5ANjsQUCIHA@linux.dev/
[2] https://lore.kernel.org/linux-mm/CAKEwX=Nz9SWcEVQGQjHN8P8OANJY4BG0w+iQOzoNOWuteoVjAg@mail.gmail.com/
[3] https://lore.kernel.org/cgroups/CAKEwX=O23a4iWBZoewKVb8QqODte6r3Xijckw3_oCJNoiO9M5A@mail.gmail.com/
[4] https://lore.kernel.org/linux-mm/CAO9r8zOg0OP1Ak1v7CRzSfQq0D8b4Dw+_T0Jui6YTM_KwQQNOA@mail.gmail.com/
[5] https://lore.kernel.org/linux-mm/CAO9r8zNi4-QC4sUi=xXWHt9WMeG39mbyoSf8kON9vLOZ=cbCmw@mail.gmail.com/

Youngjun Park (4):
  mm: swap: introduce swap tier infrastructure
  mm: swap: associate swap devices with tiers
  mm: memcontrol: add interface for swap tier selection
  mm: swap: filter swap allocation by memcg tier mask

 Documentation/admin-guide/cgroup-v2.rst |  20 +
 Documentation/mm/index.rst              |   1 +
 Documentation/mm/swap-tier.rst          | 159 ++++++++
 MAINTAINERS                             |   3 +
 include/linux/memcontrol.h              |   5 +
 include/linux/swap.h                    |   1 +
 mm/Kconfig                              |  12 +
 mm/Makefile                             |   2 +-
 mm/memcontrol.c                         |  67 ++++
 mm/swap.h                               |   4 +
 mm/swap_state.c                         |  75 ++++
 mm/swap_tier.c                          | 483 ++++++++++++++++++++++++
 mm/swap_tier.h                          |  76 ++++
 mm/swapfile.c                           |  20 +-
 14 files changed, 923 insertions(+), 5 deletions(-)
 create mode 100644 Documentation/mm/swap-tier.rst
 create mode 100644 mm/swap_tier.c
 create mode 100644 mm/swap_tier.h

base-commit: 9d335aed8840f6bf83ba93309ae5e185de829c21
-- 
2.34.1

^ permalink raw reply

* Re: [LSF/MM/BPF TOPIC][RFC PATCH v4 00/27] Private Memory Nodes (w/ Compressed RAM)
From: Balbir Singh @ 2026-06-17  4:02 UTC (permalink / raw)
  To: Gregory Price
  Cc: David Hildenbrand (Arm), lsf-pc, linux-kernel, linux-cxl, cgroups,
	linux-mm, linux-trace-kernel, damon, kernel-team, gregkh, rafael,
	dakr, dave, jonathan.cameron, dave.jiang, alison.schofield,
	vishal.l.verma, ira.weiny, dan.j.williams, longman, akpm,
	lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko,
	osalvador, ziy, matthew.brost, joshua.hahnjy, rakie.kim,
	byungchul, ying.huang, apopple, axelrasmussen, yuanchu, weixugc,
	yury.norov, linux, mhiramat, mathieu.desnoyers, tj, hannes,
	mkoutny, jackmanb, sj, baolin.wang, npache, ryan.roberts,
	dev.jain, baohua, lance.yang, muchun.song, xu.xin16,
	chengming.zhou, jannh, linmiaohe, nao.horiguchi, pfalcato,
	rientjes, shakeel.butt, riel, harry.yoo, cl, roman.gushchin,
	chrisl, kasong, shikemeng, nphamcs, bhe, zhengqi.arch,
	terry.bowman
In-Reply-To: <aimSzvoJDrpeQsmM@gourry-fedora-PF4VCD3F>

On Wed, Jun 10, 2026 at 12:37:34PM -0400, Gregory Price wrote:
> On Wed, Jun 10, 2026 at 05:00:33PM +0200, David Hildenbrand (Arm) wrote:
> > On 6/10/26 12:41, Gregory Price wrote:
> > > On Wed, Jun 03, 2026 at 03:00:01PM +1000, Balbir Singh wrote:
> > > 
> > > Notably: slub.c injects __GFP_THISNODE internally on behalf of kmalloc,
> > > which causes spillage into private nodes because slub allows private
> > > nodes in its mask.  I think this is fixable.
> > > 
> > > I have to inspect some other __GFP_THISNODE users (hugetlb, some arch
> > > code, etc), but it seems like fully dropping the FALLBACK entries and
> > > requiring __GFP_THISNODE might be sufficient.
> > 
> > Sorry, I haven't been able to follow up so far, and not sure if that's what you
> > are discussing here ...
> > 
> > After the LSF/MM session, I was wondering, whether if we focus on allowing only
> > folios allocations to end up on private memory nodes for now: could the
> > __GFP_THISNODE approach work there?
> > 
> > Essentially, disallow any allocations on non-folio paths, and allow folio
> > allocation only with __GFP_THISNODE set.
> > 
> > I have to find time to read the other mails in this thread, on my todo list.
> > 
> > So sorry if that is precisely what is being discussed here.
> > 
> 
> So, I remember this being asked, and I didn't fully grok the request.
> 
> I'm still not sure I fully understand the question, so apologies if I'm
> answer the wrong things here.
> 
> I understand this question in two ways:
> 
>   1) Can we disallow PAGE allocation and limit this to FOLIO allocation
>   2) Can we disallow [Feature] (i.e. slab) allocation targeting the node.
> 
> 
> 1) Can we disallow page allocation and limit this to folios?
> 
> No, I don't think so.
> 
> Folio allocations are written in terms of page allocations, we would
> have to rewrite folio allocation interfaces and introduce a bunch of
> boilerplate for the sake of this.
> 
> struct page *__alloc_pages_noprof(gfp_t gfp, unsigned int order,
>                 int preferred_nid, nodemask_t *nodemask)
> {
>         struct page *page;
> 
>         page = __alloc_frozen_pages_noprof(gfp, order, preferred_nid, nodemask);
>         if (page)
>                 set_page_refcounted(page);
>         return page;
> }
> 
> struct folio *__folio_alloc_noprof(gfp_t gfp, unsigned int order, int preferred_nid,
>                 nodemask_t *nodemask)
> {
>         struct page *page = __alloc_pages_noprof(gfp | __GFP_COMP, order,
>                                         preferred_nid, nodemask);
> 	return page_rmappable_folio(page);
> }
> 
> At the end of the day, this all reduces to `get_pages_from_freelist`,
> and at that level we don't really care about folio vs page.
> 
> __GFP_COMP is insufficient to differentiate between a non-folio compound
> page and a folio, and __GFP_COMP is passed into __alloc_pages_*
> interfaces all over the kernel.
> 
> Trying to detach these paths things seems like a horrible rats nest /
> not feasible / will create a lot of boilerplate for little value.
> 
> (I did not fully understand this request when it was asked, I do
>  not fully understand this request not, please let me know if I
>  have misunderstood what you were asking).
> 

I agree with this, any changes to folio only allocation could then be
easily adapted for N_MEMORY_PRIVATE

> 
> 
> 2) Can we disallow SLAB allocation.
> 
> Yeah, but I think a better question is whether there's a difference
> between alloc_pages_node() and kmalloc_node() when it all just sinks
> to the same fundamental code in mm/page_alloc.c
> 
> Maybe there's an argument for something like NP_OPT_KMALLOC (allow slab
> allocations on the private node w/ __GFP_THISNODE)
> 
> On my current set, I don't implement any explicit filtering at all in
> mm/page_alloc.c - the filtering is a function of the nodes not being
> present in the FALLBACK list and only having a NOFALLBACK list.
> 
> What __GFP_THISNODE actually does under the hood is just switch
> which zone list (FALLBACK vs NOFALLBACK) is used for the target node.
> 
> For isolation w/o __GFP_PRIVATE, we're removing N_MEMORY_PRIVATE nodes
> from *their own FALLBACK* list and only adding them to their NOFALLBACK
> list.  That means to reach a private node you MUST use __GFP_THISNODE.
> 
> I realize this is confusing, but essentially we don't have to modify
> mm/page_alloc.c to get the __GFP_THISNODE filtering, we get this from
> the fallback/nofallback list construction.
> 
> 
> Ok, so how does this flush out in practice - and why do I call this
> filtering mechanism fragile?
> 
> consider kmalloc_node() and __slab_alloc():
> 
> kmalloc_node(...)
>   └─ ___slab_alloc()     mm/slub.c:4406   pc.flags |= __GFP_THISNODE
>       └─ new_slab(s, pc.flags, node)
>           └─ allocate_slab(s, flags, node)
>               └─ alloc_slab_page(flags, node, oo, …)
>                   └─ __alloc_frozen_pages(flags, order, node, NULL);
> 
> Slab silently upgrades the page allocator flags here to include
> __GFP_THISNODE - even if the user didn't request that behavior.
> 
> This is exactly the kind of "spillage" I said was hard to police at LSF.
> 
> Without __GFP_PRIVATE, we have to keep an eye on what around the kernel
> is using __GFP_THISNODE and how.
> 
> For mm/slub.c we can choose to do one of thwo things
> 
>   1) 100% refuse slab allocations on private nodes, i.e.:
> 
>      kmalloc_node(..., private_nid, __GFP_THISNODE)
> 
>      And will fail (return NULL).
> 

Doesn't this iterate through N_MEMORY only? N_MEMORY_PRIVATE should not
be in the regular for_each(...) loops

>   or
> 
>   2) Do not upgrade private-node slab requests w/ __GFP_THISNODE
>      
>      This allows kmalloc_node() to work the same as folio_alloc()
>      or alloc_pages() interfaces (__GFP_THISNODE is the key), with
>      the understanding that any __GFP_THISNODE user
> 
> We can opt these nodes into slab/kmalloc with a NP_OPT_SLAB
> if the owner wants kmalloc_node(), with the understanding that any
> caller using __GFP_THISNODE may get access.
> 
> That's the kind of fragility I was trying to avoid.
> 
> 
> That said, in practice, I have found that basic kernel operations don't
> generally target use kmalloc_node() w/ __GFP_THISNODE - there's just
> nothing to prevent anyone from doing so.
> 
> So this seems promising...
> And then theres arch/powerpc/platforms/powernv/memtrace.c
> 
> static u64 memtrace_alloc_node(u32 nid, u64 size)
> {
> 	... snip ...
>         page = alloc_contig_pages(nr_pages, GFP_KERNEL | __GFP_THISNODE |
>                                   __GFP_NOWARN | __GFP_ZERO, nid, NULL);
> 	... snip ...
> }
> 
> static int memtrace_init_regions_runtime(u64 size)
> {
> 	... snip ...
>         for_each_online_node(nid) {
>                 m = memtrace_alloc_node(nid, size);
> 	... snip ...
> }
> 
> static int memtrace_enable_set(void *data, u64 val)
> {
> 	... snip ...
>         if (memtrace_init_regions_runtime(val))
>                 goto out_unlock;
> 	... snip ...
> }
> 
> This is the *exact* pattern I said would be hard to police - and it
> doesn't look like a bug, just not informed that private nodes exist.
> 
> This is why I'm concerned with trying to depend on __GFP_THISNODE as the
> filtering function.
> 
> That said, the number of __GFP_THISNODE users is very limited
> kernel-wide, so maybe that's an acceptable maintenance burden?
> 

Balbir

^ permalink raw reply

* Re: [PATCH v3 03/15] mm/slab: introduce slab_alloc_context
From: Hao Li @ 2026-06-17  2:52 UTC (permalink / raw)
  To: Vlastimil Babka (SUSE)
  Cc: Harry Yoo, Christoph Lameter, David Rientjes, Roman Gushchin,
	Suren Baghdasaryan, Alexei Starovoitov, Andrew Morton,
	Johannes Weiner, Michal Hocko, Shakeel Butt, Alexander Potapenko,
	Marco Elver, Dmitry Vyukov, kasan-dev, linux-mm, linux-kernel,
	cgroups
In-Reply-To: <20260615-slab_alloc_flags-v3-3-ce1146d140fb@kernel.org>

On Mon, Jun 15, 2026 at 01:54:36PM +0200, Vlastimil Babka (SUSE) wrote:
> Similarly to page allocator's struct alloc_context, introduce a helper
> struct to hold a part of the allocation arguments. This will allow
> reducing the number of parameters in many functions of the
> implementation, and extend them easily if needed.
> 
> For now, make it hold the caller address and the originally requested
> allocation size.
> 
> Convert alloc_single_from_new_slab(), __slab_alloc_node() and
> ___slab_alloc(). No functional change intended.
> 
> Link: https://patch.msgid.link/20260610-slab_alloc_flags-v2-4-7190909db118@kernel.org
> Reviewed-by: Harry Yoo (Oracle) <harry@kernel.org>
> Reviewed-by: Suren Baghdasaryan <surenb@google.com>
> Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
> ---

Reviewed-by: Hao Li <hao.li@linux.dev>

-- 
Thanks,
Hao

^ permalink raw reply

* Re: [PATCH v3 01/15] mm/slab: do not init any kfence objects on allocation
From: Hao Li @ 2026-06-17  2:44 UTC (permalink / raw)
  To: Vlastimil Babka (SUSE)
  Cc: Harry Yoo, Christoph Lameter, David Rientjes, Roman Gushchin,
	Suren Baghdasaryan, Alexei Starovoitov, Andrew Morton,
	Johannes Weiner, Michal Hocko, Shakeel Butt, Alexander Potapenko,
	Marco Elver, Dmitry Vyukov, kasan-dev, linux-mm, linux-kernel,
	cgroups
In-Reply-To: <20260615-slab_alloc_flags-v3-1-ce1146d140fb@kernel.org>

On Mon, Jun 15, 2026 at 01:54:34PM +0200, Vlastimil Babka (SUSE) wrote:
> When init (zeroing) on allocation is requested, for kmalloc() we
> generally have to zero the full object size even if a smaller size is
> requested, in order to provide krealloc()'s __GFP_ZERO guarantees.
> 
> When we end up allocating a kfence object, kfence performs the zeroing
> on its own because it has its own redzone beyond the requested size.
> Thus slab_post_alloc_hook() has an 'init' parameter which has to be
> evaluated in all callers (via slab_want_init_on_alloc()) and should be
> false for kfence allocations.
> 
> For kfence allocations in slab_alloc_node() this is achieved by subtly
> skipping over the slab_want_init_on_alloc() call. Other callers (i.e.
> kmem_cache_alloc_bulk_noprof()) however evaluate it unconditionally even
> if they do end up with a kfence allocation. This is only subtly not a
> problem, as those are not kmalloc allocations and thus the "requested
> size" equals s->object_size and thus it cannot interfere with kfence's
> redzone. There's just a unnecessary double zeroing (in both kfence and
> slab_post_alloc_hook()), but it's all very fragile and contradicts the
> comment in kfence_guarded_alloc().
> 
> Remove this subtlety and simplify the code by eliminating the init
> parameter from slab_post_alloc_hook() and make it call
> slab_want_init_on_alloc() itself. Instead add a is_kfence_address()
> check before performing the memset, which will start doing the right
> thing for all callers of slab_post_alloc_hook().
> 
> This potentially adds overhead of the is_kfence_address() check to
> allocation hotpath, but that one is designed to be as small as possible,
> and it's only evaluated if zeroing is about to happen. This means (aside
> from init_on_alloc hardening) only for __GFP_ZERO allocations, and the
> zeroing itself comes with an overhead likely larger than the added
> check.
> 
> While at it, refactor the handling of evaluating when KASAN does the
> init instead of SLUB, with no intended functional changes. A
> non-functional change is that we don't pass kasan_init as true to
> kasan_slab_alloc() if kasan has no integrated init, but then the value
> is ignored anyway, so it's theoretically more correct.
> 
> Thanks to Harry Yoo for the initial refactoring attempt, and for updated
> comments that are used here.
> 
> Link: https://patch.msgid.link/20260610-slab_alloc_flags-v2-2-7190909db118@kernel.org
> Reviewed-by: Harry Yoo (Oracle) <harry@kernel.org>
> Reviewed-by: Suren Baghdasaryan <surenb@google.com>
> Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>

Reviewed-by: Hao Li <hao.li@linux.dev>

-- 
Thanks,
Hao

^ permalink raw reply

* Re: [swap tier discussion] Re: [PATCH v3 2/4] mm/zswap: Implement proactive writeback
From: Yosry Ahmed @ 2026-06-16 20:26 UTC (permalink / raw)
  To: Nhat Pham
  Cc: YoungJun Park, Shakeel Butt, Hao Jia, Johannes Weiner, mhocko, tj,
	mkoutny, roman.gushchin, akpm, chengming.zhou, muchun.song,
	cgroups, linux-mm, linux-kernel, linux-doc, Hao Jia, chrisl,
	kasong, baoquan.he, joshua.hahnjy
In-Reply-To: <CAKEwX=NyfxfXhHESTLyirAgdVA6QaYAcam792-vSZdmo0Pz+bA@mail.gmail.com>

On Tue, Jun 16, 2026 at 1:24 PM Nhat Pham <nphamcs@gmail.com> wrote:
>
> On Tue, Jun 16, 2026 at 4:10 PM Yosry Ahmed <yosry@kernel.org> wrote:
> >
> > On Tue, Jun 16, 2026 at 1:09 PM Nhat Pham <nphamcs@gmail.com> wrote:
> > >
> > > On Tue, Jun 16, 2026 at 3:54 PM Yosry Ahmed <yosry@kernel.org> wrote:
> > > >
> > > > On Tue, Jun 16, 2026 at 11:33 AM Nhat Pham <nphamcs@gmail.com> wrote:
> > > > >
> > > > > TBH, without vswap, we should not allow setting zswap as its own tier.
> > > > > It's meaningless. Maybe makes it a no-op, and warn users what they're
> > > > > setting is gibberish?
> > > >
> > > > Why? vswap is transparent to the user. Why can't zswap be its own tier?
> > >
> > > Without vswap, if you set zswap as its own tier, which phys swap
> > > device should we allocate from for the backing slot? :)
> >
> > Today we just allocate a swap slot in a swapfile during reclaim,
> > before swapout, and zswap will just writeback to that one. I assume
> > the same will work with swap tiering, except that maybe the way that
> > swap slot will respect the allowed swap tiers?
>
> Yep! So if we set zswap as the only tier, then it wouldn't be able to
> allocate a swap slot in swapfile right?

Ohh I thought you meant we shouldn't allow zswap to be a tier at all,
not the *only* tier.

> Or are you suggesting that if we set zswap as the only tier then we
> can allocate from any swapfile (since we're not doing any IO anyway)?

Hmm, technically having zswap as the only tier should be equivalent to
disabling writeback, but you're right that if zswap is the only tier
than the memcg is not allowed to use swap slots from any swapfile, so
zswap cannot be used. Very good point :)

In this case I think yes, we need vswap to be enabled to allow making
zswap the only tier. That's one gap between zswap being the only tier
and disabling zswap writeback, the former requires vswap while the
latter doesn't.

^ permalink raw reply

* Re: [swap tier discussion] Re: [PATCH v3 2/4] mm/zswap: Implement proactive writeback
From: Nhat Pham @ 2026-06-16 20:24 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: YoungJun Park, Shakeel Butt, Hao Jia, Johannes Weiner, mhocko, tj,
	mkoutny, roman.gushchin, akpm, chengming.zhou, muchun.song,
	cgroups, linux-mm, linux-kernel, linux-doc, Hao Jia, chrisl,
	kasong, baoquan.he, joshua.hahnjy
In-Reply-To: <CAO9r8zMHGFG_jcVeDPgowaQ2RNntp3KankwzQdgrJb9PrWu8_w@mail.gmail.com>

On Tue, Jun 16, 2026 at 4:10 PM Yosry Ahmed <yosry@kernel.org> wrote:
>
> On Tue, Jun 16, 2026 at 1:09 PM Nhat Pham <nphamcs@gmail.com> wrote:
> >
> > On Tue, Jun 16, 2026 at 3:54 PM Yosry Ahmed <yosry@kernel.org> wrote:
> > >
> > > On Tue, Jun 16, 2026 at 11:33 AM Nhat Pham <nphamcs@gmail.com> wrote:
> > > >
> > > > TBH, without vswap, we should not allow setting zswap as its own tier.
> > > > It's meaningless. Maybe makes it a no-op, and warn users what they're
> > > > setting is gibberish?
> > >
> > > Why? vswap is transparent to the user. Why can't zswap be its own tier?
> >
> > Without vswap, if you set zswap as its own tier, which phys swap
> > device should we allocate from for the backing slot? :)
>
> Today we just allocate a swap slot in a swapfile during reclaim,
> before swapout, and zswap will just writeback to that one. I assume
> the same will work with swap tiering, except that maybe the way that
> swap slot will respect the allowed swap tiers?

Yep! So if we set zswap as the only tier, then it wouldn't be able to
allocate a swap slot in swapfile right?

Or are you suggesting that if we set zswap as the only tier then we
can allocate from any swapfile (since we're not doing any IO anyway)?

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox