Linux cgroups development
 help / color / mirror / Atom feed
* Re: [RFC PATCH v2 3/9] mm/zswap: support fully zswap-backed large folio loads
From: Fujunjie @ 2026-05-31 20:03 UTC (permalink / raw)
  To: Nhat Pham
  Cc: Andrew Morton, linux-mm, Alexandre Ghiti, Kairui Song, Usama Arif,
	Chris Li, Johannes Weiner, Yosry Ahmed, David Hildenbrand,
	Hugh Dickins, Roman Gushchin, Shakeel Butt, linux-kernel, cgroups
In-Reply-To: <CAKEwX=Or6forBoArv1b=MZuhOuF+MTuLLZWPKgUmkBVaoBoYSQ@mail.gmail.com>



On 5/30/2026 2:25 AM, Nhat Pham wrote:
> On Fri, May 29, 2026 at 5:19 AM fujunjie <fujunjie1@qq.com> wrote:
>>
>> zswap currently refuses large swapcache folios. That is correct for mixed
>> backend ranges, but it also prevents the common swapin path from loading a
>> range that is still fully backed by zswap.
>>
>> Teach zswap_load() to fill a locked large swapcache folio by decompressing
>> each base-page entry into the matching folio offset, then flushing the
>> folio once. A missing entry after zswap data has been seen is reported as
>> -EAGAIN so the caller can drop the speculative large folio and retry
>> order-0.
>>
>> The large load keeps the zswap entries in place. It is a clean speculative
>> fill: until the swap slots are freed, zswap remains the backing copy if
>> reclaim drops the large folio before PTEs are installed.
>>
>> Signed-off-by: fujunjie <fujunjie1@qq.com>
>> ---
>>  mm/zswap.c | 105 ++++++++++++++++++++++++++++++++++++++++++++---------
>>  1 file changed, 87 insertions(+), 18 deletions(-)
>>
>> diff --git a/mm/zswap.c b/mm/zswap.c
>> index da5297f7bd69..94ba112a2982 100644
>> --- a/mm/zswap.c
>> +++ b/mm/zswap.c
>> @@ -15,6 +15,8 @@
>>
>>  #include <linux/module.h>
>>  #include <linux/cpu.h>
>> +#include <linux/mm.h>
>> +#include <linux/huge_mm.h>
>>  #include <linux/highmem.h>
>>  #include <linux/slab.h>
>>  #include <linux/spinlock.h>
>> @@ -934,7 +936,8 @@ static bool zswap_compress(struct page *page, struct zswap_entry *entry,
>>         return comp_ret == 0 && alloc_ret == 0;
>>  }
>>
>> -static bool zswap_decompress(struct zswap_entry *entry, struct folio *folio)
>> +static bool zswap_decompress(struct zswap_entry *entry, struct folio *folio,
>> +                            unsigned int page_idx, bool flush_dcache)
>>  {
>>         struct zswap_pool *pool = entry->pool;
>>         struct scatterlist input[2]; /* zsmalloc returns an SG list 1-2 entries */
>> @@ -952,14 +955,15 @@ static bool zswap_decompress(struct zswap_entry *entry, struct folio *folio)
>>
>>                 WARN_ON_ONCE(input->length != PAGE_SIZE);
>>
>> -               dst = kmap_local_folio(folio, 0);
>> +               dst = kmap_local_folio(folio, page_idx * PAGE_SIZE);
>>                 memcpy_from_sglist(dst, input, 0, PAGE_SIZE);
>>                 dlen = PAGE_SIZE;
>>                 kunmap_local(dst);
>> -               flush_dcache_folio(folio);
>> +               if (flush_dcache)
>> +                       flush_dcache_folio(folio);
>>         } else {
>>                 sg_init_table(&output, 1);
>> -               sg_set_folio(&output, folio, PAGE_SIZE, 0);
>> +               sg_set_folio(&output, folio, PAGE_SIZE, page_idx * PAGE_SIZE);
>>                 acomp_request_set_params(acomp_ctx->req, input, &output,
>>                                          entry->length, PAGE_SIZE);
>>                 ret = crypto_acomp_decompress(acomp_ctx->req);
>> @@ -1042,7 +1046,7 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
>>                 goto out;
>>         }
>>
>> -       if (!zswap_decompress(entry, folio)) {
>> +       if (!zswap_decompress(entry, folio, 0, true)) {
>>                 ret = -EIO;
>>                 goto out;
>>         }
>> @@ -1615,10 +1619,9 @@ enum zswap_range_state zswap_probe_range(swp_entry_t swp,
>>   *  NOT marked up-to-date, so that an IO error is emitted (e.g. do_swap_page()
>>   *  will SIGBUS).
>>   *
>> - *  -EINVAL: if the swapped out content was in zswap, but the page belongs
>> - *  to a large folio, which is not supported by zswap. The folio is unlocked,
>> - *  but NOT marked up-to-date, so that an IO error is emitted (e.g.
>> - *  do_swap_page() will SIGBUS).
>> + *  -EAGAIN: if the swapped out content belongs to a large folio, but the
>> + *  range is mixed or raced with writeback. The folio remains locked so the
>> + *  caller can drop the large swapcache folio and retry order-0.
>>   *
>>   *  -ENOENT: if the swapped out content was not in zswap. The folio remains
>>   *  locked on return.
>> @@ -1626,9 +1629,12 @@ enum zswap_range_state zswap_probe_range(swp_entry_t swp,
>>  int zswap_load(struct folio *folio)
>>  {
>>         swp_entry_t swp = folio->swap;
>> +       unsigned int nr_pages = folio_nr_pages(folio);
>> +       unsigned int type = swp_type(swp);
>>         pgoff_t offset = swp_offset(swp);
>> -       struct xarray *tree = swap_zswap_tree(swp);
>> +       struct xarray *tree;
>>         struct zswap_entry *entry;
>> +       unsigned int i;
>>
>>         VM_WARN_ON_ONCE(!folio_test_locked(folio));
>>         VM_WARN_ON_ONCE(!folio_test_swapcache(folio));
>> @@ -1636,21 +1642,84 @@ int zswap_load(struct folio *folio)
>>         if (zswap_never_enabled())
>>                 return -ENOENT;
>>
>> -       /*
>> -        * Large folios should not be swapped in while zswap is being used, as
>> -        * they are not properly handled. Zswap does not properly load large
>> -        * folios, and a large folio may only be partially in zswap.
>> -        */
>> -       if (WARN_ON_ONCE(folio_test_large(folio))) {
>> +       if (folio_test_large(folio)) {
>> +               struct obj_cgroup *first_objcg = NULL;
>> +               bool same_objcg = true;
>> +               bool saw_zswap = false;
>> +               bool saw_non_zswap = false;
>> +
>> +               /*
>> +                * The locked large swapcache folio now covers the range and
>> +                * conflicts with zswap writeback's order-0 swapcache allocation.
>> +                * If the range is mixed or an entry disappears, retry order-0.
>> +                */
>> +               for (i = 0; i < nr_pages; i++) {
>> +                       tree = swap_zswap_tree(swp_entry(type, offset + i));
>> +                       entry = xa_load(tree, offset + i);
>> +                       if (!entry) {
>> +                               if (saw_zswap)
>> +                                       return -EAGAIN;
>> +                               saw_non_zswap = true;
>> +                               continue;
>> +                       }
> 
> Can we use xas_load API here instead of traversing down the tree again
> and again?

I'll rework it to use xas_load(), while handling zswap tree boundaries correctly.

> 
>> +                       if (saw_non_zswap)
>> +                               return -EAGAIN;
>> +
>> +                       if (!saw_zswap)
>> +                               first_objcg = entry->objcg;
>> +                       else if (entry->objcg != first_objcg)
>> +                               same_objcg = false;
> 
> Can we get different objcg at this point?

The objcg pointers can be different in principle, for example if
the range is assembled from entries that came from different per-node objcgs
of the same memcg.

But for this accounting path, count_objcg_events() ultimately charges the
event to obj_cgroup_memcg(entry->objcg). Since the large swapcache allocation
has already checked compatible swap ownership for the range, the final memcg
accounting target should be the same even if the objcg pointers differ.

I will simplify this in v3 and avoid the extra objcg equality pass.

> 
>> +                       saw_zswap = true;
>> +               }
>> +               if (!saw_zswap)
>> +                       return -ENOENT;
>> +
>> +               for (i = 0; i < nr_pages; i++) {
>> +                       tree = swap_zswap_tree(swp_entry(type, offset + i));
>> +                       entry = xa_load(tree, offset + i);
>> +                       if (!entry)
>> +                               return -EAGAIN;
>> +
>> +                       if (!zswap_decompress(entry, folio, i, false)) {
>> +                               folio_unlock(folio);
>> +                               return -EIO;
>> +                       }
>> +               }
>> +
>> +               flush_dcache_folio(folio);
>> +               /*
>> +                * Keep zswap entries until swap slots are freed. This is a clean
>> +                * speculative fill; zswap remains the backing copy if reclaim
>> +                * drops the large folio before PTEs are installed.
>> +                */
>> +               folio_mark_uptodate(folio);
>> +               count_vm_events(ZSWPIN, nr_pages);
>> +               count_mthp_stat(folio_order(folio), MTHP_STAT_SWPIN);
>> +
>> +               if (same_objcg) {
>> +                       if (first_objcg)
>> +                               count_objcg_events(first_objcg, ZSWPIN, nr_pages);
>> +               } else {
>> +                       for (i = 0; i < nr_pages; i++) {
>> +                               tree = swap_zswap_tree(swp_entry(type, offset + i));
>> +                               entry = xa_load(tree, offset + i);
>> +                               if (WARN_ON_ONCE(!entry))
>> +                                       continue;
>> +                               if (entry->objcg)
>> +                                       count_objcg_events(entry->objcg, ZSWPIN, 1);
> 
> xas_load() here too?

Yes, same issue here. 

> 
> 
>> +                       }
>> +               }
>> +
>>                 folio_unlock(folio);
>> -               return -EINVAL;
>> +               return 0;
>>         }
> 
>>
>> +       tree = swap_zswap_tree(swp);
>>         entry = xa_load(tree, offset);
>>         if (!entry)
>>                 return -ENOENT;
>>
>> -       if (!zswap_decompress(entry, folio)) {
>> +       if (!zswap_decompress(entry, folio, 0, true)) {
>>                 folio_unlock(folio);
>>                 return -EIO;
>>         }
> 
> I wonder how much of these two paths (order 0 and larger order) can be
> unified...

I think more of this can be unified than this version does.

I split the paths this way because I treated the large-folio load as a
speculative fill and kept the zswap entries as the backing copy. But with
your point that an installed large swapcache folio should block zswap
writeback from turning the range mixed, I should revisit that completion rule
instead of baking it into a separate path.

For the v3 version I will try to collapse the common load path. If the large-folio
case still needs different entry lifetime rules, I will make that distinction
explicit.

> 
>> --
>> 2.34.1
>>



^ permalink raw reply

* Re: [PATCH 5/5] cgroup: Defer kill_css_finish() in cgroup_apply_control_disable()
From: Bert Karwatzki @ 2026-05-31 18:45 UTC (permalink / raw)
  To: Mark Brown, Tejun Heo
  Cc: Johannes Weiner, spasswolf, Michal Koutný,
	Sebastian Andrzej Siewior, Petr Malat, kernel test robot,
	Martin Pitt, cgroups, linux-kernel, Aishwarya.TCV
In-Reply-To: <8b15e2465901b48ee63f4827c69a67ff6d0e6098.camel@web.de>

Am Sonntag, dem 31.05.2026 um 11:19 +0200 schrieb Bert Karwatzki:
> Am Freitag, dem 29.05.2026 um 22:08 +0100 schrieb Mark Brown:
> > On Fri, May 29, 2026 at 07:25:29AM -1000, Tejun Heo wrote:
> > > On Wed, May 27, 2026 at 11:45:54AM +0100, Mark Brown wrote:
> > > > On Mon, May 04, 2026 at 02:51:21PM -1000, Tejun Heo wrote:
> > 
> > > > with no further output and given that this is a cgroup locking change
> > > > this does seem like a plausible commmit, though I didn't look into it in
> > > > detail.  Bisect log and the list of LTP tests we're running in our test
> > > > job below.  We are running multuple tests in parallel.
> > 
> > > Unfortunately, I can't reproduce this in my environment. Any chance you can
> > > try testing on x86 tooa nd see whether it produces there?
> > 
> > Not readily sadly, I'll see if I can figure something out.  Our rootfs
> > images are based on Debian Trixie if that's relevant?
> 
> Using debian unstable (sid/forky) I can at least detect a timeout when running
> the ltp controller testsuite:
> 
> # LTPROOT=/home/bert/ltp-install/ ./kirk --run-suite controllers
> Host information
>  Hostname: homer
>  Python: 3.13.12 (main, Feb 4 2026, 15:06:39) [GCC 15.2.0]
>  Directory: /tmp/kirk.root/tmp092in2yb
> 
> Connecting to SUT: default
> 
> Suite: controllers
> ──────────────────
> cgroup_core01: pass  (0.024s)
> cgroup_core02: pass  (0.004s)
> cgroup_core03: pass  (0.017s)
> cgroup: skip  (2m 41s)
> memcg_regression: skip  (3.414s)
> memcg_test_3: pass  (0.090s)
> memcg_failcnt: skip  (0.019s)
> memcg_force_empty: skip  (0.015s)
> memcg_limit_in_bytes: skip  (0.017s)
> memcg_stat_rss: skip  (0.015s)
> memcg_subgroup_charge: skip  (0.015s)
> memcg_max_usage_in_bytes: skip  (0.014s)
> memcg_move_charge_at_immigrate: skip  (0.014s)
> memcg_memsw_limit_in_bytes: skip  (0.015s)
> memcg_stat: skip  (0.015s)
> memcg_use_hierarchy: skip  (0.015s)
> memcg_usage_in_bytes: skip  (0.014s)
> memcg_stress: pass  (30m 4s)
> memcg_control: pass  (6.058s)
> memcontrol01: pass  (0.004s)
> memcontrol02: pass  (0.636s)
> memcontrol03: pass  (15.983s)
> memcontrol04: pass  (0.890s)
> cgroup_fj_function_debug: skip  (0.013s)
> cgroup_fj_function_cpuset: skip  (0.044s)
> cgroup_fj_function_cpu: skip  (0.050s)
> cgroup_fj_function_cpuacct: pass  (0.052s)
> cgroup_fj_function_memory: skip  (0.042s)
> cgroup_fj_function_freezer: pass  (0.044s)
> cgroup_fj_function_devices: pass  (0.066s)
> cgroup_fj_function_blkio: skip  (0.009s)
> cgroup_fj_function_net_cls: pass  (0.073s)
> cgroup_fj_function_perf_event: pass  (0.072s)
> 
> 
> Execution time: 1h 33m 13s
> 
> Disconnecting from SUT: default
> 
> Target information
> ──────────────────
> Kernel:   Linux 7.1.0-rc5-next-20260528-master-dirty #480 SMP PREEMPT_RT Thu May 28 19:55:12 CEST 2026
> Cmdline:  BOOT_IMAGE=/boot/vmlinuz-7.1.0-rc5-next-20260528-master-dirty
>           root=UUID=3d5cdc5d-1902-40bf-9e16-ca819372d350
>           ro
>           quiet
> Machine:  unknown
> Arch:     x86_64
> RAM:      63439380 kB
> Swap:     78125052 kB
> Distro:   debian 
> 
> ────────────────────────
>       TEST SUMMARY
> ────────────────────────
> Suite:   controllers
> Runtime: 33m 13s
> Runs:    347
> 
> Results:
>     Passed:   181
>     Failed:   0
>     Broken:   0
>     Skipped:  350
>     Warnings: 0
> 
> Session stopped
> 
> In dmesg I get messages about task tst_cgtl hanging:
> 
> [ 2212.794669] [    T346] INFO: task tst_cgctl:317896 blocked for more than 122 seconds.
> [ 2212.794674] [    T346]       Not tainted 7.1.0-rc5-next-20260528-master-dirty #480
> [ 2212.794675] [    T346] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> 
> [...] 
> 
> [ 3318.721344] [    T346] INFO: task tst_cgctl:317896 blocked for more than 1228 seconds.
> [ 3318.721349] [    T346]       Not tainted 7.1.0-rc5-next-20260528-master-dirty #480
> [ 3318.721351] [    T346] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> 
> 
> 
> 
> 
> 
> On 6.19.14 the Results of this testrun is:
> 
> # LTPROOT=/home/bert/ltp-install/ ./kirk --run-suite controllers
> 
> [...]
> 
> Target information
> ──────────────────
> Kernel:   Linux 6.19.14-stable #1238 SMP PREEMPT_RT Sat May 30 17:28:29 CEST 2026
> Cmdline:  BOOT_IMAGE=/boot/vmlinuz-6.19.14-stable
>           root=UUID=3d5cdc5d-1902-40bf-9e16-ca819372d350
>           ro
>           quiet
> Machine:  unknown
> Arch:     x86_64
> RAM:      63436188 kB
> Swap:     78125052 kB
> Distro:   debian 
> 
> ────────────────────────
>       TEST SUMMARY
> ────────────────────────
> Suite:   controllers
> Runtime: 36m 12s
> Runs:    347
> 
> Results:
>     Passed:   1742
>     Failed:   0
>     Broken:   0
>     Skipped:  97
>     Warnings: 0
> 
> Session stopped
> 
> With 6.19.14 I also get no hung tasks.
> 
> On 7.0.10 the tests also work:
> 
> root@homer:/mnt/data/linux-forest/kirk# LTPROOT=/home/bert/ltp-install/ ./kirk --run-suite controllers
> Host information
> 	Hostname:   homer
> 	Python:     3.13.12 (main, Feb  4 2026, 15:06:39) [GCC 15.2.0]
> 	Directory:  /tmp/kirk.root/tmpq32b09g7
> 
> Connecting to SUT: default
> 
> Suite: controllers
> ──────────────────
> cgroup_core01: pass  (0.016s)
> 
> [...]
> 
> pids_9_100: pass  (0.107s)
> 
> Execution time: 36m 15s
> 
> Disconnecting from SUT: default
> 
> Target information
> ──────────────────
> Kernel:   Linux 7.0.10-stable #1239 SMP PREEMPT_RT Sun May 31 00:42:41 CEST 2026
> Cmdline:  BOOT_IMAGE=/boot/vmlinuz-7.0.10-stable
>           root=UUID=3d5cdc5d-1902-40bf-9e16-ca819372d350
>           ro
>           quiet
> Machine:  unknown
> Arch:     x86_64
> RAM:      63435940 kB
> Swap:     78125052 kB
> Distro:   debian 
> 
> ────────────────────────
>       TEST SUMMARY
> ────────────────────────
> Suite:   controllers
> Runtime: 36m 13s
> Runs:    347
> 
> Results:
>     Passed:   1742
>     Failed:   0
>     Broken:   0
>     Skipped:  97
>     Warnings: 0
> 
> Session stopped
> 
> 
> 
> I'm not sure if this is related to the problems on arm64, but I'll try bisecting this.
> 
> Bert Karwatzki

I finished my bisectiOn (from v7.0.0 to next-20260528) and it shows 

commit 1dffd95575eb ("cgroup: Defer kill_css_finish() in cgroup_apply_control_disable()")

as first bad commit, too. During the bisection I had to apply this patch (when it's cleanly applicable)

diff --git a/fs/filesystems.c b/fs/filesystems.c
index 771fc31a69b8..712316a1e3e0 100644
--- a/fs/filesystems.c
+++ b/fs/filesystems.c
@@ -269,7 +269,7 @@ static __cold noinline int regen_filesystems_string(void)
 	hlist_for_each_entry_rcu(p, &file_systems, list) {
 		if (!(p->fs_flags & FS_REQUIRES_DEV))
 			newlen += strlen("nodev");
-		newlen += strlen("\t") + strlen(p->name) +  strlen("\n");
+		newlen += strlen("\t") + strlen(p->name) + strlen("\n");
 	}
 	spin_unlock(&file_systems_lock);
 
@@ -289,6 +289,7 @@ static __cold noinline int regen_filesystems_string(void)
 	 * Did someone beat us to it?
 	 */
 	if (old && old->gen == file_systems_gen) {
+		spin_unlock(&file_systems_lock);
 		kfree(new);
 		return 0;
 	}
@@ -297,6 +298,7 @@ static __cold noinline int regen_filesystems_string(void)
 	 * Did the list change in the meantime?
 	 */
 	if (gen != file_systems_gen) {
+		spin_unlock(&file_systems_lock);
 		kfree(new);
 		goto retry;
 	}
@@ -321,13 +323,12 @@ static __cold noinline int regen_filesystems_string(void)
 		 * generation above and messes it up.
 		 */
 		spin_unlock(&file_systems_lock);
-		if (old)
-			kfree_rcu(old, rcu);
+		kfree(new);
 		return -EINVAL;
 	}
 
 	/*
-	 * Paired with consume fence in READ_ONCE() in filesystems_proc_show()
+	 * Paired with consume fence in rcu_dereference() in filesystems_proc_show()
 	 */
 	smp_store_release(&file_systems_string, new);
 	spin_unlock(&file_systems_lock);


to take care of a locking issue in commit
36b3306779ea ("fs: cache the string generated by reading /proc/filesystems")
https://lore.kernel.org/all/20260520225245.2962-1-spasswolf@web.de/

The test that hang when running
# LTPROOT=/home/bert/ltp-install/ ./kirk --run-suite controllers
is always  cgroup_fj_function_net_prio.
Also when bisecting this I disabled (i.e. commented out) the
memcg_stress test in ~/ltp-install/runtest/controllers as it takes a lot of
time (30min) and succeeds even in the version where hangs occur.

Bert Karwatzki

^ permalink raw reply related

* Re: [PATCH v6] cgroup/dmem: implement dmem.high soft limit via prioritized eviction
From: Tejun Heo @ 2026-05-31 17:06 UTC (permalink / raw)
  To: Qiliang Yuan, Christian Koenig, Huang Rui, Matthew Auld,
	Matthew Brost, Maarten Lankhorst, Maxime Ripard,
	Thomas Zimmermann, David Airlie, Simona Vetter, Johannes Weiner,
	Michal Koutný, Natalie Vock
  Cc: Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song,
	dri-devel, linux-kernel, cgroups, linux-mm
In-Reply-To: <20260531-feature-dmem-high-v6-1-20563ecd6dc7@gmail.com>

Hello,

I don't think we want to define dmem.high (or dmem.max) in terms of a
specific reclaim mechanic. These interface files should express a
generic resource-distribution concept that stays valid regardless of
how the underlying reclaim works. As written, dmem.high comes down to
"evicted first in the high-priority eviction pass". It isn't consulted
on charge and dmem has no proactive reclaim, so the file does nothing
until a dmem.max hit elsewhere triggers eviction. That's an
implementation detail, not something I'd want to commit to in the
cgroup interface.

It also reads as a way to work around dmem's reclaim behavior rather
than a soft limit in its own right. A dmem.max hit doesn't just fail
today: the charge returns -EAGAIN and TTM already falls back to evicting
buffers and retrying before the allocation fails. So the question isn't
"max fails immediately, add reclaim via high" but which buffers reclaim
should target and when, which is a property of the max reclaim behavior.

If we work around that with a high knob whose meaning is the current
eviction order, we bake an implementation detail into the ABI and make
it harder to give dmem.high a proper soft-limit semantics later.

I'm not against a dmem soft limit. I'd rather improve the max reclaim
behavior so it makes sense in general, and then define high as a concept
on top of that, rather than the other way around.

The whole max-vs-high distinction and what a soft limit should mean has
had a lot of thought put into it on the memcg side, so adding the memcg
folks for their input.

Thanks.

--
tejun

^ permalink raw reply

* Re: [PATCH v1] docs: cgroup: Fix stale source file paths
From: Tejun Heo @ 2026-05-31 16:33 UTC (permalink / raw)
  To: Costa Shulyupin
  Cc: Johannes Weiner, Michal Koutný, Jonathan Corbet, Shuah Khan,
	Randy Dunlap, cgroups, linux-doc, linux-kernel
In-Reply-To: <20260531140045.4114289-1-costa.shul@redhat.com>

Hello,

Applied to cgroup/for-7.2.

Thanks.

--
tejun

^ permalink raw reply

* [PATCH v1] docs: cgroup: Fix stale source file paths
From: Costa Shulyupin @ 2026-05-31 14:00 UTC (permalink / raw)
  To: Tejun Heo, Johannes Weiner, Michal Koutný, Jonathan Corbet,
	Shuah Khan, Randy Dunlap, cgroups, linux-doc, linux-kernel
  Cc: Costa Shulyupin

Update two references to files that were moved:
- kernel/cgroup.c -> kernel/cgroup/cgroup.c
- tools/cgroup/cgroup_event_listener.c ->
  samples/cgroup/cgroup_event_listener.c

Assisted-by: Claude:claude-opus-4-6
Signed-off-by: Costa Shulyupin <costa.shul@redhat.com>
---
 Documentation/admin-guide/cgroup-v1/cgroups.rst    | 2 +-
 Documentation/admin-guide/cgroup-v1/memcg_test.rst | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/Documentation/admin-guide/cgroup-v1/cgroups.rst b/Documentation/admin-guide/cgroup-v1/cgroups.rst
index 463f98453323..e501f45ea93f 100644
--- a/Documentation/admin-guide/cgroup-v1/cgroups.rst
+++ b/Documentation/admin-guide/cgroup-v1/cgroups.rst
@@ -525,7 +525,7 @@ cgroup. It may also be taken to prevent cgroups from being
 modified, but more specific locks may be more appropriate in that
 situation.
 
-See kernel/cgroup.c for more details.
+See kernel/cgroup/cgroup.c for more details.
 
 Subsystems can take/release the cgroup_mutex via the functions
 cgroup_lock()/cgroup_unlock().
diff --git a/Documentation/admin-guide/cgroup-v1/memcg_test.rst b/Documentation/admin-guide/cgroup-v1/memcg_test.rst
index 7c7cd457cf69..ebedbc3c3f9c 100644
--- a/Documentation/admin-guide/cgroup-v1/memcg_test.rst
+++ b/Documentation/admin-guide/cgroup-v1/memcg_test.rst
@@ -321,7 +321,7 @@ Under below explanation, we assume CONFIG_SWAP=y.
 ----------------------
 
 	Memory controller implements memory thresholds using cgroups notification
-	API. You can use tools/cgroup/cgroup_event_listener.c to test it.
+	API. You can use samples/cgroup/cgroup_event_listener.c to test it.
 
 	(Shell-A) Create cgroup and run event listener::
 
-- 
2.53.0


^ permalink raw reply related

* Re: [RFC PATCH v2 1/9] mm/zswap: expose range state for swapin policy
From: Fujunjie @ 2026-05-31 13:47 UTC (permalink / raw)
  To: Nhat Pham
  Cc: Andrew Morton, linux-mm, Alexandre Ghiti, Kairui Song, Usama Arif,
	Chris Li, Johannes Weiner, Yosry Ahmed, David Hildenbrand,
	Hugh Dickins, Roman Gushchin, Shakeel Butt, linux-kernel, cgroups
In-Reply-To: <CAKEwX=NUQb5b4T49dbRV0_41QYRRuLkQNUg+FVDpJiobCCCh7g@mail.gmail.com>



On 5/30/2026 2:35 AM, Nhat Pham wrote:
> On Fri, May 29, 2026 at 5:19 AM fujunjie <fujunjie1@qq.com> wrote:
>>
>> Large folio swapin needs to know whether a candidate swap range is fully
>> backed by zswap before it can choose an order. That decision should stay
>> in common swapin code, not inside zswap.
>>
>> Export two zswap facts for that caller: a lockless range occupancy snapshot
>> and the current zswap reclaim-pressure state. The range state is
>> advisory only. Writeback or invalidation can change the backend after the
>> snapshot, so users must recheck before issuing large-folio IO.
>>
>> Signed-off-by: fujunjie <fujunjie1@qq.com>
>> ---
>>  include/linux/zswap.h | 26 +++++++++++++++++++++++++
>>  mm/zswap.c            | 44 +++++++++++++++++++++++++++++++++++++++++++
>>  2 files changed, 70 insertions(+)
>>
>> diff --git a/include/linux/zswap.h b/include/linux/zswap.h
>> index 30c193a1207e..8f9aee97517c 100644
>> --- a/include/linux/zswap.h
>> +++ b/include/linux/zswap.h
>> @@ -9,6 +9,18 @@ struct lruvec;
>>
>>  extern atomic_long_t zswap_stored_pages;
>>
>> +/*
>> + * Advisory zswap occupancy snapshot for a swap range. This is not a complete
>> + * backend classifier; callers must recheck before depending on ALL_ZSWAP for
>> + * large-folio IO.
>> + */
>> +enum zswap_range_state {
>> +       ZSWAP_RANGE_NEVER_ENABLED,
>> +       ZSWAP_RANGE_NO_ZSWAP,
>> +       ZSWAP_RANGE_ALL_ZSWAP,
>> +       ZSWAP_RANGE_MIXED,
>> +};
>> +
>>  #ifdef CONFIG_ZSWAP
>>
>>  struct zswap_lruvec_state {
>> @@ -27,6 +39,9 @@ struct zswap_lruvec_state {
>>  unsigned long zswap_total_pages(void);
>>  bool zswap_store(struct folio *folio);
>>  int zswap_load(struct folio *folio);
>> +enum zswap_range_state zswap_probe_range(swp_entry_t swp,
>> +                                        unsigned int nr_pages);
>> +bool zswap_pool_reclaim_pressure(void);
>>  void zswap_invalidate(swp_entry_t swp);
>>  int zswap_swapon(int type, unsigned long nr_pages);
>>  void zswap_swapoff(int type);
>> @@ -49,6 +64,17 @@ static inline int zswap_load(struct folio *folio)
>>         return -ENOENT;
>>  }
>>
>> +static inline enum zswap_range_state zswap_probe_range(swp_entry_t swp,
>> +                                                      unsigned int nr_pages)
>> +{
>> +       return ZSWAP_RANGE_NEVER_ENABLED;
>> +}
>> +
>> +static inline bool zswap_pool_reclaim_pressure(void)
>> +{
>> +       return false;
>> +}
>> +
>>  static inline void zswap_invalidate(swp_entry_t swp) {}
>>  static inline int zswap_swapon(int type, unsigned long nr_pages)
>>  {
>> diff --git a/mm/zswap.c b/mm/zswap.c
>> index 761cd699e0a3..da5297f7bd69 100644
>> --- a/mm/zswap.c
>> +++ b/mm/zswap.c
>> @@ -506,6 +506,19 @@ unsigned long zswap_total_pages(void)
>>         return total;
>>  }
>>
>> +/*
>> + * Expose whether zswap reclaim pressure is active. This is a backend fact:
>> + * zswap_check_limits() sets the state once the pool reaches the hard limit and
>> + * keeps it set until the pool falls below the accept threshold.
>> + */
>> +bool zswap_pool_reclaim_pressure(void)
>> +{
>> +       if (zswap_never_enabled())
>> +               return false;
>> +
>> +       return READ_ONCE(zswap_pool_reached_full);
>> +}
>> +
>>  static bool zswap_check_limits(void)
>>  {
>>         unsigned long cur_pages = zswap_total_pages();
>> @@ -1559,6 +1572,37 @@ bool zswap_store(struct folio *folio)
>>         return ret;
>>  }
>>
>> +enum zswap_range_state zswap_probe_range(swp_entry_t swp,
>> +                                        unsigned int nr_pages)
>> +{
>> +       unsigned int type = swp_type(swp);
>> +       pgoff_t offset = swp_offset(swp);
>> +       bool present = false, missing = false;
>> +       unsigned int i;
>> +
>> +       /*
>> +        * This is an advisory, lockless snapshot for common swapin admission.
>> +        * Callers must recheck before depending on an all-zswap range for IO:
>> +        * concurrent writeback or invalidation can change the backend state.
>> +        */
>> +       if (zswap_never_enabled())
>> +               return ZSWAP_RANGE_NEVER_ENABLED;
>> +
>> +       for (i = 0; i < nr_pages; i++) {
>> +               struct xarray *tree = swap_zswap_tree(swp_entry(type, offset + i));
>> +
>> +               if (xa_load(tree, offset + i))
>> +                       present = true;
>> +               else
>> +                       missing = true;
>> +
>> +               if (present && missing)
>> +                       return ZSWAP_RANGE_MIXED;
>> +       }
> 
> Can we use xas_load() to make this check more efficient? IIUC,
> xa_load() walks the tree every time.
> 
> (We used to use a bitmap here back in frontswap days. Good times....)

Thanks for your review.

I'll switch this to xas_load() in the v3 version.


^ permalink raw reply

* Re: [RFC PATCH v2 9/9] docs: mm: update THP swapin counter descriptions
From: Fujunjie @ 2026-05-31 13:21 UTC (permalink / raw)
  To: Nhat Pham
  Cc: Andrew Morton, linux-mm, Alexandre Ghiti, Kairui Song, Usama Arif,
	Chris Li, Johannes Weiner, Yosry Ahmed, David Hildenbrand,
	Hugh Dickins, Roman Gushchin, Shakeel Butt, linux-kernel, cgroups
In-Reply-To: <CAKEwX=NDfJud3FM4Y+Ek3RtTtwi2aXWeDCujNxh2ReUEq-m4oA@mail.gmail.com>



On 5/30/2026 2:37 AM, Nhat Pham wrote:
> On Fri, May 29, 2026 at 5:19 AM fujunjie <fujunjie1@qq.com> wrote:
>>
>> The THP swapin counter descriptions still describe large swapin as
>> coming only from non-zswap swap devices. Update them now that
>> zswap-backed large folio swapin can also increment swpin.
>>
>> Also describe policy and backend rejection as swpin_fallback cases,
>> since speculative zswap large swapin can intentionally fall back before
>> doing large IO.
>>
>> Signed-off-by: fujunjie <fujunjie1@qq.com>
>> ---
>>  Documentation/admin-guide/mm/transhuge.rst | 11 ++++++-----
>>  1 file changed, 6 insertions(+), 5 deletions(-)
>>
>> diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
>> index 23f8d13c2629..59b7a0d09243 100644
>> --- a/Documentation/admin-guide/mm/transhuge.rst
>> +++ b/Documentation/admin-guide/mm/transhuge.rst
>> @@ -667,13 +667,14 @@ zswpout
>>         piece without splitting.
>>
>>  swpin
>> -       is incremented every time a huge page is swapped in from a non-zswap
>> -       swap device in one piece.
>> +       is incremented every time a huge page is swapped in from swap or
>> +       zswap in one piece.
>>
>>  swpin_fallback
>> -       is incremented if swapin fails to allocate or charge a huge page
>> -       and instead falls back to using huge pages with lower orders or
>> -       small pages.
>> +       is incremented if swapin cannot use a huge page and instead falls
>> +       back to using huge pages with lower orders or small pages. This can
>> +       happen because allocation or charging fails, or because policy or
>> +       backend state rejects a speculative large swapin.
> 
> I think we should add separate zswpin and zswpin fallback counter for
> THP rather than overloading swpin. We already do that for zswpout vs
> swpout.

that makes sense.


^ permalink raw reply

* Re: [RFC PATCH v2 4/9] mm: admit large swapin by backend range in swapin_sync()
From: Fujunjie @ 2026-05-31 13:15 UTC (permalink / raw)
  To: Nhat Pham
  Cc: Andrew Morton, linux-mm, Alexandre Ghiti, Kairui Song, Usama Arif,
	Chris Li, Johannes Weiner, Yosry Ahmed, David Hildenbrand,
	Hugh Dickins, Roman Gushchin, Shakeel Butt, linux-kernel, cgroups
In-Reply-To: <CAKEwX=PvcM1u1n8TTikCAaqJN=GtgfwvnXtU2wCf=Qjp6E_Zew@mail.gmail.com>



On 5/30/2026 2:34 AM, Nhat Pham wrote:
> On Fri, May 29, 2026 at 5:19 AM fujunjie <fujunjie1@qq.com> wrote:
>>
>> A large swapin can only read one folio when the whole range has compatible
>> backing. Mixed zswap/disk ranges must not reach large-folio IO, and zswap
>> range probes are only snapshots.
>>
>> Filter the orders passed to swap_cache_alloc_folio() in swapin_sync().
>> Uniform zeromap ranges and all-disk ranges keep the existing large swapin
>> path. Fully zswap-backed ranges may be tried. Mixed zswap/disk ranges fall
>> back before allocation.
>>
>> After a large swapcache folio is installed, recheck the zswap range and
>> drop the fresh folio if it became mixed. Also consume -EAGAIN from
>> swap_read_folio() the same way. Both cases retry order-0, where each slot
>> can resolve its current backend independently.
>>
>> Signed-off-by: fujunjie <fujunjie1@qq.com>
>> ---
>>  mm/memcontrol-v1.c |   8 ++-
>>  mm/memory.c        |  31 ++++++++-
>>  mm/swap_state.c    | 169 ++++++++++++++++++++++++++++++++++++++++++---
>>  3 files changed, 194 insertions(+), 14 deletions(-)
>>
>> diff --git a/mm/memcontrol-v1.c b/mm/memcontrol-v1.c
>> index 765069211567..5b11b8055c66 100644
>> --- a/mm/memcontrol-v1.c
>> +++ b/mm/memcontrol-v1.c
>> @@ -682,8 +682,8 @@ void __memcg1_swapout(struct folio *folio, struct swap_cluster_info *ci)
>>   * memcg1_swapin - uncharge swap slot on swapin
>>   * @folio: folio being swapped in
>>   *
>> - * Call this function after successfully adding the charged
>> - * folio to swapcache.
>> + * Call this after the charged folio has been added to swapcache and the caller
>> + * is no longer going to drop it back to swapped-out state.
>>   *
>>   * Context: The folio has to be in swap cache and locked.
>>   */
>> @@ -721,7 +721,9 @@ void memcg1_swapin(struct folio *folio)
>>         id = __swap_cgroup_clear(ci, swp_cluster_offset(folio->swap),
>>                                  nr_pages);
>>         swap_cluster_unlock(ci);
>> -       mem_cgroup_uncharge_swap(id, nr_pages);
>> +
>> +       if (id)
>> +               mem_cgroup_uncharge_swap(id, nr_pages);
>>  }
>>  #endif
>>
>> diff --git a/mm/memory.c b/mm/memory.c
>> index 5a365492a9a2..d73a19692dea 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -4538,6 +4538,24 @@ static inline bool should_try_to_free_swap(struct swap_info_struct *si,
>>                 folio_ref_count(folio) == (extra_refs + folio_nr_pages(folio));
>>  }
>>
>> +static void memcg1_swapin_retry_folio(struct folio *folio,
>> +                                     struct vm_fault *vmf)
>> +{
>> +       if (!folio_test_large(folio) || !folio_test_swapcache(folio))
>> +               return;
>> +
>> +       if (vmf->flags & FAULT_FLAG_RETRY_NOWAIT) {
>> +               if (!folio_trylock(folio))
>> +                       return;
>> +       } else {
>> +               folio_lock(folio);
>> +       }
>> +
>> +       if (folio_test_large(folio) && folio_test_swapcache(folio))
>> +               memcg1_swapin(folio);
>> +       folio_unlock(folio);
>> +}
>> +
>>  static vm_fault_t pte_marker_clear(struct vm_fault *vmf)
>>  {
>>         vmf->pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd,
>> @@ -4857,8 +4875,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>
>>         swapcache = folio;
>>         ret |= folio_lock_or_retry(folio, vmf);
>> -       if (ret & VM_FAULT_RETRY)
>> +       if (ret & VM_FAULT_RETRY) {
>> +               memcg1_swapin_retry_folio(folio, vmf);
>>                 goto out_release;
>> +       }
>>
>>         page = folio_file_page(folio, swp_offset(entry));
>>         /*
>> @@ -5067,6 +5087,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>         if (unlikely(folio != swapcache)) {
>>                 folio_add_new_anon_rmap(folio, vma, address, RMAP_EXCLUSIVE);
>>                 folio_add_lru_vma(folio, vma);
>> +               if (folio_test_large(swapcache))
>> +                       memcg1_swapin(swapcache);
>>                 folio_put_swap(swapcache, NULL);
>>         } else if (!folio_test_anon(folio)) {
>>                 /*
>> @@ -5076,6 +5098,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>                 VM_WARN_ON_ONCE_FOLIO(folio_nr_pages(folio) != nr_pages, folio);
>>                 VM_WARN_ON_ONCE_FOLIO(folio_mapped(folio), folio);
>>                 folio_add_new_anon_rmap(folio, vma, address, rmap_flags);
>> +               if (folio_test_large(folio))
>> +                       memcg1_swapin(folio);
>>                 folio_put_swap(folio, NULL);
>>         } else {
>>                 VM_WARN_ON_ONCE(nr_pages != 1 && nr_pages != folio_nr_pages(folio));
>> @@ -5132,8 +5156,11 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>         if (vmf->pte)
>>                 pte_unmap_unlock(vmf->pte, vmf->ptl);
>>  out_page:
>> -       if (folio_test_swapcache(folio))
>> +       if (folio_test_swapcache(folio)) {
>> +               if (folio_test_large(folio))
>> +                       memcg1_swapin(folio);
>>                 folio_free_swap(folio);
>> +       }
>>         folio_unlock(folio);
>>  out_release:
>>         folio_put(folio);
>> diff --git a/mm/swap_state.c b/mm/swap_state.c
>> index d37097913b30..f03ad4832f16 100644
>> --- a/mm/swap_state.c
>> +++ b/mm/swap_state.c
>> @@ -21,6 +21,7 @@
>>  #include <linux/migrate.h>
>>  #include <linux/vmalloc.h>
>>  #include <linux/huge_mm.h>
>> +#include <linux/zswap.h>
>>  #include <linux/shmem_fs.h>
>>  #include "internal.h"
>>  #include "swap_table.h"
>> @@ -403,7 +404,8 @@ void __swap_cache_replace_folio(struct swap_cluster_info *ci,
>>  static struct folio *__swap_cache_alloc(struct swap_cluster_info *ci,
>>                                         swp_entry_t targ_entry, gfp_t gfp,
>>                                         unsigned int order, struct vm_fault *vmf,
>> -                                       struct mempolicy *mpol, pgoff_t ilx)
>> +                                       struct mempolicy *mpol, pgoff_t ilx,
>> +                                       bool defer_memcg1_swapin)
>>  {
>>         int err;
>>         swp_entry_t entry;
>> @@ -466,7 +468,8 @@ static struct folio *__swap_cache_alloc(struct swap_cluster_info *ci,
>>         }
>>
>>         /* memsw uncharges swap when folio is added to swap cache */
>> -       memcg1_swapin(folio);
>> +       if (!defer_memcg1_swapin || !order)
>> +               memcg1_swapin(folio);
>>         if (shadow)
>>                 workingset_refault(folio, shadow);
>>
>> @@ -495,9 +498,12 @@ static struct folio *__swap_cache_alloc(struct swap_cluster_info *ci,
>>   * Return: Returns the folio if allocation succeeded and folio is in the swap
>>   * cache. Returns error code if failed due to race, OOM or invalid arguments.
>>   */
>> -struct folio *swap_cache_alloc_folio(swp_entry_t targ_entry, gfp_t gfp,
>> -                                    unsigned long orders, struct vm_fault *vmf,
>> -                                    struct mempolicy *mpol, pgoff_t ilx)
>> +static struct folio *__swap_cache_alloc_folio(swp_entry_t targ_entry,
>> +                                             gfp_t gfp, unsigned long orders,
>> +                                             struct vm_fault *vmf,
>> +                                             struct mempolicy *mpol,
>> +                                             pgoff_t ilx,
>> +                                             bool defer_memcg1_swapin)
>>  {
>>         int order, err;
>>         struct folio *ret;
>> @@ -512,7 +518,8 @@ struct folio *swap_cache_alloc_folio(swp_entry_t targ_entry, gfp_t gfp,
>>
>>         do {
>>                 ret = __swap_cache_alloc(ci, targ_entry, gfp, order,
>> -                                        vmf, mpol, ilx);
>> +                                        vmf, mpol, ilx,
>> +                                        defer_memcg1_swapin);
>>                 if (!IS_ERR(ret))
>>                         break;
>>                 err = PTR_ERR(ret);
>> @@ -525,6 +532,124 @@ struct folio *swap_cache_alloc_folio(swp_entry_t targ_entry, gfp_t gfp,
>>         return ret;
>>  }
>>
>> +struct folio *swap_cache_alloc_folio(swp_entry_t targ_entry, gfp_t gfp,
>> +                                    unsigned long orders, struct vm_fault *vmf,
>> +                                    struct mempolicy *mpol, pgoff_t ilx)
>> +{
>> +       return __swap_cache_alloc_folio(targ_entry, gfp, orders, vmf,
>> +                                       mpol, ilx, false);
>> +}
>> +
>> +static struct folio *swap_cache_alloc_speculative_folio(swp_entry_t targ_entry,
>> +                                                       gfp_t gfp,
>> +                                                       unsigned long orders,
>> +                                                       struct vm_fault *vmf,
>> +                                                       struct mempolicy *mpol,
>> +                                                       pgoff_t ilx)
>> +{
>> +       /*
>> +        * Speculative large swapin may drop this fresh swapcache folio and
>> +        * retry order-0 after backend or page-table revalidation. Keep the
>> +        * cgroup v1 memsw swap owner until the caller commits the folio.
>> +        */
>> +       return __swap_cache_alloc_folio(targ_entry, gfp, orders, vmf,
>> +                                       mpol, ilx, true);
>> +}
>> +
>> +static bool swapin_zeromap_same(swp_entry_t entry, unsigned int nr_pages)
>> +{
>> +       unsigned int ci_start = swp_cluster_offset(entry);
>> +       struct swap_cluster_info *ci = __swap_entry_to_cluster(entry);
>> +       bool is_zero;
>> +       unsigned int i;
>> +
>> +       if (ci_start + nr_pages > SWAPFILE_CLUSTER) {
>> +               VM_WARN_ON_ONCE(1);
>> +               return false;
>> +       }
>> +
>> +       rcu_read_lock();
>> +       if (!rcu_dereference(ci->table)) {
>> +               rcu_read_unlock();
>> +               return true;
>> +       }
>> +
>> +       is_zero = __swap_table_test_zero(ci, ci_start);
>> +       for (i = 1; i < nr_pages; i++) {
>> +               if (is_zero != __swap_table_test_zero(ci, ci_start + i)) {
>> +                       rcu_read_unlock();
>> +                       return false;
>> +               }
>> +       }
>> +       rcu_read_unlock();
>> +
>> +       return true;
>> +}
>> +
>> +static unsigned long swapin_admit_orders(swp_entry_t entry,
>> +                                        unsigned long orders)
>> +{
>> +       unsigned long candidates = orders & ~BIT(0);
>> +       unsigned long admitted = orders & BIT(0);
>> +       int order;
>> +
>> +       if (!candidates)
>> +               return orders;
>> +
>> +       while (candidates) {
>> +               enum zswap_range_state state;
>> +               unsigned int nr_pages;
>> +               swp_entry_t range_entry;
>> +               bool admit = false;
>> +
>> +               order = fls_long(candidates) - 1;
>> +               if (order > MAX_PAGE_ORDER) {
>> +                       candidates &= ~BIT(order);
>> +                       continue;
>> +               }
>> +
>> +               nr_pages = 1U << order;
>> +               range_entry = swp_entry(swp_type(entry),
>> +                                       round_down(swp_offset(entry), nr_pages));
>> +               if (!swapin_zeromap_same(range_entry, nr_pages))
>> +                       goto next;
>> +
>> +               state = zswap_probe_range(range_entry, nr_pages);
>> +               switch (state) {
>> +               case ZSWAP_RANGE_MIXED:
>> +                       break;
>> +               case ZSWAP_RANGE_ALL_ZSWAP:
>> +               case ZSWAP_RANGE_NEVER_ENABLED:
>> +               case ZSWAP_RANGE_NO_ZSWAP:
>> +                       admit = true;
>> +                       break;
>> +               }
>> +
>> +next:
>> +               if (admit)
>> +                       admitted |= BIT(order);
>> +               else
>> +                       count_mthp_stat(order, MTHP_STAT_SWPIN_FALLBACK);
>> +               candidates &= ~BIT(order);
>> +       }
>> +
>> +       return admitted ? admitted : BIT(0);
>> +}
>> +
>> +static bool zswap_needs_order0_retry(struct folio *folio)
>> +{
>> +       if (!folio_test_large(folio))
>> +               return false;
>> +
>> +       /*
>> +        * Admission sees only an advisory zswap snapshot. Recheck after the
>> +        * large swapcache folio is installed; if the range became mixed, drop
>> +        * the fresh folio before IO and let order-0 handle each slot.
>> +        */
>> +       return zswap_probe_range(folio->swap, folio_nr_pages(folio)) ==
>> +              ZSWAP_RANGE_MIXED;
>> +}
>> +
>>  /*
>>   * If we are the only user, then try to free up the swap cache.
>>   *
>> @@ -634,7 +759,8 @@ static struct folio *swap_cache_read_folio(swp_entry_t entry, gfp_t gfp,
>>                 folio = swap_cache_get_folio(entry);
>>                 if (folio)
>>                         return folio;
>> -               folio = swap_cache_alloc_folio(entry, gfp, BIT(0), NULL, mpol, ilx);
>> +               folio = swap_cache_alloc_folio(entry, gfp, BIT(0), NULL,
>> +                                              mpol, ilx);
>>         } while (PTR_ERR(folio) == -EEXIST);
>>
>>         if (IS_ERR_OR_NULL(folio))
>> @@ -677,18 +803,43 @@ struct folio *swapin_sync(swp_entry_t entry, gfp_t gfp, unsigned long orders,
>>         struct folio *folio;
>>         int ret;
>>
>> +       orders = swapin_admit_orders(entry, orders);
>> +again:
>>         do {
>>                 folio = swap_cache_get_folio(entry);
>>                 if (folio)
>>                         return folio;
>> -               folio = swap_cache_alloc_folio(entry, gfp, orders, vmf, mpol, ilx);
>> +               folio = swap_cache_alloc_speculative_folio(entry, gfp, orders,
>> +                                                          vmf, mpol, ilx);
>>         } while (PTR_ERR(folio) == -EEXIST);
>>
>>         if (IS_ERR(folio))
>>                 return folio;
>>
>> +       if (zswap_needs_order0_retry(folio)) {
>> +               count_mthp_stat(folio_order(folio), MTHP_STAT_SWPIN_FALLBACK);
>> +               /*
>> +                * The folio is newly allocated, locked, clean and not uptodate;
>> +                * no data has been read into it. Removing it only restores the
>> +                * swap table entries so order-0 swapin can resolve a backend
>> +                * race without attempting speculative large-folio zswapin.
>> +                */
>> +               swap_cache_del_folio(folio);
>> +               folio_unlock(folio);
>> +               folio_put(folio);
>> +               orders = BIT(0);
>> +               goto again;
>> +       }
>> +
>>         ret = swap_read_folio(folio, NULL);
>> -       VM_WARN_ON_ONCE(ret == -EAGAIN);
>> +       if (ret == -EAGAIN) {
> 
> Can this happen? After you add the entire swap range to swap cache,
> backend is locked. Zswap writeback bails out if it fails to add the
> page to swap cache.
> 
> I think you can just check (zswap_probe_range or wev) before
> swap_read_folio(). If the range is still fully backed by zswap, you
> are good to go. Otherwise, bail here immediately.
> 
> Then you don't need all the complexity with extending swap_read_folio
> to handle mixed range errors (for now at least).

Yes, I think you are right.

I missed that property of zswap writeback. Once the whole range is covered by
the large swapcache folio, writeback should not be able to move a subslot to
disk because it has to allocate an order-0 swapcache folio first, and that
should fail.

Sorry for adding this extra complexity. I will rework this in a more unified way for the
next version. 


^ permalink raw reply

* Re: [RFC PATCH v2 6/9] mm: provide anon locality evidence for zswap large swapin
From: Fujunjie @ 2026-05-31 13:11 UTC (permalink / raw)
  To: Nhat Pham
  Cc: Andrew Morton, linux-mm, Alexandre Ghiti, Kairui Song, Usama Arif,
	Chris Li, Johannes Weiner, Yosry Ahmed, David Hildenbrand,
	Hugh Dickins, Roman Gushchin, Shakeel Butt, linux-kernel, cgroups
In-Reply-To: <CAKEwX=MDSwMoU-=h3NOG==-ru+qT3LeTi2_PADLWFXBB9aZZ+w@mail.gmail.com>



On 5/30/2026 3:22 AM, Nhat Pham wrote:
> On Fri, May 29, 2026 at 5:19 AM fujunjie <fujunjie1@qq.com> wrote:
>>
>> The common zswap large-swapin policy needs locality evidence from
>> callers before it can admit a large folio. For anonymous faults, provide
>> that evidence from existing VMA hints and from the PTE young state left
>> by earlier zswap-backed large swapins.
>>
>> Keep non-faulting PTEs old when mapping a speculative all-zswap large
>> folio. A later fault can then require a dense young previous range before
>> admitting another large swapin without adding VMA state.
> 
> Makes sense to me.
> 
>>
>> This also removes the old zswap-enabled guard from the THP swapin
>> candidate scan. The common swapin path now classifies the backend range
>> and falls back to order-0 for mixed zswap/disk ranges or races.
>>
>> Signed-off-by: fujunjie <fujunjie1@qq.com>
>> ---
>>  mm/memory.c     | 234 +++++++++++++++++++++++++++++++++++++++++++-----
>>  mm/swap.h       |   6 ++
>>  mm/swap_state.c |  15 ++++
>>  3 files changed, 235 insertions(+), 20 deletions(-)
>>
>> diff --git a/mm/memory.c b/mm/memory.c
>> index 92a82008d583..7bbb89632000 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -4556,6 +4556,35 @@ static void memcg1_swapin_retry_folio(struct folio *folio,
>>         folio_unlock(folio);
>>  }
>>
>> +static void set_swapin_ptes(struct vm_area_struct *vma,
>> +                           unsigned long address, pte_t *ptep, pte_t pte,
>> +                           unsigned int nr_pages, unsigned int fault_pte_idx,
>> +                           bool fault_only_young)
>> +{
>> +       struct mm_struct *mm = vma->vm_mm;
>> +       pte_t old_pte;
>> +
>> +       if (!fault_only_young || nr_pages == 1) {
>> +               set_ptes(mm, address, ptep, pte, nr_pages);
>> +               return;
>> +       }
>> +
>> +       old_pte = pte_mkold(pte);
>> +       if (fault_pte_idx)
>> +               set_ptes(mm, address, ptep, old_pte, fault_pte_idx);
>> +
>> +       set_pte_at(mm, address + fault_pte_idx * PAGE_SIZE,
>> +                  ptep + fault_pte_idx,
>> +                  pte_mkyoung(pte_advance_pfn(pte, fault_pte_idx)));
> 
> Hmm, does this mean that without THP swapin, the faulting PTE is not
> marked young, but it is marked young if there is a THP swapin. That's
> a behavioral change right? Would this throw off other heuristics
> relying on this bit, or any justification that this is fine?

Thanks.

The intent was not to make the faulting PTE behave differently from the
normal swapin path. In do_swap_page() we first build the PTE with:

		pte = mk_pte(page, vma->vm_page_prot);

and on the common architectures I checked, the normal user pgprot already
contains the accessed/young bit. For example arm64 PAGE_SHARED/PAGE_READONLY
are based on _PAGE_DEFAULT, which includes PTE_AF, and x86 user page
protections also include the accessed bit. So in practice the faulting PTE is
already young after mk_pte() there.Therefore, the default path is originally marked as young.

What I really wanted here is only to keep the speculative neighbouring PTEs
old. A large zswapin may install PTEs for pages that did not fault, and those
should not all look accessed just because mk_pte() produced a young PTE.

But, the explicit pte_mkyoung() on the faulting PTE makes this look
like THP swapin is adding a new behavior.I will try to improve it 
in a way that is less ambiguous.



^ permalink raw reply

* Re: [RFC PATCH v2 4/9] mm: admit large swapin by backend range in swapin_sync()
From: Fujunjie @ 2026-05-31 12:34 UTC (permalink / raw)
  To: Kairui Song
  Cc: Andrew Morton, linux-mm, Alexandre Ghiti, Usama Arif, Chris Li,
	Johannes Weiner, Yosry Ahmed, Nhat Pham, David Hildenbrand,
	Hugh Dickins, Roman Gushchin, Shakeel Butt, linux-kernel, cgroups
In-Reply-To: <CAMgjq7AQwF6oNpnGTxxJWb=oyZ3dLfPL4oSNoS+eQxtuzZPgTQ@mail.gmail.com>



On 5/29/2026 10:45 PM, Kairui Song wrote:
> On Fri, May 29, 2026 at 10:43 PM Kairui Song <ryncsn@gmail.com> wrote:
>>
>> Hi Fujunjie,
>>
>> Thanks for the update, but this whole defer_memcg1_swapin thing is so
>> ugly I don't think this is the right way at all.
>>
>> If you really need this, maybe you can always defer the memcg1
> 
> Oh and I'm not saying I'm against this series or the idea, I'm just
> saying this particular design of this one patch needs some improvement
> :)

Thanks for your review! I will improve the implementation.


^ permalink raw reply

* Re: [RFC PATCH v2 0/9] mm: support zswap-backed large folio swapin
From: Fujunjie @ 2026-05-31 12:32 UTC (permalink / raw)
  To: Nhat Pham
  Cc: Andrew Morton, linux-mm, Alexandre Ghiti, Kairui Song, Usama Arif,
	Chris Li, Johannes Weiner, Yosry Ahmed, David Hildenbrand,
	Hugh Dickins, Roman Gushchin, Shakeel Butt, linux-kernel, cgroups
In-Reply-To: <CAKEwX=PdQb2nDbFaZYuRa9_mYrMCnMEJHpxxABebKkVz+OgDHg@mail.gmail.com>



On 5/30/2026 2:06 AM, Nhat Pham wrote:
> On Fri, May 29, 2026 at 5:17 AM fujunjie <fujunjie1@qq.com> wrote:
>>
>> Hi,
>>
>> This RFC explores large-folio swapin for ranges that are still fully backed
>> by zswap.
>>
>> Large swapin is currently disabled once zswap is in the picture. Anonymous
>> faults stop considering large orders after zswap has ever been enabled,
>> shmem does the same, and zswap_load() refuses large swapcache folios. That
>> keeps mixed zswap/disk cases safe, but it also loses the dense case where
>> every slot in an aligned 64K range is still resident in zswap.
>>
>> The series keeps the policy in common swapin code:
>>
>>   - zswap reports backend facts and provides the large-folio load helper.
>>   - swapin_sync() filters candidate orders by backend range.
>>   - all-disk and zeromap ranges keep the existing Kairui large-swapin path.
>>   - mixed zswap/disk ranges stay order-0.
>>   - all-zswap ranges may use a 64K folio after locality admission.
>>   - anon provides locality evidence from VMA hints and PTE young density.
>>   - shmem starts with explicit VMA-hint evidence only.
>>   - swap readahead uses its existing VMA/cluster window as locality
>>     evidence; it does not also run the anon PTE-young rule.
>>
>> The backend range probe is only a snapshot. If the backend changes after a
>> fresh large swapcache folio is allocated, the common path drops that folio
>> and falls back to order-0. zswap_load() can also return -EAGAIN for the
>> same retry path. If a late fault retry keeps the large folio in swapcache
>> instead of deleting it, the cgroup v1 memsw swap owner is committed before
>> returning.
>>
>> This is mTHP/large-folio swapin. The mappings installed by do_swap_page()
>> are still PTE mappings, not PMD mappings. The expected win is fewer faults,
>> batched PTE/rmap work, and preserving the large folio across zswapin
>> instead of rebuilding the working set as order-0 pages.
>>
>> Prior art: Usama Arif posted a related RFC on 2024-10-18:
>>
>>   mm: zswap: add support for zswapin of large folios
>>   https://lore.kernel.org/linux-mm/20241018105026.2521366-1-usamaarif642@gmail.com/
>>
>> This RFC keeps the same broad goal, but moves admission into common swapin
>> code. zswap does not decide the policy. Mixed zswap/disk ranges are
>> rejected before large IO, and the first cap is 64K.
>>
>> This is a rewrite of the RFC posted on 2026-05-08:
>>
>>   [RFC PATCH 0/5] mm: support zswap-backed anonymous large folio swapin
>>   https://lore.kernel.org/linux-mm/tencent_8B437BE4F586C162950BF71954316C1EDB05@qq.com/
>>
>> The v1 series was anonymous-only and kept too much of the policy near the
>> anon fault and zswap paths. This version is rebuilt on top of Kairui Song's
>> common swapin infrastructure. It keeps admission in common swapin code,
>> rejects mixed zswap/disk large ranges, and adds separate locality producers
>> for anon, shmem and swap readahead.
>>
>> Performance and behavior
>> ========================
>>
>> The A/B tables are 10-run measurements. Elapsed values are seconds,
>> shown as mean +/- sample standard deviation. "phase" or "refault" is the
>> measured refault subphase. "zswpin" counts zswap loads. "pswpin" counts
>> swap-ins from the real swap device; pswpin=0 means the refaults were served
>> by zswap even when a disk swap device was configured. "RFC 64K" is the mean
>> number of successful 64K swapins.
>>
>> The numbers below show where the large path is used and where it is
>> rejected.
>>
>> zram-backed zswap microbench, 64K mTHP, 8G guest:
>>
>> +-----------------+----------------+----------------+--------+--------+--------+----------+
>> | workload        | base elapsed   | RFC elapsed    | delta  | phase  | zswpin | RFC 64K  |
>> +-----------------+----------------+----------------+--------+--------+--------+----------+
>> | usama_1g        | 11.260+/-0.235 | 10.301+/-0.140 | -8.5%  | -22.2% | 1.000x | 16381.1  |
>> | nohint_seq64    |  4.398+/-0.085 |  4.025+/-0.022 | -8.5%  | -21.1% | 1.000x |  6221.1  |
>> | seqhint_seq64   |  4.283+/-0.060 |  3.948+/-0.062 | -7.8%  | -20.6% | 1.000x |  6223.5  |
>> | stride64_sparse |  3.095+/-0.051 |  3.086+/-0.025 | -0.3%  |  +5.8% | 1.002x |     1.0  |
>> | random64_sparse |  3.095+/-0.046 |  3.076+/-0.016 | -0.6%  |  +0.7% | 1.001x |     0.0  |
>> | random64_full   |  4.423+/-0.067 |  4.405+/-0.018 | -0.4%  |  +0.1% | 1.000x |     0.0  |
>> +-----------------+----------------+----------------+--------+--------+--------+----------+
>>
>> The usama_1g row follows the shape of the 2024 RFC benchmark: allocate 1G,
>> fill it with compressible per-page data, reclaim it through memory.reclaim,
>> then time the full integrity-check refault. The seq64 rows use a 512M
>> target and 768M of pressure. "sparse" touches one 4K page per 64K region, while
>> "full" touches every 4K page. "seqhint" uses MADV_SEQUENTIAL; "nohint" does
>> not.
>>
>> Virtio-block swap device present, zswap enabled:
>>
>> +-----------------+---------------+---------------+--------+---------+--------+--------+---------+
>> | workload        | base elapsed  | RFC elapsed   | delta  | refault | pswpin | zswpin | RFC 64K |
>> +-----------------+---------------+---------------+--------+---------+--------+--------+---------+
>> | seq64           | 4.399+/-0.100 | 4.279+/-0.216 | -2.7%  | -10.5%  | 0      | 1.000x | 3110.7  |
>> | stride64_sparse | 3.103+/-0.047 | 3.119+/-0.086 | +0.5%  |  +6.2%  | 0      | 0.999x |    0.0  |
>> | random64_sparse | 3.142+/-0.112 | 3.097+/-0.030 | -1.4%  |  -2.2%  | 0      | 0.999x |    0.1  |
>> | random64_full   | 4.473+/-0.147 | 4.445+/-0.088 | -0.6%  |  +0.9%  | 0      | 1.000x |    0.4  |
>> +-----------------+---------------+---------------+--------+---------+--------+--------+---------+
>>
>> This run uses a real block swap device, but the refaulted data stayed in
>> zswap. It covers the all-zswap hit path with disk swap configured, not disk
>> read IO.
>>
>> Virtio-block pressure/mixed run, zswap max_pool_percent=1,
>> low-compressibility full fill:
>>
>> +-------------------------------+---------------+---------------+--------+---------+----------------+------------+---------+----------+
>> | workload                      | base elapsed  | RFC elapsed   | delta  | refault | pswpin base/RFC | RFC zswpin | RFC 64K | fallback |
>> +-------------------------------+---------------+---------------+--------+---------+----------------+------------+---------+----------+
>> | seq64_full_pressure           | 5.908+/-0.195 | 5.790+/-0.235 | -2.0%  |  +3.0%  | 90258/99038    | 20327      |   0.0   | 3730     |
>> | random64_sparse_full_pressure | 5.104+/-0.069 | 5.068+/-0.090 | -0.7%  |  -9.1%  |  6201/6196     |  1297      |   0.0   |    0     |
>> +-------------------------------+---------------+---------------+--------+---------+----------------+------------+---------+----------+
>>
>> This run reaches the disk-backed path: pswpin is non-zero in both base and
>> RFC. It is mainly fallback coverage. The RFC does not install 64K folios
>> for these disk/mixed-heavy ranges.
> 
> Ok this results above look good. Basically, if we don't have spatial
> locality in access patterns, we don't do THP zswapin. Nice.
> 
>>
>> Policy matrix, virtio-block swap device present:
>>
>> +------------------------------+----+------+--------+--------+-------+----------+
>> | case                         | pc | hint | pswpin | zswpin | zswpwb| 64K in   |
>> +------------------------------+----+------+--------+--------+-------+----------+
>> | pc0_seq                      | 0  | none | 0      | 99559  | 0     | 0        |
>> | pc3_seq                      | 3  | none | 0      | 99498  | 0     | 0        |
>> | pc4_seq                      | 4  | none | 0      | 99512  | 0     | 3109     |
>> | pc5_seq                      | 5  | none | 0      | 99657  | 0     | 3113     |
>> | hint_none_random_sparse      | 5  | none | 0      |  6265  | 0     | 0        |
>> | hint_random_seq              | 5  | rand | 0      | 99488  | 0     | 0        |
>> | mixed_seq_full               | 5  | none | 97725  | 20147  | 84    | 569      |
>> | mixed_random_sparse_full     | 5  | none |  6230  |  1302  | 0     | 0        |
>> +------------------------------+----+------+--------+--------+-------+----------+
>>
>> The pc rows show the readahead-window gate. The hint_random_seq row shows
>> the explicit random hint veto. The mixed rows use a small zswap pool to
>> force disk/zswap split backing; most mixed ranges are rejected, while any
>> remaining 64K successes were all-zswap at the time of the fault.
>>
>> Kbuild pressure, zram swap, 384M memcg:
>>
>> +----------------------+----------+----------+--------+--------+----------+
>> | setup                | base     | RFC      | delta  | zswpin | RFC 64K  |
>> +----------------------+----------+----------+--------+--------+----------+
>> | zram swap, 384M memcg| 2060.323 | 2047.516 | -0.6%  | 0.991x | 2797     |
>> +----------------------+----------+----------+--------+--------+----------+
>>
>> This is a single-run zram pressure smoke. It did not show Kbuild
>> regression, and the RFC run installed 64K zswap-backed folios. The result
>> should not be read as a tuned-performance claim.
>>
>> Kbuild pressure, virtio-block swap device, 512M memcg:
>>
>> +-------------------------+----------+----------+--------+--------+----------+
>> | setup                   | base     | RFC      | delta  | pswpin | RFC 64K  |
>> +-------------------------+----------+----------+--------+--------+----------+
>> | disk swap, 512M memcg   | 1420.671 | 1409.263 | -0.8%  | 0      | 7497     |
>> +-------------------------+----------+----------+--------+--------+----------+
>>
>> This is a single-run pressure smoke. The disk-swap Kbuild run also stayed
>> on the all-zswap hit path, so it is pressure coverage with a disk swap device
>> present rather than proof of disk-read large swapin.
> 
> Why a single-run?

I did run Kbuild a few times while debugging the series and did not see a
significant difference either way. Because of that I only kept one fresh run
with the final tree before sending the RFC, so this should be read only as a
smoke test, not as performance evidence.

For the next version I will rerun Kbuild properly with multiple fresh
iterations and report it, so it can be used as a more reliable
performance comparison instead of just smoke coverage.

> 
>>
>> Shmem smoke, tmpfs huge=always, 64K shmem mTHP:
>>
>> +----------------------------+---------------+---------+-------------+----------+
>> | case                       | refault hint  | touched | 64K shmem   | 64K in   |
>> +----------------------------+---------------+---------+-------------+----------+
>> | nohint_seq                 | none          | 65536   | 4096        | 0        |
>> | seq_refault_hint           | sequential    | 65536   | 4096        | 4096     |
>> | random_refault_hint_sparse | random        |  4096   | 4096        | 0        |
>> +----------------------------+---------------+---------+-------------+----------+
>>
>> That matches the current shmem producer: explicit sequential refault hints
>> allow large zswap swapin; no hint and random hints do not.
>>
>> What this RFC does not establish
>> ================================
>>
>> The 64K cap is deliberate, but it is not tuned. The anon PTE-young rule is
>> only anon evidence. Shmem has the framework and explicit VMA hints in this
>> RFC, not a page-cache locality producer. For larger orders, the anon
>> producer should probably use bounded sampling instead of walking every PTE
>> in a 1M or larger candidate range. The mixed-backend tests cover fallback
>> behavior, but this series does not add mixed zswap/disk large IO.
> 
> The mixed IO can be deferred, but I think we should figure out a rule
> to extend this hint to arbitrarily sized ranges, and preferrably shmem
> too.

That makes sense.

The current 64K cap was intentionally conservative, but the locality rule is
too tied to that size. For v3 I will look at making the admission rule
order-independent, probably with bounded sampling rather than walking every
PTE for larger ranges.

For shmem, this RFC only uses explicit VMA hints, so it does not yet have a
real page-cache locality producer. I will think through how to add a shmem
producer with similar semantics, so the rule is not anon-only.

Thanks,
Fujunjie



^ permalink raw reply

* Re: [RFC PATCH v2 4/9] mm: admit large swapin by backend range in swapin_sync()
From: Fujunjie @ 2026-05-31 12:21 UTC (permalink / raw)
  To: Kairui Song
  Cc: Andrew Morton, linux-mm, Alexandre Ghiti, Usama Arif, Chris Li,
	Johannes Weiner, Yosry Ahmed, Nhat Pham, David Hildenbrand,
	Hugh Dickins, Roman Gushchin, Shakeel Butt, linux-kernel, cgroups
In-Reply-To: <CAMgjq7AA_1esgtA8VyxaBLWBBRM12bCBpxO2Jch5OESBZSg--A@mail.gmail.com>



On 5/29/2026 10:43 PM, Kairui Song wrote:
> On Fri, May 29, 2026 at 8:26 PM fujunjie <fujunjie1@qq.com> wrote:
>>
>> A large swapin can only read one folio when the whole range has compatible
>> backing. Mixed zswap/disk ranges must not reach large-folio IO, and zswap
>> range probes are only snapshots.
>>
>> Filter the orders passed to swap_cache_alloc_folio() in swapin_sync().
>> Uniform zeromap ranges and all-disk ranges keep the existing large swapin
>> path. Fully zswap-backed ranges may be tried. Mixed zswap/disk ranges fall
>> back before allocation.
>>
>> After a large swapcache folio is installed, recheck the zswap range and
>> drop the fresh folio if it became mixed. Also consume -EAGAIN from
>> swap_read_folio() the same way. Both cases retry order-0, where each slot
>> can resolve its current backend independently.
>>
>> Signed-off-by: fujunjie <fujunjie1@qq.com>
>> ---
>>  mm/memcontrol-v1.c |   8 ++-
>>  mm/memory.c        |  31 ++++++++-
>>  mm/swap_state.c    | 169 ++++++++++++++++++++++++++++++++++++++++++---
>>  3 files changed, 194 insertions(+), 14 deletions(-)
>>
>> diff --git a/mm/memcontrol-v1.c b/mm/memcontrol-v1.c
>> index 765069211567..5b11b8055c66 100644
>> --- a/mm/memcontrol-v1.c
>> +++ b/mm/memcontrol-v1.c
>> @@ -682,8 +682,8 @@ void __memcg1_swapout(struct folio *folio, struct swap_cluster_info *ci)
>>   * memcg1_swapin - uncharge swap slot on swapin
>>   * @folio: folio being swapped in
>>   *
>> - * Call this function after successfully adding the charged
>> - * folio to swapcache.
>> + * Call this after the charged folio has been added to swapcache and the caller
>> + * is no longer going to drop it back to swapped-out state.
>>   *
>>   * Context: The folio has to be in swap cache and locked.
>>   */
>> @@ -721,7 +721,9 @@ void memcg1_swapin(struct folio *folio)
>>         id = __swap_cgroup_clear(ci, swp_cluster_offset(folio->swap),
>>                                  nr_pages);
>>         swap_cluster_unlock(ci);
>> -       mem_cgroup_uncharge_swap(id, nr_pages);
>> +
>> +       if (id)
>> +               mem_cgroup_uncharge_swap(id, nr_pages);
>>  }
>>  #endif
>>
>> diff --git a/mm/memory.c b/mm/memory.c
>> index 5a365492a9a2..d73a19692dea 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -4538,6 +4538,24 @@ static inline bool should_try_to_free_swap(struct swap_info_struct *si,
>>                 folio_ref_count(folio) == (extra_refs + folio_nr_pages(folio));
>>  }
>>
>> +static void memcg1_swapin_retry_folio(struct folio *folio,
>> +                                     struct vm_fault *vmf)
>> +{
>> +       if (!folio_test_large(folio) || !folio_test_swapcache(folio))
>> +               return;
>> +
>> +       if (vmf->flags & FAULT_FLAG_RETRY_NOWAIT) {
>> +               if (!folio_trylock(folio))
>> +                       return;
>> +       } else {
>> +               folio_lock(folio);
>> +       }
>> +
>> +       if (folio_test_large(folio) && folio_test_swapcache(folio))
>> +               memcg1_swapin(folio);
>> +       folio_unlock(folio);
>> +}
>> +
>>  static vm_fault_t pte_marker_clear(struct vm_fault *vmf)
>>  {
>>         vmf->pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd,
>> @@ -4857,8 +4875,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>
>>         swapcache = folio;
>>         ret |= folio_lock_or_retry(folio, vmf);
>> -       if (ret & VM_FAULT_RETRY)
>> +       if (ret & VM_FAULT_RETRY) {
>> +               memcg1_swapin_retry_folio(folio, vmf);
>>                 goto out_release;
>> +       }
>>
>>         page = folio_file_page(folio, swp_offset(entry));
>>         /*
>> @@ -5067,6 +5087,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>         if (unlikely(folio != swapcache)) {
>>                 folio_add_new_anon_rmap(folio, vma, address, RMAP_EXCLUSIVE);
>>                 folio_add_lru_vma(folio, vma);
>> +               if (folio_test_large(swapcache))
>> +                       memcg1_swapin(swapcache);
>>                 folio_put_swap(swapcache, NULL);
>>         } else if (!folio_test_anon(folio)) {
>>                 /*
>> @@ -5076,6 +5098,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>                 VM_WARN_ON_ONCE_FOLIO(folio_nr_pages(folio) != nr_pages, folio);
>>                 VM_WARN_ON_ONCE_FOLIO(folio_mapped(folio), folio);
>>                 folio_add_new_anon_rmap(folio, vma, address, rmap_flags);
>> +               if (folio_test_large(folio))
>> +                       memcg1_swapin(folio);
>>                 folio_put_swap(folio, NULL);
>>         } else {
>>                 VM_WARN_ON_ONCE(nr_pages != 1 && nr_pages != folio_nr_pages(folio));
>> @@ -5132,8 +5156,11 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>         if (vmf->pte)
>>                 pte_unmap_unlock(vmf->pte, vmf->ptl);
>>  out_page:
>> -       if (folio_test_swapcache(folio))
>> +       if (folio_test_swapcache(folio)) {
>> +               if (folio_test_large(folio))
>> +                       memcg1_swapin(folio);
>>                 folio_free_swap(folio);
>> +       }
>>         folio_unlock(folio);
>>  out_release:
>>         folio_put(folio);
>> diff --git a/mm/swap_state.c b/mm/swap_state.c
>> index d37097913b30..f03ad4832f16 100644
>> --- a/mm/swap_state.c
>> +++ b/mm/swap_state.c
>> @@ -21,6 +21,7 @@
>>  #include <linux/migrate.h>
>>  #include <linux/vmalloc.h>
>>  #include <linux/huge_mm.h>
>> +#include <linux/zswap.h>
>>  #include <linux/shmem_fs.h>
>>  #include "internal.h"
>>  #include "swap_table.h"
>> @@ -403,7 +404,8 @@ void __swap_cache_replace_folio(struct swap_cluster_info *ci,
>>  static struct folio *__swap_cache_alloc(struct swap_cluster_info *ci,
>>                                         swp_entry_t targ_entry, gfp_t gfp,
>>                                         unsigned int order, struct vm_fault *vmf,
>> -                                       struct mempolicy *mpol, pgoff_t ilx)
>> +                                       struct mempolicy *mpol, pgoff_t ilx,
>> +                                       bool defer_memcg1_swapin)
> 
> Hi Fujunjie,
> 
> Thanks for the update, but this whole defer_memcg1_swapin thing is so
> ugly I don't think this is the right way at all.
> 
> If you really need this, maybe you can always defer the memcg1
> uncharge, I don't see why we need to treat large folio differently.
> This charge doesn't effect the memory pressure, the reason we uncharge
> memcg1's swap counter is to avoid long pinning swap cache holding the
> swap cache of a cgroup so the cgroup will no longer be able to swap
> out more folios. Deferring it won't hurt.

Yes, I think you are right.

I added defer_memcg1_swapin because I was still treating the freshly
allocated large swapcache folio as something that might be dropped after it
was installed, so I tried to avoid clearing the cgroup v1 swap owner too
early.

Nhat Pham also pointed out that this is probably the wrong model. Once the whole
range is covered by the large swapcache folio, zswap writeback should not be
able to turn one subslot into disk-backed state, since it has to allocate an
order-0 swapcache folio first and that should fail.

So the deferred memcg1 handling is likely self-inflicted complexity. I'll
drop this flag and rework the mixed-backend check so we fail the current
order before we need this late abort path. If any memcg1 timing issue remains
after that, I'll try to handle it with a uniform rule rather than a
large-folio-specific flag.
> 
>>  {
>>         int err;
>>         swp_entry_t entry;
>> @@ -466,7 +468,8 @@ static struct folio *__swap_cache_alloc(struct swap_cluster_info *ci,
>>         }
>>
>>         /* memsw uncharges swap when folio is added to swap cache */
>> -       memcg1_swapin(folio);
>> +       if (!defer_memcg1_swapin || !order)
>> +               memcg1_swapin(folio);
>>         if (shadow)
>>                 workingset_refault(folio, shadow);
>>
>> @@ -495,9 +498,12 @@ static struct folio *__swap_cache_alloc(struct swap_cluster_info *ci,
>>   * Return: Returns the folio if allocation succeeded and folio is in the swap
>>   * cache. Returns error code if failed due to race, OOM or invalid arguments.
>>   */
>> -struct folio *swap_cache_alloc_folio(swp_entry_t targ_entry, gfp_t gfp,
>> -                                    unsigned long orders, struct vm_fault *vmf,
>> -                                    struct mempolicy *mpol, pgoff_t ilx)
>> +static struct folio *__swap_cache_alloc_folio(swp_entry_t targ_entry,
>> +                                             gfp_t gfp, unsigned long orders,
>> +                                             struct vm_fault *vmf,
>> +                                             struct mempolicy *mpol,
>> +                                             pgoff_t ilx,
>> +                                             bool defer_memcg1_swapin)
>>  {
>>         int order, err;
>>         struct folio *ret;
>> @@ -512,7 +518,8 @@ struct folio *swap_cache_alloc_folio(swp_entry_t targ_entry, gfp_t gfp,
>>
>>         do {
>>                 ret = __swap_cache_alloc(ci, targ_entry, gfp, order,
>> -                                        vmf, mpol, ilx);
>> +                                        vmf, mpol, ilx,
>> +                                        defer_memcg1_swapin);
>>                 if (!IS_ERR(ret))
>>                         break;
>>                 err = PTR_ERR(ret);
>> @@ -525,6 +532,124 @@ struct folio *swap_cache_alloc_folio(swp_entry_t targ_entry, gfp_t gfp,
>>         return ret;
>>  }
>>
>> +struct folio *swap_cache_alloc_folio(swp_entry_t targ_entry, gfp_t gfp,
>> +                                    unsigned long orders, struct vm_fault *vmf,
>> +                                    struct mempolicy *mpol, pgoff_t ilx)
>> +{
>> +       return __swap_cache_alloc_folio(targ_entry, gfp, orders, vmf,
>> +                                       mpol, ilx, false);
>> +}
>> +
>> +static struct folio *swap_cache_alloc_speculative_folio(swp_entry_t targ_entry,
>> +                                                       gfp_t gfp,
>> +                                                       unsigned long orders,
>> +                                                       struct vm_fault *vmf,
>> +                                                       struct mempolicy *mpol,
>> +                                                       pgoff_t ilx)
>> +{
>> +       /*
>> +        * Speculative large swapin may drop this fresh swapcache folio and
>> +        * retry order-0 after backend or page-table revalidation. Keep the
>> +        * cgroup v1 memsw swap owner until the caller commits the folio.
>> +        */
>> +       return __swap_cache_alloc_folio(targ_entry, gfp, orders, vmf,
>> +                                       mpol, ilx, true);
>> +}
>> +
>> +static bool swapin_zeromap_same(swp_entry_t entry, unsigned int nr_pages)
>> +{
>> +       unsigned int ci_start = swp_cluster_offset(entry);
>> +       struct swap_cluster_info *ci = __swap_entry_to_cluster(entry);
>> +       bool is_zero;
>> +       unsigned int i;
>> +
>> +       if (ci_start + nr_pages > SWAPFILE_CLUSTER) {
>> +               VM_WARN_ON_ONCE(1);
>> +               return false;
>> +       }
>> +
>> +       rcu_read_lock();
>> +       if (!rcu_dereference(ci->table)) {
>> +               rcu_read_unlock();
>> +               return true;
>> +       }
>> +
>> +       is_zero = __swap_table_test_zero(ci, ci_start);
>> +       for (i = 1; i < nr_pages; i++) {
>> +               if (is_zero != __swap_table_test_zero(ci, ci_start + i)) {
>> +                       rcu_read_unlock();
>> +                       return false;
>> +               }
>> +       }
>> +       rcu_read_unlock();
>> +
>> +       return true;
>> +}
>> +
>> +static unsigned long swapin_admit_orders(swp_entry_t entry,
>> +                                        unsigned long orders)
> 
> And this swapin_admit_orders chunk doesn't look good either...、

Yes, this helper is doing too much.
I wanted to keep mixed zswap/disk ranges away from large-folio IO, but this
ended up mixing policy with range feasibility checks.

> 
>> +{
>> +       unsigned long candidates = orders & ~BIT(0);
>> +       unsigned long admitted = orders & BIT(0);
>> +       int order;
>> +
>> +       if (!candidates)
>> +               return orders;
>> +
>> +       while (candidates) {
>> +               enum zswap_range_state state;
>> +               unsigned int nr_pages;
>> +               swp_entry_t range_entry;
>> +               bool admit = false;
>> +
>> +               order = fls_long(candidates) - 1;
>> +               if (order > MAX_PAGE_ORDER) {
>> +                       candidates &= ~BIT(order);
>> +                       continue;
>> +               }
>> +
>> +               nr_pages = 1U << order;
>> +               range_entry = swp_entry(swp_type(entry),
>> +                                       round_down(swp_offset(entry), nr_pages));
>> +               if (!swapin_zeromap_same(range_entry, nr_pages))
>> +                       goto next;
> 
> I think you don't need to test zeromap at all? __swap_cache_alloc
> handles that already.

I am sorry for missed that this is already covered by __swap_cache_add_check().
I'll drop the explicit zeromap scan in v3 version.

> 
>> +
>> +               state = zswap_probe_range(range_entry, nr_pages);
> 
> If you just move the zswap_probe_range into __swap_cache_alloc and do
> fallback there (or maybe you can shrink the order faster), then this
> two new helpers are all redundant.
> 
>> +               switch (state) {
>> +               case ZSWAP_RANGE_MIXED:
>> +                       break;
>> +               case ZSWAP_RANGE_ALL_ZSWAP:
>> +               case ZSWAP_RANGE_NEVER_ENABLED:
>> +               case ZSWAP_RANGE_NO_ZSWAP:
>> +                       admit = true;
>> +                       break;
>> +               }
>> +
>> +next:
>> +               if (admit)
>> +                       admitted |= BIT(order);
>> +               else
>> +                       count_mthp_stat(order, MTHP_STAT_SWPIN_FALLBACK);
>> +               candidates &= ~BIT(order);
>> +       }
>> +
>> +       return admitted ? admitted : BIT(0);
>> +}
>> +
>> +static bool zswap_needs_order0_retry(struct folio *folio)
>> +{
>> +       if (!folio_test_large(folio))
>> +               return false;
>> +
>> +       /*
>> +        * Admission sees only an advisory zswap snapshot. Recheck after the
>> +        * large swapcache folio is installed; if the range became mixed, drop
>> +        * the fresh folio before IO and let order-0 handle each slot.
>> +        */
>> +       return zswap_probe_range(folio->swap, folio_nr_pages(folio)) ==
>> +              ZSWAP_RANGE_MIXED;
>> +}
>> +
> 
> Again, I think you can just probe the suitable size in
> __swap_cache_alloc directly, that way, we avoid the diverge of sync /
> non-sync device, and avoid the whole chunk making the code much
> simplier too, just like what we are alreadying doing for zero map in
> __swap_cache_alloc, or am I over simpliying it?Thanks for your review! I will try it.


^ permalink raw reply

* [PATCH v6] cgroup/dmem: implement dmem.high soft limit via prioritized eviction
From: Qiliang Yuan @ 2026-05-31  9:52 UTC (permalink / raw)
  To: Christian Koenig, Huang Rui, Matthew Auld, Matthew Brost,
	Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, David Airlie,
	Simona Vetter, Tejun Heo, Johannes Weiner, Michal Koutný,
	Natalie Vock
  Cc: dri-devel, linux-kernel, cgroups, Qiliang Yuan

The dmem cgroup v2 controller currently only provides a hard "max"
limit, which causes immediate allocation failures when a cgroup's
device memory usage reaches its quota.  GPU-bound AI workloads need
smoother over-subscription support: a soft limit that temporarily
allows excess usage while applying backpressure through reclaim
rather than outright failure.

Add dmem.high, a soft limit that penalizes over-limit cgroups by
evicting their buffer objects first when eviction is triggered (e.g.
due to a "max" limit hit).  Unlike the rejected v1 approach which
used sleep-on-allocation throttling, this version provides a
meaningful recovery action through prioritized reclaim.

Expose "high" as a new cgroupfs control file per region via
set_resource_high() and get_resource_high(), and initialize it to
PAGE_COUNTER_MAX in reset_all_resource_limits().  Like get_resource_max(),
get_resource_high() returns PAGE_COUNTER_MAX when the pool is NULL.

Extend dmem_cgroup_state_evict_valuable() with a "try_high"
parameter.  When set, the function walks the page_counter parent
chain to check whether any ancestor exceeds its high limit, and
verifies that the pool is above its effective minimum to respect
dmem.min protection.  For the limit-hitting cgroup's own BOs, the
ancestry check is skipped but the high threshold still applies.
When CONFIG_CGROUP_DMEM is disabled, the stub returns false in
try_high mode so the first pass has no effect.

Refactor ttm_bo_evict_alloc() into a 3-pass eviction strategy.
Pass 1 targets only BOs whose cgroup exceeds dmem.high, using a
blocking lock when a ticket is available or trylock otherwise.
Pass 2 falls back to the standard above-elow trylock eviction.
Pass 3+ uses proper locking and repeats while making progress
with the existing low-watermark fallback.

Signed-off-by: Qiliang Yuan <realwujing@gmail.com>
---
Introduce a "high" soft limit for the dmem cgroup v2 controller.
When a "max" limit is hit and eviction is triggered, buffer objects
belonging to cgroups that exceed their dmem.high limit are targeted
first, providing a meaningful recovery action through reclaim.

The dmem cgroup currently only supports hard "max" limits, which
cause immediate allocation failures for GPU-bound workloads. A soft
limit enables smoother over-subscription by penalizing over-limit
cgroups via prioritized eviction rather than outright rejection.

The implementation adds a "high" cgroupfs control file per region,
a try_high parameter to dmem_cgroup_state_evict_valuable() for
tier-1 eviction, and a 3-pass strategy in ttm_bo_evict_alloc().
---
V5 -> V6:
- Guard the try_high dereference of test_pool->cnt with a NULL check
  to prevent a kernel panic during global memory pressure eviction
  when a BO has no associated cgroup.
- Make the disabled-cgroup stub for dmem_cgroup_state_evict_valuable()
  return false in try_high mode so the stub does not incorrectly
  enable Pass 1 when CONFIG_CGROUP_DMEM=n.

V4 -> V5:
- Restore the original control flow in dmem_cgroup_state_evict_valuable():
  test_pool is no longer dereferenced before the ancestry checks, fixing
  a NULL pointer dereference on BOs without a cgroup.  The limit_pool
  NULL-to-root-cgroup resolution is now performed before the try_high
  block, fixing a panic during global memory pressure eviction.
- Keep the try_high check for limit_pool == test_pool inside the existing
  early-return branch to avoid bypassing the hierarchy constraint check
  that prevents cross-cgroup eviction.
- Use a blocking lock in Pass 1 only when a ticket is available
  (trylock otherwise), addressing the deadlock risk of blocking without
  a valid ww_acquire_ctx.
- Explicitly reset trylock_only to true before Pass 2 so it does not
  inherit Pass 1's blocking behavior.

V3 -> V4:
- Use a blocking lock in Pass 1 instead of trylock to ensure
  over-limit cgroups are penalized even when their BOs are actively
  in use, as requested by Maarten Lankhorst.
- Evaluate the try_high condition before the limit_pool == test_pool
  early-return so that the limit-hitting cgroup's own BOs are also
  filtered by dmem.high.
- Remove the high-priority compensation retry at the start of Pass 3,
  which is no longer needed now that Pass 1 uses a blocking lock.

V2 -> V3:
- Walk the page_counter parent chain in the try_high pass to prevent
  child cgroups from evading the penalty when a parent cgroup exceeds
  its dmem.high limit.
- Check dmem.min protection in the try_high pass to avoid evicting
  BOs below the effective minimum.
- Add a properly-locked high-priority retry at the beginning of Pass 3
  so that actively-used over-limit BOs (which failed trylock in Pass 1)
  are not skipped while innocent cgroups are evicted.
- Fix get_resource_high(NULL) returning 0 instead of PAGE_COUNTER_MAX
  to match the behavior of get_resource_max().

V1 -> V2:
- Replace sleep-on-allocation throttling with prioritized eviction.
  When a "max" limit is hit, BOs from cgroups exceeding dmem.high are
  evicted first in a dedicated pass. No throttling or sleeping is
  performed in the charge path.
- Remove task throttling (schedule_timeout_killable, TIF_NOTIFY_RESUME,
  resume_user_mode_work() integration) entirely.
- Add dmem.high cgroupfs control file per region.
- Extend dmem_cgroup_state_evict_valuable() with try_high parameter
  to target over-limit cgroups as tier-1 eviction.
- Refactor ttm_bo_evict_alloc() into a 3-pass eviction strategy:
  (1) trylock: evict only BOs exceeding dmem.high
  (2) trylock: above-elow
  (3) proper-lock: repeat with low fallback.
- Initialize high to PAGE_COUNTER_MAX in reset_all_resource_limits().

v5: https://lore.kernel.org/r/20260531-feature-dmem-high-v5-1-1c6c532b26a9@gmail.com
v4: https://lore.kernel.org/r/20260530-feature-dmem-high-v4-1-ee7c6ec1c8da@gmail.com
v3: https://lore.kernel.org/r/20260528-feature-dmem-high-v3-1-c642b34bcb2f@gmail.com
v2: https://lore.kernel.org/r/20260522-feature-dmem-high-v2-1-1d7d4a0fa5da@gmail.com
v1: https://lore.kernel.org/all/20260520-feature-dmem-high-v1-1-97ca0cb7f95a@gmail.com
---
 drivers/gpu/drm/ttm/ttm_bo.c | 33 ++++++++++++++----
 include/linux/cgroup_dmem.h  |  6 ++--
 kernel/cgroup/dmem.c         | 79 +++++++++++++++++++++++++++++++++++++++++---
 3 files changed, 105 insertions(+), 13 deletions(-)

diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c
index bcd76f6bb7f02..21fe34fd43eec 100644
--- a/drivers/gpu/drm/ttm/ttm_bo.c
+++ b/drivers/gpu/drm/ttm/ttm_bo.c
@@ -505,6 +505,8 @@ struct ttm_bo_evict_walk {
 
 	/** @limit_pool: Which pool limit we should test against */
 	struct dmem_cgroup_pool_state *limit_pool;
+	/** @try_high: Whether to only evict BO's above the high watermark (first pass) */
+	bool try_high;
 	/** @try_low: Whether we should attempt to evict BO's with low watermark threshold */
 	bool try_low;
 	/** @hit_low: If we cannot evict a bo when @try_low is false (first pass) */
@@ -518,7 +520,8 @@ static s64 ttm_bo_evict_cb(struct ttm_lru_walk *walk, struct ttm_buffer_object *
 	s64 lret;
 
 	if (!dmem_cgroup_state_evict_valuable(evict_walk->limit_pool, bo->resource->css,
-					      evict_walk->try_low, &evict_walk->hit_low))
+					      evict_walk->try_high, evict_walk->try_low,
+					      &evict_walk->hit_low))
 		return 0;
 
 	if (bo->pin_count || !bo->bdev->funcs->eviction_valuable(bo, evict_walk->place))
@@ -577,31 +580,47 @@ static int ttm_bo_evict_alloc(struct ttm_device *bdev,
 	};
 	s64 lret;
 
-	evict_walk.walk.arg.trylock_only = true;
+	/*
+	 * Pass 1 (high-priority): Evict only BOs whose cgroup exceeds its
+	 * dmem.high soft limit.  A blocking lock is used when a ticket is
+	 * available to ensure over-limit cgroups are penalized even when
+	 * their BOs are actively in use; trylock otherwise.
+	 */
+	evict_walk.walk.arg.trylock_only = !ticket;
+	evict_walk.try_high = true;
 	lret = ttm_lru_walk_for_evict(&evict_walk.walk, bdev, man, 1);
+	evict_walk.try_high = false;
+	if (lret)
+		goto out;
 
-	/* One more attempt if we hit low limit? */
+	/*
+	 * Pass 2 (trylock): Evict BOs above the effective low watermark.
+	 * Falls back to low-priority eviction if needed.
+	 */
+	evict_walk.walk.arg.trylock_only = true;
+	lret = ttm_lru_walk_for_evict(&evict_walk.walk, bdev, man, 1);
 	if (!lret && evict_walk.hit_low) {
 		evict_walk.try_low = true;
 		lret = ttm_lru_walk_for_evict(&evict_walk.walk, bdev, man, 1);
 	}
+
 	if (lret || !ticket)
 		goto out;
 
-	/* Reset low limit */
+	/*
+	 * Pass 3+ (properly locked): Evict while making progress.
+	 * Reset flags and retry with try_low if we hit the low watermark.
+	 */
 	evict_walk.try_low = evict_walk.hit_low = false;
-	/* If ticket-locking, repeat while making progress. */
 	evict_walk.walk.arg.trylock_only = false;
 
 retry:
 	do {
-		/* The walk may clear the evict_walk.walk.ticket field */
 		evict_walk.walk.arg.ticket = ticket;
 		evict_walk.evicted = 0;
 		lret = ttm_lru_walk_for_evict(&evict_walk.walk, bdev, man, 1);
 	} while (!lret && evict_walk.evicted);
 
-	/* We hit the low limit? Try once more */
 	if (!lret && evict_walk.hit_low && !evict_walk.try_low) {
 		evict_walk.try_low = true;
 		goto retry;
diff --git a/include/linux/cgroup_dmem.h b/include/linux/cgroup_dmem.h
index dd4869f1d736e..3f7278cb290b3 100644
--- a/include/linux/cgroup_dmem.h
+++ b/include/linux/cgroup_dmem.h
@@ -23,7 +23,7 @@ int dmem_cgroup_try_charge(struct dmem_cgroup_region *region, u64 size,
 void dmem_cgroup_uncharge(struct dmem_cgroup_pool_state *pool, u64 size);
 bool dmem_cgroup_state_evict_valuable(struct dmem_cgroup_pool_state *limit_pool,
 				      struct dmem_cgroup_pool_state *test_pool,
-				      bool ignore_low, bool *ret_hit_low);
+				      bool try_high, bool ignore_low, bool *ret_hit_low);
 
 void dmem_cgroup_pool_state_put(struct dmem_cgroup_pool_state *pool);
 #else
@@ -54,8 +54,10 @@ static inline void dmem_cgroup_uncharge(struct dmem_cgroup_pool_state *pool, u64
 static inline
 bool dmem_cgroup_state_evict_valuable(struct dmem_cgroup_pool_state *limit_pool,
 				      struct dmem_cgroup_pool_state *test_pool,
-				      bool ignore_low, bool *ret_hit_low)
+				      bool try_high, bool ignore_low, bool *ret_hit_low)
 {
+	if (try_high)
+		return false;
 	return true;
 }
 
diff --git a/kernel/cgroup/dmem.c b/kernel/cgroup/dmem.c
index 4753a67d0f0f2..4267309e6b01d 100644
--- a/kernel/cgroup/dmem.c
+++ b/kernel/cgroup/dmem.c
@@ -156,6 +156,12 @@ set_resource_low(struct dmem_cgroup_pool_state *pool, u64 val)
 	page_counter_set_low(&pool->cnt, val);
 }
 
+static void
+set_resource_high(struct dmem_cgroup_pool_state *pool, u64 val)
+{
+	page_counter_set_high(&pool->cnt, val);
+}
+
 static void
 set_resource_max(struct dmem_cgroup_pool_state *pool, u64 val)
 {
@@ -167,6 +173,11 @@ static u64 get_resource_low(struct dmem_cgroup_pool_state *pool)
 	return pool ? READ_ONCE(pool->cnt.low) : 0;
 }
 
+static u64 get_resource_high(struct dmem_cgroup_pool_state *pool)
+{
+	return pool ? READ_ONCE(pool->cnt.high) : PAGE_COUNTER_MAX;
+}
+
 static u64 get_resource_min(struct dmem_cgroup_pool_state *pool)
 {
 	return pool ? READ_ONCE(pool->cnt.min) : 0;
@@ -186,6 +197,7 @@ static void reset_all_resource_limits(struct dmem_cgroup_pool_state *rpool)
 {
 	set_resource_min(rpool, 0);
 	set_resource_low(rpool, 0);
+	set_resource_high(rpool, PAGE_COUNTER_MAX);
 	set_resource_max(rpool, PAGE_COUNTER_MAX);
 }
 
@@ -289,10 +301,13 @@ dmem_cgroup_calculate_protection(struct dmem_cgroup_pool_state *limit_pool,
  * dmem_cgroup_state_evict_valuable() - Check if we should evict from test_pool
  * @limit_pool: The pool for which we hit limits
  * @test_pool: The pool for which to test
+ * @try_high: Only evict BOs whose usage exceeds the high limit (first pass)
  * @ignore_low: Whether we have to respect low watermarks.
  * @ret_hit_low: Pointer to whether it makes sense to consider low watermark.
  *
  * This function returns true if we can evict from @test_pool, false if not.
+ * When @try_high is set, only pools with usage above their high limit are
+ * evictable, enabling prioritized eviction of over-limit cgroups.
  * When returning false and @ignore_low is false, @ret_hit_low may
  * be set to true to indicate this function can be retried with @ignore_low
  * set to true.
@@ -301,15 +316,26 @@ dmem_cgroup_calculate_protection(struct dmem_cgroup_pool_state *limit_pool,
  */
 bool dmem_cgroup_state_evict_valuable(struct dmem_cgroup_pool_state *limit_pool,
 				      struct dmem_cgroup_pool_state *test_pool,
-				      bool ignore_low, bool *ret_hit_low)
+				      bool try_high, bool ignore_low, bool *ret_hit_low)
 {
 	struct dmem_cgroup_pool_state *pool = test_pool;
 	struct page_counter *ctest;
 	u64 used, min, low;
 
-	/* Can always evict from current pool, despite limits */
-	if (limit_pool == test_pool)
+	/*
+	 * When the limit-hitting cgroup's own BOs are being considered
+	 * in try_high mode, only evict them if their pool exceeds its
+	 * own dmem.high limit.  For non-try_high mode, maintain the
+	 * existing behavior: always evict from the limit-hitting pool.
+	 */
+	if (limit_pool == test_pool) {
+		if (try_high && test_pool) {
+			ctest = &test_pool->cnt;
+			used = page_counter_read(ctest);
+			return used > READ_ONCE(ctest->high);
+		}
 		return true;
+	}
 
 	if (limit_pool) {
 		if (!parent_dmemcs(limit_pool->cs))
@@ -330,10 +356,38 @@ bool dmem_cgroup_state_evict_valuable(struct dmem_cgroup_pool_state *limit_pool,
 	}
 
 	ctest = &test_pool->cnt;
+	used = page_counter_read(ctest);
+
+	if (try_high) {
+		struct page_counter *c;
+
+		/*
+		 * Walk the page_counter parent chain to check whether any
+		 * ancestor cgroup exceeds its dmem.high limit.  This prevents
+		 * child cgroups from evading the penalty when a parent cgroup
+		 * is over its high limit.
+		 */
+		if (used <= READ_ONCE(ctest->high)) {
+			for (c = ctest->parent; c; c = c->parent) {
+				if (page_counter_read(c) > READ_ONCE(c->high))
+					break;
+			}
+			if (!c)
+				return false;
+		}
+
+		/*
+		 * Respect dmem.min protection: do not evict BOs below the
+		 * effective minimum even during the high-priority pass.
+		 */
+		dmem_cgroup_calculate_protection(limit_pool, test_pool);
+		min = READ_ONCE(ctest->emin);
+
+		return used > min;
+	}
 
 	dmem_cgroup_calculate_protection(limit_pool, test_pool);
 
-	used = page_counter_read(ctest);
 	min = READ_ONCE(ctest->emin);
 
 	if (used <= min)
@@ -835,6 +889,17 @@ static ssize_t dmem_cgroup_region_low_write(struct kernfs_open_file *of,
 	return dmemcg_limit_write(of, buf, nbytes, off, set_resource_low);
 }
 
+static int dmem_cgroup_region_high_show(struct seq_file *sf, void *v)
+{
+	return dmemcg_limit_show(sf, v, get_resource_high);
+}
+
+static ssize_t dmem_cgroup_region_high_write(struct kernfs_open_file *of,
+					  char *buf, size_t nbytes, loff_t off)
+{
+	return dmemcg_limit_write(of, buf, nbytes, off, set_resource_high);
+}
+
 static int dmem_cgroup_region_max_show(struct seq_file *sf, void *v)
 {
 	return dmemcg_limit_show(sf, v, get_resource_max);
@@ -868,6 +933,12 @@ static struct cftype files[] = {
 		.seq_show = dmem_cgroup_region_low_show,
 		.flags = CFTYPE_NOT_ON_ROOT,
 	},
+	{
+		.name = "high",
+		.write = dmem_cgroup_region_high_write,
+		.seq_show = dmem_cgroup_region_high_show,
+		.flags = CFTYPE_NOT_ON_ROOT,
+	},
 	{
 		.name = "max",
 		.write = dmem_cgroup_region_max_write,

---
base-commit: ab5fce87a778cb780a05984a2ca448f2b41aafbf
change-id: 20260519-feature-dmem-high-16997148dc38

Best regards,
-- 
Qiliang Yuan <realwujing@gmail.com>


^ permalink raw reply related

* Re: [PATCH 5/5] cgroup: Defer kill_css_finish() in cgroup_apply_control_disable()
From: Bert Karwatzki @ 2026-05-31  9:19 UTC (permalink / raw)
  To: Mark Brown, Tejun Heo
  Cc: Johannes Weiner, spasswolf@web.de Michal Koutný,
	Sebastian Andrzej Siewior, Petr Malat, kernel test robot,
	Martin Pitt, cgroups, linux-kernel, Aishwarya.TCV
In-Reply-To: <fd72aa26-4fed-4fcb-b4b1-d7ce9d891fe4@sirena.org.uk>

Am Freitag, dem 29.05.2026 um 22:08 +0100 schrieb Mark Brown:
> On Fri, May 29, 2026 at 07:25:29AM -1000, Tejun Heo wrote:
> > On Wed, May 27, 2026 at 11:45:54AM +0100, Mark Brown wrote:
> > > On Mon, May 04, 2026 at 02:51:21PM -1000, Tejun Heo wrote:
> 
> > > with no further output and given that this is a cgroup locking change
> > > this does seem like a plausible commmit, though I didn't look into it in
> > > detail.  Bisect log and the list of LTP tests we're running in our test
> > > job below.  We are running multuple tests in parallel.
> 
> > Unfortunately, I can't reproduce this in my environment. Any chance you can
> > try testing on x86 tooa nd see whether it produces there?
> 
> Not readily sadly, I'll see if I can figure something out.  Our rootfs
> images are based on Debian Trixie if that's relevant?

Using debian unstable (sid/forky) I can at least detect a timeout when running
the ltp controller testsuite:

# LTPROOT=/home/bert/ltp-install/ ./kirk --run-suite controllers
Host information
 Hostname: homer
 Python: 3.13.12 (main, Feb 4 2026, 15:06:39) [GCC 15.2.0]
 Directory: /tmp/kirk.root/tmp092in2yb

Connecting to SUT: default

Suite: controllers
──────────────────
cgroup_core01: pass  (0.024s)
cgroup_core02: pass  (0.004s)
cgroup_core03: pass  (0.017s)
cgroup: skip  (2m 41s)
memcg_regression: skip  (3.414s)
memcg_test_3: pass  (0.090s)
memcg_failcnt: skip  (0.019s)
memcg_force_empty: skip  (0.015s)
memcg_limit_in_bytes: skip  (0.017s)
memcg_stat_rss: skip  (0.015s)
memcg_subgroup_charge: skip  (0.015s)
memcg_max_usage_in_bytes: skip  (0.014s)
memcg_move_charge_at_immigrate: skip  (0.014s)
memcg_memsw_limit_in_bytes: skip  (0.015s)
memcg_stat: skip  (0.015s)
memcg_use_hierarchy: skip  (0.015s)
memcg_usage_in_bytes: skip  (0.014s)
memcg_stress: pass  (30m 4s)
memcg_control: pass  (6.058s)
memcontrol01: pass  (0.004s)
memcontrol02: pass  (0.636s)
memcontrol03: pass  (15.983s)
memcontrol04: pass  (0.890s)
cgroup_fj_function_debug: skip  (0.013s)
cgroup_fj_function_cpuset: skip  (0.044s)
cgroup_fj_function_cpu: skip  (0.050s)
cgroup_fj_function_cpuacct: pass  (0.052s)
cgroup_fj_function_memory: skip  (0.042s)
cgroup_fj_function_freezer: pass  (0.044s)
cgroup_fj_function_devices: pass  (0.066s)
cgroup_fj_function_blkio: skip  (0.009s)
cgroup_fj_function_net_cls: pass  (0.073s)
cgroup_fj_function_perf_event: pass  (0.072s)
cgroup_fj_function_net_prio: Suite 'controllers' timed out after 3600 seconds

Execution time: 1h 33m 13s

Disconnecting from SUT: default

Target information
──────────────────
Kernel:   Linux 7.1.0-rc5-next-20260528-master-dirty #480 SMP PREEMPT_RT Thu May 28 19:55:12 CEST 2026
Cmdline:  BOOT_IMAGE=/boot/vmlinuz-7.1.0-rc5-next-20260528-master-dirty
          root=UUID=3d5cdc5d-1902-40bf-9e16-ca819372d350
          ro
          quiet
Machine:  unknown
Arch:     x86_64
RAM:      63439380 kB
Swap:     78125052 kB
Distro:   debian 

────────────────────────
      TEST SUMMARY
────────────────────────
Suite:   controllers
Runtime: 33m 13s
Runs:    347

Results:
    Passed:   181
    Failed:   0
    Broken:   0
    Skipped:  350
    Warnings: 0

Session stopped

In dmesg I get messages about task tst_cgtl hanging:

[ 2212.794669] [    T346] INFO: task tst_cgctl:317896 blocked for more than 122 seconds.
[ 2212.794674] [    T346]       Not tainted 7.1.0-rc5-next-20260528-master-dirty #480
[ 2212.794675] [    T346] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

[...] 

[ 3318.721344] [    T346] INFO: task tst_cgctl:317896 blocked for more than 1228 seconds.
[ 3318.721349] [    T346]       Not tainted 7.1.0-rc5-next-20260528-master-dirty #480
[ 3318.721351] [    T346] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.






On 6.19.14 the Results of this testrun is:

# LTPROOT=/home/bert/ltp-install/ ./kirk --run-suite controllers

[...]

Target information
──────────────────
Kernel:   Linux 6.19.14-stable #1238 SMP PREEMPT_RT Sat May 30 17:28:29 CEST 2026
Cmdline:  BOOT_IMAGE=/boot/vmlinuz-6.19.14-stable
          root=UUID=3d5cdc5d-1902-40bf-9e16-ca819372d350
          ro
          quiet
Machine:  unknown
Arch:     x86_64
RAM:      63436188 kB
Swap:     78125052 kB
Distro:   debian 

────────────────────────
      TEST SUMMARY
────────────────────────
Suite:   controllers
Runtime: 36m 12s
Runs:    347

Results:
    Passed:   1742
    Failed:   0
    Broken:   0
    Skipped:  97
    Warnings: 0

Session stopped

With 6.19.14 I also get no hung tasks.

On 7.0.10 the tests also work:

root@homer:/mnt/data/linux-forest/kirk# LTPROOT=/home/bert/ltp-install/ ./kirk --run-suite controllers
Host information
	Hostname:   homer
	Python:     3.13.12 (main, Feb  4 2026, 15:06:39) [GCC 15.2.0]
	Directory:  /tmp/kirk.root/tmpq32b09g7

Connecting to SUT: default

Suite: controllers
──────────────────
cgroup_core01: pass  (0.016s)

[...]

pids_9_100: pass  (0.107s)

Execution time: 36m 15s

Disconnecting from SUT: default

Target information
──────────────────
Kernel:   Linux 7.0.10-stable #1239 SMP PREEMPT_RT Sun May 31 00:42:41 CEST 2026
Cmdline:  BOOT_IMAGE=/boot/vmlinuz-7.0.10-stable
          root=UUID=3d5cdc5d-1902-40bf-9e16-ca819372d350
          ro
          quiet
Machine:  unknown
Arch:     x86_64
RAM:      63435940 kB
Swap:     78125052 kB
Distro:   debian 

────────────────────────
      TEST SUMMARY
────────────────────────
Suite:   controllers
Runtime: 36m 13s
Runs:    347

Results:
    Passed:   1742
    Failed:   0
    Broken:   0
    Skipped:  97
    Warnings: 0

Session stopped



I'm not sure if this is related to the problems on arm64, but I'll try bisecting this.

Bert Karwatzki

^ permalink raw reply

* [PATCH v5] cgroup/dmem: implement dmem.high soft limit via prioritized eviction
From: Qiliang Yuan @ 2026-05-31  8:45 UTC (permalink / raw)
  To: Christian Koenig, Huang Rui, Matthew Auld, Matthew Brost,
	Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, David Airlie,
	Simona Vetter, Tejun Heo, Johannes Weiner, Michal Koutný,
	Natalie Vock
  Cc: dri-devel, linux-kernel, cgroups, Qiliang Yuan

The dmem cgroup v2 controller currently only provides a hard "max"
limit, which causes immediate allocation failures when a cgroup's
device memory usage reaches its quota.  GPU-bound AI workloads need
smoother over-subscription support: a soft limit that temporarily
allows excess usage while applying backpressure through reclaim
rather than outright failure.

Add dmem.high, a soft limit that penalizes over-limit cgroups by
evicting their buffer objects first when eviction is triggered (e.g.
due to a "max" limit hit).  Unlike the rejected v1 approach which
used sleep-on-allocation throttling, this version provides a
meaningful recovery action through prioritized reclaim.

Expose "high" as a new cgroupfs control file per region via
set_resource_high() and get_resource_high(), and initialize it to
PAGE_COUNTER_MAX in reset_all_resource_limits().  Like get_resource_max(),
get_resource_high() returns PAGE_COUNTER_MAX when the pool is NULL.

Extend dmem_cgroup_state_evict_valuable() with a "try_high"
parameter.  When set, the function walks the page_counter parent
chain to check whether any ancestor exceeds its high limit, and
verifies that the pool is above its effective minimum to respect
dmem.min protection.  For the limit-hitting cgroup's own BOs, the
ancestry check is skipped but the high threshold still applies.

Refactor ttm_bo_evict_alloc() into a 3-pass eviction strategy.
Pass 1 targets only BOs whose cgroup exceeds dmem.high, using a
blocking lock when a ticket is available or trylock otherwise.
Pass 2 falls back to the standard above-elow trylock eviction.
Pass 3+ uses proper locking and repeats while making progress
with the existing low-watermark fallback.

Signed-off-by: Qiliang Yuan <realwujing@gmail.com>
---
Introduce a "high" soft limit for the dmem cgroup v2 controller.
When a "max" limit is hit and eviction is triggered, buffer objects
belonging to cgroups that exceed their dmem.high limit are targeted
first, providing a meaningful recovery action through reclaim.

The dmem cgroup currently only supports hard "max" limits, which
cause immediate allocation failures for GPU-bound workloads. A soft
limit enables smoother over-subscription by penalizing over-limit
cgroups via prioritized eviction rather than outright rejection.

The implementation adds a "high" cgroupfs control file per region,
a try_high parameter to dmem_cgroup_state_evict_valuable() for
tier-1 eviction, and a 3-pass strategy in ttm_bo_evict_alloc().
---
V4 -> V5:
- Restore the original control flow in dmem_cgroup_state_evict_valuable():
  test_pool is no longer dereferenced before the ancestry checks, fixing
  a NULL pointer dereference on BOs without a cgroup.  The limit_pool
  NULL-to-root-cgroup resolution is now performed before the try_high
  block, fixing a panic during global memory pressure eviction.
- Keep the try_high check for limit_pool == test_pool inside the existing
  early-return branch to avoid bypassing the hierarchy constraint check
  that prevents cross-cgroup eviction.
- Use a blocking lock in Pass 1 only when a ticket is available
  (trylock otherwise), addressing the deadlock risk of blocking without
  a valid ww_acquire_ctx.
- Explicitly reset trylock_only to true before Pass 2 so it does not
  inherit Pass 1's blocking behavior.

V3 -> V4:
- Use a blocking lock in Pass 1 instead of trylock to ensure
  over-limit cgroups are penalized even when their BOs are actively
  in use, as requested by Maarten Lankhorst.
- Evaluate the try_high condition before the limit_pool == test_pool
  early-return so that the limit-hitting cgroup's own BOs are also
  filtered by dmem.high.
- Remove the high-priority compensation retry at the start of Pass 3,
  which is no longer needed now that Pass 1 uses a blocking lock.

V2 -> V3:
- Walk the page_counter parent chain in the try_high pass to prevent
  child cgroups from evading the penalty when a parent cgroup exceeds
  its dmem.high limit.
- Check dmem.min protection in the try_high pass to avoid evicting
  BOs below the effective minimum.
- Add a properly-locked high-priority retry at the beginning of Pass 3
  so that actively-used over-limit BOs (which failed trylock in Pass 1)
  are not skipped while innocent cgroups are evicted.
- Fix get_resource_high(NULL) returning 0 instead of PAGE_COUNTER_MAX
  to match the behavior of get_resource_max().

V1 -> V2:
- Replace sleep-on-allocation throttling with prioritized eviction.
  When a "max" limit is hit, BOs from cgroups exceeding dmem.high are
  evicted first in a dedicated pass. No throttling or sleeping is
  performed in the charge path.
- Remove task throttling (schedule_timeout_killable, TIF_NOTIFY_RESUME,
  resume_user_mode_work() integration) entirely.
- Add dmem.high cgroupfs control file per region.
- Extend dmem_cgroup_state_evict_valuable() with try_high parameter
  to target over-limit cgroups as tier-1 eviction.
- Refactor ttm_bo_evict_alloc() into a 3-pass eviction strategy:
  (1) trylock: evict only BOs exceeding dmem.high
  (2) trylock: above-elow
  (3) proper-lock: repeat with low fallback.
- Initialize high to PAGE_COUNTER_MAX in reset_all_resource_limits().

v4: https://lore.kernel.org/r/20260530-feature-dmem-high-v4-1-ee7c6ec1c8da@gmail.com
v3: https://lore.kernel.org/r/20260528-feature-dmem-high-v3-1-c642b34bcb2f@gmail.com
v2: https://lore.kernel.org/r/20260522-feature-dmem-high-v2-1-1d7d4a0fa5da@gmail.com
v1: https://lore.kernel.org/all/20260520-feature-dmem-high-v1-1-97ca0cb7f95a@gmail.com
---
 drivers/gpu/drm/ttm/ttm_bo.c | 33 ++++++++++++++----
 include/linux/cgroup_dmem.h  |  4 +--
 kernel/cgroup/dmem.c         | 79 +++++++++++++++++++++++++++++++++++++++++---
 3 files changed, 103 insertions(+), 13 deletions(-)

diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c
index bcd76f6bb7f02..21fe34fd43eec 100644
--- a/drivers/gpu/drm/ttm/ttm_bo.c
+++ b/drivers/gpu/drm/ttm/ttm_bo.c
@@ -505,6 +505,8 @@ struct ttm_bo_evict_walk {
 
 	/** @limit_pool: Which pool limit we should test against */
 	struct dmem_cgroup_pool_state *limit_pool;
+	/** @try_high: Whether to only evict BO's above the high watermark (first pass) */
+	bool try_high;
 	/** @try_low: Whether we should attempt to evict BO's with low watermark threshold */
 	bool try_low;
 	/** @hit_low: If we cannot evict a bo when @try_low is false (first pass) */
@@ -518,7 +520,8 @@ static s64 ttm_bo_evict_cb(struct ttm_lru_walk *walk, struct ttm_buffer_object *
 	s64 lret;
 
 	if (!dmem_cgroup_state_evict_valuable(evict_walk->limit_pool, bo->resource->css,
-					      evict_walk->try_low, &evict_walk->hit_low))
+					      evict_walk->try_high, evict_walk->try_low,
+					      &evict_walk->hit_low))
 		return 0;
 
 	if (bo->pin_count || !bo->bdev->funcs->eviction_valuable(bo, evict_walk->place))
@@ -577,31 +580,47 @@ static int ttm_bo_evict_alloc(struct ttm_device *bdev,
 	};
 	s64 lret;
 
-	evict_walk.walk.arg.trylock_only = true;
+	/*
+	 * Pass 1 (high-priority): Evict only BOs whose cgroup exceeds its
+	 * dmem.high soft limit.  A blocking lock is used when a ticket is
+	 * available to ensure over-limit cgroups are penalized even when
+	 * their BOs are actively in use; trylock otherwise.
+	 */
+	evict_walk.walk.arg.trylock_only = !ticket;
+	evict_walk.try_high = true;
 	lret = ttm_lru_walk_for_evict(&evict_walk.walk, bdev, man, 1);
+	evict_walk.try_high = false;
+	if (lret)
+		goto out;
 
-	/* One more attempt if we hit low limit? */
+	/*
+	 * Pass 2 (trylock): Evict BOs above the effective low watermark.
+	 * Falls back to low-priority eviction if needed.
+	 */
+	evict_walk.walk.arg.trylock_only = true;
+	lret = ttm_lru_walk_for_evict(&evict_walk.walk, bdev, man, 1);
 	if (!lret && evict_walk.hit_low) {
 		evict_walk.try_low = true;
 		lret = ttm_lru_walk_for_evict(&evict_walk.walk, bdev, man, 1);
 	}
+
 	if (lret || !ticket)
 		goto out;
 
-	/* Reset low limit */
+	/*
+	 * Pass 3+ (properly locked): Evict while making progress.
+	 * Reset flags and retry with try_low if we hit the low watermark.
+	 */
 	evict_walk.try_low = evict_walk.hit_low = false;
-	/* If ticket-locking, repeat while making progress. */
 	evict_walk.walk.arg.trylock_only = false;
 
 retry:
 	do {
-		/* The walk may clear the evict_walk.walk.ticket field */
 		evict_walk.walk.arg.ticket = ticket;
 		evict_walk.evicted = 0;
 		lret = ttm_lru_walk_for_evict(&evict_walk.walk, bdev, man, 1);
 	} while (!lret && evict_walk.evicted);
 
-	/* We hit the low limit? Try once more */
 	if (!lret && evict_walk.hit_low && !evict_walk.try_low) {
 		evict_walk.try_low = true;
 		goto retry;
diff --git a/include/linux/cgroup_dmem.h b/include/linux/cgroup_dmem.h
index dd4869f1d736e..06115d35509b1 100644
--- a/include/linux/cgroup_dmem.h
+++ b/include/linux/cgroup_dmem.h
@@ -23,7 +23,7 @@ int dmem_cgroup_try_charge(struct dmem_cgroup_region *region, u64 size,
 void dmem_cgroup_uncharge(struct dmem_cgroup_pool_state *pool, u64 size);
 bool dmem_cgroup_state_evict_valuable(struct dmem_cgroup_pool_state *limit_pool,
 				      struct dmem_cgroup_pool_state *test_pool,
-				      bool ignore_low, bool *ret_hit_low);
+				      bool try_high, bool ignore_low, bool *ret_hit_low);
 
 void dmem_cgroup_pool_state_put(struct dmem_cgroup_pool_state *pool);
 #else
@@ -54,7 +54,7 @@ static inline void dmem_cgroup_uncharge(struct dmem_cgroup_pool_state *pool, u64
 static inline
 bool dmem_cgroup_state_evict_valuable(struct dmem_cgroup_pool_state *limit_pool,
 				      struct dmem_cgroup_pool_state *test_pool,
-				      bool ignore_low, bool *ret_hit_low)
+				      bool try_high, bool ignore_low, bool *ret_hit_low)
 {
 	return true;
 }
diff --git a/kernel/cgroup/dmem.c b/kernel/cgroup/dmem.c
index 4753a67d0f0f2..29f8c68e92f7f 100644
--- a/kernel/cgroup/dmem.c
+++ b/kernel/cgroup/dmem.c
@@ -156,6 +156,12 @@ set_resource_low(struct dmem_cgroup_pool_state *pool, u64 val)
 	page_counter_set_low(&pool->cnt, val);
 }
 
+static void
+set_resource_high(struct dmem_cgroup_pool_state *pool, u64 val)
+{
+	page_counter_set_high(&pool->cnt, val);
+}
+
 static void
 set_resource_max(struct dmem_cgroup_pool_state *pool, u64 val)
 {
@@ -167,6 +173,11 @@ static u64 get_resource_low(struct dmem_cgroup_pool_state *pool)
 	return pool ? READ_ONCE(pool->cnt.low) : 0;
 }
 
+static u64 get_resource_high(struct dmem_cgroup_pool_state *pool)
+{
+	return pool ? READ_ONCE(pool->cnt.high) : PAGE_COUNTER_MAX;
+}
+
 static u64 get_resource_min(struct dmem_cgroup_pool_state *pool)
 {
 	return pool ? READ_ONCE(pool->cnt.min) : 0;
@@ -186,6 +197,7 @@ static void reset_all_resource_limits(struct dmem_cgroup_pool_state *rpool)
 {
 	set_resource_min(rpool, 0);
 	set_resource_low(rpool, 0);
+	set_resource_high(rpool, PAGE_COUNTER_MAX);
 	set_resource_max(rpool, PAGE_COUNTER_MAX);
 }
 
@@ -289,10 +301,13 @@ dmem_cgroup_calculate_protection(struct dmem_cgroup_pool_state *limit_pool,
  * dmem_cgroup_state_evict_valuable() - Check if we should evict from test_pool
  * @limit_pool: The pool for which we hit limits
  * @test_pool: The pool for which to test
+ * @try_high: Only evict BOs whose usage exceeds the high limit (first pass)
  * @ignore_low: Whether we have to respect low watermarks.
  * @ret_hit_low: Pointer to whether it makes sense to consider low watermark.
  *
  * This function returns true if we can evict from @test_pool, false if not.
+ * When @try_high is set, only pools with usage above their high limit are
+ * evictable, enabling prioritized eviction of over-limit cgroups.
  * When returning false and @ignore_low is false, @ret_hit_low may
  * be set to true to indicate this function can be retried with @ignore_low
  * set to true.
@@ -301,15 +316,26 @@ dmem_cgroup_calculate_protection(struct dmem_cgroup_pool_state *limit_pool,
  */
 bool dmem_cgroup_state_evict_valuable(struct dmem_cgroup_pool_state *limit_pool,
 				      struct dmem_cgroup_pool_state *test_pool,
-				      bool ignore_low, bool *ret_hit_low)
+				      bool try_high, bool ignore_low, bool *ret_hit_low)
 {
 	struct dmem_cgroup_pool_state *pool = test_pool;
 	struct page_counter *ctest;
 	u64 used, min, low;
 
-	/* Can always evict from current pool, despite limits */
-	if (limit_pool == test_pool)
+	/*
+	 * When the limit-hitting cgroup's own BOs are being considered
+	 * in try_high mode, only evict them if their pool exceeds its
+	 * own dmem.high limit.  For non-try_high mode, maintain the
+	 * existing behavior: always evict from the limit-hitting pool.
+	 */
+	if (limit_pool == test_pool) {
+		if (try_high) {
+			ctest = &test_pool->cnt;
+			used = page_counter_read(ctest);
+			return used > READ_ONCE(ctest->high);
+		}
 		return true;
+	}
 
 	if (limit_pool) {
 		if (!parent_dmemcs(limit_pool->cs))
@@ -330,10 +356,38 @@ bool dmem_cgroup_state_evict_valuable(struct dmem_cgroup_pool_state *limit_pool,
 	}
 
 	ctest = &test_pool->cnt;
+	used = page_counter_read(ctest);
+
+	if (try_high) {
+		struct page_counter *c;
+
+		/*
+		 * Walk the page_counter parent chain to check whether any
+		 * ancestor cgroup exceeds its dmem.high limit.  This prevents
+		 * child cgroups from evading the penalty when a parent cgroup
+		 * is over its high limit.
+		 */
+		if (used <= READ_ONCE(ctest->high)) {
+			for (c = ctest->parent; c; c = c->parent) {
+				if (page_counter_read(c) > READ_ONCE(c->high))
+					break;
+			}
+			if (!c)
+				return false;
+		}
+
+		/*
+		 * Respect dmem.min protection: do not evict BOs below the
+		 * effective minimum even during the high-priority pass.
+		 */
+		dmem_cgroup_calculate_protection(limit_pool, test_pool);
+		min = READ_ONCE(ctest->emin);
+
+		return used > min;
+	}
 
 	dmem_cgroup_calculate_protection(limit_pool, test_pool);
 
-	used = page_counter_read(ctest);
 	min = READ_ONCE(ctest->emin);
 
 	if (used <= min)
@@ -835,6 +889,17 @@ static ssize_t dmem_cgroup_region_low_write(struct kernfs_open_file *of,
 	return dmemcg_limit_write(of, buf, nbytes, off, set_resource_low);
 }
 
+static int dmem_cgroup_region_high_show(struct seq_file *sf, void *v)
+{
+	return dmemcg_limit_show(sf, v, get_resource_high);
+}
+
+static ssize_t dmem_cgroup_region_high_write(struct kernfs_open_file *of,
+					  char *buf, size_t nbytes, loff_t off)
+{
+	return dmemcg_limit_write(of, buf, nbytes, off, set_resource_high);
+}
+
 static int dmem_cgroup_region_max_show(struct seq_file *sf, void *v)
 {
 	return dmemcg_limit_show(sf, v, get_resource_max);
@@ -868,6 +933,12 @@ static struct cftype files[] = {
 		.seq_show = dmem_cgroup_region_low_show,
 		.flags = CFTYPE_NOT_ON_ROOT,
 	},
+	{
+		.name = "high",
+		.write = dmem_cgroup_region_high_write,
+		.seq_show = dmem_cgroup_region_high_show,
+		.flags = CFTYPE_NOT_ON_ROOT,
+	},
 	{
 		.name = "max",
 		.write = dmem_cgroup_region_max_write,

---
base-commit: ab5fce87a778cb780a05984a2ca448f2b41aafbf
change-id: 20260519-feature-dmem-high-16997148dc38

Best regards,
-- 
Qiliang Yuan <realwujing@gmail.com>


^ permalink raw reply related

* Re: [PATCH v5 9/9] mm: switch deferred split shrinker to list_lru
From: Wei Yang @ 2026-05-31  8:00 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, David Hildenbrand, Lorenzo Stoakes, Shakeel Butt,
	Michal Hocko, Dave Chinner, Roman Gushchin, Muchun Song, Qi Zheng,
	Yosry Ahmed, Zi Yan, Liam R . Howlett, Usama Arif,
	Kiryl Shutsemau, Vlastimil Babka, Kairui Song, Mikhail Zaslonko,
	Vasily Gorbik, Baolin Wang, Barry Song, Dev Jain, Lance Yang,
	Nico Pache, Ryan Roberts, cgroups, linux-mm, linux-kernel
In-Reply-To: <20260527204757.2544958-10-hannes@cmpxchg.org>

On Wed, May 27, 2026 at 04:45:16PM -0400, Johannes Weiner wrote:
>The deferred split queue handles cgroups in a suboptimal fashion. The
>queue is per-NUMA node or per-cgroup, not the intersection. That means
>on a cgrouped system, a node-restricted allocation entering reclaim
>can end up splitting large pages on other nodes:
>
>        alloc/unmap
>          deferred_split_folio()
>            list_add_tail(memcg->split_queue)
>            set_shrinker_bit(memcg, node, deferred_shrinker_id)
>
>        for_each_zone_zonelist_nodemask(restricted_nodes)
>          mem_cgroup_iter()
>            shrink_slab(node, memcg)
>              shrink_slab_memcg(node, memcg)
>                if test_shrinker_bit(memcg, node, deferred_shrinker_id)
>                  deferred_split_scan()
>                    walks memcg->split_queue
>
>The shrinker bit adds an imperfect guard rail. As soon as the cgroup
>has a single large page on the node of interest, all large pages owned
>by that memcg, including those on other nodes, will be split.
>
>list_lru properly sets up per-node, per-cgroup lists. As a bonus, it
>streamlines a lot of the list operations and reclaim walks. It's used
>widely by other major shrinkers already. Convert the deferred split
>queue as well.
>
>The list_lru per-memcg heads are instantiated on demand when the first
>object of interest is allocated for a cgroup, by calling
>folio_memcg_alloc_deferred(). Add calls to where splittable pages are
>created: anon faults, swapin faults, khugepaged collapse.
>
>These calls create all possible node heads for the cgroup at once, so
>the migration code (between nodes) doesn't need any special care.
>
>Reported-by: Mikhail Zaslonko <zaslonko@linux.ibm.com>
>Tested-by: Mikhail Zaslonko <zaslonko@linux.ibm.com>
>Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
>Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
>Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
>---
> include/linux/huge_mm.h    |   7 +-
> include/linux/memcontrol.h |   4 -
> include/linux/mmzone.h     |  12 --
> mm/huge_memory.c           | 364 +++++++++++++------------------------
> mm/internal.h              |   2 +-
> mm/khugepaged.c            |   5 +
> mm/memcontrol.c            |  12 +-
> mm/memory.c                |   4 +
> mm/mm_init.c               |  15 --
> mm/swap_state.c            |  10 +
> 10 files changed, 150 insertions(+), 285 deletions(-)
>
[...]
>@@ -1379,6 +1285,14 @@ static struct folio *vma_alloc_anon_folio_pmd(struct vm_area_struct *vma,
> 		count_mthp_stat(order, MTHP_STAT_ANON_FAULT_FALLBACK_CHARGE);
> 		return NULL;
> 	}
>+
>+	if (folio_memcg_alloc_deferred(folio)) {
>+		folio_put(folio);
>+		count_vm_event(THP_FAULT_FALLBACK);
>+		count_mthp_stat(order, MTHP_STAT_ANON_FAULT_FALLBACK);
>+		return NULL;
>+	}
>+

Nit: we have three possible failure point, and some duplicate
count_xxx_event/state().

Maybe we can have a followup cleanup for it.

Others, looks good. Thanks.

-- 
Wei Yang
Help you, Help me

^ permalink raw reply

* Re: [PATCH v7 4/4] mm: swap: filter swap allocation by memcg tier mask
From: Nhat Pham @ 2026-05-30 18:21 UTC (permalink / raw)
  To: Youngjun Park
  Cc: akpm, chrisl, linux-mm, cgroups, linux-kernel, kasong, hannes,
	mhocko, roman.gushchin, shakeel.butt, muchun.song, shikemeng,
	baoquan.he, baohua, gunho.lee, taejoon.song, hyungjun.cho,
	mkoutny, baver.bae, matia.kim
In-Reply-To: <CAKEwX=O-_OZ8x0UC96a_k+0eZfAE+mWMWDdn68uy1LHRq=JC0w@mail.gmail.com>

On Sat, May 30, 2026 at 10:51 AM Nhat Pham <nphamcs@gmail.com> wrote:
>
>
> How expensive is it to add per-cpu caching for each device :(

to clarify - a percpu_swap_cluster per si for every si.

>

... or for each tier (assuming devices in each tier share the same
performance characteristics, and could be used interchangeably?).

Basically:

struct percpu_swap_cluster {
    struct swap_info_struct *si[MAX_SWAPTIER][SWAP_NR_ORDERS];
    unsigned long offset[MAX_SWAPTIER][SWAP_NR_ORDERS];
    local_lock_t lock;
};

Seems like 4 is the default number of tier right? So the extra
overhead is just (nr cpu) * 10 * 3 * (sizeof(unsigned long) +
sizeof(*ptr)) or wev?

^ permalink raw reply

* Re: [PATCH v7 0/4] mm: swap: introduce swap tier infrastructure
From: Nhat Pham @ 2026-05-30 18:02 UTC (permalink / raw)
  To: Youngjun Park
  Cc: akpm, chrisl, linux-mm, cgroups, linux-kernel, kasong, hannes,
	mhocko, roman.gushchin, shakeel.butt, muchun.song, shikemeng,
	baoquan.he, baohua, gunho.lee, taejoon.song, hyungjun.cho,
	mkoutny, baver.bae, matia.kim
In-Reply-To: <20260527062247.3440692-1-youngjun.park@lge.com>

On Tue, May 26, 2026 at 11:23 PM Youngjun Park <youngjun.park@lge.com> wrote:
>
> This is v7 of the swap tier series addressing review feedback.
> The cover letter has been simplified.
>
> I revisited the design (see Design Rationale). Since our use case
> fits best with a memcg-based model, the implementation remains
> within memcg and preserves its resource accounting semantics.
>
> Alternatives considered:
>
> 1. A separate sysfs interface under swap. (Workable. But, it would still
>    need to reference memcg paths, and fully decoupling it would add
>    swap-layer logic to manage memcgs, making it secondary option.)
>
> 2. Making the feature non-default.
>
> Other interfaces were also reviewed. Aside from sysfs and BPF,
> the options involve trade-offs and are largely design choices.
> BPF was excluded due to possible disablement on our embedded
> platform, though future extension remains possible.
>
> Overview
> ========
>
> Swap Tiers group swap devices into performance classes (e.g. NVMe,
> HDD, Network) and allow per-memcg selection of which tiers to use.
> This mechanism was suggested by Chris Li.
>
> Design Rationale
> ================
>
> Swap tier selection is attached to memcg. A child cgroup may select a
> subset of the parent's allowed tiers.
>
> This
> - Preserves cgroup inheritance semantics (boundary at parent,
>   refinement at child).
> - Reuses memcg, which already groups processes and enforces
>   hierarchical memory limits.
> - Aligns with existing memcg swap controls (e.g. swap.max, zswap.writeback)
> - Avoids introducing a parallel swap control hierarchy.
>
> Placing tier control outside memcg (e.g., via BPF, syscalls, or
> madvise) would allow swap preference to diverge from the memcg
> hierarchy. Integrating it into memcg keeps the swap policy
> consistent with existing memory ownership semantics. There are
> also real use cases built around memcg.
>
> In the future, this can be extended to other interfaces to cover
> additional use cases.
>
> I believe a memcg-based swap control is a good starting point
> before such extensions.
>
> Use Cases
> =========
>
> #1: Latency separation (our primary deployment scenario)
>   [ / ]
>      |
>      +-- latency-sensitive workload  (fast tier)
>      +-- background workload         (slow tier)
>
> The parent defines the memory boundary.
> Each workload selects a swap tier via memory.swap.tiers according to
> latency requirements.
>
> This prevents latency-sensitive workloads from being swapped to
> slow devices used by background workloads.
>
> #2: Per-VM swap selection (Chris Li's deployment scenario)
>   [ / ]
>      |
>      +-- [ Job on VM ]              (tiers: zswap, SSD)
>             |
>             +-- [ VMM guest memory ]  (tiers: SSD)
>
> The parent (job) has access to both zswap and SSD tiers.
> The child (VMM guest memory) selects SSD as its swap tier via
> memory.swap.tiers. In this deployment, swap device selection
> happens at the child level from the parent's available set.
>
> #3: Tier isolation for reduced contention (hypothetical)
>   [ / ]                    (tiers: A, B)
>      |
>      +-- workload X        (tiers: A)
>      +-- workload Y        (tiers: B)
>
> Each child uses a different tier. Since swap paths are separated
> per tier, synchronization overhead between the two workloads is
> reduced.
>
> Future extension
> ================
>
> #1: Intra-tier distribution policy:
>   Currently, swap devices with the same priority are allocated in a
>   round-robin fashion. Per-tier policy files under
>   /sys/kernel/mm/swap/tiers/ can control how devices within a tier
>   are selected (e.g. round-robin, weighted).
>
> #2: Inter-tier promotion and demotion:
>   Promotion and demotion apply between tiers, not within a single
>   tier. The current interface defines only tier assignment; it does
>   not yet define when or how pages move between tiers. Two triggering
>   models are possible:
>
>   (a) User-triggered: userspace explicitly initiates migration between
>       tiers (e.g. via a new interface or existing move_pages semantics).
>   (b) Kernel-triggered: the kernel moves pages between tiers at
>       appropriate points such as reclaim or refault.
>
> #3: Per-VMA, per-process swap and BPF:
>   Not just for memcg based swap, possible to extend Per-VMA or per-process swap.
>   Or we can use it as BPF program.
>
> Experimentation
> ===============
>
> Tested on our internal platform using NBD as a separate swap tier.
> Our first production's simple usecase.
>
> Without tiers:
> - No selective control over flash wear
> - Cannot selectively assign NBD to specific applications
>
> Cold launch improvement (preloaded vs. baseline):
> - App A: 13.17s -> 4.18s (68%)
> - App B: 5.60s -> 1.12s (80%)
> - App C: 10.25s -> 2.00s (80%)
>
> Performance impact with no tiers configured:
> <1% regression in kernel build and vm-scalability benchmarks
>

Bit late to the party - working on my review backlog right now :)

I see some parallels with this and memory tiering work being done. One
future line of work could be considering how to ensure fairness when
multiple cgroups share same tiers:

https://lwn.net/Articles/1073400/

I can imagine a scenario where one noisy neighbor eagerly swaps first
and occupy all the space in the faster tier(s), pushing the other
colocated tenants to the slower tier(s). We might need to figure out a
way to ensure fairness here (while letting cgroups occupy fast swap
backends opportunistically if there is no resources scarcity).

^ permalink raw reply

* Re: [PATCH v7 4/4] mm: swap: filter swap allocation by memcg tier mask
From: Nhat Pham @ 2026-05-30 17:51 UTC (permalink / raw)
  To: Youngjun Park
  Cc: akpm, chrisl, linux-mm, cgroups, linux-kernel, kasong, hannes,
	mhocko, roman.gushchin, shakeel.butt, muchun.song, shikemeng,
	baoquan.he, baohua, gunho.lee, taejoon.song, hyungjun.cho,
	mkoutny, baver.bae, matia.kim
In-Reply-To: <20260527062247.3440692-5-youngjun.park@lge.com>

On Tue, May 26, 2026 at 11:23 PM Youngjun Park <youngjun.park@lge.com> wrote:
>
> Apply memcg tier effective mask during swap slot allocation to
> enforce per-cgroup swap tier restrictions.
>
> In the fast path, check the percpu cached swap_info's tier_mask
> against the folio's effective mask. If it does not match, fall
> through to the slow path. In the slow path, skip swap devices
> whose tier_mask is not covered by the folio's effective mask.
>
> This works correctly when there is only one non-rotational
> device in the system and no devices share the same priority.
> However, there are known limitations:
>
>  - When non-rotational devices are distributed across multiple
>    tiers, and different memcgs are configured to use those
>    distinct tiers, they may constantly overwrite the shared
>    percpu swap cache. This cache thrashing leads to frequent
>    fast path misses.
>
>  - Combined with the above issue, if same-priority devices exist
>    among them, a percpu cache miss (overwritten by another memcg)
>    forces the allocator to round-robin to the next device
>    prematurely, even if the current cluster is not fully
>    exhausted.

I had very similar issues when I tried hacking vswap on top of swap
table too... It's even worse over there because it's not just
performance - vswap needs special handling in certain cases, and in
some places cannot be used at all (for e.g in zswap writeback). I
ended up having to add separate caching for vswap device:

https://lore.kernel.org/all/20260528212955.1912856-1-nphamcs@gmail.com/

How expensive is it to add per-cpu caching for each device :(

Anyway, as a first step, this LGTM. Reviewing from swap's mechanism
perspective, and leaving the cgroup side to memcg folks:

Reviewed-by: Nhat Pham <nphamcs@gmail.com>

^ permalink raw reply

* Re: [PATCH v5 8/9] mm: memory: flatten alloc_anon_folio() retry loop
From: Dev Jain @ 2026-05-30  9:06 UTC (permalink / raw)
  To: Johannes Weiner, Andrew Morton
  Cc: David Hildenbrand, Lorenzo Stoakes, Shakeel Butt, Michal Hocko,
	Dave Chinner, Roman Gushchin, Muchun Song, Qi Zheng, Yosry Ahmed,
	Zi Yan, Liam R . Howlett, Usama Arif, Kiryl Shutsemau,
	Vlastimil Babka, Kairui Song, Mikhail Zaslonko, Vasily Gorbik,
	Baolin Wang, Barry Song, Lance Yang, Nico Pache, Ryan Roberts,
	cgroups, linux-mm, linux-kernel
In-Reply-To: <20260527204757.2544958-9-hannes@cmpxchg.org>



On 28/05/26 2:15 am, Johannes Weiner wrote:
> alloc_anon_folio() uses a top-level if (folio) that buries the success
> path four levels deep. This makes for awkward long lines and wrapping.
> The next patch will add more code here, so flatten this now to keep
> things clean and simple.
> 
> The next label is already there, use it for !folio.
> 
> No functional change intended.
> 
> Suggested-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> Acked-by: Usama Arif <usama.arif@linux.dev>
> Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
> ---

Reviewed-by: Dev Jain <dev.jain@arm.com>


>  mm/memory.c | 34 +++++++++++++++++-----------------
>  1 file changed, 17 insertions(+), 17 deletions(-)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index 7c020995eafc..135f5c0f57bd 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -5215,24 +5215,24 @@ static struct folio *alloc_anon_folio(struct vm_fault *vmf)
>  	while (orders) {
>  		addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
>  		folio = vma_alloc_folio(gfp, order, vma, addr);
> -		if (folio) {
> -			if (mem_cgroup_charge(folio, vma->vm_mm, gfp)) {
> -				count_mthp_stat(order, MTHP_STAT_ANON_FAULT_FALLBACK_CHARGE);
> -				folio_put(folio);
> -				goto next;
> -			}
> -			folio_throttle_swaprate(folio, gfp);
> -			/*
> -			 * When a folio is not zeroed during allocation
> -			 * (__GFP_ZERO not used) or user folios require special
> -			 * handling, folio_zero_user() is used to make sure
> -			 * that the page corresponding to the faulting address
> -			 * will be hot in the cache after zeroing.
> -			 */
> -			if (user_alloc_needs_zeroing())
> -				folio_zero_user(folio, vmf->address);
> -			return folio;
> +		if (!folio)
> +			goto next;
> +		if (mem_cgroup_charge(folio, vma->vm_mm, gfp)) {
> +			count_mthp_stat(order, MTHP_STAT_ANON_FAULT_FALLBACK_CHARGE);
> +			folio_put(folio);
> +			goto next;
>  		}
> +		folio_throttle_swaprate(folio, gfp);
> +		/*
> +		 * When a folio is not zeroed during allocation
> +		 * (__GFP_ZERO not used) or user folios require special
> +		 * handling, folio_zero_user() is used to make sure
> +		 * that the page corresponding to the faulting address
> +		 * will be hot in the cache after zeroing.
> +		 */
> +		if (user_alloc_needs_zeroing())
> +			folio_zero_user(folio, vmf->address);
> +		return folio;
>  next:
>  		count_mthp_stat(order, MTHP_STAT_ANON_FAULT_FALLBACK);
>  		order = next_order(&orders, order);


^ permalink raw reply

* [PATCH v4] cgroup/dmem: implement dmem.high soft limit via prioritized eviction
From: Qiliang Yuan @ 2026-05-30  7:35 UTC (permalink / raw)
  To: Christian Koenig, Huang Rui, Matthew Auld, Matthew Brost,
	Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, David Airlie,
	Simona Vetter, Tejun Heo, Johannes Weiner, Michal Koutný,
	Natalie Vock
  Cc: dri-devel, linux-kernel, cgroups, Qiliang Yuan

The dmem cgroup v2 controller currently only provides a hard "max"
limit, which causes immediate allocation failures when a cgroup's
device memory usage reaches its quota.  GPU-bound AI workloads need
smoother over-subscription support: a soft limit that temporarily
allows excess usage while applying backpressure through reclaim
rather than outright failure.

Add dmem.high, a soft limit that penalizes over-limit cgroups by
evicting their buffer objects first when eviction is triggered (e.g.
due to a "max" limit hit).  Unlike the rejected v1 approach which
used sleep-on-allocation throttling, this version provides a
meaningful recovery action through prioritized reclaim.

Expose "high" as a new cgroupfs control file per region via
set_resource_high() and get_resource_high(), and initialize it to
PAGE_COUNTER_MAX in reset_all_resource_limits().  Like get_resource_max(),
get_resource_high() returns PAGE_COUNTER_MAX when the pool is NULL.

Extend dmem_cgroup_state_evict_valuable() with a "try_high"
parameter.  When set, the function evaluates the try_high condition
first (before the limit_pool == test_pool shortcut) so that even the
limit-hitting cgroup's own BOs are filtered by the high threshold.
It then walks the page_counter parent chain to check whether any
ancestor exceeds its high limit, and verifies that the pool is above
its effective minimum to respect dmem.min protection.

Refactor ttm_bo_evict_alloc() into a 3-pass eviction strategy.
Pass 1 uses a blocking lock and targets only BOs whose cgroup exceeds
dmem.high, ensuring over-limit cgroups are penalized even when their
BOs are actively in use.  Pass 2 falls back to the standard above-elow
trylock eviction.  Pass 3+ uses proper locking and repeats while
making progress with the existing low-watermark fallback.

Signed-off-by: Qiliang Yuan <realwujing@gmail.com>
---
Introduce a "high" soft limit for the dmem cgroup v2 controller.
When a "max" limit is hit and eviction is triggered, buffer objects
belonging to cgroups that exceed their dmem.high limit are targeted
first, providing a meaningful recovery action through reclaim.

The dmem cgroup currently only supports hard "max" limits, which
cause immediate allocation failures for GPU-bound workloads. A soft
limit enables smoother over-subscription by penalizing over-limit
cgroups via prioritized eviction rather than outright rejection.

The implementation adds a "high" cgroupfs control file per region,
a try_high parameter to dmem_cgroup_state_evict_valuable() for
tier-1 eviction, and a 3-pass strategy in ttm_bo_evict_alloc().
---
V3 -> V4:
- Use a blocking lock in Pass 1 instead of trylock to ensure
  over-limit cgroups are penalized even when their BOs are actively
  in use, as requested by Maarten Lankhorst.
- Evaluate the try_high condition before the limit_pool == test_pool
  early-return so that the limit-hitting cgroup's own BOs are also
  filtered by dmem.high.
- Remove the high-priority compensation retry at the start of Pass 3,
  which is no longer needed now that Pass 1 uses a blocking lock.

V2 -> V3:
- Walk the page_counter parent chain in the try_high pass to prevent
  child cgroups from evading the penalty when a parent cgroup exceeds
  its dmem.high limit.
- Check dmem.min protection in the try_high pass to avoid evicting
  BOs below the effective minimum.
- Add a properly-locked high-priority retry at the beginning of Pass 3
  so that actively-used over-limit BOs (which failed trylock in Pass 1)
  are not skipped while innocent cgroups are evicted.
- Fix get_resource_high(NULL) returning 0 instead of PAGE_COUNTER_MAX
  to match the behavior of get_resource_max().

V1 -> V2:
- Replace sleep-on-allocation throttling with prioritized eviction.
  When a "max" limit is hit, BOs from cgroups exceeding dmem.high are
  evicted first in a dedicated pass. No throttling or sleeping is
  performed in the charge path.
- Remove task throttling (schedule_timeout_killable, TIF_NOTIFY_RESUME,
  resume_user_mode_work() integration) entirely.
- Add dmem.high cgroupfs control file per region.
- Extend dmem_cgroup_state_evict_valuable() with try_high parameter
  to target over-limit cgroups as tier-1 eviction.
- Refactor ttm_bo_evict_alloc() into a 3-pass eviction strategy:
  (1) trylock: evict only BOs exceeding dmem.high
  (2) trylock: above-elow
  (3) proper-lock: repeat with low fallback.
- Initialize high to PAGE_COUNTER_MAX in reset_all_resource_limits().

v3: https://lore.kernel.org/r/20260528-feature-dmem-high-v3-1-c642b34bcb2f@gmail.com
v2: https://lore.kernel.org/r/20260522-feature-dmem-high-v2-1-1d7d4a0fa5da@gmail.com
v1: https://lore.kernel.org/all/20260520-feature-dmem-high-v1-1-97ca0cb7f95a@gmail.com
---
 drivers/gpu/drm/ttm/ttm_bo.c | 32 +++++++++++++----
 include/linux/cgroup_dmem.h  |  4 +--
 kernel/cgroup/dmem.c         | 81 +++++++++++++++++++++++++++++++++++++++++---
 3 files changed, 104 insertions(+), 13 deletions(-)

diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c
index bcd76f6bb7f02..bf06e9e4b18a3 100644
--- a/drivers/gpu/drm/ttm/ttm_bo.c
+++ b/drivers/gpu/drm/ttm/ttm_bo.c
@@ -505,6 +505,8 @@ struct ttm_bo_evict_walk {
 
 	/** @limit_pool: Which pool limit we should test against */
 	struct dmem_cgroup_pool_state *limit_pool;
+	/** @try_high: Whether to only evict BO's above the high watermark (first pass) */
+	bool try_high;
 	/** @try_low: Whether we should attempt to evict BO's with low watermark threshold */
 	bool try_low;
 	/** @hit_low: If we cannot evict a bo when @try_low is false (first pass) */
@@ -518,7 +520,8 @@ static s64 ttm_bo_evict_cb(struct ttm_lru_walk *walk, struct ttm_buffer_object *
 	s64 lret;
 
 	if (!dmem_cgroup_state_evict_valuable(evict_walk->limit_pool, bo->resource->css,
-					      evict_walk->try_low, &evict_walk->hit_low))
+					      evict_walk->try_high, evict_walk->try_low,
+					      &evict_walk->hit_low))
 		return 0;
 
 	if (bo->pin_count || !bo->bdev->funcs->eviction_valuable(bo, evict_walk->place))
@@ -577,31 +580,46 @@ static int ttm_bo_evict_alloc(struct ttm_device *bdev,
 	};
 	s64 lret;
 
-	evict_walk.walk.arg.trylock_only = true;
+	/*
+	 * Pass 1 (blocking, high-priority): Evict only BOs whose cgroup
+	 * exceeds its dmem.high soft limit.  A blocking lock is used to
+	 * ensure over-limit cgroups are penalized even when their BOs are
+	 * actively in use.
+	 */
+	evict_walk.walk.arg.trylock_only = false;
+	evict_walk.try_high = true;
 	lret = ttm_lru_walk_for_evict(&evict_walk.walk, bdev, man, 1);
+	evict_walk.try_high = false;
+	if (lret)
+		goto out;
 
-	/* One more attempt if we hit low limit? */
+	/*
+	 * Pass 2 (trylock): Evict BOs above the effective low watermark.
+	 * Falls back to low-priority eviction if needed.
+	 */
+	lret = ttm_lru_walk_for_evict(&evict_walk.walk, bdev, man, 1);
 	if (!lret && evict_walk.hit_low) {
 		evict_walk.try_low = true;
 		lret = ttm_lru_walk_for_evict(&evict_walk.walk, bdev, man, 1);
 	}
+
 	if (lret || !ticket)
 		goto out;
 
-	/* Reset low limit */
+	/*
+	 * Pass 3+ (properly locked): Evict while making progress.
+	 * Reset flags and retry with try_low if we hit the low watermark.
+	 */
 	evict_walk.try_low = evict_walk.hit_low = false;
-	/* If ticket-locking, repeat while making progress. */
 	evict_walk.walk.arg.trylock_only = false;
 
 retry:
 	do {
-		/* The walk may clear the evict_walk.walk.ticket field */
 		evict_walk.walk.arg.ticket = ticket;
 		evict_walk.evicted = 0;
 		lret = ttm_lru_walk_for_evict(&evict_walk.walk, bdev, man, 1);
 	} while (!lret && evict_walk.evicted);
 
-	/* We hit the low limit? Try once more */
 	if (!lret && evict_walk.hit_low && !evict_walk.try_low) {
 		evict_walk.try_low = true;
 		goto retry;
diff --git a/include/linux/cgroup_dmem.h b/include/linux/cgroup_dmem.h
index dd4869f1d736e..06115d35509b1 100644
--- a/include/linux/cgroup_dmem.h
+++ b/include/linux/cgroup_dmem.h
@@ -23,7 +23,7 @@ int dmem_cgroup_try_charge(struct dmem_cgroup_region *region, u64 size,
 void dmem_cgroup_uncharge(struct dmem_cgroup_pool_state *pool, u64 size);
 bool dmem_cgroup_state_evict_valuable(struct dmem_cgroup_pool_state *limit_pool,
 				      struct dmem_cgroup_pool_state *test_pool,
-				      bool ignore_low, bool *ret_hit_low);
+				      bool try_high, bool ignore_low, bool *ret_hit_low);
 
 void dmem_cgroup_pool_state_put(struct dmem_cgroup_pool_state *pool);
 #else
@@ -54,7 +54,7 @@ static inline void dmem_cgroup_uncharge(struct dmem_cgroup_pool_state *pool, u64
 static inline
 bool dmem_cgroup_state_evict_valuable(struct dmem_cgroup_pool_state *limit_pool,
 				      struct dmem_cgroup_pool_state *test_pool,
-				      bool ignore_low, bool *ret_hit_low)
+				      bool try_high, bool ignore_low, bool *ret_hit_low)
 {
 	return true;
 }
diff --git a/kernel/cgroup/dmem.c b/kernel/cgroup/dmem.c
index 4753a67d0f0f2..f81fbb538cf2f 100644
--- a/kernel/cgroup/dmem.c
+++ b/kernel/cgroup/dmem.c
@@ -156,6 +156,12 @@ set_resource_low(struct dmem_cgroup_pool_state *pool, u64 val)
 	page_counter_set_low(&pool->cnt, val);
 }
 
+static void
+set_resource_high(struct dmem_cgroup_pool_state *pool, u64 val)
+{
+	page_counter_set_high(&pool->cnt, val);
+}
+
 static void
 set_resource_max(struct dmem_cgroup_pool_state *pool, u64 val)
 {
@@ -167,6 +173,11 @@ static u64 get_resource_low(struct dmem_cgroup_pool_state *pool)
 	return pool ? READ_ONCE(pool->cnt.low) : 0;
 }
 
+static u64 get_resource_high(struct dmem_cgroup_pool_state *pool)
+{
+	return pool ? READ_ONCE(pool->cnt.high) : PAGE_COUNTER_MAX;
+}
+
 static u64 get_resource_min(struct dmem_cgroup_pool_state *pool)
 {
 	return pool ? READ_ONCE(pool->cnt.min) : 0;
@@ -186,6 +197,7 @@ static void reset_all_resource_limits(struct dmem_cgroup_pool_state *rpool)
 {
 	set_resource_min(rpool, 0);
 	set_resource_low(rpool, 0);
+	set_resource_high(rpool, PAGE_COUNTER_MAX);
 	set_resource_max(rpool, PAGE_COUNTER_MAX);
 }
 
@@ -289,10 +301,13 @@ dmem_cgroup_calculate_protection(struct dmem_cgroup_pool_state *limit_pool,
  * dmem_cgroup_state_evict_valuable() - Check if we should evict from test_pool
  * @limit_pool: The pool for which we hit limits
  * @test_pool: The pool for which to test
+ * @try_high: Only evict BOs whose usage exceeds the high limit (first pass)
  * @ignore_low: Whether we have to respect low watermarks.
  * @ret_hit_low: Pointer to whether it makes sense to consider low watermark.
  *
  * This function returns true if we can evict from @test_pool, false if not.
+ * When @try_high is set, only pools with usage above their high limit are
+ * evictable, enabling prioritized eviction of over-limit cgroups.
  * When returning false and @ignore_low is false, @ret_hit_low may
  * be set to true to indicate this function can be retried with @ignore_low
  * set to true.
@@ -301,12 +316,56 @@ dmem_cgroup_calculate_protection(struct dmem_cgroup_pool_state *limit_pool,
  */
 bool dmem_cgroup_state_evict_valuable(struct dmem_cgroup_pool_state *limit_pool,
 				      struct dmem_cgroup_pool_state *test_pool,
-				      bool ignore_low, bool *ret_hit_low)
+				      bool try_high, bool ignore_low, bool *ret_hit_low)
 {
 	struct dmem_cgroup_pool_state *pool = test_pool;
 	struct page_counter *ctest;
 	u64 used, min, low;
 
+	ctest = &test_pool->cnt;
+	used = page_counter_read(ctest);
+
+	if (try_high) {
+		/*
+		 * When the limit-hitting cgroup's own BOs are being
+		 * considered, only evict them if their pool exceeds its
+		 * own dmem.high limit.  No ancestry check is needed
+		 * because the limit was triggered by this pool itself.
+		 */
+		if (limit_pool == test_pool)
+			return used > READ_ONCE(ctest->high);
+
+		{
+			struct page_counter *c;
+
+			/*
+			 * Walk the page_counter parent chain to check
+			 * whether any ancestor cgroup exceeds its
+			 * dmem.high limit.  This prevents child cgroups
+			 * from evading the penalty when a parent cgroup
+			 * is over its high limit.
+			 */
+			if (used <= READ_ONCE(ctest->high)) {
+				for (c = ctest->parent; c; c = c->parent) {
+					if (page_counter_read(c) >
+					    READ_ONCE(c->high))
+						break;
+				}
+				if (!c)
+					return false;
+			}
+		}
+
+		/*
+		 * Respect dmem.min protection: do not evict BOs below the
+		 * effective minimum even during the high-priority pass.
+		 */
+		dmem_cgroup_calculate_protection(limit_pool, test_pool);
+		min = READ_ONCE(ctest->emin);
+
+		return used > min;
+	}
+
 	/* Can always evict from current pool, despite limits */
 	if (limit_pool == test_pool)
 		return true;
@@ -329,11 +388,8 @@ bool dmem_cgroup_state_evict_valuable(struct dmem_cgroup_pool_state *limit_pool,
 			{}
 	}
 
-	ctest = &test_pool->cnt;
-
 	dmem_cgroup_calculate_protection(limit_pool, test_pool);
 
-	used = page_counter_read(ctest);
 	min = READ_ONCE(ctest->emin);
 
 	if (used <= min)
@@ -835,6 +891,17 @@ static ssize_t dmem_cgroup_region_low_write(struct kernfs_open_file *of,
 	return dmemcg_limit_write(of, buf, nbytes, off, set_resource_low);
 }
 
+static int dmem_cgroup_region_high_show(struct seq_file *sf, void *v)
+{
+	return dmemcg_limit_show(sf, v, get_resource_high);
+}
+
+static ssize_t dmem_cgroup_region_high_write(struct kernfs_open_file *of,
+					  char *buf, size_t nbytes, loff_t off)
+{
+	return dmemcg_limit_write(of, buf, nbytes, off, set_resource_high);
+}
+
 static int dmem_cgroup_region_max_show(struct seq_file *sf, void *v)
 {
 	return dmemcg_limit_show(sf, v, get_resource_max);
@@ -868,6 +935,12 @@ static struct cftype files[] = {
 		.seq_show = dmem_cgroup_region_low_show,
 		.flags = CFTYPE_NOT_ON_ROOT,
 	},
+	{
+		.name = "high",
+		.write = dmem_cgroup_region_high_write,
+		.seq_show = dmem_cgroup_region_high_show,
+		.flags = CFTYPE_NOT_ON_ROOT,
+	},
 	{
 		.name = "max",
 		.write = dmem_cgroup_region_max_write,

---
base-commit: ab5fce87a778cb780a05984a2ca448f2b41aafbf
change-id: 20260519-feature-dmem-high-16997148dc38

Best regards,
-- 
Qiliang Yuan <realwujing@gmail.com>


^ permalink raw reply related

* [tj-cgroup:for-7.2] BUILD SUCCESS 390f2d73bc99a888469f789f274c162da33bafe5
From: kernel test robot @ 2026-05-30  6:57 UTC (permalink / raw)
  To: Tejun Heo; +Cc: cgroups

tree/branch: https://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git for-7.2
branch HEAD: 390f2d73bc99a888469f789f274c162da33bafe5  cgroup/cpuset: Free sched domains on rebuild guard failure

elapsed time: 734m

configs tested: 164
configs skipped: 2

The following configs have been built successfully.
More configs may be tested in the coming days.

tested configs:
alpha                             allnoconfig    gcc-15.2.0
alpha                            allyesconfig    gcc-15.2.0
alpha                               defconfig    gcc-15.2.0
arc                              allmodconfig    clang-16
arc                               allnoconfig    gcc-15.2.0
arc                              allyesconfig    clang-23
arc                                 defconfig    gcc-15.2.0
arc                   randconfig-001-20260530    gcc-14.3.0
arc                   randconfig-002-20260530    gcc-14.3.0
arm                               allnoconfig    gcc-15.2.0
arm                              allyesconfig    clang-16
arm                                 defconfig    gcc-15.2.0
arm                        multi_v7_defconfig    gcc-15.2.0
arm                   randconfig-001-20260530    gcc-14.3.0
arm                   randconfig-002-20260530    gcc-14.3.0
arm                   randconfig-003-20260530    gcc-14.3.0
arm                   randconfig-004-20260530    gcc-14.3.0
arm                             rpc_defconfig    clang-18
arm64                            allmodconfig    clang-23
arm64                             allnoconfig    gcc-15.2.0
arm64                               defconfig    gcc-15.2.0
arm64                 randconfig-001-20260530    gcc-8.5.0
arm64                 randconfig-002-20260530    gcc-8.5.0
arm64                 randconfig-003-20260530    gcc-8.5.0
arm64                 randconfig-004-20260530    gcc-8.5.0
csky                             allmodconfig    gcc-15.2.0
csky                              allnoconfig    gcc-15.2.0
csky                                defconfig    gcc-15.2.0
csky                  randconfig-001-20260530    gcc-8.5.0
csky                  randconfig-002-20260530    gcc-8.5.0
hexagon                          allmodconfig    gcc-15.2.0
hexagon                           allnoconfig    gcc-15.2.0
hexagon                             defconfig    gcc-15.2.0
hexagon               randconfig-001-20260530    clang-23
hexagon               randconfig-002-20260530    clang-23
i386                             allmodconfig    clang-20
i386                              allnoconfig    gcc-15.2.0
i386                             allyesconfig    clang-20
i386        buildonly-randconfig-001-20260530    clang-20
i386        buildonly-randconfig-002-20260530    clang-20
i386        buildonly-randconfig-003-20260530    clang-20
i386        buildonly-randconfig-004-20260530    clang-20
i386        buildonly-randconfig-005-20260530    clang-20
i386        buildonly-randconfig-006-20260530    clang-20
i386                                defconfig    gcc-15.2.0
i386                  randconfig-001-20260530    clang-20
i386                  randconfig-002-20260530    clang-20
i386                  randconfig-003-20260530    clang-20
i386                  randconfig-004-20260530    clang-20
i386                  randconfig-005-20260530    clang-20
i386                  randconfig-006-20260530    clang-20
i386                  randconfig-007-20260530    clang-20
i386                  randconfig-011-20260530    clang-20
i386                  randconfig-012-20260530    clang-20
i386                  randconfig-013-20260530    clang-20
i386                  randconfig-014-20260530    clang-20
i386                  randconfig-015-20260530    clang-20
i386                  randconfig-016-20260530    clang-20
i386                  randconfig-017-20260530    clang-20
loongarch                        allmodconfig    clang-23
loongarch                         allnoconfig    gcc-15.2.0
loongarch                           defconfig    clang-19
loongarch             randconfig-001-20260530    clang-23
loongarch             randconfig-002-20260530    clang-23
m68k                             allmodconfig    gcc-15.2.0
m68k                              allnoconfig    gcc-15.2.0
m68k                             allyesconfig    clang-16
m68k                                defconfig    clang-19
microblaze                        allnoconfig    gcc-15.2.0
microblaze                       allyesconfig    gcc-15.2.0
microblaze                          defconfig    clang-19
mips                             allmodconfig    gcc-15.2.0
mips                              allnoconfig    gcc-15.2.0
mips                             allyesconfig    gcc-15.2.0
mips                        qi_lb60_defconfig    clang-23
nios2                            allmodconfig    clang-23
nios2                             allnoconfig    clang-23
nios2                               defconfig    clang-19
nios2                 randconfig-001-20260530    clang-23
nios2                 randconfig-002-20260530    clang-23
openrisc                         allmodconfig    clang-23
openrisc                          allnoconfig    clang-23
openrisc                            defconfig    gcc-15.2.0
parisc                           allmodconfig    gcc-15.2.0
parisc                            allnoconfig    clang-23
parisc                           allyesconfig    clang-19
parisc                              defconfig    gcc-15.2.0
parisc                randconfig-001-20260530    gcc-8.5.0
parisc                randconfig-002-20260530    gcc-8.5.0
parisc64                            defconfig    clang-19
powerpc                          allmodconfig    gcc-15.2.0
powerpc                           allnoconfig    clang-23
powerpc               randconfig-001-20260530    gcc-8.5.0
powerpc               randconfig-002-20260530    gcc-8.5.0
powerpc64             randconfig-001-20260530    gcc-8.5.0
powerpc64             randconfig-002-20260530    gcc-8.5.0
riscv                            allmodconfig    clang-23
riscv                             allnoconfig    clang-23
riscv                            allyesconfig    clang-16
riscv                               defconfig    gcc-15.2.0
riscv                 randconfig-001-20260530    gcc-12.5.0
riscv                 randconfig-002-20260530    gcc-12.5.0
s390                             allmodconfig    clang-19
s390                              allnoconfig    clang-23
s390                             allyesconfig    gcc-15.2.0
s390                                defconfig    gcc-15.2.0
s390                  randconfig-001-20260530    gcc-12.5.0
s390                  randconfig-002-20260530    gcc-12.5.0
sh                               allmodconfig    gcc-15.2.0
sh                                allnoconfig    clang-23
sh                               allyesconfig    clang-19
sh                                  defconfig    gcc-14
sh                    randconfig-001-20260530    gcc-12.5.0
sh                    randconfig-002-20260530    gcc-12.5.0
sparc                             allnoconfig    clang-23
sparc                               defconfig    gcc-15.2.0
sparc                 randconfig-001-20260530    gcc-9.5.0
sparc                 randconfig-002-20260530    gcc-9.5.0
sparc64                          allmodconfig    clang-23
sparc64                             defconfig    gcc-14
sparc64               randconfig-001-20260530    gcc-9.5.0
sparc64               randconfig-002-20260530    gcc-9.5.0
um                               allmodconfig    clang-19
um                                allnoconfig    clang-23
um                               allyesconfig    gcc-15.2.0
um                                  defconfig    gcc-14
um                             i386_defconfig    gcc-14
um                    randconfig-001-20260530    gcc-9.5.0
um                    randconfig-002-20260530    gcc-9.5.0
um                           x86_64_defconfig    gcc-14
x86_64                           allmodconfig    clang-20
x86_64                            allnoconfig    clang-23
x86_64                           allyesconfig    clang-20
x86_64      buildonly-randconfig-001-20260530    gcc-14
x86_64      buildonly-randconfig-002-20260530    gcc-14
x86_64      buildonly-randconfig-003-20260530    gcc-14
x86_64      buildonly-randconfig-004-20260530    gcc-14
x86_64      buildonly-randconfig-005-20260530    gcc-14
x86_64      buildonly-randconfig-006-20260530    gcc-14
x86_64                              defconfig    gcc-14
x86_64                                  kexec    clang-20
x86_64                randconfig-011-20260530    gcc-14
x86_64                randconfig-012-20260530    gcc-14
x86_64                randconfig-013-20260530    gcc-14
x86_64                randconfig-014-20260530    gcc-14
x86_64                randconfig-015-20260530    gcc-14
x86_64                randconfig-016-20260530    gcc-14
x86_64                randconfig-071-20260530    gcc-14
x86_64                randconfig-072-20260530    gcc-14
x86_64                randconfig-073-20260530    gcc-14
x86_64                randconfig-074-20260530    gcc-14
x86_64                randconfig-075-20260530    gcc-14
x86_64                randconfig-076-20260530    gcc-14
x86_64                               rhel-9.4    clang-20
x86_64                           rhel-9.4-bpf    gcc-14
x86_64                          rhel-9.4-func    clang-20
x86_64                    rhel-9.4-kselftests    clang-20
x86_64                         rhel-9.4-kunit    gcc-14
x86_64                           rhel-9.4-ltp    gcc-14
x86_64                          rhel-9.4-rust    clang-20
xtensa                            allnoconfig    clang-23
xtensa                           allyesconfig    clang-23
xtensa                randconfig-001-20260530    gcc-9.5.0
xtensa                randconfig-002-20260530    gcc-9.5.0

--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply

* [tj-cgroup:for-next] BUILD SUCCESS ebc50c66b365d3046c7741195224d2aa7809c9b5
From: kernel test robot @ 2026-05-30  6:47 UTC (permalink / raw)
  To: Tejun Heo; +Cc: cgroups

tree/branch: https://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git for-next
branch HEAD: ebc50c66b365d3046c7741195224d2aa7809c9b5  Merge branch 'for-7.2' into for-next

elapsed time: 724m

configs tested: 171
configs skipped: 2

The following configs have been built successfully.
More configs may be tested in the coming days.

tested configs:
alpha                             allnoconfig    gcc-15.2.0
alpha                            allyesconfig    gcc-15.2.0
alpha                               defconfig    gcc-15.2.0
arc                              allmodconfig    clang-16
arc                               allnoconfig    gcc-15.2.0
arc                              allyesconfig    clang-23
arc                                 defconfig    gcc-15.2.0
arc                   randconfig-001-20260530    gcc-14.3.0
arc                   randconfig-002-20260530    gcc-14.3.0
arm                               allnoconfig    gcc-15.2.0
arm                              allyesconfig    clang-16
arm                                 defconfig    gcc-15.2.0
arm                        multi_v7_defconfig    gcc-15.2.0
arm                   randconfig-001-20260530    gcc-14.3.0
arm                   randconfig-002-20260530    gcc-14.3.0
arm                   randconfig-003-20260530    gcc-14.3.0
arm                   randconfig-004-20260530    gcc-14.3.0
arm                             rpc_defconfig    clang-18
arm64                            allmodconfig    clang-23
arm64                             allnoconfig    gcc-15.2.0
arm64                               defconfig    gcc-15.2.0
arm64                 randconfig-001-20260530    gcc-8.5.0
arm64                 randconfig-002-20260530    gcc-8.5.0
arm64                 randconfig-003-20260530    gcc-8.5.0
arm64                 randconfig-004-20260530    gcc-8.5.0
csky                             allmodconfig    gcc-15.2.0
csky                              allnoconfig    gcc-15.2.0
csky                                defconfig    gcc-15.2.0
csky                  randconfig-001-20260530    gcc-8.5.0
csky                  randconfig-002-20260530    gcc-8.5.0
hexagon                          allmodconfig    gcc-15.2.0
hexagon                           allnoconfig    gcc-15.2.0
hexagon                             defconfig    gcc-15.2.0
hexagon               randconfig-001-20260530    clang-23
hexagon               randconfig-002-20260530    clang-23
i386                             allmodconfig    clang-20
i386                              allnoconfig    gcc-15.2.0
i386                             allyesconfig    clang-20
i386        buildonly-randconfig-001-20260530    clang-20
i386        buildonly-randconfig-002-20260530    clang-20
i386        buildonly-randconfig-003-20260530    clang-20
i386        buildonly-randconfig-004-20260530    clang-20
i386        buildonly-randconfig-005-20260530    clang-20
i386        buildonly-randconfig-006-20260530    clang-20
i386                                defconfig    gcc-15.2.0
i386                  randconfig-001-20260530    clang-20
i386                  randconfig-002-20260530    clang-20
i386                  randconfig-003-20260530    clang-20
i386                  randconfig-004-20260530    clang-20
i386                  randconfig-005-20260530    clang-20
i386                  randconfig-006-20260530    clang-20
i386                  randconfig-007-20260530    clang-20
i386                  randconfig-011-20260530    clang-20
i386                  randconfig-012-20260530    clang-20
i386                  randconfig-013-20260530    clang-20
i386                  randconfig-014-20260530    clang-20
i386                  randconfig-015-20260530    clang-20
i386                  randconfig-016-20260530    clang-20
i386                  randconfig-017-20260530    clang-20
loongarch                        allmodconfig    clang-23
loongarch                         allnoconfig    gcc-15.2.0
loongarch                           defconfig    clang-19
loongarch             randconfig-001-20260530    clang-23
loongarch             randconfig-002-20260530    clang-23
m68k                             allmodconfig    gcc-15.2.0
m68k                              allnoconfig    gcc-15.2.0
m68k                             allyesconfig    clang-16
m68k                                defconfig    clang-19
microblaze                        allnoconfig    gcc-15.2.0
microblaze                       allyesconfig    gcc-15.2.0
microblaze                          defconfig    clang-19
mips                             allmodconfig    gcc-15.2.0
mips                              allnoconfig    gcc-15.2.0
mips                             allyesconfig    gcc-15.2.0
mips                        qi_lb60_defconfig    clang-23
nios2                            allmodconfig    clang-23
nios2                             allnoconfig    clang-23
nios2                               defconfig    clang-19
nios2                 randconfig-001-20260530    clang-23
nios2                 randconfig-002-20260530    clang-23
openrisc                         allmodconfig    clang-23
openrisc                          allnoconfig    clang-23
openrisc                            defconfig    gcc-15.2.0
parisc                           allmodconfig    gcc-15.2.0
parisc                            allnoconfig    clang-23
parisc                           allyesconfig    clang-19
parisc                              defconfig    gcc-15.2.0
parisc                randconfig-001-20260530    gcc-8.5.0
parisc                randconfig-002-20260530    gcc-8.5.0
parisc64                            defconfig    clang-19
powerpc                          allmodconfig    gcc-15.2.0
powerpc                           allnoconfig    clang-23
powerpc               randconfig-001-20260530    gcc-8.5.0
powerpc               randconfig-002-20260530    gcc-8.5.0
powerpc                    socrates_defconfig    gcc-15.2.0
powerpc64             randconfig-001-20260530    gcc-8.5.0
powerpc64             randconfig-002-20260530    gcc-8.5.0
riscv                            allmodconfig    clang-23
riscv                             allnoconfig    clang-23
riscv                            allyesconfig    clang-16
riscv                               defconfig    gcc-15.2.0
riscv                 randconfig-001-20260530    gcc-12.5.0
riscv                 randconfig-002-20260530    gcc-12.5.0
s390                             allmodconfig    clang-19
s390                              allnoconfig    clang-23
s390                             allyesconfig    gcc-15.2.0
s390                                defconfig    gcc-15.2.0
s390                  randconfig-001-20260530    gcc-12.5.0
s390                  randconfig-002-20260530    gcc-12.5.0
sh                               allmodconfig    gcc-15.2.0
sh                                allnoconfig    clang-23
sh                               allyesconfig    clang-19
sh                                  defconfig    gcc-14
sh                    randconfig-001-20260530    gcc-12.5.0
sh                    randconfig-002-20260530    gcc-12.5.0
sparc                             allnoconfig    clang-23
sparc                               defconfig    gcc-15.2.0
sparc                 randconfig-001-20260530    gcc-9.5.0
sparc                 randconfig-002-20260530    gcc-9.5.0
sparc64                          allmodconfig    clang-23
sparc64                             defconfig    gcc-14
sparc64               randconfig-001-20260530    gcc-9.5.0
sparc64               randconfig-002-20260530    gcc-9.5.0
um                               allmodconfig    clang-19
um                                allnoconfig    clang-23
um                               allyesconfig    gcc-15.2.0
um                                  defconfig    gcc-14
um                             i386_defconfig    gcc-14
um                    randconfig-001-20260530    gcc-9.5.0
um                    randconfig-002-20260530    gcc-9.5.0
um                           x86_64_defconfig    gcc-14
x86_64                           allmodconfig    clang-20
x86_64                            allnoconfig    clang-23
x86_64                           allyesconfig    clang-20
x86_64      buildonly-randconfig-001-20260530    gcc-14
x86_64      buildonly-randconfig-002-20260530    gcc-14
x86_64      buildonly-randconfig-003-20260530    gcc-14
x86_64      buildonly-randconfig-004-20260530    gcc-14
x86_64      buildonly-randconfig-005-20260530    gcc-14
x86_64      buildonly-randconfig-006-20260530    gcc-14
x86_64                              defconfig    gcc-14
x86_64                                  kexec    clang-20
x86_64                randconfig-001-20260530    gcc-14
x86_64                randconfig-002-20260530    gcc-14
x86_64                randconfig-003-20260530    gcc-14
x86_64                randconfig-004-20260530    gcc-14
x86_64                randconfig-005-20260530    gcc-14
x86_64                randconfig-006-20260530    gcc-14
x86_64                randconfig-011-20260530    gcc-14
x86_64                randconfig-012-20260530    gcc-14
x86_64                randconfig-013-20260530    gcc-14
x86_64                randconfig-014-20260530    gcc-14
x86_64                randconfig-015-20260530    gcc-14
x86_64                randconfig-016-20260530    gcc-14
x86_64                randconfig-071-20260530    gcc-14
x86_64                randconfig-072-20260530    gcc-14
x86_64                randconfig-073-20260530    gcc-14
x86_64                randconfig-074-20260530    gcc-14
x86_64                randconfig-075-20260530    gcc-14
x86_64                randconfig-076-20260530    gcc-14
x86_64                               rhel-9.4    clang-20
x86_64                           rhel-9.4-bpf    gcc-14
x86_64                          rhel-9.4-func    clang-20
x86_64                    rhel-9.4-kselftests    clang-20
x86_64                         rhel-9.4-kunit    gcc-14
x86_64                           rhel-9.4-ltp    gcc-14
x86_64                          rhel-9.4-rust    clang-20
xtensa                            allnoconfig    clang-23
xtensa                           allyesconfig    clang-23
xtensa                randconfig-001-20260530    gcc-9.5.0
xtensa                randconfig-002-20260530    gcc-9.5.0

--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply

* Re: [PATCH v5 1/9] mm: list_lru: fix set_shrinker_bit() call during race with cgroup deletion
From: Wei Yang @ 2026-05-30  2:38 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, David Hildenbrand, Lorenzo Stoakes, Shakeel Butt,
	Michal Hocko, Dave Chinner, Roman Gushchin, Muchun Song, Qi Zheng,
	Yosry Ahmed, Zi Yan, Liam R . Howlett, Usama Arif,
	Kiryl Shutsemau, Vlastimil Babka, Kairui Song, Mikhail Zaslonko,
	Vasily Gorbik, Baolin Wang, Barry Song, Dev Jain, Lance Yang,
	Nico Pache, Ryan Roberts, cgroups, linux-mm, linux-kernel
In-Reply-To: <20260527204757.2544958-2-hannes@cmpxchg.org>

On Wed, May 27, 2026 at 04:45:08PM -0400, Johannes Weiner wrote:
>When list_lru_add() races with cgroup deletion, the shrinker bit is set
>on the wrong group and lost. This can cause a shrinker run to miss the
>cgroup that actually has the object.
>
>When the passed in memcg is dead, the function finds the first non-dead
>parent from the passed in memcg and adds the object there; but the
>shrinker bit is set on the memcg that was passed in.
>

This means we just miss to reclaim some obj, but won't crash the kernel.

>This bug is as old as the shrinker bitmap itself.
>
>Fix it by returning the "effective" memcg from the locking function, and
>have the caller use that.
>
>Fixes: fae91d6d8be5 ("mm/list_lru.c: set bit in memcg shrinker bitmap on first list_lru item appearance")
>Reported-by: Usama Arif <usama.arif@linux.dev>
>Reported-by: Sashiko
>Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

So we don't want to cc stable, right?

The fix looks right, so

Reviewed-by: Wei Yang <richard.weiyang@gmail.com>

-- 
Wei Yang
Help you, Help me

^ permalink raw reply

* Re: [PATCH v3 2/4] mm/zswap: Implement proactive writeback
From: Yosry Ahmed @ 2026-05-30  1:40 UTC (permalink / raw)
  To: Nhat Pham
  Cc: Hao Jia, akpm, tj, hannes, shakeel.butt, mhocko, mkoutny,
	chengming.zhou, muchun.song, roman.gushchin, cgroups, linux-mm,
	linux-kernel, linux-doc, Hao Jia
In-Reply-To: <CAKEwX=MQe_KFZe2vBXQYh0aa-x+E8AzNwmyjJGJk4tDoS9ML3A@mail.gmail.com>

On Fri, May 29, 2026 at 12:58:09PM -0700, Nhat Pham wrote:
> On Tue, May 26, 2026 at 4:46 AM Hao Jia <jiahao.kernel@gmail.com> wrote:
> >
> > From: Hao Jia <jiahao1@lixiang.com>
> >
> > Zswap currently writes back pages to backing swap reactively, triggered
> > either by the shrinker or when the pool reaches its size limit. There is
> > no mechanism to control the amount of writeback for a specific memory
> > cgroup. However, users may want to proactively write back zswap pages,
> > e.g., to free up memory for other applications or to prepare for
> > memory-intensive workloads.
> >
> > Introduce a "zswap_writeback_only" key to the memory.reclaim cgroup
> > interface. When specified, this key bypasses standard memory reclaim
> > and exclusively performs proactive zswap writeback up to the requested
> > budget. If omitted, the default reclaim behavior remains unchanged.
> >
> > Example usage:
> >   # Write back 100MB of pages from zswap to the backing swap
> >   echo "100M zswap_writeback_only" > memory.reclaim
> 
> Hmmm, so this 100MB is the pre-compression size? i.e if this 100 MB
> compresses to 25 MB, then you're only freeing 25 MB?
> 
> I'm ok-ish with this, but can you document it?

That's a good point. I think pre-compressed size doesn't make sense to
be honest. We should care about how much memory we are actually trying
to save by doing writeback here.

The pre-compressed size is only useful in determining the blast radius,
how many actual pages are going to have slower page faults now. But
then, I don't think there's a reasonable way for userspace to decide
that.

I understand passing in the compressed size is tricky because we need to
keep track of the size of the compressed pages we end up writing back,
but it should be doable.

If we really want pre-compressed size here, then yes we need to make it
very clear, and I vote that we use a separate interface in this case
because memory.reclaim having different meanings for the amount of
memory written to it is extremely counter-intuitive.

> 
> The rest seems solid to me, FWIW. I'll defer to Johannes and Yosry for
> opinions on zswap-only proactive reclaim.

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox