Linux cgroups development
 help / color / mirror / Atom feed
* Re: [PATCH v3] security: Expand task_setscheduler LSM hook to include CPU affinity mask
From: Aaron Tomlin @ 2026-05-28  1:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: tsbogend, paul, jmorris, serge, mingo, juri.lelli,
	vincent.guittot, stephen.smalley.work, casey, longman, tj, hannes,
	mkoutny, chenridong, dietmar.eggemann, rostedt, bsegall, mgorman,
	vschneid, kprateek.nayak, omosnace, kees, neelx, sean, chjohnst,
	steve, mproche, nick.lange, cgroups, linux-mips, linux-fsdevel,
	linux-security-module, selinux, linux-kernel
In-Reply-To: <20260527195858.GC3493090@noisy.programming.kicks-ass.net>

[-- Attachment #1: Type: text/plain, Size: 3314 bytes --]

On Wed, May 27, 2026 at 09:58:58PM +0200, Peter Zijlstra wrote:
> On Wed, May 27, 2026 at 01:41:52PM -0400, Aaron Tomlin wrote:
> 
> > > > The actual use case here is multi-tenant workload isolation and visibility.
> > > > Passing the evaluated cpumask to the BPF LSM allows operators to write a
> > > > simple eBPF program to detect spatial boundary overlaps (e.g., logging an
> > > > event if a requested mask intersects with platform-reserved cores).
> 
> Why isn't cgroups good enough to enforce this? If you create a cgroup
> hierarchy per tenant, and constrain them using the cpuset controller,
> they should not be able to escape, rendering this event impossible.

Hi Peter,

You raise a very fair point. The cpuset cgroup controller is indeed the
kernel's primary vehicle for spatial enforcement, and under normal
circumstances, it successfully prevents a tenant from escaping their
designated cores.

The cpuset controller does govern resource limits, but does not audit
intent. When __sched_setaffinity() is invoked, the kernel compares the
requested in_mask against the task's allowed cpuset. If there is only a
partial intersection, the kernel silently truncates the requested mask to
fit the cpuset, without raising any alarm.

The BPF LSM hook, conversely, receives the raw, untruncated in_mask,
affording operators the visibility to detect, audit, and even reject these
violations of intent before the kernel silently sanitises the input.

This patch does not seek to replace the cpuset controller, but rather to
complement it by providing auditing capabilities.

> > We are not creating a bespoke BPF hook here; rather, we are rectifying a
> > historical blind spot within the API. The existing LSM hook is invoked
> > during sched_setaffinity(), yet it presently receives only the task_struct
> > pointer. Consequently, the security module is essentially asked, "Should
> > Process A be permitted to alter Process B's affinity?" without being
> > informed of the proposed affinity itself. Providing in_mask simply
> > furnishes the existing hook with the requisite payload to make an informed
> > decision.
> 
> It occurs to me that this same argument would require to also pass in
> the new sched_attr, no? That way the LSM can inspect the new policy
> before it becomes effective.

I agree, the underlying logic does indeed extend perfectly to sched_attr.

Presently, the LSM is equally oblivious as to whether a process is
requesting a benign transition to SCHED_BATCH, or attempting to escalate
its privileges by requesting a real-time policy such as SCHED_FIFO with
maximum priority. Just as with the CPU mask, providing the sched_attr
payload would rectify this parallel blind spot, allowing BPF policies to
inspect and mediate scheduling attributes before they become effective.

If you are amenable, I should be more than happy to expand the scope of the
forthcoming patch to include this. Alternatively, we could address the
sched_attr expansion in a separate, subsequent patch. Personally, I would
favour the latter approach, but please do let me know your preference.

I very much look forward to hearing Paul's thoughts on whether this aligns
with the broader LSM vision.

Thank you.


Kind regards,
-- 
Aaron Tomlin

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* [syzbot] [cgroups?] [mm?] INFO: rcu detected stall in clone3 (6)
From: syzbot @ 2026-05-28  1:12 UTC (permalink / raw)
  To: akpm, cgroups, hannes, jackmanb, linux-kernel, linux-mm, mhocko,
	surenb, syzkaller-bugs, vbabka, ziy

Hello,

syzbot found the following issue on:

HEAD commit:    e8c2f9fdadee Merge tag 'for-7.1/hpfs-fixes' of git://git.k..
git tree:       upstream
console output: https://syzkaller.appspot.com/x/log.txt?x=131dcf96580000
kernel config:  https://syzkaller.appspot.com/x/.config?x=8d24a1331e060dda
dashboard link: https://syzkaller.appspot.com/bug?extid=774c2dfaebdf78f984c5
compiler:       Debian clang version 21.1.8 (++20251221033036+2078da43e25a-1~exp1~20251221153213.50), Debian LLD 21.1.8

Unfortunately, I don't have any reproducer for this issue yet.

Downloadable assets:
disk image: https://storage.googleapis.com/syzbot-assets/7980daa950e4/disk-e8c2f9fd.raw.xz
vmlinux: https://storage.googleapis.com/syzbot-assets/8bdb257b9cb5/vmlinux-e8c2f9fd.xz
kernel image: https://storage.googleapis.com/syzbot-assets/827a38d4946b/bzImage-e8c2f9fd.xz

IMPORTANT: if you fix the issue, please add the following tag to the commit:
Reported-by: syzbot+774c2dfaebdf78f984c5@syzkaller.appspotmail.com

rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
rcu: 	Tasks blocked on level-0 rcu_node (CPUs 0-1): P17948/1:b..l
rcu: 	(detected by 0, t=10503 jiffies, g=351569, q=826004 ncpus=2)
task:syz.4.14533     state:R  running task     stack:25592 pid:17948 tgid:17948 ppid:10462  task_flags:0x400040 flags:0x00080002
Call Trace:
 <TASK>
 context_switch kernel/sched/core.c:5388 [inline]
 __schedule+0x1821/0x5740 kernel/sched/core.c:7189
 preempt_schedule_irq+0x4d/0xa0 kernel/sched/core.c:7513
 irqentry_exit_to_kernel_mode include/linux/irq-entry-common.h:539 [inline]
 irqentry_exit+0x14f/0x760 kernel/entry/common.c:164
 asm_sysvec_apic_timer_interrupt+0x1a/0x20 arch/x86/include/asm/idtentry.h:697
RIP: 0010:__rcu_read_unlock+0x0/0xe0 kernel/rcu/tree_plugin.h:431
Code: d9 80 e1 07 80 c1 03 38 c1 7c dc 48 89 df e8 07 77 85 00 eb d2 0f 1f 44 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 <f3> 0f 1e fa 41 57 41 56 41 55 41 54 53 49 bf 00 00 00 00 00 fc ff
RSP: 0018:ffffc9000596f390 EFLAGS: 00000286
RAX: c3ba2f73bc45e400 RBX: 00007f67769da601 RCX: 0000000000000046
RDX: 0000000000000001 RSI: ffffffff8e220c4d RDI: ffffffff8c28b860
RBP: dffffc0000000000 R08: 0000000000000022 R09: ffffffff8e95cce0
R10: dffffc0000000000 R11: ffffffff81b0e040 R12: 00007fff7b795798
R13: ffffc90005968000 R14: ffffc9000596f468 R15: ffffffff8176e256
 rcu_read_unlock include/linux/rcupdate.h:871 [inline]
 class_rcu_destructor include/linux/rcupdate.h:1181 [inline]
 unwind_next_frame+0x1bbf/0x2550 arch/x86/kernel/unwind_orc.c:709
 arch_stack_walk+0x11b/0x150 arch/x86/kernel/stacktrace.c:25
 stack_trace_save+0xa9/0x100 kernel/stacktrace.c:122
 save_stack+0x122/0x230 mm/page_owner.c:165
 __reset_page_owner+0x71/0x1f0 mm/page_owner.c:320
 reset_page_owner include/linux/page_owner.h:25 [inline]
 __free_pages_prepare mm/page_alloc.c:1402 [inline]
 __free_frozen_pages+0xbc7/0xd30 mm/page_alloc.c:2943
 __slab_free+0x274/0x2c0 mm/slub.c:5613
 qlink_free mm/kasan/quarantine.c:163 [inline]
 qlist_free_all+0x99/0x100 mm/kasan/quarantine.c:179
 kasan_quarantine_reduce+0x148/0x160 mm/kasan/quarantine.c:286
 __kasan_slab_alloc+0x22/0x80 mm/kasan/common.c:350
 kasan_slab_alloc include/linux/kasan.h:253 [inline]
 slab_post_alloc_hook mm/slub.c:4570 [inline]
 slab_alloc_node mm/slub.c:4899 [inline]
 kmem_cache_alloc_node_noprof+0x384/0x690 mm/slub.c:4951
 alloc_task_struct_node kernel/fork.c:187 [inline]
 dup_task_struct+0x52/0x840 kernel/fork.c:918
 copy_process+0x89b/0x4440 kernel/fork.c:2090
 kernel_clone+0x284/0x8f0 kernel/fork.c:2721
 __do_sys_clone3 kernel/fork.c:3025 [inline]
 __se_sys_clone3+0x33c/0x360 kernel/fork.c:3004
 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
 do_syscall_64+0x15f/0x560 arch/x86/entry/syscall_64.c:94
 entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7f6775b9dc49
RSP: 002b:00007fff7b795798 EFLAGS: 00000202 ORIG_RAX: 00000000000001b3
RAX: ffffffffffffffda RBX: 00007f6775b591e0 RCX: 00007f6775b9dc49
RDX: 00007f6775b591e0 RSI: 0000000000000058 RDI: 00007fff7b7957f0
RBP: 00007f67769da6c0 R08: 00007f67769da6c0 R09: 00007fff7b7958d7
R10: 0000000000000008 R11: 0000000000000202 R12: ffffffffffffffe8
R13: 000000000000006e R14: 00007fff7b7957f0 R15: 00007fff7b7958d8
 </TASK>
rcu: rcu_preempt kthread starved for 332 jiffies! g351569 f0x2 RCU_GP_WAIT_FQS(5) ->state=0x0 ->cpu=1
rcu: 	Unless rcu_preempt kthread gets sufficient CPU time, OOM is now expected behavior.
rcu: RCU grace-period kthread stack dump:
task:rcu_preempt     state:R  running task     stack:27536 pid:16    tgid:16    ppid:2      task_flags:0x208040 flags:0x00080000
Call Trace:
 <TASK>
 context_switch kernel/sched/core.c:5388 [inline]
 __schedule+0x1821/0x5740 kernel/sched/core.c:7189
 __schedule_loop kernel/sched/core.c:7268 [inline]
 schedule+0x164/0x360 kernel/sched/core.c:7283
 schedule_timeout+0x158/0x2c0 kernel/time/sleep_timeout.c:99
 rcu_gp_fqs_loop+0x312/0x11d0 kernel/rcu/tree.c:2095
 rcu_gp_kthread+0x9e/0x2b0 kernel/rcu/tree.c:2297
 kthread+0x389/0x470 kernel/kthread.c:436
 ret_from_fork+0x514/0xb70 arch/x86/kernel/process.c:158
 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
 </TASK>
rcu: Stack dump where RCU GP kthread last ran:
Sending NMI from CPU 0 to CPUs 1:
NMI backtrace for cpu 1
CPU: 1 UID: 0 PID: 9543 Comm: kworker/u8:15 Tainted: G             L      syzkaller #0 PREEMPT(full) 
Tainted: [L]=SOFTLOCKUP
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 04/18/2026
Workqueue: wg-kex-wg1 wg_packet_handshake_send_worker
RIP: 0010:on_stack arch/x86/include/asm/stacktrace.h:58 [inline]
RIP: 0010:stack_access_ok arch/x86/kernel/unwind_orc.c:409 [inline]
RIP: 0010:deref_stack_reg arch/x86/kernel/unwind_orc.c:419 [inline]
RIP: 0010:unwind_next_frame+0xdd5/0x2550 arch/x86/kernel/unwind_orc.c:614
Code: 61 0b ba 00 48 89 5c 24 60 4c 89 64 24 18 49 8d 5c 24 f8 4d 8b 66 10 48 b8 00 00 00 00 00 fc ff df 48 8b 4c 24 20 0f b6 04 01 <84> c0 0f 85 2c 12 00 00 41 83 3e 00 0f 95 c0 49 39 df 0f 96 c1 20
RSP: 0018:ffffc90000a075f8 EFLAGS: 00000246
RAX: 0000000000000000 RBX: ffffc90000a07c28 RCX: 1ffff92000140ed9
RDX: ffffffff914bec6a RSI: 0000000000000002 RDI: ffffffff8c28b800
RBP: 1ffff92000140eda R08: 000000000000000b R09: ffffffff8e95cce0
R10: dffffc0000000000 R11: ffffffff81b0e040 R12: ffffc90000a09000
R13: 1ffff92000140edb R14: ffffc90000a076c8 R15: ffffc90000a01000
FS:  0000000000000000(0000) GS:ffff888125387000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f0dcae0021c CR3: 0000000090206000 CR4: 0000000000350ef0
Call Trace:
 <IRQ>
 arch_stack_walk+0x11b/0x150 arch/x86/kernel/stacktrace.c:25
 stack_trace_save+0xa9/0x100 kernel/stacktrace.c:122
 kasan_save_stack mm/kasan/common.c:57 [inline]
 kasan_save_track+0x3e/0x80 mm/kasan/common.c:78
 kasan_save_free_info+0x46/0x50 mm/kasan/generic.c:584
 poison_slab_object mm/kasan/common.c:253 [inline]
 __kasan_slab_free+0x5c/0x80 mm/kasan/common.c:285
 kasan_slab_free include/linux/kasan.h:235 [inline]
 slab_free_hook mm/slub.c:2689 [inline]
 slab_free mm/slub.c:6251 [inline]
 kmem_cache_free+0x182/0x650 mm/slub.c:6378
 kfree_skb_reason include/linux/skbuff.h:1322 [inline]
 enqueue_to_backlog+0x69b/0xee0 net/core/dev.c:5421
 netif_rx_internal+0x120/0x560 net/core/dev.c:5719
 __netif_rx+0x78/0xc0 net/core/dev.c:5739
 loopback_xmit+0x43a/0x660 drivers/net/loopback.c:90
 __netdev_start_xmit include/linux/netdevice.h:5368 [inline]
 netdev_start_xmit include/linux/netdevice.h:5377 [inline]
 xmit_one net/core/dev.c:3888 [inline]
 dev_hard_start_xmit+0x2cd/0x830 net/core/dev.c:3904
 sch_direct_xmit+0x251/0x4c0 net/sched/sch_generic.c:372
 qdisc_restart net/sched/sch_generic.c:437 [inline]
 __qdisc_run+0xa83/0x1560 net/sched/sch_generic.c:445
 qdisc_run include/net/pkt_sched.h:120 [inline]
 __dev_xmit_skb net/core/dev.c:4292 [inline]
 __dev_queue_xmit+0x1d26/0x3950 net/core/dev.c:4831
 dev_queue_xmit include/linux/netdevice.h:3418 [inline]
 neigh_hh_output include/net/neighbour.h:544 [inline]
 neigh_output include/net/neighbour.h:558 [inline]
 ip_finish_output2+0xc68/0x1070 net/ipv4/ip_output.c:237
 NF_HOOK_COND include/linux/netfilter.h:307 [inline]
 ip_output+0x29f/0x450 net/ipv4/ip_output.c:438
 synproxy_send_client_synack+0x8c1/0xe30 net/netfilter/nf_synproxy_core.c:485
 nft_synproxy_eval_v4+0x34a/0x4e0 net/netfilter/nft_synproxy.c:60
 nft_synproxy_do_eval+0x305/0x580 net/netfilter/nft_synproxy.c:142
 expr_call_ops_eval net/netfilter/nf_tables_core.c:237 [inline]
 nft_do_chain+0x48d/0x1ae0 net/netfilter/nf_tables_core.c:285
 nft_do_chain_inet+0x360/0x4b0 net/netfilter/nft_chain_filter.c:162
 nf_hook_entry_hookfn include/linux/netfilter.h:158 [inline]
 nf_hook_slow+0xc5/0x220 net/netfilter/core.c:619
 nf_hook include/linux/netfilter.h:273 [inline]
 NF_HOOK+0x21f/0x3c0 include/linux/netfilter.h:316
 NF_HOOK+0x336/0x3c0 include/linux/netfilter.h:318
 __netif_receive_skb_one_core net/core/dev.c:6202 [inline]
 __netif_receive_skb net/core/dev.c:6315 [inline]
 process_backlog+0xaa3/0x1950 net/core/dev.c:6666
 __napi_poll+0xae/0x340 net/core/dev.c:7733
 napi_poll net/core/dev.c:7796 [inline]
 net_rx_action+0x627/0xf70 net/core/dev.c:7953
 handle_softirqs+0x22a/0x840 kernel/softirq.c:622
 do_softirq+0x76/0xd0 kernel/softirq.c:523
 </IRQ>
 <TASK>
 __local_bh_enable_ip+0xf8/0x130 kernel/softirq.c:450
 blake2s_compress+0xf9/0x1eb0 lib/crypto/x86/blake2s.h:42
 blake2s_update+0x14b/0x450 lib/crypto/blake2s.c:119
 hmac+0x2d3/0x3b0 drivers/net/wireguard/noise.c:332
 kdf drivers/net/wireguard/noise.c:367 [inline]
 message_ephemeral+0x255/0x310 drivers/net/wireguard/noise.c:493
 wg_noise_handshake_create_initiation+0x257/0x830 drivers/net/wireguard/noise.c:545
 wg_packet_send_handshake_initiation drivers/net/wireguard/send.c:34 [inline]
 wg_packet_handshake_send_worker+0x18d/0x350 drivers/net/wireguard/send.c:51
 process_one_work kernel/workqueue.c:3314 [inline]
 process_scheduled_works+0xb5d/0x1860 kernel/workqueue.c:3397
 worker_thread+0xa53/0xfc0 kernel/workqueue.c:3478
 kthread+0x389/0x470 kernel/kthread.c:436
 ret_from_fork+0x514/0xb70 arch/x86/kernel/process.c:158
 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
 </TASK>


---
This report is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzkaller@googlegroups.com.

syzbot will keep track of this issue. See:
https://goo.gl/tpsmEJ#status for how to communicate with syzbot.

If the report is already addressed, let syzbot know by replying with:
#syz fix: exact-commit-title

If you want to overwrite report's subsystems, reply with:
#syz set subsystems: new-subsystem
(See the list of subsystem names on the web dashboard)

If the report is a duplicate of another one, reply with:
#syz dup: exact-subject-of-another-report

If you want to undo deduplication, reply with:
#syz undup

^ permalink raw reply

* Re: [PATCH v7 0/4] mm: swap: introduce swap tier infrastructure
From: Yosry Ahmed @ 2026-05-27 23:52 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Youngjun Park, chrisl, linux-mm, cgroups, linux-kernel, kasong,
	hannes, mhocko, roman.gushchin, shakeel.butt, muchun.song,
	shikemeng, nphamcs, baoquan.he, baohua, gunho.lee, taejoon.song,
	hyungjun.cho, mkoutny, baver.bae, matia.kim
In-Reply-To: <20260527133651.2ce806fa542a82eca5ff66d6@linux-foundation.org>

On Wed, May 27, 2026 at 1:36 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Wed, 27 May 2026 15:22:43 +0900 Youngjun Park <youngjun.park@lge.com> wrote:
>
> > This is v7 of the swap tier series addressing review feedback.
> > The cover letter has been simplified.
>
> One question from Sashiko.   Minor, but easy to address.
>         https://sashiko.dev/#/patchset/20260527062247.3440692-1-youngjun.park@lge.com
>
> I'm reluctant to add a new feature patchset at this time - we have a lot
> already and we're at -rc5.   What do others think?

This adds new user-visible interfaces and I think we didn't reach an
agreement on them. I specifically recall Shakeel (and perhaps other
memcg folks) having questions about the memcg interface, and I don't
see any Acks on that patch. I don't think this should be included.

^ permalink raw reply

* [PATCH v5 9/9] mm: switch deferred split shrinker to list_lru
From: Johannes Weiner @ 2026-05-27 20:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Hildenbrand, Lorenzo Stoakes, Shakeel Butt, Michal Hocko,
	Dave Chinner, Roman Gushchin, Muchun Song, Qi Zheng, Yosry Ahmed,
	Zi Yan, Liam R . Howlett, Usama Arif, Kiryl Shutsemau,
	Vlastimil Babka, Kairui Song, Mikhail Zaslonko, Vasily Gorbik,
	Baolin Wang, Barry Song, Dev Jain, Lance Yang, Nico Pache,
	Ryan Roberts, cgroups, linux-mm, linux-kernel
In-Reply-To: <20260527204757.2544958-1-hannes@cmpxchg.org>

The deferred split queue handles cgroups in a suboptimal fashion. The
queue is per-NUMA node or per-cgroup, not the intersection. That means
on a cgrouped system, a node-restricted allocation entering reclaim
can end up splitting large pages on other nodes:

        alloc/unmap
          deferred_split_folio()
            list_add_tail(memcg->split_queue)
            set_shrinker_bit(memcg, node, deferred_shrinker_id)

        for_each_zone_zonelist_nodemask(restricted_nodes)
          mem_cgroup_iter()
            shrink_slab(node, memcg)
              shrink_slab_memcg(node, memcg)
                if test_shrinker_bit(memcg, node, deferred_shrinker_id)
                  deferred_split_scan()
                    walks memcg->split_queue

The shrinker bit adds an imperfect guard rail. As soon as the cgroup
has a single large page on the node of interest, all large pages owned
by that memcg, including those on other nodes, will be split.

list_lru properly sets up per-node, per-cgroup lists. As a bonus, it
streamlines a lot of the list operations and reclaim walks. It's used
widely by other major shrinkers already. Convert the deferred split
queue as well.

The list_lru per-memcg heads are instantiated on demand when the first
object of interest is allocated for a cgroup, by calling
folio_memcg_alloc_deferred(). Add calls to where splittable pages are
created: anon faults, swapin faults, khugepaged collapse.

These calls create all possible node heads for the cgroup at once, so
the migration code (between nodes) doesn't need any special care.

Reported-by: Mikhail Zaslonko <zaslonko@linux.ibm.com>
Tested-by: Mikhail Zaslonko <zaslonko@linux.ibm.com>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/huge_mm.h    |   7 +-
 include/linux/memcontrol.h |   4 -
 include/linux/mmzone.h     |  12 --
 mm/huge_memory.c           | 364 +++++++++++++------------------------
 mm/internal.h              |   2 +-
 mm/khugepaged.c            |   5 +
 mm/memcontrol.c            |  12 +-
 mm/memory.c                |   4 +
 mm/mm_init.c               |  15 --
 mm/swap_state.c            |  10 +
 10 files changed, 150 insertions(+), 285 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index edece3e26985..f6c2531a27a3 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -423,10 +423,10 @@ static inline int split_huge_page(struct page *page)
 {
 	return split_huge_page_to_list_to_order(page, NULL, 0);
 }
+
+int folio_memcg_alloc_deferred(struct folio *folio);
+
 void deferred_split_folio(struct folio *folio, bool partially_mapped);
-#ifdef CONFIG_MEMCG
-void reparent_deferred_split_queue(struct mem_cgroup *memcg);
-#endif
 
 void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 		unsigned long address, bool freeze);
@@ -664,7 +664,6 @@ static inline int folio_split(struct folio *folio, unsigned int new_order,
 }
 
 static inline void deferred_split_folio(struct folio *folio, bool partially_mapped) {}
-static inline void reparent_deferred_split_queue(struct mem_cgroup *memcg) {}
 #define split_huge_pmd(__vma, __pmd, __address)	\
 	do { } while (0)
 
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index bf1a6e131eca..20404e59fb3b 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -278,10 +278,6 @@ struct mem_cgroup {
 	struct memcg_cgwb_frn cgwb_frn[MEMCG_CGWB_FRN_CNT];
 #endif
 
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-	struct deferred_split deferred_split_queue;
-#endif
-
 #ifdef CONFIG_LRU_GEN_WALKS_MMU
 	/* per-memcg mm_struct list */
 	struct lru_gen_mm_list mm_list;
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 1331a7b93f33..8e449f524f26 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -1431,14 +1431,6 @@ struct zonelist {
  */
 extern struct page *mem_map;
 
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-struct deferred_split {
-	spinlock_t split_queue_lock;
-	struct list_head split_queue;
-	unsigned long split_queue_len;
-};
-#endif
-
 #ifdef CONFIG_MEMORY_FAILURE
 /*
  * Per NUMA node memory failure handling statistics.
@@ -1564,10 +1556,6 @@ typedef struct pglist_data {
 	unsigned long first_deferred_pfn;
 #endif /* CONFIG_DEFERRED_STRUCT_PAGE_INIT */
 
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-	struct deferred_split deferred_split_queue;
-#endif
-
 #ifdef CONFIG_NUMA_BALANCING
 	/* start time in ms of current promote rate limit period */
 	unsigned int nbp_rl_start;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index bf9b480bb3b0..72f6caf0fec6 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -14,6 +14,7 @@
 #include <linux/mmu_notifier.h>
 #include <linux/rmap.h>
 #include <linux/swap.h>
+#include <linux/list_lru.h>
 #include <linux/shrinker.h>
 #include <linux/mm_inline.h>
 #include <linux/swapops.h>
@@ -67,6 +68,8 @@ unsigned long transparent_hugepage_flags __read_mostly =
 	(1<<TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG)|
 	(1<<TRANSPARENT_HUGEPAGE_USE_ZERO_PAGE_FLAG);
 
+static struct lock_class_key deferred_split_key;
+static struct list_lru deferred_split_lru;
 static struct shrinker *deferred_split_shrinker;
 static unsigned long deferred_split_count(struct shrinker *shrink,
 					  struct shrink_control *sc);
@@ -943,6 +946,13 @@ static inline void hugepage_exit_sysfs(struct kobject *hugepage_kobj)
 }
 #endif /* CONFIG_SYSFS */
 
+int folio_memcg_alloc_deferred(struct folio *folio)
+{
+	if (mem_cgroup_disabled())
+		return 0;
+	return folio_memcg_list_lru_alloc(folio, &deferred_split_lru, GFP_KERNEL);
+}
+
 static int __init thp_shrinker_init(void)
 {
 	deferred_split_shrinker = shrinker_alloc(SHRINKER_NUMA_AWARE |
@@ -952,6 +962,13 @@ static int __init thp_shrinker_init(void)
 	if (!deferred_split_shrinker)
 		return -ENOMEM;
 
+	if (list_lru_init_memcg_key(&deferred_split_lru,
+				    deferred_split_shrinker,
+				    &deferred_split_key)) {
+		shrinker_free(deferred_split_shrinker);
+		return -ENOMEM;
+	}
+
 	deferred_split_shrinker->count_objects = deferred_split_count;
 	deferred_split_shrinker->scan_objects = deferred_split_scan;
 	shrinker_register(deferred_split_shrinker);
@@ -973,6 +990,7 @@ static int __init thp_shrinker_init(void)
 	huge_zero_folio_shrinker = shrinker_alloc(0, "thp-zero");
 	if (!huge_zero_folio_shrinker) {
 		shrinker_free(deferred_split_shrinker);
+		list_lru_destroy(&deferred_split_lru);
 		return -ENOMEM;
 	}
 
@@ -987,6 +1005,7 @@ static void __init thp_shrinker_exit(void)
 {
 	shrinker_free(huge_zero_folio_shrinker);
 	shrinker_free(deferred_split_shrinker);
+	list_lru_destroy(&deferred_split_lru);
 }
 
 static int __init hugepage_init(void)
@@ -1166,119 +1185,6 @@ pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma)
 	return pmd;
 }
 
-static struct deferred_split *split_queue_node(int nid)
-{
-	struct pglist_data *pgdata = NODE_DATA(nid);
-
-	return &pgdata->deferred_split_queue;
-}
-
-#ifdef CONFIG_MEMCG
-static inline
-struct mem_cgroup *folio_split_queue_memcg(struct folio *folio,
-					   struct deferred_split *queue)
-{
-	if (mem_cgroup_disabled())
-		return NULL;
-	if (split_queue_node(folio_nid(folio)) == queue)
-		return NULL;
-	return container_of(queue, struct mem_cgroup, deferred_split_queue);
-}
-
-static struct deferred_split *memcg_split_queue(int nid, struct mem_cgroup *memcg)
-{
-	return memcg ? &memcg->deferred_split_queue : split_queue_node(nid);
-}
-#else
-static inline
-struct mem_cgroup *folio_split_queue_memcg(struct folio *folio,
-					   struct deferred_split *queue)
-{
-	return NULL;
-}
-
-static struct deferred_split *memcg_split_queue(int nid, struct mem_cgroup *memcg)
-{
-	return split_queue_node(nid);
-}
-#endif
-
-static struct deferred_split *split_queue_lock(int nid, struct mem_cgroup *memcg)
-{
-	struct deferred_split *queue;
-
-retry:
-	queue = memcg_split_queue(nid, memcg);
-	spin_lock(&queue->split_queue_lock);
-	/*
-	 * There is a period between setting memcg to dying and reparenting
-	 * deferred split queue, and during this period the THPs in the deferred
-	 * split queue will be hidden from the shrinker side.
-	 */
-	if (unlikely(memcg_is_dying(memcg))) {
-		spin_unlock(&queue->split_queue_lock);
-		memcg = parent_mem_cgroup(memcg);
-		goto retry;
-	}
-
-	return queue;
-}
-
-static struct deferred_split *
-split_queue_lock_irqsave(int nid, struct mem_cgroup *memcg, unsigned long *flags)
-{
-	struct deferred_split *queue;
-
-retry:
-	queue = memcg_split_queue(nid, memcg);
-	spin_lock_irqsave(&queue->split_queue_lock, *flags);
-	if (unlikely(memcg_is_dying(memcg))) {
-		spin_unlock_irqrestore(&queue->split_queue_lock, *flags);
-		memcg = parent_mem_cgroup(memcg);
-		goto retry;
-	}
-
-	return queue;
-}
-
-static struct deferred_split *folio_split_queue_lock(struct folio *folio)
-{
-	struct deferred_split *queue;
-
-	rcu_read_lock();
-	queue = split_queue_lock(folio_nid(folio), folio_memcg(folio));
-	/*
-	 * The memcg destruction path is acquiring the split queue lock for
-	 * reparenting. Once you have it locked, it's safe to drop the rcu lock.
-	 */
-	rcu_read_unlock();
-
-	return queue;
-}
-
-static struct deferred_split *
-folio_split_queue_lock_irqsave(struct folio *folio, unsigned long *flags)
-{
-	struct deferred_split *queue;
-
-	rcu_read_lock();
-	queue = split_queue_lock_irqsave(folio_nid(folio), folio_memcg(folio), flags);
-	rcu_read_unlock();
-
-	return queue;
-}
-
-static inline void split_queue_unlock(struct deferred_split *queue)
-{
-	spin_unlock(&queue->split_queue_lock);
-}
-
-static inline void split_queue_unlock_irqrestore(struct deferred_split *queue,
-						 unsigned long flags)
-{
-	spin_unlock_irqrestore(&queue->split_queue_lock, flags);
-}
-
 static inline bool is_transparent_hugepage(const struct folio *folio)
 {
 	if (!folio_test_large(folio))
@@ -1379,6 +1285,14 @@ static struct folio *vma_alloc_anon_folio_pmd(struct vm_area_struct *vma,
 		count_mthp_stat(order, MTHP_STAT_ANON_FAULT_FALLBACK_CHARGE);
 		return NULL;
 	}
+
+	if (folio_memcg_alloc_deferred(folio)) {
+		folio_put(folio);
+		count_vm_event(THP_FAULT_FALLBACK);
+		count_mthp_stat(order, MTHP_STAT_ANON_FAULT_FALLBACK);
+		return NULL;
+	}
+
 	folio_throttle_swaprate(folio, gfp);
 
        /*
@@ -3890,34 +3804,43 @@ static int __folio_freeze_and_split_unmapped(struct folio *folio, unsigned int n
 	struct folio *end_folio = folio_next(folio);
 	struct folio *new_folio, *next;
 	int old_order = folio_order(folio);
+	struct list_lru_one *lru;
+	bool dequeue_deferred;
 	int ret = 0;
-	struct deferred_split *ds_queue;
 
 	VM_WARN_ON_ONCE(!mapping && end);
-	/* Prevent deferred_split_scan() touching ->_refcount */
-	ds_queue = folio_split_queue_lock(folio);
+	/*
+	 * If this folio can be on the deferred split queue, lock out
+	 * the shrinker before freezing the ref. If the shrinker sees
+	 * a 0-ref folio, it assumes it beat folio_put() to the list
+	 * lock and must clean up the LRU state - the same dequeue we
+	 * will do below as part of the split.
+	 */
+	dequeue_deferred = folio_test_anon(folio) && old_order > 1;
+	if (dequeue_deferred) {
+		struct mem_cgroup *memcg;
+
+		rcu_read_lock();
+		memcg = folio_memcg(folio);
+		lru = list_lru_lock(&deferred_split_lru,
+				    folio_nid(folio), &memcg);
+	}
 	if (folio_ref_freeze(folio, folio_cache_ref_count(folio) + 1)) {
 		struct swap_cluster_info *ci = NULL;
 		struct lruvec *lruvec;
 
-		if (old_order > 1) {
-			if (!list_empty(&folio->_deferred_list)) {
-				ds_queue->split_queue_len--;
-				/*
-				 * Reinitialize page_deferred_list after removing the
-				 * page from the split_queue, otherwise a subsequent
-				 * split will see list corruption when checking the
-				 * page_deferred_list.
-				 */
-				list_del_init(&folio->_deferred_list);
-			}
+		if (dequeue_deferred) {
+			__list_lru_del(&deferred_split_lru, lru,
+				       &folio->_deferred_list, folio_nid(folio));
 			if (folio_test_partially_mapped(folio)) {
 				folio_clear_partially_mapped(folio);
 				mod_mthp_stat(old_order,
 					MTHP_STAT_NR_ANON_PARTIALLY_MAPPED, -1);
 			}
+			list_lru_unlock(lru);
+			rcu_read_unlock();
 		}
-		split_queue_unlock(ds_queue);
+
 		if (mapping) {
 			int nr = folio_nr_pages(folio);
 
@@ -4017,7 +3940,10 @@ static int __folio_freeze_and_split_unmapped(struct folio *folio, unsigned int n
 		if (ci)
 			swap_cluster_unlock(ci);
 	} else {
-		split_queue_unlock(ds_queue);
+		if (dequeue_deferred) {
+			list_lru_unlock(lru);
+			rcu_read_unlock();
+		}
 		return -EAGAIN;
 	}
 
@@ -4383,33 +4309,37 @@ int split_folio_to_list(struct folio *folio, struct list_head *list)
  * queueing THP splits, and that list is (racily observed to be) non-empty.
  *
  * It is unsafe to call folio_unqueue_deferred_split() until folio refcount is
- * zero: because even when split_queue_lock is held, a non-empty _deferred_list
- * might be in use on deferred_split_scan()'s unlocked on-stack list.
+ * zero: because even when the list_lru lock is held, a non-empty
+ * _deferred_list might be in use on deferred_split_scan()'s unlocked
+ * on-stack list.
  *
- * If memory cgroups are enabled, split_queue_lock is in the mem_cgroup: it is
- * therefore important to unqueue deferred split before changing folio memcg.
+ * The list_lru sublist is determined by folio's memcg: it is therefore
+ * important to unqueue deferred split before changing folio memcg.
  */
 bool __folio_unqueue_deferred_split(struct folio *folio)
 {
-	struct deferred_split *ds_queue;
+	struct mem_cgroup *memcg;
+	struct list_lru_one *lru;
+	int nid = folio_nid(folio);
 	unsigned long flags;
 	bool unqueued = false;
 
 	WARN_ON_ONCE(folio_ref_count(folio));
 	WARN_ON_ONCE(!mem_cgroup_disabled() && !folio_memcg_charged(folio));
 
-	ds_queue = folio_split_queue_lock_irqsave(folio, &flags);
-	if (!list_empty(&folio->_deferred_list)) {
-		ds_queue->split_queue_len--;
+	rcu_read_lock();
+	memcg = folio_memcg(folio);
+	lru = list_lru_lock_irqsave(&deferred_split_lru, nid, &memcg, &flags);
+	if (__list_lru_del(&deferred_split_lru, lru, &folio->_deferred_list, nid)) {
 		if (folio_test_partially_mapped(folio)) {
 			folio_clear_partially_mapped(folio);
 			mod_mthp_stat(folio_order(folio),
 				      MTHP_STAT_NR_ANON_PARTIALLY_MAPPED, -1);
 		}
-		list_del_init(&folio->_deferred_list);
 		unqueued = true;
 	}
-	split_queue_unlock_irqrestore(ds_queue, flags);
+	list_lru_unlock_irqrestore(lru, &flags);
+	rcu_read_unlock();
 
 	return unqueued;	/* useful for debug warnings */
 }
@@ -4417,7 +4347,9 @@ bool __folio_unqueue_deferred_split(struct folio *folio)
 /* partially_mapped=false won't clear PG_partially_mapped folio flag */
 void deferred_split_folio(struct folio *folio, bool partially_mapped)
 {
-	struct deferred_split *ds_queue;
+	struct list_lru_one *lru;
+	int nid;
+	struct mem_cgroup *memcg;
 	unsigned long flags;
 
 	/*
@@ -4440,7 +4372,11 @@ void deferred_split_folio(struct folio *folio, bool partially_mapped)
 	if (folio_test_swapcache(folio))
 		return;
 
-	ds_queue = folio_split_queue_lock_irqsave(folio, &flags);
+	nid = folio_nid(folio);
+
+	rcu_read_lock();
+	memcg = folio_memcg(folio);
+	lru = list_lru_lock_irqsave(&deferred_split_lru, nid, &memcg, &flags);
 	if (partially_mapped) {
 		if (!folio_test_partially_mapped(folio)) {
 			folio_set_partially_mapped(folio);
@@ -4448,36 +4384,20 @@ void deferred_split_folio(struct folio *folio, bool partially_mapped)
 				count_vm_event(THP_DEFERRED_SPLIT_PAGE);
 			count_mthp_stat(folio_order(folio), MTHP_STAT_SPLIT_DEFERRED);
 			mod_mthp_stat(folio_order(folio), MTHP_STAT_NR_ANON_PARTIALLY_MAPPED, 1);
-
 		}
 	} else {
 		/* partially mapped folios cannot become non-partially mapped */
 		VM_WARN_ON_FOLIO(folio_test_partially_mapped(folio), folio);
 	}
-	if (list_empty(&folio->_deferred_list)) {
-		struct mem_cgroup *memcg;
-
-		memcg = folio_split_queue_memcg(folio, ds_queue);
-		list_add_tail(&folio->_deferred_list, &ds_queue->split_queue);
-		ds_queue->split_queue_len++;
-		if (memcg)
-			set_shrinker_bit(memcg, folio_nid(folio),
-					 shrinker_id(deferred_split_shrinker));
-	}
-	split_queue_unlock_irqrestore(ds_queue, flags);
+	__list_lru_add(&deferred_split_lru, lru, &folio->_deferred_list, nid, memcg);
+	list_lru_unlock_irqrestore(lru, &flags);
+	rcu_read_unlock();
 }
 
 static unsigned long deferred_split_count(struct shrinker *shrink,
 		struct shrink_control *sc)
 {
-	struct pglist_data *pgdata = NODE_DATA(sc->nid);
-	struct deferred_split *ds_queue = &pgdata->deferred_split_queue;
-
-#ifdef CONFIG_MEMCG
-	if (sc->memcg)
-		ds_queue = &sc->memcg->deferred_split_queue;
-#endif
-	return READ_ONCE(ds_queue->split_queue_len);
+	return list_lru_shrink_count(&deferred_split_lru, sc);
 }
 
 static bool thp_underused(struct folio *folio)
@@ -4507,45 +4427,49 @@ static bool thp_underused(struct folio *folio)
 	return false;
 }
 
+static enum lru_status deferred_split_isolate(struct list_head *item,
+					      struct list_lru_one *lru,
+					      void *cb_arg)
+{
+	struct folio *folio = container_of(item, struct folio, _deferred_list);
+	struct list_head *freeable = cb_arg;
+
+	if (folio_try_get(folio)) {
+		list_lru_isolate_move(lru, item, freeable);
+		return LRU_REMOVED;
+	}
+
+	/*
+	 * We lost race with folio_put(). Read folio state before the
+	 * isolate: folio_unqueue_deferred_split() checks list_empty()
+	 * locklessly, so once removed the folio can be freed any time.
+	 */
+	if (folio_test_partially_mapped(folio)) {
+		folio_clear_partially_mapped(folio);
+		mod_mthp_stat(folio_order(folio),
+			      MTHP_STAT_NR_ANON_PARTIALLY_MAPPED, -1);
+	}
+	list_lru_isolate(lru, item);
+	return LRU_REMOVED;
+}
+
 static unsigned long deferred_split_scan(struct shrinker *shrink,
 		struct shrink_control *sc)
 {
-	struct deferred_split *ds_queue;
-	unsigned long flags;
+	LIST_HEAD(dispose);
 	struct folio *folio, *next;
-	int split = 0, i;
-	struct folio_batch fbatch;
-
-	folio_batch_init(&fbatch);
+	int split = 0;
+	unsigned long isolated;
 
-retry:
-	ds_queue = split_queue_lock_irqsave(sc->nid, sc->memcg, &flags);
-	/* Take pin on all head pages to avoid freeing them under us */
-	list_for_each_entry_safe(folio, next, &ds_queue->split_queue,
-							_deferred_list) {
-		if (folio_try_get(folio)) {
-			folio_batch_add(&fbatch, folio);
-		} else if (folio_test_partially_mapped(folio)) {
-			/* We lost race with folio_put() */
-			folio_clear_partially_mapped(folio);
-			mod_mthp_stat(folio_order(folio),
-				      MTHP_STAT_NR_ANON_PARTIALLY_MAPPED, -1);
-		}
-		list_del_init(&folio->_deferred_list);
-		ds_queue->split_queue_len--;
-		if (!--sc->nr_to_scan)
-			break;
-		if (!folio_batch_space(&fbatch))
-			break;
-	}
-	split_queue_unlock_irqrestore(ds_queue, flags);
+	isolated = list_lru_shrink_walk_irq(&deferred_split_lru, sc,
+					    deferred_split_isolate, &dispose);
 
-	for (i = 0; i < folio_batch_count(&fbatch); i++) {
+	list_for_each_entry_safe(folio, next, &dispose, _deferred_list) {
 		bool did_split = false;
 		bool underused = false;
-		struct deferred_split *fqueue;
 
-		folio = fbatch.folios[i];
+		list_del_init(&folio->_deferred_list);
+
 		if (!folio_test_partially_mapped(folio)) {
 			/*
 			 * See try_to_map_unused_to_zeropage(): we cannot
@@ -4574,63 +4498,23 @@ static unsigned long deferred_split_scan(struct shrinker *shrink,
 		 * underused, then consider it used and don't add it back to
 		 * split_queue.
 		 */
-		if (did_split || !folio_test_partially_mapped(folio))
-			continue;
+		if (!did_split && folio_test_partially_mapped(folio)) {
 requeue:
-		/*
-		 * Add back partially mapped folios, or underused folios that
-		 * we could not lock this round.
-		 */
-		fqueue = folio_split_queue_lock_irqsave(folio, &flags);
-		if (list_empty(&folio->_deferred_list)) {
-			list_add_tail(&folio->_deferred_list, &fqueue->split_queue);
-			fqueue->split_queue_len++;
+			rcu_read_lock();
+			list_lru_add_irq(&deferred_split_lru,
+					 &folio->_deferred_list,
+					 folio_nid(folio),
+					 folio_memcg(folio));
+			rcu_read_unlock();
 		}
-		split_queue_unlock_irqrestore(fqueue, flags);
-	}
-	folios_put(&fbatch);
-
-	if (sc->nr_to_scan && !list_empty(&ds_queue->split_queue)) {
-		cond_resched();
-		goto retry;
+		folio_put(folio);
 	}
 
-	/*
-	 * Stop shrinker if we didn't split any page, but the queue is empty.
-	 * This can happen if pages were freed under us.
-	 */
-	if (!split && list_empty(&ds_queue->split_queue))
+	if (!split && !isolated)
 		return SHRINK_STOP;
 	return split;
 }
 
-#ifdef CONFIG_MEMCG
-void reparent_deferred_split_queue(struct mem_cgroup *memcg)
-{
-	struct mem_cgroup *parent = parent_mem_cgroup(memcg);
-	struct deferred_split *ds_queue = &memcg->deferred_split_queue;
-	struct deferred_split *parent_ds_queue = &parent->deferred_split_queue;
-	int nid;
-
-	spin_lock_irq(&ds_queue->split_queue_lock);
-	spin_lock_nested(&parent_ds_queue->split_queue_lock, SINGLE_DEPTH_NESTING);
-
-	if (!ds_queue->split_queue_len)
-		goto unlock;
-
-	list_splice_tail_init(&ds_queue->split_queue, &parent_ds_queue->split_queue);
-	parent_ds_queue->split_queue_len += ds_queue->split_queue_len;
-	ds_queue->split_queue_len = 0;
-
-	for_each_node(nid)
-		set_shrinker_bit(parent, nid, shrinker_id(deferred_split_shrinker));
-
-unlock:
-	spin_unlock(&parent_ds_queue->split_queue_lock);
-	spin_unlock_irq(&ds_queue->split_queue_lock);
-}
-#endif
-
 #ifdef CONFIG_DEBUG_FS
 static void split_huge_pages_all(void)
 {
diff --git a/mm/internal.h b/mm/internal.h
index 5602393054f3..181e79f1d6a2 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -852,7 +852,7 @@ static inline bool folio_unqueue_deferred_split(struct folio *folio)
 	/*
 	 * At this point, there is no one trying to add the folio to
 	 * deferred_list. If folio is not in deferred_list, it's safe
-	 * to check without acquiring the split_queue_lock.
+	 * to check without acquiring the list_lru lock.
 	 */
 	if (data_race(list_empty(&folio->_deferred_list)))
 		return false;
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 35a5f8c44c18..8ffb47f1e845 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1306,6 +1306,11 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long s
 	if (result != SCAN_SUCCEED)
 		goto out_nolock;
 
+	if (folio_memcg_alloc_deferred(folio)) {
+		result = SCAN_ALLOC_HUGE_PAGE_FAIL;
+		goto out_nolock;
+	}
+
 	mmap_read_lock(mm);
 	result = hugepage_vma_revalidate(mm, pmd_addr, /*expect_anon=*/ true,
 					 &vma, cc, order);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 92269740eef1..d93564af82b5 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4035,11 +4035,6 @@ static struct mem_cgroup *mem_cgroup_alloc(struct mem_cgroup *parent)
 	for (i = 0; i < MEMCG_CGWB_FRN_CNT; i++)
 		memcg->cgwb_frn[i].done =
 			__WB_COMPLETION_INIT(&memcg_cgwb_frn_waitq);
-#endif
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-	spin_lock_init(&memcg->deferred_split_queue.split_queue_lock);
-	INIT_LIST_HEAD(&memcg->deferred_split_queue.split_queue);
-	memcg->deferred_split_queue.split_queue_len = 0;
 #endif
 	lru_gen_init_memcg(memcg);
 	return memcg;
@@ -4191,11 +4186,10 @@ static void mem_cgroup_css_offline(struct cgroup_subsys_state *css)
 	zswap_memcg_offline_cleanup(memcg);
 
 	memcg_offline_kmem(memcg);
-	reparent_deferred_split_queue(memcg);
 	/*
-	 * The reparenting of objcg must be after the reparenting of the
-	 * list_lru and deferred_split_queue above, which ensures that they will
-	 * not mistakenly get the parent list_lru and deferred_split_queue.
+	 * The reparenting of objcg must be after the reparenting of
+	 * the list_lru in memcg_offline_kmem(), which ensures that
+	 * they will not mistakenly get the parent list_lru.
 	 */
 	memcg_reparent_objcgs(memcg);
 	reparent_shrinker_deferred(memcg);
diff --git a/mm/memory.c b/mm/memory.c
index 135f5c0f57bd..f22e61d8c8de 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -5222,6 +5222,10 @@ static struct folio *alloc_anon_folio(struct vm_fault *vmf)
 			folio_put(folio);
 			goto next;
 		}
+		if (order > 1 && folio_memcg_alloc_deferred(folio)) {
+			folio_put(folio);
+			goto fallback;
+		}
 		folio_throttle_swaprate(folio, gfp);
 		/*
 		 * When a folio is not zeroed during allocation
diff --git a/mm/mm_init.c b/mm/mm_init.c
index db5568cf36e1..c0a7f1cf6fef 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -1373,19 +1373,6 @@ static void __init calculate_node_totalpages(struct pglist_data *pgdat,
 	pr_debug("On node %d totalpages: %lu\n", pgdat->node_id, realtotalpages);
 }
 
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-static void pgdat_init_split_queue(struct pglist_data *pgdat)
-{
-	struct deferred_split *ds_queue = &pgdat->deferred_split_queue;
-
-	spin_lock_init(&ds_queue->split_queue_lock);
-	INIT_LIST_HEAD(&ds_queue->split_queue);
-	ds_queue->split_queue_len = 0;
-}
-#else
-static void pgdat_init_split_queue(struct pglist_data *pgdat) {}
-#endif
-
 #ifdef CONFIG_COMPACTION
 static void pgdat_init_kcompactd(struct pglist_data *pgdat)
 {
@@ -1401,8 +1388,6 @@ static void __meminit pgdat_init_internals(struct pglist_data *pgdat)
 
 	pgdat_resize_init(pgdat);
 	pgdat_kswapd_lock_init(pgdat);
-
-	pgdat_init_split_queue(pgdat);
 	pgdat_init_kcompactd(pgdat);
 
 	init_waitqueue_head(&pgdat->kswapd_wait);
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 04f5ce992401..9c3a5cf99778 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -465,6 +465,16 @@ static struct folio *__swap_cache_alloc(struct swap_cluster_info *ci,
 		return ERR_PTR(-ENOMEM);
 	}
 
+	if (order > 1 && folio_memcg_alloc_deferred(folio)) {
+		spin_lock(&ci->lock);
+		__swap_cache_do_del_folio(ci, folio, entry, shadow);
+		spin_unlock(&ci->lock);
+		folio_unlock(folio);
+		/* nr_pages refs from swap cache, 1 from allocation */
+		folio_put_refs(folio, nr_pages + 1);
+		return ERR_PTR(-ENOMEM);
+	}
+
 	/* memsw uncharges swap when folio is added to swap cache */
 	memcg1_swapin(folio);
 	if (shadow)
-- 
2.54.0


^ permalink raw reply related

* [PATCH v5 8/9] mm: memory: flatten alloc_anon_folio() retry loop
From: Johannes Weiner @ 2026-05-27 20:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Hildenbrand, Lorenzo Stoakes, Shakeel Butt, Michal Hocko,
	Dave Chinner, Roman Gushchin, Muchun Song, Qi Zheng, Yosry Ahmed,
	Zi Yan, Liam R . Howlett, Usama Arif, Kiryl Shutsemau,
	Vlastimil Babka, Kairui Song, Mikhail Zaslonko, Vasily Gorbik,
	Baolin Wang, Barry Song, Dev Jain, Lance Yang, Nico Pache,
	Ryan Roberts, cgroups, linux-mm, linux-kernel
In-Reply-To: <20260527204757.2544958-1-hannes@cmpxchg.org>

alloc_anon_folio() uses a top-level if (folio) that buries the success
path four levels deep. This makes for awkward long lines and wrapping.
The next patch will add more code here, so flatten this now to keep
things clean and simple.

The next label is already there, use it for !folio.

No functional change intended.

Suggested-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Usama Arif <usama.arif@linux.dev>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
---
 mm/memory.c | 34 +++++++++++++++++-----------------
 1 file changed, 17 insertions(+), 17 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 7c020995eafc..135f5c0f57bd 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -5215,24 +5215,24 @@ static struct folio *alloc_anon_folio(struct vm_fault *vmf)
 	while (orders) {
 		addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
 		folio = vma_alloc_folio(gfp, order, vma, addr);
-		if (folio) {
-			if (mem_cgroup_charge(folio, vma->vm_mm, gfp)) {
-				count_mthp_stat(order, MTHP_STAT_ANON_FAULT_FALLBACK_CHARGE);
-				folio_put(folio);
-				goto next;
-			}
-			folio_throttle_swaprate(folio, gfp);
-			/*
-			 * When a folio is not zeroed during allocation
-			 * (__GFP_ZERO not used) or user folios require special
-			 * handling, folio_zero_user() is used to make sure
-			 * that the page corresponding to the faulting address
-			 * will be hot in the cache after zeroing.
-			 */
-			if (user_alloc_needs_zeroing())
-				folio_zero_user(folio, vmf->address);
-			return folio;
+		if (!folio)
+			goto next;
+		if (mem_cgroup_charge(folio, vma->vm_mm, gfp)) {
+			count_mthp_stat(order, MTHP_STAT_ANON_FAULT_FALLBACK_CHARGE);
+			folio_put(folio);
+			goto next;
 		}
+		folio_throttle_swaprate(folio, gfp);
+		/*
+		 * When a folio is not zeroed during allocation
+		 * (__GFP_ZERO not used) or user folios require special
+		 * handling, folio_zero_user() is used to make sure
+		 * that the page corresponding to the faulting address
+		 * will be hot in the cache after zeroing.
+		 */
+		if (user_alloc_needs_zeroing())
+			folio_zero_user(folio, vmf->address);
+		return folio;
 next:
 		count_mthp_stat(order, MTHP_STAT_ANON_FAULT_FALLBACK);
 		order = next_order(&orders, order);
-- 
2.54.0


^ permalink raw reply related

* [PATCH v5 7/9] mm: list_lru: introduce folio_memcg_list_lru_alloc()
From: Johannes Weiner @ 2026-05-27 20:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Hildenbrand, Lorenzo Stoakes, Shakeel Butt, Michal Hocko,
	Dave Chinner, Roman Gushchin, Muchun Song, Qi Zheng, Yosry Ahmed,
	Zi Yan, Liam R . Howlett, Usama Arif, Kiryl Shutsemau,
	Vlastimil Babka, Kairui Song, Mikhail Zaslonko, Vasily Gorbik,
	Baolin Wang, Barry Song, Dev Jain, Lance Yang, Nico Pache,
	Ryan Roberts, cgroups, linux-mm, linux-kernel
In-Reply-To: <20260527204757.2544958-1-hannes@cmpxchg.org>

memcg_list_lru_alloc() is called every time an object that may end up
on the list_lru is created. It needs to quickly check if the list_lru
heads for the memcg already exist, and allocate them when they don't.

Doing this with folio objects is tricky: folio_memcg() is not stable
and requires either RCU protection or pinning the cgroup. But it's
desirable to make the existence check lightweight under RCU, and only
pin the memcg when we need to allocate list_lru heads and may block.

In preparation for switching the THP shrinker to list_lru, add a
helper function for allocating list_lru heads coming from a folio.

Reviewed-by: David Hildenbrand (Arm) <david@kernel.org>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/list_lru.h | 27 +++++++++++++++++++++++++++
 mm/list_lru.c            | 39 ++++++++++++++++++++++++++++++++++-----
 2 files changed, 61 insertions(+), 5 deletions(-)

diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h
index 134cb3e5652a..a450fffe1550 100644
--- a/include/linux/list_lru.h
+++ b/include/linux/list_lru.h
@@ -81,6 +81,33 @@ static inline int list_lru_init_memcg_key(struct list_lru *lru, struct shrinker
 
 int memcg_list_lru_alloc(struct mem_cgroup *memcg, struct list_lru *lru,
 			 gfp_t gfp);
+
+#ifdef CONFIG_MEMCG
+/**
+ * folio_memcg_list_lru_alloc - allocate list_lru heads for shrinkable folio
+ * @folio: the newly allocated & charged folio
+ * @lru: the list_lru this might be queued on
+ * @gfp: gfp mask
+ *
+ * Allocate list_lru heads (per-memcg, per-node) needed to queue this
+ * particular folio down the line.
+ *
+ * This does memcg_list_lru_alloc(), but on the memcg that @folio is
+ * associated with. Handles folio_memcg() access rules in the fast
+ * path (list_lru heads allocated) and the allocation slowpath.
+ *
+ * Returns 0 on success, a negative error value otherwise.
+ */
+int folio_memcg_list_lru_alloc(struct folio *folio, struct list_lru *lru,
+			       gfp_t gfp);
+#else
+static inline int folio_memcg_list_lru_alloc(struct folio *folio,
+					     struct list_lru *lru, gfp_t gfp)
+{
+	return 0;
+}
+#endif
+
 void memcg_reparent_list_lrus(struct mem_cgroup *memcg, struct mem_cgroup *parent);
 
 /**
diff --git a/mm/list_lru.c b/mm/list_lru.c
index 402bb028114d..41a811966063 100644
--- a/mm/list_lru.c
+++ b/mm/list_lru.c
@@ -568,17 +568,14 @@ static inline bool memcg_list_lru_allocated(struct mem_cgroup *memcg,
 	return idx < 0 || xa_load(&lru->xa, idx);
 }
 
-int memcg_list_lru_alloc(struct mem_cgroup *memcg, struct list_lru *lru,
-			 gfp_t gfp)
+static int __memcg_list_lru_alloc(struct mem_cgroup *memcg,
+				  struct list_lru *lru, gfp_t gfp)
 {
 	unsigned long flags;
 	struct list_lru_memcg *mlru = NULL;
 	struct mem_cgroup *pos, *parent;
 	XA_STATE(xas, &lru->xa, 0);
 
-	if (!list_lru_memcg_aware(lru) || memcg_list_lru_allocated(memcg, lru))
-		return 0;
-
 	gfp &= GFP_RECLAIM_MASK;
 	/*
 	 * Because the list_lru can be reparented to the parent cgroup's
@@ -619,6 +616,38 @@ int memcg_list_lru_alloc(struct mem_cgroup *memcg, struct list_lru *lru,
 
 	return xas_error(&xas);
 }
+
+int memcg_list_lru_alloc(struct mem_cgroup *memcg, struct list_lru *lru,
+			 gfp_t gfp)
+{
+	if (!list_lru_memcg_aware(lru) || memcg_list_lru_allocated(memcg, lru))
+		return 0;
+	return __memcg_list_lru_alloc(memcg, lru, gfp);
+}
+
+int folio_memcg_list_lru_alloc(struct folio *folio, struct list_lru *lru,
+			       gfp_t gfp)
+{
+	struct mem_cgroup *memcg;
+	int res;
+
+	if (!list_lru_memcg_aware(lru))
+		return 0;
+
+	/* Fast path when list_lru heads already exist */
+	rcu_read_lock();
+	memcg = folio_memcg(folio);
+	res = memcg_list_lru_allocated(memcg, lru);
+	rcu_read_unlock();
+	if (likely(res))
+		return 0;
+
+	/* Allocation may block, pin the memcg */
+	memcg = get_mem_cgroup_from_folio(folio);
+	res = __memcg_list_lru_alloc(memcg, lru, gfp);
+	mem_cgroup_put(memcg);
+	return res;
+}
 #else
 static inline void memcg_init_list_lru(struct list_lru *lru, bool memcg_aware)
 {
-- 
2.54.0


^ permalink raw reply related

* [PATCH v5 6/9] mm: list_lru: introduce caller locking for additions and deletions
From: Johannes Weiner @ 2026-05-27 20:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Hildenbrand, Lorenzo Stoakes, Shakeel Butt, Michal Hocko,
	Dave Chinner, Roman Gushchin, Muchun Song, Qi Zheng, Yosry Ahmed,
	Zi Yan, Liam R . Howlett, Usama Arif, Kiryl Shutsemau,
	Vlastimil Babka, Kairui Song, Mikhail Zaslonko, Vasily Gorbik,
	Baolin Wang, Barry Song, Dev Jain, Lance Yang, Nico Pache,
	Ryan Roberts, cgroups, linux-mm, linux-kernel
In-Reply-To: <20260527204757.2544958-1-hannes@cmpxchg.org>

Locking is currently internal to the list_lru API. However, a caller
might want to keep auxiliary state synchronized with the LRU state.

For example, the THP shrinker uses the lock of its custom LRU to keep
PG_partially_mapped and vmstats consistent.

To allow the THP shrinker to switch to list_lru, provide normal and
irqsafe locking primitives as well as caller-locked variants of the
addition and deletion functions.

Reviewed-by: David Hildenbrand (Arm) <david@kernel.org>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Liam R. Howlett (Oracle) <liam@infradead.org>
---
 include/linux/list_lru.h |  43 +++++++++++++
 mm/list_lru.c            | 133 ++++++++++++++++++++++++++++++---------
 2 files changed, 145 insertions(+), 31 deletions(-)

diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h
index fe739d35a864..134cb3e5652a 100644
--- a/include/linux/list_lru.h
+++ b/include/linux/list_lru.h
@@ -83,6 +83,46 @@ int memcg_list_lru_alloc(struct mem_cgroup *memcg, struct list_lru *lru,
 			 gfp_t gfp);
 void memcg_reparent_list_lrus(struct mem_cgroup *memcg, struct mem_cgroup *parent);
 
+/**
+ * list_lru_lock: lock the sublist for the given node and memcg
+ * @lru: the lru pointer
+ * @nid: the node id of the sublist to lock.
+ * @memcg: pointer to the cgroup of the sublist to lock. On return,
+ *         updated to the cgroup whose sublist was actually locked,
+ *         which may be an ancestor if the original memcg was dying.
+ *
+ * Returns the locked list_lru_one sublist. The caller must call
+ * list_lru_unlock() when done.
+ *
+ * You must ensure that the memcg is not freed during this call (e.g., with
+ * rcu or by taking a css refcnt).
+ *
+ * Return: the locked list_lru_one, or NULL on failure
+ */
+struct list_lru_one *list_lru_lock(struct list_lru *lru, int nid,
+		struct mem_cgroup **memcg);
+
+/**
+ * list_lru_unlock: unlock a sublist locked by list_lru_lock()
+ * @l: the list_lru_one to unlock
+ */
+void list_lru_unlock(struct list_lru_one *l);
+
+struct list_lru_one *list_lru_lock_irq(struct list_lru *lru, int nid,
+		struct mem_cgroup **memcg);
+void list_lru_unlock_irq(struct list_lru_one *l);
+
+struct list_lru_one *list_lru_lock_irqsave(struct list_lru *lru, int nid,
+		struct mem_cgroup **memcg, unsigned long *irq_flags);
+void list_lru_unlock_irqrestore(struct list_lru_one *l,
+		unsigned long *irq_flags);
+
+/* Caller-locked variants, see list_lru_add() etc for documentation */
+bool __list_lru_add(struct list_lru *lru, struct list_lru_one *l,
+		struct list_head *item, int nid, struct mem_cgroup *memcg);
+bool __list_lru_del(struct list_lru *lru, struct list_lru_one *l,
+		struct list_head *item, int nid);
+
 /**
  * list_lru_add: add an element to the lru list's tail
  * @lru: the lru pointer
@@ -115,6 +155,9 @@ void memcg_reparent_list_lrus(struct mem_cgroup *memcg, struct mem_cgroup *paren
 bool list_lru_add(struct list_lru *lru, struct list_head *item, int nid,
 		    struct mem_cgroup *memcg);
 
+bool list_lru_add_irq(struct list_lru *lru, struct list_head *item, int nid,
+		      struct mem_cgroup *memcg);
+
 /**
  * list_lru_add_obj: add an element to the lru list's tail
  * @lru: the lru pointer
diff --git a/mm/list_lru.c b/mm/list_lru.c
index fdb3fe2ea64f..402bb028114d 100644
--- a/mm/list_lru.c
+++ b/mm/list_lru.c
@@ -15,17 +15,23 @@
 #include "slab.h"
 #include "internal.h"
 
-static inline void lock_list_lru(struct list_lru_one *l, bool irq)
+static inline void lock_list_lru(struct list_lru_one *l, bool irq,
+				 unsigned long *irq_flags)
 {
-	if (irq)
+	if (irq_flags)
+		spin_lock_irqsave(&l->lock, *irq_flags);
+	else if (irq)
 		spin_lock_irq(&l->lock);
 	else
 		spin_lock(&l->lock);
 }
 
-static inline void unlock_list_lru(struct list_lru_one *l, bool irq_off)
+static inline void unlock_list_lru(struct list_lru_one *l, bool irq_off,
+				   unsigned long *irq_flags)
 {
-	if (irq_off)
+	if (irq_flags)
+		spin_unlock_irqrestore(&l->lock, *irq_flags);
+	else if (irq_off)
 		spin_unlock_irq(&l->lock);
 	else
 		spin_unlock(&l->lock);
@@ -78,7 +84,8 @@ list_lru_from_memcg_idx(struct list_lru *lru, int nid, int idx)
 
 static inline struct list_lru_one *
 lock_list_lru_of_memcg(struct list_lru *lru, int nid,
-		       struct mem_cgroup **memcg, bool irq, bool skip_empty)
+		       struct mem_cgroup **memcg, bool irq,
+		       unsigned long *irq_flags, bool skip_empty)
 {
 	struct list_lru_one *l;
 
@@ -86,12 +93,12 @@ lock_list_lru_of_memcg(struct list_lru *lru, int nid,
 again:
 	l = list_lru_from_memcg_idx(lru, nid, memcg_kmem_id(*memcg));
 	if (likely(l)) {
-		lock_list_lru(l, irq);
+		lock_list_lru(l, irq, irq_flags);
 		if (likely(READ_ONCE(l->nr_items) != LONG_MIN)) {
 			rcu_read_unlock();
 			return l;
 		}
-		unlock_list_lru(l, irq);
+		unlock_list_lru(l, irq, irq_flags);
 	}
 	/*
 	 * Caller may simply bail out if raced with reparenting or
@@ -132,24 +139,58 @@ list_lru_from_memcg_idx(struct list_lru *lru, int nid, int idx)
 
 static inline struct list_lru_one *
 lock_list_lru_of_memcg(struct list_lru *lru, int nid,
-		       struct mem_cgroup **memcg, bool irq, bool skip_empty)
+		       struct mem_cgroup **memcg, bool irq,
+		       unsigned long *irq_flags, bool skip_empty)
 {
 	struct list_lru_one *l = &lru->node[nid].lru;
 
-	lock_list_lru(l, irq);
+	lock_list_lru(l, irq, irq_flags);
 
 	return l;
 }
 #endif /* CONFIG_MEMCG */
 
-/* The caller must ensure the memcg lifetime. */
-bool list_lru_add(struct list_lru *lru, struct list_head *item, int nid,
-		  struct mem_cgroup *memcg)
+struct list_lru_one *list_lru_lock(struct list_lru *lru, int nid,
+				   struct mem_cgroup **memcg)
 {
-	struct list_lru_node *nlru = &lru->node[nid];
-	struct list_lru_one *l;
+	return lock_list_lru_of_memcg(lru, nid, memcg, /*irq=*/false,
+				      /*irq_flags=*/NULL, /*skip_empty=*/false);
+}
+
+void list_lru_unlock(struct list_lru_one *l)
+{
+	unlock_list_lru(l, /*irq_off=*/false, /*irq_flags=*/NULL);
+}
+
+struct list_lru_one *list_lru_lock_irq(struct list_lru *lru, int nid,
+				       struct mem_cgroup **memcg)
+{
+	return lock_list_lru_of_memcg(lru, nid, memcg, /*irq=*/true,
+				      /*irq_flags=*/NULL, /*skip_empty=*/false);
+}
+
+void list_lru_unlock_irq(struct list_lru_one *l)
+{
+	unlock_list_lru(l, /*irq_off=*/true, /*irq_flags=*/NULL);
+}
 
-	l = lock_list_lru_of_memcg(lru, nid, &memcg, false, false);
+struct list_lru_one *list_lru_lock_irqsave(struct list_lru *lru, int nid,
+					   struct mem_cgroup **memcg,
+					   unsigned long *flags)
+{
+	return lock_list_lru_of_memcg(lru, nid, memcg, /*irq=*/true,
+				      /*irq_flags=*/flags, /*skip_empty=*/false);
+}
+
+void list_lru_unlock_irqrestore(struct list_lru_one *l, unsigned long *flags)
+{
+	unlock_list_lru(l, /*irq_off=*/true, /*irq_flags=*/flags);
+}
+
+bool __list_lru_add(struct list_lru *lru, struct list_lru_one *l,
+		    struct list_head *item, int nid,
+		    struct mem_cgroup *memcg)
+{
 	if (list_empty(item)) {
 		list_add_tail(item, &l->list);
 		/*
@@ -159,15 +200,50 @@ bool list_lru_add(struct list_lru *lru, struct list_head *item, int nid,
 		 */
 		if (!l->nr_items++)
 			set_shrinker_bit(memcg, nid, lru_shrinker_id(lru));
-		unlock_list_lru(l, false);
-		atomic_long_inc(&nlru->nr_items);
+		atomic_long_inc(&lru->node[nid].nr_items);
 		return true;
 	}
-	unlock_list_lru(l, false);
 	return false;
 }
 EXPORT_SYMBOL_GPL(list_lru_add);
 
+bool __list_lru_del(struct list_lru *lru, struct list_lru_one *l,
+		    struct list_head *item, int nid)
+{
+	if (!list_empty(item)) {
+		list_del_init(item);
+		l->nr_items--;
+		atomic_long_dec(&lru->node[nid].nr_items);
+		return true;
+	}
+	return false;
+}
+
+/* The caller must ensure the memcg lifetime. */
+bool list_lru_add(struct list_lru *lru, struct list_head *item, int nid,
+		  struct mem_cgroup *memcg)
+{
+	struct list_lru_one *l;
+	bool ret;
+
+	l = list_lru_lock(lru, nid, &memcg);
+	ret = __list_lru_add(lru, l, item, nid, memcg);
+	list_lru_unlock(l);
+	return ret;
+}
+
+bool list_lru_add_irq(struct list_lru *lru, struct list_head *item,
+		      int nid, struct mem_cgroup *memcg)
+{
+	struct list_lru_one *l;
+	bool ret;
+
+	l = list_lru_lock_irq(lru, nid, &memcg);
+	ret = __list_lru_add(lru, l, item, nid, memcg);
+	list_lru_unlock_irq(l);
+	return ret;
+}
+
 bool list_lru_add_obj(struct list_lru *lru, struct list_head *item)
 {
 	bool ret;
@@ -189,19 +265,13 @@ EXPORT_SYMBOL_GPL(list_lru_add_obj);
 bool list_lru_del(struct list_lru *lru, struct list_head *item, int nid,
 		  struct mem_cgroup *memcg)
 {
-	struct list_lru_node *nlru = &lru->node[nid];
 	struct list_lru_one *l;
+	bool ret;
 
-	l = lock_list_lru_of_memcg(lru, nid, &memcg, false, false);
-	if (!list_empty(item)) {
-		list_del_init(item);
-		l->nr_items--;
-		unlock_list_lru(l, false);
-		atomic_long_dec(&nlru->nr_items);
-		return true;
-	}
-	unlock_list_lru(l, false);
-	return false;
+	l = list_lru_lock(lru, nid, &memcg);
+	ret = __list_lru_del(lru, l, item, nid);
+	list_lru_unlock(l);
+	return ret;
 }
 
 bool list_lru_del_obj(struct list_lru *lru, struct list_head *item)
@@ -274,7 +344,8 @@ __list_lru_walk_one(struct list_lru *lru, int nid, struct mem_cgroup *memcg,
 	unsigned long isolated = 0;
 
 restart:
-	l = lock_list_lru_of_memcg(lru, nid, &memcg, irq_off, true);
+	l = lock_list_lru_of_memcg(lru, nid, &memcg, /*irq=*/irq_off,
+				   /*irq_flags=*/NULL, /*skip_empty=*/true);
 	if (!l)
 		return isolated;
 	list_for_each_safe(item, n, &l->list) {
@@ -315,7 +386,7 @@ __list_lru_walk_one(struct list_lru *lru, int nid, struct mem_cgroup *memcg,
 			BUG();
 		}
 	}
-	unlock_list_lru(l, irq_off);
+	unlock_list_lru(l, irq_off, NULL);
 out:
 	return isolated;
 }
-- 
2.54.0


^ permalink raw reply related

* [PATCH v5 5/9] mm: list_lru: deduplicate lock_list_lru()
From: Johannes Weiner @ 2026-05-27 20:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Hildenbrand, Lorenzo Stoakes, Shakeel Butt, Michal Hocko,
	Dave Chinner, Roman Gushchin, Muchun Song, Qi Zheng, Yosry Ahmed,
	Zi Yan, Liam R . Howlett, Usama Arif, Kiryl Shutsemau,
	Vlastimil Babka, Kairui Song, Mikhail Zaslonko, Vasily Gorbik,
	Baolin Wang, Barry Song, Dev Jain, Lance Yang, Nico Pache,
	Ryan Roberts, cgroups, linux-mm, linux-kernel
In-Reply-To: <20260527204757.2544958-1-hannes@cmpxchg.org>

The MEMCG and !MEMCG paths have the same pattern. Share the code.

Reviewed-by: David Hildenbrand (Arm) <david@kernel.org>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Liam R. Howlett (Oracle) <liam@infradead.org>
---
 mm/list_lru.c | 21 +++++++++------------
 1 file changed, 9 insertions(+), 12 deletions(-)

diff --git a/mm/list_lru.c b/mm/list_lru.c
index 7d0523e44010..fdb3fe2ea64f 100644
--- a/mm/list_lru.c
+++ b/mm/list_lru.c
@@ -15,6 +15,14 @@
 #include "slab.h"
 #include "internal.h"
 
+static inline void lock_list_lru(struct list_lru_one *l, bool irq)
+{
+	if (irq)
+		spin_lock_irq(&l->lock);
+	else
+		spin_lock(&l->lock);
+}
+
 static inline void unlock_list_lru(struct list_lru_one *l, bool irq_off)
 {
 	if (irq_off)
@@ -68,14 +76,6 @@ list_lru_from_memcg_idx(struct list_lru *lru, int nid, int idx)
 	return &lru->node[nid].lru;
 }
 
-static inline void lock_list_lru(struct list_lru_one *l, bool irq)
-{
-	if (irq)
-		spin_lock_irq(&l->lock);
-	else
-		spin_lock(&l->lock);
-}
-
 static inline struct list_lru_one *
 lock_list_lru_of_memcg(struct list_lru *lru, int nid,
 		       struct mem_cgroup **memcg, bool irq, bool skip_empty)
@@ -136,10 +136,7 @@ lock_list_lru_of_memcg(struct list_lru *lru, int nid,
 {
 	struct list_lru_one *l = &lru->node[nid].lru;
 
-	if (irq)
-		spin_lock_irq(&l->lock);
-	else
-		spin_lock(&l->lock);
+	lock_list_lru(l, irq);
 
 	return l;
 }
-- 
2.54.0


^ permalink raw reply related

* [PATCH v5 4/9] mm: list_lru: move list dead check to lock_list_lru_of_memcg()
From: Johannes Weiner @ 2026-05-27 20:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Hildenbrand, Lorenzo Stoakes, Shakeel Butt, Michal Hocko,
	Dave Chinner, Roman Gushchin, Muchun Song, Qi Zheng, Yosry Ahmed,
	Zi Yan, Liam R . Howlett, Usama Arif, Kiryl Shutsemau,
	Vlastimil Babka, Kairui Song, Mikhail Zaslonko, Vasily Gorbik,
	Baolin Wang, Barry Song, Dev Jain, Lance Yang, Nico Pache,
	Ryan Roberts, cgroups, linux-mm, linux-kernel
In-Reply-To: <20260527204757.2544958-1-hannes@cmpxchg.org>

Only the MEMCG variant of lock_list_lru() needs to check if there is a
race with cgroup deletion and list reparenting. Move the check to the
caller, so that the next patch can unify the lock_list_lru() variants.

Reviewed-by: David Hildenbrand (Arm) <david@kernel.org>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Liam R. Howlett (Oracle) <liam@infradead.org>
---
 mm/list_lru.c | 17 ++++++++---------
 1 file changed, 8 insertions(+), 9 deletions(-)

diff --git a/mm/list_lru.c b/mm/list_lru.c
index 5497034e80f3..7d0523e44010 100644
--- a/mm/list_lru.c
+++ b/mm/list_lru.c
@@ -68,17 +68,12 @@ list_lru_from_memcg_idx(struct list_lru *lru, int nid, int idx)
 	return &lru->node[nid].lru;
 }
 
-static inline bool lock_list_lru(struct list_lru_one *l, bool irq)
+static inline void lock_list_lru(struct list_lru_one *l, bool irq)
 {
 	if (irq)
 		spin_lock_irq(&l->lock);
 	else
 		spin_lock(&l->lock);
-	if (unlikely(READ_ONCE(l->nr_items) == LONG_MIN)) {
-		unlock_list_lru(l, irq);
-		return false;
-	}
-	return true;
 }
 
 static inline struct list_lru_one *
@@ -90,9 +85,13 @@ lock_list_lru_of_memcg(struct list_lru *lru, int nid,
 	rcu_read_lock();
 again:
 	l = list_lru_from_memcg_idx(lru, nid, memcg_kmem_id(*memcg));
-	if (likely(l) && lock_list_lru(l, irq)) {
-		rcu_read_unlock();
-		return l;
+	if (likely(l)) {
+		lock_list_lru(l, irq);
+		if (likely(READ_ONCE(l->nr_items) != LONG_MIN)) {
+			rcu_read_unlock();
+			return l;
+		}
+		unlock_list_lru(l, irq);
 	}
 	/*
 	 * Caller may simply bail out if raced with reparenting or
-- 
2.54.0


^ permalink raw reply related

* [PATCH v5 3/9] mm: list_lru: deduplicate unlock_list_lru()
From: Johannes Weiner @ 2026-05-27 20:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Hildenbrand, Lorenzo Stoakes, Shakeel Butt, Michal Hocko,
	Dave Chinner, Roman Gushchin, Muchun Song, Qi Zheng, Yosry Ahmed,
	Zi Yan, Liam R . Howlett, Usama Arif, Kiryl Shutsemau,
	Vlastimil Babka, Kairui Song, Mikhail Zaslonko, Vasily Gorbik,
	Baolin Wang, Barry Song, Dev Jain, Lance Yang, Nico Pache,
	Ryan Roberts, cgroups, linux-mm, linux-kernel
In-Reply-To: <20260527204757.2544958-1-hannes@cmpxchg.org>

The MEMCG and !MEMCG variants are the same. lock_list_lru() has the
same pattern when bailing. Consolidate into a common implementation.

Reviewed-by: David Hildenbrand (Arm) <david@kernel.org>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Liam R. Howlett (Oracle) <liam@infradead.org>
---
 mm/list_lru.c | 29 +++++++++--------------------
 1 file changed, 9 insertions(+), 20 deletions(-)

diff --git a/mm/list_lru.c b/mm/list_lru.c
index 77999ed78fa5..5497034e80f3 100644
--- a/mm/list_lru.c
+++ b/mm/list_lru.c
@@ -15,6 +15,14 @@
 #include "slab.h"
 #include "internal.h"
 
+static inline void unlock_list_lru(struct list_lru_one *l, bool irq_off)
+{
+	if (irq_off)
+		spin_unlock_irq(&l->lock);
+	else
+		spin_unlock(&l->lock);
+}
+
 #ifdef CONFIG_MEMCG
 static LIST_HEAD(memcg_list_lrus);
 static DEFINE_MUTEX(list_lrus_mutex);
@@ -67,10 +75,7 @@ static inline bool lock_list_lru(struct list_lru_one *l, bool irq)
 	else
 		spin_lock(&l->lock);
 	if (unlikely(READ_ONCE(l->nr_items) == LONG_MIN)) {
-		if (irq)
-			spin_unlock_irq(&l->lock);
-		else
-			spin_unlock(&l->lock);
+		unlock_list_lru(l, irq);
 		return false;
 	}
 	return true;
@@ -101,14 +106,6 @@ lock_list_lru_of_memcg(struct list_lru *lru, int nid,
 	*memcg = parent_mem_cgroup(*memcg);
 	goto again;
 }
-
-static inline void unlock_list_lru(struct list_lru_one *l, bool irq_off)
-{
-	if (irq_off)
-		spin_unlock_irq(&l->lock);
-	else
-		spin_unlock(&l->lock);
-}
 #else
 static void list_lru_register(struct list_lru *lru)
 {
@@ -147,14 +144,6 @@ lock_list_lru_of_memcg(struct list_lru *lru, int nid,
 
 	return l;
 }
-
-static inline void unlock_list_lru(struct list_lru_one *l, bool irq_off)
-{
-	if (irq_off)
-		spin_unlock_irq(&l->lock);
-	else
-		spin_unlock(&l->lock);
-}
 #endif /* CONFIG_MEMCG */
 
 /* The caller must ensure the memcg lifetime. */
-- 
2.54.0


^ permalink raw reply related

* [PATCH v5 2/9] mm: list_lru: lock_list_lru_of_memcg() cannot return NULL if !skip_empty
From: Johannes Weiner @ 2026-05-27 20:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Hildenbrand, Lorenzo Stoakes, Shakeel Butt, Michal Hocko,
	Dave Chinner, Roman Gushchin, Muchun Song, Qi Zheng, Yosry Ahmed,
	Zi Yan, Liam R . Howlett, Usama Arif, Kiryl Shutsemau,
	Vlastimil Babka, Kairui Song, Mikhail Zaslonko, Vasily Gorbik,
	Baolin Wang, Barry Song, Dev Jain, Lance Yang, Nico Pache,
	Ryan Roberts, cgroups, linux-mm, linux-kernel
In-Reply-To: <20260527204757.2544958-1-hannes@cmpxchg.org>

skip_empty is only for the shrinker to abort and skip a list that's
empty or whose cgroup is being deleted.

For list additions and deletions, the cgroup hierarchy is walked
upwards until a valid list_lru head is found, or it will fall back to
the node list. Acquiring the lock won't fail. Remove the NULL checks
in those callers.

Reviewed-by: David Hildenbrand (Arm) <david@kernel.org>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Liam R. Howlett (Oracle) <liam@infradead.org>
---
 mm/list_lru.c | 5 +----
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/mm/list_lru.c b/mm/list_lru.c
index 45d1b97737ea..77999ed78fa5 100644
--- a/mm/list_lru.c
+++ b/mm/list_lru.c
@@ -165,8 +165,6 @@ bool list_lru_add(struct list_lru *lru, struct list_head *item, int nid,
 	struct list_lru_one *l;
 
 	l = lock_list_lru_of_memcg(lru, nid, &memcg, false, false);
-	if (!l)
-		return false;
 	if (list_empty(item)) {
 		list_add_tail(item, &l->list);
 		/*
@@ -208,9 +206,8 @@ bool list_lru_del(struct list_lru *lru, struct list_head *item, int nid,
 {
 	struct list_lru_node *nlru = &lru->node[nid];
 	struct list_lru_one *l;
+
 	l = lock_list_lru_of_memcg(lru, nid, &memcg, false, false);
-	if (!l)
-		return false;
 	if (!list_empty(item)) {
 		list_del_init(item);
 		l->nr_items--;
-- 
2.54.0


^ permalink raw reply related

* [PATCH v5 1/9] mm: list_lru: fix set_shrinker_bit() call during race with cgroup deletion
From: Johannes Weiner @ 2026-05-27 20:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Hildenbrand, Lorenzo Stoakes, Shakeel Butt, Michal Hocko,
	Dave Chinner, Roman Gushchin, Muchun Song, Qi Zheng, Yosry Ahmed,
	Zi Yan, Liam R . Howlett, Usama Arif, Kiryl Shutsemau,
	Vlastimil Babka, Kairui Song, Mikhail Zaslonko, Vasily Gorbik,
	Baolin Wang, Barry Song, Dev Jain, Lance Yang, Nico Pache,
	Ryan Roberts, cgroups, linux-mm, linux-kernel
In-Reply-To: <20260527204757.2544958-1-hannes@cmpxchg.org>

When list_lru_add() races with cgroup deletion, the shrinker bit is set
on the wrong group and lost. This can cause a shrinker run to miss the
cgroup that actually has the object.

When the passed in memcg is dead, the function finds the first non-dead
parent from the passed in memcg and adds the object there; but the
shrinker bit is set on the memcg that was passed in.

This bug is as old as the shrinker bitmap itself.

Fix it by returning the "effective" memcg from the locking function, and
have the caller use that.

Fixes: fae91d6d8be5 ("mm/list_lru.c: set bit in memcg shrinker bitmap on first list_lru item appearance")
Reported-by: Usama Arif <usama.arif@linux.dev>
Reported-by: Sashiko
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/list_lru.c | 26 +++++++++++++++-----------
 1 file changed, 15 insertions(+), 11 deletions(-)

diff --git a/mm/list_lru.c b/mm/list_lru.c
index dd29bcf8eb5f..45d1b97737ea 100644
--- a/mm/list_lru.c
+++ b/mm/list_lru.c
@@ -77,14 +77,14 @@ static inline bool lock_list_lru(struct list_lru_one *l, bool irq)
 }
 
 static inline struct list_lru_one *
-lock_list_lru_of_memcg(struct list_lru *lru, int nid, struct mem_cgroup *memcg,
-		       bool irq, bool skip_empty)
+lock_list_lru_of_memcg(struct list_lru *lru, int nid,
+		       struct mem_cgroup **memcg, bool irq, bool skip_empty)
 {
 	struct list_lru_one *l;
 
 	rcu_read_lock();
 again:
-	l = list_lru_from_memcg_idx(lru, nid, memcg_kmem_id(memcg));
+	l = list_lru_from_memcg_idx(lru, nid, memcg_kmem_id(*memcg));
 	if (likely(l) && lock_list_lru(l, irq)) {
 		rcu_read_unlock();
 		return l;
@@ -97,8 +97,8 @@ lock_list_lru_of_memcg(struct list_lru *lru, int nid, struct mem_cgroup *memcg,
 		rcu_read_unlock();
 		return NULL;
 	}
-	VM_WARN_ON(!css_is_dying(&memcg->css));
-	memcg = parent_mem_cgroup(memcg);
+	VM_WARN_ON(!css_is_dying(&(*memcg)->css));
+	*memcg = parent_mem_cgroup(*memcg);
 	goto again;
 }
 
@@ -135,8 +135,8 @@ list_lru_from_memcg_idx(struct list_lru *lru, int nid, int idx)
 }
 
 static inline struct list_lru_one *
-lock_list_lru_of_memcg(struct list_lru *lru, int nid, struct mem_cgroup *memcg,
-		       bool irq, bool skip_empty)
+lock_list_lru_of_memcg(struct list_lru *lru, int nid,
+		       struct mem_cgroup **memcg, bool irq, bool skip_empty)
 {
 	struct list_lru_one *l = &lru->node[nid].lru;
 
@@ -164,12 +164,16 @@ bool list_lru_add(struct list_lru *lru, struct list_head *item, int nid,
 	struct list_lru_node *nlru = &lru->node[nid];
 	struct list_lru_one *l;
 
-	l = lock_list_lru_of_memcg(lru, nid, memcg, false, false);
+	l = lock_list_lru_of_memcg(lru, nid, &memcg, false, false);
 	if (!l)
 		return false;
 	if (list_empty(item)) {
 		list_add_tail(item, &l->list);
-		/* Set shrinker bit if the first element was added */
+		/*
+		 * Set shrinker bit on the memcg that owns the locked
+		 * sublist - lock_list_lru_of_memcg() may have walked up
+		 * past a dying memcg, and the bit must be set there.
+		 */
 		if (!l->nr_items++)
 			set_shrinker_bit(memcg, nid, lru_shrinker_id(lru));
 		unlock_list_lru(l, false);
@@ -204,7 +208,7 @@ bool list_lru_del(struct list_lru *lru, struct list_head *item, int nid,
 {
 	struct list_lru_node *nlru = &lru->node[nid];
 	struct list_lru_one *l;
-	l = lock_list_lru_of_memcg(lru, nid, memcg, false, false);
+	l = lock_list_lru_of_memcg(lru, nid, &memcg, false, false);
 	if (!l)
 		return false;
 	if (!list_empty(item)) {
@@ -288,7 +292,7 @@ __list_lru_walk_one(struct list_lru *lru, int nid, struct mem_cgroup *memcg,
 	unsigned long isolated = 0;
 
 restart:
-	l = lock_list_lru_of_memcg(lru, nid, memcg, irq_off, true);
+	l = lock_list_lru_of_memcg(lru, nid, &memcg, irq_off, true);
 	if (!l)
 		return isolated;
 	list_for_each_safe(item, n, &l->list) {
-- 
2.54.0


^ permalink raw reply related

* [PATCH v5 0/9] mm: switch THP shrinker to list_lru
From: Johannes Weiner @ 2026-05-27 20:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Hildenbrand, Lorenzo Stoakes, Shakeel Butt, Michal Hocko,
	Dave Chinner, Roman Gushchin, Muchun Song, Qi Zheng, Yosry Ahmed,
	Zi Yan, Liam R . Howlett, Usama Arif, Kiryl Shutsemau,
	Vlastimil Babka, Kairui Song, Mikhail Zaslonko, Vasily Gorbik,
	Baolin Wang, Barry Song, Dev Jain, Lance Yang, Nico Pache,
	Ryan Roberts, cgroups, linux-mm, linux-kernel

This is version 5 of switching the THP shrinker to list_lru.

Core of the new version is the list_lru/set_shrinker_bit fix up front,
which minimally affects later patches; and a rebase onto the latest
mm-unstable - replaced alloc_swap_folio() with __swap_cache_alloc().

The changes seemed small enough that *I chose to keep the review tags
from v4*. Please shout if you object to this!

Changes in v5:
- patch 1 is a new fix for a very old, pre-existing set_shrinker_bit()
  problem in list_lru, where the bit can be set on a dying child memcg
  instead of the ancestor that actually received the item. Pointed out
  by Usama Arif and Sashiko; fix it first to make it minimally
  backportable and so the conversion is safe.
- patches 6 and 9 adapt to that fix's new memcg-by-reference
  lock_list_lru_of_memcg() signature
- collapse_huge_page(): propagate folio_memcg_alloc_deferred() failure
  as SCAN_ALLOC_HUGE_PAGE_FAIL instead of leaking SCAN_SUCCEED and
  falsely reporting a successful MADV_COLLAPSE (Usama Arif, Sashiko)
- deferred_split_isolate(): fix a UAF by reading folio state before
  list_lru_isolate(); once removed, a racing folio_put() frees the
  folio via the lockless list_empty() check while we still touch its
  flags and stats (Sashiko)
- rebased to mm-unstable of 2026-05-27, which simplifies the flatten
  prep patch (now anon-only, as alloc_swap_folio() was folded into the
  new __swap_cache_alloc()) and moves the swap-side
  folio_memcg_alloc_deferred() hook into __swap_cache_alloc(). Kairui,
  I would appreciate an eyeball on that.

Changes in v4:
- guard folio_memcg_alloc_deferred() with mem_cgroup_disabled() to fix
  NULL deref in __memcg_list_lru_alloc() when booting with
  cgroup_disable=memory (e.g., kdump capture kernel) -- reported and
  tested by Mikhail Zaslonko on s390 and x86
- flatten if (folio) branches in alloc_swap_folio() and alloc_anon_folio()
  in a prep patch so the list_lru allocation additions are a clean minimal
  diff (Lorenzo)
- folio_memcg_alloc_deferred() moved out of alloc_charge_folio() into the
  anon-only collapse_huge_page() path; collapse_file() shares that helper
  but its pages don't go on the THP shrinker queue (David)
- guard folio_memcg_alloc_deferred() with order > 1; mTHPs below order-2
  can't be queued on the deferred split list (David)
- make deferred_split_lru static, hide behind folio_memcg_alloc_deferred()
  wrapper with GFP_KERNEL (Lorenzo)
- rename l -> lru throughout huge_memory.c (Lorenzo)
- kdoc for folio_memcg_list_lru_alloc() (Lorenzo)
- list_lru_lock_irq()/unlock_irq()/add_irq() irq-disabling variants;
  use list_lru_add_irq() in deferred_split_scan() (Lorenzo)
- reorder shrinker_free() before list_lru_destroy() (Lorenzo)

Changes in v3:
- dedicated lockdep_key for irqsafe deferred_split_lru.lock (syzbot)
- conditional list_lru ops in __folio_freeze_and_split_unmapped() (syzbot)
- annotate runs of inscrutable false, NULL, false function arguments (David)
- rename to folio_memcg_list_lru_alloc() (David)

Changes in v2:
- explicit rcu_read_lock() in __folio_freeze_and_split_unmapped() (Usama)
- split out list_lru prep bits (Dave)

The open-coded deferred split queue has issues. It's not NUMA-aware
(when cgroup is enabled), and it's more complicated in the callsites
interacting with it. Switching to list_lru fixes the NUMA problem and
streamlines things. It also simplifies planned shrinker work.

Patch 1 fixes a pre-existing list_lru bug where the shrinker bit is
set on the caller's memcg rather than the ancestor whose sublist the
item actually lands on after a walk-up. Standalone, backportable; the
rest of the series depends on it.

Patches 2-5 are cleanups and small refactors in list_lru code. They're
basically independent, but make the THP shrinker conversion easier.

Patch 6 extends the list_lru API to allow the caller to control the
locking scope. The THP shrinker has private state it needs to keep
synchronized with the LRU state.

Patch 7 extends the list_lru API with a convenience helper to do
list_lru head allocation (memcg_list_lru_alloc) when coming from a
folio. Anon THPs are instantiated in several places, and with the
folio reparenting patches pending, folio_memcg() access is now a more
delicate dance. This avoids having to replicate that dance everywhere.

Patch 8 flattens the alloc_anon_folio() retry loop so the next patch's
list_lru hook lands as a clean addition rather than nested deep inside
an if (folio) block.

Patch 9 finally switches the deferred_split_queue to list_lru.

Based on mm-unstable.

 include/linux/huge_mm.h    |   7 +-
 include/linux/list_lru.h   |  70 +++++++++
 include/linux/memcontrol.h |   4 -
 include/linux/mmzone.h     |  12 --
 mm/huge_memory.c           | 364 +++++++++++++++------------------------------
 mm/internal.h              |   2 +-
 mm/khugepaged.c            |   5 +
 mm/list_lru.c              | 238 +++++++++++++++++++----------
 mm/memcontrol.c            |  12 +-
 mm/memory.c                |  38 ++---
 mm/mm_init.c               |  15 --
 mm/swap_state.c            |  10 ++
 12 files changed, 399 insertions(+), 378 deletions(-)

The base moved substantially since v4 (the swap allocation rework in
particular reshuffled the alloc_swap_folio() landing spot), so the
patch-level diff between v4 and v5 is non-obvious from a tree diff
alone. For ease of review, here is the range-diff:

 -:  ------------ >  1:  f4f3933599b9 mm: list_lru: set shrinker bit on the memcg that owns the locked sublist
 1:  846dafe02e8b !  2:  e7b8f8bce2ec mm: list_lru: lock_list_lru_of_memcg() cannot return NULL if !skip_empty
    @@ mm/list_lru.c
     @@ mm/list_lru.c: bool list_lru_add(struct list_lru *lru, struct list_head *item, int nid,
      	struct list_lru_one *l;
      
    - 	l = lock_list_lru_of_memcg(lru, nid, memcg, false, false);
    + 	l = lock_list_lru_of_memcg(lru, nid, &memcg, false, false);
     -	if (!l)
     -		return false;
      	if (list_empty(item)) {
      		list_add_tail(item, &l->list);
    - 		/* Set shrinker bit if the first element was added */
    + 		/*
     @@ mm/list_lru.c: bool list_lru_del(struct list_lru *lru, struct list_head *item, int nid,
      {
      	struct list_lru_node *nlru = &lru->node[nid];
      	struct list_lru_one *l;
     +
    - 	l = lock_list_lru_of_memcg(lru, nid, memcg, false, false);
    + 	l = lock_list_lru_of_memcg(lru, nid, &memcg, false, false);
     -	if (!l)
     -		return false;
      	if (!list_empty(item)) {
 2:  afe28e645aff !  3:  f1e34640dff9 mm: list_lru: deduplicate unlock_list_lru()
    @@ mm/list_lru.c: static inline bool lock_list_lru(struct list_lru_one *l, bool irq
      		return false;
      	}
      	return true;
    -@@ mm/list_lru.c: lock_list_lru_of_memcg(struct list_lru *lru, int nid, struct mem_cgroup *memcg,
    - 	memcg = parent_mem_cgroup(memcg);
    +@@ mm/list_lru.c: lock_list_lru_of_memcg(struct list_lru *lru, int nid,
    + 	*memcg = parent_mem_cgroup(*memcg);
      	goto again;
      }
     -
    @@ mm/list_lru.c: lock_list_lru_of_memcg(struct list_lru *lru, int nid, struct mem_
      #else
      static void list_lru_register(struct list_lru *lru)
      {
    -@@ mm/list_lru.c: lock_list_lru_of_memcg(struct list_lru *lru, int nid, struct mem_cgroup *memcg,
    +@@ mm/list_lru.c: lock_list_lru_of_memcg(struct list_lru *lru, int nid,
      
      	return l;
      }
 3:  9e5499facfb1 !  4:  2612b71187ea mm: list_lru: move list dead check to lock_list_lru_of_memcg()
    @@ mm/list_lru.c: list_lru_from_memcg_idx(struct list_lru *lru, int nid, int idx)
      }
      
      static inline struct list_lru_one *
    -@@ mm/list_lru.c: lock_list_lru_of_memcg(struct list_lru *lru, int nid, struct mem_cgroup *memcg,
    +@@ mm/list_lru.c: lock_list_lru_of_memcg(struct list_lru *lru, int nid,
      	rcu_read_lock();
      again:
    - 	l = list_lru_from_memcg_idx(lru, nid, memcg_kmem_id(memcg));
    + 	l = list_lru_from_memcg_idx(lru, nid, memcg_kmem_id(*memcg));
     -	if (likely(l) && lock_list_lru(l, irq)) {
     -		rcu_read_unlock();
     -		return l;
 4:  855b908bfb82 !  5:  cc2819362f07 mm: list_lru: deduplicate lock_list_lru()
    @@ mm/list_lru.c: list_lru_from_memcg_idx(struct list_lru *lru, int nid, int idx)
     -}
     -
      static inline struct list_lru_one *
    - lock_list_lru_of_memcg(struct list_lru *lru, int nid, struct mem_cgroup *memcg,
    - 		       bool irq, bool skip_empty)
    -@@ mm/list_lru.c: lock_list_lru_of_memcg(struct list_lru *lru, int nid, struct mem_cgroup *memcg,
    + lock_list_lru_of_memcg(struct list_lru *lru, int nid,
    + 		       struct mem_cgroup **memcg, bool irq, bool skip_empty)
    +@@ mm/list_lru.c: lock_list_lru_of_memcg(struct list_lru *lru, int nid,
      {
      	struct list_lru_one *l = &lru->node[nid].lru;
      
 5:  b8a70f1016f3 !  6:  08c4561616df mm: list_lru: introduce caller locking for additions and deletions
    @@ include/linux/list_lru.h: int memcg_list_lru_alloc(struct mem_cgroup *memcg, str
     + * list_lru_lock: lock the sublist for the given node and memcg
     + * @lru: the lru pointer
     + * @nid: the node id of the sublist to lock.
    -+ * @memcg: the cgroup of the sublist to lock.
    ++ * @memcg: pointer to the cgroup of the sublist to lock. On return,
    ++ *         updated to the cgroup whose sublist was actually locked,
    ++ *         which may be an ancestor if the original memcg was dying.
     + *
     + * Returns the locked list_lru_one sublist. The caller must call
     + * list_lru_unlock() when done.
    @@ include/linux/list_lru.h: int memcg_list_lru_alloc(struct mem_cgroup *memcg, str
     + * Return: the locked list_lru_one, or NULL on failure
     + */
     +struct list_lru_one *list_lru_lock(struct list_lru *lru, int nid,
    -+		struct mem_cgroup *memcg);
    ++		struct mem_cgroup **memcg);
     +
     +/**
     + * list_lru_unlock: unlock a sublist locked by list_lru_lock()
    @@ include/linux/list_lru.h: int memcg_list_lru_alloc(struct mem_cgroup *memcg, str
     +void list_lru_unlock(struct list_lru_one *l);
     +
     +struct list_lru_one *list_lru_lock_irq(struct list_lru *lru, int nid,
    -+		struct mem_cgroup *memcg);
    ++		struct mem_cgroup **memcg);
     +void list_lru_unlock_irq(struct list_lru_one *l);
     +
     +struct list_lru_one *list_lru_lock_irqsave(struct list_lru *lru, int nid,
    -+		struct mem_cgroup *memcg, unsigned long *irq_flags);
    ++		struct mem_cgroup **memcg, unsigned long *irq_flags);
     +void list_lru_unlock_irqrestore(struct list_lru_one *l,
     +		unsigned long *irq_flags);
     +
    @@ mm/list_lru.c
     @@ mm/list_lru.c: list_lru_from_memcg_idx(struct list_lru *lru, int nid, int idx)
      
      static inline struct list_lru_one *
    - lock_list_lru_of_memcg(struct list_lru *lru, int nid, struct mem_cgroup *memcg,
    --		       bool irq, bool skip_empty)
    -+		       bool irq, unsigned long *irq_flags, bool skip_empty)
    + lock_list_lru_of_memcg(struct list_lru *lru, int nid,
    +-		       struct mem_cgroup **memcg, bool irq, bool skip_empty)
    ++		       struct mem_cgroup **memcg, bool irq,
    ++		       unsigned long *irq_flags, bool skip_empty)
      {
      	struct list_lru_one *l;
      
    -@@ mm/list_lru.c: lock_list_lru_of_memcg(struct list_lru *lru, int nid, struct mem_cgroup *memcg,
    +@@ mm/list_lru.c: lock_list_lru_of_memcg(struct list_lru *lru, int nid,
      again:
    - 	l = list_lru_from_memcg_idx(lru, nid, memcg_kmem_id(memcg));
    + 	l = list_lru_from_memcg_idx(lru, nid, memcg_kmem_id(*memcg));
      	if (likely(l)) {
     -		lock_list_lru(l, irq);
     +		lock_list_lru(l, irq, irq_flags);
    @@ mm/list_lru.c: lock_list_lru_of_memcg(struct list_lru *lru, int nid, struct mem_
     @@ mm/list_lru.c: list_lru_from_memcg_idx(struct list_lru *lru, int nid, int idx)
      
      static inline struct list_lru_one *
    - lock_list_lru_of_memcg(struct list_lru *lru, int nid, struct mem_cgroup *memcg,
    --		       bool irq, bool skip_empty)
    -+		       bool irq, unsigned long *irq_flags, bool skip_empty)
    + lock_list_lru_of_memcg(struct list_lru *lru, int nid,
    +-		       struct mem_cgroup **memcg, bool irq, bool skip_empty)
    ++		       struct mem_cgroup **memcg, bool irq,
    ++		       unsigned long *irq_flags, bool skip_empty)
      {
      	struct list_lru_one *l = &lru->node[nid].lru;
      
    @@ mm/list_lru.c: list_lru_from_memcg_idx(struct list_lru *lru, int nid, int idx)
     -bool list_lru_add(struct list_lru *lru, struct list_head *item, int nid,
     -		  struct mem_cgroup *memcg)
     +struct list_lru_one *list_lru_lock(struct list_lru *lru, int nid,
    -+				   struct mem_cgroup *memcg)
    ++				   struct mem_cgroup **memcg)
      {
     -	struct list_lru_node *nlru = &lru->node[nid];
     -	struct list_lru_one *l;
    @@ mm/list_lru.c: list_lru_from_memcg_idx(struct list_lru *lru, int nid, int idx)
     +}
     +
     +struct list_lru_one *list_lru_lock_irq(struct list_lru *lru, int nid,
    -+				       struct mem_cgroup *memcg)
    ++				       struct mem_cgroup **memcg)
     +{
     +	return lock_list_lru_of_memcg(lru, nid, memcg, /*irq=*/true,
     +				      /*irq_flags=*/NULL, /*skip_empty=*/false);
    @@ mm/list_lru.c: list_lru_from_memcg_idx(struct list_lru *lru, int nid, int idx)
     +	unlock_list_lru(l, /*irq_off=*/true, /*irq_flags=*/NULL);
     +}
      
    --	l = lock_list_lru_of_memcg(lru, nid, memcg, false, false);
    +-	l = lock_list_lru_of_memcg(lru, nid, &memcg, false, false);
     +struct list_lru_one *list_lru_lock_irqsave(struct list_lru *lru, int nid,
    -+					   struct mem_cgroup *memcg,
    ++					   struct mem_cgroup **memcg,
     +					   unsigned long *flags)
     +{
     +	return lock_list_lru_of_memcg(lru, nid, memcg, /*irq=*/true,
    @@ mm/list_lru.c: list_lru_from_memcg_idx(struct list_lru *lru, int nid, int idx)
     +{
      	if (list_empty(item)) {
      		list_add_tail(item, &l->list);
    - 		/* Set shrinker bit if the first element was added */
    + 		/*
    +@@ mm/list_lru.c: bool list_lru_add(struct list_lru *lru, struct list_head *item, int nid,
    + 		 */
      		if (!l->nr_items++)
      			set_shrinker_bit(memcg, nid, lru_shrinker_id(lru));
     -		unlock_list_lru(l, false);
    @@ mm/list_lru.c: list_lru_from_memcg_idx(struct list_lru *lru, int nid, int idx)
     +	struct list_lru_one *l;
     +	bool ret;
     +
    -+	l = list_lru_lock(lru, nid, memcg);
    ++	l = list_lru_lock(lru, nid, &memcg);
     +	ret = __list_lru_add(lru, l, item, nid, memcg);
     +	list_lru_unlock(l);
     +	return ret;
    @@ mm/list_lru.c: list_lru_from_memcg_idx(struct list_lru *lru, int nid, int idx)
     +	struct list_lru_one *l;
     +	bool ret;
     +
    -+	l = list_lru_lock_irq(lru, nid, memcg);
    ++	l = list_lru_lock_irq(lru, nid, &memcg);
     +	ret = __list_lru_add(lru, l, item, nid, memcg);
     +	list_lru_unlock_irq(l);
     +	return ret;
    @@ mm/list_lru.c: EXPORT_SYMBOL_GPL(list_lru_add_obj);
      	struct list_lru_one *l;
     +	bool ret;
      
    --	l = lock_list_lru_of_memcg(lru, nid, memcg, false, false);
    +-	l = lock_list_lru_of_memcg(lru, nid, &memcg, false, false);
     -	if (!list_empty(item)) {
     -		list_del_init(item);
     -		l->nr_items--;
    @@ mm/list_lru.c: EXPORT_SYMBOL_GPL(list_lru_add_obj);
     -	}
     -	unlock_list_lru(l, false);
     -	return false;
    -+	l = list_lru_lock(lru, nid, memcg);
    ++	l = list_lru_lock(lru, nid, &memcg);
     +	ret = __list_lru_del(lru, l, item, nid);
     +	list_lru_unlock(l);
     +	return ret;
    @@ mm/list_lru.c: __list_lru_walk_one(struct list_lru *lru, int nid, struct mem_cgr
      	unsigned long isolated = 0;
      
      restart:
    --	l = lock_list_lru_of_memcg(lru, nid, memcg, irq_off, true);
    -+	l = lock_list_lru_of_memcg(lru, nid, memcg, /*irq=*/irq_off,
    +-	l = lock_list_lru_of_memcg(lru, nid, &memcg, irq_off, true);
    ++	l = lock_list_lru_of_memcg(lru, nid, &memcg, /*irq=*/irq_off,
     +				   /*irq_flags=*/NULL, /*skip_empty=*/true);
      	if (!l)
      		return isolated;
 6:  0bf8cd5bc205 =  7:  9b1b9ab5e749 mm: list_lru: introduce folio_memcg_list_lru_alloc()
 7:  a26656c1c0a5 !  8:  fd4e1d364dc2 mm: memory: flatten folio allocation retry loops
    @@ Metadata
     Author: Johannes Weiner <hannes@cmpxchg.org>
     
      ## Commit message ##
    -    mm: memory: flatten folio allocation retry loops
    +    mm: memory: flatten alloc_anon_folio() retry loop
     
    -    alloc_swap_folio() and alloc_anon_folio() use a top-level if (folio)
    -    that buries the success path four levels deep. This makes for awkward
    -    long lines and wrapping. The next patch will add more code here, so
    -    flatten this now to keep things clean and simple.
    +    alloc_anon_folio() uses a top-level if (folio) that buries the success
    +    path four levels deep. This makes for awkward long lines and wrapping.
    +    The next patch will add more code here, so flatten this now to keep
    +    things clean and simple.
     
    -    alloc_anon_folio() already has a next label, use it for !folio. Add
    -    the equivalent to alloc_swap_folio().
    +    The next label is already there, use it for !folio.
     
         No functional change intended.
     
    @@ Commit message
         Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
     
      ## mm/memory.c ##
    -@@ mm/memory.c: static struct folio *alloc_swap_folio(struct vm_fault *vmf)
    - 	while (orders) {
    - 		addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
    - 		folio = vma_alloc_folio(gfp, order, vma, addr);
    --		if (folio) {
    --			if (!mem_cgroup_swapin_charge_folio(folio, vma->vm_mm,
    --							    gfp, entry))
    --				return folio;
    -+		if (!folio)
    -+			goto next;
    -+		if (mem_cgroup_swapin_charge_folio(folio, vma->vm_mm, gfp, entry)) {
    - 			count_mthp_stat(order, MTHP_STAT_SWPIN_FALLBACK_CHARGE);
    - 			folio_put(folio);
    -+			goto next;
    - 		}
    -+		return folio;
    -+next:
    - 		count_mthp_stat(order, MTHP_STAT_SWPIN_FALLBACK);
    - 		order = next_order(&orders, order);
    - 	}
     @@ mm/memory.c: static struct folio *alloc_anon_folio(struct vm_fault *vmf)
      	while (orders) {
      		addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
 8:  e454696ab1b7 !  9:  70fe768450de mm: switch deferred split shrinker to list_lru
    @@ mm/huge_memory.c: static int __folio_freeze_and_split_unmapped(struct folio *fol
     +	 */
     +	dequeue_deferred = folio_test_anon(folio) && old_order > 1;
     +	if (dequeue_deferred) {
    ++		struct mem_cgroup *memcg;
    ++
     +		rcu_read_lock();
    ++		memcg = folio_memcg(folio);
     +		lru = list_lru_lock(&deferred_split_lru,
    -+				    folio_nid(folio), folio_memcg(folio));
    ++				    folio_nid(folio), &memcg);
     +	}
      	if (folio_ref_freeze(folio, folio_cache_ref_count(folio) + 1)) {
      		struct swap_cluster_info *ci = NULL;
    @@ mm/huge_memory.c: int split_folio_to_list(struct folio *folio, struct list_head
      bool __folio_unqueue_deferred_split(struct folio *folio)
      {
     -	struct deferred_split *ds_queue;
    ++	struct mem_cgroup *memcg;
     +	struct list_lru_one *lru;
     +	int nid = folio_nid(folio);
      	unsigned long flags;
    @@ mm/huge_memory.c: int split_folio_to_list(struct folio *folio, struct list_head
     -	if (!list_empty(&folio->_deferred_list)) {
     -		ds_queue->split_queue_len--;
     +	rcu_read_lock();
    -+	lru = list_lru_lock_irqsave(&deferred_split_lru, nid, folio_memcg(folio), &flags);
    ++	memcg = folio_memcg(folio);
    ++	lru = list_lru_lock_irqsave(&deferred_split_lru, nid, &memcg, &flags);
     +	if (__list_lru_del(&deferred_split_lru, lru, &folio->_deferred_list, nid)) {
      		if (folio_test_partially_mapped(folio)) {
      			folio_clear_partially_mapped(folio);
    @@ mm/huge_memory.c: void deferred_split_folio(struct folio *folio, bool partially_
     +
     +	rcu_read_lock();
     +	memcg = folio_memcg(folio);
    -+	lru = list_lru_lock_irqsave(&deferred_split_lru, nid, memcg, &flags);
    ++	lru = list_lru_lock_irqsave(&deferred_split_lru, nid, &memcg, &flags);
      	if (partially_mapped) {
      		if (!folio_test_partially_mapped(folio)) {
      			folio_set_partially_mapped(folio);
    @@ mm/huge_memory.c: static bool thp_underused(struct folio *folio)
     +		return LRU_REMOVED;
     +	}
     +
    -+	/* We lost race with folio_put() */
    -+	list_lru_isolate(lru, item);
    ++	/*
    ++	 * We lost race with folio_put(). Read folio state before the
    ++	 * isolate: folio_unqueue_deferred_split() checks list_empty()
    ++	 * locklessly, so once removed the folio can be freed any time.
    ++	 */
     +	if (folio_test_partially_mapped(folio)) {
     +		folio_clear_partially_mapped(folio);
     +		mod_mthp_stat(folio_order(folio),
     +			      MTHP_STAT_NR_ANON_PARTIALLY_MAPPED, -1);
     +	}
    ++	list_lru_isolate(lru, item);
     +	return LRU_REMOVED;
     +}
     +
    @@ mm/huge_memory.c: static bool thp_underused(struct folio *folio)
      	struct folio *folio, *next;
     -	int split = 0, i;
     -	struct folio_batch fbatch;
    +-
    +-	folio_batch_init(&fbatch);
     +	int split = 0;
     +	unsigned long isolated;
      
    --	folio_batch_init(&fbatch);
    -+	isolated = list_lru_shrink_walk_irq(&deferred_split_lru, sc,
    -+					    deferred_split_isolate, &dispose);
    - 
     -retry:
     -	ds_queue = split_queue_lock_irqsave(sc->nid, sc->memcg, &flags);
     -	/* Take pin on all head pages to avoid freeing them under us */
    @@ mm/huge_memory.c: static bool thp_underused(struct folio *folio)
     -			break;
     -	}
     -	split_queue_unlock_irqrestore(ds_queue, flags);
    --
    ++	isolated = list_lru_shrink_walk_irq(&deferred_split_lru, sc,
    ++					    deferred_split_isolate, &dispose);
    + 
     -	for (i = 0; i < folio_batch_count(&fbatch); i++) {
     +	list_for_each_entry_safe(folio, next, &dispose, _deferred_list) {
      		bool did_split = false;
    @@ mm/khugepaged.c: static enum scan_result collapse_huge_page(struct mm_struct *mm
      	if (result != SCAN_SUCCEED)
      		goto out_nolock;
      
    -+	if (folio_memcg_alloc_deferred(folio))
    ++	if (folio_memcg_alloc_deferred(folio)) {
    ++		result = SCAN_ALLOC_HUGE_PAGE_FAIL;
     +		goto out_nolock;
    ++	}
     +
      	mmap_read_lock(mm);
      	result = hugepage_vma_revalidate(mm, pmd_addr, /*expect_anon=*/ true,
    @@ mm/memcontrol.c: static void mem_cgroup_css_offline(struct cgroup_subsys_state *
      	reparent_shrinker_deferred(memcg);
     
      ## mm/memory.c ##
    -@@ mm/memory.c: static struct folio *alloc_swap_folio(struct vm_fault *vmf)
    - 			folio_put(folio);
    - 			goto next;
    - 		}
    -+		if (order > 1 && folio_memcg_alloc_deferred(folio)) {
    -+			folio_put(folio);
    -+			goto fallback;
    -+		}
    - 		return folio;
    - next:
    - 		count_mthp_stat(order, MTHP_STAT_SWPIN_FALLBACK);
     @@ mm/memory.c: static struct folio *alloc_anon_folio(struct vm_fault *vmf)
      			folio_put(folio);
      			goto next;
    @@ mm/mm_init.c: static void __meminit pgdat_init_internals(struct pglist_data *pgd
      	pgdat_init_kcompactd(pgdat);
      
      	init_waitqueue_head(&pgdat->kswapd_wait);
    +
    + ## mm/swap_state.c ##
    +@@ mm/swap_state.c: static struct folio *__swap_cache_alloc(struct swap_cluster_info *ci,
    + 		return ERR_PTR(-ENOMEM);
    + 	}
    + 
    ++	if (order > 1 && folio_memcg_alloc_deferred(folio)) {
    ++		spin_lock(&ci->lock);
    ++		__swap_cache_do_del_folio(ci, folio, entry, shadow);
    ++		spin_unlock(&ci->lock);
    ++		folio_unlock(folio);
    ++		/* nr_pages refs from swap cache, 1 from allocation */
    ++		folio_put_refs(folio, nr_pages + 1);
    ++		return ERR_PTR(-ENOMEM);
    ++	}
    ++
    + 	/* memsw uncharges swap when folio is added to swap cache */
    + 	memcg1_swapin(folio);
    + 	if (shadow)


^ permalink raw reply

* Re: [PATCH v7 0/4] mm: swap: introduce swap tier infrastructure
From: Andrew Morton @ 2026-05-27 20:36 UTC (permalink / raw)
  To: Youngjun Park
  Cc: chrisl, linux-mm, cgroups, linux-kernel, kasong, hannes, mhocko,
	roman.gushchin, shakeel.butt, muchun.song, shikemeng, nphamcs,
	baoquan.he, baohua, gunho.lee, taejoon.song, hyungjun.cho,
	mkoutny, baver.bae, matia.kim
In-Reply-To: <20260527062247.3440692-1-youngjun.park@lge.com>

On Wed, 27 May 2026 15:22:43 +0900 Youngjun Park <youngjun.park@lge.com> wrote:

> This is v7 of the swap tier series addressing review feedback.
> The cover letter has been simplified.

One question from Sashiko.   Minor, but easy to address.
	https://sashiko.dev/#/patchset/20260527062247.3440692-1-youngjun.park@lge.com

I'm reluctant to add a new feature patchset at this time - we have a lot
already and we're at -rc5.   What do others think?

^ permalink raw reply

* Re: [PATCH v3] security: Expand task_setscheduler LSM hook to include CPU affinity mask
From: Peter Zijlstra @ 2026-05-27 19:58 UTC (permalink / raw)
  To: Aaron Tomlin
  Cc: tsbogend, paul, jmorris, serge, mingo, juri.lelli,
	vincent.guittot, stephen.smalley.work, casey, longman, tj, hannes,
	mkoutny, chenridong, dietmar.eggemann, rostedt, bsegall, mgorman,
	vschneid, kprateek.nayak, omosnace, kees, neelx, sean, chjohnst,
	steve, mproche, nick.lange, cgroups, linux-mips, linux-fsdevel,
	linux-security-module, selinux, linux-kernel
In-Reply-To: <ov33cu2wosubbfufcmfyoinfatecskjgmkvqyit33komlcla2d@2qgj45724bql>

[-- Attachment #1: Type: text/plain, Size: 2572 bytes --]

On Wed, May 27, 2026 at 01:41:52PM -0400, Aaron Tomlin wrote:

> > > The actual use case here is multi-tenant workload isolation and visibility.
> > > Passing the evaluated cpumask to the BPF LSM allows operators to write a
> > > simple eBPF program to detect spatial boundary overlaps (e.g., logging an
> > > event if a requested mask intersects with platform-reserved cores).

Why isn't cgroups good enough to enforce this? If you create a cgroup
hierarchy per tenant, and constrain them using the cpuset controller,
they should not be able to escape, rendering this event impossible.

> > > If this justification makes more sense, I will focus strictly on the
> > > seccomp pointer limitations and multi-tenant workload isolation.
> > 
> > I suppose it does, my only remaining question is if that is indeed
> > proper use of LSM -- I really don't know much about that.
> > 
> 
> We are not creating a bespoke BPF hook here; rather, we are rectifying a
> historical blind spot within the API. The existing LSM hook is invoked
> during sched_setaffinity(), yet it presently receives only the task_struct
> pointer. Consequently, the security module is essentially asked, "Should
> Process A be permitted to alter Process B's affinity?" without being
> informed of the proposed affinity itself. Providing in_mask simply
> furnishes the existing hook with the requisite payload to make an informed
> decision.

It occurs to me that this same argument would require to also pass in
the new sched_attr, no? That way the LSM can inspect the new policy
before it becomes effective.

> Were the objective solely one of observability, a tracepoint would indeed
> be the most suitable mechanism. However, if the aim within multi-tenant
> environments is active enforcement (namely, safely returning -EPERM to deny
> the pinning request before the scheduler applies it), the LSM layer remains
> the standard, architecturally supported gateway for returning syscall
> errors in accordance with administrative policy.

Indeed; but being constrained in a cpuset cgroup would result in the
same, no?

> I shall defer to Paul Moore and the LSM maintainers for their final
> blessing on the LSM API semantics.

Yes, I think that this is an interesting test-case of the LSM purpose.

You seem to be mostly aiming at resource control, something that is
traditionally done elsewhere.

> Thank you once again for the thorough review and for keeping the
> architectural boundaries honest.

No problem, just trying to understand myself ;-)

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* Re: [PATCH v2 1/2] mm/memcontrol: add dmem charge/uncharge functions
From: Eric Chanudet @ 2026-05-27 19:10 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song,
	Andrew Morton, Maarten Lankhorst, Maxime Ripard, Natalie Vock,
	Tejun Heo, Michal Koutný, Jonathan Corbet, Shuah Khan,
	cgroups, linux-mm, linux-kernel, dri-devel, T.J. Mercier,
	Christian König, Maxime Ripard, Albert Esteve, Dave Airlie,
	linux-doc
In-Reply-To: <ahB7pCu_G4vuswc0@linux.dev>

On Fri, May 22, 2026 at 08:53:10AM -0700, Shakeel Butt wrote:
> On Tue, May 19, 2026 at 11:59:01AM -0400, Eric Chanudet wrote:
> > Add mem_cgroup_dmem_charge() and mem_cgroup_dmem_uncharge() to allow
> > dmem pool allocations to optionally be double-charged against the memory
> > controller. Take the struct cgroup from the dmem pool's css as there is
> > no convenient object exported to represent these allocations. These will
> > resolve the effective memory css from that cgroup and perform the
> > charge.
> > 
> > Introduce a MEMCG_DMEM stat counter to memory.stat to make the cgroup's
> > dmem charge visible.
> > 
> > Signed-off-by: Eric Chanudet <echanude@redhat.com>
> > ---
> >  include/linux/memcontrol.h | 16 ++++++++++++
> >  mm/memcontrol.c            | 65 ++++++++++++++++++++++++++++++++++++++++++++++
> >  2 files changed, 81 insertions(+)
> > 
> > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > index dc3fa687759b45748b2acee6d7f43da325eb50c1..8e1d49b87fb64e6114f3eb920293e14920290fe7 100644
> > --- a/include/linux/memcontrol.h
> > +++ b/include/linux/memcontrol.h
> > @@ -39,6 +39,7 @@ enum memcg_stat_item {
> >  	MEMCG_ZSWAP_B,
> >  	MEMCG_ZSWAPPED,
> >  	MEMCG_ZSWAP_INCOMP,
> > +	MEMCG_DMEM,
> >  	MEMCG_NR_STAT,
> >  };
> >  
> > @@ -1872,6 +1873,21 @@ static inline bool mem_cgroup_zswap_writeback_enabled(struct mem_cgroup *memcg)
> >  }
> >  #endif
> >  
> > +#if defined(CONFIG_MEMCG) && defined(CONFIG_CGROUP_DMEM)
> > +bool mem_cgroup_dmem_charge(struct cgroup *cgrp, unsigned int nr_pages,
> > +			    gfp_t gfp_mask);
> > +void mem_cgroup_dmem_uncharge(struct cgroup *cgrp, unsigned int nr_pages);
> > +#else
> > +static inline bool mem_cgroup_dmem_charge(struct cgroup *cgrp,
> > +					  unsigned int nr_pages, gfp_t gfp_mask)
> 
> Please follow Johannes's request to pass the actually memory object instead of
> naked numbers.

Sorry, I misunderstood Johannes' comment. I am not sure what to use
here. Since these are called from dmem.c, they don't have access to what
was allocated.

Looking at zswap, it uses obj_cgroup. I thought of resolving the
obj_cgroup from dmem_cgroup_try_charge and keep it in the
dmem_cgroup_pool_state, but that made me realize there is a catch with
this patch set, with something like:
A: +memory{max:32M}/+dmem
A/B: +memory{max:16M}

It gets the CSS from the dmem's cgroup with
  cgroup_get_e_css(cgrp, &memory_cgrp_subsys);
  mem_cgroup_from_css(mem_css);

Which would resolve to A's memcg and not enforce the memory.max limit
set in B when dmem.memcg is set for that region.

-- 
Eric Chanudet


^ permalink raw reply

* Re: [PATCH v2 0/2] cgroup/cpuset: Fix sibling CPU exclusion in partcmd_update
From: Tejun Heo @ 2026-05-27 19:05 UTC (permalink / raw)
  To: Sun Shaojie
  Cc: Waiman Long, Chen Ridong, Johannes Weiner, Michal Koutný,
	cgroups, linux-kernel, zhangguopeng
In-Reply-To: <20260527064329.640060-1-sunshaojie@kylinos.cn>

Hello,

On Wed, May 27, 2026 at 02:43:27PM +0800, Sun Shaojie wrote:
> Sun Shaojie (2):
>   cgroup/cpuset: Use effective_xcpus in partcmd_update add/del mask
>     calculation
>   cgroup/cpuset: Add test cases for sibling CPU exclusion on partition
>     update

Applied to cgroup/for-7.1-fixes with the following changes:

- Added Cc: stable@vger.kernel.org # v7.0+ to the fix since 2a3602030d80
  shipped in v7.0.
- Added Reviewed-by: Waiman Long <longman@redhat.com> to both patches.

Thanks.

--
tejun

^ permalink raw reply

* Re: [PATCH v2 2/2] cgroup/cpuset: Add test cases for sibling CPU exclusion on partition update
From: Waiman Long @ 2026-05-27 18:08 UTC (permalink / raw)
  To: Sun Shaojie, Chen Ridong, Tejun Heo, Johannes Weiner,
	Michal Koutný
  Cc: cgroups, linux-kernel, zhangguopeng
In-Reply-To: <20260527070509.648304-1-sunshaojie@kylinos.cn>

On 5/27/26 3:05 AM, Sun Shaojie wrote:
> When sibling CPU exclusion occurs, a partition's effective_xcpus may be
> a subset of its user_xcpus. The partcmd_update path must use
> effective_xcpus instead of user_xcpus when calculating CPUs to return
> to or request from the parent.
>
> Add two test cases to verify this behavior:
>
>    1) Narrowing cpuset.cpus to only the sibling-excluded CPUs should not
>       return CPUs to parent that the partition never actually owned.
>
>    2) Expanding cpuset.cpus after a sibling becomes a member should
>       correctly request the additional CPUs from parent.
>
> Co-developed-by: Zhang Guopeng <zhangguopeng@kylinos.cn>
> Signed-off-by: Zhang Guopeng <zhangguopeng@kylinos.cn>
> Signed-off-by: Sun Shaojie <sunshaojie@kylinos.cn>
> ---
>   tools/testing/selftests/cgroup/test_cpuset_prs.sh | 10 ++++++++++
>   1 file changed, 10 insertions(+)
>
> diff --git a/tools/testing/selftests/cgroup/test_cpuset_prs.sh b/tools/testing/selftests/cgroup/test_cpuset_prs.sh
> index a56f4153c64d..683b05062810 100755
> --- a/tools/testing/selftests/cgroup/test_cpuset_prs.sh
> +++ b/tools/testing/selftests/cgroup/test_cpuset_prs.sh
> @@ -492,6 +492,16 @@ REMOTE_TEST_MATRIX=(
>   	"  C1-5:P1   .  C1-4:P1   C2-3     .       .  \
>   	      .      .     .       P1      .       .     p1:5|c11:1-4|c12:5 \
>   							 p1:P1|c11:P1|c12:P-1"
> +	# Narrowing cpuset.cpus to previously sibling-excluded CPUs should
> +	# not return CPUs that were never actually owned.
> +	"  C1-4:P1   .   C1-2:P1  C1-3:P2  .       .  \
> +	      .      .     .         C3    .       .     p1:4|c11:1-2|c12:3 \
> +							 p1:P1|c11:P1|c12:P2 3"
> +	# Expanding cpuset.cpus to include a previously sibling-excluded CPU
> +	# after the sibling has become a member should correctly request it.
> +	"  C1-4:P1   .   C1-2:P1  C1-3:P2  .       .  \
> +	      .      .      P0     C2-3    .       .     p1:1,4|c11:1|c12:2-3 \
> +							 p1:P1|c11:P0|c12:P2 2-3"
>   )
>   
>   #
Reviewed-by: Waiman Long <longman@redhat.com>


^ permalink raw reply

* Re: [PATCH v2 1/2] cgroup/cpuset: Use effective_xcpus in partcmd_update add/del mask calculation
From: Waiman Long @ 2026-05-27 18:01 UTC (permalink / raw)
  To: Sun Shaojie, Chen Ridong, Tejun Heo, Johannes Weiner,
	Michal Koutný
  Cc: cgroups, linux-kernel, zhangguopeng
In-Reply-To: <20260527064329.640060-2-sunshaojie@kylinos.cn>

On 5/27/26 2:43 AM, Sun Shaojie wrote:
> When sibling CPU exclusion occurs, a partition's user_xcpus may contain
> CPUs that were never actually granted to it. These CPUs are present in
> user_xcpus(cs) but not in cs->effective_xcpus.
>
> The partcmd_update path in update_parent_effective_cpumask() uses
> user_xcpus(cs) (via the local variable xcpus) to compute the addmask
> (CPUs to return to parent) and delmask (CPUs to request from parent).
> This is incorrect:
>
>   1) When newmask removes a CPU that was previously excluded by a
>      sibling, addmask incorrectly includes that CPU and tries to return
>      it to the parent even though the partition never actually owned it,
>      causing CPU overlap with sibling partitions and triggering warnings
>      in generate_sched_domains().
>
>   2) When newmask adds a previously excluded CPU that is now available,
>      delmask fails to request it from the parent because user_xcpus(cs)
>      already includes it.
>
> Fix this by using cs->effective_xcpus instead of user_xcpus(cs) in all
> partcmd_update paths that calculate addmask or delmask, including the
> PERR_NOCPUS error handling paths.
>
> Reproducers:
>
>    Example 1 - Removing a sibling-excluded CPU incorrectly returns it:
>
>      # cd /sys/fs/cgroup
>      # echo "0-1" > a1/cpuset.cpus
>      # echo "root" > a1/cpuset.cpus.partition
>      # echo "0-2" > b1/cpuset.cpus
>      # echo "root" > b1/cpuset.cpus.partition
>      # echo "2" > b1/cpuset.cpus
>      # cat cpuset.cpus.effective
>      # Actual: 0-1,3    Expected: 3
>
>    Example 2 - Expanding to a previously excluded CPU fails to request it:
>
>      # cd /sys/fs/cgroup
>      # echo "0-1" > a1/cpuset.cpus
>      # echo "root" > a1/cpuset.cpus.partition
>      # echo "0-2" > b1/cpuset.cpus
>      # echo "root" > b1/cpuset.cpus.partition
>      # echo "member" > a1/cpuset.cpus.partition
>      # echo "1-2" > b1/cpuset.cpus
>      # cat cpuset.cpus.effective
>      # Actual: 0-1,3    Expected: 0,3
>
> Fixes: 2a3602030d80 ("cgroup/cpuset: Don't invalidate sibling partitions on cpuset.cpus conflict")
> Suggested-by: Zhang Guopeng <zhangguopeng@kylinos.cn>
> Signed-off-by: Sun Shaojie <sunshaojie@kylinos.cn>
> ---
>   kernel/cgroup/cpuset.c | 13 +++++++------
>   1 file changed, 7 insertions(+), 6 deletions(-)
>
> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> index 1335e437098e..2395c5aec871 100644
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -1807,9 +1807,9 @@ static int update_parent_effective_cpumask(struct cpuset *cs, int cmd,
>   		 * Compute add/delete mask to/from effective_cpus
>   		 *
>   		 * For valid partition:
> -		 *   addmask = exclusive_cpus & ~newmask
> +		 *   addmask = effective_xcpus & ~newmask
>   		 *			      & parent->effective_xcpus
> -		 *   delmask = newmask & ~exclusive_cpus
> +		 *   delmask = newmask & ~effective_xcpus
>   		 *		       & parent->effective_xcpus
>   		 *
>   		 * For invalid partition:
> @@ -1821,11 +1821,11 @@ static int update_parent_effective_cpumask(struct cpuset *cs, int cmd,
>   			deleting = cpumask_and(tmp->delmask,
>   					newmask, parent->effective_xcpus);
>   		} else {
> -			cpumask_andnot(tmp->addmask, xcpus, newmask);
> +			cpumask_andnot(tmp->addmask, cs->effective_xcpus, newmask);
>   			adding = cpumask_and(tmp->addmask, tmp->addmask,
>   					     parent->effective_xcpus);
>   
> -			cpumask_andnot(tmp->delmask, newmask, xcpus);
> +			cpumask_andnot(tmp->delmask, newmask, cs->effective_xcpus);
>   			deleting = cpumask_and(tmp->delmask, tmp->delmask,
>   					       parent->effective_xcpus);
>   		}
> @@ -1864,7 +1864,7 @@ static int update_parent_effective_cpumask(struct cpuset *cs, int cmd,
>   			part_error = PERR_NOCPUS;
>   			deleting = false;
>   			adding = cpumask_and(tmp->addmask,
> -					     xcpus, parent->effective_xcpus);
> +					     cs->effective_xcpus, parent->effective_xcpus);
>   		}
>   	} else {
>   		/*
> @@ -1886,7 +1886,8 @@ static int update_parent_effective_cpumask(struct cpuset *cs, int cmd,
>   			part_error = PERR_NOCPUS;
>   			if (is_partition_valid(cs))
>   				adding = cpumask_and(tmp->addmask,
> -						xcpus, parent->effective_xcpus);
> +						     cs->effective_xcpus,
> +						     parent->effective_xcpus);
>   		} else if (is_partition_invalid(cs) && !cpumask_empty(xcpus) &&
>   			   cpumask_subset(xcpus, parent->effective_xcpus)) {
>   			struct cgroup_subsys_state *css;
Reviewed-by: Waiman Long <longman@redhat.com>


^ permalink raw reply

* Re: [PATCH v7 4/4] mm: swap: filter swap allocation by memcg tier mask
From: Kairui Song @ 2026-05-27 17:50 UTC (permalink / raw)
  To: Youngjun Park
  Cc: akpm, chrisl, linux-mm, cgroups, linux-kernel, kasong, hannes,
	mhocko, roman.gushchin, shakeel.butt, muchun.song, shikemeng,
	nphamcs, baoquan.he, baohua, gunho.lee, taejoon.song,
	hyungjun.cho, mkoutny, baver.bae, matia.kim
In-Reply-To: <20260527062247.3440692-5-youngjun.park@lge.com>

On Wed, May 27, 2026 at 03:22:47PM +0800, Youngjun Park wrote:
> Apply memcg tier effective mask during swap slot allocation to
> enforce per-cgroup swap tier restrictions.
> 
> In the fast path, check the percpu cached swap_info's tier_mask
> against the folio's effective mask. If it does not match, fall
> through to the slow path. In the slow path, skip swap devices
> whose tier_mask is not covered by the folio's effective mask.
> 
> This works correctly when there is only one non-rotational
> device in the system and no devices share the same priority.
> However, there are known limitations:
> 
>  - When non-rotational devices are distributed across multiple
>    tiers, and different memcgs are configured to use those
>    distinct tiers, they may constantly overwrite the shared
>    percpu swap cache. This cache thrashing leads to frequent
>    fast path misses.
> 
>  - Combined with the above issue, if same-priority devices exist
>    among them, a percpu cache miss (overwritten by another memcg)
>    forces the allocator to round-robin to the next device
>    prematurely, even if the current cluster is not fully
>    exhausted.
> 
> These edge cases do not affect the primary use case of
> directing swap traffic per cgroup. Further optimization is
> planned for future work.
> 
> Signed-off-by: Youngjun Park <youngjun.park@lge.com>
> ---
>  mm/swapfile.c | 10 +++++++++-
>  1 file changed, 9 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 9a86ebe992f4..1a2d29735b71 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -1365,14 +1365,18 @@ static bool swap_alloc_fast(struct folio *folio)
>  	struct swap_cluster_info *ci;
>  	struct swap_info_struct *si;
>  	unsigned int offset;
> +	int mask = folio_tier_effective_mask(folio);
>  
>  	/*
>  	 * Once allocated, swap_info_struct will never be completely freed,
>  	 * so checking it's liveness by get_swap_device_info is enough.
>  	 */
>  	si = this_cpu_read(percpu_swap_cluster.si[order]);
> +	if (!si || !swap_tiers_mask_test(si->tier_mask, mask))
> +		return false;
> +
>  	offset = this_cpu_read(percpu_swap_cluster.offset[order]);
> -	if (!si || !offset || !get_swap_device_info(si))
> +	if (!offset || !get_swap_device_info(si))
>  		return false;
>  
>  	ci = swap_cluster_lock(si, offset);
> @@ -1392,10 +1396,14 @@ static bool swap_alloc_fast(struct folio *folio)
>  static void swap_alloc_slow(struct folio *folio)
>  {
>  	struct swap_info_struct *si, *next;
> +	int mask = folio_tier_effective_mask(folio);
>  
>  	spin_lock(&swap_avail_lock);
>  start_over:
>  	plist_for_each_entry_safe(si, next, &swap_avail_head, avail_list) {
> +		if (!swap_tiers_mask_test(si->tier_mask, mask))
> +			continue;
> +
>  		/* Rotate the device and switch to a new cluster */
>  		plist_requeue(&si->avail_list, &swap_avail_head);
>  		spin_unlock(&swap_avail_lock);
> -- 
> 2.34.1

This part looks good to me, the known limitations are not regression
and only for tiering, so can be improved later, and we do have plan
to refine the priority / rotation / pcp cluster so they aligns well.

Reviewed-by: Kairui Song <kasong@tencent.com>

^ permalink raw reply

* Re: [PATCH v3] security: Expand task_setscheduler LSM hook to include CPU affinity mask
From: Aaron Tomlin @ 2026-05-27 17:41 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: tsbogend, paul, jmorris, serge, mingo, juri.lelli,
	vincent.guittot, stephen.smalley.work, casey, longman, tj, hannes,
	mkoutny, chenridong, dietmar.eggemann, rostedt, bsegall, mgorman,
	vschneid, kprateek.nayak, omosnace, kees, neelx, sean, chjohnst,
	steve, mproche, nick.lange, cgroups, linux-mips, linux-fsdevel,
	linux-security-module, selinux, linux-kernel
In-Reply-To: <20260527155404.GV3126523@noisy.programming.kicks-ass.net>

[-- Attachment #1: Type: text/plain, Size: 2480 bytes --]

On Wed, May 27, 2026 at 05:54:04PM +0200, Peter Zijlstra wrote:
> > The LSM hook is currently the only infrastructure positioned to do
> > this safely for eBPF-driven security policies.
> 
> But is that correct use of LSM? Or is that working around short comings
> elsewhere?

Hi Peter,

I am in complete agreement that we should avoid indiscriminately grafting
hooks onto the kernel simply to accommodate BPF. Nevertheless, I would
argue that this represents a textbook application of LSM.

> I realize that bpf people rarely care about things like this, they just
> want to hack their thing and will take any hook they can get. But I feel
> people *should* care.
> 
> > The actual use case here is multi-tenant workload isolation and visibility.
> > Passing the evaluated cpumask to the BPF LSM allows operators to write a
> > simple eBPF program to detect spatial boundary overlaps (e.g., logging an
> > event if a requested mask intersects with platform-reserved cores).
> > 
> > If this justification makes more sense, I will focus strictly on the
> > seccomp pointer limitations and multi-tenant workload isolation.
> 
> I suppose it does, my only remaining question is if that is indeed
> proper use of LSM -- I really don't know much about that.
> 

We are not creating a bespoke BPF hook here; rather, we are rectifying a
historical blind spot within the API. The existing LSM hook is invoked
during sched_setaffinity(), yet it presently receives only the task_struct
pointer. Consequently, the security module is essentially asked, "Should
Process A be permitted to alter Process B's affinity?" without being
informed of the proposed affinity itself. Providing in_mask simply
furnishes the existing hook with the requisite payload to make an informed
decision.

Were the objective solely one of observability, a tracepoint would indeed
be the most suitable mechanism. However, if the aim within multi-tenant
environments is active enforcement (namely, safely returning -EPERM to deny
the pinning request before the scheduler applies it), the LSM layer remains
the standard, architecturally supported gateway for returning syscall
errors in accordance with administrative policy.

I shall defer to Paul Moore and the LSM maintainers for their final
blessing on the LSM API semantics.

Thank you once again for the thorough review and for keeping the
architectural boundaries honest.


Kind regards,
-- 
Aaron Tomlin

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* Re: [PATCH v3] security: Expand task_setscheduler LSM hook to include CPU affinity mask
From: Peter Zijlstra @ 2026-05-27 15:54 UTC (permalink / raw)
  To: Aaron Tomlin
  Cc: tsbogend, paul, jmorris, serge, mingo, juri.lelli,
	vincent.guittot, stephen.smalley.work, casey, longman, tj, hannes,
	mkoutny, chenridong, dietmar.eggemann, rostedt, bsegall, mgorman,
	vschneid, kprateek.nayak, omosnace, kees, neelx, sean, chjohnst,
	steve, mproche, nick.lange, cgroups, linux-mips, linux-fsdevel,
	linux-security-module, selinux, linux-kernel
In-Reply-To: <bgjagepcfb7gz6jawatu6kpfmecw46gwg5cvb6r7dl3dn7bt4l@rtymdaslx7ef>

[-- Attachment #1: Type: text/plain, Size: 2815 bytes --]

On Wed, May 27, 2026 at 11:05:17AM -0400, Aaron Tomlin wrote:
> On Wed, May 27, 2026 at 10:52:21AM +0200, Peter Zijlstra wrote:
> > I'm not sure I really buy the Real-Time argument here; that really feels
> > like a straw man. Real-Time will need to account for the shared resource
> > usage inherent in using a single kernel image across the CPUs, affinity
> > alone does not Real-Time make in any way shape or form.
> > 
> > And the compromised task vs crypto thing feels like it wants sandboxing,
> > but wasn't that what seccomp is for, rather than lsm?
> > 
> > So while I don't think I object very much to the patch, I do find the
> > whole Changelog to be utterly questionable. Which makes me very
> > suspicious as to wtf this is actually for.
> 
> Hi Peter,
> 
> Thank you for the blunt and honest feedback.
> 
> You are completely right to call out the changelog. It obscured the actual
> practical use case. I will rewrite the commit message to drop those
> statements.
> 
> To answer your question regarding seccomp: seccomp-bpf is strictly limited
> to inspecting syscall arguments by value at the syscall entry boundary. For
> sched_setaffinity(), the mask is passed as a "__user" pointer. Seccomp
> cannot safely dereference this pointer to inspect the requested CPU bits.

There has been work to allow tracepoints, specifically syscall
tracepoints, to access the syscall arguments and to do exactly this
(deref user pointers). I *think* most of that work landed, but I might
be mistaken.

Would this then not also allow seccomp-bpf to access these?

(while writing this, I wonder if that would then not be subject to
TOCTOU)

> To actually evaluate which CPUs a task is trying to pin to, we must
> evaluate the mask after copy_from_user() has safely brought it into kernel
> memory.

Right this.

> The LSM hook is currently the only infrastructure positioned to do
> this safely for eBPF-driven security policies.

But is that correct use of LSM? Or is that working around short comings
elsewhere?

I realize that bpf people rarely care about things like this, they just
want to hack their thing and will take any hook they can get. But I feel
people *should* care.

> The actual use case here is multi-tenant workload isolation and visibility.
> Passing the evaluated cpumask to the BPF LSM allows operators to write a
> simple eBPF program to detect spatial boundary overlaps (e.g., logging an
> event if a requested mask intersects with platform-reserved cores).
> 
> If this justification makes more sense, I will focus strictly on the
> seccomp pointer limitations and multi-tenant workload isolation.

I suppose it does, my only remaining question is if that is indeed
proper use of LSM -- I really don't know much about that.


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* Re: [RFC PATCH bpf-next v7 04/11] libbpf: introduce bpf_map__attach_struct_ops_opts()
From: Yonghong Song @ 2026-05-27 15:43 UTC (permalink / raw)
  To: Hui Zhu, Alexei Starovoitov, Daniel Borkmann, John Fastabend,
	Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman,
	Kumar Kartikeya Dwivedi, Song Liu, Jiri Olsa, Johannes Weiner,
	Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song,
	JP Kobryn, Andrew Morton, Shuah Khan, davem, Jakub Kicinski,
	Jesper Dangaard Brouer, Stanislav Fomichev, KP Singh, Tao Chen,
	Mykyta Yatsenko, Leon Hwang, Anton Protopopov, Amery Hung,
	Tobias Klauser, Eyal Birger, Rong Tao, Hao Luo, Peter Zijlstra,
	Miguel Ojeda, Nathan Chancellor, Kees Cook, Tejun Heo, Jeff Xu,
	mkoutny, Jan Hendrik Farr, Christian Brauner, Randy Dunlap,
	Brian Gerst, Masahiro Yamada, Willem de Bruijn, Jason Xing,
	Paul Chaignon, Chen Ridong, Lance Yang, Jiayuan Chen,
	linux-kernel, bpf, cgroups, linux-mm, netdev, linux-kselftest
  Cc: geliang, baohua
In-Reply-To: <20bdaa33cc19364f5f10208c79ef94fe43bd5ac1.1779760876.git.zhuhui@kylinos.cn>



On 5/25/26 7:20 PM, Hui Zhu wrote:
> From: Roman Gushchin <roman.gushchin@linux.dev>
>
> Introduce bpf_map__attach_struct_ops_opts(), an extended version of
> bpf_map__attach_struct_ops(), which takes additional struct
> bpf_struct_ops_opts argument.
>
> This allows to pass a target_fd argument and the BPF_F_CGROUP_FD flag
> and attach the struct ops to a cgroup as a result.
>
> Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
> ---
>   tools/lib/bpf/libbpf.c   | 20 +++++++++++++++++---
>   tools/lib/bpf/libbpf.h   | 14 ++++++++++++++
>   tools/lib/bpf/libbpf.map |  1 +
>   3 files changed, 32 insertions(+), 3 deletions(-)
>
> diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
> index 1e8688975d16..a1b54da1ded2 100644
> --- a/tools/lib/bpf/libbpf.c
> +++ b/tools/lib/bpf/libbpf.c
> @@ -13683,11 +13683,18 @@ static int bpf_link__detach_struct_ops(struct bpf_link *link)
>   	return close(link->fd);
>   }
>   
> -struct bpf_link *bpf_map__attach_struct_ops(const struct bpf_map *map)
> +struct bpf_link *bpf_map__attach_struct_ops_opts(const struct bpf_map *map,
> +						 const struct bpf_struct_ops_opts *opts)
>   {
> +	DECLARE_LIBBPF_OPTS(bpf_link_create_opts, link_opts);
>   	struct bpf_link_struct_ops *link;
> +	int err, fd, target_fd;
>   	__u32 zero = 0;
> -	int err, fd;
> +
> +	if (!OPTS_VALID(opts, bpf_struct_ops_opts)) {
> +		pr_warn("map '%s': invalid opts\n", map->name);
> +		return libbpf_err_ptr(-EINVAL);
> +	}
>   
>   	if (!bpf_map__is_struct_ops(map)) {
>   		pr_warn("map '%s': can't attach non-struct_ops map\n", map->name);
> @@ -13724,7 +13731,9 @@ struct bpf_link *bpf_map__attach_struct_ops(const struct bpf_map *map)
>   		return &link->link;
>   	}
>   
> -	fd = bpf_link_create(map->fd, 0, BPF_STRUCT_OPS, NULL);
> +	link_opts.flags = OPTS_GET(opts, flags, 0);
> +	target_fd = OPTS_GET(opts, target_fd, 0);
> +	fd = bpf_link_create(map->fd, target_fd, BPF_STRUCT_OPS, &link_opts);
>   	if (fd < 0) {
>   		free(link);
>   		return libbpf_err_ptr(fd);
> @@ -13736,6 +13745,11 @@ struct bpf_link *bpf_map__attach_struct_ops(const struct bpf_map *map)
>   	return &link->link;
>   }
>   
> +struct bpf_link *bpf_map__attach_struct_ops(const struct bpf_map *map)
> +{
> +	return bpf_map__attach_struct_ops_opts(map, NULL);
> +}
> +
>   /*
>    * Swap the back struct_ops of a link with a new struct_ops map.
>    */
> diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h
> index bba4e8464396..18af178547ad 100644
> --- a/tools/lib/bpf/libbpf.h
> +++ b/tools/lib/bpf/libbpf.h
> @@ -945,6 +945,20 @@ bpf_program__attach_cgroup_opts(const struct bpf_program *prog, int cgroup_fd,
>   struct bpf_map;
>   
>   LIBBPF_API struct bpf_link *bpf_map__attach_struct_ops(const struct bpf_map *map);
> +
> +struct bpf_struct_ops_opts {
> +	/* size of this struct, for forward/backward compatibility */
> +	size_t sz;
> +	__u32 flags;
> +	__u32 target_fd;
> +	__u64 expected_revision;
> +	size_t :0;
> +};
> +#define bpf_struct_ops_opts__last_field expected_revision
> +
> +LIBBPF_API struct bpf_link *
> +bpf_map__attach_struct_ops_opts(const struct bpf_map *map,
> +				const struct bpf_struct_ops_opts *opts);
>   LIBBPF_API int bpf_link__update_map(struct bpf_link *link, const struct bpf_map *map);
>   
>   struct bpf_iter_attach_opts {
> diff --git a/tools/lib/bpf/libbpf.map b/tools/lib/bpf/libbpf.map
> index dfed8d60af05..6105619b5ecf 100644
> --- a/tools/lib/bpf/libbpf.map
> +++ b/tools/lib/bpf/libbpf.map
> @@ -454,6 +454,7 @@ LIBBPF_1.7.0 {
>   		bpf_prog_assoc_struct_ops;
>   		bpf_program__assoc_struct_ops;
>   		btf__permute;
> +		bpf_map__attach_struct_ops_opts;

Function bpf_map__attach_struct_ops_opts should be in
LIBBPF_1.8.0.

>   } LIBBPF_1.6.0;
>   
>   LIBBPF_1.8.0 {


^ permalink raw reply

* [PATCH-next v3 5/5] cgroup/cpuset: Support multiple source/destination cpusets for cpuset_*attach()
From: Waiman Long @ 2026-05-27 15:38 UTC (permalink / raw)
  To: Chen Ridong, Tejun Heo, Johannes Weiner, Michal Koutný,
	Ingo Molnar, Peter Zijlstra
  Cc: cgroups, linux-kernel, Aaron Tomlin, Waiman Long
In-Reply-To: <20260527153800.1557449-1-longman@redhat.com>

With cgroup v2, the cgroup_taskset structure passed into the cgroup
can_attach() and attach() methods can contain task migration data with
multiple destination or source cpusets when the cpuset controller is
enabled or disabled respectively.

Since cpuset is threaded in both v1 and v2, another possible way to
cause many-to-one migration is to move the whole process with multiple
threads in different cpuset enabled threaded cgroups into another cpuset
enabled cgroup.

The current cpuset_can_attach() and cpuset_attach() functions still
expect task migration is from one source cpuset to one destination
cpuset. This has been the case since cpuset was enabled for cgroup v2
in commit 4ec22e9c5a90 ("cpuset: Enable cpuset controller in default
hierarchy").

This problem is less an issue when enabling the cpuset controller as all
the newly created child cpusets will have exactly the same set of CPUs
and memory nodes except when deadline tasks are involved in migration
as the deadline task accounting data can be off.

It can be more problematic when the cpuset controller is disabled as
their set of CPUs and memory nodes may differ from their parent or with
the moving of multi-threaded process from different threaded cgroups.

Fix that by tracking the set of source (old) and destination cpusets
in singly linked lists and iterating them all to properly update the
internal data. Also keep the current cs and oldcs variables up-to-date
with the css and task iterators.

To ensure proper DL tasks accounting, the nr_migrate_dl_tasks in both
the source and destination cpusets are decremented/incremented with
their values added to nr_deadline_tasks when the migration is successful.

Fixes: 4ec22e9c5a90 ("cpuset: Enable cpuset controller in default hierarchy")
Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/cgroup/cpuset-internal.h |   6 +
 kernel/cgroup/cpuset.c          | 206 +++++++++++++++++++++++---------
 2 files changed, 157 insertions(+), 55 deletions(-)

diff --git a/kernel/cgroup/cpuset-internal.h b/kernel/cgroup/cpuset-internal.h
index f7aaf01f7cd5..4c2772a7fd5e 100644
--- a/kernel/cgroup/cpuset-internal.h
+++ b/kernel/cgroup/cpuset-internal.h
@@ -161,6 +161,12 @@ struct cpuset {
 	 */
 	bool remote_partition;
 
+	/*
+	 * cpuset_can_attach() and cpuset_attach() specific data
+	 */
+	bool			attach_node_in_llist;
+	struct llist_node	attach_node;
+
 	/*
 	 * number of SCHED_DEADLINE tasks attached to this cpuset, so that we
 	 * know when to rebuild associated root domain bandwidth information.
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 7100575927f6..98ee001ef950 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -37,6 +37,7 @@
 #include <linux/wait.h>
 #include <linux/workqueue.h>
 #include <linux/task_work.h>
+#include <linux/llist.h>
 
 DEFINE_STATIC_KEY_FALSE(cpusets_pre_enable_key);
 DEFINE_STATIC_KEY_FALSE(cpusets_enabled_key);
@@ -2983,6 +2984,8 @@ static int update_prstate(struct cpuset *cs, int new_prs)
  * matter which child cpuset is selected as cpuset_attach_old_cs.
  */
 static struct cpuset *cpuset_attach_old_cs;
+static LLIST_HEAD(src_cs_head);
+static LLIST_HEAD(dst_cs_head);
 static bool attach_cpus_updated;
 static bool attach_mems_updated;
 
@@ -2995,9 +2998,10 @@ static bool attach_mems_updated;
  * Also set the boolean flag passed in by @psetsched depending on if
  * security_task_setscheduler() call is needed and @oldcs is not NULL.
  */
-static int cpuset_can_attach_check(struct cpuset *cs, struct cpuset *oldcs,
-				   bool *psetsched)
+static int cpuset_can_attach_check(struct cpuset *cs, struct cpuset *oldcs, bool *psetsched)
 {
+	bool cpu_match, mem_match;
+
 	if (cpumask_empty(cs->effective_cpus) ||
 	   (!is_in_v2_mode() && nodes_empty(cs->mems_allowed)))
 		return -ENOSPC;
@@ -3008,15 +3012,34 @@ static int cpuset_can_attach_check(struct cpuset *cs, struct cpuset *oldcs,
 	/*
 	 * Update attach specific data
 	 */
-	attach_cpus_updated = !cpumask_equal(cs->effective_cpus, oldcs->effective_cpus);
-	attach_mems_updated = !nodes_equal(cs->effective_mems, oldcs->effective_mems);
+	if (!cs->attach_node_in_llist) {
+		llist_add(&cs->attach_node, &dst_cs_head);
+		cs->attach_node_in_llist = true;
+	}
+	if (!oldcs->attach_node_in_llist) {
+		llist_add(&oldcs->attach_node, &src_cs_head);
+		oldcs->attach_node_in_llist = true;
+	}
+
+	cpu_match = cpumask_equal(cs->effective_cpus, oldcs->effective_cpus);
+	mem_match = nodes_equal(cs->effective_mems, oldcs->effective_mems);
+
+	/*
+	 * Set the updated flags whenever there is a mismatch in any of the
+	 * src/dst pairs.
+	 */
+	if (!attach_cpus_updated)
+		attach_cpus_updated = !cpu_match;
+
+	if (!attach_mems_updated)
+		attach_mems_updated = !mem_match;
 
 	/*
 	 * Skip rights over task setsched check in v2 when nothing changes,
 	 * migration permission derives from hierarchy ownership in
 	 * cgroup_procs_write_permission()).
 	 */
-	*psetsched = !cpuset_v2() || attach_cpus_updated || attach_mems_updated;
+	*psetsched = !cpuset_v2() || !cpu_match || !mem_match;
 
 	/*
 	 * A v1 cpuset with tasks will have no CPU left only when CPU hotplug
@@ -3031,33 +3054,103 @@ static int cpuset_can_attach_check(struct cpuset *cs, struct cpuset *oldcs,
 	return 0;
 }
 
-static int cpuset_reserve_dl_bw(struct cpuset *cs)
+/*
+ * If reset_dl_bw is set, reset the previous dl_bw_alloc() call. Otherwise,
+ * update nr_deadline_tasks according to nr_migrate_dl_tasks in both source
+ * and destination cpusets.
+ */
+static void clear_attach_data(bool reset_dl_bw)
+{
+	struct cpuset *cs, *next;
+
+	llist_for_each_entry_safe(cs, next, src_cs_head.first, attach_node) {
+		cs->attach_node.next = NULL;
+		cs->attach_node_in_llist = false;
+		if (cs->nr_migrate_dl_tasks && !reset_dl_bw)
+			cs->nr_deadline_tasks += cs->nr_migrate_dl_tasks;
+		cs->nr_migrate_dl_tasks = 0;
+	}
+
+	llist_for_each_entry_safe(cs, next, dst_cs_head.first, attach_node) {
+		cs->attach_node.next = NULL;
+		cs->attach_node_in_llist = false;
+		if (reset_dl_bw && cs->dl_bw_cpu >= 0)
+			dl_bw_free(cs->dl_bw_cpu, cs->sum_migrate_dl_bw);
+		if (cs->nr_migrate_dl_tasks && !reset_dl_bw)
+			cs->nr_deadline_tasks += cs->nr_migrate_dl_tasks;
+		cs->nr_migrate_dl_tasks = 0;
+		cs->sum_migrate_dl_bw = 0;
+		cs->dl_bw_cpu = -1;
+	}
+
+	src_cs_head.first = NULL;
+	dst_cs_head.first = NULL;
+	attach_cpus_updated = false;
+	attach_mems_updated = false;
+}
+
+static int cpuset_reserve_dl_bw(void)
 {
+	struct cpuset *cs;
 	int cpu, ret;
 
-	if (!cs->sum_migrate_dl_bw)
-		return 0;
+	llist_for_each_entry(cs, dst_cs_head.first, attach_node) {
+		if (!cs->sum_migrate_dl_bw)
+			continue;
 
-	cpu = cpumask_any_and(cpu_active_mask, cs->effective_cpus);
-	if (unlikely(cpu >= nr_cpu_ids))
-		return -EINVAL;
+		cpu = cpumask_any_and(cpu_active_mask, cs->effective_cpus);
+		if (unlikely(cpu >= nr_cpu_ids))
+			return -EINVAL;
 
-	ret = dl_bw_alloc(cpu, cs->sum_migrate_dl_bw);
-	if (ret)
-		return ret;
+		ret = dl_bw_alloc(cpu, cs->sum_migrate_dl_bw);
+		if (ret)
+			return ret;
 
-	cs->dl_bw_cpu = cpu;
+		cs->dl_bw_cpu = cpu;
+	}
 	return 0;
 }
 
-static void reset_migrate_dl_data(struct cpuset *cs)
+static void set_attach_in_progress(void)
 {
-	cs->nr_migrate_dl_tasks = 0;
-	cs->sum_migrate_dl_bw = 0;
-	cs->dl_bw_cpu = -1;
+	struct cpuset *cs;
+
+	/*
+	 * Mark attach is in progress.  This makes validate_change() fail
+	 * changes which zero cpus/mems_allowed.
+	 */
+	llist_for_each_entry(cs, dst_cs_head.first, attach_node)
+		cs->attach_in_progress++;
+}
+
+static void reset_attach_in_progress(void)
+{
+	struct cpuset *cs;
+
+	llist_for_each_entry(cs, dst_cs_head.first, attach_node)
+		dec_attach_in_progress_locked(cs);
 }
 
-/* Called by cgroups to determine if a cpuset is usable; cpuset_mutex held */
+/*
+ * Called by cgroups to determine if a cpuset is usable; cpuset_mutex held.
+ *
+ * With cgroup v2, enabling of cpuset controller in a cgroup subtree can
+ * cause @tset to contain task migration data from one parent cpuset to multiple
+ * child cpusets. Not much is needed to be done here other than tracking the
+ * number of DL tasks in each cpuset as the CPUs and memory nodes of the child
+ * cpusets are exactly the same as the parent.
+ *
+ * Conversely, disabling of cpuset controller can cause @tset to contain task
+ * migration data from multiple child cpusets to one parent cpuset. Here, the
+ * CPUs and memory nodes of the child cpusets may be different from the parent,
+ * but must be a subset of its parent.
+ *
+ * Another possible many-to-one migration is the moving of the whole
+ * multithreaded process with threads in different cpusets to another cpuset.
+ *
+ * For all other use cases, @tset task migration data should be from one source
+ * cpuset to one destination cpuset.
+ */
 static int cpuset_can_attach(struct cgroup_taskset *tset)
 {
 	struct cgroup_subsys_state *css;
@@ -3079,6 +3172,16 @@ static int cpuset_can_attach(struct cgroup_taskset *tset)
 		goto out_unlock;
 
 	cgroup_taskset_for_each(task, css, tset) {
+		struct cpuset *newcs = css_cs(css);
+		struct cpuset *new_oldcs = task_cs(task);
+
+		if ((newcs != cs) || (new_oldcs != oldcs)) {
+			cs = newcs;
+			oldcs = new_oldcs;
+			ret = cpuset_can_attach_check(cs, oldcs, &setsched_check);
+			if (ret)
+				goto out_unlock;
+		}
 		ret = task_can_attach(task);
 		if (ret)
 			goto out_unlock;
@@ -3100,23 +3203,19 @@ static int cpuset_can_attach(struct cgroup_taskset *tset)
 			 * contribute to sum_migrate_dl_bw.
 			 */
 			cs->nr_migrate_dl_tasks++;
+			oldcs->nr_migrate_dl_tasks--;
 			if (dl_task_needs_bw_move(task, cs->effective_cpus))
 				cs->sum_migrate_dl_bw += task->dl.dl_bw;
 		}
 	}
 
-	ret = cpuset_reserve_dl_bw(cs);
+	ret = cpuset_reserve_dl_bw();
 
 out_unlock:
-	if (ret) {
-		reset_migrate_dl_data(cs);
-	} else {
-		/*
-		 * Mark attach is in progress.  This makes validate_change() fail
-		 * changes which zero cpus/mems_allowed.
-		 */
-		cs->attach_in_progress++;
-	}
+	if (ret)
+		clear_attach_data(true);
+	else
+		set_attach_in_progress();
 
 	mutex_unlock(&cpuset_mutex);
 	return ret;
@@ -3131,14 +3230,8 @@ static void cpuset_cancel_attach(struct cgroup_taskset *tset)
 	cs = css_cs(css);
 
 	mutex_lock(&cpuset_mutex);
-	dec_attach_in_progress_locked(cs);
-
-	if (cs->dl_bw_cpu >= 0)
-		dl_bw_free(cs->dl_bw_cpu, cs->sum_migrate_dl_bw);
-
-	if (cs->nr_migrate_dl_tasks)
-		reset_migrate_dl_data(cs);
-
+	reset_attach_in_progress();
+	clear_attach_data(true);
 	mutex_unlock(&cpuset_mutex);
 }
 
@@ -3210,42 +3303,45 @@ static void cpuset_attach(struct cgroup_taskset *tset)
 	struct task_struct *task;
 	struct cgroup_subsys_state *css;
 	struct cpuset *cs;
-	struct cpuset *oldcs = cpuset_attach_old_cs;
 
 	cgroup_taskset_first(tset, &css);
 	cs = css_cs(css);
-
 	lockdep_assert_cpus_held();	/* see cgroup_attach_lock() */
 	mutex_lock(&cpuset_mutex);
 	queue_task_work = false;
 
 	/*
 	 * In the default hierarchy, enabling cpuset in the child cgroups
-	 * will trigger a number of cpuset_attach() calls with no change
-	 * in effective cpus and mems. In that case, we can optimize out
-	 * by skipping the task iteration and update.
+	 * will trigger a cpuset_attach() call with no change in effective cpus
+	 * and mems. In that case, we can optimize out by skipping the task
+	 * iteration and update, but the destination cpuset list is iterated to
+	 * set old_mems_sllowed.
 	 */
-	if (cpuset_v2() && !attach_cpus_updated && !attach_mems_updated)
+	if (cpuset_v2() && !attach_cpus_updated && !attach_mems_updated) {
+		llist_for_each_entry(cs, dst_cs_head.first, attach_node)
+			cs->old_mems_allowed = cs->effective_mems;
 		goto out;
+	}
 
 	guarantee_online_mems(cs, &cpuset_attach_nodemask_to);
 
-	cgroup_taskset_for_each(task, css, tset)
+	cgroup_taskset_for_each(task, css, tset) {
+		struct cpuset *newcs = css_cs(css);
+
+		if (newcs != cs) {
+			cs->old_mems_allowed = cs->effective_mems;
+			cs = newcs;
+			guarantee_online_mems(cs, &cpuset_attach_nodemask_to);
+		}
 		cpuset_attach_task(cs, task);
+	}
 
-out:
 	if (queue_task_work)
 		schedule_flush_migrate_mm();
 	cs->old_mems_allowed = cs->effective_mems;
-
-	if (cs->nr_migrate_dl_tasks) {
-		cs->nr_deadline_tasks += cs->nr_migrate_dl_tasks;
-		oldcs->nr_deadline_tasks -= cs->nr_migrate_dl_tasks;
-		reset_migrate_dl_data(cs);
-	}
-
-	dec_attach_in_progress_locked(cs);
-
+out:
+	reset_attach_in_progress();
+	clear_attach_data(false);
 	mutex_unlock(&cpuset_mutex);
 }
 
-- 
2.54.0


^ permalink raw reply related

* [PATCH-next v3 4/5] cgroup/cpuset: Move mpol_rebind_mm/cpuset_migrate_mm() calls inside cpuset_attach_task()
From: Waiman Long @ 2026-05-27 15:37 UTC (permalink / raw)
  To: Chen Ridong, Tejun Heo, Johannes Weiner, Michal Koutný,
	Ingo Molnar, Peter Zijlstra
  Cc: cgroups, linux-kernel, Aaron Tomlin, Waiman Long
In-Reply-To: <20260527153800.1557449-1-longman@redhat.com>

The cpuset_attach_task() was introduced in commit 42a11bf5c543
("cgroup/cpuset: Make cpuset_fork() handle CLONE_INTO_CGROUP properly")
to enable the CLONE_INTO_CGROUP flag of clone(2) to behave more like
moving a task from one cpuset into another one. That commits didn't
move the mpol_rebind_mm() and cpuset_migrate_mm() calls for group leader
into cpuset_attach_task().

When the CLONE_INTO_CGROUP flag is used without CLONE_THREAD, the new
task is its own group leader. So it is still not equivalent to moving
task between cpusets in this case. Make CLONE_INTO_CGROUP behaves
more close to cpuset_attach() by moving the mpol_rebind_mm() and
cpuset_migrate_mm() calls inside cpuset_attach_task(). As a result,
cpuset_attach_old_cs, attach_cpus_updated and attach_mems_updated will
also need to be updated in cpuset_fork().

Besides, the original code use cpuset_attach_nodemask_to for
both nodemask returned by guarantee_online_mems() used only by
cpuset_change_task_nodemask() and cs->effective_mems in all other cases.
Such dual use is now impractical by merging the two task iteration loops
into one. So keep cpuset_attach_nodemask_to for the nodemask returned
by guarantee_online_mems() and reference cs->effective_mems directly
in all the other cases.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/cgroup/cpuset.c | 90 ++++++++++++++++++++++--------------------
 1 file changed, 47 insertions(+), 43 deletions(-)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index b233a71f9b7c..7100575927f6 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -3149,9 +3149,12 @@ static void cpuset_cancel_attach(struct cgroup_taskset *tset)
  */
 static cpumask_var_t cpus_attach;
 static nodemask_t cpuset_attach_nodemask_to;
+static bool queue_task_work;
 
 static void cpuset_attach_task(struct cpuset *cs, struct task_struct *task)
 {
+	struct mm_struct *mm;
+
 	lockdep_assert_cpuset_lock_held();
 
 	if (cs != &top_cpuset)
@@ -3165,24 +3168,56 @@ static void cpuset_attach_task(struct cpuset *cs, struct task_struct *task)
 	 */
 	WARN_ON_ONCE(set_cpus_allowed_ptr(task, cpus_attach));
 
+	if (cpuset_v2() && !attach_mems_updated)
+		return;
+
 	cpuset_change_task_nodemask(task, &cpuset_attach_nodemask_to);
 	cpuset1_update_task_spread_flags(cs, task);
+
+	if (task != task->group_leader)
+		return;
+
+	/*
+	 * Change mm for threadgroup leader. This is expensive and may
+	 * sleep and should be moved outside migration path proper.
+	 */
+	mm = get_task_mm(task);
+	if (mm) {
+		struct cpuset *oldcs = cpuset_attach_old_cs;
+
+		mpol_rebind_mm(mm, &cs->effective_mems);
+
+		/*
+		 * old_mems_allowed is the same with mems_allowed
+		 * here, except if this task is being moved
+		 * automatically due to hotplug.  In that case
+		 * @mems_allowed has been updated and is empty, so
+		 * @old_mems_allowed is the right nodesets that we
+		 * migrate mm from.
+		 */
+		if (is_memory_migrate(cs)) {
+			cpuset_migrate_mm(mm, &oldcs->old_mems_allowed,
+					  &cs->effective_mems);
+			queue_task_work = true;
+		} else {
+			mmput(mm);
+		}
+	}
 }
 
 static void cpuset_attach(struct cgroup_taskset *tset)
 {
 	struct task_struct *task;
-	struct task_struct *leader;
 	struct cgroup_subsys_state *css;
 	struct cpuset *cs;
 	struct cpuset *oldcs = cpuset_attach_old_cs;
-	bool queue_task_work = false;
 
 	cgroup_taskset_first(tset, &css);
 	cs = css_cs(css);
 
 	lockdep_assert_cpus_held();	/* see cgroup_attach_lock() */
 	mutex_lock(&cpuset_mutex);
+	queue_task_work = false;
 
 	/*
 	 * In the default hierarchy, enabling cpuset in the child cgroups
@@ -3190,53 +3225,18 @@ static void cpuset_attach(struct cgroup_taskset *tset)
 	 * in effective cpus and mems. In that case, we can optimize out
 	 * by skipping the task iteration and update.
 	 */
-	if (cpuset_v2() && !attach_cpus_updated && !attach_mems_updated) {
-		cpuset_attach_nodemask_to = cs->effective_mems;
+	if (cpuset_v2() && !attach_cpus_updated && !attach_mems_updated)
 		goto out;
-	}
 
 	guarantee_online_mems(cs, &cpuset_attach_nodemask_to);
 
 	cgroup_taskset_for_each(task, css, tset)
 		cpuset_attach_task(cs, task);
 
-	/*
-	 * Change mm for all threadgroup leaders. This is expensive and may
-	 * sleep and should be moved outside migration path proper. Skip it
-	 * if there is no change in effective_mems and CS_MEMORY_MIGRATE is
-	 * not set.
-	 */
-	cpuset_attach_nodemask_to = cs->effective_mems;
-	if (!is_memory_migrate(cs) && !attach_mems_updated)
-		goto out;
-
-	cgroup_taskset_for_each_leader(leader, css, tset) {
-		struct mm_struct *mm = get_task_mm(leader);
-
-		if (mm) {
-			mpol_rebind_mm(mm, &cpuset_attach_nodemask_to);
-
-			/*
-			 * old_mems_allowed is the same with mems_allowed
-			 * here, except if this task is being moved
-			 * automatically due to hotplug.  In that case
-			 * @mems_allowed has been updated and is empty, so
-			 * @old_mems_allowed is the right nodesets that we
-			 * migrate mm from.
-			 */
-			if (is_memory_migrate(cs)) {
-				cpuset_migrate_mm(mm, &oldcs->old_mems_allowed,
-						  &cpuset_attach_nodemask_to);
-				queue_task_work = true;
-			} else
-				mmput(mm);
-		}
-	}
-
 out:
 	if (queue_task_work)
 		schedule_flush_migrate_mm();
-	cs->old_mems_allowed = cpuset_attach_nodemask_to;
+	cs->old_mems_allowed = cs->effective_mems;
 
 	if (cs->nr_migrate_dl_tasks) {
 		cs->nr_deadline_tasks += cs->nr_migrate_dl_tasks;
@@ -3666,15 +3666,14 @@ static void cpuset_cancel_fork(struct task_struct *task, struct css_set *cset)
  */
 static void cpuset_fork(struct task_struct *task)
 {
-	struct cpuset *cs;
-	bool same_cs;
+	struct cpuset *cs, *oldcs;
 
 	rcu_read_lock();
 	cs = task_cs(task);
-	same_cs = (cs == task_cs(current));
+	oldcs = task_cs(current);
 	rcu_read_unlock();
 
-	if (same_cs) {
+	if (cs == oldcs) {
 		if (cs == &top_cpuset)
 			return;
 
@@ -3686,7 +3685,12 @@ static void cpuset_fork(struct task_struct *task)
 	/* CLONE_INTO_CGROUP */
 	mutex_lock(&cpuset_mutex);
 	guarantee_online_mems(cs, &cpuset_attach_nodemask_to);
+	/* Assume CPUs and memory nodes are updated */
+	attach_cpus_updated = attach_mems_updated = true;
+	cpuset_attach_old_cs = oldcs;
+	oldcs->old_mems_allowed = oldcs->effective_mems;
 	cpuset_attach_task(cs, task);
+	attach_cpus_updated = attach_mems_updated = false;
 
 	dec_attach_in_progress_locked(cs);
 	mutex_unlock(&cpuset_mutex);
-- 
2.54.0


^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox