From: Qi Zheng <qi.zheng@linux.dev>
To: Peiyang He <peiyang_he@smail.nju.edu.cn>,
akpm@linux-foundation.org, hannes@cmpxchg.org,
linux-mm@kvack.org
Cc: mhocko@kernel.org, roman.gushchin@linux.dev,
shakeel.butt@linux.dev, muchun.song@linux.dev,
kasong@tencent.com, baohua@kernel.org, axelrasmussen@google.com,
yuanchu@google.com, weixugc@google.com, david@kernel.org,
ljs@kernel.org, cgroups@vger.kernel.org,
linux-kernel@vger.kernel.org, syzkaller@googlegroups.com
Subject: Re: [BUG] mm: mglru: stale aging batch triggers lru_gen_exit_memcg warning
Date: Mon, 22 Jun 2026 11:12:34 +0800 [thread overview]
Message-ID: <6af11da2-affa-414c-8426-168224cd2f69@linux.dev> (raw)
In-Reply-To: <5A9E929D82717101+12fcf643-efb8-4b9a-a53a-1e28cc894f0b@smail.nju.edu.cn>
Hi Peiyang,
Thanks for reporting this issue!
On 6/21/26 9:50 PM, Peiyang He wrote:
> Hello,
>
> I hit the following warning while fuzzing other kernel code with Syzkaller.
>
> The original Syzkaller report:
>
> WARNING: mm/vmscan.c:5867 at lru_gen_exit_memcg+0x26f/0x300 mm/
> vmscan.c:5867, CPU#0: kworker/0:0/9
> Modules linked in:
> CPU: 0 UID: 0 PID: 9 Comm: kworker/0:0 Not tainted 7.1.0 #2 PREEMPT(full)
> Hardware name: QEMU Ubuntu 24.04 PC v2 (i440FX + PIIX, arch_caps fix,
> 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
> Workqueue: cgroup_free css_free_rwork_fn
> RIP: 0010:lru_gen_exit_memcg+0x26f/0x300 mm/vmscan.c:5867
> Code: 89 de e8 d4 62 ba ff 49 83 fd 3f 0f 86 9c fe ff ff 48 83 c4 08 5b
> 5d 41 5c 41 5d 41 5e 41 5f e9 17 68 ba ff e8 12 68 ba ff 90 <0f> 0b 90
> e9 b0 fe ff ff e8 04 68 ba ff 66 90 e8 fd 67 ba ff 90 0f
> RSP: 0018:ffffc900001afb78 EFLAGS: 00010293
> RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffffff82049e88
> RDX: ffff888016f35c40 RSI: ffffffff8204a02e RDI: ffff88801d4103b8
> RBP: dffffc0000000000 R08: 0000000000000005 R09: 0000000000000040
> R10: 0000000000000000 R11: 0000000000002ba4 R12: ffff8880481f1600
> R13: ffff88801d410650 R14: ffff88801d410040 R15: dead000000000100
> FS: 0000000000000000(0000) GS:ffff888098d91000(0000)
> knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 000055ac6490c1d8 CR3: 00000000249b0000 CR4: 0000000000350ef0
> Call Trace:
> <TASK>
> mem_cgroup_free mm/memcontrol.c:3972 [inline]
> mem_cgroup_css_free+0x76/0xb0 mm/memcontrol.c:4241
> css_free_rwork_fn+0x125/0x1260 kernel/cgroup/cgroup.c:5575
> process_one_work+0xa0d/0x1c30 kernel/workqueue.c:3314
> process_scheduled_works kernel/workqueue.c:3397 [inline]
> worker_thread+0x645/0xe80 kernel/workqueue.c:3478
> kthread+0x367/0x480 kernel/kthread.c:436
> ret_from_fork+0x72b/0xd50 arch/x86/kernel/process.c:158
> ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
> </TASK>
>
> Kernel version: commit 8cd9520d35a6c38db6567e97dd93b1f11f185dc6 (tag v7.1)
>
> Relevant kernel config:
>
> CONFIG_MEMCG=y
> CONFIG_LRU_GEN=y
> CONFIG_LRU_GEN_ENABLED=y
> CONFIG_LRU_GEN_WALKS_MMU=y
> CONFIG_NUMA=y
>
> Root Cause:
>
> The bug is a race between two code paths that each hold `lruvec-
> >lru_lock`, but at
> non-overlapping times.
>
> Component 1 - `reset_batch_size()`:
>
> During `walk_mm()`, `update_batch_size()` accumulates per-generation
> page deltas into
> `walk->nr_pages` WITHOUT holding `lruvec_lock`. After
> `mmap_read_unlock(mm)`, the
> walker reacquires `lruvec_lock` and `reset_batch_size()` writes those
> deltas
> UNCONDITIONALLY into `lrugen->nr_pages`.
>
> Component 2 - `lru_gen_reparent_memcg()`:
>
> When a memcg is offlined, `lru_gen_reparent_memcg()` moves all folios to
> the parent
> lruvec and zeros the child's `lrugen->nr_pages`, all under `lruvec_lock`.
>
> I have not bisected the issue. Based on code inspection, the important
> interaction
> appears to be the reparenting path that clears the child's `nr_pages` while
> `reset_batch_size()` can still commit a batch that was generated before
> the memcg
> went offline. This looks related to f304652609ea ("mm: vmscan: prepare for
> reparenting MGLRU folios").
>
> Race sequence:
>
> 1. The aging path enters walk_mm() for the child memcg lruvec.
>
> 2. walk_page_range() scans PTEs and update_batch_size() stores
> deltas in
> walk->nr_pages. At this point the deltas have not been
> committed to
> lruvec->lrugen.nr_pages yet.
>
> 3. walk_mm() drops mmap_read_lock(mm). Before it reaches
> reset_batch_size(), the child memcg is killed and removed.
>
> 4. The memcg offline path runs lru_gen_reparent_memcg(). Under
> lruvec_lock, it moves the child folios to the parent and clears the
> child's lrugen.nr_pages.
>
> 5. The old aging walk resumes, takes lruvec_lock, and
> reset_batch_size()
> writes the stale walk->nr_pages deltas back into the original child
> lruvec.
>
> 6. Later, lru_gen_exit_memcg(child) checks the child's
> lrugen.nr_pages with
> memchr_inv(...). Since the stale batch made some slots non-zero
> again,
> VM_WARN_ON_ONCE() triggers.
It seems this race can actually happen.
>
> The two critical sections are serialized by `lruvec_lock`, but the batch
> accumulation
> in `walk->nr_pages` happens outside that lock, so there is no ordering
> between the
> accumulation and the reparenting zeroing.
>
> The relevant code path:
>
> mm/vmscan.c:
> run_cmd('+') selects the target memcg and child lruvec
> try_to_inc_max_seq() stores the child lruvec in walk->lruvec
> update_batch_size() accumulates deltas in walk->nr_pages
> walk_mm() calls walk_page_range(), then later
> reset_batch_size()
> reset_batch_size() writes cached deltas into walk->lruvec-
> >lrugen.nr_pages
> lru_gen_reparent_memcg() reparents child MGLRU state and clears
> child nr_pages
> lru_gen_exit_memcg() warns if the exiting memcg has non-zero
> nr_pages
>
> mm/memcontrol.c:
> mem_cgroup_css_offline() calls memcg_reparent_objcgs() and
> lru_gen_offline_memcg()
> mem_cgroup_free() calls lru_gen_exit_memcg()
>
> Reproducer:
>
> The C reproducer and the helper script for running it are provided in
> the attachments.
>
> The PoC creates a leaf memory cgroup, moves a victim process into it,
> and makes the victim fault and continuously touch file-backed pages so
> MGLRU aging can produce cached generation deltas for that memcg. A
> separate `lru_ager` thread repeatedly writes aging commands to `/sys/
> kernel/debug/lru_gen`; when the instrumentation reports that the ager is
> delayed just before `reset_batch_size()`, the PoC kills the victim and
> removes the leaf cgroup, forcing memcg offline/reparenting before the
> stale batch is committed.
>
> The helper script builds the PoC, creates a temporary qcow2 overlay,
> boots the instrumented kernel in QEMU with fake NUMA and SSH port
> forwarding, copies the PoC into the guest, runs it, and scans the serial
> console for `exit_nonzero`, `WARNING: mm/vmscan.c`, or `Kernel panic`.
> It writes the full serial console, extracted kernel events, and guest
> stdout/stderr under the chosen output directory.
>
> The example command:
>
> ./repros/lru_gen_exit_memcg/run_poc_qemu.sh /tmp/lru_gen_poc_manual
> 10450 20 32
>
> The arguments are:
>
> /tmp/lru_gen_poc_manual output directory for the overlay, console log,
> extracted events and guest log
> 10450 host TCP port forwarded to guest SSH
> 20 number of PoC iterations to run
> 32 file-backed working-set size in MiB per
> iteration
>
> The script uses default `KERNEL`, `IMAGE` and `SSH_KEY` paths, or they
> can be
> overridden with environment variables.
>
> Since this bug requires a specific race window, kernel instrumentation
> is needed
> to enlarge the race window in order to reproduce the bug more reliably.
> The
> instrumentation patch is also included in the attachments.
>
> The patch only instruments `mm/vmscan.c`: it delays the PoC aging task just
> before `reset_batch_size()`, logs when a stale batch is written into an
> already
> offlined and zeroed memcg lruvec, and dumps the non-zero
> `lrugen.nr_pages` slots
> before `lru_gen_exit_memcg()` triggers the warning.
>
> A successful run reports `status=repro_triggered`, and the extracted events
> include a warning like:
>
> WARNING: mm/vmscan.c:5943 at lru_gen_exit_memcg+0x420/0x520
>
> Proposed Fix:
>
> One possible fix direction is to make `reset_batch_size()` skip writing
> back the
> stale delta when the memcg is no longer online. `reset_batch_size()` is
> called
> under `lruvec_lock`, the same lock that `lru_gen_reparent_memcg()` holds
> when it
> zeroes `nr_pages`, so this should avoid committing a batch after
> reparenting has
> completed.
>
> Possible fix direction, not a tested patch:
>
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -... reset_batch_size() ...
> static void reset_batch_size(struct lru_gen_mm_walk *walk)
> {
> int gen, type, zone;
> struct lruvec *lruvec = walk->lruvec;
> struct lru_gen_folio *lrugen = &lruvec->lrugen;
> + struct mem_cgroup *memcg = lruvec_memcg(lruvec);
>
> walk->batched = 0;
>
> for_each_gen_type_zone(gen, type, zone) {
> enum lru_list lru = type * LRU_INACTIVE_FILE;
> int delta = walk->nr_pages[gen][type][zone];
>
> if (!delta)
> continue;
>
> walk->nr_pages[gen][type][zone] = 0;
> +
> + /*
> + * If the memcg went offline while we were walking page tables,
> + * lru_gen_reparent_memcg() has already zeroed nr_pages and moved
> + * all folios to the parent. Writing our stale batch delta back
> + * would corrupt the offline child and trigger WARN_ON in
> + * lru_gen_exit_memcg(). Discard the delta; the parent lruvec
> + * already owns the pages and accounts for them correctly.
> + */
> + if (memcg && !mem_cgroup_online(memcg))
> + continue;
This check is insufficient, because offline_css() clears the CSS_ONLINE
after ss->css_offline(css). And we can not simple drop the delta.
Thanks,
Qi
> +
> WRITE_ONCE(lrugen->nr_pages[gen][type][zone],
> lrugen->nr_pages[gen][type][zone] + delta);
>
> if (lru_gen_is_active(lruvec, gen))
> lru += LRU_ACTIVE;
> __update_lru_size(lruvec, lru, zone, delta);
> }
> }
>
> Thanks
next prev parent reply other threads:[~2026-06-22 3:12 UTC|newest]
Thread overview: 5+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-06-21 13:50 [BUG] mm: mglru: stale aging batch triggers lru_gen_exit_memcg warning Peiyang He
2026-06-22 3:12 ` Qi Zheng [this message]
2026-06-22 7:37 ` [PATCH] mm: mglru: fix stale batch updates after memcg reparenting Qi Zheng
2026-06-22 8:24 ` Peiyang He
2026-06-22 8:31 ` Qi Zheng
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=6af11da2-affa-414c-8426-168224cd2f69@linux.dev \
--to=qi.zheng@linux.dev \
--cc=akpm@linux-foundation.org \
--cc=axelrasmussen@google.com \
--cc=baohua@kernel.org \
--cc=cgroups@vger.kernel.org \
--cc=david@kernel.org \
--cc=hannes@cmpxchg.org \
--cc=kasong@tencent.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=ljs@kernel.org \
--cc=mhocko@kernel.org \
--cc=muchun.song@linux.dev \
--cc=peiyang_he@smail.nju.edu.cn \
--cc=roman.gushchin@linux.dev \
--cc=shakeel.butt@linux.dev \
--cc=syzkaller@googlegroups.com \
--cc=weixugc@google.com \
--cc=yuanchu@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox