Re: [BUG] mm: mglru: stale aging batch triggers lru_gen_exit_memcg warning

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Qi Zheng <qi.zheng@linux.dev>
To: Peiyang He <peiyang_he@smail.nju.edu.cn>,
	akpm@linux-foundation.org, hannes@cmpxchg.org,
	linux-mm@kvack.org
Cc: mhocko@kernel.org, roman.gushchin@linux.dev,
	shakeel.butt@linux.dev, muchun.song@linux.dev,
	kasong@tencent.com, baohua@kernel.org, axelrasmussen@google.com,
	yuanchu@google.com, weixugc@google.com, david@kernel.org,
	ljs@kernel.org, cgroups@vger.kernel.org,
	linux-kernel@vger.kernel.org, syzkaller@googlegroups.com
Subject: Re: [BUG] mm: mglru: stale aging batch triggers lru_gen_exit_memcg warning
Date: Mon, 22 Jun 2026 11:12:34 +0800	[thread overview]
Message-ID: <6af11da2-affa-414c-8426-168224cd2f69@linux.dev> (raw)
In-Reply-To: <5A9E929D82717101+12fcf643-efb8-4b9a-a53a-1e28cc894f0b@smail.nju.edu.cn>

Hi Peiyang,

Thanks for reporting this issue!

On 6/21/26 9:50 PM, Peiyang He wrote:
> Hello,
> 
> I hit the following warning while fuzzing other kernel code with Syzkaller.
> 
> The original Syzkaller report:
> 
> WARNING: mm/vmscan.c:5867 at lru_gen_exit_memcg+0x26f/0x300 mm/ 
> vmscan.c:5867, CPU#0: kworker/0:0/9
> Modules linked in:
> CPU: 0 UID: 0 PID: 9 Comm: kworker/0:0 Not tainted 7.1.0 #2 PREEMPT(full)
> Hardware name: QEMU Ubuntu 24.04 PC v2 (i440FX + PIIX, arch_caps fix, 
> 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
> Workqueue: cgroup_free css_free_rwork_fn
> RIP: 0010:lru_gen_exit_memcg+0x26f/0x300 mm/vmscan.c:5867
> Code: 89 de e8 d4 62 ba ff 49 83 fd 3f 0f 86 9c fe ff ff 48 83 c4 08 5b 
> 5d 41 5c 41 5d 41 5e 41 5f e9 17 68 ba ff e8 12 68 ba ff 90 <0f> 0b 90 
> e9 b0 fe ff ff e8 04 68 ba ff 66 90 e8 fd 67 ba ff 90 0f
> RSP: 0018:ffffc900001afb78 EFLAGS: 00010293
> RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffffff82049e88
> RDX: ffff888016f35c40 RSI: ffffffff8204a02e RDI: ffff88801d4103b8
> RBP: dffffc0000000000 R08: 0000000000000005 R09: 0000000000000040
> R10: 0000000000000000 R11: 0000000000002ba4 R12: ffff8880481f1600
> R13: ffff88801d410650 R14: ffff88801d410040 R15: dead000000000100
> FS:  0000000000000000(0000) GS:ffff888098d91000(0000) 
> knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 000055ac6490c1d8 CR3: 00000000249b0000 CR4: 0000000000350ef0
> Call Trace:
>   <TASK>
>   mem_cgroup_free mm/memcontrol.c:3972 [inline]
>   mem_cgroup_css_free+0x76/0xb0 mm/memcontrol.c:4241
>   css_free_rwork_fn+0x125/0x1260 kernel/cgroup/cgroup.c:5575
>   process_one_work+0xa0d/0x1c30 kernel/workqueue.c:3314
>   process_scheduled_works kernel/workqueue.c:3397 [inline]
>   worker_thread+0x645/0xe80 kernel/workqueue.c:3478
>   kthread+0x367/0x480 kernel/kthread.c:436
>   ret_from_fork+0x72b/0xd50 arch/x86/kernel/process.c:158
>   ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
>   </TASK>
> 
> Kernel version: commit 8cd9520d35a6c38db6567e97dd93b1f11f185dc6 (tag v7.1)
> 
> Relevant kernel config:
> 
>    CONFIG_MEMCG=y
>    CONFIG_LRU_GEN=y
>    CONFIG_LRU_GEN_ENABLED=y
>    CONFIG_LRU_GEN_WALKS_MMU=y
>    CONFIG_NUMA=y
> 
> Root Cause:
> 
> The bug is a race between two code paths that each hold `lruvec- 
>  >lru_lock`, but at
> non-overlapping times.
> 
> Component 1 - `reset_batch_size()`:
> 
> During `walk_mm()`, `update_batch_size()` accumulates per-generation 
> page deltas into
> `walk->nr_pages` WITHOUT holding `lruvec_lock`.  After 
> `mmap_read_unlock(mm)`, the
> walker reacquires `lruvec_lock` and `reset_batch_size()` writes those 
> deltas
> UNCONDITIONALLY into `lrugen->nr_pages`.
> 
> Component 2 - `lru_gen_reparent_memcg()`:
> 
> When a memcg is offlined, `lru_gen_reparent_memcg()` moves all folios to 
> the parent
> lruvec and zeros the child's `lrugen->nr_pages`, all under `lruvec_lock`.
> 
> I have not bisected the issue.  Based on code inspection, the important 
> interaction
> appears to be the reparenting path that clears the child's `nr_pages` while
> `reset_batch_size()` can still commit a batch that was generated before 
> the memcg
> went offline.  This looks related to f304652609ea ("mm: vmscan: prepare for
> reparenting MGLRU folios").
> 
> Race sequence:
> 
>      1. The aging path enters walk_mm() for the child memcg lruvec.
> 
>      2. walk_page_range() scans PTEs and update_batch_size() stores 
> deltas in
>         walk->nr_pages.  At this point the deltas have not been 
> committed to
>         lruvec->lrugen.nr_pages yet.
> 
>      3. walk_mm() drops mmap_read_lock(mm).  Before it reaches
>         reset_batch_size(), the child memcg is killed and removed.
> 
>      4. The memcg offline path runs lru_gen_reparent_memcg().  Under
>         lruvec_lock, it moves the child folios to the parent and clears the
>         child's lrugen.nr_pages.
> 
>      5. The old aging walk resumes, takes lruvec_lock, and 
> reset_batch_size()
>         writes the stale walk->nr_pages deltas back into the original child
>         lruvec.
> 
>      6. Later, lru_gen_exit_memcg(child) checks the child's 
> lrugen.nr_pages with
>         memchr_inv(...).  Since the stale batch made some slots non-zero 
> again,
>         VM_WARN_ON_ONCE() triggers.

It seems this race can actually happen.

> 
> The two critical sections are serialized by `lruvec_lock`, but the batch 
> accumulation
> in `walk->nr_pages` happens outside that lock, so there is no ordering 
> between the
> accumulation and the reparenting zeroing.
> 
> The relevant code path:
> 
>    mm/vmscan.c:
>      run_cmd('+')              selects the target memcg and child lruvec
>      try_to_inc_max_seq()      stores the child lruvec in walk->lruvec
>      update_batch_size()       accumulates deltas in walk->nr_pages
>      walk_mm()                 calls walk_page_range(), then later 
> reset_batch_size()
>      reset_batch_size()        writes cached deltas into walk->lruvec- 
>  >lrugen.nr_pages
>      lru_gen_reparent_memcg()  reparents child MGLRU state and clears 
> child nr_pages
>      lru_gen_exit_memcg()      warns if the exiting memcg has non-zero 
> nr_pages
> 
>    mm/memcontrol.c:
>      mem_cgroup_css_offline()  calls memcg_reparent_objcgs() and 
> lru_gen_offline_memcg()
>      mem_cgroup_free()         calls lru_gen_exit_memcg()
> 
> Reproducer:
> 
> The C reproducer and the helper script for running it are provided in 
> the attachments.
> 
> The PoC creates a leaf memory cgroup, moves a victim process into it, 
> and makes the victim fault and continuously touch file-backed pages so 
> MGLRU aging can produce cached generation deltas for that memcg. A 
> separate `lru_ager` thread repeatedly writes aging commands to `/sys/ 
> kernel/debug/lru_gen`; when the instrumentation reports that the ager is 
> delayed just before `reset_batch_size()`, the PoC kills the victim and 
> removes the leaf cgroup, forcing memcg offline/reparenting before the 
> stale batch is committed.
> 
> The helper script builds the PoC, creates a temporary qcow2 overlay, 
> boots the instrumented kernel in QEMU with fake NUMA and SSH port 
> forwarding, copies the PoC into the guest, runs it, and scans the serial 
> console for `exit_nonzero`, `WARNING: mm/vmscan.c`, or `Kernel panic`. 
> It writes the full serial console, extracted kernel events, and guest 
> stdout/stderr under the chosen output directory.
> 
> The example command:
> 
>    ./repros/lru_gen_exit_memcg/run_poc_qemu.sh /tmp/lru_gen_poc_manual 
> 10450 20 32
> 
> The arguments are:
> 
>    /tmp/lru_gen_poc_manual  output directory for the overlay, console log,
>                             extracted events and guest log
>    10450                    host TCP port forwarded to guest SSH
>    20                       number of PoC iterations to run
>    32                       file-backed working-set size in MiB per 
> iteration
> 
> The script uses default `KERNEL`, `IMAGE` and `SSH_KEY` paths, or they 
> can be
> overridden with environment variables.
> 
> Since this bug requires a specific race window, kernel instrumentation 
> is needed
> to enlarge the race window in order to reproduce the bug more reliably.  
> The
> instrumentation patch is also included in the attachments.
> 
> The patch only instruments `mm/vmscan.c`: it delays the PoC aging task just
> before `reset_batch_size()`, logs when a stale batch is written into an 
> already
> offlined and zeroed memcg lruvec, and dumps the non-zero 
> `lrugen.nr_pages` slots
> before `lru_gen_exit_memcg()` triggers the warning.
> 
> A successful run reports `status=repro_triggered`, and the extracted events
> include a warning like:
> 
>    WARNING: mm/vmscan.c:5943 at lru_gen_exit_memcg+0x420/0x520
> 
> Proposed Fix:
> 
> One possible fix direction is to make `reset_batch_size()` skip writing 
> back the
> stale delta when the memcg is no longer online. `reset_batch_size()` is 
> called
> under `lruvec_lock`, the same lock that `lru_gen_reparent_memcg()` holds 
> when it
> zeroes `nr_pages`, so this should avoid committing a batch after 
> reparenting has
> completed.
> 
> Possible fix direction, not a tested patch:
> 
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -... reset_batch_size() ...
>   static void reset_batch_size(struct lru_gen_mm_walk *walk)
>   {
>       int gen, type, zone;
>       struct lruvec *lruvec = walk->lruvec;
>       struct lru_gen_folio *lrugen = &lruvec->lrugen;
> +    struct mem_cgroup *memcg = lruvec_memcg(lruvec);
> 
>       walk->batched = 0;
> 
>       for_each_gen_type_zone(gen, type, zone) {
>           enum lru_list lru = type * LRU_INACTIVE_FILE;
>           int delta = walk->nr_pages[gen][type][zone];
> 
>           if (!delta)
>               continue;
> 
>           walk->nr_pages[gen][type][zone] = 0;
> +
> +        /*
> +         * If the memcg went offline while we were walking page tables,
> +         * lru_gen_reparent_memcg() has already zeroed nr_pages and moved
> +         * all folios to the parent.  Writing our stale batch delta back
> +         * would corrupt the offline child and trigger WARN_ON in
> +         * lru_gen_exit_memcg().  Discard the delta; the parent lruvec
> +         * already owns the pages and accounts for them correctly.
> +         */
> +        if (memcg && !mem_cgroup_online(memcg))
> +            continue;

This check is insufficient, because offline_css() clears the CSS_ONLINE
after ss->css_offline(css). And we can not simple drop the delta.

Thanks,
Qi

> +
>           WRITE_ONCE(lrugen->nr_pages[gen][type][zone],
>                  lrugen->nr_pages[gen][type][zone] + delta);
> 
>           if (lru_gen_is_active(lruvec, gen))
>               lru += LRU_ACTIVE;
>           __update_lru_size(lruvec, lru, zone, delta);
>       }
>   }
> 
> Thanks

next prev parent reply	other threads:[~2026-06-22  3:13 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-06-21 13:50 [BUG] mm: mglru: stale aging batch triggers lru_gen_exit_memcg warning Peiyang He
2026-06-22  3:12 ` Qi Zheng [this message]
2026-06-22  7:37 ` [PATCH] mm: mglru: fix stale batch updates after memcg reparenting Qi Zheng
2026-06-22  8:24   ` Peiyang He
2026-06-22  8:31     ` Qi Zheng

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=6af11da2-affa-414c-8426-168224cd2f69@linux.dev \
    --to=qi.zheng@linux.dev \
    --cc=akpm@linux-foundation.org \
    --cc=axelrasmussen@google.com \
    --cc=baohua@kernel.org \
    --cc=cgroups@vger.kernel.org \
    --cc=david@kernel.org \
    --cc=hannes@cmpxchg.org \
    --cc=kasong@tencent.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=ljs@kernel.org \
    --cc=mhocko@kernel.org \
    --cc=muchun.song@linux.dev \
    --cc=peiyang_he@smail.nju.edu.cn \
    --cc=roman.gushchin@linux.dev \
    --cc=shakeel.butt@linux.dev \
    --cc=syzkaller@googlegroups.com \
    --cc=weixugc@google.com \
    --cc=yuanchu@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.