[BUG] mm: mglru: stale aging batch triggers lru_gen_exit

Linux cgroups development
 help / color / mirror / Atom feed

* [BUG] mm: mglru: stale aging batch triggers lru_gen_exit_memcg warning
@ 2026-06-21 13:50 Peiyang He
  2026-06-22  3:12 ` Qi Zheng
  0 siblings, 1 reply; 2+ messages in thread
From: Peiyang He @ 2026-06-21 13:50 UTC (permalink / raw)
  To: akpm, hannes, linux-mm
  Cc: mhocko, roman.gushchin, shakeel.butt, muchun.song, qi.zheng,
	kasong, baohua, axelrasmussen, yuanchu, weixugc, david, ljs,
	cgroups, linux-kernel, syzkaller

[-- Attachment #1: Type: text/plain, Size: 9372 bytes --]

Hello,

I hit the following warning while fuzzing other kernel code with Syzkaller.

The original Syzkaller report:

WARNING: mm/vmscan.c:5867 at lru_gen_exit_memcg+0x26f/0x300 
mm/vmscan.c:5867, CPU#0: kworker/0:0/9
Modules linked in:
CPU: 0 UID: 0 PID: 9 Comm: kworker/0:0 Not tainted 7.1.0 #2 PREEMPT(full)
Hardware name: QEMU Ubuntu 24.04 PC v2 (i440FX + PIIX, arch_caps fix, 
1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
Workqueue: cgroup_free css_free_rwork_fn
RIP: 0010:lru_gen_exit_memcg+0x26f/0x300 mm/vmscan.c:5867
Code: 89 de e8 d4 62 ba ff 49 83 fd 3f 0f 86 9c fe ff ff 48 83 c4 08 5b 
5d 41 5c 41 5d 41 5e 41 5f e9 17 68 ba ff e8 12 68 ba ff 90 <0f> 0b 90 
e9 b0 fe ff ff e8 04 68 ba ff 66 90 e8 fd 67 ba ff 90 0f
RSP: 0018:ffffc900001afb78 EFLAGS: 00010293
RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffffff82049e88
RDX: ffff888016f35c40 RSI: ffffffff8204a02e RDI: ffff88801d4103b8
RBP: dffffc0000000000 R08: 0000000000000005 R09: 0000000000000040
R10: 0000000000000000 R11: 0000000000002ba4 R12: ffff8880481f1600
R13: ffff88801d410650 R14: ffff88801d410040 R15: dead000000000100
FS:  0000000000000000(0000) GS:ffff888098d91000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000055ac6490c1d8 CR3: 00000000249b0000 CR4: 0000000000350ef0
Call Trace:
  <TASK>
  mem_cgroup_free mm/memcontrol.c:3972 [inline]
  mem_cgroup_css_free+0x76/0xb0 mm/memcontrol.c:4241
  css_free_rwork_fn+0x125/0x1260 kernel/cgroup/cgroup.c:5575
  process_one_work+0xa0d/0x1c30 kernel/workqueue.c:3314
  process_scheduled_works kernel/workqueue.c:3397 [inline]
  worker_thread+0x645/0xe80 kernel/workqueue.c:3478
  kthread+0x367/0x480 kernel/kthread.c:436
  ret_from_fork+0x72b/0xd50 arch/x86/kernel/process.c:158
  ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
  </TASK>

Kernel version: commit 8cd9520d35a6c38db6567e97dd93b1f11f185dc6 (tag v7.1)

Relevant kernel config:

   CONFIG_MEMCG=y
   CONFIG_LRU_GEN=y
   CONFIG_LRU_GEN_ENABLED=y
   CONFIG_LRU_GEN_WALKS_MMU=y
   CONFIG_NUMA=y

Root Cause:

The bug is a race between two code paths that each hold 
`lruvec->lru_lock`, but at
non-overlapping times.

Component 1 - `reset_batch_size()`:

During `walk_mm()`, `update_batch_size()` accumulates per-generation 
page deltas into
`walk->nr_pages` WITHOUT holding `lruvec_lock`.  After 
`mmap_read_unlock(mm)`, the
walker reacquires `lruvec_lock` and `reset_batch_size()` writes those deltas
UNCONDITIONALLY into `lrugen->nr_pages`.

Component 2 - `lru_gen_reparent_memcg()`:

When a memcg is offlined, `lru_gen_reparent_memcg()` moves all folios to 
the parent
lruvec and zeros the child's `lrugen->nr_pages`, all under `lruvec_lock`.

I have not bisected the issue.  Based on code inspection, the important 
interaction
appears to be the reparenting path that clears the child's `nr_pages` while
`reset_batch_size()` can still commit a batch that was generated before 
the memcg
went offline.  This looks related to f304652609ea ("mm: vmscan: prepare for
reparenting MGLRU folios").

Race sequence:

     1. The aging path enters walk_mm() for the child memcg lruvec.

     2. walk_page_range() scans PTEs and update_batch_size() stores 
deltas in
        walk->nr_pages.  At this point the deltas have not been committed to
        lruvec->lrugen.nr_pages yet.

     3. walk_mm() drops mmap_read_lock(mm).  Before it reaches
        reset_batch_size(), the child memcg is killed and removed.

     4. The memcg offline path runs lru_gen_reparent_memcg().  Under
        lruvec_lock, it moves the child folios to the parent and clears the
        child's lrugen.nr_pages.

     5. The old aging walk resumes, takes lruvec_lock, and 
reset_batch_size()
        writes the stale walk->nr_pages deltas back into the original child
        lruvec.

     6. Later, lru_gen_exit_memcg(child) checks the child's 
lrugen.nr_pages with
        memchr_inv(...).  Since the stale batch made some slots non-zero 
again,
        VM_WARN_ON_ONCE() triggers.

The two critical sections are serialized by `lruvec_lock`, but the batch 
accumulation
in `walk->nr_pages` happens outside that lock, so there is no ordering 
between the
accumulation and the reparenting zeroing.

The relevant code path:

   mm/vmscan.c:
     run_cmd('+')              selects the target memcg and child lruvec
     try_to_inc_max_seq()      stores the child lruvec in walk->lruvec
     update_batch_size()       accumulates deltas in walk->nr_pages
     walk_mm()                 calls walk_page_range(), then later 
reset_batch_size()
     reset_batch_size()        writes cached deltas into 
walk->lruvec->lrugen.nr_pages
     lru_gen_reparent_memcg()  reparents child MGLRU state and clears 
child nr_pages
     lru_gen_exit_memcg()      warns if the exiting memcg has non-zero 
nr_pages

   mm/memcontrol.c:
     mem_cgroup_css_offline()  calls memcg_reparent_objcgs() and 
lru_gen_offline_memcg()
     mem_cgroup_free()         calls lru_gen_exit_memcg()

Reproducer:

The C reproducer and the helper script for running it are provided in 
the attachments.

The PoC creates a leaf memory cgroup, moves a victim process into it, 
and makes the victim fault and continuously touch file-backed pages so 
MGLRU aging can produce cached generation deltas for that memcg. A 
separate `lru_ager` thread repeatedly writes aging commands to 
`/sys/kernel/debug/lru_gen`; when the instrumentation reports that the 
ager is delayed just before `reset_batch_size()`, the PoC kills the 
victim and removes the leaf cgroup, forcing memcg offline/reparenting 
before the stale batch is committed.

The helper script builds the PoC, creates a temporary qcow2 overlay, 
boots the instrumented kernel in QEMU with fake NUMA and SSH port 
forwarding, copies the PoC into the guest, runs it, and scans the serial 
console for `exit_nonzero`, `WARNING: mm/vmscan.c`, or `Kernel panic`. 
It writes the full serial console, extracted kernel events, and guest 
stdout/stderr under the chosen output directory.

The example command:

   ./repros/lru_gen_exit_memcg/run_poc_qemu.sh /tmp/lru_gen_poc_manual 
10450 20 32

The arguments are:

   /tmp/lru_gen_poc_manual  output directory for the overlay, console log,
                            extracted events and guest log
   10450                    host TCP port forwarded to guest SSH
   20                       number of PoC iterations to run
   32                       file-backed working-set size in MiB per 
iteration

The script uses default `KERNEL`, `IMAGE` and `SSH_KEY` paths, or they 
can be
overridden with environment variables.

Since this bug requires a specific race window, kernel instrumentation 
is needed
to enlarge the race window in order to reproduce the bug more reliably.  The
instrumentation patch is also included in the attachments.

The patch only instruments `mm/vmscan.c`: it delays the PoC aging task just
before `reset_batch_size()`, logs when a stale batch is written into an 
already
offlined and zeroed memcg lruvec, and dumps the non-zero 
`lrugen.nr_pages` slots
before `lru_gen_exit_memcg()` triggers the warning.

A successful run reports `status=repro_triggered`, and the extracted events
include a warning like:

   WARNING: mm/vmscan.c:5943 at lru_gen_exit_memcg+0x420/0x520

Proposed Fix:

One possible fix direction is to make `reset_batch_size()` skip writing 
back the
stale delta when the memcg is no longer online. `reset_batch_size()` is 
called
under `lruvec_lock`, the same lock that `lru_gen_reparent_memcg()` holds 
when it
zeroes `nr_pages`, so this should avoid committing a batch after 
reparenting has
completed.

Possible fix direction, not a tested patch:

--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -... reset_batch_size() ...
  static void reset_batch_size(struct lru_gen_mm_walk *walk)
  {
      int gen, type, zone;
      struct lruvec *lruvec = walk->lruvec;
      struct lru_gen_folio *lrugen = &lruvec->lrugen;
+    struct mem_cgroup *memcg = lruvec_memcg(lruvec);

      walk->batched = 0;

      for_each_gen_type_zone(gen, type, zone) {
          enum lru_list lru = type * LRU_INACTIVE_FILE;
          int delta = walk->nr_pages[gen][type][zone];

          if (!delta)
              continue;

          walk->nr_pages[gen][type][zone] = 0;
+
+        /*
+         * If the memcg went offline while we were walking page tables,
+         * lru_gen_reparent_memcg() has already zeroed nr_pages and moved
+         * all folios to the parent.  Writing our stale batch delta back
+         * would corrupt the offline child and trigger WARN_ON in
+         * lru_gen_exit_memcg().  Discard the delta; the parent lruvec
+         * already owns the pages and accounts for them correctly.
+         */
+        if (memcg && !mem_cgroup_online(memcg))
+            continue;
+
          WRITE_ONCE(lrugen->nr_pages[gen][type][zone],
                 lrugen->nr_pages[gen][type][zone] + delta);

          if (lru_gen_is_active(lruvec, gen))
              lru += LRU_ACTIVE;
          __update_lru_size(lruvec, lru, zone, delta);
      }
  }

Thanks

[-- Attachment #2: poc_lru_race.c --]
[-- Type: text/plain, Size: 10419 bytes --]

/*
 * Minimal MGLRU memcg reparent race PoC.
 *
 * This program expects the companion instrumentation patch to add a short
 * delay before reset_batch_size() for cgroups named /lru_gen_race_* and to log
 * "delay_before_reset".  The program waits for that log line, tears down the
 * target memcg, and lets the stale MGLRU batch commit into the offlined child.
 */

#define _GNU_SOURCE
#include <errno.h>
#include <fcntl.h>
#include <pthread.h>
#include <signal.h>
#include <stdatomic.h>
#include <stdbool.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/mman.h>
#include <sys/mount.h>
#include <sys/prctl.h>
#include <sys/stat.h>
#include <sys/wait.h>
#include <unistd.h>

#define CGROUP_ROOT "/sys/fs/cgroup"
#define LRU_GEN_FILE "/sys/kernel/debug/lru_gen"
#define PAGE_BYTES 4096UL
#define MAX_NODES 8
#define DEFAULT_ITERS 20
#define DEFAULT_FILE_MIB 32
#define WINDOW_TIMEOUT_MS 90000
#define CONFIRM_TIMEOUT_MS 5000

enum {
	PHASE_IDLE = 0,
	PHASE_WINDOW = 1,
	PHASE_DONE = 2,
};

/* LruInfo stores the debugfs MGLRU ids needed to issue aging commands. */
struct lru_info {
	unsigned long memcg_id; /* Memcg id accepted by /sys/kernel/debug/lru_gen. */
	int nr_nodes; /* Number of NUMA nodes parsed for this memcg. */
	int nodes[MAX_NODES]; /* Node ids parsed from /sys/kernel/debug/lru_gen. */
	unsigned long max_seq[MAX_NODES]; /* Latest generation sequence for each node. */
};

/* RaceState is shared by the ager, kmsg reader, and main iteration. */
struct race_state {
	char leaf[1024]; /* Absolute cgroup path for the current leaf. */
	char leaf_rel[1024]; /* Cgroup path as printed by cgroup_path(). */
	char file_path[512]; /* File mapped by the victim process. */
	pid_t victim; /* Victim pid charged to the leaf memcg. */
	atomic_int phase; /* Current synchronization phase. */
	int iter; /* Iteration index used only for concise progress output. */
};

/* Die prints a syscall failure and exits the process. */
static void die(const char *what)
{
	perror(what);
	exit(1);
}

/* WriteFile writes a short string into a sysfs/cgroupfs control file. */
static int write_file(const char *path, const char *value)
{
	int fd = open(path, O_WRONLY | O_CLOEXEC);

	if (fd < 0)
		return -1;

	ssize_t ret = write(fd, value, strlen(value));
	int saved_errno = errno;

	close(fd);
	errno = saved_errno;
	return ret < 0 ? -1 : 0;
}

/* MkdirIfMissing creates a cgroup directory if it is not already present. */
static int mkdir_if_missing(const char *path)
{
	if (!mkdir(path, 0755) || errno == EEXIST)
		return 0;

	return -1;
}

/* EnableMemoryController enables memory accounting below a cgroup. */
static void enable_memory_controller(const char *cg)
{
	char path[640];

	snprintf(path, sizeof(path), "%s/cgroup.subtree_control", cg);
	(void)write_file(path, "+memory");
}

/* MovePid moves a process into the target cgroup. */
static int move_pid(const char *cg, pid_t pid)
{
	char path[640];
	char value[32];

	snprintf(path, sizeof(path), "%s/cgroup.procs", cg);
	snprintf(value, sizeof(value), "%d", (int)pid);
	return write_file(path, value);
}

/* RmdirRetry removes a cgroup after css teardown has made it removable. */
static int rmdir_retry(const char *path)
{
	for (int i = 0; i < 600; i++) {
		if (!rmdir(path))
			return 0;
		if (errno != EBUSY && errno != EINVAL)
			return -1;
		usleep(5000);
	}

	return -1;
}

/* WaitPhase waits until the shared phase reaches the requested value. */
static bool wait_phase(struct race_state *st, int want, int timeout_ms)
{
	for (int i = 0; i < timeout_ms; i++) {
		if (atomic_load(&st->phase) >= want)
			return true;
		usleep(1000);
	}

	return false;
}

/* VictimMain faults file-backed pages after it has been moved into the leaf. */
static void victim_main(int start_fd, int ready_fd, const char *path, size_t bytes)
{
	char ch;

	if (read(start_fd, &ch, 1) != 1)
		_exit(10);
	close(start_fd);

	int fd = open(path, O_CREAT | O_TRUNC | O_RDWR | O_CLOEXEC, 0600);

	if (fd < 0)
		_exit(11);
	if (ftruncate(fd, (off_t)bytes))
		_exit(12);

	volatile uint8_t *mapping = mmap(NULL, bytes, PROT_READ | PROT_WRITE,
					 MAP_SHARED, fd, 0);
	if (mapping == MAP_FAILED)
		_exit(13);
	close(fd);

	for (size_t off = 0; off < bytes; off += PAGE_BYTES)
		mapping[off] = (uint8_t)(off >> 12);

	ssize_t ready_ret = write(ready_fd, "R", 1);

	(void)ready_ret;
	close(ready_fd);

	for (uint8_t seed = 1;; seed++) {
		for (size_t off = 0; off < bytes; off += PAGE_BYTES)
			mapping[off] ^= seed;
	}
}

/* ReadLruInfo parses the target memcg section from /sys/kernel/debug/lru_gen. */
static int read_lru_info(const char *leaf_rel, struct lru_info *info)
{
	FILE *file = fopen(LRU_GEN_FILE, "r");
	char *line = NULL;
	size_t cap = 0;
	bool in_target = false;
	int current = -1;
	int ret = -1;

	if (!file)
		return -1;

	memset(info, 0, sizeof(*info));

	while (getline(&line, &cap, file) > 0) {
		unsigned long id;
		unsigned long seq;
		char path[1024];
		int node;

		if (sscanf(line, " memcg %lu %1023s", &id, path) == 2) {
			in_target = !strcmp(path, leaf_rel);
			current = -1;
			if (in_target) {
				info->memcg_id = id;
				ret = 0;
			}
			continue;
		}

		if (!in_target)
			continue;

		if (sscanf(line, " node %d", &node) == 1) {
			if (info->nr_nodes >= MAX_NODES)
				continue;
			current = info->nr_nodes++;
			info->nodes[current] = node;
			continue;
		}

		if (current >= 0 && sscanf(line, " %lu", &seq) == 1 &&
		    seq > info->max_seq[current])
			info->max_seq[current] = seq;
	}

	free(line);
	fclose(file);
	return ret;
}

/* AgerThread repeatedly asks MGLRU debugfs to age the target memcg. */
static void *ager_thread(void *arg)
{
	struct race_state *st = arg;
	int fd;

	prctl(PR_SET_NAME, "lru_ager", 0, 0, 0);

	fd = open(LRU_GEN_FILE, O_WRONLY | O_CLOEXEC);
	if (fd < 0)
		return NULL;

	while (atomic_load(&st->phase) < PHASE_WINDOW) {
		struct lru_info info;

		if (read_lru_info(st->leaf_rel, &info) || !info.memcg_id)
			break;

		for (int i = 0; i < info.nr_nodes; i++) {
			char cmd[128];

			if (atomic_load(&st->phase) >= PHASE_WINDOW)
				break;

			snprintf(cmd, sizeof(cmd), "+ %lu %d %lu 1 1\n",
				 info.memcg_id, info.nodes[i], info.max_seq[i]);
			ssize_t write_ret = write(fd, cmd, strlen(cmd));

			(void)write_ret;
		}
	}

	close(fd);
	return NULL;
}

/* KmsgThread watches for the instrumentation lines used for synchronization. */
static void *kmsg_thread(void *arg)
{
	struct race_state *st = arg;
	char buf[8192];
	int fd = open("/dev/kmsg", O_RDONLY | O_NONBLOCK | O_CLOEXEC);

	if (fd < 0)
		return NULL;

	lseek(fd, 0, SEEK_END);

	while (atomic_load(&st->phase) < PHASE_DONE) {
		ssize_t len = read(fd, buf, sizeof(buf) - 1);
		const char *msg;
		bool ours;

		if (len <= 0) {
			usleep(500);
			continue;
		}

		buf[len] = '\0';
		msg = strchr(buf, ';');
		if (msg)
			msg++;
		else
			msg = buf;

		ours = strstr(msg, st->leaf_rel) != NULL;
		if (ours && strstr(msg, "delay_before_reset")) {
			int idle = PHASE_IDLE;

			atomic_compare_exchange_strong(&st->phase, &idle, PHASE_WINDOW);
		}

		if ((ours && strstr(msg, "exit_nonzero")) ||
		    (atomic_load(&st->phase) >= PHASE_WINDOW &&
		     strstr(msg, "WARNING: mm/vmscan.c")))
			atomic_store(&st->phase, PHASE_DONE);
	}

	close(fd);
	return NULL;
}

/* RunIteration creates one child memcg and races its teardown against aging. */
static bool run_iteration(const char *base, int iter, size_t file_mib)
{
	struct race_state st;
	int start_pipe[2];
	int ready_pipe[2];
	pthread_t ager;
	pthread_t kmsg;
	bool got_window;
	bool confirmed = false;

	memset(&st, 0, sizeof(st));
	st.iter = iter;
	snprintf(st.leaf, sizeof(st.leaf), "%s/leaf_%03d", base, iter);
	snprintf(st.leaf_rel, sizeof(st.leaf_rel), "%s/leaf_%03d",
		 base + strlen(CGROUP_ROOT), iter);
	snprintf(st.file_path, sizeof(st.file_path), "/root/lru_race_%d_%03d.dat",
		 getpid(), iter);
	atomic_store(&st.phase, PHASE_IDLE);

	if (mkdir_if_missing(st.leaf))
		return false;
	if (pipe(start_pipe) || pipe(ready_pipe))
		die("pipe");

	st.victim = fork();
	if (!st.victim) {
		close(start_pipe[1]);
		close(ready_pipe[0]);
		victim_main(start_pipe[0], ready_pipe[1], st.file_path,
			    file_mib << 20);
		_exit(0);
	}

	close(start_pipe[0]);
	close(ready_pipe[1]);

	if (move_pid(st.leaf, st.victim)) {
		ssize_t start_ret = write(start_pipe[1], "g", 1);

		(void)start_ret;
		close(start_pipe[1]);
		kill(st.victim, SIGKILL);
		waitpid(st.victim, NULL, 0);
		rmdir_retry(st.leaf);
		return false;
	}

	ssize_t start_ret = write(start_pipe[1], "g", 1);

	(void)start_ret;
	close(start_pipe[1]);

	char ready;
	ssize_t ready_ret = read(ready_pipe[0], &ready, 1);

	(void)ready_ret;
	close(ready_pipe[0]);

	for (int retry = 0; retry < 400; retry++) {
		struct lru_info info;

		if (!read_lru_info(st.leaf_rel, &info) && info.memcg_id &&
		    info.nr_nodes > 0)
			break;
		usleep(5000);
	}

	pthread_create(&kmsg, NULL, kmsg_thread, &st);
	pthread_create(&ager, NULL, ager_thread, &st);

	got_window = wait_phase(&st, PHASE_WINDOW, WINDOW_TIMEOUT_MS);
	if (got_window) {
		kill(st.victim, SIGKILL);
		waitpid(st.victim, NULL, 0);
		st.victim = 0;
		rmdir_retry(st.leaf);
		confirmed = wait_phase(&st, PHASE_DONE, CONFIRM_TIMEOUT_MS);
	}

	atomic_store(&st.phase, PHASE_DONE);
	pthread_join(ager, NULL);
	pthread_join(kmsg, NULL);

	if (st.victim) {
		kill(st.victim, SIGKILL);
		waitpid(st.victim, NULL, 0);
	}
	rmdir_retry(st.leaf);
	unlink(st.file_path);

	printf("iter %d: %s\n", iter, confirmed ? "confirmed" :
	       got_window ? "window-only" : "miss");
	return confirmed;
}

/* Main prepares cgroup/debugfs state and runs bounded race attempts. */
int main(int argc, char **argv)
{
	int iters = argc > 1 ? atoi(argv[1]) : DEFAULT_ITERS;
	size_t file_mib = argc > 2 ? strtoul(argv[2], NULL, 0) : DEFAULT_FILE_MIB;
	char base[512];
	int confirmed = 0;

	if (geteuid()) {
		fprintf(stderr, "must run as root\n");
		return 1;
	}

	if (mount("debugfs", "/sys/kernel/debug", "debugfs", 0, NULL) && errno != EBUSY)
		perror("mount debugfs");

	enable_memory_controller(CGROUP_ROOT);
	snprintf(base, sizeof(base), CGROUP_ROOT "/lru_gen_race_%d", getpid());
	if (mkdir_if_missing(base))
		die("mkdir base cgroup");
	enable_memory_controller(base);

	for (int i = 0; i < iters; i++) {
		if (run_iteration(base, i, file_mib))
			confirmed++;
	}

	rmdir_retry(base);
	printf("confirmed=%d/%d\n", confirmed, iters);
	return confirmed ? 0 : 1;
}

[-- Attachment #3: lru_gen_exit_memcg.patch --]
[-- Type: text/plain, Size: 4169 bytes --]

diff --git a/mm/vmscan.c b/mm/vmscan.c
index bd1b1aa12581..6206ce41de3b 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3265,24 +3265,96 @@ static void update_batch_size(struct lru_gen_mm_walk *walk, struct folio *folio,
 	walk->nr_pages[new_gen][type][zone] += delta;
 }
 
+/* Return whether any MGLRU size slot is still charged. */
+static bool lru_gen_has_nr_pages(struct lruvec *lruvec)
+{
+	int gen, type, zone;
+	struct lru_gen_folio *lrugen = &lruvec->lrugen;
+
+	for_each_gen_type_zone(gen, type, zone) {
+		if (READ_ONCE(lrugen->nr_pages[gen][type][zone]))
+			return true;
+	}
+
+	return false;
+}
+
+/* Dump nonzero MGLRU size slots for the target memcg. */
+static void lru_gen_dump_nr_pages(const char *tag, struct mem_cgroup *memcg,
+				  int nid, struct lruvec *lruvec, bool show_path)
+{
+	int gen, type, zone;
+	char path[256] = "";
+	struct lru_gen_folio *lrugen = &lruvec->lrugen;
+
+	if (show_path && memcg)
+		cgroup_path(memcg->css.cgroup, path, sizeof(path));
+
+	pr_warn("lru_gen_debug: %s task=%s/%d memcg=%llu online=%d dying=%d nid=%d path=%s\n",
+		tag, current->comm, task_pid_nr(current), mem_cgroup_id(memcg),
+		memcg ? mem_cgroup_online(memcg) : 1, memcg_is_dying(memcg),
+		nid, path);
+
+	for_each_gen_type_zone(gen, type, zone) {
+		long nr_pages = READ_ONCE(lrugen->nr_pages[gen][type][zone]);
+
+		if (!nr_pages)
+			continue;
+
+		pr_warn("lru_gen_debug: %s slot memcg=%llu nid=%d gen=%d type=%d zone=%d nr=%ld list_empty=%d\n",
+			tag, mem_cgroup_id(memcg), nid, gen, type, zone,
+			nr_pages, list_empty(&lrugen->folios[gen][type][zone]));
+	}
+}
+
+/* Delay the PoC aging task so memcg offline can race with batch reset. */
+static void lru_gen_delay_test_reset(struct lruvec *lruvec)
+{
+	char path[128] = "";
+	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
+
+	if (!memcg || strcmp(current->comm, "lru_ager"))
+		return;
+
+	cgroup_path(memcg->css.cgroup, path, sizeof(path));
+	if (!str_has_prefix(path, "/lru_gen_race_"))
+		return;
+
+	pr_warn("lru_gen_debug: delay_before_reset task=%s/%d memcg=%llu online=%d dying=%d path=%s\n",
+		current->comm, task_pid_nr(current), mem_cgroup_id(memcg),
+		mem_cgroup_online(memcg), memcg_is_dying(memcg), path);
+	msleep(3000);
+}
+
 static void reset_batch_size(struct lru_gen_mm_walk *walk)
 {
 	int gen, type, zone;
 	struct lruvec *lruvec = walk->lruvec;
 	struct lru_gen_folio *lrugen = &lruvec->lrugen;
+	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
+	bool offline = memcg && (!mem_cgroup_online(memcg) || memcg_is_dying(memcg));
+	bool zeroed = offline && !lru_gen_has_nr_pages(lruvec);
 
 	walk->batched = 0;
 
 	for_each_gen_type_zone(gen, type, zone) {
 		enum lru_list lru = type * LRU_INACTIVE_FILE;
 		int delta = walk->nr_pages[gen][type][zone];
+		long old;
 
 		if (!delta)
 			continue;
 
 		walk->nr_pages[gen][type][zone] = 0;
-		WRITE_ONCE(lrugen->nr_pages[gen][type][zone],
-			   lrugen->nr_pages[gen][type][zone] + delta);
+		old = READ_ONCE(lrugen->nr_pages[gen][type][zone]);
+		WRITE_ONCE(lrugen->nr_pages[gen][type][zone], old + delta);
+
+		if (zeroed)
+			pr_warn("lru_gen_debug: reset_batch_to_zeroed_offline task=%s/%d memcg=%llu online=%d dying=%d nid=%d seq=%lu gen=%d type=%d zone=%d delta=%d old=%ld new=%ld\n",
+				current->comm, task_pid_nr(current), mem_cgroup_id(memcg),
+				mem_cgroup_online(memcg), memcg_is_dying(memcg),
+				lruvec_pgdat(lruvec)->node_id, walk->seq, gen, type,
+				zone, delta, old, old + delta);
 
 		if (lru_gen_is_active(lruvec, gen))
 			lru += LRU_ACTIVE;
@@ -3783,6 +3855,7 @@ static void walk_mm(struct mm_struct *mm, struct lru_gen_mm_walk *walk)
 		}
 
 		if (walk->batched) {
+			lru_gen_delay_test_reset(lruvec);
 			lruvec_lock_irq(lruvec);
 			reset_batch_size(walk);
 			lruvec_unlock_irq(lruvec);
@@ -5864,6 +5937,9 @@ void lru_gen_exit_memcg(struct mem_cgroup *memcg)
 		struct lruvec *lruvec = get_lruvec(memcg, nid);
 		struct lru_gen_mm_state *mm_state = get_mm_state(lruvec);
 
+		if (lru_gen_has_nr_pages(lruvec))
+			lru_gen_dump_nr_pages("exit_nonzero", memcg, nid, lruvec, true);
+
 		VM_WARN_ON_ONCE(memchr_inv(lruvec->lrugen.nr_pages, 0,
 					   sizeof(lruvec->lrugen.nr_pages)));
 

[-- Attachment #4: run_poc_qemu.sh --]
[-- Type: application/x-sh, Size: 3160 bytes --]

^ permalink raw reply related	[flat|nested] 2+ messages in thread

* Re: [BUG] mm: mglru: stale aging batch triggers lru_gen_exit_memcg warning
  2026-06-21 13:50 [BUG] mm: mglru: stale aging batch triggers lru_gen_exit_memcg warning Peiyang He
@ 2026-06-22  3:12 ` Qi Zheng
  0 siblings, 0 replies; 2+ messages in thread
From: Qi Zheng @ 2026-06-22  3:12 UTC (permalink / raw)
  To: Peiyang He, akpm, hannes, linux-mm
  Cc: mhocko, roman.gushchin, shakeel.butt, muchun.song, kasong, baohua,
	axelrasmussen, yuanchu, weixugc, david, ljs, cgroups,
	linux-kernel, syzkaller

Hi Peiyang,

Thanks for reporting this issue!

On 6/21/26 9:50 PM, Peiyang He wrote:
> Hello,
> 
> I hit the following warning while fuzzing other kernel code with Syzkaller.
> 
> The original Syzkaller report:
> 
> WARNING: mm/vmscan.c:5867 at lru_gen_exit_memcg+0x26f/0x300 mm/ 
> vmscan.c:5867, CPU#0: kworker/0:0/9
> Modules linked in:
> CPU: 0 UID: 0 PID: 9 Comm: kworker/0:0 Not tainted 7.1.0 #2 PREEMPT(full)
> Hardware name: QEMU Ubuntu 24.04 PC v2 (i440FX + PIIX, arch_caps fix, 
> 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
> Workqueue: cgroup_free css_free_rwork_fn
> RIP: 0010:lru_gen_exit_memcg+0x26f/0x300 mm/vmscan.c:5867
> Code: 89 de e8 d4 62 ba ff 49 83 fd 3f 0f 86 9c fe ff ff 48 83 c4 08 5b 
> 5d 41 5c 41 5d 41 5e 41 5f e9 17 68 ba ff e8 12 68 ba ff 90 <0f> 0b 90 
> e9 b0 fe ff ff e8 04 68 ba ff 66 90 e8 fd 67 ba ff 90 0f
> RSP: 0018:ffffc900001afb78 EFLAGS: 00010293
> RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffffff82049e88
> RDX: ffff888016f35c40 RSI: ffffffff8204a02e RDI: ffff88801d4103b8
> RBP: dffffc0000000000 R08: 0000000000000005 R09: 0000000000000040
> R10: 0000000000000000 R11: 0000000000002ba4 R12: ffff8880481f1600
> R13: ffff88801d410650 R14: ffff88801d410040 R15: dead000000000100
> FS:  0000000000000000(0000) GS:ffff888098d91000(0000) 
> knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 000055ac6490c1d8 CR3: 00000000249b0000 CR4: 0000000000350ef0
> Call Trace:
>   <TASK>
>   mem_cgroup_free mm/memcontrol.c:3972 [inline]
>   mem_cgroup_css_free+0x76/0xb0 mm/memcontrol.c:4241
>   css_free_rwork_fn+0x125/0x1260 kernel/cgroup/cgroup.c:5575
>   process_one_work+0xa0d/0x1c30 kernel/workqueue.c:3314
>   process_scheduled_works kernel/workqueue.c:3397 [inline]
>   worker_thread+0x645/0xe80 kernel/workqueue.c:3478
>   kthread+0x367/0x480 kernel/kthread.c:436
>   ret_from_fork+0x72b/0xd50 arch/x86/kernel/process.c:158
>   ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
>   </TASK>
> 
> Kernel version: commit 8cd9520d35a6c38db6567e97dd93b1f11f185dc6 (tag v7.1)
> 
> Relevant kernel config:
> 
>    CONFIG_MEMCG=y
>    CONFIG_LRU_GEN=y
>    CONFIG_LRU_GEN_ENABLED=y
>    CONFIG_LRU_GEN_WALKS_MMU=y
>    CONFIG_NUMA=y
> 
> Root Cause:
> 
> The bug is a race between two code paths that each hold `lruvec- 
>  >lru_lock`, but at
> non-overlapping times.
> 
> Component 1 - `reset_batch_size()`:
> 
> During `walk_mm()`, `update_batch_size()` accumulates per-generation 
> page deltas into
> `walk->nr_pages` WITHOUT holding `lruvec_lock`.  After 
> `mmap_read_unlock(mm)`, the
> walker reacquires `lruvec_lock` and `reset_batch_size()` writes those 
> deltas
> UNCONDITIONALLY into `lrugen->nr_pages`.
> 
> Component 2 - `lru_gen_reparent_memcg()`:
> 
> When a memcg is offlined, `lru_gen_reparent_memcg()` moves all folios to 
> the parent
> lruvec and zeros the child's `lrugen->nr_pages`, all under `lruvec_lock`.
> 
> I have not bisected the issue.  Based on code inspection, the important 
> interaction
> appears to be the reparenting path that clears the child's `nr_pages` while
> `reset_batch_size()` can still commit a batch that was generated before 
> the memcg
> went offline.  This looks related to f304652609ea ("mm: vmscan: prepare for
> reparenting MGLRU folios").
> 
> Race sequence:
> 
>      1. The aging path enters walk_mm() for the child memcg lruvec.
> 
>      2. walk_page_range() scans PTEs and update_batch_size() stores 
> deltas in
>         walk->nr_pages.  At this point the deltas have not been 
> committed to
>         lruvec->lrugen.nr_pages yet.
> 
>      3. walk_mm() drops mmap_read_lock(mm).  Before it reaches
>         reset_batch_size(), the child memcg is killed and removed.
> 
>      4. The memcg offline path runs lru_gen_reparent_memcg().  Under
>         lruvec_lock, it moves the child folios to the parent and clears the
>         child's lrugen.nr_pages.
> 
>      5. The old aging walk resumes, takes lruvec_lock, and 
> reset_batch_size()
>         writes the stale walk->nr_pages deltas back into the original child
>         lruvec.
> 
>      6. Later, lru_gen_exit_memcg(child) checks the child's 
> lrugen.nr_pages with
>         memchr_inv(...).  Since the stale batch made some slots non-zero 
> again,
>         VM_WARN_ON_ONCE() triggers.

It seems this race can actually happen.

> 
> The two critical sections are serialized by `lruvec_lock`, but the batch 
> accumulation
> in `walk->nr_pages` happens outside that lock, so there is no ordering 
> between the
> accumulation and the reparenting zeroing.
> 
> The relevant code path:
> 
>    mm/vmscan.c:
>      run_cmd('+')              selects the target memcg and child lruvec
>      try_to_inc_max_seq()      stores the child lruvec in walk->lruvec
>      update_batch_size()       accumulates deltas in walk->nr_pages
>      walk_mm()                 calls walk_page_range(), then later 
> reset_batch_size()
>      reset_batch_size()        writes cached deltas into walk->lruvec- 
>  >lrugen.nr_pages
>      lru_gen_reparent_memcg()  reparents child MGLRU state and clears 
> child nr_pages
>      lru_gen_exit_memcg()      warns if the exiting memcg has non-zero 
> nr_pages
> 
>    mm/memcontrol.c:
>      mem_cgroup_css_offline()  calls memcg_reparent_objcgs() and 
> lru_gen_offline_memcg()
>      mem_cgroup_free()         calls lru_gen_exit_memcg()
> 
> Reproducer:
> 
> The C reproducer and the helper script for running it are provided in 
> the attachments.
> 
> The PoC creates a leaf memory cgroup, moves a victim process into it, 
> and makes the victim fault and continuously touch file-backed pages so 
> MGLRU aging can produce cached generation deltas for that memcg. A 
> separate `lru_ager` thread repeatedly writes aging commands to `/sys/ 
> kernel/debug/lru_gen`; when the instrumentation reports that the ager is 
> delayed just before `reset_batch_size()`, the PoC kills the victim and 
> removes the leaf cgroup, forcing memcg offline/reparenting before the 
> stale batch is committed.
> 
> The helper script builds the PoC, creates a temporary qcow2 overlay, 
> boots the instrumented kernel in QEMU with fake NUMA and SSH port 
> forwarding, copies the PoC into the guest, runs it, and scans the serial 
> console for `exit_nonzero`, `WARNING: mm/vmscan.c`, or `Kernel panic`. 
> It writes the full serial console, extracted kernel events, and guest 
> stdout/stderr under the chosen output directory.
> 
> The example command:
> 
>    ./repros/lru_gen_exit_memcg/run_poc_qemu.sh /tmp/lru_gen_poc_manual 
> 10450 20 32
> 
> The arguments are:
> 
>    /tmp/lru_gen_poc_manual  output directory for the overlay, console log,
>                             extracted events and guest log
>    10450                    host TCP port forwarded to guest SSH
>    20                       number of PoC iterations to run
>    32                       file-backed working-set size in MiB per 
> iteration
> 
> The script uses default `KERNEL`, `IMAGE` and `SSH_KEY` paths, or they 
> can be
> overridden with environment variables.
> 
> Since this bug requires a specific race window, kernel instrumentation 
> is needed
> to enlarge the race window in order to reproduce the bug more reliably.  
> The
> instrumentation patch is also included in the attachments.
> 
> The patch only instruments `mm/vmscan.c`: it delays the PoC aging task just
> before `reset_batch_size()`, logs when a stale batch is written into an 
> already
> offlined and zeroed memcg lruvec, and dumps the non-zero 
> `lrugen.nr_pages` slots
> before `lru_gen_exit_memcg()` triggers the warning.
> 
> A successful run reports `status=repro_triggered`, and the extracted events
> include a warning like:
> 
>    WARNING: mm/vmscan.c:5943 at lru_gen_exit_memcg+0x420/0x520
> 
> Proposed Fix:
> 
> One possible fix direction is to make `reset_batch_size()` skip writing 
> back the
> stale delta when the memcg is no longer online. `reset_batch_size()` is 
> called
> under `lruvec_lock`, the same lock that `lru_gen_reparent_memcg()` holds 
> when it
> zeroes `nr_pages`, so this should avoid committing a batch after 
> reparenting has
> completed.
> 
> Possible fix direction, not a tested patch:
> 
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -... reset_batch_size() ...
>   static void reset_batch_size(struct lru_gen_mm_walk *walk)
>   {
>       int gen, type, zone;
>       struct lruvec *lruvec = walk->lruvec;
>       struct lru_gen_folio *lrugen = &lruvec->lrugen;
> +    struct mem_cgroup *memcg = lruvec_memcg(lruvec);
> 
>       walk->batched = 0;
> 
>       for_each_gen_type_zone(gen, type, zone) {
>           enum lru_list lru = type * LRU_INACTIVE_FILE;
>           int delta = walk->nr_pages[gen][type][zone];
> 
>           if (!delta)
>               continue;
> 
>           walk->nr_pages[gen][type][zone] = 0;
> +
> +        /*
> +         * If the memcg went offline while we were walking page tables,
> +         * lru_gen_reparent_memcg() has already zeroed nr_pages and moved
> +         * all folios to the parent.  Writing our stale batch delta back
> +         * would corrupt the offline child and trigger WARN_ON in
> +         * lru_gen_exit_memcg().  Discard the delta; the parent lruvec
> +         * already owns the pages and accounts for them correctly.
> +         */
> +        if (memcg && !mem_cgroup_online(memcg))
> +            continue;

This check is insufficient, because offline_css() clears the CSS_ONLINE
after ss->css_offline(css). And we can not simple drop the delta.

Thanks,
Qi

> +
>           WRITE_ONCE(lrugen->nr_pages[gen][type][zone],
>                  lrugen->nr_pages[gen][type][zone] + delta);
> 
>           if (lru_gen_is_active(lruvec, gen))
>               lru += LRU_ACTIVE;
>           __update_lru_size(lruvec, lru, zone, delta);
>       }
>   }
> 
> Thanks


^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2026-06-22  3:13 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-21 13:50 [BUG] mm: mglru: stale aging batch triggers lru_gen_exit_memcg warning Peiyang He
2026-06-22  3:12 ` Qi Zheng

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox