From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from mail-pj1-f48.google.com (mail-pj1-f48.google.com [209.85.216.48])
	(using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id C036819F40B
	for <linux-kernel@vger.kernel.org>; Sat, 28 Feb 2026 21:28:46 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.216.48
ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1772314128; cv=none; b=t/JZXdC8kFnEifSdjTfTIRjZY8m5hLoSCHX9LpM6KNcSBtl8MOCC4CUAm2BfGuCoee75t1+oQKcJ9Is6ZEPzc3uFWV1Udr3WrCEG5XG+s9p5qQIUlGPyY+aqRHO5poi9kUSu9vPRR9kVqX8jnBPkQIJWQvR4JQD1Hn0sRwUxUz8=
ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1772314128; c=relaxed/simple;
	bh=IW9HLahrDV/96MHrkNApai+c+Kr2NeeEVSR2Jco4elA=;
	h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References:
	 MIME-Version:Content-Type; b=I+Jza8KVvE9dnawzNebIetTyufwnkH1RYyHI3EBms4UjpJZ0F48eqMhHR4pgqMo1wb095cmxsINQyXM3eSONggCb9/1KIw46xTQqQEpxukc5GXZWio0RrkZtcksCfa6Pv8EC0c2Ey68gyS2lw81kh7PY0DjhgAZ104d3pqHTeZI=
ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=UTFWeQ29; arc=none smtp.client-ip=209.85.216.48
Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com
Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="UTFWeQ29"
Received: by mail-pj1-f48.google.com with SMTP id 98e67ed59e1d1-3594623766eso2292478a91.1
        for <linux-kernel@vger.kernel.org>; Sat, 28 Feb 2026 13:28:46 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20230601; t=1772314126; x=1772918926; darn=vger.kernel.org;
        h=content-transfer-encoding:mime-version:references:in-reply-to
         :message-id:date:subject:cc:to:from:from:to:cc:subject:date
         :message-id:reply-to;
        bh=yZxZ2bJyypDd9ZV8NErVwlj1440hZcAv4s/39SJRyHk=;
        b=UTFWeQ29VFiBYwH5CcoJMCQ4M9NZndWq4YqVan1EbgdyE6ADI//SowkPGIK16a/LBc
         yoOVqYnPAFvoowQbEdQF+fHCKU5B8BQr61wi/qa6fCL1u5NjDjkM40bOCqcSA+sQFply
         VsaBMfpQtpc9TiI4bbg5sk1j/i+mymjGm9heYHTBgQpMCF0t0uhw1QheEdXaBQswl91m
         TYYony+QCDR4QuGFSsJOE0VJUCrtb+c9LcGwQ2XmHsJs4GD4Evwn8U41/mZi3wcKbACD
         RT8tmS4kAZyBuLT/ENlt195OiGPgJSs99TzRA8MK7zj2lFNlh8c2zR3K41jyQKln6jPJ
         mzWw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1772314126; x=1772918926;
        h=content-transfer-encoding:mime-version:references:in-reply-to
         :message-id:date:subject:cc:to:from:x-gm-gg:x-gm-message-state:from
         :to:cc:subject:date:message-id:reply-to;
        bh=yZxZ2bJyypDd9ZV8NErVwlj1440hZcAv4s/39SJRyHk=;
        b=n1rf52DbzvB31AA1qhw8dMnjSvFboAk4VWX1N8rXp5PxoDesMHN45dI17sk93Mqaj+
         2Xwr0Q1Jc+GPyt9TJ+xMt4097Ra2aL6dFlef3k3xFfWLK7lPDvBUsWnOr1O+pWObGl4J
         PagJYMKgxNwnZdPgW3SDW+HzNlaRtpz8GZTnhRWd7U2dw/YJzR09nUJc5NuXOGWHdzZ4
         kfDRZdc8uSPIH68eYjjmsFRVZ2NbyZMFLwyNAAkbTQ9wAw1RTZOceigcd0yB7JLwcCVi
         lECGwepswdagziYhL9hw2o4hA7fe8SYDc59k8KNsqccFctaGa0msm1ut8hZEpbe7Hkmc
         eSRQ==
X-Forwarded-Encrypted: i=1; AJvYcCWENsXo6dgO85knVxRTtpXtXEFsGpcZy5l8TVCM8RZXLIUj0N7R106Sn+bx45p3+qBz0QocWPr6ne0ac8A=@vger.kernel.org
X-Gm-Message-State: AOJu0YwRjkDMq7b+bQQhq8+N3/WgH0DQvci8fZ3EUAU6nGcUvHD0t7wE
	xTfJOXzL2L4zvlhxTmmn5GUP+HixkO0EWGgnDmwz8GeZ9ofv86KtAC4p
X-Gm-Gg: ATEYQzzpW56E/B6LrC4SXR4eaSNegAPaVu14Ryf3WWkuwrOYonX82l52/pWQGLuxb9G
	Nad7os1bhlaISyyndhuDi/jumFbaX9bvTsrfYj/mEYS7UQn16htrlX+BqMY8DNjpKZLSO5Bq1jh
	jfRDPsxVdMcdJ6fN3cAT4YJ//+IrDXCb5dY1lpDQbf0T9EQzyyZszPe58Nd9r+sDmFGAhLUU48v
	QMkg64wfAh9a9HVo5C5ZAGVFoDBg0+dYt2FyqgDOVZKDO01eNznOdWNmBlIToKmXNiGwVfGt18k
	+h1bHFPLEV0wWBngMLRdbJEHgEpt8W4Aph7toE4FEB0zo4wFSZqOoM46q0QyfdIvnOaUxynzsq6
	F91tP873W7Hg1fAJmhwp+N4OBT2nzb2eDoiepFvRvTydrPr3GEvt+Tlb0mFp45swmHv8LGXPoL7
	gOs+GqDYlWE2gO4uwQBvF7JcoVEW+tKmESPg0aArC+faAXJ10=
X-Received: by 2002:a17:90b:1f8f:b0:34c:99d6:175d with SMTP id 98e67ed59e1d1-35965c3b00bmr6653438a91.2.1772314126074;
        Sat, 28 Feb 2026 13:28:46 -0800 (PST)
Received: from Barrys-MBP.hub ([47.72.129.29])
        by smtp.gmail.com with ESMTPSA id 98e67ed59e1d1-3591349c87bsm6656961a91.2.2026.02.28.13.28.41
        (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256);
        Sat, 28 Feb 2026 13:28:45 -0800 (PST)
From: Barry Song <21cnbao@gmail.com>
To: lenohou@gmail.com
Cc: 21cnbao@gmail.com,
	akpm@linux-foundation.org,
	axelrasmussen@google.com,
	laoar.shao@gmail.com,
	linux-kernel@vger.kernel.org,
	linux-mm@kvack.org,
	weixugc@google.com,
	wjl.linux@gmail.com,
	yuanchu@google.com,
	yuzhao@google.com
Subject: Re: [PATCH] mm/mglru: fix cgroup OOM during MGLRU state switching
Date: Sun,  1 Mar 2026 05:28:37 +0800
Message-Id: <20260228212837.59661-1-21cnbao@gmail.com>
X-Mailer: git-send-email 2.39.3 (Apple Git-146)
In-Reply-To: <20260228161008.707-1-lenohou@gmail.com>
References: <20260228161008.707-1-lenohou@gmail.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

On Sun, Mar 1, 2026 at 12:10 AM Leno Hou <lenohou@gmail.com> wrote:
>
> When the Multi-Gen LRU (MGLRU) state is toggled dynamically, a race
> condition exists between the state switching and the memory reclaim
> path. This can lead to unexpected cgroup OOM kills, even when plenty of
> reclaimable memory is available.
>
> *** Problem Description ***
>
> The issue arises from a "reclaim vacuum" during the transition:
>
> 1. When disabling MGLRU, lru_gen_change_state() sets lrugen->enabled to
>    false before the pages are drained from MGLRU lists back to
>    traditional LRU lists.
> 2. Concurrent reclaimers in shrink_lruvec() see lrugen->enabled as false
>    and skip the MGLRU path.
> 3. However, these pages might not have reached the traditional LRU lists
>    yet, or the changes are not yet visible to all CPUs due to a lack of
>    synchronization.
> 4. get_scan_count() subsequently finds traditional LRU lists empty,
>    concludes there is no reclaimable memory, and triggers an OOM kill.
>
> A similar race can occur during enablement, where the reclaimer sees
> the new state but the MGLRU lists haven't been populated via
> fill_evictable() yet.
>
> *** Solution ***
>
> Introduce a 'draining' state to bridge the gap during transitions:
>
> - Use smp_store_release() and smp_load_acquire() to ensure the visibility
>   of 'enabled' and 'draining' flags across CPUs.
> - Modify shrink_lruvec() to allow a "joint reclaim" period. If an lruvec
>   is in the 'draining' state, the reclaimer will attempt to scan MGLRU
>   lists first, and then fall through to traditional LRU lists instead
>   of returning early. This ensures that folios are visible to at least
>   one reclaim path at any given time.
>
> *** Reproduction ***
>
> The issue was consistently reproduced on v6.1.157 and v6.18.3 using
> a high-pressure memory cgroup (v1) environment.
>
> Reproduction steps:
> 1. Create a 16GB memcg and populate it with 10GB file cache (5GB active)
>    and 8GB active anonymous memory.
> 2. Toggle MGLRU state while performing new memory allocations to force
>    direct reclaim.
>
> Reproduction script:
> ---
> #!/bin/bash
> # Fixed reproduction for memcg OOM during MGLRU toggle
> set -euo pipefail
>
> MGLRU_FILE="/sys/kernel/mm/lru_gen/enabled"
> CGROUP_PATH="/sys/fs/cgroup/memory/memcg_oom_test"
>
> # Switch MGLRU dynamically in the background
> switch_mglru() {
>     local orig_val=$(cat "$MGLRU_FILE")
>     if [[ "$orig_val" != "0x0000" ]]; then
>         echo n > "$MGLRU_FILE" &
>     else
>         echo y > "$MGLRU_FILE" &
>     fi
> }
>
> # Setup 16G memcg
> mkdir -p "$CGROUP_PATH"
> echo $((16 * 1024 * 1024 * 1024)) > "$CGROUP_PATH/memory.limit_in_bytes"
> echo $$ > "$CGROUP_PATH/cgroup.procs"
>
> # 1. Build memory pressure (File + Anon)
> dd if=/dev/urandom of=/tmp/test_file bs=1M count=10240
> dd if=/tmp/test_file of=/dev/null bs=1M # Warm up cache
>
> stress-ng --vm 1 --vm-bytes 8G --vm-keep -t 600 &
> sleep 5
>
> # 2. Trigger switch and concurrent allocation
> switch_mglru
> stress-ng --vm 1 --vm-bytes 2G --vm-populate --timeout 5s || echo "OOM Triggered"
>
> # Check OOM counter
> grep oom_kill "$CGROUP_PATH/memory.oom_control"
> ---
>
> Signed-off-by: Leno Hou <lenohou@gmail.com>
>
> ---
> To: linux-mm@kvack.org
> To: linux-kernel@vger.kernel.org
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Axel Rasmussen <axelrasmussen@google.com>
> Cc: Yuanchu Xie <yuanchu@google.com>
> Cc: Wei Xu <weixugc@google.com>
> Cc: Barry Song <21cnbao@gmail.com>
> Cc: Jialing Wang <wjl.linux@gmail.com>
> Cc: Yafang Shao <laoar.shao@gmail.com>
> Cc: Yu Zhao <yuzhao@google.com>
> ---
>  include/linux/mmzone.h |  2 ++
>  mm/vmscan.c            | 14 +++++++++++---
>  2 files changed, 13 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 7fb7331c5725..0648ce91dbc6 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -509,6 +509,8 @@ struct lru_gen_folio {
>         atomic_long_t refaulted[NR_HIST_GENS][ANON_AND_FILE][MAX_NR_TIERS];
>         /* whether the multi-gen LRU is enabled */
>         bool enabled;
> +       /* whether the multi-gen LRU is draining to LRU */
> +       bool draining;
>         /* the memcg generation this lru_gen_folio belongs to */
>         u8 gen;
>         /* the list segment this lru_gen_folio belongs to */
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 06071995dacc..629a00681163 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -5222,7 +5222,8 @@ static void lru_gen_change_state(bool enabled)
>                         VM_WARN_ON_ONCE(!seq_is_valid(lruvec));
>                         VM_WARN_ON_ONCE(!state_is_valid(lruvec));
>
> -                       lruvec->lrugen.enabled = enabled;
> +                       smp_store_release(&lruvec->lrugen.enabled, enabled);
> +                       smp_store_release(&lruvec->lrugen.draining, true);
>
>                         while (!(enabled ? fill_evictable(lruvec) : drain_evictable(lruvec))) {
>                                 spin_unlock_irq(&lruvec->lru_lock);
> @@ -5230,6 +5231,8 @@ static void lru_gen_change_state(bool enabled)
>                                 spin_lock_irq(&lruvec->lru_lock);
>                         }
>
> +                       smp_store_release(&lruvec->lrugen.draining, false);
> +
>                         spin_unlock_irq(&lruvec->lru_lock);
>                 }
>
> @@ -5813,10 +5816,15 @@ static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
>         unsigned long nr_to_reclaim = sc->nr_to_reclaim;
>         bool proportional_reclaim;
>         struct blk_plug plug;
> +       bool lrugen_enabled = smp_load_acquire(&lruvec->lrugen.enabled);
> +       bool lru_draining = smp_load_acquire(&lruvec->lrugen.draining);
>
> -       if (lru_gen_enabled() && !root_reclaim(sc)) {
> +       if (lrugen_enabled || lru_draining && !root_reclaim(sc)) {
>                 lru_gen_shrink_lruvec(lruvec, sc);
> -               return;

Is it possible to simply wait for draining to finish instead of performing
an lru_gen/lru shrink while lru_gen is being disabled or enabled?

Performing a shrink in an intermediate state may still involve a lot of
uncertainty, depending on how far the shrink has progressed and how much
remains in each side’s LRU？

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 3e51190a55e4..ba306e986050 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -509,6 +509,8 @@ struct lru_gen_folio {
 	atomic_long_t refaulted[NR_HIST_GENS][ANON_AND_FILE][MAX_NR_TIERS];
 	/* whether the multi-gen LRU is enabled */
 	bool enabled;
+	/* whether the multi-gen LRU is switching from/to active/inactive LRU */
+	bool switching;
 	/* the memcg generation this lru_gen_folio belongs to */
 	u8 gen;
 	/* the list segment this lru_gen_folio belongs to */
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 0fc9373e8251..60fc611067c7 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -5196,6 +5196,7 @@ static void lru_gen_change_state(bool enabled)
 			VM_WARN_ON_ONCE(!state_is_valid(lruvec));
 
 			lruvec->lrugen.enabled = enabled;
+			smp_store_release(&lruvec->lrugen.switching, true);
 
 			while (!(enabled ? fill_evictable(lruvec) : drain_evictable(lruvec))) {
 				spin_unlock_irq(&lruvec->lru_lock);
@@ -5203,6 +5204,8 @@ static void lru_gen_change_state(bool enabled)
 				spin_lock_irq(&lruvec->lru_lock);
 			}
 
+			smp_store_release(&lruvec->lrugen.switching, false);
+
 			spin_unlock_irq(&lruvec->lru_lock);
 		}
 
@@ -5780,6 +5783,10 @@ static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
 	bool proportional_reclaim;
 	struct blk_plug plug;
 
+#ifdef CONFIG_LRU_GEN
+	while (smp_load_acquire(&lruvec->lrugen.switching))
+		schedule_timeout_uninterruptible(HZ/100);
+#endif
 	if (lru_gen_enabled() && !root_reclaim(sc)) {
 		lru_gen_shrink_lruvec(lruvec, sc);
 		return;
-- 

> +
> +               if (!lru_draining)
> +                       return;
> +
>         }
>
>         get_scan_count(lruvec, sc, nr);
> --
> 2.52.0
>

Thanks
Barry