From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-pj1-f42.google.com (mail-pj1-f42.google.com [209.85.216.42]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id D6C85314D0B for ; Thu, 19 Mar 2026 03:14:18 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.216.42 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1773890060; cv=none; b=AFBXCbNbgdYYLAk3Yft2MH+YCPISOq3lsxv0JSxTtQDdIQmdNGd3oDZecm1wkDNcGcIyUtxH8IYBa3eMfJ6L8dWf7/ko0VgaROS3nLG5RdMfWEJBG0YpKhIuhjHPKUC2YKJogQZhLWBShnImR5Bhdax+1WMqrGrRFPaRlTijkNI= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1773890060; c=relaxed/simple; bh=VSTgnvjlBfteu7EY0IoyKybnXo+dTMV+mWO4cDMabd4=; h=Message-ID:Date:MIME-Version:Subject:To:Cc:References:From: In-Reply-To:Content-Type; b=ADxQp2FYhZOcjuyRz3/jhlNPHLGg48qILYPpfKiFgLAJDd/VLSZFFj9DQCaspcHsxVz0eI5U+rV7+vAPuCdubyrVk7/ZAGAef9gFfIuWC45KXV5X9Yi+Jw723miAy9u9la6RZdoMWgqQe8zMcH2yxuS+UxGlAJN18INAR7VCmJE= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=EwVpk9mk; arc=none smtp.client-ip=209.85.216.42 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="EwVpk9mk" Received: by mail-pj1-f42.google.com with SMTP id 98e67ed59e1d1-35b905a05a8so118451a91.1 for ; Wed, 18 Mar 2026 20:14:18 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1773890058; x=1774494858; darn=vger.kernel.org; h=content-transfer-encoding:in-reply-to:from:content-language :references:cc:to:subject:user-agent:mime-version:date:message-id :from:to:cc:subject:date:message-id:reply-to; bh=VKvMcKFaJqWhdEDJm7I22Ycbq4zxryzlgrB1A0RyvyY=; b=EwVpk9mkOYATB31qt57Eh5qZqFL6fy6F7MVyxqTZ1/MtVjWCL4/kXGQca3g7OfvjHE QIlvL9z1DA3g/19jCYXHKrV1LlXLXRbMPEbo02RpIIICy2F59J8n7dKvese3Ik2xKRpE 1LcFcaO0p1QBdjHv46wszcJSjr9sbZBjWtTTdXb9jfPxk8cO5D240jYYjihbFaekojJd aWKrYXYJ5aLNxyF6xJ2Wpg14iIXMV/EIGpjVHamYmN9ETHiQBXGW+MblrR1nKw78c/8/ AbQfrqf8B91XkqTM7QmGROkDhAAa1Ds3WH62reliYtx0QPS/O88nU6XOYerqss4fcAet 4LNw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1773890058; x=1774494858; h=content-transfer-encoding:in-reply-to:from:content-language :references:cc:to:subject:user-agent:mime-version:date:message-id :x-gm-gg:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=VKvMcKFaJqWhdEDJm7I22Ycbq4zxryzlgrB1A0RyvyY=; b=KtWVk6Mcy4Jw3GcBiFqaP1QiowOWMXcIIYE6ukWEnfL0MFqC3mfuNjQqBcKCGC4uNi PLBN0Gr+Vtufd+lyQvjMlzbFwFasnrhooiIuYvHP/BfH0A1mRnykNvDPQKuhidl43spA OvFjq7KjojHAvyVhlvuRLndLjv1QAA+Se9H58jmpXkdWcnl1Qlt+BYfediX+Oh2ztAwT yK1iBMmjBZvaVsPww7u2VepETMQGFVqEOGuVALs5bSYnPqsC4Xnv3/ldQWX3GO+VPn9M kIpUFQ+sv4VVBi4SrmkPDn+P58xgFMCM8v5mcbDhP9CcJP9NC1qgkDjI+kqgyuSY9FDN Hs1g== X-Forwarded-Encrypted: i=1; AJvYcCXAZenyxjKvmQlWKBp8wRtm3Ne6TMqdy1GePFHHnwcJOWitpEdFM7afEztkqUmTBor+fmoANc8nFX3OEq8=@vger.kernel.org X-Gm-Message-State: AOJu0Yzc/r4uJizYuToOWDZhkYRnha9Z+fxYGqyX33Vy9SOls98SihVV XaU5j6EVpWNCuT5Ftec/g6kbbyubMwQunBtIE/vknbnhqjaxT+2SxWGw X-Gm-Gg: ATEYQzx4At+WYAUvnvEmIl7pNmWQEbp8Voj2GD+sXcCid6hc1YTp8JzPfToiUqQKvVI YwoKhLIrunuHI1rP28vVtjlbcQQvB62HpA0bw0CssczVw5SGbypiJG9blnEA0CB2cU8pGZlFF0j LQWFU+DvZQpUjwntJX9R0A063JyRrvifu2iU4OAsZ3aecOCjTFgahkzfkpOO4FXoV6b87PcXVAy Y2RIIa+xdd31oLx1A+mAdLYIvCr0QU1YTTTAoSRenpOUnY2txeCTgAywnaCsSVNYMwKI5jQGVuH i3mNpWDPYnWnaX4ex/SoIwexlzt16FoWXLiEAT8Er21RW7XDnF3JQyVvkkGcBliMcG765Iz7I1+ 9wOPd3D0li1QSX7YdCIlYqi3I4/5rtXzrmUrToZMiTfuHmMTsWOAu/UA5NJxihHtegP/d3MK2st D2P9kVT/xPc/VTCPI7zVEsTLXo878F/xecLBAZ+EjlWakTMhs9FmDieLI/xJYHpoGMRN6L+0vGb hMOam41TpPMuSTu51i24Z4xeuuFldB4w05o6qPpElAHLShcKQ== X-Received: by 2002:a17:90b:2d4f:b0:35b:96bb:47b5 with SMTP id 98e67ed59e1d1-35bb9e87d51mr5385781a91.15.1773890057825; Wed, 18 Mar 2026 20:14:17 -0700 (PDT) Received: from ?IPV6:2408:840d:1c00:46ad:3de1:fd36:5f98:c03? ([2408:840d:1c00:46ad:3de1:fd36:5f98:c03]) by smtp.gmail.com with ESMTPSA id 98e67ed59e1d1-35bc610c1efsm1113203a91.14.2026.03.18.20.14.13 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Wed, 18 Mar 2026 20:14:17 -0700 (PDT) Message-ID: Date: Thu, 19 Mar 2026 11:14:10 +0800 Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH v4] mm/mglru: fix cgroup OOM during MGLRU state switching To: Barry Song <21cnbao@gmail.com> Cc: Andrew Morton , Axel Rasmussen , Yuanchu Xie , Wei Xu , Jialing Wang , Yafang Shao , Yu Zhao , Kairui Song , Bingfang Guo , linux-mm@kvack.org, linux-kernel@vger.kernel.org References: <20260318-b4-switch-mglru-v2-v4-1-1b927c93659d@gmail.com> <8c01a707-f798-4649-8441-d82dd0dac7b9@gmail.com> <4807e460-054c-49ed-9792-f5000d7b3820@gmail.com> Content-Language: en-US From: Leno Hou In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit On 3/19/26 5:29 AM, Barry Song wrote: > On Wed, Mar 18, 2026 at 8:56 PM Leno Hou wrote: >> >> On 3/18/26 4:30 PM, Barry Song wrote: >>> On Wed, Mar 18, 2026 at 4:17 PM Leno Hou wrote: >> [...] >>>>>> diff --git a/mm/vmscan.c b/mm/vmscan.c >>>>>> index 33287ba4a500..88b9db06e331 100644 >>>>>> --- a/mm/vmscan.c >>>>>> +++ b/mm/vmscan.c >>>>>> @@ -886,7 +886,7 @@ static enum folio_references folio_check_references(struct folio *folio, >>>>>> if (referenced_ptes == -1) >>>>>> return FOLIOREF_KEEP; >>>>>> >>>>>> - if (lru_gen_enabled()) { >>>> >>>> documentation as following: >>>> >>>> /* >>>> * During the MGLRU state transition (lru_gen_switching), we force >>>> * folios to follow the traditional active/inactive reference checking. >>>> * >>>> * While MGLRU is switching,the generational state of folios is in flux. >>>> * Falling back to the traditional logic (which relies on PG_referenced/ >>>> * PG_active flags that are consistent across both mechanisms) provides >>>> * a stable, safe behavior for the folio until it is fully migrated back >>>> * to the traditional LRU lists. This avoids relying on potentially >>>> * inconsistent MGLRU generational metadata during the transition. >>>> */ >>>> >>>>>> + if (lru_gen_enabled() && !lru_gen_draining()) { >>>>> >>>>> I’m curious what prompted you to do this. >>>>> >>>>> This feels a bit odd. I assume this effectively makes >>>>> folios on MGLRU, as well as those on active/inactive >>>>> lists, always follow the active/inactive logic. >>>>> >>>>> It might be fine, but it needs thorough documentation here. >>>>> >>>>> another approach would be: >>>>> diff --git a/mm/vmscan.c b/mm/vmscan.c >>>>> index 33287ba4a500..91b60664b652 100644 >>>>> --- a/mm/vmscan.c >>>>> +++ b/mm/vmscan.c >>>>> @@ -122,6 +122,9 @@ struct scan_control { >>>>> /* Proactive reclaim invoked by userspace */ >>>>> unsigned int proactive:1; >>>>> >>>>> + /* Are we reclaiming from MGLRU */ >>>>> + unsigned int lru_gen:1; >>>>> + >>>>> /* >>>>> * Cgroup memory below memory.low is protected as long as we >>>>> * don't threaten to OOM. If any cgroup is reclaimed at >>>>> @@ -886,7 +889,7 @@ static enum folio_references >>>>> folio_check_references(struct folio *folio, >>>>> if (referenced_ptes == -1) >>>>> return FOLIOREF_KEEP; >>>>> >>>>> - if (lru_gen_enabled()) { >>>>> + if (sc->lru_gen) { >>>>> if (!referenced_ptes) >>>>> return FOLIOREF_RECLAIM; >>>>> >>>>> This makes the logic perfectly correct (you know exactly >>>>> where your folios come from), but I’m not sure it’s worth it. >>>>> >>>>> Anyway, I’d like to understand why you always need to >>>>> use the active/inactive logic even for folios from MGLRU. >>>>> To me, it seems to work only by coincidence, which isn’t good. >>>>> >>>>> Thanks >>>>> Barry >>>> >>>> Hi Barry, >>>> >>>> I agree that using !lru_gen_draining() feels a bit like a fallback path. >>>> However, after considering your suggestion for sc->lru_gen, I’m >>>> concerned about the broad impact of modifying struct scan_control.Since >>>> lru_drain_core is a very transient state, I prefer a localized fix that >>>> doesn't propagate architectural changes throughout the entire reclaim stack. >>>> >>>> You mentioned that using the active/inactive logic feels like it works >>>> by 'coincidence'. To clarify, this is an intentional fallback: because >>>> the generational metadata in MGLRU becomes unreliable during draining, >>>> we intentionally downgrade these folios to the traditional logic. Since >>>> the PG_referenced and PG_active bits are maintained by the core VM and >>>> are consistent regardless of whether MGLRU is active, this fallback is >>>> technically sound and robust. >>>> >>>> I have added detailed documentation to the code to explain this design >>>> choice, clarifying that it's a deliberate transition strategy rather >>>> than a coincidence." >>> >>> Nope. You still haven’t explained why the active/inactive LRU >>> logic makes it work. MGLRU and active/inactive use different >>> methods to determine whether a folio is hot or cold. You’re >>> forcing active/inactive logic to decide hot/cold for an MGLRU >>> folio. It’s not that simple—PG_referenced isn’t maintained >>> by the core; it’s specific to active/inactive. See folio_mark_accessed(). >>> >>> Best Regards >>> Barry >> >> Hi Barry, >> >> Thank you for your patience and for pointing out the version-specific >> nuances. You are absolutely correct—my previous assumption that the >> traditional reference-checking logic would serve as a robust fallback >> was fundamentally flawed. >> >> After re-examining the code in v7.0 and comparing it with older versions >> (e.g., v6.1), I see the core issue you highlighted: >> >> 1. Evolution of PG_referenced: In older kernels, lru_gen_inc_refs() >> often interacted with the PG_referenced bit, which inadvertently >> provided a 'coincidental' hint for the legacy reclaim path. However, in >> v7.0+, lru_gen_inc_refs() has evolved to use set_mask_bits() on the >> LRU_REFS_MASK bitfield, and it no longer relies on or updates the legacy >> PG_referenced bit for MGLRU folios. >> >> 2. The Logic Flaw: When switching from MGLRU to the traditional LRU, >> these folios arrive at the legacy reclaim path with PG_referenced unset >> or stale. If I force them through the legacy folio_check_references() >> path, folio_test_clear_referenced(folio) predictably returns 0. The >> legacy path interprets this as a 'cold' folio, leading to premature >> reclamation. You are correct that forcing this active/inactive logic >> onto MGLRU folios is logically inconsistent. >> >> >> 3. My Revised Approach: Instead of attempting to patch >> folio_check_references() with a fallback logic, I have decided to keep >> the folio_check_references() logic unchanged. >> >> The system handles this transition safely through the kernel's existing >> reclaim loop and retry mechanisms: >> >> a) While MGLRU is draining, folios are moved back to the traditional >> LRU lists. Once migrated, these folios will naturally begin >> participating in the legacy reclaim path. >> >> b) Although some folios might be initially underestimated as 'cold' >> in the very first reclaim pass immediately after the switch, the >> kernel's reclaim loop will naturally re-evaluate them. As they are >> accessed, the standard legacy mechanism will correctly maintain the >> PG_referenced bit, and the system will converge to the correct state >> without needing an explicit fallback path or state-checking in >> folio_check_references(). >> >> >> This approach avoids the logical corruption caused by forcing >> incompatible evaluation methods and relies on the natural convergence of >> the existing reclaim loop. >> >> >> Does this alignment with the existing reclaim mechanism address your >> concerns about logical consistency? > > My gut feeling is that we probably don’t need to worry > too much about the accuracy of hot/cold evaluation during > switching, since the system is already in a volatile state > at that point. So as long as we avoid introducing unusual > logic—such as forcing active/inactive decisions onto MGLRU > folios—I’m fine with it. > > Ideally, we would add an sc->lru_gen boolean so we know > exactly where the folios come from, rather than relying on > folio_lru_gen(folio) != -1, which can be misleading. > However, if this doesn’t bring much improvement, it may > not be worth increasing the complexity. > Hi Barry, Thank you for the guidance~ To address your concerns regarding readability and maintainability: 1. Naming: I'll rename the transition state to lru_gen_switching (instead of draining) to better reflect its purpose. 2. Documentation: I'll add the documentation to explain why we disable look-around optimization when LRU is switching. This approach keeps the patch minimal and fix for the OOM issue during switching without introducing complexity. See next v5 patch later. --- Best regards, Leno Hou