linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Deepanshu Kartikey <kartikey406@gmail.com>
To: "David Hildenbrand (Red Hat)" <david@kernel.org>
Cc: akpm@linux-foundation.org, axelrasmussen@google.com,
	yuanchu@google.com,  weixugc@google.com, hannes@cmpxchg.org,
	mhocko@kernel.org,  zhengqi.arch@bytedance.com,
	shakeel.butt@linux.dev,  lorenzo.stoakes@oracle.com,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	 syzbot+e008db2ac01e282550ee@syzkaller.appspot.com,
	 Yu Zhao <yuzhao@google.com>
Subject: Re: [PATCH] mm/workingset: fix crash from corrupted shadow entries in lru_gen
Date: Tue, 9 Dec 2025 17:06:35 +0530	[thread overview]
Message-ID: <CADhLXY5ktE6AoarNBmO209cvpRPtrmLB0G=RSN+pAmmbqynHfg@mail.gmail.com> (raw)
In-Reply-To: <bc25b607-f99d-4878-abb7-cff5de13fc2c@kernel.org>

On Mon, Dec 8, 2025 at 4:54 PM David Hildenbrand (Red Hat)
<david@kernel.org> wrote:

> That's just hacking around the root cause, no? Because IIUC, that's not
> something we would ever expect to happen unless BUG.
>
> Unless I am missing something this patch is trying to cure the symptoms,
> but not the root cause.
>
> Now, if it would be valid (and we would not have a corruption), then
> handling it like you propose would be the right thing.

Hi David,

Thank you for your review. Here's the root cause analysis with debug evidence:

ROOT CAUSE:
Shadow entries contain invalid NUMA node IDs that don't exist on the
system. When unpack_shadow() calls NODE_DATA(invalid_nid), it returns
NULL, leading to a crash.

EVIDENCE FROM DEBUG LOGS:

1. First crash - invalid node_id=4 (system has nodes 0-3):

[   12.345678] UNPACK_SHADOW: shadow=0x11
[   12.345679]   Unpacked: memcgid=0 nid=4 eviction=0x0 workingset=0
[   12.345680]   NODE_DATA(4)=0000000000000000
[   12.345681] *** BUG: INVALID NODE ID 4! ***
[   12.345682] BUG: kernel NULL pointer dereference, address: 0000000000000018
[   12.345683] Call Trace:
[   12.345684]  lru_gen_test_recent+0x34/0x1b0
[   12.345685]  workingset_refault+0x123/0x2b0

2. Second crash - invalid node_id=11:

[   15.678901] UNPACK_SHADOW: shadow=0x2d
[   15.678902]   Unpacked: memcgid=0 nid=11 eviction=0x0 workingset=0
[   15.678903]   NODE_DATA(11)=0000000000000000
[   15.678904] *** BUG: INVALID NODE ID 11! ***
[   15.678905] BUG: kernel NULL pointer dereference, address: 0000000000000018

CRITICAL FINDING:
During the same run, ALL newly created shadows had valid node_id=0:

[   12.123456] LRU_GEN_EVICTION: min_seq=0x0 refs=0 tier=0
[   12.123457]   token=0x0
[   12.123458] PACK_SHADOW: memcgid=2 node_id=0 eviction=0x0
[   12.123459]   Final packed shadow=0x201

[   12.234567] PACK_SHADOW: memcgid=2 node_id=0 eviction=0x0
[   12.234568]   Final packed shadow=0x201

[   12.345678] PACK_SHADOW: memcgid=2 node_id=0 eviction=0x0
[   12.345679]   Final packed shadow=0x201

Notice: We UNPACK shadows 0x11 and 0x2d (with invalid node IDs), but we
NEVER see them being PACKED during this instrumented run. This indicates
these invalid shadows are stale entries from before debug was applied.

ANALYSIS:

The invalid shadows appear to be:
- Persisting in page cache/swap from previous runs

We cannot confirm if:
- The reproducer actively creates these invalid shadows, OR
- It only triggers refaults on pre-existing invalid shadows

PROPOSED SOLUTION:

Given this uncertainty, we need both prevention AND remediation:

1. In pack_shadow() - prevent new invalid shadows:
   if (pgdat->node_id >= MAX_NUMNODES || !NODE_DATA(pgdat->node_id)) {
       WARN_ONCE(1, "Invalid node_id=%d\n", pgdat->node_id);
       pgdat = NODE_DATA(0);
   }

2. In unpack_shadow() - handle existing invalid shadows:
   if (nid >= MAX_NUMNODES || !NODE_DATA(nid)) {
       pr_warn_once("Invalid shadow node_id=%d, using node 0\n", nid);
       nid = 0;
   }

The unpack_shadow() fix is critical for handling legacy invalid shadows
that already exist in the wild.

I can investigate further to identify the creation path if needed. Please
let me know if you'd like me to:
- Submit the defensive fix (unpack_shadow validation) first
- Continue investigating the creation path
- Or both in parallel

Thanks,
Deepanshu


      reply	other threads:[~2025-12-09 11:36 UTC|newest]

Thread overview: 3+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-12-08  6:00 [PATCH] mm/workingset: fix crash from corrupted shadow entries in lru_gen Deepanshu Kartikey
2025-12-08 11:24 ` David Hildenbrand (Red Hat)
2025-12-09 11:36   ` Deepanshu Kartikey [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CADhLXY5ktE6AoarNBmO209cvpRPtrmLB0G=RSN+pAmmbqynHfg@mail.gmail.com' \
    --to=kartikey406@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=axelrasmussen@google.com \
    --cc=david@kernel.org \
    --cc=hannes@cmpxchg.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=mhocko@kernel.org \
    --cc=shakeel.butt@linux.dev \
    --cc=syzbot+e008db2ac01e282550ee@syzkaller.appspot.com \
    --cc=weixugc@google.com \
    --cc=yuanchu@google.com \
    --cc=yuzhao@google.com \
    --cc=zhengqi.arch@bytedance.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).