From: "Vlastimil Babka (SUSE)" <vbabka@kernel.org>
To: "zhen.ni" <zhen.ni@easystack.cn>,
Andrew Morton <akpm@linux-foundation.org>
Cc: surenb@google.com, mhocko@suse.com, jackmanb@google.com,
hannes@cmpxchg.org, ziy@nvidia.com, linux-mm@kvack.org,
linux-kernel@vger.kernel.org
Subject: Re: [PATCH v9 0/4] mm/page_owner: add per-fd filter infrastructure for print_mode and NUMA filtering
Date: Tue, 23 Jun 2026 08:52:48 +0200 [thread overview]
Message-ID: <ef2db739-fa41-4203-91e5-61c0625ecbf4@kernel.org> (raw)
In-Reply-To: <444a6fd9-26dd-4d33-b836-987ef220b5f2@easystack.cn>
On 6/18/26 10:15, zhen.ni wrote:
>
>
> 在 2026/6/18 15:27, Vlastimil Babka (SUSE) 写道:
>> On 6/17/26 10:52, zhen.ni wrote:
>>> 在 2026/5/26 03:58, Andrew Morton 写道:
>>>> On Mon, 25 May 2026 16:16:48 +0800 Zhen Ni <zhen.ni@easystack.cn> wrote:
>>>>
>>>>> This patch series introduces per-file-descriptor filtering capabilities to the
>>>>> page_owner feature.
>>>>
>>>> Thanks again. AI review has found a bunch of new things to get worried
>>>> about:
>>>> https://sashiko.dev/#/patchset/20260525081652.2210206-1-zhen.ni@easystack.cn
>>>>
>>>>
>>> Hi,
>>>
>>> Can this lead to an out-of-bounds memory read?
>>>
>>> The NUMA filter in page_owner (mm/page_owner.c:790-798) bypasses
>>> PF_POISONED_CHECK() to avoid triggering VM_BUG_ON during concurrent page
>>> allocation/free:
>>>
>>> int page_nid = memdesc_nid(page->flags);
>>>
>>> When NODE_NOT_IN_PAGE_FLAGS is defined, memdesc_nid() performs unchecked
>>> array access:
>>>
>>> int memdesc_nid(memdesc_flags_t mdf)
>>> {
>>> return section_to_node_table[memdesc_section(mdf)];
>>> }
>>>
>>> If page->flags is poisoned, memdesc_section() can return a garbage
>>> section_nr that causes out-of-bounds access.
>>>
>>> ## Lockless Access Safety Principle
>>>
>>> The page_owner iterator runs without locks, meaning pages can be
>>> allocated or freed concurrently. The fundamental design principle should be:
>>>
>>> "It's acceptable to skip a small number of abnormal pages, but panics
>>> must be prevented."
>>>
>>> In lockless iteration, TOCTOU is unavoidable - even with reference
>>> counting or RCU, page->flags can still be modified concurrently during
>>> access. Zone locks prevent this but are prohibitively expensive.
>>>
>>> ## Proposed Solution: Add nid to struct page_owner
>>>
>>> Record nid at allocation time when page state is stable, eliminating the
>>> need to extract it from page->flags during iteration:
>>>
>>> ### 1. Modify struct page_owner
>>>
>>> struct page_owner {
>>> unsigned short order;
>>> short last_migrate_reason;
>>> ...
>>> pid_t tgid;
>>> pid_t free_pid;
>>> pid_t free_tgid;
>>> int nid; // NEW
>>> };
>>>
>>> ### 2. Record nid during allocation
>>>
>>> static inline void __update_page_owner_handle(struct page *page, ...)
>>> {
>>> int nid = page_to_nid(page); // Safe in allocation context
>>>
>>> for_each_page_ext(page, 1 << order, page_ext, iter) {
>>> page_owner = get_page_owner(page_ext);
>>> page_owner->nid = nid;
>>> // ... other fields ...
>>> }
>>> }
>>>
>>> ### 3. Use saved nid in NUMA filter
>>>
>>> if (state->nid_filter_enabled) {
>>> int page_nid = page_owner->nid; // Direct read, safe
>>>
>>> if (!node_isset(page_nid, state->nid_filter)) {
>>> spin_unlock_irqrestore(&state->lock, flags);
>>> goto ext_put_continue;
>>> }
>>> }
>>>
>>> ### 4. Update nid on page migration
>>>
>>> // In split_page_owner() when page migrates
>>> page_owner->nid = page_to_nid(&newfolio->page);
>>>
>>
>> This (presumably LLM) suggestion is a, let's say "lazy" solution to the
>> problem, leading to more memory usage. I'd be surprised if it's not possible
>> to read the nid in a way that avoids the hazards. If page_to_nid() can
>> trigger a VM_BUG_ON(), then I'd add a version without that VM_BUG_ON(),
>> handling the poisoned state gracefully - if it's poisoned, return e.g.
>> NUMA_NO_NODE and skip the page, or something.
>>
>>> The remaining two issues can also be improved. If there are no
>>> additional comments, I will proceed with sending v10.
>>>
>>>
>>> Thanks,
>>> Zhen
>>
>>
>>
>
> Thank you for the review.
>
> I'd like to clarify that the "nid" member approach was my own design
> after careful consideration of alternatives, not a suggestion from
> automated tools.:)
Alright, good :)
> In fact, LLMs suggested approaches like "check then access" which I had
> already implemented and rejected in earlier versions due to TOCTOU
> issues. The key insight is that page_owner serves as a buffering layer
> for struct page, eliminating lockless access inconsistency entirely.
Sure, but is the elimination necessary if it has a memory cost?
> Think about it this way:
> Even with extensive checks, page_owner->handle cannot guarantee the page
> won't be freed when printed. The time window between checking
> page->flags and accessing page->flags is unavoidable in lockless
> iteration.
Of course, but that's an argument saying that page owner as a whole can't be
a perfect snapshot anyway. Then it follows that "nid" snapshot doesn't need
to be perfect either.
> By recording nid at allocation time (when page->flags is stable),
> page_owner becomes a "consistent snapshot" that can be safely accessed
> without locks. This is the same principle that makes page_owner work as
> a debug feature in the first place - it accepts a small inconsistency
> window in exchange for lockless access.
And that means we can accept some more small inconsistency for nid, without
occupying more memory.
> Alternative approaches (adding poison checks, bounds checking) cannot
> fully eliminate TOCTOU in lockless code - they just reduce the
> probability. The buffering approach is the way to provide both
> safety and lockless performance.
Please explain how the following "page_to_nid_robust()" (or similar name)
code cannot eliminate the out of bounds accesses completely. The section/nid
in page flags is possibly the most stable part of the whole struct page, it
doesn't change as page is allocated, freed, or how it's used. The only
problem is the poison.
memdesc_flags_t flags = READ_ONCE(page->flags);
// our flags variable is no longer subject to TOCTOU
if (flags.f == PAGE_POISON_PATTERN)
return NUMA_NO_NODE;
return memdesc_nid(flags);
-
> Thanks,
> Zhen
prev parent reply other threads:[~2026-06-23 6:52 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-05-25 8:16 [PATCH v9 0/4] mm/page_owner: add per-fd filter infrastructure for print_mode and NUMA filtering Zhen Ni
2026-05-25 8:16 ` [PATCH v9 1/4] mm/page_owner: add print_mode filter Zhen Ni
2026-05-25 8:16 ` [PATCH v9 2/4] mm/page_owner: add NUMA node filter Zhen Ni
2026-05-25 8:16 ` [PATCH v9 3/4] tools/mm: add page_owner_filter userspace tool Zhen Ni
2026-05-25 8:16 ` [PATCH v9 4/4] mm/page_owner: document page_owner filter Zhen Ni
2026-05-25 19:58 ` [PATCH v9 0/4] mm/page_owner: add per-fd filter infrastructure for print_mode and NUMA filtering Andrew Morton
2026-06-17 8:52 ` zhen.ni
2026-06-18 7:27 ` Vlastimil Babka (SUSE)
2026-06-18 8:15 ` zhen.ni
2026-06-23 6:52 ` Vlastimil Babka (SUSE) [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ef2db739-fa41-4203-91e5-61c0625ecbf4@kernel.org \
--to=vbabka@kernel.org \
--cc=akpm@linux-foundation.org \
--cc=hannes@cmpxchg.org \
--cc=jackmanb@google.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mhocko@suse.com \
--cc=surenb@google.com \
--cc=zhen.ni@easystack.cn \
--cc=ziy@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox