Linux-mm Archive on lore.kernel.org
 help / color / mirror / Atom feed
From: "zhen.ni" <zhen.ni@easystack.cn>
To: "Vlastimil Babka (SUSE)" <vbabka@kernel.org>,
	Andrew Morton <akpm@linux-foundation.org>
Cc: surenb@google.com, mhocko@suse.com, jackmanb@google.com,
	hannes@cmpxchg.org, ziy@nvidia.com, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org
Subject: Re: [PATCH v9 0/4] mm/page_owner: add per-fd filter infrastructure for print_mode and NUMA filtering
Date: Thu, 18 Jun 2026 16:15:23 +0800	[thread overview]
Message-ID: <444a6fd9-26dd-4d33-b836-987ef220b5f2@easystack.cn> (raw)
In-Reply-To: <7b2125bb-c2d6-4678-9ae3-6f4e93034391@kernel.org>



在 2026/6/18 15:27, Vlastimil Babka (SUSE) 写道:
> On 6/17/26 10:52, zhen.ni wrote:
>> 在 2026/5/26 03:58, Andrew Morton 写道:
>>> On Mon, 25 May 2026 16:16:48 +0800 Zhen Ni <zhen.ni@easystack.cn> wrote:
>>>
>>>> This patch series introduces per-file-descriptor filtering capabilities to the
>>>> page_owner feature.
>>>
>>> Thanks again.  AI review has found a bunch of new things to get worried
>>> about:
>>> 	https://sashiko.dev/#/patchset/20260525081652.2210206-1-zhen.ni@easystack.cn
>>>
>>>
>> Hi,
>>
>> Can this lead to an out-of-bounds memory read?
>>
>> The NUMA filter in page_owner (mm/page_owner.c:790-798) bypasses
>> PF_POISONED_CHECK() to avoid triggering VM_BUG_ON during concurrent page
>> allocation/free:
>>
>>       int page_nid = memdesc_nid(page->flags);
>>
>> When NODE_NOT_IN_PAGE_FLAGS is defined, memdesc_nid() performs unchecked
>> array access:
>>
>>       int memdesc_nid(memdesc_flags_t mdf)
>>       {
>>           return section_to_node_table[memdesc_section(mdf)];
>>       }
>>
>> If page->flags is poisoned, memdesc_section() can return a garbage
>> section_nr that causes out-of-bounds access.
>>
>> ## Lockless Access Safety Principle
>>
>> The page_owner iterator runs without locks, meaning pages can be
>> allocated or freed concurrently. The fundamental design principle should be:
>>
>> "It's acceptable to skip a small number of abnormal pages, but panics
>> must be prevented."
>>
>> In lockless iteration, TOCTOU is unavoidable - even with reference
>> counting or RCU, page->flags can still be modified concurrently during
>> access. Zone locks prevent this but are prohibitively expensive.
>>
>> ## Proposed Solution: Add nid to struct page_owner
>>
>> Record nid at allocation time when page state is stable, eliminating the
>> need to extract it from page->flags during iteration:
>>
>> ### 1. Modify struct page_owner
>>
>>       struct page_owner {
>>           unsigned short order;
>>           short last_migrate_reason;
>>           ...
>>           pid_t tgid;
>>           pid_t free_pid;
>>           pid_t free_tgid;
>>           int nid;  // NEW
>>       };
>>
>> ### 2. Record nid during allocation
>>
>>       static inline void __update_page_owner_handle(struct page *page, ...)
>>       {
>>           int nid = page_to_nid(page);  // Safe in allocation context
>>
>>           for_each_page_ext(page, 1 << order, page_ext, iter) {
>>               page_owner = get_page_owner(page_ext);
>>               page_owner->nid = nid;
>>               // ... other fields ...
>>           }
>>       }
>>
>> ### 3. Use saved nid in NUMA filter
>>
>>       if (state->nid_filter_enabled) {
>>           int page_nid = page_owner->nid;  // Direct read, safe
>>
>>           if (!node_isset(page_nid, state->nid_filter)) {
>>               spin_unlock_irqrestore(&state->lock, flags);
>>               goto ext_put_continue;
>>           }
>>       }
>>
>> ### 4. Update nid on page migration
>>
>>       // In split_page_owner() when page migrates
>>       page_owner->nid = page_to_nid(&newfolio->page);
>>
> 
> This (presumably LLM) suggestion is a, let's say "lazy" solution to the
> problem, leading to more memory usage. I'd be surprised if it's not possible
> to read the nid in a way that avoids the hazards. If page_to_nid() can
> trigger a VM_BUG_ON(), then I'd add a version without that VM_BUG_ON(),
> handling the poisoned state gracefully - if it's poisoned, return e.g.
> NUMA_NO_NODE and skip the page, or something.
> 
>> The remaining two issues can also be improved. If there are no
>> additional comments, I will proceed with sending v10.
>>
>>
>> Thanks,
>> Zhen
> 
> 
> 

Thank you for the review.

I'd like to clarify that the "nid" member approach was my own design
after careful consideration of alternatives, not a suggestion from
automated tools.:)

In fact, LLMs suggested approaches like "check then access" which I had
already implemented and rejected in earlier versions due to TOCTOU
issues. The key insight is that page_owner serves as a buffering layer
for struct page,  eliminating lockless access inconsistency entirely.

Think about it this way:
Even with extensive checks, page_owner->handle cannot guarantee the page
won't be freed when printed. The time window between checking
page->flags and accessing page->flags is unavoidable in lockless
iteration.

By recording nid at allocation time (when page->flags is stable),
page_owner becomes a "consistent snapshot" that can be safely accessed
without locks. This is the same principle that makes page_owner work as
a debug feature in the first place - it accepts a small inconsistency
window in exchange for lockless access.

Alternative approaches (adding poison checks, bounds checking) cannot
fully eliminate TOCTOU in lockless code - they just reduce the
probability. The buffering approach is the way to provide both
safety and lockless performance.

Thanks,
Zhen


  reply	other threads:[~2026-06-18  8:21 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-25  8:16 [PATCH v9 0/4] mm/page_owner: add per-fd filter infrastructure for print_mode and NUMA filtering Zhen Ni
2026-05-25  8:16 ` [PATCH v9 1/4] mm/page_owner: add print_mode filter Zhen Ni
2026-05-25  8:16 ` [PATCH v9 2/4] mm/page_owner: add NUMA node filter Zhen Ni
2026-05-25  8:16 ` [PATCH v9 3/4] tools/mm: add page_owner_filter userspace tool Zhen Ni
2026-05-25  8:16 ` [PATCH v9 4/4] mm/page_owner: document page_owner filter Zhen Ni
2026-05-25 19:58 ` [PATCH v9 0/4] mm/page_owner: add per-fd filter infrastructure for print_mode and NUMA filtering Andrew Morton
2026-06-17  8:52   ` zhen.ni
2026-06-18  7:27     ` Vlastimil Babka (SUSE)
2026-06-18  8:15       ` zhen.ni [this message]
2026-06-23  6:52         ` Vlastimil Babka (SUSE)
2026-06-23 12:43           ` zhen.ni

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=444a6fd9-26dd-4d33-b836-987ef220b5f2@easystack.cn \
    --to=zhen.ni@easystack.cn \
    --cc=akpm@linux-foundation.org \
    --cc=hannes@cmpxchg.org \
    --cc=jackmanb@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@suse.com \
    --cc=surenb@google.com \
    --cc=vbabka@kernel.org \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox