All of lore.kernel.org
 help / color / mirror / Atom feed
From: Harry Yoo <harry.yoo@oracle.com>
To: Jiaqi Yan <jiaqiyan@google.com>
Cc: "Miaohe Lin" <linmiaohe@huawei.com>,
	"“William Roche" <william.roche@oracle.com>,
	"Ackerley Tng" <ackerleytng@google.com>,
	jgg@nvidia.com, akpm@linux-foundation.org, ankita@nvidia.com,
	dave.hansen@linux.intel.com, david@redhat.com,
	duenwen@google.com, jane.chu@oracle.com, jthoughton@google.com,
	linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-mm@kvack.org, muchun.song@linux.dev,
	nao.horiguchi@gmail.com, osalvador@suse.de, peterx@redhat.com,
	rientjes@google.com, sidhartha.kumar@oracle.com,
	tony.luck@intel.com, wangkefeng.wang@huawei.com,
	willy@infradead.org, vbabka@suse.cz, surenb@google.com,
	mhocko@suse.com, jackmanb@google.com, hannes@cmpxchg.org,
	ziy@nvidia.com
Subject: Re: [RFC PATCH v1 0/3] Userspace MFR Policy via memfd
Date: Mon, 3 Nov 2025 17:16:33 +0900	[thread overview]
Message-ID: <aQhk4WtDSaQmFFFo@harry> (raw)
In-Reply-To: <CACw3F51qaug5aWFNcjB54dVEc8yH+_A7zrkGcQyKXKJs6uVvgA@mail.gmail.com>

On Thu, Oct 30, 2025 at 10:28:48AM -0700, Jiaqi Yan wrote:
> On Thu, Oct 30, 2025 at 4:51 AM Miaohe Lin <linmiaohe@huawei.com> wrote:
> >
> > On 2025/10/28 15:00, Harry Yoo wrote:
> > > On Mon, Oct 27, 2025 at 09:17:31PM -0700, Jiaqi Yan wrote:
> > >> On Wed, Oct 22, 2025 at 6:09 AM Harry Yoo <harry.yoo@oracle.com> wrote:
> > >>>
> > >>> On Mon, Oct 13, 2025 at 03:14:32PM -0700, Jiaqi Yan wrote:
> > >>>> On Fri, Sep 19, 2025 at 8:58 AM “William Roche <william.roche@oracle.com> wrote:
> > >>>>>
> > >>>>> From: William Roche <william.roche@oracle.com>
> > >>>>>
> > >>>>> Hello,
> > >>>>>
> > >>>>> The possibility to keep a VM using large hugetlbfs pages running after a memory
> > >>>>> error is very important, and the possibility described here could be a good
> > >>>>> candidate to address this issue.
> > >>>>
> > >>>> Thanks for expressing interest, William, and sorry for getting back to
> > >>>> you so late.
> > >>>>
> > >>>>>
> > >>>>> So I would like to provide my feedback after testing this code with the
> > >>>>> introduction of persistent errors in the address space: My tests used a VM
> > >>>>> running a kernel able to provide MFD_MF_KEEP_UE_MAPPED memfd segments to the
> > >>>>> test program provided with this project. But instead of injecting the errors
> > >>>>> with madvise calls from this program, I get the guest physical address of a
> > >>>>> location and inject the error from the hypervisor into the VM, so that any
> > >>>>> subsequent access to the location is prevented directly from the hypervisor
> > >>>>> level.
> > >>>>
> > >>>> This is exactly what VMM should do: when it owns or manages the VM
> > >>>> memory with MFD_MF_KEEP_UE_MAPPED, it is then VMM's responsibility to
> > >>>> isolate guest/VCPUs from poisoned memory pages, e.g. by intercepting
> > >>>> such memory accesses.
> > >>>>
> > >>>>>
> > >>>>> Using this framework, I realized that the code provided here has a problem:
> > >>>>> When the error impacts a large folio, the release of this folio doesn't isolate
> > >>>>> the sub-page(s) actually impacted by the poison. __rmqueue_pcplist() can return
> > >>>>> a known poisoned page to get_page_from_freelist().
> > >>>>
> > >>>> Just curious, how exactly you can repro this leaking of a known poison
> > >>>> page? It may help me debug my patch.
> > >>>>
> > >>>>>
> > >>>>> This revealed some mm limitations, as I would have expected that the
> > >>>>> check_new_pages() mechanism used by the __rmqueue functions would filter these
> > >>>>> pages out, but I noticed that this has been disabled by default in 2023 with:
> > >>>>> [PATCH] mm, page_alloc: reduce page alloc/free sanity checks
> > >>>>> https://lore.kernel.org/all/20230216095131.17336-1-vbabka@suse.cz 
> > >>>>
> > >>>> Thanks for the reference. I did turned on CONFIG_DEBUG_VM=y during dev
> > >>>> and testing but didn't notice any WARNING on "bad page"; It is very
> > >>>> likely I was just lucky.
> > >>>>
> > >>>>>
> > >>>>>
> > >>>>> This problem seems to be avoided if we call take_page_off_buddy(page) in the
> > >>>>> filemap_offline_hwpoison_folio_hugetlb() function without testing if
> > >>>>> PageBuddy(page) is true first.
> > >>>>
> > >>>> Oh, I think you are right, filemap_offline_hwpoison_folio_hugetlb
> > >>>> shouldn't call take_page_off_buddy(page) depend on PageBuddy(page) or
> > >>>> not. take_page_off_buddy will check PageBuddy or not, on the page_head
> > >>>> of different page orders. So maybe somehow a known poisoned page is
> > >>>> not taken off from buddy allocator due to this?
> > >>>
> > >>> Maybe it's the case where the poisoned page is merged to a larger page,
> > >>> and the PGTY_buddy flag is set on its buddy of the poisoned page, so
> > >>> PageBuddy() returns false?:
> > >>>
> > >>>   [ free page A ][ free page B (poisoned) ]
> > >>>
> > >>> When these two are merged, then we set PGTY_buddy on page A but not on B.
> > >>
> > >> Thanks Harry!
> > >>
> > >> It is indeed this case. I validate by adding some debug prints in
> > >> take_page_off_buddy:
> > >>
> > >> [ 193.029423] Memory failure: 0x2800200: [yjq] PageBuddy=0 after drain_all_pages
> > >> [ 193.029426] 0x2800200: [yjq] order=0, page_order=0, PageBuddy(page_head)=0
> > >> [ 193.029428] 0x2800200: [yjq] order=1, page_order=0, PageBuddy(page_head)=0
> > >> [ 193.029429] 0x2800200: [yjq] order=2, page_order=0, PageBuddy(page_head)=0
> > >> [ 193.029430] 0x2800200: [yjq] order=3, page_order=0, PageBuddy(page_head)=0
> > >> [ 193.029431] 0x2800200: [yjq] order=4, page_order=0, PageBuddy(page_head)=0
> > >> [ 193.029432] 0x2800200: [yjq] order=5, page_order=0, PageBuddy(page_head)=0
> > >> [ 193.029434] 0x2800200: [yjq] order=6, page_order=0, PageBuddy(page_head)=0
> > >> [ 193.029435] 0x2800200: [yjq] order=7, page_order=0, PageBuddy(page_head)=0
> > >> [ 193.029436] 0x2800200: [yjq] order=8, page_order=0, PageBuddy(page_head)=0
> > >> [ 193.029437] 0x2800200: [yjq] order=9, page_order=0, PageBuddy(page_head)=0
> > >> [ 193.029438] 0x2800200: [yjq] order=10, page_order=10, PageBuddy(page_head)=1
> > >>
> > >> In this case, page for 0x2800200 is hwpoisoned, and its buddy page is
> > >> 0x2800000 with order 10.
> > >
> > > Woohoo, I got it right!
> > >
> > >>> But even after fixing that we need to fix the race condition.
> > >>
> > >> What exactly is the race condition you are referring to?
> > >
> > > When you free a high-order page, the buddy allocator doesn't not check
> > > PageHWPoison() on the page and its subpages. It checks PageHWPoison()
> > > only when you free a base (order-0) page, see free_pages_prepare().
> >
> > I think we might could check PageHWPoison() for subpages as what free_page_is_bad()
> > does. If any subpage has HWPoisoned flag set, simply drop the folio. Even we could
>
> Agree, I think as a starter I could try to, for example, let
> free_pages_prepare scan HWPoison-ed subpages if the base page is high
> order. In the optimal case, HugeTLB does move PageHWPoison flag from
> head page to the raw error pages.

[+Cc page allocator folks]

AFAICT enabling page sanity check in page alloc/free path would be against
past efforts to reduce sanity check overhead.

[1] https://lore.kernel.org/linux-mm/1460711275-1130-15-git-send-email-mgorman@techsingularity.net/
[2] https://lore.kernel.org/linux-mm/1460711275-1130-16-git-send-email-mgorman@techsingularity.net/
[3] https://lore.kernel.org/all/20230216095131.17336-1-vbabka@suse.cz

I'd recommend to check hwpoison flag before freeing it to the buddy
when we know a memory error has occurred (I guess that's also what Miaohe
suggested).

> > do it better -- Split the folio and let healthy subpages join the buddy while reject
> > the hwpoisoned one.
> >
> > >
> > > AFAICT there is nothing that prevents the poisoned page to be
> > > allocated back to users because the buddy doesn't check PageHWPoison()
> > > on allocation as well (by default).
> > >
> > > So rather than freeing the high-order page as-is in
> > > dissolve_free_hugetlb_folio(), I think we have to split it to base pages
> > > and then free them one by one.
> >
> > It might not be worth to do that as this would significantly increase the overhead
> > of the function while memory failure event is really rare.
> 
> IIUC, Harry's idea is to do the split in dissolve_free_hugetlb_folio
> only if folio is HWPoison-ed, similar to what Miaohe suggested
> earlier.

Yes, and if we do the check before moving HWPoison flag to raw pages,
it'll be just a single folio_test_hwpoison() call.

> BTW, I believe this race condition already exists today when
> memory_failure handles HWPoison-ed free hugetlb page; it is not
> something introduced via this patchset. I will fix or improve this in
> a separate patchset.

That makes sense.

Thanks for working on this!

> > > That way, free_pages_prepare() will catch that it's poisoned and won't
> > > add it back to the freelist. Otherwise there will always be a window
> > > where the poisoned page can be allocated to users - before it's taken
> > > off from the buddy.
> > >
> >

-- 
Cheers,
Harry / Hyeonggon


  parent reply	other threads:[~2025-11-03  8:17 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-01-18 23:15 [RFC PATCH v1 0/3] Userspace MFR Policy via memfd Jiaqi Yan
2025-01-18 23:15 ` [RFC PATCH v1 1/3] mm: memfd/hugetlb: introduce userspace memory failure recovery policy Jiaqi Yan
2025-01-18 23:15 ` [RFC PATCH v1 2/3] selftests/mm: test userspace MFR for HugeTLB 1G hugepage Jiaqi Yan
2025-01-18 23:15 ` [RFC PATCH v1 3/3] Documentation: add userspace MF recovery policy via memfd Jiaqi Yan
2025-01-20 17:26 ` [RFC PATCH v1 0/3] Userspace MFR Policy " Jason Gunthorpe
2025-01-21 21:45   ` Jiaqi Yan
2025-01-22 16:41 ` Zi Yan
2025-09-19 15:58 ` “William Roche
2025-10-13 22:14   ` Jiaqi Yan
2025-10-14 20:57     ` William Roche
2025-10-28  4:17       ` Jiaqi Yan
2025-10-22 13:09     ` Harry Yoo
2025-10-28  4:17       ` Jiaqi Yan
2025-10-28  7:00         ` Harry Yoo
2025-10-30 11:51           ` Miaohe Lin
2025-10-30 17:28             ` Jiaqi Yan
2025-10-30 21:28               ` Jiaqi Yan
2025-11-03  8:16               ` Harry Yoo [this message]
2025-11-03  8:53                 ` Harry Yoo
2025-11-03 16:57                   ` Jiaqi Yan
2025-11-04  3:44                     ` Miaohe Lin
2025-11-06  7:53                     ` Harry Yoo
2025-11-12  1:28                       ` Jiaqi Yan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aQhk4WtDSaQmFFFo@harry \
    --to=harry.yoo@oracle.com \
    --cc=ackerleytng@google.com \
    --cc=akpm@linux-foundation.org \
    --cc=ankita@nvidia.com \
    --cc=dave.hansen@linux.intel.com \
    --cc=david@redhat.com \
    --cc=duenwen@google.com \
    --cc=hannes@cmpxchg.org \
    --cc=jackmanb@google.com \
    --cc=jane.chu@oracle.com \
    --cc=jgg@nvidia.com \
    --cc=jiaqiyan@google.com \
    --cc=jthoughton@google.com \
    --cc=linmiaohe@huawei.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@suse.com \
    --cc=muchun.song@linux.dev \
    --cc=nao.horiguchi@gmail.com \
    --cc=osalvador@suse.de \
    --cc=peterx@redhat.com \
    --cc=rientjes@google.com \
    --cc=sidhartha.kumar@oracle.com \
    --cc=surenb@google.com \
    --cc=tony.luck@intel.com \
    --cc=vbabka@suse.cz \
    --cc=wangkefeng.wang@huawei.com \
    --cc=william.roche@oracle.com \
    --cc=willy@infradead.org \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.