linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jiaqi Yan <jiaqiyan@google.com>
To: William Roche <william.roche@oracle.com>
Cc: Ackerley Tng <ackerleytng@google.com>,
	jgg@nvidia.com, akpm@linux-foundation.org,  ankita@nvidia.com,
	dave.hansen@linux.intel.com, david@redhat.com,
	 duenwen@google.com, jane.chu@oracle.com, jthoughton@google.com,
	 linmiaohe@huawei.com, linux-fsdevel@vger.kernel.org,
	 linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	muchun.song@linux.dev,  nao.horiguchi@gmail.com,
	osalvador@suse.de, peterx@redhat.com,  rientjes@google.com,
	sidhartha.kumar@oracle.com, tony.luck@intel.com,
	 wangkefeng.wang@huawei.com, willy@infradead.org,
	harry.yoo@oracle.com
Subject: Re: [RFC PATCH v1 0/3] Userspace MFR Policy via memfd
Date: Mon, 27 Oct 2025 21:17:26 -0700	[thread overview]
Message-ID: <CACw3F53ycL5xDwHC2dYxi9RXBAF=nQjwT4HWnRWewQLaHFU_kw@mail.gmail.com> (raw)
In-Reply-To: <24367c05-ad1f-4546-b2ed-69e587113a54@oracle.com>

On Tue, Oct 14, 2025 at 1:57 PM William Roche <william.roche@oracle.com> wrote:
>
> On 10/14/25 00:14, Jiaqi Yan wrote:
>  > On Fri, Sep 19, 2025 at 8:58 AM William Roche wrote:
>  > [...]
>  >>
>  >> Using this framework, I realized that the code provided here has a
>  >> problem:
>  >> When the error impacts a large folio, the release of this folio
>  >> doesn't isolate the sub-page(s) actually impacted by the poison.
>  >> __rmqueue_pcplist() can return a known poisoned page to
>  >> get_page_from_freelist().
>  >
>  > Just curious, how exactly you can repro this leaking of a known poison
>  > page? It may help me debug my patch.
>  >
>
> When the memfd segment impacted by a memory error is released, the
> sub-page impacted by a memory error is not removed from the freelist and
> an allocation of memory (large enough to increase the chance to get this
> page) crashes the system with the following stack trace (for example):
>
> [  479.572513] RIP: 0010:clear_page_erms+0xb/0x20
> [...]
> [  479.587565]  post_alloc_hook+0xbd/0xd0
> [  479.588371]  get_page_from_freelist+0x3a6/0x6d0
> [  479.589221]  ? srso_alias_return_thunk+0x5/0xfbef5
> [  479.590122]  __alloc_frozen_pages_noprof+0x186/0x380
> [  479.591012]  alloc_pages_mpol+0x7b/0x180
> [  479.591787]  vma_alloc_folio_noprof+0x70/0xf0
> [  479.592609]  alloc_anon_folio+0x1a0/0x3a0
> [  479.593401]  do_anonymous_page+0x13f/0x4d0
> [  479.594174]  ? pte_offset_map_rw_nolock+0x1f/0xa0
> [  479.595035]  __handle_mm_fault+0x581/0x6c0
> [  479.595799]  handle_mm_fault+0xcf/0x2a0
> [  479.596539]  do_user_addr_fault+0x22b/0x6e0
> [  479.597349]  exc_page_fault+0x67/0x170
> [  479.598095]  asm_exc_page_fault+0x26/0x30
>
> The idea is to run the test program in the VM and instead of using
> madvise to poison the location, I take the physical address of the
> location, and use Qemu 'gpa2hpa' address of the location,
> so that I can inject the error on the hypervisor with the
> hwpoison-inject module (for example).
> Let the test program finish and run a memory allocator (trying to take
> as much memory as possible)
> You should end up on a panic of the VM.

Thanks William, I can even repro with the hugetlb-mfr selftest withou a VM.

>
>  >>
>  >> This revealed some mm limitations, as I would have expected that the
>  >> check_new_pages() mechanism used by the __rmqueue functions would
>  >> filter these pages out, but I noticed that this has been disabled by
>  >> default in 2023 with:
>  >> [PATCH] mm, page_alloc: reduce page alloc/free sanity checks
>  >> https://lore.kernel.org/all/20230216095131.17336-1-vbabka@suse.cz
>  >
>  > Thanks for the reference. I did turned on CONFIG_DEBUG_VM=y during dev
>  > and testing but didn't notice any WARNING on "bad page"; It is very
>  > likely I was just lucky.
>  >
>  >>
>  >>
>  >> This problem seems to be avoided if we call take_page_off_buddy(page)
>  >> in the filemap_offline_hwpoison_folio_hugetlb() function without
>  >> testing if PageBuddy(page) is true first.
>  >
>  > Oh, I think you are right, filemap_offline_hwpoison_folio_hugetlb
>  > shouldn't call take_page_off_buddy(page) depend on PageBuddy(page) or
>  > not. take_page_off_buddy will check PageBuddy or not, on the page_head
>  > of different page orders. So maybe somehow a known poisoned page is
>  > not taken off from buddy allocator due to this?
>  >
>  > Let me try to fix it in v2, by the end of the week. If you could test
>  > with your way of repro as well, that will be very helpful!
>
>
> Of course, I'll run the test on your v2 version and let you know how it
> goes.

Sorry it took more than I expect to prepare v2. I want to get rid of
populate_memfd_hwp_folios and want to insert
filemap_offline_hwpoison_folio into remove_inode_single_folio so that
everything can be done on the fly in remove_inode_hugepages's while
loop. This refactor isn't as trivial as I thought.

I was struggled with page refcount for some time, for a couple of reasons:
1. filemap_offline_hwpoison_folio has to put 1 refcount on hwpoison-ed
folio so it can be dissolved. But I immediately got a "BUG: Bad page
state in process" due to "page: refcount:-1".
2. It turns out to be that remove_inode_hugepages also puts folios'
refcount via folio_batch_release. I avoided this for hwpoison-ed folio
by removing it from the fbatch.

I have just tested v2 with the hugetlb-mfr selftest and didn't see
"BUG: Bad page" for either nonzero refcount or hwpoison after some
hours of running/up time. Meanwhile, I will send v2 as a draft to you
for more test coverage.

>
>
>  >> But according to me it leaves a (small) race condition where a new
>  >> page allocation could get a poisoned sub-page between the dissolve
>  >> phase and the attempt to remove it from the buddy allocator.
>
> I still think that the way we recycle the impacted large page still has
> a (much smaller) race condition where a memory allocation can get the
> poisoned page, as we don't have the checks to filter the poisoned page
> from the freelist.
> I'm not sure we have a way to recycle the page without having a moment
> when the poison page is in the freelist.
> (I'd be happy to be proven wrong ;) )
>
>
>  >> If performance requires using Hugetlb pages, than maybe we could
>  >> accept to loose a huge page after a memory impacted
>  >> MFD_MF_KEEP_UE_MAPPED memfd segment is released ? If it can easily
>  >> avoid some other corruption.
>
> What I meant is: if we don't have a reliable way to recycle an impacted
> large page, we could start with a version of the code where we don't
> recycle it, just to avoid the risk...
>
>
>  >
>  > There is also another possible path if VMM can change to back VM
>  > memory with *1G guest_memfd*, which wraps 1G hugetlbfs. In Ackerley's
>  > work [1], guest_memfd can split the 1G page for conversions. If we
>  > re-use the splitting for memory failure recovery, we can probably
>  > achieve something generally similar to THP's memory failure recovery:
>  > split 1G to 2M and 4k chunks, then unmap only 4k of poisoned page. We
>  > still lose the 1G TLB size so VM may be subject to some performance
>  > sacrifice.
>  >
>  > [1]
> https://lore.kernel.org/linux-mm/2ae41e0d80339da2b57011622ac2288fed65cd01.1747264138.git.ackerleytng@google.com
>
>
> Thanks for the pointer.
> I personally think that splitting the large page into base pages, is
> just fine.
> The main possibility I see in this project is to significantly increase
> the probability to survive a memory error on large pages backed VMs.
>
> HTH.
>
> Thanks a lot,
> William.

  reply	other threads:[~2025-10-28  4:17 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-01-18 23:15 [RFC PATCH v1 0/3] Userspace MFR Policy via memfd Jiaqi Yan
2025-01-18 23:15 ` [RFC PATCH v1 1/3] mm: memfd/hugetlb: introduce userspace memory failure recovery policy Jiaqi Yan
2025-01-18 23:15 ` [RFC PATCH v1 2/3] selftests/mm: test userspace MFR for HugeTLB 1G hugepage Jiaqi Yan
2025-01-18 23:15 ` [RFC PATCH v1 3/3] Documentation: add userspace MF recovery policy via memfd Jiaqi Yan
2025-01-20 17:26 ` [RFC PATCH v1 0/3] Userspace MFR Policy " Jason Gunthorpe
2025-01-21 21:45   ` Jiaqi Yan
2025-01-22 16:41 ` Zi Yan
2025-09-19 15:58 ` “William Roche
2025-10-13 22:14   ` Jiaqi Yan
2025-10-14 20:57     ` William Roche
2025-10-28  4:17       ` Jiaqi Yan [this message]
2025-10-22 13:09     ` Harry Yoo
2025-10-28  4:17       ` Jiaqi Yan
2025-10-28  7:00         ` Harry Yoo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CACw3F53ycL5xDwHC2dYxi9RXBAF=nQjwT4HWnRWewQLaHFU_kw@mail.gmail.com' \
    --to=jiaqiyan@google.com \
    --cc=ackerleytng@google.com \
    --cc=akpm@linux-foundation.org \
    --cc=ankita@nvidia.com \
    --cc=dave.hansen@linux.intel.com \
    --cc=david@redhat.com \
    --cc=duenwen@google.com \
    --cc=harry.yoo@oracle.com \
    --cc=jane.chu@oracle.com \
    --cc=jgg@nvidia.com \
    --cc=jthoughton@google.com \
    --cc=linmiaohe@huawei.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=muchun.song@linux.dev \
    --cc=nao.horiguchi@gmail.com \
    --cc=osalvador@suse.de \
    --cc=peterx@redhat.com \
    --cc=rientjes@google.com \
    --cc=sidhartha.kumar@oracle.com \
    --cc=tony.luck@intel.com \
    --cc=wangkefeng.wang@huawei.com \
    --cc=william.roche@oracle.com \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).