* Re: [RFC PATCH v1 0/3] Userspace MFR Policy via memfd
[not found] <20250118231549.1652825-1-jiaqiyan@google.com>
@ 2025-09-19 15:58 ` “William Roche
2025-10-13 22:14 ` Jiaqi Yan
0 siblings, 1 reply; 16+ messages in thread
From: “William Roche @ 2025-09-19 15:58 UTC (permalink / raw)
To: jiaqiyan, jgg
Cc: akpm, ankita, dave.hansen, david, duenwen, jane.chu, jthoughton,
linmiaohe, linux-fsdevel, linux-kernel, linux-mm, muchun.song,
nao.horiguchi, osalvador, peterx, rientjes, sidhartha.kumar,
tony.luck, wangkefeng.wang, willy, harry.yoo
From: William Roche <william.roche@oracle.com>
Hello,
The possibility to keep a VM using large hugetlbfs pages running after a memory
error is very important, and the possibility described here could be a good
candidate to address this issue.
So I would like to provide my feedback after testing this code with the
introduction of persistent errors in the address space: My tests used a VM
running a kernel able to provide MFD_MF_KEEP_UE_MAPPED memfd segments to the
test program provided with this project. But instead of injecting the errors
with madvise calls from this program, I get the guest physical address of a
location and inject the error from the hypervisor into the VM, so that any
subsequent access to the location is prevented directly from the hypervisor
level.
Using this framework, I realized that the code provided here has a problem:
When the error impacts a large folio, the release of this folio doesn't isolate
the sub-page(s) actually impacted by the poison. __rmqueue_pcplist() can return
a known poisoned page to get_page_from_freelist().
This revealed some mm limitations, as I would have expected that the
check_new_pages() mechanism used by the __rmqueue functions would filter these
pages out, but I noticed that this has been disabled by default in 2023 with:
[PATCH] mm, page_alloc: reduce page alloc/free sanity checks
https://lore.kernel.org/all/20230216095131.17336-1-vbabka@suse.cz
This problem seems to be avoided if we call take_page_off_buddy(page) in the
filemap_offline_hwpoison_folio_hugetlb() function without testing if
PageBuddy(page) is true first.
But according to me it leaves a (small) race condition where a new page
allocation could get a poisoned sub-page between the dissolve phase and the
attempt to remove it from the buddy allocator.
I do have the impression that a correct behavior (isolating an impacted
sub-page and remapping the valid memory content) using large pages is
currently only achieved with Transparent Huge Pages.
If performance requires using Hugetlb pages, than maybe we could accept to
loose a huge page after a memory impacted MFD_MF_KEEP_UE_MAPPED memfd segment
is released ? If it can easily avoid some other corruption.
I'm very interested in finding an appropriate way to deal with memory errors on
hugetlbfs pages, and willing to help to build a valid solution. This project
showed a real possibility to do so, even in cases where pinned memory is used -
with VFIO for example.
I would really be interested in knowing your feedback about this project, and
if another solution is considered more adapted to deal with errors on hugetlbfs
pages, please let us know.
Thanks in advance for your answers.
William.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RFC PATCH v1 0/3] Userspace MFR Policy via memfd
2025-09-19 15:58 ` [RFC PATCH v1 0/3] Userspace MFR Policy via memfd “William Roche
@ 2025-10-13 22:14 ` Jiaqi Yan
2025-10-14 20:57 ` William Roche
2025-10-22 13:09 ` Harry Yoo
0 siblings, 2 replies; 16+ messages in thread
From: Jiaqi Yan @ 2025-10-13 22:14 UTC (permalink / raw)
To: “William Roche, Ackerley Tng
Cc: jgg, akpm, ankita, dave.hansen, david, duenwen, jane.chu,
jthoughton, linmiaohe, linux-fsdevel, linux-kernel, linux-mm,
muchun.song, nao.horiguchi, osalvador, peterx, rientjes,
sidhartha.kumar, tony.luck, wangkefeng.wang, willy, harry.yoo
On Fri, Sep 19, 2025 at 8:58 AM “William Roche <william.roche@oracle.com> wrote:
>
> From: William Roche <william.roche@oracle.com>
>
> Hello,
>
> The possibility to keep a VM using large hugetlbfs pages running after a memory
> error is very important, and the possibility described here could be a good
> candidate to address this issue.
Thanks for expressing interest, William, and sorry for getting back to
you so late.
>
> So I would like to provide my feedback after testing this code with the
> introduction of persistent errors in the address space: My tests used a VM
> running a kernel able to provide MFD_MF_KEEP_UE_MAPPED memfd segments to the
> test program provided with this project. But instead of injecting the errors
> with madvise calls from this program, I get the guest physical address of a
> location and inject the error from the hypervisor into the VM, so that any
> subsequent access to the location is prevented directly from the hypervisor
> level.
This is exactly what VMM should do: when it owns or manages the VM
memory with MFD_MF_KEEP_UE_MAPPED, it is then VMM's responsibility to
isolate guest/VCPUs from poisoned memory pages, e.g. by intercepting
such memory accesses.
>
> Using this framework, I realized that the code provided here has a problem:
> When the error impacts a large folio, the release of this folio doesn't isolate
> the sub-page(s) actually impacted by the poison. __rmqueue_pcplist() can return
> a known poisoned page to get_page_from_freelist().
Just curious, how exactly you can repro this leaking of a known poison
page? It may help me debug my patch.
>
> This revealed some mm limitations, as I would have expected that the
> check_new_pages() mechanism used by the __rmqueue functions would filter these
> pages out, but I noticed that this has been disabled by default in 2023 with:
> [PATCH] mm, page_alloc: reduce page alloc/free sanity checks
> https://lore.kernel.org/all/20230216095131.17336-1-vbabka@suse.cz
Thanks for the reference. I did turned on CONFIG_DEBUG_VM=y during dev
and testing but didn't notice any WARNING on "bad page"; It is very
likely I was just lucky.
>
>
> This problem seems to be avoided if we call take_page_off_buddy(page) in the
> filemap_offline_hwpoison_folio_hugetlb() function without testing if
> PageBuddy(page) is true first.
Oh, I think you are right, filemap_offline_hwpoison_folio_hugetlb
shouldn't call take_page_off_buddy(page) depend on PageBuddy(page) or
not. take_page_off_buddy will check PageBuddy or not, on the page_head
of different page orders. So maybe somehow a known poisoned page is
not taken off from buddy allocator due to this?
Let me try to fix it in v2, by the end of the week. If you could test
with your way of repro as well, that will be very helpful!
> But according to me it leaves a (small) race condition where a new page
> allocation could get a poisoned sub-page between the dissolve phase and the
> attempt to remove it from the buddy allocator.
>
> I do have the impression that a correct behavior (isolating an impacted
> sub-page and remapping the valid memory content) using large pages is
> currently only achieved with Transparent Huge Pages.
> If performance requires using Hugetlb pages, than maybe we could accept to
> loose a huge page after a memory impacted MFD_MF_KEEP_UE_MAPPED memfd segment
> is released ? If it can easily avoid some other corruption.
>
> I'm very interested in finding an appropriate way to deal with memory errors on
> hugetlbfs pages, and willing to help to build a valid solution. This project
> showed a real possibility to do so, even in cases where pinned memory is used -
> with VFIO for example.
>
> I would really be interested in knowing your feedback about this project, and
> if another solution is considered more adapted to deal with errors on hugetlbfs
> pages, please let us know.
There is also another possible path if VMM can change to back VM
memory with *1G guest_memfd*, which wraps 1G hugetlbfs. In Ackerley's
work [1], guest_memfd can split the 1G page for conversions. If we
re-use the splitting for memory failure recovery, we can probably
achieve something generally similar to THP's memory failure recovery:
split 1G to 2M and 4k chunks, then unmap only 4k of poisoned page. We
still lose the 1G TLB size so VM may be subject to some performance
sacrifice.
[1] https://lore.kernel.org/linux-mm/2ae41e0d80339da2b57011622ac2288fed65cd01.1747264138.git.ackerleytng@google.com
>
> Thanks in advance for your answers.
> William.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RFC PATCH v1 0/3] Userspace MFR Policy via memfd
2025-10-13 22:14 ` Jiaqi Yan
@ 2025-10-14 20:57 ` William Roche
2025-10-28 4:17 ` Jiaqi Yan
2025-10-22 13:09 ` Harry Yoo
1 sibling, 1 reply; 16+ messages in thread
From: William Roche @ 2025-10-14 20:57 UTC (permalink / raw)
To: Jiaqi Yan, Ackerley Tng
Cc: jgg, akpm, ankita, dave.hansen, david, duenwen, jane.chu,
jthoughton, linmiaohe, linux-fsdevel, linux-kernel, linux-mm,
muchun.song, nao.horiguchi, osalvador, peterx, rientjes,
sidhartha.kumar, tony.luck, wangkefeng.wang, willy, harry.yoo
On 10/14/25 00:14, Jiaqi Yan wrote:
> On Fri, Sep 19, 2025 at 8:58 AM William Roche wrote:
> [...]
>>
>> Using this framework, I realized that the code provided here has a
>> problem:
>> When the error impacts a large folio, the release of this folio
>> doesn't isolate the sub-page(s) actually impacted by the poison.
>> __rmqueue_pcplist() can return a known poisoned page to
>> get_page_from_freelist().
>
> Just curious, how exactly you can repro this leaking of a known poison
> page? It may help me debug my patch.
>
When the memfd segment impacted by a memory error is released, the
sub-page impacted by a memory error is not removed from the freelist and
an allocation of memory (large enough to increase the chance to get this
page) crashes the system with the following stack trace (for example):
[ 479.572513] RIP: 0010:clear_page_erms+0xb/0x20
[...]
[ 479.587565] post_alloc_hook+0xbd/0xd0
[ 479.588371] get_page_from_freelist+0x3a6/0x6d0
[ 479.589221] ? srso_alias_return_thunk+0x5/0xfbef5
[ 479.590122] __alloc_frozen_pages_noprof+0x186/0x380
[ 479.591012] alloc_pages_mpol+0x7b/0x180
[ 479.591787] vma_alloc_folio_noprof+0x70/0xf0
[ 479.592609] alloc_anon_folio+0x1a0/0x3a0
[ 479.593401] do_anonymous_page+0x13f/0x4d0
[ 479.594174] ? pte_offset_map_rw_nolock+0x1f/0xa0
[ 479.595035] __handle_mm_fault+0x581/0x6c0
[ 479.595799] handle_mm_fault+0xcf/0x2a0
[ 479.596539] do_user_addr_fault+0x22b/0x6e0
[ 479.597349] exc_page_fault+0x67/0x170
[ 479.598095] asm_exc_page_fault+0x26/0x30
The idea is to run the test program in the VM and instead of using
madvise to poison the location, I take the physical address of the
location, and use Qemu 'gpa2hpa' address of the location,
so that I can inject the error on the hypervisor with the
hwpoison-inject module (for example).
Let the test program finish and run a memory allocator (trying to take
as much memory as possible)
You should end up on a panic of the VM.
>>
>> This revealed some mm limitations, as I would have expected that the
>> check_new_pages() mechanism used by the __rmqueue functions would
>> filter these pages out, but I noticed that this has been disabled by
>> default in 2023 with:
>> [PATCH] mm, page_alloc: reduce page alloc/free sanity checks
>> https://lore.kernel.org/all/20230216095131.17336-1-vbabka@suse.cz
>
> Thanks for the reference. I did turned on CONFIG_DEBUG_VM=y during dev
> and testing but didn't notice any WARNING on "bad page"; It is very
> likely I was just lucky.
>
>>
>>
>> This problem seems to be avoided if we call take_page_off_buddy(page)
>> in the filemap_offline_hwpoison_folio_hugetlb() function without
>> testing if PageBuddy(page) is true first.
>
> Oh, I think you are right, filemap_offline_hwpoison_folio_hugetlb
> shouldn't call take_page_off_buddy(page) depend on PageBuddy(page) or
> not. take_page_off_buddy will check PageBuddy or not, on the page_head
> of different page orders. So maybe somehow a known poisoned page is
> not taken off from buddy allocator due to this?
>
> Let me try to fix it in v2, by the end of the week. If you could test
> with your way of repro as well, that will be very helpful!
Of course, I'll run the test on your v2 version and let you know how it
goes.
>> But according to me it leaves a (small) race condition where a new
>> page allocation could get a poisoned sub-page between the dissolve
>> phase and the attempt to remove it from the buddy allocator.
I still think that the way we recycle the impacted large page still has
a (much smaller) race condition where a memory allocation can get the
poisoned page, as we don't have the checks to filter the poisoned page
from the freelist.
I'm not sure we have a way to recycle the page without having a moment
when the poison page is in the freelist.
(I'd be happy to be proven wrong ;) )
>> If performance requires using Hugetlb pages, than maybe we could
>> accept to loose a huge page after a memory impacted
>> MFD_MF_KEEP_UE_MAPPED memfd segment is released ? If it can easily
>> avoid some other corruption.
What I meant is: if we don't have a reliable way to recycle an impacted
large page, we could start with a version of the code where we don't
recycle it, just to avoid the risk...
>
> There is also another possible path if VMM can change to back VM
> memory with *1G guest_memfd*, which wraps 1G hugetlbfs. In Ackerley's
> work [1], guest_memfd can split the 1G page for conversions. If we
> re-use the splitting for memory failure recovery, we can probably
> achieve something generally similar to THP's memory failure recovery:
> split 1G to 2M and 4k chunks, then unmap only 4k of poisoned page. We
> still lose the 1G TLB size so VM may be subject to some performance
> sacrifice.
>
> [1]
https://lore.kernel.org/linux-mm/2ae41e0d80339da2b57011622ac2288fed65cd01.1747264138.git.ackerleytng@google.com
Thanks for the pointer.
I personally think that splitting the large page into base pages, is
just fine.
The main possibility I see in this project is to significantly increase
the probability to survive a memory error on large pages backed VMs.
HTH.
Thanks a lot,
William.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RFC PATCH v1 0/3] Userspace MFR Policy via memfd
2025-10-13 22:14 ` Jiaqi Yan
2025-10-14 20:57 ` William Roche
@ 2025-10-22 13:09 ` Harry Yoo
2025-10-28 4:17 ` Jiaqi Yan
1 sibling, 1 reply; 16+ messages in thread
From: Harry Yoo @ 2025-10-22 13:09 UTC (permalink / raw)
To: Jiaqi Yan
Cc: “William Roche, Ackerley Tng, jgg, akpm, ankita,
dave.hansen, david, duenwen, jane.chu, jthoughton, linmiaohe,
linux-fsdevel, linux-kernel, linux-mm, muchun.song, nao.horiguchi,
osalvador, peterx, rientjes, sidhartha.kumar, tony.luck,
wangkefeng.wang, willy
On Mon, Oct 13, 2025 at 03:14:32PM -0700, Jiaqi Yan wrote:
> On Fri, Sep 19, 2025 at 8:58 AM “William Roche <william.roche@oracle.com> wrote:
> >
> > From: William Roche <william.roche@oracle.com>
> >
> > Hello,
> >
> > The possibility to keep a VM using large hugetlbfs pages running after a memory
> > error is very important, and the possibility described here could be a good
> > candidate to address this issue.
>
> Thanks for expressing interest, William, and sorry for getting back to
> you so late.
>
> >
> > So I would like to provide my feedback after testing this code with the
> > introduction of persistent errors in the address space: My tests used a VM
> > running a kernel able to provide MFD_MF_KEEP_UE_MAPPED memfd segments to the
> > test program provided with this project. But instead of injecting the errors
> > with madvise calls from this program, I get the guest physical address of a
> > location and inject the error from the hypervisor into the VM, so that any
> > subsequent access to the location is prevented directly from the hypervisor
> > level.
>
> This is exactly what VMM should do: when it owns or manages the VM
> memory with MFD_MF_KEEP_UE_MAPPED, it is then VMM's responsibility to
> isolate guest/VCPUs from poisoned memory pages, e.g. by intercepting
> such memory accesses.
>
> >
> > Using this framework, I realized that the code provided here has a problem:
> > When the error impacts a large folio, the release of this folio doesn't isolate
> > the sub-page(s) actually impacted by the poison. __rmqueue_pcplist() can return
> > a known poisoned page to get_page_from_freelist().
>
> Just curious, how exactly you can repro this leaking of a known poison
> page? It may help me debug my patch.
>
> >
> > This revealed some mm limitations, as I would have expected that the
> > check_new_pages() mechanism used by the __rmqueue functions would filter these
> > pages out, but I noticed that this has been disabled by default in 2023 with:
> > [PATCH] mm, page_alloc: reduce page alloc/free sanity checks
> > https://lore.kernel.org/all/20230216095131.17336-1-vbabka@suse.cz
>
> Thanks for the reference. I did turned on CONFIG_DEBUG_VM=y during dev
> and testing but didn't notice any WARNING on "bad page"; It is very
> likely I was just lucky.
>
> >
> >
> > This problem seems to be avoided if we call take_page_off_buddy(page) in the
> > filemap_offline_hwpoison_folio_hugetlb() function without testing if
> > PageBuddy(page) is true first.
>
> Oh, I think you are right, filemap_offline_hwpoison_folio_hugetlb
> shouldn't call take_page_off_buddy(page) depend on PageBuddy(page) or
> not. take_page_off_buddy will check PageBuddy or not, on the page_head
> of different page orders. So maybe somehow a known poisoned page is
> not taken off from buddy allocator due to this?
Maybe it's the case where the poisoned page is merged to a larger page,
and the PGTY_buddy flag is set on its buddy of the poisoned page, so
PageBuddy() returns false?:
[ free page A ][ free page B (poisoned) ]
When these two are merged, then we set PGTY_buddy on page A but not on B.
But even after fixing that we need to fix the race condition.
> Let me try to fix it in v2, by the end of the week. If you could test
> with your way of repro as well, that will be very helpful!
>
> > But according to me it leaves a (small) race condition where a new page
> > allocation could get a poisoned sub-page between the dissolve phase and the
> > attempt to remove it from the buddy allocator.
> >
> > I do have the impression that a correct behavior (isolating an impacted
> > sub-page and remapping the valid memory content) using large pages is
> > currently only achieved with Transparent Huge Pages.
> > If performance requires using Hugetlb pages, than maybe we could accept to
> > loose a huge page after a memory impacted MFD_MF_KEEP_UE_MAPPED memfd segment
> > is released ? If it can easily avoid some other corruption.
> >
> > I'm very interested in finding an appropriate way to deal with memory errors on
> > hugetlbfs pages, and willing to help to build a valid solution. This project
> > showed a real possibility to do so, even in cases where pinned memory is used -
> > with VFIO for example.
> >
> > I would really be interested in knowing your feedback about this project, and
> > if another solution is considered more adapted to deal with errors on hugetlbfs
> > pages, please let us know.
>
> There is also another possible path if VMM can change to back VM
> memory with *1G guest_memfd*, which wraps 1G hugetlbfs. In Ackerley's
> work [1], guest_memfd can split the 1G page for conversions. If we
> re-use the splitting for memory failure recovery, we can probably
> achieve something generally similar to THP's memory failure recovery:
> split 1G to 2M and 4k chunks, then unmap only 4k of poisoned page. We
> still lose the 1G TLB size so VM may be subject to some performance
> sacrifice.
> [1] https://lore.kernel.org/linux-mm/2ae41e0d80339da2b57011622ac2288fed65cd01.1747264138.git.ackerleytng@google.com
I want to take a closer look at the actual patches but either way sounds
good to me.
By the way, please Cc me in future revisions :)
Thanks!
--
Cheers,
Harry / Hyeonggon
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RFC PATCH v1 0/3] Userspace MFR Policy via memfd
2025-10-14 20:57 ` William Roche
@ 2025-10-28 4:17 ` Jiaqi Yan
0 siblings, 0 replies; 16+ messages in thread
From: Jiaqi Yan @ 2025-10-28 4:17 UTC (permalink / raw)
To: William Roche
Cc: Ackerley Tng, jgg, akpm, ankita, dave.hansen, david, duenwen,
jane.chu, jthoughton, linmiaohe, linux-fsdevel, linux-kernel,
linux-mm, muchun.song, nao.horiguchi, osalvador, peterx, rientjes,
sidhartha.kumar, tony.luck, wangkefeng.wang, willy, harry.yoo
On Tue, Oct 14, 2025 at 1:57 PM William Roche <william.roche@oracle.com> wrote:
>
> On 10/14/25 00:14, Jiaqi Yan wrote:
> > On Fri, Sep 19, 2025 at 8:58 AM William Roche wrote:
> > [...]
> >>
> >> Using this framework, I realized that the code provided here has a
> >> problem:
> >> When the error impacts a large folio, the release of this folio
> >> doesn't isolate the sub-page(s) actually impacted by the poison.
> >> __rmqueue_pcplist() can return a known poisoned page to
> >> get_page_from_freelist().
> >
> > Just curious, how exactly you can repro this leaking of a known poison
> > page? It may help me debug my patch.
> >
>
> When the memfd segment impacted by a memory error is released, the
> sub-page impacted by a memory error is not removed from the freelist and
> an allocation of memory (large enough to increase the chance to get this
> page) crashes the system with the following stack trace (for example):
>
> [ 479.572513] RIP: 0010:clear_page_erms+0xb/0x20
> [...]
> [ 479.587565] post_alloc_hook+0xbd/0xd0
> [ 479.588371] get_page_from_freelist+0x3a6/0x6d0
> [ 479.589221] ? srso_alias_return_thunk+0x5/0xfbef5
> [ 479.590122] __alloc_frozen_pages_noprof+0x186/0x380
> [ 479.591012] alloc_pages_mpol+0x7b/0x180
> [ 479.591787] vma_alloc_folio_noprof+0x70/0xf0
> [ 479.592609] alloc_anon_folio+0x1a0/0x3a0
> [ 479.593401] do_anonymous_page+0x13f/0x4d0
> [ 479.594174] ? pte_offset_map_rw_nolock+0x1f/0xa0
> [ 479.595035] __handle_mm_fault+0x581/0x6c0
> [ 479.595799] handle_mm_fault+0xcf/0x2a0
> [ 479.596539] do_user_addr_fault+0x22b/0x6e0
> [ 479.597349] exc_page_fault+0x67/0x170
> [ 479.598095] asm_exc_page_fault+0x26/0x30
>
> The idea is to run the test program in the VM and instead of using
> madvise to poison the location, I take the physical address of the
> location, and use Qemu 'gpa2hpa' address of the location,
> so that I can inject the error on the hypervisor with the
> hwpoison-inject module (for example).
> Let the test program finish and run a memory allocator (trying to take
> as much memory as possible)
> You should end up on a panic of the VM.
Thanks William, I can even repro with the hugetlb-mfr selftest withou a VM.
>
> >>
> >> This revealed some mm limitations, as I would have expected that the
> >> check_new_pages() mechanism used by the __rmqueue functions would
> >> filter these pages out, but I noticed that this has been disabled by
> >> default in 2023 with:
> >> [PATCH] mm, page_alloc: reduce page alloc/free sanity checks
> >> https://lore.kernel.org/all/20230216095131.17336-1-vbabka@suse.cz
> >
> > Thanks for the reference. I did turned on CONFIG_DEBUG_VM=y during dev
> > and testing but didn't notice any WARNING on "bad page"; It is very
> > likely I was just lucky.
> >
> >>
> >>
> >> This problem seems to be avoided if we call take_page_off_buddy(page)
> >> in the filemap_offline_hwpoison_folio_hugetlb() function without
> >> testing if PageBuddy(page) is true first.
> >
> > Oh, I think you are right, filemap_offline_hwpoison_folio_hugetlb
> > shouldn't call take_page_off_buddy(page) depend on PageBuddy(page) or
> > not. take_page_off_buddy will check PageBuddy or not, on the page_head
> > of different page orders. So maybe somehow a known poisoned page is
> > not taken off from buddy allocator due to this?
> >
> > Let me try to fix it in v2, by the end of the week. If you could test
> > with your way of repro as well, that will be very helpful!
>
>
> Of course, I'll run the test on your v2 version and let you know how it
> goes.
Sorry it took more than I expect to prepare v2. I want to get rid of
populate_memfd_hwp_folios and want to insert
filemap_offline_hwpoison_folio into remove_inode_single_folio so that
everything can be done on the fly in remove_inode_hugepages's while
loop. This refactor isn't as trivial as I thought.
I was struggled with page refcount for some time, for a couple of reasons:
1. filemap_offline_hwpoison_folio has to put 1 refcount on hwpoison-ed
folio so it can be dissolved. But I immediately got a "BUG: Bad page
state in process" due to "page: refcount:-1".
2. It turns out to be that remove_inode_hugepages also puts folios'
refcount via folio_batch_release. I avoided this for hwpoison-ed folio
by removing it from the fbatch.
I have just tested v2 with the hugetlb-mfr selftest and didn't see
"BUG: Bad page" for either nonzero refcount or hwpoison after some
hours of running/up time. Meanwhile, I will send v2 as a draft to you
for more test coverage.
>
>
> >> But according to me it leaves a (small) race condition where a new
> >> page allocation could get a poisoned sub-page between the dissolve
> >> phase and the attempt to remove it from the buddy allocator.
>
> I still think that the way we recycle the impacted large page still has
> a (much smaller) race condition where a memory allocation can get the
> poisoned page, as we don't have the checks to filter the poisoned page
> from the freelist.
> I'm not sure we have a way to recycle the page without having a moment
> when the poison page is in the freelist.
> (I'd be happy to be proven wrong ;) )
>
>
> >> If performance requires using Hugetlb pages, than maybe we could
> >> accept to loose a huge page after a memory impacted
> >> MFD_MF_KEEP_UE_MAPPED memfd segment is released ? If it can easily
> >> avoid some other corruption.
>
> What I meant is: if we don't have a reliable way to recycle an impacted
> large page, we could start with a version of the code where we don't
> recycle it, just to avoid the risk...
>
>
> >
> > There is also another possible path if VMM can change to back VM
> > memory with *1G guest_memfd*, which wraps 1G hugetlbfs. In Ackerley's
> > work [1], guest_memfd can split the 1G page for conversions. If we
> > re-use the splitting for memory failure recovery, we can probably
> > achieve something generally similar to THP's memory failure recovery:
> > split 1G to 2M and 4k chunks, then unmap only 4k of poisoned page. We
> > still lose the 1G TLB size so VM may be subject to some performance
> > sacrifice.
> >
> > [1]
> https://lore.kernel.org/linux-mm/2ae41e0d80339da2b57011622ac2288fed65cd01.1747264138.git.ackerleytng@google.com
>
>
> Thanks for the pointer.
> I personally think that splitting the large page into base pages, is
> just fine.
> The main possibility I see in this project is to significantly increase
> the probability to survive a memory error on large pages backed VMs.
>
> HTH.
>
> Thanks a lot,
> William.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RFC PATCH v1 0/3] Userspace MFR Policy via memfd
2025-10-22 13:09 ` Harry Yoo
@ 2025-10-28 4:17 ` Jiaqi Yan
2025-10-28 7:00 ` Harry Yoo
0 siblings, 1 reply; 16+ messages in thread
From: Jiaqi Yan @ 2025-10-28 4:17 UTC (permalink / raw)
To: Harry Yoo, “William Roche
Cc: Ackerley Tng, jgg, akpm, ankita, dave.hansen, david, duenwen,
jane.chu, jthoughton, linmiaohe, linux-fsdevel, linux-kernel,
linux-mm, muchun.song, nao.horiguchi, osalvador, peterx, rientjes,
sidhartha.kumar, tony.luck, wangkefeng.wang, willy
On Wed, Oct 22, 2025 at 6:09 AM Harry Yoo <harry.yoo@oracle.com> wrote:
>
> On Mon, Oct 13, 2025 at 03:14:32PM -0700, Jiaqi Yan wrote:
> > On Fri, Sep 19, 2025 at 8:58 AM “William Roche <william.roche@oracle.com> wrote:
> > >
> > > From: William Roche <william.roche@oracle.com>
> > >
> > > Hello,
> > >
> > > The possibility to keep a VM using large hugetlbfs pages running after a memory
> > > error is very important, and the possibility described here could be a good
> > > candidate to address this issue.
> >
> > Thanks for expressing interest, William, and sorry for getting back to
> > you so late.
> >
> > >
> > > So I would like to provide my feedback after testing this code with the
> > > introduction of persistent errors in the address space: My tests used a VM
> > > running a kernel able to provide MFD_MF_KEEP_UE_MAPPED memfd segments to the
> > > test program provided with this project. But instead of injecting the errors
> > > with madvise calls from this program, I get the guest physical address of a
> > > location and inject the error from the hypervisor into the VM, so that any
> > > subsequent access to the location is prevented directly from the hypervisor
> > > level.
> >
> > This is exactly what VMM should do: when it owns or manages the VM
> > memory with MFD_MF_KEEP_UE_MAPPED, it is then VMM's responsibility to
> > isolate guest/VCPUs from poisoned memory pages, e.g. by intercepting
> > such memory accesses.
> >
> > >
> > > Using this framework, I realized that the code provided here has a problem:
> > > When the error impacts a large folio, the release of this folio doesn't isolate
> > > the sub-page(s) actually impacted by the poison. __rmqueue_pcplist() can return
> > > a known poisoned page to get_page_from_freelist().
> >
> > Just curious, how exactly you can repro this leaking of a known poison
> > page? It may help me debug my patch.
> >
> > >
> > > This revealed some mm limitations, as I would have expected that the
> > > check_new_pages() mechanism used by the __rmqueue functions would filter these
> > > pages out, but I noticed that this has been disabled by default in 2023 with:
> > > [PATCH] mm, page_alloc: reduce page alloc/free sanity checks
> > > https://lore.kernel.org/all/20230216095131.17336-1-vbabka@suse.cz
> >
> > Thanks for the reference. I did turned on CONFIG_DEBUG_VM=y during dev
> > and testing but didn't notice any WARNING on "bad page"; It is very
> > likely I was just lucky.
> >
> > >
> > >
> > > This problem seems to be avoided if we call take_page_off_buddy(page) in the
> > > filemap_offline_hwpoison_folio_hugetlb() function without testing if
> > > PageBuddy(page) is true first.
> >
> > Oh, I think you are right, filemap_offline_hwpoison_folio_hugetlb
> > shouldn't call take_page_off_buddy(page) depend on PageBuddy(page) or
> > not. take_page_off_buddy will check PageBuddy or not, on the page_head
> > of different page orders. So maybe somehow a known poisoned page is
> > not taken off from buddy allocator due to this?
>
> Maybe it's the case where the poisoned page is merged to a larger page,
> and the PGTY_buddy flag is set on its buddy of the poisoned page, so
> PageBuddy() returns false?:
>
> [ free page A ][ free page B (poisoned) ]
>
> When these two are merged, then we set PGTY_buddy on page A but not on B.
Thanks Harry!
It is indeed this case. I validate by adding some debug prints in
take_page_off_buddy:
[ 193.029423] Memory failure: 0x2800200: [yjq] PageBuddy=0 after drain_all_pages
[ 193.029426] 0x2800200: [yjq] order=0, page_order=0, PageBuddy(page_head)=0
[ 193.029428] 0x2800200: [yjq] order=1, page_order=0, PageBuddy(page_head)=0
[ 193.029429] 0x2800200: [yjq] order=2, page_order=0, PageBuddy(page_head)=0
[ 193.029430] 0x2800200: [yjq] order=3, page_order=0, PageBuddy(page_head)=0
[ 193.029431] 0x2800200: [yjq] order=4, page_order=0, PageBuddy(page_head)=0
[ 193.029432] 0x2800200: [yjq] order=5, page_order=0, PageBuddy(page_head)=0
[ 193.029434] 0x2800200: [yjq] order=6, page_order=0, PageBuddy(page_head)=0
[ 193.029435] 0x2800200: [yjq] order=7, page_order=0, PageBuddy(page_head)=0
[ 193.029436] 0x2800200: [yjq] order=8, page_order=0, PageBuddy(page_head)=0
[ 193.029437] 0x2800200: [yjq] order=9, page_order=0, PageBuddy(page_head)=0
[ 193.029438] 0x2800200: [yjq] order=10, page_order=10, PageBuddy(page_head)=1
In this case, page for 0x2800200 is hwpoisoned, and its buddy page is
0x2800000 with order 10.
>
> But even after fixing that we need to fix the race condition.
What exactly is the race condition you are referring to?
>
> > Let me try to fix it in v2, by the end of the week. If you could test
> > with your way of repro as well, that will be very helpful!
> >
> > > But according to me it leaves a (small) race condition where a new page
> > > allocation could get a poisoned sub-page between the dissolve phase and the
> > > attempt to remove it from the buddy allocator.
> > >
> > > I do have the impression that a correct behavior (isolating an impacted
> > > sub-page and remapping the valid memory content) using large pages is
> > > currently only achieved with Transparent Huge Pages.
> > > If performance requires using Hugetlb pages, than maybe we could accept to
> > > loose a huge page after a memory impacted MFD_MF_KEEP_UE_MAPPED memfd segment
> > > is released ? If it can easily avoid some other corruption.
> > >
> > > I'm very interested in finding an appropriate way to deal with memory errors on
> > > hugetlbfs pages, and willing to help to build a valid solution. This project
> > > showed a real possibility to do so, even in cases where pinned memory is used -
> > > with VFIO for example.
> > >
> > > I would really be interested in knowing your feedback about this project, and
> > > if another solution is considered more adapted to deal with errors on hugetlbfs
> > > pages, please let us know.
> >
> > There is also another possible path if VMM can change to back VM
> > memory with *1G guest_memfd*, which wraps 1G hugetlbfs. In Ackerley's
> > work [1], guest_memfd can split the 1G page for conversions. If we
> > re-use the splitting for memory failure recovery, we can probably
> > achieve something generally similar to THP's memory failure recovery:
> > split 1G to 2M and 4k chunks, then unmap only 4k of poisoned page. We
> > still lose the 1G TLB size so VM may be subject to some performance
> > sacrifice.
> > [1] https://lore.kernel.org/linux-mm/2ae41e0d80339da2b57011622ac2288fed65cd01.1747264138.git.ackerleytng@google.com
>
> I want to take a closer look at the actual patches but either way sounds
> good to me.
>
> By the way, please Cc me in future revisions :)
For sure!
>
> Thanks!
>
> --
> Cheers,
> Harry / Hyeonggon
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RFC PATCH v1 0/3] Userspace MFR Policy via memfd
2025-10-28 4:17 ` Jiaqi Yan
@ 2025-10-28 7:00 ` Harry Yoo
2025-10-30 11:51 ` Miaohe Lin
0 siblings, 1 reply; 16+ messages in thread
From: Harry Yoo @ 2025-10-28 7:00 UTC (permalink / raw)
To: Jiaqi Yan
Cc: “William Roche, Ackerley Tng, jgg, akpm, ankita,
dave.hansen, david, duenwen, jane.chu, jthoughton, linmiaohe,
linux-fsdevel, linux-kernel, linux-mm, muchun.song, nao.horiguchi,
osalvador, peterx, rientjes, sidhartha.kumar, tony.luck,
wangkefeng.wang, willy
On Mon, Oct 27, 2025 at 09:17:31PM -0700, Jiaqi Yan wrote:
> On Wed, Oct 22, 2025 at 6:09 AM Harry Yoo <harry.yoo@oracle.com> wrote:
> >
> > On Mon, Oct 13, 2025 at 03:14:32PM -0700, Jiaqi Yan wrote:
> > > On Fri, Sep 19, 2025 at 8:58 AM “William Roche <william.roche@oracle.com> wrote:
> > > >
> > > > From: William Roche <william.roche@oracle.com>
> > > >
> > > > Hello,
> > > >
> > > > The possibility to keep a VM using large hugetlbfs pages running after a memory
> > > > error is very important, and the possibility described here could be a good
> > > > candidate to address this issue.
> > >
> > > Thanks for expressing interest, William, and sorry for getting back to
> > > you so late.
> > >
> > > >
> > > > So I would like to provide my feedback after testing this code with the
> > > > introduction of persistent errors in the address space: My tests used a VM
> > > > running a kernel able to provide MFD_MF_KEEP_UE_MAPPED memfd segments to the
> > > > test program provided with this project. But instead of injecting the errors
> > > > with madvise calls from this program, I get the guest physical address of a
> > > > location and inject the error from the hypervisor into the VM, so that any
> > > > subsequent access to the location is prevented directly from the hypervisor
> > > > level.
> > >
> > > This is exactly what VMM should do: when it owns or manages the VM
> > > memory with MFD_MF_KEEP_UE_MAPPED, it is then VMM's responsibility to
> > > isolate guest/VCPUs from poisoned memory pages, e.g. by intercepting
> > > such memory accesses.
> > >
> > > >
> > > > Using this framework, I realized that the code provided here has a problem:
> > > > When the error impacts a large folio, the release of this folio doesn't isolate
> > > > the sub-page(s) actually impacted by the poison. __rmqueue_pcplist() can return
> > > > a known poisoned page to get_page_from_freelist().
> > >
> > > Just curious, how exactly you can repro this leaking of a known poison
> > > page? It may help me debug my patch.
> > >
> > > >
> > > > This revealed some mm limitations, as I would have expected that the
> > > > check_new_pages() mechanism used by the __rmqueue functions would filter these
> > > > pages out, but I noticed that this has been disabled by default in 2023 with:
> > > > [PATCH] mm, page_alloc: reduce page alloc/free sanity checks
> > > > https://lore.kernel.org/all/20230216095131.17336-1-vbabka@suse.cz
> > >
> > > Thanks for the reference. I did turned on CONFIG_DEBUG_VM=y during dev
> > > and testing but didn't notice any WARNING on "bad page"; It is very
> > > likely I was just lucky.
> > >
> > > >
> > > >
> > > > This problem seems to be avoided if we call take_page_off_buddy(page) in the
> > > > filemap_offline_hwpoison_folio_hugetlb() function without testing if
> > > > PageBuddy(page) is true first.
> > >
> > > Oh, I think you are right, filemap_offline_hwpoison_folio_hugetlb
> > > shouldn't call take_page_off_buddy(page) depend on PageBuddy(page) or
> > > not. take_page_off_buddy will check PageBuddy or not, on the page_head
> > > of different page orders. So maybe somehow a known poisoned page is
> > > not taken off from buddy allocator due to this?
> >
> > Maybe it's the case where the poisoned page is merged to a larger page,
> > and the PGTY_buddy flag is set on its buddy of the poisoned page, so
> > PageBuddy() returns false?:
> >
> > [ free page A ][ free page B (poisoned) ]
> >
> > When these two are merged, then we set PGTY_buddy on page A but not on B.
>
> Thanks Harry!
>
> It is indeed this case. I validate by adding some debug prints in
> take_page_off_buddy:
>
> [ 193.029423] Memory failure: 0x2800200: [yjq] PageBuddy=0 after drain_all_pages
> [ 193.029426] 0x2800200: [yjq] order=0, page_order=0, PageBuddy(page_head)=0
> [ 193.029428] 0x2800200: [yjq] order=1, page_order=0, PageBuddy(page_head)=0
> [ 193.029429] 0x2800200: [yjq] order=2, page_order=0, PageBuddy(page_head)=0
> [ 193.029430] 0x2800200: [yjq] order=3, page_order=0, PageBuddy(page_head)=0
> [ 193.029431] 0x2800200: [yjq] order=4, page_order=0, PageBuddy(page_head)=0
> [ 193.029432] 0x2800200: [yjq] order=5, page_order=0, PageBuddy(page_head)=0
> [ 193.029434] 0x2800200: [yjq] order=6, page_order=0, PageBuddy(page_head)=0
> [ 193.029435] 0x2800200: [yjq] order=7, page_order=0, PageBuddy(page_head)=0
> [ 193.029436] 0x2800200: [yjq] order=8, page_order=0, PageBuddy(page_head)=0
> [ 193.029437] 0x2800200: [yjq] order=9, page_order=0, PageBuddy(page_head)=0
> [ 193.029438] 0x2800200: [yjq] order=10, page_order=10, PageBuddy(page_head)=1
>
> In this case, page for 0x2800200 is hwpoisoned, and its buddy page is
> 0x2800000 with order 10.
Woohoo, I got it right!
> > But even after fixing that we need to fix the race condition.
>
> What exactly is the race condition you are referring to?
When you free a high-order page, the buddy allocator doesn't not check
PageHWPoison() on the page and its subpages. It checks PageHWPoison()
only when you free a base (order-0) page, see free_pages_prepare().
AFAICT there is nothing that prevents the poisoned page to be
allocated back to users because the buddy doesn't check PageHWPoison()
on allocation as well (by default).
So rather than freeing the high-order page as-is in
dissolve_free_hugetlb_folio(), I think we have to split it to base pages
and then free them one by one.
That way, free_pages_prepare() will catch that it's poisoned and won't
add it back to the freelist. Otherwise there will always be a window
where the poisoned page can be allocated to users - before it's taken
off from the buddy.
--
Cheers,
Harry / Hyeonggon
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RFC PATCH v1 0/3] Userspace MFR Policy via memfd
2025-10-28 7:00 ` Harry Yoo
@ 2025-10-30 11:51 ` Miaohe Lin
2025-10-30 17:28 ` Jiaqi Yan
0 siblings, 1 reply; 16+ messages in thread
From: Miaohe Lin @ 2025-10-30 11:51 UTC (permalink / raw)
To: Harry Yoo, Jiaqi Yan
Cc: “William Roche, Ackerley Tng, jgg, akpm, ankita,
dave.hansen, david, duenwen, jane.chu, jthoughton, linux-fsdevel,
linux-kernel, linux-mm, muchun.song, nao.horiguchi, osalvador,
peterx, rientjes, sidhartha.kumar, tony.luck, wangkefeng.wang,
willy
On 2025/10/28 15:00, Harry Yoo wrote:
> On Mon, Oct 27, 2025 at 09:17:31PM -0700, Jiaqi Yan wrote:
>> On Wed, Oct 22, 2025 at 6:09 AM Harry Yoo <harry.yoo@oracle.com> wrote:
>>>
>>> On Mon, Oct 13, 2025 at 03:14:32PM -0700, Jiaqi Yan wrote:
>>>> On Fri, Sep 19, 2025 at 8:58 AM “William Roche <william.roche@oracle.com> wrote:
>>>>>
>>>>> From: William Roche <william.roche@oracle.com>
>>>>>
>>>>> Hello,
>>>>>
>>>>> The possibility to keep a VM using large hugetlbfs pages running after a memory
>>>>> error is very important, and the possibility described here could be a good
>>>>> candidate to address this issue.
>>>>
>>>> Thanks for expressing interest, William, and sorry for getting back to
>>>> you so late.
>>>>
>>>>>
>>>>> So I would like to provide my feedback after testing this code with the
>>>>> introduction of persistent errors in the address space: My tests used a VM
>>>>> running a kernel able to provide MFD_MF_KEEP_UE_MAPPED memfd segments to the
>>>>> test program provided with this project. But instead of injecting the errors
>>>>> with madvise calls from this program, I get the guest physical address of a
>>>>> location and inject the error from the hypervisor into the VM, so that any
>>>>> subsequent access to the location is prevented directly from the hypervisor
>>>>> level.
>>>>
>>>> This is exactly what VMM should do: when it owns or manages the VM
>>>> memory with MFD_MF_KEEP_UE_MAPPED, it is then VMM's responsibility to
>>>> isolate guest/VCPUs from poisoned memory pages, e.g. by intercepting
>>>> such memory accesses.
>>>>
>>>>>
>>>>> Using this framework, I realized that the code provided here has a problem:
>>>>> When the error impacts a large folio, the release of this folio doesn't isolate
>>>>> the sub-page(s) actually impacted by the poison. __rmqueue_pcplist() can return
>>>>> a known poisoned page to get_page_from_freelist().
>>>>
>>>> Just curious, how exactly you can repro this leaking of a known poison
>>>> page? It may help me debug my patch.
>>>>
>>>>>
>>>>> This revealed some mm limitations, as I would have expected that the
>>>>> check_new_pages() mechanism used by the __rmqueue functions would filter these
>>>>> pages out, but I noticed that this has been disabled by default in 2023 with:
>>>>> [PATCH] mm, page_alloc: reduce page alloc/free sanity checks
>>>>> https://lore.kernel.org/all/20230216095131.17336-1-vbabka@suse.cz
>>>>
>>>> Thanks for the reference. I did turned on CONFIG_DEBUG_VM=y during dev
>>>> and testing but didn't notice any WARNING on "bad page"; It is very
>>>> likely I was just lucky.
>>>>
>>>>>
>>>>>
>>>>> This problem seems to be avoided if we call take_page_off_buddy(page) in the
>>>>> filemap_offline_hwpoison_folio_hugetlb() function without testing if
>>>>> PageBuddy(page) is true first.
>>>>
>>>> Oh, I think you are right, filemap_offline_hwpoison_folio_hugetlb
>>>> shouldn't call take_page_off_buddy(page) depend on PageBuddy(page) or
>>>> not. take_page_off_buddy will check PageBuddy or not, on the page_head
>>>> of different page orders. So maybe somehow a known poisoned page is
>>>> not taken off from buddy allocator due to this?
>>>
>>> Maybe it's the case where the poisoned page is merged to a larger page,
>>> and the PGTY_buddy flag is set on its buddy of the poisoned page, so
>>> PageBuddy() returns false?:
>>>
>>> [ free page A ][ free page B (poisoned) ]
>>>
>>> When these two are merged, then we set PGTY_buddy on page A but not on B.
>>
>> Thanks Harry!
>>
>> It is indeed this case. I validate by adding some debug prints in
>> take_page_off_buddy:
>>
>> [ 193.029423] Memory failure: 0x2800200: [yjq] PageBuddy=0 after drain_all_pages
>> [ 193.029426] 0x2800200: [yjq] order=0, page_order=0, PageBuddy(page_head)=0
>> [ 193.029428] 0x2800200: [yjq] order=1, page_order=0, PageBuddy(page_head)=0
>> [ 193.029429] 0x2800200: [yjq] order=2, page_order=0, PageBuddy(page_head)=0
>> [ 193.029430] 0x2800200: [yjq] order=3, page_order=0, PageBuddy(page_head)=0
>> [ 193.029431] 0x2800200: [yjq] order=4, page_order=0, PageBuddy(page_head)=0
>> [ 193.029432] 0x2800200: [yjq] order=5, page_order=0, PageBuddy(page_head)=0
>> [ 193.029434] 0x2800200: [yjq] order=6, page_order=0, PageBuddy(page_head)=0
>> [ 193.029435] 0x2800200: [yjq] order=7, page_order=0, PageBuddy(page_head)=0
>> [ 193.029436] 0x2800200: [yjq] order=8, page_order=0, PageBuddy(page_head)=0
>> [ 193.029437] 0x2800200: [yjq] order=9, page_order=0, PageBuddy(page_head)=0
>> [ 193.029438] 0x2800200: [yjq] order=10, page_order=10, PageBuddy(page_head)=1
>>
>> In this case, page for 0x2800200 is hwpoisoned, and its buddy page is
>> 0x2800000 with order 10.
>
> Woohoo, I got it right!
>
>>> But even after fixing that we need to fix the race condition.
>>
>> What exactly is the race condition you are referring to?
>
> When you free a high-order page, the buddy allocator doesn't not check
> PageHWPoison() on the page and its subpages. It checks PageHWPoison()
> only when you free a base (order-0) page, see free_pages_prepare().
I think we might could check PageHWPoison() for subpages as what free_page_is_bad()
does. If any subpage has HWPoisoned flag set, simply drop the folio. Even we could
do it better -- Split the folio and let healthy subpages join the buddy while reject
the hwpoisoned one.
>
> AFAICT there is nothing that prevents the poisoned page to be
> allocated back to users because the buddy doesn't check PageHWPoison()
> on allocation as well (by default).
>
> So rather than freeing the high-order page as-is in
> dissolve_free_hugetlb_folio(), I think we have to split it to base pages
> and then free them one by one.
It might not be worth to do that as this would significantly increase the overhead
of the function while memory failure event is really rare.
Thanks both.
.
>
> That way, free_pages_prepare() will catch that it's poisoned and won't
> add it back to the freelist. Otherwise there will always be a window
> where the poisoned page can be allocated to users - before it's taken
> off from the buddy.
>
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RFC PATCH v1 0/3] Userspace MFR Policy via memfd
2025-10-30 11:51 ` Miaohe Lin
@ 2025-10-30 17:28 ` Jiaqi Yan
2025-10-30 21:28 ` Jiaqi Yan
2025-11-03 8:16 ` Harry Yoo
0 siblings, 2 replies; 16+ messages in thread
From: Jiaqi Yan @ 2025-10-30 17:28 UTC (permalink / raw)
To: Miaohe Lin
Cc: Harry Yoo, “William Roche, Ackerley Tng, jgg, akpm, ankita,
dave.hansen, david, duenwen, jane.chu, jthoughton, linux-fsdevel,
linux-kernel, linux-mm, muchun.song, nao.horiguchi, osalvador,
peterx, rientjes, sidhartha.kumar, tony.luck, wangkefeng.wang,
willy
On Thu, Oct 30, 2025 at 4:51 AM Miaohe Lin <linmiaohe@huawei.com> wrote:
>
> On 2025/10/28 15:00, Harry Yoo wrote:
> > On Mon, Oct 27, 2025 at 09:17:31PM -0700, Jiaqi Yan wrote:
> >> On Wed, Oct 22, 2025 at 6:09 AM Harry Yoo <harry.yoo@oracle.com> wrote:
> >>>
> >>> On Mon, Oct 13, 2025 at 03:14:32PM -0700, Jiaqi Yan wrote:
> >>>> On Fri, Sep 19, 2025 at 8:58 AM “William Roche <william.roche@oracle.com> wrote:
> >>>>>
> >>>>> From: William Roche <william.roche@oracle.com>
> >>>>>
> >>>>> Hello,
> >>>>>
> >>>>> The possibility to keep a VM using large hugetlbfs pages running after a memory
> >>>>> error is very important, and the possibility described here could be a good
> >>>>> candidate to address this issue.
> >>>>
> >>>> Thanks for expressing interest, William, and sorry for getting back to
> >>>> you so late.
> >>>>
> >>>>>
> >>>>> So I would like to provide my feedback after testing this code with the
> >>>>> introduction of persistent errors in the address space: My tests used a VM
> >>>>> running a kernel able to provide MFD_MF_KEEP_UE_MAPPED memfd segments to the
> >>>>> test program provided with this project. But instead of injecting the errors
> >>>>> with madvise calls from this program, I get the guest physical address of a
> >>>>> location and inject the error from the hypervisor into the VM, so that any
> >>>>> subsequent access to the location is prevented directly from the hypervisor
> >>>>> level.
> >>>>
> >>>> This is exactly what VMM should do: when it owns or manages the VM
> >>>> memory with MFD_MF_KEEP_UE_MAPPED, it is then VMM's responsibility to
> >>>> isolate guest/VCPUs from poisoned memory pages, e.g. by intercepting
> >>>> such memory accesses.
> >>>>
> >>>>>
> >>>>> Using this framework, I realized that the code provided here has a problem:
> >>>>> When the error impacts a large folio, the release of this folio doesn't isolate
> >>>>> the sub-page(s) actually impacted by the poison. __rmqueue_pcplist() can return
> >>>>> a known poisoned page to get_page_from_freelist().
> >>>>
> >>>> Just curious, how exactly you can repro this leaking of a known poison
> >>>> page? It may help me debug my patch.
> >>>>
> >>>>>
> >>>>> This revealed some mm limitations, as I would have expected that the
> >>>>> check_new_pages() mechanism used by the __rmqueue functions would filter these
> >>>>> pages out, but I noticed that this has been disabled by default in 2023 with:
> >>>>> [PATCH] mm, page_alloc: reduce page alloc/free sanity checks
> >>>>> https://lore.kernel.org/all/20230216095131.17336-1-vbabka@suse.cz
> >>>>
> >>>> Thanks for the reference. I did turned on CONFIG_DEBUG_VM=y during dev
> >>>> and testing but didn't notice any WARNING on "bad page"; It is very
> >>>> likely I was just lucky.
> >>>>
> >>>>>
> >>>>>
> >>>>> This problem seems to be avoided if we call take_page_off_buddy(page) in the
> >>>>> filemap_offline_hwpoison_folio_hugetlb() function without testing if
> >>>>> PageBuddy(page) is true first.
> >>>>
> >>>> Oh, I think you are right, filemap_offline_hwpoison_folio_hugetlb
> >>>> shouldn't call take_page_off_buddy(page) depend on PageBuddy(page) or
> >>>> not. take_page_off_buddy will check PageBuddy or not, on the page_head
> >>>> of different page orders. So maybe somehow a known poisoned page is
> >>>> not taken off from buddy allocator due to this?
> >>>
> >>> Maybe it's the case where the poisoned page is merged to a larger page,
> >>> and the PGTY_buddy flag is set on its buddy of the poisoned page, so
> >>> PageBuddy() returns false?:
> >>>
> >>> [ free page A ][ free page B (poisoned) ]
> >>>
> >>> When these two are merged, then we set PGTY_buddy on page A but not on B.
> >>
> >> Thanks Harry!
> >>
> >> It is indeed this case. I validate by adding some debug prints in
> >> take_page_off_buddy:
> >>
> >> [ 193.029423] Memory failure: 0x2800200: [yjq] PageBuddy=0 after drain_all_pages
> >> [ 193.029426] 0x2800200: [yjq] order=0, page_order=0, PageBuddy(page_head)=0
> >> [ 193.029428] 0x2800200: [yjq] order=1, page_order=0, PageBuddy(page_head)=0
> >> [ 193.029429] 0x2800200: [yjq] order=2, page_order=0, PageBuddy(page_head)=0
> >> [ 193.029430] 0x2800200: [yjq] order=3, page_order=0, PageBuddy(page_head)=0
> >> [ 193.029431] 0x2800200: [yjq] order=4, page_order=0, PageBuddy(page_head)=0
> >> [ 193.029432] 0x2800200: [yjq] order=5, page_order=0, PageBuddy(page_head)=0
> >> [ 193.029434] 0x2800200: [yjq] order=6, page_order=0, PageBuddy(page_head)=0
> >> [ 193.029435] 0x2800200: [yjq] order=7, page_order=0, PageBuddy(page_head)=0
> >> [ 193.029436] 0x2800200: [yjq] order=8, page_order=0, PageBuddy(page_head)=0
> >> [ 193.029437] 0x2800200: [yjq] order=9, page_order=0, PageBuddy(page_head)=0
> >> [ 193.029438] 0x2800200: [yjq] order=10, page_order=10, PageBuddy(page_head)=1
> >>
> >> In this case, page for 0x2800200 is hwpoisoned, and its buddy page is
> >> 0x2800000 with order 10.
> >
> > Woohoo, I got it right!
> >
> >>> But even after fixing that we need to fix the race condition.
> >>
> >> What exactly is the race condition you are referring to?
> >
> > When you free a high-order page, the buddy allocator doesn't not check
> > PageHWPoison() on the page and its subpages. It checks PageHWPoison()
> > only when you free a base (order-0) page, see free_pages_prepare().
>
> I think we might could check PageHWPoison() for subpages as what free_page_is_bad()
> does. If any subpage has HWPoisoned flag set, simply drop the folio. Even we could
Agree, I think as a starter I could try to, for example, let
free_pages_prepare scan HWPoison-ed subpages if the base page is high
order. In the optimal case, HugeTLB does move PageHWPoison flag from
head page to the raw error pages.
> do it better -- Split the folio and let healthy subpages join the buddy while reject
> the hwpoisoned one.
>
> >
> > AFAICT there is nothing that prevents the poisoned page to be
> > allocated back to users because the buddy doesn't check PageHWPoison()
> > on allocation as well (by default).
> >
> > So rather than freeing the high-order page as-is in
> > dissolve_free_hugetlb_folio(), I think we have to split it to base pages
> > and then free them one by one.
>
> It might not be worth to do that as this would significantly increase the overhead
> of the function while memory failure event is really rare.
IIUC, Harry's idea is to do the split in dissolve_free_hugetlb_folio
only if folio is HWPoison-ed, similar to what Miaohe suggested
earlier.
BTW, I believe this race condition already exists today when
memory_failure handles HWPoison-ed free hugetlb page; it is not
something introduced via this patchset. I will fix or improve this in
a separate patchset.
>
> Thanks both.
Thanks Harry and Miaohe!
> .
>
> >
> > That way, free_pages_prepare() will catch that it's poisoned and won't
> > add it back to the freelist. Otherwise there will always be a window
> > where the poisoned page can be allocated to users - before it's taken
> > off from the buddy.
> >
>
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RFC PATCH v1 0/3] Userspace MFR Policy via memfd
2025-10-30 17:28 ` Jiaqi Yan
@ 2025-10-30 21:28 ` Jiaqi Yan
2025-11-03 8:16 ` Harry Yoo
1 sibling, 0 replies; 16+ messages in thread
From: Jiaqi Yan @ 2025-10-30 21:28 UTC (permalink / raw)
To: Miaohe Lin
Cc: Harry Yoo, “William Roche, Ackerley Tng, jgg, akpm, ankita,
dave.hansen, david, duenwen, jane.chu, jthoughton, linux-fsdevel,
linux-kernel, linux-mm, muchun.song, nao.horiguchi, osalvador,
peterx, rientjes, sidhartha.kumar, tony.luck, wangkefeng.wang,
willy
On Thu, Oct 30, 2025 at 10:28 AM Jiaqi Yan <jiaqiyan@google.com> wrote:
>
> On Thu, Oct 30, 2025 at 4:51 AM Miaohe Lin <linmiaohe@huawei.com> wrote:
> >
> > On 2025/10/28 15:00, Harry Yoo wrote:
> > > On Mon, Oct 27, 2025 at 09:17:31PM -0700, Jiaqi Yan wrote:
> > >> On Wed, Oct 22, 2025 at 6:09 AM Harry Yoo <harry.yoo@oracle.com> wrote:
> > >>>
> > >>> On Mon, Oct 13, 2025 at 03:14:32PM -0700, Jiaqi Yan wrote:
> > >>>> On Fri, Sep 19, 2025 at 8:58 AM “William Roche <william.roche@oracle.com> wrote:
> > >>>>>
> > >>>>> From: William Roche <william.roche@oracle.com>
> > >>>>>
> > >>>>> Hello,
> > >>>>>
> > >>>>> The possibility to keep a VM using large hugetlbfs pages running after a memory
> > >>>>> error is very important, and the possibility described here could be a good
> > >>>>> candidate to address this issue.
> > >>>>
> > >>>> Thanks for expressing interest, William, and sorry for getting back to
> > >>>> you so late.
> > >>>>
> > >>>>>
> > >>>>> So I would like to provide my feedback after testing this code with the
> > >>>>> introduction of persistent errors in the address space: My tests used a VM
> > >>>>> running a kernel able to provide MFD_MF_KEEP_UE_MAPPED memfd segments to the
> > >>>>> test program provided with this project. But instead of injecting the errors
> > >>>>> with madvise calls from this program, I get the guest physical address of a
> > >>>>> location and inject the error from the hypervisor into the VM, so that any
> > >>>>> subsequent access to the location is prevented directly from the hypervisor
> > >>>>> level.
> > >>>>
> > >>>> This is exactly what VMM should do: when it owns or manages the VM
> > >>>> memory with MFD_MF_KEEP_UE_MAPPED, it is then VMM's responsibility to
> > >>>> isolate guest/VCPUs from poisoned memory pages, e.g. by intercepting
> > >>>> such memory accesses.
> > >>>>
> > >>>>>
> > >>>>> Using this framework, I realized that the code provided here has a problem:
> > >>>>> When the error impacts a large folio, the release of this folio doesn't isolate
> > >>>>> the sub-page(s) actually impacted by the poison. __rmqueue_pcplist() can return
> > >>>>> a known poisoned page to get_page_from_freelist().
> > >>>>
> > >>>> Just curious, how exactly you can repro this leaking of a known poison
> > >>>> page? It may help me debug my patch.
> > >>>>
> > >>>>>
> > >>>>> This revealed some mm limitations, as I would have expected that the
> > >>>>> check_new_pages() mechanism used by the __rmqueue functions would filter these
> > >>>>> pages out, but I noticed that this has been disabled by default in 2023 with:
> > >>>>> [PATCH] mm, page_alloc: reduce page alloc/free sanity checks
> > >>>>> https://lore.kernel.org/all/20230216095131.17336-1-vbabka@suse.cz
> > >>>>
> > >>>> Thanks for the reference. I did turned on CONFIG_DEBUG_VM=y during dev
> > >>>> and testing but didn't notice any WARNING on "bad page"; It is very
> > >>>> likely I was just lucky.
> > >>>>
> > >>>>>
> > >>>>>
> > >>>>> This problem seems to be avoided if we call take_page_off_buddy(page) in the
> > >>>>> filemap_offline_hwpoison_folio_hugetlb() function without testing if
> > >>>>> PageBuddy(page) is true first.
> > >>>>
> > >>>> Oh, I think you are right, filemap_offline_hwpoison_folio_hugetlb
> > >>>> shouldn't call take_page_off_buddy(page) depend on PageBuddy(page) or
> > >>>> not. take_page_off_buddy will check PageBuddy or not, on the page_head
> > >>>> of different page orders. So maybe somehow a known poisoned page is
> > >>>> not taken off from buddy allocator due to this?
> > >>>
> > >>> Maybe it's the case where the poisoned page is merged to a larger page,
> > >>> and the PGTY_buddy flag is set on its buddy of the poisoned page, so
> > >>> PageBuddy() returns false?:
> > >>>
> > >>> [ free page A ][ free page B (poisoned) ]
> > >>>
> > >>> When these two are merged, then we set PGTY_buddy on page A but not on B.
> > >>
> > >> Thanks Harry!
> > >>
> > >> It is indeed this case. I validate by adding some debug prints in
> > >> take_page_off_buddy:
> > >>
> > >> [ 193.029423] Memory failure: 0x2800200: [yjq] PageBuddy=0 after drain_all_pages
> > >> [ 193.029426] 0x2800200: [yjq] order=0, page_order=0, PageBuddy(page_head)=0
> > >> [ 193.029428] 0x2800200: [yjq] order=1, page_order=0, PageBuddy(page_head)=0
> > >> [ 193.029429] 0x2800200: [yjq] order=2, page_order=0, PageBuddy(page_head)=0
> > >> [ 193.029430] 0x2800200: [yjq] order=3, page_order=0, PageBuddy(page_head)=0
> > >> [ 193.029431] 0x2800200: [yjq] order=4, page_order=0, PageBuddy(page_head)=0
> > >> [ 193.029432] 0x2800200: [yjq] order=5, page_order=0, PageBuddy(page_head)=0
> > >> [ 193.029434] 0x2800200: [yjq] order=6, page_order=0, PageBuddy(page_head)=0
> > >> [ 193.029435] 0x2800200: [yjq] order=7, page_order=0, PageBuddy(page_head)=0
> > >> [ 193.029436] 0x2800200: [yjq] order=8, page_order=0, PageBuddy(page_head)=0
> > >> [ 193.029437] 0x2800200: [yjq] order=9, page_order=0, PageBuddy(page_head)=0
> > >> [ 193.029438] 0x2800200: [yjq] order=10, page_order=10, PageBuddy(page_head)=1
> > >>
> > >> In this case, page for 0x2800200 is hwpoisoned, and its buddy page is
> > >> 0x2800000 with order 10.
> > >
> > > Woohoo, I got it right!
> > >
> > >>> But even after fixing that we need to fix the race condition.
> > >>
> > >> What exactly is the race condition you are referring to?
> > >
> > > When you free a high-order page, the buddy allocator doesn't not check
> > > PageHWPoison() on the page and its subpages. It checks PageHWPoison()
> > > only when you free a base (order-0) page, see free_pages_prepare().
> >
> > I think we might could check PageHWPoison() for subpages as what free_page_is_bad()
> > does. If any subpage has HWPoisoned flag set, simply drop the folio. Even we could
>
> Agree, I think as a starter I could try to, for example, let
> free_pages_prepare scan HWPoison-ed subpages if the base page is high
> order. In the optimal case, HugeTLB does move PageHWPoison flag from
> head page to the raw error pages.
Another idea I came up with today and is trying out is:
1. let buddy allocator reject the high order folio first based on the
HWPoison-ed flag
2. memory_failure takes the advantage of break_down_buddy_pages to add
free pages to freelist, but keep target/hwpoison-ed page off the
freelist
>
> > do it better -- Split the folio and let healthy subpages join the buddy while reject
> > the hwpoisoned one.
> >
> > >
> > > AFAICT there is nothing that prevents the poisoned page to be
> > > allocated back to users because the buddy doesn't check PageHWPoison()
> > > on allocation as well (by default).
> > >
> > > So rather than freeing the high-order page as-is in
> > > dissolve_free_hugetlb_folio(), I think we have to split it to base pages
> > > and then free them one by one.
> >
> > It might not be worth to do that as this would significantly increase the overhead
> > of the function while memory failure event is really rare.
>
> IIUC, Harry's idea is to do the split in dissolve_free_hugetlb_folio
> only if folio is HWPoison-ed, similar to what Miaohe suggested
> earlier.
>
> BTW, I believe this race condition already exists today when
> memory_failure handles HWPoison-ed free hugetlb page; it is not
> something introduced via this patchset. I will fix or improve this in
> a separate patchset.
>
> >
> > Thanks both.
>
> Thanks Harry and Miaohe!
>
>
> > .
> >
> > >
> > > That way, free_pages_prepare() will catch that it's poisoned and won't
> > > add it back to the freelist. Otherwise there will always be a window
> > > where the poisoned page can be allocated to users - before it's taken
> > > off from the buddy.
> > >
> >
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RFC PATCH v1 0/3] Userspace MFR Policy via memfd
2025-10-30 17:28 ` Jiaqi Yan
2025-10-30 21:28 ` Jiaqi Yan
@ 2025-11-03 8:16 ` Harry Yoo
2025-11-03 8:53 ` Harry Yoo
1 sibling, 1 reply; 16+ messages in thread
From: Harry Yoo @ 2025-11-03 8:16 UTC (permalink / raw)
To: Jiaqi Yan
Cc: Miaohe Lin, “William Roche, Ackerley Tng, jgg, akpm, ankita,
dave.hansen, david, duenwen, jane.chu, jthoughton, linux-fsdevel,
linux-kernel, linux-mm, muchun.song, nao.horiguchi, osalvador,
peterx, rientjes, sidhartha.kumar, tony.luck, wangkefeng.wang,
willy, vbabka, surenb, mhocko, jackmanb, hannes, ziy
On Thu, Oct 30, 2025 at 10:28:48AM -0700, Jiaqi Yan wrote:
> On Thu, Oct 30, 2025 at 4:51 AM Miaohe Lin <linmiaohe@huawei.com> wrote:
> >
> > On 2025/10/28 15:00, Harry Yoo wrote:
> > > On Mon, Oct 27, 2025 at 09:17:31PM -0700, Jiaqi Yan wrote:
> > >> On Wed, Oct 22, 2025 at 6:09 AM Harry Yoo <harry.yoo@oracle.com> wrote:
> > >>>
> > >>> On Mon, Oct 13, 2025 at 03:14:32PM -0700, Jiaqi Yan wrote:
> > >>>> On Fri, Sep 19, 2025 at 8:58 AM “William Roche <william.roche@oracle.com> wrote:
> > >>>>>
> > >>>>> From: William Roche <william.roche@oracle.com>
> > >>>>>
> > >>>>> Hello,
> > >>>>>
> > >>>>> The possibility to keep a VM using large hugetlbfs pages running after a memory
> > >>>>> error is very important, and the possibility described here could be a good
> > >>>>> candidate to address this issue.
> > >>>>
> > >>>> Thanks for expressing interest, William, and sorry for getting back to
> > >>>> you so late.
> > >>>>
> > >>>>>
> > >>>>> So I would like to provide my feedback after testing this code with the
> > >>>>> introduction of persistent errors in the address space: My tests used a VM
> > >>>>> running a kernel able to provide MFD_MF_KEEP_UE_MAPPED memfd segments to the
> > >>>>> test program provided with this project. But instead of injecting the errors
> > >>>>> with madvise calls from this program, I get the guest physical address of a
> > >>>>> location and inject the error from the hypervisor into the VM, so that any
> > >>>>> subsequent access to the location is prevented directly from the hypervisor
> > >>>>> level.
> > >>>>
> > >>>> This is exactly what VMM should do: when it owns or manages the VM
> > >>>> memory with MFD_MF_KEEP_UE_MAPPED, it is then VMM's responsibility to
> > >>>> isolate guest/VCPUs from poisoned memory pages, e.g. by intercepting
> > >>>> such memory accesses.
> > >>>>
> > >>>>>
> > >>>>> Using this framework, I realized that the code provided here has a problem:
> > >>>>> When the error impacts a large folio, the release of this folio doesn't isolate
> > >>>>> the sub-page(s) actually impacted by the poison. __rmqueue_pcplist() can return
> > >>>>> a known poisoned page to get_page_from_freelist().
> > >>>>
> > >>>> Just curious, how exactly you can repro this leaking of a known poison
> > >>>> page? It may help me debug my patch.
> > >>>>
> > >>>>>
> > >>>>> This revealed some mm limitations, as I would have expected that the
> > >>>>> check_new_pages() mechanism used by the __rmqueue functions would filter these
> > >>>>> pages out, but I noticed that this has been disabled by default in 2023 with:
> > >>>>> [PATCH] mm, page_alloc: reduce page alloc/free sanity checks
> > >>>>> https://lore.kernel.org/all/20230216095131.17336-1-vbabka@suse.cz
> > >>>>
> > >>>> Thanks for the reference. I did turned on CONFIG_DEBUG_VM=y during dev
> > >>>> and testing but didn't notice any WARNING on "bad page"; It is very
> > >>>> likely I was just lucky.
> > >>>>
> > >>>>>
> > >>>>>
> > >>>>> This problem seems to be avoided if we call take_page_off_buddy(page) in the
> > >>>>> filemap_offline_hwpoison_folio_hugetlb() function without testing if
> > >>>>> PageBuddy(page) is true first.
> > >>>>
> > >>>> Oh, I think you are right, filemap_offline_hwpoison_folio_hugetlb
> > >>>> shouldn't call take_page_off_buddy(page) depend on PageBuddy(page) or
> > >>>> not. take_page_off_buddy will check PageBuddy or not, on the page_head
> > >>>> of different page orders. So maybe somehow a known poisoned page is
> > >>>> not taken off from buddy allocator due to this?
> > >>>
> > >>> Maybe it's the case where the poisoned page is merged to a larger page,
> > >>> and the PGTY_buddy flag is set on its buddy of the poisoned page, so
> > >>> PageBuddy() returns false?:
> > >>>
> > >>> [ free page A ][ free page B (poisoned) ]
> > >>>
> > >>> When these two are merged, then we set PGTY_buddy on page A but not on B.
> > >>
> > >> Thanks Harry!
> > >>
> > >> It is indeed this case. I validate by adding some debug prints in
> > >> take_page_off_buddy:
> > >>
> > >> [ 193.029423] Memory failure: 0x2800200: [yjq] PageBuddy=0 after drain_all_pages
> > >> [ 193.029426] 0x2800200: [yjq] order=0, page_order=0, PageBuddy(page_head)=0
> > >> [ 193.029428] 0x2800200: [yjq] order=1, page_order=0, PageBuddy(page_head)=0
> > >> [ 193.029429] 0x2800200: [yjq] order=2, page_order=0, PageBuddy(page_head)=0
> > >> [ 193.029430] 0x2800200: [yjq] order=3, page_order=0, PageBuddy(page_head)=0
> > >> [ 193.029431] 0x2800200: [yjq] order=4, page_order=0, PageBuddy(page_head)=0
> > >> [ 193.029432] 0x2800200: [yjq] order=5, page_order=0, PageBuddy(page_head)=0
> > >> [ 193.029434] 0x2800200: [yjq] order=6, page_order=0, PageBuddy(page_head)=0
> > >> [ 193.029435] 0x2800200: [yjq] order=7, page_order=0, PageBuddy(page_head)=0
> > >> [ 193.029436] 0x2800200: [yjq] order=8, page_order=0, PageBuddy(page_head)=0
> > >> [ 193.029437] 0x2800200: [yjq] order=9, page_order=0, PageBuddy(page_head)=0
> > >> [ 193.029438] 0x2800200: [yjq] order=10, page_order=10, PageBuddy(page_head)=1
> > >>
> > >> In this case, page for 0x2800200 is hwpoisoned, and its buddy page is
> > >> 0x2800000 with order 10.
> > >
> > > Woohoo, I got it right!
> > >
> > >>> But even after fixing that we need to fix the race condition.
> > >>
> > >> What exactly is the race condition you are referring to?
> > >
> > > When you free a high-order page, the buddy allocator doesn't not check
> > > PageHWPoison() on the page and its subpages. It checks PageHWPoison()
> > > only when you free a base (order-0) page, see free_pages_prepare().
> >
> > I think we might could check PageHWPoison() for subpages as what free_page_is_bad()
> > does. If any subpage has HWPoisoned flag set, simply drop the folio. Even we could
>
> Agree, I think as a starter I could try to, for example, let
> free_pages_prepare scan HWPoison-ed subpages if the base page is high
> order. In the optimal case, HugeTLB does move PageHWPoison flag from
> head page to the raw error pages.
[+Cc page allocator folks]
AFAICT enabling page sanity check in page alloc/free path would be against
past efforts to reduce sanity check overhead.
[1] https://lore.kernel.org/linux-mm/1460711275-1130-15-git-send-email-mgorman@techsingularity.net/
[2] https://lore.kernel.org/linux-mm/1460711275-1130-16-git-send-email-mgorman@techsingularity.net/
[3] https://lore.kernel.org/all/20230216095131.17336-1-vbabka@suse.cz
I'd recommend to check hwpoison flag before freeing it to the buddy
when we know a memory error has occurred (I guess that's also what Miaohe
suggested).
> > do it better -- Split the folio and let healthy subpages join the buddy while reject
> > the hwpoisoned one.
> >
> > >
> > > AFAICT there is nothing that prevents the poisoned page to be
> > > allocated back to users because the buddy doesn't check PageHWPoison()
> > > on allocation as well (by default).
> > >
> > > So rather than freeing the high-order page as-is in
> > > dissolve_free_hugetlb_folio(), I think we have to split it to base pages
> > > and then free them one by one.
> >
> > It might not be worth to do that as this would significantly increase the overhead
> > of the function while memory failure event is really rare.
>
> IIUC, Harry's idea is to do the split in dissolve_free_hugetlb_folio
> only if folio is HWPoison-ed, similar to what Miaohe suggested
> earlier.
Yes, and if we do the check before moving HWPoison flag to raw pages,
it'll be just a single folio_test_hwpoison() call.
> BTW, I believe this race condition already exists today when
> memory_failure handles HWPoison-ed free hugetlb page; it is not
> something introduced via this patchset. I will fix or improve this in
> a separate patchset.
That makes sense.
Thanks for working on this!
> > > That way, free_pages_prepare() will catch that it's poisoned and won't
> > > add it back to the freelist. Otherwise there will always be a window
> > > where the poisoned page can be allocated to users - before it's taken
> > > off from the buddy.
> > >
> >
--
Cheers,
Harry / Hyeonggon
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RFC PATCH v1 0/3] Userspace MFR Policy via memfd
2025-11-03 8:16 ` Harry Yoo
@ 2025-11-03 8:53 ` Harry Yoo
2025-11-03 16:57 ` Jiaqi Yan
0 siblings, 1 reply; 16+ messages in thread
From: Harry Yoo @ 2025-11-03 8:53 UTC (permalink / raw)
To: Jiaqi Yan
Cc: Miaohe Lin, “William Roche, Ackerley Tng, jgg, akpm, ankita,
dave.hansen, david, duenwen, jane.chu, jthoughton, linux-fsdevel,
linux-kernel, linux-mm, muchun.song, nao.horiguchi, osalvador,
peterx, rientjes, sidhartha.kumar, tony.luck, wangkefeng.wang,
willy, vbabka, surenb, mhocko, jackmanb, hannes, ziy
On Mon, Nov 03, 2025 at 05:16:33PM +0900, Harry Yoo wrote:
> On Thu, Oct 30, 2025 at 10:28:48AM -0700, Jiaqi Yan wrote:
> > On Thu, Oct 30, 2025 at 4:51 AM Miaohe Lin <linmiaohe@huawei.com> wrote:
> > > On 2025/10/28 15:00, Harry Yoo wrote:
> > > > On Mon, Oct 27, 2025 at 09:17:31PM -0700, Jiaqi Yan wrote:
> > > >> On Wed, Oct 22, 2025 at 6:09 AM Harry Yoo <harry.yoo@oracle.com> wrote:
> > > >>> On Mon, Oct 13, 2025 at 03:14:32PM -0700, Jiaqi Yan wrote:
> > > >>>> On Fri, Sep 19, 2025 at 8:58 AM “William Roche <william.roche@oracle.com> wrote:
> > > >>> But even after fixing that we need to fix the race condition.
> > > >>
> > > >> What exactly is the race condition you are referring to?
> > > >
> > > > When you free a high-order page, the buddy allocator doesn't not check
> > > > PageHWPoison() on the page and its subpages. It checks PageHWPoison()
> > > > only when you free a base (order-0) page, see free_pages_prepare().
> > >
> > > I think we might could check PageHWPoison() for subpages as what free_page_is_bad()
> > > does. If any subpage has HWPoisoned flag set, simply drop the folio. Even we could
> >
> > Agree, I think as a starter I could try to, for example, let
> > free_pages_prepare scan HWPoison-ed subpages if the base page is high
> > order. In the optimal case, HugeTLB does move PageHWPoison flag from
> > head page to the raw error pages.
>
> [+Cc page allocator folks]
>
> AFAICT enabling page sanity check in page alloc/free path would be against
> past efforts to reduce sanity check overhead.
>
> [1] https://lore.kernel.org/linux-mm/1460711275-1130-15-git-send-email-mgorman@techsingularity.net/
> [2] https://lore.kernel.org/linux-mm/1460711275-1130-16-git-send-email-mgorman@techsingularity.net/
> [3] https://lore.kernel.org/all/20230216095131.17336-1-vbabka@suse.cz
>
> I'd recommend to check hwpoison flag before freeing it to the buddy
> when we know a memory error has occurred (I guess that's also what Miaohe
> suggested).
>
> > > do it better -- Split the folio and let healthy subpages join the buddy while reject
> > > the hwpoisoned one.
> > >
> > > >
> > > > AFAICT there is nothing that prevents the poisoned page to be
> > > > allocated back to users because the buddy doesn't check PageHWPoison()
> > > > on allocation as well (by default).
> > > >
> > > > So rather than freeing the high-order page as-is in
> > > > dissolve_free_hugetlb_folio(), I think we have to split it to base pages
> > > > and then free them one by one.
> > >
> > > It might not be worth to do that as this would significantly increase the overhead
> > > of the function while memory failure event is really rare.
> >
> > IIUC, Harry's idea is to do the split in dissolve_free_hugetlb_folio
> > only if folio is HWPoison-ed, similar to what Miaohe suggested
> > earlier.
>
> Yes, and if we do the check before moving HWPoison flag to raw pages,
> it'll be just a single folio_test_hwpoison() call.
>
> > BTW, I believe this race condition already exists today when
> > memory_failure handles HWPoison-ed free hugetlb page; it is not
> > something introduced via this patchset. I will fix or improve this in
> > a separate patchset.
>
> That makes sense.
Wait, without this patchset, do we even free the hugetlb folio when
its subpage is hwpoisoned? I don't think we do, but I'm not expert at MFR...
If we don't, the mainline kernel should not be affected by this yet?
> Thanks for working on this!
>
> > > > That way, free_pages_prepare() will catch that it's poisoned and won't
> > > > add it back to the freelist. Otherwise there will always be a window
> > > > where the poisoned page can be allocated to users - before it's taken
> > > > off from the buddy.
--
Cheers,
Harry / Hyeonggon
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RFC PATCH v1 0/3] Userspace MFR Policy via memfd
2025-11-03 8:53 ` Harry Yoo
@ 2025-11-03 16:57 ` Jiaqi Yan
2025-11-04 3:44 ` Miaohe Lin
2025-11-06 7:53 ` Harry Yoo
0 siblings, 2 replies; 16+ messages in thread
From: Jiaqi Yan @ 2025-11-03 16:57 UTC (permalink / raw)
To: Harry Yoo
Cc: Miaohe Lin, “William Roche, Ackerley Tng, jgg, akpm, ankita,
dave.hansen, david, duenwen, jane.chu, jthoughton, linux-fsdevel,
linux-kernel, linux-mm, muchun.song, nao.horiguchi, osalvador,
peterx, rientjes, sidhartha.kumar, tony.luck, wangkefeng.wang,
willy, vbabka, surenb, mhocko, jackmanb, hannes, ziy
On Mon, Nov 3, 2025 at 12:53 AM Harry Yoo <harry.yoo@oracle.com> wrote:
>
> On Mon, Nov 03, 2025 at 05:16:33PM +0900, Harry Yoo wrote:
> > On Thu, Oct 30, 2025 at 10:28:48AM -0700, Jiaqi Yan wrote:
> > > On Thu, Oct 30, 2025 at 4:51 AM Miaohe Lin <linmiaohe@huawei.com> wrote:
> > > > On 2025/10/28 15:00, Harry Yoo wrote:
> > > > > On Mon, Oct 27, 2025 at 09:17:31PM -0700, Jiaqi Yan wrote:
> > > > >> On Wed, Oct 22, 2025 at 6:09 AM Harry Yoo <harry.yoo@oracle.com> wrote:
> > > > >>> On Mon, Oct 13, 2025 at 03:14:32PM -0700, Jiaqi Yan wrote:
> > > > >>>> On Fri, Sep 19, 2025 at 8:58 AM “William Roche <william.roche@oracle.com> wrote:
> > > > >>> But even after fixing that we need to fix the race condition.
> > > > >>
> > > > >> What exactly is the race condition you are referring to?
> > > > >
> > > > > When you free a high-order page, the buddy allocator doesn't not check
> > > > > PageHWPoison() on the page and its subpages. It checks PageHWPoison()
> > > > > only when you free a base (order-0) page, see free_pages_prepare().
> > > >
> > > > I think we might could check PageHWPoison() for subpages as what free_page_is_bad()
> > > > does. If any subpage has HWPoisoned flag set, simply drop the folio. Even we could
> > >
> > > Agree, I think as a starter I could try to, for example, let
> > > free_pages_prepare scan HWPoison-ed subpages if the base page is high
> > > order. In the optimal case, HugeTLB does move PageHWPoison flag from
> > > head page to the raw error pages.
> >
> > [+Cc page allocator folks]
> >
> > AFAICT enabling page sanity check in page alloc/free path would be against
> > past efforts to reduce sanity check overhead.
> >
> > [1] https://lore.kernel.org/linux-mm/1460711275-1130-15-git-send-email-mgorman@techsingularity.net/
> > [2] https://lore.kernel.org/linux-mm/1460711275-1130-16-git-send-email-mgorman@techsingularity.net/
> > [3] https://lore.kernel.org/all/20230216095131.17336-1-vbabka@suse.cz
> >
> > I'd recommend to check hwpoison flag before freeing it to the buddy
> > when we know a memory error has occurred (I guess that's also what Miaohe
> > suggested).
> >
> > > > do it better -- Split the folio and let healthy subpages join the buddy while reject
> > > > the hwpoisoned one.
> > > >
> > > > >
> > > > > AFAICT there is nothing that prevents the poisoned page to be
> > > > > allocated back to users because the buddy doesn't check PageHWPoison()
> > > > > on allocation as well (by default).
> > > > >
> > > > > So rather than freeing the high-order page as-is in
> > > > > dissolve_free_hugetlb_folio(), I think we have to split it to base pages
> > > > > and then free them one by one.
> > > >
> > > > It might not be worth to do that as this would significantly increase the overhead
> > > > of the function while memory failure event is really rare.
> > >
> > > IIUC, Harry's idea is to do the split in dissolve_free_hugetlb_folio
> > > only if folio is HWPoison-ed, similar to what Miaohe suggested
> > > earlier.
> >
> > Yes, and if we do the check before moving HWPoison flag to raw pages,
> > it'll be just a single folio_test_hwpoison() call.
> >
> > > BTW, I believe this race condition already exists today when
> > > memory_failure handles HWPoison-ed free hugetlb page; it is not
> > > something introduced via this patchset. I will fix or improve this in
> > > a separate patchset.
> >
> > That makes sense.
>
> Wait, without this patchset, do we even free the hugetlb folio when
> its subpage is hwpoisoned? I don't think we do, but I'm not expert at MFR...
Based on my reading of try_memory_failure_hugetlb, me_huge_page, and
__page_handle_poison, I think mainline kernel frees dissolved hugetlb
folio to buddy allocator in two cases:
1. it was a free hugetlb page at the moment of try_memory_failure_hugetlb
2. it was an anonomous hugetlb page
Let me know if my understanding is wrong.
>
> If we don't, the mainline kernel should not be affected by this yet?
>
> > Thanks for working on this!
> >
> > > > > That way, free_pages_prepare() will catch that it's poisoned and won't
> > > > > add it back to the freelist. Otherwise there will always be a window
> > > > > where the poisoned page can be allocated to users - before it's taken
> > > > > off from the buddy.
>
> --
> Cheers,
> Harry / Hyeonggon
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RFC PATCH v1 0/3] Userspace MFR Policy via memfd
2025-11-03 16:57 ` Jiaqi Yan
@ 2025-11-04 3:44 ` Miaohe Lin
2025-11-06 7:53 ` Harry Yoo
1 sibling, 0 replies; 16+ messages in thread
From: Miaohe Lin @ 2025-11-04 3:44 UTC (permalink / raw)
To: Jiaqi Yan, Harry Yoo
Cc: “William Roche, Ackerley Tng, jgg, akpm, ankita,
dave.hansen, david, duenwen, jane.chu, jthoughton, linux-fsdevel,
linux-kernel, linux-mm, muchun.song, nao.horiguchi, osalvador,
peterx, rientjes, sidhartha.kumar, tony.luck, wangkefeng.wang,
willy, vbabka, surenb, mhocko, jackmanb, hannes, ziy
On 2025/11/4 0:57, Jiaqi Yan wrote:
> On Mon, Nov 3, 2025 at 12:53 AM Harry Yoo <harry.yoo@oracle.com> wrote:
>>
>> On Mon, Nov 03, 2025 at 05:16:33PM +0900, Harry Yoo wrote:
>>> On Thu, Oct 30, 2025 at 10:28:48AM -0700, Jiaqi Yan wrote:
>>>> On Thu, Oct 30, 2025 at 4:51 AM Miaohe Lin <linmiaohe@huawei.com> wrote:
>>>>> On 2025/10/28 15:00, Harry Yoo wrote:
>>>>>> On Mon, Oct 27, 2025 at 09:17:31PM -0700, Jiaqi Yan wrote:
>>>>>>> On Wed, Oct 22, 2025 at 6:09 AM Harry Yoo <harry.yoo@oracle.com> wrote:
>>>>>>>> On Mon, Oct 13, 2025 at 03:14:32PM -0700, Jiaqi Yan wrote:
>>>>>>>>> On Fri, Sep 19, 2025 at 8:58 AM “William Roche <william.roche@oracle.com> wrote:
>>>>>>>> But even after fixing that we need to fix the race condition.
>>>>>>>
>>>>>>> What exactly is the race condition you are referring to?
>>>>>>
>>>>>> When you free a high-order page, the buddy allocator doesn't not check
>>>>>> PageHWPoison() on the page and its subpages. It checks PageHWPoison()
>>>>>> only when you free a base (order-0) page, see free_pages_prepare().
>>>>>
>>>>> I think we might could check PageHWPoison() for subpages as what free_page_is_bad()
>>>>> does. If any subpage has HWPoisoned flag set, simply drop the folio. Even we could
>>>>
>>>> Agree, I think as a starter I could try to, for example, let
>>>> free_pages_prepare scan HWPoison-ed subpages if the base page is high
>>>> order. In the optimal case, HugeTLB does move PageHWPoison flag from
>>>> head page to the raw error pages.
>>>
>>> [+Cc page allocator folks]
>>>
>>> AFAICT enabling page sanity check in page alloc/free path would be against
>>> past efforts to reduce sanity check overhead.
>>>
>>> [1] https://lore.kernel.org/linux-mm/1460711275-1130-15-git-send-email-mgorman@techsingularity.net/
>>> [2] https://lore.kernel.org/linux-mm/1460711275-1130-16-git-send-email-mgorman@techsingularity.net/
>>> [3] https://lore.kernel.org/all/20230216095131.17336-1-vbabka@suse.cz
>>>
>>> I'd recommend to check hwpoison flag before freeing it to the buddy
>>> when we know a memory error has occurred (I guess that's also what Miaohe
>>> suggested).
>>>
>>>>> do it better -- Split the folio and let healthy subpages join the buddy while reject
>>>>> the hwpoisoned one.
>>>>>
>>>>>>
>>>>>> AFAICT there is nothing that prevents the poisoned page to be
>>>>>> allocated back to users because the buddy doesn't check PageHWPoison()
>>>>>> on allocation as well (by default).
>>>>>>
>>>>>> So rather than freeing the high-order page as-is in
>>>>>> dissolve_free_hugetlb_folio(), I think we have to split it to base pages
>>>>>> and then free them one by one.
>>>>>
>>>>> It might not be worth to do that as this would significantly increase the overhead
>>>>> of the function while memory failure event is really rare.
>>>>
>>>> IIUC, Harry's idea is to do the split in dissolve_free_hugetlb_folio
>>>> only if folio is HWPoison-ed, similar to what Miaohe suggested
>>>> earlier.
>>>
>>> Yes, and if we do the check before moving HWPoison flag to raw pages,
>>> it'll be just a single folio_test_hwpoison() call.
>>>
>>>> BTW, I believe this race condition already exists today when
>>>> memory_failure handles HWPoison-ed free hugetlb page; it is not
>>>> something introduced via this patchset. I will fix or improve this in
>>>> a separate patchset.
>>>
>>> That makes sense.
>>
>> Wait, without this patchset, do we even free the hugetlb folio when
>> its subpage is hwpoisoned? I don't think we do, but I'm not expert at MFR...
>
> Based on my reading of try_memory_failure_hugetlb, me_huge_page, and
> __page_handle_poison, I think mainline kernel frees dissolved hugetlb
> folio to buddy allocator in two cases:
> 1. it was a free hugetlb page at the moment of try_memory_failure_hugetlb
> 2. it was an anonomous hugetlb page
I think there are some corner cases that can lead to hugetlb folio being freed while
some of its subpages are hwpoisoned. E.g. get_huge_page_for_hwpoison can return
-EHWPOISON when hugetlb folio is happen to be isolated. Later hugetlb folio might
become free and __update_and_free_hugetlb_folio will be used to free it into buddy.
If page sanity check is enabled, hwpoisoned subpages will slip into buddy but they
won't be re-allocated later because check_new_page will drop them. But if page sanity
check is disabled, I think there is still missing a way to stop hwpoisoned subpages
from being reused.
Let me know if I miss something.
Thanks both.
.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RFC PATCH v1 0/3] Userspace MFR Policy via memfd
2025-11-03 16:57 ` Jiaqi Yan
2025-11-04 3:44 ` Miaohe Lin
@ 2025-11-06 7:53 ` Harry Yoo
2025-11-12 1:28 ` Jiaqi Yan
1 sibling, 1 reply; 16+ messages in thread
From: Harry Yoo @ 2025-11-06 7:53 UTC (permalink / raw)
To: Jiaqi Yan
Cc: Miaohe Lin, “William Roche, Ackerley Tng, jgg, akpm, ankita,
dave.hansen, david, duenwen, jane.chu, jthoughton, linux-fsdevel,
linux-kernel, linux-mm, muchun.song, nao.horiguchi, osalvador,
peterx, rientjes, sidhartha.kumar, tony.luck, wangkefeng.wang,
willy, vbabka, surenb, mhocko, jackmanb, hannes, ziy
On Mon, Nov 03, 2025 at 08:57:08AM -0800, Jiaqi Yan wrote:
> On Mon, Nov 3, 2025 at 12:53 AM Harry Yoo <harry.yoo@oracle.com> wrote:
> >
> > On Mon, Nov 03, 2025 at 05:16:33PM +0900, Harry Yoo wrote:
> > > On Thu, Oct 30, 2025 at 10:28:48AM -0700, Jiaqi Yan wrote:
> > > > On Thu, Oct 30, 2025 at 4:51 AM Miaohe Lin <linmiaohe@huawei.com> wrote:
> > > > > On 2025/10/28 15:00, Harry Yoo wrote:
> > > > > > On Mon, Oct 27, 2025 at 09:17:31PM -0700, Jiaqi Yan wrote:
> > > > > >> On Wed, Oct 22, 2025 at 6:09 AM Harry Yoo <harry.yoo@oracle.com> wrote:
> > > > > >>> On Mon, Oct 13, 2025 at 03:14:32PM -0700, Jiaqi Yan wrote:
> > > > > >>>> On Fri, Sep 19, 2025 at 8:58 AM “William Roche <william.roche@oracle.com> wrote:
> > > > > >>> But even after fixing that we need to fix the race condition.
> > > > > >>
> > > > > >> What exactly is the race condition you are referring to?
> > > > > >
> > > > > > When you free a high-order page, the buddy allocator doesn't not check
> > > > > > PageHWPoison() on the page and its subpages. It checks PageHWPoison()
> > > > > > only when you free a base (order-0) page, see free_pages_prepare().
> > > > >
> > > > > I think we might could check PageHWPoison() for subpages as what free_page_is_bad()
> > > > > does. If any subpage has HWPoisoned flag set, simply drop the folio. Even we could
> > > >
> > > > Agree, I think as a starter I could try to, for example, let
> > > > free_pages_prepare scan HWPoison-ed subpages if the base page is high
> > > > order. In the optimal case, HugeTLB does move PageHWPoison flag from
> > > > head page to the raw error pages.
> > >
> > > [+Cc page allocator folks]
> > >
> > > AFAICT enabling page sanity check in page alloc/free path would be against
> > > past efforts to reduce sanity check overhead.
> > >
> > > [1] https://lore.kernel.org/linux-mm/1460711275-1130-15-git-send-email-mgorman@techsingularity.net/
> > > [2] https://lore.kernel.org/linux-mm/1460711275-1130-16-git-send-email-mgorman@techsingularity.net/
> > > [3] https://lore.kernel.org/all/20230216095131.17336-1-vbabka@suse.cz
> > >
> > > I'd recommend to check hwpoison flag before freeing it to the buddy
> > > when we know a memory error has occurred (I guess that's also what Miaohe
> > > suggested).
> > >
> > > > > do it better -- Split the folio and let healthy subpages join the buddy while reject
> > > > > the hwpoisoned one.
> > > > >
> > > > > >
> > > > > > AFAICT there is nothing that prevents the poisoned page to be
> > > > > > allocated back to users because the buddy doesn't check PageHWPoison()
> > > > > > on allocation as well (by default).
> > > > > >
> > > > > > So rather than freeing the high-order page as-is in
> > > > > > dissolve_free_hugetlb_folio(), I think we have to split it to base pages
> > > > > > and then free them one by one.
> > > > >
> > > > > It might not be worth to do that as this would significantly increase the overhead
> > > > > of the function while memory failure event is really rare.
> > > >
> > > > IIUC, Harry's idea is to do the split in dissolve_free_hugetlb_folio
> > > > only if folio is HWPoison-ed, similar to what Miaohe suggested
> > > > earlier.
> > >
> > > Yes, and if we do the check before moving HWPoison flag to raw pages,
> > > it'll be just a single folio_test_hwpoison() call.
> > >
> > > > BTW, I believe this race condition already exists today when
> > > > memory_failure handles HWPoison-ed free hugetlb page; it is not
> > > > something introduced via this patchset. I will fix or improve this in
> > > > a separate patchset.
> > >
> > > That makes sense.
> >
> > Wait, without this patchset, do we even free the hugetlb folio when
> > its subpage is hwpoisoned? I don't think we do, but I'm not expert at MFR...
>
> Based on my reading of try_memory_failure_hugetlb, me_huge_page, and
> __page_handle_poison, I think mainline kernel frees dissolved hugetlb
> folio to buddy allocator in two cases:
> 1. it was a free hugetlb page at the moment of try_memory_failure_hugetlb
Right.
> 2. it was an anonomous hugetlb page
Right.
Thanks. I think you're right that poisoned hugetlb folios can be freed
to the buddy even without this series (and poisoned pages allocated back to
users instead of being isolated due to missing PageHWPoison() checks on
alloc/free).
So the plan is to post RFC v2 of this series and the race condition fix
as a separate series, right? (that sounds good to me!)
I still think it'd be best to split the hugetlb folio to order-0 pages and
free them when we know the hugetlb folio is poisoned because:
- We don't have to implement a special version of __free_pages() that
knows how to handle freeing of a high-order page where its one or more
sub-pages are poisoned.
- We can avoid re-enabling page sanity checks (and introducing overhead)
all the time.
--
Cheers,
Harry / Hyeonggon
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RFC PATCH v1 0/3] Userspace MFR Policy via memfd
2025-11-06 7:53 ` Harry Yoo
@ 2025-11-12 1:28 ` Jiaqi Yan
0 siblings, 0 replies; 16+ messages in thread
From: Jiaqi Yan @ 2025-11-12 1:28 UTC (permalink / raw)
To: Harry Yoo, “William Roche, Miaohe Lin
Cc: Ackerley Tng, jgg, akpm, ankita, dave.hansen, david, duenwen,
jane.chu, jthoughton, linux-fsdevel, linux-kernel, linux-mm,
muchun.song, nao.horiguchi, osalvador, peterx, rientjes,
sidhartha.kumar, tony.luck, wangkefeng.wang, willy, vbabka,
surenb, mhocko, jackmanb, hannes, ziy
On Wed, Nov 5, 2025 at 11:54 PM Harry Yoo <harry.yoo@oracle.com> wrote:
>
> On Mon, Nov 03, 2025 at 08:57:08AM -0800, Jiaqi Yan wrote:
> > On Mon, Nov 3, 2025 at 12:53 AM Harry Yoo <harry.yoo@oracle.com> wrote:
> > >
> > > On Mon, Nov 03, 2025 at 05:16:33PM +0900, Harry Yoo wrote:
> > > > On Thu, Oct 30, 2025 at 10:28:48AM -0700, Jiaqi Yan wrote:
> > > > > On Thu, Oct 30, 2025 at 4:51 AM Miaohe Lin <linmiaohe@huawei.com> wrote:
> > > > > > On 2025/10/28 15:00, Harry Yoo wrote:
> > > > > > > On Mon, Oct 27, 2025 at 09:17:31PM -0700, Jiaqi Yan wrote:
> > > > > > >> On Wed, Oct 22, 2025 at 6:09 AM Harry Yoo <harry.yoo@oracle.com> wrote:
> > > > > > >>> On Mon, Oct 13, 2025 at 03:14:32PM -0700, Jiaqi Yan wrote:
> > > > > > >>>> On Fri, Sep 19, 2025 at 8:58 AM “William Roche <william.roche@oracle.com> wrote:
> > > > > > >>> But even after fixing that we need to fix the race condition.
> > > > > > >>
> > > > > > >> What exactly is the race condition you are referring to?
> > > > > > >
> > > > > > > When you free a high-order page, the buddy allocator doesn't not check
> > > > > > > PageHWPoison() on the page and its subpages. It checks PageHWPoison()
> > > > > > > only when you free a base (order-0) page, see free_pages_prepare().
> > > > > >
> > > > > > I think we might could check PageHWPoison() for subpages as what free_page_is_bad()
> > > > > > does. If any subpage has HWPoisoned flag set, simply drop the folio. Even we could
> > > > >
> > > > > Agree, I think as a starter I could try to, for example, let
> > > > > free_pages_prepare scan HWPoison-ed subpages if the base page is high
> > > > > order. In the optimal case, HugeTLB does move PageHWPoison flag from
> > > > > head page to the raw error pages.
> > > >
> > > > [+Cc page allocator folks]
> > > >
> > > > AFAICT enabling page sanity check in page alloc/free path would be against
> > > > past efforts to reduce sanity check overhead.
> > > >
> > > > [1] https://lore.kernel.org/linux-mm/1460711275-1130-15-git-send-email-mgorman@techsingularity.net/
> > > > [2] https://lore.kernel.org/linux-mm/1460711275-1130-16-git-send-email-mgorman@techsingularity.net/
> > > > [3] https://lore.kernel.org/all/20230216095131.17336-1-vbabka@suse.cz
> > > >
> > > > I'd recommend to check hwpoison flag before freeing it to the buddy
> > > > when we know a memory error has occurred (I guess that's also what Miaohe
> > > > suggested).
> > > >
> > > > > > do it better -- Split the folio and let healthy subpages join the buddy while reject
> > > > > > the hwpoisoned one.
> > > > > >
> > > > > > >
> > > > > > > AFAICT there is nothing that prevents the poisoned page to be
> > > > > > > allocated back to users because the buddy doesn't check PageHWPoison()
> > > > > > > on allocation as well (by default).
> > > > > > >
> > > > > > > So rather than freeing the high-order page as-is in
> > > > > > > dissolve_free_hugetlb_folio(), I think we have to split it to base pages
> > > > > > > and then free them one by one.
> > > > > >
> > > > > > It might not be worth to do that as this would significantly increase the overhead
> > > > > > of the function while memory failure event is really rare.
> > > > >
> > > > > IIUC, Harry's idea is to do the split in dissolve_free_hugetlb_folio
> > > > > only if folio is HWPoison-ed, similar to what Miaohe suggested
> > > > > earlier.
> > > >
> > > > Yes, and if we do the check before moving HWPoison flag to raw pages,
> > > > it'll be just a single folio_test_hwpoison() call.
> > > >
> > > > > BTW, I believe this race condition already exists today when
> > > > > memory_failure handles HWPoison-ed free hugetlb page; it is not
> > > > > something introduced via this patchset. I will fix or improve this in
> > > > > a separate patchset.
> > > >
> > > > That makes sense.
> > >
> > > Wait, without this patchset, do we even free the hugetlb folio when
> > > its subpage is hwpoisoned? I don't think we do, but I'm not expert at MFR...
> >
> > Based on my reading of try_memory_failure_hugetlb, me_huge_page, and
> > __page_handle_poison, I think mainline kernel frees dissolved hugetlb
> > folio to buddy allocator in two cases:
> > 1. it was a free hugetlb page at the moment of try_memory_failure_hugetlb
>
> Right.
>
> > 2. it was an anonomous hugetlb page
>
> Right.
>
> Thanks. I think you're right that poisoned hugetlb folios can be freed
> to the buddy even without this series (and poisoned pages allocated back to
> users instead of being isolated due to missing PageHWPoison() checks on
> alloc/free).
Fortunately today at maximum only 1 raw HWPoison-ed page, with the
high-order folio containing it, will get free to buddy allocator.
But with my memfd MFR series, raw HWPoison-ed pages can accumulate
while userspace still holds the hugetlb folio. So I would like a
solution to this.
>
> So the plan is to post RFC v2 of this series and the race condition fix
> as a separate series, right? (that sounds good to me!)
Yes, I am preparing RFC v2 in the meanwhile.
>
> I still think it'd be best to split the hugetlb folio to order-0 pages and
> free them when we know the hugetlb folio is poisoned because:
>
> - We don't have to implement a special version of __free_pages() that
> knows how to handle freeing of a high-order page where its one or more
> sub-pages are poisoned.
>
> - We can avoid re-enabling page sanity checks (and introducing overhead)
> all the time.
Agreed, after I tried a couple of alternative and unsuccessful
approaches, now I have a working prototype that works exactly the same
way as Harry suggested.
My code roughly work like this (if you can't tolerate the prototype
code attached at the end):
__update_and_free_hugetlb_folio()
hugetlb_free_hwpoison_folio() (new code, instead of hugetlb_free_folio)
folios = __split_unmapped_folio()
for folio in folios
free_frozen_pages if not HWPoison-ed
It took me some time to test my implementation with some test-only
code to check pcplist and freelist (i.e. check_zone_free_list and
page_count_in_pcplist), but I have validated with several tests that,
after freeing high-order folio containing multiple HWPoison-ed pages,
only healthy pages go to buddy allocator or per-cpu-pages lists:
1. some pages are still zone->per_cpu_pageset because pcp-count is not
high enough
2. all the others are, after merging, in some order's
zone->free_area[order].free_list
For example:
- when hugepagesize=2M, 512 - x 0-order pages (x=number of HWPoison-ed
ones) are all placed in pcp list.
- when hugepagesize=1G, most pages are merged to buddy blocks of order
0 to 10, and some left over in pcp list.
I am in the middle of refining my working prototype (attached below),
and then send it out as separate patch.
Code below is just for illustrating my idea to see if it is correct in
general, not asking for code review :).
commit d54cc323608d383ee0136ca95932b535fed55def
Author: Jiaqi Yan <jiaqiyan@google.com>
Date: Mon Nov 10 19:46:21 2025 +0000
mm: memory_failure: avoid free HWPoison pages when dissolve free
hugetlb folio
1. expose __split_unmapped_folio
2. introduce hugetlb_free_hwpoison_folio
3. simplify filemap_offline_hwpoison_folio_hugetlb
4. introduce page_count_in_pcplist and check_zone_free_list for testing
Tested with page_count_in_pcplist and check_zone_free_list.
Change-Id: I7af5fc40851e3a26eaa37bb3191d319437202bc1
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index f327d62fc9852..5619d8931c4bf 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -367,6 +367,9 @@ unsigned long thp_get_unmapped_area_vmflags(struct
file *filp, unsigned long add
bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins);
int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
unsigned int new_order);
+int __split_unmapped_folio(struct folio *folio, int new_order,
+ struct page *split_at, struct xa_state *xas,
+ struct address_space *mapping, bool uniform_split);
int min_order_for_split(struct folio *folio);
int split_folio_to_list(struct folio *folio, struct list_head *list);
bool uniform_split_supported(struct folio *folio, unsigned int new_order,
@@ -591,6 +594,14 @@ split_huge_page_to_list_to_order(struct page
*page, struct list_head *list,
VM_WARN_ON_ONCE_PAGE(1, page);
return -EINVAL;
}
+static inline int __split_unmapped_folio(struct folio *folio, int new_order,
+ struct page *split_at, struct
xa_state *xas,
+ struct address_space *mapping,
+ bool uniform_split)
+{
+ VM_WARN_ON_ONCE_FOLIO(1, folio);
+ return -EINVAL;
+}
static inline int split_huge_page(struct page *page)
{
VM_WARN_ON_ONCE_PAGE(1, page);
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index b7733ef5ee917..fad53772c875c 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -873,6 +873,7 @@ int dissolve_free_hugetlb_folios(unsigned long start_pfn,
extern void folio_clear_hugetlb_hwpoison(struct folio *folio);
extern bool hugetlb_should_keep_hwpoison_mapped(struct folio *folio,
struct address_space *mapping);
+extern void hugetlb_free_hwpoison_folio(struct folio *folio);
#else
static inline void folio_clear_hugetlb_hwpoison(struct folio *folio)
{
@@ -882,6 +883,9 @@ static inline bool
hugetlb_should_keep_hwpoison_mapped(struct folio *folio
{
return false;
}
+static inline void hugetlb_free_hwpoison_folio(struct folio *folio)
+{
+}
#endif
#ifdef CONFIG_ARCH_ENABLE_HUGEPAGE_MIGRATION
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 1b81680b4225f..6ca70ec2fb7cd 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3408,9 +3408,9 @@ static void __split_folio_to_order(struct folio
*folio, int old_order,
* For !uniform_split, when -ENOMEM is returned, the original folio might be
* split. The caller needs to check the input folio.
*/
-static int __split_unmapped_folio(struct folio *folio, int new_order,
- struct page *split_at, struct xa_state *xas,
- struct address_space *mapping, bool uniform_split)
+int __split_unmapped_folio(struct folio *folio, int new_order,
+ struct page *split_at, struct xa_state *xas,
+ struct address_space *mapping, bool uniform_split)
{
int order = folio_order(folio);
int start_order = uniform_split ? new_order : order - 1;
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index d499574aafe52..7e408d6ce91d7 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1596,6 +1596,7 @@ static void
__update_and_free_hugetlb_folio(struct hstate *h,
struct folio *folio)
{
bool clear_flag = folio_test_hugetlb_vmemmap_optimized(folio);
+ bool has_hwpoison = folio_test_hwpoison(folio);
if (hstate_is_gigantic(h) && !gigantic_page_runtime_supported())
return;
@@ -1638,12 +1639,15 @@ static void
__update_and_free_hugetlb_folio(struct hstate *h,
* Move PageHWPoison flag from head page to the raw error pages,
* which makes any healthy subpages reusable.
*/
- if (unlikely(folio_test_hwpoison(folio)))
+ if (unlikely(has_hwpoison))
folio_clear_hugetlb_hwpoison(folio);
folio_ref_unfreeze(folio, 1);
- hugetlb_free_folio(folio);
+ if (has_hwpoison)
+ hugetlb_free_hwpoison_folio(folio);
+ else
+ hugetlb_free_folio(folio);
}
/*
diff --git a/mm/internal.h b/mm/internal.h
index 1561fc2ff5b83..6ee56aea01a91 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -829,6 +829,7 @@ struct page *__alloc_frozen_pages_noprof(gfp_t,
unsigned int order, int nid,
#define __alloc_frozen_pages(...) \
alloc_hooks(__alloc_frozen_pages_noprof(__VA_ARGS__))
void free_frozen_pages(struct page *page, unsigned int order);
+int page_count_in_pcplist(struct zone *zone);
void free_unref_folios(struct folio_batch *fbatch);
#ifdef CONFIG_NUMA
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index fa281461f38a6..7dd82c787cea7 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -2044,13 +2044,134 @@ int __get_huge_page_for_hwpoison(unsigned
long pfn, int flags,
return ret;
}
+static int calculate_overlap(int s1, int e1, int s2, int e2) {
+ /* Calculate the start and end of the potential overlap. */
+ unsigned long overlap_start = max(s1, s2);
+ unsigned long overlap_end = min(e1, e2);
+
+ if (overlap_start <= overlap_end)
+ return (overlap_end - overlap_start + 1);
+ else
+ return 0UL;
+}
+
+static void check_zone_free_list(struct zone *zone,
+ enum migratetype migrate_type,
+ unsigned long target_start_pfn,
+ unsigned long target_end_pfn)
+{
+ int order;
+ struct list_head *list;
+ struct page *page;
+ unsigned long pages_in_block;
+ unsigned long nr_free;
+ unsigned long start_pfn, end_pfn;
+ unsigned long flags;
+ unsigned long nr_pages = target_end_pfn - target_start_pfn + 1;
+ unsigned long overlap;
+
+ pr_info("%s:%d: search 0~%d order free areas\n", __func__,
__LINE__, NR_PAGE_ORDERS);
+
+ spin_lock_irqsave(&zone->lock, flags);
+ for (order = 0; order < NR_PAGE_ORDERS; ++order) {
+ pages_in_block = 1UL << order;
+ nr_free = zone->free_area[order].nr_free;
+
+ if (nr_free == 0) {
+ pr_info("%s:%d: empty free area for order=%d\n",
+ __func__, __LINE__, order);
+ continue;
+ }
+
+ pr_info("%s:%d: free area order=%d, nr_free=%lu blocks
in total\n",
+ __func__, __LINE__, order, nr_free);
+ list = &zone->free_area[order].free_list[migrate_type];
+ list_for_each_entry(page, list, buddy_list) {
+ start_pfn = page_to_pfn(page);
+ end_pfn = start_pfn + pages_in_block - 1;
+ overlap = calculate_overlap(target_start_pfn,
+ target_end_pfn,
+ start_pfn,
+ end_pfn);
+ nr_pages -= overlap;
+ if (overlap > 0)
+ pr_warn("%s:%d: found [%#lx, %#lx]
overlap %lu pages with [%#lx, %#lx]\n",
+ __func__, __LINE__,
+ target_start_pfn, target_end_pfn,
+ overlap, start_pfn, end_pfn);
+ }
+ }
+ spin_unlock_irqrestore(&zone->lock, flags);
+ pr_err("%s:%d: %lu pages not found in free list\n", __func__,
__LINE__, nr_pages);
+}
+
+void hugetlb_free_hwpoison_folio(struct folio *folio)
+{
+ struct folio *curr, *next;
+ struct folio *end_folio = folio_next(folio);
+ int ret;
+ unsigned long start_pfn = folio_pfn(folio);
+ unsigned long end_pfn = start_pfn + folio_nr_pages(folio) - 1;
+ struct zone *zone = folio_zone(folio);
+ int migrate_type = folio_migratetype(folio);
+ int pcp_count_init, pcp_count;
+
+ pr_info("%s:%d: folio start_pfn=%#lx, end_pfn=%#lx\n",
__func__, __LINE__, start_pfn, end_pfn);
+ /* Expect folio's refcount==1. */
+ drain_all_pages(folio_zone(folio));
+
+ pcp_count_init = page_count_in_pcplist(zone);
+
+ pr_warn("%#lx: %s:%d: split-to-zero folio: order=%d,
refcount=%d, nid=%d, zone=%d, migratetype=%d\n",
+ folio_pfn(folio), __func__, __LINE__,
folio_order(folio), folio_ref_count(folio),
+ folio_nid(folio), folio_zonenum(folio),
folio_migratetype(folio));
+
+ ret = __split_unmapped_folio(folio, /*new_order=*/0,
+ /*split_at=*/&folio->page,
+ /*xas=*/NULL, /*mapping=*/NULL,
+ /*uniform_split=*/true);
+ if (ret) {
+ pr_err("%#lx: failed to split free %d-order folio with
HWPoison-ed page(s): %d\n",
+ folio_pfn(folio), folio_order(folio), ret);
+ return;
+ }
+
+ /* Expect 1st folio's refcount==1, and other's refcount==0. */
+ for (curr = folio; curr != end_folio; curr = next) {
+ next = folio_next(curr);
+
+ VM_WARN_ON_FOLIO(folio_order(curr), curr);
+
+ if (PageHWPoison(&curr->page)) {
+ if (curr != folio)
+ folio_ref_inc(curr);
+
+ VM_WARN_ON_FOLIO(folio_ref_count(curr) != 1, curr);
+ pr_warn("%#lx: prevented freeing HWPoison
page\n", folio_pfn(curr));
+ continue;
+ }
+
+ if (curr == folio)
+ folio_ref_dec(curr);
+
+ VM_WARN_ON_FOLIO(folio_ref_count(curr), curr);
+ free_frozen_pages(&curr->page, folio_order(curr));
+ }
+
+ pcp_count = page_count_in_pcplist(zone);
+ pr_err("%s:%d: delta pcp_count: %d - %d = %d\n",
+ __func__, __LINE__, pcp_count, pcp_count_init,
+ pcp_count - pcp_count_init);
+
+ check_zone_free_list(zone, migrate_type, start_pfn, end_pfn);
+}
+
static void filemap_offline_hwpoison_folio_hugetlb(struct folio *folio)
{
int ret;
struct llist_node *head;
struct raw_hwp_page *curr, *next;
struct page *page;
- unsigned long pfn;
/*
* Since folio is still in the folio_batch, drop the refcount
@@ -2063,38 +2184,20 @@ static void
filemap_offline_hwpoison_folio_hugetlb(struct folio *folio)
* Release references hold by try_memory_failure_hugetlb, one per
* HWPoison-ed page in the raw hwp list.
*/
- llist_for_each_entry(curr, head, node)
- folio_put(folio);
-
- /* Refcount now should be zero and ready to dissolve folio. */
- ret = dissolve_free_hugetlb_folio(folio);
- if (ret) {
- pr_err("failed to dissolve hugetlb folio: %d\n", ret);
- llist_for_each_entry(curr, head, node) {
- page = curr->page;
- pfn = page_to_pfn(page);
- /*
- * Maybe we also need to roll back the count
- * incremented during inline handling, depending
- * on what me_huge_page returned.
- */
- update_per_node_mf_stats(pfn, MF_FAILED);
- }
- return;
- }
-
llist_for_each_entry_safe(curr, next, head, node) {
+ folio_put(folio);
page = curr->page;
- pfn = page_to_pfn(page);
- drain_all_pages(page_zone(page));
- if (!take_page_off_buddy(page))
- pr_warn("%#lx: unable to take off buddy
allocator\n", pfn);
-
SetPageHWPoison(page);
- page_ref_inc(page);
+ pr_info("%#lx: %s:%d moved HWPoison flag\n",
page_to_pfn(page), __func__, __LINE__);
kfree(curr);
- pr_info("%#lx: pending hard offline completed\n", pfn);
}
+
+ pr_info("%#lx: %s:%d before dissolve refcount=%d\n",
+ page_to_pfn(&folio->page), __func__, __LINE__,
folio_ref_count(folio));
+ /* Refcount now should be zero and ready to dissolve folio. */
+ ret = dissolve_free_hugetlb_folio(folio);
+ if (ret)
+ pr_err("failed to dissolve HWPoison-ed hugetlb folio:
%d\n", ret);
}
void filemap_offline_hwpoison_folio(struct address_space *mapping,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 600d9e981c23d..0b3507a1880ec 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1333,6 +1333,7 @@ __always_inline bool free_pages_prepare(struct page *page,
}
if (unlikely(PageHWPoison(page)) && !order) {
+ VM_BUG_ON_PAGE(1, page);
/* Do not let hwpoison pages hit pcplists/buddy */
reset_page_owner(page, order);
page_table_check_free(page, order);
@@ -2939,6 +2940,24 @@ static void __free_frozen_pages(struct page
*page, unsigned int order,
pcp_trylock_finish(UP_flags);
}
+int page_count_in_pcplist(struct zone *zone)
+{
+ unsigned long __maybe_unused UP_flags;
+ struct per_cpu_pages *pcp;
+ int page_count = 0;
+
+ pcp_trylock_prepare(UP_flags);
+ pcp = pcp_spin_trylock(zone->per_cpu_pageset);
+ if (pcp) {
+ page_count = pcp->count;
+ pcp_spin_unlock(pcp);
+ }
+ pcp_trylock_finish(UP_flags);
+
+ pr_info("%s:%d: #pages in pcp list=%d\n", __func__, __LINE__,
page_count);
+ return page_count;
+}
+
void free_frozen_pages(struct page *page, unsigned int order)
{
__free_frozen_pages(page, order, FPI_NONE);
>
> --
> Cheers,
> Harry / Hyeonggon
^ permalink raw reply related [flat|nested] 16+ messages in thread
end of thread, other threads:[~2025-11-12 1:29 UTC | newest]
Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <20250118231549.1652825-1-jiaqiyan@google.com>
2025-09-19 15:58 ` [RFC PATCH v1 0/3] Userspace MFR Policy via memfd “William Roche
2025-10-13 22:14 ` Jiaqi Yan
2025-10-14 20:57 ` William Roche
2025-10-28 4:17 ` Jiaqi Yan
2025-10-22 13:09 ` Harry Yoo
2025-10-28 4:17 ` Jiaqi Yan
2025-10-28 7:00 ` Harry Yoo
2025-10-30 11:51 ` Miaohe Lin
2025-10-30 17:28 ` Jiaqi Yan
2025-10-30 21:28 ` Jiaqi Yan
2025-11-03 8:16 ` Harry Yoo
2025-11-03 8:53 ` Harry Yoo
2025-11-03 16:57 ` Jiaqi Yan
2025-11-04 3:44 ` Miaohe Lin
2025-11-06 7:53 ` Harry Yoo
2025-11-12 1:28 ` Jiaqi Yan
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox