[RFC PATCH v1 0/3] Userspace MFR Policy via memfd

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH v1 0/3] Userspace MFR Policy via memfd
@ 2025-01-18 23:15 Jiaqi Yan
  2025-01-18 23:15 ` [RFC PATCH v1 1/3] mm: memfd/hugetlb: introduce userspace memory failure recovery policy Jiaqi Yan
                   ` (5 more replies)
  0 siblings, 6 replies; 14+ messages in thread
From: Jiaqi Yan @ 2025-01-18 23:15 UTC (permalink / raw)
  To: nao.horiguchi, linmiaohe
  Cc: tony.luck, wangkefeng.wang, willy, jane.chu, akpm, osalvador,
	rientjes, duenwen, jthoughton, jgg, ankita, peterx,
	sidhartha.kumar, david, dave.hansen, muchun.song, linux-mm,
	linux-kernel, linux-fsdevel, Jiaqi Yan

Background and Motivation
=========================

To recap [1] in short: in Cloud, HugeTLB and huge VM_PFNMAP serves
capacity- and performance-critical guest memory, but the current memory
failure recovery (MFR) behavior for both are not ideal:
* Once a byte of memory in a hugepage is hardware corrupted, kernel
  discards the whole hugepage, not only the corrupted bytes but also the
  healthy portion, from the HugeTLB system, causing a great loss of
  memory to VM. We use 1GB HugeTLB for the vast majority of guest memory
  in GCE.
* After MFR zaps PUD (assuming memory mapping is huge VM_PFNMAP [2]),
  there will be a huge hole in the EPT or stage-2 (S2) page table,
  causing a lot of EPT or S2 violations that need to be fixed up by the
  device driver or core MM, and fragmented EPT or S2 after fixup. There
  will be noticeable VM performance downgrades (see “MemCycler
  Benchmarking”).

Therefore keeping or discarding a large chunk of contiguous memory
mapped to userspace (particularly to serve guest memory) due to
uncorrected memory error (UE, recoverable is implied) should be
controlled by userspace, e.g. virtual machine monitor (VMM) in Cloud.

In the MM upstream alignment meeting [3], we were able to align with
folks from the Linux MM upstream community on “Why Control Needed” and
“What to Control”. However, the two proposed approaches for “How to
Control” are both not well accepted, and we think it is worthy to pursue
the memfd-based userspace MFR idea brought up by Jason Gunthorpe.

MemCycler Benchmarking
======================

To follow up the question by Dave Hansen, “If one motivation for this is
guest performance, then it would be great to have some data to back that
up, even if it is worst-case data”, we run MemCycler in guest and
compare its performance when there are an extremely large number of
memory errors.

The MemCycler benchmark cycles through memory with multiple threads. On
each iteration, the thread reads the current value, validates it, and
writes a counter value. The benchmark continuously outputs rates
indicating the speed at which it is reading and writing 64-bit integers,
and aggregates the reads and writes of the multiple threads across
multiple iterations into a single rate (unit: 64-bit per microsecond).

MemCycler is running inside a VM with 80 vCPUs and 640 GB guest memory.
The hardware platform hosting the VM is using Intel Emerald Rapids CPUs
(in total 120 physical cores) and 1.5 T DDR5 memory. MemCycler allocates
memory with 2M transparent hugepage in the guest. Our in-house VMM backs
the guest memory with 2M transparent hugepage on the host. The final
aggregate rate after 60 runtime is 17,204.69 and referred to as the
baseline case.

In the experimental case, all the setups are identical to the baseline
case, however 25% of the guest memory is split from THP to 4K pages due
to the memory failure recovery triggered by MADV_HWPOISON. I made some
minor changes in the kernel so that the MADV_HWPOISON-ed pages are
unpoisoned, and afterwards the in-guest MemCycle is still able to read
and write its data. The final aggregate rate is 16,355.11, which is
decreased by 5.06% compared to the baseline case. When 5% of the guest
memory is split after MADV_HWPOISON, the final aggregate rate is
16,999.14, a drop of 1.20% compared to the baseline case.

Design
======

Userspace process creates memfd to get a file that lives in RAM, that
has a volatile backing storage, that the backing memory has anonymous
semantics. Userspace then can modify, truncate, memory-map the file and
so on.

Per-memfd MFR Policy associates the userspace MFR policy with a memfd
instance. This approach is promising for the following reasons:
1. Keeping memory with UE mapped to a process has risks if the process
   does not do its duty to prevent itself from repeatedly consuming UER.
   The MFR policy can be associated with a memfd to limit such risk to a
   particular memory space owned by a particular process that opts in
   the policy. This is much preferable than the Global MFR Policy
   proposed in the initial RFC, which provides no granularity
   whatsoever.
2. Like Per-VMA MFR Policy in the initial RFC, poisoning the folio and
   keeping the mapping are not conflicting in Per-memfd MFR Policy;
   Kernel can keep setting the HWPoison flag to the folios affected by
   the UE, while the folio is kept mapping to userspace. This is an
   advantage to the Global MFR Policy, which breaks kernel’s HWPoison
   behavior.
3. Although MFR policy allows the userspace process to keep memory UE
   mapped, eventually these HWPoison-ed folios need to be dealt with by
   the kernel (e.g. split into smallest chunk and isolated from
   future allocation). For memfd once all references to it are dropped,
   it is automatically released from userspace, which is a perfect
   timing for the kernel to do its duties to HWPoison-ed folios if any.
   This is also a big advantage to the Global MFR Policy, which breaks
   kernel’s protection to HWPoison-ed folios.
4. Given memfd’s anonymous semantic, we don’t need to worry about that
   different threads can have different and conflicting MFR policies. It
   allows a simpler implementation than the Per-VMA MFR Policy in the
   initial RFC [1].

Userspace can choose the memory backing the created file either be
managed by HugeTLB (MFD_HUGETLB) or SHMEM. To userspace the Per-memfd
MFR Policy is independent of the memory management systems, although the
implementations required in kernel are different because the existing
MFR behavior already varies.

UAPI
====

The UAPI to control MFR policy via memfd is through the memfd_create
syscall with a new flag value:

  #define MFD_MF_KEEP_UE_MAPPED	0x0020U
  int memfd_create(const char *name, unsigned int flags);

When MFD_MF_KEEP_UE_MAPPED is set in flags, memory failure (MF) recovery
in the kernel doesn’t hard offline memory due to uncorrected error (UE)
until the created memfd is released. IOW, the HWPoison-ed memory remains
accessible via the returned memfd or the memory mapping created with
that memfd.

However, the affected memory will be immediately protected and isolated
from future use by both kernel and userspace once the owning memfd is
gone or the memory is truncated. By default MFD_MF_KEEP_UE_MAPPED is not
set, and kernel hard offlines memory having UEs. Kernel immediately
poisons the folios for both cases.

MFD_MF_KEEP_UE_MAPPED translates to a new flag value introduced to
address_space around which the new code changes in MFR, mm fault
handler, and in-RAM file system are added.

  /* * Bits in mapping->flags. */
  enum mapping_flags {
    ...
    /*
     * Keeps folios belong to the mapping mapped even if uncorrectable
     * memory errors (UE) caused memory failure (MF) within the folio.
     * Only at the end of mapping will its HWPoison-ed folios be dealt
     * with.
     */
    AS_MF_KEEP_UE_MAPPED = 9,
    ...
  };

Implementation
==============

Implementation is relatively straightforward with two major parts.

Part 1: When a AS_MF_KEEP_UE_MAPPED memfd is alive and one of its memory
pages is affected by UE:
* MFR needs to defer operations (e.g. unmapping, splitting, dissolving)
  needed to hard offline the memory page. MFR still holds a refcount for
  every raw HWPoison-ed page. MFR still sends SIGBUS to consuming
  thread, but the si_addr_lsb will be reduced to PAGE_SHIFT.
* If the memory was not faulted in yet, the fault handler also needs to
  unblock the fault to HWPoison-ed folio.

Part2: When a AS_MF_KEEP_UE_MAPPED memfd is about to be released, or
when the userspace process truncates a range of memory pages belonging
to a AS_MF_KEEP_UE_MAPPED memfd:
* When the in-memory file system is evicting the inode corresponding to
  the memfd, it needs to prepare the HWPoison-ed folios that are easily
  identifiable with the PG_HWPOISON flag. This operation is implemented
  by populate_memfd_hwp_folios and is exported to file systems.
* After the file system removes all the folios, there is nothing else
  preventing MFR from dealing with HWPoison-ed folios, so the file
  system forwards them to MFR. This step is implemented by
  offline_memfd_hwp_folios and is exported to file systems.
* MFR has been holding refcount(s) of each HWPoison-ed folio. After
  dropping the refcounts, a HWPoison-ed folio should become free and can
  be disposed of. MFR naturally takes the responsibility for this,
  implemented as filemap_offline_hwpoison_folio. How the folio is
  disposed of depends on the type of the memory management system.
  Taking HugeTLB as an example, a HugeTLB folio is dissolved into a set
  of raw pages. The healthy raw pages can be reclaimed by the buddy
  allocator while the HWPoison-ed raw pages need to be taken off and
  prevented from future allocation, as implemented by
  filemap_offline_hwpoison_folio_hugetlb.

This RFC includes the code patch to demonstrate the implementation for
HugeTLB.

In V2 I can probably offline each folio as they get remove, instead of
doing this in batch. The advantage is we can get rid of
populate_memfd_hwp_folios and the linked list needed to store poisoned
folios. One way is to insert filemap_offline_hwpoison_folio into
somewhere in folio_batch_release, or into per file system's free_folio
handler.

Extensibility: Guest memfd
==========================

Guest memfd is going to be the future API used by a virtual machine
monitor (VMM) to allocate and configure memory for the guest but with
better protections that are needed for confidential VM. The current MFR
in guest memfd works as follows:
1. KVM unmaps all the GFNs that are backed by the HWPoison-ed folio from
   the stage-2 page table and invalidates the range in TLB. This
   protects KVM / VM from causing poison consumption at hardware level.
   On the other hand, if the folio backs a large amount of GFNs, e.g. 1G
   HugeTLB, it is likely that majority of the GFNs are still healthy but
   has been “offlined” together (a big hole in stage-2 and guest memory
   region).
2. In react to later fault to any part of the HWPoison-ed folio, guest
   memfd returns KVM_PFN_ERR_HWPOISON, and KVM sends SIGBUS to VMM. This
   is good enough for actual hardware corrupted PFN backed GFNs, but not
   ideal for the healthy PFNs “offlined” together with the error PFNs.
   The userspace MFR policy can be useful if VMM wants KVM to 1. Keep
   these GFNs mapped in the stage-2 page table 2. In react to later
   access to the actual hardware corrupted part of the HWPoison-ed
   folio, there is going to be a (repeated) poison consumption event,
   and KVM returns KVM_PFN_ERR_HWPOISON for the actual poisoned PFN.
3. In response to later access to the still healthy part of the
   HWPoison-ed folio, guest is able to fast access the memory as the
   healthy PFNs are still in stage-2 page table.

This behavior is better from the PoV of capacity (if the folio contains
a large number of raw pages) and performance (if both the stage-1 and
stage-2 page table sizes are huge), however, at the cost of the risk of
recurring poison consumptions. The cost can be mitigated by splitting
stage-2 page table wrt to HWPoison-ed PFNs so that stage-2 and guest
memory region only have smaller holes.

The UAPI for userspace MFR control via guest memfd can be through the
KVM_CREATE_GUEST_MEMFD IOCTL. It is easy to apply MFD_MF_KEEP_UE_MAPPED
to kvm_create_guest_memfd:

  static int __kvm_gmem_create(struct kvm *kvm, loff_t size, u64 flags)
  {
    //...
    if (flags & MFD_MF_KEEP_UE_MAPPED)
      mapping_set_mf_keep_ue_mapped(inode->i_mapping);
    //...
  }

Of course full implementation requires more code changes in guest memfd,
e.g. __kvm_gmem_get_pfn, kvm_gmem_error_folio etc, especially for VMs
that are built on both guest memfd and HugeTLB. Feedbacks are welcome
before I put out an implementation.

Extensibility: VFIO Device Memory
=================================

In Cloud a significant amount of memory can be managed by certain VFIO
drivers, for example the GPU device memory, or EGM memory. As mentioned
before, the unmapping behavior in MFR becomes a concern if the VFIO
device driver supports the device memory mapping via huge VM_PFNMAP.

This RFC [4] proposes a MFR framework for VFIO device managed userspace
memory (i.e. memory regions mapped by remap_pfn_region). The userspace
MFR policy can instruct the device driver to keep all PFN mapped in a
VMA (i.e. don’t unmap_mapping_range).

Of course, the above memfd uAPI (MFD_MF_KEEP_UE_MAPPED + memfd_create)
doesn’t work with VFIO device kernel drivers (as of today I don’t think
userspace can create a memfd with the path name to a VFIO device). I
don’t have a satisfying uAPI design, but here is what I considered, VFIO
Device Specific IOCTL:
* IOCTL to the VFIO Device File. The device driver usually expose a
  file-like uAPI to its managed device memory (e.g. PCI MMIO BAR)
  directly with the file to the VFIO device. AS_MF_KEEP_UE_MAPPED can be
  placed in the address_space of the file to the VFIO device. Device
  driver can implement a specific IOCTL to the VFIO device file for
  userspace to set AS_MF_KEEP_UE_MAPPED.
* IOCTL to the Char File. The device driver can create a char device for
  its managed memory regions, then expose the file-like uAPI (open,
  close, mmap, unlocked_ioctl) with the created char device using
  cdev_init. AS_MF_KEEP_UE_MAPPED can be straightforwardly put into the
  address_space of the file to the char device. Device driver can
  implement a specific IOCTL to the char device file for userspace to
  set AS_MF_KEEP_UE_MAPPED.

What is common (and unsatisfactory) above is every device driver needs
to add device-specific IOCTL to support MFD_MF_KEEP_UE_MAPPED. The
timing of accepting the IOCTL also needs to be restricted to be after
file descriptor creation (e.g. VFIO_GROUP_GET_DEVICE_FD) and before the
first mmap request. I am still considering how to integrate
MFD_MF_KEEP_UE_MAPPED to VFIO framework’s uAPI.

Extensibility: THP SHMEM/TMPFS
==============================

The current MFR behavior for THP SHMEM/TMPFS is to split the hugepage
into raw page and only offline the raw HWPoison-ed page. In most cases
THP is 2M and raw page size is 4K, so userspace loses the “huge”
property of a 2M huge memory, but the actual data loss is only 4K.

Using populate_memfd_hwp_folios and offline_memfd_hwp_folios, it is not
hard to implement AS_MF_KEEP_UE_MAPPED for THP so that userspace process
retain the huge property of the hugepage when it is affected by memory
errors. However, this benefit is not as attractive as to HugeTLB and it
is not implemented for now.

[1] https://lwn.net/Articles/991513
[2] https://lore.kernel.org/kvm/20240826204353.2228736-1-peterx@redhat.com/
[3] https://docs.google.com/presentation/d/1tWqcuAqeCLhfd47uXXLdu2SzolKu7WYvM03vEkbhobc/edit#slide=id.g3014a65d24b_0_0
[4] https://lore.kernel.org/linux-mm/20231123003513.24292-2-ankita@nvidia.com/

Jiaqi Yan (3):
  mm: memfd/hugetlb: introduce userspace memory failure recovery policy
  selftests/mm: test userspace MFR for HugeTLB 1G hugepage
  Documentation: add userspace MF recovery policy via memfd

 Documentation/userspace-api/index.rst         |   1 +
 .../userspace-api/mfd_mfr_policy.rst          |  55 ++++
 fs/hugetlbfs/inode.c                          |  16 ++
 include/linux/hugetlb.h                       |   7 +
 include/linux/pagemap.h                       |  43 +++
 include/uapi/linux/memfd.h                    |   1 +
 mm/filemap.c                                  |  78 +++++
 mm/hugetlb.c                                  |  20 +-
 mm/memfd.c                                    |  15 +-
 mm/memory-failure.c                           | 119 +++++++-
 tools/testing/selftests/mm/.gitignore         |   1 +
 tools/testing/selftests/mm/Makefile           |   1 +
 tools/testing/selftests/mm/hugetlb-mfr.c      | 267 ++++++++++++++++++
 13 files changed, 607 insertions(+), 17 deletions(-)
 create mode 100644 Documentation/userspace-api/mfd_mfr_policy.rst
 create mode 100644 tools/testing/selftests/mm/hugetlb-mfr.c

-- 
2.48.0.rc2.279.g1de40edade-goog

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [RFC PATCH v1 1/3] mm: memfd/hugetlb: introduce userspace memory failure recovery policy
  2025-01-18 23:15 [RFC PATCH v1 0/3] Userspace MFR Policy via memfd Jiaqi Yan
@ 2025-01-18 23:15 ` Jiaqi Yan
  2025-01-18 23:15 ` [RFC PATCH v1 2/3] selftests/mm: test userspace MFR for HugeTLB 1G hugepage Jiaqi Yan
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 14+ messages in thread
From: Jiaqi Yan @ 2025-01-18 23:15 UTC (permalink / raw)
  To: nao.horiguchi, linmiaohe
  Cc: tony.luck, wangkefeng.wang, willy, jane.chu, akpm, osalvador,
	rientjes, duenwen, jthoughton, jgg, ankita, peterx,
	sidhartha.kumar, david, dave.hansen, muchun.song, linux-mm,
	linux-kernel, linux-fsdevel, Jiaqi Yan

Sometimes immediately hard offlining memory page having uncorrected
memory errors (UE) may not be the best option for capacity and/or
performance reasons. "Sometimes" even becomes "often times" in Cloud
scenarios. See cover letter for the descriptions to two scenarios.

Therefore keeping or discarding a large chunk of contiguous memory
mapped to userspace (particularly to serve guest memory) due to
UE (recoverable is implied) should be able to be controlled by
userspace process, e.g. VMM in Cloud environment.

Given the relevance of HugeTLB's non-ideal memory failure recovery
behavior, this commit uses HugeTLB as the "testbed" to demonstrate the
idea of memfd-based userspace memory failure policy.

MFD_MF_KEEP_UE_MAPPED is added to the possible values for flags in
memfd_create syscall. It is intended to be generic for any memfd,
not just HugeTLB, but the current implementation only covers HugeTLB.

When MFD_MF_KEEP_UE_MAPPED is set in flags, memory failure recovery
in the kernel doesn’t hard offline memory due to UE until the created
memfd is released or the affected memory region is truncated by
userspace. IOW, the HWPoison-ed memory remains accessible via
the returned memfd or the memory mapping created with that memfd.
However, the affected memory will be immediately protected and isolated
from future use by both kernel and userspace once the owning memfd is
gone or the memory is truncated. By default MFD_MF_KEEP_UE_MAPPED is
not set, and kernel hard offlines memory having UEs.

Tested with selftest in followup patch.

This commit should probably be split into smaller pieces, but for now
I will defer it until this RFC becomes PATCH.

Signed-off-by: Jiaqi Yan <jiaqiyan@google.com>
---
 fs/hugetlbfs/inode.c       |  16 +++++
 include/linux/hugetlb.h    |   7 +++
 include/linux/pagemap.h    |  43 ++++++++++++++
 include/uapi/linux/memfd.h |   1 +
 mm/filemap.c               |  78 ++++++++++++++++++++++++
 mm/hugetlb.c               |  20 ++++++-
 mm/memfd.c                 |  15 ++++-
 mm/memory-failure.c        | 119 +++++++++++++++++++++++++++++++++----
 8 files changed, 282 insertions(+), 17 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 0fc179a598300..3c7812898717b 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -576,6 +576,10 @@ static void remove_inode_hugepages(struct inode *inode, loff_t lstart,
 	pgoff_t next, index;
 	int i, freed = 0;
 	bool truncate_op = (lend == LLONG_MAX);
+	LIST_HEAD(hwp_folios);
+
+	/* Needs to be done before removing folios from filemap. */
+	populate_memfd_hwp_folios(mapping, lstart >> PAGE_SHIFT, end, &hwp_folios);
 
 	folio_batch_init(&fbatch);
 	next = lstart >> PAGE_SHIFT;
@@ -605,6 +609,18 @@ static void remove_inode_hugepages(struct inode *inode, loff_t lstart,
 		(void)hugetlb_unreserve_pages(inode,
 				lstart >> huge_page_shift(h),
 				LONG_MAX, freed);
+	/*
+	 * hugetlbfs_error_remove_folio keeps the HWPoison-ed pages in
+	 * page cache until mm wants to drop the folio at the end of the
+	 * of the filemap. At this point, if memory failure was delayed
+	 * by AS_MF_KEEP_UE_MAPPED in the past, we can now deal with it.
+	 *
+	 * TODO: in V2 we can probably get rid of populate_memfd_hwp_folios
+	 * and hwp_folios, by inserting filemap_offline_hwpoison_folio
+	 * into somewhere in folio_batch_release, or into per file system's
+	 * free_folio handler.
+	 */
+	offline_memfd_hwp_folios(mapping, &hwp_folios);
 }
 
 static void hugetlbfs_evict_inode(struct inode *inode)
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index ec8c0ccc8f959..07d2a31146728 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -836,10 +836,17 @@ int dissolve_free_hugetlb_folios(unsigned long start_pfn,
 
 #ifdef CONFIG_MEMORY_FAILURE
 extern void folio_clear_hugetlb_hwpoison(struct folio *folio);
+extern bool hugetlb_should_keep_hwpoison_mapped(struct folio *folio,
+						struct address_space *mapping);
 #else
 static inline void folio_clear_hugetlb_hwpoison(struct folio *folio)
 {
 }
+static inline bool hugetlb_should_keep_hwpoison_mapped(struct folio *folio
+						       struct address_space *mapping)
+{
+	return false;
+}
 #endif
 
 #ifdef CONFIG_ARCH_ENABLE_HUGEPAGE_MIGRATION
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index fc2e1319c7bb5..fad7093d232a9 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -210,6 +210,12 @@ enum mapping_flags {
 	AS_STABLE_WRITES = 7,	/* must wait for writeback before modifying
 				   folio contents */
 	AS_INACCESSIBLE = 8,	/* Do not attempt direct R/W access to the mapping */
+	/*
+	 * Keeps folios belong to the mapping mapped even if uncorrectable memory
+	 * errors (UE) caused memory failure (MF) within the folio. Only at the end
+	 * of mapping will its HWPoison-ed folios be dealt with.
+	 */
+	AS_MF_KEEP_UE_MAPPED = 9,
 	/* Bits 16-25 are used for FOLIO_ORDER */
 	AS_FOLIO_ORDER_BITS = 5,
 	AS_FOLIO_ORDER_MIN = 16,
@@ -335,6 +341,16 @@ static inline bool mapping_inaccessible(struct address_space *mapping)
 	return test_bit(AS_INACCESSIBLE, &mapping->flags);
 }
 
+static inline bool mapping_mf_keep_ue_mapped(struct address_space *mapping)
+{
+	return test_bit(AS_MF_KEEP_UE_MAPPED, &mapping->flags);
+}
+
+static inline void mapping_set_mf_keep_ue_mapped(struct address_space *mapping)
+{
+	set_bit(AS_MF_KEEP_UE_MAPPED, &mapping->flags);
+}
+
 static inline gfp_t mapping_gfp_mask(struct address_space * mapping)
 {
 	return mapping->gfp_mask;
@@ -1298,6 +1314,33 @@ void replace_page_cache_folio(struct folio *old, struct folio *new);
 void delete_from_page_cache_batch(struct address_space *mapping,
 				  struct folio_batch *fbatch);
 bool filemap_release_folio(struct folio *folio, gfp_t gfp);
+#ifdef CONFIG_MEMORY_FAILURE
+void populate_memfd_hwp_folios(struct address_space *mapping,
+			       pgoff_t lstart, pgoff_t lend,
+			       struct list_head *list);
+void offline_memfd_hwp_folios(struct address_space *mapping,
+			      struct list_head *list);
+/*
+ * Provided by memory failure to offline HWPoison-ed folio for various memory
+ * management systems (hugetlb, THP etc).
+ */
+void filemap_offline_hwpoison_folio(struct address_space *mapping,
+				    struct folio *folio);
+#else
+void populate_memfd_hwp_folios(struct address_space *mapping,
+			       loff_t lstart, loff_t lend,
+			       struct list_head *list)
+{
+}
+void offline_memfd_hwp_folios(struct address_space *mapping,
+			      struct list_head *list)
+{
+}
+void filemap_offline_hwpoison_folio(struct address_space *mapping,
+				    struct folio *folio)
+{
+}
+#endif
 loff_t mapping_seek_hole_data(struct address_space *, loff_t start, loff_t end,
 		int whence);
 
diff --git a/include/uapi/linux/memfd.h b/include/uapi/linux/memfd.h
index 273a4e15dfcff..eb7a4ffcae6b9 100644
--- a/include/uapi/linux/memfd.h
+++ b/include/uapi/linux/memfd.h
@@ -12,6 +12,7 @@
 #define MFD_NOEXEC_SEAL		0x0008U
 /* executable */
 #define MFD_EXEC		0x0010U
+#define MFD_MF_KEEP_UE_MAPPED	0x0020U
 
 /*
  * Huge page size encoding when MFD_HUGETLB is specified, and a huge page
diff --git a/mm/filemap.c b/mm/filemap.c
index b6494d2d3bc2a..5216889d12ecf 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -4427,3 +4427,81 @@ SYSCALL_DEFINE4(cachestat, unsigned int, fd,
 	return 0;
 }
 #endif /* CONFIG_CACHESTAT_SYSCALL */
+
+#ifdef CONFIG_MEMORY_FAILURE
+/**
+ * To remember the HWPoison-ed folios within a mapping before removing every
+ * folio, create an utility struct to link them a list.
+ */
+struct memfd_hwp_folio {
+	struct list_head node;
+	struct folio *folio;
+};
+/**
+ * populate_memfd_hwp_folios - populates HWPoison-ed folios.
+ * @mapping: The address_space of a memfd the kernel is trying to remove or truncate.
+ * @start: The starting page index.
+ * @end: The final page index (inclusive).
+ * @list: Where the HWPoison-ed folios will be stored into.
+ *
+ * There may be pending HWPoison-ed folios when a memfd is being removed or
+ * part of it is being truncated. Stores them into a linked list to offline
+ * after the file system removes them.
+ */
+void populate_memfd_hwp_folios(struct address_space *mapping,
+			       pgoff_t start, pgoff_t end,
+			       struct list_head *list)
+{
+	int i;
+	struct folio *folio;
+	struct memfd_hwp_folio *to_add;
+	struct folio_batch fbatch;
+	pgoff_t next = start;
+
+	if (!mapping_mf_keep_ue_mapped(mapping))
+		return;
+
+	folio_batch_init(&fbatch);
+	while (filemap_get_folios(mapping, &next, end - 1, &fbatch)) {
+		for (i = 0; i < folio_batch_count(&fbatch); ++i) {
+			folio = fbatch.folios[i];
+			if (!folio_test_hwpoison(folio))
+				continue;
+
+			to_add = kmalloc(sizeof(*to_add), GFP_KERNEL);
+			if (!to_add)
+				continue;
+
+			to_add->folio = folio;
+			list_add_tail(&to_add->node, list);
+		}
+		folio_batch_release(&fbatch);
+	}
+}
+EXPORT_SYMBOL_GPL(populate_memfd_hwp_folios);
+
+/**
+ * offline_memfd_hwp_folios - hard offline HWPoison-ed folios.
+ * @mapping: The address_space of a memfd the kernel is trying to remove or truncate.
+ * @list: Where the HWPoison-ed folios are stored. It will become empty when
+ *        offline_memfd_hwp_folios returns.
+ *
+ * After the file system removed all the folios belong to a memfd, the kernel
+ * now can hard offline all HWPoison-ed folios that are previously pending.
+ * Caller needs to exclusively own @list as no locking is provided here, and
+ * @list is entirely consumed here.
+ */
+void offline_memfd_hwp_folios(struct address_space *mapping,
+			      struct list_head *list)
+{
+	struct memfd_hwp_folio *curr, *temp;
+
+	list_for_each_entry_safe(curr, temp, list, node) {
+		filemap_offline_hwpoison_folio(mapping, curr->folio);
+		list_del(&curr->node);
+		kfree(curr);
+	}
+}
+EXPORT_SYMBOL_GPL(offline_memfd_hwp_folios);
+
+#endif /* CONFIG_MEMORY_FAILURE */
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 87761b042ed04..35e88d7fc2793 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -6091,6 +6091,18 @@ static bool hugetlb_pte_stable(struct hstate *h, struct mm_struct *mm, unsigned
 	return same;
 }
 
+bool hugetlb_should_keep_hwpoison_mapped(struct folio *folio,
+					 struct address_space *mapping)
+{
+	if (WARN_ON_ONCE(!folio_test_hugetlb(folio)))
+		return false;
+
+	if (!mapping)
+		return false;
+
+	return mapping_mf_keep_ue_mapped(mapping);
+}
+
 static vm_fault_t hugetlb_no_page(struct address_space *mapping,
 			struct vm_fault *vmf)
 {
@@ -6214,9 +6226,11 @@ static vm_fault_t hugetlb_no_page(struct address_space *mapping,
 		 * So we need to block hugepage fault by PG_hwpoison bit check.
 		 */
 		if (unlikely(folio_test_hwpoison(folio))) {
-			ret = VM_FAULT_HWPOISON_LARGE |
-				VM_FAULT_SET_HINDEX(hstate_index(h));
-			goto backout_unlocked;
+			if (!mapping_mf_keep_ue_mapped(mapping)) {
+				ret = VM_FAULT_HWPOISON_LARGE |
+				      VM_FAULT_SET_HINDEX(hstate_index(h));
+				goto backout_unlocked;
+			}
 		}
 
 		/* Check for page in userfault range. */
diff --git a/mm/memfd.c b/mm/memfd.c
index 37f7be57c2f50..ddb9e988396c7 100644
--- a/mm/memfd.c
+++ b/mm/memfd.c
@@ -302,7 +302,8 @@ long memfd_fcntl(struct file *file, unsigned int cmd, unsigned int arg)
 #define MFD_NAME_PREFIX_LEN (sizeof(MFD_NAME_PREFIX) - 1)
 #define MFD_NAME_MAX_LEN (NAME_MAX - MFD_NAME_PREFIX_LEN)
 
-#define MFD_ALL_FLAGS (MFD_CLOEXEC | MFD_ALLOW_SEALING | MFD_HUGETLB | MFD_NOEXEC_SEAL | MFD_EXEC)
+#define MFD_ALL_FLAGS (MFD_CLOEXEC | MFD_ALLOW_SEALING | MFD_HUGETLB | \
+		       MFD_NOEXEC_SEAL | MFD_EXEC | MFD_MF_KEEP_UE_MAPPED)
 
 static int check_sysctl_memfd_noexec(unsigned int *flags)
 {
@@ -376,6 +377,8 @@ static int sanitize_flags(unsigned int *flags_ptr)
 	if (!(flags & MFD_HUGETLB)) {
 		if (flags & ~(unsigned int)MFD_ALL_FLAGS)
 			return -EINVAL;
+		if (flags & MFD_MF_KEEP_UE_MAPPED)
+			return -EINVAL;
 	} else {
 		/* Allow huge page size encoding in flags. */
 		if (flags & ~(unsigned int)(MFD_ALL_FLAGS |
@@ -436,6 +439,16 @@ static struct file *alloc_file(const char *name, unsigned int flags)
 	file->f_mode |= FMODE_LSEEK | FMODE_PREAD | FMODE_PWRITE;
 	file->f_flags |= O_LARGEFILE;
 
+	/*
+	 * MFD_MF_KEEP_UE_MAPPED can only be specified in memfd_create; no API
+	 * to update it once memfd is created. MFD_MF_KEEP_UE_MAPPED is not
+	 * seal-able.
+	 *
+	 * TODO: MFD_MF_KEEP_UE_MAPPED is not supported by all file system yet.
+	 */
+	if (flags & (MFD_HUGETLB | MFD_MF_KEEP_UE_MAPPED))
+		mapping_set_mf_keep_ue_mapped(file->f_mapping);
+
 	if (flags & MFD_NOEXEC_SEAL) {
 		struct inode *inode = file_inode(file);
 
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index a7b8ccd29b6f5..f43607fb4310e 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -445,11 +445,13 @@ static unsigned long dev_pagemap_mapping_shift(struct vm_area_struct *vma,
  * Schedule a process for later kill.
  * Uses GFP_ATOMIC allocations to avoid potential recursions in the VM.
  */
-static void __add_to_kill(struct task_struct *tsk, const struct page *p,
+static void __add_to_kill(struct task_struct *tsk, struct page *p,
 			  struct vm_area_struct *vma, struct list_head *to_kill,
 			  unsigned long addr)
 {
 	struct to_kill *tk;
+	struct folio *folio;
+	struct address_space *mapping;
 
 	tk = kmalloc(sizeof(struct to_kill), GFP_ATOMIC);
 	if (!tk) {
@@ -460,8 +462,20 @@ static void __add_to_kill(struct task_struct *tsk, const struct page *p,
 	tk->addr = addr;
 	if (is_zone_device_page(p))
 		tk->size_shift = dev_pagemap_mapping_shift(vma, tk->addr);
-	else
-		tk->size_shift = folio_shift(page_folio(p));
+	else {
+		folio = page_folio(p);
+		mapping = folio_mapping(folio);
+		if (mapping && mapping_mf_keep_ue_mapped(mapping))
+			/*
+			 * Let userspace know the radius of the hardware poison
+			 * is the size of raw page, and as long as they aborts
+			 * the load to the scope, other pages inside the folio
+			 * are still safe to access.
+			 */
+			tk->size_shift = PAGE_SHIFT;
+		else
+			tk->size_shift = folio_shift(folio);
+	}
 
 	/*
 	 * Send SIGKILL if "tk->addr == -EFAULT". Also, as
@@ -486,7 +500,7 @@ static void __add_to_kill(struct task_struct *tsk, const struct page *p,
 	list_add_tail(&tk->nd, to_kill);
 }
 
-static void add_to_kill_anon_file(struct task_struct *tsk, const struct page *p,
+static void add_to_kill_anon_file(struct task_struct *tsk, struct page *p,
 		struct vm_area_struct *vma, struct list_head *to_kill,
 		unsigned long addr)
 {
@@ -607,7 +621,7 @@ struct task_struct *task_early_kill(struct task_struct *tsk, int force_early)
  * Collect processes when the error hit an anonymous page.
  */
 static void collect_procs_anon(const struct folio *folio,
-		const struct page *page, struct list_head *to_kill,
+		struct page *page, struct list_head *to_kill,
 		int force_early)
 {
 	struct task_struct *tsk;
@@ -645,7 +659,7 @@ static void collect_procs_anon(const struct folio *folio,
  * Collect processes when the error hit a file mapped page.
  */
 static void collect_procs_file(const struct folio *folio,
-		const struct page *page, struct list_head *to_kill,
+		struct page *page, struct list_head *to_kill,
 		int force_early)
 {
 	struct vm_area_struct *vma;
@@ -727,7 +741,7 @@ static void collect_procs_fsdax(const struct page *page,
 /*
  * Collect the processes who have the corrupted page mapped to kill.
  */
-static void collect_procs(const struct folio *folio, const struct page *page,
+static void collect_procs(const struct folio *folio, struct page *page,
 		struct list_head *tokill, int force_early)
 {
 	if (!folio->mapping)
@@ -1226,6 +1240,13 @@ static int me_huge_page(struct page_state *ps, struct page *p)
 		}
 	}
 
+	/*
+	 * MF still needs to holds a refcount for the deferred actions in
+	 * filemap_offline_hwpoison_folio.
+	 */
+	if (hugetlb_should_keep_hwpoison_mapped(folio, mapping))
+		return res;
+
 	if (has_extra_refcount(ps, p, extra_pins))
 		res = MF_FAILED;
 
@@ -1593,6 +1614,7 @@ static bool hwpoison_user_mappings(struct folio *folio, struct page *p,
 	struct address_space *mapping;
 	LIST_HEAD(tokill);
 	bool unmap_success;
+	bool keep_mapped;
 	int forcekill;
 	bool mlocked = folio_test_mlocked(folio);
 
@@ -1643,10 +1665,12 @@ static bool hwpoison_user_mappings(struct folio *folio, struct page *p,
 	 */
 	collect_procs(folio, p, &tokill, flags & MF_ACTION_REQUIRED);
 
-	unmap_poisoned_folio(folio, ttu);
+	keep_mapped = hugetlb_should_keep_hwpoison_mapped(folio, mapping);
+	if (!keep_mapped)
+		unmap_poisoned_folio(folio, ttu);
 
 	unmap_success = !folio_mapped(folio);
-	if (!unmap_success)
+	if (!unmap_success && !keep_mapped)
 		pr_err("%#lx: failed to unmap page (folio mapcount=%d)\n",
 		       pfn, folio_mapcount(folio));
 
@@ -1671,7 +1695,7 @@ static bool hwpoison_user_mappings(struct folio *folio, struct page *p,
 		    !unmap_success;
 	kill_procs(&tokill, forcekill, pfn, flags);
 
-	return unmap_success;
+	return unmap_success || keep_mapped;
 }
 
 static int identify_page_state(unsigned long pfn, struct page *p,
@@ -1911,6 +1935,9 @@ static unsigned long __folio_free_raw_hwp(struct folio *folio, bool move_flag)
 	unsigned long count = 0;
 
 	head = llist_del_all(raw_hwp_list_head(folio));
+	if (head == NULL)
+		return 0;
+
 	llist_for_each_entry_safe(p, next, head, node) {
 		if (move_flag)
 			SetPageHWPoison(p->page);
@@ -1927,7 +1954,8 @@ static int folio_set_hugetlb_hwpoison(struct folio *folio, struct page *page)
 	struct llist_head *head;
 	struct raw_hwp_page *raw_hwp;
 	struct raw_hwp_page *p;
-	int ret = folio_test_set_hwpoison(folio) ? -EHWPOISON : 0;
+	struct address_space *mapping = folio->mapping;
+	bool has_hwpoison = folio_test_set_hwpoison(folio);
 
 	/*
 	 * Once the hwpoison hugepage has lost reliable raw error info,
@@ -1946,8 +1974,15 @@ static int folio_set_hugetlb_hwpoison(struct folio *folio, struct page *page)
 	if (raw_hwp) {
 		raw_hwp->page = page;
 		llist_add(&raw_hwp->node, head);
+		if (hugetlb_should_keep_hwpoison_mapped(folio, mapping))
+			/*
+			 * A new raw HWPoison page. Don't return HWPOISON.
+			 * Error event will be counted in action_result().
+			 */
+			return 0;
+
 		/* the first error event will be counted in action_result(). */
-		if (ret)
+		if (has_hwpoison)
 			num_poisoned_pages_inc(page_to_pfn(page));
 	} else {
 		/*
@@ -1962,7 +1997,8 @@ static int folio_set_hugetlb_hwpoison(struct folio *folio, struct page *page)
 		 */
 		__folio_free_raw_hwp(folio, false);
 	}
-	return ret;
+
+	return has_hwpoison ? -EHWPOISON : 0;
 }
 
 static unsigned long folio_free_raw_hwp(struct folio *folio, bool move_flag)
@@ -2051,6 +2087,63 @@ int __get_huge_page_for_hwpoison(unsigned long pfn, int flags,
 	return ret;
 }
 
+static void filemap_offline_hwpoison_folio_hugetlb(struct folio *folio)
+{
+	int ret;
+	struct llist_node *head;
+	struct raw_hwp_page *curr, *next;
+	struct page *page;
+	unsigned long pfn;
+
+	head = llist_del_all(raw_hwp_list_head(folio));
+
+	/*
+	 * Release references hold by try_memory_failure_hugetlb, one per
+	 * HWPoison-ed page in raw hwp list. This folio's refcount expects to
+	 * drop to zero after the below for-each loop.
+	 */
+	llist_for_each_entry(curr, head, node)
+		folio_put(folio);
+
+	ret = dissolve_free_hugetlb_folio(folio);
+	if (ret) {
+		pr_err("failed to dissolve hugetlb folio: %d\n", ret);
+		llist_for_each_entry(curr, head, node) {
+			page = curr->page;
+			pfn = page_to_pfn(page);
+			/*
+			 * TODO: roll back the count incremented during online
+			 * handling, i.e. whatever me_huge_page returns.
+			 */
+			update_per_node_mf_stats(pfn, MF_FAILED);
+		}
+		return;
+	}
+
+	llist_for_each_entry_safe(curr, next, head, node) {
+		page = curr->page;
+		pfn = page_to_pfn(page);
+		drain_all_pages(page_zone(page));
+		if (PageBuddy(page) && !take_page_off_buddy(page))
+			pr_warn("%#lx: unable to take off buddy allocator\n", pfn);
+
+		SetPageHWPoison(page);
+		page_ref_inc(page);
+		kfree(curr);
+		pr_info("%#lx: pending hard offline completed\n", pfn);
+	}
+}
+
+void filemap_offline_hwpoison_folio(struct address_space *mapping,
+				    struct folio *folio)
+{
+	WARN_ON_ONCE(!mapping);
+
+	/* Pending MFR currently only exist for hugetlb. */
+	if (hugetlb_should_keep_hwpoison_mapped(folio, mapping))
+		filemap_offline_hwpoison_folio_hugetlb(folio);
+}
+
 /*
  * Taking refcount of hugetlb pages needs extra care about race conditions
  * with basic operations like hugepage allocation/free/demotion.
-- 
2.48.0.rc2.279.g1de40edade-goog


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [RFC PATCH v1 2/3] selftests/mm: test userspace MFR for HugeTLB 1G hugepage
  2025-01-18 23:15 [RFC PATCH v1 0/3] Userspace MFR Policy via memfd Jiaqi Yan
  2025-01-18 23:15 ` [RFC PATCH v1 1/3] mm: memfd/hugetlb: introduce userspace memory failure recovery policy Jiaqi Yan
@ 2025-01-18 23:15 ` Jiaqi Yan
  2025-01-18 23:15 ` [RFC PATCH v1 3/3] Documentation: add userspace MF recovery policy via memfd Jiaqi Yan
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 14+ messages in thread
From: Jiaqi Yan @ 2025-01-18 23:15 UTC (permalink / raw)
  To: nao.horiguchi, linmiaohe
  Cc: tony.luck, wangkefeng.wang, willy, jane.chu, akpm, osalvador,
	rientjes, duenwen, jthoughton, jgg, ankita, peterx,
	sidhartha.kumar, david, dave.hansen, muchun.song, linux-mm,
	linux-kernel, linux-fsdevel, Jiaqi Yan

Tests the userspace memory failure recovery (MFR) policy for HugeTLB 1G
hugepage case:
1. Creates a memfd backed by 1G HugeTLB and had MFD_MF_KEEP_UE_MAPPED set.
2. Allocates and maps in a 1G hugepage to the process.
3. Creates sub-threads to MADV_HWPOISON inner addresses of the hugepage.
4. Checks if the process gets correct SIGBUS for each poisoned raw page.
5. Checks if all memory are still accessible and content valid.
6. Checks if the poisoned 1G hugepage is dealt with after memfd released.

Signed-off-by: Jiaqi Yan <jiaqiyan@google.com>
---
 tools/testing/selftests/mm/.gitignore    |   1 +
 tools/testing/selftests/mm/Makefile      |   1 +
 tools/testing/selftests/mm/hugetlb-mfr.c | 267 +++++++++++++++++++++++
 3 files changed, 269 insertions(+)
 create mode 100644 tools/testing/selftests/mm/hugetlb-mfr.c

diff --git a/tools/testing/selftests/mm/.gitignore b/tools/testing/selftests/mm/.gitignore
index 121000c28c105..e65a1fa43f868 100644
--- a/tools/testing/selftests/mm/.gitignore
+++ b/tools/testing/selftests/mm/.gitignore
@@ -5,6 +5,7 @@ hugepage-mremap
 hugepage-shm
 hugepage-vmemmap
 hugetlb-madvise
+hugetlb-mfr
 hugetlb-read-hwpoison
 hugetlb-soft-offline
 khugepaged
diff --git a/tools/testing/selftests/mm/Makefile b/tools/testing/selftests/mm/Makefile
index 63ce39d024bb5..171a9e65ed357 100644
--- a/tools/testing/selftests/mm/Makefile
+++ b/tools/testing/selftests/mm/Makefile
@@ -62,6 +62,7 @@ TEST_GEN_FILES += hmm-tests
 TEST_GEN_FILES += hugetlb-madvise
 TEST_GEN_FILES += hugetlb-read-hwpoison
 TEST_GEN_FILES += hugetlb-soft-offline
+TEST_GEN_FILES += hugetlb-mfr
 TEST_GEN_FILES += hugepage-mmap
 TEST_GEN_FILES += hugepage-mremap
 TEST_GEN_FILES += hugepage-shm
diff --git a/tools/testing/selftests/mm/hugetlb-mfr.c b/tools/testing/selftests/mm/hugetlb-mfr.c
new file mode 100644
index 0000000000000..cb20b81984f5e
--- /dev/null
+++ b/tools/testing/selftests/mm/hugetlb-mfr.c
@@ -0,0 +1,267 @@
+// SPDX-License-Identifier: GPL-2.0
+
+/*
+ * Tests the userspace memory failure recovery (MFR) policy for HugeTLB 1G
+ * hugepage case:
+ * 1. Creates a memfd backed by 1G HugeTLB and MFD_MF_KEEP_UE_MAPPED bit set.
+ * 2. Allocates and maps a 1G hugepage.
+ * 3. Creates sub-threads to MADV_HWPOISON inner addresses of the hugepage.
+ * 4. Checks if the sub-thread get correct SIGBUS for each poisoned raw page.
+ * 5. Checks if all memory are still accessible and content still valid.
+ * 6. Checks if the poisoned 1G hugepage is dealt with after memfd released.
+ */
+
+#define _GNU_SOURCE
+#include <errno.h>
+#include <pthread.h>
+#include <signal.h>
+#include <stdbool.h>
+#include <stdlib.h>
+#include <stdio.h>
+#include <string.h>
+#include <unistd.h>
+
+#include <linux/magic.h>
+#include <linux/memfd.h>
+#include <sys/mman.h>
+#include <sys/prctl.h>
+#include <sys/statfs.h>
+#include <sys/types.h>
+
+#include "../kselftest.h"
+#include "vm_util.h"
+
+#define EPREFIX			" !!! "
+#define BYTE_LENTH_IN_1G	0x40000000
+#define HUGETLB_FILL		0xab
+
+static void *sigbus_addr;
+static int sigbus_addr_lsb;
+static bool expecting_sigbus;
+static bool got_sigbus;
+static bool was_mceerr;
+
+static int create_hugetlbfs_file(struct statfs *file_stat)
+{
+	int fd;
+	int flags = MFD_HUGETLB | MFD_HUGE_1GB | MFD_MF_KEEP_UE_MAPPED;
+
+	fd = memfd_create("hugetlb_tmp", flags);
+	if (fd < 0)
+		ksft_exit_fail_perror("Failed to memfd_create");
+
+	memset(file_stat, 0, sizeof(*file_stat));
+	if (fstatfs(fd, file_stat)) {
+		close(fd);
+		ksft_exit_fail_perror("Failed to fstatfs");
+	}
+	if (file_stat->f_type != HUGETLBFS_MAGIC) {
+		close(fd);
+		ksft_exit_fail_msg("Not hugetlbfs file");
+	}
+
+	ksft_print_msg("Created hugetlb_tmp file\n");
+	ksft_print_msg("hugepagesize=%#lx\n", file_stat->f_bsize);
+	if (file_stat->f_bsize != BYTE_LENTH_IN_1G)
+		ksft_exit_fail_msg("Hugepage size is not 1G");
+
+	return fd;
+}
+
+/*
+ * SIGBUS handler for "do_hwpoison" thread that mapped and MADV_HWPOISON
+ */
+static void sigbus_handler(int signo, siginfo_t *info, void *context)
+{
+	if (!expecting_sigbus)
+		ksft_exit_fail_msg("unexpected sigbus with addr=%p",
+				   info->si_addr);
+
+	got_sigbus = true;
+	was_mceerr = (info->si_code == BUS_MCEERR_AO ||
+		      info->si_code == BUS_MCEERR_AR);
+	sigbus_addr = info->si_addr;
+	sigbus_addr_lsb = info->si_addr_lsb;
+}
+
+static void *do_hwpoison(void *hwpoison_addr)
+{
+	int hwpoison_size = getpagesize();
+
+	ksft_print_msg("MADV_HWPOISON hwpoison_addr=%p, len=%d\n",
+		       hwpoison_addr, hwpoison_size);
+	if (madvise(hwpoison_addr, hwpoison_size, MADV_HWPOISON) < 0)
+		ksft_exit_fail_perror("Failed to MADV_HWPOISON");
+
+	pthread_exit(NULL);
+}
+
+static void test_hwpoison_multiple_pages(unsigned char *start_addr)
+{
+	pthread_t pthread;
+	int ret;
+	unsigned char *hwpoison_addr;
+	unsigned long offsets[] = {0x200000, 0x400000, 0x800000};
+
+	for (size_t i = 0; i < ARRAY_SIZE(offsets); ++i) {
+		sigbus_addr = (void *)0xBADBADBAD;
+		sigbus_addr_lsb = 0;
+		was_mceerr = false;
+		got_sigbus = false;
+		expecting_sigbus = true;
+		hwpoison_addr = start_addr + offsets[i];
+
+		ret = pthread_create(&pthread, NULL, &do_hwpoison, hwpoison_addr);
+		if (ret)
+			ksft_exit_fail_perror("Failed to create hwpoison thread");
+
+		ksft_print_msg("Created thread to hwpoison and access hwpoison_addr=%p\n",
+			       hwpoison_addr);
+
+		pthread_join(pthread, NULL);
+
+		if (!got_sigbus)
+			ksft_test_result_fail("Didn't get a SIGBUS\n");
+		if (!was_mceerr)
+			ksft_test_result_fail("Didn't get a BUS_MCEERR_A(R|O)\n");
+		if (sigbus_addr != hwpoison_addr)
+			ksft_test_result_fail("Incorrect address: got=%p, expected=%p\n",
+					      sigbus_addr, hwpoison_addr);
+		if (sigbus_addr_lsb != pshift())
+			ksft_test_result_fail("Incorrect address LSB: got=%d, expected=%d\n",
+					      sigbus_addr_lsb, pshift());
+
+		ksft_print_msg("Received expected and correct SIGBUS\n");
+	}
+}
+
+static int read_nr_hugepages(unsigned long hugepage_size,
+			     unsigned long *nr_hugepages)
+{
+	char buffer[256] = {0};
+	char cmd[256] = {0};
+
+	sprintf(cmd, "cat /sys/kernel/mm/hugepages/hugepages-%ldkB/nr_hugepages",
+		hugepage_size);
+	FILE *cmdfile = popen(cmd, "r");
+
+	if (cmdfile == NULL) {
+		ksft_perror(EPREFIX "failed to popen nr_hugepages");
+		return -1;
+	}
+
+	if (!fgets(buffer, sizeof(buffer), cmdfile)) {
+		ksft_perror(EPREFIX "failed to read nr_hugepages");
+		pclose(cmdfile);
+		return -1;
+	}
+
+	*nr_hugepages = atoll(buffer);
+	pclose(cmdfile);
+	return 0;
+}
+
+/*
+ * Main thread that drives the test.
+ */
+static void test_main(int fd, size_t len)
+{
+	unsigned char *map, *iter;
+	struct sigaction new, old;
+	const unsigned long hugepagesize_kb = BYTE_LENTH_IN_1G / 1024;
+	unsigned long nr_hugepages_before = 0;
+	unsigned long nr_hugepages_after = 0;
+
+	if (read_nr_hugepages(hugepagesize_kb, &nr_hugepages_before) != 0) {
+		close(fd);
+		ksft_exit_fail_msg("Failed to read nr_hugepages\n");
+	}
+	ksft_print_msg("NR hugepages before MADV_HWPOISON is %ld\n", nr_hugepages_before);
+
+	if (ftruncate(fd, len) < 0)
+		ksft_exit_fail_perror("Failed to ftruncate");
+
+	ksft_print_msg("Allocated %#lx bytes to HugeTLB file\n", len);
+
+	map = mmap(NULL, len, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
+	if (map == MAP_FAILED)
+		ksft_exit_fail_msg("Failed to mmap");
+
+	ksft_print_msg("Created HugeTLB mapping: %p\n", map);
+
+	memset(map, HUGETLB_FILL, len);
+	ksft_print_msg("Memset every byte to 0xab\n");
+
+	new.sa_sigaction = &sigbus_handler;
+	new.sa_flags = SA_SIGINFO;
+	if (sigaction(SIGBUS, &new, &old) < 0)
+		ksft_exit_fail_msg("Failed to setup SIGBUS handler");
+
+	ksft_print_msg("Setup SIGBUS handler successfully\n");
+
+	test_hwpoison_multiple_pages(map);
+
+	/*
+	 * Since MADV_HWPOISON doesn't corrupt the memory in hardware, and
+	 * MFD_MF_KEEP_UE_MAPPED keeps the hugepage mapped, every byte should
+	 * remain accessible and hold original data.
+	 */
+	expecting_sigbus = false;
+	for (iter = map; iter < map + len; ++iter) {
+		if (*iter != HUGETLB_FILL) {
+			ksft_print_msg("At addr=%p: got=%#x, expected=%#x\n",
+				       iter, *iter, HUGETLB_FILL);
+			ksft_test_result_fail("Memory content corrupted\n");
+			break;
+		}
+	}
+	ksft_print_msg("Memory content all valid\n");
+
+	if (read_nr_hugepages(hugepagesize_kb, &nr_hugepages_after) != 0) {
+		close(fd);
+		ksft_exit_fail_msg("Failed to read nr_hugepages\n");
+	}
+
+	/*
+	 * After MADV_HWPOISON, hugepage should still be in HugeTLB pool.
+	 */
+	ksft_print_msg("NR hugepages after MADV_HWPOISON is %ld\n", nr_hugepages_after);
+	if (nr_hugepages_before != nr_hugepages_after)
+		ksft_test_result_fail("NR hugepages reduced by %ld after MADV_HWPOISON\n",
+				      nr_hugepages_before - nr_hugepages_after);
+
+	/* End of the lifetime of the created HugeTLB memfd. */
+	if (ftruncate(fd, 0) < 0)
+		ksft_exit_fail_perror("Failed to ftruncate to 0");
+	munmap(map, len);
+	close(fd);
+
+	/*
+	 * After freed by userspace, MADV_HWPOISON-ed hugepage should be
+	 * dissolved into raw pages and removed from HugeTLB pool.
+	 */
+	if (read_nr_hugepages(hugepagesize_kb, &nr_hugepages_after) != 0) {
+		close(fd);
+		ksft_exit_fail_msg("Failed to read nr_hugepages\n");
+	}
+	ksft_print_msg("NR hugepages after closure is %ld\n", nr_hugepages_after);
+	if (nr_hugepages_before != nr_hugepages_after + 1)
+		ksft_test_result_fail("NR hugepages is not reduced after memfd closure\n");
+
+	ksft_test_result_pass("All done\n");
+}
+
+int main(int argc, char **argv)
+{
+	int fd;
+	struct statfs file_stat;
+	size_t len = BYTE_LENTH_IN_1G;
+
+	ksft_print_header();
+	ksft_set_plan(1);
+
+	fd = create_hugetlbfs_file(&file_stat);
+	test_main(fd, len);
+
+	ksft_finished();
+}
-- 
2.48.0.rc2.279.g1de40edade-goog


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [RFC PATCH v1 3/3] Documentation: add userspace MF recovery policy via memfd
  2025-01-18 23:15 [RFC PATCH v1 0/3] Userspace MFR Policy via memfd Jiaqi Yan
  2025-01-18 23:15 ` [RFC PATCH v1 1/3] mm: memfd/hugetlb: introduce userspace memory failure recovery policy Jiaqi Yan
  2025-01-18 23:15 ` [RFC PATCH v1 2/3] selftests/mm: test userspace MFR for HugeTLB 1G hugepage Jiaqi Yan
@ 2025-01-18 23:15 ` Jiaqi Yan
  2025-01-20 17:26 ` [RFC PATCH v1 0/3] Userspace MFR Policy " Jason Gunthorpe
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 14+ messages in thread
From: Jiaqi Yan @ 2025-01-18 23:15 UTC (permalink / raw)
  To: nao.horiguchi, linmiaohe
  Cc: tony.luck, wangkefeng.wang, willy, jane.chu, akpm, osalvador,
	rientjes, duenwen, jthoughton, jgg, ankita, peterx,
	sidhartha.kumar, david, dave.hansen, muchun.song, linux-mm,
	linux-kernel, linux-fsdevel, Jiaqi Yan

Document its motivation and userspace API.

Signed-off-by: Jiaqi Yan <jiaqiyan@google.com>
---
 Documentation/userspace-api/index.rst         |  1 +
 .../userspace-api/mfd_mfr_policy.rst          | 55 +++++++++++++++++++
 2 files changed, 56 insertions(+)
 create mode 100644 Documentation/userspace-api/mfd_mfr_policy.rst

diff --git a/Documentation/userspace-api/index.rst b/Documentation/userspace-api/index.rst
index 274cc7546efc2..0f9783b8807ea 100644
--- a/Documentation/userspace-api/index.rst
+++ b/Documentation/userspace-api/index.rst
@@ -63,6 +63,7 @@ Everything else
    vduse
    futex2
    perf_ring_buffer
+   mfd_mfr_policy
 
 .. only::  subproject and html
 
diff --git a/Documentation/userspace-api/mfd_mfr_policy.rst b/Documentation/userspace-api/mfd_mfr_policy.rst
new file mode 100644
index 0000000000000..d4557693c2c40
--- /dev/null
+++ b/Documentation/userspace-api/mfd_mfr_policy.rst
@@ -0,0 +1,55 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==================================================
+Userspace Memory Failure Recovery Policy via memfd
+==================================================
+
+:Author:
+    Jiaqi Yan <jiaqiyan@google.com>
+
+
+Motivation
+==========
+
+When a userspace process is able to recover from memory failures (MF)
+caused by uncorrected memory error (UE) in the DIMM, especially when it is
+able to avoid consuming known UEs, keeping the memory page mapped and
+accessible may be benifical to the owning process for a couple of reasons:
+- The memory pages affected by UE have a large smallest granularity, for
+  example 1G hugepage, but the actual corrupted amount of the page is only
+  several cachlines. Losing the entire hugepage of data is unacceptable to
+  the application.
+- In addition to keeping the data accessible, the application still wants
+  to access with as large page size for the fastest virtual-to-physical
+  translations.
+
+Memory failure recovery for 1G or larger HugeTLB is a good example. With
+memfd userspace process can control whether the kernel hard offlines its
+memory (huge)pages that backs the in-RAM file created by memfd.
+
+
+User API
+========
+
+``int memfd_create(const char *name, unsigned int flags)``
+
+``MFD_MF_KEEP_UE_MAPPED``
+	When ``MFD_MF_KEEP_UE_MAPPED`` bit is set in ``flags``, MF recovery
+	in the kernel does not hard offline memory due to UE until the
+	returned ``memfd`` is released. IOW, the HWPoison-ed memory emains
+	accessible via the returned ``memfd`` or the memory mapping created
+	with the returned ``memfd``. Note the affected memory will be
+	immediately protected and isolated from future use (by both kernel
+	and userspace) once the owning process is gone. By default
+	``MFD_MF_KEEP_UE_MAPPED`` is not set, and kernel hard offlines
+	memory having UEs.
+
+Notes about the behavior and limitations
+- Even if the page affected by UE is kept, a portion of the (huge)page is
+  already lost due to hardware corruption, and the size of the portion
+  is the smallest page size that kernel uses to manages memory on the
+  architecture, i.e. PAGESIZE. Accessing a virtual address within any of
+  these parts results in a SIGBUS; accessing virtual address outside these
+  parts are good until it is corrupted by new memory error.
+- ``MFD_MF_KEEP_UE_MAPPED`` currently only works for HugeTLB, so
+  ``MFD_HUGETLB`` must also be set when setting ``MFD_MF_KEEP_UE_MAPPED``.
-- 
2.48.0.rc2.279.g1de40edade-goog


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH v1 0/3] Userspace MFR Policy via memfd
  2025-01-18 23:15 [RFC PATCH v1 0/3] Userspace MFR Policy via memfd Jiaqi Yan
                   ` (2 preceding siblings ...)
  2025-01-18 23:15 ` [RFC PATCH v1 3/3] Documentation: add userspace MF recovery policy via memfd Jiaqi Yan
@ 2025-01-20 17:26 ` Jason Gunthorpe
  2025-01-21 21:45   ` Jiaqi Yan
  2025-01-22 16:41 ` Zi Yan
  2025-09-19 15:58 ` “William Roche
  5 siblings, 1 reply; 14+ messages in thread
From: Jason Gunthorpe @ 2025-01-20 17:26 UTC (permalink / raw)
  To: Jiaqi Yan
  Cc: nao.horiguchi, linmiaohe, tony.luck, wangkefeng.wang, willy,
	jane.chu, akpm, osalvador, rientjes, duenwen, jthoughton, ankita,
	peterx, sidhartha.kumar, david, dave.hansen, muchun.song,
	linux-mm, linux-kernel, linux-fsdevel

On Sat, Jan 18, 2025 at 11:15:46PM +0000, Jiaqi Yan wrote:
> In the experimental case, all the setups are identical to the baseline
> case, however 25% of the guest memory is split from THP to 4K pages due
> to the memory failure recovery triggered by MADV_HWPOISON. I made some
> minor changes in the kernel so that the MADV_HWPOISON-ed pages are
> unpoisoned, and afterwards the in-guest MemCycle is still able to read
> and write its data. The final aggregate rate is 16,355.11, which is
> decreased by 5.06% compared to the baseline case. When 5% of the guest
> memory is split after MADV_HWPOISON, the final aggregate rate is
> 16,999.14, a drop of 1.20% compared to the baseline case.

I think it was mentioned in one of the calls, but this is good data on
the CPU side, but for VMs doing IO, the IO performance is impacted
also. IOTLB miss on (random) IO performance, especially with two
dimensional IO paging, tends to have a performance curve that drops
off a cliff once the IOTLB is too small for the workload.

Specifically, systems seem to be designed to require high IOTLB hit
rate to maintain their target performance and IOTLB miss is much more
expensive than CPU TLB miss.

So, I would view MemCycle as something of a best case work load that
is not as sensitive to TLB size. A worst case is a workload that just
fits inside the TLB and reducing the page sizes pushes it to no longer
fit.

> Per-memfd MFR Policy associates the userspace MFR policy with a memfd
> instance. This approach is promising for the following reasons:
> 1. Keeping memory with UE mapped to a process has risks if the process
>    does not do its duty to prevent itself from repeatedly consuming UER.
>    The MFR policy can be associated with a memfd to limit such risk to a
>    particular memory space owned by a particular process that opts in
>    the policy. This is much preferable than the Global MFR Policy
>    proposed in the initial RFC, which provides no granularity
>    whatsoever.

Yes, very much agree

> 3. Although MFR policy allows the userspace process to keep memory UE
>    mapped, eventually these HWPoison-ed folios need to be dealt with by
>    the kernel (e.g. split into smallest chunk and isolated from
>    future allocation). For memfd once all references to it are dropped,
>    it is automatically released from userspace, which is a perfect
>    timing for the kernel to do its duties to HWPoison-ed folios if any.
>    This is also a big advantage to the Global MFR Policy, which breaks
>    kernel’s protection to HWPoison-ed folios.

iommufd will hold the memory pinned for the life of the VM, is that OK
for this plan?

> 4. Given memfd’s anonymous semantic, we don’t need to worry about that
>    different threads can have different and conflicting MFR policies. It
>    allows a simpler implementation than the Per-VMA MFR Policy in the
>    initial RFC [1].

Your policy is per-memfd right?

> However, the affected memory will be immediately protected and isolated
> from future use by both kernel and userspace once the owning memfd is
> gone or the memory is truncated. By default MFD_MF_KEEP_UE_MAPPED is not
> set, and kernel hard offlines memory having UEs. Kernel immediately
> poisons the folios for both cases.

I'm reading this and thinking that today we don't have any callback
into the iommu to force offline the memory either, so a guest can
still do DMA to it.

> Part2: When a AS_MF_KEEP_UE_MAPPED memfd is about to be released, or
> when the userspace process truncates a range of memory pages belonging
> to a AS_MF_KEEP_UE_MAPPED memfd:
> * When the in-memory file system is evicting the inode corresponding to
>   the memfd, it needs to prepare the HWPoison-ed folios that are easily
>   identifiable with the PG_HWPOISON flag. This operation is implemented
>   by populate_memfd_hwp_folios and is exported to file systems.
> * After the file system removes all the folios, there is nothing else
>   preventing MFR from dealing with HWPoison-ed folios, so the file
>   system forwards them to MFR. This step is implemented by
>   offline_memfd_hwp_folios and is exported to file systems.

As above, iommu won't release its refcount after truncate or zap.

> * MFR has been holding refcount(s) of each HWPoison-ed folio. After
>   dropping the refcounts, a HWPoison-ed folio should become free and can
>   be disposed of.

So you have to deal with "should" being "won't" in cases where VFIO is
being used...

> In V2 I can probably offline each folio as they get remove, instead of
> doing this in batch. The advantage is we can get rid of
> populate_memfd_hwp_folios and the linked list needed to store poisoned
> folios. One way is to insert filemap_offline_hwpoison_folio into
> somewhere in folio_batch_release, or into per file system's free_folio
> handler.

That sounds more workable given the above, though we keep getting into
cases where people want to hook free_folio..

> 2. In react to later fault to any part of the HWPoison-ed folio, guest
>    memfd returns KVM_PFN_ERR_HWPOISON, and KVM sends SIGBUS to VMM. This
>    is good enough for actual hardware corrupted PFN backed GFNs, but not
>    ideal for the healthy PFNs “offlined” together with the error PFNs.
>    The userspace MFR policy can be useful if VMM wants KVM to 1. Keep
>    these GFNs mapped in the stage-2 page table 2. In react to later
>    access to the actual hardware corrupted part of the HWPoison-ed
>    folio, there is going to be a (repeated) poison consumption event,
>    and KVM returns KVM_PFN_ERR_HWPOISON for the actual poisoned PFN.

I feel like the guestmemfd version of this is not about userspace
mappings but about what is communicated to the secure world.

If normal memfd would leave these pages mapped to the VMA then I'd
think the guestmemfd version would be to leave the pages mapped to the
secure world?

Keep in mind that guestmemfd is more complex that kvm today as several
of the secure world implementations are sharing the stage2/ept
translation between CPU and IOMMU HW. So you can't just unmap 1G of
memory without completely breaking the guest.

> This RFC [4] proposes a MFR framework for VFIO device managed userspace
> memory (i.e. memory regions mapped by remap_pfn_region). The userspace
> MFR policy can instruct the device driver to keep all PFN mapped in a
> VMA (i.e. don’t unmap_mapping_range).

Ankit has some patches that cause the MFR framework to send the
poision events for non-struct page memory to the device driver that
owns the memory.

> * IOCTL to the VFIO Device File. The device driver usually expose a
>   file-like uAPI to its managed device memory (e.g. PCI MMIO BAR)
>   directly with the file to the VFIO device. AS_MF_KEEP_UE_MAPPED can be
>   placed in the address_space of the file to the VFIO device. Device
>   driver can implement a specific IOCTL to the VFIO device file for
>   userspace to set AS_MF_KEEP_UE_MAPPED.

I don't think address spaces are involved in the MFR path after Ankit's
patch? The dispatch is done entirely on phys_addr_t.

What happens will be up to the driver that owns the memory.

You could have a VFIO feature that specifies one behavior or the
other, but perhaps VFIO just always keeps things mapped. I don't know.

Jason

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH v1 0/3] Userspace MFR Policy via memfd
  2025-01-20 17:26 ` [RFC PATCH v1 0/3] Userspace MFR Policy " Jason Gunthorpe
@ 2025-01-21 21:45   ` Jiaqi Yan
  0 siblings, 0 replies; 14+ messages in thread
From: Jiaqi Yan @ 2025-01-21 21:45 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: nao.horiguchi, linmiaohe, tony.luck, wangkefeng.wang, willy,
	jane.chu, akpm, osalvador, rientjes, duenwen, jthoughton, ankita,
	peterx, sidhartha.kumar, david, dave.hansen, muchun.song,
	linux-mm, linux-kernel, linux-fsdevel

Thanks Jason, your comments are very much appreciated. I replied to
some of them, and need more thoughts for the others.

On Mon, Jan 20, 2025 at 9:26 AM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> On Sat, Jan 18, 2025 at 11:15:46PM +0000, Jiaqi Yan wrote:
> > In the experimental case, all the setups are identical to the baseline
> > case, however 25% of the guest memory is split from THP to 4K pages due
> > to the memory failure recovery triggered by MADV_HWPOISON. I made some
> > minor changes in the kernel so that the MADV_HWPOISON-ed pages are
> > unpoisoned, and afterwards the in-guest MemCycle is still able to read
> > and write its data. The final aggregate rate is 16,355.11, which is
> > decreased by 5.06% compared to the baseline case. When 5% of the guest
> > memory is split after MADV_HWPOISON, the final aggregate rate is
> > 16,999.14, a drop of 1.20% compared to the baseline case.
>
> I think it was mentioned in one of the calls, but this is good data on
> the CPU side, but for VMs doing IO, the IO performance is impacted
> also. IOTLB miss on (random) IO performance, especially with two
> dimensional IO paging, tends to have a performance curve that drops
> off a cliff once the IOTLB is too small for the workload.
>
> Specifically, systems seem to be designed to require high IOTLB hit
> rate to maintain their target performance and IOTLB miss is much more
> expensive than CPU TLB miss.
>
> So, I would view MemCycle as something of a best case work load that
> is not as sensitive to TLB size. A worst case is a workload that just
> fits inside the TLB and reducing the page sizes pushes it to no longer
> fit.

I think a guest IO benchmarking could be valuable (I wasn't able to
conduct it along with the MemCycler experiment due to resource reason)
so it is added to my TODO list. I think if the number comes out really
bad, it may affect the default behavior of MFR, or at least people's
preference.

Do you know if there is an existing benchmark like memcycler that I
can run in VM, and that exercises the two dimensional IO paging?

>
> > Per-memfd MFR Policy associates the userspace MFR policy with a memfd
> > instance. This approach is promising for the following reasons:
> > 1. Keeping memory with UE mapped to a process has risks if the process
> >    does not do its duty to prevent itself from repeatedly consuming UER.
> >    The MFR policy can be associated with a memfd to limit such risk to a
> >    particular memory space owned by a particular process that opts in
> >    the policy. This is much preferable than the Global MFR Policy
> >    proposed in the initial RFC, which provides no granularity
> >    whatsoever.
>
> Yes, very much agree
>
> > 3. Although MFR policy allows the userspace process to keep memory UE
> >    mapped, eventually these HWPoison-ed folios need to be dealt with by
> >    the kernel (e.g. split into smallest chunk and isolated from
> >    future allocation). For memfd once all references to it are dropped,
> >    it is automatically released from userspace, which is a perfect
> >    timing for the kernel to do its duties to HWPoison-ed folios if any.
> >    This is also a big advantage to the Global MFR Policy, which breaks
> >    kernel’s protection to HWPoison-ed folios.
>
> iommufd will hold the memory pinned for the life of the VM, is that OK
> for this plan?

At appearance, pinned memory (i.e. folio_maybe_dma_pinned=true) by
definition should not be offlined / reclaimed in many places,
including the handling of HWPoison at any stage.

>
> > 4. Given memfd’s anonymous semantic, we don’t need to worry about that
> >    different threads can have different and conflicting MFR policies. It
> >    allows a simpler implementation than the Per-VMA MFR Policy in the
> >    initial RFC [1].
>
> Your policy is per-memfd right?

Yes, basically per-memfd, no VMA involved.

>
> > However, the affected memory will be immediately protected and isolated
> > from future use by both kernel and userspace once the owning memfd is
> > gone or the memory is truncated. By default MFD_MF_KEEP_UE_MAPPED is not
> > set, and kernel hard offlines memory having UEs. Kernel immediately
> > poisons the folios for both cases.
>
> I'm reading this and thinking that today we don't have any callback
> into the iommu to force offline the memory either, so a guest can
> still do DMA to it.
>
> > Part2: When a AS_MF_KEEP_UE_MAPPED memfd is about to be released, or
> > when the userspace process truncates a range of memory pages belonging
> > to a AS_MF_KEEP_UE_MAPPED memfd:
> > * When the in-memory file system is evicting the inode corresponding to
> >   the memfd, it needs to prepare the HWPoison-ed folios that are easily
> >   identifiable with the PG_HWPOISON flag. This operation is implemented
> >   by populate_memfd_hwp_folios and is exported to file systems.
> > * After the file system removes all the folios, there is nothing else
> >   preventing MFR from dealing with HWPoison-ed folios, so the file
> >   system forwards them to MFR. This step is implemented by
> >   offline_memfd_hwp_folios and is exported to file systems.
>
> As above, iommu won't release its refcount after truncate or zap.

Due to the pinning behavior, or something else? Then I think not
offlining the HWPoison folio (i.e. no op) after truncate is expected
behavior, or a return of EBUSY from offline_memfd_hwp_folios.

Taking HugeTLB as an example, I used to think the folio will be
dissovled into 4k pages after truncate. However, it seems today the
huge folio is just isolated from freed and reused[*], like a hugepage
is "leaked" (after truncation, nr_hugepages is not decreased, but
free_hugepages is reduced, as if the exit process or mm still hold
that hugepage).

[*] https://lore.kernel.org/linux-mm/20250119180608.2132296-3-jiaqiyan@google.com/T/#m54be295de1144eead4ab73c3cc9077b6dd14050f

>
> > * MFR has been holding refcount(s) of each HWPoison-ed folio. After
> >   dropping the refcounts, a HWPoison-ed folio should become free and can
> >   be disposed of.
>
> So you have to deal with "should" being "won't" in cases where VFIO is
> being used...
>
> > In V2 I can probably offline each folio as they get remove, instead of
> > doing this in batch. The advantage is we can get rid of
> > populate_memfd_hwp_folios and the linked list needed to store poisoned
> > folios. One way is to insert filemap_offline_hwpoison_folio into
> > somewhere in folio_batch_release, or into per file system's free_folio
> > handler.
>
> That sounds more workable given the above, though we keep getting into
> cases where people want to hook free_folio..
>
> > 2. In react to later fault to any part of the HWPoison-ed folio, guest
> >    memfd returns KVM_PFN_ERR_HWPOISON, and KVM sends SIGBUS to VMM. This
> >    is good enough for actual hardware corrupted PFN backed GFNs, but not
> >    ideal for the healthy PFNs “offlined” together with the error PFNs.
> >    The userspace MFR policy can be useful if VMM wants KVM to 1. Keep
> >    these GFNs mapped in the stage-2 page table 2. In react to later
> >    access to the actual hardware corrupted part of the HWPoison-ed
> >    folio, there is going to be a (repeated) poison consumption event,
> >    and KVM returns KVM_PFN_ERR_HWPOISON for the actual poisoned PFN.
>
> I feel like the guestmemfd version of this is not about userspace
> mappings but about what is communicated to the secure world.
>
> If normal memfd would leave these pages mapped to the VMA then I'd
> think the guestmemfd version would be to leave the pages mapped to the
> secure world?
>
> Keep in mind that guestmemfd is more complex that kvm today as several
> of the secure world implementations are sharing the stage2/ept
> translation between CPU and IOMMU HW. So you can't just unmap 1G of
> memory without completely breaking the guest.
>
> > This RFC [4] proposes a MFR framework for VFIO device managed userspace
> > memory (i.e. memory regions mapped by remap_pfn_region). The userspace
> > MFR policy can instruct the device driver to keep all PFN mapped in a
> > VMA (i.e. don’t unmap_mapping_range).
>
> Ankit has some patches that cause the MFR framework to send the
> poision events for non-struct page memory to the device driver that
> owns the memory.

But it seems the driver itself has not yet been in charge of unmapping
or not. Instead, MFR framework made the call. I think it is probably
fine the MFR framework can just continue to make the call with a piece
of new info, mapping_mf_keep_ue_mapped/AS_MF_KEEP_UE_MAPPED (in RFC
PATCH 1/3).

>
> > * IOCTL to the VFIO Device File. The device driver usually expose a
> >   file-like uAPI to its managed device memory (e.g. PCI MMIO BAR)
> >   directly with the file to the VFIO device. AS_MF_KEEP_UE_MAPPED can be
> >   placed in the address_space of the file to the VFIO device. Device
> >   driver can implement a specific IOCTL to the VFIO device file for
> >   userspace to set AS_MF_KEEP_UE_MAPPED.
>
> I don't think address spaces are involved in the MFR path after Ankit's
> patch? The dispatch is done entirely on phys_addr_t.

I think strictly speaking Ankit's patch is built around
pfn_address_space[*], and the driver does include address_space when
it registers to core mm. So if the driver wants to be the one make the
call of unmapping or not, it should still be able to access
AS_MF_KEEP_UE_MAPPED.

[*] https://lore.kernel.org/lkml/20231123003513.24292-2-ankita@nvidia.com/

>
> What happens will be up to the driver that owns the memory.
>
> You could have a VFIO feature that specifies one behavior or the

Do you mean a new one under the VFIO_DEVICE_FEATURE IOCTL?

> other, but perhaps VFIO just always keeps things mapped. I don't know.

I think a proper VFIO guest benchmarking can help tell which behavior is better.

>
> Jason

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH v1 0/3] Userspace MFR Policy via memfd
  2025-01-18 23:15 [RFC PATCH v1 0/3] Userspace MFR Policy via memfd Jiaqi Yan
                   ` (3 preceding siblings ...)
  2025-01-20 17:26 ` [RFC PATCH v1 0/3] Userspace MFR Policy " Jason Gunthorpe
@ 2025-01-22 16:41 ` Zi Yan
  2025-09-19 15:58 ` “William Roche
  5 siblings, 0 replies; 14+ messages in thread
From: Zi Yan @ 2025-01-22 16:41 UTC (permalink / raw)
  To: Jiaqi Yan
  Cc: nao.horiguchi, linmiaohe, tony.luck, wangkefeng.wang, willy,
	jane.chu, akpm, osalvador, rientjes, duenwen, jthoughton, jgg,
	ankita, peterx, sidhartha.kumar, david, dave.hansen, muchun.song,
	linux-mm, linux-kernel, linux-fsdevel

On 18 Jan 2025, at 18:15, Jiaqi Yan wrote:

<snip>

> MemCycler Benchmarking
> ======================
>
> To follow up the question by Dave Hansen, “If one motivation for this is
> guest performance, then it would be great to have some data to back that
> up, even if it is worst-case data”, we run MemCycler in guest and
> compare its performance when there are an extremely large number of
> memory errors.
>
> The MemCycler benchmark cycles through memory with multiple threads. On
> each iteration, the thread reads the current value, validates it, and
> writes a counter value. The benchmark continuously outputs rates
> indicating the speed at which it is reading and writing 64-bit integers,
> and aggregates the reads and writes of the multiple threads across
> multiple iterations into a single rate (unit: 64-bit per microsecond).
>
> MemCycler is running inside a VM with 80 vCPUs and 640 GB guest memory.
> The hardware platform hosting the VM is using Intel Emerald Rapids CPUs
> (in total 120 physical cores) and 1.5 T DDR5 memory. MemCycler allocates
> memory with 2M transparent hugepage in the guest. Our in-house VMM backs
> the guest memory with 2M transparent hugepage on the host. The final
> aggregate rate after 60 runtime is 17,204.69 and referred to as the
> baseline case.
>
> In the experimental case, all the setups are identical to the baseline
> case, however 25% of the guest memory is split from THP to 4K pages due
> to the memory failure recovery triggered by MADV_HWPOISON. I made some
> minor changes in the kernel so that the MADV_HWPOISON-ed pages are
> unpoisoned, and afterwards the in-guest MemCycle is still able to read
> and write its data. The final aggregate rate is 16,355.11, which is
> decreased by 5.06% compared to the baseline case. When 5% of the guest
> memory is split after MADV_HWPOISON, the final aggregate rate is
> 16,999.14, a drop of 1.20% compared to the baseline case.
>
<snip>
>
> Extensibility: THP SHMEM/TMPFS
> ==============================
>
> The current MFR behavior for THP SHMEM/TMPFS is to split the hugepage
> into raw page and only offline the raw HWPoison-ed page. In most cases
> THP is 2M and raw page size is 4K, so userspace loses the “huge”
> property of a 2M huge memory, but the actual data loss is only 4K.

I wonder if the buddy allocator like split[1] could help here by splitting
the THP to 1MB, 512KB, 256KB, ..., two 4KB, so you still have some mTHPs
at the end.

[1] https://lore.kernel.org/linux-mm/20250116211042.741543-1-ziy@nvidia.com/

Best Regards,
Yan, Zi

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH v1 0/3] Userspace MFR Policy via memfd
  2025-01-18 23:15 [RFC PATCH v1 0/3] Userspace MFR Policy via memfd Jiaqi Yan
                   ` (4 preceding siblings ...)
  2025-01-22 16:41 ` Zi Yan
@ 2025-09-19 15:58 ` “William Roche
  2025-10-13 22:14   ` Jiaqi Yan
  5 siblings, 1 reply; 14+ messages in thread
From: “William Roche @ 2025-09-19 15:58 UTC (permalink / raw)
  To: jiaqiyan, jgg
  Cc: akpm, ankita, dave.hansen, david, duenwen, jane.chu, jthoughton,
	linmiaohe, linux-fsdevel, linux-kernel, linux-mm, muchun.song,
	nao.horiguchi, osalvador, peterx, rientjes, sidhartha.kumar,
	tony.luck, wangkefeng.wang, willy, harry.yoo

From: William Roche <william.roche@oracle.com>

Hello,

The possibility to keep a VM using large hugetlbfs pages running after a memory
error is very important, and the possibility described here could be a good
candidate to address this issue.

So I would like to provide my feedback after testing this code with the
introduction of persistent errors in the address space: My tests used a VM
running a kernel able to provide MFD_MF_KEEP_UE_MAPPED memfd segments to the
test program provided with this project. But instead of injecting the errors
with madvise calls from this program, I get the guest physical address of a
location and inject the error from the hypervisor into the VM, so that any
subsequent access to the location is prevented directly from the hypervisor
level.

Using this framework, I realized that the code provided here has a problem:
When the error impacts a large folio, the release of this folio doesn't isolate
the sub-page(s) actually impacted by the poison. __rmqueue_pcplist() can return
a known poisoned page to get_page_from_freelist().

This revealed some mm limitations, as I would have expected that the
check_new_pages() mechanism used by the __rmqueue functions would filter these
pages out, but I noticed that this has been disabled by default in 2023 with:
[PATCH] mm, page_alloc: reduce page alloc/free sanity checks
https://lore.kernel.org/all/20230216095131.17336-1-vbabka@suse.cz

This problem seems to be avoided if we call take_page_off_buddy(page) in the
filemap_offline_hwpoison_folio_hugetlb() function without testing if
PageBuddy(page) is true first.
But according to me it leaves a (small) race condition where a new page
allocation could get a poisoned sub-page between the dissolve phase and the
attempt to remove it from the buddy allocator.

I do have the impression that a correct behavior (isolating an impacted
sub-page and remapping the valid memory content) using large pages is
currently only achieved with Transparent Huge Pages.
If performance requires using Hugetlb pages, than maybe we could accept to
loose a huge page after a memory impacted MFD_MF_KEEP_UE_MAPPED memfd segment
is released ? If it can easily avoid some other corruption.

I'm very interested in finding an appropriate way to deal with memory errors on
hugetlbfs pages, and willing to help to build a valid solution. This project
showed a real possibility to do so, even in cases where pinned memory is used -
with VFIO for example.

I would really be interested in knowing your feedback about this project, and
if another solution is considered more adapted to deal with errors on hugetlbfs
pages, please let us know.

Thanks in advance for your answers.
William.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH v1 0/3] Userspace MFR Policy via memfd
  2025-09-19 15:58 ` “William Roche
@ 2025-10-13 22:14   ` Jiaqi Yan
  2025-10-14 20:57     ` William Roche
  2025-10-22 13:09     ` Harry Yoo
  0 siblings, 2 replies; 14+ messages in thread
From: Jiaqi Yan @ 2025-10-13 22:14 UTC (permalink / raw)
  To: “William Roche, Ackerley Tng
  Cc: jgg, akpm, ankita, dave.hansen, david, duenwen, jane.chu,
	jthoughton, linmiaohe, linux-fsdevel, linux-kernel, linux-mm,
	muchun.song, nao.horiguchi, osalvador, peterx, rientjes,
	sidhartha.kumar, tony.luck, wangkefeng.wang, willy, harry.yoo

On Fri, Sep 19, 2025 at 8:58 AM “William Roche <william.roche@oracle.com> wrote:
>
> From: William Roche <william.roche@oracle.com>
>
> Hello,
>
> The possibility to keep a VM using large hugetlbfs pages running after a memory
> error is very important, and the possibility described here could be a good
> candidate to address this issue.

Thanks for expressing interest, William, and sorry for getting back to
you so late.

>
> So I would like to provide my feedback after testing this code with the
> introduction of persistent errors in the address space: My tests used a VM
> running a kernel able to provide MFD_MF_KEEP_UE_MAPPED memfd segments to the
> test program provided with this project. But instead of injecting the errors
> with madvise calls from this program, I get the guest physical address of a
> location and inject the error from the hypervisor into the VM, so that any
> subsequent access to the location is prevented directly from the hypervisor
> level.

This is exactly what VMM should do: when it owns or manages the VM
memory with MFD_MF_KEEP_UE_MAPPED, it is then VMM's responsibility to
isolate guest/VCPUs from poisoned memory pages, e.g. by intercepting
such memory accesses.

>
> Using this framework, I realized that the code provided here has a problem:
> When the error impacts a large folio, the release of this folio doesn't isolate
> the sub-page(s) actually impacted by the poison. __rmqueue_pcplist() can return
> a known poisoned page to get_page_from_freelist().

Just curious, how exactly you can repro this leaking of a known poison
page? It may help me debug my patch.

>
> This revealed some mm limitations, as I would have expected that the
> check_new_pages() mechanism used by the __rmqueue functions would filter these
> pages out, but I noticed that this has been disabled by default in 2023 with:
> [PATCH] mm, page_alloc: reduce page alloc/free sanity checks
> https://lore.kernel.org/all/20230216095131.17336-1-vbabka@suse.cz

Thanks for the reference. I did turned on CONFIG_DEBUG_VM=y during dev
and testing but didn't notice any WARNING on "bad page"; It is very
likely I was just lucky.

>
>
> This problem seems to be avoided if we call take_page_off_buddy(page) in the
> filemap_offline_hwpoison_folio_hugetlb() function without testing if
> PageBuddy(page) is true first.

Oh, I think you are right, filemap_offline_hwpoison_folio_hugetlb
shouldn't call take_page_off_buddy(page) depend on PageBuddy(page) or
not. take_page_off_buddy will check PageBuddy or not, on the page_head
of different page orders. So maybe somehow a known poisoned page is
not taken off from buddy allocator due to this?

Let me try to fix it in v2, by the end of the week. If you could test
with your way of repro as well, that will be very helpful!

> But according to me it leaves a (small) race condition where a new page
> allocation could get a poisoned sub-page between the dissolve phase and the
> attempt to remove it from the buddy allocator.
>
> I do have the impression that a correct behavior (isolating an impacted
> sub-page and remapping the valid memory content) using large pages is
> currently only achieved with Transparent Huge Pages.
> If performance requires using Hugetlb pages, than maybe we could accept to
> loose a huge page after a memory impacted MFD_MF_KEEP_UE_MAPPED memfd segment
> is released ? If it can easily avoid some other corruption.
>
> I'm very interested in finding an appropriate way to deal with memory errors on
> hugetlbfs pages, and willing to help to build a valid solution. This project
> showed a real possibility to do so, even in cases where pinned memory is used -
> with VFIO for example.
>
> I would really be interested in knowing your feedback about this project, and
> if another solution is considered more adapted to deal with errors on hugetlbfs
> pages, please let us know.

There is also another possible path if VMM can change to back VM
memory with *1G guest_memfd*, which wraps 1G hugetlbfs. In Ackerley's
work [1], guest_memfd can split the 1G page for conversions. If we
re-use the splitting for memory failure recovery, we can probably
achieve something generally similar to THP's memory failure recovery:
split 1G to 2M and 4k chunks, then unmap only 4k of poisoned page. We
still lose the 1G TLB size so VM may be subject to some performance
sacrifice.

[1] https://lore.kernel.org/linux-mm/2ae41e0d80339da2b57011622ac2288fed65cd01.1747264138.git.ackerleytng@google.com

>
> Thanks in advance for your answers.
> William.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH v1 0/3] Userspace MFR Policy via memfd
  2025-10-13 22:14   ` Jiaqi Yan
@ 2025-10-14 20:57     ` William Roche
  2025-10-28  4:17       ` Jiaqi Yan
  2025-10-22 13:09     ` Harry Yoo
  1 sibling, 1 reply; 14+ messages in thread
From: William Roche @ 2025-10-14 20:57 UTC (permalink / raw)
  To: Jiaqi Yan, Ackerley Tng
  Cc: jgg, akpm, ankita, dave.hansen, david, duenwen, jane.chu,
	jthoughton, linmiaohe, linux-fsdevel, linux-kernel, linux-mm,
	muchun.song, nao.horiguchi, osalvador, peterx, rientjes,
	sidhartha.kumar, tony.luck, wangkefeng.wang, willy, harry.yoo

On 10/14/25 00:14, Jiaqi Yan wrote:
 > On Fri, Sep 19, 2025 at 8:58 AM William Roche wrote:
 > [...]
 >>
 >> Using this framework, I realized that the code provided here has a
 >> problem:
 >> When the error impacts a large folio, the release of this folio
 >> doesn't isolate the sub-page(s) actually impacted by the poison.
 >> __rmqueue_pcplist() can return a known poisoned page to
 >> get_page_from_freelist().
 >
 > Just curious, how exactly you can repro this leaking of a known poison
 > page? It may help me debug my patch.
 >

When the memfd segment impacted by a memory error is released, the 
sub-page impacted by a memory error is not removed from the freelist and 
an allocation of memory (large enough to increase the chance to get this 
page) crashes the system with the following stack trace (for example):

[  479.572513] RIP: 0010:clear_page_erms+0xb/0x20
[...]
[  479.587565]  post_alloc_hook+0xbd/0xd0
[  479.588371]  get_page_from_freelist+0x3a6/0x6d0
[  479.589221]  ? srso_alias_return_thunk+0x5/0xfbef5
[  479.590122]  __alloc_frozen_pages_noprof+0x186/0x380
[  479.591012]  alloc_pages_mpol+0x7b/0x180
[  479.591787]  vma_alloc_folio_noprof+0x70/0xf0
[  479.592609]  alloc_anon_folio+0x1a0/0x3a0
[  479.593401]  do_anonymous_page+0x13f/0x4d0
[  479.594174]  ? pte_offset_map_rw_nolock+0x1f/0xa0
[  479.595035]  __handle_mm_fault+0x581/0x6c0
[  479.595799]  handle_mm_fault+0xcf/0x2a0
[  479.596539]  do_user_addr_fault+0x22b/0x6e0
[  479.597349]  exc_page_fault+0x67/0x170
[  479.598095]  asm_exc_page_fault+0x26/0x30

The idea is to run the test program in the VM and instead of using 
madvise to poison the location, I take the physical address of the 
location, and use Qemu 'gpa2hpa' address of the location,
so that I can inject the error on the hypervisor with the 
hwpoison-inject module (for example).
Let the test program finish and run a memory allocator (trying to take 
as much memory as possible)
You should end up on a panic of the VM.

 >>
 >> This revealed some mm limitations, as I would have expected that the
 >> check_new_pages() mechanism used by the __rmqueue functions would
 >> filter these pages out, but I noticed that this has been disabled by
 >> default in 2023 with:
 >> [PATCH] mm, page_alloc: reduce page alloc/free sanity checks
 >> https://lore.kernel.org/all/20230216095131.17336-1-vbabka@suse.cz
 >
 > Thanks for the reference. I did turned on CONFIG_DEBUG_VM=y during dev
 > and testing but didn't notice any WARNING on "bad page"; It is very
 > likely I was just lucky.
 >
 >>
 >>
 >> This problem seems to be avoided if we call take_page_off_buddy(page)
 >> in the filemap_offline_hwpoison_folio_hugetlb() function without
 >> testing if PageBuddy(page) is true first.
 >
 > Oh, I think you are right, filemap_offline_hwpoison_folio_hugetlb
 > shouldn't call take_page_off_buddy(page) depend on PageBuddy(page) or
 > not. take_page_off_buddy will check PageBuddy or not, on the page_head
 > of different page orders. So maybe somehow a known poisoned page is
 > not taken off from buddy allocator due to this?
 >
 > Let me try to fix it in v2, by the end of the week. If you could test
 > with your way of repro as well, that will be very helpful!


Of course, I'll run the test on your v2 version and let you know how it 
goes.


 >> But according to me it leaves a (small) race condition where a new
 >> page allocation could get a poisoned sub-page between the dissolve
 >> phase and the attempt to remove it from the buddy allocator.

I still think that the way we recycle the impacted large page still has 
a (much smaller) race condition where a memory allocation can get the 
poisoned page, as we don't have the checks to filter the poisoned page 
from the freelist.
I'm not sure we have a way to recycle the page without having a moment 
when the poison page is in the freelist.
(I'd be happy to be proven wrong ;) )


 >> If performance requires using Hugetlb pages, than maybe we could
 >> accept to loose a huge page after a memory impacted
 >> MFD_MF_KEEP_UE_MAPPED memfd segment is released ? If it can easily
 >> avoid some other corruption.

What I meant is: if we don't have a reliable way to recycle an impacted 
large page, we could start with a version of the code where we don't 
recycle it, just to avoid the risk...


 >
 > There is also another possible path if VMM can change to back VM
 > memory with *1G guest_memfd*, which wraps 1G hugetlbfs. In Ackerley's
 > work [1], guest_memfd can split the 1G page for conversions. If we
 > re-use the splitting for memory failure recovery, we can probably
 > achieve something generally similar to THP's memory failure recovery:
 > split 1G to 2M and 4k chunks, then unmap only 4k of poisoned page. We
 > still lose the 1G TLB size so VM may be subject to some performance
 > sacrifice.
 >
 > [1] 
https://lore.kernel.org/linux-mm/2ae41e0d80339da2b57011622ac2288fed65cd01.1747264138.git.ackerleytng@google.com


Thanks for the pointer.
I personally think that splitting the large page into base pages, is 
just fine.
The main possibility I see in this project is to significantly increase 
the probability to survive a memory error on large pages backed VMs.

HTH.

Thanks a lot,
William.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH v1 0/3] Userspace MFR Policy via memfd
  2025-10-13 22:14   ` Jiaqi Yan
  2025-10-14 20:57     ` William Roche
@ 2025-10-22 13:09     ` Harry Yoo
  2025-10-28  4:17       ` Jiaqi Yan
  1 sibling, 1 reply; 14+ messages in thread
From: Harry Yoo @ 2025-10-22 13:09 UTC (permalink / raw)
  To: Jiaqi Yan
  Cc: “William Roche, Ackerley Tng, jgg, akpm, ankita,
	dave.hansen, david, duenwen, jane.chu, jthoughton, linmiaohe,
	linux-fsdevel, linux-kernel, linux-mm, muchun.song, nao.horiguchi,
	osalvador, peterx, rientjes, sidhartha.kumar, tony.luck,
	wangkefeng.wang, willy

On Mon, Oct 13, 2025 at 03:14:32PM -0700, Jiaqi Yan wrote:
> On Fri, Sep 19, 2025 at 8:58 AM “William Roche <william.roche@oracle.com> wrote:
> >
> > From: William Roche <william.roche@oracle.com>
> >
> > Hello,
> >
> > The possibility to keep a VM using large hugetlbfs pages running after a memory
> > error is very important, and the possibility described here could be a good
> > candidate to address this issue.
> 
> Thanks for expressing interest, William, and sorry for getting back to
> you so late.
> 
> >
> > So I would like to provide my feedback after testing this code with the
> > introduction of persistent errors in the address space: My tests used a VM
> > running a kernel able to provide MFD_MF_KEEP_UE_MAPPED memfd segments to the
> > test program provided with this project. But instead of injecting the errors
> > with madvise calls from this program, I get the guest physical address of a
> > location and inject the error from the hypervisor into the VM, so that any
> > subsequent access to the location is prevented directly from the hypervisor
> > level.
> 
> This is exactly what VMM should do: when it owns or manages the VM
> memory with MFD_MF_KEEP_UE_MAPPED, it is then VMM's responsibility to
> isolate guest/VCPUs from poisoned memory pages, e.g. by intercepting
> such memory accesses.
> 
> >
> > Using this framework, I realized that the code provided here has a problem:
> > When the error impacts a large folio, the release of this folio doesn't isolate
> > the sub-page(s) actually impacted by the poison. __rmqueue_pcplist() can return
> > a known poisoned page to get_page_from_freelist().
> 
> Just curious, how exactly you can repro this leaking of a known poison
> page? It may help me debug my patch.
> 
> >
> > This revealed some mm limitations, as I would have expected that the
> > check_new_pages() mechanism used by the __rmqueue functions would filter these
> > pages out, but I noticed that this has been disabled by default in 2023 with:
> > [PATCH] mm, page_alloc: reduce page alloc/free sanity checks
> > https://lore.kernel.org/all/20230216095131.17336-1-vbabka@suse.cz
> 
> Thanks for the reference. I did turned on CONFIG_DEBUG_VM=y during dev
> and testing but didn't notice any WARNING on "bad page"; It is very
> likely I was just lucky.
> 
> >
> >
> > This problem seems to be avoided if we call take_page_off_buddy(page) in the
> > filemap_offline_hwpoison_folio_hugetlb() function without testing if
> > PageBuddy(page) is true first.
> 
> Oh, I think you are right, filemap_offline_hwpoison_folio_hugetlb
> shouldn't call take_page_off_buddy(page) depend on PageBuddy(page) or
> not. take_page_off_buddy will check PageBuddy or not, on the page_head
> of different page orders. So maybe somehow a known poisoned page is
> not taken off from buddy allocator due to this?

Maybe it's the case where the poisoned page is merged to a larger page,
and the PGTY_buddy flag is set on its buddy of the poisoned page, so
PageBuddy() returns false?:

  [ free page A ][ free page B (poisoned) ]

When these two are merged, then we set PGTY_buddy on page A but not on B.

But even after fixing that we need to fix the race condition.

> Let me try to fix it in v2, by the end of the week. If you could test
> with your way of repro as well, that will be very helpful!
>
> > But according to me it leaves a (small) race condition where a new page
> > allocation could get a poisoned sub-page between the dissolve phase and the
> > attempt to remove it from the buddy allocator.
> >
> > I do have the impression that a correct behavior (isolating an impacted
> > sub-page and remapping the valid memory content) using large pages is
> > currently only achieved with Transparent Huge Pages.
> > If performance requires using Hugetlb pages, than maybe we could accept to
> > loose a huge page after a memory impacted MFD_MF_KEEP_UE_MAPPED memfd segment
> > is released ? If it can easily avoid some other corruption.
> >
> > I'm very interested in finding an appropriate way to deal with memory errors on
> > hugetlbfs pages, and willing to help to build a valid solution. This project
> > showed a real possibility to do so, even in cases where pinned memory is used -
> > with VFIO for example.
> >
> > I would really be interested in knowing your feedback about this project, and
> > if another solution is considered more adapted to deal with errors on hugetlbfs
> > pages, please let us know.
> 
> There is also another possible path if VMM can change to back VM
> memory with *1G guest_memfd*, which wraps 1G hugetlbfs. In Ackerley's
> work [1], guest_memfd can split the 1G page for conversions. If we
> re-use the splitting for memory failure recovery, we can probably
> achieve something generally similar to THP's memory failure recovery:
> split 1G to 2M and 4k chunks, then unmap only 4k of poisoned page. We
> still lose the 1G TLB size so VM may be subject to some performance
> sacrifice.
> [1] https://lore.kernel.org/linux-mm/2ae41e0d80339da2b57011622ac2288fed65cd01.1747264138.git.ackerleytng@google.com

I want to take a closer look at the actual patches but either way sounds
good to me.

By the way, please Cc me in future revisions :)

Thanks!

-- 
Cheers,
Harry / Hyeonggon

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH v1 0/3] Userspace MFR Policy via memfd
  2025-10-14 20:57     ` William Roche
@ 2025-10-28  4:17       ` Jiaqi Yan
  0 siblings, 0 replies; 14+ messages in thread
From: Jiaqi Yan @ 2025-10-28  4:17 UTC (permalink / raw)
  To: William Roche
  Cc: Ackerley Tng, jgg, akpm, ankita, dave.hansen, david, duenwen,
	jane.chu, jthoughton, linmiaohe, linux-fsdevel, linux-kernel,
	linux-mm, muchun.song, nao.horiguchi, osalvador, peterx, rientjes,
	sidhartha.kumar, tony.luck, wangkefeng.wang, willy, harry.yoo

On Tue, Oct 14, 2025 at 1:57 PM William Roche <william.roche@oracle.com> wrote:
>
> On 10/14/25 00:14, Jiaqi Yan wrote:
>  > On Fri, Sep 19, 2025 at 8:58 AM William Roche wrote:
>  > [...]
>  >>
>  >> Using this framework, I realized that the code provided here has a
>  >> problem:
>  >> When the error impacts a large folio, the release of this folio
>  >> doesn't isolate the sub-page(s) actually impacted by the poison.
>  >> __rmqueue_pcplist() can return a known poisoned page to
>  >> get_page_from_freelist().
>  >
>  > Just curious, how exactly you can repro this leaking of a known poison
>  > page? It may help me debug my patch.
>  >
>
> When the memfd segment impacted by a memory error is released, the
> sub-page impacted by a memory error is not removed from the freelist and
> an allocation of memory (large enough to increase the chance to get this
> page) crashes the system with the following stack trace (for example):
>
> [  479.572513] RIP: 0010:clear_page_erms+0xb/0x20
> [...]
> [  479.587565]  post_alloc_hook+0xbd/0xd0
> [  479.588371]  get_page_from_freelist+0x3a6/0x6d0
> [  479.589221]  ? srso_alias_return_thunk+0x5/0xfbef5
> [  479.590122]  __alloc_frozen_pages_noprof+0x186/0x380
> [  479.591012]  alloc_pages_mpol+0x7b/0x180
> [  479.591787]  vma_alloc_folio_noprof+0x70/0xf0
> [  479.592609]  alloc_anon_folio+0x1a0/0x3a0
> [  479.593401]  do_anonymous_page+0x13f/0x4d0
> [  479.594174]  ? pte_offset_map_rw_nolock+0x1f/0xa0
> [  479.595035]  __handle_mm_fault+0x581/0x6c0
> [  479.595799]  handle_mm_fault+0xcf/0x2a0
> [  479.596539]  do_user_addr_fault+0x22b/0x6e0
> [  479.597349]  exc_page_fault+0x67/0x170
> [  479.598095]  asm_exc_page_fault+0x26/0x30
>
> The idea is to run the test program in the VM and instead of using
> madvise to poison the location, I take the physical address of the
> location, and use Qemu 'gpa2hpa' address of the location,
> so that I can inject the error on the hypervisor with the
> hwpoison-inject module (for example).
> Let the test program finish and run a memory allocator (trying to take
> as much memory as possible)
> You should end up on a panic of the VM.

Thanks William, I can even repro with the hugetlb-mfr selftest withou a VM.

>
>  >>
>  >> This revealed some mm limitations, as I would have expected that the
>  >> check_new_pages() mechanism used by the __rmqueue functions would
>  >> filter these pages out, but I noticed that this has been disabled by
>  >> default in 2023 with:
>  >> [PATCH] mm, page_alloc: reduce page alloc/free sanity checks
>  >> https://lore.kernel.org/all/20230216095131.17336-1-vbabka@suse.cz
>  >
>  > Thanks for the reference. I did turned on CONFIG_DEBUG_VM=y during dev
>  > and testing but didn't notice any WARNING on "bad page"; It is very
>  > likely I was just lucky.
>  >
>  >>
>  >>
>  >> This problem seems to be avoided if we call take_page_off_buddy(page)
>  >> in the filemap_offline_hwpoison_folio_hugetlb() function without
>  >> testing if PageBuddy(page) is true first.
>  >
>  > Oh, I think you are right, filemap_offline_hwpoison_folio_hugetlb
>  > shouldn't call take_page_off_buddy(page) depend on PageBuddy(page) or
>  > not. take_page_off_buddy will check PageBuddy or not, on the page_head
>  > of different page orders. So maybe somehow a known poisoned page is
>  > not taken off from buddy allocator due to this?
>  >
>  > Let me try to fix it in v2, by the end of the week. If you could test
>  > with your way of repro as well, that will be very helpful!
>
>
> Of course, I'll run the test on your v2 version and let you know how it
> goes.

Sorry it took more than I expect to prepare v2. I want to get rid of
populate_memfd_hwp_folios and want to insert
filemap_offline_hwpoison_folio into remove_inode_single_folio so that
everything can be done on the fly in remove_inode_hugepages's while
loop. This refactor isn't as trivial as I thought.

I was struggled with page refcount for some time, for a couple of reasons:
1. filemap_offline_hwpoison_folio has to put 1 refcount on hwpoison-ed
folio so it can be dissolved. But I immediately got a "BUG: Bad page
state in process" due to "page: refcount:-1".
2. It turns out to be that remove_inode_hugepages also puts folios'
refcount via folio_batch_release. I avoided this for hwpoison-ed folio
by removing it from the fbatch.

I have just tested v2 with the hugetlb-mfr selftest and didn't see
"BUG: Bad page" for either nonzero refcount or hwpoison after some
hours of running/up time. Meanwhile, I will send v2 as a draft to you
for more test coverage.

>
>
>  >> But according to me it leaves a (small) race condition where a new
>  >> page allocation could get a poisoned sub-page between the dissolve
>  >> phase and the attempt to remove it from the buddy allocator.
>
> I still think that the way we recycle the impacted large page still has
> a (much smaller) race condition where a memory allocation can get the
> poisoned page, as we don't have the checks to filter the poisoned page
> from the freelist.
> I'm not sure we have a way to recycle the page without having a moment
> when the poison page is in the freelist.
> (I'd be happy to be proven wrong ;) )
>
>
>  >> If performance requires using Hugetlb pages, than maybe we could
>  >> accept to loose a huge page after a memory impacted
>  >> MFD_MF_KEEP_UE_MAPPED memfd segment is released ? If it can easily
>  >> avoid some other corruption.
>
> What I meant is: if we don't have a reliable way to recycle an impacted
> large page, we could start with a version of the code where we don't
> recycle it, just to avoid the risk...
>
>
>  >
>  > There is also another possible path if VMM can change to back VM
>  > memory with *1G guest_memfd*, which wraps 1G hugetlbfs. In Ackerley's
>  > work [1], guest_memfd can split the 1G page for conversions. If we
>  > re-use the splitting for memory failure recovery, we can probably
>  > achieve something generally similar to THP's memory failure recovery:
>  > split 1G to 2M and 4k chunks, then unmap only 4k of poisoned page. We
>  > still lose the 1G TLB size so VM may be subject to some performance
>  > sacrifice.
>  >
>  > [1]
> https://lore.kernel.org/linux-mm/2ae41e0d80339da2b57011622ac2288fed65cd01.1747264138.git.ackerleytng@google.com
>
>
> Thanks for the pointer.
> I personally think that splitting the large page into base pages, is
> just fine.
> The main possibility I see in this project is to significantly increase
> the probability to survive a memory error on large pages backed VMs.
>
> HTH.
>
> Thanks a lot,
> William.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH v1 0/3] Userspace MFR Policy via memfd
  2025-10-22 13:09     ` Harry Yoo
@ 2025-10-28  4:17       ` Jiaqi Yan
  2025-10-28  7:00         ` Harry Yoo
  0 siblings, 1 reply; 14+ messages in thread
From: Jiaqi Yan @ 2025-10-28  4:17 UTC (permalink / raw)
  To: Harry Yoo, “William Roche
  Cc: Ackerley Tng, jgg, akpm, ankita, dave.hansen, david, duenwen,
	jane.chu, jthoughton, linmiaohe, linux-fsdevel, linux-kernel,
	linux-mm, muchun.song, nao.horiguchi, osalvador, peterx, rientjes,
	sidhartha.kumar, tony.luck, wangkefeng.wang, willy

On Wed, Oct 22, 2025 at 6:09 AM Harry Yoo <harry.yoo@oracle.com> wrote:
>
> On Mon, Oct 13, 2025 at 03:14:32PM -0700, Jiaqi Yan wrote:
> > On Fri, Sep 19, 2025 at 8:58 AM “William Roche <william.roche@oracle.com> wrote:
> > >
> > > From: William Roche <william.roche@oracle.com>
> > >
> > > Hello,
> > >
> > > The possibility to keep a VM using large hugetlbfs pages running after a memory
> > > error is very important, and the possibility described here could be a good
> > > candidate to address this issue.
> >
> > Thanks for expressing interest, William, and sorry for getting back to
> > you so late.
> >
> > >
> > > So I would like to provide my feedback after testing this code with the
> > > introduction of persistent errors in the address space: My tests used a VM
> > > running a kernel able to provide MFD_MF_KEEP_UE_MAPPED memfd segments to the
> > > test program provided with this project. But instead of injecting the errors
> > > with madvise calls from this program, I get the guest physical address of a
> > > location and inject the error from the hypervisor into the VM, so that any
> > > subsequent access to the location is prevented directly from the hypervisor
> > > level.
> >
> > This is exactly what VMM should do: when it owns or manages the VM
> > memory with MFD_MF_KEEP_UE_MAPPED, it is then VMM's responsibility to
> > isolate guest/VCPUs from poisoned memory pages, e.g. by intercepting
> > such memory accesses.
> >
> > >
> > > Using this framework, I realized that the code provided here has a problem:
> > > When the error impacts a large folio, the release of this folio doesn't isolate
> > > the sub-page(s) actually impacted by the poison. __rmqueue_pcplist() can return
> > > a known poisoned page to get_page_from_freelist().
> >
> > Just curious, how exactly you can repro this leaking of a known poison
> > page? It may help me debug my patch.
> >
> > >
> > > This revealed some mm limitations, as I would have expected that the
> > > check_new_pages() mechanism used by the __rmqueue functions would filter these
> > > pages out, but I noticed that this has been disabled by default in 2023 with:
> > > [PATCH] mm, page_alloc: reduce page alloc/free sanity checks
> > > https://lore.kernel.org/all/20230216095131.17336-1-vbabka@suse.cz
> >
> > Thanks for the reference. I did turned on CONFIG_DEBUG_VM=y during dev
> > and testing but didn't notice any WARNING on "bad page"; It is very
> > likely I was just lucky.
> >
> > >
> > >
> > > This problem seems to be avoided if we call take_page_off_buddy(page) in the
> > > filemap_offline_hwpoison_folio_hugetlb() function without testing if
> > > PageBuddy(page) is true first.
> >
> > Oh, I think you are right, filemap_offline_hwpoison_folio_hugetlb
> > shouldn't call take_page_off_buddy(page) depend on PageBuddy(page) or
> > not. take_page_off_buddy will check PageBuddy or not, on the page_head
> > of different page orders. So maybe somehow a known poisoned page is
> > not taken off from buddy allocator due to this?
>
> Maybe it's the case where the poisoned page is merged to a larger page,
> and the PGTY_buddy flag is set on its buddy of the poisoned page, so
> PageBuddy() returns false?:
>
>   [ free page A ][ free page B (poisoned) ]
>
> When these two are merged, then we set PGTY_buddy on page A but not on B.

Thanks Harry!

It is indeed this case. I validate by adding some debug prints in
take_page_off_buddy:

[ 193.029423] Memory failure: 0x2800200: [yjq] PageBuddy=0 after drain_all_pages
[ 193.029426] 0x2800200: [yjq] order=0, page_order=0, PageBuddy(page_head)=0
[ 193.029428] 0x2800200: [yjq] order=1, page_order=0, PageBuddy(page_head)=0
[ 193.029429] 0x2800200: [yjq] order=2, page_order=0, PageBuddy(page_head)=0
[ 193.029430] 0x2800200: [yjq] order=3, page_order=0, PageBuddy(page_head)=0
[ 193.029431] 0x2800200: [yjq] order=4, page_order=0, PageBuddy(page_head)=0
[ 193.029432] 0x2800200: [yjq] order=5, page_order=0, PageBuddy(page_head)=0
[ 193.029434] 0x2800200: [yjq] order=6, page_order=0, PageBuddy(page_head)=0
[ 193.029435] 0x2800200: [yjq] order=7, page_order=0, PageBuddy(page_head)=0
[ 193.029436] 0x2800200: [yjq] order=8, page_order=0, PageBuddy(page_head)=0
[ 193.029437] 0x2800200: [yjq] order=9, page_order=0, PageBuddy(page_head)=0
[ 193.029438] 0x2800200: [yjq] order=10, page_order=10, PageBuddy(page_head)=1

In this case, page for 0x2800200 is hwpoisoned, and its buddy page is
0x2800000 with order 10.

>
> But even after fixing that we need to fix the race condition.

What exactly is the race condition you are referring to?

>
> > Let me try to fix it in v2, by the end of the week. If you could test
> > with your way of repro as well, that will be very helpful!
> >
> > > But according to me it leaves a (small) race condition where a new page
> > > allocation could get a poisoned sub-page between the dissolve phase and the
> > > attempt to remove it from the buddy allocator.
> > >
> > > I do have the impression that a correct behavior (isolating an impacted
> > > sub-page and remapping the valid memory content) using large pages is
> > > currently only achieved with Transparent Huge Pages.
> > > If performance requires using Hugetlb pages, than maybe we could accept to
> > > loose a huge page after a memory impacted MFD_MF_KEEP_UE_MAPPED memfd segment
> > > is released ? If it can easily avoid some other corruption.
> > >
> > > I'm very interested in finding an appropriate way to deal with memory errors on
> > > hugetlbfs pages, and willing to help to build a valid solution. This project
> > > showed a real possibility to do so, even in cases where pinned memory is used -
> > > with VFIO for example.
> > >
> > > I would really be interested in knowing your feedback about this project, and
> > > if another solution is considered more adapted to deal with errors on hugetlbfs
> > > pages, please let us know.
> >
> > There is also another possible path if VMM can change to back VM
> > memory with *1G guest_memfd*, which wraps 1G hugetlbfs. In Ackerley's
> > work [1], guest_memfd can split the 1G page for conversions. If we
> > re-use the splitting for memory failure recovery, we can probably
> > achieve something generally similar to THP's memory failure recovery:
> > split 1G to 2M and 4k chunks, then unmap only 4k of poisoned page. We
> > still lose the 1G TLB size so VM may be subject to some performance
> > sacrifice.
> > [1] https://lore.kernel.org/linux-mm/2ae41e0d80339da2b57011622ac2288fed65cd01.1747264138.git.ackerleytng@google.com
>
> I want to take a closer look at the actual patches but either way sounds
> good to me.
>
> By the way, please Cc me in future revisions :)

For sure!

>
> Thanks!
>
> --
> Cheers,
> Harry / Hyeonggon

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH v1 0/3] Userspace MFR Policy via memfd
  2025-10-28  4:17       ` Jiaqi Yan
@ 2025-10-28  7:00         ` Harry Yoo
  0 siblings, 0 replies; 14+ messages in thread
From: Harry Yoo @ 2025-10-28  7:00 UTC (permalink / raw)
  To: Jiaqi Yan
  Cc: “William Roche, Ackerley Tng, jgg, akpm, ankita,
	dave.hansen, david, duenwen, jane.chu, jthoughton, linmiaohe,
	linux-fsdevel, linux-kernel, linux-mm, muchun.song, nao.horiguchi,
	osalvador, peterx, rientjes, sidhartha.kumar, tony.luck,
	wangkefeng.wang, willy

On Mon, Oct 27, 2025 at 09:17:31PM -0700, Jiaqi Yan wrote:
> On Wed, Oct 22, 2025 at 6:09 AM Harry Yoo <harry.yoo@oracle.com> wrote:
> >
> > On Mon, Oct 13, 2025 at 03:14:32PM -0700, Jiaqi Yan wrote:
> > > On Fri, Sep 19, 2025 at 8:58 AM “William Roche <william.roche@oracle.com> wrote:
> > > >
> > > > From: William Roche <william.roche@oracle.com>
> > > >
> > > > Hello,
> > > >
> > > > The possibility to keep a VM using large hugetlbfs pages running after a memory
> > > > error is very important, and the possibility described here could be a good
> > > > candidate to address this issue.
> > >
> > > Thanks for expressing interest, William, and sorry for getting back to
> > > you so late.
> > >
> > > >
> > > > So I would like to provide my feedback after testing this code with the
> > > > introduction of persistent errors in the address space: My tests used a VM
> > > > running a kernel able to provide MFD_MF_KEEP_UE_MAPPED memfd segments to the
> > > > test program provided with this project. But instead of injecting the errors
> > > > with madvise calls from this program, I get the guest physical address of a
> > > > location and inject the error from the hypervisor into the VM, so that any
> > > > subsequent access to the location is prevented directly from the hypervisor
> > > > level.
> > >
> > > This is exactly what VMM should do: when it owns or manages the VM
> > > memory with MFD_MF_KEEP_UE_MAPPED, it is then VMM's responsibility to
> > > isolate guest/VCPUs from poisoned memory pages, e.g. by intercepting
> > > such memory accesses.
> > >
> > > >
> > > > Using this framework, I realized that the code provided here has a problem:
> > > > When the error impacts a large folio, the release of this folio doesn't isolate
> > > > the sub-page(s) actually impacted by the poison. __rmqueue_pcplist() can return
> > > > a known poisoned page to get_page_from_freelist().
> > >
> > > Just curious, how exactly you can repro this leaking of a known poison
> > > page? It may help me debug my patch.
> > >
> > > >
> > > > This revealed some mm limitations, as I would have expected that the
> > > > check_new_pages() mechanism used by the __rmqueue functions would filter these
> > > > pages out, but I noticed that this has been disabled by default in 2023 with:
> > > > [PATCH] mm, page_alloc: reduce page alloc/free sanity checks
> > > > https://lore.kernel.org/all/20230216095131.17336-1-vbabka@suse.cz
> > >
> > > Thanks for the reference. I did turned on CONFIG_DEBUG_VM=y during dev
> > > and testing but didn't notice any WARNING on "bad page"; It is very
> > > likely I was just lucky.
> > >
> > > >
> > > >
> > > > This problem seems to be avoided if we call take_page_off_buddy(page) in the
> > > > filemap_offline_hwpoison_folio_hugetlb() function without testing if
> > > > PageBuddy(page) is true first.
> > >
> > > Oh, I think you are right, filemap_offline_hwpoison_folio_hugetlb
> > > shouldn't call take_page_off_buddy(page) depend on PageBuddy(page) or
> > > not. take_page_off_buddy will check PageBuddy or not, on the page_head
> > > of different page orders. So maybe somehow a known poisoned page is
> > > not taken off from buddy allocator due to this?
> >
> > Maybe it's the case where the poisoned page is merged to a larger page,
> > and the PGTY_buddy flag is set on its buddy of the poisoned page, so
> > PageBuddy() returns false?:
> >
> >   [ free page A ][ free page B (poisoned) ]
> >
> > When these two are merged, then we set PGTY_buddy on page A but not on B.
> 
> Thanks Harry!
>
> It is indeed this case. I validate by adding some debug prints in
> take_page_off_buddy:
> 
> [ 193.029423] Memory failure: 0x2800200: [yjq] PageBuddy=0 after drain_all_pages
> [ 193.029426] 0x2800200: [yjq] order=0, page_order=0, PageBuddy(page_head)=0
> [ 193.029428] 0x2800200: [yjq] order=1, page_order=0, PageBuddy(page_head)=0
> [ 193.029429] 0x2800200: [yjq] order=2, page_order=0, PageBuddy(page_head)=0
> [ 193.029430] 0x2800200: [yjq] order=3, page_order=0, PageBuddy(page_head)=0
> [ 193.029431] 0x2800200: [yjq] order=4, page_order=0, PageBuddy(page_head)=0
> [ 193.029432] 0x2800200: [yjq] order=5, page_order=0, PageBuddy(page_head)=0
> [ 193.029434] 0x2800200: [yjq] order=6, page_order=0, PageBuddy(page_head)=0
> [ 193.029435] 0x2800200: [yjq] order=7, page_order=0, PageBuddy(page_head)=0
> [ 193.029436] 0x2800200: [yjq] order=8, page_order=0, PageBuddy(page_head)=0
> [ 193.029437] 0x2800200: [yjq] order=9, page_order=0, PageBuddy(page_head)=0
> [ 193.029438] 0x2800200: [yjq] order=10, page_order=10, PageBuddy(page_head)=1
> 
> In this case, page for 0x2800200 is hwpoisoned, and its buddy page is
> 0x2800000 with order 10.

Woohoo, I got it right!

> > But even after fixing that we need to fix the race condition.
> 
> What exactly is the race condition you are referring to?

When you free a high-order page, the buddy allocator doesn't not check
PageHWPoison() on the page and its subpages. It checks PageHWPoison()
only when you free a base (order-0) page, see free_pages_prepare().

AFAICT there is nothing that prevents the poisoned page to be
allocated back to users because the buddy doesn't check PageHWPoison()
on allocation as well (by default).

So rather than freeing the high-order page as-is in
dissolve_free_hugetlb_folio(), I think we have to split it to base pages
and then free them one by one.

That way, free_pages_prepare() will catch that it's poisoned and won't
add it back to the freelist. Otherwise there will always be a window
where the poisoned page can be allocated to users - before it's taken
off from the buddy.

-- 
Cheers,
Harry / Hyeonggon

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2025-10-28  7:02 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-01-18 23:15 [RFC PATCH v1 0/3] Userspace MFR Policy via memfd Jiaqi Yan
2025-01-18 23:15 ` [RFC PATCH v1 1/3] mm: memfd/hugetlb: introduce userspace memory failure recovery policy Jiaqi Yan
2025-01-18 23:15 ` [RFC PATCH v1 2/3] selftests/mm: test userspace MFR for HugeTLB 1G hugepage Jiaqi Yan
2025-01-18 23:15 ` [RFC PATCH v1 3/3] Documentation: add userspace MF recovery policy via memfd Jiaqi Yan
2025-01-20 17:26 ` [RFC PATCH v1 0/3] Userspace MFR Policy " Jason Gunthorpe
2025-01-21 21:45   ` Jiaqi Yan
2025-01-22 16:41 ` Zi Yan
2025-09-19 15:58 ` “William Roche
2025-10-13 22:14   ` Jiaqi Yan
2025-10-14 20:57     ` William Roche
2025-10-28  4:17       ` Jiaqi Yan
2025-10-22 13:09     ` Harry Yoo
2025-10-28  4:17       ` Jiaqi Yan
2025-10-28  7:00         ` Harry Yoo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).