[RFC PATCH v1 0/2] Userspace Can Control Memory Failure Recovery

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH v1 0/2] Userspace Can Control Memory Failure Recovery
@ 2024-09-24  4:39 Jiaqi Yan
  2024-09-24  4:39 ` [RFC PATCH v1 1/2] mm/memory-failure: introduce global MFR policy Jiaqi Yan
                   ` (2 more replies)
  0 siblings, 3 replies; 21+ messages in thread
From: Jiaqi Yan @ 2024-09-24  4:39 UTC (permalink / raw)
  To: nao.horiguchi, linmiaohe
  Cc: tony.luck, wangkefeng.wang, jane.chu, akpm, osalvador, rientjes,
	duenwen, jthoughton, jgg, ankita, peterx, linux-mm, Jiaqi Yan

Introduction and Motivation
===========================

Recently there is an enforcement on the userspace control over how kernel
handles memory with corrected memory errors [1]. This RFC wants to extend
userspace's control to how the kernel deals with uncorrectable memory errors,
so userspace can now control all aspects of memory failure recovery (MFR).

Why does userspace need to play a role in MFR? There are two major use cases,
both from the cloud provider's perspective.

The first use case is 1G HugeTLB, which provides critical optimization for
Virtual Machines (VM) where database-centric and data-intensive workloads have
requirements of both large memory size (hundreds of GB or several TB [2]),
and high performance of address mapping. These VMs usually also require high
availability, so tolerating and recovering from inevitable uncorrectable
memory errors is usually provided by host RAS features for long VM uptime
(SLA is 99.95% Monthly Uptime [3]). Due to the 1GB granularity, once a byte
of memory in a hugepage is hardware corrupted, the kernel discards the whole
1G hugepage, not only the corrupted bytes but also the healthy portion, from
HugeTLB system. In the cloud environment this is a great loss of memory to VM,
putting VM in a dilemma: although the host is able to keep serving the VM,
the VM itself struggles to continue its data-intensive workload with the
unnecessary loss of ~1G data. On the other hand, simply terminating the VM
greatly reduces its uptime given the frequency of uncorrectable memory errors
occurrence. There was a RFC [4] that utilizes high granularity mapping [5]
to more efficiently recover HugeTLB memory failures. However, it faded away
with the high granularity mapping’s upstream proposal.

The second use case comes from the discussion of MFR for huge VM_PFNMAP [6],
which is to greatly improve TLB hits for PCI MMIO bars and kernel unmanaged
host primary memory. They are most relevant for VMs that run Machine Learning
(ML) workloads, which also requires reliable VM uptime (see [7] for detail).
There is no official MFR support for VM_PFNMAP yet, but [8] is proposed.
It unmaps non-struct-page memory without considering huge VM_PFNMAP. Peter,
Jason and I discussed what will be the MFR behavior for huge VM_PFNMAP [9]:
if the driver originally VM_PFNMAP-ed with PUD, it must first zap the PUD,
and then intercept future page faults to either install PTE/PMD for clean PFNs,
or return VM_FAULT_HWPOISON for poisoned PFNs. Zapping PUD means there will be
a huge hole in the EPT or stage-2 (S2) page table, causing a lot of EPT or S2
violations that need to be fixed up by the device driver. There will be
noticeable VM performance downgrades, not only during refilling, but also
after the hole is refilled, as EPT or S2 is already fragmented.

For both use cases, HARD_OFFLINE behavior in MFR arguably does more harm than
good to the VM. For the 1st case, if we simply leave the 1G HugeTLB hugepage
mapped, VM access to the clean PFNs within the poisoned 1G region still works
well; we just need to still send SIGBUS to userspace in case of re-access
poisoned PFNs to stop populating corrupted data. For the 2nd case, if zapping
PUD doesn't happen there is no need for the driver to intercept page faults to
clean memory on HBM or EGM. In addition, in both cases, there is no EPT or S2
violation, so no performance cost for accessing clean guest pages already
mapped in EPT and S2.

Keeping or discarding a large chunk of memory wrt memory error therefore
sounds like something that userspace should be able to control. The virtual
machine monitor can choose the better behavior for its managed VM.

Background and Terminology
============================

First I want to set the scope of the userspace control in the process of
kernel's memory error containment and recovery, which is drawn in [10]:
1. Interpret Hardware Fault: respond immediately to decode the exception
   generated by the hardware / firmware. On X86 the kernel needs to interpret
   machine check exceptions; on ARM the kernel needs to interpret the
   command platform error records.
2. Classify Error: the kernel classifies memory errors, white boxes in [10].
3. Result: different memory errors end up with different results, gray boxes
   in [10].

The scope of the control exposed to userspace is in Memory Failure Recovery
(highlighted in red in [10]), the only part that is relevant to userspace
and needs improvement. Userspace policy defines what userspace wants kernel
to do to the memory page in this stage. If specified by userspace, kernel
performs actions conforming to the policy; otherwise the behavior is exactly
what is in the kernel today. Once the recovery actions that must be executed
by the kernel are done, if any, the kernel engages relevant userspace
threads that must be signaled to prevent data corruption, and must participate
in MFR.

Depending on whether the memory error is corrected or uncorrectable, a memory
page is referred as
- Error page/hugepage/folio: the raw/huge/(either raw or huge) page
  consisting of the physical memory poisoned by platform's RAS.
- Corrected page/hugepage/folio: the raw/huge/(either raw or huge) page
  consisting of the physical memory corrected by platform's RAS.

We propose two design options for the uAPI of the MFR policy.

Option 1. Global MFR Policy
===========================

The global MFR policy depicted in [11] is "global" in the sense that
1. It is for entire system-level memory managed by the kernel, regardless
   of the underlying memory type.
2. It applies to all userspace threads, regardless if the physical memory is
   currently backing any VMA (free memory) or what VMAs it is backing.
3. It applies to PCI(e) device memory (e.g. HBM on GPU) as well, on the
   condition that their device driver deliberately wants to follow the
   kernel's MFR, instead of being entirely taken care of by device driver
   (e.g. drivers/nvdimm/pmem.c).

Drawn in blue, the kernel initializes the MFR policy with the default policy
to ensure two things.
- There is always a policy during memory failure recovery available.
- Default MFR is compatible with the existing MFR behavior in the kernel today.

MFR Config, which can be any binary that runs in userspace with root privilege,
configures or modifies the MFR policy offline to the MFR execution. MFR Config
is independent from and does not interact with stakeholder threads at all.
MFR config establishes the source-of-truth policy that is enacted when memory
errors occur.

The userspace API to set/get the policy is via sysctl:
- /proc/sys/vm/enable_soft_offline: whether to SOFT_OFFLINE corrected folio
- /proc/sys/vm/enable_hard_offline: whether to HARD_OFFLINE error folio

SOFT_OFFLINE and HARD_OFFLINE are current upstream behavior and will be used
as default policy when initializing global MFR policy.

There is one important thing to point out in when
/proc/sys/vm/enable_hard_offline = 0. The kernel does NOT set HWPoison flag
in the struct page or struct folio. This behavior has implications now that
no enforcement is done by kernel to isolate poisoned page and prevent both
userspace and kernel from consuming memory error and causing hardware fault
again (which used to be "setting the HWPoison flag"):
1. Userspace already has sufficient capability to prevent itself from
   consuming memory error and causing hardware fault: with the poisoned
   virtual address sent in SIGBUS, it can ask the kernel to remap the poisoned
   page with data loss, or simply abort the memory load operation. That being
   said, there needs to be a mechanism to detect and forcibly kill a malicious
   userspace thread if it keeps ignoring SIGBUS and generates hardware faults
   repeatedly.
2. Kernel won't be able to forbid the reuse of the free error pages in future
   memory allocations. If an error page is allocated to the kernel, when the
   kernel consumes the allocated error page, a kernel panic is most likely to
   happen. For userspace, it is now not guaranteed that newly allocated memory
   is free of memory errors.

Option 2. Per-VMA MFR
=====================

This design provides a policy for every struct vm_area_struct (VMA). VMA
depicts a virtual memory area (a virtual address interval of contiguous memory
that all share the same characteristics), and its granularity is per VM-area
and per task. One major usage of VMA is to associate a virtual memory area of
a process with a special rule for the page-fault handlers. Since the recovery
action is mainly about the page fault (PF), how PF handler behaves wrt
corrected or error page, attaching MFR policy into VMA sounds a natural fit.

The interface for a userspace thread to set the MFR policy for one of its own
virtual memory area is

  int madvise(void *vaddr, size_t length, int MADV_MFR_XXX)

or if we want some "MFR master" to be able to assign policy to other threads
of interests:

  int process_madvise(int pidfd, const struct iovec iovec[.n], size_t n,
                      int MADV_MFR_XXX, unsigned int flags)

where MADV_MFR_XXX is:
MADV_MFR_HARD_OFFLINE: HARD_OFFLINE error folio.
MADV_MFR_SOFT_OFFLINE: SOFT_OFFLINE corrected folio  .
MADV_MFR_KEEP_ERROR_MAPPED: keep error folio mapped to preserve the ability
                            for userspace to re-access the error folio.
MADV_MFR_KEEP_CORRECTED_MAPPED: keep corrected folio mapped to preserve the
				ability for userspace to re-access the
				corrected folio.

MADV_MFR_KEEP_ERROR_MAPPED is the inverse of MADV_MFR_HARD_OFFLINE.
MADV_MFR_KEEP_CORRECTED_MAPPED is the inverse of MADV_MFR_SOFT_OFFLINE.
{MADV_MFR_HARD_OFFLINE, MADV_MFR_KEEP_ERROR_MAPPED} are independent with
{MADV_MFR_SOFT_OFFLINE, MADV_MFR_KEEP_CORRECTED_MAPPED}.

The exact behaviors of SOFT_OFFLINE and HARD_OFFLINE are specific to the page
types. In general what will happen is
- Some pages, including pages not impacted by memory error, can be unmapped
  from userspace. PF handler sends SIGBUS to userspace for every unmapped page.
- Some pages, including pages not impacted by memory error, can be migrated
  to pages somewhere else.
- Transparent hugepage will be split to raw pages.
- HugeTLB hugepage will be dissolved to raw pages.
- PUD/PMD/PTE installed for (huge) VM_PFNMAP will be removed.

The default MFR policy is (MADV_MFR_SOFT_OFFLINE | MADV_MFR_HARD_OFFLINE) for
every VMA at its creation time, to be consistent with the current kernel’s
MFR behavior.

Unmap and Page Fault Behavior with Per-VMA Policy
=================================================

Before describing the behavior in kernel with the proposed per-VMA MFR policy,
one thing not changed by per-VMA MFR policy but very worthy to point out is:
regardless of the MFR policy configured to the VMA, the kernel sets the
HWPoison flag on the struct page or struct folio.

New MFR and page fault behavior with per-VMA policy is illustrated in [12],
and here is a walkthrough with some pseudocode.

If PF handler hasn't yet see the error page is HWPoison, a load to corrupted
physical memory will kicks off the memory error containment process depicted
in [10], and for the scope we care, corrected + recoverable uncorrected memory
errors, kernel gets into the MFR box:

page = pfn_to_page(pfn)
pgoff = page_to_offset(page)
SetPageHWPoison(page)					// step 6 in [12]
for thread in collect_procs_mapped(page):
    for vma in vma_interval_tree(thread, pgoff)
        if vma's MFR policy == MADV_MFR_OFFLINE:	// step 7 in [12]
            try_to_unmap(vma, page)			// step 8 in [12]

        vaddr = page_address_in_vma(vma, page)
        kill_proc(thread, vaddr)			// step 9 in [12]

Two outcomes for the owning thread after kernel sends SIGBUS
- Thread is terminated and no possible access to the error page
- Thread performs some recovery action and keeps running. In this case,
  a thread is able to attempt re-access the poisoned memory (step 1 in [12]
  "re-access HWPOISON").

When all threads that share the HWPoison-flagged folio exit, this problematic
page will be isolated and prevented from future allocation. However, before
that happens, there could still be a surviving thread or a sharing thread that
attempts to access the HWPoison-flagged folio. For these threads, PF handler
needs to handle the access to HWPOISON-flagged folio, either because folio is
already unmapped somehow (e.g. by Memory Failure Recovery), or because folio
is not yet mapped to thread:

if PageHWPoison(page):					// step 2 in [12]
    if vma's MFR policy != MADV_MFR_KEEP_ERROR_MAPPED:  // step 3 in [12]
        kill_proc(current, vaddr)			// step 4 in [12]
        return VM_FAULT_HWPOISON

resolve the PF successfully
install VA => PFN in thread's page table

If the page fault is resolved and mapping installed to page table successfully,
or there isn't a PF at all as the mapping is still present in thread's page
table, there are two possible outcomes to the memory load to a folio that
contains raw error page(s):
- Sad case: The underlying physical memory mapped into vaddr has an
  uncorrectable memory error. Memory error consumption kicks off the memory
  error containment process depicted in [10].
- Happy case: The underlying physical memory mapped into vaddr is clean,
  i.e. the thread is accessing a healthy raw page in the compound hugepage.
  Thread gets the clean data it asked for.

The worst case is, a thread will repeatedly hit the sad case. It spreads
sadness to the entire system in the manner of denial of service. What can we
do to mitigate this Loop of Sadness? Forcibly unmapping the poisoned folio
won’t help, we must forcibly terminate the thread with SIGKILL.

When different VMAs backed by the same chunk of memory, or physical address
range (PAR), want to have conflicting MFR policies, how to resolve conflict?
[13] illustrated an example of this. VMA1 wants MADV_MFR_KEEP_MAPPED but VMA2
wants MADV_MFR_HARD_OFFLINE. When MFR deals with error hugepage in PAR2,
after the hugepage is kept in VMA1 but unmapped from VMA2 at the same time,
should it dissolve the hugepage or leave it as it is? I think we can treat
MADV_MFR_XXX as a non-persistent property of the VMA, like MADV_FREE,
MADV_DONTNEED. After all, with VMA established, a thread can already modify
the underlying physical memory content, why can't it change the MFR policy
on it? A simple resolution to the conflicting MADV_MFR_XXX for PAR2 is that
the policy is established by the thread that mapped it and last wrote it.

Pros and Cons
=============

Part of option 1 (/proc/sys/vm/enable_soft_offline) has already merged upstream.
It already provides userspace the control necessary for corrected memory errors.
From uncorrected memory error's perspective, the per-VMA MFR option and the
global MFR option have their own pros and cons:
- /proc/sys/vm/enable_hard_offline in appearance is like a natural extended
  solution to /proc/sys/vm/enable_soft_offline, but the implications caused by
  the HWPoison flag introduces risks to everyone in the system. Per-VMA policy
  can also introduce repeated hardware fault risk, but the risk is contained
  in the VMA / thread and can be cleaned up by the kernel when the life of
  VMA / thread ends.
- /proc/sys/vm/enable_hard_offline is easy to implement, especially with the
  recent change from Jane [14]. I attached the patch for enable_hard_offline.
  On the other hand, per-VMA policy requires more code, like code to change
  page fault handler and to handle conflicting MFR policies.
- Per-VMA MPR provides userspace much better flexibility than global MFR.
- Global MFR policy requires high privilege, either root or dedicated user
  group. On the other hand, per-VMA policy naturally can be done by any
  userspace thread who owns the VMA.

No matter how the MFR policy is designed, userspace will eventually be
notified of poisoned memory, via SIGBUS BUS_MCEERR_AR or BUS_MCEERR_AO.

So far I personally prefer the global MFR policy but open to feedbacks to both
options, or new ideas.

[1] https://lwn.net/Articles/978732
[2] https://cloud.google.com/compute/docs/memory-optimized-machines#m2_machine_types
[3] https://cloud.google.com/compute/sla
[4] https://lore.kernel.org/lkml/20230428004139.2899856-1-jiaqiyan@google.com
[5] https://lwn.net/Articles/912017
[6] https://lore.kernel.org/linux-mm/20240828234958.GE3773488@nvidia.com/T/#m34d054d967a72ad8a7c8120c19447b415fd12179
[7] The example for MMIO bar is Nvidia's GB 200. In passthrough mode it supports
VM access to nearly half of its 196GB HBM per card [7.1]. The example for kernel
unmanaged host primary memory is Nvidia's extended GPU memory (EGM) [7.2],
so that ~400GB LPDDR5 DIMMs per socket on the host can not only back VM memory,
but are also accessible by GPU at high speed. Both HBM and EGM are exposed to
VM via VM_PFNMAP under the hood, and MFR for both HBM and EGM are important
because ML workload requires long VM uptime.
[7.1] https://www.nvidia.com/en-us/data-center/gb200-nvl72/?ncid=pa-srch-goog-739865&_bt=709953060161&_bk=nvidia%20blackwell%20tensor%20core%20gpus&_bm=p&_bn=g&_bg=169122792888&gad_source=1&gclid=Cj0KCQjwz7C2BhDkARIsAA_SZKbHWgnjAA_0Ve8niwtx9FooW-bgzehdRkDnoke-zIKafDaVu9d75eEaAjc_EALw_wcB
[7.2] https://developer.nvidia.com/blog/nvidia-grace-hopper-superchip-architecture-in-depth/#extended_gpu_memory
[8] https://lore.kernel.org/lkml/20231123003513.24292-2-ankita@nvidia.com/#t
[9] https://lore.kernel.org/linux-mm/20240828234958.GE3773488@nvidia.com/T/#m413a61acaf1fc60e65ee7968ab0ae3093f7b1ea3
[10] https://docs.google.com/drawings/d/1Dmx2sxUGyRWdA1-5-HVko6IpsFL6PYAYL0ZL8T8AhY4
[11] https://docs.google.com/drawings/d/1E4m5Zy6_JFLmsacM3Z8FU6LLxLiTPMxvbmf4gzZhN6c
[12] https://docs.google.com/drawings/d/1hEe2BuEEJAlnqE4cjiZc-eBLjrkUk4BwOPDyL7TDClw
[13] https://docs.google.com/drawings/d/1u4er__Bziwn7itijOwghXhfu-JrXMDhnfFVu62BTzr0
[14] https://lore.kernel.org/all/20240524215306.2705454-2-jane.chu@oracle.com/T/#mbd530effd89d50eef7e9dd9375b900e7e34803c1

Jiaqi Yan (2):
  mm/memory-failure: introduce global MFR policy
  docs: mm: add enable_hard_offline sysctl

 Documentation/admin-guide/sysctl/vm.rst | 92 +++++++++++++++++++++++++
 mm/memory-failure.c                     | 33 +++++++++
 2 files changed, 125 insertions(+)

-- 
2.46.0.792.g87dc391469-goog

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [RFC PATCH v1 1/2] mm/memory-failure: introduce global MFR policy
  2024-09-24  4:39 [RFC PATCH v1 0/2] Userspace Can Control Memory Failure Recovery Jiaqi Yan
@ 2024-09-24  4:39 ` Jiaqi Yan
  2024-10-02 23:50   ` jane.chu
  2024-09-24  4:39 ` [RFC PATCH v1 2/2] docs: mm: add enable_hard_offline sysctl Jiaqi Yan
  2024-10-02 15:02 ` [RFC PATCH v1 0/2] Userspace Can Control Memory Failure Recovery Jason Gunthorpe
  2 siblings, 1 reply; 21+ messages in thread
From: Jiaqi Yan @ 2024-09-24  4:39 UTC (permalink / raw)
  To: nao.horiguchi, linmiaohe
  Cc: tony.luck, wangkefeng.wang, jane.chu, akpm, osalvador, rientjes,
	duenwen, jthoughton, jgg, ankita, peterx, linux-mm, Jiaqi Yan

Give userspace the control to enable or disable HARD_OFFLINE error folio
(either a raw page or a hugepage). By default, HARD_OFFLINE is enabled to
be consistent with existing memory_failure behavior.

Userspace should be able to control whether to keep or discard a large chunk
of memory in the event of uncorrectable memory errors. There are two major
use cases in cloud environments.

The 1st case is 1G HugeTLB-backed database workload. Comparing to discarding
the hugepage when only single PFN is impacted by uncorrectable memory error,
if kernel simply leaves the 1G hugepage mapped, access to major of clean PFNs
within the poisoned 1G region still works well for VM and workload.

The 2nd case is MMIO device memory or EGM [1] mapped to userspace via huge
VM_PFNMAP [2]. If kernel does not zap PUD or PMD, there is no need for the
VFIO drivers that manages the memory to intercept page faults for clean PFNs
and to reinstall PTEs.

In addition, in both cases there is no EPT or stage-2 (S2) violation, so no
performance cost for accessing clean guest pages already mapped in EPT or S2.

See cover letter for more details on why userspace need such control, and
implication when userspace chooses to disable HARD_OFFLINE.

If this RFC receives general positive feedbacks, I will add selftest in v2.

[1] https://developer.nvidia.com/blog/nvidia-grace-hopper-superchip-architecture-in-depth/#extended_gpu_memory
[2] https://lore.kernel.org/linux-mm/20240828234958.GE3773488@nvidia.com/T/#m413a61acaf1fc60e65ee7968ab0ae3093f7b1ea3

Signed-off-by: Jiaqi Yan <jiaqiyan@google.com>
---
 mm/memory-failure.c | 33 +++++++++++++++++++++++++++++++++
 1 file changed, 33 insertions(+)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 7066fc84f351..a7b85b98d61e 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -70,6 +70,8 @@ static int sysctl_memory_failure_recovery __read_mostly = 1;
 
 static int sysctl_enable_soft_offline __read_mostly = 1;
 
+static int sysctl_enable_hard_offline __read_mostly = 1;
+
 atomic_long_t num_poisoned_pages __read_mostly = ATOMIC_LONG_INIT(0);
 
 static bool hw_memory_failure __read_mostly = false;
@@ -151,6 +153,15 @@ static struct ctl_table memory_failure_table[] = {
 		.proc_handler	= proc_dointvec_minmax,
 		.extra1		= SYSCTL_ZERO,
 		.extra2		= SYSCTL_ONE,
+	},
+	{
+		.procname	= "enable_hard_offline",
+		.data		= &sysctl_enable_hard_offline,
+		.maxlen		= sizeof(sysctl_enable_hard_offline),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= SYSCTL_ZERO,
+		.extra2		= SYSCTL_ONE,
 	}
 };
 
@@ -2223,6 +2234,14 @@ int memory_failure(unsigned long pfn, int flags)
 
 	p = pfn_to_online_page(pfn);
 	if (!p) {
+		/*
+		 * For ZONE_DEVICE memory and memory on special architectures,
+		 * assume they have opt out core kernel's MFR. Since these
+		 * memory can still be mapped to userspace, let userspace
+		 * know MFR doesn't apply.
+		 */
+		pr_info_once("%#lx: can't apply global MFR policy\n", pfn);
+
 		res = arch_memory_failure(pfn, flags);
 		if (res == 0)
 			goto unlock_mutex;
@@ -2241,6 +2260,20 @@ int memory_failure(unsigned long pfn, int flags)
 		goto unlock_mutex;
 	}
 
+	/*
+	 * On ARM64, if APEI failed to claims SEA, (e.g. GHES driver doesn't
+	 * register to SEA notifications from firmware), memory_failure will
+	 * never be synchrounous to the error consumption thread. Notifying
+	 * it via SIGBUS synchrnously has to be done by either core kernel in
+	 * do_mem_abort, or KVM in kvm_handle_guest_abort.
+	 */
+	if (!sysctl_enable_hard_offline) {
+		pr_info_once("%#lx: disabled by /proc/sys/vm/enable_hard_offline\n", pfn);
+		kill_procs_now(p, pfn, flags, page_folio(p));
+		res = -EOPNOTSUPP;
+		goto unlock_mutex;
+	}
+
 try_again:
 	res = try_memory_failure_hugetlb(pfn, flags, &hugetlb);
 	if (hugetlb)
-- 
2.46.0.792.g87dc391469-goog



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH v1 1/2] mm/memory-failure: introduce global MFR policy
  2024-09-24  4:39 ` [RFC PATCH v1 1/2] mm/memory-failure: introduce global MFR policy Jiaqi Yan
@ 2024-10-02 23:50   ` jane.chu
  2024-10-03 23:51     ` Jiaqi Yan
  0 siblings, 1 reply; 21+ messages in thread
From: jane.chu @ 2024-10-02 23:50 UTC (permalink / raw)
  To: Jiaqi Yan, nao.horiguchi, linmiaohe
  Cc: tony.luck, wangkefeng.wang, akpm, osalvador, rientjes, duenwen,
	jthoughton, jgg, ankita, peterx, linux-mm, jane.chu

Hi,

On 9/23/2024 9:39 PM, Jiaqi Yan wrote:
>   
> +	/*
> +	 * On ARM64, if APEI failed to claims SEA, (e.g. GHES driver doesn't
> +	 * register to SEA notifications from firmware), memory_failure will
> +	 * never be synchrounous to the error consumption thread. Notifying
> +	 * it via SIGBUS synchrnously has to be done by either core kernel in
> +	 * do_mem_abort, or KVM in kvm_handle_guest_abort.
> +	 */
> +	if (!sysctl_enable_hard_offline) {
> +		pr_info_once("%#lx: disabled by /proc/sys/vm/enable_hard_offline\n", pfn);
> +		kill_procs_now(p, pfn, flags, page_folio(p));
> +		res = -EOPNOTSUPP;
> +		goto unlock_mutex;
> +	}
> +

I am curious why the SIGBUS is sent without setting PG_hwpoison in the 
page.   In 0/2 there seems to be indication about threads coordinate 
with each other such that clean subpages in a poisoned hugetlb page 
continue to be accessible, and at some point, (or perhaps I misread), 
the poisoned page (sub- or huge-) will eventually be isolated, because, 
it's unthinkable to let a poisoned page laying around and kernel treats 
it like a clean page ?  But I'm not sure how do you plan to handle it 
without PG_hwpoison while hard_offline is disabled globally.

Another thing I'm curious at is whether you have tested with real 
hardware UE - the one that triggers MCE.  When a real UE is consumed by 
the training process, the user process must longjmp out in order to 
avoid getting stuck at the same instruction that fetched a UE memory.  
Given a longjmp is needed (unless I am missing something), the training 
process is already in a situation where it has to figure out things like 
rewind, where-to-restart-from, does it even keep states? etc. On the 
whole, whether the burden to ask user application to deal with what's 
lacking in the kernel, namely the lack of splitting up a hugetlb page, 
is worthwhile, is something that need to be weighed over.

Thanks,

-jane

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH v1 1/2] mm/memory-failure: introduce global MFR policy
  2024-10-02 23:50   ` jane.chu
@ 2024-10-03 23:51     ` Jiaqi Yan
  2024-10-07 17:24       ` jane.chu
  2024-10-11  7:04       ` Miaohe Lin
  0 siblings, 2 replies; 21+ messages in thread
From: Jiaqi Yan @ 2024-10-03 23:51 UTC (permalink / raw)
  To: jane.chu
  Cc: nao.horiguchi, linmiaohe, tony.luck, wangkefeng.wang, akpm,
	osalvador, rientjes, duenwen, jthoughton, jgg, ankita, peterx,
	linux-mm

Hi Jane,

On Wed, Oct 2, 2024 at 4:50 PM <jane.chu@oracle.com> wrote:
>
> Hi,
>
> On 9/23/2024 9:39 PM, Jiaqi Yan wrote:
> >
> > +     /*
> > +      * On ARM64, if APEI failed to claims SEA, (e.g. GHES driver doesn't
> > +      * register to SEA notifications from firmware), memory_failure will
> > +      * never be synchrounous to the error consumption thread. Notifying
> > +      * it via SIGBUS synchrnously has to be done by either core kernel in
> > +      * do_mem_abort, or KVM in kvm_handle_guest_abort.
> > +      */
> > +     if (!sysctl_enable_hard_offline) {
> > +             pr_info_once("%#lx: disabled by /proc/sys/vm/enable_hard_offline\n", pfn);
> > +             kill_procs_now(p, pfn, flags, page_folio(p));
> > +             res = -EOPNOTSUPP;
> > +             goto unlock_mutex;
> > +     }
> > +
>
> I am curious why the SIGBUS is sent without setting PG_hwpoison in the
> page.   In 0/2 there seems to be indication about threads coordinate
> with each other such that clean subpages in a poisoned hugetlb page
> continue to be accessible, and at some point, (or perhaps I misread),
> the poisoned page (sub- or huge-) will eventually be isolated, because,

The code here is "global policy". The "per-VMA policy", proposed in
0/2 but code not sent, should be able to support isolation + offline
at some point (all VMAs are gone and page becomes free).

> it's unthinkable to let a poisoned page laying around and kernel treats
> it like a clean page ?  But I'm not sure how do you plan to handle it
> without PG_hwpoison while hard_offline is disabled globally.

It will become the responsibility of a control plan running in
userspace. For example, the control plan immediately prevents starting
of any new workload/VM, but chooses to wait until memory errors exceed
a certain threshold, or hold on to the hosts until all workloads/VMs
are migrated and then repair the machine. Not setting PG_hwpoison is
indeed a big difference and risk, so it needs to be carefully handled
by userspace.

>
> Another thing I'm curious at is whether you have tested with real
> hardware UE - the one that triggers MCE.  When a real UE is consumed by

Yes, with our workload. Can you share more about what is the "training
process"? Is it something to train memory or screen memory errors?

> the training process, the user process must longjmp out in order to
> avoid getting stuck at the same instruction that fetched a UE memory.
> Given a longjmp is needed (unless I am missing something), the training
> process is already in a situation where it has to figure out things like
> rewind, where-to-restart-from, does it even keep states? etc. On the
> whole, whether the burden to ask user application to deal with what's
> lacking in the kernel, namely the lack of splitting up a hugetlb page,
> is worthwhile, is something that need to be weighed over.

For sure, and that's why I put a lot of the word in the cover letter
to talk about 2 use cases where "user application to deal with what's
lacking in the kernel is worthwhile".

>
> Thanks,
>
> -jane
>
>


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH v1 1/2] mm/memory-failure: introduce global MFR policy
  2024-10-03 23:51     ` Jiaqi Yan
@ 2024-10-07 17:24       ` jane.chu
  2024-10-10 23:21         ` Jiaqi Yan
  2024-10-11  7:04       ` Miaohe Lin
  1 sibling, 1 reply; 21+ messages in thread
From: jane.chu @ 2024-10-07 17:24 UTC (permalink / raw)
  To: Jiaqi Yan
  Cc: nao.horiguchi, linmiaohe, tony.luck, wangkefeng.wang, akpm,
	osalvador, rientjes, duenwen, jthoughton, jgg, ankita, peterx,
	linux-mm

On 10/3/2024 4:51 PM, Jiaqi Yan wrote:
> soned page (sub- or huge-) will eventually be isolated, because,
> The code here is "global policy". The "per-VMA policy", proposed in
> 0/2 but code not sent, should be able to support isolation + offline
> at some point (all VMAs are gone and page becomes free).
"per-VMA policy" sounds interesting.
>> Another thing I'm curious at is whether you have tested with real
>> hardware UE - the one that triggers MCE.  When a real UE is consumed by
> Yes, with our workload. Can you share more about what is the "training
> process"? Is it something to train memory or screen memory errors?

The cover letter mentioned "Machine Learning (ML) workloads", so I used 
it as an example.

-jane



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH v1 1/2] mm/memory-failure: introduce global MFR policy
  2024-10-07 17:24       ` jane.chu
@ 2024-10-10 23:21         ` Jiaqi Yan
  2024-10-11 18:28           ` jane.chu
  0 siblings, 1 reply; 21+ messages in thread
From: Jiaqi Yan @ 2024-10-10 23:21 UTC (permalink / raw)
  To: jane.chu
  Cc: nao.horiguchi, linmiaohe, tony.luck, wangkefeng.wang, akpm,
	osalvador, rientjes, duenwen, jthoughton, jgg, ankita, peterx,
	linux-mm

On Mon, Oct 7, 2024 at 10:24 AM <jane.chu@oracle.com> wrote:
>
> On 10/3/2024 4:51 PM, Jiaqi Yan wrote:
> > soned page (sub- or huge-) will eventually be isolated, because,
> > The code here is "global policy". The "per-VMA policy", proposed in
> > 0/2 but code not sent, should be able to support isolation + offline
> > at some point (all VMAs are gone and page becomes free).
> "per-VMA policy" sounds interesting.
> >> Another thing I'm curious at is whether you have tested with real
> >> hardware UE - the one that triggers MCE.  When a real UE is consumed by
> > Yes, with our workload. Can you share more about what is the "training
> > process"? Is it something to train memory or screen memory errors?
>
> The cover letter mentioned "Machine Learning (ML) workloads", so I used
> it as an example.

Got you. In that case, if the ML workload (running in a VM) wants to
do what you described, wouldn't losing 1G hugetlb page due to kernel
offline make the VM/workload even harder to execute recover logic?

>
> -jane
>


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH v1 1/2] mm/memory-failure: introduce global MFR policy
  2024-10-10 23:21         ` Jiaqi Yan
@ 2024-10-11 18:28           ` jane.chu
  2024-10-11 19:44             ` Luck, Tony
  2024-10-15 23:45             ` Jiaqi Yan
  0 siblings, 2 replies; 21+ messages in thread
From: jane.chu @ 2024-10-11 18:28 UTC (permalink / raw)
  To: Jiaqi Yan
  Cc: nao.horiguchi, linmiaohe, tony.luck, wangkefeng.wang, akpm,
	osalvador, rientjes, duenwen, jthoughton, jgg, ankita, peterx,
	linux-mm

On 10/10/2024 4:21 PM, Jiaqi Yan wrote:

> On Mon, Oct 7, 2024 at 10:24 AM <jane.chu@oracle.com> wrote:
>> On 10/3/2024 4:51 PM, Jiaqi Yan wrote:
>>> soned page (sub- or huge-) will eventually be isolated, because,
>>> The code here is "global policy". The "per-VMA policy", proposed in
>>> 0/2 but code not sent, should be able to support isolation + offline
>>> at some point (all VMAs are gone and page becomes free).
>> "per-VMA policy" sounds interesting.
>>>> Another thing I'm curious at is whether you have tested with real
>>>> hardware UE - the one that triggers MCE.  When a real UE is consumed by
>>> Yes, with our workload. Can you share more about what is the "training
>>> process"? Is it something to train memory or screen memory errors?
>> The cover letter mentioned "Machine Learning (ML) workloads", so I used
>> it as an example.
> Got you. In that case, if the ML workload (running in a VM) wants to
> do what you described, wouldn't losing 1G hugetlb page due to kernel
> offline make the VM/workload even harder to execute recover logic?

Indeed.

As the user application got more sophisticated on recovering from 
poison, what about making the kernel to do the heavy lifting?

Something like by way of userfaultfd,  kernel provides a new/clean 
hugetlb page, copied over good data from the clean subpages and then 
present the clean hugetlb page to user process with indication that 
subpage x is a substitute of the poisoned old subpage x, hence its data 
might need a refill?  I am not sure how exactly to pull this through as 
the even is not a page-fault, but just wondering whether something like 
this is possible.

thanks,

-jane

>
>> -jane
>>


^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: [RFC PATCH v1 1/2] mm/memory-failure: introduce global MFR policy
  2024-10-11 18:28           ` jane.chu
@ 2024-10-11 19:44             ` Luck, Tony
  2024-10-11 20:15               ` jane.chu
  2024-10-15 23:45             ` Jiaqi Yan
  1 sibling, 1 reply; 21+ messages in thread
From: Luck, Tony @ 2024-10-11 19:44 UTC (permalink / raw)
  To: chu, jane, Jiaqi Yan
  Cc: nao.horiguchi@gmail.com, linmiaohe@huawei.com,
	wangkefeng.wang@huawei.com, akpm@linux-foundation.org,
	osalvador@suse.de, rientjes@google.com, duenwen@google.com,
	jthoughton@google.com, jgg@nvidia.com, ankita@nvidia.com,
	peterx@redhat.com, linux-mm@kvack.org

> Something like by way of userfaultfd,  kernel provides a new/clean 
> hugetlb page, copied over good data from the clean subpages and then 
> present the clean hugetlb page to user process with indication that 
> subpage x is a substitute of the poisoned old subpage x, hence its data 
> might need a refill?  I am not sure how exactly to pull this through as 
> the even is not a page-fault, but just wondering whether something like 
> this is possible.

This requires serious levels of sophistication from the application.
If some thread still accesses the "lost" data, there's no signal that
anything went wrong. It just reads whatever data the kernel filled the
poisoned area with. For some applications there might be some
data pattern that would help track this down. But no general answer.

On the plus side, the amount of "lost" data need not be a page.
On Intel the poison unit is a cache line (64 bytes). So more of the
original data can potentially be preserved. This might be useful
for applications using regular pages as well as those using huge pages.

When Linux first implemented recovery, we had hopes that applications
like databases would be able to implement their own recovery. Losing
a whole page turned out to be problematic as in some implementations
the metadata for a database entry was stored at the start of the memory
block. So the SIGBUS would provide the virtual address, and it wasn't
of any practical use to determine which data structure(s) were affected
without some massive restructure of the code to separate metadata
from data.

-Tony

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH v1 1/2] mm/memory-failure: introduce global MFR policy
  2024-10-11 19:44             ` Luck, Tony
@ 2024-10-11 20:15               ` jane.chu
  0 siblings, 0 replies; 21+ messages in thread
From: jane.chu @ 2024-10-11 20:15 UTC (permalink / raw)
  To: Luck, Tony, Jiaqi Yan
  Cc: nao.horiguchi@gmail.com, linmiaohe@huawei.com,
	wangkefeng.wang@huawei.com, akpm@linux-foundation.org,
	osalvador@suse.de, rientjes@google.com, duenwen@google.com,
	jthoughton@google.com, jgg@nvidia.com, ankita@nvidia.com,
	peterx@redhat.com, linux-mm@kvack.org

On 10/11/2024 12:44 PM, Luck, Tony wrote:

>> Something like by way of userfaultfd,  kernel provides a new/clean
>> hugetlb page, copied over good data from the clean subpages and then
>> present the clean hugetlb page to user process with indication that
>> subpage x is a substitute of the poisoned old subpage x, hence its data
>> might need a refill?  I am not sure how exactly to pull this through as
>> the even is not a page-fault, but just wondering whether something like
>> this is possible.
> This requires serious levels of sophistication from the application.
> If some thread still accesses the "lost" data, there's no signal that
> anything went wrong. It just reads whatever data the kernel filled the
> poisoned area with. For some applications there might be some
> data pattern that would help track this down. But no general answer.
Is it possible to rely on mf_mutex to hold off subsequent threads 
accessing the poisoned spot until the 1st poison event has been handled 
and page replaced by joint effort of the application and kernel?  I mean 
until the poisoned page is removed from the page table, other threads 
accessing it would hit MCE, right?
>
> On the plus side, the amount of "lost" data need not be a page.
> On Intel the poison unit is a cache line (64 bytes). So more of the
> original data can potentially be preserved. This might be useful
> for applications using regular pages as well as those using huge pages.
That requires the kernel to provide finer grained SIGBUS payload such as 
untrimmed vaddr and si_lsb=6.
>
> When Linux first implemented recovery, we had hopes that applications
> like databases would be able to implement their own recovery. Losing
> a whole page turned out to be problematic as in some implementations
> the metadata for a database entry was stored at the start of the memory
> block. So the SIGBUS would provide the virtual address, and it wasn't
> of any practical use to determine which data structure(s) were affected
> without some massive restructure of the code to separate metadata
> from data.
>
> -Tony
-jane


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH v1 1/2] mm/memory-failure: introduce global MFR policy
  2024-10-11 18:28           ` jane.chu
  2024-10-11 19:44             ` Luck, Tony
@ 2024-10-15 23:45             ` Jiaqi Yan
  2024-10-15 23:56               ` Luck, Tony
  1 sibling, 1 reply; 21+ messages in thread
From: Jiaqi Yan @ 2024-10-15 23:45 UTC (permalink / raw)
  To: jane.chu
  Cc: nao.horiguchi, linmiaohe, tony.luck, wangkefeng.wang, akpm,
	osalvador, rientjes, duenwen, jthoughton, jgg, ankita, peterx,
	linux-mm

On Fri, Oct 11, 2024 at 11:28 AM <jane.chu@oracle.com> wrote:
>
> On 10/10/2024 4:21 PM, Jiaqi Yan wrote:
>
> > On Mon, Oct 7, 2024 at 10:24 AM <jane.chu@oracle.com> wrote:
> >> On 10/3/2024 4:51 PM, Jiaqi Yan wrote:
> >>> soned page (sub- or huge-) will eventually be isolated, because,
> >>> The code here is "global policy". The "per-VMA policy", proposed in
> >>> 0/2 but code not sent, should be able to support isolation + offline
> >>> at some point (all VMAs are gone and page becomes free).
> >> "per-VMA policy" sounds interesting.
> >>>> Another thing I'm curious at is whether you have tested with real
> >>>> hardware UE - the one that triggers MCE.  When a real UE is consumed by
> >>> Yes, with our workload. Can you share more about what is the "training
> >>> process"? Is it something to train memory or screen memory errors?
> >> The cover letter mentioned "Machine Learning (ML) workloads", so I used
> >> it as an example.
> > Got you. In that case, if the ML workload (running in a VM) wants to
> > do what you described, wouldn't losing 1G hugetlb page due to kernel
> > offline make the VM/workload even harder to execute recover logic?
>
> Indeed.
>
> As the user application got more sophisticated on recovering from
> poison, what about making the kernel to do the heavy lifting?

I think there are two things.

First, if userspace claims it has enough or sophisticated recovery
ability (assume we trust it), can it take full control of what happens
to the hardware poisoned memory page it **owns**?
My answer to this question is yes. The reason is I believe the kernel
has a limited ability to do memory failure recovery (MFR) optimally
for all userspace. Current hard offline support in the kernel has also
made userspace recovery hard, so userspace deserve a position in MFR.

Second, what is the granularity of the control? This patch makes the
control applicable to every process. So what about making it
controllable only by the userspace process that owns the memory page?
Kernel can still do whatever the heavy lifting (hard offline, set
HWPoison) **after** the owning userspace unclaims the control, or
exits.

Another way to "disable hardoffline but still set HWPoison" I can
think of is, make the HWPOISON flag apply at page_size level, instead
of always set at the compound head. At least from hugetlb's
perspective, is it a good idea?

>
> Something like by way of userfaultfd,  kernel provides a new/clean
> hugetlb page, copied over good data from the clean subpages and then
> present the clean hugetlb page to user process with indication that
> subpage x is a substitute of the poisoned old subpage x, hence its data
> might need a refill?  I am not sure how exactly to pull this through as
> the even is not a page-fault, but just wondering whether something like
> this is possible.
>
> thanks,
>
> -jane
>
> >
> >> -jane
> >>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: [RFC PATCH v1 1/2] mm/memory-failure: introduce global MFR policy
  2024-10-15 23:45             ` Jiaqi Yan
@ 2024-10-15 23:56               ` Luck, Tony
  2024-10-16  0:19                 ` jane.chu
  0 siblings, 1 reply; 21+ messages in thread
From: Luck, Tony @ 2024-10-15 23:56 UTC (permalink / raw)
  To: Jiaqi Yan, chu, jane
  Cc: nao.horiguchi@gmail.com, linmiaohe@huawei.com,
	wangkefeng.wang@huawei.com, akpm@linux-foundation.org,
	osalvador@suse.de, rientjes@google.com, duenwen@google.com,
	jthoughton@google.com, jgg@nvidia.com, ankita@nvidia.com,
	peterx@redhat.com, linux-mm@kvack.org

> Another way to "disable hardoffline but still set HWPoison" I can
> think of is, make the HWPOISON flag apply at page_size level, instead
> of always set at the compound head. At least from hugetlb's
> perspective, is it a good idea?

Many years ago someone looked at breaking up hugetlb pages
when a memory error occurred so that just 4K was lost instead
of the entire huge page. At that time the conclusion was that
doing so would require locks to be taken/released around all
hugetlb map/unmap operations. An unacceptable performance
issue for common operations to handle very rare memory error
events.

I don't know if that is still true. There's been a lot of restructure
to memory management code since then.

-Tony

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH v1 1/2] mm/memory-failure: introduce global MFR policy
  2024-10-15 23:56               ` Luck, Tony
@ 2024-10-16  0:19                 ` jane.chu
  0 siblings, 0 replies; 21+ messages in thread
From: jane.chu @ 2024-10-16  0:19 UTC (permalink / raw)
  To: Luck, Tony, Jiaqi Yan
  Cc: nao.horiguchi@gmail.com, linmiaohe@huawei.com,
	wangkefeng.wang@huawei.com, akpm@linux-foundation.org,
	osalvador@suse.de, rientjes@google.com, duenwen@google.com,
	jthoughton@google.com, jgg@nvidia.com, ankita@nvidia.com,
	peterx@redhat.com, linux-mm@kvack.org

On 10/15/2024 4:56 PM, Luck, Tony wrote:

>> Another way to "disable hardoffline but still set HWPoison" I can
>> think of is, make the HWPOISON flag apply at page_size level, instead
>> of always set at the compound head. At least from hugetlb's
>> perspective, is it a good idea?
> Many years ago someone looked at breaking up hugetlb pages
> when a memory error occurred so that just 4K was lost instead
> of the entire huge page. At that time the conclusion was that
> doing so would require locks to be taken/released around all
> hugetlb map/unmap operations. An unacceptable performance
> issue for common operations to handle very rare memory error
> events.
>
> I don't know if that is still true. There's been a lot of restructure
> to memory management code since then.

The HGM for hugetlbfs 
<https://lore.kernel.org/linux-mm/20230306191944.GA15773@monkey/#r> 
project attempted this as well.

  https://lore.kernel.org/linux-mm/20230306191944.GA15773@monkey/

-jan

>
> -Tony
>
>


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH v1 1/2] mm/memory-failure: introduce global MFR policy
  2024-10-03 23:51     ` Jiaqi Yan
  2024-10-07 17:24       ` jane.chu
@ 2024-10-11  7:04       ` Miaohe Lin
  2024-10-15 23:58         ` Jiaqi Yan
  1 sibling, 1 reply; 21+ messages in thread
From: Miaohe Lin @ 2024-10-11  7:04 UTC (permalink / raw)
  To: Jiaqi Yan, jane.chu
  Cc: nao.horiguchi, tony.luck, wangkefeng.wang, akpm, osalvador,
	rientjes, duenwen, jthoughton, jgg, ankita, peterx, linux-mm

On 2024/10/4 7:51, Jiaqi Yan wrote:
> Hi Jane,
> 
> On Wed, Oct 2, 2024 at 4:50 PM <jane.chu@oracle.com> wrote:
>>
>> Hi,
>>
>> On 9/23/2024 9:39 PM, Jiaqi Yan wrote:
>>>
>>> +     /*
>>> +      * On ARM64, if APEI failed to claims SEA, (e.g. GHES driver doesn't
>>> +      * register to SEA notifications from firmware), memory_failure will
>>> +      * never be synchrounous to the error consumption thread. Notifying
>>> +      * it via SIGBUS synchrnously has to be done by either core kernel in
>>> +      * do_mem_abort, or KVM in kvm_handle_guest_abort.
>>> +      */
>>> +     if (!sysctl_enable_hard_offline) {
>>> +             pr_info_once("%#lx: disabled by /proc/sys/vm/enable_hard_offline\n", pfn);
>>> +             kill_procs_now(p, pfn, flags, page_folio(p));
>>> +             res = -EOPNOTSUPP;
>>> +             goto unlock_mutex;
>>> +     }
>>> +
>>
>> I am curious why the SIGBUS is sent without setting PG_hwpoison in the
>> page.   In 0/2 there seems to be indication about threads coordinate
>> with each other such that clean subpages in a poisoned hugetlb page
>> continue to be accessible, and at some point, (or perhaps I misread),
>> the poisoned page (sub- or huge-) will eventually be isolated, because,
> 
> The code here is "global policy". The "per-VMA policy", proposed in
> 0/2 but code not sent, should be able to support isolation + offline
> at some point (all VMAs are gone and page becomes free).
> 
>> it's unthinkable to let a poisoned page laying around and kernel treats
>> it like a clean page ?  But I'm not sure how do you plan to handle it
>> without PG_hwpoison while hard_offline is disabled globally.
> 
> It will become the responsibility of a control plan running in
> userspace. For example, the control plan immediately prevents starting
> of any new workload/VM, but chooses to wait until memory errors exceed
> a certain threshold, or hold on to the hosts until all workloads/VMs
> are migrated and then repair the machine. Not setting PG_hwpoison is
> indeed a big difference and risk, so it needs to be carefully handled
> by userspace.
> 

Could you explain why PG_hwpoison cannot be set in this case? It seems a control plan running in
userspace can work with PG_hwpoison set. PG_hwpoison makes sure hwpoisoned pages won't be re-used
by kernel while the control plan prevent them from re-accessed from userspace. Or am I miss something?

Thanks.
.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH v1 1/2] mm/memory-failure: introduce global MFR policy
  2024-10-11  7:04       ` Miaohe Lin
@ 2024-10-15 23:58         ` Jiaqi Yan
  0 siblings, 0 replies; 21+ messages in thread
From: Jiaqi Yan @ 2024-10-15 23:58 UTC (permalink / raw)
  To: Miaohe Lin
  Cc: jane.chu, nao.horiguchi, tony.luck, wangkefeng.wang, akpm,
	osalvador, rientjes, duenwen, jthoughton, jgg, ankita, peterx,
	linux-mm

On Fri, Oct 11, 2024 at 12:05 AM Miaohe Lin <linmiaohe@huawei.com> wrote:
>
> On 2024/10/4 7:51, Jiaqi Yan wrote:
> > Hi Jane,
> >
> > On Wed, Oct 2, 2024 at 4:50 PM <jane.chu@oracle.com> wrote:
> >>
> >> Hi,
> >>
> >> On 9/23/2024 9:39 PM, Jiaqi Yan wrote:
> >>>
> >>> +     /*
> >>> +      * On ARM64, if APEI failed to claims SEA, (e.g. GHES driver doesn't
> >>> +      * register to SEA notifications from firmware), memory_failure will
> >>> +      * never be synchrounous to the error consumption thread. Notifying
> >>> +      * it via SIGBUS synchrnously has to be done by either core kernel in
> >>> +      * do_mem_abort, or KVM in kvm_handle_guest_abort.
> >>> +      */
> >>> +     if (!sysctl_enable_hard_offline) {
> >>> +             pr_info_once("%#lx: disabled by /proc/sys/vm/enable_hard_offline\n", pfn);
> >>> +             kill_procs_now(p, pfn, flags, page_folio(p));
> >>> +             res = -EOPNOTSUPP;
> >>> +             goto unlock_mutex;
> >>> +     }
> >>> +
> >>
> >> I am curious why the SIGBUS is sent without setting PG_hwpoison in the
> >> page.   In 0/2 there seems to be indication about threads coordinate
> >> with each other such that clean subpages in a poisoned hugetlb page
> >> continue to be accessible, and at some point, (or perhaps I misread),
> >> the poisoned page (sub- or huge-) will eventually be isolated, because,
> >
> > The code here is "global policy". The "per-VMA policy", proposed in
> > 0/2 but code not sent, should be able to support isolation + offline
> > at some point (all VMAs are gone and page becomes free).
> >
> >> it's unthinkable to let a poisoned page laying around and kernel treats
> >> it like a clean page ?  But I'm not sure how do you plan to handle it
> >> without PG_hwpoison while hard_offline is disabled globally.
> >
> > It will become the responsibility of a control plan running in
> > userspace. For example, the control plan immediately prevents starting
> > of any new workload/VM, but chooses to wait until memory errors exceed
> > a certain threshold, or hold on to the hosts until all workloads/VMs
> > are migrated and then repair the machine. Not setting PG_hwpoison is
> > indeed a big difference and risk, so it needs to be carefully handled
> > by userspace.
> >
>
> Could you explain why PG_hwpoison cannot be set in this case? It seems a control plan running in
> userspace can work with PG_hwpoison set. PG_hwpoison makes sure hwpoisoned pages won't be re-used
> by kernel while the control plan prevent them from re-accessed from userspace. Or am I miss something?
>

[Resend to include more people and linux-mm]

Sorry I almost missed your comment/question.

I think for hugetlb and transparent hugepages, say we keep them mapped
but set HWPoison flag, the flag will be set at compound head and
future userspace page fault on **any** part of the hugepage will
result in SIGBUS, meaning the hugepage is lost to the userspace,
making "keep them mapped" a meaningless action.

> Thanks.
> .
>


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [RFC PATCH v1 2/2] docs: mm: add enable_hard_offline sysctl
  2024-09-24  4:39 [RFC PATCH v1 0/2] Userspace Can Control Memory Failure Recovery Jiaqi Yan
  2024-09-24  4:39 ` [RFC PATCH v1 1/2] mm/memory-failure: introduce global MFR policy Jiaqi Yan
@ 2024-09-24  4:39 ` Jiaqi Yan
  2024-10-02 15:02 ` [RFC PATCH v1 0/2] Userspace Can Control Memory Failure Recovery Jason Gunthorpe
  2 siblings, 0 replies; 21+ messages in thread
From: Jiaqi Yan @ 2024-09-24  4:39 UTC (permalink / raw)
  To: nao.horiguchi, linmiaohe
  Cc: tony.luck, wangkefeng.wang, jane.chu, akpm, osalvador, rientjes,
	duenwen, jthoughton, jgg, ankita, peterx, linux-mm, Jiaqi Yan

Add the documentation for the userspace control of hard offline memory
having uncorrectable memory errors: where it will be useful and its
global implications.

Signed-off-by: Jiaqi Yan <jiaqiyan@google.com>
---
 Documentation/admin-guide/sysctl/vm.rst | 92 +++++++++++++++++++++++++
 1 file changed, 92 insertions(+)

diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst
index f48eaa98d22d..a55a1d496b34 100644
--- a/Documentation/admin-guide/sysctl/vm.rst
+++ b/Documentation/admin-guide/sysctl/vm.rst
@@ -37,6 +37,7 @@ Currently, these files are in /proc/sys/vm:
 - dirty_writeback_centisecs
 - drop_caches
 - enable_soft_offline
+- enable_hard_offline
 - extfrag_threshold
 - highmem_is_dirtyable
 - hugetlb_shm_group
@@ -306,6 +307,97 @@ following requests to soft offline pages will not be performed:
 
 - On PARISC, the request to soft offline pages from Page Deallocation Table.
 
+
+enable_hard_offline
+===================
+
+This parameter gives userspace the control on whether the kernel should hard
+offline memory that has uncorrectable memory errors.  When set to 1, kernel
+attempts to hard offline the error folio whenever it thinks needed.  When set
+to 0, kernel returns EOPNOTSUPP to the request to hard offline the pages.
+Its default value is 1.
+
+Where will `enable_hard_offline = 0`be useful?
+----------------------------------------------
+
+There are two major use cases from the cloud provider's perspective.
+
+The first use case is 1G HugeTLB, which provides critical optimization for
+Virtual Machines (VM) where database-centric and data-intensive workloads have
+requirements of both large memory size (hundreds of GB or several TB),
+and high performance of address mapping.  These VMs usually also require high
+availability, so tolerating and recovering from inevitable uncorrectable
+memory errors is usually provided by host RAS features for long VM uptime
+(SLA is 99.95% Monthly Uptime).  Due to the 1GB granularity, once a byte
+of memory in a hugepage is hardware corrupted, the kernel discards the whole
+1G hugepage, not only the corrupted bytes but also the healthy portion, from
+HugeTLB system.  In cloud environment this is a great loss of memory to VM,
+putting VM in a dilemma: although the host is able to keep serving the VM,
+the VM itself struggles to continue its data-intensive workload with the
+unnecessary loss of ~1G data.  On the other hand, simply terminating the VM
+greatly reduces its uptime given the frequency of uncorrectable memory errors
+occurrence.
+
+The second use case comes from the discussion of MFR for huge VM_PFNMAP [6],
+which is to greatly improve TLB hits for PCI MMIO bars and kernel unmanaged
+host primary memory.  They are most relevant for VMs that run Machine Learning
+(ML) workloads, which also requires reliable VM uptime.  The MFR behavior for
+huge VM_PFNMAP is: if the driver originally VM_PFNMAP-ed with PUD, it must
+first zap the PUD, then intercept future page faults to either install PTE/PMD
+for clean PFNs, or return VM_FAULT_HWPOISON for poisoned PFNs.  Zapping PUD
+means there will be a huge hole in the EPT or stage-2 (S2) page table,
+causing a lot of EPT or S2 violations that need to be fixed up by the device
+driver.  There will be noticeable VM performance downgrades, not only during
+refilling EPT or S2, but also after the hole is refilled, as EPT or S2 is
+already fragmented.
+
+For both use cases, HARD_OFFLINE behavior in MFR arguably does more harm than
+good to the VM.  For the 1st case, if we simply leave the 1G HugeTLB hugepage
+mapped, VM access to the clean PFNs within the poisoned 1G region still works
+well; we just need to still send SIGBUS to userspace in case of re-access
+poisoned PFNs to stop populating corrupted data.  For the 2nd case, if zapping
+PUD doesn't happen there is no need for the driver to intercept page faults to
+clean memory on HBM or EGM.  In addition, in both cases, there is no EPT or S2
+violation, so no performance cost for accessing clean guest pages already
+mapped in EPT and S2.
+
+It is Global
+------------
+
+This applies to the system **globally** in the sense that
+1. It is for entire *system-level memory managed by the kernel*, regardless
+   of the underlying memory type.
+2. It applies to *all userspace threads*, regardless if the physical memory is
+   currently backing any VMA (free memory) or what VMAs it is backing.
+3. It applies to *PCI(e) device memory* (e.g. HBM on GPU) as well, on the
+   condition that their device driver deliberately wants to follow the
+   kernel’s memory failure recovery, instead of being entirely taken care of
+   by device driver (e.g. drivers/nvdimm/pmem.c).
+
+Implications
+------------
+
+There is one important thing to point out in when `enable_hard_offline` = 0.
+The kernel does NOT set HWPoison flag in the struct page or struct folio.
+This behavior has implications now that no enforcement is done by kernel to
+isolate poisoned page and prevent both userspace and kernel from consuming
+memory error and causing hardware fault again (which used to be 'setting the
+HWPoison flag'):
+
+- Userspace already has sufficient capability to prevent itself from
+  consuming memory error and causing hardware fault: with the poisoned
+  virtual address sent in SIGBUS, it can ask the kernel to remap the poisoned
+  page with data loss, or simply abort the memory load operation. That being
+  said, there is risk that a userspace thread can keep ignoring SIGBUS and
+  generates hardware faults repeatedly.
+
+- Kernel won't be able to forbid the reuse of the free error pages in future
+  memory allocations. If an error page is allocated to the kernel, when the
+  kernel consumes the allocated error page, a kernel panic is most likely to
+  happen. For userspace, it is now not guaranteed that newly allocated memory
+  is free of memory errors.
+
+
 extfrag_threshold
 =================
 
-- 
2.46.0.792.g87dc391469-goog



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH v1 0/2] Userspace Can Control Memory Failure Recovery
  2024-09-24  4:39 [RFC PATCH v1 0/2] Userspace Can Control Memory Failure Recovery Jiaqi Yan
  2024-09-24  4:39 ` [RFC PATCH v1 1/2] mm/memory-failure: introduce global MFR policy Jiaqi Yan
  2024-09-24  4:39 ` [RFC PATCH v1 2/2] docs: mm: add enable_hard_offline sysctl Jiaqi Yan
@ 2024-10-02 15:02 ` Jason Gunthorpe
  2024-10-03 22:45   ` Jiaqi Yan
  2 siblings, 1 reply; 21+ messages in thread
From: Jason Gunthorpe @ 2024-10-02 15:02 UTC (permalink / raw)
  To: Jiaqi Yan
  Cc: nao.horiguchi, linmiaohe, tony.luck, wangkefeng.wang, jane.chu,
	akpm, osalvador, rientjes, duenwen, jthoughton, ankita, peterx,
	linux-mm

On Tue, Sep 24, 2024 at 04:39:18AM +0000, Jiaqi Yan wrote:

> So far I personally prefer the global MFR policy but open to feedbacks to both
> options, or new ideas.

Why? It seems more natural that only processe that can handle the
SIGBUS semantics would opt into them?

Jason


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH v1 0/2] Userspace Can Control Memory Failure Recovery
  2024-10-02 15:02 ` [RFC PATCH v1 0/2] Userspace Can Control Memory Failure Recovery Jason Gunthorpe
@ 2024-10-03 22:45   ` Jiaqi Yan
  2024-10-03 22:58     ` Luck, Tony
  2024-10-03 23:19     ` Jason Gunthorpe
  0 siblings, 2 replies; 21+ messages in thread
From: Jiaqi Yan @ 2024-10-03 22:45 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: nao.horiguchi, linmiaohe, tony.luck, wangkefeng.wang, jane.chu,
	akpm, osalvador, rientjes, duenwen, jthoughton, ankita, peterx,
	linux-mm

Hi Jason,

On Wed, Oct 2, 2024 at 8:02 AM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> On Tue, Sep 24, 2024 at 04:39:18AM +0000, Jiaqi Yan wrote:
>
> > So far I personally prefer the global MFR policy but open to feedbacks to both
> > options, or new ideas.
>
> Why? It seems more natural that only processe that can handle the
> SIGBUS semantics would opt into them?

Are you suggesting you prefer the per-VMA policy, or proposing a new
"per-process policy" added via prctl? By "per-process", I imagine the
policy to keep or offline the poisoned page will apply to all its
VMAs?

>
> Jason

Thanks,
Jiaqi


^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: [RFC PATCH v1 0/2] Userspace Can Control Memory Failure Recovery
  2024-10-03 22:45   ` Jiaqi Yan
@ 2024-10-03 22:58     ` Luck, Tony
  2024-10-03 23:19       ` Jiaqi Yan
  2024-10-03 23:19     ` Jason Gunthorpe
  1 sibling, 1 reply; 21+ messages in thread
From: Luck, Tony @ 2024-10-03 22:58 UTC (permalink / raw)
  To: Jiaqi Yan, Jason Gunthorpe
  Cc: nao.horiguchi@gmail.com, linmiaohe@huawei.com,
	wangkefeng.wang@huawei.com, chu, jane, akpm@linux-foundation.org,
	osalvador@suse.de, rientjes@google.com, duenwen@google.com,
	jthoughton@google.com, ankita@nvidia.com, peterx@redhat.com,
	linux-mm@kvack.org

> Are you suggesting you prefer the per-VMA policy, or proposing a new
> "per-process policy" added via prctl? By "per-process", I imagine the
> policy to keep or offline the poisoned page will apply to all its
> VMAs?

A "per-process policy" using prctl already exists. See prctl(PR_MCE_KILL).
Currently used to choose whether to eagerly send SIGBUS to a process
when a memory error is discovered asynchronously by a h/w patrol scrubber.

What is the use case for a per-VMA policy? Do you have some application
that would like to use this?

-Tony

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH v1 0/2] Userspace Can Control Memory Failure Recovery
  2024-10-03 22:58     ` Luck, Tony
@ 2024-10-03 23:19       ` Jiaqi Yan
  0 siblings, 0 replies; 21+ messages in thread
From: Jiaqi Yan @ 2024-10-03 23:19 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Jason Gunthorpe, nao.horiguchi@gmail.com, linmiaohe@huawei.com,
	wangkefeng.wang@huawei.com, chu, jane, akpm@linux-foundation.org,
	osalvador@suse.de, rientjes@google.com, duenwen@google.com,
	jthoughton@google.com, ankita@nvidia.com, peterx@redhat.com,
	linux-mm@kvack.org

Hi Tony,

On Thu, Oct 3, 2024 at 3:58 PM Luck, Tony <tony.luck@intel.com> wrote:
>
> > Are you suggesting you prefer the per-VMA policy, or proposing a new
> > "per-process policy" added via prctl? By "per-process", I imagine the
> > policy to keep or offline the poisoned page will apply to all its
> > VMAs?
>
> A "per-process policy" using prctl already exists. See prctl(PR_MCE_KILL).

The policy I want to have is not about "whether to send SIGBUS or not"
or "when to send SIGBUS", it is about whether to offline the error
[huge]page or keep it accessible by the process.

> Currently used to choose whether to eagerly send SIGBUS to a process
> when a memory error is discovered asynchronously by a h/w patrol scrubber.
>
> What is the use case for a per-VMA policy? Do you have some application
> that would like to use this?

Our main use case is the virtual machine monitor and VM. VMM can track
the *guest* physical addresses that are affected by the *host*
physical addresses having errors. We'd like the VM to be able to
continue loading guest data from the error [huge]page. Loading the
clean portion should just work; loading the poisoned portion will be
intercepted by KVM + VMM without going down to kernel / firmware /
hardware.

>
> -Tony

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH v1 0/2] Userspace Can Control Memory Failure Recovery
  2024-10-03 22:45   ` Jiaqi Yan
  2024-10-03 22:58     ` Luck, Tony
@ 2024-10-03 23:19     ` Jason Gunthorpe
  2024-10-04 18:32       ` Jiaqi Yan
  1 sibling, 1 reply; 21+ messages in thread
From: Jason Gunthorpe @ 2024-10-03 23:19 UTC (permalink / raw)
  To: Jiaqi Yan
  Cc: nao.horiguchi, linmiaohe, tony.luck, wangkefeng.wang, jane.chu,
	akpm, osalvador, rientjes, duenwen, jthoughton, ankita, peterx,
	linux-mm

On Thu, Oct 03, 2024 at 03:45:09PM -0700, Jiaqi Yan wrote:
> Hi Jason,
> 
> On Wed, Oct 2, 2024 at 8:02 AM Jason Gunthorpe <jgg@nvidia.com> wrote:
> >
> > On Tue, Sep 24, 2024 at 04:39:18AM +0000, Jiaqi Yan wrote:
> >
> > > So far I personally prefer the global MFR policy but open to feedbacks to both
> > > options, or new ideas.
> >
> > Why? It seems more natural that only processe that can handle the
> > SIGBUS semantics would opt into them?
> 
> Are you suggesting you prefer the per-VMA policy, or proposing a new
> "per-process policy" added via prctl? By "per-process", I imagine the
> policy to keep or offline the poisoned page will apply to all its
> VMAs?

I'm just asking why you "personally prefer" as the direction seems a
bit awkward

Jason


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH v1 0/2] Userspace Can Control Memory Failure Recovery
  2024-10-03 23:19     ` Jason Gunthorpe
@ 2024-10-04 18:32       ` Jiaqi Yan
  0 siblings, 0 replies; 21+ messages in thread
From: Jiaqi Yan @ 2024-10-04 18:32 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: nao.horiguchi, linmiaohe, tony.luck, wangkefeng.wang, jane.chu,
	akpm, osalvador, rientjes, duenwen, jthoughton, ankita, peterx,
	linux-mm

On Thu, Oct 3, 2024 at 4:20 PM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> On Thu, Oct 03, 2024 at 03:45:09PM -0700, Jiaqi Yan wrote:
> > Hi Jason,
> >
> > On Wed, Oct 2, 2024 at 8:02 AM Jason Gunthorpe <jgg@nvidia.com> wrote:
> > >
> > > On Tue, Sep 24, 2024 at 04:39:18AM +0000, Jiaqi Yan wrote:
> > >
> > > > So far I personally prefer the global MFR policy but open to feedbacks to both
> > > > options, or new ideas.
> > >
> > > Why? It seems more natural that only processe that can handle the
> > > SIGBUS semantics would opt into them?
> >
> > Are you suggesting you prefer the per-VMA policy, or proposing a new
> > "per-process policy" added via prctl? By "per-process", I imagine the
> > policy to keep or offline the poisoned page will apply to all its
> > VMAs?
>
> I'm just asking why you "personally prefer" as the direction seems a
> bit awkward

I assume the "awkward" comes from the concern of what userspace will
do if the kernel is configured to keep poisoned pages.

Admittedly this direction is the high return-on-invest one for me, as
we already have memory failure recovery and repair in userspace to
work well with poisoned pages not offlined until hw is repaired. But I
don't assume it is the also case for everyone else, so I also want to
propose alternative (limit to just VMA, or memory owned by process,
and limit to their lifetime) that hope work for more people.

>
> Jason


^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2024-10-16  0:20 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-09-24  4:39 [RFC PATCH v1 0/2] Userspace Can Control Memory Failure Recovery Jiaqi Yan
2024-09-24  4:39 ` [RFC PATCH v1 1/2] mm/memory-failure: introduce global MFR policy Jiaqi Yan
2024-10-02 23:50   ` jane.chu
2024-10-03 23:51     ` Jiaqi Yan
2024-10-07 17:24       ` jane.chu
2024-10-10 23:21         ` Jiaqi Yan
2024-10-11 18:28           ` jane.chu
2024-10-11 19:44             ` Luck, Tony
2024-10-11 20:15               ` jane.chu
2024-10-15 23:45             ` Jiaqi Yan
2024-10-15 23:56               ` Luck, Tony
2024-10-16  0:19                 ` jane.chu
2024-10-11  7:04       ` Miaohe Lin
2024-10-15 23:58         ` Jiaqi Yan
2024-09-24  4:39 ` [RFC PATCH v1 2/2] docs: mm: add enable_hard_offline sysctl Jiaqi Yan
2024-10-02 15:02 ` [RFC PATCH v1 0/2] Userspace Can Control Memory Failure Recovery Jason Gunthorpe
2024-10-03 22:45   ` Jiaqi Yan
2024-10-03 22:58     ` Luck, Tony
2024-10-03 23:19       ` Jiaqi Yan
2024-10-03 23:19     ` Jason Gunthorpe
2024-10-04 18:32       ` Jiaqi Yan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).