* Re: [PATCH v14 0/8] arm64: add ARCH_HAS_COPY_MC support
From: Kefeng Wang @ 2026-05-18 15:05 UTC (permalink / raw)
To: Ruidong Tian, catalin.marinas, will, rafael, tony.luck, guohanjun,
mchehab, xueshuai, tongtiangen, james.morse, robin.murphy,
andreyknvl, dvyukov, vincenzo.frascino, mpe, npiggin,
ryabinin.a.a, glider, christophe.leroy, aneesh.kumar,
naveen.n.rao, tglx, mingo
Cc: linux-arm-kernel, linux-mm, linuxppc-dev, linux-kernel, kasan-dev
In-Reply-To: <20260518084956.2538442-1-tianruidong@linux.alibaba.com>
On 5/18/2026 4:49 PM, Ruidong Tian wrote:
> This series continues Tong Tiangen's work on arm64 ARCH_HAS_COPY_MC
> support. We encounter the same problem, and from a forward-looking
> perspective, large-memory ARM machines such as Grace and Vera will suffer
> more from this class of issues, which motivates us to push this feature
> upstream.
>
> Problem
> =========
> With the increase of memory capacity and density, the probability of memory
> error also increases. The increasing size and density of server RAM in data
> centers and clouds have shown increased uncorrectable memory errors.
>
> Currently, more and more scenarios that can tolerate memory errors, such as
> COW[1,2], KSM copy[3], coredump copy[4], khugepaged[5,6], uaccess copy[7],
> etc.
We have encountered more scenarios and have made more enhancements, eg,
658be46520ce mm: support poison recovery from copy_present_page()
aa549f923f5e mm: support poison recovery from do_cow_fault()
f00b295b9b61 fs: hugetlbfs: support poisoned recover from
hugetlbfs_migrate_folio()
060913999d7a mm: migrate: support poisoned recover from migrate folio
Hope that the architecture-related sections can receive relevant reviews
and responses.
Thanks.
> Solution
> =========
>
> This patchset introduces a new processing framework on ARM64, which enables
> ARM64 to support error recovery in the above scenarios, and more scenarios
> can be expanded based on this in the future.
>
> In arm64, memory error handling in do_sea(), which is divided into two cases:
> 1. If the user state consumed the memory errors, the solution is to kill
> the user process and isolate the error page.
> 2. If the kernel state consumed the memory errors, the solution is to
> panic.
>
> For case 2, Undifferentiated panic may not be the optimal choice, as it can
> be handled better. In some scenarios, we can avoid panic, such as uaccess,
> if the uaccess fails due to memory error, only the user process will be
> affected, returning an error to the caller and isolating the user page with
> hardware memory errors is a better choice.
>
> [1] commit d302c2398ba2 ("mm, hwpoison: when copy-on-write hits poison, take page offline")
> [2] commit 1cb9dc4b475c ("mm: hwpoison: support recovery from HugePage copy-on-write faults")
> [3] commit 6b970599e807 ("mm: hwpoison: support recovery from ksm_might_need_to_copy()")
> [4] commit 245f09226893 ("mm: hwpoison: coredump: support recovery from dump_user_range()")
> [5] commit 98c76c9f1ef7 ("mm/khugepaged: recover from poisoned anonymous memory")
> [6] commit 12904d953364 ("mm/khugepaged: recover from poisoned file-backed memory")
> [7] commit 278b917f8cb9 ("x86/mce: Add _ASM_EXTABLE_CPY for copy user access")
>
> ------------------
> Test result:
>
> Tested on Kunpeng 920.
>
> 1. copy_page(), copy_mc_page() basic function test pass, and the disassembly
> contents remains the same before and after refactor.
>
> 2. copy_to/from_user() access kernel NULL pointer raise translation fault
> and dump error message then die(), test pass.
>
> 3. Test following scenarios: copy_from_user(), get_user(), COW.
>
> Before patched: trigger a hardware memory error then panic.
> After patched: trigger a hardware memory error without panic.
>
> Testing step:
> step1. start an user-process.
> step2. poison(einj) the user-process's page.
> step3: user-process access the poison page in kernel mode, then trigger SEA.
> step4: the kernel will not panic, only the user process is killed, the poison
> page is isolated. (before patched, the kernel will panic in do_sea())
>
> The above tests can also be reproduced using ras-tools, which provides
> einj-based injection and validation for uaccess and COW scenarios.
> Example usage:
>
> einj_mem_uc futex # get_user
> einj_mem_uc copyin # copy_to_user
> einj_mem_uc copy-on-write # COW
>
> Link: https://git.kernel.org/pub/scm/linux/kernel/git/aegl/ras-tools.git
>
> ------------------
>
> Benefits
> =========
> According to Huawei's statistics from their storage products, memory errors
> triggered in kernel-mode by COW and page cache read (uaccess) scenarios
> account for more than 50%. With this patchset deployed, all kernel panics
> caused by COW and page cache memory errors are eliminated.
> Alibaba Cloud has also observed memory errors occurring in uaccess contexts.
>
> Since V13:
> 1. Changed MC-safe functions to return an error rather than kill the user
> process. When a user program invokes a syscall and the kernel encounters
> a memory error during uaccess, killing the process is unexpected; the
> syscall should return an error.
> 2. Added FEAT_MOPS support for the copy_page_mc paths.
> 3. Refactored copy_page() and memcpy() on top of the shared memcpy_template,
> reducing duplicated assembly code.
>
> Since v12:
> Thanks to the suggestions of Jonathan, Mark, and Mauro, the following modifications
> are made:
> 1. Rebase to latest kernel version.
> 2. Patch1, add Jonathan's and Mauro's review-by.
> 3. Patch2, modified do_apei_claim_sea() according to Mark's and Jonathan's suggestions,
> and optimized the commit message according to Mark's suggestions(Added description of
> the impact on regular copy_to_user()).
> 4. Patch3, optimized the commit message according to Mauro's suggestions and add Jonathan's
> review-by.
> 5. Patch4, modified copy_mc_user_highpage() and Optimized the commit message according to
> Jonathan's suggestions(no functional changes).
> 6. Patch5, optimized the commit message according to Mauro's suggestions.
> 7. Patch4/5, FEAT_MOPS is added to the code logic. Currently, the fixup is not performed
> on the MOPS instruction.
> 8. Remove patch6 in v12 according to Jonathan's suggestions.
>
> Since v11:
> 1. Rebase to latest kernel version 6.9-rc1.
> 2. Add patch 5, Since the problem described in "Since V10 Besides 3" has
> been solved in a50026bdb867 ('iov_iter: get rid of 'copy_mc' flag').
> 3. Add the benefit of applying the patch set to our company to the description of patch0.
>
> Since V10:
> Accroding Mark's suggestion:
> 1. Merge V10's patch2 and patch3 to V11's patch2.
> 2. Patch2(V11): use new fixup_type for ld* in copy_to_user(), fix fatal
> issues (NULL kernel pointeraccess) been fixup incorrectly.
> 3. Patch2(V11): refactoring the logic of do_sea().
> 4. Patch4(V11): Remove duplicate assembly logic and remove do_mte().
>
> Besides:
> 1. Patch2(V11): remove st* insn's fixup, st* generally not trigger memory error.
> 2. Split a part of the logic of patch2(V11) to patch5(V11), for detail,
> see patch5(V11)'s commit msg.
> 3. Remove patch6(v10) “arm64: introduce copy_mc_to_kernel() implementation”.
> During modification, some problems that cannot be solved in a short
> period are found. The patch will be released after the problems are
> solved.
> 4. Add test result in this patch.
> 5. Modify patchset title, do not use machine check and remove "-next".
>
> Since V9:
> 1. Rebase to latest kernel version 6.8-rc2.
> 2. Add patch 6/6 to support copy_mc_to_kernel().
>
> Since V8:
> 1. Rebase to latest kernel version and fix topo in some of the patches.
> 2. According to the suggestion of Catalin, I attempted to modify the
> return value of function copy_mc_[user]_highpage() to bytes not copied.
> During the modification process, I found that it would be more
> reasonable to return -EFAULT when copy error occurs (referring to the
> newly added patch 4).
>
> For ARM64, the implementation of copy_mc_[user]_highpage() needs to
> consider MTE. Considering the scenario where data copying is successful
> but the MTE tag copying fails, it is also not reasonable to return
> bytes not copied.
> 3. Considering the recent addition of machine check safe support for
> multiple scenarios, modify commit message for patch 5 (patch 4 for V8).
>
> Since V7:
> Currently, there are patches supporting recover from poison
> consumption for the cow scenario[1]. Therefore, Supporting cow
> scenario under the arm64 architecture only needs to modify the relevant
> code under the arch/.
> [1]https://lore.kernel.org/lkml/20221031201029.102123-1-tony.luck@intel.com/
>
> Since V6:
> Resend patches that are not merged into the mainline in V6.
>
> Since V5:
> 1. Add patch2/3 to add uaccess assembly helpers.
> 2. Optimize the implementation logic of arm64_do_kernel_sea() in patch8.
> 3. Remove kernel access fixup in patch9.
> All suggestion are from Mark.
>
> Since V4:
> 1. According Michael's suggestion, add patch5.
> 2. According Mark's suggestiog, do some restructuring to arm64
> extable, then a new adaptation of machine check safe support is made based
> on this.
> 3. According Mark's suggestion, support machine check safe in do_mte() in
> cow scene.
> 4. In V4, two patches have been merged into -next, so V5 not send these
> two patches.
>
> Since V3:
> 1. According to Robin's suggestion, direct modify user_ldst and
> user_ldp in asm-uaccess.h and modify mte.S.
> 2. Add new macro USER_MC in asm-uaccess.h, used in copy_from_user.S
> and copy_to_user.S.
> 3. According to Robin's suggestion, using micro in copy_page_mc.S to
> simplify code.
> 4. According to KeFeng's suggestion, modify powerpc code in patch1.
> 5. According to KeFeng's suggestion, modify mm/extable.c and some code
> optimization.
>
> Since V2:
> 1. According to Mark's suggestion, all uaccess can be recovered due to
> memory error.
> 2. Scenario pagecache reading is also supported as part of uaccess
> (copy_to_user()) and duplication code problem is also solved.
> Thanks for Robin's suggestion.
> 3. According Mark's suggestion, update commit message of patch 2/5.
> 4. According Borisllav's suggestion, update commit message of patch 1/5.
>
> Since V1:
> 1.Consistent with PPC/x86, Using CONFIG_ARCH_HAS_COPY_MC instead of
> ARM64_UCE_KERNEL_RECOVERY.
> 2.Add two new scenes, cow and pagecache reading.
> 3.Fix two small bug(the first two patch).
>
> V1 in here:
> https://lore.kernel.org/lkml/20220323033705.3966643-1-tongtiangen@huawei.com/
>
> Ruidong Tian (3):
> ACPI: APEI: GHES: use exception context to gate SIGBUS on poison
> consumption
> lib/test: memcpy_kunit: add copy_page() and copy_mc_page() tests
> lib/tests: memcpy_kunit: add memcpy_mc() and memcpy_mc_large() test
>
> Tong Tiangen (5):
> uaccess: add generic fallback version of copy_mc_to_user()
> arm64: add support for ARCH_HAS_COPY_MC
> mm/hwpoison: return -EFAULT when copy fail in
> copy_mc_[user]_highpage()
> arm64: support copy_mc_[user]_highpage()
> arm64: introduce copy_mc_to_kernel() implementation
>
> arch/arm64/Kconfig | 1 +
> arch/arm64/include/asm/asm-extable.h | 22 ++-
> arch/arm64/include/asm/asm-uaccess.h | 4 +
> arch/arm64/include/asm/extable.h | 1 +
> arch/arm64/include/asm/mte.h | 9 +
> arch/arm64/include/asm/page.h | 10 ++
> arch/arm64/include/asm/string.h | 5 +
> arch/arm64/include/asm/uaccess.h | 17 ++
> arch/arm64/kernel/acpi.c | 2 +-
> arch/arm64/lib/Makefile | 2 +
> arch/arm64/lib/copy_mc_page.S | 44 +++++
> arch/arm64/lib/copy_page.S | 62 +------
> arch/arm64/lib/copy_page_template.S | 71 ++++++++
> arch/arm64/lib/copy_to_user.S | 10 +-
> arch/arm64/lib/memcpy.S | 253 ++-------------------------
> arch/arm64/lib/memcpy_mc.S | 56 ++++++
> arch/arm64/lib/memcpy_template.S | 249 ++++++++++++++++++++++++++
> arch/arm64/lib/mte.S | 29 +++
> arch/arm64/mm/copypage.c | 75 ++++++++
> arch/arm64/mm/extable.c | 21 +++
> arch/arm64/mm/fault.c | 30 +++-
> arch/powerpc/include/asm/uaccess.h | 1 +
> arch/x86/include/asm/uaccess.h | 1 +
> drivers/acpi/apei/ghes.c | 36 ++--
> include/acpi/ghes.h | 6 +-
> include/linux/highmem.h | 16 +-
> include/linux/uaccess.h | 8 +
> lib/tests/memcpy_kunit.c | 178 ++++++++++++++++++-
> mm/kasan/shadow.c | 12 ++
> mm/khugepaged.c | 4 +-
> 30 files changed, 904 insertions(+), 331 deletions(-)
> create mode 100644 arch/arm64/lib/copy_mc_page.S
> create mode 100644 arch/arm64/lib/copy_page_template.S
> create mode 100644 arch/arm64/lib/memcpy_mc.S
> create mode 100644 arch/arm64/lib/memcpy_template.S
>
^ permalink raw reply
* [PATCH v8 4/5] PCI: qcom: Add support for resetting the Root Port due to link down event
From: Manivannan Sadhasivam via B4 Relay @ 2026-05-18 14:59 UTC (permalink / raw)
To: Bjorn Helgaas, Mahesh J Salgaonkar, Oliver O'Halloran,
Will Deacon, Lorenzo Pieralisi, Krzysztof Wilczyński,
Manivannan Sadhasivam, Rob Herring, Heiko Stuebner, Philipp Zabel
Cc: linux-pci, linux-kernel, linuxppc-dev, linux-arm-kernel,
linux-arm-msm, linux-rockchip, Niklas Cassel, Wilfred Mallawa,
Krishna Chaitanya Chundru, mani, Lukas Wunner, Richard Zhu,
Brian Norris, Wilson Ding, Manivannan Sadhasivam,
Manivannan Sadhasivam
In-Reply-To: <20260518-pci-port-reset-v8-0-eb5a7d331dfc@oss.qualcomm.com>
From: Manivannan Sadhasivam <mani@kernel.org>
The PCIe link can go down under circumstances such as the device firmware
crash, link instability, etc... When that happens, the PCIe Root Port needs
to be reset to make it operational again. Currently, the driver is not
handling the link down event, due to which the users have to restart the
machine to make PCIe link operational again. So fix it by detecting the
link down event and resetting the Root Port.
Since the Qcom PCIe controllers report the link down event through the
'global' IRQ, enable the link down event by setting PARF_INT_ALL_LINK_DOWN
bit in PARF_INT_ALL_MASK register.
In the case of the event, iterate through the available Root Ports and call
pci_host_handle_link_down() API with Root Port 'pci_dev' to let the PCI
core handle the link down condition. Since Qcom PCIe controllers only
support one Root Port per controller instance, the API will be called only
once. But the looping is necessary as there is no PCI API available to
fetch the Root Port instance without the child 'pci_dev'.
The API will internally call, 'pci_host_bridge::reset_root_port()' callback
to reset the Root Port in a platform specific way. So implement the
callback to reset the Root Port by first resetting the PCIe core, followed
by reinitializing the resources and then finally starting the link again.
Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@oss.qualcomm.com>
Tested-by: Krishna Chaitanya Chundru <krishna.chundru@oss.qualcomm.com>
---
drivers/pci/controller/dwc/pcie-qcom.c | 143 ++++++++++++++++++++++++++++++++-
1 file changed, 142 insertions(+), 1 deletion(-)
diff --git a/drivers/pci/controller/dwc/pcie-qcom.c b/drivers/pci/controller/dwc/pcie-qcom.c
index af6bf5cce65b..feda8abf5f85 100644
--- a/drivers/pci/controller/dwc/pcie-qcom.c
+++ b/drivers/pci/controller/dwc/pcie-qcom.c
@@ -56,6 +56,10 @@
#define PARF_AXI_MSTR_WR_ADDR_HALT_V2 0x1a8
#define PARF_Q2A_FLUSH 0x1ac
#define PARF_LTSSM 0x1b0
+#define PARF_INT_ALL_STATUS 0x224
+#define PARF_INT_ALL_CLEAR 0x228
+#define PARF_INT_ALL_MASK 0x22c
+#define PARF_STATUS 0x230
#define PARF_SID_OFFSET 0x234
#define PARF_BDF_TRANSLATE_CFG 0x24c
#define PARF_DBI_BASE_ADDR_V2 0x350
@@ -131,6 +135,13 @@
/* PARF_LTSSM register fields */
#define LTSSM_EN BIT(8)
+#define SW_CLEAR_FLUSH_MODE BIT(10)
+#define FLUSH_MODE BIT(11)
+
+/* PARF_INT_ALL_{STATUS/CLEAR/MASK} register fields */
+#define INT_ALL_LINK_DOWN 1
+#define PARF_INT_ALL_LINK_DOWN BIT(INT_ALL_LINK_DOWN)
+#define PARF_INT_MSI_DEV_0_7 GENMASK(30, 23)
/* PARF_NO_SNOOP_OVERRIDE register fields */
#define WR_NO_SNOOP_OVERRIDE_EN BIT(1)
@@ -142,6 +153,9 @@
/* PARF_BDF_TO_SID_CFG fields */
#define BDF_TO_SID_BYPASS BIT(0)
+/* PARF_STATUS fields */
+#define FLUSH_COMPLETED BIT(8)
+
/* ELBI_SYS_CTRL register fields */
#define ELBI_SYS_CTRL_LT_ENABLE BIT(0)
@@ -166,6 +180,7 @@
PCIE_CAP_SLOT_POWER_LIMIT_SCALE)
#define PERST_DELAY_US 1000
+#define FLUSH_TIMEOUT_US 100
#define QCOM_PCIE_CRC8_POLYNOMIAL (BIT(2) | BIT(1) | BIT(0))
@@ -282,11 +297,14 @@ struct qcom_pcie {
const struct qcom_pcie_cfg *cfg;
struct dentry *debugfs;
struct list_head ports;
+ int global_irq;
bool suspended;
bool use_pm_opp;
};
#define to_qcom_pcie(x) dev_get_drvdata((x)->dev)
+static int qcom_pcie_reset_root_port(struct pci_host_bridge *bridge,
+ struct pci_dev *pdev);
static void __qcom_pcie_perst_assert(struct qcom_pcie *pcie, bool assert)
{
@@ -1330,6 +1348,8 @@ static int qcom_pcie_host_init(struct dw_pcie_rp *pp)
goto err_assert_reset;
}
+ pp->bridge->reset_root_port = qcom_pcie_reset_root_port;
+
return 0;
err_assert_reset:
@@ -1613,6 +1633,78 @@ static void qcom_pcie_icc_opp_update(struct qcom_pcie *pcie)
}
}
+/*
+ * Qcom PCIe controllers only support one Root Port per controller instance. So
+ * this function ignores the 'pci_dev' associated with the Root Port and just
+ * resets the host bridge, which in turn resets the Root Port also.
+ */
+static int qcom_pcie_reset_root_port(struct pci_host_bridge *bridge,
+ struct pci_dev *pdev)
+{
+ struct pci_bus *bus = bridge->bus;
+ struct dw_pcie_rp *pp = bus->sysdata;
+ struct dw_pcie *pci = to_dw_pcie_from_pp(pp);
+ struct qcom_pcie *pcie = to_qcom_pcie(pci);
+ struct device *dev = pcie->pci->dev;
+ u32 val;
+ int ret;
+
+ /* Wait for the pending transactions to be completed */
+ ret = readl_relaxed_poll_timeout(pcie->parf + PARF_STATUS, val,
+ val & FLUSH_COMPLETED, 10,
+ FLUSH_TIMEOUT_US);
+ if (ret) {
+ dev_err(dev, "Flush completion failed: %d\n", ret);
+ goto err_host_deinit;
+ }
+
+ /* Clear the FLUSH_MODE to allow the core to be reset */
+ val = readl(pcie->parf + PARF_LTSSM);
+ val |= SW_CLEAR_FLUSH_MODE;
+ writel(val, pcie->parf + PARF_LTSSM);
+
+ /* Wait for the FLUSH_MODE to clear */
+ ret = readl_relaxed_poll_timeout(pcie->parf + PARF_LTSSM, val,
+ !(val & FLUSH_MODE), 10,
+ FLUSH_TIMEOUT_US);
+ if (ret) {
+ dev_err(dev, "Flush mode clear failed: %d\n", ret);
+ goto err_host_deinit;
+ }
+
+ qcom_pcie_host_deinit(pp);
+
+ ret = qcom_pcie_host_init(pp);
+ if (ret) {
+ dev_err(dev, "Host init failed\n");
+ return ret;
+ }
+
+ ret = dw_pcie_setup_rc(pp);
+ if (ret)
+ goto err_host_deinit;
+
+ /*
+ * Re-enable global IRQ events as the PARF_INT_ALL_MASK register is
+ * non-sticky.
+ */
+ if (pcie->global_irq)
+ writel_relaxed(PARF_INT_ALL_LINK_DOWN | PARF_INT_MSI_DEV_0_7,
+ pcie->parf + PARF_INT_ALL_MASK);
+
+ qcom_pcie_start_link(pci);
+ dw_pcie_wait_for_link(pci);
+
+ dev_dbg(dev, "Root Port reset completed\n");
+
+ return 0;
+
+err_host_deinit:
+ qcom_pcie_host_deinit(pp);
+
+ return ret;
+}
+
static int qcom_pcie_link_transition_count(struct seq_file *s, void *data)
{
struct qcom_pcie *pcie = (struct qcom_pcie *)dev_get_drvdata(s->private);
@@ -1650,6 +1742,27 @@ static void qcom_pcie_init_debugfs(struct qcom_pcie *pcie)
qcom_pcie_link_transition_count);
}
+static irqreturn_t qcom_pcie_global_irq_thread(int irq, void *data)
+{
+ struct qcom_pcie *pcie = data;
+ struct dw_pcie_rp *pp = &pcie->pci->pp;
+ struct device *dev = pcie->pci->dev;
+ struct pci_dev *port;
+ unsigned long status = readl_relaxed(pcie->parf + PARF_INT_ALL_STATUS);
+
+ writel_relaxed(status, pcie->parf + PARF_INT_ALL_CLEAR);
+
+ if (test_and_clear_bit(INT_ALL_LINK_DOWN, &status)) {
+ dev_dbg(dev, "Received Link down event\n");
+ for_each_pci_bridge(port, pp->bridge->bus) {
+ if (pci_pcie_type(port) == PCI_EXP_TYPE_ROOT_PORT)
+ pci_host_handle_link_down(port);
+ }
+ }
+
+ return IRQ_HANDLED;
+}
+
static void qcom_pci_free_msi(void *ptr)
{
struct dw_pcie_rp *pp = (struct dw_pcie_rp *)ptr;
@@ -1852,7 +1965,7 @@ static int qcom_pcie_probe(struct platform_device *pdev)
struct dw_pcie_rp *pp;
struct resource *res;
struct dw_pcie *pci;
- int ret;
+ int ret, irq;
pcie_cfg = of_device_get_match_data(dev);
if (!pcie_cfg) {
@@ -2009,6 +2122,32 @@ static int qcom_pcie_probe(struct platform_device *pdev)
goto err_phy_exit;
}
+ irq = platform_get_irq_byname_optional(pdev, "global");
+ if (irq > 0) {
+ const char *name;
+
+ name = devm_kasprintf(dev, GFP_KERNEL, "qcom_pcie_global_irq%d",
+ pci_domain_nr(pp->bridge->bus));
+ if (!name) {
+ ret = -ENOMEM;
+ goto err_host_deinit;
+ }
+
+ ret = devm_request_threaded_irq(&pdev->dev, irq, NULL,
+ qcom_pcie_global_irq_thread,
+ IRQF_ONESHOT, name, pcie);
+ if (ret) {
+ dev_err_probe(&pdev->dev, ret,
+ "Failed to request Global IRQ\n");
+ goto err_host_deinit;
+ }
+
+ writel_relaxed(PARF_INT_ALL_LINK_DOWN | PARF_INT_MSI_DEV_0_7,
+ pcie->parf + PARF_INT_ALL_MASK);
+
+ pcie->global_irq = irq;
+ }
+
qcom_pcie_icc_opp_update(pcie);
if (pcie->mhi)
@@ -2016,6 +2155,8 @@ static int qcom_pcie_probe(struct platform_device *pdev)
return 0;
+err_host_deinit:
+ dw_pcie_host_deinit(pp);
err_phy_exit:
list_for_each_entry_safe(port, tmp_port, &pcie->ports, list) {
list_for_each_entry_safe(perst, tmp_perst, &port->perst, list)
--
2.48.1
^ permalink raw reply related
* [PATCH v8 3/5] PCI: host-common: Add link down handling for Root Ports
From: Manivannan Sadhasivam via B4 Relay @ 2026-05-18 14:59 UTC (permalink / raw)
To: Bjorn Helgaas, Mahesh J Salgaonkar, Oliver O'Halloran,
Will Deacon, Lorenzo Pieralisi, Krzysztof Wilczyński,
Manivannan Sadhasivam, Rob Herring, Heiko Stuebner, Philipp Zabel
Cc: linux-pci, linux-kernel, linuxppc-dev, linux-arm-kernel,
linux-arm-msm, linux-rockchip, Niklas Cassel, Wilfred Mallawa,
Krishna Chaitanya Chundru, mani, Lukas Wunner, Richard Zhu,
Brian Norris, Wilson Ding, Manivannan Sadhasivam, Frank Li,
Manivannan Sadhasivam
In-Reply-To: <20260518-pci-port-reset-v8-0-eb5a7d331dfc@oss.qualcomm.com>
From: Manivannan Sadhasivam <mani@kernel.org>
The PCI link, when down, needs to be recovered to bring it back. But on
some platforms, that cannot be done in a generic way as link recovery
procedure is platform specific. So add a new API
pci_host_handle_link_down() that could be called by the host bridge drivers
for a specific Root Port when the link goes down.
The API accepts the 'pci_dev' corresponding to the Root Port which observed
the link down event. If CONFIG_PCIEAER is enabled, the API calls
pcie_do_recovery() function with 'pci_channel_io_frozen' as the state. This
will result in the execution of the AER Fatal error handling code. Since
the link down recovery is pretty much the same as AER Fatal error handling,
pcie_do_recovery() helper is reused here. First, the AER error_detected()
callback will be triggered for the bridge and then for the downstream
devices. Finally, pci_host_reset_root_port() will be called for the Root
Port, which will reset the Root Port using 'reset_root_port' callback to
recover the link. Once that's done, resume message will be broadcasted to
the bridge and the downstream devices, indicating successful link recovery.
But if CONFIG_PCIEAER is not enabled in the kernel, only
pci_host_reset_root_port() API will be called, which will in turn call
pci_bus_error_reset() to just reset the Root Port as there is no way we
could inform the drivers about link recovery.
Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@oss.qualcomm.com>
Tested-by: Brian Norris <briannorris@chromium.org>
Tested-by: Krishna Chaitanya Chundru <krishna.chundru@oss.qualcomm.com>
Tested-by: Richard Zhu <hongxing.zhu@nxp.com>
Reviewed-by: Frank Li <Frank.Li@nxp.com>
---
drivers/pci/controller/pci-host-common.c | 35 ++++++++++++++++++++++++++++++++
drivers/pci/controller/pci-host-common.h | 1 +
drivers/pci/pci.c | 1 +
drivers/pci/pcie/err.c | 1 +
4 files changed, 38 insertions(+)
diff --git a/drivers/pci/controller/pci-host-common.c b/drivers/pci/controller/pci-host-common.c
index d6258c1cffe5..15ebff8a542a 100644
--- a/drivers/pci/controller/pci-host-common.c
+++ b/drivers/pci/controller/pci-host-common.c
@@ -12,9 +12,11 @@
#include <linux/of.h>
#include <linux/of_address.h>
#include <linux/of_pci.h>
+#include <linux/pci.h>
#include <linux/pci-ecam.h>
#include <linux/platform_device.h>
+#include "../pci.h"
#include "pci-host-common.h"
static void gen_pci_unmap_cfg(void *ptr)
@@ -106,5 +108,38 @@ void pci_host_common_remove(struct platform_device *pdev)
}
EXPORT_SYMBOL_GPL(pci_host_common_remove);
+static pci_ers_result_t pci_host_reset_root_port(struct pci_dev *dev)
+{
+ int ret;
+
+ pci_lock_rescan_remove();
+ ret = pci_bus_error_reset(dev);
+ pci_unlock_rescan_remove();
+ if (ret) {
+ pci_err(dev, "Failed to reset Root Port: %d\n", ret);
+ return PCI_ERS_RESULT_DISCONNECT;
+ }
+
+ pci_info(dev, "Root Port has been reset\n");
+
+ return PCI_ERS_RESULT_RECOVERED;
+}
+
+static void pci_host_recover_root_port(struct pci_dev *port)
+{
+#if IS_ENABLED(CONFIG_PCIEAER)
+ pcie_do_recovery(port, pci_channel_io_frozen, pci_host_reset_root_port);
+#else
+ pci_host_reset_root_port(port);
+#endif
+}
+
+void pci_host_handle_link_down(struct pci_dev *port)
+{
+ pci_info(port, "Recovering Root Port due to Link Down\n");
+ pci_host_recover_root_port(port);
+}
+EXPORT_SYMBOL_GPL(pci_host_handle_link_down);
+
MODULE_DESCRIPTION("Common library for PCI host controller drivers");
MODULE_LICENSE("GPL v2");
diff --git a/drivers/pci/controller/pci-host-common.h b/drivers/pci/controller/pci-host-common.h
index b5075d4bd7eb..dd12dd1a1b23 100644
--- a/drivers/pci/controller/pci-host-common.h
+++ b/drivers/pci/controller/pci-host-common.h
@@ -17,6 +17,7 @@ int pci_host_common_init(struct platform_device *pdev,
struct pci_host_bridge *bridge,
const struct pci_ecam_ops *ops);
void pci_host_common_remove(struct platform_device *pdev);
+void pci_host_handle_link_down(struct pci_dev *port);
struct pci_config_window *pci_host_common_ecam_create(struct device *dev,
struct pci_host_bridge *bridge, const struct pci_ecam_ops *ops);
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 651505b3bd60..35dc9f54a8ef 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -5669,6 +5669,7 @@ int pci_bus_error_reset(struct pci_dev *bridge)
{
return pci_reset_bridge(bridge, PCI_RESET_NO_RESTORE);
}
+EXPORT_SYMBOL_GPL(pci_bus_error_reset);
int pci_try_reset_bridge(struct pci_dev *bridge)
{
diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
index 13b9d9eb714f..d77403d8855b 100644
--- a/drivers/pci/pcie/err.c
+++ b/drivers/pci/pcie/err.c
@@ -292,3 +292,4 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
return status;
}
+EXPORT_SYMBOL_GPL(pcie_do_recovery);
--
2.48.1
^ permalink raw reply related
* [PATCH v8 1/5] PCI: dwc: ep: Clear MSI iATU mapping in dw_pcie_ep_cleanup()
From: Manivannan Sadhasivam via B4 Relay @ 2026-05-18 14:59 UTC (permalink / raw)
To: Bjorn Helgaas, Mahesh J Salgaonkar, Oliver O'Halloran,
Will Deacon, Lorenzo Pieralisi, Krzysztof Wilczyński,
Manivannan Sadhasivam, Rob Herring, Heiko Stuebner, Philipp Zabel
Cc: linux-pci, linux-kernel, linuxppc-dev, linux-arm-kernel,
linux-arm-msm, linux-rockchip, Niklas Cassel, Wilfred Mallawa,
Krishna Chaitanya Chundru, mani, Lukas Wunner, Richard Zhu,
Brian Norris, Wilson Ding, Manivannan Sadhasivam
In-Reply-To: <20260518-pci-port-reset-v8-0-eb5a7d331dfc@oss.qualcomm.com>
From: Manivannan Sadhasivam <manivannan.sadhasivam@oss.qualcomm.com>
The MSI iATU mapping is currently only cleared when the endpoint is
stopped via configfs or when the host updates the MSI address/size.
This avoids redundant iATU reconfiguration every time the endpoint
raises an MSI interrupt.
However, a fundamental reset triggered by PERST# assert/deassert
resets all iATU inbound/outbound registers without going through the
configfs stop path. If the host also retains the same MSI address/size
after PERST# deassert, the driver never clears the stale MSI iATU
mapping. It then continues using this stale mapping to raise the MSI
interrupts, which can cause IOMMU faults and MSI failures on the host.
Fix this by clearing the MSI iATU mapping inside dw_pcie_ep_cleanup(),
which is already called as part of the PERST# assert/deassert sequence.
This unmaps the MSI iATU region and sets the msi_iatu_mapped flag to
false, ensuring that dw_pcie_ep_raise_msi_irq() performs a fresh iATU
mapping on its next invocation, regardless of whether the host changed
the MSI address/size.
Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@oss.qualcomm.com>
---
drivers/pci/controller/dwc/pcie-designware-ep.c | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/drivers/pci/controller/dwc/pcie-designware-ep.c b/drivers/pci/controller/dwc/pcie-designware-ep.c
index d4dc3b24da60..4ae0e1b55f39 100644
--- a/drivers/pci/controller/dwc/pcie-designware-ep.c
+++ b/drivers/pci/controller/dwc/pcie-designware-ep.c
@@ -1035,6 +1035,11 @@ void dw_pcie_ep_cleanup(struct dw_pcie_ep *ep)
{
struct dw_pcie *pci = to_dw_pcie_from_ep(ep);
+ if (ep->msi_iatu_mapped) {
+ dw_pcie_ep_unmap_addr(ep->epc, 0, 0, ep->msi_mem_phys);
+ ep->msi_iatu_mapped = false;
+ }
+
dwc_pcie_debugfs_deinit(pci);
dw_pcie_edma_remove(pci);
}
--
2.48.1
^ permalink raw reply related
* [PATCH v8 0/5] PCI: Add support for resetting the Root Ports in a platform specific way
From: Manivannan Sadhasivam via B4 Relay @ 2026-05-18 14:59 UTC (permalink / raw)
To: Bjorn Helgaas, Mahesh J Salgaonkar, Oliver O'Halloran,
Will Deacon, Lorenzo Pieralisi, Krzysztof Wilczyński,
Manivannan Sadhasivam, Rob Herring, Heiko Stuebner, Philipp Zabel
Cc: linux-pci, linux-kernel, linuxppc-dev, linux-arm-kernel,
linux-arm-msm, linux-rockchip, Niklas Cassel, Wilfred Mallawa,
Krishna Chaitanya Chundru, mani, Lukas Wunner, Richard Zhu,
Brian Norris, Wilson Ding, Manivannan Sadhasivam, Frank Li,
Manivannan Sadhasivam
Hi,
Currently, in the event of AER/DPC, PCI core will try to reset the slot (Root
Port) and its subordinate devices by invoking bridge control reset and FLR. But
in some cases like AER Fatal error, it might be necessary to reset the Root
Ports using the PCI host bridge drivers in a platform specific way (as indicated
by the TODO in the pcie_do_recovery() function in drivers/pci/pcie/err.c).
Otherwise, the PCI link won't be recovered successfully.
So this series adds a new callback 'pci_host_bridge::reset_root_port' for the
host bridge drivers to reset the Root Port when a fatal error happens.
Also, this series allows the host bridge drivers to handle PCI link down event
by resetting the Root Ports and recovering the bus. This is accomplished by the
help of the new 'pci_host_handle_link_down()' API. Host bridge drivers are
expected to call this API (preferrably from a threaded IRQ handler) with
relevant Root Port 'pci_dev' when a link down event is detected for the port.
The API will reuse the pcie_do_recovery() function to recover the link if AER
support is enabled, otherwise it will directly call the reset_root_port()
callback of the host bridge driver (if exists).
For reference, I've modified the pcie-qcom driver to call
pci_host_handle_link_down() API with Root Port 'pci_dev' after receiving the
LDn global_irq event and populated 'pci_host_bridge::reset_root_port()'
callback to reset the Root Ports.
Testing
-------
Tested on Qcom Lemans AU Ride platform with Host and EP SoCs connected over PCIe
link. Simulated the LDn by disabling LTSSM_EN on the EP and I could verify that
the link was getting recovered successfully.
Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@oss.qualcomm.com>
---
Changes in v8:
- Removed pci_save_state() for the Root Port during recovery as the PCI core now
saves the config space during enumeration
- Added save/restore in pci_endpoint_test.c driver to save the config space
after enabling BME and restoring it after reset
- Added a patch to unmap MSI address post LDn
- Rebased on top of v7.1-rc1
- Link to v7: https://lore.kernel.org/r/20260310-pci-port-reset-v7-0-9dd00ccc25ab@oss.qualcomm.com
Changes in v7:
- Dropped Rockchip Root port reset patch due to reported issues. But the series
works on other platforms as tested by others.
- Added pci_{lock/unlock}_rescan_remove() to guard pci_bus_error_reset() as the
device could be removed in-between due to Native hotplug interrupt.
- Rebased on top of v7.0-rc1
- Link to v6: https://lore.kernel.org/r/20250715-pci-port-reset-v6-0-6f9cce94e7bb@oss.qualcomm.com
Changes in v6:
- Incorporated the patch: https://lore.kernel.org/all/20250524185304.26698-2-manivannan.sadhasivam@linaro.org/
- Link to v5: https://lore.kernel.org/r/20250715-pci-port-reset-v5-0-26a5d278db40@oss.qualcomm.com
Changes in v5:
* Reworked the pci_host_handle_link_down() to accept Root Port instead of
resetting all Root Ports in the event of link down.
* Renamed 'reset_slot' to 'reset_root_port' to avoid confusion as both terms
were used interchangibly and the series is intended to reset Root Port only.
* Added the Rockchip driver change to this series.
* Dropped the applied patches and review/tested tags due to rework.
* Rebased on top of v6.16-rc1.
Changes in v4:
- Handled link down first in the irq handler
- Updated ICC & OPP bandwidth after link up in reset_slot() callback
- Link to v3: https://lore.kernel.org/r/20250417-pcie-reset-slot-v3-0-59a10811c962@linaro.org
Changes in v3:
- Made the pci-host-common driver as a common library for host controller
drivers
- Moved the reset slot code to pci-host-common library
- Link to v2: https://lore.kernel.org/r/20250416-pcie-reset-slot-v2-0-efe76b278c10@linaro.org
Changes in v2:
- Moved calling reset_slot() callback from pcie_do_recovery() to pcibios_reset_secondary_bus()
- Link to v1: https://lore.kernel.org/r/20250404-pcie-reset-slot-v1-0-98952918bf90@linaro.org
---
Manivannan Sadhasivam (5):
PCI: dwc: ep: Clear MSI iATU mapping in dw_pcie_ep_cleanup()
PCI/ERR: Add support for resetting the Root Ports in a platform specific way
PCI: host-common: Add link down handling for Root Ports
PCI: qcom: Add support for resetting the Root Port due to link down event
misc: pci_endpoint_test: Add AER error handlers
drivers/misc/pci_endpoint_test.c | 23 ++++
drivers/pci/controller/dwc/pcie-designware-ep.c | 5 +
drivers/pci/controller/dwc/pcie-qcom.c | 143 +++++++++++++++++++++++-
drivers/pci/controller/pci-host-common.c | 35 ++++++
drivers/pci/controller/pci-host-common.h | 1 +
drivers/pci/pci.c | 14 +++
drivers/pci/pcie/err.c | 6 +-
include/linux/pci.h | 1 +
8 files changed, 222 insertions(+), 6 deletions(-)
---
base-commit: 254f49634ee16a731174d2ae34bc50bd5f45e731
change-id: 20250715-pci-port-reset-4d9519570123
Best regards,
--
Manivannan Sadhasivam <manivannan.sadhasivam@oss.qualcomm.com>
^ permalink raw reply
* [PATCH v8 2/5] PCI/ERR: Add support for resetting the Root Ports in a platform specific way
From: Manivannan Sadhasivam via B4 Relay @ 2026-05-18 14:59 UTC (permalink / raw)
To: Bjorn Helgaas, Mahesh J Salgaonkar, Oliver O'Halloran,
Will Deacon, Lorenzo Pieralisi, Krzysztof Wilczyński,
Manivannan Sadhasivam, Rob Herring, Heiko Stuebner, Philipp Zabel
Cc: linux-pci, linux-kernel, linuxppc-dev, linux-arm-kernel,
linux-arm-msm, linux-rockchip, Niklas Cassel, Wilfred Mallawa,
Krishna Chaitanya Chundru, mani, Lukas Wunner, Richard Zhu,
Brian Norris, Wilson Ding, Manivannan Sadhasivam, Frank Li,
Manivannan Sadhasivam
In-Reply-To: <20260518-pci-port-reset-v8-0-eb5a7d331dfc@oss.qualcomm.com>
From: Manivannan Sadhasivam <mani@kernel.org>
Some host bridge devices require resetting the Root Ports in a platform
specific way to recover them from error conditions such as Fatal AER
errors, Link Down etc... So introduce pci_host_bridge::reset_root_port()
callback and call it from pcibios_reset_secondary_bus() if available. Also,
save the Root Port config space before reset and restore it afterwards.
The 'reset_root_port' callback is responsible for resetting the given Root
Port referenced by the 'pci_dev' pointer in a platform specific way and
bring it back to the working state if possible. If any error occurs during
the reset operation, relevant errno should be returned.
Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@oss.qualcomm.com>
Tested-by: Brian Norris <briannorris@chromium.org>
Tested-by: Krishna Chaitanya Chundru <krishna.chundru@oss.qualcomm.com>
Tested-by: Richard Zhu <hongxing.zhu@nxp.com>
Reviewed-by: Frank Li <Frank.Li@nxp.com>
---
drivers/pci/pci.c | 13 +++++++++++++
drivers/pci/pcie/err.c | 5 -----
include/linux/pci.h | 1 +
3 files changed, 14 insertions(+), 5 deletions(-)
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 8f7cfcc00090..651505b3bd60 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -4809,6 +4809,19 @@ void pci_reset_secondary_bus(struct pci_dev *dev)
void __weak pcibios_reset_secondary_bus(struct pci_dev *dev)
{
+ struct pci_host_bridge *host = pci_find_host_bridge(dev->bus);
+ int ret;
+
+ if (pci_is_root_bus(dev->bus) && host->reset_root_port) {
+ ret = host->reset_root_port(host, dev);
+ if (ret)
+ pci_err(dev, "Failed to reset Root Port: %d\n", ret);
+ else
+ pci_restore_state(dev);
+
+ return;
+ }
+
pci_reset_secondary_bus(dev);
}
diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
index bebe4bc111d7..13b9d9eb714f 100644
--- a/drivers/pci/pcie/err.c
+++ b/drivers/pci/pcie/err.c
@@ -256,11 +256,6 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
}
if (status == PCI_ERS_RESULT_NEED_RESET) {
- /*
- * TODO: Should call platform-specific
- * functions to reset slot before calling
- * drivers' slot_reset callbacks?
- */
status = PCI_ERS_RESULT_RECOVERED;
pci_dbg(bridge, "broadcast slot_reset message\n");
pci_walk_bridge(bridge, report_slot_reset, &status);
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 2c4454583c11..439dbd0d9184 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -646,6 +646,7 @@ struct pci_host_bridge {
void (*release_fn)(struct pci_host_bridge *);
int (*enable_device)(struct pci_host_bridge *bridge, struct pci_dev *dev);
void (*disable_device)(struct pci_host_bridge *bridge, struct pci_dev *dev);
+ int (*reset_root_port)(struct pci_host_bridge *bridge, struct pci_dev *dev);
void *release_data;
unsigned int ignore_reset_delay:1; /* For entire hierarchy */
unsigned int no_ext_tags:1; /* No Extended Tags */
--
2.48.1
^ permalink raw reply related
* [PATCH v8 5/5] misc: pci_endpoint_test: Add AER error handlers
From: Manivannan Sadhasivam via B4 Relay @ 2026-05-18 14:59 UTC (permalink / raw)
To: Bjorn Helgaas, Mahesh J Salgaonkar, Oliver O'Halloran,
Will Deacon, Lorenzo Pieralisi, Krzysztof Wilczyński,
Manivannan Sadhasivam, Rob Herring, Heiko Stuebner, Philipp Zabel
Cc: linux-pci, linux-kernel, linuxppc-dev, linux-arm-kernel,
linux-arm-msm, linux-rockchip, Niklas Cassel, Wilfred Mallawa,
Krishna Chaitanya Chundru, mani, Lukas Wunner, Richard Zhu,
Brian Norris, Wilson Ding, Manivannan Sadhasivam
In-Reply-To: <20260518-pci-port-reset-v8-0-eb5a7d331dfc@oss.qualcomm.com>
From: Manivannan Sadhasivam <manivannan.sadhasivam@oss.qualcomm.com>
This Endpoint test driver doesn't need to do anything fancy in its error
handlers, but just restore the config space that was saved during probe and
report the correct result. This helps in making sure that the AER recovery
succeeds.
Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@oss.qualcomm.com>
---
drivers/misc/pci_endpoint_test.c | 23 +++++++++++++++++++++++
1 file changed, 23 insertions(+)
diff --git a/drivers/misc/pci_endpoint_test.c b/drivers/misc/pci_endpoint_test.c
index dbd017cabbb9..3e89bd48c196 100644
--- a/drivers/misc/pci_endpoint_test.c
+++ b/drivers/misc/pci_endpoint_test.c
@@ -1327,6 +1327,8 @@ static int pci_endpoint_test_probe(struct pci_dev *pdev,
goto err_kfree_name;
}
+ pci_save_state(pdev);
+
return 0;
err_kfree_name:
@@ -1448,12 +1450,33 @@ static const struct pci_device_id pci_endpoint_test_tbl[] = {
};
MODULE_DEVICE_TABLE(pci, pci_endpoint_test_tbl);
+static pci_ers_result_t pci_endpoint_test_error_detected(struct pci_dev *pdev,
+ pci_channel_state_t state)
+{
+ if (state == pci_channel_io_perm_failure)
+ return PCI_ERS_RESULT_DISCONNECT;
+
+ return PCI_ERS_RESULT_NEED_RESET;
+}
+
+static pci_ers_result_t pci_endpoint_test_slot_reset(struct pci_dev *pdev)
+{
+ pci_restore_state(pdev);
+ return PCI_ERS_RESULT_RECOVERED;
+}
+
+static const struct pci_error_handlers pci_endpoint_test_err_handler = {
+ .error_detected = pci_endpoint_test_error_detected,
+ .slot_reset = pci_endpoint_test_slot_reset,
+};
+
static struct pci_driver pci_endpoint_test_driver = {
.name = DRV_MODULE_NAME,
.id_table = pci_endpoint_test_tbl,
.probe = pci_endpoint_test_probe,
.remove = pci_endpoint_test_remove,
.sriov_configure = pci_sriov_configure_simple,
+ .err_handler = &pci_endpoint_test_err_handler,
};
module_pci_driver(pci_endpoint_test_driver);
--
2.48.1
^ permalink raw reply related
* Re: [PATCH v2 0/6] fsl-mc: Move over to device MSI infrastructure
From: Marc Zyngier @ 2026-05-18 14:24 UTC (permalink / raw)
To: Christophe Leroy (CS GROUP)
Cc: Arnd Bergmann, Ioana Ciornei, Thomas Gleixner, Sascha Bischoff,
linux-kernel, linux-arm-kernel, linuxppc-dev
In-Reply-To: <4f34dc22-8dd7-42fb-8c16-4e359faf60d6@kernel.org>
On Mon, 18 May 2026 14:51:48 +0100,
"Christophe Leroy (CS GROUP)" <chleroy@kernel.org> wrote:
>
> > > Do I need to respin it?
>
> No, I'd like to avoid having to rebase again. If you have changes to
> the series please send followup patches.
No follow-up patches for that particular series, I just wanted to find
out whether I could start posting additional changes that do not
directly involve fsl-mc, but that are prevented by the current state
of the code (such as trying to move the ITS initialisation much later
in the boot process).
I'll postpone my changes to 7.3, and keep my fingers crossed for this
to hit 7.2.
Thanks,
M.
--
Without deviation from the norm, progress is not possible.
^ permalink raw reply
* Re: [PATCH v2 0/6] fsl-mc: Move over to device MSI infrastructure
From: Christophe Leroy (CS GROUP) @ 2026-05-18 13:51 UTC (permalink / raw)
To: Marc Zyngier, Arnd Bergmann
Cc: Ioana Ciornei, Thomas Gleixner, Sascha Bischoff, linux-kernel,
linux-arm-kernel, linuxppc-dev
In-Reply-To: <87v7cva080.wl-maz@kernel.org>
Hi Marc,
Le 10/05/2026 à 15:00, Marc Zyngier a écrit :
> On Thu, 26 Feb 2026 09:50:43 +0000,
> "Christophe Leroy (CS GROUP)" <chleroy@kernel.org> wrote:
>>
>>
>> On Tue, 24 Feb 2026 10:09:30 +0000, Marc Zyngier wrote:
>>> This is the second drop of this cleanup series for the fsl-mc MSI
>>> infrastructure, initially posted at [1].
>>>
>>> * From v1 [1]:
>>>
>>> - Drop the now unused DOMAIN_BUS_FSL_MC_MSI bus token
>>>
>>> [...]
>>
>> Applied, thanks!
>>
>> [1/6] fsl-mc: Remove MSI domain propagation to sub-devices
>> commit: 1fb7392ee3408494d4d62c09a8c3e5f5934caba7
>> [2/6] fsl-mc: Add minimal infrastructure to use platform MSI
>> commit: 0c9f522f2d41c7e055a602a0d2c41dc7af01010b
>> [3/6] irqchip/gic-v3-its: Add fsl_mc device plumbing to the msi-parent handling
>> commit: cf3179b4e53f527aba9f0c6c3b921619c8adf761
>> [4/6] fsl-mc: Switch over to per-device platform MSI
>> commit: 4a958e47c246fa3fb8954f4303e0da15ab3d026d
>> [5/6] fsl-mc: Remove legacy MSI implementation
>> commit: 14b1cbcc6cec0b02298f4adf717646cd943b7ef6
>> [6/6] platform-msi: Remove stale comment
>> commit: f0a2eac6a597268034fd40d92c1469182438b53d
>
> Is there any particular reason why this didn't make it into 7.1?
I sent a pull request [1] as early as possible after Easter break, but
it was apparently too late. Or maybe that was because I did a last
minute rebase, I'm not sure what the real reason. I tentatively sent an
alternative pull request [2] the day after after rolling back the
rebase, but I got no feedback.
>
> I was really looking forward to some additional cleanups in the GICv3
> ITS code, and this gets in the way.
It is still in linux-next and finger crossed will go in 7.2
> > Do I need to respin it?
No, I'd like to avoid having to rebase again. If you have changes to the
series please send followup patches.
Christophe
[1]
https://patchwork.kernel.org/project/linux-soc/patch/69cdadd6-523c-4a79-8cb3-1deff5910699@kernel.org/
[2]
https://patchwork.kernel.org/project/linux-soc/patch/310b65ba-1d73-4ce0-b332-00cde70acc69@kernel.org/
^ permalink raw reply
* [PATCH] PCI/AER: Clear non-fatal errors on AER recovery failure
From: Yury Murashka @ 2026-05-18 13:23 UTC (permalink / raw)
To: bhelgaas, mahesh
Cc: oohall, corbet, skhan, linux-pci, linux-doc, linux-kernel,
linuxppc-dev, Yury Murashka
pci_aer_clear_nonfatal_status() is not called when AER recovery fails.
If a new AER error is subsequently reported, the AER driver calls
find_source_device() to find the source of the error. It rescans the
whole bus and picks the first device reporting an AER error. Because the
previous error was never cleared, the error is attributed to the wrong
device and AER recovery is started for the wrong device.
Add a kernel boot parameter pci=aer_clear_on_recovery_failure to clear
AER error status even when recovery fails, preventing stale errors from
causing incorrect device identification on subsequent AER events.
Signed-off-by: Yury Murashka <yurypm@arista.com>
---
Documentation/admin-guide/kernel-parameters.txt | 5 +++++
drivers/pci/pci.c | 2 ++
drivers/pci/pci.h | 2 ++
drivers/pci/pcie/err.c | 13 +++++++++++++
4 files changed, 22 insertions(+)
diff --git a/Documentation/admin-guide/kernel-parameters.txt
b/Documentation/admin-guide/kernel-parameters.txt
index 4d0f545fb..5a9e266f5 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -5301,6 +5301,11 @@ Kernel parameters
nomio [S390] Do not use MIO instructions.
norid [S390] ignore the RID field and force use of
one PCI domain per PCI function
+ aer_clear_on_recovery_failure
+ [PCIE] If the PCIEAER kernel config parameter is
+ enabled, this kernel boot option can be used to
+ enable AER errors cleanup even if error recovery
+ failed.
notph [PCIE] If the PCIE_TPH kernel config parameter
is enabled, this kernel boot option can be used
to disable PCIe TLP Processing Hints support
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index d34266651..701459c62 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -6769,6 +6769,8 @@ static int __init pci_setup(char *str)
disable_acs_redir_param = str + 18;
} else if (!strncmp(str, "config_acs=", 11)) {
config_acs_param = str + 11;
+ } else if (!strncmp(str,
"aer_clear_on_recovery_failure", 29)) {
+ pci_enable_aer_clear_on_recovery_failure();
} else {
pr_err("PCI: Unknown option `%s'\n", str);
}
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 4a14f88e5..093a7c896 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -1292,6 +1292,7 @@ int pci_aer_clear_status(struct pci_dev *dev);
int pci_aer_raw_clear_status(struct pci_dev *dev);
void pci_save_aer_state(struct pci_dev *dev);
void pci_restore_aer_state(struct pci_dev *dev);
+void pci_enable_aer_clear_on_recovery_failure(void);
#else
static inline void pci_no_aer(void) { }
static inline void pci_aer_init(struct pci_dev *d) { }
@@ -1301,6 +1302,7 @@ static inline int pci_aer_clear_status(struct
pci_dev *dev) { return -EINVAL; }
static inline int pci_aer_raw_clear_status(struct pci_dev *dev) {
return -EINVAL; }
static inline void pci_save_aer_state(struct pci_dev *dev) { }
static inline void pci_restore_aer_state(struct pci_dev *dev) { }
+static inline void pci_enable_aer_clear_on_recovery_failure(void) { }
#endif
#ifdef CONFIG_ACPI
diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
index bebe4bc11..29d655a34 100644
--- a/drivers/pci/pcie/err.c
+++ b/drivers/pci/pcie/err.c
@@ -21,6 +21,13 @@
#include "portdrv.h"
#include "../pci.h"
+static int enable_aer_clear_on_recovery_failure;
+
+void pci_enable_aer_clear_on_recovery_failure(void)
+{
+ enable_aer_clear_on_recovery_failure = 1;
+}
+
static pci_ers_result_t merge_result(enum pci_ers_result orig,
enum pci_ers_result new)
{
@@ -289,6 +296,12 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
return status;
failed:
+ if (enable_aer_clear_on_recovery_failure &&
+ (host->native_aer || pcie_ports_native)) {
+ pcie_clear_device_status(dev);
+ pci_aer_clear_nonfatal_status(dev);
+ }
+
pci_walk_bridge(bridge, pci_pm_runtime_put, NULL);
pci_walk_bridge(bridge, report_perm_failure_detected, NULL);
--
2.51.0
^ permalink raw reply related
* [PATCH v2] perf kvm stat: Add missing mappings for PPC kvm exit reasons
From: Gautam Menghani @ 2026-05-18 12:50 UTC (permalink / raw)
To: peterz, mingo, acme, namhyung, mark.rutland, alexander.shishkin,
jolsa, irogers, adrian.hunter, james.clark, atrajeev
Cc: Gautam Menghani, linuxppc-dev, linux-perf-users, linux-kernel
The macro kvm_trace_symbol_exit is used for providing the mappings
for the exit trap vectors and their names. Add mappings for H_FAC_UNAVAIL
and H_VIRT so that exit reasons are displayed as string instead of
vector numbers when using perf kvm stat.
Signed-off-by: Gautam Menghani <gautam@linux.ibm.com>
---
v1 -> v2:
1. Update the patch title and description to remove dependency on
another file trace_book3s.h
tools/perf/util/kvm-stat-arch/book3s_hv_exits.h | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/tools/perf/util/kvm-stat-arch/book3s_hv_exits.h b/tools/perf/util/kvm-stat-arch/book3s_hv_exits.h
index 2011376c7ab5..2688ca7d0399 100644
--- a/tools/perf/util/kvm-stat-arch/book3s_hv_exits.h
+++ b/tools/perf/util/kvm-stat-arch/book3s_hv_exits.h
@@ -26,8 +26,10 @@
{0xe00, "H_DATA_STORAGE"}, \
{0xe20, "H_INST_STORAGE"}, \
{0xe40, "H_EMUL_ASSIST"}, \
+ {0xea0, "H_VIRT"}, \
{0xf00, "PERFMON"}, \
{0xf20, "ALTIVEC"}, \
- {0xf40, "VSX"}
+ {0xf40, "VSX"}, \
+ {0xf80, "H_FAC_UNAVAIL"}
#endif
--
2.53.0
^ permalink raw reply related
* Re: [PATCH] perf kvm stat: Update the exit reason mappings
From: Gautam Menghani @ 2026-05-18 12:21 UTC (permalink / raw)
To: Ritesh Harjani
Cc: Ian Rogers, peterz, mingo, acme, namhyung, mark.rutland,
alexander.shishkin, jolsa, adrian.hunter, james.clark,
linux-perf-users, linux-kernel, linuxppc-dev, maddy
In-Reply-To: <qzngq7lt.ritesh.list@gmail.com>
On Wed, May 13, 2026 at 09:33:10AM +0530, Ritesh Harjani wrote:
>
> ++ linuxppc-dev
>
> Gautam Menghani <gautam@linux.ibm.com> writes:
>
> > On Tue, May 12, 2026 at 08:25:08AM -0700, Ian Rogers wrote:
> >> On Tue, May 12, 2026 at 5:04 AM Gautam Menghani <gautam@linux.ibm.com> wrote:
> >> >
> >> > Sync the exit reason mappings with the mappings in trace_book3s.h
> >>
> >> I see:
> >> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/powerpc/kvm/trace_book3s.h
> >> Would it make sense to have a copy in perf and use the check headers
> >> code to keep them in sync?
> >> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/perf/check-headers.sh
> >
> > I'll take a look at this, thanks
> >
> >>
> >> Could you add the commits that add the H_VIRT and H_FAC_UNAVAIL
> >> definitions? I don't see them in Linus' tree yet.
> >
> > I posted that patch earlier today - https://lore.kernel.org/linuxppc-dev/20260512115724.59299-1-gautam@linux.ibm.com/
> > should've pasted the link in the patch
> >
>
> For patches not yet merged and having such a dependency, this could cause
> confusion. What I generally tend to do in such case is, group this
> patch (changes in tools/perf/util/kvm-stat-arch/book3s_hv_exits.h) into
> the same series which adds H_FAC_UNAVAIL to trace_book3s.h [1].
> This way it is easier for everyone to keep track of the dependencies.
>
> [1]: https://lore.kernel.org/linuxppc-dev/20260512115724.59299-1-gautam@linux.ibm.com/
My patch description is incorrect, there actually isn't a dependency -
both patches can go in independently. I'll send a v2 to make this clear.
>
> Note, that we should still cc the relevant mailing lists, reviewers and
> maintainers to get an Acked-by. Since the changes in this patch are
> largely powerpc specific, so IMO, it should be ok even if it goes via
> powerpc tree via a common series, as long as everyone agrees.
Yes noted, thanks.
- Gautam
^ permalink raw reply
* Re: [PATCH 4/5] x86/pci: Use official API to iterate over PCI buses
From: Gerd Bayer @ 2026-05-18 12:01 UTC (permalink / raw)
To: Dave Hansen, Richard Henderson, Matt Turner, Magnus Lindholm,
Russell King, Madhavan Srinivasan, Michael Ellerman,
Nicholas Piggin, Christophe Leroy (CS GROUP), Bjorn Helgaas,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
H. Peter Anvin
Cc: Yinghai Lu, linux-alpha, linux-kernel, linux-arm-kernel,
linuxppc-dev, linux-pci, Gerd Bayer
In-Reply-To: <553c703f-ba9c-4785-91ba-2cf62ceb9653@intel.com>
On Fri, 2026-05-15 at 08:13 -0700, Dave Hansen wrote:
> On 5/15/26 07:22, Gerd Bayer wrote:
> > static int __init pcibios_assign_resources(void)
> > {
> > - struct pci_bus *bus;
> > + struct pci_bus *bus = NULL;
> >
> > if (!(pci_probe & PCI_ASSIGN_ROMS))
> > - list_for_each_entry(bus, &pci_root_buses, node)
> > + while ((bus = pci_find_next_bus(bus)) != NULL)
> > pcibios_allocate_rom_resources(bus);
>
> What's with the 'bus = NULL'? I thought there was some crazy macro magic
> going on or something, but pci_find_next_bus() looks like a normal
> function that's just taking a pointer and not _modifying_ the pointer value.
Initializing 'bus = NULL" makes sure, that pci_find_next_bus() starts
at the list head; list_for_each_entry() did that implicitly. I didn't
want to rely on implicit zero-init for local var's on all the various
architectures. But I'm fine to drop it here, if you prefer.
>
> Also, wouldn't this be a more readable way of writing what you have?
>
> while (bus = pci_find_next_bus(bus))
Yeah, another occasion of me being (overly?) verbose.
arch/sparc/kernel/pci.c was my blueprint. Again, something that I'm ok
to drop.
>
> For that matter isn't the kernel idiom for these things:
>
> for_each_pci_bus(bus) {
> // do bus stuff
> }
>
> I'm kinda surprised there isn't one of those already.
Just guessing: There was too little use of pci_find_next_bus() to
warrant that short-cut. But I can make a proposal in the next
iteration.
Thanks,
Gerd
^ permalink raw reply
* Re: [PATCH v2 0/2] LKDTM powerpc enhancements - Part2
From: Michael Ellerman @ 2026-05-18 11:58 UTC (permalink / raw)
To: Sayali Patil, linuxppc-dev, maddy
Cc: linux-kernel, Ritesh Harjani, Mahesh Salgaonkar, kees
In-Reply-To: <cover.1778975974.git.sayalip@linux.ibm.com>
On 18/5/2026 16:56, Sayali Patil wrote:
> Hi all,
>
> This series adds a new LKDTM trigger PPC_RADIX_TLBIEL, to validate
> machine check handling on radix MMU systems and improves reliability of
> the PPC_SLB_MULTIHIT test by adding isync instructions after slbmte
> operations.
>
> Please review the patches and provide any feedback or suggestions
> for improvement.
>
> Thanks,
> Sayali
>
> ---
>
> v1->v2
> - Split the patch series into two parts.
> - Updated "lkdtm/powerpc: add PPC_RADIX_TLBIEL test for radix MCE
> validation" as per review comments:
> Wrapped Hash-MMU specific functions with #ifdef CONFIG_PPC_64S_HASH_MMU.
> Guarded powerpc_crashtypes registration with #ifdef CONFIG_PPC_BOOK3S_64
> Updated comment explaining the MCE trigger condition for radix MMU.
>
> v1: https://lore.kernel.org/all/cover.1778057685.git.sayalip@linux.ibm.com/
> ---
>
> Sayali Patil (2):
> lkdtm/powerpc: add isync after slbmte to enforce SLB update ordering
> lkdtm/powerpc: add PPC_RADIX_TLBIEL test for radix MCE validation
>
> drivers/misc/lkdtm/Makefile | 2 +-
> drivers/misc/lkdtm/core.c | 2 +-
> drivers/misc/lkdtm/powerpc.c | 49 +++++++++++++++++++++++++
> tools/testing/selftests/lkdtm/tests.txt | 1 +
> 4 files changed, 52 insertions(+), 2 deletions(-)
Both changes look good to me.
You should send them to Kees, who maintains lkdtm. I've added him to Cc,
but that may not be sufficient to get his attention.
Reviewed-by: Michael Ellerman <mpe@kernel.org>
cheers
^ permalink raw reply
* Re: [PATCH v4 0/5] powerpc/bpf: Add support for verifier selftest
From: Christophe Leroy (CS GROUP) @ 2026-05-18 11:44 UTC (permalink / raw)
To: adubey, bpf
Cc: hbathini, linuxppc-dev, maddy, ast, andrii, daniel, shuah,
linux-kselftest, stable
In-Reply-To: <20260517214043.12975-1-adubey@linux.ibm.com>
Le 17/05/2026 à 23:40, adubey@linux.ibm.com a écrit :
> From: Abhishek Dubey <adubey@linux.ibm.com>
>
> The verifier selftest validates JITed instructions by matching expected
> disassembly output. The first two patches fix issues in powerpc instruction
> disassembly that were causing test flow failures. The fix is common for
> 64-bit & 32-bit powerpc. Add support for the powerpc-specific "__powerpc64"
> architecture tag in the third patch, enabling proper test filtering in
> verifier test files. Introduce verifier testcases for tailcalls on powerpc64
> in the final patch.
Build fails:
DESCEND objtool
INSTALL libsubcmd_headers
CC arch/powerpc/net/bpf_jit_comp32.o
arch/powerpc/net/bpf_jit_comp32.c:232:6: error: conflicting types for
'bpf_jit_build_epilogue'; have 'void(u32 *, struct codegen_context *)'
{aka 'void(unsigned int *, struct codegen_context *)'}
232 | void bpf_jit_build_epilogue(u32 *image, struct codegen_context
*ctx)
| ^~~~~~~~~~~~~~~~~~~~~~
In file included from arch/powerpc/net/bpf_jit_comp32.c:19:
arch/powerpc/net/bpf_jit.h:217:6: note: previous declaration of
'bpf_jit_build_epilogue' with type 'void(u32 *, u32 *, struct
codegen_context *)' {aka 'void(unsigned int *, unsigned int *, struct
codegen_context *)'}
217 | void bpf_jit_build_epilogue(u32 *image, u32 *fimage, struct
codegen_context *ctx);
| ^~~~~~~~~~~~~~~~~~~~~~
arch/powerpc/net/bpf_jit_comp32.c: In function 'bpf_jit_build_epilogue':
arch/powerpc/net/bpf_jit_comp32.c:240:43: error: passing argument 2 of
'bpf_jit_build_fentry_stubs' from incompatible pointer type
[-Wincompatible-pointer-types]
240 | bpf_jit_build_fentry_stubs(image, ctx);
| ^~~
| |
| struct codegen_context *
arch/powerpc/net/bpf_jit.h:218:50: note: expected 'u32 *' {aka 'unsigned
int *'} but argument is of type 'struct codegen_context *'
218 | void bpf_jit_build_fentry_stubs(u32 *image, u32 *fimage, struct
codegen_context *ctx);
| ~~~~~^~~~~~
arch/powerpc/net/bpf_jit_comp32.c:240:9: error: too few arguments to
function 'bpf_jit_build_fentry_stubs'; expected 3, have 2
240 | bpf_jit_build_fentry_stubs(image, ctx);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~
arch/powerpc/net/bpf_jit.h:218:6: note: declared here
218 | void bpf_jit_build_fentry_stubs(u32 *image, u32 *fimage, struct
codegen_context *ctx);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~
make[4]: *** [scripts/Makefile.build:289:
arch/powerpc/net/bpf_jit_comp32.o] Error 1
make[3]: *** [scripts/Makefile.build:548: arch/powerpc/net] Error 2
make[2]: *** [scripts/Makefile.build:548: arch/powerpc] Error 2
make[1]: *** [/home/chleroy/linux-powerpc/Makefile:2143: .] Error 2
make: *** [Makefile:248: __sub-make] Error 2
Christophe
^ permalink raw reply
* Re: [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance
From: Barry Song @ 2026-05-18 11:25 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: Matthew Wilcox, surenb, akpm, linux-mm, david, liam, vbabka, rppt,
mhocko, jack, pfalcato, wanglian, chentao, lianux.mm, kunwu.chan,
liyangouwen1, chrisl, kasong, shikemeng, nphamcs, bhe,
youngjun.park, linux-arm-kernel, linux-kernel, loongarch,
linuxppc-dev, linux-riscv, linux-s390, Nanzhe Zhao
In-Reply-To: <agrWuDNGddNmvMFD@lucifer>
On Mon, May 18, 2026 at 5:47 PM Lorenzo Stoakes <ljs@kernel.org> wrote:
>
> On Sun, May 17, 2026 at 04:45:15PM +0800, Barry Song wrote:
> > On Sat, May 2, 2026 at 1:58 AM Matthew Wilcox <willy@infradead.org> wrote:
> > >
> > > On Sat, May 02, 2026 at 01:44:34AM +0800, Barry Song wrote:
> > > > On Fri, May 1, 2026 at 10:57 PM Matthew Wilcox <willy@infradead.org> wrote:
> > > > >
> > > > > On Fri, May 01, 2026 at 06:49:58AM +0800, Barry Song wrote:
> > > > > > 1. There is no deterministic latency for I/O completion. It depends on
> > > > > > both the hardware and the software stack (bio/request queues and the
> > > > > > block scheduler). Sometimes the latency is short; at other times it can
> > > > > > be quite long. In such cases, a high-priority thread performing operations
> > > > > > such as mprotect, unmap, prctl_set_vma, or madvise may be forced to wait
> > > > > > for an unpredictable amount of time.
> > > > >
> > > > > But does that actually happen? I find it hard to believe that thread A
> > > > > unmaps a VMA while thread B is in the middle of taking a page fault in
> > > > > that same VMA. mprotect() and madvise() are more likely to happen, but
> > > > > it still seems really unlikely to me.
> > > >
> > > > It doesn’t have to involve unmapping or applying mprotect to
> > > > the entire VMA—just a portion of it is sufficient.
> > >
> > > Yes, but that still fails to answer "does this actually happen". How much
> > > performance is all this complexity in the page fault handler buying us?
> > > If you don't answer this question, I'm just going to go in and rip it
> > > all out.
> > >
> >
> > Hi Matthew (and Lorenzo, Jan, and anyone else who may be
> > waiting for answers),
> >
> > As promised during LSF/MM/BPF, we conducted thorough
> > testing on Android phones to determine whether performing
> > I/O in `filemap_fault()` can block `vma_start_write()`.
> > I wanted to give a quick update on this question.
> >
> > Nanzhe at Xiaomi created tracing scripts and ran various
> > applications on Android devices with I/O performed under
> > the VMA lock in `filemap_fault()`. We found that:
> >
> > 1. There are very few cases where unmap() is blocked by
> > page faults. I assume this is due to buggy user code
> > or poor synchronization between reads and unmap().
> > So I assume it is not a problem.
> >
> > 2. We observed many cases where `vma_start_write()`
> > is blocked by page-fault I/O in some applications.
> > The blocking occurs in the `dup_mmap()` path during
> > fork().
> >
> > With Suren's commit fb49c455323ff ("fork: lock VMAs of
> > the parent process when forking"), we now always hold
> > `vma_write_lock()` for each VMA. Note that the
> > `mmap_lock` write lock is also held, which could lead to
> > chained waiting if page-fault I/O is performed without
> > releasing the VMA lock.
>
> Hm but did you observe this 'chained waiting'? And what were the latencies?
We have clearly observed that the `fork()` operations of many
popular Android apps, such as iQiyi, Baidu Tieba, and 10086,
end up waiting on page-fault (PF) I/O when the VMA lock is
held during I/O operations. This has already become a
practical issue. I also believe this can lead to chained
waiting, since the global `mmap_lock` blocks all threads that
need to acquire it.
>
> >
> > My gut feeling is that Suren's commit may be overshooting,
> > so my rough idea is that we might want to do something like
> > the following (we haven't tested it yet and it might be
> > wrong):
>
> Yeah I'm really not sure about that.
>
> Prior to the VMA locks, the mmap write lock would have guaranteed no concurrent
> page faults, which is really what fb49c455323ff is about.
>
> So Suren's patch was essentially restoring the _existing_ forking behaviour, and
> now you're saying 'let's change the forking behaviour that's been like that for
> forever'.
I am afraid not. Before we introduced the per-VMA lock, we
were not performing I/O while holding `mmap_lock`. A page fault
that needed I/O would drop the `mmap_lock` read lock and allow
`fork()` to proceed.
Now, you are suggesting performing I/O while holding the VMA
lock, which changes the requirements and introduces this
problem.
>
> I think you would _really_ have to be sure that's safe. And forking is a very
> dangerous time in terms of complexity and sensitivity and 'weird stuff'
> happening so I'd tread _very_ carefully here.
Yep. I think my original proposal did not require any changes
to `fork()`, since it simply preserved the current behavior of
dropping the VMA lock before performing I/O. In that model,
`fork()` would not end up waiting on I/O at all.
What you are suggesting now appears to be performing I/O while
holding the VMA lock, which in turn introduces the need to
change `fork()`.
>
> >
> > diff --git a/mm/mmap.c b/mm/mmap.c
> > index 2311ae7c2ff4..5ddaf297f31a 100644
> > --- a/mm/mmap.c
> > +++ b/mm/mmap.c
> > @@ -1762,7 +1762,13 @@ __latent_entropy int dup_mmap(struct mm_struct
> > *mm, struct mm_struct *oldmm)
> > for_each_vma(vmi, mpnt) {
> > struct file *file;
> >
> > - retval = vma_start_write_killable(mpnt);
> > + /*
> > + * For anonymous or writable private VMAs, prevent
> > + * concurrent CoW faults.
> > + */
>
> To nit pick I think the comment's confusing but also tells you you don't need to
> specific anon check - writable private is sufficient. And it's not really just
> CoW that's the issue, it's anon_vma population _at all_ as well as CoW.
>
> > + if (!mpnt->vm_file || (!(mpnt->vm_flags & VM_SHARED) &&
> > + (mpnt->vm_flags & VM_WRITE)))
> > + retval = vma_start_write_killable(mpnt);
>
> I think this has to be VM_MAYWRITE, because somebody could otherwise mprotect()
> it R/W.
>
> I also don't understand why !mpnt->vm_file for a read-only anon mapping (more
> likely PROT_NONE) is here, just do the second check?
>
> (Also please use the new interface, so !vma_test(mpnt, VMA_SHARED_BIT) &&
> vma_test(mpnt, VMA_MAYWRITE_BIT))
Yep, I can definitely refine the check further. But before
doing that, I'd first like to confirm that we are aligned on
the direction.
If you still intend to hold the VMA lock while performing I/O,
then I think we should fix `fork()` to avoid taking
`vma_start_write()`.
>
> > if (retval < 0)
> > goto loop_out;
> > if (mpnt->vm_flags & VM_DONTCOPY) {
> >
> > Based on the above, we may want to re-check whether fork()
> > can be blocked by page faults. At the same time, if Suren,
> > you, or anyone else has any comments, please feel free to
> > share them.
> >
> > Best Regards
> > Barry
>
> Technical commentary above is sort of 'just cos' :) because I really question
> doing this honestly.
I think we either need to fix `fork()`, or keep the current
behavior of dropping the VMA lock before performing I/O.
>
> I'd also like to get Suren's input, however.
Yes. of course.
>
> Thanks, Lorenzo
Best Regards
Barry
^ permalink raw reply
* Re: [PATCH v4 07/13] dma-direct: make dma_direct_map_phys() honor DMA_ATTR_CC_SHARED
From: Christian Borntraeger @ 2026-05-18 10:04 UTC (permalink / raw)
To: Aneesh Kumar K.V (Arm), iommu, linux-arm-kernel, linux-kernel,
linux-coco
Cc: Robin Murphy, Marek Szyprowski, Will Deacon, Marc Zyngier,
Steven Price, Suzuki K Poulose, Catalin Marinas, Jiri Pirko,
Jason Gunthorpe, Mostafa Saleh, Petr Tesarik,
Alexey Kardashevskiy, Dan Williams, Xu Yilun, linuxppc-dev,
linux-s390, Madhavan Srinivasan, Michael Ellerman,
Nicholas Piggin, Christophe Leroy (CS GROUP), Alexander Gordeev,
Gerald Schaefer, Heiko Carstens, Vasily Gorbik, Sven Schnelle,
x86, Halil Pasic, Matthew Rosato, Jaehoon Kim
In-Reply-To: <20260512090408.794195-8-aneesh.kumar@kernel.org>
cc Halil, Matt, Jaehoon.
Can you have a look what this means for virtio on secure execution?
Am 12.05.26 um 11:04 schrieb Aneesh Kumar K.V (Arm):
> Teach dma_direct_map_phys() to select the DMA address encoding based on
> DMA_ATTR_CC_SHARED.
>
> Use phys_to_dma_unencrypted() for decrypted mappings and
> phys_to_dma_encrypted() otherwise. If a device requires unencrypted DMA
> but the source physical address is still encrypted, force the mapping
> through swiotlb so the DMA address and backing memory attributes remain
> consistent.
>
> Update the arm64, x86, s390 and powerpc secure-guest setup to not use
> swiotlb force option
>
> Signed-off-by: Aneesh Kumar K.V (Arm) <aneesh.kumar@kernel.org>
> ---
> Changes from v3:
> * Handle DMA_ATTR_MMIO
> ---
> arch/arm64/mm/init.c | 4 +--
> arch/powerpc/platforms/pseries/svm.c | 2 +-
> arch/s390/mm/init.c | 2 +-
> arch/x86/kernel/pci-dma.c | 4 +--
> kernel/dma/direct.c | 4 ++-
> kernel/dma/direct.h | 38 +++++++++++++---------------
> 6 files changed, 24 insertions(+), 30 deletions(-)
>
> diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
> index 97987f850a33..acf67c7064db 100644
> --- a/arch/arm64/mm/init.c
> +++ b/arch/arm64/mm/init.c
> @@ -338,10 +338,8 @@ void __init arch_mm_preinit(void)
> unsigned int flags = SWIOTLB_VERBOSE;
> bool swiotlb = max_pfn > PFN_DOWN(arm64_dma_phys_limit);
>
> - if (is_realm_world()) {
> + if (is_realm_world())
> swiotlb = true;
> - flags |= SWIOTLB_FORCE;
> - }
>
> if (IS_ENABLED(CONFIG_DMA_BOUNCE_UNALIGNED_KMALLOC) && !swiotlb) {
> /*
> diff --git a/arch/powerpc/platforms/pseries/svm.c b/arch/powerpc/platforms/pseries/svm.c
> index 384c9dc1899a..7a403dbd35ee 100644
> --- a/arch/powerpc/platforms/pseries/svm.c
> +++ b/arch/powerpc/platforms/pseries/svm.c
> @@ -29,7 +29,7 @@ static int __init init_svm(void)
> * need to use the SWIOTLB buffer for DMA even if dma_capable() says
> * otherwise.
> */
> - ppc_swiotlb_flags |= SWIOTLB_ANY | SWIOTLB_FORCE;
> + ppc_swiotlb_flags |= SWIOTLB_ANY;
>
> /* Share the SWIOTLB buffer with the host. */
> swiotlb_update_mem_attributes();
> diff --git a/arch/s390/mm/init.c b/arch/s390/mm/init.c
> index 1f72efc2a579..843dbd445124 100644
> --- a/arch/s390/mm/init.c
> +++ b/arch/s390/mm/init.c
> @@ -149,7 +149,7 @@ static void __init pv_init(void)
> virtio_set_mem_acc_cb(virtio_require_restricted_mem_acc);
>
> /* make sure bounce buffers are shared */
> - swiotlb_init(true, SWIOTLB_FORCE | SWIOTLB_VERBOSE);
> + swiotlb_init(true, SWIOTLB_VERBOSE);
> swiotlb_update_mem_attributes();
> }
>
> diff --git a/arch/x86/kernel/pci-dma.c b/arch/x86/kernel/pci-dma.c
> index 6267363e0189..75cf8f6ae8cd 100644
> --- a/arch/x86/kernel/pci-dma.c
> +++ b/arch/x86/kernel/pci-dma.c
> @@ -59,10 +59,8 @@ static void __init pci_swiotlb_detect(void)
> * bounce buffers as the hypervisor can't access arbitrary VM memory
> * that is not explicitly shared with it.
> */
> - if (cc_platform_has(CC_ATTR_GUEST_MEM_ENCRYPT)) {
> + if (cc_platform_has(CC_ATTR_GUEST_MEM_ENCRYPT))
> x86_swiotlb_enable = true;
> - x86_swiotlb_flags |= SWIOTLB_FORCE;
> - }
> }
> #else
> static inline void __init pci_swiotlb_detect(void)
> diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
> index ac315dd046c4..5aaa813c5509 100644
> --- a/kernel/dma/direct.c
> +++ b/kernel/dma/direct.c
> @@ -691,8 +691,10 @@ size_t dma_direct_max_mapping_size(struct device *dev)
> {
> /* If SWIOTLB is active, use its maximum mapping size */
> if (is_swiotlb_active(dev) &&
> - (dma_addressing_limited(dev) || is_swiotlb_force_bounce(dev)))
> + (dma_addressing_limited(dev) || is_swiotlb_force_bounce(dev) ||
> + force_dma_unencrypted(dev)))
> return swiotlb_max_mapping_size(dev);
> +
> return SIZE_MAX;
> }
>
> diff --git a/kernel/dma/direct.h b/kernel/dma/direct.h
> index e05dc7649366..4e35264ab6f8 100644
> --- a/kernel/dma/direct.h
> +++ b/kernel/dma/direct.h
> @@ -89,36 +89,32 @@ static inline dma_addr_t dma_direct_map_phys(struct device *dev,
> dma_addr_t dma_addr;
>
> if (is_swiotlb_force_bounce(dev)) {
> - if (!(attrs & DMA_ATTR_CC_SHARED)) {
> - if (attrs & (DMA_ATTR_MMIO | DMA_ATTR_REQUIRE_COHERENT))
> - return DMA_MAPPING_ERROR;
> + if (attrs & (DMA_ATTR_MMIO | DMA_ATTR_REQUIRE_COHERENT))
> + return DMA_MAPPING_ERROR;
>
> - return swiotlb_map(dev, phys, size, dir, attrs);
> - }
> - } else if (attrs & DMA_ATTR_CC_SHARED) {
> - return DMA_MAPPING_ERROR;
> + return swiotlb_map(dev, phys, size, dir, attrs);
> }
>
> - if (attrs & DMA_ATTR_MMIO) {
> - dma_addr = phys;
> - if (unlikely(!dma_capable(dev, dma_addr, size, false, attrs)))
> - goto err_overflow;
> - } else if (attrs & DMA_ATTR_CC_SHARED) {
> + if (attrs & DMA_ATTR_CC_SHARED)
> dma_addr = phys_to_dma_unencrypted(dev, phys);
> + else
> + dma_addr = phys_to_dma_encrypted(dev, phys);
> +
> + if (attrs & DMA_ATTR_MMIO) {
> if (unlikely(!dma_capable(dev, dma_addr, size, false, attrs)))
> goto err_overflow;
> - } else {
> - dma_addr = phys_to_dma(dev, phys);
> - if (unlikely(!dma_capable(dev, dma_addr, size, true, attrs)) ||
> - dma_kmalloc_needs_bounce(dev, size, dir)) {
> - if (is_swiotlb_active(dev) &&
> - !(attrs & DMA_ATTR_REQUIRE_COHERENT))
> - return swiotlb_map(dev, phys, size, dir, attrs);
> + goto dma_mapped;
> + }
>
> - goto err_overflow;
> - }
> + if (unlikely(!dma_capable(dev, dma_addr, size, true, attrs)) ||
> + dma_kmalloc_needs_bounce(dev, size, dir)) {
> + if (is_swiotlb_active(dev) &&
> + !(attrs & DMA_ATTR_REQUIRE_COHERENT))
> + return swiotlb_map(dev, phys, size, dir, attrs);
> + goto err_overflow;
> }
>
> +dma_mapped:
> if (!dev_is_dma_coherent(dev) &&
> !(attrs & (DMA_ATTR_SKIP_CPU_SYNC | DMA_ATTR_MMIO))) {
> arch_sync_dma_for_device(phys, size, dir);
^ permalink raw reply
* Re: [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance
From: David Hildenbrand (Arm) @ 2026-05-18 9:53 UTC (permalink / raw)
To: Barry Song, Matthew Wilcox, surenb
Cc: akpm, linux-mm, ljs, liam, vbabka, rppt, mhocko, jack, pfalcato,
wanglian, chentao, lianux.mm, kunwu.chan, liyangouwen1, chrisl,
kasong, shikemeng, nphamcs, bhe, youngjun.park, linux-arm-kernel,
linux-kernel, loongarch, linuxppc-dev, linux-riscv, linux-s390,
Nanzhe Zhao
In-Reply-To: <CAGsJ_4ysMcrmDLSOwBkf7qwCQrcDWeEMXkHDajTJFMLKUk0bSQ@mail.gmail.com>
On 5/17/26 10:45, Barry Song wrote:
> On Sat, May 2, 2026 at 1:58 AM Matthew Wilcox <willy@infradead.org> wrote:
>>
>> On Sat, May 02, 2026 at 01:44:34AM +0800, Barry Song wrote:
>>>
>>> It doesn’t have to involve unmapping or applying mprotect to
>>> the entire VMA—just a portion of it is sufficient.
>>
>> Yes, but that still fails to answer "does this actually happen". How much
>> performance is all this complexity in the page fault handler buying us?
>> If you don't answer this question, I'm just going to go in and rip it
>> all out.
>>
>
> Hi Matthew (and Lorenzo, Jan, and anyone else who may be
> waiting for answers),
>
> As promised during LSF/MM/BPF, we conducted thorough
> testing on Android phones to determine whether performing
> I/O in `filemap_fault()` can block `vma_start_write()`.
> I wanted to give a quick update on this question.
>
> Nanzhe at Xiaomi created tracing scripts and ran various
> applications on Android devices with I/O performed under
> the VMA lock in `filemap_fault()`. We found that:
>
> 1. There are very few cases where unmap() is blocked by
> page faults. I assume this is due to buggy user code
> or poor synchronization between reads and unmap().
> So I assume it is not a problem.
>
> 2. We observed many cases where `vma_start_write()`
> is blocked by page-fault I/O in some applications.
> The blocking occurs in the `dup_mmap()` path during
> fork().
>
> With Suren's commit fb49c455323ff ("fork: lock VMAs of
> the parent process when forking"), we now always hold
> `vma_write_lock()` for each VMA. Note that the
> `mmap_lock` write lock is also held, which could lead to
> chained waiting if page-fault I/O is performed without
> releasing the VMA lock.
>
> My gut feeling is that Suren's commit may be overshooting,
> so my rough idea is that we might want to do something like
> the following (we haven't tested it yet and it might be
> wrong):
>
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 2311ae7c2ff4..5ddaf297f31a 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -1762,7 +1762,13 @@ __latent_entropy int dup_mmap(struct mm_struct
> *mm, struct mm_struct *oldmm)
> for_each_vma(vmi, mpnt) {
> struct file *file;
>
> - retval = vma_start_write_killable(mpnt);
> + /*
> + * For anonymous or writable private VMAs, prevent
> + * concurrent CoW faults.
> + */
> + if (!mpnt->vm_file || (!(mpnt->vm_flags & VM_SHARED) &&
> + (mpnt->vm_flags & VM_WRITE)))
> + retval = vma_start_write_killable(mpnt);
Likely is_cow_mapping() is what you would want to check to handle VMAs that
could have anonymous pages in them.
--
Cheers,
David
^ permalink raw reply
* Re: [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance
From: Lorenzo Stoakes @ 2026-05-18 9:46 UTC (permalink / raw)
To: Barry Song
Cc: Matthew Wilcox, surenb, akpm, linux-mm, david, liam, vbabka, rppt,
mhocko, jack, pfalcato, wanglian, chentao, lianux.mm, kunwu.chan,
liyangouwen1, chrisl, kasong, shikemeng, nphamcs, bhe,
youngjun.park, linux-arm-kernel, linux-kernel, loongarch,
linuxppc-dev, linux-riscv, linux-s390, Nanzhe Zhao
In-Reply-To: <CAGsJ_4ysMcrmDLSOwBkf7qwCQrcDWeEMXkHDajTJFMLKUk0bSQ@mail.gmail.com>
On Sun, May 17, 2026 at 04:45:15PM +0800, Barry Song wrote:
> On Sat, May 2, 2026 at 1:58 AM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > On Sat, May 02, 2026 at 01:44:34AM +0800, Barry Song wrote:
> > > On Fri, May 1, 2026 at 10:57 PM Matthew Wilcox <willy@infradead.org> wrote:
> > > >
> > > > On Fri, May 01, 2026 at 06:49:58AM +0800, Barry Song wrote:
> > > > > 1. There is no deterministic latency for I/O completion. It depends on
> > > > > both the hardware and the software stack (bio/request queues and the
> > > > > block scheduler). Sometimes the latency is short; at other times it can
> > > > > be quite long. In such cases, a high-priority thread performing operations
> > > > > such as mprotect, unmap, prctl_set_vma, or madvise may be forced to wait
> > > > > for an unpredictable amount of time.
> > > >
> > > > But does that actually happen? I find it hard to believe that thread A
> > > > unmaps a VMA while thread B is in the middle of taking a page fault in
> > > > that same VMA. mprotect() and madvise() are more likely to happen, but
> > > > it still seems really unlikely to me.
> > >
> > > It doesn’t have to involve unmapping or applying mprotect to
> > > the entire VMA—just a portion of it is sufficient.
> >
> > Yes, but that still fails to answer "does this actually happen". How much
> > performance is all this complexity in the page fault handler buying us?
> > If you don't answer this question, I'm just going to go in and rip it
> > all out.
> >
>
> Hi Matthew (and Lorenzo, Jan, and anyone else who may be
> waiting for answers),
>
> As promised during LSF/MM/BPF, we conducted thorough
> testing on Android phones to determine whether performing
> I/O in `filemap_fault()` can block `vma_start_write()`.
> I wanted to give a quick update on this question.
>
> Nanzhe at Xiaomi created tracing scripts and ran various
> applications on Android devices with I/O performed under
> the VMA lock in `filemap_fault()`. We found that:
>
> 1. There are very few cases where unmap() is blocked by
> page faults. I assume this is due to buggy user code
> or poor synchronization between reads and unmap().
> So I assume it is not a problem.
>
> 2. We observed many cases where `vma_start_write()`
> is blocked by page-fault I/O in some applications.
> The blocking occurs in the `dup_mmap()` path during
> fork().
>
> With Suren's commit fb49c455323ff ("fork: lock VMAs of
> the parent process when forking"), we now always hold
> `vma_write_lock()` for each VMA. Note that the
> `mmap_lock` write lock is also held, which could lead to
> chained waiting if page-fault I/O is performed without
> releasing the VMA lock.
Hm but did you observe this 'chained waiting'? And what were the latencies?
>
> My gut feeling is that Suren's commit may be overshooting,
> so my rough idea is that we might want to do something like
> the following (we haven't tested it yet and it might be
> wrong):
Yeah I'm really not sure about that.
Prior to the VMA locks, the mmap write lock would have guaranteed no concurrent
page faults, which is really what fb49c455323ff is about.
So Suren's patch was essentially restoring the _existing_ forking behaviour, and
now you're saying 'let's change the forking behaviour that's been like that for
forever'.
I think you would _really_ have to be sure that's safe. And forking is a very
dangerous time in terms of complexity and sensitivity and 'weird stuff'
happening so I'd tread _very_ carefully here.
>
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 2311ae7c2ff4..5ddaf297f31a 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -1762,7 +1762,13 @@ __latent_entropy int dup_mmap(struct mm_struct
> *mm, struct mm_struct *oldmm)
> for_each_vma(vmi, mpnt) {
> struct file *file;
>
> - retval = vma_start_write_killable(mpnt);
> + /*
> + * For anonymous or writable private VMAs, prevent
> + * concurrent CoW faults.
> + */
To nit pick I think the comment's confusing but also tells you you don't need to
specific anon check - writable private is sufficient. And it's not really just
CoW that's the issue, it's anon_vma population _at all_ as well as CoW.
> + if (!mpnt->vm_file || (!(mpnt->vm_flags & VM_SHARED) &&
> + (mpnt->vm_flags & VM_WRITE)))
> + retval = vma_start_write_killable(mpnt);
I think this has to be VM_MAYWRITE, because somebody could otherwise mprotect()
it R/W.
I also don't understand why !mpnt->vm_file for a read-only anon mapping (more
likely PROT_NONE) is here, just do the second check?
(Also please use the new interface, so !vma_test(mpnt, VMA_SHARED_BIT) &&
vma_test(mpnt, VMA_MAYWRITE_BIT))
> if (retval < 0)
> goto loop_out;
> if (mpnt->vm_flags & VM_DONTCOPY) {
>
> Based on the above, we may want to re-check whether fork()
> can be blocked by page faults. At the same time, if Suren,
> you, or anyone else has any comments, please feel free to
> share them.
>
> Best Regards
> Barry
Technical commentary above is sort of 'just cos' :) because I really question
doing this honestly.
I'd also like to get Suren's input, however.
Thanks, Lorenzo
^ permalink raw reply
* [PATCH v14 6/8] lib/test: memcpy_kunit: add copy_page() and copy_mc_page() tests
From: Ruidong Tian @ 2026-05-18 8:49 UTC (permalink / raw)
To: catalin.marinas, will, rafael, tony.luck, guohanjun, mchehab,
xueshuai, tongtiangen, james.morse, robin.murphy, andreyknvl,
dvyukov, vincenzo.frascino, mpe, npiggin, ryabinin.a.a, glider,
christophe.leroy, aneesh.kumar, naveen.n.rao, tglx, mingo
Cc: linux-arm-kernel, linux-mm, linuxppc-dev, linux-kernel, kasan-dev,
tianruidong
In-Reply-To: <20260518084956.2538442-1-tianruidong@linux.alibaba.com>
Add KUnit tests for copy_page() and copy_mc_page(), modeled after
the existing memcpy_test() style: a static page-aligned src and a
two-page dst, filled with random bytes plus non-zero edges, then
verify byte-for-byte equality and that the adjacent page is
untouched. The copy_mc_page() case additionally checks the return
value is 0 on clean memory and is gated on CONFIG_ARCH_HAS_COPY_MC.
Signed-off-by: Ruidong Tian <tianruidong@linux.alibaba.com>
---
lib/tests/memcpy_kunit.c | 67 +++++++++++++++++++++++++++++++++++++++-
1 file changed, 66 insertions(+), 1 deletion(-)
diff --git a/lib/tests/memcpy_kunit.c b/lib/tests/memcpy_kunit.c
index d36933554e46..85df53ccfb0c 100644
--- a/lib/tests/memcpy_kunit.c
+++ b/lib/tests/memcpy_kunit.c
@@ -493,6 +493,67 @@ static void memmove_overlap_test(struct kunit *test)
}
}
+/* --- Page-sized copy tests --- */
+
+static u8 page_src[PAGE_SIZE] __aligned(PAGE_SIZE);
+static u8 page_dst[PAGE_SIZE * 2] __aligned(PAGE_SIZE);
+static const u8 page_zero[PAGE_SIZE] __aligned(PAGE_SIZE);
+
+static void init_page(struct kunit *test)
+{
+ /* Get many bit patterns. */
+ get_random_bytes(page_src, PAGE_SIZE);
+
+ /* Make sure we have non-zero edges. */
+ set_random_nonzero(test, &page_src[0]);
+ set_random_nonzero(test, &page_src[PAGE_SIZE - 1]);
+
+ /* Explicitly zero the entire destination. */
+ memset(page_dst, 0, ARRAY_SIZE(page_dst));
+}
+
+static void copy_page_test(struct kunit *test)
+{
+ init_page(test);
+
+ /* Copy. */
+ copy_page(page_dst, page_src);
+
+ /* Verify byte-for-byte exact. */
+ KUNIT_ASSERT_EQ_MSG(test,
+ memcmp(page_dst, page_src, PAGE_SIZE), 0,
+ "copy_page content mismatch with random data");
+
+ /* Verify no overflow into second page. */
+ KUNIT_ASSERT_EQ_MSG(test,
+ memcmp(page_dst + PAGE_SIZE, page_zero, PAGE_SIZE), 0,
+ "copy_page overflow into adjacent page");
+}
+
+#ifdef CONFIG_ARCH_HAS_COPY_MC
+static void copy_mc_page_test(struct kunit *test)
+{
+ int ret;
+
+ init_page(test);
+
+ /* Copy and check return value. */
+ ret = copy_mc_page(page_dst, page_src);
+ KUNIT_ASSERT_EQ_MSG(test, ret, 0,
+ "copy_mc_page returned %d on clean memory", ret);
+
+ /* Verify byte-for-byte exact. */
+ KUNIT_ASSERT_EQ_MSG(test,
+ memcmp(page_dst, page_src, PAGE_SIZE), 0,
+ "copy_mc_page content mismatch with random data");
+
+ /* Verify no overflow into second page. */
+ KUNIT_ASSERT_EQ_MSG(test,
+ memcmp(page_dst + PAGE_SIZE, page_zero, PAGE_SIZE), 0,
+ "copy_mc_page overflow into adjacent page");
+}
+#endif /* CONFIG_ARCH_HAS_COPY_MC */
+
static struct kunit_case memcpy_test_cases[] = {
KUNIT_CASE(memset_test),
KUNIT_CASE(memcpy_test),
@@ -500,6 +561,10 @@ static struct kunit_case memcpy_test_cases[] = {
KUNIT_CASE_SLOW(memmove_test),
KUNIT_CASE_SLOW(memmove_large_test),
KUNIT_CASE_SLOW(memmove_overlap_test),
+ KUNIT_CASE(copy_page_test),
+#ifdef CONFIG_ARCH_HAS_COPY_MC
+ KUNIT_CASE(copy_mc_page_test),
+#endif
{}
};
@@ -510,5 +575,5 @@ static struct kunit_suite memcpy_test_suite = {
kunit_test_suite(memcpy_test_suite);
-MODULE_DESCRIPTION("test cases for memcpy(), memmove(), and memset()");
+MODULE_DESCRIPTION("test cases for memcpy(), memmove(), memset() and copy_page()");
MODULE_LICENSE("GPL");
--
2.39.3
^ permalink raw reply related
* [PATCH v14 4/8] mm/hwpoison: return -EFAULT when copy fail in copy_mc_[user]_highpage()
From: Ruidong Tian @ 2026-05-18 8:49 UTC (permalink / raw)
To: catalin.marinas, will, rafael, tony.luck, guohanjun, mchehab,
xueshuai, tongtiangen, james.morse, robin.murphy, andreyknvl,
dvyukov, vincenzo.frascino, mpe, npiggin, ryabinin.a.a, glider,
christophe.leroy, aneesh.kumar, naveen.n.rao, tglx, mingo
Cc: linux-arm-kernel, linux-mm, linuxppc-dev, linux-kernel, kasan-dev,
tianruidong, Jonathan Cameron, Mauro Carvalho Chehab
In-Reply-To: <20260518084956.2538442-1-tianruidong@linux.alibaba.com>
From: Tong Tiangen <tongtiangen@huawei.com>
Currently, copy_mc_[user]_highpage() returns zero on success, or in case
of failures, the number of bytes that weren't copied.
While tracking the number of not copied works fine for x86 and PPC, There
are some difficulties in doing the same thing on ARM64 because there is no
available caller-saved register in copy_page()(lib/copy_page.S) to save
"bytes not copied", and the following copy_mc_page() will also encounter
the same problem.
Consider the caller of copy_mc_[user]_highpage() cannot do any processing
on the remaining data(The page has hardware errors), they only check if
copy was succeeded or not, make the interface more generic by using an
error code when copy fails (-EFAULT) or return zero on success.
Signed-off-by: Tong Tiangen <tongtiangen@huawei.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
---
include/linux/highmem.h | 8 ++++----
mm/khugepaged.c | 4 ++--
2 files changed, 6 insertions(+), 6 deletions(-)
diff --git a/include/linux/highmem.h b/include/linux/highmem.h
index af03db851a1d..18dc4aca4aa1 100644
--- a/include/linux/highmem.h
+++ b/include/linux/highmem.h
@@ -427,8 +427,8 @@ static inline void copy_highpage(struct page *to, struct page *from)
/*
* If architecture supports machine check exception handling, define the
* #MC versions of copy_user_highpage and copy_highpage. They copy a memory
- * page with #MC in source page (@from) handled, and return the number
- * of bytes not copied if there was a #MC, otherwise 0 for success.
+ * page with #MC in source page (@from) handled, and return -EFAULT if there
+ * was a #MC, otherwise 0 for success.
*/
static inline int copy_mc_user_highpage(struct page *to, struct page *from,
unsigned long vaddr, struct vm_area_struct *vma)
@@ -447,7 +447,7 @@ static inline int copy_mc_user_highpage(struct page *to, struct page *from,
if (ret)
memory_failure_queue(page_to_pfn(from), 0);
- return ret;
+ return ret ? -EFAULT : 0;
}
static inline int copy_mc_highpage(struct page *to, struct page *from)
@@ -466,7 +466,7 @@ static inline int copy_mc_highpage(struct page *to, struct page *from)
if (ret)
memory_failure_queue(page_to_pfn(from), 0);
- return ret;
+ return ret ? -EFAULT : 0;
}
#else
static inline int copy_mc_user_highpage(struct page *to, struct page *from,
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index b8452dbdb043..cf1b78eed3c3 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -810,7 +810,7 @@ static enum scan_result __collapse_huge_page_copy(pte_t *pte, struct folio *foli
continue;
}
src_page = pte_page(pteval);
- if (copy_mc_user_highpage(page, src_page, src_addr, vma) > 0) {
+ if (copy_mc_user_highpage(page, src_page, src_addr, vma)) {
result = SCAN_COPY_MC;
break;
}
@@ -2143,7 +2143,7 @@ static enum scan_result collapse_file(struct mm_struct *mm, unsigned long addr,
}
for (i = 0; i < nr_pages; i++) {
- if (copy_mc_highpage(dst, folio_page(folio, i)) > 0) {
+ if (copy_mc_highpage(dst, folio_page(folio, i))) {
result = SCAN_COPY_MC;
goto rollback;
}
--
2.39.3
^ permalink raw reply related
* [PATCH v14 7/8] arm64: introduce copy_mc_to_kernel() implementation
From: Ruidong Tian @ 2026-05-18 8:49 UTC (permalink / raw)
To: catalin.marinas, will, rafael, tony.luck, guohanjun, mchehab,
xueshuai, tongtiangen, james.morse, robin.murphy, andreyknvl,
dvyukov, vincenzo.frascino, mpe, npiggin, ryabinin.a.a, glider,
christophe.leroy, aneesh.kumar, naveen.n.rao, tglx, mingo
Cc: linux-arm-kernel, linux-mm, linuxppc-dev, linux-kernel, kasan-dev,
tianruidong
In-Reply-To: <20260518084956.2538442-1-tianruidong@linux.alibaba.com>
From: Tong Tiangen <tongtiangen@huawei.com>
The copy_mc_to_kernel() helper is memory copy implementation that handles
source exceptions. It can be used in memory copy scenarios that tolerate
hardware memory errors(e.g: pmem_read/dax_copy_to_iter).
Currently, only x86 and ppc support this helper, Add this for ARM64 as
well, if ARCH_HAS_COPY_MC is defined, by implementing copy_mc_to_kernel()
and memcpy_mc() functions.
Because there is no caller-saved GPR is available for saving "bytes not
copied" in memcpy(), the memcpy_mc() is referenced to the implementation
of copy_from_user(). In addition, the fixup of MOPS insn is not considered
at present.
[Ruidong: refactor memcpy_mc on top of the new memcpy implementation.]
Signed-off-by: Tong Tiangen <tongtiangen@huawei.com>
Signed-off-by: Ruidong Tian <tianruidong@linux.alibaba.com>
---
arch/arm64/include/asm/string.h | 5 +
arch/arm64/include/asm/uaccess.h | 17 +++
arch/arm64/lib/Makefile | 2 +-
arch/arm64/lib/memcpy.S | 253 +++----------------------------
arch/arm64/lib/memcpy_mc.S | 56 +++++++
arch/arm64/lib/memcpy_template.S | 249 ++++++++++++++++++++++++++++++
mm/kasan/shadow.c | 12 ++
7 files changed, 359 insertions(+), 235 deletions(-)
create mode 100644 arch/arm64/lib/memcpy_mc.S
create mode 100644 arch/arm64/lib/memcpy_template.S
diff --git a/arch/arm64/include/asm/string.h b/arch/arm64/include/asm/string.h
index 3a3264ff47b9..23eca4fb24fa 100644
--- a/arch/arm64/include/asm/string.h
+++ b/arch/arm64/include/asm/string.h
@@ -35,6 +35,10 @@ extern void *memchr(const void *, int, __kernel_size_t);
extern void *memcpy(void *, const void *, __kernel_size_t);
extern void *__memcpy(void *, const void *, __kernel_size_t);
+#define __HAVE_ARCH_MEMCPY_MC
+extern int memcpy_mc(void *, const void *, __kernel_size_t);
+extern int __memcpy_mc(void *, const void *, __kernel_size_t);
+
#define __HAVE_ARCH_MEMMOVE
extern void *memmove(void *, const void *, __kernel_size_t);
extern void *__memmove(void *, const void *, __kernel_size_t);
@@ -57,6 +61,7 @@ void memcpy_flushcache(void *dst, const void *src, size_t cnt);
*/
#define memcpy(dst, src, len) __memcpy(dst, src, len)
+#define memcpy_mc(dst, src, len) __memcpy_mc(dst, src, len)
#define memmove(dst, src, len) __memmove(dst, src, len)
#define memset(s, c, n) __memset(s, c, n)
diff --git a/arch/arm64/include/asm/uaccess.h b/arch/arm64/include/asm/uaccess.h
index b0c83a08dda9..93277eca2268 100644
--- a/arch/arm64/include/asm/uaccess.h
+++ b/arch/arm64/include/asm/uaccess.h
@@ -499,5 +499,22 @@ static inline size_t probe_subpage_writeable(const char __user *uaddr,
}
#endif /* CONFIG_ARCH_HAS_SUBPAGE_FAULTS */
+#ifdef CONFIG_ARCH_HAS_COPY_MC
+/**
+ * copy_mc_to_kernel - memory copy that handles source exceptions
+ *
+ * @to: destination address
+ * @from: source address
+ * @size: number of bytes to copy
+ *
+ * Return 0 for success, or bytes not copied.
+ */
+static inline unsigned long __must_check
+copy_mc_to_kernel(void *to, const void *from, unsigned long size)
+{
+ return memcpy_mc(to, from, size);
+}
+#define copy_mc_to_kernel copy_mc_to_kernel
+#endif
#endif /* __ASM_UACCESS_H */
diff --git a/arch/arm64/lib/Makefile b/arch/arm64/lib/Makefile
index 1f4c3f743a20..a5820e6c33d4 100644
--- a/arch/arm64/lib/Makefile
+++ b/arch/arm64/lib/Makefile
@@ -7,7 +7,7 @@ lib-y := clear_user.o delay.o copy_from_user.o \
lib-$(CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE) += uaccess_flushcache.o
-lib-$(CONFIG_ARCH_HAS_COPY_MC) += copy_mc_page.o
+lib-$(CONFIG_ARCH_HAS_COPY_MC) += copy_mc_page.o memcpy_mc.o
obj-$(CONFIG_FUNCTION_ERROR_INJECTION) += error-inject.o
diff --git a/arch/arm64/lib/memcpy.S b/arch/arm64/lib/memcpy.S
index 9b99106fb95f..ef6aea2de9b4 100644
--- a/arch/arm64/lib/memcpy.S
+++ b/arch/arm64/lib/memcpy.S
@@ -15,247 +15,32 @@
*
*/
-#define L(label) .L ## label
+ .macro ldrb1 reg, addr:vararg
+ ldrb \reg, \addr
+ .endm
-#define dstin x0
-#define src x1
-#define count x2
-#define dst x3
-#define srcend x4
-#define dstend x5
-#define A_l x6
-#define A_lw w6
-#define A_h x7
-#define B_l x8
-#define B_lw w8
-#define B_h x9
-#define C_l x10
-#define C_lw w10
-#define C_h x11
-#define D_l x12
-#define D_h x13
-#define E_l x14
-#define E_h x15
-#define F_l x16
-#define F_h x17
-#define G_l count
-#define G_h dst
-#define H_l src
-#define H_h srcend
-#define tmp1 x14
+ .macro ldr1 reg, addr:vararg
+ ldr \reg, \addr
+ .endm
-/* This implementation handles overlaps and supports both memcpy and memmove
- from a single entry point. It uses unaligned accesses and branchless
- sequences to keep the code small, simple and improve performance.
+ .macro ldp1 reg1, reg2, addr:vararg
+ ldp \reg1, \reg2, \addr
+ .endm
- Copies are split into 3 main cases: small copies of up to 32 bytes, medium
- copies of up to 128 bytes, and large copies. The overhead of the overlap
- check is negligible since it is only required for large copies.
+ .macro ret1
+ ret
+ .endm
- Large copies use a software pipelined loop processing 64 bytes per iteration.
- The destination pointer is 16-byte aligned to minimize unaligned accesses.
- The loop tail is handled by always copying 64 bytes from the end.
-*/
+ .macro cpy1 dst, src, count
+ .arch_extension mops
+ cpyp [\dst]!, [\src]!, \count!
+ cpym [\dst]!, [\src]!, \count!
+ cpye [\dst]!, [\src]!, \count!
+ .endm
-SYM_FUNC_START_LOCAL(__pi_memcpy_generic)
- add srcend, src, count
- add dstend, dstin, count
- cmp count, 128
- b.hi L(copy_long)
- cmp count, 32
- b.hi L(copy32_128)
-
- /* Small copies: 0..32 bytes. */
- cmp count, 16
- b.lo L(copy16)
- ldp A_l, A_h, [src]
- ldp D_l, D_h, [srcend, -16]
- stp A_l, A_h, [dstin]
- stp D_l, D_h, [dstend, -16]
- ret
-
- /* Copy 8-15 bytes. */
-L(copy16):
- tbz count, 3, L(copy8)
- ldr A_l, [src]
- ldr A_h, [srcend, -8]
- str A_l, [dstin]
- str A_h, [dstend, -8]
- ret
-
- .p2align 3
- /* Copy 4-7 bytes. */
-L(copy8):
- tbz count, 2, L(copy4)
- ldr A_lw, [src]
- ldr B_lw, [srcend, -4]
- str A_lw, [dstin]
- str B_lw, [dstend, -4]
- ret
-
- /* Copy 0..3 bytes using a branchless sequence. */
-L(copy4):
- cbz count, L(copy0)
- lsr tmp1, count, 1
- ldrb A_lw, [src]
- ldrb C_lw, [srcend, -1]
- ldrb B_lw, [src, tmp1]
- strb A_lw, [dstin]
- strb B_lw, [dstin, tmp1]
- strb C_lw, [dstend, -1]
-L(copy0):
- ret
-
- .p2align 4
- /* Medium copies: 33..128 bytes. */
-L(copy32_128):
- ldp A_l, A_h, [src]
- ldp B_l, B_h, [src, 16]
- ldp C_l, C_h, [srcend, -32]
- ldp D_l, D_h, [srcend, -16]
- cmp count, 64
- b.hi L(copy128)
- stp A_l, A_h, [dstin]
- stp B_l, B_h, [dstin, 16]
- stp C_l, C_h, [dstend, -32]
- stp D_l, D_h, [dstend, -16]
- ret
-
- .p2align 4
- /* Copy 65..128 bytes. */
-L(copy128):
- ldp E_l, E_h, [src, 32]
- ldp F_l, F_h, [src, 48]
- cmp count, 96
- b.ls L(copy96)
- ldp G_l, G_h, [srcend, -64]
- ldp H_l, H_h, [srcend, -48]
- stp G_l, G_h, [dstend, -64]
- stp H_l, H_h, [dstend, -48]
-L(copy96):
- stp A_l, A_h, [dstin]
- stp B_l, B_h, [dstin, 16]
- stp E_l, E_h, [dstin, 32]
- stp F_l, F_h, [dstin, 48]
- stp C_l, C_h, [dstend, -32]
- stp D_l, D_h, [dstend, -16]
- ret
-
- .p2align 4
- /* Copy more than 128 bytes. */
-L(copy_long):
- /* Use backwards copy if there is an overlap. */
- sub tmp1, dstin, src
- cbz tmp1, L(copy0)
- cmp tmp1, count
- b.lo L(copy_long_backwards)
-
- /* Copy 16 bytes and then align dst to 16-byte alignment. */
-
- ldp D_l, D_h, [src]
- and tmp1, dstin, 15
- bic dst, dstin, 15
- sub src, src, tmp1
- add count, count, tmp1 /* Count is now 16 too large. */
- ldp A_l, A_h, [src, 16]
- stp D_l, D_h, [dstin]
- ldp B_l, B_h, [src, 32]
- ldp C_l, C_h, [src, 48]
- ldp D_l, D_h, [src, 64]!
- subs count, count, 128 + 16 /* Test and readjust count. */
- b.ls L(copy64_from_end)
-
-L(loop64):
- stp A_l, A_h, [dst, 16]
- ldp A_l, A_h, [src, 16]
- stp B_l, B_h, [dst, 32]
- ldp B_l, B_h, [src, 32]
- stp C_l, C_h, [dst, 48]
- ldp C_l, C_h, [src, 48]
- stp D_l, D_h, [dst, 64]!
- ldp D_l, D_h, [src, 64]!
- subs count, count, 64
- b.hi L(loop64)
-
- /* Write the last iteration and copy 64 bytes from the end. */
-L(copy64_from_end):
- ldp E_l, E_h, [srcend, -64]
- stp A_l, A_h, [dst, 16]
- ldp A_l, A_h, [srcend, -48]
- stp B_l, B_h, [dst, 32]
- ldp B_l, B_h, [srcend, -32]
- stp C_l, C_h, [dst, 48]
- ldp C_l, C_h, [srcend, -16]
- stp D_l, D_h, [dst, 64]
- stp E_l, E_h, [dstend, -64]
- stp A_l, A_h, [dstend, -48]
- stp B_l, B_h, [dstend, -32]
- stp C_l, C_h, [dstend, -16]
- ret
-
- .p2align 4
-
- /* Large backwards copy for overlapping copies.
- Copy 16 bytes and then align dst to 16-byte alignment. */
-L(copy_long_backwards):
- ldp D_l, D_h, [srcend, -16]
- and tmp1, dstend, 15
- sub srcend, srcend, tmp1
- sub count, count, tmp1
- ldp A_l, A_h, [srcend, -16]
- stp D_l, D_h, [dstend, -16]
- ldp B_l, B_h, [srcend, -32]
- ldp C_l, C_h, [srcend, -48]
- ldp D_l, D_h, [srcend, -64]!
- sub dstend, dstend, tmp1
- subs count, count, 128
- b.ls L(copy64_from_start)
-
-L(loop64_backwards):
- stp A_l, A_h, [dstend, -16]
- ldp A_l, A_h, [srcend, -16]
- stp B_l, B_h, [dstend, -32]
- ldp B_l, B_h, [srcend, -32]
- stp C_l, C_h, [dstend, -48]
- ldp C_l, C_h, [srcend, -48]
- stp D_l, D_h, [dstend, -64]!
- ldp D_l, D_h, [srcend, -64]!
- subs count, count, 64
- b.hi L(loop64_backwards)
-
- /* Write the last iteration and copy 64 bytes from the start. */
-L(copy64_from_start):
- ldp G_l, G_h, [src, 48]
- stp A_l, A_h, [dstend, -16]
- ldp A_l, A_h, [src, 32]
- stp B_l, B_h, [dstend, -32]
- ldp B_l, B_h, [src, 16]
- stp C_l, C_h, [dstend, -48]
- ldp C_l, C_h, [src]
- stp D_l, D_h, [dstend, -64]
- stp G_l, G_h, [dstin, 48]
- stp A_l, A_h, [dstin, 32]
- stp B_l, B_h, [dstin, 16]
- stp C_l, C_h, [dstin]
- ret
-SYM_FUNC_END(__pi_memcpy_generic)
-
-#ifdef CONFIG_AS_HAS_MOPS
- .arch_extension mops
SYM_FUNC_START(__pi_memcpy)
-alternative_if_not ARM64_HAS_MOPS
- b __pi_memcpy_generic
-alternative_else_nop_endif
-
- mov dst, dstin
- cpyp [dst]!, [src]!, count!
- cpym [dst]!, [src]!, count!
- cpye [dst]!, [src]!, count!
- ret
+#include "memcpy_template.S"
SYM_FUNC_END(__pi_memcpy)
-#else
-SYM_FUNC_ALIAS(__pi_memcpy, __pi_memcpy_generic)
-#endif
SYM_FUNC_ALIAS(__memcpy, __pi_memcpy)
EXPORT_SYMBOL(__memcpy)
diff --git a/arch/arm64/lib/memcpy_mc.S b/arch/arm64/lib/memcpy_mc.S
new file mode 100644
index 000000000000..90624d35af4b
--- /dev/null
+++ b/arch/arm64/lib/memcpy_mc.S
@@ -0,0 +1,56 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright (c) 2012-2021, Arm Limited.
+ *
+ * Adapted from the original at:
+ * https://github.com/ARM-software/optimized-routines/blob/afd6244a1f8d9229/string/aarch64/memcpy.S
+ */
+
+#include <linux/linkage.h>
+#include <asm/assembler.h>
+#include <asm/asm-uaccess.h>
+
+/* Assumptions:
+ *
+ * ARMv8-a, AArch64, unaligned accesses.
+ *
+ */
+
+ .macro ldrb1 reg, addr:vararg
+ KERNEL_MEM_ERR(9998f, ldrb \reg, \addr)
+ .endm
+
+ .macro ldr1 reg, addr:vararg
+ KERNEL_MEM_ERR(9998f, ldr \reg, \addr)
+ .endm
+
+ .macro ldp1 reg1, reg2, addr:vararg
+ KERNEL_MEM_ERR(9998f, ldp \reg1, \reg2, \addr)
+ .endm
+
+ .macro ret1
+ mov x0, #0
+ ret
+ .endm
+
+ .macro cpy1 dst, src, count
+ .arch_extension mops
+ USER_CPY(9998f, 0, cpyp [\dst]!, [\src]!, \count!)
+ USER_CPY(9996f, 0, cpym [\dst]!, [\src]!, \count!)
+ USER_CPY(9996f, 0, cpye [\dst]!, [\src]!, \count!)
+ .endm
+
+SYM_FUNC_START(__memcpy_mc)
+#include "memcpy_template.S"
+
+ // Exception fixups
+9996: b.cs 9998f
+ // Registers are in Option A format
+ add dst, dst, count
+9998: sub x0, dstend, dstin // bytes not copied
+ ret
+SYM_FUNC_END(__memcpy_mc)
+
+EXPORT_SYMBOL(__memcpy_mc)
+SYM_FUNC_ALIAS_WEAK(memcpy_mc, __memcpy_mc)
+EXPORT_SYMBOL(memcpy_mc)
diff --git a/arch/arm64/lib/memcpy_template.S b/arch/arm64/lib/memcpy_template.S
new file mode 100644
index 000000000000..205516c6e076
--- /dev/null
+++ b/arch/arm64/lib/memcpy_template.S
@@ -0,0 +1,249 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright (c) 2012-2021, Arm Limited.
+ *
+ * Adapted from the original at:
+ * https://github.com/ARM-software/optimized-routines/blob/afd6244a1f8d9229/string/aarch64/memcpy.S
+ */
+
+#include <linux/linkage.h>
+#include <asm/assembler.h>
+
+/* Assumptions:
+ *
+ * ARMv8-a, AArch64, unaligned accesses.
+ *
+ */
+
+#define L(label) .L ## label
+
+#define dstin x0
+#define src x1
+#define count x2
+#define dst x3
+#define srcend x4
+#define dstend x5
+#define A_l x6
+#define A_lw w6
+#define A_h x7
+#define B_l x8
+#define B_lw w8
+#define B_h x9
+#define C_l x10
+#define C_lw w10
+#define C_h x11
+#define D_l x12
+#define D_h x13
+#define E_l x14
+#define E_h x15
+#define F_l x16
+#define F_h x17
+#define G_l count
+#define G_h dst
+#define H_l src
+#define H_h srcend
+#define tmp1 x14
+
+/* This implementation handles overlaps and supports both memcpy and memmove
+ from a single entry point. It uses unaligned accesses and branchless
+ sequences to keep the code small, simple and improve performance.
+
+ Copies are split into 3 main cases: small copies of up to 32 bytes, medium
+ copies of up to 128 bytes, and large copies. The overhead of the overlap
+ check is negligible since it is only required for large copies.
+
+ Large copies use a software pipelined loop processing 64 bytes per iteration.
+ The destination pointer is 16-byte aligned to minimize unaligned accesses.
+ The loop tail is handled by always copying 64 bytes from the end.
+*/
+
+#ifdef CONFIG_AS_HAS_MOPS
+alternative_if_not ARM64_HAS_MOPS
+ b L(no_mops):
+alternative_else_nop_endif
+
+ cpy1 dst, src, count
+ ret1
+#endif
+
+L(no_mops):
+ add srcend, src, count
+ add dstend, dstin, count
+ cmp count, 128
+ b.hi L(copy_long)
+ cmp count, 32
+ b.hi L(copy32_128)
+
+ /* Small copies: 0..32 bytes. */
+ cmp count, 16
+ b.lo L(copy16)
+ ldp1 A_l, A_h, [src]
+ ldp1 D_l, D_h, [srcend, -16]
+ stp A_l, A_h, [dstin]
+ stp D_l, D_h, [dstend, -16]
+ ret1
+
+ /* Copy 8-15 bytes. */
+L(copy16):
+ tbz count, 3, L(copy8)
+ ldr1 A_l, [src]
+ ldr1 A_h, [srcend, -8]
+ str A_l, [dstin]
+ str A_h, [dstend, -8]
+ ret1
+
+ .p2align 3
+ /* Copy 4-7 bytes. */
+L(copy8):
+ tbz count, 2, L(copy4)
+ ldr1 A_lw, [src]
+ ldr1 B_lw, [srcend, -4]
+ str A_lw, [dstin]
+ str B_lw, [dstend, -4]
+ ret1
+
+ /* Copy 0..3 bytes using a branchless sequence. */
+L(copy4):
+ cbz count, L(copy0)
+ lsr tmp1, count, 1
+ ldrb1 A_lw, [src]
+ ldrb1 C_lw, [srcend, -1]
+ ldrb1 B_lw, [src, tmp1]
+ strb A_lw, [dstin]
+ strb B_lw, [dstin, tmp1]
+ strb C_lw, [dstend, -1]
+L(copy0):
+ ret1
+
+ .p2align 4
+ /* Medium copies: 33..128 bytes. */
+L(copy32_128):
+ ldp1 A_l, A_h, [src]
+ ldp1 B_l, B_h, [src, 16]
+ ldp1 C_l, C_h, [srcend, -32]
+ ldp1 D_l, D_h, [srcend, -16]
+ cmp count, 64
+ b.hi L(copy128)
+ stp A_l, A_h, [dstin]
+ stp B_l, B_h, [dstin, 16]
+ stp C_l, C_h, [dstend, -32]
+ stp D_l, D_h, [dstend, -16]
+ ret1
+
+ .p2align 4
+ /* Copy 65..128 bytes. */
+L(copy128):
+ ldp1 E_l, E_h, [src, 32]
+ ldp1 F_l, F_h, [src, 48]
+ cmp count, 96
+ b.ls L(copy96)
+ ldp1 G_l, G_h, [srcend, -64]
+ ldp1 H_l, H_h, [srcend, -48]
+ stp G_l, G_h, [dstend, -64]
+ stp H_l, H_h, [dstend, -48]
+L(copy96):
+ stp A_l, A_h, [dstin]
+ stp B_l, B_h, [dstin, 16]
+ stp E_l, E_h, [dstin, 32]
+ stp F_l, F_h, [dstin, 48]
+ stp C_l, C_h, [dstend, -32]
+ stp D_l, D_h, [dstend, -16]
+ ret1
+
+ .p2align 4
+ /* Copy more than 128 bytes. */
+L(copy_long):
+ /* Use backwards copy if there is an overlap. */
+ sub tmp1, dstin, src
+ cbz tmp1, L(copy0)
+ cmp tmp1, count
+ b.lo L(copy_long_backwards)
+
+ /* Copy 16 bytes and then align dst to 16-byte alignment. */
+
+ ldp1 D_l, D_h, [src]
+ and tmp1, dstin, 15
+ bic dst, dstin, 15
+ sub src, src, tmp1
+ add count, count, tmp1 /* Count is now 16 too large. */
+ ldp1 A_l, A_h, [src, 16]
+ stp D_l, D_h, [dstin]
+ ldp1 B_l, B_h, [src, 32]
+ ldp1 C_l, C_h, [src, 48]
+ ldp1 D_l, D_h, [src, 64]!
+ subs count, count, 128 + 16 /* Test and readjust count. */
+ b.ls L(copy64_from_end)
+
+L(loop64):
+ stp A_l, A_h, [dst, 16]
+ ldp1 A_l, A_h, [src, 16]
+ stp B_l, B_h, [dst, 32]
+ ldp1 B_l, B_h, [src, 32]
+ stp C_l, C_h, [dst, 48]
+ ldp1 C_l, C_h, [src, 48]
+ stp D_l, D_h, [dst, 64]!
+ ldp1 D_l, D_h, [src, 64]!
+ subs count, count, 64
+ b.hi L(loop64)
+
+ /* Write the last iteration and copy 64 bytes from the end. */
+L(copy64_from_end):
+ ldp1 E_l, E_h, [srcend, -64]
+ stp A_l, A_h, [dst, 16]
+ ldp1 A_l, A_h, [srcend, -48]
+ stp B_l, B_h, [dst, 32]
+ ldp1 B_l, B_h, [srcend, -32]
+ stp C_l, C_h, [dst, 48]
+ ldp1 C_l, C_h, [srcend, -16]
+ stp D_l, D_h, [dst, 64]
+ stp E_l, E_h, [dstend, -64]
+ stp A_l, A_h, [dstend, -48]
+ stp B_l, B_h, [dstend, -32]
+ stp C_l, C_h, [dstend, -16]
+ ret1
+
+ .p2align 4
+
+ /* Large backwards copy for overlapping copies.
+ Copy 16 bytes and then align dst to 16-byte alignment. */
+L(copy_long_backwards):
+ ldp1 D_l, D_h, [srcend, -16]
+ and tmp1, dstend, 15
+ sub srcend, srcend, tmp1
+ sub count, count, tmp1
+ ldp1 A_l, A_h, [srcend, -16]
+ stp D_l, D_h, [dstend, -16]
+ ldp1 B_l, B_h, [srcend, -32]
+ ldp1 C_l, C_h, [srcend, -48]
+ ldp1 D_l, D_h, [srcend, -64]!
+ sub dstend, dstend, tmp1
+ subs count, count, 128
+ b.ls L(copy64_from_start)
+
+L(loop64_backwards):
+ stp A_l, A_h, [dstend, -16]
+ ldp1 A_l, A_h, [srcend, -16]
+ stp B_l, B_h, [dstend, -32]
+ ldp1 B_l, B_h, [srcend, -32]
+ stp C_l, C_h, [dstend, -48]
+ ldp1 C_l, C_h, [srcend, -48]
+ stp D_l, D_h, [dstend, -64]!
+ ldp1 D_l, D_h, [srcend, -64]!
+ subs count, count, 64
+ b.hi L(loop64_backwards)
+
+ /* Write the last iteration and copy 64 bytes from the start. */
+L(copy64_from_start):
+ ldp1 G_l, G_h, [src, 48]
+ stp A_l, A_h, [dstend, -16]
+ ldp1 A_l, A_h, [src, 32]
+ stp B_l, B_h, [dstend, -32]
+ ldp1 B_l, B_h, [src, 16]
+ stp C_l, C_h, [dstend, -48]
+ ldp1 C_l, C_h, [src]
+ stp D_l, D_h, [dstend, -64]
+ stp G_l, G_h, [dstin, 48]
+ stp A_l, A_h, [dstin, 32]
+ stp B_l, B_h, [dstin, 16]
+ stp C_l, C_h, [dstin]
+ ret1
diff --git a/mm/kasan/shadow.c b/mm/kasan/shadow.c
index d286e0a04543..3128f0d9cc46 100644
--- a/mm/kasan/shadow.c
+++ b/mm/kasan/shadow.c
@@ -79,6 +79,18 @@ void *memcpy(void *dest, const void *src, size_t len)
}
#endif
+#ifdef __HAVE_ARCH_MEMCPY_MC
+#undef memcpy_mc
+int memcpy_mc(void *dest, const void *src, size_t len)
+{
+ if (!kasan_check_range(src, len, false, _RET_IP_) ||
+ !kasan_check_range(dest, len, true, _RET_IP_))
+ return (int)len;
+
+ return __memcpy_mc(dest, src, len);
+}
+#endif
+
void *__asan_memset(void *addr, int c, ssize_t len)
{
if (!kasan_check_range(addr, len, true, _RET_IP_))
--
2.39.3
^ permalink raw reply related
* [PATCH v14 8/8] lib/tests: memcpy_kunit: add memcpy_mc() and memcpy_mc_large() test
From: Ruidong Tian @ 2026-05-18 8:49 UTC (permalink / raw)
To: catalin.marinas, will, rafael, tony.luck, guohanjun, mchehab,
xueshuai, tongtiangen, james.morse, robin.murphy, andreyknvl,
dvyukov, vincenzo.frascino, mpe, npiggin, ryabinin.a.a, glider,
christophe.leroy, aneesh.kumar, naveen.n.rao, tglx, mingo
Cc: linux-arm-kernel, linux-mm, linuxppc-dev, linux-kernel, kasan-dev,
tianruidong
In-Reply-To: <20260518084956.2538442-1-tianruidong@linux.alibaba.com>
memcpy_mc() is the Machine-Check safe memcpy variant that returns the
number of bytes NOT copied on a hardware memory error, or 0 on success.
Add two test cases modeled after the existing memcpy_test() and
memcpy_large_test() implementations:
Signed-off-by: Ruidong Tian <tianruidong@linux.alibaba.com>
---
lib/tests/memcpy_kunit.c | 113 ++++++++++++++++++++++++++++++++++++++-
1 file changed, 112 insertions(+), 1 deletion(-)
diff --git a/lib/tests/memcpy_kunit.c b/lib/tests/memcpy_kunit.c
index 85df53ccfb0c..b4b2dafb50f1 100644
--- a/lib/tests/memcpy_kunit.c
+++ b/lib/tests/memcpy_kunit.c
@@ -552,6 +552,115 @@ static void copy_mc_page_test(struct kunit *test)
memcmp(page_dst + PAGE_SIZE, page_zero, PAGE_SIZE), 0,
"copy_mc_page overflow into adjacent page");
}
+/*
+ * memcpy_mc() is a Machine-Check safe memcpy variant.
+ * Signature: int memcpy_mc(void *dst, const void *src, size_t len)
+ * Returns: 0 on success, or number of bytes NOT copied on MC error.
+ *
+ * In the normal (no-poison) path it must behave identically to memcpy()
+ * and always return 0.
+ */
+static void memcpy_mc_test(struct kunit *test)
+{
+#define TEST_OP "memcpy_mc"
+ struct some_bytes control = {
+ .data = { 0x20, 0x20, 0x20, 0x20, 0x20, 0x20, 0x20, 0x20,
+ 0x20, 0x20, 0x20, 0x20, 0x20, 0x20, 0x20, 0x20,
+ 0x20, 0x20, 0x20, 0x20, 0x20, 0x20, 0x20, 0x20,
+ 0x20, 0x20, 0x20, 0x20, 0x20, 0x20, 0x20, 0x20,
+ },
+ };
+ struct some_bytes zero = { };
+ struct some_bytes middle = {
+ .data = { 0x20, 0x20, 0x20, 0x20, 0x20, 0x20, 0x20, 0x20,
+ 0x20, 0x20, 0x20, 0x20, 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x20, 0x20, 0x20, 0x20, 0x20,
+ 0x20, 0x20, 0x20, 0x20, 0x20, 0x20, 0x20, 0x20,
+ },
+ };
+ struct some_bytes three = {
+ .data = { 0x00, 0x20, 0x20, 0x20, 0x20, 0x20, 0x20, 0x20,
+ 0x20, 0x00, 0x00, 0x20, 0x20, 0x20, 0x20, 0x20,
+ 0x20, 0x20, 0x20, 0x20, 0x20, 0x20, 0x20, 0x20,
+ 0x20, 0x20, 0x20, 0x20, 0x20, 0x20, 0x20, 0x20,
+ },
+ };
+ struct some_bytes dest = { };
+ int ret, count;
+ u8 *ptr;
+
+ /* Verify static initializers. */
+ check(control, 0x20);
+ check(zero, 0);
+ compare("static initializers", dest, zero);
+
+ /* Verify assignment. */
+ dest = control;
+ compare("direct assignment", dest, control);
+
+ /* Verify complete overwrite. */
+ ret = memcpy_mc(dest.data, control.data, sizeof(dest.data));
+ KUNIT_ASSERT_EQ(test, ret, 0);
+ compare("complete overwrite", dest, control);
+
+ /* Verify middle overwrite: 7 bytes at offset 12. */
+ dest = control;
+ ret = memcpy_mc(dest.data + 12, zero.data, 7);
+ KUNIT_ASSERT_EQ(test, ret, 0);
+ compare("middle overwrite", dest, middle);
+
+ /* Verify zero-length copy is a no-op. */
+ dest = control;
+ ret = memcpy_mc(dest.data, zero.data, 0);
+ KUNIT_ASSERT_EQ(test, ret, 0);
+ compare("zero length", dest, control);
+
+ /* Verify argument side-effects aren't repeated. */
+ dest = control;
+ ptr = dest.data;
+ count = 1;
+ memcpy(ptr++, zero.data, count++);
+ ptr += 8;
+ memcpy(ptr++, zero.data, count++);
+ compare("argument side-effects", dest, three);
+#undef TEST_OP
+}
+
+static void memcpy_mc_large_test(struct kunit *test)
+{
+ init_large(test);
+
+ /* Sweep 1..1024 bytes x shifting offset to cover all template paths. */
+ for (int bytes = 1; bytes <= ARRAY_SIZE(large_src); bytes++) {
+ for (int offset = 0; offset < ARRAY_SIZE(large_src); offset++) {
+ int right_zero_pos = offset + bytes;
+ int right_zero_size = ARRAY_SIZE(large_dst) - right_zero_pos;
+ int ret;
+
+ ret = memcpy_mc(large_dst + offset, large_src, bytes);
+ KUNIT_ASSERT_EQ_MSG(test, ret, 0,
+ "memcpy_mc returned %d with size %d at offset %d",
+ ret, bytes, offset);
+
+ /* No write before copy area. */
+ KUNIT_ASSERT_EQ_MSG(test,
+ memcmp(large_dst, large_zero, offset), 0,
+ "with size %d at offset %d", bytes, offset);
+ /* No write after copy area. */
+ KUNIT_ASSERT_EQ_MSG(test,
+ memcmp(&large_dst[right_zero_pos], large_zero,
+ right_zero_size), 0,
+ "with size %d at offset %d", bytes, offset);
+ /* Byte-for-byte exact. */
+ KUNIT_ASSERT_EQ_MSG(test,
+ memcmp(large_dst + offset, large_src, bytes), 0,
+ "with size %d at offset %d", bytes, offset);
+
+ memset(large_dst + offset, 0, bytes);
+ }
+ cond_resched();
+ }
+}
#endif /* CONFIG_ARCH_HAS_COPY_MC */
static struct kunit_case memcpy_test_cases[] = {
@@ -564,6 +673,8 @@ static struct kunit_case memcpy_test_cases[] = {
KUNIT_CASE(copy_page_test),
#ifdef CONFIG_ARCH_HAS_COPY_MC
KUNIT_CASE(copy_mc_page_test),
+ KUNIT_CASE(memcpy_mc_test),
+ KUNIT_CASE_SLOW(memcpy_mc_large_test),
#endif
{}
};
@@ -575,5 +686,5 @@ static struct kunit_suite memcpy_test_suite = {
kunit_test_suite(memcpy_test_suite);
-MODULE_DESCRIPTION("test cases for memcpy(), memmove(), memset() and copy_page()");
+MODULE_DESCRIPTION("test cases for memcpy(), memmove(), memset(), copy_page() and memcpy_mc()");
MODULE_LICENSE("GPL");
--
2.39.3
^ permalink raw reply related
* [PATCH v14 5/8] arm64: support copy_mc_[user]_highpage()
From: Ruidong Tian @ 2026-05-18 8:49 UTC (permalink / raw)
To: catalin.marinas, will, rafael, tony.luck, guohanjun, mchehab,
xueshuai, tongtiangen, james.morse, robin.murphy, andreyknvl,
dvyukov, vincenzo.frascino, mpe, npiggin, ryabinin.a.a, glider,
christophe.leroy, aneesh.kumar, naveen.n.rao, tglx, mingo
Cc: linux-arm-kernel, linux-mm, linuxppc-dev, linux-kernel, kasan-dev,
tianruidong
In-Reply-To: <20260518084956.2538442-1-tianruidong@linux.alibaba.com>
From: Tong Tiangen <tongtiangen@huawei.com>
Currently, many scenarios that can tolerate memory errors when copying page
have been supported in the kernel[1~5], all of which are implemented by
copy_mc_[user]_highpage(). arm64 should also support this mechanism.
Due to mte, arm64 needs to have its own copy_mc_[user]_highpage()
architecture implementation, macros __HAVE_ARCH_COPY_MC_HIGHPAGE and
__HAVE_ARCH_COPY_MC_USER_HIGHPAGE have been added to control it.
Add new helper copy_mc_page() which provide a page copy implementation with
hardware memory error safe. The code logic of copy_mc_page() is the same as
copy_page(), the main difference is that the ldp insn of copy_mc_page()
contains the fixup type EX_TYPE_KACCESS_ERR_ZERO_MEM_ERR, therefore, the
main logic is extracted to copy_page_template.S. In addition, the fixup of
MOPS insn is not considered at present.
[Ruidong: add FEAT_MOPS support]
[1] commit d302c2398ba2 ("mm, hwpoison: when copy-on-write hits poison, take page offline")
[2] commit 1cb9dc4b475c ("mm: hwpoison: support recovery from HugePage copy-on-write faults")
[3] commit 6b970599e807 ("mm: hwpoison: support recovery from ksm_might_need_to_copy()")
[4] commit 98c76c9f1ef7 ("mm/khugepaged: recover from poisoned anonymous memory")
[5] commit 12904d953364 ("mm/khugepaged: recover from poisoned file-backed memory")
Signed-off-by: Tong Tiangen <tongtiangen@huawei.com>
Signed-off-by: Ruidong Tian <tianruidong@linux.alibaba.com>
---
arch/arm64/include/asm/mte.h | 9 ++++
arch/arm64/include/asm/page.h | 10 ++++
arch/arm64/lib/Makefile | 2 +
arch/arm64/lib/copy_mc_page.S | 44 +++++++++++++++++
arch/arm64/lib/copy_page.S | 62 ++----------------------
arch/arm64/lib/copy_page_template.S | 71 +++++++++++++++++++++++++++
arch/arm64/lib/mte.S | 29 +++++++++++
arch/arm64/mm/copypage.c | 75 +++++++++++++++++++++++++++++
include/linux/highmem.h | 8 +++
9 files changed, 253 insertions(+), 57 deletions(-)
create mode 100644 arch/arm64/lib/copy_mc_page.S
create mode 100644 arch/arm64/lib/copy_page_template.S
diff --git a/arch/arm64/include/asm/mte.h b/arch/arm64/include/asm/mte.h
index 7f7b97e09996..a0b1757f4847 100644
--- a/arch/arm64/include/asm/mte.h
+++ b/arch/arm64/include/asm/mte.h
@@ -98,6 +98,11 @@ static inline bool try_page_mte_tagging(struct page *page)
void mte_zero_clear_page_tags(void *addr);
void mte_sync_tags(pte_t pte, unsigned int nr_pages);
void mte_copy_page_tags(void *kto, const void *kfrom);
+
+#ifdef CONFIG_ARCH_HAS_COPY_MC
+int mte_copy_mc_page_tags(void *kto, const void *kfrom);
+#endif
+
void mte_thread_init_user(void);
void mte_thread_switch(struct task_struct *next);
void mte_cpu_setup(void);
@@ -134,6 +139,10 @@ static inline void mte_sync_tags(pte_t pte, unsigned int nr_pages)
static inline void mte_copy_page_tags(void *kto, const void *kfrom)
{
}
+static inline int mte_copy_mc_page_tags(void *kto, const void *kfrom)
+{
+ return 0;
+}
static inline void mte_thread_init_user(void)
{
}
diff --git a/arch/arm64/include/asm/page.h b/arch/arm64/include/asm/page.h
index e25d0d18f6d7..f65818ee614a 100644
--- a/arch/arm64/include/asm/page.h
+++ b/arch/arm64/include/asm/page.h
@@ -29,6 +29,16 @@ void copy_user_highpage(struct page *to, struct page *from,
void copy_highpage(struct page *to, struct page *from);
#define __HAVE_ARCH_COPY_HIGHPAGE
+#ifdef CONFIG_ARCH_HAS_COPY_MC
+int copy_mc_page(void *to, const void *from);
+int copy_mc_highpage(struct page *to, struct page *from);
+#define __HAVE_ARCH_COPY_MC_HIGHPAGE
+
+int copy_mc_user_highpage(struct page *to, struct page *from,
+ unsigned long vaddr, struct vm_area_struct *vma);
+#define __HAVE_ARCH_COPY_MC_USER_HIGHPAGE
+#endif
+
struct folio *vma_alloc_zeroed_movable_folio(struct vm_area_struct *vma,
unsigned long vaddr);
#define vma_alloc_zeroed_movable_folio vma_alloc_zeroed_movable_folio
diff --git a/arch/arm64/lib/Makefile b/arch/arm64/lib/Makefile
index 448c917494f3..1f4c3f743a20 100644
--- a/arch/arm64/lib/Makefile
+++ b/arch/arm64/lib/Makefile
@@ -7,6 +7,8 @@ lib-y := clear_user.o delay.o copy_from_user.o \
lib-$(CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE) += uaccess_flushcache.o
+lib-$(CONFIG_ARCH_HAS_COPY_MC) += copy_mc_page.o
+
obj-$(CONFIG_FUNCTION_ERROR_INJECTION) += error-inject.o
obj-$(CONFIG_ARM64_MTE) += mte.o
diff --git a/arch/arm64/lib/copy_mc_page.S b/arch/arm64/lib/copy_mc_page.S
new file mode 100644
index 000000000000..ad1371e9e687
--- /dev/null
+++ b/arch/arm64/lib/copy_mc_page.S
@@ -0,0 +1,44 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+
+#include <linux/linkage.h>
+#include <linux/const.h>
+#include <asm/assembler.h>
+#include <asm/page.h>
+#include <asm/cpufeature.h>
+#include <asm/alternative.h>
+#include <asm/asm-extable.h>
+#include <asm/asm-uaccess.h>
+
+/*
+ * Copy a page from src to dest (both are page aligned) with memory error safe
+ *
+ * Parameters:
+ * x0 - dest
+ * x1 - src
+ * Returns:
+ * x0 - Return 0 if copy success, or -EFAULT if anything goes wrong
+ * while copying.
+ */
+ .macro ldp1 reg1, reg2, ptr, val
+ KERNEL_MEM_ERR(9998f, ldp \reg1, \reg2, [\ptr, \val])
+ .endm
+
+ .macro cpy1 dst, src, count
+ .arch_extension mops
+ USER_CPY(9998f, 0, cpyfprt [\dst]!, [\src]!, \count!)
+ USER_CPY(9998f, 0, cpyfmrt [\dst]!, [\src]!, \count!)
+ USER_CPY(9998f, 0, cpyfert [\dst]!, [\src]!, \count!)
+ .endm
+
+SYM_FUNC_START(__pi_copy_mc_page)
+#include "copy_page_template.S"
+
+ mov x0, #0
+ ret
+
+9998: mov x0, #-EFAULT
+ ret
+
+SYM_FUNC_END(__pi_copy_mc_page)
+SYM_FUNC_ALIAS(copy_mc_page, __pi_copy_mc_page)
+EXPORT_SYMBOL(copy_mc_page)
diff --git a/arch/arm64/lib/copy_page.S b/arch/arm64/lib/copy_page.S
index e6374e7e5511..d0186bbf99f1 100644
--- a/arch/arm64/lib/copy_page.S
+++ b/arch/arm64/lib/copy_page.S
@@ -17,65 +17,13 @@
* x0 - dest
* x1 - src
*/
-SYM_FUNC_START(__pi_copy_page)
-#ifdef CONFIG_AS_HAS_MOPS
- .arch_extension mops
-alternative_if_not ARM64_HAS_MOPS
- b .Lno_mops
-alternative_else_nop_endif
-
- mov x2, #PAGE_SIZE
- cpypwn [x0]!, [x1]!, x2!
- cpymwn [x0]!, [x1]!, x2!
- cpyewn [x0]!, [x1]!, x2!
- ret
-.Lno_mops:
-#endif
- ldp x2, x3, [x1]
- ldp x4, x5, [x1, #16]
- ldp x6, x7, [x1, #32]
- ldp x8, x9, [x1, #48]
- ldp x10, x11, [x1, #64]
- ldp x12, x13, [x1, #80]
- ldp x14, x15, [x1, #96]
- ldp x16, x17, [x1, #112]
-
- add x0, x0, #256
- add x1, x1, #128
-1:
- tst x0, #(PAGE_SIZE - 1)
- stnp x2, x3, [x0, #-256]
- ldp x2, x3, [x1]
- stnp x4, x5, [x0, #16 - 256]
- ldp x4, x5, [x1, #16]
- stnp x6, x7, [x0, #32 - 256]
- ldp x6, x7, [x1, #32]
- stnp x8, x9, [x0, #48 - 256]
- ldp x8, x9, [x1, #48]
- stnp x10, x11, [x0, #64 - 256]
- ldp x10, x11, [x1, #64]
- stnp x12, x13, [x0, #80 - 256]
- ldp x12, x13, [x1, #80]
- stnp x14, x15, [x0, #96 - 256]
- ldp x14, x15, [x1, #96]
- stnp x16, x17, [x0, #112 - 256]
- ldp x16, x17, [x1, #112]
-
- add x0, x0, #128
- add x1, x1, #128
-
- b.ne 1b
-
- stnp x2, x3, [x0, #-256]
- stnp x4, x5, [x0, #16 - 256]
- stnp x6, x7, [x0, #32 - 256]
- stnp x8, x9, [x0, #48 - 256]
- stnp x10, x11, [x0, #64 - 256]
- stnp x12, x13, [x0, #80 - 256]
- stnp x14, x15, [x0, #96 - 256]
- stnp x16, x17, [x0, #112 - 256]
+ .macro ldp1 reg1, reg2, ptr, val
+ ldp \reg1, \reg2, [\ptr, \val]
+ .endm
+SYM_FUNC_START(__pi_copy_page)
+#include "copy_page_template.S"
ret
SYM_FUNC_END(__pi_copy_page)
SYM_FUNC_ALIAS(copy_page, __pi_copy_page)
diff --git a/arch/arm64/lib/copy_page_template.S b/arch/arm64/lib/copy_page_template.S
new file mode 100644
index 000000000000..d466b51c8ed9
--- /dev/null
+++ b/arch/arm64/lib/copy_page_template.S
@@ -0,0 +1,71 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright (C) 2012 ARM Ltd.
+ */
+
+/*
+ * Copy a page from src to dest (both are page aligned)
+ *
+ * Parameters:
+ * x0 - dest
+ * x1 - src
+ */
+dstin .req x0
+src .req x1
+
+#ifdef CONFIG_AS_HAS_MOPS
+ .arch_extension mops
+alternative_if_not ARM64_HAS_MOPS
+ b .Lno_mops
+alternative_else_nop_endif
+ mov x2, #PAGE_SIZE
+ cpy1 dst, src, x2
+ b .Lexitfunc
+.Lno_mops:
+#endif
+
+ ldp1 x2, x3, x1, #0
+ ldp1 x4, x5, x1, #16
+ ldp1 x6, x7, x1, #32
+ ldp1 x8, x9, x1, #48
+ ldp1 x10, x11, x1, #64
+ ldp1 x12, x13, x1, #80
+ ldp1 x14, x15, x1, #96
+ ldp1 x16, x17, x1, #112
+
+ add x0, x0, #256
+ add x1, x1, #128
+1:
+ tst x0, #(PAGE_SIZE - 1)
+
+ stnp x2, x3, [x0, #-256]
+ ldp1 x2, x3, x1, #0
+ stnp x4, x5, [x0, #16 - 256]
+ ldp1 x4, x5, x1, #16
+ stnp x6, x7, [x0, #32 - 256]
+ ldp1 x6, x7, x1, #32
+ stnp x8, x9, [x0, #48 - 256]
+ ldp1 x8, x9, x1, #48
+ stnp x10, x11, [x0, #64 - 256]
+ ldp1 x10, x11, x1, #64
+ stnp x12, x13, [x0, #80 - 256]
+ ldp1 x12, x13, x1, #80
+ stnp x14, x15, [x0, #96 - 256]
+ ldp1 x14, x15, x1, #96
+ stnp x16, x17, [x0, #112 - 256]
+ ldp1 x16, x17, x1, #112
+
+ add x0, x0, #128
+ add x1, x1, #128
+
+ b.ne 1b
+
+ stnp x2, x3, [x0, #-256]
+ stnp x4, x5, [x0, #16 - 256]
+ stnp x6, x7, [x0, #32 - 256]
+ stnp x8, x9, [x0, #48 - 256]
+ stnp x10, x11, [x0, #64 - 256]
+ stnp x12, x13, [x0, #80 - 256]
+ stnp x14, x15, [x0, #96 - 256]
+ stnp x16, x17, [x0, #112 - 256]
+.Lexitfunc:
diff --git a/arch/arm64/lib/mte.S b/arch/arm64/lib/mte.S
index 5018ac03b6bf..9d4eeb76a838 100644
--- a/arch/arm64/lib/mte.S
+++ b/arch/arm64/lib/mte.S
@@ -80,6 +80,35 @@ SYM_FUNC_START(mte_copy_page_tags)
ret
SYM_FUNC_END(mte_copy_page_tags)
+#ifdef CONFIG_ARCH_HAS_COPY_MC
+/*
+ * Copy the tags from the source page to the destination one with memory error safe
+ * x0 - address of the destination page
+ * x1 - address of the source page
+ * Returns:
+ * x0 - Return 0 if copy success, or
+ * -EFAULT if anything goes wrong while copying.
+ */
+SYM_FUNC_START(mte_copy_mc_page_tags)
+ mov x2, x0
+ mov x3, x1
+ multitag_transfer_size x5, x6
+1:
+KERNEL_MEM_ERR(2f, ldgm x4, [x3])
+ stgm x4, [x2]
+ add x2, x2, x5
+ add x3, x3, x5
+ tst x2, #(PAGE_SIZE - 1)
+ b.ne 1b
+
+ mov x0, #0
+ ret
+
+2: mov x0, #-EFAULT
+ ret
+SYM_FUNC_END(mte_copy_mc_page_tags)
+#endif
+
/*
* Read tags from a user buffer (one tag per byte) and set the corresponding
* tags at the given kernel address. Used by PTRACE_POKEMTETAGS.
diff --git a/arch/arm64/mm/copypage.c b/arch/arm64/mm/copypage.c
index cd5912ba617b..9fd773baf17b 100644
--- a/arch/arm64/mm/copypage.c
+++ b/arch/arm64/mm/copypage.c
@@ -72,3 +72,78 @@ void copy_user_highpage(struct page *to, struct page *from,
flush_dcache_page(to);
}
EXPORT_SYMBOL_GPL(copy_user_highpage);
+
+#ifdef CONFIG_ARCH_HAS_COPY_MC
+/*
+ * Return -EFAULT if anything goes wrong while copying page or mte.
+ */
+int copy_mc_highpage(struct page *to, struct page *from)
+{
+ void *kto = page_address(to);
+ void *kfrom = page_address(from);
+ struct folio *src = page_folio(from);
+ struct folio *dst = page_folio(to);
+ unsigned int i, nr_pages;
+ int ret;
+
+ ret = copy_mc_page(kto, kfrom);
+ if (ret)
+ return -EFAULT;
+
+ if (kasan_hw_tags_enabled())
+ page_kasan_tag_reset(to);
+
+ if (!system_supports_mte())
+ return 0;
+
+ if (folio_test_hugetlb(src)) {
+ if (!folio_test_hugetlb_mte_tagged(src) ||
+ from != folio_page(src, 0))
+ return 0;
+
+ WARN_ON_ONCE(!folio_try_hugetlb_mte_tagging(dst));
+
+ /*
+ * Populate tags for all subpages.
+ *
+ * Don't assume the first page is head page since
+ * huge page copy may start from any subpage.
+ */
+ nr_pages = folio_nr_pages(src);
+ for (i = 0; i < nr_pages; i++) {
+ kfrom = page_address(folio_page(src, i));
+ kto = page_address(folio_page(dst, i));
+ ret = mte_copy_mc_page_tags(kto, kfrom);
+ if (ret)
+ return -EFAULT;
+ }
+ folio_set_hugetlb_mte_tagged(dst);
+ } else if (page_mte_tagged(from)) {
+ /* It's a new page, shouldn't have been tagged yet */
+ WARN_ON_ONCE(!try_page_mte_tagging(to));
+
+ ret = mte_copy_mc_page_tags(kto, kfrom);
+ if (ret)
+ return -EFAULT;
+ set_page_mte_tagged(to);
+ }
+
+ return 0;
+}
+EXPORT_SYMBOL(copy_mc_highpage);
+
+int copy_mc_user_highpage(struct page *to, struct page *from,
+ unsigned long vaddr, struct vm_area_struct *vma)
+{
+ int ret;
+
+ ret = copy_mc_highpage(to, from);
+ if (ret)
+ return ret;
+
+ flush_dcache_page(to);
+
+ return 0;
+}
+EXPORT_SYMBOL_GPL(copy_mc_user_highpage);
+#endif
diff --git a/include/linux/highmem.h b/include/linux/highmem.h
index 18dc4aca4aa1..f168c9d4ad0e 100644
--- a/include/linux/highmem.h
+++ b/include/linux/highmem.h
@@ -424,6 +424,7 @@ static inline void copy_highpage(struct page *to, struct page *from)
#endif
#ifdef copy_mc_to_kernel
+#ifndef __HAVE_ARCH_COPY_MC_USER_HIGHPAGE
/*
* If architecture supports machine check exception handling, define the
* #MC versions of copy_user_highpage and copy_highpage. They copy a memory
@@ -449,7 +450,9 @@ static inline int copy_mc_user_highpage(struct page *to, struct page *from,
return ret ? -EFAULT : 0;
}
+#endif
+#ifndef __HAVE_ARCH_COPY_MC_HIGHPAGE
static inline int copy_mc_highpage(struct page *to, struct page *from)
{
unsigned long ret;
@@ -468,20 +471,25 @@ static inline int copy_mc_highpage(struct page *to, struct page *from)
return ret ? -EFAULT : 0;
}
+#endif
#else
+#ifndef __HAVE_ARCH_COPY_MC_USER_HIGHPAGE
static inline int copy_mc_user_highpage(struct page *to, struct page *from,
unsigned long vaddr, struct vm_area_struct *vma)
{
copy_user_highpage(to, from, vaddr, vma);
return 0;
}
+#endif
+#ifndef __HAVE_ARCH_COPY_MC_HIGHPAGE
static inline int copy_mc_highpage(struct page *to, struct page *from)
{
copy_highpage(to, from);
return 0;
}
#endif
+#endif
static inline void memcpy_page(struct page *dst_page, size_t dst_off,
struct page *src_page, size_t src_off,
--
2.39.3
^ permalink raw reply related
* [PATCH v14 3/8] arm64: add support for ARCH_HAS_COPY_MC
From: Ruidong Tian @ 2026-05-18 8:49 UTC (permalink / raw)
To: catalin.marinas, will, rafael, tony.luck, guohanjun, mchehab,
xueshuai, tongtiangen, james.morse, robin.murphy, andreyknvl,
dvyukov, vincenzo.frascino, mpe, npiggin, ryabinin.a.a, glider,
christophe.leroy, aneesh.kumar, naveen.n.rao, tglx, mingo
Cc: linux-arm-kernel, linux-mm, linuxppc-dev, linux-kernel, kasan-dev,
tianruidong
In-Reply-To: <20260518084956.2538442-1-tianruidong@linux.alibaba.com>
From: Tong Tiangen <tongtiangen@huawei.com>
For the arm64 kernel, when it processes hardware memory errors for
synchronize notifications(do_sea()), if the errors is consumed within the
kernel, the current processing is panic. However, it is not optimal.
Take copy_from/to_user for example, If ld* triggers a memory error, even in
kernel mode, only the associated process is affected. Killing the user
process and isolating the corrupt page is a better choice.
Add new fixup type EX_TYPE_KACCESS_ERR_ZERO_MEM_ERR to identify insn
that can recover from memory errors triggered by access to kernel memory,
and this fixup type is used in __arch_copy_to_user(), This make the regular
copy_to_user() will handle kernel memory errors.
[Ruidong: handle EX_TYPE_UACCESS_CPY in fixup_exception_me()]
Signed-off-by: Tong Tiangen <tongtiangen@huawei.com>
Signed-off-by: Ruidong Tian <tianruidong@linux.alibaba.com>
---
arch/arm64/Kconfig | 1 +
arch/arm64/include/asm/asm-extable.h | 22 +++++++++++++++++++-
arch/arm64/include/asm/asm-uaccess.h | 4 ++++
arch/arm64/include/asm/extable.h | 1 +
arch/arm64/lib/copy_to_user.S | 10 +++++-----
arch/arm64/mm/extable.c | 21 +++++++++++++++++++
arch/arm64/mm/fault.c | 30 ++++++++++++++++++++--------
7 files changed, 75 insertions(+), 14 deletions(-)
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index fe60738e5943..831b20d45893 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -21,6 +21,7 @@ config ARM64
select ARCH_HAS_CACHE_LINE_SIZE
select ARCH_HAS_CC_PLATFORM
select ARCH_HAS_CPU_CACHE_INVALIDATE_MEMREGION
+ select ARCH_HAS_COPY_MC if ACPI_APEI_GHES
select ARCH_HAS_CURRENT_STACK_POINTER
select ARCH_HAS_DEBUG_VIRTUAL
select ARCH_HAS_DEBUG_VM_PGTABLE
diff --git a/arch/arm64/include/asm/asm-extable.h b/arch/arm64/include/asm/asm-extable.h
index d67e2fdd1aee..4980023f2fbd 100644
--- a/arch/arm64/include/asm/asm-extable.h
+++ b/arch/arm64/include/asm/asm-extable.h
@@ -11,6 +11,8 @@
#define EX_TYPE_KACCESS_ERR_ZERO 3
#define EX_TYPE_UACCESS_CPY 4
#define EX_TYPE_LOAD_UNALIGNED_ZEROPAD 5
+/* kernel access memory error safe */
+#define EX_TYPE_KACCESS_ERR_ZERO_MEM_ERR 6
/* Data fields for EX_TYPE_UACCESS_ERR_ZERO */
#define EX_DATA_REG_ERR_SHIFT 0
@@ -42,7 +44,7 @@
(.L__gpr_num_##gpr << EX_DATA_REG_##reg##_SHIFT)
#define _ASM_EXTABLE_UACCESS_ERR_ZERO(insn, fixup, err, zero) \
- __ASM_EXTABLE_RAW(insn, fixup, \
+ __ASM_EXTABLE_RAW(insn, fixup, \
EX_TYPE_UACCESS_ERR_ZERO, \
( \
EX_DATA_REG(ERR, err) | \
@@ -55,6 +57,17 @@
#define _ASM_EXTABLE_UACCESS(insn, fixup) \
_ASM_EXTABLE_UACCESS_ERR_ZERO(insn, fixup, wzr, wzr)
+#define _ASM_EXTABLE_KACCESS_ERR_ZERO_MEM_ERR(insn, fixup, err, zero) \
+ __ASM_EXTABLE_RAW(insn, fixup, \
+ EX_TYPE_KACCESS_ERR_ZERO_MEM_ERR, \
+ ( \
+ EX_DATA_REG(ERR, err) | \
+ EX_DATA_REG(ZERO, zero) \
+ ))
+
+#define _ASM_EXTABLE_KACCESS_MEM_ERR(insn, fixup) \
+ _ASM_EXTABLE_KACCESS_ERR_ZERO_MEM_ERR(insn, fixup, wzr, wzr)
+
/*
* Create an exception table entry for uaccess `insn`, which will branch to `fixup`
* when an unhandled fault is taken.
@@ -76,6 +89,13 @@
.macro _asm_extable_uaccess_cpy, insn, fixup, uaccess_is_write
__ASM_EXTABLE_RAW(\insn, \fixup, EX_TYPE_UACCESS_CPY, \uaccess_is_write)
.endm
+/*
+ * Create an exception table entry for kaccess `insn`, which will branch to
+ * `fixup` when an unhandled fault is taken.
+ */
+ .macro _asm_extable_kaccess_mem_err, insn, fixup
+ _ASM_EXTABLE_KACCESS_MEM_ERR(\insn, \fixup)
+ .endm
#else /* __ASSEMBLER__ */
diff --git a/arch/arm64/include/asm/asm-uaccess.h b/arch/arm64/include/asm/asm-uaccess.h
index 12aa6a283249..c8f0af5fde63 100644
--- a/arch/arm64/include/asm/asm-uaccess.h
+++ b/arch/arm64/include/asm/asm-uaccess.h
@@ -57,6 +57,10 @@ alternative_else_nop_endif
.endm
#endif
+#define KERNEL_MEM_ERR(l, x...) \
+9999: x; \
+ _asm_extable_kaccess_mem_err 9999b, l
+
#define USER(l, x...) \
9999: x; \
_asm_extable_uaccess 9999b, l
diff --git a/arch/arm64/include/asm/extable.h b/arch/arm64/include/asm/extable.h
index 9dc39612bdf5..47c851d7df4f 100644
--- a/arch/arm64/include/asm/extable.h
+++ b/arch/arm64/include/asm/extable.h
@@ -48,4 +48,5 @@ bool ex_handler_bpf(const struct exception_table_entry *ex,
#endif /* !CONFIG_BPF_JIT */
bool fixup_exception(struct pt_regs *regs, unsigned long esr);
+bool fixup_exception_me(struct pt_regs *regs);
#endif
diff --git a/arch/arm64/lib/copy_to_user.S b/arch/arm64/lib/copy_to_user.S
index 819f2e3fc7a9..991d94ecc1a8 100644
--- a/arch/arm64/lib/copy_to_user.S
+++ b/arch/arm64/lib/copy_to_user.S
@@ -20,7 +20,7 @@
* x0 - bytes not copied
*/
.macro ldrb1 reg, ptr, val
- ldrb \reg, [\ptr], \val
+ KERNEL_MEM_ERR(9998f, ldrb \reg, [\ptr], \val)
.endm
.macro strb1 reg, ptr, val
@@ -28,7 +28,7 @@
.endm
.macro ldrh1 reg, ptr, val
- ldrh \reg, [\ptr], \val
+ KERNEL_MEM_ERR(9998f, ldrh \reg, [\ptr], \val)
.endm
.macro strh1 reg, ptr, val
@@ -36,7 +36,7 @@
.endm
.macro ldr1 reg, ptr, val
- ldr \reg, [\ptr], \val
+ KERNEL_MEM_ERR(9998f, ldr \reg, [\ptr], \val)
.endm
.macro str1 reg, ptr, val
@@ -44,7 +44,7 @@
.endm
.macro ldp1 reg1, reg2, ptr, val
- ldp \reg1, \reg2, [\ptr], \val
+ KERNEL_MEM_ERR(9998f, ldp \reg1, \reg2, [\ptr], \val)
.endm
.macro stp1 reg1, reg2, ptr, val
@@ -74,7 +74,7 @@ SYM_FUNC_START(__arch_copy_to_user)
9997: cmp dst, dstin
b.ne 9998f
// Before being absolutely sure we couldn't copy anything, try harder
- ldrb tmp1w, [srcin]
+KERNEL_MEM_ERR(9998f, ldrb tmp1w, [srcin])
USER(9998f, sttrb tmp1w, [dst])
add dst, dst, #1
9998: sub x0, end, dst // bytes not copied
diff --git a/arch/arm64/mm/extable.c b/arch/arm64/mm/extable.c
index 6e0528831cd3..f78ac7e92845 100644
--- a/arch/arm64/mm/extable.c
+++ b/arch/arm64/mm/extable.c
@@ -110,7 +110,28 @@ bool fixup_exception(struct pt_regs *regs, unsigned long esr)
return ex_handler_uaccess_cpy(ex, regs, esr);
case EX_TYPE_LOAD_UNALIGNED_ZEROPAD:
return ex_handler_load_unaligned_zeropad(ex, regs);
+ case EX_TYPE_KACCESS_ERR_ZERO_MEM_ERR:
+ return false;
}
BUG();
}
+
+bool fixup_exception_me(struct pt_regs *regs)
+{
+ const struct exception_table_entry *ex;
+
+ ex = search_exception_tables(instruction_pointer(regs));
+ if (!ex)
+ return false;
+
+ switch (ex->type) {
+ case EX_TYPE_UACCESS_CPY:
+ return ex_handler_uaccess_cpy(ex, regs, 0);
+ case EX_TYPE_UACCESS_ERR_ZERO:
+ case EX_TYPE_KACCESS_ERR_ZERO_MEM_ERR:
+ return ex_handler_uaccess_err_zero(ex, regs);
+ }
+
+ return false;
+}
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index 0f3c5c7ca054..efbda54770be 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -858,21 +858,35 @@ static int do_bad(unsigned long far, unsigned long esr, struct pt_regs *regs)
return 1; /* "fault" */
}
+/*
+ * APEI claimed this as a firmware-first notification.
+ * Some processing deferred to task_work before ret_to_user().
+ */
+static int do_apei_claim_sea(struct pt_regs *regs)
+{
+ int ret;
+
+ ret = apei_claim_sea(regs);
+ if (ret)
+ return ret;
+
+ if (!user_mode(regs) && IS_ENABLED(CONFIG_ARCH_HAS_COPY_MC)) {
+ if (!fixup_exception_me(regs))
+ return -ENOENT;
+ }
+
+ return ret;
+}
+
static int do_sea(unsigned long far, unsigned long esr, struct pt_regs *regs)
{
const struct fault_info *inf;
unsigned long siaddr;
- inf = esr_to_fault_info(esr);
-
- if (user_mode(regs) && apei_claim_sea(regs) == 0) {
- /*
- * APEI claimed this as a firmware-first notification.
- * Some processing deferred to task_work before ret_to_user().
- */
+ if (do_apei_claim_sea(regs) == 0)
return 0;
- }
+ inf = esr_to_fault_info(esr);
if (esr & ESR_ELx_FnV) {
siaddr = 0;
} else {
--
2.39.3
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox