* Re: [PATCH v14 0/8] arm64: add ARCH_HAS_COPY_MC support
From: Kefeng Wang @ 2026-05-18 15:05 UTC (permalink / raw)
To: Ruidong Tian, catalin.marinas, will, rafael, tony.luck, guohanjun,
mchehab, xueshuai, tongtiangen, james.morse, robin.murphy,
andreyknvl, dvyukov, vincenzo.frascino, mpe, npiggin,
ryabinin.a.a, glider, christophe.leroy, aneesh.kumar,
naveen.n.rao, tglx, mingo
Cc: linux-arm-kernel, linux-mm, linuxppc-dev, linux-kernel, kasan-dev
In-Reply-To: <20260518084956.2538442-1-tianruidong@linux.alibaba.com>
On 5/18/2026 4:49 PM, Ruidong Tian wrote:
> This series continues Tong Tiangen's work on arm64 ARCH_HAS_COPY_MC
> support. We encounter the same problem, and from a forward-looking
> perspective, large-memory ARM machines such as Grace and Vera will suffer
> more from this class of issues, which motivates us to push this feature
> upstream.
>
> Problem
> =========
> With the increase of memory capacity and density, the probability of memory
> error also increases. The increasing size and density of server RAM in data
> centers and clouds have shown increased uncorrectable memory errors.
>
> Currently, more and more scenarios that can tolerate memory errors, such as
> COW[1,2], KSM copy[3], coredump copy[4], khugepaged[5,6], uaccess copy[7],
> etc.
We have encountered more scenarios and have made more enhancements, eg,
658be46520ce mm: support poison recovery from copy_present_page()
aa549f923f5e mm: support poison recovery from do_cow_fault()
f00b295b9b61 fs: hugetlbfs: support poisoned recover from
hugetlbfs_migrate_folio()
060913999d7a mm: migrate: support poisoned recover from migrate folio
Hope that the architecture-related sections can receive relevant reviews
and responses.
Thanks.
> Solution
> =========
>
> This patchset introduces a new processing framework on ARM64, which enables
> ARM64 to support error recovery in the above scenarios, and more scenarios
> can be expanded based on this in the future.
>
> In arm64, memory error handling in do_sea(), which is divided into two cases:
> 1. If the user state consumed the memory errors, the solution is to kill
> the user process and isolate the error page.
> 2. If the kernel state consumed the memory errors, the solution is to
> panic.
>
> For case 2, Undifferentiated panic may not be the optimal choice, as it can
> be handled better. In some scenarios, we can avoid panic, such as uaccess,
> if the uaccess fails due to memory error, only the user process will be
> affected, returning an error to the caller and isolating the user page with
> hardware memory errors is a better choice.
>
> [1] commit d302c2398ba2 ("mm, hwpoison: when copy-on-write hits poison, take page offline")
> [2] commit 1cb9dc4b475c ("mm: hwpoison: support recovery from HugePage copy-on-write faults")
> [3] commit 6b970599e807 ("mm: hwpoison: support recovery from ksm_might_need_to_copy()")
> [4] commit 245f09226893 ("mm: hwpoison: coredump: support recovery from dump_user_range()")
> [5] commit 98c76c9f1ef7 ("mm/khugepaged: recover from poisoned anonymous memory")
> [6] commit 12904d953364 ("mm/khugepaged: recover from poisoned file-backed memory")
> [7] commit 278b917f8cb9 ("x86/mce: Add _ASM_EXTABLE_CPY for copy user access")
>
> ------------------
> Test result:
>
> Tested on Kunpeng 920.
>
> 1. copy_page(), copy_mc_page() basic function test pass, and the disassembly
> contents remains the same before and after refactor.
>
> 2. copy_to/from_user() access kernel NULL pointer raise translation fault
> and dump error message then die(), test pass.
>
> 3. Test following scenarios: copy_from_user(), get_user(), COW.
>
> Before patched: trigger a hardware memory error then panic.
> After patched: trigger a hardware memory error without panic.
>
> Testing step:
> step1. start an user-process.
> step2. poison(einj) the user-process's page.
> step3: user-process access the poison page in kernel mode, then trigger SEA.
> step4: the kernel will not panic, only the user process is killed, the poison
> page is isolated. (before patched, the kernel will panic in do_sea())
>
> The above tests can also be reproduced using ras-tools, which provides
> einj-based injection and validation for uaccess and COW scenarios.
> Example usage:
>
> einj_mem_uc futex # get_user
> einj_mem_uc copyin # copy_to_user
> einj_mem_uc copy-on-write # COW
>
> Link: https://git.kernel.org/pub/scm/linux/kernel/git/aegl/ras-tools.git
>
> ------------------
>
> Benefits
> =========
> According to Huawei's statistics from their storage products, memory errors
> triggered in kernel-mode by COW and page cache read (uaccess) scenarios
> account for more than 50%. With this patchset deployed, all kernel panics
> caused by COW and page cache memory errors are eliminated.
> Alibaba Cloud has also observed memory errors occurring in uaccess contexts.
>
> Since V13:
> 1. Changed MC-safe functions to return an error rather than kill the user
> process. When a user program invokes a syscall and the kernel encounters
> a memory error during uaccess, killing the process is unexpected; the
> syscall should return an error.
> 2. Added FEAT_MOPS support for the copy_page_mc paths.
> 3. Refactored copy_page() and memcpy() on top of the shared memcpy_template,
> reducing duplicated assembly code.
>
> Since v12:
> Thanks to the suggestions of Jonathan, Mark, and Mauro, the following modifications
> are made:
> 1. Rebase to latest kernel version.
> 2. Patch1, add Jonathan's and Mauro's review-by.
> 3. Patch2, modified do_apei_claim_sea() according to Mark's and Jonathan's suggestions,
> and optimized the commit message according to Mark's suggestions(Added description of
> the impact on regular copy_to_user()).
> 4. Patch3, optimized the commit message according to Mauro's suggestions and add Jonathan's
> review-by.
> 5. Patch4, modified copy_mc_user_highpage() and Optimized the commit message according to
> Jonathan's suggestions(no functional changes).
> 6. Patch5, optimized the commit message according to Mauro's suggestions.
> 7. Patch4/5, FEAT_MOPS is added to the code logic. Currently, the fixup is not performed
> on the MOPS instruction.
> 8. Remove patch6 in v12 according to Jonathan's suggestions.
>
> Since v11:
> 1. Rebase to latest kernel version 6.9-rc1.
> 2. Add patch 5, Since the problem described in "Since V10 Besides 3" has
> been solved in a50026bdb867 ('iov_iter: get rid of 'copy_mc' flag').
> 3. Add the benefit of applying the patch set to our company to the description of patch0.
>
> Since V10:
> Accroding Mark's suggestion:
> 1. Merge V10's patch2 and patch3 to V11's patch2.
> 2. Patch2(V11): use new fixup_type for ld* in copy_to_user(), fix fatal
> issues (NULL kernel pointeraccess) been fixup incorrectly.
> 3. Patch2(V11): refactoring the logic of do_sea().
> 4. Patch4(V11): Remove duplicate assembly logic and remove do_mte().
>
> Besides:
> 1. Patch2(V11): remove st* insn's fixup, st* generally not trigger memory error.
> 2. Split a part of the logic of patch2(V11) to patch5(V11), for detail,
> see patch5(V11)'s commit msg.
> 3. Remove patch6(v10) “arm64: introduce copy_mc_to_kernel() implementation”.
> During modification, some problems that cannot be solved in a short
> period are found. The patch will be released after the problems are
> solved.
> 4. Add test result in this patch.
> 5. Modify patchset title, do not use machine check and remove "-next".
>
> Since V9:
> 1. Rebase to latest kernel version 6.8-rc2.
> 2. Add patch 6/6 to support copy_mc_to_kernel().
>
> Since V8:
> 1. Rebase to latest kernel version and fix topo in some of the patches.
> 2. According to the suggestion of Catalin, I attempted to modify the
> return value of function copy_mc_[user]_highpage() to bytes not copied.
> During the modification process, I found that it would be more
> reasonable to return -EFAULT when copy error occurs (referring to the
> newly added patch 4).
>
> For ARM64, the implementation of copy_mc_[user]_highpage() needs to
> consider MTE. Considering the scenario where data copying is successful
> but the MTE tag copying fails, it is also not reasonable to return
> bytes not copied.
> 3. Considering the recent addition of machine check safe support for
> multiple scenarios, modify commit message for patch 5 (patch 4 for V8).
>
> Since V7:
> Currently, there are patches supporting recover from poison
> consumption for the cow scenario[1]. Therefore, Supporting cow
> scenario under the arm64 architecture only needs to modify the relevant
> code under the arch/.
> [1]https://lore.kernel.org/lkml/20221031201029.102123-1-tony.luck@intel.com/
>
> Since V6:
> Resend patches that are not merged into the mainline in V6.
>
> Since V5:
> 1. Add patch2/3 to add uaccess assembly helpers.
> 2. Optimize the implementation logic of arm64_do_kernel_sea() in patch8.
> 3. Remove kernel access fixup in patch9.
> All suggestion are from Mark.
>
> Since V4:
> 1. According Michael's suggestion, add patch5.
> 2. According Mark's suggestiog, do some restructuring to arm64
> extable, then a new adaptation of machine check safe support is made based
> on this.
> 3. According Mark's suggestion, support machine check safe in do_mte() in
> cow scene.
> 4. In V4, two patches have been merged into -next, so V5 not send these
> two patches.
>
> Since V3:
> 1. According to Robin's suggestion, direct modify user_ldst and
> user_ldp in asm-uaccess.h and modify mte.S.
> 2. Add new macro USER_MC in asm-uaccess.h, used in copy_from_user.S
> and copy_to_user.S.
> 3. According to Robin's suggestion, using micro in copy_page_mc.S to
> simplify code.
> 4. According to KeFeng's suggestion, modify powerpc code in patch1.
> 5. According to KeFeng's suggestion, modify mm/extable.c and some code
> optimization.
>
> Since V2:
> 1. According to Mark's suggestion, all uaccess can be recovered due to
> memory error.
> 2. Scenario pagecache reading is also supported as part of uaccess
> (copy_to_user()) and duplication code problem is also solved.
> Thanks for Robin's suggestion.
> 3. According Mark's suggestion, update commit message of patch 2/5.
> 4. According Borisllav's suggestion, update commit message of patch 1/5.
>
> Since V1:
> 1.Consistent with PPC/x86, Using CONFIG_ARCH_HAS_COPY_MC instead of
> ARM64_UCE_KERNEL_RECOVERY.
> 2.Add two new scenes, cow and pagecache reading.
> 3.Fix two small bug(the first two patch).
>
> V1 in here:
> https://lore.kernel.org/lkml/20220323033705.3966643-1-tongtiangen@huawei.com/
>
> Ruidong Tian (3):
> ACPI: APEI: GHES: use exception context to gate SIGBUS on poison
> consumption
> lib/test: memcpy_kunit: add copy_page() and copy_mc_page() tests
> lib/tests: memcpy_kunit: add memcpy_mc() and memcpy_mc_large() test
>
> Tong Tiangen (5):
> uaccess: add generic fallback version of copy_mc_to_user()
> arm64: add support for ARCH_HAS_COPY_MC
> mm/hwpoison: return -EFAULT when copy fail in
> copy_mc_[user]_highpage()
> arm64: support copy_mc_[user]_highpage()
> arm64: introduce copy_mc_to_kernel() implementation
>
> arch/arm64/Kconfig | 1 +
> arch/arm64/include/asm/asm-extable.h | 22 ++-
> arch/arm64/include/asm/asm-uaccess.h | 4 +
> arch/arm64/include/asm/extable.h | 1 +
> arch/arm64/include/asm/mte.h | 9 +
> arch/arm64/include/asm/page.h | 10 ++
> arch/arm64/include/asm/string.h | 5 +
> arch/arm64/include/asm/uaccess.h | 17 ++
> arch/arm64/kernel/acpi.c | 2 +-
> arch/arm64/lib/Makefile | 2 +
> arch/arm64/lib/copy_mc_page.S | 44 +++++
> arch/arm64/lib/copy_page.S | 62 +------
> arch/arm64/lib/copy_page_template.S | 71 ++++++++
> arch/arm64/lib/copy_to_user.S | 10 +-
> arch/arm64/lib/memcpy.S | 253 ++-------------------------
> arch/arm64/lib/memcpy_mc.S | 56 ++++++
> arch/arm64/lib/memcpy_template.S | 249 ++++++++++++++++++++++++++
> arch/arm64/lib/mte.S | 29 +++
> arch/arm64/mm/copypage.c | 75 ++++++++
> arch/arm64/mm/extable.c | 21 +++
> arch/arm64/mm/fault.c | 30 +++-
> arch/powerpc/include/asm/uaccess.h | 1 +
> arch/x86/include/asm/uaccess.h | 1 +
> drivers/acpi/apei/ghes.c | 36 ++--
> include/acpi/ghes.h | 6 +-
> include/linux/highmem.h | 16 +-
> include/linux/uaccess.h | 8 +
> lib/tests/memcpy_kunit.c | 178 ++++++++++++++++++-
> mm/kasan/shadow.c | 12 ++
> mm/khugepaged.c | 4 +-
> 30 files changed, 904 insertions(+), 331 deletions(-)
> create mode 100644 arch/arm64/lib/copy_mc_page.S
> create mode 100644 arch/arm64/lib/copy_page_template.S
> create mode 100644 arch/arm64/lib/memcpy_mc.S
> create mode 100644 arch/arm64/lib/memcpy_template.S
>
^ permalink raw reply
* Re: [PATCH v2 0/6] fsl-mc: Move over to device MSI infrastructure
From: Arnd Bergmann @ 2026-05-18 15:24 UTC (permalink / raw)
To: Marc Zyngier, Christophe Leroy
Cc: Ioana Ciornei, Thomas Gleixner, Sascha Bischoff, linux-kernel,
linux-arm-kernel, linuxppc-dev
In-Reply-To: <86zf1wx0b4.wl-maz@kernel.org>
On Mon, May 18, 2026, at 16:24, Marc Zyngier wrote:
> On Mon, 18 May 2026 14:51:48 +0100,
> "Christophe Leroy (CS GROUP)" <chleroy@kernel.org> wrote:
>>
>> > > Do I need to respin it?
>>
>> No, I'd like to avoid having to rebase again. If you have changes to
>> the series please send followup patches.
Sorry this got held up even longer now. I meant to reply
earlier but dropped the ball on that while sending the merge
window contents.
This was indeed bad timing as the original pull request reached
me only after 7.0 was already out.
> No follow-up patches for that particular series, I just wanted to find
> out whether I could start posting additional changes that do not
> directly involve fsl-mc, but that are prevented by the current state
> of the code (such as trying to move the ITS initialisation much later
> in the boot process).
>
> I'll postpone my changes to 7.3, and keep my fingers crossed for this
> to hit 7.2.
I've merged the soc_fsl-7.1-2 tag into the soc/drivers branch
for 7.2 now. You should be able to base your other changes on top
of f0a2eac6a597 ("platform-msi: Remove stale comment") as a shared
branch.
Arnd
^ permalink raw reply
* Re: [PATCH 2/5] arm/pci: Use official API to iterate over PCI buses
From: Gerd Bayer @ 2026-05-18 15:45 UTC (permalink / raw)
To: Russell King
Cc: Yinghai Lu, linux-alpha, linux-kernel, linux-arm-kernel,
linuxppc-dev, linux-pci, Richard Henderson, Matt Turner,
Magnus Lindholm, Madhavan Srinivasan, Michael Ellerman,
Nicholas Piggin, Christophe Leroy (CS GROUP), Bjorn Helgaas,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
H. Peter Anvin, Gerd Bayer
In-Reply-To: <20260515-priv_root_buses-v1-2-f8e393c57390@linux.ibm.com>
On Fri, 2026-05-15 at 16:22 +0200, Gerd Bayer wrote:
> Replace iterating over pci_root_buses with the official
> pci_find_next_bus() call provided by PCI core. This allows to make
> pci_root_buses private to PCI core.
>
> Signed-off-by: Gerd Bayer <gbayer@linux.ibm.com>
> ---
> arch/arm/kernel/bios32.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/arch/arm/kernel/bios32.c b/arch/arm/kernel/bios32.c
> index ac0e890510da..35642c9ba054 100644
> --- a/arch/arm/kernel/bios32.c
> +++ b/arch/arm/kernel/bios32.c
> @@ -59,9 +59,9 @@ static void pcibios_bus_report_status(struct pci_bus *bus, u_int status_mask, in
>
> void pcibios_report_status(u_int status_mask, int warn)
> {
> - struct pci_bus *bus;
> + struct pci_bus *bus = NULL;
>
> - list_for_each_entry(bus, &pci_root_buses, node)
> + while ((bus = pci_find_next_bus(bus)) != NULL)
> pcibios_bus_report_status(bus, status_mask, warn);
> }
>
Hi Russell,
Sashiko
https://sashiko.dev/#/message/20260515145940.E85AAC2BCB0%40smtp.kernel.org
reported:
> Since pci_find_next_bus() unconditionally acquires the pci_bus_sem read-write
> semaphore using down_read(), this introduces a blocking operation into that
> atomic path:
>
> dc21285_abort_irq() [hardirq context]
> pcibios_report_status()
> pci_find_next_bus()
> down_read(&pci_bus_sem) [sleeps]
>
> Does this path need an alternative approach to safely iterate over the buses
> without taking a sleeping lock?
IMHO, it looks like this entire pcibios_report_status() iterating over
all PCI buses and all their devices would be better off if moved
outside of the hardirq context?
Or could pcibios_report_status() be converted to use
for_each_pci_device()?
Any suggestions welcome...
Gerd
^ permalink raw reply
* Re: [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance
From: Matthew Wilcox @ 2026-05-18 16:17 UTC (permalink / raw)
To: Barry Song
Cc: Lorenzo Stoakes, surenb, akpm, linux-mm, david, liam, vbabka,
rppt, mhocko, jack, pfalcato, wanglian, chentao, lianux.mm,
kunwu.chan, liyangouwen1, chrisl, kasong, shikemeng, nphamcs, bhe,
youngjun.park, linux-arm-kernel, linux-kernel, loongarch,
linuxppc-dev, linux-riscv, linux-s390, Nanzhe Zhao
In-Reply-To: <CAGsJ_4zqLfdWoTH9s7FFaqWWj0mESfikYgr7=GcV64qcuXrPxA@mail.gmail.com>
On Mon, May 18, 2026 at 07:25:54PM +0800, Barry Song wrote:
> We have clearly observed that the `fork()` operations of many
> popular Android apps, such as iQiyi, Baidu Tieba, and 10086,
> end up waiting on page-fault (PF) I/O when the VMA lock is
> held during I/O operations. This has already become a
> practical issue. I also believe this can lead to chained
> waiting, since the global `mmap_lock` blocks all threads that
> need to acquire it.
It's always been a terrible idea to call fork() from a multithreaded
application. For example, this question:
https://stackoverflow.com/questions/53601200/calling-fork-on-a-multithreaded-process
or this lwn thread: https://lwn.net/Articles/674660/
Do we have any insight into why these applications are doing this
horrible thing?
^ permalink raw reply
* Re: [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance
From: Suren Baghdasaryan @ 2026-05-18 19:56 UTC (permalink / raw)
To: Barry Song
Cc: Lorenzo Stoakes, Matthew Wilcox, akpm, linux-mm, david, liam,
vbabka, rppt, mhocko, jack, pfalcato, wanglian, chentao,
lianux.mm, kunwu.chan, liyangouwen1, chrisl, kasong, shikemeng,
nphamcs, bhe, youngjun.park, linux-arm-kernel, linux-kernel,
loongarch, linuxppc-dev, linux-riscv, linux-s390, Nanzhe Zhao
In-Reply-To: <CAGsJ_4zqLfdWoTH9s7FFaqWWj0mESfikYgr7=GcV64qcuXrPxA@mail.gmail.com>
On Mon, May 18, 2026 at 4:26 AM Barry Song <baohua@kernel.org> wrote:
>
> On Mon, May 18, 2026 at 5:47 PM Lorenzo Stoakes <ljs@kernel.org> wrote:
> >
> > On Sun, May 17, 2026 at 04:45:15PM +0800, Barry Song wrote:
> > > On Sat, May 2, 2026 at 1:58 AM Matthew Wilcox <willy@infradead.org> wrote:
> > > >
> > > > On Sat, May 02, 2026 at 01:44:34AM +0800, Barry Song wrote:
> > > > > On Fri, May 1, 2026 at 10:57 PM Matthew Wilcox <willy@infradead.org> wrote:
> > > > > >
> > > > > > On Fri, May 01, 2026 at 06:49:58AM +0800, Barry Song wrote:
> > > > > > > 1. There is no deterministic latency for I/O completion. It depends on
> > > > > > > both the hardware and the software stack (bio/request queues and the
> > > > > > > block scheduler). Sometimes the latency is short; at other times it can
> > > > > > > be quite long. In such cases, a high-priority thread performing operations
> > > > > > > such as mprotect, unmap, prctl_set_vma, or madvise may be forced to wait
> > > > > > > for an unpredictable amount of time.
> > > > > >
> > > > > > But does that actually happen? I find it hard to believe that thread A
> > > > > > unmaps a VMA while thread B is in the middle of taking a page fault in
> > > > > > that same VMA. mprotect() and madvise() are more likely to happen, but
> > > > > > it still seems really unlikely to me.
> > > > >
> > > > > It doesn’t have to involve unmapping or applying mprotect to
> > > > > the entire VMA—just a portion of it is sufficient.
> > > >
> > > > Yes, but that still fails to answer "does this actually happen". How much
> > > > performance is all this complexity in the page fault handler buying us?
> > > > If you don't answer this question, I'm just going to go in and rip it
> > > > all out.
> > > >
> > >
> > > Hi Matthew (and Lorenzo, Jan, and anyone else who may be
> > > waiting for answers),
> > >
> > > As promised during LSF/MM/BPF, we conducted thorough
> > > testing on Android phones to determine whether performing
> > > I/O in `filemap_fault()` can block `vma_start_write()`.
> > > I wanted to give a quick update on this question.
> > >
> > > Nanzhe at Xiaomi created tracing scripts and ran various
> > > applications on Android devices with I/O performed under
> > > the VMA lock in `filemap_fault()`. We found that:
> > >
> > > 1. There are very few cases where unmap() is blocked by
> > > page faults. I assume this is due to buggy user code
> > > or poor synchronization between reads and unmap().
> > > So I assume it is not a problem.
> > >
> > > 2. We observed many cases where `vma_start_write()`
> > > is blocked by page-fault I/O in some applications.
> > > The blocking occurs in the `dup_mmap()` path during
> > > fork().
> > >
> > > With Suren's commit fb49c455323ff ("fork: lock VMAs of
> > > the parent process when forking"), we now always hold
> > > `vma_write_lock()` for each VMA. Note that the
> > > `mmap_lock` write lock is also held, which could lead to
> > > chained waiting if page-fault I/O is performed without
> > > releasing the VMA lock.
> >
> > Hm but did you observe this 'chained waiting'? And what were the latencies?
>
> We have clearly observed that the `fork()` operations of many
> popular Android apps, such as iQiyi, Baidu Tieba, and 10086,
> end up waiting on page-fault (PF) I/O when the VMA lock is
> held during I/O operations. This has already become a
> practical issue. I also believe this can lead to chained
> waiting, since the global `mmap_lock` blocks all threads that
> need to acquire it.
>
>
> >
> > >
> > > My gut feeling is that Suren's commit may be overshooting,
> > > so my rough idea is that we might want to do something like
> > > the following (we haven't tested it yet and it might be
> > > wrong):
> >
> > Yeah I'm really not sure about that.
> >
> > Prior to the VMA locks, the mmap write lock would have guaranteed no concurrent
> > page faults, which is really what fb49c455323ff is about.
> >
> > So Suren's patch was essentially restoring the _existing_ forking behaviour, and
> > now you're saying 'let's change the forking behaviour that's been like that for
> > forever'.
>
>
> I am afraid not. Before we introduced the per-VMA lock, we
> were not performing I/O while holding `mmap_lock`. A page fault
> that needed I/O would drop the `mmap_lock` read lock and allow
> `fork()` to proceed.
>
> Now, you are suggesting performing I/O while holding the VMA
> lock, which changes the requirements and introduces this
> problem.
>
> >
> > I think you would _really_ have to be sure that's safe. And forking is a very
> > dangerous time in terms of complexity and sensitivity and 'weird stuff'
> > happening so I'd tread _very_ carefully here.
>
> Yep. I think my original proposal did not require any changes
> to `fork()`, since it simply preserved the current behavior of
> dropping the VMA lock before performing I/O. In that model,
> `fork()` would not end up waiting on I/O at all.
>
> What you are suggesting now appears to be performing I/O while
> holding the VMA lock, which in turn introduces the need to
> change `fork()`.
>
> >
> > >
> > > diff --git a/mm/mmap.c b/mm/mmap.c
> > > index 2311ae7c2ff4..5ddaf297f31a 100644
> > > --- a/mm/mmap.c
> > > +++ b/mm/mmap.c
> > > @@ -1762,7 +1762,13 @@ __latent_entropy int dup_mmap(struct mm_struct
> > > *mm, struct mm_struct *oldmm)
> > > for_each_vma(vmi, mpnt) {
> > > struct file *file;
> > >
> > > - retval = vma_start_write_killable(mpnt);
> > > + /*
> > > + * For anonymous or writable private VMAs, prevent
> > > + * concurrent CoW faults.
> > > + */
> >
> > To nit pick I think the comment's confusing but also tells you you don't need to
> > specific anon check - writable private is sufficient. And it's not really just
> > CoW that's the issue, it's anon_vma population _at all_ as well as CoW.
> >
> > > + if (!mpnt->vm_file || (!(mpnt->vm_flags & VM_SHARED) &&
> > > + (mpnt->vm_flags & VM_WRITE)))
> > > + retval = vma_start_write_killable(mpnt);
> >
> > I think this has to be VM_MAYWRITE, because somebody could otherwise mprotect()
> > it R/W.
> >
> > I also don't understand why !mpnt->vm_file for a read-only anon mapping (more
> > likely PROT_NONE) is here, just do the second check?
> >
> > (Also please use the new interface, so !vma_test(mpnt, VMA_SHARED_BIT) &&
> > vma_test(mpnt, VMA_MAYWRITE_BIT))
>
> Yep, I can definitely refine the check further. But before
> doing that, I'd first like to confirm that we are aligned on
> the direction.
>
> If you still intend to hold the VMA lock while performing I/O,
> then I think we should fix `fork()` to avoid taking
> `vma_start_write()`.
>
> >
> > > if (retval < 0)
> > > goto loop_out;
> > > if (mpnt->vm_flags & VM_DONTCOPY) {
> > >
> > > Based on the above, we may want to re-check whether fork()
> > > can be blocked by page faults. At the same time, if Suren,
> > > you, or anyone else has any comments, please feel free to
> > > share them.
> > >
> > > Best Regards
> > > Barry
> >
> > Technical commentary above is sort of 'just cos' :) because I really question
> > doing this honestly.
>
> I think we either need to fix `fork()`, or keep the current
> behavior of dropping the VMA lock before performing I/O.
I see. So, this problem arises from the fact that we are changing the
pagefaults requiring I/O operation to hold VMA lock...
And you want to lock VMA on fork only if vma_is_anonymous(vma) ||
is_cow_mapping(vma->vm_flags). So, we will be blocking page faults for
anonymous and COW VMAs only while holding mmap_write_lock, preventing
any VMA modification. On the surface, that looks ok to me but I might
be missing some corner cases. If nobody sees any obvious issues, I
think it's worth a try.
>
> >
> > I'd also like to get Suren's input, however.
>
> Yes. of course.
>
> >
> > Thanks, Lorenzo
>
> Best Regards
> Barry
^ permalink raw reply
* Re: [PATCH] PCI/AER: Clear non-fatal errors on AER recovery failure
From: Bjorn Helgaas @ 2026-05-18 20:29 UTC (permalink / raw)
To: Yury Murashka
Cc: bhelgaas, mahesh, oohall, corbet, skhan, linux-pci, linux-doc,
linux-kernel, linuxppc-dev, Lukas Wunner
In-Reply-To: <CAPzpGcRCTCZtaX1EVaJNZ103THZKsoszZduY7=gwfYdcrMo-SQ@mail.gmail.com>
[+cc Lukas]
On Mon, May 18, 2026 at 02:23:36PM +0100, Yury Murashka wrote:
> pci_aer_clear_nonfatal_status() is not called when AER recovery fails.
> If a new AER error is subsequently reported, the AER driver calls
> find_source_device() to find the source of the error. It rescans the
> whole bus and picks the first device reporting an AER error. Because the
> previous error was never cleared, the error is attributed to the wrong
> device and AER recovery is started for the wrong device.
>
> Add a kernel boot parameter pci=aer_clear_on_recovery_failure to clear
> AER error status even when recovery fails, preventing stale errors from
> causing incorrect device identification on subsequent AER events.
Why should we add a kernel parameter for this? How would a user
decide whether to use the parameter? Are there cases where we
find the source of the first error, but we *wouldn't* want to clear
it if recovery fails?
^ permalink raw reply
* Re: [PATCH] PCI/AER: Clear non-fatal errors on AER recovery failure
From: Yury M. @ 2026-05-18 20:49 UTC (permalink / raw)
To: Bjorn Helgaas
Cc: bhelgaas, mahesh, oohall, corbet, skhan, linux-pci, linux-doc,
linux-kernel, linuxppc-dev, Lukas Wunner
In-Reply-To: <20260518202903.GA641158@bhelgaas>
Current behavior has existed for a long time and I could easily imagine
that there is software which relies on the fact that the system is in a
non-modified state if AER recovery failed. The software can analyze the
system and do cleanup afterwards. Sometimes, if something fails in the
system, it is better to have it in a non-modified state.
In short, I just wanted to preserve the current logic by default because
there is a chance that we have software which relies on the current
behavior.
On 5/18/26 21:29, Bjorn Helgaas wrote:
> [+cc Lukas]
>
> On Mon, May 18, 2026 at 02:23:36PM +0100, Yury Murashka wrote:
>> pci_aer_clear_nonfatal_status() is not called when AER recovery fails.
>> If a new AER error is subsequently reported, the AER driver calls
>> find_source_device() to find the source of the error. It rescans the
>> whole bus and picks the first device reporting an AER error. Because the
>> previous error was never cleared, the error is attributed to the wrong
>> device and AER recovery is started for the wrong device.
>>
>> Add a kernel boot parameter pci=aer_clear_on_recovery_failure to clear
>> AER error status even when recovery fails, preventing stale errors from
>> causing incorrect device identification on subsequent AER events.
> Why should we add a kernel parameter for this? How would a user
> decide whether to use the parameter? Are there cases where we
> find the source of the first error, but we *wouldn't* want to clear
> it if recovery fails?
^ permalink raw reply
* Re: [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance
From: Barry Song @ 2026-05-18 20:50 UTC (permalink / raw)
To: Matthew Wilcox
Cc: Lorenzo Stoakes, surenb, akpm, linux-mm, david, liam, vbabka,
rppt, mhocko, jack, pfalcato, wanglian, chentao, lianux.mm,
kunwu.chan, liyangouwen1, chrisl, kasong, shikemeng, nphamcs, bhe,
youngjun.park, linux-arm-kernel, linux-kernel, loongarch,
linuxppc-dev, linux-riscv, linux-s390, Nanzhe Zhao
In-Reply-To: <ags7mPK7Ong0ZsBf@casper.infradead.org>
On Tue, May 19, 2026 at 12:17 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Mon, May 18, 2026 at 07:25:54PM +0800, Barry Song wrote:
> > We have clearly observed that the `fork()` operations of many
> > popular Android apps, such as iQiyi, Baidu Tieba, and 10086,
> > end up waiting on page-fault (PF) I/O when the VMA lock is
> > held during I/O operations. This has already become a
> > practical issue. I also believe this can lead to chained
> > waiting, since the global `mmap_lock` blocks all threads that
> > need to acquire it.
>
> It's always been a terrible idea to call fork() from a multithreaded
> application. For example, this question:
>
> https://stackoverflow.com/questions/53601200/calling-fork-on-a-multithreaded-process
>
> or this lwn thread: https://lwn.net/Articles/674660/
>
> Do we have any insight into why these applications are doing this
> horrible thing?
I swear I read the two links you shared. But the reality
is that as long as people use the Android framework,
even the simplest "Hello World" app already runs with
10+ threads :-)
main
RenderThread
ReferenceQueueDaemon
FinalizerDaemon
FinalizerWatchdogDaemon
HeapTaskDaemon
Binder:1234_1
Binder:1234_2
Signal Catcher
JDWP
...
Best Regards
Barry
^ permalink raw reply
* Re: cleanup the RAID6 P/Q library v3
From: Andrew Morton @ 2026-05-18 21:12 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Catalin Marinas, Will Deacon, Ard Biesheuvel, Huacai Chen,
WANG Xuerui, Madhavan Srinivasan, Michael Ellerman,
Nicholas Piggin, Christophe Leroy (CS GROUP), Paul Walmsley,
Palmer Dabbelt, Albert Ou, Alexandre Ghiti, Heiko Carstens,
Vasily Gorbik, Alexander Gordeev, Christian Borntraeger,
Sven Schnelle, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Dave Hansen, x86, H. Peter Anvin, Herbert Xu, Dan Williams,
Chris Mason, David Sterba, Arnd Bergmann, Song Liu, Yu Kuai,
Li Nan, linux-kernel, linux-arm-kernel, loongarch, linuxppc-dev,
linux-riscv, linux-s390, linux-crypto, linux-btrfs, linux-arch,
linux-raid
In-Reply-To: <20260518051804.462141-1-hch@lst.de>
On Mon, 18 May 2026 07:17:43 +0200 Christoph Hellwig <hch@lst.de> wrote:
> this series cleans up the RAID6 P/Q library to match the recent updates
> to the RAID 5 XOR library and other CRC/crypto libraries. This includes
> providing properly documented external interfaces, hiding the internals,
> using static_call instead of indirect calls and turning the user space
> test suite into an in-kernel kunit test which is also extended to
> improve coverage.
Cool, I'll add this to mm.git's mm-nonmm-unstable branch for some
linux-next testing.
AI review found quite a lot to talk about:
https://sashiko.dev/#/patchset/20260518051804.462141-1-hch@lst.de
^ permalink raw reply
* Re: [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance
From: Barry Song @ 2026-05-18 21:14 UTC (permalink / raw)
To: Suren Baghdasaryan
Cc: Lorenzo Stoakes, Matthew Wilcox, akpm, linux-mm, david, liam,
vbabka, rppt, mhocko, jack, pfalcato, wanglian, chentao,
lianux.mm, kunwu.chan, liyangouwen1, chrisl, kasong, shikemeng,
nphamcs, bhe, youngjun.park, linux-arm-kernel, linux-kernel,
loongarch, linuxppc-dev, linux-riscv, linux-s390, Nanzhe Zhao
In-Reply-To: <CAJuCfpE0WQrB3zJp9qn3jvn5DthS=ttpX7gJJvyEhA_BJGrp5g@mail.gmail.com>
On Tue, May 19, 2026 at 3:57 AM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Mon, May 18, 2026 at 4:26 AM Barry Song <baohua@kernel.org> wrote:
> >
> > On Mon, May 18, 2026 at 5:47 PM Lorenzo Stoakes <ljs@kernel.org> wrote:
> > >
> > > On Sun, May 17, 2026 at 04:45:15PM +0800, Barry Song wrote:
[...]
> >
> > I think we either need to fix `fork()`, or keep the current
> > behavior of dropping the VMA lock before performing I/O.
>
> I see. So, this problem arises from the fact that we are changing the
> pagefaults requiring I/O operation to hold VMA lock...
> And you want to lock VMA on fork only if vma_is_anonymous(vma) ||
> is_cow_mapping(vma->vm_flags). So, we will be blocking page faults for
> anonymous and COW VMAs only while holding mmap_write_lock, preventing
> any VMA modification. On the surface, that looks ok to me but I might
> be missing some corner cases. If nobody sees any obvious issues, I
> think it's worth a try.
>
Thanks. Besides the creation of processes via fork(), I
am also beginning to worry about the death of processes.
One thing that came to my mind this morning
is that when lowmemorykiller decides to kill an app, we
want the memory to be released as quickly as possible so
the new app or user scenario can get memory sooner.
In that case, if the app being killed is performing I/O
while holding the VMA lock, the unmapping procedure
could end up being blocked as well.
If we release the VMA lock as we currently do, we allow
process exit to proceed.
I haven't thought it through very clearly yet, and I
may be wrong. I'd like to do more investigation. I hope
the apps being killed stay very still, but who knows—we
have so many applications in the market.
Meanwhile, if you have any comments regarding the death
of processes, they would be very welcome.
Best Regards
Barry
^ permalink raw reply
* Re: [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance
From: Yang Shi @ 2026-05-18 21:21 UTC (permalink / raw)
To: Barry Song
Cc: Matthew Wilcox, surenb, akpm, linux-mm, david, ljs, liam, vbabka,
rppt, mhocko, jack, pfalcato, wanglian, chentao, lianux.mm,
kunwu.chan, liyangouwen1, chrisl, kasong, shikemeng, nphamcs, bhe,
youngjun.park, linux-arm-kernel, linux-kernel, loongarch,
linuxppc-dev, linux-riscv, linux-s390, Nanzhe Zhao
In-Reply-To: <CAGsJ_4ysMcrmDLSOwBkf7qwCQrcDWeEMXkHDajTJFMLKUk0bSQ@mail.gmail.com>
On Sun, May 17, 2026 at 1:45 AM Barry Song <baohua@kernel.org> wrote:
>
> On Sat, May 2, 2026 at 1:58 AM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > On Sat, May 02, 2026 at 01:44:34AM +0800, Barry Song wrote:
> > > On Fri, May 1, 2026 at 10:57 PM Matthew Wilcox <willy@infradead.org> wrote:
> > > >
> > > > On Fri, May 01, 2026 at 06:49:58AM +0800, Barry Song wrote:
> > > > > 1. There is no deterministic latency for I/O completion. It depends on
> > > > > both the hardware and the software stack (bio/request queues and the
> > > > > block scheduler). Sometimes the latency is short; at other times it can
> > > > > be quite long. In such cases, a high-priority thread performing operations
> > > > > such as mprotect, unmap, prctl_set_vma, or madvise may be forced to wait
> > > > > for an unpredictable amount of time.
> > > >
> > > > But does that actually happen? I find it hard to believe that thread A
> > > > unmaps a VMA while thread B is in the middle of taking a page fault in
> > > > that same VMA. mprotect() and madvise() are more likely to happen, but
> > > > it still seems really unlikely to me.
> > >
> > > It doesn’t have to involve unmapping or applying mprotect to
> > > the entire VMA—just a portion of it is sufficient.
> >
> > Yes, but that still fails to answer "does this actually happen". How much
> > performance is all this complexity in the page fault handler buying us?
> > If you don't answer this question, I'm just going to go in and rip it
> > all out.
> >
>
> Hi Matthew (and Lorenzo, Jan, and anyone else who may be
> waiting for answers),
>
> As promised during LSF/MM/BPF, we conducted thorough
> testing on Android phones to determine whether performing
> I/O in `filemap_fault()` can block `vma_start_write()`.
> I wanted to give a quick update on this question.
>
> Nanzhe at Xiaomi created tracing scripts and ran various
> applications on Android devices with I/O performed under
> the VMA lock in `filemap_fault()`. We found that:
>
> 1. There are very few cases where unmap() is blocked by
> page faults. I assume this is due to buggy user code
> or poor synchronization between reads and unmap().
> So I assume it is not a problem.
>
> 2. We observed many cases where `vma_start_write()`
> is blocked by page-fault I/O in some applications.
> The blocking occurs in the `dup_mmap()` path during
> fork().
>
> With Suren's commit fb49c455323ff ("fork: lock VMAs of
> the parent process when forking"), we now always hold
> `vma_write_lock()` for each VMA. Note that the
> `mmap_lock` write lock is also held, which could lead to
> chained waiting if page-fault I/O is performed without
> releasing the VMA lock.
>
> My gut feeling is that Suren's commit may be overshooting,
> so my rough idea is that we might want to do something like
> the following (we haven't tested it yet and it might be
> wrong):
>
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 2311ae7c2ff4..5ddaf297f31a 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -1762,7 +1762,13 @@ __latent_entropy int dup_mmap(struct mm_struct
> *mm, struct mm_struct *oldmm)
> for_each_vma(vmi, mpnt) {
> struct file *file;
>
> - retval = vma_start_write_killable(mpnt);
> + /*
> + * For anonymous or writable private VMAs, prevent
> + * concurrent CoW faults.
> + */
> + if (!mpnt->vm_file || (!(mpnt->vm_flags & VM_SHARED) &&
> + (mpnt->vm_flags & VM_WRITE)))
> + retval = vma_start_write_killable(mpnt);
> if (retval < 0)
> goto loop_out;
> if (mpnt->vm_flags & VM_DONTCOPY) {
Maybe a little bit off topic. This is an interesting idea. It seems
possible we don't have to take vma write lock unconditionally. IIUC
the write lock is mainly used to serialize against page fault and
madvise, right? I got a crazy idea off the top of my head. We may be
able to just take vma write lock iff vma->anon_vma is not NULL.
First of all, write mmap_lock is held, so the vma can't go or be
changed under us.
Secondly, if vma->anon_vma is NULL, it basically means either no page
fault happened or no cow happened, so there is no page table to copy,
this is also what copy_page_range() does currently. So we can shrink
the critical section to:
if (vma->anon_vma) {
vma_start_write_killable(src_vma);
anon_vma_fork(dst_vma, src_vma);
copy_page_range(dst_vma, src_vma);
}
But page fault can happen before write mmap_lock is taken, when we
check vma->anon_vma, it is possible it has not been set up yet. But it
seems to be equivalent to page fault after fork and won't break the
semantic.
Anyway, just a crazy idea, I may miss some corner cases.
Thanks,
Yang
}
>
> Based on the above, we may want to re-check whether fork()
> can be blocked by page faults. At the same time, if Suren,
> you, or anyone else has any comments, please feel free to
> share them.
>
> Best Regards
> Barry
>
^ permalink raw reply
* Re: [PATCH v5 00/14] module: Introduce hash-based integrity checking
From: Sami Tolvanen @ 2026-05-18 21:55 UTC (permalink / raw)
To: Thomas Weißschuh
Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
Eduard Zingerman, Kumar Kartikeya Dwivedi, Nathan Chancellor,
Nicolas Schier, Arnd Bergmann, Luis Chamberlain, Petr Pavlu,
Daniel Gomez, Paul Moore, James Morris, Serge E. Hallyn,
Jonathan Corbet, Madhavan Srinivasan, Michael Ellerman,
Nicholas Piggin, Naveen N Rao, Mimi Zohar, Roberto Sassu,
Dmitry Kasatkin, Eric Snowberg, Nicolas Schier, Daniel Gomez,
Aaron Tomlin, Christophe Leroy (CS GROUP), Nicolas Bouchinet,
Xiu Jianfeng, Martin KaFai Lau, Song Liu, Yonghong Song,
Jiri Olsa, bpf, Fabian Grünbichler, Arnout Engelen,
Mattia Rizzolo, kpcyrd, Christian Heusel, Câju Mihai-Drosi,
Eric Biggers, Sebastian Andrzej Siewior, linux-kbuild,
linux-kernel, linux-arch, linux-modules, linux-security-module,
linux-doc, linuxppc-dev, linux-integrity, debian-kernel
In-Reply-To: <20260505-module-hashes-v5-0-e174a5a49fce@weissschuh.net>
Hi Thomas,
On Tue, May 05, 2026 at 11:05:04AM +0200, Thomas Weißschuh wrote:
> The current signature-based module integrity checking has some drawbacks
> in combination with reproducible builds. Either the module signing key
> is generated at build time, which makes the build unreproducible, or a
> static signing key is used, which precludes rebuilds by third parties
> and makes the whole build and packaging process much more complicated.
>
> The goal is to reach bit-for-bit reproducibility. Excluding certain
> parts of the build output from the reproducibility analysis would be
> error-prone and force each downstream consumer to introduce new tooling.
>
> Introduce a new mechanism to ensure only well-known modules are loaded
> by embedding a merkle tree root of all modules built as part of the full
> kernel build into vmlinux.
I noticed Sashiko had a few concerns about the build changes. Would you
mind taking a look to see if they're valid?
https://sashiko.dev/#/patchset/20260505-module-hashes-v5-0-e174a5a49fce%40weissschuh.net
Sami
^ permalink raw reply
* Re: [PATCH v2 0/3] KVM: Fix and clean up kvm_vcpu_map[_readonly]() usages
From: Sean Christopherson @ 2026-05-19 0:40 UTC (permalink / raw)
To: Sean Christopherson, Paolo Bonzini, Madhavan Srinivasan,
Nicholas Piggin, Peter Fang
Cc: Yosry Ahmed, Ritesh Harjani, Michael Ellerman,
Christophe Leroy (CS GROUP), Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, kvm,
linuxppc-dev, linux-kernel
In-Reply-To: <20260408001137.3290444-1-peter.fang@intel.com>
On Tue, 07 Apr 2026 17:11:27 -0700, Peter Fang wrote:
> kvm_vcpu_map() and kvm_vcpu_map_readonly() are declared to take a gpa_t
> in kvm_host.h when they're supposed to take a gfn_t. First fix the
> function prototypes, and then refactor them to correctly take a gpa_t,
> reducing boilerplate gpa->gfn conversions at all call sites.
>
> No actual harm has been done yet as all of the call sites are correctly
> passing in a gfn.
>
> [...]
Applied patch 1 to kvm-x86 generic. I'm moderately optimistic that the gpc
stuff will land soon enough that I won't regret skipping 2 and 3 :-)
Thanks much!
[1/3] KVM: Fix kvm_vcpu_map[_readonly]() function prototypes
https://github.com/kvm-x86/linux/commit/ccd6c77223bb
--
https://github.com/kvm-x86/linux/tree/next
^ permalink raw reply
* Re: [PATCH v2] KVM: PPC: Kconfig: Enable CONFIG_VPA_PMU with KVM
From: Sean Christopherson @ 2026-05-19 1:01 UTC (permalink / raw)
To: Gautam Menghani
Cc: maddy, npiggin, mpe, chleroy, atrajeev, linuxppc-dev, kvm,
linux-kernel, stable
In-Reply-To: <20260518044150.34632-1-gautam@linux.ibm.com>
On Mon, May 18, 2026, Gautam Menghani wrote:
> Enable CONFIG_VPA_PMU with KVM to enable its usage. Currently, the
> vpa-pmu driver cannot be used since it is not enabled in distro configs.
That seems like a problem to take up with distros, no?
^ permalink raw reply
* [powerpc:merge] BUILD SUCCESS d850c9d4d46e1c3f70922e048815ce8bc0235cec
From: kernel test robot @ 2026-05-19 2:16 UTC (permalink / raw)
To: Madhavan Srinivasan; +Cc: linuxppc-dev
tree/branch: https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git merge
branch HEAD: d850c9d4d46e1c3f70922e048815ce8bc0235cec Automatic merge of 'fixes' into merge (2026-05-18 15:48)
elapsed time: 886m
configs tested: 326
configs skipped: 14
The following configs have been built successfully.
More configs may be tested in the coming days.
tested configs:
alpha allnoconfig gcc-15.2.0
alpha allyesconfig gcc-15.2.0
alpha defconfig gcc-15.2.0
arc alldefconfig gcc-15.2.0
arc allmodconfig clang-16
arc allmodconfig clang-17
arc allmodconfig gcc-15.2.0
arc allnoconfig gcc-15.2.0
arc allyesconfig clang-23
arc allyesconfig gcc-15.2.0
arc defconfig gcc-15.2.0
arc randconfig-001-20260518 gcc-8.5.0
arc randconfig-001-20260518 gcc-9.5.0
arc randconfig-001-20260519 clang-23
arc randconfig-002-20260518 gcc-8.5.0
arc randconfig-002-20260519 clang-23
arm allnoconfig clang-23
arm allnoconfig gcc-15.2.0
arm allyesconfig clang-16
arm allyesconfig clang-17
arm allyesconfig gcc-15.2.0
arm defconfig gcc-15.2.0
arm mv78xx0_defconfig clang-19
arm randconfig-001-20260518 clang-23
arm randconfig-001-20260518 gcc-8.5.0
arm randconfig-001-20260519 clang-23
arm randconfig-002-20260518 gcc-8.5.0
arm randconfig-002-20260519 clang-23
arm randconfig-003-20260518 gcc-8.5.0
arm randconfig-003-20260519 clang-23
arm randconfig-004-20260518 gcc-10.5.0
arm randconfig-004-20260518 gcc-8.5.0
arm randconfig-004-20260519 clang-23
arm64 allmodconfig clang-23
arm64 allnoconfig gcc-15.2.0
arm64 defconfig gcc-15.2.0
arm64 randconfig-001-20260519 gcc-8.5.0
arm64 randconfig-002-20260519 gcc-8.5.0
arm64 randconfig-003-20260519 gcc-8.5.0
arm64 randconfig-004-20260519 gcc-8.5.0
csky allmodconfig gcc-15.2.0
csky allnoconfig gcc-15.2.0
csky defconfig gcc-15.2.0
csky randconfig-001-20260519 gcc-8.5.0
csky randconfig-002-20260518 gcc-11.5.0
csky randconfig-002-20260519 gcc-8.5.0
hexagon allmodconfig clang-17
hexagon allmodconfig gcc-15.2.0
hexagon allnoconfig clang-23
hexagon allnoconfig gcc-15.2.0
hexagon defconfig gcc-15.2.0
hexagon randconfig-001-20260518 clang-16
hexagon randconfig-001-20260518 gcc-11.5.0
hexagon randconfig-001-20260519 clang-23
hexagon randconfig-001-20260519 gcc-10.5.0
hexagon randconfig-002-20260518 clang-23
hexagon randconfig-002-20260518 gcc-11.5.0
hexagon randconfig-002-20260519 clang-23
hexagon randconfig-002-20260519 gcc-10.5.0
i386 allmodconfig clang-20
i386 allmodconfig gcc-14
i386 allnoconfig gcc-14
i386 allnoconfig gcc-15.2.0
i386 allyesconfig clang-20
i386 allyesconfig gcc-14
i386 buildonly-randconfig-001-20260518 clang-20
i386 buildonly-randconfig-001-20260519 gcc-12
i386 buildonly-randconfig-002-20260518 clang-20
i386 buildonly-randconfig-002-20260519 gcc-12
i386 buildonly-randconfig-003-20260518 gcc-14
i386 buildonly-randconfig-003-20260519 gcc-12
i386 buildonly-randconfig-004-20260518 gcc-14
i386 buildonly-randconfig-004-20260519 gcc-12
i386 buildonly-randconfig-005-20260518 gcc-14
i386 buildonly-randconfig-005-20260519 gcc-12
i386 buildonly-randconfig-006-20260518 clang-20
i386 buildonly-randconfig-006-20260519 gcc-12
i386 defconfig gcc-15.2.0
i386 randconfig-001-20260518 clang-20
i386 randconfig-001-20260518 gcc-14
i386 randconfig-001-20260519 gcc-14
i386 randconfig-002-20260518 gcc-14
i386 randconfig-002-20260519 gcc-14
i386 randconfig-003-20260518 gcc-14
i386 randconfig-003-20260519 gcc-14
i386 randconfig-004-20260518 clang-20
i386 randconfig-004-20260518 gcc-14
i386 randconfig-004-20260519 gcc-14
i386 randconfig-005-20260518 clang-20
i386 randconfig-005-20260518 gcc-14
i386 randconfig-005-20260519 gcc-14
i386 randconfig-006-20260518 gcc-13
i386 randconfig-006-20260518 gcc-14
i386 randconfig-006-20260519 gcc-14
i386 randconfig-007-20260518 clang-20
i386 randconfig-007-20260518 gcc-14
i386 randconfig-007-20260519 gcc-14
i386 randconfig-011-20260518 gcc-14
i386 randconfig-011-20260519 gcc-14
i386 randconfig-012-20260518 clang-20
i386 randconfig-012-20260518 gcc-14
i386 randconfig-012-20260519 gcc-14
i386 randconfig-013-20260518 gcc-14
i386 randconfig-013-20260519 gcc-14
i386 randconfig-014-20260518 gcc-14
i386 randconfig-014-20260519 gcc-14
i386 randconfig-015-20260518 clang-20
i386 randconfig-015-20260518 gcc-14
i386 randconfig-015-20260519 gcc-14
i386 randconfig-016-20260518 gcc-14
i386 randconfig-016-20260519 gcc-14
i386 randconfig-017-20260518 gcc-14
i386 randconfig-017-20260519 gcc-14
loongarch allmodconfig clang-23
loongarch allnoconfig clang-23
loongarch allnoconfig gcc-15.2.0
loongarch defconfig clang-19
loongarch randconfig-001-20260518 gcc-11.5.0
loongarch randconfig-001-20260518 gcc-15.2.0
loongarch randconfig-001-20260519 clang-23
loongarch randconfig-001-20260519 gcc-10.5.0
loongarch randconfig-002-20260518 clang-23
loongarch randconfig-002-20260518 gcc-11.5.0
loongarch randconfig-002-20260519 clang-23
loongarch randconfig-002-20260519 gcc-10.5.0
m68k allmodconfig gcc-15.2.0
m68k allnoconfig gcc-15.2.0
m68k allyesconfig clang-16
m68k allyesconfig clang-17
m68k allyesconfig gcc-15.2.0
m68k defconfig clang-19
m68k defconfig gcc-15.2.0
m68k m5249evb_defconfig gcc-15.2.0
m68k m5407c3_defconfig gcc-15.2.0
microblaze allnoconfig gcc-15.2.0
microblaze allyesconfig gcc-15.2.0
microblaze defconfig clang-19
microblaze defconfig gcc-15.2.0
mips allmodconfig gcc-15.2.0
mips allnoconfig gcc-15.2.0
mips allyesconfig gcc-15.2.0
mips bcm47xx_defconfig clang-18
nios2 allmodconfig clang-23
nios2 allmodconfig gcc-11.5.0
nios2 allnoconfig clang-23
nios2 allnoconfig gcc-11.5.0
nios2 defconfig clang-19
nios2 defconfig gcc-11.5.0
nios2 randconfig-001-20260518 gcc-11.5.0
nios2 randconfig-001-20260519 gcc-10.5.0
nios2 randconfig-002-20260518 gcc-11.5.0
nios2 randconfig-002-20260519 gcc-10.5.0
openrisc allmodconfig clang-23
openrisc allmodconfig gcc-15.2.0
openrisc allnoconfig clang-23
openrisc allnoconfig gcc-15.2.0
openrisc defconfig gcc-15.2.0
parisc allmodconfig gcc-15.2.0
parisc allnoconfig clang-23
parisc allnoconfig gcc-15.2.0
parisc allyesconfig clang-19
parisc allyesconfig gcc-15.2.0
parisc defconfig gcc-15.2.0
parisc randconfig-001-20260518 gcc-15.2.0
parisc randconfig-001-20260519 gcc-8.5.0
parisc randconfig-002-20260518 gcc-12.5.0
parisc randconfig-002-20260519 gcc-8.5.0
parisc64 defconfig clang-19
parisc64 defconfig gcc-15.2.0
powerpc allmodconfig gcc-15.2.0
powerpc allnoconfig clang-23
powerpc allnoconfig gcc-15.2.0
powerpc powernv_defconfig gcc-15.2.0
powerpc randconfig-001-20260518 clang-23
powerpc randconfig-001-20260519 gcc-8.5.0
powerpc randconfig-002-20260518 clang-23
powerpc randconfig-002-20260519 gcc-8.5.0
powerpc64 randconfig-001-20260518 gcc-11.5.0
powerpc64 randconfig-001-20260519 gcc-8.5.0
powerpc64 randconfig-002-20260518 clang-23
powerpc64 randconfig-002-20260519 gcc-8.5.0
riscv allmodconfig clang-23
riscv allnoconfig clang-23
riscv allnoconfig gcc-15.2.0
riscv allyesconfig clang-16
riscv allyesconfig clang-17
riscv defconfig clang-23
riscv defconfig gcc-15.2.0
riscv randconfig-001-20260518 clang-23
riscv randconfig-001-20260519 gcc-13.4.0
riscv randconfig-002-20260518 clang-23
riscv randconfig-002-20260519 gcc-13.4.0
s390 allmodconfig clang-18
s390 allmodconfig clang-19
s390 allnoconfig clang-23
s390 allyesconfig gcc-15.2.0
s390 defconfig clang-23
s390 defconfig gcc-15.2.0
s390 randconfig-001-20260518 gcc-12.5.0
s390 randconfig-001-20260519 gcc-13.4.0
s390 randconfig-002-20260518 gcc-12.5.0
s390 randconfig-002-20260519 gcc-13.4.0
sh allmodconfig gcc-15.2.0
sh allnoconfig clang-23
sh allnoconfig gcc-15.2.0
sh allyesconfig clang-19
sh allyesconfig gcc-15.2.0
sh defconfig gcc-14
sh defconfig gcc-15.2.0
sh randconfig-001-20260518 gcc-14.3.0
sh randconfig-001-20260519 gcc-13.4.0
sh randconfig-002-20260518 gcc-11.5.0
sh randconfig-002-20260519 gcc-13.4.0
sparc allnoconfig clang-23
sparc allnoconfig gcc-15.2.0
sparc defconfig gcc-15.2.0
sparc randconfig-001-20260518 gcc-15.2.0
sparc randconfig-001-20260519 gcc-14.3.0
sparc randconfig-002-20260518 gcc-15.2.0
sparc randconfig-002-20260519 gcc-14.3.0
sparc64 allmodconfig clang-23
sparc64 defconfig clang-20
sparc64 defconfig gcc-14
sparc64 randconfig-001-20260518 gcc-15.2.0
sparc64 randconfig-001-20260519 gcc-14.3.0
sparc64 randconfig-002-20260518 gcc-15.2.0
sparc64 randconfig-002-20260519 gcc-14.3.0
um allmodconfig clang-19
um allnoconfig clang-23
um allyesconfig gcc-14
um allyesconfig gcc-15.2.0
um defconfig clang-23
um defconfig gcc-14
um i386_defconfig gcc-14
um randconfig-001-20260518 clang-16
um randconfig-001-20260518 gcc-15.2.0
um randconfig-001-20260519 gcc-14.3.0
um randconfig-002-20260518 clang-23
um randconfig-002-20260518 gcc-15.2.0
um randconfig-002-20260519 gcc-14.3.0
um x86_64_defconfig clang-23
um x86_64_defconfig gcc-14
x86_64 allmodconfig clang-20
x86_64 allnoconfig clang-20
x86_64 allnoconfig clang-23
x86_64 allyesconfig clang-20
x86_64 buildonly-randconfig-001-20260518 clang-20
x86_64 buildonly-randconfig-001-20260518 gcc-14
x86_64 buildonly-randconfig-001-20260519 gcc-14
x86_64 buildonly-randconfig-002-20260518 clang-20
x86_64 buildonly-randconfig-002-20260518 gcc-14
x86_64 buildonly-randconfig-002-20260519 gcc-14
x86_64 buildonly-randconfig-003-20260518 clang-20
x86_64 buildonly-randconfig-003-20260518 gcc-14
x86_64 buildonly-randconfig-003-20260519 gcc-14
x86_64 buildonly-randconfig-004-20260518 gcc-14
x86_64 buildonly-randconfig-004-20260519 gcc-14
x86_64 buildonly-randconfig-005-20260518 gcc-14
x86_64 buildonly-randconfig-005-20260519 gcc-14
x86_64 buildonly-randconfig-006-20260518 clang-20
x86_64 buildonly-randconfig-006-20260518 gcc-14
x86_64 buildonly-randconfig-006-20260519 gcc-14
x86_64 defconfig gcc-14
x86_64 kexec clang-20
x86_64 randconfig-001 gcc-14
x86_64 randconfig-001-20260518 gcc-14
x86_64 randconfig-001-20260519 clang-20
x86_64 randconfig-002 gcc-14
x86_64 randconfig-002-20260518 clang-20
x86_64 randconfig-002-20260518 gcc-14
x86_64 randconfig-002-20260519 clang-20
x86_64 randconfig-003 gcc-14
x86_64 randconfig-003-20260518 gcc-14
x86_64 randconfig-003-20260519 clang-20
x86_64 randconfig-004 gcc-14
x86_64 randconfig-004-20260518 gcc-14
x86_64 randconfig-004-20260519 clang-20
x86_64 randconfig-005 gcc-14
x86_64 randconfig-005-20260518 gcc-14
x86_64 randconfig-005-20260519 clang-20
x86_64 randconfig-006 gcc-14
x86_64 randconfig-006-20260518 gcc-14
x86_64 randconfig-006-20260519 clang-20
x86_64 randconfig-011-20260518 clang-20
x86_64 randconfig-011-20260519 clang-20
x86_64 randconfig-012-20260518 clang-20
x86_64 randconfig-012-20260519 clang-20
x86_64 randconfig-013-20260518 gcc-14
x86_64 randconfig-013-20260519 clang-20
x86_64 randconfig-014-20260518 clang-20
x86_64 randconfig-014-20260519 clang-20
x86_64 randconfig-015-20260518 clang-20
x86_64 randconfig-015-20260519 clang-20
x86_64 randconfig-016-20260518 gcc-14
x86_64 randconfig-016-20260519 clang-20
x86_64 randconfig-071-20260518 clang-20
x86_64 randconfig-071-20260518 gcc-14
x86_64 randconfig-071-20260519 gcc-14
x86_64 randconfig-072-20260518 clang-20
x86_64 randconfig-072-20260518 gcc-14
x86_64 randconfig-072-20260519 gcc-14
x86_64 randconfig-073-20260518 clang-20
x86_64 randconfig-073-20260519 gcc-14
x86_64 randconfig-074-20260518 clang-20
x86_64 randconfig-074-20260519 gcc-14
x86_64 randconfig-075-20260518 clang-20
x86_64 randconfig-075-20260519 gcc-14
x86_64 randconfig-076-20260518 clang-20
x86_64 randconfig-076-20260519 gcc-14
x86_64 rhel-9.4 clang-20
x86_64 rhel-9.4-bpf gcc-14
x86_64 rhel-9.4-func clang-20
x86_64 rhel-9.4-kselftests clang-20
x86_64 rhel-9.4-kunit gcc-14
x86_64 rhel-9.4-ltp gcc-14
x86_64 rhel-9.4-rust clang-20
xtensa allnoconfig clang-23
xtensa allnoconfig gcc-15.2.0
xtensa allyesconfig clang-23
xtensa allyesconfig gcc-15.2.0
xtensa randconfig-001-20260518 gcc-12.5.0
xtensa randconfig-001-20260518 gcc-15.2.0
xtensa randconfig-001-20260519 gcc-14.3.0
xtensa randconfig-002-20260518 gcc-15.2.0
xtensa randconfig-002-20260518 gcc-9.5.0
xtensa randconfig-002-20260519 gcc-14.3.0
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply
* Re: [PATCH v2 1/2] lkdtm/powerpc: add isync after slbmte to enforce SLB update ordering
From: Ritesh Harjani @ 2026-05-19 2:24 UTC (permalink / raw)
To: Sayali Patil, linuxppc-dev, maddy; +Cc: linux-kernel, Mahesh Salgaonkar
In-Reply-To: <2f8d430962a96a7498903b994f081deee4a4d97a.1778975974.git.sayalip@linux.ibm.com>
Sayali Patil <sayalip@linux.ibm.com> writes:
> The slbmte instruction modifies the Segment Lookaside Buffer, but without
> a context synchronizing operation the CPU is not guaranteed to observe
> the updated SLB state for subsequent instructions. This can result in
> use of stale translation state when memory is accessed immediately after
> SLB modifications.
>
> Add isync after each slbmte in the PPC_SLB_MULTIHIT test to ensure proper
> ordering of SLB updates before subsequent memory accesses.
>
> This aligns with Power ISA context synchronization requirements for changes
> in address translation state and improves the reliability of SLB multihit
> injection tests in hash MMU mode.
>
Yup, CSI is required for before & after a slbmte. Given we are trying to
add duplicate slb entries, I think the isync()s added in this patch is
sufficient.
LGTM. Feel free to add:
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
As Mpe added - This needs to be cc'd to Kees.
-> ./scripts/get_maintainer.pl -f drivers/misc/lkdtm/powerpc.c
Kees Cook <kees@kernel.org> (maintainer:LINUX KERNEL DUMP TEST MODULE (LKDTM))
Arnd Bergmann <arnd@arndb.de> (maintainer:CHAR and MISC DRIVERS)
Greg Kroah-Hartman <gregkh@linuxfoundation.org> (maintainer:CHAR and MISC DRIVERS)
linux-kernel@vger.kernel.org (open list)
CHAR and MISC DRIVERS status: Supported
-ritesh
^ permalink raw reply
* Re: [PATCH 3/8] mm/bootmem_info: stop using PG_private
From: Lance Yang @ 2026-05-19 2:56 UTC (permalink / raw)
To: david
Cc: davem, andreas, rppt, akpm, agordeev, gerald.schaefer, hca, gor,
borntraeger, svens, maddy, mpe, npiggin, chleroy, ljs, liam,
vbabka, surenb, mhocko, sparclinux, linux-kernel, linux-mm,
linux-s390, linuxppc-dev, Lance Yang
In-Reply-To: <20260511-bootmem_info_prep-v1-3-3fb0be6fc688@kernel.org>
On Mon, May 11, 2026 at 04:05:31PM +0200, David Hildenbrand (Arm) wrote:
>Nobody checks PG_private for these pages, and we can happily use
>set_page_private() without setting PG_private. So let's just stop
>setting/clearing PG_private.
>
>Signed-off-by: David Hildenbrand (Arm) <david@kernel.org>
>---
> mm/bootmem_info.c | 2 --
> 1 file changed, 2 deletions(-)
>
>diff --git a/mm/bootmem_info.c b/mm/bootmem_info.c
>index a0a1ecdec8d0..6e2aaab3dca9 100644
>--- a/mm/bootmem_info.c
>+++ b/mm/bootmem_info.c
>@@ -19,7 +19,6 @@ void get_page_bootmem(unsigned long info, struct page *page,
> {
> BUG_ON(type > 0xf);
> BUG_ON(info > (ULONG_MAX >> 4));
>- SetPagePrivate(page);
Right, the users classify these pages via PageReserved()/bootmem_type(),
not PagePrivate().
So makes sense to not set PG_private in the first place.
> set_page_private(page, info << 4 | type);
> page_ref_inc(page);
> }
>@@ -32,7 +31,6 @@ void put_page_bootmem(struct page *page)
> type > MEMORY_HOTPLUG_MAX_BOOTMEM_TYPE);
>
> if (page_ref_dec_return(page) == 1) {
>- ClearPagePrivate(page);
Nothing sets it anymore, so there is nothing to clear here.
LGTM, feel free to add:
Reviewed-by: Lance Yang <lance.yang@linux.dev>
> set_page_private(page, 0);
> kmemleak_free_part_phys(PFN_PHYS(page_to_pfn(page)), PAGE_SIZE);
> free_reserved_page(page);
>
>--
>2.43.0
>
>
^ permalink raw reply
* Re: cleanup the RAID6 P/Q library v3
From: Christoph Hellwig @ 2026-05-19 8:24 UTC (permalink / raw)
To: Andrew Morton
Cc: Christoph Hellwig, Catalin Marinas, Will Deacon, Ard Biesheuvel,
Huacai Chen, WANG Xuerui, Madhavan Srinivasan, Michael Ellerman,
Nicholas Piggin, Christophe Leroy (CS GROUP), Paul Walmsley,
Palmer Dabbelt, Albert Ou, Alexandre Ghiti, Heiko Carstens,
Vasily Gorbik, Alexander Gordeev, Christian Borntraeger,
Sven Schnelle, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Dave Hansen, x86, H. Peter Anvin, Herbert Xu, Dan Williams,
Chris Mason, David Sterba, Arnd Bergmann, Song Liu, Yu Kuai,
Li Nan, linux-kernel, linux-arm-kernel, loongarch, linuxppc-dev,
linux-riscv, linux-s390, linux-crypto, linux-btrfs, linux-arch,
linux-raid
In-Reply-To: <20260518141205.c100f76eec5f58e78bbbf7af@linux-foundation.org>
On Mon, May 18, 2026 at 02:12:05PM -0700, Andrew Morton wrote:
> Cool, I'll add this to mm.git's mm-nonmm-unstable branch for some
> linux-next testing.
>
> AI review found quite a lot to talk about:
> https://sashiko.dev/#/patchset/20260518051804.462141-1-hch@lst.de
Not a lot of it is very useful, though:
raid6: turn the userspace test harness into a kunit test
- complains about basically adding need_resched, which we've decided
we won't do now that we have lazy preempt. This is probably going
to come up in lots of places because of the old training data
raid6: use named initializers for struct raid6_calls
- whining about keeping totally pointless comments
raid6: warn when using less than four devices
- complains about warning for btrfs which is clearly documented as the
outcome in the commit log
- and also complaining that the enforcement isn't hard enough, but the
WARN_ON is the best we can do here
raid6: rework registration of optimized algorithms
- less registration causing less kunit coverage: that's intentional
as it keeps testing time down and similar to other arch optimized
tests in crc and crypto code. It also doesn't really reduce
coverage as before this series there was none.
raid6: use static_call for gen_syndrom and xor_syndrom
- doesn't seem to know that bool fails when an initcall fails
raid6_kunit: use KUNIT_CASE_PARAM
- whining about the code style. I don't really like it either,
but the kunit case stuff is a mess
There are a few somewhat useful things, though.
raid6: hide internals
- yes, the -I is duplicate and should be fixed
raid6: rework registration of optimized algorithms
- avx2 instead of avx512 is probably the right thing for no
benchmarking, but if it was intentional (it wasn't), that should
be document. So I'll just switch back to the previous version to
keep the state of the art
^ permalink raw reply
* Re: [PATCH] PCI/AER: Clear non-fatal errors on AER recovery failure
From: Lukas Wunner @ 2026-05-19 9:53 UTC (permalink / raw)
To: Yury Murashka
Cc: bhelgaas, mahesh, oohall, linux-pci, linux-kernel, linuxppc-dev
In-Reply-To: <CAPzpGcRCTCZtaX1EVaJNZ103THZKsoszZduY7=gwfYdcrMo-SQ@mail.gmail.com>
On Mon, May 18, 2026 at 02:23:36PM +0100, Yury Murashka wrote:
> pci_aer_clear_nonfatal_status() is not called when AER recovery fails.
> If a new AER error is subsequently reported, the AER driver calls
> find_source_device() to find the source of the error. It rescans the
> whole bus and picks the first device reporting an AER error. Because the
> previous error was never cleared, the error is attributed to the wrong
> device and AER recovery is started for the wrong device.
I guess the rationale of the current behavior is that the devices
affected by the failed error recovery are basically in a broken
state once error recovery failed and so user intervention is
required, e.g. a remove/rescan via sysfs.
My question is, why is error recovery failing for the devices
in the first place?
And what does the hierarchy look like?
(lspci -tv and lspci -vvv output please)
I also don't quite follow your assertion that (only) the first device
reporting an error is picked. The algorithm tries to collect *all*
error-reporting devices in the affected portion of the hierarchy.
Thanks,
Lukas
^ permalink raw reply
* Re: [PATCH v4 04/13] dma: swiotlb: track pool encryption state and honor DMA_ATTR_CC_SHARED
From: Mostafa Saleh @ 2026-05-19 11:04 UTC (permalink / raw)
To: Aneesh Kumar K.V
Cc: iommu, linux-arm-kernel, linux-kernel, linux-coco, Robin Murphy,
Marek Szyprowski, Will Deacon, Marc Zyngier, Steven Price,
Suzuki K Poulose, Catalin Marinas, Jiri Pirko, Jason Gunthorpe,
Petr Tesarik, Alexey Kardashevskiy, Dan Williams, Xu Yilun,
linuxppc-dev, linux-s390, Madhavan Srinivasan, Michael Ellerman,
Nicholas Piggin, Christophe Leroy (CS GROUP), Alexander Gordeev,
Gerald Schaefer, Heiko Carstens, Vasily Gorbik,
Christian Borntraeger, Sven Schnelle, x86
In-Reply-To: <yq5ah5oa59wy.fsf@kernel.org>
On Thu, May 14, 2026 at 08:13:25PM +0530, Aneesh Kumar K.V wrote:
> >>
> >> What I meant was that we need a generic way to identify a pKVM guest, so
> >> that we can use it in the conditional above.
> >
> > I have this patch, with that I can boot with your series unmodified,
> > but I will need to do more testing.
> >
>
> Thanks, I can add this to the series once you complete the required testing.
>
I am still running more tests, but looking more into it. Setting
force_dma_unencrypted() to true for pKVM guests is wrong, as the
guest shouldn’t try to decrypt arbitrary memory as it can include
sensitive information (for example in case of virtio sub-page
allocation) and should strictly rely on the restricted-dma-pool
for that.
However, with my patch and setting force_dma_unencrypted() to false
on top of this series, it fails on pKVM due to a missing shared
attribute as Alexey mentioned, as now SWIOTLB rejects non shared
attrs, so, the DMA-API has to pass it. With that, I can boot again:
diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
index 5103a04df99f..b19aeec03f27 100644
--- a/kernel/dma/direct.c
+++ b/kernel/dma/direct.c
@@ -286,6 +286,8 @@ void *dma_direct_alloc(struct device *dev, size_t size,
}
if (is_swiotlb_for_alloc(dev)) {
+ attrs |= DMA_ATTR_CC_SHARED;
+
page = dma_direct_alloc_swiotlb(dev, size, attrs);
if (page) {
/*
@@ -449,6 +451,8 @@ struct page *dma_direct_alloc_pages(struct device *dev, size_t size,
&cpu_addr, gfp, attrs);
if (is_swiotlb_for_alloc(dev)) {
+ attrs |= DMA_ATTR_CC_SHARED;
+
page = dma_direct_alloc_swiotlb(dev, size, attrs);
if (!page)
return NULL;
diff --git a/kernel/dma/direct.h b/kernel/dma/direct.h
index 4e35264ab6f8..8ee5bbf78cfb 100644
--- a/kernel/dma/direct.h
+++ b/kernel/dma/direct.h
@@ -92,6 +92,7 @@ static inline dma_addr_t dma_direct_map_phys(struct device *dev,
if (attrs & (DMA_ATTR_MMIO | DMA_ATTR_REQUIRE_COHERENT))
return DMA_MAPPING_ERROR;
+ attrs |= DMA_ATTR_CC_SHARED;
return swiotlb_map(dev, phys, size, dir, attrs);
}
--
I will keep testing and let you know how it goes. If there is nothing
else required to convert pKVM guests to CC, I can just post the patch
separately as it has no dependency on this series.
Re force_dma_unencrypted(), I am looking into a safe way to use it
for pKVM as I beleive it will be useful to eliminate some bouncing.
However, that’s not critical for this series and can be added later
as I am still investigating it, if I reach something I can post it
along the pKVM patch above.
Thanks,
Mostafa
>
>
> -aneesh
^ permalink raw reply related
* Re: [PATCH v4 04/13] dma: swiotlb: track pool encryption state and honor DMA_ATTR_CC_SHARED
From: Mostafa Saleh @ 2026-05-19 11:06 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Aneesh Kumar K.V (Arm), iommu, linux-arm-kernel, linux-kernel,
linux-coco, Robin Murphy, Marek Szyprowski, Will Deacon,
Marc Zyngier, Steven Price, Suzuki K Poulose, Catalin Marinas,
Jiri Pirko, Petr Tesarik, Alexey Kardashevskiy, Dan Williams,
Xu Yilun, linuxppc-dev, linux-s390, Madhavan Srinivasan,
Michael Ellerman, Nicholas Piggin, Christophe Leroy (CS GROUP),
Alexander Gordeev, Gerald Schaefer, Heiko Carstens, Vasily Gorbik,
Christian Borntraeger, Sven Schnelle, x86
In-Reply-To: <20260515225113.GN7702@ziepe.ca>
On Fri, May 15, 2026 at 07:51:13PM -0300, Jason Gunthorpe wrote:
> On Thu, May 14, 2026 at 02:43:39PM +0000, Mostafa Saleh wrote:
> > > That's a somewhat different problem, we have the dev->trusted stuff
> > > that is supposed to deal with this kind of security. We need it for
> > > IOMMU based systems too, eg hot plug thunderbolt should have it.
> >
> > I see that it is used only for dma-iommu and for PCI devices.
> > However, I think that should be a problem with other CCA solutions
> > with emulated devices as they are untrusted. As I'd expect they
> > would have virtio devices.
>
> Yes, any security solution with an out of TCB device should be using either
> memory encryption so the kernel already bounces or this trusted stuff
> and a force strict dma-iommu so the dma layer is careful.
>
> This is more policy from userspace what devices they want in or out of
> their TCB. Like you make accept the device into T=1 but then still
> want to keep it out of your TCB with the vIOMMU, I can see good
> arguments for something like that.
>
> > > > While we can debate the aesthetics of the setup , this is
> > > > the exisitng behaviour for Linux, which existed for years
> > > > and pKVM relies on and is used extensively.
> > > > And, this patch alters that long-standing logic and introduces
> > > > a functional regression.
> > >
> > > Yeah, Aneesh needs to do something here, I'm pointing out it is
> > > entirely seperate thing from the CC path we are working on which is
> > > decoupling CC from reylying on force swiotlb.
> >
> > I am looking into converting pKVM to use the CC stuff, I replied with
> > a patch to Aneesh in this thread. However, I need to do more testing
> > and make sure there are not any unwanted consequences.
>
> Yeah, it is a nice patch and I think it will help reduce the
> complexity if it aligns to CCA type stuff.
>
> > > In a pkvm world it should be the same, the S2 table for the SMMU will
> > > control what the device can access, and if the SMMU points to a
> > > "private" or "shared" page is not something the device needs to know
> > > or care about.
> >
> > I see that's because dma-iommu chooses the attrs for iommu_map().
>
> Long term the DMA API path through the dma-iommu will pass the
> ATTR_CC_SHARED through to iommu_map so when the arch requires a
> different IOPTE it can construct it.
>
> > In pKVM, dma_addr_t and IOPTE are the same for private and shared,
> > so nothing differs in that case.
>
> Yes, so you don't have to worry.
>
> > We don’t expect pass-through devices to interact with shared
> > memory (T=0) at the moment.
> > However, I can see use cases for that, where the host and the guest
> > collaborate with device passthrough and require zero copy.
>
> Once you add the CC patch it becomes immediately possible though
> because the user can allocate a CC shared DMA HEAP and feed that all
> over the place.
>
> > One other interesting case for device-passthrough is non-coherent
> > devices which then require private pools for bouncing.
>
> Why does shared/private matter for bouncing? Why do you need to bounce
> at all? Do cmo's not work in pkvm guests?
At the moment, in iommu_dma_map_phys(), if a non coherent device
tries to map an unaligned address or size it will be bounced.
In pKVM, dma-iommu is used for assigned devices which operate on
private memory, so bouncing that through the SWIOTLB would leak
information from the guest as the SWIOTLB is decrypted.
In that case, the device needs a pool which remains private.
Thanks,
Mostafa
>
> Jason
^ permalink raw reply
* Re: [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance
From: Barry Song @ 2026-05-19 11:07 UTC (permalink / raw)
To: Yang Shi
Cc: Matthew Wilcox, surenb, akpm, linux-mm, david, ljs, liam, vbabka,
rppt, mhocko, jack, pfalcato, wanglian, chentao, lianux.mm,
kunwu.chan, liyangouwen1, chrisl, kasong, shikemeng, nphamcs, bhe,
youngjun.park, linux-arm-kernel, linux-kernel, loongarch,
linuxppc-dev, linux-riscv, linux-s390, Nanzhe Zhao
In-Reply-To: <CAHbLzkrOSoh-jmR=uaNvx73n_wn+vExoKY0UzH5zGcfdAiDbNg@mail.gmail.com>
On Tue, May 19, 2026 at 5:21 AM Yang Shi <shy828301@gmail.com> wrote:
>
> On Sun, May 17, 2026 at 1:45 AM Barry Song <baohua@kernel.org> wrote:
> >
> > On Sat, May 2, 2026 at 1:58 AM Matthew Wilcox <willy@infradead.org> wrote:
> > >
> > > On Sat, May 02, 2026 at 01:44:34AM +0800, Barry Song wrote:
> > > > On Fri, May 1, 2026 at 10:57 PM Matthew Wilcox <willy@infradead.org> wrote:
> > > > >
> > > > > On Fri, May 01, 2026 at 06:49:58AM +0800, Barry Song wrote:
> > > > > > 1. There is no deterministic latency for I/O completion. It depends on
> > > > > > both the hardware and the software stack (bio/request queues and the
> > > > > > block scheduler). Sometimes the latency is short; at other times it can
> > > > > > be quite long. In such cases, a high-priority thread performing operations
> > > > > > such as mprotect, unmap, prctl_set_vma, or madvise may be forced to wait
> > > > > > for an unpredictable amount of time.
> > > > >
> > > > > But does that actually happen? I find it hard to believe that thread A
> > > > > unmaps a VMA while thread B is in the middle of taking a page fault in
> > > > > that same VMA. mprotect() and madvise() are more likely to happen, but
> > > > > it still seems really unlikely to me.
> > > >
> > > > It doesn’t have to involve unmapping or applying mprotect to
> > > > the entire VMA—just a portion of it is sufficient.
> > >
> > > Yes, but that still fails to answer "does this actually happen". How much
> > > performance is all this complexity in the page fault handler buying us?
> > > If you don't answer this question, I'm just going to go in and rip it
> > > all out.
> > >
> >
> > Hi Matthew (and Lorenzo, Jan, and anyone else who may be
> > waiting for answers),
> >
> > As promised during LSF/MM/BPF, we conducted thorough
> > testing on Android phones to determine whether performing
> > I/O in `filemap_fault()` can block `vma_start_write()`.
> > I wanted to give a quick update on this question.
> >
> > Nanzhe at Xiaomi created tracing scripts and ran various
> > applications on Android devices with I/O performed under
> > the VMA lock in `filemap_fault()`. We found that:
> >
> > 1. There are very few cases where unmap() is blocked by
> > page faults. I assume this is due to buggy user code
> > or poor synchronization between reads and unmap().
> > So I assume it is not a problem.
> >
> > 2. We observed many cases where `vma_start_write()`
> > is blocked by page-fault I/O in some applications.
> > The blocking occurs in the `dup_mmap()` path during
> > fork().
> >
> > With Suren's commit fb49c455323ff ("fork: lock VMAs of
> > the parent process when forking"), we now always hold
> > `vma_write_lock()` for each VMA. Note that the
> > `mmap_lock` write lock is also held, which could lead to
> > chained waiting if page-fault I/O is performed without
> > releasing the VMA lock.
> >
> > My gut feeling is that Suren's commit may be overshooting,
> > so my rough idea is that we might want to do something like
> > the following (we haven't tested it yet and it might be
> > wrong):
> >
> > diff --git a/mm/mmap.c b/mm/mmap.c
> > index 2311ae7c2ff4..5ddaf297f31a 100644
> > --- a/mm/mmap.c
> > +++ b/mm/mmap.c
> > @@ -1762,7 +1762,13 @@ __latent_entropy int dup_mmap(struct mm_struct
> > *mm, struct mm_struct *oldmm)
> > for_each_vma(vmi, mpnt) {
> > struct file *file;
> >
> > - retval = vma_start_write_killable(mpnt);
> > + /*
> > + * For anonymous or writable private VMAs, prevent
> > + * concurrent CoW faults.
> > + */
> > + if (!mpnt->vm_file || (!(mpnt->vm_flags & VM_SHARED) &&
> > + (mpnt->vm_flags & VM_WRITE)))
> > + retval = vma_start_write_killable(mpnt);
> > if (retval < 0)
> > goto loop_out;
> > if (mpnt->vm_flags & VM_DONTCOPY) {
>
> Maybe a little bit off topic. This is an interesting idea. It seems
> possible we don't have to take vma write lock unconditionally. IIUC
> the write lock is mainly used to serialize against page fault and
> madvise, right? I got a crazy idea off the top of my head. We may be
> able to just take vma write lock iff vma->anon_vma is not NULL.
>
> First of all, write mmap_lock is held, so the vma can't go or be
> changed under us.
>
> Secondly, if vma->anon_vma is NULL, it basically means either no page
> fault happened or no cow happened, so there is no page table to copy,
> this is also what copy_page_range() does currently. So we can shrink
> the critical section to:
>
> if (vma->anon_vma) {
> vma_start_write_killable(src_vma);
> anon_vma_fork(dst_vma, src_vma);
> copy_page_range(dst_vma, src_vma);
> }
>
> But page fault can happen before write mmap_lock is taken, when we
> check vma->anon_vma, it is possible it has not been set up yet. But it
> seems to be equivalent to page fault after fork and won't break the
> semantic.
Re-reading Suren's commit log for fb49c455323ff8
("fork: lock VMAs of the parent process when forking"),
it seems that vm_start_write() is used to protect
against a race where anon_vma changes from NULL to
non-NULL during fork. In that scenario, we hold the
mmap_lock write lock, but not vma_start_write(), so a
concurrent anon_vma_prepare() could still install an
anon_vma.
" A concurrent page fault on a page newly marked read-only by the page
copy might trigger wp_page_copy() and a anon_vma_prepare(vma) on the
source vma, defeating the anon_vma_clone() that wasn't done because the
parent vma originally didn't have an anon_vma, but we now might end up
copying a pte entry for a page that has one.
"
If that is the case, then your change does not work.
Nowadays, nobody calls anon_vma_prepare(vma) directly.
Instead, vmf_anon_prepare() is used, and we always
require the mmap_lock read lock before calling
__anon_vma_prepare(). As a result, anon_vma cannot
transition from NULL to non-NULL during fork.
So the original race condition has effectively
disappeared.
You also mentioned the madvise() case. If I understand
correctly, madvise() should take mmap_lock before
modifying anon_vma. Only some parts of madvise() can
support per-VMA locking. Therefore, we probably do not
need:
if (vma->anon_vma) {
vma_start_write_killable(src_vma);
...
}
>
> Anyway, just a crazy idea, I may miss some corner cases.
To me, it seems that we could remove vma_start_write()
entirely now. Or is that an even crazier idea?
Thanks
Barry
^ permalink raw reply
* Re: [PATCH v4 04/13] dma: swiotlb: track pool encryption state and honor DMA_ATTR_CC_SHARED
From: Aneesh Kumar K.V @ 2026-05-19 12:27 UTC (permalink / raw)
To: Mostafa Saleh
Cc: iommu, linux-arm-kernel, linux-kernel, linux-coco, Robin Murphy,
Marek Szyprowski, Will Deacon, Marc Zyngier, Steven Price,
Suzuki K Poulose, Catalin Marinas, Jiri Pirko, Jason Gunthorpe,
Petr Tesarik, Alexey Kardashevskiy, Dan Williams, Xu Yilun,
linuxppc-dev, linux-s390, Madhavan Srinivasan, Michael Ellerman,
Nicholas Piggin, Christophe Leroy (CS GROUP), Alexander Gordeev,
Gerald Schaefer, Heiko Carstens, Vasily Gorbik,
Christian Borntraeger, Sven Schnelle, x86
In-Reply-To: <agxDxdxynp4KEovA@google.com>
Mostafa Saleh <smostafa@google.com> writes:
> On Thu, May 14, 2026 at 08:13:25PM +0530, Aneesh Kumar K.V wrote:
>> >>
>> >> What I meant was that we need a generic way to identify a pKVM guest, so
>> >> that we can use it in the conditional above.
>> >
>> > I have this patch, with that I can boot with your series unmodified,
>> > but I will need to do more testing.
>> >
>>
>> Thanks, I can add this to the series once you complete the required testing.
>>
>
> I am still running more tests, but looking more into it. Setting
> force_dma_unencrypted() to true for pKVM guests is wrong, as the
> guest shouldn’t try to decrypt arbitrary memory as it can include
> sensitive information (for example in case of virtio sub-page
> allocation) and should strictly rely on the restricted-dma-pool
> for that.
>
> However, with my patch and setting force_dma_unencrypted() to false
> on top of this series, it fails on pKVM due to a missing shared
> attribute as Alexey mentioned, as now SWIOTLB rejects non shared
> attrs, so, the DMA-API has to pass it. With that, I can boot again:
>
> diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
> index 5103a04df99f..b19aeec03f27 100644
> --- a/kernel/dma/direct.c
> +++ b/kernel/dma/direct.c
> @@ -286,6 +286,8 @@ void *dma_direct_alloc(struct device *dev, size_t size,
> }
>
> if (is_swiotlb_for_alloc(dev)) {
> + attrs |= DMA_ATTR_CC_SHARED;
> +
> page = dma_direct_alloc_swiotlb(dev, size, attrs);
> if (page) {
> /*
> @@ -449,6 +451,8 @@ struct page *dma_direct_alloc_pages(struct device *dev, size_t size,
> &cpu_addr, gfp, attrs);
>
> if (is_swiotlb_for_alloc(dev)) {
> + attrs |= DMA_ATTR_CC_SHARED;
> +
> page = dma_direct_alloc_swiotlb(dev, size, attrs);
> if (!page)
> return NULL;
> diff --git a/kernel/dma/direct.h b/kernel/dma/direct.h
> index 4e35264ab6f8..8ee5bbf78cfb 100644
> --- a/kernel/dma/direct.h
> +++ b/kernel/dma/direct.h
> @@ -92,6 +92,7 @@ static inline dma_addr_t dma_direct_map_phys(struct device *dev,
> if (attrs & (DMA_ATTR_MMIO | DMA_ATTR_REQUIRE_COHERENT))
> return DMA_MAPPING_ERROR;
>
> + attrs |= DMA_ATTR_CC_SHARED;
> return swiotlb_map(dev, phys, size, dir, attrs);
> }
>
> --
>
How about the below?
modified kernel/dma/direct.c
@@ -278,6 +278,10 @@ void *dma_direct_alloc(struct device *dev, size_t size,
}
if (is_swiotlb_for_alloc(dev)) {
+
+ if (dev->dma_io_tlb_mem->unencrypted)
+ attrs |= DMA_ATTR_CC_SHARED;
+
page = dma_direct_alloc_swiotlb(dev, size, attrs);
if (page) {
/*
@@ -451,6 +455,10 @@ struct page *dma_direct_alloc_pages(struct device *dev, size_t size,
&cpu_addr, gfp, attrs);
if (is_swiotlb_for_alloc(dev)) {
+
+ if (dev->dma_io_tlb_mem->unencrypted)
+ attrs |= DMA_ATTR_CC_SHARED;
+
page = dma_direct_alloc_swiotlb(dev, size, attrs);
if (!page)
return NULL;
modified kernel/dma/direct.h
@@ -92,6 +92,9 @@ static inline dma_addr_t dma_direct_map_phys(struct device *dev,
if (attrs & (DMA_ATTR_MMIO | DMA_ATTR_REQUIRE_COHERENT))
return DMA_MAPPING_ERROR;
+ if (dev->dma_io_tlb_mem->unencrypted)
+ attrs |= DMA_ATTR_CC_SHARED;
+
return swiotlb_map(dev, phys, size, dir, attrs);
}
>
>
> I will keep testing and let you know how it goes. If there is nothing
> else required to convert pKVM guests to CC, I can just post the patch
> separately as it has no dependency on this series.
>
That would be useful. I can then carry the patch as a dependent change,
which can also be merged separately
>
> Re force_dma_unencrypted(), I am looking into a safe way to use it
> for pKVM as I beleive it will be useful to eliminate some bouncing.
> However, that’s not critical for this series and can be added later
> as I am still investigating it, if I reach something I can post it
> along the pKVM patch above.
>
> Thanks,
> Mostafa
>
>>
>>
>> -aneesh
^ permalink raw reply
* Re: [PATCH v13 04/15] arm64: kexec_file: Fix potential buffer overflow in prepare_elf_headers()
From: Jinjie Ruan @ 2026-05-19 12:42 UTC (permalink / raw)
To: Breno Leitao
Cc: corbet, skhan, catalin.marinas, will, chenhuacai, kernel, maddy,
mpe, npiggin, chleroy, pjw, palmer, aou, alex, tglx, mingo, bp,
dave.hansen, hpa, robh, saravanak, akpm, bhe, rppt,
pasha.tatashin, pratyush, ruirui.yang, rdunlap, pmladek,
dapeng1.mi, kees, elver, kuba, ebiggers, lirongqing, paulmck,
sourabhjain, coxu, jbohac, ryan.roberts, osandov, cfsworks,
tangyouling, ritesh.list, adityag, guoren, songshuaishuai,
kevin.brodsky, vishal.moola, junhui.liu, wangruikang, namcao,
chao.gao, seanjc, fuqiang.wang, ardb, chenjiahao16, hbathini,
takahiro.akashi, james.morse, lizhengyu3, x86, linux-doc,
linux-kernel, linux-arm-kernel, loongarch, linuxppc-dev,
linux-riscv, devicetree, kexec
In-Reply-To: <agGkvrg06KNDNfDi@gmail.com>
On 5/11/2026 5:46 PM, Breno Leitao wrote:
> On Mon, May 11, 2026 at 11:04:43AM +0800, Jinjie Ruan wrote:
>> There is a race condition between the kexec_load() system call
>> (crash kernel loading path) and memory hotplug operations that can
>> lead to buffer overflow and potential kernel crash.
>>
>> During prepare_elf_headers(), the following steps occur:
>> 1. The first for_each_mem_range() queries current System RAM memory ranges
>> 2. Allocates buffer based on queried count
>> 3. The 2st for_each_mem_range() populates ranges from memblock
>>
>> If memory hotplug occurs between step 1 and step 3, the number of ranges
>> can increase, causing out-of-bounds write when populating cmem->ranges[].
>>
>> This happens because kexec_load() uses kexec_trylock (atomic_t) while
>> memory hotplug uses device_hotplug_lock (mutex), so they don't serialize
>> with each other.
>>
>> Add the explicit bounds checking to prevent out-of-bounds access.
>
> It seems you have a TOCTOU type of issue, and this seems to be shrinking
> the window, but not fully solving it?
I plan to fix this issue as follows, and would appreciate your feedback
on whether this is reasonable.
Sashiko AI code review pointed out there is a TOCTOU (Time-of-Check to
Time-of-Use) race condition in prepare_elf_headers() between the initial
pass that counts System RAM ranges and the second pass that populates them.
If a memory hotplug event occurs between these two steps, the number of
memory regions may increase, causing an out-of-bounds write to
the cmem->ranges[] array.
To resolve this and ensure data consistency, this patch:
1. Wraps the counting and population passes with get_online_mems() and
crash_hotplug_lock(). This serializes the kexec_file_load() path
with concurrent memory hotplug operations, ensuring the memory
map remains consistent throughout the header preparation.
2. Adds an explicit boundary check in prepare_elf64_ram_headers_callback().
If the number of ranges exceeds the allocated maximum, it now returns
-EAGAIN, which indicates a transient race, signaling userspace
kexec-tools to retry the syscall instead of leaving the system
without a loaded crash kernel.
index daf81a873bbd..546be6261177 100644
--- a/arch/arm64/kernel/machine_kexec_file.c
+++ b/arch/arm64/kernel/machine_kexec_file.c
@@ -15,6 +15,7 @@
#include <linux/kexec.h>
#include <linux/libfdt.h>
#include <linux/memblock.h>
+#include <linux/memory_hotplug.h>
#include <linux/of.h>
#include <linux/of_fdt.h>
#include <linux/slab.h>
@@ -40,7 +41,7 @@ int arch_kimage_file_post_load_cleanup(struct kimage
*image)
}
#ifdef CONFIG_CRASH_DUMP
-int prepare_elf_headers(void **addr, unsigned long *sz)
+static int __prepare_elf_headers(void **addr, unsigned long *sz)
{
struct crash_mem *cmem;
unsigned int nr_ranges;
@@ -59,6 +60,11 @@ int prepare_elf_headers(void **addr, unsigned long *sz)
cmem->max_nr_ranges = nr_ranges;
cmem->nr_ranges = 0;
for_each_mem_range(i, &start, &end) {
+ if (cmem->nr_ranges >= cmem->max_nr_ranges) {
+ ret = -EAGAIN;
+ goto out;
+ }
+
cmem->ranges[cmem->nr_ranges].start = start;
cmem->ranges[cmem->nr_ranges].end = end - 1;
cmem->nr_ranges++;
@@ -81,6 +87,21 @@ int prepare_elf_headers(void **addr, unsigned long *sz)
kfree(cmem);
return ret;
}
+
+int prepare_elf_headers(void **addr, unsigned long *sz)
+{
+ int ret;
+
+ crash_hotplug_lock();
+ get_online_mems();
+
+ ret = __prepare_elf_headers(addr, sz);
+
+ put_online_mems();
+ crash_hotplug_unlock();
+
+ return ret;
+}
#endif
>
>> Cc: Catalin Marinas <catalin.marinas@arm.com>
>> Cc: Will Deacon <will.deacon@arm.com>
>> Cc: Andrew Morton <akpm@linux-foundation.org>
>> Cc: Baoquan He <bhe@redhat.com>
>> Cc: Breno Leitao <leitao@debian.org>
>> Cc: stable@vger.kernel.org
>> Fixes: 3751e728cef2 ("arm64: kexec_file: add crash dump support")
>> Closes: https://sashiko.dev/#/patchset/20260323072745.2481719-1-ruanjinjie%40huawei.com
>> Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com>
>> ---
>> arch/arm64/kernel/machine_kexec_file.c | 5 +++++
>> 1 file changed, 5 insertions(+)
>>
>> diff --git a/arch/arm64/kernel/machine_kexec_file.c b/arch/arm64/kernel/machine_kexec_file.c
>> index e31fabed378a..a67e7b1abbab 100644
>> --- a/arch/arm64/kernel/machine_kexec_file.c
>> +++ b/arch/arm64/kernel/machine_kexec_file.c
>> @@ -59,6 +59,11 @@ static int prepare_elf_headers(void **addr, unsigned long *sz)
>> cmem->max_nr_ranges = nr_ranges;
>> cmem->nr_ranges = 0;
>> for_each_mem_range(i, &start, &end) {
>> + if (cmem->nr_ranges >= cmem->max_nr_ranges) {
>> + ret = -ENOMEM;
>
> -ENOMEM seems to be the the wrong errno. This isn't an allocation
> failure; it's a transient race. -EBUSY or -EAGAIN would be more honest
^ permalink raw reply related
* Re: [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance
From: Lorenzo Stoakes @ 2026-05-19 12:43 UTC (permalink / raw)
To: Barry Song
Cc: Matthew Wilcox, surenb, akpm, linux-mm, david, liam, vbabka, rppt,
mhocko, jack, pfalcato, wanglian, chentao, lianux.mm, kunwu.chan,
liyangouwen1, chrisl, kasong, shikemeng, nphamcs, bhe,
youngjun.park, linux-arm-kernel, linux-kernel, loongarch,
linuxppc-dev, linux-riscv, linux-s390, Nanzhe Zhao
In-Reply-To: <CAGsJ_4zqLfdWoTH9s7FFaqWWj0mESfikYgr7=GcV64qcuXrPxA@mail.gmail.com>
On Mon, May 18, 2026 at 07:25:54PM +0800, Barry Song wrote:
> On Mon, May 18, 2026 at 5:47 PM Lorenzo Stoakes <ljs@kernel.org> wrote:
> >
> > On Sun, May 17, 2026 at 04:45:15PM +0800, Barry Song wrote:
> > > On Sat, May 2, 2026 at 1:58 AM Matthew Wilcox <willy@infradead.org> wrote:
> > > >
> > > > On Sat, May 02, 2026 at 01:44:34AM +0800, Barry Song wrote:
> > > > > On Fri, May 1, 2026 at 10:57 PM Matthew Wilcox <willy@infradead.org> wrote:
> > > > > >
> > > > > > On Fri, May 01, 2026 at 06:49:58AM +0800, Barry Song wrote:
> > > > > > > 1. There is no deterministic latency for I/O completion. It depends on
> > > > > > > both the hardware and the software stack (bio/request queues and the
> > > > > > > block scheduler). Sometimes the latency is short; at other times it can
> > > > > > > be quite long. In such cases, a high-priority thread performing operations
> > > > > > > such as mprotect, unmap, prctl_set_vma, or madvise may be forced to wait
> > > > > > > for an unpredictable amount of time.
> > > > > >
> > > > > > But does that actually happen? I find it hard to believe that thread A
> > > > > > unmaps a VMA while thread B is in the middle of taking a page fault in
> > > > > > that same VMA. mprotect() and madvise() are more likely to happen, but
> > > > > > it still seems really unlikely to me.
> > > > >
> > > > > It doesn’t have to involve unmapping or applying mprotect to
> > > > > the entire VMA—just a portion of it is sufficient.
> > > >
> > > > Yes, but that still fails to answer "does this actually happen". How much
> > > > performance is all this complexity in the page fault handler buying us?
> > > > If you don't answer this question, I'm just going to go in and rip it
> > > > all out.
> > > >
> > >
> > > Hi Matthew (and Lorenzo, Jan, and anyone else who may be
> > > waiting for answers),
> > >
> > > As promised during LSF/MM/BPF, we conducted thorough
> > > testing on Android phones to determine whether performing
> > > I/O in `filemap_fault()` can block `vma_start_write()`.
> > > I wanted to give a quick update on this question.
> > >
> > > Nanzhe at Xiaomi created tracing scripts and ran various
> > > applications on Android devices with I/O performed under
> > > the VMA lock in `filemap_fault()`. We found that:
> > >
> > > 1. There are very few cases where unmap() is blocked by
> > > page faults. I assume this is due to buggy user code
> > > or poor synchronization between reads and unmap().
> > > So I assume it is not a problem.
> > >
> > > 2. We observed many cases where `vma_start_write()`
> > > is blocked by page-fault I/O in some applications.
> > > The blocking occurs in the `dup_mmap()` path during
> > > fork().
> > >
> > > With Suren's commit fb49c455323ff ("fork: lock VMAs of
> > > the parent process when forking"), we now always hold
> > > `vma_write_lock()` for each VMA. Note that the
> > > `mmap_lock` write lock is also held, which could lead to
> > > chained waiting if page-fault I/O is performed without
> > > releasing the VMA lock.
> >
> > Hm but did you observe this 'chained waiting'? And what were the latencies?
>
> We have clearly observed that the `fork()` operations of many
> popular Android apps, such as iQiyi, Baidu Tieba, and 10086,
> end up waiting on page-fault (PF) I/O when the VMA lock is
> held during I/O operations. This has already become a
> practical issue. I also believe this can lead to chained
> waiting, since the global `mmap_lock` blocks all threads that
> need to acquire it.
I asked about the chained waiting :) I'm aware you've observed contention on
write lock, you said so in your LSF talk.
So have you observed that or is this a theory?
>
>
> >
> > >
> > > My gut feeling is that Suren's commit may be overshooting,
> > > so my rough idea is that we might want to do something like
> > > the following (we haven't tested it yet and it might be
> > > wrong):
> >
> > Yeah I'm really not sure about that.
> >
> > Prior to the VMA locks, the mmap write lock would have guaranteed no concurrent
> > page faults, which is really what Fb49c455323ff is about.
> >
> > So Suren's patch was essentially restoring the _existing_ forking behaviour, and
> > now you're saying 'let's change the forking behaviour that's been like that for
> > forever'.
>
>
> I am afraid not. Before we introduced the per-VMA lock, we
> were not performing I/O while holding `mmap_lock`. A page fault
> that needed I/O would drop the `mmap_lock` read lock and allow
> `fork()` to proceed.
Err I'm talking about fork? The patch you reference is a change to fork?
So you're saying that Fb49c455323ff which explicitly takes the VMA write lock on
fork, was somehow an addendum after fork didnt take the mmap write lock?
I must be imagining
https://elixir.bootlin.com/linux/v6.0/source/kernel/fork.c#L590 then in v6.0
pre-vma locks :)
I suspect that's _not_ what you're saying, so now what you're suggesting as I
stated above, is to fundamentally change fork behaviour to account for the
existing per-VMA lock behaviour on the fault path?
Again I state - are you really sure you want to fundamentally change fork
behaviour for this?
I am extremely concerned about doing that.
>
> Now, you are suggesting performing I/O while holding the VMA
> lock, which changes the requirements and introduces this
> problem.
>
> >
> > I think you would _really_ have to be sure that's safe. And forking is a very
> > dangerous time in terms of complexity and sensitivity and 'weird stuff'
> > happening so I'd tread _very_ carefully here.
>
> Yep. I think my original proposal did not require any changes
> to `fork()`, since it simply preserved the current behavior of
> dropping the VMA lock before performing I/O. In that model,
> `fork()` would not end up waiting on I/O at all.
>
> What you are suggesting now appears to be performing I/O while
> holding the VMA lock, which in turn introduces the need to
> change `fork()`.
Again, you're saying we should fundamentally change the way fork has worked
forever to work around something else.
At LSF I raised the fact that Josef himself suggested we simply drop this I/O
waiting behaviour for file-backed mapppings. Isn't there a way forward that way
rather than 'hey let's drop locks and hope for the best!'
I am really reticent about this because we've seen HORRIBLE bugs come from fork
behaviour, especially edge cases, and mm testing isn't great so I am basically
opposed to this, and you're not really convincing me here.
>
> >
> > >
> > > diff --git a/mm/mmap.c b/mm/mmap.c
> > > index 2311ae7c2ff4..5ddaf297f31a 100644
> > > --- a/mm/mmap.c
> > > +++ b/mm/mmap.c
> > > @@ -1762,7 +1762,13 @@ __latent_entropy int dup_mmap(struct mm_struct
> > > *mm, struct mm_struct *oldmm)
> > > for_each_vma(vmi, mpnt) {
> > > struct file *file;
> > >
> > > - retval = vma_start_write_killable(mpnt);
> > > + /*
> > > + * For anonymous or writable private VMAs, prevent
> > > + * concurrent CoW faults.
> > > + */
> >
> > To nit pick I think the comment's confusing but also tells you you don't need to
> > specific anon check - writable private is sufficient. And it's not really just
> > CoW that's the issue, it's anon_vma population _at all_ as well as CoW.
> >
> > > + if (!mpnt->vm_file || (!(mpnt->vm_flags & VM_SHARED) &&
> > > + (mpnt->vm_flags & VM_WRITE)))
> > > + retval = vma_start_write_killable(mpnt);
> >
> > I think this has to be VM_MAYWRITE, because somebody could otherwise mprotect()
> > it R/W.
> >
> > I also don't understand why !mpnt->vm_file for a read-only anon mapping (more
> > likely PROT_NONE) is here, just do the second check?
> >
> > (Also please use the new interface, so !vma_test(mpnt, VMA_SHARED_BIT) &&
> > vma_test(mpnt, VMA_MAYWRITE_BIT))
>
> Yep, I can definitely refine the check further. But before
> doing that, I'd first like to confirm that we are aligned on
> the direction.
>
> If you still intend to hold the VMA lock while performing I/O,
> then I think we should fix `fork()` to avoid taking
> `vma_start_write()`.
Yeah or we could do something different, it isn't a case of you get to do one of
two options you propose - the maintainers decide which way is appropriate.
Of the two options dropping the lock on the fault path rather than this fork
insanity is my preference but I wonder if we can't find another way.
Let me read through the series and give more thoughts I guess.
>
> >
> > > if (retval < 0)
> > > goto loop_out;
> > > if (mpnt->vm_flags & VM_DONTCOPY) {
> > >
> > > Based on the above, we may want to re-check whether fork()
> > > can be blocked by page faults. At the same time, if Suren,
> > > you, or anyone else has any comments, please feel free to
> > > share them.
> > >
> > > Best Regards
> > > Barry
> >
> > Technical commentary above is sort of 'just cos' :) because I really question
> > doing this honestly.
>
> I think we either need to fix `fork()`, or keep the current
> behavior of dropping the VMA lock before performing I/O.
Yup you said :)
>
> >
> > I'd also like to get Suren's input, however.
>
> Yes. of course.
>
> >
> > Thanks, Lorenzo
>
> Best Regards
> Barry
Thanks, Lorenzo
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox