LinuxPPC-Dev Archive on lore.kernel.org

LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed

* Re: [PATCH v14 0/8] arm64: add ARCH_HAS_COPY_MC support
From: Kefeng Wang @ 2026-05-18 15:05 UTC (permalink / raw)
  To: Ruidong Tian, catalin.marinas, will, rafael, tony.luck, guohanjun,
	mchehab, xueshuai, tongtiangen, james.morse, robin.murphy,
	andreyknvl, dvyukov, vincenzo.frascino, mpe, npiggin,
	ryabinin.a.a, glider, christophe.leroy, aneesh.kumar,
	naveen.n.rao, tglx, mingo
  Cc: linux-arm-kernel, linux-mm, linuxppc-dev, linux-kernel, kasan-dev
In-Reply-To: <20260518084956.2538442-1-tianruidong@linux.alibaba.com>



On 5/18/2026 4:49 PM, Ruidong Tian wrote:
> This series continues Tong Tiangen's work on arm64 ARCH_HAS_COPY_MC
> support. We encounter the same problem, and from a forward-looking
> perspective, large-memory ARM machines such as Grace and Vera will suffer
> more from this class of issues, which motivates us to push this feature
> upstream.
> 
> Problem
> =========
> With the increase of memory capacity and density, the probability of memory
> error also increases. The increasing size and density of server RAM in data
> centers and clouds have shown increased uncorrectable memory errors.
> 
> Currently, more and more scenarios that can tolerate memory errors, such as
> COW[1,2], KSM copy[3], coredump copy[4], khugepaged[5,6], uaccess copy[7],
> etc.

We have encountered more scenarios and have made more enhancements, eg,

  658be46520ce mm: support poison recovery from copy_present_page()
  aa549f923f5e mm: support poison recovery from do_cow_fault()
  f00b295b9b61 fs: hugetlbfs: support poisoned recover from 
hugetlbfs_migrate_folio()
  060913999d7a mm: migrate: support poisoned recover from migrate folio

Hope that the architecture-related sections can receive relevant reviews 
and responses.

Thanks.
  > Solution
> =========
> 
> This patchset introduces a new processing framework on ARM64, which enables
> ARM64 to support error recovery in the above scenarios, and more scenarios
> can be expanded based on this in the future.
> 
> In arm64, memory error handling in do_sea(), which is divided into two cases:
>   1. If the user state consumed the memory errors, the solution is to kill
>      the user process and isolate the error page.
>   2. If the kernel state consumed the memory errors, the solution is to
>      panic.
> 
> For case 2, Undifferentiated panic may not be the optimal choice, as it can
> be handled better. In some scenarios, we can avoid panic, such as uaccess,
> if the uaccess fails due to memory error, only the user process will be
> affected, returning an error to the caller and isolating the user page with
> hardware memory errors is a better choice.
> 
> [1] commit d302c2398ba2 ("mm, hwpoison: when copy-on-write hits poison, take page offline")
> [2] commit 1cb9dc4b475c ("mm: hwpoison: support recovery from HugePage copy-on-write faults")
> [3] commit 6b970599e807 ("mm: hwpoison: support recovery from ksm_might_need_to_copy()")
> [4] commit 245f09226893 ("mm: hwpoison: coredump: support recovery from dump_user_range()")
> [5] commit 98c76c9f1ef7 ("mm/khugepaged: recover from poisoned anonymous memory")
> [6] commit 12904d953364 ("mm/khugepaged: recover from poisoned file-backed memory")
> [7] commit 278b917f8cb9 ("x86/mce: Add _ASM_EXTABLE_CPY for copy user access")
> 
> ------------------
> Test result:
> 
> Tested on Kunpeng 920.
> 
> 1. copy_page(), copy_mc_page() basic function test pass, and the disassembly
>     contents remains the same before and after refactor.
> 
> 2. copy_to/from_user() access kernel NULL pointer raise translation fault
>     and dump error message then die(), test pass.
> 
> 3. Test following scenarios: copy_from_user(), get_user(), COW.
> 
>     Before patched: trigger a hardware memory error then panic.
>     After  patched: trigger a hardware memory error without panic.
> 
>     Testing step:
>     step1. start an user-process.
>     step2. poison(einj) the user-process's page.
>     step3: user-process access the poison page in kernel mode, then trigger SEA.
>     step4: the kernel will not panic, only the user process is killed, the poison
>            page is isolated. (before patched, the kernel will panic in do_sea())
> 
>     The above tests can also be reproduced using ras-tools, which provides
>     einj-based injection and validation for uaccess and COW scenarios.
>     Example usage:
> 
>       einj_mem_uc futex          # get_user
>       einj_mem_uc copyin         # copy_to_user
>       einj_mem_uc copy-on-write  # COW
> 
>     Link: https://git.kernel.org/pub/scm/linux/kernel/git/aegl/ras-tools.git
> 
> ------------------
> 
> Benefits
> =========
> According to Huawei's statistics from their storage products, memory errors
> triggered in kernel-mode by COW and page cache read (uaccess) scenarios
> account for more than 50%. With this patchset deployed, all kernel panics
> caused by COW and page cache memory errors are eliminated.
> Alibaba Cloud has also observed memory errors occurring in uaccess contexts.
> 
> Since V13:
> 1. Changed MC-safe functions to return an error rather than kill the user
>     process. When a user program invokes a syscall and the kernel encounters
>     a memory error during uaccess, killing the process is unexpected; the
>     syscall should return an error.
> 2. Added FEAT_MOPS support for the copy_page_mc paths.
> 3. Refactored copy_page() and memcpy() on top of the shared memcpy_template,
>     reducing duplicated assembly code.
> 
> Since v12:
> Thanks to the suggestions of Jonathan, Mark, and Mauro, the following modifications
> are made:
> 1. Rebase to latest kernel version.
> 2. Patch1, add Jonathan's and Mauro's review-by.
> 3. Patch2, modified do_apei_claim_sea() according to Mark's and Jonathan's suggestions,
>     and optimized the commit message according to Mark's suggestions(Added description of
>     the impact on regular copy_to_user()).
> 4. Patch3, optimized the commit message according to Mauro's suggestions and add Jonathan's
>     review-by.
> 5. Patch4, modified copy_mc_user_highpage() and Optimized the commit message according to
>     Jonathan's suggestions(no functional changes).
> 6. Patch5, optimized the commit message according to Mauro's suggestions.
> 7. Patch4/5, FEAT_MOPS is added to the code logic. Currently, the fixup is not performed
>     on the MOPS instruction.
> 8. Remove patch6 in v12 according to Jonathan's suggestions.
> 
> Since v11:
> 1. Rebase to latest kernel version 6.9-rc1.
> 2. Add patch 5, Since the problem described in "Since V10 Besides 3" has
>     been solved in a50026bdb867 ('iov_iter: get rid of 'copy_mc' flag').
> 3. Add the benefit of applying the patch set to our company to the description of patch0.
> 
> Since V10:
>   Accroding Mark's suggestion:
>   1. Merge V10's patch2 and patch3 to V11's patch2.
>   2. Patch2(V11): use new fixup_type for ld* in copy_to_user(), fix fatal
>      issues (NULL kernel pointeraccess) been fixup incorrectly.
>   3. Patch2(V11): refactoring the logic of do_sea().
>   4. Patch4(V11): Remove duplicate assembly logic and remove do_mte().
> 
>   Besides:
>   1. Patch2(V11): remove st* insn's fixup, st* generally not trigger memory error.
>   2. Split a part of the logic of patch2(V11) to patch5(V11), for detail,
>      see patch5(V11)'s commit msg.
>   3. Remove patch6(v10) “arm64: introduce copy_mc_to_kernel() implementation”.
>      During modification, some problems that cannot be solved in a short
>      period are found. The patch will be released after the problems are
>      solved.
>   4. Add test result in this patch.
>   5. Modify patchset title, do not use machine check and remove "-next".
> 
> Since V9:
>   1. Rebase to latest kernel version 6.8-rc2.
>   2. Add patch 6/6 to support copy_mc_to_kernel().
> 
> Since V8:
>   1. Rebase to latest kernel version and fix topo in some of the patches.
>   2. According to the suggestion of Catalin, I attempted to modify the
>      return value of function copy_mc_[user]_highpage() to bytes not copied.
>      During the modification process, I found that it would be more
>      reasonable to return -EFAULT when copy error occurs (referring to the
>      newly added patch 4).
> 
>      For ARM64, the implementation of copy_mc_[user]_highpage() needs to
>      consider MTE. Considering the scenario where data copying is successful
>      but the MTE tag copying fails, it is also not reasonable to return
>      bytes not copied.
>   3. Considering the recent addition of machine check safe support for
>      multiple scenarios, modify commit message for patch 5 (patch 4 for V8).
> 
> Since V7:
>   Currently, there are patches supporting recover from poison
>   consumption for the cow scenario[1]. Therefore, Supporting cow
>   scenario under the arm64 architecture only needs to modify the relevant
>   code under the arch/.
>   [1]https://lore.kernel.org/lkml/20221031201029.102123-1-tony.luck@intel.com/
> 
> Since V6:
>   Resend patches that are not merged into the mainline in V6.
> 
> Since V5:
>   1. Add patch2/3 to add uaccess assembly helpers.
>   2. Optimize the implementation logic of arm64_do_kernel_sea() in patch8.
>   3. Remove kernel access fixup in patch9.
>   All suggestion are from Mark.
> 
> Since V4:
>   1. According Michael's suggestion, add patch5.
>   2. According Mark's suggestiog, do some restructuring to arm64
>   extable, then a new adaptation of machine check safe support is made based
>   on this.
>   3. According Mark's suggestion, support machine check safe in do_mte() in
>   cow scene.
>   4. In V4, two patches have been merged into -next, so V5 not send these
>   two patches.
> 
> Since V3:
>   1. According to Robin's suggestion, direct modify user_ldst and
>   user_ldp in asm-uaccess.h and modify mte.S.
>   2. Add new macro USER_MC in asm-uaccess.h, used in copy_from_user.S
>   and copy_to_user.S.
>   3. According to Robin's suggestion, using micro in copy_page_mc.S to
>   simplify code.
>   4. According to KeFeng's suggestion, modify powerpc code in patch1.
>   5. According to KeFeng's suggestion, modify mm/extable.c and some code
>   optimization.
> 
> Since V2:
>   1. According to Mark's suggestion, all uaccess can be recovered due to
>      memory error.
>   2. Scenario pagecache reading is also supported as part of uaccess
>      (copy_to_user()) and duplication code problem is also solved.
>      Thanks for Robin's suggestion.
>   3. According Mark's suggestion, update commit message of patch 2/5.
>   4. According Borisllav's suggestion, update commit message of patch 1/5.
> 
> Since V1:
>   1.Consistent with PPC/x86, Using CONFIG_ARCH_HAS_COPY_MC instead of
>     ARM64_UCE_KERNEL_RECOVERY.
>   2.Add two new scenes, cow and pagecache reading.
>   3.Fix two small bug(the first two patch).
> 
> V1 in here:
> https://lore.kernel.org/lkml/20220323033705.3966643-1-tongtiangen@huawei.com/
> 
> Ruidong Tian (3):
>    ACPI: APEI: GHES: use exception context to gate SIGBUS on poison
>      consumption
>    lib/test: memcpy_kunit: add copy_page() and copy_mc_page() tests
>    lib/tests: memcpy_kunit: add memcpy_mc() and memcpy_mc_large() test
> 
> Tong Tiangen (5):
>    uaccess: add generic fallback version of copy_mc_to_user()
>    arm64: add support for ARCH_HAS_COPY_MC
>    mm/hwpoison: return -EFAULT when copy fail in
>      copy_mc_[user]_highpage()
>    arm64: support copy_mc_[user]_highpage()
>    arm64: introduce copy_mc_to_kernel() implementation
> 
>   arch/arm64/Kconfig                   |   1 +
>   arch/arm64/include/asm/asm-extable.h |  22 ++-
>   arch/arm64/include/asm/asm-uaccess.h |   4 +
>   arch/arm64/include/asm/extable.h     |   1 +
>   arch/arm64/include/asm/mte.h         |   9 +
>   arch/arm64/include/asm/page.h        |  10 ++
>   arch/arm64/include/asm/string.h      |   5 +
>   arch/arm64/include/asm/uaccess.h     |  17 ++
>   arch/arm64/kernel/acpi.c             |   2 +-
>   arch/arm64/lib/Makefile              |   2 +
>   arch/arm64/lib/copy_mc_page.S        |  44 +++++
>   arch/arm64/lib/copy_page.S           |  62 +------
>   arch/arm64/lib/copy_page_template.S  |  71 ++++++++
>   arch/arm64/lib/copy_to_user.S        |  10 +-
>   arch/arm64/lib/memcpy.S              | 253 ++-------------------------
>   arch/arm64/lib/memcpy_mc.S           |  56 ++++++
>   arch/arm64/lib/memcpy_template.S     | 249 ++++++++++++++++++++++++++
>   arch/arm64/lib/mte.S                 |  29 +++
>   arch/arm64/mm/copypage.c             |  75 ++++++++
>   arch/arm64/mm/extable.c              |  21 +++
>   arch/arm64/mm/fault.c                |  30 +++-
>   arch/powerpc/include/asm/uaccess.h   |   1 +
>   arch/x86/include/asm/uaccess.h       |   1 +
>   drivers/acpi/apei/ghes.c             |  36 ++--
>   include/acpi/ghes.h                  |   6 +-
>   include/linux/highmem.h              |  16 +-
>   include/linux/uaccess.h              |   8 +
>   lib/tests/memcpy_kunit.c             | 178 ++++++++++++++++++-
>   mm/kasan/shadow.c                    |  12 ++
>   mm/khugepaged.c                      |   4 +-
>   30 files changed, 904 insertions(+), 331 deletions(-)
>   create mode 100644 arch/arm64/lib/copy_mc_page.S
>   create mode 100644 arch/arm64/lib/copy_page_template.S
>   create mode 100644 arch/arm64/lib/memcpy_mc.S
>   create mode 100644 arch/arm64/lib/memcpy_template.S
> 



^ permalink raw reply

* Re: [PATCH v2 0/6] fsl-mc: Move over to device MSI infrastructure
From: Arnd Bergmann @ 2026-05-18 15:24 UTC (permalink / raw)
  To: Marc Zyngier, Christophe Leroy
  Cc: Ioana Ciornei, Thomas Gleixner, Sascha Bischoff, linux-kernel,
	linux-arm-kernel, linuxppc-dev
In-Reply-To: <86zf1wx0b4.wl-maz@kernel.org>

On Mon, May 18, 2026, at 16:24, Marc Zyngier wrote:
> On Mon, 18 May 2026 14:51:48 +0100,
> "Christophe Leroy (CS GROUP)" <chleroy@kernel.org> wrote:
>>
>> > > Do I need to respin it?
>> 
>> No, I'd like to avoid having to rebase again. If you have changes to
>> the series please send followup patches.

Sorry this got held up even longer now. I meant to reply
earlier but dropped the ball on that while sending the merge
window contents.

This was indeed bad timing as the original pull request reached
me only after 7.0 was already out.

> No follow-up patches for that particular series, I just wanted to find
> out whether I could start posting additional changes that do not
> directly involve fsl-mc, but that are prevented by the current state
> of the code (such as trying to move the ITS initialisation much later
> in the boot process).
>
> I'll postpone my changes to 7.3, and keep my fingers crossed for this
> to hit 7.2.

I've merged the soc_fsl-7.1-2 tag into the soc/drivers branch
for 7.2 now. You should be able to base your other changes on top
of f0a2eac6a597 ("platform-msi: Remove stale comment") as a shared
branch.

    Arnd


^ permalink raw reply

* Re: [PATCH 2/5] arm/pci: Use official API to iterate over PCI buses
From: Gerd Bayer @ 2026-05-18 15:45 UTC (permalink / raw)
  To: Russell King
  Cc: Yinghai Lu, linux-alpha, linux-kernel, linux-arm-kernel,
	linuxppc-dev, linux-pci, Richard Henderson, Matt Turner,
	Magnus Lindholm, Madhavan Srinivasan, Michael Ellerman,
	Nicholas Piggin, Christophe Leroy (CS GROUP), Bjorn Helgaas,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Gerd Bayer
In-Reply-To: <20260515-priv_root_buses-v1-2-f8e393c57390@linux.ibm.com>

On Fri, 2026-05-15 at 16:22 +0200, Gerd Bayer wrote:
> Replace iterating over pci_root_buses with the official
> pci_find_next_bus() call provided by PCI core. This allows to make
> pci_root_buses private to PCI core.
> 
> Signed-off-by: Gerd Bayer <gbayer@linux.ibm.com>
> ---
>  arch/arm/kernel/bios32.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/arm/kernel/bios32.c b/arch/arm/kernel/bios32.c
> index ac0e890510da..35642c9ba054 100644
> --- a/arch/arm/kernel/bios32.c
> +++ b/arch/arm/kernel/bios32.c
> @@ -59,9 +59,9 @@ static void pcibios_bus_report_status(struct pci_bus *bus, u_int status_mask, in
>  
>  void pcibios_report_status(u_int status_mask, int warn)
>  {
> -	struct pci_bus *bus;
> +	struct pci_bus *bus = NULL;
>  
> -	list_for_each_entry(bus, &pci_root_buses, node)
> +	while ((bus = pci_find_next_bus(bus)) != NULL)
>  		pcibios_bus_report_status(bus, status_mask, warn);
>  }
>  

Hi Russell,

Sashiko
https://sashiko.dev/#/message/20260515145940.E85AAC2BCB0%40smtp.kernel.org
reported:

> Since pci_find_next_bus() unconditionally acquires the pci_bus_sem read-write
> semaphore using down_read(), this introduces a blocking operation into that
> atomic path:
> 
> dc21285_abort_irq() [hardirq context]
>   pcibios_report_status()
>     pci_find_next_bus()
>       down_read(&pci_bus_sem) [sleeps]
> 
> Does this path need an alternative approach to safely iterate over the buses
> without taking a sleeping lock?

IMHO, it looks like this entire pcibios_report_status() iterating over
all PCI buses and all their devices would be better off if moved
outside of the hardirq context?

Or could pcibios_report_status() be converted to use
for_each_pci_device()?

Any suggestions welcome...
Gerd


^ permalink raw reply

* Re: [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance
From: Matthew Wilcox @ 2026-05-18 16:17 UTC (permalink / raw)
  To: Barry Song
  Cc: Lorenzo Stoakes, surenb, akpm, linux-mm, david, liam, vbabka,
	rppt, mhocko, jack, pfalcato, wanglian, chentao, lianux.mm,
	kunwu.chan, liyangouwen1, chrisl, kasong, shikemeng, nphamcs, bhe,
	youngjun.park, linux-arm-kernel, linux-kernel, loongarch,
	linuxppc-dev, linux-riscv, linux-s390, Nanzhe Zhao
In-Reply-To: <CAGsJ_4zqLfdWoTH9s7FFaqWWj0mESfikYgr7=GcV64qcuXrPxA@mail.gmail.com>

On Mon, May 18, 2026 at 07:25:54PM +0800, Barry Song wrote:
> We have clearly observed that the `fork()` operations of many
> popular Android apps, such as iQiyi, Baidu Tieba, and 10086,
> end up waiting on page-fault (PF) I/O when the VMA lock is
> held during I/O operations. This has already become a
> practical issue. I also believe this can lead to chained
> waiting, since the global `mmap_lock` blocks all threads that
> need to acquire it.

It's always been a terrible idea to call fork() from a multithreaded
application.  For example, this question:

https://stackoverflow.com/questions/53601200/calling-fork-on-a-multithreaded-process

or this lwn thread: https://lwn.net/Articles/674660/

Do we have any insight into why these applications are doing this
horrible thing?


^ permalink raw reply

* Re: [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance
From: Suren Baghdasaryan @ 2026-05-18 19:56 UTC (permalink / raw)
  To: Barry Song
  Cc: Lorenzo Stoakes, Matthew Wilcox, akpm, linux-mm, david, liam,
	vbabka, rppt, mhocko, jack, pfalcato, wanglian, chentao,
	lianux.mm, kunwu.chan, liyangouwen1, chrisl, kasong, shikemeng,
	nphamcs, bhe, youngjun.park, linux-arm-kernel, linux-kernel,
	loongarch, linuxppc-dev, linux-riscv, linux-s390, Nanzhe Zhao
In-Reply-To: <CAGsJ_4zqLfdWoTH9s7FFaqWWj0mESfikYgr7=GcV64qcuXrPxA@mail.gmail.com>

On Mon, May 18, 2026 at 4:26 AM Barry Song <baohua@kernel.org> wrote:
>
> On Mon, May 18, 2026 at 5:47 PM Lorenzo Stoakes <ljs@kernel.org> wrote:
> >
> > On Sun, May 17, 2026 at 04:45:15PM +0800, Barry Song wrote:
> > > On Sat, May 2, 2026 at 1:58 AM Matthew Wilcox <willy@infradead.org> wrote:
> > > >
> > > > On Sat, May 02, 2026 at 01:44:34AM +0800, Barry Song wrote:
> > > > > On Fri, May 1, 2026 at 10:57 PM Matthew Wilcox <willy@infradead.org> wrote:
> > > > > >
> > > > > > On Fri, May 01, 2026 at 06:49:58AM +0800, Barry Song wrote:
> > > > > > > 1. There is no deterministic latency for I/O completion. It depends on
> > > > > > > both the hardware and the software stack (bio/request queues and the
> > > > > > > block scheduler). Sometimes the latency is short; at other times it can
> > > > > > > be quite long. In such cases, a high-priority thread performing operations
> > > > > > > such as mprotect, unmap, prctl_set_vma, or madvise may be forced to wait
> > > > > > > for an unpredictable amount of time.
> > > > > >
> > > > > > But does that actually happen?  I find it hard to believe that thread A
> > > > > > unmaps a VMA while thread B is in the middle of taking a page fault in
> > > > > > that same VMA.  mprotect() and madvise() are more likely to happen, but
> > > > > > it still seems really unlikely to me.
> > > > >
> > > > > It doesn’t have to involve unmapping or applying mprotect to
> > > > > the entire VMA—just a portion of it is sufficient.
> > > >
> > > > Yes, but that still fails to answer "does this actually happen".  How much
> > > > performance is all this complexity in the page fault handler buying us?
> > > > If you don't answer this question, I'm just going to go in and rip it
> > > > all out.
> > > >
> > >
> > > Hi Matthew (and Lorenzo, Jan, and anyone else who may be
> > > waiting for answers),
> > >
> > > As promised during LSF/MM/BPF, we conducted thorough
> > > testing on Android phones to determine whether performing
> > > I/O in `filemap_fault()` can block `vma_start_write()`.
> > > I wanted to give a quick update on this question.
> > >
> > > Nanzhe at Xiaomi created tracing scripts and ran various
> > > applications on Android devices with I/O performed under
> > > the VMA lock in `filemap_fault()`. We found that:
> > >
> > > 1. There are very few cases where unmap() is blocked by
> > >    page faults. I assume this is due to buggy user code
> > >    or poor synchronization between reads and unmap().
> > > So I assume it is not a problem.
> > >
> > > 2. We observed many cases where `vma_start_write()`
> > >    is blocked by page-fault I/O in some applications.
> > >    The blocking occurs in the `dup_mmap()` path during
> > >    fork().
> > >
> > > With Suren's commit fb49c455323ff ("fork: lock VMAs of
> > > the parent process when forking"), we now always hold
> > > `vma_write_lock()` for each VMA. Note that the
> > > `mmap_lock` write lock is also held, which could lead to
> > > chained waiting if page-fault I/O is performed without
> > > releasing the VMA lock.
> >
> > Hm but did you observe this 'chained waiting'? And what were the latencies?
>
> We have clearly observed that the `fork()` operations of many
> popular Android apps, such as iQiyi, Baidu Tieba, and 10086,
> end up waiting on page-fault (PF) I/O when the VMA lock is
> held during I/O operations. This has already become a
> practical issue. I also believe this can lead to chained
> waiting, since the global `mmap_lock` blocks all threads that
> need to acquire it.
>
>
> >
> > >
> > > My gut feeling is that Suren's commit may be overshooting,
> > > so my rough idea is that we might want to do something like
> > > the following (we haven't tested it yet and it might be
> > > wrong):
> >
> > Yeah I'm really not sure about that.
> >
> > Prior to the VMA locks, the mmap write lock would have guaranteed no concurrent
> > page faults, which is really what fb49c455323ff is about.
> >
> > So Suren's patch was essentially restoring the _existing_ forking behaviour, and
> > now you're saying 'let's change the forking behaviour that's been like that for
> > forever'.
>
>
> I am afraid not. Before we introduced the per-VMA lock, we
> were not performing I/O while holding `mmap_lock`. A page fault
> that needed I/O would drop the `mmap_lock` read lock and allow
> `fork()` to proceed.
>
> Now, you are suggesting performing I/O while holding the VMA
> lock, which changes the requirements and introduces this
> problem.
>
> >
> > I think you would _really_ have to be sure that's safe. And forking is a very
> > dangerous time in terms of complexity and sensitivity and 'weird stuff'
> > happening so I'd tread _very_ carefully here.
>
> Yep. I think my original proposal did not require any changes
> to `fork()`, since it simply preserved the current behavior of
> dropping the VMA lock before performing I/O. In that model,
> `fork()` would not end up waiting on I/O at all.
>
> What you are suggesting now appears to be performing I/O while
> holding the VMA lock, which in turn introduces the need to
> change `fork()`.
>
> >
> > >
> > > diff --git a/mm/mmap.c b/mm/mmap.c
> > > index 2311ae7c2ff4..5ddaf297f31a 100644
> > > --- a/mm/mmap.c
> > > +++ b/mm/mmap.c
> > > @@ -1762,7 +1762,13 @@ __latent_entropy int dup_mmap(struct mm_struct
> > > *mm, struct mm_struct *oldmm)
> > >         for_each_vma(vmi, mpnt) {
> > >                 struct file *file;
> > >
> > > -               retval = vma_start_write_killable(mpnt);
> > > +               /*
> > > +                * For anonymous or writable private VMAs, prevent
> > > +                * concurrent CoW faults.
> > > +                */
> >
> > To nit pick I think the comment's confusing but also tells you you don't need to
> > specific anon check - writable private is sufficient. And it's not really just
> > CoW that's the issue, it's anon_vma population _at all_ as well as CoW.
> >
> > > +               if (!mpnt->vm_file || (!(mpnt->vm_flags & VM_SHARED) &&
> > > +                                       (mpnt->vm_flags & VM_WRITE)))
> > > +                       retval = vma_start_write_killable(mpnt);
> >
> > I think this has to be VM_MAYWRITE, because somebody could otherwise mprotect()
> > it R/W.
> >
> > I also don't understand why !mpnt->vm_file for a read-only anon mapping (more
> > likely PROT_NONE) is here, just do the second check?
> >
> > (Also please use the new interface, so !vma_test(mpnt, VMA_SHARED_BIT) &&
> > vma_test(mpnt, VMA_MAYWRITE_BIT))
>
> Yep, I can definitely refine the check further. But before
> doing that, I'd first like to confirm that we are aligned on
> the direction.
>
> If you still intend to hold the VMA lock while performing I/O,
> then I think we should fix `fork()` to avoid taking
> `vma_start_write()`.
>
> >
> > >                 if (retval < 0)
> > >                         goto loop_out;
> > >                 if (mpnt->vm_flags & VM_DONTCOPY) {
> > >
> > > Based on the above, we may want to re-check whether fork()
> > > can be blocked by page faults. At the same time, if Suren,
> > > you, or anyone else has any comments, please feel free to
> > > share them.
> > >
> > > Best Regards
> > > Barry
> >
> > Technical commentary above is sort of 'just cos' :) because I really question
> > doing this honestly.
>
> I think we either need to fix `fork()`, or keep the current
> behavior of dropping the VMA lock before performing I/O.

I see. So, this problem arises from the fact that we are changing the
pagefaults requiring I/O operation to hold VMA lock...
And you want to lock VMA on fork only if vma_is_anonymous(vma) ||
is_cow_mapping(vma->vm_flags). So, we will be blocking page faults for
anonymous and COW VMAs only while holding mmap_write_lock, preventing
any VMA modification. On the surface, that looks ok to me but I might
be missing some corner cases. If nobody sees any obvious issues, I
think it's worth a try.




>
> >
> > I'd also like to get Suren's input, however.
>
> Yes. of course.
>
> >
> > Thanks, Lorenzo
>
> Best Regards
> Barry


^ permalink raw reply

* Re: [PATCH] PCI/AER: Clear non-fatal errors on AER recovery failure
From: Bjorn Helgaas @ 2026-05-18 20:29 UTC (permalink / raw)
  To: Yury Murashka
  Cc: bhelgaas, mahesh, oohall, corbet, skhan, linux-pci, linux-doc,
	linux-kernel, linuxppc-dev, Lukas Wunner
In-Reply-To: <CAPzpGcRCTCZtaX1EVaJNZ103THZKsoszZduY7=gwfYdcrMo-SQ@mail.gmail.com>

[+cc Lukas]

On Mon, May 18, 2026 at 02:23:36PM +0100, Yury Murashka wrote:
> pci_aer_clear_nonfatal_status() is not called when AER recovery fails.
> If a new AER error is subsequently reported, the AER driver calls
> find_source_device() to find the source of the error. It rescans the
> whole bus and picks the first device reporting an AER error. Because the
> previous error was never cleared, the error is attributed to the wrong
> device and AER recovery is started for the wrong device.
> 
> Add a kernel boot parameter pci=aer_clear_on_recovery_failure to clear
> AER error status even when recovery fails, preventing stale errors from
> causing incorrect device identification on subsequent AER events.

Why should we add a kernel parameter for this?  How would a user
decide whether to use the parameter?  Are there cases where we
find the source of the first error, but we *wouldn't* want to clear
it if recovery fails?


^ permalink raw reply

* Re: [PATCH] PCI/AER: Clear non-fatal errors on AER recovery failure
From: Yury M. @ 2026-05-18 20:49 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: bhelgaas, mahesh, oohall, corbet, skhan, linux-pci, linux-doc,
	linux-kernel, linuxppc-dev, Lukas Wunner
In-Reply-To: <20260518202903.GA641158@bhelgaas>

Current behavior has existed for a long time and I could easily imagine 
that there is software which relies on the fact that the system is in a 
non-modified state if AER recovery failed. The software can analyze the 
system and do cleanup afterwards. Sometimes, if something fails in the 
system, it is better to have it in a non-modified state.
In short, I just wanted to preserve the current logic by default because 
there is a chance that we have software which relies on the current 
behavior.

On 5/18/26 21:29, Bjorn Helgaas wrote:
> [+cc Lukas]
>
> On Mon, May 18, 2026 at 02:23:36PM +0100, Yury Murashka wrote:
>> pci_aer_clear_nonfatal_status() is not called when AER recovery fails.
>> If a new AER error is subsequently reported, the AER driver calls
>> find_source_device() to find the source of the error. It rescans the
>> whole bus and picks the first device reporting an AER error. Because the
>> previous error was never cleared, the error is attributed to the wrong
>> device and AER recovery is started for the wrong device.
>>
>> Add a kernel boot parameter pci=aer_clear_on_recovery_failure to clear
>> AER error status even when recovery fails, preventing stale errors from
>> causing incorrect device identification on subsequent AER events.
> Why should we add a kernel parameter for this?  How would a user
> decide whether to use the parameter?  Are there cases where we
> find the source of the first error, but we *wouldn't* want to clear
> it if recovery fails?


^ permalink raw reply

* Re: [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance
From: Barry Song @ 2026-05-18 20:50 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Lorenzo Stoakes, surenb, akpm, linux-mm, david, liam, vbabka,
	rppt, mhocko, jack, pfalcato, wanglian, chentao, lianux.mm,
	kunwu.chan, liyangouwen1, chrisl, kasong, shikemeng, nphamcs, bhe,
	youngjun.park, linux-arm-kernel, linux-kernel, loongarch,
	linuxppc-dev, linux-riscv, linux-s390, Nanzhe Zhao
In-Reply-To: <ags7mPK7Ong0ZsBf@casper.infradead.org>

On Tue, May 19, 2026 at 12:17 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Mon, May 18, 2026 at 07:25:54PM +0800, Barry Song wrote:
> > We have clearly observed that the `fork()` operations of many
> > popular Android apps, such as iQiyi, Baidu Tieba, and 10086,
> > end up waiting on page-fault (PF) I/O when the VMA lock is
> > held during I/O operations. This has already become a
> > practical issue. I also believe this can lead to chained
> > waiting, since the global `mmap_lock` blocks all threads that
> > need to acquire it.
>
> It's always been a terrible idea to call fork() from a multithreaded
> application.  For example, this question:
>
> https://stackoverflow.com/questions/53601200/calling-fork-on-a-multithreaded-process
>
> or this lwn thread: https://lwn.net/Articles/674660/
>
> Do we have any insight into why these applications are doing this
> horrible thing?

I swear I read the two links you shared. But the reality
is that as long as people use the Android framework,
even the simplest "Hello World" app already runs with
10+ threads :-)


main
RenderThread
ReferenceQueueDaemon
FinalizerDaemon
FinalizerWatchdogDaemon
HeapTaskDaemon
Binder:1234_1
Binder:1234_2
Signal Catcher
JDWP
...

Best Regards
Barry


^ permalink raw reply

* Re: cleanup the RAID6 P/Q library v3
From: Andrew Morton @ 2026-05-18 21:12 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Catalin Marinas, Will Deacon, Ard Biesheuvel, Huacai Chen,
	WANG Xuerui, Madhavan Srinivasan, Michael Ellerman,
	Nicholas Piggin, Christophe Leroy (CS GROUP), Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Alexandre Ghiti, Heiko Carstens,
	Vasily Gorbik, Alexander Gordeev, Christian Borntraeger,
	Sven Schnelle, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H. Peter Anvin, Herbert Xu, Dan Williams,
	Chris Mason, David Sterba, Arnd Bergmann, Song Liu, Yu Kuai,
	Li Nan, linux-kernel, linux-arm-kernel, loongarch, linuxppc-dev,
	linux-riscv, linux-s390, linux-crypto, linux-btrfs, linux-arch,
	linux-raid
In-Reply-To: <20260518051804.462141-1-hch@lst.de>

On Mon, 18 May 2026 07:17:43 +0200 Christoph Hellwig <hch@lst.de> wrote:

> this series cleans up the RAID6 P/Q library to match the recent updates
> to the RAID 5 XOR library and other CRC/crypto libraries.  This includes
> providing properly documented external interfaces, hiding the internals,
> using static_call instead of indirect calls and turning the user space
> test suite into an in-kernel kunit test which is also extended to
> improve coverage.

Cool, I'll add this to mm.git's mm-nonmm-unstable branch for some
linux-next testing.

AI review found quite a lot to talk about:
	https://sashiko.dev/#/patchset/20260518051804.462141-1-hch@lst.de


^ permalink raw reply

* Re: [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance
From: Barry Song @ 2026-05-18 21:14 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Lorenzo Stoakes, Matthew Wilcox, akpm, linux-mm, david, liam,
	vbabka, rppt, mhocko, jack, pfalcato, wanglian, chentao,
	lianux.mm, kunwu.chan, liyangouwen1, chrisl, kasong, shikemeng,
	nphamcs, bhe, youngjun.park, linux-arm-kernel, linux-kernel,
	loongarch, linuxppc-dev, linux-riscv, linux-s390, Nanzhe Zhao
In-Reply-To: <CAJuCfpE0WQrB3zJp9qn3jvn5DthS=ttpX7gJJvyEhA_BJGrp5g@mail.gmail.com>

On Tue, May 19, 2026 at 3:57 AM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Mon, May 18, 2026 at 4:26 AM Barry Song <baohua@kernel.org> wrote:
> >
> > On Mon, May 18, 2026 at 5:47 PM Lorenzo Stoakes <ljs@kernel.org> wrote:
> > >
> > > On Sun, May 17, 2026 at 04:45:15PM +0800, Barry Song wrote:
[...]
> >
> > I think we either need to fix `fork()`, or keep the current
> > behavior of dropping the VMA lock before performing I/O.
>
> I see. So, this problem arises from the fact that we are changing the
> pagefaults requiring I/O operation to hold VMA lock...
> And you want to lock VMA on fork only if vma_is_anonymous(vma) ||
> is_cow_mapping(vma->vm_flags). So, we will be blocking page faults for
> anonymous and COW VMAs only while holding mmap_write_lock, preventing
> any VMA modification. On the surface, that looks ok to me but I might
> be missing some corner cases. If nobody sees any obvious issues, I
> think it's worth a try.
>

Thanks. Besides the creation of processes via fork(), I
am also beginning to worry about the death of processes.

One thing that came to my mind this morning
is that when lowmemorykiller decides to kill an app, we
want the memory to be released as quickly as possible so
the new app or user scenario can get memory sooner.

In that case, if the app being killed is performing I/O
while holding the VMA lock, the unmapping procedure
could end up being blocked as well.

If we release the VMA lock as we currently do, we allow
process exit to proceed.

I haven't thought it through very clearly yet, and I
may be wrong. I'd like to do more investigation. I hope
the apps being killed stay very still, but who knows—we
have so many applications in the market.

Meanwhile, if you have any comments regarding the death
of processes, they would be very welcome.

Best Regards
Barry

^ permalink raw reply

* Re: [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance
From: Yang Shi @ 2026-05-18 21:21 UTC (permalink / raw)
  To: Barry Song
  Cc: Matthew Wilcox, surenb, akpm, linux-mm, david, ljs, liam, vbabka,
	rppt, mhocko, jack, pfalcato, wanglian, chentao, lianux.mm,
	kunwu.chan, liyangouwen1, chrisl, kasong, shikemeng, nphamcs, bhe,
	youngjun.park, linux-arm-kernel, linux-kernel, loongarch,
	linuxppc-dev, linux-riscv, linux-s390, Nanzhe Zhao
In-Reply-To: <CAGsJ_4ysMcrmDLSOwBkf7qwCQrcDWeEMXkHDajTJFMLKUk0bSQ@mail.gmail.com>

On Sun, May 17, 2026 at 1:45 AM Barry Song <baohua@kernel.org> wrote:
>
> On Sat, May 2, 2026 at 1:58 AM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > On Sat, May 02, 2026 at 01:44:34AM +0800, Barry Song wrote:
> > > On Fri, May 1, 2026 at 10:57 PM Matthew Wilcox <willy@infradead.org> wrote:
> > > >
> > > > On Fri, May 01, 2026 at 06:49:58AM +0800, Barry Song wrote:
> > > > > 1. There is no deterministic latency for I/O completion. It depends on
> > > > > both the hardware and the software stack (bio/request queues and the
> > > > > block scheduler). Sometimes the latency is short; at other times it can
> > > > > be quite long. In such cases, a high-priority thread performing operations
> > > > > such as mprotect, unmap, prctl_set_vma, or madvise may be forced to wait
> > > > > for an unpredictable amount of time.
> > > >
> > > > But does that actually happen?  I find it hard to believe that thread A
> > > > unmaps a VMA while thread B is in the middle of taking a page fault in
> > > > that same VMA.  mprotect() and madvise() are more likely to happen, but
> > > > it still seems really unlikely to me.
> > >
> > > It doesn’t have to involve unmapping or applying mprotect to
> > > the entire VMA—just a portion of it is sufficient.
> >
> > Yes, but that still fails to answer "does this actually happen".  How much
> > performance is all this complexity in the page fault handler buying us?
> > If you don't answer this question, I'm just going to go in and rip it
> > all out.
> >
>
> Hi Matthew (and Lorenzo, Jan, and anyone else who may be
> waiting for answers),
>
> As promised during LSF/MM/BPF, we conducted thorough
> testing on Android phones to determine whether performing
> I/O in `filemap_fault()` can block `vma_start_write()`.
> I wanted to give a quick update on this question.
>
> Nanzhe at Xiaomi created tracing scripts and ran various
> applications on Android devices with I/O performed under
> the VMA lock in `filemap_fault()`. We found that:
>
> 1. There are very few cases where unmap() is blocked by
>    page faults. I assume this is due to buggy user code
>    or poor synchronization between reads and unmap().
> So I assume it is not a problem.
>
> 2. We observed many cases where `vma_start_write()`
>    is blocked by page-fault I/O in some applications.
>    The blocking occurs in the `dup_mmap()` path during
>    fork().
>
> With Suren's commit fb49c455323ff ("fork: lock VMAs of
> the parent process when forking"), we now always hold
> `vma_write_lock()` for each VMA. Note that the
> `mmap_lock` write lock is also held, which could lead to
> chained waiting if page-fault I/O is performed without
> releasing the VMA lock.
>
> My gut feeling is that Suren's commit may be overshooting,
> so my rough idea is that we might want to do something like
> the following (we haven't tested it yet and it might be
> wrong):
>
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 2311ae7c2ff4..5ddaf297f31a 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -1762,7 +1762,13 @@ __latent_entropy int dup_mmap(struct mm_struct
> *mm, struct mm_struct *oldmm)
>         for_each_vma(vmi, mpnt) {
>                 struct file *file;
>
> -               retval = vma_start_write_killable(mpnt);
> +               /*
> +                * For anonymous or writable private VMAs, prevent
> +                * concurrent CoW faults.
> +                */
> +               if (!mpnt->vm_file || (!(mpnt->vm_flags & VM_SHARED) &&
> +                                       (mpnt->vm_flags & VM_WRITE)))
> +                       retval = vma_start_write_killable(mpnt);
>                 if (retval < 0)
>                         goto loop_out;
>                 if (mpnt->vm_flags & VM_DONTCOPY) {

Maybe a little bit off topic. This is an interesting idea. It seems
possible we don't have to take vma write lock unconditionally. IIUC
the write lock is mainly used to serialize against page fault and
madvise, right? I got a crazy idea off the top of my head. We may be
able to just take vma write lock iff vma->anon_vma is not NULL.

First of all, write mmap_lock is held, so the vma can't go or be
changed under us.

Secondly, if vma->anon_vma is NULL, it basically means either no page
fault happened or no cow happened, so there is no page table to copy,
this is also what copy_page_range() does currently. So we can shrink
the critical section to:

if (vma->anon_vma) {
    vma_start_write_killable(src_vma);
    anon_vma_fork(dst_vma, src_vma);
    copy_page_range(dst_vma, src_vma);
}

But page fault can happen before write mmap_lock is taken, when we
check vma->anon_vma, it is possible it has not been set up yet. But it
seems to be equivalent to page fault after fork and won't break the
semantic.

Anyway, just a crazy idea, I may miss some corner cases.

Thanks,
Yang

}

>
> Based on the above, we may want to re-check whether fork()
> can be blocked by page faults. At the same time, if Suren,
> you, or anyone else has any comments, please feel free to
> share them.
>
> Best Regards
> Barry
>


^ permalink raw reply

* Re: [PATCH v5 00/14] module: Introduce hash-based integrity checking
From: Sami Tolvanen @ 2026-05-18 21:55 UTC (permalink / raw)
  To: Thomas Weißschuh
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Eduard Zingerman, Kumar Kartikeya Dwivedi, Nathan Chancellor,
	Nicolas Schier, Arnd Bergmann, Luis Chamberlain, Petr Pavlu,
	Daniel Gomez, Paul Moore, James Morris, Serge E. Hallyn,
	Jonathan Corbet, Madhavan Srinivasan, Michael Ellerman,
	Nicholas Piggin, Naveen N Rao, Mimi Zohar, Roberto Sassu,
	Dmitry Kasatkin, Eric Snowberg, Nicolas Schier, Daniel Gomez,
	Aaron Tomlin, Christophe Leroy (CS GROUP), Nicolas Bouchinet,
	Xiu Jianfeng, Martin KaFai Lau, Song Liu, Yonghong Song,
	Jiri Olsa, bpf, Fabian Grünbichler, Arnout Engelen,
	Mattia Rizzolo, kpcyrd, Christian Heusel, Câju Mihai-Drosi,
	Eric Biggers, Sebastian Andrzej Siewior, linux-kbuild,
	linux-kernel, linux-arch, linux-modules, linux-security-module,
	linux-doc, linuxppc-dev, linux-integrity, debian-kernel
In-Reply-To: <20260505-module-hashes-v5-0-e174a5a49fce@weissschuh.net>

Hi Thomas,

On Tue, May 05, 2026 at 11:05:04AM +0200, Thomas Weißschuh wrote:
> The current signature-based module integrity checking has some drawbacks
> in combination with reproducible builds. Either the module signing key
> is generated at build time, which makes the build unreproducible, or a
> static signing key is used, which precludes rebuilds by third parties
> and makes the whole build and packaging process much more complicated.
> 
> The goal is to reach bit-for-bit reproducibility. Excluding certain
> parts of the build output from the reproducibility analysis would be
> error-prone and force each downstream consumer to introduce new tooling.
> 
> Introduce a new mechanism to ensure only well-known modules are loaded
> by embedding a merkle tree root of all modules built as part of the full
> kernel build into vmlinux.

I noticed Sashiko had a few concerns about the build changes. Would you
mind taking a look to see if they're valid?

https://sashiko.dev/#/patchset/20260505-module-hashes-v5-0-e174a5a49fce%40weissschuh.net

Sami


^ permalink raw reply

* Re: [PATCH v2 0/3] KVM: Fix and clean up kvm_vcpu_map[_readonly]() usages
From: Sean Christopherson @ 2026-05-19  0:40 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Madhavan Srinivasan,
	Nicholas Piggin, Peter Fang
  Cc: Yosry Ahmed, Ritesh Harjani, Michael Ellerman,
	Christophe Leroy (CS GROUP), Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, kvm,
	linuxppc-dev, linux-kernel
In-Reply-To: <20260408001137.3290444-1-peter.fang@intel.com>

On Tue, 07 Apr 2026 17:11:27 -0700, Peter Fang wrote:
> kvm_vcpu_map() and kvm_vcpu_map_readonly() are declared to take a gpa_t
> in kvm_host.h when they're supposed to take a gfn_t. First fix the
> function prototypes, and then refactor them to correctly take a gpa_t,
> reducing boilerplate gpa->gfn conversions at all call sites.
> 
> No actual harm has been done yet as all of the call sites are correctly
> passing in a gfn.
> 
> [...]

Applied patch 1 to kvm-x86 generic.  I'm moderately optimistic that the gpc
stuff will land soon enough that I won't regret skipping 2 and 3 :-)

Thanks much!

[1/3] KVM: Fix kvm_vcpu_map[_readonly]() function prototypes
      https://github.com/kvm-x86/linux/commit/ccd6c77223bb

--
https://github.com/kvm-x86/linux/tree/next


^ permalink raw reply

* Re: [PATCH v2] KVM: PPC: Kconfig: Enable CONFIG_VPA_PMU with KVM
From: Sean Christopherson @ 2026-05-19  1:01 UTC (permalink / raw)
  To: Gautam Menghani
  Cc: maddy, npiggin, mpe, chleroy, atrajeev, linuxppc-dev, kvm,
	linux-kernel, stable
In-Reply-To: <20260518044150.34632-1-gautam@linux.ibm.com>

On Mon, May 18, 2026, Gautam Menghani wrote:
> Enable CONFIG_VPA_PMU with KVM to enable its usage. Currently, the
> vpa-pmu driver cannot be used since it is not enabled in distro configs.

That seems like a problem to take up with distros, no?


^ permalink raw reply

* [powerpc:merge] BUILD SUCCESS d850c9d4d46e1c3f70922e048815ce8bc0235cec
From: kernel test robot @ 2026-05-19  2:16 UTC (permalink / raw)
  To: Madhavan Srinivasan; +Cc: linuxppc-dev

tree/branch: https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git merge
branch HEAD: d850c9d4d46e1c3f70922e048815ce8bc0235cec  Automatic merge of 'fixes' into merge (2026-05-18 15:48)

elapsed time: 886m

configs tested: 326
configs skipped: 14

The following configs have been built successfully.
More configs may be tested in the coming days.

tested configs:
alpha                             allnoconfig    gcc-15.2.0
alpha                            allyesconfig    gcc-15.2.0
alpha                               defconfig    gcc-15.2.0
arc                              alldefconfig    gcc-15.2.0
arc                              allmodconfig    clang-16
arc                              allmodconfig    clang-17
arc                              allmodconfig    gcc-15.2.0
arc                               allnoconfig    gcc-15.2.0
arc                              allyesconfig    clang-23
arc                              allyesconfig    gcc-15.2.0
arc                                 defconfig    gcc-15.2.0
arc                   randconfig-001-20260518    gcc-8.5.0
arc                   randconfig-001-20260518    gcc-9.5.0
arc                   randconfig-001-20260519    clang-23
arc                   randconfig-002-20260518    gcc-8.5.0
arc                   randconfig-002-20260519    clang-23
arm                               allnoconfig    clang-23
arm                               allnoconfig    gcc-15.2.0
arm                              allyesconfig    clang-16
arm                              allyesconfig    clang-17
arm                              allyesconfig    gcc-15.2.0
arm                                 defconfig    gcc-15.2.0
arm                         mv78xx0_defconfig    clang-19
arm                   randconfig-001-20260518    clang-23
arm                   randconfig-001-20260518    gcc-8.5.0
arm                   randconfig-001-20260519    clang-23
arm                   randconfig-002-20260518    gcc-8.5.0
arm                   randconfig-002-20260519    clang-23
arm                   randconfig-003-20260518    gcc-8.5.0
arm                   randconfig-003-20260519    clang-23
arm                   randconfig-004-20260518    gcc-10.5.0
arm                   randconfig-004-20260518    gcc-8.5.0
arm                   randconfig-004-20260519    clang-23
arm64                            allmodconfig    clang-23
arm64                             allnoconfig    gcc-15.2.0
arm64                               defconfig    gcc-15.2.0
arm64                 randconfig-001-20260519    gcc-8.5.0
arm64                 randconfig-002-20260519    gcc-8.5.0
arm64                 randconfig-003-20260519    gcc-8.5.0
arm64                 randconfig-004-20260519    gcc-8.5.0
csky                             allmodconfig    gcc-15.2.0
csky                              allnoconfig    gcc-15.2.0
csky                                defconfig    gcc-15.2.0
csky                  randconfig-001-20260519    gcc-8.5.0
csky                  randconfig-002-20260518    gcc-11.5.0
csky                  randconfig-002-20260519    gcc-8.5.0
hexagon                          allmodconfig    clang-17
hexagon                          allmodconfig    gcc-15.2.0
hexagon                           allnoconfig    clang-23
hexagon                           allnoconfig    gcc-15.2.0
hexagon                             defconfig    gcc-15.2.0
hexagon               randconfig-001-20260518    clang-16
hexagon               randconfig-001-20260518    gcc-11.5.0
hexagon               randconfig-001-20260519    clang-23
hexagon               randconfig-001-20260519    gcc-10.5.0
hexagon               randconfig-002-20260518    clang-23
hexagon               randconfig-002-20260518    gcc-11.5.0
hexagon               randconfig-002-20260519    clang-23
hexagon               randconfig-002-20260519    gcc-10.5.0
i386                             allmodconfig    clang-20
i386                             allmodconfig    gcc-14
i386                              allnoconfig    gcc-14
i386                              allnoconfig    gcc-15.2.0
i386                             allyesconfig    clang-20
i386                             allyesconfig    gcc-14
i386        buildonly-randconfig-001-20260518    clang-20
i386        buildonly-randconfig-001-20260519    gcc-12
i386        buildonly-randconfig-002-20260518    clang-20
i386        buildonly-randconfig-002-20260519    gcc-12
i386        buildonly-randconfig-003-20260518    gcc-14
i386        buildonly-randconfig-003-20260519    gcc-12
i386        buildonly-randconfig-004-20260518    gcc-14
i386        buildonly-randconfig-004-20260519    gcc-12
i386        buildonly-randconfig-005-20260518    gcc-14
i386        buildonly-randconfig-005-20260519    gcc-12
i386        buildonly-randconfig-006-20260518    clang-20
i386        buildonly-randconfig-006-20260519    gcc-12
i386                                defconfig    gcc-15.2.0
i386                  randconfig-001-20260518    clang-20
i386                  randconfig-001-20260518    gcc-14
i386                  randconfig-001-20260519    gcc-14
i386                  randconfig-002-20260518    gcc-14
i386                  randconfig-002-20260519    gcc-14
i386                  randconfig-003-20260518    gcc-14
i386                  randconfig-003-20260519    gcc-14
i386                  randconfig-004-20260518    clang-20
i386                  randconfig-004-20260518    gcc-14
i386                  randconfig-004-20260519    gcc-14
i386                  randconfig-005-20260518    clang-20
i386                  randconfig-005-20260518    gcc-14
i386                  randconfig-005-20260519    gcc-14
i386                  randconfig-006-20260518    gcc-13
i386                  randconfig-006-20260518    gcc-14
i386                  randconfig-006-20260519    gcc-14
i386                  randconfig-007-20260518    clang-20
i386                  randconfig-007-20260518    gcc-14
i386                  randconfig-007-20260519    gcc-14
i386                  randconfig-011-20260518    gcc-14
i386                  randconfig-011-20260519    gcc-14
i386                  randconfig-012-20260518    clang-20
i386                  randconfig-012-20260518    gcc-14
i386                  randconfig-012-20260519    gcc-14
i386                  randconfig-013-20260518    gcc-14
i386                  randconfig-013-20260519    gcc-14
i386                  randconfig-014-20260518    gcc-14
i386                  randconfig-014-20260519    gcc-14
i386                  randconfig-015-20260518    clang-20
i386                  randconfig-015-20260518    gcc-14
i386                  randconfig-015-20260519    gcc-14
i386                  randconfig-016-20260518    gcc-14
i386                  randconfig-016-20260519    gcc-14
i386                  randconfig-017-20260518    gcc-14
i386                  randconfig-017-20260519    gcc-14
loongarch                        allmodconfig    clang-23
loongarch                         allnoconfig    clang-23
loongarch                         allnoconfig    gcc-15.2.0
loongarch                           defconfig    clang-19
loongarch             randconfig-001-20260518    gcc-11.5.0
loongarch             randconfig-001-20260518    gcc-15.2.0
loongarch             randconfig-001-20260519    clang-23
loongarch             randconfig-001-20260519    gcc-10.5.0
loongarch             randconfig-002-20260518    clang-23
loongarch             randconfig-002-20260518    gcc-11.5.0
loongarch             randconfig-002-20260519    clang-23
loongarch             randconfig-002-20260519    gcc-10.5.0
m68k                             allmodconfig    gcc-15.2.0
m68k                              allnoconfig    gcc-15.2.0
m68k                             allyesconfig    clang-16
m68k                             allyesconfig    clang-17
m68k                             allyesconfig    gcc-15.2.0
m68k                                defconfig    clang-19
m68k                                defconfig    gcc-15.2.0
m68k                       m5249evb_defconfig    gcc-15.2.0
m68k                        m5407c3_defconfig    gcc-15.2.0
microblaze                        allnoconfig    gcc-15.2.0
microblaze                       allyesconfig    gcc-15.2.0
microblaze                          defconfig    clang-19
microblaze                          defconfig    gcc-15.2.0
mips                             allmodconfig    gcc-15.2.0
mips                              allnoconfig    gcc-15.2.0
mips                             allyesconfig    gcc-15.2.0
mips                        bcm47xx_defconfig    clang-18
nios2                            allmodconfig    clang-23
nios2                            allmodconfig    gcc-11.5.0
nios2                             allnoconfig    clang-23
nios2                             allnoconfig    gcc-11.5.0
nios2                               defconfig    clang-19
nios2                               defconfig    gcc-11.5.0
nios2                 randconfig-001-20260518    gcc-11.5.0
nios2                 randconfig-001-20260519    gcc-10.5.0
nios2                 randconfig-002-20260518    gcc-11.5.0
nios2                 randconfig-002-20260519    gcc-10.5.0
openrisc                         allmodconfig    clang-23
openrisc                         allmodconfig    gcc-15.2.0
openrisc                          allnoconfig    clang-23
openrisc                          allnoconfig    gcc-15.2.0
openrisc                            defconfig    gcc-15.2.0
parisc                           allmodconfig    gcc-15.2.0
parisc                            allnoconfig    clang-23
parisc                            allnoconfig    gcc-15.2.0
parisc                           allyesconfig    clang-19
parisc                           allyesconfig    gcc-15.2.0
parisc                              defconfig    gcc-15.2.0
parisc                randconfig-001-20260518    gcc-15.2.0
parisc                randconfig-001-20260519    gcc-8.5.0
parisc                randconfig-002-20260518    gcc-12.5.0
parisc                randconfig-002-20260519    gcc-8.5.0
parisc64                            defconfig    clang-19
parisc64                            defconfig    gcc-15.2.0
powerpc                          allmodconfig    gcc-15.2.0
powerpc                           allnoconfig    clang-23
powerpc                           allnoconfig    gcc-15.2.0
powerpc                     powernv_defconfig    gcc-15.2.0
powerpc               randconfig-001-20260518    clang-23
powerpc               randconfig-001-20260519    gcc-8.5.0
powerpc               randconfig-002-20260518    clang-23
powerpc               randconfig-002-20260519    gcc-8.5.0
powerpc64             randconfig-001-20260518    gcc-11.5.0
powerpc64             randconfig-001-20260519    gcc-8.5.0
powerpc64             randconfig-002-20260518    clang-23
powerpc64             randconfig-002-20260519    gcc-8.5.0
riscv                            allmodconfig    clang-23
riscv                             allnoconfig    clang-23
riscv                             allnoconfig    gcc-15.2.0
riscv                            allyesconfig    clang-16
riscv                            allyesconfig    clang-17
riscv                               defconfig    clang-23
riscv                               defconfig    gcc-15.2.0
riscv                 randconfig-001-20260518    clang-23
riscv                 randconfig-001-20260519    gcc-13.4.0
riscv                 randconfig-002-20260518    clang-23
riscv                 randconfig-002-20260519    gcc-13.4.0
s390                             allmodconfig    clang-18
s390                             allmodconfig    clang-19
s390                              allnoconfig    clang-23
s390                             allyesconfig    gcc-15.2.0
s390                                defconfig    clang-23
s390                                defconfig    gcc-15.2.0
s390                  randconfig-001-20260518    gcc-12.5.0
s390                  randconfig-001-20260519    gcc-13.4.0
s390                  randconfig-002-20260518    gcc-12.5.0
s390                  randconfig-002-20260519    gcc-13.4.0
sh                               allmodconfig    gcc-15.2.0
sh                                allnoconfig    clang-23
sh                                allnoconfig    gcc-15.2.0
sh                               allyesconfig    clang-19
sh                               allyesconfig    gcc-15.2.0
sh                                  defconfig    gcc-14
sh                                  defconfig    gcc-15.2.0
sh                    randconfig-001-20260518    gcc-14.3.0
sh                    randconfig-001-20260519    gcc-13.4.0
sh                    randconfig-002-20260518    gcc-11.5.0
sh                    randconfig-002-20260519    gcc-13.4.0
sparc                             allnoconfig    clang-23
sparc                             allnoconfig    gcc-15.2.0
sparc                               defconfig    gcc-15.2.0
sparc                 randconfig-001-20260518    gcc-15.2.0
sparc                 randconfig-001-20260519    gcc-14.3.0
sparc                 randconfig-002-20260518    gcc-15.2.0
sparc                 randconfig-002-20260519    gcc-14.3.0
sparc64                          allmodconfig    clang-23
sparc64                             defconfig    clang-20
sparc64                             defconfig    gcc-14
sparc64               randconfig-001-20260518    gcc-15.2.0
sparc64               randconfig-001-20260519    gcc-14.3.0
sparc64               randconfig-002-20260518    gcc-15.2.0
sparc64               randconfig-002-20260519    gcc-14.3.0
um                               allmodconfig    clang-19
um                                allnoconfig    clang-23
um                               allyesconfig    gcc-14
um                               allyesconfig    gcc-15.2.0
um                                  defconfig    clang-23
um                                  defconfig    gcc-14
um                             i386_defconfig    gcc-14
um                    randconfig-001-20260518    clang-16
um                    randconfig-001-20260518    gcc-15.2.0
um                    randconfig-001-20260519    gcc-14.3.0
um                    randconfig-002-20260518    clang-23
um                    randconfig-002-20260518    gcc-15.2.0
um                    randconfig-002-20260519    gcc-14.3.0
um                           x86_64_defconfig    clang-23
um                           x86_64_defconfig    gcc-14
x86_64                           allmodconfig    clang-20
x86_64                            allnoconfig    clang-20
x86_64                            allnoconfig    clang-23
x86_64                           allyesconfig    clang-20
x86_64      buildonly-randconfig-001-20260518    clang-20
x86_64      buildonly-randconfig-001-20260518    gcc-14
x86_64      buildonly-randconfig-001-20260519    gcc-14
x86_64      buildonly-randconfig-002-20260518    clang-20
x86_64      buildonly-randconfig-002-20260518    gcc-14
x86_64      buildonly-randconfig-002-20260519    gcc-14
x86_64      buildonly-randconfig-003-20260518    clang-20
x86_64      buildonly-randconfig-003-20260518    gcc-14
x86_64      buildonly-randconfig-003-20260519    gcc-14
x86_64      buildonly-randconfig-004-20260518    gcc-14
x86_64      buildonly-randconfig-004-20260519    gcc-14
x86_64      buildonly-randconfig-005-20260518    gcc-14
x86_64      buildonly-randconfig-005-20260519    gcc-14
x86_64      buildonly-randconfig-006-20260518    clang-20
x86_64      buildonly-randconfig-006-20260518    gcc-14
x86_64      buildonly-randconfig-006-20260519    gcc-14
x86_64                              defconfig    gcc-14
x86_64                                  kexec    clang-20
x86_64                         randconfig-001    gcc-14
x86_64                randconfig-001-20260518    gcc-14
x86_64                randconfig-001-20260519    clang-20
x86_64                         randconfig-002    gcc-14
x86_64                randconfig-002-20260518    clang-20
x86_64                randconfig-002-20260518    gcc-14
x86_64                randconfig-002-20260519    clang-20
x86_64                         randconfig-003    gcc-14
x86_64                randconfig-003-20260518    gcc-14
x86_64                randconfig-003-20260519    clang-20
x86_64                         randconfig-004    gcc-14
x86_64                randconfig-004-20260518    gcc-14
x86_64                randconfig-004-20260519    clang-20
x86_64                         randconfig-005    gcc-14
x86_64                randconfig-005-20260518    gcc-14
x86_64                randconfig-005-20260519    clang-20
x86_64                         randconfig-006    gcc-14
x86_64                randconfig-006-20260518    gcc-14
x86_64                randconfig-006-20260519    clang-20
x86_64                randconfig-011-20260518    clang-20
x86_64                randconfig-011-20260519    clang-20
x86_64                randconfig-012-20260518    clang-20
x86_64                randconfig-012-20260519    clang-20
x86_64                randconfig-013-20260518    gcc-14
x86_64                randconfig-013-20260519    clang-20
x86_64                randconfig-014-20260518    clang-20
x86_64                randconfig-014-20260519    clang-20
x86_64                randconfig-015-20260518    clang-20
x86_64                randconfig-015-20260519    clang-20
x86_64                randconfig-016-20260518    gcc-14
x86_64                randconfig-016-20260519    clang-20
x86_64                randconfig-071-20260518    clang-20
x86_64                randconfig-071-20260518    gcc-14
x86_64                randconfig-071-20260519    gcc-14
x86_64                randconfig-072-20260518    clang-20
x86_64                randconfig-072-20260518    gcc-14
x86_64                randconfig-072-20260519    gcc-14
x86_64                randconfig-073-20260518    clang-20
x86_64                randconfig-073-20260519    gcc-14
x86_64                randconfig-074-20260518    clang-20
x86_64                randconfig-074-20260519    gcc-14
x86_64                randconfig-075-20260518    clang-20
x86_64                randconfig-075-20260519    gcc-14
x86_64                randconfig-076-20260518    clang-20
x86_64                randconfig-076-20260519    gcc-14
x86_64                               rhel-9.4    clang-20
x86_64                           rhel-9.4-bpf    gcc-14
x86_64                          rhel-9.4-func    clang-20
x86_64                    rhel-9.4-kselftests    clang-20
x86_64                         rhel-9.4-kunit    gcc-14
x86_64                           rhel-9.4-ltp    gcc-14
x86_64                          rhel-9.4-rust    clang-20
xtensa                            allnoconfig    clang-23
xtensa                            allnoconfig    gcc-15.2.0
xtensa                           allyesconfig    clang-23
xtensa                           allyesconfig    gcc-15.2.0
xtensa                randconfig-001-20260518    gcc-12.5.0
xtensa                randconfig-001-20260518    gcc-15.2.0
xtensa                randconfig-001-20260519    gcc-14.3.0
xtensa                randconfig-002-20260518    gcc-15.2.0
xtensa                randconfig-002-20260518    gcc-9.5.0
xtensa                randconfig-002-20260519    gcc-14.3.0

--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


^ permalink raw reply

* Re: [PATCH v2 1/2] lkdtm/powerpc: add isync after slbmte to enforce SLB update ordering
From: Ritesh Harjani @ 2026-05-19  2:24 UTC (permalink / raw)
  To: Sayali Patil, linuxppc-dev, maddy; +Cc: linux-kernel, Mahesh Salgaonkar
In-Reply-To: <2f8d430962a96a7498903b994f081deee4a4d97a.1778975974.git.sayalip@linux.ibm.com>

Sayali Patil <sayalip@linux.ibm.com> writes:

> The slbmte instruction modifies the Segment Lookaside Buffer, but without
> a context synchronizing operation the CPU is not guaranteed to observe
> the updated SLB state for subsequent instructions. This can result in
> use of stale translation state when memory is accessed immediately after
> SLB modifications.
>
> Add isync after each slbmte in the PPC_SLB_MULTIHIT test to ensure proper
> ordering of SLB updates before subsequent memory accesses.
>
> This aligns with Power ISA context synchronization requirements for changes
> in address translation state and improves the reliability of SLB multihit
> injection tests in hash MMU mode.
>

Yup, CSI is required for before & after a slbmte. Given we are trying to
add duplicate slb entries, I think the isync()s added in this patch is
sufficient.

LGTM. Feel free to add:
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>


As Mpe added - This needs to be cc'd to Kees.

-> ./scripts/get_maintainer.pl -f drivers/misc/lkdtm/powerpc.c
Kees Cook <kees@kernel.org> (maintainer:LINUX KERNEL DUMP TEST MODULE (LKDTM))
Arnd Bergmann <arnd@arndb.de> (maintainer:CHAR and MISC DRIVERS)
Greg Kroah-Hartman <gregkh@linuxfoundation.org> (maintainer:CHAR and MISC DRIVERS)
linux-kernel@vger.kernel.org (open list)
CHAR and MISC DRIVERS status: Supported

-ritesh


^ permalink raw reply

* Re: [PATCH 3/8] mm/bootmem_info: stop using PG_private
From: Lance Yang @ 2026-05-19  2:56 UTC (permalink / raw)
  To: david
  Cc: davem, andreas, rppt, akpm, agordeev, gerald.schaefer, hca, gor,
	borntraeger, svens, maddy, mpe, npiggin, chleroy, ljs, liam,
	vbabka, surenb, mhocko, sparclinux, linux-kernel, linux-mm,
	linux-s390, linuxppc-dev, Lance Yang
In-Reply-To: <20260511-bootmem_info_prep-v1-3-3fb0be6fc688@kernel.org>


On Mon, May 11, 2026 at 04:05:31PM +0200, David Hildenbrand (Arm) wrote:
>Nobody checks PG_private for these pages, and we can happily use
>set_page_private() without setting PG_private. So let's just stop
>setting/clearing PG_private.
>
>Signed-off-by: David Hildenbrand (Arm) <david@kernel.org>
>---
> mm/bootmem_info.c | 2 --
> 1 file changed, 2 deletions(-)
>
>diff --git a/mm/bootmem_info.c b/mm/bootmem_info.c
>index a0a1ecdec8d0..6e2aaab3dca9 100644
>--- a/mm/bootmem_info.c
>+++ b/mm/bootmem_info.c
>@@ -19,7 +19,6 @@ void get_page_bootmem(unsigned long info, struct page *page,
> {
> 	BUG_ON(type > 0xf);
> 	BUG_ON(info > (ULONG_MAX >> 4));
>-	SetPagePrivate(page);

Right, the users classify these pages via PageReserved()/bootmem_type(),
not PagePrivate().

So makes sense to not set PG_private in the first place.

> 	set_page_private(page, info << 4 | type);
> 	page_ref_inc(page);
> }
>@@ -32,7 +31,6 @@ void put_page_bootmem(struct page *page)
> 	       type > MEMORY_HOTPLUG_MAX_BOOTMEM_TYPE);
> 
> 	if (page_ref_dec_return(page) == 1) {
>-		ClearPagePrivate(page);

Nothing sets it anymore, so there is nothing to clear here.

LGTM, feel free to add:
Reviewed-by: Lance Yang <lance.yang@linux.dev>

> 		set_page_private(page, 0);
> 		kmemleak_free_part_phys(PFN_PHYS(page_to_pfn(page)), PAGE_SIZE);
> 		free_reserved_page(page);
>
>-- 
>2.43.0
>
>


^ permalink raw reply

* Re: cleanup the RAID6 P/Q library v3
From: Christoph Hellwig @ 2026-05-19  8:24 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Hellwig, Catalin Marinas, Will Deacon, Ard Biesheuvel,
	Huacai Chen, WANG Xuerui, Madhavan Srinivasan, Michael Ellerman,
	Nicholas Piggin, Christophe Leroy (CS GROUP), Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Alexandre Ghiti, Heiko Carstens,
	Vasily Gorbik, Alexander Gordeev, Christian Borntraeger,
	Sven Schnelle, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H. Peter Anvin, Herbert Xu, Dan Williams,
	Chris Mason, David Sterba, Arnd Bergmann, Song Liu, Yu Kuai,
	Li Nan, linux-kernel, linux-arm-kernel, loongarch, linuxppc-dev,
	linux-riscv, linux-s390, linux-crypto, linux-btrfs, linux-arch,
	linux-raid
In-Reply-To: <20260518141205.c100f76eec5f58e78bbbf7af@linux-foundation.org>

On Mon, May 18, 2026 at 02:12:05PM -0700, Andrew Morton wrote:
> Cool, I'll add this to mm.git's mm-nonmm-unstable branch for some
> linux-next testing.
> 
> AI review found quite a lot to talk about:
> 	https://sashiko.dev/#/patchset/20260518051804.462141-1-hch@lst.de

Not a lot of it is very useful, though:

raid6: turn the userspace test harness into a kunit test

 - complains about basically adding need_resched, which we've decided
   we won't do now that we have lazy preempt.  This is probably going
   to come up in lots of places because of the old training data

raid6: use named initializers for struct raid6_calls

 - whining about keeping totally pointless comments
 
raid6: warn when using less than four devices

 - complains about warning for btrfs which is clearly documented as the
   outcome in the commit log
 - and also complaining that the enforcement isn't hard enough, but the
   WARN_ON is the best we can do here

raid6: rework registration of optimized algorithms

 - less registration causing less kunit coverage:  that's intentional
   as it keeps testing time down and similar to other arch optimized
   tests in crc and crypto code.  It also doesn't really reduce
   coverage as before this series there was none.

raid6: use static_call for gen_syndrom and xor_syndrom

 - doesn't seem to know that bool fails when an initcall fails

raid6_kunit: use KUNIT_CASE_PARAM

 - whining about the code style.  I don't really like it either,
   but the kunit case stuff is a mess

There are a few somewhat useful things, though.

raid6: hide internals

 - yes, the -I is duplicate and should be fixed

raid6: rework registration of optimized algorithms

 - avx2 instead of avx512 is probably the right thing for no
   benchmarking, but if it was intentional (it wasn't), that should
   be document.  So I'll just switch back to the previous version to
   keep the state of the art


^ permalink raw reply

* Re: [PATCH] PCI/AER: Clear non-fatal errors on AER recovery failure
From: Lukas Wunner @ 2026-05-19  9:53 UTC (permalink / raw)
  To: Yury Murashka
  Cc: bhelgaas, mahesh, oohall, linux-pci, linux-kernel, linuxppc-dev
In-Reply-To: <CAPzpGcRCTCZtaX1EVaJNZ103THZKsoszZduY7=gwfYdcrMo-SQ@mail.gmail.com>

On Mon, May 18, 2026 at 02:23:36PM +0100, Yury Murashka wrote:
> pci_aer_clear_nonfatal_status() is not called when AER recovery fails.
> If a new AER error is subsequently reported, the AER driver calls
> find_source_device() to find the source of the error. It rescans the
> whole bus and picks the first device reporting an AER error. Because the
> previous error was never cleared, the error is attributed to the wrong
> device and AER recovery is started for the wrong device.

I guess the rationale of the current behavior is that the devices
affected by the failed error recovery are basically in a broken
state once error recovery failed and so user intervention is
required, e.g. a remove/rescan via sysfs.

My question is, why is error recovery failing for the devices
in the first place?

And what does the hierarchy look like?
(lspci -tv and lspci -vvv output please)

I also don't quite follow your assertion that (only) the first device
reporting an error is picked.  The algorithm tries to collect *all*
error-reporting devices in the affected portion of the hierarchy.

Thanks,

Lukas

^ permalink raw reply

* Re: [PATCH v4 04/13] dma: swiotlb: track pool encryption state and honor DMA_ATTR_CC_SHARED
From: Mostafa Saleh @ 2026-05-19 11:04 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: iommu, linux-arm-kernel, linux-kernel, linux-coco, Robin Murphy,
	Marek Szyprowski, Will Deacon, Marc Zyngier, Steven Price,
	Suzuki K Poulose, Catalin Marinas, Jiri Pirko, Jason Gunthorpe,
	Petr Tesarik, Alexey Kardashevskiy, Dan Williams, Xu Yilun,
	linuxppc-dev, linux-s390, Madhavan Srinivasan, Michael Ellerman,
	Nicholas Piggin, Christophe Leroy (CS GROUP), Alexander Gordeev,
	Gerald Schaefer, Heiko Carstens, Vasily Gorbik,
	Christian Borntraeger, Sven Schnelle, x86
In-Reply-To: <yq5ah5oa59wy.fsf@kernel.org>

On Thu, May 14, 2026 at 08:13:25PM +0530, Aneesh Kumar K.V wrote:
> >> 
> >> What I meant was that we need a generic way to identify a pKVM guest, so
> >> that we can use it in the conditional above.
> >
> > I have this patch, with that I can boot with your series unmodified,
> > but I will need to do more testing.
> >
> 
> Thanks, I can add this to the series once you complete the required testing.
> 

I am still running more tests, but looking more into it. Setting
force_dma_unencrypted() to true for pKVM guests is wrong, as the
guest shouldn’t try to decrypt arbitrary memory as it can include
sensitive information (for example in case of virtio sub-page
allocation) and should strictly rely on the restricted-dma-pool
for that.

However, with my patch and setting force_dma_unencrypted() to false
on top of this series, it fails on pKVM due to a missing shared
attribute as Alexey mentioned, as now SWIOTLB rejects non shared
attrs, so, the DMA-API has to pass it. With that, I can boot again:

diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
index 5103a04df99f..b19aeec03f27 100644
--- a/kernel/dma/direct.c
+++ b/kernel/dma/direct.c
@@ -286,6 +286,8 @@ void *dma_direct_alloc(struct device *dev, size_t size,
 	}

 	if (is_swiotlb_for_alloc(dev)) {
+		attrs |= DMA_ATTR_CC_SHARED;
+
 		page = dma_direct_alloc_swiotlb(dev, size, attrs);
 		if (page) {
 			/*
@@ -449,6 +451,8 @@ struct page *dma_direct_alloc_pages(struct device *dev, size_t size,
 						  &cpu_addr, gfp, attrs);

 	if (is_swiotlb_for_alloc(dev)) {
+		attrs |= DMA_ATTR_CC_SHARED;
+
 		page = dma_direct_alloc_swiotlb(dev, size, attrs);
 		if (!page)
 			return NULL;
diff --git a/kernel/dma/direct.h b/kernel/dma/direct.h
index 4e35264ab6f8..8ee5bbf78cfb 100644
--- a/kernel/dma/direct.h
+++ b/kernel/dma/direct.h
@@ -92,6 +92,7 @@ static inline dma_addr_t dma_direct_map_phys(struct device *dev,
 		if (attrs & (DMA_ATTR_MMIO | DMA_ATTR_REQUIRE_COHERENT))
 			return DMA_MAPPING_ERROR;

+		attrs |= DMA_ATTR_CC_SHARED;
 		return swiotlb_map(dev, phys, size, dir, attrs);
 	}

-- 

I will keep testing and let you know how it goes. If there is nothing
else required to convert pKVM guests to CC, I can just post the patch
separately as it has no dependency on this series.

Re force_dma_unencrypted(), I am looking into a safe way to use it
for pKVM as I beleive it will be useful to eliminate some bouncing.
However, that’s not critical for this series and can be added later
as I am still investigating it, if I reach something I can post it
along the pKVM patch above.

Thanks,
Mostafa

> 
> 
> -aneesh

^ permalink raw reply related

* Re: [PATCH v4 04/13] dma: swiotlb: track pool encryption state and honor DMA_ATTR_CC_SHARED
From: Mostafa Saleh @ 2026-05-19 11:06 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Aneesh Kumar K.V (Arm), iommu, linux-arm-kernel, linux-kernel,
	linux-coco, Robin Murphy, Marek Szyprowski, Will Deacon,
	Marc Zyngier, Steven Price, Suzuki K Poulose, Catalin Marinas,
	Jiri Pirko, Petr Tesarik, Alexey Kardashevskiy, Dan Williams,
	Xu Yilun, linuxppc-dev, linux-s390, Madhavan Srinivasan,
	Michael Ellerman, Nicholas Piggin, Christophe Leroy (CS GROUP),
	Alexander Gordeev, Gerald Schaefer, Heiko Carstens, Vasily Gorbik,
	Christian Borntraeger, Sven Schnelle, x86
In-Reply-To: <20260515225113.GN7702@ziepe.ca>

On Fri, May 15, 2026 at 07:51:13PM -0300, Jason Gunthorpe wrote:
> On Thu, May 14, 2026 at 02:43:39PM +0000, Mostafa Saleh wrote:
> > > That's a somewhat different problem, we have the dev->trusted stuff
> > > that is supposed to deal with this kind of security. We need it for
> > > IOMMU based systems too, eg hot plug thunderbolt should have it.
> > 
> > I see that it is used only for dma-iommu and for PCI devices.
> > However, I think that should be a problem with other CCA solutions
> > with emulated devices as they are untrusted. As I'd expect they
> > would have virtio devices.
> 
> Yes, any security solution with an out of TCB device should be using either
> memory encryption so the kernel already bounces or this trusted stuff
> and a force strict dma-iommu so the dma layer is careful.
> 
> This is more policy from userspace what devices they want in or out of
> their TCB. Like you make accept the device into T=1 but then still
> want to keep it out of your TCB with the vIOMMU, I can see good
> arguments for something like that.
> 
> > > > While we can debate the aesthetics of the setup , this is
> > > > the exisitng behaviour for Linux, which existed for years
> > > > and pKVM relies on and is used extensively.
> > > > And, this patch alters that long-standing logic and introduces
> > > > a functional regression.
> > > 
> > > Yeah, Aneesh needs to do something here, I'm pointing out it is
> > > entirely seperate thing from the CC path we are working on which is
> > > decoupling CC from reylying on force swiotlb.
> > 
> > I am looking into converting pKVM to use the CC stuff, I replied with
> > a patch to Aneesh in this thread. However, I need to do more testing
> > and make sure there are not any unwanted consequences.
> 
> Yeah, it is a nice patch and I think it will help reduce the
> complexity if it aligns to CCA type stuff.
> 
> > > In a pkvm world it should be the same, the S2 table for the SMMU will
> > > control what the device can access, and if the SMMU points to a
> > > "private" or "shared" page is not something the device needs to know
> > > or care about.
> > 
> > I see that's because dma-iommu chooses the attrs for iommu_map().
> 
> Long term the DMA API path through the dma-iommu will pass the
> ATTR_CC_SHARED through to iommu_map so when the arch requires a
> different IOPTE it can construct it.
> 
> > In pKVM, dma_addr_t and IOPTE are the same for private and shared,
> > so nothing differs in that case.
> 
> Yes, so you don't have to worry.
> 
> > We don’t expect pass-through devices to interact with shared
> > memory (T=0) at the moment.
> > However, I can see use cases for that, where the host and the guest
> > collaborate with device passthrough and require zero copy.
> 
> Once you add the CC patch it becomes immediately possible though
> because the user can allocate a CC shared DMA HEAP and feed that all
> over the place.
> 
> > One other interesting case for device-passthrough is non-coherent
> > devices which then require private pools for bouncing.
> 
> Why does shared/private matter for bouncing? Why do you need to bounce
> at all? Do cmo's not work in pkvm guests?

At the moment, in iommu_dma_map_phys(), if a non coherent device
tries to map an unaligned address or size it will be bounced.
In pKVM, dma-iommu is used for assigned devices which operate on
private memory, so bouncing that through the SWIOTLB would leak
information from the guest as the SWIOTLB is decrypted.
In that case, the device needs a pool which remains private.

Thanks,
Mostafa

> 
> Jason


^ permalink raw reply

* Re: [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance
From: Barry Song @ 2026-05-19 11:07 UTC (permalink / raw)
  To: Yang Shi
  Cc: Matthew Wilcox, surenb, akpm, linux-mm, david, ljs, liam, vbabka,
	rppt, mhocko, jack, pfalcato, wanglian, chentao, lianux.mm,
	kunwu.chan, liyangouwen1, chrisl, kasong, shikemeng, nphamcs, bhe,
	youngjun.park, linux-arm-kernel, linux-kernel, loongarch,
	linuxppc-dev, linux-riscv, linux-s390, Nanzhe Zhao
In-Reply-To: <CAHbLzkrOSoh-jmR=uaNvx73n_wn+vExoKY0UzH5zGcfdAiDbNg@mail.gmail.com>

On Tue, May 19, 2026 at 5:21 AM Yang Shi <shy828301@gmail.com> wrote:
>
> On Sun, May 17, 2026 at 1:45 AM Barry Song <baohua@kernel.org> wrote:
> >
> > On Sat, May 2, 2026 at 1:58 AM Matthew Wilcox <willy@infradead.org> wrote:
> > >
> > > On Sat, May 02, 2026 at 01:44:34AM +0800, Barry Song wrote:
> > > > On Fri, May 1, 2026 at 10:57 PM Matthew Wilcox <willy@infradead.org> wrote:
> > > > >
> > > > > On Fri, May 01, 2026 at 06:49:58AM +0800, Barry Song wrote:
> > > > > > 1. There is no deterministic latency for I/O completion. It depends on
> > > > > > both the hardware and the software stack (bio/request queues and the
> > > > > > block scheduler). Sometimes the latency is short; at other times it can
> > > > > > be quite long. In such cases, a high-priority thread performing operations
> > > > > > such as mprotect, unmap, prctl_set_vma, or madvise may be forced to wait
> > > > > > for an unpredictable amount of time.
> > > > >
> > > > > But does that actually happen?  I find it hard to believe that thread A
> > > > > unmaps a VMA while thread B is in the middle of taking a page fault in
> > > > > that same VMA.  mprotect() and madvise() are more likely to happen, but
> > > > > it still seems really unlikely to me.
> > > >
> > > > It doesn’t have to involve unmapping or applying mprotect to
> > > > the entire VMA—just a portion of it is sufficient.
> > >
> > > Yes, but that still fails to answer "does this actually happen".  How much
> > > performance is all this complexity in the page fault handler buying us?
> > > If you don't answer this question, I'm just going to go in and rip it
> > > all out.
> > >
> >
> > Hi Matthew (and Lorenzo, Jan, and anyone else who may be
> > waiting for answers),
> >
> > As promised during LSF/MM/BPF, we conducted thorough
> > testing on Android phones to determine whether performing
> > I/O in `filemap_fault()` can block `vma_start_write()`.
> > I wanted to give a quick update on this question.
> >
> > Nanzhe at Xiaomi created tracing scripts and ran various
> > applications on Android devices with I/O performed under
> > the VMA lock in `filemap_fault()`. We found that:
> >
> > 1. There are very few cases where unmap() is blocked by
> >    page faults. I assume this is due to buggy user code
> >    or poor synchronization between reads and unmap().
> > So I assume it is not a problem.
> >
> > 2. We observed many cases where `vma_start_write()`
> >    is blocked by page-fault I/O in some applications.
> >    The blocking occurs in the `dup_mmap()` path during
> >    fork().
> >
> > With Suren's commit fb49c455323ff ("fork: lock VMAs of
> > the parent process when forking"), we now always hold
> > `vma_write_lock()` for each VMA. Note that the
> > `mmap_lock` write lock is also held, which could lead to
> > chained waiting if page-fault I/O is performed without
> > releasing the VMA lock.
> >
> > My gut feeling is that Suren's commit may be overshooting,
> > so my rough idea is that we might want to do something like
> > the following (we haven't tested it yet and it might be
> > wrong):
> >
> > diff --git a/mm/mmap.c b/mm/mmap.c
> > index 2311ae7c2ff4..5ddaf297f31a 100644
> > --- a/mm/mmap.c
> > +++ b/mm/mmap.c
> > @@ -1762,7 +1762,13 @@ __latent_entropy int dup_mmap(struct mm_struct
> > *mm, struct mm_struct *oldmm)
> >         for_each_vma(vmi, mpnt) {
> >                 struct file *file;
> >
> > -               retval = vma_start_write_killable(mpnt);
> > +               /*
> > +                * For anonymous or writable private VMAs, prevent
> > +                * concurrent CoW faults.
> > +                */
> > +               if (!mpnt->vm_file || (!(mpnt->vm_flags & VM_SHARED) &&
> > +                                       (mpnt->vm_flags & VM_WRITE)))
> > +                       retval = vma_start_write_killable(mpnt);
> >                 if (retval < 0)
> >                         goto loop_out;
> >                 if (mpnt->vm_flags & VM_DONTCOPY) {
>
> Maybe a little bit off topic. This is an interesting idea. It seems
> possible we don't have to take vma write lock unconditionally. IIUC
> the write lock is mainly used to serialize against page fault and
> madvise, right? I got a crazy idea off the top of my head. We may be
> able to just take vma write lock iff vma->anon_vma is not NULL.
>
> First of all, write mmap_lock is held, so the vma can't go or be
> changed under us.
>
> Secondly, if vma->anon_vma is NULL, it basically means either no page
> fault happened or no cow happened, so there is no page table to copy,
> this is also what copy_page_range() does currently. So we can shrink
> the critical section to:
>
> if (vma->anon_vma) {
>     vma_start_write_killable(src_vma);
>     anon_vma_fork(dst_vma, src_vma);
>     copy_page_range(dst_vma, src_vma);
> }
>
> But page fault can happen before write mmap_lock is taken, when we
> check vma->anon_vma, it is possible it has not been set up yet. But it
> seems to be equivalent to page fault after fork and won't break the
> semantic.

Re-reading Suren's commit log for fb49c455323ff8
("fork: lock VMAs of the parent process when forking"),
it seems that vm_start_write() is used to protect
against a race where anon_vma changes from NULL to
non-NULL during fork. In that scenario, we hold the
mmap_lock write lock, but not vma_start_write(), so a
concurrent anon_vma_prepare() could still install an
anon_vma.

"    A concurrent page fault on a page newly marked read-only by the page
    copy might trigger wp_page_copy() and a anon_vma_prepare(vma) on the
    source vma, defeating the anon_vma_clone() that wasn't done because the
    parent vma originally didn't have an anon_vma, but we now might end up
    copying a pte entry for a page that has one.
"

If that is the case, then your change does not work.

Nowadays, nobody calls anon_vma_prepare(vma) directly.
Instead, vmf_anon_prepare() is used, and we always
require the mmap_lock read lock before calling
__anon_vma_prepare(). As a result, anon_vma cannot
transition from NULL to non-NULL during fork.

So the original race condition has effectively
disappeared.

You also mentioned the madvise() case. If I understand
correctly, madvise() should take mmap_lock before
modifying anon_vma. Only some parts of madvise() can
support per-VMA locking. Therefore, we probably do not
need:

if (vma->anon_vma) {
vma_start_write_killable(src_vma);
...
}

>
> Anyway, just a crazy idea, I may miss some corner cases.

To me, it seems that we could remove vma_start_write()
entirely now. Or is that an even crazier idea?

Thanks
Barry


^ permalink raw reply

* Re: [PATCH v4 04/13] dma: swiotlb: track pool encryption state and honor DMA_ATTR_CC_SHARED
From: Aneesh Kumar K.V @ 2026-05-19 12:27 UTC (permalink / raw)
  To: Mostafa Saleh
  Cc: iommu, linux-arm-kernel, linux-kernel, linux-coco, Robin Murphy,
	Marek Szyprowski, Will Deacon, Marc Zyngier, Steven Price,
	Suzuki K Poulose, Catalin Marinas, Jiri Pirko, Jason Gunthorpe,
	Petr Tesarik, Alexey Kardashevskiy, Dan Williams, Xu Yilun,
	linuxppc-dev, linux-s390, Madhavan Srinivasan, Michael Ellerman,
	Nicholas Piggin, Christophe Leroy (CS GROUP), Alexander Gordeev,
	Gerald Schaefer, Heiko Carstens, Vasily Gorbik,
	Christian Borntraeger, Sven Schnelle, x86
In-Reply-To: <agxDxdxynp4KEovA@google.com>

Mostafa Saleh <smostafa@google.com> writes:

> On Thu, May 14, 2026 at 08:13:25PM +0530, Aneesh Kumar K.V wrote:
>> >> 
>> >> What I meant was that we need a generic way to identify a pKVM guest, so
>> >> that we can use it in the conditional above.
>> >
>> > I have this patch, with that I can boot with your series unmodified,
>> > but I will need to do more testing.
>> >
>> 
>> Thanks, I can add this to the series once you complete the required testing.
>> 
>
> I am still running more tests, but looking more into it. Setting
> force_dma_unencrypted() to true for pKVM guests is wrong, as the
> guest shouldn’t try to decrypt arbitrary memory as it can include
> sensitive information (for example in case of virtio sub-page
> allocation) and should strictly rely on the restricted-dma-pool
> for that.
>
> However, with my patch and setting force_dma_unencrypted() to false
> on top of this series, it fails on pKVM due to a missing shared
> attribute as Alexey mentioned, as now SWIOTLB rejects non shared
> attrs, so, the DMA-API has to pass it. With that, I can boot again:
>
> diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
> index 5103a04df99f..b19aeec03f27 100644
> --- a/kernel/dma/direct.c
> +++ b/kernel/dma/direct.c
> @@ -286,6 +286,8 @@ void *dma_direct_alloc(struct device *dev, size_t size,
>  	}
>  
>  	if (is_swiotlb_for_alloc(dev)) {
> +		attrs |= DMA_ATTR_CC_SHARED;
> +
>  		page = dma_direct_alloc_swiotlb(dev, size, attrs);
>  		if (page) {
>  			/*
> @@ -449,6 +451,8 @@ struct page *dma_direct_alloc_pages(struct device *dev, size_t size,
>  						  &cpu_addr, gfp, attrs);
>  
>  	if (is_swiotlb_for_alloc(dev)) {
> +		attrs |= DMA_ATTR_CC_SHARED;
> +
>  		page = dma_direct_alloc_swiotlb(dev, size, attrs);
>  		if (!page)
>  			return NULL;
> diff --git a/kernel/dma/direct.h b/kernel/dma/direct.h
> index 4e35264ab6f8..8ee5bbf78cfb 100644
> --- a/kernel/dma/direct.h
> +++ b/kernel/dma/direct.h
> @@ -92,6 +92,7 @@ static inline dma_addr_t dma_direct_map_phys(struct device *dev,
>  		if (attrs & (DMA_ATTR_MMIO | DMA_ATTR_REQUIRE_COHERENT))
>  			return DMA_MAPPING_ERROR;
>  
> +		attrs |= DMA_ATTR_CC_SHARED;
>  		return swiotlb_map(dev, phys, size, dir, attrs);
>  	}
>  
> --
>

How about the below?

modified   kernel/dma/direct.c
@@ -278,6 +278,10 @@ void *dma_direct_alloc(struct device *dev, size_t size,
 	}
 
 	if (is_swiotlb_for_alloc(dev)) {
+
+		if (dev->dma_io_tlb_mem->unencrypted)
+			attrs |= DMA_ATTR_CC_SHARED;
+
 		page = dma_direct_alloc_swiotlb(dev, size, attrs);
 		if (page) {
 			/*
@@ -451,6 +455,10 @@ struct page *dma_direct_alloc_pages(struct device *dev, size_t size,
 						  &cpu_addr, gfp, attrs);
 
 	if (is_swiotlb_for_alloc(dev)) {
+
+		if (dev->dma_io_tlb_mem->unencrypted)
+			attrs |= DMA_ATTR_CC_SHARED;
+
 		page = dma_direct_alloc_swiotlb(dev, size, attrs);
 		if (!page)
 			return NULL;
modified   kernel/dma/direct.h
@@ -92,6 +92,9 @@ static inline dma_addr_t dma_direct_map_phys(struct device *dev,
 		if (attrs & (DMA_ATTR_MMIO | DMA_ATTR_REQUIRE_COHERENT))
 			return DMA_MAPPING_ERROR;
 
+		if (dev->dma_io_tlb_mem->unencrypted)
+			attrs |= DMA_ATTR_CC_SHARED;
+
 		return swiotlb_map(dev, phys, size, dir, attrs);
 	}
 


>
>
> I will keep testing and let you know how it goes. If there is nothing
> else required to convert pKVM guests to CC, I can just post the patch
> separately as it has no dependency on this series.
>

That would be useful. I can then carry the patch as a dependent change,
which can also be merged separately

>
> Re force_dma_unencrypted(), I am looking into a safe way to use it
> for pKVM as I beleive it will be useful to eliminate some bouncing.
> However, that’s not critical for this series and can be added later
> as I am still investigating it, if I reach something I can post it
> along the pKVM patch above.
>
> Thanks,
> Mostafa
>
>> 
>> 
>> -aneesh


^ permalink raw reply

* Re: [PATCH v13 04/15] arm64: kexec_file: Fix potential buffer overflow in prepare_elf_headers()
From: Jinjie Ruan @ 2026-05-19 12:42 UTC (permalink / raw)
  To: Breno Leitao
  Cc: corbet, skhan, catalin.marinas, will, chenhuacai, kernel, maddy,
	mpe, npiggin, chleroy, pjw, palmer, aou, alex, tglx, mingo, bp,
	dave.hansen, hpa, robh, saravanak, akpm, bhe, rppt,
	pasha.tatashin, pratyush, ruirui.yang, rdunlap, pmladek,
	dapeng1.mi, kees, elver, kuba, ebiggers, lirongqing, paulmck,
	sourabhjain, coxu, jbohac, ryan.roberts, osandov, cfsworks,
	tangyouling, ritesh.list, adityag, guoren, songshuaishuai,
	kevin.brodsky, vishal.moola, junhui.liu, wangruikang, namcao,
	chao.gao, seanjc, fuqiang.wang, ardb, chenjiahao16, hbathini,
	takahiro.akashi, james.morse, lizhengyu3, x86, linux-doc,
	linux-kernel, linux-arm-kernel, loongarch, linuxppc-dev,
	linux-riscv, devicetree, kexec
In-Reply-To: <agGkvrg06KNDNfDi@gmail.com>



On 5/11/2026 5:46 PM, Breno Leitao wrote:
> On Mon, May 11, 2026 at 11:04:43AM +0800, Jinjie Ruan wrote:
>> There is a race condition between the kexec_load() system call
>> (crash kernel loading path) and memory hotplug operations that can
>> lead to buffer overflow and potential kernel crash.
>>
>> During prepare_elf_headers(), the following steps occur:
>> 1. The first for_each_mem_range() queries current System RAM memory ranges
>> 2. Allocates buffer based on queried count
>> 3. The 2st for_each_mem_range() populates ranges from memblock
>>
>> If memory hotplug occurs between step 1 and step 3, the number of ranges
>> can increase, causing out-of-bounds write when populating cmem->ranges[].
>>
>> This happens because kexec_load() uses kexec_trylock (atomic_t) while
>> memory hotplug uses device_hotplug_lock (mutex), so they don't serialize
>> with each other.
>>
>> Add the explicit bounds checking to prevent out-of-bounds access.
> 
> It seems you have a TOCTOU type of issue, and this seems to be shrinking
> the window, but not fully solving it?

I plan to fix this issue as follows, and would appreciate your feedback
on whether this is reasonable.

Sashiko AI code review pointed out there is a TOCTOU (Time-of-Check to
Time-of-Use) race condition in prepare_elf_headers() between the initial
pass that counts System RAM ranges and the second pass that populates them.
If a memory hotplug event occurs between these two steps, the number of
memory regions may increase, causing an out-of-bounds write to
the cmem->ranges[] array.

To resolve this and ensure data consistency, this patch:

1. Wraps the counting and population passes with get_online_mems() and
   crash_hotplug_lock(). This serializes the kexec_file_load() path
   with concurrent memory hotplug operations, ensuring the memory
   map remains consistent throughout the header preparation.

2. Adds an explicit boundary check in prepare_elf64_ram_headers_callback().
   If the number of ranges exceeds the allocated maximum, it now returns
   -EAGAIN, which indicates a transient race, signaling userspace
   kexec-tools to retry the syscall instead of leaving the system
without a loaded crash kernel.

index daf81a873bbd..546be6261177 100644
--- a/arch/arm64/kernel/machine_kexec_file.c
+++ b/arch/arm64/kernel/machine_kexec_file.c
@@ -15,6 +15,7 @@
 #include <linux/kexec.h>
 #include <linux/libfdt.h>
 #include <linux/memblock.h>
+#include <linux/memory_hotplug.h>
 #include <linux/of.h>
 #include <linux/of_fdt.h>
 #include <linux/slab.h>
@@ -40,7 +41,7 @@ int arch_kimage_file_post_load_cleanup(struct kimage
*image)
 }

 #ifdef CONFIG_CRASH_DUMP
-int prepare_elf_headers(void **addr, unsigned long *sz)
+static int __prepare_elf_headers(void **addr, unsigned long *sz)
 {
 	struct crash_mem *cmem;
 	unsigned int nr_ranges;
@@ -59,6 +60,11 @@ int prepare_elf_headers(void **addr, unsigned long *sz)
 	cmem->max_nr_ranges = nr_ranges;
 	cmem->nr_ranges = 0;
 	for_each_mem_range(i, &start, &end) {
+		if (cmem->nr_ranges >= cmem->max_nr_ranges) {
+			ret = -EAGAIN;
+			goto out;
+		}
+
 		cmem->ranges[cmem->nr_ranges].start = start;
 		cmem->ranges[cmem->nr_ranges].end = end - 1;
 		cmem->nr_ranges++;
@@ -81,6 +87,21 @@ int prepare_elf_headers(void **addr, unsigned long *sz)
 	kfree(cmem);
 	return ret;
 }
+
+int prepare_elf_headers(void **addr, unsigned long *sz)
+{
+	int ret;
+
+	crash_hotplug_lock();
+	get_online_mems();
+
+	ret = __prepare_elf_headers(addr, sz);
+
+	put_online_mems();
+	crash_hotplug_unlock();
+
+	return ret;
+}
 #endif

> 
>> Cc: Catalin Marinas <catalin.marinas@arm.com>
>> Cc: Will Deacon <will.deacon@arm.com>
>> Cc: Andrew Morton <akpm@linux-foundation.org>
>> Cc: Baoquan He <bhe@redhat.com>
>> Cc: Breno Leitao <leitao@debian.org>
>> Cc: stable@vger.kernel.org
>> Fixes: 3751e728cef2 ("arm64: kexec_file: add crash dump support")
>> Closes: https://sashiko.dev/#/patchset/20260323072745.2481719-1-ruanjinjie%40huawei.com
>> Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com>
>> ---
>>  arch/arm64/kernel/machine_kexec_file.c | 5 +++++
>>  1 file changed, 5 insertions(+)
>>
>> diff --git a/arch/arm64/kernel/machine_kexec_file.c b/arch/arm64/kernel/machine_kexec_file.c
>> index e31fabed378a..a67e7b1abbab 100644
>> --- a/arch/arm64/kernel/machine_kexec_file.c
>> +++ b/arch/arm64/kernel/machine_kexec_file.c
>> @@ -59,6 +59,11 @@ static int prepare_elf_headers(void **addr, unsigned long *sz)
>>  	cmem->max_nr_ranges = nr_ranges;
>>  	cmem->nr_ranges = 0;
>>  	for_each_mem_range(i, &start, &end) {
>> +		if (cmem->nr_ranges >= cmem->max_nr_ranges) {
>> +			ret = -ENOMEM;
> 
> -ENOMEM seems to be the the wrong errno. This isn't an allocation
> failure; it's a transient race. -EBUSY or -EAGAIN would be more honest



^ permalink raw reply related

* Re: [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance
From: Lorenzo Stoakes @ 2026-05-19 12:43 UTC (permalink / raw)
  To: Barry Song
  Cc: Matthew Wilcox, surenb, akpm, linux-mm, david, liam, vbabka, rppt,
	mhocko, jack, pfalcato, wanglian, chentao, lianux.mm, kunwu.chan,
	liyangouwen1, chrisl, kasong, shikemeng, nphamcs, bhe,
	youngjun.park, linux-arm-kernel, linux-kernel, loongarch,
	linuxppc-dev, linux-riscv, linux-s390, Nanzhe Zhao
In-Reply-To: <CAGsJ_4zqLfdWoTH9s7FFaqWWj0mESfikYgr7=GcV64qcuXrPxA@mail.gmail.com>

On Mon, May 18, 2026 at 07:25:54PM +0800, Barry Song wrote:
> On Mon, May 18, 2026 at 5:47 PM Lorenzo Stoakes <ljs@kernel.org> wrote:
> >
> > On Sun, May 17, 2026 at 04:45:15PM +0800, Barry Song wrote:
> > > On Sat, May 2, 2026 at 1:58 AM Matthew Wilcox <willy@infradead.org> wrote:
> > > >
> > > > On Sat, May 02, 2026 at 01:44:34AM +0800, Barry Song wrote:
> > > > > On Fri, May 1, 2026 at 10:57 PM Matthew Wilcox <willy@infradead.org> wrote:
> > > > > >
> > > > > > On Fri, May 01, 2026 at 06:49:58AM +0800, Barry Song wrote:
> > > > > > > 1. There is no deterministic latency for I/O completion. It depends on
> > > > > > > both the hardware and the software stack (bio/request queues and the
> > > > > > > block scheduler). Sometimes the latency is short; at other times it can
> > > > > > > be quite long. In such cases, a high-priority thread performing operations
> > > > > > > such as mprotect, unmap, prctl_set_vma, or madvise may be forced to wait
> > > > > > > for an unpredictable amount of time.
> > > > > >
> > > > > > But does that actually happen?  I find it hard to believe that thread A
> > > > > > unmaps a VMA while thread B is in the middle of taking a page fault in
> > > > > > that same VMA.  mprotect() and madvise() are more likely to happen, but
> > > > > > it still seems really unlikely to me.
> > > > >
> > > > > It doesn’t have to involve unmapping or applying mprotect to
> > > > > the entire VMA—just a portion of it is sufficient.
> > > >
> > > > Yes, but that still fails to answer "does this actually happen".  How much
> > > > performance is all this complexity in the page fault handler buying us?
> > > > If you don't answer this question, I'm just going to go in and rip it
> > > > all out.
> > > >
> > >
> > > Hi Matthew (and Lorenzo, Jan, and anyone else who may be
> > > waiting for answers),
> > >
> > > As promised during LSF/MM/BPF, we conducted thorough
> > > testing on Android phones to determine whether performing
> > > I/O in `filemap_fault()` can block `vma_start_write()`.
> > > I wanted to give a quick update on this question.
> > >
> > > Nanzhe at Xiaomi created tracing scripts and ran various
> > > applications on Android devices with I/O performed under
> > > the VMA lock in `filemap_fault()`. We found that:
> > >
> > > 1. There are very few cases where unmap() is blocked by
> > >    page faults. I assume this is due to buggy user code
> > >    or poor synchronization between reads and unmap().
> > > So I assume it is not a problem.
> > >
> > > 2. We observed many cases where `vma_start_write()`
> > >    is blocked by page-fault I/O in some applications.
> > >    The blocking occurs in the `dup_mmap()` path during
> > >    fork().
> > >
> > > With Suren's commit fb49c455323ff ("fork: lock VMAs of
> > > the parent process when forking"), we now always hold
> > > `vma_write_lock()` for each VMA. Note that the
> > > `mmap_lock` write lock is also held, which could lead to
> > > chained waiting if page-fault I/O is performed without
> > > releasing the VMA lock.
> >
> > Hm but did you observe this 'chained waiting'? And what were the latencies?
>
> We have clearly observed that the `fork()` operations of many
> popular Android apps, such as iQiyi, Baidu Tieba, and 10086,
> end up waiting on page-fault (PF) I/O when the VMA lock is
> held during I/O operations. This has already become a
> practical issue. I also believe this can lead to chained
> waiting, since the global `mmap_lock` blocks all threads that
> need to acquire it.

I asked about the chained waiting :) I'm aware you've observed contention on
write lock, you said so in your LSF talk.

So have you observed that or is this a theory?

>
>
> >
> > >
> > > My gut feeling is that Suren's commit may be overshooting,
> > > so my rough idea is that we might want to do something like
> > > the following (we haven't tested it yet and it might be
> > > wrong):
> >
> > Yeah I'm really not sure about that.
> >
> > Prior to the VMA locks, the mmap write lock would have guaranteed no concurrent
> > page faults, which is really what Fb49c455323ff is about.
> >
> > So Suren's patch was essentially restoring the _existing_ forking behaviour, and
> > now you're saying 'let's change the forking behaviour that's been like that for
> > forever'.
>
>
> I am afraid not. Before we introduced the per-VMA lock, we
> were not performing I/O while holding `mmap_lock`. A page fault
> that needed I/O would drop the `mmap_lock` read lock and allow
> `fork()` to proceed.

Err I'm talking about fork? The patch you reference is a change to fork?

So you're saying that Fb49c455323ff which explicitly takes the VMA write lock on
fork, was somehow an addendum after fork didnt take the mmap write lock?

I must be imagining
https://elixir.bootlin.com/linux/v6.0/source/kernel/fork.c#L590 then in v6.0
pre-vma locks :)

I suspect that's _not_ what you're saying, so now what you're suggesting as I
stated above, is to fundamentally change fork behaviour to account for the
existing per-VMA lock behaviour on the fault path?

Again I state - are you really sure you want to fundamentally change fork
behaviour for this?

I am extremely concerned about doing that.

>
> Now, you are suggesting performing I/O while holding the VMA
> lock, which changes the requirements and introduces this
> problem.
>
> >
> > I think you would _really_ have to be sure that's safe. And forking is a very
> > dangerous time in terms of complexity and sensitivity and 'weird stuff'
> > happening so I'd tread _very_ carefully here.
>
> Yep. I think my original proposal did not require any changes
> to `fork()`, since it simply preserved the current behavior of
> dropping the VMA lock before performing I/O. In that model,
> `fork()` would not end up waiting on I/O at all.
>
> What you are suggesting now appears to be performing I/O while
> holding the VMA lock, which in turn introduces the need to
> change `fork()`.

Again, you're saying we should fundamentally change the way fork has worked
forever to work around something else.

At LSF I raised the fact that Josef himself suggested we simply drop this I/O
waiting behaviour for file-backed mapppings. Isn't there a way forward that way
rather than 'hey let's drop locks and hope for the best!'

I am really reticent about this because we've seen HORRIBLE bugs come from fork
behaviour, especially edge cases, and mm testing isn't great so I am basically
opposed to this, and you're not really convincing me here.

>
> >
> > >
> > > diff --git a/mm/mmap.c b/mm/mmap.c
> > > index 2311ae7c2ff4..5ddaf297f31a 100644
> > > --- a/mm/mmap.c
> > > +++ b/mm/mmap.c
> > > @@ -1762,7 +1762,13 @@ __latent_entropy int dup_mmap(struct mm_struct
> > > *mm, struct mm_struct *oldmm)
> > >         for_each_vma(vmi, mpnt) {
> > >                 struct file *file;
> > >
> > > -               retval = vma_start_write_killable(mpnt);
> > > +               /*
> > > +                * For anonymous or writable private VMAs, prevent
> > > +                * concurrent CoW faults.
> > > +                */
> >
> > To nit pick I think the comment's confusing but also tells you you don't need to
> > specific anon check - writable private is sufficient. And it's not really just
> > CoW that's the issue, it's anon_vma population _at all_ as well as CoW.
> >
> > > +               if (!mpnt->vm_file || (!(mpnt->vm_flags & VM_SHARED) &&
> > > +                                       (mpnt->vm_flags & VM_WRITE)))
> > > +                       retval = vma_start_write_killable(mpnt);
> >
> > I think this has to be VM_MAYWRITE, because somebody could otherwise mprotect()
> > it R/W.
> >
> > I also don't understand why !mpnt->vm_file for a read-only anon mapping (more
> > likely PROT_NONE) is here, just do the second check?
> >
> > (Also please use the new interface, so !vma_test(mpnt, VMA_SHARED_BIT) &&
> > vma_test(mpnt, VMA_MAYWRITE_BIT))
>
> Yep, I can definitely refine the check further. But before
> doing that, I'd first like to confirm that we are aligned on
> the direction.
>
> If you still intend to hold the VMA lock while performing I/O,
> then I think we should fix `fork()` to avoid taking
> `vma_start_write()`.

Yeah or we could do something different, it isn't a case of you get to do one of
two options you propose - the maintainers decide which way is appropriate.

Of the two options dropping the lock on the fault path rather than this fork
insanity is my preference but I wonder if we can't find another way.

Let me read through the series and give more thoughts I guess.

>
> >
> > >                 if (retval < 0)
> > >                         goto loop_out;
> > >                 if (mpnt->vm_flags & VM_DONTCOPY) {
> > >
> > > Based on the above, we may want to re-check whether fork()
> > > can be blocked by page faults. At the same time, if Suren,
> > > you, or anyone else has any comments, please feel free to
> > > share them.
> > >
> > > Best Regards
> > > Barry
> >
> > Technical commentary above is sort of 'just cos' :) because I really question
> > doing this honestly.
>
> I think we either need to fix `fork()`, or keep the current
> behavior of dropping the VMA lock before performing I/O.

Yup you said :)

>
> >
> > I'd also like to get Suren's input, however.
>
> Yes. of course.
>
> >
> > Thanks, Lorenzo
>
> Best Regards
> Barry

Thanks, Lorenzo


^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox