* Re: [PATCH v4 11/24] iommu: Add iommu_report_device_broken() to quarantine a broken device
From: Jason Gunthorpe @ 2026-05-19 12:07 UTC (permalink / raw)
To: Nicolin Chen
Cc: Will Deacon, Robin Murphy, Joerg Roedel, Bjorn Helgaas,
Rafael J . Wysocki, Len Brown, Pranjal Shrivastava, Mostafa Saleh,
Lu Baolu, Kevin Tian, linux-arm-kernel, iommu, linux-kernel,
linux-acpi, linux-pci, vsethi, Shuai Xue
In-Reply-To: <745da1a819eb943f2519e660c8bcfde715885c6c.1779161849.git.nicolinc@nvidia.com>
On Mon, May 18, 2026 at 08:38:54PM -0700, Nicolin Chen wrote:
> +void iommu_report_device_broken(struct device *dev)
> +{
> + struct group_device *gdev;
> +
> + /*
> + * We cannot hold group->mutex here. Rely on iommu_group_broken_worker()
> + * to validate dev_has_iommu(). The iommu_group memory is RCU-protected
> + * via kfree_rcu() in iommu_group_release(), and group->devices is an
> + * RCU-protected list, so the lookup runs entirely under rcu_read_lock.
> + *
> + * Note the device might have been concurrently removed from the group
> + * (list_del_rcu) before iommu_deinit_device() cleared the dev->iommu.
> + */
> + rcu_read_lock();
> + gdev = __dev_to_gdev_rcu(dev);
> + if (gdev) {
If this is why the RCU is being added it seems like overkill.
Just add the worker to struct dev_iommu and push it there so it can
use a mutex but I'm confused why are we even adding this function?
The entire design of this series was supposed to have the IOMMU driver
itself adjust it's "STE" to inhibit translated TLPs synchronosly
within its fully locked invalidation loop.
Whats the async worker for?
Jason
^ permalink raw reply
* Re: [PATCH v4 19/24] iommu/arm-smmu-v3: Add invs and has_ats to struct arm_smmu_cmdq_batch
From: Jason Gunthorpe @ 2026-05-19 12:09 UTC (permalink / raw)
To: Nicolin Chen
Cc: Will Deacon, Robin Murphy, Joerg Roedel, Bjorn Helgaas,
Rafael J . Wysocki, Len Brown, Pranjal Shrivastava, Mostafa Saleh,
Lu Baolu, Kevin Tian, linux-arm-kernel, iommu, linux-kernel,
linux-acpi, linux-pci, vsethi, Shuai Xue
In-Reply-To: <713ce34e90e836e32f484e214cb6941a6973a962.1779161849.git.nicolinc@nvidia.com>
On Mon, May 18, 2026 at 08:39:02PM -0700, Nicolin Chen wrote:
> The arm_smmu_cmdq_batch_add_cmd_p() might flush a sub-batch mid-way, when
> the ARM_SMMU_OPT_CMDQ_FORCE_SYNC is set or when a batch is full. To allow
> a future change to retry these sub-batch flushes on a timeout and identify
> the broken master, the batch needs to carry both the per-domain invs and
> a per-batch indicator of whether the batch contains an ATC_INV.
This seems like too much to tackle in one series. Let's just assume
all the ATCs in the batch have failed and blow up everything for this
point.
I think retrying will be a lot easier by removing the batch..
This series has become way too big now anyhow..
Jason
^ permalink raw reply
* Re: [PATCH] ARM64: remove unnecessary architecture-specific <asm/device.h>
From: Will Deacon @ 2026-05-19 12:12 UTC (permalink / raw)
To: Ethan Nelson-Moore; +Cc: linux-arm-kernel, linux-kernel, Catalin Marinas
In-Reply-To: <20260517025348.96906-1-enelsonmoore@gmail.com>
On Sat, May 16, 2026 at 07:53:42PM -0700, Ethan Nelson-Moore wrote:
> arch/arm64/include/asm/device.h is identical to
> include/asm-generic/device.h, and therefore the ARM64-specific version
> is unnecessary. Remove it.
>
> Signed-off-by: Ethan Nelson-Moore <enelsonmoore@gmail.com>
> ---
> arch/arm64/include/asm/device.h | 14 --------------
> 1 file changed, 14 deletions(-)
> delete mode 100644 arch/arm64/include/asm/device.h
>
> diff --git a/arch/arm64/include/asm/device.h b/arch/arm64/include/asm/device.h
> deleted file mode 100644
> index 996498751318..000000000000
> --- a/arch/arm64/include/asm/device.h
> +++ /dev/null
> @@ -1,14 +0,0 @@
> -/* SPDX-License-Identifier: GPL-2.0-only */
> -/*
> - * Copyright (C) 2012 ARM Ltd.
> - */
> -#ifndef __ASM_DEVICE_H
> -#define __ASM_DEVICE_H
> -
> -struct dev_archdata {
> -};
> -
> -struct pdev_archdata {
> -};
> -
> -#endif
Do we need to add a generic-y entry to arch/arm64/include/asm/Kbuild
so that we pick up the asm-generic variant?
Will
^ permalink raw reply
* Re: [PATCH v4 22/24] iommu/arm-smmu-v3: Introduce master->ats_invs
From: Jason Gunthorpe @ 2026-05-19 12:12 UTC (permalink / raw)
To: Nicolin Chen
Cc: Will Deacon, Robin Murphy, Joerg Roedel, Bjorn Helgaas,
Rafael J . Wysocki, Len Brown, Pranjal Shrivastava, Mostafa Saleh,
Lu Baolu, Kevin Tian, linux-arm-kernel, iommu, linux-kernel,
linux-acpi, linux-pci, vsethi, Shuai Xue
In-Reply-To: <7fb3652c6448d757aa4b5e7bf32018333e1db1a9.1779161849.git.nicolinc@nvidia.com>
On Mon, May 18, 2026 at 08:39:05PM -0700, Nicolin Chen wrote:
> Similar to master->build_invs used by a per-domain invalidation, add a new
> master->ats_invs to be used by arm_smmu_atc_inv_master().
>
> Since arm_smmu_cmdq_batch_init_cmd() now takes an invs pointer, pass it in.
>
> This will be useful by arm_smmu_cmdq_batch_issue() to backtrack the master
> pointer from a timed out ATC invalidation command in a subsequent change.
Again this is a good place to just use the SID and get back to the
master through the rbtree under a spinlock.
Jason
^ permalink raw reply
* Re: [PATCH] Documentation: KVM: Document guest-visible compatibility expectations
From: Paolo Bonzini @ 2026-05-19 12:13 UTC (permalink / raw)
To: David Woodhouse
Cc: Will Deacon, Marc Zyngier, Jonathan Corbet, Shuah Khan, kvm,
Linux Doc Mailing List, Kernel Mailing List, Linux,
Sean Christopherson, Jim Mattson, Oliver Upton, Joey Gouly,
Suzuki K Poulose, Zenghui Yu, Catalin Marinas,
Raghavendra Rao Ananta, Eric Auger, Kees Cook, Arnd Bergmann,
Nathan Chancellor, linux-arm-kernel, kvmarm, linux-kselftest
In-Reply-To: <cf429f2082e863571595f74d1d3dedc3e6a82964.camel@infradead.org>
On Tue, May 19, 2026 at 1:44 PM David Woodhouse <dwmw2@infradead.org> wrote:
> > > So... what next? Is one of the other KVM/arm64 maintainers going to
> > > speak up? Paolo would you consider taking the fixes through your tree
> > > directly?
I admit that my knowledge of Arm is really limited, and I do not
understand which IIDR values have architecturally allowed behaviors
and which (if any) were made up by KVM; but even if I cannot honestly
remark on the code or even the approach, a compatibility knob is the
right thing to have. That's a userspace API design matter, not an Arm
or GIC matter.
I hope that Marc provides a better explanation of why he believes
https://lore.kernel.org/all/20260511113558.3325004-2-dwmw2@infradead.org/
shouldn't be accepted, because I am more than a bit puzzled about
*why* that patch is being rejected or (in v3) so far ignored. Marc in
this thread wrote: "If userspace is not a total joke, it will read all
the ID registers, and configure what it wants to see, assuming it is a
feature that can be configured (not everything can, because the
architecture itself is not fully backward compatible)". But in this
case there's an ID register that tells KVM if userspace wants the old
or the new behavior, independent of whether that old behavior is
architecturally valid or not.
I will certainly take this patch, but I won't override Marc. However
I'd like to better understand his point of view, because right now I
just don't get it.
> If KVM on arm64 doesn't aspire to maintain guest compatibility across
> host kernel changes — regardless of whether the previous kernel's
> behaviour was "blessed" by the architecture specification or not — then
> it does not meet the expectation that we have of KVM implementations in
> the Linux kernel.
I agree with the "aspire" wording. Even if it's not going to be 100%
achievable, KVM *needs* to aspire to maintain both guest compatibility
and architecture precision. Sometimes it's impossible, sometimes there
are constraints that require you to trade off one for another (e.g.
via quirks, or by breaking behavior that no sane guest would have
cared about). But in general as a maintainer you don't *get* to
choose.
Paolo
> Or indeed the standards that we've held for Linux kernel ABIs for the
> last 35 years.
^ permalink raw reply
* Re: [PATCH v13 2/2] arm: dts: aspeed: ventura: add Meta Ventura BMC
From: P.K. Lee @ 2026-05-19 12:16 UTC (permalink / raw)
To: Andrew Jeffery
Cc: robh+dt, krzysztof.kozlowski+dt, conor+dt, joel, devicetree,
linux-arm-kernel, linux-aspeed, linux-kernel, Jason-Hsu, p.k.lee
In-Reply-To: <e5d1cbdb34dddf17b4d474446fb9cebddc0894d9.camel@codeconstruct.com.au>
> > + model = "Facebook ventura RMC";
>
> I suggest capitalising 'Ventura'.
>
> ...
>
> > + i2c47 = &i2c2mux0ch7;
>
> Many of the buses aliased here don't have any devices described below
> them. Can you add some commentary about why it's necessary to enable
> and alias each of these?
>
> ...
>
> > + // Fan Board 1 FRU
>
> I'd rather we pick one commenting style (/* */). Can you please fix
> that throughout?
>
> ...
>
> > +
> > + //fan 0 IL
>
> Can you please add a space between the comment marker and the comment
> itself? This needs fixing throughout.
>
> ...
>
> > +&mdio0 {
> > + status = "okay";
> > + /* * Intentionally left empty.
>
> The comment is a bit busted here. Can you please fix it?
>
> Andrew
>
Thank you for your suggestions. I will address them in the next
revision of the DTS patch.
P.K.
^ permalink raw reply
* Re: [PATCH 0/2] iommu/arm-smmu: Use FIELD_MODIFY() for bitfield operations
From: Hans Zhang @ 2026-05-19 12:20 UTC (permalink / raw)
To: Will Deacon
Cc: robin.murphy, joro, iommu, linux-arm-msm, linux-kernel,
linux-arm-kernel
In-Reply-To: <agxAvl02rcPoEHaq@willie-the-truck>
On 5/19/26 18:51, Will Deacon wrote:
> On Fri, May 01, 2026 at 12:45:43AM +0800, Hans Zhang wrote:
>> Replace open-coded bitfield modifications with the standard FIELD_MODIFY()
>> macro. This improves code readability and adds type/range checking without
>> functional changes.
>
> Does it _really_ improve the readability? '&=' and '|=' patterns are
> pretty idiomatic C code, if you ask me.
>
>> FIELD_MODIFY() internally performs the same mask-clear + set operation but
>> eliminates repetitive boilerplate.
>>
>> ---
>> Hi, If the Maintainers think it's not necessary, please ignore it.
>
> I don't really mind the code either way, so I think I'd prefer to leave
> it as-is unless somebody wants to convince me otherwise...
>
> Will
Hi Will,
It's not like that. Please take a look at the accepted patch below.
https://lore.kernel.org/linux-pci/20260505165436.GA737933@bhelgaas/
https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?id=42ec65b46a4fc7565d48daa42bf025fdc67800eb
https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?id=5fd6f2154734f447e83b6de9a08d16848605191e
Best regards,
Hans
^ permalink raw reply
* Re: [PATCH v4 04/13] dma: swiotlb: track pool encryption state and honor DMA_ATTR_CC_SHARED
From: Aneesh Kumar K.V @ 2026-05-19 12:27 UTC (permalink / raw)
To: Mostafa Saleh
Cc: iommu, linux-arm-kernel, linux-kernel, linux-coco, Robin Murphy,
Marek Szyprowski, Will Deacon, Marc Zyngier, Steven Price,
Suzuki K Poulose, Catalin Marinas, Jiri Pirko, Jason Gunthorpe,
Petr Tesarik, Alexey Kardashevskiy, Dan Williams, Xu Yilun,
linuxppc-dev, linux-s390, Madhavan Srinivasan, Michael Ellerman,
Nicholas Piggin, Christophe Leroy (CS GROUP), Alexander Gordeev,
Gerald Schaefer, Heiko Carstens, Vasily Gorbik,
Christian Borntraeger, Sven Schnelle, x86
In-Reply-To: <agxDxdxynp4KEovA@google.com>
Mostafa Saleh <smostafa@google.com> writes:
> On Thu, May 14, 2026 at 08:13:25PM +0530, Aneesh Kumar K.V wrote:
>> >>
>> >> What I meant was that we need a generic way to identify a pKVM guest, so
>> >> that we can use it in the conditional above.
>> >
>> > I have this patch, with that I can boot with your series unmodified,
>> > but I will need to do more testing.
>> >
>>
>> Thanks, I can add this to the series once you complete the required testing.
>>
>
> I am still running more tests, but looking more into it. Setting
> force_dma_unencrypted() to true for pKVM guests is wrong, as the
> guest shouldn’t try to decrypt arbitrary memory as it can include
> sensitive information (for example in case of virtio sub-page
> allocation) and should strictly rely on the restricted-dma-pool
> for that.
>
> However, with my patch and setting force_dma_unencrypted() to false
> on top of this series, it fails on pKVM due to a missing shared
> attribute as Alexey mentioned, as now SWIOTLB rejects non shared
> attrs, so, the DMA-API has to pass it. With that, I can boot again:
>
> diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
> index 5103a04df99f..b19aeec03f27 100644
> --- a/kernel/dma/direct.c
> +++ b/kernel/dma/direct.c
> @@ -286,6 +286,8 @@ void *dma_direct_alloc(struct device *dev, size_t size,
> }
>
> if (is_swiotlb_for_alloc(dev)) {
> + attrs |= DMA_ATTR_CC_SHARED;
> +
> page = dma_direct_alloc_swiotlb(dev, size, attrs);
> if (page) {
> /*
> @@ -449,6 +451,8 @@ struct page *dma_direct_alloc_pages(struct device *dev, size_t size,
> &cpu_addr, gfp, attrs);
>
> if (is_swiotlb_for_alloc(dev)) {
> + attrs |= DMA_ATTR_CC_SHARED;
> +
> page = dma_direct_alloc_swiotlb(dev, size, attrs);
> if (!page)
> return NULL;
> diff --git a/kernel/dma/direct.h b/kernel/dma/direct.h
> index 4e35264ab6f8..8ee5bbf78cfb 100644
> --- a/kernel/dma/direct.h
> +++ b/kernel/dma/direct.h
> @@ -92,6 +92,7 @@ static inline dma_addr_t dma_direct_map_phys(struct device *dev,
> if (attrs & (DMA_ATTR_MMIO | DMA_ATTR_REQUIRE_COHERENT))
> return DMA_MAPPING_ERROR;
>
> + attrs |= DMA_ATTR_CC_SHARED;
> return swiotlb_map(dev, phys, size, dir, attrs);
> }
>
> --
>
How about the below?
modified kernel/dma/direct.c
@@ -278,6 +278,10 @@ void *dma_direct_alloc(struct device *dev, size_t size,
}
if (is_swiotlb_for_alloc(dev)) {
+
+ if (dev->dma_io_tlb_mem->unencrypted)
+ attrs |= DMA_ATTR_CC_SHARED;
+
page = dma_direct_alloc_swiotlb(dev, size, attrs);
if (page) {
/*
@@ -451,6 +455,10 @@ struct page *dma_direct_alloc_pages(struct device *dev, size_t size,
&cpu_addr, gfp, attrs);
if (is_swiotlb_for_alloc(dev)) {
+
+ if (dev->dma_io_tlb_mem->unencrypted)
+ attrs |= DMA_ATTR_CC_SHARED;
+
page = dma_direct_alloc_swiotlb(dev, size, attrs);
if (!page)
return NULL;
modified kernel/dma/direct.h
@@ -92,6 +92,9 @@ static inline dma_addr_t dma_direct_map_phys(struct device *dev,
if (attrs & (DMA_ATTR_MMIO | DMA_ATTR_REQUIRE_COHERENT))
return DMA_MAPPING_ERROR;
+ if (dev->dma_io_tlb_mem->unencrypted)
+ attrs |= DMA_ATTR_CC_SHARED;
+
return swiotlb_map(dev, phys, size, dir, attrs);
}
>
>
> I will keep testing and let you know how it goes. If there is nothing
> else required to convert pKVM guests to CC, I can just post the patch
> separately as it has no dependency on this series.
>
That would be useful. I can then carry the patch as a dependent change,
which can also be merged separately
>
> Re force_dma_unencrypted(), I am looking into a safe way to use it
> for pKVM as I beleive it will be useful to eliminate some bouncing.
> However, that’s not critical for this series and can be added later
> as I am still investigating it, if I reach something I can post it
> along the pKVM patch above.
>
> Thanks,
> Mostafa
>
>>
>>
>> -aneesh
^ permalink raw reply
* [ANNOUNCEMENT / CFP] LPC 2026 Devicetree Microconference CFP
From: Krzysztof Kozlowski @ 2026-05-19 12:28 UTC (permalink / raw)
To: devicetree, linux-arm-kernel, linux-arm-msm, Rob Herring,
Conor Dooley, imx, linux-kernel, devicetree-compiler,
devicetree-spec
Hey there,
There will be a dedicated Devicetree Microconference on Linux Plumbers
2026 in October in Prague. Call For Papers is currently open.
https://lpc.events/event/20/contributions/2317/
The Devicetree Microconference focuses on discussing and solving
problems present in the systems using Devicetree as firmware
representation. This notably is Linux kernel and U-Boot, which share the
Devicetree bindings and sources, but also can cover topics relevant to
Zephyr or System Devicetrees.
Topics suggested for discussion are mentioned in "Ongoing problems" in
session description (link above), but of course are not limited to
these. Topics should obviously follow standard Microconference
expectation, that is to be discussion oriented.
We discuss and try to solve actual problems, thus if one wants to
present his new patchset, then that patchset must have been already
reviewed on the list with inconclusive result thus having another
discussing in person would be beneficial.
Allocated time per slot will be between 15 to 30 minutes (20 minutes
last year).
CFP end is not fixed yet, but it will be some time after end of "Kernel
Summit" CFP, so *probably middle of July*. I will confirm that later.
Proposals can be submitted here:
https://lpc.events/event/20/abstracts/
Please remember to choose "Devicetree MC" as the track.
Best regards,
Krzysztof
^ permalink raw reply
* Re: [PATCH v6 1/2] KVM: arm64: Support FFA_MSG_SEND_DIRECT_REQ in host handler
From: Fuad Tabba @ 2026-05-19 12:29 UTC (permalink / raw)
To: perlarsen
Cc: Marc Zyngier, Joey Gouly, Suzuki K Poulose, Zenghui Yu,
Catalin Marinas, Will Deacon, Yeoreum Yun, Ben Horgan,
Oliver Upton, Armelle Laine, Sebastien Ene, linux-arm-kernel,
kvmarm, linux-kernel
In-Reply-To: <20260501-host-direct-messages-v6-1-3f4af727ed85@google.com>
Hi Per and Seb,
On Fri, 1 May 2026 at 06:34, Per Larsen via B4 Relay
<devnull+perlarsen.google.com@kernel.org> wrote:
>
> From: Sebastian Ene <sebastianene@google.com>
>
> Allow direct messages to be forwarded from the host. The host should
> not be sending framework messages so they are filtered out.
>
> Signed-off-by: Sebastian Ene <sebastianene@google.com>
> Reviewed-by: Yeoreum Yun <yeoreum.yun@arm.com>
> Signed-off-by: Per Larsen <perlarsen@google.com>
> ---
> arch/arm64/kvm/hyp/nvhe/ffa.c | 26 ++++++++++++++++++++++++++
> include/linux/arm_ffa.h | 2 ++
> 2 files changed, 28 insertions(+)
>
> diff --git a/arch/arm64/kvm/hyp/nvhe/ffa.c b/arch/arm64/kvm/hyp/nvhe/ffa.c
> index 1af722771178..3a58e01d255f 100644
> --- a/arch/arm64/kvm/hyp/nvhe/ffa.c
> +++ b/arch/arm64/kvm/hyp/nvhe/ffa.c
> @@ -862,6 +862,28 @@ static void do_ffa_part_get(struct arm_smccc_1_2_regs *res,
> hyp_spin_unlock(&host_buffers.lock);
> }
>
> +static void do_ffa_direct_msg(struct arm_smccc_1_2_regs *res,
> + struct kvm_cpu_context *ctxt)
> +{
> + DECLARE_REG(u32, endp, ctxt, 1);
> + DECLARE_REG(u32, flags, ctxt, 2);
> +
> + struct arm_smccc_1_2_regs *args = (void *)&ctxt->regs.regs[0];
> +
> + if (FIELD_GET(FFA_SRC_ENDPOINT_MASK, endp) != HOST_FFA_ID) {
> + ffa_to_smccc_error(res, FFA_RET_INVALID_PARAMETERS);
> + return;
> + }
> +
> + /* filter out framework messages and validate SBZ/MBZ bits */
> + if (flags) {
> + ffa_to_smccc_error(res, FFA_RET_INVALID_PARAMETERS);
> + return;
> + }
> +
> + arm_smccc_1_2_smc(args, res);
> +}
> +
> bool kvm_host_ffa_handler(struct kvm_cpu_context *host_ctxt, u32 func_id)
> {
> struct arm_smccc_1_2_regs res;
> @@ -920,6 +942,10 @@ bool kvm_host_ffa_handler(struct kvm_cpu_context *host_ctxt, u32 func_id)
> case FFA_PARTITION_INFO_GET:
> do_ffa_part_get(&res, host_ctxt);
> goto out_handled;
> + case FFA_MSG_SEND_DIRECT_REQ:
> + case FFA_FN64_MSG_SEND_DIRECT_REQ:
> + do_ffa_direct_msg(&res, host_ctxt);
> + goto out_handled;
> }
The commit message says the host should not be sending framework
messages, so they're filtered out. That filter is only installed for
REQ. The RESP side is asymmetric: FFA_MSG_SEND_DIRECT_RESP is in the
reject list, but FFA_FN64_MSG_SEND_DIRECT_RESP isn't. Should you add
that too?
Sashiko [1] also flagged two issues here. Took a quick look and the
tracing one seems to be worth a closer look. The upper-bits question
is judgement but plausible.
[1] https://sashiko.dev/#/patchset/20260501-host-direct-messages-v6-0-3f4af727ed85%40google.com
Cheers,
/fuad
>
> if (ffa_call_supported(func_id))
> diff --git a/include/linux/arm_ffa.h b/include/linux/arm_ffa.h
> index 81e603839c4a..f47d4bd51b2d 100644
> --- a/include/linux/arm_ffa.h
> +++ b/include/linux/arm_ffa.h
> @@ -269,6 +269,8 @@ bool ffa_partition_check_property(struct ffa_device *dev, u32 property)
> (ffa_partition_check_property(dev, FFA_PARTITION_DIRECT_REQ2_RECV) && \
> !dev->mode_32bit)
>
> +#define FFA_SRC_ENDPOINT_MASK GENMASK(31, 16)
> +
> /* For use with FFA_MSG_SEND_DIRECT_{REQ,RESP} which pass data via registers */
> struct ffa_send_direct_data {
> unsigned long data0; /* w3/x3 */
>
> --
> 2.54.0.545.g6539524ca2-goog
>
>
>
^ permalink raw reply
* Re: [PATCH] Documentation: KVM: Document guest-visible compatibility expectations
From: Marc Zyngier @ 2026-05-19 12:38 UTC (permalink / raw)
To: Paolo Bonzini
Cc: David Woodhouse, Will Deacon, Jonathan Corbet, Shuah Khan, kvm,
Linux Doc Mailing List, Kernel Mailing List, Linux,
Sean Christopherson, Jim Mattson, Oliver Upton, Joey Gouly,
Suzuki K Poulose, Zenghui Yu, Catalin Marinas,
Raghavendra Rao Ananta, Eric Auger, Kees Cook, Arnd Bergmann,
Nathan Chancellor, linux-arm-kernel, kvmarm, linux-kselftest
In-Reply-To: <CABgObfacAYexR25SMi1kSZMRnHx3EDGj8=E84V1DumER66ibnQ@mail.gmail.com>
On Tue, 19 May 2026 13:13:41 +0100,
Paolo Bonzini <pbonzini@redhat.com> wrote:
>
> On Tue, May 19, 2026 at 1:44 PM David Woodhouse <dwmw2@infradead.org> wrote:
> > > > So... what next? Is one of the other KVM/arm64 maintainers going to
> > > > speak up? Paolo would you consider taking the fixes through your tree
> > > > directly?
>
> I admit that my knowledge of Arm is really limited, and I do not
> understand which IIDR values have architecturally allowed behaviors
> and which (if any) were made up by KVM; but even if I cannot honestly
> remark on the code or even the approach, a compatibility knob is the
> right thing to have. That's a userspace API design matter, not an Arm
> or GIC matter.
I agree that we can have the knob -- not having it is a userspace
issue, and I have said that I was OK with preserving the userspace
interface.
>
> I hope that Marc provides a better explanation of why he believes
> https://lore.kernel.org/all/20260511113558.3325004-2-dwmw2@infradead.org/
> shouldn't be accepted, because I am more than a bit puzzled about
> *why* that patch is being rejected or (in v3) so far ignored. Marc in
> this thread wrote: "If userspace is not a total joke, it will read all
> the ID registers, and configure what it wants to see, assuming it is a
> feature that can be configured (not everything can, because the
> architecture itself is not fully backward compatible)".
This was a more general comment on the full mechanism that we use to
save/restore the state and at the same time configure the feature
set. Which is what the GICD_IIDR does to some extent for the GIC.
> But in this case there's an ID register that tells KVM if userspace
> wants the old or the new behavior, independent of whether that old
> behavior is architecturally valid or not.
But the "old behaviour" makes no sense, and cannot be used by a guest:
- either the guest doesn't use the alternative interrupt groups, then
it wasn't affected by the bug. That's 100% of the guests.
- or the guest did try to use the alternative groups, and it *NEVER*
worked, as it wouldn't get any interrupt at all. What is the point
of preserving a "feature" that only results in a non-working guest?
Given that, re-introducing a behaviour that cannot be used makes zero
sense to me.
> I will certainly take this patch, but I won't override Marc. However
> I'd like to better understand his point of view, because right now I
> just don't get it.
I don't get it either, but for different reasons.
>
> > If KVM on arm64 doesn't aspire to maintain guest compatibility across
> > host kernel changes — regardless of whether the previous kernel's
> > behaviour was "blessed" by the architecture specification or not — then
> > it does not meet the expectation that we have of KVM implementations in
> > the Linux kernel.
>
> I agree with the "aspire" wording. Even if it's not going to be 100%
> achievable, KVM *needs* to aspire to maintain both guest compatibility
> and architecture precision. Sometimes it's impossible, sometimes there
> are constraints that require you to trade off one for another (e.g.
> via quirks, or by breaking behavior that no sane guest would have
> cared about). But in general as a maintainer you don't *get* to
> choose.
>
> Paolo
>
> > Or indeed the standards that we've held for Linux kernel ABIs for the
> > last 35 years.
As I said before, I'd be OK with something that would restore IIDR to
REV1. But not something that actively breaks the GIC emulation by
reintroducing a bug. That's, by construction, dead code that will only
bitrot, because there is no SW that can make use of this nonsense.
M.
--
Without deviation from the norm, progress is not possible.
^ permalink raw reply
* Re: [PATCH] Documentation: KVM: Document guest-visible compatibility expectations
From: David Woodhouse @ 2026-05-19 12:42 UTC (permalink / raw)
To: Paolo Bonzini
Cc: Will Deacon, Marc Zyngier, Jonathan Corbet, Shuah Khan, kvm,
Linux Doc Mailing List, Kernel Mailing List, Linux,
Sean Christopherson, Jim Mattson, Oliver Upton, Joey Gouly,
Suzuki K Poulose, Zenghui Yu, Catalin Marinas,
Raghavendra Rao Ananta, Eric Auger, Kees Cook, Arnd Bergmann,
Nathan Chancellor, linux-arm-kernel, kvmarm, linux-kselftest
In-Reply-To: <CABgObfacAYexR25SMi1kSZMRnHx3EDGj8=E84V1DumER66ibnQ@mail.gmail.com>
[-- Attachment #1: Type: text/plain, Size: 2201 bytes --]
On Tue, 2026-05-19 at 14:13 +0200, Paolo Bonzini wrote:
> I admit that my knowledge of Arm is really limited, and I do not
> understand which IIDR values have architecturally allowed behaviors
> and which (if any) were made up by KVM; but even if I cannot honestly
> remark on the code or even the approach, a compatibility knob is the
> right thing to have. That's a userspace API design matter, not an Arm
> or GIC matter.
To be clear: the "IID" in IIDR is "implementer identification"; the
implementer in this case being KVM.
The revision field in the IIDR literally *is* the compatibility knob.
The values are indeed 'made up by KVM', and have been correctly bumped
from 1 to 2 to 3 as guest-visible behaviour has changed.
The only problem is that there's no way to set it *back* to 1 again, on
a newer kernel (which defaults to 3). We can only set it to 2 or 3.
The patch which is causing all this fuss is little more than a one-
liner which allows userspace to set it to 1 again, and a second line to
actually *honour* the corresponding behaviour that certain registers
aren't writable.
This is the only way to retain that historical behaviour so that we
don't have to change it underneath running guests on a kernel upgrade
(or worse, rip the new behaviour *away* from newly-launched guests, if
we have to roll back to the old kernel after launching on the new one).
> I will certainly take this patch, but I won't override Marc. However
> I'd like to better understand his point of view, because right now I
> just don't get it.
Indeed. Like you, I just don't get it. I cannot see any reason *not* to
take the fix, and I am *trying* (with limited success) to limit the
expression of my frustration to the specific technical issue at hand.
Marc, I have a huge amount of respect for you, and I'm painfully aware
that I risk burning bridges here by pressing the issue. But on this
specific topic I respectfully believe that you have made the wrong
decision, and I beg you to reconsider.
We *need* to be able to upgrade without changing behaviour for guests.
Even if the old behaviour was "wrong" according to the architecture
specification.
[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]
^ permalink raw reply
* Re: [RFC PATCH 1/2] KVM: arm64: Introduce S2 walker SKIP return options
From: Will Deacon @ 2026-05-19 12:43 UTC (permalink / raw)
To: Leonardo Bras
Cc: Oliver Upton, Marc Zyngier, Joey Gouly, Suzuki K Poulose,
Zenghui Yu, Catalin Marinas, Fuad Tabba, Raghavendra Rao Ananta,
linux-arm-kernel, kvmarm, linux-kernel
In-Reply-To: <agsYF-Dav2bpxWK4@devkitleo>
On Mon, May 18, 2026 at 02:45:59PM +0100, Leonardo Bras wrote:
> Hello Oliver, Will,
> Thanks for reviewing!
>
> On Mon, May 18, 2026 at 09:52:16AM +0100, Will Deacon wrote:
> > On Mon, May 18, 2026 at 12:22:47AM -0700, Oliver Upton wrote:
> > > On Fri, May 15, 2026 at 08:59:02PM +0100, Leonardo Bras wrote:
> > > > Introduce S2 walker return values:
> > > > - SKIP_CHILDREN: skip walking the children of the current node
> > > > - SKIP_SIBLINGS: skip waling the siblings of the current node
> > > >
> > > > Also, modify __kvm_pgtable_visit() to fulfil the hing on above return
> > > > values. Current walkers should not be impacted
> > >
> > > I'd rather see something based around new walk flags than introducing an
> > > entirely new mechanic around return values.
> > >
> > > e.g. you could split the LEAF flag into separate flags for blocks v.
> > > pages:
> > >
> > > KVM_PGTABLE_WALK_PAGE,
> > > KVM_PGTABLE_WALK_BLOCK,
> > > KVM_PGTABLE_WALK_LEAF = KVM_PGTABLE_WALK_PAGE |
> > > KVM_PGTABLE_WALK_BLOCK,
> > >
> > > and then let __kvm_pgtable_visit() decide how to steer the walk. You may
> > > need some special handling to get the address arithmetic right when
> > > skipping over a table of page descriptors.
>
> I am probably not getting the whole inner workings of this solution, but
> IIUC the idea would be to walk the blocks, but not the pages, right?
>
> Blocks meaning level2- and pages being level3?
>
> > I was wondering along similar lines, but maybe it would be useful just
> > to pass a maximum level to the walker logic? That feels like the most
> > general case without complicating the existing logic.
>
> This proposal seems simpler for me to understand, and indeed looks like a
> better solution than what I have proposed, taking care of the
> 'already split' case with better performance, as it don't even walk a
> single level-3 entry.
>
> On the 'splitting' case, it also works flawlessly if the memory is given in
> level-2 blocks. There is only one case that I would like to address here:
>
> - Memory given in level-1 blocks (say 1GB)
> - Walker flag says 'walk down to level-2 only'
> - Split Walker on level-1 will break page down to (up to) level-3 entries.
> - Walker will continue to be called on level-2 entries, even though it's
> not necessary.
If you're only visiting leaves, why would it be called on the level-2
table entries?
Will
^ permalink raw reply
* RE: [PATCH] spi: aspeed: Replace VLA parameter with flat pointer in calibration helper
From: Chin-Ting Kuo @ 2026-05-19 12:43 UTC (permalink / raw)
To: Mark Brown
Cc: clg@kaod.org, joel@jms.id.au, andrew@codeconstruct.com.au,
linux-aspeed@lists.ozlabs.org, openbmc@lists.ozlabs.org,
linux-spi@vger.kernel.org, linux-arm-kernel@lists.infradead.org,
linux-kernel@vger.kernel.org, BMC-SW, kernel test robot
In-Reply-To: <659a6593-0223-4a26-830b-1390326b84e5@sirena.org.uk>
Hi Mark,
Thanks for the review.
> -----Original Message-----
> From: Mark Brown <broonie@kernel.org>
> Sent: Tuesday, May 19, 2026 7:04 PM
> Subject: Re: [PATCH] spi: aspeed: Replace VLA parameter with flat pointer in
> calibration helper
>
> On Mon, May 18, 2026 at 05:57:08PM +0800, Chin-Ting Kuo wrote:
>
> > - while (k < cols && buf[i][k])
> > + while (k < cols && buf[i * cols + k])
>
> This really needs () to make it clear what's going on; the precedence is well
> defined but not everyone is going to know that off the top of their head.
Okay, below change will be added in the next patch version.
while (k < cols && buf[(i * cols) + k])
Chin-Ting
^ permalink raw reply
* Re: [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance
From: Lorenzo Stoakes @ 2026-05-19 12:43 UTC (permalink / raw)
To: Barry Song
Cc: Matthew Wilcox, surenb, akpm, linux-mm, david, liam, vbabka, rppt,
mhocko, jack, pfalcato, wanglian, chentao, lianux.mm, kunwu.chan,
liyangouwen1, chrisl, kasong, shikemeng, nphamcs, bhe,
youngjun.park, linux-arm-kernel, linux-kernel, loongarch,
linuxppc-dev, linux-riscv, linux-s390, Nanzhe Zhao
In-Reply-To: <CAGsJ_4zqLfdWoTH9s7FFaqWWj0mESfikYgr7=GcV64qcuXrPxA@mail.gmail.com>
On Mon, May 18, 2026 at 07:25:54PM +0800, Barry Song wrote:
> On Mon, May 18, 2026 at 5:47 PM Lorenzo Stoakes <ljs@kernel.org> wrote:
> >
> > On Sun, May 17, 2026 at 04:45:15PM +0800, Barry Song wrote:
> > > On Sat, May 2, 2026 at 1:58 AM Matthew Wilcox <willy@infradead.org> wrote:
> > > >
> > > > On Sat, May 02, 2026 at 01:44:34AM +0800, Barry Song wrote:
> > > > > On Fri, May 1, 2026 at 10:57 PM Matthew Wilcox <willy@infradead.org> wrote:
> > > > > >
> > > > > > On Fri, May 01, 2026 at 06:49:58AM +0800, Barry Song wrote:
> > > > > > > 1. There is no deterministic latency for I/O completion. It depends on
> > > > > > > both the hardware and the software stack (bio/request queues and the
> > > > > > > block scheduler). Sometimes the latency is short; at other times it can
> > > > > > > be quite long. In such cases, a high-priority thread performing operations
> > > > > > > such as mprotect, unmap, prctl_set_vma, or madvise may be forced to wait
> > > > > > > for an unpredictable amount of time.
> > > > > >
> > > > > > But does that actually happen? I find it hard to believe that thread A
> > > > > > unmaps a VMA while thread B is in the middle of taking a page fault in
> > > > > > that same VMA. mprotect() and madvise() are more likely to happen, but
> > > > > > it still seems really unlikely to me.
> > > > >
> > > > > It doesn’t have to involve unmapping or applying mprotect to
> > > > > the entire VMA—just a portion of it is sufficient.
> > > >
> > > > Yes, but that still fails to answer "does this actually happen". How much
> > > > performance is all this complexity in the page fault handler buying us?
> > > > If you don't answer this question, I'm just going to go in and rip it
> > > > all out.
> > > >
> > >
> > > Hi Matthew (and Lorenzo, Jan, and anyone else who may be
> > > waiting for answers),
> > >
> > > As promised during LSF/MM/BPF, we conducted thorough
> > > testing on Android phones to determine whether performing
> > > I/O in `filemap_fault()` can block `vma_start_write()`.
> > > I wanted to give a quick update on this question.
> > >
> > > Nanzhe at Xiaomi created tracing scripts and ran various
> > > applications on Android devices with I/O performed under
> > > the VMA lock in `filemap_fault()`. We found that:
> > >
> > > 1. There are very few cases where unmap() is blocked by
> > > page faults. I assume this is due to buggy user code
> > > or poor synchronization between reads and unmap().
> > > So I assume it is not a problem.
> > >
> > > 2. We observed many cases where `vma_start_write()`
> > > is blocked by page-fault I/O in some applications.
> > > The blocking occurs in the `dup_mmap()` path during
> > > fork().
> > >
> > > With Suren's commit fb49c455323ff ("fork: lock VMAs of
> > > the parent process when forking"), we now always hold
> > > `vma_write_lock()` for each VMA. Note that the
> > > `mmap_lock` write lock is also held, which could lead to
> > > chained waiting if page-fault I/O is performed without
> > > releasing the VMA lock.
> >
> > Hm but did you observe this 'chained waiting'? And what were the latencies?
>
> We have clearly observed that the `fork()` operations of many
> popular Android apps, such as iQiyi, Baidu Tieba, and 10086,
> end up waiting on page-fault (PF) I/O when the VMA lock is
> held during I/O operations. This has already become a
> practical issue. I also believe this can lead to chained
> waiting, since the global `mmap_lock` blocks all threads that
> need to acquire it.
I asked about the chained waiting :) I'm aware you've observed contention on
write lock, you said so in your LSF talk.
So have you observed that or is this a theory?
>
>
> >
> > >
> > > My gut feeling is that Suren's commit may be overshooting,
> > > so my rough idea is that we might want to do something like
> > > the following (we haven't tested it yet and it might be
> > > wrong):
> >
> > Yeah I'm really not sure about that.
> >
> > Prior to the VMA locks, the mmap write lock would have guaranteed no concurrent
> > page faults, which is really what Fb49c455323ff is about.
> >
> > So Suren's patch was essentially restoring the _existing_ forking behaviour, and
> > now you're saying 'let's change the forking behaviour that's been like that for
> > forever'.
>
>
> I am afraid not. Before we introduced the per-VMA lock, we
> were not performing I/O while holding `mmap_lock`. A page fault
> that needed I/O would drop the `mmap_lock` read lock and allow
> `fork()` to proceed.
Err I'm talking about fork? The patch you reference is a change to fork?
So you're saying that Fb49c455323ff which explicitly takes the VMA write lock on
fork, was somehow an addendum after fork didnt take the mmap write lock?
I must be imagining
https://elixir.bootlin.com/linux/v6.0/source/kernel/fork.c#L590 then in v6.0
pre-vma locks :)
I suspect that's _not_ what you're saying, so now what you're suggesting as I
stated above, is to fundamentally change fork behaviour to account for the
existing per-VMA lock behaviour on the fault path?
Again I state - are you really sure you want to fundamentally change fork
behaviour for this?
I am extremely concerned about doing that.
>
> Now, you are suggesting performing I/O while holding the VMA
> lock, which changes the requirements and introduces this
> problem.
>
> >
> > I think you would _really_ have to be sure that's safe. And forking is a very
> > dangerous time in terms of complexity and sensitivity and 'weird stuff'
> > happening so I'd tread _very_ carefully here.
>
> Yep. I think my original proposal did not require any changes
> to `fork()`, since it simply preserved the current behavior of
> dropping the VMA lock before performing I/O. In that model,
> `fork()` would not end up waiting on I/O at all.
>
> What you are suggesting now appears to be performing I/O while
> holding the VMA lock, which in turn introduces the need to
> change `fork()`.
Again, you're saying we should fundamentally change the way fork has worked
forever to work around something else.
At LSF I raised the fact that Josef himself suggested we simply drop this I/O
waiting behaviour for file-backed mapppings. Isn't there a way forward that way
rather than 'hey let's drop locks and hope for the best!'
I am really reticent about this because we've seen HORRIBLE bugs come from fork
behaviour, especially edge cases, and mm testing isn't great so I am basically
opposed to this, and you're not really convincing me here.
>
> >
> > >
> > > diff --git a/mm/mmap.c b/mm/mmap.c
> > > index 2311ae7c2ff4..5ddaf297f31a 100644
> > > --- a/mm/mmap.c
> > > +++ b/mm/mmap.c
> > > @@ -1762,7 +1762,13 @@ __latent_entropy int dup_mmap(struct mm_struct
> > > *mm, struct mm_struct *oldmm)
> > > for_each_vma(vmi, mpnt) {
> > > struct file *file;
> > >
> > > - retval = vma_start_write_killable(mpnt);
> > > + /*
> > > + * For anonymous or writable private VMAs, prevent
> > > + * concurrent CoW faults.
> > > + */
> >
> > To nit pick I think the comment's confusing but also tells you you don't need to
> > specific anon check - writable private is sufficient. And it's not really just
> > CoW that's the issue, it's anon_vma population _at all_ as well as CoW.
> >
> > > + if (!mpnt->vm_file || (!(mpnt->vm_flags & VM_SHARED) &&
> > > + (mpnt->vm_flags & VM_WRITE)))
> > > + retval = vma_start_write_killable(mpnt);
> >
> > I think this has to be VM_MAYWRITE, because somebody could otherwise mprotect()
> > it R/W.
> >
> > I also don't understand why !mpnt->vm_file for a read-only anon mapping (more
> > likely PROT_NONE) is here, just do the second check?
> >
> > (Also please use the new interface, so !vma_test(mpnt, VMA_SHARED_BIT) &&
> > vma_test(mpnt, VMA_MAYWRITE_BIT))
>
> Yep, I can definitely refine the check further. But before
> doing that, I'd first like to confirm that we are aligned on
> the direction.
>
> If you still intend to hold the VMA lock while performing I/O,
> then I think we should fix `fork()` to avoid taking
> `vma_start_write()`.
Yeah or we could do something different, it isn't a case of you get to do one of
two options you propose - the maintainers decide which way is appropriate.
Of the two options dropping the lock on the fault path rather than this fork
insanity is my preference but I wonder if we can't find another way.
Let me read through the series and give more thoughts I guess.
>
> >
> > > if (retval < 0)
> > > goto loop_out;
> > > if (mpnt->vm_flags & VM_DONTCOPY) {
> > >
> > > Based on the above, we may want to re-check whether fork()
> > > can be blocked by page faults. At the same time, if Suren,
> > > you, or anyone else has any comments, please feel free to
> > > share them.
> > >
> > > Best Regards
> > > Barry
> >
> > Technical commentary above is sort of 'just cos' :) because I really question
> > doing this honestly.
>
> I think we either need to fix `fork()`, or keep the current
> behavior of dropping the VMA lock before performing I/O.
Yup you said :)
>
> >
> > I'd also like to get Suren's input, however.
>
> Yes. of course.
>
> >
> > Thanks, Lorenzo
>
> Best Regards
> Barry
Thanks, Lorenzo
^ permalink raw reply
* [PATCH v3] coresight: etm-perf: Fix reference count leak in etm_setup_aux
From: Ma Ke @ 2026-05-19 12:43 UTC (permalink / raw)
To: suzuki.poulose, mike.leach, james.clark, leo.yan,
alexander.shishkin, mathieu.poirier
Cc: coresight, linux-arm-kernel, linux-kernel, akpm, Ma Ke, stable
bus_find_device() returns a device with its reference count
incremented. When a user-selected sink is obtained through
coresight_get_sink_by_id(), etm_setup_aux() keeps using the returned
sink while building the path and allocating the sink buffer.
Therefore the lookup reference must remain valid while etm_setup_aux()
is still using the sink, otherwise the sink could be removed under the
caller. Drop the lookup reference on the common exit path, after
etm_setup_aux() no longer directly uses the user-selected sink.
The CoreSight path code takes the references it needs for built paths,
so the initial lookup reference from coresight_get_sink_by_id() is no
longer needed after setup_aux finishes.
Found by code review.
Signed-off-by: Ma Ke <make24@iscas.ac.cn>
Cc: stable@vger.kernel.org
Fixes: 0e6c20517596 ("coresight: etm-perf: Allow an event to use different sinks")
---
Changes in v3:
- do not drop the lookup reference in coresight_get_sink_by_id(), as
that would return a sink pointer without keeping the device reference
while etm_setup_aux() is still using it.
- dropped the lookup reference in etm_setup_aux on the common exit path,
as suggested by Suzuki.
- updated the commit message to describe why the reference is kept
until etm_setup_aux() finishes using the sink.
Changes in v2:
- modified the patch as suggestions.
---
drivers/hwtracing/coresight/coresight-etm-perf.c | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/drivers/hwtracing/coresight/coresight-etm-perf.c b/drivers/hwtracing/coresight/coresight-etm-perf.c
index f85dedf89a3f..d5116177c1b9 100644
--- a/drivers/hwtracing/coresight/coresight-etm-perf.c
+++ b/drivers/hwtracing/coresight/coresight-etm-perf.c
@@ -456,6 +456,11 @@ static void *etm_setup_aux(struct perf_event *event, void **pages,
goto err;
out:
+ if (user_sink) {
+ put_device(&user_sink->dev);
+ user_sink = NULL;
+ }
+
return event_data;
err:
--
2.43.0
^ permalink raw reply related
* Re: [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance
From: Lorenzo Stoakes @ 2026-05-19 12:45 UTC (permalink / raw)
To: Barry Song
Cc: Suren Baghdasaryan, Matthew Wilcox, akpm, linux-mm, david, liam,
vbabka, rppt, mhocko, jack, pfalcato, wanglian, chentao,
lianux.mm, kunwu.chan, liyangouwen1, chrisl, kasong, shikemeng,
nphamcs, bhe, youngjun.park, linux-arm-kernel, linux-kernel,
loongarch, linuxppc-dev, linux-riscv, linux-s390, Nanzhe Zhao
In-Reply-To: <CAGsJ_4zgeE3ebEJ+j7GJFzWVoVYHeiOn4dOoOsmWCLA6s=oECQ@mail.gmail.com>
On Tue, May 19, 2026 at 05:14:45AM +0800, Barry Song wrote:
> On Tue, May 19, 2026 at 3:57 AM Suren Baghdasaryan <surenb@google.com> wrote:
> >
> > On Mon, May 18, 2026 at 4:26 AM Barry Song <baohua@kernel.org> wrote:
> > >
> > > On Mon, May 18, 2026 at 5:47 PM Lorenzo Stoakes <ljs@kernel.org> wrote:
> > > >
> > > > On Sun, May 17, 2026 at 04:45:15PM +0800, Barry Song wrote:
> [...]
> > >
> > > I think we either need to fix `fork()`, or keep the current
> > > behavior of dropping the VMA lock before performing I/O.
> >
> > I see. So, this problem arises from the fact that we are changing the
> > pagefaults requiring I/O operation to hold VMA lock...
> > And you want to lock VMA on fork only if vma_is_anonymous(vma) ||
> > is_cow_mapping(vma->vm_flags). So, we will be blocking page faults for
> > anonymous and COW VMAs only while holding mmap_write_lock, preventing
> > any VMA modification. On the surface, that looks ok to me but I might
> > be missing some corner cases. If nobody sees any obvious issues, I
> > think it's worth a try.
> >
>
> Thanks. Besides the creation of processes via fork(), I
> am also beginning to worry about the death of processes.
>
> One thing that came to my mind this morning
> is that when lowmemorykiller decides to kill an app, we
What's the lowmemorykiller? :P you mean the OOM killer?
> want the memory to be released as quickly as possible so
> the new app or user scenario can get memory sooner.
>
> In that case, if the app being killed is performing I/O
> while holding the VMA lock, the unmapping procedure
> could end up being blocked as well.
>
> If we release the VMA lock as we currently do, we allow
> process exit to proceed.
>
> I haven't thought it through very clearly yet, and I
> may be wrong. I'd like to do more investigation. I hope
> the apps being killed stay very still, but who knows—we
> have so many applications in the market.
Yeah let's tread very carefully please, you're picking two of the most fraught
areas of mm, I'm not going to want to see changes there unless they're
substantially more convincingly argued.
>
> Meanwhile, if you have any comments regarding the death
> of processes, they would be very welcome.
As above, leave it alone please :)
>
> Best Regards
> Barry
Thanks, Lorenzo
^ permalink raw reply
* Re: [PATCH] coresight: platform: defer connection counter increment until alloc succeeds
From: James Clark @ 2026-05-19 12:46 UTC (permalink / raw)
To: Jie Gan
Cc: coresight, linux-arm-kernel, linux-kernel, Suzuki K Poulose,
Mike Leach, Leo Yan, Alexander Shishkin, Tingwei Zhang
In-Reply-To: <20260511-fix-ref-count-issue-v1-1-99d647810d3c@oss.qualcomm.com>
On 11/05/2026 5:19 am, Jie Gan wrote:
> coresight_add_out_conn() increments nr_outconns before calling
> devm_krealloc_array() and again before devm_kmalloc(). If either
> allocation fails, the counter is already bumped while the corresponding
> array entry is NULL or uninitialized garbage.
>
> coresight_add_in_conn() has the same problem with nr_inconns and
> devm_krealloc_array().
>
> In both cases the probe returns -ENOMEM, which causes
> coresight_get_platform_data() to call coresight_release_platform_data()
> for cleanup. That function iterates up to nr_outconns (or nr_inconns)
> entries and dereferences each pointer unconditionally, hitting the NULL
> or garbage entry and panicking instead of failing gracefully.
>
> Fix by moving the counter increments to after all allocations succeed,
> so the struct is always consistent on any error path.
>
> Fixes: 3d4ff657e454 ("coresight: Dynamically add connections")
> Fixes: e3f4e68797a9 ("coresight: Store in-connections as well as out-connections")
> Signed-off-by: Jie Gan <jie.gan@oss.qualcomm.com>
Reviewed-by: James Clark <james.clark@linaro.org>
> ---
> drivers/hwtracing/coresight/coresight-platform.c | 12 ++++++------
> 1 file changed, 6 insertions(+), 6 deletions(-)
>
> diff --git a/drivers/hwtracing/coresight/coresight-platform.c b/drivers/hwtracing/coresight/coresight-platform.c
> index e337b6e2bf32..93c2d075cad6 100644
> --- a/drivers/hwtracing/coresight/coresight-platform.c
> +++ b/drivers/hwtracing/coresight/coresight-platform.c
> @@ -45,9 +45,8 @@ coresight_add_out_conn(struct device *dev,
> }
> }
>
> - pdata->nr_outconns++;
> pdata->out_conns =
> - devm_krealloc_array(dev, pdata->out_conns, pdata->nr_outconns,
> + devm_krealloc_array(dev, pdata->out_conns, pdata->nr_outconns + 1,
> sizeof(*pdata->out_conns), GFP_KERNEL);
> if (!pdata->out_conns)
> return ERR_PTR(-ENOMEM);
> @@ -63,7 +62,8 @@ coresight_add_out_conn(struct device *dev,
> * used right away.
> */
> *conn = *new_conn;
> - pdata->out_conns[pdata->nr_outconns - 1] = conn;
> + pdata->out_conns[pdata->nr_outconns] = conn;
> + pdata->nr_outconns++;
> return conn;
> }
> EXPORT_SYMBOL_GPL(coresight_add_out_conn);
> @@ -86,13 +86,13 @@ int coresight_add_in_conn(struct coresight_connection *out_conn)
> return 0;
> }
>
> - pdata->nr_inconns++;
> pdata->in_conns =
> - devm_krealloc_array(dev, pdata->in_conns, pdata->nr_inconns,
> + devm_krealloc_array(dev, pdata->in_conns, pdata->nr_inconns + 1,
> sizeof(*pdata->in_conns), GFP_KERNEL);
> if (!pdata->in_conns)
> return -ENOMEM;
> - pdata->in_conns[pdata->nr_inconns - 1] = out_conn;
> + pdata->in_conns[pdata->nr_inconns] = out_conn;
> + pdata->nr_inconns++;
> return 0;
> }
> EXPORT_SYMBOL_GPL(coresight_add_in_conn);
>
> ---
> base-commit: e98d21c170b01ddef366f023bbfcf6b31509fa83
> change-id: 20260511-fix-ref-count-issue-7c44ce39700f
>
> Best regards,
^ permalink raw reply
* Re: [PATCH v2] cpu/hotplug: Fix NULL kobject warning in cpuhp_smt_enable()
From: Will Deacon @ 2026-05-19 12:48 UTC (permalink / raw)
To: Catalin Marinas
Cc: Jinjie Ruan, punit.agrawal, rafael.j.wysocki, fengchengwen,
chenl311, suzuki.poulose, maz, timothy.hayes, lpieralisi,
mrigendra.chaubey, arnd, sudeep.holla, yangyicong, jic23,
pierre.gondois, linux-arm-kernel, linux-kernel, james.morse
In-Reply-To: <agIT0R4KV3YBYjlo@arm.com>
On Mon, May 11, 2026 at 06:37:21PM +0100, Catalin Marinas wrote:
> On Mon, Apr 27, 2026 at 10:35:07AM +0800, Jinjie Ruan wrote:
> > On arm64, when booting with `maxcpus` greater than the number of present
> > CPUs (e.g., QEMU -smp cpus=4,maxcpus=8), some CPUs are marked as 'present'
> > but have not yet been registered via register_cpu(). Consequently,
> > the per-cpu device objects for these CPUs are not yet initialized.
> [...]
> > Fix this by:
> >
> > 1. When booting with ACPI, checking the ACPI_MADT_ENABLED flag in the GICC
> > entry before calling set_cpu_present() during SMP initialization.
> >
> > 2. Properly managing the present mask in acpi_map_cpu() and
> > acpi_unmap_cpu() to support actual CPU hotplug events, This aligns with
> > other architectures like x86 and LoongArch.
>
> I had a chat with James earlier and IIUC the decision was to mark all
> CPUs present and the GIC must be fully initialised. But digging through
> the GICv3 code, I don't see it depending on cpu_present_mask but rather
> on the "always on" MADT GICR description. So I think it should be safe
> as long as we don't rely on the GICC gicr_base_address. But we should
> update Documentation/arch/arm64/cpu-hotplug.rst to no longer state that
> all online-capable vCPUs are marked as present by the kernel.
>
> (or maybe I misunderstood all this)
Jinjie, do you plan to respin with a documentation update as per
Catalin's comment above?
Will
^ permalink raw reply
* Re: [PATCH v2] media: meson: vdec: Fix memory leak in error path of vdec_open
From: Anand Moon @ 2026-05-19 12:51 UTC (permalink / raw)
To: Nicolas Dufresne
Cc: Neil Armstrong, Mauro Carvalho Chehab, Greg Kroah-Hartman,
Kevin Hilman, Jerome Brunet, Martin Blumenstingl, Maxime Jourdan,
Hans Verkuil,
open list:MESON VIDEO DECODER DRIVER FOR AMLOGIC SOCS,
open list:MESON VIDEO DECODER DRIVER FOR AMLOGIC SOCS,
open list:STAGING SUBSYSTEM,
moderated list:ARM/Amlogic Meson SoC support, open list
In-Reply-To: <49260ed9ce09b0684ed72787b5635e2c26059297.camel@ndufresne.ca>
Hi Nicolas.
Thanks for your review comments
On Fri, 8 May 2026 at 23:28, Nicolas Dufresne <nicolas@ndufresne.ca> wrote:
>
> Hi,
>
> sorry I missed your patch, catching up now.
>
>
> Le samedi 21 mars 2026 à 12:24 +0530, Anand Moon a écrit :
> > The vdec_open and vdec_close functions in the Meson VDEC driver failed
> > to release several resources, leading to memory leaks and potential
> > use-after-free scenarios.
> >
> > This patch addresses:
> > - Missing v4l2_ctrl_handler_free() in both the close path and error
> > exit of the open path, preventing control memory leaks.
> > - A leak of the M2M context if vdec_init_ctrls() failed.
> >
> > The error labels in vdec_open() have been reordered to ensure a proper
> > Last-In-First-Out (LIFO) teardown of all initialized resources.
> >
> > This was identified via kmemleak:
> > unreferenced object 0xffff0000205d6878 (size 8):
> > comm "v4l_id", pid 5289, jiffies 4294938580
> > hex dump (first 8 bytes):
> > 40 d2 49 18 00 00 ff ff @.I.....
> > backtrace (crc d3204599):
> > kmemleak_alloc+0xc8/0xf0
> > __kvmalloc_node_noprof+0x60c/0x850
> > v4l2_ctrl_handler_init_class+0x1b4/0x2e8 [videodev]
> > vdec_open+0x1f4/0x788 [meson_vdec]
> > v4l2_open+0x144/0x460 [videodev]
> > chrdev_open+0x1ac/0x500
> > do_dentry_open+0x3f0/0xfe8
> > vfs_open+0x68/0x320
> > do_open+0x2d8/0x9a8
> > path_openat+0x1d0/0x4f0
> > do_filp_open+0x190/0x380
> > do_sys_openat2+0xf8/0x1b0
> > __arm64_sys_openat+0x13c/0x1e8
> > invoke_syscall+0xdc/0x268
> > el0_svc_common.constprop.0+0x178/0x258
> > do_el0_svc+0x4c/0x70
> >
> > Cc: Nicolas Dufresne <nicolas@ndufresne.ca>
> > Fixes: 3e7f51bd9607 ("media: meson: add v4l2 m2m video decoder driver")
> > Signed-off-by: Anand Moon <linux.amoon@gmail.com>
> > ---
> > v1: https://lore.kernel.org/all/20260304100557.126488-1-linux.amoon@gmail.com/
> > tried to address the issue reported by Nicolas
> > improve the commit message.
> > ---
> > drivers/staging/media/meson/vdec/vdec.c | 9 ++++++---
> > 1 file changed, 6 insertions(+), 3 deletions(-)
> >
> > diff --git a/drivers/staging/media/meson/vdec/vdec.c
> > b/drivers/staging/media/meson/vdec/vdec.c
> > index 4b77ec1af5a76..3a5e4ebe0b34c 100644
> > --- a/drivers/staging/media/meson/vdec/vdec.c
> > +++ b/drivers/staging/media/meson/vdec/vdec.c
> > @@ -877,7 +877,7 @@ static int vdec_open(struct file *file)
> > if (IS_ERR(sess->m2m_dev)) {
> > dev_err(dev, "Fail to v4l2_m2m_init\n");
> > ret = PTR_ERR(sess->m2m_dev);
> > - goto err_free_sess;
> > + goto err_m2m_release;
>
> If m2m_dev creation failed, why do you want to call v4l2_m2m_release() ?
>
I don’t recall the exact details, but the current handling appears incorrect.
I’ve prepared the following fix to resolve the issue, based on
sashiko’s suggestion.
[1] https://sashiko.dev/#/patchset/20260321065408.209723-1-linux.amoon%40gmail.com
-----8<----------8<--------
$ git diff drivers/staging/media/meson/vdec/vdec.c
diff --git a/drivers/staging/media/meson/vdec/vdec.c
b/drivers/staging/media/meson/vdec/vdec.c
index 4b77ec1af5a7..a039d925c0fe 100644
--- a/drivers/staging/media/meson/vdec/vdec.c
+++ b/drivers/staging/media/meson/vdec/vdec.c
@@ -889,7 +889,7 @@ static int vdec_open(struct file *file)
ret = vdec_init_ctrls(sess);
if (ret)
- goto err_m2m_release;
+ goto err_m2m_ctx_release;
sess->pixfmt_cap = formats[0].pixfmts_cap[0];
sess->fmt_out = &formats[0];
@@ -913,6 +913,8 @@ static int vdec_open(struct file *file)
return 0;
+err_m2m_ctx_release:
+ v4l2_m2m_ctx_release(sess->m2m_ctx);
err_m2m_release:
v4l2_m2m_release(sess->m2m_dev);
err_free_sess:
-----8<----------8<--------
Thanks
-Anand
^ permalink raw reply related
* Re: [PATCH v2 0/5] mm: reduce mmap_lock contention and improve page fault performance
From: Lorenzo Stoakes @ 2026-05-19 12:53 UTC (permalink / raw)
To: Suren Baghdasaryan
Cc: Barry Song, Matthew Wilcox, akpm, linux-mm, david, liam, vbabka,
rppt, mhocko, jack, pfalcato, wanglian, chentao, lianux.mm,
kunwu.chan, liyangouwen1, chrisl, kasong, shikemeng, nphamcs, bhe,
youngjun.park, linux-arm-kernel, linux-kernel, loongarch,
linuxppc-dev, linux-riscv, linux-s390, Nanzhe Zhao
In-Reply-To: <CAJuCfpE0WQrB3zJp9qn3jvn5DthS=ttpX7gJJvyEhA_BJGrp5g@mail.gmail.com>
On Mon, May 18, 2026 at 12:56:59PM -0700, Suren Baghdasaryan wrote:
> >
> > I think we either need to fix `fork()`, or keep the current
> > behavior of dropping the VMA lock before performing I/O.
>
> I see. So, this problem arises from the fact that we are changing the
> pagefaults requiring I/O operation to hold VMA lock...
> And you want to lock VMA on fork only if vma_is_anonymous(vma) ||
> is_cow_mapping(vma->vm_flags). So, we will be blocking page faults for
> anonymous and COW VMAs only while holding mmap_write_lock, preventing
> any VMA modification. On the surface, that looks ok to me but I might
> be missing some corner cases. If nobody sees any obvious issues, I
> think it's worth a try.
Not sure if you noticed but I did raise concerns ;)
I wonder if you've confused the fault path and fork here, as I think Barry has
been a little unclear on that.
What's being suggested in this thread is to fundamentally change fork behaviour
so it's different from the entire history of the kernel (or - presumably - at
least recent history :) and permit concurrent page faults to occur on a forking
process.
I absolutely object to this for being pretty crazy. I mean I'm not sure we
really want to be simultaneously modifying page tables while invoking
copy_page_range()? No?
OK you cover anon and MAP_PRIVATE file-backed but hang on there's
VM_COPY_ON_FORK too.. so PFN mapped, mixed map and (the accursed) UFFD W/P as
well as possibly-guard region containing VMAs now can have page tables raced.
That's not to mention anything else that relies on serialisation here (this
would be changing how forking has been done in general) that we may or may not
know about.
The risk level is high, for what amounts to a hack to work around the fault
issue.
I suggest that if we have a problem with the fault path, let's look at the fault
path :)
So yeah I'm very opposed to this unless I'm somehow horribly mistaken here or a
very convincing argument is made.
>
>
>
>
> >
> > >
> > > I'd also like to get Suren's input, however.
> >
> > Yes. of course.
> >
> > >
> > > Thanks, Lorenzo
> >
> > Best Regards
> > Barry
Cheers, Lorenzo
^ permalink raw reply
* Re: [PATCH v14 27/44] arm64: RMI: Set RIPAS of initial memslots
From: Aneesh Kumar K.V @ 2026-05-19 12:55 UTC (permalink / raw)
To: Suzuki K Poulose, Steven Price, kvm, kvmarm
Cc: Catalin Marinas, Marc Zyngier, Will Deacon, James Morse,
Oliver Upton, Zenghui Yu, linux-arm-kernel, linux-kernel,
Joey Gouly, Alexandru Elisei, Christoffer Dall, Fuad Tabba,
linux-coco, Ganapatrao Kulkarni, Gavin Shan, Shanker Donthineni,
Alper Gun, Emi Kisanuki, Vishal Annapurve, WeiLin.Chang,
Lorenzo.Pieralisi2
In-Reply-To: <6681f10b-0966-42e2-ae04-4e1aef47ec4d@arm.com>
Suzuki K Poulose <suzuki.poulose@arm.com> writes:
> On 19/05/2026 11:02, Aneesh Kumar K.V wrote:
>> Steven Price <steven.price@arm.com> writes:
>>
>>> The memory which the realm guest accesses must be set to RIPAS_RAM.
>>> Iterate over the memslots and set all gmem memslots to RIPAS_RAM.
>>>
>>> Signed-off-by: Steven Price <steven.price@arm.com>
>>> ---
>>
>> ...
>>
>>> +static int set_ripas_of_protected_regions(struct kvm *kvm)
>>> +{
>>> + struct kvm_memslots *slots;
>>> + struct kvm_memory_slot *memslot;
>>> + int idx, bkt;
>>> + int ret = 0;
>>> +
>>> + idx = srcu_read_lock(&kvm->srcu);
>>> +
>>> + slots = kvm_memslots(kvm);
>>> + kvm_for_each_memslot(memslot, bkt, slots) {
>>> + if (!kvm_slot_has_gmem(memslot))
>>> + continue;
>>> +
>>> + ret = realm_init_ipa_state(kvm, memslot->base_gfn,
>>> + memslot->npages);
>>> + if (ret)
>>> + break;
>>> + }
>>> + srcu_read_unlock(&kvm->srcu, idx);
>>> +
>>> + return ret;
>>> +}
>>> +
>>> int kvm_arm_rmi_populate(struct kvm *kvm,
>>> struct kvm_arm_rmi_populate *args)
>>> {
>>> @@ -890,6 +922,10 @@ int kvm_activate_realm(struct kvm *kvm)
>>> return ret;
>>> }
>>>
>>> + ret = set_ripas_of_protected_regions(kvm);
>>> + if (ret)
>>> + return ret;
>>> +
>>> ret = rmi_realm_activate(virt_to_phys(realm->rd));
>>> if (ret)
>>> return -ENXIO;
>>
>> relam guest already does.
>> for_each_mem_range(i, &start, &end) {
>> if (rsi_set_memory_range_protected_safe(start, end)) {
>> panic("Failed to set memory range to protected: %pa-%pa",
>> &start, &end);
>> }
>> }
>>
>> if so why is host required to do this ?
>
> Ideally this should be a call from the VMM (i.e., user). Irrespective of
> what the guest does (which the host has no knowledge about), the VMM/
> user is better aware of what to do for a given guest. We have done this
> implicitly in the KVM as a start, to keep the initial implementation
> simple. This could be moved out to the VMM as UABI, if there is
> sufficient demand for it.
>
> TL,DR: This should be a host/deployer decision, not the Guest. There
> may other guest OS, which do not do RIPAS_RAM early enough.
>
Are we suggesting that when the guest is running out of DRAM initialized
via rmi_rtt_data_map_init(), it may need to access memory outside that
range before it gets a chance to set the RIPAS as RAM?
Does that mean the guest now has to trust the host for that?
rmi_rtt_init_ripas() is not added to the measurement details, right?
-aneesh
^ permalink raw reply
* Re: [PATCH] Documentation: KVM: Document guest-visible compatibility expectations
From: Marc Zyngier @ 2026-05-19 12:56 UTC (permalink / raw)
To: Paolo Bonzini
Cc: David Woodhouse, Will Deacon, Jonathan Corbet, Shuah Khan, kvm,
Linux Doc Mailing List, Kernel Mailing List, Linux,
Sean Christopherson, Jim Mattson, Oliver Upton, Joey Gouly,
Suzuki K Poulose, Zenghui Yu, Catalin Marinas,
Raghavendra Rao Ananta, Eric Auger, Kees Cook, Arnd Bergmann,
Nathan Chancellor, linux-arm-kernel, kvmarm, linux-kselftest
In-Reply-To: <86qzn7wp3y.wl-maz@kernel.org>
On Tue, 19 May 2026 13:38:57 +0100,
Marc Zyngier <maz@kernel.org> wrote:
>
> As I said before, I'd be OK with something that would restore IIDR to
> REV1. But not something that actively breaks the GIC emulation by
> reintroducing a bug. That's, by construction, dead code that will only
> bitrot, because there is no SW that can make use of this nonsense.
I will also add that if we make it a policy to preserve buggy
behaviours that the guest cannot be relying on, then I question
whether we should be fixing anything at all.
For example, 6.19 fixed a totally buggy behaviour where a guest
couldn't not have more than (on most HW) 4 interrupts in flight at any
given time. This was obviously totally bogus, and this was fixed
unconditionally, as legitimate guests could experience gold-platted
lock-ups.
Should we revert to the previous behaviour? In the affirmative, I will
simply stop fixing things, and someone else can have fun retrofitting
buggy crap.
M.
--
Without deviation from the norm, progress is not possible.
^ permalink raw reply
* Re: [PATCH v2] cpu/hotplug: Fix NULL kobject warning in cpuhp_smt_enable()
From: Jinjie Ruan @ 2026-05-19 12:56 UTC (permalink / raw)
To: Will Deacon, Catalin Marinas
Cc: punit.agrawal, rafael.j.wysocki, fengchengwen, chenl311,
suzuki.poulose, maz, timothy.hayes, lpieralisi, mrigendra.chaubey,
arnd, sudeep.holla, yangyicong, jic23, pierre.gondois,
linux-arm-kernel, linux-kernel, james.morse
In-Reply-To: <agxcF1bzmErb_lB9@willie-the-truck>
On 5/19/2026 8:48 PM, Will Deacon wrote:
> On Mon, May 11, 2026 at 06:37:21PM +0100, Catalin Marinas wrote:
>> On Mon, Apr 27, 2026 at 10:35:07AM +0800, Jinjie Ruan wrote:
>>> On arm64, when booting with `maxcpus` greater than the number of present
>>> CPUs (e.g., QEMU -smp cpus=4,maxcpus=8), some CPUs are marked as 'present'
>>> but have not yet been registered via register_cpu(). Consequently,
>>> the per-cpu device objects for these CPUs are not yet initialized.
>> [...]
>>> Fix this by:
>>>
>>> 1. When booting with ACPI, checking the ACPI_MADT_ENABLED flag in the GICC
>>> entry before calling set_cpu_present() during SMP initialization.
>>>
>>> 2. Properly managing the present mask in acpi_map_cpu() and
>>> acpi_unmap_cpu() to support actual CPU hotplug events, This aligns with
>>> other architectures like x86 and LoongArch.
>>
>> I had a chat with James earlier and IIUC the decision was to mark all
>> CPUs present and the GIC must be fully initialised. But digging through
>> the GICv3 code, I don't see it depending on cpu_present_mask but rather
>> on the "always on" MADT GICR description. So I think it should be safe
>> as long as we don't rely on the GICC gicr_base_address. But we should
>> update Documentation/arch/arm64/cpu-hotplug.rst to no longer state that
>> all online-capable vCPUs are marked as present by the kernel.
>>
>> (or maybe I misunderstood all this)
>
> Jinjie, do you plan to respin with a documentation update as per
> Catalin's comment above?
Ah, sorry, I will definitely include this change in the next version soon.
Thanks,
Jinjie
>
> Will
>
^ permalink raw reply
* Re: [RFC PATCH 1/2] KVM: arm64: Introduce S2 walker SKIP return options
From: Leonardo Bras @ 2026-05-19 12:56 UTC (permalink / raw)
To: Will Deacon
Cc: Leonardo Bras, Oliver Upton, Marc Zyngier, Joey Gouly,
Suzuki K Poulose, Zenghui Yu, Catalin Marinas, Fuad Tabba,
Raghavendra Rao Ananta, linux-arm-kernel, kvmarm, linux-kernel
In-Reply-To: <agxa-TydN1PoKsYn@willie-the-truck>
On Tue, May 19, 2026 at 01:43:37PM +0100, Will Deacon wrote:
> On Mon, May 18, 2026 at 02:45:59PM +0100, Leonardo Bras wrote:
> > Hello Oliver, Will,
> > Thanks for reviewing!
> >
> > On Mon, May 18, 2026 at 09:52:16AM +0100, Will Deacon wrote:
> > > On Mon, May 18, 2026 at 12:22:47AM -0700, Oliver Upton wrote:
> > > > On Fri, May 15, 2026 at 08:59:02PM +0100, Leonardo Bras wrote:
> > > > > Introduce S2 walker return values:
> > > > > - SKIP_CHILDREN: skip walking the children of the current node
> > > > > - SKIP_SIBLINGS: skip waling the siblings of the current node
> > > > >
> > > > > Also, modify __kvm_pgtable_visit() to fulfil the hing on above return
> > > > > values. Current walkers should not be impacted
> > > >
> > > > I'd rather see something based around new walk flags than introducing an
> > > > entirely new mechanic around return values.
> > > >
> > > > e.g. you could split the LEAF flag into separate flags for blocks v.
> > > > pages:
> > > >
> > > > KVM_PGTABLE_WALK_PAGE,
> > > > KVM_PGTABLE_WALK_BLOCK,
> > > > KVM_PGTABLE_WALK_LEAF = KVM_PGTABLE_WALK_PAGE |
> > > > KVM_PGTABLE_WALK_BLOCK,
> > > >
> > > > and then let __kvm_pgtable_visit() decide how to steer the walk. You may
> > > > need some special handling to get the address arithmetic right when
> > > > skipping over a table of page descriptors.
> >
> > I am probably not getting the whole inner workings of this solution, but
> > IIUC the idea would be to walk the blocks, but not the pages, right?
> >
> > Blocks meaning level2- and pages being level3?
> >
> > > I was wondering along similar lines, but maybe it would be useful just
> > > to pass a maximum level to the walker logic? That feels like the most
> > > general case without complicating the existing logic.
> >
> > This proposal seems simpler for me to understand, and indeed looks like a
> > better solution than what I have proposed, taking care of the
> > 'already split' case with better performance, as it don't even walk a
> > single level-3 entry.
> >
> > On the 'splitting' case, it also works flawlessly if the memory is given in
> > level-2 blocks. There is only one case that I would like to address here:
> >
> > - Memory given in level-1 blocks (say 1GB)
> > - Walker flag says 'walk down to level-2 only'
> > - Split Walker on level-1 will break page down to (up to) level-3 entries.
> > - Walker will continue to be called on level-2 entries, even though it's
> > not necessary.
>
> If you're only visiting leaves, why would it be called on the level-2
> table entries?
>
Because once the leaf is turned into a table by the splitting walker, it
gets reloaded and walked. This is an excerpt of __kvm_pgtable_visit():
---
if (!table && (ctx.flags & KVM_PGTABLE_WALK_LEAF)) {
ret = kvm_pgtable_visitor_cb(data, &ctx, KVM_PGTABLE_WALK_LEAF);
reload = true;
}
/*
* Reload the page table after invoking the walker callback for leaf
* entries or after pre-order traversal, to allow the walker to descend
* into a newly installed or replaced table.
*/
if (reload) {
ctx.old = READ_ONCE(*ptep);
table = kvm_pte_table(ctx.old, level);
}
if (!kvm_pgtable_walk_continue(data->walker, ret))
goto out;
if (!table) {
data->addr = ALIGN_DOWN(data->addr, kvm_granule_size(level));
data->addr += kvm_granule_size(level);
goto out;
}
childp = (kvm_pteref_t)kvm_pte_follow(ctx.old, mm_ops);
ret = __kvm_pgtable_walk(data, mm_ops, childp, level + 1);
if (!kvm_pgtable_walk_continue(data->walker, ret))
goto out;
---
After the leaf is visited, it makes reload=true, which causes the newly
created table to be detected as such and waked down. It means even the page
that just got split will be walked down to the specified level.
Example:
- Split this level-1 leave:
- Walker creates the whole structure up to given level (currently 3)
- Walker returns, gets reloaded, table detected, go down on that one
- Level 2 entries walked (which is unnecessary)
Please let me know if I am misunderstanding something.
Thanks!
Leo
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox