Linux Documentation
 help / color / mirror / Atom feed
* Re: [PATCH 00/19] crypto: cmh - add CRI CryptoManager Hub driver
From: Eric Biggers @ 2026-06-25 18:05 UTC (permalink / raw)
  To: Saravanakrishnan Krishnamoorthy
  Cc: Albert Ou, Alex Ousherovitch, Conor Dooley, David S. Miller,
	Herbert Xu, Jonathan Corbet, Krzysztof Kozlowski, Palmer Dabbelt,
	Paul Walmsley, Rob Herring, Shuah Khan, Alexandre Ghiti,
	devicetree, Joel Wittenauer, linux-api, linux-crypto, linux-doc,
	linux-kernel, linux-kselftest, linux-riscv, Shuah Khan,
	sipsupport, Thi Nguyen
In-Reply-To: <20260625173328.1140487-1-skrishnamoorthy@rambus.com>

On Thu, Jun 25, 2026 at 10:33:08AM -0700, Saravanakrishnan Krishnamoorthy wrote:
> ** This message and any attachments are for the sole use of the
> intended recipient(s). It may contain information that is confidential
> and privileged. If you are not the intended recipient of this message,
> you are prohibited from printing, copying, forwarding or saving it.
> Please delete the message and attachments and notify the sender
> immediately. **

Okay, I deleted it.

- Eric

^ permalink raw reply

* Re: [PATCH RFC 01/12] Documentation: KVM: Elaborate comment on kvm_usage_lock
From: Sean Christopherson @ 2026-06-25 18:12 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: Paolo Bonzini, Jonathan Corbet, Shuah Khan, Tianrui Zhao,
	Bibo Mao, Huacai Chen, WANG Xuerui, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, Fuad Tabba, vannapurve, x86,
	H. Peter Anvin, kvm, linux-doc, linux-kernel, loongarch
In-Reply-To: <20260527-kvm-locking-docs-v1-1-4fe8b602ff47@google.com>

On Wed, May 27, 2026, Ackerley Tng wrote:
> The original comment talks about cpus_read_lock() and kvm_usage_count, but
> doesn't explain why they are related.
> 
> Elaborate comment on kvm_usage_lock to provide more context.
> 
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
> ---
>  Documentation/virt/kvm/locking.rst | 19 +++++++++++++++++--
>  1 file changed, 17 insertions(+), 2 deletions(-)
> 
> diff --git a/Documentation/virt/kvm/locking.rst b/Documentation/virt/kvm/locking.rst
> index 662231e958a07..5564c8b38b9cc 100644
> --- a/Documentation/virt/kvm/locking.rst
> +++ b/Documentation/virt/kvm/locking.rst
> @@ -248,8 +248,23 @@ time it will be set using the Dirty tracking mechanism described above.
>  :Arch:		any
>  :Protects:	- kvm_usage_count
>  		- hardware virtualization enable/disable
> -:Comment:	Exists to allow taking cpus_read_lock() while kvm_usage_count is
> -		protected, which simplifies the virtualization enabling logic.
> +:Comment:       ``kvm_usage_count`` serves to deduplicate hardware
> +    virtualization enabling and disabling requests from different VMs
> +    being created.

kvm_usage_count does that and more, i.e. this is 'wrong" by being incomplete. 

> +
> +    Hardware virtualization enabling/disabling requires taking
> +    ``cpus_read_lock()``.
> +
> +    ``kvm_lock`` used to also protect ``kvm_usage_count``, but other
> +    parts of the Linux kernel holding ``cpus_read_lock()`` need to
> +    call into KVM to ensure that VM state remains consistent with the
> +    host's state. For example, when the CPU frequency changes, KVM is
> +    notified. ``kvmclock_cpufreq_notifier()`` takes ``kvm_lock`` to
> +    iterate ``vm_list``.
> +
> +    To decouple these, use different locks, ``kvm_lock`` for
> +    ``vm_list`` and ``kvm_usage_lock`` for enabling/disabling hardware
> +    virtualization.

I appreciate the effort, but honestly I think this does more harm than good.  I
already know what this code does, and the above confused me more than anything.

>  
>  ``kvm->mn_invalidate_lock``
>  ^^^^^^^^^^^^^^^^^^^^^^^^^^^
> 
> -- 
> 2.54.0.823.g6e5bcc1fc9-goog
> 

^ permalink raw reply

* Re: [PATCH RFC 02/12] Documentation: KVM: Consolidate notes about cpu_read_lock() and kvm_lock
From: Sean Christopherson @ 2026-06-25 18:12 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: Paolo Bonzini, Jonathan Corbet, Shuah Khan, Tianrui Zhao,
	Bibo Mao, Huacai Chen, WANG Xuerui, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, Fuad Tabba, vannapurve, x86,
	H. Peter Anvin, kvm, linux-doc, linux-kernel, loongarch
In-Reply-To: <20260527-kvm-locking-docs-v1-2-4fe8b602ff47@google.com>

On Wed, May 27, 2026, Ackerley Tng wrote:
> Move the detail about cpu_read_lock() and kvm_lock to where the acquisition
> order is mentioned.

Why?

^ permalink raw reply

* Re: [PATCH RFC 03/12] Documentation: KVM: Consolidate notes about kvm->slots_lock and irq_lock
From: Sean Christopherson @ 2026-06-25 18:12 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: Paolo Bonzini, Jonathan Corbet, Shuah Khan, Tianrui Zhao,
	Bibo Mao, Huacai Chen, WANG Xuerui, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, Fuad Tabba, vannapurve, x86,
	H. Peter Anvin, kvm, linux-doc, linux-kernel, loongarch
In-Reply-To: <20260527-kvm-locking-docs-v1-3-4fe8b602ff47@google.com>

On Wed, May 27, 2026, Ackerley Tng wrote:
> Move the detail about ordering between kvm->slots_lock and kvm->irq_lock to
> where the two locks are first mentioned.

Why?

^ permalink raw reply

* Re: [PATCH RFC 08/12] Documentation: KVM: Add example for kvm->srcu in relation to mutex/lock
From: Sean Christopherson @ 2026-06-25 18:17 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: Paolo Bonzini, Jonathan Corbet, Shuah Khan, Tianrui Zhao,
	Bibo Mao, Huacai Chen, WANG Xuerui, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, Fuad Tabba, vannapurve, x86,
	H. Peter Anvin, kvm, linux-doc, linux-kernel, loongarch
In-Reply-To: <20260527-kvm-locking-docs-v1-8-4fe8b602ff47@google.com>

On Wed, May 27, 2026, Ackerley Tng wrote:
> Add example of where vcpu->mutex and kvm->slots_lock are held while calling
> synchronize_srcu(&kvm->srcu) to concretely show where the synchronization
> primitives overlap.

Sorry, but NAK.  This is too x86-centric, and IMO the risk of the documentation
becoming stale and confusing outweighs any benefits from providing an incomplete
example.  Because like the kvm_usage_count stuff, I know the code in question,
and the example confused me and makes it harder to understand the rule(s).

^ permalink raw reply

* Re: [PATCH RFC 10/12] KVM: guest_memfd: Clarify comment about gmem.file vs kvm->srcu
From: Sean Christopherson @ 2026-06-25 18:19 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: Paolo Bonzini, Jonathan Corbet, Shuah Khan, Tianrui Zhao,
	Bibo Mao, Huacai Chen, WANG Xuerui, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, Fuad Tabba, vannapurve, x86,
	H. Peter Anvin, kvm, linux-doc, linux-kernel, loongarch
In-Reply-To: <20260527-kvm-locking-docs-v1-10-4fe8b602ff47@google.com>

On Wed, May 27, 2026, Ackerley Tng wrote:
> Clarify the existing comment about synchronize_srcu() and
> kvm_gmem_get_pfn() to provide further context. Explain which
> synchronize_srcu() prevents races with how kvm_gmem_get_pfn() is used.
> 
> Also point reader to documentation for better understanding.
> 
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
> ---
>  virt/kvm/guest_memfd.c | 9 +++++++--
>  1 file changed, 7 insertions(+), 2 deletions(-)
> 
> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> index 69c9d6d546b28..f2218db0af980 100644
> --- a/virt/kvm/guest_memfd.c
> +++ b/virt/kvm/guest_memfd.c
> @@ -711,8 +711,13 @@ static void __kvm_gmem_unbind(struct kvm_memory_slot *slot, struct gmem_file *f)
>  	xa_store_range(&f->bindings, start, end - 1, NULL, GFP_KERNEL);
>  
>  	/*
> -	 * synchronize_srcu(&kvm->srcu) ensured that kvm_gmem_get_pfn()
> -	 * cannot see this memslot.
> +	 * This is called when memslots are updated, after the old
> +	 * memslot container is no longer in
> +	 * use. synchronize_srcu(&kvm->srcu) was called there, so
> +	 * kvm_gmem_get_pfn() from KVM's guest fault handling cannot
> +	 * see this memslot. See Documentation/virt/kvm/locking.rst
> +	 * for more information about kvm->srcu and the memslots
> +	 * container.

If we want to add to this comment, I would much rather do so as part of an update
to kvm_gmem_release()'s comment as well.

https://lore.kernel.org/all/20251113232229.1698886-1-seanjc@google.com

^ permalink raw reply

* Re: [PATCH v8 24/46] KVM: guest_memfd: Make in-place conversion the default
From: Ackerley Tng @ 2026-06-25 18:20 UTC (permalink / raw)
  To: Yan Zhao
  Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
	jmattson, jthoughton, michael.roth, oupton, pankaj.gupta, qperret,
	rick.p.edgecombe, rientjes, shivankg, steven.price, tabba, willy,
	wyihan, forkloop, pratyush, suzuki.poulose, aneesh.kumar, liam,
	Paolo Bonzini, Sean Christopherson, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
	Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
	Kairui Song, Kemeng Shi, Nhat Pham, Barry Song, Axel Rasmussen,
	Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt,
	Kiryl Shutsemau, Baoquan He, Jason Gunthorpe, Vlastimil Babka,
	kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	linux-mm, linux-coco
In-Reply-To: <ajyCn0PnFtQK+Nka@yzhao56-desk.sh.intel.com>

Yan Zhao <yan.y.zhao@intel.com> writes:

> On Wed, Jun 24, 2026 at 05:05:44PM -0700, Ackerley Tng wrote:
>> Yan Zhao <yan.y.zhao@intel.com> writes:
>>
>> >
>> > [...snip...]
>> >
>> >>
>> >>  #ifdef kvm_arch_has_private_mem
>> >> -bool __ro_after_init gmem_in_place_conversion = false;
>> >> +bool __ro_after_init gmem_in_place_conversion = !IS_ENABLED(CONFIG_KVM_VM_MEMORY_ATTRIBUTES);
>> >> +module_param(gmem_in_place_conversion, bool, 0444);
>> >
>> > With gmem_in_place_conversion=true, userspace can create guest_memfd without the
>> > MMAP flag. In such cases, shared memory is allocated from different backends.
>> > This means this module parameter only enables per-gmem memory attribute and does
>> > not guarantee that gmem in-place conversion will actually occur.
>> >
>> > To avoid confusion, could we rename this module parameter to something more
>> > accurate, such as gmem_memory_attribute?
>> >
>>
>> I asked Sean about this after getting some fixes off list. Sean said
>> gmem_in_place_conversion is named for a host admin to use, and something
>> like gmem_memory_attributes is too much implementation details for the
>> admin.
> Thanks for this background.
>
> Some more context on why I'm asking:
>
> Currently, I'm testing TDX huge pages with the following two gmem components:
> 1. The gmem memory attribute in this gmem in-place conversion v8.
> 2. The gmem 2MB from buddy allocator. (for development/testing only).
>
> The gmem 2MB from buddy allocator allocates 2MB folios from buddy for private
> memory, while shared memory is allocated from a different backend.
> (To avoid fragmentation, only private mappings are split during private-to-shared
> conversions. In this approach, the 2MB folios are always retained in the gmem
> inode filemap cache without splitting.)
>
> Since shared memory is not allocated from gmem, there're no in-place conversions.
> The reason I'm using "gmem memory attribute" is that the per-VM attribute is
> being deprecated, as suggested by Sean [1].
>

v8 of conversions series changed that slightly, per-VM attributes is
going to stay around (because of work on RWX attributes, coming up) and
RWX will stay tracked at the VM level.

For v8 and beyond, only tracking of private/shared in per-VM attributes
is being deprecated.

By extension the entire thing about using guest_memfd for private memory
and a different backing memory for shared memory is being deprecated.

> Besides my current usage,

I think you can set up guest_memfd+2M for private memory and shared
memory from some other source, and that's the deprecated usage pattern.

> there may be other scenarios where gmem memory
> attributes is preferred without allocating shared memory from gmem.
> (e.g., PAGE.ADD from a temp extra shared source memory).
>

Is this TDH.MEM.PAGE.ADD, used indirectly from
tdx_gmem_post_populate()? This use case isn't blocked. Even if
gmem_in_place_conversion=true, you can still set src_address to
non-guest_memfd memory and load from anywhere you like.

Please let me know if that is broken! I think I accidentally used that
setup in selftests and it worked. The selftests are now defaulting to
in-place conversion.

> For such use cases, I'm concerns that the admins may find it confusing if they
> enable gmem_in_place_conversion but still observe extra memory consumptions for
> shared memory.
>

Hmm but I guess if someone enables gmem_in_place_conversion but still
allocates from elsewhere, they'd have to figure it out?

> [1] https://lore.kernel.org/kvm/aWmEegVP_A613WIr@google.com/
>
>> Sean, would you reconsider since Yan also asked? If the admin compiled
>> the kernel knowing what CONFIG_KVM_VM_MEMORY_ATTRIBUTES means, then the
>> admin would also be able to use a param like gmem_memory_attributes?
>>
>> There's the additional benefit that the similar naming aids in
>> understanding for both the admin and software engineers.
>>
>> Either way, in the next revision, I'll also add this documentation for
>> this module_param:
>>
>>   Setting the module parameter gmem_in_place_conversion to true will
>>   enable the KVM_SET_MEMORY_ATTRIBUTES2 guest_memfd ioctl and disables
>>   the KVM_SET_MEMORY_ATTRIBUTES VM ioctl. If gmem_in_place_conversion is
>>   true, the private/shared attribute will be tracked per-guest_memfd
>>   instead of per-VM.
>>
>> Let me know what y'all think of the wording!
>>
>> >>
>> >> [...snip...]
>> >>

^ permalink raw reply

* [PATCH v4 0/2] arm64: errata: NVIDIA Olympus device store/load ordering
From: Shanker Donthineni @ 2026-06-25 18:24 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Vladimir Murzin
  Cc: Jason Gunthorpe, linux-arm-kernel, Mark Rutland, linux-kernel,
	linux-doc, Shanker Donthineni, Vikram Sethi, Jason Sequeira

This series works around the NVIDIA Olympus device store/load ordering
erratum (T410-OLY-1027): a Device-nGnR* load can be observed by a
peripheral before an older, non-overlapping Device-nGnR* store to the
same peripheral, breaking the program order that drivers rely on for
MMIO and potentially leaving a device in an incorrect state.

Patch 1 adds the workaround. It promotes the raw MMIO store helpers
(__raw_writeb/w/l/q, and therefore writel()/writel_relaxed()) to
store-release on affected CPUs, and promotes the trailing DGH of the
write-combining __iowrite{32,64}_copy() helpers to dmb osh. Everything is
gated on a new ARM64_WORKAROUND_DEVICE_STORE_RELEASE cpucap and patched
in only on affected parts, so it is a no-op elsewhere.

Patch 2 provides arm64 memset_io()/memcpy_toio(). The generic versions
are built on __raw_write*(), so patch 1 would promote every store in a
block to a store-release; as each STLR drains the write-combining buffer,
block MMIO becomes O(n) store-releases. The arm64 versions emit plain
STR in the loop and order the whole block with a single trailing dmb osh,
keeping block MMIO at one-barrier cost.

Performance: NVIDIA Olympus, write-combining MMIO to a device BAR, single
PE pinned; per-call cost in ns. Consecutive writes ping-pong between two
buffers so repeated stores are not coalesced. iowrite64/iowrite32 =
__iowrite{64,32}_copy().

Table 1 - workaround off (CONFIG_NVIDIA_OLYMPUS_1027_ERRATUM=n)
+-------+-----------+-----------+-----------+-------------+
|  size | iowrite64 | iowrite32 | memset_io | memcpy_toio |
+-------+-----------+-----------+-----------+-------------+
|    8B |   67.9 ns |   67.8 ns |    3.6 ns |    3.6 ns   |
|   16B |   67.9 ns |   67.8 ns |    4.0 ns |    4.0 ns   |
|   32B |   67.9 ns |   67.9 ns |    4.6 ns |    4.6 ns   |
|   64B |   69.1 ns |   69.1 ns |   69.1 ns |   69.0 ns   |
|  128B |  138.3 ns |  138.3 ns |  138.4 ns |  138.3 ns   |
|  256B |  276.6 ns |  276.6 ns |  276.6 ns |  276.7 ns   |
|  512B |  276.6 ns |  276.5 ns |  276.6 ns |  276.6 ns   |
|   1KB |  276.6 ns |  278.4 ns |  276.6 ns |  276.6 ns   |
|   2KB |  278.4 ns |  278.4 ns |  275.9 ns |  276.6 ns   |
|   4KB |  365.7 ns |  365.7 ns |  365.7 ns |  365.7 ns   |
+-------+-----------+-----------+-----------+-------------+
relaxed/no-flush: memset_io()/memcpy_toio() issue plain stores with no
trailing dgh() or barrier, unlike __iowrite*_copy() which ends with dgh().

Table 2 - workaround on, arm64 memset_io/memcpy_toio (this series)
+-------+-----------+-----------+-----------+-------------+
|  size | iowrite64 | iowrite32 | memset_io | memcpy_toio |
+-------+-----------+-----------+-----------+-------------+
|    8B |  231.6 ns |  231.6 ns |  232.4 ns |  232.4 ns   |
|   16B |  231.7 ns |  231.9 ns |  232.7 ns |  232.6 ns   |
|   32B |  231.9 ns |  232.7 ns |  232.9 ns |  232.9 ns   |
|   64B |  232.7 ns |  235.0 ns |  233.7 ns |  233.6 ns   |
|  128B |  233.6 ns |  235.8 ns |  234.4 ns |  234.3 ns   |
|  256B |  237.7 ns |  276.8 ns |  264.0 ns |  276.7 ns   |
|  512B |  237.7 ns |  277.1 ns |  238.1 ns |  277.6 ns   |
|   1KB |  253.7 ns |  279.3 ns |  276.1 ns |  294.1 ns   |
|   2KB |  295.0 ns |  318.7 ns |  288.5 ns |  308.3 ns   |
|   4KB |  365.9 ns |  381.4 ns |  365.7 ns |  381.3 ns   |
+-------+-----------+-----------+-----------+-------------+
all four helpers end with a single trailing barrier (dmb osh).

Table 3 - workaround on, generic per-store memset_io/memcpy_toio
+-------+-----------+-----------+-------------+--------------+
|  size | iowrite64 | iowrite32 |   memset_io |  memcpy_toio |
+-------+-----------+-----------+-------------+--------------+
|    8B |  231.6 ns |  231.6 ns |    229.0 ns |    229.0 ns  |
|   16B |  231.7 ns |  231.9 ns |    458.4 ns |    458.5 ns  |
|   32B |  231.9 ns |  232.7 ns |    917.4 ns |    917.5 ns  |
|   64B |  232.7 ns |  234.8 ns |   1835.4 ns |   1835.5 ns  |
|  128B |  233.6 ns |  235.8 ns |   3670.9 ns |   3670.8 ns  |
|  256B |  237.7 ns |  276.7 ns |   7341.6 ns |   7341.6 ns  |
|  512B |  237.7 ns |  279.4 ns |  14001.4 ns |  14001.3 ns  |
|   1KB |  253.7 ns |  279.1 ns |  28631.5 ns |  28631.8 ns  |
|   2KB |  279.4 ns |  317.9 ns |  57276.3 ns |  57275.2 ns  |
|   4KB |  365.7 ns |  381.5 ns | 114564.4 ns | 114563.6 ns  |
+-------+-----------+-----------+-------------+--------------+
the generic memset_io()/memcpy_toio() build on __raw_write*(), which the
workaround promotes to store-release, so every store is individually
ordered - hence O(n) in the store count.

Tables 2 and 3 show why patch 2 is needed: the generic per-store block
writers collapse to O(n) under the workaround (4KB ~314x slower, ~115 us
vs ~366 ns), while the arm64 versions stay flat at one-barrier cost.

Changes since v3:
  - Split the workaround into two patches: the erratum fix (1/2) and the
    arm64 memset_io()/memcpy_toio() block writers (2/2).
  - Reworked the raw MMIO write helpers to use a direct base-register
    str*/stlr* alternative sequence instead of a per-write static branch.
  - Covered the write-combining __iowrite{32,64}_copy() path by patching
    dgh() to dmb osh on affected CPUs, keeping the contiguous STR groups
    and the ordering barrier outside the copy loop; the single-element
    case now uses a plain str* as well.
  - Added arm64 memset_io()/memcpy_toio() so the byte/word block writers
    take one trailing dmb osh instead of a per-store store-release.
  - Updated the commit messages to describe the offset-addressing
    trade-off.

Changes since v2:
  - Reworked the raw MMIO write helpers so unaffected CPUs keep the
    existing offset-addressed STR sequence, while affected CPUs use the
    base-register STLR path.
  - Updated the commit message to match the code changes.
  - Rebased on top of the arm64 for-next/errata branch:
    https://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git/log/?h=for-next/errata

Changes since v1:
  - Updated the commit message based on feedback from Vladimir Murzin.

Shanker Donthineni (2):
  arm64: errata: Workaround NVIDIA Olympus device store/load ordering
  arm64: io: apply the device store-release workaround once per block
    write

 Documentation/arch/arm64/silicon-errata.rst |  2 +
 arch/arm64/Kconfig                          | 25 +++++++++
 arch/arm64/include/asm/barrier.h            |  4 +-
 arch/arm64/include/asm/io.h                 | 36 +++++++++----
 arch/arm64/kernel/cpu_errata.c              |  8 +++
 arch/arm64/kernel/io.c                      | 82 +++++++++++++++++++++++++++++
 arch/arm64/tools/cpucaps                    |  1 +
 7 files changed, 146 insertions(+), 12 deletions(-)

-- 
2.54.0.windows.1


^ permalink raw reply

* [PATCH v4 1/2] arm64: errata: Workaround NVIDIA Olympus device store/load ordering
From: Shanker Donthineni @ 2026-06-25 18:24 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Vladimir Murzin
  Cc: Jason Gunthorpe, linux-arm-kernel, Mark Rutland, linux-kernel,
	linux-doc, Shanker Donthineni, Vikram Sethi, Jason Sequeira
In-Reply-To: <20260625182425.3194066-1-sdonthineni@nvidia.com>

On systems with NVIDIA Olympus cores, a Device-nGnR* load can be
observed by a peripheral before an older, non-overlapping Device-nGnR*
store to the same peripheral. This breaks the program-order guarantee
that software expects for Device-nGnR* accesses and can leave a
peripheral in an incorrect state, as a load is observed before an
earlier store takes effect.

The erratum can occur only when all of the following apply:

  - A PE executes a Device-nGnR* store followed by a younger
    Device-nGnR* load.
  - The store is not a store-release.
  - The accesses target the same peripheral and do not overlap in bytes.
  - There is at most one intervening Device-nGnR* store in program
    order, and there are no intervening Device-nGnR* loads.
  - There is no DSB, and no DMB that orders loads, between the store and
    the load.
  - Specific micro-architectural and timing conditions occur.

Promote the raw MMIO store helpers (__raw_writeb/w/l/q) from plain str*
to stlr* (Store-Release) on affected CPUs, which removes the "store is
not a store-release" condition for every device write the kernel issues.
Because writel() and writel_relaxed() are both built on __raw_writel()
in asm-generic/io.h, patching the raw variants covers both the
non-relaxed and relaxed APIs without touching the higher layers. Note
that writel()'s own barrier sits before the store, so it does not order
the store against a subsequent readl(); the store-release promotion is
what provides that ordering.

Like ARM64_ERRATUM_832075 on the load side, the change is gated on a new
ARM64_WORKAROUND_DEVICE_STORE_RELEASE capability and only activated on
parts that match MIDR_NVIDIA_OLYMPUS, so unaffected CPUs continue to use
plain str* instructions.

Note: stlr* only supports base-register addressing, so the raw MMIO
write helpers use a base-register str*/stlr* alternative sequence. This
gives up the offset-addressed str* code generation introduced by commit
d044d6ba6f02 ("arm64: io: permit offset addressing"). A static-branch
implementation would add extra control flow without preserving the
desired offset-addressed code generation in practice, so use a direct
base-register str*/stlr* alternative instead.

For the write-combining copy helpers (__iowrite{32,64}_copy()), the
contiguous str* groups are kept, because replacing those stores would
defeat the write-combining behaviour used to improve store performance.
Rather than rely on the relaxed, no-ordering contract of these helpers -
which would leave affected CPUs behaving differently from every other
arm64 system and exposed to any future driver that depends on ordering
across such copies - the DGH hint emitted once after each copy is
promoted to dmb osh on affected CPUs. That orders the grouped stores
against subsequent loads without placing a barrier in the copy loop,
while unaffected CPUs keep the existing DGH hint. The single-element
case of __const_memcpy_toio_aligned{32,64}() likewise uses a plain str*
(instead of __raw_write*()) so it shares that str* group + DGH path
rather than taking a per-store store-release.

Co-developed-by: Vikram Sethi <vsethi@nvidia.com>
Signed-off-by: Vikram Sethi <vsethi@nvidia.com>
Signed-off-by: Shanker Donthineni <sdonthineni@nvidia.com>
Link: https://lore.kernel.org/all/ajVZBJgKn-5sxHD6@willie-the-truck/
---
 Documentation/arch/arm64/silicon-errata.rst |  2 ++
 arch/arm64/Kconfig                          | 25 +++++++++++++++++
 arch/arm64/include/asm/barrier.h            |  4 ++-
 arch/arm64/include/asm/io.h                 | 31 +++++++++++++--------
 arch/arm64/kernel/cpu_errata.c              |  8 ++++++
 arch/arm64/tools/cpucaps                    |  1 +
 6 files changed, 59 insertions(+), 12 deletions(-)

diff --git a/Documentation/arch/arm64/silicon-errata.rst b/Documentation/arch/arm64/silicon-errata.rst
index ad04d1cdc0f0..c4137f89acef 100644
--- a/Documentation/arch/arm64/silicon-errata.rst
+++ b/Documentation/arch/arm64/silicon-errata.rst
@@ -298,6 +298,8 @@ stable kernels.
 +----------------+-----------------+-----------------+-----------------------------+
 | NVIDIA         | Carmel Core     | N/A             | NVIDIA_CARMEL_CNP_ERRATUM   |
 +----------------+-----------------+-----------------+-----------------------------+
+| NVIDIA         | Olympus core    | T410-OLY-1027   | NVIDIA_OLYMPUS_1027_ERRATUM |
++----------------+-----------------+-----------------+-----------------------------+
 | NVIDIA         | Olympus core    | T410-OLY-1029   | ARM64_ERRATUM_4118414       |
 +----------------+-----------------+-----------------+-----------------------------+
 | NVIDIA         | T241 GICv3/4.x  | T241-FABRIC-4   | N/A                         |
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 10c69474f276..da4e66b19209 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -564,6 +564,31 @@ config ARM64_ERRATUM_832075
 
 	  If unsure, say Y.
 
+config NVIDIA_OLYMPUS_1027_ERRATUM
+	bool "NVIDIA Olympus: device store/load ordering erratum"
+	default y
+	help
+	  This option adds an alternative code sequence to work around an
+	  NVIDIA Olympus core erratum where a Device-nGnR* store can be
+	  observed by a peripheral after a younger Device-nGnR* load to the
+	  same peripheral. This breaks the program order that drivers rely
+	  on for MMIO and can leave a device in an incorrect state.
+
+	  The workaround promotes the raw MMIO store helpers
+	  (__raw_writeb/w/l/q) to Store-Release (STLR), which restores the
+	  required ordering. Because writel() and writel_relaxed() are built
+	  on __raw_writel(), both are covered without changes to the higher
+	  layers. It also promotes the DGH hint used after write-combining
+	  memcpy-to-IO sequences to a DMB, so grouped stores are ordered
+	  against subsequent reads without placing a barrier in the copy loop.
+
+	  The fix is applied through the alternatives framework, so enabling
+	  this option does not by itself activate the workaround: it is
+	  patched in only when an affected CPU is detected, and is a no-op on
+	  unaffected CPUs.
+
+	  If unsure, say Y.
+
 config ARM64_ERRATUM_834220
 	bool "Cortex-A57: 834220: Stage 2 translation fault might be incorrectly reported in presence of a Stage 1 fault (rare)"
 	depends on KVM
diff --git a/arch/arm64/include/asm/barrier.h b/arch/arm64/include/asm/barrier.h
index 9495c4441a46..22792d1305aa 100644
--- a/arch/arm64/include/asm/barrier.h
+++ b/arch/arm64/include/asm/barrier.h
@@ -38,7 +38,9 @@
  * Device-GRE attributes before the hint instruction with any memory accesses
  * appearing after the hint instruction.
  */
-#define dgh()		asm volatile("hint #6" : : : "memory")
+#define dgh()		asm volatile(ALTERNATIVE("hint #6", "dmb osh",	\
+					 ARM64_WORKAROUND_DEVICE_STORE_RELEASE)	\
+				     : : : "memory")
 
 #define spec_bar()	asm volatile(ALTERNATIVE("dsb nsh\nisb\n",		\
 						 SB_BARRIER_INSN"nop\n",	\
diff --git a/arch/arm64/include/asm/io.h b/arch/arm64/include/asm/io.h
index 8cbd1e96fd50..69e0fa004d31 100644
--- a/arch/arm64/include/asm/io.h
+++ b/arch/arm64/include/asm/io.h
@@ -16,7 +16,6 @@
 #include <asm/memory.h>
 #include <asm/early_ioremap.h>
 #include <asm/alternative.h>
-#include <asm/cpufeature.h>
 #include <asm/rsi.h>
 
 /*
@@ -25,29 +24,37 @@
 #define __raw_writeb __raw_writeb
 static __always_inline void __raw_writeb(u8 val, volatile void __iomem *addr)
 {
-	volatile u8 __iomem *ptr = addr;
-	asm volatile("strb %w0, %1" : : "rZ" (val), "Qo" (*ptr));
+	asm volatile(ALTERNATIVE("strb %w0, [%1]",
+				 "stlrb %w0, [%1]",
+				 ARM64_WORKAROUND_DEVICE_STORE_RELEASE)
+		     : : "rZ" (val), "r" (addr));
 }
 
 #define __raw_writew __raw_writew
 static __always_inline void __raw_writew(u16 val, volatile void __iomem *addr)
 {
-	volatile u16 __iomem *ptr = addr;
-	asm volatile("strh %w0, %1" : : "rZ" (val), "Qo" (*ptr));
+	asm volatile(ALTERNATIVE("strh %w0, [%1]",
+				 "stlrh %w0, [%1]",
+				 ARM64_WORKAROUND_DEVICE_STORE_RELEASE)
+		     : : "rZ" (val), "r" (addr));
 }
 
 #define __raw_writel __raw_writel
 static __always_inline void __raw_writel(u32 val, volatile void __iomem *addr)
 {
-	volatile u32 __iomem *ptr = addr;
-	asm volatile("str %w0, %1" : : "rZ" (val), "Qo" (*ptr));
+	asm volatile(ALTERNATIVE("str %w0, [%1]",
+				 "stlr %w0, [%1]",
+				 ARM64_WORKAROUND_DEVICE_STORE_RELEASE)
+		     : : "rZ" (val), "r" (addr));
 }
 
 #define __raw_writeq __raw_writeq
 static __always_inline void __raw_writeq(u64 val, volatile void __iomem *addr)
 {
-	volatile u64 __iomem *ptr = addr;
-	asm volatile("str %x0, %1" : : "rZ" (val), "Qo" (*ptr));
+	asm volatile(ALTERNATIVE("str %x0, [%1]",
+				 "stlr %x0, [%1]",
+				 ARM64_WORKAROUND_DEVICE_STORE_RELEASE)
+		     : : "rZ" (val), "r" (addr));
 }
 
 #define __raw_readb __raw_readb
@@ -178,7 +185,8 @@ __const_memcpy_toio_aligned32(volatile u32 __iomem *to, const u32 *from,
 			     : "rZ"(from[0]), "rZ"(from[1]), "r"(to));
 		break;
 	case 1:
-		__raw_writel(*from, to);
+		asm volatile("str %w0, [%1]"
+			     : : "rZ"(from[0]), "r"(to) : "memory");
 		break;
 	default:
 		BUILD_BUG();
@@ -235,7 +243,8 @@ __const_memcpy_toio_aligned64(volatile u64 __iomem *to, const u64 *from,
 			     : "rZ"(from[0]), "rZ"(from[1]), "r"(to));
 		break;
 	case 1:
-		__raw_writeq(*from, to);
+		asm volatile("str %x0, [%1]"
+			     : : "rZ"(from[0]), "r"(to) : "memory");
 		break;
 	default:
 		BUILD_BUG();
diff --git a/arch/arm64/kernel/cpu_errata.c b/arch/arm64/kernel/cpu_errata.c
index 4b0d5d932897..76c1f8cf1ee0 100644
--- a/arch/arm64/kernel/cpu_errata.c
+++ b/arch/arm64/kernel/cpu_errata.c
@@ -839,6 +839,14 @@ const struct arm64_cpu_capabilities arm64_errata[] = {
 		ERRATA_MIDR_ALL_VERSIONS(MIDR_NVIDIA_CARMEL),
 	},
 #endif
+#ifdef CONFIG_NVIDIA_OLYMPUS_1027_ERRATUM
+	{
+		/* NVIDIA Olympus core */
+		.desc = "NVIDIA Olympus device load/store ordering erratum",
+		.capability = ARM64_WORKAROUND_DEVICE_STORE_RELEASE,
+		ERRATA_MIDR_ALL_VERSIONS(MIDR_NVIDIA_OLYMPUS),
+	},
+#endif
 #ifdef CONFIG_ARM64_WORKAROUND_TRBE_OVERWRITE_FILL_MODE
 	{
 		/*
diff --git a/arch/arm64/tools/cpucaps b/arch/arm64/tools/cpucaps
index 811c2479e82d..d367257bf770 100644
--- a/arch/arm64/tools/cpucaps
+++ b/arch/arm64/tools/cpucaps
@@ -120,6 +120,7 @@ WORKAROUND_CAVIUM_TX2_219_PRFM
 WORKAROUND_CAVIUM_TX2_219_TVM
 WORKAROUND_CLEAN_CACHE
 WORKAROUND_DEVICE_LOAD_ACQUIRE
+WORKAROUND_DEVICE_STORE_RELEASE
 WORKAROUND_NVIDIA_CARMEL_CNP
 WORKAROUND_PMUV3_IMPDEF_TRAPS
 WORKAROUND_QCOM_FALKOR_E1003
-- 
2.54.0.windows.1


^ permalink raw reply related

* [PATCH v4 2/2] arm64: io: apply the device store-release workaround once per block write
From: Shanker Donthineni @ 2026-06-25 18:24 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Vladimir Murzin
  Cc: Jason Gunthorpe, linux-arm-kernel, Mark Rutland, linux-kernel,
	linux-doc, Shanker Donthineni, Vikram Sethi, Jason Sequeira
In-Reply-To: <20260625182425.3194066-1-sdonthineni@nvidia.com>

The generic memset_io()/memcpy_toio() are built on __raw_write*(), so on
parts with the NVIDIA Olympus device store/load ordering erratum the
ARM64_WORKAROUND_DEVICE_STORE_RELEASE workaround promotes every store in
the block to a store-release. Each stlr* carries a barrier cost, so block
MMIO becomes O(n) store-releases, making a block copy many times slower
than a single ordered burst and growing with the transfer size.

Provide arm64 memset_io()/memcpy_toio() that emit plain str* in the loop
and order the whole block against subsequent loads with a single
trailing dmb osh on affected CPUs (a no-op elsewhere, preserving the
relaxed contract of these helpers). This keeps block MMIO writes at
one-barrier cost rather than scaling with the transfer size.

Performance (NVIDIA Olympus, write-combining MMIO to a device BAR, single
PE pinned; per-call cost in ns; consecutive writes ping-pong between two
buffers so repeated stores are not coalesced; iowrite64/iowrite32 =
__iowrite{64,32}_copy()):

Table 1 - arm64 memset_io/memcpy_toio (this patch)
+-------+-----------+-----------+-----------+-------------+
|  size | iowrite64 | iowrite32 | memset_io | memcpy_toio |
+-------+-----------+-----------+-----------+-------------+
|    8B |  231.6 ns |  231.6 ns |  232.4 ns |  232.4 ns   |
|   16B |  231.7 ns |  231.9 ns |  232.7 ns |  232.6 ns   |
|   32B |  231.9 ns |  232.7 ns |  232.9 ns |  232.9 ns   |
|   64B |  232.7 ns |  235.0 ns |  233.7 ns |  233.6 ns   |
|  128B |  233.6 ns |  235.8 ns |  234.4 ns |  234.3 ns   |
|  256B |  237.7 ns |  276.8 ns |  264.0 ns |  276.7 ns   |
|  512B |  237.7 ns |  277.1 ns |  238.1 ns |  277.6 ns   |
|   1KB |  253.7 ns |  279.3 ns |  276.1 ns |  294.1 ns   |
|   2KB |  295.0 ns |  318.7 ns |  288.5 ns |  308.3 ns   |
|   4KB |  365.9 ns |  381.4 ns |  365.7 ns |  381.3 ns   |
+-------+-----------+-----------+-----------+-------------+
all four helpers end with a single trailing barrier (dmb osh).

Table 2 - generic per-store memset_io/memcpy_toio
+-------+-----------+-----------+-------------+--------------+
|  size | iowrite64 | iowrite32 |   memset_io |  memcpy_toio |
+-------+-----------+-----------+-------------+--------------+
|    8B |  231.6 ns |  231.6 ns |    229.0 ns |    229.0 ns  |
|   16B |  231.7 ns |  231.9 ns |    458.4 ns |    458.5 ns  |
|   32B |  231.9 ns |  232.7 ns |    917.4 ns |    917.5 ns  |
|   64B |  232.7 ns |  234.8 ns |   1835.4 ns |   1835.5 ns  |
|  128B |  233.6 ns |  235.8 ns |   3670.9 ns |   3670.8 ns  |
|  256B |  237.7 ns |  276.7 ns |   7341.6 ns |   7341.6 ns  |
|  512B |  237.7 ns |  279.4 ns |  14001.4 ns |  14001.3 ns  |
|   1KB |  253.7 ns |  279.1 ns |  28631.5 ns |  28631.8 ns  |
|   2KB |  279.4 ns |  317.9 ns |  57276.3 ns |  57275.2 ns  |
|   4KB |  365.7 ns |  381.5 ns | 114564.4 ns | 114563.6 ns  |
+-------+-----------+-----------+-------------+--------------+
the generic memset_io()/memcpy_toio() build on __raw_write*(), which the
workaround promotes to store-release, so every store is individually
ordered - hence O(n) in the store count.

The arm64 versions stay flat at one-barrier cost while the generic
per-store writers collapse to O(n): at 4KB ~314x slower (~115 us vs
~366 ns).

Signed-off-by: Shanker Donthineni <sdonthineni@nvidia.com>
---
 arch/arm64/include/asm/io.h |  5 +++
 arch/arm64/kernel/io.c      | 82 +++++++++++++++++++++++++++++++++++++
 2 files changed, 87 insertions(+)

diff --git a/arch/arm64/include/asm/io.h b/arch/arm64/include/asm/io.h
index 69e0fa004d31..649503f347bc 100644
--- a/arch/arm64/include/asm/io.h
+++ b/arch/arm64/include/asm/io.h
@@ -266,6 +266,11 @@ __iowrite64_copy(void __iomem *to, const void *from, size_t count)
 }
 #define __iowrite64_copy __iowrite64_copy
 
+void memset_io(volatile void __iomem *dst, int c, size_t count);
+#define memset_io memset_io
+void memcpy_toio(volatile void __iomem *dst, const void *src, size_t count);
+#define memcpy_toio memcpy_toio
+
 /*
  * I/O memory mapping functions.
  */
diff --git a/arch/arm64/kernel/io.c b/arch/arm64/kernel/io.c
index fe86ada23c7d..b5fd9ee6d9eb 100644
--- a/arch/arm64/kernel/io.c
+++ b/arch/arm64/kernel/io.c
@@ -5,9 +5,91 @@
  * Copyright (C) 2012 ARM Ltd.
  */
 
+#include <linux/align.h>
 #include <linux/export.h>
 #include <linux/types.h>
 #include <linux/io.h>
+#include <linux/unaligned.h>
+
+#include <asm/alternative.h>
+
+/*
+ * ARM64_WORKAROUND_DEVICE_STORE_RELEASE promotes every raw MMIO store
+ * (__raw_write*()) to a store-release on affected CPUs. The generic
+ * memset_io()/memcpy_toio() are built on those helpers, so the workaround would
+ * emit one store-release per element and turn a block write into O(n) ordered
+ * stores - far more costly than the single barrier a block actually needs.
+ *
+ * Provide arm64 versions that emit plain STR in the loop and order the whole
+ * block against subsequent loads with one trailing DMB OSH, patched in only on
+ * affected CPUs (a no-op elsewhere, so the relaxed contract of these helpers is
+ * preserved).
+ *
+ * This capability is currently enabled only for the NVIDIA Olympus device
+ * store/load ordering erratum, where a Device-nGnR* load may be observed before
+ * an older, non-overlapping Device-nGnR* store to the same peripheral.
+ */
+static __always_inline void iomem_block_store_barrier(void)
+{
+	asm volatile(ALTERNATIVE("nop", "dmb osh",
+				 ARM64_WORKAROUND_DEVICE_STORE_RELEASE)
+		     : : : "memory");
+}
+
+void memset_io(volatile void __iomem *dst, int c, size_t count)
+{
+	u64 qc = (u8)c;
+
+	qc *= ~0ULL / 0xff;
+
+	while (count && !IS_ALIGNED((__force unsigned long)dst, sizeof(u64))) {
+		asm volatile("strb %w0, [%1]" : : "rZ"((u8)c), "r"(dst) : "memory");
+		dst++;
+		count--;
+	}
+	while (count >= sizeof(u64)) {
+		asm volatile("str %x0, [%1]" : : "rZ"(qc), "r"(dst) : "memory");
+		dst += sizeof(u64);
+		count -= sizeof(u64);
+	}
+	while (count) {
+		asm volatile("strb %w0, [%1]" : : "rZ"((u8)c), "r"(dst) : "memory");
+		dst++;
+		count--;
+	}
+
+	iomem_block_store_barrier();
+}
+EXPORT_SYMBOL(memset_io);
+
+void memcpy_toio(volatile void __iomem *dst, const void *src, size_t count)
+{
+	while (count && !IS_ALIGNED((__force unsigned long)dst, sizeof(u64))) {
+		asm volatile("strb %w0, [%1]"
+			     : : "rZ"(*(const u8 *)src), "r"(dst) : "memory");
+		src++;
+		dst++;
+		count--;
+	}
+	while (count >= sizeof(u64)) {
+		asm volatile("str %x0, [%1]"
+			     : : "rZ"(get_unaligned((const u64 *)src)), "r"(dst)
+			     : "memory");
+		src += sizeof(u64);
+		dst += sizeof(u64);
+		count -= sizeof(u64);
+	}
+	while (count) {
+		asm volatile("strb %w0, [%1]"
+			     : : "rZ"(*(const u8 *)src), "r"(dst) : "memory");
+		src++;
+		dst++;
+		count--;
+	}
+
+	iomem_block_store_barrier();
+}
+EXPORT_SYMBOL(memcpy_toio);
 
 /*
  * This generates a memcpy that works on a from/to address which is aligned to
-- 
2.54.0.windows.1


^ permalink raw reply related

* Re: [PATCH RFC 12/12] Documentation: KVM: Focus acquisition order section on preventing deadlocks
From: Sean Christopherson @ 2026-06-25 18:25 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: Paolo Bonzini, Jonathan Corbet, Shuah Khan, Tianrui Zhao,
	Bibo Mao, Huacai Chen, WANG Xuerui, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, Fuad Tabba, vannapurve, x86,
	H. Peter Anvin, kvm, linux-doc, linux-kernel, loongarch
In-Reply-To: <20260527-kvm-locking-docs-v1-12-4fe8b602ff47@google.com>

On Wed, May 27, 2026, Ackerley Tng wrote:
> Now that the first sentence is already described in more detail in the new
> section on synchronization while managing guest faults, drop the first
> sentence.

Nope, nothing in that sections says anything about the role of
mn_active_invalidate_count.

^ permalink raw reply

* Re: [PATCH 17/19] Documentation: ioctl: add CMH ioctl documentation and register 'J'
From: Randy Dunlap @ 2026-06-25 18:29 UTC (permalink / raw)
  To: Saravanakrishnan Krishnamoorthy, Albert Ou, Alex Ousherovitch,
	Conor Dooley, David S. Miller, Herbert Xu, Jonathan Corbet,
	Krzysztof Kozlowski, Palmer Dabbelt, Paul Walmsley, Rob Herring,
	Shuah Khan
  Cc: Alexandre Ghiti, devicetree, Joel Wittenauer, linux-api,
	linux-crypto, linux-doc, linux-kernel, linux-kselftest,
	linux-riscv, Shuah Khan, sipsupport, Thi Nguyen
In-Reply-To: <20260625173328.1140487-18-skrishnamoorthy@rambus.com>



On 6/25/26 10:33 AM, Saravanakrishnan Krishnamoorthy wrote:
> From: Alex Ousherovitch <aousherovitch@rambus.com>
> 
> Add Documentation/userspace-api/ioctl/cmh_mgmt.rst documenting the
> ioctl commands on the /dev/cmh_mgmt misc device for the CRI
> CryptoManager Hub (CMH) hardware crypto accelerator driver.  Covers
> key management, KIC key derivation, PKE (RSA, ECDSA, ECDH, EdDSA),
> PQC (ML-KEM, ML-DSA, SLH-DSA), SM2, EAC, and DRBG.
> 
> Register ioctl magic number 'J' (0x4A) in ioctl-number.rst.  The
> driver uses ioctls 0x01-0x40.
> 
> Co-developed-by: Saravanakrishnan Krishnamoorthy <skrishnamoorthy@rambus.com>
> Signed-off-by: Saravanakrishnan Krishnamoorthy <skrishnamoorthy@rambus.com>
> Signed-off-by: Alex Ousherovitch <aousherovitch@rambus.com>
> Reviewed-by: Joel Wittenauer <Joel.Wittenauer@cryptography.com>
> Reviewed-by: Thi Nguyen <thin@rambus.com>
> ---
>  .../userspace-api/ioctl/cmh_mgmt.rst          | 941 ++++++++++++++++++
>  .../userspace-api/ioctl/ioctl-number.rst      |   1 +
>  2 files changed, 942 insertions(+)
>  create mode 100644 Documentation/userspace-api/ioctl/cmh_mgmt.rst
> 

One docs build warning:

Documentation/userspace-api/ioctl/cmh_mgmt.rst: WARNING: document isn't included in any toctree [toc.not_included]


-- 
~Randy


^ permalink raw reply

* Re: [PATCH RFC 11/12] KVM: mmu: Point users of host_pfn_mapping_level() to docs
From: Sean Christopherson @ 2026-06-25 18:29 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: Paolo Bonzini, Jonathan Corbet, Shuah Khan, Tianrui Zhao,
	Bibo Mao, Huacai Chen, WANG Xuerui, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, Fuad Tabba, vannapurve, x86,
	H. Peter Anvin, kvm, linux-doc, linux-kernel, loongarch
In-Reply-To: <20260527-kvm-locking-docs-v1-11-4fe8b602ff47@google.com>

On Wed, May 27, 2026, Ackerley Tng wrote:
> After consolidating documentation for host_pfn_mapping_level() in
> Documentation/virt/kvm/locking.rst, point users of function to docs.

NAK.  I want the "formal" documentation to describe the rules and general mechanisms,
not arch specific implementation details.  It's unfortunate the LoongArch copy+pasted
x86's code, comment and all, but that's a separate problem.

^ permalink raw reply

* Re: [PATCH RFC 07/12] Documentation: KVM: Drop mention of kvm->lock in SRCU documentation
From: Sean Christopherson @ 2026-06-25 18:35 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: Paolo Bonzini, Jonathan Corbet, Shuah Khan, Tianrui Zhao,
	Bibo Mao, Huacai Chen, WANG Xuerui, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, Fuad Tabba, vannapurve, x86,
	H. Peter Anvin, kvm, linux-doc, linux-kernel, loongarch
In-Reply-To: <20260527-kvm-locking-docs-v1-7-4fe8b602ff47@google.com>

On Wed, May 27, 2026, Ackerley Tng wrote:
> The original comment says that synchronize_srcu(&kvm->srcu) is called
> inside critical sections for kvm->lock, vcpu->mutex and
> kvm->slots_lock. Drop mention of kvm->lock since this is no longer true.

I would *much* rather "fix" this by saying synchronize_srcu() *may* be called
inside blah blah blah.  Because (a) I don't feel like auditing all of KVM to see
if the above is true, (b) KVM's implementation may change again in the future,
and (c) taking kvm->lock inside a kvm->srcu read-side critical section is still
unsafe as we'd end up with ABBA deadlock (well, ABCCA?).

  1. SRCU held, waiting on kvm->lock
  2. kvm->lock held, waiting on vcpu->mutex
  3. vcpu->mutex held, waiting on synchronize_srcu()

^ permalink raw reply

* Re: [PATCH RFC 00/12] Document synchronization used in managing guest faults
From: Sean Christopherson @ 2026-06-25 18:37 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: Paolo Bonzini, Jonathan Corbet, Shuah Khan, Tianrui Zhao,
	Bibo Mao, Huacai Chen, WANG Xuerui, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, Fuad Tabba, vannapurve, x86,
	H. Peter Anvin, kvm, linux-doc, linux-kernel, loongarch
In-Reply-To: <20260527-kvm-locking-docs-v1-0-4fe8b602ff47@google.com>

On Wed, May 27, 2026, Ackerley Tng wrote:
> Ackerley Tng (12):
>       Documentation: KVM: Elaborate comment on kvm_usage_lock
>       Documentation: KVM: Consolidate notes about cpu_read_lock() and kvm_lock
>       Documentation: KVM: Consolidate notes about kvm->slots_lock and irq_lock
>       Documentation: KVM: Turn - into bullet point
>       Documentation: KVM: Explain what rule the exception section is meant for
>       Documentation: KVM: Have actual headings for exceptions
>       Documentation: KVM: Drop mention of kvm->lock in SRCU documentation
>       Documentation: KVM: Add example for kvm->srcu in relation to mutex/lock
>       Documentation: KVM: Document synchronization for managing guest faults
>       KVM: guest_memfd: Clarify comment about gmem.file vs kvm->srcu
>       KVM: mmu: Point users of host_pfn_mapping_level() to docs
>       Documentation: KVM: Focus acquisition order section on preventing deadlocks

Please split these up into standalone patches or small series of patches that are
logically related at a finer granularity.  "Here's a pile of KVM documentation
updates" is not a reasonable level of granularity.

^ permalink raw reply

* Re: [PATCH v2 7/8] dt-bindings: riscv: Add generic CBQRI controller binding
From: Drew Fustini @ 2026-06-25 19:21 UTC (permalink / raw)
  To: Conor Dooley
  Cc: Adrien Ricciardi, Alexandre Ghiti, Atish Kumar Patra, Atish Patra,
	Babu Moger, Ben Horgan, Borislav Petkov, Chen Pei, Conor Dooley,
	Conor Dooley, Dave Hansen, Dave Martin, Fenghua Yu, Gong Shuai,
	Gong Shuai, guo.wenjia23, James Morse, Kornel Dulęba,
	Krzysztof Kozlowski, liu.qingtao2, Liu Zhiwei, Palmer Dabbelt,
	Paul Walmsley, Peter Newman, Radim Krčmář,
	Reinette Chatre, Rob Herring, Samuel Holland,
	Sebastian Andrzej Siewior, Tony Luck, Vasudevan Srinivasan,
	Ved Shanbhogue, Weiwei Li, yunhui cui, linux-kernel, linux-riscv,
	x86, devicetree, linux-rt-devel, linux-doc
In-Reply-To: <20260625-cupbearer-failing-9ce0abf97b93@spud>

[-- Attachment #1: Type: text/plain, Size: 4636 bytes --]

On Thu, Jun 25, 2026 at 05:19:28PM +0100, Conor Dooley wrote:
> On Wed, Jun 24, 2026 at 06:38:35PM -0700, Drew Fustini wrote:
> > Document the generic compatibles for capacity and bandwidth controllers
> > that implement the RISC-V CBQRI specification. The binding also
> > describes the common riscv,cbqri-rcid and riscv,cbqri-mcid properties,
> > and the optional riscv,cbqri-cache phandle that links a capacity
> > controller to the cache whose capacity it allocates.
> > 
> > Assisted-by: Claude:claude-opus-4-8
> > Co-developed-by: Adrien Ricciardi <aricciardi@baylibre.com>
> > Signed-off-by: Adrien Ricciardi <aricciardi@baylibre.com>
> > Signed-off-by: Drew Fustini <fustini@kernel.org>
> > ---
> >  .../devicetree/bindings/riscv/riscv,cbqri.yaml     | 97 ++++++++++++++++++++++
> >  MAINTAINERS                                        |  1 +
> >  2 files changed, 98 insertions(+)

Thanks for the review.

[..]
> > +properties:
> > +  compatible:
> > +    oneOf:
> > +      - items:
> > +          - description: Tenstorrent Ascalon Shared Cache
> > +            const: tenstorrent,ascalon-sc-cbqri
> > +          - const: riscv,cbqri-capacity-controller
> > +      - enum:
> > +          - riscv,cbqri-capacity-controller
> > +          - riscv,cbqri-bandwidth-controller
> 
> Please modify this, as has been done for other riscv spec related
> bindings, to let people get away without using device-specific
> compatibles.
> 
> In this case, you can just delete the first entry from this enum, since
> it already has a user and only have to implement this feedback for the
> second entry.

Would this work?

properties:
  compatible:
    oneOf:
      - items:
          - enum:
              - tenstorrent,ascalon-sc-cbqri # Tenstorrent Ascalon Shared Cache
          - const: riscv,cbqri-capacity-controller
      - items:
          - {}
          - const: riscv,cbqri-bandwidth-controller

> > +
> > +  reg:
> > +    maxItems: 1
> > +    description:
> > +      The CBQRI controller register block.
> > +
> > +  riscv,cbqri-rcid:
> > +    $ref: /schemas/types.yaml#/definitions/uint32
> > +    description:
> > +      The maximum number of RCIDs the controller supports. RCIDs are the
> > +      resource-control IDs that allocation operations target.
> > +
> > +  riscv,cbqri-mcid:
> > +    $ref: /schemas/types.yaml#/definitions/uint32
> > +    description:
> > +      The maximum number of MCIDs the controller supports. MCIDs are the
> > +      monitoring-counter IDs that usage-monitoring operations target. Present
> > +      on controllers that implement monitoring.
> > +
> > +  riscv,cbqri-cache:
> > +    $ref: /schemas/types.yaml#/definitions/phandle
> > +    description:
> > +      Phandle to the cache node whose capacity this controller allocates.
> > +      Applies to capacity controllers that back a CPU cache. The cache level
> > +      and the harts sharing it are taken from that node's cache topology.
> 
> Architecturally, is it impossible for a capacity controller to control
> more than one cache?

Yes, there is only ever a single logical capacity resource per capacity
controller. When that resource is a cache, the controller handles that
one logical cache. The hardware may implement the cache as a collection
of slices, but that stays opaque to CBQRI. So riscv,cbqri-cache stays a
single phandle.

> > +
> > +required:
> > +  - compatible
> > +  - reg
> > +
> > +allOf:
> > +  - if:
> > +      properties:
> > +        compatible:
> > +          contains:
> > +            const: tenstorrent,ascalon-sc-cbqri
> > +    then:
> > +      required:
> > +        - riscv,cbqri-rcid
> > +        - riscv,cbqri-cache
> > +
> > +additionalProperties: false
> > +
> > +examples:
> > +  - |
> > +    l2_cache: l2-cache {
> > +        compatible = "cache";
> > +        cache-level = <2>;
> > +        cache-unified;
> > +        cache-size = <0xc00000>;
> > +        cache-sets = <512>;
> > +        cache-block-size = <64>;
> > +    };
> > +
> > +    cache-controller@a21a00c0 {
> > +        compatible = "tenstorrent,ascalon-sc-cbqri",
> > +                     "riscv,cbqri-capacity-controller";
> 
> Is this or is this not a cache controller?
> The compatible and fact that the property points to an actual cache
> controller suggests that this is not.

Good point. This nodes represents just the QoS interface (CBQRI) and
should not use that node name. 'qos-controller' seems like it would be
more appropriate but that has no precedent. What do you think?

Thanks,
Drew

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply

* Re: [PATCH v14 4/5] gpio: rpmsg: add generic rpmsg GPIO driver
From: Andrew Davis @ 2026-06-25 20:32 UTC (permalink / raw)
  To: Shenwei Wang, Linus Walleij, Bartosz Golaszewski, Jonathan Corbet,
	Rob Herring, Krzysztof Kozlowski, Conor Dooley, Bjorn Andersson,
	Mathieu Poirier, Frank Li, Sascha Hauer
  Cc: Shuah Khan, linux-gpio, linux-doc, linux-kernel,
	Pengutronix Kernel Team, Fabio Estevam, Shenwei Wang, Peng Fan,
	devicetree, linux-remoteproc, imx, linux-arm-kernel, linux-imx,
	Arnaud POULIQUEN, b-padhi, Andrew Lunn, Bartosz Golaszewski
In-Reply-To: <20260625155432.815185-5-shenwei.wang@oss.nxp.com>

On 6/25/26 10:54 AM, Shenwei Wang wrote:
> From: Shenwei Wang <shenwei.wang@nxp.com>
> 
> On an AMP platform, the system may include multiple processors:
> 	- MCUs running an RTOS
> 	- An MPU running Linux
> 
> These processors communicate via the RPMSG protocol.
> The driver implements the standard GPIO interface, allowing
> the Linux side to control GPIO controllers which reside in
> the remote processor via RPMSG protocol.
> 
> Cc: Bartosz Golaszewski <brgl@bgdev.pl>
> Cc: Andrew Lunn <andrew@lunn.ch>
> Signed-off-by: Shenwei Wang <shenwei.wang@nxp.com>
> ---
>   drivers/gpio/Kconfig      |  17 ++
>   drivers/gpio/Makefile     |   1 +
>   drivers/gpio/gpio-rpmsg.c | 568 ++++++++++++++++++++++++++++++++++++++
>   3 files changed, 586 insertions(+)
>   create mode 100644 drivers/gpio/gpio-rpmsg.c
> 
> diff --git a/drivers/gpio/Kconfig b/drivers/gpio/Kconfig
> index 020e51e30317..4ad299fe3c6f 100644
> --- a/drivers/gpio/Kconfig
> +++ b/drivers/gpio/Kconfig
> @@ -1917,6 +1917,23 @@ config GPIO_SODAVILLE
>   
>   endmenu
>   
> +menu "RPMSG GPIO drivers"
> +	depends on RPMSG
> +
> +config GPIO_RPMSG
> +	tristate "Generic RPMSG GPIO support"
> +	depends on OF && REMOTEPROC
> +	select GPIOLIB_IRQCHIP
> +	default REMOTEPROC
> +	help
> +	  Say yes here to support the generic GPIO functions over the RPMSG
> +	  bus. Currently supported devices: i.MX7ULP, i.MX8ULP, i.MX8x, and
> +	  i.MX9x.

The support would depend on if the right firmware is loaded/running on the given
remote core. Also if you want to make this generic, then any vendor should be able
to make a firmware that implements this protocol and make use of this driver.
Suggest dropping this NXP specific device list.

> +
> +	  If unsure, say N.
> +
> +endmenu
> +
>   menu "SPI GPIO expanders"
>   	depends on SPI_MASTER
>   
> diff --git a/drivers/gpio/Makefile b/drivers/gpio/Makefile
> index b267598b517d..ee75c0e65b8b 100644
> --- a/drivers/gpio/Makefile
> +++ b/drivers/gpio/Makefile
> @@ -157,6 +157,7 @@ obj-$(CONFIG_GPIO_RDC321X)		+= gpio-rdc321x.o
>   obj-$(CONFIG_GPIO_REALTEK_OTTO)		+= gpio-realtek-otto.o
>   obj-$(CONFIG_GPIO_REG)			+= gpio-reg.o
>   obj-$(CONFIG_GPIO_ROCKCHIP)	+= gpio-rockchip.o
> +obj-$(CONFIG_GPIO_RPMSG)		+= gpio-rpmsg.o
>   obj-$(CONFIG_GPIO_RTD)			+= gpio-rtd.o
>   obj-$(CONFIG_ARCH_SA1100)		+= gpio-sa1100.o
>   obj-$(CONFIG_GPIO_SAMA5D2_PIOBU)	+= gpio-sama5d2-piobu.o
> diff --git a/drivers/gpio/gpio-rpmsg.c b/drivers/gpio/gpio-rpmsg.c
> new file mode 100644
> index 000000000000..332e2925a830
> --- /dev/null
> +++ b/drivers/gpio/gpio-rpmsg.c
> @@ -0,0 +1,568 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * Copyright 2026 NXP
> + *
> + * The driver exports a standard gpiochip interface to control
> + * the GPIO controllers via RPMSG on a remote processor.
> + */
> +
> +#include <linux/completion.h>
> +#include <linux/device.h>
> +#include <linux/err.h>
> +#include <linux/gpio/driver.h>
> +#include <linux/init.h>
> +#include <linux/irqdomain.h>
> +#include <linux/mod_devicetable.h>
> +#include <linux/module.h>
> +#include <linux/mutex.h>
> +#include <linux/of.h>
> +#include <linux/of_device.h>
> +#include <linux/of_platform.h>
> +#include <linux/platform_device.h>
> +#include <linux/remoteproc.h>
> +#include <linux/rpmsg.h>
> +#include <linux/virtio_gpio.h>
> +
> +#define GPIOS_PER_PORT_DEFAULT		32
> +#define RPMSG_TIMEOUT			1000
> +
> +/* Additional commands beyond virtio-gpio */
> +#define VIRTIO_GPIO_MSG_SET_WAKEUP	0x0010
> +
> +/* GPIO Receive MSG Type */
> +#define GPIO_RPMSG_REPLY	1
> +#define GPIO_RPMSG_NOTIFY	2
> +
> +#define CHAN_NAME_PREFIX	"rpmsg-io-"
> +#define GPIO_COMPAT_STR		"rpmsg-gpio"
> +
> +struct rpmsg_gpio_response {
> +	__u8 type;
> +	union {
> +		/* command reply */
> +		struct {
> +			__u8 status;
> +			__u8 value;
> +		};
> +
> +		/* interrupt notification */
> +		struct {
> +			__u8 line;
> +			__u8 trigger; /* rising/falling/high/low */
> +		};
> +	};
> +};
> +
> +struct rpmsg_gpio_line {
> +	u8 irq_shutdown;
> +	u8 irq_unmask;
> +	u8 irq_mask;
> +	u32 irq_wake_enable;
> +	u32 irq_type;
> +};
> +
> +struct rpmsg_gpio_port {
> +	struct gpio_chip gc;
> +	struct rpmsg_device *rpdev;
> +	struct virtio_gpio_request *send_msg;
> +	struct rpmsg_gpio_response *recv_msg;
> +	struct completion cmd_complete;
> +	struct mutex lock;
> +	u32 ngpios;
> +	u32 idx;
> +	struct rpmsg_gpio_line lines[GPIOS_PER_PORT_DEFAULT];
> +};
> +
> +static int rpmsg_gpio_send_message(struct rpmsg_gpio_port *port)
> +{
> +	int ret;
> +
> +	reinit_completion(&port->cmd_complete);
> +
> +	ret = rpmsg_send(port->rpdev->ept, port->send_msg, sizeof(*port->send_msg));
> +	if (ret) {
> +		dev_err(&port->rpdev->dev, "rpmsg_send failed: cmd=%d ret=%d\n",
> +			port->send_msg->type, ret);
> +		return ret;
> +	}
> +
> +	ret = wait_for_completion_timeout(&port->cmd_complete,
> +					  msecs_to_jiffies(RPMSG_TIMEOUT));
> +	if (ret == 0) {
> +		dev_err(&port->rpdev->dev, "rpmsg_send timeout! cmd=%d\n",
> +			port->send_msg->type);
> +		return -ETIMEDOUT;
> +	}
> +
> +	if (unlikely(port->recv_msg->status != VIRTIO_GPIO_STATUS_OK)) {
> +		dev_err(&port->rpdev->dev, "remote core replies an error: cmd=%d!\n",
> +			port->send_msg->type);
> +		return -EINVAL;
> +	}
> +
> +	return 0;
> +}
> +
> +static struct virtio_gpio_request *
> +rpmsg_gpio_msg_prepare(struct rpmsg_gpio_port *port, u16 line, u16 cmd, u32 val)
> +{
> +	struct virtio_gpio_request *msg = port->send_msg;
> +
> +	msg->type = cmd;
> +	msg->gpio = line;
> +	msg->value = val;
> +
> +	return msg;
> +}
> +
> +static int rpmsg_gpio_get(struct gpio_chip *gc, unsigned int line)
> +{
> +	struct rpmsg_gpio_port *port = gpiochip_get_data(gc);
> +	int ret;
> +
> +	guard(mutex)(&port->lock);
> +
> +	rpmsg_gpio_msg_prepare(port, line, VIRTIO_GPIO_MSG_GET_VALUE, 0);
> +
> +	ret = rpmsg_gpio_send_message(port);
> +	return ret ? ret : port->recv_msg->value;
> +}
> +
> +static int rpmsg_gpio_get_direction(struct gpio_chip *gc, unsigned int line)
> +{
> +	struct rpmsg_gpio_port *port = gpiochip_get_data(gc);
> +	int ret;
> +
> +	guard(mutex)(&port->lock);
> +
> +	rpmsg_gpio_msg_prepare(port, line, VIRTIO_GPIO_MSG_GET_DIRECTION, 0);
> +
> +	ret = rpmsg_gpio_send_message(port);
> +	if (ret)
> +		return ret;
> +
> +	switch (port->recv_msg->value) {
> +	case VIRTIO_GPIO_DIRECTION_IN:
> +		return GPIO_LINE_DIRECTION_IN;
> +	case VIRTIO_GPIO_DIRECTION_OUT:
> +		return GPIO_LINE_DIRECTION_OUT;
> +	default:
> +		break;
> +	}
> +
> +	return -EINVAL;
> +}
> +
> +static int rpmsg_gpio_direction_input(struct gpio_chip *gc, unsigned int line)
> +{
> +	struct rpmsg_gpio_port *port = gpiochip_get_data(gc);
> +
> +	guard(mutex)(&port->lock);
> +
> +	rpmsg_gpio_msg_prepare(port, line, VIRTIO_GPIO_MSG_SET_DIRECTION,
> +			       VIRTIO_GPIO_DIRECTION_IN);
> +
> +	return rpmsg_gpio_send_message(port);
> +}
> +
> +static int rpmsg_gpio_set(struct gpio_chip *gc, unsigned int line, int val)
> +{
> +	struct rpmsg_gpio_port *port = gpiochip_get_data(gc);
> +
> +	guard(mutex)(&port->lock);
> +
> +	rpmsg_gpio_msg_prepare(port, line, VIRTIO_GPIO_MSG_SET_VALUE, val);
> +
> +	return rpmsg_gpio_send_message(port);
> +}
> +
> +static int rpmsg_gpio_direction_output(struct gpio_chip *gc, unsigned int line, int val)
> +{
> +	struct rpmsg_gpio_port *port = gpiochip_get_data(gc);
> +	int ret;
> +
> +	guard(mutex)(&port->lock);
> +
> +	rpmsg_gpio_msg_prepare(port, line, VIRTIO_GPIO_MSG_SET_DIRECTION,
> +			       VIRTIO_GPIO_DIRECTION_OUT);
> +
> +	ret = rpmsg_gpio_send_message(port);
> +	if (ret)
> +		return ret;
> +
> +	rpmsg_gpio_msg_prepare(port, line, VIRTIO_GPIO_MSG_SET_VALUE, val);
> +
> +	return rpmsg_gpio_send_message(port);
> +}
> +
> +static int gpio_rpmsg_irq_set_type(struct irq_data *d, u32 type)
> +{
> +	struct rpmsg_gpio_port *port = irq_data_get_irq_chip_data(d);
> +	u32 line = d->hwirq;
> +
> +	switch (type) {
> +	case IRQ_TYPE_EDGE_RISING:
> +		type = VIRTIO_GPIO_IRQ_TYPE_EDGE_RISING;
> +		irq_set_handler_locked(d, handle_simple_irq);
> +		break;
> +	case IRQ_TYPE_EDGE_FALLING:
> +		type = VIRTIO_GPIO_IRQ_TYPE_EDGE_FALLING;
> +		irq_set_handler_locked(d, handle_simple_irq);
> +		break;
> +	case IRQ_TYPE_EDGE_BOTH:
> +		type = VIRTIO_GPIO_IRQ_TYPE_EDGE_BOTH;
> +		irq_set_handler_locked(d, handle_simple_irq);
> +		break;
> +	case IRQ_TYPE_LEVEL_LOW:
> +		type = VIRTIO_GPIO_IRQ_TYPE_LEVEL_LOW;
> +		irq_set_handler_locked(d, handle_level_irq);
> +		break;
> +	case IRQ_TYPE_LEVEL_HIGH:
> +		type = VIRTIO_GPIO_IRQ_TYPE_LEVEL_HIGH;
> +		irq_set_handler_locked(d, handle_level_irq);
> +		break;
> +	default:
> +		dev_err(&port->rpdev->dev, "unsupported irq type: %u\n", type);
> +		return -EINVAL;
> +	}
> +
> +	port->lines[line].irq_type = type;
> +
> +	return 0;
> +}
> +
> +static int gpio_rpmsg_irq_set_wake(struct irq_data *d, u32 enable)
> +{
> +	struct rpmsg_gpio_port *port = irq_data_get_irq_chip_data(d);
> +	u32 line = d->hwirq;
> +
> +	port->lines[line].irq_wake_enable = enable;
> +
> +	return 0;
> +}
> +
> +/*
> + * This unmask/mask function is invoked in two situations:
> + *   - when an interrupt is being set up, and
> + *   - after an interrupt has occurred.
> + *
> + * The GPIO driver does not access hardware registers directly.
> + * Instead, it caches all relevant information locally, and then sends
> + * the accumulated state to the remote system at this stage.
> + */
> +static void gpio_rpmsg_unmask_irq(struct irq_data *d)
> +{
> +	struct rpmsg_gpio_port *port = irq_data_get_irq_chip_data(d);
> +	u32 line = d->hwirq;
> +
> +	port->lines[line].irq_unmask = 1;
> +}
> +
> +static void gpio_rpmsg_mask_irq(struct irq_data *d)
> +{
> +	struct rpmsg_gpio_port *port = irq_data_get_irq_chip_data(d);
> +	u32 line = d->hwirq;
> +
> +	/*
> +	 * When an interrupt occurs, the remote system masks the interrupt
> +	 * and then sends a notification to Linux. After Linux processes
> +	 * that notification, it sends an RPMsg command back to the remote
> +	 * system to unmask the interrupt again.
> +	 */
> +	port->lines[line].irq_mask = 1;
> +}
> +
> +static void gpio_rpmsg_irq_shutdown(struct irq_data *d)
> +{
> +	struct rpmsg_gpio_port *port = irq_data_get_irq_chip_data(d);
> +	u32 line = d->hwirq;
> +
> +	port->lines[line].irq_shutdown = 1;
> +}
> +
> +static void gpio_rpmsg_irq_bus_lock(struct irq_data *d)
> +{
> +	struct rpmsg_gpio_port *port = irq_data_get_irq_chip_data(d);
> +
> +	mutex_lock(&port->lock);
> +}
> +
> +static void gpio_rpmsg_irq_bus_sync_unlock(struct irq_data *d)
> +{
> +	struct rpmsg_gpio_port *port = irq_data_get_irq_chip_data(d);
> +	u32 line = d->hwirq;
> +
> +	rpmsg_gpio_msg_prepare(port, line, VIRTIO_GPIO_MSG_SET_WAKEUP,
> +			       port->lines[line].irq_wake_enable);
> +	rpmsg_gpio_send_message(port);
> +
> +	/*
> +	 * For mask irq, do nothing here.
> +	 * The remote system will mask interrupt after an interrupt occurs,
> +	 * and then send a notification to Linux system. After Linux system
> +	 * handles the notification, it sends an rpmsg back to the remote
> +	 * system to unmask this interrupt again.
> +	 */
> +	if (port->lines[line].irq_mask && !port->lines[line].irq_unmask) {
> +		port->lines[line].irq_mask = 0;
> +		mutex_unlock(&port->lock);
> +		return;
> +	}
> +
> +	if (port->lines[line].irq_shutdown) {
> +		rpmsg_gpio_msg_prepare(port, line, VIRTIO_GPIO_MSG_IRQ_TYPE,
> +				       VIRTIO_GPIO_IRQ_TYPE_NONE);
> +		port->lines[line].irq_shutdown = 0;
> +	} else {
> +		rpmsg_gpio_msg_prepare(port, line, VIRTIO_GPIO_MSG_IRQ_TYPE,
> +				       port->lines[line].irq_type);
> +
> +		if (port->lines[line].irq_unmask)
> +			port->lines[line].irq_unmask = 0;
> +	}
> +
> +	rpmsg_gpio_send_message(port);
> +	mutex_unlock(&port->lock);
> +}
> +
> +static const struct irq_chip gpio_rpmsg_irq_chip = {
> +	.irq_mask = gpio_rpmsg_mask_irq,
> +	.irq_unmask = gpio_rpmsg_unmask_irq,
> +	.irq_set_wake = gpio_rpmsg_irq_set_wake,
> +	.irq_set_type = gpio_rpmsg_irq_set_type,
> +	.irq_shutdown = gpio_rpmsg_irq_shutdown,
> +	.irq_bus_lock = gpio_rpmsg_irq_bus_lock,
> +	.irq_bus_sync_unlock = gpio_rpmsg_irq_bus_sync_unlock,
> +	.flags = IRQCHIP_IMMUTABLE,
> +};
> +
> +static int rpmsg_gpiochip_register(struct rpmsg_device *rpdev,
> +				   struct device_node *np, const char *name)
> +{
> +	struct rpmsg_gpio_port *port;
> +	struct gpio_irq_chip *girq;
> +	struct gpio_chip *gc;
> +	int ret;
> +
> +	port = devm_kzalloc(&rpdev->dev, sizeof(*port), GFP_KERNEL);
> +	if (!port)
> +		return -ENOMEM;
> +
> +	ret = of_property_read_u32(np, "reg", &port->idx);
> +	if (ret)
> +		return ret;
> +
> +	ret = devm_mutex_init(&rpdev->dev, &port->lock);
> +	if (ret)
> +		return ret;
> +
> +	ret = of_property_read_u32(np, "ngpios", &port->ngpios);
> +	if (ret || port->ngpios > GPIOS_PER_PORT_DEFAULT)
> +		port->ngpios = GPIOS_PER_PORT_DEFAULT;
> +
> +	port->send_msg = devm_kzalloc(&rpdev->dev,
> +				      sizeof(*port->send_msg),
> +				      GFP_KERNEL);
> +
> +	port->recv_msg = devm_kzalloc(&rpdev->dev,
> +				      sizeof(*port->recv_msg),
> +				      GFP_KERNEL);
> +	if (!port->send_msg || !port->recv_msg)
> +		return -ENOMEM;
> +
> +	init_completion(&port->cmd_complete);
> +	port->rpdev = rpdev;
> +
> +	gc = &port->gc;
> +	gc->owner = THIS_MODULE;
> +	gc->parent = &rpdev->dev;
> +	gc->fwnode = of_fwnode_handle(np);
> +	gc->ngpio = port->ngpios;
> +	gc->base = -1;
> +	gc->label = devm_kasprintf(&rpdev->dev, GFP_KERNEL, "%s-gpio%d",
> +				   name, port->idx);
> +
> +	gc->direction_input = rpmsg_gpio_direction_input;
> +	gc->direction_output = rpmsg_gpio_direction_output;
> +	gc->get_direction = rpmsg_gpio_get_direction;
> +	gc->get = rpmsg_gpio_get;
> +	gc->set = rpmsg_gpio_set;
> +
> +	girq = &gc->irq;
> +	gpio_irq_chip_set_chip(girq, &gpio_rpmsg_irq_chip);
> +	girq->parent_handler = NULL;
> +	girq->num_parents = 0;
> +	girq->parents = NULL;
> +	girq->chip->name = devm_kstrdup(&rpdev->dev, gc->label, GFP_KERNEL);
> +
> +	dev_set_drvdata(&rpdev->dev, port);
> +
> +	return devm_gpiochip_add_data(&rpdev->dev, gc, port);
> +}
> +
> +static const char *rpmsg_get_rproc_node_name(struct rpmsg_device *rpdev)
> +{
> +	const char *name = NULL;
> +	struct device_node *np;
> +	struct rproc *rproc;
> +
> +	rproc = rproc_get_by_child(&rpdev->dev);
> +	if (!rproc)
> +		return NULL;
> +
> +	np = of_node_get(rproc->dev.of_node);
> +	if (!np && rproc->dev.parent)
> +		np = of_node_get(rproc->dev.parent->of_node);
> +
> +	if (np) {
> +		name = devm_kstrdup(&rpdev->dev, np->name, GFP_KERNEL);
> +		of_node_put(np);
> +	}
> +
> +	return name;
> +}
> +
> +static struct device_node *
> +rpmsg_find_child_by_compat_reg(struct device_node *parent, const char *compat, u32 idx)
> +{
> +	struct device_node *child;
> +	u32 reg;
> +
> +	for_each_available_child_of_node(parent, child) {
> +		if (!of_device_is_compatible(child, compat))
> +			continue;
> +
> +		if (of_property_read_u32(child, "reg", &reg))
> +			continue;
> +
> +		if (reg == idx)
> +			return child;
> +	}
> +
> +	return NULL;
> +}
> +
> +static struct device_node *
> +rpmsg_get_channel_ofnode(struct rpmsg_device *rpdev, const char *compat, u32 idx)
> +{
> +	struct device_node *np_chan = NULL, *np;
> +	struct rproc *rproc;
> +
> +	rproc = rproc_get_by_child(&rpdev->dev);
> +	if (!rproc)
> +		return NULL;
> +
> +	np = of_node_get(rproc->dev.of_node);
> +	if (!np && rproc->dev.parent)
> +		np = of_node_get(rproc->dev.parent->of_node);
> +
> +	if (np)
> +		np_chan = rpmsg_find_child_by_compat_reg(np, compat, idx);
> +
> +	return np_chan;
> +}
> +
> +static int rpmsg_get_gpio_index(const char *name, const char *prefix)
> +{
> +	const char *p;
> +	int base = 10;
> +	int val;
> +
> +	if (!name)
> +		return -EINVAL;
> +
> +	/* Ensure correct prefix */
> +	if (!str_has_prefix(name, prefix))
> +		return -EINVAL;
> +
> +	/* Find last '-' */
> +	p = strrchr(name, '-');
> +
> +	if (!p || *(p + 1) == '\0')
> +		return -EINVAL;
> +
> +	if (p[1] == '0' && (p[2] == 'x' || p[2] == 'X'))
> +		base = 16;
> +
> +	if (kstrtoint(p + 1, base, &val))
> +		return -EINVAL;
> +
> +	return val;
> +}
> +
> +static int rpmsg_gpio_channel_callback(struct rpmsg_device *rpdev, void *data,
> +				       int len, void *priv, u32 src)
> +{
> +	struct rpmsg_gpio_response *msg = data;
> +	struct rpmsg_gpio_port *port = NULL;
> +
> +	port = dev_get_drvdata(&rpdev->dev);
> +
> +	if (!port) {
> +		dev_err(&rpdev->dev, "port is null\n");
> +		return -EINVAL;
> +	}
> +
> +	if (msg->type == GPIO_RPMSG_REPLY) {
> +		*port->recv_msg = *msg;
> +		complete(&port->cmd_complete);
> +	} else if (msg->type == GPIO_RPMSG_NOTIFY) {
> +		generic_handle_domain_irq_safe(port->gc.irq.domain, msg->line);
> +	} else {
> +		dev_err(&rpdev->dev, "wrong message type (0x%x)\n", msg->type);
> +	}
> +
> +	return 0;
> +}
> +
> +static int rpmsg_gpio_channel_probe(struct rpmsg_device *rpdev)
> +{
> +	struct device *dev = &rpdev->dev;
> +	struct device_node *np;
> +	const char *rproc_name;
> +	int idx;
> +
> +	idx = rpmsg_get_gpio_index(rpdev->id.name, CHAN_NAME_PREFIX);
> +	if (idx < 0)
> +		return -EINVAL;
> +
> +	if (!dev->of_node) {
> +		np = rpmsg_get_channel_ofnode(rpdev, GPIO_COMPAT_STR, idx);
> +		if (!np)
> +			return -ENODEV;

This seems to imply that DT nodes are required. RPMSG is a discoverable
bus with a nameservice that can bind/probe new devices. While then optionally
binding to a DT node when available so sub-devices can be described in DT is
fine, I don't see why it should be required.

> +
> +		dev->of_node = np;
> +		set_primary_fwnode(dev, of_fwnode_handle(np));
> +		return -EPROBE_DEFER;
> +	}
> +
> +	rproc_name = rpmsg_get_rproc_node_name(rpdev);
> +
> +	return rpmsg_gpiochip_register(rpdev, dev->of_node, rproc_name);
> +}
> +
> +static const struct of_device_id rpmsg_gpio_dt_ids[] = {
> +	{ .compatible = GPIO_COMPAT_STR },
> +	{ /* sentinel */ }
> +};
> +
> +static struct rpmsg_device_id rpmsg_gpio_channel_id_table[] = {
> +	{ .name = CHAN_NAME_PREFIX },
> +	{ },
> +};
> +MODULE_DEVICE_TABLE(rpmsg, rpmsg_gpio_channel_id_table);
> +
> +static struct rpmsg_driver rpmsg_gpio_channel_client = {
> +	.callback	= rpmsg_gpio_channel_callback,
> +	.id_table	= rpmsg_gpio_channel_id_table,
> +	.probe		= rpmsg_gpio_channel_probe,
> +	.drv		= {
> +		.name	= KBUILD_MODNAME,
> +		.of_match_table = rpmsg_gpio_dt_ids,

Does this line actually do anything anymore? Maybe it did when
this was a platform_driver, but this is a rpmsg_driver and
will probe though .id_table matches.

Andrew

> +	},
> +};
> +module_rpmsg_driver(rpmsg_gpio_channel_client);
> +
> +MODULE_AUTHOR("Shenwei Wang <shenwei.wang@nxp.com>");
> +MODULE_DESCRIPTION("generic rpmsg gpio driver");
> +MODULE_LICENSE("GPL");


^ permalink raw reply

* [PATCH] Docs: SMP: add an SMP docbook chapter
From: Randy Dunlap @ 2026-06-25 21:03 UTC (permalink / raw)
  To: linux-kernel
  Cc: Randy Dunlap, Jonathan Corbet, Shuah Khan, linux-doc,
	Thomas Gleixner, Peter Zijlstra, Andrew Morton

Add SMP primitives to the core-api documentation.

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
---
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Shuah Khan <skhan@linuxfoundation.org>
Cc: linux-doc@vger.kernel.org
Cc: Thomas Gleixner <tglx@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>

 Documentation/core-api/SMP.rst   |   11 +++++++++++
 Documentation/core-api/index.rst |    1 +
 2 files changed, 12 insertions(+)

--- linext-2026-0623.orig/Documentation/core-api/index.rst
+++ linext-2026-0623/Documentation/core-api/index.rst
@@ -81,6 +81,7 @@ Documentation/locking/index.rst for more
    padata
    ../RCU/index
    wrappers/memory-barriers.rst
+   SMP
 
 Low-level hardware management
 =============================
--- /dev/null
+++ linext-2026-0623/Documentation/core-api/SMP.rst
@@ -0,0 +1,11 @@
+.. SPDX-License-Identifier: GPL-2.0+
+
+==============
+SMP primitives
+==============
+
+.. kernel-doc:: include/linux/smp.h
+   :internal:
+
+.. kernel-doc:: kernel/smp.c
+   :export:

^ permalink raw reply

* Re: [PATCH 17/19] Documentation: ioctl: add CMH ioctl documentation and register 'J'
From: Krishnamoorthy, Saravanakrishnan @ 2026-06-25 21:21 UTC (permalink / raw)
  To: Randy Dunlap, Albert Ou, Ousherovitch, Alex, Conor Dooley,
	David S. Miller, Herbert Xu, Jonathan Corbet, Krzysztof Kozlowski,
	Palmer Dabbelt, Paul Walmsley, Rob Herring, Shuah Khan
  Cc: Alexandre Ghiti, devicetree@vger.kernel.org, Wittenauer, Joel,
	linux-api@vger.kernel.org, linux-crypto@vger.kernel.org,
	linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-kselftest@vger.kernel.org, linux-riscv@lists.infradead.org,
	Shuah Khan, SIPSupport, Nguyen, Thi
In-Reply-To: <60a7850d-1527-4517-aae8-cd5cbebcb9b5@infradead.org>

Thanks for catching this, Randy. You're right. cmh_mgmt.rst isn't pulled into any toctree. We will fix it in v2 by adding cmh_mgmt to the toctree in Documentation/userspace-api/ioctl/index.rst (alongside cdrom/hdio).

Krishnan

________________________________________
From: Randy Dunlap <rdunlap@infradead.org>
Sent: Thursday, June 25, 2026 11:29 AM
To: Krishnamoorthy, Saravanakrishnan; Albert Ou; Ousherovitch, Alex; Conor Dooley; David S. Miller; Herbert Xu; Jonathan Corbet; Krzysztof Kozlowski; Palmer Dabbelt; Paul Walmsley; Rob Herring; Shuah Khan
Cc: Alexandre Ghiti; devicetree@vger.kernel.org; Wittenauer, Joel; linux-api@vger.kernel.org; linux-crypto@vger.kernel.org; linux-doc@vger.kernel.org; linux-kernel@vger.kernel.org; linux-kselftest@vger.kernel.org; linux-riscv@lists.infradead.org; Shuah Khan; SIPSupport; Nguyen, Thi
Subject: Re: [PATCH 17/19] Documentation: ioctl: add CMH ioctl documentation and register 'J'

[Some people who received this message don't often get email from rdunlap@infradead.org. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ]

Caution: < External Email >

On 6/25/26 10:33 AM, Saravanakrishnan Krishnamoorthy wrote:
> From: Alex Ousherovitch <aousherovitch@rambus.com>
>
> Add Documentation/userspace-api/ioctl/cmh_mgmt.rst documenting the
> ioctl commands on the /dev/cmh_mgmt misc device for the CRI
> CryptoManager Hub (CMH) hardware crypto accelerator driver.  Covers
> key management, KIC key derivation, PKE (RSA, ECDSA, ECDH, EdDSA),
> PQC (ML-KEM, ML-DSA, SLH-DSA), SM2, EAC, and DRBG.
>
> Register ioctl magic number 'J' (0x4A) in ioctl-number.rst.  The
> driver uses ioctls 0x01-0x40.
>
> Co-developed-by: Saravanakrishnan Krishnamoorthy <skrishnamoorthy@rambus.com>
> Signed-off-by: Saravanakrishnan Krishnamoorthy <skrishnamoorthy@rambus.com>
> Signed-off-by: Alex Ousherovitch <aousherovitch@rambus.com>
> Reviewed-by: Joel Wittenauer <Joel.Wittenauer@cryptography.com>
> Reviewed-by: Thi Nguyen <thin@rambus.com>
> ---
>  .../userspace-api/ioctl/cmh_mgmt.rst          | 941 ++++++++++++++++++
>  .../userspace-api/ioctl/ioctl-number.rst      |   1 +
>  2 files changed, 942 insertions(+)
>  create mode 100644 Documentation/userspace-api/ioctl/cmh_mgmt.rst
>

One docs build warning:

Documentation/userspace-api/ioctl/cmh_mgmt.rst: WARNING: document isn't included in any toctree [toc.not_included]


--
~Randy


^ permalink raw reply

* Re: [PATCH v14 4/5] gpio: rpmsg: add generic rpmsg GPIO driver
From: Julian Braha @ 2026-06-25 22:17 UTC (permalink / raw)
  To: Shenwei Wang, Linus Walleij, Bartosz Golaszewski, Jonathan Corbet,
	Rob Herring, Krzysztof Kozlowski, Conor Dooley, Bjorn Andersson,
	Mathieu Poirier, Frank Li, Sascha Hauer
  Cc: Shuah Khan, linux-gpio, linux-doc, linux-kernel,
	Pengutronix Kernel Team, Fabio Estevam, Shenwei Wang, Peng Fan,
	devicetree, linux-remoteproc, imx, linux-arm-kernel, linux-imx,
	Arnaud POULIQUEN, b-padhi, Andrew Lunn, Bartosz Golaszewski
In-Reply-To: <20260625155432.815185-5-shenwei.wang@oss.nxp.com>

Hi Shenwei,

On 6/25/26 16:54, Shenwei Wang wrote:
> +config GPIO_RPMSG
> +	tristate "Generic RPMSG GPIO support"
> +	depends on OF && REMOTEPROC
> +	select GPIOLIB_IRQCHIP
> +	default REMOTEPROC
> +	help
> +	  Say yes here to support the generic GPIO functions over the RPMSG
> +	  bus. Currently supported devices: i.MX7ULP, i.MX8ULP, i.MX8x, and
> +	  i.MX9x.
> +
> +	  If unsure, say N.

You've got GPIO_RPMSG defaulting to REMOTEPROC, which is also a
dependency, so it's effectively the same as using 'default y'.

But then you've got "If unsure, say N" in the help text. Usually this is
used when the default is N.

- Julian Braha

^ permalink raw reply

* Re: [PATCH] KVM: x86/xen: Add KVM_XEN_VCPU_ATTR_TYPE_WRITE_HYPERCALL_PAGE
From: Sean Christopherson @ 2026-06-25 23:09 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Paolo Bonzini, Jonathan Corbet, Shuah Khan, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, x86, H. Peter Anvin,
	Paul Durrant, kvm, linux-doc, linux-kernel, linux-kselftest
In-Reply-To: <c855535b4262ecd41f67734d19e8f48a7f014c2a.camel@infradead.org>

On Tue, Apr 28, 2026, David Woodhouse wrote:
> From: David Woodhouse <dwmw@amazon.co.uk>
> 
> Commit 3617c0ee7decb ("KVM: x86/xen: Only write Xen hypercall page for
> guest writes to MSR") blocked host-initiated writes from triggering the
> Xen hypercall page setup, to fix an SRCU usage violation when the
> hypercall MSR index collides with a real MSR written during vCPU reset.
> 
> However, some VMMs legitimately need to trigger hypercall page setup
> from host context. For example, a VMM may intercept the guest's MSR
> write to track an epoch (for kexec/crash recovery), and then replay the
> write as a host-initiated KVM_SET_MSRS to populate the hypercall page.
> The host_initiated check breaks this use case.
> 
> Add KVM_XEN_VCPU_ATTR_TYPE_WRITE_HYPERCALL_PAGE as a new vcpu attribute
> that explicitly invokes kvm_xen_write_hypercall_page() under proper
> locking. This gives userspace a safe interface to trigger hypercall page
> setup without going through the MSR write path, preserving the
> host_initiated defence in depth while restoring the lost functionality.

This is all kinda silly.  Userspace provides KVM a blob, then userspace intercepts
the MSR write that triggers doing something with said blob, only to call back into
KVM to consume the blob that userspace provided in the first place.

Any chance we can deprecate KVM's kvm_xen_write_hypercall_page(), and instead
rely on userspace to fill the page?  This extra bit obviously isn't much code to
carry, but it's yet one more Xen thing to maintain, and we've accumulated a lot
of those over the years...

> diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c
> index 91fd3673c09a..c16b4560c9e7 100644
> --- a/arch/x86/kvm/xen.c
> +++ b/arch/x86/kvm/xen.c
> @@ -907,6 +907,13 @@ int kvm_xen_vcpu_set_attr(struct kvm_vcpu *vcpu, struct kvm_xen_vcpu_attr *data)
>  {
>  	int idx, r = -ENOENT;
>  
> +	/*
> +	 * kvm_xen_write_hypercall_page() manages its own locking.
> +	 * Handle it before taking xen_lock to avoid a deadlock.

Do we actually want the side effects that necessitate taking xen.xen_lock?  From
a uAPI perspective, it's odd to effectively bundle KVM_XEN_ATTR_TYPE_LONG_MODE
into KVM_XEN_VCPU_ATTR_TYPE_WRITE_HYPERCALL_PAGE.

The other question is, why does kvm_xen_write_hypercall_page() drop xen_lock
when writing guest memory?  That seems odd and unnecessary.

> +	 */
> +	if (data->type == KVM_XEN_VCPU_ATTR_TYPE_WRITE_HYPERCALL_PAGE)
> +		return kvm_xen_write_hypercall_page(vcpu, data->u.gpa) ? -EIO : 0;

-EIO is rather weird, wouldn't -EINVAL be more appropriate?  Ah, and both are
wrong if copying the blob fails.

> +
>  	mutex_lock(&vcpu->kvm->arch.xen.xen_lock);
>  	idx = srcu_read_lock(&vcpu->kvm->srcu);

Speaking of writing memory, kvm_xen_write_hypercall_page() expects the caller
to be in a read-side SRCU critical section (I didn't actually run this with
PROVE_LOCKING=y, but I don't think I'm missing anything?)

So, if this uAPI is unavoidable seems like we want something like the below.
Either that or guard all of kvm_xen_write_hypercall_page() with a lock, and put
the entire thing in a helper so that KVM_XEN_VCPU_ATTR_TYPE_WRITE_HYPERCALL_PAGE
can be handled in a case-statement and doesn't need to grab SRCU on its own.

---
 arch/x86/include/uapi/asm/kvm.h |  3 ++
 arch/x86/kvm/x86.c              |  3 +-
 arch/x86/kvm/xen.c              | 64 ++++++++++++++++++++-------------
 3 files changed, 44 insertions(+), 26 deletions(-)

diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
index 1585ec804066..7732b92a4db0 100644
--- a/arch/x86/include/uapi/asm/kvm.h
+++ b/arch/x86/include/uapi/asm/kvm.h
@@ -598,6 +598,7 @@ struct kvm_x86_mce {
 #define KVM_XEN_HVM_CONFIG_RUNSTATE_UPDATE_FLAG	(1 << 6)
 #define KVM_XEN_HVM_CONFIG_PVCLOCK_TSC_UNSTABLE	(1 << 7)
 #define KVM_XEN_HVM_CONFIG_SHARED_INFO_HVA	(1 << 8)
+#define KVM_XEN_HVM_CONFIG_WRITE_HYPERCALL_PAGE	(1 << 9)
 
 #define KVM_XEN_MSR_MIN_INDEX			0x40000000u
 #define KVM_XEN_MSR_MAX_INDEX			0x4fffffffu
@@ -706,6 +707,8 @@ struct kvm_xen_vcpu_attr {
 #define KVM_XEN_VCPU_ATTR_TYPE_UPCALL_VECTOR	0x8
 /* Available with KVM_CAP_XEN_HVM / KVM_XEN_HVM_CONFIG_SHARED_INFO_HVA */
 #define KVM_XEN_VCPU_ATTR_TYPE_VCPU_INFO_HVA	0x9
+/* Available with KVM_CAP_XEN_HVM / KVM_XEN_HVM_CONFIG_WRITE_HYPERCALL_PAGE */
+#define KVM_XEN_VCPU_ATTR_TYPE_WRITE_HYPERCALL_PAGE 0xa
 
 /* Secure Encrypted Virtualization command */
 enum sev_cmd_id {
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 0626e835e9eb..ced19e84bf6c 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2287,7 +2287,8 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 		    KVM_XEN_HVM_CONFIG_EVTCHN_2LEVEL |
 		    KVM_XEN_HVM_CONFIG_EVTCHN_SEND |
 		    KVM_XEN_HVM_CONFIG_PVCLOCK_TSC_UNSTABLE |
-		    KVM_XEN_HVM_CONFIG_SHARED_INFO_HVA;
+		    KVM_XEN_HVM_CONFIG_SHARED_INFO_HVA |
+		    KVM_XEN_HVM_CONFIG_WRITE_HYPERCALL_PAGE;
 		if (sched_info_on())
 			r |= KVM_XEN_HVM_CONFIG_RUNSTATE |
 			     KVM_XEN_HVM_CONFIG_RUNSTATE_UPDATE_FLAG;
diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c
index db10f12d10cf..b72845aa67e2 100644
--- a/arch/x86/kvm/xen.c
+++ b/arch/x86/kvm/xen.c
@@ -904,6 +904,8 @@ int kvm_xen_hvm_get_attr(struct kvm *kvm, struct kvm_xen_hvm_attr *data)
 	return r;
 }
 
+static int __kvm_xen_write_hypercall_page(struct kvm_vcpu *vcpu, u64 data);
+
 int kvm_xen_vcpu_set_attr(struct kvm_vcpu *vcpu, struct kvm_xen_vcpu_attr *data)
 {
 	int idx, r = -ENOENT;
@@ -1138,7 +1140,9 @@ int kvm_xen_vcpu_set_attr(struct kvm_vcpu *vcpu, struct kvm_xen_vcpu_attr *data)
 			r = 0;
 		}
 		break;
-
+	case KVM_XEN_VCPU_ATTR_TYPE_WRITE_HYPERCALL_PAGE:
+		r = __kvm_xen_write_hypercall_page(vcpu, data->u.gpa);
+		break;
 	default:
 		break;
 	}
@@ -1274,30 +1278,12 @@ int kvm_xen_vcpu_get_attr(struct kvm_vcpu *vcpu, struct kvm_xen_vcpu_attr *data)
 	return r;
 }
 
-int kvm_xen_write_hypercall_page(struct kvm_vcpu *vcpu, u64 data)
+static int __kvm_xen_write_hypercall_page(struct kvm_vcpu *vcpu, u64 data)
 {
-	struct kvm *kvm = vcpu->kvm;
 	u32 page_num = data & ~PAGE_MASK;
 	u64 page_addr = data & PAGE_MASK;
 	bool lm = is_long_mode(vcpu);
-	int r = 0;
-
-	mutex_lock(&kvm->arch.xen.xen_lock);
-	if (kvm->arch.xen.long_mode != lm) {
-		kvm->arch.xen.long_mode = lm;
-
-		/*
-		 * Re-initialize shared_info to put the wallclock in the
-		 * correct place.
-		 */
-		if (kvm->arch.xen.shinfo_cache.active &&
-		    kvm_xen_shared_info_init(kvm))
-			r = 1;
-	}
-	mutex_unlock(&kvm->arch.xen.xen_lock);
-
-	if (r)
-		return r;
+	struct kvm *kvm = vcpu->kvm;
 
 	/*
 	 * If Xen hypercall intercept is enabled, fill the hypercall
@@ -1310,7 +1296,7 @@ int kvm_xen_write_hypercall_page(struct kvm_vcpu *vcpu, u64 data)
 		int i;
 
 		if (page_num)
-			return 1;
+			return -EINVAL;
 
 		/* mov imm32, %eax */
 		instructions[0] = 0xb8;
@@ -1329,7 +1315,7 @@ int kvm_xen_write_hypercall_page(struct kvm_vcpu *vcpu, u64 data)
 			if (kvm_vcpu_write_guest(vcpu,
 						 page_addr + (i * sizeof(instructions)),
 						 instructions, sizeof(instructions)))
-				return 1;
+				return -EINVAL;
 		}
 	} else {
 		/*
@@ -1344,7 +1330,7 @@ int kvm_xen_write_hypercall_page(struct kvm_vcpu *vcpu, u64 data)
 		int ret;
 
 		if (page_num >= blob_size)
-			return 1;
+			return -EINVAL;
 
 		blob_addr += page_num * PAGE_SIZE;
 
@@ -1355,11 +1341,39 @@ int kvm_xen_write_hypercall_page(struct kvm_vcpu *vcpu, u64 data)
 		ret = kvm_vcpu_write_guest(vcpu, page_addr, page, PAGE_SIZE);
 		kfree(page);
 		if (ret)
-			return 1;
+			return -EINVAL;
 	}
 	return 0;
 }
 
+
+int kvm_xen_write_hypercall_page(struct kvm_vcpu *vcpu, u64 data)
+{
+	struct kvm *kvm = vcpu->kvm;
+	bool lm = is_long_mode(vcpu);
+	int r = 0;
+
+	mutex_lock(&kvm->arch.xen.xen_lock);
+	if (kvm->arch.xen.long_mode != lm) {
+		kvm->arch.xen.long_mode = lm;
+
+		/*
+		 * Re-initialize shared_info to put the wallclock in the
+		 * correct place.
+		 */
+		if (kvm->arch.xen.shinfo_cache.active &&
+		    kvm_xen_shared_info_init(kvm))
+			r = 1;
+	}
+	mutex_unlock(&kvm->arch.xen.xen_lock);
+
+	if (r)
+		return r;
+
+
+	return __kvm_xen_write_hypercall_page(vcpu, data) ? 1 : 0;
+}
+
 int kvm_xen_hvm_config(struct kvm *kvm, struct kvm_xen_hvm_config *xhc)
 {
 	/* Only some feature flags need to be *enabled* by userspace */

base-commit: 8867ab1259b34261eaa96f1f3a2a092cd03e5a17
--

^ permalink raw reply related

* Re: [PATCH v4 2/2] tracing: Remove trace_printk.h from kernel.h
From: Nathan Chancellor @ 2026-06-25 23:41 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel, linux-trace-kernel, Masami Hiramatsu, Mark Rutland,
	Mathieu Desnoyers, Andrew Morton, Linus Torvalds,
	Sebastian Andrzej Siewior, John Ogness, Thomas Gleixner,
	Peter Zijlstra, Julia Lawall, Yury Norov, linux-doc, linux-kbuild,
	linuxppc-dev, dri-devel, linux-stm32, linux-arm-kernel,
	linux-rdma, linux-usb, linux-ext4, linux-nfs, kvm, intel-gfx
In-Reply-To: <20260625104402.210473477@kernel.org>

Hi Steve,

On Thu, Jun 25, 2026 at 06:40:09AM -0400, Steven Rostedt wrote:
> From: Steven Rostedt <rostedt@goodmis.org>
> 
> There have been complaints about trace_printk.h causing more build time
> for being in kernel.h if it changes. There is also an effort to clean up
> kernel.h to have it not include unneeded header files. Move trace_printk.h
> out of kernel.h and place it in the headers and C files that use it.
> 
> Link: https://lore.kernel.org/all/CAHk-=wikCBeVFjVXiY4o-oepdbjAoir5+TcAgtL12c4u1TpZLQ@mail.gmail.com/
> 
> Suggested-by: Yury Norov <yury.norov@gmail.com>
> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>

This patch breaks lib/test_context-analysis.c for me in several
configurations:

  In file included from lib/test_context-analysis.c:9:
  In file included from include/linux/local_lock.h:5:
  include/linux/local_lock_internal.h:46:2: error: use of undeclared identifier '_THIS_IP_'
     46 |         lock_map_acquire(&l->dep_map);
        |         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  include/linux/lockdep.h:541:69: note: expanded from macro 'lock_map_acquire'
    541 | #define lock_map_acquire(l)                     lock_acquire_exclusive(l, 0, 0, NULL, _THIS_IP_)
        |                                                                                       ^~~~~~~~~
  In file included from lib/test_context-analysis.c:9:
  In file included from include/linux/local_lock.h:5:
  include/linux/local_lock_internal.h:53:2: error: use of undeclared identifier '_THIS_IP_'
     53 |         lock_map_acquire_try(&l->dep_map);
        |         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  include/linux/lockdep.h:542:73: note: expanded from macro 'lock_map_acquire_try'
    542 | #define lock_map_acquire_try(l)                 lock_acquire_exclusive(l, 0, 1, NULL, _THIS_IP_)
        |                                                                                       ^~~~~~~~~
  In file included from lib/test_context-analysis.c:9:
  In file included from include/linux/local_lock.h:5:
  include/linux/local_lock_internal.h:62:2: error: use of undeclared identifier '_THIS_IP_'
     62 |         lock_map_release(&l->dep_map);
        |         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  include/linux/lockdep.h:545:47: note: expanded from macro 'lock_map_release'
    545 | #define lock_map_release(l)                     lock_release(l, _THIS_IP_)
        |                                                                 ^~~~~~~~~
  3 errors generated.

The following diff resolves it for me, should I send it as a separate
patch or do you want to just fold it in with a note?

diff --git a/include/linux/lockdep.h b/include/linux/lockdep.h
index 621566345406..2301a701ffbb 100644
--- a/include/linux/lockdep.h
+++ b/include/linux/lockdep.h
@@ -10,6 +10,7 @@
 #ifndef __LINUX_LOCKDEP_H
 #define __LINUX_LOCKDEP_H
 
+#include <linux/instruction_pointer.h>
 #include <linux/lockdep_types.h>
 #include <linux/smp.h>
 #include <asm/percpu.h>
-- 
Cheers,
Nathan

^ permalink raw reply related

* Re: [PATCH v8 24/46] KVM: guest_memfd: Make in-place conversion the default
From: Yan Zhao @ 2026-06-26  0:04 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
	jmattson, jthoughton, michael.roth, oupton, pankaj.gupta, qperret,
	rick.p.edgecombe, rientjes, shivankg, steven.price, tabba, willy,
	wyihan, forkloop, pratyush, suzuki.poulose, aneesh.kumar, liam,
	Paolo Bonzini, Sean Christopherson, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
	Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
	Kairui Song, Kemeng Shi, Nhat Pham, Barry Song, Axel Rasmussen,
	Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt,
	Kiryl Shutsemau, Baoquan He, Jason Gunthorpe, Vlastimil Babka,
	kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	linux-mm, linux-coco
In-Reply-To: <CAEvNRgFfgV0FbQLzP8hhNH5hMGaQao6OFQin4cb3TAmC7SVhfA@mail.gmail.com>

On Thu, Jun 25, 2026 at 11:20:30AM -0700, Ackerley Tng wrote:
> Yan Zhao <yan.y.zhao@intel.com> writes:
> 
> > On Wed, Jun 24, 2026 at 05:05:44PM -0700, Ackerley Tng wrote:
> >> Yan Zhao <yan.y.zhao@intel.com> writes:
> >>
> >> >
> >> > [...snip...]
> >> >
> >> >>
> >> >>  #ifdef kvm_arch_has_private_mem
> >> >> -bool __ro_after_init gmem_in_place_conversion = false;
> >> >> +bool __ro_after_init gmem_in_place_conversion = !IS_ENABLED(CONFIG_KVM_VM_MEMORY_ATTRIBUTES);
> >> >> +module_param(gmem_in_place_conversion, bool, 0444);
> >> >
> >> > With gmem_in_place_conversion=true, userspace can create guest_memfd without the
> >> > MMAP flag. In such cases, shared memory is allocated from different backends.
> >> > This means this module parameter only enables per-gmem memory attribute and does
> >> > not guarantee that gmem in-place conversion will actually occur.
> >> >
> >> > To avoid confusion, could we rename this module parameter to something more
> >> > accurate, such as gmem_memory_attribute?
> >> >
> >>
> >> I asked Sean about this after getting some fixes off list. Sean said
> >> gmem_in_place_conversion is named for a host admin to use, and something
> >> like gmem_memory_attributes is too much implementation details for the
> >> admin.
> > Thanks for this background.
> >
> > Some more context on why I'm asking:
> >
> > Currently, I'm testing TDX huge pages with the following two gmem components:
> > 1. The gmem memory attribute in this gmem in-place conversion v8.
> > 2. The gmem 2MB from buddy allocator. (for development/testing only).
> >
> > The gmem 2MB from buddy allocator allocates 2MB folios from buddy for private
> > memory, while shared memory is allocated from a different backend.
> > (To avoid fragmentation, only private mappings are split during private-to-shared
> > conversions. In this approach, the 2MB folios are always retained in the gmem
> > inode filemap cache without splitting.)
> >
> > Since shared memory is not allocated from gmem, there're no in-place conversions.
> > The reason I'm using "gmem memory attribute" is that the per-VM attribute is
> > being deprecated, as suggested by Sean [1].
> >
> 
> v8 of conversions series changed that slightly, per-VM attributes is
> going to stay around (because of work on RWX attributes, coming up) and
> RWX will stay tracked at the VM level.
> 
> For v8 and beyond, only tracking of private/shared in per-VM attributes
> is being deprecated.
> 
> By extension the entire thing about using guest_memfd for private memory
> and a different backing memory for shared memory is being deprecated.
Thanks for the info. I was actually referring to the per-VM shared/private
attribute, which is being deprecated. Sean hoped TDX huge page would be the
first mandated user of the per-gmem shared/private attribute.


> > Besides my current usage,
> 
> I think you can set up guest_memfd+2M for private memory and shared
> memory from some other source, and that's the deprecated usage pattern.

Yes, though this is the deprecated usage pattern, gmem_in_place_conversion=true
allows it.

In fact, even without huge pages, v8 allows userspace to have shared memory
allocated from other source when gmem_in_place_conversion=true.
(My default testing of this series for the 4KB setting is with this
configuration).

> > there may be other scenarios where gmem memory
> > attributes is preferred without allocating shared memory from gmem.
> > (e.g., PAGE.ADD from a temp extra shared source memory).
> >
> 
> Is this TDH.MEM.PAGE.ADD, used indirectly from
> tdx_gmem_post_populate()? This use case isn't blocked. Even if
> gmem_in_place_conversion=true, you can still set src_address to
> non-guest_memfd memory and load from anywhere you like.
> 
> Please let me know if that is broken! I think I accidentally used that
It's not broken. I tested it with my hacked-up QEMU.

> setup in selftests and it worked. The selftests are now defaulting to
> in-place conversion.
> 
> > For such use cases, I'm concerns that the admins may find it confusing if they
> > enable gmem_in_place_conversion but still observe extra memory consumptions for
> > shared memory.
> >
> 
> Hmm but I guess if someone enables gmem_in_place_conversion but still
> allocates from elsewhere, they'd have to figure it out?

If gmem_in_place_conversion=true means gmem in place conversion is allowed (but
not enforced), I agree.

I'm wondering if we could rename it to "allow_gmem_in_place_conversion":)

> > [1] https://lore.kernel.org/kvm/aWmEegVP_A613WIr@google.com/
> >
> >> Sean, would you reconsider since Yan also asked? If the admin compiled
> >> the kernel knowing what CONFIG_KVM_VM_MEMORY_ATTRIBUTES means, then the
> >> admin would also be able to use a param like gmem_memory_attributes?
> >>
> >> There's the additional benefit that the similar naming aids in
> >> understanding for both the admin and software engineers.
> >>
> >> Either way, in the next revision, I'll also add this documentation for
> >> this module_param:
> >>
> >>   Setting the module parameter gmem_in_place_conversion to true will
> >>   enable the KVM_SET_MEMORY_ATTRIBUTES2 guest_memfd ioctl and disables
> >>   the KVM_SET_MEMORY_ATTRIBUTES VM ioctl. If gmem_in_place_conversion is
> >>   true, the private/shared attribute will be tracked per-guest_memfd
> >>   instead of per-VM.
> >>
> >> Let me know what y'all think of the wording!
> >>
> >> >>
> >> >> [...snip...]
> >> >>

^ permalink raw reply

* Re: [PATCH] docs: pagemap: fix flags location, member name and sample code
From: SeongJae Park @ 2026-06-26  0:04 UTC (permalink / raw)
  To: Zenghui Yu
  Cc: SeongJae Park, linux-mm, linux-doc, linux-kernel, akpm, david,
	ljs, liam, vbabka, rppt, surenb, mhocko, corbet, skhan
In-Reply-To: <20260625174447.24292-1-zenghui.yu@linux.dev>

On Fri, 26 Jun 2026 01:44:47 +0800 Zenghui Yu <zenghui.yu@linux.dev> wrote:

> The userland visible page flags (KPF_*) were initially moved to
> include/linux/kernel-page-flags.h in commit 1a9b5b7fe0c5 ("mm: export
> stable page flags"), and later moved to
> include/uapi/linux/kernel-page-flags.h in commit 607ca46e97a1 ("UAPI:
> (Scripted) Disintegrate include/linux"). Upadte the doc to reflect the
> current location of these flags.
> 
> The member @walk_end of struct pm_scan_arg {} was wrongly written as
> "end_walk".
> 
> The first sample code of the PAGEMAP_SCAN ioctl wrongly used the
> PM_SCAN_CHECK_WPASYNC flag twice, instead of the PM_SCAN_WP_MATCHING flag.
> The second one missed PAGE_IS_FILE in the required mask.
> 
> Fix them all together.

Looks good to me, thank you!

> 
> Signed-off-by: Zenghui Yu <zenghui.yu@linux.dev>

Reviewed-by: SeongJae Park <sj@kernel.org>


Thanks,
SJ

[...]

^ permalink raw reply

* Re: [PATCH v8 23/46] KVM: TDX: Make source page optional for KVM_TDX_INIT_MEM_REGION
From: Ackerley Tng @ 2026-06-26  0:07 UTC (permalink / raw)
  To: Yan Zhao
  Cc: Sean Christopherson, aik, andrew.jones, binbin.wu, brauner,
	chao.p.peng, david, jmattson, jthoughton, michael.roth, oupton,
	pankaj.gupta, qperret, rick.p.edgecombe, rientjes, shivankg,
	steven.price, tabba, willy, wyihan, forkloop, pratyush,
	suzuki.poulose, aneesh.kumar, liam, Paolo Bonzini,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Shuah Khan,
	Vishal Annapurve, Andrew Morton, Chris Li, Kairui Song,
	Kemeng Shi, Nhat Pham, Barry Song, Axel Rasmussen, Yuanchu Xie,
	Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt, Kiryl Shutsemau,
	Baoquan He, Jason Gunthorpe, Vlastimil Babka, kvm, linux-kernel,
	linux-trace-kernel, linux-doc, linux-kselftest, linux-mm,
	linux-coco
In-Reply-To: <ajyRg3BwGu5dCfOn@yzhao56-desk.sh.intel.com>

Yan Zhao <yan.y.zhao@intel.com> writes:

> On Wed, Jun 24, 2026 at 04:00:32PM -0700, Ackerley Tng wrote:
>> Sean Christopherson <seanjc@google.com> writes:
>>
>> > On Tue, Jun 23, 2026, Yan Zhao wrote:
>> >> On Tue, Jun 23, 2026 at 01:16:14PM +0800, Yan Zhao wrote:
>> >> > On Mon, Jun 22, 2026 at 06:22:45PM -0700, Sean Christopherson wrote:
>> >> > > On Mon, Jun 22, 2026, Yan Zhao wrote:
>> >> > > > On Thu, Jun 18, 2026 at 05:32:00PM -0700, Ackerley Tng via B4 Relay wrote:
>> >> > > > > diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
>> >> > > > > index ffe9d0db58c59..56d10333c61a7 100644
>> >> > > > > --- a/arch/x86/kvm/vmx/tdx.c
>> >> > > > > +++ b/arch/x86/kvm/vmx/tdx.c
>> >> > > > > @@ -3198,8 +3198,12 @@ static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
>> >> > > > >  	if (KVM_BUG_ON(kvm_tdx->page_add_src, kvm))
>> >> > > > >  		return -EIO;
>> >> > > > >
>> >> > > > > -	if (!src_page)
>> >> > > > > -		return -EOPNOTSUPP;
>> >> > > > > +	if (!src_page) {
>> >> > > > > +		if (!gmem_in_place_conversion)
>> >> > > > When userspace turns on gmem_in_place_conversion while creating guest_memfd
>> >> > > > without the MMAP flag, the absence of src_page should still be treated as an
>> >> > > > error.
>> >> > >
>> >> > > Why MMAP?
>> >> > Hmm, I was showing a scenario that in-place conversion couldn't occur.
>> >> > I didn't mean that with the MMAP flag, mmap() and user write must occur.
>> >> >
>> >> > > Shouldn't this be a general "if (!src_page && !up-to-date)"?  Just
>> >> > > because userspace _can_ mmap() the memory doesn't mean userspace _has_ mmap()'d
>> >> > > and written memory.  And when write() lands, MMAP wouldn't be necessary to
>> >> > > initialize the memory.
>> >> > Do you mean using up-to-date flag as below?
>> >
>> > Yes?  I didn't actually look at the implementation details.
>> >
>> >> > if (!src_page) {
>> >> > 	src_page = pfn_to_page(pfn);
>> >> > 	if (!folio_test_uptodate(page_folio(src_page)))
>> >> > 		return -EOPNOTSUPP;
>> >> > }
>>
>> Yan is right that with the earlier patch "Zero page while getting pfn",
>> folio_test_uptodate() here will always return true.
>>
>> Actually, this is an alternative fix for the issue Sashiko pointed out
>> on v7 where userspace can do a populate() (either TDX or SNP) without
>> first allocating the page, with src_address == NULL, and leak
>> uninitialized memory into the guest.
>>
>> Advantage of using the uptodate check in populate: if the host never
>> allocates the page, populate doesn't incur zeroing before writing the
>> page anyway in populate().
>>
>> Disadvantage: Both TDX and SNP will have to implement this uptodate
>> check. guest_memfd can't check centrally because for SNP, for a
>> PAGE_TYPE_ZERO, !src_page should be allowed with a !uptodate page since
>> firmware will zero and there's no leakage of uninitialized host memory?
> Another disadvantage: the uptodate flag is per-folio. What if the folio
> is only partially initialized by the userspace especially after huge page is
> supported?
>

Good point on huge pages!

The uptodate flag on the folio in guest_memfd means "this folio has been
written to". As of now (before patch at [1]), this happens when

+ folio is zeroed on first use by userspace
+ folio is zeroed on first use of the guest
+ folio is populated

When huge pages are supported, the folio can't partially be initialized?

On allocation, if any part is shared, we split the page. The parts are
separate folios that have their own uptodate flags.

On splitting, if the huge page is uptodate, the split pages will also be
uptodate. If the huge page is not uptodate, the split pages won't be
uptodate, but that's ok since they will be marked uptodate on first use.

On merging, the non-uptodate parts have to be zeroed and then marked
uptodate. Any parts that are in use would have been marked uptodate
already, so there's no overwriting data that is in use. I'll need to
think more about when it's safe to zero.

I'm still on the fence between the two options

1. Using uptodate check in populate to reject src_pages that have never
   been written to or
2. Always zero before populate

but whether the uptodate flag is per-folio or not doesn't affect these
two options in terms of fixing the leak of uninitialized host memory,
right?

>
>> >> Another concern with this fix is that:
>> >> commit "KVM: guest_memfd: Zero page while getting pfn" [1] always marks the
>> >> folio uptodate before reaching post_populate().
>> >>
>> >> [1] https://lore.kernel.org/all/20260618-gmem-inplace-conversion-v8-21-9d2959357853@google.com/
>> >>
>> >> > One concern is that TDX now does not much care about the up-to-date flag since
>> >> > TDX doesn't rely on the flag to clear pages on conversions.
>> >> > I'm not sure if the flag can be reliably checked in this case. e.g.,
>> >> > now the whole folio is marked up-to-date even if only part of it is faulted by
>> >> > user access.
>> >> > Ensuring that the up-to-date flag works correctly with huge page support seems
>> >> > to have more effort than introducing a dedicated flag for TDX.
>> >> >
>> >> > > > Additionally, to properly enable in-place copying for the TDX initial memory
>> >> > > > region, userspace must not only specify source_addr to NULL, but also follow
>> >> > > > a specific sequence (where steps 1/2/3/7 are required only for in-place copy):
>> >> > > > 1. create guest_memfd with MMAP flag
>> >> > > > 2. mmap the guest_memfd.
>> >> > > > 3. convert the initial memory range to shared.
>> >> > > > 4. copy initial content to the source page.
>> >> > > > 5. convert the initial memory range to private
>> >> > > > 6. invoke ioctl KVM_TDX_INIT_MEM_REGION.
>> >> > > > 7. do not unmap the source backend.
>> >> > > >
>> >> > > > So, would it be reasonable to introduce a dedicated flag that allows userspace
>> >> > > > to explicitly opt into the in-place copy functionality? e.g.,
>> >> > >
>> >> > > Why?  It's userspace's responsibility to get the above right.  If userspace fails
>> >> > > to provide a src_page when it doesn't want in-place copy, that's a userspace bug.
>>
>> Yan, is your concern that userspace forgot to update the code and
>> forgets to provide a src_page, and if we keep the "Zero page while
> Yes. Previously, it would be rejected after GUP fails.
>

I see, didn't realize previously it would be rejected because GUP
fails. GUP failed because it wasn't faulted into the host?

That's kind of orthogonal, I don't think GUP fail leading to rejecting
populate was meant to help userspace catch these issues. GUP would also
fail if the user did mmap(), write to it, unmap using
madvise(MADV_DONTNEED), then forget and pass 0 as src_address.

>> getting pfn" patch, ends up with the guest silently having a zero page?
>> I think that would be found quite early in userspace VMM testing...
> I actually encountered this during testing this patch.
> I update most code path to follow this sequence. However, still some corner ones
> for TDVF HOB, which are less obvious and harder to update.
> The TD just booted up and hang silently.
>

I think this is just the life of a close-to-hardware software engineer
:P no errors, got stuck somewhere, root cause is some unitialized
thing.

>> >> > I mean if userspace specifies a NULL source_addr by mistake, it's better for
>> >> > kernel to detect this mistake, similar to how it validates whether source_addr
>> >> > is PAGE_ALIGNED.
>> >
>> > The alignment case is different.  If userspace provides an unaligned value, KVM
>> > *can't* do what userspace is asking because hardware and thus KVM only supports
>> > converting on page boundaries.
>> >
>> > For a NULL source, KVM can still do what userspace is asking.  Rejecting userspace's
>> > request would then be making assumptions about what userspace wants.
>> >
>>
>> Also, +1 on this, what if userspace, knowing that pages are zeroed on
>> allocation, actually wants to rely on that to get a zero page in the guest?
> What if 0 uaddr is a valid address? :)
>
>> >> > Since userspace already needs to perform additional steps to enable in-place
>> >> > copy, specifying a dedicated flag to indicate that the NULL source_addr is
>> >> > intentional seems like a reasonable burden.
>> >
>> > I don't see how it adds any value.  I wouldn't be at all surprised if most VMMs
>> > just wen up with code that does:
>> >
>> > 	if (in-place) {
>> > 		src = NULL;
>> > 		flags |= KVM_TDX_IN_PLACE_COPY_INITIAL_MEMORY_REGION;
>> > 	}
>>

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox