* Re: [PATCH 01/15] x86/virt/tdx: Read global metadata for TDX Module Extensions
From: Kiryl Shutsemau @ 2026-05-27 15:35 UTC (permalink / raw)
To: Xiaoyao Li
Cc: Xu Yilun, djbw, rick.p.edgecombe, x86, peter.fang, linux-coco,
linux-kernel, kvm, sohil.mehta, yilun.xu, baolu.lu,
zhenzhong.duan
In-Reply-To: <956fa1e6-2920-4b2e-8037-d4b9d812ae53@intel.com>
On Mon, May 25, 2026 at 02:54:40PM +0800, Xiaoyao Li wrote:
> On 5/22/2026 11:41 AM, Xu Yilun wrote:
> ...
> > +static __init int get_tdx_sys_info_ext(struct tdx_sys_info_ext *sysinfo_ext)
> > +{
> > + int ret = 0;
> > + u64 val;
> > +
> > + if (!ret && !(ret = read_sys_metadata_field(0x3100000100000000, &val)))
> > + sysinfo_ext->memory_pool_required_pages = val;
> > + if (!ret && !(ret = read_sys_metadata_field(0x3100000000000001, &val)))
> > + sysinfo_ext->ext_required = val;
> > +
> > + return ret;
> > +}
> > +
> > static __init int get_tdx_sys_info(struct tdx_sys_info *sysinfo)
> > {
> > int ret = 0;
> > @@ -116,5 +129,8 @@ static __init int get_tdx_sys_info(struct tdx_sys_info *sysinfo)
> > ret = ret ?: get_tdx_sys_info_td_ctrl(&sysinfo->td_ctrl);
> > ret = ret ?: get_tdx_sys_info_td_conf(&sysinfo->td_conf);
> > + if (sysinfo->features.tdx_features0 & TDX_FEATURES0_EXT)
> > + ret = ret ?: get_tdx_sys_info_ext(&sysinfo->ext);
>
> Is it correct to read "memory_pool_required_pages" and "ext_required" so
> early in get_tdx_sys_info()? get_tdx_sys_info() is called before
> config_tdx_module() which calls TDH.SYS.CONFIG.
>
> If I read the TDX module base spec correctly, the amount of memory for
> extensions and EXT_REQUIRED field depends on the enabled features, which is
> determined by TDH.SYS.CONFIG/TDH.SYS.UPDATE ?
This is my read too. Looks like we need a separate step after
config_tdx_module() to readout config-dependatant metadata.
--
Kiryl Shutsemau / Kirill A. Shutemov
^ permalink raw reply
* Re: [PATCH v3 1/2] x86/tdx: Fix off-by-one in port I/O handling
From: Edgecombe, Rick P @ 2026-05-27 15:38 UTC (permalink / raw)
To: x86@kernel.org, mingo@redhat.com, kas@kernel.org, tglx@kernel.org,
bp@alien8.de, dave.hansen@linux.intel.com
Cc: seanjc@google.com, Huang, Kai, hpa@zytor.com,
sathyanarayanan.kuppuswamy@linux.intel.com,
linux-kernel@vger.kernel.org, tsyrulnikov.borys@gmail.com,
kvm@vger.kernel.org, linux-coco@lists.linux.dev,
stable@vger.kernel.org
In-Reply-To: <20260527120544.2903923-2-kas@kernel.org>
On Wed, 2026-05-27 at 13:05 +0100, Kiryl Shutsemau (Meta) wrote:
> handle_in() and handle_out() in arch/x86/coco/tdx/tdx.c use:
>
> u64 mask = GENMASK(BITS_PER_BYTE * size, 0);
>
> GENMASK(h, l) includes bit h. For size=1 (INB), this produces
> GENMASK(8, 0) = 0x1FF (9 bits) instead of GENMASK(7, 0) = 0xFF (8
> bits). The mask is one bit too wide for all I/O sizes.
>
> Fix the mask calculation.
>
> Fixes: 03149948832a ("x86/tdx: Port I/O: Add runtime hypercalls")
> Reported-by: Borys Tsyrulnikov <tsyrulnikov.borys@gmail.com>
> Link:
> https://lore.kernel.org/all/CAKw_Dz96rfSQc6Rn+9QBcUFHhmkK+9zu+P=bxowfZwxrATCBRg@mail.gmail.com/
> Signed-off-by: Kiryl Shutsemau (Meta) <kas@kernel.org>
> Reviewed-by: Kai Huang <kai.huang@intel.com>
> Reviewed-by: Kuppuswamy Sathyanarayanan
> <sathyanarayanan.kuppuswamy@linux.intel.com>
> Cc: stable@vger.kernel.org
Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
^ permalink raw reply
* Re: [PATCH v3 2/2] x86/tdx: Fix zero-extension for 32-bit port I/O
From: Edgecombe, Rick P @ 2026-05-27 15:45 UTC (permalink / raw)
To: x86@kernel.org, mingo@redhat.com, kas@kernel.org, tglx@kernel.org,
bp@alien8.de, dave.hansen@linux.intel.com
Cc: seanjc@google.com, Huang, Kai, hpa@zytor.com,
sathyanarayanan.kuppuswamy@linux.intel.com,
linux-kernel@vger.kernel.org, tsyrulnikov.borys@gmail.com,
kvm@vger.kernel.org, linux-coco@lists.linux.dev,
stable@vger.kernel.org
In-Reply-To: <20260527120544.2903923-3-kas@kernel.org>
On Wed, 2026-05-27 at 13:05 +0100, Kiryl Shutsemau (Meta) wrote:
> + /*
> + * IN writes the result into a sub-register of RAX. Only the
> + * 32-bit form zero-extends; the smaller forms leave the upper
> + * bits untouched:
> + *
> + * insn dest size bits written bits preserved
> + * inb AL 1 RAX[ 7: 0] RAX[63: 8]
> + * inw AX 2 RAX[15: 0] RAX[63:16]
> + * inl EAX 4 RAX[63: 0] (none, zero-extended)
We are working on getting the GHCI spec amended to clarify who is supposed to do
this zero-extending and masking, host or guest. For this and the similar
tdvmcalls. The process involves getting all VMMs in agreement.
Today I think the spec doesn't say to *not* do it, so I think it is reasonable
to merge this, but there is some small risk of complications depending on how
that discussion goes.
> + *
> + * 'mask' only covers the low 'size' bytes, which is exactly the
> + * range affected for size 1 and 2. For size 4 the write also
> + * clears RAX[63:32], so widen the clear-mask.
> + */
^ permalink raw reply
* Re: [PATCH 00/15] Enable TDX Module Extensions and DICE-based TDX Quoting
From: Sohil Mehta @ 2026-05-27 17:09 UTC (permalink / raw)
To: Xu Yilun
Cc: kas, djbw, rick.p.edgecombe, x86, peter.fang, linux-coco,
linux-kernel, kvm, yilun.xu, baolu.lu, zhenzhong.duan, xiaoyao.li
In-Reply-To: <ahbJnCFlGFwxrEkw@yilunxu-OptiPlex-7050>
On 5/27/2026 3:38 AM, Xu Yilun wrote:
>
> Because for security purpose, these add-on features are always needed,
> even if not all of them, so Extensions will most likely be enabled.
>
A cover letter is a good place to explain such nuances, alternate
approaches, and tradeoffs.
> And even if someone switched them off all and saved the memory, compared
> to the memory of a typical TDX capable system (lets say 1TB), the saving
> is still little (0.001%).
>
In this case percentages make it harder to understand. Does it need a
fixed amount of memory (~50MB) irrespective of the feature or the number
of features? If so, it would be good to mention that.
>> In addition, could you briefly describe the complexity we are trading off?
>
> If we delay the Extensions initialization to the first Extension
> SEAMCALL, we need to maintain additional TDX state machine for
> lifecycle, and we need mechanisms to synchronize parallel Extension
> enabling request from multiple callers.
This would be good to include in the cover as well.
^ permalink raw reply
* Re: [PATCH 01/15] x86/virt/tdx: Read global metadata for TDX Module Extensions
From: Sohil Mehta @ 2026-05-27 17:17 UTC (permalink / raw)
To: Xu Yilun
Cc: kas, djbw, rick.p.edgecombe, x86, peter.fang, linux-coco,
linux-kernel, kvm, yilun.xu, baolu.lu, zhenzhong.duan, xiaoyao.li
In-Reply-To: <ahaZPjf+A7ms0Ba9@yilunxu-OptiPlex-7050>
On 5/27/2026 12:11 AM, Xu Yilun wrote:
>>> +struct tdx_sys_info_ext {
>>> + u16 memory_pool_required_pages;
>>> + u8 ext_required;
>>
>> The name ext_required seems like a boolean. It is also used like a
>> boolean later.
>> if (!tdx_sysinfo.ext.ext_required)
>> return 0;
>>
>> But, IIUC, is it actually a mask that lists any feature that needs
>
> No it is just a bool about Extentions needs to be initialized or not.
>
How does the kernel know which features need Extensions? Is there any
hardware enumeration or the kernel just keeps a static list?
^ permalink raw reply
* Re: [PATCH v3 2/2] x86/tdx: Fix zero-extension for 32-bit port I/O
From: Dave Hansen @ 2026-05-27 17:45 UTC (permalink / raw)
To: Kiryl Shutsemau (Meta), Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, x86
Cc: H . Peter Anvin, Rick Edgecombe, Kuppuswamy Sathyanarayanan,
Kai Huang, Sean Christopherson, Borys Tsyrulnikov, linux-kernel,
linux-coco, kvm, stable
In-Reply-To: <20260527120544.2903923-3-kas@kernel.org>
[-- Attachment #1: Type: text/plain, Size: 1959 bytes --]
On 5/27/26 05:05, Kiryl Shutsemau (Meta) wrote:
...
> - /* Update part of the register affected by the emulated instruction */
> - regs->ax &= ~mask;
> + /*
> + * IN writes the result into a sub-register of RAX. Only the
> + * 32-bit form zero-extends; the smaller forms leave the upper
> + * bits untouched:
> + *
> + * insn dest size bits written bits preserved
> + * inb AL 1 RAX[ 7: 0] RAX[63: 8]
> + * inw AX 2 RAX[15: 0] RAX[63:16]
> + * inl EAX 4 RAX[63: 0] (none, zero-extended)
> + *
> + * 'mask' only covers the low 'size' bytes, which is exactly the
> + * range affected for size 1 and 2. For size 4 the write also
> + * clears RAX[63:32], so widen the clear-mask.
> + */
> + if (size == 4)
> + regs->ax = 0;
> + else
> + regs->ax &= ~mask;
> +
Is there any way we could do this with fewer comments and more code?
I mean, there's only three cases. Why have;
u64 mask = GENMASK(BITS_PER_BYTE * size - 1, 0);
When there are only 3 possible cases:
1 => 0xf
2 => 0xff
4 => 0xffff
and one of those cases needs a special case on top of it.
Maybe something like this?
/* Clear out part of RAX so part of args.r11 can be OR'd in: */
switch (size) {
case 1:
/* inb consumes lower 8 bits of r11: */
regs->ax &= ~GENMASK_ULL(7, 0);
args.r11 &= GENMASK_ULL(7, 0);
break;
case 2:
/* inw consumes lower 16 bits of r11: */
regs->ax &= ~GENMASK_ULL(15, 0);
args.r11 &= GENMASK_ULL(15, 0);
break;
case 4:
/* inl is weird and zeros the whole register: */
regs->ax &= ~GENMASK_ULL(63, 0);
/* But only consumes 32-bits from r11: */
args.r11 &= GENMASK_ULL(31, 0);
break;
default:
/* Probable TDX module bug. Illegal in[bwl] size: */
WARN_ON_ONCE(1);
success = 0;
}
if (success)
regs->ax |= args.r11;
It might need a temporary variable for args.r11, but you get the point.
That's basically the data from the comment but written as code.
[-- Attachment #2: tdxinX.patch --]
[-- Type: text/x-patch, Size: 93 bytes --]
tdx.c | 29 ++++++++++++++++++++++++++++-
1 file changed, 28 insertions(+), 1 deletion(-)
^ permalink raw reply
* Re: [PATCH v5 5/5] iommufd/vdevice: add TSM request ioctl
From: Aneesh Kumar K.V @ 2026-05-27 17:49 UTC (permalink / raw)
To: Dan Williams (nvidia), Alexey Kardashevskiy, linux-coco, iommu,
linux-kernel, kvm
Cc: Bjorn Helgaas, Dan Williams, Jason Gunthorpe, Joerg Roedel,
Jonathan Cameron, Kevin Tian, Nicolin Chen, Samuel Ortiz,
Steven Price, Suzuki K Poulose, Will Deacon, Xu Yilun,
Shameer Kolothum, Paolo Bonzini, Tony Krowiak, Halil Pasic,
Jason Herne, Harald Freudenberger, Holger Dengler, Heiko Carstens,
Vasily Gorbik, Alexander Gordeev, Christian Borntraeger,
Sven Schnelle, Alex Williamson, Matthew Rosato, Farhan Ali,
Eric Farman, linux-s390
In-Reply-To: <yq5apl2gsw6y.fsf@kernel.org>
Aneesh Kumar K.V <aneesh.kumar@kernel.org> writes:
> "Dan Williams (nvidia)" <djbw@kernel.org> writes:
>
>> Alexey Kardashevskiy wrote:
>>>
......
>>>
>>> I have 3 types of requests to fit here, all go via VM -> KVM -> QEMU -> IOMMUFD -> TSM.
>>>
>>> 1) bind/unbind TDI <- moves to CONFIG_LOCKED, this is "OP";
>>> 2) start/stop TDI <- moves to RUN, this is "GR"? Right now I route it via "OP";
>>> 3) enable/disable MMIO/DMA <- no TDI state change, this is "GR" but which scope is it here?
>>
>> The scope parameter was meant to enumerate a security model for classes
>> of commands that are otherwise opaque to the kernel. However, none of
>> the commands we are targeting are opaque (private specification with
>> unknown effect). It now turns out there is no role for @scope for
>> security.
>>
>> Now a command family that iommufd can validate seems useful. As it
>> stands this implementation aliases command codes across TSMs. Do we
>> proceed with creating an actual shared command uapi for the truly shared
>> commands:
>>
>> TSM_REQ_TYPE_DEFAULT: Commands every arch needs
>> TSM_REQ_READ_OBJECT
>> TSM_REQ_REGEN_OBJECT
>> TSM_REQ_OBJECT_INFO
>> TSM_REQ_VALIDATE_MMIO
>> TSM_REQ_SET_TDI_STATE
>>
>> TSM_REQ_TYPE_SEV: Commands only SEV needs
>> TSM_REQ_SEV_ENABLE_DMA
>> TSM_REQ_SEV_DISABLE_DMA
>>
>> ...or just observe that per CC arch commands are needed to setup the VM
>> so per CC arch commands are needed to marshal device assignment support
>> requests.
>>
>> In that case pci_tsm_req_scope becomes tsm_req_type and is just:
>>
>> TSM_REQ_TYPE_CCA
>> TSM_REQ_TYPE_SEV
>> TSM_REQ_TYPE_TDX
>>
>> I am leaning towards the latter at this point.
>
> But we already have struct pci_tsm_ops::guest_req, which is specific to
> the underlying CC architecture. From the above, pci_tsm_req_scope also
> appears to carry the same information. Is that useful?
>
I think there is value in having the VMM express the guest’s
confidential computing architecture, so that the TSM backend can
validate whether it should handle that guest request ?.
So it would not be the IOMMU validating the scope value, but rather
pci_tsm_ops::guest_req.
static ssize_t cca_tsm_guest_req(struct pci_tdi *tdi, enum pci_tsm_req_scope scope,
sockptr_t req, size_t req_len, sockptr_t resp,
size_t resp_len, u64 *tsm_code)
{
struct pci_dev *pdev = tdi->pdev;
/* reject the guest request if VMM was using the link tsm wrongly. The guest
* was using a wrong CC archiecture with this link tsm
*/
if (scope != TSM_REQ_TYPE_CCA)
return -EINVAL;
Jason Gunthorpe <jgg@ziepe.ca> writes:
> On Tue, May 26, 2026 at 11:17:50PM -0700, Dan Williams (nvidia) wrote:
>
>> In that case pci_tsm_req_scope becomes tsm_req_type and is just:
>>
>> TSM_REQ_TYPE_CCA
>> TSM_REQ_TYPE_SEV
>> TSM_REQ_TYPE_TDX
>>
>> I am leaning towards the latter at this point.
>
> Yeah, this sounds good. I would also include an common op field that
> can be decoded by the TSM driver based on the TYPE above, and the
> usual in/out message buffers.
We already have iommufd_vdevice_tsm_op_ioctl() to handle common
operations. Right now, it handles IOMMU_VDEVICE_TSM_BIND and
IOMMU_VDEVICE_TSM_UNBIND. I guess we should move TSM_REQ_SET_TDI_STATE
operations to that as well?
-aneesh
^ permalink raw reply
* Re: [PATCH v2 0/4] struct page to PFN conversion for TDX guest private memory
From: Sean Christopherson @ 2026-05-27 18:10 UTC (permalink / raw)
To: Sean Christopherson, dave.hansen, pbonzini, Yan Zhao
Cc: tglx, mingo, bp, kas, x86, linux-kernel, kvm, linux-coco,
kai.huang, rick.p.edgecombe, yilun.xu, vannapurve, ackerleytng,
sagis, binbin.wu, xiaoyao.li, isaku.yamahata
In-Reply-To: <20260430014852.24183-1-yan.y.zhao@intel.com>
On Thu, 30 Apr 2026 09:48:52 +0800, Yan Zhao wrote:
> This is v2 of the struct page to PFN conversion series, which converts TDX
> guest private memory mapping/unmapping APIs from taking struct page to
> taking PFN as input.
>
> v2 is based on v7.1.0-rc1 + Sean's 4 cleanup patches (see details in
> section "Base" below). The purpose is to get Dave's Ack, so Sean can take
> it from the KVM x86 tree. The full stack of v2 is available at [14].
>
> [...]
Applied to kvm-x86 mmu, thanks!
[1/4] x86/tdx: Use PFN directly for mapping guest private memory
https://github.com/kvm-x86/linux/commit/6ad0badd765c
[2/4] x86/tdx: Use PFN directly for unmapping guest private memory
https://github.com/kvm-x86/linux/commit/4c7a1247646c
[3/4] x86/tdx: Drop exported function tdx_quirk_reset_page()
https://github.com/kvm-x86/linux/commit/4a72a6dc447d
[4/4] x86/virt/tdx: Move mk_keyed_paddr() to tdx.c due to no external users
https://github.com/kvm-x86/linux/commit/3f330fbb918f
--
https://github.com/kvm-x86/linux/tree/next
^ permalink raw reply
* Re: [PATCH v2 0/5] guest_memfd fixes for bind and populate
From: Sean Christopherson @ 2026-05-27 18:19 UTC (permalink / raw)
To: Sean Christopherson, Paolo Bonzini, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, x86, H. Peter Anvin,
Kiryl Shutsemau, Rick Edgecombe, Vishal Annapurve, Yan Zhao,
Michael Roth, Isaku Yamahata, Chao Peng, Xiaoyao Li, Zongyao Chen,
Ackerley Tng
Cc: kvm, linux-kernel, linux-coco, Yu Zhang, Fuad Tabba
In-Reply-To: <20260522-fix-sev-gmem-post-populate-v2-0-3f196bfad5a1@google.com>
On Fri, 22 May 2026 15:46:05 -0700, Ackerley Tng wrote:
> This series is a group of fixes for the bind and populate flows for
> guest_memfd, and fixes some issues reported by Sashiko after reviewing the
> guest_memfd in-place conversions series [1] and another fixup series Sean
> posted [3].
>
> Changes in v2:
>
> [...]
Applied 1, 4, and 5 to kvm-x86 sev, with massaged shortlogs+changelogs.
[1/5] KVM: SEV: Pin source page for write when adding CPUID data for SNP guest
https://github.com/kvm-x86/linux/commit/f13e90059908
[2/5] KVM: guest_memfd: Fix possible signed integer overflow
[SKIP]
[3/5] KVM: guest_memfd: Handle errors from xa_store_range() when binding
[SKIP]
[4/5] KVM: SEV: Unmap local kmaps in LIFO order, per highmem requirements
https://github.com/kvm-x86/linux/commit/138f5f9cbe37
[5/5] KVM: SEV: Mark source page dirty when writing back CPUID data on failure
https://github.com/kvm-x86/linux/commit/97cd21d57e9b
--
https://github.com/kvm-x86/linux/tree/next
^ permalink raw reply
* Re: [PATCH v2 2/5] KVM: guest_memfd: Fix possible signed integer overflow
From: Ackerley Tng @ 2026-05-27 18:26 UTC (permalink / raw)
To: Sean Christopherson
Cc: Paolo Bonzini, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Dave Hansen, x86, H. Peter Anvin, Kiryl Shutsemau, Rick Edgecombe,
Vishal Annapurve, Yan Zhao, Michael Roth, Isaku Yamahata,
Chao Peng, Xiaoyao Li, Zongyao Chen, kvm, linux-kernel,
linux-coco, Yu Zhang, Fuad Tabba
In-Reply-To: <ahXB_tBWakUUjt6q@google.com>
Sean Christopherson <seanjc@google.com> writes:
> For shortlogs (and changeloges), when possible, describe the _change_ itself, not
> its impact is. Sometimes "Fix xyz" is the best shortlog, e.g. when fixing build
> failures, but here, I would go with:
>
> KVM: guest_memfd: Treat memslot binding offset+size as unsigned values
>
> for two reasons. First, it provides a lot more context for future readers, versus
> "Fix possible signed integer overflow" which doesn't even capture what flow is
> affected, how the overflow is being fixed, etc. Second, if the fix is wrong,
> incomplete, etc., we don't end up with a follow-up patch that start with "Really
> fix ...".
>
Thanks for explaining!
> Oh, actually, three reasons. This doesn't only affect the overflow check. The
> check on a negative offset is flawed, as it means KVM would incorrectly reject
> bindings with (comically) large offsets.
>
Makes sense.
> LOL, four. There is no bug. The size of the memslot is ((1UL << 31) - 1)
> pages, i.e. 0x7FF_FFFFF000:
>
> if (id < KVM_USER_MEM_SLOTS &&
> (mem->memory_size >> PAGE_SHIFT) > KVM_MEM_MAX_NR_PAGES)
> return -EINVAL;
>
> and so "loff_t size" can never be negative.
>
I think the bug was that the sum of offset + size in kvm_gmem_bind()
when interpreted as signed integers could be smaller than
i_size_read(inode) and allow binding.
So IIUC even if size is small (and not negative), nothing catches a
large enough offset where offset + size (interpreted as unsigned
integers) doesn't overflow, but offset + size (interpreted as signed
integers) overflows.
> As for the offset, the negative check is intentional, because KVM_CREATE_GUEST_MEMFD
> takes loff_t for the size, and so an offset that is negative would also be larger
> than the size of the file.
>
> I still think it's worth taking unsigned values, because teasing out all of that
> information wasn't exactly easy.
>
Yup it's still easier this way, and your proposed shortlog is good.
> On Fri, May 22, 2026, Ackerley Tng wrote:
>> From: Sean Christopherson <seanjc@google.com>
>>
>> The caller, kvm_set_memory_region(), checks for an overflow in an unsigned
>> u64 guest_memfd_offset. When guest_memfd_offset is passed to kvm_gmem_bind,
>> it is cast into a signed 64-bit integer.
>>
>> Hence, a large 64-bit offset could result in a negative loff_t, which could
>> result in the overflow checks failing.
>>
>> Make kvm_gmem_bind() take u64 instead of loff_t to consistently deal with
>> unsigned values to avoid this issue.
>>
>> Fixes: a7800aa80ea4d ("KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory")
>> Signed-off-by: Sean Christopherson <seanjc@google.com>
>> [Use size_t for size instead of u64]
>
> Why? Oh, right, because kvm_memory_slot.npages is an "unsigned long". The
> discrepancy between a u64 for the offset and a size_t for the size is confusing,
> as they are both conceptually in the same "domain".
>
> Rather than u64 and size_t, we should use pgoff_t, which is what KVM already uses
> as the storage for kvm_memory_slot.gmem.pgoff.
>
I picked size_t more because I thought it was semantically correct to
use the size type for a size. size_t may have different sizes (64 vs
32), but in the comparison offset + size > i_size_read(inode), size is
promoted to 64 bits, and signed inode size is cast to unsigned for
comparison, so I think that works.
pgoff_t is also unsigned, but I think that should be reserved for page
offsets/indices? Is that the way to think when choosing types? What does
"domain" mean above?
> I'll send a new version as a standalone patch.
^ permalink raw reply
* Re: [PATCH v2 3/5] KVM: guest_memfd: Handle errors from xa_store_range() when binding
From: Ackerley Tng @ 2026-05-27 19:11 UTC (permalink / raw)
To: Sean Christopherson
Cc: Paolo Bonzini, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Dave Hansen, x86, H. Peter Anvin, Kiryl Shutsemau, Rick Edgecombe,
Vishal Annapurve, Yan Zhao, Michael Roth, Isaku Yamahata,
Chao Peng, Xiaoyao Li, Zongyao Chen, kvm, linux-kernel,
linux-coco, Yu Zhang, Fuad Tabba
In-Reply-To: <ahXMzPz1cQl6kZge@google.com>
Sean Christopherson <seanjc@google.com> writes:
> On Fri, May 22, 2026, Ackerley Tng wrote:
>> Unhandled errors from xa_store_range() means kvm_gmem_bind() might falsely
>> reporting success, leading to false assumptions in guest_memfd's lifecycle
>> later.
>>
>> On error, restore the unbound state and return the error to userspace.
>>
>> Fixes: a7800aa80ea4d ("KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory")
>> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
>> ---
>> virt/kvm/guest_memfd.c | 11 +++++++++--
>> 1 file changed, 9 insertions(+), 2 deletions(-)
>>
>> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
>> index d203135969d13..5b4911ffa208a 100644
>> --- a/virt/kvm/guest_memfd.c
>> +++ b/virt/kvm/guest_memfd.c
>> @@ -648,6 +648,7 @@ int kvm_gmem_bind(struct kvm *kvm, struct kvm_memory_slot *slot,
>> struct inode *inode;
>> struct file *file;
>> int r = -EINVAL;
>> + void *result;
>
> I would rather go with "xr". "result" is too generic, e.g. begs the question of
> "result of what?"
>
Good to go with xr too.
> Actually, I don't think we even need an intermediate variable.
>
>> BUILD_BUG_ON(sizeof(gfn_t) != sizeof(slot->gmem.pgoff));
>>
>> @@ -688,7 +689,14 @@ int kvm_gmem_bind(struct kvm *kvm, struct kvm_memory_slot *slot,
>> if (kvm_gmem_supports_mmap(inode))
>> slot->flags |= KVM_MEMSLOT_GMEM_ONLY;
>>
>> - xa_store_range(&f->bindings, start, end - 1, slot, GFP_KERNEL);
>> + result = xa_store_range(&f->bindings, start, end - 1, slot, GFP_KERNEL);
>> + if (xa_is_err(result)) {
>> + r = xa_err(result);
>> + xa_store_range(&f->bindings, start, end - 1, NULL, GFP_KERNEL);
>
> I'm not convinced this is necessary. Sashiko "asked" the question:
>
> : If xa_store_range() fails midway through storing a large range (for example,
> : returning -ENOMEM), does it leave the already-processed entries in the
> : f->bindings XArray?
> :
> : When this error is propagated back, the caller __kvm_set_memory_region()
> : will abort the operation and free the memslot without calling
> : kvm_gmem_unbind().
> :
> : Since the partial XArray updates aren't rolled back here, could this leave
> : dangling pointers to the freed memslot in f->bindings? If so, when the file
> : is eventually closed, kvm_gmem_release() might iterate over these dangling
> : pointers and write to slot->gmem.file, resulting in a use-after-free.
>
> but I think Sashiko is hallicunating.
>
When I updated this I kind of just assumed xa_store_range() always
iterates indices (so for a range [0, 10], it would store 11 times), and
an earlier index could be set, and a later store could result in
-ENOMEM.
Since you called this out, I dug into it more.
> If @entry is non-NULL, xa_store_range() pre-creates the entire range, before
> storing anything into the range:
>
> if (entry) {
> unsigned int order = BITS_PER_LONG;
> if (last + 1)
> order = __ffs(last + 1);
> xas_set_order(&xas, last, order);
> xas_create(&xas, true);
> if (xas_error(&xas))
> goto unlock;
> }
>
xa_store_range() doesn't actually always iterate: if last + 1 is some
clean power of 2, it'll create a higher order xarray node.
Otherwise, it falls back to creating and storing 1 index/node at a time:
if the above did manage to create an xarray node, xas_error() returns
false, it goes on to the store below.
> Yes, the API handles failure on the subsequent xas_store(), but I can't imagine
> that failure is actually, barring garbage input from KVM:
>
> do {
> xas_set_range(&xas, first, last);
> xas_store(&xas, entry);
> if (xas_error(&xas))
> goto unlock;
> first += xas_size(&xas);
> } while (first <= last);
>
So if a later xas_create() fails because it runs out of memory, the
earlier stores would have already been committed.
This ignores -EEXIST being returned since earlier in kvm_gmem_bind()
conflicts were already checked.
> Purely from a design perspective, providing an API that can fail partway through
> under normal operation, with no indication of where failure occured (AFAICT),
> would be awful.
>
Do you mean the API of xas_store_range()? xas is updated by
xas_set_range() so that should track the last store. Since the cleanup
is storing NULLs and won't allocate, I thought it would be fine to just
store NULL on the entire range on error.
>> + } else {
>> + r = 0;
>> + }
>> +
>> filemap_invalidate_unlock(inode->i_mapping);
>>
>> /*
>>
>> [...snip...]
>>
^ permalink raw reply
* Re: [PATCH v2 5/5] KVM: SNP: Mark source page dirty in sev_gmem_post_populate
From: Ackerley Tng @ 2026-05-27 19:14 UTC (permalink / raw)
To: Sean Christopherson
Cc: Paolo Bonzini, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Dave Hansen, x86, H. Peter Anvin, Kiryl Shutsemau, Rick Edgecombe,
Vishal Annapurve, Yan Zhao, Michael Roth, Isaku Yamahata,
Chao Peng, Xiaoyao Li, Zongyao Chen, kvm, linux-kernel,
linux-coco, Yu Zhang, Fuad Tabba
In-Reply-To: <ahXOiy0dBSBOQli2@google.com>
Sean Christopherson <seanjc@google.com> writes:
> On Fri, May 22, 2026, Ackerley Tng wrote:
>> Mark the folio as dirty after copying data into the source page in
>> sev_gmem_post_populate. After the memcpy, failing to mark the page dirty
>> can lead to the memory management subsystem discarding the changes if the
>> page is reclaimed or otherwise processed by the swap subsystem.
>>
>> Fixes: 2a62345b3052 ("KVM: guest_memfd: GUP source pages prior to populating guest memory")
>> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
>> ---
>> arch/x86/kvm/svm/sev.c | 1 +
>> 1 file changed, 1 insertion(+)
>>
>> diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
>> index dbf75326a40f4..1a361f08c7a3d 100644
>> --- a/arch/x86/kvm/svm/sev.c
>> +++ b/arch/x86/kvm/svm/sev.c
>> @@ -2395,6 +2395,7 @@ static int sev_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
>> void *dst_vaddr = kmap_local_pfn(pfn);
>>
>> memcpy(src_vaddr, dst_vaddr, PAGE_SIZE);
>> + folio_mark_dirty(page_folio(src_page));
>
> I'd rather use set_page_dirty(). I'll fixup when applying, unless someon objects.
>
I was looking for page, dirty but set_page_dirty() somehow escaped my
search. This works, thanks!
>> kunmap_local(dst_vaddr);
>> kunmap_local(src_vaddr);
>>
>> --
>> 2.54.0.794.g4f17f83d09-goog
>>
^ permalink raw reply
* Re: [PATCH v2 2/5] KVM: guest_memfd: Fix possible signed integer overflow
From: Sean Christopherson @ 2026-05-27 19:26 UTC (permalink / raw)
To: Ackerley Tng
Cc: Paolo Bonzini, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Dave Hansen, x86, H. Peter Anvin, Kiryl Shutsemau, Rick Edgecombe,
Vishal Annapurve, Yan Zhao, Michael Roth, Isaku Yamahata,
Chao Peng, Xiaoyao Li, Zongyao Chen, kvm, linux-kernel,
linux-coco, Yu Zhang, Fuad Tabba
In-Reply-To: <CAEvNRgEN6syKVLjvFX-s=xU=r3CBZ3KtmeKvVYOC09uvFGXSFg@mail.gmail.com>
On Wed, May 27, 2026, Ackerley Tng wrote:
> Sean Christopherson <seanjc@google.com> writes:
>
> > For shortlogs (and changeloges), when possible, describe the _change_ itself, not
> > its impact is. Sometimes "Fix xyz" is the best shortlog, e.g. when fixing build
> > failures, but here, I would go with:
> >
> > KVM: guest_memfd: Treat memslot binding offset+size as unsigned values
> >
> > for two reasons. First, it provides a lot more context for future readers, versus
> > "Fix possible signed integer overflow" which doesn't even capture what flow is
> > affected, how the overflow is being fixed, etc. Second, if the fix is wrong,
> > incomplete, etc., we don't end up with a follow-up patch that start with "Really
> > fix ...".
> >
>
> Thanks for explaining!
>
> > Oh, actually, three reasons. This doesn't only affect the overflow check. The
> > check on a negative offset is flawed, as it means KVM would incorrectly reject
> > bindings with (comically) large offsets.
> >
>
> Makes sense.
>
> > LOL, four. There is no bug. The size of the memslot is ((1UL << 31) - 1)
> > pages, i.e. 0x7FF_FFFFF000:
> >
> > if (id < KVM_USER_MEM_SLOTS &&
> > (mem->memory_size >> PAGE_SHIFT) > KVM_MEM_MAX_NR_PAGES)
> > return -EINVAL;
> >
> > and so "loff_t size" can never be negative.
>
> I think the bug was that the sum of offset + size in kvm_gmem_bind()
> when interpreted as signed integers could be smaller than
> i_size_read(inode) and allow binding.
>
> So IIUC even if size is small (and not negative), nothing catches a
> large enough offset where offset + size (interpreted as unsigned
> integers) doesn't overflow, but offset + size (interpreted as signed
> integers) overflows.
Oooh, duh, if @offset is positive, but @offset+size is negative. Yes, that's a
real bug, confirmed via selftest. I'll send a fix along with a selftest testcase.
Thanks much!
> >> Fixes: a7800aa80ea4d ("KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory")
> >> Signed-off-by: Sean Christopherson <seanjc@google.com>
> >> [Use size_t for size instead of u64]
> >
> > Why? Oh, right, because kvm_memory_slot.npages is an "unsigned long". The
> > discrepancy between a u64 for the offset and a size_t for the size is confusing,
> > as they are both conceptually in the same "domain".
> >
> > Rather than u64 and size_t, we should use pgoff_t, which is what KVM already uses
> > as the storage for kvm_memory_slot.gmem.pgoff.
> >
>
> I picked size_t more because I thought it was semantically correct to
> use the size type for a size. size_t may have different sizes (64 vs
> 32), but in the comparison offset + size > i_size_read(inode), size is
> promoted to 64 bits, and signed inode size is cast to unsigned for
> comparison, so I think that works.
>
> pgoff_t is also unsigned, but I think that should be reserved for page
> offsets/indices?
Just to avoid confusion over the definition of an offset/idnex:
* The type of an index into the pagecache.
I.e. it's not the 12-bit offset into a 4KiB page. Which I'm pretty sure you were
saying as well, just want to ensure we're on the same page.
I like pgoff_t more than size_t because, for KVM, it's really all about addressing
memory, thanks to the offset into guest_memfd being associated 1:1 with a GPA.
It's not perfect, because GPAs are tracked as 64-bit values, whereas the kernel
restricts itself to "unsigned long". But that's a non-issue in practice since
guest_memfd is 64-bit only.
But conceptually, I like tracking the gmem offset as a pgoff_t to tie it back
to using GPAs to offset/index into gmem. And for all intents and purposes, gmem
is nothing more than a glorified pagecache :-)
^ permalink raw reply
* Re: [PATCH v2 2/5] KVM: guest_memfd: Fix possible signed integer overflow
From: Ackerley Tng @ 2026-05-27 20:17 UTC (permalink / raw)
To: Sean Christopherson
Cc: Paolo Bonzini, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Dave Hansen, x86, H. Peter Anvin, Kiryl Shutsemau, Rick Edgecombe,
Vishal Annapurve, Yan Zhao, Michael Roth, Isaku Yamahata,
Chao Peng, Xiaoyao Li, Zongyao Chen, kvm, linux-kernel,
linux-coco, Yu Zhang, Fuad Tabba
In-Reply-To: <ahdFSxXnWhhsjhKe@google.com>
Sean Christopherson <seanjc@google.com> writes:
>
> [...snip...]
>
>> >> [Use size_t for size instead of u64]
>> >
>> > Why? Oh, right, because kvm_memory_slot.npages is an "unsigned long". The
>> > discrepancy between a u64 for the offset and a size_t for the size is confusing,
>> > as they are both conceptually in the same "domain".
>> >
>> > Rather than u64 and size_t, we should use pgoff_t, which is what KVM already uses
>> > as the storage for kvm_memory_slot.gmem.pgoff.
>> >
>>
>> I picked size_t more because I thought it was semantically correct to
>> use the size type for a size. size_t may have different sizes (64 vs
>> 32), but in the comparison offset + size > i_size_read(inode), size is
>> promoted to 64 bits, and signed inode size is cast to unsigned for
>> comparison, so I think that works.
>>
>> pgoff_t is also unsigned, but I think that should be reserved for page
>> offsets/indices?
>
> Just to avoid confusion over the definition of an offset/idnex:
>
> * The type of an index into the pagecache.
>
> I.e. it's not the 12-bit offset into a 4KiB page. Which I'm pretty sure you were
> saying as well, just want to ensure we're on the same page.
>
Wait yes I meant index as in the key if you think of the xarray of the
page cache as a key-value store.
> I like pgoff_t more than size_t because, for KVM, it's really all about addressing
> memory, thanks to the offset into guest_memfd being associated 1:1 with a GPA.
The offset into guest_memfd is associated 1:1 with a GPA, and this
offset is
offset = index << PAGE_SHIFT
> It's not perfect, because GPAs are tracked as 64-bit values, whereas the kernel
> restricts itself to "unsigned long". But that's a non-issue in practice since
> guest_memfd is 64-bit only.
>
> But conceptually, I like tracking the gmem offset as a pgoff_t to tie it back
> to using GPAs to offset/index into gmem. And for all intents and purposes, gmem
> is nothing more than a glorified pagecache :-)
So we actually want to use u64s for gmem offsets (where offset = index
<< PAGE_SHIFT), and pgoff_t for indices, since indices (aka page
offsets) are semantically the offset, counted in units of pages?
I pulled this conclusion together from filemap-related code like
filemap_add_folio() takes a pgoff_t index, so I thought gmem should
follow that and stick with pgoff_t for index/indices.
^ permalink raw reply
* Re: [PATCH v5 1/7] x86/cpufeatures: Add X86_FEATURE_AMD_RMPOPT feature flag
From: Borislav Petkov @ 2026-05-27 20:17 UTC (permalink / raw)
To: Ashish Kalra
Cc: tglx, mingo, dave.hansen, x86, hpa, seanjc, peterz,
thomas.lendacky, herbert, davem, ardb, pbonzini, aik,
Michael.Roth, KPrateek.Nayak, Tycho.Andersen, Nathan.Fontenot,
ackerleytng, jackyli, pgonda, rientjes, jacobhxu, xin,
pawan.kumar.gupta, babu.moger, dyoung, nikunj, john.allen, darwi,
linux-kernel, linux-crypto, kvm, linux-coco
In-Reply-To: <305b625c0528f16a95542001c66e643fbd3a2622.1779133590.git.ashish.kalra@amd.com>
On Mon, May 18, 2026 at 09:41:53PM +0000, Ashish Kalra wrote:
> From: Ashish Kalra <ashish.kalra@amd.com>
>
> Add a flag indicating whether RMPOPT instruction is supported.
>
> RMPOPT is a new instruction designed to minimize the performance
> overhead of RMP checks on the hypervisor and on non-SNP guests by
> allowing RMP checks to be skipped when 1G regions of memory are known
> not to contain any SEV-SNP guest memory.
Streamline:
"RMPOPT is a new instruction that reduces the performance overhead of RMP
checks for the hypervisor and non‑SNP guests by allowing those checks to be
skipped when 1‑GB memory regions are known to contain no SEV‑SNP guest
memory."
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
^ permalink raw reply
* Re: [PATCH v5 2/7] x86/msr: add wrmsrq_on_cpus helper
From: Borislav Petkov @ 2026-05-27 21:06 UTC (permalink / raw)
To: Ashish Kalra
Cc: tglx, mingo, dave.hansen, x86, hpa, seanjc, peterz,
thomas.lendacky, herbert, davem, ardb, pbonzini, aik,
Michael.Roth, KPrateek.Nayak, Tycho.Andersen, Nathan.Fontenot,
ackerleytng, jackyli, pgonda, rientjes, jacobhxu, xin,
pawan.kumar.gupta, babu.moger, dyoung, nikunj, john.allen, darwi,
linux-kernel, linux-crypto, kvm, linux-coco
In-Reply-To: <c9fe5c2fef063f5006cc9bfa03eec824ac015db7.1779133590.git.ashish.kalra@amd.com>
On Mon, May 18, 2026 at 09:42:15PM +0000, Ashish Kalra wrote:
> From: Ashish Kalra <ashish.kalra@amd.com>
>
> The existing wrmsr_on_cpus() takes a per-cpu struct msr array, requiring
> callers to allocate and populate per-cpu storage even when every CPU
> receives the same value. This is unnecessary overhead for the common
> case of writing a single uniform u64 to a per-CPU MSR across multiple
> CPUs.
>
> Add wrmsrq_on_cpus() which writes the same u64 value to the specified
> MSR on all CPUs in the given cpumask.
So let's add yet another function which name differs from the other one by
a single letter and have people go look at the implementation to know which is
which...?
Instead of unifying what's there and extending this one to do what you want it
to do?
And now you have a wrmsrQ_on_cpus() but no rdmsrQ_on_cpus()?
Because if you look at the code, you'll see how those are used: first you
rdmsr on CPUs, modify each value and then wrmsr on same CPUs.
So no, try again pls.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
^ permalink raw reply
* Re: [PATCH v5 2/7] x86/msr: add wrmsrq_on_cpus helper
From: Dave Hansen @ 2026-05-27 21:38 UTC (permalink / raw)
To: Borislav Petkov, Ashish Kalra
Cc: tglx, mingo, dave.hansen, x86, hpa, seanjc, peterz,
thomas.lendacky, herbert, davem, ardb, pbonzini, aik,
Michael.Roth, KPrateek.Nayak, Tycho.Andersen, Nathan.Fontenot,
ackerleytng, jackyli, pgonda, rientjes, jacobhxu, xin,
pawan.kumar.gupta, babu.moger, dyoung, nikunj, john.allen, darwi,
linux-kernel, linux-crypto, kvm, linux-coco
In-Reply-To: <20260527210603.GCahdcu8zvVjfKfGEL@fat_crate.local>
[-- Attachment #1: Type: text/plain, Size: 2757 bytes --]
On 5/27/26 14:06, Borislav Petkov wrote:
> On Mon, May 18, 2026 at 09:42:15PM +0000, Ashish Kalra wrote:
>> From: Ashish Kalra <ashish.kalra@amd.com>
>>
>> The existing wrmsr_on_cpus() takes a per-cpu struct msr array, requiring
>> callers to allocate and populate per-cpu storage even when every CPU
>> receives the same value. This is unnecessary overhead for the common
>> case of writing a single uniform u64 to a per-CPU MSR across multiple
>> CPUs.
>>
>> Add wrmsrq_on_cpus() which writes the same u64 value to the specified
>> MSR on all CPUs in the given cpumask.
>
> So let's add yet another function which name differs from the other one by
> a single letter and have people go look at the implementation to know which is
> which...?
>
> Instead of unifying what's there and extending this one to do what you want it
> to do?
>
> And now you have a wrmsrQ_on_cpus() but no rdmsrQ_on_cpus()?
>
> Because if you look at the code, you'll see how those are used: first you
> rdmsr on CPUs, modify each value and then wrmsr on same CPUs.
This one is my doing.
wrmsr_on_cpus() is kinda a mess. I think it only has a single user. It's
also not very flexible because it needs a 'struct msr __percpu *msrs'
argument where each MSR has a value in memory.
The use case for RMPOPT is that all CPUs get the same value. It'd be a
little awkward to go create a percpu data structure to duplcate the same
value to call wrmsr_on_cpus(). The RMPOPT case is also arguably
performance sensitive since it's done during boot. It should do the IPIs
in parallel.
toggle_ecc_err_reporting(), on the other hand, is done at module init
time. It's not really performance sensitive. It's probably pretty easy
to zap wrmsr_on_cpus() and just have toggle_ecc_err_reporting() do
something slightly less efficient.
Yeah, the
wrmsr_on_cpus()
wrmsrq_on_cpus()
naming pain is real. There's little chance of bugs coming from it
because the function signatures are *SO* different. But, it certainly
could confuse humans for a minute.
But the real solution to this is axing wrmsr_on_cpus(). Which I think we
could do after killing its one user which the attached (completely
untested) patch does. The only downside of the patch is that it does
RDMSR via IPIs one CPU at a time. But, looking at the code, I'm not sure
anyone would care. If anyone did, I _think_ all those MSRs have the same
value and the code could be simplified further. But that would take more
than 3 minutes.
It's also possible that my grepping was bad or I'm completely
misunderstanding amd64_edac.c. Cluebat welcome if I'm being dense.
BTW, I also don't feel the need to make Ashish go do any of this edac
cleanup. I think it can just be done in parallel. But I wouldn't stop
him if he volunteered.
[-- Attachment #2: axe-wrmsr_on_cpus.patch --]
[-- Type: text/x-patch, Size: 2609 bytes --]
---
b/drivers/edac/amd64_edac.c | 35 +++++++++++------------------------
1 file changed, 11 insertions(+), 24 deletions(-)
diff -puN drivers/edac/amd64_edac.c~axe-wrmsr_on_cpus drivers/edac/amd64_edac.c
--- a/drivers/edac/amd64_edac.c~axe-wrmsr_on_cpus 2026-05-27 14:25:07.881694152 -0700
+++ b/drivers/edac/amd64_edac.c 2026-05-27 14:33:03.278139821 -0700
@@ -14,8 +14,6 @@ static struct edac_pci_ctl_info *pci_ctl
static int ecc_enable_override;
module_param(ecc_enable_override, int, 0644);
-static struct msr __percpu *msrs;
-
static inline u32 get_umc_reg(struct amd64_pvt *pvt, u32 reg)
{
if (!pvt->flags.zn_regs_v2)
@@ -3215,14 +3213,14 @@ static bool nb_mce_bank_enabled_on_node(
get_cpus_on_this_dct_cpumask(mask, nid);
- rdmsr_on_cpus(mask, MSR_IA32_MCG_CTL, msrs);
-
for_each_cpu(cpu, mask) {
- struct msr *reg = per_cpu_ptr(msrs, cpu);
- nbe = reg->l & MSR_MCGCTL_NBE;
+ u64 mcg_ctl;
+
+ rdmsrq_on_cpu(cpu, MSR_IA32_MCG_CTL, &mcg_ctl);
+ nbe = mcg_ctl & MSR_MCGCTL_NBE;
edac_dbg(0, "core: %u, MCG_CTL: 0x%llx, NB MSR is %s\n",
- cpu, reg->q, str_enabled_disabled(nbe));
+ cpu, mcg_ctl, str_enabled_disabled(nbe));
if (!nbe)
goto out;
@@ -3246,26 +3244,25 @@ static int toggle_ecc_err_reporting(stru
get_cpus_on_this_dct_cpumask(cmask, nid);
- rdmsr_on_cpus(cmask, MSR_IA32_MCG_CTL, msrs);
-
for_each_cpu(cpu, cmask) {
+ u64 mcg_ctl;
- struct msr *reg = per_cpu_ptr(msrs, cpu);
+ rdmsrq_on_cpu(cpu, MSR_IA32_MCG_CTL, &mcg_ctl);
if (on) {
- if (reg->l & MSR_MCGCTL_NBE)
+ if (mcg_ctl & MSR_MCGCTL_NBE)
s->flags.nb_mce_enable = 1;
- reg->l |= MSR_MCGCTL_NBE;
+ mcg_ctl |= MSR_MCGCTL_NBE;
} else {
/*
* Turn off NB MCE reporting only when it was off before
*/
if (!s->flags.nb_mce_enable)
- reg->l &= ~MSR_MCGCTL_NBE;
+ mcg_ctl &= ~MSR_MCGCTL_NBE;
}
+ wrmsrq_on_cpu(cpu, MSR_IA32_MCG_CTL, mcg_ctl);
}
- wrmsr_on_cpus(cmask, MSR_IA32_MCG_CTL, msrs);
free_cpumask_var(cmask);
@@ -4153,10 +4150,6 @@ static int __init amd64_edac_init(void)
if (!ecc_stngs)
goto err_free;
- msrs = msrs_alloc();
- if (!msrs)
- goto err_free;
-
for (i = 0; i < amd_nb_num(); i++) {
err = probe_one_instance(i);
if (err) {
@@ -4190,9 +4183,6 @@ static int __init amd64_edac_init(void)
err_pci:
pci_ctl_dev = NULL;
- msrs_free(msrs);
- msrs = NULL;
-
err_free:
kfree(ecc_stngs);
ecc_stngs = NULL;
@@ -4220,9 +4210,6 @@ static void __exit amd64_edac_exit(void)
ecc_stngs = NULL;
pci_ctl_dev = NULL;
-
- msrs_free(msrs);
- msrs = NULL;
}
module_init(amd64_edac_init);
_
^ permalink raw reply
* Re: [PATCH v2 2/5] KVM: guest_memfd: Fix possible signed integer overflow
From: Sean Christopherson @ 2026-05-27 22:08 UTC (permalink / raw)
To: Ackerley Tng
Cc: Paolo Bonzini, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Dave Hansen, x86, H. Peter Anvin, Kiryl Shutsemau, Rick Edgecombe,
Vishal Annapurve, Yan Zhao, Michael Roth, Isaku Yamahata,
Chao Peng, Xiaoyao Li, Zongyao Chen, kvm, linux-kernel,
linux-coco, Yu Zhang, Fuad Tabba
In-Reply-To: <CAEvNRgEW8nph=Hz-3pcQD7ZxRuAOcjcOfj8afgJMsRgESD-Wbw@mail.gmail.com>
On Wed, May 27, 2026, Ackerley Tng wrote:
> Sean Christopherson <seanjc@google.com> writes:
> > I like pgoff_t more than size_t because, for KVM, it's really all about addressing
> > memory, thanks to the offset into guest_memfd being associated 1:1 with a GPA.
>
> The offset into guest_memfd is associated 1:1 with a GPA, and this
> offset is
>
> offset = index << PAGE_SHIFT
>
> > It's not perfect, because GPAs are tracked as 64-bit values, whereas the kernel
> > restricts itself to "unsigned long". But that's a non-issue in practice since
> > guest_memfd is 64-bit only.
> >
> > But conceptually, I like tracking the gmem offset as a pgoff_t to tie it back
> > to using GPAs to offset/index into gmem. And for all intents and purposes, gmem
> > is nothing more than a glorified pagecache :-)
>
> So we actually want to use u64s for gmem offsets (where offset = index
> << PAGE_SHIFT),
Hrm, poking around, I guess what we really should use for the byte offset is
uoff_t. My only hesitation to using uoff_t was that it's hardly used anywhere,
but it does seem to fix exactly what we're trying to do.
I don't want to use a raw u64, because I dislike using u{8,16,32,64} (in KVM)
unless something absolutely _must_ be that size (and ideally _exactly_ that size).
Limiting use of raw uNN helps identify fields/variables that correspond to some
hardware asset, versus fields/variables that just need to be "big enough". It's
not a 100% comprehensive rule of anything, e.g. there are still many "naturally
sized" hardware assets that need to be tracked with "unsigned long", but I still
find it helpful/valuable to highlight hardware-derived fields/variables.
> and pgoff_t for indices, since indices (aka page
> offsets) are semantically the offset, counted in units of pages?
Yeah, I agree the distinction will help us differentiate between byte offsets
and pfn offsets, especially with another compile-time assert to show the
relationship:
BUILD_BUG_ON(sizeof(gpa_t) != sizeof(offset));
BUILD_BUG_ON(sizeof(gfn_t) != sizeof(slot->gmem.pgoff));
> I pulled this conclusion together from filemap-related code like
> filemap_add_folio() takes a pgoff_t index, so I thought gmem should
> follow that and stick with pgoff_t for index/indices.
^ permalink raw reply
* [PATCH] MAINTAINERS: Move Rick Edgecombe to TDX maintainer
From: Rick Edgecombe @ 2026-05-27 22:13 UTC (permalink / raw)
To: kas, dave.hansen, x86, linux-coco, kvm, pbonzini, seanjc; +Cc: Rick Edgecombe
Per some offline discussion with Kiryl, he could use some help on the TDX
host side. I have worked on the TDX host side for the past few years
including wrangling the initial KVM support, and can help with this.
I am already listed as TDX reviewer. Move it to maintainer.
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: Kiryl Shutsemau <kas@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
---
MAINTAINERS | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/MAINTAINERS b/MAINTAINERS
index 882214b0e7db5..a838ff047d891 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -28943,8 +28943,8 @@ F: arch/x86/kernel/unwind_*.c
X86 TRUST DOMAIN EXTENSIONS (TDX)
M: Kiryl Shutsemau <kas@kernel.org>
+M: Rick Edgecombe <rick.p.edgecombe@intel.com>
R: Dave Hansen <dave.hansen@linux.intel.com>
-R: Rick Edgecombe <rick.p.edgecombe@intel.com>
L: x86@kernel.org
L: linux-coco@lists.linux.dev
L: kvm@vger.kernel.org
--
2.54.0
^ permalink raw reply related
* Re: [PATCH v2 3/5] KVM: guest_memfd: Handle errors from xa_store_range() when binding
From: Sean Christopherson @ 2026-05-27 22:43 UTC (permalink / raw)
To: Ackerley Tng
Cc: Paolo Bonzini, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Dave Hansen, x86, H. Peter Anvin, Kiryl Shutsemau, Rick Edgecombe,
Vishal Annapurve, Yan Zhao, Michael Roth, Isaku Yamahata,
Chao Peng, Xiaoyao Li, Zongyao Chen, kvm, linux-kernel,
linux-coco, Yu Zhang, Fuad Tabba
In-Reply-To: <CAEvNRgHnxEO+jXqEKuewiDgFBDQDSr68A8MnJJSyf9K5AgbwWg@mail.gmail.com>
On Wed, May 27, 2026, Ackerley Tng wrote:
> Sean Christopherson <seanjc@google.com> writes:
> > On Fri, May 22, 2026, Ackerley Tng wrote:
> > If @entry is non-NULL, xa_store_range() pre-creates the entire range, before
> > storing anything into the range:
> >
> > if (entry) {
> > unsigned int order = BITS_PER_LONG;
> > if (last + 1)
> > order = __ffs(last + 1);
> > xas_set_order(&xas, last, order);
> > xas_create(&xas, true);
> > if (xas_error(&xas))
> > goto unlock;
> > }
> >
>
> xa_store_range() doesn't actually always iterate: if last + 1 is some
> clean power of 2, it'll create a higher order xarray node.
>
> Otherwise, it falls back to creating and storing 1 index/node at a time:
Ugh, _that's_ what the code is doing? Argh, I missed that "first" is incremented
by whatever the batch size happened to be.
first += xas_size(&xas); <====
} while (first <= last);
> if the above did manage to create an xarray node, xas_error() returns
> false, it goes on to the store below.
>
> > Yes, the API handles failure on the subsequent xas_store(), but I can't imagine
> > that failure is actually, barring garbage input from KVM:
> >
> > do {
> > xas_set_range(&xas, first, last);
> > xas_store(&xas, entry);
> > if (xas_error(&xas))
> > goto unlock;
> > first += xas_size(&xas);
> > } while (first <= last);
> >
>
> So if a later xas_create() fails because it runs out of memory, the
> earlier stores would have already been committed.
>
> This ignores -EEXIST being returned since earlier in kvm_gmem_bind()
> conflicts were already checked.
>
> > Purely from a design perspective, providing an API that can fail partway through
> > under normal operation, with no indication of where failure occured (AFAICT),
> > would be awful.
> >
>
> Do you mean the API of xas_store_range()?
No, I mean xa_store_range(). AFAICT, on failure, it doesn't actually communicate
"where" failure occurred. That's quite nasty.
> xas is updated by xas_set_range() so that should track the last store. Since
> the cleanup is storing NULLs and won't allocate, I thought it would be fine
> to just store NULL on the entire range on error.
Yeah, it's totally fine, and AFAICT the only remotely sane approach.
^ permalink raw reply
* Re: [PATCH v5 5/5] iommufd/vdevice: add TSM request ioctl
From: Dan Williams (nvidia) @ 2026-05-27 22:49 UTC (permalink / raw)
To: Aneesh Kumar K.V, Dan Williams (nvidia), Alexey Kardashevskiy,
linux-coco, iommu, linux-kernel, kvm
Cc: Bjorn Helgaas, Dan Williams, Jason Gunthorpe, Joerg Roedel,
Jonathan Cameron, Kevin Tian, Nicolin Chen, Samuel Ortiz,
Steven Price, Suzuki K Poulose, Will Deacon, Xu Yilun,
Shameer Kolothum, Paolo Bonzini, Tony Krowiak, Halil Pasic,
Jason Herne, Harald Freudenberger, Holger Dengler, Heiko Carstens,
Vasily Gorbik, Alexander Gordeev, Christian Borntraeger,
Sven Schnelle, Alex Williamson, Matthew Rosato, Farhan Ali,
Eric Farman, linux-s390
In-Reply-To: <yq5aldd4spyc.fsf@kernel.org>
Aneesh Kumar K.V wrote:
> >> I am leaning towards the latter at this point.
> >
> > But we already have struct pci_tsm_ops::guest_req, which is specific to
> > the underlying CC architecture. From the above, pci_tsm_req_scope also
> > appears to carry the same information. Is that useful?
> >
>
> I think there is value in having the VMM express the guest’s
> confidential computing architecture, so that the TSM backend can
> validate whether it should handle that guest request ?.
Yes, that is the idea.
> So it would not be the IOMMU validating the scope value, but rather
> pci_tsm_ops::guest_req.
>
> static ssize_t cca_tsm_guest_req(struct pci_tdi *tdi, enum pci_tsm_req_scope scope,
> sockptr_t req, size_t req_len, sockptr_t resp,
> size_t resp_len, u64 *tsm_code)
> {
> struct pci_dev *pdev = tdi->pdev;
>
> /* reject the guest request if VMM was using the link tsm wrongly. The guest
> * was using a wrong CC archiecture with this link tsm
> */
> if (scope != TSM_REQ_TYPE_CCA)
> return -EINVAL;
Right, iommufd is tunneling TSM requests. The tunnel should have an
envelope of TSM_REQ_TYPE_* and an @op field. The TSM driver gets those
from iommufd, validates the envelope and then processes @req.
This self-consistency and explicitness also buys some future-proofing.
It allows for alternate command sets within an arch, cross TSM
implementation shared commands, IOMMUFD-to-TSM requests outside of guest
requests.
> Jason Gunthorpe <jgg@ziepe.ca> writes:
>
> > On Tue, May 26, 2026 at 11:17:50PM -0700, Dan Williams (nvidia) wrote:
> >
> >> In that case pci_tsm_req_scope becomes tsm_req_type and is just:
> >>
> >> TSM_REQ_TYPE_CCA
> >> TSM_REQ_TYPE_SEV
> >> TSM_REQ_TYPE_TDX
> >>
> >> I am leaning towards the latter at this point.
> >
> > Yeah, this sounds good. I would also include an common op field that
> > can be decoded by the TSM driver based on the TYPE above, and the
> > usual in/out message buffers.
>
> We already have iommufd_vdevice_tsm_op_ioctl() to handle common
> operations.
Per above, I believe this is about an @op value in a common location
that iommufd can forward to the backend for validation of guest
requests.
> Right now, it handles IOMMU_VDEVICE_TSM_BIND and
> IOMMU_VDEVICE_TSM_UNBIND. I guess we should move TSM_REQ_SET_TDI_STATE
> operations to that as well?
I think we can wait to move it to its own IOMMU operation unless/until
there is a need to set RUN outside of an explicit guest request, right?
^ permalink raw reply
* Re: [PATCH v3 07/41] clocksource: hyper-v: Register sched_clock save/restore iff it's necessary
From: Wei Liu @ 2026-05-27 22:49 UTC (permalink / raw)
To: Sean Christopherson
Cc: Kiryl Shutsemau, Paolo Bonzini, K. Y. Srinivasan, Haiyang Zhang,
Wei Liu, Dexuan Cui, Long Li, Ajay Kaher, Alexey Makhalov,
Jan Kiszka, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
Juergen Gross, Daniel Lezcano, Thomas Gleixner, John Stultz,
Rick Edgecombe, Vitaly Kuznetsov,
Broadcom internal kernel review list, Boris Ostrovsky,
Stephen Boyd, x86, linux-coco, kvm, linux-hyperv, virtualization,
linux-kernel, xen-devel, Michael Kelley, Tom Lendacky,
Nikunj A Dadhania, Thomas Gleixner, David Woodhouse
In-Reply-To: <20260515191942.1892718-8-seanjc@google.com>
On Fri, May 15, 2026 at 12:19:08PM -0700, Sean Christopherson wrote:
> Register the Hyper-V reference counter (refcounter) callbacks for saving
> and restoring its PV sched_clock, if and only if the refcounter is
> actually being used for sched_clock. Currently, Hyper-V overrides the
> save/restore hooks if the reference TSC available, whereas the Hyper-V
> refcounter code only overrides sched_clock if the reference TSC is
> available *and* it's not invariant. The flaw is effectively papered over
> by invoking the "old" save/restore callbacks as part of save/restore, but
> that's unnecessary and fragile.
>
> To avoid introducing more complexity, and to allow for additional cleanups
> of the PV sched_clock code, move the save/restore hooks and logic into
> hyperv_timer.c and simply wire up the hooks when overriding sched_clock
> itself.
>
> Note, while the Hyper-V refcounter code is intended to be architecture
> neutral, CONFIG_PARAVIRT is firmly x86-only, i.e. adding a small amount of
> x86 specific code (which will be reduced in future cleanups) doesn't
> meaningfully pollute generic code.
>
> Reviewed-by: Michael Kelley <mhklinux@outlook.com>
> Tested-by: Michael Kelley <mhklinux@outlook.com>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
Acked-by: Wei Liu <wei.liu@kernel.org>
^ permalink raw reply
* Re: [PATCH v3 08/41] clocksource: hyper-v: Drop wrappers to sched_clock save/restore helpers
From: Wei Liu @ 2026-05-27 22:50 UTC (permalink / raw)
To: Sean Christopherson
Cc: Kiryl Shutsemau, Paolo Bonzini, K. Y. Srinivasan, Haiyang Zhang,
Wei Liu, Dexuan Cui, Long Li, Ajay Kaher, Alexey Makhalov,
Jan Kiszka, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
Juergen Gross, Daniel Lezcano, Thomas Gleixner, John Stultz,
Rick Edgecombe, Vitaly Kuznetsov,
Broadcom internal kernel review list, Boris Ostrovsky,
Stephen Boyd, x86, linux-coco, kvm, linux-hyperv, virtualization,
linux-kernel, xen-devel, Michael Kelley, Tom Lendacky,
Nikunj A Dadhania, Thomas Gleixner, David Woodhouse
In-Reply-To: <20260515191942.1892718-9-seanjc@google.com>
On Fri, May 15, 2026 at 12:19:09PM -0700, Sean Christopherson wrote:
> Now that all of the Hyper-V reference counter sched_clock code is located
> in a single file, drop the superfluous wrappers for the save/restore flows.
>
> No functional change intended.
>
> Reviewed-by: Michael Kelley <mhklinux@outlook.com>
> Tested-by: Michael Kelley <mhklinux@outlook.com>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
Acked-by: Wei Liu <wei.liu@kernel.org>
^ permalink raw reply
* Re: [PATCH v3 09/41] clocksource: hyper-v: Don't save/restore TSC offset when using HV sched_clock
From: Wei Liu @ 2026-05-27 22:50 UTC (permalink / raw)
To: Sean Christopherson
Cc: Kiryl Shutsemau, Paolo Bonzini, K. Y. Srinivasan, Haiyang Zhang,
Wei Liu, Dexuan Cui, Long Li, Ajay Kaher, Alexey Makhalov,
Jan Kiszka, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
Juergen Gross, Daniel Lezcano, Thomas Gleixner, John Stultz,
Rick Edgecombe, Vitaly Kuznetsov,
Broadcom internal kernel review list, Boris Ostrovsky,
Stephen Boyd, x86, linux-coco, kvm, linux-hyperv, virtualization,
linux-kernel, xen-devel, Michael Kelley, Tom Lendacky,
Nikunj A Dadhania, Thomas Gleixner, David Woodhouse
In-Reply-To: <20260515191942.1892718-10-seanjc@google.com>
On Fri, May 15, 2026 at 12:19:10PM -0700, Sean Christopherson wrote:
> Now that Hyper-V overrides the sched_clock save/restore hooks if and only
> sched_clock itself is set to the Hyper-V reference counter, drop the
> invocation of the "old" save/restore callbacks. When the registration of
> the PV sched_clock was done separately from overriding the save/restore
> hooks, it was possible for Hyper-V to clobber the TSC save/restore
> callbacks without actually switching to the Hyper-V refcounter.
>
> Enabling a PV sched_clock is a one-way street, i.e. the kernel will never
> revert to using TSC for sched_clock, and so there is no need to invoke the
> TSC save/restore hooks (and if there was, it belongs in common PV code).
>
> Reviewed-by: Michael Kelley <mhklinux@outlook.com>
> Tested-by: Michael Kelley <mhklinux@outlook.com>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
Acked-by: Wei Liu <wei.liu@kernel.org>
^ permalink raw reply
* Re: [PATCH v5 2/7] x86/msr: add wrmsrq_on_cpus helper
From: Borislav Petkov @ 2026-05-28 0:43 UTC (permalink / raw)
To: Dave Hansen
Cc: Ashish Kalra, tglx, mingo, dave.hansen, x86, hpa, seanjc, peterz,
thomas.lendacky, herbert, davem, ardb, pbonzini, aik,
Michael.Roth, KPrateek.Nayak, Tycho.Andersen, Nathan.Fontenot,
ackerleytng, jackyli, pgonda, rientjes, jacobhxu, xin,
pawan.kumar.gupta, babu.moger, dyoung, nikunj, john.allen, darwi,
linux-kernel, linux-crypto, kvm, linux-coco
In-Reply-To: <eea0497f-6930-43e3-947d-dae139e657ad@intel.com>
On Wed, May 27, 2026 at 02:38:05PM -0700, Dave Hansen wrote:
> This one is my doing.
I know.
But hey, maybe we should not disagree on the public ML because the submitter
might disappear like the last one. :-P
> wrmsr_on_cpus() is kinda a mess. I think it only has a single user. It's
> also not very flexible because it needs a 'struct msr __percpu *msrs'
> argument where each MSR has a value in memory.
Right, we did that a looong time ago.
The only reason I'd have for per-CPU MSR structs is reading different MSR
values on different cores, modifying only the bits you need and then *keeping*
the remaining values as they were. And that interface allows you to do that
while this new thing won't.
And I'm going to venture a guess here that adding a simpler interface which
simply forces a new value ontop of a whole MSR could cause a lot of subtle
bugs when people don't pay attention to keep the old values.
> The use case for RMPOPT is that all CPUs get the same value. It'd be a
> little awkward to go create a percpu data structure to duplcate the same
> value to call wrmsr_on_cpus(). The RMPOPT case is also arguably
> performance sensitive since it's done during boot. It should do the IPIs
> in parallel.
Oh sure, my meaning was to create something that serves both purposes.
> toggle_ecc_err_reporting(), on the other hand, is done at module init
> time. It's not really performance sensitive. It's probably pretty easy
> to zap wrmsr_on_cpus() and just have toggle_ecc_err_reporting() do
> something slightly less efficient.
Sure. That's fine.
> Yeah, the
>
> wrmsr_on_cpus()
> wrmsrq_on_cpus()
>
> naming pain is real. There's little chance of bugs coming from it
> because the function signatures are *SO* different. But, it certainly
> could confuse humans for a minute.
Yap.
> But the real solution to this is axing wrmsr_on_cpus().
Yap, for example. Basically reingeneering the whole
write-MSRs-on-multiple-CPUs functionality is what I meant.
> Which I think we could do after killing its one user which the attached
> (completely untested) patch does. The only downside of the patch is that it
> does RDMSR via IPIs one CPU at a time. But, looking at the code, I'm not
> sure anyone would care. If anyone did, I _think_ all those MSRs have the
> same value and the code could be simplified further. But that would take
> more than 3 minutes.
>
> It's also possible that my grepping was bad or I'm completely
> misunderstanding amd64_edac.c. Cluebat welcome if I'm being dense.
Looks ok to me, we can surely do that. I even hw to test it. I think...
> BTW, I also don't feel the need to make Ashish go do any of this edac
> cleanup. I think it can just be done in parallel. But I wouldn't stop
> him if he volunteered.
Why not?
It has always been the case: cleanups and bug fixes first, new features ontop.
So yeah, modulo figuring out how to redefine the *msr_on_cpus() interface,
I think this all makes sense.
Thx.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox