* [GIT PULL] KVM changes for Linux 6.14
@ 2025-01-24 16:37 Paolo Bonzini
2025-01-25 14:30 ` Marc Zyngier
` (3 more replies)
0 siblings, 4 replies; 23+ messages in thread
From: Paolo Bonzini @ 2025-01-24 16:37 UTC (permalink / raw)
To: torvalds; +Cc: linux-kernel, kvm
Linus,
The following changes since commit 5bc55a333a2f7316b58edc7573e8e893f7acb532:
Linux 6.13-rc7 (2025-01-12 14:37:56 -0800)
are available in the Git repository at:
https://git.kernel.org/pub/scm/virt/kvm/kvm.git tags/for-linus
for you to fetch changes up to 931656b9e2ff7029aee0b36e17780621948a6ac1:
kvm: defer huge page recovery vhost task to later (2025-01-24 10:53:56 -0500)
There is a conflict in arch/x86/kvm/cpuid.c that is nasty to describe
because the affected area has been completely rewritten, but is really a
one liner. The change to be reproduced is commit 716f86b523d8 ("KVM:
x86: Advertise SRSO_USER_KERNEL_NO to userspace"):
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index ae0b438a2c99..f7e222953cab 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -821,7 +821,7 @@ void kvm_set_cpu_caps(void)
kvm_cpu_cap_mask(CPUID_8000_0021_EAX,
F(NO_NESTED_DATA_BP) | F(LFENCE_RDTSC) | 0 /* SmmPgCfgLock */ |
F(NULL_SEL_CLR_BASE) | F(AUTOIBRS) | 0 /* PrefetchCtlMsr */ |
- F(WRMSR_XX_BASE_NS)
+ F(WRMSR_XX_BASE_NS) | F(SRSO_USER_KERNEL_NO)
);
kvm_cpu_cap_check_and_set(X86_FEATURE_SBPB);
but you can throw away the <<<< ... ==== part completely, and apply the
same change on top of the new implementation:
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index edef30359c19..9f9a29be3beb 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -1177,6 +1177,7 @@ void kvm_set_cpu_caps(void)
EMULATED_F(NO_SMM_CTL_MSR),
/* PrefetchCtlMsr */
F(WRMSR_XX_BASE_NS),
+ F(SRSO_USER_KERNEL_NO),
SYNTHESIZED_F(SBPB),
SYNTHESIZED_F(IBPB_BRTYPE),
SYNTHESIZED_F(SRSO_NO),
I cannot blame Boris at all for including the change to cpuid.c, since
this file has never been much of a source of conflicts (and is not
expected to become one in the future).
Thanks,
Paolo
----------------------------------------------------------------
Loongarch:
* Clear LLBCTL if secondary mmu mapping changes.
* Add hypercall service support for usermode VMM.
x86:
* Add a comment to kvm_mmu_do_page_fault() to explain why KVM performs a
direct call to kvm_tdp_page_fault() when RETPOLINE is enabled.
* Ensure that all SEV code is compiled out when disabled in Kconfig, even
if building with less brilliant compilers.
* Remove a redundant TLB flush on AMD processors when guest CR4.PGE changes.
* Use str_enabled_disabled() to replace open coded strings.
* Drop kvm_x86_ops.hwapic_irr_update() as KVM updates hardware's APICv cache
prior to every VM-Enter.
* Overhaul KVM's CPUID feature infrastructure to track all vCPU capabilities
instead of just those where KVM needs to manage state and/or explicitly
enable the feature in hardware. Along the way, refactor the code to make
it easier to add features, and to make it more self-documenting how KVM
is handling each feature.
* Rework KVM's handling of VM-Exits during event vectoring; this plugs holes
where KVM unintentionally puts the vCPU into infinite loops in some scenarios
(e.g. if emulation is triggered by the exit), and brings parity between VMX
and SVM.
* Add pending request and interrupt injection information to the kvm_exit and
kvm_entry tracepoints respectively.
* Fix a relatively benign flaw where KVM would end up redoing RDPKRU when
loading guest/host PKRU, due to a refactoring of the kernel helpers that
didn't account for KVM's pre-checking of the need to do WRPKRU.
* Make the completion of hypercalls go through the complete_hypercall
function pointer argument, no matter if the hypercall exits to
userspace or not. Previously, the code assumed that KVM_HC_MAP_GPA_RANGE
specifically went to userspace, and all the others did not; the new code
need not special case KVM_HC_MAP_GPA_RANGE and in fact does not care at
all whether there was an exit to userspace or not.
* As part of enabling TDX virtual machines, support support separation of
private/shared EPT into separate roots. When TDX will be enabled, operations
on private pages will need to go through the privileged TDX Module via SEAMCALLs;
as a result, they are limited and relatively slow compared to reading a PTE.
The patches included in 6.14 allow KVM to keep a mirror of the private EPT in
host memory, and define entries in kvm_x86_ops to operate on external page
tables such as the TDX private EPT.
* The recently introduced conversion of the NX-page reclamation kthread to
vhost_task moved the task under the main process. The task is created as
soon as KVM_CREATE_VM was invoked and this, of course, broke userspace that
didn't expect to see any child task of the VM process until it started
creating its own userspace threads. In particular crosvm refuses to fork()
if procfs shows any child task, so unbreak it by creating the task lazily.
This is arguably a userspace bug, as there can be other kinds of legitimate
worker tasks and they wouldn't impede fork(); but it's not like userspace
has a way to distinguish kernel worker tasks right now. Should they show
as "Kthread: 1" in proc/.../status?
x86 - Intel:
* Fix a bug where KVM updates hardware's APICv cache of the highest ISR bit
while L2 is active, while ultimately results in a hardware-accelerated L1
EOI effectively being lost.
* Honor event priority when emulating Posted Interrupt delivery during nested
VM-Enter by queueing KVM_REQ_EVENT instead of immediately handling the
interrupt.
* Rework KVM's processing of the Page-Modification Logging buffer to reap
entries in the same order they were created, i.e. to mark gfns dirty in the
same order that hardware marked the page/PTE dirty.
* Misc cleanups.
Generic:
* Cleanup and harden kvm_set_memory_region(); add proper lockdep assertions when
setting memory regions and add a dedicated API for setting KVM-internal
memory regions. The API can then explicitly disallow all flags for
KVM-internal memory regions.
* Explicitly verify the target vCPU is online in kvm_get_vcpu() to fix a bug
where KVM would return a pointer to a vCPU prior to it being fully online,
and give kvm_for_each_vcpu() similar treatment to fix a similar flaw.
* Wait for a vCPU to come online prior to executing a vCPU ioctl, to fix a
bug where userspace could coerce KVM into handling the ioctl on a vCPU that
isn't yet onlined.
* Gracefully handle xarray insertion failures; even though such failures are
impossible in practice after xa_reserve(), reserving an entry is always followed
by xa_store() which does not know (or differentiate) whether there was an
xa_reserve() before or not.
RISC-V:
* Zabha, Svvptc, and Ziccrse extension support for guests. None of them
require anything in KVM except for detecting them and marking them
as supported; Zabha adds byte and halfword atomic operations, while the
others are markers for specific operation of the TLB and of LL/SC
instructions respectively.
* Virtualize SBI system suspend extension for Guest/VM
* Support firmware counters which can be used by the guests to collect
statistics about traps that occur in the host.
Selftests:
* Rework vcpu_get_reg() to return a value instead of using an out-param, and
update all affected arch code accordingly.
* Convert the max_guest_memory_test into a more generic mmu_stress_test.
The basic gist of the "conversion" is to have the test do mprotect() on
guest memory while vCPUs are accessing said memory, e.g. to verify KVM
and mmu_notifiers are working as intended.
* Play nice with treewrite builds of unsupported architectures, e.g. arm
(32-bit), as KVM selftests' Makefile doesn't do anything to ensure the
target architecture is actually one KVM selftests supports.
* Use the kernel's $(ARCH) definition instead of the target triple for arch
specific directories, e.g. arm64 instead of aarch64, mainly so as not to
be different from the rest of the kernel.
* Ensure that format strings for logging statements are checked by the
compiler even when the logging statement itself is disabled.
* Attempt to whack the last LLC references/misses mole in the Intel PMU
counters test by adding a data load and doing CLFLUSH{OPT} on the data
instead of the code being executed. It seems that modern Intel CPUs
have learned new code prefetching tricks that bypass the PMU counters.
* Fix a flaw in the Intel PMU counters test where it asserts that events
are counting correctly without actually knowing what the events count
given the underlying hardware; this can happen if Intel reuses a
formerly microarchitecture-specific event encoding as an architectural
event, as was the case for Top-Down Slots.
----------------------------------------------------------------
Adrian Hunter (1):
KVM: VMX: Allow toggling bits in MSR_IA32_RTIT_CTL when enable bit is cleared
Andrew Jones (2):
RISC-V: KVM: Add SBI system suspend support
KVM: riscv: selftests: Add SBI SUSP to get-reg-list test
Atish Patra (2):
RISC-V: KVM: Update firmware counters for various events
RISC-V: KVM: Add new exit statstics for redirected traps
Bibo Mao (2):
LoongArch: KVM: Clear LLBCTL if secondary mmu mapping is changed
LoongArch: KVM: Add hypercall service support for usermode VMM
Binbin Wu (1):
KVM: x86: Add a helper to check for user interception of KVM hypercalls
Chao Gao (2):
KVM: nVMX: Defer SVI update to vmcs01 on EOI when L2 is active w/o VID
KVM: x86: Remove hwapic_irr_update() from kvm_x86_ops
Costas Argyris (1):
KVM: VMX: Reinstate __exit attribute for vmx_exit()
Gao Shiyuan (1):
KVM: VMX: Fix comment of handle_vmx_instruction()
Isaku Yamahata (12):
KVM: Add member to struct kvm_gfn_range to indicate private/shared
KVM: x86/mmu: Add an external pointer to struct kvm_mmu_page
KVM: x86/mmu: Add an is_mirror member for union kvm_mmu_page_role
KVM: x86/tdp_mmu: Take struct kvm in iter loops
KVM: x86/mmu: Support GFN direct bits
KVM: x86/tdp_mmu: Extract root invalid check from tdx_mmu_next_root()
KVM: x86/tdp_mmu: Introduce KVM MMU root types to specify page table type
KVM: x86/tdp_mmu: Take root in tdp_mmu_for_each_pte()
KVM: x86/tdp_mmu: Support mirror root for TDP MMU
KVM: x86/tdp_mmu: Propagate building mirror page tables
KVM: x86/tdp_mmu: Propagate tearing down mirror page tables
KVM: x86/tdp_mmu: Take root types for kvm_tdp_mmu_invalidate_all_roots()
Ivan Orlov (7):
KVM: x86: Add function for vectoring error generation
KVM: x86: Add emulation status for unhandleable exception vectoring
KVM: x86: Try to unprotect and retry on unhandleable emulation failure
KVM: VMX: Handle event vectoring error in check_emulate_instruction()
KVM: SVM: Handle event vectoring error in check_emulate_instruction()
KVM: selftests: Add and use a helper function for x86's LIDT
KVM: selftests: Add test case for MMIO during vectoring on x86
Juergen Gross (1):
KVM/x86: add comment to kvm_mmu_do_page_fault()
Keith Busch (1):
kvm: defer huge page recovery vhost task to later
Liam Ni (1):
KVM: x86: Use LVT_TIMER instead of an open coded literal
Maxim Levitsky (4):
KVM: x86: Add interrupt injection information to the kvm_entry tracepoint
KVM: x86: Add information about pending requests to kvm_exit tracepoint
KVM: VMX: refactor PML terminology
KVM: VMX: read the PML log in the same order as it was written
Paolo Bonzini (15):
Merge tag 'kvm-selftests-treewide-6.14' of https://github.com/kvm-x86/linux into HEAD
Merge tag 'kvm-x86-fixes-6.13-rcN' of https://github.com/kvm-x86/linux into HEAD
KVM: x86: clear vcpu->run->hypercall.ret before exiting for KVM_EXIT_HYPERCALL
KVM: x86: Refactor __kvm_emulate_hypercall() into a macro
KVM: x86/tdp_mmu: Propagate attr_filter to MMU notifier callbacks
Merge tag 'loongarch-kvm-6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson into HEAD
Merge tag 'kvm-memslots-6.14' of https://github.com/kvm-x86/linux into HEAD
Merge tag 'kvm-x86-vcpu_array-6.14' of https://github.com/kvm-x86/linux into HEAD
Merge tag 'kvm-x86-mmu-6.14' of https://github.com/kvm-x86/linux into HEAD
Merge tag 'kvm-x86-svm-6.14' of https://github.com/kvm-x86/linux into HEAD
Merge tag 'kvm-x86-vmx-6.14' of https://github.com/kvm-x86/linux into HEAD
Merge tag 'kvm-x86-misc-6.14' of https://github.com/kvm-x86/linux into HEAD
Merge tag 'kvm-riscv-6.14-1' of https://github.com/kvm-riscv/linux into HEAD
Merge branch 'kvm-userspace-hypercall' into HEAD
Merge branch 'kvm-mirror-page-tables' into HEAD
Quan Zhou (5):
RISC-V: KVM: Allow Svvptc extension for Guest/VM
RISC-V: KVM: Allow Zabha extension for Guest/VM
RISC-V: KVM: Allow Ziccrse extension for Guest/VM
KVM: riscv: selftests: Add Svvptc/Zabha/Ziccrse exts to get-reg-list test
RISC-V: KVM: Redirect instruction access fault trap to guest
Rick Edgecombe (5):
KVM: x86/mmu: Zap invalid roots with mmu_lock holding for write at uninit
KVM: x86: Add a VM type define for TDX
KVM: x86/mmu: Make kvm_tdp_mmu_alloc_root() return void
KVM: x86/tdp_mmu: Don't zap valid mirror roots in kvm_tdp_mmu_zap_all()
KVM: x86/mmu: Prevent aliased memslot GFNs
Sean Christopherson (96):
KVM: Explicitly verify target vCPU is online in kvm_get_vcpu()
KVM: Verify there's at least one online vCPU when iterating over all vCPUs
KVM: Grab vcpu->mutex across installing the vCPU's fd and bumping online_vcpus
Revert "KVM: Fix vcpu_array[0] races"
KVM: Don't BUG() the kernel if xa_insert() fails with -EBUSY
KVM: Drop hack that "manually" informs lockdep of kvm->lock vs. vcpu->mutex
KVM: x86: Plumb in the vCPU to kvm_x86_ops.hwapic_isr_update()
KVM: SVM: Macrofy SEV=n versions of sev_xxx_guest()
KVM: SVM: Remove redundant TLB flush on guest CR4.PGE change
KVM: Move KVM_REG_SIZE() definition to common uAPI header
KVM: selftests: Return a value from vcpu_get_reg() instead of using an out-param
KVM: selftests: Assert that vcpu_{g,s}et_reg() won't truncate
KVM: selftests: Check for a potential unhandled exception iff KVM_RUN succeeded
KVM: selftests: Rename max_guest_memory_test to mmu_stress_test
KVM: selftests: Only muck with SREGS on x86 in mmu_stress_test
KVM: selftests: Compute number of extra pages needed in mmu_stress_test
KVM: sefltests: Explicitly include ucall_common.h in mmu_stress_test.c
KVM: selftests: Enable mmu_stress_test on arm64
KVM: selftests: Use vcpu_arch_put_guest() in mmu_stress_test
KVM: selftests: Precisely limit the number of guest loops in mmu_stress_test
KVM: selftests: Add a read-only mprotect() phase to mmu_stress_test
KVM: selftests: Verify KVM correctly handles mprotect(PROT_READ)
KVM: selftests: Provide empty 'all' and 'clean' targets for unsupported ARCHs
KVM: selftests: Use canonical $(ARCH) paths for KVM selftests directories
KVM: selftests: Override ARCH for x86_64 instead of using ARCH_DIR
KVM: x86: Use feature_bit() to clear CONSTANT_TSC when emulating CPUID
KVM: x86: Limit use of F() and SF() to kvm_cpu_cap_{mask,init_kvm_defined}()
KVM: x86: Do all post-set CPUID processing during vCPU creation
KVM: x86: Explicitly do runtime CPUID updates "after" initial setup
KVM: x86: Account for KVM-reserved CR4 bits when passing through CR4 on VMX
KVM: selftests: Update x86's set_sregs_test to match KVM's CPUID enforcement
KVM: selftests: Assert that vcpu->cpuid is non-NULL when getting CPUID entries
KVM: selftests: Refresh vCPU CPUID cache in __vcpu_get_cpuid_entry()
KVM: selftests: Verify KVM stuffs runtime CPUID OS bits on CR4 writes
KVM: x86: Move __kvm_is_valid_cr4() definition to x86.h
KVM: x86/pmu: Drop now-redundant refresh() during init()
KVM: x86: Drop now-redundant MAXPHYADDR and GPA rsvd bits from vCPU creation
KVM: x86: Disallow KVM_CAP_X86_DISABLE_EXITS after vCPU creation
KVM: x86: Reject disabling of MWAIT/HLT interception when not allowed
KVM: x86: Drop the now unused KVM_X86_DISABLE_VALID_EXITS
KVM: selftests: Fix a bad TEST_REQUIRE() in x86's KVM PV test
KVM: selftests: Update x86's KVM PV test to match KVM's disabling exits behavior
KVM: x86: Zero out PV features cache when the CPUID leaf is not present
KVM: x86: Don't update PV features caches when enabling enforcement capability
KVM: x86: Do reverse CPUID sanity checks in __feature_leaf()
KVM: x86: Account for max supported CPUID leaf when getting raw host CPUID
KVM: x86: Unpack F() CPUID feature flag macros to one flag per line of code
KVM: x86: Rename kvm_cpu_cap_mask() to kvm_cpu_cap_init()
KVM: x86: Add a macro to init CPUID features that are 64-bit only
KVM: x86: Add a macro to precisely handle aliased 0x1.EDX CPUID features
KVM: x86: Handle kernel- and KVM-defined CPUID words in a single helper
KVM: x86: #undef SPEC_CTRL_SSBD in cpuid.c to avoid macro collisions
KVM: x86: Harden CPU capabilities processing against out-of-scope features
KVM: x86: Add a macro to init CPUID features that ignore host kernel support
KVM: x86: Add a macro to init CPUID features that KVM emulates in software
KVM: x86: Swap incoming guest CPUID into vCPU before massaging in KVM_SET_CPUID2
KVM: x86: Clear PV_UNHALT for !HLT-exiting only when userspace sets CPUID
KVM: x86: Remove unnecessary caching of KVM's PV CPUID base
KVM: x86: Always operate on kvm_vcpu data in cpuid_entry2_find()
KVM: x86: Move kvm_find_cpuid_entry{,_index}() up near cpuid_entry2_find()
KVM: x86: Remove all direct usage of cpuid_entry2_find()
KVM: x86: Advertise TSC_DEADLINE_TIMER in KVM_GET_SUPPORTED_CPUID
KVM: x86: Advertise HYPERVISOR in KVM_GET_SUPPORTED_CPUID
KVM: x86: Rename "governed features" helpers to use "guest_cpu_cap"
KVM: x86: Replace guts of "governed" features with comprehensive cpu_caps
KVM: x86: Initialize guest cpu_caps based on guest CPUID
KVM: x86: Extract code for generating per-entry emulated CPUID information
KVM: x86: Treat MONTIOR/MWAIT as a "partially emulated" feature
KVM: x86: Initialize guest cpu_caps based on KVM support
KVM: x86: Avoid double CPUID lookup when updating MWAIT at runtime
KVM: x86: Drop unnecessary check that cpuid_entry2_find() returns right leaf
KVM: x86: Update OS{XSAVE,PKE} bits in guest CPUID irrespective of host support
KVM: x86: Update guest cpu_caps at runtime for dynamic CPUID-based features
KVM: x86: Shuffle code to prepare for dropping guest_cpuid_has()
KVM: x86: Replace (almost) all guest CPUID feature queries with cpu_caps
KVM: x86: Drop superfluous host XSAVE check when adjusting guest XSAVES caps
KVM: x86: Add a macro for features that are synthesized into boot_cpu_data
KVM: x86: Pull CPUID capabilities from boot_cpu_data only as needed
KVM: x86: Rename "SF" macro to "SCATTERED_F"
KVM: x86: Explicitly track feature flags that require vendor enabling
KVM: x86: Explicitly track feature flags that are enabled at runtime
KVM: x86: Use only local variables (no bitmask) to init kvm_cpu_caps
KVM: nVMX: Explicitly update vPPR on successful nested VM-Enter
KVM: nVMX: Check for pending INIT/SIPI after entering non-root mode
KVM: nVMX: Drop manual vmcs01.GUEST_INTERRUPT_STATUS.RVI check at VM-Enter
KVM: nVMX: Use vmcs01's controls shadow to check for IRQ/NMI windows at VM-Enter
KVM: nVMX: Honor event priority when emulating PI delivery during VM-Enter
KVM: x86: Move "emulate hypercall" function declarations to x86.h
KVM: x86: Bump hypercall stat prior to fully completing hypercall
KVM: x86: Always complete hypercall via function callback
KVM: x86: Avoid double RDPKRU when loading host/guest PKRU
KVM: Open code kvm_set_memory_region() into its sole caller (ioctl() API)
KVM: Assert slots_lock is held when setting memory regions
KVM: Add a dedicated API for setting KVM-internal memslots
KVM: x86: Drop double-underscores from __kvm_set_memory_region()
KVM: Disallow all flags for KVM-internal memslots
Thorsten Blum (2):
KVM: SVM: Use str_enabled_disabled() helper in sev_hardware_setup()
KVM: SVM: Use str_enabled_disabled() helper in svm_hardware_setup()
Yan Zhao (2):
KVM: guest_memfd: Remove RCU-protected attribute from slot->gmem.file
KVM: x86/mmu: Return RET_PF* instead of 1 in kvm_mmu_page_fault()
Documentation/virt/kvm/api.rst | 10 +-
MAINTAINERS | 12 +-
arch/arm64/include/uapi/asm/kvm.h | 3 -
arch/loongarch/include/asm/kvm_host.h | 1 +
arch/loongarch/include/asm/kvm_para.h | 3 +
arch/loongarch/include/asm/kvm_vcpu.h | 1 +
arch/loongarch/include/uapi/asm/kvm_para.h | 1 +
arch/loongarch/kvm/exit.c | 30 +
arch/loongarch/kvm/main.c | 18 +
arch/loongarch/kvm/vcpu.c | 7 +-
arch/riscv/include/asm/kvm_host.h | 5 +
arch/riscv/include/asm/kvm_vcpu_sbi.h | 1 +
arch/riscv/include/uapi/asm/kvm.h | 7 +-
arch/riscv/kvm/Makefile | 1 +
arch/riscv/kvm/vcpu.c | 7 +-
arch/riscv/kvm/vcpu_exit.c | 37 +-
arch/riscv/kvm/vcpu_onereg.c | 6 +
arch/riscv/kvm/vcpu_sbi.c | 4 +
arch/riscv/kvm/vcpu_sbi_system.c | 73 ++
arch/x86/include/asm/kvm-x86-ops.h | 6 +-
arch/x86/include/asm/kvm_host.h | 107 ++-
arch/x86/include/uapi/asm/kvm.h | 1 +
arch/x86/kvm/cpuid.c | 997 ++++++++++++++-------
arch/x86/kvm/cpuid.h | 132 ++-
arch/x86/kvm/governed_features.h | 22 -
arch/x86/kvm/hyperv.c | 2 +-
arch/x86/kvm/kvm_emulate.h | 2 +
arch/x86/kvm/lapic.c | 31 +-
arch/x86/kvm/lapic.h | 1 +
arch/x86/kvm/mmu.h | 33 +-
arch/x86/kvm/mmu/mmu.c | 82 +-
arch/x86/kvm/mmu/mmu_internal.h | 80 +-
arch/x86/kvm/mmu/spte.h | 5 +
arch/x86/kvm/mmu/tdp_iter.c | 10 +-
arch/x86/kvm/mmu/tdp_iter.h | 21 +-
arch/x86/kvm/mmu/tdp_mmu.c | 325 +++++--
arch/x86/kvm/mmu/tdp_mmu.h | 51 +-
arch/x86/kvm/pmu.c | 1 -
arch/x86/kvm/reverse_cpuid.h | 23 +-
arch/x86/kvm/smm.c | 10 +-
arch/x86/kvm/svm/nested.c | 22 +-
arch/x86/kvm/svm/pmu.c | 8 +-
arch/x86/kvm/svm/sev.c | 43 +-
arch/x86/kvm/svm/svm.c | 78 +-
arch/x86/kvm/svm/svm.h | 23 +-
arch/x86/kvm/trace.h | 17 +-
arch/x86/kvm/vmx/hyperv.h | 2 +-
arch/x86/kvm/vmx/main.c | 4 +-
arch/x86/kvm/vmx/nested.c | 102 ++-
arch/x86/kvm/vmx/pmu_intel.c | 4 +-
arch/x86/kvm/vmx/sgx.c | 14 +-
arch/x86/kvm/vmx/vmx.c | 176 ++--
arch/x86/kvm/vmx/vmx.h | 6 +-
arch/x86/kvm/vmx/x86_ops.h | 6 +-
arch/x86/kvm/x86.c | 261 +++---
arch/x86/kvm/x86.h | 34 +-
include/linux/call_once.h | 45 +
include/linux/kvm_host.h | 37 +-
include/uapi/linux/kvm.h | 8 +-
tools/testing/selftests/kvm/.gitignore | 1 +
tools/testing/selftests/kvm/Makefile | 347 +------
tools/testing/selftests/kvm/Makefile.kvm | 330 +++++++
.../kvm/{aarch64 => arm64}/aarch32_id_regs.c | 10 +-
.../selftests/kvm/{aarch64 => arm64}/arch_timer.c | 0
.../kvm/{aarch64 => arm64}/arch_timer_edge_cases.c | 0
.../kvm/{aarch64 => arm64}/debug-exceptions.c | 4 +-
.../kvm/{aarch64 => arm64}/get-reg-list.c | 0
.../selftests/kvm/{aarch64 => arm64}/hypercalls.c | 6 +-
.../selftests/kvm/{aarch64 => arm64}/mmio_abort.c | 0
.../selftests/kvm/{aarch64 => arm64}/no-vgic-v3.c | 2 +-
.../kvm/{aarch64 => arm64}/page_fault_test.c | 0
.../selftests/kvm/{aarch64 => arm64}/psci_test.c | 8 +-
.../selftests/kvm/{aarch64 => arm64}/set_id_regs.c | 22 +-
.../kvm/{aarch64 => arm64}/smccc_filter.c | 0
.../kvm/{aarch64 => arm64}/vcpu_width_config.c | 0
.../selftests/kvm/{aarch64 => arm64}/vgic_init.c | 0
.../selftests/kvm/{aarch64 => arm64}/vgic_irq.c | 0
.../kvm/{aarch64 => arm64}/vgic_lpi_stress.c | 0
.../kvm/{aarch64 => arm64}/vpmu_counter_access.c | 19 +-
tools/testing/selftests/kvm/dirty_log_perf_test.c | 2 +-
.../kvm/include/{aarch64 => arm64}/arch_timer.h | 0
.../kvm/include/{aarch64 => arm64}/delay.h | 0
.../selftests/kvm/include/{aarch64 => arm64}/gic.h | 0
.../kvm/include/{aarch64 => arm64}/gic_v3.h | 0
.../kvm/include/{aarch64 => arm64}/gic_v3_its.h | 0
.../kvm/include/{aarch64 => arm64}/kvm_util_arch.h | 0
.../kvm/include/{aarch64 => arm64}/processor.h | 0
.../kvm/include/{aarch64 => arm64}/spinlock.h | 0
.../kvm/include/{aarch64 => arm64}/ucall.h | 0
.../kvm/include/{aarch64 => arm64}/vgic.h | 0
tools/testing/selftests/kvm/include/kvm_util.h | 10 +-
.../kvm/include/{s390x => s390}/debug_print.h | 0
.../include/{s390x => s390}/diag318_test_handler.h | 0
.../kvm/include/{s390x => s390}/facility.h | 0
.../kvm/include/{s390x => s390}/kvm_util_arch.h | 0
.../kvm/include/{s390x => s390}/processor.h | 0
.../selftests/kvm/include/{s390x => s390}/sie.h | 0
.../selftests/kvm/include/{s390x => s390}/ucall.h | 0
.../selftests/kvm/include/{x86_64 => x86}/apic.h | 2 -
.../selftests/kvm/include/{x86_64 => x86}/evmcs.h | 3 -
.../selftests/kvm/include/{x86_64 => x86}/hyperv.h | 3 -
.../kvm/include/{x86_64 => x86}/kvm_util_arch.h | 0
.../selftests/kvm/include/{x86_64 => x86}/mce.h | 2 -
.../selftests/kvm/include/{x86_64 => x86}/pmu.h | 0
.../kvm/include/{x86_64 => x86}/processor.h | 27 +-
.../selftests/kvm/include/{x86_64 => x86}/sev.h | 0
.../selftests/kvm/include/{x86_64 => x86}/svm.h | 6 -
.../kvm/include/{x86_64 => x86}/svm_util.h | 3 -
.../selftests/kvm/include/{x86_64 => x86}/ucall.h | 0
.../selftests/kvm/include/{x86_64 => x86}/vmx.h | 2 -
.../selftests/kvm/lib/{aarch64 => arm64}/gic.c | 0
.../kvm/lib/{aarch64 => arm64}/gic_private.h | 0
.../selftests/kvm/lib/{aarch64 => arm64}/gic_v3.c | 0
.../kvm/lib/{aarch64 => arm64}/gic_v3_its.c | 0
.../kvm/lib/{aarch64 => arm64}/handlers.S | 0
.../kvm/lib/{aarch64 => arm64}/processor.c | 8 +-
.../kvm/lib/{aarch64 => arm64}/spinlock.c | 0
.../selftests/kvm/lib/{aarch64 => arm64}/ucall.c | 0
.../selftests/kvm/lib/{aarch64 => arm64}/vgic.c | 0
tools/testing/selftests/kvm/lib/kvm_util.c | 3 +-
tools/testing/selftests/kvm/lib/riscv/processor.c | 66 +-
.../kvm/lib/{s390x => s390}/diag318_test_handler.c | 0
.../selftests/kvm/lib/{s390x => s390}/facility.c | 0
.../selftests/kvm/lib/{s390x => s390}/processor.c | 0
.../selftests/kvm/lib/{s390x => s390}/ucall.c | 0
.../selftests/kvm/lib/{x86_64 => x86}/apic.c | 0
.../selftests/kvm/lib/{x86_64 => x86}/handlers.S | 0
.../selftests/kvm/lib/{x86_64 => x86}/hyperv.c | 0
.../selftests/kvm/lib/{x86_64 => x86}/memstress.c | 2 +-
.../selftests/kvm/lib/{x86_64 => x86}/pmu.c | 0
.../selftests/kvm/lib/{x86_64 => x86}/processor.c | 2 -
.../selftests/kvm/lib/{x86_64 => x86}/sev.c | 0
.../selftests/kvm/lib/{x86_64 => x86}/svm.c | 1 -
.../selftests/kvm/lib/{x86_64 => x86}/ucall.c | 0
.../selftests/kvm/lib/{x86_64 => x86}/vmx.c | 2 -
.../{max_guest_memory_test.c => mmu_stress_test.c} | 162 +++-
tools/testing/selftests/kvm/riscv/arch_timer.c | 2 +-
tools/testing/selftests/kvm/riscv/ebreak_test.c | 2 +-
tools/testing/selftests/kvm/riscv/get-reg-list.c | 18 +-
tools/testing/selftests/kvm/riscv/sbi_pmu_test.c | 2 +-
.../selftests/kvm/{s390x => s390}/cmma_test.c | 0
tools/testing/selftests/kvm/{s390x => s390}/config | 0
.../kvm/{s390x => s390}/cpumodel_subfuncs_test.c | 0
.../selftests/kvm/{s390x => s390}/debug_test.c | 0
.../testing/selftests/kvm/{s390x => s390}/memop.c | 0
.../testing/selftests/kvm/{s390x => s390}/resets.c | 2 +-
.../kvm/{s390x => s390}/shared_zeropage_test.c | 0
.../selftests/kvm/{s390x => s390}/sync_regs_test.c | 0
.../testing/selftests/kvm/{s390x => s390}/tprot.c | 0
.../selftests/kvm/{s390x => s390}/ucontrol_test.c | 0
.../testing/selftests/kvm/set_memory_region_test.c | 59 +-
tools/testing/selftests/kvm/steal_time.c | 3 +-
.../selftests/kvm/{x86_64 => x86}/amx_test.c | 0
.../kvm/{x86_64 => x86}/apic_bus_clock_test.c | 0
.../selftests/kvm/{x86_64 => x86}/cpuid_test.c | 0
.../kvm/{x86_64 => x86}/cr4_cpuid_sync_test.c | 0
.../selftests/kvm/{x86_64 => x86}/debug_regs.c | 0
.../dirty_log_page_splitting_test.c | 0
.../exit_on_emulation_failure_test.c | 0
.../kvm/{x86_64 => x86}/feature_msrs_test.c | 0
.../kvm/{x86_64 => x86}/fix_hypercall_test.c | 0
.../selftests/kvm/{x86_64 => x86}/flds_emulation.h | 0
.../selftests/kvm/{x86_64 => x86}/hwcr_msr_test.c | 0
.../selftests/kvm/{x86_64 => x86}/hyperv_clock.c | 0
.../selftests/kvm/{x86_64 => x86}/hyperv_cpuid.c | 0
.../selftests/kvm/{x86_64 => x86}/hyperv_evmcs.c | 0
.../{x86_64 => x86}/hyperv_extended_hypercalls.c | 0
.../kvm/{x86_64 => x86}/hyperv_features.c | 0
.../selftests/kvm/{x86_64 => x86}/hyperv_ipi.c | 0
.../kvm/{x86_64 => x86}/hyperv_svm_test.c | 0
.../kvm/{x86_64 => x86}/hyperv_tlb_flush.c | 0
.../selftests/kvm/{x86_64 => x86}/kvm_clock_test.c | 0
.../selftests/kvm/{x86_64 => x86}/kvm_pv_test.c | 38 +-
.../kvm/{x86_64 => x86}/max_vcpuid_cap_test.c | 0
.../kvm/{x86_64 => x86}/monitor_mwait_test.c | 0
.../kvm/{x86_64 => x86}/nested_exceptions_test.c | 0
.../kvm/{x86_64 => x86}/nx_huge_pages_test.c | 0
.../kvm/{x86_64 => x86}/nx_huge_pages_test.sh | 0
.../kvm/{x86_64 => x86}/platform_info_test.c | 0
.../kvm/{x86_64 => x86}/pmu_counters_test.c | 0
.../kvm/{x86_64 => x86}/pmu_event_filter_test.c | 0
.../{x86_64 => x86}/private_mem_conversions_test.c | 0
.../{x86_64 => x86}/private_mem_kvm_exits_test.c | 0
.../kvm/{x86_64 => x86}/recalc_apic_map_test.c | 0
.../kvm/{x86_64 => x86}/set_boot_cpu_id.c | 0
.../selftests/kvm/{x86_64 => x86}/set_sregs_test.c | 63 +-
.../kvm/{x86_64 => x86}/sev_init2_tests.c | 0
.../kvm/{x86_64 => x86}/sev_migrate_tests.c | 0
.../selftests/kvm/{x86_64 => x86}/sev_smoke_test.c | 2 +-
.../smaller_maxphyaddr_emulation_test.c | 0
.../selftests/kvm/{x86_64 => x86}/smm_test.c | 0
.../selftests/kvm/{x86_64 => x86}/state_test.c | 0
.../kvm/{x86_64 => x86}/svm_int_ctl_test.c | 0
.../kvm/{x86_64 => x86}/svm_nested_shutdown_test.c | 0
.../{x86_64 => x86}/svm_nested_soft_inject_test.c | 0
.../kvm/{x86_64 => x86}/svm_vmcall_test.c | 0
.../selftests/kvm/{x86_64 => x86}/sync_regs_test.c | 0
.../kvm/{x86_64 => x86}/triple_fault_event_test.c | 0
.../selftests/kvm/{x86_64 => x86}/tsc_msrs_test.c | 0
.../kvm/{x86_64 => x86}/tsc_scaling_sync.c | 0
.../kvm/{x86_64 => x86}/ucna_injection_test.c | 0
.../kvm/{x86_64 => x86}/userspace_io_test.c | 0
.../kvm/{x86_64 => x86}/userspace_msr_exit_test.c | 0
.../kvm/{x86_64 => x86}/vmx_apic_access_test.c | 0
.../{x86_64 => x86}/vmx_close_while_nested_test.c | 0
.../kvm/{x86_64 => x86}/vmx_dirty_log_test.c | 0
.../vmx_exception_with_invalid_guest_state.c | 0
.../vmx_invalid_nested_guest_state.c | 0
.../selftests/kvm/{x86_64 => x86}/vmx_msrs_test.c | 0
.../{x86_64 => x86}/vmx_nested_tsc_scaling_test.c | 0
.../kvm/{x86_64 => x86}/vmx_pmu_caps_test.c | 0
.../{x86_64 => x86}/vmx_preemption_timer_test.c | 0
.../{x86_64 => x86}/vmx_set_nested_state_test.c | 0
.../kvm/{x86_64 => x86}/vmx_tsc_adjust_test.c | 0
.../selftests/kvm/{x86_64 => x86}/xapic_ipi_test.c | 0
.../kvm/{x86_64 => x86}/xapic_state_test.c | 0
.../kvm/{x86_64 => x86}/xcr0_cpuid_test.c | 0
.../kvm/{x86_64 => x86}/xen_shinfo_test.c | 0
.../kvm/{x86_64 => x86}/xen_vmcall_test.c | 0
.../selftests/kvm/{x86_64 => x86}/xss_msr_test.c | 0
virt/kvm/guest_memfd.c | 36 +-
virt/kvm/kvm_main.c | 115 ++-
222 files changed, 2909 insertions(+), 1547 deletions(-)
create mode 100644 arch/riscv/kvm/vcpu_sbi_system.c
delete mode 100644 arch/x86/kvm/governed_features.h
create mode 100644 include/linux/call_once.h
create mode 100644 tools/testing/selftests/kvm/Makefile.kvm
rename tools/testing/selftests/kvm/{aarch64 => arm64}/aarch32_id_regs.c (95%)
rename tools/testing/selftests/kvm/{aarch64 => arm64}/arch_timer.c (100%)
rename tools/testing/selftests/kvm/{aarch64 => arm64}/arch_timer_edge_cases.c (100%)
rename tools/testing/selftests/kvm/{aarch64 => arm64}/debug-exceptions.c (99%)
rename tools/testing/selftests/kvm/{aarch64 => arm64}/get-reg-list.c (100%)
rename tools/testing/selftests/kvm/{aarch64 => arm64}/hypercalls.c (98%)
rename tools/testing/selftests/kvm/{aarch64 => arm64}/mmio_abort.c (100%)
rename tools/testing/selftests/kvm/{aarch64 => arm64}/no-vgic-v3.c (98%)
rename tools/testing/selftests/kvm/{aarch64 => arm64}/page_fault_test.c (100%)
rename tools/testing/selftests/kvm/{aarch64 => arm64}/psci_test.c (96%)
rename tools/testing/selftests/kvm/{aarch64 => arm64}/set_id_regs.c (97%)
rename tools/testing/selftests/kvm/{aarch64 => arm64}/smccc_filter.c (100%)
rename tools/testing/selftests/kvm/{aarch64 => arm64}/vcpu_width_config.c (100%)
rename tools/testing/selftests/kvm/{aarch64 => arm64}/vgic_init.c (100%)
rename tools/testing/selftests/kvm/{aarch64 => arm64}/vgic_irq.c (100%)
rename tools/testing/selftests/kvm/{aarch64 => arm64}/vgic_lpi_stress.c (100%)
rename tools/testing/selftests/kvm/{aarch64 => arm64}/vpmu_counter_access.c (97%)
rename tools/testing/selftests/kvm/include/{aarch64 => arm64}/arch_timer.h (100%)
rename tools/testing/selftests/kvm/include/{aarch64 => arm64}/delay.h (100%)
rename tools/testing/selftests/kvm/include/{aarch64 => arm64}/gic.h (100%)
rename tools/testing/selftests/kvm/include/{aarch64 => arm64}/gic_v3.h (100%)
rename tools/testing/selftests/kvm/include/{aarch64 => arm64}/gic_v3_its.h (100%)
rename tools/testing/selftests/kvm/include/{aarch64 => arm64}/kvm_util_arch.h (100%)
rename tools/testing/selftests/kvm/include/{aarch64 => arm64}/processor.h (100%)
rename tools/testing/selftests/kvm/include/{aarch64 => arm64}/spinlock.h (100%)
rename tools/testing/selftests/kvm/include/{aarch64 => arm64}/ucall.h (100%)
rename tools/testing/selftests/kvm/include/{aarch64 => arm64}/vgic.h (100%)
rename tools/testing/selftests/kvm/include/{s390x => s390}/debug_print.h (100%)
rename tools/testing/selftests/kvm/include/{s390x => s390}/diag318_test_handler.h (100%)
rename tools/testing/selftests/kvm/include/{s390x => s390}/facility.h (100%)
rename tools/testing/selftests/kvm/include/{s390x => s390}/kvm_util_arch.h (100%)
rename tools/testing/selftests/kvm/include/{s390x => s390}/processor.h (100%)
rename tools/testing/selftests/kvm/include/{s390x => s390}/sie.h (100%)
rename tools/testing/selftests/kvm/include/{s390x => s390}/ucall.h (100%)
rename tools/testing/selftests/kvm/include/{x86_64 => x86}/apic.h (98%)
rename tools/testing/selftests/kvm/include/{x86_64 => x86}/evmcs.h (99%)
rename tools/testing/selftests/kvm/include/{x86_64 => x86}/hyperv.h (99%)
rename tools/testing/selftests/kvm/include/{x86_64 => x86}/kvm_util_arch.h (100%)
rename tools/testing/selftests/kvm/include/{x86_64 => x86}/mce.h (94%)
rename tools/testing/selftests/kvm/include/{x86_64 => x86}/pmu.h (100%)
rename tools/testing/selftests/kvm/include/{x86_64 => x86}/processor.h (99%)
rename tools/testing/selftests/kvm/include/{x86_64 => x86}/sev.h (100%)
rename tools/testing/selftests/kvm/include/{x86_64 => x86}/svm.h (98%)
rename tools/testing/selftests/kvm/include/{x86_64 => x86}/svm_util.h (94%)
rename tools/testing/selftests/kvm/include/{x86_64 => x86}/ucall.h (100%)
rename tools/testing/selftests/kvm/include/{x86_64 => x86}/vmx.h (99%)
rename tools/testing/selftests/kvm/lib/{aarch64 => arm64}/gic.c (100%)
rename tools/testing/selftests/kvm/lib/{aarch64 => arm64}/gic_private.h (100%)
rename tools/testing/selftests/kvm/lib/{aarch64 => arm64}/gic_v3.c (100%)
rename tools/testing/selftests/kvm/lib/{aarch64 => arm64}/gic_v3_its.c (100%)
rename tools/testing/selftests/kvm/lib/{aarch64 => arm64}/handlers.S (100%)
rename tools/testing/selftests/kvm/lib/{aarch64 => arm64}/processor.c (98%)
rename tools/testing/selftests/kvm/lib/{aarch64 => arm64}/spinlock.c (100%)
rename tools/testing/selftests/kvm/lib/{aarch64 => arm64}/ucall.c (100%)
rename tools/testing/selftests/kvm/lib/{aarch64 => arm64}/vgic.c (100%)
rename tools/testing/selftests/kvm/lib/{s390x => s390}/diag318_test_handler.c (100%)
rename tools/testing/selftests/kvm/lib/{s390x => s390}/facility.c (100%)
rename tools/testing/selftests/kvm/lib/{s390x => s390}/processor.c (100%)
rename tools/testing/selftests/kvm/lib/{s390x => s390}/ucall.c (100%)
rename tools/testing/selftests/kvm/lib/{x86_64 => x86}/apic.c (100%)
rename tools/testing/selftests/kvm/lib/{x86_64 => x86}/handlers.S (100%)
rename tools/testing/selftests/kvm/lib/{x86_64 => x86}/hyperv.c (100%)
rename tools/testing/selftests/kvm/lib/{x86_64 => x86}/memstress.c (98%)
rename tools/testing/selftests/kvm/lib/{x86_64 => x86}/pmu.c (100%)
rename tools/testing/selftests/kvm/lib/{x86_64 => x86}/processor.c (99%)
rename tools/testing/selftests/kvm/lib/{x86_64 => x86}/sev.c (100%)
rename tools/testing/selftests/kvm/lib/{x86_64 => x86}/svm.c (99%)
rename tools/testing/selftests/kvm/lib/{x86_64 => x86}/ucall.c (100%)
rename tools/testing/selftests/kvm/lib/{x86_64 => x86}/vmx.c (99%)
rename tools/testing/selftests/kvm/{max_guest_memory_test.c => mmu_stress_test.c} (60%)
rename tools/testing/selftests/kvm/{s390x => s390}/cmma_test.c (100%)
rename tools/testing/selftests/kvm/{s390x => s390}/config (100%)
rename tools/testing/selftests/kvm/{s390x => s390}/cpumodel_subfuncs_test.c (100%)
rename tools/testing/selftests/kvm/{s390x => s390}/debug_test.c (100%)
rename tools/testing/selftests/kvm/{s390x => s390}/memop.c (100%)
rename tools/testing/selftests/kvm/{s390x => s390}/resets.c (99%)
rename tools/testing/selftests/kvm/{s390x => s390}/shared_zeropage_test.c (100%)
rename tools/testing/selftests/kvm/{s390x => s390}/sync_regs_test.c (100%)
rename tools/testing/selftests/kvm/{s390x => s390}/tprot.c (100%)
rename tools/testing/selftests/kvm/{s390x => s390}/ucontrol_test.c (100%)
rename tools/testing/selftests/kvm/{x86_64 => x86}/amx_test.c (100%)
rename tools/testing/selftests/kvm/{x86_64 => x86}/apic_bus_clock_test.c (100%)
rename tools/testing/selftests/kvm/{x86_64 => x86}/cpuid_test.c (100%)
rename tools/testing/selftests/kvm/{x86_64 => x86}/cr4_cpuid_sync_test.c (100%)
rename tools/testing/selftests/kvm/{x86_64 => x86}/debug_regs.c (100%)
rename tools/testing/selftests/kvm/{x86_64 => x86}/dirty_log_page_splitting_test.c (100%)
rename tools/testing/selftests/kvm/{x86_64 => x86}/exit_on_emulation_failure_test.c (100%)
rename tools/testing/selftests/kvm/{x86_64 => x86}/feature_msrs_test.c (100%)
rename tools/testing/selftests/kvm/{x86_64 => x86}/fix_hypercall_test.c (100%)
rename tools/testing/selftests/kvm/{x86_64 => x86}/flds_emulation.h (100%)
rename tools/testing/selftests/kvm/{x86_64 => x86}/hwcr_msr_test.c (100%)
rename tools/testing/selftests/kvm/{x86_64 => x86}/hyperv_clock.c (100%)
rename tools/testing/selftests/kvm/{x86_64 => x86}/hyperv_cpuid.c (100%)
rename tools/testing/selftests/kvm/{x86_64 => x86}/hyperv_evmcs.c (100%)
rename tools/testing/selftests/kvm/{x86_64 => x86}/hyperv_extended_hypercalls.c (100%)
rename tools/testing/selftests/kvm/{x86_64 => x86}/hyperv_features.c (100%)
rename tools/testing/selftests/kvm/{x86_64 => x86}/hyperv_ipi.c (100%)
rename tools/testing/selftests/kvm/{x86_64 => x86}/hyperv_svm_test.c (100%)
rename tools/testing/selftests/kvm/{x86_64 => x86}/hyperv_tlb_flush.c (100%)
rename tools/testing/selftests/kvm/{x86_64 => x86}/kvm_clock_test.c (100%)
rename tools/testing/selftests/kvm/{x86_64 => x86}/kvm_pv_test.c (76%)
rename tools/testing/selftests/kvm/{x86_64 => x86}/max_vcpuid_cap_test.c (100%)
rename tools/testing/selftests/kvm/{x86_64 => x86}/monitor_mwait_test.c (100%)
rename tools/testing/selftests/kvm/{x86_64 => x86}/nested_exceptions_test.c (100%)
rename tools/testing/selftests/kvm/{x86_64 => x86}/nx_huge_pages_test.c (100%)
rename tools/testing/selftests/kvm/{x86_64 => x86}/nx_huge_pages_test.sh (100%)
rename tools/testing/selftests/kvm/{x86_64 => x86}/platform_info_test.c (100%)
rename tools/testing/selftests/kvm/{x86_64 => x86}/pmu_counters_test.c (100%)
rename tools/testing/selftests/kvm/{x86_64 => x86}/pmu_event_filter_test.c (100%)
rename tools/testing/selftests/kvm/{x86_64 => x86}/private_mem_conversions_test.c (100%)
rename tools/testing/selftests/kvm/{x86_64 => x86}/private_mem_kvm_exits_test.c (100%)
rename tools/testing/selftests/kvm/{x86_64 => x86}/recalc_apic_map_test.c (100%)
rename tools/testing/selftests/kvm/{x86_64 => x86}/set_boot_cpu_id.c (100%)
rename tools/testing/selftests/kvm/{x86_64 => x86}/set_sregs_test.c (75%)
rename tools/testing/selftests/kvm/{x86_64 => x86}/sev_init2_tests.c (100%)
rename tools/testing/selftests/kvm/{x86_64 => x86}/sev_migrate_tests.c (100%)
rename tools/testing/selftests/kvm/{x86_64 => x86}/sev_smoke_test.c (99%)
rename tools/testing/selftests/kvm/{x86_64 => x86}/smaller_maxphyaddr_emulation_test.c (100%)
rename tools/testing/selftests/kvm/{x86_64 => x86}/smm_test.c (100%)
rename tools/testing/selftests/kvm/{x86_64 => x86}/state_test.c (100%)
rename tools/testing/selftests/kvm/{x86_64 => x86}/svm_int_ctl_test.c (100%)
rename tools/testing/selftests/kvm/{x86_64 => x86}/svm_nested_shutdown_test.c (100%)
rename tools/testing/selftests/kvm/{x86_64 => x86}/svm_nested_soft_inject_test.c (100%)
rename tools/testing/selftests/kvm/{x86_64 => x86}/svm_vmcall_test.c (100%)
rename tools/testing/selftests/kvm/{x86_64 => x86}/sync_regs_test.c (100%)
rename tools/testing/selftests/kvm/{x86_64 => x86}/triple_fault_event_test.c (100%)
rename tools/testing/selftests/kvm/{x86_64 => x86}/tsc_msrs_test.c (100%)
rename tools/testing/selftests/kvm/{x86_64 => x86}/tsc_scaling_sync.c (100%)
rename tools/testing/selftests/kvm/{x86_64 => x86}/ucna_injection_test.c (100%)
rename tools/testing/selftests/kvm/{x86_64 => x86}/userspace_io_test.c (100%)
rename tools/testing/selftests/kvm/{x86_64 => x86}/userspace_msr_exit_test.c (100%)
rename tools/testing/selftests/kvm/{x86_64 => x86}/vmx_apic_access_test.c (100%)
rename tools/testing/selftests/kvm/{x86_64 => x86}/vmx_close_while_nested_test.c (100%)
rename tools/testing/selftests/kvm/{x86_64 => x86}/vmx_dirty_log_test.c (100%)
rename tools/testing/selftests/kvm/{x86_64 => x86}/vmx_exception_with_invalid_guest_state.c (100%)
rename tools/testing/selftests/kvm/{x86_64 => x86}/vmx_invalid_nested_guest_state.c (100%)
rename tools/testing/selftests/kvm/{x86_64 => x86}/vmx_msrs_test.c (100%)
rename tools/testing/selftests/kvm/{x86_64 => x86}/vmx_nested_tsc_scaling_test.c (100%)
rename tools/testing/selftests/kvm/{x86_64 => x86}/vmx_pmu_caps_test.c (100%)
rename tools/testing/selftests/kvm/{x86_64 => x86}/vmx_preemption_timer_test.c (100%)
rename tools/testing/selftests/kvm/{x86_64 => x86}/vmx_set_nested_state_test.c (100%)
rename tools/testing/selftests/kvm/{x86_64 => x86}/vmx_tsc_adjust_test.c (100%)
rename tools/testing/selftests/kvm/{x86_64 => x86}/xapic_ipi_test.c (100%)
rename tools/testing/selftests/kvm/{x86_64 => x86}/xapic_state_test.c (100%)
rename tools/testing/selftests/kvm/{x86_64 => x86}/xcr0_cpuid_test.c (100%)
rename tools/testing/selftests/kvm/{x86_64 => x86}/xen_shinfo_test.c (100%)
rename tools/testing/selftests/kvm/{x86_64 => x86}/xen_vmcall_test.c (100%)
rename tools/testing/selftests/kvm/{x86_64 => x86}/xss_msr_test.c (100%)
^ permalink raw reply related [flat|nested] 23+ messages in thread* Re: [GIT PULL] KVM changes for Linux 6.14 2025-01-24 16:37 [GIT PULL] KVM changes for Linux 6.14 Paolo Bonzini @ 2025-01-25 14:30 ` Marc Zyngier 2025-01-25 18:12 ` Linus Torvalds ` (2 subsequent siblings) 3 siblings, 0 replies; 23+ messages in thread From: Marc Zyngier @ 2025-01-25 14:30 UTC (permalink / raw) To: Paolo Bonzini Cc: torvalds, linux-kernel, kvm, Oliver Upton, Will Deacon, Catalin Marinas On Fri, 24 Jan 2025 16:37:41 +0000, Paolo Bonzini <pbonzini@redhat.com> wrote: > > Linus, > > The following changes since commit 5bc55a333a2f7316b58edc7573e8e893f7acb532: > > Linux 6.13-rc7 (2025-01-12 14:37:56 -0800) > > are available in the Git repository at: > > https://git.kernel.org/pub/scm/virt/kvm/kvm.git tags/for-linus > > for you to fetch changes up to 931656b9e2ff7029aee0b36e17780621948a6ac1: > > kvm: defer huge page recovery vhost task to later (2025-01-24 10:53:56 -0500) Sorry to ask the obvious, but why didn't you include the KVM/arm64 updates which you said you had pulled[1]? Was there anything wrong with it? If so, I would very much like to know. M. [1] https://lore.kernel.org/r/CABgObfYckN2J_Q3d--ZfAP=QbtGWp-teOpXGPfubU=BN4DSrgw@mail.gmail.com -- Without deviation from the norm, progress is not possible. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [GIT PULL] KVM changes for Linux 6.14 2025-01-24 16:37 [GIT PULL] KVM changes for Linux 6.14 Paolo Bonzini 2025-01-25 14:30 ` Marc Zyngier @ 2025-01-25 18:12 ` Linus Torvalds 2025-01-25 18:31 ` Linus Torvalds 2025-01-26 14:20 ` Oleg Nesterov 2025-01-25 18:16 ` Linus Torvalds 2025-01-25 18:30 ` pr-tracker-bot 3 siblings, 2 replies; 23+ messages in thread From: Linus Torvalds @ 2025-01-25 18:12 UTC (permalink / raw) To: Paolo Bonzini, Michael S. Tsirkin, Christian Brauner, Eric W. Biederman, Oleg Nesterov Cc: linux-kernel, kvm Let's bring some thread setup people in on this.. The kvm people obviously already solved their particular issue, but I get the feeling that the kvm solution is kind of a hack that works around a user space oddity. For newly added people: see commit 931656b9e2ff ("kvm: defer huge page recovery vhost task to later") and the explanation below, and this thread on the mailing lists: https://lore.kernel.org/all/Z2RYyagu3phDFIac@kbusch-mbp.dhcp.thefacebook.com/ Arguably the user space oddity is just strange and Paolo even calls it a bug, but at the same time, I do think user space can and should reasonably expect that it only has children that it created explicitly, and the automatic reclamation thread most definitely is a bit too implicit. On Fri, 24 Jan 2025 at 08:38, Paolo Bonzini <pbonzini@redhat.com> wrote: > > * The recently introduced conversion of the NX-page reclamation kthread to > vhost_task moved the task under the main process. The task is created as > soon as KVM_CREATE_VM was invoked and this, of course, broke userspace that > didn't expect to see any child task of the VM process until it started > creating its own userspace threads. In particular crosvm refuses to fork() > if procfs shows any child task, so unbreak it by creating the task lazily. > This is arguably a userspace bug, as there can be other kinds of legitimate > worker tasks and they wouldn't impede fork(); but it's not like userspace > has a way to distinguish kernel worker tasks right now. Should they show > as "Kthread: 1" in proc/.../status? So first off, let me just say that I still absolutely think that the current "vhost workers are children of the starter" is the right model, even if it has caused some issues because of various legacy expectations. But in this case I do wonder if we should hide the implicit kernel threads from user space somehow. Keith pinpointed the user space logic to fork_remap(): https://github.com/google/minijail/blob/main/rust/minijail/src/lib.rs#L987 and honestly, I do think it makes sense for user space to ask "am I single-threaded" (which is presumably the thing that breaks), and the code for that is pretty simple: fn is_single_threaded() -> io::Result<bool> { match count_dir_entries("/proc/self/task") { Ok(1) => Ok(true), Ok(_) => Ok(false), Err(e) => Err(e), } } and I really don't think user space is "wrong". So the fact that a kernel helper thread that runs async in the background and does random background infrastructure things that do not really affect user space should probably simply not break this kind of simple (and admittedly simplistic) user space logic. Should we just add some flag to say "don't show this thread in this context"? We obviously still want to see it for management purposes, so it's not like the thing should be entirely invisible, but maybe Christian / Eric / Oleg have some opinions on how to do this cleanly in "copy_process()" or similar? Linus ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [GIT PULL] KVM changes for Linux 6.14 2025-01-25 18:12 ` Linus Torvalds @ 2025-01-25 18:31 ` Linus Torvalds 2025-01-27 3:55 ` Eric W. Biederman 2025-01-26 14:20 ` Oleg Nesterov 1 sibling, 1 reply; 23+ messages in thread From: Linus Torvalds @ 2025-01-25 18:31 UTC (permalink / raw) To: Paolo Bonzini, Michael S. Tsirkin, Christian Brauner, Eric W. Biederman, Oleg Nesterov Cc: linux-kernel, kvm On Sat, 25 Jan 2025 at 10:12, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > Arguably the user space oddity is just strange and Paolo even calls it > a bug, but at the same time, I do think user space can and should > reasonably expect that it only has children that it created > explicitly [..] Note that I think that doing things like "io_uring" and getting IO helper threads that way would very much count as "explicit children", so I don't argue that all kernel helper threads would fall under this category. And I suspect that the normal vhost workers fall under that same kind of "it's like io_uring". If you use VHOST_NEW_WORKER to create a worker thread, then that's a pretty explicit "I have a child process". So it's really just that hugepage recovery thread that seems to be a bit "too" much of an implicit kernel helper thread that user space kind of gets accidentally and implicitly just because of a kernel implementation detail. I'm sure the kvm hack to just start it later (at KVM_RUN time?) is sufficient in practice, but it still feels conceptually iffy to me. Linus ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [GIT PULL] KVM changes for Linux 6.14 2025-01-25 18:31 ` Linus Torvalds @ 2025-01-27 3:55 ` Eric W. Biederman 0 siblings, 0 replies; 23+ messages in thread From: Eric W. Biederman @ 2025-01-27 3:55 UTC (permalink / raw) To: Linus Torvalds Cc: Paolo Bonzini, Michael S. Tsirkin, Christian Brauner, Oleg Nesterov, linux-kernel, kvm Linus Torvalds <torvalds@linux-foundation.org> writes: > On Sat, 25 Jan 2025 at 10:12, Linus Torvalds > <torvalds@linux-foundation.org> wrote: >> >> Arguably the user space oddity is just strange and Paolo even calls it >> a bug, but at the same time, I do think user space can and should >> reasonably expect that it only has children that it created >> explicitly [..] > > Note that I think that doing things like "io_uring" and getting IO > helper threads that way would very much count as "explicit children", > so I don't argue that all kernel helper threads would fall under this > category. > > And I suspect that the normal vhost workers fall under that same kind > of "it's like io_uring". If you use VHOST_NEW_WORKER to create a > worker thread, then that's a pretty explicit "I have a child process". > > So it's really just that hugepage recovery thread that seems to be a > bit "too" much of an implicit kernel helper thread that user space > kind of gets accidentally and implicitly just because of a kernel > implementation detail. > > I'm sure the kvm hack to just start it later (at KVM_RUN time?) is > sufficient in practice, but it still feels conceptually iffy to me. I don't think implicit vs explicit is right question. Rather we should be asking can userspace care? If I read the context from the commit correctly what userspace is asking is: Am I single threaded so that I know nothing funny will happen in the forked process. The most common funny I am aware of for forked multi-threaded processes is that if they fork with another thread holding a lock the forked process might hang forever on the lock because the lock will never be released. The most interesting part of the hugepage reaper appears to be kvm_mmu_commit_zap_page, where a page is freed after being flushed from the tlb. I would argue that if kvm_mmu_commit_zap_page and friends change the page tables in a way that userspace can see after a fork, and in turn could affect how the forked process will execute userspace is doing something sensible in testing for it. On the flip side if this isn't something userspace can observe in it's own process I would argue that the proper solution is to user a regular kthread. In summary the conceptually clean approach is to only have threads that when running can effect the process they are a part of in a userspace visible way. Assuming the hugepage reaper can effect the process it is a part of, the only problem I see is the hugepage reaper existing when it had nothing it could possibly do. I don't think hiding threads is a useful solution because the threads will effect they process they are a part of. If the threads aren't effecting the process they are a part of we have other solutions besides threads. Eric ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [GIT PULL] KVM changes for Linux 6.14 2025-01-25 18:12 ` Linus Torvalds 2025-01-25 18:31 ` Linus Torvalds @ 2025-01-26 14:20 ` Oleg Nesterov 2025-01-26 18:34 ` Linus Torvalds 1 sibling, 1 reply; 23+ messages in thread From: Oleg Nesterov @ 2025-01-26 14:20 UTC (permalink / raw) To: Linus Torvalds Cc: Paolo Bonzini, Michael S. Tsirkin, Christian Brauner, Eric W. Biederman, linux-kernel, kvm On 01/25, Linus Torvalds wrote: > > Keith pinpointed the user space logic to fork_remap(): > > https://github.com/google/minijail/blob/main/rust/minijail/src/lib.rs#L987 > > and honestly, I do think it makes sense for user space to ask "am I > single-threaded" (which is presumably the thing that breaks), and the > code for that is pretty simple: > > fn is_single_threaded() -> io::Result<bool> { > match count_dir_entries("/proc/self/task") { > Ok(1) => Ok(true), > Ok(_) => Ok(false), > Err(e) => Err(e), > } > } > > and I really don't think user space is "wrong". > > So the fact that a kernel helper thread that runs async in the > background and does random background infrastructure things that do > not really affect user space should probably simply not break this > kind of simple (and admittedly simplistic) user space logic. > > Should we just add some flag to say "don't show this thread in this > context"? Not sure I understand... Looking at is_single_threaded() above I guess something like below should work (incomplete, in particular we need to chang first_tid() as well). But a PF_HIDDEN sub-thread will still be visible via /proc/$pid_of_PF_HIDDEN > We obviously still want to see it for management purposes, > so it's not like the thing should be entirely invisible, Can you explain? Oleg. --- x/include/linux/sched.h +++ x/include/linux/sched.h @@ -1685,7 +1685,7 @@ extern struct pid *cad_pid; #define PF_USED_MATH 0x00002000 /* If unset the fpu must be initialized before use */ #define PF_USER_WORKER 0x00004000 /* Kernel thread cloned from userspace thread */ #define PF_NOFREEZE 0x00008000 /* This thread should not be frozen */ -#define PF__HOLE__00010000 0x00010000 +#define PF_HIDDEN 0x00010000 #define PF_KSWAPD 0x00020000 /* I am kswapd */ #define PF_MEMALLOC_NOFS 0x00040000 /* All allocations inherit GFP_NOFS. See memalloc_nfs_save() */ #define PF_MEMALLOC_NOIO 0x00080000 /* All allocations inherit GFP_NOIO. See memalloc_noio_save() */ --- x/include/linux/sched/task.h +++ x/include/linux/sched/task.h @@ -31,6 +31,7 @@ struct kernel_clone_args { u32 io_thread:1; u32 user_worker:1; u32 no_files:1; + u32 hidden:1; unsigned long stack; unsigned long stack_size; unsigned long tls; --- x/kernel/fork.c +++ x/kernel/fork.c @@ -2237,6 +2237,8 @@ __latent_entropy struct task_struct *cop } if (args->io_thread) p->flags |= PF_IO_WORKER; + if (args->hidden) + p->flags |= PF_HIDDEN; if (args->name) strscpy_pad(p->comm, args->name, sizeof(p->comm)); --- x/kernel/vhost_task.c +++ x/kernel/vhost_task.c @@ -117,7 +117,7 @@ EXPORT_SYMBOL_GPL(vhost_task_stop); */ struct vhost_task *vhost_task_create(bool (*fn)(void *), void (*handle_sigkill)(void *), void *arg, - const char *name) + bool hidden, const char *name) { struct kernel_clone_args args = { .flags = CLONE_FS | CLONE_UNTRACED | CLONE_VM | @@ -125,6 +125,7 @@ struct vhost_task *vhost_task_create(boo .exit_signal = 0, .fn = vhost_task_fn, .name = name, + .hidden = hidden, .user_worker = 1, .no_files = 1, }; --- x/fs/proc/base.c +++ x/fs/proc/base.c @@ -3906,9 +3906,12 @@ static struct task_struct *next_tid(stru struct task_struct *pos = NULL; rcu_read_lock(); if (pid_alive(start)) { - pos = __next_thread(start); - if (pos) - get_task_struct(pos); + for (pos = start; (pos = __next_thread(pos)); ) { + if (!(pos->flags & PF_HIDDEN)) { + get_task_struct(pos); + break; + } + } } rcu_read_unlock(); put_task_struct(start); ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [GIT PULL] KVM changes for Linux 6.14 2025-01-26 14:20 ` Oleg Nesterov @ 2025-01-26 18:34 ` Linus Torvalds 2025-01-26 18:53 ` Oleg Nesterov 0 siblings, 1 reply; 23+ messages in thread From: Linus Torvalds @ 2025-01-26 18:34 UTC (permalink / raw) To: Oleg Nesterov Cc: Paolo Bonzini, Michael S. Tsirkin, Christian Brauner, Eric W. Biederman, linux-kernel, kvm On Sun, 26 Jan 2025 at 06:21, Oleg Nesterov <oleg@redhat.com> wrote: > > > Should we just add some flag to say "don't show this thread in this > > context"? > > Not sure I understand... Looking at is_single_threaded() above I guess > something like below should work (incomplete, in particular we need to > chang first_tid() as well). So yes, I was thinking something similar, but: > But a PF_HIDDEN sub-thread will still be visible via /proc/$pid_of_PF_HIDDEN > > > We obviously still want to see it for management purposes, > > so it's not like the thing should be entirely invisible, > > Can you explain? I was literally thinking that instead of a "hidden" flag, it would be a "self-hidden" flag. So if somebody _else_ (notably the sysadmin) does "ps" they see the kernel thread as a subthread. But when you look at your own /proc/self/task/ listing, you only see your own explicit threads. So that "is_singlethreaded()" logic works. Maybe that's just too ugly for words, and the kvm workaround is better. Linus ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [GIT PULL] KVM changes for Linux 6.14 2025-01-26 18:34 ` Linus Torvalds @ 2025-01-26 18:53 ` Oleg Nesterov 2025-01-26 19:03 ` Oleg Nesterov 2025-01-26 19:16 ` Linus Torvalds 0 siblings, 2 replies; 23+ messages in thread From: Oleg Nesterov @ 2025-01-26 18:53 UTC (permalink / raw) To: Linus Torvalds Cc: Paolo Bonzini, Michael S. Tsirkin, Christian Brauner, Eric W. Biederman, linux-kernel, kvm On 01/26, Linus Torvalds wrote: > > On Sun, 26 Jan 2025 at 06:21, Oleg Nesterov <oleg@redhat.com> wrote: > > > > > Should we just add some flag to say "don't show this thread in this > > > context"? > > > > Not sure I understand... Looking at is_single_threaded() above I guess > > something like below should work (incomplete, in particular we need to > > chang first_tid() as well). > > So yes, I was thinking something similar, but: > > > But a PF_HIDDEN sub-thread will still be visible via /proc/$pid_of_PF_HIDDEN > > > > > We obviously still want to see it for management purposes, > > > so it's not like the thing should be entirely invisible, > > > > Can you explain? > > I was literally thinking that instead of a "hidden" flag, it would be > a "self-hidden" flag. > > So if somebody _else_ (notably the sysadmin) does "ps" they see the > kernel thread as a subthread. > > But when you look at your own /proc/self/task/ listing, you only see > your own explicit threads. So that "is_singlethreaded()" logic works. Got it... I don't think we even need to detect the /proc/self/ or /proc/self-thread/ case, next_tid() can just check same_thread_group, - if (!(pos->flags & PF_HIDDEN)) { + if (!(pos->flags & PF_HIDDEN) || !same_thread_group(current, pos))) { right ? Oleg. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [GIT PULL] KVM changes for Linux 6.14 2025-01-26 18:53 ` Oleg Nesterov @ 2025-01-26 19:03 ` Oleg Nesterov 2025-01-26 19:16 ` Linus Torvalds 1 sibling, 0 replies; 23+ messages in thread From: Oleg Nesterov @ 2025-01-26 19:03 UTC (permalink / raw) To: Linus Torvalds Cc: Paolo Bonzini, Michael S. Tsirkin, Christian Brauner, Eric W. Biederman, linux-kernel, kvm On 01/26, Oleg Nesterov wrote: > > On 01/26, Linus Torvalds wrote: > > > > I was literally thinking that instead of a "hidden" flag, it would be > > a "self-hidden" flag. > > > > So if somebody _else_ (notably the sysadmin) does "ps" they see the > > kernel thread as a subthread. > > > > But when you look at your own /proc/self/task/ listing, you only see > > your own explicit threads. So that "is_singlethreaded()" logic works. > > Got it... > > I don't think we even need to detect the /proc/self/ or /proc/self-thread/ > case, next_tid() can just check same_thread_group, > > - if (!(pos->flags & PF_HIDDEN)) { > + if (!(pos->flags & PF_HIDDEN) || !same_thread_group(current, pos))) { > > right ? Or we can exclude them from /proc/whatever/task/ listing unconditionally, and change next_tgid() to report them as if there are not sub-threads, iow "ps ax" will show all the PF_HIDDEN tasks... I dunno. Oleg. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [GIT PULL] KVM changes for Linux 6.14 2025-01-26 18:53 ` Oleg Nesterov 2025-01-26 19:03 ` Oleg Nesterov @ 2025-01-26 19:16 ` Linus Torvalds 2025-01-27 14:09 ` Oleg Nesterov 1 sibling, 1 reply; 23+ messages in thread From: Linus Torvalds @ 2025-01-26 19:16 UTC (permalink / raw) To: Oleg Nesterov Cc: Paolo Bonzini, Michael S. Tsirkin, Christian Brauner, Eric W. Biederman, linux-kernel, kvm On Sun, 26 Jan 2025 at 10:54, Oleg Nesterov <oleg@redhat.com> wrote: > > I don't think we even need to detect the /proc/self/ or /proc/self-thread/ > case, next_tid() can just check same_thread_group, That was my thinking yes. If we exclude them from /proc/*/task entirely, I'd worry that it would hide it from some management tool and be used for nefarious purposes (even if they then show up elsewhere that the tool wouldn't look at). But as mentioned, maybe this is all more of a hack than what kvm now does. Linus ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [GIT PULL] KVM changes for Linux 6.14 2025-01-26 19:16 ` Linus Torvalds @ 2025-01-27 14:09 ` Oleg Nesterov 2025-01-27 15:15 ` Paolo Bonzini 0 siblings, 1 reply; 23+ messages in thread From: Oleg Nesterov @ 2025-01-27 14:09 UTC (permalink / raw) To: Linus Torvalds Cc: Paolo Bonzini, Michael S. Tsirkin, Christian Brauner, Eric W. Biederman, linux-kernel, kvm On 01/26, Linus Torvalds wrote: > > On Sun, 26 Jan 2025 at 10:54, Oleg Nesterov <oleg@redhat.com> wrote: > > > > I don't think we even need to detect the /proc/self/ or /proc/self-thread/ > > case, next_tid() can just check same_thread_group, > > That was my thinking yes. > > If we exclude them from /proc/*/task entirely, I'd worry that it would > hide it from some management tool and be used for nefarious purposes Agreed, > (even if they then show up elsewhere that the tool wouldn't look at). Even if we move them from /proc/*/task to /proc ? Perhaps, I honestly do not know what will/can confuse userspace more. > But as mentioned, maybe this is all more of a hack than what kvm now does. I don't know. But I will be happy to make a patch if we have a consensus. Oleg. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [GIT PULL] KVM changes for Linux 6.14 2025-01-27 14:09 ` Oleg Nesterov @ 2025-01-27 15:15 ` Paolo Bonzini 2025-02-04 14:19 ` Christian Brauner 0 siblings, 1 reply; 23+ messages in thread From: Paolo Bonzini @ 2025-01-27 15:15 UTC (permalink / raw) To: Oleg Nesterov Cc: Linus Torvalds, Michael S. Tsirkin, Christian Brauner, Eric W. Biederman, linux-kernel, kvm On Mon, Jan 27, 2025 at 3:10 PM Oleg Nesterov <oleg@redhat.com> wrote: > On 01/26, Linus Torvalds wrote: > > On Sun, 26 Jan 2025 at 10:54, Oleg Nesterov <oleg@redhat.com> wrote: > > > > > > I don't think we even need to detect the /proc/self/ or /proc/self-thread/ > > > case, next_tid() can just check same_thread_group, > > > > That was my thinking yes. > > > > If we exclude them from /proc/*/task entirely, I'd worry that it would > > hide it from some management tool and be used for nefarious purposes > > Agreed, > > > (even if they then show up elsewhere that the tool wouldn't look at). > > Even if we move them from /proc/*/task to /proc ? Indeed---as long as they show up somewhere, it's not worse than it used to be. The reason why I'd prefer them to stay in /proc/*/task is that moving them away at least partly negates the benefits of the "workers are children of the starter" model. For example it complicates measuring their cost within the process that runs the VM. Maybe it's more of a romantic thing than a real practical issue, because in the real world resource accounting for VMs is done via cgroups. But unlike the lazy creation in KVM, which is overall pretty self-contained, I am afraid the ugliness in procfs would be much worse compared to the benefit, if there's a benefit at all. > Perhaps, I honestly do not know what will/can confuse userspace more. At the very least, marking workers as "Kthread: 1" makes sense and should not cause too much confusion. I wouldn't go beyond that unless we get more reports of similar issues, and I'm not even sure how common it is for userspace libraries to check for single-threadedness. Paolo > > But as mentioned, maybe this is all more of a hack than what kvm now does. > > I don't know. But I will be happy to make a patch if we have a consensus. > > Oleg. > ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [GIT PULL] KVM changes for Linux 6.14 2025-01-27 15:15 ` Paolo Bonzini @ 2025-02-04 14:19 ` Christian Brauner 2025-02-04 16:05 ` Paolo Bonzini 0 siblings, 1 reply; 23+ messages in thread From: Christian Brauner @ 2025-02-04 14:19 UTC (permalink / raw) To: Paolo Bonzini, Oleg Nesterov, Linus Torvalds Cc: Michael S. Tsirkin, Eric W. Biederman, linux-kernel, kvm On Mon, Jan 27, 2025 at 04:15:01PM +0100, Paolo Bonzini wrote: > On Mon, Jan 27, 2025 at 3:10 PM Oleg Nesterov <oleg@redhat.com> wrote: > > On 01/26, Linus Torvalds wrote: > > > On Sun, 26 Jan 2025 at 10:54, Oleg Nesterov <oleg@redhat.com> wrote: > > > > > > > > I don't think we even need to detect the /proc/self/ or /proc/self-thread/ > > > > case, next_tid() can just check same_thread_group, > > > > > > That was my thinking yes. > > > > > > If we exclude them from /proc/*/task entirely, I'd worry that it would > > > hide it from some management tool and be used for nefarious purposes > > > > Agreed, > > > > > (even if they then show up elsewhere that the tool wouldn't look at). > > > > Even if we move them from /proc/*/task to /proc ? > > Indeed---as long as they show up somewhere, it's not worse than it > used to be. The reason why I'd prefer them to stay in /proc/*/task is > that moving them away at least partly negates the benefits of the > "workers are children of the starter" model. For example it > complicates measuring their cost within the process that runs the VM. > Maybe it's more of a romantic thing than a real practical issue, > because in the real world resource accounting for VMs is done via > cgroups. But unlike the lazy creation in KVM, which is overall pretty > self-contained, I am afraid the ugliness in procfs would be much worse > compared to the benefit, if there's a benefit at all. > > > Perhaps, I honestly do not know what will/can confuse userspace more. > > At the very least, marking workers as "Kthread: 1" makes sense and > should not cause too much confusion. I wouldn't go beyond that unless > we get more reports of similar issues, and I'm not even sure how > common it is for userspace libraries to check for single-threadedness. Sorry, just saw this thread now. What if we did what Linus suggests and hide (odd) user workers from /proc/<pid>/task/* but also added /proc/<pid>/workers/*. The latter would only list the workers that got spawned by the kernel for that particular task? This would acknowledge their somewhat special status and allow userspace to still detect them as "Hey, I didn't actually spawn those, they got shoved into my workload by the kernel for me.". (Another (ugly) alternative would be to abuse prctl() and have workloads opt-in to hiding user workers from /proc/<pid>/task/.) > > Paolo > > > > But as mentioned, maybe this is all more of a hack than what kvm now does. > > > > I don't know. But I will be happy to make a patch if we have a consensus. > > > > Oleg. > > > ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [GIT PULL] KVM changes for Linux 6.14 2025-02-04 14:19 ` Christian Brauner @ 2025-02-04 16:05 ` Paolo Bonzini 2025-02-05 11:49 ` Christian Brauner 0 siblings, 1 reply; 23+ messages in thread From: Paolo Bonzini @ 2025-02-04 16:05 UTC (permalink / raw) To: Christian Brauner Cc: Oleg Nesterov, Linus Torvalds, Michael S. Tsirkin, Eric W. Biederman, linux-kernel, kvm On Tue, Feb 4, 2025 at 3:19 PM Christian Brauner <brauner@kernel.org> wrote: > > On Mon, Jan 27, 2025 at 04:15:01PM +0100, Paolo Bonzini wrote: > > On Mon, Jan 27, 2025 at 3:10 PM Oleg Nesterov <oleg@redhat.com> wrote: > > > On 01/26, Linus Torvalds wrote: > > > > On Sun, 26 Jan 2025 at 10:54, Oleg Nesterov <oleg@redhat.com> wrote: > > > > > > > > > > I don't think we even need to detect the /proc/self/ or /proc/self-thread/ > > > > > case, next_tid() can just check same_thread_group, > > > > > > > > That was my thinking yes. > > > > > > > > If we exclude them from /proc/*/task entirely, I'd worry that it would > > > > hide it from some management tool and be used for nefarious purposes > > > > > > Agreed, > > > > > > > (even if they then show up elsewhere that the tool wouldn't look at). > > > > > > Even if we move them from /proc/*/task to /proc ? > > > > Indeed---as long as they show up somewhere, it's not worse than it > > used to be. The reason why I'd prefer them to stay in /proc/*/task is > > that moving them away at least partly negates the benefits of the > > "workers are children of the starter" model. For example it > > complicates measuring their cost within the process that runs the VM. > > Maybe it's more of a romantic thing than a real practical issue, > > because in the real world resource accounting for VMs is done via > > cgroups. But unlike the lazy creation in KVM, which is overall pretty > > self-contained, I am afraid the ugliness in procfs would be much worse > > compared to the benefit, if there's a benefit at all. > > > > > Perhaps, I honestly do not know what will/can confuse userspace more. > > > > At the very least, marking workers as "Kthread: 1" makes sense and > > should not cause too much confusion. I wouldn't go beyond that unless > > we get more reports of similar issues, and I'm not even sure how > > common it is for userspace libraries to check for single-threadedness. > > Sorry, just saw this thread now. > > What if we did what Linus suggests and hide (odd) user workers from > /proc/<pid>/task/* but also added /proc/<pid>/workers/*. The latter > would only list the workers that got spawned by the kernel for that > particular task? This would acknowledge their somewhat special status > and allow userspace to still detect them as "Hey, I didn't actually > spawn those, they got shoved into my workload by the kernel for me.". Wouldn't the workers then disappear completely from ps, top or other tools that look at /proc/$PID/task? That seems a bit too underhanded towards userspace... Paolo > (Another (ugly) alternative would be to abuse prctl() and have workloads > opt-in to hiding user workers from /proc/<pid>/task/.) ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [GIT PULL] KVM changes for Linux 6.14 2025-02-04 16:05 ` Paolo Bonzini @ 2025-02-05 11:49 ` Christian Brauner 2025-02-05 16:12 ` Linus Torvalds 2025-02-26 12:14 ` Christian Brauner 0 siblings, 2 replies; 23+ messages in thread From: Christian Brauner @ 2025-02-05 11:49 UTC (permalink / raw) To: Paolo Bonzini Cc: Oleg Nesterov, Linus Torvalds, Michael S. Tsirkin, Eric W. Biederman, linux-kernel, kvm On Tue, Feb 04, 2025 at 05:05:06PM +0100, Paolo Bonzini wrote: > On Tue, Feb 4, 2025 at 3:19 PM Christian Brauner <brauner@kernel.org> wrote: > > > > On Mon, Jan 27, 2025 at 04:15:01PM +0100, Paolo Bonzini wrote: > > > On Mon, Jan 27, 2025 at 3:10 PM Oleg Nesterov <oleg@redhat.com> wrote: > > > > On 01/26, Linus Torvalds wrote: > > > > > On Sun, 26 Jan 2025 at 10:54, Oleg Nesterov <oleg@redhat.com> wrote: > > > > > > > > > > > > I don't think we even need to detect the /proc/self/ or /proc/self-thread/ > > > > > > case, next_tid() can just check same_thread_group, > > > > > > > > > > That was my thinking yes. > > > > > > > > > > If we exclude them from /proc/*/task entirely, I'd worry that it would > > > > > hide it from some management tool and be used for nefarious purposes > > > > > > > > Agreed, > > > > > > > > > (even if they then show up elsewhere that the tool wouldn't look at). > > > > > > > > Even if we move them from /proc/*/task to /proc ? > > > > > > Indeed---as long as they show up somewhere, it's not worse than it > > > used to be. The reason why I'd prefer them to stay in /proc/*/task is > > > that moving them away at least partly negates the benefits of the > > > "workers are children of the starter" model. For example it > > > complicates measuring their cost within the process that runs the VM. > > > Maybe it's more of a romantic thing than a real practical issue, > > > because in the real world resource accounting for VMs is done via > > > cgroups. But unlike the lazy creation in KVM, which is overall pretty > > > self-contained, I am afraid the ugliness in procfs would be much worse > > > compared to the benefit, if there's a benefit at all. > > > > > > > Perhaps, I honestly do not know what will/can confuse userspace more. > > > > > > At the very least, marking workers as "Kthread: 1" makes sense and You mean in /proc/<pid>/status? Yeah, we can do that. This expands the definition of Kthread a bit. It would now mean anything that the kernel spawned for userspace. But that is probably fine. But it won't help with the problem of just checking /proc/<pid>/task/ to figure out whether the caller is single-threaded or not. If the caller has more than 1 entry in there they need to walk through all of them and parse through /proc/<pid>/status to discount them if they're kernel threads. > > > should not cause too much confusion. I wouldn't go beyond that unless > > > we get more reports of similar issues, and I'm not even sure how > > > common it is for userspace libraries to check for single-threadedness. > > > > Sorry, just saw this thread now. > > > > What if we did what Linus suggests and hide (odd) user workers from > > /proc/<pid>/task/* but also added /proc/<pid>/workers/*. The latter > > would only list the workers that got spawned by the kernel for that > > particular task? This would acknowledge their somewhat special status > > and allow userspace to still detect them as "Hey, I didn't actually > > spawn those, they got shoved into my workload by the kernel for me.". > > Wouldn't the workers then disappear completely from ps, top or other > tools that look at /proc/$PID/task? That seems a bit too underhanded > towards userspace... So maybe, but then there's also the possibility to do: - Have /proc/<pid>/status list all tasks. - Have /proc/<pid>/worker only list user workers spawned by the kernel for userspace. count(/proc/<pid>/status) - count(/proc/<pid>/workers) == 1 => (userspace) single threaded My wider point is that I would prefer we add something that is consistent and doesn't have to give the caller a different view than a foreign task. I think that will just create confusion in the long run. Btw, checking whether single-threaded this way: fn is_single_threaded() -> io::Result<bool> { match count_dir_entries("/proc/self/task") { Ok(1) => Ok(true), Ok(_) => Ok(false), Err(e) => Err(e), } } can be simplified. It should be sufficient to do: stat("/proc/self/task", &st); if ((st->st_nlink - 2) == 1) // single threaded Since procfs adds the number of tasks to st_nlink (Which is a bit weird given that /proc/<pid>/fd puts the number of file descriptors in st->st_size.). ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [GIT PULL] KVM changes for Linux 6.14 2025-02-05 11:49 ` Christian Brauner @ 2025-02-05 16:12 ` Linus Torvalds 2025-02-26 12:14 ` Christian Brauner 1 sibling, 0 replies; 23+ messages in thread From: Linus Torvalds @ 2025-02-05 16:12 UTC (permalink / raw) To: Christian Brauner Cc: Paolo Bonzini, Oleg Nesterov, Michael S. Tsirkin, Eric W. Biederman, linux-kernel, kvm On Wed, 5 Feb 2025 at 03:49, Christian Brauner <brauner@kernel.org> wrote: > > > Btw, checking whether single-threaded this can be simplified. > It should be sufficient to do: > > stat("/proc/self/task", &st); > if ((st->st_nlink - 2) == 1) > // single threaded > > since procfs adds the number of tasks to st_nlink I'd be careful about depending on st_nlink on strange filesystems, particularly for directories. And /proc is stranger than most. So the above may happen to work, but I'm not convinced it always has had that st_nlink thing. We do it because some tools do end up looking at n_link to prune recursive directory traversal. But *most* such tools also know that st_nlink < 2 is special and might mean "don't know" (because not all filesystems actually count directory links the way traditional Unix filesystems do). So relying on /proc acting "normal" seems fragile. Linus ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [GIT PULL] KVM changes for Linux 6.14 2025-02-05 11:49 ` Christian Brauner 2025-02-05 16:12 ` Linus Torvalds @ 2025-02-26 12:14 ` Christian Brauner 2025-02-26 19:03 ` Oleg Nesterov 1 sibling, 1 reply; 23+ messages in thread From: Christian Brauner @ 2025-02-26 12:14 UTC (permalink / raw) To: Paolo Bonzini Cc: Oleg Nesterov, Linus Torvalds, Michael S. Tsirkin, Eric W. Biederman, linux-kernel, kvm On Wed, Feb 05, 2025 at 12:49:30PM +0100, Christian Brauner wrote: > On Tue, Feb 04, 2025 at 05:05:06PM +0100, Paolo Bonzini wrote: > > On Tue, Feb 4, 2025 at 3:19 PM Christian Brauner <brauner@kernel.org> wrote: > > > > > > On Mon, Jan 27, 2025 at 04:15:01PM +0100, Paolo Bonzini wrote: > > > > On Mon, Jan 27, 2025 at 3:10 PM Oleg Nesterov <oleg@redhat.com> wrote: > > > > > On 01/26, Linus Torvalds wrote: > > > > > > On Sun, 26 Jan 2025 at 10:54, Oleg Nesterov <oleg@redhat.com> wrote: > > > > > > > > > > > > > > I don't think we even need to detect the /proc/self/ or /proc/self-thread/ > > > > > > > case, next_tid() can just check same_thread_group, > > > > > > > > > > > > That was my thinking yes. > > > > > > > > > > > > If we exclude them from /proc/*/task entirely, I'd worry that it would > > > > > > hide it from some management tool and be used for nefarious purposes > > > > > > > > > > Agreed, > > > > > > > > > > > (even if they then show up elsewhere that the tool wouldn't look at). > > > > > > > > > > Even if we move them from /proc/*/task to /proc ? > > > > > > > > Indeed---as long as they show up somewhere, it's not worse than it > > > > used to be. The reason why I'd prefer them to stay in /proc/*/task is > > > > that moving them away at least partly negates the benefits of the > > > > "workers are children of the starter" model. For example it > > > > complicates measuring their cost within the process that runs the VM. > > > > Maybe it's more of a romantic thing than a real practical issue, > > > > because in the real world resource accounting for VMs is done via > > > > cgroups. But unlike the lazy creation in KVM, which is overall pretty > > > > self-contained, I am afraid the ugliness in procfs would be much worse > > > > compared to the benefit, if there's a benefit at all. > > > > > > > > > Perhaps, I honestly do not know what will/can confuse userspace more. > > > > > > > > At the very least, marking workers as "Kthread: 1" makes sense and > > You mean in /proc/<pid>/status? Yeah, we can do that. This expands the > definition of Kthread a bit. It would now mean anything that the kernel > spawned for userspace. But that is probably fine. > > But it won't help with the problem of just checking /proc/<pid>/task/ to > figure out whether the caller is single-threaded or not. If the caller > has more than 1 entry in there they need to walk through all of them and > parse through /proc/<pid>/status to discount them if they're kernel > threads. > > > > > should not cause too much confusion. I wouldn't go beyond that unless > > > > we get more reports of similar issues, and I'm not even sure how > > > > common it is for userspace libraries to check for single-threadedness. > > > > > > Sorry, just saw this thread now. > > > > > > What if we did what Linus suggests and hide (odd) user workers from > > > /proc/<pid>/task/* but also added /proc/<pid>/workers/*. The latter > > > would only list the workers that got spawned by the kernel for that > > > particular task? This would acknowledge their somewhat special status > > > and allow userspace to still detect them as "Hey, I didn't actually > > > spawn those, they got shoved into my workload by the kernel for me.". > > > > Wouldn't the workers then disappear completely from ps, top or other > > tools that look at /proc/$PID/task? That seems a bit too underhanded > > towards userspace... > > So maybe, but then there's also the possibility to do: > > - Have /proc/<pid>/status list all tasks. > - Have /proc/<pid>/worker only list user workers spawned by the kernel for userspace. > > count(/proc/<pid>/status) - count(/proc/<pid>/workers) == 1 => (userspace) single threaded > > My wider point is that I would prefer we add something that is > consistent and doesn't have to give the caller a different view than a > foreign task. I think that will just create confusion in the long run. So what I had in mind was (quickly sketched) the rough draft below. This will unconditionally skip PF_USER_WORKER tasks in /proc/<pid>/task and will only list them in /proc/<pid>/worker. fs/proc/base.c | 122 ++++++++++++++++++++++++++++++++++++++++++++++--- 1 file changed, 116 insertions(+), 6 deletions(-) diff --git a/fs/proc/base.c b/fs/proc/base.c index cd89e956c322..60e6b2cea259 100644 --- a/fs/proc/base.c +++ b/fs/proc/base.c @@ -3315,10 +3315,13 @@ static int proc_stack_depth(struct seq_file *m, struct pid_namespace *ns, * Thread groups */ static const struct file_operations proc_task_operations; +static const struct file_operations proc_worker_operations; static const struct inode_operations proc_task_inode_operations; +static const struct inode_operations proc_worker_inode_operations; static const struct pid_entry tgid_base_stuff[] = { DIR("task", S_IRUGO|S_IXUGO, proc_task_inode_operations, proc_task_operations), + DIR("worker", S_IRUGO|S_IXUGO, proc_worker_inode_operations, proc_worker_operations), DIR("fd", S_IRUSR|S_IXUSR, proc_fd_inode_operations, proc_fd_operations), DIR("map_files", S_IRUSR|S_IXUSR, proc_map_files_inode_operations, proc_map_files_operations), DIR("fdinfo", S_IRUGO|S_IXUGO, proc_fdinfo_inode_operations, proc_fdinfo_operations), @@ -3835,11 +3838,14 @@ static struct dentry *proc_task_lookup(struct inode *dir, struct dentry * dentry fs_info = proc_sb_info(dentry->d_sb); ns = fs_info->pid_ns; - rcu_read_lock(); - task = find_task_by_pid_ns(tid, ns); - if (task) - get_task_struct(task); - rcu_read_unlock(); + scoped_guard(rcu) { + task = find_task_by_pid_ns(tid, ns); + if (task) { + if (task->flags & PF_USER_WORKER) + goto out; + get_task_struct(task); + } + } if (!task) goto out; if (!same_thread_group(leader, task)) @@ -3949,7 +3955,7 @@ static int proc_task_readdir(struct file *file, struct dir_context *ctx) tid = (int)(intptr_t)file->private_data; file->private_data = NULL; for (task = first_tid(proc_pid(inode), tid, ctx->pos - 2, ns); - task; + task && !(task->flags & PF_USER_WORKER); task = next_tid(task), ctx->pos++) { char name[10 + 1]; unsigned int len; @@ -3987,6 +3993,97 @@ static int proc_task_getattr(struct mnt_idmap *idmap, return 0; } +static struct dentry * +proc_worker_lookup(struct inode *dir, struct dentry *dentry, unsigned int flags) +{ + struct task_struct *task; + struct task_struct *leader = get_proc_task(dir); + unsigned tid; + struct proc_fs_info *fs_info; + struct pid_namespace *ns; + struct dentry *result = ERR_PTR(-ENOENT); + + if (!leader) + goto out_no_task; + + tid = name_to_int(&dentry->d_name); + if (tid == ~0U) + goto out; + + fs_info = proc_sb_info(dentry->d_sb); + ns = fs_info->pid_ns; + scoped_guard(rcu) { + task = find_task_by_pid_ns(tid, ns); + if (task) { + if (!(task->flags & PF_USER_WORKER)) + goto out; + get_task_struct(task); + } + } + if (!task) + goto out; + if (!same_thread_group(leader, task)) + goto out_drop_task; + + result = proc_task_instantiate(dentry, task, NULL); +out_drop_task: + put_task_struct(task); +out: + put_task_struct(leader); +out_no_task: + return result; +} + +static int proc_worker_getattr(struct mnt_idmap *idmap, const struct path *path, + struct kstat *stat, u32 request_mask, + unsigned int query_flags) +{ + generic_fillattr(&nop_mnt_idmap, request_mask, d_inode(path->dentry), stat); + return 0; +} + +static int proc_worker_readdir(struct file *file, struct dir_context *ctx) +{ + struct inode *inode = file_inode(file); + struct task_struct *task; + struct pid_namespace *ns; + int tid; + + if (proc_inode_is_dead(inode)) + return -ENOENT; + + if (!dir_emit_dots(file, ctx)) + return 0; + + /* We cache the tgid value that the last readdir call couldn't + * return and lseek resets it to 0. + */ + ns = proc_pid_ns(inode->i_sb); + tid = (int)(intptr_t)file->private_data; + file->private_data = NULL; + for (task = first_tid(proc_pid(inode), tid, ctx->pos - 2, ns); + task && (task->flags & PF_USER_WORKER); + task = next_tid(task), ctx->pos++) { + char name[10 + 1]; + unsigned int len; + + tid = task_pid_nr_ns(task, ns); + if (!tid) + continue; /* The task has just exited. */ + len = snprintf(name, sizeof(name), "%u", tid); + if (!proc_fill_cache(file, ctx, name, len, + proc_task_instantiate, task, NULL)) { + /* returning this tgid failed, save it as the first + * pid for the next readir call */ + file->private_data = (void *)(intptr_t)tid; + put_task_struct(task); + break; + } + } + + return 0; +} + /* * proc_task_readdir() set @file->private_data to a positive integer * value, so casting that to u64 is safe. generic_llseek_cookie() will @@ -4005,6 +4102,19 @@ static loff_t proc_dir_llseek(struct file *file, loff_t offset, int whence) return off; } +static const struct inode_operations proc_worker_inode_operations = { + .lookup = proc_worker_lookup, + .getattr = proc_worker_getattr, + .setattr = proc_setattr, + .permission = proc_pid_permission, +}; + +static const struct file_operations proc_worker_operations = { + .read = generic_read_dir, + .iterate_shared = proc_worker_readdir, + .llseek = proc_dir_llseek, +}; + static const struct inode_operations proc_task_inode_operations = { .lookup = proc_task_lookup, .getattr = proc_task_getattr, -- 2.47.2 ^ permalink raw reply related [flat|nested] 23+ messages in thread
* Re: [GIT PULL] KVM changes for Linux 6.14 2025-02-26 12:14 ` Christian Brauner @ 2025-02-26 19:03 ` Oleg Nesterov 2025-02-27 8:15 ` Christian Brauner 0 siblings, 1 reply; 23+ messages in thread From: Oleg Nesterov @ 2025-02-26 19:03 UTC (permalink / raw) To: Christian Brauner Cc: Paolo Bonzini, Linus Torvalds, Michael S. Tsirkin, Eric W. Biederman, linux-kernel, kvm Sorry, didn't have time to actually read this patch, but after a quick glance... On 02/26, Christian Brauner wrote: > > @@ -3949,7 +3955,7 @@ static int proc_task_readdir(struct file *file, struct dir_context *ctx) > tid = (int)(intptr_t)file->private_data; > file->private_data = NULL; > for (task = first_tid(proc_pid(inode), tid, ctx->pos - 2, ns); > - task; > + task && !(task->flags & PF_USER_WORKER); unless I am totally confused this looks "obviously wrong". proc_task_readdir() should not stop if it sees a PF_USER_WORKER task, this check should go into first_tid/next_tid. Oleg. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [GIT PULL] KVM changes for Linux 6.14 2025-02-26 19:03 ` Oleg Nesterov @ 2025-02-27 8:15 ` Christian Brauner 0 siblings, 0 replies; 23+ messages in thread From: Christian Brauner @ 2025-02-27 8:15 UTC (permalink / raw) To: Oleg Nesterov Cc: Paolo Bonzini, Linus Torvalds, Michael S. Tsirkin, Eric W. Biederman, linux-kernel, kvm On Wed, Feb 26, 2025 at 08:03:23PM +0100, Oleg Nesterov wrote: > Sorry, didn't have time to actually read this patch, but after a quick > glance... > > On 02/26, Christian Brauner wrote: > > > > @@ -3949,7 +3955,7 @@ static int proc_task_readdir(struct file *file, struct dir_context *ctx) > > tid = (int)(intptr_t)file->private_data; > > file->private_data = NULL; > > for (task = first_tid(proc_pid(inode), tid, ctx->pos - 2, ns); > > - task; > > + task && !(task->flags & PF_USER_WORKER); > > unless I am totally confused this looks "obviously wrong". > > proc_task_readdir() should not stop if it sees a PF_USER_WORKER task, this > check should go into first_tid/next_tid. It's really a draft as I said. I'm more interested in whether this is a viable idea to separate kernel spawned workers into /proc/<pid>/worker and not show them in /proc/<pid>/task or if this is a non-starter. If so then I'll send an actual patch that also doesn't include code-duplication to no end. ;) ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [GIT PULL] KVM changes for Linux 6.14 2025-01-24 16:37 [GIT PULL] KVM changes for Linux 6.14 Paolo Bonzini 2025-01-25 14:30 ` Marc Zyngier 2025-01-25 18:12 ` Linus Torvalds @ 2025-01-25 18:16 ` Linus Torvalds 2025-01-27 15:24 ` Sean Christopherson 2025-01-27 15:25 ` Paolo Bonzini 2025-01-25 18:30 ` pr-tracker-bot 3 siblings, 2 replies; 23+ messages in thread From: Linus Torvalds @ 2025-01-25 18:16 UTC (permalink / raw) To: Paolo Bonzini; +Cc: linux-kernel, kvm On Fri, 24 Jan 2025 at 08:38, Paolo Bonzini <pbonzini@redhat.com> wrote: > > but you can throw away the <<<< ... ==== part completely, and apply the > same change on top of the new implementation: > > diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c > index edef30359c19..9f9a29be3beb 100644 > --- a/arch/x86/kvm/cpuid.c > +++ b/arch/x86/kvm/cpuid.c > @@ -1177,6 +1177,7 @@ void kvm_set_cpu_caps(void) > EMULATED_F(NO_SMM_CTL_MSR), > /* PrefetchCtlMsr */ > F(WRMSR_XX_BASE_NS), > + F(SRSO_USER_KERNEL_NO), > SYNTHESIZED_F(SBPB), > SYNTHESIZED_F(IBPB_BRTYPE), > SYNTHESIZED_F(SRSO_NO), Ehh. My resolution ended up being different. I did this instead: F(WRMSR_XX_BASE_NS), SYNTHESIZED_F(SBPB), SYNTHESIZED_F(IBPB_BRTYPE), SYNTHESIZED_F(SRSO_NO), + SYNTHESIZED_F(SRSO_USER_KERNEL_NO), which (apart from the line ordering) differs from your suggestion in F() vs SYNTHESIZED_F(). That really seemed to be the RightThing(tm) to do from the context of the two conflicting commits, but maybe there was some reason that I didn't catch that you kept it as a plain "F()". So please take a look, and if I screwed up send me a fix (with a scathing explanation for why I'm maternally related to some less-than-gifted rodentia with syphilis). Linus ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [GIT PULL] KVM changes for Linux 6.14 2025-01-25 18:16 ` Linus Torvalds @ 2025-01-27 15:24 ` Sean Christopherson 2025-01-27 15:25 ` Paolo Bonzini 1 sibling, 0 replies; 23+ messages in thread From: Sean Christopherson @ 2025-01-27 15:24 UTC (permalink / raw) To: Linus Torvalds; +Cc: Paolo Bonzini, linux-kernel, kvm On Sat, Jan 25, 2025, Linus Torvalds wrote: > On Fri, 24 Jan 2025 at 08:38, Paolo Bonzini <pbonzini@redhat.com> wrote: > > > > but you can throw away the <<<< ... ==== part completely, and apply the > > same change on top of the new implementation: > > > > diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c > > index edef30359c19..9f9a29be3beb 100644 > > --- a/arch/x86/kvm/cpuid.c > > +++ b/arch/x86/kvm/cpuid.c > > @@ -1177,6 +1177,7 @@ void kvm_set_cpu_caps(void) > > EMULATED_F(NO_SMM_CTL_MSR), > > /* PrefetchCtlMsr */ > > F(WRMSR_XX_BASE_NS), > > + F(SRSO_USER_KERNEL_NO), > > SYNTHESIZED_F(SBPB), > > SYNTHESIZED_F(IBPB_BRTYPE), > > SYNTHESIZED_F(SRSO_NO), > > Ehh. My resolution ended up being different. > > I did this instead: > > F(WRMSR_XX_BASE_NS), > SYNTHESIZED_F(SBPB), > SYNTHESIZED_F(IBPB_BRTYPE), > SYNTHESIZED_F(SRSO_NO), > + SYNTHESIZED_F(SRSO_USER_KERNEL_NO), > > which (apart from the line ordering) differs from your suggestion in > F() vs SYNTHESIZED_F(). > > That really seemed to be the RightThing(tm) to do from the context of > the two conflicting commits, but maybe there was some reason that I > didn't catch that you kept it as a plain "F()". Heh, I waffled on whether SRSO_USER_KERNEL_NO should be F() or SYNTHESIZED_F() when the initial commit went in. I would prefer to keep it F(), though it doesn't matter terribly at the moment. The "synthesized" features are for cases where the kernel stuffs X86_FEATURE_xxx via set_cpu_cap() even when the feature isn't present in CPUID, and it's correct for KVM to relay the synthesized feature to the guest. E.g. SRSO_NO is synthesized into cpu_caps for Zen1/2, and in that case the absense of the SRSO flaw extends to the guest as well. if (boot_cpu_data.x86 < 0x19 && !cpu_smt_possible()) { setup_force_cpu_cap(X86_FEATURE_SRSO_NO); return; } For SRSO_USER_KERNEL_NO, it's currently not force set, i.e. it's a pure reflection of hardware capabilities. Treating it as synthesized is effectively a nop with the current code, but that would change if the kernel were to force set the flag. If a future commit force set SRSO_USER_KERNEL_NO because of a ucode update that didn't also modify CPUID behavior, then treating the flag as synthesized would be desirabled, e.g. so that the guest could also avoid the overhead of mitigating SRSO. But if a future commit set the flag for some other reason, e.g. if the kernel somehow isn't vulnerable even when running on buggy hardware, then enumerating SRSO_USER_KERNEL_NO to the guest could cause the guest kernel to incorrectly skips its mitigation. My vote is to err on the side of caution and go with F(). ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [GIT PULL] KVM changes for Linux 6.14 2025-01-25 18:16 ` Linus Torvalds 2025-01-27 15:24 ` Sean Christopherson @ 2025-01-27 15:25 ` Paolo Bonzini 1 sibling, 0 replies; 23+ messages in thread From: Paolo Bonzini @ 2025-01-27 15:25 UTC (permalink / raw) To: Linus Torvalds; +Cc: linux-kernel, kvm On Sat, Jan 25, 2025 at 7:16 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > Ehh. My resolution ended up being different. > > I did this instead: > > F(WRMSR_XX_BASE_NS), > SYNTHESIZED_F(SBPB), > SYNTHESIZED_F(IBPB_BRTYPE), > SYNTHESIZED_F(SRSO_NO), > + SYNTHESIZED_F(SRSO_USER_KERNEL_NO), > > which (apart from the line ordering) differs from your suggestion in > F() vs SYNTHESIZED_F(). > > That really seemed to be the RightThing(tm) to do from the context of > the two conflicting commits, but maybe there was some reason that I > didn't catch that you kept it as a plain "F()". SYNTHESIZED_F() generally is used together with setup_force_cpu_cap(), i.e. when it makes sense to present the feature even if cpuid does not have it *and* the VM is not able to see the difference. You use it if when mitigations on the host automatically protect the guest as well. For example, F vs. SYNTHESIZED_F() makes a difference for X86_FEATURE_SRSO_NO because F() hides the feature from the guests and SYNTHESIZED_F() lets them use it. It doesn't hurt at all in this case, or make a difference for that matter, because there's no setup_force_cpu_cap(X86_FEATURE_SRSO_USER_KERNEL_NO). But here using SYNTHESIZED_F it's just a little less self-documenting and a little less future proof, nothing that a quick follow-up PR can't fix, and also I managed to pull the KVM/ARM changes from the wrong machine so I have to send a second KVM pull request anyway. > So please take a look, and if I screwed up send me a fix (with a > scathing explanation for why I'm maternally related to some > less-than-gifted rodentia with syphilis). I think I don't want to know if it's a Finnish metaphor, or you came up with it all on your own... Paolo ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [GIT PULL] KVM changes for Linux 6.14 2025-01-24 16:37 [GIT PULL] KVM changes for Linux 6.14 Paolo Bonzini ` (2 preceding siblings ...) 2025-01-25 18:16 ` Linus Torvalds @ 2025-01-25 18:30 ` pr-tracker-bot 3 siblings, 0 replies; 23+ messages in thread From: pr-tracker-bot @ 2025-01-25 18:30 UTC (permalink / raw) To: Paolo Bonzini; +Cc: torvalds, linux-kernel, kvm The pull request you sent on Fri, 24 Jan 2025 11:37:41 -0500: > https://git.kernel.org/pub/scm/virt/kvm/kvm.git tags/for-linus has been merged into torvalds/linux.git: https://git.kernel.org/torvalds/c/0f8e26b38d7ac72b3ad764944a25dd5808f37a6e Thank you! -- Deet-doot-dot, I am a bot. https://korg.docs.kernel.org/prtracker.html ^ permalink raw reply [flat|nested] 23+ messages in thread
end of thread, other threads:[~2025-02-27 8:15 UTC | newest] Thread overview: 23+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-01-24 16:37 [GIT PULL] KVM changes for Linux 6.14 Paolo Bonzini 2025-01-25 14:30 ` Marc Zyngier 2025-01-25 18:12 ` Linus Torvalds 2025-01-25 18:31 ` Linus Torvalds 2025-01-27 3:55 ` Eric W. Biederman 2025-01-26 14:20 ` Oleg Nesterov 2025-01-26 18:34 ` Linus Torvalds 2025-01-26 18:53 ` Oleg Nesterov 2025-01-26 19:03 ` Oleg Nesterov 2025-01-26 19:16 ` Linus Torvalds 2025-01-27 14:09 ` Oleg Nesterov 2025-01-27 15:15 ` Paolo Bonzini 2025-02-04 14:19 ` Christian Brauner 2025-02-04 16:05 ` Paolo Bonzini 2025-02-05 11:49 ` Christian Brauner 2025-02-05 16:12 ` Linus Torvalds 2025-02-26 12:14 ` Christian Brauner 2025-02-26 19:03 ` Oleg Nesterov 2025-02-27 8:15 ` Christian Brauner 2025-01-25 18:16 ` Linus Torvalds 2025-01-27 15:24 ` Sean Christopherson 2025-01-27 15:25 ` Paolo Bonzini 2025-01-25 18:30 ` pr-tracker-bot
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox