[RFC] proposal: KVM: Orphaned VMs: The Caretaker approach for Live Update

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [RFC] proposal: KVM: Orphaned VMs: The Caretaker approach for Live Update
@ 2026-04-28 22:29 Pasha Tatashin
  2026-04-29  8:13 ` Alexander Graf
  2026-04-30 13:28 ` Paolo Bonzini
  0 siblings, 2 replies; 12+ messages in thread
From: Pasha Tatashin @ 2026-04-28 22:29 UTC (permalink / raw)
  To: linux-kernel, kexec, kvm, linux-mm, kvmarm
  Cc: pasha.tatashin, rppt, graf, pratyush, pbonzini, seanjc, maz,
	oupton, dwmw2, alex.williamson, kevin.tian, rientjes,
	Tycho.Andersen, anthony.yznaga, baolu.lu, david, dmatlack, mheyne,
	jgowans, jgg, pankaj.gupta.linux, kpraveen.lkml, vipinsh,
	vannapurve, corbet, loeser, tglx, mingo, bp, dave.hansen, x86,
	hpa, roman.gushchin, akpm, pjt

Hi all,

TL;DR: Below is a proposal for the next step in Live Update
functionality: maintaining vCPU execution during a host kernel reboot
via a "Caretaker" for orphaned VMs.

As cloud infrastructure continues to push toward zero-downtime host
maintenance, extending the capabilities of kexec-based Live Update to
minimize guest disruption is becoming increasingly critical.

I would greatly appreciate your thoughts on the overall architecture,
the proposed ABI, and any security or hardware-specific edge cases we
might have missed.

Background
==========

Currently, Live Update allows us to preserve hardware resources across a
kexec boundary, enabling VMs to be re-attached and resumed on the new
kernel. This resource preservation is orchestrated from the kernel side
by the LUO (https://docs.kernel.org/core-api/liveupdate.html). This
proposal outlines how we can extend this foundation to keep the VMs
running rather than just suspended during the transition, significantly
reducing the effective blackout experienced by the virtual machines.

Orphaned VMs
============

Definition: An Orphaned VM is a virtual machine actively executing guest
instructions on isolated physical hardware, completely decoupled from a
Host Operating System or a userspace Virtual Machine Monitor (VMM).

Historically, a VM's lifecycle has been strictly tied to a userspace VMM
process. The Orphaned VM strategy breaks this dependency by separating
resource ownership from active execution.

Before the old kernel shuts down, it uses the Live Update Orchestrator
to maintain ownership of the guest's underlying resources. The LUO
preserves the vmfd and vcpufd structures, guest memory, secondary page
tables, and other critical KVM metadata required to successfully restore
the VM state in the new kernel.

During the transition, the active execution of the VM is not managed by
the host kernel. Instead, execution is handed off to a specialized
bare-metal component: The Caretaker.

While the VM is "orphaned," it operates entirely outside of standard
userspace and kernel space. The physical CPU (pCPU) hosting the VM is
completely isolated from the Host OS. The vCPUs continue to execute
guest instructions uninterrupted, and any VM Exits are trapped and
handled exclusively by the Caretaker. This isolated execution continues
while the host kernel reboots on separate management cores (or while
the VMM restarts, if the Live Update is limited to a userspace
component).

The Caretaker
=============

The Caretaker is a specialized, identity-mapped bare-metal executable
attached to each vCPU. While it is installed during the initial VM setup
and remains permanently enabled as an interpose layer between the Guest
and KVM, it is the key architectural feature required for a VM to be
orphaned. By providing an execution environment independent of the host
OS, the Caretaker enables the guest workload to safely survive the host
lifecycle transition.

During normal VM operation, the Caretaker acts as a fast-path shim: it
forwards standard VM Exits down to the backing KVM kernel module and
handles certain exits autonomously. However, during the "Management Gap"
(when no host OS is active and standard KVM handlers are offline), the
Caretaker enters a standalone mode to ensure continuous guest execution
without host intervention.

While this proposal focuses on its critical role in minimally disruptive
Live Update, the Caretaker is fundamentally designed as an extensible
primitive. Its architecture allows it to be leveraged for a variety of
other advanced virtualization use cases, such as running custom
lightweight hypervisors or completely offloading virtualization duties
to an accelerator card.

Constraints
-----------

The Caretaker is a bare-metal executable. It does not have a backing
kernel, it makes zero syscalls, and it lacks access to underlying
hardware devices or complex I/O components.

Hardware Setup & ABI Interface
==============================

To interpose the Caretaker between the guest and the host OS, the
initialization sequence must load the bare-metal payload, rewire the
physical CPU's virtualization structures, and establish a syscall-free
communication channel.

API Design & Caretaker Installation
-----------------------------------

The Caretaker is installed early in the VM's lifecycle (e.g., shortly
after KVM_CREATE_VM). To manage this, we introduce a new KVM ioctl
(KVM_SET_CARETAKER) that configures the shim on the vcpufd
(alternatively on vmfd). Userspace provides the Caretaker payload
compiled as an ELF binary.

Using an ELF binary allows the Caretaker to be separated into distinct
memory sections:

  * .text: The executable bare-metal instructions. This section is
    mapped as read-execute (RX).
  * .data / .rodata: The static data and variables. This includes the
    pre-populated per-vCPU metadata (such as CPUID topology responses)
    generated by the VMM.
  * .ccb: A dedicated section specifically reserved for the Caretaker
    Control Block.

During installation, the userspace VMM opens the ELF file and passes a
file descriptor to the KVM_SET_CARETAKER ioctl. By utilizing a
structured ELF payload, the Caretaker's logic remains highly flexible,
and allows the architecture to be utilized for broader features, such
as injecting a custom lightweight hypervisor, or unconditionally
forwarding VM exits to an accelerator card.

Hardware Interposition
----------------------

During the execution of the KVM_SET_CARETAKER ioctl, instead of
pointing the hardware's return path to standard KVM entry points (e.g.,
vmx_vmexit or svm_vcpu_run), KVM reprograms the host-state return area
of the CPU's hardware virtualization control structures (e.g., Intel
VMCS, AMD VMCB, or ARM equivalent) to point directly into the
bare-metal Caretaker environment.

Specifically, the following critical host-state fields (using x86
terminology for illustration) are updated:

  * HOST_RIP: Set to point to the e_entry (entry point) of the provided
    ELF .text segment. Whenever the guest triggers a VM Exit, the CPU
    hardware will unconditionally jump to this address.
  * HOST_RSP: Programmed to point to a specialized, pre-allocated
    bare-metal stack dedicated strictly to this vCPU's Caretaker. This
    allows it to execute completely independently of the host kernel's
    normal thread stacks.
  * HOST_SSP and HOST_INTR_SSP_TABLE: KVM pre-allocates a Shadow Stack
    for the Caretaker. These VMCS fields are programmed during
    initialization to ensure that RET and IRET instructions executed by
    the Caretaker do not trigger fatal #CP faults when the host kernel
    is detached.
  * HOST_GS_BASE: Programmed to point to the CCB/Shared Metadata for
    this vCPU.
  * HOST_CR3: Configured to point to the Caretaker's isolated,
    identity-mapped page tables, ensuring memory fetch safety when the
    host kernel's page tables are torn down.

Note on Optimization vs. Security: Constantly switching the page table
(CR3) on every VM Exit can be expensive due to TLB flushing. To
optimize performance, the Caretaker can share the host kernel's page
tables while the kernel is still around, and dynamically replace
HOST_CR3 with the dedicated, isolated page tables only when the vCPU is
orphaned (during the detachment phase). On the other hand, maintaining
a permanently isolated CR3 for the Caretaker adds a strong security
boundary, achieving hardware-enforced separation similar to KVM Address
Space Isolation (ASI).

KVM-Caretaker ABI
=================

The Caretaker requires a defined ABI to communicate with the host KVM
subsystem. This ABI is implemented via the shared, identity-mapped .ccb
section of the ELF payload, acting as the Caretaker Control Block
(CCB).

The CCB acts as the source of truth for the Caretaker's execution loop
and contains three primary elements:

  * Attachment State Flag: An atomic variable indicating the current
    relationship with the host KVM subsystem (e.g., KVM_ATTACHED or
    KVM_DETACHED).
  * KVM Routing Pointers: The physical function pointers that the
    Caretaker uses to safely jump into the host KVM's standard VM Exit
    handlers when operating in normal mode.
  * Shared Configuration Metadata: A physical pointer to dedicated
    memory pages used by the kernel to share dynamic vCPU configuration
    data with the Caretaker. Because every guest is configured
    differently, KVM populates these pages with the specific parameters
    negotiated during VM initialization (such as CPUID feature masks,
    APIC routing, and timer states). These pages also include a
    pre-allocated Telemetry Buffer for the Caretaker to log VM Exits
    and spin-wait durations. These dedicated pages are explicitly
    preserved across the host reboot via KHO, ensuring the Caretaker
    maintains continuous access to the exact context required to
    accurately emulate trivial exits during the gap.

Caretaker VM Exit Flow
======================

Upon vmexit, the Caretaker's execution flow follows a routing hierarchy:

The Fast Path
-------------

The Caretaker first evaluates the VM Exit reason. If the exit belongs to
a category that the Caretaker is programmed to resolve natively, it
handles it internally. For example, profiling of guests has identified
the following exit categories for potential local resolution:

  * Guest Idle Exits (e.g., HLT): When the guest OS goes idle, it
    triggers idle exits. The Caretaker intercepts these and halts the
    physical core until the next guest-bound interrupt fires, preserving
    host power.
  * Timer and APIC Exits: Even an idle guest frequently writes to
    interrupt controllers and system registers to configure internal
    timers. The Caretaker handles these trivial writes directly,
    acknowledging the timer updates.
  * CPUID and System Registers: Exits like CPUID or safe system register
    accesses are resolved by reading pre-computed responses from the
    shared metadata pages to accurately service the guest dynamically.

Host Routing
------------

If the VM Exit requires KVM/VMM intervention (e.g., a Page Fault or
emulated device I/O), the Caretaker cannot resolve it locally. It must
check the CCB attachment state flag to determine where to route the
exit:

  * If KVM_ATTACHED: The host kernel is actively managing the system.
    The Caretaker acts as a fast-path trampoline, setting up the
    standard host registers and transferring execution to the KVM
    Routing Pointers.
  * If KVM_DETACHED: The host kernel is offline for Live Update. The
    Caretaker places the specific vCPU into a safe "spin-wait" polling
    loop, continuously checking the CCB flag for a change.

Preservation of the vCPU
========================

For this orchestration to work across a host OS replacement, the file
descriptor associated with the vCPU must outlive the userspace process
that created it, leveraging the LUO:

  * Creation: The VMM initially creates the vcpufd via the standard
    KVM_CREATE_VCPU ioctl on the VM-wide kvmfd. Immediately after
    creation, the VMM issues the KVM_SET_CARETAKER ioctl on the newly
    created vcpufd. This installs the Caretaker as an interpose layer,
    allocating the CCB and rewiring the hardware virtualization control
    structures (e.g., HOST_RIP) for this specific vCPU. A dedicated
    userspace thread then takes ownership of this file descriptor to
    drive the KVM_RUN loop.

  * LUO Registration & Isolation: In preparation for minimally
    disruptive Live Update, the VMM acquires a session from the
    userspace luo-agent and registers the vcpufd using
    LIVEUPDATE_SESSION_PRESERVE_FD. KVM's LUO .preserve() handler is
    invoked. Prior to this step, the VMM must pause emulated devices;
    the guest relies on VFIO pass-through for primary I/O to survive
    the gap without triggering VMM-bound exits. Crucially, the
    .preserve() phase is the moment when the pCPU is isolated from the
    host OS. The kernel finalizes the state, offlines the core (from
    the OS perspective), transitions the vCPU fully into the Caretaker,
    and preserves the required KVM data to KHO.

  * The Gap: When the VMM process exits prior to the kexec transition,
    open file descriptors are normally destroyed. However, because the
    vcpufd is registered with LUO, the kernel holds a reference to the
    underlying struct file to ensure it survives the reboot. While the
    VMM has exited, the isolated pCPU appears completely offline to the
    host OS. During the boot of the Next Kernel, the core smp_init()
    routine parses the LUO FLB data and explicitly skips initializing
    these preserved pCPUs. This shields the dedicated cores from reset
    signals, allowing the guest workload to continue uninterrupted.

  * Reclamation: The Next Kernel boots and LUO deserializes the
    session. When the new VMM process spawns, it retrieves the
    preserved session and issues LIVEUPDATE_SESSION_RETRIEVE_FD using
    its token. LUO invokes KVM's .retrieve() callback to map the
    preserved vcpufd back into the new VMM's file descriptor table. As
    part of this retrieval process, the host formally brings the
    isolated pCPU back online, and the new VMM userspace thread is
    attached back to the active VM thread running on the vCPU. Finally,
    KVM populates the new KVM Routing Pointers in the CCB and
    atomically flips the Host State Flag back to KVM_ATTACHED. This
    breaks the Caretaker's spin-wait loop (if it is in this state),
    allowing standard KVM operation to resume.

Gap Observability and Telemetry
===============================

During the host transition standard host-level observability tools
(e.g., ftrace, perf, eBPF) are completely offline. To ensure production
readiness, site reliability, and precise latency accounting, the
Caretaker must act as a standalone micro-profiler while the host is
detached.

The architecture includes a pre-allocated, identity-mapped Telemetry
Buffer, passed to the Caretaker via the Shared Configuration Metadata
during installation. During the gap, the Caretaker writes to this
memory region to record execution data:

  * Exit Counters: The Caretaker increments hardware-specific counters
    for every VM Exit reason it encounters. This provides a precise
    tally of natively handled events (e.g., HLT, CPUID, or APIC writes)
    that occurred while the host was offline.

  * Spin-Wait Profiling: If a complex exit forces the Caretaker to
    block, it reads the hardware time stamp counter (e.g., RDTSC on
    x86, CNTVCT_EL0 on ARM64) immediately before entering the spin-wait
    loop. It reads the counter again the exact moment the CCB flag
    flips back to KVM_ATTACHED. The Caretaker records the specific exit
    reason that caused the block and the total accumulated stall time.

  * Post-Gap Ingestion: Upon successful reattachment, the newly booted
    host KVM subsystem parses this Telemetry Buffer. The recorded data
    is exported via debugfs (e.g., /sys/kernel/debug/kvm/).

Challenges During Gap
=====================

While the Caretaker manages standard VM Exits, completely offlining a
physical CPU introduces critical edge cases regarding asynchronous
hardware interrupts. Below are the primary architectural challenges and
their proposed mitigations.

VFIO Interrupt Routing
----------------------

  * The Problem: The design relies on VFIO pass-through for primary
    I/O to survive the gap. However, when a physical device (e.g., an
    NVMe drive) completes a read, it sends a hardware interrupt
    (MSI/MSI-X) back to the CPU. By default, external hardware
    interrupts cause a VM Exit. Because the Caretaker natively lacks
    Linux IRQ routing tables, it has no immediate mechanism to inject
    that specific device interrupt into the guest's virtual APIC or
    GIC. The guest would stall indefinitely waiting for the I/O
    completion.

  * Proposed Solution: The deployment may utilize hardware-accelerated
    interrupt routing, such as Posted Interrupts (Intel PI), Advanced
    Virtual Interrupt Controller (AMD AVIC), or GICv4 Direct Virtual
    Interrupt Injection (ARM64). This hardware feature allows the
    IOMMU/SMMU to route physical device interrupts directly into the
    guest's virtual interrupt controller without causing a VM Exit.
    Alternatively, KVM could share its IRQ routing tables with the
    Caretaker via the Shared Configuration Metadata during setup. This
    would allow the Caretaker to intercept the hardware interrupt and
    manually inject the virtual interrupt into the guest in software.

Guest-to-Guest IPIs
-------------------

  * The Problem: If the guest OS attempts to wake up a sleeping thread,
    one orphaned vCPU will send an Inter-Processor Interrupt (IPI) to
    another orphaned vCPU. In standard virtualization without hardware
    assistance, writing to the APIC ICR (or sending an ARM SGI) causes
    a VM Exit so the host KVM can emulate the message delivery. During
    the gap, KVM is unavailable to route this message.

  * Proposed Solution: The architecture may leverage hardware virtualized
    interrupts (Intel APICv, AMD AVIC, or ARM GICv4.1 virtual SGIs).
    This allows the hardware silicon to handle IPI delivery between the
    isolated pCPUs natively, eliminating the VM Exit. Alternatively,
    the Caretaker can be programmed to emulate the IPI delivery. By
    utilizing the shared memory metadata, the Caretaker can determine
    the target vCPU and directly update its pending interrupt state.

Stray Hardware Interrupts
-------------------------

  * The Problem: What happens if a Non-Maskable Interrupt (NMI), a
    hardware timer tick, or a Machine Check Exception / System Error
    (MCE / ARM SError) arrives while the CPU is actively executing
    Caretaker code in KVM_DETACHED mode?

  * Proposed Solution: To safely handle these asynchronous events, the
    Caretaker payload should establish and load its own minimal,
    self-contained Interrupt Descriptor Table (IDT on x86) or Exception
    Vector Table (via VBAR_EL2 on ARM64) during initialization. This
    ensures that if a stray hardware event or NMI arrives while
    executing Caretaker instructions in host mode, the CPU can catch
    the fault and park the core in a known state rather than triggering
    a catastrophic host reboot.

    On x86, when transitioning into the gap, KVM explicitly programs
    HOST_IDTR and HOST_GDTR to these self-contained tables. If an NMI
    or stray hardware event arrives, the CPU catches the fault using
    the Caretaker's native handlers, parking the core or logging the
    event rather than attempting host recovery. KVM also programs
    HOST_INTR_SSP_TABLE within the Caretaker's isolated environment so
    that if the exception handlers execute an IRET, the hardware's CET
    shadow stack unroll succeeds without triggering a #CP exception.

Timekeeping Drift
-----------------

  * The Problem: While the guest is orphaned, it continues executing
    and reading time via the physical CPU's hardware Time Stamp Counter
    (TSC or ARM generic timer). Because the host kernel is offline,
    software-based paravirtualized clocks (such as kvmclock) are no
    longer being updated by host background threads. Furthermore, when
    the Next Kernel boots, its calculation of host CLOCK_MONOTONIC or
    wall-time might drift from the old kernel. If the new KVM
    subsystem resets the VM's TSC offsets or updates the PV clock
    structures with standard initialization values upon adoption, the
    guest could experience a time jump.

  * Proposed Solution: During the gap, the guest relies entirely on the
    physical CPU's Invariant TSC (which the hardware automatically
    offsets natively via the VMCS/VMCB) for continuous timekeeping. To
    ensure safe reattachment, KVM must serialize the exact state of the
    guest's PV clocks, TSC offsets, and the old kernel's base reference
    times into KHO memory during the LUO .preserve() phase. Upon
    adoption, the new KVM subsystem must synchronize its internal
    tracking with this preserved data, ensuring that any subsequent
    updates to the guest's PV clock memory guarantee a strictly
    monotonic and smooth time progression.

Nested Page Table Updates
-------------------------

  * The Problem: As the guest executes, it may attempt to access memory
    that has not yet been mapped by the hypervisor, or it may interact
    with MMIO regions. Normally, this triggers an EPT Violation (Intel)
    or NPT Page Fault (AMD), prompting KVM to allocate host pages and
    update the secondary page tables. How are these updates handled
    when the host KVM subsystem is offline during the gap?

  * Proposed Solution: During the "Management Gap," there are absolutely
    no updates made to the NPT/EPT. The existing secondary page tables
    are fully preserved in memory via LUO kvmfd preservation prior to
    detachment, allowing the guest to seamlessly access all previously
    mapped memory. If the guest triggers a new page fault (requiring an
    NPT/EPT update) during the gap, the Caretaker simply categorizes it
    as a Blocking Exit.

Compromised Caretaker
---------------------

  * The Problem: The Caretaker runs in Host Mode. If left unprotected,
    this could allow a lightly privileged userspace process (e.g., QEMU
    or crosvm) to inject arbitrary executable code directly into the
    CPU's most privileged hardware state (VMX Root / Ring 0 / EL2).

  * Proposed Solution: To mitigate this risk, the KVM_SET_CARETAKER
    ioctl may adopt the security model used by the kexec_file_load()
    syscall. Rather than trusting userspace to pass physical addresses,
    the kernel must take full ownership of payload validation:

    - In-Kernel ELF Parsing: Instead of passing raw segment addresses
      (which is vulnerable to manipulation), the VMM passes a file
      descriptor for the Caretaker binary. The host KVM subsystem then
      performs the ELF parsing entirely in-kernel. This guarantees that
      the kernel controls exactly where the .text, .data, and .ccb
      sections are mapped, preventing userspace from tricking the
      kernel into overwriting sensitive host memory.

    - Signature Verification & Secure Boot: If the host is running with
      Secure Boot or Kernel Lockdown enabled, KVM mandates that the
      Caretaker ELF binary be cryptographically signed. The kernel
      verifies the signature against the system's trusted keyring
      (e.g., .builtin_trusted_keys) before loading it. An unsigned or
      modified payload is outright rejected.

    - IMA (Integrity Measurement Architecture) Integration: The loading
      of the Caretaker is hooked directly into the kernel's IMA
      subsystem. The binary is measured (hashed and extended into the
      hardware TPM) for remote attestation and appraised against local
      security policies before execution is permitted.

Caretaker Update
----------------

  * The Problem: Given that the Caretaker is permanently installed
    during VM setup, how does it get updated on long-running VMs?

  * Proposed Solution: To update the Caretaker, we can only perform the
    update when vCPUs are not isolated (i.e., not during the live
    update gap):

    - vCPU Quiescence: The userspace VMM issues the KVM_SET_CARETAKER
      ioctl. KVM then sends kvm_vcpu_kick to force the target vCPU to
      exit the guest and return to the host kernel.
    - KVM parses the new ELF. It allocates fresh memory pages for the
      new .text and .data segments.
    - The kernel populates the new .ccb with the existing VM context.
    - After the new environment is fully staged, KVM updates the
      physical CPU's virtualization control structures.
    - When the vCPU thread resumes and re-enters the guest, the next
      VM Exit will trigger the hardware to jump into the new Caretaker
      payload. The old memory segments are then safely freed.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC] proposal: KVM: Orphaned VMs: The Caretaker approach for Live Update
  2026-04-28 22:29 [RFC] proposal: KVM: Orphaned VMs: The Caretaker approach for Live Update Pasha Tatashin
@ 2026-04-29  8:13 ` Alexander Graf
  2026-04-29  8:40   ` David Woodhouse
  2026-04-29 16:02   ` Pasha Tatashin
  2026-04-30 13:28 ` Paolo Bonzini
  1 sibling, 2 replies; 12+ messages in thread
From: Alexander Graf @ 2026-04-29  8:13 UTC (permalink / raw)
  To: Pasha Tatashin, linux-kernel, kexec, kvm, linux-mm, kvmarm
  Cc: rppt, pratyush, pbonzini, seanjc, maz, oupton, dwmw2,
	alex.williamson, kevin.tian, rientjes, Tycho.Andersen,
	anthony.yznaga, baolu.lu, david, dmatlack, mheyne, jgowans, jgg,
	pankaj.gupta.linux, kpraveen.lkml, vipinsh, vannapurve, corbet,
	loeser, tglx, mingo, bp, dave.hansen, x86, hpa, roman.gushchin,
	akpm, pjt, Petrongonas, Evangelos


On 29.04.26 00:29, Pasha Tatashin wrote:
> Hi all,
>
> TL;DR: Below is a proposal for the next step in Live Update
> functionality: maintaining vCPU execution during a host kernel reboot
> via a "Caretaker" for orphaned VMs.
>
> As cloud infrastructure continues to push toward zero-downtime host
> maintenance, extending the capabilities of kexec-based Live Update to
> minimize guest disruption is becoming increasingly critical.
>
> I would greatly appreciate your thoughts on the overall architecture,
> the proposed ABI, and any security or hardware-specific edge cases we
> might have missed.
>
> Background
> ==========
>
> Currently, Live Update allows us to preserve hardware resources across a
> kexec boundary, enabling VMs to be re-attached and resumed on the new
> kernel. This resource preservation is orchestrated from the kernel side
> by the LUO (https://docs.kernel.org/core-api/liveupdate.html). This
> proposal outlines how we can extend this foundation to keep the VMs
> running rather than just suspended during the transition, significantly
> reducing the effective blackout experienced by the virtual machines.
>
> Orphaned VMs
> ============
>
> Definition: An Orphaned VM is a virtual machine actively executing guest
> instructions on isolated physical hardware, completely decoupled from a
> Host Operating System or a userspace Virtual Machine Monitor (VMM).
>
> Historically, a VM's lifecycle has been strictly tied to a userspace VMM
> process. The Orphaned VM strategy breaks this dependency by separating
> resource ownership from active execution.
>
> Before the old kernel shuts down, it uses the Live Update Orchestrator
> to maintain ownership of the guest's underlying resources. The LUO
> preserves the vmfd and vcpufd structures, guest memory, secondary page
> tables, and other critical KVM metadata required to successfully restore
> the VM state in the new kernel.
>
> During the transition, the active execution of the VM is not managed by
> the host kernel. Instead, execution is handed off to a specialized
> bare-metal component: The Caretaker.
>
> While the VM is "orphaned," it operates entirely outside of standard
> userspace and kernel space. The physical CPU (pCPU) hosting the VM is
> completely isolated from the Host OS. The vCPUs continue to execute
> guest instructions uninterrupted, and any VM Exits are trapped and
> handled exclusively by the Caretaker. This isolated execution continues
> while the host kernel reboots on separate management cores (or while
> the VMM restarts, if the Live Update is limited to a userspace
> component).
>
> The Caretaker
> =============
>
> The Caretaker is a specialized, identity-mapped bare-metal executable
> attached to each vCPU. While it is installed during the initial VM setup
> and remains permanently enabled as an interpose layer between the Guest
> and KVM, it is the key architectural feature required for a VM to be
> orphaned. By providing an execution environment independent of the host
> OS, the Caretaker enables the guest workload to safely survive the host
> lifecycle transition.
>
> During normal VM operation, the Caretaker acts as a fast-path shim: it
> forwards standard VM Exits down to the backing KVM kernel module and
> handles certain exits autonomously. However, during the "Management Gap"
> (when no host OS is active and standard KVM handlers are offline), the
> Caretaker enters a standalone mode to ensure continuous guest execution
> without host intervention.
>
> While this proposal focuses on its critical role in minimally disruptive
> Live Update, the Caretaker is fundamentally designed as an extensible
> primitive. Its architecture allows it to be leveraged for a variety of
> other advanced virtualization use cases, such as running custom
> lightweight hypervisors or completely offloading virtualization duties
> to an accelerator card.
>
> Constraints
> -----------
>
> The Caretaker is a bare-metal executable. It does not have a backing
> kernel, it makes zero syscalls, and it lacks access to underlying
> hardware devices or complex I/O components.


I think you still want to define an ABI with the outer execution 
environment to allow primitives such as timers. The same the other way 
around: You want to define multiple entry points, so that you can for 
example have one that says "You're up for running this vCPU now" or 
"Stop all your work, serialize your state".


>
> Hardware Setup & ABI Interface
> ==============================
>
> To interpose the Caretaker between the guest and the host OS, the
> initialization sequence must load the bare-metal payload, rewire the
> physical CPU's virtualization structures, and establish a syscall-free
> communication channel.
>
> API Design & Caretaker Installation
> -----------------------------------
>
> The Caretaker is installed early in the VM's lifecycle (e.g., shortly
> after KVM_CREATE_VM). To manage this, we introduce a new KVM ioctl
> (KVM_SET_CARETAKER) that configures the shim on the vcpufd
> (alternatively on vmfd). Userspace provides the Caretaker payload
> compiled as an ELF binary.


Yikes. So you get a random unprivileged user space injects binary code 
into the kernel primitive? Not great.

I think it would be better to think of the caretaker as a separate 
subsystem. Maybe even a kernel module that you load (that way it goes 
through the same signing logic as everything else). It can then take a 
KVM fd and cooperate in-kernel with KVM to do the hand-over.


Or alternatively you follow a model where the caretaker is always 
compiled in as part of KVM. That makes the outgoing ABI easier, but 
complicates the incoming path a bit.

>
> Using an ELF binary allows the Caretaker to be separated into distinct
> memory sections:
>
>    * .text: The executable bare-metal instructions. This section is
>      mapped as read-execute (RX).
>    * .data / .rodata: The static data and variables. This includes the
>      pre-populated per-vCPU metadata (such as CPUID topology responses)
>      generated by the VMM.
>    * .ccb: A dedicated section specifically reserved for the Caretaker
>      Control Block.
>
> During installation, the userspace VMM opens the ELF file and passes a
> file descriptor to the KVM_SET_CARETAKER ioctl. By utilizing a
> structured ELF payload, the Caretaker's logic remains highly flexible,
> and allows the architecture to be utilized for broader features, such
> as injecting a custom lightweight hypervisor, or unconditionally
> forwarding VM exits to an accelerator card.


Again, no way you want to have a random kernel binary load API :).


> Hardware Interposition
> ----------------------
>
> During the execution of the KVM_SET_CARETAKER ioctl, instead of
> pointing the hardware's return path to standard KVM entry points (e.g.,
> vmx_vmexit or svm_vcpu_run), KVM reprograms the host-state return area
> of the CPU's hardware virtualization control structures (e.g., Intel
> VMCS, AMD VMCB, or ARM equivalent) to point directly into the
> bare-metal Caretaker environment.
>
> Specifically, the following critical host-state fields (using x86
> terminology for illustration) are updated:
>
>    * HOST_RIP: Set to point to the e_entry (entry point) of the provided
>      ELF .text segment. Whenever the guest triggers a VM Exit, the CPU
>      hardware will unconditionally jump to this address.
>    * HOST_RSP: Programmed to point to a specialized, pre-allocated
>      bare-metal stack dedicated strictly to this vCPU's Caretaker. This
>      allows it to execute completely independently of the host kernel's
>      normal thread stacks.
>    * HOST_SSP and HOST_INTR_SSP_TABLE: KVM pre-allocates a Shadow Stack
>      for the Caretaker. These VMCS fields are programmed during
>      initialization to ensure that RET and IRET instructions executed by
>      the Caretaker do not trigger fatal #CP faults when the host kernel
>      is detached.
>    * HOST_GS_BASE: Programmed to point to the CCB/Shared Metadata for
>      this vCPU.
>    * HOST_CR3: Configured to point to the Caretaker's isolated,
>      identity-mapped page tables, ensuring memory fetch safety when the
>      host kernel's page tables are torn down.
>
> Note on Optimization vs. Security: Constantly switching the page table
> (CR3) on every VM Exit can be expensive due to TLB flushing. To
> optimize performance, the Caretaker can share the host kernel's page
> tables while the kernel is still around, and dynamically replace
> HOST_CR3 with the dedicated, isolated page tables only when the vCPU is
> orphaned (during the detachment phase). On the other hand, maintaining
> a permanently isolated CR3 for the Caretaker adds a strong security
> boundary, achieving hardware-enforced separation similar to KVM Address
> Space Isolation (ASI).


This means you're pulling the rug underneath Linux executing that code. 
How will this work in practice? Won't Linux get upset and scream that it 
can no longer access the CPU? What happens on IPI?


> KVM-Caretaker ABI
> =================
>
> The Caretaker requires a defined ABI to communicate with the host KVM
> subsystem. This ABI is implemented via the shared, identity-mapped .ccb
> section of the ELF payload, acting as the Caretaker Control Block
> (CCB).
>
> The CCB acts as the source of truth for the Caretaker's execution loop
> and contains three primary elements:
>
>    * Attachment State Flag: An atomic variable indicating the current
>      relationship with the host KVM subsystem (e.g., KVM_ATTACHED or
>      KVM_DETACHED).
>    * KVM Routing Pointers: The physical function pointers that the
>      Caretaker uses to safely jump into the host KVM's standard VM Exit
>      handlers when operating in normal mode.
>    * Shared Configuration Metadata: A physical pointer to dedicated
>      memory pages used by the kernel to share dynamic vCPU configuration
>      data with the Caretaker. Because every guest is configured
>      differently, KVM populates these pages with the specific parameters
>      negotiated during VM initialization (such as CPUID feature masks,
>      APIC routing, and timer states). These pages also include a
>      pre-allocated Telemetry Buffer for the Caretaker to log VM Exits
>      and spin-wait durations. These dedicated pages are explicitly
>      preserved across the host reboot via KHO, ensuring the Caretaker
>      maintains continuous access to the exact context required to
>      accurately emulate trivial exits during the gap.


Why not create a clear handover point? You are either running with KVM 
and its user space vmm or you are running this "caretaker vmm" which 
then runs its own binary hypervisor thing. That way the handover can 
also be done fully through UAPI. You could even implement the same KVM 
API on a caretaker fd for example and just serialize/deserialize the 
vCPU state.

Or if we embed it into KVM proper, we can probably define a KHO 
representation of that state and reuse that. Then ingesting it on the 
incoming environment is the same problem you already have to solve with 
KHO compatibility.


>
> Caretaker VM Exit Flow
> ======================
>
> Upon vmexit, the Caretaker's execution flow follows a routing hierarchy:
>
> The Fast Path
> -------------
>
> The Caretaker first evaluates the VM Exit reason. If the exit belongs to
> a category that the Caretaker is programmed to resolve natively, it
> handles it internally. For example, profiling of guests has identified
> the following exit categories for potential local resolution:
>
>    * Guest Idle Exits (e.g., HLT): When the guest OS goes idle, it
>      triggers idle exits. The Caretaker intercepts these and halts the
>      physical core until the next guest-bound interrupt fires, preserving
>      host power.
>    * Timer and APIC Exits: Even an idle guest frequently writes to
>      interrupt controllers and system registers to configure internal
>      timers. The Caretaker handles these trivial writes directly,
>      acknowledging the timer updates.
>    * CPUID and System Registers: Exits like CPUID or safe system register
>      accesses are resolved by reading pre-computed responses from the
>      shared metadata pages to accurately service the guest dynamically.
>
> Host Routing
> ------------
>
> If the VM Exit requires KVM/VMM intervention (e.g., a Page Fault or
> emulated device I/O), the Caretaker cannot resolve it locally. It must
> check the CCB attachment state flag to determine where to route the
> exit:
>
>    * If KVM_ATTACHED: The host kernel is actively managing the system.
>      The Caretaker acts as a fast-path trampoline, setting up the
>      standard host registers and transferring execution to the KVM
>      Routing Pointers.
>    * If KVM_DETACHED: The host kernel is offline for Live Update. The
>      Caretaker places the specific vCPU into a safe "spin-wait" polling
>      loop, continuously checking the CCB flag for a change.
>
> Preservation of the vCPU
> ========================
>
> For this orchestration to work across a host OS replacement, the file
> descriptor associated with the vCPU must outlive the userspace process
> that created it, leveraging the LUO:
>
>    * Creation: The VMM initially creates the vcpufd via the standard
>      KVM_CREATE_VCPU ioctl on the VM-wide kvmfd. Immediately after
>      creation, the VMM issues the KVM_SET_CARETAKER ioctl on the newly
>      created vcpufd. This installs the Caretaker as an interpose layer,
>      allocating the CCB and rewiring the hardware virtualization control
>      structures (e.g., HOST_RIP) for this specific vCPU. A dedicated
>      userspace thread then takes ownership of this file descriptor to
>      drive the KVM_RUN loop.
>
>    * LUO Registration & Isolation: In preparation for minimally
>      disruptive Live Update, the VMM acquires a session from the
>      userspace luo-agent and registers the vcpufd using
>      LIVEUPDATE_SESSION_PRESERVE_FD. KVM's LUO .preserve() handler is
>      invoked. Prior to this step, the VMM must pause emulated devices;
>      the guest relies on VFIO pass-through for primary I/O to survive
>      the gap without triggering VMM-bound exits. Crucially, the
>      .preserve() phase is the moment when the pCPU is isolated from the
>      host OS. The kernel finalizes the state, offlines the core (from
>      the OS perspective), transitions the vCPU fully into the Caretaker,
>      and preserves the required KVM data to KHO.
>
>    * The Gap: When the VMM process exits prior to the kexec transition,
>      open file descriptors are normally destroyed. However, because the
>      vcpufd is registered with LUO, the kernel holds a reference to the
>      underlying struct file to ensure it survives the reboot. While the
>      VMM has exited, the isolated pCPU appears completely offline to the
>      host OS. During the boot of the Next Kernel, the core smp_init()
>      routine parses the LUO FLB data and explicitly skips initializing
>      these preserved pCPUs. This shields the dedicated cores from reset
>      signals, allowing the guest workload to continue uninterrupted.
>
>    * Reclamation: The Next Kernel boots and LUO deserializes the
>      session. When the new VMM process spawns, it retrieves the
>      preserved session and issues LIVEUPDATE_SESSION_RETRIEVE_FD using
>      its token. LUO invokes KVM's .retrieve() callback to map the
>      preserved vcpufd back into the new VMM's file descriptor table. As
>      part of this retrieval process, the host formally brings the
>      isolated pCPU back online, and the new VMM userspace thread is
>      attached back to the active VM thread running on the vCPU. Finally,
>      KVM populates the new KVM Routing Pointers in the CCB and
>      atomically flips the Host State Flag back to KVM_ATTACHED. This
>      breaks the Caretaker's spin-wait loop (if it is in this state),
>      allowing standard KVM operation to resume.


How does the caretaker continue execution across kexec?


>
> Gap Observability and Telemetry
> ===============================
>
> During the host transition standard host-level observability tools
> (e.g., ftrace, perf, eBPF) are completely offline. To ensure production
> readiness, site reliability, and precise latency accounting, the
> Caretaker must act as a standalone micro-profiler while the host is
> detached.
>
> The architecture includes a pre-allocated, identity-mapped Telemetry
> Buffer, passed to the Caretaker via the Shared Configuration Metadata
> during installation. During the gap, the Caretaker writes to this
> memory region to record execution data:
>
>    * Exit Counters: The Caretaker increments hardware-specific counters
>      for every VM Exit reason it encounters. This provides a precise
>      tally of natively handled events (e.g., HLT, CPUID, or APIC writes)
>      that occurred while the host was offline.
>
>    * Spin-Wait Profiling: If a complex exit forces the Caretaker to
>      block, it reads the hardware time stamp counter (e.g., RDTSC on
>      x86, CNTVCT_EL0 on ARM64) immediately before entering the spin-wait
>      loop. It reads the counter again the exact moment the CCB flag
>      flips back to KVM_ATTACHED. The Caretaker records the specific exit
>      reason that caused the block and the total accumulated stall time.
>
>    * Post-Gap Ingestion: Upon successful reattachment, the newly booted
>      host KVM subsystem parses this Telemetry Buffer. The recorded data
>      is exported via debugfs (e.g., /sys/kernel/debug/kvm/).
>
> Challenges During Gap
> =====================
>
> While the Caretaker manages standard VM Exits, completely offlining a
> physical CPU introduces critical edge cases regarding asynchronous
> hardware interrupts. Below are the primary architectural challenges and
> their proposed mitigations.
>
> VFIO Interrupt Routing
> ----------------------
>
>    * The Problem: The design relies on VFIO pass-through for primary
>      I/O to survive the gap. However, when a physical device (e.g., an
>      NVMe drive) completes a read, it sends a hardware interrupt
>      (MSI/MSI-X) back to the CPU. By default, external hardware
>      interrupts cause a VM Exit. Because the Caretaker natively lacks
>      Linux IRQ routing tables, it has no immediate mechanism to inject
>      that specific device interrupt into the guest's virtual APIC or
>      GIC. The guest would stall indefinitely waiting for the I/O
>      completion.
>
>    * Proposed Solution: The deployment may utilize hardware-accelerated
>      interrupt routing, such as Posted Interrupts (Intel PI), Advanced
>      Virtual Interrupt Controller (AMD AVIC), or GICv4 Direct Virtual
>      Interrupt Injection (ARM64). This hardware feature allows the
>      IOMMU/SMMU to route physical device interrupts directly into the
>      guest's virtual interrupt controller without causing a VM Exit.
>      Alternatively, KVM could share its IRQ routing tables with the
>      Caretaker via the Shared Configuration Metadata during setup. This
>      would allow the Caretaker to intercept the hardware interrupt and
>      manually inject the virtual interrupt into the guest in software.
>
> Guest-to-Guest IPIs
> -------------------
>
>    * The Problem: If the guest OS attempts to wake up a sleeping thread,
>      one orphaned vCPU will send an Inter-Processor Interrupt (IPI) to
>      another orphaned vCPU. In standard virtualization without hardware
>      assistance, writing to the APIC ICR (or sending an ARM SGI) causes
>      a VM Exit so the host KVM can emulate the message delivery. During
>      the gap, KVM is unavailable to route this message.
>
>    * Proposed Solution: The architecture may leverage hardware virtualized
>      interrupts (Intel APICv, AMD AVIC, or ARM GICv4.1 virtual SGIs).
>      This allows the hardware silicon to handle IPI delivery between the
>      isolated pCPUs natively, eliminating the VM Exit. Alternatively,
>      the Caretaker can be programmed to emulate the IPI delivery. By
>      utilizing the shared memory metadata, the Caretaker can determine
>      the target vCPU and directly update its pending interrupt state.
>
> Stray Hardware Interrupts
> -------------------------
>
>    * The Problem: What happens if a Non-Maskable Interrupt (NMI), a
>      hardware timer tick, or a Machine Check Exception / System Error
>      (MCE / ARM SError) arrives while the CPU is actively executing
>      Caretaker code in KVM_DETACHED mode?
>
>    * Proposed Solution: To safely handle these asynchronous events, the
>      Caretaker payload should establish and load its own minimal,
>      self-contained Interrupt Descriptor Table (IDT on x86) or Exception
>      Vector Table (via VBAR_EL2 on ARM64) during initialization. This
>      ensures that if a stray hardware event or NMI arrives while
>      executing Caretaker instructions in host mode, the CPU can catch
>      the fault and park the core in a known state rather than triggering
>      a catastrophic host reboot.
>
>      On x86, when transitioning into the gap, KVM explicitly programs
>      HOST_IDTR and HOST_GDTR to these self-contained tables. If an NMI
>      or stray hardware event arrives, the CPU catches the fault using
>      the Caretaker's native handlers, parking the core or logging the
>      event rather than attempting host recovery. KVM also programs
>      HOST_INTR_SSP_TABLE within the Caretaker's isolated environment so
>      that if the exception handlers execute an IRET, the hardware's CET
>      shadow stack unroll succeeds without triggering a #CP exception.
>
> Timekeeping Drift
> -----------------
>
>    * The Problem: While the guest is orphaned, it continues executing
>      and reading time via the physical CPU's hardware Time Stamp Counter
>      (TSC or ARM generic timer). Because the host kernel is offline,
>      software-based paravirtualized clocks (such as kvmclock) are no
>      longer being updated by host background threads. Furthermore, when
>      the Next Kernel boots, its calculation of host CLOCK_MONOTONIC or
>      wall-time might drift from the old kernel. If the new KVM
>      subsystem resets the VM's TSC offsets or updates the PV clock
>      structures with standard initialization values upon adoption, the
>      guest could experience a time jump.
>
>    * Proposed Solution: During the gap, the guest relies entirely on the
>      physical CPU's Invariant TSC (which the hardware automatically
>      offsets natively via the VMCS/VMCB) for continuous timekeeping. To
>      ensure safe reattachment, KVM must serialize the exact state of the
>      guest's PV clocks, TSC offsets, and the old kernel's base reference
>      times into KHO memory during the LUO .preserve() phase. Upon
>      adoption, the new KVM subsystem must synchronize its internal
>      tracking with this preserved data, ensuring that any subsequent
>      updates to the guest's PV clock memory guarantee a strictly
>      monotonic and smooth time progression.
>
> Nested Page Table Updates
> -------------------------
>
>    * The Problem: As the guest executes, it may attempt to access memory
>      that has not yet been mapped by the hypervisor, or it may interact
>      with MMIO regions. Normally, this triggers an EPT Violation (Intel)
>      or NPT Page Fault (AMD), prompting KVM to allocate host pages and
>      update the secondary page tables. How are these updates handled
>      when the host KVM subsystem is offline during the gap?
>
>    * Proposed Solution: During the "Management Gap," there are absolutely
>      no updates made to the NPT/EPT. The existing secondary page tables
>      are fully preserved in memory via LUO kvmfd preservation prior to
>      detachment, allowing the guest to seamlessly access all previously
>      mapped memory. If the guest triggers a new page fault (requiring an
>      NPT/EPT update) during the gap, the Caretaker simply categorizes it
>      as a Blocking Exit.


You may want to prefault all nested page tables before you switch to 
detached mode.


>
> Compromised Caretaker
> ---------------------
>
>    * The Problem: The Caretaker runs in Host Mode. If left unprotected,
>      this could allow a lightly privileged userspace process (e.g., QEMU
>      or crosvm) to inject arbitrary executable code directly into the
>      CPU's most privileged hardware state (VMX Root / Ring 0 / EL2).
>
>    * Proposed Solution: To mitigate this risk, the KVM_SET_CARETAKER
>      ioctl may adopt the security model used by the kexec_file_load()
>      syscall. Rather than trusting userspace to pass physical addresses,
>      the kernel must take full ownership of payload validation:
>
>      - In-Kernel ELF Parsing: Instead of passing raw segment addresses
>        (which is vulnerable to manipulation), the VMM passes a file
>        descriptor for the Caretaker binary. The host KVM subsystem then
>        performs the ELF parsing entirely in-kernel. This guarantees that
>        the kernel controls exactly where the .text, .data, and .ccb
>        sections are mapped, preventing userspace from tricking the
>        kernel into overwriting sensitive host memory.
>
>      - Signature Verification & Secure Boot: If the host is running with
>        Secure Boot or Kernel Lockdown enabled, KVM mandates that the
>        Caretaker ELF binary be cryptographically signed. The kernel
>        verifies the signature against the system's trusted keyring
>        (e.g., .builtin_trusted_keys) before loading it. An unsigned or
>        modified payload is outright rejected.
>
>      - IMA (Integrity Measurement Architecture) Integration: The loading
>        of the Caretaker is hooked directly into the kernel's IMA
>        subsystem. The binary is measured (hashed and extended into the
>        hardware TPM) for remote attestation and appraised against local
>        security policies before execution is permitted.


We have all of this already. It's called "kernel module loader", no? :)


> Caretaker Update
> ----------------
>
>    * The Problem: Given that the Caretaker is permanently installed
>      during VM setup, how does it get updated on long-running VMs?
>
>    * Proposed Solution: To update the Caretaker, we can only perform the
>      update when vCPUs are not isolated (i.e., not during the live
>      update gap):
>
>      - vCPU Quiescence: The userspace VMM issues the KVM_SET_CARETAKER
>        ioctl. KVM then sends kvm_vcpu_kick to force the target vCPU to
>        exit the guest and return to the host kernel.
>      - KVM parses the new ELF. It allocates fresh memory pages for the
>        new .text and .data segments.
>      - The kernel populates the new .ccb with the existing VM context.
>      - After the new environment is fully staged, KVM updates the
>        physical CPU's virtualization control structures.
>      - When the vCPU thread resumes and re-enters the guest, the next
>        VM Exit will trigger the hardware to jump into the new Caretaker
>        payload. The old memory segments are then safely freed.


I would prefer we only attach the whole caretaker and all of its 
specialties right around the point when live update happens. Why keep it 
dangling and active forever? That way you can also late load the kernel 
module that contains it, so you can be sure it's an up to date version.


Alex




Amazon Web Services Development Center Germany GmbH
Tamara-Danz-Str. 13
10243 Berlin
Geschaeftsfuehrung: Christof Hellmis, Andreas Stieger
Eingetragen am Amtsgericht Charlottenburg unter HRB 257764 B
Sitz: Berlin
Ust-ID: DE 365 538 597

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC] proposal: KVM: Orphaned VMs: The Caretaker approach for Live Update
  2026-04-29  8:13 ` Alexander Graf
@ 2026-04-29  8:40   ` David Woodhouse
  2026-04-29 16:13     ` Pasha Tatashin
  2026-04-29 16:02   ` Pasha Tatashin
  1 sibling, 1 reply; 12+ messages in thread
From: David Woodhouse @ 2026-04-29  8:40 UTC (permalink / raw)
  To: Alexander Graf, Pasha Tatashin, linux-kernel, kexec, kvm,
	linux-mm, kvmarm
  Cc: rppt, pratyush, pbonzini, seanjc, maz, oupton, alex.williamson,
	kevin.tian, rientjes, Tycho.Andersen, anthony.yznaga, baolu.lu,
	david, dmatlack, mheyne, jgowans, jgg, pankaj.gupta.linux,
	kpraveen.lkml, vipinsh, vannapurve, corbet, loeser, tglx, mingo,
	bp, dave.hansen, x86, hpa, roman.gushchin, akpm, pjt,
	Petrongonas, Evangelos

[-- Attachment #1: Type: text/plain, Size: 1331 bytes --]

On Wed, 2026-04-29 at 10:13 +0200, Alexander Graf wrote:
> I would prefer we only attach the whole caretaker and all of its 
> specialties right around the point when live update happens. Why keep it 
> dangling and active forever? That way you can also late load the kernel 
> module that contains it, so you can be sure it's an up to date version.

"Why keep it dangling and active forever?"

I've always wanted to tie this to address space isolation.

The only way to truly stay in front of the constant stream of new
speculation vulnerabilities has been to just make sure there's nothing
sensitive accessible in the address space at all. Hence all the work on
secret hiding, XPFO, proclocal, etc. — and hence the occasional
researcher finding their shiny new (5-year-old) vulnerability and being
confused when it doesn't leak anything *interesting* in certain
environments.

I'd like to see the inner KVM_RUN loop switch to a completely separate
address space, in which there's a kind of caretaker which can handle
the bare minimum of interrupts and timers and the most common exits,
and which *relatively* rarely has to come back into the real Linux
address space.

And once you have that caretaker running in its own address space...
why not just let it keep going while Linux does its kexec?

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC] proposal: KVM: Orphaned VMs: The Caretaker approach for Live Update
  2026-04-29  8:13 ` Alexander Graf
  2026-04-29  8:40   ` David Woodhouse
@ 2026-04-29 16:02   ` Pasha Tatashin
  1 sibling, 0 replies; 12+ messages in thread
From: Pasha Tatashin @ 2026-04-29 16:02 UTC (permalink / raw)
  To: Alexander Graf
  Cc: Pasha Tatashin, linux-kernel, kexec, kvm, linux-mm, kvmarm, rppt,
	pratyush, pbonzini, seanjc, maz, oupton, dwmw2, alex.williamson,
	kevin.tian, rientjes, Tycho.Andersen, anthony.yznaga, baolu.lu,
	david, dmatlack, mheyne, jgowans, jgg, pankaj.gupta.linux,
	kpraveen.lkml, vipinsh, vannapurve, corbet, loeser, tglx, mingo,
	bp, dave.hansen, x86, hpa, roman.gushchin, akpm, pjt,
	Petrongonas, Evangelos

On 04-29 10:13, Alexander Graf wrote:
> 
> On 29.04.26 00:29, Pasha Tatashin wrote:
> > Hi all,
> > 
> > TL;DR: Below is a proposal for the next step in Live Update
> > functionality: maintaining vCPU execution during a host kernel reboot
> > via a "Caretaker" for orphaned VMs.
> > 
> > As cloud infrastructure continues to push toward zero-downtime host
> > maintenance, extending the capabilities of kexec-based Live Update to
> > minimize guest disruption is becoming increasingly critical.
> > 
> > I would greatly appreciate your thoughts on the overall architecture,
> > the proposed ABI, and any security or hardware-specific edge cases we
> > might have missed.
> > 
> > Background
> > ==========
> > 
> > Currently, Live Update allows us to preserve hardware resources across a
> > kexec boundary, enabling VMs to be re-attached and resumed on the new
> > kernel. This resource preservation is orchestrated from the kernel side
> > by the LUO (https://docs.kernel.org/core-api/liveupdate.html). This
> > proposal outlines how we can extend this foundation to keep the VMs
> > running rather than just suspended during the transition, significantly
> > reducing the effective blackout experienced by the virtual machines.
> > 
> > Orphaned VMs
> > ============
> > 
> > Definition: An Orphaned VM is a virtual machine actively executing guest
> > instructions on isolated physical hardware, completely decoupled from a
> > Host Operating System or a userspace Virtual Machine Monitor (VMM).
> > 
> > Historically, a VM's lifecycle has been strictly tied to a userspace VMM
> > process. The Orphaned VM strategy breaks this dependency by separating
> > resource ownership from active execution.
> > 
> > Before the old kernel shuts down, it uses the Live Update Orchestrator
> > to maintain ownership of the guest's underlying resources. The LUO
> > preserves the vmfd and vcpufd structures, guest memory, secondary page
> > tables, and other critical KVM metadata required to successfully restore
> > the VM state in the new kernel.
> > 
> > During the transition, the active execution of the VM is not managed by
> > the host kernel. Instead, execution is handed off to a specialized
> > bare-metal component: The Caretaker.
> > 
> > While the VM is "orphaned," it operates entirely outside of standard
> > userspace and kernel space. The physical CPU (pCPU) hosting the VM is
> > completely isolated from the Host OS. The vCPUs continue to execute
> > guest instructions uninterrupted, and any VM Exits are trapped and
> > handled exclusively by the Caretaker. This isolated execution continues
> > while the host kernel reboots on separate management cores (or while
> > the VMM restarts, if the Live Update is limited to a userspace
> > component).
> > 
> > The Caretaker
> > =============
> > 
> > The Caretaker is a specialized, identity-mapped bare-metal executable
> > attached to each vCPU. While it is installed during the initial VM setup
> > and remains permanently enabled as an interpose layer between the Guest
> > and KVM, it is the key architectural feature required for a VM to be
> > orphaned. By providing an execution environment independent of the host
> > OS, the Caretaker enables the guest workload to safely survive the host
> > lifecycle transition.
> > 
> > During normal VM operation, the Caretaker acts as a fast-path shim: it
> > forwards standard VM Exits down to the backing KVM kernel module and
> > handles certain exits autonomously. However, during the "Management Gap"
> > (when no host OS is active and standard KVM handlers are offline), the
> > Caretaker enters a standalone mode to ensure continuous guest execution
> > without host intervention.
> > 
> > While this proposal focuses on its critical role in minimally disruptive
> > Live Update, the Caretaker is fundamentally designed as an extensible
> > primitive. Its architecture allows it to be leveraged for a variety of
> > other advanced virtualization use cases, such as running custom
> > lightweight hypervisors or completely offloading virtualization duties
> > to an accelerator card.
> > 
> > Constraints
> > -----------
> > 
> > The Caretaker is a bare-metal executable. It does not have a backing
> > kernel, it makes zero syscalls, and it lacks access to underlying
> > hardware devices or complex I/O components.
> 
> 
> I think you still want to define an ABI with the outer execution environment
> to allow primitives such as timers. The same the other way around: You want
> to define multiple entry points, so that you can for example have one that
> says "You're up for running this vCPU now" or "Stop all your work, serialize
> your state".

You are right. The intent is to use the CCB as the ABI between the Linux
kernel and the Caretaker.

To keep execution as similar as possible between normal operation and
the orphaned state, this shared memory region is always populated by the
kernel. Because the Caretaker lacks syscalls, all communication, basic
primitives, and lifecycle signaling must be negotiated through the CCB.

Regarding lifecycle commands (like "Stop all your work"), the design
currently relies on the Attachment State Flag in the CCB.

> 
> 
> > 
> > Hardware Setup & ABI Interface
> > ==============================
> > 
> > To interpose the Caretaker between the guest and the host OS, the
> > initialization sequence must load the bare-metal payload, rewire the
> > physical CPU's virtualization structures, and establish a syscall-free
> > communication channel.
> > 
> > API Design & Caretaker Installation
> > -----------------------------------
> > 
> > The Caretaker is installed early in the VM's lifecycle (e.g., shortly
> > after KVM_CREATE_VM). To manage this, we introduce a new KVM ioctl
> > (KVM_SET_CARETAKER) that configures the shim on the vcpufd
> > (alternatively on vmfd). Userspace provides the Caretaker payload
> > compiled as an ELF binary.
> 
> 
> Yikes. So you get a random unprivileged user space injects binary code into
> the kernel primitive? Not great.
> 
> I think it would be better to think of the caretaker as a separate
> subsystem. Maybe even a kernel module that you load (that way it goes
> through the same signing logic as everything else). It can then take a KVM
> fd and cooperate in-kernel with KVM to do the hand-over.
> 
> 
> Or alternatively you follow a model where the caretaker is always compiled
> in as part of KVM. That makes the outgoing ABI easier, but complicates the
> incoming path a bit.

The intent is not to allow unprivileged userspace to load random code.
As outlined in the RFC under "Compromised Caretaker", the security
model mirrors kexec_file_load(). The kernel mandates cryptographic
signature verification of the ELF payload against the trusted kernel
keyring before allowing it to be mapped.

Regarding the kernel module approach: standard modules are tied to the
running kernel version. Because the Caretaker must survive the kexec
transition and interact with the next kernel upon reattachment, it must
maintain ABI compatibility (via the CCB) across different kernel
versions. Treating it as a standalone, signed payload facilitates this
cross-kernel lifecycle.

Compiling the Caretaker directly into KVM also has drawbacks. It reduces
extendability, preventing use cases like forwarding VM Exits to a remote
VMM (e.g., running on an accelerator card or in another VM). 

Additionally, loading it as a distinct payload allows us to leverage
Address Space Isolation. By running the Caretaker in its own
isolated CR3, it provides a hardware-enforced security boundary that
protects the host kernel.

> 
> > 
> > Using an ELF binary allows the Caretaker to be separated into distinct
> > memory sections:
> > 
> >    * .text: The executable bare-metal instructions. This section is
> >      mapped as read-execute (RX).
> >    * .data / .rodata: The static data and variables. This includes the
> >      pre-populated per-vCPU metadata (such as CPUID topology responses)
> >      generated by the VMM.
> >    * .ccb: A dedicated section specifically reserved for the Caretaker
> >      Control Block.
> > 
> > During installation, the userspace VMM opens the ELF file and passes a
> > file descriptor to the KVM_SET_CARETAKER ioctl. By utilizing a
> > structured ELF payload, the Caretaker's logic remains highly flexible,
> > and allows the architecture to be utilized for broader features, such
> > as injecting a custom lightweight hypervisor, or unconditionally
> > forwarding VM exits to an accelerator card.
> 
> 
> Again, no way you want to have a random kernel binary load API :).
> 
> 
> > Hardware Interposition
> > ----------------------
> > 
> > During the execution of the KVM_SET_CARETAKER ioctl, instead of
> > pointing the hardware's return path to standard KVM entry points (e.g.,
> > vmx_vmexit or svm_vcpu_run), KVM reprograms the host-state return area
> > of the CPU's hardware virtualization control structures (e.g., Intel
> > VMCS, AMD VMCB, or ARM equivalent) to point directly into the
> > bare-metal Caretaker environment.
> > 
> > Specifically, the following critical host-state fields (using x86
> > terminology for illustration) are updated:
> > 
> >    * HOST_RIP: Set to point to the e_entry (entry point) of the provided
> >      ELF .text segment. Whenever the guest triggers a VM Exit, the CPU
> >      hardware will unconditionally jump to this address.
> >    * HOST_RSP: Programmed to point to a specialized, pre-allocated
> >      bare-metal stack dedicated strictly to this vCPU's Caretaker. This
> >      allows it to execute completely independently of the host kernel's
> >      normal thread stacks.
> >    * HOST_SSP and HOST_INTR_SSP_TABLE: KVM pre-allocates a Shadow Stack
> >      for the Caretaker. These VMCS fields are programmed during
> >      initialization to ensure that RET and IRET instructions executed by
> >      the Caretaker do not trigger fatal #CP faults when the host kernel
> >      is detached.
> >    * HOST_GS_BASE: Programmed to point to the CCB/Shared Metadata for
> >      this vCPU.
> >    * HOST_CR3: Configured to point to the Caretaker's isolated,
> >      identity-mapped page tables, ensuring memory fetch safety when the
> >      host kernel's page tables are torn down.
> > 
> > Note on Optimization vs. Security: Constantly switching the page table
> > (CR3) on every VM Exit can be expensive due to TLB flushing. To
> > optimize performance, the Caretaker can share the host kernel's page
> > tables while the kernel is still around, and dynamically replace
> > HOST_CR3 with the dedicated, isolated page tables only when the vCPU is
> > orphaned (during the detachment phase). On the other hand, maintaining
> > a permanently isolated CR3 for the Caretaker adds a strong security
> > boundary, achieving hardware-enforced separation similar to KVM Address
> > Space Isolation (ASI).
> 
> 
> This means you're pulling the rug underneath Linux executing that code. How
> will this work in practice? Won't Linux get upset and scream that it can no
> longer access the CPU? What happens on IPI?

It will not because the CPU is fully isolated and appears offline from
the kernel's perspective. The host kernel explicitly offlines the core
before the Caretaker takes full control, meaning the scheduler and host
interrupt routing no longer target it.

In fact, this isolation can be achieved without kexec. The host kernel
can offline and isolate the CPU while the Caretaker keeps the vCPU
executing, and later re-adopt and online the CPU back into the host.

Regarding IPIs, the handling of guest-to-guest IPIs and stray hardware
interrupts is covered later in the RFC under the "Challenges During Gap"
section.

> 
> 
> > KVM-Caretaker ABI
> > =================
> > 
> > The Caretaker requires a defined ABI to communicate with the host KVM
> > subsystem. This ABI is implemented via the shared, identity-mapped .ccb
> > section of the ELF payload, acting as the Caretaker Control Block
> > (CCB).
> > 
> > The CCB acts as the source of truth for the Caretaker's execution loop
> > and contains three primary elements:
> > 
> >    * Attachment State Flag: An atomic variable indicating the current
> >      relationship with the host KVM subsystem (e.g., KVM_ATTACHED or
> >      KVM_DETACHED).
> >    * KVM Routing Pointers: The physical function pointers that the
> >      Caretaker uses to safely jump into the host KVM's standard VM Exit
> >      handlers when operating in normal mode.
> >    * Shared Configuration Metadata: A physical pointer to dedicated
> >      memory pages used by the kernel to share dynamic vCPU configuration
> >      data with the Caretaker. Because every guest is configured
> >      differently, KVM populates these pages with the specific parameters
> >      negotiated during VM initialization (such as CPUID feature masks,
> >      APIC routing, and timer states). These pages also include a
> >      pre-allocated Telemetry Buffer for the Caretaker to log VM Exits
> >      and spin-wait durations. These dedicated pages are explicitly
> >      preserved across the host reboot via KHO, ensuring the Caretaker
> >      maintains continuous access to the exact context required to
> >      accurately emulate trivial exits during the gap.
> 
> 
> Why not create a clear handover point? You are either running with KVM and
> its user space vmm or you are running this "caretaker vmm" which then runs
> its own binary hypervisor thing. That way the handover can also be done
> fully through UAPI. You could even implement the same KVM API on a caretaker
> fd for example and just serialize/deserialize the vCPU state.
> 
> Or if we embed it into KVM proper, we can probably define a KHO
> representation of that state and reuse that. Then ingesting it on the
> incoming environment is the same problem you already have to solve with KHO
> compatibility.

A clear handover point implies maintaining two entirely distinct
execution paths. We explicitly want to avoid that because execution
during normal operation and during the Live Update gap should remain as
similar as possible. Running the Caretaker continuously minimizes state
transitions and avoids introducing untested, hard-to-debug corner cases
that only manifest during a host transition.

Additionally, we want the Caretaker to provide architectural value
beyond Live Update. By keeping it permanently installed in the VM Exit
path, it serves as the foundation for Address Space Isolation and
offers an extendability point for routing or offloading virtualization
tasks. Switching to a separate environment only during the gap defeats
these broader goals.

> 
> 
> > 
> > Caretaker VM Exit Flow
> > ======================
> > 
> > Upon vmexit, the Caretaker's execution flow follows a routing hierarchy:
> > 
> > The Fast Path
> > -------------
> > 
> > The Caretaker first evaluates the VM Exit reason. If the exit belongs to
> > a category that the Caretaker is programmed to resolve natively, it
> > handles it internally. For example, profiling of guests has identified
> > the following exit categories for potential local resolution:
> > 
> >    * Guest Idle Exits (e.g., HLT): When the guest OS goes idle, it
> >      triggers idle exits. The Caretaker intercepts these and halts the
> >      physical core until the next guest-bound interrupt fires, preserving
> >      host power.
> >    * Timer and APIC Exits: Even an idle guest frequently writes to
> >      interrupt controllers and system registers to configure internal
> >      timers. The Caretaker handles these trivial writes directly,
> >      acknowledging the timer updates.
> >    * CPUID and System Registers: Exits like CPUID or safe system register
> >      accesses are resolved by reading pre-computed responses from the
> >      shared metadata pages to accurately service the guest dynamically.
> > 
> > Host Routing
> > ------------
> > 
> > If the VM Exit requires KVM/VMM intervention (e.g., a Page Fault or
> > emulated device I/O), the Caretaker cannot resolve it locally. It must
> > check the CCB attachment state flag to determine where to route the
> > exit:
> > 
> >    * If KVM_ATTACHED: The host kernel is actively managing the system.
> >      The Caretaker acts as a fast-path trampoline, setting up the
> >      standard host registers and transferring execution to the KVM
> >      Routing Pointers.
> >    * If KVM_DETACHED: The host kernel is offline for Live Update. The
> >      Caretaker places the specific vCPU into a safe "spin-wait" polling
> >      loop, continuously checking the CCB flag for a change.
> > 
> > Preservation of the vCPU
> > ========================
> > 
> > For this orchestration to work across a host OS replacement, the file
> > descriptor associated with the vCPU must outlive the userspace process
> > that created it, leveraging the LUO:
> > 
> >    * Creation: The VMM initially creates the vcpufd via the standard
> >      KVM_CREATE_VCPU ioctl on the VM-wide kvmfd. Immediately after
> >      creation, the VMM issues the KVM_SET_CARETAKER ioctl on the newly
> >      created vcpufd. This installs the Caretaker as an interpose layer,
> >      allocating the CCB and rewiring the hardware virtualization control
> >      structures (e.g., HOST_RIP) for this specific vCPU. A dedicated
> >      userspace thread then takes ownership of this file descriptor to
> >      drive the KVM_RUN loop.
> > 
> >    * LUO Registration & Isolation: In preparation for minimally
> >      disruptive Live Update, the VMM acquires a session from the
> >      userspace luo-agent and registers the vcpufd using
> >      LIVEUPDATE_SESSION_PRESERVE_FD. KVM's LUO .preserve() handler is
> >      invoked. Prior to this step, the VMM must pause emulated devices;
> >      the guest relies on VFIO pass-through for primary I/O to survive
> >      the gap without triggering VMM-bound exits. Crucially, the
> >      .preserve() phase is the moment when the pCPU is isolated from the
> >      host OS. The kernel finalizes the state, offlines the core (from
> >      the OS perspective), transitions the vCPU fully into the Caretaker,
> >      and preserves the required KVM data to KHO.
> > 
> >    * The Gap: When the VMM process exits prior to the kexec transition,
> >      open file descriptors are normally destroyed. However, because the
> >      vcpufd is registered with LUO, the kernel holds a reference to the
> >      underlying struct file to ensure it survives the reboot. While the
> >      VMM has exited, the isolated pCPU appears completely offline to the
> >      host OS. During the boot of the Next Kernel, the core smp_init()
> >      routine parses the LUO FLB data and explicitly skips initializing
> >      these preserved pCPUs. This shields the dedicated cores from reset
> >      signals, allowing the guest workload to continue uninterrupted.
> > 
> >    * Reclamation: The Next Kernel boots and LUO deserializes the
> >      session. When the new VMM process spawns, it retrieves the
> >      preserved session and issues LIVEUPDATE_SESSION_RETRIEVE_FD using
> >      its token. LUO invokes KVM's .retrieve() callback to map the
> >      preserved vcpufd back into the new VMM's file descriptor table. As
> >      part of this retrieval process, the host formally brings the
> >      isolated pCPU back online, and the new VMM userspace thread is
> >      attached back to the active VM thread running on the vCPU. Finally,
> >      KVM populates the new KVM Routing Pointers in the CCB and
> >      atomically flips the Host State Flag back to KVM_ATTACHED. This
> >      breaks the Caretaker's spin-wait loop (if it is in this state),
> >      allowing standard KVM operation to resume.
> 
> 
> How does the caretaker continue execution across kexec?

The pCPU running the Caretaker is offlined and isolated by the host OS
prior to the kexec transition. 

Although the pCPUs are offlined, they remain owned by the kernel because
the associated vcpu file descriptors are preserved in the LUO.

During kexec, the primary boot CPU drives the kernel replacement.
Because the Caretaker's pCPUs are offline, they are ignored by the kexec
teardown path. The memory is preserved via KHO, and the host stops
targeting them with interrupts or scheduling tasks.

During the new kernel's boot, smp_init() normally sends INIT/SIPI (on
x86) to reset all secondary cores. In this architecture, smp_init()
parses the KHO/LUO metadata, identifies the preserved pCPUs, and skips
sending reset signals to them.

Because the hardware core is not reset by the kexec process or the new
kernel's boot sequence, the Caretaker continues executing in preserved
memory while the new host kernel boots.

> 
> 
> > 
> > Gap Observability and Telemetry
> > ===============================
> > 
> > During the host transition standard host-level observability tools
> > (e.g., ftrace, perf, eBPF) are completely offline. To ensure production
> > readiness, site reliability, and precise latency accounting, the
> > Caretaker must act as a standalone micro-profiler while the host is
> > detached.
> > 
> > The architecture includes a pre-allocated, identity-mapped Telemetry
> > Buffer, passed to the Caretaker via the Shared Configuration Metadata
> > during installation. During the gap, the Caretaker writes to this
> > memory region to record execution data:
> > 
> >    * Exit Counters: The Caretaker increments hardware-specific counters
> >      for every VM Exit reason it encounters. This provides a precise
> >      tally of natively handled events (e.g., HLT, CPUID, or APIC writes)
> >      that occurred while the host was offline.
> > 
> >    * Spin-Wait Profiling: If a complex exit forces the Caretaker to
> >      block, it reads the hardware time stamp counter (e.g., RDTSC on
> >      x86, CNTVCT_EL0 on ARM64) immediately before entering the spin-wait
> >      loop. It reads the counter again the exact moment the CCB flag
> >      flips back to KVM_ATTACHED. The Caretaker records the specific exit
> >      reason that caused the block and the total accumulated stall time.
> > 
> >    * Post-Gap Ingestion: Upon successful reattachment, the newly booted
> >      host KVM subsystem parses this Telemetry Buffer. The recorded data
> >      is exported via debugfs (e.g., /sys/kernel/debug/kvm/).
> > 
> > Challenges During Gap
> > =====================
> > 
> > While the Caretaker manages standard VM Exits, completely offlining a
> > physical CPU introduces critical edge cases regarding asynchronous
> > hardware interrupts. Below are the primary architectural challenges and
> > their proposed mitigations.
> > 
> > VFIO Interrupt Routing
> > ----------------------
> > 
> >    * The Problem: The design relies on VFIO pass-through for primary
> >      I/O to survive the gap. However, when a physical device (e.g., an
> >      NVMe drive) completes a read, it sends a hardware interrupt
> >      (MSI/MSI-X) back to the CPU. By default, external hardware
> >      interrupts cause a VM Exit. Because the Caretaker natively lacks
> >      Linux IRQ routing tables, it has no immediate mechanism to inject
> >      that specific device interrupt into the guest's virtual APIC or
> >      GIC. The guest would stall indefinitely waiting for the I/O
> >      completion.
> > 
> >    * Proposed Solution: The deployment may utilize hardware-accelerated
> >      interrupt routing, such as Posted Interrupts (Intel PI), Advanced
> >      Virtual Interrupt Controller (AMD AVIC), or GICv4 Direct Virtual
> >      Interrupt Injection (ARM64). This hardware feature allows the
> >      IOMMU/SMMU to route physical device interrupts directly into the
> >      guest's virtual interrupt controller without causing a VM Exit.
> >      Alternatively, KVM could share its IRQ routing tables with the
> >      Caretaker via the Shared Configuration Metadata during setup. This
> >      would allow the Caretaker to intercept the hardware interrupt and
> >      manually inject the virtual interrupt into the guest in software.
> > 
> > Guest-to-Guest IPIs
> > -------------------
> > 
> >    * The Problem: If the guest OS attempts to wake up a sleeping thread,
> >      one orphaned vCPU will send an Inter-Processor Interrupt (IPI) to
> >      another orphaned vCPU. In standard virtualization without hardware
> >      assistance, writing to the APIC ICR (or sending an ARM SGI) causes
> >      a VM Exit so the host KVM can emulate the message delivery. During
> >      the gap, KVM is unavailable to route this message.
> > 
> >    * Proposed Solution: The architecture may leverage hardware virtualized
> >      interrupts (Intel APICv, AMD AVIC, or ARM GICv4.1 virtual SGIs).
> >      This allows the hardware silicon to handle IPI delivery between the
> >      isolated pCPUs natively, eliminating the VM Exit. Alternatively,
> >      the Caretaker can be programmed to emulate the IPI delivery. By
> >      utilizing the shared memory metadata, the Caretaker can determine
> >      the target vCPU and directly update its pending interrupt state.
> > 
> > Stray Hardware Interrupts
> > -------------------------
> > 
> >    * The Problem: What happens if a Non-Maskable Interrupt (NMI), a
> >      hardware timer tick, or a Machine Check Exception / System Error
> >      (MCE / ARM SError) arrives while the CPU is actively executing
> >      Caretaker code in KVM_DETACHED mode?
> > 
> >    * Proposed Solution: To safely handle these asynchronous events, the
> >      Caretaker payload should establish and load its own minimal,
> >      self-contained Interrupt Descriptor Table (IDT on x86) or Exception
> >      Vector Table (via VBAR_EL2 on ARM64) during initialization. This
> >      ensures that if a stray hardware event or NMI arrives while
> >      executing Caretaker instructions in host mode, the CPU can catch
> >      the fault and park the core in a known state rather than triggering
> >      a catastrophic host reboot.
> > 
> >      On x86, when transitioning into the gap, KVM explicitly programs
> >      HOST_IDTR and HOST_GDTR to these self-contained tables. If an NMI
> >      or stray hardware event arrives, the CPU catches the fault using
> >      the Caretaker's native handlers, parking the core or logging the
> >      event rather than attempting host recovery. KVM also programs
> >      HOST_INTR_SSP_TABLE within the Caretaker's isolated environment so
> >      that if the exception handlers execute an IRET, the hardware's CET
> >      shadow stack unroll succeeds without triggering a #CP exception.
> > 
> > Timekeeping Drift
> > -----------------
> > 
> >    * The Problem: While the guest is orphaned, it continues executing
> >      and reading time via the physical CPU's hardware Time Stamp Counter
> >      (TSC or ARM generic timer). Because the host kernel is offline,
> >      software-based paravirtualized clocks (such as kvmclock) are no
> >      longer being updated by host background threads. Furthermore, when
> >      the Next Kernel boots, its calculation of host CLOCK_MONOTONIC or
> >      wall-time might drift from the old kernel. If the new KVM
> >      subsystem resets the VM's TSC offsets or updates the PV clock
> >      structures with standard initialization values upon adoption, the
> >      guest could experience a time jump.
> > 
> >    * Proposed Solution: During the gap, the guest relies entirely on the
> >      physical CPU's Invariant TSC (which the hardware automatically
> >      offsets natively via the VMCS/VMCB) for continuous timekeeping. To
> >      ensure safe reattachment, KVM must serialize the exact state of the
> >      guest's PV clocks, TSC offsets, and the old kernel's base reference
> >      times into KHO memory during the LUO .preserve() phase. Upon
> >      adoption, the new KVM subsystem must synchronize its internal
> >      tracking with this preserved data, ensuring that any subsequent
> >      updates to the guest's PV clock memory guarantee a strictly
> >      monotonic and smooth time progression.
> > 
> > Nested Page Table Updates
> > -------------------------
> > 
> >    * The Problem: As the guest executes, it may attempt to access memory
> >      that has not yet been mapped by the hypervisor, or it may interact
> >      with MMIO regions. Normally, this triggers an EPT Violation (Intel)
> >      or NPT Page Fault (AMD), prompting KVM to allocate host pages and
> >      update the secondary page tables. How are these updates handled
> >      when the host KVM subsystem is offline during the gap?
> > 
> >    * Proposed Solution: During the "Management Gap," there are absolutely
> >      no updates made to the NPT/EPT. The existing secondary page tables
> >      are fully preserved in memory via LUO kvmfd preservation prior to
> >      detachment, allowing the guest to seamlessly access all previously
> >      mapped memory. If the guest triggers a new page fault (requiring an
> >      NPT/EPT update) during the gap, the Caretaker simply categorizes it
> >      as a Blocking Exit.
> 
> 
> You may want to prefault all nested page tables before you switch to
> detached mode.

Yes, prefaulting all nested page tables is an option to minimize guest
stalls. However, it is not strictly necessary.

If a VM Exit due to an EPT or NPT violation occurs during the gap, the
Caretaker categorizes it as a Blocking Exit. The vCPU stalls and waits
for the next kernel to re-adopt the vCPU and handle the fault before
continuing execution.

> 
> 
> > 
> > Compromised Caretaker
> > ---------------------
> > 
> >    * The Problem: The Caretaker runs in Host Mode. If left unprotected,
> >      this could allow a lightly privileged userspace process (e.g., QEMU
> >      or crosvm) to inject arbitrary executable code directly into the
> >      CPU's most privileged hardware state (VMX Root / Ring 0 / EL2).
> > 
> >    * Proposed Solution: To mitigate this risk, the KVM_SET_CARETAKER
> >      ioctl may adopt the security model used by the kexec_file_load()
> >      syscall. Rather than trusting userspace to pass physical addresses,
> >      the kernel must take full ownership of payload validation:
> > 
> >      - In-Kernel ELF Parsing: Instead of passing raw segment addresses
> >        (which is vulnerable to manipulation), the VMM passes a file
> >        descriptor for the Caretaker binary. The host KVM subsystem then
> >        performs the ELF parsing entirely in-kernel. This guarantees that
> >        the kernel controls exactly where the .text, .data, and .ccb
> >        sections are mapped, preventing userspace from tricking the
> >        kernel into overwriting sensitive host memory.
> > 
> >      - Signature Verification & Secure Boot: If the host is running with
> >        Secure Boot or Kernel Lockdown enabled, KVM mandates that the
> >        Caretaker ELF binary be cryptographically signed. The kernel
> >        verifies the signature against the system's trusted keyring
> >        (e.g., .builtin_trusted_keys) before loading it. An unsigned or
> >        modified payload is outright rejected.
> > 
> >      - IMA (Integrity Measurement Architecture) Integration: The loading
> >        of the Caretaker is hooked directly into the kernel's IMA
> >        subsystem. The binary is measured (hashed and extended into the
> >        hardware TPM) for remote attestation and appraised against local
> >        security policies before execution is permitted.
> 
> 
> We have all of this already. It's called "kernel module loader", no? :)

Yes, the mechanisms are identical. However, standard kernel modules are
tied to specific kernel versions.

Because the Caretaker must survive the kexec transition and interface
with the next kernel, it requires a standalone payload verified through
the same security primitives (similar to kexec_file_load). This avoids
the version coupling of standard modules while maintaining the same
security guarantees.

> 
> 
> > Caretaker Update
> > ----------------
> > 
> >    * The Problem: Given that the Caretaker is permanently installed
> >      during VM setup, how does it get updated on long-running VMs?
> > 
> >    * Proposed Solution: To update the Caretaker, we can only perform the
> >      update when vCPUs are not isolated (i.e., not during the live
> >      update gap):
> > 
> >      - vCPU Quiescence: The userspace VMM issues the KVM_SET_CARETAKER
> >        ioctl. KVM then sends kvm_vcpu_kick to force the target vCPU to
> >        exit the guest and return to the host kernel.
> >      - KVM parses the new ELF. It allocates fresh memory pages for the
> >        new .text and .data segments.
> >      - The kernel populates the new .ccb with the existing VM context.
> >      - After the new environment is fully staged, KVM updates the
> >        physical CPU's virtualization control structures.
> >      - When the vCPU thread resumes and re-enters the guest, the next
> >        VM Exit will trigger the hardware to jump into the new Caretaker
> >        payload. The old memory segments are then safely freed.
> 
> 
> I would prefer we only attach the whole caretaker and all of its specialties
> right around the point when live update happens. Why keep it dangling and
> active forever? That way you can also late load the kernel module that
> contains it, so you can be sure it's an up to date version.

I touched on this earlier in the reply. Keeping the Caretaker active
permanently ensures execution symmetry and provides architectural
benefits beyond Live Update, such as Address Space Isolation and
extendability, which would be lost if it were only attached during the
gap.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC] proposal: KVM: Orphaned VMs: The Caretaker approach for Live Update
  2026-04-29  8:40   ` David Woodhouse
@ 2026-04-29 16:13     ` Pasha Tatashin
  0 siblings, 0 replies; 12+ messages in thread
From: Pasha Tatashin @ 2026-04-29 16:13 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Alexander Graf, Pasha Tatashin, linux-kernel, kexec, kvm,
	linux-mm, kvmarm, rppt, pratyush, pbonzini, seanjc, maz, oupton,
	alex.williamson, kevin.tian, rientjes, Tycho.Andersen,
	anthony.yznaga, baolu.lu, david, dmatlack, mheyne, jgowans, jgg,
	pankaj.gupta.linux, kpraveen.lkml, vipinsh, vannapurve, corbet,
	loeser, tglx, mingo, bp, dave.hansen, x86, hpa, roman.gushchin,
	akpm, pjt, Petrongonas, Evangelos, kpsingh, jackmanb

On 04-29 09:40, David Woodhouse wrote:
> On Wed, 2026-04-29 at 10:13 +0200, Alexander Graf wrote:
> > I would prefer we only attach the whole caretaker and all of its 
> > specialties right around the point when live update happens. Why keep it 
> > dangling and active forever? That way you can also late load the kernel 
> > module that contains it, so you can be sure it's an up to date version.
> 
> "Why keep it dangling and active forever?"
> 
> I've always wanted to tie this to address space isolation.
> 
> The only way to truly stay in front of the constant stream of new
> speculation vulnerabilities has been to just make sure there's nothing
> sensitive accessible in the address space at all. Hence all the work on
> secret hiding, XPFO, proclocal, etc. — and hence the occasional
> researcher finding their shiny new (5-year-old) vulnerability and being
> confused when it doesn't leak anything *interesting* in certain
> environments.
> 
> I'd like to see the inner KVM_RUN loop switch to a completely separate
> address space, in which there's a kind of caretaker which can handle
> the bare minimum of interrupts and timers and the most common exits,
> and which *relatively* rarely has to come back into the real Linux
> address space.
> 
> And once you have that caretaker running in its own address space...
> why not just let it keep going while Linux does its kexec?

Yep, this captures one of the benefits of having a permanently attached
Caretaker.

By establishing that isolated execution environment for the inner
KVM_RUN loop to mitigate speculation vulnerabilities, we naturally get
the hardware-enforced boundary required to survive the kexec gap. The
Live Update capability is effectively a byproduct of achieving true
Address Space Isolation.

+CC KP and Brendan

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC] proposal: KVM: Orphaned VMs: The Caretaker approach for Live Update
  2026-04-28 22:29 [RFC] proposal: KVM: Orphaned VMs: The Caretaker approach for Live Update Pasha Tatashin
  2026-04-29  8:13 ` Alexander Graf
@ 2026-04-30 13:28 ` Paolo Bonzini
  2026-04-30 15:27   ` David Woodhouse
  2026-05-01 21:48   ` Pasha Tatashin
  1 sibling, 2 replies; 12+ messages in thread
From: Paolo Bonzini @ 2026-04-30 13:28 UTC (permalink / raw)
  To: Pasha Tatashin, linux-kernel, kexec, kvm, linux-mm, kvmarm
  Cc: rppt, graf, pratyush, seanjc, maz, oupton, dwmw2, alex.williamson,
	kevin.tian, rientjes, Tycho.Andersen, anthony.yznaga, baolu.lu,
	david, dmatlack, mheyne, jgowans, jgg, pankaj.gupta.linux,
	kpraveen.lkml, vipinsh, vannapurve, corbet, loeser, tglx, mingo,
	bp, dave.hansen, x86, hpa, roman.gushchin, akpm, pjt

I have some very similar observations to Alex and some very similar 
observations to David.  This has to imply that everyone will agree with 
me. :)

Seriously, the main contention point, from reading the thread, is the 
placement and lifecycle of the caretaker.  More on this later...

On 4/29/26 00:29, Pasha Tatashin wrote:
> While this proposal focuses on its critical role in minimally disruptive
> Live Update, the Caretaker is fundamentally designed as an extensible
> primitive. Its architecture allows it to be leveraged for a variety of
> other advanced virtualization use cases, such as running custom
> lightweight hypervisors or completely offloading virtualization duties
> to an accelerator card.

One step at a time please---and as an initial step, just place it inside 
the kernel, a la Arm nVHE.

Since your design would have anyway the ability to update the caretaker, 
you can embed that part into the reattachment process, so that the new 
kernel can use its own caretaker.

This reduces a lot the need to establish a stable-ish ABI.  Only the 
handover (kexec/LUO) needs to be stable, so that the new kernel can 
populate its kvm and kvm_vcpu structs.  And for that we mostly have a 
solution already: a stream of serialized ioctls.

> During the execution of the KVM_SET_CARETAKER ioctl, instead of
> pointing the hardware's return path to standard KVM entry points (e.g.,
> vmx_vmexit or svm_vcpu_run), KVM reprograms the host-state return area
> of the CPU's hardware virtualization control structures (e.g., Intel
> VMCS, AMD VMCB, or ARM equivalent) to point directly into the
> bare-metal Caretaker environment.

This can be done unconditionally for all VMs based on a module 
parameter, again as in Arm nVHE.

> Note on Optimization vs. Security: Constantly switching the page table
> (CR3) on every VM Exit can be expensive due to TLB flushing. To
> optimize performance, the Caretaker can share the host kernel's page
> tables while the kernel is still around, and dynamically replace
> HOST_CR3 with the dedicated, isolated page tables only when the vCPU is
> orphaned (during the detachment phase). On the other hand, maintaining
> a permanently isolated CR3 for the Caretaker adds a strong security
> boundary, achieving hardware-enforced separation similar to KVM Address
> Space Isolation (ASI).

Agreed on this.

> The Caretaker requires a defined ABI to communicate with the host KVM
> subsystem. This ABI is implemented via the shared, identity-mapped .ccb
> section of the ELF payload, acting as the Caretaker Control Block
> (CCB).
> 
> The CCB acts as the source of truth for the Caretaker's execution loop
> and contains three primary elements:
> 
>    * Attachment State Flag: An atomic variable indicating the current
>      relationship with the host KVM subsystem (e.g., KVM_ATTACHED or
>      KVM_DETACHED).

This must be done atomically at the time Linux offlines/onlines a pCPU. 
The interface from Linux to the caretaker must use some kind of IPI so 
that the new kernel can force a VMEXIT (if needed) in the caretaker, ask 
it to serialize the vm state, and pass it down to the new kernel's 
caretaker.

>    * KVM Routing Pointers: The physical function pointers that the
>      Caretaker uses to safely jump into the host KVM's standard VM Exit
>      handlers when operating in normal mode.
>    * Shared Configuration Metadata: A physical pointer to dedicated
>      memory pages used by the kernel to share dynamic vCPU configuration
>      data with the Caretaker. Because every guest is configured
>      differently, KVM populates these pages with the specific parameters
>      negotiated during VM initialization (such as CPUID feature masks,
>      APIC routing, and timer states). These pages also include a
>      pre-allocated Telemetry Buffer for the Caretaker to log VM Exits
>      and spin-wait durations. These dedicated pages are explicitly
>      preserved across the host reboot via KHO, ensuring the Caretaker
>      maintains continuous access to the exact context required to
>      accurately emulate trivial exits during the gap.

All this is mostly unnecessary if the caretaker is provided by the 
kernel.  The recently introduced remote ring buffers can be used for 
tracing too.

> The Caretaker first evaluates the VM Exit reason. If the exit belongs to
> a category that the Caretaker is programmed to resolve natively, it
> handles it internally. For example, profiling of guests has identified
> the following exit categories for potential local resolution:
> 
>    * Guest Idle Exits (e.g., HLT): When the guest OS goes idle, it
>      triggers idle exits. The Caretaker intercepts these and halts the
>      physical core until the next guest-bound interrupt fires, preserving
>      host power.

I don't think HLT can be handled entirely here.  Either you skip the 
exit completely or you have to go out to the scheduler.  The HLT exit 
could be skipped unconditionally for an orphaned VM, but while there is 
a running kernel the caretaker has to run entirely with interrupts off 
and that limits what you can do.

In fact there is already a blueprint of what can be handled easily in 
the caretaker, namely 
vmx_exit_handlers_fastpath()/svm_exit_handlers_fastpath().  Stick to 
what exists already.

>    * Timer and APIC Exits: Even an idle guest frequently writes to
>      interrupt controllers and system registers to configure internal
>      timers. The Caretaker handles these trivial writes directly,
>      acknowledging the timer updates.

This depends heavily on the implementation of the hypervisor, for 
example it can be done on Intel via the preemption timer but not on AMD 
where an actual hrtimer is needed.

[...]

> When the new VMM process spawns, it retrieves the
>     preserved session and issues LIVEUPDATE_SESSION_RETRIEVE_FD using
>     its token. LUO invokes KVM's .retrieve() callback to map the
>     preserved vcpufd back into the new VMM's file descriptor table. As
>     part of this retrieval process, the host formally brings the
>     isolated pCPU back online, and the new VMM userspace thread is
>     attached back to the active VM thread running on the vCPU. Finally,
>     KVM populates the new KVM Routing Pointers in the CCB and
>     atomically flips the Host State Flag back to KVM_ATTACHED. This
>     breaks the Caretaker's spin-wait loop (if it is in this state),
>     allowing standard KVM operation to resume.

This would also include some kind of serialization of the old VM into 
the new kernel's struct kvm_vcpu.

Also some kind of feature negotiation is needed (if that fails, the VMs 
are terminated unceremoniously) so I believe that the transition into 
and out of the gap must be synchronous.  For example with INIT/SIPI for 
the entry, and an IPI for the exit?

> Guest-to-Guest IPIs
> -------------------
> 
>    * The Problem: If the guest OS attempts to wake up a sleeping thread,
>      one orphaned vCPU will send an Inter-Processor Interrupt (IPI) to
>      another orphaned vCPU. In standard virtualization without hardware
>      assistance, writing to the APIC ICR (or sending an ARM SGI) causes
>      a VM Exit so the host KVM can emulate the message delivery. During
>      the gap, KVM is unavailable to route this message.
> 
>    * Proposed Solution: The architecture may leverage hardware virtualized
>      interrupts (Intel APICv, AMD AVIC, or ARM GICv4.1 virtual SGIs).
>      This allows the hardware silicon to handle IPI delivery between the
>      isolated pCPUs natively, eliminating the VM Exit. Alternatively,
>      the Caretaker can be programmed to emulate the IPI delivery. By
>      utilizing the shared memory metadata, the Caretaker can determine
>      the target vCPU and directly update its pending interrupt state.

Yeah, I think APIC emulation to some extent must be moved into the 
VMX/SVM fastpaths.  The good news is that this can be done already as a 
PoC without needing the whole caretaker and LUO infrastructure.

>    * The Problem: What happens if a Non-Maskable Interrupt (NMI), a
>      hardware timer tick, or a Machine Check Exception / System Error
>      (MCE / ARM SError) arrives while the CPU is actively executing
>      Caretaker code in KVM_DETACHED mode?
> 
>    * Proposed Solution: To safely handle these asynchronous events, [...]
>      on x86, when transitioning into the gap, KVM explicitly programs
>      HOST_IDTR and HOST_GDTR to [the caretaker's] tables.

Agreed and this also shows that the transition must be synchronous.
>    * The Problem: As the guest executes, it may attempt to access memory
>      that has not yet been mapped by the hypervisor, or it may interact
>      with MMIO regions. Normally, this triggers an EPT Violation (Intel)
>      or NPT Page Fault (AMD), prompting KVM to allocate host pages and
>      update the secondary page tables. How are these updates handled
>      when the host KVM subsystem is offline during the gap?
> 
>    * Proposed Solution: During the "Management Gap," there are absolutely
>      no updates made to the NPT/EPT. The existing secondary page tables
>      are fully preserved in memory via LUO kvmfd preservation prior to
>      detachment, allowing the guest to seamlessly access all previously
>      mapped memory. If the guest triggers a new page fault (requiring an
>      NPT/EPT update) during the gap, the Caretaker simply categorizes it
>      as a Blocking Exit.

Yes, by default everything is a blocking exit.  In particular, unless 
one day we do x86/pKVM, page tables can be handled entirely by Linux 
rather than the caretaker with no change to the existing MMU notifier 
architecture.

As a consequence, the caretaker is absolutely not going to be a TCB---at 
least not in the beginning.

> Compromised Caretaker
> ---------------------
> 
>    * The Problem: The Caretaker runs in Host Mode. If left unprotected,
>      this could allow a lightly privileged userspace process (e.g., QEMU
>      or crosvm) to inject arbitrary executable code directly into the
>      CPU's most privileged hardware state (VMX Root / Ring 0 / EL2).
> 
>    * Proposed Solution: To mitigate this risk, the KVM_SET_CARETAKER
>      ioctl may adopt the security model used by the kexec_file_load()
>      syscall. Rather than trusting userspace to pass physical addresses,
>      the kernel must take full ownership of payload validation:

-EOVERENGINEERED.  Just shove it into the kernel.

> Caretaker Update
> ----------------
> 
>    * The Problem: Given that the Caretaker is permanently installed
>      during VM setup, how does it get updated on long-running VMs?

Via kexec. :)  I understand you have bigger plans, but we need to crawl 
before walk^Wattempting a marathon.

I even wonder if, for long term simplicity, the interface for 
host->caretaker should be just for the caretaker to swallow the host 
into non-root mode, again as in Arm nVHE.  That would make it much 
harder to implement some kind of live update, but my answer to that 
*really* is just to use kexec.

Paolo


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC] proposal: KVM: Orphaned VMs: The Caretaker approach for Live Update
  2026-04-30 13:28 ` Paolo Bonzini
@ 2026-04-30 15:27   ` David Woodhouse
  2026-05-01  3:32     ` Paolo Bonzini
  2026-05-01 21:48   ` Pasha Tatashin
  1 sibling, 1 reply; 12+ messages in thread
From: David Woodhouse @ 2026-04-30 15:27 UTC (permalink / raw)
  To: Paolo Bonzini, Pasha Tatashin, linux-kernel, kexec, kvm, linux-mm,
	kvmarm
  Cc: rppt, graf, pratyush, seanjc, maz, oupton, alex.williamson,
	kevin.tian, rientjes, Tycho.Andersen, anthony.yznaga, baolu.lu,
	david, dmatlack, mheyne, jgowans, jgg, pankaj.gupta.linux,
	kpraveen.lkml, vipinsh, vannapurve, corbet, loeser, tglx, mingo,
	bp, dave.hansen, x86, hpa, roman.gushchin, akpm, pjt

[-- Attachment #1: Type: text/plain, Size: 1550 bytes --]

On Thu, 2026-04-30 at 15:28 +0200, Paolo Bonzini wrote:
> I even wonder if, for long term simplicity, the interface for 
> host->caretaker should be just for the caretaker to swallow the host 
> into non-root mode, again as in Arm nVHE. 

There's a lot of merit in that approach.

I talked about wanting to use this 'caretaker' for secret hiding.  But
why have *voluntary* secret hiding with the kernel hiding things from
its own address space, when you have have *mandatory* secret hiding
with something running in EL2, like pKVM. Or the Nitro Isolation Engine
which adds formal proof of correctness on top and is designed to allow
for live update of both itself *and* the kernel it hosts.

Honestly, I don't see the *caretaker* being much of an ABI at all,
except from one kernel to the next.

The *userspace* ABI considerations are all about how you make a vCPU
that runs asynchronously (should it conceptually just be an async
KVM_RUN call, which allows the vCPU to run in a kernel thread up to the
point of kexec? Why is it fundamentally tied to kexec at all?).

I'd love to start without kexec in the picture at all. Just show me the
KVM API for starting a *confidential* guest (pKVM, SEV-SNP, whatever),
leaving it running, completely stopping the VMM and then starting a new
VMM to pick up from where it left off.

Sometimes the vCPUs might all actually still be running. Sometimes they
might have hit an exit that couldn't be handled.

Doing kexec while the VMM is "hands-off" is then the *next* challenge.

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC] proposal: KVM: Orphaned VMs: The Caretaker approach for Live Update
  2026-04-30 15:27   ` David Woodhouse
@ 2026-05-01  3:32     ` Paolo Bonzini
  2026-05-01  8:56       ` David Woodhouse
  0 siblings, 1 reply; 12+ messages in thread
From: Paolo Bonzini @ 2026-05-01  3:32 UTC (permalink / raw)
  To: David Woodhouse, Pasha Tatashin, linux-kernel, kexec, kvm,
	linux-mm, kvmarm
  Cc: rppt, graf, pratyush, seanjc, maz, oupton, alex.williamson,
	kevin.tian, rientjes, Tycho.Andersen, anthony.yznaga, baolu.lu,
	david, dmatlack, mheyne, jgowans, jgg, pankaj.gupta.linux,
	kpraveen.lkml, vipinsh, vannapurve, corbet, loeser, tglx, mingo,
	bp, dave.hansen, x86, hpa, roman.gushchin, akpm, pjt

On 4/30/26 17:27, David Woodhouse wrote:
> On Thu, 2026-04-30 at 15:28 +0200, Paolo Bonzini wrote:
>> I even wonder if, for long term simplicity, the interface for
>> host->caretaker should be just for the caretaker to swallow the host
>> into non-root mode, again as in Arm nVHE.
> 
> There's a lot of merit in that approach.
> 
> I talked about wanting to use this 'caretaker' for secret hiding.  But
> why have *voluntary* secret hiding with the kernel hiding things from
> its own address space, when you have have *mandatory* secret hiding
> with something running in EL2, like pKVM.

Well, other than because it's a lot of work? :)

> Honestly, I don't see the *caretaker* being much of an ABI at all,
> except from one kernel to the next.

I agree.

> The *userspace* ABI considerations are all about how you make a vCPU
> that runs asynchronously (should it conceptually just be an async
> KVM_RUN call, which allows the vCPU to run in a kernel thread up to the
> point of kexec? Why is it fundamentally tied to kexec at all?).

It's not tied to kexec.  kexec is just forcing a handoff + forcing an 
update.

The big difference is that:

1) if you don't tie it to kexec, a detached vCPU thread is a struct 
vhost_task and a blocking vmexit schedules out the thread; while during 
kexec you have s/kthread/pCPU/ and halting the CPU instead of scheduling 
it out.

2) if you don't tie it to kexec, address space isolation is the only 
real reason for the complication of treating the caretaker as a separate 
bare metal program.  OTOH maybe that's a feature - you could do:

- ioctl(KVM_RUN_ASYNC)

- then vmfd/vcpufd handoff to a new mm on top

- then address space isolation on top

- then kexec (de)serialization on top

> I'd love to start without kexec in the picture at all. Just show me the
> KVM API for starting a *confidential* guest (pKVM, SEV-SNP, whatever),
> leaving it running, completely stopping the VMM and then starting a new
> VMM to pick up from where it left off.

Why confidential?

Paolo


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC] proposal: KVM: Orphaned VMs: The Caretaker approach for Live Update
  2026-05-01  3:32     ` Paolo Bonzini
@ 2026-05-01  8:56       ` David Woodhouse
  2026-05-01 22:07         ` Pasha Tatashin
  0 siblings, 1 reply; 12+ messages in thread
From: David Woodhouse @ 2026-05-01  8:56 UTC (permalink / raw)
  To: Paolo Bonzini, Pasha Tatashin, linux-kernel, kexec, kvm, linux-mm,
	kvmarm
  Cc: rppt, graf, pratyush, seanjc, maz, oupton, alex.williamson,
	kevin.tian, rientjes, Tycho.Andersen, anthony.yznaga, baolu.lu,
	david, dmatlack, mheyne, jgowans, jgg, pankaj.gupta.linux,
	kpraveen.lkml, vipinsh, vannapurve, corbet, tglx, mingo, bp,
	dave.hansen, x86, hpa, roman.gushchin, akpm, pjt

[-- Attachment #1: Type: text/plain, Size: 4718 bytes --]

On Fri, 2026-05-01 at 05:32 +0200, Paolo Bonzini wrote:
> On 4/30/26 17:27, David Woodhouse wrote:
> > On Thu, 2026-04-30 at 15:28 +0200, Paolo Bonzini wrote:
> > > I even wonder if, for long term simplicity, the interface for
> > > host->caretaker should be just for the caretaker to swallow the host
> > > into non-root mode, again as in Arm nVHE.
> > 
> > There's a lot of merit in that approach.
> > 
> > I talked about wanting to use this 'caretaker' for secret hiding.  But
> > why have *voluntary* secret hiding with the kernel hiding things from
> > its own address space, when you have have *mandatory* secret hiding
> > with something running in EL2, like pKVM.
> 
> Well, other than because it's a lot of work? :)

If we avoided those things then we've never have any fun!

And in a week where there seems to be a new user-to-root exploit posted
every day, the 'deprivilege the VMM and assume the guest has owned it'
security model is looking rather scary. So the additional defence in
depth of knowing that even *root* can't get the kernel to access other
guests' memory might be the only thing that lets you sleep at night :)

Yes, it's a lot of work. But I think we've reached the point where
mandatory secret hiding is... well... mandatory.

> > The *userspace* ABI considerations are all about how you make a vCPU
> > that runs asynchronously (should it conceptually just be an async
> > KVM_RUN call, which allows the vCPU to run in a kernel thread up to the
> > point of kexec? Why is it fundamentally tied to kexec at all?).
> 
> It's not tied to kexec.  kexec is just forcing a handoff + forcing an
> update.
> 
> The big difference is that:
> 
> 1) if you don't tie it to kexec, a detached vCPU thread is a struct 
> vhost_task and a blocking vmexit schedules out the thread; while during 
> kexec you have s/kthread/pCPU/ and halting the CPU instead of scheduling 
> it out.

For now maybe. But "how does the caretaker do scheduling" is on
definitely the list of future problems, for any environment where a
physical host with N pCPUs is hosting >= N vCPUs.

(In the case of a true mandatory-secret-hiding caretaker at EL2, the
scheduling part *could* be done by the residual purgatory-caretaker-
thing at EL1 that all the secondary CPUs go to instead of being turned
off. It would just be calling into EL2 to run the actual vCPUs. Thus
leaving the EL2 code just to do its *one* job, which has the added
benefit that the automated reasoning people put the knives down and no
longer have that look in their eyes that they got when they thought you
wanted to put a scheduler in their formally-proven EL2 code...)

> 2) if you don't tie it to kexec, address space isolation is the only 
> real reason for the complication of treating the caretaker as a separate 
> bare metal program.  OTOH maybe that's a feature - you could do:
> 
> - ioctl(KVM_RUN_ASYNC)
> 
> - then vmfd/vcpufd handoff to a new mm on top

This much gives you a seamless upgrade of the userspace VMM without
having to play fd-handover tricks. The old VMM detaches, the new one
attaches. If you're quick, and the guests aren't doing much "admin"
work but only passing traffic through passthrough PCI devices, the
guests might not experience any non-negligible steal time at all.

> - then address space isolation on top

Even voluntary secret hiding lets you sleep at night when the next
Retbleed happens.

> - then kexec (de)serialization on top

... and this one is the holy grail.

So yes, that's exactly the kind of thing I was thinking, rather than
trying to boil the ocean. There are sensible milestones along the way
which give practical benefits.

But my point was *also* about understanding the actual userspace
interface for this, even if we were to just focus on the live update
and do it all in one amphetamine-and-tokens-fueled epic. What does it
even look like, from the VMM point of view? How does the new VMM under
the new kernel 'reattach' to the existing vCPUs?

I think we need the userspace API concepts for 'detach' and 'attach',
including the permissions model for reattach, and we might as well
implement and test them without the kexec in the middle to start with.

> > I'd love to start without kexec in the picture at all. Just show me the
> > KVM API for starting a *confidential* guest (pKVM, SEV-SNP, whatever),
> > leaving it running, completely stopping the VMM and then starting a new
> > VMM to pick up from where it left off.
> 
> Why confidential?

Mostly so that confidential VMs aren't an *afterthought*, and the
design of the detach/attach userspace ABI gets them right from the
start.

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC] proposal: KVM: Orphaned VMs: The Caretaker approach for Live Update
  2026-04-30 13:28 ` Paolo Bonzini
  2026-04-30 15:27   ` David Woodhouse
@ 2026-05-01 21:48   ` Pasha Tatashin
  2026-05-03 16:57     ` Paolo Bonzini
  1 sibling, 1 reply; 12+ messages in thread
From: Pasha Tatashin @ 2026-05-01 21:48 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Pasha Tatashin, linux-kernel, kexec, kvm, linux-mm, kvmarm, rppt,
	graf, pratyush, seanjc, maz, oupton, dwmw2, alex.williamson,
	kevin.tian, rientjes, Tycho.Andersen, anthony.yznaga, baolu.lu,
	david, dmatlack, mheyne, jgowans, jgg, pankaj.gupta.linux,
	kpraveen.lkml, vipinsh, vannapurve, corbet, loeser, tglx, mingo,
	bp, dave.hansen, x86, hpa, roman.gushchin, akpm, pjt

On 04-30 15:28, Paolo Bonzini wrote:
> I have some very similar observations to Alex and some very similar
> observations to David.  This has to imply that everyone will agree with me.
> :)
> 
> Seriously, the main contention point, from reading the thread, is the
> placement and lifecycle of the caretaker.  More on this later...
> 
> On 4/29/26 00:29, Pasha Tatashin wrote:
> > While this proposal focuses on its critical role in minimally disruptive
> > Live Update, the Caretaker is fundamentally designed as an extensible
> > primitive. Its architecture allows it to be leveraged for a variety of
> > other advanced virtualization use cases, such as running custom
> > lightweight hypervisors or completely offloading virtualization duties
> > to an accelerator card.
> 
> One step at a time please---and as an initial step, just place it inside the
> kernel, a la Arm nVHE.
> 
> Since your design would have anyway the ability to update the caretaker, you
> can embed that part into the reattachment process, so that the new kernel
> can use its own caretaker.

This is a good point. If we make the Caretaker built-in and always 
enabled via a module parameter, the update to new Caretaker can be 
handled in the new kernel when we adopt the Orphaned VM thread during 
reattachment.

> This reduces a lot the need to establish a stable-ish ABI.  Only the
> handover (kexec/LUO) needs to be stable, so that the new kernel can populate
> its kvm and kvm_vcpu structs.  And for that we mostly have a solution
> already: a stream of serialized ioctls.

The way I see it, vmfd and vcpufd need to support LUO preservation by 
implementing the liveupdate_file_ops callbacks (.preserve, .restore).

When userspace preserves the vcpufd, the kernel isolates the assigned 
pCPU (probably requiring the vCPU to be pinned to the CPU, and the CPU 
to be in an isolated cpuset), offlines it from the kernel's perspective, 
and signals to the Caretaker that KVM is detached.

We can start with the simplest form of the Caretaker for a PoC:

* No ASI. It duplicates the kernel page tables and uses its own mappings.
* It does not handle any VM exits locally. It passes everything down to 
KVM during normal operation, and treats all exits as blocking events 
when detached.

The PoC can be demonstrated without kexec: execute vcpufd .preserve() to 
isolate and offline the pCPU, and later use LUO .retrieve() to online 
the pCPU and re-adopt the running vCPU thread.

Even for this simplest form, we still need a defined ABI between the 
host kernel and the Caretaker. The host must send an IPI to 
synchronously notify the Caretaker of attach and detach transitions. 
This ABI must also handle Caretaker replacement during the adoption 
phase: when the new kernel retrieves the vCPU, it requires a protocol to 
notify the running Caretaker that the pCPU is being onlined. The 
Caretaker must reach a known, state at that moment so the kernel can 
seamlessly replace the previous kernel's Caretaker version with the 
current one. Finally, the Caretaker still relies on the CCB to access 
KVM routing pointers for forwarding VM exits back to the host during 
normal operation.

> > During the execution of the KVM_SET_CARETAKER ioctl, instead of
> > pointing the hardware's return path to standard KVM entry points (e.g.,
> > vmx_vmexit or svm_vcpu_run), KVM reprograms the host-state return area
> > of the CPU's hardware virtualization control structures (e.g., Intel
> > VMCS, AMD VMCB, or ARM equivalent) to point directly into the
> > bare-metal Caretaker environment.
> 
> This can be done unconditionally for all VMs based on a module parameter,
> again as in Arm nVHE.

Makes sense.

> 
> > Note on Optimization vs. Security: Constantly switching the page table
> > (CR3) on every VM Exit can be expensive due to TLB flushing. To
> > optimize performance, the Caretaker can share the host kernel's page
> > tables while the kernel is still around, and dynamically replace
> > HOST_CR3 with the dedicated, isolated page tables only when the vCPU is
> > orphaned (during the detachment phase). On the other hand, maintaining
> > a permanently isolated CR3 for the Caretaker adds a strong security
> > boundary, achieving hardware-enforced separation similar to KVM Address
> > Space Isolation (ASI).
> 
> Agreed on this.
> 
> > The Caretaker requires a defined ABI to communicate with the host KVM
> > subsystem. This ABI is implemented via the shared, identity-mapped .ccb
> > section of the ELF payload, acting as the Caretaker Control Block
> > (CCB).
> > 
> > The CCB acts as the source of truth for the Caretaker's execution loop
> > and contains three primary elements:
> > 
> >    * Attachment State Flag: An atomic variable indicating the current
> >      relationship with the host KVM subsystem (e.g., KVM_ATTACHED or
> >      KVM_DETACHED).
> 
> This must be done atomically at the time Linux offlines/onlines a pCPU. The
> interface from Linux to the caretaker must use some kind of IPI so that the
> new kernel can force a VMEXIT (if needed) in the caretaker, ask it to
> serialize the vm state, and pass it down to the new kernel's caretaker.

Yes, agree.

> 
> >    * KVM Routing Pointers: The physical function pointers that the
> >      Caretaker uses to safely jump into the host KVM's standard VM Exit
> >      handlers when operating in normal mode.
> >    * Shared Configuration Metadata: A physical pointer to dedicated
> >      memory pages used by the kernel to share dynamic vCPU configuration
> >      data with the Caretaker. Because every guest is configured
> >      differently, KVM populates these pages with the specific parameters
> >      negotiated during VM initialization (such as CPUID feature masks,
> >      APIC routing, and timer states). These pages also include a
> >      pre-allocated Telemetry Buffer for the Caretaker to log VM Exits
> >      and spin-wait durations. These dedicated pages are explicitly
> >      preserved across the host reboot via KHO, ensuring the Caretaker
> >      maintains continuous access to the exact context required to
> >      accurately emulate trivial exits during the gap.
> 
> All this is mostly unnecessary if the caretaker is provided by the kernel.
> The recently introduced remote ring buffers can be used for tracing too.

I will need to learn more about how remote ring buffers work, but it 
would have to be possible to use them even when there is no kernel. So, 
as I understand it, something like this would happen:

The old kernel allocates the remote ring buffer, passes the physical 
address to the Caretaker, and preserves this memory via KHO.

During the kexec gap (when the host kernel is completely offline), the 
Caretaker acts purely as a producer, writing trace events directly to 
this physical memory block and advancing the ring buffer pointers.

When the new kernel boots and re-adopts the orphaned vCPU, it retrieves 
this memory from KHO and attaches it back to the tracing subsystem.

> > The Caretaker first evaluates the VM Exit reason. If the exit belongs to
> > a category that the Caretaker is programmed to resolve natively, it
> > handles it internally. For example, profiling of guests has identified
> > the following exit categories for potential local resolution:
> > 
> >    * Guest Idle Exits (e.g., HLT): When the guest OS goes idle, it
> >      triggers idle exits. The Caretaker intercepts these and halts the
> >      physical core until the next guest-bound interrupt fires, preserving
> >      host power.
> 
> I don't think HLT can be handled entirely here.  Either you skip the exit
> completely or you have to go out to the scheduler.  The HLT exit could be
> skipped unconditionally for an orphaned VM, but while there is a running
> kernel the caretaker has to run entirely with interrupts off and that limits
> what you can do.
> 
> In fact there is already a blueprint of what can be handled easily in the
> caretaker, namely vmx_exit_handlers_fastpath()/svm_exit_handlers_fastpath().
> Stick to what exists already.

I agree regarding HLT: during the gap, the simplest approach is to just 
skip the exit and return directly to the VM without attempting to handle 
it.

Also, yes, vmx_exit_handlers_fastpath() and svm_exit_handlers_fastpath() 
provide a good blueprint for the VM exits that can be handled directly 
in the Caretaker.

> 
> >    * Timer and APIC Exits: Even an idle guest frequently writes to
> >      interrupt controllers and system registers to configure internal
> >      timers. The Caretaker handles these trivial writes directly,
> >      acknowledging the timer updates.
> 
> This depends heavily on the implementation of the hypervisor, for example it
> can be done on Intel via the preemption timer but not on AMD where an actual
> hrtimer is needed.

The Caretaker will have varying degrees of support depending on the 
architecture.

For the initial PoC, we can start with the most primitive form: treating 
these as blocking exits across all relevant architectures (Intel, AMD, 
ARM).

Beyond the PoC, we will expand this functionality independently per 
platform to reduce the blackout window, berpaps based on fastpath 
support. If a specific architecture lacks the hardware features to 
handle a certain exit natively in the Caretaker, it will remain a 
blocking exit on that platform during the gap.

> 
> [...]
> 
> > When the new VMM process spawns, it retrieves the
> >     preserved session and issues LIVEUPDATE_SESSION_RETRIEVE_FD using
> >     its token. LUO invokes KVM's .retrieve() callback to map the
> >     preserved vcpufd back into the new VMM's file descriptor table. As
> >     part of this retrieval process, the host formally brings the
> >     isolated pCPU back online, and the new VMM userspace thread is
> >     attached back to the active VM thread running on the vCPU. Finally,
> >     KVM populates the new KVM Routing Pointers in the CCB and
> >     atomically flips the Host State Flag back to KVM_ATTACHED. This
> >     breaks the Caretaker's spin-wait loop (if it is in this state),
> >     allowing standard KVM operation to resume.
> 
> This would also include some kind of serialization of the old VM into the
> new kernel's struct kvm_vcpu.

> 
> Also some kind of feature negotiation is needed (if that fails, the VMs are
> terminated unceremoniously) so I believe that the transition into and out of
> the gap must be synchronous.  For example with INIT/SIPI for the entry, and
> an IPI for the exit?

Yes, the serialization of the old VM state into the new kernel's struct 
kvm_vcpu is handled via KHO (Kexec Handover) during the LUO .preserve() 
and .retrieve() phases.

Regarding feature negotiation, compatibility checks are a fundamental 
requirement for live update. If the new kernel does not support the 
features required by the preserved VM, the update should ideally be 
aborted before the point of no return, or as you said, the VM must be 
terminated unceremoniously.

I completely agree that the transition into and out of the gap must be 
synchronous. As discussed above, using an IPI is the right approach. For 
entry, the host kernel signals the Caretaker via IPI to ensure it 
reaches a known state before the pCPU is offlined. For exit, the new 
kernel sends an IPI to the orphaned pCPU to force a VM exit, allowing 
the new kernel to take control, update the Caretaker environment, 
and complete the reattachment.

> > Guest-to-Guest IPIs
> > -------------------
> > 
> >    * The Problem: If the guest OS attempts to wake up a sleeping thread,
> >      one orphaned vCPU will send an Inter-Processor Interrupt (IPI) to
> >      another orphaned vCPU. In standard virtualization without hardware
> >      assistance, writing to the APIC ICR (or sending an ARM SGI) causes
> >      a VM Exit so the host KVM can emulate the message delivery. During
> >      the gap, KVM is unavailable to route this message.
> > 
> >    * Proposed Solution: The architecture may leverage hardware virtualized
> >      interrupts (Intel APICv, AMD AVIC, or ARM GICv4.1 virtual SGIs).
> >      This allows the hardware silicon to handle IPI delivery between the
> >      isolated pCPUs natively, eliminating the VM Exit. Alternatively,
> >      the Caretaker can be programmed to emulate the IPI delivery. By
> >      utilizing the shared memory metadata, the Caretaker can determine
> >      the target vCPU and directly update its pending interrupt state.
> 
> Yeah, I think APIC emulation to some extent must be moved into the VMX/SVM
> fastpaths.  The good news is that this can be done already as a PoC without
> needing the whole caretaker and LUO infrastructure.

Moving APIC emulation into the VMX/SVM fastpaths makes sense as a 
standalone effort.

> 
> >    * The Problem: What happens if a Non-Maskable Interrupt (NMI), a
> >      hardware timer tick, or a Machine Check Exception / System Error
> >      (MCE / ARM SError) arrives while the CPU is actively executing
> >      Caretaker code in KVM_DETACHED mode?
> > 
> >    * Proposed Solution: To safely handle these asynchronous events, [...]
> >      on x86, when transitioning into the gap, KVM explicitly programs
> >      HOST_IDTR and HOST_GDTR to [the caretaker's] tables.
> 
> Agreed and this also shows that the transition must be synchronous.
> >    * The Problem: As the guest executes, it may attempt to access memory
> >      that has not yet been mapped by the hypervisor, or it may interact
> >      with MMIO regions. Normally, this triggers an EPT Violation (Intel)
> >      or NPT Page Fault (AMD), prompting KVM to allocate host pages and
> >      update the secondary page tables. How are these updates handled
> >      when the host KVM subsystem is offline during the gap?
> > 
> >    * Proposed Solution: During the "Management Gap," there are absolutely
> >      no updates made to the NPT/EPT. The existing secondary page tables
> >      are fully preserved in memory via LUO kvmfd preservation prior to
> >      detachment, allowing the guest to seamlessly access all previously
> >      mapped memory. If the guest triggers a new page fault (requiring an
> >      NPT/EPT update) during the gap, the Caretaker simply categorizes it
> >      as a Blocking Exit.
> 
> Yes, by default everything is a blocking exit.  In particular, unless one
> day we do x86/pKVM, page tables can be handled entirely by Linux rather than
> the caretaker with no change to the existing MMU notifier architecture.
> 
> As a consequence, the caretaker is absolutely not going to be a TCB---at
> least not in the beginning.

+1.

> 
> > Compromised Caretaker
> > ---------------------
> > 
> >    * The Problem: The Caretaker runs in Host Mode. If left unprotected,
> >      this could allow a lightly privileged userspace process (e.g., QEMU
> >      or crosvm) to inject arbitrary executable code directly into the
> >      CPU's most privileged hardware state (VMX Root / Ring 0 / EL2).
> > 
> >    * Proposed Solution: To mitigate this risk, the KVM_SET_CARETAKER
> >      ioctl may adopt the security model used by the kexec_file_load()
> >      syscall. Rather than trusting userspace to pass physical addresses,
> >      the kernel must take full ownership of payload validation:
> 
> -EOVERENGINEERED.  Just shove it into the kernel.

Ok.

> 
> > Caretaker Update
> > ----------------
> > 
> >    * The Problem: Given that the Caretaker is permanently installed
> >      during VM setup, how does it get updated on long-running VMs?
> 
> Via kexec. :)  I understand you have bigger plans, but we need to crawl
> before walk^Wattempting a marathon.

Sure.

> 
> I even wonder if, for long term simplicity, the interface for
> host->caretaker should be just for the caretaker to swallow the host into
> non-root mode, again as in Arm nVHE.  That would make it much harder to
> implement some kind of live update, but my answer to that *really* is just
> to use kexec.
> 
> Paolo
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC] proposal: KVM: Orphaned VMs: The Caretaker approach for Live Update
  2026-05-01  8:56       ` David Woodhouse
@ 2026-05-01 22:07         ` Pasha Tatashin
  0 siblings, 0 replies; 12+ messages in thread
From: Pasha Tatashin @ 2026-05-01 22:07 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Paolo Bonzini, Pasha Tatashin, linux-kernel, kexec, kvm, linux-mm,
	kvmarm, rppt, graf, pratyush, seanjc, maz, oupton,
	alex.williamson, kevin.tian, rientjes, Tycho.Andersen,
	anthony.yznaga, baolu.lu, david, dmatlack, mheyne, jgowans, jgg,
	pankaj.gupta.linux, kpraveen.lkml, vipinsh, vannapurve, corbet,
	tglx, mingo, bp, dave.hansen, x86, hpa, roman.gushchin, akpm, pjt

On 05-01 09:56, David Woodhouse wrote:
> On Fri, 2026-05-01 at 05:32 +0200, Paolo Bonzini wrote:
> > On 4/30/26 17:27, David Woodhouse wrote:
> > > On Thu, 2026-04-30 at 15:28 +0200, Paolo Bonzini wrote:
> > > > I even wonder if, for long term simplicity, the interface for
> > > > host->caretaker should be just for the caretaker to swallow the host
> > > > into non-root mode, again as in Arm nVHE.
> > > 
> > > There's a lot of merit in that approach.
> > > 
> > > I talked about wanting to use this 'caretaker' for secret hiding.  But
> > > why have *voluntary* secret hiding with the kernel hiding things from
> > > its own address space, when you have have *mandatory* secret hiding
> > > with something running in EL2, like pKVM.
> > 
> > Well, other than because it's a lot of work? :)
> 
> If we avoided those things then we've never have any fun!
> 
> And in a week where there seems to be a new user-to-root exploit posted
> every day, the 'deprivilege the VMM and assume the guest has owned it'
> security model is looking rather scary. So the additional defence in
> depth of knowing that even *root* can't get the kernel to access other
> guests' memory might be the only thing that lets you sleep at night :)
> 
> Yes, it's a lot of work. But I think we've reached the point where
> mandatory secret hiding is... well... mandatory.
> 
> > > The *userspace* ABI considerations are all about how you make a vCPU
> > > that runs asynchronously (should it conceptually just be an async
> > > KVM_RUN call, which allows the vCPU to run in a kernel thread up to the
> > > point of kexec? Why is it fundamentally tied to kexec at all?).
> > 
> > It's not tied to kexec.  kexec is just forcing a handoff + forcing an
> > update.
> > 
> > The big difference is that:
> > 
> > 1) if you don't tie it to kexec, a detached vCPU thread is a struct 
> > vhost_task and a blocking vmexit schedules out the thread; while during 
> > kexec you have s/kthread/pCPU/ and halting the CPU instead of scheduling 
> > it out.
> 
> For now maybe. But "how does the caretaker do scheduling" is on
> definitely the list of future problems, for any environment where a
> physical host with N pCPUs is hosting >= N vCPUs.
> 
> (In the case of a true mandatory-secret-hiding caretaker at EL2, the
> scheduling part *could* be done by the residual purgatory-caretaker-
> thing at EL1 that all the secondary CPUs go to instead of being turned
> off. It would just be calling into EL2 to run the actual vCPUs. Thus
> leaving the EL2 code just to do its *one* job, which has the added
> benefit that the automated reasoning people put the knives down and no
> longer have that look in their eyes that they got when they thought you
> wanted to put a scheduler in their formally-proven EL2 code...)

For the initial PoC, however, we will bypass the scheduling problem 
entirely by enforcing a 1:1 mapping of active vCPUs to isolated pCPUs. 
If a system is overcommitted, we can either pin some vCPUs, or simply be 
suspend the across the kexec gap and wait for the new kernel's scheduler 
to resume them, i.e. the same as what we have now without Orphaned VM 
support.

> > 2) if you don't tie it to kexec, address space isolation is the only 
> > real reason for the complication of treating the caretaker as a separate 
> > bare metal program.  OTOH maybe that's a feature - you could do:
> > 
> > - ioctl(KVM_RUN_ASYNC)
> > 
> > - then vmfd/vcpufd handoff to a new mm on top
> 
> This much gives you a seamless upgrade of the userspace VMM without
> having to play fd-handover tricks. The old VMM detaches, the new one
> attaches. If you're quick, and the guests aren't doing much "admin"
> work but only passing traffic through passthrough PCI devices, the
> guests might not experience any non-negligible steal time at all.

I agree. During development, we should maintain a workflow that is 
functionally identical to the kexec transition. Even without a kernel 
reboot, the process should be: preserve resources via LUO, isolate and 
offline the pCPU, launch the new VMM, and then retrieve the resources to 
online the pCPU and re-adopt the vCPU.

> 
> > - then address space isolation on top
> 
> Even voluntary secret hiding lets you sleep at night when the next
> Retbleed happens.
> 
> > - then kexec (de)serialization on top
> 
> ... and this one is the holy grail.
> 
> So yes, that's exactly the kind of thing I was thinking, rather than
> trying to boil the ocean. There are sensible milestones along the way
> which give practical benefits.
> 
> But my point was *also* about understanding the actual userspace
> interface for this, even if we were to just focus on the live update
> and do it all in one amphetamine-and-tokens-fueled epic. What does it
> even look like, from the VMM point of view? How does the new VMM under
> the new kernel 'reattach' to the existing vCPUs?
> 
> I think we need the userspace API concepts for 'detach' and 'attach',
> including the permissions model for reattach, and we might as well
> implement and test them without the kexec in the middle to start with.

From the VMM point of view, the interface would follow the standard Live 
Update Orchestrator flow for file descriptor preservation. This is 
the same mechanism used to preserve and restore resources like vfiofd, 
iommufd, and memfd across a transition.

> 
> > > I'd love to start without kexec in the picture at all. Just show me the
> > > KVM API for starting a *confidential* guest (pKVM, SEV-SNP, whatever),
> > > leaving it running, completely stopping the VMM and then starting a new
> > > VMM to pick up from where it left off.
> > 
> > Why confidential?
> 
> Mostly so that confidential VMs aren't an *afterthought*, and the
> design of the detach/attach userspace ABI gets them right from the
> start.



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC] proposal: KVM: Orphaned VMs: The Caretaker approach for Live Update
  2026-05-01 21:48   ` Pasha Tatashin
@ 2026-05-03 16:57     ` Paolo Bonzini
  0 siblings, 0 replies; 12+ messages in thread
From: Paolo Bonzini @ 2026-05-03 16:57 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: linux-kernel, kexec, kvm, linux-mm, kvmarm, rppt, graf, pratyush,
	seanjc, maz, oupton, dwmw2, alex.williamson, kevin.tian, rientjes,
	Tycho.Andersen, anthony.yznaga, baolu.lu, david, dmatlack, mheyne,
	jgowans, jgg, pankaj.gupta.linux, kpraveen.lkml, vipinsh,
	vannapurve, corbet, loeser, tglx, mingo, bp, dave.hansen, x86,
	hpa, roman.gushchin, akpm, pjt

On Fri, May 1, 2026 at 11:48 PM Pasha Tatashin
<pasha.tatashin@soleen.com> wrote:
> The way I see it, vmfd and vcpufd need to support LUO preservation by
> implementing the liveupdate_file_ops callbacks (.preserve, .restore).
>
> When userspace preserves the vcpufd, the kernel isolates the assigned
> pCPU (probably requiring the vCPU to be pinned to the CPU, and the CPU
> to be in an isolated cpuset),

Generally speaking vCPUs do not care if they are pinned. Do we need a
preparatory ioctl for this on the vcpufd side, even just a
KVM_ENABLE_CAP or a new bit for
KVM_ENABLE_CAP(KVM_CAP_X86_DISABLE_EXITS)?  If
LIVEUPDATE_SESSION_PRESERVE_FD suddenly can start offlining pCPUs,
that would require a capability check.

> Even for this simplest form, we still need a defined ABI between the
> host kernel and the Caretaker. The host must send an IPI to
> synchronously notify the Caretaker of attach and detach transitions.

Yes, but it's unlikely to need stability unlike the KHO serialization.

> This ABI must also handle Caretaker replacement during the adoption
> phase: when the new kernel retrieves the vCPU, it requires a protocol to
> notify the running Caretaker that the pCPU is being onlined. The
> Caretaker must reach a known, state at that moment so the kernel can
> seamlessly replace the previous kernel's Caretaker version with the
> current one. Finally, the Caretaker still relies on the CCB to access
> KVM routing pointers for forwarding VM exits back to the host during
> normal operation.

I'm not sure if routing pointers are needed, as opposed to just a
call/ret (ignoring detachment and reattachment which are different
anyway). In the end, the x86 caretaker is basically the
non-preemptable part of vcpu_enter_guest(). Even once you add
attach/detach, the body of code that runs in the caretaker is roughly
the same and what changes is setting up the address space, the IDT,
etc.

> > This must be done atomically at the time Linux offlines/onlines a pCPU. The
> > interface from Linux to the caretaker must use some kind of IPI so that the
> > new kernel can force a VMEXIT (if needed) in the caretaker, ask it to
> > serialize the vm state, and pass it down to the new kernel's caretaker.
>
> Yes, agree.

BTW, the same IPI is needed to force a VMEXIT even before kexec, if
the old kernel does anything that breaks the running VM such as
madvise(MADV_FREE).  Should not happen with properly behaving
userspace, but it must be accounted for.

> During the kexec gap (when the host kernel is completely offline), the
> Caretaker acts purely as a producer, writing trace events directly to
> this physical memory block and advancing the ring buffer pointers.
>
> When the new kernel boots and re-adopts the orphaned vCPU, it retrieves
> this memory from KHO and attaches it back to the tracing subsystem.

Yes, this should work with remote trace buffers.

> I agree regarding HLT: during the gap, the simplest approach is to just
> skip the exit and return directly to the VM without attempting to handle
> it.

Or even disable intercepts for HLT/PAUSE/MONITOR/MWAIT.

> I completely agree that the transition into and out of the gap must be
> synchronous. As discussed above, using an IPI is the right approach. For
> entry, the host kernel signals the Caretaker via IPI to ensure it
> reaches a known state before the pCPU is offlined.

And then does INIT/SIPI on the offlined pCPU to re-enter the caretaker
in detached mode.

> For exit, the new
> kernel sends an IPI to the orphaned pCPU to force a VM exit, allowing
> the new kernel to take control, update the Caretaker environment,
> and complete the reattachment.

> > Yeah, I think APIC emulation to some extent must be moved into the VMX/SVM
> > fastpaths.  The good news is that this can be done already as a PoC without
> > needing the whole caretaker and LUO infrastructure.
>
> Moving APIC emulation into the VMX/SVM fastpaths makes sense as a
> standalone effort.

Yup, that's nice to have.

Thanks for the detailed reply, I limited mine to where I wanted your
input as the author of LUO - especially with respect to privilege
separation.

Paolo


^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2026-05-03 16:57 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-28 22:29 [RFC] proposal: KVM: Orphaned VMs: The Caretaker approach for Live Update Pasha Tatashin
2026-04-29  8:13 ` Alexander Graf
2026-04-29  8:40   ` David Woodhouse
2026-04-29 16:13     ` Pasha Tatashin
2026-04-29 16:02   ` Pasha Tatashin
2026-04-30 13:28 ` Paolo Bonzini
2026-04-30 15:27   ` David Woodhouse
2026-05-01  3:32     ` Paolo Bonzini
2026-05-01  8:56       ` David Woodhouse
2026-05-01 22:07         ` Pasha Tatashin
2026-05-01 21:48   ` Pasha Tatashin
2026-05-03 16:57     ` Paolo Bonzini

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox