qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* [PATCH RFC V6 00/24] Support of Virtual CPU Hotplug-like Feature for ARMv8+ Arch
@ 2025-10-01  1:01 salil.mehta
  2025-10-01  1:01 ` [PATCH RFC V6 01/24] hw/core: Introduce administrative power-state property and its accessors salil.mehta
                   ` (26 more replies)
  0 siblings, 27 replies; 67+ messages in thread
From: salil.mehta @ 2025-10-01  1:01 UTC (permalink / raw)
  To: qemu-devel, qemu-arm, mst
  Cc: salil.mehta, maz, jean-philippe, jonathan.cameron, lpieralisi,
	peter.maydell, richard.henderson, imammedo, armbru, andrew.jones,
	david, philmd, eric.auger, will, ardb, oliver.upton, pbonzini,
	gshan, rafael, borntraeger, alex.bennee, gustavo.romero, npiggin,
	harshpb, linux, darren, ilkka, vishnu, gankulkarni, karl.heubaum,
	miguel.luis, salil.mehta, zhukeqian1, wangxiongfeng2, wangyanan55,
	wangzhou1, linuxarm, jiakernel2, maobibo, lixianglai, shahuang,
	zhao1.liu

From: Salil Mehta <salil.mehta@huawei.com>

[!] Sending again: It looks like mails sent from my official ID are being held
somewhere. Hence, I am using my other email address. Sorry for any inconvenience
this may have caused.

============
(I) Prologue
============

This patch series adds support for a virtual CPU hotplug-like feature (in terms
of usage) to Armv8+ platforms. Administrators are able to dynamically scale the
compute capacity on demand by adding or removing vCPUs. The interface is similar
in look-and-feel to the vCPU hotplug feature supported on x86 platforms. While
this series for Arm platforms shares the end goal with x86, it is implemented
differently because of inherent differences in the CPU architecture and the
constraints it imposes.

In this implementation, the meaning of "CPU hotplug" is as described in the Arm
Power State Coordination Interface (PSCI) specification (DEN0022F.b, §4.3 "CPU
hotplug and secondary CPU boot", §5.5, 5.6). This definition has not changed.
On Arm platforms, the guest kernel itself can request CPU onlining or offlining
using PSCI calls (via SMC/HVC), since the CPU_ON and CPU_OFF functions are part
of the standard PSCI interface exposed to the non-secure world.

This patch series instead adds the infrastructure required to implement
administrative policy control in QEMU/VMM (as privileged software) along with
the ability to convey changes via ACPI to the guest kernel. This ensures the
guest is notified of compute capacity changes that result from per-vCPU
administrative policy. This is conceptually similar to the traditional CPU
hotplug mechanism that x86 follows. It allows or denies guest-initiated PSCI
CPU_ON/OFF requests by enabling or disabling an already ACPI-described and
present vCPU via HMP/QMP 'device_set' (a new interface), making it (un)available
to the guest kernel. This provides the look-and-feel of vCPU hotplug through
ACPI _STA.Enabled toggling, while keeping all vCPUs enumerated in ACPI tables at
boot.

Unlike x86, where vCPUs can become not-present after boot and the kernel (maybe
because architecture allows this?) tolerates some level of dynamic topology
changes, the Arm CPU Architecture requires that the number of vCPUs and their
associated redistributors remain fixed once the system has booted. Consequently,
the Arm host kernel and guest kernel do not tolerate removal or addition of CPU
objects after boot.

Offline vCPUs remain described to guest firmware and OSPM, and can be brought
online later by toggling the ACPI _STA Enabled bit. This model aligns with
ACPI 6.5 (Table 5.37, GICC CPU Interface Flags), which introduced the "Online
Capable" bit to signal processors that can be enabled at runtime. It is also
consistent with the Arm GIC Architecture Specification (IHI0069H, §11.1), which
defines CPU interface power domain behavior.

Corresponding kernel changes from James Morse (ARM) have already been accepted
and are part of the mainline Linux kernel since 6.11 release.

====================================
(II) Summary of `Recent` Key Changes
====================================

RFC V5 -> RFC V6

(*) KeyChange: Introduced new infrastructure to handle administrative PowerState
    transitions (enable-to-disable & vice-versa) as-per Policy.
(*) Stopped using the existing vCPU Hotplug infrastructure code
(*) Replaced 'device_add/-device' with new 'device_set/-deviceset' interface
(*) Introduced '-smp disabledcpus=N' parameter for Qemu CLI
(*) Dropped 'info hotpluggable'. Added 'info cpus-powerstate' command
(*) Introduced DeviceState::admin_power_state property={enabled,disabled,removed} states
(*) Introduced new 'PowerStateHandlder' abstract interface with powerstate hooks.
(*) Dropped destruction of disabled vCPU objects post cpu init.
(*) Dropped vCPU Hotplug ACPI support introduced ACPI/GED specifcally for ARM type vCPUs
(*) Dropped GIC IRQ unwiring support once VM is initialized.
(*) Dropped vCPU unrealization support. Retained lazy realization of disabled vCPUs(boot time).
(*) All vCPU objects exist for lifetime of VM.
(*) Introduced a separate ACPI CPU/OSPM interface to handle device check, eject
    request etc. to intimate gues kernel about change in policy.
(*) Introduced new concept of *userspace parking* of 'disabled' KVM vCPUs 
(*) We do not migrate disabled vCPUs
(*) Mitigation to pause_all_vcpus() problem. Caching the ICC_CTLR_EL1 in Qemu
(*) Stopped reconciling (for now) vCPU config at destination VM during Migration

Dropped Due to change in vCPU handling approach:

[PATCH RFC V5 03/30] hw/arm/virt: Move setting of common vCPU properties in a function
[PATCH RFC V5 04/30] arm/virt, target/arm: Machine init time change common to vCPU {cold|hot}-plug
[PATCH RFC V5 09/30] arm/acpi: Enable ACPI support for vCPU hotplug
[PATCH RFC V5 12/30] arm/virt: Release objects for *disabled* possible vCPUs after init
[PATCH RFC V5 14/30] hw/acpi: Make _MAT method optional
[PATCH RFC V5 16/30] target/arm: Force ARM vCPU *present* status ACPI *persistent*
[PATCH RFC V5 18/30] arm/virt: Changes to (un)wire GICC<->vCPU IRQs during hot-(un)plug
[PATCH RFC V5 22/30] target/arm/cpu: Check if hotplugged ARM vCPU's FEAT match existing
[PATCH RFC V5 24/30] target/arm: Add support to *unrealize* ARMCPU during vCPU Hot-unplug
[PATCH RFC V5 25/30] tcg/mttcg: Introduce MTTCG thread unregistration leg
[PATCH RFC V5 30/30] hw/arm/virt: Expose cold-booted vCPUs as MADT GICC *Enabled*

Modified or Code reused in other patches:

[PATCH RFC V5 19/30] hw/arm, gicv3: Changes to notify GICv3 CPU state with vCPU hot-(un)plug event
[PATCH RFC V5 17/30] arm/virt: Add/update basic hot-(un)plug framework
[PATCH RFC V5 20/30] hw/arm: Changes required for reset and to support next boot
[PATCH RFC V5 21/30] arm/virt: Update the guest(via GED) about vCPU hot-(un)plug events

---------------------------------
[!] Expectations From This RFC v6
---------------------------------

Please refer to the DISCLAIMER in Section (XI) for the correct expectations from
this version of the RFC

===============
(II) Motivation
===============

Adds virtual CPU hot-plug-like support for ARMv8+ Arch in QEMU. Allows vCPUs to
be brought online or offline after VM boot, similar to x86 arch, while keeping
all CPU resources provisioned and described at startup. Enables scaling guest VM
compute capacity on demand, useful in several scenarios:

1. Vertical Pod Autoscaling [9][10] in the cloud: As part of an orchestration
   framework, resource requests (CPU and memory) for containers in a pod can be
   adjusted dynamically based on usage.

2. Pay-as-you-grow business model: Infrastructure providers may allocate and
   restrict the total compute resources available to a guest VM according to
   the SLA (Service Level Agreement). VM owners can then request additional
   CPUs to be hot-plugged at extra cost.

In Kubernetes environments, workloads such as Kata Container VMs often adopt
a "hot-plug everything" model: start with the minimum resources and add vCPUs
later as needed. For example, a VM may boot with just one vCPU, then scale up
once the workload is provisioned. This approach provides:

1. Faster boot times, and
2. Lower memory footprint.

vCPU hot-plug is therefore one of the steps toward realizing the broader
"hot-plug everything" objective. Other hot-plug mechanisms already exist on ARM,
such as ACPI-based memory hot-plug and PCIe device hot-plug, and are supported
in both QEMU and the Linux guest. Extending vCPU hot-plug in this series aligns
with those efforts and fills the remaining gap.

================
(III) Background
================

The ARM architecture does not support physical CPU hot-plug and lacks a
specification describing the behavior of per-CPU components (e.g. GIC CPU
interface, redistributors, PMUs, timers) when such events occur. As a result,
both host and guest kernels are intolerant to changes in the number of CPUs
enumerated by firmware and described by ACPI at boot time.

We need to respect these architectural constraints and the kernel limitations
they impose, namely the inability to tolerate changes in the number of CPUs
enumerated by firmware once the system has booted, and create a practical
solution with workarounds in the VMM/QEMU.

This patch set implements a non-intrusive solution by provisioning all vCPU
resources during VM initialization and exposing them via ACPI to the guest
kernel. The resources remain fixed, while the effect of hot-plug is achieved by
toggling ACPI CPU status (enabled) bits to bring vCPUs online or offline.

-----------
Terminology
-----------

(*) Possible CPUs: Total vCPUs that could ever exist in the VM. This includes
                   any 'present' & 'enabled' CPUs plus any CPUs that are
                   'present' but are 'disabled' at boottime.
                   - Qemu parameter (-smp cpus=N1, disabled=N2)
                   - Possible vCPUs = N1 + N2
(*) Present CPUs:  Possible CPUs that are ACPI 'present'. These might or might
                   not be ACPI 'enabled'. 
(*) Enabled CPUs:  Possible CPUs that are ACPI 'present' and 'enabled' and can
                   now be ‘onlined’ (PSCI) for use by the Guest Kernel. All cold-
                   booted vCPUs are ACPI 'enabled' at boot. Later, using
                   'device_set/-deviceset', more vCPUs can be ACPI 'enabled'.


Below are further details of the constraints:

===============================================
(IV) Constraints Due to ARMv8+ CPU Architecture
===============================================

A. Physical Limitation to Support CPU Hotplug: (Architectural Constraint)

   1. ARMv8 CPU architecture does not support the concept of the physical CPU
      hotplug. 
      a. There are many per-CPU components like PMU, SVE, MTE, Arch timers, etc.,
         whose behavior needs to be clearly defined when the CPU is
         hot(un)plugged. Current specification does not define this nor are any
         immediate plans from ARM to extend support for such a feature.
   2. Other ARM components like GIC, etc., have not been designed to realize
      physical CPU hotplug capability as of now. For example,
      a. Every physical CPU has a unique GICC (GIC CPU Interface) by construct.
         Architecture does not specify what CPU hot(un)plug would mean in
         context to any of these.
      b. CPUs/GICC are physically connected to unique GICR (GIC Redistributor).
         GIC Redistributors are always part of the always-on power domain. Hence,
         they cannot be powered off as per specification.

B. Limitation in Firmware/ACPI (Architectural Constraint)

   1. Firmware has to expose GICC, GICR, and other per-CPU features like PMU,
      SVE, MTE, Arch Timers, etc., to the OS. Due to the architectural constraint
      stated in section A1(a), all interrupt controller structures of
      MADT describing GIC CPU Interfaces and the GIC Redistributors MUST be
      presented by firmware to the OSPM during boot time.
   2. Architectures that support CPU hotplug can evaluate the ACPI _MAT method
      to get this kind of information from the firmware even after boot, and the
      OSPM has the capability to process these. ARM kernel uses information in
      MADT interrupt controller structures to identify the number of present CPUs
      during boot and hence does not allow to change these after boot. The number
      of present CPUs cannot be changed. It is an architectural constraint!

C. Limitations in KVM to Support Virtual CPU Hotplug (Architectural Constraint)

   1. KVM VGIC:
      a. Sizing of various VGIC resources like memory regions, etc., related to
         the redistributor happens only once and is fixed at the VM init time
         and cannot be changed later after initialization has happened.
         KVM statically configures these resources based on the number of vCPUs
         and the number/size of redistributor ranges.
      b. Association between vCPU and its VGIC redistributor is fixed at the
         VM init time within the KVM, i.e., when redistributor iodevs gets
         registered. VGIC does not allow to setup/change this association
         after VM initialization has happened. Physically, every CPU/GICC is
         uniquely connected with its redistributor, and there is no
         architectural way to set this up.
   2. KVM vCPUs:
      a. Lack of specification means destruction of KVM vCPUs does not exist as
         there is no reference to tell what to do with other per-vCPU
         components like redistributors, arch timer, etc.
      b. In fact, KVM does not implement the destruction of vCPUs for any
         architecture. This is independent of whether the architecture
         actually supports CPU Hotplug feature. For example, even for x86 KVM
         does not implement the destruction of vCPUs.

D. Considerations in Qemu due to ARM CPU Architecture & related KVM Constraints:

   1. Qemu CPU Objects MUST be created to initialize all the Host KVM vCPUs to
      overcome the KVM constraint. KVM vCPUs are created and initialized when
      Qemu CPU Objects are realized.
   2. The 'GICV3State' and 'GICV3CPUState' objects must be sized for all possible
      vCPUs at VM initialization, when the QOM GICv3 object is realized. This is
      required because the KVM VGIC can only be initialized once, and the number
      of redistributors, their per-vCPU interfaces, and associated data
      structures or I/O device regions are all fixed at VM init time.
   3. How should new QOM CPU objects be connected back to the 'GICV3CPUState'
      objects and disconnected from it in case the CPU is being hot(un)plugged?
   4. How should 'unplugged' or 'yet-to-be-plugged' vCPUs be represented in the
      QOM for which KVM vCPU already exists? For example, whether to keep,
       a. No QOM CPU objects Or
       b. Unrealized CPU Objects
   5. How should vCPU state be exposed via ACPI to the Guest? Especially for
      the unplugged/yet-to-be-plugged vCPUs whose CPU objects might not exist
      within the QOM but the Guest always expects all possible vCPUs to be
      identified as ACPI *present* during boot.
   6. How should Qemu expose GIC CPU interfaces for the unplugged or
      yet-to-be-plugged vCPUs using ACPI MADT Table to the Guest?

E. How are the above questions addressed in this QEMU implementation?

   1. Respect the limitations imposed by the Arm architecture in KVM, ACPI, and
      the guest kernel. This requires always keeping the vCPU count constant.
   2. Implement a workaround in QEMU by keeping all vCPUs present and toggling
      the ACPI _STA.Enabled bit to realize a vCPU hotplug-like effect.
   3. Never destroy vCPU objects once initialized, since they hold the ARMCPU
      state that is set up once during VM initialization.
   4. Size other per-vCPU components, such as the VGIC CPU interface and
      redistributors, for the maximum number of vCPUs possible during the VM’s
      lifetime.
   5. Exit HVC/SMC KVM hypercalls (triggered by PSCI CPU_ON/OFF) to user space
      for policy checks that allow or deny the guest kernel’s power-on/off
      request.
   6. Disabled vCPUs remain parked in user space and are never migrated.

===================  
(V) Summary of Flow  
===================  

-------------------  
vCPU Initialization  
-------------------  
   1. Keep all vCPUs always enumerated and present (enabled/disabled) in the
      guest kernel, host KVM, and QEMU with topology fixed.  
   2. Realize hotplug-like functionality by toggling the ACPI _STA.Enabled bit
      for each vCPU.  
   3. Never destroy a vCPU. vCPU objects and threads remain alive throughout the
      VM lifetime once created. No un-realization handling code is required.
      Threads may be realized lazily for disabled vCPUs.  
   4. At VM init, pre-create all possible vCPUs in KVM, including those not yet
      enabled in QEMU, but keep them in the PSCI powered-off state.  
   5. Park disabled vCPU threads in user space to avoid KVM lock contention.
      This means 'CPUState::halted=1'; 'CPUState::stopped=1'; and 'CPUState::parked=1' (new).  
-------------------  
VGIC Initialization  
-------------------  
   6. Size 'GICv3State' and 'GICv3CPUState' objects over possible vCPUs at VM
      init time when the QEMU GIC is realized. This also sizes KVM VGIC
      resources such  as redistributor regions. This sizing never changes after
      VM init.
-------------------  
ACPI Initialization  
-------------------  
   7. Build the ACPI MADT table with updates:  
      a. Number of GIC CPU interface entries = possible vCPUs.  
      b. Boot vCPU as MADT.GICC.Enabled=1 (not hot[un]pluggable).  
      c. Hot[un]pluggable vCPUs as MADT.GICC.online-capable=1 and  
         MADT.GICC.Enabled=0 (mutually exclusive). These vCPUs can be enabled
         and onlined after guest boot (firmware policy).  
   8. Expose ACPI _STA status to the guest kernel:  
      a. Always _STA.Present=1 (all possible vCPUs).  
      b. _STA.Enabled=1 (enabled vCPUs = plugged).  
      c. _STA.Enabled=0 (disabled vCPUs = unplugged).  
---------------------------------------------------------------  
vCPU Administrative *First* Enable [= vCPU Hotplug-like Action]  
---------------------------------------------------------------  
   9. The first administrative enable of a vCPU leads to deferred realization of
      the QEMU vCPU object initialized at VM init:  
      a. Realizes the vCPU object and spawns the QEMU vCPU thread.  
      b. Unparks the existing KVM vCPU ("kvm_parked_vcpus" list).  
      c. Reinitializes the KVM vCPU in the host (reset core/sys regs, set
         defaults). 
      d. Runs the KVM vCPU (created with 'start-powered-off'). Thread waits for
         PSCI.
      e. Marks QEMU 'GICv3CPUState' interface accessible.  
      f. Updates ACPI _STA.Enabled=1.  
      g. Notifies guest (GED Device-Check). Guest sees Enabled=1 and registers
         CPU. 
      h. Guest onlines vCPU (PSCI CPU_ON over HVC/SMC).  
         - KVM exits to QEMU (policy check).  
         - If allowed, QEMU calls `cpu_reset()` and powers on the vCPU in KVM.
	 - KVM wakes vCPU thread out of sleep and puts vCPUMP state to RUNNABLE 
-----------------------------------------------------------  
vCPU Administrative Disable [= vCPU Hot-unplug-like Action]  
-----------------------------------------------------------  
  10. Administrative disable does not un-realize the QOM CPU object or destroy
      the vCPU thread. Instead:  
      a. Notifies guest (GED Eject Request). Guest offlines vCPU (CPU_OFF PSCI).
      b. KVM exits to QEMU (policy check). 
         - QEMU powers off vCPU in KVM and
	 - KVM puts vCPUMP state to STOPPED & sleeps on RCUWait
      c. Guest signals eject after quiescing vCPU.  
      d. QEMU updates ACPI _STA.Enabled=0.  
      e. Marks QEMU 'GICv3CPUState' interface inaccessible.  
      f. Parks the vCPU thread in user space (unblocks from KVM to avoid vCPU
         lock contention):  
         - Unregisters VMSD from migration.  
         - Removes vCPU from present/active lists.  
         - Pauses the vCPU (`cpu_pause`).  
         - Kicks vCPU thread to user space ('CPUState::parked=1').  
      g. Guest sees ACPI _STA.Enabled=0 and removes CPU (unregisters from LDM).
--------------------------------------------------------------------  
vCPU Administrative *Subsequent* Enable [= vCPU Hotplug-like Action]  
--------------------------------------------------------------------  
  11. A subsequent administrative enable does not realize objects or spawn a new
      thread. Instead:  
      a. Unparks the vCPU thread in user space:  
         - Re-registers VMSD for migration.  
         - Adds back to present/active lists.  
         - Resumes the vCPU (`cpu_resume`).  
         - Clears parked flag ('CPUState::parked=0').  
      b. Marks QEMU 'GICv3CPUState' interface accessible again.  
      c. Updates ACPI _STA.Enabled=1.  
      d. Notifies guest (GED Device-Check). Guest sees Enabled=1 and registers
         CPU.
      e. Guest onlines vCPU (PSCI CPU_ON over HVC/SMC).  
         - KVM exits to QEMU (policy check).  
         - QEMU sets power-state=PSCI_ON, calls `cpu_reset()`, and powers on
	   vCPU.  
         - KVM changes MP state to RUNNABLE.  

============================================
(VI) Work Presented at KVM Forum Conferences
============================================

Details of the above work have been presented at KVMForum2020 and KVMForum2023
conferences. Slides & video are available at the links below:
a. KVMForum 2023
   - Challenges Revisited in Supporting Virt CPU Hotplug on architectures that don't Support CPU Hotplug (like ARM64).
     https://kvm-forum.qemu.org/2023/KVM-forum-cpu-hotplug_7OJ1YyJ.pdf
     https://kvm-forum.qemu.org/2023/Challenges_Revisited_in_Supporting_Virt_CPU_Hotplug_-__ii0iNb3.pdf
     https://www.youtube.com/watch?v=hyrw4j2D6I0&t=23970s
     https://kvm-forum.qemu.org/2023/talk/9SMPDQ/
b. KVMForum 2020
   - Challenges in Supporting Virtual CPU Hotplug on SoC Based Systems (like ARM64) - Salil Mehta, Huawei.
     https://kvmforum2020.sched.com/event/eE4m

===================
(VII) Commands Used
===================

A. Qemu launch commands to init the machine (with 6 possible vCPUs):

$ qemu-system-aarch64 --enable-kvm -machine virt,gic-version=3 \
-cpu host -smp cpus=4,disabled=2 \
-m 300M \
-kernel Image \
-initrd rootfs.cpio.gz \
-append "console=ttyAMA0 root=/dev/ram rdinit=/init maxcpus=2 acpi=force" \
-nographic \
-bios QEMU_EFI.fd \

B. Administrative '[En,Dis]able' [akin to 'Hot-(un)plug'] related commands:

# Hot(un)plug a host vCPU (accel=kvm):
(qemu) device_set host-arm-cpu,id=core4,core-id=4,admin-state=enable
(qemu) device_set host-arm-cpu,id=core4,core-id=4,admin-state=disable

# Hot(un)plug a vCPU (accel=tcg):
(qemu) device_set cortex-a57-arm-cpu,id=core4,core-id=4,admin-state=enable
(qemu) device_set cortex-a57-arm-cpu,id=core4,core-id=4,admin-state=disable

Sample output on guest after boot:

    $ cat /sys/devices/system/cpu/possible
    0-5
    $ cat /sys/devices/system/cpu/present
    0-5
    $ cat /sys/devices/system/cpu/enabled
    0-3
    $ cat /sys/devices/system/cpu/online
    0-1
    $ cat /sys/devices/system/cpu/offline
    2-5

Sample output on guest after 'enabling'[='hotplug'] & 'online' of vCPU=4:

    $ echo 1 > /sys/devices/system/cpu/cpu4/online

    $ cat /sys/devices/system/cpu/possible
    0-5
    $ cat /sys/devices/system/cpu/present
    0-5
    $ cat /sys/devices/system/cpu/enabled
    0-4
    $ cat /sys/devices/system/cpu/online
    0-1,4
    $ cat /sys/devices/system/cpu/offline
    2-3,5

===================
(VIII) Repositories
===================

(*) Latest Qemu RFC V6 (Architecture Specific) patch set:
    https://github.com/salil-mehta/qemu.git virt-cpuhp-armv8/rfc-v6
(*) Older QEMU changes for vCPU hotplug can be cloned from below site:
    https://github.com/salil-mehta/qemu.git virt-cpuhp-armv8/rfc-{v1,v2,v3,v4,v5}
(*) `Accepted` Qemu Architecture Agnostic patch is present here:
    https://github.com/salil-mehta/qemu/commits/virt-cpuhp-armv8/rfc-v3.arch.agnostic.v16/
(*) All Kernel changes are already part of mainline v6.11
(*) Original Guest Kernel changes (by James Morse, ARM) are available here:
    https://git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git virtual_cpu_hotplug/rfc/v2

================================
(IX) KNOWN ISSUES & THINGS TO DO
================================

1. TCG currently faces some hang issues due to unhandled cases. We aim to fix
   these within the next one to two weeks.
2. Comprehensive testing is ongoing. This is fresh code, and we expect to
   complete testing within two weeks.
3. QEMU documentation (.rst) still needs to be updated.
4. Migration has been lightly tested but is working as expected.
5. Mitigation to avoid `pause_all_vcpus` needs broader community discussion. An
   alternative change has been prepared in KVM, which maintains a shadow of
   `ICC_CTLR_EL1` to reduce lock contention when using KVM device IOCTLs. This
   avoids synchronization issues if the register value changes during VM runtime.
   While not mandatory, this enhancement would provide a more comprehensive fix
   than the current QEMU assumption that the relevant fields are invariant or
   pseudo-static. An RFC for this KVM change will be floated within a week.
6. Mitigation of parking disabled vCPU threads in user space, to avoid blocking
   them inside KVM, needs review by the wider community to ensure no hidden
   issues are introduced.
7. A discussion (if needed) on why `device_set` was chosen instead of `qom-set`
   for administrative state control.
8. CPU_SUSPEND/Standy related handling (if required)
9. HVF and qtest are not supported or done yet.

============================
(X) ORGANIZATION OF PATCHES
============================

 [Patch 1-2, 22-23] New HMP/QMP interface ('device_set') related changes
    (*) New ('DeviceState::admin_power_state') property; Enabled/Disabled States and handling
    (*) New Qemu CLI parameter ('-smp CPUS, disabled=N') handling
    (*) Logic to find the existing object not part of the QOM
 [Patch 3-5, 10] logic required during machine init.
    (*) Some validation checks.
    (*) Introduces core-id,socket-id,cluster-id property and some util functions required later.
    (*) Logic to setup lazy realization of the QOM vCPUs 
    (*) Logic to pre-create vCPUs in the KVM host kernel.
 [Patch 6-7, 8-9] logic required to size the GICv3 State
    (*) GIC initialization pre-sized with possible vCPUs. 
    (*) Introduction of the GICv3 CPU Interface `accessibility` property & accessors
    (*) Refactoring to make KVM & TCG 'GICv3CPUState' initialization common.
    (*) Changes in GICv3 post/pre-load function for migration 
 [Patch 11,14-16,19] logic related to ACPI at machine init time.
    (*) ACPI CPU OSPM interface for ACPI _STA.Enable/Disable handling  
    (*) ACPI GED framework to cater to CPU DeviceCheck/Eject Events.
    (*) ACPI DSDT, MADT changes.
 [Patch 12-13, 17] Qdev, Virt Machine, PowerState Handler Changes
    (*) Changes to introduce 'PowerStateHandler' and its abstract interface.
    (*) Qdev changes to handle the administrative enabling/disabling of device
    (*) Virt Machine implementation of 'PowerStateHandler' Hooks
    (*) vCPU thread user-space parking and unparking logic.
 [Patch 18,20-21,24] Misc.
    (*) Handling of SMCC Hypercall Exits by KVM to Qemu for PSCI.
    (*) Mitigation to avoid using 'pause_all_vcpus' during ICC_CTLR_EL1 reset.
    (*) Mitigation when TCG 'TB Code Cache' is found saturated

===============
(XI) DISCLAIMER
===============

This patch-set is the culmination of over four years of ongoing effort to bring
a vCPU hotplug-like feature to the Arm platform. The work has already led to
changes in the ACPI specification and the Linux kernel, and this series now
introduces the missing piece within QEMU.

The transition from RFC v5 to RFC v6 resulted in a shift of approach, based on
maintainer feedback, and required substantial code to be re-written. This is
*not* production-level code and may still contain bugs. Comprehensive testing is
in progress on HiSilicon Kunpeng920 SoCs, Oracle servers, and Ampere platforms.
We expect to fix outstanding issues in the coming month and, subject to no major
concerns from maintainers about the chosen approach, a near-stable, non-RFC
version will be posted soon.

This work largely follows the direction of prior community discussions over the
years [see refs below], including mailing list threads, Linaro Open Discussions,
and sessions at KVM Forum. This RFC is intended to validate the overall approach
outlined here and to gather community feedback before moving forward with a
formal patch series.

[The concept being presented has been found to work!]

================
(XII) Change Log
================

RFC V4 -> RFC V5:
-----------------
1. Dropped "[PATCH RFC V4 19/33] target/arm: Force ARM vCPU *present* status ACPI *persistent*"
   - Seperated the architecture agnostic ACPI changes required to support vCPU Hotplug
     Link: https://lore.kernel.org/qemu-devel/20241014192205.253479-1-salil.mehta@huawei.com/#t
2. Dropped "[PATCH RFC V4 02/33] cpu-common: Add common CPU utility for possible vCPUs"
   - Dropped qemu{present,enabled}_cpu() APIs. Commented by Gavin (Redhat), Miguel(Oracle), Igor(Redhat)
3. Added "Reviewed-by: Miguel Luis <miguel.luis@oracle.com>" to [PATCH RFC V4 01/33]
3. Dropped the `CPUState::disabled` flag and introduced `GICv3State::num_smp_cpus` flag
   - All `GICv3CPUState' between [num_smp_cpus,num_cpus) are marked as 'inaccessible` during gicv3_common_realize()
   - qemu_enabled_cpu() not required - removed!
   - removed usage of `CPUState::disabled` from virt.c and hw/cpu64.c
4. Removed virt_cpu_properties() and introduced property `mp-affinity` get accessor
5. Dropped "[PATCH RFC V4 12/33] arm/virt: Create GED device before *disabled* vCPU Objects are destroyed"

RFC V3 -> RFC V4:
-----------------
1. Addressed Nicholas Piggin's (IBM) comments
   - Moved qemu_get_cpu_archid() as a ACPI helper inline acpi/cpu.h
     https://lore.kernel.org/qemu-devel/D2GFCLH11HGJ.1IJGANHQ9ZQRL@gmail.com/
   - Introduced new macro CPU_FOREACH_POSSIBLE() in [PATCH 12/33] 
     https://lore.kernel.org/qemu-devel/D2GF9A9AJO02.1G1G8UEXA5AOD@gmail.com/
   - Converted CPUState::acpi_persistent into Property. Improved the cover note
     https://lore.kernel.org/qemu-devel/D2H62RK48KT7.2BTQEZUOEGG4L@gmail.com/
   - Fixed teh cover note of the[PATCH ] and clearly mentioned about KVMParking
     https://lore.kernel.org/qemu-devel/D2GFOGQC3HYO.2LKOV306JIU98@gmail.com/ 
2. Addressed Gavin Shan's (RedHat) comments:
   - Introduced the ARM Extensions check. [Looks like I missed the PMU check :( ]
     https://lore.kernel.org/qemu-devel/28f3107f-0267-4112-b0ca-da59df2968ae@redhat.com/
   - Moved create_gpio() along with create_ged()
     https://lore.kernel.org/qemu-devel/143ad7d2-8f45-4428-bed3-891203a49029@redhat.com/
   - Improved the logic of the GIC creation and initialization
     https://lore.kernel.org/qemu-devel/9b7582f0-8149-4bf0-a1aa-4d4fe0d35e70@redhat.com/
   - Removed redundant !dev->realized checks in cpu_hotunplug(_request)
     https://lore.kernel.org/qemu-devel/64e9feaa-8df2-4108-9e73-c72517fb074a@redhat.com/
3. Addresses Alex Bennée's + Gustavo Romero (Linaro) comments
   - Fixed the TCG support and now it works for all the cases including migration.
     https://lore.kernel.org/qemu-devel/87bk1b3azm.fsf@draig.linaro.org/
   - Fixed the cpu_address_space_destroy() compilation failuer in user-mode
     https://lore.kernel.org/qemu-devel/87v800wkb1.fsf@draig.linaro.org/
4. Fixed crash in .post_gicv3() during migration with asymmetrically *enabled*
     vCPUs at destination VM

RFC V2 -> RFC V3:
-----------------
1. Miscellaneous:
   - Split the RFC V2 into arch-agnostic and arch-specific patch sets.
2. Addressed Gavin Shan's (RedHat) comments:
   - Made CPU property accessors inline.
     https://lore.kernel.org/qemu-devel/6cd28639-2cfa-f233-c6d9-d5d2ec5b1c58@redhat.com/
   - Collected Reviewed-bys [PATCH RFC V2 4/37, 14/37, 22/37].
   - Dropped the patch as it was not required after init logic was refactored.
     https://lore.kernel.org/qemu-devel/4fb2eef9-6742-1eeb-721a-b3db04b1be97@redhat.com/
   - Fixed the range check for the core during vCPU Plug.
     https://lore.kernel.org/qemu-devel/1c5fa24c-6bf3-750f-4f22-087e4a9311af@redhat.com/
   - Added has_hotpluggable_vcpus check to make build_cpus_aml() conditional.
     https://lore.kernel.org/qemu-devel/832342cb-74bc-58dd-c5d7-6f995baeb0f2@redhat.com/
   - Fixed the states initialization in cpu_hotplug_hw_init() to accommodate previous refactoring.
     https://lore.kernel.org/qemu-devel/da5e5609-1883-8650-c7d8-6868c7b74f1c@redhat.com/
   - Fixed typos.
     https://lore.kernel.org/qemu-devel/eb1ac571-7844-55e6-15e7-3dd7df21366b@redhat.com/
   - Removed the unnecessary 'goto fail'.
     https://lore.kernel.org/qemu-devel/4d8980ac-f402-60d4-fe52-787815af8a7d@redhat.com/#t
   - Added check for hotpluggable vCPUs in the _OSC method.
     https://lore.kernel.org/qemu-devel/20231017001326.FUBqQ1PTowF2GxQpnL3kIW0AhmSqbspazwixAHVSi6c@z/
3. Addressed Shaoqin Huang's (Intel) comments:
   - Fixed the compilation break due to the absence of a call to virt_cpu_properties() missing
     along with its definition.
     https://lore.kernel.org/qemu-devel/3632ee24-47f7-ae68-8790-26eb2cf9950b@redhat.com/
4. Addressed Jonathan Cameron's (Huawei) comments:
   - Gated the 'disabled vcpu message' for GIC version < 3.
     https://lore.kernel.org/qemu-devel/20240116155911.00004fe1@Huawei.com/

RFC V1 -> RFC V2:
-----------------
1. Addressed James Morse's (ARM) requirement as per Linaro Open Discussion:
   - Exposed all possible vCPUs as always ACPI _STA.present and available during boot time.
   - Added the _OSC handling as required by James's patches.
   - Introduction of 'online-capable' bit handling in the flag of MADT GICC.
   - SMCC Hypercall Exit handling in Qemu.
2. Addressed Marc Zyngier's comment:
   - Fixed the note about GIC CPU Interface in the cover letter.
3. Addressed issues raised by Vishnu Pajjuru (Ampere) & Miguel Luis (Oracle) during testing:
   - Live/Pseudo Migration crashes.
4. Others:
   - Introduced the concept of persistent vCPU at QOM.
   - Introduced wrapper APIs of present, possible, and persistent.
   - Change at ACPI hotplug H/W init leg accommodating initializing is_present and is_enabled states.
   - Check to avoid unplugging cold-booted vCPUs.
   - Disabled hotplugging with TCG/HVF/QTEST.
   - Introduced CPU Topology, {socket, cluster, core, thread}-id property.
   - Extract virt CPU properties as a common virt_vcpu_properties() function.

=======================
(XIII) ACKNOWLEDGEMENTS
=======================

I would like to thank the following people for various discussions with me over
different channels during development:

Marc Zyngier (Google), Catalin Marinas (ARM), James Morse (ARM), Will Deacon (Google), 
Jean-Philippe Brucker (Linaro), Sudeep Holla (ARM), Lorenzo Pieralisi (Linaro), 
Gavin Shan (RedHat), Jonathan Cameron (Huawei), Darren Hart (Ampere), 
Igor Mammedov (RedHat), Ilkka Koskinen (Ampere), Andrew Jones (RedHat), 
Karl Heubaum (Oracle), Keqian Zhu (Huawei), Miguel Luis (Oracle), 
Xiongfeng Wang (Huawei), Vishnu Pajjuri (Ampere), Shameerali Kolothum (Huawei), 
Russell King (Oracle), Xuwei/Joy (Huawei), Peter Maydel (Linaro), 
Zengtao/Prime (Huawei), Nicholas Piggin (IBM), Alex Bennée(Linaro) and all those
whom I have missed!

Many thanks to the following people for their current or past contributions:

1. James Morse (ARM)
   (Current Kernel part of vCPU Hotplug Support on AARCH64)
2. Jean-Philippe Brucker (Linaro)
   (Prototyped one of the earlier PSCI-based POC [17][18] based on RFC V1)
3. Keqian Zhu (Huawei)
   (Co-developed Qemu prototype)
4. Xiongfeng Wang (Huawei)
   (Co-developed an earlier kernel prototype with me)
5. Vishnu Pajjuri (Ampere)
   (Verification on Ampere ARM64 Platforms + fixes)
6. Miguel Luis (Oracle)
   (Verification on Oracle ARM64 Platforms + fixes)
7. Russell King (Oracle) & Jonathan Cameron (Huawei)
   (Helping in upstreaming James Morse's Kernel patches).

================
(XIV) REFERENCES
================

[1] https://lore.kernel.org/qemu-devel/20200613213629.21984-1-salil.mehta@huawei.com/
[2] https://lore.kernel.org/linux-arm-kernel/20200625133757.22332-1-salil.mehta@huawei.com/
[3] https://lore.kernel.org/lkml/20230203135043.409192-1-james.morse@arm.com/
[4] https://lore.kernel.org/all/20230913163823.7880-1-james.morse@arm.com/
[5] https://lore.kernel.org/all/20230404154050.2270077-1-oliver.upton@linux.dev/
[6] https://bugzilla.tianocore.org/show_bug.cgi?id=3706
[7] https://uefi.org/specs/ACPI/6.5/05_ACPI_Software_Programming_Model.html#gic-cpu-interface-gicc-structure
[8] https://bugzilla.tianocore.org/show_bug.cgi?id=4481#c5
[9] https://cloud.google.com/kubernetes-engine/docs/concepts/verticalpodautoscaler
[10] https://docs.aws.amazon.com/eks/latest/userguide/vertical-pod-autoscaler.html
[11] https://lkml.org/lkml/2019/7/10/235
[12] https://lists.cs.columbia.edu/pipermail/kvmarm/2018-July/032316.html
[13] https://lists.gnu.org/archive/html/qemu-devel/2020-01/msg06517.html
[14] https://op-lists.linaro.org/archives/list/linaro-open-discussions@op-lists.linaro.org/thread/7CGL6JTACPUZEYQC34CZ2ZBWJGSR74WE/
[15] http://lists.nongnu.org/archive/html/qemu-devel/2018-07/msg01168.html
[16] https://lists.gnu.org/archive/html/qemu-devel/2020-06/msg00131.html
[17] https://op-lists.linaro.org/archives/list/linaro-open-discussions@op-lists.linaro.org/message/X74JS6P2N4AUWHHATJJVVFDI2EMDZJ74/
[18] https://lore.kernel.org/lkml/20210608154805.216869-1-jean-philippe@linaro.org/
[19] https://lore.kernel.org/all/20230913163823.7880-1-james.morse@arm.com/ 
[20] https://uefi.org/specs/ACPI/6.5/05_ACPI_Software_Programming_Model.html#gicc-cpu-interface-flags
[21] https://lore.kernel.org/qemu-devel/20230926100436.28284-1-salil.mehta@huawei.com/
[22] https://lore.kernel.org/qemu-devel/20240607115649.214622-1-salil.mehta@huawei.com/T/#md0887eb07976bc76606a8204614ccc7d9a01c1f7
[23] RFC V3: https://lore.kernel.org/qemu-devel/20240613233639.202896-1-salil.mehta@huawei.com/#t

Author Salil Mehta (1):
  target/arm/kvm,tcg: Handle SMCCC hypercall exits in VMM during PSCI_CPU_{ON,OFF}

Jean-Philippe Brucker (1):
  target/arm/kvm: Write vCPU's state back to KVM on cold-reset

Salil Mehta (22):
  hw/core: Introduce administrative power-state property and its accessors
  hw/core, qemu-options.hx: Introduce 'disabledcpus' SMP parameter
  hw/arm/virt: Clamp 'maxcpus' as-per machine's vCPU deferred online-capability
  arm/virt,target/arm: Add new ARMCPU {socket,cluster,core,thread}-id property
  arm/virt,kvm: Pre-create KVM vCPUs for 'disabled' QOM vCPUs at machine init
  arm/virt,gicv3: Pre-size GIC with possible vCPUs at machine init
  arm/gicv3: Refactor CPU interface init for shared TCG/KVM use
  arm/virt, gicv3: Guard CPU interface access for admin disabled vCPUs
  hw/intc/arm_gicv3_common: Migrate & check 'GICv3CPUState' accessibility mismatch
  arm/virt: Init PMU at host for all present vCPUs
  hw/arm/acpi: MADT change to size the guest with possible vCPUs
  hw/core: Introduce generic device power-state handler interface
  qdev: make admin power state changes trigger platform transitions via ACPI
  arm/acpi: Introduce dedicated CPU OSPM interface for ARM-like platforms
  acpi/ged: Notify OSPM of CPU administrative state changes via GED
  arm/virt/acpi: Update ACPI DSDT Tbl to include 'Online-Capable' CPUs AML
  hw/arm/virt,acpi/ged: Add PowerStateHandler hooks for runtime CPU state changes
  target/arm/cpu: Add the Accessor hook to fetch ARM CPU arch-id
  hw/intc/arm-gicv3-kvm: Pause all vCPUs & cache ICC_CTLR_EL1 for userspace PSCI CPU_ON
  monitor,qdev: Introduce 'device_set' to change admin state of existing devices
  monitor,qapi: add 'info cpus-powerstate' and QMP query (Admin + Oper states)
  tcg: Defer TB flush for 'lazy realized' vCPUs on first region alloc

 accel/kvm/kvm-all.c                    |   2 +-
 accel/tcg/tcg-accel-ops-mttcg.c        |   2 +-
 accel/tcg/tcg-accel-ops-rr.c           |   2 +-
 cpu-common.c                           |   4 +-
 hmp-commands-info.hx                   |  32 ++
 hmp-commands.hx                        |  30 +
 hw/acpi/Kconfig                        |   3 +
 hw/acpi/acpi-cpu-ospm-interface-stub.c |  41 ++
 hw/acpi/cpu_ospm_interface.c           | 747 +++++++++++++++++++++++++
 hw/acpi/generic_event_device.c         |  91 +++
 hw/acpi/meson.build                    |   2 +
 hw/acpi/trace-events                   |  17 +
 hw/arm/Kconfig                         |   1 +
 hw/arm/virt-acpi-build.c               |  75 ++-
 hw/arm/virt.c                          | 573 +++++++++++++++++--
 hw/core/cpu-common.c                   |  12 +
 hw/core/machine-hmp-cmds.c             |  62 ++
 hw/core/machine-qmp-cmds.c             | 107 ++++
 hw/core/machine-smp.c                  |  24 +-
 hw/core/machine.c                      |  28 +
 hw/core/meson.build                    |   1 +
 hw/core/powerstate.c                   | 100 ++++
 hw/core/qdev.c                         | 197 +++++++
 hw/intc/arm_gicv3.c                    |   1 +
 hw/intc/arm_gicv3_common.c             |  64 ++-
 hw/intc/arm_gicv3_cpuif.c              | 270 ++++-----
 hw/intc/arm_gicv3_cpuif_common.c       |  58 ++
 hw/intc/arm_gicv3_kvm.c                | 123 +++-
 hw/intc/gicv3_internal.h               |   1 +
 include/hw/acpi/acpi_dev_interface.h   |   1 +
 include/hw/acpi/cpu_ospm_interface.h   |  78 +++
 include/hw/acpi/generic_event_device.h |   6 +
 include/hw/arm/virt.h                  |  42 +-
 include/hw/boards.h                    |  37 ++
 include/hw/core/cpu.h                  |  71 +++
 include/hw/intc/arm_gicv3_common.h     |  65 +++
 include/hw/powerstate.h                | 177 ++++++
 include/hw/qdev-core.h                 | 151 +++++
 include/monitor/hmp.h                  |   3 +
 include/monitor/qdev.h                 |  30 +
 include/system/kvm.h                   |   8 +
 include/system/system.h                |   1 +
 include/tcg/startup.h                  |   6 +
 include/tcg/tcg.h                      |   1 +
 qapi/machine.json                      |  90 +++
 qemu-options.hx                        | 129 ++++-
 stubs/meson.build                      |   1 +
 stubs/powerstate-stubs.c               |  47 ++
 system/cpus.c                          |   4 +-
 system/qdev-monitor.c                  | 139 ++++-
 system/vl.c                            |  42 ++
 target/arm/arm-powerctl.c              |  29 +-
 target/arm/cpu.c                       |  14 +
 target/arm/cpu.h                       |   5 +
 target/arm/helper.c                    |   2 +-
 target/arm/internals.h                 |   2 +-
 target/arm/kvm.c                       | 140 ++++-
 target/arm/kvm_arm.h                   |  25 +
 target/arm/meson.build                 |   1 +
 target/arm/{tcg => }/psci.c            |   9 +
 target/arm/tcg/meson.build             |   4 -
 tcg/region.c                           |  16 +
 tcg/tcg.c                              |  19 +-
 63 files changed, 3800 insertions(+), 265 deletions(-)
 create mode 100644 hw/acpi/acpi-cpu-ospm-interface-stub.c
 create mode 100644 hw/acpi/cpu_ospm_interface.c
 create mode 100644 hw/core/powerstate.c
 create mode 100644 include/hw/acpi/cpu_ospm_interface.h
 create mode 100644 include/hw/powerstate.h
 create mode 100644 stubs/powerstate-stubs.c
 rename target/arm/{tcg => }/psci.c (96%)

-- 
2.34.1



^ permalink raw reply	[flat|nested] 67+ messages in thread

* [PATCH RFC V6 01/24] hw/core: Introduce administrative power-state property and its accessors
  2025-10-01  1:01 [PATCH RFC V6 00/24] Support of Virtual CPU Hotplug-like Feature for ARMv8+ Arch salil.mehta
@ 2025-10-01  1:01 ` salil.mehta
  2025-10-09 10:48   ` Miguel Luis
  2025-10-01  1:01 ` [PATCH RFC V6 02/24] hw/core, qemu-options.hx: Introduce 'disabledcpus' SMP parameter salil.mehta
                   ` (25 subsequent siblings)
  26 siblings, 1 reply; 67+ messages in thread
From: salil.mehta @ 2025-10-01  1:01 UTC (permalink / raw)
  To: qemu-devel, qemu-arm, mst
  Cc: salil.mehta, maz, jean-philippe, jonathan.cameron, lpieralisi,
	peter.maydell, richard.henderson, imammedo, armbru, andrew.jones,
	david, philmd, eric.auger, will, ardb, oliver.upton, pbonzini,
	gshan, rafael, borntraeger, alex.bennee, gustavo.romero, npiggin,
	harshpb, linux, darren, ilkka, vishnu, gankulkarni, karl.heubaum,
	miguel.luis, salil.mehta, zhukeqian1, wangxiongfeng2, wangyanan55,
	wangzhou1, linuxarm, jiakernel2, maobibo, lixianglai, shahuang,
	zhao1.liu

From: Salil Mehta <salil.mehta@huawei.com>

Some devices cannot be hot-unplugged, either because removal is not meaningful
(e.g. on-board devices) or not supported (e.g. certain PCIe devices). Others,
such as CPUs on architectures like ARM, lack native hotplug support but can
still have their availability controlled through host policy. In all these
cases, a mechanism is needed to track and control a device’s *administrative*
power state — independent of its runtime operational state — so QEMU can:

  - Disable a device while keeping it described in firmware, ACPI, or other
    configuration.
  - Prevent guest use until explicitly re-enabled.
  - Coordinate transitions with platform-specific power handlers and migration
    logic.

This patch introduces the core qdev support for administrative power state —
defining the property, enum, and accessors — without yet applying it to any
device. Later patches in this series integrate it with helper APIs
(qdev_disable(), qdev_enable(), etc.) and specific device types such as CPUs,
completing the flow with platform-specific handlers.

Key additions:
  - New enum DeviceAdminPowerState with ENABLED, DISABLED, and REMOVED states,
    defaulting to ENABLED.
  - New DeviceClass flag admin_power_state_supported to advertise support for
    administrative transitions.
  - New QOM property "admin_power_state" to query or set the state on supported
    devices.
  - Internal accessors device_get_admin_power_state() and
    device_set_admin_power_state() to manage state changes, including safe
    handling when the device is not yet realized.

The enum models *policy* rather than electrical or functional power state, and
is distinct from runtime mechanisms (e.g. PSCI for ARM CPUs). The actual
operational state of a device is maintained by platform-specific or device-
specific code, which enforces runtime behaviour based on the administrative
setting. Every device starts administratively ENABLED by default. A DISABLED
device remains logically present but blocked from operation; a REMOVED device
is logically absent.

Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
---
 hw/core/qdev.c         | 62 ++++++++++++++++++++++++++++++++++++++++++
 include/hw/qdev-core.h | 54 ++++++++++++++++++++++++++++++++++++
 target/arm/cpu.c       |  1 +
 3 files changed, 117 insertions(+)

diff --git a/hw/core/qdev.c b/hw/core/qdev.c
index f600226176..8502d6216f 100644
--- a/hw/core/qdev.c
+++ b/hw/core/qdev.c
@@ -633,6 +633,53 @@ static bool device_get_hotplugged(Object *obj, Error **errp)
     return dev->hotplugged;
 }
 
+static int device_get_admin_power_state(Object *obj, Error **errp)
+{
+    DeviceState *dev = DEVICE(obj);
+
+    return dev->admin_power_state;
+}
+
+static void
+device_set_admin_power_state(Object *obj, int new_state, Error **errp)
+{
+    DeviceState *dev = DEVICE(obj);
+    DeviceClass *dc = DEVICE_GET_CLASS(dev);
+
+    if (!dc->admin_power_state_supported) {
+        error_setg(errp, "Device '%s' admin power state change not supported",
+                   object_get_typename(obj));
+        return;
+    }
+
+    switch (new_state) {
+    case DEVICE_ADMIN_POWER_STATE_DISABLED: {
+        /*
+         * TODO: Operational state transition triggered by administrative action
+         * Powering off the realized device either synchronously or via OSPM.
+         */
+
+        qatomic_set(&dev->admin_power_state, DEVICE_ADMIN_POWER_STATE_DISABLED);
+        smp_wmb();
+        break;
+    }
+    case DEVICE_ADMIN_POWER_STATE_ENABLED: {
+        /*
+         * TODO: Operational state transition triggered by administrative action
+         * Powering on the device and restoring migration registration.
+         */
+
+        qatomic_set(&dev->admin_power_state, DEVICE_ADMIN_POWER_STATE_ENABLED);
+        smp_wmb();
+        break;
+    }
+    default:
+        error_setg(errp, "Invalid admin power state %d for device '%s'",
+                   new_state, dev->id);
+        break;
+    }
+}
+
 static void device_initfn(Object *obj)
 {
     DeviceState *dev = DEVICE(obj);
@@ -644,6 +691,7 @@ static void device_initfn(Object *obj)
 
     dev->instance_id_alias = -1;
     dev->realized = false;
+    dev->admin_power_state = DEVICE_ADMIN_POWER_STATE_ENABLED;
     dev->allow_unplug_during_migration = false;
 
     QLIST_INIT(&dev->gpios);
@@ -731,6 +779,15 @@ device_vmstate_if_get_id(VMStateIf *obj)
     return qdev_get_dev_path(dev);
 }
 
+static const QEnumLookup device_admin_power_state_lookup = {
+    .array = (const char *const[]) {
+        [DEVICE_ADMIN_POWER_STATE_ENABLED]  = "enabled",
+        [DEVICE_ADMIN_POWER_STATE_REMOVED]  = "removed",
+        [DEVICE_ADMIN_POWER_STATE_DISABLED] = "disabled",
+    },
+    .size = DEVICE_ADMIN_POWER_STATE_MAX,
+};
+
 static void device_class_init(ObjectClass *class, const void *data)
 {
     DeviceClass *dc = DEVICE_CLASS(class);
@@ -765,6 +822,11 @@ static void device_class_init(ObjectClass *class, const void *data)
                                    device_get_hotpluggable, NULL);
     object_class_property_add_bool(class, "hotplugged",
                                    device_get_hotplugged, NULL);
+    object_class_property_add_enum(class, "admin_power_state",
+                                   "DeviceAdminPowerState",
+                                   &device_admin_power_state_lookup,
+                                   device_get_admin_power_state,
+                                   device_set_admin_power_state);
     object_class_property_add_link(class, "parent_bus", TYPE_BUS,
                                    offsetof(DeviceState, parent_bus), NULL, 0);
 }
diff --git a/include/hw/qdev-core.h b/include/hw/qdev-core.h
index 530f3da702..3bc212ab3a 100644
--- a/include/hw/qdev-core.h
+++ b/include/hw/qdev-core.h
@@ -159,6 +159,7 @@ struct DeviceClass {
      */
     bool user_creatable;
     bool hotpluggable;
+    bool admin_power_state_supported;
 
     /* callbacks */
     /**
@@ -217,6 +218,55 @@ typedef QLIST_HEAD(, NamedGPIOList) NamedGPIOListHead;
 typedef QLIST_HEAD(, NamedClockList) NamedClockListHead;
 typedef QLIST_HEAD(, BusState) BusStateHead;
 
+/**
+ * enum DeviceAdminPowerState - Administrative control states for a device
+ *
+ * This enum defines abstract administrative states used by QEMU to enable,
+ * disable, or logically remove a device from the virtual machine. These
+ * states reflect administrative control over a device's power availability
+ * and presence in the system. These administrative states are distinct from
+ * runtime operational power states (e.g., PSCI states for ARM CPUs). They
+ * represent administrative *policy* rather than physical, electrical, or
+ * functional state.
+ *
+ * Administrative state is managed externally "via QMP, firmware, or other
+ * host-side policy agents" and acts as a gating policy that determines
+ * whether guest software is permitted to interact with the device. Most
+ * devices default to the ENABLED state unless explicitly disabled or removed.
+ *
+ * Changing a device administrative state may directly or indirectly affect
+ * its operational behavior. For example, a DISABLED device will reject guest
+ * attempts to power it on or transition it out of a suspended state. Not all
+ * devices support dynamic transitions between administrative states.
+ *
+ * - DEVICE_ADMIN_POWER_STATE_ENABLED:
+ *     The device is administratively enabled (i.e., logically present and
+ *     permitted to operate). Guest software may change its operational state
+ *     (e.g., activate, deactivate, suspend) within allowed architectural
+ *     semantics. This is the default state for most devices unless explicitly
+ *     disabled or unplugged.
+ *
+ * - DEVICE_ADMIN_POWER_STATE_DISABLED:
+ *     The device is administratively disabled. It remains logically present
+ *     but is blocked from functional operation. Guest-initiated transitions
+ *     are either suppressed or ignored. This is typically used to enforce
+ *     shutdown, deny execution, or offline the device without removing it.
+ *
+ * - DEVICE_ADMIN_POWER_STATE_REMOVED:
+ *     The device has been logically removed (e.g., via hot-unplug). It is no
+ *     longer considered present or visible to the guest. This state exists
+ *     for representational or transitional purposes only. In most cases,
+ *     once removed, the corresponding DeviceState object is destroyed and
+ *     no longer tracked. This concept may not apply to some devices as
+ *     architectural limitations might make unplug not meaningful.
+ */
+typedef enum DeviceAdminPowerState {
+    DEVICE_ADMIN_POWER_STATE_ENABLED = 0,
+    DEVICE_ADMIN_POWER_STATE_DISABLED,
+    DEVICE_ADMIN_POWER_STATE_REMOVED,
+    DEVICE_ADMIN_POWER_STATE_MAX
+} DeviceAdminPowerState;
+
 /**
  * struct DeviceState - common device state, accessed with qdev helpers
  *
@@ -240,6 +290,10 @@ struct DeviceState {
      * @realized: has device been realized?
      */
     bool realized;
+    /**
+     * @admin_power_state: device administrative power state
+     */
+    DeviceAdminPowerState admin_power_state;
     /**
      * @pending_deleted_event: track pending deletion events during unplug
      */
diff --git a/target/arm/cpu.c b/target/arm/cpu.c
index e2b2337399..0c9a2e7ea4 100644
--- a/target/arm/cpu.c
+++ b/target/arm/cpu.c
@@ -2765,6 +2765,7 @@ static void arm_cpu_class_init(ObjectClass *oc, const void *data)
     cc->gdb_get_core_xml_file = arm_gdb_get_core_xml_file;
     cc->gdb_stop_before_watchpoint = true;
     cc->disas_set_info = arm_disas_set_info;
+    dc->admin_power_state_supported = true;
 
 #ifdef CONFIG_TCG
     cc->tcg_ops = &arm_tcg_ops;
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH RFC V6 02/24] hw/core, qemu-options.hx: Introduce 'disabledcpus' SMP parameter
  2025-10-01  1:01 [PATCH RFC V6 00/24] Support of Virtual CPU Hotplug-like Feature for ARMv8+ Arch salil.mehta
  2025-10-01  1:01 ` [PATCH RFC V6 01/24] hw/core: Introduce administrative power-state property and its accessors salil.mehta
@ 2025-10-01  1:01 ` salil.mehta
  2025-10-09 11:28   ` Miguel Luis
  2025-10-09 11:51   ` Markus Armbruster
  2025-10-01  1:01 ` [PATCH RFC V6 03/24] hw/arm/virt: Clamp 'maxcpus' as-per machine's vCPU deferred online-capability salil.mehta
                   ` (24 subsequent siblings)
  26 siblings, 2 replies; 67+ messages in thread
From: salil.mehta @ 2025-10-01  1:01 UTC (permalink / raw)
  To: qemu-devel, qemu-arm, mst
  Cc: salil.mehta, maz, jean-philippe, jonathan.cameron, lpieralisi,
	peter.maydell, richard.henderson, imammedo, armbru, andrew.jones,
	david, philmd, eric.auger, will, ardb, oliver.upton, pbonzini,
	gshan, rafael, borntraeger, alex.bennee, gustavo.romero, npiggin,
	harshpb, linux, darren, ilkka, vishnu, gankulkarni, karl.heubaum,
	miguel.luis, salil.mehta, zhukeqian1, wangxiongfeng2, wangyanan55,
	wangzhou1, linuxarm, jiakernel2, maobibo, lixianglai, shahuang,
	zhao1.liu

From: Salil Mehta <salil.mehta@huawei.com>

Add support for a new SMP configuration parameter, 'disabledcpus', which
specifies the number of additional CPUs that are present in the virtual
machine but administratively disabled at boot. These CPUs are visible in
firmware (e.g. ACPI tables) yet unavailable to the guest until explicitly
enabled via QMP/HMP, or via the 'device_set' API (introduced in later
patches).

This feature is intended for architectures that lack native CPU hotplug
support but can change the administrative power state of present CPUs.
It allows simulating CPU hot-add–like scenarios while all CPUs remain
physically present in the topology at boot time.

Note: ARM is the first architecture to support this concept.

Changes include:
 - Extend CpuTopology with a 'disabledcpus' field.
 - Update machine_parse_smp_config() to account for disabled CPUs when
   computing 'cpus' and 'maxcpus'.
 - Update SMPConfiguration in QAPI to accept 'disabledcpus'.
 - Extend -smp option documentation to describe 'disabledcpus' usage and
   behavior.

Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
---
 hw/core/machine-smp.c | 24 +++++++-----
 include/hw/boards.h   |  2 +
 qapi/machine.json     |  3 ++
 qemu-options.hx       | 86 +++++++++++++++++++++++++++++++++----------
 system/vl.c           |  3 ++
 5 files changed, 89 insertions(+), 29 deletions(-)

diff --git a/hw/core/machine-smp.c b/hw/core/machine-smp.c
index 0be0ac044c..c1a09fdc3f 100644
--- a/hw/core/machine-smp.c
+++ b/hw/core/machine-smp.c
@@ -87,6 +87,7 @@ void machine_parse_smp_config(MachineState *ms,
 {
     MachineClass *mc = MACHINE_GET_CLASS(ms);
     unsigned cpus    = config->has_cpus ? config->cpus : 0;
+    unsigned disabledcpus = config->has_disabledcpus ? config->disabledcpus : 0;
     unsigned drawers = config->has_drawers ? config->drawers : 0;
     unsigned books   = config->has_books ? config->books : 0;
     unsigned sockets = config->has_sockets ? config->sockets : 0;
@@ -166,8 +167,13 @@ void machine_parse_smp_config(MachineState *ms,
         sockets = sockets > 0 ? sockets : 1;
         cores = cores > 0 ? cores : 1;
         threads = threads > 0 ? threads : 1;
+
+        maxcpus = drawers * books * sockets * dies * clusters *
+                    modules * cores * threads;
+        cpus = maxcpus - disabledcpus;
     } else {
-        maxcpus = maxcpus > 0 ? maxcpus : cpus;
+        maxcpus = maxcpus > 0 ? maxcpus : cpus + disabledcpus;
+        cpus = cpus > 0 ? cpus : maxcpus - disabledcpus;
 
         if (mc->smp_props.prefer_sockets) {
             /* prefer sockets over cores before 6.2 */
@@ -207,12 +213,8 @@ void machine_parse_smp_config(MachineState *ms,
         }
     }
 
-    total_cpus = drawers * books * sockets * dies *
-                 clusters * modules * cores * threads;
-    maxcpus = maxcpus > 0 ? maxcpus : total_cpus;
-    cpus = cpus > 0 ? cpus : maxcpus;
-
     ms->smp.cpus = cpus;
+    ms->smp.disabledcpus = disabledcpus;
     ms->smp.drawers = drawers;
     ms->smp.books = books;
     ms->smp.sockets = sockets;
@@ -226,6 +228,8 @@ void machine_parse_smp_config(MachineState *ms,
     mc->smp_props.has_clusters = config->has_clusters;
 
     /* sanity-check of the computed topology */
+    total_cpus = maxcpus = drawers * books * sockets * dies * clusters *
+                modules * cores * threads;
     if (total_cpus != maxcpus) {
         g_autofree char *topo_msg = cpu_hierarchy_to_string(ms);
         error_setg(errp, "Invalid CPU topology: "
@@ -235,12 +239,12 @@ void machine_parse_smp_config(MachineState *ms,
         return;
     }
 
-    if (maxcpus < cpus) {
+    if (maxcpus < (cpus + disabledcpus)) {
         g_autofree char *topo_msg = cpu_hierarchy_to_string(ms);
         error_setg(errp, "Invalid CPU topology: "
-                   "maxcpus must be equal to or greater than smp: "
-                   "%s == maxcpus (%u) < smp_cpus (%u)",
-                   topo_msg, maxcpus, cpus);
+                   "maxcpus must be equal to or greater than smp[+disabledcpus]:"
+                   "%s == maxcpus (%u) < smp_cpus (%u) [+ offline cpus (%u)]",
+                   topo_msg, maxcpus, cpus, disabledcpus);
         return;
     }
 
diff --git a/include/hw/boards.h b/include/hw/boards.h
index f94713e6e2..2b182d7817 100644
--- a/include/hw/boards.h
+++ b/include/hw/boards.h
@@ -361,6 +361,7 @@ typedef struct DeviceMemoryState {
 /**
  * CpuTopology:
  * @cpus: the number of present logical processors on the machine
+ * @disabledcpus: the number additional present but admin disabled cpus
  * @drawers: the number of drawers on the machine
  * @books: the number of books in one drawer
  * @sockets: the number of sockets in one book
@@ -373,6 +374,7 @@ typedef struct DeviceMemoryState {
  */
 typedef struct CpuTopology {
     unsigned int cpus;
+    unsigned int disabledcpus;
     unsigned int drawers;
     unsigned int books;
     unsigned int sockets;
diff --git a/qapi/machine.json b/qapi/machine.json
index 038eab281c..e45740da33 100644
--- a/qapi/machine.json
+++ b/qapi/machine.json
@@ -1634,6 +1634,8 @@
 #
 # @cpus: number of virtual CPUs in the virtual machine
 #
+# @disabledcpus: number of additional present but disabled(or offline) CPUs
+#
 # @maxcpus: maximum number of hotpluggable virtual CPUs in the virtual
 #     machine
 #
@@ -1657,6 +1659,7 @@
 ##
 { 'struct': 'SMPConfiguration', 'data': {
      '*cpus': 'int',
+     '*disabledcpus': 'int',
      '*drawers': 'int',
      '*books': 'int',
      '*sockets': 'int',
diff --git a/qemu-options.hx b/qemu-options.hx
index ab23f14d21..83ccde341b 100644
--- a/qemu-options.hx
+++ b/qemu-options.hx
@@ -326,12 +326,15 @@ SRST
 ERST
 
 DEF("smp", HAS_ARG, QEMU_OPTION_smp,
-    "-smp [[cpus=]n][,maxcpus=maxcpus][,drawers=drawers][,books=books][,sockets=sockets]\n"
-    "               [,dies=dies][,clusters=clusters][,modules=modules][,cores=cores]\n"
-    "               [,threads=threads]\n"
-    "                set the number of initial CPUs to 'n' [default=1]\n"
-    "                maxcpus= maximum number of total CPUs, including\n"
-    "                offline CPUs for hotplug, etc\n"
+    "-smp [[cpus=]n][,disabledcpus=disabledcpus][,maxcpus=maxcpus][,drawers=drawers][,books=books]\n"
+    "               [,sockets=sockets][,dies=dies][,clusters=clusters][,modules=modules]\n"
+    "               [,cores=cores][,threads=threads]\n"
+    "                set the initial number of CPUs present and\n"
+    "                  administratively enabled at boot time to 'n' [default=1]\n"
+    "                disabledcpus= number of present but administratively\n"
+    "                  disabled CPUs (unavailable to the guest at boot)\n"
+    "                maxcpus= maximum total CPUs (present + hotpluggable)\n"
+    "                  on machines without CPU hotplug, defaults to n + disabledcpus\n"
     "                drawers= number of drawers on the machine board\n"
     "                books= number of books in one drawer\n"
     "                sockets= number of sockets in one book\n"
@@ -351,22 +354,49 @@ DEF("smp", HAS_ARG, QEMU_OPTION_smp,
     "      For a particular machine type board, an expected CPU topology hierarchy\n"
     "      can be defined through the supported sub-option. Unsupported parameters\n"
     "      can also be provided in addition to the sub-option, but their values\n"
-    "      must be set as 1 in the purpose of correct parsing.\n",
+    "      must be set as 1 in the purpose of correct parsing.\n"
+    "                                                          \n"
+    "      Administratively disabled CPUs: Some machine types do not support vCPU\n"
+    "      hotplug but their CPUs can be marked disabled (powered off) and kept\n"
+    "      unavailable to the guest. Later, such CPUs can be enabled via QMP/HMP\n"
+    "      (e.g., 'device_set ... admin-state=enable'). This is similar to hotplug,\n"
+    "      except all disabled CPUs are already present at boot. Useful on\n"
+    "      architectures that lack architectural CPU hotplug.\n",
     QEMU_ARCH_ALL)
 SRST
-``-smp [[cpus=]n][,maxcpus=maxcpus][,drawers=drawers][,books=books][,sockets=sockets][,dies=dies][,clusters=clusters][,modules=modules][,cores=cores][,threads=threads]``
-    Simulate a SMP system with '\ ``n``\ ' CPUs initially present on
-    the machine type board. On boards supporting CPU hotplug, the optional
-    '\ ``maxcpus``\ ' parameter can be set to enable further CPUs to be
-    added at runtime. When both parameters are omitted, the maximum number
+``-smp [[cpus=]n][,disabledcpus=disabledcpus][,maxcpus=maxcpus][,drawers=drawers][,books=books][,sockets=sockets][,dies=dies][,clusters=clusters][,modules=modules][,cores=cores][,threads=threads]``
+    Simulate a SMP system with '\ ``n``\ ' CPUs initially present & enabled on
+    the machine type board. Furthermore, on architectures that support changing
+    the administrative power state of CPUs, optional '\ ``disabledcpus``\ '
+    parameter specifies *additional* CPUs that are present in firmware (e.g.,
+    ACPI) but are administratively disabled (i.e., not usable by the guest at
+    boot time).
+
+    This is different from CPU hotplug where additional CPUs are not even
+    present in the system description. Administratively disabled CPUs appear in
+    ACPI tables i.e. are provisioned, but cannot be used until explicitly
+    enabled via QMP/HMP or the deviceset API.
+
+    On boards supporting CPU hotplug, the optional '\ ``maxcpus``\ ' parameter
+    can be set to enable further CPUs to be added at runtime. When both
+    '\ ``n``\ ' & '\ ``maxcpus``\ ' parameters are omitted, the maximum number
     of CPUs will be calculated from the provided topology members and the
-    initial CPU count will match the maximum number. When only one of them
-    is given then the omitted one will be set to its counterpart's value.
-    Both parameters may be specified, but the maximum number of CPUs must
-    be equal to or greater than the initial CPU count. Product of the
-    CPU topology hierarchy must be equal to the maximum number of CPUs.
-    Both parameters are subject to an upper limit that is determined by
-    the specific machine type chosen.
+    initial CPU count will match the maximum number. When only one of them is
+    given then the omitted one will be set to its counterpart's value. Both
+    parameters may be specified, but the maximum number of CPUs must be equal
+    to or greater than the initial CPU count. Product of the CPU topology
+    hierarchy must be equal to the maximum number of CPUs. Both parameters are
+    subject to an upper limit that is determined by the specific machine type
+    chosen. Boards that support administratively disabled CPUs but do *not*
+    support CPU hotplug derive the maximum number of CPUs implicitly:
+    '\ ``maxcpus``\ ' is treated as '\ ``n + disabledcpus``\ ' (the total CPUs
+    present in firmware). If '\ ``maxcpus``\ ' is provided, it must equal
+    '\ ``n + disabledcpus``\ '. The topology product must equal this derived
+    maximum as well.
+
+    Note: Administratively disabled CPUs will appear to the guest as
+    unavailable, and any attempt to bring them online must go through QMP/HMP
+    commands like 'device_set'.
 
     To control reporting of CPU topology information, values of the topology
     parameters can be specified. Machines may only support a subset of the
@@ -425,6 +455,24 @@ SRST
 
         -smp 2
 
+    Examples using 'disabledcpus':
+
+    For a board without CPU hotplug, enable 4 CPUs at boot and provision
+    2 additional administratively disabled CPUs (maximum is derived
+    implicitly as 6 = 4 + 2):
+
+    ::
+
+        -smp cpus=4,disabledcpus=2
+
+    For a board that supports CPU hotplug and 'disabledcpus', enable 4 CPUs
+    at boot, provision 2 administratively disabled CPUs, and allow hotplug of
+    2 more CPUs (for a maximum of 8):
+
+    ::
+
+        -smp cpus=4,disabledcpus=2,maxcpus=8
+
     Note: The cluster topology will only be generated in ACPI and exposed
     to guest if it's explicitly specified in -smp.
 ERST
diff --git a/system/vl.c b/system/vl.c
index 3b7057e6c6..2f0fd21a1f 100644
--- a/system/vl.c
+++ b/system/vl.c
@@ -736,6 +736,9 @@ static QemuOptsList qemu_smp_opts = {
         {
             .name = "cpus",
             .type = QEMU_OPT_NUMBER,
+        }, {
+            .name = "disabledcpus",
+            .type = QEMU_OPT_NUMBER,
         }, {
             .name = "drawers",
             .type = QEMU_OPT_NUMBER,
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH RFC V6 03/24] hw/arm/virt: Clamp 'maxcpus' as-per machine's vCPU deferred online-capability
  2025-10-01  1:01 [PATCH RFC V6 00/24] Support of Virtual CPU Hotplug-like Feature for ARMv8+ Arch salil.mehta
  2025-10-01  1:01 ` [PATCH RFC V6 01/24] hw/core: Introduce administrative power-state property and its accessors salil.mehta
  2025-10-01  1:01 ` [PATCH RFC V6 02/24] hw/core, qemu-options.hx: Introduce 'disabledcpus' SMP parameter salil.mehta
@ 2025-10-01  1:01 ` salil.mehta
  2025-10-09 12:32   ` Miguel Luis
  2025-10-01  1:01 ` [PATCH RFC V6 04/24] arm/virt, target/arm: Add new ARMCPU {socket, cluster, core, thread}-id property salil.mehta
                   ` (23 subsequent siblings)
  26 siblings, 1 reply; 67+ messages in thread
From: salil.mehta @ 2025-10-01  1:01 UTC (permalink / raw)
  To: qemu-devel, qemu-arm, mst
  Cc: salil.mehta, maz, jean-philippe, jonathan.cameron, lpieralisi,
	peter.maydell, richard.henderson, imammedo, armbru, andrew.jones,
	david, philmd, eric.auger, will, ardb, oliver.upton, pbonzini,
	gshan, rafael, borntraeger, alex.bennee, gustavo.romero, npiggin,
	harshpb, linux, darren, ilkka, vishnu, gankulkarni, karl.heubaum,
	miguel.luis, salil.mehta, zhukeqian1, wangxiongfeng2, wangyanan55,
	wangzhou1, linuxarm, jiakernel2, maobibo, lixianglai, shahuang,
	zhao1.liu

From: Salil Mehta <salil.mehta@huawei.com>

To support a vCPU hot-add–like model on ARM, the virt machine may be setup with
more CPUs than are active at boot. These additional CPUs are fully realized in
KVM and listed in ACPI tables from the start, but begin in a disabled state.
They can later be brought online or taken offline under host or platform policy
control. The CPU topology is fixed at VM creation time and cannot change
dynamically on ARM. Therefore, we must determine precisely the 'maxcpus' value
that applies for the full lifetime of the VM.

On ARM, this deferred online-capable model is only valid if:
  - The GIC version is 3 or higher, and
  - Each non-boot CPU’s GIC CPU Interface is marked “online-capable” in its
    ACPI GICC structure (UEFI ACPI Specification 6.5, §5.2.12.14, Table 5.37
    “GICC CPU Interface Flags”), and
  - The chosen accelerator supports safe deferred CPU online:
      * TCG with multi-threaded TCG (MTTCG) enabled
      * KVM (on supported hosts)
      * Not HVF or QTest

This patch sizes the machine’s max-possible CPUs during VM init:
  - If all conditions are satisfied, retain the full set of CPUs corresponding
    to (`-smp cpus` + `-smp disabledcpus`), allowing the additional (initially
    disabled) CPUs to participate in later policy-driven online.
  - Otherwise, clamp the max-possible CPUs to the boot-enabled count
    (`-smp disabledcpus=0` equivalent) to avoid advertising CPUs the guest can
    never use.

A new MachineClass flag, `has_online_capable_cpus`, records whether the machine
supports deferred vCPU online. This is usable by other machine types as well.

Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
---
 hw/arm/virt.c       | 84 ++++++++++++++++++++++++++++++---------------
 include/hw/boards.h |  1 +
 2 files changed, 57 insertions(+), 28 deletions(-)

diff --git a/hw/arm/virt.c b/hw/arm/virt.c
index ef6be3660f..76f21bd56a 100644
--- a/hw/arm/virt.c
+++ b/hw/arm/virt.c
@@ -2168,8 +2168,7 @@ static void machvirt_init(MachineState *machine)
     bool has_ged = !vmc->no_ged;
     unsigned int smp_cpus = machine->smp.cpus;
     unsigned int max_cpus = machine->smp.max_cpus;
-
-    possible_cpus = mc->possible_cpu_arch_ids(machine);
+    DeviceClass *dc;
 
     /*
      * In accelerated mode, the memory map is computed earlier in kvm_type()
@@ -2186,7 +2185,7 @@ static void machvirt_init(MachineState *machine)
          * we are about to deal with. Once this is done, get rid of
          * the object.
          */
-        cpuobj = object_new(possible_cpus->cpus[0].type);
+        cpuobj = object_new(machine->cpu_type);
         armcpu = ARM_CPU(cpuobj);
 
         pa_bits = arm_pamax(armcpu);
@@ -2201,6 +2200,57 @@ static void machvirt_init(MachineState *machine)
      */
     finalize_gic_version(vms);
 
+    /*
+     * The maximum number of CPUs depends on the GIC version, or on how
+     * many redistributors we can fit into the memory map (which in turn
+     * depends on whether this is a GICv3 or v4).
+     */
+    if (vms->gic_version == VIRT_GIC_VERSION_2) {
+        virt_max_cpus = GIC_NCPU;
+    } else {
+        virt_max_cpus = virt_redist_capacity(vms, VIRT_GIC_REDIST);
+        if (vms->highmem_redists) {
+            virt_max_cpus += virt_redist_capacity(vms, VIRT_HIGH_GIC_REDIST2);
+        }
+    }
+
+    if ((tcg_enabled() && !qemu_tcg_mttcg_enabled()) || hvf_enabled() ||
+        qtest_enabled() || vms->gic_version == VIRT_GIC_VERSION_2) {
+        max_cpus = machine->smp.max_cpus = smp_cpus;
+        if (mc->has_online_capable_cpus) {
+            if (vms->gic_version == VIRT_GIC_VERSION_2) {
+                warn_report("GICv2 does not support online-capable CPUs");
+            }
+            mc->has_online_capable_cpus = false;
+        }
+    }
+
+    if (mc->has_online_capable_cpus) {
+        max_cpus = smp_cpus + machine->smp.disabledcpus;
+        machine->smp.max_cpus = max_cpus;
+    }
+
+    if (max_cpus > virt_max_cpus) {
+        error_report("Number of SMP CPUs requested (%d) exceeds max CPUs "
+                     "supported by machine 'mach-virt' (%d)",
+                     max_cpus, virt_max_cpus);
+        if (vms->gic_version != VIRT_GIC_VERSION_2 && !vms->highmem_redists) {
+            error_printf("Try 'highmem-redists=on' for more CPUs\n");
+        }
+
+        exit(1);
+    }
+
+    dc = DEVICE_CLASS(object_class_by_name(machine->cpu_type));
+    if (!dc) {
+        error_report("CPU type '%s' not registered", machine->cpu_type);
+        exit(1);
+    }
+    dc->admin_power_state_supported = mc->has_online_capable_cpus;
+
+    /* uses smp.max_cpus to initialize all possible vCPUs */
+    possible_cpus = mc->possible_cpu_arch_ids(machine);
+
     if (vms->secure) {
         /*
          * The Secure view of the world is the same as the NonSecure,
@@ -2235,31 +2285,6 @@ static void machvirt_init(MachineState *machine)
         vms->psci_conduit = QEMU_PSCI_CONDUIT_HVC;
     }
 
-    /*
-     * The maximum number of CPUs depends on the GIC version, or on how
-     * many redistributors we can fit into the memory map (which in turn
-     * depends on whether this is a GICv3 or v4).
-     */
-    if (vms->gic_version == VIRT_GIC_VERSION_2) {
-        virt_max_cpus = GIC_NCPU;
-    } else {
-        virt_max_cpus = virt_redist_capacity(vms, VIRT_GIC_REDIST);
-        if (vms->highmem_redists) {
-            virt_max_cpus += virt_redist_capacity(vms, VIRT_HIGH_GIC_REDIST2);
-        }
-    }
-
-    if (max_cpus > virt_max_cpus) {
-        error_report("Number of SMP CPUs requested (%d) exceeds max CPUs "
-                     "supported by machine 'mach-virt' (%d)",
-                     max_cpus, virt_max_cpus);
-        if (vms->gic_version != VIRT_GIC_VERSION_2 && !vms->highmem_redists) {
-            error_printf("Try 'highmem-redists=on' for more CPUs\n");
-        }
-
-        exit(1);
-    }
-
     if (vms->secure && !tcg_enabled() && !qtest_enabled()) {
         error_report("mach-virt: %s does not support providing "
                      "Security extensions (TrustZone) to the guest CPU",
@@ -3245,6 +3270,9 @@ static void virt_machine_class_init(ObjectClass *oc, const void *data)
     hc->plug = virt_machine_device_plug_cb;
     hc->unplug_request = virt_machine_device_unplug_request_cb;
     hc->unplug = virt_machine_device_unplug_cb;
+
+    mc->has_online_capable_cpus = true;
+
     mc->nvdimm_supported = true;
     mc->smp_props.clusters_supported = true;
     mc->auto_enable_numa_with_memhp = true;
diff --git a/include/hw/boards.h b/include/hw/boards.h
index 2b182d7817..b27c2326a2 100644
--- a/include/hw/boards.h
+++ b/include/hw/boards.h
@@ -302,6 +302,7 @@ struct MachineClass {
     bool rom_file_has_mr;
     int minimum_page_bits;
     bool has_hotpluggable_cpus;
+    bool has_online_capable_cpus;
     bool ignore_memory_transaction_failures;
     int numa_mem_align_shift;
     const char * const *valid_cpu_types;
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH RFC V6 04/24] arm/virt, target/arm: Add new ARMCPU {socket, cluster, core, thread}-id property
  2025-10-01  1:01 [PATCH RFC V6 00/24] Support of Virtual CPU Hotplug-like Feature for ARMv8+ Arch salil.mehta
                   ` (2 preceding siblings ...)
  2025-10-01  1:01 ` [PATCH RFC V6 03/24] hw/arm/virt: Clamp 'maxcpus' as-per machine's vCPU deferred online-capability salil.mehta
@ 2025-10-01  1:01 ` salil.mehta
  2025-10-01  1:01 ` [PATCH RFC V6 05/24] arm/virt, kvm: Pre-create KVM vCPUs for 'disabled' QOM vCPUs at machine init salil.mehta
                   ` (22 subsequent siblings)
  26 siblings, 0 replies; 67+ messages in thread
From: salil.mehta @ 2025-10-01  1:01 UTC (permalink / raw)
  To: qemu-devel, qemu-arm, mst
  Cc: salil.mehta, maz, jean-philippe, jonathan.cameron, lpieralisi,
	peter.maydell, richard.henderson, imammedo, armbru, andrew.jones,
	david, philmd, eric.auger, will, ardb, oliver.upton, pbonzini,
	gshan, rafael, borntraeger, alex.bennee, gustavo.romero, npiggin,
	harshpb, linux, darren, ilkka, vishnu, gankulkarni, karl.heubaum,
	miguel.luis, salil.mehta, zhukeqian1, wangxiongfeng2, wangyanan55,
	wangzhou1, linuxarm, jiakernel2, maobibo, lixianglai, shahuang,
	zhao1.liu

From: Salil Mehta <salil.mehta@huawei.com>

Store the user-specified topology (socket/cluster/core/thread) and derive a
unique 'vcpu-id'. The 'vcpu-id' is used as the slot index in the possible vCPUs
list when administratively enabling or disabling a vCPU.

Co-developed-by: Keqian Zhu <zhukeqian1@huawei.com>
Signed-off-by: Keqian Zhu <zhukeqian1@huawei.com>
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
Reviewed-by: Miguel Luis <miguel.luis@oracle.com>
---
 hw/arm/virt.c         | 10 ++++++++++
 include/hw/arm/virt.h | 36 ++++++++++++++++++++++++++++++++++++
 target/arm/cpu.c      |  4 ++++
 target/arm/cpu.h      |  4 ++++
 4 files changed, 54 insertions(+)

diff --git a/hw/arm/virt.c b/hw/arm/virt.c
index 76f21bd56a..4ded19dc69 100644
--- a/hw/arm/virt.c
+++ b/hw/arm/virt.c
@@ -2334,6 +2334,14 @@ static void machvirt_init(MachineState *machine)
                           &error_fatal);
 
         aarch64 &= object_property_get_bool(cpuobj, "aarch64", NULL);
+        object_property_set_int(cpuobj, "socket-id", virt_get_socket_id(n),
+                                NULL);
+        object_property_set_int(cpuobj, "cluster-id", virt_get_cluster_id(n),
+                                NULL);
+        object_property_set_int(cpuobj, "core-id", virt_get_core_id(n),
+                                NULL);
+        object_property_set_int(cpuobj, "thread-id", virt_get_thread_id(n),
+                                NULL);
 
         if (!vms->secure) {
             object_property_set_bool(cpuobj, "has_el3", false, NULL);
@@ -2902,6 +2910,7 @@ static const CPUArchIdList *virt_possible_cpu_arch_ids(MachineState *ms)
 {
     int n;
     unsigned int max_cpus = ms->smp.max_cpus;
+    unsigned int smp_threads = ms->smp.threads;
     VirtMachineState *vms = VIRT_MACHINE(ms);
     MachineClass *mc = MACHINE_GET_CLASS(vms);
 
@@ -2915,6 +2924,7 @@ static const CPUArchIdList *virt_possible_cpu_arch_ids(MachineState *ms)
     ms->possible_cpus->len = max_cpus;
     for (n = 0; n < ms->possible_cpus->len; n++) {
         ms->possible_cpus->cpus[n].type = ms->cpu_type;
+        ms->possible_cpus->cpus[n].vcpus_count = smp_threads;
         ms->possible_cpus->cpus[n].arch_id =
             virt_cpu_mp_affinity(vms, n);
 
diff --git a/include/hw/arm/virt.h b/include/hw/arm/virt.h
index 365a28b082..683e4b965a 100644
--- a/include/hw/arm/virt.h
+++ b/include/hw/arm/virt.h
@@ -213,4 +213,40 @@ static inline int virt_gicv3_redist_region_count(VirtMachineState *vms)
             vms->highmem_redists) ? 2 : 1;
 }
 
+static inline int virt_get_socket_id(int cpu_index)
+{
+    MachineState *ms = MACHINE(qdev_get_machine());
+
+    assert(cpu_index >= 0 && cpu_index < ms->possible_cpus->len);
+
+    return ms->possible_cpus->cpus[cpu_index].props.socket_id;
+}
+
+static inline int virt_get_cluster_id(int cpu_index)
+{
+    MachineState *ms = MACHINE(qdev_get_machine());
+
+    assert(cpu_index >= 0 && cpu_index < ms->possible_cpus->len);
+
+    return ms->possible_cpus->cpus[cpu_index].props.cluster_id;
+}
+
+static inline int virt_get_core_id(int cpu_index)
+{
+    MachineState *ms = MACHINE(qdev_get_machine());
+
+    assert(cpu_index >= 0 && cpu_index < ms->possible_cpus->len);
+
+    return ms->possible_cpus->cpus[cpu_index].props.core_id;
+}
+
+static inline int virt_get_thread_id(int cpu_index)
+{
+    MachineState *ms = MACHINE(qdev_get_machine());
+
+    assert(cpu_index >= 0 && cpu_index < ms->possible_cpus->len);
+
+    return ms->possible_cpus->cpus[cpu_index].props.thread_id;
+}
+
 #endif /* QEMU_ARM_VIRT_H */
diff --git a/target/arm/cpu.c b/target/arm/cpu.c
index 0c9a2e7ea4..7e0d5b2ed8 100644
--- a/target/arm/cpu.c
+++ b/target/arm/cpu.c
@@ -2607,6 +2607,10 @@ static const Property arm_cpu_properties[] = {
     DEFINE_PROP_UINT64("mp-affinity", ARMCPU,
                         mp_affinity, ARM64_AFFINITY_INVALID),
     DEFINE_PROP_INT32("node-id", ARMCPU, node_id, CPU_UNSET_NUMA_NODE_ID),
+    DEFINE_PROP_INT32("socket-id", ARMCPU, socket_id, 0),
+    DEFINE_PROP_INT32("cluster-id", ARMCPU, cluster_id, 0),
+    DEFINE_PROP_INT32("core-id", ARMCPU, core_id, 0),
+    DEFINE_PROP_INT32("thread-id", ARMCPU, thread_id, 0),
     DEFINE_PROP_INT32("core-count", ARMCPU, core_count, -1),
     /* True to default to the backward-compat old CNTFRQ rather than 1Ghz */
     DEFINE_PROP_BOOL("backcompat-cntfrq", ARMCPU, backcompat_cntfrq, false),
diff --git a/target/arm/cpu.h b/target/arm/cpu.h
index dc9b6dce4c..cd5982d362 100644
--- a/target/arm/cpu.h
+++ b/target/arm/cpu.h
@@ -1126,6 +1126,10 @@ struct ArchCPU {
     QLIST_HEAD(, ARMELChangeHook) el_change_hooks;
 
     int32_t node_id; /* NUMA node this CPU belongs to */
+    int32_t socket_id;
+    int32_t cluster_id;
+    int32_t core_id;
+    int32_t thread_id;
 
     /* Used to synchronize KVM and QEMU in-kernel device levels */
     uint8_t device_irq_level;
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH RFC V6 05/24] arm/virt, kvm: Pre-create KVM vCPUs for 'disabled' QOM vCPUs at machine init
  2025-10-01  1:01 [PATCH RFC V6 00/24] Support of Virtual CPU Hotplug-like Feature for ARMv8+ Arch salil.mehta
                   ` (3 preceding siblings ...)
  2025-10-01  1:01 ` [PATCH RFC V6 04/24] arm/virt, target/arm: Add new ARMCPU {socket, cluster, core, thread}-id property salil.mehta
@ 2025-10-01  1:01 ` salil.mehta
  2025-10-22 10:36   ` [PATCH RFC V6 05/24] arm/virt,kvm: " Gavin Shan
  2025-10-01  1:01 ` [PATCH RFC V6 06/24] arm/virt, gicv3: Pre-size GIC with possible " salil.mehta
                   ` (21 subsequent siblings)
  26 siblings, 1 reply; 67+ messages in thread
From: salil.mehta @ 2025-10-01  1:01 UTC (permalink / raw)
  To: qemu-devel, qemu-arm, mst
  Cc: salil.mehta, maz, jean-philippe, jonathan.cameron, lpieralisi,
	peter.maydell, richard.henderson, imammedo, armbru, andrew.jones,
	david, philmd, eric.auger, will, ardb, oliver.upton, pbonzini,
	gshan, rafael, borntraeger, alex.bennee, gustavo.romero, npiggin,
	harshpb, linux, darren, ilkka, vishnu, gankulkarni, karl.heubaum,
	miguel.luis, salil.mehta, zhukeqian1, wangxiongfeng2, wangyanan55,
	wangzhou1, linuxarm, jiakernel2, maobibo, lixianglai, shahuang,
	zhao1.liu, Keqian Zhu

From: Salil Mehta <salil.mehta@huawei.com>

ARM CPU architecture does not allow CPUs to be plugged after system has
initialized. This is a constraint. Hence, the Kernel must know all the CPUs
being booted during its initialization. This applies to the Guest Kernel as
well and therefore, the number of KVM vCPU descriptors in the host must be
fixed at VM initialization time.

Also, the GIC must know all the CPUs it is connected to during its
initialization, and this cannot change afterward. This must also be ensured
during the initialization of the VGIC in KVM. This is necessary because:

1. The association between GICR and MPIDR must be fixed at VM initialization
   time. This is represented by the register
   `GICR_TYPER(mp_affinity, proc_num)`.
2. Memory regions associated with GICR, etc., cannot be changed (added,
   deleted, or modified) after the VM has been initialized. This is not an
   ARM architectural constraint but rather invites a difficult and messy
   change in VGIC data structures.

To enable a hot-add–like model while preserving these constraints, the virt
machine may enumerate more CPUs than are enabled at boot using
`-smp disabledcpus=N`. Such CPUs are present but start offline (i.e.,
administratively disabled at init). The topology remains fixed at VM
creation time; only the online/offline status may change later.

Administratively disabled vCPUs are not realized in QOM until first enabled,
avoiding creation of unnecessary vCPU threads at boot. On large systems, this
reduces startup time proportionally to the number of disabled vCPUs. Once a
QOM vCPU is realized and its thread created, subsequent enable/disable actions
do not unrealize it. This behaviour was adopted following review feedback and
differs from earlier RFC versions.

Co-developed-by: Keqian Zhu <zhuqian1@huawei.com>
Signed-off-by: Keqian Zhu <zhuqian1@huawei.com>
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
---
 accel/kvm/kvm-all.c    |  2 +-
 hw/arm/virt.c          | 77 ++++++++++++++++++++++++++++++++++++++----
 hw/core/qdev.c         | 17 ++++++++++
 include/hw/qdev-core.h | 19 +++++++++++
 include/system/kvm.h   |  8 +++++
 target/arm/cpu.c       |  2 ++
 target/arm/kvm.c       | 40 +++++++++++++++++++++-
 target/arm/kvm_arm.h   | 11 ++++++
 8 files changed, 168 insertions(+), 8 deletions(-)

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index 890d5ea9f8..0e7d9d5c3d 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -460,7 +460,7 @@ static void kvm_reset_parked_vcpus(KVMState *s)
  *
  * @returns: 0 when success, errno (<0) when failed.
  */
-static int kvm_create_vcpu(CPUState *cpu)
+int kvm_create_vcpu(CPUState *cpu)
 {
     unsigned long vcpu_id = kvm_arch_vcpu_id(cpu);
     KVMState *s = kvm_state;
diff --git a/hw/arm/virt.c b/hw/arm/virt.c
index 4ded19dc69..f4eeeacf6c 100644
--- a/hw/arm/virt.c
+++ b/hw/arm/virt.c
@@ -2152,6 +2152,49 @@ static void virt_post_cpus_gic_realized(VirtMachineState *vms,
     }
 }
 
+static void
+virt_setup_lazy_vcpu_realization(Object *cpuobj, VirtMachineState *vms)
+{
+    /*
+     * Present & administratively disabled vCPUs:
+     *
+     * These CPUs are marked offline at init via '-smp disabledcpus=N'. We
+     * intentionally do not realize them during the first boot, since it is
+     * not known if or when they will ever be enabled. The decision to enable
+     * such CPUs depends on policy (e.g. guided by SLAs or other deployment
+     * requirements).
+     *
+     * Realizing all disabled vCPUs up front would make boot time proportional
+     * to 'maxcpus', even if policy permits only a small subset to be enabled.
+     * This can lead to unacceptable boot delays in some scenarios.
+     *
+     * Instead, these CPUs remain administratively disabled and unrealized at
+     * boot, to be instantiated and brought online only if policy later allows
+     * it.
+     */
+
+    /* set this vCPU to be administratively 'disabled' in QOM */
+    qdev_disable(DEVICE(cpuobj), NULL, &error_fatal);
+
+    if (vms->psci_conduit != QEMU_PSCI_CONDUIT_DISABLED) {
+        object_property_set_int(cpuobj, "psci-conduit", vms->psci_conduit,
+                                NULL);
+    }
+
+    /*
+     * [!] Constraint: The ARM CPU architecture does not permit new CPUs
+     * to be added after system initialization.
+     *
+     * Workaround: Pre-create KVM vCPUs even for those that are not yet
+     * online i.e. powered-off, keeping them `parked` and in an
+     * `unrealized (at-least during boot time)` state within QEMU until
+     * they are powered-on and made online.
+     */
+    if (kvm_enabled()) {
+        kvm_arm_create_host_vcpu(ARM_CPU(cpuobj));
+    }
+}
+
 static void machvirt_init(MachineState *machine)
 {
     VirtMachineState *vms = VIRT_MACHINE(machine);
@@ -2319,10 +2362,6 @@ static void machvirt_init(MachineState *machine)
         Object *cpuobj;
         CPUState *cs;
 
-        if (n >= smp_cpus) {
-            break;
-        }
-
         cpuobj = object_new(possible_cpus->cpus[n].type);
         object_property_set_int(cpuobj, "mp-affinity",
                                 possible_cpus->cpus[n].arch_id, NULL);
@@ -2427,8 +2466,34 @@ static void machvirt_init(MachineState *machine)
             }
         }
 
-        qdev_realize(DEVICE(cpuobj), NULL, &error_fatal);
-        object_unref(cpuobj);
+        /* start secondary vCPUs in a powered-down state */
+        if(n && mc->has_online_capable_cpus) {
+            object_property_set_bool(cpuobj, "start-powered-off", true, NULL);
+        }
+
+        if (n < smp_cpus) {
+            /* 'Present' & 'Enabled' vCPUs */
+            qdev_realize(DEVICE(cpuobj), NULL, &error_fatal);
+            object_unref(cpuobj);
+        } else {
+            /* 'Present' & 'Disabled' vCPUs */
+            virt_setup_lazy_vcpu_realization(cpuobj, vms);
+        }
+
+        /*
+         * All possible vCPUs should have QOM vCPU Object pointer & arch-id.
+         * 'cpus_queue' (accessed via qemu_get_cpu()) contains only realized and
+         * enabled vCPUs. Hence, we must now populate the 'possible_cpus' list.
+         */
+        if (kvm_enabled()) {
+            /*
+             * Override the default architecture ID with the one retrieved
+             * from KVM, as they currently differ.
+             */
+            machine->possible_cpus->cpus[n].arch_id =
+                arm_cpu_mp_affinity(ARM_CPU(cs));
+        }
+        machine->possible_cpus->cpus[n].cpu = cs;
     }
 
     /* Now we've created the CPUs we can see if they have the hypvirt timer */
diff --git a/hw/core/qdev.c b/hw/core/qdev.c
index 8502d6216f..5816abae39 100644
--- a/hw/core/qdev.c
+++ b/hw/core/qdev.c
@@ -309,6 +309,23 @@ void qdev_assert_realized_properly(void)
                                    qdev_assert_realized_properly_cb, NULL);
 }
 
+bool qdev_disable(DeviceState *dev, BusState *bus, Error **errp)
+{
+    g_assert(dev);
+
+    if (bus) {
+        error_setg(errp, "Device %s 'disable' operation not supported",
+                   object_get_typename(OBJECT(dev)));
+        return false;
+    }
+
+    /* devices like cpu don't have bus */
+    g_assert(!DEVICE_GET_CLASS(dev)->bus_type);
+
+    return object_property_set_str(OBJECT(dev), "admin_power_state", "disabled",
+                                   errp);
+}
+
 bool qdev_machine_modified(void)
 {
     return qdev_hot_added || qdev_hot_removed;
diff --git a/include/hw/qdev-core.h b/include/hw/qdev-core.h
index 3bc212ab3a..2c22b32a3f 100644
--- a/include/hw/qdev-core.h
+++ b/include/hw/qdev-core.h
@@ -570,6 +570,25 @@ bool qdev_realize(DeviceState *dev, BusState *bus, Error **errp);
  */
 bool qdev_realize_and_unref(DeviceState *dev, BusState *bus, Error **errp);
 
+/**
+ * qdev_disable - Initiate administrative disablement and power-off of device
+ * @dev:   The device to be administratively powered off
+ * @bus:   The bus on which the device resides (may be NULL for CPUs)
+ * @errp:  Pointer to a location where an error can be reported
+ *
+ * This function initiates an administrative transition of the device into a
+ * DISABLED state. This may trigger a graceful shutdown process depending on
+ * platform capabilities. For ACPI platforms, this typically involves notifying
+ * the guest via events such as Notify(..., 0x03) and executing _EJx.
+ *
+ * Once completed, the device's operational power is turned off and it is
+ * marked as administratively DISABLED. Further guest usage is blocked until
+ * re-enabled by host-side policy.
+ *
+ * Returns true on success; false if an error occurs, with @errp populated.
+ */
+bool qdev_disable(DeviceState *dev, BusState *bus, Error **errp);
+
 /**
  * qdev_unrealize: Unrealize a device
  * @dev: device to unrealize
diff --git a/include/system/kvm.h b/include/system/kvm.h
index 3c7d314736..4896a3c9c5 100644
--- a/include/system/kvm.h
+++ b/include/system/kvm.h
@@ -317,6 +317,14 @@ int kvm_create_device(KVMState *s, uint64_t type, bool test);
  */
 bool kvm_device_supported(int vmfd, uint64_t type);
 
+/**
+ * kvm_create_vcpu - Gets a parked KVM vCPU or creates a KVM vCPU
+ * @cpu: QOM CPUState object for which KVM vCPU has to be fetched/created.
+ *
+ * @returns: 0 when success, errno (<0) when failed.
+ */
+int kvm_create_vcpu(CPUState *cpu);
+
 /**
  * kvm_park_vcpu - Park QEMU KVM vCPU context
  * @cpu: QOM CPUState object for which QEMU KVM vCPU context has to be parked.
diff --git a/target/arm/cpu.c b/target/arm/cpu.c
index 7e0d5b2ed8..a5906d1672 100644
--- a/target/arm/cpu.c
+++ b/target/arm/cpu.c
@@ -1500,6 +1500,8 @@ static void arm_cpu_initfn(Object *obj)
         /* TCG and HVF implement PSCI 1.1 */
         cpu->psci_version = QEMU_PSCI_VERSION_1_1;
     }
+
+    CPU(obj)->thread_id = 0;
 }
 
 /*
diff --git a/target/arm/kvm.c b/target/arm/kvm.c
index 6672344855..1962eb29b2 100644
--- a/target/arm/kvm.c
+++ b/target/arm/kvm.c
@@ -991,6 +991,38 @@ void kvm_arm_reset_vcpu(ARMCPU *cpu)
     write_list_to_cpustate(cpu);
 }
 
+void kvm_arm_create_host_vcpu(ARMCPU *cpu)
+{
+    CPUState *cs = CPU(cpu);
+    unsigned long vcpu_id = cs->cpu_index;
+    int ret;
+
+    ret = kvm_create_vcpu(cs);
+    if (ret < 0) {
+        error_report("Failed to create host vcpu %ld", vcpu_id);
+        abort();
+    }
+
+    /*
+     * Initialize the vCPU in the host. This will reset the sys regs
+     * for this vCPU and related registers like MPIDR_EL1 etc. also
+     * get programmed during this call to host. These are referenced
+     * later while setting device attributes of the GICR during GICv3
+     * reset.
+     */
+    ret = kvm_arch_init_vcpu(cs);
+    if (ret < 0) {
+        error_report("Failed to initialize host vcpu %ld", vcpu_id);
+        abort();
+    }
+
+    /*
+     * park the created vCPU. shall be used during kvm_get_vcpu() when
+     * threads are created during realization of ARM vCPUs.
+     */
+    kvm_park_vcpu(cs);
+}
+
 /*
  * Update KVM's MP_STATE based on what QEMU thinks it is
  */
@@ -1876,7 +1908,13 @@ int kvm_arch_init_vcpu(CPUState *cs)
         return -EINVAL;
     }
 
-    qemu_add_vm_change_state_handler(kvm_arm_vm_state_change, cpu);
+    /*
+     * Install VM change handler only when vCPU thread has been spawned
+     * i.e. vCPU is being realized
+     */
+    if (cs->thread_id) {
+        qemu_add_vm_change_state_handler(kvm_arm_vm_state_change, cpu);
+    }
 
     /* Determine init features for this CPU */
     memset(cpu->kvm_init_features, 0, sizeof(cpu->kvm_init_features));
diff --git a/target/arm/kvm_arm.h b/target/arm/kvm_arm.h
index 6a9b6374a6..ec9dc95ee8 100644
--- a/target/arm/kvm_arm.h
+++ b/target/arm/kvm_arm.h
@@ -98,6 +98,17 @@ bool kvm_arm_cpu_post_load(ARMCPU *cpu);
 void kvm_arm_reset_vcpu(ARMCPU *cpu);
 
 struct kvm_vcpu_init;
+
+/**
+ * kvm_arm_create_host_vcpu:
+ * @cpu: ARMCPU
+ *
+ * Called to pre-create possible KVM vCPU within the host during the
+ * `virt_machine` initialization phase. This pre-created vCPU will be parked and
+ * will be reused when ARM QOM vCPU is actually hotplugged.
+ */
+void kvm_arm_create_host_vcpu(ARMCPU *cpu);
+
 /**
  * kvm_arm_create_scratch_host_vcpu:
  * @fdarray: filled in with kvmfd, vmfd, cpufd file descriptors in that order
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH RFC V6 06/24] arm/virt, gicv3: Pre-size GIC with possible vCPUs at machine init
  2025-10-01  1:01 [PATCH RFC V6 00/24] Support of Virtual CPU Hotplug-like Feature for ARMv8+ Arch salil.mehta
                   ` (4 preceding siblings ...)
  2025-10-01  1:01 ` [PATCH RFC V6 05/24] arm/virt, kvm: Pre-create KVM vCPUs for 'disabled' QOM vCPUs at machine init salil.mehta
@ 2025-10-01  1:01 ` salil.mehta
  2025-10-01  1:01 ` [PATCH RFC V6 07/24] arm/gicv3: Refactor CPU interface init for shared TCG/KVM use salil.mehta
                   ` (20 subsequent siblings)
  26 siblings, 0 replies; 67+ messages in thread
From: salil.mehta @ 2025-10-01  1:01 UTC (permalink / raw)
  To: qemu-devel, qemu-arm, mst
  Cc: salil.mehta, maz, jean-philippe, jonathan.cameron, lpieralisi,
	peter.maydell, richard.henderson, imammedo, armbru, andrew.jones,
	david, philmd, eric.auger, will, ardb, oliver.upton, pbonzini,
	gshan, rafael, borntraeger, alex.bennee, gustavo.romero, npiggin,
	harshpb, linux, darren, ilkka, vishnu, gankulkarni, karl.heubaum,
	miguel.luis, salil.mehta, zhukeqian1, wangxiongfeng2, wangyanan55,
	wangzhou1, linuxarm, jiakernel2, maobibo, lixianglai, shahuang,
	zhao1.liu

From: Salil Mehta <salil.mehta@huawei.com>

Pre-size the GIC with the maximum possible vCPUs during machine initialization
instead of the currently enabled CPU count. This ensures that the GIC is fully
provisioned for any vCPUs that may be enabled later by administrative or
hot-add–like operations.

Pre-sizing must also include redistributors for administratively disabled vCPUs,
ensuring the GIC is fully provisioned at initialization for all possible CPUs.
This is required because:

1. Memory regions and resources associated with GICC/GICR cannot be modified
   (added, deleted, or resized) after VM initialization.
2. The GICD_TYPER and related redistributor structures must be initialized with
   correct mp_affinity and CPU interface numbering at creation time, and cannot
   be altered later.
3. Avoids the need to dynamically resize GIC CPU interfaces, which is unsupported
   and would break architectural guarantees.

This patch:
 - Replaces use of `ms->smp.cpus` with `ms->smp.max_cpus` for GIC sizing,
   redistributor allocation, and interrupt wiring.
 - Updates GICv3 realization to fetch CPU references via
   `machine_get_possible_cpu()` instead of `qemu_get_cpu()`, ensuring that CPUs
   not yet realized but part of the possible set are accounted for.

Co-developed-by: Keqian Zhu <zhukeqian1@huawei.com>
Signed-off-by: Keqian Zhu <zhukeqian1@huawei.com>
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
---
 hw/arm/virt.c              | 24 ++++++++++++------------
 hw/core/machine.c          | 14 ++++++++++++++
 hw/intc/arm_gicv3_common.c |  4 ++--
 include/hw/arm/virt.h      |  2 +-
 include/hw/boards.h        | 12 ++++++++++++
 5 files changed, 41 insertions(+), 15 deletions(-)

diff --git a/hw/arm/virt.c b/hw/arm/virt.c
index f4eeeacf6c..ee09aa19bd 100644
--- a/hw/arm/virt.c
+++ b/hw/arm/virt.c
@@ -793,7 +793,7 @@ static void create_gic(VirtMachineState *vms, MemoryRegion *mem)
     SysBusDevice *gicbusdev;
     const char *gictype;
     int i;
-    unsigned int smp_cpus = ms->smp.cpus;
+    unsigned int max_cpus = ms->smp.max_cpus;
     uint32_t nb_redist_regions = 0;
     int revision;
 
@@ -825,7 +825,7 @@ static void create_gic(VirtMachineState *vms, MemoryRegion *mem)
 
     vms->gic = qdev_new(gictype);
     qdev_prop_set_uint32(vms->gic, "revision", revision);
-    qdev_prop_set_uint32(vms->gic, "num-cpu", smp_cpus);
+    qdev_prop_set_uint32(vms->gic, "num-cpu", max_cpus);
     /* Note that the num-irq property counts both internal and external
      * interrupts; there are always 32 of the former (mandated by GIC spec).
      */
@@ -837,7 +837,7 @@ static void create_gic(VirtMachineState *vms, MemoryRegion *mem)
     if (vms->gic_version != VIRT_GIC_VERSION_2) {
         QList *redist_region_count;
         uint32_t redist0_capacity = virt_redist_capacity(vms, VIRT_GIC_REDIST);
-        uint32_t redist0_count = MIN(smp_cpus, redist0_capacity);
+        uint32_t redist0_count = MIN(max_cpus, redist0_capacity);
 
         nb_redist_regions = virt_gicv3_redist_region_count(vms);
 
@@ -848,7 +848,7 @@ static void create_gic(VirtMachineState *vms, MemoryRegion *mem)
                 virt_redist_capacity(vms, VIRT_HIGH_GIC_REDIST2);
 
             qlist_append_int(redist_region_count,
-                MIN(smp_cpus - redist0_count, redist1_capacity));
+                MIN(max_cpus - redist0_count, redist1_capacity));
         }
         qdev_prop_set_array(vms->gic, "redist-region-count",
                             redist_region_count);
@@ -896,8 +896,8 @@ static void create_gic(VirtMachineState *vms, MemoryRegion *mem)
      * and the GIC's IRQ/FIQ/VIRQ/VFIQ/NMI/VINMI interrupt outputs to the
      * CPU's inputs.
      */
-    for (i = 0; i < smp_cpus; i++) {
-        DeviceState *cpudev = DEVICE(qemu_get_cpu(i));
+    for (i = 0; i < max_cpus; i++) {
+        DeviceState *cpudev = DEVICE(machine_get_possible_cpu(i));
         int intidbase = NUM_IRQS + i * GIC_INTERNAL;
         /* Mapping from the output timer irq lines from the CPU to the
          * GIC PPI inputs we use for the virt board.
@@ -926,7 +926,7 @@ static void create_gic(VirtMachineState *vms, MemoryRegion *mem)
         } else if (vms->virt) {
             qemu_irq irq = qdev_get_gpio_in(vms->gic,
                                             intidbase + ARCH_GIC_MAINT_IRQ);
-            sysbus_connect_irq(gicbusdev, i + 4 * smp_cpus, irq);
+            sysbus_connect_irq(gicbusdev, i + 4 * max_cpus, irq);
         }
 
         qdev_connect_gpio_out_named(cpudev, "pmu-interrupt", 0,
@@ -934,17 +934,17 @@ static void create_gic(VirtMachineState *vms, MemoryRegion *mem)
                                                      + VIRTUAL_PMU_IRQ));
 
         sysbus_connect_irq(gicbusdev, i, qdev_get_gpio_in(cpudev, ARM_CPU_IRQ));
-        sysbus_connect_irq(gicbusdev, i + smp_cpus,
+        sysbus_connect_irq(gicbusdev, i + max_cpus,
                            qdev_get_gpio_in(cpudev, ARM_CPU_FIQ));
-        sysbus_connect_irq(gicbusdev, i + 2 * smp_cpus,
+        sysbus_connect_irq(gicbusdev, i + 2 * max_cpus,
                            qdev_get_gpio_in(cpudev, ARM_CPU_VIRQ));
-        sysbus_connect_irq(gicbusdev, i + 3 * smp_cpus,
+        sysbus_connect_irq(gicbusdev, i + 3 * max_cpus,
                            qdev_get_gpio_in(cpudev, ARM_CPU_VFIQ));
 
         if (vms->gic_version != VIRT_GIC_VERSION_2) {
-            sysbus_connect_irq(gicbusdev, i + 4 * smp_cpus,
+            sysbus_connect_irq(gicbusdev, i + 4 * max_cpus,
                                qdev_get_gpio_in(cpudev, ARM_CPU_NMI));
-            sysbus_connect_irq(gicbusdev, i + 5 * smp_cpus,
+            sysbus_connect_irq(gicbusdev, i + 5 * max_cpus,
                                qdev_get_gpio_in(cpudev, ARM_CPU_VINMI));
         }
     }
diff --git a/hw/core/machine.c b/hw/core/machine.c
index bd47527479..69d5632464 100644
--- a/hw/core/machine.c
+++ b/hw/core/machine.c
@@ -1369,6 +1369,20 @@ bool machine_require_guest_memfd(MachineState *machine)
     return machine->cgs && machine->cgs->require_guest_memfd;
 }
 
+CPUState *machine_get_possible_cpu(int64_t cpu_index)
+{
+    MachineState *ms = MACHINE(qdev_get_machine());
+    const CPUArchIdList *possible_cpus = ms->possible_cpus;
+
+    for (int i = 0; i < possible_cpus->len; i++) {
+        if (possible_cpus->cpus[i].cpu &&
+            possible_cpus->cpus[i].cpu->cpu_index == cpu_index) {
+            return possible_cpus->cpus[i].cpu;
+        }
+    }
+    return NULL;
+}
+
 static char *cpu_slot_to_string(const CPUArchId *cpu)
 {
     GString *s = g_string_new(NULL);
diff --git a/hw/intc/arm_gicv3_common.c b/hw/intc/arm_gicv3_common.c
index e438d8c042..f6a9f1c68b 100644
--- a/hw/intc/arm_gicv3_common.c
+++ b/hw/intc/arm_gicv3_common.c
@@ -32,7 +32,7 @@
 #include "gicv3_internal.h"
 #include "hw/arm/linux-boot-if.h"
 #include "system/kvm.h"
-
+#include "hw/boards.h"
 
 static void gicv3_gicd_no_migration_shift_bug_post_load(GICv3State *cs)
 {
@@ -436,7 +436,7 @@ static void arm_gicv3_common_realize(DeviceState *dev, Error **errp)
     s->cpu = g_new0(GICv3CPUState, s->num_cpu);
 
     for (i = 0; i < s->num_cpu; i++) {
-        CPUState *cpu = qemu_get_cpu(i);
+        CPUState *cpu = machine_get_possible_cpu(i);
         uint64_t cpu_affid;
 
         s->cpu[i].cpu = cpu;
diff --git a/include/hw/arm/virt.h b/include/hw/arm/virt.h
index 683e4b965a..ace4154cc6 100644
--- a/include/hw/arm/virt.h
+++ b/include/hw/arm/virt.h
@@ -209,7 +209,7 @@ static inline int virt_gicv3_redist_region_count(VirtMachineState *vms)
 
     assert(vms->gic_version != VIRT_GIC_VERSION_2);
 
-    return (MACHINE(vms)->smp.cpus > redist0_capacity &&
+    return (MACHINE(vms)->smp.max_cpus > redist0_capacity &&
             vms->highmem_redists) ? 2 : 1;
 }
 
diff --git a/include/hw/boards.h b/include/hw/boards.h
index b27c2326a2..3ff77a8b3a 100644
--- a/include/hw/boards.h
+++ b/include/hw/boards.h
@@ -118,6 +118,18 @@ bool device_is_dynamic_sysbus(MachineClass *mc, DeviceState *dev);
 MemoryRegion *machine_consume_memdev(MachineState *machine,
                                      HostMemoryBackend *backend);
 
+/**
+ * machine_get_possible_cpu: Gets 'CPUState' for the CPU with the given logical
+ * cpu_index. The slot index in possible_cpus[] list is always sequential, but
+ * 'cpu_index' values may not be sequential depending on machine implementation
+ * (e.g. with hotplug/unplug). Therefore, this function must scan the list to
+ * find a match.
+ * @cpu_index: logical cpu index to search for 'CPUState'
+ *
+ * Returns: pointer to CPUState, or NULL if not found.
+ */
+CPUState *machine_get_possible_cpu(int64_t cpu_index);
+
 /**
  * CPUArchId:
  * @arch_id - architecture-dependent CPU ID of present or possible CPU
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH RFC V6 07/24] arm/gicv3: Refactor CPU interface init for shared TCG/KVM use
  2025-10-01  1:01 [PATCH RFC V6 00/24] Support of Virtual CPU Hotplug-like Feature for ARMv8+ Arch salil.mehta
                   ` (5 preceding siblings ...)
  2025-10-01  1:01 ` [PATCH RFC V6 06/24] arm/virt, gicv3: Pre-size GIC with possible " salil.mehta
@ 2025-10-01  1:01 ` salil.mehta
  2025-10-01  1:01 ` [PATCH RFC V6 08/24] arm/virt, gicv3: Guard CPU interface access for admin disabled vCPUs salil.mehta
                   ` (19 subsequent siblings)
  26 siblings, 0 replies; 67+ messages in thread
From: salil.mehta @ 2025-10-01  1:01 UTC (permalink / raw)
  To: qemu-devel, qemu-arm, mst
  Cc: salil.mehta, maz, jean-philippe, jonathan.cameron, lpieralisi,
	peter.maydell, richard.henderson, imammedo, armbru, andrew.jones,
	david, philmd, eric.auger, will, ardb, oliver.upton, pbonzini,
	gshan, rafael, borntraeger, alex.bennee, gustavo.romero, npiggin,
	harshpb, linux, darren, ilkka, vishnu, gankulkarni, karl.heubaum,
	miguel.luis, salil.mehta, zhukeqian1, wangxiongfeng2, wangyanan55,
	wangzhou1, linuxarm, jiakernel2, maobibo, lixianglai, shahuang,
	zhao1.liu

From: Salil Mehta <salil.mehta@huawei.com>

GICv3 CPU interface initialization currently has separate logic paths for TCG
and KVM accelerators, even though much of the flow—such as iterating over vCPUs
and applying common setup—should be identical. This separation makes it harder
to add new CPU interface features that apply to both backends, as each needs to
be updated individually.

To address this, the common CPU interface setup is now centralized in
‘gicv3_init_cpuif()’, called during GIC realization. Accelerator-specific code
is still handled via a class hook for register-level initialization, but all
iteration and shared setup is unified.

This refactoring is required to:
 - Ensure later patches can set ‘gicc_accessible‘ for all vCPUs in a consistent
   manner.
 - Provide a single entry point for any future common initialization, avoiding
   duplication between TCG and KVM.
 - Maintain correct initialization for both enabled and administratively
   disabled but present vCPUs.

No functional change intended here.

Co-developed-by: Keqian Zhu <zhukeqian1@huawei.com>
Signed-off-by: Keqian Zhu <zhukeqian1@huawei.com>
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
---
 hw/intc/arm_gicv3.c                |   1 +
 hw/intc/arm_gicv3_cpuif.c          | 262 ++++++++++++++---------------
 hw/intc/arm_gicv3_cpuif_common.c   |  11 ++
 hw/intc/arm_gicv3_kvm.c            |  12 +-
 hw/intc/gicv3_internal.h           |   1 +
 include/hw/intc/arm_gicv3_common.h |   1 +
 6 files changed, 150 insertions(+), 138 deletions(-)

diff --git a/hw/intc/arm_gicv3.c b/hw/intc/arm_gicv3.c
index 6059ce926a..8ca61413d2 100644
--- a/hw/intc/arm_gicv3.c
+++ b/hw/intc/arm_gicv3.c
@@ -459,6 +459,7 @@ static void arm_gicv3_class_init(ObjectClass *klass, const void *data)
     ARMGICv3Class *agc = ARM_GICV3_CLASS(klass);
 
     agcc->post_load = arm_gicv3_post_load;
+    agcc->init_cpu_reginfo = gicv3_init_cpu_reginfo;
     device_class_set_parent_realize(dc, arm_gic_realize, &agc->parent_realize);
 }
 
diff --git a/hw/intc/arm_gicv3_cpuif.c b/hw/intc/arm_gicv3_cpuif.c
index 4b4cf09157..a7904237ac 100644
--- a/hw/intc/arm_gicv3_cpuif.c
+++ b/hw/intc/arm_gicv3_cpuif.c
@@ -3016,154 +3016,150 @@ static void gicv3_cpuif_el_change_hook(ARMCPU *cpu, void *opaque)
     gicv3_cpuif_virt_irq_fiq_update(cs);
 }
 
-void gicv3_init_cpuif(GICv3State *s)
+void gicv3_init_cpu_reginfo(CPUState *cs)
 {
     /* Called from the GICv3 realize function; register our system
      * registers with the CPU
      */
-    int i;
-
-    for (i = 0; i < s->num_cpu; i++) {
-        ARMCPU *cpu = ARM_CPU(qemu_get_cpu(i));
-        GICv3CPUState *cs = &s->cpu[i];
+    ARMCPU *cpu = ARM_CPU(cs);
+    GICv3CPUState *gcs = icc_cs_from_env(&cpu->env);
 
-        /*
-         * If the CPU doesn't define a GICv3 configuration, probably because
-         * in real hardware it doesn't have one, then we use default values
-         * matching the one used by most Arm CPUs. This applies to:
-         *  cpu->gic_num_lrs
-         *  cpu->gic_vpribits
-         *  cpu->gic_vprebits
-         *  cpu->gic_pribits
-         */
+    /*
+     * If the CPU doesn't define a GICv3 configuration, probably because
+     * in real hardware it doesn't have one, then we use default values
+     * matching the one used by most Arm CPUs. This applies to:
+     *  cpu->gic_num_lrs
+     *  cpu->gic_vpribits
+     *  cpu->gic_vprebits
+     *  cpu->gic_pribits
+     */
 
-        /* Note that we can't just use the GICv3CPUState as an opaque pointer
-         * in define_arm_cp_regs_with_opaque(), because when we're called back
-         * it might be with code translated by CPU 0 but run by CPU 1, in
-         * which case we'd get the wrong value.
-         * So instead we define the regs with no ri->opaque info, and
-         * get back to the GICv3CPUState from the CPUARMState.
-         *
-         * These CP regs callbacks can be called from either TCG or HVF code.
-         */
-        define_arm_cp_regs(cpu, gicv3_cpuif_reginfo);
+    /* Note that we can't just use the GICv3CPUState as an opaque pointer
+     * in define_arm_cp_regs_with_opaque(), because when we're called back
+     * it might be with code translated by CPU 0 but run by CPU 1, in
+     * which case we'd get the wrong value.
+     * So instead we define the regs with no ri->opaque info, and
+     * get back to the GICv3CPUState from the CPUARMState.
+     *
+     * These CP regs callbacks can be called from either TCG or HVF code.
+     */
+    define_arm_cp_regs(cpu, gicv3_cpuif_reginfo);
 
-        /*
-         * If the CPU implements FEAT_NMI and FEAT_GICv3 it must also
-         * implement FEAT_GICv3_NMI, which is the CPU interface part
-         * of NMI support. This is distinct from whether the GIC proper
-         * (redistributors and distributor) have NMI support. In QEMU
-         * that is a property of the GIC device in s->nmi_support;
-         * cs->nmi_support indicates the CPU interface's support.
-         */
-        if (cpu_isar_feature(aa64_nmi, cpu)) {
-            cs->nmi_support = true;
-            define_arm_cp_regs(cpu, gicv3_cpuif_gicv3_nmi_reginfo);
-        }
+    /*
+     * If the CPU implements FEAT_NMI and FEAT_GICv3 it must also
+     * implement FEAT_GICv3_NMI, which is the CPU interface part
+     * of NMI support. This is distinct from whether the GIC proper
+     * (redistributors and distributor) have NMI support. In QEMU
+     * that is a property of the GIC device in s->nmi_support;
+     * gcs->nmi_support indicates the CPU interface's support.
+     */
+    if (cpu_isar_feature(aa64_nmi, cpu)) {
+        gcs->nmi_support = true;
+        define_arm_cp_regs(cpu, gicv3_cpuif_gicv3_nmi_reginfo);
+    }
 
-        /*
-         * The CPU implementation specifies the number of supported
-         * bits of physical priority. For backwards compatibility
-         * of migration, we have a compat property that forces use
-         * of 8 priority bits regardless of what the CPU really has.
-         */
-        if (s->force_8bit_prio) {
-            cs->pribits = 8;
-        } else {
-            cs->pribits = cpu->gic_pribits ?: 5;
-        }
+    /*
+     * The CPU implementation specifies the number of supported
+     * bits of physical priority. For backwards compatibility
+     * of migration, we have a compat property that forces use
+     * of 8 priority bits regardless of what the CPU really has.
+     */
+    if (gcs->gic->force_8bit_prio) {
+        gcs->pribits = 8;
+    } else {
+        gcs->pribits = cpu->gic_pribits ?: 5;
+    }
 
-        /*
-         * The GICv3 has separate ID register fields for virtual priority
-         * and preemption bit values, but only a single ID register field
-         * for the physical priority bits. The preemption bit count is
-         * always the same as the priority bit count, except that 8 bits
-         * of priority means 7 preemption bits. We precalculate the
-         * preemption bits because it simplifies the code and makes the
-         * parallels between the virtual and physical bits of the GIC
-         * a bit clearer.
-         */
-        cs->prebits = cs->pribits;
-        if (cs->prebits == 8) {
-            cs->prebits--;
-        }
-        /*
-         * Check that CPU code defining pribits didn't violate
-         * architectural constraints our implementation relies on.
-         */
-        g_assert(cs->pribits >= 4 && cs->pribits <= 8);
+    /*
+     * The GICv3 has separate ID register fields for virtual priority
+     * and preemption bit values, but only a single ID register field
+     * for the physical priority bits. The preemption bit count is
+     * always the same as the priority bit count, except that 8 bits
+     * of priority means 7 preemption bits. We precalculate the
+     * preemption bits because it simplifies the code and makes the
+     * parallels between the virtual and physical bits of the GIC
+     * a bit clearer.
+     */
+    gcs->prebits = gcs->pribits;
+    if (gcs->prebits == 8) {
+        gcs->prebits--;
+    }
+    /*
+     * Check that CPU code defining pribits didn't violate
+     * architectural constraints our implementation relies on.
+     */
+    g_assert(gcs->pribits >= 4 && gcs->pribits <= 8);
 
-        /*
-         * gicv3_cpuif_reginfo[] defines ICC_AP*R0_EL1; add definitions
-         * for ICC_AP*R{1,2,3}_EL1 if the prebits value requires them.
-         */
-        if (cs->prebits >= 6) {
-            define_arm_cp_regs(cpu, gicv3_cpuif_icc_apxr1_reginfo);
-        }
-        if (cs->prebits == 7) {
-            define_arm_cp_regs(cpu, gicv3_cpuif_icc_apxr23_reginfo);
-        }
+    /*
+     * gicv3_cpuif_reginfo[] defines ICC_AP*R0_EL1; add definitions
+     * for ICC_AP*R{1,2,3}_EL1 if the prebits value requires them.
+     */
+    if (gcs->prebits >= 6) {
+        define_arm_cp_regs(cpu, gicv3_cpuif_icc_apxr1_reginfo);
+    }
+    if (gcs->prebits == 7) {
+        define_arm_cp_regs(cpu, gicv3_cpuif_icc_apxr23_reginfo);
+    }
 
-        if (arm_feature(&cpu->env, ARM_FEATURE_EL2)) {
-            int j;
+    if (arm_feature(&cpu->env, ARM_FEATURE_EL2)) {
+        int j;
 
-            cs->num_list_regs = cpu->gic_num_lrs ?: 4;
-            cs->vpribits = cpu->gic_vpribits ?: 5;
-            cs->vprebits = cpu->gic_vprebits ?: 5;
+        gcs->num_list_regs = cpu->gic_num_lrs ?: 4;
+        gcs->vpribits = cpu->gic_vpribits ?: 5;
+        gcs->vprebits = cpu->gic_vprebits ?: 5;
 
-            /* Check against architectural constraints: getting these
-             * wrong would be a bug in the CPU code defining these,
-             * and the implementation relies on them holding.
-             */
-            g_assert(cs->vprebits <= cs->vpribits);
-            g_assert(cs->vprebits >= 5 && cs->vprebits <= 7);
-            g_assert(cs->vpribits >= 5 && cs->vpribits <= 8);
+        /* Check against architectural constraints: getting these
+         * wrong would be a bug in the CPU code defining these,
+         * and the implementation relies on them holding.
+         */
+        g_assert(gcs->vprebits <= gcs->vpribits);
+        g_assert(gcs->vprebits >= 5 && gcs->vprebits <= 7);
+        g_assert(gcs->vpribits >= 5 && gcs->vpribits <= 8);
 
-            define_arm_cp_regs(cpu, gicv3_cpuif_hcr_reginfo);
+        define_arm_cp_regs(cpu, gicv3_cpuif_hcr_reginfo);
 
-            for (j = 0; j < cs->num_list_regs; j++) {
-                /* Note that the AArch64 LRs are 64-bit; the AArch32 LRs
-                 * are split into two cp15 regs, LR (the low part, with the
-                 * same encoding as the AArch64 LR) and LRC (the high part).
-                 */
-                ARMCPRegInfo lr_regset[] = {
-                    { .name = "ICH_LRn_EL2", .state = ARM_CP_STATE_BOTH,
-                      .opc0 = 3, .opc1 = 4, .crn = 12,
-                      .crm = 12 + (j >> 3), .opc2 = j & 7,
-                      .type = ARM_CP_IO | ARM_CP_NO_RAW,
-                      .nv2_redirect_offset = 0x400 + 8 * j,
-                      .access = PL2_RW,
-                      .readfn = ich_lr_read,
-                      .writefn = ich_lr_write,
-                    },
-                    { .name = "ICH_LRCn_EL2", .state = ARM_CP_STATE_AA32,
-                      .cp = 15, .opc1 = 4, .crn = 12,
-                      .crm = 14 + (j >> 3), .opc2 = j & 7,
-                      .type = ARM_CP_IO | ARM_CP_NO_RAW,
-                      .access = PL2_RW,
-                      .readfn = ich_lr_read,
-                      .writefn = ich_lr_write,
-                    },
-                };
-                define_arm_cp_regs(cpu, lr_regset);
-            }
-            if (cs->vprebits >= 6) {
-                define_arm_cp_regs(cpu, gicv3_cpuif_ich_apxr1_reginfo);
-            }
-            if (cs->vprebits == 7) {
-                define_arm_cp_regs(cpu, gicv3_cpuif_ich_apxr23_reginfo);
-            }
-        }
-        if (tcg_enabled() || qtest_enabled()) {
-            /*
-             * We can only trap EL changes with TCG. However the GIC interrupt
-             * state only changes on EL changes involving EL2 or EL3, so for
-             * the non-TCG case this is OK, as EL2 and EL3 can't exist.
+        for (j = 0; j < gcs->num_list_regs; j++) {
+            /* Note that the AArch64 LRs are 64-bit; the AArch32 LRs
+             * are split into two cp15 regs, LR (the low part, with the
+             * same encoding as the AArch64 LR) and LRC (the high part).
              */
-            arm_register_el_change_hook(cpu, gicv3_cpuif_el_change_hook, cs);
-        } else {
-            assert(!arm_feature(&cpu->env, ARM_FEATURE_EL2));
-            assert(!arm_feature(&cpu->env, ARM_FEATURE_EL3));
-        }
+            ARMCPRegInfo lr_regset[] = {
+                { .name = "ICH_LRn_EL2", .state = ARM_CP_STATE_BOTH,
+                  .opc0 = 3, .opc1 = 4, .crn = 12,
+                  .crm = 12 + (j >> 3), .opc2 = j & 7,
+                  .type = ARM_CP_IO | ARM_CP_NO_RAW,
+                  .nv2_redirect_offset = 0x400 + 8 * j,
+                  .access = PL2_RW,
+                  .readfn = ich_lr_read,
+                  .writefn = ich_lr_write,
+                },
+                { .name = "ICH_LRCn_EL2", .state = ARM_CP_STATE_AA32,
+                  .cp = 15, .opc1 = 4, .crn = 12,
+                  .crm = 14 + (j >> 3), .opc2 = j & 7,
+                  .type = ARM_CP_IO | ARM_CP_NO_RAW,
+                  .access = PL2_RW,
+                  .readfn = ich_lr_read,
+                  .writefn = ich_lr_write,
+                },
+            };
+            define_arm_cp_regs(cpu, lr_regset);
+        }
+        if (gcs->vprebits >= 6) {
+            define_arm_cp_regs(cpu, gicv3_cpuif_ich_apxr1_reginfo);
+        }
+        if (gcs->vprebits == 7) {
+            define_arm_cp_regs(cpu, gicv3_cpuif_ich_apxr23_reginfo);
+        }
+    }
+    if (tcg_enabled() || qtest_enabled()) {
+        /*
+         * We can only trap EL changes with TCG. However the GIC interrupt
+         * state only changes on EL changes involving EL2 or EL3, so for
+         * the non-TCG case this is OK, as EL2 and EL3 can't exist.
+         */
+        arm_register_el_change_hook(cpu, gicv3_cpuif_el_change_hook, gcs);
+    } else {
+        assert(!arm_feature(&cpu->env, ARM_FEATURE_EL2));
+        assert(!arm_feature(&cpu->env, ARM_FEATURE_EL3));
     }
 }
diff --git a/hw/intc/arm_gicv3_cpuif_common.c b/hw/intc/arm_gicv3_cpuif_common.c
index ff1239f65d..f9a9b2d8a3 100644
--- a/hw/intc/arm_gicv3_cpuif_common.c
+++ b/hw/intc/arm_gicv3_cpuif_common.c
@@ -20,3 +20,14 @@ void gicv3_set_gicv3state(CPUState *cpu, GICv3CPUState *s)
 
     env->gicv3state = (void *)s;
 };
+
+void gicv3_init_cpuif(GICv3State *s)
+{
+    ARMGICv3CommonClass *agcc = ARM_GICV3_COMMON_GET_CLASS(s);
+    int i;
+
+    /* define and register `system registers` with the vCPU  */
+    for (i = 0; i < s->num_cpu; i++) {
+        agcc->init_cpu_reginfo(s->cpu[i].cpu);
+    }
+}
diff --git a/hw/intc/arm_gicv3_kvm.c b/hw/intc/arm_gicv3_kvm.c
index 6166283cd1..4ca889da45 100644
--- a/hw/intc/arm_gicv3_kvm.c
+++ b/hw/intc/arm_gicv3_kvm.c
@@ -776,6 +776,10 @@ static void vm_change_state_handler(void *opaque, bool running,
     }
 }
 
+static void kvm_gicv3_init_cpu_reginfo(CPUState *cs)
+{
+    define_arm_cp_regs(ARM_CPU(cs), gicv3_cpuif_reginfo);
+}
 
 static void kvm_arm_gicv3_realize(DeviceState *dev, Error **errp)
 {
@@ -811,11 +815,8 @@ static void kvm_arm_gicv3_realize(DeviceState *dev, Error **errp)
 
     gicv3_init_irqs_and_mmio(s, kvm_arm_gicv3_set_irq, NULL);
 
-    for (i = 0; i < s->num_cpu; i++) {
-        ARMCPU *cpu = ARM_CPU(qemu_get_cpu(i));
-
-        define_arm_cp_regs(cpu, gicv3_cpuif_reginfo);
-    }
+    /* initialize vCPU interface */
+    gicv3_init_cpuif(s);
 
     /* Try to create the device via the device control API */
     s->dev_fd = kvm_create_device(kvm_state, KVM_DEV_TYPE_ARM_VGIC_V3, false);
@@ -929,6 +930,7 @@ static void kvm_arm_gicv3_class_init(ObjectClass *klass, const void *data)
 
     agcc->pre_save = kvm_arm_gicv3_get;
     agcc->post_load = kvm_arm_gicv3_put;
+    agcc->init_cpu_reginfo = kvm_gicv3_init_cpu_reginfo;
     device_class_set_parent_realize(dc, kvm_arm_gicv3_realize,
                                     &kgc->parent_realize);
     resettable_class_set_parent_phases(rc, NULL, kvm_arm_gicv3_reset_hold, NULL,
diff --git a/hw/intc/gicv3_internal.h b/hw/intc/gicv3_internal.h
index bc9f518fe8..cc8edc499b 100644
--- a/hw/intc/gicv3_internal.h
+++ b/hw/intc/gicv3_internal.h
@@ -722,6 +722,7 @@ void gicv3_redist_vinvall(GICv3CPUState *cs, uint64_t vptaddr);
 
 void gicv3_redist_send_sgi(GICv3CPUState *cs, int grp, int irq, bool ns);
 void gicv3_init_cpuif(GICv3State *s);
+void gicv3_init_cpu_reginfo(CPUState *cs);
 
 /**
  * gicv3_cpuif_update:
diff --git a/include/hw/intc/arm_gicv3_common.h b/include/hw/intc/arm_gicv3_common.h
index c18503869f..3720728227 100644
--- a/include/hw/intc/arm_gicv3_common.h
+++ b/include/hw/intc/arm_gicv3_common.h
@@ -313,6 +313,7 @@ struct ARMGICv3CommonClass {
 
     void (*pre_save)(GICv3State *s);
     void (*post_load)(GICv3State *s);
+    void (*init_cpu_reginfo)(CPUState *cs);
 };
 
 void gicv3_init_irqs_and_mmio(GICv3State *s, qemu_irq_handler handler,
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH RFC V6 08/24] arm/virt, gicv3: Guard CPU interface access for admin disabled vCPUs
  2025-10-01  1:01 [PATCH RFC V6 00/24] Support of Virtual CPU Hotplug-like Feature for ARMv8+ Arch salil.mehta
                   ` (6 preceding siblings ...)
  2025-10-01  1:01 ` [PATCH RFC V6 07/24] arm/gicv3: Refactor CPU interface init for shared TCG/KVM use salil.mehta
@ 2025-10-01  1:01 ` salil.mehta
  2025-10-24  4:07   ` Gavin Shan
  2025-10-01  1:01 ` [PATCH RFC V6 09/24] hw/intc/arm_gicv3_common: Migrate & check 'GICv3CPUState' accessibility mismatch salil.mehta
                   ` (18 subsequent siblings)
  26 siblings, 1 reply; 67+ messages in thread
From: salil.mehta @ 2025-10-01  1:01 UTC (permalink / raw)
  To: qemu-devel, qemu-arm, mst
  Cc: salil.mehta, maz, jean-philippe, jonathan.cameron, lpieralisi,
	peter.maydell, richard.henderson, imammedo, armbru, andrew.jones,
	david, philmd, eric.auger, will, ardb, oliver.upton, pbonzini,
	gshan, rafael, borntraeger, alex.bennee, gustavo.romero, npiggin,
	harshpb, linux, darren, ilkka, vishnu, gankulkarni, karl.heubaum,
	miguel.luis, salil.mehta, zhukeqian1, wangxiongfeng2, wangyanan55,
	wangzhou1, linuxarm, jiakernel2, maobibo, lixianglai, shahuang,
	zhao1.liu

From: Salil Mehta <salil.mehta@huawei.com>

Per Arm GIC Architecture Specification (IHI0069H_b, §11.1), the CPU interface
and its Processing Element (PE) share a power domain. If the PE is powered down
or administratively disabled, the CPU interface must be quiescent or off, and
any access is architecturally UNPREDICTABLE. Without explicit checks, QEMU may
issue GICC register operations for vCPUs that are offline, removed, or
otherwise unavailable—risking inconsistent state or undefined behavior in both
TCG and KVM accelerators.

To address this, introduce a per-vCPU gicc_accessible flag that reflects the
administrative enablement of the corresponding QOM vCPU in accordance with the
policy. This is permissible when the GICC (GIC CPU Interface) is online-capable,
meaning vCPUs can be brought online in the guest kernel after boot. The flag is
set during GIC realization and used to skip VGIC register reads/writes, SGI
generation, and CPU interface updates when the GICC is not accessible. This
prevents unsafe operations and ensures compliance when managing administratively
disabled but present vCPUs.

Co-developed-by: Keqian Zhu <zhukeqian1@huawei.com>
Signed-off-by: Keqian Zhu <zhukeqian1@huawei.com>
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
---
 hw/core/qdev.c                     | 26 +++++++++++++++++
 hw/intc/arm_gicv3_common.c         | 23 +++++++++++++++
 hw/intc/arm_gicv3_cpuif.c          |  8 +++++
 hw/intc/arm_gicv3_cpuif_common.c   | 47 ++++++++++++++++++++++++++++++
 hw/intc/arm_gicv3_kvm.c            | 18 ++++++++++++
 include/hw/intc/arm_gicv3_common.h | 24 +++++++++++++++
 include/hw/qdev-core.h             | 24 +++++++++++++++
 7 files changed, 170 insertions(+)

diff --git a/hw/core/qdev.c b/hw/core/qdev.c
index 5816abae39..8e9a4da6b5 100644
--- a/hw/core/qdev.c
+++ b/hw/core/qdev.c
@@ -326,6 +326,32 @@ bool qdev_disable(DeviceState *dev, BusState *bus, Error **errp)
                                    errp);
 }
 
+int qdev_get_admin_power_state(DeviceState *dev)
+{
+    DeviceClass *dc;
+
+    if (!dev) {
+        return DEVICE_ADMIN_POWER_STATE_REMOVED;
+    }
+
+    dc = DEVICE_GET_CLASS(dev);
+    if (dc->admin_power_state_supported) {
+        return object_property_get_enum(OBJECT(dev), "admin_power_state",
+                                        "DeviceAdminPowerState", NULL);
+    }
+
+    return DEVICE_ADMIN_POWER_STATE_ENABLED;
+}
+
+bool qdev_check_enabled(DeviceState *dev)
+{
+   /*
+    * if device supports power state transitions, check if it is not in
+    * 'disabled' state.
+    */
+    return qdev_get_admin_power_state(dev) == DEVICE_ADMIN_POWER_STATE_ENABLED;
+}
+
 bool qdev_machine_modified(void)
 {
     return qdev_hot_added || qdev_hot_removed;
diff --git a/hw/intc/arm_gicv3_common.c b/hw/intc/arm_gicv3_common.c
index f6a9f1c68b..f4428ad165 100644
--- a/hw/intc/arm_gicv3_common.c
+++ b/hw/intc/arm_gicv3_common.c
@@ -439,6 +439,29 @@ static void arm_gicv3_common_realize(DeviceState *dev, Error **errp)
         CPUState *cpu = machine_get_possible_cpu(i);
         uint64_t cpu_affid;
 
+        /*
+         * Ref: Arm Generic Interrupt Controller Architecture Specification
+         * (GIC Architecture version 3 and version 4), IHI0069H_b,
+         * Section 11.1: Power Management
+         * https://developer.arm.com/documentation/ihi0069
+         *
+         * According to this specification, the CPU interface and the
+         * Processing Element (PE) must reside in the same power domain.
+         * Therefore, when a CPU/PE is powered off, its corresponding CPU
+         * interface must also be in the off state or in a quiescent state—
+         * depending on the state of the associated Redistributor.
+         *
+         * The Redistributor may reside in a separate power domain and may
+         * remain powered even when the associated PE is turned off.
+         *
+         * Accessing the GIC CPU interface while the PE is powered down can
+         * lead to UNPREDICTABLE behavior.
+         *
+         * Accordingly, the QOM object `GICv3CPUState` should be marked as
+         * either accessible or inaccessible based on the power state of the
+         * associated `CPUState` vCPU.
+         */
+        s->cpu[i].gicc_accessible = qdev_check_enabled(DEVICE(cpu));
         s->cpu[i].cpu = cpu;
         s->cpu[i].gic = s;
         /* Store GICv3CPUState in CPUARMState gicv3state pointer */
diff --git a/hw/intc/arm_gicv3_cpuif.c b/hw/intc/arm_gicv3_cpuif.c
index a7904237ac..6430b2c649 100644
--- a/hw/intc/arm_gicv3_cpuif.c
+++ b/hw/intc/arm_gicv3_cpuif.c
@@ -1052,6 +1052,10 @@ void gicv3_cpuif_update(GICv3CPUState *cs)
     ARMCPU *cpu = ARM_CPU(cs->cpu);
     CPUARMState *env = &cpu->env;
 
+    if (!gicv3_gicc_accessible(OBJECT(cs->gic), CPU(cpu)->cpu_index)) {
+        return;
+    }
+
     g_assert(bql_locked());
 
     trace_gicv3_cpuif_update(gicv3_redist_affid(cs), cs->hppi.irq,
@@ -2036,6 +2040,10 @@ static void icc_generate_sgi(CPUARMState *env, GICv3CPUState *cs,
     for (i = 0; i < s->num_cpu; i++) {
         GICv3CPUState *ocs = &s->cpu[i];
 
+        if (!gicv3_gicc_accessible(OBJECT(s), i)) {
+            continue;
+        }
+
         if (irm) {
             /* IRM == 1 : route to all CPUs except self */
             if (cs == ocs) {
diff --git a/hw/intc/arm_gicv3_cpuif_common.c b/hw/intc/arm_gicv3_cpuif_common.c
index f9a9b2d8a3..8f9a5b6fa2 100644
--- a/hw/intc/arm_gicv3_cpuif_common.c
+++ b/hw/intc/arm_gicv3_cpuif_common.c
@@ -12,6 +12,9 @@
 #include "qemu/osdep.h"
 #include "gicv3_internal.h"
 #include "cpu.h"
+#include "qemu/log.h"
+#include "monitor/monitor.h"
+#include "qapi/visitor.h"
 
 void gicv3_set_gicv3state(CPUState *cpu, GICv3CPUState *s)
 {
@@ -21,6 +24,41 @@ void gicv3_set_gicv3state(CPUState *cpu, GICv3CPUState *s)
     env->gicv3state = (void *)s;
 };
 
+static void
+gicv3_get_gicc_accessibility(Object *obj, Visitor *v, const char *name,
+                             void *opaque, Error **errp)
+{
+    GICv3CPUState *cs = (GICv3CPUState *)opaque;
+    bool value = cs->gicc_accessible;
+
+    visit_type_bool(v, name, &value, errp);
+}
+
+static void
+gicv3_set_gicc_accessibility(Object *obj, Visitor *v, const char *name,
+                             void *opaque, Error **errp)
+{
+    GICv3CPUState *gcs = opaque;
+    CPUState *cs = gcs->cpu;
+    bool value;
+
+    visit_type_bool(v, name, &value, errp);
+
+    /* Block external attempts to set */
+    if (monitor_cur_is_qmp()) {
+        error_setg(errp, "Property 'gicc-accessible' is read-only externally");
+        return;
+    }
+
+    if (gcs->gicc_accessible != value) {
+        gcs->gicc_accessible = value;
+
+        qemu_log_mask(LOG_UNIMP,
+                      "GICC accessibility changed: vCPU %d = %s\n",
+                      cs->cpu_index, value ? "accessible" : "inaccessible");
+    }
+}
+
 void gicv3_init_cpuif(GICv3State *s)
 {
     ARMGICv3CommonClass *agcc = ARM_GICV3_COMMON_GET_CLASS(s);
@@ -28,6 +66,15 @@ void gicv3_init_cpuif(GICv3State *s)
 
     /* define and register `system registers` with the vCPU  */
     for (i = 0; i < s->num_cpu; i++) {
+        g_autofree char *propname = g_strdup_printf("gicc-accessible[%d]", i);
+        object_property_add(OBJECT(s), propname, "bool",
+                            gicv3_get_gicc_accessibility,
+                            gicv3_set_gicc_accessibility,
+                            NULL, &s->cpu[i]);
+
+        object_property_set_description(OBJECT(s), propname,
+            "Per-vCPU GICC interface accessibility (internal set only)");
+
         agcc->init_cpu_reginfo(s->cpu[i].cpu);
     }
 }
diff --git a/hw/intc/arm_gicv3_kvm.c b/hw/intc/arm_gicv3_kvm.c
index 4ca889da45..e97578f59a 100644
--- a/hw/intc/arm_gicv3_kvm.c
+++ b/hw/intc/arm_gicv3_kvm.c
@@ -457,6 +457,16 @@ static void kvm_arm_gicv3_put(GICv3State *s)
         GICv3CPUState *c = &s->cpu[ncpu];
         int num_pri_bits;
 
+        /*
+         * We must ensure that we do not attempt to access or update KVM GICC
+         * registers if their corresponding QOM `GICv3CPUState` is marked as
+         * 'inaccessible', because their corresponding QOM vCPU objects
+         * are in administratively 'disabled' state.
+         */
+        if (!gicv3_gicc_accessible(OBJECT(s), ncpu)) {
+            continue;
+        }
+
         kvm_gicc_access(s, ICC_SRE_EL1, ncpu, &c->icc_sre_el1, true);
         kvm_gicc_access(s, ICC_CTLR_EL1, ncpu,
                         &c->icc_ctlr_el1[GICV3_NS], true);
@@ -615,6 +625,14 @@ static void kvm_arm_gicv3_get(GICv3State *s)
         GICv3CPUState *c = &s->cpu[ncpu];
         int num_pri_bits;
 
+        /*
+         * don't attempt to access KVM VGIC for the disabled vCPUs where
+         * GICv3CPUState is inaccessible.
+         */
+        if (!gicv3_gicc_accessible(OBJECT(s), ncpu)) {
+            continue;
+        }
+
         kvm_gicc_access(s, ICC_SRE_EL1, ncpu, &c->icc_sre_el1, false);
         kvm_gicc_access(s, ICC_CTLR_EL1, ncpu,
                         &c->icc_ctlr_el1[GICV3_NS], false);
diff --git a/include/hw/intc/arm_gicv3_common.h b/include/hw/intc/arm_gicv3_common.h
index 3720728227..bbf899184e 100644
--- a/include/hw/intc/arm_gicv3_common.h
+++ b/include/hw/intc/arm_gicv3_common.h
@@ -27,6 +27,7 @@
 #include "hw/sysbus.h"
 #include "hw/intc/arm_gic_common.h"
 #include "qom/object.h"
+#include "qapi/error.h"
 
 /*
  * Maximum number of possible interrupts, determined by the GIC architecture.
@@ -164,6 +165,7 @@ struct GICv3CPUState {
     uint64_t icc_apr[3][4];
     uint64_t icc_igrpen[3];
     uint64_t icc_ctlr_el3;
+    bool gicc_accessible;
 
     /* Virtualization control interface */
     uint64_t ich_apr[3][4]; /* ich_apr[GICV3_G1][x] never used */
@@ -329,4 +331,26 @@ void gicv3_init_irqs_and_mmio(GICv3State *s, qemu_irq_handler handler,
  */
 const char *gicv3_class_name(void);
 
+/**
+ * gicv3_gicc_accessible:
+ * @obj: QOM object implementing the GICv3 device
+ * @cpu: Index of the vCPU whose GICC accessibility is being queried
+ *
+ * Returns: true if the GICC interface for vCPU @cpu is accessible.
+ * Uses QOM property lookup for "gicc-accessible[%d]".
+ */
+static inline bool gicv3_gicc_accessible(Object *obj, int cpu)
+{
+    g_autofree gchar *propname = g_strdup_printf("gicc-accessible[%d]", cpu);
+    Error *local_err = NULL;
+    bool value;
+
+    value = object_property_get_bool(obj, propname, &local_err);
+    if (local_err) {
+        error_report_err(local_err);
+        return false;
+    }
+
+    return value;
+}
 #endif
diff --git a/include/hw/qdev-core.h b/include/hw/qdev-core.h
index 2c22b32a3f..b1d3fa4a25 100644
--- a/include/hw/qdev-core.h
+++ b/include/hw/qdev-core.h
@@ -589,6 +589,30 @@ bool qdev_realize_and_unref(DeviceState *dev, BusState *bus, Error **errp);
  */
 bool qdev_disable(DeviceState *dev, BusState *bus, Error **errp);
 
+/**
+ * qdev_check_enabled - Check if a device is administratively enabled
+ * @dev:  The device to check
+ *
+ * This function returns whether the device is currently in administrative
+ * ENABLED state. It does not reflect runtime operational power state, but
+ * rather the host policy on whether the guest may interact with the device.
+ *
+ * Returns true if the device is administratively enabled; false otherwise.
+ */
+bool qdev_check_enabled(DeviceState *dev);
+
+/**
+ * qdev_get_admin_power_state - Query administrative power state of a device
+ * @dev:  The device whose state is being queried
+ *
+ * Returns the current administrative power state (ENABLED or DISABLED),
+ * as stored in the device's internal admin state field. This reflects
+ * host-level policy—not the operational runtime state seen by the guest.
+ *
+ * Returns an integer from the DeviceAdminPowerState enum.
+ */
+int qdev_get_admin_power_state(DeviceState *dev);
+
 /**
  * qdev_unrealize: Unrealize a device
  * @dev: device to unrealize
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH RFC V6 09/24] hw/intc/arm_gicv3_common: Migrate & check 'GICv3CPUState' accessibility mismatch
  2025-10-01  1:01 [PATCH RFC V6 00/24] Support of Virtual CPU Hotplug-like Feature for ARMv8+ Arch salil.mehta
                   ` (7 preceding siblings ...)
  2025-10-01  1:01 ` [PATCH RFC V6 08/24] arm/virt, gicv3: Guard CPU interface access for admin disabled vCPUs salil.mehta
@ 2025-10-01  1:01 ` salil.mehta
  2025-10-01  1:01 ` [PATCH RFC V6 10/24] arm/virt: Init PMU at host for all present vCPUs salil.mehta
                   ` (17 subsequent siblings)
  26 siblings, 0 replies; 67+ messages in thread
From: salil.mehta @ 2025-10-01  1:01 UTC (permalink / raw)
  To: qemu-devel, qemu-arm, mst
  Cc: salil.mehta, maz, jean-philippe, jonathan.cameron, lpieralisi,
	peter.maydell, richard.henderson, imammedo, armbru, andrew.jones,
	david, philmd, eric.auger, will, ardb, oliver.upton, pbonzini,
	gshan, rafael, borntraeger, alex.bennee, gustavo.romero, npiggin,
	harshpb, linux, darren, ilkka, vishnu, gankulkarni, karl.heubaum,
	miguel.luis, salil.mehta, zhukeqian1, wangxiongfeng2, wangyanan55,
	wangzhou1, linuxarm, jiakernel2, maobibo, lixianglai, shahuang,
	zhao1.liu

From: Salil Mehta <salil.mehta@huawei.com>

At the source, administratively disabled vCPUs may lack a CPU VMSD: either they
were never realized (never enabled once), or they were realized and later
disabled, causing the VMSD to be unregistered. Such vCPUs are not migrated as
CPU devices. However, the GICv3CpuState for all vCPUs is still migrated to the
destination VM and must be checked for mismatches in their CPU interface
accessibility.

To preserve correctness, migrate the per-vCPU `gicc_accessible` bit as part of
the GICv3 device state, and fail migration on load if a mismatch is detected.
Administrators must ensure that the number of possible vCPUs and the number of
administratively disabled vCPUs remain consistent across hosts.

Changes:
 - Add `VMSTATE_BOOL(gicc_accessible)` to the per-vCPU GICv3 state.
 - Add `post_load` hook that checks for mismatch in disabled vCPUs by verifying
   GIC CPU interface accessibility.

Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
---
 hw/core/qdev.c             | 17 +++++++++++++++++
 hw/intc/arm_gicv3_common.c | 37 +++++++++++++++++++++++++++++++++++++
 include/hw/qdev-core.h     | 15 +++++++++++++++
 3 files changed, 69 insertions(+)

diff --git a/hw/core/qdev.c b/hw/core/qdev.c
index 8e9a4da6b5..23b84a7756 100644
--- a/hw/core/qdev.c
+++ b/hw/core/qdev.c
@@ -326,6 +326,23 @@ bool qdev_disable(DeviceState *dev, BusState *bus, Error **errp)
                                    errp);
 }
 
+bool qdev_enable(DeviceState *dev, BusState *bus, Error **errp)
+{
+    g_assert(dev);
+
+    if (bus) {
+        error_setg(errp, "Device %s does not supports 'enable' operation",
+                   object_get_typename(OBJECT(dev)));
+        return false;
+    }
+
+    /* devices like cpu don't have bus */
+    g_assert(!DEVICE_GET_CLASS(dev)->bus_type);
+
+    return object_property_set_str(OBJECT(dev), "admin_power_state", "enabled",
+                                    errp);
+}
+
 int qdev_get_admin_power_state(DeviceState *dev)
 {
     DeviceClass *dc;
diff --git a/hw/intc/arm_gicv3_common.c b/hw/intc/arm_gicv3_common.c
index f4428ad165..9139352330 100644
--- a/hw/intc/arm_gicv3_common.c
+++ b/hw/intc/arm_gicv3_common.c
@@ -84,6 +84,15 @@ static int gicv3_post_load(void *opaque, int version_id)
 {
     GICv3State *s = (GICv3State *)opaque;
     ARMGICv3CommonClass *c = ARM_GICV3_COMMON_GET_CLASS(s);
+    MachineState *ms = MACHINE(qdev_get_machine());
+
+    /* ensure source and destination VM 'maxcpu' count matches */
+    if (s->num_cpu != ms->smp.max_cpus) {
+        error_report("GICv3: source num_cpu(%u) != dest maxcpus(%u). "
+                     "Launch dest with -smp maxcpus=%u",
+                     s->num_cpu, ms->smp.max_cpus, s->num_cpu);
+        return -1;
+    }
 
     gicv3_gicd_no_migration_shift_bug_post_load(s);
 
@@ -127,6 +136,32 @@ static int vmstate_gicv3_cpu_pre_load(void *opaque)
     return 0;
 }
 
+static int vmstate_gicv3_cpu_post_load(void *opaque, int version_id)
+{
+    bool src_enabled, dst_enabled;
+    GICv3CPUState *gcs = opaque;
+    CPUState *cs = gcs->cpu;
+
+    if (!cs) {
+        return 0;
+    }
+
+    /* we derive the source vCPU admin state via GIC CPU Interface */
+    src_enabled = gicv3_gicc_accessible(OBJECT(gcs->gic), cs->cpu_index);
+    dst_enabled = qdev_check_enabled(DEVICE(cs));
+
+    if (dst_enabled != src_enabled) {
+        error_report("GICv3: CPU %d admin-state mismatch: dst=%s, src=%s;"
+                     " Aborting!", cs->cpu_index,
+                    dst_enabled ? "enabled" : "disabled",
+                    src_enabled ? "enabled" : "disabled");
+
+        return -1;
+    }
+
+    return 0;
+}
+
 static bool icc_sre_el1_reg_needed(void *opaque)
 {
     GICv3CPUState *cs = opaque;
@@ -187,6 +222,7 @@ static const VMStateDescription vmstate_gicv3_cpu = {
     .version_id = 1,
     .minimum_version_id = 1,
     .pre_load = vmstate_gicv3_cpu_pre_load,
+    .post_load = vmstate_gicv3_cpu_post_load,
     .fields = (const VMStateField[]) {
         VMSTATE_UINT32(level, GICv3CPUState),
         VMSTATE_UINT32(gicr_ctlr, GICv3CPUState),
@@ -208,6 +244,7 @@ static const VMStateDescription vmstate_gicv3_cpu = {
         VMSTATE_UINT64_2DARRAY(icc_apr, GICv3CPUState, 3, 4),
         VMSTATE_UINT64_ARRAY(icc_igrpen, GICv3CPUState, 3),
         VMSTATE_UINT64(icc_ctlr_el3, GICv3CPUState),
+        VMSTATE_BOOL(gicc_accessible, GICv3CPUState),
         VMSTATE_END_OF_LIST()
     },
     .subsections = (const VMStateDescription * const []) {
diff --git a/include/hw/qdev-core.h b/include/hw/qdev-core.h
index b1d3fa4a25..855ff865ba 100644
--- a/include/hw/qdev-core.h
+++ b/include/hw/qdev-core.h
@@ -589,6 +589,21 @@ bool qdev_realize_and_unref(DeviceState *dev, BusState *bus, Error **errp);
  */
 bool qdev_disable(DeviceState *dev, BusState *bus, Error **errp);
 
+/**
+ * qdev_enable - Power on and administratively enable a device
+ * @dev:   The device to be powered on and administratively enabled
+ * @bus:   The bus on which the device is connected (may be NULL for CPUs)
+ * @errp:  Pointer to a location where an error can be reported
+ *
+ * This function performs both administrative and operational power-on of
+ * the specified device. It transitions the device into ENABLED state and
+ * restores runtime availability. If applicable, the device is also re-added
+ * to the migration stream.
+ *
+ * Returns true if the operation succeeds; false otherwise, with @errp set.
+ */
+bool qdev_enable(DeviceState *dev, BusState *bus, Error **errp);
+
 /**
  * qdev_check_enabled - Check if a device is administratively enabled
  * @dev:  The device to check
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH RFC V6 10/24] arm/virt: Init PMU at host for all present vCPUs
  2025-10-01  1:01 [PATCH RFC V6 00/24] Support of Virtual CPU Hotplug-like Feature for ARMv8+ Arch salil.mehta
                   ` (8 preceding siblings ...)
  2025-10-01  1:01 ` [PATCH RFC V6 09/24] hw/intc/arm_gicv3_common: Migrate & check 'GICv3CPUState' accessibility mismatch salil.mehta
@ 2025-10-01  1:01 ` salil.mehta
  2025-10-03 15:02   ` Igor Mammedov
  2025-10-01  1:01 ` [PATCH RFC V6 11/24] hw/arm/acpi: MADT change to size the guest with possible vCPUs salil.mehta
                   ` (16 subsequent siblings)
  26 siblings, 1 reply; 67+ messages in thread
From: salil.mehta @ 2025-10-01  1:01 UTC (permalink / raw)
  To: qemu-devel, qemu-arm, mst
  Cc: salil.mehta, maz, jean-philippe, jonathan.cameron, lpieralisi,
	peter.maydell, richard.henderson, imammedo, armbru, andrew.jones,
	david, philmd, eric.auger, will, ardb, oliver.upton, pbonzini,
	gshan, rafael, borntraeger, alex.bennee, gustavo.romero, npiggin,
	harshpb, linux, darren, ilkka, vishnu, gankulkarni, karl.heubaum,
	miguel.luis, salil.mehta, zhukeqian1, wangxiongfeng2, wangyanan55,
	wangzhou1, linuxarm, jiakernel2, maobibo, lixianglai, shahuang,
	zhao1.liu

From: Salil Mehta <salil.mehta@huawei.com>

ARM architecture requires that all CPUs which form part of the VM must
expose identical feature sets and consistent system components at creation
time. This includes the Performance Monitoring Unit (PMU). If only the boot
CPUs had their PMU state initialized, the remaining CPUs defined by
`smp.disabled_cpus` would not match this architectural requirement, leading
to inconsistencies and guest misbehavior.

To comply with this constraint, PMU initialization must cover the entire set
of present vCPUs:

    present = smp.cpus + smp.disabled_cpus

CPUs outside this set (`smp.max_cpus - present`) are not considered part of
the machine at creation and are therefore not initialized.

Co-developed-by: Keqian Zhu <zhukeqian1@huawei.com>
Signed-off-by: Keqian Zhu <zhukeqian1@huawei.com>
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
---
 hw/arm/virt.c         | 13 +++++++---
 include/hw/arm/virt.h |  1 +
 include/hw/core/cpu.h | 57 +++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 67 insertions(+), 4 deletions(-)

diff --git a/hw/arm/virt.c b/hw/arm/virt.c
index ee09aa19bd..3980f553db 100644
--- a/hw/arm/virt.c
+++ b/hw/arm/virt.c
@@ -2087,12 +2087,13 @@ static void finalize_gic_version(VirtMachineState *vms)
 static void virt_post_cpus_gic_realized(VirtMachineState *vms,
                                         MemoryRegion *sysmem)
 {
+    CPUArchIdList *possible_cpus = vms->parent.possible_cpus;
     int max_cpus = MACHINE(vms)->smp.max_cpus;
-    bool aarch64, pmu, steal_time;
+    bool aarch64, steal_time;
     CPUState *cpu;
 
     aarch64 = object_property_get_bool(OBJECT(first_cpu), "aarch64", NULL);
-    pmu = object_property_get_bool(OBJECT(first_cpu), "pmu", NULL);
+    vms->pmu = object_property_get_bool(OBJECT(first_cpu), "pmu", NULL);
     steal_time = object_property_get_bool(OBJECT(first_cpu),
                                           "kvm-steal-time", NULL);
 
@@ -2123,8 +2124,12 @@ static void virt_post_cpus_gic_realized(VirtMachineState *vms,
             exit(1);
         }
 
-        CPU_FOREACH(cpu) {
-            if (pmu) {
+        CPU_FOREACH_POSSIBLE(cpu, possible_cpus) {
+            if (!cpu) {
+                continue;
+            }
+
+            if (vms->pmu) {
                 assert(arm_feature(&ARM_CPU(cpu)->env, ARM_FEATURE_PMU));
                 if (kvm_irqchip_in_kernel()) {
                     kvm_arm_pmu_set_irq(ARM_CPU(cpu), VIRTUAL_PMU_IRQ);
diff --git a/include/hw/arm/virt.h b/include/hw/arm/virt.h
index ace4154cc6..02cc311452 100644
--- a/include/hw/arm/virt.h
+++ b/include/hw/arm/virt.h
@@ -154,6 +154,7 @@ struct VirtMachineState {
     bool mte;
     bool dtb_randomness;
     bool second_ns_uart_present;
+    bool pmu;
     OnOffAuto acpi;
     VirtGICType gic_version;
     VirtIOMMUType iommu;
diff --git a/include/hw/core/cpu.h b/include/hw/core/cpu.h
index 5eaf41a566..2ee202a8a5 100644
--- a/include/hw/core/cpu.h
+++ b/include/hw/core/cpu.h
@@ -602,6 +602,63 @@ extern CPUTailQ cpus_queue;
 #define CPU_FOREACH_SAFE(cpu, next_cpu) \
     QTAILQ_FOREACH_SAFE_RCU(cpu, &cpus_queue, node, next_cpu)
 
+
+/**
+ * CPU_FOREACH_POSSIBLE(cpu_, archid_list_)
+ *
+ * Iterate over all entries in a CPUArchIdList, assigning each entry’s
+ * CPUState* to @cpu_. This hides the loop index and reads like a normal
+ * C for-loop.
+ *
+ * A CPUArchIdList represents the set of *possible* CPUs for a machine.
+ * Each entry contains:
+ *   - @cpu:        CPUState pointer, or NULL if not realized yet
+ *   - @arch_id:    architecture-specific identifier (e.g. MPIDR)
+ *   - @vcpus_count: number of vCPUs represented (usually 1)
+ *
+ * The list models *possible* CPUs: it includes (a) currently plugged vCPUs
+ * made available through hotplug, (b) present (and perhaps visible to OSPM)
+ * but kept ACPI-disabled vCPUs, and (c) reserved slots for CPUs that may be
+ * created in the future. This supports co-existence of hotpluggable and
+ * admin-disabled vCPUs if architectures permit.
+ *
+ * Example:
+ *
+ *   CPUArchIdList *alist = machine_possible_cpus(ms);
+ *   CPUState *cpu;
+ *
+ *   CPU_FOREACH_POSSIBLE(cpu, alist) {
+ *       if (!cpu) {
+ *           continue; // reserved slot for hotplug case
+ *       }
+ *
+ *       < Do Something >
+ *   }
+ *
+ * Expanded equivalent:
+ *
+ *   for (int __cpu_idx = 0; alist && __cpu_idx < alist->len; __cpu_idx++) {
+ *       if ((cpu = alist->cpus[__cpu_idx].cpu, 1)) {
+ *           if (!cpu) {
+ *               continue;
+ *           }
+ *
+ *           < Do Something >
+ *       }
+ *   }
+ *
+ * Notes:
+ *   - Callers must check @cpu for NULL when filtering unplugged CPUs.
+ *   - Mirrors the style of CPU_FOREACH(), but iterates all *possible* CPUs
+ *     (plugged, ACPI-disabled, and reserved slots) rather than only present
+ *     and enabled vCPUs.
+ */
+#define CPU_FOREACH_POSSIBLE(cpu_, archid_list_) \
+    for (int __cpu_idx = 0; \
+         (archid_list_) && __cpu_idx < (archid_list_)->len; \
+         __cpu_idx++) \
+        if (((cpu_) = (archid_list_)->cpus[__cpu_idx].cpu, 1))
+
 extern __thread CPUState *current_cpu;
 
 /**
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH RFC V6 11/24] hw/arm/acpi: MADT change to size the guest with possible vCPUs
  2025-10-01  1:01 [PATCH RFC V6 00/24] Support of Virtual CPU Hotplug-like Feature for ARMv8+ Arch salil.mehta
                   ` (9 preceding siblings ...)
  2025-10-01  1:01 ` [PATCH RFC V6 10/24] arm/virt: Init PMU at host for all present vCPUs salil.mehta
@ 2025-10-01  1:01 ` salil.mehta
  2025-10-03 15:09   ` Igor Mammedov
  2025-10-01  1:01 ` [PATCH RFC V6 12/24] hw/core: Introduce generic device power-state handler interface salil.mehta
                   ` (15 subsequent siblings)
  26 siblings, 1 reply; 67+ messages in thread
From: salil.mehta @ 2025-10-01  1:01 UTC (permalink / raw)
  To: qemu-devel, qemu-arm, mst
  Cc: salil.mehta, maz, jean-philippe, jonathan.cameron, lpieralisi,
	peter.maydell, richard.henderson, imammedo, armbru, andrew.jones,
	david, philmd, eric.auger, will, ardb, oliver.upton, pbonzini,
	gshan, rafael, borntraeger, alex.bennee, gustavo.romero, npiggin,
	harshpb, linux, darren, ilkka, vishnu, gankulkarni, karl.heubaum,
	miguel.luis, salil.mehta, zhukeqian1, wangxiongfeng2, wangyanan55,
	wangzhou1, linuxarm, jiakernel2, maobibo, lixianglai, shahuang,
	zhao1.liu

From: Salil Mehta <salil.mehta@huawei.com>

When QEMU builds the MADT table, modifications are needed to include information
about possible vCPUs that are exposed as ACPI-disabled (i.e., `_STA.Enabled=0`).
This new information will help the guest kernel pre-size its resources during
boot time. Pre-sizing based on possible vCPUs will facilitate the future
hot-plugging of the currently disabled vCPUs.

Additionally, this change addresses updates to the ACPI MADT GIC CPU interface
flags, as introduced in the UEFI ACPI 6.5 specification [1]. These updates
enable deferred virtual CPU onlining in the guest kernel.

Reference:
[1] 5.2.12.14. GIC CPU Interface (GICC) Structure (Table 5.37 GICC CPU Interface Flags)
    Link: https://uefi.org/specs/ACPI/6.5/05_ACPI_Software_Programming_Model.html#gic-cpu-interface-gicc-structure

Co-developed-by: Keqian Zhu <zhukeqian1@huawei.com>
Signed-off-by: Keqian Zhu <zhukeqian1@huawei.com>
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
---
 hw/arm/virt-acpi-build.c | 40 ++++++++++++++++++++++++++++++++++------
 hw/core/machine.c        | 14 ++++++++++++++
 include/hw/boards.h      | 20 ++++++++++++++++++++
 3 files changed, 68 insertions(+), 6 deletions(-)

diff --git a/hw/arm/virt-acpi-build.c b/hw/arm/virt-acpi-build.c
index b01fc4f8ef..7c24dd6369 100644
--- a/hw/arm/virt-acpi-build.c
+++ b/hw/arm/virt-acpi-build.c
@@ -760,6 +760,32 @@ static void build_append_gicr(GArray *table_data, uint64_t base, uint32_t size)
     build_append_int_noprefix(table_data, size, 4); /* Discovery Range Length */
 }
 
+static uint32_t virt_acpi_get_gicc_flags(CPUState *cpu)
+{
+    MachineClass *mc = MACHINE_GET_CLASS(qdev_get_machine());
+    const uint32_t GICC_FLAG_ENABLED = BIT(0);
+    const uint32_t GICC_FLAG_ONLINE_CAPABLE = BIT(3);
+
+    /* ARM architecture does not support vCPU hotplug yet */
+    if (!cpu) {
+        return 0;
+    }
+
+    /*
+     * If the machine does not support online-capable CPUs, report the GICC as
+     * 'enabled' only.
+     */
+    if (!mc->has_online_capable_cpus) {
+        return GICC_FLAG_ENABLED;
+    }
+
+    /*
+     * ACPI 6.5, 5.2.12.14 (GICC): mark the boot CPU 'enabled' and all others
+     * 'online-capable'.
+     */
+    return (cpu == first_cpu) ? GICC_FLAG_ENABLED : GICC_FLAG_ONLINE_CAPABLE;
+}
+
 static void
 build_madt(GArray *table_data, BIOSLinker *linker, VirtMachineState *vms)
 {
@@ -785,12 +811,14 @@ build_madt(GArray *table_data, BIOSLinker *linker, VirtMachineState *vms)
     build_append_int_noprefix(table_data, vms->gic_version, 1);
     build_append_int_noprefix(table_data, 0, 3);   /* Reserved */
 
-    for (i = 0; i < MACHINE(vms)->smp.cpus; i++) {
-        ARMCPU *armcpu = ARM_CPU(qemu_get_cpu(i));
+    for (i = 0; i < MACHINE(vms)->smp.max_cpus; i++) {
+        CPUState *cpu = machine_get_possible_cpu(i);
         uint64_t physical_base_address = 0, gich = 0, gicv = 0;
         uint32_t vgic_interrupt = vms->virt ? ARCH_GIC_MAINT_IRQ : 0;
-        uint32_t pmu_interrupt = arm_feature(&armcpu->env, ARM_FEATURE_PMU) ?
-                                             VIRTUAL_PMU_IRQ : 0;
+        uint32_t pmu_interrupt = vms->pmu ? VIRTUAL_PMU_IRQ : 0;
+        CPUArchId *archid = machine_get_possible_cpu_arch_id(i);
+        uint32_t flags = virt_acpi_get_gicc_flags(cpu);
+        uint64_t mpidr = archid->arch_id;
 
         if (vms->gic_version == VIRT_GIC_VERSION_2) {
             physical_base_address = memmap[VIRT_GIC_CPU].base;
@@ -805,7 +833,7 @@ build_madt(GArray *table_data, BIOSLinker *linker, VirtMachineState *vms)
         build_append_int_noprefix(table_data, i, 4);    /* GIC ID */
         build_append_int_noprefix(table_data, i, 4);    /* ACPI Processor UID */
         /* Flags */
-        build_append_int_noprefix(table_data, 1, 4);    /* Enabled */
+        build_append_int_noprefix(table_data, flags, 4);
         /* Parking Protocol Version */
         build_append_int_noprefix(table_data, 0, 4);
         /* Performance Interrupt GSIV */
@@ -819,7 +847,7 @@ build_madt(GArray *table_data, BIOSLinker *linker, VirtMachineState *vms)
         build_append_int_noprefix(table_data, vgic_interrupt, 4);
         build_append_int_noprefix(table_data, 0, 8);    /* GICR Base Address*/
         /* MPIDR */
-        build_append_int_noprefix(table_data, arm_cpu_mp_affinity(armcpu), 8);
+        build_append_int_noprefix(table_data, mpidr, 8);
         /* Processor Power Efficiency Class */
         build_append_int_noprefix(table_data, 0, 1);
         /* Reserved */
diff --git a/hw/core/machine.c b/hw/core/machine.c
index 69d5632464..65388d859a 100644
--- a/hw/core/machine.c
+++ b/hw/core/machine.c
@@ -1383,6 +1383,20 @@ CPUState *machine_get_possible_cpu(int64_t cpu_index)
     return NULL;
 }
 
+CPUArchId *machine_get_possible_cpu_arch_id(int64_t cpu_index)
+{
+    MachineState *ms = MACHINE(qdev_get_machine());
+    CPUArchIdList *possible_cpus = ms->possible_cpus;
+
+    for (int i = 0; i < possible_cpus->len; i++) {
+        if (possible_cpus->cpus[i].cpu &&
+            possible_cpus->cpus[i].cpu->cpu_index == cpu_index) {
+            return &possible_cpus->cpus[i];
+        }
+    }
+    return NULL;
+}
+
 static char *cpu_slot_to_string(const CPUArchId *cpu)
 {
     GString *s = g_string_new(NULL);
diff --git a/include/hw/boards.h b/include/hw/boards.h
index 3ff77a8b3a..fe51ca58bf 100644
--- a/include/hw/boards.h
+++ b/include/hw/boards.h
@@ -461,6 +461,26 @@ struct MachineState {
     bool acpi_spcr_enabled;
 };
 
+/*
+ * machine_get_possible_cpu_arch_id:
+ * @cpu_index: logical cpu_index to search for
+ *
+ * Return a pointer to the CPUArchId entry matching the given @cpu_index
+ * in the current machine's MachineState. The possible_cpus array holds
+ * the full set of CPUs that the machine could support, including those
+ * that may be created as disabled or taken offline.
+ *
+ * The slot index in ms->possible_cpus[] is always sequential, but the
+ * logical cpu_index values are assigned by QEMU and may or may not be
+ * sequential depending on the implementation of a particular machine.
+ * Direct indexing by cpu_index is therefore unsafe in general. This
+ * helper performs a linear search of the possible_cpus array to find
+ * the matching entry.
+ *
+ * Returns: pointer to the matching CPUArchId, or NULL if not found.
+ */
+CPUArchId *machine_get_possible_cpu_arch_id(int64_t cpu_index);
+
 /*
  * The macros which follow are intended to facilitate the
  * definition of versioned machine types, using a somewhat
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH RFC V6 12/24] hw/core: Introduce generic device power-state handler interface
  2025-10-01  1:01 [PATCH RFC V6 00/24] Support of Virtual CPU Hotplug-like Feature for ARMv8+ Arch salil.mehta
                   ` (10 preceding siblings ...)
  2025-10-01  1:01 ` [PATCH RFC V6 11/24] hw/arm/acpi: MADT change to size the guest with possible vCPUs salil.mehta
@ 2025-10-01  1:01 ` salil.mehta
  2025-10-01  1:01 ` [PATCH RFC V6 13/24] qdev: make admin power state changes trigger platform transitions via ACPI salil.mehta
                   ` (14 subsequent siblings)
  26 siblings, 0 replies; 67+ messages in thread
From: salil.mehta @ 2025-10-01  1:01 UTC (permalink / raw)
  To: qemu-devel, qemu-arm, mst
  Cc: salil.mehta, maz, jean-philippe, jonathan.cameron, lpieralisi,
	peter.maydell, richard.henderson, imammedo, armbru, andrew.jones,
	david, philmd, eric.auger, will, ardb, oliver.upton, pbonzini,
	gshan, rafael, borntraeger, alex.bennee, gustavo.romero, npiggin,
	harshpb, linux, darren, ilkka, vishnu, gankulkarni, karl.heubaum,
	miguel.luis, salil.mehta, zhukeqian1, wangxiongfeng2, wangyanan55,
	wangzhou1, linuxarm, jiakernel2, maobibo, lixianglai, shahuang,
	zhao1.liu

From: Salil Mehta <salil.mehta@huawei.com>

Device power-state transitions such as powering on, powering off, or entering
standby may be triggered by administrative state changes (enable to disable or
disable to enable), guest OSPM requests in response to workload or policy, or
platform-specific control flows (e.g. ACPI, firmware, or machine hooks). These
varied triggers require coordinated handling to ensure consistent behavior
across devices.

Without a common interface, each device type must implement ad-hoc logic,
making it harder to manage and extend power-state control in QEMU.

This patch introduces a generic PowerStateHandler QOM interface that allows
devices to expose callbacks for operational power-state transitions. The model
distinguishes between administrative state (enable/disable, host-driven) and
operational state (on/off/standby, runtime). An administrative transition may
trigger an operational change, with QEMU signaling the guest through platform
interfaces and OSPM coordinating the transition. Some platforms may enforce
transitions directly, without OSPM involvement.

Key features:
 - New TYPE_POWERSTATE_HANDLER QOM interface.
 - PowerStateHandlerClass with optional callbacks for operational transitions:
      device_request_poweroff() – notify guest of internal logic to begin a
                                  graceful shutdown sequence.
      device_post_poweroff()    – complete disable after OSPM has powered off
                                  operationally; device is inactive and freed.
      device_pre_poweron()      – prepare for activation on administrative
                                  enable; reinit state and notify guest/OSPM.
      device_request_standby()  – request a standby state without full poweroff,
                                  retaining sufficient state for resume.
 - Helper functions in hw/core/powerstate.c to:
      - Retrieve a device’s PowerStateHandler from the machine.
      - Invoke the registered callbacks if present.
 - Intended for use by any device type (CPU or non-CPU) that supports controlled
   power transitions, regardless of whether it supports architectural hotplug.

High-level flow:
 QMP/HMP
   |    user issues: {"execute":"device-set", ...} (in later patches)
   v
 QDEV (Prop: admin-power-state) (Administrative State Handling)
   |    invokes PowerStateHandler callbacks via interface
   v
 Machine (PowerStateHandler) (Operational State Handling)
   |    coordinates platform policy and may call firmware handler
   v
 ACPI GED (PowerStateHandler, firmware)
   |    signals events/notifications to the guest
   v
 ACPI SCI (System Control Interrupt) to guest OS
   |  SCI is delivered on GSI N (GED Interrupt() _CRS = N, with FADT
   |  designating N as SCI)
   |  OSPM receives SCI/GSI IRQ
   v
 OSPM (in-guest house keeping) evaluates ACPI methods from firmware tables
        (e.g. _EJ0, _STA, _OST) and completes the transition

Integration model:
Both Machine and ACPI GED implement the PowerStateHandler interface.
QDEV calls the handler hooks; Machine applies platform policy and can invoke
GED to coordinate with OSPM. This keeps Qdev generic while arch-specific logic
resides in Machine and firmware.

This interface will be used in later patches to coordinate CPU administrative
enable/disable operations on architectures that lack native CPU hotplug, and
can also be adopted by other device classes requiring similar control.

Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
---
 hw/core/meson.build      |   1 +
 hw/core/powerstate.c     | 100 +++++++++++++++++++++++
 include/hw/boards.h      |   2 +
 include/hw/powerstate.h  | 171 +++++++++++++++++++++++++++++++++++++++
 stubs/meson.build        |   1 +
 stubs/powerstate-stubs.c |  47 +++++++++++
 6 files changed, 322 insertions(+)
 create mode 100644 hw/core/powerstate.c
 create mode 100644 include/hw/powerstate.h
 create mode 100644 stubs/powerstate-stubs.c

diff --git a/hw/core/meson.build b/hw/core/meson.build
index b5a545a0ed..d9d716ce55 100644
--- a/hw/core/meson.build
+++ b/hw/core/meson.build
@@ -40,6 +40,7 @@ system_ss.add(files(
   'numa.c',
   'qdev-fw.c',
   'qdev-hotplug.c',
+  'powerstate.c',
   'qdev-properties-system.c',
   'reset.c',
   'sysbus.c',
diff --git a/hw/core/powerstate.c b/hw/core/powerstate.c
new file mode 100644
index 0000000000..0e1d12b3f6
--- /dev/null
+++ b/hw/core/powerstate.c
@@ -0,0 +1,100 @@
+/*
+ * Device Power State transition handler interface
+ *
+ * An administrative request to 'enable' or 'disable' a device results in a
+ * change of its operational status. The transition may be performed either
+ * synchronously or asynchronously, with OSPM assistance where required.
+ *
+ * Copyright (c) 2025 Huawei Technologies R&D (UK) Ltd.
+ *
+ * Author: Salil Mehta <salil.mehta@huawei.com>
+ *
+ * SPDX-License-Identifier: GPL-2.0-or-later
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ */
+#include "qemu/osdep.h"
+#include "hw/powerstate.h"
+#include "qemu/module.h"
+#include "qapi/error.h"
+#include "hw/boards.h"
+
+PowerStateHandler *powerstate_handler(DeviceState *dev)
+{
+    MachineState *machine = MACHINE(qdev_get_machine());
+    MachineClass *mc = MACHINE_GET_CLASS(machine);
+
+   if (mc->get_powerstate_handler) {
+        return (PowerStateHandler *)mc->get_powerstate_handler(machine, dev);
+   }
+
+    return NULL;
+}
+
+DeviceOperPowerState qdev_get_oper_power_state(DeviceState *dev)
+{
+    PowerStateHandler *h = powerstate_handler(dev);
+    PowerStateHandlerClass *pshc = h ? POWERSTATE_HANDLER_GET_CLASS(h) : NULL;
+
+    if (pshc && pshc->get_oper_state) {
+        return pshc->get_oper_state(dev, &error_warn);
+    }
+
+    return DEVICE_OPER_POWER_STATE_UNKNOWN;
+}
+
+void device_request_poweroff(DeviceState *dev, Error **errp)
+{
+    PowerStateHandler *h = powerstate_handler(dev);
+    PowerStateHandlerClass *pshc = h ? POWERSTATE_HANDLER_GET_CLASS(h) : NULL;
+
+    if (pshc && pshc->request_poweroff) {
+        pshc->request_poweroff(h, dev, errp);
+    }
+}
+
+void device_post_poweroff(DeviceState *dev, Error **errp)
+{
+    PowerStateHandler *h = powerstate_handler(dev);
+    PowerStateHandlerClass *pshc = h ? POWERSTATE_HANDLER_GET_CLASS(h) : NULL;
+
+    if (pshc && pshc->post_poweroff) {
+        pshc->post_poweroff(h, dev, errp);
+    }
+}
+
+void device_pre_poweron(DeviceState *dev, Error **errp)
+{
+    PowerStateHandler *h = powerstate_handler(dev);
+    PowerStateHandlerClass *pshc = h ? POWERSTATE_HANDLER_GET_CLASS(h) : NULL;
+
+    if (pshc && pshc->pre_poweron) {
+        pshc->pre_poweron(h, dev, errp);
+    }
+}
+
+void device_request_standby(DeviceState *dev, Error **errp)
+{
+    PowerStateHandler *h = powerstate_handler(dev);
+    PowerStateHandlerClass *pshc = h ? POWERSTATE_HANDLER_GET_CLASS(h) : NULL;
+
+    if (pshc && pshc->request_standby) {
+        pshc->request_standby(h, dev, errp);
+    }
+}
+
+static const TypeInfo powerstate_handler_info = {
+    .name          = TYPE_POWERSTATE_HANDLER,
+    .parent        = TYPE_INTERFACE,
+    .class_size = sizeof(PowerStateHandlerClass),
+};
+
+static void powerstate_handler_register_types(void)
+{
+    type_register_static(&powerstate_handler_info);
+}
+
+type_init(powerstate_handler_register_types)
diff --git a/include/hw/boards.h b/include/hw/boards.h
index fe51ca58bf..161505911f 100644
--- a/include/hw/boards.h
+++ b/include/hw/boards.h
@@ -332,6 +332,8 @@ struct MachineClass {
 
     HotplugHandler *(*get_hotplug_handler)(MachineState *machine,
                                            DeviceState *dev);
+    void *(*get_powerstate_handler)(MachineState *machine,
+                                                 DeviceState *dev);
     bool (*hotplug_allowed)(MachineState *state, DeviceState *dev,
                             Error **errp);
     CpuInstanceProperties (*cpu_index_to_instance_props)(MachineState *machine,
diff --git a/include/hw/powerstate.h b/include/hw/powerstate.h
new file mode 100644
index 0000000000..c16da0f24d
--- /dev/null
+++ b/include/hw/powerstate.h
@@ -0,0 +1,171 @@
+/*
+ * Device Power State transition handler interface
+ *
+ * An administrative request to 'enable' or 'disable' a device results in a
+ * change of its operational status. The transition may be performed either
+ * synchronously or asynchronously, with OSPM assistance where required.
+ *
+ * Copyright (c) 2025 Huawei Technologies R&D (UK) Ltd.
+ *
+ * Author: Salil Mehta <salil.mehta@huawei.com>
+ *
+ * SPDX-License-Identifier: GPL-2.0-or-later
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ */
+#ifndef POWERSTATE_H
+#define POWERSTATE_H
+
+#include "qom/object.h"
+
+#define TYPE_POWERSTATE_HANDLER "powerstate-handler"
+
+typedef struct PowerStateHandlerClass PowerStateHandlerClass;
+DECLARE_CLASS_CHECKERS(PowerStateHandlerClass, POWERSTATE_HANDLER,
+                       TYPE_POWERSTATE_HANDLER)
+#define POWERSTATE_HANDLER(obj) \
+     INTERFACE_CHECK(PowerStateHandler, (obj), TYPE_POWERSTATE_HANDLER)
+
+typedef struct PowerStateHandler PowerStateHandler;
+
+/**
+ * DeviceOperPowerState:
+ *
+ * Enumeration of operational power states for devices. These represent runtime
+ * states controlled through platform interfaces (e.g. ACPI, PSCI, or other
+ * OSPM mechanisms), and are distinct from administrative presence or enable/
+ * disable state.
+ *
+ * Transitions may be initiated by the guest OSPM in response to workload or
+ * policy, or triggered by administrative actions due to policy change. Please
+ * check PowerStateHandlerClass for more details on these.
+ *
+ * Platforms may optionally implement a callback to fetch the current state.
+ * That callback must map internal platform state to one of the values here.
+ *
+ * @DEVICE_OPER_POWER_STATE_UNKNOWN: State reporting unsupported, or state
+ *                                   could not be determined. If @errp is set,
+ *                                   this indicates an error. Platform firmware
+ *                                   may also enforce state changes directly;
+ *                                   the callback must return the resulting
+ *                                   state.
+ *
+ * @DEVICE_OPER_POWER_STATE_ON:      Device is powered on and fully active.
+ *
+ * @DEVICE_OPER_POWER_STATE_OFF:     Device is powered off and inactive. It
+ *                                   should not consume resources and may
+ *                                   require reinitialization on power on.
+ *
+ * @DEVICE_OPER_POWER_STATE_STANDBY: Device is in a low-power standby state.
+ *                                   It retains enough state to allow fast
+ *                                   resume without full reinitialization.
+ *
+ * See also: PowerStateHandlerClass, powerstate_get_fn
+ */
+typedef enum DeviceOperPowerState {
+    DEVICE_OPER_POWER_STATE_UNKNOWN = -1,
+    DEVICE_OPER_POWER_STATE_ON = 0,
+    DEVICE_OPER_POWER_STATE_OFF,
+    DEVICE_OPER_POWER_STATE_STANDBY,
+    DEVICE_OPER_POWER_STATE_MAX
+} DeviceOperPowerState;
+
+/**
+ * powerstate_fn:
+ * @handler: Power state handler for the device performing the transition.
+ * @dev: The device being transitioned as a result of an administrative
+ *       state change (e.g. enable-to-disable or disable-to-enable), which
+ *       in turn affects its operational state (on, off, standby).
+ * @errp: Pointer to return an error if the function fails.
+ *
+ * Generic function signature for device power state transitions. An
+ * administrative state change triggers the corresponding operational
+ * transition, which may be implemented synchronously or asynchronously.
+ */
+typedef void (*powerstate_fn)(PowerStateHandler *handler, DeviceState *dev,
+                              Error **errp);
+
+/**
+ * powerstate_get_fn:
+ * @dev:  The device whose operational state is being queried.
+ * @errp: Pointer to an error object, set on failure.
+ *
+ * Callback type to query the current operational power state of a device.
+ * Platforms may optionally implement this to expose their internal power
+ * management status. When present, the callback must map the platform’s
+ * internal state into one of the DeviceOperPowerState values.
+ *
+ * Returns: A DeviceOperPowerState value on success. If the platform does not
+ * support state reporting, returns DEVICE_OPER_POWER_STATE_UNKNOWN without
+ * setting @errp. If the state could not be determined due to an error, sets
+ * @errp and also returns DEVICE_OPER_POWER_STATE_UNKNOWN. In this case, the
+ * return value must be ignored when @errp is set.
+ */
+typedef DeviceOperPowerState (*powerstate_get_fn)(DeviceState *dev,
+                                                  Error **errp);
+
+/**
+ * PowerStateHandlerClass:
+ *
+ * Interface for devices that support transitions of their operational power
+ * state (on, off, standby). These transitions may be driven by changes in the
+ * device’s administrative state (enable to/from disable), or initiated by the
+ * guest OSPM based on runtime policy.
+ *
+ * Administrative changes are host-driven (e.g. 'device_set') and can trigger
+ * corresponding operational transitions. QEMU may signal the guest via platform
+ * interfaces (such as ACPI) so that OSPM coordinates the change. Some platforms
+ * may also enforce transitions directly, without OSPM involvement.
+ *
+ * @parent: Opaque parent interface.
+ *
+ * @get_oper_state: Optional callback to query the current operational state.
+ *                  Implementations must map the internal state to the
+ *                  'DeviceOperPowerState' enum.
+ *
+ * @request_poweroff: Optional callback to notify the guest of internal logic
+ *                    that the device is about to be disabled. Used to initiate
+ *                    graceful shutdown or cleanup within OSPM.
+ *
+ * @post_poweroff: Callback invoked after OSPM has powered off the device
+ *                 operationally. Completes the administrative transition to
+ *                 'disabled', ensuring the device is fully inactive and not
+ *                 consuming resources.
+ *
+ * @pre_poweron: Callback to prepare a device for re-activation after an
+ *               administrative 'enable'. May reinitialize state and notify the
+ *               guest that the device is available. Guest of internal OSPM may
+ *               or may not make the device become operationally active.
+ *
+ * @request_standby: Optional callback to place the device into a standby state
+ *                   without full power-off. The device is expected to retain
+ *                   sufficient state for efficient resume, e.g. CPU_SUSPEND.
+ */
+struct PowerStateHandlerClass {
+    /* <private> */
+    InterfaceClass parent;
+
+    /* <public> */
+    powerstate_get_fn get_oper_state;
+
+    powerstate_fn request_poweroff;
+    powerstate_fn post_poweroff;
+    powerstate_fn pre_poweron;
+    powerstate_fn request_standby;
+};
+
+PowerStateHandler *powerstate_handler(DeviceState *dev);
+
+DeviceOperPowerState qdev_get_oper_power_state(DeviceState *dev);
+
+void device_request_poweroff(DeviceState *dev, Error **errp);
+
+void device_post_poweroff(DeviceState *dev, Error **errp);
+
+void device_pre_poweron(DeviceState *dev, Error **errp);
+
+void device_request_standby(DeviceState *dev, Error **errp);
+#endif /* POWERSTATE_H */
diff --git a/stubs/meson.build b/stubs/meson.build
index cef046e685..f38cdd1947 100644
--- a/stubs/meson.build
+++ b/stubs/meson.build
@@ -95,5 +95,6 @@ if have_system or have_user
 
   # Also included in have_system for tests/unit/test-qdev-global-props
   stub_ss.add(files('hotplug-stubs.c'))
+  stub_ss.add(files('powerstate-stubs.c'))
   stub_ss.add(files('sysbus.c'))
 endif
diff --git a/stubs/powerstate-stubs.c b/stubs/powerstate-stubs.c
new file mode 100644
index 0000000000..01c615cda2
--- /dev/null
+++ b/stubs/powerstate-stubs.c
@@ -0,0 +1,47 @@
+/*
+ * Device Power State handler interface Stubs.
+ *
+ * Copyright (c) 2025 Huawei Technologies R&D (UK) Ltd.
+ *
+ * Author: Salil Mehta <salil.mehta@huawei.com>
+ *
+ * SPDX-License-Identifier: GPL-2.0-or-later
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ */
+#include "qemu/osdep.h"
+#include "hw/powerstate.h"
+#include "hw/qdev-core.h"
+
+PowerStateHandler *powerstate_handler(DeviceState *dev)
+{
+    return NULL;
+}
+
+DeviceOperPowerState qdev_get_oper_power_state(DeviceState *dev)
+{
+    return DEVICE_OPER_POWER_STATE_UNKNOWN;
+}
+
+void device_request_poweroff(DeviceState *dev, Error **errp)
+{
+    g_assert_not_reached();
+}
+
+void device_post_poweroff(DeviceState *dev, Error **errp)
+{
+    g_assert_not_reached();
+}
+
+void device_pre_poweron(DeviceState *dev, Error **errp)
+{
+    g_assert_not_reached();
+}
+
+void device_request_standby(DeviceState *dev, Error **errp)
+{
+    g_assert_not_reached();
+}
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH RFC V6 13/24] qdev: make admin power state changes trigger platform transitions via ACPI
  2025-10-01  1:01 [PATCH RFC V6 00/24] Support of Virtual CPU Hotplug-like Feature for ARMv8+ Arch salil.mehta
                   ` (11 preceding siblings ...)
  2025-10-01  1:01 ` [PATCH RFC V6 12/24] hw/core: Introduce generic device power-state handler interface salil.mehta
@ 2025-10-01  1:01 ` salil.mehta
  2025-10-01  1:01 ` [PATCH RFC V6 14/24] arm/acpi: Introduce dedicated CPU OSPM interface for ARM-like platforms salil.mehta
                   ` (13 subsequent siblings)
  26 siblings, 0 replies; 67+ messages in thread
From: salil.mehta @ 2025-10-01  1:01 UTC (permalink / raw)
  To: qemu-devel, qemu-arm, mst
  Cc: salil.mehta, maz, jean-philippe, jonathan.cameron, lpieralisi,
	peter.maydell, richard.henderson, imammedo, armbru, andrew.jones,
	david, philmd, eric.auger, will, ardb, oliver.upton, pbonzini,
	gshan, rafael, borntraeger, alex.bennee, gustavo.romero, npiggin,
	harshpb, linux, darren, ilkka, vishnu, gankulkarni, karl.heubaum,
	miguel.luis, salil.mehta, zhukeqian1, wangxiongfeng2, wangyanan55,
	wangzhou1, linuxarm, jiakernel2, maobibo, lixianglai, shahuang,
	zhao1.liu

From: Salil Mehta <salil.mehta@huawei.com>

Changing a device's administrative power state must trigger a concrete
operational transition at the platform layer via ACPI coordination with OSPM.
The platform is responsible for actually powering devices off or on and for
notifying the guest when required.

Some machines can coordinate transitions asynchronously with OSPM using ACPI
methods and events (e.g. _EJx, device-check, _OST), while others cannot or
may not be ready when policy flips. Without a defined linkage, admin policy
can drift from runtime reality, leaving devices active while 'disabled', or
disappearing without guest notification, and migration metadata out of sync.

This change establishes that linkage: administrative DISABLED/ENABLED requests
first drive the platform's operational transition via ACPI (prefer OSPM
coordination; otherwise fall back to a synchronous in-QEMU path) and only
then update QOM state and migration registration. This provides uniform
semantics and a reliable contract for management and tests.

Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
---
 hw/core/qdev.c          | 68 ++++++++++++++++++++++++++++++++++++-----
 include/hw/powerstate.h |  6 ++++
 include/hw/qdev-core.h  | 17 +++++++++++
 3 files changed, 84 insertions(+), 7 deletions(-)

diff --git a/hw/core/qdev.c b/hw/core/qdev.c
index 23b84a7756..3aba99b912 100644
--- a/hw/core/qdev.c
+++ b/hw/core/qdev.c
@@ -326,6 +326,30 @@ bool qdev_disable(DeviceState *dev, BusState *bus, Error **errp)
                                    errp);
 }
 
+void qdev_sync_disable(DeviceState *dev, Error **errp)
+{
+    g_assert(dev);
+    g_assert(powerstate_handler(dev));
+
+    /*
+     * Administrative disable triggered either after OSPM completes _EJx
+     * (post Notify(..., 0x03)), or due to lack of async shutdown support.
+     *
+     * Device may still appear in ACPI namespace but remains disabled at
+     * the platform level. Guest cannot re-enable it until host allows.
+     */
+
+    /* Perform operational shutdown */
+    device_post_poweroff(dev, errp);
+    if (*errp) {
+        return;
+    }
+
+    /* Mark the device administratively disabled */
+    qatomic_set(&dev->admin_power_state, DEVICE_ADMIN_POWER_STATE_DISABLED);
+    smp_wmb();
+}
+
 bool qdev_enable(DeviceState *dev, BusState *bus, Error **errp)
 {
     g_assert(dev);
@@ -705,6 +729,7 @@ device_set_admin_power_state(Object *obj, int new_state, Error **errp)
 {
     DeviceState *dev = DEVICE(obj);
     DeviceClass *dc = DEVICE_GET_CLASS(dev);
+    DeviceAdminPowerState old_state;
 
     if (!dc->admin_power_state_supported) {
         error_setg(errp, "Device '%s' admin power state change not supported",
@@ -712,25 +737,54 @@ device_set_admin_power_state(Object *obj, int new_state, Error **errp)
         return;
     }
 
+    g_assert(powerstate_handler(dev));
+    old_state = qatomic_read(&dev->admin_power_state);
+
     switch (new_state) {
     case DEVICE_ADMIN_POWER_STATE_DISABLED: {
+        if (old_state == DEVICE_ADMIN_POWER_STATE_DISABLED) {
+            break;
+        }
+
         /*
-         * TODO: Operational state transition triggered by administrative action
+         * Operational state transition triggered by administrative action
          * Powering off the realized device either synchronously or via OSPM.
          */
+        if (device_graceful_poweroff_supported(dev)) {
+            /* Graceful shutdown via guest coordination */
+            device_request_poweroff(dev, errp);
+            if (*errp) {
+                return;
+            }
 
-        qatomic_set(&dev->admin_power_state, DEVICE_ADMIN_POWER_STATE_DISABLED);
-        smp_wmb();
+            qatomic_set(&dev->admin_power_state,
+                        DEVICE_ADMIN_POWER_STATE_DISABLED);
+            smp_wmb();
+        } else {
+            /* Immediate shutdown within QEMU synchronously */
+            qdev_sync_disable(dev, errp);
+            if (*errp) {
+                return;
+            }
+        }
         break;
     }
     case DEVICE_ADMIN_POWER_STATE_ENABLED: {
-        /*
-         * TODO: Operational state transition triggered by administrative action
-         * Powering on the device and restoring migration registration.
-         */
+        if (old_state == DEVICE_ADMIN_POWER_STATE_ENABLED) {
+            break;
+        }
 
         qatomic_set(&dev->admin_power_state, DEVICE_ADMIN_POWER_STATE_ENABLED);
         smp_wmb();
+
+        /*
+         * Operational state transition triggered by administrative action
+         * Powering on the device and restoring migration registration.
+         */
+        device_pre_poweron(dev, errp);
+        if (*errp) {
+            return;
+        }
         break;
     }
     default:
diff --git a/include/hw/powerstate.h b/include/hw/powerstate.h
index c16da0f24d..b35650bac4 100644
--- a/include/hw/powerstate.h
+++ b/include/hw/powerstate.h
@@ -168,4 +168,10 @@ void device_post_poweroff(DeviceState *dev, Error **errp);
 void device_pre_poweron(DeviceState *dev, Error **errp);
 
 void device_request_standby(DeviceState *dev, Error **errp);
+
+static inline bool device_graceful_poweroff_supported(DeviceState *dev)
+{
+    PowerStateHandler *h = powerstate_handler(dev);
+    return h && POWERSTATE_HANDLER_GET_CLASS(h)->request_poweroff;
+}
 #endif /* POWERSTATE_H */
diff --git a/include/hw/qdev-core.h b/include/hw/qdev-core.h
index 855ff865ba..3e08cfb59f 100644
--- a/include/hw/qdev-core.h
+++ b/include/hw/qdev-core.h
@@ -8,6 +8,7 @@
 #include "qemu/rcu_queue.h"
 #include "qom/object.h"
 #include "hw/hotplug.h"
+#include "hw/powerstate.h"
 #include "hw/resettable.h"
 
 /**
@@ -589,6 +590,22 @@ bool qdev_realize_and_unref(DeviceState *dev, BusState *bus, Error **errp);
  */
 bool qdev_disable(DeviceState *dev, BusState *bus, Error **errp);
 
+/**
+ * qdev_sync_disable - Force immediate power-off and administrative disable
+ * @dev:   The device to be powered off and administratively disabled
+ * @errp:  Pointer to a location where an error can be reported
+ *
+ * This function performs a synchronous power-off of the device and marks it
+ * as administratively DISABLED. It assumes that prior graceful handling (e.g.,
+ * ACPI _EJx) has already been completed, or that asynchronous mechanisms are
+ * unsupported.
+ *
+ * After execution, the device remains visible to the guest (e.g. via ACPI),
+ * but cannot be brought back online unless explicitly re-enabled via admin
+ * policy. This function also removes the device from the migration stream.
+ */
+void qdev_sync_disable(DeviceState *dev, Error **errp);
+
 /**
  * qdev_enable - Power on and administratively enable a device
  * @dev:   The device to be powered on and administratively enabled
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH RFC V6 14/24] arm/acpi: Introduce dedicated CPU OSPM interface for ARM-like platforms
  2025-10-01  1:01 [PATCH RFC V6 00/24] Support of Virtual CPU Hotplug-like Feature for ARMv8+ Arch salil.mehta
                   ` (12 preceding siblings ...)
  2025-10-01  1:01 ` [PATCH RFC V6 13/24] qdev: make admin power state changes trigger platform transitions via ACPI salil.mehta
@ 2025-10-01  1:01 ` salil.mehta
  2025-10-03 14:58   ` Igor Mammedov
  2025-10-24  4:47   ` Gavin Shan
  2025-10-01  1:01 ` [PATCH RFC V6 15/24] acpi/ged: Notify OSPM of CPU administrative state changes via GED salil.mehta
                   ` (12 subsequent siblings)
  26 siblings, 2 replies; 67+ messages in thread
From: salil.mehta @ 2025-10-01  1:01 UTC (permalink / raw)
  To: qemu-devel, qemu-arm, mst
  Cc: salil.mehta, maz, jean-philippe, jonathan.cameron, lpieralisi,
	peter.maydell, richard.henderson, imammedo, armbru, andrew.jones,
	david, philmd, eric.auger, will, ardb, oliver.upton, pbonzini,
	gshan, rafael, borntraeger, alex.bennee, gustavo.romero, npiggin,
	harshpb, linux, darren, ilkka, vishnu, gankulkarni, karl.heubaum,
	miguel.luis, salil.mehta, zhukeqian1, wangxiongfeng2, wangyanan55,
	wangzhou1, linuxarm, jiakernel2, maobibo, lixianglai, shahuang,
	zhao1.liu

From: Salil Mehta <salil.mehta@huawei.com>

The existing ACPI CPU hotplug interface is built for x86 platforms where CPUs
can be inserted or removed and resources are allocated dynamically. On ARM, CPUs
are never hotpluggable: resources are allocated at boot and QOM vCPU objects
always exist. Instead, CPUs are administratively managed by toggling ACPI _STA
to enable or disable them, which gives a hotplug-like effect but does not match
the x86 model.

Reusing the x86 hotplug AML code would complicate maintenance since much of its
logic relies on toggling the _STA.Present bit to notify OSPM about CPU insertion
or removal. Such usage is not architecturally valid on ARM, where CPUs cannot
appear or disappear at runtime. Mixing both models in one interface would
increase complexity and make the AML harder to extend. A separate path is
therefore required. The new design is heavily inspired by the CPU hotplug
interface but avoids its unsuitable semantics.

This patch adds a dedicated CPU OSPM (Operating System Power Management)
interface. It provides a memory-mapped control region with selector, flags,
command, and data fields, and AML methods for device-check, eject request, and
_OST reporting. OSPM is notified through GED events and can coordinate CPU
events directly with QEMU. Other ARM-like architectures may also use this
interface.

Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
---
 hw/acpi/Kconfig                        |   3 +
 hw/acpi/acpi-cpu-ospm-interface-stub.c |  41 ++
 hw/acpi/cpu_ospm_interface.c           | 747 +++++++++++++++++++++++++
 hw/acpi/meson.build                    |   2 +
 hw/acpi/trace-events                   |  17 +
 hw/arm/Kconfig                         |   1 +
 include/hw/acpi/cpu_ospm_interface.h   |  78 +++
 7 files changed, 889 insertions(+)
 create mode 100644 hw/acpi/acpi-cpu-ospm-interface-stub.c
 create mode 100644 hw/acpi/cpu_ospm_interface.c
 create mode 100644 include/hw/acpi/cpu_ospm_interface.h

diff --git a/hw/acpi/Kconfig b/hw/acpi/Kconfig
index 1d4e9f0845..aa52f0468f 100644
--- a/hw/acpi/Kconfig
+++ b/hw/acpi/Kconfig
@@ -21,6 +21,9 @@ config ACPI_ICH9
 config ACPI_CPU_HOTPLUG
     bool
 
+config ACPI_CPU_OSPM_INTERFACE
+    bool
+
 config ACPI_MEMORY_HOTPLUG
     bool
     select MEM_DEVICE
diff --git a/hw/acpi/acpi-cpu-ospm-interface-stub.c b/hw/acpi/acpi-cpu-ospm-interface-stub.c
new file mode 100644
index 0000000000..f6f333f641
--- /dev/null
+++ b/hw/acpi/acpi-cpu-ospm-interface-stub.c
@@ -0,0 +1,41 @@
+/*
+ * ACPI CPU OSPM Interface Handling.
+ *
+ * Copyright (c) 2025 Huawei Technologies R&D (UK) Ltd.
+ *
+ * Author: Salil Mehta <salil.mehta@huawei.com>
+ *
+ * SPDX-License-Identifier: GPL-2.0-or-later
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ */
+
+#include "qemu/osdep.h"
+#include "hw/acpi/cpu_ospm_interface.h"
+
+void acpi_cpu_device_check_cb(AcpiCpuOspmState *cpu_st, DeviceState *dev,
+                              uint32_t event_st, Error **errp)
+{
+}
+
+void acpi_cpu_eject_request_cb(AcpiCpuOspmState *cpu_st, DeviceState *dev,
+                               uint32_t event_st, Error **errp)
+{
+}
+
+void acpi_cpu_eject_cb(AcpiCpuOspmState *cpu_st, DeviceState *dev, Error **errp)
+{
+}
+
+void acpi_cpu_ospm_state_interface_init(MemoryRegion *as, Object *owner,
+                                        AcpiCpuOspmState *state,
+                                        hwaddr base_addr)
+{
+}
+
+void acpi_cpus_ospm_status(AcpiCpuOspmState *cpu_st, ACPIOSTInfoList ***list)
+{
+}
diff --git a/hw/acpi/cpu_ospm_interface.c b/hw/acpi/cpu_ospm_interface.c
new file mode 100644
index 0000000000..61aab8a793
--- /dev/null
+++ b/hw/acpi/cpu_ospm_interface.c
@@ -0,0 +1,747 @@
+/*
+ * ACPI CPU OSPM Interface Handling.
+ *
+ * Copyright (c) 2025 Huawei Technologies R&D (UK) Ltd.
+ *
+ * Author: Salil Mehta <salil.mehta@huawei.com>
+ *
+ * SPDX-License-Identifier: GPL-2.0-or-later
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ */
+
+#include "qemu/osdep.h"
+#include "migration/vmstate.h"
+#include "hw/core/cpu.h"
+#include "qapi/error.h"
+#include "trace.h"
+#include "qapi/qapi-events-acpi.h"
+#include "hw/acpi/cpu_ospm_interface.h"
+
+/* CPU identifier and resource device */
+#define CPU_NAME_FMT      "C%.03X" /* CPU name format (e.g., C001) */
+#define CPU_RES_DEVICE    "CPUR" /* CPU resource device name */
+#define CPU_DEVICE        "CPUS" /* CPUs device name */
+#define CPU_LOCK          "CPLK" /* CPU lock object */
+/* ACPI method(_STA, _EJ0, etc.) handlers */
+#define CPU_STS_METHOD    "CSTA" /* CPU status method (_STA.Enabled) */
+#define CPU_SCAN_METHOD   "CSCN" /* CPU scan method for enumeration */
+#define CPU_NOTIFY_METHOD "CTFY" /* Notify method for CPU events */
+#define CPU_EJECT_METHOD  "CEJ0" /* CPU eject method (_EJ0) */
+#define CPU_OST_METHOD    "COST" /* OSPM status reporting (_OST) */
+/* CPU MMIO region fields (in PRST region) */
+#define CPU_SELECTOR      "CSEL" /* CPU selector index (WO) */
+#define CPU_ENABLED_F     "CPEN" /* Flag: CPU enabled status(_STA) (RO) */
+#define CPU_DEVCHK_F      "CDCK" /* Flag: Device-check event (RW) */
+#define CPU_EJECTRQ_F     "CEJR" /* Flag: Eject-request event (RW)*/
+#define CPU_EJECT_F       "CEJ0" /* Flag: Ejection trigger (WO) */
+#define CPU_COMMAND       "CCMD" /* Command register (RW) */
+#define CPU_DATA          "CDAT" /* Data register (RW) */
+
+ /*
+ * CPU OSPM Interface MMIO Layout (Total: 16 bytes)
+ *
+ * +--------+--------+--------+--------+--------+--------+--------+--------+
+ * |  0x00  |  0x01  |  0x02  |  0x03  |  0x04  |  0x05  |  0x06  |  0x07  |
+ * +--------+--------+--------+--------+--------+--------+--------+--------+
+ * |       Selector (DWord, write-only)         | Flags  |Command |Reserved|
+ * |                                            | (RO/RW)|  (WO)  |(2B pad)|
+ * |        4 bytes (32 bits)                   | 1B     |   1B   | 2B     |
+ * +-----------------------------------------------------------------------+
+ * |  0x08  |  0x09  |  0x0A  |  0x0B  |  0x0C  |  0x0D  |  0x0E  |  0x0F  |
+ * +--------+--------+--------+--------+--------+--------+--------+--------+
+ * |                        Data (QWord, read/write)                       |
+ * |               Used by CPU scan and _OST methods (64 bits)             |
+ * +-----------------------------------------------------------------------+
+ *
+ * Field Overview:
+ *
+ * - Selector: 4 bytes @0x00 (DWord, WO)
+ *               - Selects target CPU index for the current operation.
+ * - Flags:    1 byte  @0x04 (RO/RW)
+ *               - Bit 0: ENABLED  – CPU is powered on (RO)
+ *               - Bit 1: DEVCHK   – Device-check completed (RW)
+ *               - Bit 2: EJECTRQ  – Guest requests CPU eject (RW)
+ *               - Bit 3: EJECT    – Trigger CPU ejection (WO)
+ *               - Bits 4–7: Reserved (write 0)
+ * - Command:  1 byte  @0x05 (WO)
+ *               - Specifies control operation (e.g., scan, _OST, eject).
+ * - Reserved: 2 bytes @0x06–0x07
+ *               - Alignment padding; must be zero on write.
+ * - Data:     8 bytes @0x08 (QWord, RW)
+ *               - Input/output for command-specific data.
+ *               - Used by CPU scan or _OST.
+ */
+
+/*
+ * Macros defining the CPU MMIO region layout. Change field sizes here to
+ * alter the overall MMIO region size.
+ */
+/* Sub-Field sizes (in bytes) */
+#define ACPI_CPU_MR_SELECTOR_SIZE  4 /* Write-only (DWord access) */
+#define ACPI_CPU_MR_FLAGS_SIZE     1 /* Read-write (Byte access) */
+#define ACPI_CPU_MR_RES_FLAGS_SIZE 0 /* Reserved padding */
+#define ACPI_CPU_MR_CMD_SIZE       1 /* Write-only (Byte access) */
+#define ACPI_CPU_MR_RES_CMD_SIZE   2 /* Reserved padding */
+#define ACPI_CPU_MR_CMD_DATA_SIZE  8 /* Read-write (QWord access) */
+
+#define ACPI_CPU_OSPM_IF_MAX_FIELD_SIZE \
+    MAX_CONST(ACPI_CPU_MR_CMD_DATA_SIZE, \
+    MAX_CONST(ACPI_CPU_MR_SELECTOR_SIZE, \
+    MAX_CONST(ACPI_CPU_MR_CMD_SIZE, ACPI_CPU_MR_FLAGS_SIZE)))
+
+/* Validate layout against exported total length */
+_Static_assert(ACPI_CPU_OSPM_IF_REG_LEN ==
+               (ACPI_CPU_MR_SELECTOR_SIZE +
+                ACPI_CPU_MR_FLAGS_SIZE +
+                ACPI_CPU_MR_RES_FLAGS_SIZE +
+                ACPI_CPU_MR_CMD_SIZE +
+                ACPI_CPU_MR_RES_CMD_SIZE +
+                ACPI_CPU_MR_CMD_DATA_SIZE),
+               "ACPI_CPU_OSPM_IF_REG_LEN mismatch with internal MMIO layout");
+
+/* Sub-Field sizes (in bits) */
+#define ACPI_CPU_MR_SELECTOR_SIZE_BITS \
+    (ACPI_CPU_MR_SELECTOR_SIZE * BITS_PER_BYTE)  /* Write-only (DWord Acc) */
+#define ACPI_CPU_MR_FLAGS_SIZE_BITS \
+    (ACPI_CPU_MR_FLAGS_SIZE * BITS_PER_BYTE)     /* Read-write (Byte Acc) */
+#define ACPI_CPU_MR_RES_FLAGS_SIZE_BITS \
+    (ACPI_CPU_MR_RES_FLAGS_SIZE * BITS_PER_BYTE) /* Reserved padding */
+#define ACPI_CPU_MR_CMD_SIZE_BITS \
+    (ACPI_CPU_MR_CMD_SIZE * BITS_PER_BYTE)       /* Write-only (Byte Acc) */
+#define ACPI_CPU_MR_RES_CMD_SIZE_BITS \
+    (ACPI_CPU_MR_RES_CMD_SIZE * BITS_PER_BYTE)   /* Reserved padding */
+#define ACPI_CPU_MR_CMD_DATA_SIZE_BITS \
+    (ACPI_CPU_MR_CMD_DATA_SIZE * BITS_PER_BYTE)  /* Read-write (QWord Acc) */
+
+/* Field offsets (in bytes) */
+#define ACPI_CPU_MR_SELECTOR_OFFSET_WO  0
+#define ACPI_CPU_MR_FLAGS_OFFSET_RW \
+    (ACPI_CPU_MR_SELECTOR_OFFSET_WO + \
+     ACPI_CPU_MR_SELECTOR_SIZE)
+#define ACPI_CPU_MR_CMD_OFFSET_WO \
+    (ACPI_CPU_MR_FLAGS_OFFSET_RW + \
+     ACPI_CPU_MR_FLAGS_SIZE + \
+     ACPI_CPU_MR_RES_FLAGS_SIZE)
+#define ACPI_CPU_MR_CMD_DATA_OFFSET_RW \
+    (ACPI_CPU_MR_CMD_OFFSET_WO + \
+     ACPI_CPU_MR_CMD_SIZE + \
+     ACPI_CPU_MR_RES_CMD_SIZE)
+
+/* ensure all offsets are at their natural size alignment boundaries */
+#define STATIC_ASSERT_FIELD_ALIGNMENT(offset, type, field_name)               \
+    _Static_assert((offset) % sizeof(type) == 0,                              \
+                   field_name " is not aligned to its natural boundary")
+
+STATIC_ASSERT_FIELD_ALIGNMENT(ACPI_CPU_MR_SELECTOR_OFFSET_WO,
+                              uint32_t, "Selector");
+STATIC_ASSERT_FIELD_ALIGNMENT(ACPI_CPU_MR_FLAGS_OFFSET_RW,
+                              uint8_t, "Flags");
+STATIC_ASSERT_FIELD_ALIGNMENT(ACPI_CPU_MR_CMD_OFFSET_WO,
+                              uint8_t, "Command");
+STATIC_ASSERT_FIELD_ALIGNMENT(ACPI_CPU_MR_CMD_DATA_OFFSET_RW,
+                              uint64_t, "Command Data");
+
+/* Flag bit positions (used within 'flags' subfield) */
+#define ACPI_CPU_FLAGS_USED_BITS 4
+#define ACPI_CPU_MR_FLAGS_BIT_ENABLED BIT(0)
+#define ACPI_CPU_MR_FLAGS_BIT_DEVCHK  BIT(1)
+#define ACPI_CPU_MR_FLAGS_BIT_EJECTRQ BIT(2)
+#define ACPI_CPU_MR_FLAGS_BIT_EJECT   BIT(ACPI_CPU_FLAGS_USED_BITS - 1)
+
+#define ACPI_CPU_MR_RES_FLAG_BITS (BITS_PER_BYTE - ACPI_CPU_FLAGS_USED_BITS)
+
+enum {
+    ACPI_GET_NEXT_CPU_WITH_EVENT_CMD = 0,
+    ACPI_OST_EVENT_CMD = 1,
+    ACPI_OST_STATUS_CMD = 2,
+    ACPI_CMD_MAX
+};
+
+#define AML_APPEND_MR_RESVD_FIELD(mr_field, size_bits)       \
+    do {                                                        \
+        if ((size_bits) != 0) {                                 \
+            aml_append((mr_field), aml_reserved_field(size_bits)); \
+        }                                                       \
+    } while (0)
+
+#define AML_APPEND_MR_NAMED_FIELD(mr_field, name, size_bits)    \
+    do {                                                        \
+        if ((size_bits) != 0) {                                 \
+            aml_append((mr_field), aml_named_field((name), (size_bits))); \
+        }                                                       \
+    } while (0)
+
+#define AML_CPU_RES_DEV(base, field) \
+        aml_name("%s.%s.%s", (base), CPU_RES_DEVICE, (field))
+
+static ACPIOSTInfo *
+acpi_cpu_ospm_ost_status(int idx, AcpiCpuOspmStateStatus *cdev)
+{
+    ACPIOSTInfo *info = g_new0(ACPIOSTInfo, 1);
+
+    info->source = cdev->ost_event;
+    info->status = cdev->ost_status;
+    if (cdev->cpu) {
+        DeviceState *dev = DEVICE(cdev->cpu);
+        if (dev->id) {
+            info->device = g_strdup(dev->id);
+        }
+    }
+    return info;
+}
+
+void acpi_cpus_ospm_status(AcpiCpuOspmState *cpu_st, ACPIOSTInfoList ***list)
+{
+    ACPIOSTInfoList ***tail = list;
+    int i;
+
+    for (i = 0; i < cpu_st->dev_count; i++) {
+        QAPI_LIST_APPEND(*tail, acpi_cpu_ospm_ost_status(i, &cpu_st->devs[i]));
+    }
+}
+
+static uint64_t
+acpi_cpu_ospm_intf_mr_read(void *opaque, hwaddr addr, unsigned size)
+{
+    AcpiCpuOspmState *cpu_st = opaque;
+    AcpiCpuOspmStateStatus *cdev;
+    uint64_t val = 0;
+
+    if (cpu_st->selector >= cpu_st->dev_count) {
+        return val;
+    }
+    cdev = &cpu_st->devs[cpu_st->selector];
+    switch (addr) {
+    case ACPI_CPU_MR_FLAGS_OFFSET_RW:
+        val |= qdev_check_enabled(DEVICE(cdev->cpu)) ?
+                                  ACPI_CPU_MR_FLAGS_BIT_ENABLED : 0;
+        val |= cdev->devchk_pending ? ACPI_CPU_MR_FLAGS_BIT_DEVCHK : 0;
+        val |= cdev->ejrqst_pending ? ACPI_CPU_MR_FLAGS_BIT_EJECTRQ : 0;
+        trace_acpi_cpuos_if_read_flags(cpu_st->selector, val);
+        break;
+    case ACPI_CPU_MR_CMD_DATA_OFFSET_RW:
+        switch (cpu_st->command) {
+        case ACPI_GET_NEXT_CPU_WITH_EVENT_CMD:
+           val = cpu_st->selector;
+           break;
+        default:
+           trace_acpi_cpuos_if_read_invalid_cmd_data(cpu_st->selector,
+                                                     cpu_st->command);
+           break;
+        }
+        trace_acpi_cpuos_if_read_cmd_data(cpu_st->selector, val);
+        break;
+    default:
+        break;
+    }
+    return val;
+}
+
+static void
+acpi_cpu_ospm_intf_mr_write(void *opaque, hwaddr addr, uint64_t data,
+                            unsigned int size)
+{
+    AcpiCpuOspmState *cpu_st = opaque;
+    AcpiCpuOspmStateStatus *cdev;
+    ACPIOSTInfo *info;
+
+    assert(cpu_st->dev_count);
+    if (addr) {
+        if (cpu_st->selector >= cpu_st->dev_count) {
+            trace_acpi_cpuos_if_invalid_idx_selected(cpu_st->selector);
+            return;
+        }
+    }
+
+    switch (addr) {
+    case ACPI_CPU_MR_SELECTOR_OFFSET_WO: /* current CPU selector */
+        cpu_st->selector = data;
+        trace_acpi_cpuos_if_write_idx(cpu_st->selector);
+        break;
+    case ACPI_CPU_MR_FLAGS_OFFSET_RW: /* set is_* fields  */
+        cdev = &cpu_st->devs[cpu_st->selector];
+        if (data & ACPI_CPU_MR_FLAGS_BIT_DEVCHK) {
+            /* clear device-check pending event */
+            cdev->devchk_pending = false;
+            trace_acpi_cpuos_if_clear_devchk_evt(cpu_st->selector);
+        } else if (data & ACPI_CPU_MR_FLAGS_BIT_EJECTRQ) {
+            /* clear eject-request pending event */
+            cdev->ejrqst_pending = false;
+            trace_acpi_cpuos_if_clear_ejrqst_evt(cpu_st->selector);
+        } else if (data & ACPI_CPU_MR_FLAGS_BIT_EJECT) {
+            DeviceState *dev = NULL;
+            if (!cdev->cpu || cdev->cpu == first_cpu) {
+                trace_acpi_cpuos_if_ejecting_invalid_cpu(cpu_st->selector);
+                break;
+            }
+            /*
+             * OSPM has returned with eject. Hence, it is now safe to put the
+             * cpu device on powered-off state.
+             */
+            trace_acpi_cpuos_if_ejecting_cpu(cpu_st->selector);
+            dev = DEVICE(cdev->cpu);
+            qdev_sync_disable(dev, &error_fatal);
+        }
+        break;
+    case ACPI_CPU_MR_CMD_OFFSET_WO:
+        trace_acpi_cpuos_if_write_cmd(cpu_st->selector, data);
+        if (data < ACPI_CMD_MAX) {
+            cpu_st->command = data;
+            if (cpu_st->command == ACPI_GET_NEXT_CPU_WITH_EVENT_CMD) {
+                uint32_t iter = cpu_st->selector;
+
+                do {
+                    cdev = &cpu_st->devs[iter];
+                    if (cdev->devchk_pending || cdev->ejrqst_pending) {
+                        cpu_st->selector = iter;
+                        trace_acpi_cpuos_if_cpu_has_events(cpu_st->selector,
+                            cdev->devchk_pending, cdev->ejrqst_pending);
+                        break;
+                    }
+                    iter = iter + 1 < cpu_st->dev_count ? iter + 1 : 0;
+                } while (iter != cpu_st->selector);
+            }
+        }
+        break;
+    case ACPI_CPU_MR_CMD_DATA_OFFSET_RW:
+        switch (cpu_st->command) {
+        case ACPI_OST_EVENT_CMD: {
+           cdev = &cpu_st->devs[cpu_st->selector];
+           cdev->ost_event = data;
+           trace_acpi_cpuos_if_write_ost_ev(cpu_st->selector, cdev->ost_event);
+           break;
+        }
+        case ACPI_OST_STATUS_CMD: {
+           cdev = &cpu_st->devs[cpu_st->selector];
+           cdev->ost_status = data;
+           info = acpi_cpu_ospm_ost_status(cpu_st->selector, cdev);
+           qapi_event_send_acpi_device_ost(info);
+           qapi_free_ACPIOSTInfo(info);
+           trace_acpi_cpuos_if_write_ost_status(cpu_st->selector,
+                                                cdev->ost_status);
+           break;
+        }
+        default:
+           trace_acpi_cpuos_if_write_invalid_cmd(cpu_st->selector,
+                                                 cpu_st->command);
+           break;
+        }
+        break;
+    default:
+        trace_acpi_cpuos_if_write_invalid_offset(cpu_st->selector, addr);
+        break;
+    }
+}
+
+static const MemoryRegionOps cpu_common_mr_ops = {
+    .read = acpi_cpu_ospm_intf_mr_read,
+    .write = acpi_cpu_ospm_intf_mr_write,
+    .endianness = DEVICE_LITTLE_ENDIAN,
+    .valid = {
+        .min_access_size = 1,
+        .max_access_size = ACPI_CPU_OSPM_IF_MAX_FIELD_SIZE,
+    },
+    .impl = {
+        .min_access_size = 1,
+        .max_access_size = ACPI_CPU_OSPM_IF_MAX_FIELD_SIZE,
+        .unaligned = false,
+    },
+};
+
+void acpi_cpu_ospm_state_interface_init(MemoryRegion *as, Object *owner,
+                                        AcpiCpuOspmState *state,
+                                        hwaddr base_addr)
+{
+    MachineState *machine = MACHINE(qdev_get_machine());
+    MachineClass *mc = MACHINE_GET_CLASS(machine);
+    const CPUArchIdList *id_list;
+    int i;
+
+    assert(mc->possible_cpu_arch_ids);
+    id_list = mc->possible_cpu_arch_ids(machine);
+    state->dev_count = id_list->len;
+    state->devs = g_new0(typeof(*state->devs), state->dev_count);
+    for (i = 0; i < id_list->len; i++) {
+        state->devs[i].cpu =  CPU(id_list->cpus[i].cpu);
+        state->devs[i].arch_id = id_list->cpus[i].arch_id;
+    }
+    memory_region_init_io(&state->ctrl_reg, owner, &cpu_common_mr_ops, state,
+                          "ACPI CPU OSPM State Interface Memory Region",
+                          ACPI_CPU_OSPM_IF_REG_LEN);
+    memory_region_add_subregion(as, base_addr, &state->ctrl_reg);
+}
+
+static AcpiCpuOspmStateStatus *
+acpi_get_cpu_status(AcpiCpuOspmState *cpu_st, DeviceState *dev)
+{
+    CPUClass *k = CPU_GET_CLASS(dev);
+    uint64_t cpu_arch_id = k->get_arch_id(CPU(dev));
+    int i;
+
+    for (i = 0; i < cpu_st->dev_count; i++) {
+        if (cpu_arch_id == cpu_st->devs[i].arch_id) {
+            return &cpu_st->devs[i];
+        }
+    }
+    return NULL;
+}
+
+void acpi_cpu_device_check_cb(AcpiCpuOspmState *cpu_st, DeviceState *dev,
+                              uint32_t event_st, Error **errp)
+{
+    AcpiCpuOspmStateStatus *cdev;
+    cdev = acpi_get_cpu_status(cpu_st, dev);
+    if (!cdev) {
+        return;
+    }
+    assert(cdev->cpu);
+
+    /*
+     * Tell OSPM via GED IRQ(GSI) that a powered-off cpu is being powered-on.
+     * Also, mark 'device-check' event pending for this cpu. This will
+     * eventually result in OSPM evaluating the ACPI _EVT method and scan of
+     * cpus
+     */
+    cdev->devchk_pending = true;
+    acpi_send_event(cpu_st->acpi_dev, event_st);
+}
+
+void acpi_cpu_eject_request_cb(AcpiCpuOspmState *cpu_st, DeviceState *dev,
+                              uint32_t event_st, Error **errp)
+{
+    AcpiCpuOspmStateStatus *cdev;
+    cdev = acpi_get_cpu_status(cpu_st, dev);
+    if (!cdev) {
+        return;
+    }
+    assert(cdev->cpu);
+
+    /*
+     * Tell OSPM via GED IRQ(GSI) that a cpu wants to power-off or go on standby
+     * Also,mark 'eject-request' event pending for this cpu. (graceful shutdown)
+     */
+    cdev->ejrqst_pending = true;
+    acpi_send_event(cpu_st->acpi_dev, event_st);
+}
+
+void
+acpi_cpu_eject_cb(AcpiCpuOspmState *cpu_st, DeviceState *dev, Error **errp)
+{
+    /* TODO: possible handling here */
+}
+
+static const VMStateDescription vmstate_cpu_ospm_state_sts = {
+    .name = "CPU OSPM state status",
+    .version_id = 1,
+    .minimum_version_id = 1,
+    .fields = (const VMStateField[]) {
+        VMSTATE_BOOL(devchk_pending, AcpiCpuOspmStateStatus),
+        VMSTATE_BOOL(ejrqst_pending, AcpiCpuOspmStateStatus),
+        VMSTATE_UINT32(ost_event, AcpiCpuOspmStateStatus),
+        VMSTATE_UINT32(ost_status, AcpiCpuOspmStateStatus),
+        VMSTATE_END_OF_LIST()
+    }
+};
+
+const VMStateDescription vmstate_cpu_ospm_state = {
+    .name = "CPU OSPM state",
+    .version_id = 1,
+    .minimum_version_id = 1,
+    .fields = (const VMStateField[]) {
+        VMSTATE_UINT32(selector, AcpiCpuOspmState),
+        VMSTATE_UINT8(command, AcpiCpuOspmState),
+        VMSTATE_STRUCT_VARRAY_POINTER_UINT32(devs, AcpiCpuOspmState,
+                                             dev_count,
+                                             vmstate_cpu_ospm_state_sts,
+                                             AcpiCpuOspmStateStatus),
+        VMSTATE_END_OF_LIST()
+    }
+};
+
+void acpi_build_cpus_aml(Aml *table, hwaddr base_addr, const char *root,
+                         const char *event_handler_method)
+{
+    MachineState *machine = MACHINE(qdev_get_machine());
+    MachineClass *mc = MACHINE_GET_CLASS(machine);
+    const CPUArchIdList *arch_ids = mc->possible_cpu_arch_ids(machine);
+    Aml *sb_scope = aml_scope("_SB"); /* System Bus Scope */
+    Aml *ifctx, *field, *method, *cpu_res_dev, *cpus_dev;
+    Aml *zero = aml_int(0);
+    Aml *one = aml_int(1);
+
+    cpu_res_dev = aml_device("%s.%s", root, CPU_RES_DEVICE);
+    {
+        Aml *crs;
+
+        aml_append(cpu_res_dev,
+            aml_name_decl("_HID", aml_eisaid("PNP0A06")));
+        aml_append(cpu_res_dev,
+            aml_name_decl("_UID", aml_string("CPU OSPM Interface resources")));
+        aml_append(cpu_res_dev, aml_mutex(CPU_LOCK, 0));
+
+        crs = aml_resource_template();
+        aml_append(crs, aml_memory32_fixed(base_addr, ACPI_CPU_OSPM_IF_REG_LEN,
+                   AML_READ_WRITE));
+
+        aml_append(cpu_res_dev, aml_name_decl("_CRS", crs));
+
+        /* declare CPU OSPM Interface MMIO region related access fields */
+        aml_append(cpu_res_dev,
+                   aml_operation_region("PRST", AML_SYSTEM_MEMORY,
+                                        aml_int(base_addr),
+                                        ACPI_CPU_OSPM_IF_REG_LEN));
+
+        /*
+         * define named fields within PRST region with 'Byte' access widths
+         * and reserve fields with other access width
+         */
+        field = aml_field("PRST", AML_BYTE_ACC, AML_NOLOCK, AML_PRESERVE);
+        /* reserve CPU 'selector' field (size in bits) */
+        AML_APPEND_MR_RESVD_FIELD(field, ACPI_CPU_MR_SELECTOR_SIZE_BITS);
+        /* Flag::Enabled Bit(RO) - Read '1' if enabled */
+        AML_APPEND_MR_NAMED_FIELD(field, CPU_ENABLED_F, 1);
+        /* Flag::Devchk Bit(RW) - Read '1', has a event. Write '1', to clear */
+        AML_APPEND_MR_NAMED_FIELD(field, CPU_DEVCHK_F, 1);
+        /* Flag::Ejectrq Bit(RW) - Read 1, has event. Write 1 to clear */
+        AML_APPEND_MR_NAMED_FIELD(field, CPU_EJECTRQ_F, 1);
+        /* Flag::Eject Bit(WO) - OSPM evals _EJx, initiates CPU Eject in Qemu*/
+        AML_APPEND_MR_NAMED_FIELD(field, CPU_EJECT_F, 1);
+        /* Flag::Bit(ACPI_CPU_FLAGS_USED_BITS)-Bit(7) - Reserve left over bits*/
+        AML_APPEND_MR_RESVD_FIELD(field, ACPI_CPU_MR_RES_FLAG_BITS);
+        /* Reserved space: padding after flags */
+        AML_APPEND_MR_RESVD_FIELD(field, ACPI_CPU_MR_RES_FLAGS_SIZE_BITS);
+        /* Command field written by OSPM */
+        AML_APPEND_MR_NAMED_FIELD(field, CPU_COMMAND,
+                                  ACPI_CPU_MR_CMD_SIZE_BITS);
+        /* Reserved space: padding after command field */
+        AML_APPEND_MR_RESVD_FIELD(field, ACPI_CPU_MR_RES_CMD_SIZE_BITS);
+        /* Command data: 64-bit payload associated with command */
+        AML_APPEND_MR_RESVD_FIELD(field, ACPI_CPU_MR_CMD_DATA_SIZE_BITS);
+        aml_append(cpu_res_dev, field);
+
+        /*
+         * define named fields with 'Dword' access widths and reserve fields
+         * with other access width
+         */
+        field = aml_field("PRST", AML_DWORD_ACC, AML_NOLOCK, AML_PRESERVE);
+        /* CPU selector, write only */
+        AML_APPEND_MR_NAMED_FIELD(field, CPU_SELECTOR,
+                                  ACPI_CPU_MR_SELECTOR_SIZE_BITS);
+        aml_append(cpu_res_dev, field);
+
+        /*
+         * define named fields with 'Qword' access widths and reserve fields
+         * with other access width
+         */
+        field = aml_field("PRST", AML_QWORD_ACC, AML_NOLOCK, AML_PRESERVE);
+        /*
+         * Reserve space: selector, flags, reserved flags, command, reserved
+         * command for Qword alignment.
+         */
+        AML_APPEND_MR_RESVD_FIELD(field, ACPI_CPU_MR_SELECTOR_SIZE_BITS +
+                                            ACPI_CPU_MR_FLAGS_SIZE_BITS +
+                                            ACPI_CPU_MR_RES_FLAGS_SIZE_BITS +
+                                            ACPI_CPU_MR_CMD_SIZE_BITS +
+                                            ACPI_CPU_MR_RES_CMD_SIZE_BITS);
+        /* Command data accessible via Qword */
+        AML_APPEND_MR_NAMED_FIELD(field, CPU_DATA,
+                                  ACPI_CPU_MR_CMD_DATA_SIZE_BITS);
+        aml_append(cpu_res_dev, field);
+    }
+    aml_append(sb_scope, cpu_res_dev);
+
+    cpus_dev = aml_device("%s.%s", root, CPU_DEVICE);
+    {
+        Aml *ctrl_lock = AML_CPU_RES_DEV(root, CPU_LOCK);
+        Aml *cpu_selector = AML_CPU_RES_DEV(root, CPU_SELECTOR);
+        Aml *is_enabled = AML_CPU_RES_DEV(root, CPU_ENABLED_F);
+        Aml *dvchk_evt = AML_CPU_RES_DEV(root, CPU_DEVCHK_F);
+        Aml *ejrq_evt = AML_CPU_RES_DEV(root, CPU_EJECTRQ_F);
+        Aml *ej_evt = AML_CPU_RES_DEV(root, CPU_EJECT_F);
+        Aml *cpu_cmd = AML_CPU_RES_DEV(root, CPU_COMMAND);
+        Aml *cpu_data = AML_CPU_RES_DEV(root, CPU_DATA);
+        int i;
+
+        aml_append(cpus_dev, aml_name_decl("_HID", aml_string("ACPI0010")));
+        aml_append(cpus_dev, aml_name_decl("_CID", aml_eisaid("PNP0A05")));
+
+        method = aml_method(CPU_NOTIFY_METHOD, 2, AML_NOTSERIALIZED);
+        for (i = 0; i < arch_ids->len; i++) {
+            Aml *cpu = aml_name(CPU_NAME_FMT, i);
+            Aml *uid = aml_arg(0);
+            Aml *event = aml_arg(1);
+
+            ifctx = aml_if(aml_equal(uid, aml_int(i)));
+            {
+                aml_append(ifctx, aml_notify(cpu, event));
+            }
+            aml_append(method, ifctx);
+        }
+        aml_append(cpus_dev, method);
+
+        method = aml_method(CPU_STS_METHOD, 1, AML_SERIALIZED);
+        {
+            Aml *idx = aml_arg(0);
+            Aml *sta = aml_local(0);
+            Aml *else_ctx;
+
+            aml_append(method, aml_acquire(ctrl_lock, 0xFFFF));
+            aml_append(method, aml_store(idx, cpu_selector));
+            aml_append(method, aml_store(zero, sta));
+            ifctx = aml_if(aml_equal(is_enabled, one));
+            {
+                /* cpu is present and enabled */
+                aml_append(ifctx, aml_store(aml_int(0xF), sta));
+            }
+            aml_append(method, ifctx);
+            else_ctx = aml_else();
+            {
+                /* cpu is present but disabled */
+                aml_append(else_ctx, aml_store(aml_int(0xD), sta));
+            }
+            aml_append(method, else_ctx);
+            aml_append(method, aml_release(ctrl_lock));
+            aml_append(method, aml_return(sta));
+        }
+        aml_append(cpus_dev, method);
+
+        method = aml_method(CPU_EJECT_METHOD, 1, AML_SERIALIZED);
+        {
+            Aml *idx = aml_arg(0);
+
+            aml_append(method, aml_acquire(ctrl_lock, 0xFFFF));
+            aml_append(method, aml_store(idx, cpu_selector));
+            aml_append(method, aml_store(one, ej_evt));
+            aml_append(method, aml_release(ctrl_lock));
+        }
+        aml_append(cpus_dev, method);
+
+        method = aml_method(CPU_SCAN_METHOD, 0, AML_SERIALIZED);
+        {
+            Aml *has_event = aml_local(0); /* Local0: Loop control flag */
+            Aml *uid = aml_local(1); /* Local1: Current CPU UID */
+            /* Constants */
+            Aml *dev_chk = aml_int(1); /* Notify: device check to enable */
+            Aml *eject_req = aml_int(3); /* Notify: eject for removal */
+            Aml *next_cpu_cmd = aml_int(ACPI_GET_NEXT_CPU_WITH_EVENT_CMD);
+
+            /* Acquire CPU lock */
+            aml_append(method, aml_acquire(ctrl_lock, 0xFFFF));
+
+            /* Initialize loop */
+            aml_append(method, aml_store(zero, uid));
+            aml_append(method, aml_store(one, has_event));
+
+            Aml *while_ctx = aml_while(aml_land(
+                aml_equal(has_event, one),
+                aml_lless(uid, aml_int(arch_ids->len))
+            ));
+            {
+                aml_append(while_ctx, aml_store(zero, has_event));
+                /*
+                 * Issue scan cmd: QEMU will return next CPU with event in
+                 * cpu_data
+                 */
+                aml_append(while_ctx, aml_store(uid, cpu_selector));
+                aml_append(while_ctx, aml_store(next_cpu_cmd, cpu_cmd));
+
+                /* If scan wrapped around to an earlier UID, exit loop */
+                Aml *wrap_check = aml_if(aml_lless(cpu_data, uid));
+                aml_append(wrap_check, aml_break());
+                aml_append(while_ctx, wrap_check);
+
+                /* Set UID to scanned result */
+                aml_append(while_ctx, aml_store(cpu_data, uid));
+
+                /* send CPU device-check(resume) event to OSPM */
+                Aml *if_devchk = aml_if(aml_equal(dvchk_evt, one));
+                {
+                    aml_append(if_devchk,
+                        aml_call2(CPU_NOTIFY_METHOD, uid, dev_chk));
+                    /* clear local device-check event sent flag */
+                    aml_append(if_devchk, aml_store(one, dvchk_evt));
+                    aml_append(if_devchk, aml_store(one, has_event));
+                }
+                aml_append(while_ctx, if_devchk);
+
+                /*
+                 * send CPU eject-request event to OSPM to gracefully handle
+                 * OSPM related tasks running on this CPU
+                 */
+                Aml *else_ctx = aml_else();
+                Aml *if_ejrq = aml_if(aml_equal(ejrq_evt, one));
+                {
+                    aml_append(if_ejrq,
+                        aml_call2(CPU_NOTIFY_METHOD, uid, eject_req));
+                    /* clear local eject-request event sent flag */
+                    aml_append(if_ejrq, aml_store(one, ejrq_evt));
+                    aml_append(if_ejrq, aml_store(one, has_event));
+                }
+                aml_append(else_ctx, if_ejrq);
+                aml_append(while_ctx, else_ctx);
+
+                /* Increment UID */
+                aml_append(while_ctx, aml_increment(uid));
+            }
+            aml_append(method, while_ctx);
+
+            /* Release cpu lock */
+            aml_append(method, aml_release(ctrl_lock));
+        }
+        aml_append(cpus_dev, method);
+
+        method = aml_method(CPU_OST_METHOD, 4, AML_SERIALIZED);
+        {
+            Aml *uid = aml_arg(0);
+            Aml *ev_cmd = aml_int(ACPI_OST_EVENT_CMD);
+            Aml *st_cmd = aml_int(ACPI_OST_STATUS_CMD);
+
+            aml_append(method, aml_acquire(ctrl_lock, 0xFFFF));
+            aml_append(method, aml_store(uid, cpu_selector));
+            aml_append(method, aml_store(ev_cmd, cpu_cmd));
+            aml_append(method, aml_store(aml_arg(1), cpu_data));
+            aml_append(method, aml_store(st_cmd, cpu_cmd));
+            aml_append(method, aml_store(aml_arg(2), cpu_data));
+            aml_append(method, aml_release(ctrl_lock));
+        }
+        aml_append(cpus_dev, method);
+
+        /* build Processor object for each processor */
+        for (i = 0; i < arch_ids->len; i++) {
+            Aml *dev;
+            Aml *uid = aml_int(i);
+
+            dev = aml_device(CPU_NAME_FMT, i);
+            aml_append(dev, aml_name_decl("_HID", aml_string("ACPI0007")));
+            aml_append(dev, aml_name_decl("_UID", uid));
+
+            method = aml_method("_STA", 0, AML_SERIALIZED);
+            aml_append(method, aml_return(aml_call1(CPU_STS_METHOD, uid)));
+            aml_append(dev, method);
+
+            if (CPU(arch_ids->cpus[i].cpu) != first_cpu) {
+                method = aml_method("_EJ0", 1, AML_NOTSERIALIZED);
+                aml_append(method, aml_call1(CPU_EJECT_METHOD, uid));
+                aml_append(dev, method);
+            }
+
+            method = aml_method("_OST", 3, AML_SERIALIZED);
+            aml_append(method,
+                aml_call4(CPU_OST_METHOD, uid, aml_arg(0),
+                          aml_arg(1), aml_arg(2))
+            );
+            aml_append(dev, method);
+            aml_append(cpus_dev, dev);
+        }
+    }
+    aml_append(sb_scope, cpus_dev);
+    aml_append(table, sb_scope);
+
+    method = aml_method(event_handler_method, 0, AML_NOTSERIALIZED);
+    aml_append(method, aml_call0("\\_SB.CPUS." CPU_SCAN_METHOD));
+    aml_append(table, method);
+}
diff --git a/hw/acpi/meson.build b/hw/acpi/meson.build
index 73f02b9691..6d83396ab4 100644
--- a/hw/acpi/meson.build
+++ b/hw/acpi/meson.build
@@ -8,6 +8,8 @@ acpi_ss.add(files(
 ))
 acpi_ss.add(when: 'CONFIG_ACPI_CPU_HOTPLUG', if_true: files('cpu.c', 'cpu_hotplug.c'))
 acpi_ss.add(when: 'CONFIG_ACPI_CPU_HOTPLUG', if_false: files('acpi-cpu-hotplug-stub.c'))
+acpi_ss.add(when: 'CONFIG_ACPI_CPU_OSPM_INTERFACE', if_true: files('cpu_ospm_interface.c'))
+acpi_ss.add(when: 'CONFIG_ACPI_CPU_OSPM_INTERFACE', if_false: files('acpi-cpu-ospm-interface-stub.c'))
 acpi_ss.add(when: 'CONFIG_ACPI_MEMORY_HOTPLUG', if_true: files('memory_hotplug.c'))
 acpi_ss.add(when: 'CONFIG_ACPI_MEMORY_HOTPLUG', if_false: files('acpi-mem-hotplug-stub.c'))
 acpi_ss.add(when: 'CONFIG_ACPI_NVDIMM', if_true: files('nvdimm.c'))
diff --git a/hw/acpi/trace-events b/hw/acpi/trace-events
index edc93e703c..c0ecbdd48f 100644
--- a/hw/acpi/trace-events
+++ b/hw/acpi/trace-events
@@ -40,6 +40,23 @@ cpuhp_acpi_fw_remove_cpu(uint32_t idx) "0x%"PRIx32
 cpuhp_acpi_write_ost_ev(uint32_t slot, uint32_t ev) "idx[0x%"PRIx32"] OST EVENT: 0x%"PRIx32
 cpuhp_acpi_write_ost_status(uint32_t slot, uint32_t st) "idx[0x%"PRIx32"] OST STATUS: 0x%"PRIx32
 
+#cpu_ospm_interface.c
+acpi_cpuos_if_invalid_idx_selected(uint32_t idx) "selector idx[0x%"PRIx32"]"
+acpi_cpuos_if_read_flags(uint32_t idx, uint8_t flags) "cpu idx[0x%"PRIx32"] flags: 0x%"PRIx8
+acpi_cpuos_if_write_idx(uint32_t idx) "set active cpu idx: 0x%"PRIx32
+acpi_cpuos_if_write_cmd(uint32_t idx, uint8_t cmd) "cpu idx[0x%"PRIx32"] cmd: 0x%"PRIx8
+acpi_cpuos_if_write_invalid_cmd(uint32_t idx, uint8_t cmd) "cpu idx[0x%"PRIx32"] invalid cmd: 0x%"PRIx8
+acpi_cpuos_if_write_invalid_offset(uint32_t idx, uint64_t addr) "cpu idx[0x%"PRIx32"] invalid offset: 0x%"PRIx64
+acpi_cpuos_if_read_cmd_data(uint32_t idx, uint32_t data) "cpu idx[0x%"PRIx32"] data: 0x%"PRIx32
+acpi_cpuos_if_read_invalid_cmd_data(uint32_t idx, uint8_t cmd) "cpu idx[0x%"PRIx32"] invalid cmd: 0x%"PRIx8
+acpi_cpuos_if_cpu_has_events(uint32_t idx, bool devchk, bool ejrqst) "cpu idx[0x%"PRIx32"] device-check pending: %d, eject-request pending: %d"
+acpi_cpuos_if_clear_devchk_evt(uint32_t idx) "cpu idx[0x%"PRIx32"]"
+acpi_cpuos_if_clear_ejrqst_evt(uint32_t idx) "cpu idx[0x%"PRIx32"]"
+acpi_cpuos_if_ejecting_invalid_cpu(uint32_t idx) "invalid cpu idx[0x%"PRIx32"]"
+acpi_cpuos_if_ejecting_cpu(uint32_t idx) "cpu idx[0x%"PRIx32"]"
+acpi_cpuos_if_write_ost_ev(uint32_t idx, uint32_t ev) "cpu idx[0x%"PRIx32"] OST Event: 0x%"PRIx32
+acpi_cpuos_if_write_ost_status(uint32_t idx, uint32_t st) "cpu idx[0x%"PRIx32"] OST Status: 0x%"PRIx32
+
 # pcihp.c
 acpi_pci_eject_slot(unsigned bsel, unsigned slot) "bsel: %u slot: %u"
 acpi_pci_unplug(int bsel, int slot) "bsel: %d slot: %d"
diff --git a/hw/arm/Kconfig b/hw/arm/Kconfig
index 2aa4b5d778..c9991e00c7 100644
--- a/hw/arm/Kconfig
+++ b/hw/arm/Kconfig
@@ -39,6 +39,7 @@ config ARM_VIRT
     select VIRTIO_MEM_SUPPORTED
     select ACPI_CXL
     select ACPI_HMAT
+    select ACPI_CPU_OSPM_INTERFACE
 
 config CUBIEBOARD
     bool
diff --git a/include/hw/acpi/cpu_ospm_interface.h b/include/hw/acpi/cpu_ospm_interface.h
new file mode 100644
index 0000000000..5dda327a34
--- /dev/null
+++ b/include/hw/acpi/cpu_ospm_interface.h
@@ -0,0 +1,78 @@
+/*
+ * ACPI CPU OSPM Interface Handling.
+ *
+ * Copyright (c) 2025 Huawei Technologies R&D (UK) Ltd.
+ *
+ * Author: Salil Mehta <salil.mehta@huawei.com>
+ *
+ * SPDX-License-Identifier: GPL-2.0-or-later
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the ree Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ */
+#ifndef CPU_OSPM_INTERFACE_H
+#define CPU_OSPM_INTERFACE_H
+
+#include "qapi/qapi-types-acpi.h"
+#include "hw/qdev-core.h"
+#include "hw/acpi/acpi.h"
+#include "hw/acpi/aml-build.h"
+#include "hw/boards.h"
+
+/**
+ * Total size (in bytes) of the ACPI CPU OSPM Interface MMIO region.
+ *
+ * This region contains control and status fields such as CPU selector,
+ * flags, command register, and data register. It must exactly match the
+ * layout defined in the AML code and the memory region implementation.
+ *
+ * Any mismatch between this definition and the AML layout may result in
+ * runtime errors or build-time assertion failures (e.g., _Static_assert),
+ * breaking correct device emulation and guest OS coordination.
+ */
+#define ACPI_CPU_OSPM_IF_REG_LEN 16
+
+typedef struct  {
+    CPUState *cpu;
+    uint64_t arch_id;
+    bool devchk_pending; /* device-check pending */
+    bool ejrqst_pending; /* eject-request pending */
+    uint32_t ost_event;
+    uint32_t ost_status;
+} AcpiCpuOspmStateStatus;
+
+typedef struct AcpiCpuOspmState {
+    DeviceState *acpi_dev;
+    MemoryRegion ctrl_reg;
+    uint32_t selector;
+    uint8_t command;
+    uint32_t dev_count;
+    AcpiCpuOspmStateStatus *devs;
+} AcpiCpuOspmState;
+
+void acpi_cpu_device_check_cb(AcpiCpuOspmState *cpu_st, DeviceState *dev,
+                              uint32_t event_st, Error **errp);
+
+void acpi_cpu_eject_request_cb(AcpiCpuOspmState *cpu_st, DeviceState *dev,
+                               uint32_t event_st, Error **errp);
+
+void acpi_cpu_eject_cb(AcpiCpuOspmState *cpu_st, DeviceState *dev,
+                       Error **errp);
+
+void acpi_cpu_ospm_state_interface_init(MemoryRegion *as, Object *owner,
+                                        AcpiCpuOspmState *state,
+                                        hwaddr base_addr);
+
+void acpi_build_cpus_aml(Aml *table, hwaddr base_addr, const char *root,
+                         const char *event_handler_method);
+
+void acpi_cpus_ospm_status(AcpiCpuOspmState *cpu_st,
+                           ACPIOSTInfoList ***list);
+
+extern const VMStateDescription vmstate_cpu_ospm_state;
+#define VMSTATE_CPU_OSPM_STATE(cpuospm, state) \
+    VMSTATE_STRUCT(cpuospm, state, 1, \
+                   vmstate_cpu_ospm_state, AcpiCpuOspmState)
+#endif  /* CPU_OSPM_INTERFACE_H */
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH RFC V6 15/24] acpi/ged: Notify OSPM of CPU administrative state changes via GED
  2025-10-01  1:01 [PATCH RFC V6 00/24] Support of Virtual CPU Hotplug-like Feature for ARMv8+ Arch salil.mehta
                   ` (13 preceding siblings ...)
  2025-10-01  1:01 ` [PATCH RFC V6 14/24] arm/acpi: Introduce dedicated CPU OSPM interface for ARM-like platforms salil.mehta
@ 2025-10-01  1:01 ` salil.mehta
  2025-10-01  1:01 ` [PATCH RFC V6 16/24] arm/virt/acpi: Update ACPI DSDT Tbl to include 'Online-Capable' CPUs AML salil.mehta
                   ` (11 subsequent siblings)
  26 siblings, 0 replies; 67+ messages in thread
From: salil.mehta @ 2025-10-01  1:01 UTC (permalink / raw)
  To: qemu-devel, qemu-arm, mst
  Cc: salil.mehta, maz, jean-philippe, jonathan.cameron, lpieralisi,
	peter.maydell, richard.henderson, imammedo, armbru, andrew.jones,
	david, philmd, eric.auger, will, ardb, oliver.upton, pbonzini,
	gshan, rafael, borntraeger, alex.bennee, gustavo.romero, npiggin,
	harshpb, linux, darren, ilkka, vishnu, gankulkarni, karl.heubaum,
	miguel.luis, salil.mehta, zhukeqian1, wangxiongfeng2, wangyanan55,
	wangzhou1, linuxarm, jiakernel2, maobibo, lixianglai, shahuang,
	zhao1.liu

From: Salil Mehta <salil.mehta@huawei.com>

When vCPUs are administratively enabled or disabled, the guest OSPM must be
notified so it can coordinate the corresponding operational transitions and
preserve system stability.

When a CPU is administratively enabled, GED raises a Device Check event. OSPM
then uses the ACPI _EVT handler to identify the CPU device and evaluates its
_STA, ensuring the CPU is identified, registered with the Linux device model,
enabled in the guest kernel, and made available to the scheduler.

When a CPU is administratively disabled, GED raises an Eject Request event. OSPM
again uses the ACPI _EVT handler to identify the CPU device and evaluates its
_STA, marking the CPU absent. This allows OSPM to invoke the _EJ0 path,
gracefully offload tasks, and shut down state before removal. Without this
coordination, CPUs may be forcefully removed, risking state loss or kernel
instability.

Platform code (e.g. Arm virt machine) calls PowerStateHandler hooks, which in
turn drive the GED callbacks. Those callbacks use ACPI events to reflect the
administrative change and let OSPM orchestrate the operational transition.

Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
---
 hw/acpi/generic_event_device.c         | 91 ++++++++++++++++++++++++++
 hw/arm/virt.c                          |  9 ++-
 include/hw/acpi/acpi_dev_interface.h   |  1 +
 include/hw/acpi/generic_event_device.h |  6 ++
 include/hw/arm/virt.h                  |  1 +
 5 files changed, 107 insertions(+), 1 deletion(-)

diff --git a/hw/acpi/generic_event_device.c b/hw/acpi/generic_event_device.c
index 95682b79a2..4fbf5aaa20 100644
--- a/hw/acpi/generic_event_device.c
+++ b/hw/acpi/generic_event_device.c
@@ -23,11 +23,13 @@
 #include "migration/vmstate.h"
 #include "qemu/error-report.h"
 #include "system/runstate.h"
+#include "hw/powerstate.h"
 
 static const uint32_t ged_supported_events[] = {
     ACPI_GED_MEM_HOTPLUG_EVT,
     ACPI_GED_PWR_DOWN_EVT,
     ACPI_GED_NVDIMM_HOTPLUG_EVT,
+    ACPI_GED_CPU_POWERSTATE_EVT,
     ACPI_GED_CPU_HOTPLUG_EVT,
     ACPI_GED_PCI_HOTPLUG_EVT,
 };
@@ -112,6 +114,9 @@ void build_ged_aml(Aml *table, const char *name, HotplugHandler *hotplug_dev,
                 aml_append(if_ctx, aml_call0(MEMORY_DEVICES_CONTAINER "."
                                              MEMORY_SLOT_SCAN_METHOD));
                 break;
+            case ACPI_GED_CPU_POWERSTATE_EVT:
+                aml_append(if_ctx, aml_call0(AML_GED_EVT_CPUPS_SCAN_METHOD));
+                break;
             case ACPI_GED_CPU_HOTPLUG_EVT:
                 aml_append(if_ctx, aml_call0(AML_GED_EVT_CPU_SCAN_METHOD));
                 break;
@@ -302,12 +307,57 @@ static void acpi_ged_unplug_cb(HotplugHandler *hotplug_dev,
     }
 }
 
+static void
+acpi_ged_pre_poweron_cb(PowerStateHandler *handler, DeviceState *dev,
+                        Error **errp)
+{
+    AcpiGedState *s = ACPI_GED(handler);
+
+    if (object_dynamic_cast(OBJECT(dev), TYPE_CPU)) {
+        acpi_cpu_device_check_cb(&s->cpuospm_state, dev,
+                                  ACPI_CPU_POWERSTATE_STATUS, errp);
+    } else {
+        error_setg(errp, "acpi: poweron transition on unsupported device"
+                   " type %s", object_get_typename(OBJECT(dev)));
+    }
+}
+
+static void
+acpi_ged_request_poweroff_cb(PowerStateHandler *handler, DeviceState *dev,
+                             Error **errp)
+{
+    AcpiGedState *s = ACPI_GED(handler);
+
+    if (object_dynamic_cast(OBJECT(dev), TYPE_CPU)) {
+        acpi_cpu_eject_request_cb(&s->cpuospm_state, dev,
+                                  ACPI_CPU_POWERSTATE_STATUS, errp);
+    } else {
+        error_setg(errp, "acpi: poweroff transition request for unsupported"
+                   " device type: %s", object_get_typename(OBJECT(dev)));
+    }
+}
+
+static void
+acpi_ged_post_poweroff_cb(PowerStateHandler *handler, DeviceState *dev,
+                          Error **errp)
+{
+    AcpiGedState *s = ACPI_GED(handler);
+
+    if (object_dynamic_cast(OBJECT(dev), TYPE_CPU)) {
+        acpi_cpu_eject_cb(&s->cpuospm_state, dev, errp);
+    } else {
+        error_setg(errp, "acpi: post poweroff handling on unsupported device"
+                   " type %s", object_get_typename(OBJECT(dev)));
+    }
+}
+
 static void acpi_ged_ospm_status(AcpiDeviceIf *adev, ACPIOSTInfoList ***list)
 {
     AcpiGedState *s = ACPI_GED(adev);
 
     acpi_memory_ospm_status(&s->memhp_state, list);
     acpi_cpu_ospm_status(&s->cpuhp_state, list);
+    acpi_cpus_ospm_status(&s->cpuospm_state, list);
 }
 
 static void acpi_ged_send_event(AcpiDeviceIf *adev, AcpiEventStatusBits ev)
@@ -322,6 +372,8 @@ static void acpi_ged_send_event(AcpiDeviceIf *adev, AcpiEventStatusBits ev)
         sel = ACPI_GED_PWR_DOWN_EVT;
     } else if (ev & ACPI_NVDIMM_HOTPLUG_STATUS) {
         sel = ACPI_GED_NVDIMM_HOTPLUG_EVT;
+    } else if (ev & ACPI_CPU_POWERSTATE_STATUS) {
+        sel = ACPI_GED_CPU_POWERSTATE_EVT;
     } else if (ev & ACPI_CPU_HOTPLUG_STATUS) {
         sel = ACPI_GED_CPU_HOTPLUG_EVT;
     } else if (ev & ACPI_PCI_HOTPLUG_STATUS) {
@@ -379,6 +431,24 @@ static const VMStateDescription vmstate_cpuhp_state = {
     }
 };
 
+static bool cpuospm_needed(void *opaque)
+{
+    MachineClass *mc = MACHINE_GET_CLASS(qdev_get_machine());
+
+    return mc->has_online_capable_cpus;
+}
+
+static const VMStateDescription vmstate_cpuospm_state = {
+    .name = "acpi-ged/cpu-ospm",
+    .version_id = 1,
+    .minimum_version_id = 1,
+    .needed = cpuospm_needed,
+    .fields      = (VMStateField[]) {
+        VMSTATE_CPU_OSPM_STATE(cpuospm_state, AcpiGedState),
+        VMSTATE_END_OF_LIST()
+    }
+};
+
 static const VMStateDescription vmstate_ged_state = {
     .name = "acpi-ged-state",
     .version_id = 1,
@@ -447,6 +517,7 @@ static const VMStateDescription vmstate_acpi_ged = {
     .subsections = (const VMStateDescription * const []) {
         &vmstate_memhp_state,
         &vmstate_cpuhp_state,
+        &vmstate_cpuospm_state,
         &vmstate_ghes_state,
         &vmstate_pcihp_state,
         NULL
@@ -461,6 +532,8 @@ static void acpi_ged_realize(DeviceState *dev, Error **errp)
     uint32_t ged_events;
     int i;
 
+    s->cpuospm_state.acpi_dev = dev;
+
     if (pcihp_state->use_acpi_hotplug_bridge) {
         s->ged_event_bitmap |= ACPI_GED_PCI_HOTPLUG_EVT;
     }
@@ -474,6 +547,18 @@ static void acpi_ged_realize(DeviceState *dev, Error **errp)
         }
 
         switch (event) {
+        case ACPI_GED_CPU_POWERSTATE_EVT:
+            /* initialize regions related to CPU OSPM interface to be used
+             * during notification of the power-on,off events to the OSPM
+             */
+            memory_region_init(&s->container_cpuospm, OBJECT(dev),
+                               ACPI_CPUOSPM_REGION_NAME,
+                               ACPI_CPU_OSPM_IF_REG_LEN);
+            sysbus_init_mmio(sbd, &s->container_cpuospm);
+            acpi_cpu_ospm_state_interface_init(&s->container_cpuospm,
+                                               OBJECT(dev),
+                                               &s->cpuospm_state, 0);
+            break;
         case ACPI_GED_CPU_HOTPLUG_EVT:
             /* initialize CPU Hotplug related regions */
             memory_region_init(&s->container_cpuhp, OBJECT(dev),
@@ -544,6 +629,7 @@ static void acpi_ged_class_init(ObjectClass *class, const void *data)
 {
     DeviceClass *dc = DEVICE_CLASS(class);
     HotplugHandlerClass *hc = HOTPLUG_HANDLER_CLASS(class);
+    PowerStateHandlerClass *pshc = POWERSTATE_HANDLER_CLASS(class);
     AcpiDeviceIfClass *adevc = ACPI_DEVICE_IF_CLASS(class);
     ResettableClass *rc = RESETTABLE_CLASS(class);
     AcpiGedClass *gedc = ACPI_GED_CLASS(class);
@@ -560,6 +646,10 @@ static void acpi_ged_class_init(ObjectClass *class, const void *data)
     resettable_class_set_parent_phases(rc, NULL, ged_reset_hold, NULL,
                                        &gedc->parent_phases);
 
+    pshc->pre_poweron = acpi_ged_pre_poweron_cb;
+    pshc->request_poweroff = acpi_ged_request_poweroff_cb;
+    pshc->post_poweroff = acpi_ged_post_poweroff_cb;
+
     adevc->ospm_status = acpi_ged_ospm_status;
     adevc->send_event = acpi_ged_send_event;
 }
@@ -573,6 +663,7 @@ static const TypeInfo acpi_ged_info = {
     .class_size    = sizeof(AcpiGedClass),
     .interfaces = (const InterfaceInfo[]) {
         { TYPE_HOTPLUG_HANDLER },
+        { TYPE_POWERSTATE_HANDLER },
         { TYPE_ACPI_DEVICE_IF },
         { }
     }
diff --git a/hw/arm/virt.c b/hw/arm/virt.c
index 3980f553db..8d498708ab 100644
--- a/hw/arm/virt.c
+++ b/hw/arm/virt.c
@@ -188,6 +188,7 @@ static const MemMapEntry base_memmap[] = {
     [VIRT_PVTIME] =             { 0x090a0000, 0x00010000 },
     [VIRT_SECURE_GPIO] =        { 0x090b0000, 0x00001000 },
     [VIRT_ACPI_PCIHP] =         { 0x090c0000, ACPI_PCIHP_SIZE },
+    [VIRT_ACPI_CPUPS] =         { 0x090d0000, ACPI_CPU_OSPM_IF_REG_LEN },
     [VIRT_MMIO] =               { 0x0a000000, 0x00000200 },
     /* ...repeating for a total of NUM_VIRTIO_TRANSPORTS, each of that size */
     [VIRT_PLATFORM_BUS] =       { 0x0c000000, 0x02000000 },
@@ -688,9 +689,10 @@ static inline DeviceState *create_acpi_ged(VirtMachineState *vms)
 {
     DeviceState *dev;
     MachineState *ms = MACHINE(vms);
+    MachineClass *mc = MACHINE_GET_CLASS(ms);
     SysBusDevice *sbdev;
     int irq = vms->irqmap[VIRT_ACPI_GED];
-    uint32_t event = ACPI_GED_PWR_DOWN_EVT;
+    uint32_t event = ACPI_GED_PWR_DOWN_EVT | ACPI_GED_CPU_POWERSTATE_EVT;
     bool acpi_pcihp;
 
     if (ms->ram_slots) {
@@ -711,6 +713,11 @@ static inline DeviceState *create_acpi_ged(VirtMachineState *vms)
     sysbus_mmio_map_name(sbdev, ACPI_MEMHP_REGION_NAME,
                          vms->memmap[VIRT_PCDIMM_ACPI].base);
 
+    if (mc->has_online_capable_cpus) {
+        sysbus_mmio_map_name(sbdev, ACPI_CPUOSPM_REGION_NAME,
+                             vms->memmap[VIRT_ACPI_CPUPS].base);
+    }
+
     acpi_pcihp = object_property_get_bool(OBJECT(dev),
                                           ACPI_PM_PROP_ACPI_PCIHP_BRIDGE, NULL);
 
diff --git a/include/hw/acpi/acpi_dev_interface.h b/include/hw/acpi/acpi_dev_interface.h
index 68d9d15f50..eea03ca47d 100644
--- a/include/hw/acpi/acpi_dev_interface.h
+++ b/include/hw/acpi/acpi_dev_interface.h
@@ -13,6 +13,7 @@ typedef enum {
     ACPI_NVDIMM_HOTPLUG_STATUS = 16,
     ACPI_VMGENID_CHANGE_STATUS = 32,
     ACPI_POWER_DOWN_STATUS = 64,
+    ACPI_CPU_POWERSTATE_STATUS = 128,
 } AcpiEventStatusBits;
 
 #define TYPE_ACPI_DEVICE_IF "acpi-device-interface"
diff --git a/include/hw/acpi/generic_event_device.h b/include/hw/acpi/generic_event_device.h
index 2c5b055327..87e4e5e6ce 100644
--- a/include/hw/acpi/generic_event_device.h
+++ b/include/hw/acpi/generic_event_device.h
@@ -64,6 +64,7 @@
 #include "hw/acpi/ghes.h"
 #include "hw/acpi/cpu.h"
 #include "hw/acpi/pcihp.h"
+#include "hw/acpi/cpu_ospm_interface.h"
 #include "qom/object.h"
 
 #define ACPI_POWER_BUTTON_DEVICE "PWRB"
@@ -92,6 +93,7 @@ OBJECT_DECLARE_TYPE(AcpiGedState, AcpiGedClass, ACPI_GED)
 #define AML_GED_EVT_REG "EREG"
 #define AML_GED_EVT_SEL "ESEL"
 #define AML_GED_EVT_CPU_SCAN_METHOD "\\_SB.GED.CSCN"
+#define AML_GED_EVT_CPUPS_SCAN_METHOD "\\_SB.GED.PSCN"  /* Power State Scan */
 
 /*
  * Platforms need to specify the GED event bitmap
@@ -103,6 +105,7 @@ OBJECT_DECLARE_TYPE(AcpiGedState, AcpiGedClass, ACPI_GED)
 #define ACPI_GED_NVDIMM_HOTPLUG_EVT 0x4
 #define ACPI_GED_CPU_HOTPLUG_EVT    0x8
 #define ACPI_GED_PCI_HOTPLUG_EVT    0x10
+#define ACPI_GED_CPU_POWERSTATE_EVT 0x20
 
 typedef struct GEDState {
     MemoryRegion evt;
@@ -112,6 +115,7 @@ typedef struct GEDState {
 
 #define ACPI_PCIHP_REGION_NAME "pcihp container"
 #define ACPI_MEMHP_REGION_NAME "memhp container"
+#define ACPI_CPUOSPM_REGION_NAME "cpuospm container"
 
 struct AcpiGedState {
     SysBusDevice parent_obj;
@@ -121,6 +125,8 @@ struct AcpiGedState {
     MemoryRegion container_cpuhp;
     AcpiPciHpState pcihp_state;
     MemoryRegion container_pcihp;
+    AcpiCpuOspmState cpuospm_state;
+    MemoryRegion container_cpuospm;
     GEDState ged_state;
     uint32_t ged_event_bitmap;
     qemu_irq irq;
diff --git a/include/hw/arm/virt.h b/include/hw/arm/virt.h
index 02cc311452..68081b79bb 100644
--- a/include/hw/arm/virt.h
+++ b/include/hw/arm/virt.h
@@ -81,6 +81,7 @@ enum {
     VIRT_NVDIMM_ACPI,
     VIRT_PVTIME,
     VIRT_ACPI_PCIHP,
+    VIRT_ACPI_CPUPS,
     VIRT_LOWMEMMAP_LAST,
 };
 
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH RFC V6 16/24] arm/virt/acpi: Update ACPI DSDT Tbl to include 'Online-Capable' CPUs AML
  2025-10-01  1:01 [PATCH RFC V6 00/24] Support of Virtual CPU Hotplug-like Feature for ARMv8+ Arch salil.mehta
                   ` (14 preceding siblings ...)
  2025-10-01  1:01 ` [PATCH RFC V6 15/24] acpi/ged: Notify OSPM of CPU administrative state changes via GED salil.mehta
@ 2025-10-01  1:01 ` salil.mehta
  2025-10-01  1:01 ` [PATCH RFC V6 17/24] hw/arm/virt, acpi/ged: Add PowerStateHandler hooks for runtime CPU state changes salil.mehta
                   ` (10 subsequent siblings)
  26 siblings, 0 replies; 67+ messages in thread
From: salil.mehta @ 2025-10-01  1:01 UTC (permalink / raw)
  To: qemu-devel, qemu-arm, mst
  Cc: salil.mehta, maz, jean-philippe, jonathan.cameron, lpieralisi,
	peter.maydell, richard.henderson, imammedo, armbru, andrew.jones,
	david, philmd, eric.auger, will, ardb, oliver.upton, pbonzini,
	gshan, rafael, borntraeger, alex.bennee, gustavo.romero, npiggin,
	harshpb, linux, darren, ilkka, vishnu, gankulkarni, karl.heubaum,
	miguel.luis, salil.mehta, zhukeqian1, wangxiongfeng2, wangyanan55,
	wangzhou1, linuxarm, jiakernel2, maobibo, lixianglai, shahuang,
	zhao1.liu

From: Salil Mehta <salil.mehta@huawei.com>

This change emits AML in DSDT to support vCPU deferred online-capability on
arm/virt. It wires the CPU OSPM coordination paths so that CPUs which are
administratively disabled at boot can be brought online later under policy,
providing hotplug-like functionality without claiming full hotplug support.

The AML connects the CPUS scan method to a GED handler so QEMU and the
guest OSPM can coordinate CPU add/remove while the VM is running (e.g.
device-check, eject-request, _EJ0, CPU scan, _OST status reporting).

It also fixes an ACPI namespace load error:
  AE_NOT_FOUND resolving \_SB.GED.PSCN
Error excerpt:
[    0.070518] ACPI BIOS Error (bug): Object does not exist: GED_
[    0.071457] ACPI BIOS Error (bug): Could not resolve symbol [\_SB.GED.PSCN],
[    0.073084] ACPI Error: AE_NOT_FOUND, During name lookup/catalog

Root cause was build order and naming: the PSCN handler must be created
under \_SB.GED using a short ACPI 'NameSeg', and referenced elsewhere by its
fully qualified path. The GED device (and PSCN) are now defined before the CPUS
AML, preventing the early lookup failure.

Notes:
  * CPU enumeration remains from MADT (GICC). CPU0 is Enabled; other CPUs
    may be Disabled but Online-Capable.
  * Policy (which CPUs start disabled, later enabled) is administrative
    and not decided by OSPM.

Tested: boot with EDK2/ACPI; no AE_NOT_FOUND for \_SB.GED.PSCN; generic CPU
devices register; sysfs topology group warnings do not occur.

           DSDT.dsl (Not Working)                                                    DSDT.dsl (Working)
           ---------------------                                                     ------------------

DefinitionBlock ("", "DSDT", 2, "BOCHS ", "BXPC    ", 0x00000001)        DefinitionBlock ("", "DSDT", 2, "BOCHS ", "BXPC    ", 0x00000001)
{                                                                        {
    Scope (\_SB)                                                             Scope (\_SB)
    {                                                                        {
        Scope (_SB)                                                              Device (\_SB.GED)
        {                                                                        {
            Device (\_SB.CPUR)                                                       Name (_HID, "ACPI0013"
            {                                                                        Name (_UID, "GED")
	    [...]                                                                    Name (_CRS, ResourceTemplate ()
            Device (\_SB.CPUS)                                                 	     [...]
            {                                                                        Method (_EVT, 1, Serialized)
                Name (_HID, "ACPI0010")                                              {
                Name (_CID, EisaId ("PNP0A05"))                                          Local0 = ESEL /* \_SB_.GED_.ESEL */
                Method (CTFY, 2, NotSerialized)                                          If (((Local0 & 0x02) == 0x02))
                {	                                                                 {
	    [...]                                                                             Notify (PWRB, 0x80)
                Method (CSTA, 1, Serialized)                                             }
                {
	    [...]                                                                        If (((Local0 & 0x08) == 0x08))
                Method (CEJ0, 1, Serialized)                                             {
                {                                                                            \_SB.GED.PSCN ()
	    [...]                                                                        }
                Method (CSCN, 0, Serialized)                                         }
                {                                                                }
	    [...]
                Method (COST, 4, Serialized)                                     Scope (_SB)
                {			                                         {
	    [...]                                                                    Device (\_SB.CPUR)
                Device (C000)                                                        {
                {		                                                    	  [...]
	    [...]                                                                    Device (\_SB.CPUS)
                Device (C001)                                                        {
                {                                                                         Name (_HID, "ACPI0010")
	    [...]                                                                         Name (_CID, EisaId ("PNP0A05"))
                Device (C002)                                                             Method (CTFY, 2, NotSerialized)
                {		                                                          {
	    [...]                                                                    [...]
                Device (C003)                                                             Method (CSTA, 1, Serialized)
                {                                                                         {
	    [...]                                                                    [...]
                Device (C004)                                                             Method (CEJ0, 1, Serialized)
                {		                                                          {
	    [...]                                                                    [...]
                Device (C005)                                                             Method (CSCN, 0, Serialized)
                {			                                                  {
            }                                                                        [...]
        }                                                                                 Method (COST, 4, Serialized)
                                                                                          {
        Method (\_SB.GED.PSCN, 0, NotSerialized)                                     [...]
        {                                                                                 Device (C000)
            \_SB.CPUS.CSCN ()                                                             {
        }                                                                            [...]
                                                                                          Device (C001)
        Device (COM0)                                                                     {
        {	                                                                     [...]
	    [...]                                                                         Device (C002)
                                                                                          {
        Device (\_SB.GED)                                                            [...]
        {                                                                                 Device (C003)
            Name (_HID, "ACPI0013")                                                       {
            Name (_UID, "GED")                                                       [...]
            Name (_CRS, ResourceTemplate ()                                               Device (C004)
            {	                                                                          {
	    [...]                                                                    [...]
            OperationRegion (EREG, SystemMemory, 0x09080000, 0x04)                        Device (C005)
            Field (EREG, DWordAcc, NoLock, WriteAsZeros)                                  {
            {	                                                                      }
	    [...]                                                                 }

            Method (_EVT, 1, Serialized)                                          Method (\_SB.GED.PSCN, 0, NotSerialized)
            {                                                                     {
                Local0 = ESEL                                                         \_SB.CPUS.CSCN ()
                If (((Local0 & 0x02) == 0x02))                                    }
                {
                    Notify (PWRB, 0x80)                                           Device (COM0)
                }                                                                 {
                                                                                      [...]
                If (((Local0 & 0x08) == 0x08))                               }
                {                                                        }
                    \_SB.GED.PSCN ()
                }
            }
        }

        Device (PWRB)
        {
	    [...]
    }
}

Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
---
 hw/arm/virt-acpi-build.c | 35 +++++++++++++++++++++++++----------
 1 file changed, 25 insertions(+), 10 deletions(-)

diff --git a/hw/arm/virt-acpi-build.c b/hw/arm/virt-acpi-build.c
index 7c24dd6369..5e5acb3026 100644
--- a/hw/arm/virt-acpi-build.c
+++ b/hw/arm/virt-acpi-build.c
@@ -931,6 +931,7 @@ build_dsdt(GArray *table_data, BIOSLinker *linker, VirtMachineState *vms)
     VirtMachineClass *vmc = VIRT_MACHINE_GET_CLASS(vms);
     Aml *scope, *dsdt;
     MachineState *ms = MACHINE(vms);
+    MachineClass *mc = MACHINE_GET_CLASS(ms);
     const MemMapEntry *memmap = vms->memmap;
     const int *irqmap = vms->irqmap;
     AcpiTable table = { .sig = "DSDT", .rev = 2, .oem_id = vms->oem_id,
@@ -946,7 +947,30 @@ build_dsdt(GArray *table_data, BIOSLinker *linker, VirtMachineState *vms)
      * the RTC ACPI device at all when using UEFI.
      */
     scope = aml_scope("\\_SB");
-    acpi_dsdt_add_cpus(scope, vms);
+    if (vms->acpi_dev) {
+        build_ged_aml(scope, "\\_SB."GED_DEVICE,
+                      HOTPLUG_HANDLER(vms->acpi_dev),
+                      irqmap[VIRT_ACPI_GED] + ARM_SPI_BASE, AML_SYSTEM_MEMORY,
+                      memmap[VIRT_ACPI_GED].base);
+    } else {
+        acpi_dsdt_add_gpio(scope, &memmap[VIRT_GPIO],
+                           (irqmap[VIRT_GPIO] + ARM_SPI_BASE));
+    }
+
+    /*
+     * If the machine supports bringing administratively disabled vCPUs
+     * deferred-online under policy, build AML to coordinate the addition and
+     * removal of CPUs gracefully with the OSPM while the VM is running. This
+     * includes events such as device-check, eject-request, ejection (_EJ0),
+     * CPU scan, _OST status reporting, etc.
+     */
+    if (vms->acpi_dev && mc->has_online_capable_cpus) {
+        acpi_build_cpus_aml(scope, memmap[VIRT_ACPI_CPUPS].base, "\\_SB",
+                            AML_GED_EVT_CPUPS_SCAN_METHOD);
+    } else {
+        acpi_dsdt_add_cpus(scope, vms);
+    }
+
     acpi_dsdt_add_uart(scope, &memmap[VIRT_UART0],
                        (irqmap[VIRT_UART0] + ARM_SPI_BASE), 0);
     if (vms->second_ns_uart_present) {
@@ -961,15 +985,6 @@ build_dsdt(GArray *table_data, BIOSLinker *linker, VirtMachineState *vms)
                          (irqmap[VIRT_MMIO] + ARM_SPI_BASE),
                          0, NUM_VIRTIO_TRANSPORTS);
     acpi_dsdt_add_pci(scope, memmap, irqmap[VIRT_PCIE] + ARM_SPI_BASE, vms);
-    if (vms->acpi_dev) {
-        build_ged_aml(scope, "\\_SB."GED_DEVICE,
-                      HOTPLUG_HANDLER(vms->acpi_dev),
-                      irqmap[VIRT_ACPI_GED] + ARM_SPI_BASE, AML_SYSTEM_MEMORY,
-                      memmap[VIRT_ACPI_GED].base);
-    } else {
-        acpi_dsdt_add_gpio(scope, &memmap[VIRT_GPIO],
-                           (irqmap[VIRT_GPIO] + ARM_SPI_BASE));
-    }
 
     if (vms->acpi_dev) {
         uint32_t event = object_property_get_uint(OBJECT(vms->acpi_dev),
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH RFC V6 17/24] hw/arm/virt, acpi/ged: Add PowerStateHandler hooks for runtime CPU state changes
  2025-10-01  1:01 [PATCH RFC V6 00/24] Support of Virtual CPU Hotplug-like Feature for ARMv8+ Arch salil.mehta
                   ` (15 preceding siblings ...)
  2025-10-01  1:01 ` [PATCH RFC V6 16/24] arm/virt/acpi: Update ACPI DSDT Tbl to include 'Online-Capable' CPUs AML salil.mehta
@ 2025-10-01  1:01 ` salil.mehta
  2025-10-01  1:01 ` [PATCH RFC V6 18/24] target/arm/kvm, tcg: Handle SMCCC hypercall exits in VMM during PSCI_CPU_{ON, OFF} salil.mehta
                   ` (9 subsequent siblings)
  26 siblings, 0 replies; 67+ messages in thread
From: salil.mehta @ 2025-10-01  1:01 UTC (permalink / raw)
  To: qemu-devel, qemu-arm, mst
  Cc: salil.mehta, maz, jean-philippe, jonathan.cameron, lpieralisi,
	peter.maydell, richard.henderson, imammedo, armbru, andrew.jones,
	david, philmd, eric.auger, will, ardb, oliver.upton, pbonzini,
	gshan, rafael, borntraeger, alex.bennee, gustavo.romero, npiggin,
	harshpb, linux, darren, ilkka, vishnu, gankulkarni, karl.heubaum,
	miguel.luis, salil.mehta, zhukeqian1, wangxiongfeng2, wangyanan55,
	wangzhou1, linuxarm, jiakernel2, maobibo, lixianglai, shahuang,
	zhao1.liu

From: Salil Mehta <salil.mehta@huawei.com>

Administrative power state property has been recently introduced as part of this
patch-set, and QEMU currently lacks a way for platforms to react to such control
(e.g. 'device_set ... admin-state=disable'). These host-driven changes must
drive corresponding operational transitions and involve OSPM where appropriate.

Summary of Handling:
===================

Since vCPUs are always enumerated as present, administrative enable must ensure
they also become operationally usable. This requires realizing the vCPU (if
enabled for the first time) or unparking it otherwise, re-registering it with
the VMState handler, adding it back to the active vCPU list, and kicking its
sleeping thread into KVM so it can transition to the guest runnable state once
the kernel issues CPU_ON. The GICC interface must also be marked accessible, and
OSPM must be notified through a Device Check event so that _EVT/_STA evaluation
can identify the CPU, register it with the Linux device model, enable it in the
guest kernel, and make it available to the scheduler.

When a CPU is administratively disabled, the virt machine invokes its
PowerStateHandler callbacks to request powering off the vCPU. As a consequence,
GED raises an Eject Request event so OSPM can invoke _EJ0 to offload tasks and
shut down state before removal. The vCPU is then quiesced, unregistered from
VMState, removed from the active vCPU list, its sleeping vCPU thread is kicked
from KVM, re-blocked inside QEMU, and the vCPU is parked in userspace. This
helps reduce locking contention inside the kernel.

The callbacks introduced as part of this patch-set handle the above flows and
avoid forceful removal without kernel coordination, keep firmware and GIC access
in sync, and integrate with existing ACPI GED-based signaling.

Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
---
 cpu-common.c                       |   4 +-
 hw/arm/virt.c                      | 233 ++++++++++++++++++++++++++++-
 include/hw/arm/virt.h              |   1 +
 include/hw/core/cpu.h              |   2 +
 include/hw/intc/arm_gicv3_common.h |  30 ++++
 system/cpus.c                      |   4 +-
 target/arm/cpu.c                   |   1 +
 7 files changed, 271 insertions(+), 4 deletions(-)

diff --git a/cpu-common.c b/cpu-common.c
index ef5757d23b..7eced58434 100644
--- a/cpu-common.c
+++ b/cpu-common.c
@@ -103,7 +103,9 @@ void cpu_list_remove(CPUState *cpu)
     }
 
     QTAILQ_REMOVE_RCU(&cpus_queue, cpu, node);
-    cpu->cpu_index = UNASSIGNED_CPU_INDEX;
+    if (!cpu->preserve_assigned_cpu_index) {
+        cpu->cpu_index = UNASSIGNED_CPU_INDEX;
+    }
     cpu_list_generation_id++;
 }
 
diff --git a/hw/arm/virt.c b/hw/arm/virt.c
index 8d498708ab..9a41a0682b 100644
--- a/hw/arm/virt.c
+++ b/hw/arm/virt.c
@@ -45,6 +45,7 @@
 #include "system/device_tree.h"
 #include "system/numa.h"
 #include "system/runstate.h"
+#include "system/reset.h"
 #include "system/tpm.h"
 #include "system/tcg.h"
 #include "system/kvm.h"
@@ -91,6 +92,8 @@
 #include "hw/cxl/cxl.h"
 #include "hw/cxl/cxl_host.h"
 #include "qemu/guest-random.h"
+#include "hw/powerstate.h"
+#include "arm-powerctl.h"
 
 static GlobalProperty arm_virt_compat[] = {
     { TYPE_VIRTIO_IOMMU_PCI, "aw-bits", "48" },
@@ -1400,7 +1403,7 @@ static FWCfgState *create_fw_cfg(const VirtMachineState *vms, AddressSpace *as)
     char *nodename;
 
     fw_cfg = fw_cfg_init_mem_wide(base + 8, base, 8, base + 16, as);
-    fw_cfg_add_i16(fw_cfg, FW_CFG_NB_CPUS, (uint16_t)ms->smp.cpus);
+    fw_cfg_add_i16(fw_cfg, FW_CFG_NB_CPUS, vms->boot_cpus);
 
     nodename = g_strdup_printf("/fw-cfg@%" PRIx64, base);
     qemu_fdt_add_subnode(ms->fdt, nodename);
@@ -1821,6 +1824,179 @@ void virt_machine_done(Notifier *notifier, void *data)
     virt_build_smbios(vms);
 }
 
+static void virt_park_cpu_in_userspace(CPUState *cs)
+{
+    /* we don't want to migrate 'disabled' vCPU state(even if realized) */
+    cpu_vmstate_unregister(cs);
+    /* remove from 'present' and 'enabled' list of active vCPUs */
+    cpu_list_remove(cs);
+    /* ensure that other context do not kick us out of the parked state */
+    cs->parked = true;
+    /* this will kick the sleeping KVM vCPUs to Qemu; releasing vCPU mutex */
+    cpu_pause(cs);
+}
+
+static void virt_unpark_cpu_in_userspace(CPUState *cs)
+{
+    /* disabled vCPUs lack a VMStateDescription; re-register */
+    cpu_vmstate_register(cs);
+    /* add back to 'present' and 'enabled' list of active vCPUs */
+    cpu_list_add(cs);
+    /*
+     * kick back the vCPU into action; operational power-on will happen in
+     * context to PSCI CPU_ON executed by the Guest. We are just enabling the
+     * infrastructre here and making it available to the Guest.
+     */
+    cs->parked = false;
+    cpu_resume(cs);
+}
+
+static void
+virt_cpu_pre_poweron(PowerStateHandler *handler, DeviceState *dev, Error **errp)
+{
+    VirtMachineState *vms = VIRT_MACHINE(handler);
+    PowerStateHandlerClass *pshc;
+    CPUState *cs = CPU(dev);
+
+    /*
+     * Lazy realization path: bring the CPU to a realized state the first time
+     * it is powered on. Saves boot time; later power-ons skips this.
+     */
+    if (!dev->realized) {
+        qdev_realize(dev, NULL, errp);
+    } else {
+        /* Realized but parked 'disabled' vCPUs */
+        virt_unpark_cpu_in_userspace(cs);
+    }
+
+    gicv3_mark_gicc_accessible(OBJECT(vms->gic), cs->cpu_index, errp);
+    if (*errp) {
+        error_setg(errp, "couldn't mark GICC accessibile for CPU %d",
+                   cs->cpu_index);
+        return;
+    }
+
+    /* update the firmware information for the next boot. */
+    vms->boot_cpus++;
+    if (vms->fw_cfg) {
+        fw_cfg_modify_i16(vms->fw_cfg, FW_CFG_NB_CPUS, vms->boot_cpus);
+    }
+
+    /*
+     * Notify the guest that a CPU is powered-on(_STA.Ena = 1), triggering a
+     * Device Check (Notify(..., 0x80)) via GED. This prompts OSPM to
+     * re-evaluate ACPI _STA method.
+     *
+     * Only notify after the VM is ready i.e., the guest kernel is initialized.
+     * For example, during boot-time '-deviceset' usage, the kernel isn't ready,
+     * so sending a notification is pointless.
+     */
+    if (phase_check(PHASE_MACHINE_READY) &&
+        !runstate_check(RUN_STATE_INMIGRATE)) {
+        pshc = POWERSTATE_HANDLER_GET_CLASS(vms->acpi_dev);
+        pshc->pre_poweron(POWERSTATE_HANDLER(vms->acpi_dev), dev, errp);
+        if (*errp) {
+            error_setg(errp, "failed to notify OSPM about CPU %d power-on",
+                       cs->cpu_index);
+            return;
+        }
+    }
+
+    /*
+     * Guest Kernel/OSPM will issue PSCI CPU_ON, which performs the cold start
+     * (reset + entry state) for this CPU
+     */
+}
+
+static void
+virt_cpu_request_poweroff(PowerStateHandler *handler, DeviceState *dev,
+                          Error **errp)
+{
+    VirtMachineState *vms = VIRT_MACHINE(handler);
+    PowerStateHandlerClass *pshc;
+    ARMCPU *cpu = ARM_CPU(dev);
+    CPUState *cs = CPU(dev);
+
+    if (cs->cpu_index == first_cpu->cpu_index) {
+        error_setg(errp, "can't power-off  boot CPU (id=%d [%d:%d:%d:%d])",
+                   first_cpu->cpu_index, cpu->socket_id, cpu->cluster_id,
+                   cpu->core_id, cpu->thread_id);
+        return;
+    }
+
+    /*
+     * Check that we are not tearing down too early when no live state exists.
+     * This can happen in:
+     *  1. Lazy device realization
+     *  2. Use of '-device-set' at qemu prompt
+     *  3. Post-migration on the destination VM
+     */
+    if (!dev->realized) {
+        return;
+    }
+
+    if (!phase_check(PHASE_MACHINE_READY) ||
+        runstate_check(RUN_STATE_INMIGRATE)) {
+        virt_park_cpu_in_userspace(cs);
+        return;
+    }
+
+    /*
+     * powering-off a CPU triggers an Eject Request (Notify(..., 0x03))
+     * via GED, prompting the OSPM to invoke _EJ0 for device removal handling.
+     */
+    pshc = POWERSTATE_HANDLER_GET_CLASS(vms->acpi_dev);
+    pshc->request_poweroff(POWERSTATE_HANDLER(vms->acpi_dev), dev, errp);
+    if (*errp) {
+        error_setg(errp, "request failed to power-off CPU %d", cs->cpu_index);
+        return;
+    }
+}
+
+static void
+virt_cpu_post_poweroff(PowerStateHandler *handler, DeviceState *dev,
+                       Error **errp)
+{
+    VirtMachineState *vms = VIRT_MACHINE(handler);
+    PowerStateHandlerClass *pshc;
+    CPUState *cs = CPU(dev);
+
+    /*
+     * Just in case we are here too early. Ignore admin power-off before
+     * realize; no live state to tear down.
+     */
+    if (!dev->realized) {
+        return;
+    }
+
+    /* we are here because OSPM has already offline'd CPU and issued EJ0 */
+    pshc = POWERSTATE_HANDLER_GET_CLASS(vms->acpi_dev);
+    pshc->post_poweroff(POWERSTATE_HANDLER(vms->acpi_dev), dev, errp);
+    if (*errp) {
+        error_setg(errp, "failed to complete CPU %d power-off", cs->cpu_index);
+        return;
+    }
+
+    vms->boot_cpus--;
+    if (vms->fw_cfg) {
+        fw_cfg_modify_i16(vms->fw_cfg, FW_CFG_NB_CPUS, vms->boot_cpus);
+    }
+
+    gicv3_mark_gicc_inaccessible(OBJECT(vms->gic), cs->cpu_index, errp);
+    if (*errp) {
+        error_setg(errp, "couldn't mark GICC inaccessibile for CPU %d",
+                   cs->cpu_index);
+        return;
+    }
+
+    /*
+     * A 'disabled' vCPU is quiesced; now park it in userspace. For KVM,
+     * this unblocks the sleeping vCPU thread and re-blocks it inside QEMU,
+     * reducing KVM vCPU lock contention.
+     */
+    virt_park_cpu_in_userspace(cs);
+}
+
 static uint64_t virt_cpu_mp_affinity(VirtMachineState *vms, int idx)
 {
     uint8_t clustersz;
@@ -3218,6 +3394,53 @@ static HotplugHandler *virt_machine_get_hotplug_handler(MachineState *machine,
     return NULL;
 }
 
+static void
+virt_machine_device_request_poweroff(PowerStateHandler *handler,
+                                     DeviceState *dev,
+                                     Error **errp)
+{
+    if (object_dynamic_cast(OBJECT(dev), TYPE_CPU)) {
+        virt_cpu_request_poweroff(handler, dev, errp);
+    } else {
+        error_setg(errp, "power-off request for unsupported device-type: %s",
+                   object_get_typename(OBJECT(dev)));
+    }
+}
+
+static void
+virt_machine_device_post_poweroff(PowerStateHandler *handler, DeviceState *dev,
+                                  Error **errp)
+{
+    if (object_dynamic_cast(OBJECT(dev), TYPE_CPU)) {
+        virt_cpu_post_poweroff(handler, dev, errp);
+    } else {
+        error_setg(errp, "can't complete power-off, unsupported device-type %s",
+                   object_get_typename(OBJECT(dev)));
+    }
+}
+
+static void
+virt_machine_device_pre_poweron(PowerStateHandler *handler, DeviceState *dev,
+                                Error **errp)
+{
+    if (object_dynamic_cast(OBJECT(dev), TYPE_CPU)) {
+        virt_cpu_pre_poweron(handler, dev, errp);
+    } else {
+        error_setg(errp, "can't prepare power-on, unsupported device-type %s",
+                   object_get_typename(OBJECT(dev)));
+    }
+}
+
+static void *
+virt_machine_powerstate_handler(MachineState *machine, DeviceState *dev)
+{
+    if (object_dynamic_cast(OBJECT(dev), TYPE_CPU)) {
+        return (void *)POWERSTATE_HANDLER(machine);
+    }
+
+    return NULL;
+}
+
 /*
  * for arm64 kvm_type [7-0] encodes the requested number of bits
  * in the IPA address space
@@ -3294,6 +3517,7 @@ static void virt_machine_class_init(ObjectClass *oc, const void *data)
 {
     MachineClass *mc = MACHINE_CLASS(oc);
     HotplugHandlerClass *hc = HOTPLUG_HANDLER_CLASS(oc);
+    PowerStateHandlerClass *pshc = POWERSTATE_HANDLER_CLASS(oc);
     static const char * const valid_cpu_types[] = {
 #ifdef CONFIG_TCG
         ARM_CPU_TYPE_NAME("cortex-a7"),
@@ -3358,7 +3582,13 @@ static void virt_machine_class_init(ObjectClass *oc, const void *data)
     hc->unplug_request = virt_machine_device_unplug_request_cb;
     hc->unplug = virt_machine_device_unplug_cb;
 
+    /* virt machine device powerstate handlers & callbacks */
+    assert(!mc->get_powerstate_handler);
     mc->has_online_capable_cpus = true;
+    mc->get_powerstate_handler = virt_machine_powerstate_handler;
+    pshc->request_poweroff = virt_machine_device_request_poweroff;
+    pshc->post_poweroff = virt_machine_device_post_poweroff;
+    pshc->pre_poweron = virt_machine_device_pre_poweron;
 
     mc->nvdimm_supported = true;
     mc->smp_props.clusters_supported = true;
@@ -3560,6 +3790,7 @@ static const TypeInfo virt_machine_info = {
     .instance_init = virt_instance_init,
     .interfaces = (const InterfaceInfo[]) {
          { TYPE_HOTPLUG_HANDLER },
+         { TYPE_POWERSTATE_HANDLER },
          { }
     },
 };
diff --git a/include/hw/arm/virt.h b/include/hw/arm/virt.h
index 68081b79bb..0898e8eed3 100644
--- a/include/hw/arm/virt.h
+++ b/include/hw/arm/virt.h
@@ -166,6 +166,7 @@ struct VirtMachineState {
     MemMapEntry *memmap;
     char *pciehb_nodename;
     const int *irqmap;
+    uint16_t boot_cpus;
     int fdt_size;
     uint32_t clock_phandle;
     uint32_t gic_phandle;
diff --git a/include/hw/core/cpu.h b/include/hw/core/cpu.h
index 2ee202a8a5..ccf5588011 100644
--- a/include/hw/core/cpu.h
+++ b/include/hw/core/cpu.h
@@ -485,6 +485,7 @@ struct CPUState {
     bool created;
     bool stop;
     bool stopped;
+    bool parked;
 
     /* Should CPU start in powered-off state? */
     bool start_powered_off;
@@ -549,6 +550,7 @@ struct CPUState {
 
     /* TODO Move common fields from CPUArchState here. */
     int cpu_index;
+    bool preserve_assigned_cpu_index;
     int cluster_index;
     uint32_t tcg_cflags;
     uint32_t halted;
diff --git a/include/hw/intc/arm_gicv3_common.h b/include/hw/intc/arm_gicv3_common.h
index bbf899184e..a8a84c4687 100644
--- a/include/hw/intc/arm_gicv3_common.h
+++ b/include/hw/intc/arm_gicv3_common.h
@@ -353,4 +353,34 @@ static inline bool gicv3_gicc_accessible(Object *obj, int cpu)
 
     return value;
 }
+
+/**
+ * gicv3_mark_gicc_accessible:
+ * @obj: QOM object implementing the GICv3 device
+ * @cpu: Index of the vCPU to mark as GICC-accessible
+ * @errp: Pointer to an Error* for reporting failures
+ *
+ * Marks GICv3CPUState::gicc_accessible as accessible and available for use.
+ */
+static inline void
+gicv3_mark_gicc_accessible(Object *obj, int cpu, Error **errp)
+{
+    g_autofree gchar *propname = g_strdup_printf("gicc-accessible[%d]", cpu);
+    object_property_set_bool(obj, propname, true, errp);
+}
+
+/**
+ * gicv3_mark_gicc_inaccessible:
+ * @obj: QOM object implementing the GICv3 device
+ * @cpu: Index of the vCPU to mark as GICC-inaccessible
+ * @errp: Pointer to an Error* for reporting failures
+ *
+ * Marks GICv3CPUState::gicc_accessible as inaccessible and unavailable for use.
+ */
+static inline void
+gicv3_mark_gicc_inaccessible(Object *obj, int cpu, Error **errp)
+{
+    g_autofree gchar *propname = g_strdup_printf("gicc-accessible[%d]", cpu);
+    object_property_set_bool(obj, propname, false, errp);
+}
 #endif
diff --git a/system/cpus.c b/system/cpus.c
index 256723558d..0545aaaa0f 100644
--- a/system/cpus.c
+++ b/system/cpus.c
@@ -89,7 +89,7 @@ bool cpu_thread_is_idle(CPUState *cpu)
     if (cpu->stop || !cpu_work_list_empty(cpu)) {
         return false;
     }
-    if (cpu_is_stopped(cpu)) {
+    if (cpu_is_stopped(cpu) || cpu->parked) {
         return true;
     }
     if (!cpu->halted || cpu_has_work(cpu)) {
@@ -327,7 +327,7 @@ bool cpu_can_run(CPUState *cpu)
     if (cpu->stop) {
         return false;
     }
-    if (cpu_is_stopped(cpu)) {
+    if (cpu_is_stopped(cpu) || cpu->parked) {
         return false;
     }
     return true;
diff --git a/target/arm/cpu.c b/target/arm/cpu.c
index a5906d1672..0ceaf69092 100644
--- a/target/arm/cpu.c
+++ b/target/arm/cpu.c
@@ -1502,6 +1502,7 @@ static void arm_cpu_initfn(Object *obj)
     }
 
     CPU(obj)->thread_id = 0;
+    CPU(obj)->preserve_assigned_cpu_index = true;
 }
 
 /*
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH RFC V6 18/24] target/arm/kvm, tcg: Handle SMCCC hypercall exits in VMM during PSCI_CPU_{ON, OFF}
  2025-10-01  1:01 [PATCH RFC V6 00/24] Support of Virtual CPU Hotplug-like Feature for ARMv8+ Arch salil.mehta
                   ` (16 preceding siblings ...)
  2025-10-01  1:01 ` [PATCH RFC V6 17/24] hw/arm/virt, acpi/ged: Add PowerStateHandler hooks for runtime CPU state changes salil.mehta
@ 2025-10-01  1:01 ` salil.mehta
  2025-10-01  1:01 ` [PATCH RFC V6 19/24] target/arm/cpu: Add the Accessor hook to fetch ARM CPU arch-id salil.mehta
                   ` (8 subsequent siblings)
  26 siblings, 0 replies; 67+ messages in thread
From: salil.mehta @ 2025-10-01  1:01 UTC (permalink / raw)
  To: qemu-devel, qemu-arm, mst
  Cc: salil.mehta, maz, jean-philippe, jonathan.cameron, lpieralisi,
	peter.maydell, richard.henderson, imammedo, armbru, andrew.jones,
	david, philmd, eric.auger, will, ardb, oliver.upton, pbonzini,
	gshan, rafael, borntraeger, alex.bennee, gustavo.romero, npiggin,
	harshpb, linux, darren, ilkka, vishnu, gankulkarni, karl.heubaum,
	miguel.luis, salil.mehta, zhukeqian1, wangxiongfeng2, wangyanan55,
	wangzhou1, linuxarm, jiakernel2, maobibo, lixianglai, shahuang,
	zhao1.liu

From: Author Salil Mehta <salil.mehta@huawei.com>

To support vCPU hotplug-like feature, we must trap any `HVC`/`SMC`
`PSCI_CPU_{ON,OFF}` hypercalls from the host KVM to QEMU for policy checks. This
ensures the following when a vCPU is brought online:

1. The vCPU is actually plugged in (i.e., present).
2. The vCPU is not administratively disabled. (Policy Checks)

Implement the registration and handling of `HVC`/`SMC` hypercall exits within
the VMM, ensuring that proper policy checks and control flow are enforced during
the vCPU onlining and offlining processes.

Co-developed-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
---
 target/arm/arm-powerctl.c   | 27 ++++++++---
 target/arm/helper.c         |  2 +-
 target/arm/internals.h      |  2 +-
 target/arm/kvm.c            | 93 +++++++++++++++++++++++++++++++++++++
 target/arm/kvm_arm.h        | 14 ++++++
 target/arm/meson.build      |  1 +
 target/arm/{tcg => }/psci.c |  9 ++++
 target/arm/tcg/meson.build  |  4 --
 8 files changed, 139 insertions(+), 13 deletions(-)
 rename target/arm/{tcg => }/psci.c (96%)

diff --git a/target/arm/arm-powerctl.c b/target/arm/arm-powerctl.c
index 20c70c7d6b..ab4422b261 100644
--- a/target/arm/arm-powerctl.c
+++ b/target/arm/arm-powerctl.c
@@ -17,6 +17,7 @@
 #include "qemu/main-loop.h"
 #include "system/tcg.h"
 #include "target/arm/multiprocessing.h"
+#include "hw/boards.h"
 
 #ifndef DEBUG_ARM_POWERCTL
 #define DEBUG_ARM_POWERCTL 0
@@ -31,14 +32,17 @@
 
 CPUState *arm_get_cpu_by_id(uint64_t id)
 {
+    MachineState *ms = MACHINE(qdev_get_machine());
     CPUState *cpu;
 
     DPRINTF("cpu %" PRId64 "\n", id);
 
-    CPU_FOREACH(cpu) {
-        ARMCPU *armcpu = ARM_CPU(cpu);
-
-        if (arm_cpu_mp_affinity(armcpu) == id) {
+    /*
+     * with vCPU standy/hotplug support, we must now check for all
+     * possible vCPUs
+     */
+    CPU_FOREACH_POSSIBLE(cpu, ms->possible_cpus) {
+        if (cpu && (arm_cpu_mp_affinity(ARM_CPU(cpu)) == id)) {
             return cpu;
         }
     }
@@ -119,9 +123,18 @@ int arm_set_cpu_on(uint64_t cpuid, uint64_t entry, uint64_t context_id,
 
     /* Retrieve the cpu we are powering up */
     target_cpu_state = arm_get_cpu_by_id(cpuid);
-    if (!target_cpu_state) {
-        /* The cpu was not found */
-        return QEMU_ARM_POWERCTL_INVALID_PARAM;
+
+    /* Policy check: verify 'administrative' power state of target CPU */
+    if (!target_cpu_state || !qdev_check_enabled(DEVICE(target_cpu_state))) {
+        /*
+         * The cpu is not plugged in or disabled. We should return appropriate
+         * value as introduced in DEN0022E PSCI 1.2 issue E
+         */
+        qemu_log_mask(LOG_GUEST_ERROR,
+                      "[ARM]%s: Denying attempt to online ACPI disabled"
+                      "(_STA.Ena=0)CPU%" PRId64", needs admin action first!\n",
+                      __func__, cpuid);
+        return QEMU_ARM_POWERCTL_IS_OFF;
     }
 
     target_cpu = ARM_CPU(target_cpu_state);
diff --git a/target/arm/helper.c b/target/arm/helper.c
index 0c1299ff84..814fe719da 100644
--- a/target/arm/helper.c
+++ b/target/arm/helper.c
@@ -9110,7 +9110,7 @@ void arm_cpu_do_interrupt(CPUState *cs)
                       env->exception.syndrome);
     }
 
-    if (tcg_enabled() && arm_is_psci_call(cpu, cs->exception_index)) {
+    if (arm_is_psci_call(cpu, cs->exception_index)) {
         arm_handle_psci_call(cpu);
         qemu_log_mask(CPU_LOG_INT, "...handled as PSCI call\n");
         return;
diff --git a/target/arm/internals.h b/target/arm/internals.h
index 1b3d0244fd..ffd82a7ace 100644
--- a/target/arm/internals.h
+++ b/target/arm/internals.h
@@ -645,7 +645,7 @@ vaddr arm_adjust_watchpoint_address(CPUState *cs, vaddr addr, int len);
 /* Callback function for when a watchpoint or breakpoint triggers. */
 void arm_debug_excp_handler(CPUState *cs);
 
-#if defined(CONFIG_USER_ONLY) || !defined(CONFIG_TCG)
+#if defined(CONFIG_USER_ONLY)
 static inline bool arm_is_psci_call(ARMCPU *cpu, int excp_type)
 {
     return false;
diff --git a/target/arm/kvm.c b/target/arm/kvm.c
index 1962eb29b2..98eb6db9ed 100644
--- a/target/arm/kvm.c
+++ b/target/arm/kvm.c
@@ -529,9 +529,51 @@ int kvm_arch_get_default_type(MachineState *ms)
     return fixed_ipa ? 0 : size;
 }
 
+static bool kvm_arm_set_vm_attr(struct kvm_device_attr *attr, const char *name)
+{
+    int err;
+
+    err = kvm_vm_ioctl(kvm_state, KVM_HAS_DEVICE_ATTR, attr);
+    if (err != 0) {
+        error_report("%s: KVM_HAS_DEVICE_ATTR: %s", name, strerror(-err));
+        return false;
+    }
+
+    err = kvm_vm_ioctl(kvm_state, KVM_SET_DEVICE_ATTR, attr);
+    if (err != 0) {
+        error_report("%s: KVM_SET_DEVICE_ATTR: %s", name, strerror(-err));
+        return false;
+    }
+
+    return true;
+}
+
+int kvm_arm_set_smccc_filter(uint64_t func, uint8_t faction)
+{
+    struct kvm_smccc_filter filter = {
+        .base = func,
+        .nr_functions = 1,
+        .action = faction,
+    };
+    struct kvm_device_attr attr = {
+        .group = KVM_ARM_VM_SMCCC_CTRL,
+        .attr = KVM_ARM_VM_SMCCC_FILTER,
+        .flags = 0,
+        .addr = (uintptr_t)&filter,
+    };
+
+    if (!kvm_arm_set_vm_attr(&attr, "SMCCC Filter")) {
+        error_report("failed to set SMCCC filter in KVM Host");
+        return -1;
+    }
+
+    return 0;
+}
+
 int kvm_arch_init(MachineState *ms, KVMState *s)
 {
     int ret = 0;
+
     /* For ARM interrupt delivery is always asynchronous,
      * whether we are using an in-kernel VGIC or not.
      */
@@ -594,6 +636,22 @@ int kvm_arch_init(MachineState *ms, KVMState *s)
     hw_breakpoints = g_array_sized_new(true, true,
                                        sizeof(HWBreakpoint), max_hw_bps);
 
+    /*
+     * To be able to handle PSCI CPU ON calls in QEMU, we need to install SMCCC
+     * filter in the Host KVM. This is required to support features like
+     * virtual CPU Hotplug on ARM platforms.
+     */
+    if (kvm_arm_set_smccc_filter(PSCI_0_2_FN64_CPU_ON,
+                                 KVM_SMCCC_FILTER_FWD_TO_USER)) {
+        error_report("CPU On PSCI-to-user-space fwd filter install failed");
+        abort();
+    }
+    if (kvm_arm_set_smccc_filter(PSCI_0_2_FN_CPU_OFF,
+                                 KVM_SMCCC_FILTER_FWD_TO_USER)) {
+        error_report("CPU Off PSCI-to-user-space fwd filter install failed");
+        abort();
+    }
+
     return ret;
 }
 
@@ -1440,6 +1498,38 @@ static bool kvm_arm_handle_debug(ARMCPU *cpu,
     return false;
 }
 
+static int kvm_arm_handle_hypercall(CPUState *cs, struct kvm_run *run)
+{
+    ARMCPU *cpu = ARM_CPU(cs);
+    CPUARMState *env = &cpu->env;
+
+    kvm_cpu_synchronize_state(cs);
+
+    /*
+     * hard coding immediate to 0 as we dont expect non-zero value as of now
+     * This might change in future versions. Hence, KVM_GET_ONE_REG  could be
+     * used in such cases but it must be enhanced then only synchronize will
+     * also fetch ESR_EL2 value.
+     */
+    if (run->hypercall.flags == KVM_HYPERCALL_EXIT_SMC) {
+        cs->exception_index = EXCP_SMC;
+        env->exception.syndrome = syn_aa64_smc(0);
+    } else {
+        cs->exception_index = EXCP_HVC;
+        env->exception.syndrome = syn_aa64_hvc(0);
+    }
+    env->exception.target_el = 1;
+    bql_lock();
+    arm_cpu_do_interrupt(cs);
+    bql_unlock();
+
+    /*
+     * For PSCI, exit the kvm_run loop and process the work. Especially
+     * important if this was a CPU_OFF command and we can't return to the guest.
+     */
+    return EXCP_INTERRUPT;
+}
+
 int kvm_arch_handle_exit(CPUState *cs, struct kvm_run *run)
 {
     ARMCPU *cpu = ARM_CPU(cs);
@@ -1456,6 +1546,9 @@ int kvm_arch_handle_exit(CPUState *cs, struct kvm_run *run)
         ret = kvm_arm_handle_dabt_nisv(cpu, run->arm_nisv.esr_iss,
                                        run->arm_nisv.fault_ipa);
         break;
+    case KVM_EXIT_HYPERCALL:
+          ret = kvm_arm_handle_hypercall(cs, run);
+        break;
     default:
         qemu_log_mask(LOG_UNIMP, "%s: un-handled exit reason %d\n",
                       __func__, run->exit_reason);
diff --git a/target/arm/kvm_arm.h b/target/arm/kvm_arm.h
index ec9dc95ee8..bb2dfde3af 100644
--- a/target/arm/kvm_arm.h
+++ b/target/arm/kvm_arm.h
@@ -216,6 +216,15 @@ bool kvm_arm_mte_supported(void);
  * Returns true if KVM can enable EL2 and false otherwise.
  */
 bool kvm_arm_el2_supported(void);
+
+/**
+ * kvm_arm_set_smccc_filter
+ * @func: funcion
+ * @faction: SMCCC filter action(handle, deny, fwd-to-user) to be deployed
+ *
+ * Sets the ARMs SMC-CC filter in KVM Host for selective hypercall exits
+ */
+int kvm_arm_set_smccc_filter(uint64_t func, uint8_t faction);
 #else
 
 static inline bool kvm_arm_aarch32_supported(void)
@@ -242,6 +251,11 @@ static inline bool kvm_arm_el2_supported(void)
 {
     return false;
 }
+
+static inline int kvm_arm_set_smccc_filter(uint64_t func, uint8_t faction)
+{
+    g_assert_not_reached();
+}
 #endif
 
 /**
diff --git a/target/arm/meson.build b/target/arm/meson.build
index 07d9271aa4..ae4e75c4a9 100644
--- a/target/arm/meson.build
+++ b/target/arm/meson.build
@@ -15,6 +15,7 @@ arm_system_ss.add(files(
 ))
 arm_system_ss.add(when: 'CONFIG_KVM', if_true: files('hyp_gdbstub.c', 'kvm.c'))
 arm_system_ss.add(when: 'CONFIG_HVF', if_true: files('hyp_gdbstub.c'))
+arm_system_ss.add(files('psci.c'))
 
 arm_user_ss = ss.source_set()
 arm_user_ss.add(files('cpu.c'))
diff --git a/target/arm/tcg/psci.c b/target/arm/psci.c
similarity index 96%
rename from target/arm/tcg/psci.c
rename to target/arm/psci.c
index cabed43e8a..fbd2bd2d6f 100644
--- a/target/arm/tcg/psci.c
+++ b/target/arm/psci.c
@@ -21,10 +21,13 @@
 #include "exec/helper-proto.h"
 #include "kvm-consts.h"
 #include "qemu/main-loop.h"
+#include "qemu/error-report.h"
 #include "system/runstate.h"
+#include "system/tcg.h"
 #include "internals.h"
 #include "arm-powerctl.h"
 #include "target/arm/multiprocessing.h"
+#include "exec/target_long.h"
 
 bool arm_is_psci_call(ARMCPU *cpu, int excp_type)
 {
@@ -158,6 +161,11 @@ void arm_handle_psci_call(ARMCPU *cpu)
     case QEMU_PSCI_0_1_FN_CPU_SUSPEND:
     case QEMU_PSCI_0_2_FN_CPU_SUSPEND:
     case QEMU_PSCI_0_2_FN64_CPU_SUSPEND:
+       if (!tcg_enabled()) {
+            warn_report("CPU suspend not supported in non-tcg mode");
+            break;
+       }
+#ifdef CONFIG_TCG
         /* Affinity levels are not supported in QEMU */
         if (param[1] & 0xfffe0000) {
             ret = QEMU_PSCI_RET_INVALID_PARAMS;
@@ -170,6 +178,7 @@ void arm_handle_psci_call(ARMCPU *cpu)
             env->regs[0] = 0;
         }
         helper_wfi(env, 4);
+#endif
         break;
     case QEMU_PSCI_1_0_FN_PSCI_FEATURES:
         switch (param[1]) {
diff --git a/target/arm/tcg/meson.build b/target/arm/tcg/meson.build
index 895facdc30..f4d8db0f79 100644
--- a/target/arm/tcg/meson.build
+++ b/target/arm/tcg/meson.build
@@ -49,10 +49,6 @@ arm_ss.add(when: 'TARGET_AARCH64', if_true: files(
   'sve_helper.c',
 ))
 
-arm_system_ss.add(files(
-  'psci.c',
-))
-
 arm_system_ss.add(when: 'CONFIG_ARM_V7M', if_true: files('cpu-v7m.c'))
 arm_user_ss.add(when: 'TARGET_AARCH64', if_false: files('cpu-v7m.c'))
 
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH RFC V6 19/24] target/arm/cpu: Add the Accessor hook to fetch ARM CPU arch-id
  2025-10-01  1:01 [PATCH RFC V6 00/24] Support of Virtual CPU Hotplug-like Feature for ARMv8+ Arch salil.mehta
                   ` (17 preceding siblings ...)
  2025-10-01  1:01 ` [PATCH RFC V6 18/24] target/arm/kvm, tcg: Handle SMCCC hypercall exits in VMM during PSCI_CPU_{ON, OFF} salil.mehta
@ 2025-10-01  1:01 ` salil.mehta
  2025-10-01  1:01 ` [PATCH RFC V6 20/24] target/arm/kvm: Write vCPU's state back to KVM on cold-reset salil.mehta
                   ` (7 subsequent siblings)
  26 siblings, 0 replies; 67+ messages in thread
From: salil.mehta @ 2025-10-01  1:01 UTC (permalink / raw)
  To: qemu-devel, qemu-arm, mst
  Cc: salil.mehta, maz, jean-philippe, jonathan.cameron, lpieralisi,
	peter.maydell, richard.henderson, imammedo, armbru, andrew.jones,
	david, philmd, eric.auger, will, ardb, oliver.upton, pbonzini,
	gshan, rafael, borntraeger, alex.bennee, gustavo.romero, npiggin,
	harshpb, linux, darren, ilkka, vishnu, gankulkarni, karl.heubaum,
	miguel.luis, salil.mehta, zhukeqian1, wangxiongfeng2, wangyanan55,
	wangzhou1, linuxarm, jiakernel2, maobibo, lixianglai, shahuang,
	zhao1.liu

From: Salil Mehta <salil.mehta@huawei.com>

ACPI 'acpi_cpu_{device_check,eject_request}_cb()' uses 'get_cpu_status()'
API to get the existing 'AcpiCpuOspmStateStatus' of the CPU being 'online'd or
offline'd' after VM has initialized. Later usesCPUClass::get_arch_id` to match
the CPU. Hence, we must add ARM CPU architecture specific accessor hook to fetch
`mp-affinity` programmed in the KVM host.

Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
---
 target/arm/cpu.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/target/arm/cpu.c b/target/arm/cpu.c
index 0ceaf69092..d147e786c1 100644
--- a/target/arm/cpu.c
+++ b/target/arm/cpu.c
@@ -2744,6 +2744,11 @@ static const TCGCPUOps arm_tcg_ops = {
 };
 #endif /* CONFIG_TCG */
 
+static int64_t arm_cpu_get_arch_id(CPUState *cs)
+{
+    return arm_cpu_mp_affinity(ARM_CPU(cs));
+}
+
 static void arm_cpu_class_init(ObjectClass *oc, const void *data)
 {
     ARMCPUClass *acc = ARM_CPU_CLASS(oc);
@@ -2763,6 +2768,7 @@ static void arm_cpu_class_init(ObjectClass *oc, const void *data)
     cc->dump_state = arm_cpu_dump_state;
     cc->set_pc = arm_cpu_set_pc;
     cc->get_pc = arm_cpu_get_pc;
+    cc->get_arch_id = arm_cpu_get_arch_id;
     cc->gdb_read_register = arm_cpu_gdb_read_register;
     cc->gdb_write_register = arm_cpu_gdb_write_register;
 #ifndef CONFIG_USER_ONLY
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH RFC V6 20/24] target/arm/kvm: Write vCPU's state back to KVM on cold-reset
  2025-10-01  1:01 [PATCH RFC V6 00/24] Support of Virtual CPU Hotplug-like Feature for ARMv8+ Arch salil.mehta
                   ` (18 preceding siblings ...)
  2025-10-01  1:01 ` [PATCH RFC V6 19/24] target/arm/cpu: Add the Accessor hook to fetch ARM CPU arch-id salil.mehta
@ 2025-10-01  1:01 ` salil.mehta
  2025-10-01  1:01 ` [PATCH RFC V6 21/24] hw/intc/arm-gicv3-kvm: Pause all vCPUs & cache ICC_CTLR_EL1 for userspace PSCI CPU_ON salil.mehta
                   ` (6 subsequent siblings)
  26 siblings, 0 replies; 67+ messages in thread
From: salil.mehta @ 2025-10-01  1:01 UTC (permalink / raw)
  To: qemu-devel, qemu-arm, mst
  Cc: salil.mehta, maz, jean-philippe, jonathan.cameron, lpieralisi,
	peter.maydell, richard.henderson, imammedo, armbru, andrew.jones,
	david, philmd, eric.auger, will, ardb, oliver.upton, pbonzini,
	gshan, rafael, borntraeger, alex.bennee, gustavo.romero, npiggin,
	harshpb, linux, darren, ilkka, vishnu, gankulkarni, karl.heubaum,
	miguel.luis, salil.mehta, zhukeqian1, wangxiongfeng2, wangyanan55,
	wangzhou1, linuxarm, jiakernel2, maobibo, lixianglai, shahuang,
	zhao1.liu

From: Jean-Philippe Brucker <jean-philippe@linaro.org>

Previously, all `PSCI_CPU_{ON, OFF}` calls were handled directly by KVM.
However, with the introduction of this new vCPU hotplug-like feature, these
hypervisor calls are now trapped to QEMU for policy checks. This shift can lead
to inconsistent vCPU states between KVM and QEMU, particularly when the vCPU has
been recently administratively enabled and is transitioning from either unparked
state in QOM due to 'lazy realization' or even from 'powered-off' state.
Therefore, it is crucial to synchronize the vCPU state with KVM, especially in
the context of a cold reset of the QOM vCPU. The same applies when PSCI CPU_OFF
is being handled by Qemu, it must ensure that kVM vCPUs are powered-off as well.

To ensure this synchronization, mark the QOM vCPU as "dirty" to trigger a call
to `kvm_arch_put_registers()`. This guarantees that KVM’s `MP_STATE` is updated
accordingly, forcing synchronization of the `mp_state` between QEMU and KVM.

Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
---
 target/arm/arm-powerctl.c | 1 +
 target/arm/kvm.c          | 7 +++++++
 2 files changed, 8 insertions(+)

diff --git a/target/arm/arm-powerctl.c b/target/arm/arm-powerctl.c
index ab4422b261..89074918a9 100644
--- a/target/arm/arm-powerctl.c
+++ b/target/arm/arm-powerctl.c
@@ -263,6 +263,7 @@ static void arm_set_cpu_off_async_work(CPUState *target_cpu_state,
 
     assert(bql_locked());
     target_cpu->power_state = PSCI_OFF;
+    target_cpu_state->vcpu_dirty = true;
     target_cpu_state->halted = 1;
     target_cpu_state->exception_index = EXCP_HLT;
 }
diff --git a/target/arm/kvm.c b/target/arm/kvm.c
index 98eb6db9ed..c4b68a0b17 100644
--- a/target/arm/kvm.c
+++ b/target/arm/kvm.c
@@ -1026,6 +1026,7 @@ bool kvm_arm_cpu_post_load(ARMCPU *cpu)
 void kvm_arm_reset_vcpu(ARMCPU *cpu)
 {
     int ret;
+    CPUState *cs = CPU(cpu);
 
     /* Re-init VCPU so that all registers are set to
      * their respective reset values.
@@ -1047,6 +1048,12 @@ void kvm_arm_reset_vcpu(ARMCPU *cpu)
      * for the same reason we do so in kvm_arch_get_registers().
      */
     write_list_to_cpustate(cpu);
+
+    /*
+     * Ensure we call kvm_arch_put_registers(). The vCPU isn't marked dirty if
+     * it was parked in KVM and is now booting from a PSCI CPU_ON call.
+     */
+    cs->vcpu_dirty = true;
 }
 
 void kvm_arm_create_host_vcpu(ARMCPU *cpu)
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH RFC V6 21/24] hw/intc/arm-gicv3-kvm: Pause all vCPUs & cache ICC_CTLR_EL1 for userspace PSCI CPU_ON
  2025-10-01  1:01 [PATCH RFC V6 00/24] Support of Virtual CPU Hotplug-like Feature for ARMv8+ Arch salil.mehta
                   ` (19 preceding siblings ...)
  2025-10-01  1:01 ` [PATCH RFC V6 20/24] target/arm/kvm: Write vCPU's state back to KVM on cold-reset salil.mehta
@ 2025-10-01  1:01 ` salil.mehta
  2025-10-01  1:01 ` [PATCH RFC V6 22/24] monitor, qdev: Introduce 'device_set' to change admin state of existing devices salil.mehta
                   ` (5 subsequent siblings)
  26 siblings, 0 replies; 67+ messages in thread
From: salil.mehta @ 2025-10-01  1:01 UTC (permalink / raw)
  To: qemu-devel, qemu-arm, mst
  Cc: salil.mehta, maz, jean-philippe, jonathan.cameron, lpieralisi,
	peter.maydell, richard.henderson, imammedo, armbru, andrew.jones,
	david, philmd, eric.auger, will, ardb, oliver.upton, pbonzini,
	gshan, rafael, borntraeger, alex.bennee, gustavo.romero, npiggin,
	harshpb, linux, darren, ilkka, vishnu, gankulkarni, karl.heubaum,
	miguel.luis, salil.mehta, zhukeqian1, wangxiongfeng2, wangyanan55,
	wangzhou1, linuxarm, jiakernel2, maobibo, lixianglai, shahuang,
	zhao1.liu

From: Salil Mehta <salil.mehta@huawei.com>

Problem:
=======

When PSCI CPU_ON was handled entirely in KVM, the operation executed under
VGIC/KVM locks at EL2 and appeared atomic to other vCPU threads (intermediate
states were not observable). With the SMCCC forward-to-userspace filter enabled,
PSCI ON/OFF calls now exit to QEMU, where policy checks are performed.

In the userspace CPU_ON handling (during cpu_reset?), QEMU must perform IOCTLs
to fetch ICC_CTLR_EL1 fields that reflect supported features and IRQ-related
configuration (e.g. EOImode, PMHE, CBPR). While these IOCTLs are in flight,
other vCPUs can run and cause transient inconsistency. KVM enforces atomicity by
trying to take all vCPU locks (kvm_trylock_all_vcpus() -> -EBUSY). QEMU
therefore pauses all vCPUs before issuing these IOCTLs to avoid contending for
locks and to prevent -EBUSY failures during cpu_reset.

KVM Details: (As I understand and stand ready to be corrected! :))

Userspace fetch of sysreg ICC_CTLR_EL1 results in access of ICH_VMCR_EL2 reg.
VMCR is per-vCPU and controls the CPU interface. Pending state is recorded in
the distributor for SPIs, and in each redistributor for SGIs and PPIs. Delivery
to the PE depends on the CPU interface configuration (VMCR fields such as PMR,
IGRPEN, EOImode, BPR). Updates to VMCR must therefore be applied atomically with
respect to interrupt injection and deactivation. The KVM ioctl layer first
attempts to lock all vCPU mutexes, and only then takes the VM lock before
calling vgic_v3_attr_regs_access(). This ordering serializes userspace accesses
with IRQ handling (IAR/EOI and SGI delivery?).

ICC_CTLR_EL1 initially reflects architectural defaults (e.g. EOImode, PMR).
Most fields are read-only feature indicators that never change. Writable
fields such as EOImode, PMHE and CBPR are configured once by the guest GICv3
driver and then remain pseudo-static. Both the initial defaults and the
guest-configured values can be cached and reused across resets, avoiding
repeated VM-wide pauses to fetch ICC_CTLR_EL1 from KVM on every cpu_reset().

Appendix: ICC_CTLR_EL1 layout (for reviewers)
=============================================

ICC_CTLR_EL1 [63:0]

  63                                                                        32
 +----------------------------------------------------------------------------+
 |                                   RES0                                     |
 +----------------------------------------------------------------------------+
  31        20 19 18 17 16 15 14 13 12 11 10   9   8  7  6   5  4  3  2  1  0
 +------------+--+--+--+--+--+--+--+--+--+---+---+---+--+--+--+--+--+--+--+--+
 |    RES0    |Ex|RS|RES0 |A3|SE| IDbits |  PRIbits  |R0|PM|  RES0     |EO|CB|
 +------------+--+--+--+--+--+--+--+--+--+---+---+---+--+--+--+--+--+--+--+--+
              |  |        |  |                       |   |             |  |
              |  |        |  |                       |   |             |  +CBPR
              |  |        |  |                       |   |             +EOImode
              |  |        |  |                       |   +-PMHE
              |  |        |  |                       +----RES0
              |  |        |  +--SEIS
              |  |        +-----A3V
              |  +--------------RSS
              +-----------------ExtRange

 Access: {Ex, RS, A3, SE, IDbits, PRIbits} = RO;
         {PMHE} = RW*;
         {EO, CB} = RW**;
	 others = RES0.
 Notes : * impl-def (may be RO when DS=0)
         ** CB may be RO when DS=0 (EO stays RW)

 Source: Arm GIC Architecture Specification (IHI 0069H.b),
         §12.2.6 “ICC_CTLR_EL1”, pp. 12-233…12-237

Resets that may trigger ICC_CTLR_EL1 fetch include:
  1. PSCI CPU_ON
  2. qemu_system_reset() during full VM reset
  3. Post-load path on migration
  4. Lazy realization via device_set/-deviceset

It can be expensive to pause the entire VM just to reset one vCPU, especially
for long-lived workloads where hundreds of resets may occur. For such systems,
frequent VM-wide pauses are unacceptable.

Solution:
========

This patch caches ICC_CTLR_EL1 early, seeding it either from architectural
defaults or on the first PSCI CPU_ON when the guest GICv3 driver has initialized
the interface. The cached value is then reused on every cpu_reset(), avoiding
repeated VM-wide pauses and heavy IOCTLs. The IOCTL path is retained only as a
fallback if the cached shadow is not valid.

Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
---
 hw/intc/arm_gicv3_kvm.c            | 93 ++++++++++++++++++++++++++++--
 include/hw/intc/arm_gicv3_common.h | 10 ++++
 target/arm/arm-powerctl.c          |  1 +
 target/arm/cpu.h                   |  1 +
 4 files changed, 100 insertions(+), 5 deletions(-)

diff --git a/hw/intc/arm_gicv3_kvm.c b/hw/intc/arm_gicv3_kvm.c
index e97578f59a..62d6016e8a 100644
--- a/hw/intc/arm_gicv3_kvm.c
+++ b/hw/intc/arm_gicv3_kvm.c
@@ -27,6 +27,7 @@
 #include "qemu/module.h"
 #include "system/kvm.h"
 #include "system/runstate.h"
+#include "system/cpus.h"
 #include "kvm_arm.h"
 #include "gicv3_internal.h"
 #include "vgic_common.h"
@@ -681,13 +682,73 @@ static void kvm_arm_gicv3_get(GICv3State *s)
     }
 }
 
+/* Caller must hold the iothread (BQL). */
+static inline void
+kvm_gicc_get_cached_icc_ctlr_el1(GICv3CPUState *c, uint64_t regval[2],
+                                      bool *valid)
+{
+    const uint64_t attr = (uint64_t)KVM_VGIC_ATTR(ICC_CTLR_EL1, c->gicr_typer);
+    const int group = KVM_DEV_ARM_VGIC_GRP_CPU_SYSREGS;
+    GICv3State *s = c->gic;
+    uint64_t val = 0;
+    int ret;
+
+    assert(regval && valid);
+
+    if (*valid) {
+        /* Fast path: return cached (no vCPU pausing required). */
+        c->icc_ctlr_el1[GICV3_NS] = regval[GICV3_NS];
+        c->icc_ctlr_el1[GICV3_S] = regval[GICV3_S];
+        return;
+    }
+
+    ret = kvm_device_access(s->dev_fd, group, attr, &val, false, NULL);
+    if (ret == -EBUSY || ret == -EAGAIN) {
+        int tries;
+
+        /* One-time heavy path: avoid contention by pausing all vCPUs. */
+        pause_all_vcpus();
+        /*
+         * Even with vCPUs paused, we cannot fully rule out a non-vCPU context
+         * temporarily holding KVM vCPU mutexes; treat -EBUSY/-EAGAIN as
+         * transient and retry a few times. Final attempt aborts in-loop.
+         */
+        for (tries = 0; tries < 5; tries++) {
+            Error **errp = (tries == 4) ? &error_abort : NULL;
+
+            ret = kvm_device_access(s->dev_fd, group, attr, &val, false, errp);
+            if (!ret) {
+                break;
+            }
+            if (ret != -EBUSY && ret != -EAGAIN) {
+               error_setg_errno(&error_abort, -ret,
+                                "KVM_GET_DEVICE_ATTR failed: Group %d "
+                                "attr 0x%016" PRIx64, group, attr);
+               /* not reached */
+            }
+            g_usleep(50);
+        }
+        resume_all_vcpus();
+    }
+
+    /* Success: publish and seed cache. */
+    c->icc_ctlr_el1[GICV3_NS] = val;
+    c->icc_ctlr_el1[GICV3_S] = val;
+
+    regval[GICV3_NS] = c->icc_ctlr_el1[GICV3_NS];
+    regval[GICV3_S] = c->icc_ctlr_el1[GICV3_S];
+    *valid = true;
+}
+
 static void arm_gicv3_icc_reset(CPUARMState *env, const ARMCPRegInfo *ri)
 {
     GICv3State *s;
     GICv3CPUState *c;
+    ARMCPU *cpu;
 
     c = (GICv3CPUState *)env->gicv3state;
     s = c->gic;
+    cpu = ARM_CPU(c->cpu);
 
     c->icc_pmr_el1 = 0;
     /*
@@ -713,11 +774,33 @@ static void arm_gicv3_icc_reset(CPUARMState *env, const ARMCPRegInfo *ri)
     }
 
     /* Initialize to actual HW supported configuration */
-    kvm_device_access(s->dev_fd, KVM_DEV_ARM_VGIC_GRP_CPU_SYSREGS,
-                      KVM_VGIC_ATTR(ICC_CTLR_EL1, c->gicr_typer),
-                      &c->icc_ctlr_el1[GICV3_NS], false, &error_abort);
-
-    c->icc_ctlr_el1[GICV3_S] = c->icc_ctlr_el1[GICV3_NS];
+    /*
+     * Avoid racy VGIC CPU sysreg reads while vCPUs are running. KVM requires
+     * pausing all vCPUs for ICC_* sysregs accesses to prevent races with
+     * in-flight IRQ delivery (e.g. EOImode etc.).
+     *
+     * To keep the reset path fast, cache the architectural default and the
+     * guest GICv3 driver configured ICC_CTLR_EL1 on the first access and then
+     * reuse that for subsequent resets. Most fields in this register are
+     * invariants throughout the life of VM. Fields EOImode, PMHE and CBPR are
+     * pseudo static and dont change once configured by guest driver.
+     */
+    if (cpu->first_psci_on_request_seen || s->guest_gicc_initialized) {
+        if (!s->guest_gicc_initialized) {
+            s->guest_gicc_initialized = true;
+        }
+        kvm_gicc_get_cached_icc_ctlr_el1(c, c->icc_ctlr_configured,
+                                         &c->icc_ctlr_configured_valid);
+    } else {
+        /*
+         * kernel has not loded yet. It safe to assume not other vCPU is in
+         * KVM_RUN except vCPU 0 at this moment. Just in case, if there is
+         * other priviledged context of KVM accessing the register then we
+         * KVM device access can potentially return -EBUSY.
+         */
+        kvm_gicc_get_cached_icc_ctlr_el1(c, c->icc_ctlr_arch_def,
+                                         &c->icc_ctlr_arch_def_valid);
+    }
 }
 
 static void kvm_arm_gicv3_reset_hold(Object *obj, ResetType type)
diff --git a/include/hw/intc/arm_gicv3_common.h b/include/hw/intc/arm_gicv3_common.h
index a8a84c4687..0282a94edc 100644
--- a/include/hw/intc/arm_gicv3_common.h
+++ b/include/hw/intc/arm_gicv3_common.h
@@ -165,6 +165,15 @@ struct GICv3CPUState {
     uint64_t icc_apr[3][4];
     uint64_t icc_igrpen[3];
     uint64_t icc_ctlr_el3;
+    /*
+     * Shadow copy of ICC_CTLR_EL1 architectural default. Fetched once per-vCPU
+     * when no vCPUs are running, and reused on reset to avoid calling
+     * kvm_device_access() in the hot path.
+     */
+    uint64_t icc_ctlr_arch_def[2]; /* per-secstate (NS=0,S=1) */
+    bool icc_ctlr_arch_def_valid;
+    uint64_t icc_ctlr_configured[2];
+    bool icc_ctlr_configured_valid;
     bool gicc_accessible;
 
     /* Virtualization control interface */
@@ -240,6 +249,7 @@ struct GICv3State {
     bool force_8bit_prio;
     bool irq_reset_nonsecure;
     bool gicd_no_migration_shift_bug;
+    bool guest_gicc_initialized;
 
     int dev_fd; /* kvm device fd if backed by kvm vgic support */
     Error *migration_blocker;
diff --git a/target/arm/arm-powerctl.c b/target/arm/arm-powerctl.c
index 89074918a9..0b65898cec 100644
--- a/target/arm/arm-powerctl.c
+++ b/target/arm/arm-powerctl.c
@@ -68,6 +68,7 @@ static void arm_set_cpu_on_async_work(CPUState *target_cpu_state,
     ARMCPU *target_cpu = ARM_CPU(target_cpu_state);
     struct CpuOnInfo *info = (struct CpuOnInfo *) data.host_ptr;
 
+    target_cpu->first_psci_on_request_seen = true;
     /* Initialize the cpu we are turning on */
     cpu_reset(target_cpu_state);
     arm_emulate_firmware_reset(target_cpu_state, info->target_el);
diff --git a/target/arm/cpu.h b/target/arm/cpu.h
index cd5982d362..603e482b3a 100644
--- a/target/arm/cpu.h
+++ b/target/arm/cpu.h
@@ -974,6 +974,7 @@ struct ArchCPU {
 
     /* Current power state, access guarded by BQL */
     ARMPSCIState power_state;
+    bool first_psci_on_request_seen;
 
     /* CPU has virtualization extension */
     bool has_el2;
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH RFC V6 22/24] monitor, qdev: Introduce 'device_set' to change admin state of existing devices
  2025-10-01  1:01 [PATCH RFC V6 00/24] Support of Virtual CPU Hotplug-like Feature for ARMv8+ Arch salil.mehta
                   ` (20 preceding siblings ...)
  2025-10-01  1:01 ` [PATCH RFC V6 21/24] hw/intc/arm-gicv3-kvm: Pause all vCPUs & cache ICC_CTLR_EL1 for userspace PSCI CPU_ON salil.mehta
@ 2025-10-01  1:01 ` salil.mehta
  2025-10-09  8:55   ` [PATCH RFC V6 22/24] monitor,qdev: " Markus Armbruster
  2025-10-01  1:01 ` [PATCH RFC V6 23/24] monitor, qapi: add 'info cpus-powerstate' and QMP query (Admin + Oper states) salil.mehta
                   ` (4 subsequent siblings)
  26 siblings, 1 reply; 67+ messages in thread
From: salil.mehta @ 2025-10-01  1:01 UTC (permalink / raw)
  To: qemu-devel, qemu-arm, mst
  Cc: salil.mehta, maz, jean-philippe, jonathan.cameron, lpieralisi,
	peter.maydell, richard.henderson, imammedo, armbru, andrew.jones,
	david, philmd, eric.auger, will, ardb, oliver.upton, pbonzini,
	gshan, rafael, borntraeger, alex.bennee, gustavo.romero, npiggin,
	harshpb, linux, darren, ilkka, vishnu, gankulkarni, karl.heubaum,
	miguel.luis, salil.mehta, zhukeqian1, wangxiongfeng2, wangyanan55,
	wangzhou1, linuxarm, jiakernel2, maobibo, lixianglai, shahuang,
	zhao1.liu

From: Salil Mehta <salil.mehta@huawei.com>

This patch adds a "device_set" interface for modifying properties of devices
that already exist in the guest topology. Unlike 'device_add'/'device_del'
(hot-plug), 'device_set' does not create or destroy devices. It is intended
for guest-visible hot-add semantics where hardware is provisioned at boot but
logically enabled/disabled later via administrative policy.

Compared to the existing 'qom-set' command, which is less intuitive and works
only with object IDs, device_set provides a more device-oriented interface.
It can be invoked at the QEMU prompt using natural device arguments, and the
new '-deviceset' CLI option allows properties to be set at boot time, similar
to how '-device' specifies device creation.

While the initial implementation focuses on "admin-state" changes (e.g.,
enable/disable a CPU already described by ACPI/DT), the interface is designed
to be generic. In future, it could be used for other per-device set/unset
style controls — beyond administrative power-states — provided the target
device explicitly allows such changes. This enables fine-grained runtime
control of device properties.

Key pieces:
  * QMP: qmp_device_set() to update an existing device. The device can be
    located by "id" or via driver+property match using a DeviceListener
    callback (qdev_find_device()).
  * HMP: "device_set" command with tab-completion. Errors are surfaced via
    hmp_handle_error().
  * CLI: "-deviceset" option for setting startup/admin properties at boot,
    including a JSON form. Options are parsed into qemu_deviceset_opts and
    applied after device creation.
  * Docs/help: HMP help text and qemu-options.hx additions explain usage and
    explicitly note that no hot-plug occurs.
  * Safety: disallowed during live migration (migration_is_idle() check).

Semantics:
  * Operates on an existing DeviceState; no enumeration/new device appears.
  * Complements device_add/device_del by providing state mutation only.
  * Backward compatible: no behavior change unless "device_set"/"-deviceset"
    is used.

Examples:
  HMP:
    (qemu) device_set host-arm-cpu,core-id=3,admin-state=enable

  CLI (at boot):
    -smp cpus=4,maxcpus=4 \
    -deviceset host-arm-cpu,core-id=2,admin-state=disable

  QMP (JSON form):
    { "execute": "device_set",
      "arguments": {
        "driver": "host-arm-cpu",
        "core-id": 1,
        "admin-state": "disable"
      }
    }

NOTE: The qdev_enable()/qdev_disable() hooks for acting on admin-state will be
added in subsequent patches. Device classes must explicitly support any
property they want to expose through device_set.

Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
---
 hmp-commands.hx         |  30 +++++++++
 hw/arm/virt.c           |  86 +++++++++++++++++++++++++
 hw/core/cpu-common.c    |  12 ++++
 hw/core/qdev.c          |  21 ++++++
 include/hw/arm/virt.h   |   1 +
 include/hw/core/cpu.h   |  11 ++++
 include/hw/qdev-core.h  |  22 +++++++
 include/monitor/hmp.h   |   2 +
 include/monitor/qdev.h  |  30 +++++++++
 include/system/system.h |   1 +
 qemu-options.hx         |  51 +++++++++++++--
 system/qdev-monitor.c   | 139 +++++++++++++++++++++++++++++++++++++++-
 system/vl.c             |  39 +++++++++++
 13 files changed, 440 insertions(+), 5 deletions(-)

diff --git a/hmp-commands.hx b/hmp-commands.hx
index d0e4f35a30..18056cf21d 100644
--- a/hmp-commands.hx
+++ b/hmp-commands.hx
@@ -707,6 +707,36 @@ SRST
   or a QOM object path.
 ERST
 
+{
+    .name       = "device_set",
+    .args_type  = "device:O",
+    .params     = "driver[,prop=value][,...]",
+    .help       = "set/unset existing device property",
+    .cmd        = hmp_device_set,
+    .command_completion = device_set_completion,
+},
+
+SRST
+``device_set`` *driver[,prop=value][,...]*
+  Change the administrative power state of an existing device.
+
+  This command enables or disables a known device (e.g., CPU) using the
+  "device_set" interface. It does not hotplug or add a new device.
+
+  Depending on platform support (e.g., PSCI or ACPI), this may trigger
+  corresponding operational changes — such as powering down a CPU or
+  transitioning it to active use.
+
+  Administrative state:
+    * *enabled*  — Allows the guest to use the device (e.g., CPU_ON)
+    * *disabled* — Prevents guest use; device is powered off (e.g., CPU_OFF)
+
+  Note: The device must already exist (be declared during machine creation).
+
+  Example:
+      (qemu) device_set host-arm-cpu,core-id=3,admin-state=disabled
+ERST
+
     {
         .name       = "cpu",
         .args_type  = "index:i",
diff --git a/hw/arm/virt.c b/hw/arm/virt.c
index 9a41a0682b..7bd37ffb75 100644
--- a/hw/arm/virt.c
+++ b/hw/arm/virt.c
@@ -74,6 +74,7 @@
 #include "qapi/visitor.h"
 #include "qapi/qapi-visit-common.h"
 #include "qobject/qlist.h"
+#include "qobject/qdict.h"
 #include "standard-headers/linux/input.h"
 #include "hw/arm/smmuv3.h"
 #include "hw/acpi/acpi.h"
@@ -1824,6 +1825,88 @@ void virt_machine_done(Notifier *notifier, void *data)
     virt_build_smbios(vms);
 }
 
+static DeviceState *virt_find_cpu(const QDict *opts, Error **errp)
+{
+    int64_t socket_id, cluster_id, core_id, thread_id;
+    MachineState *ms = MACHINE(qdev_get_machine());
+    int64_t T, C, K, cpu_id;
+    CPUState *cpu;
+    const char *s;
+
+    /* parse topology */
+    socket_id  = (s = qdict_get_try_str(opts, "socket-id")) ?
+                  strtoll(s, NULL, 10) : 0;
+    cluster_id = (s = qdict_get_try_str(opts, "cluster-id")) ?
+                 strtoll(s, NULL, 10) : 0;
+    core_id    = (s = qdict_get_try_str(opts, "core-id")) ?
+                 strtoll(s, NULL, 10) : 0;
+    thread_id  = (s = qdict_get_try_str(opts, "thread-id")) ?
+                 strtoll(s, NULL, 10) : 0;
+
+    /* Range checks */
+    if (thread_id < 0 || thread_id >= ms->smp.threads) {
+        error_setg(errp,
+                   "Couldn't find cpu(%ld:%ld:%ld:%ld), Invalid thread-id %ld",
+                   socket_id, cluster_id, core_id, thread_id, thread_id);
+        return NULL;
+    }
+    if (core_id < 0 || core_id >= ms->smp.cores) {
+        error_setg(errp,
+                   "Couldn't find cpu(%ld:%ld:%ld:%ld), Invalid core-id %ld",
+                   socket_id, cluster_id, core_id, thread_id, core_id);
+        return NULL;
+    }
+    if (cluster_id < 0 || cluster_id >= ms->smp.clusters) {
+        error_setg(errp,
+                   "Couldn't find cpu(%ld:%ld:%ld:%ld), Invalid cluster-id %ld",
+                   socket_id, cluster_id, core_id, thread_id, cluster_id);
+        return NULL;
+    }
+    if (socket_id < 0 || socket_id >= ms->smp.sockets) {
+        error_setg(errp,
+                   "Couldn't find cpu(%ld:%ld:%ld:%ld), Invalid socket-id %ld",
+                   socket_id, cluster_id, core_id, thread_id, socket_id);
+        return NULL;
+    }
+
+    /* Compute logical CPU index: t + T*(c + C*(k + K*s)). */
+    T = ms->smp.threads;
+    C = ms->smp.cores;
+    K = ms->smp.clusters;
+    cpu_id = thread_id + T * (core_id + C * (cluster_id + K * socket_id));
+
+    cpu = machine_get_possible_cpu((int)cpu_id);
+    if (!cpu) {
+        error_setg(errp,
+                   "Couldn't find cpu(%ld:%ld:%ld:%ld), Invalid cpu-index %ld",
+                   socket_id, cluster_id, core_id, thread_id, cpu_id);
+        return NULL;
+    }
+
+    return DEVICE(cpu);
+}
+
+static DeviceState *
+virt_find_device(DeviceListener *listener, const QDict *opts, Error **errp)
+{
+    const char *typename;
+
+    g_assert(opts);
+
+    typename = qdict_get_try_str(opts, "driver");
+    if (!typename)
+    {
+        error_setg(errp, "no driver specified");
+        return NULL;
+    }
+
+    if (cpu_typename_is_a(typename, TYPE_ARM_CPU)) {
+        return virt_find_cpu(opts, errp);
+    }
+
+    return NULL;
+}
+
 static void virt_park_cpu_in_userspace(CPUState *cs)
 {
     /* we don't want to migrate 'disabled' vCPU state(even if realized) */
@@ -2545,6 +2628,9 @@ static void machvirt_init(MachineState *machine)
 
     create_fdt(vms);
 
+    vms->device_listener.find_device = virt_find_device;
+    device_listener_register(&vms->device_listener);
+
     assert(possible_cpus->len == max_cpus);
     for (n = 0; n < possible_cpus->len; n++) {
         Object *cpuobj;
diff --git a/hw/core/cpu-common.c b/hw/core/cpu-common.c
index 39e674aca2..6883dba75e 100644
--- a/hw/core/cpu-common.c
+++ b/hw/core/cpu-common.c
@@ -170,6 +170,18 @@ char *cpu_model_from_type(const char *typename)
     return g_strdup(typename);
 }
 
+bool cpu_typename_is_a(const char *typename, const char *base_typename)
+{
+    ObjectClass *oc;
+
+    if (!typename || !base_typename) {
+        return false;
+    }
+
+    oc = object_class_by_name(typename);
+    return oc && object_class_dynamic_cast(oc, base_typename);
+}
+
 static void cpu_common_parse_features(const char *typename, char *features,
                                       Error **errp)
 {
diff --git a/hw/core/qdev.c b/hw/core/qdev.c
index 3aba99b912..4fa2988ca0 100644
--- a/hw/core/qdev.c
+++ b/hw/core/qdev.c
@@ -226,6 +226,27 @@ bool qdev_should_hide_device(const QDict *opts, bool from_json, Error **errp)
     return false;
 }
 
+DeviceState *
+qdev_find_device(const QDict *opts, Error **errp)
+{
+    ERRP_GUARD();
+    DeviceListener *listener;
+    DeviceState *dev;
+
+    QTAILQ_FOREACH(listener, &device_listeners, link) {
+        if (listener->find_device) {
+            dev = listener->find_device(listener, opts, errp);
+            if (*errp) {
+                return NULL;
+            } else if (dev) {
+                return dev;
+            }
+        }
+    }
+
+    return NULL;
+}
+
 void qdev_set_legacy_instance_id(DeviceState *dev, int alias_id,
                                  int required_for_version)
 {
diff --git a/include/hw/arm/virt.h b/include/hw/arm/virt.h
index 0898e8eed3..de4a08175e 100644
--- a/include/hw/arm/virt.h
+++ b/include/hw/arm/virt.h
@@ -182,6 +182,7 @@ struct VirtMachineState {
     char *oem_table_id;
     bool ns_el2_virt_timer_irq;
     CXLState cxl_devices_state;
+    DeviceListener device_listener;
 };
 
 #define VIRT_ECAM_ID(high) (high ? VIRT_HIGH_PCIE_ECAM : VIRT_PCIE_ECAM)
diff --git a/include/hw/core/cpu.h b/include/hw/core/cpu.h
index ccf5588011..c9ce9bbdaf 100644
--- a/include/hw/core/cpu.h
+++ b/include/hw/core/cpu.h
@@ -853,6 +853,17 @@ ObjectClass *cpu_class_by_name(const char *typename, const char *cpu_model);
  */
 char *cpu_model_from_type(const char *typename);
 
+/**
+ * cpu_typename_is_a:
+ * @typename: QOM type name to check (e.g. "host-arm-cpu").
+ * @base_typename: Base QOM typename to test against (e.g. TYPE_ARM_CPU).
+ *
+ * Return: true if @typename names a class that is-a @base_typename, else false.
+ *
+ * Notes: Safe for common code; depends only on QOM (no target headers).
+ */
+bool cpu_typename_is_a(const char *typename, const char *base_typename);
+
 /**
  * cpu_create:
  * @typename: The CPU type.
diff --git a/include/hw/qdev-core.h b/include/hw/qdev-core.h
index 3e08cfb59f..19d1d1a144 100644
--- a/include/hw/qdev-core.h
+++ b/include/hw/qdev-core.h
@@ -371,6 +371,15 @@ struct DeviceListener {
      */
     bool (*hide_device)(DeviceListener *listener, const QDict *device_opts,
                         bool from_json, Error **errp);
+    /*
+     * Used by qdev to find any device corresponding to the device opts
+     *
+     * Returns the `DeviceState` on sucess and NULL if device was not found.
+     * On errors, it returns NULL and errp is set
+     */
+    DeviceState * (*find_device)(DeviceListener *listener,
+                                 const QDict *device_opts,
+                                 Error **errp);
     QTAILQ_ENTRY(DeviceListener) link;
 };
 
@@ -1252,6 +1261,19 @@ void device_listener_unregister(DeviceListener *listener);
  */
 bool qdev_should_hide_device(const QDict *opts, bool from_json, Error **errp);
 
+/**
+ * qdev_find_device() - find the device
+ *
+ * @opts: options QDict
+ * @errp: pointer to error object
+ *
+ * Called when device state is toggled via qdev_device_state()
+ *
+ * Return: a DeviceState on success and NULL on failure
+ */
+DeviceState *
+qdev_find_device(const QDict *opts, Error **errp);
+
 typedef enum MachineInitPhase {
     /* current_machine is NULL.  */
     PHASE_NO_MACHINE,
diff --git a/include/monitor/hmp.h b/include/monitor/hmp.h
index ae116d9804..3e8c492c28 100644
--- a/include/monitor/hmp.h
+++ b/include/monitor/hmp.h
@@ -84,6 +84,7 @@ void hmp_change_medium(Monitor *mon, const char *device, const char *target,
 void hmp_migrate(Monitor *mon, const QDict *qdict);
 void hmp_device_add(Monitor *mon, const QDict *qdict);
 void hmp_device_del(Monitor *mon, const QDict *qdict);
+void hmp_device_set(Monitor *mon, const QDict *qdict);
 void hmp_dump_guest_memory(Monitor *mon, const QDict *qdict);
 void hmp_netdev_add(Monitor *mon, const QDict *qdict);
 void hmp_netdev_del(Monitor *mon, const QDict *qdict);
@@ -117,6 +118,7 @@ void object_add_completion(ReadLineState *rs, int nb_args, const char *str);
 void object_del_completion(ReadLineState *rs, int nb_args, const char *str);
 void device_add_completion(ReadLineState *rs, int nb_args, const char *str);
 void device_del_completion(ReadLineState *rs, int nb_args, const char *str);
+void device_set_completion(ReadLineState *rs, int nb_args, const char *str);
 void sendkey_completion(ReadLineState *rs, int nb_args, const char *str);
 void chardev_remove_completion(ReadLineState *rs, int nb_args, const char *str);
 void chardev_add_completion(ReadLineState *rs, int nb_args, const char *str);
diff --git a/include/monitor/qdev.h b/include/monitor/qdev.h
index 1d57bf6577..b10040e27f 100644
--- a/include/monitor/qdev.h
+++ b/include/monitor/qdev.h
@@ -6,6 +6,36 @@
 void hmp_info_qtree(Monitor *mon, const QDict *qdict);
 void hmp_info_qdm(Monitor *mon, const QDict *qdict);
 void qmp_device_add(QDict *qdict, QObject **ret_data, Error **errp);
+/**
+ * qmp_device_set:
+ * @qdict: Boxed arguments identifying the target device and property changes.
+ *
+ *         The device can be identified in one of two ways:
+ *           1. By "id":      Device instance ID (string), or
+ *           2. By "driver":  Device type (string) plus one or more
+ *                            property=value pairs to match.
+ *
+ *         Must also include at least one property assignment to change.
+ *         Currently used for:
+ *           - "admin-state": "enable" | "disable"
+ *
+ *         Additional properties may be supported by specific devices
+ *         in future.
+ *
+ * @errp:  Pointer to error object (set on failure).
+ *
+ * Change one or more mutable properties of an existing device at runtime.
+ * Initially intended for administrative CPU power-state control via
+ * "admin-state" on CPU devices, but may be extended to support other
+ * per-device set/unset controls when allowed by the target device class.
+ *
+ * Returns: Nothing. On success, replies with `{ "return": true }` via QMP.
+ *
+ * Errors:
+ *  - DeviceNotFound:  No matching device found
+ *  - GenericError:    Parameter validation failed or operation unsupported
+ */
+void qmp_device_set(const QDict *qdict, Error **errp);
 
 int qdev_device_help(QemuOpts *opts);
 DeviceState *qdev_device_add(QemuOpts *opts, Error **errp);
diff --git a/include/system/system.h b/include/system/system.h
index a7effe7dfd..3702325cfb 100644
--- a/include/system/system.h
+++ b/include/system/system.h
@@ -116,6 +116,7 @@ extern QemuOptsList qemu_drive_opts;
 extern QemuOptsList bdrv_runtime_opts;
 extern QemuOptsList qemu_chardev_opts;
 extern QemuOptsList qemu_device_opts;
+extern QemuOptsList qemu_deviceset_opts;
 extern QemuOptsList qemu_netdev_opts;
 extern QemuOptsList qemu_nic_opts;
 extern QemuOptsList qemu_net_opts;
diff --git a/qemu-options.hx b/qemu-options.hx
index 83ccde341b..f517b91042 100644
--- a/qemu-options.hx
+++ b/qemu-options.hx
@@ -375,7 +375,10 @@ SRST
     This is different from CPU hotplug where additional CPUs are not even
     present in the system description. Administratively disabled CPUs appear in
     ACPI tables i.e. are provisioned, but cannot be used until explicitly
-    enabled via QMP/HMP or the deviceset API.
+    enabled via QMP/HMP or the deviceset API. On ACPI guests, each vCPU counted
+    by 'disabledcpus=' is provisioned with '\ ``_STA``\ ' reporting Present=1
+    and Enabled=0 (present-offline) at boot; it becomes Enabled=1 when brought
+    online via 'device_set ... admin-state=enable'.
 
     On boards supporting CPU hotplug, the optional '\ ``maxcpus``\ ' parameter
     can be set to enable further CPUs to be added at runtime. When both
@@ -455,6 +458,15 @@ SRST
 
         -smp 2
 
+    Note: The cluster topology will only be generated in ACPI and exposed
+    to guest if it's explicitly specified in -smp.
+
+    Note: Administratively disabled CPUs (specified via 'disabledcpus=' and
+    '-deviceset' at CLI during boot) are especially useful for platforms like
+    ARM that lack native CPU hotplug support. These CPUs will appear to the
+    guest as unavailable, and any attempt to bring them online must go through
+    QMP/HMP commands like 'device_set'.
+
     Examples using 'disabledcpus':
 
     For a board without CPU hotplug, enable 4 CPUs at boot and provision
@@ -472,9 +484,6 @@ SRST
     ::
 
         -smp cpus=4,disabledcpus=2,maxcpus=8
-
-    Note: The cluster topology will only be generated in ACPI and exposed
-    to guest if it's explicitly specified in -smp.
 ERST
 
 DEF("numa", HAS_ARG, QEMU_OPTION_numa,
@@ -1281,6 +1290,40 @@ SRST
 
 ERST
 
+DEF("deviceset", HAS_ARG, QEMU_OPTION_deviceset,
+    "-deviceset driver[,prop[=value]][,...]\n"
+    "                Set administrative power state of an existing device.\n"
+    "                Does not hotplug a new device. Can disable or enable\n"
+    "                devices (such as CPUs) at boot based on policy.\n"
+    "                Example:\n"
+    "                    -deviceset host-arm-cpu,core-id=2,admin-state=disabled\n"
+    "                Use '-deviceset help' for supported drivers\n"
+    "                Use '-deviceset driver,help' for driver-specific properties\n",
+    QEMU_ARCH_ALL)
+SRST
+``-deviceset driver[,prop[=value]][,...]``
+    Configure an existing device's administrative power state or properties.
+
+    Unlike ``-device``, this option does not create a new device. Instead,
+    it sets startup properties (such as administrative power state) for
+    a device already declared via -smp or other machine configuration.
+
+    Example:
+        -smp cpus=4
+        -deviceset host-arm-cpu,core-id=2,admin-state=disabled
+
+    The above disables CPU core 2 at boot using administrative offlining.
+    The guest may later re-enable the core (if permitted by platform policy).
+
+    ``state=enabled|disabled``
+        Sets the administrative state of the device:
+        - ``enabled``: device is made available at boot
+        - ``disabled``: device is administratively disabled and powered off
+
+    Use ``-deviceset help`` to view all supported drivers.
+    Use ``-deviceset driver,help`` for property-specific help.
+ERST
+
 DEF("name", HAS_ARG, QEMU_OPTION_name,
     "-name string1[,process=string2][,debug-threads=on|off]\n"
     "                set the name of the guest\n"
diff --git a/system/qdev-monitor.c b/system/qdev-monitor.c
index 2ac92d0a07..1099b1237d 100644
--- a/system/qdev-monitor.c
+++ b/system/qdev-monitor.c
@@ -263,12 +263,20 @@ static DeviceClass *qdev_get_device_class(const char **driver, Error **errp)
     }
 
     dc = DEVICE_CLASS(oc);
-    if (!dc->user_creatable) {
+    if (!dc->user_creatable && !dc->admin_power_state_supported) {
         error_setg(errp, QERR_INVALID_PARAMETER_VALUE, "driver",
                    "a pluggable device type");
         return NULL;
     }
 
+    if (phase_check(PHASE_MACHINE_READY) &&
+        (!dc->hotpluggable || !dc->admin_power_state_supported)) {
+        error_setg(errp, QERR_INVALID_PARAMETER_VALUE, "driver",
+                   "a pluggable device type or which supports changing power-"
+                   "state administratively");
+        return NULL;
+    }
+
     if (object_class_dynamic_cast(oc, TYPE_SYS_BUS_DEVICE)) {
         /* sysbus devices need to be allowed by the machine */
         MachineClass *mc = MACHINE_CLASS(object_get_class(qdev_get_machine()));
@@ -939,6 +947,76 @@ void qmp_device_del(const char *id, Error **errp)
     }
 }
 
+void qmp_device_set(const QDict *qdict, Error **errp)
+{
+    const char *state;
+    const char *driver;
+    DeviceState *dev;
+    DeviceClass *dc;
+    const char *id;
+
+    driver = qdict_get_try_str(qdict, "driver");
+    if (!driver) {
+        error_setg(errp, "Parameter 'driver' is missing");
+        return;
+    }
+
+    /* check driver exists and we are at the right phase of machine init */
+    dc = qdev_get_device_class(&driver, errp);
+    if (!dc) {
+        error_setg(errp, "driver '%s' not supported", driver);
+        return;
+    }
+
+    if (migration_is_running()) {
+        error_setg(errp, "device_set not allowed while migrating");
+        return;
+    }
+
+    id = qdict_get_try_str(qdict, "id");
+
+    if (id) {
+        /* Lookup by ID */
+        dev = find_device_state(id, false, errp);
+        if (errp && *errp) {
+            error_prepend(errp, "Device lookup failed for ID '%s': ", id);
+            return;
+        }
+    } else {
+        /* Lookup using driver and properties */
+        dev = qdev_find_device(qdict, errp);
+        if (errp && *errp) {
+            error_prepend(errp, "Device lookup for %s failed: ", driver);
+            return;
+        }
+    }
+    if (!dev) {
+        error_set(errp, ERROR_CLASS_DEVICE_NOT_FOUND,
+                  "No device found for driver '%s'", driver);
+        return;
+    }
+
+    state = qdict_get_try_str(qdict, "admin-state");
+    if (!state) {
+        error_setg(errp, "no device state change specified for device %s ",
+                   dev->id);
+        return;
+    } else if (!strcmp(state, "enable")) {
+
+        if (!qdev_enable(dev, qdev_get_parent_bus(DEVICE(dev)), errp)) {
+            return;
+        }
+    } else if (!strcmp(state, "disable")) {
+        if (!qdev_disable(dev, qdev_get_parent_bus(DEVICE(dev)), errp)) {
+            return;
+        }
+    } else {
+        error_setg(errp, "unrecognized specified state *%s* for device %s",
+                   state, dev->id);
+        return;
+    }
+}
+
 int qdev_sync_config(DeviceState *dev, Error **errp)
 {
     DeviceClass *dc = DEVICE_GET_CLASS(dev);
@@ -1019,6 +1097,14 @@ void hmp_device_del(Monitor *mon, const QDict *qdict)
     hmp_handle_error(mon, err);
 }
 
+void hmp_device_set(Monitor *mon, const QDict *qdict)
+{
+    Error *err = NULL;
+
+    qmp_device_set(qdict, &err);
+    hmp_handle_error(mon, err);
+}
+
 void device_add_completion(ReadLineState *rs, int nb_args, const char *str)
 {
     GSList *list, *elt;
@@ -1101,6 +1187,41 @@ void device_del_completion(ReadLineState *rs, int nb_args, const char *str)
     peripheral_device_del_completion(rs, str);
 }
 
+void device_set_completion(ReadLineState *rs, int nb_args, const char *str)
+{
+    GSList *list, *elt;
+    size_t len;
+
+    if (nb_args == 2) {
+        len = strlen(str);
+        readline_set_completion_index(rs, len);
+
+        list = elt = object_class_get_list(TYPE_DEVICE, false);
+        while (elt) {
+            DeviceClass *dc = OBJECT_CLASS_CHECK(DeviceClass, elt->data,
+                                                 TYPE_DEVICE);
+            readline_add_completion_of(
+                rs, str, object_class_get_name(OBJECT_CLASS(dc)));
+            elt = elt->next;
+        }
+        g_slist_free(list);
+        return;
+    }
+
+    if (nb_args == 3) {
+        readline_set_completion_index(rs, strlen(str));
+        readline_add_completion_of(rs, str, "admin-state");
+        return;
+    }
+
+    if (nb_args == 4) {
+        readline_set_completion_index(rs, strlen(str));
+        readline_add_completion_of(rs, str, "enable");
+        readline_add_completion_of(rs, str, "disable");
+        return;
+    }
+}
+
 BlockBackend *blk_by_qdev_id(const char *id, Error **errp)
 {
     DeviceState *dev;
@@ -1134,6 +1255,22 @@ QemuOptsList qemu_device_opts = {
     },
 };
 
+QemuOptsList qemu_deviceset_opts = {
+    .name = "deviceset",
+    .implied_opt_name = "driver",
+    .head = QTAILQ_HEAD_INITIALIZER(qemu_deviceset_opts.head),
+    .desc = {
+        /*
+         * no fixed schema; parameters include:
+         * - driver=<device-name>
+         * - id=<device-id> (optional)
+         * - admin-state=enabled|disabled
+         * - other optional props for locating the device
+         */
+        { /* end of list */ }
+    },
+};
+
 QemuOptsList qemu_global_opts = {
     .name = "global",
     .head = QTAILQ_HEAD_INITIALIZER(qemu_global_opts.head),
diff --git a/system/vl.c b/system/vl.c
index 2f0fd21a1f..c1731de202 100644
--- a/system/vl.c
+++ b/system/vl.c
@@ -1218,6 +1218,16 @@ static int device_init_func(void *opaque, QemuOpts *opts, Error **errp)
     return 0;
 }
 
+static int deviceset_init_func(void *opaque, QemuOpts *opts, Error **errp)
+{
+    QDict *qdict = qemu_opts_to_qdict(opts, NULL);
+
+    qmp_device_set(qdict, errp);
+    qobject_unref(qdict);
+
+    return *errp ? -1 : 0;
+}
+
 static int chardev_init_func(void *opaque, QemuOpts *opts, Error **errp)
 {
     Error *local_err = NULL;
@@ -2755,6 +2765,10 @@ static void qemu_create_cli_devices(void)
         assert(ret_data == NULL); /* error_fatal aborts */
         loc_pop(&opt->loc);
     }
+
+    /* add deferred 'deviceset' list handling - common to JSON/non-JSON path */
+    qemu_opts_foreach(qemu_find_opts("deviceset"), deviceset_init_func, NULL,
+                      &error_fatal);
 }
 
 static bool qemu_machine_creation_done(Error **errp)
@@ -2855,6 +2869,7 @@ void qemu_init(int argc, char **argv)
     qemu_add_drive_opts(&bdrv_runtime_opts);
     qemu_add_opts(&qemu_chardev_opts);
     qemu_add_opts(&qemu_device_opts);
+    qemu_add_opts(&qemu_deviceset_opts);
     qemu_add_opts(&qemu_netdev_opts);
     qemu_add_opts(&qemu_nic_opts);
     qemu_add_opts(&qemu_net_opts);
@@ -3458,6 +3473,30 @@ void qemu_init(int argc, char **argv)
                     }
                 }
                 break;
+            case QEMU_OPTION_deviceset:
+                if (optarg[0] == '{') {
+                     /* JSON input: convert to QDict and then to QemuOpts */
+                     QObject *obj = qobject_from_json(optarg, &error_fatal);
+                     QDict *qdict = qobject_to(QDict, obj);
+                     if (!qdict) {
+                         error_report("Invalid JSON object for -deviceset");
+                         exit(1);
+                     }
+
+                     opts = qemu_opts_from_qdict(qemu_find_opts("deviceset"),
+                                                 qdict, &error_fatal);
+                     qobject_unref(qdict);
+                     if (!opts) {
+                         error_report_err(error_fatal);
+                         exit(1);
+                     }
+                } else {
+                    if (!qemu_opts_parse_noisily(qemu_find_opts("deviceset"),
+                                                 optarg, true)) {
+                        exit(1);
+                    }
+                }
+                break;
             case QEMU_OPTION_smp:
                 machine_parse_property_opt(qemu_find_opts("smp-opts"),
                                            "smp", optarg);
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH RFC V6 23/24] monitor, qapi: add 'info cpus-powerstate' and QMP query (Admin + Oper states)
  2025-10-01  1:01 [PATCH RFC V6 00/24] Support of Virtual CPU Hotplug-like Feature for ARMv8+ Arch salil.mehta
                   ` (21 preceding siblings ...)
  2025-10-01  1:01 ` [PATCH RFC V6 22/24] monitor, qdev: Introduce 'device_set' to change admin state of existing devices salil.mehta
@ 2025-10-01  1:01 ` salil.mehta
  2025-10-09 11:53   ` [PATCH RFC V6 23/24] monitor,qapi: " Markus Armbruster
  2025-10-01  1:01 ` [PATCH RFC V6 24/24] tcg: Defer TB flush for 'lazy realized' vCPUs on first region alloc salil.mehta
                   ` (3 subsequent siblings)
  26 siblings, 1 reply; 67+ messages in thread
From: salil.mehta @ 2025-10-01  1:01 UTC (permalink / raw)
  To: qemu-devel, qemu-arm, mst
  Cc: salil.mehta, maz, jean-philippe, jonathan.cameron, lpieralisi,
	peter.maydell, richard.henderson, imammedo, armbru, andrew.jones,
	david, philmd, eric.auger, will, ardb, oliver.upton, pbonzini,
	gshan, rafael, borntraeger, alex.bennee, gustavo.romero, npiggin,
	harshpb, linux, darren, ilkka, vishnu, gankulkarni, karl.heubaum,
	miguel.luis, salil.mehta, zhukeqian1, wangxiongfeng2, wangyanan55,
	wangzhou1, linuxarm, jiakernel2, maobibo, lixianglai, shahuang,
	zhao1.liu

From: Salil Mehta <salil.mehta@huawei.com>

The existing 'info hotpluggable-cpus' applies to platforms with true CPU
hotplug. On ARM, vCPUs are not hotpluggable: resources are allocated at
boot and policy is enforced administratively (e.g. via ACPI _STA) to
achieve a hotplug-like effect. As a result, the hotpluggable interface
cannot describe ARM CPU state, whether administrative or runtime.

Operators need a clear view of both administrative policy (Enabled,
Disabled, Removed) and guest runtime status (On, Standby, Off, Unknown)
for all possible vCPUs. This separation is essential to debug CPU life
cycle flows on ARM, where PSCI CPU_ON/CPU_OFF and ACPI methods are used,
and to distinguish CPUs that are enumerated but administratively blocked
from those actually executing in the guest.

The new interface is independent of hotplug and coexists with 'info
hotpluggable-cpus' on platforms that support it (e.g. x86). By default
devices are administratively Enabled; on hotpluggable systems, absent
CPUs appear as Removed here.

This patch introduces:
  * QMP 'query-cpus-powerstate' returning CPUPowerStateInfo per possible
    vCPU.
  * HMP 'info cpus-powerstate' for human-readable output.
  * Enums:
      - CPUPowerAdminState { enabled, disabled, removed }
      - CPUOperPowerState  { on, standby, off, unknown }
  * CPUPowerStateInfo with admin/oper state, optional topology ids, and
    qom-path.

Operational state semantics:
  * 'on'      : CPU is on and runnable.
  * 'standby' : Reserved for suspend-with-context (e.g. PSCI CPU_SUSPEND).
                Not emitted yet.
  * 'off'     : CPU is powered off.
                - At initial boot, admin-disabled vCPUs may be left
                  unrealized (lazy realize) and are reported Off.
                - After an admin enable, the vCPU is realized; if later
                  powered down, it remains realized and reported Off.
  * 'unknown' : State cannot be determined (very early init/teardown,
                transient hot-(un)plug window, or no power-state handler).

Migration semantics:
  * Admin-disabled (unrealized) vCPUs do not migrate.
  * Admin-enabled vCPUs migrate their operational state, including Off.

Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
---
 hmp-commands-info.hx       |  32 +++++++++++
 hw/arm/virt.c              |  32 +++++++++++
 hw/core/machine-hmp-cmds.c |  62 +++++++++++++++++++++
 hw/core/machine-qmp-cmds.c | 107 +++++++++++++++++++++++++++++++++++++
 include/monitor/hmp.h      |   1 +
 qapi/machine.json          |  87 ++++++++++++++++++++++++++++++
 6 files changed, 321 insertions(+)

diff --git a/hmp-commands-info.hx b/hmp-commands-info.hx
index 6142f60e7b..b4d24c8aed 100644
--- a/hmp-commands-info.hx
+++ b/hmp-commands-info.hx
@@ -766,6 +766,38 @@ ERST
 SRST
   ``info hotpluggable-cpus``
     Show information about hotpluggable CPUs
+
+ERST
+
+{
+    .name       = "cpus-powerstate",
+    .args_type  = "",
+    .params     = "",
+    .help       = "Show administrative and operational CPU states",
+    .cmd        = hmp_info_cpus_powerstate,
+    .flags      = "p",
+},
+
+SRST
+  ``info cpus-powerstate``
+    Display administrative (policy) and operational (runtime) power
+    states for each virtual CPU.
+
+    Administrative states:
+      - ``Enabled``  : CPU is available to the guest
+      - ``Disabled`` : CPU is present but administratively blocked
+      - ``Removed``  : CPU is not present (hidden from the guest)
+
+    Operational states (if available):
+      - ``On``       : CPU is powered on and executing
+      - ``Standby``  : CPU is idle/low-power and can resume on an event
+      - ``Off``      : CPU is powered off or guest-offlined
+      - ``Unknown``  : State cannot be determined (e.g. very early init,
+                       teardown, transient hotplug/hotremove window, or
+                       target/platform does not expose a queryable state)
+
+    The administrative state constrains which operational states are
+    possible.
 ERST
 
     {
diff --git a/hw/arm/virt.c b/hw/arm/virt.c
index 7bd37ffb75..5e02d6749d 100644
--- a/hw/arm/virt.c
+++ b/hw/arm/virt.c
@@ -2080,6 +2080,21 @@ virt_cpu_post_poweroff(PowerStateHandler *handler, DeviceState *dev,
     virt_park_cpu_in_userspace(cs);
 }
 
+static
+DeviceOperPowerState virt_cpu_get_oper_state(DeviceState *dev, Error **errp)
+{
+    ARMCPU *cpu = ARM_CPU(CPU(dev));
+
+    switch (cpu->power_state) {
+    case PSCI_ON:
+        return DEVICE_OPER_POWER_STATE_ON;
+    case PSCI_OFF:
+        return DEVICE_OPER_POWER_STATE_OFF;
+    default:
+        return DEVICE_OPER_POWER_STATE_UNKNOWN;
+    }
+}
+
 static uint64_t virt_cpu_mp_affinity(VirtMachineState *vms, int idx)
 {
     uint8_t clustersz;
@@ -2452,6 +2467,9 @@ virt_setup_lazy_vcpu_realization(Object *cpuobj, VirtMachineState *vms)
                                 NULL);
     }
 
+    /* set operational state of disabled CPUs as OFF */
+    ARM_CPU(cpuobj)->power_state = PSCI_OFF;
+
     /*
      * [!] Constraint: The ARM CPU architecture does not permit new CPUs
      * to be added after system initialization.
@@ -3517,6 +3535,19 @@ virt_machine_device_pre_poweron(PowerStateHandler *handler, DeviceState *dev,
     }
 }
 
+static DeviceOperPowerState
+virt_machine_get_device_oper_state(DeviceState *dev, Error **errp)
+{
+    if (object_dynamic_cast(OBJECT(dev), TYPE_CPU)) {
+        return virt_cpu_get_oper_state(dev, errp);
+    } else {
+        error_setg(errp, "can't get power state for unsupported device-type %s",
+                   object_get_typename(OBJECT(dev)));
+    }
+
+    return DEVICE_OPER_POWER_STATE_UNKNOWN;
+}
+
 static void *
 virt_machine_powerstate_handler(MachineState *machine, DeviceState *dev)
 {
@@ -3672,6 +3703,7 @@ static void virt_machine_class_init(ObjectClass *oc, const void *data)
     assert(!mc->get_powerstate_handler);
     mc->has_online_capable_cpus = true;
     mc->get_powerstate_handler = virt_machine_powerstate_handler;
+    pshc->get_oper_state = virt_machine_get_device_oper_state;
     pshc->request_poweroff = virt_machine_device_request_poweroff;
     pshc->post_poweroff = virt_machine_device_post_poweroff;
     pshc->pre_poweron = virt_machine_device_pre_poweron;
diff --git a/hw/core/machine-hmp-cmds.c b/hw/core/machine-hmp-cmds.c
index 3a612e2232..b01d8b800a 100644
--- a/hw/core/machine-hmp-cmds.c
+++ b/hw/core/machine-hmp-cmds.c
@@ -107,6 +107,68 @@ void hmp_hotpluggable_cpus(Monitor *mon, const QDict *qdict)
     qapi_free_HotpluggableCPUList(saved);
 }
 
+void hmp_info_cpus_powerstate(Monitor *mon, const QDict *qdict)
+{
+    Error *err = NULL;
+    CPUPowerStateInfoList *list = qmp_query_cpus_power_state(&err);
+    CPUPowerStateInfoList *entry = list;
+
+    if (hmp_handle_error(mon, err)) {
+        return;
+    }
+
+    monitor_printf(mon, "CPUs Power State Info:\n");
+
+    while (entry) {
+        CPUPowerStateInfo *cpu = entry->value;
+
+        monitor_printf(mon, "  CPU ID: %" PRIi64 "\n", cpu->id);
+
+        if (cpu->has_socket_id) {
+            monitor_printf(mon, "    socket-id: %" PRIi64 "\n", cpu->socket_id);
+        }
+        if (cpu->has_cluster_id) {
+            monitor_printf(mon, "    cluster-id: %" PRIi64 "\n", cpu->cluster_id);
+        }
+        if (cpu->has_core_id) {
+            monitor_printf(mon, "    core-id: %" PRIi64 "\n", cpu->core_id);
+        }
+        if (cpu->has_thread_id) {
+            monitor_printf(mon, "    thread-id: %" PRIi64 "\n", cpu->thread_id);
+        }
+        if (cpu->has_die_id) {
+            monitor_printf(mon, "    die-id: %" PRIi64 "\n", cpu->die_id);
+        }
+        if (cpu->has_module_id) {
+            monitor_printf(mon, "    module-id: %" PRIi64 "\n", cpu->module_id);
+        }
+        if (cpu->has_book_id) {
+            monitor_printf(mon, "    book-id: %" PRIi64 "\n", cpu->book_id);
+        }
+        if (cpu->has_drawer_id) {
+            monitor_printf(mon, "    drawer-id: %" PRIi64 "\n", cpu->drawer_id);
+        }
+        if (cpu->has_node_id) {
+            monitor_printf(mon, "    node-id: %" PRIi64 "\n", cpu->node_id);
+        }
+        if (cpu->has_vcpus_count) {
+            monitor_printf(mon, "    vcpus-count: %" PRIi64 "\n", cpu->vcpus_count);
+        }
+        if (cpu->qom_path) {
+            monitor_printf(mon, "    qom-path: \"%s\"\n", cpu->qom_path);
+        }
+
+        monitor_printf(mon, "    admin-state: \"%s\"\n",
+                       CPUAdminPowerState_str(cpu->admin_state));
+        monitor_printf(mon, "    oper-state: \"%s\"\n",
+                       CPUOperPowerState_str(cpu->oper_state));
+
+        entry = entry->next;
+    }
+
+    qapi_free_CPUPowerStateInfoList(list);
+}
+
 void hmp_info_memdev(Monitor *mon, const QDict *qdict)
 {
     Error *err = NULL;
diff --git a/hw/core/machine-qmp-cmds.c b/hw/core/machine-qmp-cmds.c
index 6aca1a626e..b48356f36f 100644
--- a/hw/core/machine-qmp-cmds.c
+++ b/hw/core/machine-qmp-cmds.c
@@ -158,6 +158,113 @@ HotpluggableCPUList *qmp_query_hotpluggable_cpus(Error **errp)
     return machine_query_hotpluggable_cpus(ms);
 }
 
+CPUPowerStateInfoList *qmp_query_cpus_power_state(Error **errp)
+{
+    MachineState *ms = MACHINE(qdev_get_machine());
+    CPUPowerStateInfoList *head = NULL;
+    CPUPowerStateInfoList **tail = &head;
+    CPUPowerStateInfo *info;
+    CPUState *cpu;
+
+    CPU_FOREACH_POSSIBLE(cpu, ms->possible_cpus) {
+        CPUArchId *arch_id = machine_get_possible_cpu_arch_id(cpu->cpu_index);
+        if (!arch_id) {
+            continue;
+        }
+
+        info = g_new0(CPUPowerStateInfo, 1);
+        info->id = cpu->cpu_index;
+
+        /* Optional topology fields */
+        if (arch_id->props.has_socket_id) {
+            info->socket_id = arch_id->props.socket_id;
+            info->has_socket_id = true;
+        }
+        if (arch_id->props.has_cluster_id) {
+            info->cluster_id = arch_id->props.cluster_id;
+            info->has_cluster_id = true;
+        }
+        if (arch_id->props.has_core_id) {
+            info->core_id = arch_id->props.core_id;
+            info->has_core_id = true;
+        }
+        if (arch_id->props.has_thread_id) {
+            info->thread_id = arch_id->props.thread_id;
+            info->has_thread_id = true;
+        }
+        if (arch_id->props.has_die_id) {
+            info->die_id = arch_id->props.die_id;
+            info->has_die_id = true;
+        }
+        if (arch_id->props.has_module_id) {
+            info->module_id = arch_id->props.module_id;
+            info->has_module_id = true;
+        }
+        if (arch_id->props.has_book_id) {
+            info->book_id = arch_id->props.book_id;
+            info->has_book_id = true;
+        }
+        if (arch_id->props.has_drawer_id) {
+            info->drawer_id = arch_id->props.drawer_id;
+            info->has_drawer_id = true;
+        }
+        if (arch_id->props.has_node_id) {
+            info->node_id = arch_id->props.node_id;
+            info->has_node_id = true;
+        }
+
+        info->vcpus_count = arch_id->vcpus_count;
+        info->has_vcpus_count = true;
+
+        info->qom_path = object_get_canonical_path(OBJECT(cpu));
+
+        /* Determine current power state */
+        switch (qdev_get_admin_power_state(DEVICE(cpu))) {
+        case DEVICE_ADMIN_POWER_STATE_ENABLED:
+            info->admin_state = CPU_ADMIN_POWER_STATE_ENABLED;
+            break;
+        case DEVICE_ADMIN_POWER_STATE_DISABLED:
+            info->admin_state = CPU_ADMIN_POWER_STATE_DISABLED;
+            break;
+        case DEVICE_ADMIN_POWER_STATE_REMOVED:
+            info->admin_state = CPU_ADMIN_POWER_STATE_REMOVED;
+            break;
+        default:
+            /* This should never be hit */
+            g_assert_not_reached();
+            break;
+        }
+
+        /* Determine current operational power state */
+        switch (qdev_get_oper_power_state(DEVICE(cpu))) {
+        case DEVICE_OPER_POWER_STATE_ON:
+            info->oper_state = CPU_OPER_POWER_STATE_ON;
+            break;
+        case DEVICE_OPER_POWER_STATE_OFF:
+            info->oper_state = CPU_OPER_POWER_STATE_OFF;
+            break;
+        case DEVICE_OPER_POWER_STATE_STANDBY:
+            info->oper_state = CPU_OPER_POWER_STATE_STANDBY;
+            break;
+        case DEVICE_OPER_POWER_STATE_UNKNOWN:
+            info->oper_state = CPU_OPER_POWER_STATE_UNKNOWN;
+            break;
+        default:
+            /* This should never be hit */
+            g_assert_not_reached();
+            break;
+        }
+
+        /* Add to result list */
+        CPUPowerStateInfoList *entry = g_new0(CPUPowerStateInfoList, 1);
+        entry->value = info;
+        *tail = entry;
+        tail = &entry->next;
+    }
+
+    return head;
+}
+
 void qmp_set_numa_node(NumaOptions *cmd, Error **errp)
 {
     if (phase_check(PHASE_MACHINE_INITIALIZED)) {
diff --git a/include/monitor/hmp.h b/include/monitor/hmp.h
index 3e8c492c28..946ccb90c1 100644
--- a/include/monitor/hmp.h
+++ b/include/monitor/hmp.h
@@ -142,6 +142,7 @@ void hmp_rocker_of_dpa_flows(Monitor *mon, const QDict *qdict);
 void hmp_rocker_of_dpa_groups(Monitor *mon, const QDict *qdict);
 void hmp_info_dump(Monitor *mon, const QDict *qdict);
 void hmp_hotpluggable_cpus(Monitor *mon, const QDict *qdict);
+void hmp_info_cpus_powerstate(Monitor *mon, const QDict *qdict);
 void hmp_info_vm_generation_id(Monitor *mon, const QDict *qdict);
 void hmp_info_memory_size_summary(Monitor *mon, const QDict *qdict);
 void hmp_info_replay(Monitor *mon, const QDict *qdict);
diff --git a/qapi/machine.json b/qapi/machine.json
index e45740da33..3856785b27 100644
--- a/qapi/machine.json
+++ b/qapi/machine.json
@@ -1069,6 +1069,93 @@
 { 'command': 'query-hotpluggable-cpus', 'returns': ['HotpluggableCPU'],
              'allow-preconfig': true }
 
+##
+# @CPUOperPowerState:
+#
+# Guest-visible operational state of the CPU.
+# This reflects runtime status such as guest online/offline status or
+# suspended state (e.g., CPU halted, suspended in a WFI loop).
+#
+# .. note::
+#    This field is read-only. It is derived by QEMU from runtime
+#    information (e.g., CPU execution/architectural state, PSCI power
+#    status, vCPU runstate) and cannot be set by management tools or
+#    user commands.
+#
+# @on: CPU is online and executing.
+# @standby: CPU is idle or suspended (e.g., WFI).
+# @off: CPU is guest-offlined or halted.
+# @unknown: State cannot be determined at this time (e.g., very early
+#           init/teardown, transient hotplug/hotremove window, no
+#           power-state handler registered, or the target/platform does
+#           not expose a queryable CPU state).
+##
+{ 'enum': 'CPUOperPowerState',
+  'data': ['on', 'standby', 'off', 'unknown'] }
+
+##
+# @CPUAdminPowerState:
+#
+# Host-side administrative power state of the CPU device.
+# Controls guest visibility and lifecycle.
+#
+# @enabled: CPU is administratively enabled (can be used by guest)
+# @disabled: CPU is administratively disabled (guest-visible but unusable)
+# @removed: CPU is logically removed (not visible to guest)
+##
+{ 'enum': 'CPUAdminPowerState',
+  'data': ['enabled', 'disabled', 'removed'] }
+
+##
+# @CPUPowerStateInfo:
+#
+# CPU status combining both administrative and operational/runtime state.
+#
+# @id: CPU index
+# @core-id: Core ID (optional)
+# @socket-id: Socket ID (optional)
+# @cluster-id: Cluster ID (optional)
+# @thread-id: Thread ID (optional)
+# @node-id: NUMA node ID (optional)
+# @drawer-id: Drawer ID (optional)
+# @book-id: Book ID (optional)
+# @die-id: Die ID (optional)
+# @module-id: Module ID (optional)
+# @vcpus-count: Number of threads under this logical CPU (optional)
+# @qom-path: QOM object path (optional)
+# @admin-state: Administrative power state (enabled/disabled/removed)
+# @oper-state: Guest-visible runtime power state (on/standby/off)
+##
+{ 'struct': 'CPUPowerStateInfo',
+  'data': {
+    'id': 'int',
+    '*core-id': 'int',
+    '*socket-id': 'int',
+    '*cluster-id': 'int',
+    '*thread-id': 'int',
+    '*node-id': 'int',
+    '*drawer-id': 'int',
+    '*book-id': 'int',
+    '*die-id': 'int',
+    '*module-id': 'int',
+    '*vcpus-count': 'int',
+    '*qom-path': 'str',
+    'admin-state': 'CPUAdminPowerState',
+    'oper-state': 'CPUOperPowerState'
+  } }
+
+##
+# @query-cpus-power-state:
+#
+# Returns all CPUs and their power state info, combining host policy and
+# runtime guest status. This is useful for debugging vCPU hotplug,
+# suspend/resume, admin power states or offline state flows.
+#
+# Returns: a list of @CPUPowerStateInfo
+##
+{ 'command': 'query-cpus-power-state',
+  'returns': ['CPUPowerStateInfo'] }
+
 ##
 # @set-numa-node:
 #
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH RFC V6 24/24] tcg: Defer TB flush for 'lazy realized' vCPUs on first region alloc
  2025-10-01  1:01 [PATCH RFC V6 00/24] Support of Virtual CPU Hotplug-like Feature for ARMv8+ Arch salil.mehta
                   ` (22 preceding siblings ...)
  2025-10-01  1:01 ` [PATCH RFC V6 23/24] monitor, qapi: add 'info cpus-powerstate' and QMP query (Admin + Oper states) salil.mehta
@ 2025-10-01  1:01 ` salil.mehta
  2025-10-01 21:34   ` Richard Henderson
  2025-10-06 14:00 ` [PATCH RFC V6 00/24] Support of Virtual CPU Hotplug-like Feature for ARMv8+ Arch Igor Mammedov
                   ` (2 subsequent siblings)
  26 siblings, 1 reply; 67+ messages in thread
From: salil.mehta @ 2025-10-01  1:01 UTC (permalink / raw)
  To: qemu-devel, qemu-arm, mst
  Cc: salil.mehta, maz, jean-philippe, jonathan.cameron, lpieralisi,
	peter.maydell, richard.henderson, imammedo, armbru, andrew.jones,
	david, philmd, eric.auger, will, ardb, oliver.upton, pbonzini,
	gshan, rafael, borntraeger, alex.bennee, gustavo.romero, npiggin,
	harshpb, linux, darren, ilkka, vishnu, gankulkarni, karl.heubaum,
	miguel.luis, salil.mehta, zhukeqian1, wangxiongfeng2, wangyanan55,
	wangzhou1, linuxarm, jiakernel2, maobibo, lixianglai, shahuang,
	zhao1.liu

From: Salil Mehta <salil.mehta@huawei.com>

The TCG code cache is split into regions shared by vCPUs under MTTCG. For
cold-boot (early realized) vCPUs, regions are sized/allocated during bring-up.
However, when a vCPU is *lazy_realized* (administratively "disabled" at boot
and realized later on demand), its TCGContext may fail the very first code
region allocation if the shared TB cache is saturated by already-running
vCPUs.

Flushing the TB cache is the right remediation, but `tb_flush()` must be
performed from the safe execution context (cpu_exec_loop()/tb_gen_code()).
This patch wires a deferred flush:

  * In `tcg_region_initial_alloc__locked()`, treat an initial allocation
    failure for a lazily realized vCPU as non-fatal: set `s->tbflush_pend`
    and return.

  * In `tcg_tb_alloc()`, if `s->tbflush_pend` is observed, clear it and
    return NULL so the caller performs a synchronous `tb_flush()` and then
    retries allocation.

This avoids hangs observed when a newly realized vCPU cannot obtain its first
region under TB-cache pressure, while keeping the flush at a safe point.

No change for cold-boot vCPUs and when accel ops is KVM.

In earlier series, this patch was with below named,
'tcg: Update tcg_register_thread() leg to handle region alloc for hotplugged vCPU'

Reported-by: Miguel Luis <miguel.luis@oracle.com>
Signed-off-by: Miguel Luis <miguel.luis@oracle.com>
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
---
 accel/tcg/tcg-accel-ops-mttcg.c |  2 +-
 accel/tcg/tcg-accel-ops-rr.c    |  2 +-
 hw/arm/virt.c                   |  5 +++++
 include/hw/core/cpu.h           |  1 +
 include/tcg/startup.h           |  6 ++++++
 include/tcg/tcg.h               |  1 +
 tcg/region.c                    | 16 ++++++++++++++++
 tcg/tcg.c                       | 19 ++++++++++++++++++-
 8 files changed, 49 insertions(+), 3 deletions(-)

diff --git a/accel/tcg/tcg-accel-ops-mttcg.c b/accel/tcg/tcg-accel-ops-mttcg.c
index 337b993d3d..cdb7345340 100644
--- a/accel/tcg/tcg-accel-ops-mttcg.c
+++ b/accel/tcg/tcg-accel-ops-mttcg.c
@@ -73,7 +73,7 @@ static void *mttcg_cpu_thread_fn(void *arg)
     force_rcu.notifier.notify = mttcg_force_rcu;
     force_rcu.cpu = cpu;
     rcu_add_force_rcu_notifier(&force_rcu.notifier);
-    tcg_register_thread();
+    tcg_register_thread(cpu);
 
     bql_lock();
     qemu_thread_get_self(cpu->thread);
diff --git a/accel/tcg/tcg-accel-ops-rr.c b/accel/tcg/tcg-accel-ops-rr.c
index 6eec5c9eee..18e713cada 100644
--- a/accel/tcg/tcg-accel-ops-rr.c
+++ b/accel/tcg/tcg-accel-ops-rr.c
@@ -186,7 +186,7 @@ static void *rr_cpu_thread_fn(void *arg)
     rcu_register_thread();
     force_rcu.notify = rr_force_rcu;
     rcu_add_force_rcu_notifier(&force_rcu);
-    tcg_register_thread();
+    tcg_register_thread(cpu);
 
     bql_lock();
     qemu_thread_get_self(cpu->thread);
diff --git a/hw/arm/virt.c b/hw/arm/virt.c
index 5e02d6749d..254303727b 100644
--- a/hw/arm/virt.c
+++ b/hw/arm/virt.c
@@ -2482,6 +2482,11 @@ virt_setup_lazy_vcpu_realization(Object *cpuobj, VirtMachineState *vms)
     if (kvm_enabled()) {
         kvm_arm_create_host_vcpu(ARM_CPU(cpuobj));
     }
+
+    /* we may have to nuke the TB cache */
+    if (tcg_enabled()) {
+        CPU(cpuobj)->lazy_realized = true;
+    }
 }
 
 static void machvirt_init(MachineState *machine)
diff --git a/include/hw/core/cpu.h b/include/hw/core/cpu.h
index c9ce9bbdaf..c2d45fb494 100644
--- a/include/hw/core/cpu.h
+++ b/include/hw/core/cpu.h
@@ -486,6 +486,7 @@ struct CPUState {
     bool stop;
     bool stopped;
     bool parked;
+    bool lazy_realized; /* realized after machine init (lazy realization) */
 
     /* Should CPU start in powered-off state? */
     bool start_powered_off;
diff --git a/include/tcg/startup.h b/include/tcg/startup.h
index 95f574af2b..f9126bb0bd 100644
--- a/include/tcg/startup.h
+++ b/include/tcg/startup.h
@@ -25,6 +25,8 @@
 #ifndef TCG_STARTUP_H
 #define TCG_STARTUP_H
 
+#include "hw/core/cpu.h"
+
 /**
  * tcg_init: Initialize the TCG runtime
  * @tb_size: translation buffer size
@@ -43,7 +45,11 @@ void tcg_init(size_t tb_size, int splitwx, unsigned max_threads);
  * accelerator's init_machine() method) must register with this
  * function before initiating translation.
  */
+#ifdef CONFIG_USER_ONLY
 void tcg_register_thread(void);
+#else
+void tcg_register_thread(CPUState *cpu);
+#endif
 
 /**
  * tcg_prologue_init(): Generate the code for the TCG prologue
diff --git a/include/tcg/tcg.h b/include/tcg/tcg.h
index a6d9aa50d4..e197ee03c0 100644
--- a/include/tcg/tcg.h
+++ b/include/tcg/tcg.h
@@ -396,6 +396,7 @@ struct TCGContext {
 
     /* Track which vCPU triggers events */
     CPUState *cpu;                      /* *_trans */
+    bool tbflush_pend; /* TB flush pending due to lazy vCPU realization */
 
     /* These structures are private to tcg-target.c.inc.  */
     QSIMPLEQ_HEAD(, TCGLabelQemuLdst) ldst_labels;
diff --git a/tcg/region.c b/tcg/region.c
index 7ea0b37a84..23635e0194 100644
--- a/tcg/region.c
+++ b/tcg/region.c
@@ -393,6 +393,22 @@ bool tcg_region_alloc(TCGContext *s)
 static void tcg_region_initial_alloc__locked(TCGContext *s)
 {
     bool err = tcg_region_alloc__locked(s);
+
+    /*
+     * Lazily realized vCPUs (administratively "disabled" at boot and realized
+     * later on demand) may initially fail to obtain even a single code region
+     * if the shared TB cache is under pressure from already running vCPUs.
+     *
+     * Treat this first-allocation failure as non-fatal: mark this TCGContext
+     * to request a TB cache flush and return. The flush is performed later,
+     * synchronously in the vCPU execution path (cpu_exec_loop()/tb_gen_code()),
+     * which is the safe place for tb_flush().
+     */
+    if (err && s->cpu && s->cpu->lazy_realized) {
+        s->tbflush_pend = true;
+        return;
+    }
+
     g_assert(!err);
 }
 
diff --git a/tcg/tcg.c b/tcg/tcg.c
index afac55a203..5867952ae7 100644
--- a/tcg/tcg.c
+++ b/tcg/tcg.c
@@ -1285,12 +1285,14 @@ void tcg_register_thread(void)
     tcg_ctx = &tcg_init_ctx;
 }
 #else
-void tcg_register_thread(void)
+void tcg_register_thread(CPUState *cpu)
 {
     TCGContext *s = g_malloc(sizeof(*s));
     unsigned int i, n;
 
     *s = tcg_init_ctx;
+     s->cpu = cpu;
+     s->tbflush_pend = false;
 
     /* Relink mem_base.  */
     for (i = 0, n = tcg_init_ctx.nb_globals; i < n; ++i) {
@@ -1871,6 +1873,21 @@ TranslationBlock *tcg_tb_alloc(TCGContext *s)
     TranslationBlock *tb;
     void *next;
 
+    /*
+     * Lazy realization:
+     * A vCPU that was realized after machine init may have failed its first
+     * code-region allocation (see tcg_region_initial_alloc__locked()) and
+     * requested a deferred TB-cache flush by setting s->tbflush_pend.
+     *
+     * If the flag is set, do not attempt allocation here. Clear the flag and
+     * return NULL so the caller (tb_gen_code()/cpu_exec_loop()) can perform a
+     * safe tb_flush() and then retry TB allocation.
+     */
+    if (s->tbflush_pend) {
+        s->tbflush_pend = false;
+        return NULL;
+    }
+
  retry:
     tb = (void *)ROUND_UP((uintptr_t)s->code_gen_ptr, align);
     next = (void *)ROUND_UP((uintptr_t)(tb + 1), align);
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 67+ messages in thread

* Re: [PATCH RFC V6 24/24] tcg: Defer TB flush for 'lazy realized' vCPUs on first region alloc
  2025-10-01  1:01 ` [PATCH RFC V6 24/24] tcg: Defer TB flush for 'lazy realized' vCPUs on first region alloc salil.mehta
@ 2025-10-01 21:34   ` Richard Henderson
  2025-10-02 12:27     ` Salil Mehta via
  0 siblings, 1 reply; 67+ messages in thread
From: Richard Henderson @ 2025-10-01 21:34 UTC (permalink / raw)
  To: salil.mehta, qemu-devel, qemu-arm, mst

On 9/30/25 18:01, salil.mehta@opnsrc.net wrote:
> From: Salil Mehta <salil.mehta@huawei.com>
> 
> The TCG code cache is split into regions shared by vCPUs under MTTCG. For
> cold-boot (early realized) vCPUs, regions are sized/allocated during bring-up.
> However, when a vCPU is *lazy_realized* (administratively "disabled" at boot
> and realized later on demand), its TCGContext may fail the very first code
> region allocation if the shared TB cache is saturated by already-running
> vCPUs.
> 
> Flushing the TB cache is the right remediation, but `tb_flush()` must be
> performed from the safe execution context (cpu_exec_loop()/tb_gen_code()).
> This patch wires a deferred flush:
> 
>    * In `tcg_region_initial_alloc__locked()`, treat an initial allocation
>      failure for a lazily realized vCPU as non-fatal: set `s->tbflush_pend`
>      and return.
> 
>    * In `tcg_tb_alloc()`, if `s->tbflush_pend` is observed, clear it and
>      return NULL so the caller performs a synchronous `tb_flush()` and then
>      retries allocation.
> 
> This avoids hangs observed when a newly realized vCPU cannot obtain its first
> region under TB-cache pressure, while keeping the flush at a safe point.
> 
> No change for cold-boot vCPUs and when accel ops is KVM.
> 
> In earlier series, this patch was with below named,
> 'tcg: Update tcg_register_thread() leg to handle region alloc for hotplugged vCPU'


I don't see why you need two different booleans for this. 	
It seems to me that you could create the cpu in a state for which the first call to 
tcg_tb_alloc() sees highwater state, and everything after that happens per usual 
allocating a new region, and possibly flushing the full buffer.

What is the testcase for this?


r~


^ permalink raw reply	[flat|nested] 67+ messages in thread

* RE: [PATCH RFC V6 24/24] tcg: Defer TB flush for 'lazy realized' vCPUs on first region alloc
  2025-10-01 21:34   ` Richard Henderson
@ 2025-10-02 12:27     ` Salil Mehta via
  2025-10-02 15:41       ` Richard Henderson
  0 siblings, 1 reply; 67+ messages in thread
From: Salil Mehta via @ 2025-10-02 12:27 UTC (permalink / raw)
  To: Richard Henderson, salil.mehta@opnsrc.net, qemu-devel@nongnu.org,
	qemu-arm@nongnu.org, mst@redhat.com

Hi Richard,

Thanks for the reply. Please find my response inline.

Cheers.

> From: qemu-devel-bounces+salil.mehta=huawei.com@nongnu.org <qemu-
> devel-bounces+salil.mehta=huawei.com@nongnu.org> On Behalf Of Richard
> Henderson
> Sent: Wednesday, October 1, 2025 10:34 PM
> To: salil.mehta@opnsrc.net; qemu-devel@nongnu.org; qemu-
> arm@nongnu.org; mst@redhat.com
> Subject: Re: [PATCH RFC V6 24/24] tcg: Defer TB flush for 'lazy realized' vCPUs
> on first region alloc
> 
> On 9/30/25 18:01, salil.mehta@opnsrc.net wrote:
> > From: Salil Mehta <salil.mehta@huawei.com>
> >
> > The TCG code cache is split into regions shared by vCPUs under MTTCG.
> > For cold-boot (early realized) vCPUs, regions are sized/allocated during
> bring-up.
> > However, when a vCPU is *lazy_realized* (administratively "disabled"
> > at boot and realized later on demand), its TCGContext may fail the
> > very first code region allocation if the shared TB cache is saturated
> > by already-running vCPUs.
> >
> > Flushing the TB cache is the right remediation, but `tb_flush()` must
> > be performed from the safe execution context
> (cpu_exec_loop()/tb_gen_code()).
> > This patch wires a deferred flush:
> >
> >    * In `tcg_region_initial_alloc__locked()`, treat an initial allocation
> >      failure for a lazily realized vCPU as non-fatal: set `s->tbflush_pend`
> >      and return.
> >
> >    * In `tcg_tb_alloc()`, if `s->tbflush_pend` is observed, clear it and
> >      return NULL so the caller performs a synchronous `tb_flush()` and then
> >      retries allocation.
> >
> > This avoids hangs observed when a newly realized vCPU cannot obtain
> > its first region under TB-cache pressure, while keeping the flush at a safe
> point.
> >
> > No change for cold-boot vCPUs and when accel ops is KVM.
> >
> > In earlier series, this patch was with below named,
> > 'tcg: Update tcg_register_thread() leg to handle region alloc for hotplugged
> vCPU'
> 
> 
> I don't see why you need two different booleans for this.


I can see your point. Maybe I can move `s->tbflush_pend`  to 'CPUState' instead? 


> It seems to me that you could create the cpu in a state for which the first call
> to
> tcg_tb_alloc() sees highwater state, and everything after that happens per
> usual allocating a new region, and possibly flushing the full buffer.


Correct. but with a distinction that highwater state is relevant to a TCGContext
and the regions are allocated from a common pool 'Code Generation Buffer'.
'code_gen_highwater' is use to detect whether current context needs more
region allocation for the dynamic translation to continue. This is a different
condition than what we are encountering; which is the worst case condition
that the entire code generation buffer is saturated and cannot even allocate
a single free TCG region successfully. In such a case, we do not have any option
than to flush the entire buffer and reallocate the regions to all the threads.
A rebalancing act to accommodate a new vCPU - which is expensive but the
good thing is this does not happens every time and is a worst case condition
i.e. when a system is under tremendous stress and is running out of resources. 


We are avoiding this crash:

ERROR:../tcg/region.c:396:tcg_region_initial_alloc__locked: assertion failed: (!err)
Bail out! ERROR:../tcg/region.c:396:tcg_region_initial_alloc__locked: assertion failed: (!err)
./run-qemu.sh: line 8: 255346 Aborted                 
(core dumped) ./qemu/build/qemu-system-aarch64 -M virt,accel=tcg

Dump is here:

Thread 65 "qemu-system-aar" received signal SIGABRT, Aborted.
[Switching to Thread 0x7fff48ff9640 (LWP 633577)]
0x00007ffff782f98c in __pthread_kill_implementation () from /lib64/libc.so.6
(gdb) bt
#0  0x00007ffff782f98c in __pthread_kill_implementation () at /lib64/libc.so.6
#1  0x00007ffff77e2646 in raise () at /lib64/libc.so.6
#2  0x00007ffff77cc7f3 in abort () at /lib64/libc.so.6
#3  0x00007ffff7c21d6c in g_assertion_message_expr.cold () at /lib64/libglib-2.0.so.0
#4  0x00007ffff7c7ce2f in g_assertion_message_expr () at /lib64/libglib-2.0.so.0
#5  0x00005555561cf359 in tcg_region_initial_alloc__locked (s=0x7fff10000b60) at ../tcg/region.c:396
#6  0x00005555561cf3ab in tcg_region_initial_alloc (s=0x7fff10000b60) at ../tcg/region.c:402
#7  0x00005555561da83c in tcg_register_thread () at ../tcg/tcg.c:820
#8  0x00005555561a97bb in mttcg_cpu_thread_fn (arg=0x555557e0c2b0) at ../accel/tcg/tcg-accel-ops-mttcg.c:77
#9  0x00005555564f18ab in qemu_thread_start (args=0x5555582e2bc0) at ../util/qemu-thread-posix.c:541
#10 0x00007ffff782dc12 in start_thread () at /lib64/libc.so.6
#11 0x00007ffff78b2cc0 in clone3 () at /lib64/libc.so.6
(gdb)



> 
> What is the testcase for this?


As mentioned, tackling a worst case when 'code generation buffer' runs out
of space totally. We need a better mitigation plan that to simply assert().

Can be easily reproducible by decreasing the 'tb_size'  and increasing the 
number of vCPUs, and having larger programs running simultaneously.
I was able to reproduce it with only 6 vCPUs and with 'tb_size=10'.
Booting was dead slow but with a single vCPU hotplug action we can
 reproduce it.

RFC V6 has TCG broken for some other reason and I'm trying to fix it.
But if you wish you can try this on RFC 5 which has greater chances of
this happening as it actually uses vCPU hotplug approach i.e. threads
can be created and deleted.

https://github.com/salil-mehta/qemu/commits/virt-cpuhp-armv8/rfc-v5/

With RFC V6 this condition is likely to happen only once during delayed
spawning of the vCPU thread of a VCPU being lazily realized. We do not
delete the spawned thread.

Many thanks!

Best regards
Salil.

> 
> 
> r~


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH RFC V6 24/24] tcg: Defer TB flush for 'lazy realized' vCPUs on first region alloc
  2025-10-02 12:27     ` Salil Mehta via
@ 2025-10-02 15:41       ` Richard Henderson
  2025-10-07 10:14         ` Salil Mehta via
  0 siblings, 1 reply; 67+ messages in thread
From: Richard Henderson @ 2025-10-02 15:41 UTC (permalink / raw)
  To: Salil Mehta, salil.mehta@opnsrc.net, qemu-devel@nongnu.org,
	qemu-arm@nongnu.org, mst@redhat.com

On 10/2/25 05:27, Salil Mehta wrote:
> Hi Richard,
> 
> Thanks for the reply. Please find my response inline.
> 
> Cheers.
> 
>> From: qemu-devel-bounces+salil.mehta=huawei.com@nongnu.org <qemu-
>> devel-bounces+salil.mehta=huawei.com@nongnu.org> On Behalf Of Richard
>> Henderson
>> Sent: Wednesday, October 1, 2025 10:34 PM
>> To: salil.mehta@opnsrc.net; qemu-devel@nongnu.org; qemu-
>> arm@nongnu.org; mst@redhat.com
>> Subject: Re: [PATCH RFC V6 24/24] tcg: Defer TB flush for 'lazy realized' vCPUs
>> on first region alloc
>>
>> On 9/30/25 18:01, salil.mehta@opnsrc.net wrote:
>>> From: Salil Mehta <salil.mehta@huawei.com>
>>>
>>> The TCG code cache is split into regions shared by vCPUs under MTTCG.
>>> For cold-boot (early realized) vCPUs, regions are sized/allocated during
>> bring-up.
>>> However, when a vCPU is *lazy_realized* (administratively "disabled"
>>> at boot and realized later on demand), its TCGContext may fail the
>>> very first code region allocation if the shared TB cache is saturated
>>> by already-running vCPUs.
>>>
>>> Flushing the TB cache is the right remediation, but `tb_flush()` must
>>> be performed from the safe execution context
>> (cpu_exec_loop()/tb_gen_code()).
>>> This patch wires a deferred flush:
>>>
>>>     * In `tcg_region_initial_alloc__locked()`, treat an initial allocation
>>>       failure for a lazily realized vCPU as non-fatal: set `s->tbflush_pend`
>>>       and return.
>>>
>>>     * In `tcg_tb_alloc()`, if `s->tbflush_pend` is observed, clear it and
>>>       return NULL so the caller performs a synchronous `tb_flush()` and then
>>>       retries allocation.
>>>
>>> This avoids hangs observed when a newly realized vCPU cannot obtain
>>> its first region under TB-cache pressure, while keeping the flush at a safe
>> point.
>>>
>>> No change for cold-boot vCPUs and when accel ops is KVM.
>>>
>>> In earlier series, this patch was with below named,
>>> 'tcg: Update tcg_register_thread() leg to handle region alloc for hotplugged
>> vCPU'
>>
>>
>> I don't see why you need two different booleans for this.
> 
> 
> I can see your point. Maybe I can move `s->tbflush_pend`  to 'CPUState' instead?
> 
> 
>> It seems to me that you could create the cpu in a state for which the first call
>> to
>> tcg_tb_alloc() sees highwater state, and everything after that happens per
>> usual allocating a new region, and possibly flushing the full buffer.
> 
> 
> Correct. but with a distinction that highwater state is relevant to a TCGContext
> and the regions are allocated from a common pool 'Code Generation Buffer'.
> 'code_gen_highwater' is use to detect whether current context needs more
> region allocation for the dynamic translation to continue. This is a different
> condition than what we are encountering; which is the worst case condition
> that the entire code generation buffer is saturated and cannot even allocate
> a single free TCG region successfully.

I think you misunderstand "and everything after that happens per usual".

When allocating a tb, if a cpu finds that it's current region is full, then it tries to 
allocate another region.  If that is not successful, then we flush the entire 
code_gen_buffer and try again.

Thus tbflush_pend is exactly equivalent to setting

     s->code_gen_ptr > s->code_gen_highwater.

As far as lazy_realized...  The utility of the assert under these conditions may be called 
into question; we could just remove it.


r~


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH RFC V6 14/24] arm/acpi: Introduce dedicated CPU OSPM interface for ARM-like platforms
  2025-10-01  1:01 ` [PATCH RFC V6 14/24] arm/acpi: Introduce dedicated CPU OSPM interface for ARM-like platforms salil.mehta
@ 2025-10-03 14:58   ` Igor Mammedov
       [not found]     ` <7da6a9c470684754810414f0abd23a62@huawei.com>
  2025-10-24  4:47   ` Gavin Shan
  1 sibling, 1 reply; 67+ messages in thread
From: Igor Mammedov @ 2025-10-03 14:58 UTC (permalink / raw)
  To: salil.mehta
  Cc: qemu-devel, qemu-arm, mst, salil.mehta, maz, jean-philippe,
	jonathan.cameron, lpieralisi, peter.maydell, richard.henderson,
	armbru, andrew.jones, david, philmd, eric.auger, will, ardb,
	oliver.upton, pbonzini, gshan, rafael, borntraeger, alex.bennee,
	gustavo.romero, npiggin, harshpb, linux, darren, ilkka, vishnu,
	gankulkarni, karl.heubaum, miguel.luis, zhukeqian1,
	wangxiongfeng2, wangyanan55, wangzhou1, linuxarm, jiakernel2,
	maobibo, lixianglai, shahuang, zhao1.liu

On Wed,  1 Oct 2025 01:01:17 +0000
salil.mehta@opnsrc.net wrote:

> From: Salil Mehta <salil.mehta@huawei.com>
> 
> The existing ACPI CPU hotplug interface is built for x86 platforms where CPUs
> can be inserted or removed and resources are allocated dynamically. On ARM, CPUs
> are never hotpluggable: resources are allocated at boot and QOM vCPU objects
> always exist. Instead, CPUs are administratively managed by toggling ACPI _STA
> to enable or disable them, which gives a hotplug-like effect but does not match
> the x86 model.
> 
> Reusing the x86 hotplug AML code would complicate maintenance since much of its
> logic relies on toggling the _STA.Present bit to notify OSPM about CPU insertion
> or removal. Such usage is not architecturally valid on ARM, where CPUs cannot
> appear or disappear at runtime. Mixing both models in one interface would
> increase complexity and make the AML harder to extend. A separate path is
> therefore required. The new design is heavily inspired by the CPU hotplug
> interface but avoids its unsuitable semantics.

Let me ask how much existing CPUHP AML code will become,
if you reuse it and add handling of 'enabled' bit there?

Would it be the same 700LOC as in this patch,
which is basically duplication of existing CPUHP ACPI interface?

> 
> This patch adds a dedicated CPU OSPM (Operating System Power Management)
> interface. It provides a memory-mapped control region with selector, flags,
> command, and data fields, and AML methods for device-check, eject request, and
> _OST reporting. OSPM is notified through GED events and can coordinate CPU
> events directly with QEMU. Other ARM-like architectures may also use this
> interface.
> 
> Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
> ---
>  hw/acpi/Kconfig                        |   3 +
>  hw/acpi/acpi-cpu-ospm-interface-stub.c |  41 ++
>  hw/acpi/cpu_ospm_interface.c           | 747 +++++++++++++++++++++++++
>  hw/acpi/meson.build                    |   2 +
>  hw/acpi/trace-events                   |  17 +
>  hw/arm/Kconfig                         |   1 +
>  include/hw/acpi/cpu_ospm_interface.h   |  78 +++
>  7 files changed, 889 insertions(+)
>  create mode 100644 hw/acpi/acpi-cpu-ospm-interface-stub.c
>  create mode 100644 hw/acpi/cpu_ospm_interface.c
>  create mode 100644 include/hw/acpi/cpu_ospm_interface.h
> 
> diff --git a/hw/acpi/Kconfig b/hw/acpi/Kconfig
> index 1d4e9f0845..aa52f0468f 100644
> --- a/hw/acpi/Kconfig
> +++ b/hw/acpi/Kconfig
> @@ -21,6 +21,9 @@ config ACPI_ICH9
>  config ACPI_CPU_HOTPLUG
>      bool
>  
> +config ACPI_CPU_OSPM_INTERFACE
> +    bool
> +
>  config ACPI_MEMORY_HOTPLUG
>      bool
>      select MEM_DEVICE
> diff --git a/hw/acpi/acpi-cpu-ospm-interface-stub.c b/hw/acpi/acpi-cpu-ospm-interface-stub.c
> new file mode 100644
> index 0000000000..f6f333f641
> --- /dev/null
> +++ b/hw/acpi/acpi-cpu-ospm-interface-stub.c
> @@ -0,0 +1,41 @@
> +/*
> + * ACPI CPU OSPM Interface Handling.
> + *
> + * Copyright (c) 2025 Huawei Technologies R&D (UK) Ltd.
> + *
> + * Author: Salil Mehta <salil.mehta@huawei.com>
> + *
> + * SPDX-License-Identifier: GPL-2.0-or-later
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License, or
> + * (at your option) any later version.
> + */
> +
> +#include "qemu/osdep.h"
> +#include "hw/acpi/cpu_ospm_interface.h"
> +
> +void acpi_cpu_device_check_cb(AcpiCpuOspmState *cpu_st, DeviceState *dev,
> +                              uint32_t event_st, Error **errp)
> +{
> +}
> +
> +void acpi_cpu_eject_request_cb(AcpiCpuOspmState *cpu_st, DeviceState *dev,
> +                               uint32_t event_st, Error **errp)
> +{
> +}
> +
> +void acpi_cpu_eject_cb(AcpiCpuOspmState *cpu_st, DeviceState *dev, Error **errp)
> +{
> +}
> +
> +void acpi_cpu_ospm_state_interface_init(MemoryRegion *as, Object *owner,
> +                                        AcpiCpuOspmState *state,
> +                                        hwaddr base_addr)
> +{
> +}
> +
> +void acpi_cpus_ospm_status(AcpiCpuOspmState *cpu_st, ACPIOSTInfoList ***list)
> +{
> +}
> diff --git a/hw/acpi/cpu_ospm_interface.c b/hw/acpi/cpu_ospm_interface.c
> new file mode 100644
> index 0000000000..61aab8a793
> --- /dev/null
> +++ b/hw/acpi/cpu_ospm_interface.c
> @@ -0,0 +1,747 @@
> +/*
> + * ACPI CPU OSPM Interface Handling.
> + *
> + * Copyright (c) 2025 Huawei Technologies R&D (UK) Ltd.
> + *
> + * Author: Salil Mehta <salil.mehta@huawei.com>
> + *
> + * SPDX-License-Identifier: GPL-2.0-or-later
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License, or
> + * (at your option) any later version.
> + */
> +
> +#include "qemu/osdep.h"
> +#include "migration/vmstate.h"
> +#include "hw/core/cpu.h"
> +#include "qapi/error.h"
> +#include "trace.h"
> +#include "qapi/qapi-events-acpi.h"
> +#include "hw/acpi/cpu_ospm_interface.h"
> +
> +/* CPU identifier and resource device */
> +#define CPU_NAME_FMT      "C%.03X" /* CPU name format (e.g., C001) */
> +#define CPU_RES_DEVICE    "CPUR" /* CPU resource device name */
> +#define CPU_DEVICE        "CPUS" /* CPUs device name */
> +#define CPU_LOCK          "CPLK" /* CPU lock object */
> +/* ACPI method(_STA, _EJ0, etc.) handlers */
> +#define CPU_STS_METHOD    "CSTA" /* CPU status method (_STA.Enabled) */
> +#define CPU_SCAN_METHOD   "CSCN" /* CPU scan method for enumeration */
> +#define CPU_NOTIFY_METHOD "CTFY" /* Notify method for CPU events */
> +#define CPU_EJECT_METHOD  "CEJ0" /* CPU eject method (_EJ0) */
> +#define CPU_OST_METHOD    "COST" /* OSPM status reporting (_OST) */
> +/* CPU MMIO region fields (in PRST region) */
> +#define CPU_SELECTOR      "CSEL" /* CPU selector index (WO) */
> +#define CPU_ENABLED_F     "CPEN" /* Flag: CPU enabled status(_STA) (RO) */
> +#define CPU_DEVCHK_F      "CDCK" /* Flag: Device-check event (RW) */
> +#define CPU_EJECTRQ_F     "CEJR" /* Flag: Eject-request event (RW)*/
> +#define CPU_EJECT_F       "CEJ0" /* Flag: Ejection trigger (WO) */
> +#define CPU_COMMAND       "CCMD" /* Command register (RW) */
> +#define CPU_DATA          "CDAT" /* Data register (RW) */
> +
> + /*
> + * CPU OSPM Interface MMIO Layout (Total: 16 bytes)
> + *
> + * +--------+--------+--------+--------+--------+--------+--------+--------+
> + * |  0x00  |  0x01  |  0x02  |  0x03  |  0x04  |  0x05  |  0x06  |  0x07  |
> + * +--------+--------+--------+--------+--------+--------+--------+--------+
> + * |       Selector (DWord, write-only)         | Flags  |Command |Reserved|
> + * |                                            | (RO/RW)|  (WO)  |(2B pad)|
> + * |        4 bytes (32 bits)                   | 1B     |   1B   | 2B     |
> + * +-----------------------------------------------------------------------+
> + * |  0x08  |  0x09  |  0x0A  |  0x0B  |  0x0C  |  0x0D  |  0x0E  |  0x0F  |
> + * +--------+--------+--------+--------+--------+--------+--------+--------+
> + * |                        Data (QWord, read/write)                       |
> + * |               Used by CPU scan and _OST methods (64 bits)             |
> + * +-----------------------------------------------------------------------+
> + *
> + * Field Overview:
> + *
> + * - Selector: 4 bytes @0x00 (DWord, WO)
> + *               - Selects target CPU index for the current operation.
> + * - Flags:    1 byte  @0x04 (RO/RW)
> + *               - Bit 0: ENABLED  – CPU is powered on (RO)
> + *               - Bit 1: DEVCHK   – Device-check completed (RW)
> + *               - Bit 2: EJECTRQ  – Guest requests CPU eject (RW)
> + *               - Bit 3: EJECT    – Trigger CPU ejection (WO)
> + *               - Bits 4–7: Reserved (write 0)
> + * - Command:  1 byte  @0x05 (WO)
> + *               - Specifies control operation (e.g., scan, _OST, eject).
> + * - Reserved: 2 bytes @0x06–0x07
> + *               - Alignment padding; must be zero on write.
> + * - Data:     8 bytes @0x08 (QWord, RW)
> + *               - Input/output for command-specific data.
> + *               - Used by CPU scan or _OST.
> + */
> +
> +/*
> + * Macros defining the CPU MMIO region layout. Change field sizes here to
> + * alter the overall MMIO region size.
> + */
> +/* Sub-Field sizes (in bytes) */
> +#define ACPI_CPU_MR_SELECTOR_SIZE  4 /* Write-only (DWord access) */
> +#define ACPI_CPU_MR_FLAGS_SIZE     1 /* Read-write (Byte access) */
> +#define ACPI_CPU_MR_RES_FLAGS_SIZE 0 /* Reserved padding */
> +#define ACPI_CPU_MR_CMD_SIZE       1 /* Write-only (Byte access) */
> +#define ACPI_CPU_MR_RES_CMD_SIZE   2 /* Reserved padding */
> +#define ACPI_CPU_MR_CMD_DATA_SIZE  8 /* Read-write (QWord access) */
> +
> +#define ACPI_CPU_OSPM_IF_MAX_FIELD_SIZE \
> +    MAX_CONST(ACPI_CPU_MR_CMD_DATA_SIZE, \
> +    MAX_CONST(ACPI_CPU_MR_SELECTOR_SIZE, \
> +    MAX_CONST(ACPI_CPU_MR_CMD_SIZE, ACPI_CPU_MR_FLAGS_SIZE)))
> +
> +/* Validate layout against exported total length */
> +_Static_assert(ACPI_CPU_OSPM_IF_REG_LEN ==
> +               (ACPI_CPU_MR_SELECTOR_SIZE +
> +                ACPI_CPU_MR_FLAGS_SIZE +
> +                ACPI_CPU_MR_RES_FLAGS_SIZE +
> +                ACPI_CPU_MR_CMD_SIZE +
> +                ACPI_CPU_MR_RES_CMD_SIZE +
> +                ACPI_CPU_MR_CMD_DATA_SIZE),
> +               "ACPI_CPU_OSPM_IF_REG_LEN mismatch with internal MMIO layout");
> +
> +/* Sub-Field sizes (in bits) */
> +#define ACPI_CPU_MR_SELECTOR_SIZE_BITS \
> +    (ACPI_CPU_MR_SELECTOR_SIZE * BITS_PER_BYTE)  /* Write-only (DWord Acc) */
> +#define ACPI_CPU_MR_FLAGS_SIZE_BITS \
> +    (ACPI_CPU_MR_FLAGS_SIZE * BITS_PER_BYTE)     /* Read-write (Byte Acc) */
> +#define ACPI_CPU_MR_RES_FLAGS_SIZE_BITS \
> +    (ACPI_CPU_MR_RES_FLAGS_SIZE * BITS_PER_BYTE) /* Reserved padding */
> +#define ACPI_CPU_MR_CMD_SIZE_BITS \
> +    (ACPI_CPU_MR_CMD_SIZE * BITS_PER_BYTE)       /* Write-only (Byte Acc) */
> +#define ACPI_CPU_MR_RES_CMD_SIZE_BITS \
> +    (ACPI_CPU_MR_RES_CMD_SIZE * BITS_PER_BYTE)   /* Reserved padding */
> +#define ACPI_CPU_MR_CMD_DATA_SIZE_BITS \
> +    (ACPI_CPU_MR_CMD_DATA_SIZE * BITS_PER_BYTE)  /* Read-write (QWord Acc) */
> +
> +/* Field offsets (in bytes) */
> +#define ACPI_CPU_MR_SELECTOR_OFFSET_WO  0
> +#define ACPI_CPU_MR_FLAGS_OFFSET_RW \
> +    (ACPI_CPU_MR_SELECTOR_OFFSET_WO + \
> +     ACPI_CPU_MR_SELECTOR_SIZE)
> +#define ACPI_CPU_MR_CMD_OFFSET_WO \
> +    (ACPI_CPU_MR_FLAGS_OFFSET_RW + \
> +     ACPI_CPU_MR_FLAGS_SIZE + \
> +     ACPI_CPU_MR_RES_FLAGS_SIZE)
> +#define ACPI_CPU_MR_CMD_DATA_OFFSET_RW \
> +    (ACPI_CPU_MR_CMD_OFFSET_WO + \
> +     ACPI_CPU_MR_CMD_SIZE + \
> +     ACPI_CPU_MR_RES_CMD_SIZE)
> +
> +/* ensure all offsets are at their natural size alignment boundaries */
> +#define STATIC_ASSERT_FIELD_ALIGNMENT(offset, type, field_name)               \
> +    _Static_assert((offset) % sizeof(type) == 0,                              \
> +                   field_name " is not aligned to its natural boundary")
> +
> +STATIC_ASSERT_FIELD_ALIGNMENT(ACPI_CPU_MR_SELECTOR_OFFSET_WO,
> +                              uint32_t, "Selector");
> +STATIC_ASSERT_FIELD_ALIGNMENT(ACPI_CPU_MR_FLAGS_OFFSET_RW,
> +                              uint8_t, "Flags");
> +STATIC_ASSERT_FIELD_ALIGNMENT(ACPI_CPU_MR_CMD_OFFSET_WO,
> +                              uint8_t, "Command");
> +STATIC_ASSERT_FIELD_ALIGNMENT(ACPI_CPU_MR_CMD_DATA_OFFSET_RW,
> +                              uint64_t, "Command Data");
> +
> +/* Flag bit positions (used within 'flags' subfield) */
> +#define ACPI_CPU_FLAGS_USED_BITS 4
> +#define ACPI_CPU_MR_FLAGS_BIT_ENABLED BIT(0)
> +#define ACPI_CPU_MR_FLAGS_BIT_DEVCHK  BIT(1)
> +#define ACPI_CPU_MR_FLAGS_BIT_EJECTRQ BIT(2)
> +#define ACPI_CPU_MR_FLAGS_BIT_EJECT   BIT(ACPI_CPU_FLAGS_USED_BITS - 1)
> +
> +#define ACPI_CPU_MR_RES_FLAG_BITS (BITS_PER_BYTE - ACPI_CPU_FLAGS_USED_BITS)
> +
> +enum {
> +    ACPI_GET_NEXT_CPU_WITH_EVENT_CMD = 0,
> +    ACPI_OST_EVENT_CMD = 1,
> +    ACPI_OST_STATUS_CMD = 2,
> +    ACPI_CMD_MAX
> +};
> +
> +#define AML_APPEND_MR_RESVD_FIELD(mr_field, size_bits)       \
> +    do {                                                        \
> +        if ((size_bits) != 0) {                                 \
> +            aml_append((mr_field), aml_reserved_field(size_bits)); \
> +        }                                                       \
> +    } while (0)
> +
> +#define AML_APPEND_MR_NAMED_FIELD(mr_field, name, size_bits)    \
> +    do {                                                        \
> +        if ((size_bits) != 0) {                                 \
> +            aml_append((mr_field), aml_named_field((name), (size_bits))); \
> +        }                                                       \
> +    } while (0)
> +
> +#define AML_CPU_RES_DEV(base, field) \
> +        aml_name("%s.%s.%s", (base), CPU_RES_DEVICE, (field))
> +
> +static ACPIOSTInfo *
> +acpi_cpu_ospm_ost_status(int idx, AcpiCpuOspmStateStatus *cdev)
> +{
> +    ACPIOSTInfo *info = g_new0(ACPIOSTInfo, 1);
> +
> +    info->source = cdev->ost_event;
> +    info->status = cdev->ost_status;
> +    if (cdev->cpu) {
> +        DeviceState *dev = DEVICE(cdev->cpu);
> +        if (dev->id) {
> +            info->device = g_strdup(dev->id);
> +        }
> +    }
> +    return info;
> +}
> +
> +void acpi_cpus_ospm_status(AcpiCpuOspmState *cpu_st, ACPIOSTInfoList ***list)
> +{
> +    ACPIOSTInfoList ***tail = list;
> +    int i;
> +
> +    for (i = 0; i < cpu_st->dev_count; i++) {
> +        QAPI_LIST_APPEND(*tail, acpi_cpu_ospm_ost_status(i, &cpu_st->devs[i]));
> +    }
> +}
> +
> +static uint64_t
> +acpi_cpu_ospm_intf_mr_read(void *opaque, hwaddr addr, unsigned size)
> +{
> +    AcpiCpuOspmState *cpu_st = opaque;
> +    AcpiCpuOspmStateStatus *cdev;
> +    uint64_t val = 0;
> +
> +    if (cpu_st->selector >= cpu_st->dev_count) {
> +        return val;
> +    }
> +    cdev = &cpu_st->devs[cpu_st->selector];
> +    switch (addr) {
> +    case ACPI_CPU_MR_FLAGS_OFFSET_RW:
> +        val |= qdev_check_enabled(DEVICE(cdev->cpu)) ?
> +                                  ACPI_CPU_MR_FLAGS_BIT_ENABLED : 0;
> +        val |= cdev->devchk_pending ? ACPI_CPU_MR_FLAGS_BIT_DEVCHK : 0;
> +        val |= cdev->ejrqst_pending ? ACPI_CPU_MR_FLAGS_BIT_EJECTRQ : 0;
> +        trace_acpi_cpuos_if_read_flags(cpu_st->selector, val);
> +        break;
> +    case ACPI_CPU_MR_CMD_DATA_OFFSET_RW:
> +        switch (cpu_st->command) {
> +        case ACPI_GET_NEXT_CPU_WITH_EVENT_CMD:
> +           val = cpu_st->selector;
> +           break;
> +        default:
> +           trace_acpi_cpuos_if_read_invalid_cmd_data(cpu_st->selector,
> +                                                     cpu_st->command);
> +           break;
> +        }
> +        trace_acpi_cpuos_if_read_cmd_data(cpu_st->selector, val);
> +        break;
> +    default:
> +        break;
> +    }
> +    return val;
> +}
> +
> +static void
> +acpi_cpu_ospm_intf_mr_write(void *opaque, hwaddr addr, uint64_t data,
> +                            unsigned int size)
> +{
> +    AcpiCpuOspmState *cpu_st = opaque;
> +    AcpiCpuOspmStateStatus *cdev;
> +    ACPIOSTInfo *info;
> +
> +    assert(cpu_st->dev_count);
> +    if (addr) {
> +        if (cpu_st->selector >= cpu_st->dev_count) {
> +            trace_acpi_cpuos_if_invalid_idx_selected(cpu_st->selector);
> +            return;
> +        }
> +    }
> +
> +    switch (addr) {
> +    case ACPI_CPU_MR_SELECTOR_OFFSET_WO: /* current CPU selector */
> +        cpu_st->selector = data;
> +        trace_acpi_cpuos_if_write_idx(cpu_st->selector);
> +        break;
> +    case ACPI_CPU_MR_FLAGS_OFFSET_RW: /* set is_* fields  */
> +        cdev = &cpu_st->devs[cpu_st->selector];
> +        if (data & ACPI_CPU_MR_FLAGS_BIT_DEVCHK) {
> +            /* clear device-check pending event */
> +            cdev->devchk_pending = false;
> +            trace_acpi_cpuos_if_clear_devchk_evt(cpu_st->selector);
> +        } else if (data & ACPI_CPU_MR_FLAGS_BIT_EJECTRQ) {
> +            /* clear eject-request pending event */
> +            cdev->ejrqst_pending = false;
> +            trace_acpi_cpuos_if_clear_ejrqst_evt(cpu_st->selector);
> +        } else if (data & ACPI_CPU_MR_FLAGS_BIT_EJECT) {
> +            DeviceState *dev = NULL;
> +            if (!cdev->cpu || cdev->cpu == first_cpu) {
> +                trace_acpi_cpuos_if_ejecting_invalid_cpu(cpu_st->selector);
> +                break;
> +            }
> +            /*
> +             * OSPM has returned with eject. Hence, it is now safe to put the
> +             * cpu device on powered-off state.
> +             */
> +            trace_acpi_cpuos_if_ejecting_cpu(cpu_st->selector);
> +            dev = DEVICE(cdev->cpu);
> +            qdev_sync_disable(dev, &error_fatal);
> +        }
> +        break;
> +    case ACPI_CPU_MR_CMD_OFFSET_WO:
> +        trace_acpi_cpuos_if_write_cmd(cpu_st->selector, data);
> +        if (data < ACPI_CMD_MAX) {
> +            cpu_st->command = data;
> +            if (cpu_st->command == ACPI_GET_NEXT_CPU_WITH_EVENT_CMD) {
> +                uint32_t iter = cpu_st->selector;
> +
> +                do {
> +                    cdev = &cpu_st->devs[iter];
> +                    if (cdev->devchk_pending || cdev->ejrqst_pending) {
> +                        cpu_st->selector = iter;
> +                        trace_acpi_cpuos_if_cpu_has_events(cpu_st->selector,
> +                            cdev->devchk_pending, cdev->ejrqst_pending);
> +                        break;
> +                    }
> +                    iter = iter + 1 < cpu_st->dev_count ? iter + 1 : 0;
> +                } while (iter != cpu_st->selector);
> +            }
> +        }
> +        break;
> +    case ACPI_CPU_MR_CMD_DATA_OFFSET_RW:
> +        switch (cpu_st->command) {
> +        case ACPI_OST_EVENT_CMD: {
> +           cdev = &cpu_st->devs[cpu_st->selector];
> +           cdev->ost_event = data;
> +           trace_acpi_cpuos_if_write_ost_ev(cpu_st->selector, cdev->ost_event);
> +           break;
> +        }
> +        case ACPI_OST_STATUS_CMD: {
> +           cdev = &cpu_st->devs[cpu_st->selector];
> +           cdev->ost_status = data;
> +           info = acpi_cpu_ospm_ost_status(cpu_st->selector, cdev);
> +           qapi_event_send_acpi_device_ost(info);
> +           qapi_free_ACPIOSTInfo(info);
> +           trace_acpi_cpuos_if_write_ost_status(cpu_st->selector,
> +                                                cdev->ost_status);
> +           break;
> +        }
> +        default:
> +           trace_acpi_cpuos_if_write_invalid_cmd(cpu_st->selector,
> +                                                 cpu_st->command);
> +           break;
> +        }
> +        break;
> +    default:
> +        trace_acpi_cpuos_if_write_invalid_offset(cpu_st->selector, addr);
> +        break;
> +    }
> +}
> +
> +static const MemoryRegionOps cpu_common_mr_ops = {
> +    .read = acpi_cpu_ospm_intf_mr_read,
> +    .write = acpi_cpu_ospm_intf_mr_write,
> +    .endianness = DEVICE_LITTLE_ENDIAN,
> +    .valid = {
> +        .min_access_size = 1,
> +        .max_access_size = ACPI_CPU_OSPM_IF_MAX_FIELD_SIZE,
> +    },
> +    .impl = {
> +        .min_access_size = 1,
> +        .max_access_size = ACPI_CPU_OSPM_IF_MAX_FIELD_SIZE,
> +        .unaligned = false,
> +    },
> +};
> +
> +void acpi_cpu_ospm_state_interface_init(MemoryRegion *as, Object *owner,
> +                                        AcpiCpuOspmState *state,
> +                                        hwaddr base_addr)
> +{
> +    MachineState *machine = MACHINE(qdev_get_machine());
> +    MachineClass *mc = MACHINE_GET_CLASS(machine);
> +    const CPUArchIdList *id_list;
> +    int i;
> +
> +    assert(mc->possible_cpu_arch_ids);
> +    id_list = mc->possible_cpu_arch_ids(machine);
> +    state->dev_count = id_list->len;
> +    state->devs = g_new0(typeof(*state->devs), state->dev_count);
> +    for (i = 0; i < id_list->len; i++) {
> +        state->devs[i].cpu =  CPU(id_list->cpus[i].cpu);
> +        state->devs[i].arch_id = id_list->cpus[i].arch_id;
> +    }
> +    memory_region_init_io(&state->ctrl_reg, owner, &cpu_common_mr_ops, state,
> +                          "ACPI CPU OSPM State Interface Memory Region",
> +                          ACPI_CPU_OSPM_IF_REG_LEN);
> +    memory_region_add_subregion(as, base_addr, &state->ctrl_reg);
> +}
> +
> +static AcpiCpuOspmStateStatus *
> +acpi_get_cpu_status(AcpiCpuOspmState *cpu_st, DeviceState *dev)
> +{
> +    CPUClass *k = CPU_GET_CLASS(dev);
> +    uint64_t cpu_arch_id = k->get_arch_id(CPU(dev));
> +    int i;
> +
> +    for (i = 0; i < cpu_st->dev_count; i++) {
> +        if (cpu_arch_id == cpu_st->devs[i].arch_id) {
> +            return &cpu_st->devs[i];
> +        }
> +    }
> +    return NULL;
> +}
> +
> +void acpi_cpu_device_check_cb(AcpiCpuOspmState *cpu_st, DeviceState *dev,
> +                              uint32_t event_st, Error **errp)
> +{
> +    AcpiCpuOspmStateStatus *cdev;
> +    cdev = acpi_get_cpu_status(cpu_st, dev);
> +    if (!cdev) {
> +        return;
> +    }
> +    assert(cdev->cpu);
> +
> +    /*
> +     * Tell OSPM via GED IRQ(GSI) that a powered-off cpu is being powered-on.
> +     * Also, mark 'device-check' event pending for this cpu. This will
> +     * eventually result in OSPM evaluating the ACPI _EVT method and scan of
> +     * cpus
> +     */
> +    cdev->devchk_pending = true;
> +    acpi_send_event(cpu_st->acpi_dev, event_st);
> +}
> +
> +void acpi_cpu_eject_request_cb(AcpiCpuOspmState *cpu_st, DeviceState *dev,
> +                              uint32_t event_st, Error **errp)
> +{
> +    AcpiCpuOspmStateStatus *cdev;
> +    cdev = acpi_get_cpu_status(cpu_st, dev);
> +    if (!cdev) {
> +        return;
> +    }
> +    assert(cdev->cpu);
> +
> +    /*
> +     * Tell OSPM via GED IRQ(GSI) that a cpu wants to power-off or go on standby
> +     * Also,mark 'eject-request' event pending for this cpu. (graceful shutdown)
> +     */
> +    cdev->ejrqst_pending = true;
> +    acpi_send_event(cpu_st->acpi_dev, event_st);
> +}
> +
> +void
> +acpi_cpu_eject_cb(AcpiCpuOspmState *cpu_st, DeviceState *dev, Error **errp)
> +{
> +    /* TODO: possible handling here */
> +}
> +
> +static const VMStateDescription vmstate_cpu_ospm_state_sts = {
> +    .name = "CPU OSPM state status",
> +    .version_id = 1,
> +    .minimum_version_id = 1,
> +    .fields = (const VMStateField[]) {
> +        VMSTATE_BOOL(devchk_pending, AcpiCpuOspmStateStatus),
> +        VMSTATE_BOOL(ejrqst_pending, AcpiCpuOspmStateStatus),
> +        VMSTATE_UINT32(ost_event, AcpiCpuOspmStateStatus),
> +        VMSTATE_UINT32(ost_status, AcpiCpuOspmStateStatus),
> +        VMSTATE_END_OF_LIST()
> +    }
> +};
> +
> +const VMStateDescription vmstate_cpu_ospm_state = {
> +    .name = "CPU OSPM state",
> +    .version_id = 1,
> +    .minimum_version_id = 1,
> +    .fields = (const VMStateField[]) {
> +        VMSTATE_UINT32(selector, AcpiCpuOspmState),
> +        VMSTATE_UINT8(command, AcpiCpuOspmState),
> +        VMSTATE_STRUCT_VARRAY_POINTER_UINT32(devs, AcpiCpuOspmState,
> +                                             dev_count,
> +                                             vmstate_cpu_ospm_state_sts,
> +                                             AcpiCpuOspmStateStatus),
> +        VMSTATE_END_OF_LIST()
> +    }
> +};
> +
> +void acpi_build_cpus_aml(Aml *table, hwaddr base_addr, const char *root,
> +                         const char *event_handler_method)
> +{
> +    MachineState *machine = MACHINE(qdev_get_machine());
> +    MachineClass *mc = MACHINE_GET_CLASS(machine);
> +    const CPUArchIdList *arch_ids = mc->possible_cpu_arch_ids(machine);
> +    Aml *sb_scope = aml_scope("_SB"); /* System Bus Scope */
> +    Aml *ifctx, *field, *method, *cpu_res_dev, *cpus_dev;
> +    Aml *zero = aml_int(0);
> +    Aml *one = aml_int(1);
> +
> +    cpu_res_dev = aml_device("%s.%s", root, CPU_RES_DEVICE);
> +    {
> +        Aml *crs;
> +
> +        aml_append(cpu_res_dev,
> +            aml_name_decl("_HID", aml_eisaid("PNP0A06")));
> +        aml_append(cpu_res_dev,
> +            aml_name_decl("_UID", aml_string("CPU OSPM Interface resources")));
> +        aml_append(cpu_res_dev, aml_mutex(CPU_LOCK, 0));
> +
> +        crs = aml_resource_template();
> +        aml_append(crs, aml_memory32_fixed(base_addr, ACPI_CPU_OSPM_IF_REG_LEN,
> +                   AML_READ_WRITE));
> +
> +        aml_append(cpu_res_dev, aml_name_decl("_CRS", crs));
> +
> +        /* declare CPU OSPM Interface MMIO region related access fields */
> +        aml_append(cpu_res_dev,
> +                   aml_operation_region("PRST", AML_SYSTEM_MEMORY,
> +                                        aml_int(base_addr),
> +                                        ACPI_CPU_OSPM_IF_REG_LEN));
> +
> +        /*
> +         * define named fields within PRST region with 'Byte' access widths
> +         * and reserve fields with other access width
> +         */
> +        field = aml_field("PRST", AML_BYTE_ACC, AML_NOLOCK, AML_PRESERVE);
> +        /* reserve CPU 'selector' field (size in bits) */
> +        AML_APPEND_MR_RESVD_FIELD(field, ACPI_CPU_MR_SELECTOR_SIZE_BITS);
> +        /* Flag::Enabled Bit(RO) - Read '1' if enabled */
> +        AML_APPEND_MR_NAMED_FIELD(field, CPU_ENABLED_F, 1);
> +        /* Flag::Devchk Bit(RW) - Read '1', has a event. Write '1', to clear */
> +        AML_APPEND_MR_NAMED_FIELD(field, CPU_DEVCHK_F, 1);
> +        /* Flag::Ejectrq Bit(RW) - Read 1, has event. Write 1 to clear */
> +        AML_APPEND_MR_NAMED_FIELD(field, CPU_EJECTRQ_F, 1);
> +        /* Flag::Eject Bit(WO) - OSPM evals _EJx, initiates CPU Eject in Qemu*/
> +        AML_APPEND_MR_NAMED_FIELD(field, CPU_EJECT_F, 1);
> +        /* Flag::Bit(ACPI_CPU_FLAGS_USED_BITS)-Bit(7) - Reserve left over bits*/
> +        AML_APPEND_MR_RESVD_FIELD(field, ACPI_CPU_MR_RES_FLAG_BITS);
> +        /* Reserved space: padding after flags */
> +        AML_APPEND_MR_RESVD_FIELD(field, ACPI_CPU_MR_RES_FLAGS_SIZE_BITS);
> +        /* Command field written by OSPM */
> +        AML_APPEND_MR_NAMED_FIELD(field, CPU_COMMAND,
> +                                  ACPI_CPU_MR_CMD_SIZE_BITS);
> +        /* Reserved space: padding after command field */
> +        AML_APPEND_MR_RESVD_FIELD(field, ACPI_CPU_MR_RES_CMD_SIZE_BITS);
> +        /* Command data: 64-bit payload associated with command */
> +        AML_APPEND_MR_RESVD_FIELD(field, ACPI_CPU_MR_CMD_DATA_SIZE_BITS);
> +        aml_append(cpu_res_dev, field);
> +
> +        /*
> +         * define named fields with 'Dword' access widths and reserve fields
> +         * with other access width
> +         */
> +        field = aml_field("PRST", AML_DWORD_ACC, AML_NOLOCK, AML_PRESERVE);
> +        /* CPU selector, write only */
> +        AML_APPEND_MR_NAMED_FIELD(field, CPU_SELECTOR,
> +                                  ACPI_CPU_MR_SELECTOR_SIZE_BITS);
> +        aml_append(cpu_res_dev, field);
> +
> +        /*
> +         * define named fields with 'Qword' access widths and reserve fields
> +         * with other access width
> +         */
> +        field = aml_field("PRST", AML_QWORD_ACC, AML_NOLOCK, AML_PRESERVE);
> +        /*
> +         * Reserve space: selector, flags, reserved flags, command, reserved
> +         * command for Qword alignment.
> +         */
> +        AML_APPEND_MR_RESVD_FIELD(field, ACPI_CPU_MR_SELECTOR_SIZE_BITS +
> +                                            ACPI_CPU_MR_FLAGS_SIZE_BITS +
> +                                            ACPI_CPU_MR_RES_FLAGS_SIZE_BITS +
> +                                            ACPI_CPU_MR_CMD_SIZE_BITS +
> +                                            ACPI_CPU_MR_RES_CMD_SIZE_BITS);
> +        /* Command data accessible via Qword */
> +        AML_APPEND_MR_NAMED_FIELD(field, CPU_DATA,
> +                                  ACPI_CPU_MR_CMD_DATA_SIZE_BITS);
> +        aml_append(cpu_res_dev, field);
> +    }
> +    aml_append(sb_scope, cpu_res_dev);
> +
> +    cpus_dev = aml_device("%s.%s", root, CPU_DEVICE);
> +    {
> +        Aml *ctrl_lock = AML_CPU_RES_DEV(root, CPU_LOCK);
> +        Aml *cpu_selector = AML_CPU_RES_DEV(root, CPU_SELECTOR);
> +        Aml *is_enabled = AML_CPU_RES_DEV(root, CPU_ENABLED_F);
> +        Aml *dvchk_evt = AML_CPU_RES_DEV(root, CPU_DEVCHK_F);
> +        Aml *ejrq_evt = AML_CPU_RES_DEV(root, CPU_EJECTRQ_F);
> +        Aml *ej_evt = AML_CPU_RES_DEV(root, CPU_EJECT_F);
> +        Aml *cpu_cmd = AML_CPU_RES_DEV(root, CPU_COMMAND);
> +        Aml *cpu_data = AML_CPU_RES_DEV(root, CPU_DATA);
> +        int i;
> +
> +        aml_append(cpus_dev, aml_name_decl("_HID", aml_string("ACPI0010")));
> +        aml_append(cpus_dev, aml_name_decl("_CID", aml_eisaid("PNP0A05")));
> +
> +        method = aml_method(CPU_NOTIFY_METHOD, 2, AML_NOTSERIALIZED);
> +        for (i = 0; i < arch_ids->len; i++) {
> +            Aml *cpu = aml_name(CPU_NAME_FMT, i);
> +            Aml *uid = aml_arg(0);
> +            Aml *event = aml_arg(1);
> +
> +            ifctx = aml_if(aml_equal(uid, aml_int(i)));
> +            {
> +                aml_append(ifctx, aml_notify(cpu, event));
> +            }
> +            aml_append(method, ifctx);
> +        }
> +        aml_append(cpus_dev, method);
> +
> +        method = aml_method(CPU_STS_METHOD, 1, AML_SERIALIZED);
> +        {
> +            Aml *idx = aml_arg(0);
> +            Aml *sta = aml_local(0);
> +            Aml *else_ctx;
> +
> +            aml_append(method, aml_acquire(ctrl_lock, 0xFFFF));
> +            aml_append(method, aml_store(idx, cpu_selector));
> +            aml_append(method, aml_store(zero, sta));
> +            ifctx = aml_if(aml_equal(is_enabled, one));
> +            {
> +                /* cpu is present and enabled */
> +                aml_append(ifctx, aml_store(aml_int(0xF), sta));
> +            }
> +            aml_append(method, ifctx);
> +            else_ctx = aml_else();
> +            {
> +                /* cpu is present but disabled */
> +                aml_append(else_ctx, aml_store(aml_int(0xD), sta));
> +            }
> +            aml_append(method, else_ctx);
> +            aml_append(method, aml_release(ctrl_lock));
> +            aml_append(method, aml_return(sta));
> +        }
> +        aml_append(cpus_dev, method);
> +
> +        method = aml_method(CPU_EJECT_METHOD, 1, AML_SERIALIZED);
> +        {
> +            Aml *idx = aml_arg(0);
> +
> +            aml_append(method, aml_acquire(ctrl_lock, 0xFFFF));
> +            aml_append(method, aml_store(idx, cpu_selector));
> +            aml_append(method, aml_store(one, ej_evt));
> +            aml_append(method, aml_release(ctrl_lock));
> +        }
> +        aml_append(cpus_dev, method);
> +
> +        method = aml_method(CPU_SCAN_METHOD, 0, AML_SERIALIZED);
> +        {
> +            Aml *has_event = aml_local(0); /* Local0: Loop control flag */
> +            Aml *uid = aml_local(1); /* Local1: Current CPU UID */
> +            /* Constants */
> +            Aml *dev_chk = aml_int(1); /* Notify: device check to enable */
> +            Aml *eject_req = aml_int(3); /* Notify: eject for removal */
> +            Aml *next_cpu_cmd = aml_int(ACPI_GET_NEXT_CPU_WITH_EVENT_CMD);
> +
> +            /* Acquire CPU lock */
> +            aml_append(method, aml_acquire(ctrl_lock, 0xFFFF));
> +
> +            /* Initialize loop */
> +            aml_append(method, aml_store(zero, uid));
> +            aml_append(method, aml_store(one, has_event));
> +
> +            Aml *while_ctx = aml_while(aml_land(
> +                aml_equal(has_event, one),
> +                aml_lless(uid, aml_int(arch_ids->len))
> +            ));
> +            {
> +                aml_append(while_ctx, aml_store(zero, has_event));
> +                /*
> +                 * Issue scan cmd: QEMU will return next CPU with event in
> +                 * cpu_data
> +                 */
> +                aml_append(while_ctx, aml_store(uid, cpu_selector));
> +                aml_append(while_ctx, aml_store(next_cpu_cmd, cpu_cmd));
> +
> +                /* If scan wrapped around to an earlier UID, exit loop */
> +                Aml *wrap_check = aml_if(aml_lless(cpu_data, uid));
> +                aml_append(wrap_check, aml_break());
> +                aml_append(while_ctx, wrap_check);
> +
> +                /* Set UID to scanned result */
> +                aml_append(while_ctx, aml_store(cpu_data, uid));
> +
> +                /* send CPU device-check(resume) event to OSPM */
> +                Aml *if_devchk = aml_if(aml_equal(dvchk_evt, one));
> +                {
> +                    aml_append(if_devchk,
> +                        aml_call2(CPU_NOTIFY_METHOD, uid, dev_chk));
> +                    /* clear local device-check event sent flag */
> +                    aml_append(if_devchk, aml_store(one, dvchk_evt));
> +                    aml_append(if_devchk, aml_store(one, has_event));
> +                }
> +                aml_append(while_ctx, if_devchk);
> +
> +                /*
> +                 * send CPU eject-request event to OSPM to gracefully handle
> +                 * OSPM related tasks running on this CPU
> +                 */
> +                Aml *else_ctx = aml_else();
> +                Aml *if_ejrq = aml_if(aml_equal(ejrq_evt, one));
> +                {
> +                    aml_append(if_ejrq,
> +                        aml_call2(CPU_NOTIFY_METHOD, uid, eject_req));
> +                    /* clear local eject-request event sent flag */
> +                    aml_append(if_ejrq, aml_store(one, ejrq_evt));
> +                    aml_append(if_ejrq, aml_store(one, has_event));
> +                }
> +                aml_append(else_ctx, if_ejrq);
> +                aml_append(while_ctx, else_ctx);
> +
> +                /* Increment UID */
> +                aml_append(while_ctx, aml_increment(uid));
> +            }
> +            aml_append(method, while_ctx);
> +
> +            /* Release cpu lock */
> +            aml_append(method, aml_release(ctrl_lock));
> +        }
> +        aml_append(cpus_dev, method);
> +
> +        method = aml_method(CPU_OST_METHOD, 4, AML_SERIALIZED);
> +        {
> +            Aml *uid = aml_arg(0);
> +            Aml *ev_cmd = aml_int(ACPI_OST_EVENT_CMD);
> +            Aml *st_cmd = aml_int(ACPI_OST_STATUS_CMD);
> +
> +            aml_append(method, aml_acquire(ctrl_lock, 0xFFFF));
> +            aml_append(method, aml_store(uid, cpu_selector));
> +            aml_append(method, aml_store(ev_cmd, cpu_cmd));
> +            aml_append(method, aml_store(aml_arg(1), cpu_data));
> +            aml_append(method, aml_store(st_cmd, cpu_cmd));
> +            aml_append(method, aml_store(aml_arg(2), cpu_data));
> +            aml_append(method, aml_release(ctrl_lock));
> +        }
> +        aml_append(cpus_dev, method);
> +
> +        /* build Processor object for each processor */
> +        for (i = 0; i < arch_ids->len; i++) {
> +            Aml *dev;
> +            Aml *uid = aml_int(i);
> +
> +            dev = aml_device(CPU_NAME_FMT, i);
> +            aml_append(dev, aml_name_decl("_HID", aml_string("ACPI0007")));
> +            aml_append(dev, aml_name_decl("_UID", uid));
> +
> +            method = aml_method("_STA", 0, AML_SERIALIZED);
> +            aml_append(method, aml_return(aml_call1(CPU_STS_METHOD, uid)));
> +            aml_append(dev, method);
> +
> +            if (CPU(arch_ids->cpus[i].cpu) != first_cpu) {
> +                method = aml_method("_EJ0", 1, AML_NOTSERIALIZED);
> +                aml_append(method, aml_call1(CPU_EJECT_METHOD, uid));
> +                aml_append(dev, method);
> +            }
> +
> +            method = aml_method("_OST", 3, AML_SERIALIZED);
> +            aml_append(method,
> +                aml_call4(CPU_OST_METHOD, uid, aml_arg(0),
> +                          aml_arg(1), aml_arg(2))
> +            );
> +            aml_append(dev, method);
> +            aml_append(cpus_dev, dev);
> +        }
> +    }
> +    aml_append(sb_scope, cpus_dev);
> +    aml_append(table, sb_scope);
> +
> +    method = aml_method(event_handler_method, 0, AML_NOTSERIALIZED);
> +    aml_append(method, aml_call0("\\_SB.CPUS." CPU_SCAN_METHOD));
> +    aml_append(table, method);
> +}
> diff --git a/hw/acpi/meson.build b/hw/acpi/meson.build
> index 73f02b9691..6d83396ab4 100644
> --- a/hw/acpi/meson.build
> +++ b/hw/acpi/meson.build
> @@ -8,6 +8,8 @@ acpi_ss.add(files(
>  ))
>  acpi_ss.add(when: 'CONFIG_ACPI_CPU_HOTPLUG', if_true: files('cpu.c', 'cpu_hotplug.c'))
>  acpi_ss.add(when: 'CONFIG_ACPI_CPU_HOTPLUG', if_false: files('acpi-cpu-hotplug-stub.c'))
> +acpi_ss.add(when: 'CONFIG_ACPI_CPU_OSPM_INTERFACE', if_true: files('cpu_ospm_interface.c'))
> +acpi_ss.add(when: 'CONFIG_ACPI_CPU_OSPM_INTERFACE', if_false: files('acpi-cpu-ospm-interface-stub.c'))
>  acpi_ss.add(when: 'CONFIG_ACPI_MEMORY_HOTPLUG', if_true: files('memory_hotplug.c'))
>  acpi_ss.add(when: 'CONFIG_ACPI_MEMORY_HOTPLUG', if_false: files('acpi-mem-hotplug-stub.c'))
>  acpi_ss.add(when: 'CONFIG_ACPI_NVDIMM', if_true: files('nvdimm.c'))
> diff --git a/hw/acpi/trace-events b/hw/acpi/trace-events
> index edc93e703c..c0ecbdd48f 100644
> --- a/hw/acpi/trace-events
> +++ b/hw/acpi/trace-events
> @@ -40,6 +40,23 @@ cpuhp_acpi_fw_remove_cpu(uint32_t idx) "0x%"PRIx32
>  cpuhp_acpi_write_ost_ev(uint32_t slot, uint32_t ev) "idx[0x%"PRIx32"] OST EVENT: 0x%"PRIx32
>  cpuhp_acpi_write_ost_status(uint32_t slot, uint32_t st) "idx[0x%"PRIx32"] OST STATUS: 0x%"PRIx32
>  
> +#cpu_ospm_interface.c
> +acpi_cpuos_if_invalid_idx_selected(uint32_t idx) "selector idx[0x%"PRIx32"]"
> +acpi_cpuos_if_read_flags(uint32_t idx, uint8_t flags) "cpu idx[0x%"PRIx32"] flags: 0x%"PRIx8
> +acpi_cpuos_if_write_idx(uint32_t idx) "set active cpu idx: 0x%"PRIx32
> +acpi_cpuos_if_write_cmd(uint32_t idx, uint8_t cmd) "cpu idx[0x%"PRIx32"] cmd: 0x%"PRIx8
> +acpi_cpuos_if_write_invalid_cmd(uint32_t idx, uint8_t cmd) "cpu idx[0x%"PRIx32"] invalid cmd: 0x%"PRIx8
> +acpi_cpuos_if_write_invalid_offset(uint32_t idx, uint64_t addr) "cpu idx[0x%"PRIx32"] invalid offset: 0x%"PRIx64
> +acpi_cpuos_if_read_cmd_data(uint32_t idx, uint32_t data) "cpu idx[0x%"PRIx32"] data: 0x%"PRIx32
> +acpi_cpuos_if_read_invalid_cmd_data(uint32_t idx, uint8_t cmd) "cpu idx[0x%"PRIx32"] invalid cmd: 0x%"PRIx8
> +acpi_cpuos_if_cpu_has_events(uint32_t idx, bool devchk, bool ejrqst) "cpu idx[0x%"PRIx32"] device-check pending: %d, eject-request pending: %d"
> +acpi_cpuos_if_clear_devchk_evt(uint32_t idx) "cpu idx[0x%"PRIx32"]"
> +acpi_cpuos_if_clear_ejrqst_evt(uint32_t idx) "cpu idx[0x%"PRIx32"]"
> +acpi_cpuos_if_ejecting_invalid_cpu(uint32_t idx) "invalid cpu idx[0x%"PRIx32"]"
> +acpi_cpuos_if_ejecting_cpu(uint32_t idx) "cpu idx[0x%"PRIx32"]"
> +acpi_cpuos_if_write_ost_ev(uint32_t idx, uint32_t ev) "cpu idx[0x%"PRIx32"] OST Event: 0x%"PRIx32
> +acpi_cpuos_if_write_ost_status(uint32_t idx, uint32_t st) "cpu idx[0x%"PRIx32"] OST Status: 0x%"PRIx32
> +
>  # pcihp.c
>  acpi_pci_eject_slot(unsigned bsel, unsigned slot) "bsel: %u slot: %u"
>  acpi_pci_unplug(int bsel, int slot) "bsel: %d slot: %d"
> diff --git a/hw/arm/Kconfig b/hw/arm/Kconfig
> index 2aa4b5d778..c9991e00c7 100644
> --- a/hw/arm/Kconfig
> +++ b/hw/arm/Kconfig
> @@ -39,6 +39,7 @@ config ARM_VIRT
>      select VIRTIO_MEM_SUPPORTED
>      select ACPI_CXL
>      select ACPI_HMAT
> +    select ACPI_CPU_OSPM_INTERFACE
>  
>  config CUBIEBOARD
>      bool
> diff --git a/include/hw/acpi/cpu_ospm_interface.h b/include/hw/acpi/cpu_ospm_interface.h
> new file mode 100644
> index 0000000000..5dda327a34
> --- /dev/null
> +++ b/include/hw/acpi/cpu_ospm_interface.h
> @@ -0,0 +1,78 @@
> +/*
> + * ACPI CPU OSPM Interface Handling.
> + *
> + * Copyright (c) 2025 Huawei Technologies R&D (UK) Ltd.
> + *
> + * Author: Salil Mehta <salil.mehta@huawei.com>
> + *
> + * SPDX-License-Identifier: GPL-2.0-or-later
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the ree Software Foundation; either version 2 of the License, or
> + * (at your option) any later version.
> + */
> +#ifndef CPU_OSPM_INTERFACE_H
> +#define CPU_OSPM_INTERFACE_H
> +
> +#include "qapi/qapi-types-acpi.h"
> +#include "hw/qdev-core.h"
> +#include "hw/acpi/acpi.h"
> +#include "hw/acpi/aml-build.h"
> +#include "hw/boards.h"
> +
> +/**
> + * Total size (in bytes) of the ACPI CPU OSPM Interface MMIO region.
> + *
> + * This region contains control and status fields such as CPU selector,
> + * flags, command register, and data register. It must exactly match the
> + * layout defined in the AML code and the memory region implementation.
> + *
> + * Any mismatch between this definition and the AML layout may result in
> + * runtime errors or build-time assertion failures (e.g., _Static_assert),
> + * breaking correct device emulation and guest OS coordination.
> + */
> +#define ACPI_CPU_OSPM_IF_REG_LEN 16
> +
> +typedef struct  {
> +    CPUState *cpu;
> +    uint64_t arch_id;
> +    bool devchk_pending; /* device-check pending */
> +    bool ejrqst_pending; /* eject-request pending */
> +    uint32_t ost_event;
> +    uint32_t ost_status;
> +} AcpiCpuOspmStateStatus;
> +
> +typedef struct AcpiCpuOspmState {
> +    DeviceState *acpi_dev;
> +    MemoryRegion ctrl_reg;
> +    uint32_t selector;
> +    uint8_t command;
> +    uint32_t dev_count;
> +    AcpiCpuOspmStateStatus *devs;
> +} AcpiCpuOspmState;
> +
> +void acpi_cpu_device_check_cb(AcpiCpuOspmState *cpu_st, DeviceState *dev,
> +                              uint32_t event_st, Error **errp);
> +
> +void acpi_cpu_eject_request_cb(AcpiCpuOspmState *cpu_st, DeviceState *dev,
> +                               uint32_t event_st, Error **errp);
> +
> +void acpi_cpu_eject_cb(AcpiCpuOspmState *cpu_st, DeviceState *dev,
> +                       Error **errp);
> +
> +void acpi_cpu_ospm_state_interface_init(MemoryRegion *as, Object *owner,
> +                                        AcpiCpuOspmState *state,
> +                                        hwaddr base_addr);
> +
> +void acpi_build_cpus_aml(Aml *table, hwaddr base_addr, const char *root,
> +                         const char *event_handler_method);
> +
> +void acpi_cpus_ospm_status(AcpiCpuOspmState *cpu_st,
> +                           ACPIOSTInfoList ***list);
> +
> +extern const VMStateDescription vmstate_cpu_ospm_state;
> +#define VMSTATE_CPU_OSPM_STATE(cpuospm, state) \
> +    VMSTATE_STRUCT(cpuospm, state, 1, \
> +                   vmstate_cpu_ospm_state, AcpiCpuOspmState)
> +#endif  /* CPU_OSPM_INTERFACE_H */



^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH RFC V6 10/24] arm/virt: Init PMU at host for all present vCPUs
  2025-10-01  1:01 ` [PATCH RFC V6 10/24] arm/virt: Init PMU at host for all present vCPUs salil.mehta
@ 2025-10-03 15:02   ` Igor Mammedov
  0 siblings, 0 replies; 67+ messages in thread
From: Igor Mammedov @ 2025-10-03 15:02 UTC (permalink / raw)
  To: salil.mehta
  Cc: qemu-devel, qemu-arm, mst, salil.mehta, maz, jean-philippe,
	jonathan.cameron, lpieralisi, peter.maydell, richard.henderson,
	armbru, andrew.jones, david, philmd, eric.auger, will, ardb,
	oliver.upton, pbonzini, gshan, rafael, borntraeger, alex.bennee,
	gustavo.romero, npiggin, harshpb, linux, darren, ilkka, vishnu,
	gankulkarni, karl.heubaum, miguel.luis, zhukeqian1,
	wangxiongfeng2, wangyanan55, wangzhou1, linuxarm, jiakernel2,
	maobibo, lixianglai, shahuang, zhao1.liu

On Wed,  1 Oct 2025 01:01:13 +0000
salil.mehta@opnsrc.net wrote:

> From: Salil Mehta <salil.mehta@huawei.com>
> 

...

> diff --git a/include/hw/core/cpu.h b/include/hw/core/cpu.h
> index 5eaf41a566..2ee202a8a5 100644
> --- a/include/hw/core/cpu.h
> +++ b/include/hw/core/cpu.h
> @@ -602,6 +602,63 @@ extern CPUTailQ cpus_queue;
>  #define CPU_FOREACH_SAFE(cpu, next_cpu) \
>      QTAILQ_FOREACH_SAFE_RCU(cpu, &cpus_queue, node, next_cpu)
>  
> +
> +/**
> + * CPU_FOREACH_POSSIBLE(cpu_, archid_list_)
> + *
> + * Iterate over all entries in a CPUArchIdList, assigning each entry’s
> + * CPUState* to @cpu_. This hides the loop index and reads like a normal
> + * C for-loop.
> + *
> + * A CPUArchIdList represents the set of *possible* CPUs for a machine.
> + * Each entry contains:
> + *   - @cpu:        CPUState pointer, or NULL if not realized yet
> + *   - @arch_id:    architecture-specific identifier (e.g. MPIDR)
> + *   - @vcpus_count: number of vCPUs represented (usually 1)
> + *
> + * The list models *possible* CPUs: it includes (a) currently plugged vCPUs
> + * made available through hotplug, (b) present (and perhaps visible to OSPM)
> + * but kept ACPI-disabled vCPUs, and (c) reserved slots for CPUs that may be
> + * created in the future. This supports co-existence of hotpluggable and
> + * admin-disabled vCPUs if architectures permit.
> + *
> + * Example:
> + *
> + *   CPUArchIdList *alist = machine_possible_cpus(ms);
> + *   CPUState *cpu;
> + *
> + *   CPU_FOREACH_POSSIBLE(cpu, alist) {
> + *       if (!cpu) {
> + *           continue; // reserved slot for hotplug case
> + *       }
> + *
> + *       < Do Something >
> + *   }
> + *
> + * Expanded equivalent:
> + *
> + *   for (int __cpu_idx = 0; alist && __cpu_idx < alist->len; __cpu_idx++) {
> + *       if ((cpu = alist->cpus[__cpu_idx].cpu, 1)) {
> + *           if (!cpu) {
> + *               continue;
> + *           }
> + *
> + *           < Do Something >
> + *       }
> + *   }
> + *
> + * Notes:
> + *   - Callers must check @cpu for NULL when filtering unplugged CPUs.
> + *   - Mirrors the style of CPU_FOREACH(), but iterates all *possible* CPUs
> + *     (plugged, ACPI-disabled, and reserved slots) rather than only present
> + *     and enabled vCPUs.
> + */
> +#define CPU_FOREACH_POSSIBLE(cpu_, archid_list_) \
> +    for (int __cpu_idx = 0; \
> +         (archid_list_) && __cpu_idx < (archid_list_)->len; \
> +         __cpu_idx++) \
> +        if (((cpu_) = (archid_list_)->cpus[__cpu_idx].cpu, 1))
> +
>  extern __thread CPUState *current_cpu;

make it a separate patch and refactor existing loops to use it.

>  
>  /**



^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH RFC V6 11/24] hw/arm/acpi: MADT change to size the guest with possible vCPUs
  2025-10-01  1:01 ` [PATCH RFC V6 11/24] hw/arm/acpi: MADT change to size the guest with possible vCPUs salil.mehta
@ 2025-10-03 15:09   ` Igor Mammedov
       [not found]     ` <0175e40f70424dd9a29389b8a4f16c42@huawei.com>
  0 siblings, 1 reply; 67+ messages in thread
From: Igor Mammedov @ 2025-10-03 15:09 UTC (permalink / raw)
  To: salil.mehta
  Cc: qemu-devel, qemu-arm, mst, salil.mehta, maz, jean-philippe,
	jonathan.cameron, lpieralisi, peter.maydell, richard.henderson,
	armbru, andrew.jones, david, philmd, eric.auger, will, ardb,
	oliver.upton, pbonzini, gshan, rafael, borntraeger, alex.bennee,
	gustavo.romero, npiggin, harshpb, linux, darren, ilkka, vishnu,
	gankulkarni, karl.heubaum, miguel.luis, zhukeqian1,
	wangxiongfeng2, wangyanan55, wangzhou1, linuxarm, jiakernel2,
	maobibo, lixianglai, shahuang, zhao1.liu

On Wed,  1 Oct 2025 01:01:14 +0000
salil.mehta@opnsrc.net wrote:

> From: Salil Mehta <salil.mehta@huawei.com>
> 
> When QEMU builds the MADT table, modifications are needed to include information
> about possible vCPUs that are exposed as ACPI-disabled (i.e., `_STA.Enabled=0`).
> This new information will help the guest kernel pre-size its resources during
> boot time. Pre-sizing based on possible vCPUs will facilitate the future
> hot-plugging of the currently disabled vCPUs.
> 
> Additionally, this change addresses updates to the ACPI MADT GIC CPU interface
> flags, as introduced in the UEFI ACPI 6.5 specification [1]. These updates
> enable deferred virtual CPU onlining in the guest kernel.
> 
> Reference:
> [1] 5.2.12.14. GIC CPU Interface (GICC) Structure (Table 5.37 GICC CPU Interface Flags)
>     Link: https://uefi.org/specs/ACPI/6.5/05_ACPI_Software_Programming_Model.html#gic-cpu-interface-gicc-structure
> 
> Co-developed-by: Keqian Zhu <zhukeqian1@huawei.com>
> Signed-off-by: Keqian Zhu <zhukeqian1@huawei.com>
> Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
> ---
>  hw/arm/virt-acpi-build.c | 40 ++++++++++++++++++++++++++++++++++------
>  hw/core/machine.c        | 14 ++++++++++++++
>  include/hw/boards.h      | 20 ++++++++++++++++++++
>  3 files changed, 68 insertions(+), 6 deletions(-)
> 
> diff --git a/hw/arm/virt-acpi-build.c b/hw/arm/virt-acpi-build.c
> index b01fc4f8ef..7c24dd6369 100644
> --- a/hw/arm/virt-acpi-build.c
> +++ b/hw/arm/virt-acpi-build.c
> @@ -760,6 +760,32 @@ static void build_append_gicr(GArray *table_data, uint64_t base, uint32_t size)
>      build_append_int_noprefix(table_data, size, 4); /* Discovery Range Length */
>  }
>  
> +static uint32_t virt_acpi_get_gicc_flags(CPUState *cpu)
> +{
> +    MachineClass *mc = MACHINE_GET_CLASS(qdev_get_machine());
> +    const uint32_t GICC_FLAG_ENABLED = BIT(0);
> +    const uint32_t GICC_FLAG_ONLINE_CAPABLE = BIT(3);
> +
> +    /* ARM architecture does not support vCPU hotplug yet */
> +    if (!cpu) {
> +        return 0;
> +    }
> +
> +    /*
> +     * If the machine does not support online-capable CPUs, report the GICC as
> +     * 'enabled' only.
> +     */
> +    if (!mc->has_online_capable_cpus) {
> +        return GICC_FLAG_ENABLED;
> +    }
> +
> +    /*
> +     * ACPI 6.5, 5.2.12.14 (GICC): mark the boot CPU 'enabled' and all others
> +     * 'online-capable'.
> +     */
> +    return (cpu == first_cpu) ? GICC_FLAG_ENABLED : GICC_FLAG_ONLINE_CAPABLE;
> +}
> +
>  static void
>  build_madt(GArray *table_data, BIOSLinker *linker, VirtMachineState *vms)
>  {
> @@ -785,12 +811,14 @@ build_madt(GArray *table_data, BIOSLinker *linker, VirtMachineState *vms)
>      build_append_int_noprefix(table_data, vms->gic_version, 1);
>      build_append_int_noprefix(table_data, 0, 3);   /* Reserved */
>  
> -    for (i = 0; i < MACHINE(vms)->smp.cpus; i++) {
> -        ARMCPU *armcpu = ARM_CPU(qemu_get_cpu(i));
> +    for (i = 0; i < MACHINE(vms)->smp.max_cpus; i++) {
                                     ^^^^^^^^^^^^
> +        CPUState *cpu = machine_get_possible_cpu(i);
...
> +        CPUArchId *archid = machine_get_possible_cpu_arch_id(i);

what complexity above adds? /and then you say creating instantiating ARM VM
is slow./

I'd drop machine_get_possible_cpu/machine_get_possible_cpu_arch_id altogether
and mimic what acpi_build_madt() does.

> +        uint32_t flags = virt_acpi_get_gicc_flags(cpu);
> +        uint64_t mpidr = archid->arch_id;
>  
>          if (vms->gic_version == VIRT_GIC_VERSION_2) {
>              physical_base_address = memmap[VIRT_GIC_CPU].base;
> @@ -805,7 +833,7 @@ build_madt(GArray *table_data, BIOSLinker *linker, VirtMachineState *vms)
>          build_append_int_noprefix(table_data, i, 4);    /* GIC ID */
>          build_append_int_noprefix(table_data, i, 4);    /* ACPI Processor UID */
>          /* Flags */
> -        build_append_int_noprefix(table_data, 1, 4);    /* Enabled */
> +        build_append_int_noprefix(table_data, flags, 4);
>          /* Parking Protocol Version */
>          build_append_int_noprefix(table_data, 0, 4);
>          /* Performance Interrupt GSIV */
> @@ -819,7 +847,7 @@ build_madt(GArray *table_data, BIOSLinker *linker, VirtMachineState *vms)
>          build_append_int_noprefix(table_data, vgic_interrupt, 4);
>          build_append_int_noprefix(table_data, 0, 8);    /* GICR Base Address*/
>          /* MPIDR */
> -        build_append_int_noprefix(table_data, arm_cpu_mp_affinity(armcpu), 8);
> +        build_append_int_noprefix(table_data, mpidr, 8);
>          /* Processor Power Efficiency Class */
>          build_append_int_noprefix(table_data, 0, 1);
>          /* Reserved */
> diff --git a/hw/core/machine.c b/hw/core/machine.c
> index 69d5632464..65388d859a 100644
> --- a/hw/core/machine.c
> +++ b/hw/core/machine.c
> @@ -1383,6 +1383,20 @@ CPUState *machine_get_possible_cpu(int64_t cpu_index)
>      return NULL;
>  }
>  
> +CPUArchId *machine_get_possible_cpu_arch_id(int64_t cpu_index)
> +{
> +    MachineState *ms = MACHINE(qdev_get_machine());
> +    CPUArchIdList *possible_cpus = ms->possible_cpus;
> +
> +    for (int i = 0; i < possible_cpus->len; i++) {
> +        if (possible_cpus->cpus[i].cpu &&
> +            possible_cpus->cpus[i].cpu->cpu_index == cpu_index) {
> +            return &possible_cpus->cpus[i];
> +        }
> +    }
> +    return NULL;
> +}
> +
>  static char *cpu_slot_to_string(const CPUArchId *cpu)
>  {
>      GString *s = g_string_new(NULL);
> diff --git a/include/hw/boards.h b/include/hw/boards.h
> index 3ff77a8b3a..fe51ca58bf 100644
> --- a/include/hw/boards.h
> +++ b/include/hw/boards.h
> @@ -461,6 +461,26 @@ struct MachineState {
>      bool acpi_spcr_enabled;
>  };
>  
> +/*
> + * machine_get_possible_cpu_arch_id:
> + * @cpu_index: logical cpu_index to search for
> + *
> + * Return a pointer to the CPUArchId entry matching the given @cpu_index
> + * in the current machine's MachineState. The possible_cpus array holds
> + * the full set of CPUs that the machine could support, including those
> + * that may be created as disabled or taken offline.
> + *
> + * The slot index in ms->possible_cpus[] is always sequential, but the
> + * logical cpu_index values are assigned by QEMU and may or may not be
> + * sequential depending on the implementation of a particular machine.
> + * Direct indexing by cpu_index is therefore unsafe in general. This
> + * helper performs a linear search of the possible_cpus array to find
> + * the matching entry.
> + *
> + * Returns: pointer to the matching CPUArchId, or NULL if not found.
> + */
> +CPUArchId *machine_get_possible_cpu_arch_id(int64_t cpu_index);
> +
>  /*
>   * The macros which follow are intended to facilitate the
>   * definition of versioned machine types, using a somewhat



^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH RFC V6 00/24] Support of Virtual CPU Hotplug-like Feature for ARMv8+ Arch
  2025-10-01  1:01 [PATCH RFC V6 00/24] Support of Virtual CPU Hotplug-like Feature for ARMv8+ Arch salil.mehta
                   ` (23 preceding siblings ...)
  2025-10-01  1:01 ` [PATCH RFC V6 24/24] tcg: Defer TB flush for 'lazy realized' vCPUs on first region alloc salil.mehta
@ 2025-10-06 14:00 ` Igor Mammedov
  2025-10-13  0:34 ` Gavin Shan
  2025-10-22 10:07 ` Gavin Shan
  26 siblings, 0 replies; 67+ messages in thread
From: Igor Mammedov @ 2025-10-06 14:00 UTC (permalink / raw)
  To: salil.mehta
  Cc: qemu-devel, qemu-arm, mst, salil.mehta, maz, jean-philippe,
	jonathan.cameron, lpieralisi, peter.maydell, richard.henderson,
	armbru, andrew.jones, david, philmd, eric.auger, will, ardb,
	oliver.upton, pbonzini, gshan, rafael, borntraeger, alex.bennee,
	gustavo.romero, npiggin, harshpb, linux, darren, ilkka, vishnu,
	gankulkarni, karl.heubaum, miguel.luis, zhukeqian1,
	wangxiongfeng2, wangyanan55, wangzhou1, linuxarm, jiakernel2,
	maobibo, lixianglai, shahuang, zhao1.liu

On Wed,  1 Oct 2025 01:01:03 +0000
salil.mehta@opnsrc.net wrote:

> From: Salil Mehta <salil.mehta@huawei.com>
> 
> [!] Sending again: It looks like mails sent from my official ID are being held
> somewhere. Hence, I am using my other email address. Sorry for any inconvenience
> this may have caused.
> 
> ============
> (I) Prologue
> ============
> 
> This patch series adds support for a virtual CPU hotplug-like feature (in terms
> of usage) to Armv8+ platforms. Administrators are able to dynamically scale the
> compute capacity on demand by adding or removing vCPUs. The interface is similar
> in look-and-feel to the vCPU hotplug feature supported on x86 platforms. While
> this series for Arm platforms shares the end goal with x86, it is implemented
> differently because of inherent differences in the CPU architecture and the
> constraints it imposes.
> 
> In this implementation, the meaning of "CPU hotplug" is as described in the Arm
> Power State Coordination Interface (PSCI) specification (DEN0022F.b, §4.3 "CPU
> hotplug and secondary CPU boot", §5.5, 5.6). This definition has not changed.
> On Arm platforms, the guest kernel itself can request CPU onlining or offlining
> using PSCI calls (via SMC/HVC), since the CPU_ON and CPU_OFF functions are part
> of the standard PSCI interface exposed to the non-secure world.
> 
> This patch series instead adds the infrastructure required to implement
> administrative policy control in QEMU/VMM (as privileged software) along with
> the ability to convey changes via ACPI to the guest kernel. This ensures the
> guest is notified of compute capacity changes that result from per-vCPU
> administrative policy. This is conceptually similar to the traditional CPU
> hotplug mechanism that x86 follows. It allows or denies guest-initiated PSCI
> CPU_ON/OFF requests by enabling or disabling an already ACPI-described and
> present vCPU via HMP/QMP 'device_set' (a new interface), making it (un)available
> to the guest kernel. This provides the look-and-feel of vCPU hotplug through
> ACPI _STA.Enabled toggling, while keeping all vCPUs enumerated in ACPI tables at
> boot.
> 
> Unlike x86, where vCPUs can become not-present after boot and the kernel (maybe
> because architecture allows this?) tolerates some level of dynamic topology
> changes, the Arm CPU Architecture requires that the number of vCPUs and their
> associated redistributors remain fixed once the system has booted. Consequently,
> the Arm host kernel and guest kernel do not tolerate removal or addition of CPU
> objects after boot.
> 
> Offline vCPUs remain described to guest firmware and OSPM, and can be brought
> online later by toggling the ACPI _STA Enabled bit. This model aligns with
> ACPI 6.5 (Table 5.37, GICC CPU Interface Flags), which introduced the "Online
> Capable" bit to signal processors that can be enabled at runtime. It is also
> consistent with the Arm GIC Architecture Specification (IHI0069H, §11.1), which
> defines CPU interface power domain behavior.
> 
> Corresponding kernel changes from James Morse (ARM) have already been accepted
> and are part of the mainline Linux kernel since 6.11 release.

Series a bit of tangled up, I'd suggest splitting it in 3 separate
series that would follow up each other, to make it more digestible
and that should simplify/speedup review and merging processes.

here is suggested order (1 and 2 could be swapped or done in parallel):

1: introduce power state infrastructure (hw one as opposed to administrative one)
   and apply it to ARM vcpus.
   (on completion guest should be able to power up/down cores using exiting
    linux interfaces (which I'd guess end up in calling PSCI))

2. Instead of complicating ‘-smp’ semantics and as consequence trying to deal
    with complications it causes (-deviceset CLI option),
    I’d suggest to make ‘-device arm-cpu’ work, that way
    ‘-smp’ semantics stay the same
    and you don’t have to invent -deviceset CLI option to fixup built in CPUs.

    What I see in this series has a number of problems wrt to above:
    2.1: -smp creates anonymous vcpus that aren’t supposed to be managed by user,
         as opposed to devices created by -device, which must have user assigned  ‘id’
         if they are to be managed by the user.
    2.2. to workaround that, series adds to ‘-deviceset/device-set’ that lets
         pattern match vcpus by some subset of device properties. Thus violating #1 point.

    Above can be solved by using existing ‘-device’ that would allow to set
    initial power state during vcpu creation time and let user assign ‘id’
    if one wishes to manage that vcpu later on.
    That alleviates the need for -deviceset accessing anonymous vcpus.
    And remaining device-set QMP command would have user provided ‘id’ to address
    managed vcpus.
    i.e. UI will be consistent with what we do in x86/s390/spapr cases,
    as well as with any other devices that supports '-device'

3. As the last would come series to support administrative power state policy
   aka ‘hotplug-like’ feature (incl. device-set QMP command).


Also:
Looking at previous reviews, lazy-realize issue still hasn’t been dropped.
It’s a hack to workaround slowness of VCPU creation code and
I still maintain the opinion that it doesn't fit this series.
(it actually gets in the way of the ‘hotplug-like’ feature,
negatively influencing/complicating this series).

If faster vcpu creation times are needed, fix relevant code instead of
covering it up with a hack. (in previous reviews I've even pointed to
some low hanging fruits that can speed it up).
Anyways speed up optimizations should be a separate series and
shouldn't be conflated with this series at all.

> ====================================
> (II) Summary of `Recent` Key Changes
> ====================================
> 
> RFC V5 -> RFC V6
> 
> (*) KeyChange: Introduced new infrastructure to handle administrative PowerState
>     transitions (enable-to-disable & vice-versa) as-per Policy.
> (*) Stopped using the existing vCPU Hotplug infrastructure code
> (*) Replaced 'device_add/-device' with new 'device_set/-deviceset' interface
> (*) Introduced '-smp disabledcpus=N' parameter for Qemu CLI
> (*) Dropped 'info hotpluggable'. Added 'info cpus-powerstate' command
> (*) Introduced DeviceState::admin_power_state property={enabled,disabled,removed} states
> (*) Introduced new 'PowerStateHandlder' abstract interface with powerstate hooks.
> (*) Dropped destruction of disabled vCPU objects post cpu init.
> (*) Dropped vCPU Hotplug ACPI support introduced ACPI/GED specifcally for ARM type vCPUs
> (*) Dropped GIC IRQ unwiring support once VM is initialized.
> (*) Dropped vCPU unrealization support. Retained lazy realization of disabled vCPUs(boot time).
> (*) All vCPU objects exist for lifetime of VM.
> (*) Introduced a separate ACPI CPU/OSPM interface to handle device check, eject
>     request etc. to intimate gues kernel about change in policy.
> (*) Introduced new concept of *userspace parking* of 'disabled' KVM vCPUs 

‘Parking’ was KVM concept due to inability to destroy VCPUs on KVM side.
Please do not use/propagate/expose it to newer code (unless it's KVM related).
In absence of lazy-realize it's likely not needed, as parking is only
related to unrealize in KVM context.

> (*) We do not migrate disabled vCPUs

I’d migrate disabled vcpus as well (at least power state), while it would consume time,
It would be in line with what we do with other present devices.
We can always reduce what we migrate later on as a patch on top if necessary.

> (*) Mitigation to pause_all_vcpus() problem. Caching the ICC_CTLR_EL1 in Qemu
> (*) Stopped reconciling (for now) vCPU config at destination VM during Migration

I guess that concludes my review of this revision.

PS:
I don't think that per patch review at this state would make much sense
as the things patches do are all mixed up and some of that should go away,
drastically changing next revision.
But given the series is complete rewrite, it's expected and allows
to identify how to ammend the following series.


> Dropped Due to change in vCPU handling approach:
> 
> [PATCH RFC V5 03/30] hw/arm/virt: Move setting of common vCPU properties in a function
> [PATCH RFC V5 04/30] arm/virt, target/arm: Machine init time change common to vCPU {cold|hot}-plug
> [PATCH RFC V5 09/30] arm/acpi: Enable ACPI support for vCPU hotplug
> [PATCH RFC V5 12/30] arm/virt: Release objects for *disabled* possible vCPUs after init
> [PATCH RFC V5 14/30] hw/acpi: Make _MAT method optional
> [PATCH RFC V5 16/30] target/arm: Force ARM vCPU *present* status ACPI *persistent*
> [PATCH RFC V5 18/30] arm/virt: Changes to (un)wire GICC<->vCPU IRQs during hot-(un)plug
> [PATCH RFC V5 22/30] target/arm/cpu: Check if hotplugged ARM vCPU's FEAT match existing
> [PATCH RFC V5 24/30] target/arm: Add support to *unrealize* ARMCPU during vCPU Hot-unplug
> [PATCH RFC V5 25/30] tcg/mttcg: Introduce MTTCG thread unregistration leg
> [PATCH RFC V5 30/30] hw/arm/virt: Expose cold-booted vCPUs as MADT GICC *Enabled*
> 
> Modified or Code reused in other patches:
> 
> [PATCH RFC V5 19/30] hw/arm, gicv3: Changes to notify GICv3 CPU state with vCPU hot-(un)plug event
> [PATCH RFC V5 17/30] arm/virt: Add/update basic hot-(un)plug framework
> [PATCH RFC V5 20/30] hw/arm: Changes required for reset and to support next boot
> [PATCH RFC V5 21/30] arm/virt: Update the guest(via GED) about vCPU hot-(un)plug events
> 
> ---------------------------------
> [!] Expectations From This RFC v6
> ---------------------------------
> 
> Please refer to the DISCLAIMER in Section (XI) for the correct expectations from
> this version of the RFC
> 
> ===============
> (II) Motivation
> ===============
> 
> Adds virtual CPU hot-plug-like support for ARMv8+ Arch in QEMU. Allows vCPUs to
> be brought online or offline after VM boot, similar to x86 arch, while keeping
> all CPU resources provisioned and described at startup. Enables scaling guest VM
> compute capacity on demand, useful in several scenarios:
> 
> 1. Vertical Pod Autoscaling [9][10] in the cloud: As part of an orchestration
>    framework, resource requests (CPU and memory) for containers in a pod can be
>    adjusted dynamically based on usage.
> 
> 2. Pay-as-you-grow business model: Infrastructure providers may allocate and
>    restrict the total compute resources available to a guest VM according to
>    the SLA (Service Level Agreement). VM owners can then request additional
>    CPUs to be hot-plugged at extra cost.
> 
> In Kubernetes environments, workloads such as Kata Container VMs often adopt
> a "hot-plug everything" model: start with the minimum resources and add vCPUs
> later as needed. For example, a VM may boot with just one vCPU, then scale up
> once the workload is provisioned. This approach provides:
> 
> 1. Faster boot times, and
> 2. Lower memory footprint.
> 
> vCPU hot-plug is therefore one of the steps toward realizing the broader
> "hot-plug everything" objective. Other hot-plug mechanisms already exist on ARM,
> such as ACPI-based memory hot-plug and PCIe device hot-plug, and are supported
> in both QEMU and the Linux guest. Extending vCPU hot-plug in this series aligns
> with those efforts and fills the remaining gap.
> 
> ================
> (III) Background
> ================
> 
> The ARM architecture does not support physical CPU hot-plug and lacks a
> specification describing the behavior of per-CPU components (e.g. GIC CPU
> interface, redistributors, PMUs, timers) when such events occur. As a result,
> both host and guest kernels are intolerant to changes in the number of CPUs
> enumerated by firmware and described by ACPI at boot time.
> 
> We need to respect these architectural constraints and the kernel limitations
> they impose, namely the inability to tolerate changes in the number of CPUs
> enumerated by firmware once the system has booted, and create a practical
> solution with workarounds in the VMM/QEMU.
> 
> This patch set implements a non-intrusive solution by provisioning all vCPU
> resources during VM initialization and exposing them via ACPI to the guest
> kernel. The resources remain fixed, while the effect of hot-plug is achieved by
> toggling ACPI CPU status (enabled) bits to bring vCPUs online or offline.
> 
> -----------
> Terminology
> -----------
> 
> (*) Possible CPUs: Total vCPUs that could ever exist in the VM. This includes
>                    any 'present' & 'enabled' CPUs plus any CPUs that are
>                    'present' but are 'disabled' at boottime.
>                    - Qemu parameter (-smp cpus=N1, disabled=N2)
>                    - Possible vCPUs = N1 + N2
> (*) Present CPUs:  Possible CPUs that are ACPI 'present'. These might or might
>                    not be ACPI 'enabled'. 
> (*) Enabled CPUs:  Possible CPUs that are ACPI 'present' and 'enabled' and can
>                    now be ‘onlined’ (PSCI) for use by the Guest Kernel. All cold-
>                    booted vCPUs are ACPI 'enabled' at boot. Later, using
>                    'device_set/-deviceset', more vCPUs can be ACPI 'enabled'.
> 
> 
> Below are further details of the constraints:
> 
> ===============================================
> (IV) Constraints Due to ARMv8+ CPU Architecture
> ===============================================
> 
> A. Physical Limitation to Support CPU Hotplug: (Architectural Constraint)
> 
>    1. ARMv8 CPU architecture does not support the concept of the physical CPU
>       hotplug. 
>       a. There are many per-CPU components like PMU, SVE, MTE, Arch timers, etc.,
>          whose behavior needs to be clearly defined when the CPU is
>          hot(un)plugged. Current specification does not define this nor are any
>          immediate plans from ARM to extend support for such a feature.
>    2. Other ARM components like GIC, etc., have not been designed to realize
>       physical CPU hotplug capability as of now. For example,
>       a. Every physical CPU has a unique GICC (GIC CPU Interface) by construct.
>          Architecture does not specify what CPU hot(un)plug would mean in
>          context to any of these.
>       b. CPUs/GICC are physically connected to unique GICR (GIC Redistributor).
>          GIC Redistributors are always part of the always-on power domain. Hence,
>          they cannot be powered off as per specification.
> 
> B. Limitation in Firmware/ACPI (Architectural Constraint)
> 
>    1. Firmware has to expose GICC, GICR, and other per-CPU features like PMU,
>       SVE, MTE, Arch Timers, etc., to the OS. Due to the architectural constraint
>       stated in section A1(a), all interrupt controller structures of
>       MADT describing GIC CPU Interfaces and the GIC Redistributors MUST be
>       presented by firmware to the OSPM during boot time.
>    2. Architectures that support CPU hotplug can evaluate the ACPI _MAT method
>       to get this kind of information from the firmware even after boot, and the
>       OSPM has the capability to process these. ARM kernel uses information in
>       MADT interrupt controller structures to identify the number of present CPUs
>       during boot and hence does not allow to change these after boot. The number
>       of present CPUs cannot be changed. It is an architectural constraint!
> 
> C. Limitations in KVM to Support Virtual CPU Hotplug (Architectural Constraint)
> 
>    1. KVM VGIC:
>       a. Sizing of various VGIC resources like memory regions, etc., related to
>          the redistributor happens only once and is fixed at the VM init time
>          and cannot be changed later after initialization has happened.
>          KVM statically configures these resources based on the number of vCPUs
>          and the number/size of redistributor ranges.
>       b. Association between vCPU and its VGIC redistributor is fixed at the
>          VM init time within the KVM, i.e., when redistributor iodevs gets
>          registered. VGIC does not allow to setup/change this association
>          after VM initialization has happened. Physically, every CPU/GICC is
>          uniquely connected with its redistributor, and there is no
>          architectural way to set this up.
>    2. KVM vCPUs:
>       a. Lack of specification means destruction of KVM vCPUs does not exist as
>          there is no reference to tell what to do with other per-vCPU
>          components like redistributors, arch timer, etc.
>       b. In fact, KVM does not implement the destruction of vCPUs for any
>          architecture. This is independent of whether the architecture
>          actually supports CPU Hotplug feature. For example, even for x86 KVM
>          does not implement the destruction of vCPUs.
> 
> D. Considerations in Qemu due to ARM CPU Architecture & related KVM Constraints:
> 
>    1. Qemu CPU Objects MUST be created to initialize all the Host KVM vCPUs to
>       overcome the KVM constraint. KVM vCPUs are created and initialized when
>       Qemu CPU Objects are realized.
>    2. The 'GICV3State' and 'GICV3CPUState' objects must be sized for all possible
>       vCPUs at VM initialization, when the QOM GICv3 object is realized. This is
>       required because the KVM VGIC can only be initialized once, and the number
>       of redistributors, their per-vCPU interfaces, and associated data
>       structures or I/O device regions are all fixed at VM init time.
>    3. How should new QOM CPU objects be connected back to the 'GICV3CPUState'
>       objects and disconnected from it in case the CPU is being hot(un)plugged?
>    4. How should 'unplugged' or 'yet-to-be-plugged' vCPUs be represented in the
>       QOM for which KVM vCPU already exists? For example, whether to keep,
>        a. No QOM CPU objects Or
>        b. Unrealized CPU Objects
>    5. How should vCPU state be exposed via ACPI to the Guest? Especially for
>       the unplugged/yet-to-be-plugged vCPUs whose CPU objects might not exist
>       within the QOM but the Guest always expects all possible vCPUs to be
>       identified as ACPI *present* during boot.
>    6. How should Qemu expose GIC CPU interfaces for the unplugged or
>       yet-to-be-plugged vCPUs using ACPI MADT Table to the Guest?
> 
> E. How are the above questions addressed in this QEMU implementation?
> 
>    1. Respect the limitations imposed by the Arm architecture in KVM, ACPI, and
>       the guest kernel. This requires always keeping the vCPU count constant.
>    2. Implement a workaround in QEMU by keeping all vCPUs present and toggling
>       the ACPI _STA.Enabled bit to realize a vCPU hotplug-like effect.
>    3. Never destroy vCPU objects once initialized, since they hold the ARMCPU
>       state that is set up once during VM initialization.
>    4. Size other per-vCPU components, such as the VGIC CPU interface and
>       redistributors, for the maximum number of vCPUs possible during the VM’s
>       lifetime.
>    5. Exit HVC/SMC KVM hypercalls (triggered by PSCI CPU_ON/OFF) to user space
>       for policy checks that allow or deny the guest kernel’s power-on/off
>       request.
>    6. Disabled vCPUs remain parked in user space and are never migrated.
> 
> ===================  
> (V) Summary of Flow  
> ===================  
> 
> -------------------  
> vCPU Initialization  
> -------------------  
>    1. Keep all vCPUs always enumerated and present (enabled/disabled) in the
>       guest kernel, host KVM, and QEMU with topology fixed.  
>    2. Realize hotplug-like functionality by toggling the ACPI _STA.Enabled bit
>       for each vCPU.  
>    3. Never destroy a vCPU. vCPU objects and threads remain alive throughout the
>       VM lifetime once created. No un-realization handling code is required.
>       Threads may be realized lazily for disabled vCPUs.  
>    4. At VM init, pre-create all possible vCPUs in KVM, including those not yet
>       enabled in QEMU, but keep them in the PSCI powered-off state.  
>    5. Park disabled vCPU threads in user space to avoid KVM lock contention.
>       This means 'CPUState::halted=1'; 'CPUState::stopped=1'; and 'CPUState::parked=1' (new).  
> -------------------  
> VGIC Initialization  
> -------------------  
>    6. Size 'GICv3State' and 'GICv3CPUState' objects over possible vCPUs at VM
>       init time when the QEMU GIC is realized. This also sizes KVM VGIC
>       resources such  as redistributor regions. This sizing never changes after
>       VM init.
> -------------------  
> ACPI Initialization  
> -------------------  
>    7. Build the ACPI MADT table with updates:  
>       a. Number of GIC CPU interface entries = possible vCPUs.  
>       b. Boot vCPU as MADT.GICC.Enabled=1 (not hot[un]pluggable).  
>       c. Hot[un]pluggable vCPUs as MADT.GICC.online-capable=1 and  
>          MADT.GICC.Enabled=0 (mutually exclusive). These vCPUs can be enabled
>          and onlined after guest boot (firmware policy).  
>    8. Expose ACPI _STA status to the guest kernel:  
>       a. Always _STA.Present=1 (all possible vCPUs).  
>       b. _STA.Enabled=1 (enabled vCPUs = plugged).  
>       c. _STA.Enabled=0 (disabled vCPUs = unplugged).  
> ---------------------------------------------------------------  
> vCPU Administrative *First* Enable [= vCPU Hotplug-like Action]  
> ---------------------------------------------------------------  
>    9. The first administrative enable of a vCPU leads to deferred realization of
>       the QEMU vCPU object initialized at VM init:  
>       a. Realizes the vCPU object and spawns the QEMU vCPU thread.  
>       b. Unparks the existing KVM vCPU ("kvm_parked_vcpus" list).  
>       c. Reinitializes the KVM vCPU in the host (reset core/sys regs, set
>          defaults). 
>       d. Runs the KVM vCPU (created with 'start-powered-off'). Thread waits for
>          PSCI.
>       e. Marks QEMU 'GICv3CPUState' interface accessible.  
>       f. Updates ACPI _STA.Enabled=1.  
>       g. Notifies guest (GED Device-Check). Guest sees Enabled=1 and registers
>          CPU. 
>       h. Guest onlines vCPU (PSCI CPU_ON over HVC/SMC).  
>          - KVM exits to QEMU (policy check).  
>          - If allowed, QEMU calls `cpu_reset()` and powers on the vCPU in KVM.
> 	 - KVM wakes vCPU thread out of sleep and puts vCPUMP state to RUNNABLE 
> -----------------------------------------------------------  
> vCPU Administrative Disable [= vCPU Hot-unplug-like Action]  
> -----------------------------------------------------------  
>   10. Administrative disable does not un-realize the QOM CPU object or destroy
>       the vCPU thread. Instead:  
>       a. Notifies guest (GED Eject Request). Guest offlines vCPU (CPU_OFF PSCI).
>       b. KVM exits to QEMU (policy check). 
>          - QEMU powers off vCPU in KVM and
> 	 - KVM puts vCPUMP state to STOPPED & sleeps on RCUWait
>       c. Guest signals eject after quiescing vCPU.  
>       d. QEMU updates ACPI _STA.Enabled=0.  
>       e. Marks QEMU 'GICv3CPUState' interface inaccessible.  
>       f. Parks the vCPU thread in user space (unblocks from KVM to avoid vCPU
>          lock contention):  
>          - Unregisters VMSD from migration.  
>          - Removes vCPU from present/active lists.  
>          - Pauses the vCPU (`cpu_pause`).  
>          - Kicks vCPU thread to user space ('CPUState::parked=1').  
>       g. Guest sees ACPI _STA.Enabled=0 and removes CPU (unregisters from LDM).
> --------------------------------------------------------------------  
> vCPU Administrative *Subsequent* Enable [= vCPU Hotplug-like Action]  
> --------------------------------------------------------------------  
>   11. A subsequent administrative enable does not realize objects or spawn a new
>       thread. Instead:  
>       a. Unparks the vCPU thread in user space:  
>          - Re-registers VMSD for migration.  
>          - Adds back to present/active lists.  
>          - Resumes the vCPU (`cpu_resume`).  
>          - Clears parked flag ('CPUState::parked=0').  
>       b. Marks QEMU 'GICv3CPUState' interface accessible again.  
>       c. Updates ACPI _STA.Enabled=1.  
>       d. Notifies guest (GED Device-Check). Guest sees Enabled=1 and registers
>          CPU.
>       e. Guest onlines vCPU (PSCI CPU_ON over HVC/SMC).  
>          - KVM exits to QEMU (policy check).  
>          - QEMU sets power-state=PSCI_ON, calls `cpu_reset()`, and powers on
> 	   vCPU.  
>          - KVM changes MP state to RUNNABLE.  
> 
> ============================================
> (VI) Work Presented at KVM Forum Conferences
> ============================================
> 
> Details of the above work have been presented at KVMForum2020 and KVMForum2023
> conferences. Slides & video are available at the links below:
> a. KVMForum 2023
>    - Challenges Revisited in Supporting Virt CPU Hotplug on architectures that don't Support CPU Hotplug (like ARM64).
>      https://kvm-forum.qemu.org/2023/KVM-forum-cpu-hotplug_7OJ1YyJ.pdf
>      https://kvm-forum.qemu.org/2023/Challenges_Revisited_in_Supporting_Virt_CPU_Hotplug_-__ii0iNb3.pdf
>      https://www.youtube.com/watch?v=hyrw4j2D6I0&t=23970s
>      https://kvm-forum.qemu.org/2023/talk/9SMPDQ/
> b. KVMForum 2020
>    - Challenges in Supporting Virtual CPU Hotplug on SoC Based Systems (like ARM64) - Salil Mehta, Huawei.
>      https://kvmforum2020.sched.com/event/eE4m
> 
> ===================
> (VII) Commands Used
> ===================
> 
> A. Qemu launch commands to init the machine (with 6 possible vCPUs):
> 
> $ qemu-system-aarch64 --enable-kvm -machine virt,gic-version=3 \
> -cpu host -smp cpus=4,disabled=2 \
> -m 300M \
> -kernel Image \
> -initrd rootfs.cpio.gz \
> -append "console=ttyAMA0 root=/dev/ram rdinit=/init maxcpus=2 acpi=force" \
> -nographic \
> -bios QEMU_EFI.fd \
> 
> B. Administrative '[En,Dis]able' [akin to 'Hot-(un)plug'] related commands:
> 
> # Hot(un)plug a host vCPU (accel=kvm):
> (qemu) device_set host-arm-cpu,id=core4,core-id=4,admin-state=enable
> (qemu) device_set host-arm-cpu,id=core4,core-id=4,admin-state=disable
> 
> # Hot(un)plug a vCPU (accel=tcg):
> (qemu) device_set cortex-a57-arm-cpu,id=core4,core-id=4,admin-state=enable
> (qemu) device_set cortex-a57-arm-cpu,id=core4,core-id=4,admin-state=disable
> 
> Sample output on guest after boot:
> 
>     $ cat /sys/devices/system/cpu/possible
>     0-5
>     $ cat /sys/devices/system/cpu/present
>     0-5
>     $ cat /sys/devices/system/cpu/enabled
>     0-3
>     $ cat /sys/devices/system/cpu/online
>     0-1
>     $ cat /sys/devices/system/cpu/offline
>     2-5
> 
> Sample output on guest after 'enabling'[='hotplug'] & 'online' of vCPU=4:
> 
>     $ echo 1 > /sys/devices/system/cpu/cpu4/online
> 
>     $ cat /sys/devices/system/cpu/possible
>     0-5
>     $ cat /sys/devices/system/cpu/present
>     0-5
>     $ cat /sys/devices/system/cpu/enabled
>     0-4
>     $ cat /sys/devices/system/cpu/online
>     0-1,4
>     $ cat /sys/devices/system/cpu/offline
>     2-3,5
> 
> ===================
> (VIII) Repositories
> ===================
> 
> (*) Latest Qemu RFC V6 (Architecture Specific) patch set:
>     https://github.com/salil-mehta/qemu.git virt-cpuhp-armv8/rfc-v6
> (*) Older QEMU changes for vCPU hotplug can be cloned from below site:
>     https://github.com/salil-mehta/qemu.git virt-cpuhp-armv8/rfc-{v1,v2,v3,v4,v5}
> (*) `Accepted` Qemu Architecture Agnostic patch is present here:
>     https://github.com/salil-mehta/qemu/commits/virt-cpuhp-armv8/rfc-v3.arch.agnostic.v16/
> (*) All Kernel changes are already part of mainline v6.11
> (*) Original Guest Kernel changes (by James Morse, ARM) are available here:
>     https://git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git virtual_cpu_hotplug/rfc/v2
> 
> ================================
> (IX) KNOWN ISSUES & THINGS TO DO
> ================================
> 
> 1. TCG currently faces some hang issues due to unhandled cases. We aim to fix
>    these within the next one to two weeks.
> 2. Comprehensive testing is ongoing. This is fresh code, and we expect to
>    complete testing within two weeks.
> 3. QEMU documentation (.rst) still needs to be updated.
> 4. Migration has been lightly tested but is working as expected.
> 5. Mitigation to avoid `pause_all_vcpus` needs broader community discussion. An
>    alternative change has been prepared in KVM, which maintains a shadow of
>    `ICC_CTLR_EL1` to reduce lock contention when using KVM device IOCTLs. This
>    avoids synchronization issues if the register value changes during VM runtime.
>    While not mandatory, this enhancement would provide a more comprehensive fix
>    than the current QEMU assumption that the relevant fields are invariant or
>    pseudo-static. An RFC for this KVM change will be floated within a week.
> 6. Mitigation of parking disabled vCPU threads in user space, to avoid blocking
>    them inside KVM, needs review by the wider community to ensure no hidden
>    issues are introduced.
> 7. A discussion (if needed) on why `device_set` was chosen instead of `qom-set`
>    for administrative state control.
> 8. CPU_SUSPEND/Standy related handling (if required)
> 9. HVF and qtest are not supported or done yet.
> 
> ============================
> (X) ORGANIZATION OF PATCHES
> ============================
> 
>  [Patch 1-2, 22-23] New HMP/QMP interface ('device_set') related changes
>     (*) New ('DeviceState::admin_power_state') property; Enabled/Disabled States and handling
>     (*) New Qemu CLI parameter ('-smp CPUS, disabled=N') handling
>     (*) Logic to find the existing object not part of the QOM
>  [Patch 3-5, 10] logic required during machine init.
>     (*) Some validation checks.
>     (*) Introduces core-id,socket-id,cluster-id property and some util functions required later.
>     (*) Logic to setup lazy realization of the QOM vCPUs 
>     (*) Logic to pre-create vCPUs in the KVM host kernel.
>  [Patch 6-7, 8-9] logic required to size the GICv3 State
>     (*) GIC initialization pre-sized with possible vCPUs. 
>     (*) Introduction of the GICv3 CPU Interface `accessibility` property & accessors
>     (*) Refactoring to make KVM & TCG 'GICv3CPUState' initialization common.
>     (*) Changes in GICv3 post/pre-load function for migration 
>  [Patch 11,14-16,19] logic related to ACPI at machine init time.
>     (*) ACPI CPU OSPM interface for ACPI _STA.Enable/Disable handling  
>     (*) ACPI GED framework to cater to CPU DeviceCheck/Eject Events.
>     (*) ACPI DSDT, MADT changes.
>  [Patch 12-13, 17] Qdev, Virt Machine, PowerState Handler Changes
>     (*) Changes to introduce 'PowerStateHandler' and its abstract interface.
>     (*) Qdev changes to handle the administrative enabling/disabling of device
>     (*) Virt Machine implementation of 'PowerStateHandler' Hooks
>     (*) vCPU thread user-space parking and unparking logic.
>  [Patch 18,20-21,24] Misc.
>     (*) Handling of SMCC Hypercall Exits by KVM to Qemu for PSCI.
>     (*) Mitigation to avoid using 'pause_all_vcpus' during ICC_CTLR_EL1 reset.
>     (*) Mitigation when TCG 'TB Code Cache' is found saturated
> 
> ===============
> (XI) DISCLAIMER
> ===============
> 
> This patch-set is the culmination of over four years of ongoing effort to bring
> a vCPU hotplug-like feature to the Arm platform. The work has already led to
> changes in the ACPI specification and the Linux kernel, and this series now
> introduces the missing piece within QEMU.
> 
> The transition from RFC v5 to RFC v6 resulted in a shift of approach, based on
> maintainer feedback, and required substantial code to be re-written. This is
> *not* production-level code and may still contain bugs. Comprehensive testing is
> in progress on HiSilicon Kunpeng920 SoCs, Oracle servers, and Ampere platforms.
> We expect to fix outstanding issues in the coming month and, subject to no major
> concerns from maintainers about the chosen approach, a near-stable, non-RFC
> version will be posted soon.
> 
> This work largely follows the direction of prior community discussions over the
> years [see refs below], including mailing list threads, Linaro Open Discussions,
> and sessions at KVM Forum. This RFC is intended to validate the overall approach
> outlined here and to gather community feedback before moving forward with a
> formal patch series.
> 
> [The concept being presented has been found to work!]
> 
> ================
> (XII) Change Log
> ================
> 
> RFC V4 -> RFC V5:
> -----------------
> 1. Dropped "[PATCH RFC V4 19/33] target/arm: Force ARM vCPU *present* status ACPI *persistent*"
>    - Seperated the architecture agnostic ACPI changes required to support vCPU Hotplug
>      Link: https://lore.kernel.org/qemu-devel/20241014192205.253479-1-salil.mehta@huawei.com/#t
> 2. Dropped "[PATCH RFC V4 02/33] cpu-common: Add common CPU utility for possible vCPUs"
>    - Dropped qemu{present,enabled}_cpu() APIs. Commented by Gavin (Redhat), Miguel(Oracle), Igor(Redhat)
> 3. Added "Reviewed-by: Miguel Luis <miguel.luis@oracle.com>" to [PATCH RFC V4 01/33]
> 3. Dropped the `CPUState::disabled` flag and introduced `GICv3State::num_smp_cpus` flag
>    - All `GICv3CPUState' between [num_smp_cpus,num_cpus) are marked as 'inaccessible` during gicv3_common_realize()
>    - qemu_enabled_cpu() not required - removed!
>    - removed usage of `CPUState::disabled` from virt.c and hw/cpu64.c
> 4. Removed virt_cpu_properties() and introduced property `mp-affinity` get accessor
> 5. Dropped "[PATCH RFC V4 12/33] arm/virt: Create GED device before *disabled* vCPU Objects are destroyed"
> 
> RFC V3 -> RFC V4:
> -----------------
> 1. Addressed Nicholas Piggin's (IBM) comments
>    - Moved qemu_get_cpu_archid() as a ACPI helper inline acpi/cpu.h
>      https://lore.kernel.org/qemu-devel/D2GFCLH11HGJ.1IJGANHQ9ZQRL@gmail.com/
>    - Introduced new macro CPU_FOREACH_POSSIBLE() in [PATCH 12/33] 
>      https://lore.kernel.org/qemu-devel/D2GF9A9AJO02.1G1G8UEXA5AOD@gmail.com/
>    - Converted CPUState::acpi_persistent into Property. Improved the cover note
>      https://lore.kernel.org/qemu-devel/D2H62RK48KT7.2BTQEZUOEGG4L@gmail.com/
>    - Fixed teh cover note of the[PATCH ] and clearly mentioned about KVMParking
>      https://lore.kernel.org/qemu-devel/D2GFOGQC3HYO.2LKOV306JIU98@gmail.com/ 
> 2. Addressed Gavin Shan's (RedHat) comments:
>    - Introduced the ARM Extensions check. [Looks like I missed the PMU check :( ]
>      https://lore.kernel.org/qemu-devel/28f3107f-0267-4112-b0ca-da59df2968ae@redhat.com/
>    - Moved create_gpio() along with create_ged()
>      https://lore.kernel.org/qemu-devel/143ad7d2-8f45-4428-bed3-891203a49029@redhat.com/
>    - Improved the logic of the GIC creation and initialization
>      https://lore.kernel.org/qemu-devel/9b7582f0-8149-4bf0-a1aa-4d4fe0d35e70@redhat.com/
>    - Removed redundant !dev->realized checks in cpu_hotunplug(_request)
>      https://lore.kernel.org/qemu-devel/64e9feaa-8df2-4108-9e73-c72517fb074a@redhat.com/
> 3. Addresses Alex Bennée's + Gustavo Romero (Linaro) comments
>    - Fixed the TCG support and now it works for all the cases including migration.
>      https://lore.kernel.org/qemu-devel/87bk1b3azm.fsf@draig.linaro.org/
>    - Fixed the cpu_address_space_destroy() compilation failuer in user-mode
>      https://lore.kernel.org/qemu-devel/87v800wkb1.fsf@draig.linaro.org/
> 4. Fixed crash in .post_gicv3() during migration with asymmetrically *enabled*
>      vCPUs at destination VM
> 
> RFC V2 -> RFC V3:
> -----------------
> 1. Miscellaneous:
>    - Split the RFC V2 into arch-agnostic and arch-specific patch sets.
> 2. Addressed Gavin Shan's (RedHat) comments:
>    - Made CPU property accessors inline.
>      https://lore.kernel.org/qemu-devel/6cd28639-2cfa-f233-c6d9-d5d2ec5b1c58@redhat.com/
>    - Collected Reviewed-bys [PATCH RFC V2 4/37, 14/37, 22/37].
>    - Dropped the patch as it was not required after init logic was refactored.
>      https://lore.kernel.org/qemu-devel/4fb2eef9-6742-1eeb-721a-b3db04b1be97@redhat.com/
>    - Fixed the range check for the core during vCPU Plug.
>      https://lore.kernel.org/qemu-devel/1c5fa24c-6bf3-750f-4f22-087e4a9311af@redhat.com/
>    - Added has_hotpluggable_vcpus check to make build_cpus_aml() conditional.
>      https://lore.kernel.org/qemu-devel/832342cb-74bc-58dd-c5d7-6f995baeb0f2@redhat.com/
>    - Fixed the states initialization in cpu_hotplug_hw_init() to accommodate previous refactoring.
>      https://lore.kernel.org/qemu-devel/da5e5609-1883-8650-c7d8-6868c7b74f1c@redhat.com/
>    - Fixed typos.
>      https://lore.kernel.org/qemu-devel/eb1ac571-7844-55e6-15e7-3dd7df21366b@redhat.com/
>    - Removed the unnecessary 'goto fail'.
>      https://lore.kernel.org/qemu-devel/4d8980ac-f402-60d4-fe52-787815af8a7d@redhat.com/#t
>    - Added check for hotpluggable vCPUs in the _OSC method.
>      https://lore.kernel.org/qemu-devel/20231017001326.FUBqQ1PTowF2GxQpnL3kIW0AhmSqbspazwixAHVSi6c@z/
> 3. Addressed Shaoqin Huang's (Intel) comments:
>    - Fixed the compilation break due to the absence of a call to virt_cpu_properties() missing
>      along with its definition.
>      https://lore.kernel.org/qemu-devel/3632ee24-47f7-ae68-8790-26eb2cf9950b@redhat.com/
> 4. Addressed Jonathan Cameron's (Huawei) comments:
>    - Gated the 'disabled vcpu message' for GIC version < 3.
>      https://lore.kernel.org/qemu-devel/20240116155911.00004fe1@Huawei.com/
> 
> RFC V1 -> RFC V2:
> -----------------
> 1. Addressed James Morse's (ARM) requirement as per Linaro Open Discussion:
>    - Exposed all possible vCPUs as always ACPI _STA.present and available during boot time.
>    - Added the _OSC handling as required by James's patches.
>    - Introduction of 'online-capable' bit handling in the flag of MADT GICC.
>    - SMCC Hypercall Exit handling in Qemu.
> 2. Addressed Marc Zyngier's comment:
>    - Fixed the note about GIC CPU Interface in the cover letter.
> 3. Addressed issues raised by Vishnu Pajjuru (Ampere) & Miguel Luis (Oracle) during testing:
>    - Live/Pseudo Migration crashes.
> 4. Others:
>    - Introduced the concept of persistent vCPU at QOM.
>    - Introduced wrapper APIs of present, possible, and persistent.
>    - Change at ACPI hotplug H/W init leg accommodating initializing is_present and is_enabled states.
>    - Check to avoid unplugging cold-booted vCPUs.
>    - Disabled hotplugging with TCG/HVF/QTEST.
>    - Introduced CPU Topology, {socket, cluster, core, thread}-id property.
>    - Extract virt CPU properties as a common virt_vcpu_properties() function.
> 
> =======================
> (XIII) ACKNOWLEDGEMENTS
> =======================
> 
> I would like to thank the following people for various discussions with me over
> different channels during development:
> 
> Marc Zyngier (Google), Catalin Marinas (ARM), James Morse (ARM), Will Deacon (Google), 
> Jean-Philippe Brucker (Linaro), Sudeep Holla (ARM), Lorenzo Pieralisi (Linaro), 
> Gavin Shan (RedHat), Jonathan Cameron (Huawei), Darren Hart (Ampere), 
> Igor Mammedov (RedHat), Ilkka Koskinen (Ampere), Andrew Jones (RedHat), 
> Karl Heubaum (Oracle), Keqian Zhu (Huawei), Miguel Luis (Oracle), 
> Xiongfeng Wang (Huawei), Vishnu Pajjuri (Ampere), Shameerali Kolothum (Huawei), 
> Russell King (Oracle), Xuwei/Joy (Huawei), Peter Maydel (Linaro), 
> Zengtao/Prime (Huawei), Nicholas Piggin (IBM), Alex Bennée(Linaro) and all those
> whom I have missed!
> 
> Many thanks to the following people for their current or past contributions:
> 
> 1. James Morse (ARM)
>    (Current Kernel part of vCPU Hotplug Support on AARCH64)
> 2. Jean-Philippe Brucker (Linaro)
>    (Prototyped one of the earlier PSCI-based POC [17][18] based on RFC V1)
> 3. Keqian Zhu (Huawei)
>    (Co-developed Qemu prototype)
> 4. Xiongfeng Wang (Huawei)
>    (Co-developed an earlier kernel prototype with me)
> 5. Vishnu Pajjuri (Ampere)
>    (Verification on Ampere ARM64 Platforms + fixes)
> 6. Miguel Luis (Oracle)
>    (Verification on Oracle ARM64 Platforms + fixes)
> 7. Russell King (Oracle) & Jonathan Cameron (Huawei)
>    (Helping in upstreaming James Morse's Kernel patches).
> 
> ================
> (XIV) REFERENCES
> ================
> 
> [1] https://lore.kernel.org/qemu-devel/20200613213629.21984-1-salil.mehta@huawei.com/
> [2] https://lore.kernel.org/linux-arm-kernel/20200625133757.22332-1-salil.mehta@huawei.com/
> [3] https://lore.kernel.org/lkml/20230203135043.409192-1-james.morse@arm.com/
> [4] https://lore.kernel.org/all/20230913163823.7880-1-james.morse@arm.com/
> [5] https://lore.kernel.org/all/20230404154050.2270077-1-oliver.upton@linux.dev/
> [6] https://bugzilla.tianocore.org/show_bug.cgi?id=3706
> [7] https://uefi.org/specs/ACPI/6.5/05_ACPI_Software_Programming_Model.html#gic-cpu-interface-gicc-structure
> [8] https://bugzilla.tianocore.org/show_bug.cgi?id=4481#c5
> [9] https://cloud.google.com/kubernetes-engine/docs/concepts/verticalpodautoscaler
> [10] https://docs.aws.amazon.com/eks/latest/userguide/vertical-pod-autoscaler.html
> [11] https://lkml.org/lkml/2019/7/10/235
> [12] https://lists.cs.columbia.edu/pipermail/kvmarm/2018-July/032316.html
> [13] https://lists.gnu.org/archive/html/qemu-devel/2020-01/msg06517.html
> [14] https://op-lists.linaro.org/archives/list/linaro-open-discussions@op-lists.linaro.org/thread/7CGL6JTACPUZEYQC34CZ2ZBWJGSR74WE/
> [15] http://lists.nongnu.org/archive/html/qemu-devel/2018-07/msg01168.html
> [16] https://lists.gnu.org/archive/html/qemu-devel/2020-06/msg00131.html
> [17] https://op-lists.linaro.org/archives/list/linaro-open-discussions@op-lists.linaro.org/message/X74JS6P2N4AUWHHATJJVVFDI2EMDZJ74/
> [18] https://lore.kernel.org/lkml/20210608154805.216869-1-jean-philippe@linaro.org/
> [19] https://lore.kernel.org/all/20230913163823.7880-1-james.morse@arm.com/ 
> [20] https://uefi.org/specs/ACPI/6.5/05_ACPI_Software_Programming_Model.html#gicc-cpu-interface-flags
> [21] https://lore.kernel.org/qemu-devel/20230926100436.28284-1-salil.mehta@huawei.com/
> [22] https://lore.kernel.org/qemu-devel/20240607115649.214622-1-salil.mehta@huawei.com/T/#md0887eb07976bc76606a8204614ccc7d9a01c1f7
> [23] RFC V3: https://lore.kernel.org/qemu-devel/20240613233639.202896-1-salil.mehta@huawei.com/#t
> 
> Author Salil Mehta (1):
>   target/arm/kvm,tcg: Handle SMCCC hypercall exits in VMM during PSCI_CPU_{ON,OFF}
> 
> Jean-Philippe Brucker (1):
>   target/arm/kvm: Write vCPU's state back to KVM on cold-reset
> 
> Salil Mehta (22):
>   hw/core: Introduce administrative power-state property and its accessors
>   hw/core, qemu-options.hx: Introduce 'disabledcpus' SMP parameter
>   hw/arm/virt: Clamp 'maxcpus' as-per machine's vCPU deferred online-capability
>   arm/virt,target/arm: Add new ARMCPU {socket,cluster,core,thread}-id property
>   arm/virt,kvm: Pre-create KVM vCPUs for 'disabled' QOM vCPUs at machine init
>   arm/virt,gicv3: Pre-size GIC with possible vCPUs at machine init
>   arm/gicv3: Refactor CPU interface init for shared TCG/KVM use
>   arm/virt, gicv3: Guard CPU interface access for admin disabled vCPUs
>   hw/intc/arm_gicv3_common: Migrate & check 'GICv3CPUState' accessibility mismatch
>   arm/virt: Init PMU at host for all present vCPUs
>   hw/arm/acpi: MADT change to size the guest with possible vCPUs
>   hw/core: Introduce generic device power-state handler interface
>   qdev: make admin power state changes trigger platform transitions via ACPI
>   arm/acpi: Introduce dedicated CPU OSPM interface for ARM-like platforms
>   acpi/ged: Notify OSPM of CPU administrative state changes via GED
>   arm/virt/acpi: Update ACPI DSDT Tbl to include 'Online-Capable' CPUs AML
>   hw/arm/virt,acpi/ged: Add PowerStateHandler hooks for runtime CPU state changes
>   target/arm/cpu: Add the Accessor hook to fetch ARM CPU arch-id
>   hw/intc/arm-gicv3-kvm: Pause all vCPUs & cache ICC_CTLR_EL1 for userspace PSCI CPU_ON
>   monitor,qdev: Introduce 'device_set' to change admin state of existing devices
>   monitor,qapi: add 'info cpus-powerstate' and QMP query (Admin + Oper states)
>   tcg: Defer TB flush for 'lazy realized' vCPUs on first region alloc
> 
>  accel/kvm/kvm-all.c                    |   2 +-
>  accel/tcg/tcg-accel-ops-mttcg.c        |   2 +-
>  accel/tcg/tcg-accel-ops-rr.c           |   2 +-
>  cpu-common.c                           |   4 +-
>  hmp-commands-info.hx                   |  32 ++
>  hmp-commands.hx                        |  30 +
>  hw/acpi/Kconfig                        |   3 +
>  hw/acpi/acpi-cpu-ospm-interface-stub.c |  41 ++
>  hw/acpi/cpu_ospm_interface.c           | 747 +++++++++++++++++++++++++
>  hw/acpi/generic_event_device.c         |  91 +++
>  hw/acpi/meson.build                    |   2 +
>  hw/acpi/trace-events                   |  17 +
>  hw/arm/Kconfig                         |   1 +
>  hw/arm/virt-acpi-build.c               |  75 ++-
>  hw/arm/virt.c                          | 573 +++++++++++++++++--
>  hw/core/cpu-common.c                   |  12 +
>  hw/core/machine-hmp-cmds.c             |  62 ++
>  hw/core/machine-qmp-cmds.c             | 107 ++++
>  hw/core/machine-smp.c                  |  24 +-
>  hw/core/machine.c                      |  28 +
>  hw/core/meson.build                    |   1 +
>  hw/core/powerstate.c                   | 100 ++++
>  hw/core/qdev.c                         | 197 +++++++
>  hw/intc/arm_gicv3.c                    |   1 +
>  hw/intc/arm_gicv3_common.c             |  64 ++-
>  hw/intc/arm_gicv3_cpuif.c              | 270 ++++-----
>  hw/intc/arm_gicv3_cpuif_common.c       |  58 ++
>  hw/intc/arm_gicv3_kvm.c                | 123 +++-
>  hw/intc/gicv3_internal.h               |   1 +
>  include/hw/acpi/acpi_dev_interface.h   |   1 +
>  include/hw/acpi/cpu_ospm_interface.h   |  78 +++
>  include/hw/acpi/generic_event_device.h |   6 +
>  include/hw/arm/virt.h                  |  42 +-
>  include/hw/boards.h                    |  37 ++
>  include/hw/core/cpu.h                  |  71 +++
>  include/hw/intc/arm_gicv3_common.h     |  65 +++
>  include/hw/powerstate.h                | 177 ++++++
>  include/hw/qdev-core.h                 | 151 +++++
>  include/monitor/hmp.h                  |   3 +
>  include/monitor/qdev.h                 |  30 +
>  include/system/kvm.h                   |   8 +
>  include/system/system.h                |   1 +
>  include/tcg/startup.h                  |   6 +
>  include/tcg/tcg.h                      |   1 +
>  qapi/machine.json                      |  90 +++
>  qemu-options.hx                        | 129 ++++-
>  stubs/meson.build                      |   1 +
>  stubs/powerstate-stubs.c               |  47 ++
>  system/cpus.c                          |   4 +-
>  system/qdev-monitor.c                  | 139 ++++-
>  system/vl.c                            |  42 ++
>  target/arm/arm-powerctl.c              |  29 +-
>  target/arm/cpu.c                       |  14 +
>  target/arm/cpu.h                       |   5 +
>  target/arm/helper.c                    |   2 +-
>  target/arm/internals.h                 |   2 +-
>  target/arm/kvm.c                       | 140 ++++-
>  target/arm/kvm_arm.h                   |  25 +
>  target/arm/meson.build                 |   1 +
>  target/arm/{tcg => }/psci.c            |   9 +
>  target/arm/tcg/meson.build             |   4 -
>  tcg/region.c                           |  16 +
>  tcg/tcg.c                              |  19 +-
>  63 files changed, 3800 insertions(+), 265 deletions(-)
>  create mode 100644 hw/acpi/acpi-cpu-ospm-interface-stub.c
>  create mode 100644 hw/acpi/cpu_ospm_interface.c
>  create mode 100644 hw/core/powerstate.c
>  create mode 100644 include/hw/acpi/cpu_ospm_interface.h
>  create mode 100644 include/hw/powerstate.h
>  create mode 100644 stubs/powerstate-stubs.c
>  rename target/arm/{tcg => }/psci.c (96%)
> 



^ permalink raw reply	[flat|nested] 67+ messages in thread

* RE: [PATCH RFC V6 24/24] tcg: Defer TB flush for 'lazy realized' vCPUs on first region alloc
  2025-10-02 15:41       ` Richard Henderson
@ 2025-10-07 10:14         ` Salil Mehta via
  0 siblings, 0 replies; 67+ messages in thread
From: Salil Mehta via @ 2025-10-07 10:14 UTC (permalink / raw)
  To: Richard Henderson, salil.mehta@opnsrc.net, qemu-devel@nongnu.org,
	qemu-arm@nongnu.org, mst@redhat.com

Hi Richard,

Sorry for the delay in reply. 

> From: Richard Henderson <richard.henderson@linaro.org>
> Sent: Thursday, October 2, 2025 4:41 PM
> 
> On 10/2/25 05:27, Salil Mehta wrote:
> > Hi Richard,
> >
> > Thanks for the reply. Please find my response inline.
> >
> > Cheers.
> >
> >> From: qemu-devel-bounces+salil.mehta=huawei.com@nongnu.org
> <qemu-
> >> devel-bounces+salil.mehta=huawei.com@nongnu.org> On Behalf Of
> Richard
> >> Henderson
> >> Sent: Wednesday, October 1, 2025 10:34 PM
> >> To: salil.mehta@opnsrc.net; qemu-devel@nongnu.org; qemu-
> >> arm@nongnu.org; mst@redhat.com
> >> Subject: Re: [PATCH RFC V6 24/24] tcg: Defer TB flush for 'lazy
> >> realized' vCPUs on first region alloc
> >>
> >> On 9/30/25 18:01, salil.mehta@opnsrc.net wrote:
> >>> From: Salil Mehta <salil.mehta@huawei.com>
> >>>
> >>> The TCG code cache is split into regions shared by vCPUs under MTTCG.
> >>> For cold-boot (early realized) vCPUs, regions are sized/allocated
> >>> during
> >> bring-up.
> >>> However, when a vCPU is *lazy_realized* (administratively "disabled"
> >>> at boot and realized later on demand), its TCGContext may fail the
> >>> very first code region allocation if the shared TB cache is
> >>> saturated by already-running vCPUs.
> >>>
> >>> Flushing the TB cache is the right remediation, but `tb_flush()`
> >>> must be performed from the safe execution context
> >> (cpu_exec_loop()/tb_gen_code()).
> >>> This patch wires a deferred flush:
> >>>
> >>>     * In `tcg_region_initial_alloc__locked()`, treat an initial allocation
> >>>       failure for a lazily realized vCPU as non-fatal: set `s->tbflush_pend`
> >>>       and return.
> >>>
> >>>     * In `tcg_tb_alloc()`, if `s->tbflush_pend` is observed, clear it and
> >>>       return NULL so the caller performs a synchronous `tb_flush()` and
> then
> >>>       retries allocation.
> >>>
> >>> This avoids hangs observed when a newly realized vCPU cannot obtain
> >>> its first region under TB-cache pressure, while keeping the flush at
> >>> a safe
> >> point.
> >>>
> >>> No change for cold-boot vCPUs and when accel ops is KVM.
> >>>
> >>> In earlier series, this patch was with below named,
> >>> 'tcg: Update tcg_register_thread() leg to handle region alloc for
> >>> hotplugged
> >> vCPU'
> >>
> >>
> >> I don't see why you need two different booleans for this.
> >
> >
> > I can see your point. Maybe I can move `s->tbflush_pend`  to 'CPUState'
> instead?
> >
> >
> >> It seems to me that you could create the cpu in a state for which the
> >> first call to
> >> tcg_tb_alloc() sees highwater state, and everything after that
> >> happens per usual allocating a new region, and possibly flushing the full
> buffer.
> >
> >
> > Correct. but with a distinction that highwater state is relevant to a
> > TCGContext and the regions are allocated from a common pool 'Code
> Generation Buffer'.
> > 'code_gen_highwater' is use to detect whether current context needs
> > more region allocation for the dynamic translation to continue. This
> > is a different condition than what we are encountering; which is the
> > worst case condition that the entire code generation buffer is
> > saturated and cannot even allocate a single free TCG region successfully.
> 
> I think you misunderstand "and everything after that happens per usual".
> 
> When allocating a tb, if a cpu finds that it's current region is full, then it tries
> to allocate another region.  If that is not successful, then we flush the entire
> code_gen_buffer and try again.
> 
> Thus tbflush_pend is exactly equivalent to setting
> 
>      s->code_gen_ptr > s->code_gen_highwater.
> 
> As far as lazy_realized...  The utility of the assert under these conditions may
> be called into question; we could just remove it.


I understand your point. I'll remove the 'tbflush_pend' flag  and directly use
'code_gen_highwater = NULL' so that we hit the highwater condition early
when the TCG threads gets lazily realized. And yes, we might have to either
remove or conditionally bypass the assert(). Will dig further and validate. 

Many thanks for this optimization!

Best regards
Salil.


> 
> 
> r~

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH RFC V6 14/24] arm/acpi: Introduce dedicated CPU OSPM interface for ARM-like platforms
       [not found]     ` <7da6a9c470684754810414f0abd23a62@huawei.com>
@ 2025-10-07 12:06       ` Igor Mammedov
  2025-10-10  3:00         ` Salil Mehta
  0 siblings, 1 reply; 67+ messages in thread
From: Igor Mammedov @ 2025-10-07 12:06 UTC (permalink / raw)
  To: Salil Mehta
  Cc: salil.mehta@opnsrc.net, qemu-devel@nongnu.org,
	qemu-arm@nongnu.org, mst@redhat.com, maz@kernel.org,
	jean-philippe@linaro.org, Jonathan Cameron, lpieralisi@kernel.org,
	peter.maydell@linaro.org, richard.henderson@linaro.org,
	armbru@redhat.com, andrew.jones@linux.dev, david@redhat.com,
	philmd@linaro.org, eric.auger@redhat.com, will@kernel.org,
	ardb@kernel.org, oliver.upton@linux.dev, pbonzini@redhat.com,
	gshan@redhat.com, rafael@kernel.org, borntraeger@linux.ibm.com,
	alex.bennee@linaro.org, gustavo.romero@linaro.org,
	npiggin@gmail.com, harshpb@linux.ibm.com, linux@armlinux.org.uk,
	darren@os.amperecomputing.com, ilkka@os.amperecomputing.com,
	vishnu@os.amperecomputing.com, gankulkarni@os.amperecomputing.com,
	karl.heubaum@oracle.com, miguel.luis@oracle.com, zhukeqian,
	wangxiongfeng (C), wangyanan (Y), Wangzhou (B), Linuxarm,
	jiakernel2@gmail.com, maobibo@loongson.cn, lixianglai@loongson.cn,
	shahuang@redhat.com, zhao1.liu@intel.com

On Tue, 7 Oct 2025 11:15:47 +0000
Salil Mehta <salil.mehta@huawei.com> wrote:

> Hi Igor,
> 
> Thanks for the reviews and sorry for the late reply. Please find my replies inline.
> 
> 
> > From: Igor Mammedov <imammedo@redhat.com>
> > Sent: Friday, October 3, 2025 3:58 PM
> > 
> > On Wed,  1 Oct 2025 01:01:17 +0000
> > salil.mehta@opnsrc.net wrote:
> >   
> > > From: Salil Mehta <salil.mehta@huawei.com>
> > >
> > > The existing ACPI CPU hotplug interface is built for x86 platforms
> > > where CPUs can be inserted or removed and resources are allocated
> > > dynamically. On ARM, CPUs are never hotpluggable: resources are
> > > allocated at boot and QOM vCPU objects always exist. Instead, CPUs are
> > > administratively managed by toggling ACPI _STA to enable or disable
> > > them, which gives a hotplug-like effect but does not match the x86 model.
> > >
> > > Reusing the x86 hotplug AML code would complicate maintenance since
> > > much of its logic relies on toggling the _STA.Present bit to notify
> > > OSPM about CPU insertion or removal. Such usage is not architecturally
> > > valid on ARM, where CPUs cannot appear or disappear at runtime. Mixing
> > > both models in one interface would increase complexity and make the
> > > AML harder to extend. A separate path is therefore required. The new
> > > design is heavily inspired by the CPU hotplug interface but avoids its  
> > unsuitable semantics.
> > 
> > Let me ask how much existing CPUHP AML code will become, if you reuse it
> > and add handling of 'enabled' bit there?
> > 
> > Would it be the same 700LOC as in this patch, which is basically duplication of
> > existing CPUHP ACPI interface?  
> 
> 
> It is by design as we have adopted non-hotplug approach now and closely aligned
> ourselves with what PSCI standard perceives to be the definition of CPU hotplug on ARM
> platforms - at least, as of today! And it is *NOT* what 'CPU hotplug' means on x86 platform. 

There is no argument that they are different but,
Could you point to PSCI specific parts in this patch?

> In crux, this means,
> 1. Dropping any hotplug/unplug infrastructure and its related paraphernalia from the
>     ARM implementation till the time the meaning of physical CPU hotplug is not clear as
>     per the specification. We do not want to model in Qemu something which does not
>     exist or defined, especially for the CPU hotplug case.

there is 'opts' config struct that lets user to opt in/out from specific AML
being generated. You could use that to disable some hotplug only bits of AML.
Other bits that are more generic/reusable, just refactor/rename them to a more
generic names.

> 2. This also means *NOT* enabling the  ACPI_CPU_HOTPLUG compilation switch to 
>      preserve the sanctity of the clean design.

that's semantics, I'd suggest renaming that to ACPI_CPU.

> 3. Yes, there is a code duplicity for now but that’s a case of further optimization and
>     cleanup not a design issue. Some of them are:
>     (1)  ACPI device-check and eject-request  handling code can be extracted and
>     made generic for all devices not just for CPUs.

make it more generic in acpi/cpu.c, instead of copying.
I don't have any objections to refactoring existing code if it makes sense and
we can share the code.

>     (2)  Right now, acpi/cpu.c is assuming that resources and templates should be
>      same for all the CPUs using CPUs AML described in it. There is no need for
>      such a restriction. Every platform should be free to choose the way it wants to
>      manage the resources and the interpretation of the fields inside it.

be more specific why you'd need different resources/MMIO layout for this series?

The thing is if we copied every time when we needed something that's a bit different,
we would end up with unsupportable/bloated QEMU.

>     (3) Call backs used with GED  makes an assumption of HOTPLUG interface etc.
>     (4) In fact, the prototype of the GED event handler makes a similar mistake of
>      assuming that GED is only meant for devices supporting hotplug when this is not
>      the case even as per the ACPI specification.
please be more specific and point to problematic code.

current acpi/cpu.c might be compiled under ACPI_CPU_HOTPLUG knob but it's not really
limited to hotplug, the reason for being compiled as such is that hotplug was
the sole reason for building CPUs AML at all.

What I see in the patch is simplifying current code somewhat by dropping
some hotplug related bits and a bunch of renaming.
Otherwise it's pretty much duplicating current acpi/cpu.c.

Beside that simplification, I don't see any reason why duplicating such amount is good idea.
Consider making exiting acpi/cpu.c more generic instead.

> RFC V5 was an attempt to implement this feature using the hotplug infrastructure
> and this RFC V6 is a deviation from previous approach towards non-hotplug. We do
> not want a hotchpotch approach because that’s a recipe for future disaster.
> 
> 
> Many Thanks!
> Salil.
> 
> 
> >   
> > >
> > > This patch adds a dedicated CPU OSPM (Operating System Power
> > > Management) interface. It provides a memory-mapped control region with
> > > selector, flags, command, and data fields, and AML methods for
> > > device-check, eject request, and _OST reporting. OSPM is notified
> > > through GED events and can coordinate CPU events directly with QEMU.
> > > Other ARM-like architectures may also use this interface.
> > >
> > > Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
> > > ---
> > >  hw/acpi/Kconfig                        |   3 +
> > >  hw/acpi/acpi-cpu-ospm-interface-stub.c |  41 ++
> > >  hw/acpi/cpu_ospm_interface.c           | 747  
> > +++++++++++++++++++++++++  
> > >  hw/acpi/meson.build                    |   2 +
> > >  hw/acpi/trace-events                   |  17 +
> > >  hw/arm/Kconfig                         |   1 +
> > >  include/hw/acpi/cpu_ospm_interface.h   |  78 +++
> > >  7 files changed, 889 insertions(+)
> > >  create mode 100644 hw/acpi/acpi-cpu-ospm-interface-stub.c
> > >  create mode 100644 hw/acpi/cpu_ospm_interface.c  create mode 100644
> > > include/hw/acpi/cpu_ospm_interface.h
> > >
> > > diff --git a/hw/acpi/Kconfig b/hw/acpi/Kconfig index
> > > 1d4e9f0845..aa52f0468f 100644
> > > --- a/hw/acpi/Kconfig
> > > +++ b/hw/acpi/Kconfig
> > > @@ -21,6 +21,9 @@ config ACPI_ICH9
> > >  config ACPI_CPU_HOTPLUG
> > >      bool
> > >
> > > +config ACPI_CPU_OSPM_INTERFACE
> > > +    bool
> > > +
> > >  config ACPI_MEMORY_HOTPLUG
> > >      bool
> > >      select MEM_DEVICE
> > > diff --git a/hw/acpi/acpi-cpu-ospm-interface-stub.c
> > > b/hw/acpi/acpi-cpu-ospm-interface-stub.c
> > > new file mode 100644
> > > index 0000000000..f6f333f641
> > > --- /dev/null
> > > +++ b/hw/acpi/acpi-cpu-ospm-interface-stub.c
> > > @@ -0,0 +1,41 @@
> > > +/*
> > > + * ACPI CPU OSPM Interface Handling.
> > > + *
> > > + * Copyright (c) 2025 Huawei Technologies R&D (UK) Ltd.
> > > + *
> > > + * Author: Salil Mehta <salil.mehta@huawei.com>
> > > + *
> > > + * SPDX-License-Identifier: GPL-2.0-or-later
> > > + *
> > > + * This program is free software; you can redistribute it and/or
> > > +modify
> > > + * it under the terms of the GNU General Public License as published
> > > +by
> > > + * the Free Software Foundation; either version 2 of the License, or
> > > + * (at your option) any later version.
> > > + */
> > > +
> > > +#include "qemu/osdep.h"
> > > +#include "hw/acpi/cpu_ospm_interface.h"
> > > +
> > > +void acpi_cpu_device_check_cb(AcpiCpuOspmState *cpu_st, DeviceState  
> > *dev,  
> > > +                              uint32_t event_st, Error **errp) { }
> > > +
> > > +void acpi_cpu_eject_request_cb(AcpiCpuOspmState *cpu_st,  
> > DeviceState *dev,  
> > > +                               uint32_t event_st, Error **errp) { }
> > > +
> > > +void acpi_cpu_eject_cb(AcpiCpuOspmState *cpu_st, DeviceState *dev,
> > > +Error **errp) { }
> > > +
> > > +void acpi_cpu_ospm_state_interface_init(MemoryRegion *as, Object  
> > *owner,  
> > > +                                        AcpiCpuOspmState *state,
> > > +                                        hwaddr base_addr) { }
> > > +
> > > +void acpi_cpus_ospm_status(AcpiCpuOspmState *cpu_st,  
> > ACPIOSTInfoList  
> > > +***list) { }
> > > diff --git a/hw/acpi/cpu_ospm_interface.c
> > > b/hw/acpi/cpu_ospm_interface.c new file mode 100644 index
> > > 0000000000..61aab8a793
> > > --- /dev/null
> > > +++ b/hw/acpi/cpu_ospm_interface.c
> > > @@ -0,0 +1,747 @@
> > > +/*
> > > + * ACPI CPU OSPM Interface Handling.
> > > + *
> > > + * Copyright (c) 2025 Huawei Technologies R&D (UK) Ltd.
> > > + *
> > > + * Author: Salil Mehta <salil.mehta@huawei.com>
> > > + *
> > > + * SPDX-License-Identifier: GPL-2.0-or-later
> > > + *
> > > + * This program is free software; you can redistribute it and/or
> > > +modify
> > > + * it under the terms of the GNU General Public License as published
> > > +by
> > > + * the Free Software Foundation; either version 2 of the License, or
> > > + * (at your option) any later version.
> > > + */
> > > +
> > > +#include "qemu/osdep.h"
> > > +#include "migration/vmstate.h"
> > > +#include "hw/core/cpu.h"
> > > +#include "qapi/error.h"
> > > +#include "trace.h"
> > > +#include "qapi/qapi-events-acpi.h"
> > > +#include "hw/acpi/cpu_ospm_interface.h"
> > > +
> > > +/* CPU identifier and resource device */
> > > +#define CPU_NAME_FMT      "C%.03X" /* CPU name format (e.g., C001)  
> > */  
> > > +#define CPU_RES_DEVICE    "CPUR" /* CPU resource device name */
> > > +#define CPU_DEVICE        "CPUS" /* CPUs device name */
> > > +#define CPU_LOCK          "CPLK" /* CPU lock object */
> > > +/* ACPI method(_STA, _EJ0, etc.) handlers */
> > > +#define CPU_STS_METHOD    "CSTA" /* CPU status method  
> > (_STA.Enabled) */  
> > > +#define CPU_SCAN_METHOD   "CSCN" /* CPU scan method for  
> > enumeration */  
> > > +#define CPU_NOTIFY_METHOD "CTFY" /* Notify method for CPU events  
> > */  
> > > +#define CPU_EJECT_METHOD  "CEJ0" /* CPU eject method (_EJ0) */
> > > +#define CPU_OST_METHOD    "COST" /* OSPM status reporting (_OST) */
> > > +/* CPU MMIO region fields (in PRST region) */
> > > +#define CPU_SELECTOR      "CSEL" /* CPU selector index (WO) */
> > > +#define CPU_ENABLED_F     "CPEN" /* Flag: CPU enabled status(_STA)  
> > (RO) */  
> > > +#define CPU_DEVCHK_F      "CDCK" /* Flag: Device-check event (RW) */
> > > +#define CPU_EJECTRQ_F     "CEJR" /* Flag: Eject-request event (RW)*/
> > > +#define CPU_EJECT_F       "CEJ0" /* Flag: Ejection trigger (WO) */
> > > +#define CPU_COMMAND       "CCMD" /* Command register (RW) */
> > > +#define CPU_DATA          "CDAT" /* Data register (RW) */
> > > +
> > > + /*
> > > + * CPU OSPM Interface MMIO Layout (Total: 16 bytes)
> > > + *
> > > + *
> > > + +--------+--------+--------+--------+--------+--------+--------+----
> > > + ----+
> > > + * |  0x00  |  0x01  |  0x02  |  0x03  |  0x04  |  0x05  |  0x06  |
> > > + 0x07  |
> > > + * +--------+--------+--------+--------+--------+--------+--------+--------+
> > > + * |       Selector (DWord, write-only)         | Flags  |Command |Reserved|
> > > + * |                                            | (RO/RW)|  (WO)  |(2B pad)|
> > > + * |        4 bytes (32 bits)                   | 1B     |   1B   | 2B     |
> > > + *
> > > + +-------------------------------------------------------------------
> > > + ----+
> > > + * |  0x08  |  0x09  |  0x0A  |  0x0B  |  0x0C  |  0x0D  |  0x0E  |
> > > + 0x0F  |
> > > + * +--------+--------+--------+--------+--------+--------+--------+--------+
> > > + * |                        Data (QWord, read/write)                       |
> > > + * |               Used by CPU scan and _OST methods (64 bits)             |
> > > + *
> > > + +-------------------------------------------------------------------
> > > + ----+
> > > + *
> > > + * Field Overview:
> > > + *
> > > + * - Selector: 4 bytes @0x00 (DWord, WO)
> > > + *               - Selects target CPU index for the current operation.
> > > + * - Flags:    1 byte  @0x04 (RO/RW)
> > > + *               - Bit 0: ENABLED  – CPU is powered on (RO)
> > > + *               - Bit 1: DEVCHK   – Device-check completed (RW)
> > > + *               - Bit 2: EJECTRQ  – Guest requests CPU eject (RW)
> > > + *               - Bit 3: EJECT    – Trigger CPU ejection (WO)
> > > + *               - Bits 4–7: Reserved (write 0)
> > > + * - Command:  1 byte  @0x05 (WO)
> > > + *               - Specifies control operation (e.g., scan, _OST, eject).
> > > + * - Reserved: 2 bytes @0x06–0x07
> > > + *               - Alignment padding; must be zero on write.
> > > + * - Data:     8 bytes @0x08 (QWord, RW)
> > > + *               - Input/output for command-specific data.
> > > + *               - Used by CPU scan or _OST.
> > > + */
> > > +
> > > +/*
> > > + * Macros defining the CPU MMIO region layout. Change field sizes
> > > +here to
> > > + * alter the overall MMIO region size.
> > > + */
> > > +/* Sub-Field sizes (in bytes) */
> > > +#define ACPI_CPU_MR_SELECTOR_SIZE  4 /* Write-only (DWord access)  
> > */  
> > > +#define ACPI_CPU_MR_FLAGS_SIZE     1 /* Read-write (Byte access) */
> > > +#define ACPI_CPU_MR_RES_FLAGS_SIZE 0 /* Reserved padding */
> > > +#define ACPI_CPU_MR_CMD_SIZE       1 /* Write-only (Byte access) */
> > > +#define ACPI_CPU_MR_RES_CMD_SIZE   2 /* Reserved padding */
> > > +#define ACPI_CPU_MR_CMD_DATA_SIZE  8 /* Read-write (QWord  
> > access) */  
> > > +
> > > +#define ACPI_CPU_OSPM_IF_MAX_FIELD_SIZE \
> > > +    MAX_CONST(ACPI_CPU_MR_CMD_DATA_SIZE, \
> > > +    MAX_CONST(ACPI_CPU_MR_SELECTOR_SIZE, \
> > > +    MAX_CONST(ACPI_CPU_MR_CMD_SIZE,  
> > ACPI_CPU_MR_FLAGS_SIZE)))  
> > > +
> > > +/* Validate layout against exported total length */
> > > +_Static_assert(ACPI_CPU_OSPM_IF_REG_LEN ==
> > > +               (ACPI_CPU_MR_SELECTOR_SIZE +
> > > +                ACPI_CPU_MR_FLAGS_SIZE +
> > > +                ACPI_CPU_MR_RES_FLAGS_SIZE +
> > > +                ACPI_CPU_MR_CMD_SIZE +
> > > +                ACPI_CPU_MR_RES_CMD_SIZE +
> > > +                ACPI_CPU_MR_CMD_DATA_SIZE),
> > > +               "ACPI_CPU_OSPM_IF_REG_LEN mismatch with internal MMIO
> > > +layout");
> > > +
> > > +/* Sub-Field sizes (in bits) */
> > > +#define ACPI_CPU_MR_SELECTOR_SIZE_BITS \
> > > +    (ACPI_CPU_MR_SELECTOR_SIZE * BITS_PER_BYTE)  /* Write-only  
> > (DWord  
> > > +Acc) */ #define ACPI_CPU_MR_FLAGS_SIZE_BITS \
> > > +    (ACPI_CPU_MR_FLAGS_SIZE * BITS_PER_BYTE)     /* Read-write (Byte  
> > Acc) */  
> > > +#define ACPI_CPU_MR_RES_FLAGS_SIZE_BITS \
> > > +    (ACPI_CPU_MR_RES_FLAGS_SIZE * BITS_PER_BYTE) /* Reserved  
> > padding  
> > > +*/ #define ACPI_CPU_MR_CMD_SIZE_BITS \
> > > +    (ACPI_CPU_MR_CMD_SIZE * BITS_PER_BYTE)       /* Write-only (Byte  
> > Acc) */  
> > > +#define ACPI_CPU_MR_RES_CMD_SIZE_BITS \
> > > +    (ACPI_CPU_MR_RES_CMD_SIZE * BITS_PER_BYTE)   /* Reserved  
> > padding */  
> > > +#define ACPI_CPU_MR_CMD_DATA_SIZE_BITS \
> > > +    (ACPI_CPU_MR_CMD_DATA_SIZE * BITS_PER_BYTE)  /* Read-write  
> > (QWord  
> > > +Acc) */
> > > +
> > > +/* Field offsets (in bytes) */
> > > +#define ACPI_CPU_MR_SELECTOR_OFFSET_WO  0 #define
> > > +ACPI_CPU_MR_FLAGS_OFFSET_RW \
> > > +    (ACPI_CPU_MR_SELECTOR_OFFSET_WO + \
> > > +     ACPI_CPU_MR_SELECTOR_SIZE)
> > > +#define ACPI_CPU_MR_CMD_OFFSET_WO \
> > > +    (ACPI_CPU_MR_FLAGS_OFFSET_RW + \
> > > +     ACPI_CPU_MR_FLAGS_SIZE + \
> > > +     ACPI_CPU_MR_RES_FLAGS_SIZE)
> > > +#define ACPI_CPU_MR_CMD_DATA_OFFSET_RW \
> > > +    (ACPI_CPU_MR_CMD_OFFSET_WO + \
> > > +     ACPI_CPU_MR_CMD_SIZE + \
> > > +     ACPI_CPU_MR_RES_CMD_SIZE)
> > > +
> > > +/* ensure all offsets are at their natural size alignment boundaries */
> > > +#define STATIC_ASSERT_FIELD_ALIGNMENT(offset, type, field_name)  
> > \  
> > > +    _Static_assert((offset) % sizeof(type) == 0,                              \
> > > +                   field_name " is not aligned to its natural
> > > +boundary")
> > > +
> > >  
> > +STATIC_ASSERT_FIELD_ALIGNMENT(ACPI_CPU_MR_SELECTOR_OFFSET_W
> > O,  
> > > +                              uint32_t, "Selector");
> > > +STATIC_ASSERT_FIELD_ALIGNMENT(ACPI_CPU_MR_FLAGS_OFFSET_RW,
> > > +                              uint8_t, "Flags");
> > > +STATIC_ASSERT_FIELD_ALIGNMENT(ACPI_CPU_MR_CMD_OFFSET_WO,
> > > +                              uint8_t, "Command");
> > >  
> > +STATIC_ASSERT_FIELD_ALIGNMENT(ACPI_CPU_MR_CMD_DATA_OFFSET_
> > RW,  
> > > +                              uint64_t, "Command Data");
> > > +
> > > +/* Flag bit positions (used within 'flags' subfield) */ #define
> > > +ACPI_CPU_FLAGS_USED_BITS 4 #define  
> > ACPI_CPU_MR_FLAGS_BIT_ENABLED  
> > > +BIT(0) #define ACPI_CPU_MR_FLAGS_BIT_DEVCHK  BIT(1) #define
> > > +ACPI_CPU_MR_FLAGS_BIT_EJECTRQ BIT(2)
> > > +#define ACPI_CPU_MR_FLAGS_BIT_EJECT  
> > BIT(ACPI_CPU_FLAGS_USED_BITS - 1)  
> > > +
> > > +#define ACPI_CPU_MR_RES_FLAG_BITS (BITS_PER_BYTE -
> > > +ACPI_CPU_FLAGS_USED_BITS)
> > > +
> > > +enum {
> > > +    ACPI_GET_NEXT_CPU_WITH_EVENT_CMD = 0,
> > > +    ACPI_OST_EVENT_CMD = 1,
> > > +    ACPI_OST_STATUS_CMD = 2,
> > > +    ACPI_CMD_MAX
> > > +};
> > > +
> > > +#define AML_APPEND_MR_RESVD_FIELD(mr_field, size_bits)       \
> > > +    do {                                                        \
> > > +        if ((size_bits) != 0) {                                 \
> > > +            aml_append((mr_field), aml_reserved_field(size_bits)); \
> > > +        }                                                       \
> > > +    } while (0)
> > > +
> > > +#define AML_APPEND_MR_NAMED_FIELD(mr_field, name, size_bits)    \
> > > +    do {                                                        \
> > > +        if ((size_bits) != 0) {                                 \
> > > +            aml_append((mr_field), aml_named_field((name), (size_bits))); \
> > > +        }                                                       \
> > > +    } while (0)
> > > +
> > > +#define AML_CPU_RES_DEV(base, field) \
> > > +        aml_name("%s.%s.%s", (base), CPU_RES_DEVICE, (field))
> > > +
> > > +static ACPIOSTInfo *
> > > +acpi_cpu_ospm_ost_status(int idx, AcpiCpuOspmStateStatus *cdev) {
> > > +    ACPIOSTInfo *info = g_new0(ACPIOSTInfo, 1);
> > > +
> > > +    info->source = cdev->ost_event;
> > > +    info->status = cdev->ost_status;
> > > +    if (cdev->cpu) {
> > > +        DeviceState *dev = DEVICE(cdev->cpu);
> > > +        if (dev->id) {
> > > +            info->device = g_strdup(dev->id);
> > > +        }
> > > +    }
> > > +    return info;
> > > +}
> > > +
> > > +void acpi_cpus_ospm_status(AcpiCpuOspmState *cpu_st,  
> > ACPIOSTInfoList  
> > > +***list) {
> > > +    ACPIOSTInfoList ***tail = list;
> > > +    int i;
> > > +
> > > +    for (i = 0; i < cpu_st->dev_count; i++) {
> > > +        QAPI_LIST_APPEND(*tail, acpi_cpu_ospm_ost_status(i, &cpu_st-
> > >devs[i]));
> > > +    }
> > > +}
> > > +
> > > +static uint64_t
> > > +acpi_cpu_ospm_intf_mr_read(void *opaque, hwaddr addr, unsigned  
> > size)  
> > > +{
> > > +    AcpiCpuOspmState *cpu_st = opaque;
> > > +    AcpiCpuOspmStateStatus *cdev;
> > > +    uint64_t val = 0;
> > > +
> > > +    if (cpu_st->selector >= cpu_st->dev_count) {
> > > +        return val;
> > > +    }
> > > +    cdev = &cpu_st->devs[cpu_st->selector];
> > > +    switch (addr) {
> > > +    case ACPI_CPU_MR_FLAGS_OFFSET_RW:
> > > +        val |= qdev_check_enabled(DEVICE(cdev->cpu)) ?
> > > +                                  ACPI_CPU_MR_FLAGS_BIT_ENABLED : 0;
> > > +        val |= cdev->devchk_pending ? ACPI_CPU_MR_FLAGS_BIT_DEVCHK :  
> > 0;  
> > > +        val |= cdev->ejrqst_pending ? ACPI_CPU_MR_FLAGS_BIT_EJECTRQ :  
> > 0;  
> > > +        trace_acpi_cpuos_if_read_flags(cpu_st->selector, val);
> > > +        break;
> > > +    case ACPI_CPU_MR_CMD_DATA_OFFSET_RW:
> > > +        switch (cpu_st->command) {
> > > +        case ACPI_GET_NEXT_CPU_WITH_EVENT_CMD:
> > > +           val = cpu_st->selector;
> > > +           break;
> > > +        default:
> > > +           trace_acpi_cpuos_if_read_invalid_cmd_data(cpu_st->selector,
> > > +                                                     cpu_st->command);
> > > +           break;
> > > +        }
> > > +        trace_acpi_cpuos_if_read_cmd_data(cpu_st->selector, val);
> > > +        break;
> > > +    default:
> > > +        break;
> > > +    }
> > > +    return val;
> > > +}
> > > +
> > > +static void
> > > +acpi_cpu_ospm_intf_mr_write(void *opaque, hwaddr addr, uint64_t  
> > data,  
> > > +                            unsigned int size) {
> > > +    AcpiCpuOspmState *cpu_st = opaque;
> > > +    AcpiCpuOspmStateStatus *cdev;
> > > +    ACPIOSTInfo *info;
> > > +
> > > +    assert(cpu_st->dev_count);
> > > +    if (addr) {
> > > +        if (cpu_st->selector >= cpu_st->dev_count) {
> > > +            trace_acpi_cpuos_if_invalid_idx_selected(cpu_st->selector);
> > > +            return;
> > > +        }
> > > +    }
> > > +
> > > +    switch (addr) {
> > > +    case ACPI_CPU_MR_SELECTOR_OFFSET_WO: /* current CPU selector  
> > */  
> > > +        cpu_st->selector = data;
> > > +        trace_acpi_cpuos_if_write_idx(cpu_st->selector);
> > > +        break;
> > > +    case ACPI_CPU_MR_FLAGS_OFFSET_RW: /* set is_* fields  */
> > > +        cdev = &cpu_st->devs[cpu_st->selector];
> > > +        if (data & ACPI_CPU_MR_FLAGS_BIT_DEVCHK) {
> > > +            /* clear device-check pending event */
> > > +            cdev->devchk_pending = false;
> > > +            trace_acpi_cpuos_if_clear_devchk_evt(cpu_st->selector);
> > > +        } else if (data & ACPI_CPU_MR_FLAGS_BIT_EJECTRQ) {
> > > +            /* clear eject-request pending event */
> > > +            cdev->ejrqst_pending = false;
> > > +            trace_acpi_cpuos_if_clear_ejrqst_evt(cpu_st->selector);
> > > +        } else if (data & ACPI_CPU_MR_FLAGS_BIT_EJECT) {
> > > +            DeviceState *dev = NULL;
> > > +            if (!cdev->cpu || cdev->cpu == first_cpu) {
> > > +                trace_acpi_cpuos_if_ejecting_invalid_cpu(cpu_st->selector);
> > > +                break;
> > > +            }
> > > +            /*
> > > +             * OSPM has returned with eject. Hence, it is now safe to put the
> > > +             * cpu device on powered-off state.
> > > +             */
> > > +            trace_acpi_cpuos_if_ejecting_cpu(cpu_st->selector);
> > > +            dev = DEVICE(cdev->cpu);
> > > +            qdev_sync_disable(dev, &error_fatal);
> > > +        }
> > > +        break;
> > > +    case ACPI_CPU_MR_CMD_OFFSET_WO:
> > > +        trace_acpi_cpuos_if_write_cmd(cpu_st->selector, data);
> > > +        if (data < ACPI_CMD_MAX) {
> > > +            cpu_st->command = data;
> > > +            if (cpu_st->command ==  
> > ACPI_GET_NEXT_CPU_WITH_EVENT_CMD) {  
> > > +                uint32_t iter = cpu_st->selector;
> > > +
> > > +                do {
> > > +                    cdev = &cpu_st->devs[iter];
> > > +                    if (cdev->devchk_pending || cdev->ejrqst_pending) {
> > > +                        cpu_st->selector = iter;
> > > +                        trace_acpi_cpuos_if_cpu_has_events(cpu_st->selector,
> > > +                            cdev->devchk_pending, cdev->ejrqst_pending);
> > > +                        break;
> > > +                    }
> > > +                    iter = iter + 1 < cpu_st->dev_count ? iter + 1 : 0;
> > > +                } while (iter != cpu_st->selector);
> > > +            }
> > > +        }
> > > +        break;
> > > +    case ACPI_CPU_MR_CMD_DATA_OFFSET_RW:
> > > +        switch (cpu_st->command) {
> > > +        case ACPI_OST_EVENT_CMD: {
> > > +           cdev = &cpu_st->devs[cpu_st->selector];
> > > +           cdev->ost_event = data;
> > > +           trace_acpi_cpuos_if_write_ost_ev(cpu_st->selector, cdev-
> > >ost_event);
> > > +           break;
> > > +        }
> > > +        case ACPI_OST_STATUS_CMD: {
> > > +           cdev = &cpu_st->devs[cpu_st->selector];
> > > +           cdev->ost_status = data;
> > > +           info = acpi_cpu_ospm_ost_status(cpu_st->selector, cdev);
> > > +           qapi_event_send_acpi_device_ost(info);
> > > +           qapi_free_ACPIOSTInfo(info);
> > > +           trace_acpi_cpuos_if_write_ost_status(cpu_st->selector,
> > > +                                                cdev->ost_status);
> > > +           break;
> > > +        }
> > > +        default:
> > > +           trace_acpi_cpuos_if_write_invalid_cmd(cpu_st->selector,
> > > +                                                 cpu_st->command);
> > > +           break;
> > > +        }
> > > +        break;
> > > +    default:
> > > +        trace_acpi_cpuos_if_write_invalid_offset(cpu_st->selector, addr);
> > > +        break;
> > > +    }
> > > +}
> > > +
> > > +static const MemoryRegionOps cpu_common_mr_ops = {
> > > +    .read = acpi_cpu_ospm_intf_mr_read,
> > > +    .write = acpi_cpu_ospm_intf_mr_write,
> > > +    .endianness = DEVICE_LITTLE_ENDIAN,
> > > +    .valid = {
> > > +        .min_access_size = 1,
> > > +        .max_access_size = ACPI_CPU_OSPM_IF_MAX_FIELD_SIZE,
> > > +    },
> > > +    .impl = {
> > > +        .min_access_size = 1,
> > > +        .max_access_size = ACPI_CPU_OSPM_IF_MAX_FIELD_SIZE,
> > > +        .unaligned = false,
> > > +    },
> > > +};
> > > +
> > > +void acpi_cpu_ospm_state_interface_init(MemoryRegion *as, Object  
> > *owner,  
> > > +                                        AcpiCpuOspmState *state,
> > > +                                        hwaddr base_addr) {
> > > +    MachineState *machine = MACHINE(qdev_get_machine());
> > > +    MachineClass *mc = MACHINE_GET_CLASS(machine);
> > > +    const CPUArchIdList *id_list;
> > > +    int i;
> > > +
> > > +    assert(mc->possible_cpu_arch_ids);
> > > +    id_list = mc->possible_cpu_arch_ids(machine);
> > > +    state->dev_count = id_list->len;
> > > +    state->devs = g_new0(typeof(*state->devs), state->dev_count);
> > > +    for (i = 0; i < id_list->len; i++) {
> > > +        state->devs[i].cpu =  CPU(id_list->cpus[i].cpu);
> > > +        state->devs[i].arch_id = id_list->cpus[i].arch_id;
> > > +    }
> > > +    memory_region_init_io(&state->ctrl_reg, owner,  
> > &cpu_common_mr_ops, state,  
> > > +                          "ACPI CPU OSPM State Interface Memory Region",
> > > +                          ACPI_CPU_OSPM_IF_REG_LEN);
> > > +    memory_region_add_subregion(as, base_addr, &state->ctrl_reg); }
> > > +
> > > +static AcpiCpuOspmStateStatus *
> > > +acpi_get_cpu_status(AcpiCpuOspmState *cpu_st, DeviceState *dev) {
> > > +    CPUClass *k = CPU_GET_CLASS(dev);
> > > +    uint64_t cpu_arch_id = k->get_arch_id(CPU(dev));
> > > +    int i;
> > > +
> > > +    for (i = 0; i < cpu_st->dev_count; i++) {
> > > +        if (cpu_arch_id == cpu_st->devs[i].arch_id) {
> > > +            return &cpu_st->devs[i];
> > > +        }
> > > +    }
> > > +    return NULL;
> > > +}
> > > +
> > > +void acpi_cpu_device_check_cb(AcpiCpuOspmState *cpu_st, DeviceState  
> > *dev,  
> > > +                              uint32_t event_st, Error **errp) {
> > > +    AcpiCpuOspmStateStatus *cdev;
> > > +    cdev = acpi_get_cpu_status(cpu_st, dev);
> > > +    if (!cdev) {
> > > +        return;
> > > +    }
> > > +    assert(cdev->cpu);
> > > +
> > > +    /*
> > > +     * Tell OSPM via GED IRQ(GSI) that a powered-off cpu is being powered-  
> > on.  
> > > +     * Also, mark 'device-check' event pending for this cpu. This will
> > > +     * eventually result in OSPM evaluating the ACPI _EVT method and scan  
> > of  
> > > +     * cpus
> > > +     */
> > > +    cdev->devchk_pending = true;
> > > +    acpi_send_event(cpu_st->acpi_dev, event_st); }
> > > +
> > > +void acpi_cpu_eject_request_cb(AcpiCpuOspmState *cpu_st,  
> > DeviceState *dev,  
> > > +                              uint32_t event_st, Error **errp) {
> > > +    AcpiCpuOspmStateStatus *cdev;
> > > +    cdev = acpi_get_cpu_status(cpu_st, dev);
> > > +    if (!cdev) {
> > > +        return;
> > > +    }
> > > +    assert(cdev->cpu);
> > > +
> > > +    /*
> > > +     * Tell OSPM via GED IRQ(GSI) that a cpu wants to power-off or go on  
> > standby  
> > > +     * Also,mark 'eject-request' event pending for this cpu. (graceful  
> > shutdown)  
> > > +     */
> > > +    cdev->ejrqst_pending = true;
> > > +    acpi_send_event(cpu_st->acpi_dev, event_st); }
> > > +
> > > +void
> > > +acpi_cpu_eject_cb(AcpiCpuOspmState *cpu_st, DeviceState *dev, Error
> > > +**errp) {
> > > +    /* TODO: possible handling here */ }
> > > +
> > > +static const VMStateDescription vmstate_cpu_ospm_state_sts = {
> > > +    .name = "CPU OSPM state status",
> > > +    .version_id = 1,
> > > +    .minimum_version_id = 1,
> > > +    .fields = (const VMStateField[]) {
> > > +        VMSTATE_BOOL(devchk_pending, AcpiCpuOspmStateStatus),
> > > +        VMSTATE_BOOL(ejrqst_pending, AcpiCpuOspmStateStatus),
> > > +        VMSTATE_UINT32(ost_event, AcpiCpuOspmStateStatus),
> > > +        VMSTATE_UINT32(ost_status, AcpiCpuOspmStateStatus),
> > > +        VMSTATE_END_OF_LIST()
> > > +    }
> > > +};
> > > +
> > > +const VMStateDescription vmstate_cpu_ospm_state = {
> > > +    .name = "CPU OSPM state",
> > > +    .version_id = 1,
> > > +    .minimum_version_id = 1,
> > > +    .fields = (const VMStateField[]) {
> > > +        VMSTATE_UINT32(selector, AcpiCpuOspmState),
> > > +        VMSTATE_UINT8(command, AcpiCpuOspmState),
> > > +        VMSTATE_STRUCT_VARRAY_POINTER_UINT32(devs,  
> > AcpiCpuOspmState,  
> > > +                                             dev_count,
> > > +                                             vmstate_cpu_ospm_state_sts,
> > > +                                             AcpiCpuOspmStateStatus),
> > > +        VMSTATE_END_OF_LIST()
> > > +    }
> > > +};
> > > +
> > > +void acpi_build_cpus_aml(Aml *table, hwaddr base_addr, const char  
> > *root,  
> > > +                         const char *event_handler_method) {
> > > +    MachineState *machine = MACHINE(qdev_get_machine());
> > > +    MachineClass *mc = MACHINE_GET_CLASS(machine);
> > > +    const CPUArchIdList *arch_ids = mc->possible_cpu_arch_ids(machine);
> > > +    Aml *sb_scope = aml_scope("_SB"); /* System Bus Scope */
> > > +    Aml *ifctx, *field, *method, *cpu_res_dev, *cpus_dev;
> > > +    Aml *zero = aml_int(0);
> > > +    Aml *one = aml_int(1);
> > > +
> > > +    cpu_res_dev = aml_device("%s.%s", root, CPU_RES_DEVICE);
> > > +    {
> > > +        Aml *crs;
> > > +
> > > +        aml_append(cpu_res_dev,
> > > +            aml_name_decl("_HID", aml_eisaid("PNP0A06")));
> > > +        aml_append(cpu_res_dev,
> > > +            aml_name_decl("_UID", aml_string("CPU OSPM Interface  
> > resources")));  
> > > +        aml_append(cpu_res_dev, aml_mutex(CPU_LOCK, 0));
> > > +
> > > +        crs = aml_resource_template();
> > > +        aml_append(crs, aml_memory32_fixed(base_addr,  
> > ACPI_CPU_OSPM_IF_REG_LEN,  
> > > +                   AML_READ_WRITE));
> > > +
> > > +        aml_append(cpu_res_dev, aml_name_decl("_CRS", crs));
> > > +
> > > +        /* declare CPU OSPM Interface MMIO region related access fields */
> > > +        aml_append(cpu_res_dev,
> > > +                   aml_operation_region("PRST", AML_SYSTEM_MEMORY,
> > > +                                        aml_int(base_addr),
> > > +                                        ACPI_CPU_OSPM_IF_REG_LEN));
> > > +
> > > +        /*
> > > +         * define named fields within PRST region with 'Byte' access widths
> > > +         * and reserve fields with other access width
> > > +         */
> > > +        field = aml_field("PRST", AML_BYTE_ACC, AML_NOLOCK,  
> > AML_PRESERVE);  
> > > +        /* reserve CPU 'selector' field (size in bits) */
> > > +        AML_APPEND_MR_RESVD_FIELD(field,  
> > ACPI_CPU_MR_SELECTOR_SIZE_BITS);  
> > > +        /* Flag::Enabled Bit(RO) - Read '1' if enabled */
> > > +        AML_APPEND_MR_NAMED_FIELD(field, CPU_ENABLED_F, 1);
> > > +        /* Flag::Devchk Bit(RW) - Read '1', has a event. Write '1', to clear */
> > > +        AML_APPEND_MR_NAMED_FIELD(field, CPU_DEVCHK_F, 1);
> > > +        /* Flag::Ejectrq Bit(RW) - Read 1, has event. Write 1 to clear */
> > > +        AML_APPEND_MR_NAMED_FIELD(field, CPU_EJECTRQ_F, 1);
> > > +        /* Flag::Eject Bit(WO) - OSPM evals _EJx, initiates CPU Eject in  
> > Qemu*/  
> > > +        AML_APPEND_MR_NAMED_FIELD(field, CPU_EJECT_F, 1);
> > > +        /* Flag::Bit(ACPI_CPU_FLAGS_USED_BITS)-Bit(7) - Reserve left over  
> > bits*/  
> > > +        AML_APPEND_MR_RESVD_FIELD(field,  
> > ACPI_CPU_MR_RES_FLAG_BITS);  
> > > +        /* Reserved space: padding after flags */
> > > +        AML_APPEND_MR_RESVD_FIELD(field,  
> > ACPI_CPU_MR_RES_FLAGS_SIZE_BITS);  
> > > +        /* Command field written by OSPM */
> > > +        AML_APPEND_MR_NAMED_FIELD(field, CPU_COMMAND,
> > > +                                  ACPI_CPU_MR_CMD_SIZE_BITS);
> > > +        /* Reserved space: padding after command field */
> > > +        AML_APPEND_MR_RESVD_FIELD(field,  
> > ACPI_CPU_MR_RES_CMD_SIZE_BITS);  
> > > +        /* Command data: 64-bit payload associated with command */
> > > +        AML_APPEND_MR_RESVD_FIELD(field,  
> > ACPI_CPU_MR_CMD_DATA_SIZE_BITS);  
> > > +        aml_append(cpu_res_dev, field);
> > > +
> > > +        /*
> > > +         * define named fields with 'Dword' access widths and reserve fields
> > > +         * with other access width
> > > +         */
> > > +        field = aml_field("PRST", AML_DWORD_ACC, AML_NOLOCK,  
> > AML_PRESERVE);  
> > > +        /* CPU selector, write only */
> > > +        AML_APPEND_MR_NAMED_FIELD(field, CPU_SELECTOR,
> > > +                                  ACPI_CPU_MR_SELECTOR_SIZE_BITS);
> > > +        aml_append(cpu_res_dev, field);
> > > +
> > > +        /*
> > > +         * define named fields with 'Qword' access widths and reserve fields
> > > +         * with other access width
> > > +         */
> > > +        field = aml_field("PRST", AML_QWORD_ACC, AML_NOLOCK,  
> > AML_PRESERVE);  
> > > +        /*
> > > +         * Reserve space: selector, flags, reserved flags, command, reserved
> > > +         * command for Qword alignment.
> > > +         */
> > > +        AML_APPEND_MR_RESVD_FIELD(field,  
> > ACPI_CPU_MR_SELECTOR_SIZE_BITS +  
> > > +                                            ACPI_CPU_MR_FLAGS_SIZE_BITS +
> > > +                                            ACPI_CPU_MR_RES_FLAGS_SIZE_BITS +
> > > +                                            ACPI_CPU_MR_CMD_SIZE_BITS +
> > > +                                            ACPI_CPU_MR_RES_CMD_SIZE_BITS);
> > > +        /* Command data accessible via Qword */
> > > +        AML_APPEND_MR_NAMED_FIELD(field, CPU_DATA,
> > > +                                  ACPI_CPU_MR_CMD_DATA_SIZE_BITS);
> > > +        aml_append(cpu_res_dev, field);
> > > +    }
> > > +    aml_append(sb_scope, cpu_res_dev);
> > > +
> > > +    cpus_dev = aml_device("%s.%s", root, CPU_DEVICE);
> > > +    {
> > > +        Aml *ctrl_lock = AML_CPU_RES_DEV(root, CPU_LOCK);
> > > +        Aml *cpu_selector = AML_CPU_RES_DEV(root, CPU_SELECTOR);
> > > +        Aml *is_enabled = AML_CPU_RES_DEV(root, CPU_ENABLED_F);
> > > +        Aml *dvchk_evt = AML_CPU_RES_DEV(root, CPU_DEVCHK_F);
> > > +        Aml *ejrq_evt = AML_CPU_RES_DEV(root, CPU_EJECTRQ_F);
> > > +        Aml *ej_evt = AML_CPU_RES_DEV(root, CPU_EJECT_F);
> > > +        Aml *cpu_cmd = AML_CPU_RES_DEV(root, CPU_COMMAND);
> > > +        Aml *cpu_data = AML_CPU_RES_DEV(root, CPU_DATA);
> > > +        int i;
> > > +
> > > +        aml_append(cpus_dev, aml_name_decl("_HID",  
> > aml_string("ACPI0010")));  
> > > +        aml_append(cpus_dev, aml_name_decl("_CID",
> > > + aml_eisaid("PNP0A05")));
> > > +
> > > +        method = aml_method(CPU_NOTIFY_METHOD, 2,  
> > AML_NOTSERIALIZED);  
> > > +        for (i = 0; i < arch_ids->len; i++) {
> > > +            Aml *cpu = aml_name(CPU_NAME_FMT, i);
> > > +            Aml *uid = aml_arg(0);
> > > +            Aml *event = aml_arg(1);
> > > +
> > > +            ifctx = aml_if(aml_equal(uid, aml_int(i)));
> > > +            {
> > > +                aml_append(ifctx, aml_notify(cpu, event));
> > > +            }
> > > +            aml_append(method, ifctx);
> > > +        }
> > > +        aml_append(cpus_dev, method);
> > > +
> > > +        method = aml_method(CPU_STS_METHOD, 1, AML_SERIALIZED);
> > > +        {
> > > +            Aml *idx = aml_arg(0);
> > > +            Aml *sta = aml_local(0);
> > > +            Aml *else_ctx;
> > > +
> > > +            aml_append(method, aml_acquire(ctrl_lock, 0xFFFF));
> > > +            aml_append(method, aml_store(idx, cpu_selector));
> > > +            aml_append(method, aml_store(zero, sta));
> > > +            ifctx = aml_if(aml_equal(is_enabled, one));
> > > +            {
> > > +                /* cpu is present and enabled */
> > > +                aml_append(ifctx, aml_store(aml_int(0xF), sta));
> > > +            }
> > > +            aml_append(method, ifctx);
> > > +            else_ctx = aml_else();
> > > +            {
> > > +                /* cpu is present but disabled */
> > > +                aml_append(else_ctx, aml_store(aml_int(0xD), sta));
> > > +            }
> > > +            aml_append(method, else_ctx);
> > > +            aml_append(method, aml_release(ctrl_lock));
> > > +            aml_append(method, aml_return(sta));
> > > +        }
> > > +        aml_append(cpus_dev, method);
> > > +
> > > +        method = aml_method(CPU_EJECT_METHOD, 1, AML_SERIALIZED);
> > > +        {
> > > +            Aml *idx = aml_arg(0);
> > > +
> > > +            aml_append(method, aml_acquire(ctrl_lock, 0xFFFF));
> > > +            aml_append(method, aml_store(idx, cpu_selector));
> > > +            aml_append(method, aml_store(one, ej_evt));
> > > +            aml_append(method, aml_release(ctrl_lock));
> > > +        }
> > > +        aml_append(cpus_dev, method);
> > > +
> > > +        method = aml_method(CPU_SCAN_METHOD, 0, AML_SERIALIZED);
> > > +        {
> > > +            Aml *has_event = aml_local(0); /* Local0: Loop control flag */
> > > +            Aml *uid = aml_local(1); /* Local1: Current CPU UID */
> > > +            /* Constants */
> > > +            Aml *dev_chk = aml_int(1); /* Notify: device check to enable */
> > > +            Aml *eject_req = aml_int(3); /* Notify: eject for removal */
> > > +            Aml *next_cpu_cmd =
> > > + aml_int(ACPI_GET_NEXT_CPU_WITH_EVENT_CMD);
> > > +
> > > +            /* Acquire CPU lock */
> > > +            aml_append(method, aml_acquire(ctrl_lock, 0xFFFF));
> > > +
> > > +            /* Initialize loop */
> > > +            aml_append(method, aml_store(zero, uid));
> > > +            aml_append(method, aml_store(one, has_event));
> > > +
> > > +            Aml *while_ctx = aml_while(aml_land(
> > > +                aml_equal(has_event, one),
> > > +                aml_lless(uid, aml_int(arch_ids->len))
> > > +            ));
> > > +            {
> > > +                aml_append(while_ctx, aml_store(zero, has_event));
> > > +                /*
> > > +                 * Issue scan cmd: QEMU will return next CPU with event in
> > > +                 * cpu_data
> > > +                 */
> > > +                aml_append(while_ctx, aml_store(uid, cpu_selector));
> > > +                aml_append(while_ctx, aml_store(next_cpu_cmd,
> > > + cpu_cmd));
> > > +
> > > +                /* If scan wrapped around to an earlier UID, exit loop */
> > > +                Aml *wrap_check = aml_if(aml_lless(cpu_data, uid));
> > > +                aml_append(wrap_check, aml_break());
> > > +                aml_append(while_ctx, wrap_check);
> > > +
> > > +                /* Set UID to scanned result */
> > > +                aml_append(while_ctx, aml_store(cpu_data, uid));
> > > +
> > > +                /* send CPU device-check(resume) event to OSPM */
> > > +                Aml *if_devchk = aml_if(aml_equal(dvchk_evt, one));
> > > +                {
> > > +                    aml_append(if_devchk,
> > > +                        aml_call2(CPU_NOTIFY_METHOD, uid, dev_chk));
> > > +                    /* clear local device-check event sent flag */
> > > +                    aml_append(if_devchk, aml_store(one, dvchk_evt));
> > > +                    aml_append(if_devchk, aml_store(one, has_event));
> > > +                }
> > > +                aml_append(while_ctx, if_devchk);
> > > +
> > > +                /*
> > > +                 * send CPU eject-request event to OSPM to gracefully handle
> > > +                 * OSPM related tasks running on this CPU
> > > +                 */
> > > +                Aml *else_ctx = aml_else();
> > > +                Aml *if_ejrq = aml_if(aml_equal(ejrq_evt, one));
> > > +                {
> > > +                    aml_append(if_ejrq,
> > > +                        aml_call2(CPU_NOTIFY_METHOD, uid, eject_req));
> > > +                    /* clear local eject-request event sent flag */
> > > +                    aml_append(if_ejrq, aml_store(one, ejrq_evt));
> > > +                    aml_append(if_ejrq, aml_store(one, has_event));
> > > +                }
> > > +                aml_append(else_ctx, if_ejrq);
> > > +                aml_append(while_ctx, else_ctx);
> > > +
> > > +                /* Increment UID */
> > > +                aml_append(while_ctx, aml_increment(uid));
> > > +            }
> > > +            aml_append(method, while_ctx);
> > > +
> > > +            /* Release cpu lock */
> > > +            aml_append(method, aml_release(ctrl_lock));
> > > +        }
> > > +        aml_append(cpus_dev, method);
> > > +
> > > +        method = aml_method(CPU_OST_METHOD, 4, AML_SERIALIZED);
> > > +        {
> > > +            Aml *uid = aml_arg(0);
> > > +            Aml *ev_cmd = aml_int(ACPI_OST_EVENT_CMD);
> > > +            Aml *st_cmd = aml_int(ACPI_OST_STATUS_CMD);
> > > +
> > > +            aml_append(method, aml_acquire(ctrl_lock, 0xFFFF));
> > > +            aml_append(method, aml_store(uid, cpu_selector));
> > > +            aml_append(method, aml_store(ev_cmd, cpu_cmd));
> > > +            aml_append(method, aml_store(aml_arg(1), cpu_data));
> > > +            aml_append(method, aml_store(st_cmd, cpu_cmd));
> > > +            aml_append(method, aml_store(aml_arg(2), cpu_data));
> > > +            aml_append(method, aml_release(ctrl_lock));
> > > +        }
> > > +        aml_append(cpus_dev, method);
> > > +
> > > +        /* build Processor object for each processor */
> > > +        for (i = 0; i < arch_ids->len; i++) {
> > > +            Aml *dev;
> > > +            Aml *uid = aml_int(i);
> > > +
> > > +            dev = aml_device(CPU_NAME_FMT, i);
> > > +            aml_append(dev, aml_name_decl("_HID",  
> > aml_string("ACPI0007")));  
> > > +            aml_append(dev, aml_name_decl("_UID", uid));
> > > +
> > > +            method = aml_method("_STA", 0, AML_SERIALIZED);
> > > +            aml_append(method, aml_return(aml_call1(CPU_STS_METHOD,  
> > uid)));  
> > > +            aml_append(dev, method);
> > > +
> > > +            if (CPU(arch_ids->cpus[i].cpu) != first_cpu) {
> > > +                method = aml_method("_EJ0", 1, AML_NOTSERIALIZED);
> > > +                aml_append(method, aml_call1(CPU_EJECT_METHOD, uid));
> > > +                aml_append(dev, method);
> > > +            }
> > > +
> > > +            method = aml_method("_OST", 3, AML_SERIALIZED);
> > > +            aml_append(method,
> > > +                aml_call4(CPU_OST_METHOD, uid, aml_arg(0),
> > > +                          aml_arg(1), aml_arg(2))
> > > +            );
> > > +            aml_append(dev, method);
> > > +            aml_append(cpus_dev, dev);
> > > +        }
> > > +    }
> > > +    aml_append(sb_scope, cpus_dev);
> > > +    aml_append(table, sb_scope);
> > > +
> > > +    method = aml_method(event_handler_method, 0,  
> > AML_NOTSERIALIZED);  
> > > +    aml_append(method, aml_call0("\\_SB.CPUS." CPU_SCAN_METHOD));
> > > +    aml_append(table, method);
> > > +}
> > > diff --git a/hw/acpi/meson.build b/hw/acpi/meson.build index
> > > 73f02b9691..6d83396ab4 100644
> > > --- a/hw/acpi/meson.build
> > > +++ b/hw/acpi/meson.build
> > > @@ -8,6 +8,8 @@ acpi_ss.add(files(
> > >  ))
> > >  acpi_ss.add(when: 'CONFIG_ACPI_CPU_HOTPLUG', if_true: files('cpu.c',
> > > 'cpu_hotplug.c'))
> > >  acpi_ss.add(when: 'CONFIG_ACPI_CPU_HOTPLUG', if_false:
> > > files('acpi-cpu-hotplug-stub.c'))
> > > +acpi_ss.add(when: 'CONFIG_ACPI_CPU_OSPM_INTERFACE', if_true:
> > > +files('cpu_ospm_interface.c'))
> > > +acpi_ss.add(when: 'CONFIG_ACPI_CPU_OSPM_INTERFACE', if_false:
> > > +files('acpi-cpu-ospm-interface-stub.c'))
> > >  acpi_ss.add(when: 'CONFIG_ACPI_MEMORY_HOTPLUG', if_true:
> > > files('memory_hotplug.c'))
> > >  acpi_ss.add(when: 'CONFIG_ACPI_MEMORY_HOTPLUG', if_false:
> > > files('acpi-mem-hotplug-stub.c'))
> > >  acpi_ss.add(when: 'CONFIG_ACPI_NVDIMM', if_true: files('nvdimm.c'))
> > > diff --git a/hw/acpi/trace-events b/hw/acpi/trace-events index
> > > edc93e703c..c0ecbdd48f 100644
> > > --- a/hw/acpi/trace-events
> > > +++ b/hw/acpi/trace-events
> > > @@ -40,6 +40,23 @@ cpuhp_acpi_fw_remove_cpu(uint32_t idx)  
> > "0x%"PRIx32  
> > > cpuhp_acpi_write_ost_ev(uint32_t slot, uint32_t ev) "idx[0x%"PRIx32"]
> > > OST EVENT: 0x%"PRIx32  cpuhp_acpi_write_ost_status(uint32_t slot,
> > > uint32_t st) "idx[0x%"PRIx32"] OST STATUS: 0x%"PRIx32
> > >
> > > +#cpu_ospm_interface.c
> > > +acpi_cpuos_if_invalid_idx_selected(uint32_t idx) "selector  
> > idx[0x%"PRIx32"]"  
> > > +acpi_cpuos_if_read_flags(uint32_t idx, uint8_t flags) "cpu
> > > +idx[0x%"PRIx32"] flags: 0x%"PRIx8 acpi_cpuos_if_write_idx(uint32_t
> > > +idx) "set active cpu idx: 0x%"PRIx32 acpi_cpuos_if_write_cmd(uint32_t
> > > +idx, uint8_t cmd) "cpu idx[0x%"PRIx32"] cmd: 0x%"PRIx8
> > > +acpi_cpuos_if_write_invalid_cmd(uint32_t idx, uint8_t cmd) "cpu
> > > +idx[0x%"PRIx32"] invalid cmd: 0x%"PRIx8
> > > +acpi_cpuos_if_write_invalid_offset(uint32_t idx, uint64_t addr) "cpu
> > > +idx[0x%"PRIx32"] invalid offset: 0x%"PRIx64  
> > acpi_cpuos_if_read_cmd_data(uint32_t idx, uint32_t data) "cpu
> > idx[0x%"PRIx32"] data: 0x%"PRIx32
> > acpi_cpuos_if_read_invalid_cmd_data(uint32_t idx, uint8_t cmd) "cpu
> > idx[0x%"PRIx32"] invalid cmd: 0x%"PRIx8
> > acpi_cpuos_if_cpu_has_events(uint32_t idx, bool devchk, bool ejrqst) "cpu
> > idx[0x%"PRIx32"] device-check pending: %d, eject-request pending: %d"  
> > > +acpi_cpuos_if_clear_devchk_evt(uint32_t idx) "cpu idx[0x%"PRIx32"]"
> > > +acpi_cpuos_if_clear_ejrqst_evt(uint32_t idx) "cpu idx[0x%"PRIx32"]"
> > > +acpi_cpuos_if_ejecting_invalid_cpu(uint32_t idx) "invalid cpu  
> > idx[0x%"PRIx32"]"  
> > > +acpi_cpuos_if_ejecting_cpu(uint32_t idx) "cpu idx[0x%"PRIx32"]"
> > > +acpi_cpuos_if_write_ost_ev(uint32_t idx, uint32_t ev) "cpu
> > > +idx[0x%"PRIx32"] OST Event: 0x%"PRIx32
> > > +acpi_cpuos_if_write_ost_status(uint32_t idx, uint32_t st) "cpu
> > > +idx[0x%"PRIx32"] OST Status: 0x%"PRIx32
> > > +
> > >  # pcihp.c
> > >  acpi_pci_eject_slot(unsigned bsel, unsigned slot) "bsel: %u slot: %u"
> > >  acpi_pci_unplug(int bsel, int slot) "bsel: %d slot: %d"
> > > diff --git a/hw/arm/Kconfig b/hw/arm/Kconfig index
> > > 2aa4b5d778..c9991e00c7 100644
> > > --- a/hw/arm/Kconfig
> > > +++ b/hw/arm/Kconfig
> > > @@ -39,6 +39,7 @@ config ARM_VIRT
> > >      select VIRTIO_MEM_SUPPORTED
> > >      select ACPI_CXL
> > >      select ACPI_HMAT
> > > +    select ACPI_CPU_OSPM_INTERFACE
> > >
> > >  config CUBIEBOARD
> > >      bool
> > > diff --git a/include/hw/acpi/cpu_ospm_interface.h
> > > b/include/hw/acpi/cpu_ospm_interface.h
> > > new file mode 100644
> > > index 0000000000..5dda327a34
> > > --- /dev/null
> > > +++ b/include/hw/acpi/cpu_ospm_interface.h
> > > @@ -0,0 +1,78 @@
> > > +/*
> > > + * ACPI CPU OSPM Interface Handling.
> > > + *
> > > + * Copyright (c) 2025 Huawei Technologies R&D (UK) Ltd.
> > > + *
> > > + * Author: Salil Mehta <salil.mehta@huawei.com>
> > > + *
> > > + * SPDX-License-Identifier: GPL-2.0-or-later
> > > + *
> > > + * This program is free software; you can redistribute it and/or
> > > +modify
> > > + * it under the terms of the GNU General Public License as published
> > > +by
> > > + * the ree Software Foundation; either version 2 of the License, or
> > > + * (at your option) any later version.
> > > + */
> > > +#ifndef CPU_OSPM_INTERFACE_H
> > > +#define CPU_OSPM_INTERFACE_H
> > > +
> > > +#include "qapi/qapi-types-acpi.h"
> > > +#include "hw/qdev-core.h"
> > > +#include "hw/acpi/acpi.h"
> > > +#include "hw/acpi/aml-build.h"
> > > +#include "hw/boards.h"
> > > +
> > > +/**
> > > + * Total size (in bytes) of the ACPI CPU OSPM Interface MMIO region.
> > > + *
> > > + * This region contains control and status fields such as CPU
> > > +selector,
> > > + * flags, command register, and data register. It must exactly match
> > > +the
> > > + * layout defined in the AML code and the memory region  
> > implementation.  
> > > + *
> > > + * Any mismatch between this definition and the AML layout may result
> > > +in
> > > + * runtime errors or build-time assertion failures (e.g.,
> > > +_Static_assert),
> > > + * breaking correct device emulation and guest OS coordination.
> > > + */
> > > +#define ACPI_CPU_OSPM_IF_REG_LEN 16
> > > +
> > > +typedef struct  {
> > > +    CPUState *cpu;
> > > +    uint64_t arch_id;
> > > +    bool devchk_pending; /* device-check pending */
> > > +    bool ejrqst_pending; /* eject-request pending */
> > > +    uint32_t ost_event;
> > > +    uint32_t ost_status;
> > > +} AcpiCpuOspmStateStatus;
> > > +
> > > +typedef struct AcpiCpuOspmState {
> > > +    DeviceState *acpi_dev;
> > > +    MemoryRegion ctrl_reg;
> > > +    uint32_t selector;
> > > +    uint8_t command;
> > > +    uint32_t dev_count;
> > > +    AcpiCpuOspmStateStatus *devs;
> > > +} AcpiCpuOspmState;
> > > +
> > > +void acpi_cpu_device_check_cb(AcpiCpuOspmState *cpu_st, DeviceState  
> > *dev,  
> > > +                              uint32_t event_st, Error **errp);
> > > +
> > > +void acpi_cpu_eject_request_cb(AcpiCpuOspmState *cpu_st,  
> > DeviceState *dev,  
> > > +                               uint32_t event_st, Error **errp);
> > > +
> > > +void acpi_cpu_eject_cb(AcpiCpuOspmState *cpu_st, DeviceState *dev,
> > > +                       Error **errp);
> > > +
> > > +void acpi_cpu_ospm_state_interface_init(MemoryRegion *as, Object  
> > *owner,  
> > > +                                        AcpiCpuOspmState *state,
> > > +                                        hwaddr base_addr);
> > > +
> > > +void acpi_build_cpus_aml(Aml *table, hwaddr base_addr, const char  
> > *root,  
> > > +                         const char *event_handler_method);
> > > +
> > > +void acpi_cpus_ospm_status(AcpiCpuOspmState *cpu_st,
> > > +                           ACPIOSTInfoList ***list);
> > > +
> > > +extern const VMStateDescription vmstate_cpu_ospm_state; #define
> > > +VMSTATE_CPU_OSPM_STATE(cpuospm, state) \
> > > +    VMSTATE_STRUCT(cpuospm, state, 1, \
> > > +                   vmstate_cpu_ospm_state, AcpiCpuOspmState) #endif
> > > +/* CPU_OSPM_INTERFACE_H */  
> >   
> 



^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH RFC V6 11/24] hw/arm/acpi: MADT change to size the guest with possible vCPUs
       [not found]     ` <0175e40f70424dd9a29389b8a4f16c42@huawei.com>
@ 2025-10-07 12:20       ` Igor Mammedov
  2025-10-10  3:15         ` Salil Mehta
  0 siblings, 1 reply; 67+ messages in thread
From: Igor Mammedov @ 2025-10-07 12:20 UTC (permalink / raw)
  To: Salil Mehta
  Cc: salil.mehta@opnsrc.net, qemu-devel@nongnu.org,
	qemu-arm@nongnu.org, mst@redhat.com, maz@kernel.org,
	jean-philippe@linaro.org, Jonathan Cameron, lpieralisi@kernel.org,
	peter.maydell@linaro.org, richard.henderson@linaro.org,
	armbru@redhat.com, andrew.jones@linux.dev, david@redhat.com,
	philmd@linaro.org, eric.auger@redhat.com, will@kernel.org,
	ardb@kernel.org, oliver.upton@linux.dev, pbonzini@redhat.com,
	gshan@redhat.com, rafael@kernel.org, borntraeger@linux.ibm.com,
	alex.bennee@linaro.org, gustavo.romero@linaro.org,
	npiggin@gmail.com, harshpb@linux.ibm.com, linux@armlinux.org.uk,
	darren@os.amperecomputing.com, ilkka@os.amperecomputing.com,
	vishnu@os.amperecomputing.com, gankulkarni@os.amperecomputing.com,
	karl.heubaum@oracle.com, miguel.luis@oracle.com, zhukeqian,
	wangxiongfeng (C), wangyanan (Y), Wangzhou (B), Linuxarm,
	jiakernel2@gmail.com, maobibo@loongson.cn, lixianglai@loongson.cn,
	shahuang@redhat.com, zhao1.liu@intel.com

On Tue, 7 Oct 2025 11:34:48 +0000
Salil Mehta <salil.mehta@huawei.com> wrote:

> Hi Igor,
> 
> > From: Igor Mammedov <imammedo@redhat.com>
> > Sent: Friday, October 3, 2025 4:09 PM
> > To: salil.mehta@opnsrc.net
> > 
> > On Wed,  1 Oct 2025 01:01:14 +0000
> > salil.mehta@opnsrc.net wrote:
> >   
> > > From: Salil Mehta <salil.mehta@huawei.com>
> > >
> > > When QEMU builds the MADT table, modifications are needed to include
> > > information about possible vCPUs that are exposed as ACPI-disabled (i.e.,  
> > `_STA.Enabled=0`).  
> > > This new information will help the guest kernel pre-size its resources
> > > during boot time. Pre-sizing based on possible vCPUs will facilitate
> > > the future hot-plugging of the currently disabled vCPUs.
> > >
> > > Additionally, this change addresses updates to the ACPI MADT GIC CPU
> > > interface flags, as introduced in the UEFI ACPI 6.5 specification [1].
> > > These updates enable deferred virtual CPU onlining in the guest kernel.
> > >
> > > Reference:
> > > [1] 5.2.12.14. GIC CPU Interface (GICC) Structure (Table 5.37 GICC CPU  
> > Interface Flags)  
> > >     Link:
> > >  
> > https://uefi.org/specs/ACPI/6.5/05_ACPI_Software_Programming_Model.h
> > tm  
> > > l#gic-cpu-interface-gicc-structure
> > >
> > > Co-developed-by: Keqian Zhu <zhukeqian1@huawei.com>
> > > Signed-off-by: Keqian Zhu <zhukeqian1@huawei.com>
> > > Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
> > > ---
> > >  hw/arm/virt-acpi-build.c | 40 ++++++++++++++++++++++++++++++++++-  
> > -----  
> > >  hw/core/machine.c        | 14 ++++++++++++++
> > >  include/hw/boards.h      | 20 ++++++++++++++++++++
> > >  3 files changed, 68 insertions(+), 6 deletions(-)
> > >
> > > diff --git a/hw/arm/virt-acpi-build.c b/hw/arm/virt-acpi-build.c index
> > > b01fc4f8ef..7c24dd6369 100644
> > > --- a/hw/arm/virt-acpi-build.c
> > > +++ b/hw/arm/virt-acpi-build.c
> > > @@ -760,6 +760,32 @@ static void build_append_gicr(GArray *table_data,  
> > uint64_t base, uint32_t size)  
> > >      build_append_int_noprefix(table_data, size, 4); /* Discovery
> > > Range Length */  }
> > >
> > > +static uint32_t virt_acpi_get_gicc_flags(CPUState *cpu) {
> > > +    MachineClass *mc = MACHINE_GET_CLASS(qdev_get_machine());
> > > +    const uint32_t GICC_FLAG_ENABLED = BIT(0);
> > > +    const uint32_t GICC_FLAG_ONLINE_CAPABLE = BIT(3);
> > > +
> > > +    /* ARM architecture does not support vCPU hotplug yet */
> > > +    if (!cpu) {
> > > +        return 0;
> > > +    }
> > > +
> > > +    /*
> > > +     * If the machine does not support online-capable CPUs, report the  
> > GICC as  
> > > +     * 'enabled' only.
> > > +     */
> > > +    if (!mc->has_online_capable_cpus) {
> > > +        return GICC_FLAG_ENABLED;
> > > +    }
> > > +
> > > +    /*
> > > +     * ACPI 6.5, 5.2.12.14 (GICC): mark the boot CPU 'enabled' and all others
> > > +     * 'online-capable'.
> > > +     */
> > > +    return (cpu == first_cpu) ? GICC_FLAG_ENABLED :
> > > +GICC_FLAG_ONLINE_CAPABLE; }
> > > +
> > >  static void
> > >  build_madt(GArray *table_data, BIOSLinker *linker, VirtMachineState
> > > *vms)  { @@ -785,12 +811,14 @@ build_madt(GArray *table_data,
> > > BIOSLinker *linker, VirtMachineState *vms)
> > >      build_append_int_noprefix(table_data, vms->gic_version, 1);
> > >      build_append_int_noprefix(table_data, 0, 3);   /* Reserved */
> > >
> > > -    for (i = 0; i < MACHINE(vms)->smp.cpus; i++) {
> > > -        ARMCPU *armcpu = ARM_CPU(qemu_get_cpu(i));
> > > +    for (i = 0; i < MACHINE(vms)->smp.max_cpus; i++) {  
> >                                      ^^^^^^^^^^^^  
> > > +        CPUState *cpu = machine_get_possible_cpu(i);  
> > ...  
> > > +        CPUArchId *archid = machine_get_possible_cpu_arch_id(i);  
> > 
> > what complexity above adds? /and then you say creating instantiating ARM
> > VM is slow./
> > 
> > I'd drop machine_get_possible_cpu/machine_get_possible_cpu_arch_id
> > altogether and mimic what acpi_build_madt() does.  
> 
> 
> We can do that here but I need this function elsewhere in the monitor code as well
> to iterate over the possible CPUs and if I remember correctly I was getting compilation
> errors there. But I will check if this can be removed.
> 
> I would like to keep machine_get_possible_cpu().

if you did iteration with this helper over CPUs, you'd basically introducing
^2 complexity at that point.
But that's details, we will sort it out eventually.

> 
> I think you've misunderstood the reason of the boot time delay mentioned to you in RFC V5.
> It is because of the realization leg i.e. qdev_relaize(), of the vCPU and not because of this
> initialization leg

I did misunderstood wrt slow vcpus creation.
I did object to lazy creation in general, and well I still dislike it.
For more on this topic see my reply to cover letter, let continue discussion there
about that.

> 
> 
> >   
> > > +        uint32_t flags = virt_acpi_get_gicc_flags(cpu);
> > > +        uint64_t mpidr = archid->arch_id;
> > >
> > >          if (vms->gic_version == VIRT_GIC_VERSION_2) {
> > >              physical_base_address = memmap[VIRT_GIC_CPU].base; @@
> > > -805,7 +833,7 @@ build_madt(GArray *table_data, BIOSLinker *linker,  
> > VirtMachineState *vms)  
> > >          build_append_int_noprefix(table_data, i, 4);    /* GIC ID */
> > >          build_append_int_noprefix(table_data, i, 4);    /* ACPI Processor UID  
> > */  
> > >          /* Flags */
> > > -        build_append_int_noprefix(table_data, 1, 4);    /* Enabled */
> > > +        build_append_int_noprefix(table_data, flags, 4);
> > >          /* Parking Protocol Version */
> > >          build_append_int_noprefix(table_data, 0, 4);
> > >          /* Performance Interrupt GSIV */ @@ -819,7 +847,7 @@
> > > build_madt(GArray *table_data, BIOSLinker *linker, VirtMachineState  
> > *vms)  
> > >          build_append_int_noprefix(table_data, vgic_interrupt, 4);
> > >          build_append_int_noprefix(table_data, 0, 8);    /* GICR Base  
> > Address*/  
> > >          /* MPIDR */
> > > -        build_append_int_noprefix(table_data,  
> > arm_cpu_mp_affinity(armcpu), 8);  
> > > +        build_append_int_noprefix(table_data, mpidr, 8);
> > >          /* Processor Power Efficiency Class */
> > >          build_append_int_noprefix(table_data, 0, 1);
> > >          /* Reserved */
> > > diff --git a/hw/core/machine.c b/hw/core/machine.c index
> > > 69d5632464..65388d859a 100644
> > > --- a/hw/core/machine.c
> > > +++ b/hw/core/machine.c
> > > @@ -1383,6 +1383,20 @@ CPUState *machine_get_possible_cpu(int64_t  
> > cpu_index)  
> > >      return NULL;
> > >  }
> > >
> > > +CPUArchId *machine_get_possible_cpu_arch_id(int64_t cpu_index) {
> > > +    MachineState *ms = MACHINE(qdev_get_machine());
> > > +    CPUArchIdList *possible_cpus = ms->possible_cpus;
> > > +
> > > +    for (int i = 0; i < possible_cpus->len; i++) {
> > > +        if (possible_cpus->cpus[i].cpu &&
> > > +            possible_cpus->cpus[i].cpu->cpu_index == cpu_index) {
> > > +            return &possible_cpus->cpus[i];
> > > +        }
> > > +    }
> > > +    return NULL;
> > > +}
> > > +
> > >  static char *cpu_slot_to_string(const CPUArchId *cpu)  {
> > >      GString *s = g_string_new(NULL);
> > > diff --git a/include/hw/boards.h b/include/hw/boards.h index
> > > 3ff77a8b3a..fe51ca58bf 100644
> > > --- a/include/hw/boards.h
> > > +++ b/include/hw/boards.h
> > > @@ -461,6 +461,26 @@ struct MachineState {
> > >      bool acpi_spcr_enabled;
> > >  };
> > >
> > > +/*
> > > + * machine_get_possible_cpu_arch_id:
> > > + * @cpu_index: logical cpu_index to search for
> > > + *
> > > + * Return a pointer to the CPUArchId entry matching the given
> > > +@cpu_index
> > > + * in the current machine's MachineState. The possible_cpus array
> > > +holds
> > > + * the full set of CPUs that the machine could support, including
> > > +those
> > > + * that may be created as disabled or taken offline.
> > > + *
> > > + * The slot index in ms->possible_cpus[] is always sequential, but
> > > +the
> > > + * logical cpu_index values are assigned by QEMU and may or may not
> > > +be
> > > + * sequential depending on the implementation of a particular machine.
> > > + * Direct indexing by cpu_index is therefore unsafe in general. This
> > > + * helper performs a linear search of the possible_cpus array to find
> > > + * the matching entry.
> > > + *
> > > + * Returns: pointer to the matching CPUArchId, or NULL if not found.
> > > + */
> > > +CPUArchId *machine_get_possible_cpu_arch_id(int64_t cpu_index);
> > > +
> > >  /*
> > >   * The macros which follow are intended to facilitate the
> > >   * definition of versioned machine types, using a somewhat  
> >   
> 



^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH RFC V6 22/24] monitor,qdev: Introduce 'device_set' to change admin state of existing devices
  2025-10-01  1:01 ` [PATCH RFC V6 22/24] monitor, qdev: Introduce 'device_set' to change admin state of existing devices salil.mehta
@ 2025-10-09  8:55   ` Markus Armbruster
  2025-10-09 12:51     ` Igor Mammedov
  0 siblings, 1 reply; 67+ messages in thread
From: Markus Armbruster @ 2025-10-09  8:55 UTC (permalink / raw)
  To: salil.mehta
  Cc: qemu-devel, qemu-arm, mst, salil.mehta, maz, jean-philippe,
	jonathan.cameron, lpieralisi, peter.maydell, richard.henderson,
	imammedo, andrew.jones, david, philmd, eric.auger, will, ardb,
	oliver.upton, pbonzini, gshan, rafael, borntraeger, alex.bennee,
	gustavo.romero, npiggin, harshpb, linux, darren, ilkka, vishnu,
	gankulkarni, karl.heubaum, miguel.luis, zhukeqian1,
	wangxiongfeng2, wangyanan55, wangzhou1, linuxarm, jiakernel2,
	maobibo, lixianglai, shahuang, zhao1.liu

salil.mehta@opnsrc.net writes:

> From: Salil Mehta <salil.mehta@huawei.com>
>
> This patch adds a "device_set" interface for modifying properties of devices
> that already exist in the guest topology. Unlike 'device_add'/'device_del'
> (hot-plug), 'device_set' does not create or destroy devices. It is intended
> for guest-visible hot-add semantics where hardware is provisioned at boot but
> logically enabled/disabled later via administrative policy.
>
> Compared to the existing 'qom-set' command, which is less intuitive and works
> only with object IDs, device_set provides a more device-oriented interface.
> It can be invoked at the QEMU prompt using natural device arguments, and the
> new '-deviceset' CLI option allows properties to be set at boot time, similar
> to how '-device' specifies device creation.

Why can't we use -device?

> While the initial implementation focuses on "admin-state" changes (e.g.,
> enable/disable a CPU already described by ACPI/DT), the interface is designed
> to be generic. In future, it could be used for other per-device set/unset
> style controls — beyond administrative power-states — provided the target
> device explicitly allows such changes. This enables fine-grained runtime
> control of device properties.

Beware, designing a generic interface can be harder, sometimes much
harder, than designing a specialized one.

device_add and qom-set are generic, and they have issues:

* device_add effectively bypasses QAPI by using 'gen': false.

  This bypasses QAPI's enforcement of documentation.  Property
  documentation is separate and poor.

  It also defeats introspection with query-qmp-schema.  You need to
  resort to other means instead, say QOM introspection (which is a bag
  of design flaws on its own), then map from QOM to qdev.

* device_add lets you specify any qdev property, even properties that
  are intended only for use by C code.

  This results in accidental external interfaces.

  We tend to name properties like "x-prop" to discourage external use,
  but I wouldn't bet my own money on us getting that always right.
  Moreover, there's beauties like "x-origin".

* qom-set & friends effectively bypass QAPI by using type 'any'.

  Again, the bypass results in poor documentation and a defeat of
  query-qmp-schema.

* qom-set lets you mess with any QOM property with a setter callback.

  Again, accidental external interfaces: most of these properties are
  not meant for use with qom-set.  For some, qom-set works, for some it
  silently does nothing, and for some it crashes.  A lot more dangerous
  than device_add.

  The "x-" convention can't help here: some properties are intended for
  external use with object-add, but not with qom-set.

We should avoid such issues in new interfaces.

We'll examine how this applies to device_set when I review the QAPI
schema.

> Key pieces:
>   * QMP: qmp_device_set() to update an existing device. The device can be
>     located by "id" or via driver+property match using a DeviceListener
>     callback (qdev_find_device()).
>   * HMP: "device_set" command with tab-completion. Errors are surfaced via
>     hmp_handle_error().
>   * CLI: "-deviceset" option for setting startup/admin properties at boot,
>     including a JSON form. Options are parsed into qemu_deviceset_opts and
>     applied after device creation.
>   * Docs/help: HMP help text and qemu-options.hx additions explain usage and
>     explicitly note that no hot-plug occurs.
>   * Safety: disallowed during live migration (migration_is_idle() check).
>
> Semantics:
>   * Operates on an existing DeviceState; no enumeration/new device appears.
>   * Complements device_add/device_del by providing state mutation only.
>   * Backward compatible: no behavior change unless "device_set"/"-deviceset"
>     is used.
>
> Examples:
>   HMP:
>     (qemu) device_set host-arm-cpu,core-id=3,admin-state=enable
>
>   CLI (at boot):
>     -smp cpus=4,maxcpus=4 \
>     -deviceset host-arm-cpu,core-id=2,admin-state=disable
>
>   QMP (JSON form):
>     { "execute": "device_set",
>       "arguments": {
>         "driver": "host-arm-cpu",
>         "core-id": 1,
>         "admin-state": "disable"
>       }
>     }

{"error": {"class": "CommandNotFound", "desc": "The command device_set has not been found"}}

Clue below.

> NOTE: The qdev_enable()/qdev_disable() hooks for acting on admin-state will be
> added in subsequent patches. Device classes must explicitly support any
> property they want to expose through device_set.
>
> Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
> ---
>  hmp-commands.hx         |  30 +++++++++
>  hw/arm/virt.c           |  86 +++++++++++++++++++++++++
>  hw/core/cpu-common.c    |  12 ++++
>  hw/core/qdev.c          |  21 ++++++
>  include/hw/arm/virt.h   |   1 +
>  include/hw/core/cpu.h   |  11 ++++
>  include/hw/qdev-core.h  |  22 +++++++
>  include/monitor/hmp.h   |   2 +
>  include/monitor/qdev.h  |  30 +++++++++
>  include/system/system.h |   1 +
>  qemu-options.hx         |  51 +++++++++++++--
>  system/qdev-monitor.c   | 139 +++++++++++++++++++++++++++++++++++++++-
>  system/vl.c             |  39 +++++++++++
>  13 files changed, 440 insertions(+), 5 deletions(-)

Clue: no update to the QAPI schema, i.e. the QMP command does not exist.

>
> diff --git a/hmp-commands.hx b/hmp-commands.hx
> index d0e4f35a30..18056cf21d 100644
> --- a/hmp-commands.hx
> +++ b/hmp-commands.hx
> @@ -707,6 +707,36 @@ SRST
>    or a QOM object path.
>  ERST
>  
> +{
> +    .name       = "device_set",
> +    .args_type  = "device:O",
> +    .params     = "driver[,prop=value][,...]",
> +    .help       = "set/unset existing device property",
> +    .cmd        = hmp_device_set,
> +    .command_completion = device_set_completion,
> +},
> +
> +SRST
> +``device_set`` *driver[,prop=value][,...]*
> +  Change the administrative power state of an existing device.
> +
> +  This command enables or disables a known device (e.g., CPU) using the
> +  "device_set" interface. It does not hotplug or add a new device.
> +
> +  Depending on platform support (e.g., PSCI or ACPI), this may trigger
> +  corresponding operational changes — such as powering down a CPU or
> +  transitioning it to active use.
> +
> +  Administrative state:
> +    * *enabled*  — Allows the guest to use the device (e.g., CPU_ON)
> +    * *disabled* — Prevents guest use; device is powered off (e.g., CPU_OFF)
> +
> +  Note: The device must already exist (be declared during machine creation).
> +
> +  Example:
> +      (qemu) device_set host-arm-cpu,core-id=3,admin-state=disabled
> +ERST

How exactly is the device selected?  You provide a clue above: 'can be
located by "id" or via driver+property match'.

I assume by "id" is just like device_del, i.e. by qdev ID or QOM path.

By "driver+property match" is not obvious.  Which of the arguments are
for matching, and which are for setting?

If "id" is specified, is there any matching?

The matching feature complicates this interface quite a bit.  I doubt
it's worth the complexity.  If you think it is, please split it off into
a separate patch.

Next question.  Is there a way for management applications to detect
whether a certain device supports device_set for a certain property?

Without that, what are management application supposed to do?  Hard-code
what works?  Run the command and see whether it fails?

I understand right now the command supports just "admin-state" for a
certain set of devices, so hard-coding would be possible.  But every new
(device, property) pair then requires management application updates,
and the hard-coded information becomes version specific.  This will
become unworkable real quick.  Not good enough for a command designed to
be generic.

> +
>      {
>          .name       = "cpu",
>          .args_type  = "index:i",

[...]

> diff --git a/qemu-options.hx b/qemu-options.hx
> index 83ccde341b..f517b91042 100644
> --- a/qemu-options.hx
> +++ b/qemu-options.hx
> @@ -375,7 +375,10 @@ SRST
>      This is different from CPU hotplug where additional CPUs are not even
>      present in the system description. Administratively disabled CPUs appear in
>      ACPI tables i.e. are provisioned, but cannot be used until explicitly
> -    enabled via QMP/HMP or the deviceset API.
> +    enabled via QMP/HMP or the deviceset API. On ACPI guests, each vCPU counted
> +    by 'disabledcpus=' is provisioned with '\ ``_STA``\ ' reporting Present=1
> +    and Enabled=0 (present-offline) at boot; it becomes Enabled=1 when brought
> +    online via 'device_set ... admin-state=enable'.
>  
>      On boards supporting CPU hotplug, the optional '\ ``maxcpus``\ ' parameter
>      can be set to enable further CPUs to be added at runtime. When both
> @@ -455,6 +458,15 @@ SRST
>  
>          -smp 2
>  
> +    Note: The cluster topology will only be generated in ACPI and exposed
> +    to guest if it's explicitly specified in -smp.
> +
> +    Note: Administratively disabled CPUs (specified via 'disabledcpus=' and
> +    '-deviceset' at CLI during boot) are especially useful for platforms like
> +    ARM that lack native CPU hotplug support. These CPUs will appear to the
> +    guest as unavailable, and any attempt to bring them online must go through
> +    QMP/HMP commands like 'device_set'.
> +
>      Examples using 'disabledcpus':
>  
>      For a board without CPU hotplug, enable 4 CPUs at boot and provision
> @@ -472,9 +484,6 @@ SRST
>      ::
>  
>          -smp cpus=4,disabledcpus=2,maxcpus=8
> -
> -    Note: The cluster topology will only be generated in ACPI and exposed
> -    to guest if it's explicitly specified in -smp.
>  ERST
>  
>  DEF("numa", HAS_ARG, QEMU_OPTION_numa,
> @@ -1281,6 +1290,40 @@ SRST
>  
>  ERST
>  
> +DEF("deviceset", HAS_ARG, QEMU_OPTION_deviceset,
> +    "-deviceset driver[,prop[=value]][,...]\n"
> +    "                Set administrative power state of an existing device.\n"
> +    "                Does not hotplug a new device. Can disable or enable\n"
> +    "                devices (such as CPUs) at boot based on policy.\n"
> +    "                Example:\n"
> +    "                    -deviceset host-arm-cpu,core-id=2,admin-state=disabled\n"
> +    "                Use '-deviceset help' for supported drivers\n"
> +    "                Use '-deviceset driver,help' for driver-specific properties\n",
> +    QEMU_ARCH_ALL)
> +SRST
> +``-deviceset driver[,prop[=value]][,...]``
> +    Configure an existing device's administrative power state or properties.
> +
> +    Unlike ``-device``, this option does not create a new device. Instead,
> +    it sets startup properties (such as administrative power state) for
> +    a device already declared via -smp or other machine configuration.
> +
> +    Example:
> +        -smp cpus=4
> +        -deviceset host-arm-cpu,core-id=2,admin-state=disabled
> +
> +    The above disables CPU core 2 at boot using administrative offlining.
> +    The guest may later re-enable the core (if permitted by platform policy).
> +
> +    ``state=enabled|disabled``
> +        Sets the administrative state of the device:
> +        - ``enabled``: device is made available at boot
> +        - ``disabled``: device is administratively disabled and powered off
> +
> +    Use ``-deviceset help`` to view all supported drivers.
> +    Use ``-deviceset driver,help`` for property-specific help.
> +ERST
> +
>  DEF("name", HAS_ARG, QEMU_OPTION_name,
>      "-name string1[,process=string2][,debug-threads=on|off]\n"
>      "                set the name of the guest\n"
> diff --git a/system/qdev-monitor.c b/system/qdev-monitor.c
> index 2ac92d0a07..1099b1237d 100644
> --- a/system/qdev-monitor.c
> +++ b/system/qdev-monitor.c
> @@ -263,12 +263,20 @@ static DeviceClass *qdev_get_device_class(const char **driver, Error **errp)
>      }
>  
>      dc = DEVICE_CLASS(oc);
> -    if (!dc->user_creatable) {
> +    if (!dc->user_creatable && !dc->admin_power_state_supported) {
>          error_setg(errp, QERR_INVALID_PARAMETER_VALUE, "driver",
>                     "a pluggable device type");
>          return NULL;
>      }
>  
> +    if (phase_check(PHASE_MACHINE_READY) &&
> +        (!dc->hotpluggable || !dc->admin_power_state_supported)) {
> +        error_setg(errp, QERR_INVALID_PARAMETER_VALUE, "driver",
> +                   "a pluggable device type or which supports changing power-"
> +                   "state administratively");
> +        return NULL;
> +    }
> +
>      if (object_class_dynamic_cast(oc, TYPE_SYS_BUS_DEVICE)) {
>          /* sysbus devices need to be allowed by the machine */
>          MachineClass *mc = MACHINE_CLASS(object_get_class(qdev_get_machine()));
> @@ -939,6 +947,76 @@ void qmp_device_del(const char *id, Error **errp)
>      }
>  }
>  
> +void qmp_device_set(const QDict *qdict, Error **errp)
> +{
> +    const char *state;
> +    const char *driver;
> +    DeviceState *dev;
> +    DeviceClass *dc;
> +    const char *id;
> +
> +    driver = qdict_get_try_str(qdict, "driver");
> +    if (!driver) {
> +        error_setg(errp, "Parameter 'driver' is missing");
> +        return;
> +    }
> +
> +    /* check driver exists and we are at the right phase of machine init */
> +    dc = qdev_get_device_class(&driver, errp);
> +    if (!dc) {

Since qdev_get_device_class() sets an error when it fails, *errp is not
null here, ...

> +        error_setg(errp, "driver '%s' not supported", driver);

... which makes this wrong.  Caught by error_setv()'s assertion.

Please test your error paths.

> +        return;
> +    }
> +
> +    if (migration_is_running()) {
> +        error_setg(errp, "device_set not allowed while migrating");
> +        return;
> +    }
> +
> +    id = qdict_get_try_str(qdict, "id");
> +
> +    if (id) {
> +        /* Lookup by ID */
> +        dev = find_device_state(id, false, errp);
> +        if (errp && *errp) {
> +            error_prepend(errp, "Device lookup failed for ID '%s': ", id);
> +            return;
> +        }
> +    } else {
> +        /* Lookup using driver and properties */
> +        dev = qdev_find_device(qdict, errp);
> +        if (errp && *errp) {
> +            error_prepend(errp, "Device lookup for %s failed: ", driver);
> +            return;
> +        }
> +    }
> +    if (!dev) {
> +        error_set(errp, ERROR_CLASS_DEVICE_NOT_FOUND,
> +                  "No device found for driver '%s'", driver);
> +        return;
> +    }
> +
> +    state = qdict_get_try_str(qdict, "admin-state");
> +    if (!state) {
> +        error_setg(errp, "no device state change specified for device %s ",
> +                   dev->id);
> +        return;
> +    } else if (!strcmp(state, "enable")) {
> +
> +        if (!qdev_enable(dev, qdev_get_parent_bus(DEVICE(dev)), errp)) {
> +            return;
> +        }
> +    } else if (!strcmp(state, "disable")) {
> +        if (!qdev_disable(dev, qdev_get_parent_bus(DEVICE(dev)), errp)) {
> +            return;
> +        }
> +    } else {
> +        error_setg(errp, "unrecognized specified state *%s* for device %s",
> +                   state, dev->id);
> +        return;
> +    }
> +}
> +
>  int qdev_sync_config(DeviceState *dev, Error **errp)
>  {
>      DeviceClass *dc = DEVICE_GET_CLASS(dev);

[...]



^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH RFC V6 01/24] hw/core: Introduce administrative power-state property and its accessors
  2025-10-01  1:01 ` [PATCH RFC V6 01/24] hw/core: Introduce administrative power-state property and its accessors salil.mehta
@ 2025-10-09 10:48   ` Miguel Luis
  0 siblings, 0 replies; 67+ messages in thread
From: Miguel Luis @ 2025-10-09 10:48 UTC (permalink / raw)
  To: salil.mehta@opnsrc.net
  Cc: qemu-devel@nongnu.org, qemu-arm@nongnu.org, mst@redhat.com,
	salil.mehta@huawei.com, maz@kernel.org, jean-philippe@linaro.org,
	jonathan.cameron@huawei.com, lpieralisi@kernel.org,
	peter.maydell@linaro.org, richard.henderson@linaro.org,
	imammedo@redhat.com, armbru@redhat.com, andrew.jones@linux.dev,
	david@redhat.com, philmd@linaro.org, eric.auger@redhat.com,
	will@kernel.org, ardb@kernel.org, oliver.upton@linux.dev,
	pbonzini@redhat.com, gshan@redhat.com, rafael@kernel.org,
	borntraeger@linux.ibm.com, alex.bennee@linaro.org,
	gustavo.romero@linaro.org, npiggin@gmail.com,
	harshpb@linux.ibm.com, linux@armlinux.org.uk,
	darren@os.amperecomputing.com, ilkka@os.amperecomputing.com,
	vishnu@os.amperecomputing.com, gankulkarni@os.amperecomputing.com,
	Karl Heubaum, zhukeqian1@huawei.com, wangxiongfeng2@huawei.com,
	wangyanan55@huawei.com, wangzhou1@hisilicon.com,
	linuxarm@huawei.com, jiakernel2@gmail.com, maobibo@loongson.cn,
	lixianglai@loongson.cn, shahuang@redhat.com, zhao1.liu@intel.com

Hi Salil,

> On 1 Oct 2025, at 01:01, salil.mehta@opnsrc.net wrote:
> 
> From: Salil Mehta <salil.mehta@huawei.com>
> 
> Some devices cannot be hot-unplugged, either because removal is not meaningful
> (e.g. on-board devices) or not supported (e.g. certain PCIe devices). Others,
> such as CPUs on architectures like ARM, lack native hotplug support but can
> still have their availability controlled through host policy. In all these
> cases, a mechanism is needed to track and control a device’s *administrative*
> power state — independent of its runtime operational state — so QEMU can:
> 
>  - Disable a device while keeping it described in firmware, ACPI, or other
>    configuration.
>  - Prevent guest use until explicitly re-enabled.
>  - Coordinate transitions with platform-specific power handlers and migration
>    logic.
> 
> This patch introduces the core qdev support for administrative power state —
> defining the property, enum, and accessors — without yet applying it to any
> device. Later patches in this series integrate it with helper APIs
> (qdev_disable(), qdev_enable(), etc.) and specific device types such as CPUs,
> completing the flow with platform-specific handlers.
> 
> Key additions:
>  - New enum DeviceAdminPowerState with ENABLED, DISABLED, and REMOVED states,
>    defaulting to ENABLED.
>  - New DeviceClass flag admin_power_state_supported to advertise support for
>    administrative transitions.
>  - New QOM property "admin_power_state" to query or set the state on supported
>    devices.
>  - Internal accessors device_get_admin_power_state() and
>    device_set_admin_power_state() to manage state changes, including safe
>    handling when the device is not yet realized.
> 
> The enum models *policy* rather than electrical or functional power state, and
> is distinct from runtime mechanisms (e.g. PSCI for ARM CPUs). The actual
> operational state of a device is maintained by platform-specific or device-
> specific code, which enforces runtime behaviour based on the administrative
> setting. Every device starts administratively ENABLED by default. A DISABLED
> device remains logically present but blocked from operation; a REMOVED device
> is logically absent.
> 
> Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
> ---
> hw/core/qdev.c         | 62 ++++++++++++++++++++++++++++++++++++++++++
> include/hw/qdev-core.h | 54 ++++++++++++++++++++++++++++++++++++
> target/arm/cpu.c       |  1 +

I’d suggest separating this in two patches, one which adds functionality and
another one which enables functionality for the arch. It may ease integration overall,
moreover both may be independent of one another.

Thanks
Miguel

> 3 files changed, 117 insertions(+)
> 
> diff --git a/hw/core/qdev.c b/hw/core/qdev.c
> index f600226176..8502d6216f 100644
> --- a/hw/core/qdev.c
> +++ b/hw/core/qdev.c
> @@ -633,6 +633,53 @@ static bool device_get_hotplugged(Object *obj, Error **errp)
>     return dev->hotplugged;
> }
> 
> +static int device_get_admin_power_state(Object *obj, Error **errp)
> +{
> +    DeviceState *dev = DEVICE(obj);
> +
> +    return dev->admin_power_state;
> +}
> +
> +static void
> +device_set_admin_power_state(Object *obj, int new_state, Error **errp)
> +{
> +    DeviceState *dev = DEVICE(obj);
> +    DeviceClass *dc = DEVICE_GET_CLASS(dev);
> +
> +    if (!dc->admin_power_state_supported) {
> +        error_setg(errp, "Device '%s' admin power state change not supported",
> +                   object_get_typename(obj));
> +        return;
> +    }
> +
> +    switch (new_state) {
> +    case DEVICE_ADMIN_POWER_STATE_DISABLED: {
> +        /*
> +         * TODO: Operational state transition triggered by administrative action
> +         * Powering off the realized device either synchronously or via OSPM.
> +         */
> +
> +        qatomic_set(&dev->admin_power_state, DEVICE_ADMIN_POWER_STATE_DISABLED);
> +        smp_wmb();
> +        break;
> +    }
> +    case DEVICE_ADMIN_POWER_STATE_ENABLED: {
> +        /*
> +         * TODO: Operational state transition triggered by administrative action
> +         * Powering on the device and restoring migration registration.
> +         */
> +
> +        qatomic_set(&dev->admin_power_state, DEVICE_ADMIN_POWER_STATE_ENABLED);
> +        smp_wmb();
> +        break;
> +    }
> +    default:
> +        error_setg(errp, "Invalid admin power state %d for device '%s'",
> +                   new_state, dev->id);
> +        break;
> +    }
> +}
> +
> static void device_initfn(Object *obj)
> {
>     DeviceState *dev = DEVICE(obj);
> @@ -644,6 +691,7 @@ static void device_initfn(Object *obj)
> 
>     dev->instance_id_alias = -1;
>     dev->realized = false;
> +    dev->admin_power_state = DEVICE_ADMIN_POWER_STATE_ENABLED;
>     dev->allow_unplug_during_migration = false;
> 
>     QLIST_INIT(&dev->gpios);
> @@ -731,6 +779,15 @@ device_vmstate_if_get_id(VMStateIf *obj)
>     return qdev_get_dev_path(dev);
> }
> 
> +static const QEnumLookup device_admin_power_state_lookup = {
> +    .array = (const char *const[]) {
> +        [DEVICE_ADMIN_POWER_STATE_ENABLED]  = "enabled",
> +        [DEVICE_ADMIN_POWER_STATE_REMOVED]  = "removed",
> +        [DEVICE_ADMIN_POWER_STATE_DISABLED] = "disabled",
> +    },
> +    .size = DEVICE_ADMIN_POWER_STATE_MAX,
> +};
> +
> static void device_class_init(ObjectClass *class, const void *data)
> {
>     DeviceClass *dc = DEVICE_CLASS(class);
> @@ -765,6 +822,11 @@ static void device_class_init(ObjectClass *class, const void *data)
>                                    device_get_hotpluggable, NULL);
>     object_class_property_add_bool(class, "hotplugged",
>                                    device_get_hotplugged, NULL);
> +    object_class_property_add_enum(class, "admin_power_state",
> +                                   "DeviceAdminPowerState",
> +                                   &device_admin_power_state_lookup,
> +                                   device_get_admin_power_state,
> +                                   device_set_admin_power_state);
>     object_class_property_add_link(class, "parent_bus", TYPE_BUS,
>                                    offsetof(DeviceState, parent_bus), NULL, 0);
> }
> diff --git a/include/hw/qdev-core.h b/include/hw/qdev-core.h
> index 530f3da702..3bc212ab3a 100644
> --- a/include/hw/qdev-core.h
> +++ b/include/hw/qdev-core.h
> @@ -159,6 +159,7 @@ struct DeviceClass {
>      */
>     bool user_creatable;
>     bool hotpluggable;
> +    bool admin_power_state_supported;
> 
>     /* callbacks */
>     /**
> @@ -217,6 +218,55 @@ typedef QLIST_HEAD(, NamedGPIOList) NamedGPIOListHead;
> typedef QLIST_HEAD(, NamedClockList) NamedClockListHead;
> typedef QLIST_HEAD(, BusState) BusStateHead;
> 
> +/**
> + * enum DeviceAdminPowerState - Administrative control states for a device
> + *
> + * This enum defines abstract administrative states used by QEMU to enable,
> + * disable, or logically remove a device from the virtual machine. These
> + * states reflect administrative control over a device's power availability
> + * and presence in the system. These administrative states are distinct from
> + * runtime operational power states (e.g., PSCI states for ARM CPUs). They
> + * represent administrative *policy* rather than physical, electrical, or
> + * functional state.
> + *
> + * Administrative state is managed externally "via QMP, firmware, or other
> + * host-side policy agents" and acts as a gating policy that determines
> + * whether guest software is permitted to interact with the device. Most
> + * devices default to the ENABLED state unless explicitly disabled or removed.
> + *
> + * Changing a device administrative state may directly or indirectly affect
> + * its operational behavior. For example, a DISABLED device will reject guest
> + * attempts to power it on or transition it out of a suspended state. Not all
> + * devices support dynamic transitions between administrative states.
> + *
> + * - DEVICE_ADMIN_POWER_STATE_ENABLED:
> + *     The device is administratively enabled (i.e., logically present and
> + *     permitted to operate). Guest software may change its operational state
> + *     (e.g., activate, deactivate, suspend) within allowed architectural
> + *     semantics. This is the default state for most devices unless explicitly
> + *     disabled or unplugged.
> + *
> + * - DEVICE_ADMIN_POWER_STATE_DISABLED:
> + *     The device is administratively disabled. It remains logically present
> + *     but is blocked from functional operation. Guest-initiated transitions
> + *     are either suppressed or ignored. This is typically used to enforce
> + *     shutdown, deny execution, or offline the device without removing it.
> + *
> + * - DEVICE_ADMIN_POWER_STATE_REMOVED:
> + *     The device has been logically removed (e.g., via hot-unplug). It is no
> + *     longer considered present or visible to the guest. This state exists
> + *     for representational or transitional purposes only. In most cases,
> + *     once removed, the corresponding DeviceState object is destroyed and
> + *     no longer tracked. This concept may not apply to some devices as
> + *     architectural limitations might make unplug not meaningful.
> + */
> +typedef enum DeviceAdminPowerState {
> +    DEVICE_ADMIN_POWER_STATE_ENABLED = 0,
> +    DEVICE_ADMIN_POWER_STATE_DISABLED,
> +    DEVICE_ADMIN_POWER_STATE_REMOVED,
> +    DEVICE_ADMIN_POWER_STATE_MAX
> +} DeviceAdminPowerState;
> +
> /**
>  * struct DeviceState - common device state, accessed with qdev helpers
>  *
> @@ -240,6 +290,10 @@ struct DeviceState {
>      * @realized: has device been realized?
>      */
>     bool realized;
> +    /**
> +     * @admin_power_state: device administrative power state
> +     */
> +    DeviceAdminPowerState admin_power_state;
>     /**
>      * @pending_deleted_event: track pending deletion events during unplug
>      */
> diff --git a/target/arm/cpu.c b/target/arm/cpu.c
> index e2b2337399..0c9a2e7ea4 100644
> --- a/target/arm/cpu.c
> +++ b/target/arm/cpu.c
> @@ -2765,6 +2765,7 @@ static void arm_cpu_class_init(ObjectClass *oc, const void *data)
>     cc->gdb_get_core_xml_file = arm_gdb_get_core_xml_file;
>     cc->gdb_stop_before_watchpoint = true;
>     cc->disas_set_info = arm_disas_set_info;
> +    dc->admin_power_state_supported = true;
> 
> #ifdef CONFIG_TCG
>     cc->tcg_ops = &arm_tcg_ops;
> -- 
> 2.34.1
> 


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH RFC V6 02/24] hw/core, qemu-options.hx: Introduce 'disabledcpus' SMP parameter
  2025-10-01  1:01 ` [PATCH RFC V6 02/24] hw/core, qemu-options.hx: Introduce 'disabledcpus' SMP parameter salil.mehta
@ 2025-10-09 11:28   ` Miguel Luis
  2025-10-09 13:17     ` Igor Mammedov
  2025-10-09 11:51   ` Markus Armbruster
  1 sibling, 1 reply; 67+ messages in thread
From: Miguel Luis @ 2025-10-09 11:28 UTC (permalink / raw)
  To: salil.mehta@opnsrc.net
  Cc: qemu-devel@nongnu.org, qemu-arm@nongnu.org, mst@redhat.com,
	salil.mehta@huawei.com, maz@kernel.org, jean-philippe@linaro.org,
	jonathan.cameron@huawei.com, lpieralisi@kernel.org,
	peter.maydell@linaro.org, richard.henderson@linaro.org,
	imammedo@redhat.com, armbru@redhat.com, andrew.jones@linux.dev,
	david@redhat.com, philmd@linaro.org, eric.auger@redhat.com,
	will@kernel.org, ardb@kernel.org, oliver.upton@linux.dev,
	pbonzini@redhat.com, gshan@redhat.com, rafael@kernel.org,
	borntraeger@linux.ibm.com, alex.bennee@linaro.org,
	gustavo.romero@linaro.org, npiggin@gmail.com,
	harshpb@linux.ibm.com, linux@armlinux.org.uk,
	darren@os.amperecomputing.com, ilkka@os.amperecomputing.com,
	vishnu@os.amperecomputing.com, gankulkarni@os.amperecomputing.com,
	Karl Heubaum, zhukeqian1@huawei.com, wangxiongfeng2@huawei.com,
	wangyanan55@huawei.com, wangzhou1@hisilicon.com,
	linuxarm@huawei.com, jiakernel2@gmail.com, maobibo@loongson.cn,
	lixianglai@loongson.cn, shahuang@redhat.com, zhao1.liu@intel.com

Hi Salil,

> On 1 Oct 2025, at 01:01, salil.mehta@opnsrc.net wrote:
> 
> From: Salil Mehta <salil.mehta@huawei.com>
> 
> Add support for a new SMP configuration parameter, 'disabledcpus', which
> specifies the number of additional CPUs that are present in the virtual
> machine but administratively disabled at boot. These CPUs are visible in
> firmware (e.g. ACPI tables) yet unavailable to the guest until explicitly
> enabled via QMP/HMP, or via the 'device_set' API (introduced in later
> patches).
> 
> This feature is intended for architectures that lack native CPU hotplug
> support but can change the administrative power state of present CPUs.
> It allows simulating CPU hot-add–like scenarios while all CPUs remain
> physically present in the topology at boot time.
> 
> Note: ARM is the first architecture to support this concept.
> 
> Changes include:
> - Extend CpuTopology with a 'disabledcpus' field.
> - Update machine_parse_smp_config() to account for disabled CPUs when
>   computing 'cpus' and 'maxcpus'.
> - Update SMPConfiguration in QAPI to accept 'disabledcpus'.
> - Extend -smp option documentation to describe 'disabledcpus' usage and
>   behavior.
> 

Specifying a new parameter for the user seems unnecessary when the system could
infer the number of present and disabled from (maxcpus - cpus) and those this
patch calls "disabledcpus" could be obtained this way.

Naming is hard although it is of my opinion that we shouldn't be
calling 'disabledcpus' here; I understand that gets carried by previous
administrative power state meanings but machine-smp level being at a different
abstraction level the administrative power state concept could be
decoupled from machine-smp realm.

My suggestion would be calling those cpus 'inactive' and not carry previous
patch's nomenclature.

CPUs in 'inactive' state are still present in the virtual machine although this
pre-condition may require post actions like being explicitly 'enabled'/active via
[QH]MP.

Overall, I believe the above should be all it takes to simplify acommodation of
CPUs not to be brought online at boot time within this patch's context.

Thanks
Miguel


> Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
> ---
> hw/core/machine-smp.c | 24 +++++++-----
> include/hw/boards.h   |  2 +
> qapi/machine.json     |  3 ++
> qemu-options.hx       | 86 +++++++++++++++++++++++++++++++++----------
> system/vl.c           |  3 ++
> 5 files changed, 89 insertions(+), 29 deletions(-)
> 
> diff --git a/hw/core/machine-smp.c b/hw/core/machine-smp.c
> index 0be0ac044c..c1a09fdc3f 100644
> --- a/hw/core/machine-smp.c
> +++ b/hw/core/machine-smp.c
> @@ -87,6 +87,7 @@ void machine_parse_smp_config(MachineState *ms,
> {
>     MachineClass *mc = MACHINE_GET_CLASS(ms);
>     unsigned cpus    = config->has_cpus ? config->cpus : 0;
> +    unsigned disabledcpus = config->has_disabledcpus ? config->disabledcpus : 0;
>     unsigned drawers = config->has_drawers ? config->drawers : 0;
>     unsigned books   = config->has_books ? config->books : 0;
>     unsigned sockets = config->has_sockets ? config->sockets : 0;
> @@ -166,8 +167,13 @@ void machine_parse_smp_config(MachineState *ms,
>         sockets = sockets > 0 ? sockets : 1;
>         cores = cores > 0 ? cores : 1;
>         threads = threads > 0 ? threads : 1;
> +
> +        maxcpus = drawers * books * sockets * dies * clusters *
> +                    modules * cores * threads;
> +        cpus = maxcpus - disabledcpus;
>     } else {
> -        maxcpus = maxcpus > 0 ? maxcpus : cpus;
> +        maxcpus = maxcpus > 0 ? maxcpus : cpus + disabledcpus;
> +        cpus = cpus > 0 ? cpus : maxcpus - disabledcpus;
> 
>         if (mc->smp_props.prefer_sockets) {
>             /* prefer sockets over cores before 6.2 */
> @@ -207,12 +213,8 @@ void machine_parse_smp_config(MachineState *ms,
>         }
>     }
> 
> -    total_cpus = drawers * books * sockets * dies *
> -                 clusters * modules * cores * threads;
> -    maxcpus = maxcpus > 0 ? maxcpus : total_cpus;
> -    cpus = cpus > 0 ? cpus : maxcpus;
> -
>     ms->smp.cpus = cpus;
> +    ms->smp.disabledcpus = disabledcpus;
>     ms->smp.drawers = drawers;
>     ms->smp.books = books;
>     ms->smp.sockets = sockets;
> @@ -226,6 +228,8 @@ void machine_parse_smp_config(MachineState *ms,
>     mc->smp_props.has_clusters = config->has_clusters;
> 
>     /* sanity-check of the computed topology */
> +    total_cpus = maxcpus = drawers * books * sockets * dies * clusters *
> +                modules * cores * threads;
>     if (total_cpus != maxcpus) {
>         g_autofree char *topo_msg = cpu_hierarchy_to_string(ms);
>         error_setg(errp, "Invalid CPU topology: "
> @@ -235,12 +239,12 @@ void machine_parse_smp_config(MachineState *ms,
>         return;
>     }
> 
> -    if (maxcpus < cpus) {
> +    if (maxcpus < (cpus + disabledcpus)) {
>         g_autofree char *topo_msg = cpu_hierarchy_to_string(ms);
>         error_setg(errp, "Invalid CPU topology: "
> -                   "maxcpus must be equal to or greater than smp: "
> -                   "%s == maxcpus (%u) < smp_cpus (%u)",
> -                   topo_msg, maxcpus, cpus);
> +                   "maxcpus must be equal to or greater than smp[+disabledcpus]:"
> +                   "%s == maxcpus (%u) < smp_cpus (%u) [+ offline cpus (%u)]",
> +                   topo_msg, maxcpus, cpus, disabledcpus);
>         return;
>     }
> 
> diff --git a/include/hw/boards.h b/include/hw/boards.h
> index f94713e6e2..2b182d7817 100644
> --- a/include/hw/boards.h
> +++ b/include/hw/boards.h
> @@ -361,6 +361,7 @@ typedef struct DeviceMemoryState {
> /**
>  * CpuTopology:
>  * @cpus: the number of present logical processors on the machine
> + * @disabledcpus: the number additional present but admin disabled cpus
>  * @drawers: the number of drawers on the machine
>  * @books: the number of books in one drawer
>  * @sockets: the number of sockets in one book
> @@ -373,6 +374,7 @@ typedef struct DeviceMemoryState {
>  */
> typedef struct CpuTopology {
>     unsigned int cpus;
> +    unsigned int disabledcpus;
>     unsigned int drawers;
>     unsigned int books;
>     unsigned int sockets;
> diff --git a/qapi/machine.json b/qapi/machine.json
> index 038eab281c..e45740da33 100644
> --- a/qapi/machine.json
> +++ b/qapi/machine.json
> @@ -1634,6 +1634,8 @@
> #
> # @cpus: number of virtual CPUs in the virtual machine
> #
> +# @disabledcpus: number of additional present but disabled(or offline) CPUs
> +#
> # @maxcpus: maximum number of hotpluggable virtual CPUs in the virtual
> #     machine
> #
> @@ -1657,6 +1659,7 @@
> ##
> { 'struct': 'SMPConfiguration', 'data': {
>      '*cpus': 'int',
> +     '*disabledcpus': 'int',
>      '*drawers': 'int',
>      '*books': 'int',
>      '*sockets': 'int',
> diff --git a/qemu-options.hx b/qemu-options.hx
> index ab23f14d21..83ccde341b 100644
> --- a/qemu-options.hx
> +++ b/qemu-options.hx
> @@ -326,12 +326,15 @@ SRST
> ERST
> 
> DEF("smp", HAS_ARG, QEMU_OPTION_smp,
> -    "-smp [[cpus=]n][,maxcpus=maxcpus][,drawers=drawers][,books=books][,sockets=sockets]\n"
> -    "               [,dies=dies][,clusters=clusters][,modules=modules][,cores=cores]\n"
> -    "               [,threads=threads]\n"
> -    "                set the number of initial CPUs to 'n' [default=1]\n"
> -    "                maxcpus= maximum number of total CPUs, including\n"
> -    "                offline CPUs for hotplug, etc\n"
> +    "-smp [[cpus=]n][,disabledcpus=disabledcpus][,maxcpus=maxcpus][,drawers=drawers][,books=books]\n"
> +    "               [,sockets=sockets][,dies=dies][,clusters=clusters][,modules=modules]\n"
> +    "               [,cores=cores][,threads=threads]\n"
> +    "                set the initial number of CPUs present and\n"
> +    "                  administratively enabled at boot time to 'n' [default=1]\n"
> +    "                disabledcpus= number of present but administratively\n"
> +    "                  disabled CPUs (unavailable to the guest at boot)\n"
> +    "                maxcpus= maximum total CPUs (present + hotpluggable)\n"
> +    "                  on machines without CPU hotplug, defaults to n + disabledcpus\n"
>     "                drawers= number of drawers on the machine board\n"
>     "                books= number of books in one drawer\n"
>     "                sockets= number of sockets in one book\n"
> @@ -351,22 +354,49 @@ DEF("smp", HAS_ARG, QEMU_OPTION_smp,
>     "      For a particular machine type board, an expected CPU topology hierarchy\n"
>     "      can be defined through the supported sub-option. Unsupported parameters\n"
>     "      can also be provided in addition to the sub-option, but their values\n"
> -    "      must be set as 1 in the purpose of correct parsing.\n",
> +    "      must be set as 1 in the purpose of correct parsing.\n"
> +    "                                                          \n"
> +    "      Administratively disabled CPUs: Some machine types do not support vCPU\n"
> +    "      hotplug but their CPUs can be marked disabled (powered off) and kept\n"
> +    "      unavailable to the guest. Later, such CPUs can be enabled via QMP/HMP\n"
> +    "      (e.g., 'device_set ... admin-state=enable'). This is similar to hotplug,\n"
> +    "      except all disabled CPUs are already present at boot. Useful on\n"
> +    "      architectures that lack architectural CPU hotplug.\n",
>     QEMU_ARCH_ALL)
> SRST
> -``-smp [[cpus=]n][,maxcpus=maxcpus][,drawers=drawers][,books=books][,sockets=sockets][,dies=dies][,clusters=clusters][,modules=modules][,cores=cores][,threads=threads]``
> -    Simulate a SMP system with '\ ``n``\ ' CPUs initially present on
> -    the machine type board. On boards supporting CPU hotplug, the optional
> -    '\ ``maxcpus``\ ' parameter can be set to enable further CPUs to be
> -    added at runtime. When both parameters are omitted, the maximum number
> +``-smp [[cpus=]n][,disabledcpus=disabledcpus][,maxcpus=maxcpus][,drawers=drawers][,books=books][,sockets=sockets][,dies=dies][,clusters=clusters][,modules=modules][,cores=cores][,threads=threads]``
> +    Simulate a SMP system with '\ ``n``\ ' CPUs initially present & enabled on
> +    the machine type board. Furthermore, on architectures that support changing
> +    the administrative power state of CPUs, optional '\ ``disabledcpus``\ '
> +    parameter specifies *additional* CPUs that are present in firmware (e.g.,
> +    ACPI) but are administratively disabled (i.e., not usable by the guest at
> +    boot time).
> +
> +    This is different from CPU hotplug where additional CPUs are not even
> +    present in the system description. Administratively disabled CPUs appear in
> +    ACPI tables i.e. are provisioned, but cannot be used until explicitly
> +    enabled via QMP/HMP or the deviceset API.
> +
> +    On boards supporting CPU hotplug, the optional '\ ``maxcpus``\ ' parameter
> +    can be set to enable further CPUs to be added at runtime. When both
> +    '\ ``n``\ ' & '\ ``maxcpus``\ ' parameters are omitted, the maximum number
>     of CPUs will be calculated from the provided topology members and the
> -    initial CPU count will match the maximum number. When only one of them
> -    is given then the omitted one will be set to its counterpart's value.
> -    Both parameters may be specified, but the maximum number of CPUs must
> -    be equal to or greater than the initial CPU count. Product of the
> -    CPU topology hierarchy must be equal to the maximum number of CPUs.
> -    Both parameters are subject to an upper limit that is determined by
> -    the specific machine type chosen.
> +    initial CPU count will match the maximum number. When only one of them is
> +    given then the omitted one will be set to its counterpart's value. Both
> +    parameters may be specified, but the maximum number of CPUs must be equal
> +    to or greater than the initial CPU count. Product of the CPU topology
> +    hierarchy must be equal to the maximum number of CPUs. Both parameters are
> +    subject to an upper limit that is determined by the specific machine type
> +    chosen. Boards that support administratively disabled CPUs but do *not*
> +    support CPU hotplug derive the maximum number of CPUs implicitly:
> +    '\ ``maxcpus``\ ' is treated as '\ ``n + disabledcpus``\ ' (the total CPUs
> +    present in firmware). If '\ ``maxcpus``\ ' is provided, it must equal
> +    '\ ``n + disabledcpus``\ '. The topology product must equal this derived
> +    maximum as well.
> +
> +    Note: Administratively disabled CPUs will appear to the guest as
> +    unavailable, and any attempt to bring them online must go through QMP/HMP
> +    commands like 'device_set'.
> 
>     To control reporting of CPU topology information, values of the topology
>     parameters can be specified. Machines may only support a subset of the
> @@ -425,6 +455,24 @@ SRST
> 
>         -smp 2
> 
> +    Examples using 'disabledcpus':
> +
> +    For a board without CPU hotplug, enable 4 CPUs at boot and provision
> +    2 additional administratively disabled CPUs (maximum is derived
> +    implicitly as 6 = 4 + 2):
> +
> +    ::
> +
> +        -smp cpus=4,disabledcpus=2
> +
> +    For a board that supports CPU hotplug and 'disabledcpus', enable 4 CPUs
> +    at boot, provision 2 administratively disabled CPUs, and allow hotplug of
> +    2 more CPUs (for a maximum of 8):
> +
> +    ::
> +
> +        -smp cpus=4,disabledcpus=2,maxcpus=8
> +
>     Note: The cluster topology will only be generated in ACPI and exposed
>     to guest if it's explicitly specified in -smp.
> ERST
> diff --git a/system/vl.c b/system/vl.c
> index 3b7057e6c6..2f0fd21a1f 100644
> --- a/system/vl.c
> +++ b/system/vl.c
> @@ -736,6 +736,9 @@ static QemuOptsList qemu_smp_opts = {
>         {
>             .name = "cpus",
>             .type = QEMU_OPT_NUMBER,
> +        }, {
> +            .name = "disabledcpus",
> +            .type = QEMU_OPT_NUMBER,
>         }, {
>             .name = "drawers",
>             .type = QEMU_OPT_NUMBER,
> -- 
> 2.34.1
> 


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH RFC V6 02/24] hw/core, qemu-options.hx: Introduce 'disabledcpus' SMP parameter
  2025-10-01  1:01 ` [PATCH RFC V6 02/24] hw/core, qemu-options.hx: Introduce 'disabledcpus' SMP parameter salil.mehta
  2025-10-09 11:28   ` Miguel Luis
@ 2025-10-09 11:51   ` Markus Armbruster
  1 sibling, 0 replies; 67+ messages in thread
From: Markus Armbruster @ 2025-10-09 11:51 UTC (permalink / raw)
  To: salil.mehta
  Cc: qemu-devel, qemu-arm, mst, salil.mehta, maz, jean-philippe,
	jonathan.cameron, lpieralisi, peter.maydell, richard.henderson,
	imammedo, andrew.jones, david, philmd, eric.auger, will, ardb,
	oliver.upton, pbonzini, gshan, rafael, borntraeger, alex.bennee,
	gustavo.romero, npiggin, harshpb, linux, darren, ilkka, vishnu,
	gankulkarni, karl.heubaum, miguel.luis, zhukeqian1,
	wangxiongfeng2, wangyanan55, wangzhou1, linuxarm, jiakernel2,
	maobibo, lixianglai, shahuang, zhao1.liu

salil.mehta@opnsrc.net writes:

> From: Salil Mehta <salil.mehta@huawei.com>
>
> Add support for a new SMP configuration parameter, 'disabledcpus', which
> specifies the number of additional CPUs that are present in the virtual
> machine but administratively disabled at boot. These CPUs are visible in
> firmware (e.g. ACPI tables) yet unavailable to the guest until explicitly
> enabled via QMP/HMP, or via the 'device_set' API (introduced in later
> patches).
>
> This feature is intended for architectures that lack native CPU hotplug
> support but can change the administrative power state of present CPUs.
> It allows simulating CPU hot-add–like scenarios while all CPUs remain
> physically present in the topology at boot time.
>
> Note: ARM is the first architecture to support this concept.
>
> Changes include:
>  - Extend CpuTopology with a 'disabledcpus' field.
>  - Update machine_parse_smp_config() to account for disabled CPUs when
>    computing 'cpus' and 'maxcpus'.
>  - Update SMPConfiguration in QAPI to accept 'disabledcpus'.
>  - Extend -smp option documentation to describe 'disabledcpus' usage and
>    behavior.
>
> Signed-off-by: Salil Mehta <salil.mehta@huawei.com>

[...]

> diff --git a/qapi/machine.json b/qapi/machine.json
> index 038eab281c..e45740da33 100644
> --- a/qapi/machine.json
> +++ b/qapi/machine.json
> @@ -1634,6 +1634,8 @@
>  #
>  # @cpus: number of virtual CPUs in the virtual machine
>  #
> +# @disabledcpus: number of additional present but disabled(or offline) CPUs
> +#
>  # @maxcpus: maximum number of hotpluggable virtual CPUs in the virtual
>  #     machine
>  #
> @@ -1657,6 +1659,7 @@
>  ##
>  { 'struct': 'SMPConfiguration', 'data': {
>       '*cpus': 'int',
> +     '*disabledcpus': 'int',
>       '*drawers': 'int',
>       '*books': 'int',
>       '*sockets': 'int',

We prefer words-with-dashes to wordsruntogether in QAPI/QMP, for
readability: disabled-cpus.

Missing here even before the patch: how these values are related.  That
information appears to be buried in machine_parse_smp_config(), which is
160 lines long and with too many conditionals for me to make sense of
now.

> diff --git a/qemu-options.hx b/qemu-options.hx
> index ab23f14d21..83ccde341b 100644
> --- a/qemu-options.hx
> +++ b/qemu-options.hx
> @@ -326,12 +326,15 @@ SRST
>  ERST
>  
>  DEF("smp", HAS_ARG, QEMU_OPTION_smp,
> -    "-smp [[cpus=]n][,maxcpus=maxcpus][,drawers=drawers][,books=books][,sockets=sockets]\n"
> -    "               [,dies=dies][,clusters=clusters][,modules=modules][,cores=cores]\n"
> -    "               [,threads=threads]\n"
> -    "                set the number of initial CPUs to 'n' [default=1]\n"
> -    "                maxcpus= maximum number of total CPUs, including\n"
> -    "                offline CPUs for hotplug, etc\n"
> +    "-smp [[cpus=]n][,disabledcpus=disabledcpus][,maxcpus=maxcpus][,drawers=drawers][,books=books]\n"
> +    "               [,sockets=sockets][,dies=dies][,clusters=clusters][,modules=modules]\n"
> +    "               [,cores=cores][,threads=threads]\n"
> +    "                set the initial number of CPUs present and\n"
> +    "                  administratively enabled at boot time to 'n' [default=1]\n"
> +    "                disabledcpus= number of present but administratively\n"
> +    "                  disabled CPUs (unavailable to the guest at boot)\n"
> +    "                maxcpus= maximum total CPUs (present + hotpluggable)\n"
> +    "                  on machines without CPU hotplug, defaults to n + disabledcpus\n"
>      "                drawers= number of drawers on the machine board\n"
>      "                books= number of books in one drawer\n"
>      "                sockets= number of sockets in one book\n"
> @@ -351,22 +354,49 @@ DEF("smp", HAS_ARG, QEMU_OPTION_smp,
>      "      For a particular machine type board, an expected CPU topology hierarchy\n"
>      "      can be defined through the supported sub-option. Unsupported parameters\n"
>      "      can also be provided in addition to the sub-option, but their values\n"
> -    "      must be set as 1 in the purpose of correct parsing.\n",
> +    "      must be set as 1 in the purpose of correct parsing.\n"
> +    "                                                          \n"
> +    "      Administratively disabled CPUs: Some machine types do not support vCPU\n"
> +    "      hotplug but their CPUs can be marked disabled (powered off) and kept\n"
> +    "      unavailable to the guest. Later, such CPUs can be enabled via QMP/HMP\n"
> +    "      (e.g., 'device_set ... admin-state=enable'). This is similar to hotplug,\n"
> +    "      except all disabled CPUs are already present at boot. Useful on\n"
> +    "      architectures that lack architectural CPU hotplug.\n",
>      QEMU_ARCH_ALL)
>  SRST
> -``-smp [[cpus=]n][,maxcpus=maxcpus][,drawers=drawers][,books=books][,sockets=sockets][,dies=dies][,clusters=clusters][,modules=modules][,cores=cores][,threads=threads]``
> -    Simulate a SMP system with '\ ``n``\ ' CPUs initially present on
> -    the machine type board. On boards supporting CPU hotplug, the optional
> -    '\ ``maxcpus``\ ' parameter can be set to enable further CPUs to be
> -    added at runtime. When both parameters are omitted, the maximum number
> +``-smp [[cpus=]n][,disabledcpus=disabledcpus][,maxcpus=maxcpus][,drawers=drawers][,books=books][,sockets=sockets][,dies=dies][,clusters=clusters][,modules=modules][,cores=cores][,threads=threads]``
> +    Simulate a SMP system with '\ ``n``\ ' CPUs initially present & enabled on
> +    the machine type board. Furthermore, on architectures that support changing
> +    the administrative power state of CPUs, optional '\ ``disabledcpus``\ '
> +    parameter specifies *additional* CPUs that are present in firmware (e.g.,
> +    ACPI) but are administratively disabled (i.e., not usable by the guest at
> +    boot time).
> +
> +    This is different from CPU hotplug where additional CPUs are not even
> +    present in the system description. Administratively disabled CPUs appear in
> +    ACPI tables i.e. are provisioned, but cannot be used until explicitly
> +    enabled via QMP/HMP or the deviceset API.
> +
> +    On boards supporting CPU hotplug, the optional '\ ``maxcpus``\ ' parameter
> +    can be set to enable further CPUs to be added at runtime. When both
> +    '\ ``n``\ ' & '\ ``maxcpus``\ ' parameters are omitted, the maximum number
>      of CPUs will be calculated from the provided topology members and the
> -    initial CPU count will match the maximum number. When only one of them
> -    is given then the omitted one will be set to its counterpart's value.
> -    Both parameters may be specified, but the maximum number of CPUs must
> -    be equal to or greater than the initial CPU count. Product of the
> -    CPU topology hierarchy must be equal to the maximum number of CPUs.
> -    Both parameters are subject to an upper limit that is determined by
> -    the specific machine type chosen.
> +    initial CPU count will match the maximum number. When only one of them is
> +    given then the omitted one will be set to its counterpart's value. Both
> +    parameters may be specified, but the maximum number of CPUs must be equal
> +    to or greater than the initial CPU count. Product of the CPU topology
> +    hierarchy must be equal to the maximum number of CPUs. Both parameters are
> +    subject to an upper limit that is determined by the specific machine type
> +    chosen. Boards that support administratively disabled CPUs but do *not*
> +    support CPU hotplug derive the maximum number of CPUs implicitly:
> +    '\ ``maxcpus``\ ' is treated as '\ ``n + disabledcpus``\ ' (the total CPUs
> +    present in firmware). If '\ ``maxcpus``\ ' is provided, it must equal
> +    '\ ``n + disabledcpus``\ '. The topology product must equal this derived
> +    maximum as well.
> +
> +    Note: Administratively disabled CPUs will appear to the guest as
> +    unavailable, and any attempt to bring them online must go through QMP/HMP
> +    commands like 'device_set'.
>  
>      To control reporting of CPU topology information, values of the topology
>      parameters can be specified. Machines may only support a subset of the
> @@ -425,6 +455,24 @@ SRST
>  
>          -smp 2
>  
> +    Examples using 'disabledcpus':
> +
> +    For a board without CPU hotplug, enable 4 CPUs at boot and provision
> +    2 additional administratively disabled CPUs (maximum is derived
> +    implicitly as 6 = 4 + 2):
> +
> +    ::
> +
> +        -smp cpus=4,disabledcpus=2
> +
> +    For a board that supports CPU hotplug and 'disabledcpus', enable 4 CPUs
> +    at boot, provision 2 administratively disabled CPUs, and allow hotplug of
> +    2 more CPUs (for a maximum of 8):
> +
> +    ::
> +
> +        -smp cpus=4,disabledcpus=2,maxcpus=8
> +

So, maxcpus = cpus + disabledcpus?

Is drawers * books * sockets * dies * clusters * modules * cores *
threads equal or greater than maxcpus?

>      Note: The cluster topology will only be generated in ACPI and exposed
>      to guest if it's explicitly specified in -smp.
>  ERST
> diff --git a/system/vl.c b/system/vl.c
> index 3b7057e6c6..2f0fd21a1f 100644
> --- a/system/vl.c
> +++ b/system/vl.c
> @@ -736,6 +736,9 @@ static QemuOptsList qemu_smp_opts = {
>          {
>              .name = "cpus",
>              .type = QEMU_OPT_NUMBER,
> +        }, {
> +            .name = "disabledcpus",
> +            .type = QEMU_OPT_NUMBER,
>          }, {
>              .name = "drawers",
>              .type = QEMU_OPT_NUMBER,



^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH RFC V6 23/24] monitor,qapi: add 'info cpus-powerstate' and QMP query (Admin + Oper states)
  2025-10-01  1:01 ` [PATCH RFC V6 23/24] monitor, qapi: add 'info cpus-powerstate' and QMP query (Admin + Oper states) salil.mehta
@ 2025-10-09 11:53   ` Markus Armbruster
  0 siblings, 0 replies; 67+ messages in thread
From: Markus Armbruster @ 2025-10-09 11:53 UTC (permalink / raw)
  To: salil.mehta
  Cc: qemu-devel, qemu-arm, mst, salil.mehta, maz, jean-philippe,
	jonathan.cameron, lpieralisi, peter.maydell, richard.henderson,
	imammedo, andrew.jones, david, philmd, eric.auger, will, ardb,
	oliver.upton, pbonzini, gshan, rafael, borntraeger, alex.bennee,
	gustavo.romero, npiggin, harshpb, linux, darren, ilkka, vishnu,
	gankulkarni, karl.heubaum, miguel.luis, zhukeqian1,
	wangxiongfeng2, wangyanan55, wangzhou1, linuxarm, jiakernel2,
	maobibo, lixianglai, shahuang, zhao1.liu

salil.mehta@opnsrc.net writes:

> From: Salil Mehta <salil.mehta@huawei.com>
>
> The existing 'info hotpluggable-cpus' applies to platforms with true CPU
> hotplug. On ARM, vCPUs are not hotpluggable: resources are allocated at
> boot and policy is enforced administratively (e.g. via ACPI _STA) to
> achieve a hotplug-like effect. As a result, the hotpluggable interface
> cannot describe ARM CPU state, whether administrative or runtime.
>
> Operators need a clear view of both administrative policy (Enabled,
> Disabled, Removed) and guest runtime status (On, Standby, Off, Unknown)
> for all possible vCPUs. This separation is essential to debug CPU life
> cycle flows on ARM, where PSCI CPU_ON/CPU_OFF and ACPI methods are used,
> and to distinguish CPUs that are enumerated but administratively blocked
> from those actually executing in the guest.
>
> The new interface is independent of hotplug and coexists with 'info
> hotpluggable-cpus' on platforms that support it (e.g. x86). By default
> devices are administratively Enabled; on hotpluggable systems, absent
> CPUs appear as Removed here.
>
> This patch introduces:
>   * QMP 'query-cpus-powerstate' returning CPUPowerStateInfo per possible
>     vCPU.
>   * HMP 'info cpus-powerstate' for human-readable output.
>   * Enums:
>       - CPUPowerAdminState { enabled, disabled, removed }
>       - CPUOperPowerState  { on, standby, off, unknown }
>   * CPUPowerStateInfo with admin/oper state, optional topology ids, and
>     qom-path.
>
> Operational state semantics:
>   * 'on'      : CPU is on and runnable.
>   * 'standby' : Reserved for suspend-with-context (e.g. PSCI CPU_SUSPEND).
>                 Not emitted yet.
>   * 'off'     : CPU is powered off.
>                 - At initial boot, admin-disabled vCPUs may be left
>                   unrealized (lazy realize) and are reported Off.
>                 - After an admin enable, the vCPU is realized; if later
>                   powered down, it remains realized and reported Off.
>   * 'unknown' : State cannot be determined (very early init/teardown,
>                 transient hot-(un)plug window, or no power-state handler).
>
> Migration semantics:
>   * Admin-disabled (unrealized) vCPUs do not migrate.
>   * Admin-enabled vCPUs migrate their operational state, including Off.
>
> Signed-off-by: Salil Mehta <salil.mehta@huawei.com>

[...]

> diff --git a/qapi/machine.json b/qapi/machine.json
> index e45740da33..3856785b27 100644
> --- a/qapi/machine.json
> +++ b/qapi/machine.json
> @@ -1069,6 +1069,93 @@
>  { 'command': 'query-hotpluggable-cpus', 'returns': ['HotpluggableCPU'],
>               'allow-preconfig': true }
>  
> +##
> +# @CPUOperPowerState:
> +#
> +# Guest-visible operational state of the CPU.
> +# This reflects runtime status such as guest online/offline status or
> +# suspended state (e.g., CPU halted, suspended in a WFI loop).
> +#
> +# .. note::
> +#    This field is read-only. It is derived by QEMU from runtime
> +#    information (e.g., CPU execution/architectural state, PSCI power
> +#    status, vCPU runstate) and cannot be set by management tools or
> +#    user commands.
> +#
> +# @on: CPU is online and executing.
> +# @standby: CPU is idle or suspended (e.g., WFI).
> +# @off: CPU is guest-offlined or halted.
> +# @unknown: State cannot be determined at this time (e.g., very early
> +#           init/teardown, transient hotplug/hotremove window, no
> +#           power-state handler registered, or the target/platform does
> +#           not expose a queryable CPU state).
> +##
> +{ 'enum': 'CPUOperPowerState',
> +  'data': ['on', 'standby', 'off', 'unknown'] }
> +
> +##
> +# @CPUAdminPowerState:
> +#
> +# Host-side administrative power state of the CPU device.
> +# Controls guest visibility and lifecycle.
> +#
> +# @enabled: CPU is administratively enabled (can be used by guest)
> +# @disabled: CPU is administratively disabled (guest-visible but unusable)
> +# @removed: CPU is logically removed (not visible to guest)
> +##
> +{ 'enum': 'CPUAdminPowerState',
> +  'data': ['enabled', 'disabled', 'removed'] }
> +
> +##
> +# @CPUPowerStateInfo:
> +#
> +# CPU status combining both administrative and operational/runtime state.
> +#
> +# @id: CPU index
> +# @core-id: Core ID (optional)
> +# @socket-id: Socket ID (optional)
> +# @cluster-id: Cluster ID (optional)
> +# @thread-id: Thread ID (optional)
> +# @node-id: NUMA node ID (optional)
> +# @drawer-id: Drawer ID (optional)
> +# @book-id: Book ID (optional)
> +# @die-id: Die ID (optional)
> +# @module-id: Module ID (optional)
> +# @vcpus-count: Number of threads under this logical CPU (optional)
> +# @qom-path: QOM object path (optional)
> +# @admin-state: Administrative power state (enabled/disabled/removed)
> +# @oper-state: Guest-visible runtime power state (on/standby/off)
> +##
> +{ 'struct': 'CPUPowerStateInfo',
> +  'data': {
> +    'id': 'int',
> +    '*core-id': 'int',
> +    '*socket-id': 'int',
> +    '*cluster-id': 'int',
> +    '*thread-id': 'int',
> +    '*node-id': 'int',
> +    '*drawer-id': 'int',
> +    '*book-id': 'int',
> +    '*die-id': 'int',
> +    '*module-id': 'int',
> +    '*vcpus-count': 'int',
> +    '*qom-path': 'str',
> +    'admin-state': 'CPUAdminPowerState',
> +    'oper-state': 'CPUOperPowerState'
> +  } }
> +
> +##
> +# @query-cpus-power-state:
> +#
> +# Returns all CPUs and their power state info, combining host policy and
> +# runtime guest status. This is useful for debugging vCPU hotplug,
> +# suspend/resume, admin power states or offline state flows.
> +#
> +# Returns: a list of @CPUPowerStateInfo
> +##
> +{ 'command': 'query-cpus-power-state',
> +  'returns': ['CPUPowerStateInfo'] }
> +
>  ##
>  # @set-numa-node:
>  #

Have you considered adding the information to existing query-cpus-fast?



^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH RFC V6 03/24] hw/arm/virt: Clamp 'maxcpus' as-per machine's vCPU deferred online-capability
  2025-10-01  1:01 ` [PATCH RFC V6 03/24] hw/arm/virt: Clamp 'maxcpus' as-per machine's vCPU deferred online-capability salil.mehta
@ 2025-10-09 12:32   ` Miguel Luis
  2025-10-09 13:11     ` Igor Mammedov
  0 siblings, 1 reply; 67+ messages in thread
From: Miguel Luis @ 2025-10-09 12:32 UTC (permalink / raw)
  To: salil.mehta@opnsrc.net
  Cc: qemu-devel@nongnu.org, qemu-arm@nongnu.org, mst@redhat.com,
	salil.mehta@huawei.com, maz@kernel.org, jean-philippe@linaro.org,
	jonathan.cameron@huawei.com, lpieralisi@kernel.org,
	peter.maydell@linaro.org, richard.henderson@linaro.org,
	imammedo@redhat.com, armbru@redhat.com, andrew.jones@linux.dev,
	david@redhat.com, philmd@linaro.org, eric.auger@redhat.com,
	will@kernel.org, ardb@kernel.org, oliver.upton@linux.dev,
	pbonzini@redhat.com, gshan@redhat.com, rafael@kernel.org,
	borntraeger@linux.ibm.com, alex.bennee@linaro.org,
	gustavo.romero@linaro.org, npiggin@gmail.com,
	harshpb@linux.ibm.com, linux@armlinux.org.uk,
	darren@os.amperecomputing.com, ilkka@os.amperecomputing.com,
	vishnu@os.amperecomputing.com, gankulkarni@os.amperecomputing.com,
	Karl Heubaum, zhukeqian1@huawei.com, wangxiongfeng2@huawei.com,
	wangyanan55@huawei.com, wangzhou1@hisilicon.com,
	linuxarm@huawei.com, jiakernel2@gmail.com, maobibo@loongson.cn,
	lixianglai@loongson.cn, shahuang@redhat.com, zhao1.liu@intel.com

Hi Salil,

> On 1 Oct 2025, at 01:01, salil.mehta@opnsrc.net wrote:
> 
> From: Salil Mehta <salil.mehta@huawei.com>
> 
> To support a vCPU hot-add–like model on ARM, the virt machine may be setup with
> more CPUs than are active at boot. These additional CPUs are fully realized in
> KVM and listed in ACPI tables from the start, but begin in a disabled state.
> They can later be brought online or taken offline under host or platform policy
> control. The CPU topology is fixed at VM creation time and cannot change
> dynamically on ARM. Therefore, we must determine precisely the 'maxcpus' value
> that applies for the full lifetime of the VM.
> 
> On ARM, this deferred online-capable model is only valid if:
>  - The GIC version is 3 or higher, and
>  - Each non-boot CPU’s GIC CPU Interface is marked “online-capable” in its
>    ACPI GICC structure (UEFI ACPI Specification 6.5, §5.2.12.14, Table 5.37
>    “GICC CPU Interface Flags”), and
>  - The chosen accelerator supports safe deferred CPU online:
>      * TCG with multi-threaded TCG (MTTCG) enabled
>      * KVM (on supported hosts)
>      * Not HVF or QTest
> 
> This patch sizes the machine’s max-possible CPUs during VM init:
>  - If all conditions are satisfied, retain the full set of CPUs corresponding
>    to (`-smp cpus` + `-smp disabledcpus`), allowing the additional (initially
>    disabled) CPUs to participate in later policy-driven online.
>  - Otherwise, clamp the max-possible CPUs to the boot-enabled count
>    (`-smp disabledcpus=0` equivalent) to avoid advertising CPUs the guest can
>    never use.
> 
> A new MachineClass flag, `has_online_capable_cpus`, records whether the machine
> supports deferred vCPU online. This is usable by other machine types as well.


By the definition of

 * @has_hotpluggable_cpus:
 *    If true, board supports CPUs creation with -device/device_add.

 in include/hw/boards.h

seems one could take advantage of MachineClass's has_hotpluggable_cpus variable
instead of creating a new has_online_capable_cpus one.
(Again, IMHO ‘online capable’ is ACPI nomenclature and doesn’t need to be brought
in MachineClass’s)

Variable which would be initialized in machvirt_init on an assignment based on
GIC version and/or wether there's inactive CPUs and proceed from there anyways,
making the default assignment in machine_virt_class_init superfluous.

We're at hw/arm/virt and we know these CPUs are administratively power state
coordinated so admin_power_state_supported can still be set there in the
presence of inactive CPUs.

Thanks
Miguel

> 
> Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
> ---
> hw/arm/virt.c       | 84 ++++++++++++++++++++++++++++++---------------
> include/hw/boards.h |  1 +
> 2 files changed, 57 insertions(+), 28 deletions(-)
> 
> diff --git a/hw/arm/virt.c b/hw/arm/virt.c
> index ef6be3660f..76f21bd56a 100644
> --- a/hw/arm/virt.c
> +++ b/hw/arm/virt.c
> @@ -2168,8 +2168,7 @@ static void machvirt_init(MachineState *machine)
>     bool has_ged = !vmc->no_ged;
>     unsigned int smp_cpus = machine->smp.cpus;
>     unsigned int max_cpus = machine->smp.max_cpus;
> -
> -    possible_cpus = mc->possible_cpu_arch_ids(machine);
> +    DeviceClass *dc;
> 
>     /*
>      * In accelerated mode, the memory map is computed earlier in kvm_type()
> @@ -2186,7 +2185,7 @@ static void machvirt_init(MachineState *machine)
>          * we are about to deal with. Once this is done, get rid of
>          * the object.
>          */
> -        cpuobj = object_new(possible_cpus->cpus[0].type);
> +        cpuobj = object_new(machine->cpu_type);
>         armcpu = ARM_CPU(cpuobj);
> 
>         pa_bits = arm_pamax(armcpu);
> @@ -2201,6 +2200,57 @@ static void machvirt_init(MachineState *machine)
>      */
>     finalize_gic_version(vms);
> 
> +    /*
> +     * The maximum number of CPUs depends on the GIC version, or on how
> +     * many redistributors we can fit into the memory map (which in turn
> +     * depends on whether this is a GICv3 or v4).
> +     */
> +    if (vms->gic_version == VIRT_GIC_VERSION_2) {
> +        virt_max_cpus = GIC_NCPU;
> +    } else {
> +        virt_max_cpus = virt_redist_capacity(vms, VIRT_GIC_REDIST);
> +        if (vms->highmem_redists) {
> +            virt_max_cpus += virt_redist_capacity(vms, VIRT_HIGH_GIC_REDIST2);
> +        }
> +    }
> +
> +    if ((tcg_enabled() && !qemu_tcg_mttcg_enabled()) || hvf_enabled() ||
> +        qtest_enabled() || vms->gic_version == VIRT_GIC_VERSION_2) {
> +        max_cpus = machine->smp.max_cpus = smp_cpus;
> +        if (mc->has_online_capable_cpus) {
> +            if (vms->gic_version == VIRT_GIC_VERSION_2) {
> +                warn_report("GICv2 does not support online-capable CPUs");
> +            }
> +            mc->has_online_capable_cpus = false;
> +        }
> +    }
> +
> +    if (mc->has_online_capable_cpus) {
> +        max_cpus = smp_cpus + machine->smp.disabledcpus;
> +        machine->smp.max_cpus = max_cpus;
> +    }
> +
> +    if (max_cpus > virt_max_cpus) {
> +        error_report("Number of SMP CPUs requested (%d) exceeds max CPUs "
> +                     "supported by machine 'mach-virt' (%d)",
> +                     max_cpus, virt_max_cpus);
> +        if (vms->gic_version != VIRT_GIC_VERSION_2 && !vms->highmem_redists) {
> +            error_printf("Try 'highmem-redists=on' for more CPUs\n");
> +        }
> +
> +        exit(1);
> +    }
> +
> +    dc = DEVICE_CLASS(object_class_by_name(machine->cpu_type));
> +    if (!dc) {
> +        error_report("CPU type '%s' not registered", machine->cpu_type);
> +        exit(1);
> +    }
> +    dc->admin_power_state_supported = mc->has_online_capable_cpus;
> +
> +    /* uses smp.max_cpus to initialize all possible vCPUs */
> +    possible_cpus = mc->possible_cpu_arch_ids(machine);
> +
>     if (vms->secure) {
>         /*
>          * The Secure view of the world is the same as the NonSecure,
> @@ -2235,31 +2285,6 @@ static void machvirt_init(MachineState *machine)
>         vms->psci_conduit = QEMU_PSCI_CONDUIT_HVC;
>     }
> 
> -    /*
> -     * The maximum number of CPUs depends on the GIC version, or on how
> -     * many redistributors we can fit into the memory map (which in turn
> -     * depends on whether this is a GICv3 or v4).
> -     */
> -    if (vms->gic_version == VIRT_GIC_VERSION_2) {
> -        virt_max_cpus = GIC_NCPU;
> -    } else {
> -        virt_max_cpus = virt_redist_capacity(vms, VIRT_GIC_REDIST);
> -        if (vms->highmem_redists) {
> -            virt_max_cpus += virt_redist_capacity(vms, VIRT_HIGH_GIC_REDIST2);
> -        }
> -    }
> -
> -    if (max_cpus > virt_max_cpus) {
> -        error_report("Number of SMP CPUs requested (%d) exceeds max CPUs "
> -                     "supported by machine 'mach-virt' (%d)",
> -                     max_cpus, virt_max_cpus);
> -        if (vms->gic_version != VIRT_GIC_VERSION_2 && !vms->highmem_redists) {
> -            error_printf("Try 'highmem-redists=on' for more CPUs\n");
> -        }
> -
> -        exit(1);
> -    }
> -
>     if (vms->secure && !tcg_enabled() && !qtest_enabled()) {
>         error_report("mach-virt: %s does not support providing "
>                      "Security extensions (TrustZone) to the guest CPU",
> @@ -3245,6 +3270,9 @@ static void virt_machine_class_init(ObjectClass *oc, const void *data)
>     hc->plug = virt_machine_device_plug_cb;
>     hc->unplug_request = virt_machine_device_unplug_request_cb;
>     hc->unplug = virt_machine_device_unplug_cb;
> +
> +    mc->has_online_capable_cpus = true;
> +
>     mc->nvdimm_supported = true;
>     mc->smp_props.clusters_supported = true;
>     mc->auto_enable_numa_with_memhp = true;
> diff --git a/include/hw/boards.h b/include/hw/boards.h
> index 2b182d7817..b27c2326a2 100644
> --- a/include/hw/boards.h
> +++ b/include/hw/boards.h
> @@ -302,6 +302,7 @@ struct MachineClass {
>     bool rom_file_has_mr;
>     int minimum_page_bits;
>     bool has_hotpluggable_cpus;
> +    bool has_online_capable_cpus;
>     bool ignore_memory_transaction_failures;
>     int numa_mem_align_shift;
>     const char * const *valid_cpu_types;
> -- 
> 2.34.1
> 


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH RFC V6 22/24] monitor,qdev: Introduce 'device_set' to change admin state of existing devices
  2025-10-09  8:55   ` [PATCH RFC V6 22/24] monitor,qdev: " Markus Armbruster
@ 2025-10-09 12:51     ` Igor Mammedov
  2025-10-09 14:03       ` Daniel P. Berrangé
  2025-10-09 14:55       ` Markus Armbruster
  0 siblings, 2 replies; 67+ messages in thread
From: Igor Mammedov @ 2025-10-09 12:51 UTC (permalink / raw)
  To: Markus Armbruster
  Cc: salil.mehta, qemu-devel, qemu-arm, mst, salil.mehta, maz,
	jean-philippe, jonathan.cameron, lpieralisi, peter.maydell,
	richard.henderson, andrew.jones, david, philmd, eric.auger, will,
	ardb, oliver.upton, pbonzini, gshan, rafael, borntraeger,
	alex.bennee, gustavo.romero, npiggin, harshpb, linux, darren,
	ilkka, vishnu, gankulkarni, karl.heubaum, miguel.luis, zhukeqian1,
	wangxiongfeng2, wangyanan55, wangzhou1, linuxarm, jiakernel2,
	maobibo, lixianglai, shahuang, zhao1.liu, devel

On Thu, 09 Oct 2025 10:55:40 +0200
Markus Armbruster <armbru@redhat.com> wrote:

> salil.mehta@opnsrc.net writes:
> 
> > From: Salil Mehta <salil.mehta@huawei.com>
> >
> > This patch adds a "device_set" interface for modifying properties of devices
> > that already exist in the guest topology. Unlike 'device_add'/'device_del'
> > (hot-plug), 'device_set' does not create or destroy devices. It is intended
> > for guest-visible hot-add semantics where hardware is provisioned at boot but
> > logically enabled/disabled later via administrative policy.
> >
> > Compared to the existing 'qom-set' command, which is less intuitive and works
> > only with object IDs, device_set provides a more device-oriented interface.
> > It can be invoked at the QEMU prompt using natural device arguments, and the
> > new '-deviceset' CLI option allows properties to be set at boot time, similar
> > to how '-device' specifies device creation.  
> 
> Why can't we use -device?

that's was my concern/suggestion in reply to cover letter
(as a place to put high level review and what can be done for the next revision)

(PS: It looks like I'm having email receiving issues (i.e. not getting from
mail list my own emails that it bonces to me, so threading is all broken on
my side and I'm might miss replies). But on positive side it looks like my
replies reach the list and CCed just fine)


> > While the initial implementation focuses on "admin-state" changes (e.g.,
> > enable/disable a CPU already described by ACPI/DT), the interface is designed
> > to be generic. In future, it could be used for other per-device set/unset
> > style controls — beyond administrative power-states — provided the target
> > device explicitly allows such changes. This enables fine-grained runtime
> > control of device properties.  
> 
> Beware, designing a generic interface can be harder, sometimes much
> harder, than designing a specialized one.
> 
> device_add and qom-set are generic, and they have issues:
> 
> * device_add effectively bypasses QAPI by using 'gen': false.
> 
>   This bypasses QAPI's enforcement of documentation.  Property
>   documentation is separate and poor.
> 
>   It also defeats introspection with query-qmp-schema.  You need to
>   resort to other means instead, say QOM introspection (which is a bag
>   of design flaws on its own), then map from QOM to qdev.
> 
> * device_add lets you specify any qdev property, even properties that
>   are intended only for use by C code.
> 
>   This results in accidental external interfaces.
> 
>   We tend to name properties like "x-prop" to discourage external use,
>   but I wouldn't bet my own money on us getting that always right.
>   Moreover, there's beauties like "x-origin".
> 
> * qom-set & friends effectively bypass QAPI by using type 'any'.
> 
>   Again, the bypass results in poor documentation and a defeat of
>   query-qmp-schema.
> 
> * qom-set lets you mess with any QOM property with a setter callback.
> 
>   Again, accidental external interfaces: most of these properties are
>   not meant for use with qom-set.  For some, qom-set works, for some it
>   silently does nothing, and for some it crashes.  A lot more dangerous
>   than device_add.
> 
>   The "x-" convention can't help here: some properties are intended for
>   external use with object-add, but not with qom-set.
> 
> We should avoid such issues in new interfaces.
> 
> We'll examine how this applies to device_set when I review the QAPI
> schema.
> 
> > Key pieces:
> >   * QMP: qmp_device_set() to update an existing device. The device can be
> >     located by "id" or via driver+property match using a DeviceListener
> >     callback (qdev_find_device()).
> >   * HMP: "device_set" command with tab-completion. Errors are surfaced via
> >     hmp_handle_error().
> >   * CLI: "-deviceset" option for setting startup/admin properties at boot,
> >     including a JSON form. Options are parsed into qemu_deviceset_opts and
> >     applied after device creation.
> >   * Docs/help: HMP help text and qemu-options.hx additions explain usage and
> >     explicitly note that no hot-plug occurs.
> >   * Safety: disallowed during live migration (migration_is_idle() check).
> >
> > Semantics:
> >   * Operates on an existing DeviceState; no enumeration/new device appears.
> >   * Complements device_add/device_del by providing state mutation only.
> >   * Backward compatible: no behavior change unless "device_set"/"-deviceset"
> >     is used.
> >
> > Examples:
> >   HMP:
> >     (qemu) device_set host-arm-cpu,core-id=3,admin-state=enable
> >
> >   CLI (at boot):
> >     -smp cpus=4,maxcpus=4 \
> >     -deviceset host-arm-cpu,core-id=2,admin-state=disable
> >
> >   QMP (JSON form):
> >     { "execute": "device_set",
> >       "arguments": {
> >         "driver": "host-arm-cpu",
> >         "core-id": 1,
> >         "admin-state": "disable"
> >       }
> >     }  
> 
> {"error": {"class": "CommandNotFound", "desc": "The command device_set has not been found"}}
> 
> Clue below.
> 
> > NOTE: The qdev_enable()/qdev_disable() hooks for acting on admin-state will be
> > added in subsequent patches. Device classes must explicitly support any
> > property they want to expose through device_set.
> >
> > Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
> > ---
> >  hmp-commands.hx         |  30 +++++++++
> >  hw/arm/virt.c           |  86 +++++++++++++++++++++++++
> >  hw/core/cpu-common.c    |  12 ++++
> >  hw/core/qdev.c          |  21 ++++++
> >  include/hw/arm/virt.h   |   1 +
> >  include/hw/core/cpu.h   |  11 ++++
> >  include/hw/qdev-core.h  |  22 +++++++
> >  include/monitor/hmp.h   |   2 +
> >  include/monitor/qdev.h  |  30 +++++++++
> >  include/system/system.h |   1 +
> >  qemu-options.hx         |  51 +++++++++++++--
> >  system/qdev-monitor.c   | 139 +++++++++++++++++++++++++++++++++++++++-
> >  system/vl.c             |  39 +++++++++++
> >  13 files changed, 440 insertions(+), 5 deletions(-)  
> 
> Clue: no update to the QAPI schema, i.e. the QMP command does not exist.
> 
> >
> > diff --git a/hmp-commands.hx b/hmp-commands.hx
> > index d0e4f35a30..18056cf21d 100644
> > --- a/hmp-commands.hx
> > +++ b/hmp-commands.hx
> > @@ -707,6 +707,36 @@ SRST
> >    or a QOM object path.
> >  ERST
> >  
> > +{
> > +    .name       = "device_set",
> > +    .args_type  = "device:O",
> > +    .params     = "driver[,prop=value][,...]",
> > +    .help       = "set/unset existing device property",
> > +    .cmd        = hmp_device_set,
> > +    .command_completion = device_set_completion,
> > +},
> > +
> > +SRST
> > +``device_set`` *driver[,prop=value][,...]*
> > +  Change the administrative power state of an existing device.
> > +
> > +  This command enables or disables a known device (e.g., CPU) using the
> > +  "device_set" interface. It does not hotplug or add a new device.
> > +
> > +  Depending on platform support (e.g., PSCI or ACPI), this may trigger
> > +  corresponding operational changes — such as powering down a CPU or
> > +  transitioning it to active use.
> > +
> > +  Administrative state:
> > +    * *enabled*  — Allows the guest to use the device (e.g., CPU_ON)
> > +    * *disabled* — Prevents guest use; device is powered off (e.g., CPU_OFF)
> > +
> > +  Note: The device must already exist (be declared during machine creation).
> > +
> > +  Example:
> > +      (qemu) device_set host-arm-cpu,core-id=3,admin-state=disabled
> > +ERST  
> 
> How exactly is the device selected?  You provide a clue above: 'can be
> located by "id" or via driver+property match'.
> 
> I assume by "id" is just like device_del, i.e. by qdev ID or QOM path.
> 
> By "driver+property match" is not obvious.  Which of the arguments are
> for matching, and which are for setting?
> 
> If "id" is specified, is there any matching?
> 
> The matching feature complicates this interface quite a bit.  I doubt
> it's worth the complexity.  If you think it is, please split it off into
> a separate patch.

It's likely /me who to blame for asking to invent generic
device-set QMP command.
I see another application (beside ARM CPU power-on/off) for it,
PCI devices to simulate powering on/off them at runtime without
actually removing device.

wrt command,
I'd use only 'id' with it to identify target device
(i.e. no template matching nor QMP path either).
To enforce rule, what user hasn't named explicitly by providing 'id'
isn't meant to be accessed/manged by user later on. 

potentially we can invent specialized power_set/get command as
an alternative if it makes design easier.
But then we would be spawning similar commands for other things,
where as device-set would cover it all. But then I might be
over-complicating things by suggesting a generic approach. 

> Next question.  Is there a way for management applications to detect
> whether a certain device supports device_set for a certain property?

is there some kind of QMP command to check what does a device support,
or at least what properties it supports? Can we piggy-back on that?
 
> 
> Without that, what are management application supposed to do?  Hard-code
> what works?  Run the command and see whether it fails?

Adding libvirt list to discussion and possible ideas on what can be done here.

> I understand right now the command supports just "admin-state" for a
> certain set of devices, so hard-coding would be possible.  But every new
> (device, property) pair then requires management application updates,
> and the hard-coded information becomes version specific.  This will
> become unworkable real quick.  Not good enough for a command designed to
> be generic.
> 
> > +
> >      {
> >          .name       = "cpu",
> >          .args_type  = "index:i",  

We still do have a few legacy uses of cpu index (CLI|HMP), but
I'd avoid using cpu index or something similar in new interfaces.

> [...]
> 
> > diff --git a/qemu-options.hx b/qemu-options.hx
> > index 83ccde341b..f517b91042 100644
> > --- a/qemu-options.hx
> > +++ b/qemu-options.hx
> > @@ -375,7 +375,10 @@ SRST
> >      This is different from CPU hotplug where additional CPUs are not even
> >      present in the system description. Administratively disabled CPUs appear in
> >      ACPI tables i.e. are provisioned, but cannot be used until explicitly
> > -    enabled via QMP/HMP or the deviceset API.
> > +    enabled via QMP/HMP or the deviceset API. On ACPI guests, each vCPU counted
> > +    by 'disabledcpus=' is provisioned with '\ ``_STA``\ ' reporting Present=1
> > +    and Enabled=0 (present-offline) at boot; it becomes Enabled=1 when brought
> > +    online via 'device_set ... admin-state=enable'.
> >  
> >      On boards supporting CPU hotplug, the optional '\ ``maxcpus``\ ' parameter
> >      can be set to enable further CPUs to be added at runtime. When both
> > @@ -455,6 +458,15 @@ SRST
> >  
> >          -smp 2
> >  
> > +    Note: The cluster topology will only be generated in ACPI and exposed
> > +    to guest if it's explicitly specified in -smp.
> > +
> > +    Note: Administratively disabled CPUs (specified via 'disabledcpus=' and
> > +    '-deviceset' at CLI during boot) are especially useful for platforms like
> > +    ARM that lack native CPU hotplug support. These CPUs will appear to the
> > +    guest as unavailable, and any attempt to bring them online must go through
> > +    QMP/HMP commands like 'device_set'.
> > +
> >      Examples using 'disabledcpus':
> >  
> >      For a board without CPU hotplug, enable 4 CPUs at boot and provision
> > @@ -472,9 +484,6 @@ SRST
> >      ::
> >  
> >          -smp cpus=4,disabledcpus=2,maxcpus=8
> > -
> > -    Note: The cluster topology will only be generated in ACPI and exposed
> > -    to guest if it's explicitly specified in -smp.
> >  ERST
> >  
> >  DEF("numa", HAS_ARG, QEMU_OPTION_numa,
> > @@ -1281,6 +1290,40 @@ SRST
> >  
> >  ERST
> >  
> > +DEF("deviceset", HAS_ARG, QEMU_OPTION_deviceset,
> > +    "-deviceset driver[,prop[=value]][,...]\n"
> > +    "                Set administrative power state of an existing device.\n"
> > +    "                Does not hotplug a new device. Can disable or enable\n"
> > +    "                devices (such as CPUs) at boot based on policy.\n"
> > +    "                Example:\n"
> > +    "                    -deviceset host-arm-cpu,core-id=2,admin-state=disabled\n"
> > +    "                Use '-deviceset help' for supported drivers\n"
> > +    "                Use '-deviceset driver,help' for driver-specific properties\n",
> > +    QEMU_ARCH_ALL)
> > +SRST
> > +``-deviceset driver[,prop[=value]][,...]``
> > +    Configure an existing device's administrative power state or properties.
> > +
> > +    Unlike ``-device``, this option does not create a new device. Instead,
> > +    it sets startup properties (such as administrative power state) for
> > +    a device already declared via -smp or other machine configuration.
> > +
> > +    Example:
> > +        -smp cpus=4
> > +        -deviceset host-arm-cpu,core-id=2,admin-state=disabled
> > +
> > +    The above disables CPU core 2 at boot using administrative offlining.
> > +    The guest may later re-enable the core (if permitted by platform policy).
> > +
> > +    ``state=enabled|disabled``
> > +        Sets the administrative state of the device:
> > +        - ``enabled``: device is made available at boot
> > +        - ``disabled``: device is administratively disabled and powered off
> > +
> > +    Use ``-deviceset help`` to view all supported drivers.
> > +    Use ``-deviceset driver,help`` for property-specific help.
> > +ERST
> > +
> >  DEF("name", HAS_ARG, QEMU_OPTION_name,
> >      "-name string1[,process=string2][,debug-threads=on|off]\n"
> >      "                set the name of the guest\n"
> > diff --git a/system/qdev-monitor.c b/system/qdev-monitor.c
> > index 2ac92d0a07..1099b1237d 100644
> > --- a/system/qdev-monitor.c
> > +++ b/system/qdev-monitor.c
> > @@ -263,12 +263,20 @@ static DeviceClass *qdev_get_device_class(const char **driver, Error **errp)
> >      }
> >  
> >      dc = DEVICE_CLASS(oc);
> > -    if (!dc->user_creatable) {
> > +    if (!dc->user_creatable && !dc->admin_power_state_supported) {
> >          error_setg(errp, QERR_INVALID_PARAMETER_VALUE, "driver",
> >                     "a pluggable device type");
> >          return NULL;
> >      }
> >  
> > +    if (phase_check(PHASE_MACHINE_READY) &&
> > +        (!dc->hotpluggable || !dc->admin_power_state_supported)) {
> > +        error_setg(errp, QERR_INVALID_PARAMETER_VALUE, "driver",
> > +                   "a pluggable device type or which supports changing power-"
> > +                   "state administratively");
> > +        return NULL;
> > +    }
> > +
> >      if (object_class_dynamic_cast(oc, TYPE_SYS_BUS_DEVICE)) {
> >          /* sysbus devices need to be allowed by the machine */
> >          MachineClass *mc = MACHINE_CLASS(object_get_class(qdev_get_machine()));
> > @@ -939,6 +947,76 @@ void qmp_device_del(const char *id, Error **errp)
> >      }
> >  }
> >  
> > +void qmp_device_set(const QDict *qdict, Error **errp)
> > +{
> > +    const char *state;
> > +    const char *driver;
> > +    DeviceState *dev;
> > +    DeviceClass *dc;
> > +    const char *id;
> > +
> > +    driver = qdict_get_try_str(qdict, "driver");
> > +    if (!driver) {
> > +        error_setg(errp, "Parameter 'driver' is missing");
> > +        return;
> > +    }
> > +
> > +    /* check driver exists and we are at the right phase of machine init */
> > +    dc = qdev_get_device_class(&driver, errp);
> > +    if (!dc) {  
> 
> Since qdev_get_device_class() sets an error when it fails, *errp is not
> null here, ...
> 
> > +        error_setg(errp, "driver '%s' not supported", driver);  
> 
> ... which makes this wrong.  Caught by error_setv()'s assertion.
> 
> Please test your error paths.
> 
> > +        return;
> > +    }
> > +
> > +    if (migration_is_running()) {
> > +        error_setg(errp, "device_set not allowed while migrating");
> > +        return;
> > +    }
> > +
> > +    id = qdict_get_try_str(qdict, "id");
> > +
> > +    if (id) {
> > +        /* Lookup by ID */
> > +        dev = find_device_state(id, false, errp);
> > +        if (errp && *errp) {
> > +            error_prepend(errp, "Device lookup failed for ID '%s': ", id);
> > +            return;
> > +        }
> > +    } else {
> > +        /* Lookup using driver and properties */
> > +        dev = qdev_find_device(qdict, errp);
> > +        if (errp && *errp) {
> > +            error_prepend(errp, "Device lookup for %s failed: ", driver);
> > +            return;
> > +        }
> > +    }
> > +    if (!dev) {
> > +        error_set(errp, ERROR_CLASS_DEVICE_NOT_FOUND,
> > +                  "No device found for driver '%s'", driver);
> > +        return;
> > +    }
> > +
> > +    state = qdict_get_try_str(qdict, "admin-state");
> > +    if (!state) {
> > +        error_setg(errp, "no device state change specified for device %s ",
> > +                   dev->id);
> > +        return;
> > +    } else if (!strcmp(state, "enable")) {
> > +
> > +        if (!qdev_enable(dev, qdev_get_parent_bus(DEVICE(dev)), errp)) {
> > +            return;
> > +        }
> > +    } else if (!strcmp(state, "disable")) {
> > +        if (!qdev_disable(dev, qdev_get_parent_bus(DEVICE(dev)), errp)) {
> > +            return;
> > +        }
> > +    } else {
> > +        error_setg(errp, "unrecognized specified state *%s* for device %s",
> > +                   state, dev->id);
> > +        return;
> > +    }
> > +}
> > +
> >  int qdev_sync_config(DeviceState *dev, Error **errp)
> >  {
> >      DeviceClass *dc = DEVICE_GET_CLASS(dev);  
> 
> [...]
> 



^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH RFC V6 03/24] hw/arm/virt: Clamp 'maxcpus' as-per machine's vCPU deferred online-capability
  2025-10-09 12:32   ` Miguel Luis
@ 2025-10-09 13:11     ` Igor Mammedov
  0 siblings, 0 replies; 67+ messages in thread
From: Igor Mammedov @ 2025-10-09 13:11 UTC (permalink / raw)
  To: Miguel Luis
  Cc: salil.mehta@opnsrc.net, qemu-devel@nongnu.org,
	qemu-arm@nongnu.org, mst@redhat.com, salil.mehta@huawei.com,
	maz@kernel.org, jean-philippe@linaro.org,
	jonathan.cameron@huawei.com, lpieralisi@kernel.org,
	peter.maydell@linaro.org, richard.henderson@linaro.org,
	armbru@redhat.com, andrew.jones@linux.dev, david@redhat.com,
	philmd@linaro.org, eric.auger@redhat.com, will@kernel.org,
	ardb@kernel.org, oliver.upton@linux.dev, pbonzini@redhat.com,
	gshan@redhat.com, rafael@kernel.org, borntraeger@linux.ibm.com,
	alex.bennee@linaro.org, gustavo.romero@linaro.org,
	npiggin@gmail.com, harshpb@linux.ibm.com, linux@armlinux.org.uk,
	darren@os.amperecomputing.com, ilkka@os.amperecomputing.com,
	vishnu@os.amperecomputing.com, gankulkarni@os.amperecomputing.com,
	Karl Heubaum, zhukeqian1@huawei.com, wangxiongfeng2@huawei.com,
	wangyanan55@huawei.com, wangzhou1@hisilicon.com,
	linuxarm@huawei.com, jiakernel2@gmail.com, maobibo@loongson.cn,
	lixianglai@loongson.cn, shahuang@redhat.com, zhao1.liu@intel.com,
	devel

On Thu, 9 Oct 2025 12:32:15 +0000
Miguel Luis <miguel.luis@oracle.com> wrote:

> Hi Salil,
> 
> > On 1 Oct 2025, at 01:01, salil.mehta@opnsrc.net wrote:
> > 
> > From: Salil Mehta <salil.mehta@huawei.com>
> > 
> > To support a vCPU hot-add–like model on ARM, the virt machine may be setup with
> > more CPUs than are active at boot. These additional CPUs are fully realized in
> > KVM and listed in ACPI tables from the start, but begin in a disabled state.
> > They can later be brought online or taken offline under host or platform policy
> > control. The CPU topology is fixed at VM creation time and cannot change
> > dynamically on ARM. Therefore, we must determine precisely the 'maxcpus' value
> > that applies for the full lifetime of the VM.
> > 
> > On ARM, this deferred online-capable model is only valid if:
> >  - The GIC version is 3 or higher, and
> >  - Each non-boot CPU’s GIC CPU Interface is marked “online-capable” in its
> >    ACPI GICC structure (UEFI ACPI Specification 6.5, §5.2.12.14, Table 5.37
> >    “GICC CPU Interface Flags”), and
> >  - The chosen accelerator supports safe deferred CPU online:
> >      * TCG with multi-threaded TCG (MTTCG) enabled
> >      * KVM (on supported hosts)
> >      * Not HVF or QTest
> > 
> > This patch sizes the machine’s max-possible CPUs during VM init:
> >  - If all conditions are satisfied, retain the full set of CPUs corresponding
> >    to (`-smp cpus` + `-smp disabledcpus`), allowing the additional (initially
> >    disabled) CPUs to participate in later policy-driven online.
> >  - Otherwise, clamp the max-possible CPUs to the boot-enabled count
> >    (`-smp disabledcpus=0` equivalent) to avoid advertising CPUs the guest can
> >    never use.
> > 
> > A new MachineClass flag, `has_online_capable_cpus`, records whether the machine
> > supports deferred vCPU online. This is usable by other machine types as well.  
> 
> 
> By the definition of
> 
>  * @has_hotpluggable_cpus:
>  *    If true, board supports CPUs creation with -device/device_add.
> 
>  in include/hw/boards.h

It should be fine to rename it to has_pluggable_cpus.

But we should add support to arm/virt for -device/device_add cpu_foo,
to avoid awkward -device-set and mangling of -smp.

device_add in arm/virt case probably should be limited non hotplug usecase.

> seems one could take advantage of MachineClass's has_hotpluggable_cpus variable
> instead of creating a new has_online_capable_cpus one.
> (Again, IMHO ‘online capable’ is ACPI nomenclature and doesn’t need to be brought
> in MachineClass’s)

the issue with has_hotpluggable_cpus might be QMP ABI,
where libvirt migh use it to figure out if certain command are supported.

CCing libvirt to check if that would break something.

> 
> Variable which would be initialized in machvirt_init on an assignment based on
> GIC version and/or wether there's inactive CPUs and proceed from there anyways,
> making the default assignment in machine_virt_class_init superfluous.
> 
> We're at hw/arm/virt and we know these CPUs are administratively power state
> coordinated so admin_power_state_supported can still be set there in the
> presence of inactive CPUs.
> 
> Thanks
> Miguel
> 
> > 
> > Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
> > ---
> > hw/arm/virt.c       | 84 ++++++++++++++++++++++++++++++---------------
> > include/hw/boards.h |  1 +
> > 2 files changed, 57 insertions(+), 28 deletions(-)
> > 
> > diff --git a/hw/arm/virt.c b/hw/arm/virt.c
> > index ef6be3660f..76f21bd56a 100644
> > --- a/hw/arm/virt.c
> > +++ b/hw/arm/virt.c
> > @@ -2168,8 +2168,7 @@ static void machvirt_init(MachineState *machine)
> >     bool has_ged = !vmc->no_ged;
> >     unsigned int smp_cpus = machine->smp.cpus;
> >     unsigned int max_cpus = machine->smp.max_cpus;
> > -
> > -    possible_cpus = mc->possible_cpu_arch_ids(machine);
> > +    DeviceClass *dc;
> > 
> >     /*
> >      * In accelerated mode, the memory map is computed earlier in kvm_type()
> > @@ -2186,7 +2185,7 @@ static void machvirt_init(MachineState *machine)
> >          * we are about to deal with. Once this is done, get rid of
> >          * the object.
> >          */
> > -        cpuobj = object_new(possible_cpus->cpus[0].type);
> > +        cpuobj = object_new(machine->cpu_type);
> >         armcpu = ARM_CPU(cpuobj);
> > 
> >         pa_bits = arm_pamax(armcpu);
> > @@ -2201,6 +2200,57 @@ static void machvirt_init(MachineState *machine)
> >      */
> >     finalize_gic_version(vms);
> > 
> > +    /*
> > +     * The maximum number of CPUs depends on the GIC version, or on how
> > +     * many redistributors we can fit into the memory map (which in turn
> > +     * depends on whether this is a GICv3 or v4).
> > +     */
> > +    if (vms->gic_version == VIRT_GIC_VERSION_2) {
> > +        virt_max_cpus = GIC_NCPU;
> > +    } else {
> > +        virt_max_cpus = virt_redist_capacity(vms, VIRT_GIC_REDIST);
> > +        if (vms->highmem_redists) {
> > +            virt_max_cpus += virt_redist_capacity(vms, VIRT_HIGH_GIC_REDIST2);
> > +        }
> > +    }
> > +
> > +    if ((tcg_enabled() && !qemu_tcg_mttcg_enabled()) || hvf_enabled() ||
> > +        qtest_enabled() || vms->gic_version == VIRT_GIC_VERSION_2) {
> > +        max_cpus = machine->smp.max_cpus = smp_cpus;
> > +        if (mc->has_online_capable_cpus) {
> > +            if (vms->gic_version == VIRT_GIC_VERSION_2) {
> > +                warn_report("GICv2 does not support online-capable CPUs");
> > +            }
> > +            mc->has_online_capable_cpus = false;
> > +        }
> > +    }
> > +
> > +    if (mc->has_online_capable_cpus) {
> > +        max_cpus = smp_cpus + machine->smp.disabledcpus;
> > +        machine->smp.max_cpus = max_cpus;
> > +    }
> > +
> > +    if (max_cpus > virt_max_cpus) {
> > +        error_report("Number of SMP CPUs requested (%d) exceeds max CPUs "
> > +                     "supported by machine 'mach-virt' (%d)",
> > +                     max_cpus, virt_max_cpus);
> > +        if (vms->gic_version != VIRT_GIC_VERSION_2 && !vms->highmem_redists) {
> > +            error_printf("Try 'highmem-redists=on' for more CPUs\n");
> > +        }
> > +
> > +        exit(1);
> > +    }
> > +
> > +    dc = DEVICE_CLASS(object_class_by_name(machine->cpu_type));
> > +    if (!dc) {
> > +        error_report("CPU type '%s' not registered", machine->cpu_type);
> > +        exit(1);
> > +    }
> > +    dc->admin_power_state_supported = mc->has_online_capable_cpus;
> > +
> > +    /* uses smp.max_cpus to initialize all possible vCPUs */
> > +    possible_cpus = mc->possible_cpu_arch_ids(machine);
> > +
> >     if (vms->secure) {
> >         /*
> >          * The Secure view of the world is the same as the NonSecure,
> > @@ -2235,31 +2285,6 @@ static void machvirt_init(MachineState *machine)
> >         vms->psci_conduit = QEMU_PSCI_CONDUIT_HVC;
> >     }
> > 
> > -    /*
> > -     * The maximum number of CPUs depends on the GIC version, or on how
> > -     * many redistributors we can fit into the memory map (which in turn
> > -     * depends on whether this is a GICv3 or v4).
> > -     */
> > -    if (vms->gic_version == VIRT_GIC_VERSION_2) {
> > -        virt_max_cpus = GIC_NCPU;
> > -    } else {
> > -        virt_max_cpus = virt_redist_capacity(vms, VIRT_GIC_REDIST);
> > -        if (vms->highmem_redists) {
> > -            virt_max_cpus += virt_redist_capacity(vms, VIRT_HIGH_GIC_REDIST2);
> > -        }
> > -    }
> > -
> > -    if (max_cpus > virt_max_cpus) {
> > -        error_report("Number of SMP CPUs requested (%d) exceeds max CPUs "
> > -                     "supported by machine 'mach-virt' (%d)",
> > -                     max_cpus, virt_max_cpus);
> > -        if (vms->gic_version != VIRT_GIC_VERSION_2 && !vms->highmem_redists) {
> > -            error_printf("Try 'highmem-redists=on' for more CPUs\n");
> > -        }
> > -
> > -        exit(1);
> > -    }
> > -
> >     if (vms->secure && !tcg_enabled() && !qtest_enabled()) {
> >         error_report("mach-virt: %s does not support providing "
> >                      "Security extensions (TrustZone) to the guest CPU",
> > @@ -3245,6 +3270,9 @@ static void virt_machine_class_init(ObjectClass *oc, const void *data)
> >     hc->plug = virt_machine_device_plug_cb;
> >     hc->unplug_request = virt_machine_device_unplug_request_cb;
> >     hc->unplug = virt_machine_device_unplug_cb;
> > +
> > +    mc->has_online_capable_cpus = true;
> > +
> >     mc->nvdimm_supported = true;
> >     mc->smp_props.clusters_supported = true;
> >     mc->auto_enable_numa_with_memhp = true;
> > diff --git a/include/hw/boards.h b/include/hw/boards.h
> > index 2b182d7817..b27c2326a2 100644
> > --- a/include/hw/boards.h
> > +++ b/include/hw/boards.h
> > @@ -302,6 +302,7 @@ struct MachineClass {
> >     bool rom_file_has_mr;
> >     int minimum_page_bits;
> >     bool has_hotpluggable_cpus;
> > +    bool has_online_capable_cpus;
> >     bool ignore_memory_transaction_failures;
> >     int numa_mem_align_shift;
> >     const char * const *valid_cpu_types;
> > -- 
> > 2.34.1
> >   
> 



^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH RFC V6 02/24] hw/core, qemu-options.hx: Introduce 'disabledcpus' SMP parameter
  2025-10-09 11:28   ` Miguel Luis
@ 2025-10-09 13:17     ` Igor Mammedov
  0 siblings, 0 replies; 67+ messages in thread
From: Igor Mammedov @ 2025-10-09 13:17 UTC (permalink / raw)
  To: Miguel Luis
  Cc: salil.mehta@opnsrc.net, qemu-devel@nongnu.org,
	qemu-arm@nongnu.org, mst@redhat.com, salil.mehta@huawei.com,
	maz@kernel.org, jean-philippe@linaro.org,
	jonathan.cameron@huawei.com, lpieralisi@kernel.org,
	peter.maydell@linaro.org, richard.henderson@linaro.org,
	armbru@redhat.com, andrew.jones@linux.dev, david@redhat.com,
	philmd@linaro.org, eric.auger@redhat.com, will@kernel.org,
	ardb@kernel.org, oliver.upton@linux.dev, pbonzini@redhat.com,
	gshan@redhat.com, rafael@kernel.org, borntraeger@linux.ibm.com,
	alex.bennee@linaro.org, gustavo.romero@linaro.org,
	npiggin@gmail.com, harshpb@linux.ibm.com, linux@armlinux.org.uk,
	darren@os.amperecomputing.com, ilkka@os.amperecomputing.com,
	vishnu@os.amperecomputing.com, gankulkarni@os.amperecomputing.com,
	Karl Heubaum, zhukeqian1@huawei.com, wangxiongfeng2@huawei.com,
	wangyanan55@huawei.com, wangzhou1@hisilicon.com,
	linuxarm@huawei.com, jiakernel2@gmail.com, maobibo@loongson.cn,
	lixianglai@loongson.cn, shahuang@redhat.com, zhao1.liu@intel.com

On Thu, 9 Oct 2025 11:28:40 +0000
Miguel Luis <miguel.luis@oracle.com> wrote:

> Hi Salil,
> 
> > On 1 Oct 2025, at 01:01, salil.mehta@opnsrc.net wrote:
> > 
> > From: Salil Mehta <salil.mehta@huawei.com>
> > 
> > Add support for a new SMP configuration parameter, 'disabledcpus', which
> > specifies the number of additional CPUs that are present in the virtual
> > machine but administratively disabled at boot. These CPUs are visible in
> > firmware (e.g. ACPI tables) yet unavailable to the guest until explicitly
> > enabled via QMP/HMP, or via the 'device_set' API (introduced in later
> > patches).
> > 
> > This feature is intended for architectures that lack native CPU hotplug
> > support but can change the administrative power state of present CPUs.
> > It allows simulating CPU hot-add–like scenarios while all CPUs remain
> > physically present in the topology at boot time.
> > 
> > Note: ARM is the first architecture to support this concept.
> > 
> > Changes include:
> > - Extend CpuTopology with a 'disabledcpus' field.
> > - Update machine_parse_smp_config() to account for disabled CPUs when
> >   computing 'cpus' and 'maxcpus'.
> > - Update SMPConfiguration in QAPI to accept 'disabledcpus'.
> > - Extend -smp option documentation to describe 'disabledcpus' usage and
> >   behavior.
> >   
> 
> Specifying a new parameter for the user seems unnecessary when the system could
> infer the number of present and disabled from (maxcpus - cpus) and those this
> patch calls "disabledcpus" could be obtained this way.
> 
> Naming is hard although it is of my opinion that we shouldn't be
> calling 'disabledcpus' here; I understand that gets carried by previous
> administrative power state meanings but machine-smp level being at a different
> abstraction level the administrative power state concept could be
> decoupled from machine-smp realm.


> My suggestion would be calling those cpus 'inactive' and not carry previous
> patch's nomenclature.
> 
> CPUs in 'inactive' state are still present in the virtual machine although this
> pre-condition may require post actions like being explicitly 'enabled'/active via
> [QH]MP.
> 
> Overall, I believe the above should be all it takes to simplify acommodation of
> CPUs not to be brought online at boot time within this patch's context.

See my reply to cover letter, we shouldn't touch -smp at all.
And make -device/device_add work with virt/arm cpus instead,
that solves a number of issues that this and -deviceset patches introduce. 

> Thanks
> Miguel
> 
> 
> > Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
> > ---
> > hw/core/machine-smp.c | 24 +++++++-----
> > include/hw/boards.h   |  2 +
> > qapi/machine.json     |  3 ++
> > qemu-options.hx       | 86 +++++++++++++++++++++++++++++++++----------
> > system/vl.c           |  3 ++
> > 5 files changed, 89 insertions(+), 29 deletions(-)
> > 
> > diff --git a/hw/core/machine-smp.c b/hw/core/machine-smp.c
> > index 0be0ac044c..c1a09fdc3f 100644
> > --- a/hw/core/machine-smp.c
> > +++ b/hw/core/machine-smp.c
> > @@ -87,6 +87,7 @@ void machine_parse_smp_config(MachineState *ms,
> > {
> >     MachineClass *mc = MACHINE_GET_CLASS(ms);
> >     unsigned cpus    = config->has_cpus ? config->cpus : 0;
> > +    unsigned disabledcpus = config->has_disabledcpus ? config->disabledcpus : 0;
> >     unsigned drawers = config->has_drawers ? config->drawers : 0;
> >     unsigned books   = config->has_books ? config->books : 0;
> >     unsigned sockets = config->has_sockets ? config->sockets : 0;
> > @@ -166,8 +167,13 @@ void machine_parse_smp_config(MachineState *ms,
> >         sockets = sockets > 0 ? sockets : 1;
> >         cores = cores > 0 ? cores : 1;
> >         threads = threads > 0 ? threads : 1;
> > +
> > +        maxcpus = drawers * books * sockets * dies * clusters *
> > +                    modules * cores * threads;
> > +        cpus = maxcpus - disabledcpus;
> >     } else {
> > -        maxcpus = maxcpus > 0 ? maxcpus : cpus;
> > +        maxcpus = maxcpus > 0 ? maxcpus : cpus + disabledcpus;
> > +        cpus = cpus > 0 ? cpus : maxcpus - disabledcpus;
> > 
> >         if (mc->smp_props.prefer_sockets) {
> >             /* prefer sockets over cores before 6.2 */
> > @@ -207,12 +213,8 @@ void machine_parse_smp_config(MachineState *ms,
> >         }
> >     }
> > 
> > -    total_cpus = drawers * books * sockets * dies *
> > -                 clusters * modules * cores * threads;
> > -    maxcpus = maxcpus > 0 ? maxcpus : total_cpus;
> > -    cpus = cpus > 0 ? cpus : maxcpus;
> > -
> >     ms->smp.cpus = cpus;
> > +    ms->smp.disabledcpus = disabledcpus;
> >     ms->smp.drawers = drawers;
> >     ms->smp.books = books;
> >     ms->smp.sockets = sockets;
> > @@ -226,6 +228,8 @@ void machine_parse_smp_config(MachineState *ms,
> >     mc->smp_props.has_clusters = config->has_clusters;
> > 
> >     /* sanity-check of the computed topology */
> > +    total_cpus = maxcpus = drawers * books * sockets * dies * clusters *
> > +                modules * cores * threads;
> >     if (total_cpus != maxcpus) {
> >         g_autofree char *topo_msg = cpu_hierarchy_to_string(ms);
> >         error_setg(errp, "Invalid CPU topology: "
> > @@ -235,12 +239,12 @@ void machine_parse_smp_config(MachineState *ms,
> >         return;
> >     }
> > 
> > -    if (maxcpus < cpus) {
> > +    if (maxcpus < (cpus + disabledcpus)) {
> >         g_autofree char *topo_msg = cpu_hierarchy_to_string(ms);
> >         error_setg(errp, "Invalid CPU topology: "
> > -                   "maxcpus must be equal to or greater than smp: "
> > -                   "%s == maxcpus (%u) < smp_cpus (%u)",
> > -                   topo_msg, maxcpus, cpus);
> > +                   "maxcpus must be equal to or greater than smp[+disabledcpus]:"
> > +                   "%s == maxcpus (%u) < smp_cpus (%u) [+ offline cpus (%u)]",
> > +                   topo_msg, maxcpus, cpus, disabledcpus);
> >         return;
> >     }
> > 
> > diff --git a/include/hw/boards.h b/include/hw/boards.h
> > index f94713e6e2..2b182d7817 100644
> > --- a/include/hw/boards.h
> > +++ b/include/hw/boards.h
> > @@ -361,6 +361,7 @@ typedef struct DeviceMemoryState {
> > /**
> >  * CpuTopology:
> >  * @cpus: the number of present logical processors on the machine
> > + * @disabledcpus: the number additional present but admin disabled cpus
> >  * @drawers: the number of drawers on the machine
> >  * @books: the number of books in one drawer
> >  * @sockets: the number of sockets in one book
> > @@ -373,6 +374,7 @@ typedef struct DeviceMemoryState {
> >  */
> > typedef struct CpuTopology {
> >     unsigned int cpus;
> > +    unsigned int disabledcpus;
> >     unsigned int drawers;
> >     unsigned int books;
> >     unsigned int sockets;
> > diff --git a/qapi/machine.json b/qapi/machine.json
> > index 038eab281c..e45740da33 100644
> > --- a/qapi/machine.json
> > +++ b/qapi/machine.json
> > @@ -1634,6 +1634,8 @@
> > #
> > # @cpus: number of virtual CPUs in the virtual machine
> > #
> > +# @disabledcpus: number of additional present but disabled(or offline) CPUs
> > +#
> > # @maxcpus: maximum number of hotpluggable virtual CPUs in the virtual
> > #     machine
> > #
> > @@ -1657,6 +1659,7 @@
> > ##
> > { 'struct': 'SMPConfiguration', 'data': {
> >      '*cpus': 'int',
> > +     '*disabledcpus': 'int',
> >      '*drawers': 'int',
> >      '*books': 'int',
> >      '*sockets': 'int',
> > diff --git a/qemu-options.hx b/qemu-options.hx
> > index ab23f14d21..83ccde341b 100644
> > --- a/qemu-options.hx
> > +++ b/qemu-options.hx
> > @@ -326,12 +326,15 @@ SRST
> > ERST
> > 
> > DEF("smp", HAS_ARG, QEMU_OPTION_smp,
> > -    "-smp [[cpus=]n][,maxcpus=maxcpus][,drawers=drawers][,books=books][,sockets=sockets]\n"
> > -    "               [,dies=dies][,clusters=clusters][,modules=modules][,cores=cores]\n"
> > -    "               [,threads=threads]\n"
> > -    "                set the number of initial CPUs to 'n' [default=1]\n"
> > -    "                maxcpus= maximum number of total CPUs, including\n"
> > -    "                offline CPUs for hotplug, etc\n"
> > +    "-smp [[cpus=]n][,disabledcpus=disabledcpus][,maxcpus=maxcpus][,drawers=drawers][,books=books]\n"
> > +    "               [,sockets=sockets][,dies=dies][,clusters=clusters][,modules=modules]\n"
> > +    "               [,cores=cores][,threads=threads]\n"
> > +    "                set the initial number of CPUs present and\n"
> > +    "                  administratively enabled at boot time to 'n' [default=1]\n"
> > +    "                disabledcpus= number of present but administratively\n"
> > +    "                  disabled CPUs (unavailable to the guest at boot)\n"
> > +    "                maxcpus= maximum total CPUs (present + hotpluggable)\n"
> > +    "                  on machines without CPU hotplug, defaults to n + disabledcpus\n"
> >     "                drawers= number of drawers on the machine board\n"
> >     "                books= number of books in one drawer\n"
> >     "                sockets= number of sockets in one book\n"
> > @@ -351,22 +354,49 @@ DEF("smp", HAS_ARG, QEMU_OPTION_smp,
> >     "      For a particular machine type board, an expected CPU topology hierarchy\n"
> >     "      can be defined through the supported sub-option. Unsupported parameters\n"
> >     "      can also be provided in addition to the sub-option, but their values\n"
> > -    "      must be set as 1 in the purpose of correct parsing.\n",
> > +    "      must be set as 1 in the purpose of correct parsing.\n"
> > +    "                                                          \n"
> > +    "      Administratively disabled CPUs: Some machine types do not support vCPU\n"
> > +    "      hotplug but their CPUs can be marked disabled (powered off) and kept\n"
> > +    "      unavailable to the guest. Later, such CPUs can be enabled via QMP/HMP\n"
> > +    "      (e.g., 'device_set ... admin-state=enable'). This is similar to hotplug,\n"
> > +    "      except all disabled CPUs are already present at boot. Useful on\n"
> > +    "      architectures that lack architectural CPU hotplug.\n",
> >     QEMU_ARCH_ALL)
> > SRST
> > -``-smp [[cpus=]n][,maxcpus=maxcpus][,drawers=drawers][,books=books][,sockets=sockets][,dies=dies][,clusters=clusters][,modules=modules][,cores=cores][,threads=threads]``
> > -    Simulate a SMP system with '\ ``n``\ ' CPUs initially present on
> > -    the machine type board. On boards supporting CPU hotplug, the optional
> > -    '\ ``maxcpus``\ ' parameter can be set to enable further CPUs to be
> > -    added at runtime. When both parameters are omitted, the maximum number
> > +``-smp [[cpus=]n][,disabledcpus=disabledcpus][,maxcpus=maxcpus][,drawers=drawers][,books=books][,sockets=sockets][,dies=dies][,clusters=clusters][,modules=modules][,cores=cores][,threads=threads]``
> > +    Simulate a SMP system with '\ ``n``\ ' CPUs initially present & enabled on
> > +    the machine type board. Furthermore, on architectures that support changing
> > +    the administrative power state of CPUs, optional '\ ``disabledcpus``\ '
> > +    parameter specifies *additional* CPUs that are present in firmware (e.g.,
> > +    ACPI) but are administratively disabled (i.e., not usable by the guest at
> > +    boot time).
> > +
> > +    This is different from CPU hotplug where additional CPUs are not even
> > +    present in the system description. Administratively disabled CPUs appear in
> > +    ACPI tables i.e. are provisioned, but cannot be used until explicitly
> > +    enabled via QMP/HMP or the deviceset API.
> > +
> > +    On boards supporting CPU hotplug, the optional '\ ``maxcpus``\ ' parameter
> > +    can be set to enable further CPUs to be added at runtime. When both
> > +    '\ ``n``\ ' & '\ ``maxcpus``\ ' parameters are omitted, the maximum number
> >     of CPUs will be calculated from the provided topology members and the
> > -    initial CPU count will match the maximum number. When only one of them
> > -    is given then the omitted one will be set to its counterpart's value.
> > -    Both parameters may be specified, but the maximum number of CPUs must
> > -    be equal to or greater than the initial CPU count. Product of the
> > -    CPU topology hierarchy must be equal to the maximum number of CPUs.
> > -    Both parameters are subject to an upper limit that is determined by
> > -    the specific machine type chosen.
> > +    initial CPU count will match the maximum number. When only one of them is
> > +    given then the omitted one will be set to its counterpart's value. Both
> > +    parameters may be specified, but the maximum number of CPUs must be equal
> > +    to or greater than the initial CPU count. Product of the CPU topology
> > +    hierarchy must be equal to the maximum number of CPUs. Both parameters are
> > +    subject to an upper limit that is determined by the specific machine type
> > +    chosen. Boards that support administratively disabled CPUs but do *not*
> > +    support CPU hotplug derive the maximum number of CPUs implicitly:
> > +    '\ ``maxcpus``\ ' is treated as '\ ``n + disabledcpus``\ ' (the total CPUs
> > +    present in firmware). If '\ ``maxcpus``\ ' is provided, it must equal
> > +    '\ ``n + disabledcpus``\ '. The topology product must equal this derived
> > +    maximum as well.
> > +
> > +    Note: Administratively disabled CPUs will appear to the guest as
> > +    unavailable, and any attempt to bring them online must go through QMP/HMP
> > +    commands like 'device_set'.
> > 
> >     To control reporting of CPU topology information, values of the topology
> >     parameters can be specified. Machines may only support a subset of the
> > @@ -425,6 +455,24 @@ SRST
> > 
> >         -smp 2
> > 
> > +    Examples using 'disabledcpus':
> > +
> > +    For a board without CPU hotplug, enable 4 CPUs at boot and provision
> > +    2 additional administratively disabled CPUs (maximum is derived
> > +    implicitly as 6 = 4 + 2):
> > +
> > +    ::
> > +
> > +        -smp cpus=4,disabledcpus=2
> > +
> > +    For a board that supports CPU hotplug and 'disabledcpus', enable 4 CPUs
> > +    at boot, provision 2 administratively disabled CPUs, and allow hotplug of
> > +    2 more CPUs (for a maximum of 8):
> > +
> > +    ::
> > +
> > +        -smp cpus=4,disabledcpus=2,maxcpus=8
> > +
> >     Note: The cluster topology will only be generated in ACPI and exposed
> >     to guest if it's explicitly specified in -smp.
> > ERST
> > diff --git a/system/vl.c b/system/vl.c
> > index 3b7057e6c6..2f0fd21a1f 100644
> > --- a/system/vl.c
> > +++ b/system/vl.c
> > @@ -736,6 +736,9 @@ static QemuOptsList qemu_smp_opts = {
> >         {
> >             .name = "cpus",
> >             .type = QEMU_OPT_NUMBER,
> > +        }, {
> > +            .name = "disabledcpus",
> > +            .type = QEMU_OPT_NUMBER,
> >         }, {
> >             .name = "drawers",
> >             .type = QEMU_OPT_NUMBER,
> > -- 
> > 2.34.1
> >   
> 



^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH RFC V6 22/24] monitor,qdev: Introduce 'device_set' to change admin state of existing devices
  2025-10-09 12:51     ` Igor Mammedov
@ 2025-10-09 14:03       ` Daniel P. Berrangé
  2025-10-09 14:55       ` Markus Armbruster
  1 sibling, 0 replies; 67+ messages in thread
From: Daniel P. Berrangé @ 2025-10-09 14:03 UTC (permalink / raw)
  To: Igor Mammedov
  Cc: Markus Armbruster, salil.mehta, qemu-devel, qemu-arm, mst,
	salil.mehta, maz, jean-philippe, jonathan.cameron, lpieralisi,
	peter.maydell, richard.henderson, andrew.jones, david, philmd,
	eric.auger, will, ardb, oliver.upton, pbonzini, gshan, rafael,
	borntraeger, alex.bennee, gustavo.romero, npiggin, harshpb, linux,
	darren, ilkka, vishnu, gankulkarni, karl.heubaum, miguel.luis,
	zhukeqian1, wangxiongfeng2, wangyanan55, wangzhou1, linuxarm,
	jiakernel2, maobibo, lixianglai, shahuang, zhao1.liu, devel

On Thu, Oct 09, 2025 at 02:51:25PM +0200, Igor Mammedov via Devel wrote:
> On Thu, 09 Oct 2025 10:55:40 +0200
> Markus Armbruster <armbru@redhat.com> wrote:
> 
> > salil.mehta@opnsrc.net writes:
> > 
> > > From: Salil Mehta <salil.mehta@huawei.com>
> > >
> > > This patch adds a "device_set" interface for modifying properties of devices
> > > that already exist in the guest topology. Unlike 'device_add'/'device_del'
> > > (hot-plug), 'device_set' does not create or destroy devices. It is intended
> > > for guest-visible hot-add semantics where hardware is provisioned at boot but
> > > logically enabled/disabled later via administrative policy.
> > >
> > > Compared to the existing 'qom-set' command, which is less intuitive and works
> > > only with object IDs, device_set provides a more device-oriented interface.
> > > It can be invoked at the QEMU prompt using natural device arguments, and the
> > > new '-deviceset' CLI option allows properties to be set at boot time, similar
> > > to how '-device' specifies device creation.  
> > 
> > Why can't we use -device?
> 
> that's was my concern/suggestion in reply to cover letter
> (as a place to put high level review and what can be done for the next revision)
> 
> (PS: It looks like I'm having email receiving issues (i.e. not getting from
> mail list my own emails that it bonces to me, so threading is all broken on
> my side and I'm might miss replies). But on positive side it looks like my
> replies reach the list and CCed just fine)
> 
> 
> > > While the initial implementation focuses on "admin-state" changes (e.g.,
> > > enable/disable a CPU already described by ACPI/DT), the interface is designed
> > > to be generic. In future, it could be used for other per-device set/unset
> > > style controls — beyond administrative power-states — provided the target
> > > device explicitly allows such changes. This enables fine-grained runtime
> > > control of device properties.  
> > 
> > Beware, designing a generic interface can be harder, sometimes much
> > harder, than designing a specialized one.
> > 
> > device_add and qom-set are generic, and they have issues:
> > 
> > * device_add effectively bypasses QAPI by using 'gen': false.
> > 
> >   This bypasses QAPI's enforcement of documentation.  Property
> >   documentation is separate and poor.
> > 
> >   It also defeats introspection with query-qmp-schema.  You need to
> >   resort to other means instead, say QOM introspection (which is a bag
> >   of design flaws on its own), then map from QOM to qdev.
> > 
> > * device_add lets you specify any qdev property, even properties that
> >   are intended only for use by C code.
> > 
> >   This results in accidental external interfaces.
> > 
> >   We tend to name properties like "x-prop" to discourage external use,
> >   but I wouldn't bet my own money on us getting that always right.
> >   Moreover, there's beauties like "x-origin".
> > 
> > * qom-set & friends effectively bypass QAPI by using type 'any'.
> > 
> >   Again, the bypass results in poor documentation and a defeat of
> >   query-qmp-schema.
> > 
> > * qom-set lets you mess with any QOM property with a setter callback.
> > 
> >   Again, accidental external interfaces: most of these properties are
> >   not meant for use with qom-set.  For some, qom-set works, for some it
> >   silently does nothing, and for some it crashes.  A lot more dangerous
> >   than device_add.
> > 
> >   The "x-" convention can't help here: some properties are intended for
> >   external use with object-add, but not with qom-set.
> > 
> > We should avoid such issues in new interfaces.
> > 
> > We'll examine how this applies to device_set when I review the QAPI
> > schema.
> > 
> > > Key pieces:
> > >   * QMP: qmp_device_set() to update an existing device. The device can be
> > >     located by "id" or via driver+property match using a DeviceListener
> > >     callback (qdev_find_device()).
> > >   * HMP: "device_set" command with tab-completion. Errors are surfaced via
> > >     hmp_handle_error().
> > >   * CLI: "-deviceset" option for setting startup/admin properties at boot,
> > >     including a JSON form. Options are parsed into qemu_deviceset_opts and
> > >     applied after device creation.
> > >   * Docs/help: HMP help text and qemu-options.hx additions explain usage and
> > >     explicitly note that no hot-plug occurs.
> > >   * Safety: disallowed during live migration (migration_is_idle() check).
> > >
> > > Semantics:
> > >   * Operates on an existing DeviceState; no enumeration/new device appears.
> > >   * Complements device_add/device_del by providing state mutation only.
> > >   * Backward compatible: no behavior change unless "device_set"/"-deviceset"
> > >     is used.
> > >
> > > Examples:
> > >   HMP:
> > >     (qemu) device_set host-arm-cpu,core-id=3,admin-state=enable
> > >
> > >   CLI (at boot):
> > >     -smp cpus=4,maxcpus=4 \
> > >     -deviceset host-arm-cpu,core-id=2,admin-state=disable
> > >
> > >   QMP (JSON form):
> > >     { "execute": "device_set",
> > >       "arguments": {
> > >         "driver": "host-arm-cpu",
> > >         "core-id": 1,
> > >         "admin-state": "disable"
> > >       }
> > >     }  
> > 
> > {"error": {"class": "CommandNotFound", "desc": "The command device_set has not been found"}}
> > 
> > Clue below.
> > 
> > > NOTE: The qdev_enable()/qdev_disable() hooks for acting on admin-state will be
> > > added in subsequent patches. Device classes must explicitly support any
> > > property they want to expose through device_set.
> > >
> > > Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
> > > ---
> > >  hmp-commands.hx         |  30 +++++++++
> > >  hw/arm/virt.c           |  86 +++++++++++++++++++++++++
> > >  hw/core/cpu-common.c    |  12 ++++
> > >  hw/core/qdev.c          |  21 ++++++
> > >  include/hw/arm/virt.h   |   1 +
> > >  include/hw/core/cpu.h   |  11 ++++
> > >  include/hw/qdev-core.h  |  22 +++++++
> > >  include/monitor/hmp.h   |   2 +
> > >  include/monitor/qdev.h  |  30 +++++++++
> > >  include/system/system.h |   1 +
> > >  qemu-options.hx         |  51 +++++++++++++--
> > >  system/qdev-monitor.c   | 139 +++++++++++++++++++++++++++++++++++++++-
> > >  system/vl.c             |  39 +++++++++++
> > >  13 files changed, 440 insertions(+), 5 deletions(-)  
> > 
> > Clue: no update to the QAPI schema, i.e. the QMP command does not exist.

On that point...

No new pure HMP commands please.  We consider implementation of the
QMP command to be the mandatory first step in any patch series. Any
HMP command must follow and must be implemented by calling the QMP
command handler.


> > > diff --git a/hmp-commands.hx b/hmp-commands.hx
> > > index d0e4f35a30..18056cf21d 100644
> > > --- a/hmp-commands.hx
> > > +++ b/hmp-commands.hx
> > > @@ -707,6 +707,36 @@ SRST
> > >    or a QOM object path.
> > >  ERST
> > >  
> > > +{
> > > +    .name       = "device_set",
> > > +    .args_type  = "device:O",
> > > +    .params     = "driver[,prop=value][,...]",
> > > +    .help       = "set/unset existing device property",
> > > +    .cmd        = hmp_device_set,
> > > +    .command_completion = device_set_completion,
> > > +},
> > > +
> > > +SRST
> > > +``device_set`` *driver[,prop=value][,...]*
> > > +  Change the administrative power state of an existing device.
> > > +
> > > +  This command enables or disables a known device (e.g., CPU) using the
> > > +  "device_set" interface. It does not hotplug or add a new device.
> > > +
> > > +  Depending on platform support (e.g., PSCI or ACPI), this may trigger
> > > +  corresponding operational changes — such as powering down a CPU or
> > > +  transitioning it to active use.
> > > +
> > > +  Administrative state:
> > > +    * *enabled*  — Allows the guest to use the device (e.g., CPU_ON)
> > > +    * *disabled* — Prevents guest use; device is powered off (e.g., CPU_OFF)
> > > +
> > > +  Note: The device must already exist (be declared during machine creation).
> > > +
> > > +  Example:
> > > +      (qemu) device_set host-arm-cpu,core-id=3,admin-state=disabled
> > > +ERST  
> > 
> > How exactly is the device selected?  You provide a clue above: 'can be
> > located by "id" or via driver+property match'.
> > 
> > I assume by "id" is just like device_del, i.e. by qdev ID or QOM path.
> > 
> > By "driver+property match" is not obvious.  Which of the arguments are
> > for matching, and which are for setting?
> > 
> > If "id" is specified, is there any matching?
> > 
> > The matching feature complicates this interface quite a bit.  I doubt
> > it's worth the complexity.  If you think it is, please split it off into
> > a separate patch.
> 
> It's likely /me who to blame for asking to invent generic
> device-set QMP command.
> I see another application (beside ARM CPU power-on/off) for it,
> PCI devices to simulate powering on/off them at runtime without
> actually removing device.
> 
> wrt command,
> I'd use only 'id' with it to identify target device
> (i.e. no template matching nor QMP path either).
> To enforce rule, what user hasn't named explicitly by providing 'id'
> isn't meant to be accessed/manged by user later on. 
> 
> potentially we can invent specialized power_set/get command as
> an alternative if it makes design easier.
> But then we would be spawning similar commands for other things,
> where as device-set would cover it all. But then I might be
> over-complicating things by suggesting a generic approach.

The generic set/get design feels convenient because you don't
need to create new commands, but it has significant downsides
both for QEMU and the users of QEMU.

From a QEMU POV the main burden is that we loose understanding
of how users of QEMU are consuming our interface / functionality
at a conceptual level and at the low level. This in turns means
we either struggle to offer a stable API, or our hands are tied
behind our back for future changes.

Consumers of QEMU are similarly exposed to the raw low level
details which has many downsides

 * If QEMU ever changes impl, but retains the conceptual
   functionality, apps are broken.
 * If a given feature is more complex than a single property,
   apps will be invoking a whole set of commands to set many
   props to achieve a given task.
 * If certain sequences of prop changes are needed, apps
   have no guidance on the ordering dependancies - which
   might even change between QEMU versions
 * If setting one prop fails, apps may need to manually
   rollback previous prop changes they made
 * The schema is unable to describe what functionality is
   now available since device properties are invisible.
 * If two devices expose the same functionality, but via
   different properties, apps have inconsistent interfaces

I'm highly sceptical that exposing 'device_set' is a good
idea.

> > Next question.  Is there a way for management applications to detect
> > whether a certain device supports device_set for a certain property?
> 
> is there some kind of QMP command to check what does a device support,
> or at least what properties it supports? Can we piggy-back on that?

Note, querying whether a device supports a property is conceptually
quite different from querying whether QEMU supports a given operation,
because it requires apps to first connect the dots between the low
level property change, and the conceptual effect the want to produce.


With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH RFC V6 22/24] monitor,qdev: Introduce 'device_set' to change admin state of existing devices
  2025-10-09 12:51     ` Igor Mammedov
  2025-10-09 14:03       ` Daniel P. Berrangé
@ 2025-10-09 14:55       ` Markus Armbruster
  2025-10-09 15:19         ` Peter Maydell
  2025-10-17 14:50         ` Igor Mammedov
  1 sibling, 2 replies; 67+ messages in thread
From: Markus Armbruster @ 2025-10-09 14:55 UTC (permalink / raw)
  To: Igor Mammedov
  Cc: salil.mehta, qemu-devel, qemu-arm, mst, salil.mehta, maz,
	jean-philippe, jonathan.cameron, lpieralisi, peter.maydell,
	richard.henderson, andrew.jones, david, philmd, eric.auger, will,
	ardb, oliver.upton, pbonzini, gshan, rafael, borntraeger,
	alex.bennee, gustavo.romero, npiggin, harshpb, linux, darren,
	ilkka, vishnu, gankulkarni, karl.heubaum, miguel.luis, zhukeqian1,
	wangxiongfeng2, wangyanan55, wangzhou1, linuxarm, jiakernel2,
	maobibo, lixianglai, shahuang, zhao1.liu, devel

Igor Mammedov <imammedo@redhat.com> writes:

> On Thu, 09 Oct 2025 10:55:40 +0200
> Markus Armbruster <armbru@redhat.com> wrote:
>
>> salil.mehta@opnsrc.net writes:
>> 
>> > From: Salil Mehta <salil.mehta@huawei.com>
>> >
>> > This patch adds a "device_set" interface for modifying properties of devices
>> > that already exist in the guest topology. Unlike 'device_add'/'device_del'
>> > (hot-plug), 'device_set' does not create or destroy devices. It is intended
>> > for guest-visible hot-add semantics where hardware is provisioned at boot but
>> > logically enabled/disabled later via administrative policy.
>> >
>> > Compared to the existing 'qom-set' command, which is less intuitive and works
>> > only with object IDs, device_set provides a more device-oriented interface.
>> > It can be invoked at the QEMU prompt using natural device arguments, and the
>> > new '-deviceset' CLI option allows properties to be set at boot time, similar
>> > to how '-device' specifies device creation.  
>> 
>> Why can't we use -device?
>
> that's was my concern/suggestion in reply to cover letter
> (as a place to put high level review and what can be done for the next revision)

Yes.

> (PS: It looks like I'm having email receiving issues (i.e. not getting from
> mail list my own emails that it bonces to me, so threading is all broken on
> my side and I'm might miss replies). But on positive side it looks like my
> replies reach the list and CCed just fine)

For what it's worth, your replies arrive fine here.

>> > While the initial implementation focuses on "admin-state" changes (e.g.,
>> > enable/disable a CPU already described by ACPI/DT), the interface is designed
>> > to be generic. In future, it could be used for other per-device set/unset
>> > style controls — beyond administrative power-states — provided the target
>> > device explicitly allows such changes. This enables fine-grained runtime
>> > control of device properties.  
>> 
>> Beware, designing a generic interface can be harder, sometimes much
>> harder, than designing a specialized one.
>> 
>> device_add and qom-set are generic, and they have issues:
>> 
>> * device_add effectively bypasses QAPI by using 'gen': false.
>> 
>>   This bypasses QAPI's enforcement of documentation.  Property
>>   documentation is separate and poor.
>> 
>>   It also defeats introspection with query-qmp-schema.  You need to
>>   resort to other means instead, say QOM introspection (which is a bag
>>   of design flaws on its own), then map from QOM to qdev.
>> 
>> * device_add lets you specify any qdev property, even properties that
>>   are intended only for use by C code.
>> 
>>   This results in accidental external interfaces.
>> 
>>   We tend to name properties like "x-prop" to discourage external use,
>>   but I wouldn't bet my own money on us getting that always right.
>>   Moreover, there's beauties like "x-origin".
>> 
>> * qom-set & friends effectively bypass QAPI by using type 'any'.
>> 
>>   Again, the bypass results in poor documentation and a defeat of
>>   query-qmp-schema.
>> 
>> * qom-set lets you mess with any QOM property with a setter callback.
>> 
>>   Again, accidental external interfaces: most of these properties are
>>   not meant for use with qom-set.  For some, qom-set works, for some it
>>   silently does nothing, and for some it crashes.  A lot more dangerous
>>   than device_add.
>> 
>>   The "x-" convention can't help here: some properties are intended for
>>   external use with object-add, but not with qom-set.
>> 
>> We should avoid such issues in new interfaces.

[...]

>> > diff --git a/hmp-commands.hx b/hmp-commands.hx
>> > index d0e4f35a30..18056cf21d 100644
>> > --- a/hmp-commands.hx
>> > +++ b/hmp-commands.hx
>> > @@ -707,6 +707,36 @@ SRST
>> >    or a QOM object path.
>> >  ERST
>> >  
>> > +{
>> > +    .name       = "device_set",
>> > +    .args_type  = "device:O",
>> > +    .params     = "driver[,prop=value][,...]",
>> > +    .help       = "set/unset existing device property",
>> > +    .cmd        = hmp_device_set,
>> > +    .command_completion = device_set_completion,
>> > +},
>> > +
>> > +SRST
>> > +``device_set`` *driver[,prop=value][,...]*
>> > +  Change the administrative power state of an existing device.
>> > +
>> > +  This command enables or disables a known device (e.g., CPU) using the
>> > +  "device_set" interface. It does not hotplug or add a new device.
>> > +
>> > +  Depending on platform support (e.g., PSCI or ACPI), this may trigger
>> > +  corresponding operational changes — such as powering down a CPU or
>> > +  transitioning it to active use.
>> > +
>> > +  Administrative state:
>> > +    * *enabled*  — Allows the guest to use the device (e.g., CPU_ON)
>> > +    * *disabled* — Prevents guest use; device is powered off (e.g., CPU_OFF)
>> > +
>> > +  Note: The device must already exist (be declared during machine creation).
>> > +
>> > +  Example:
>> > +      (qemu) device_set host-arm-cpu,core-id=3,admin-state=disabled
>> > +ERST  
>> 
>> How exactly is the device selected?  You provide a clue above: 'can be
>> located by "id" or via driver+property match'.
>> 
>> I assume by "id" is just like device_del, i.e. by qdev ID or QOM path.
>> 
>> By "driver+property match" is not obvious.  Which of the arguments are
>> for matching, and which are for setting?
>> 
>> If "id" is specified, is there any matching?
>> 
>> The matching feature complicates this interface quite a bit.  I doubt
>> it's worth the complexity.  If you think it is, please split it off into
>> a separate patch.
>
> It's likely /me who to blame for asking to invent generic
> device-set QMP command.
> I see another application (beside ARM CPU power-on/off) for it,
> PCI devices to simulate powering on/off them at runtime without
> actually removing device.

I prefer generic commands over collecting ad hoc single-purpose
commands, too.  Getting the design right can be difficult.

> wrt command,
> I'd use only 'id' with it to identify target device
> (i.e. no template matching nor QMP path either).
> To enforce rule, what user hasn't named explicitly by providing 'id'
> isn't meant to be accessed/manged by user later on. 

Works well, except when we need to access / manage onboard devices.
That's still an unsolved problem.

> potentially we can invent specialized power_set/get command as
> an alternative if it makes design easier.
> But then we would be spawning similar commands for other things,
> where as device-set would cover it all. But then I might be
> over-complicating things by suggesting a generic approach. 

Unclear.

I feel it's best to start the design process with ensvisaged uses.  Can
you tell me a bit more about the uses you have in mind?

>> Next question.  Is there a way for management applications to detect
>> whether a certain device supports device_set for a certain property?
>
> is there some kind of QMP command to check what does a device support,
> or at least what properties it supports? Can we piggy-back on that?

Maybe.

QAPI schema introspection (query-qmp-schema) has been a success.  It has
a reasonably expressive type system, deprecation information, and hides
much implementation detail.  Sadly, it doesn't cover most of QOM and all
of qdev due to QAPI schema bypass.

QOM type introspection (qom-list-types and qom-list-properties) is weak.
You can retrieve a property's name and type.  The latter is seriously
underspecified, and somewhere between annoying and impossible to use
reliably.  Properties created in certain ways are not visible here.
These are rare.

QOM object introspection (qom-list) is the same for concrete objects
rather than types.

qdev introspection (device-list-properties) is like QOM type
introspection.  I'm not sure why it exists.  Use QOM type introspection
instead.

QOM introspection is servicable for checking whether a certain property
exists.  Examining a property's type is unadvisable.

>> Without that, what are management application supposed to do?  Hard-code
>> what works?  Run the command and see whether it fails?
>
> Adding libvirt list to discussion and possible ideas on what can be done here.
>
>> I understand right now the command supports just "admin-state" for a
>> certain set of devices, so hard-coding would be possible.  But every new
>> (device, property) pair then requires management application updates,
>> and the hard-coded information becomes version specific.  This will
>> become unworkable real quick.  Not good enough for a command designed to
>> be generic.

[...]



^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH RFC V6 22/24] monitor,qdev: Introduce 'device_set' to change admin state of existing devices
  2025-10-09 14:55       ` Markus Armbruster
@ 2025-10-09 15:19         ` Peter Maydell
  2025-10-10  4:59           ` Markus Armbruster
  2025-10-17 14:50         ` Igor Mammedov
  1 sibling, 1 reply; 67+ messages in thread
From: Peter Maydell @ 2025-10-09 15:19 UTC (permalink / raw)
  To: Markus Armbruster
  Cc: Igor Mammedov, salil.mehta, qemu-devel, qemu-arm, mst,
	salil.mehta, maz, jean-philippe, jonathan.cameron, lpieralisi,
	richard.henderson, andrew.jones, david, philmd, eric.auger, will,
	ardb, oliver.upton, pbonzini, gshan, rafael, borntraeger,
	alex.bennee, gustavo.romero, npiggin, harshpb, linux, darren,
	ilkka, vishnu, gankulkarni, karl.heubaum, miguel.luis, zhukeqian1,
	wangxiongfeng2, wangyanan55, wangzhou1, linuxarm, jiakernel2,
	maobibo, lixianglai, shahuang, zhao1.liu, devel

On Thu, 9 Oct 2025 at 15:56, Markus Armbruster <armbru@redhat.com> wrote:
> qdev introspection (device-list-properties) is like QOM type
> introspection.  I'm not sure why it exists.

It exists because it is the older of the two interfaces:
device-list-properties was added in 2012, whereas
qom-list-properties was only added in 2018.

device-list-properties also does some device-specific
sanitization that may or may not be helpful: it won't
let you try it on an abstract base class, for instance,
and it won't list "legacy-" properties.

One problem you don't mention with QOM introspection is
that we have no marking for whether properties are intended
to be user-facing knobs, configurable things to be set
by other parts of QEMU, or purely details of the implementation.

thanks
-- PMM


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH RFC V6 14/24] arm/acpi: Introduce dedicated CPU OSPM interface for ARM-like platforms
  2025-10-07 12:06       ` Igor Mammedov
@ 2025-10-10  3:00         ` Salil Mehta
  0 siblings, 0 replies; 67+ messages in thread
From: Salil Mehta @ 2025-10-10  3:00 UTC (permalink / raw)
  To: Igor Mammedov
  Cc: Salil Mehta, qemu-devel@nongnu.org, qemu-arm@nongnu.org,
	mst@redhat.com, maz@kernel.org, jean-philippe@linaro.org,
	Jonathan Cameron, lpieralisi@kernel.org, peter.maydell@linaro.org,
	richard.henderson@linaro.org, armbru@redhat.com,
	andrew.jones@linux.dev, david@redhat.com, philmd@linaro.org,
	eric.auger@redhat.com, will@kernel.org, ardb@kernel.org,
	oliver.upton@linux.dev, pbonzini@redhat.com, gshan@redhat.com,
	rafael@kernel.org, borntraeger@linux.ibm.com,
	alex.bennee@linaro.org, gustavo.romero@linaro.org,
	npiggin@gmail.com, harshpb@linux.ibm.com, linux@armlinux.org.uk,
	darren@os.amperecomputing.com, ilkka@os.amperecomputing.com,
	vishnu@os.amperecomputing.com, gankulkarni@os.amperecomputing.com,
	karl.heubaum@oracle.com, miguel.luis@oracle.com, zhukeqian,
	wangxiongfeng (C), wangyanan (Y), Wangzhou (B), Linuxarm,
	jiakernel2@gmail.com, maobibo@loongson.cn, lixianglai@loongson.cn,
	shahuang@redhat.com, zhao1.liu@intel.com

[-- Attachment #1: Type: text/plain, Size: 56915 bytes --]

Hi Igor,

On Tue, Oct 7, 2025 at 12:06 PM Igor Mammedov <imammedo@redhat.com> wrote:

> On Tue, 7 Oct 2025 11:15:47 +0000
> Salil Mehta <salil.mehta@huawei.com> wrote:
>
> > Hi Igor,
> >
> > Thanks for the reviews and sorry for the late reply. Please find my
> replies inline.
> >
> >
> > > From: Igor Mammedov <imammedo@redhat.com>
> > > Sent: Friday, October 3, 2025 3:58 PM
> > >
> > > On Wed,  1 Oct 2025 01:01:17 +0000
> > > salil.mehta@opnsrc.net wrote:
> > >
> > > > From: Salil Mehta <salil.mehta@huawei.com>
> > > >
> > > > The existing ACPI CPU hotplug interface is built for x86 platforms
> > > > where CPUs can be inserted or removed and resources are allocated
> > > > dynamically. On ARM, CPUs are never hotpluggable: resources are
> > > > allocated at boot and QOM vCPU objects always exist. Instead, CPUs
> are
> > > > administratively managed by toggling ACPI _STA to enable or disable
> > > > them, which gives a hotplug-like effect but does not match the x86
> model.
> > > >
> > > > Reusing the x86 hotplug AML code would complicate maintenance since
> > > > much of its logic relies on toggling the _STA.Present bit to notify
> > > > OSPM about CPU insertion or removal. Such usage is not
> architecturally
> > > > valid on ARM, where CPUs cannot appear or disappear at runtime.
> Mixing
> > > > both models in one interface would increase complexity and make the
> > > > AML harder to extend. A separate path is therefore required. The new
> > > > design is heavily inspired by the CPU hotplug interface but avoids
> its
> > > unsuitable semantics.
> > >
> > > Let me ask how much existing CPUHP AML code will become, if you reuse
> it
> > > and add handling of 'enabled' bit there?
> > >
> > > Would it be the same 700LOC as in this patch, which is basically
> duplication of
> > > existing CPUHP ACPI interface?
> >
> >
> > It is by design as we have adopted non-hotplug approach now and closely
> aligned
> > ourselves with what PSCI standard perceives to be the definition of CPU
> hotplug on ARM
> > platforms - at least, as of today! And it is *NOT* what 'CPU hotplug'
> means on x86 platform.
>
> There is no argument that they are different but,
> Could you point to PSCI specific parts in this patch?
>

Yes, sure.

https://lore.kernel.org/qemu-devel/20251001010127.3092631-1-salil.mehta@opnsrc.net/T/#m926978ce8b91a1f2cca88b5b579a8aedd9e62d2c
https://lore.kernel.org/qemu-devel/20251001010127.3092631-1-salil.mehta@opnsrc.net/T/#mb96b36ebf68b0455b657ce495cac2aee9fbf0f67



>
> > In crux, this means,
> > 1. Dropping any hotplug/unplug infrastructure and its related
> paraphernalia from the
> >     ARM implementation till the time the meaning of physical CPU hotplug
> is not clear as
> >     per the specification. We do not want to model in Qemu something
> which does not
> >     exist or defined, especially for the CPU hotplug case.
>
> there is 'opts' config struct that lets user to opt in/out from specific
> AML
> being generated. You could use that to disable some hotplug only bits of
> AML.
> Other bits that are more generic/reusable, just refactor/rename them to a
> more
> generic names.
>

Sure, but what you are suggesting is a code reuse strategy not a design
problem.
We can take this as a cleanup activity later on or even in parallel. I've
no reservations
about that. Why hold ARM patches hostage to these benign optimizations?

BTW, that 'opt' is one of the ugliest parts of this function. Why do other
architectures
have to worry about initializing legacy bits of x86 in the common code
before calling
AML function?


> > 2. This also means *NOT* enabling the  ACPI_CPU_HOTPLUG compilation
> switch to
> >      preserve the sanctity of the clean design.
>
> that's semantics, I'd suggest renaming that to ACPI_CPU.
>


It is not about this. To make it hotplug agnostic we would need to start
with new minimal
code and then add on top of it what is present in acpi/cpu.c incrementally.
This can be
done in parallel by

1. accepting the minimal new code. This will keep the ball rolling for ARM
2. adding the x86 stuff incrementally over that minimal new AML file (with
other name)
3. Testing a new file doesn't break x86 functionality
4. replacing the old file acpi/cpus.c with a new common file.


>
> > 3. Yes, there is a code duplicity for now but that’s a case of further
> optimization and
> >     cleanup not a design issue. Some of them are:
> >     (1)  ACPI device-check and eject-request  handling code can be
> extracted and
> >     made generic for all devices not just for CPUs.
>
> make it more generic in acpi/cpu.c, instead of copying.
> I don't have any objections to refactoring existing code if it makes sense
> and
> we can share the code.
>

In principle I agree with your point about carving a common code that works
for all.
I'm only requesting how this could be done in a non-disrupting way by not
holding
the current patches of ARM for this change. This change will require time
and most
importantly testing across many architectures.


>
> >     (2)  Right now, acpi/cpu.c is assuming that resources and templates
> should be
> >      same for all the CPUs using CPUs AML described in it. There is no
> need for
> >      such a restriction. Every platform should be free to choose the way
> it wants to
> >      manage the resources and the interpretation of the fields inside it.
>
> be more specific why you'd need different resources/MMIO layout for this
> series?
>

IIRC in RFC V5, you object to adding 'enabled' bit as it was breaking x86
ABI
because of some backward compatibility issue?


>
> The thing is if we copied every time when we needed something that's a bit
> different,
> we would end up with unsupportable/bloated QEMU.
>

For sure. I totally agree with it but it was an informed decision here
internally to keep
the ACPI part separate for this series because of the past review comments
and the
difficulty in dealing with very minimalistic changes we proposed for ACPI
part.


>
> >     (3) Call backs used with GED  makes an assumption of HOTPLUG
> interface etc.
> >     (4) In fact, the prototype of the GED event handler makes a similar
> mistake of
> >      assuming that GED is only meant for devices supporting hotplug when
> this is not
> >      the case even as per the ACPI specification.
> please be more specific and point to problematic code.
>

void build_ged_aml(Aml *table, const char *name, HotplugHandler
*hotplug_dev,
                   uint32_t ged_irq, AmlRegionSpace rs, hwaddr ged_base)


GED is not just for hotplug handling.


>
> current acpi/cpu.c might be compiled under ACPI_CPU_HOTPLUG knob but it's
> not really
> limited to hotplug, the reason for being compiled as such is that hotplug
> was
> the sole reason for building CPUs AML at all.
>

Got it. so as you rightly suggested it needs refactoring. The only request
I'm making is
let us do it parallely and incrementally without holding ARM patches for
this change.


>
> What I see in the patch is simplifying current code somewhat by dropping
> some hotplug related bits and a bunch of renaming.
> Otherwise it's pretty much duplicating current acpi/cpu.c.
>
> Beside that simplification, I don't see any reason why duplicating such
> amount is good idea.
> Consider making exiting acpi/cpu.c more generic instead.
>

Intial idea was to make it generic enough for any device which uses
device-check
and eject-request  for adding/removing the device but we left it for later
discussions.


> > RFC V5 was an attempt to implement this feature using the hotplug
> infrastructure
> > and this RFC V6 is a deviation from previous approach towards
> non-hotplug. We do
> > not want a hotchpotch approach because that’s a recipe for future
> disaster.
> >
> >
> > Many Thanks!
> > Salil.
> >
> >
> > >
> > > >
> > > > This patch adds a dedicated CPU OSPM (Operating System Power
> > > > Management) interface. It provides a memory-mapped control region
> with
> > > > selector, flags, command, and data fields, and AML methods for
> > > > device-check, eject request, and _OST reporting. OSPM is notified
> > > > through GED events and can coordinate CPU events directly with QEMU.
> > > > Other ARM-like architectures may also use this interface.
> > > >
> > > > Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
> > > > ---
> > > >  hw/acpi/Kconfig                        |   3 +
> > > >  hw/acpi/acpi-cpu-ospm-interface-stub.c |  41 ++
> > > >  hw/acpi/cpu_ospm_interface.c           | 747
> > > +++++++++++++++++++++++++
> > > >  hw/acpi/meson.build                    |   2 +
> > > >  hw/acpi/trace-events                   |  17 +
> > > >  hw/arm/Kconfig                         |   1 +
> > > >  include/hw/acpi/cpu_ospm_interface.h   |  78 +++
> > > >  7 files changed, 889 insertions(+)
> > > >  create mode 100644 hw/acpi/acpi-cpu-ospm-interface-stub.c
> > > >  create mode 100644 hw/acpi/cpu_ospm_interface.c  create mode 100644
> > > > include/hw/acpi/cpu_ospm_interface.h
> > > >
> > > > diff --git a/hw/acpi/Kconfig b/hw/acpi/Kconfig index
> > > > 1d4e9f0845..aa52f0468f 100644
> > > > --- a/hw/acpi/Kconfig
> > > > +++ b/hw/acpi/Kconfig
> > > > @@ -21,6 +21,9 @@ config ACPI_ICH9
> > > >  config ACPI_CPU_HOTPLUG
> > > >      bool
> > > >
> > > > +config ACPI_CPU_OSPM_INTERFACE
> > > > +    bool
> > > > +
> > > >  config ACPI_MEMORY_HOTPLUG
> > > >      bool
> > > >      select MEM_DEVICE
> > > > diff --git a/hw/acpi/acpi-cpu-ospm-interface-stub.c
> > > > b/hw/acpi/acpi-cpu-ospm-interface-stub.c
> > > > new file mode 100644
> > > > index 0000000000..f6f333f641
> > > > --- /dev/null
> > > > +++ b/hw/acpi/acpi-cpu-ospm-interface-stub.c
> > > > @@ -0,0 +1,41 @@
> > > > +/*
> > > > + * ACPI CPU OSPM Interface Handling.
> > > > + *
> > > > + * Copyright (c) 2025 Huawei Technologies R&D (UK) Ltd.
> > > > + *
> > > > + * Author: Salil Mehta <salil.mehta@huawei.com>
> > > > + *
> > > > + * SPDX-License-Identifier: GPL-2.0-or-later
> > > > + *
> > > > + * This program is free software; you can redistribute it and/or
> > > > +modify
> > > > + * it under the terms of the GNU General Public License as published
> > > > +by
> > > > + * the Free Software Foundation; either version 2 of the License, or
> > > > + * (at your option) any later version.
> > > > + */
> > > > +
> > > > +#include "qemu/osdep.h"
> > > > +#include "hw/acpi/cpu_ospm_interface.h"
> > > > +
> > > > +void acpi_cpu_device_check_cb(AcpiCpuOspmState *cpu_st,
> DeviceState
> > > *dev,
> > > > +                              uint32_t event_st, Error **errp) { }
> > > > +
> > > > +void acpi_cpu_eject_request_cb(AcpiCpuOspmState *cpu_st,
> > > DeviceState *dev,
> > > > +                               uint32_t event_st, Error **errp) { }
> > > > +
> > > > +void acpi_cpu_eject_cb(AcpiCpuOspmState *cpu_st, DeviceState *dev,
> > > > +Error **errp) { }
> > > > +
> > > > +void acpi_cpu_ospm_state_interface_init(MemoryRegion *as, Object
> > > *owner,
> > > > +                                        AcpiCpuOspmState *state,
> > > > +                                        hwaddr base_addr) { }
> > > > +
> > > > +void acpi_cpus_ospm_status(AcpiCpuOspmState *cpu_st,
> > > ACPIOSTInfoList
> > > > +***list) { }
> > > > diff --git a/hw/acpi/cpu_ospm_interface.c
> > > > b/hw/acpi/cpu_ospm_interface.c new file mode 100644 index
> > > > 0000000000..61aab8a793
> > > > --- /dev/null
> > > > +++ b/hw/acpi/cpu_ospm_interface.c
> > > > @@ -0,0 +1,747 @@
> > > > +/*
> > > > + * ACPI CPU OSPM Interface Handling.
> > > > + *
> > > > + * Copyright (c) 2025 Huawei Technologies R&D (UK) Ltd.
> > > > + *
> > > > + * Author: Salil Mehta <salil.mehta@huawei.com>
> > > > + *
> > > > + * SPDX-License-Identifier: GPL-2.0-or-later
> > > > + *
> > > > + * This program is free software; you can redistribute it and/or
> > > > +modify
> > > > + * it under the terms of the GNU General Public License as published
> > > > +by
> > > > + * the Free Software Foundation; either version 2 of the License, or
> > > > + * (at your option) any later version.
> > > > + */
> > > > +
> > > > +#include "qemu/osdep.h"
> > > > +#include "migration/vmstate.h"
> > > > +#include "hw/core/cpu.h"
> > > > +#include "qapi/error.h"
> > > > +#include "trace.h"
> > > > +#include "qapi/qapi-events-acpi.h"
> > > > +#include "hw/acpi/cpu_ospm_interface.h"
> > > > +
> > > > +/* CPU identifier and resource device */
> > > > +#define CPU_NAME_FMT      "C%.03X" /* CPU name format (e.g., C001)
> > > */
> > > > +#define CPU_RES_DEVICE    "CPUR" /* CPU resource device name */
> > > > +#define CPU_DEVICE        "CPUS" /* CPUs device name */
> > > > +#define CPU_LOCK          "CPLK" /* CPU lock object */
> > > > +/* ACPI method(_STA, _EJ0, etc.) handlers */
> > > > +#define CPU_STS_METHOD    "CSTA" /* CPU status method
> > > (_STA.Enabled) */
> > > > +#define CPU_SCAN_METHOD   "CSCN" /* CPU scan method for
> > > enumeration */
> > > > +#define CPU_NOTIFY_METHOD "CTFY" /* Notify method for CPU events
> > > */
> > > > +#define CPU_EJECT_METHOD  "CEJ0" /* CPU eject method (_EJ0) */
> > > > +#define CPU_OST_METHOD    "COST" /* OSPM status reporting (_OST) */
> > > > +/* CPU MMIO region fields (in PRST region) */
> > > > +#define CPU_SELECTOR      "CSEL" /* CPU selector index (WO) */
> > > > +#define CPU_ENABLED_F     "CPEN" /* Flag: CPU enabled status(_STA)
> > > (RO) */
> > > > +#define CPU_DEVCHK_F      "CDCK" /* Flag: Device-check event (RW) */
> > > > +#define CPU_EJECTRQ_F     "CEJR" /* Flag: Eject-request event (RW)*/
> > > > +#define CPU_EJECT_F       "CEJ0" /* Flag: Ejection trigger (WO) */
> > > > +#define CPU_COMMAND       "CCMD" /* Command register (RW) */
> > > > +#define CPU_DATA          "CDAT" /* Data register (RW) */
> > > > +
> > > > + /*
> > > > + * CPU OSPM Interface MMIO Layout (Total: 16 bytes)
> > > > + *
> > > > + *
> > > > +
> +--------+--------+--------+--------+--------+--------+--------+----
> > > > + ----+
> > > > + * |  0x00  |  0x01  |  0x02  |  0x03  |  0x04  |  0x05  |  0x06  |
> > > > + 0x07  |
> > > > + *
> +--------+--------+--------+--------+--------+--------+--------+--------+
> > > > + * |       Selector (DWord, write-only)         | Flags  |Command
> |Reserved|
> > > > + * |                                            | (RO/RW)|  (WO)
> |(2B pad)|
> > > > + * |        4 bytes (32 bits)                   | 1B     |   1B   |
> 2B     |
> > > > + *
> > > > +
> +-------------------------------------------------------------------
> > > > + ----+
> > > > + * |  0x08  |  0x09  |  0x0A  |  0x0B  |  0x0C  |  0x0D  |  0x0E  |
> > > > + 0x0F  |
> > > > + *
> +--------+--------+--------+--------+--------+--------+--------+--------+
> > > > + * |                        Data (QWord, read/write)
>        |
> > > > + * |               Used by CPU scan and _OST methods (64 bits)
>        |
> > > > + *
> > > > +
> +-------------------------------------------------------------------
> > > > + ----+
> > > > + *
> > > > + * Field Overview:
> > > > + *
> > > > + * - Selector: 4 bytes @0x00 (DWord, WO)
> > > > + *               - Selects target CPU index for the current
> operation.
> > > > + * - Flags:    1 byte  @0x04 (RO/RW)
> > > > + *               - Bit 0: ENABLED  – CPU is powered on (RO)
> > > > + *               - Bit 1: DEVCHK   – Device-check completed (RW)
> > > > + *               - Bit 2: EJECTRQ  – Guest requests CPU eject (RW)
> > > > + *               - Bit 3: EJECT    – Trigger CPU ejection (WO)
> > > > + *               - Bits 4–7: Reserved (write 0)
> > > > + * - Command:  1 byte  @0x05 (WO)
> > > > + *               - Specifies control operation (e.g., scan, _OST,
> eject).
> > > > + * - Reserved: 2 bytes @0x06–0x07
> > > > + *               - Alignment padding; must be zero on write.
> > > > + * - Data:     8 bytes @0x08 (QWord, RW)
> > > > + *               - Input/output for command-specific data.
> > > > + *               - Used by CPU scan or _OST.
> > > > + */
> > > > +
> > > > +/*
> > > > + * Macros defining the CPU MMIO region layout. Change field sizes
> > > > +here to
> > > > + * alter the overall MMIO region size.
> > > > + */
> > > > +/* Sub-Field sizes (in bytes) */
> > > > +#define ACPI_CPU_MR_SELECTOR_SIZE  4 /* Write-only (DWord access)
> > > */
> > > > +#define ACPI_CPU_MR_FLAGS_SIZE     1 /* Read-write (Byte access) */
> > > > +#define ACPI_CPU_MR_RES_FLAGS_SIZE 0 /* Reserved padding */
> > > > +#define ACPI_CPU_MR_CMD_SIZE       1 /* Write-only (Byte access) */
> > > > +#define ACPI_CPU_MR_RES_CMD_SIZE   2 /* Reserved padding */
> > > > +#define ACPI_CPU_MR_CMD_DATA_SIZE  8 /* Read-write (QWord
> > > access) */
> > > > +
> > > > +#define ACPI_CPU_OSPM_IF_MAX_FIELD_SIZE \
> > > > +    MAX_CONST(ACPI_CPU_MR_CMD_DATA_SIZE, \
> > > > +    MAX_CONST(ACPI_CPU_MR_SELECTOR_SIZE, \
> > > > +    MAX_CONST(ACPI_CPU_MR_CMD_SIZE,
> > > ACPI_CPU_MR_FLAGS_SIZE)))
> > > > +
> > > > +/* Validate layout against exported total length */
> > > > +_Static_assert(ACPI_CPU_OSPM_IF_REG_LEN ==
> > > > +               (ACPI_CPU_MR_SELECTOR_SIZE +
> > > > +                ACPI_CPU_MR_FLAGS_SIZE +
> > > > +                ACPI_CPU_MR_RES_FLAGS_SIZE +
> > > > +                ACPI_CPU_MR_CMD_SIZE +
> > > > +                ACPI_CPU_MR_RES_CMD_SIZE +
> > > > +                ACPI_CPU_MR_CMD_DATA_SIZE),
> > > > +               "ACPI_CPU_OSPM_IF_REG_LEN mismatch with internal MMIO
> > > > +layout");
> > > > +
> > > > +/* Sub-Field sizes (in bits) */
> > > > +#define ACPI_CPU_MR_SELECTOR_SIZE_BITS \
> > > > +    (ACPI_CPU_MR_SELECTOR_SIZE * BITS_PER_BYTE)  /* Write-only
> > > (DWord
> > > > +Acc) */ #define ACPI_CPU_MR_FLAGS_SIZE_BITS \
> > > > +    (ACPI_CPU_MR_FLAGS_SIZE * BITS_PER_BYTE)     /* Read-write
> (Byte
> > > Acc) */
> > > > +#define ACPI_CPU_MR_RES_FLAGS_SIZE_BITS \
> > > > +    (ACPI_CPU_MR_RES_FLAGS_SIZE * BITS_PER_BYTE) /* Reserved
> > > padding
> > > > +*/ #define ACPI_CPU_MR_CMD_SIZE_BITS \
> > > > +    (ACPI_CPU_MR_CMD_SIZE * BITS_PER_BYTE)       /* Write-only
> (Byte
> > > Acc) */
> > > > +#define ACPI_CPU_MR_RES_CMD_SIZE_BITS \
> > > > +    (ACPI_CPU_MR_RES_CMD_SIZE * BITS_PER_BYTE)   /* Reserved
> > > padding */
> > > > +#define ACPI_CPU_MR_CMD_DATA_SIZE_BITS \
> > > > +    (ACPI_CPU_MR_CMD_DATA_SIZE * BITS_PER_BYTE)  /* Read-write
> > > (QWord
> > > > +Acc) */
> > > > +
> > > > +/* Field offsets (in bytes) */
> > > > +#define ACPI_CPU_MR_SELECTOR_OFFSET_WO  0 #define
> > > > +ACPI_CPU_MR_FLAGS_OFFSET_RW \
> > > > +    (ACPI_CPU_MR_SELECTOR_OFFSET_WO + \
> > > > +     ACPI_CPU_MR_SELECTOR_SIZE)
> > > > +#define ACPI_CPU_MR_CMD_OFFSET_WO \
> > > > +    (ACPI_CPU_MR_FLAGS_OFFSET_RW + \
> > > > +     ACPI_CPU_MR_FLAGS_SIZE + \
> > > > +     ACPI_CPU_MR_RES_FLAGS_SIZE)
> > > > +#define ACPI_CPU_MR_CMD_DATA_OFFSET_RW \
> > > > +    (ACPI_CPU_MR_CMD_OFFSET_WO + \
> > > > +     ACPI_CPU_MR_CMD_SIZE + \
> > > > +     ACPI_CPU_MR_RES_CMD_SIZE)
> > > > +
> > > > +/* ensure all offsets are at their natural size alignment
> boundaries */
> > > > +#define STATIC_ASSERT_FIELD_ALIGNMENT(offset, type, field_name)
> > > \
> > > > +    _Static_assert((offset) % sizeof(type) == 0,
>           \
> > > > +                   field_name " is not aligned to its natural
> > > > +boundary")
> > > > +
> > > >
> > > +STATIC_ASSERT_FIELD_ALIGNMENT(ACPI_CPU_MR_SELECTOR_OFFSET_W
> > > O,
> > > > +                              uint32_t, "Selector");
> > > > +STATIC_ASSERT_FIELD_ALIGNMENT(ACPI_CPU_MR_FLAGS_OFFSET_RW,
> > > > +                              uint8_t, "Flags");
> > > > +STATIC_ASSERT_FIELD_ALIGNMENT(ACPI_CPU_MR_CMD_OFFSET_WO,
> > > > +                              uint8_t, "Command");
> > > >
> > > +STATIC_ASSERT_FIELD_ALIGNMENT(ACPI_CPU_MR_CMD_DATA_OFFSET_
> > > RW,
> > > > +                              uint64_t, "Command Data");
> > > > +
> > > > +/* Flag bit positions (used within 'flags' subfield) */ #define
> > > > +ACPI_CPU_FLAGS_USED_BITS 4 #define
> > > ACPI_CPU_MR_FLAGS_BIT_ENABLED
> > > > +BIT(0) #define ACPI_CPU_MR_FLAGS_BIT_DEVCHK  BIT(1) #define
> > > > +ACPI_CPU_MR_FLAGS_BIT_EJECTRQ BIT(2)
> > > > +#define ACPI_CPU_MR_FLAGS_BIT_EJECT
> > > BIT(ACPI_CPU_FLAGS_USED_BITS - 1)
> > > > +
> > > > +#define ACPI_CPU_MR_RES_FLAG_BITS (BITS_PER_BYTE -
> > > > +ACPI_CPU_FLAGS_USED_BITS)
> > > > +
> > > > +enum {
> > > > +    ACPI_GET_NEXT_CPU_WITH_EVENT_CMD = 0,
> > > > +    ACPI_OST_EVENT_CMD = 1,
> > > > +    ACPI_OST_STATUS_CMD = 2,
> > > > +    ACPI_CMD_MAX
> > > > +};
> > > > +
> > > > +#define AML_APPEND_MR_RESVD_FIELD(mr_field, size_bits)       \
> > > > +    do {                                                        \
> > > > +        if ((size_bits) != 0) {                                 \
> > > > +            aml_append((mr_field), aml_reserved_field(size_bits)); \
> > > > +        }                                                       \
> > > > +    } while (0)
> > > > +
> > > > +#define AML_APPEND_MR_NAMED_FIELD(mr_field, name, size_bits)    \
> > > > +    do {                                                        \
> > > > +        if ((size_bits) != 0) {                                 \
> > > > +            aml_append((mr_field), aml_named_field((name),
> (size_bits))); \
> > > > +        }                                                       \
> > > > +    } while (0)
> > > > +
> > > > +#define AML_CPU_RES_DEV(base, field) \
> > > > +        aml_name("%s.%s.%s", (base), CPU_RES_DEVICE, (field))
> > > > +
> > > > +static ACPIOSTInfo *
> > > > +acpi_cpu_ospm_ost_status(int idx, AcpiCpuOspmStateStatus *cdev) {
> > > > +    ACPIOSTInfo *info = g_new0(ACPIOSTInfo, 1);
> > > > +
> > > > +    info->source = cdev->ost_event;
> > > > +    info->status = cdev->ost_status;
> > > > +    if (cdev->cpu) {
> > > > +        DeviceState *dev = DEVICE(cdev->cpu);
> > > > +        if (dev->id) {
> > > > +            info->device = g_strdup(dev->id);
> > > > +        }
> > > > +    }
> > > > +    return info;
> > > > +}
> > > > +
> > > > +void acpi_cpus_ospm_status(AcpiCpuOspmState *cpu_st,
> > > ACPIOSTInfoList
> > > > +***list) {
> > > > +    ACPIOSTInfoList ***tail = list;
> > > > +    int i;
> > > > +
> > > > +    for (i = 0; i < cpu_st->dev_count; i++) {
> > > > +        QAPI_LIST_APPEND(*tail, acpi_cpu_ospm_ost_status(i, &cpu_st-
> > > >devs[i]));
> > > > +    }
> > > > +}
> > > > +
> > > > +static uint64_t
> > > > +acpi_cpu_ospm_intf_mr_read(void *opaque, hwaddr addr, unsigned
> > > size)
> > > > +{
> > > > +    AcpiCpuOspmState *cpu_st = opaque;
> > > > +    AcpiCpuOspmStateStatus *cdev;
> > > > +    uint64_t val = 0;
> > > > +
> > > > +    if (cpu_st->selector >= cpu_st->dev_count) {
> > > > +        return val;
> > > > +    }
> > > > +    cdev = &cpu_st->devs[cpu_st->selector];
> > > > +    switch (addr) {
> > > > +    case ACPI_CPU_MR_FLAGS_OFFSET_RW:
> > > > +        val |= qdev_check_enabled(DEVICE(cdev->cpu)) ?
> > > > +                                  ACPI_CPU_MR_FLAGS_BIT_ENABLED : 0;
> > > > +        val |= cdev->devchk_pending ? ACPI_CPU_MR_FLAGS_BIT_DEVCHK
> :
> > > 0;
> > > > +        val |= cdev->ejrqst_pending ? ACPI_CPU_MR_FLAGS_BIT_EJECTRQ
> :
> > > 0;
> > > > +        trace_acpi_cpuos_if_read_flags(cpu_st->selector, val);
> > > > +        break;
> > > > +    case ACPI_CPU_MR_CMD_DATA_OFFSET_RW:
> > > > +        switch (cpu_st->command) {
> > > > +        case ACPI_GET_NEXT_CPU_WITH_EVENT_CMD:
> > > > +           val = cpu_st->selector;
> > > > +           break;
> > > > +        default:
> > > > +
>  trace_acpi_cpuos_if_read_invalid_cmd_data(cpu_st->selector,
> > > > +
>  cpu_st->command);
> > > > +           break;
> > > > +        }
> > > > +        trace_acpi_cpuos_if_read_cmd_data(cpu_st->selector, val);
> > > > +        break;
> > > > +    default:
> > > > +        break;
> > > > +    }
> > > > +    return val;
> > > > +}
> > > > +
> > > > +static void
> > > > +acpi_cpu_ospm_intf_mr_write(void *opaque, hwaddr addr, uint64_t
> > > data,
> > > > +                            unsigned int size) {
> > > > +    AcpiCpuOspmState *cpu_st = opaque;
> > > > +    AcpiCpuOspmStateStatus *cdev;
> > > > +    ACPIOSTInfo *info;
> > > > +
> > > > +    assert(cpu_st->dev_count);
> > > > +    if (addr) {
> > > > +        if (cpu_st->selector >= cpu_st->dev_count) {
> > > > +
> trace_acpi_cpuos_if_invalid_idx_selected(cpu_st->selector);
> > > > +            return;
> > > > +        }
> > > > +    }
> > > > +
> > > > +    switch (addr) {
> > > > +    case ACPI_CPU_MR_SELECTOR_OFFSET_WO: /* current CPU selector
> > > */
> > > > +        cpu_st->selector = data;
> > > > +        trace_acpi_cpuos_if_write_idx(cpu_st->selector);
> > > > +        break;
> > > > +    case ACPI_CPU_MR_FLAGS_OFFSET_RW: /* set is_* fields  */
> > > > +        cdev = &cpu_st->devs[cpu_st->selector];
> > > > +        if (data & ACPI_CPU_MR_FLAGS_BIT_DEVCHK) {
> > > > +            /* clear device-check pending event */
> > > > +            cdev->devchk_pending = false;
> > > > +            trace_acpi_cpuos_if_clear_devchk_evt(cpu_st->selector);
> > > > +        } else if (data & ACPI_CPU_MR_FLAGS_BIT_EJECTRQ) {
> > > > +            /* clear eject-request pending event */
> > > > +            cdev->ejrqst_pending = false;
> > > > +            trace_acpi_cpuos_if_clear_ejrqst_evt(cpu_st->selector);
> > > > +        } else if (data & ACPI_CPU_MR_FLAGS_BIT_EJECT) {
> > > > +            DeviceState *dev = NULL;
> > > > +            if (!cdev->cpu || cdev->cpu == first_cpu) {
> > > > +
> trace_acpi_cpuos_if_ejecting_invalid_cpu(cpu_st->selector);
> > > > +                break;
> > > > +            }
> > > > +            /*
> > > > +             * OSPM has returned with eject. Hence, it is now safe
> to put the
> > > > +             * cpu device on powered-off state.
> > > > +             */
> > > > +            trace_acpi_cpuos_if_ejecting_cpu(cpu_st->selector);
> > > > +            dev = DEVICE(cdev->cpu);
> > > > +            qdev_sync_disable(dev, &error_fatal);
> > > > +        }
> > > > +        break;
> > > > +    case ACPI_CPU_MR_CMD_OFFSET_WO:
> > > > +        trace_acpi_cpuos_if_write_cmd(cpu_st->selector, data);
> > > > +        if (data < ACPI_CMD_MAX) {
> > > > +            cpu_st->command = data;
> > > > +            if (cpu_st->command ==
> > > ACPI_GET_NEXT_CPU_WITH_EVENT_CMD) {
> > > > +                uint32_t iter = cpu_st->selector;
> > > > +
> > > > +                do {
> > > > +                    cdev = &cpu_st->devs[iter];
> > > > +                    if (cdev->devchk_pending ||
> cdev->ejrqst_pending) {
> > > > +                        cpu_st->selector = iter;
> > > > +
> trace_acpi_cpuos_if_cpu_has_events(cpu_st->selector,
> > > > +                            cdev->devchk_pending,
> cdev->ejrqst_pending);
> > > > +                        break;
> > > > +                    }
> > > > +                    iter = iter + 1 < cpu_st->dev_count ? iter + 1
> : 0;
> > > > +                } while (iter != cpu_st->selector);
> > > > +            }
> > > > +        }
> > > > +        break;
> > > > +    case ACPI_CPU_MR_CMD_DATA_OFFSET_RW:
> > > > +        switch (cpu_st->command) {
> > > > +        case ACPI_OST_EVENT_CMD: {
> > > > +           cdev = &cpu_st->devs[cpu_st->selector];
> > > > +           cdev->ost_event = data;
> > > > +           trace_acpi_cpuos_if_write_ost_ev(cpu_st->selector, cdev-
> > > >ost_event);
> > > > +           break;
> > > > +        }
> > > > +        case ACPI_OST_STATUS_CMD: {
> > > > +           cdev = &cpu_st->devs[cpu_st->selector];
> > > > +           cdev->ost_status = data;
> > > > +           info = acpi_cpu_ospm_ost_status(cpu_st->selector, cdev);
> > > > +           qapi_event_send_acpi_device_ost(info);
> > > > +           qapi_free_ACPIOSTInfo(info);
> > > > +           trace_acpi_cpuos_if_write_ost_status(cpu_st->selector,
> > > > +                                                cdev->ost_status);
> > > > +           break;
> > > > +        }
> > > > +        default:
> > > > +           trace_acpi_cpuos_if_write_invalid_cmd(cpu_st->selector,
> > > > +                                                 cpu_st->command);
> > > > +           break;
> > > > +        }
> > > > +        break;
> > > > +    default:
> > > > +        trace_acpi_cpuos_if_write_invalid_offset(cpu_st->selector,
> addr);
> > > > +        break;
> > > > +    }
> > > > +}
> > > > +
> > > > +static const MemoryRegionOps cpu_common_mr_ops = {
> > > > +    .read = acpi_cpu_ospm_intf_mr_read,
> > > > +    .write = acpi_cpu_ospm_intf_mr_write,
> > > > +    .endianness = DEVICE_LITTLE_ENDIAN,
> > > > +    .valid = {
> > > > +        .min_access_size = 1,
> > > > +        .max_access_size = ACPI_CPU_OSPM_IF_MAX_FIELD_SIZE,
> > > > +    },
> > > > +    .impl = {
> > > > +        .min_access_size = 1,
> > > > +        .max_access_size = ACPI_CPU_OSPM_IF_MAX_FIELD_SIZE,
> > > > +        .unaligned = false,
> > > > +    },
> > > > +};
> > > > +
> > > > +void acpi_cpu_ospm_state_interface_init(MemoryRegion *as, Object
> > > *owner,
> > > > +                                        AcpiCpuOspmState *state,
> > > > +                                        hwaddr base_addr) {
> > > > +    MachineState *machine = MACHINE(qdev_get_machine());
> > > > +    MachineClass *mc = MACHINE_GET_CLASS(machine);
> > > > +    const CPUArchIdList *id_list;
> > > > +    int i;
> > > > +
> > > > +    assert(mc->possible_cpu_arch_ids);
> > > > +    id_list = mc->possible_cpu_arch_ids(machine);
> > > > +    state->dev_count = id_list->len;
> > > > +    state->devs = g_new0(typeof(*state->devs), state->dev_count);
> > > > +    for (i = 0; i < id_list->len; i++) {
> > > > +        state->devs[i].cpu =  CPU(id_list->cpus[i].cpu);
> > > > +        state->devs[i].arch_id = id_list->cpus[i].arch_id;
> > > > +    }
> > > > +    memory_region_init_io(&state->ctrl_reg, owner,
> > > &cpu_common_mr_ops, state,
> > > > +                          "ACPI CPU OSPM State Interface Memory
> Region",
> > > > +                          ACPI_CPU_OSPM_IF_REG_LEN);
> > > > +    memory_region_add_subregion(as, base_addr, &state->ctrl_reg); }
> > > > +
> > > > +static AcpiCpuOspmStateStatus *
> > > > +acpi_get_cpu_status(AcpiCpuOspmState *cpu_st, DeviceState *dev) {
> > > > +    CPUClass *k = CPU_GET_CLASS(dev);
> > > > +    uint64_t cpu_arch_id = k->get_arch_id(CPU(dev));
> > > > +    int i;
> > > > +
> > > > +    for (i = 0; i < cpu_st->dev_count; i++) {
> > > > +        if (cpu_arch_id == cpu_st->devs[i].arch_id) {
> > > > +            return &cpu_st->devs[i];
> > > > +        }
> > > > +    }
> > > > +    return NULL;
> > > > +}
> > > > +
> > > > +void acpi_cpu_device_check_cb(AcpiCpuOspmState *cpu_st,
> DeviceState
> > > *dev,
> > > > +                              uint32_t event_st, Error **errp) {
> > > > +    AcpiCpuOspmStateStatus *cdev;
> > > > +    cdev = acpi_get_cpu_status(cpu_st, dev);
> > > > +    if (!cdev) {
> > > > +        return;
> > > > +    }
> > > > +    assert(cdev->cpu);
> > > > +
> > > > +    /*
> > > > +     * Tell OSPM via GED IRQ(GSI) that a powered-off cpu is being
> powered-
> > > on.
> > > > +     * Also, mark 'device-check' event pending for this cpu. This
> will
> > > > +     * eventually result in OSPM evaluating the ACPI _EVT method
> and scan
> > > of
> > > > +     * cpus
> > > > +     */
> > > > +    cdev->devchk_pending = true;
> > > > +    acpi_send_event(cpu_st->acpi_dev, event_st); }
> > > > +
> > > > +void acpi_cpu_eject_request_cb(AcpiCpuOspmState *cpu_st,
> > > DeviceState *dev,
> > > > +                              uint32_t event_st, Error **errp) {
> > > > +    AcpiCpuOspmStateStatus *cdev;
> > > > +    cdev = acpi_get_cpu_status(cpu_st, dev);
> > > > +    if (!cdev) {
> > > > +        return;
> > > > +    }
> > > > +    assert(cdev->cpu);
> > > > +
> > > > +    /*
> > > > +     * Tell OSPM via GED IRQ(GSI) that a cpu wants to power-off or
> go on
> > > standby
> > > > +     * Also,mark 'eject-request' event pending for this cpu.
> (graceful
> > > shutdown)
> > > > +     */
> > > > +    cdev->ejrqst_pending = true;
> > > > +    acpi_send_event(cpu_st->acpi_dev, event_st); }
> > > > +
> > > > +void
> > > > +acpi_cpu_eject_cb(AcpiCpuOspmState *cpu_st, DeviceState *dev, Error
> > > > +**errp) {
> > > > +    /* TODO: possible handling here */ }
> > > > +
> > > > +static const VMStateDescription vmstate_cpu_ospm_state_sts = {
> > > > +    .name = "CPU OSPM state status",
> > > > +    .version_id = 1,
> > > > +    .minimum_version_id = 1,
> > > > +    .fields = (const VMStateField[]) {
> > > > +        VMSTATE_BOOL(devchk_pending, AcpiCpuOspmStateStatus),
> > > > +        VMSTATE_BOOL(ejrqst_pending, AcpiCpuOspmStateStatus),
> > > > +        VMSTATE_UINT32(ost_event, AcpiCpuOspmStateStatus),
> > > > +        VMSTATE_UINT32(ost_status, AcpiCpuOspmStateStatus),
> > > > +        VMSTATE_END_OF_LIST()
> > > > +    }
> > > > +};
> > > > +
> > > > +const VMStateDescription vmstate_cpu_ospm_state = {
> > > > +    .name = "CPU OSPM state",
> > > > +    .version_id = 1,
> > > > +    .minimum_version_id = 1,
> > > > +    .fields = (const VMStateField[]) {
> > > > +        VMSTATE_UINT32(selector, AcpiCpuOspmState),
> > > > +        VMSTATE_UINT8(command, AcpiCpuOspmState),
> > > > +        VMSTATE_STRUCT_VARRAY_POINTER_UINT32(devs,
> > > AcpiCpuOspmState,
> > > > +                                             dev_count,
> > > > +
>  vmstate_cpu_ospm_state_sts,
> > > > +
>  AcpiCpuOspmStateStatus),
> > > > +        VMSTATE_END_OF_LIST()
> > > > +    }
> > > > +};
> > > > +
> > > > +void acpi_build_cpus_aml(Aml *table, hwaddr base_addr, const char
> > > *root,
> > > > +                         const char *event_handler_method) {
> > > > +    MachineState *machine = MACHINE(qdev_get_machine());
> > > > +    MachineClass *mc = MACHINE_GET_CLASS(machine);
> > > > +    const CPUArchIdList *arch_ids =
> mc->possible_cpu_arch_ids(machine);
> > > > +    Aml *sb_scope = aml_scope("_SB"); /* System Bus Scope */
> > > > +    Aml *ifctx, *field, *method, *cpu_res_dev, *cpus_dev;
> > > > +    Aml *zero = aml_int(0);
> > > > +    Aml *one = aml_int(1);
> > > > +
> > > > +    cpu_res_dev = aml_device("%s.%s", root, CPU_RES_DEVICE);
> > > > +    {
> > > > +        Aml *crs;
> > > > +
> > > > +        aml_append(cpu_res_dev,
> > > > +            aml_name_decl("_HID", aml_eisaid("PNP0A06")));
> > > > +        aml_append(cpu_res_dev,
> > > > +            aml_name_decl("_UID", aml_string("CPU OSPM Interface
> > > resources")));
> > > > +        aml_append(cpu_res_dev, aml_mutex(CPU_LOCK, 0));
> > > > +
> > > > +        crs = aml_resource_template();
> > > > +        aml_append(crs, aml_memory32_fixed(base_addr,
> > > ACPI_CPU_OSPM_IF_REG_LEN,
> > > > +                   AML_READ_WRITE));
> > > > +
> > > > +        aml_append(cpu_res_dev, aml_name_decl("_CRS", crs));
> > > > +
> > > > +        /* declare CPU OSPM Interface MMIO region related access
> fields */
> > > > +        aml_append(cpu_res_dev,
> > > > +                   aml_operation_region("PRST", AML_SYSTEM_MEMORY,
> > > > +                                        aml_int(base_addr),
> > > > +                                        ACPI_CPU_OSPM_IF_REG_LEN));
> > > > +
> > > > +        /*
> > > > +         * define named fields within PRST region with 'Byte'
> access widths
> > > > +         * and reserve fields with other access width
> > > > +         */
> > > > +        field = aml_field("PRST", AML_BYTE_ACC, AML_NOLOCK,
> > > AML_PRESERVE);
> > > > +        /* reserve CPU 'selector' field (size in bits) */
> > > > +        AML_APPEND_MR_RESVD_FIELD(field,
> > > ACPI_CPU_MR_SELECTOR_SIZE_BITS);
> > > > +        /* Flag::Enabled Bit(RO) - Read '1' if enabled */
> > > > +        AML_APPEND_MR_NAMED_FIELD(field, CPU_ENABLED_F, 1);
> > > > +        /* Flag::Devchk Bit(RW) - Read '1', has a event. Write '1',
> to clear */
> > > > +        AML_APPEND_MR_NAMED_FIELD(field, CPU_DEVCHK_F, 1);
> > > > +        /* Flag::Ejectrq Bit(RW) - Read 1, has event. Write 1 to
> clear */
> > > > +        AML_APPEND_MR_NAMED_FIELD(field, CPU_EJECTRQ_F, 1);
> > > > +        /* Flag::Eject Bit(WO) - OSPM evals _EJx, initiates CPU
> Eject in
> > > Qemu*/
> > > > +        AML_APPEND_MR_NAMED_FIELD(field, CPU_EJECT_F, 1);
> > > > +        /* Flag::Bit(ACPI_CPU_FLAGS_USED_BITS)-Bit(7) - Reserve
> left over
> > > bits*/
> > > > +        AML_APPEND_MR_RESVD_FIELD(field,
> > > ACPI_CPU_MR_RES_FLAG_BITS);
> > > > +        /* Reserved space: padding after flags */
> > > > +        AML_APPEND_MR_RESVD_FIELD(field,
> > > ACPI_CPU_MR_RES_FLAGS_SIZE_BITS);
> > > > +        /* Command field written by OSPM */
> > > > +        AML_APPEND_MR_NAMED_FIELD(field, CPU_COMMAND,
> > > > +                                  ACPI_CPU_MR_CMD_SIZE_BITS);
> > > > +        /* Reserved space: padding after command field */
> > > > +        AML_APPEND_MR_RESVD_FIELD(field,
> > > ACPI_CPU_MR_RES_CMD_SIZE_BITS);
> > > > +        /* Command data: 64-bit payload associated with command */
> > > > +        AML_APPEND_MR_RESVD_FIELD(field,
> > > ACPI_CPU_MR_CMD_DATA_SIZE_BITS);
> > > > +        aml_append(cpu_res_dev, field);
> > > > +
> > > > +        /*
> > > > +         * define named fields with 'Dword' access widths and
> reserve fields
> > > > +         * with other access width
> > > > +         */
> > > > +        field = aml_field("PRST", AML_DWORD_ACC, AML_NOLOCK,
> > > AML_PRESERVE);
> > > > +        /* CPU selector, write only */
> > > > +        AML_APPEND_MR_NAMED_FIELD(field, CPU_SELECTOR,
> > > > +                                  ACPI_CPU_MR_SELECTOR_SIZE_BITS);
> > > > +        aml_append(cpu_res_dev, field);
> > > > +
> > > > +        /*
> > > > +         * define named fields with 'Qword' access widths and
> reserve fields
> > > > +         * with other access width
> > > > +         */
> > > > +        field = aml_field("PRST", AML_QWORD_ACC, AML_NOLOCK,
> > > AML_PRESERVE);
> > > > +        /*
> > > > +         * Reserve space: selector, flags, reserved flags, command,
> reserved
> > > > +         * command for Qword alignment.
> > > > +         */
> > > > +        AML_APPEND_MR_RESVD_FIELD(field,
> > > ACPI_CPU_MR_SELECTOR_SIZE_BITS +
> > > > +
> ACPI_CPU_MR_FLAGS_SIZE_BITS +
> > > > +
> ACPI_CPU_MR_RES_FLAGS_SIZE_BITS +
> > > > +
> ACPI_CPU_MR_CMD_SIZE_BITS +
> > > > +
> ACPI_CPU_MR_RES_CMD_SIZE_BITS);
> > > > +        /* Command data accessible via Qword */
> > > > +        AML_APPEND_MR_NAMED_FIELD(field, CPU_DATA,
> > > > +                                  ACPI_CPU_MR_CMD_DATA_SIZE_BITS);
> > > > +        aml_append(cpu_res_dev, field);
> > > > +    }
> > > > +    aml_append(sb_scope, cpu_res_dev);
> > > > +
> > > > +    cpus_dev = aml_device("%s.%s", root, CPU_DEVICE);
> > > > +    {
> > > > +        Aml *ctrl_lock = AML_CPU_RES_DEV(root, CPU_LOCK);
> > > > +        Aml *cpu_selector = AML_CPU_RES_DEV(root, CPU_SELECTOR);
> > > > +        Aml *is_enabled = AML_CPU_RES_DEV(root, CPU_ENABLED_F);
> > > > +        Aml *dvchk_evt = AML_CPU_RES_DEV(root, CPU_DEVCHK_F);
> > > > +        Aml *ejrq_evt = AML_CPU_RES_DEV(root, CPU_EJECTRQ_F);
> > > > +        Aml *ej_evt = AML_CPU_RES_DEV(root, CPU_EJECT_F);
> > > > +        Aml *cpu_cmd = AML_CPU_RES_DEV(root, CPU_COMMAND);
> > > > +        Aml *cpu_data = AML_CPU_RES_DEV(root, CPU_DATA);
> > > > +        int i;
> > > > +
> > > > +        aml_append(cpus_dev, aml_name_decl("_HID",
> > > aml_string("ACPI0010")));
> > > > +        aml_append(cpus_dev, aml_name_decl("_CID",
> > > > + aml_eisaid("PNP0A05")));
> > > > +
> > > > +        method = aml_method(CPU_NOTIFY_METHOD, 2,
> > > AML_NOTSERIALIZED);
> > > > +        for (i = 0; i < arch_ids->len; i++) {
> > > > +            Aml *cpu = aml_name(CPU_NAME_FMT, i);
> > > > +            Aml *uid = aml_arg(0);
> > > > +            Aml *event = aml_arg(1);
> > > > +
> > > > +            ifctx = aml_if(aml_equal(uid, aml_int(i)));
> > > > +            {
> > > > +                aml_append(ifctx, aml_notify(cpu, event));
> > > > +            }
> > > > +            aml_append(method, ifctx);
> > > > +        }
> > > > +        aml_append(cpus_dev, method);
> > > > +
> > > > +        method = aml_method(CPU_STS_METHOD, 1, AML_SERIALIZED);
> > > > +        {
> > > > +            Aml *idx = aml_arg(0);
> > > > +            Aml *sta = aml_local(0);
> > > > +            Aml *else_ctx;
> > > > +
> > > > +            aml_append(method, aml_acquire(ctrl_lock, 0xFFFF));
> > > > +            aml_append(method, aml_store(idx, cpu_selector));
> > > > +            aml_append(method, aml_store(zero, sta));
> > > > +            ifctx = aml_if(aml_equal(is_enabled, one));
> > > > +            {
> > > > +                /* cpu is present and enabled */
> > > > +                aml_append(ifctx, aml_store(aml_int(0xF), sta));
> > > > +            }
> > > > +            aml_append(method, ifctx);
> > > > +            else_ctx = aml_else();
> > > > +            {
> > > > +                /* cpu is present but disabled */
> > > > +                aml_append(else_ctx, aml_store(aml_int(0xD), sta));
> > > > +            }
> > > > +            aml_append(method, else_ctx);
> > > > +            aml_append(method, aml_release(ctrl_lock));
> > > > +            aml_append(method, aml_return(sta));
> > > > +        }
> > > > +        aml_append(cpus_dev, method);
> > > > +
> > > > +        method = aml_method(CPU_EJECT_METHOD, 1, AML_SERIALIZED);
> > > > +        {
> > > > +            Aml *idx = aml_arg(0);
> > > > +
> > > > +            aml_append(method, aml_acquire(ctrl_lock, 0xFFFF));
> > > > +            aml_append(method, aml_store(idx, cpu_selector));
> > > > +            aml_append(method, aml_store(one, ej_evt));
> > > > +            aml_append(method, aml_release(ctrl_lock));
> > > > +        }
> > > > +        aml_append(cpus_dev, method);
> > > > +
> > > > +        method = aml_method(CPU_SCAN_METHOD, 0, AML_SERIALIZED);
> > > > +        {
> > > > +            Aml *has_event = aml_local(0); /* Local0: Loop control
> flag */
> > > > +            Aml *uid = aml_local(1); /* Local1: Current CPU UID */
> > > > +            /* Constants */
> > > > +            Aml *dev_chk = aml_int(1); /* Notify: device check to
> enable */
> > > > +            Aml *eject_req = aml_int(3); /* Notify: eject for
> removal */
> > > > +            Aml *next_cpu_cmd =
> > > > + aml_int(ACPI_GET_NEXT_CPU_WITH_EVENT_CMD);
> > > > +
> > > > +            /* Acquire CPU lock */
> > > > +            aml_append(method, aml_acquire(ctrl_lock, 0xFFFF));
> > > > +
> > > > +            /* Initialize loop */
> > > > +            aml_append(method, aml_store(zero, uid));
> > > > +            aml_append(method, aml_store(one, has_event));
> > > > +
> > > > +            Aml *while_ctx = aml_while(aml_land(
> > > > +                aml_equal(has_event, one),
> > > > +                aml_lless(uid, aml_int(arch_ids->len))
> > > > +            ));
> > > > +            {
> > > > +                aml_append(while_ctx, aml_store(zero, has_event));
> > > > +                /*
> > > > +                 * Issue scan cmd: QEMU will return next CPU with
> event in
> > > > +                 * cpu_data
> > > > +                 */
> > > > +                aml_append(while_ctx, aml_store(uid, cpu_selector));
> > > > +                aml_append(while_ctx, aml_store(next_cpu_cmd,
> > > > + cpu_cmd));
> > > > +
> > > > +                /* If scan wrapped around to an earlier UID, exit
> loop */
> > > > +                Aml *wrap_check = aml_if(aml_lless(cpu_data, uid));
> > > > +                aml_append(wrap_check, aml_break());
> > > > +                aml_append(while_ctx, wrap_check);
> > > > +
> > > > +                /* Set UID to scanned result */
> > > > +                aml_append(while_ctx, aml_store(cpu_data, uid));
> > > > +
> > > > +                /* send CPU device-check(resume) event to OSPM */
> > > > +                Aml *if_devchk = aml_if(aml_equal(dvchk_evt, one));
> > > > +                {
> > > > +                    aml_append(if_devchk,
> > > > +                        aml_call2(CPU_NOTIFY_METHOD, uid, dev_chk));
> > > > +                    /* clear local device-check event sent flag */
> > > > +                    aml_append(if_devchk, aml_store(one,
> dvchk_evt));
> > > > +                    aml_append(if_devchk, aml_store(one,
> has_event));
> > > > +                }
> > > > +                aml_append(while_ctx, if_devchk);
> > > > +
> > > > +                /*
> > > > +                 * send CPU eject-request event to OSPM to
> gracefully handle
> > > > +                 * OSPM related tasks running on this CPU
> > > > +                 */
> > > > +                Aml *else_ctx = aml_else();
> > > > +                Aml *if_ejrq = aml_if(aml_equal(ejrq_evt, one));
> > > > +                {
> > > > +                    aml_append(if_ejrq,
> > > > +                        aml_call2(CPU_NOTIFY_METHOD, uid,
> eject_req));
> > > > +                    /* clear local eject-request event sent flag */
> > > > +                    aml_append(if_ejrq, aml_store(one, ejrq_evt));
> > > > +                    aml_append(if_ejrq, aml_store(one, has_event));
> > > > +                }
> > > > +                aml_append(else_ctx, if_ejrq);
> > > > +                aml_append(while_ctx, else_ctx);
> > > > +
> > > > +                /* Increment UID */
> > > > +                aml_append(while_ctx, aml_increment(uid));
> > > > +            }
> > > > +            aml_append(method, while_ctx);
> > > > +
> > > > +            /* Release cpu lock */
> > > > +            aml_append(method, aml_release(ctrl_lock));
> > > > +        }
> > > > +        aml_append(cpus_dev, method);
> > > > +
> > > > +        method = aml_method(CPU_OST_METHOD, 4, AML_SERIALIZED);
> > > > +        {
> > > > +            Aml *uid = aml_arg(0);
> > > > +            Aml *ev_cmd = aml_int(ACPI_OST_EVENT_CMD);
> > > > +            Aml *st_cmd = aml_int(ACPI_OST_STATUS_CMD);
> > > > +
> > > > +            aml_append(method, aml_acquire(ctrl_lock, 0xFFFF));
> > > > +            aml_append(method, aml_store(uid, cpu_selector));
> > > > +            aml_append(method, aml_store(ev_cmd, cpu_cmd));
> > > > +            aml_append(method, aml_store(aml_arg(1), cpu_data));
> > > > +            aml_append(method, aml_store(st_cmd, cpu_cmd));
> > > > +            aml_append(method, aml_store(aml_arg(2), cpu_data));
> > > > +            aml_append(method, aml_release(ctrl_lock));
> > > > +        }
> > > > +        aml_append(cpus_dev, method);
> > > > +
> > > > +        /* build Processor object for each processor */
> > > > +        for (i = 0; i < arch_ids->len; i++) {
> > > > +            Aml *dev;
> > > > +            Aml *uid = aml_int(i);
> > > > +
> > > > +            dev = aml_device(CPU_NAME_FMT, i);
> > > > +            aml_append(dev, aml_name_decl("_HID",
> > > aml_string("ACPI0007")));
> > > > +            aml_append(dev, aml_name_decl("_UID", uid));
> > > > +
> > > > +            method = aml_method("_STA", 0, AML_SERIALIZED);
> > > > +            aml_append(method,
> aml_return(aml_call1(CPU_STS_METHOD,
> > > uid)));
> > > > +            aml_append(dev, method);
> > > > +
> > > > +            if (CPU(arch_ids->cpus[i].cpu) != first_cpu) {
> > > > +                method = aml_method("_EJ0", 1, AML_NOTSERIALIZED);
> > > > +                aml_append(method, aml_call1(CPU_EJECT_METHOD,
> uid));
> > > > +                aml_append(dev, method);
> > > > +            }
> > > > +
> > > > +            method = aml_method("_OST", 3, AML_SERIALIZED);
> > > > +            aml_append(method,
> > > > +                aml_call4(CPU_OST_METHOD, uid, aml_arg(0),
> > > > +                          aml_arg(1), aml_arg(2))
> > > > +            );
> > > > +            aml_append(dev, method);
> > > > +            aml_append(cpus_dev, dev);
> > > > +        }
> > > > +    }
> > > > +    aml_append(sb_scope, cpus_dev);
> > > > +    aml_append(table, sb_scope);
> > > > +
> > > > +    method = aml_method(event_handler_method, 0,
> > > AML_NOTSERIALIZED);
> > > > +    aml_append(method, aml_call0("\\_SB.CPUS." CPU_SCAN_METHOD));
> > > > +    aml_append(table, method);
> > > > +}
> > > > diff --git a/hw/acpi/meson.build b/hw/acpi/meson.build index
> > > > 73f02b9691..6d83396ab4 100644
> > > > --- a/hw/acpi/meson.build
> > > > +++ b/hw/acpi/meson.build
> > > > @@ -8,6 +8,8 @@ acpi_ss.add(files(
> > > >  ))
> > > >  acpi_ss.add(when: 'CONFIG_ACPI_CPU_HOTPLUG', if_true: files('cpu.c',
> > > > 'cpu_hotplug.c'))
> > > >  acpi_ss.add(when: 'CONFIG_ACPI_CPU_HOTPLUG', if_false:
> > > > files('acpi-cpu-hotplug-stub.c'))
> > > > +acpi_ss.add(when: 'CONFIG_ACPI_CPU_OSPM_INTERFACE', if_true:
> > > > +files('cpu_ospm_interface.c'))
> > > > +acpi_ss.add(when: 'CONFIG_ACPI_CPU_OSPM_INTERFACE', if_false:
> > > > +files('acpi-cpu-ospm-interface-stub.c'))
> > > >  acpi_ss.add(when: 'CONFIG_ACPI_MEMORY_HOTPLUG', if_true:
> > > > files('memory_hotplug.c'))
> > > >  acpi_ss.add(when: 'CONFIG_ACPI_MEMORY_HOTPLUG', if_false:
> > > > files('acpi-mem-hotplug-stub.c'))
> > > >  acpi_ss.add(when: 'CONFIG_ACPI_NVDIMM', if_true: files('nvdimm.c'))
> > > > diff --git a/hw/acpi/trace-events b/hw/acpi/trace-events index
> > > > edc93e703c..c0ecbdd48f 100644
> > > > --- a/hw/acpi/trace-events
> > > > +++ b/hw/acpi/trace-events
> > > > @@ -40,6 +40,23 @@ cpuhp_acpi_fw_remove_cpu(uint32_t idx)
> > > "0x%"PRIx32
> > > > cpuhp_acpi_write_ost_ev(uint32_t slot, uint32_t ev) "idx[0x%"PRIx32"]
> > > > OST EVENT: 0x%"PRIx32  cpuhp_acpi_write_ost_status(uint32_t slot,
> > > > uint32_t st) "idx[0x%"PRIx32"] OST STATUS: 0x%"PRIx32
> > > >
> > > > +#cpu_ospm_interface.c
> > > > +acpi_cpuos_if_invalid_idx_selected(uint32_t idx) "selector
> > > idx[0x%"PRIx32"]"
> > > > +acpi_cpuos_if_read_flags(uint32_t idx, uint8_t flags) "cpu
> > > > +idx[0x%"PRIx32"] flags: 0x%"PRIx8 acpi_cpuos_if_write_idx(uint32_t
> > > > +idx) "set active cpu idx: 0x%"PRIx32
> acpi_cpuos_if_write_cmd(uint32_t
> > > > +idx, uint8_t cmd) "cpu idx[0x%"PRIx32"] cmd: 0x%"PRIx8
> > > > +acpi_cpuos_if_write_invalid_cmd(uint32_t idx, uint8_t cmd) "cpu
> > > > +idx[0x%"PRIx32"] invalid cmd: 0x%"PRIx8
> > > > +acpi_cpuos_if_write_invalid_offset(uint32_t idx, uint64_t addr) "cpu
> > > > +idx[0x%"PRIx32"] invalid offset: 0x%"PRIx64
> > > acpi_cpuos_if_read_cmd_data(uint32_t idx, uint32_t data) "cpu
> > > idx[0x%"PRIx32"] data: 0x%"PRIx32
> > > acpi_cpuos_if_read_invalid_cmd_data(uint32_t idx, uint8_t cmd) "cpu
> > > idx[0x%"PRIx32"] invalid cmd: 0x%"PRIx8
> > > acpi_cpuos_if_cpu_has_events(uint32_t idx, bool devchk, bool ejrqst)
> "cpu
> > > idx[0x%"PRIx32"] device-check pending: %d, eject-request pending: %d"
> > > > +acpi_cpuos_if_clear_devchk_evt(uint32_t idx) "cpu idx[0x%"PRIx32"]"
> > > > +acpi_cpuos_if_clear_ejrqst_evt(uint32_t idx) "cpu idx[0x%"PRIx32"]"
> > > > +acpi_cpuos_if_ejecting_invalid_cpu(uint32_t idx) "invalid cpu
> > > idx[0x%"PRIx32"]"
> > > > +acpi_cpuos_if_ejecting_cpu(uint32_t idx) "cpu idx[0x%"PRIx32"]"
> > > > +acpi_cpuos_if_write_ost_ev(uint32_t idx, uint32_t ev) "cpu
> > > > +idx[0x%"PRIx32"] OST Event: 0x%"PRIx32
> > > > +acpi_cpuos_if_write_ost_status(uint32_t idx, uint32_t st) "cpu
> > > > +idx[0x%"PRIx32"] OST Status: 0x%"PRIx32
> > > > +
> > > >  # pcihp.c
> > > >  acpi_pci_eject_slot(unsigned bsel, unsigned slot) "bsel: %u slot:
> %u"
> > > >  acpi_pci_unplug(int bsel, int slot) "bsel: %d slot: %d"
> > > > diff --git a/hw/arm/Kconfig b/hw/arm/Kconfig index
> > > > 2aa4b5d778..c9991e00c7 100644
> > > > --- a/hw/arm/Kconfig
> > > > +++ b/hw/arm/Kconfig
> > > > @@ -39,6 +39,7 @@ config ARM_VIRT
> > > >      select VIRTIO_MEM_SUPPORTED
> > > >      select ACPI_CXL
> > > >      select ACPI_HMAT
> > > > +    select ACPI_CPU_OSPM_INTERFACE
> > > >
> > > >  config CUBIEBOARD
> > > >      bool
> > > > diff --git a/include/hw/acpi/cpu_ospm_interface.h
> > > > b/include/hw/acpi/cpu_ospm_interface.h
> > > > new file mode 100644
> > > > index 0000000000..5dda327a34
> > > > --- /dev/null
> > > > +++ b/include/hw/acpi/cpu_ospm_interface.h
> > > > @@ -0,0 +1,78 @@
> > > > +/*
> > > > + * ACPI CPU OSPM Interface Handling.
> > > > + *
> > > > + * Copyright (c) 2025 Huawei Technologies R&D (UK) Ltd.
> > > > + *
> > > > + * Author: Salil Mehta <salil.mehta@huawei.com>
> > > > + *
> > > > + * SPDX-License-Identifier: GPL-2.0-or-later
> > > > + *
> > > > + * This program is free software; you can redistribute it and/or
> > > > +modify
> > > > + * it under the terms of the GNU General Public License as published
> > > > +by
> > > > + * the ree Software Foundation; either version 2 of the License, or
> > > > + * (at your option) any later version.
> > > > + */
> > > > +#ifndef CPU_OSPM_INTERFACE_H
> > > > +#define CPU_OSPM_INTERFACE_H
> > > > +
> > > > +#include "qapi/qapi-types-acpi.h"
> > > > +#include "hw/qdev-core.h"
> > > > +#include "hw/acpi/acpi.h"
> > > > +#include "hw/acpi/aml-build.h"
> > > > +#include "hw/boards.h"
> > > > +
> > > > +/**
> > > > + * Total size (in bytes) of the ACPI CPU OSPM Interface MMIO region.
> > > > + *
> > > > + * This region contains control and status fields such as CPU
> > > > +selector,
> > > > + * flags, command register, and data register. It must exactly match
> > > > +the
> > > > + * layout defined in the AML code and the memory region
> > > implementation.
> > > > + *
> > > > + * Any mismatch between this definition and the AML layout may
> result
> > > > +in
> > > > + * runtime errors or build-time assertion failures (e.g.,
> > > > +_Static_assert),
> > > > + * breaking correct device emulation and guest OS coordination.
> > > > + */
> > > > +#define ACPI_CPU_OSPM_IF_REG_LEN 16
> > > > +
> > > > +typedef struct  {
> > > > +    CPUState *cpu;
> > > > +    uint64_t arch_id;
> > > > +    bool devchk_pending; /* device-check pending */
> > > > +    bool ejrqst_pending; /* eject-request pending */
> > > > +    uint32_t ost_event;
> > > > +    uint32_t ost_status;
> > > > +} AcpiCpuOspmStateStatus;
> > > > +
> > > > +typedef struct AcpiCpuOspmState {
> > > > +    DeviceState *acpi_dev;
> > > > +    MemoryRegion ctrl_reg;
> > > > +    uint32_t selector;
> > > > +    uint8_t command;
> > > > +    uint32_t dev_count;
> > > > +    AcpiCpuOspmStateStatus *devs;
> > > > +} AcpiCpuOspmState;
> > > > +
> > > > +void acpi_cpu_device_check_cb(AcpiCpuOspmState *cpu_st,
> DeviceState
> > > *dev,
> > > > +                              uint32_t event_st, Error **errp);
> > > > +
> > > > +void acpi_cpu_eject_request_cb(AcpiCpuOspmState *cpu_st,
> > > DeviceState *dev,
> > > > +                               uint32_t event_st, Error **errp);
> > > > +
> > > > +void acpi_cpu_eject_cb(AcpiCpuOspmState *cpu_st, DeviceState *dev,
> > > > +                       Error **errp);
> > > > +
> > > > +void acpi_cpu_ospm_state_interface_init(MemoryRegion *as, Object
> > > *owner,
> > > > +                                        AcpiCpuOspmState *state,
> > > > +                                        hwaddr base_addr);
> > > > +
> > > > +void acpi_build_cpus_aml(Aml *table, hwaddr base_addr, const char
> > > *root,
> > > > +                         const char *event_handler_method);
> > > > +
> > > > +void acpi_cpus_ospm_status(AcpiCpuOspmState *cpu_st,
> > > > +                           ACPIOSTInfoList ***list);
> > > > +
> > > > +extern const VMStateDescription vmstate_cpu_ospm_state; #define
> > > > +VMSTATE_CPU_OSPM_STATE(cpuospm, state) \
> > > > +    VMSTATE_STRUCT(cpuospm, state, 1, \
> > > > +                   vmstate_cpu_ospm_state, AcpiCpuOspmState) #endif
> > > > +/* CPU_OSPM_INTERFACE_H */
> > >
> >
>
>

[-- Attachment #2: Type: text/html, Size: 77819 bytes --]

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH RFC V6 11/24] hw/arm/acpi: MADT change to size the guest with possible vCPUs
  2025-10-07 12:20       ` Igor Mammedov
@ 2025-10-10  3:15         ` Salil Mehta
  0 siblings, 0 replies; 67+ messages in thread
From: Salil Mehta @ 2025-10-10  3:15 UTC (permalink / raw)
  To: Igor Mammedov
  Cc: Salil Mehta, qemu-devel@nongnu.org, qemu-arm@nongnu.org,
	mst@redhat.com, maz@kernel.org, jean-philippe@linaro.org,
	Jonathan Cameron, lpieralisi@kernel.org, peter.maydell@linaro.org,
	richard.henderson@linaro.org, armbru@redhat.com,
	andrew.jones@linux.dev, david@redhat.com, philmd@linaro.org,
	eric.auger@redhat.com, will@kernel.org, ardb@kernel.org,
	oliver.upton@linux.dev, pbonzini@redhat.com, gshan@redhat.com,
	rafael@kernel.org, borntraeger@linux.ibm.com,
	alex.bennee@linaro.org, gustavo.romero@linaro.org,
	npiggin@gmail.com, harshpb@linux.ibm.com, linux@armlinux.org.uk,
	darren@os.amperecomputing.com, ilkka@os.amperecomputing.com,
	vishnu@os.amperecomputing.com, gankulkarni@os.amperecomputing.com,
	karl.heubaum@oracle.com, miguel.luis@oracle.com, zhukeqian,
	wangxiongfeng (C), wangyanan (Y), Wangzhou (B), Linuxarm,
	jiakernel2@gmail.com, maobibo@loongson.cn, lixianglai@loongson.cn,
	shahuang@redhat.com, zhao1.liu@intel.com

[-- Attachment #1: Type: text/plain, Size: 10777 bytes --]

Hi Igor,

On Tue, Oct 7, 2025 at 12:20 PM Igor Mammedov <imammedo@redhat.com> wrote:

> On Tue, 7 Oct 2025 11:34:48 +0000
> Salil Mehta <salil.mehta@huawei.com> wrote:
>
> > Hi Igor,
> >
> > > From: Igor Mammedov <imammedo@redhat.com>
> > > Sent: Friday, October 3, 2025 4:09 PM
> > > To: salil.mehta@opnsrc.net
> > >
> > > On Wed,  1 Oct 2025 01:01:14 +0000
> > > salil.mehta@opnsrc.net wrote:
> > >
> > > > From: Salil Mehta <salil.mehta@huawei.com>
> > > >
> > > > When QEMU builds the MADT table, modifications are needed to include
> > > > information about possible vCPUs that are exposed as ACPI-disabled
> (i.e.,
> > > `_STA.Enabled=0`).
> > > > This new information will help the guest kernel pre-size its
> resources
> > > > during boot time. Pre-sizing based on possible vCPUs will facilitate
> > > > the future hot-plugging of the currently disabled vCPUs.
> > > >
> > > > Additionally, this change addresses updates to the ACPI MADT GIC CPU
> > > > interface flags, as introduced in the UEFI ACPI 6.5 specification
> [1].
> > > > These updates enable deferred virtual CPU onlining in the guest
> kernel.
> > > >
> > > > Reference:
> > > > [1] 5.2.12.14. GIC CPU Interface (GICC) Structure (Table 5.37 GICC
> CPU
> > > Interface Flags)
> > > >     Link:
> > > >
> > > https://uefi.org/specs/ACPI/6.5/05_ACPI_Software_Programming_Model.h
> > > tm
> > > > l#gic-cpu-interface-gicc-structure
> > > >
> > > > Co-developed-by: Keqian Zhu <zhukeqian1@huawei.com>
> > > > Signed-off-by: Keqian Zhu <zhukeqian1@huawei.com>
> > > > Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
> > > > ---
> > > >  hw/arm/virt-acpi-build.c | 40 ++++++++++++++++++++++++++++++++++-
> > > -----
> > > >  hw/core/machine.c        | 14 ++++++++++++++
> > > >  include/hw/boards.h      | 20 ++++++++++++++++++++
> > > >  3 files changed, 68 insertions(+), 6 deletions(-)
> > > >
> > > > diff --git a/hw/arm/virt-acpi-build.c b/hw/arm/virt-acpi-build.c
> index
> > > > b01fc4f8ef..7c24dd6369 100644
> > > > --- a/hw/arm/virt-acpi-build.c
> > > > +++ b/hw/arm/virt-acpi-build.c
> > > > @@ -760,6 +760,32 @@ static void build_append_gicr(GArray
> *table_data,
> > > uint64_t base, uint32_t size)
> > > >      build_append_int_noprefix(table_data, size, 4); /* Discovery
> > > > Range Length */  }
> > > >
> > > > +static uint32_t virt_acpi_get_gicc_flags(CPUState *cpu) {
> > > > +    MachineClass *mc = MACHINE_GET_CLASS(qdev_get_machine());
> > > > +    const uint32_t GICC_FLAG_ENABLED = BIT(0);
> > > > +    const uint32_t GICC_FLAG_ONLINE_CAPABLE = BIT(3);
> > > > +
> > > > +    /* ARM architecture does not support vCPU hotplug yet */
> > > > +    if (!cpu) {
> > > > +        return 0;
> > > > +    }
> > > > +
> > > > +    /*
> > > > +     * If the machine does not support online-capable CPUs, report
> the
> > > GICC as
> > > > +     * 'enabled' only.
> > > > +     */
> > > > +    if (!mc->has_online_capable_cpus) {
> > > > +        return GICC_FLAG_ENABLED;
> > > > +    }
> > > > +
> > > > +    /*
> > > > +     * ACPI 6.5, 5.2.12.14 (GICC): mark the boot CPU 'enabled' and
> all others
> > > > +     * 'online-capable'.
> > > > +     */
> > > > +    return (cpu == first_cpu) ? GICC_FLAG_ENABLED :
> > > > +GICC_FLAG_ONLINE_CAPABLE; }
> > > > +
> > > >  static void
> > > >  build_madt(GArray *table_data, BIOSLinker *linker, VirtMachineState
> > > > *vms)  { @@ -785,12 +811,14 @@ build_madt(GArray *table_data,
> > > > BIOSLinker *linker, VirtMachineState *vms)
> > > >      build_append_int_noprefix(table_data, vms->gic_version, 1);
> > > >      build_append_int_noprefix(table_data, 0, 3);   /* Reserved */
> > > >
> > > > -    for (i = 0; i < MACHINE(vms)->smp.cpus; i++) {
> > > > -        ARMCPU *armcpu = ARM_CPU(qemu_get_cpu(i));
> > > > +    for (i = 0; i < MACHINE(vms)->smp.max_cpus; i++) {
> > >                                      ^^^^^^^^^^^^
> > > > +        CPUState *cpu = machine_get_possible_cpu(i);
> > > ...
> > > > +        CPUArchId *archid = machine_get_possible_cpu_arch_id(i);
> > >
> > > what complexity above adds? /and then you say creating instantiating
> ARM
> > > VM is slow./
> > >
> > > I'd drop machine_get_possible_cpu/machine_get_possible_cpu_arch_id
> > > altogether and mimic what acpi_build_madt() does.
> >
> >
> > We can do that here but I need this function elsewhere in the monitor
> code as well
> > to iterate over the possible CPUs and if I remember correctly I was
> getting compilation
> > errors there. But I will check if this can be removed.
> >
> > I would like to keep machine_get_possible_cpu().
>
> if you did iteration with this helper over CPUs, you'd basically
> introducing
> ^2 complexity at that point.
> But that's details, we will sort it out eventually.
>

Sure. you might be right here. I do not intend to disagree. I'll surely
look into it.

Thanks for this.


>
> >
> > I think you've misunderstood the reason of the boot time delay mentioned
> to you in RFC V5.
> > It is because of the realization leg i.e. qdev_relaize(), of the vCPU
> and not because of this
> > initialization leg
>
> I did misunderstood wrt slow vcpus creation.
>

No issues


> I did object to lazy creation in general, and well I still dislike it.
>

For sure, and I respect your apprehensions. I want to understand the
technical reasons why
you think having this approach could be problematic?

We have used this because:
1. It will drastically change the boot time on 500+ core system. In fact
boot time becomes
    constant almost independent of the core count. I had previous even
share the numbers
    for this in the KVM Forum 2023 conference slides
2.  Just for this series RFC V6, we have a bigger problem in leaving
disabled vCPUs threads
     running. These add to the KVM Lock contention during VM initialization
time and cpu_reset()
     can fail due to failure in ICC_CTLR_EL1 fetch.

Point 2 is a blocker.


> For more on this topic see my reply to cover letter, let continue
> discussion there
> about that.
>

sure, I will later today.

Sorry, for the gaps in my replies. My mails are either not reaching the
mailing list and
the people or I'm not receiving them. For legal reasons we must make
technical discussion
public so I'm refraining to reply from official ID till this problem gets
identified.

The discussion otherwise appears under broken links defeating the
traceability part. Later
is a legal requirement.

I've been told lore.kernel server is rejecting the emails. I don't know
whom to contact?


>
> >
> >
> > >
> > > > +        uint32_t flags = virt_acpi_get_gicc_flags(cpu);
> > > > +        uint64_t mpidr = archid->arch_id;
> > > >
> > > >          if (vms->gic_version == VIRT_GIC_VERSION_2) {
> > > >              physical_base_address = memmap[VIRT_GIC_CPU].base; @@
> > > > -805,7 +833,7 @@ build_madt(GArray *table_data, BIOSLinker *linker,
> > > VirtMachineState *vms)
> > > >          build_append_int_noprefix(table_data, i, 4);    /* GIC ID */
> > > >          build_append_int_noprefix(table_data, i, 4);    /* ACPI
> Processor UID
> > > */
> > > >          /* Flags */
> > > > -        build_append_int_noprefix(table_data, 1, 4);    /* Enabled
> */
> > > > +        build_append_int_noprefix(table_data, flags, 4);
> > > >          /* Parking Protocol Version */
> > > >          build_append_int_noprefix(table_data, 0, 4);
> > > >          /* Performance Interrupt GSIV */ @@ -819,7 +847,7 @@
> > > > build_madt(GArray *table_data, BIOSLinker *linker, VirtMachineState
> > > *vms)
> > > >          build_append_int_noprefix(table_data, vgic_interrupt, 4);
> > > >          build_append_int_noprefix(table_data, 0, 8);    /* GICR
> Base
> > > Address*/
> > > >          /* MPIDR */
> > > > -        build_append_int_noprefix(table_data,
> > > arm_cpu_mp_affinity(armcpu), 8);
> > > > +        build_append_int_noprefix(table_data, mpidr, 8);
> > > >          /* Processor Power Efficiency Class */
> > > >          build_append_int_noprefix(table_data, 0, 1);
> > > >          /* Reserved */
> > > > diff --git a/hw/core/machine.c b/hw/core/machine.c index
> > > > 69d5632464..65388d859a 100644
> > > > --- a/hw/core/machine.c
> > > > +++ b/hw/core/machine.c
> > > > @@ -1383,6 +1383,20 @@ CPUState *machine_get_possible_cpu(int64_t
> > > cpu_index)
> > > >      return NULL;
> > > >  }
> > > >
> > > > +CPUArchId *machine_get_possible_cpu_arch_id(int64_t cpu_index) {
> > > > +    MachineState *ms = MACHINE(qdev_get_machine());
> > > > +    CPUArchIdList *possible_cpus = ms->possible_cpus;
> > > > +
> > > > +    for (int i = 0; i < possible_cpus->len; i++) {
> > > > +        if (possible_cpus->cpus[i].cpu &&
> > > > +            possible_cpus->cpus[i].cpu->cpu_index == cpu_index) {
> > > > +            return &possible_cpus->cpus[i];
> > > > +        }
> > > > +    }
> > > > +    return NULL;
> > > > +}
> > > > +
> > > >  static char *cpu_slot_to_string(const CPUArchId *cpu)  {
> > > >      GString *s = g_string_new(NULL);
> > > > diff --git a/include/hw/boards.h b/include/hw/boards.h index
> > > > 3ff77a8b3a..fe51ca58bf 100644
> > > > --- a/include/hw/boards.h
> > > > +++ b/include/hw/boards.h
> > > > @@ -461,6 +461,26 @@ struct MachineState {
> > > >      bool acpi_spcr_enabled;
> > > >  };
> > > >
> > > > +/*
> > > > + * machine_get_possible_cpu_arch_id:
> > > > + * @cpu_index: logical cpu_index to search for
> > > > + *
> > > > + * Return a pointer to the CPUArchId entry matching the given
> > > > +@cpu_index
> > > > + * in the current machine's MachineState. The possible_cpus array
> > > > +holds
> > > > + * the full set of CPUs that the machine could support, including
> > > > +those
> > > > + * that may be created as disabled or taken offline.
> > > > + *
> > > > + * The slot index in ms->possible_cpus[] is always sequential, but
> > > > +the
> > > > + * logical cpu_index values are assigned by QEMU and may or may not
> > > > +be
> > > > + * sequential depending on the implementation of a particular
> machine.
> > > > + * Direct indexing by cpu_index is therefore unsafe in general. This
> > > > + * helper performs a linear search of the possible_cpus array to
> find
> > > > + * the matching entry.
> > > > + *
> > > > + * Returns: pointer to the matching CPUArchId, or NULL if not found.
> > > > + */
> > > > +CPUArchId *machine_get_possible_cpu_arch_id(int64_t cpu_index);
> > > > +
> > > >  /*
> > > >   * The macros which follow are intended to facilitate the
> > > >   * definition of versioned machine types, using a somewhat
> > >
> >
>
>

[-- Attachment #2: Type: text/html, Size: 15087 bytes --]

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH RFC V6 22/24] monitor,qdev: Introduce 'device_set' to change admin state of existing devices
  2025-10-09 15:19         ` Peter Maydell
@ 2025-10-10  4:59           ` Markus Armbruster
  0 siblings, 0 replies; 67+ messages in thread
From: Markus Armbruster @ 2025-10-10  4:59 UTC (permalink / raw)
  To: Peter Maydell
  Cc: Igor Mammedov, salil.mehta, qemu-devel, qemu-arm, mst,
	salil.mehta, maz, jean-philippe, jonathan.cameron, lpieralisi,
	richard.henderson, andrew.jones, david, philmd, eric.auger, will,
	ardb, oliver.upton, pbonzini, gshan, rafael, borntraeger,
	alex.bennee, gustavo.romero, npiggin, harshpb, linux, darren,
	ilkka, vishnu, gankulkarni, karl.heubaum, miguel.luis, zhukeqian1,
	wangxiongfeng2, wangyanan55, wangzhou1, linuxarm, jiakernel2,
	maobibo, lixianglai, shahuang, zhao1.liu, devel

Peter Maydell <peter.maydell@linaro.org> writes:

> On Thu, 9 Oct 2025 at 15:56, Markus Armbruster <armbru@redhat.com> wrote:
>> qdev introspection (device-list-properties) is like QOM type
>> introspection.  I'm not sure why it exists.
>
> It exists because it is the older of the two interfaces:
> device-list-properties was added in 2012, whereas
> qom-list-properties was only added in 2018.

I suspected it was, but didn't want to make unchecked claims.  Thanks
for checking!

> device-list-properties also does some device-specific
> sanitization that may or may not be helpful: it won't
> let you try it on an abstract base class, for instance,

Introspecting abstract bases is probably not useful.  But what harm
could it do?  Can't see why preventing it is worth the bother.  Of
course, changing it now is not worth the bother, either :)

> and it won't list "legacy-" properties.

I remember these exist, but not what they're good for :)

Should we deprecate device-list-properties in favour of
qom-list-properties?

> One problem you don't mention with QOM introspection is
> that we have no marking for whether properties are intended
> to be user-facing knobs, configurable things to be set
> by other parts of QEMU, or purely details of the implementation.

Yes.  This is what I had in mind when I pointed out "accidental external
interfaces".



^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH RFC V6 00/24] Support of Virtual CPU Hotplug-like Feature for ARMv8+ Arch
  2025-10-01  1:01 [PATCH RFC V6 00/24] Support of Virtual CPU Hotplug-like Feature for ARMv8+ Arch salil.mehta
                   ` (24 preceding siblings ...)
  2025-10-06 14:00 ` [PATCH RFC V6 00/24] Support of Virtual CPU Hotplug-like Feature for ARMv8+ Arch Igor Mammedov
@ 2025-10-13  0:34 ` Gavin Shan
  2025-10-22 10:07 ` Gavin Shan
  26 siblings, 0 replies; 67+ messages in thread
From: Gavin Shan @ 2025-10-13  0:34 UTC (permalink / raw)
  To: salil.mehta, qemu-devel, qemu-arm, mst
  Cc: salil.mehta, maz, jean-philippe, jonathan.cameron, lpieralisi,
	peter.maydell, richard.henderson, imammedo, armbru, andrew.jones,
	david, philmd, eric.auger, will, ardb, oliver.upton, pbonzini,
	rafael, borntraeger, alex.bennee, gustavo.romero, npiggin,
	harshpb, linux, darren, ilkka, vishnu, gankulkarni, karl.heubaum,
	miguel.luis, zhukeqian1, wangxiongfeng2, wangyanan55, wangzhou1,
	linuxarm, jiakernel2, maobibo, lixianglai, shahuang, zhao1.liu

Hi Salil,

On 10/1/25 11:01 AM, salil.mehta@opnsrc.net wrote:
> From: Salil Mehta <salil.mehta@huawei.com>
> 
> [!] Sending again: It looks like mails sent from my official ID are being held
> somewhere. Hence, I am using my other email address. Sorry for any inconvenience
> this may have caused.
> 

[...]

> 
> ============================
> (X) ORGANIZATION OF PATCHES
> ============================
> 
>   [Patch 1-2, 22-23] New HMP/QMP interface ('device_set') related changes
>      (*) New ('DeviceState::admin_power_state') property; Enabled/Disabled States and handling
>      (*) New Qemu CLI parameter ('-smp CPUS, disabled=N') handling
>      (*) Logic to find the existing object not part of the QOM
>   [Patch 3-5, 10] logic required during machine init.
>      (*) Some validation checks.
>      (*) Introduces core-id,socket-id,cluster-id property and some util functions required later.
>      (*) Logic to setup lazy realization of the QOM vCPUs
>      (*) Logic to pre-create vCPUs in the KVM host kernel.
>   [Patch 6-7, 8-9] logic required to size the GICv3 State
>      (*) GIC initialization pre-sized with possible vCPUs.
>      (*) Introduction of the GICv3 CPU Interface `accessibility` property & accessors
>      (*) Refactoring to make KVM & TCG 'GICv3CPUState' initialization common.
>      (*) Changes in GICv3 post/pre-load function for migration
>   [Patch 11,14-16,19] logic related to ACPI at machine init time.
>      (*) ACPI CPU OSPM interface for ACPI _STA.Enable/Disable handling
>      (*) ACPI GED framework to cater to CPU DeviceCheck/Eject Events.
>      (*) ACPI DSDT, MADT changes.
>   [Patch 12-13, 17] Qdev, Virt Machine, PowerState Handler Changes
>      (*) Changes to introduce 'PowerStateHandler' and its abstract interface.
>      (*) Qdev changes to handle the administrative enabling/disabling of device
>      (*) Virt Machine implementation of 'PowerStateHandler' Hooks
>      (*) vCPU thread user-space parking and unparking logic.
>   [Patch 18,20-21,24] Misc.
>      (*) Handling of SMCC Hypercall Exits by KVM to Qemu for PSCI.
>      (*) Mitigation to avoid using 'pause_all_vcpus' during ICC_CTLR_EL1 reset.
>      (*) Mitigation when TCG 'TB Code Cache' is found saturated
> 

[...]

> 
> ================
> (XII) Change Log
> ================
> 

Here seems missed the changelog from RFCv5 -> RFCv6?

> RFC V4 -> RFC V5:
> -----------------
> 1. Dropped "[PATCH RFC V4 19/33] target/arm: Force ARM vCPU *present* status ACPI *persistent*"
>     - Seperated the architecture agnostic ACPI changes required to support vCPU Hotplug
>       Link: https://lore.kernel.org/qemu-devel/20241014192205.253479-1-salil.mehta@huawei.com/#t
> 2. Dropped "[PATCH RFC V4 02/33] cpu-common: Add common CPU utility for possible vCPUs"
>     - Dropped qemu{present,enabled}_cpu() APIs. Commented by Gavin (Redhat), Miguel(Oracle), Igor(Redhat)
> 3. Added "Reviewed-by: Miguel Luis <miguel.luis@oracle.com>" to [PATCH RFC V4 01/33]
> 3. Dropped the `CPUState::disabled` flag and introduced `GICv3State::num_smp_cpus` flag
>     - All `GICv3CPUState' between [num_smp_cpus,num_cpus) are marked as 'inaccessible` during gicv3_common_realize()
>     - qemu_enabled_cpu() not required - removed!
>     - removed usage of `CPUState::disabled` from virt.c and hw/cpu64.c
> 4. Removed virt_cpu_properties() and introduced property `mp-affinity` get accessor
> 5. Dropped "[PATCH RFC V4 12/33] arm/virt: Create GED device before *disabled* vCPU Objects are destroyed"
> 

[...]

It maybe known issue, but there are a bunch of failing qtests, listed as below.

# make -j 60 check-qtest
   :
Summary of Failures:

10/29 qemu:qtest+qtest-aarch64 / qtest-aarch64/xlnx-versal-trng-test     ERROR            0.48s   killed by signal 6 SIGABRT
11/29 qemu:qtest+qtest-aarch64 / qtest-aarch64/xlnx-canfd-test           ERROR            0.49s   killed by signal 6 SIGABRT
12/29 qemu:qtest+qtest-aarch64 / qtest-aarch64/ast2700-gpio-test         ERROR            0.47s   killed by signal 6 SIGABRT
13/29 qemu:qtest+qtest-aarch64 / qtest-aarch64/ast2700-hace-test         ERROR            0.48s   killed by signal 6 SIGABRT
14/29 qemu:qtest+qtest-aarch64 / qtest-aarch64/ast2700-smc-test          ERROR            0.48s   killed by signal 6 SIGABRT
23/29 qemu:qtest+qtest-aarch64 / qtest-aarch64/test-hmp                  ERROR            3.14s   killed by signal 6 SIGABRT
26/29 qemu:qtest+qtest-aarch64 / qtest-aarch64/bios-tables-test          ERROR            8.83s   killed by signal 6 SIGABRT
27/29 qemu:qtest+qtest-aarch64 / qtest-aarch64/qom-test                  ERROR           12.75s   killed by signal 6 SIGABRT
28/29 qemu:qtest+qtest-aarch64 / qtest-aarch64/qos-test                  ERROR           16.32s   killed by signal 6 SIGABRT

Ok:                20
Fail:              9

Thanks,
Gavin



^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH RFC V6 22/24] monitor,qdev: Introduce 'device_set' to change admin state of existing devices
  2025-10-09 14:55       ` Markus Armbruster
  2025-10-09 15:19         ` Peter Maydell
@ 2025-10-17 14:50         ` Igor Mammedov
  2025-10-20 11:22           ` Markus Armbruster
  1 sibling, 1 reply; 67+ messages in thread
From: Igor Mammedov @ 2025-10-17 14:50 UTC (permalink / raw)
  To: Markus Armbruster
  Cc: salil.mehta, qemu-devel, qemu-arm, mst, salil.mehta, maz,
	jean-philippe, jonathan.cameron, lpieralisi, peter.maydell,
	richard.henderson, andrew.jones, david, philmd, eric.auger, will,
	ardb, oliver.upton, pbonzini, gshan, rafael, borntraeger,
	alex.bennee, gustavo.romero, npiggin, harshpb, linux, darren,
	ilkka, vishnu, gankulkarni, karl.heubaum, miguel.luis, zhukeqian1,
	wangxiongfeng2, wangyanan55, wangzhou1, linuxarm, jiakernel2,
	maobibo, lixianglai, shahuang, zhao1.liu, devel

On Thu, 09 Oct 2025 16:55:54 +0200
Markus Armbruster <armbru@redhat.com> wrote:

> Igor Mammedov <imammedo@redhat.com> writes:
> 
> > On Thu, 09 Oct 2025 10:55:40 +0200
> > Markus Armbruster <armbru@redhat.com> wrote:
> >  
> >> salil.mehta@opnsrc.net writes:
> >>   
> >> > From: Salil Mehta <salil.mehta@huawei.com>
> >> >
> >> > This patch adds a "device_set" interface for modifying properties of devices
> >> > that already exist in the guest topology. Unlike 'device_add'/'device_del'
> >> > (hot-plug), 'device_set' does not create or destroy devices. It is intended
> >> > for guest-visible hot-add semantics where hardware is provisioned at boot but
> >> > logically enabled/disabled later via administrative policy.
> >> >
> >> > Compared to the existing 'qom-set' command, which is less intuitive and works
> >> > only with object IDs, device_set provides a more device-oriented interface.
> >> > It can be invoked at the QEMU prompt using natural device arguments, and the
> >> > new '-deviceset' CLI option allows properties to be set at boot time, similar
> >> > to how '-device' specifies device creation.    
> >> 
> >> Why can't we use -device?  
> >
> > that's was my concern/suggestion in reply to cover letter
> > (as a place to put high level review and what can be done for the next revision)  
> 
> Yes.
> 
> > (PS: It looks like I'm having email receiving issues (i.e. not getting from
> > mail list my own emails that it bonces to me, so threading is all broken on
> > my side and I'm might miss replies). But on positive side it looks like my
> > replies reach the list and CCed just fine)  
> 
> For what it's worth, your replies arrive fine here.
> 
> >> > While the initial implementation focuses on "admin-state" changes (e.g.,
> >> > enable/disable a CPU already described by ACPI/DT), the interface is designed
> >> > to be generic. In future, it could be used for other per-device set/unset
> >> > style controls — beyond administrative power-states — provided the target
> >> > device explicitly allows such changes. This enables fine-grained runtime
> >> > control of device properties.    
> >> 
> >> Beware, designing a generic interface can be harder, sometimes much
> >> harder, than designing a specialized one.
> >> 
> >> device_add and qom-set are generic, and they have issues:
> >> 
> >> * device_add effectively bypasses QAPI by using 'gen': false.
> >> 
> >>   This bypasses QAPI's enforcement of documentation.  Property
> >>   documentation is separate and poor.
> >> 
> >>   It also defeats introspection with query-qmp-schema.  You need to
> >>   resort to other means instead, say QOM introspection (which is a bag
> >>   of design flaws on its own), then map from QOM to qdev.
> >> 
> >> * device_add lets you specify any qdev property, even properties that
> >>   are intended only for use by C code.
> >> 
> >>   This results in accidental external interfaces.
> >> 
> >>   We tend to name properties like "x-prop" to discourage external use,
> >>   but I wouldn't bet my own money on us getting that always right.
> >>   Moreover, there's beauties like "x-origin".
> >> 
> >> * qom-set & friends effectively bypass QAPI by using type 'any'.
> >> 
> >>   Again, the bypass results in poor documentation and a defeat of
> >>   query-qmp-schema.
> >> 
> >> * qom-set lets you mess with any QOM property with a setter callback.
> >> 
> >>   Again, accidental external interfaces: most of these properties are
> >>   not meant for use with qom-set.  For some, qom-set works, for some it
> >>   silently does nothing, and for some it crashes.  A lot more dangerous
> >>   than device_add.
> >> 
> >>   The "x-" convention can't help here: some properties are intended for
> >>   external use with object-add, but not with qom-set.
> >> 
> >> We should avoid such issues in new interfaces.  
> 
> [...]
> 
> >> > diff --git a/hmp-commands.hx b/hmp-commands.hx
> >> > index d0e4f35a30..18056cf21d 100644
> >> > --- a/hmp-commands.hx
> >> > +++ b/hmp-commands.hx
> >> > @@ -707,6 +707,36 @@ SRST
> >> >    or a QOM object path.
> >> >  ERST
> >> >  
> >> > +{
> >> > +    .name       = "device_set",
> >> > +    .args_type  = "device:O",
> >> > +    .params     = "driver[,prop=value][,...]",
> >> > +    .help       = "set/unset existing device property",
> >> > +    .cmd        = hmp_device_set,
> >> > +    .command_completion = device_set_completion,
> >> > +},
> >> > +
> >> > +SRST
> >> > +``device_set`` *driver[,prop=value][,...]*
> >> > +  Change the administrative power state of an existing device.
> >> > +
> >> > +  This command enables or disables a known device (e.g., CPU) using the
> >> > +  "device_set" interface. It does not hotplug or add a new device.
> >> > +
> >> > +  Depending on platform support (e.g., PSCI or ACPI), this may trigger
> >> > +  corresponding operational changes — such as powering down a CPU or
> >> > +  transitioning it to active use.
> >> > +
> >> > +  Administrative state:
> >> > +    * *enabled*  — Allows the guest to use the device (e.g., CPU_ON)
> >> > +    * *disabled* — Prevents guest use; device is powered off (e.g., CPU_OFF)
> >> > +
> >> > +  Note: The device must already exist (be declared during machine creation).
> >> > +
> >> > +  Example:
> >> > +      (qemu) device_set host-arm-cpu,core-id=3,admin-state=disabled
> >> > +ERST    
> >> 
> >> How exactly is the device selected?  You provide a clue above: 'can be
> >> located by "id" or via driver+property match'.
> >> 
> >> I assume by "id" is just like device_del, i.e. by qdev ID or QOM path.
> >> 
> >> By "driver+property match" is not obvious.  Which of the arguments are
> >> for matching, and which are for setting?
> >> 
> >> If "id" is specified, is there any matching?
> >> 
> >> The matching feature complicates this interface quite a bit.  I doubt
> >> it's worth the complexity.  If you think it is, please split it off into
> >> a separate patch.  
> >
> > It's likely /me who to blame for asking to invent generic
> > device-set QMP command.
> > I see another application (beside ARM CPU power-on/off) for it,
> > PCI devices to simulate powering on/off them at runtime without
> > actually removing device.  
> 
> I prefer generic commands over collecting ad hoc single-purpose
> commands, too.  Getting the design right can be difficult.
> 
> > wrt command,
> > I'd use only 'id' with it to identify target device
> > (i.e. no template matching nor QMP path either).
> > To enforce rule, what user hasn't named explicitly by providing 'id'
> > isn't meant to be accessed/manged by user later on.   
> 
> Works well, except when we need to access / manage onboard devices.
> That's still an unsolved problem.
> 
> > potentially we can invent specialized power_set/get command as
> > an alternative if it makes design easier.
> > But then we would be spawning similar commands for other things,
> > where as device-set would cover it all. But then I might be
> > over-complicating things by suggesting a generic approach.   
> 
> Unclear.
> 
> I feel it's best to start the design process with ensvisaged uses.  Can
> you tell me a bit more about the uses you have in mind?

We have nic failover 'feature'
   https://www.qemu.org/docs/master/system/virtio-net-failover.html
to make it work we do abuse hotplug and that poses problem
during migration, since:
  - unplugging primary device releases resources (which might not be
    possible to claim back in case migration failure)
  - it's similar on destination side, where attempt to hotplug
    primary might fail die to insufficient resources leaving guest
    on 'degraded' virtio-net link.

Idea was that instead of hotplug we can power off primary device,
(it will still exist and keep resources), initiate migration,
and then on target do the same starting with primary fully realized
but powered of (and failing migration early if it can't claim resources,
safely resuming QEMU on source incl. primary link), and then guest
failover driver on destination would power primary on as part of
switching to primary link.

Above would require -device/device_add support for specifying device's
power state as minimum.

> >> Next question.  Is there a way for management applications to detect
> >> whether a certain device supports device_set for a certain property?  
> >
> > is there some kind of QMP command to check what does a device support,
> > or at least what properties it supports? Can we piggy-back on that?  
> 
> Maybe.
> 
> QAPI schema introspection (query-qmp-schema) has been a success.  It has
> a reasonably expressive type system, deprecation information, and hides
> much implementation detail.  Sadly, it doesn't cover most of QOM and all
> of qdev due to QAPI schema bypass.
> 
> QOM type introspection (qom-list-types and qom-list-properties) is weak.
> You can retrieve a property's name and type.  The latter is seriously
> underspecified, and somewhere between annoying and impossible to use
> reliably.  Properties created in certain ways are not visible here.
> These are rare.
> 
> QOM object introspection (qom-list) is the same for concrete objects
> rather than types.
> 
> qdev introspection (device-list-properties) is like QOM type
> introspection.  I'm not sure why it exists.  Use QOM type introspection
> instead.
> 
> QOM introspection is servicable for checking whether a certain property
> exists.  Examining a property's type is unadvisable.
> 
> >> Without that, what are management application supposed to do?  Hard-code
> >> what works?  Run the command and see whether it fails?  
> >
> > Adding libvirt list to discussion and possible ideas on what can be done here.
> >  
> >> I understand right now the command supports just "admin-state" for a
> >> certain set of devices, so hard-coding would be possible.  But every new
> >> (device, property) pair then requires management application updates,
> >> and the hard-coded information becomes version specific.  This will
> >> become unworkable real quick.  Not good enough for a command designed to
> >> be generic.  
> 
> [...]
> 



^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH RFC V6 22/24] monitor,qdev: Introduce 'device_set' to change admin state of existing devices
  2025-10-17 14:50         ` Igor Mammedov
@ 2025-10-20 11:22           ` Markus Armbruster
  0 siblings, 0 replies; 67+ messages in thread
From: Markus Armbruster @ 2025-10-20 11:22 UTC (permalink / raw)
  To: Igor Mammedov
  Cc: salil.mehta, qemu-devel, qemu-arm, mst, salil.mehta, maz,
	jean-philippe, jonathan.cameron, lpieralisi, peter.maydell,
	richard.henderson, andrew.jones, david, philmd, eric.auger, will,
	ardb, oliver.upton, pbonzini, gshan, rafael, borntraeger,
	alex.bennee, gustavo.romero, npiggin, harshpb, linux, darren,
	ilkka, vishnu, gankulkarni, karl.heubaum, miguel.luis, zhukeqian1,
	wangxiongfeng2, wangyanan55, wangzhou1, linuxarm, jiakernel2,
	maobibo, lixianglai, shahuang, zhao1.liu, devel

Igor Mammedov <imammedo@redhat.com> writes:

> On Thu, 09 Oct 2025 16:55:54 +0200
> Markus Armbruster <armbru@redhat.com> wrote:
>
>> Igor Mammedov <imammedo@redhat.com> writes:

[...]

>> > It's likely /me who to blame for asking to invent generic
>> > device-set QMP command.
>> > I see another application (beside ARM CPU power-on/off) for it,
>> > PCI devices to simulate powering on/off them at runtime without
>> > actually removing device.  
>> 
>> I prefer generic commands over collecting ad hoc single-purpose
>> commands, too.  Getting the design right can be difficult.
>> 
>> > wrt command,
>> > I'd use only 'id' with it to identify target device
>> > (i.e. no template matching nor QMP path either).
>> > To enforce rule, what user hasn't named explicitly by providing 'id'
>> > isn't meant to be accessed/manged by user later on.   
>> 
>> Works well, except when we need to access / manage onboard devices.
>> That's still an unsolved problem.
>> 
>> > potentially we can invent specialized power_set/get command as
>> > an alternative if it makes design easier.
>> > But then we would be spawning similar commands for other things,
>> > where as device-set would cover it all. But then I might be
>> > over-complicating things by suggesting a generic approach.   
>> 
>> Unclear.
>> 
>> I feel it's best to start the design process with ensvisaged uses.  Can
>> you tell me a bit more about the uses you have in mind?
>
> We have nic failover 'feature'
>    https://www.qemu.org/docs/master/system/virtio-net-failover.html
> to make it work we do abuse hotplug and that poses problem
> during migration, since:
>   - unplugging primary device releases resources (which might not be
>     possible to claim back in case migration failure)

Serious reliability issue with no work-around.

>   - it's similar on destination side, where attempt to hotplug
>     primary might fail die to insufficient resources leaving guest
>     on 'degraded' virtio-net link.

Obvious work-around is failing the migration.  Same as we do when we
can't create devices.

> Idea was that instead of hotplug we can power off primary device,
> (it will still exist and keep resources), initiate migration,
> and then on target do the same starting with primary fully realized
> but powered of (and failing migration early if it can't claim resources,
> safely resuming QEMU on source incl. primary link), and then guest
> failover driver on destination would power primary on as part of
> switching to primary link.

I can see how power on / off makes more sense than hot plug / unplug.

> Above would require -device/device_add support for specifying device's
> power state as minimum.

The obvious way to control a device's power state with -device /
device_add is a qdev property.  Easy enough.

Do we need to control a device's power state after it's created?  If I
understand your use case correctly, the answer is yes.  -device /
device_add can't do that.

qom-set could, but friends don't let friends use it in production.

Any other prior art for controlling device state at run time via QMP?

[...]



^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH RFC V6 00/24] Support of Virtual CPU Hotplug-like Feature for ARMv8+ Arch
  2025-10-01  1:01 [PATCH RFC V6 00/24] Support of Virtual CPU Hotplug-like Feature for ARMv8+ Arch salil.mehta
                   ` (25 preceding siblings ...)
  2025-10-13  0:34 ` Gavin Shan
@ 2025-10-22 10:07 ` Gavin Shan
  2025-10-24  6:55   ` Gavin Shan
  26 siblings, 1 reply; 67+ messages in thread
From: Gavin Shan @ 2025-10-22 10:07 UTC (permalink / raw)
  To: salil.mehta, qemu-devel, qemu-arm, mst
  Cc: salil.mehta, maz, jean-philippe, jonathan.cameron, lpieralisi,
	peter.maydell, richard.henderson, imammedo, armbru, andrew.jones,
	david, philmd, eric.auger, will, ardb, oliver.upton, pbonzini,
	rafael, borntraeger, alex.bennee, gustavo.romero, npiggin,
	harshpb, linux, darren, ilkka, vishnu, gankulkarni, karl.heubaum,
	miguel.luis, zhukeqian1, wangxiongfeng2, wangyanan55, wangzhou1,
	linuxarm, jiakernel2, maobibo, lixianglai, shahuang, zhao1.liu

Hi Salil,

On 10/1/25 11:01 AM, salil.mehta@opnsrc.net wrote:
> 
> ===================
> (VII) Commands Used
> ===================
> 
> A. Qemu launch commands to init the machine (with 6 possible vCPUs):
> 
> $ qemu-system-aarch64 --enable-kvm -machine virt,gic-version=3 \
> -cpu host -smp cpus=4,disabled=2 \
> -m 300M \
> -kernel Image \
> -initrd rootfs.cpio.gz \
> -append "console=ttyAMA0 root=/dev/ram rdinit=/init maxcpus=2 acpi=force" \
> -nographic \
> -bios QEMU_EFI.fd \
> 

The parameter 'disabled=2' isn't correct here and it needs to be 'disabledcpus=2'.
Otherwise, the VM won't be started due to the unrecognized parameter.

$ /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64       \
   --enable-kvm -machine virt,gic-version=3 -cpu host,sve=off    \
   -smp cpus=4,disabled=2 -m 1024M                               \
   -kernel /home/gavin/sandbox/linux.guest/arch/arm64/boot/Image \
   -initrd /home/gavin/sandbox/images/rootfs.cpio.xz -nographic
qemu-system-aarch64: Parameter 'smp.disabled' is unexpected

Thanks,
Gavin



^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH RFC V6 05/24] arm/virt,kvm: Pre-create KVM vCPUs for 'disabled' QOM vCPUs at machine init
  2025-10-01  1:01 ` [PATCH RFC V6 05/24] arm/virt, kvm: Pre-create KVM vCPUs for 'disabled' QOM vCPUs at machine init salil.mehta
@ 2025-10-22 10:36   ` Gavin Shan
  2025-10-22 18:18     ` Salil Mehta
  0 siblings, 1 reply; 67+ messages in thread
From: Gavin Shan @ 2025-10-22 10:36 UTC (permalink / raw)
  To: salil.mehta, qemu-devel, qemu-arm, mst
  Cc: salil.mehta, maz, jean-philippe, jonathan.cameron, lpieralisi,
	peter.maydell, richard.henderson, imammedo, armbru, andrew.jones,
	david, philmd, eric.auger, will, ardb, oliver.upton, pbonzini,
	rafael, borntraeger, alex.bennee, gustavo.romero, npiggin,
	harshpb, linux, darren, ilkka, vishnu, gankulkarni, karl.heubaum,
	miguel.luis, zhukeqian1, wangxiongfeng2, wangyanan55, wangzhou1,
	linuxarm, jiakernel2, maobibo, lixianglai, shahuang, zhao1.liu,
	Keqian Zhu

Hi Salil,

On 10/1/25 11:01 AM, salil.mehta@opnsrc.net wrote:
> From: Salil Mehta <salil.mehta@huawei.com>
> 
> ARM CPU architecture does not allow CPUs to be plugged after system has
> initialized. This is a constraint. Hence, the Kernel must know all the CPUs
> being booted during its initialization. This applies to the Guest Kernel as
> well and therefore, the number of KVM vCPU descriptors in the host must be
> fixed at VM initialization time.
> 
> Also, the GIC must know all the CPUs it is connected to during its
> initialization, and this cannot change afterward. This must also be ensured
> during the initialization of the VGIC in KVM. This is necessary because:
> 
> 1. The association between GICR and MPIDR must be fixed at VM initialization
>     time. This is represented by the register
>     `GICR_TYPER(mp_affinity, proc_num)`.
> 2. Memory regions associated with GICR, etc., cannot be changed (added,
>     deleted, or modified) after the VM has been initialized. This is not an
>     ARM architectural constraint but rather invites a difficult and messy
>     change in VGIC data structures.
> 
> To enable a hot-add–like model while preserving these constraints, the virt
> machine may enumerate more CPUs than are enabled at boot using
> `-smp disabledcpus=N`. Such CPUs are present but start offline (i.e.,
> administratively disabled at init). The topology remains fixed at VM
> creation time; only the online/offline status may change later.
> 
> Administratively disabled vCPUs are not realized in QOM until first enabled,
> avoiding creation of unnecessary vCPU threads at boot. On large systems, this
> reduces startup time proportionally to the number of disabled vCPUs. Once a
> QOM vCPU is realized and its thread created, subsequent enable/disable actions
> do not unrealize it. This behaviour was adopted following review feedback and
> differs from earlier RFC versions.
> 
> Co-developed-by: Keqian Zhu <zhuqian1@huawei.com>
> Signed-off-by: Keqian Zhu <zhuqian1@huawei.com>
> Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
> ---
>   accel/kvm/kvm-all.c    |  2 +-
>   hw/arm/virt.c          | 77 ++++++++++++++++++++++++++++++++++++++----
>   hw/core/qdev.c         | 17 ++++++++++
>   include/hw/qdev-core.h | 19 +++++++++++
>   include/system/kvm.h   |  8 +++++
>   target/arm/cpu.c       |  2 ++
>   target/arm/kvm.c       | 40 +++++++++++++++++++++-
>   target/arm/kvm_arm.h   | 11 ++++++
>   8 files changed, 168 insertions(+), 8 deletions(-)
> 
> diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
> index 890d5ea9f8..0e7d9d5c3d 100644
> --- a/accel/kvm/kvm-all.c
> +++ b/accel/kvm/kvm-all.c
> @@ -460,7 +460,7 @@ static void kvm_reset_parked_vcpus(KVMState *s)
>    *
>    * @returns: 0 when success, errno (<0) when failed.
>    */
> -static int kvm_create_vcpu(CPUState *cpu)
> +int kvm_create_vcpu(CPUState *cpu)
>   {
>       unsigned long vcpu_id = kvm_arch_vcpu_id(cpu);
>       KVMState *s = kvm_state;
> diff --git a/hw/arm/virt.c b/hw/arm/virt.c
> index 4ded19dc69..f4eeeacf6c 100644
> --- a/hw/arm/virt.c
> +++ b/hw/arm/virt.c
> @@ -2152,6 +2152,49 @@ static void virt_post_cpus_gic_realized(VirtMachineState *vms,
>       }
>   }
>   
> +static void
> +virt_setup_lazy_vcpu_realization(Object *cpuobj, VirtMachineState *vms)
> +{
> +    /*
> +     * Present & administratively disabled vCPUs:
> +     *
> +     * These CPUs are marked offline at init via '-smp disabledcpus=N'. We
> +     * intentionally do not realize them during the first boot, since it is
> +     * not known if or when they will ever be enabled. The decision to enable
> +     * such CPUs depends on policy (e.g. guided by SLAs or other deployment
> +     * requirements).
> +     *
> +     * Realizing all disabled vCPUs up front would make boot time proportional
> +     * to 'maxcpus', even if policy permits only a small subset to be enabled.
> +     * This can lead to unacceptable boot delays in some scenarios.
> +     *
> +     * Instead, these CPUs remain administratively disabled and unrealized at
> +     * boot, to be instantiated and brought online only if policy later allows
> +     * it.
> +     */
> +
> +    /* set this vCPU to be administratively 'disabled' in QOM */
> +    qdev_disable(DEVICE(cpuobj), NULL, &error_fatal);
> +
> +    if (vms->psci_conduit != QEMU_PSCI_CONDUIT_DISABLED) {
> +        object_property_set_int(cpuobj, "psci-conduit", vms->psci_conduit,
> +                                NULL);
> +    }
> +
> +    /*
> +     * [!] Constraint: The ARM CPU architecture does not permit new CPUs
> +     * to be added after system initialization.
> +     *
> +     * Workaround: Pre-create KVM vCPUs even for those that are not yet
> +     * online i.e. powered-off, keeping them `parked` and in an
> +     * `unrealized (at-least during boot time)` state within QEMU until
> +     * they are powered-on and made online.
> +     */
> +    if (kvm_enabled()) {
> +        kvm_arm_create_host_vcpu(ARM_CPU(cpuobj));
> +    }
> +}
> +
>   static void machvirt_init(MachineState *machine)
>   {
>       VirtMachineState *vms = VIRT_MACHINE(machine);
> @@ -2319,10 +2362,6 @@ static void machvirt_init(MachineState *machine)
>           Object *cpuobj;
>           CPUState *cs;
>   
> -        if (n >= smp_cpus) {
> -            break;
> -        }
> -
>           cpuobj = object_new(possible_cpus->cpus[n].type);
>           object_property_set_int(cpuobj, "mp-affinity",
>                                   possible_cpus->cpus[n].arch_id, NULL);
> @@ -2427,8 +2466,34 @@ static void machvirt_init(MachineState *machine)
>               }
>           }
>   
> -        qdev_realize(DEVICE(cpuobj), NULL, &error_fatal);
> -        object_unref(cpuobj);
> +        /* start secondary vCPUs in a powered-down state */
> +        if(n && mc->has_online_capable_cpus) {
> +            object_property_set_bool(cpuobj, "start-powered-off", true, NULL);
> +        }
> +
> +        if (n < smp_cpus) {
> +            /* 'Present' & 'Enabled' vCPUs */
> +            qdev_realize(DEVICE(cpuobj), NULL, &error_fatal);
> +            object_unref(cpuobj);
> +        } else {
> +            /* 'Present' & 'Disabled' vCPUs */
> +            virt_setup_lazy_vcpu_realization(cpuobj, vms);
> +        }
> +
> +        /*
> +         * All possible vCPUs should have QOM vCPU Object pointer & arch-id.
> +         * 'cpus_queue' (accessed via qemu_get_cpu()) contains only realized and
> +         * enabled vCPUs. Hence, we must now populate the 'possible_cpus' list.
> +         */
> +        if (kvm_enabled()) {
> +            /*
> +             * Override the default architecture ID with the one retrieved
> +             * from KVM, as they currently differ.
> +             */
> +            machine->possible_cpus->cpus[n].arch_id =
> +                arm_cpu_mp_affinity(ARM_CPU(cs));
> +        }
> +        machine->possible_cpus->cpus[n].cpu = cs;
>       }
>   
>       /* Now we've created the CPUs we can see if they have the hypvirt timer */
> diff --git a/hw/core/qdev.c b/hw/core/qdev.c
> index 8502d6216f..5816abae39 100644
> --- a/hw/core/qdev.c
> +++ b/hw/core/qdev.c
> @@ -309,6 +309,23 @@ void qdev_assert_realized_properly(void)
>                                      qdev_assert_realized_properly_cb, NULL);
>   }
>   
> +bool qdev_disable(DeviceState *dev, BusState *bus, Error **errp)
> +{
> +    g_assert(dev);
> +
> +    if (bus) {
> +        error_setg(errp, "Device %s 'disable' operation not supported",
> +                   object_get_typename(OBJECT(dev)));
> +        return false;
> +    }
> +
> +    /* devices like cpu don't have bus */
> +    g_assert(!DEVICE_GET_CLASS(dev)->bus_type);
> +
> +    return object_property_set_str(OBJECT(dev), "admin_power_state", "disabled",
> +                                   errp);
> +}
> +
>   bool qdev_machine_modified(void)
>   {
>       return qdev_hot_added || qdev_hot_removed;
> diff --git a/include/hw/qdev-core.h b/include/hw/qdev-core.h
> index 3bc212ab3a..2c22b32a3f 100644
> --- a/include/hw/qdev-core.h
> +++ b/include/hw/qdev-core.h
> @@ -570,6 +570,25 @@ bool qdev_realize(DeviceState *dev, BusState *bus, Error **errp);
>    */
>   bool qdev_realize_and_unref(DeviceState *dev, BusState *bus, Error **errp);
>   
> +/**
> + * qdev_disable - Initiate administrative disablement and power-off of device
> + * @dev:   The device to be administratively powered off
> + * @bus:   The bus on which the device resides (may be NULL for CPUs)
> + * @errp:  Pointer to a location where an error can be reported
> + *
> + * This function initiates an administrative transition of the device into a
> + * DISABLED state. This may trigger a graceful shutdown process depending on
> + * platform capabilities. For ACPI platforms, this typically involves notifying
> + * the guest via events such as Notify(..., 0x03) and executing _EJx.
> + *
> + * Once completed, the device's operational power is turned off and it is
> + * marked as administratively DISABLED. Further guest usage is blocked until
> + * re-enabled by host-side policy.
> + *
> + * Returns true on success; false if an error occurs, with @errp populated.
> + */
> +bool qdev_disable(DeviceState *dev, BusState *bus, Error **errp);
> +
>   /**
>    * qdev_unrealize: Unrealize a device
>    * @dev: device to unrealize
> diff --git a/include/system/kvm.h b/include/system/kvm.h
> index 3c7d314736..4896a3c9c5 100644
> --- a/include/system/kvm.h
> +++ b/include/system/kvm.h
> @@ -317,6 +317,14 @@ int kvm_create_device(KVMState *s, uint64_t type, bool test);
>    */
>   bool kvm_device_supported(int vmfd, uint64_t type);
>   
> +/**
> + * kvm_create_vcpu - Gets a parked KVM vCPU or creates a KVM vCPU
> + * @cpu: QOM CPUState object for which KVM vCPU has to be fetched/created.
> + *
> + * @returns: 0 when success, errno (<0) when failed.
> + */
> +int kvm_create_vcpu(CPUState *cpu);
> +
>   /**
>    * kvm_park_vcpu - Park QEMU KVM vCPU context
>    * @cpu: QOM CPUState object for which QEMU KVM vCPU context has to be parked.
> diff --git a/target/arm/cpu.c b/target/arm/cpu.c
> index 7e0d5b2ed8..a5906d1672 100644
> --- a/target/arm/cpu.c
> +++ b/target/arm/cpu.c
> @@ -1500,6 +1500,8 @@ static void arm_cpu_initfn(Object *obj)
>           /* TCG and HVF implement PSCI 1.1 */
>           cpu->psci_version = QEMU_PSCI_VERSION_1_1;
>       }
> +
> +    CPU(obj)->thread_id = 0;
>   }
>   
>   /*
> diff --git a/target/arm/kvm.c b/target/arm/kvm.c
> index 6672344855..1962eb29b2 100644
> --- a/target/arm/kvm.c
> +++ b/target/arm/kvm.c
> @@ -991,6 +991,38 @@ void kvm_arm_reset_vcpu(ARMCPU *cpu)
>       write_list_to_cpustate(cpu);
>   }
>   
> +void kvm_arm_create_host_vcpu(ARMCPU *cpu)
> +{
> +    CPUState *cs = CPU(cpu);
> +    unsigned long vcpu_id = cs->cpu_index;
> +    int ret;
> +
> +    ret = kvm_create_vcpu(cs);
> +    if (ret < 0) {
> +        error_report("Failed to create host vcpu %ld", vcpu_id);
> +        abort();
> +    }
> +
> +    /*
> +     * Initialize the vCPU in the host. This will reset the sys regs
> +     * for this vCPU and related registers like MPIDR_EL1 etc. also
> +     * get programmed during this call to host. These are referenced
> +     * later while setting device attributes of the GICR during GICv3
> +     * reset.
> +     */
> +    ret = kvm_arch_init_vcpu(cs);
> +    if (ret < 0) {
> +        error_report("Failed to initialize host vcpu %ld", vcpu_id);
> +        abort();
> +    }
> +
> +    /*
> +     * park the created vCPU. shall be used during kvm_get_vcpu() when
> +     * threads are created during realization of ARM vCPUs.
> +     */
> +    kvm_park_vcpu(cs);
> +}
> +

I don't think we're able to simply call kvm_arch_init_vcpu() in the lazily realized
path. Otherwise, it can trigger a crash dump on my Nvidia's grace-hopper machine where
SVE is supported by default.

kvm_arch_init_vcpu() is supposed to be called in the realization path in current
implementation (without this series) because the parameters (features) to KVM_ARM_VCPU_INIT
is populated at vCPU realization time.

$ home/gavin/sandbox/qemu.main/build/qemu-system-aarch64           \
   --enable-kvm -machine virt,gic-version=3 -cpu host               \
   -smp cpus=4,disabledcpus=2 -m 1024M                              \
   -kernel /home/gavin/sandbox/linux.guest/arch/arm64/boot/Image    \
   -initrd /home/gavin/sandbox/images/rootfs.cpio.xz -nographic
qemu-system-aarch64: Failed to initialize host vcpu 4
Aborted (core dumped)

Backtrace
=========
(gdb) bt
#0  0x0000ffff9106bc80 in __pthread_kill_implementation () at /lib64/libc.so.6
#1  0x0000ffff9101aa40 [PAC] in raise () at /lib64/libc.so.6
#2  0x0000ffff91005988 [PAC] in abort () at /lib64/libc.so.6
#3  0x0000aaaab1cc26b8 [PAC] in kvm_arm_create_host_vcpu (cpu=0xaaaab9ab1bc0)
     at ../target/arm/kvm.c:1081
#4  0x0000aaaab1cd0c94 in virt_setup_lazy_vcpu_realization (cpuobj=0xaaaab9ab1bc0, vms=0xaaaab98870a0)
     at ../hw/arm/virt.c:2483
#5  0x0000aaaab1cd180c in machvirt_init (machine=0xaaaab98870a0) at ../hw/arm/virt.c:2777
#6  0x0000aaaab160f220 in machine_run_board_init
     (machine=0xaaaab98870a0, mem_path=0x0, errp=0xfffffa86bdc8) at ../hw/core/machine.c:1722
#7  0x0000aaaab1a25ef4 in qemu_init_board () at ../system/vl.c:2723
#8  0x0000aaaab1a2635c in qmp_x_exit_preconfig (errp=0xaaaab38a50f0 <error_fatal>)
     at ../system/vl.c:2821
#9  0x0000aaaab1a28b08 in qemu_init (argc=15, argv=0xfffffa86c1f8) at ../system/vl.c:3882
#10 0x0000aaaab221d9e4 in main (argc=15, argv=0xfffffa86c1f8) at ../system/main.c:71

Thanks,
Gavin

>   /*
>    * Update KVM's MP_STATE based on what QEMU thinks it is
>    */
> @@ -1876,7 +1908,13 @@ int kvm_arch_init_vcpu(CPUState *cs)
>           return -EINVAL;
>       }
>   
> -    qemu_add_vm_change_state_handler(kvm_arm_vm_state_change, cpu);
> +    /*
> +     * Install VM change handler only when vCPU thread has been spawned
> +     * i.e. vCPU is being realized
> +     */
> +    if (cs->thread_id) {
> +        qemu_add_vm_change_state_handler(kvm_arm_vm_state_change, cpu);
> +    }
>   
>       /* Determine init features for this CPU */
>       memset(cpu->kvm_init_features, 0, sizeof(cpu->kvm_init_features));
> diff --git a/target/arm/kvm_arm.h b/target/arm/kvm_arm.h
> index 6a9b6374a6..ec9dc95ee8 100644
> --- a/target/arm/kvm_arm.h
> +++ b/target/arm/kvm_arm.h
> @@ -98,6 +98,17 @@ bool kvm_arm_cpu_post_load(ARMCPU *cpu);
>   void kvm_arm_reset_vcpu(ARMCPU *cpu);
>   
>   struct kvm_vcpu_init;
> +
> +/**
> + * kvm_arm_create_host_vcpu:
> + * @cpu: ARMCPU
> + *
> + * Called to pre-create possible KVM vCPU within the host during the
> + * `virt_machine` initialization phase. This pre-created vCPU will be parked and
> + * will be reused when ARM QOM vCPU is actually hotplugged.
> + */
> +void kvm_arm_create_host_vcpu(ARMCPU *cpu);
> +
>   /**
>    * kvm_arm_create_scratch_host_vcpu:
>    * @fdarray: filled in with kvmfd, vmfd, cpufd file descriptors in that order



^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH RFC V6 05/24] arm/virt,kvm: Pre-create KVM vCPUs for 'disabled' QOM vCPUs at machine init
  2025-10-22 10:36   ` [PATCH RFC V6 05/24] arm/virt,kvm: " Gavin Shan
@ 2025-10-22 18:18     ` Salil Mehta
  2025-10-22 18:50       ` Salil Mehta
  0 siblings, 1 reply; 67+ messages in thread
From: Salil Mehta @ 2025-10-22 18:18 UTC (permalink / raw)
  To: Gavin Shan
  Cc: qemu-devel, qemu-arm, mst, salil.mehta, maz, jean-philippe,
	jonathan.cameron, lpieralisi, peter.maydell, richard.henderson,
	imammedo, armbru, andrew.jones, david, philmd, eric.auger, will,
	ardb, oliver.upton, pbonzini, rafael, borntraeger, alex.bennee,
	gustavo.romero, npiggin, harshpb, linux, darren, ilkka, vishnu,
	gankulkarni, karl.heubaum, miguel.luis, zhukeqian1,
	wangxiongfeng2, wangyanan55, wangzhou1, linuxarm, jiakernel2,
	maobibo, lixianglai, shahuang, zhao1.liu, Keqian Zhu

Hi Gavin,

On Wed, Oct 22, 2025 at 10:37 AM Gavin Shan <gshan@redhat.com> wrote:
>
> Hi Salil,
>
> On 10/1/25 11:01 AM, salil.mehta@opnsrc.net wrote:
> > From: Salil Mehta <salil.mehta@huawei.com>
> >
> > ARM CPU architecture does not allow CPUs to be plugged after system has
> > initialized. This is a constraint. Hence, the Kernel must know all the CPUs
> > being booted during its initialization. This applies to the Guest Kernel as
> > well and therefore, the number of KVM vCPU descriptors in the host must be
> > fixed at VM initialization time.
> >
> > Also, the GIC must know all the CPUs it is connected to during its
> > initialization, and this cannot change afterward. This must also be ensured
> > during the initialization of the VGIC in KVM. This is necessary because:
> >
> > 1. The association between GICR and MPIDR must be fixed at VM initialization
> >     time. This is represented by the register
> >     `GICR_TYPER(mp_affinity, proc_num)`.
> > 2. Memory regions associated with GICR, etc., cannot be changed (added,
> >     deleted, or modified) after the VM has been initialized. This is not an
> >     ARM architectural constraint but rather invites a difficult and messy
> >     change in VGIC data structures.
> >
> > To enable a hot-add–like model while preserving these constraints, the virt
> > machine may enumerate more CPUs than are enabled at boot using
> > `-smp disabledcpus=N`. Such CPUs are present but start offline (i.e.,
> > administratively disabled at init). The topology remains fixed at VM
> > creation time; only the online/offline status may change later.
> >
> > Administratively disabled vCPUs are not realized in QOM until first enabled,
> > avoiding creation of unnecessary vCPU threads at boot. On large systems, this
> > reduces startup time proportionally to the number of disabled vCPUs. Once a
> > QOM vCPU is realized and its thread created, subsequent enable/disable actions
> > do not unrealize it. This behaviour was adopted following review feedback and
> > differs from earlier RFC versions.
> >
> > Co-developed-by: Keqian Zhu <zhuqian1@huawei.com>
> > Signed-off-by: Keqian Zhu <zhuqian1@huawei.com>
> > Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
> > ---
> >   accel/kvm/kvm-all.c    |  2 +-
> >   hw/arm/virt.c          | 77 ++++++++++++++++++++++++++++++++++++++----
> >   hw/core/qdev.c         | 17 ++++++++++
> >   include/hw/qdev-core.h | 19 +++++++++++
> >   include/system/kvm.h   |  8 +++++
> >   target/arm/cpu.c       |  2 ++
> >   target/arm/kvm.c       | 40 +++++++++++++++++++++-
> >   target/arm/kvm_arm.h   | 11 ++++++
> >   8 files changed, 168 insertions(+), 8 deletions(-)
> >
[...]
> >
> > +static void
> > +virt_setup_lazy_vcpu_realization(Object *cpuobj, VirtMachineState *vms)
> > +{
> > +    /*
> > +     * Present & administratively disabled vCPUs:
> > +     *
> > +     * These CPUs are marked offline at init via '-smp disabledcpus=N'. We
> > +     * intentionally do not realize them during the first boot, since it is
> > +     * not known if or when they will ever be enabled. The decision to enable
> > +     * such CPUs depends on policy (e.g. guided by SLAs or other deployment
> > +     * requirements).
> > +     *
> > +     * Realizing all disabled vCPUs up front would make boot time proportional
> > +     * to 'maxcpus', even if policy permits only a small subset to be enabled.
> > +     * This can lead to unacceptable boot delays in some scenarios.
> > +     *
> > +     * Instead, these CPUs remain administratively disabled and unrealized at
> > +     * boot, to be instantiated and brought online only if policy later allows
> > +     * it.
> > +     */
> > +
> > +    /* set this vCPU to be administratively 'disabled' in QOM */
> > +    qdev_disable(DEVICE(cpuobj), NULL, &error_fatal);
> > +
> > +    if (vms->psci_conduit != QEMU_PSCI_CONDUIT_DISABLED) {
> > +        object_property_set_int(cpuobj, "psci-conduit", vms->psci_conduit,
> > +                                NULL);
> > +    }
> > +
> > +    /*
> > +     * [!] Constraint: The ARM CPU architecture does not permit new CPUs
> > +     * to be added after system initialization.
> > +     *
> > +     * Workaround: Pre-create KVM vCPUs even for those that are not yet
> > +     * online i.e. powered-off, keeping them `parked` and in an
> > +     * `unrealized (at-least during boot time)` state within QEMU until
> > +     * they are powered-on and made online.
> > +     */
> > +    if (kvm_enabled()) {
> > +        kvm_arm_create_host_vcpu(ARM_CPU(cpuobj));
> > +    }
> > +}
> > +
> >   static void machvirt_init(MachineState *machine)
> >   {
> >       VirtMachineState *vms = VIRT_MACHINE(machine);
> > @@ -2319,10 +2362,6 @@ static void machvirt_init(MachineState *machine)
> >           Object *cpuobj;
> >           CPUState *cs;
> >
> > -        if (n >= smp_cpus) {
> > -            break;
> > -        }
> > -
> >           cpuobj = object_new(possible_cpus->cpus[n].type);
> >           object_property_set_int(cpuobj, "mp-affinity",
> >                                   possible_cpus->cpus[n].arch_id, NULL);
> > @@ -2427,8 +2466,34 @@ static void machvirt_init(MachineState *machine)
> >               }
> >           }
> >
> > -        qdev_realize(DEVICE(cpuobj), NULL, &error_fatal);
> > -        object_unref(cpuobj);
> > +        /* start secondary vCPUs in a powered-down state */
> > +        if(n && mc->has_online_capable_cpus) {
> > +            object_property_set_bool(cpuobj, "start-powered-off", true, NULL);
> > +        }
> > +
> > +        if (n < smp_cpus) {
> > +            /* 'Present' & 'Enabled' vCPUs */
> > +            qdev_realize(DEVICE(cpuobj), NULL, &error_fatal);
> > +            object_unref(cpuobj);
> > +        } else {
> > +            /* 'Present' & 'Disabled' vCPUs */
> > +            virt_setup_lazy_vcpu_realization(cpuobj, vms);
> > +        }
> > +
> > +        /*
> > +         * All possible vCPUs should have QOM vCPU Object pointer & arch-id.
> > +         * 'cpus_queue' (accessed via qemu_get_cpu()) contains only realized and
> > +         * enabled vCPUs. Hence, we must now populate the 'possible_cpus' list.
> > +         */
> > +        if (kvm_enabled()) {
> > +            /*
> > +             * Override the default architecture ID with the one retrieved
> > +             * from KVM, as they currently differ.
> > +             */
> > +            machine->possible_cpus->cpus[n].arch_id =
> > +                arm_cpu_mp_affinity(ARM_CPU(cs));
> > +        }
> > +        machine->possible_cpus->cpus[n].cpu = cs;
> >       }
> >
> >       /* Now we've created the CPUs we can see if they have the hypvirt timer */
> > diff --git a/hw/core/qdev.c b/hw/core/qdev.c
> > index 8502d6216f..5816abae39 100644
> > --- a/hw/core/qdev.c
> > +++ b/hw/core/qdev.c
> > @@ -309,6 +309,23 @@ void qdev_assert_realized_properly(void)
> >                                      qdev_assert_realized_properly_cb, NULL);
> >   }
> >

[...]

> > +void kvm_arm_create_host_vcpu(ARMCPU *cpu)
> > +{
> > +    CPUState *cs = CPU(cpu);
> > +    unsigned long vcpu_id = cs->cpu_index;
> > +    int ret;
> > +
> > +    ret = kvm_create_vcpu(cs);
> > +    if (ret < 0) {
> > +        error_report("Failed to create host vcpu %ld", vcpu_id);
> > +        abort();
> > +    }
> > +
> > +    /*
> > +     * Initialize the vCPU in the host. This will reset the sys regs
> > +     * for this vCPU and related registers like MPIDR_EL1 etc. also
> > +     * get programmed during this call to host. These are referenced
> > +     * later while setting device attributes of the GICR during GICv3
> > +     * reset.
> > +     */
> > +    ret = kvm_arch_init_vcpu(cs);
> > +    if (ret < 0) {
> > +        error_report("Failed to initialize host vcpu %ld", vcpu_id);
> > +        abort();
> > +    }
> > +
> > +    /*
> > +     * park the created vCPU. shall be used during kvm_get_vcpu() when
> > +     * threads are created during realization of ARM vCPUs.
> > +     */
> > +    kvm_park_vcpu(cs);
> > +}
> > +
>
> I don't think we're able to simply call kvm_arch_init_vcpu() in the lazily realized
> path. Otherwise, it can trigger a crash dump on my Nvidia's grace-hopper machine where
> SVE is supported by default.

Thanks for reporting this. That is not true. As long as we initialize
KVM correctly and
finalize the features like SVE we should be fine. In fact, this is
precisely what we are
doing right now.

To understand the crash, I need a bit more info.

1#  is happening because KVM_ARM_VCPU_INIT is failing. If yes, the can you check
      within the KVM if it is happening because
     a.  features specified by QEMU are not matching the defaults within the KVM
           (HInt: check kvm_vcpu_init_check_features())?
     b. or complaining about init feate change kvm_vcpu_init_changed()?
2#  or it is happening during the setting of vector length or
finalizing features?

int kvm_arch_init_vcpu(CPUState *cs)
{
   [...]
         /* Do KVM_ARM_VCPU_INIT ioctl */
        ret = kvm_arm_vcpu_init(cpu);   ---->[1]
        if (ret) {
           return ret;
       }
          if (cpu_isar_feature(aa64_sve, cpu)) {
        ret = kvm_arm_sve_set_vls(cpu); ---->[2]
        if (ret) {
            return ret;
        }
        ret = kvm_arm_vcpu_finalize(cpu, KVM_ARM_VCPU_SVE);--->[3]
        if (ret) {
            return ret;
        }
    }
[...]
}

I think it's happening because vector length is going uninitialized.
This initialization
happens in context to  arm_cpu_finalize_features() which I forgot to call before
calling KVM finalize.

>
> kvm_arch_init_vcpu() is supposed to be called in the realization path in current
> implementation (without this series) because the parameters (features) to KVM_ARM_VCPU_INIT
> is populated at vCPU realization time.

Not necessarily. It is just meant to initialize the KVM. If we take care of the
KVM requirements in the similar way the realize path does we should be
fine. Can you try to add the patch below in your code and test if it works?

 diff --git a/target/arm/kvm.c b/target/arm/kvm.c
index c4b68a0b17..1091593478 100644
--- a/target/arm/kvm.c
+++ b/target/arm/kvm.c
@@ -1068,6 +1068,9 @@ void kvm_arm_create_host_vcpu(ARMCPU *cpu)
         abort();
     }

+     /* finalize the features like SVE, SME etc */
+     arm_cpu_finalize_features(cpu, &error_abort);
+
     /*
      * Initialize the vCPU in the host. This will reset the sys regs
      * for this vCPU and related registers like MPIDR_EL1 etc. also




>
> $ home/gavin/sandbox/qemu.main/build/qemu-system-aarch64           \
>    --enable-kvm -machine virt,gic-version=3 -cpu host               \
>    -smp cpus=4,disabledcpus=2 -m 1024M                              \
>    -kernel /home/gavin/sandbox/linux.guest/arch/arm64/boot/Image    \
>    -initrd /home/gavin/sandbox/images/rootfs.cpio.xz -nographic
> qemu-system-aarch64: Failed to initialize host vcpu 4
> Aborted (core dumped)
>
> Backtrace
> =========
> (gdb) bt
> #0  0x0000ffff9106bc80 in __pthread_kill_implementation () at /lib64/libc.so.6
> #1  0x0000ffff9101aa40 [PAC] in raise () at /lib64/libc.so.6
> #2  0x0000ffff91005988 [PAC] in abort () at /lib64/libc.so.6
> #3  0x0000aaaab1cc26b8 [PAC] in kvm_arm_create_host_vcpu (cpu=0xaaaab9ab1bc0)
>      at ../target/arm/kvm.c:1081
> #4  0x0000aaaab1cd0c94 in virt_setup_lazy_vcpu_realization (cpuobj=0xaaaab9ab1bc0, vms=0xaaaab98870a0)
>      at ../hw/arm/virt.c:2483
> #5  0x0000aaaab1cd180c in machvirt_init (machine=0xaaaab98870a0) at ../hw/arm/virt.c:2777
> #6  0x0000aaaab160f220 in machine_run_board_init
>      (machine=0xaaaab98870a0, mem_path=0x0, errp=0xfffffa86bdc8) at ../hw/core/machine.c:1722
> #7  0x0000aaaab1a25ef4 in qemu_init_board () at ../system/vl.c:2723
> #8  0x0000aaaab1a2635c in qmp_x_exit_preconfig (errp=0xaaaab38a50f0 <error_fatal>)
>      at ../system/vl.c:2821
> #9  0x0000aaaab1a28b08 in qemu_init (argc=15, argv=0xfffffa86c1f8) at ../system/vl.c:3882
> #10 0x0000aaaab221d9e4 in main (argc=15, argv=0xfffffa86c1f8) at ../system/main.c:71


Thank you for this. Please let me know if the above fix works and also
the return values in
case you encounter errors.

Many thanks!
Salil.


>
> Thanks,
> Gavin
>
> >   /*
> >    * Update KVM's MP_STATE based on what QEMU thinks it is
> >    */
> > @@ -1876,7 +1908,13 @@ int kvm_arch_init_vcpu(CPUState *cs)
> >           return -EINVAL;
> >       }
> >
> > -    qemu_add_vm_change_state_handler(kvm_arm_vm_state_change, cpu);
> > +    /*
> > +     * Install VM change handler only when vCPU thread has been spawned
> > +     * i.e. vCPU is being realized
> > +     */
> > +    if (cs->thread_id) {
> > +        qemu_add_vm_change_state_handler(kvm_arm_vm_state_change, cpu);
> > +    }
> >
> >       /* Determine init features for this CPU */
> >       memset(cpu->kvm_init_features, 0, sizeof(cpu->kvm_init_features));
> > diff --git a/target/arm/kvm_arm.h b/target/arm/kvm_arm.h
> > index 6a9b6374a6..ec9dc95ee8 100644
> > --- a/target/arm/kvm_arm.h
> > +++ b/target/arm/kvm_arm.h
> > @@ -98,6 +98,17 @@ bool kvm_arm_cpu_post_load(ARMCPU *cpu);
> >   void kvm_arm_reset_vcpu(ARMCPU *cpu);
> >
> >   struct kvm_vcpu_init;
> > +
> > +/**
> > + * kvm_arm_create_host_vcpu:
> > + * @cpu: ARMCPU
> > + *
> > + * Called to pre-create possible KVM vCPU within the host during the
> > + * `virt_machine` initialization phase. This pre-created vCPU will be parked and
> > + * will be reused when ARM QOM vCPU is actually hotplugged.
> > + */
> > +void kvm_arm_create_host_vcpu(ARMCPU *cpu);
> > +
> >   /**
> >    * kvm_arm_create_scratch_host_vcpu:
> >    * @fdarray: filled in with kvmfd, vmfd, cpufd file descriptors in that order
>


^ permalink raw reply related	[flat|nested] 67+ messages in thread

* Re: [PATCH RFC V6 05/24] arm/virt,kvm: Pre-create KVM vCPUs for 'disabled' QOM vCPUs at machine init
  2025-10-22 18:18     ` Salil Mehta
@ 2025-10-22 18:50       ` Salil Mehta
  2025-10-23  0:14         ` Gavin Shan
  0 siblings, 1 reply; 67+ messages in thread
From: Salil Mehta @ 2025-10-22 18:50 UTC (permalink / raw)
  To: Gavin Shan
  Cc: qemu-devel, qemu-arm, mst, salil.mehta, maz, jean-philippe,
	jonathan.cameron, lpieralisi, peter.maydell, richard.henderson,
	imammedo, armbru, andrew.jones, david, philmd, eric.auger, will,
	ardb, oliver.upton, pbonzini, rafael, borntraeger, alex.bennee,
	gustavo.romero, npiggin, harshpb, linux, darren, ilkka, vishnu,
	gankulkarni, karl.heubaum, miguel.luis, zhukeqian1,
	wangxiongfeng2, wangyanan55, wangzhou1, linuxarm, jiakernel2,
	maobibo, lixianglai, shahuang, zhao1.liu, Keqian Zhu

Hi Gavin,

On Wed, Oct 22, 2025 at 6:18 PM Salil Mehta <salil.mehta@opnsrc.net> wrote:
>
> Hi Gavin,
>
> On Wed, Oct 22, 2025 at 10:37 AM Gavin Shan <gshan@redhat.com> wrote:
> >
> > Hi Salil,
> >
> > On 10/1/25 11:01 AM, salil.mehta@opnsrc.net wrote:
> > > From: Salil Mehta <salil.mehta@huawei.com>
> > >
> > > ARM CPU architecture does not allow CPUs to be plugged after system has
> > > initialized. This is a constraint. Hence, the Kernel must know all the CPUs
> > > being booted during its initialization. This applies to the Guest Kernel as
> > > well and therefore, the number of KVM vCPU descriptors in the host must be
> > > fixed at VM initialization time.
> > >
> > > Also, the GIC must know all the CPUs it is connected to during its
> > > initialization, and this cannot change afterward. This must also be ensured
> > > during the initialization of the VGIC in KVM. This is necessary because:
> > >
> > > 1. The association between GICR and MPIDR must be fixed at VM initialization
> > >     time. This is represented by the register
> > >     `GICR_TYPER(mp_affinity, proc_num)`.
> > > 2. Memory regions associated with GICR, etc., cannot be changed (added,
> > >     deleted, or modified) after the VM has been initialized. This is not an
> > >     ARM architectural constraint but rather invites a difficult and messy
> > >     change in VGIC data structures.
> > >
> > > To enable a hot-add–like model while preserving these constraints, the virt
> > > machine may enumerate more CPUs than are enabled at boot using
> > > `-smp disabledcpus=N`. Such CPUs are present but start offline (i.e.,
> > > administratively disabled at init). The topology remains fixed at VM
> > > creation time; only the online/offline status may change later.
> > >
> > > Administratively disabled vCPUs are not realized in QOM until first enabled,
> > > avoiding creation of unnecessary vCPU threads at boot. On large systems, this
> > > reduces startup time proportionally to the number of disabled vCPUs. Once a
> > > QOM vCPU is realized and its thread created, subsequent enable/disable actions
> > > do not unrealize it. This behaviour was adopted following review feedback and
> > > differs from earlier RFC versions.
> > >
> > > Co-developed-by: Keqian Zhu <zhuqian1@huawei.com>
> > > Signed-off-by: Keqian Zhu <zhuqian1@huawei.com>
> > > Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
> > > ---
> > >   accel/kvm/kvm-all.c    |  2 +-
> > >   hw/arm/virt.c          | 77 ++++++++++++++++++++++++++++++++++++++----
> > >   hw/core/qdev.c         | 17 ++++++++++
> > >   include/hw/qdev-core.h | 19 +++++++++++
> > >   include/system/kvm.h   |  8 +++++
> > >   target/arm/cpu.c       |  2 ++
> > >   target/arm/kvm.c       | 40 +++++++++++++++++++++-
> > >   target/arm/kvm_arm.h   | 11 ++++++
> > >   8 files changed, 168 insertions(+), 8 deletions(-)
> > >

[...]

> > > +void kvm_arm_create_host_vcpu(ARMCPU *cpu)
> > > +{
> > > +    CPUState *cs = CPU(cpu);
> > > +    unsigned long vcpu_id = cs->cpu_index;
> > > +    int ret;
> > > +
> > > +    ret = kvm_create_vcpu(cs);
> > > +    if (ret < 0) {
> > > +        error_report("Failed to create host vcpu %ld", vcpu_id);
> > > +        abort();
> > > +    }
> > > +
> > > +    /*
> > > +     * Initialize the vCPU in the host. This will reset the sys regs
> > > +     * for this vCPU and related registers like MPIDR_EL1 etc. also
> > > +     * get programmed during this call to host. These are referenced
> > > +     * later while setting device attributes of the GICR during GICv3
> > > +     * reset.
> > > +     */
> > > +    ret = kvm_arch_init_vcpu(cs);
> > > +    if (ret < 0) {
> > > +        error_report("Failed to initialize host vcpu %ld", vcpu_id);
> > > +        abort();
> > > +    }
> > > +
> > > +    /*
> > > +     * park the created vCPU. shall be used during kvm_get_vcpu() when
> > > +     * threads are created during realization of ARM vCPUs.
> > > +     */
> > > +    kvm_park_vcpu(cs);
> > > +}
> > > +
> >
> > I don't think we're able to simply call kvm_arch_init_vcpu() in the lazily realized
> > path. Otherwise, it can trigger a crash dump on my Nvidia's grace-hopper machine where
> > SVE is supported by default.
>
> Thanks for reporting this. That is not true. As long as we initialize
> KVM correctly and
> finalize the features like SVE we should be fine. In fact, this is
> precisely what we are
> doing right now.
>
> To understand the crash, I need a bit more info.
>
> 1#  is happening because KVM_ARM_VCPU_INIT is failing. If yes, the can you check
>       within the KVM if it is happening because
>      a.  features specified by QEMU are not matching the defaults within the KVM
>            (HInt: check kvm_vcpu_init_check_features())?
>      b. or complaining about init feate change kvm_vcpu_init_changed()?
> 2#  or it is happening during the setting of vector length or
> finalizing features?
>
> int kvm_arch_init_vcpu(CPUState *cs)
> {
>    [...]
>          /* Do KVM_ARM_VCPU_INIT ioctl */
>         ret = kvm_arm_vcpu_init(cpu);   ---->[1]
>         if (ret) {
>            return ret;
>        }
>           if (cpu_isar_feature(aa64_sve, cpu)) {
>         ret = kvm_arm_sve_set_vls(cpu); ---->[2]
>         if (ret) {
>             return ret;
>         }
>         ret = kvm_arm_vcpu_finalize(cpu, KVM_ARM_VCPU_SVE);--->[3]
>         if (ret) {
>             return ret;
>         }
>     }
> [...]
> }
>
> I think it's happening because vector length is going uninitialized.
> This initialization
> happens in context to  arm_cpu_finalize_features() which I forgot to call before
> calling KVM finalize.
>
> >
> > kvm_arch_init_vcpu() is supposed to be called in the realization path in current
> > implementation (without this series) because the parameters (features) to KVM_ARM_VCPU_INIT
> > is populated at vCPU realization time.
>
> Not necessarily. It is just meant to initialize the KVM. If we take care of the
> KVM requirements in the similar way the realize path does we should be
> fine. Can you try to add the patch below in your code and test if it works?
>
>  diff --git a/target/arm/kvm.c b/target/arm/kvm.c
> index c4b68a0b17..1091593478 100644
> --- a/target/arm/kvm.c
> +++ b/target/arm/kvm.c
> @@ -1068,6 +1068,9 @@ void kvm_arm_create_host_vcpu(ARMCPU *cpu)
>          abort();
>      }
>
> +     /* finalize the features like SVE, SME etc */
> +     arm_cpu_finalize_features(cpu, &error_abort);
> +
>      /*
>       * Initialize the vCPU in the host. This will reset the sys regs
>       * for this vCPU and related registers like MPIDR_EL1 etc. also
>
>
>
>
> >
> > $ home/gavin/sandbox/qemu.main/build/qemu-system-aarch64           \
> >    --enable-kvm -machine virt,gic-version=3 -cpu host               \
> >    -smp cpus=4,disabledcpus=2 -m 1024M                              \
> >    -kernel /home/gavin/sandbox/linux.guest/arch/arm64/boot/Image    \
> >    -initrd /home/gavin/sandbox/images/rootfs.cpio.xz -nographic
> > qemu-system-aarch64: Failed to initialize host vcpu 4
> > Aborted (core dumped)
> >
> > Backtrace
> > =========
> > (gdb) bt
> > #0  0x0000ffff9106bc80 in __pthread_kill_implementation () at /lib64/libc.so.6
> > #1  0x0000ffff9101aa40 [PAC] in raise () at /lib64/libc.so.6
> > #2  0x0000ffff91005988 [PAC] in abort () at /lib64/libc.so.6
> > #3  0x0000aaaab1cc26b8 [PAC] in kvm_arm_create_host_vcpu (cpu=0xaaaab9ab1bc0)
> >      at ../target/arm/kvm.c:1081
> > #4  0x0000aaaab1cd0c94 in virt_setup_lazy_vcpu_realization (cpuobj=0xaaaab9ab1bc0, vms=0xaaaab98870a0)
> >      at ../hw/arm/virt.c:2483
> > #5  0x0000aaaab1cd180c in machvirt_init (machine=0xaaaab98870a0) at ../hw/arm/virt.c:2777
> > #6  0x0000aaaab160f220 in machine_run_board_init
> >      (machine=0xaaaab98870a0, mem_path=0x0, errp=0xfffffa86bdc8) at ../hw/core/machine.c:1722
> > #7  0x0000aaaab1a25ef4 in qemu_init_board () at ../system/vl.c:2723
> > #8  0x0000aaaab1a2635c in qmp_x_exit_preconfig (errp=0xaaaab38a50f0 <error_fatal>)
> >      at ../system/vl.c:2821
> > #9  0x0000aaaab1a28b08 in qemu_init (argc=15, argv=0xfffffa86c1f8) at ../system/vl.c:3882
> > #10 0x0000aaaab221d9e4 in main (argc=15, argv=0xfffffa86c1f8) at ../system/main.c:71
>
>
> Thank you for this. Please let me know if the above fix works and also
> the return values in
> case you encounter errors.

I've pushed the fix to below branch for your convenience:

Branch: https://github.com/salil-mehta/qemu/commits/virt-cpuhp-armv8/rfc-v6.2
Fix: https://github.com/salil-mehta/qemu/commit/1f1fbc0998ffb1fe26140df3c336bf2be2aa8669

Thanks
Salil.


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH RFC V6 05/24] arm/virt,kvm: Pre-create KVM vCPUs for 'disabled' QOM vCPUs at machine init
  2025-10-22 18:50       ` Salil Mehta
@ 2025-10-23  0:14         ` Gavin Shan
  2025-10-23  0:35           ` Salil Mehta
  0 siblings, 1 reply; 67+ messages in thread
From: Gavin Shan @ 2025-10-23  0:14 UTC (permalink / raw)
  To: Salil Mehta
  Cc: qemu-devel, qemu-arm, mst, salil.mehta, maz, jean-philippe,
	jonathan.cameron, lpieralisi, peter.maydell, richard.henderson,
	imammedo, armbru, andrew.jones, david, philmd, eric.auger, will,
	ardb, oliver.upton, pbonzini, rafael, borntraeger, alex.bennee,
	gustavo.romero, npiggin, harshpb, linux, darren, ilkka, vishnu,
	gankulkarni, karl.heubaum, miguel.luis, zhukeqian1,
	wangxiongfeng2, wangyanan55, wangzhou1, linuxarm, jiakernel2,
	maobibo, lixianglai, shahuang, zhao1.liu, Keqian Zhu

Hi Salil,

On 10/23/25 4:50 AM, Salil Mehta wrote:
> On Wed, Oct 22, 2025 at 6:18 PM Salil Mehta <salil.mehta@opnsrc.net> wrote:
>> On Wed, Oct 22, 2025 at 10:37 AM Gavin Shan <gshan@redhat.com> wrote:
>>>
>>> Hi Salil,
>>>
>>> On 10/1/25 11:01 AM, salil.mehta@opnsrc.net wrote:
>>>> From: Salil Mehta <salil.mehta@huawei.com>
>>>>
>>>> ARM CPU architecture does not allow CPUs to be plugged after system has
>>>> initialized. This is a constraint. Hence, the Kernel must know all the CPUs
>>>> being booted during its initialization. This applies to the Guest Kernel as
>>>> well and therefore, the number of KVM vCPU descriptors in the host must be
>>>> fixed at VM initialization time.
>>>>
>>>> Also, the GIC must know all the CPUs it is connected to during its
>>>> initialization, and this cannot change afterward. This must also be ensured
>>>> during the initialization of the VGIC in KVM. This is necessary because:
>>>>
>>>> 1. The association between GICR and MPIDR must be fixed at VM initialization
>>>>      time. This is represented by the register
>>>>      `GICR_TYPER(mp_affinity, proc_num)`.
>>>> 2. Memory regions associated with GICR, etc., cannot be changed (added,
>>>>      deleted, or modified) after the VM has been initialized. This is not an
>>>>      ARM architectural constraint but rather invites a difficult and messy
>>>>      change in VGIC data structures.
>>>>
>>>> To enable a hot-add–like model while preserving these constraints, the virt
>>>> machine may enumerate more CPUs than are enabled at boot using
>>>> `-smp disabledcpus=N`. Such CPUs are present but start offline (i.e.,
>>>> administratively disabled at init). The topology remains fixed at VM
>>>> creation time; only the online/offline status may change later.
>>>>
>>>> Administratively disabled vCPUs are not realized in QOM until first enabled,
>>>> avoiding creation of unnecessary vCPU threads at boot. On large systems, this
>>>> reduces startup time proportionally to the number of disabled vCPUs. Once a
>>>> QOM vCPU is realized and its thread created, subsequent enable/disable actions
>>>> do not unrealize it. This behaviour was adopted following review feedback and
>>>> differs from earlier RFC versions.
>>>>
>>>> Co-developed-by: Keqian Zhu <zhuqian1@huawei.com>
>>>> Signed-off-by: Keqian Zhu <zhuqian1@huawei.com>
>>>> Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
>>>> ---
>>>>    accel/kvm/kvm-all.c    |  2 +-
>>>>    hw/arm/virt.c          | 77 ++++++++++++++++++++++++++++++++++++++----
>>>>    hw/core/qdev.c         | 17 ++++++++++
>>>>    include/hw/qdev-core.h | 19 +++++++++++
>>>>    include/system/kvm.h   |  8 +++++
>>>>    target/arm/cpu.c       |  2 ++
>>>>    target/arm/kvm.c       | 40 +++++++++++++++++++++-
>>>>    target/arm/kvm_arm.h   | 11 ++++++
>>>>    8 files changed, 168 insertions(+), 8 deletions(-)
>>>>
> 
> [...]
> 
>>>> +void kvm_arm_create_host_vcpu(ARMCPU *cpu)
>>>> +{
>>>> +    CPUState *cs = CPU(cpu);
>>>> +    unsigned long vcpu_id = cs->cpu_index;
>>>> +    int ret;
>>>> +
>>>> +    ret = kvm_create_vcpu(cs);
>>>> +    if (ret < 0) {
>>>> +        error_report("Failed to create host vcpu %ld", vcpu_id);
>>>> +        abort();
>>>> +    }
>>>> +
>>>> +    /*
>>>> +     * Initialize the vCPU in the host. This will reset the sys regs
>>>> +     * for this vCPU and related registers like MPIDR_EL1 etc. also
>>>> +     * get programmed during this call to host. These are referenced
>>>> +     * later while setting device attributes of the GICR during GICv3
>>>> +     * reset.
>>>> +     */
>>>> +    ret = kvm_arch_init_vcpu(cs);
>>>> +    if (ret < 0) {
>>>> +        error_report("Failed to initialize host vcpu %ld", vcpu_id);
>>>> +        abort();
>>>> +    }
>>>> +
>>>> +    /*
>>>> +     * park the created vCPU. shall be used during kvm_get_vcpu() when
>>>> +     * threads are created during realization of ARM vCPUs.
>>>> +     */
>>>> +    kvm_park_vcpu(cs);
>>>> +}
>>>> +
>>>
>>> I don't think we're able to simply call kvm_arch_init_vcpu() in the lazily realized
>>> path. Otherwise, it can trigger a crash dump on my Nvidia's grace-hopper machine where
>>> SVE is supported by default.
>>
>> Thanks for reporting this. That is not true. As long as we initialize
>> KVM correctly and
>> finalize the features like SVE we should be fine. In fact, this is
>> precisely what we are
>> doing right now.
>>
>> To understand the crash, I need a bit more info.
>>
>> 1#  is happening because KVM_ARM_VCPU_INIT is failing. If yes, the can you check
>>        within the KVM if it is happening because
>>       a.  features specified by QEMU are not matching the defaults within the KVM
>>             (HInt: check kvm_vcpu_init_check_features())?
>>       b. or complaining about init feate change kvm_vcpu_init_changed()?
>> 2#  or it is happening during the setting of vector length or
>> finalizing features?
>>
>> int kvm_arch_init_vcpu(CPUState *cs)
>> {
>>     [...]
>>           /* Do KVM_ARM_VCPU_INIT ioctl */
>>          ret = kvm_arm_vcpu_init(cpu);   ---->[1]
>>          if (ret) {
>>             return ret;
>>         }
>>            if (cpu_isar_feature(aa64_sve, cpu)) {
>>          ret = kvm_arm_sve_set_vls(cpu); ---->[2]
>>          if (ret) {
>>              return ret;
>>          }
>>          ret = kvm_arm_vcpu_finalize(cpu, KVM_ARM_VCPU_SVE);--->[3]
>>          if (ret) {
>>              return ret;
>>          }
>>      }
>> [...]
>> }
>>
>> I think it's happening because vector length is going uninitialized.
>> This initialization
>> happens in context to  arm_cpu_finalize_features() which I forgot to call before
>> calling KVM finalize.
>>
>>>
>>> kvm_arch_init_vcpu() is supposed to be called in the realization path in current
>>> implementation (without this series) because the parameters (features) to KVM_ARM_VCPU_INIT
>>> is populated at vCPU realization time.
>>
>> Not necessarily. It is just meant to initialize the KVM. If we take care of the
>> KVM requirements in the similar way the realize path does we should be
>> fine. Can you try to add the patch below in your code and test if it works?
>>
>>   diff --git a/target/arm/kvm.c b/target/arm/kvm.c
>> index c4b68a0b17..1091593478 100644
>> --- a/target/arm/kvm.c
>> +++ b/target/arm/kvm.c
>> @@ -1068,6 +1068,9 @@ void kvm_arm_create_host_vcpu(ARMCPU *cpu)
>>           abort();
>>       }
>>
>> +     /* finalize the features like SVE, SME etc */
>> +     arm_cpu_finalize_features(cpu, &error_abort);
>> +
>>       /*
>>        * Initialize the vCPU in the host. This will reset the sys regs
>>        * for this vCPU and related registers like MPIDR_EL1 etc. also
>>
>>
>>
>>
>>>
>>> $ home/gavin/sandbox/qemu.main/build/qemu-system-aarch64           \
>>>     --enable-kvm -machine virt,gic-version=3 -cpu host               \
>>>     -smp cpus=4,disabledcpus=2 -m 1024M                              \
>>>     -kernel /home/gavin/sandbox/linux.guest/arch/arm64/boot/Image    \
>>>     -initrd /home/gavin/sandbox/images/rootfs.cpio.xz -nographic
>>> qemu-system-aarch64: Failed to initialize host vcpu 4
>>> Aborted (core dumped)
>>>
>>> Backtrace
>>> =========
>>> (gdb) bt
>>> #0  0x0000ffff9106bc80 in __pthread_kill_implementation () at /lib64/libc.so.6
>>> #1  0x0000ffff9101aa40 [PAC] in raise () at /lib64/libc.so.6
>>> #2  0x0000ffff91005988 [PAC] in abort () at /lib64/libc.so.6
>>> #3  0x0000aaaab1cc26b8 [PAC] in kvm_arm_create_host_vcpu (cpu=0xaaaab9ab1bc0)
>>>       at ../target/arm/kvm.c:1081
>>> #4  0x0000aaaab1cd0c94 in virt_setup_lazy_vcpu_realization (cpuobj=0xaaaab9ab1bc0, vms=0xaaaab98870a0)
>>>       at ../hw/arm/virt.c:2483
>>> #5  0x0000aaaab1cd180c in machvirt_init (machine=0xaaaab98870a0) at ../hw/arm/virt.c:2777
>>> #6  0x0000aaaab160f220 in machine_run_board_init
>>>       (machine=0xaaaab98870a0, mem_path=0x0, errp=0xfffffa86bdc8) at ../hw/core/machine.c:1722
>>> #7  0x0000aaaab1a25ef4 in qemu_init_board () at ../system/vl.c:2723
>>> #8  0x0000aaaab1a2635c in qmp_x_exit_preconfig (errp=0xaaaab38a50f0 <error_fatal>)
>>>       at ../system/vl.c:2821
>>> #9  0x0000aaaab1a28b08 in qemu_init (argc=15, argv=0xfffffa86c1f8) at ../system/vl.c:3882
>>> #10 0x0000aaaab221d9e4 in main (argc=15, argv=0xfffffa86c1f8) at ../system/main.c:71
>>
>>
>> Thank you for this. Please let me know if the above fix works and also
>> the return values in
>> case you encounter errors.
> 
> I've pushed the fix to below branch for your convenience:
> 
> Branch: https://github.com/salil-mehta/qemu/commits/virt-cpuhp-armv8/rfc-v6.2
> Fix: https://github.com/salil-mehta/qemu/commit/1f1fbc0998ffb1fe26140df3c336bf2be2aa8669
> 

I guess rfc-v6.2 branch isn't ready for test because it runs into another crash
dump with rfc-v6.2 branch, like below.

host$ /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64                     \
       -accel kvm -machine virt,gic-version=host,nvdimm=on                         \
       -cpu host,sve=on                                                            \
       -smp maxcpus=4,cpus=2,disabledcpus=2,sockets=2,clusters=2,cores=1,threads=1 \
       -m 4096M,slots=16,maxmem=128G                                               \
       -object memory-backend-ram,id=mem0,size=2048M                               \
       -object memory-backend-ram,id=mem1,size=2048M                               \
       -numa node,nodeid=0,memdev=mem0,cpus=0-1                                    \
       -numa node,nodeid=1,memdev=mem1,cpus=2-3                                    \
       -L /home/gavin/sandbox/qemu.main/build/pc-bios                              \
       -monitor none -serial mon:stdio -nographic -gdb tcp::6666                   \
       -qmp tcp:localhost:5555,server,wait=off                                     \
       -bios /home/gavin/sandbox/qemu.main/build/pc-bios/edk2-aarch64-code.fd      \
       -kernel /home/gavin/sandbox/linux.guest/arch/arm64/boot/Image               \
       -initrd /home/gavin/sandbox/images/rootfs.cpio.xz                           \
       -append memhp_default_state=online_movable
         :
         :
guest$ cd /sys/devices/system/cpu/
guest$ cat present enabled online
0-3
0-1
0-1
(qemu) device_set host-arm-cpu,socket-id=1,cluster-id=0,core-id=0,thread-id=0,admin-state=enable
qemu-system-aarch64: kvm_init_vcpu: kvm_arch_init_vcpu failed (2): Operation not permitted

I picked the fix (the last patch in rfc-v6.2 branch) to rfc-v6 branch, same crash dump
can be seen.

root@nvidia-grace-hopper-01:/home/gavin/sandbox/qemu.main# git log --oneline HEAD | head -n 1
82dbd9a8f6 tcg: Defer TB flush for 'lazy realized' vCPUs on first region alloc
root@nvidia-grace-hopper-01:/home/gavin/sandbox/qemu.main# git diff
diff --git a/hw/arm/virt.c b/hw/arm/virt.c
index 254303727b..c4f89e7db6 100644
--- a/hw/arm/virt.c
+++ b/hw/arm/virt.c
@@ -2470,6 +2470,9 @@ virt_setup_lazy_vcpu_realization(Object *cpuobj, VirtMachineState *vms)
      /* set operational state of disabled CPUs as OFF */
      ARM_CPU(cpuobj)->power_state = PSCI_OFF;
  
+    /* finalize the features like SVE, SME etc */
+    arm_cpu_finalize_features(ARM_CPU(cpuobj), &error_fatal);
+

Thanks,
Gavin



^ permalink raw reply related	[flat|nested] 67+ messages in thread

* Re: [PATCH RFC V6 05/24] arm/virt,kvm: Pre-create KVM vCPUs for 'disabled' QOM vCPUs at machine init
  2025-10-23  0:14         ` Gavin Shan
@ 2025-10-23  0:35           ` Salil Mehta
  2025-10-23  1:29             ` Salil Mehta
  2025-10-23  1:58             ` Gavin Shan
  0 siblings, 2 replies; 67+ messages in thread
From: Salil Mehta @ 2025-10-23  0:35 UTC (permalink / raw)
  To: Gavin Shan
  Cc: qemu-devel, qemu-arm, mst, salil.mehta, maz, jean-philippe,
	jonathan.cameron, lpieralisi, peter.maydell, richard.henderson,
	imammedo, armbru, andrew.jones, david, philmd, eric.auger, will,
	ardb, oliver.upton, pbonzini, rafael, borntraeger, alex.bennee,
	gustavo.romero, npiggin, harshpb, linux, darren, ilkka, vishnu,
	gankulkarni, karl.heubaum, miguel.luis, zhukeqian1,
	wangxiongfeng2, wangyanan55, wangzhou1, linuxarm, jiakernel2,
	maobibo, lixianglai, shahuang, zhao1.liu, Keqian Zhu

HI Gavin,

On Thu, Oct 23, 2025 at 12:14 AM Gavin Shan <gshan@redhat.com> wrote:
>
> Hi Salil,
>
> On 10/23/25 4:50 AM, Salil Mehta wrote:
> > On Wed, Oct 22, 2025 at 6:18 PM Salil Mehta <salil.mehta@opnsrc.net> wrote:
> >> On Wed, Oct 22, 2025 at 10:37 AM Gavin Shan <gshan@redhat.com> wrote:
> >>>
> >>> Hi Salil,
> >>>
> >>> On 10/1/25 11:01 AM, salil.mehta@opnsrc.net wrote:
> >>>> From: Salil Mehta <salil.mehta@huawei.com>
> >>>>
> >>>> ARM CPU architecture does not allow CPUs to be plugged after system has
> >>>> initialized. This is a constraint. Hence, the Kernel must know all the CPUs
> >>>> being booted during its initialization. This applies to the Guest Kernel as
> >>>> well and therefore, the number of KVM vCPU descriptors in the host must be
> >>>> fixed at VM initialization time.
> >>>>
> >>>> Also, the GIC must know all the CPUs it is connected to during its
> >>>> initialization, and this cannot change afterward. This must also be ensured
> >>>> during the initialization of the VGIC in KVM. This is necessary because:
> >>>>
> >>>> 1. The association between GICR and MPIDR must be fixed at VM initialization
> >>>>      time. This is represented by the register
> >>>>      `GICR_TYPER(mp_affinity, proc_num)`.
> >>>> 2. Memory regions associated with GICR, etc., cannot be changed (added,
> >>>>      deleted, or modified) after the VM has been initialized. This is not an
> >>>>      ARM architectural constraint but rather invites a difficult and messy
> >>>>      change in VGIC data structures.
> >>>>
> >>>> To enable a hot-add–like model while preserving these constraints, the virt
> >>>> machine may enumerate more CPUs than are enabled at boot using
> >>>> `-smp disabledcpus=N`. Such CPUs are present but start offline (i.e.,
> >>>> administratively disabled at init). The topology remains fixed at VM
> >>>> creation time; only the online/offline status may change later.
> >>>>
> >>>> Administratively disabled vCPUs are not realized in QOM until first enabled,
> >>>> avoiding creation of unnecessary vCPU threads at boot. On large systems, this
> >>>> reduces startup time proportionally to the number of disabled vCPUs. Once a
> >>>> QOM vCPU is realized and its thread created, subsequent enable/disable actions
> >>>> do not unrealize it. This behaviour was adopted following review feedback and
> >>>> differs from earlier RFC versions.
> >>>>
> >>>> Co-developed-by: Keqian Zhu <zhuqian1@huawei.com>
> >>>> Signed-off-by: Keqian Zhu <zhuqian1@huawei.com>
> >>>> Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
> >>>> ---
> >>>>    accel/kvm/kvm-all.c    |  2 +-
> >>>>    hw/arm/virt.c          | 77 ++++++++++++++++++++++++++++++++++++++----
> >>>>    hw/core/qdev.c         | 17 ++++++++++
> >>>>    include/hw/qdev-core.h | 19 +++++++++++
> >>>>    include/system/kvm.h   |  8 +++++
> >>>>    target/arm/cpu.c       |  2 ++
> >>>>    target/arm/kvm.c       | 40 +++++++++++++++++++++-
> >>>>    target/arm/kvm_arm.h   | 11 ++++++
> >>>>    8 files changed, 168 insertions(+), 8 deletions(-)
> >>>>
> >
> > [...]
> >
> >>>> +void kvm_arm_create_host_vcpu(ARMCPU *cpu)
> >>>> +{
> >>>> +    CPUState *cs = CPU(cpu);
> >>>> +    unsigned long vcpu_id = cs->cpu_index;
> >>>> +    int ret;
> >>>> +
> >>>> +    ret = kvm_create_vcpu(cs);
> >>>> +    if (ret < 0) {
> >>>> +        error_report("Failed to create host vcpu %ld", vcpu_id);
> >>>> +        abort();
> >>>> +    }
> >>>> +
> >>>> +    /*
> >>>> +     * Initialize the vCPU in the host. This will reset the sys regs
> >>>> +     * for this vCPU and related registers like MPIDR_EL1 etc. also
> >>>> +     * get programmed during this call to host. These are referenced
> >>>> +     * later while setting device attributes of the GICR during GICv3
> >>>> +     * reset.
> >>>> +     */
> >>>> +    ret = kvm_arch_init_vcpu(cs);
> >>>> +    if (ret < 0) {
> >>>> +        error_report("Failed to initialize host vcpu %ld", vcpu_id);
> >>>> +        abort();
> >>>> +    }
> >>>> +
> >>>> +    /*
> >>>> +     * park the created vCPU. shall be used during kvm_get_vcpu() when
> >>>> +     * threads are created during realization of ARM vCPUs.
> >>>> +     */
> >>>> +    kvm_park_vcpu(cs);
> >>>> +}
> >>>> +
> >>>
> >>> I don't think we're able to simply call kvm_arch_init_vcpu() in the lazily realized
> >>> path. Otherwise, it can trigger a crash dump on my Nvidia's grace-hopper machine where
> >>> SVE is supported by default.
> >>
> >> Thanks for reporting this. That is not true. As long as we initialize
> >> KVM correctly and
> >> finalize the features like SVE we should be fine. In fact, this is
> >> precisely what we are
> >> doing right now.
> >>
> >> To understand the crash, I need a bit more info.
> >>
> >> 1#  is happening because KVM_ARM_VCPU_INIT is failing. If yes, the can you check
> >>        within the KVM if it is happening because
> >>       a.  features specified by QEMU are not matching the defaults within the KVM
> >>             (HInt: check kvm_vcpu_init_check_features())?
> >>       b. or complaining about init feate change kvm_vcpu_init_changed()?
> >> 2#  or it is happening during the setting of vector length or
> >> finalizing features?
> >>
> >> int kvm_arch_init_vcpu(CPUState *cs)
> >> {
> >>     [...]
> >>           /* Do KVM_ARM_VCPU_INIT ioctl */
> >>          ret = kvm_arm_vcpu_init(cpu);   ---->[1]
> >>          if (ret) {
> >>             return ret;
> >>         }
> >>            if (cpu_isar_feature(aa64_sve, cpu)) {
> >>          ret = kvm_arm_sve_set_vls(cpu); ---->[2]
> >>          if (ret) {
> >>              return ret;
> >>          }
> >>          ret = kvm_arm_vcpu_finalize(cpu, KVM_ARM_VCPU_SVE);--->[3]
> >>          if (ret) {
> >>              return ret;
> >>          }
> >>      }
> >> [...]
> >> }
> >>
> >> I think it's happening because vector length is going uninitialized.
> >> This initialization
> >> happens in context to  arm_cpu_finalize_features() which I forgot to call before
> >> calling KVM finalize.
> >>
> >>>
> >>> kvm_arch_init_vcpu() is supposed to be called in the realization path in current
> >>> implementation (without this series) because the parameters (features) to KVM_ARM_VCPU_INIT
> >>> is populated at vCPU realization time.
> >>
> >> Not necessarily. It is just meant to initialize the KVM. If we take care of the
> >> KVM requirements in the similar way the realize path does we should be
> >> fine. Can you try to add the patch below in your code and test if it works?
> >>
> >>   diff --git a/target/arm/kvm.c b/target/arm/kvm.c
> >> index c4b68a0b17..1091593478 100644
> >> --- a/target/arm/kvm.c
> >> +++ b/target/arm/kvm.c
> >> @@ -1068,6 +1068,9 @@ void kvm_arm_create_host_vcpu(ARMCPU *cpu)
> >>           abort();
> >>       }
> >>
> >> +     /* finalize the features like SVE, SME etc */
> >> +     arm_cpu_finalize_features(cpu, &error_abort);
> >> +
> >>       /*
> >>        * Initialize the vCPU in the host. This will reset the sys regs
> >>        * for this vCPU and related registers like MPIDR_EL1 etc. also
> >>
> >>
> >>
> >>
> >>>
> >>> $ home/gavin/sandbox/qemu.main/build/qemu-system-aarch64           \
> >>>     --enable-kvm -machine virt,gic-version=3 -cpu host               \
> >>>     -smp cpus=4,disabledcpus=2 -m 1024M                              \
> >>>     -kernel /home/gavin/sandbox/linux.guest/arch/arm64/boot/Image    \
> >>>     -initrd /home/gavin/sandbox/images/rootfs.cpio.xz -nographic
> >>> qemu-system-aarch64: Failed to initialize host vcpu 4
> >>> Aborted (core dumped)
> >>>
> >>> Backtrace
> >>> =========
> >>> (gdb) bt
> >>> #0  0x0000ffff9106bc80 in __pthread_kill_implementation () at /lib64/libc.so.6
> >>> #1  0x0000ffff9101aa40 [PAC] in raise () at /lib64/libc.so.6
> >>> #2  0x0000ffff91005988 [PAC] in abort () at /lib64/libc.so.6
> >>> #3  0x0000aaaab1cc26b8 [PAC] in kvm_arm_create_host_vcpu (cpu=0xaaaab9ab1bc0)
> >>>       at ../target/arm/kvm.c:1081
> >>> #4  0x0000aaaab1cd0c94 in virt_setup_lazy_vcpu_realization (cpuobj=0xaaaab9ab1bc0, vms=0xaaaab98870a0)
> >>>       at ../hw/arm/virt.c:2483
> >>> #5  0x0000aaaab1cd180c in machvirt_init (machine=0xaaaab98870a0) at ../hw/arm/virt.c:2777
> >>> #6  0x0000aaaab160f220 in machine_run_board_init
> >>>       (machine=0xaaaab98870a0, mem_path=0x0, errp=0xfffffa86bdc8) at ../hw/core/machine.c:1722
> >>> #7  0x0000aaaab1a25ef4 in qemu_init_board () at ../system/vl.c:2723
> >>> #8  0x0000aaaab1a2635c in qmp_x_exit_preconfig (errp=0xaaaab38a50f0 <error_fatal>)
> >>>       at ../system/vl.c:2821
> >>> #9  0x0000aaaab1a28b08 in qemu_init (argc=15, argv=0xfffffa86c1f8) at ../system/vl.c:3882
> >>> #10 0x0000aaaab221d9e4 in main (argc=15, argv=0xfffffa86c1f8) at ../system/main.c:71
> >>
> >>
> >> Thank you for this. Please let me know if the above fix works and also
> >> the return values in
> >> case you encounter errors.
> >
> > I've pushed the fix to below branch for your convenience:
> >
> > Branch: https://github.com/salil-mehta/qemu/commits/virt-cpuhp-armv8/rfc-v6.2
> > Fix: https://github.com/salil-mehta/qemu/commit/1f1fbc0998ffb1fe26140df3c336bf2be2aa8669
> >
>
> I guess rfc-v6.2 branch isn't ready for test because it runs into another crash
> dump with rfc-v6.2 branch, like below.


rfc-6.2 is not crashing on Kunpeng920 where I tested. But this
chip does not have some ARM extensions like SVE etc so
Unfortunately, I can't test SVE/SME/PAuth etc support.

Can you disable SVE and then try if it comes up just to corner
the case?

>
> host$ /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64                     \
>        -accel kvm -machine virt,gic-version=host,nvdimm=on                         \
>        -cpu host,sve=on                                                            \
>        -smp maxcpus=4,cpus=2,disabledcpus=2,sockets=2,clusters=2,cores=1,threads=1 \
>        -m 4096M,slots=16,maxmem=128G                                               \
>        -object memory-backend-ram,id=mem0,size=2048M                               \
>        -object memory-backend-ram,id=mem1,size=2048M                               \
>        -numa node,nodeid=0,memdev=mem0,cpus=0-1                                    \
>        -numa node,nodeid=1,memdev=mem1,cpus=2-3                                    \
>        -L /home/gavin/sandbox/qemu.main/build/pc-bios                              \
>        -monitor none -serial mon:stdio -nographic -gdb tcp::6666                   \
>        -qmp tcp:localhost:5555,server,wait=off                                     \
>        -bios /home/gavin/sandbox/qemu.main/build/pc-bios/edk2-aarch64-code.fd      \
>        -kernel /home/gavin/sandbox/linux.guest/arch/arm64/boot/Image               \
>        -initrd /home/gavin/sandbox/images/rootfs.cpio.xz                           \
>        -append memhp_default_state=online_movable
>          :
>          :
> guest$ cd /sys/devices/system/cpu/
> guest$ cat present enabled online
> 0-3
> 0-1
> 0-1
> (qemu) device_set host-arm-cpu,socket-id=1,cluster-id=0,core-id=0,thread-id=0,admin-state=enable
> qemu-system-aarch64: kvm_init_vcpu: kvm_arch_init_vcpu failed (2): Operation not permitted


Ah, I see. I think I understand the issue. It's complaining
about calling the  finalize twice. Is it possible to check as
I do not have a way to test it?


int kvm_arm_vcpu_finalize(struct kvm_vcpu *vcpu, int feature)
{
switch (feature) {
case KVM_ARM_VCPU_SVE:
[...]
if (kvm_arm_vcpu_sve_finalized(vcpu))
return -EPERM;-----> this where it must be popping?
[...]
}


>
> I picked the fix (the last patch in rfc-v6.2 branch) to rfc-v6 branch, same crash dump
> can be seen.

Are you getting previously reported abort or above new problem?


Thanks
Salil.


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH RFC V6 05/24] arm/virt,kvm: Pre-create KVM vCPUs for 'disabled' QOM vCPUs at machine init
  2025-10-23  0:35           ` Salil Mehta
@ 2025-10-23  1:29             ` Salil Mehta
  2025-10-23  4:14               ` Gavin Shan
  2025-10-23  1:58             ` Gavin Shan
  1 sibling, 1 reply; 67+ messages in thread
From: Salil Mehta @ 2025-10-23  1:29 UTC (permalink / raw)
  To: Gavin Shan
  Cc: qemu-devel, qemu-arm, mst, salil.mehta, maz, jean-philippe,
	jonathan.cameron, lpieralisi, peter.maydell, richard.henderson,
	imammedo, armbru, andrew.jones, david, philmd, eric.auger, will,
	ardb, oliver.upton, pbonzini, rafael, borntraeger, alex.bennee,
	gustavo.romero, npiggin, harshpb, linux, darren, ilkka, vishnu,
	gankulkarni, karl.heubaum, miguel.luis, zhukeqian1,
	wangxiongfeng2, wangyanan55, wangzhou1, linuxarm, jiakernel2,
	maobibo, lixianglai, shahuang, zhao1.liu, Keqian Zhu

Hi Gavin,

On Thu, Oct 23, 2025 at 12:35 AM Salil Mehta <salil.mehta@opnsrc.net> wrote:
>
> HI Gavin,
>
> On Thu, Oct 23, 2025 at 12:14 AM Gavin Shan <gshan@redhat.com> wrote:
> >
> > Hi Salil,
> >
> > On 10/23/25 4:50 AM, Salil Mehta wrote:
> > > On Wed, Oct 22, 2025 at 6:18 PM Salil Mehta <salil.mehta@opnsrc.net> wrote:
> > >> On Wed, Oct 22, 2025 at 10:37 AM Gavin Shan <gshan@redhat.com> wrote:
> > >>>
> > >>> Hi Salil,
> > >>>
> > >>> On 10/1/25 11:01 AM, salil.mehta@opnsrc.net wrote:
> > >>>> From: Salil Mehta <salil.mehta@huawei.com>
> > >>>>
> > >>>> ARM CPU architecture does not allow CPUs to be plugged after system has
> > >>>> initialized. This is a constraint. Hence, the Kernel must know all the CPUs
> > >>>> being booted during its initialization. This applies to the Guest Kernel as
> > >>>> well and therefore, the number of KVM vCPU descriptors in the host must be
> > >>>> fixed at VM initialization time.
> > >>>>
> > >>>> Also, the GIC must know all the CPUs it is connected to during its
> > >>>> initialization, and this cannot change afterward. This must also be ensured
> > >>>> during the initialization of the VGIC in KVM. This is necessary because:
> > >>>>
> > >>>> 1. The association between GICR and MPIDR must be fixed at VM initialization
> > >>>>      time. This is represented by the register
> > >>>>      `GICR_TYPER(mp_affinity, proc_num)`.
> > >>>> 2. Memory regions associated with GICR, etc., cannot be changed (added,
> > >>>>      deleted, or modified) after the VM has been initialized. This is not an
> > >>>>      ARM architectural constraint but rather invites a difficult and messy
> > >>>>      change in VGIC data structures.
> > >>>>
> > >>>> To enable a hot-add–like model while preserving these constraints, the virt
> > >>>> machine may enumerate more CPUs than are enabled at boot using
> > >>>> `-smp disabledcpus=N`. Such CPUs are present but start offline (i.e.,
> > >>>> administratively disabled at init). The topology remains fixed at VM
> > >>>> creation time; only the online/offline status may change later.
> > >>>>
> > >>>> Administratively disabled vCPUs are not realized in QOM until first enabled,
> > >>>> avoiding creation of unnecessary vCPU threads at boot. On large systems, this
> > >>>> reduces startup time proportionally to the number of disabled vCPUs. Once a
> > >>>> QOM vCPU is realized and its thread created, subsequent enable/disable actions
> > >>>> do not unrealize it. This behaviour was adopted following review feedback and
> > >>>> differs from earlier RFC versions.
> > >>>>
> > >>>> Co-developed-by: Keqian Zhu <zhuqian1@huawei.com>
> > >>>> Signed-off-by: Keqian Zhu <zhuqian1@huawei.com>
> > >>>> Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
> > >>>> ---
> > >>>>    accel/kvm/kvm-all.c    |  2 +-
> > >>>>    hw/arm/virt.c          | 77 ++++++++++++++++++++++++++++++++++++++----
> > >>>>    hw/core/qdev.c         | 17 ++++++++++
> > >>>>    include/hw/qdev-core.h | 19 +++++++++++
> > >>>>    include/system/kvm.h   |  8 +++++
> > >>>>    target/arm/cpu.c       |  2 ++
> > >>>>    target/arm/kvm.c       | 40 +++++++++++++++++++++-
> > >>>>    target/arm/kvm_arm.h   | 11 ++++++
> > >>>>    8 files changed, 168 insertions(+), 8 deletions(-)
> > >>>>
> > >
> > > [...]
> > >
> > >>>> +void kvm_arm_create_host_vcpu(ARMCPU *cpu)
> > >>>> +{
> > >>>> +    CPUState *cs = CPU(cpu);
> > >>>> +    unsigned long vcpu_id = cs->cpu_index;
> > >>>> +    int ret;
> > >>>> +
> > >>>> +    ret = kvm_create_vcpu(cs);
> > >>>> +    if (ret < 0) {
> > >>>> +        error_report("Failed to create host vcpu %ld", vcpu_id);
> > >>>> +        abort();
> > >>>> +    }
> > >>>> +
> > >>>> +    /*
> > >>>> +     * Initialize the vCPU in the host. This will reset the sys regs
> > >>>> +     * for this vCPU and related registers like MPIDR_EL1 etc. also
> > >>>> +     * get programmed during this call to host. These are referenced
> > >>>> +     * later while setting device attributes of the GICR during GICv3
> > >>>> +     * reset.
> > >>>> +     */
> > >>>> +    ret = kvm_arch_init_vcpu(cs);
> > >>>> +    if (ret < 0) {
> > >>>> +        error_report("Failed to initialize host vcpu %ld", vcpu_id);
> > >>>> +        abort();
> > >>>> +    }
> > >>>> +
> > >>>> +    /*
> > >>>> +     * park the created vCPU. shall be used during kvm_get_vcpu() when
> > >>>> +     * threads are created during realization of ARM vCPUs.
> > >>>> +     */
> > >>>> +    kvm_park_vcpu(cs);
> > >>>> +}
> > >>>> +
> > >>>
> > >>> I don't think we're able to simply call kvm_arch_init_vcpu() in the lazily realized
> > >>> path. Otherwise, it can trigger a crash dump on my Nvidia's grace-hopper machine where
> > >>> SVE is supported by default.
> > >>
> > >> Thanks for reporting this. That is not true. As long as we initialize
> > >> KVM correctly and
> > >> finalize the features like SVE we should be fine. In fact, this is
> > >> precisely what we are
> > >> doing right now.
> > >>
> > >> To understand the crash, I need a bit more info.
> > >>
> > >> 1#  is happening because KVM_ARM_VCPU_INIT is failing. If yes, the can you check
> > >>        within the KVM if it is happening because
> > >>       a.  features specified by QEMU are not matching the defaults within the KVM
> > >>             (HInt: check kvm_vcpu_init_check_features())?
> > >>       b. or complaining about init feate change kvm_vcpu_init_changed()?
> > >> 2#  or it is happening during the setting of vector length or
> > >> finalizing features?
> > >>
> > >> int kvm_arch_init_vcpu(CPUState *cs)
> > >> {
> > >>     [...]
> > >>           /* Do KVM_ARM_VCPU_INIT ioctl */
> > >>          ret = kvm_arm_vcpu_init(cpu);   ---->[1]
> > >>          if (ret) {
> > >>             return ret;
> > >>         }
> > >>            if (cpu_isar_feature(aa64_sve, cpu)) {
> > >>          ret = kvm_arm_sve_set_vls(cpu); ---->[2]
> > >>          if (ret) {
> > >>              return ret;
> > >>          }
> > >>          ret = kvm_arm_vcpu_finalize(cpu, KVM_ARM_VCPU_SVE);--->[3]
> > >>          if (ret) {
> > >>              return ret;
> > >>          }
> > >>      }
> > >> [...]
> > >> }
> > >>
> > >> I think it's happening because vector length is going uninitialized.
> > >> This initialization
> > >> happens in context to  arm_cpu_finalize_features() which I forgot to call before
> > >> calling KVM finalize.
> > >>
> > >>>
> > >>> kvm_arch_init_vcpu() is supposed to be called in the realization path in current
> > >>> implementation (without this series) because the parameters (features) to KVM_ARM_VCPU_INIT
> > >>> is populated at vCPU realization time.
> > >>
> > >> Not necessarily. It is just meant to initialize the KVM. If we take care of the
> > >> KVM requirements in the similar way the realize path does we should be
> > >> fine. Can you try to add the patch below in your code and test if it works?
> > >>
> > >>   diff --git a/target/arm/kvm.c b/target/arm/kvm.c
> > >> index c4b68a0b17..1091593478 100644
> > >> --- a/target/arm/kvm.c
> > >> +++ b/target/arm/kvm.c
> > >> @@ -1068,6 +1068,9 @@ void kvm_arm_create_host_vcpu(ARMCPU *cpu)
> > >>           abort();
> > >>       }
> > >>
> > >> +     /* finalize the features like SVE, SME etc */
> > >> +     arm_cpu_finalize_features(cpu, &error_abort);
> > >> +
> > >>       /*
> > >>        * Initialize the vCPU in the host. This will reset the sys regs
> > >>        * for this vCPU and related registers like MPIDR_EL1 etc. also
> > >>
> > >>
> > >>
> > >>
> > >>>
> > >>> $ home/gavin/sandbox/qemu.main/build/qemu-system-aarch64           \
> > >>>     --enable-kvm -machine virt,gic-version=3 -cpu host               \
> > >>>     -smp cpus=4,disabledcpus=2 -m 1024M                              \
> > >>>     -kernel /home/gavin/sandbox/linux.guest/arch/arm64/boot/Image    \
> > >>>     -initrd /home/gavin/sandbox/images/rootfs.cpio.xz -nographic
> > >>> qemu-system-aarch64: Failed to initialize host vcpu 4
> > >>> Aborted (core dumped)
> > >>>
> > >>> Backtrace
> > >>> =========
> > >>> (gdb) bt
> > >>> #0  0x0000ffff9106bc80 in __pthread_kill_implementation () at /lib64/libc.so.6
> > >>> #1  0x0000ffff9101aa40 [PAC] in raise () at /lib64/libc.so.6
> > >>> #2  0x0000ffff91005988 [PAC] in abort () at /lib64/libc.so.6
> > >>> #3  0x0000aaaab1cc26b8 [PAC] in kvm_arm_create_host_vcpu (cpu=0xaaaab9ab1bc0)
> > >>>       at ../target/arm/kvm.c:1081
> > >>> #4  0x0000aaaab1cd0c94 in virt_setup_lazy_vcpu_realization (cpuobj=0xaaaab9ab1bc0, vms=0xaaaab98870a0)
> > >>>       at ../hw/arm/virt.c:2483
> > >>> #5  0x0000aaaab1cd180c in machvirt_init (machine=0xaaaab98870a0) at ../hw/arm/virt.c:2777
> > >>> #6  0x0000aaaab160f220 in machine_run_board_init
> > >>>       (machine=0xaaaab98870a0, mem_path=0x0, errp=0xfffffa86bdc8) at ../hw/core/machine.c:1722
> > >>> #7  0x0000aaaab1a25ef4 in qemu_init_board () at ../system/vl.c:2723
> > >>> #8  0x0000aaaab1a2635c in qmp_x_exit_preconfig (errp=0xaaaab38a50f0 <error_fatal>)
> > >>>       at ../system/vl.c:2821
> > >>> #9  0x0000aaaab1a28b08 in qemu_init (argc=15, argv=0xfffffa86c1f8) at ../system/vl.c:3882
> > >>> #10 0x0000aaaab221d9e4 in main (argc=15, argv=0xfffffa86c1f8) at ../system/main.c:71
> > >>
> > >>
> > >> Thank you for this. Please let me know if the above fix works and also
> > >> the return values in
> > >> case you encounter errors.
> > >
> > > I've pushed the fix to below branch for your convenience:
> > >
> > > Branch: https://github.com/salil-mehta/qemu/commits/virt-cpuhp-armv8/rfc-v6.2
> > > Fix: https://github.com/salil-mehta/qemu/commit/1f1fbc0998ffb1fe26140df3c336bf2be2aa8669
> > >
> >
> > I guess rfc-v6.2 branch isn't ready for test because it runs into another crash
> > dump with rfc-v6.2 branch, like below.
>
>
> rfc-6.2 is not crashing on Kunpeng920 where I tested. But this
> chip does not have some ARM extensions like SVE etc so
> Unfortunately, I can't test SVE/SME/PAuth etc support.
>
> Can you disable SVE and then try if it comes up just to corner
> the case?
>
> >
> > host$ /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64                     \
> >        -accel kvm -machine virt,gic-version=host,nvdimm=on                         \
> >        -cpu host,sve=on                                                            \
> >        -smp maxcpus=4,cpus=2,disabledcpus=2,sockets=2,clusters=2,cores=1,threads=1 \
> >        -m 4096M,slots=16,maxmem=128G                                               \
> >        -object memory-backend-ram,id=mem0,size=2048M                               \
> >        -object memory-backend-ram,id=mem1,size=2048M                               \
> >        -numa node,nodeid=0,memdev=mem0,cpus=0-1                                    \
> >        -numa node,nodeid=1,memdev=mem1,cpus=2-3                                    \
> >        -L /home/gavin/sandbox/qemu.main/build/pc-bios                              \
> >        -monitor none -serial mon:stdio -nographic -gdb tcp::6666                   \
> >        -qmp tcp:localhost:5555,server,wait=off                                     \
> >        -bios /home/gavin/sandbox/qemu.main/build/pc-bios/edk2-aarch64-code.fd      \
> >        -kernel /home/gavin/sandbox/linux.guest/arch/arm64/boot/Image               \
> >        -initrd /home/gavin/sandbox/images/rootfs.cpio.xz                           \
> >        -append memhp_default_state=online_movable
> >          :
> >          :
> > guest$ cd /sys/devices/system/cpu/
> > guest$ cat present enabled online
> > 0-3
> > 0-1
> > 0-1
> > (qemu) device_set host-arm-cpu,socket-id=1,cluster-id=0,core-id=0,thread-id=0,admin-state=enable
> > qemu-system-aarch64: kvm_init_vcpu: kvm_arch_init_vcpu failed (2): Operation not permitted
>
>
> Ah, I see. I think I understand the issue. It's complaining
> about calling the  finalize twice. Is it possible to check as
> I do not have a way to test it?
>
>
> int kvm_arm_vcpu_finalize(struct kvm_vcpu *vcpu, int feature)
> {
> switch (feature) {
> case KVM_ARM_VCPU_SVE:
> [...]
> if (kvm_arm_vcpu_sve_finalized(vcpu))
> return -EPERM;-----> this where it must be popping?
> [...]
> }

I've pushed the fix to avoid calling the finalizing SVE
feature (KVM_ARM_VCPU_FINALIZE) twice on the
same RFC-V6.2 branch.

May I kindly request you to validate the fix again and
check SVE works on NVIDIA grace-hopper?

Many thanks!

Best regards
Salil.


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH RFC V6 05/24] arm/virt,kvm: Pre-create KVM vCPUs for 'disabled' QOM vCPUs at machine init
  2025-10-23  0:35           ` Salil Mehta
  2025-10-23  1:29             ` Salil Mehta
@ 2025-10-23  1:58             ` Gavin Shan
  2025-10-23 11:17               ` Salil Mehta
  1 sibling, 1 reply; 67+ messages in thread
From: Gavin Shan @ 2025-10-23  1:58 UTC (permalink / raw)
  To: Salil Mehta
  Cc: qemu-devel, qemu-arm, mst, salil.mehta, maz, jean-philippe,
	jonathan.cameron, lpieralisi, peter.maydell, richard.henderson,
	imammedo, armbru, andrew.jones, david, philmd, eric.auger, will,
	ardb, oliver.upton, pbonzini, rafael, borntraeger, alex.bennee,
	gustavo.romero, npiggin, harshpb, linux, darren, ilkka, vishnu,
	gankulkarni, karl.heubaum, miguel.luis, zhukeqian1,
	wangxiongfeng2, wangyanan55, wangzhou1, linuxarm, jiakernel2,
	maobibo, lixianglai, shahuang, zhao1.liu, Keqian Zhu

Hi Salil,

On 10/23/25 10:35 AM, Salil Mehta wrote:
> On Thu, Oct 23, 2025 at 12:14 AM Gavin Shan <gshan@redhat.com> wrote:
>> On 10/23/25 4:50 AM, Salil Mehta wrote:
>>> On Wed, Oct 22, 2025 at 6:18 PM Salil Mehta <salil.mehta@opnsrc.net> wrote:
>>>> On Wed, Oct 22, 2025 at 10:37 AM Gavin Shan <gshan@redhat.com> wrote:
>>>>> On 10/1/25 11:01 AM, salil.mehta@opnsrc.net wrote:
>>>>>> From: Salil Mehta <salil.mehta@huawei.com>

[...]

>>>>>> +void kvm_arm_create_host_vcpu(ARMCPU *cpu)
>>>>>> +{
>>>>>> +    CPUState *cs = CPU(cpu);
>>>>>> +    unsigned long vcpu_id = cs->cpu_index;
>>>>>> +    int ret;
>>>>>> +
>>>>>> +    ret = kvm_create_vcpu(cs);
>>>>>> +    if (ret < 0) {
>>>>>> +        error_report("Failed to create host vcpu %ld", vcpu_id);
>>>>>> +        abort();
>>>>>> +    }
>>>>>> +
>>>>>> +    /*
>>>>>> +     * Initialize the vCPU in the host. This will reset the sys regs
>>>>>> +     * for this vCPU and related registers like MPIDR_EL1 etc. also
>>>>>> +     * get programmed during this call to host. These are referenced
>>>>>> +     * later while setting device attributes of the GICR during GICv3
>>>>>> +     * reset.
>>>>>> +     */
>>>>>> +    ret = kvm_arch_init_vcpu(cs);
>>>>>> +    if (ret < 0) {
>>>>>> +        error_report("Failed to initialize host vcpu %ld", vcpu_id);
>>>>>> +        abort();
>>>>>> +    }
>>>>>> +
>>>>>> +    /*
>>>>>> +     * park the created vCPU. shall be used during kvm_get_vcpu() when
>>>>>> +     * threads are created during realization of ARM vCPUs.
>>>>>> +     */
>>>>>> +    kvm_park_vcpu(cs);
>>>>>> +}
>>>>>> +
>>>>>
>>>>> I don't think we're able to simply call kvm_arch_init_vcpu() in the lazily realized
>>>>> path. Otherwise, it can trigger a crash dump on my Nvidia's grace-hopper machine where
>>>>> SVE is supported by default.
>>>>
>>>> Thanks for reporting this. That is not true. As long as we initialize
>>>> KVM correctly and
>>>> finalize the features like SVE we should be fine. In fact, this is
>>>> precisely what we are
>>>> doing right now.
>>>>
>>>> To understand the crash, I need a bit more info.
>>>>
>>>> 1#  is happening because KVM_ARM_VCPU_INIT is failing. If yes, the can you check
>>>>         within the KVM if it is happening because
>>>>        a.  features specified by QEMU are not matching the defaults within the KVM
>>>>              (HInt: check kvm_vcpu_init_check_features())?
>>>>        b. or complaining about init feate change kvm_vcpu_init_changed()?
>>>> 2#  or it is happening during the setting of vector length or
>>>> finalizing features?
>>>>
>>>> int kvm_arch_init_vcpu(CPUState *cs)
>>>> {
>>>>      [...]
>>>>            /* Do KVM_ARM_VCPU_INIT ioctl */
>>>>           ret = kvm_arm_vcpu_init(cpu);   ---->[1]
>>>>           if (ret) {
>>>>              return ret;
>>>>          }
>>>>             if (cpu_isar_feature(aa64_sve, cpu)) {
>>>>           ret = kvm_arm_sve_set_vls(cpu); ---->[2]
>>>>           if (ret) {
>>>>               return ret;
>>>>           }
>>>>           ret = kvm_arm_vcpu_finalize(cpu, KVM_ARM_VCPU_SVE);--->[3]
>>>>           if (ret) {
>>>>               return ret;
>>>>           }
>>>>       }
>>>> [...]
>>>> }
>>>>
>>>> I think it's happening because vector length is going uninitialized.
>>>> This initialization
>>>> happens in context to  arm_cpu_finalize_features() which I forgot to call before
>>>> calling KVM finalize.
>>>>
>>>>>
>>>>> kvm_arch_init_vcpu() is supposed to be called in the realization path in current
>>>>> implementation (without this series) because the parameters (features) to KVM_ARM_VCPU_INIT
>>>>> is populated at vCPU realization time.
>>>>
>>>> Not necessarily. It is just meant to initialize the KVM. If we take care of the
>>>> KVM requirements in the similar way the realize path does we should be
>>>> fine. Can you try to add the patch below in your code and test if it works?
>>>>
>>>>    diff --git a/target/arm/kvm.c b/target/arm/kvm.c
>>>> index c4b68a0b17..1091593478 100644
>>>> --- a/target/arm/kvm.c
>>>> +++ b/target/arm/kvm.c
>>>> @@ -1068,6 +1068,9 @@ void kvm_arm_create_host_vcpu(ARMCPU *cpu)
>>>>            abort();
>>>>        }
>>>>
>>>> +     /* finalize the features like SVE, SME etc */
>>>> +     arm_cpu_finalize_features(cpu, &error_abort);
>>>> +
>>>>        /*
>>>>         * Initialize the vCPU in the host. This will reset the sys regs
>>>>         * for this vCPU and related registers like MPIDR_EL1 etc. also
>>>>
>>>>
>>>>
>>>>
>>>>>
>>>>> $ home/gavin/sandbox/qemu.main/build/qemu-system-aarch64           \
>>>>>      --enable-kvm -machine virt,gic-version=3 -cpu host               \
>>>>>      -smp cpus=4,disabledcpus=2 -m 1024M                              \
>>>>>      -kernel /home/gavin/sandbox/linux.guest/arch/arm64/boot/Image    \
>>>>>      -initrd /home/gavin/sandbox/images/rootfs.cpio.xz -nographic
>>>>> qemu-system-aarch64: Failed to initialize host vcpu 4
>>>>> Aborted (core dumped)
>>>>>
>>>>> Backtrace
>>>>> =========
>>>>> (gdb) bt
>>>>> #0  0x0000ffff9106bc80 in __pthread_kill_implementation () at /lib64/libc.so.6
>>>>> #1  0x0000ffff9101aa40 [PAC] in raise () at /lib64/libc.so.6
>>>>> #2  0x0000ffff91005988 [PAC] in abort () at /lib64/libc.so.6
>>>>> #3  0x0000aaaab1cc26b8 [PAC] in kvm_arm_create_host_vcpu (cpu=0xaaaab9ab1bc0)
>>>>>        at ../target/arm/kvm.c:1081
>>>>> #4  0x0000aaaab1cd0c94 in virt_setup_lazy_vcpu_realization (cpuobj=0xaaaab9ab1bc0, vms=0xaaaab98870a0)
>>>>>        at ../hw/arm/virt.c:2483
>>>>> #5  0x0000aaaab1cd180c in machvirt_init (machine=0xaaaab98870a0) at ../hw/arm/virt.c:2777
>>>>> #6  0x0000aaaab160f220 in machine_run_board_init
>>>>>        (machine=0xaaaab98870a0, mem_path=0x0, errp=0xfffffa86bdc8) at ../hw/core/machine.c:1722
>>>>> #7  0x0000aaaab1a25ef4 in qemu_init_board () at ../system/vl.c:2723
>>>>> #8  0x0000aaaab1a2635c in qmp_x_exit_preconfig (errp=0xaaaab38a50f0 <error_fatal>)
>>>>>        at ../system/vl.c:2821
>>>>> #9  0x0000aaaab1a28b08 in qemu_init (argc=15, argv=0xfffffa86c1f8) at ../system/vl.c:3882
>>>>> #10 0x0000aaaab221d9e4 in main (argc=15, argv=0xfffffa86c1f8) at ../system/main.c:71
>>>>
>>>>
>>>> Thank you for this. Please let me know if the above fix works and also
>>>> the return values in
>>>> case you encounter errors.
>>>
>>> I've pushed the fix to below branch for your convenience:
>>>
>>> Branch: https://github.com/salil-mehta/qemu/commits/virt-cpuhp-armv8/rfc-v6.2
>>> Fix: https://github.com/salil-mehta/qemu/commit/1f1fbc0998ffb1fe26140df3c336bf2be2aa8669
>>>
>>
>> I guess rfc-v6.2 branch isn't ready for test because it runs into another crash
>> dump with rfc-v6.2 branch, like below.
> 
> 
> rfc-6.2 is not crashing on Kunpeng920 where I tested. But this
> chip does not have some ARM extensions like SVE etc so
> Unfortunately, I can't test SVE/SME/PAuth etc support.
> 
> Can you disable SVE and then try if it comes up just to corner
> the case?
> 

Right, this crash dump shouldn't be encountered if SVE isn't supported. I already
had the workaround "-cpu host,sve=off" to keep my tests moving forwards...

>>
>> host$ /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64                     \
>>         -accel kvm -machine virt,gic-version=host,nvdimm=on                         \
>>         -cpu host,sve=on                                                            \
>>         -smp maxcpus=4,cpus=2,disabledcpus=2,sockets=2,clusters=2,cores=1,threads=1 \
>>         -m 4096M,slots=16,maxmem=128G                                               \
>>         -object memory-backend-ram,id=mem0,size=2048M                               \
>>         -object memory-backend-ram,id=mem1,size=2048M                               \
>>         -numa node,nodeid=0,memdev=mem0,cpus=0-1                                    \
>>         -numa node,nodeid=1,memdev=mem1,cpus=2-3                                    \
>>         -L /home/gavin/sandbox/qemu.main/build/pc-bios                              \
>>         -monitor none -serial mon:stdio -nographic -gdb tcp::6666                   \
>>         -qmp tcp:localhost:5555,server,wait=off                                     \
>>         -bios /home/gavin/sandbox/qemu.main/build/pc-bios/edk2-aarch64-code.fd      \
>>         -kernel /home/gavin/sandbox/linux.guest/arch/arm64/boot/Image               \
>>         -initrd /home/gavin/sandbox/images/rootfs.cpio.xz                           \
>>         -append memhp_default_state=online_movable
>>           :
>>           :
>> guest$ cd /sys/devices/system/cpu/
>> guest$ cat present enabled online
>> 0-3
>> 0-1
>> 0-1
>> (qemu) device_set host-arm-cpu,socket-id=1,cluster-id=0,core-id=0,thread-id=0,admin-state=enable
>> qemu-system-aarch64: kvm_init_vcpu: kvm_arch_init_vcpu failed (2): Operation not permitted
> 
> 
> Ah, I see. I think I understand the issue. It's complaining
> about calling the  finalize twice. Is it possible to check as
> I do not have a way to test it?
> 
> 
> int kvm_arm_vcpu_finalize(struct kvm_vcpu *vcpu, int feature)
> {
> switch (feature) {
> case KVM_ARM_VCPU_SVE:
> [...]
> if (kvm_arm_vcpu_sve_finalized(vcpu))
> return -EPERM;-----> this where it must be popping?
> [...]
> }
> 

Right, I think that's the case: QEMU tries to finalize SVE capability for twice,
which is the real problem. I'm explaining what I found as below, which would be
helpful to the forthcoming revisions.

machvirt_init
   virt_setup_lazy_vcpu_realization
     arm_cpu_finalize_features
     kvm_arm_create_host_vcpu
       kvm_create_vcpu                       // New fd is created
       kvm_arch_init_vcpu
         kvm_arm_vcpu_init
         kvm_arm_sve_set_vls
         kvm_arm_vcpu_finalize               // (A) SVE capability is finalized

device_set_admin_power_state
   device_pre_poweron
     virt_machine_device_pre_poweron
       virt_cpu_pre_poweron
         qdev_realize
           arm_cpu_realizefn
             cpu_exec_realizefn
             arm_cpu_finalize_features       // Called for the second time
             qemu_init_vcpu
               kvm_start_vcpu_thread
                 kvm_vcpu_thread_fn
                   kvm_init_vcpu
                     kvm_create_vcpu         // Called for the second time
                     kvm_arch_init_vcpu      // Called for the second time
                       kvm_arm_vcpu_init
                       kvm_arm_sve_set_vls   // (B) Failed here
                       kvm_arm_vcpu_finalize

(B) where we try to finalize SVE capability again. It has been finalized at (A)
     Fianlizing SVE capability for twice is disallowed by KVM on the host side.


>>
>> I picked the fix (the last patch in rfc-v6.2 branch) to rfc-v6 branch, same crash dump
>> can be seen.
> 
> Are you getting previously reported abort or above new problem?
> 

Previously, the VM can't be started. After your fix is applied, the VM is able to start.
It's a new problem that qemu crash dump is seens on attempt to hot add a vCPU.

Thanks,
Gavin



^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH RFC V6 05/24] arm/virt,kvm: Pre-create KVM vCPUs for 'disabled' QOM vCPUs at machine init
  2025-10-23  1:29             ` Salil Mehta
@ 2025-10-23  4:14               ` Gavin Shan
  2025-10-23 11:27                 ` Salil Mehta
  0 siblings, 1 reply; 67+ messages in thread
From: Gavin Shan @ 2025-10-23  4:14 UTC (permalink / raw)
  To: Salil Mehta
  Cc: qemu-devel, qemu-arm, mst, salil.mehta, maz, jean-philippe,
	jonathan.cameron, lpieralisi, peter.maydell, richard.henderson,
	imammedo, armbru, andrew.jones, david, philmd, eric.auger, will,
	ardb, oliver.upton, pbonzini, rafael, borntraeger, alex.bennee,
	gustavo.romero, npiggin, harshpb, linux, darren, ilkka, vishnu,
	gankulkarni, karl.heubaum, miguel.luis, zhukeqian1,
	wangxiongfeng2, wangyanan55, wangzhou1, linuxarm, jiakernel2,
	maobibo, lixianglai, shahuang, zhao1.liu, Keqian Zhu

Hi Salil,

On 10/23/25 11:29 AM, Salil Mehta wrote:

[...]

>>
>> Ah, I see. I think I understand the issue. It's complaining
>> about calling the  finalize twice. Is it possible to check as
>> I do not have a way to test it?
>>
>>
>> int kvm_arm_vcpu_finalize(struct kvm_vcpu *vcpu, int feature)
>> {
>> switch (feature) {
>> case KVM_ARM_VCPU_SVE:
>> [...]
>> if (kvm_arm_vcpu_sve_finalized(vcpu))
>> return -EPERM;-----> this where it must be popping?
>> [...]
>> }
> 
> I've pushed the fix to avoid calling the finalizing SVE
> feature (KVM_ARM_VCPU_FINALIZE) twice on the
> same RFC-V6.2 branch.
> 
> May I kindly request you to validate the fix again and
> check SVE works on NVIDIA grace-hopper?
> 

With the latest rfc-v6.2 branch, I don't hit the issue. The vCPU can be hot added
and removed on grace-hopper host.

host$ /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64                     \
       -accel kvm -machine virt,gic-version=host,nvdimm=on                         \
       -cpu host,sve=on                                                            \
       -smp maxcpus=4,cpus=2,disabledcpus=2,sockets=2,clusters=2,cores=1,threads=1 \
       -m 4096M,slots=16,maxmem=128G                                               \
       -object memory-backend-ram,id=mem0,size=2048M                               \
       -object memory-backend-ram,id=mem1,size=2048M                               \
       -numa node,nodeid=0,memdev=mem0,cpus=0-1                                    \
       -numa node,nodeid=1,memdev=mem1,cpus=2-3                                    \
       -L /home/gavin/sandbox/qemu.main/build/pc-bios                              \
       -monitor none -serial mon:stdio -nographic -gdb tcp::6666                   \
       -qmp tcp:localhost:5555,server,wait=off                                     \
       -bios /home/gavin/sandbox/qemu.main/build/pc-bios/edk2-aarch64-code.fd      \
       -kernel /home/gavin/sandbox/linux.guest/arch/arm64/boot/Image               \
       -initrd /home/gavin/sandbox/images/rootfs.cpio.xz                           \
       -append memhp_default_state=online_movable
          :
          :
guest$ cd /sys/devices/system/cpu
guest$ cat present enabled online
0-3
0-1
0-1
(qemu) device_set host-arm-cpu,socket-id=1,cluster-id=0,core-id=0,thread-id=0,admin-state=enable
guest$ echo 1 > cpu2/online
guest$ cat present enabled online
0-3
0-2
0-2
         :
         :
guest$ cd /sys/device/system/cpu
guest$ cat present enabled online
0-3
0-2
0-2
(qemu) device_set host-arm-cpu,socket-id=1,cluster-id=0,core-id=0,thread-id=0,admin-state=disable
guest$ cat present enabled online
0-3
0-1
0-1

Thanks,
Gavin



^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH RFC V6 05/24] arm/virt,kvm: Pre-create KVM vCPUs for 'disabled' QOM vCPUs at machine init
  2025-10-23  1:58             ` Gavin Shan
@ 2025-10-23 11:17               ` Salil Mehta
  0 siblings, 0 replies; 67+ messages in thread
From: Salil Mehta @ 2025-10-23 11:17 UTC (permalink / raw)
  To: Gavin Shan
  Cc: qemu-devel, qemu-arm, mst, salil.mehta, maz, jean-philippe,
	jonathan.cameron, lpieralisi, peter.maydell, richard.henderson,
	imammedo, armbru, andrew.jones, david, philmd, eric.auger, will,
	ardb, oliver.upton, pbonzini, rafael, borntraeger, alex.bennee,
	gustavo.romero, npiggin, harshpb, linux, darren, ilkka, vishnu,
	gankulkarni, karl.heubaum, miguel.luis, zhukeqian1,
	wangxiongfeng2, wangyanan55, wangzhou1, linuxarm, jiakernel2,
	maobibo, lixianglai, shahuang, zhao1.liu, Keqian Zhu

Hi Gavin

On Thu, Oct 23, 2025 at 1:58 AM Gavin Shan <gshan@redhat.com> wrote:
>
> Hi Salil,
>
> On 10/23/25 10:35 AM, Salil Mehta wrote:
> > On Thu, Oct 23, 2025 at 12:14 AM Gavin Shan <gshan@redhat.com> wrote:
> >> On 10/23/25 4:50 AM, Salil Mehta wrote:
> >>> On Wed, Oct 22, 2025 at 6:18 PM Salil Mehta <salil.mehta@opnsrc.net> wrote:
> >>>> On Wed, Oct 22, 2025 at 10:37 AM Gavin Shan <gshan@redhat.com> wrote:
> >>>>> On 10/1/25 11:01 AM, salil.mehta@opnsrc.net wrote:
> >>>>>> From: Salil Mehta <salil.mehta@huawei.com>

[...]

> >> guest$ cd /sys/devices/system/cpu/
> >> guest$ cat present enabled online
> >> 0-3
> >> 0-1
> >> 0-1
> >> (qemu) device_set host-arm-cpu,socket-id=1,cluster-id=0,core-id=0,thread-id=0,admin-state=enable
> >> qemu-system-aarch64: kvm_init_vcpu: kvm_arch_init_vcpu failed (2): Operation not permitted
> >
> >
> > Ah, I see. I think I understand the issue. It's complaining
> > about calling the  finalize twice. Is it possible to check as
> > I do not have a way to test it?
> >
> >
> > int kvm_arm_vcpu_finalize(struct kvm_vcpu *vcpu, int feature)
> > {
> > switch (feature) {
> > case KVM_ARM_VCPU_SVE:
> > [...]
> > if (kvm_arm_vcpu_sve_finalized(vcpu))
> > return -EPERM;-----> this where it must be popping?
> > [...]
> > }
> >
>
> Right, I think that's the case: QEMU tries to finalize SVE capability for twice,
> which is the real problem. I'm explaining what I found as below, which would be
> helpful to the forthcoming revisions.
>
> machvirt_init
>    virt_setup_lazy_vcpu_realization
>      arm_cpu_finalize_features
>      kvm_arm_create_host_vcpu
>        kvm_create_vcpu                       // New fd is created
>        kvm_arch_init_vcpu
>          kvm_arm_vcpu_init
>          kvm_arm_sve_set_vls
>          kvm_arm_vcpu_finalize               // (A) SVE capability is finalized
>
> device_set_admin_power_state
>    device_pre_poweron
>      virt_machine_device_pre_poweron
>        virt_cpu_pre_poweron
>          qdev_realize
>            arm_cpu_realizefn
>              cpu_exec_realizefn
>              arm_cpu_finalize_features       // Called for the second time
>              qemu_init_vcpu
>                kvm_start_vcpu_thread
>                  kvm_vcpu_thread_fn
>                    kvm_init_vcpu
>                      kvm_create_vcpu         // Called for the second time
>                      kvm_arch_init_vcpu      // Called for the second time
>                        kvm_arm_vcpu_init
>                        kvm_arm_sve_set_vls   // (B) Failed here
>                        kvm_arm_vcpu_finalize
>
> (B) where we try to finalize SVE capability again. It has been finalized at (A)
>      Fianlizing SVE capability for twice is disallowed by KVM on the host side.
>
>
> >>
> >> I picked the fix (the last patch in rfc-v6.2 branch) to rfc-v6 branch, same crash dump
> >> can be seen.
> >
> > Are you getting previously reported abort or above new problem?
> >
>
> Previously, the VM can't be started. After your fix is applied, the VM is able to start.
> It's a new problem that qemu crash dump is seens on attempt to hot add a vCPU.


Thanks for confirming this as well.

Cheers
Salil.


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH RFC V6 05/24] arm/virt,kvm: Pre-create KVM vCPUs for 'disabled' QOM vCPUs at machine init
  2025-10-23  4:14               ` Gavin Shan
@ 2025-10-23 11:27                 ` Salil Mehta
  0 siblings, 0 replies; 67+ messages in thread
From: Salil Mehta @ 2025-10-23 11:27 UTC (permalink / raw)
  To: Gavin Shan
  Cc: qemu-devel, qemu-arm, mst, salil.mehta, maz, jean-philippe,
	jonathan.cameron, lpieralisi, peter.maydell, richard.henderson,
	imammedo, armbru, andrew.jones, david, philmd, eric.auger, will,
	ardb, oliver.upton, pbonzini, rafael, borntraeger, alex.bennee,
	gustavo.romero, npiggin, harshpb, linux, darren, ilkka, vishnu,
	gankulkarni, karl.heubaum, miguel.luis, zhukeqian1,
	wangxiongfeng2, wangyanan55, wangzhou1, linuxarm, jiakernel2,
	maobibo, lixianglai, shahuang, zhao1.liu, Keqian Zhu

[!] Sending this again, to keep conversation *legally* correct,
as this did not appear in the mailing-list when sent from my
official ID.

Sorry for any inconvenience caused due to this.

On Thu, Oct 23, 2025 at 4:14 AM Gavin Shan <gshan@redhat.com> wrote:
>
> Hi Salil,
>
> On 10/23/25 11:29 AM, Salil Mehta wrote:
>
> [...]
>
> >>
> >> Ah, I see. I think I understand the issue. It's complaining
> >> about calling the  finalize twice. Is it possible to check as
> >> I do not have a way to test it?
> >>
> >>
> >> int kvm_arm_vcpu_finalize(struct kvm_vcpu *vcpu, int feature)
> >> {
> >> switch (feature) {
> >> case KVM_ARM_VCPU_SVE:
> >> [...]
> >> if (kvm_arm_vcpu_sve_finalized(vcpu))
> >> return -EPERM;-----> this where it must be popping?
> >> [...]
> >> }
> >
> > I've pushed the fix to avoid calling the finalizing SVE
> > feature (KVM_ARM_VCPU_FINALIZE) twice on the
> > same RFC-V6.2 branch.
> >
> > May I kindly request you to validate the fix again and
> > check SVE works on NVIDIA grace-hopper?
> >
>
> With the latest rfc-v6.2 branch, I don't hit the issue. The vCPU can be hot added
> and removed on grace-hopper host.

Excellent, SVE/SME and other ARM extensions have not been tested earlier.
It would be of immense help if all of these can be validated as I do not have
capable hardware to test them.

Many thanks for your proactive efforts in reporting, reviewing the fixes and
validating them as well. I appreciate it!

For anyone who wants to try, fix is here:
 https://github.com/salil-mehta/qemu/commits/virt-cpuhp-armv8/rfc-v6.2
https://github.com/salil-mehta/qemu/commit/cd58e65a79c224a59407553c1a6288ed667b19ed


Many thanks!
Salil.


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH RFC V6 08/24] arm/virt, gicv3: Guard CPU interface access for admin disabled vCPUs
  2025-10-01  1:01 ` [PATCH RFC V6 08/24] arm/virt, gicv3: Guard CPU interface access for admin disabled vCPUs salil.mehta
@ 2025-10-24  4:07   ` Gavin Shan
  0 siblings, 0 replies; 67+ messages in thread
From: Gavin Shan @ 2025-10-24  4:07 UTC (permalink / raw)
  To: salil.mehta, qemu-devel, qemu-arm, mst
  Cc: salil.mehta, maz, jean-philippe, jonathan.cameron, lpieralisi,
	peter.maydell, richard.henderson, imammedo, armbru, andrew.jones,
	david, philmd, eric.auger, will, ardb, oliver.upton, pbonzini,
	rafael, borntraeger, alex.bennee, gustavo.romero, npiggin,
	harshpb, linux, darren, ilkka, vishnu, gankulkarni, karl.heubaum,
	miguel.luis, zhukeqian1, wangxiongfeng2, wangyanan55, wangzhou1,
	linuxarm, jiakernel2, maobibo, lixianglai, shahuang, zhao1.liu

Hi Salil,

On 10/1/25 11:01 AM, salil.mehta@opnsrc.net wrote:
> From: Salil Mehta <salil.mehta@huawei.com>
> 
> Per Arm GIC Architecture Specification (IHI0069H_b, §11.1), the CPU interface
> and its Processing Element (PE) share a power domain. If the PE is powered down
> or administratively disabled, the CPU interface must be quiescent or off, and
> any access is architecturally UNPREDICTABLE. Without explicit checks, QEMU may
> issue GICC register operations for vCPUs that are offline, removed, or
> otherwise unavailable—risking inconsistent state or undefined behavior in both
> TCG and KVM accelerators.
> 
> To address this, introduce a per-vCPU gicc_accessible flag that reflects the
> administrative enablement of the corresponding QOM vCPU in accordance with the
> policy. This is permissible when the GICC (GIC CPU Interface) is online-capable,
> meaning vCPUs can be brought online in the guest kernel after boot. The flag is
> set during GIC realization and used to skip VGIC register reads/writes, SGI
> generation, and CPU interface updates when the GICC is not accessible. This
> prevents unsafe operations and ensures compliance when managing administratively
> disabled but present vCPUs.
> 
> Co-developed-by: Keqian Zhu <zhukeqian1@huawei.com>
> Signed-off-by: Keqian Zhu <zhukeqian1@huawei.com>
> Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
> ---
>   hw/core/qdev.c                     | 26 +++++++++++++++++
>   hw/intc/arm_gicv3_common.c         | 23 +++++++++++++++
>   hw/intc/arm_gicv3_cpuif.c          |  8 +++++
>   hw/intc/arm_gicv3_cpuif_common.c   | 47 ++++++++++++++++++++++++++++++
>   hw/intc/arm_gicv3_kvm.c            | 18 ++++++++++++
>   include/hw/intc/arm_gicv3_common.h | 24 +++++++++++++++
>   include/hw/qdev-core.h             | 24 +++++++++++++++
>   7 files changed, 170 insertions(+)
> 
> diff --git a/hw/core/qdev.c b/hw/core/qdev.c
> index 5816abae39..8e9a4da6b5 100644
> --- a/hw/core/qdev.c
> +++ b/hw/core/qdev.c
> @@ -326,6 +326,32 @@ bool qdev_disable(DeviceState *dev, BusState *bus, Error **errp)
>                                      errp);
>   }
>   
> +int qdev_get_admin_power_state(DeviceState *dev)
> +{
> +    DeviceClass *dc;
> +
> +    if (!dev) {
> +        return DEVICE_ADMIN_POWER_STATE_REMOVED;
> +    }
> +
> +    dc = DEVICE_GET_CLASS(dev);
> +    if (dc->admin_power_state_supported) {
> +        return object_property_get_enum(OBJECT(dev), "admin_power_state",
> +                                        "DeviceAdminPowerState", NULL);
> +    }
> +
> +    return DEVICE_ADMIN_POWER_STATE_ENABLED;
> +}
> +
> +bool qdev_check_enabled(DeviceState *dev)
> +{
> +   /*
> +    * if device supports power state transitions, check if it is not in
> +    * 'disabled' state.
> +    */
> +    return qdev_get_admin_power_state(dev) == DEVICE_ADMIN_POWER_STATE_ENABLED;
> +}
> +
>   bool qdev_machine_modified(void)
>   {
>       return qdev_hot_added || qdev_hot_removed;
> diff --git a/hw/intc/arm_gicv3_common.c b/hw/intc/arm_gicv3_common.c
> index f6a9f1c68b..f4428ad165 100644
> --- a/hw/intc/arm_gicv3_common.c
> +++ b/hw/intc/arm_gicv3_common.c
> @@ -439,6 +439,29 @@ static void arm_gicv3_common_realize(DeviceState *dev, Error **errp)
>           CPUState *cpu = machine_get_possible_cpu(i);
>           uint64_t cpu_affid;
>   
> +        /*
> +         * Ref: Arm Generic Interrupt Controller Architecture Specification
> +         * (GIC Architecture version 3 and version 4), IHI0069H_b,
> +         * Section 11.1: Power Management
> +         * https://developer.arm.com/documentation/ihi0069
> +         *
> +         * According to this specification, the CPU interface and the
> +         * Processing Element (PE) must reside in the same power domain.
> +         * Therefore, when a CPU/PE is powered off, its corresponding CPU
> +         * interface must also be in the off state or in a quiescent state—
> +         * depending on the state of the associated Redistributor.
> +         *
> +         * The Redistributor may reside in a separate power domain and may
> +         * remain powered even when the associated PE is turned off.
> +         *
> +         * Accessing the GIC CPU interface while the PE is powered down can
> +         * lead to UNPREDICTABLE behavior.
> +         *
> +         * Accordingly, the QOM object `GICv3CPUState` should be marked as
> +         * either accessible or inaccessible based on the power state of the
> +         * associated `CPUState` vCPU.
> +         */
> +        s->cpu[i].gicc_accessible = qdev_check_enabled(DEVICE(cpu));
>           s->cpu[i].cpu = cpu;
>           s->cpu[i].gic = s;
>           /* Store GICv3CPUState in CPUARMState gicv3state pointer */
> diff --git a/hw/intc/arm_gicv3_cpuif.c b/hw/intc/arm_gicv3_cpuif.c
> index a7904237ac..6430b2c649 100644
> --- a/hw/intc/arm_gicv3_cpuif.c
> +++ b/hw/intc/arm_gicv3_cpuif.c
> @@ -1052,6 +1052,10 @@ void gicv3_cpuif_update(GICv3CPUState *cs)
>       ARMCPU *cpu = ARM_CPU(cs->cpu);
>       CPUARMState *env = &cpu->env;
>   
> +    if (!gicv3_gicc_accessible(OBJECT(cs->gic), CPU(cpu)->cpu_index)) {
> +        return;
> +    }
> +
>       g_assert(bql_locked());
>   
>       trace_gicv3_cpuif_update(gicv3_redist_affid(cs), cs->hppi.irq,
> @@ -2036,6 +2040,10 @@ static void icc_generate_sgi(CPUARMState *env, GICv3CPUState *cs,
>       for (i = 0; i < s->num_cpu; i++) {
>           GICv3CPUState *ocs = &s->cpu[i];
>   
> +        if (!gicv3_gicc_accessible(OBJECT(s), i)) {
> +            continue;
> +        }
> +
>           if (irm) {
>               /* IRM == 1 : route to all CPUs except self */
>               if (cs == ocs) {
> diff --git a/hw/intc/arm_gicv3_cpuif_common.c b/hw/intc/arm_gicv3_cpuif_common.c
> index f9a9b2d8a3..8f9a5b6fa2 100644
> --- a/hw/intc/arm_gicv3_cpuif_common.c
> +++ b/hw/intc/arm_gicv3_cpuif_common.c
> @@ -12,6 +12,9 @@
>   #include "qemu/osdep.h"
>   #include "gicv3_internal.h"
>   #include "cpu.h"
> +#include "qemu/log.h"
> +#include "monitor/monitor.h"
> +#include "qapi/visitor.h"
>   
>   void gicv3_set_gicv3state(CPUState *cpu, GICv3CPUState *s)
>   {
> @@ -21,6 +24,41 @@ void gicv3_set_gicv3state(CPUState *cpu, GICv3CPUState *s)
>       env->gicv3state = (void *)s;
>   };
>   
> +static void
> +gicv3_get_gicc_accessibility(Object *obj, Visitor *v, const char *name,
> +                             void *opaque, Error **errp)
> +{
> +    GICv3CPUState *cs = (GICv3CPUState *)opaque;
> +    bool value = cs->gicc_accessible;
> +
> +    visit_type_bool(v, name, &value, errp);
> +}
> +
> +static void
> +gicv3_set_gicc_accessibility(Object *obj, Visitor *v, const char *name,
> +                             void *opaque, Error **errp)
> +{
> +    GICv3CPUState *gcs = opaque;
> +    CPUState *cs = gcs->cpu;
> +    bool value;
> +
> +    visit_type_bool(v, name, &value, errp);
> +
> +    /* Block external attempts to set */
> +    if (monitor_cur_is_qmp()) {
> +        error_setg(errp, "Property 'gicc-accessible' is read-only externally");
> +        return;
> +    }
> +
> +    if (gcs->gicc_accessible != value) {
> +        gcs->gicc_accessible = value;
> +
> +        qemu_log_mask(LOG_UNIMP,
> +                      "GICC accessibility changed: vCPU %d = %s\n",
> +                      cs->cpu_index, value ? "accessible" : "inaccessible");
> +    }
> +}
> +

The property can be modified from the external by 'qom-set'.

(qemu) qom-list /machine/unattached
device[2] (child<kvm-arm-gicv3>)

(qemu) qom-get /machine/unattached/device[2] gicc-accessible[0]
true
(qemu) qom-set /machine/unattached/device[2] gicc-accessible[0] false
(qemu) qom-get /machine/unattached/device[2] gicc-accessible[0]
false

Thanks,
Gavin

>   void gicv3_init_cpuif(GICv3State *s)
>   {
>       ARMGICv3CommonClass *agcc = ARM_GICV3_COMMON_GET_CLASS(s);
> @@ -28,6 +66,15 @@ void gicv3_init_cpuif(GICv3State *s)
>   
>       /* define and register `system registers` with the vCPU  */
>       for (i = 0; i < s->num_cpu; i++) {
> +        g_autofree char *propname = g_strdup_printf("gicc-accessible[%d]", i);
> +        object_property_add(OBJECT(s), propname, "bool",
> +                            gicv3_get_gicc_accessibility,
> +                            gicv3_set_gicc_accessibility,
> +                            NULL, &s->cpu[i]);
> +
> +        object_property_set_description(OBJECT(s), propname,
> +            "Per-vCPU GICC interface accessibility (internal set only)");
> +
>           agcc->init_cpu_reginfo(s->cpu[i].cpu);
>       }
>   }
> diff --git a/hw/intc/arm_gicv3_kvm.c b/hw/intc/arm_gicv3_kvm.c
> index 4ca889da45..e97578f59a 100644
> --- a/hw/intc/arm_gicv3_kvm.c
> +++ b/hw/intc/arm_gicv3_kvm.c
> @@ -457,6 +457,16 @@ static void kvm_arm_gicv3_put(GICv3State *s)
>           GICv3CPUState *c = &s->cpu[ncpu];
>           int num_pri_bits;
>   
> +        /*
> +         * We must ensure that we do not attempt to access or update KVM GICC
> +         * registers if their corresponding QOM `GICv3CPUState` is marked as
> +         * 'inaccessible', because their corresponding QOM vCPU objects
> +         * are in administratively 'disabled' state.
> +         */
> +        if (!gicv3_gicc_accessible(OBJECT(s), ncpu)) {
> +            continue;
> +        }
> +
>           kvm_gicc_access(s, ICC_SRE_EL1, ncpu, &c->icc_sre_el1, true);
>           kvm_gicc_access(s, ICC_CTLR_EL1, ncpu,
>                           &c->icc_ctlr_el1[GICV3_NS], true);
> @@ -615,6 +625,14 @@ static void kvm_arm_gicv3_get(GICv3State *s)
>           GICv3CPUState *c = &s->cpu[ncpu];
>           int num_pri_bits;
>   
> +        /*
> +         * don't attempt to access KVM VGIC for the disabled vCPUs where
> +         * GICv3CPUState is inaccessible.
> +         */
> +        if (!gicv3_gicc_accessible(OBJECT(s), ncpu)) {
> +            continue;
> +        }
> +
>           kvm_gicc_access(s, ICC_SRE_EL1, ncpu, &c->icc_sre_el1, false);
>           kvm_gicc_access(s, ICC_CTLR_EL1, ncpu,
>                           &c->icc_ctlr_el1[GICV3_NS], false);
> diff --git a/include/hw/intc/arm_gicv3_common.h b/include/hw/intc/arm_gicv3_common.h
> index 3720728227..bbf899184e 100644
> --- a/include/hw/intc/arm_gicv3_common.h
> +++ b/include/hw/intc/arm_gicv3_common.h
> @@ -27,6 +27,7 @@
>   #include "hw/sysbus.h"
>   #include "hw/intc/arm_gic_common.h"
>   #include "qom/object.h"
> +#include "qapi/error.h"
>   
>   /*
>    * Maximum number of possible interrupts, determined by the GIC architecture.
> @@ -164,6 +165,7 @@ struct GICv3CPUState {
>       uint64_t icc_apr[3][4];
>       uint64_t icc_igrpen[3];
>       uint64_t icc_ctlr_el3;
> +    bool gicc_accessible;
>   
>       /* Virtualization control interface */
>       uint64_t ich_apr[3][4]; /* ich_apr[GICV3_G1][x] never used */
> @@ -329,4 +331,26 @@ void gicv3_init_irqs_and_mmio(GICv3State *s, qemu_irq_handler handler,
>    */
>   const char *gicv3_class_name(void);
>   
> +/**
> + * gicv3_gicc_accessible:
> + * @obj: QOM object implementing the GICv3 device
> + * @cpu: Index of the vCPU whose GICC accessibility is being queried
> + *
> + * Returns: true if the GICC interface for vCPU @cpu is accessible.
> + * Uses QOM property lookup for "gicc-accessible[%d]".
> + */
> +static inline bool gicv3_gicc_accessible(Object *obj, int cpu)
> +{
> +    g_autofree gchar *propname = g_strdup_printf("gicc-accessible[%d]", cpu);
> +    Error *local_err = NULL;
> +    bool value;
> +
> +    value = object_property_get_bool(obj, propname, &local_err);
> +    if (local_err) {
> +        error_report_err(local_err);
> +        return false;
> +    }
> +
> +    return value;
> +}
>   #endif
> diff --git a/include/hw/qdev-core.h b/include/hw/qdev-core.h
> index 2c22b32a3f..b1d3fa4a25 100644
> --- a/include/hw/qdev-core.h
> +++ b/include/hw/qdev-core.h
> @@ -589,6 +589,30 @@ bool qdev_realize_and_unref(DeviceState *dev, BusState *bus, Error **errp);
>    */
>   bool qdev_disable(DeviceState *dev, BusState *bus, Error **errp);
>   
> +/**
> + * qdev_check_enabled - Check if a device is administratively enabled
> + * @dev:  The device to check
> + *
> + * This function returns whether the device is currently in administrative
> + * ENABLED state. It does not reflect runtime operational power state, but
> + * rather the host policy on whether the guest may interact with the device.
> + *
> + * Returns true if the device is administratively enabled; false otherwise.
> + */
> +bool qdev_check_enabled(DeviceState *dev);
> +
> +/**
> + * qdev_get_admin_power_state - Query administrative power state of a device
> + * @dev:  The device whose state is being queried
> + *
> + * Returns the current administrative power state (ENABLED or DISABLED),
> + * as stored in the device's internal admin state field. This reflects
> + * host-level policy—not the operational runtime state seen by the guest.
> + *
> + * Returns an integer from the DeviceAdminPowerState enum.
> + */
> +int qdev_get_admin_power_state(DeviceState *dev);
> +
>   /**
>    * qdev_unrealize: Unrealize a device
>    * @dev: device to unrealize



^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH RFC V6 14/24] arm/acpi: Introduce dedicated CPU OSPM interface for ARM-like platforms
  2025-10-01  1:01 ` [PATCH RFC V6 14/24] arm/acpi: Introduce dedicated CPU OSPM interface for ARM-like platforms salil.mehta
  2025-10-03 14:58   ` Igor Mammedov
@ 2025-10-24  4:47   ` Gavin Shan
  1 sibling, 0 replies; 67+ messages in thread
From: Gavin Shan @ 2025-10-24  4:47 UTC (permalink / raw)
  To: salil.mehta, qemu-devel, qemu-arm, mst
  Cc: salil.mehta, maz, jean-philippe, jonathan.cameron, lpieralisi,
	peter.maydell, richard.henderson, imammedo, armbru, andrew.jones,
	david, philmd, eric.auger, will, ardb, oliver.upton, pbonzini,
	rafael, borntraeger, alex.bennee, gustavo.romero, npiggin,
	harshpb, linux, darren, ilkka, vishnu, gankulkarni, karl.heubaum,
	miguel.luis, zhukeqian1, wangxiongfeng2, wangyanan55, wangzhou1,
	linuxarm, jiakernel2, maobibo, lixianglai, shahuang, zhao1.liu

Hi Salil,

On 10/1/25 11:01 AM, salil.mehta@opnsrc.net wrote:
> From: Salil Mehta <salil.mehta@huawei.com>
> 
> The existing ACPI CPU hotplug interface is built for x86 platforms where CPUs
> can be inserted or removed and resources are allocated dynamically. On ARM, CPUs
> are never hotpluggable: resources are allocated at boot and QOM vCPU objects
> always exist. Instead, CPUs are administratively managed by toggling ACPI _STA
> to enable or disable them, which gives a hotplug-like effect but does not match
> the x86 model.
> 
> Reusing the x86 hotplug AML code would complicate maintenance since much of its
> logic relies on toggling the _STA.Present bit to notify OSPM about CPU insertion
> or removal. Such usage is not architecturally valid on ARM, where CPUs cannot
> appear or disappear at runtime. Mixing both models in one interface would
> increase complexity and make the AML harder to extend. A separate path is
> therefore required. The new design is heavily inspired by the CPU hotplug
> interface but avoids its unsuitable semantics.
> 
> This patch adds a dedicated CPU OSPM (Operating System Power Management)
> interface. It provides a memory-mapped control region with selector, flags,
> command, and data fields, and AML methods for device-check, eject request, and
> _OST reporting. OSPM is notified through GED events and can coordinate CPU
> events directly with QEMU. Other ARM-like architectures may also use this
> interface.
> 
> Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
> ---
>   hw/acpi/Kconfig                        |   3 +
>   hw/acpi/acpi-cpu-ospm-interface-stub.c |  41 ++
>   hw/acpi/cpu_ospm_interface.c           | 747 +++++++++++++++++++++++++
>   hw/acpi/meson.build                    |   2 +
>   hw/acpi/trace-events                   |  17 +
>   hw/arm/Kconfig                         |   1 +
>   include/hw/acpi/cpu_ospm_interface.h   |  78 +++
>   7 files changed, 889 insertions(+)
>   create mode 100644 hw/acpi/acpi-cpu-ospm-interface-stub.c
>   create mode 100644 hw/acpi/cpu_ospm_interface.c
>   create mode 100644 include/hw/acpi/cpu_ospm_interface.h
> 
> diff --git a/hw/acpi/Kconfig b/hw/acpi/Kconfig
> index 1d4e9f0845..aa52f0468f 100644
> --- a/hw/acpi/Kconfig
> +++ b/hw/acpi/Kconfig
> @@ -21,6 +21,9 @@ config ACPI_ICH9
>   config ACPI_CPU_HOTPLUG
>       bool
>   
> +config ACPI_CPU_OSPM_INTERFACE
> +    bool
> +
>   config ACPI_MEMORY_HOTPLUG
>       bool
>       select MEM_DEVICE
> diff --git a/hw/acpi/acpi-cpu-ospm-interface-stub.c b/hw/acpi/acpi-cpu-ospm-interface-stub.c
> new file mode 100644
> index 0000000000..f6f333f641
> --- /dev/null
> +++ b/hw/acpi/acpi-cpu-ospm-interface-stub.c
> @@ -0,0 +1,41 @@
> +/*
> + * ACPI CPU OSPM Interface Handling.
> + *
> + * Copyright (c) 2025 Huawei Technologies R&D (UK) Ltd.
> + *
> + * Author: Salil Mehta <salil.mehta@huawei.com>
> + *
> + * SPDX-License-Identifier: GPL-2.0-or-later
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License, or
> + * (at your option) any later version.
> + */
> +
> +#include "qemu/osdep.h"
> +#include "hw/acpi/cpu_ospm_interface.h"
> +
> +void acpi_cpu_device_check_cb(AcpiCpuOspmState *cpu_st, DeviceState *dev,
> +                              uint32_t event_st, Error **errp)
> +{
> +}
> +
> +void acpi_cpu_eject_request_cb(AcpiCpuOspmState *cpu_st, DeviceState *dev,
> +                               uint32_t event_st, Error **errp)
> +{
> +}
> +
> +void acpi_cpu_eject_cb(AcpiCpuOspmState *cpu_st, DeviceState *dev, Error **errp)
> +{
> +}
> +
> +void acpi_cpu_ospm_state_interface_init(MemoryRegion *as, Object *owner,
> +                                        AcpiCpuOspmState *state,
> +                                        hwaddr base_addr)
> +{
> +}
> +
> +void acpi_cpus_ospm_status(AcpiCpuOspmState *cpu_st, ACPIOSTInfoList ***list)
> +{
> +}
> diff --git a/hw/acpi/cpu_ospm_interface.c b/hw/acpi/cpu_ospm_interface.c
> new file mode 100644
> index 0000000000..61aab8a793
> --- /dev/null
> +++ b/hw/acpi/cpu_ospm_interface.c
> @@ -0,0 +1,747 @@
> +/*
> + * ACPI CPU OSPM Interface Handling.
> + *
> + * Copyright (c) 2025 Huawei Technologies R&D (UK) Ltd.
> + *
> + * Author: Salil Mehta <salil.mehta@huawei.com>
> + *
> + * SPDX-License-Identifier: GPL-2.0-or-later
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License, or
> + * (at your option) any later version.
> + */
> +
> +#include "qemu/osdep.h"
> +#include "migration/vmstate.h"
> +#include "hw/core/cpu.h"
> +#include "qapi/error.h"
> +#include "trace.h"
> +#include "qapi/qapi-events-acpi.h"
> +#include "hw/acpi/cpu_ospm_interface.h"
> +
> +/* CPU identifier and resource device */
> +#define CPU_NAME_FMT      "C%.03X" /* CPU name format (e.g., C001) */
> +#define CPU_RES_DEVICE    "CPUR" /* CPU resource device name */
> +#define CPU_DEVICE        "CPUS" /* CPUs device name */
> +#define CPU_LOCK          "CPLK" /* CPU lock object */
> +/* ACPI method(_STA, _EJ0, etc.) handlers */
> +#define CPU_STS_METHOD    "CSTA" /* CPU status method (_STA.Enabled) */
> +#define CPU_SCAN_METHOD   "CSCN" /* CPU scan method for enumeration */
> +#define CPU_NOTIFY_METHOD "CTFY" /* Notify method for CPU events */
> +#define CPU_EJECT_METHOD  "CEJ0" /* CPU eject method (_EJ0) */
> +#define CPU_OST_METHOD    "COST" /* OSPM status reporting (_OST) */
> +/* CPU MMIO region fields (in PRST region) */
> +#define CPU_SELECTOR      "CSEL" /* CPU selector index (WO) */
> +#define CPU_ENABLED_F     "CPEN" /* Flag: CPU enabled status(_STA) (RO) */
> +#define CPU_DEVCHK_F      "CDCK" /* Flag: Device-check event (RW) */
> +#define CPU_EJECTRQ_F     "CEJR" /* Flag: Eject-request event (RW)*/
> +#define CPU_EJECT_F       "CEJ0" /* Flag: Ejection trigger (WO) */
> +#define CPU_COMMAND       "CCMD" /* Command register (RW) */
> +#define CPU_DATA          "CDAT" /* Data register (RW) */
> +
> + /*
> + * CPU OSPM Interface MMIO Layout (Total: 16 bytes)
> + *
> + * +--------+--------+--------+--------+--------+--------+--------+--------+
> + * |  0x00  |  0x01  |  0x02  |  0x03  |  0x04  |  0x05  |  0x06  |  0x07  |
> + * +--------+--------+--------+--------+--------+--------+--------+--------+
> + * |       Selector (DWord, write-only)         | Flags  |Command |Reserved|
> + * |                                            | (RO/RW)|  (WO)  |(2B pad)|
> + * |        4 bytes (32 bits)                   | 1B     |   1B   | 2B     |
> + * +-----------------------------------------------------------------------+
> + * |  0x08  |  0x09  |  0x0A  |  0x0B  |  0x0C  |  0x0D  |  0x0E  |  0x0F  |
> + * +--------+--------+--------+--------+--------+--------+--------+--------+
> + * |                        Data (QWord, read/write)                       |
> + * |               Used by CPU scan and _OST methods (64 bits)             |
> + * +-----------------------------------------------------------------------+
> + *
> + * Field Overview:
> + *
> + * - Selector: 4 bytes @0x00 (DWord, WO)
> + *               - Selects target CPU index for the current operation.
> + * - Flags:    1 byte  @0x04 (RO/RW)
> + *               - Bit 0: ENABLED  – CPU is powered on (RO)
> + *               - Bit 1: DEVCHK   – Device-check completed (RW)
> + *               - Bit 2: EJECTRQ  – Guest requests CPU eject (RW)
> + *               - Bit 3: EJECT    – Trigger CPU ejection (WO)
> + *               - Bits 4–7: Reserved (write 0)
> + * - Command:  1 byte  @0x05 (WO)
> + *               - Specifies control operation (e.g., scan, _OST, eject).
> + * - Reserved: 2 bytes @0x06–0x07
> + *               - Alignment padding; must be zero on write.
> + * - Data:     8 bytes @0x08 (QWord, RW)
> + *               - Input/output for command-specific data.
> + *               - Used by CPU scan or _OST.
> + */
> +
> +/*
> + * Macros defining the CPU MMIO region layout. Change field sizes here to
> + * alter the overall MMIO region size.
> + */
> +/* Sub-Field sizes (in bytes) */

I would drop those comments since the MMIO register layout has been clearly
explained in the comments above them.

> +#define ACPI_CPU_MR_SELECTOR_SIZE  4 /* Write-only (DWord access) */
> +#define ACPI_CPU_MR_FLAGS_SIZE     1 /* Read-write (Byte access) */
> +#define ACPI_CPU_MR_RES_FLAGS_SIZE 0 /* Reserved padding */
> +#define ACPI_CPU_MR_CMD_SIZE       1 /* Write-only (Byte access) */
> +#define ACPI_CPU_MR_RES_CMD_SIZE   2 /* Reserved padding */
> +#define ACPI_CPU_MR_CMD_DATA_SIZE  8 /* Read-write (QWord access) */
> +

In the above explanation, there are 5 registers, but we're defining 6 registers.
'#define ACPI_CPU_MR_RES_FLAGS_SIZE 0' seems unnecessary here?

> +#define ACPI_CPU_OSPM_IF_MAX_FIELD_SIZE \
> +    MAX_CONST(ACPI_CPU_MR_CMD_DATA_SIZE, \
> +    MAX_CONST(ACPI_CPU_MR_SELECTOR_SIZE, \
> +    MAX_CONST(ACPI_CPU_MR_CMD_SIZE, ACPI_CPU_MR_FLAGS_SIZE)))
> +

This would simply be:

#define ACPI_CPU_OSPM_IF_MAX_FIELD_SIZE	8

> +/* Validate layout against exported total length */
> +_Static_assert(ACPI_CPU_OSPM_IF_REG_LEN ==
> +               (ACPI_CPU_MR_SELECTOR_SIZE +
> +                ACPI_CPU_MR_FLAGS_SIZE +
> +                ACPI_CPU_MR_RES_FLAGS_SIZE +
> +                ACPI_CPU_MR_CMD_SIZE +
> +                ACPI_CPU_MR_RES_CMD_SIZE +
> +                ACPI_CPU_MR_CMD_DATA_SIZE),
> +               "ACPI_CPU_OSPM_IF_REG_LEN mismatch with internal MMIO layout");
> +

It seems ACPI_CPU_MR_RES_FLAGS_SIZE can be dropped here.

Thanks,
Gavin

> +/* Sub-Field sizes (in bits) */
> +#define ACPI_CPU_MR_SELECTOR_SIZE_BITS \
> +    (ACPI_CPU_MR_SELECTOR_SIZE * BITS_PER_BYTE)  /* Write-only (DWord Acc) */
> +#define ACPI_CPU_MR_FLAGS_SIZE_BITS \
> +    (ACPI_CPU_MR_FLAGS_SIZE * BITS_PER_BYTE)     /* Read-write (Byte Acc) */
> +#define ACPI_CPU_MR_RES_FLAGS_SIZE_BITS \
> +    (ACPI_CPU_MR_RES_FLAGS_SIZE * BITS_PER_BYTE) /* Reserved padding */
> +#define ACPI_CPU_MR_CMD_SIZE_BITS \
> +    (ACPI_CPU_MR_CMD_SIZE * BITS_PER_BYTE)       /* Write-only (Byte Acc) */
> +#define ACPI_CPU_MR_RES_CMD_SIZE_BITS \
> +    (ACPI_CPU_MR_RES_CMD_SIZE * BITS_PER_BYTE)   /* Reserved padding */
> +#define ACPI_CPU_MR_CMD_DATA_SIZE_BITS \
> +    (ACPI_CPU_MR_CMD_DATA_SIZE * BITS_PER_BYTE)  /* Read-write (QWord Acc) */
> +
> +/* Field offsets (in bytes) */
> +#define ACPI_CPU_MR_SELECTOR_OFFSET_WO  0
> +#define ACPI_CPU_MR_FLAGS_OFFSET_RW \
> +    (ACPI_CPU_MR_SELECTOR_OFFSET_WO + \
> +     ACPI_CPU_MR_SELECTOR_SIZE)
> +#define ACPI_CPU_MR_CMD_OFFSET_WO \
> +    (ACPI_CPU_MR_FLAGS_OFFSET_RW + \
> +     ACPI_CPU_MR_FLAGS_SIZE + \
> +     ACPI_CPU_MR_RES_FLAGS_SIZE)
> +#define ACPI_CPU_MR_CMD_DATA_OFFSET_RW \
> +    (ACPI_CPU_MR_CMD_OFFSET_WO + \
> +     ACPI_CPU_MR_CMD_SIZE + \
> +     ACPI_CPU_MR_RES_CMD_SIZE)
> +
> +/* ensure all offsets are at their natural size alignment boundaries */
> +#define STATIC_ASSERT_FIELD_ALIGNMENT(offset, type, field_name)               \
> +    _Static_assert((offset) % sizeof(type) == 0,                              \
> +                   field_name " is not aligned to its natural boundary")
> +
> +STATIC_ASSERT_FIELD_ALIGNMENT(ACPI_CPU_MR_SELECTOR_OFFSET_WO,
> +                              uint32_t, "Selector");
> +STATIC_ASSERT_FIELD_ALIGNMENT(ACPI_CPU_MR_FLAGS_OFFSET_RW,
> +                              uint8_t, "Flags");
> +STATIC_ASSERT_FIELD_ALIGNMENT(ACPI_CPU_MR_CMD_OFFSET_WO,
> +                              uint8_t, "Command");
> +STATIC_ASSERT_FIELD_ALIGNMENT(ACPI_CPU_MR_CMD_DATA_OFFSET_RW,
> +                              uint64_t, "Command Data");
> +
> +/* Flag bit positions (used within 'flags' subfield) */
> +#define ACPI_CPU_FLAGS_USED_BITS 4
> +#define ACPI_CPU_MR_FLAGS_BIT_ENABLED BIT(0)
> +#define ACPI_CPU_MR_FLAGS_BIT_DEVCHK  BIT(1)
> +#define ACPI_CPU_MR_FLAGS_BIT_EJECTRQ BIT(2)
> +#define ACPI_CPU_MR_FLAGS_BIT_EJECT   BIT(ACPI_CPU_FLAGS_USED_BITS - 1)
> +
> +#define ACPI_CPU_MR_RES_FLAG_BITS (BITS_PER_BYTE - ACPI_CPU_FLAGS_USED_BITS)
> +
> +enum {
> +    ACPI_GET_NEXT_CPU_WITH_EVENT_CMD = 0,
> +    ACPI_OST_EVENT_CMD = 1,
> +    ACPI_OST_STATUS_CMD = 2,
> +    ACPI_CMD_MAX
> +};
> +
> +#define AML_APPEND_MR_RESVD_FIELD(mr_field, size_bits)       \
> +    do {                                                        \
> +        if ((size_bits) != 0) {                                 \
> +            aml_append((mr_field), aml_reserved_field(size_bits)); \
> +        }                                                       \
> +    } while (0)
> +
> +#define AML_APPEND_MR_NAMED_FIELD(mr_field, name, size_bits)    \
> +    do {                                                        \
> +        if ((size_bits) != 0) {                                 \
> +            aml_append((mr_field), aml_named_field((name), (size_bits))); \
> +        }                                                       \
> +    } while (0)
> +
> +#define AML_CPU_RES_DEV(base, field) \
> +        aml_name("%s.%s.%s", (base), CPU_RES_DEVICE, (field))
> +
> +static ACPIOSTInfo *
> +acpi_cpu_ospm_ost_status(int idx, AcpiCpuOspmStateStatus *cdev)
> +{
> +    ACPIOSTInfo *info = g_new0(ACPIOSTInfo, 1);
> +
> +    info->source = cdev->ost_event;
> +    info->status = cdev->ost_status;
> +    if (cdev->cpu) {
> +        DeviceState *dev = DEVICE(cdev->cpu);
> +        if (dev->id) {
> +            info->device = g_strdup(dev->id);
> +        }
> +    }
> +    return info;
> +}
> +
> +void acpi_cpus_ospm_status(AcpiCpuOspmState *cpu_st, ACPIOSTInfoList ***list)
> +{
> +    ACPIOSTInfoList ***tail = list;
> +    int i;
> +
> +    for (i = 0; i < cpu_st->dev_count; i++) {
> +        QAPI_LIST_APPEND(*tail, acpi_cpu_ospm_ost_status(i, &cpu_st->devs[i]));
> +    }
> +}
> +
> +static uint64_t
> +acpi_cpu_ospm_intf_mr_read(void *opaque, hwaddr addr, unsigned size)
> +{
> +    AcpiCpuOspmState *cpu_st = opaque;
> +    AcpiCpuOspmStateStatus *cdev;
> +    uint64_t val = 0;
> +
> +    if (cpu_st->selector >= cpu_st->dev_count) {
> +        return val;
> +    }
> +    cdev = &cpu_st->devs[cpu_st->selector];
> +    switch (addr) {
> +    case ACPI_CPU_MR_FLAGS_OFFSET_RW:
> +        val |= qdev_check_enabled(DEVICE(cdev->cpu)) ?
> +                                  ACPI_CPU_MR_FLAGS_BIT_ENABLED : 0;
> +        val |= cdev->devchk_pending ? ACPI_CPU_MR_FLAGS_BIT_DEVCHK : 0;
> +        val |= cdev->ejrqst_pending ? ACPI_CPU_MR_FLAGS_BIT_EJECTRQ : 0;
> +        trace_acpi_cpuos_if_read_flags(cpu_st->selector, val);
> +        break;
> +    case ACPI_CPU_MR_CMD_DATA_OFFSET_RW:
> +        switch (cpu_st->command) {
> +        case ACPI_GET_NEXT_CPU_WITH_EVENT_CMD:
> +           val = cpu_st->selector;
> +           break;
> +        default:
> +           trace_acpi_cpuos_if_read_invalid_cmd_data(cpu_st->selector,
> +                                                     cpu_st->command);
> +           break;
> +        }
> +        trace_acpi_cpuos_if_read_cmd_data(cpu_st->selector, val);
> +        break;
> +    default:
> +        break;
> +    }
> +    return val;
> +}
> +
> +static void
> +acpi_cpu_ospm_intf_mr_write(void *opaque, hwaddr addr, uint64_t data,
> +                            unsigned int size)
> +{
> +    AcpiCpuOspmState *cpu_st = opaque;
> +    AcpiCpuOspmStateStatus *cdev;
> +    ACPIOSTInfo *info;
> +
> +    assert(cpu_st->dev_count);
> +    if (addr) {
> +        if (cpu_st->selector >= cpu_st->dev_count) {
> +            trace_acpi_cpuos_if_invalid_idx_selected(cpu_st->selector);
> +            return;
> +        }
> +    }
> +
> +    switch (addr) {
> +    case ACPI_CPU_MR_SELECTOR_OFFSET_WO: /* current CPU selector */
> +        cpu_st->selector = data;
> +        trace_acpi_cpuos_if_write_idx(cpu_st->selector);
> +        break;
> +    case ACPI_CPU_MR_FLAGS_OFFSET_RW: /* set is_* fields  */
> +        cdev = &cpu_st->devs[cpu_st->selector];
> +        if (data & ACPI_CPU_MR_FLAGS_BIT_DEVCHK) {
> +            /* clear device-check pending event */
> +            cdev->devchk_pending = false;
> +            trace_acpi_cpuos_if_clear_devchk_evt(cpu_st->selector);
> +        } else if (data & ACPI_CPU_MR_FLAGS_BIT_EJECTRQ) {
> +            /* clear eject-request pending event */
> +            cdev->ejrqst_pending = false;
> +            trace_acpi_cpuos_if_clear_ejrqst_evt(cpu_st->selector);
> +        } else if (data & ACPI_CPU_MR_FLAGS_BIT_EJECT) {
> +            DeviceState *dev = NULL;
> +            if (!cdev->cpu || cdev->cpu == first_cpu) {
> +                trace_acpi_cpuos_if_ejecting_invalid_cpu(cpu_st->selector);
> +                break;
> +            }
> +            /*
> +             * OSPM has returned with eject. Hence, it is now safe to put the
> +             * cpu device on powered-off state.
> +             */
> +            trace_acpi_cpuos_if_ejecting_cpu(cpu_st->selector);
> +            dev = DEVICE(cdev->cpu);
> +            qdev_sync_disable(dev, &error_fatal);
> +        }
> +        break;
> +    case ACPI_CPU_MR_CMD_OFFSET_WO:
> +        trace_acpi_cpuos_if_write_cmd(cpu_st->selector, data);
> +        if (data < ACPI_CMD_MAX) {
> +            cpu_st->command = data;
> +            if (cpu_st->command == ACPI_GET_NEXT_CPU_WITH_EVENT_CMD) {
> +                uint32_t iter = cpu_st->selector;
> +
> +                do {
> +                    cdev = &cpu_st->devs[iter];
> +                    if (cdev->devchk_pending || cdev->ejrqst_pending) {
> +                        cpu_st->selector = iter;
> +                        trace_acpi_cpuos_if_cpu_has_events(cpu_st->selector,
> +                            cdev->devchk_pending, cdev->ejrqst_pending);
> +                        break;
> +                    }
> +                    iter = iter + 1 < cpu_st->dev_count ? iter + 1 : 0;
> +                } while (iter != cpu_st->selector);
> +            }
> +        }
> +        break;
> +    case ACPI_CPU_MR_CMD_DATA_OFFSET_RW:
> +        switch (cpu_st->command) {
> +        case ACPI_OST_EVENT_CMD: {
> +           cdev = &cpu_st->devs[cpu_st->selector];
> +           cdev->ost_event = data;
> +           trace_acpi_cpuos_if_write_ost_ev(cpu_st->selector, cdev->ost_event);
> +           break;
> +        }
> +        case ACPI_OST_STATUS_CMD: {
> +           cdev = &cpu_st->devs[cpu_st->selector];
> +           cdev->ost_status = data;
> +           info = acpi_cpu_ospm_ost_status(cpu_st->selector, cdev);
> +           qapi_event_send_acpi_device_ost(info);
> +           qapi_free_ACPIOSTInfo(info);
> +           trace_acpi_cpuos_if_write_ost_status(cpu_st->selector,
> +                                                cdev->ost_status);
> +           break;
> +        }
> +        default:
> +           trace_acpi_cpuos_if_write_invalid_cmd(cpu_st->selector,
> +                                                 cpu_st->command);
> +           break;
> +        }
> +        break;
> +    default:
> +        trace_acpi_cpuos_if_write_invalid_offset(cpu_st->selector, addr);
> +        break;
> +    }
> +}
> +
> +static const MemoryRegionOps cpu_common_mr_ops = {
> +    .read = acpi_cpu_ospm_intf_mr_read,
> +    .write = acpi_cpu_ospm_intf_mr_write,
> +    .endianness = DEVICE_LITTLE_ENDIAN,
> +    .valid = {
> +        .min_access_size = 1,
> +        .max_access_size = ACPI_CPU_OSPM_IF_MAX_FIELD_SIZE,
> +    },
> +    .impl = {
> +        .min_access_size = 1,
> +        .max_access_size = ACPI_CPU_OSPM_IF_MAX_FIELD_SIZE,
> +        .unaligned = false,
> +    },
> +};
> +
> +void acpi_cpu_ospm_state_interface_init(MemoryRegion *as, Object *owner,
> +                                        AcpiCpuOspmState *state,
> +                                        hwaddr base_addr)
> +{
> +    MachineState *machine = MACHINE(qdev_get_machine());
> +    MachineClass *mc = MACHINE_GET_CLASS(machine);
> +    const CPUArchIdList *id_list;
> +    int i;
> +
> +    assert(mc->possible_cpu_arch_ids);
> +    id_list = mc->possible_cpu_arch_ids(machine);
> +    state->dev_count = id_list->len;
> +    state->devs = g_new0(typeof(*state->devs), state->dev_count);
> +    for (i = 0; i < id_list->len; i++) {
> +        state->devs[i].cpu =  CPU(id_list->cpus[i].cpu);
> +        state->devs[i].arch_id = id_list->cpus[i].arch_id;
> +    }
> +    memory_region_init_io(&state->ctrl_reg, owner, &cpu_common_mr_ops, state,
> +                          "ACPI CPU OSPM State Interface Memory Region",
> +                          ACPI_CPU_OSPM_IF_REG_LEN);
> +    memory_region_add_subregion(as, base_addr, &state->ctrl_reg);
> +}
> +
> +static AcpiCpuOspmStateStatus *
> +acpi_get_cpu_status(AcpiCpuOspmState *cpu_st, DeviceState *dev)
> +{
> +    CPUClass *k = CPU_GET_CLASS(dev);
> +    uint64_t cpu_arch_id = k->get_arch_id(CPU(dev));
> +    int i;
> +
> +    for (i = 0; i < cpu_st->dev_count; i++) {
> +        if (cpu_arch_id == cpu_st->devs[i].arch_id) {
> +            return &cpu_st->devs[i];
> +        }
> +    }
> +    return NULL;
> +}
> +
> +void acpi_cpu_device_check_cb(AcpiCpuOspmState *cpu_st, DeviceState *dev,
> +                              uint32_t event_st, Error **errp)
> +{
> +    AcpiCpuOspmStateStatus *cdev;
> +    cdev = acpi_get_cpu_status(cpu_st, dev);
> +    if (!cdev) {
> +        return;
> +    }
> +    assert(cdev->cpu);
> +
> +    /*
> +     * Tell OSPM via GED IRQ(GSI) that a powered-off cpu is being powered-on.
> +     * Also, mark 'device-check' event pending for this cpu. This will
> +     * eventually result in OSPM evaluating the ACPI _EVT method and scan of
> +     * cpus
> +     */
> +    cdev->devchk_pending = true;
> +    acpi_send_event(cpu_st->acpi_dev, event_st);
> +}
> +
> +void acpi_cpu_eject_request_cb(AcpiCpuOspmState *cpu_st, DeviceState *dev,
> +                              uint32_t event_st, Error **errp)
> +{
> +    AcpiCpuOspmStateStatus *cdev;
> +    cdev = acpi_get_cpu_status(cpu_st, dev);
> +    if (!cdev) {
> +        return;
> +    }
> +    assert(cdev->cpu);
> +
> +    /*
> +     * Tell OSPM via GED IRQ(GSI) that a cpu wants to power-off or go on standby
> +     * Also,mark 'eject-request' event pending for this cpu. (graceful shutdown)
> +     */
> +    cdev->ejrqst_pending = true;
> +    acpi_send_event(cpu_st->acpi_dev, event_st);
> +}
> +
> +void
> +acpi_cpu_eject_cb(AcpiCpuOspmState *cpu_st, DeviceState *dev, Error **errp)
> +{
> +    /* TODO: possible handling here */
> +}
> +
> +static const VMStateDescription vmstate_cpu_ospm_state_sts = {
> +    .name = "CPU OSPM state status",
> +    .version_id = 1,
> +    .minimum_version_id = 1,
> +    .fields = (const VMStateField[]) {
> +        VMSTATE_BOOL(devchk_pending, AcpiCpuOspmStateStatus),
> +        VMSTATE_BOOL(ejrqst_pending, AcpiCpuOspmStateStatus),
> +        VMSTATE_UINT32(ost_event, AcpiCpuOspmStateStatus),
> +        VMSTATE_UINT32(ost_status, AcpiCpuOspmStateStatus),
> +        VMSTATE_END_OF_LIST()
> +    }
> +};
> +
> +const VMStateDescription vmstate_cpu_ospm_state = {
> +    .name = "CPU OSPM state",
> +    .version_id = 1,
> +    .minimum_version_id = 1,
> +    .fields = (const VMStateField[]) {
> +        VMSTATE_UINT32(selector, AcpiCpuOspmState),
> +        VMSTATE_UINT8(command, AcpiCpuOspmState),
> +        VMSTATE_STRUCT_VARRAY_POINTER_UINT32(devs, AcpiCpuOspmState,
> +                                             dev_count,
> +                                             vmstate_cpu_ospm_state_sts,
> +                                             AcpiCpuOspmStateStatus),
> +        VMSTATE_END_OF_LIST()
> +    }
> +};
> +
> +void acpi_build_cpus_aml(Aml *table, hwaddr base_addr, const char *root,
> +                         const char *event_handler_method)
> +{
> +    MachineState *machine = MACHINE(qdev_get_machine());
> +    MachineClass *mc = MACHINE_GET_CLASS(machine);
> +    const CPUArchIdList *arch_ids = mc->possible_cpu_arch_ids(machine);
> +    Aml *sb_scope = aml_scope("_SB"); /* System Bus Scope */
> +    Aml *ifctx, *field, *method, *cpu_res_dev, *cpus_dev;
> +    Aml *zero = aml_int(0);
> +    Aml *one = aml_int(1);
> +
> +    cpu_res_dev = aml_device("%s.%s", root, CPU_RES_DEVICE);
> +    {
> +        Aml *crs;
> +
> +        aml_append(cpu_res_dev,
> +            aml_name_decl("_HID", aml_eisaid("PNP0A06")));
> +        aml_append(cpu_res_dev,
> +            aml_name_decl("_UID", aml_string("CPU OSPM Interface resources")));
> +        aml_append(cpu_res_dev, aml_mutex(CPU_LOCK, 0));
> +
> +        crs = aml_resource_template();
> +        aml_append(crs, aml_memory32_fixed(base_addr, ACPI_CPU_OSPM_IF_REG_LEN,
> +                   AML_READ_WRITE));
> +
> +        aml_append(cpu_res_dev, aml_name_decl("_CRS", crs));
> +
> +        /* declare CPU OSPM Interface MMIO region related access fields */
> +        aml_append(cpu_res_dev,
> +                   aml_operation_region("PRST", AML_SYSTEM_MEMORY,
> +                                        aml_int(base_addr),
> +                                        ACPI_CPU_OSPM_IF_REG_LEN));
> +
> +        /*
> +         * define named fields within PRST region with 'Byte' access widths
> +         * and reserve fields with other access width
> +         */
> +        field = aml_field("PRST", AML_BYTE_ACC, AML_NOLOCK, AML_PRESERVE);
> +        /* reserve CPU 'selector' field (size in bits) */
> +        AML_APPEND_MR_RESVD_FIELD(field, ACPI_CPU_MR_SELECTOR_SIZE_BITS);
> +        /* Flag::Enabled Bit(RO) - Read '1' if enabled */
> +        AML_APPEND_MR_NAMED_FIELD(field, CPU_ENABLED_F, 1);
> +        /* Flag::Devchk Bit(RW) - Read '1', has a event. Write '1', to clear */
> +        AML_APPEND_MR_NAMED_FIELD(field, CPU_DEVCHK_F, 1);
> +        /* Flag::Ejectrq Bit(RW) - Read 1, has event. Write 1 to clear */
> +        AML_APPEND_MR_NAMED_FIELD(field, CPU_EJECTRQ_F, 1);
> +        /* Flag::Eject Bit(WO) - OSPM evals _EJx, initiates CPU Eject in Qemu*/
> +        AML_APPEND_MR_NAMED_FIELD(field, CPU_EJECT_F, 1);
> +        /* Flag::Bit(ACPI_CPU_FLAGS_USED_BITS)-Bit(7) - Reserve left over bits*/
> +        AML_APPEND_MR_RESVD_FIELD(field, ACPI_CPU_MR_RES_FLAG_BITS);
> +        /* Reserved space: padding after flags */
> +        AML_APPEND_MR_RESVD_FIELD(field, ACPI_CPU_MR_RES_FLAGS_SIZE_BITS);
> +        /* Command field written by OSPM */
> +        AML_APPEND_MR_NAMED_FIELD(field, CPU_COMMAND,
> +                                  ACPI_CPU_MR_CMD_SIZE_BITS);
> +        /* Reserved space: padding after command field */
> +        AML_APPEND_MR_RESVD_FIELD(field, ACPI_CPU_MR_RES_CMD_SIZE_BITS);
> +        /* Command data: 64-bit payload associated with command */
> +        AML_APPEND_MR_RESVD_FIELD(field, ACPI_CPU_MR_CMD_DATA_SIZE_BITS);
> +        aml_append(cpu_res_dev, field);
> +
> +        /*
> +         * define named fields with 'Dword' access widths and reserve fields
> +         * with other access width
> +         */
> +        field = aml_field("PRST", AML_DWORD_ACC, AML_NOLOCK, AML_PRESERVE);
> +        /* CPU selector, write only */
> +        AML_APPEND_MR_NAMED_FIELD(field, CPU_SELECTOR,
> +                                  ACPI_CPU_MR_SELECTOR_SIZE_BITS);
> +        aml_append(cpu_res_dev, field);
> +
> +        /*
> +         * define named fields with 'Qword' access widths and reserve fields
> +         * with other access width
> +         */
> +        field = aml_field("PRST", AML_QWORD_ACC, AML_NOLOCK, AML_PRESERVE);
> +        /*
> +         * Reserve space: selector, flags, reserved flags, command, reserved
> +         * command for Qword alignment.
> +         */
> +        AML_APPEND_MR_RESVD_FIELD(field, ACPI_CPU_MR_SELECTOR_SIZE_BITS +
> +                                            ACPI_CPU_MR_FLAGS_SIZE_BITS +
> +                                            ACPI_CPU_MR_RES_FLAGS_SIZE_BITS +
> +                                            ACPI_CPU_MR_CMD_SIZE_BITS +
> +                                            ACPI_CPU_MR_RES_CMD_SIZE_BITS);
> +        /* Command data accessible via Qword */
> +        AML_APPEND_MR_NAMED_FIELD(field, CPU_DATA,
> +                                  ACPI_CPU_MR_CMD_DATA_SIZE_BITS);
> +        aml_append(cpu_res_dev, field);
> +    }
> +    aml_append(sb_scope, cpu_res_dev);
> +
> +    cpus_dev = aml_device("%s.%s", root, CPU_DEVICE);
> +    {
> +        Aml *ctrl_lock = AML_CPU_RES_DEV(root, CPU_LOCK);
> +        Aml *cpu_selector = AML_CPU_RES_DEV(root, CPU_SELECTOR);
> +        Aml *is_enabled = AML_CPU_RES_DEV(root, CPU_ENABLED_F);
> +        Aml *dvchk_evt = AML_CPU_RES_DEV(root, CPU_DEVCHK_F);
> +        Aml *ejrq_evt = AML_CPU_RES_DEV(root, CPU_EJECTRQ_F);
> +        Aml *ej_evt = AML_CPU_RES_DEV(root, CPU_EJECT_F);
> +        Aml *cpu_cmd = AML_CPU_RES_DEV(root, CPU_COMMAND);
> +        Aml *cpu_data = AML_CPU_RES_DEV(root, CPU_DATA);
> +        int i;
> +
> +        aml_append(cpus_dev, aml_name_decl("_HID", aml_string("ACPI0010")));
> +        aml_append(cpus_dev, aml_name_decl("_CID", aml_eisaid("PNP0A05")));
> +
> +        method = aml_method(CPU_NOTIFY_METHOD, 2, AML_NOTSERIALIZED);
> +        for (i = 0; i < arch_ids->len; i++) {
> +            Aml *cpu = aml_name(CPU_NAME_FMT, i);
> +            Aml *uid = aml_arg(0);
> +            Aml *event = aml_arg(1);
> +
> +            ifctx = aml_if(aml_equal(uid, aml_int(i)));
> +            {
> +                aml_append(ifctx, aml_notify(cpu, event));
> +            }
> +            aml_append(method, ifctx);
> +        }
> +        aml_append(cpus_dev, method);
> +
> +        method = aml_method(CPU_STS_METHOD, 1, AML_SERIALIZED);
> +        {
> +            Aml *idx = aml_arg(0);
> +            Aml *sta = aml_local(0);
> +            Aml *else_ctx;
> +
> +            aml_append(method, aml_acquire(ctrl_lock, 0xFFFF));
> +            aml_append(method, aml_store(idx, cpu_selector));
> +            aml_append(method, aml_store(zero, sta));
> +            ifctx = aml_if(aml_equal(is_enabled, one));
> +            {
> +                /* cpu is present and enabled */
> +                aml_append(ifctx, aml_store(aml_int(0xF), sta));
> +            }
> +            aml_append(method, ifctx);
> +            else_ctx = aml_else();
> +            {
> +                /* cpu is present but disabled */
> +                aml_append(else_ctx, aml_store(aml_int(0xD), sta));
> +            }
> +            aml_append(method, else_ctx);
> +            aml_append(method, aml_release(ctrl_lock));
> +            aml_append(method, aml_return(sta));
> +        }
> +        aml_append(cpus_dev, method);
> +
> +        method = aml_method(CPU_EJECT_METHOD, 1, AML_SERIALIZED);
> +        {
> +            Aml *idx = aml_arg(0);
> +
> +            aml_append(method, aml_acquire(ctrl_lock, 0xFFFF));
> +            aml_append(method, aml_store(idx, cpu_selector));
> +            aml_append(method, aml_store(one, ej_evt));
> +            aml_append(method, aml_release(ctrl_lock));
> +        }
> +        aml_append(cpus_dev, method);
> +
> +        method = aml_method(CPU_SCAN_METHOD, 0, AML_SERIALIZED);
> +        {
> +            Aml *has_event = aml_local(0); /* Local0: Loop control flag */
> +            Aml *uid = aml_local(1); /* Local1: Current CPU UID */
> +            /* Constants */
> +            Aml *dev_chk = aml_int(1); /* Notify: device check to enable */
> +            Aml *eject_req = aml_int(3); /* Notify: eject for removal */
> +            Aml *next_cpu_cmd = aml_int(ACPI_GET_NEXT_CPU_WITH_EVENT_CMD);
> +
> +            /* Acquire CPU lock */
> +            aml_append(method, aml_acquire(ctrl_lock, 0xFFFF));
> +
> +            /* Initialize loop */
> +            aml_append(method, aml_store(zero, uid));
> +            aml_append(method, aml_store(one, has_event));
> +
> +            Aml *while_ctx = aml_while(aml_land(
> +                aml_equal(has_event, one),
> +                aml_lless(uid, aml_int(arch_ids->len))
> +            ));
> +            {
> +                aml_append(while_ctx, aml_store(zero, has_event));
> +                /*
> +                 * Issue scan cmd: QEMU will return next CPU with event in
> +                 * cpu_data
> +                 */
> +                aml_append(while_ctx, aml_store(uid, cpu_selector));
> +                aml_append(while_ctx, aml_store(next_cpu_cmd, cpu_cmd));
> +
> +                /* If scan wrapped around to an earlier UID, exit loop */
> +                Aml *wrap_check = aml_if(aml_lless(cpu_data, uid));
> +                aml_append(wrap_check, aml_break());
> +                aml_append(while_ctx, wrap_check);
> +
> +                /* Set UID to scanned result */
> +                aml_append(while_ctx, aml_store(cpu_data, uid));
> +
> +                /* send CPU device-check(resume) event to OSPM */
> +                Aml *if_devchk = aml_if(aml_equal(dvchk_evt, one));
> +                {
> +                    aml_append(if_devchk,
> +                        aml_call2(CPU_NOTIFY_METHOD, uid, dev_chk));
> +                    /* clear local device-check event sent flag */
> +                    aml_append(if_devchk, aml_store(one, dvchk_evt));
> +                    aml_append(if_devchk, aml_store(one, has_event));
> +                }
> +                aml_append(while_ctx, if_devchk);
> +
> +                /*
> +                 * send CPU eject-request event to OSPM to gracefully handle
> +                 * OSPM related tasks running on this CPU
> +                 */
> +                Aml *else_ctx = aml_else();
> +                Aml *if_ejrq = aml_if(aml_equal(ejrq_evt, one));
> +                {
> +                    aml_append(if_ejrq,
> +                        aml_call2(CPU_NOTIFY_METHOD, uid, eject_req));
> +                    /* clear local eject-request event sent flag */
> +                    aml_append(if_ejrq, aml_store(one, ejrq_evt));
> +                    aml_append(if_ejrq, aml_store(one, has_event));
> +                }
> +                aml_append(else_ctx, if_ejrq);
> +                aml_append(while_ctx, else_ctx);
> +
> +                /* Increment UID */
> +                aml_append(while_ctx, aml_increment(uid));
> +            }
> +            aml_append(method, while_ctx);
> +
> +            /* Release cpu lock */
> +            aml_append(method, aml_release(ctrl_lock));
> +        }
> +        aml_append(cpus_dev, method);
> +
> +        method = aml_method(CPU_OST_METHOD, 4, AML_SERIALIZED);
> +        {
> +            Aml *uid = aml_arg(0);
> +            Aml *ev_cmd = aml_int(ACPI_OST_EVENT_CMD);
> +            Aml *st_cmd = aml_int(ACPI_OST_STATUS_CMD);
> +
> +            aml_append(method, aml_acquire(ctrl_lock, 0xFFFF));
> +            aml_append(method, aml_store(uid, cpu_selector));
> +            aml_append(method, aml_store(ev_cmd, cpu_cmd));
> +            aml_append(method, aml_store(aml_arg(1), cpu_data));
> +            aml_append(method, aml_store(st_cmd, cpu_cmd));
> +            aml_append(method, aml_store(aml_arg(2), cpu_data));
> +            aml_append(method, aml_release(ctrl_lock));
> +        }
> +        aml_append(cpus_dev, method);
> +
> +        /* build Processor object for each processor */
> +        for (i = 0; i < arch_ids->len; i++) {
> +            Aml *dev;
> +            Aml *uid = aml_int(i);
> +
> +            dev = aml_device(CPU_NAME_FMT, i);
> +            aml_append(dev, aml_name_decl("_HID", aml_string("ACPI0007")));
> +            aml_append(dev, aml_name_decl("_UID", uid));
> +
> +            method = aml_method("_STA", 0, AML_SERIALIZED);
> +            aml_append(method, aml_return(aml_call1(CPU_STS_METHOD, uid)));
> +            aml_append(dev, method);
> +
> +            if (CPU(arch_ids->cpus[i].cpu) != first_cpu) {
> +                method = aml_method("_EJ0", 1, AML_NOTSERIALIZED);
> +                aml_append(method, aml_call1(CPU_EJECT_METHOD, uid));
> +                aml_append(dev, method);
> +            }
> +
> +            method = aml_method("_OST", 3, AML_SERIALIZED);
> +            aml_append(method,
> +                aml_call4(CPU_OST_METHOD, uid, aml_arg(0),
> +                          aml_arg(1), aml_arg(2))
> +            );
> +            aml_append(dev, method);
> +            aml_append(cpus_dev, dev);
> +        }
> +    }
> +    aml_append(sb_scope, cpus_dev);
> +    aml_append(table, sb_scope);
> +
> +    method = aml_method(event_handler_method, 0, AML_NOTSERIALIZED);
> +    aml_append(method, aml_call0("\\_SB.CPUS." CPU_SCAN_METHOD));
> +    aml_append(table, method);
> +}
> diff --git a/hw/acpi/meson.build b/hw/acpi/meson.build
> index 73f02b9691..6d83396ab4 100644
> --- a/hw/acpi/meson.build
> +++ b/hw/acpi/meson.build
> @@ -8,6 +8,8 @@ acpi_ss.add(files(
>   ))
>   acpi_ss.add(when: 'CONFIG_ACPI_CPU_HOTPLUG', if_true: files('cpu.c', 'cpu_hotplug.c'))
>   acpi_ss.add(when: 'CONFIG_ACPI_CPU_HOTPLUG', if_false: files('acpi-cpu-hotplug-stub.c'))
> +acpi_ss.add(when: 'CONFIG_ACPI_CPU_OSPM_INTERFACE', if_true: files('cpu_ospm_interface.c'))
> +acpi_ss.add(when: 'CONFIG_ACPI_CPU_OSPM_INTERFACE', if_false: files('acpi-cpu-ospm-interface-stub.c'))
>   acpi_ss.add(when: 'CONFIG_ACPI_MEMORY_HOTPLUG', if_true: files('memory_hotplug.c'))
>   acpi_ss.add(when: 'CONFIG_ACPI_MEMORY_HOTPLUG', if_false: files('acpi-mem-hotplug-stub.c'))
>   acpi_ss.add(when: 'CONFIG_ACPI_NVDIMM', if_true: files('nvdimm.c'))
> diff --git a/hw/acpi/trace-events b/hw/acpi/trace-events
> index edc93e703c..c0ecbdd48f 100644
> --- a/hw/acpi/trace-events
> +++ b/hw/acpi/trace-events
> @@ -40,6 +40,23 @@ cpuhp_acpi_fw_remove_cpu(uint32_t idx) "0x%"PRIx32
>   cpuhp_acpi_write_ost_ev(uint32_t slot, uint32_t ev) "idx[0x%"PRIx32"] OST EVENT: 0x%"PRIx32
>   cpuhp_acpi_write_ost_status(uint32_t slot, uint32_t st) "idx[0x%"PRIx32"] OST STATUS: 0x%"PRIx32
>   
> +#cpu_ospm_interface.c
> +acpi_cpuos_if_invalid_idx_selected(uint32_t idx) "selector idx[0x%"PRIx32"]"
> +acpi_cpuos_if_read_flags(uint32_t idx, uint8_t flags) "cpu idx[0x%"PRIx32"] flags: 0x%"PRIx8
> +acpi_cpuos_if_write_idx(uint32_t idx) "set active cpu idx: 0x%"PRIx32
> +acpi_cpuos_if_write_cmd(uint32_t idx, uint8_t cmd) "cpu idx[0x%"PRIx32"] cmd: 0x%"PRIx8
> +acpi_cpuos_if_write_invalid_cmd(uint32_t idx, uint8_t cmd) "cpu idx[0x%"PRIx32"] invalid cmd: 0x%"PRIx8
> +acpi_cpuos_if_write_invalid_offset(uint32_t idx, uint64_t addr) "cpu idx[0x%"PRIx32"] invalid offset: 0x%"PRIx64
> +acpi_cpuos_if_read_cmd_data(uint32_t idx, uint32_t data) "cpu idx[0x%"PRIx32"] data: 0x%"PRIx32
> +acpi_cpuos_if_read_invalid_cmd_data(uint32_t idx, uint8_t cmd) "cpu idx[0x%"PRIx32"] invalid cmd: 0x%"PRIx8
> +acpi_cpuos_if_cpu_has_events(uint32_t idx, bool devchk, bool ejrqst) "cpu idx[0x%"PRIx32"] device-check pending: %d, eject-request pending: %d"
> +acpi_cpuos_if_clear_devchk_evt(uint32_t idx) "cpu idx[0x%"PRIx32"]"
> +acpi_cpuos_if_clear_ejrqst_evt(uint32_t idx) "cpu idx[0x%"PRIx32"]"
> +acpi_cpuos_if_ejecting_invalid_cpu(uint32_t idx) "invalid cpu idx[0x%"PRIx32"]"
> +acpi_cpuos_if_ejecting_cpu(uint32_t idx) "cpu idx[0x%"PRIx32"]"
> +acpi_cpuos_if_write_ost_ev(uint32_t idx, uint32_t ev) "cpu idx[0x%"PRIx32"] OST Event: 0x%"PRIx32
> +acpi_cpuos_if_write_ost_status(uint32_t idx, uint32_t st) "cpu idx[0x%"PRIx32"] OST Status: 0x%"PRIx32
> +
>   # pcihp.c
>   acpi_pci_eject_slot(unsigned bsel, unsigned slot) "bsel: %u slot: %u"
>   acpi_pci_unplug(int bsel, int slot) "bsel: %d slot: %d"
> diff --git a/hw/arm/Kconfig b/hw/arm/Kconfig
> index 2aa4b5d778..c9991e00c7 100644
> --- a/hw/arm/Kconfig
> +++ b/hw/arm/Kconfig
> @@ -39,6 +39,7 @@ config ARM_VIRT
>       select VIRTIO_MEM_SUPPORTED
>       select ACPI_CXL
>       select ACPI_HMAT
> +    select ACPI_CPU_OSPM_INTERFACE
>   
>   config CUBIEBOARD
>       bool
> diff --git a/include/hw/acpi/cpu_ospm_interface.h b/include/hw/acpi/cpu_ospm_interface.h
> new file mode 100644
> index 0000000000..5dda327a34
> --- /dev/null
> +++ b/include/hw/acpi/cpu_ospm_interface.h
> @@ -0,0 +1,78 @@
> +/*
> + * ACPI CPU OSPM Interface Handling.
> + *
> + * Copyright (c) 2025 Huawei Technologies R&D (UK) Ltd.
> + *
> + * Author: Salil Mehta <salil.mehta@huawei.com>
> + *
> + * SPDX-License-Identifier: GPL-2.0-or-later
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the ree Software Foundation; either version 2 of the License, or
> + * (at your option) any later version.
> + */
> +#ifndef CPU_OSPM_INTERFACE_H
> +#define CPU_OSPM_INTERFACE_H
> +
> +#include "qapi/qapi-types-acpi.h"
> +#include "hw/qdev-core.h"
> +#include "hw/acpi/acpi.h"
> +#include "hw/acpi/aml-build.h"
> +#include "hw/boards.h"
> +
> +/**
> + * Total size (in bytes) of the ACPI CPU OSPM Interface MMIO region.
> + *
> + * This region contains control and status fields such as CPU selector,
> + * flags, command register, and data register. It must exactly match the
> + * layout defined in the AML code and the memory region implementation.
> + *
> + * Any mismatch between this definition and the AML layout may result in
> + * runtime errors or build-time assertion failures (e.g., _Static_assert),
> + * breaking correct device emulation and guest OS coordination.
> + */
> +#define ACPI_CPU_OSPM_IF_REG_LEN 16
> +
> +typedef struct  {
> +    CPUState *cpu;
> +    uint64_t arch_id;
> +    bool devchk_pending; /* device-check pending */
> +    bool ejrqst_pending; /* eject-request pending */
> +    uint32_t ost_event;
> +    uint32_t ost_status;
> +} AcpiCpuOspmStateStatus;
> +
> +typedef struct AcpiCpuOspmState {
> +    DeviceState *acpi_dev;
> +    MemoryRegion ctrl_reg;
> +    uint32_t selector;
> +    uint8_t command;
> +    uint32_t dev_count;
> +    AcpiCpuOspmStateStatus *devs;
> +} AcpiCpuOspmState;
> +
> +void acpi_cpu_device_check_cb(AcpiCpuOspmState *cpu_st, DeviceState *dev,
> +                              uint32_t event_st, Error **errp);
> +
> +void acpi_cpu_eject_request_cb(AcpiCpuOspmState *cpu_st, DeviceState *dev,
> +                               uint32_t event_st, Error **errp);
> +
> +void acpi_cpu_eject_cb(AcpiCpuOspmState *cpu_st, DeviceState *dev,
> +                       Error **errp);
> +
> +void acpi_cpu_ospm_state_interface_init(MemoryRegion *as, Object *owner,
> +                                        AcpiCpuOspmState *state,
> +                                        hwaddr base_addr);
> +
> +void acpi_build_cpus_aml(Aml *table, hwaddr base_addr, const char *root,
> +                         const char *event_handler_method);
> +
> +void acpi_cpus_ospm_status(AcpiCpuOspmState *cpu_st,
> +                           ACPIOSTInfoList ***list);
> +
> +extern const VMStateDescription vmstate_cpu_ospm_state;
> +#define VMSTATE_CPU_OSPM_STATE(cpuospm, state) \
> +    VMSTATE_STRUCT(cpuospm, state, 1, \
> +                   vmstate_cpu_ospm_state, AcpiCpuOspmState)
> +#endif  /* CPU_OSPM_INTERFACE_H */



^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH RFC V6 00/24] Support of Virtual CPU Hotplug-like Feature for ARMv8+ Arch
  2025-10-22 10:07 ` Gavin Shan
@ 2025-10-24  6:55   ` Gavin Shan
  0 siblings, 0 replies; 67+ messages in thread
From: Gavin Shan @ 2025-10-24  6:55 UTC (permalink / raw)
  To: salil.mehta, qemu-devel, qemu-arm, mst
  Cc: salil.mehta, maz, jean-philippe, jonathan.cameron, lpieralisi,
	peter.maydell, richard.henderson, imammedo, armbru, andrew.jones,
	david, philmd, eric.auger, will, ardb, oliver.upton, pbonzini,
	rafael, borntraeger, alex.bennee, gustavo.romero, npiggin,
	harshpb, linux, darren, ilkka, vishnu, gankulkarni, karl.heubaum,
	miguel.luis, zhukeqian1, wangxiongfeng2, wangyanan55, wangzhou1,
	linuxarm, jiakernel2, maobibo, lixianglai, shahuang, zhao1.liu

Hi Salil,

On 10/22/25 8:07 PM, Gavin Shan wrote:
> On 10/1/25 11:01 AM, salil.mehta@opnsrc.net wrote:
>>
>> ===================
>> (VII) Commands Used
>> ===================
>>
>> A. Qemu launch commands to init the machine (with 6 possible vCPUs):
>>
>> $ qemu-system-aarch64 --enable-kvm -machine virt,gic-version=3 \
>> -cpu host -smp cpus=4,disabled=2 \
>> -m 300M \
>> -kernel Image \
>> -initrd rootfs.cpio.gz \
>> -append "console=ttyAMA0 root=/dev/ram rdinit=/init maxcpus=2 acpi=force" \
>> -nographic \
>> -bios QEMU_EFI.fd \
>>
> 
> The parameter 'disabled=2' isn't correct here and it needs to be 'disabledcpus=2'.
> Otherwise, the VM won't be started due to the unrecognized parameter.
> 
> $ /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64       \
>    --enable-kvm -machine virt,gic-version=3 -cpu host,sve=off    \
>    -smp cpus=4,disabled=2 -m 1024M                               \
>    -kernel /home/gavin/sandbox/linux.guest/arch/arm64/boot/Image \
>    -initrd /home/gavin/sandbox/images/rootfs.cpio.xz -nographic
> qemu-system-aarch64: Parameter 'smp.disabled' is unexpected
> 

Apart from the issues that were reported previously, there are more issues.
some of them may be invalid. I'm sharing the extra issues below.

The VM is always started using the following command lines.

host$ /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64               \
-accel kvm -machine virt,gic-version=host,nvdimm=on                         \
-cpu host,sve=off                                                           \
-smp maxcpus=4,cpus=2,disabledcpus=2,sockets=2,clusters=2,cores=1,threads=1 \
-m 4096M,slots=16,maxmem=128G -object memory-backend-ram,id=mem0,size=2048M \
-object memory-backend-ram,id=mem1,size=2048M                               \
-numa node,nodeid=0,memdev=mem0,cpus=0-1                                    \
-numa node,nodeid=1,memdev=mem1,cpus=2-3                                    \
-L /home/gavin/sandbox/qemu.main/build/pc-bios                              \
-monitor none -serial mon:stdio                                             \
-nographic -gdb tcp::6666 -qmp tcp:localhost:5555,server,wait=off           \
-bios /home/gavin/sandbox/qemu.main/build/pc-bios/edk2-aarch64-code.fd      \
-kernel /home/gavin/sandbox/linux.guest/arch/arm64/boot/Image               \
-initrd /home/gavin/sandbox/images/rootfs.cpio.xz                           \
-append memhp_default_state=online_movable

[Issue-1]: Inconsistent output from 'qom-list /machine/unattached'. The disabled
CPU device doesn't show up at the beginning, but it appears after it's hot added.
However, the CPU device is still seen after it's hot removed.

(qemu) qom-list /machine/unattached
device[0] (child<host-arm-cpu>)
device[1] (child<host-arm-cpu>)
   :
(qemu) device_set host-arm-cpu,socket-id=1,admin-state=enable
(qemu) qom-list /machine/unattached
device[0] (child<host-arm-cpu>)
device[1] (child<host-arm-cpu>)
device[42] (child<host-arm-cpu>)
   :
(qemu) device_set host-arm-cpu,socket-id=1,admin-state=disable
(qemu) qom-list /machine/unattached
device[0] (child<host-arm-cpu>)
device[1] (child<host-arm-cpu>)
device[42] (child<host-arm-cpu>)

[Issue-2]: The hot added CPU disappears after a system reset

guest$ cat /sys/devices/system/cpu/online
0-1
(qemu) device_set host-arm-cpu,socket-id=1,admin-state=enable
guest$ echo 1 > /sys/devices/system/cpu/cpu2/online
guest$ cat /sys/devices/system/cpu/online
0-2

(qemu) system_reset
guest$ cat /sys/devices/system/cpu/online
0-1

[Issue-3] PCDIMM is unable to be hot added.

(qemu) object_add memory-backend-ram,id=hp-mem0,size=512M
(qemu) device_add pc-dimm,id=hp-dimm0,memdev=hp-mem0,node=0
Error: Parameter 'driver' expects a pluggable device type or which supports changing power-state administratively

Thanks,
Gavin



^ permalink raw reply	[flat|nested] 67+ messages in thread

end of thread, other threads:[~2025-10-24  6:56 UTC | newest]

Thread overview: 67+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-01  1:01 [PATCH RFC V6 00/24] Support of Virtual CPU Hotplug-like Feature for ARMv8+ Arch salil.mehta
2025-10-01  1:01 ` [PATCH RFC V6 01/24] hw/core: Introduce administrative power-state property and its accessors salil.mehta
2025-10-09 10:48   ` Miguel Luis
2025-10-01  1:01 ` [PATCH RFC V6 02/24] hw/core, qemu-options.hx: Introduce 'disabledcpus' SMP parameter salil.mehta
2025-10-09 11:28   ` Miguel Luis
2025-10-09 13:17     ` Igor Mammedov
2025-10-09 11:51   ` Markus Armbruster
2025-10-01  1:01 ` [PATCH RFC V6 03/24] hw/arm/virt: Clamp 'maxcpus' as-per machine's vCPU deferred online-capability salil.mehta
2025-10-09 12:32   ` Miguel Luis
2025-10-09 13:11     ` Igor Mammedov
2025-10-01  1:01 ` [PATCH RFC V6 04/24] arm/virt, target/arm: Add new ARMCPU {socket, cluster, core, thread}-id property salil.mehta
2025-10-01  1:01 ` [PATCH RFC V6 05/24] arm/virt, kvm: Pre-create KVM vCPUs for 'disabled' QOM vCPUs at machine init salil.mehta
2025-10-22 10:36   ` [PATCH RFC V6 05/24] arm/virt,kvm: " Gavin Shan
2025-10-22 18:18     ` Salil Mehta
2025-10-22 18:50       ` Salil Mehta
2025-10-23  0:14         ` Gavin Shan
2025-10-23  0:35           ` Salil Mehta
2025-10-23  1:29             ` Salil Mehta
2025-10-23  4:14               ` Gavin Shan
2025-10-23 11:27                 ` Salil Mehta
2025-10-23  1:58             ` Gavin Shan
2025-10-23 11:17               ` Salil Mehta
2025-10-01  1:01 ` [PATCH RFC V6 06/24] arm/virt, gicv3: Pre-size GIC with possible " salil.mehta
2025-10-01  1:01 ` [PATCH RFC V6 07/24] arm/gicv3: Refactor CPU interface init for shared TCG/KVM use salil.mehta
2025-10-01  1:01 ` [PATCH RFC V6 08/24] arm/virt, gicv3: Guard CPU interface access for admin disabled vCPUs salil.mehta
2025-10-24  4:07   ` Gavin Shan
2025-10-01  1:01 ` [PATCH RFC V6 09/24] hw/intc/arm_gicv3_common: Migrate & check 'GICv3CPUState' accessibility mismatch salil.mehta
2025-10-01  1:01 ` [PATCH RFC V6 10/24] arm/virt: Init PMU at host for all present vCPUs salil.mehta
2025-10-03 15:02   ` Igor Mammedov
2025-10-01  1:01 ` [PATCH RFC V6 11/24] hw/arm/acpi: MADT change to size the guest with possible vCPUs salil.mehta
2025-10-03 15:09   ` Igor Mammedov
     [not found]     ` <0175e40f70424dd9a29389b8a4f16c42@huawei.com>
2025-10-07 12:20       ` Igor Mammedov
2025-10-10  3:15         ` Salil Mehta
2025-10-01  1:01 ` [PATCH RFC V6 12/24] hw/core: Introduce generic device power-state handler interface salil.mehta
2025-10-01  1:01 ` [PATCH RFC V6 13/24] qdev: make admin power state changes trigger platform transitions via ACPI salil.mehta
2025-10-01  1:01 ` [PATCH RFC V6 14/24] arm/acpi: Introduce dedicated CPU OSPM interface for ARM-like platforms salil.mehta
2025-10-03 14:58   ` Igor Mammedov
     [not found]     ` <7da6a9c470684754810414f0abd23a62@huawei.com>
2025-10-07 12:06       ` Igor Mammedov
2025-10-10  3:00         ` Salil Mehta
2025-10-24  4:47   ` Gavin Shan
2025-10-01  1:01 ` [PATCH RFC V6 15/24] acpi/ged: Notify OSPM of CPU administrative state changes via GED salil.mehta
2025-10-01  1:01 ` [PATCH RFC V6 16/24] arm/virt/acpi: Update ACPI DSDT Tbl to include 'Online-Capable' CPUs AML salil.mehta
2025-10-01  1:01 ` [PATCH RFC V6 17/24] hw/arm/virt, acpi/ged: Add PowerStateHandler hooks for runtime CPU state changes salil.mehta
2025-10-01  1:01 ` [PATCH RFC V6 18/24] target/arm/kvm, tcg: Handle SMCCC hypercall exits in VMM during PSCI_CPU_{ON, OFF} salil.mehta
2025-10-01  1:01 ` [PATCH RFC V6 19/24] target/arm/cpu: Add the Accessor hook to fetch ARM CPU arch-id salil.mehta
2025-10-01  1:01 ` [PATCH RFC V6 20/24] target/arm/kvm: Write vCPU's state back to KVM on cold-reset salil.mehta
2025-10-01  1:01 ` [PATCH RFC V6 21/24] hw/intc/arm-gicv3-kvm: Pause all vCPUs & cache ICC_CTLR_EL1 for userspace PSCI CPU_ON salil.mehta
2025-10-01  1:01 ` [PATCH RFC V6 22/24] monitor, qdev: Introduce 'device_set' to change admin state of existing devices salil.mehta
2025-10-09  8:55   ` [PATCH RFC V6 22/24] monitor,qdev: " Markus Armbruster
2025-10-09 12:51     ` Igor Mammedov
2025-10-09 14:03       ` Daniel P. Berrangé
2025-10-09 14:55       ` Markus Armbruster
2025-10-09 15:19         ` Peter Maydell
2025-10-10  4:59           ` Markus Armbruster
2025-10-17 14:50         ` Igor Mammedov
2025-10-20 11:22           ` Markus Armbruster
2025-10-01  1:01 ` [PATCH RFC V6 23/24] monitor, qapi: add 'info cpus-powerstate' and QMP query (Admin + Oper states) salil.mehta
2025-10-09 11:53   ` [PATCH RFC V6 23/24] monitor,qapi: " Markus Armbruster
2025-10-01  1:01 ` [PATCH RFC V6 24/24] tcg: Defer TB flush for 'lazy realized' vCPUs on first region alloc salil.mehta
2025-10-01 21:34   ` Richard Henderson
2025-10-02 12:27     ` Salil Mehta via
2025-10-02 15:41       ` Richard Henderson
2025-10-07 10:14         ` Salil Mehta via
2025-10-06 14:00 ` [PATCH RFC V6 00/24] Support of Virtual CPU Hotplug-like Feature for ARMv8+ Arch Igor Mammedov
2025-10-13  0:34 ` Gavin Shan
2025-10-22 10:07 ` Gavin Shan
2025-10-24  6:55   ` Gavin Shan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).